Non-panic reports to sentry

imsodin · August 13, 2020, 6:45pm

With the hopefully soon incoming sutureish fatal error handling, we’ll no longer panic in our db layer (goleveldb will keep crashing ) - which is great for the user, but leaves us in the dark as things are now. Also for other “should not happen” conditions it would be great to have reporting to sentry. Given these errors will be detected in normal control flow, we could directly handle the reporting (as opposed to panics which are detecting in log output and thus have to be handled as text). However I believe the current infrastructure is not meant to/set up for reporting directly to sentry, but via stcrashreceiver only. Thus the straight-forward extension is to write something similar to a panic report to a file and just add adjustments to the receiver to also parse this type of report. That seems like the way to go to me - any opposition or other comments on this plan?

AudriusButkevicius · August 13, 2020, 7:07pm

The current infrastructure is more than capable of sentry reporting, we just don’t want to use it because we did not want to replace every go with sentry.Go, as in order to capture the panics, you’d have to be in the routine that panics.

There is also no way to capture panics that happen in leveldb for example, as we can’t tell leveldb not to do go and use sentry.Go, hence the decision was for the monitor to capture the panics, then send it to crash reporter, which stores the raw panic log, and parses it and then reports it to sentry.

In theory crashreporter could have submitted it to sentry straight away without storing it, but @calmh wanted to have the raw panic logs, which I guess is useful if you want to run https://github.com/maruel/panicparse or something against it, which gives you more concise output deduplicating duplicate routines etc.

I think for non-panics, but unexpected conditions, we’d want to do straight up standard sentry logging, capturing the details around the unexpected place, as well as return an error, oppose to return an error that causes a report.

I would however setup a new sentry project for this, as the existing one is somewhat already a large dumpster fire because of all of the noise due to runtime crashes because of random things.

AudriusButkevicius · August 13, 2020, 7:13pm

Actually, it seems sentry.io offers a hosted solution for open-source.

It might be worth moving to that to save some money, and pick up better advanced features. I guess the question is what features and what quotas do we get from that.

I’ll reach out to them.

imsodin · August 13, 2020, 7:25pm

As far as I remember sentry only stores one stack, while it’s often useful to have stacks of all goroutines to see what else is going on (e.g. what spawned the crashing goroutine).

AudriusButkevicius · August 13, 2020, 7:34pm

I had a custom “integration” written that would store all stacks, as well as code, etc.

Just resurrect my old PR https://github.com/syncthing/syncthing/pull/5721

calmh · August 13, 2020, 7:52pm

We store the original plaintext backtrace which has all that. I still think sentry is great for tracking issue frequency, correlating affected users, etc but plain horrible to debug the actual issue.

Our panic reporting works as is imho (though could of course easily be moved to a cloud hosted sentry instance) so if we want to add further reporting it’s not sentry.Go things but rather a l.Errorf or something that gets picked up by the monitor and reported via the usual channels.

AudriusButkevicius · August 13, 2020, 8:00pm

And why not just:

evt := sentry.NewEvent()
evt.Message = "Something unexpected happened"
evt.AddTag("globalKey", gk)

?

Did we forget that sentry can just send arbitary events ?

I doubt these “logic error has happened” actually need the whole stacktrace?

calmh · August 14, 2020, 7:18am

I’m not saying don’t use sentry for reporting interesting events (though the consent is a can of worms), I’m saying I’m not convinced using the sentry API directly is the best way to go. We already have an events system that can publish strongly typed events and export them over the API, and a stringly typed log stream which is read by the monitor. Both of those things have the advantage that they can already be consumed by other things that monitor Syncthing.

Hence I think it would probably be preferable to use something like the event logger with an new ErrorEvent or the stdout logger like l.Errorf. This would fit in with what we already do without tying the main Syncthing code to Sentry.

To get the events into Sentry I wouldn’t mind the monitor taking these things and sending them onwards. Maybe directly using the Sentry API, or indirectly via a translator of sorts like we do with the current crash reports.

AudriusButkevicius · August 17, 2020, 5:55pm

So we now have a sentry.io hosted instance, if we were to move stuff across, with up to 5 million events a month or something.

It seems most of the cool features are still locked behind a business plan.

They do have some newer features, which allows us to control the grouping etc, so might be worth while.

calmh · August 18, 2020, 10:36am

I think the current plan won’t support us. I pointed the crash receiver at it now, and it’s 429 Enhance Your Calm-ing almost all our requests, while the rate limit is only configurable for enterprise setups.

Aug 18 10:35:03 data stcrashreceiver: 2020/08/18 10:35:03 Receiving report 40390b723476ab8fff9310cd78a718cf272080365efc9fc0cc0144c723347094
Aug 18 10:35:04 data stcrashreceiver: 2020/08/18 10:35:04 Receiving report 2af4ef15b1f0f7a087b0ff514b89371d774e5b56dfa8b8badb49bd86a7e911d3
Aug 18 10:35:04 data stcrashreceiver: 2020/08/18 10:35:04 Receiving report 571f1383b95d4cc213b7ac98d73a7b30a76bcf452d9c77cf2ba2eec66c6915ca
Aug 18 10:35:04 data stcrashreceiver: 2020/08/18 10:35:04 Failed to send report: raven: got http status 429 - x-sentry-error:
Aug 18 10:35:05 data stcrashreceiver: 2020/08/18 10:35:05 Receiving report 57d543e96ab97085bc5a5cc1af74cb660cf321b1ff580d463d7c19c5a98c961c
Aug 18 10:35:05 data stcrashreceiver: 2020/08/18 10:35:05 Failed to send report: raven: got http status 429 - x-sentry-error:
Aug 18 10:35:06 data stcrashreceiver: 2020/08/18 10:35:06 Receiving report 22521904bd770c842ec1dcc44979645804ecf7fea269a56018e5cb24f0738ea6
Aug 18 10:35:06 data stcrashreceiver: 2020/08/18 10:35:06 Failed to send report: raven: got http status 429 - x-sentry-error:
Aug 18 10:35:07 data stcrashreceiver: 2020/08/18 10:35:07 Receiving report 46231cec28529978702a746c44a05032d64831265cf5d17c2c1fbea53f68302a
Aug 18 10:35:07 data stcrashreceiver: 2020/08/18 10:35:07 Receiving report 21363497062cd591a5cdc40ceb8b508f57759e350a3f800c4d0d6223edab94c4
Aug 18 10:35:07 data stcrashreceiver: 2020/08/18 10:35:07 Failed to send report: raven: got http status 429 - x-sentry-error:
Aug 18 10:35:07 data stcrashreceiver: 2020/08/18 10:35:07 Failed to send report: raven: got http status 429 - x-sentry-error:
Aug 18 10:35:08 data stcrashreceiver: 2020/08/18 10:35:08 Receiving report baf5849f33bb284ac61e99b5e1e69afbd881824017a811ed6ef465e27378d057
Aug 18 10:35:08 data stcrashreceiver: 2020/08/18 10:35:08 Failed to send report: raven: got http status 429 - x-sentry-error:
Aug 18 10:35:09 data stcrashreceiver: 2020/08/18 10:35:09 Receiving report 6cf70a20b76fddb42d0f7bbbd55bdf94121821fe7ae83290896e394663c5e1ef
Aug 18 10:35:09 data stcrashreceiver: 2020/08/18 10:35:09 Receiving report 02c23ed2cafe4a5bc58683be3dc089a4747306567df904ac9b4ed44e34b994e5
Aug 18 10:35:09 data stcrashreceiver: 2020/08/18 10:35:09 Receiving report d8bfd2a836608e7a01420767228e3d9be53ac132f716fbd2d27f65348dca8481
Aug 18 10:35:09 data stcrashreceiver: 2020/08/18 10:35:09 Failed to send report: unintelligeble version string
Aug 18 10:35:09 data stcrashreceiver: 2020/08/18 10:35:09 Receiving report c7440f1fbaae0c4c78eed4580b170a84d688402c2fd186e9a2b74cba04e6d9a4

It looks to me like we generate about 7k events per hour, and the free plan supports 5k events per month.

I pointed it back to our own instance for now…

imsodin · August 24, 2020, 6:20am

I’ll add “failure-event” and a service collating (compared to panics there’s no “inbuilt” throttling here) and then sending them to stcrashreceiver. With stcrashreceiver just being a relay between Syncthing and Sentry so we don’t need to introduce yet another CRURL-type of config option. And it’s also easier to switch reporting backends if we need to, e.g. don’t need to wait for next Syncthing release and do config migration etc.

AudriusButkevicius · August 24, 2020, 12:10pm

Will it be the monitor observing and sending these events?

imsodin · August 24, 2020, 12:17pm

No, I made it run as a suture service, consuming events directly instead of over rest: https://github.com/syncthing/syncthing/pull/6925

AudriusButkevicius · August 24, 2020, 12:22pm

It’s a bit grim that this needs to be an event. Why can’t it be some utility function that is called and does the right thing? Why does it have to go through the even system? Even system is mostly for communicating stuff outside of syncthing (into the ui). Yes, we abused in a few places for reacting to changes, but I’d like to do less of that, not more.

I can’t see the ui caring about failure events.

calmh · August 24, 2020, 12:24pm

It’s not grim, it’s optimal. Many external things can care about failure events and subscribe to them via the event listener, plus we get them in --verbose for free. Events are like sane, structured log entries.

We should not see this thing as something just for us to consume in Sentry. Communicating status and metrics in an open way benefits lots of consumers. That we can also grab it towards Sentry, after appropriate consent, is a sideshow imho.