OpenBSD test debugging

I thought we had skipped the notify tests, but maybe I’m wrong.

Assuming these are not inotify related, it’d be nice if the tests passed.

The reason I’ve been so persistent is that I maintain the Syncthing package for OpenBSD and I feel that I owe it to our users to check nothing bad will happen to their data. And since I don’t know how Syncthing works inside, the tests are all I have.

I understand that it’s frustrating for you guys since OpenBSD isn’t all that widely used compared to say, Linux.

Ah, fair enough, I didn’t actually look at the test list. And also interesting, at first glance those should not be OS dependent. I’ll see what the diff looks like in the end. :slight_smile:

I fully sympathize. The impression I got from the notify tests is that notify just didn’t work like we expected on OpenBSD and we made the tests more lenient as a result. That still means notify doesn’t work as we expect though, so it just means we miss notifications without the tests failing. That is, no positive change for users but the problem, such as it is, is swept under the rug. @imsodin will correct me if my impression is incorrect.

@imsodin was the last log of any use at all?

Jakob found me out :smiley:

Actually this entire thing gives me a pretense to do some changes to our model tests that I wanted to do many times before, but never did, because they are tangential. Most of your test failures do not signify any dysfunction in Syncthing, just the tests. And none of them are reproducible. So I think it’s worthwhile to fix it, but no reason to be worried for users.

I just had other stuff going on. The log was partly useful. Most of the directory not empty failures are completely mysterious to me now. Some missed a m.Stop() call and I fixed some other stuff, that might be related. If you want you can have a go again at my branch (same one again): GitHub - imsodin/syncthing at modelTestRef

However I probably won’t dig deeper regarding the directory not empty stuff (I am tempted to hypothesize this isn’t even “our problem”, i.e. something related to golang on openbsd). And anyway I should probably focus on PRs at some point, that’s not going to be so easy to sell :slight_smile:

Actually this entire thing gives me a pretense to do some changes to our model tests that I wanted to do many times before

Cool. I’d be happy to test.

Most of your test failures do not signify any dysfunction in Syncthing

The problem is, I (and other packagers probably too) don’t have the insider knowledge to distinguish a boring test suite bug from a real bug. It could be a racy clean up, or something which would delete user data if ignored.

I ran your branch on constant testing for about an hour and a half. Here’s what now fails:

TestAutoAcceptNewFolderPremutationsNoPanic
TestIssue5063
TestRequestRemoteRenameChanged

I probably won’t dig deeper regarding the directory not empty stuff

That’s a shame. It looks like this is the source of the failures.

Here’s the log: http://theunixzoo.co.uk/random/ST-LOG-20190204.bz2

I am tempted to hypothesize this isn’t even “our problem”, i.e. something related to golang on openbsd)

Well, if that’s the case, then we have the power to fix go on OpenBSD with a local patch which we could then upstream, but we’d need to figure out what exactly is wrong.

Didn’t you say in an earlier PR that something was returning an error code that should be impossible? Do you still believe this to be the case?

Cheers, and sorry again for the persistence (I got another email from a user today asking for syncthing to be updated in the pacakages collection).

Understandable, there could be a notice on “meta test failures”. The implicit notice usually is, that real test failures have a meaningful message (however I won’t vouch for that :slight_smile: ).

I pushed one more change which might have fixed 5063. The auto accept one is a directory not empty failure (more on that below). And RemoteRenameChanged is “real”, I am trying to reproduce with running in a loop locally, no luck so far.

Well, actually you do the brunt of the work, so why not. You just have to be fine with me not giving any promises.

To get more info on what’s going on with the directory not empty failures, I need logs with fs debug facility enabled. For the RequestRemoteRenameChanged with model (for good measure, just always enable model). To reduce test times and log sizes I’d do the test loops similar to this to only run the relevant test and only save the latest output: i=0; while [ $i -lt 1000 ]; do i=$((i+1)); echo "Try $i"; STTRACE=model go test -short -race -tags purego -run TestRequestRemoteRenameChanged 2>&1 > test.out || break; done

I don’t remember what exactly I said, but I am pretty certain I used should for a reason xD
Probably it was about RemoveAll returning directory not empty. At some point I thought golangs os implementation was “atomic”, but then I found a bug report that said that it is expected and possible that another goroutine can interfere with the execution of RemoveAll and thus cause this error.

No worries about the persistence.

Are you delaying updating Syncthing due to this? If so really just don’t please. There are known bugs fixed, while our test coverage isn’t nearly 100%. And even if it was, they might not cover all possible scenarios where something can in some way go wrong. If you want it more concrete: Out of all the failures discussed here, only RequestRemoteR… might still be a real bug (I doubt it) - all the rest are not. On the other hand Incorrectly out of sync, sequence number oddness. · Issue #5340 · syncthing/syncthing · GitHub is certainly a problem, and it’s fixed in v1.0.1 and existed for a long time.

@vext01 Don’t focus on TestRequestRemoteRenameChanged, I was able to reproduce that after 487 tries :slight_smile:

I was yeah, but with the above blessing I can probably put it in :slight_smile:

Don’t focus on TestRequestRemoteRenameChanged , I was able to reproduce that after 487 tries :slight_smile:

Great news!

To get more info on what’s going on with the directory not empty failures, I need logs with fs debug facility enabled.

Ack, watch this space.

And found the cause too - again no “real bug”, “just” a racy test. However it pointed me to a potential improvement in the real code :slight_smile:

Now I need to model this into nicely palatable PRs :expressionless:

Great! Is it all in your branch? I can test it for you later if you like.

Yep all there.

Alright then! I ran your branch on a65f73df9 and the only failures I have are:

TestIssue5063
TestRequestRemoteRenameChanged

Sadly TestRequestRemoteRenameChanged is back.

The log file for running all tests is massive, so I’ll now run those two tests in a loop and upload small log files.

Here’s a log file containing many failures for TestRequestRemoteRenameChanged: http://theunixzoo.co.uk/random/TestRequestRemoteRenameChanged.out.bz2

I ran it 11 times in a loop and it failed in all cases.

As for TestIssue5063, this one is much harder to trigger in isolation, but here’s a section of a full test suite run showing that test failing: http://theunixzoo.co.uk/random/TestIssue5063.out.bz2

I’ve got model and fs logging on for both logs.

Embarrassingly I was too messy on my debugging branch and when merging in master, I committed silent merge conflicts. I fixed that, which should take care of TestRequestRemoteRenameChanged.

The log of 5063 doesn’t show any filesystem activity interfering with dir removal, but it shows that filesystem watchers are started. That test works with auto-added folders, where those are enabled by default. So it’s not possible to disable that in tests. If it only happens very seldom, I’d just ignore it.

Out of 100 runs on commit 4d986830c, we have (times-failed, test-name):

  10 TestAutoAcceptNewFolderPremutationsNoPanic
   1 TestIssue5063

That test works with auto-added folders, where those are enabled by default. So it’s not possible to disable that in tests. If it only happens very seldom, I’d just ignore it.

Or is there a different way to test that functionality that would sidestep that problem? At the risk of sounding like a perfectionist, I don’t see the point of a test which sometimes fails when things are actually OK.

Cheers :slight_smile:

I was going to post the log, but I forgot to set STTRACE.

I’ll run again.

Interesting that both tests relate to new folder addition via cluster config message.

Given that the remaining two test failures relate to filesystem and have watcher enabled, and other failures of the same kind were fixed by disabling the watcher, I am inclined to believe that this is the source of the failure. Even though I can’t imagine how.

That might actually be a reasonable point: Tests that are run in temporary directories don’t really need to fail if their cleanup fails. We did that because failed cleanup can mess with other tests if testing happens in the same place. And it’s annoying if random directories exist somewhere. However I tend to agree that flaky tests are more annoying.

Edit: Implicit in my statement: I don’t think there is another way of testing the same thing short of implementing https://github.com/syncthing/syncthing/issues/4224

Well, they are only flaky because of OpenBSD particularities with file watches, so be as perfectionist as you want, the problem is somewhere else.

I say we disable these on openbsd and call it a day. OR, have a global variable/hook DefaultFolderWatchesEnabled which can be disabled during tests.