Syncthing "stuck" syncing because of device that doesn't exist?

imsodin · September 2, 2020, 6:44pm

Can you post the output for the same query on another (shown as syncing) device please.

darkpixel · September 2, 2020, 7:37pm

Even better–thanks to the magic of Salt, I can show it to you from every instance.

imsodin · September 2, 2020, 8:00pm

Might have got unlucky with the file there: It is genuinely not in sync on most of the devices, e.g. usxhfsdnas01. The folder on the left side should show it as out-of-sync. That’s evident from the displayed global and local file: They are different, version vectors differ (the global one is “bigger”). The actual file details (mod. time, size) are equal everywhere. This looks like devices are still exchanging metadata working towards establishing that they actually all have the same file. That again should be visible on the left side on the folder states, as in they should be syncing (or out-of-sync if in between sync runs and the last one didn’t yet clear everything out). However the screenshots you report don’t show any “real” syncing/out-of-sync folders - is/was that still the case for this file?

darkpixel · September 3, 2020, 2:07am

Aah. That was right after I restarted USLOGSDNAS01 with STRECHECKDBEVERY=1s, so there was definitely metadata flying around.

At the moment, things have slowed waaay down. The boxes normally end up in a state where they aren’t sending or receiving anything (or just a few bits here and there), but at the moment they’re still sending/receiving 10-100 KB…so something’s happening…but it’s not the usual 50-200 megs down and ~4.5 up (throttled by syncthing).

I’m going to let it continue to run over night and see what I find in the morning…although the “2017_ALL_LOCATIONS.xlsx” file I sent still shows out of sync on all boxes at the moment.

darkpixel · September 3, 2020, 2:56pm

Back to the same state after restarting ONLY USLOGSDNAS01 with STRECHECKDBEVERY=1s.

Here you can see a screenshot from USLOG00NAS01 showing corp-accounting-private is out of sync by 1 item (0 bytes) as well as a bunch of data in the “Remote Devices” section (for?) USBVESDNAS01. There’s really nothing happening according to the “This Device” section.

Clicking to see the out of sync items gives a blank screen. I let it sit there for 30 seconds in case it was trying to load something.

Going over to the “Remote Devices” section and looking at USBVESDNAS01, and clicking the “Out of Sync Items” shows a bunch of files out of sync and no traffic between the boxes. It also shows that all the files were modified last year and they were modified by USLOGSDNAS01.

Here is the output of the db/file API request run against all the boxes: syncthingdata (105.5 KB)

darkpixel · September 3, 2020, 2:59pm

I forgot to mention…I queried /var/log/syslog on all the boxes looking for WARNING entries. There were none during this latest resync. I made sure syncthing wasn’t restarted anywhere during this test.

imsodin · September 3, 2020, 7:05pm

Again things are indeed out of sync, but don’t show as such. I’d assume the new local need repair that’s currently PRed would kick in here: https://github.com/syncthing/syncthing/pull/6950. It would be interesting to know if it really does, however it would only be a bandaid (though maybe an effective one). It doesn’t tackle any underlying problem, it just fixes inconsistencies after the fact. You could pick and run a binary directly from the PR and check for Repaired ... local need entries ... log line, however I am not sure whether you are prepared to go that experimental on any of your systems. If not it will be in the next RC.

darkpixel · September 3, 2020, 7:28pm

I’d be happy to give it a shot. I’ll get the deb copied to the boxes and I’ll schedule a restart in about 30 minutes when everyone is out to lunch.

darkpixel · September 3, 2020, 7:53pm

I just deployed and launched it. It’s logging “repaired” entries.

Sep  3 12:50:15 uslogsdnas01 systemd[1]: Started Syncthing - Open Source Continuous File Synchronization for root.
Sep  3 12:50:15 uslogsdnas01 syncthing[12146]: [start] INFO: syncthing v1.9.0-rc.5.dev.24.g10cf260c "Fermium Flea" (go1.15.1 linux-amd64) deb@build.syncthing.net 2020-09-03 12:21:29 UTC [noupgrade]
Sep  3 12:50:15 uslogsdnas01 syncthing[12146]: [start] INFO: Using large-database tuning
Sep  3 12:50:16 uslogsdnas01 syncthing[12146]: [ACQW3] INFO: My ID: AC-REDACTED-YZCAM
Sep  3 12:50:16 uslogsdnas01 syncthing[12146]: [ACQW3] INFO: Single thread SHA256 performance is 331 MB/s using crypto/sha256 (327 MB/s using minio/sha256-simd).
Sep  3 12:50:17 uslogsdnas01 syncthing[12146]: [ACQW3] INFO: Hashing performance is 274.88 MB/s
Sep  3 12:50:17 uslogsdnas01 syncthing[12146]: [ACQW3] INFO: Checking db due to upgrade - this may take a while...
Sep  3 12:50:22 uslogsdnas01 syncthing[12146]: [ACQW3] INFO: Repaired 5666 local need entries for folder redacted-yoh4s in database
Sep  3 12:50:41 uslogsdnas01 syncthing[12146]: [ACQW3] INFO: Repaired 1554 local need entries for folder redacted-eeg9e in database
Sep  3 12:52:01 uslogsdnas01 syncthing[12146]: [ACQW3] INFO: Repaired 18309 local need entries for folder redacted-eiph9 in database
Sep  3 12:52:24 uslogsdnas01 syncthing[12146]: [ACQW3] INFO: Repaired 163 local need entries for folder redacted-raisu in database

darkpixel · September 3, 2020, 8:03pm

On a side note: Aack! I forgot to set the upload speed in the config file to 600, so it’s currently swamping all the internet connections.

I ran stcli -home=/tank/syncthing config options max-send-kbps set 600 and it returned no errors…but the change doesn’t seem to take effect on the boxes. Would an out-of-date stcli cause that? I think the stcli binary is from 1.8.0.

darkpixel · September 3, 2020, 8:25pm

That appears to have done the trick.

I have no folders saying “out of sync” and all of the instances show they are “up to date” in their “Remote Devices” section.

There’s still some syncing going on, so I’ll let you know when it’s completely finished, but so far it looks good.

I guess the next question is how you can stop syncthing everywhere, rm -rf the database everywhere, then start syncthing everywhere…and end up in a situation like that. If no instance has a record of files, starts up, and syncs…how were they ending up in a situation where they can’t sync those files from a partner or don’t recognize that the files have already been sync’d?

darkpixel · September 3, 2020, 8:41pm

Well…that was quick. All devices dropped to sending/receiving a few bits here and there. All boxes are now in sync and appear to be syncing new files properly.

imsodin · September 4, 2020, 9:32am

Good to hear the bandaid worked.

That’s the question indeed. Can you pin down when the problem first occurred? The original post was on v1.8.0, however what’s the last version you ran before that and how sure are you, if the problem existed or not on that?

imsodin · September 4, 2020, 10:06am

Do you use introducer/auto-accept or do you configure everything with salt? And if everything is done with salt, does it stop syncthing, change the config on disk, and start it again or does it use the api while syncthing is running?

Another thing: Did you do a complete index reset after updating to rc.5? I think I told you at some point that wasn’t necessary. However that was wrong, so if you didn’t the issues you saw after that might still have been relics from before that.

darkpixel · September 5, 2020, 6:00pm

Unfortunately /var/log/apt.log has rotated off into oblivion so I can’t be 100% sure of the version where this popped up.

It looks like the Debian repos have 1.0.0, but the Syncthing repos have 1.7.0 and up…so there’s a possibility this crept in between 1.0.0 and whenever I switched to the syncthing repos…At one point I thought it was introduces between 1.7.0 and 1.8.0, so I rolled back, but the issue persisted. That could have been because 1.8.0 introduced a problem into the DB and rolling back left the problem in the DB…or not. So it’s difficult to say.

darkpixel · September 5, 2020, 6:06pm

No introducer or auto-accept. The config file is managed by salt, but it’s basically blown out once and that’s it. I can force it to replace the config and restart syncthing if I want. That happens occasionally (a few times per year) when I add a new folder or delete an old one we no longer need sync’d. I’ve debated switching to the API, but from the looks of it, it’s not feature complete for what I need. Of course I also hate XML…so…

At first I didn’t. But when everything settled down to not transferring data and being out of sync, I did stop syncthing, rm -rf the index folder and start it back up. That still didn’t fix the problem, but the index was deleted.

Before I posted the bug to the forum (and I was on v1.7.0) I also did an rm -rf on the index and restarted everything.

darkpixel · September 9, 2020, 3:26am

I saw everything got rolled into v1.9.0. I installed it a few hours ago and really torture-tested it. I stopped instances, I dropped them out of the some or all of the config files, I removed synced folders, I removed test files from some of them, dropped in new test files, etc…then I reloaded our syncthing config and launched syncthing. I restarted instanced, kill -9’d a few, removed a few databases, etc…

After a few hours everything is perfectly back in sync. I think this is solved.

imsodin · September 9, 2020, 8:10am

Very nice to hear, thanks for the extensive testing and reporting!

system · October 9, 2020, 8:10am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.