Syncthing "stuck" syncing because of device that doesn't exist?

darkpixel · August 20, 2020, 12:45am

Well fubar… Everything was off to a good start, but ended up stuck again.

Using ‘corp-accounting-private’ as an example…all instances show that the folder is “Up to Date” in the Folders list. But every single instance still shows a huge backlog of files in the Remote Devices list. All the instances are basically doing nothing–just a few bits of traffic here and there when during the initial scan/sync they were uploading the maximum allowed ~5 Mbit/sec and the download is uncapped.

The interesting thing is that checking any instance of Syncthing, checking ANY device that shows a backlog of files reveals the ‘modified’ device is either ‘USWUXSDNAS01’ or ‘USWUXSDARC01’. Those are two syncthing boxes in the same physical location. Most of the files are from NAS01 though. Since they are faster boxes, maybe they finished scanning first or something? Maybe I’m barking up the wrong tree…

Looking in the logs for USWUXSDNAS01 shows…

Aug 19 16:21:30 uswuxsdnas01 syncthing[6880]: [SFXUK] INFO: Puller (folder "corp-accounting-private" (redac-ted), item "redacted/path/somefile.sync-conflict-20200814.xlsx"): pull: no such file
Aug 19 16:21:32 uswuxsdnas01 syncthing[6880]: [SFXUK] INFO: "corp-accounting-private" (redac-ted): Failed to sync 23 items
Aug 19 16:21:32 uswuxsdnas01 syncthing[6880]: [SFXUK] INFO: Folder "corp-accounting-private" (redac-ted) isn't making sync progress - retrying in 1m6s.
Aug 19 16:21:33 uswuxsdnas01 syncthing[6880]: [SFXUK] INFO: "corp-accounting-private" (redac-ted): Failed to sync 23 items
Aug 19 16:21:33 uswuxsdnas01 syncthing[6880]: [SFXUK] INFO: Folder "corp-accounting-private" (redac-ted) isn't making sync progress - retrying in 1m2s.

This repeats a handful of times.

When I look at the user interface, it makes no mention of trying to replicate any file with sync-conflict in the name. In fact, the file name it’s complaining about doesn’t exist (without the sync-conflict in the name) in the list either.

I’m a stumped. I suppose it’s possible a user is deleting the sync-conflict files in the middle of them replicating or something, but I don’t think syncthing would/should bomb out on that.

I’m half tempted to wait until the weekend, then just rm -rf the syncthing config directory on all the machines and start from scratch.

Any other debugging anyone wants done before I nuke it this weekend?

darkpixel · August 23, 2020, 5:18pm

Nothing seemed to fix the issue. I even went as far as stopping syncthing, blowing away the entire config directory and set things back up from scratch.

At the end, various devices would show files out of sync and waiting for some unknown node to come back online to finish syncing. No data was being transferred. I had one instance (apparently unrelated to the original issue) that was refusing to talk with all but one other instance.

The last time I remember it working correctly was right before the config was bumped up to version 31…so I looked in the Debian package history and saw the recent upgrade was from 1.7.0 to 1.8.0.

I stopped syncthing, downgraded every machine to 1.7.0, noticed the config version was still 31, but said “meh” and let things run. Everything is back in sync.

I’m going to pin my boxes to 1.7.0 for the moment…but I think there’s some sort of problem between 1.7.0 and 1.8.0. I just haven’t been able to track it down.

darkpixel · September 1, 2020, 1:10am

I spoke too soon. The issue occurs with 1.7.0 and 1.8.0.

I left the config file alone, stopped syncthing, rm -rf’d the index-v0.14.0.db folder, and started up syncthing 1.7.0. It always ends with multiple folders being out of sync and pointing to changes non-existent machines.

After that didn’t work, I stopped syncthing, rm -rf’d the index again, then upgraded to 1.8.0 and started it. After a few days a bunch of folders were out of sync again.

I’m stumped.

AudriusButkevicius · September 1, 2020, 6:03am

If you deleted the indexes where are the unknown devices coming from.

Deleting the index on one machine doesn’t do much, as it just gets the index from a different machine, with the same needed files from devices that are no longer there.

imsodin · September 1, 2020, 7:06am

There was also a bug that could lead to missing indexes which was fixed in v1.9.0-rc.5. If you ever made config changes or pause/unpaused a folder while it was scanning, you might be affected by that.

darkpixel · September 1, 2020, 2:41pm

My mistake. I conflated one of my earlier tests where I deleted the config and database and then re-created it and “forgot” one box.

I just went through and hand-checked all 24 boxes. They are no longer showing a mystery device changing files. They are showing one device (based on spot checking every devices “remote devices” section and looking at several pages of out-of-sync files). All the devices point to ‘uslogsdnas01’. Even uslogsdnas01 shows it has a backlog of changes from all the other devices where files were modified by itself.

Deleting the index on one machine doesn’t do much, as it just gets the index from a different machine, with the same needed files from devices that are no longer there.

In my recent test, all instances of syncthing were stopped, then the index on each box was deleted, and then syncthing was restarted. Wouldn’t that be enough to cause each machine to re-index?

darkpixel · September 1, 2020, 2:42pm

There was also a bug that could lead to missing indexes which was fixed in v1.9.0-rc.5. If you ever made config changes or pause/unpaused a folder while it was scanning, you might be affected by that.

That’s definitely a possibility. I’ll see if I can track down a 1.9.0-rc.5 package, delete all the indexes and let it resync.

imsodin · September 1, 2020, 2:59pm

You don’t need to delete the indexes if that bug is the culprit - the problem is fixed on upgrade.

Yes, that’s enough.

darkpixel · September 1, 2020, 3:01pm

Ok–I just pushed out an updated sources.list with the syncthing candidate repo. I’ll upgrade all the boxes in the next few minutes and let it work its magic over the next ~6 hours and I’ll let you know.

darkpixel · September 1, 2020, 4:26pm

I think this is the bug you’re referring to:

github.com/syncthing/syncthing

Modifying config during first scan corrupts folder state

opened 02:47PM - 27 Aug 20 UTC

closed 04:33PM - 27 Aug 20 UTC

tomasz1986

bug frozen-due-to-age

I have seen this problem several times but have not been able to reproduce until… now. How to reproduce: 1. Add a new folder, preferably with a lot of files (so that it takes some time to scan it). 2. While the folder is being scanned for the first time, modify some of its configuration. I changed `Max Conflicts` from `10` to `0`. 3. The folder state is incorrect. ![image](https://user-images.githubusercontent.com/5626656/91457520-663ea300-e8bf-11ea-9683-810116949d88.png) Below is the same folder, which was just left to scan, with no config changes in between. The state is correct here. ![image](https://user-images.githubusercontent.com/5626656/91457650-8b331600-e8bf-11ea-81c0-86f41b727c3d.png) Rescanning does nothing. The only way to fix the problem seems to be wiping out the database.

If that’s the case, I forgot to mention that during one of my resync tests a handful of the machines spit out “Non-increasing sequence detected: Checking and repairing the db…” several times during syncing. I assumed it was related to something else I hosed up during one of my test syncs. After that I stopped syncthing, deleted the databases, and started it again and didn’t see the errors return.

While I’m still only about 20% of the way through the re-sync, it does appear to be working. None of the folders are showing “out of sync” for more than a few seconds while it downloads some data and then it changes back to “syncing” a few seconds later.

I’ll keep you posted.

darkpixel · September 1, 2020, 7:56pm

All devices are uploading/downloading 0 bps. The “local state” on all boxes shows right around 165,002 files plus or minus less than 8 files. Some boxes show that every folder is up-to-date, but the remote devices section shows anywhere from a few KB to tens of GB that still need to be synced out.

Some of the devices have one or two folders that are out of sync. The folders that are out of sync show they have one or two files that need to sync and the total size is always 0 bytes. Clicking the link to see which files still need to sync gives a blank list.

In the “Remote Devices” section, every instance shows that it needs to sync data to partner instances. Clicking on the link to see the files that need to sync shows a lot of files throughout a handful of folders. But every instance shows that the “changed device” was USLOGSDNAS01. Even USLOGSDNAS01.

I’m guessing my next step should be to set STRECHECKDBEVERY=1s and restart Syncthing to see if that cleans it up. Thoughts?

imsodin · September 2, 2020, 12:44pm

Yes, that’s the issue I meant. It’s a bummer that your problem still occurs on v1.9.0-rc.5. Do you still get any non-increasing sequence lines in the log? Please mention anything else in the logs or about the displayed files that stands out or may be relevant. And if you could again provide the /rest/db/file output for one of the files that is shown in the out-of-sync list on a remote device, queried on both involved devices (e.g. in your latest screenshot on uscmsprnas01 and e.g. usbvesdnas01).

You can try that, it should at least clean up the out-of-sync folders with nothing in the actual list.

darkpixel · September 2, 2020, 2:49pm

Just grepped through syslog on all the boxes. I haven’t seen that since the 30th when I was still running 1.8.0.

I’m not seeing anything that jumps out over the last 24 hours. I did restart syncthing on one box to see if that would “unstick” it from sending/receiving nothing. It did generate a panic on shutdown:

Sep  1 17:40:07 uslogsdnas01 syncthing[22541]: [monitor] INFO: Signal 15 received; exiting
Sep  1 17:40:07 uslogsdnas01 syncthing[22541]: [ACQW3] INFO: QUIC listener ([::]:22000) shutting down
Sep  1 17:40:07 uslogsdnas01 syncthing[22541]: [ACQW3] INFO: Relay listener (dynamic+https://relays.syncthing.net/endpoint) shutting down
Sep  1 17:40:07 uslogsdnas01 syncthing[22541]: [ACQW3] INFO: TCP listener ([::]:22000) shutting down
Sep  1 17:40:14 uslogsdnas01 syncthing[22541]: panic: runtime error: invalid memory address or nil pointer dereference
Sep  1 17:40:14 uslogsdnas01 syncthing[22541]: [monitor] WARNING: Panic detected, writing to "/tank/syncthing/panic-20200901-174014.log"
Sep  1 17:40:14 uslogsdnas01 syncthing[22541]: [monitor] WARNING: Please check for existing issues with similar panic message at https://github.com/syncthing/syncthing/issues/
Sep  1 17:40:14 uslogsdnas01 syncthing[22541]: [monitor] WARNING: If no issue with similar panic message exists, please create a new issue with the panic log attached
Sep  1 17:40:14 uslogsdnas01 syncthing[22541]: [monitor] INFO: Reporting crash found in panic-20200901-174014.log (report ID 80985e81) ...
Sep  1 17:40:15 uslogsdnas01 systemd[1]: syncthing@root.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Sep  1 17:40:15 uslogsdnas01 systemd[1]: syncthing@root.service: Failed with result 'exit-code'.

These appear to be normal messages for processing through files. I have stuff like this in my logs from well before this issue popped up:

Sep  1 05:31:34 uslogsdnas01 syncthing[10286]: [ACQW3] INFO: Puller (folder "corp-accounting-private" (redac-ted), item "Workpaper Account Analysis/CY 2020/(08) Accounts Payable and Accrued Expenses/Credit Cards/(08) August/Activity Downloads/--redacted-- 08.01.20-08.29.20.sync-conflict-20200901-053134-VMGX7M2.xlsx"): file modified but not rescanned; will try again later
Sep  1 05:31:34 uslogsdnas01 syncthing[10286]: [ACQW3] INFO: "corp-accounting-private" (redac-ted): Failed to sync 1 items
Sep  1 05:31:34 uslogsdnas01 syncthing[10286]: [ACQW3] INFO: Folder "corp-accounting-private" (redac-tede) isn't making sync progress - retrying in 1m0s.
Sep  1 05:31:34 uslogsdnas01 syncthing[10286]: [ACQW3] INFO: "corp-accounting-private" (redac-ted): Failed to sync 1 items
Sep  1 05:31:34 uslogsdnas01 syncthing[10286]: [ACQW3] INFO: Folder "corp-accounting-private" (redac-ted) isn't making sync progress - retrying in 1m0s.
Sep  1 10:12:54 uslogsdnas01 syncthing[10074]: [ACQW3] INFO: Puller (folder "corp-employees" (redac-ted), item "ACTIVE/--redacted-name--/Personnel/Status Change Request.pdf --redacted-name-- 9.1.20.pdf"): syncing: pull: no such file
Sep  1 10:12:54 uslogsdnas01 syncthing[10074]: [ACQW3] INFO: "corp-employees" (redac-ted): Failed to sync 1 items
Sep  1 10:12:54 uslogsdnas01 syncthing[10074]: [ACQW3] INFO: Folder "corp-employees" (redac-ted) isn't making sync progress - retrying in 1m16s.

I’ll grab it as soon as I can. It’s starting to feel like a Monday today…

imsodin · September 2, 2020, 3:04pm

That panic/nil deref does jump out at me Not that it has to be connected, but it definitely shouldn’t happen. Could you please post the trace part of the mentioned /tank/syncthing/panic-20200901-174014.log

darkpixel · September 2, 2020, 3:23pm

Here’s the latest one: panic-20200902-074412.reported.log (622.6 KB)

It was triggered by me calling ‘service syncthing@root restart’.

imsodin · September 2, 2020, 3:44pm

That’s an unclean shutdown because services didn’t terminate fast enough. That “shouldn’t” create db inconsistencies, but if this is the only lead we have then we might have to think about whether that’s actually the case. I’d still be interested in the particular panic from above, as it’s not a db closed panic but a nil deref.

darkpixel · September 2, 2020, 3:47pm

Here’s one with the nil deref:

panic-20200901-224518.reported.log (13.5 KB)

That was also triggered by a ‘service syncthing@root restart’ during troubleshooting.

imsodin · September 2, 2020, 5:11pm

Thanks. Same thing (services didn’t terminate on shutdown before db was closed), just different effect.

After those panics, did you ever see a line like Stored folder metadata for ... is out of date after crash; recalculating and/or Repaired ... sequence entries in database?
To not have too many moving parts: Can you run with STRECHECKDBEVERY=1s on only one device that had such a panic and is shown as syncing on any other device. Does the line Repaired ... sequence entries in database appear then? And is it then shown as up-to-date on other devices?

If the answers to the latter two turn out to be yes, it very much points at unclean shutdowns being problematic. Which is bad, because we take great care to do atomic db operations, such that even on a crash the db should stay consistent. Meaning we messed up or the db implementation is messed up.

darkpixel · September 2, 2020, 6:38pm

{
  "availability": [
    {
      "id": "R4HOXEK-O4Q2J2L-MZIYGOB-B7DUWVI-YDCHH4L-U26HPCA-2EYLWIH-CI7B2AC",
      "fromTemporary": false
    }
  ],
  "global": {
    "deleted": false,
    "ignored": false,
    "invalid": false,
    "localFlags": 0,
    "modified": "2018-06-14T15:14:57.5989362-07:00",
    "modifiedBy": "ACQW3IT",
    "mustRescan": false,
    "name": "2017_ALL_LOCATIONS.xlsx",
    "noPermissions": true,
    "numBlocks": 1,
    "sequence": 101385,
    "size": 15851,
    "type": "FILE",
    "version": [
      "ACQW3IT:1599025637",
      "BHS4RWN:1599057984",
      "IM253BD:1599057953",
      "PK2ADL5:1599025637",
      "R4HOXEK:1599025638",
      "V2NGEXC:1599029835",
      "ZICYKLL:1599057944",
      "5RKFSGA:1599025642"
    ]
  },
  "local": {
    "deleted": false,
    "ignored": false,
    "invalid": false,
    "localFlags": 0,
    "modified": "2018-06-14T15:14:57.5989362-07:00",
    "modifiedBy": "UMMCTFK",
    "mustRescan": false,
    "name": "2017_ALL_LOCATIONS.xlsx",
    "noPermissions": true,
    "numBlocks": 1,
    "sequence": 43335,
    "size": 15851,
    "type": "FILE",
    "version": [
      "UMMCTFK:1599025641"
    ]
  }
}

darkpixel · September 2, 2020, 6:40pm

Running this on USLOGSDNAS01 which had a panic and is usually the machine referred to in the out-of-sync list as the file modifier.

It’ll probably take an hour or so for it to scan and sync with the other boxes. I’ll let you know shortly.