Syncthing "stuck" syncing because of device that doesn't exist?

We have about 25 machines that all keep a handful of folders in sync. Things worked beautifully until roughly 3 weeks ago. I’m pretty sure there was a syncthing update. All the boxes are running v1.8.0, Linux (64 bit) “Fermium Flea”.

As you can see in the screenshot below the box named “uslog00nas01” shows every folder is “Up to date”, but for some reason it’s syncing with almost every other machine on the list. You can also see there’s no data being transferred.

Checking the other boxes doesn’t reveal anything too interesting. Just the same “Up to date” on all their folders and a large amount of data to sync.

…but a few boxes actually list one or two folders as “out of sync”. Most of them show that 0 bytes need to be transferred and it’s usually one or a few files.

Screenshot from 2020-08-18 07-45-15

A few machines show “out of sync” and list a good number of files. For example, one says “116 items, ~13.5 MiB”. When I click on it I never see a list of files.

Here’s the part that’s strange to me.

every single box shows it needs to transfer data from every other box. Clicking on a box under “Remote Devices”, then clicking the amount that needs to be transferred (for example “160,689 items, ~75.9 GiB”) gives me the following:

What device is “AODWTOK”? Who knows? It doesn’t show up in the device list on any machine.

When I look at the “This Device” section of every box, I see almost no traffic. Maybe a few bytes of upload/download. When I poke through each machine in the “Remote Devices” of every syncthing instance, I see 0 bytes for almost everything. Where it is non-zero, I see a few intermittent bytes.

Steps I have tried to fix the problem:

  • Stopped syncthing everywhere, ran syncthing -reset-deltas. After a few hours of churning I end up back in the same state.
  • Stopped syncthing everywhere, ran syncthing -reset-database. After a few days of churning I end up back in the same state.
  • Stopped syncthing, prevented all access to the affected folders by users over the weekend, tried -reset-database again.

Any thoughts?

1 Like

After every update is usually does a full index exchange, which means we refresh what other side claims to have.

I’d start by checking if for a given folder, local/global counts match on multiple devices.

Mod device shows the first few characters of the device id (not the device name).

1 Like

Were those out of sync files deleted or do they still exist on purpose?

At first glance it looks like my problem with an unknown (former deleted?) node in [v1.8.0] Local and global state swapped between two nodes .

1 Like

I’ve left it running from Friday evening until Monday morning with no access to the folders (they are shared via Samba). The syncthing instances aren’t really exchanging any significant traffic.

I grepped through the .config/samba/config.xml file and I don’t see any Device IDs with those characters. Is this maybe a database issue instead of a config file issue? i.e. the database knows about an old host even though it was removed from the config file?

The local and global states differ by a handful of files on boxes that have a folder that says it’s “out of sync”. Sometimes it’s one or two files, sometimes it’s a few hundred.

1 Like

Database records who last modified the file, regardless of what’s in the config, as these two things are not related.

It’s possible that the last change was advertised by that device, but it has now been removed from everywhere, nowhere to be found anymore, and nobody can download that version of the file, as that device that advertised is gone, and nobody else has a copy.

2 Likes

Could you query the api for info about one of the files shown in the out-of-sync-list: https://docs.syncthing.net/rest/db-file-get.html

1 Like

I can’t see the out-of-sync list. The “Out of Sync” folders show 1 (or a few files) out of sync, but zero bytes:

Screenshot from 2020-08-18 12-29-31

Clicking the link never shows any files:

I mean the one on the remote devices you showed in the first post.

As to out-of-sync numbers not displaying any items in the list: That’s a “accounting error”. Those numbers are adjusted on the fly, while the list is populated from the db. Unfortunately there’s no real angle to debug this after it happened. You can recalculate the numbers by running Syncthing once with STRECHECKDBEVERY=1s.

An old device never having synced should not have any effect. A dropped device is removed entirely, also from the global accounting, thus no-one should need it anymore. Typically there’s an old, no longer used device hanging on maybe just a single device disconnected that can cause that. However you checked that so not the case here.

1 Like

Sorry–I didn’t realize you meant from the Devices list. Maybe my curl-fu is messed up, but I tried to check against a file that was still showing in the Devices list:

curl --insecure -X GET -H "X-API-Key: rQ-redacted-av" --data-urlencode "folder=folde-rid" --data-urlencode "file=AP-Scanned/2019/(1) Received Invoices/One time vendor/redacted.pdf" https://localhost:8384/rest/db/file

No such object in the index

I tried this on both the local box that was showing it was out of sync, and on the partner it said it was trying to sync from with the same result.

It seems odd that every Syncthing instance is trying to sync from some “unknown” partner and it’s trying to get that file from every other box even though none of them have that file.

I’ll re-launch all my syncthing instances with STRECHECKDBEVERY=1s and see if that cleans up the calculations.

Should I consider stopping synthing, doing an rm -rf .config/syncthing, starting it back up, and re-joining all the devices as this seems like something is potentially corrupt in the database?

Hmm…one more thing. When you say “a dropped device is removed entirely” would that cover the situation where all syncthing instances were stopped and that specific device was removed from the config file at every location as well as being removed from the corresponding folder section?

The curl command looks good. Maybe try a different file/folder to eliminate any trivial mistakes. Or just use the ugly variant of 'https...84/rest/db/file?folder=adsfds-adsf&file=adsfdsf' just in case --data-urlencode does something unexpected (though I wouldn’t know what/why).

Yes, it also does drop devices that aren’t in config on startup.

Not disputing the “odd” thing, it definitely is. There seems to be a little misunderstanding: The remote devices status shows that it thinks the remote is out of sync and needs to get data - it doesn’t show status of how it’s trying to sync from them. There’s no such thing as device-to-device sync. If a device is out-of-sync (folder status on the left) it tries to sync from anyone that has it.

1 Like

data-urlencode posts a url encoded body, we expect the path as a get parameter.

1 Like

Here’s the result from one of the files:

{
      "availability": [
        {
          "id": "N64MLTV-JVBAA5V-LO44A7J-IMZAI3C-HI6HZ4Y-3VUPSP3-ZQONMM2-7A6ZKAX",
          "fromTemporary": false
        },
        {
          "id": "DDTHH6M-J7IK3OW-7ASDBDW-WCTVOMZ-4KTATHY-HFAUDFS-HRZRWOT-P22P2AE",
          "fromTemporary": false
        },
        {
          "id": "VA5LDTE-TNOQRYP-T5Q5TZL-UVODOC3-JDZLUNO-H6PLDID-QVLXVMQ-3O6QKAI",
          "fromTemporary": false
        },
        {
          "id": "WJ6ATOK-7EZW3KU-GK2OWRV-X3JHZXV-4ZOPNJB-GHYDFPF-CGABONC-JJJ3BQM",
          "fromTemporary": false
        },
        {
          "id": "T54X3MF-3HBL5AJ-ENNA5N6-TFNAXLA-XAKFXVC-LTTFM7F-RY3DCD7-SAHMSAS",
          "fromTemporary": false
        },
        {
          "id": "OUTO6GM-3HVM65V-UNWNKQP-JQEOBFJ-PC55RRW-T7MTSTC-Y6KTPKW-XYDHCQZ",
          "fromTemporary": false
        },
        {
          "id": "PYUO3UW-RTGQMIE-QFEGVFU-NGNKJ3S-IH7SC2X-GI2WQ53-G3HWKY7-QAX3PAM",
          "fromTemporary": false
        },
        {
          "id": "JFNA3Y4-MLAM3T7-JWUWHJC-26HU5WI-RQIBLGH-XRKSL5N-Z6ET5RU-P66SKAM",
          "fromTemporary": false
        },
        {
          "id": "K5PDRGM-ZSKNYUS-NZBEBSF-GCV5ADT-Z6XAC3T-DR5WYLD-S7WRL2B-3YWFSAP",
          "fromTemporary": false
        },
        {
          "id": "I5SY4IZ-BQZZWIC-OM3EQZZ-A6MPMRN-7FIPOBU-57IZWTK-AFFVMKF-B6P4AQX",
          "fromTemporary": false
        },
        {
          "id": "S76N44Y-U6PV4QH-VL6BKMX-MM47EPJ-W2J5WDY-DZE3WIG-4B3HNSL-ISXM7QQ",
          "fromTemporary": false
        },
        {
          "id": "SFXUK6K-JTN2HAM-GQ2HPXL-AF2KXSD-APPMMF5-ELWLPKV-PBGQ3SN-PQYGVQ3",
          "fromTemporary": false
        }
      ],
      "global": {
        "deleted": false,
        "ignored": false,
        "invalid": false,
        "localFlags": 0,
        "modified": "2019-11-14T15:40:33.3917547-08:00",
        "modifiedBy": "AODWTOK",
        "mustRescan": false,
        "name": "AP-Scanned/2019/(1) Received Invoices/One time vendor/--redacted--.pdf",
        "noPermissions": true,
        "numBlocks": 2,
        "sequence": 480397,
        "size": 260158,
        "type": "FILE",
        "version": [
          "AODWTOK:1597615616",
          "C6W4T2X:1597615622",
          "DDTHH6M:1597615635",
          "EDWJLXE:1597615621",
          "I5SY4IZ:1597615622",
          "JFNA3Y4:1597615615",
          "K5PDRGM:1597615619",
          "NCABOJA:1597615620",
          "N64MLTV:1597615621",
          "OUTO6GM:1597615615",
          "PYUO3UW:1597675255",
          "QRKBLTY:1597615800",
          "SFXUK6K:1597615631",
          "S76N44Y:1597615616",
          "TZN5NJP:1597618590",
          "T54X3MF:1597615615",
          "VA5LDTE:1597615618",
          "WJ6ATOK:1597715154",
          "2KCBVFP:1597615619"
        ]
      },
      "local": {
        "deleted": false,
        "ignored": false,
        "invalid": false,
        "localFlags": 0,
        "modified": "2019-11-14T15:40:33.3917547-08:00",
        "modifiedBy": "AODWTOK",
        "mustRescan": false,
        "name": "AP-Scanned/2019/(1) Received Invoices/One time vendor/THOMA006_REFUND.pdf",
        "noPermissions": true,
        "numBlocks": 2,
        "sequence": 480397,
        "size": 260158,
        "type": "FILE",
        "version": [
          "AODWTOK:1597615616",
          "C6W4T2X:1597615622",
          "DDTHH6M:1597615635",
          "EDWJLXE:1597615621",
          "I5SY4IZ:1597615622",
          "JFNA3Y4:1597615615",
          "K5PDRGM:1597615619",
          "NCABOJA:1597615620",
          "N64MLTV:1597615621",
          "OUTO6GM:1597615615",
          "PYUO3UW:1597675255",
          "QRKBLTY:1597615800",
          "SFXUK6K:1597615631",
          "S76N44Y:1597615616",
          "TZN5NJP:1597618590",
          "T54X3MF:1597615615",
          "VA5LDTE:1597615618",
          "WJ6ATOK:1597715154",
          "2KCBVFP:1597615619"
        ]
      }
    }

That reveals one tiny bit of info on the timeframe here: It’s a “new-style” version counter, introduced in v1.6.0 in June. I.e. AODWTOK existed and this file was modified (resp. synced) since then. Which is probably not overly helpful.

“Unfortunately” there’s nothing wrong with the file info as far as I see. I suspect the same query on the other side reported to be syncing will result in the same file infos, they just didn’t properly exchange the info (thus the other side still believes it’s missing). That should be fixed by the db recheck ( STRECHECKDBEVERY=1s) - any results on that?

And are there receive-only folders involved?

I ran # STRECHECKDBEVERY=1s syncthing and let it go all night. It appears to be back to the same spot with all instances showing zero (or a few bits) of data being uploaded/downloaded.

Which devices are we talking about, i.e. on which device(s) did you run with STRECHECKDBEVERY=1s and what are you checking in UI? If the issue is with sequences, I would expect that after running with STRECHECKDBEVERY=1s on device A, that on devices other than A in the remote devices section of their UIs, the device A is shown as up-to-date.

And just to be sure: The percentages and data amounts on the remote devices are not changing, right?

I ran it on all the devices simultaneously. (I connected to them using cssh, ran service syncthing@root stop and then launched syncthing manually STRECHECKDBEVERY=1s syncthing.

After a few hours I popped open the web interface for all the devices and checked the folder list for folders showing “out of sync” and the remote devices section to see if any devices showed they were still syncing. Almost every single instance of syncthing still showed that it needed to sync data to almost every other instance of syncthing and were basically passing no traffic.

For example “uslog00nas01” shows every folder is “Up to Date”, but every device (except one) listed in its “Remote Devices” section shows anywhere from a few GB to a few hundred GB needs to be synced and no traffic is being passed in the “This Device” section or when I look at an individual device in “Remote Devices”.

A few minutes ago, since I needed some screen real estate back, I hit CTRL+C to stop syncthing, closed my cssh session to all the boxes, and used salt to push out a change to my systemd unit file to include the environment variable STRECHECKDBEVERY=1s and launched syncthing as a service again–so now most of the boxes are exchanging traffic again, but if things follow the previous pattern, it’ll die down to nothing in about an hour.

EDIT: Sorry–forgot to respond to the last question. If the percentages are changing, it’s very minor and could be related to users having access to those folders and making changes during the day, but I can take a few screenshots this evening and then again in the morning and compare if that would help.

You shouldn’t set that persistently, it will cause expensive db operations whenever Syncthing is started or a folder option changed. That’s what I meant with running it once with that.

On the changed sync percentages: No need for the screenshots, just wanted to confirm that - thanks.

I am pretty much out of ideas at this point, I am just fishing for information to maybe produce a new idea:

What strikes me as odd but doesn’t give me any ideas is that the local state is much lower than cumulative amount of data to be synced. That could be due to ignore patterns though (?).

Another questions I think you didn’t answer yet: Are there any send-/receive-only folders involved?

Do the folder local and global states match between devices?

Right–I figured that, but I needed my screen real estate back. Instead of leaving a bunch of cssh windows open, I just tossed it into the service unit file and launched it that way. I’ll pull it back out after testing.

I can’t be certain they aren’t changing because users still have access to those folders and that might be messing with the counts, but they appear to stick around the same percentage in the evening when no one should be on and changing files.

I don’t have any ignore patterns set for any folders.

No, the folders are send/receive. No send-only or receive-only folders.

Using the ‘accounting-private’ folder as an example, two machines that show that folder as being “Up to Date” have the same local and global state for their folders. The machines that show that folder as “out of sync” have the same global state as the “Up to Date” machines, but they have a different local state:

|Global State| 104,327   8,621   ~51 GiB|
| Local State| 104,355   8,621   ~51 GiB|

Here’s what I finally ended up doing: I stopped syncthing everywhere. For the five folders that weren’t syncing, I changed the folder IDs in the config file (salt is so handy!) to something new. rm -rf’d the .stfolder from each of the affected folders Started syncthing back up.

They now appear to be syncing properly and I’m seeing the counts drop. Based on the speed, the should be finished syncing in a few hours. I’ll report back.

2 Likes