Untrusted device never finishes syncing, very high memory usage

(I filed this as a bug report but got redirected here)

Hello amazing people !

I’ve added an untrusted device (let’s call it U) to my normally working setup and it’s struggling to do the initial synch of all the folders. It got around 80% of the data with no problem, synchronised full folders, but it’s struggling with the last 2 of them. The synchronisation rate dropped to 1 byte per second, syncthing uses around 8-14GiB memory on U now and 2 full CPU cores, but not on any of the other hosts.

    PID       TID    MINFLT    MAJFLT    VSTEXT    VSLIBS     VDATA    VSTACK    LOCKSZ     VSIZE    RSIZE     PSIZE     VGROW     RGROW    SWAPSZ    RUID        EUID         MEM    CMD        1/6
  10984         -     96015     60689    10376K        8K     14.1G      136K        0K     14.7G     6.7G        0K        0K     -0.1G      6.0G    syncthin    syncthin     88%    syncthing

This is an excerpt from the logs on U:

Jul 31 23:23:14 brix syncthing[8247]: [] 23:23:14 INFO: Puller (folder photos, item "8.syncthing-enc/DT/L03IHLC16JCGC2LDEHQFOLLD3RVBBO2L64OHKA1MHQ6M4Q1EIMEK5I6TV6OF2Q2VJSV77RBD1UTM0TKLKA7CMLECU4689"): syncing: pull: generic error
Jul 31 23:23:14 brix syncthing[8247]: [] 23:23:14 INFO: photos: Failed to sync 136 items
Jul 31 23:23:14 brix syncthing[8247]: [] 23:23:14 INFO: Folder photos isn't making sync progress - retrying in 35m51s.

When restarted, the syncthing process on U seems to get better for a while, downloading at around 50KiB/s (normally it would be ~30MiB/s for that machine) but then deteriorates again quite quickly.

I’m on Linux (all devices that sync together, except one which is Android). U is an NixOS server, U and most of the other devices are running Syncthing 1.18.0.

Could you please post screenshots from both sides to we could see what state things are in? Also, full logs would be helpful, if you have them.

Sure thing, thank you ! :slight_smile:

laptop, works, synchs with other devices without any problems:

new, untrusted, server, stuck synching:

laptop logs: foureighty.log (825 Bytes)

untrusted server logs: brix.log (36.5 KB)

The server log is presented as is, just with device IDs partially obscured, the laptop log is an excerpt, only including lines concerning the server device.

hope this helps and let me know if I could help in any way :slight_smile:

You should enable model debugging on the device providing the data and capture a larger log when it’s responding to requests.

And given the huge memory usage a memory dump would be interesting: Profiling — Syncthing v1 documentation

I can see there is a lot of filepaths in the log after enabling model debugging, is there a way to automatically anonymise those ? I consider them sensitive and wouldn’t want to post them publicly.

No there isn’t, but I am mostly looking for explanations of why the requests for data get refused.

E.g. lines containing Request, REQ and recheckFile.

syncthing-heap-linux-amd64-v1.18.0-214208.pprof (52.0 KB)

Wow o.O

Now it shouldn’t be hasing anything there as it’s a receive-encrypted folder (easily corrected bug). I’d still like to know why it’s using 12GB of memory doing that. Have you modified any advanced settings, especially copier?

This is straight-out-of-the-box config, just added a device and folders and that’s it, haven’t changed any settings.

Can send config.xml if useful, would need to anonymise tho.

I guess there is no indication here, but how big are the files?

I wonder if we somehow read the encrypted size without decrypting, and then fallocate a massive temp file and then when trying to reuse blocks from the temp file, end up allocating a massive []Hash slice based on some absurd size hint.

Biggest file in the queue (unsynced items) that I can see is 5.38GiB, most of them are around 1MiB

No need, I’ll take your word for it :slight_smile:
Btw thanks for providing all the info requested.

The size on an encrypted file info is meaningful - it’s the actual size of the encrypted file.

Corrupt binary? I mean, I can’t see how else we could allocate 12GB in that function.

I assume this is easily reproducible and happens every time?

Binary has the same hash across different boxes and the same binary was used with great success to sync the 200GiB or so of the whole library over to the 5 other boxes without any problems, the only problematic one is the untrusted one, my assumption would be that the difference is in the encrypted folder handling perhaps ?

[root@brix:~]# sha256sum /nix/store/0dr7cmbimqhdxwkhpk4akziijyvyxzr5-syncthing-1.18.0/bin/syncthing 
da61f6e5be86fd1657b2b1dfc9f972ab6d651e6e57188d6e88ad2c90807cb136  /nix/store/0dr7cmbimqhdxwkhpk4akziijyvyxzr5-syncthing-1.18.0/bin/syncthing

[root@brix:~]# logout
Connection to brix closed.

❯ sha256sum /nix/store/0dr7cmbimqhdxwkhpk4akziijyvyxzr5-syncthing-1.18.0/bin/syncthing  
da61f6e5be86fd1657b2b1dfc9f972ab6d651e6e57188d6e88ad2c90807cb136  /nix/store/0dr7cmbimqhdxwkhpk4akziijyvyxzr5-syncthing-1.18.0/bin/syncthing

Yes, behaviour is fully reproducible even after restarting all the machines involved :slight_smile:

I think it has to be something like what you proposed - some value that’s set to something unexpected in the encrypted case. Still no clue how/what/why.

Fix is here: lib/model: Don't consider hashes pulling on recv-enc by imsodin · Pull Request #7869 · syncthing/syncthing · GitHub

I’d still prefer we understood what’s happening.

Sure, this will fix the symptom, but we don’t know the cause.

If this is going wrong, what else is going wrong?

Did you try the encryption feature before it made it into the official release?

Would you be ok running a custom dev build with some extra logging to understand where the allocations are coming from?