Large repositories -- general questions and some issues

Every repo has the “shared with” info. So it should be possible to delete the “deleted files” information, after all “shared with” nodes are up to date.

This is the plan. It’s not implemented today. There are some practical tricky issues to sort out, mainly to do with knowing when all other nodes are up to date. Having a persistent index for other nodes will help with this, since it doesn’t have to be done “in real time” when all nodes are connected.

I have repository with all my Pictures: 55934 file, 507 GiB - want to keep it sync between my PC and NAS server.

I’ve installed syncthing as a FreeNAS plugin after rescaning the repo - the RAM Utilization is 868 MiB :smile:

You got off lightly!

Yeah, this is a focus for improvement in 0.9.

:smile:

Another question about multi CPU performance - during scanning neither my CPUs nor my HDD look saturated. Is it something I can do about it?

(scanning is from Disk2 Y:)

And BTW THANKS for a great piece of software!!!

Scanning a single repository is single threaded, i.e. reading from disk is interleaved with hashing the data that was read. For many small files you’ll be limited by the disk IOPS, for large files you’ll probably be CPU limited (OS readahead handling the IO side of it).

How is the resource usage on large repositories, file metada is kept in ram or disk? I tried to share my multigigabyte music collection on my nas and the ram wasn’t enough (512Mb).

In v0.8, in RAM. In v0.9, on disk.

(thanks for syncthing, it is great!)

Happy to see the metadata is now stored on disk. But about the 50MB of data in each connect (for big repos), does that still works on the same way? I have 227GB in 1.2 million files, and will have to connect several times a day.

And also: for big repos the initial scan can take a lot of time, is it safe to stop it and continue from where I left? Or do I have to wait until it finishes?

Syncthing still transfers the entire index on connect, yes, but the protocol machinery is in place to fix this into only sending changes since last connect. There’s a bunch of code to write for that, though… But it’s coming.

Yes, you can stop and restart syncthing during the initial scan; the files that have been scanned will be saved to the index[1], scanning will resume on the other files on restart. Although all files still need to be checked for modification time, which in itself can probably take a while on 1.2 million files. I’m interested to hear about how this goes though, this is probably the largest installation I know of, in number of files.

The scanner is now more parallelized and should be able to use all CPU cores for hashing, if the disks can feed them with data.

1: Mostly. The scanned files are saved in batches of 1000, so you might lose scan data for up to 999 files when you kill syncthing. Those files will be rescanned on next startup though, so no harm done.

It’s running the initial scan (started about 2 hours ago). I’ll update with info regarding its performance.

Also, another question: each time the app scans for changes, will it hash every single file? Or what other mechanism uses to recognize changes in the files?

The initial scan is finished. Syncthing reports this (the screenshot says “scanning”, but the first scan finished, then it started scanning again without me doing anything, I assume this is the periodic scan to find changes):

And the config dir weights almost 200mb:

As for the previous question about the changes detection logic, I suspect the app hashes every file every time it looks for changes, because of the huge and uninterrupted cpu and disk usage. I understand that looking for changes in 1.2 million files isn’t cheap, but I know it can be done with far less effort. Just to add some perspective, a comparison with rsync (which I want to replace with syncthing) to detect changes in those same files:

  • ST: hours of ~40% cpu and 90% disk
  • RS: 15 seconds of 10% cpu and 10% disk

It’s about 2000 times more cpu usage, and 4000 times more disk usage.

I won’t be able to use it for now, the load will fry my laptop in a few days. But I’ll keep an eye on the project, as I really want to use it, syncthing seems to be everything I wanted in a sync tool :slight_smile: If you need me to try any changes or optimizations, just ask, I’ll be glad to help!

Unless you changed the defaults, syncthing scans for changes every 60 seconds, which is probably too often for your use case. Only modification times are scanned; if those differ, the file is read and hashed.

Rsync in this case is only doing the second type of scan since it’s comparing against an identical tree. You should see similar scan times on the periodic rescans, or something is weird.

Today I tried again, and now it isn’t taking hours for the subsequent scans, they last far less, about 2 minutes. Maybe something happened in the first scan and it wasn’t valid, and the app had to do it again?

2 minutes is ok for me, I’m configuring it to do the scans once every hour :slight_smile: But I have 3 another questions (still related to having large repos):

  • The scans interval seems to be global to the syncthing installation. Is it possible to configure different scan times for each repo? Having a giant repo, I want the big one to be scanned once every hour, but the others (repos to share files with people) to be scanned minutely.

  • Is it correct to assume the ~200MB in the index folder is the size of the transference on each connect?

  • About the RAM usage: yesterday, for the initial scan, it needed about 600MB. But after a restart it is only using 180MB. Besides the possibility of some kind of leak (unused data still in RAM), is there any way to use that same index for other machine, without having to do an initial scan there? Because the server to which I want to synchronize has 256MB of RAM, it won’t be able to do the initial scan, but it has enough memory for the normal scans/sync, and currently has a mirror of all the files, so the hashes should be the same.

Not currently, but it would be a small change. Please file a feature request on Github.

No, the transmitted index is much smaller. It contains less data, is more compressed, and above all the index on disk contains info for all nodes in the cluster while only the data for the sending node is sent. If you let two nodes connect and look at the sent and received counters in the GUI you can see how much data was actually sent over the network.

I made a bunch of optimizations to allocate less memory during scanning in 0.9.3+, so if you’re running 0.9.2 please upgrade. If you are already on 0.9.4, there are coming some further optimizations in 0.9.5 that might help a little further (but I’m not sure how much yet).

As for copying the index, yes, you can do that. Syncthing will assume that the files are scanned and do the “fast” compare against the index. This’ll only “work” if the files have the same modification times as on the other node, otherwise it’ll anyway need to rescan.

Done: Configurable scan interval per repo · Issue #521 · syncthing/syncthing · GitHub

Ok!

Great, I’m already using 0.9.4.

Great! Thanks for all the answers :slight_smile:

Just in case it’s useful to you: I’m using 0.9.5 now, and it consumes a lot more RAM than before. It now requires 1.1GB (0.9.2 required 180MB).

Eh? :confused: Luckily, there’s some profiling for this. Can you run with “STHEAPPROFILE=1 syncthing”? It’ll create a bunch of files called heap-(someting).pprof, one every 250 ms as long as the heap usage is growing. Let it run for a while, send over the latest & largest of the profile files.

Oh, and the specific binary you are running, if it’s not one of the ones from the github releases page, so I cna match it up with the profile.

Ok, I’m uploading the biggest profile file, generated when 764MB of RAM were used. Something important: I discovered that the RAM usage slowly increases over time. It starts with about 100MB, and continues growing.

The profile data and syncthing binary (downloaded by syncthing itself, with the update feature) are here: http://goo.gl/vEAVGq

Thanks. The code in the database driver that’s supposed to reuse and release buffers is obviously not doing it’s thing entirely optimally.