Large repositories -- general questions and some issues

firecat4153 · May 25, 2014, 6:11pm

Greetings! First, thanks for this awesome software…it’s filling a much need gap in open-source syncing!

Some of these comments may be bugs, but I just wanted to check if it’s me or a known software limitation before filing them on the issue tracker.

I have about 150GB of files (so far) split between 4 repositories on two nodes (laptop and server). Global discovery is turned off because I have the static IP:port set for the server node and port 22000 and 8080 forwarded from my router to the server. In my home network, syncing works fine.

Problem: When I suspend to RAM (laptop) and then resume, syncthing won’t reconnect to the server node unless I restart syncthing on the laptop.
Problem: At work, the laptop is connected to the internet via an OpenVPN connection. The laptop contacts the server just fine and they go through the syncing after connection (I can see the upload/download going on. I’m assuming it’s exchanging the file hashes to compare?) However, right after the up/download stops on both nodes, there is a ping timeout that closes the connection. It re-establishes right away and then goes through the sync procedure again. This repeats forever, so that a final sync ‘green’ is never completely achieved. I used STTRACE=net but that really wasn’t informative other than to tell me about the ping timeouts. I’m running it again with STTRACE=all but it’s taking forever, so I don’t have those results yet. I’ll post anything interesting that pops up.
Question: are the *.idx.gz files in the config directory the files that are transferred to/from the nodes each time a node connection is made? For my 150GB of files, those *gz files total about 50MB, which is fair amount of network traffic back and forth each time a connection is made. Adds up on an unstable connection! Is that normal?
Question: Is there a direct download link to the ‘latest’ binaries that don’t have the version number in the URL? It would be nice to have a direct link to ‘latest’ for each architecture to help a little in updating system packages and Dockerfiles without having to edit the version number in the URL.

Thanks again! Scott

calmh · May 25, 2014, 7:18pm

This shouldn’t happen. That’s how I roll and for me this works perfectly (on Mac).

firecat4153:

Problem: When I suspend to RAM (laptop) and then resume, syncthing won’t recoProblem: At work, the laptop is connected to the internet via an OpenVPN connection. The laptop contacts the server just fine and they go through the syncing after connection (I can see the upload/download going on. I’m assuming it’s exchanging the file hashes to compare?) However, right after the up/download stops on both nodes, there is a ping timeout that closes the connection. It re-establishes right away and then goes through the sync procedure again. This repeats forever, so that a final sync ‘green’ is never completely achieved. I used STTRACE=net but that really wasn’t informative other than to tell me about the ping timeouts. I’m running it again with STTRACE=all but it’s taking forever, so I don’t have those results yet. I’ll post anything interesting that pops up.

There was a bug previously that would result in exactly this if the initial index exchange took longer than the ping timeout, but that’s supposed to be fixed… Please file an issue on that, including the traces if possible, and I’ll investigate…

Yep. The .idx.gz contains the index of files on disk, which is exchanged with other nodes on connect.

No. The files are hosted on Github which hosts them on S3 using magic unique identifiers. However it’s possible to get the URL usign the Github API. Syncthing does this for the -upgrade command; see https://github.com/calmh/syncthing/blob/master/cmd/syncthing/upgrade.go for the general mechanism (pull some JSON, find the relevant release for a given platform, extract the URL for a .tar.gz.

firecat4153 · May 25, 2014, 8:45pm

Thanks for the quick reply!

If I continue to have the problem while suspending, which would be the most useful debug flags to pass?
Github issue filed: https://github.com/calmh/syncthing/issues/280

One other issue that I’m not sure is normal or not: When I add a new repository (large…30-100GB), the initial indexing is fine, but when the repository is sync’d for the first time to another node (the files already exist on both nodes and are previously sync’d with rsync), the disk IO is huge and results in load averages of 12-14 on my i5 laptop w/ non-SSD hard drive. Pretty much grinds everything else on the laptop to a halt for a few hours until it finishes. Server is marginally better with an i3 w/ btrfs RAID1 hard drives. Is that normal? v0.8.9 on both machines.

Thanks! Scott

Edit: FYI, same issue with v0.8.10 and the ping timeout. The upload speed from the server node is really slow compared to the upload from the laptop on this VPN.

jedie · May 26, 2014, 6:48am

This is IMHO, because syncthing builds the index on every machine. That’s takes time and results in high loads I use “nice -n 19 ionice -c3” to start syncthing.

On every connect? Or only on first connect?

calmh · May 26, 2014, 10:33am

On every connect; syncthing doesn’t keep track of the state of things on other nodes when they are not connected.

calmh · May 26, 2014, 10:33am

It’s expected, but a bit crappy. This happens on the initial sync between machines that both have the same set of files, but that are not “in sync” according to syncthing. I have an optimization coming soon that will fix this.

jedie · May 26, 2014, 10:52am

Hm. Is a “diff” of changes files, after last sync not enough?

That all sounds to me, that syncing of many files over a small bandwidth is not really effective. I’m hope i’m wrong here.

e.g:

150GB of MP3 files has a index about 50MB
Sync was completed
One tag in one MP3 file was added
next sync must transfer 50MB index file to see, that one file was changed?

calmh · May 26, 2014, 11:01am

No no. On every connect. Once two nodes are connected, only diffs are transmitted.

jedie · May 26, 2014, 12:06pm

Thanks for clarification… The example from above, modified:

150GB of MP3 files has a index about 50MB
Sync was completed between all nodes
Node are disconnected
One tag in one MP3 file was added
All nodes must transfer the 50MB index file to see, that one file was changed?

Is that right? If yes, than it is ineffective, especially for clients that often disconnect like mobile phones or laptops.

calmh · May 26, 2014, 12:10pm

That’s correct. My laptop disconnects on average twice a day. Once, when closed to go to work. The second, when I close it to go home again.

jedie · May 26, 2014, 12:14pm

Related to this is IMHO: https://github.com/calmh/syncthing/issues/229 I added a Quick idea there:

split the index into a few smaller parts. Start transfer after the first index part is created. So the node can start with sync after he get the first part and not after he gets the complete index.

Maybe this can contaminated with this here: The node can say which parts of the index he has, so that only the missing part must be transferred.

calmh · May 26, 2014, 12:48pm

Things is, the node has no parts of the index if they are not connected.

jedie · May 26, 2014, 1:08pm

The nodes must just save/cache the index parts locally. So that every node can say on connect: i have a index to the date XY or i have part XY or so…

I don’t know how everything works right now. So maybe my idea is bullshit. But you agree to me, that transfer the index on every connect is not really effective, isn’t it?

calmh · May 26, 2014, 1:22pm

In my use case, I sync about 30.000 files across ~8 GiB in four repos. The index that is sent on startup is 3.5 MiB. From my point of view this is just fine and dandy. Bigger repos have other issues, one of them being that said index is kept in RAM as well. It should probably be redesigned with another on disk database system, and at that point keep “dead” indexes around for peers as well. But I don’t see it as very high up on the list of must-fix-now stuff, no.

jedie · May 26, 2014, 2:11pm

That’s not really large… Currently i test it only with my music and that are two repros:

87GB in ~64000 files
100GB in ~22000 files

In the end, i would like to use syncthing not only for Music, but also for my DSLR pictures and Videos. So complete data is currently be ~871GB in 160000 files…

But this are not often changed files and it’s not very fast growing. e.g.: I take around 200-400 new pictures per month…

In the past i use offline backups and sync them with rsync from time to time.

My hope was, that metadata changes will be quicker sync. e.g.: Adding some tags to my pictures.

Oh! Yes, that’s another problem. Now I know why it’s currently not good running on my other server. It has only 1GB RAM

Why must the index must be kept in RAM? Why not using a SQLite Database?

So one conclusion now is, that syncthing is currently not good for very large repositories That’s one point for:

calmh · May 26, 2014, 2:50pm

Oh, there’s all kind of reasons but it mostly boils down “this was much easier and good enough for everything I throw at it”.

jedie · May 26, 2014, 2:58pm

That’s a good point

OK, we have an agreement here, that the current solution works, but should be changed in the future, isn’t it?

EDIT: Add a ticket for this: https://github.com/calmh/syncthing/issues/295

firecat4153 · May 26, 2014, 6:25pm

Thanks for all the input. I think for now I’m going to keep my large media repos on unison + cron. I will continue to use Syncthing on smaller data sets and periodically test the large repos as it evolves. If you have specific changes you’d like to field test on the larger repos before a release, feel free to contact me!

Scott

calmh · May 27, 2014, 6:14am

Yep. Agreed that scalability for large repos is something that needs to be addressed.

Alex · May 27, 2014, 1:08pm

another question or feature request i have with the index is: what happens with the info about deleted files? is this deleted some day? If this does not happen already, this should be done because sometimes even file names can contain private data that never gets deleted if this is never removed from the index. this could be removed after a configurable amount of days (one year maybe, after this time all nodes should have been online at least) or another way to implement this (probably harder because this is not the same on all nodes) could be that the info that is older than the last sync of all nodes in the list can be removed.

this also helps for large repositories because it removes some data that has to be transfered if files are often deleted or moved