Slow Responding Peer Holds Up Queue?

Hello:

Hope I’m not wasting anyone’s time again… :wink: …I’m just doing a lot of syncing with remote colleagues during this strange period, and since the release of 1.4.0 (yay again!), I’m having great fun watching the sync queue progress. It’s an extremely satisfying experience…

So, with the caveat that I might be entirely wrong, I think I’ve just seen something which causes syncs to hang for a while: a 2.5GB file was being pulled from a remote colleague - but then it seemed to hang for quite a while - probably a good ten minutes. During this time, the queue didn’t appear to progress at all.

The blockage then suddenly lifted, and loads of other files in the queue started flowing in from a NAS on the local network.

The lifting of the blockage was accompanied by the following log entries:

2020-03-26 08:43:54 Connection to KF4COM3-*******-*******-*******-*******-*******-*******-******* at 192.168.181.27:64613-82.xx.yy.zzz:65450/tcp-client/TLS1.3-TLS_AES_128_GCM_SHA256 closed: writing message: write tcp 192.168.181.27:64613->82.xx.yy.zzz:65450: write: broken pipe
2020-03-26 08:44:22 Puller (folder "Updates - *****" (pmf4l-*****), item "<file path here>"): pull: connection closed

If I understand correctly, pulls are fulfilled by the peer to respond first - which makes sense. But is there any provision for a peer to respond, but then hang whilst fulfilling the pull request, without blocking the rest of the queue until it times out many minutes later?

I’m guessing this might be more of an issue since the move to large blocks.

Thanks, and feel free to dismiss me if it’s already being handled! :laughing:

1 Like

There isn’t, really, so a dead peer will likely stall everything until the connection finally drops after (iirc) five minutes or so. A nearly deed peer that just drops in a few bytes every few minutes can probably block everything indefinitely…

2 Likes

The logs show that pulling hanged because the connection was broken (which unfortunately isn’t immediately detectable), and then when it closed (due to timeout or ping failure probably) pulling exited with an error and thus immediately retried, on the live and working connection.

10 minutes is too long though, the timeout for stale connections is 5 min.

1 Like

Ah cool. Five minutes sounds like a reasonable compromise then - I might be mistaken as to how long I waited for it.

Would I be right in suggesting that, in a situation where I have ~5 peers for a folder, and 1 or 2 of them are on slow connections, I could increase the number of Copiers for the folder to allow more pulling concurrency and make better use of other peers?

I suspect the limiting factor is rather the pullerMaxPendingKiB config. Once we have requested that many kilobytes from the network we will not make more requests, even if all of them are from slow peers. In any case you’ll probably get stuck fairly quickly as each file is likely to end up requesting at least one block from a slow peer and then getting stuck there even if all other blocks get returned quickly.

1 Like

See above for the bottleneck regarding pullerMaxPendingKiB, the below is about copiers, which probably becomes relevant once you increase the puller limit enough:

The other peers are already used much more than the slow ones. Only if you have two slow peers and 2 copiers, and they each block one copier by taking a super long time for a single block, then having another copier would help. This is much more likely to happen with huge files, as there blocks are large (up to 16MB).

A solution for this kind of scenario would require some kind of smart scheduling, that detects congestion and slowness, and does something sensible about it. That sounds complicated though.

But, copiers are in the stage before pulling blocks. I don’t think having two copiers means we just process two files at the time, as far as I remember it means we will only copy blocks for two files at a time. Then we can have any number of files out for request, up to the puller pending KiB limit… (And then we will block the copiers as well, as there’s nowhere for the files to proceed after copying.)

[queue] -f-> [copier 0] -b-> [block puller] -> [request 0]
             [copier 1]                        [request 1]
                                               ...
                                               [request n]
                                        Up to pullerMaxPendingKiB

-f-> meaning files pass here, -b-> meaning individual blocks now

I mean, the degenerate case here is that all data is new and there is nothing to copy, then the copiers are essentially just pipes for the block metadata and take zero time by themselves…

But yeah. Regardless, there is a lack of smarts here.

2 Likes

So - if I understand correctly: a large file which exists only on a slow device will block the queue for other files which exist on faster devices?

Yes. Or, well, I guess that’s the point where more copiers help a little, because then there are multiple streams of blocks all fighting for the available network requests.

Ah right. Thanks - it really helps to gain a chink of understanding here and there!

Ok - so I’ve changed the number of Copiers from 0 (which I presume defaults to 2?) to 6 for this folder as an experiment. And it has made a massive difference to the rate at which files are being pulled into place… Woop woop woop!

I understand that this isn’t going to be a universal solution obviously - but for this particular setup, right now, it’s helping it steam through the backlog. :slight_smile:

1 Like