Latest encrypted devices proposal, what & how

That’s the question! I was just sketching here and depending on how one sets a device to be not-shared-with there are indeed just two modes. That makes sharing with a new device a two step operation - first check it so that it’s shared, then check the encryption checkbox? Otherwise we could list all devices and have the drop down be tristate…

The terminology needs considering. “Encrypted”/“… plaintext”? Vs trusted/untrusted. If we can avoid causing confusion about whether the communication is encrypted or not that would be great…

I feel there will be a lot of confusion about the fact the passwords for a device pair have to match on both sides, same like folder ids do, and notice users are completely puzzled by it.

Perhaps we should have a simple mode which asks for a single password and uses it for all devices, and an advanced mode which allows you to effectively revoke decryption access by changing the secret for a specific device.

Also, given we’re into “modes”, perhaps a mode that does not mangle names, because it’s a bit shit for me not being able to check if my offsite backup has a particular file.

Also, stretching it here, but if we had full public key of the other side (which I think ecdsa pub key is 32 bytes, half of our device id), we could make sure that even stealing some other devices shared password prevents you from decrypting data.

Perhaps discovery should start storing public certs and allow them to be looked up, for signing purposes (or do an Audrius’ classic, flip the table, cut v2, screw certs and roll with ecdsa pub/priv keys)

1 Like

I’m not 100% sure what you’re thinking with the keys, but we do get the other side’s public key as part of the cert at connection time. I don’t think we have much use for it, because they are the one device we don’t want to be able to decrypt the data?

For the config stuff, yeah, we could do something simple by default and more advanced things could be done in the advanced config or such.

If it’s simple people will use it. Then the bugs can get ironed out because people actually find them. If it’s complex I suspect it will be a feature that gets little use and is more likely to contain bugs.

Furthermore, if it’s simple and works it might be an incentive for new devs to get on board and make it better. If it’s advanced but only half works then I suspect new devs will just keep using their own or third party solutions and not get involved with Syncthing dev.

2 Likes

I prefer the “Trusted” and “Untrusted” terminology. Don’t use plain & encrypted, since then it makes it seem that the “plain” is not encrypted in transit to newbies.

I’m not sure I follow the implementation approach for untrusted nodes with a password. So the trusted node would add an untrusted device ID to an existing folder and then add an encrypted password aka untrusted encryption key?

Please make the untrusted encryption password VIEWABLE from the trusted side. In case one forgets it, you don’t want to have to reset it and update all the untrusted nodes. Resilio allows for viewing all the keys from a trusted side.

2 Likes

I’m trying to better understand how integrity of the data is maintained on untrusted devices, especially if untrusted devices are used to seed trusted devices. I’m thinking of a worst case scenario where the only copy I’m left with is that on a untrusted device. Based on this copy (and knowing the key) I’d like to be able to verify I have a complete, unmodified version of the original (unencrypted) data.

As far as I understand, the untrusted device would have an encrypted copy of the database. That database only makes sense on the trusted device though since file/block hashes only match when data is unencrypted. Is that correct?

IMHO the question is how we can verify an untrusted device’s data integrity. If the untrusted device was able to verify integrity itself this would save us from having to download all the encrypted files to check their unencrypted version against (local) trusted hashes. I could imagine having an additional set of hashes per encryption key would be the most elegant solution. Would that have too much of an impact performance-wise?

There is a folder decryption tool in the pull request, which can locally decrypt an encrypted folder when given the original folder ID and password. This could gain an option to not actually write the data, just verify it.

Additional hashing would have some significant impact, essentially a multiplier on the original full hash time, and it would need to happen again any time a folder is shared with a new key. It would also mean we need to use deterministic encryption for the blocks. Another option would be some sort of protocol change so that the hashing can happen at send time. That might be somewhat tricky to shoehorn in.

1 Like

True, hashing does involve additional computational effort + disk space but IMHO this is negligible compared to the additional encryption overhead and of course the effort it takes to fully sync the data set over the network.

What we’d gain, on the other side, is the ability to verify data integrity on untrusted devices (as outlined above) which includes the ability to scrub data regularily to detect any disk issues. I’ll outline my use case (which I think might apply to many): I’m running a ST instance which was originally intended for my personal use only but over time, it’s become a backup repository for friends and family (yes, there are additional components in place to make this a proper backup solution). Those guys need a ‘fire and forget’ backup solution. They will never care to run any sort of tool to verify data integrity (because if they did, I wouldn’t have gotten involved in the first place…). On the other hand, I don’t want their (clear text) data or key. Therefore, I’d ideally be able to verify that my side of the house is fine (data integrity) based on their encrypted data.

Put differently, IMHO a copy without (easily) verifiable integrity isn’t worth much. E.g., restic’s approach (where one has to download and decrypt the whole data set to verify its integrity) isn’t really an option to me when it comes to WAN-attached storage. Since remote endpoints and average Internet bandwidth are ST’s daily business I think we should keep that concern in mind to make the untrusted concept suit even more use cases.

1 Like

If we can do it on the fly so we amortize the cost it would probably be fine. I guess worst case we could just add the block checksum as a trailer to the block itself, allowing checking that without doing decryption. This buys you about the same safety as using ZFS or similar for the storage.

1 Like

Is this related to the new encryption feature though? Syncthing doesn’t check data integrity now either (we do checks when we sync data, but as mentioned that happens with encryption too, just only on one side). What I am saying is if you want to periodically check data integrity, you’ll have to do that with another tool than Syncthing regardless of data is in plain or encrypted. And as you mention you use another tool for the backup on your own, “untrusted” backup server, I’d expect that tool to be able to do integrity checks (as backup tools usually can do that - or you could just use a checksumming FS (zfs, btrfs, …).

1 Like

This is a followup to the discussion started by this comment on the PR. I am posting here because I am pretty sure it’s just me that needs an answer, not the PR being changed :slight_smile:

Why do we encrypt the block hashes at all? Equal blocks can be detected with or without encryption, and the used hash (AES) should already ensure that you cannot infer anything about the data from the hashes, shouldn’t it? Or what else am I missing?

Not exactly sure what you mean - AES is an encryption algorithm, not a hashing algorithm.

If the hashes were unencrypted, the hash of plaintext data would be visible. So arbitrary data can be guessed and verified for correctness, using the hash - some sort of plaintext oracle. This is susceptible to rainbow tables, pre-hashed dictionaries and related things. Encrypting the hash at least makes it harder to verify “if data equals x”. For larger blocks this probably doesn’t matter much - the blocks are simply too large to make successfull guesswork -, but for shorter or easily predictable blocks this can make a significant difference.

1 Like

Yeah, that. We don’t want to leak the real hashes. If your question is more “why bother providing any hashes at all”, the encrypted hashes enable the usual block level diffing so only changed blocks need to be transferred.

It’s at least theoretically possible to do offline data verification on unencrypted folders as both the data and the hash database are there; we just don’t provide a tool for it. We could enable the same for encrypted folders. Encrypted folders also have the disadvantage that it’s not otherwise possible to just open a file and see if it seems healthy.

2 Likes

:see_no_evil:

SHA it should have been. All I was thinking about was that the hash cannot be used to guess at the data. That this doesn’t matter you kindly explained (and should have been obvious). I guess asking a trivial question (“trivial” to prevent any “there are no stupid questions” remarks) about a topic I just read up on is a signature move for me :smiley:

Nice of you to offer a way out of my blunder, but that part was perfectly clear :slight_smile:

For the small chance anyone with a similar level of understanding happens by this: The following answer was quite helpful to me regarding deterministic encryption/SIV: https://crypto.stackexchange.com/a/37097/75466

[quote=“imsodin, in github”]
…Then we can disable scanning (and FS watching)… [/quote]

Disable Scan (and FS watching) ??? How shall we (untrusted device) tell the trusted one(s) we need a file that is changed/damaged/corrupted here? Sorry if I play the bull in a china shop, I’m not a coder but I’m very interested in this thread.

A damaged file wont be picked up by fs watching anyway and as for “regular” scanning let me quote Jakob in the PR:

On the encrypted side, the folder type should be receive only and don’t do any scans… Changes to the stuff on the encrypted side will predictable break things.

Scanning is about local changes, and by definition an encrypted/untrusted node mustn’t do changes. If we were to implement an option to check data integrity (trusted or not), that would have to be separate from normal scans and thus could in principle also be done by an encrypted devices (as it stands the encrypted device doesn’t know it’s hashes though, see discussion following this earlier comment: Latest encrypted devices proposal, what & how).

1 Like

Hi Simon. Do you mean that except FS watching (which I can easily figure it decides on its own what to report), there is nothing in current ST that triggers a re-hash of yet-there-files and compares to the DB ? Even for rw & sendOnly Folders ? Or is it specific to receiveOnly ?

Okay, let’s wait and see.

Any idea if versioning methods will change in untrusted devices ? I ask because my backup setup (currently a single receiveOnly foldered off-site device that get synch’d from a Duplicati local encrypted backup of the folder) heavily relies on versioning in case of synchronous disaster (massive fire/ransomware) in all “currently trusted” devices, the Duplicati machine being one of them. You guess my interest: I could drop Duplicati and reclaim about half the storage size from the whole plaintext+encrypted.

EDIT : I see Jakob is replying… so I know ST isn’t a backup software :wink:

Changes are picked up by periodic scanning as well. But changes can’t happen on an untrusted device, and we don’t actively look for corruption. (How would we differentiate it from changes?) Encrypted folders are a special case, and corruption could theoretically be detected and handled. But it’s not something we do today in any other context and probably not something we will implement in phase one.

I understand this. So we get OutOfSync message, which is enough to revert the change on receiveOnly device

Pure curiosity @calmh: What was the motivation to switch to chacha instead of aes for the non-deterministic part?