@imsodin
(as a new user I reached daily post limit in this forum and had to wait 2 hours, sorry)
In most use-cases it works as is nd otherwise you can remove the scan progress updates (as mentioned a few times), and the problem disappears.
sure, I’ll do that later, check memory consumption and give feedback.
That looks like less abstraction not more. If you implement a nice and stable Go library doing something of the sort, we can create a shim for our filesystem interface and use it for change monitoring
- in Go rather not - it’s not my area. Well, using osquery is viable rather in server environments where there are people capable of installing this beast. IF Syncthing evolve into a layered solution with API in each layer, then it would be possible to write integration code in any technology. What I mean? If you would use database for example SQlite or Postgres as a repository for input files, then any process could populate those with data (for example based on OSQuery or their own files). Imagine, that I have a server with 2M files created daily and you do not need to scan them and hash them I just need for you to distribute them to other 12 servers in 6 places in the world. So, it would by a responsibility of my process to update records in DB and Syncthing would be reponsible for efficient distribution of those files. As for now it is hard, because scanning and hashing and distributiona in Syncthing are tightly coupled.
I don’t see how a history makes scans unnecessary - you still need to discover the present to add it to your history.
I guess what you really mean here is 1. again: Use fancy, efficient methods to detect changes.
Yes, basically it is, because when I have an event stream of changes from modern file system and have a history of past state, then I do not need to repeatedly compute actual state of file at every point of time. Could elaborate, but I think you know what I mean and you know what is possible with journaling technology. The OSQuery is nice, because it wraps this event stream on linux, mac and windows. Persistent state should be stored in standard DB (SQLite, Postgres, etc.).
3./4./5. You are aware of the B in BEP - Block Exchange Protocol?
Syncthing doesn’t just handle entire files, but data in the form of blocks. You address such a block by iths SHA256.
As I mentioned, I didn’t analyzed the distribution layer, because as for now I’m more interested in scanning/hashing layer. Anyway, distribution should also be a layer, because fixed block hashing would not always be optimal. Synchronization could be file format specific, for example TIFF files has sections, VM images has own blocks, some file formats have their own embedded hashes, checksums etc. If this part of syncthing (block detection by file type) could be its own layer, maybe somebody would write a synchronization plugin for tiff files, or vmware images (this would be a HUGE deal in backup industry) or else? I assume, that such block synchronization should work in some kind of transactional mode, for obvious reasons (this is non trivial - hard even).
(AudriusButkevicius)
What I am saying is that it’s easy to be a critic, (…) Most commonly they lack context as to why things are done the way they are, hence it becomes easy to criticize.
Wow, what’s wrong with you? Nobody is criticizing anything and please do not bully me by calling me names (“critic”). Why are you everything taking personally and continue to attack? Listen, you got TECHNICAL text from enthusiast, based on technical problem - are you capable of staying in technical borders of discussion about technical project and not talk about yourself or explaining how awful people are (which is not interesting to me at all)? I want to help, but your form of communication fills me with disgust. Maybe it is my/yours cultural thing, maybe not, but it is as it is, and it comes from you specifically.
PRs are nice, yet I do not write in Go, so I do not plan to change YOUR codebase (honestly: and deal with you personally based on your current attitude lol).
(calmh)
like “I think Syncthing shouldn’t use block hashes by default”) are ignored by necessity
I didn’t say that you should not use block hashes for network transmission protocol - if anything that you shouldn’t block synchronization while hashing
I want to NOT generate hashes during SCANNING, where said hashes are used as a means to determine if the file has CHANGED (remember 50GB file). What I would like, is to use - in my situation - only timestamps, filesizes etc. because IN MY situation it is sufficient. I didn’t found an command line option for that, so my proposal is to only add such an option to create simple and fast hashes/id’s based on natural attributes of files (timestamp, size, etc) which will drastically improve scanning in my situation and properly determine changes of files (because MY filesystem will take care of that).
(devs)
What I can do - please consider this:
- if you open the scanning/hashing layer and persist it in the PostgreSQL/SQLite. I could help you with data structures, triggers, SQL, queries, indexes, analytics, optimization etc.
- I could write an external wrapper dealing with OSQuery in other language as an example of integration, and populate your persisted state in SQL table with lots of file records to sync. It should allow almost INSTANT synchronization between for example web servers in distant locations.
- I could write an example of how to integrate another apps (distributed) where they need to exchange input files and distribute output files in specific moments of their life cycle, write some analytics based on status of data synchronization, trigger some events when something happens ie. all output files are transferred.
- I would experiment with federated sharding of Postgresql database triggered by your synchronization layer - however this would need strict transactional support in your network layer, because it is essentially a holy grail of master-master synchronization.
This would open a way to write plugins in MANY tools for example continuous integration, docker, monitoring, backups, vscode, ssh, admin panels - every single one of them at some point needs to send/receive data, or logs, or backups or installations or updates or environment or else.