High CPU usage because of .stignore

alberto101 · October 11, 2014, 8:43pm

I added some rules to .stignore today to prevent the issue described in:

and CPU usage went up from about 8% to about 40% as seen on my webui interface for the server.

What seems to be happening is that when I add the following entries, the CPU usage goes through the roof.

.SyncID .SyncID* .SyncIgnore .stversions .stversions/* .SyncArchive .SyncArchive/* ~syncthing~.* .syncthing.* xxx.* xxx *.tmp *.temp

Some entries above are related to BTSync special files as I do not want to sync them.

I am not sure which one is doing it, but I ran some tests and it looks like this is what the problem is.

I wonder why the CPU load would go so high?

Anybody has a clue?

Update 1:

Related question: Can I comment out entries in the .stignore? If so, what is the exact syntax?

jpjp · October 12, 2014, 10:15am

I wonder if the expressions in the .stignore file are compiled just once…

Edit: does go have a Regexp::Assemble equivalent?

alberto101 · October 12, 2014, 12:08pm

I am not following why should they be compiled at all. I thought these are just patterns to see which files/folders to ignore.

I am testing various cases, but it takes a lot of time. Now I wonder if this has something to do with folders that have a few subfolders. For example, it looks like when I add the .SyncArchive rule to .stignore, the CPU usage goes up by the factor or 2 approx. That folder has about 190 files of about 170 megs in total size.

So, the question would be: how exactly .stignore processed? What is involved in ignoring the file? It seems that for CPU to go up so significantly, you must be going through the whole directory tree and actually doing the entire file reads or something of that sort.

What I see when I have all the patterns above is that the cycle goes as follows:

As soon as scanning starts (and what means scanning for ignored files?), the CPU goes up to 70-90%, and it stays at that rate for one sampling interval as shown in GUI.
Then it drops to about 20% during the next sampling tick, and then stays less than 1% for the rest of the cycle, which looks like 1 minute.

My server’s webui shows the CPU usage averaged, and so it translates to around 40% CPU load overall when these patterns are present. The CPU load if I stop syncthing drops to about 1%, and that is for all the stuff that is running on the sever, and that includes heavies like Apache web server and a search engine, not counting all sorts of other things.

I wonder if there is a way to optimize the process. Such dramatic CPU load increase does not seem to be necessary for such insignificant operations as ignoring files.

Am I missing something here?

calmh · October 12, 2014, 12:44pm

Most likely the matcher just isn’t very efficient. Each ignore pattern is compiled into two regexps, and the full list of ignores needs to be checked against every filename we see when doing a rescan.

Yep. // at the start of a line indicates a comment. I’ve added this to https://forum.syncthing.net/t/excluding-files-from-synchronization-ignoring/80 now.

calmh · October 12, 2014, 12:59pm

I added a benchmark for the matching; on my computer, matching a filename against (the equivalent of) the list above takes 39 microseconds. So doing that check for 50000 files is about two seconds of CPU time, which would be 40% over a five second window. So yeah, it’s possible that that’s the cause.

alberto101 · October 12, 2014, 2:04pm

I wonder if there is a way to increase the performance. Seeing 40% load (averaged) is not the greatest feeling.

Basically, regexp is orders of magnitude slower than by doing it in some efficient way. May be there is a way to avoid using the regexp since you have a pretty limited set of special characters requiring special processing. But I am not quite sure about this.

So, this might mean that if you have a large number of ignore patterns, just like I did as shown in the initial post, might really eat your CPU time REAL good, and that is probably why I am seeing the 40% average CPU load on the server webui. Somehow, it does not seem right in the scheme of things for such an insignificant operation as ignoring the files to load the processor so much. I wonder if there is a better way of doing this.

Great. I tried # at the beginning of the line, but it looks like not only the comment did not work, but the patterns seemed to have been processed as thought there was not comments.

I usually use C++ or line starting with semicolon comments. Pound sign # Is a shell special character, so, depending on what config or script doing, there might be some funk.

alberto101 · October 12, 2014, 2:10pm

Well, in my case, I see less than 200 files total in .SyncArchive directory. I just removed those files to see what is going to happen with CPU load.

Btw, if I ignore just folder, does it mean that ALL the files and subfolders in it will be ignored? Because I also added the filder_name/*, just to make sure everything in that folder will be ignored. Is it really necessary? And is it the correct assumption that this pattern will ignore everything in this folder including subfolders?

What is the best or correct and sufficient pattern to ignore folder and all subfolders and files in it?

alberto101 · October 12, 2014, 2:57pm

Even if files/folders to be ignored are not present in file system, CPU load does not decrease

I just have one “heavy” pattern left in the .stignore file - .SyncArchive. Then I tried to remove all the files and subdirs for that folder and was surprised to see that it did not decrease the CPU load. So, it seems that CPU time is munched by merely crunching the patterns without doing the actual work as far as matching the physical files against the pattern.

Is this correct of an assumption?

AudriusButkevicius · October 12, 2014, 3:34pm

It’s compiled once, and reused, so I’d be surprised if it is “crunching the patterns”.

alberto101 · October 12, 2014, 4:02pm

Well, I have no idea what is it doing, but here’s the patterns I have right now:

Navigators_Latest.zip // Just a file that does not need to be synced .SyncID // BTSync share ID .SyncID* .SyncIgnore // BTSync ignore file .SyncArchive ~syncthing~.* // I do not want syncthing temp files to be synced by BTSync

// .SyncArchive/* //.stversions //.stversions/* //.SyncArchive/*

And now the average CPU usage is about 50% higher than without those patterns

So, if patterns are compiled just once, and not every 60 seconds, then who is muching on the CPU time? Doing what kind of thing?

The files in the .SyncArchive have been removed from the file system. Not sure if it means they have been removed from the syncthing database (index?).

Update 1:

Well, actually, the CPU 5 min. average is 2 times higher than without these patterns. And this is not such an extensive ignore list, I’d say.

AudriusButkevicius · October 12, 2014, 4:33pm

Read the help options, which explains how to run profiling.

alberto101 · October 12, 2014, 8:49pm

Well, I just got up after a nap and it “clicked”. I am not sure there is a need to run the profiler. It is probably spending the CPU clocks in the regexp matcher, not the compiler. And, since the cycle runs every one minute, you need to go through every file in a share/collection and mach it against the list of regexps. So, even if you physically remove those files that would match, it won’t matter much on the CPU load, unless nearly all your files match or something crazy like that.

That means that the very matching process is a bottleneck. Now, combined with the fact that scanning is done every minute, which merely intuitively seems too short of a time interval, then, in terms of 5 min average load, you get 30 to 40% load in case of that long list.

And those 40% break down to about 15% of all time having a nearly full load on the CPU, and that is why syncthing shows around 70-80% CPU load since it samples the instant values of the CPU load without averaging, the the rest of the time in the scan cycle you run with the CPU load of about 1%. Which means syncthing heavily loads the resources about 15% of the time in my case, and that will depend on the complexity of the ignore list. So, in large shares with lots of files and lots of ignore patterns the results could be significantly worse than that.

It looks to me as an issue…

So, two things I see here:

Not running the scans so often, but running them say every 10 minutes (configurable parameter) and to avoid long delays in updates to utilize the file system events, such as file update, delete, etc. Scanning still has to run no matter what because you might miss the file system events when some nodes are off line. Furthermore, once the event is detected and a match succeeds, you also send the information events to the other nodes. This way, your update latency is reduced to a minimum.

There is one thing I haven’t mentioned before is that I do in fact see sometimes nodes are not updated, even after quite a few minutes. And I have even seen the nodes go into a disconnect state permanently. But these are totally different issues.

Do the matching via high performance mechanism instead of regex matcher. You can increase the performance by the factor of 10, I suspect. Because regexp matcher is character based matcher. For every character it sees in the string, it has to go through all sorts of hoops and perform various jumps around. The process is inherently slow and is not suited for real time processing from what I see.

Actually, in the project I mentioned before, I have developed the high performance filtering system I call the Evaluator. The evaluator is a multi stage filter, where each stage behaves as an AND operator/condition as far as end results of filtering goes. Each stage may have multiple expressions that behave like an OR conditions for that stage of filtering.

The end result, I can process about 1000 articles/sec. and that includes not only filtering itself, but loading the archive, parsing the records, creating several indicies on various keys and then, finally, selecting the articles by calling the Evaluator.

Actually, what could be done in syncthing, it looks likem is to simply create an index of file names instead of going sequentially through every file. That would probably increase the performance orders of magnitude.

Well, what can I say, at least I seem “to get” the feeling of what is going on here and that feels a little better than utter darkness, the sense I don’t like that much. I like clarity.

AudriusButkevicius · October 12, 2014, 9:18pm

Where are you getting this information from?

I don’t understand. Index of what? You don’t know what’s on the disk, how can you know what the index consists of?

Though this does give me some ideas of how to make it faster.

alberto101 · October 12, 2014, 9:33pm

I am getting it from the webui of syncthing, the “CPU Utilization” field.

Index of file paths in a share/collection.

Well, looks like we are on two different wavelengths here. You know which files physically exist in the share by traversing the top folder of the share, no matter how it is done. You also know which files are in this share by the fact that you have created the index for this share, except I do not quite know what exactly the index consists of.

What can I say, I do not know the syncthing internals that much, if at all, but on some rough grade level, just because I have spent several months looking at different operating principles and mechanisms of BTSync. Plus general programming experience and intuitive level…

Hey, that’s the kind of thing I LOVE to hear!!!

I bet this process can be significantly optimized.

AudriusButkevicius · October 12, 2014, 9:55pm

It uses a 10 second moving average, so your assumptions are wrong. It doesn’t get updated any more often than once every 10 seconds I think, so its the snapshot of the last 10 second average.

alberto101 · October 12, 2014, 10:41pm

I am not sure which assumptions are wrong, but I see it updated on the screen every 10 seconds. Whether it samples things more frequently does not change the load factor from what I see. Yes, averages do not represent the instanteneous values. So, since what I see is already averaged, but at a finer interval, that means that the actual CPU load goes up 100% in a tight loop for as long as that loop is run.

I am not sure what does it change. Since I see 30 to 40% load on 5 minute average (as shown in webui for the sever), what does it tell us? Would you consider that value as excessively high and abnormal, or would you conclude that that value is meaningless?

Do we have an unusually high CPU consumption here? Or am I just imagining something here that does not exist in reality?

Would you tell me YOUR interpretation of what I described with statistics and so on?

alberto101 · October 12, 2014, 11:23pm

Well, it seems that index won’t help here. Because you do not match a literal and complete file name against the index, but patterns, and, since those patterns include special match characters, and, especially at the beginning of a pattern, then index lookup isn’t going to do the trick. What can I say but “too bad”. Hey, but at least it was a try!!!

Well, actually, may be there still some way of using the efficient search instead of a dumb match. Except I do not see how at the moment.

AudriusButkevicius · October 13, 2014, 9:30am

I’ve actually worked out how to make this efficient up to some extent. There will be a patch soon.

alberto101 · October 13, 2014, 11:34am

That is good to hear. Do you happen to have an estimate on the performance increase? As I said, right now with only a couple of “heavy” patterns the 5 min average CPU load is DOUBLE of what it would be without that pattern.
Can you tell me what is the most “correct” or efficient way to ignore the folder with all its files and subfolders in it?
What is the reason of special handling of a path separator in ? or * containing pattern?

AudriusButkevicius · October 13, 2014, 11:50am

From the benchmarks I’ve done yesterday it was down from 36k ns/op to 100 ns/op, but it’s a poor benchmark in a way, as it matches a constant path…
I think just specify Folder/, that will in turn become Folder/** which becomes something like ^Folder/.+
I am not sure I understand the question, but even then if I did probably couldn’t answer since I haven’t written that part.