OK to collect anonymous usage data?

calmh · May 13, 2014, 11:27pm

I’m considering something that would, optionally, collect anonymous usage data. The point of the exercise would be to know what kind of workload syncthing gets hit by in the wild, to know if we should optimize for small or large repositories, millions of files or hundreds, which platforms are most common, etc.

The scheme I’m imagining is to put up a dialog on some startup quite soon after installation (not the first, to overwhelm a new user, but perhaps within a week of first installation). This would be something along the lines of:

Allow anonymous usage reports? [short description of purpose as above] If you say yes, the following data will be collected and sent once a day:

{ “version”: “0.9.3”, “platform”: “linux-amd64”, “numRepos”: 3, “totalFiles”: 12345, “totalBytes”: 1234567890, “maxFilesPerRepo”: 1234, “maxBytesPerRepo”: 123456, “maxFileSize”: 1234567, “numPeers”: 3, “memoryUsage”: 1234567, }

[Yes]  -  [Maybe later]  -  [No, and never ask me again]

The actual data that would be sent would be shown in the dialog in question, although this is of course something the user would have to accept on faith (or read the source).

Opinions on this?

Would you say yes/no?
Is this reasonable data to ask for?
is there something you think should be added/removed?
Should this be done differently, or not done at all?

When enabled, it could be disabled again by unchecking the relevant checkbox in settings, which would then have the same effect as the “no, and don’t ask again” option for the future.

When enabled, we’d send the data with an HTTPS POST to some suitable place, and I would make the aggregated data available in some useful manner.

thoj · May 14, 2014, 5:34am

This is fine by me personally. But i suggest to anonymize the data more in the client.

Round the values to a significant number. (e.g. 12345 => >10000, 3 => <5)

jpjp · May 14, 2014, 7:18am

You’re asking, so it’s okay by me. But if you ever change the data you collect, I’d like to be asked again!

sil · May 14, 2014, 9:10am

I’d be fine with collecting this data, and I’d say “yes”.

An extremely controversial suggestion: add a unique ID for the computer (node ID, or just invent a GUID on first run). If you don’t have that, then your stats will say that 1000 people were running syncthing 0.9.8 yesterday and 1000 people are running 0.9.9 today, but you don’t know whether they’re the same 1000 people. Being able to tie together the reports from a given machine will make the data much more usable, at the expense of annoying people who don’t want their reports to be tied together. It’s still anonymous (you don’t know who the people are) but it’s identified (so you know that two anonymous reports are from the same device).

menelic · May 14, 2014, 4:00pm

That would be fine with me as well, since it is opt-in and I can opt-out at any time. I also agree with @sil , I think you should add an ID scheme to make the stats you get more valuable for time-based analysis.

jedie · May 15, 2014, 6:52am

It’s ok to collect some stats. Good idea to round the data.

It would be great if the stats are a public webpage and if the link is beside the “submit your statistics” button… Similar to CyanogenMod: http://stats.cyanogenmod.org/

calmh · May 15, 2014, 7:17am

That’s a good suggestion.

Nutomic · May 16, 2014, 9:17am

I could add something in to read the android version. Then either send that, or just use it to differentiate between Linux/android.

I also have access to some stats through the play store (app version, installs over time, android version etc) which I could share.

Edit: Here’s a file with all the stats I get through Google Play. It should be possible to add a read only account tpo grab that, so we could have a bot download, parse and display it.

jasonwryan · May 17, 2014, 12:58am

+1 Collecting stats is good, making it transparent is even better.

ctismer · May 17, 2014, 2:17am

I’d say “go for it”, iff you

tell exactly what data you are going to (allow to) collect,
where this data is possibly going to and who is going to monitor that,
iff there always is an option to opt-out of it, and
iff the agreement is automatically revoked if any of the constituting rules are changed.

calmh · June 7, 2014, 5:57pm

How does this look to you guys? Does it quell all fears, answer most questions? It feels a bit like a wall of text, but I’m not sure what could be removed to make it more concise.

(data.syncthing.net isn’t up yet, but will be a view of the statistics in question)

Nutomic · June 7, 2014, 8:42pm

Maybe the example data could be collapsed by default? Also, I feel like most of the text could be said shorter by being a little less detailled.

The Android version will the host name as an ip again (the issue isn’t resolved yet).

And as I said before, there’s statistics through the Play Store for everyone who installs that way (data is things like country, language, devices, installs over time) but nothing syncthing specific obviously.

bsidhom · June 7, 2014, 11:28pm

I think it’s important that a detailed description of the data being collected is shown so that users understand exactly what’s happening. The presentation looks great. Have you considered thoj’s suggestion to round data values? It might be irrelevant when we’re including a unique ID parameter, but having a detailed fingerprint of these values could be used to identify a machine (both by “anonymized” data and externally). Additionally, having such fine-grained (vs. rounded) information would probably not provide much insight.

calmh · June 8, 2014, 5:44am

Could you suggest something? Like I said, I’m not sure what to remove…

calmh · June 8, 2014, 5:48am

I did, but thought it more work than it was worth. As you say, there is anyway an identification in there and if we ever want to calculate statistical metrics like standard deviations we’ll be sorry.

Nutomic · June 8, 2014, 11:03am

[quote]Opt in to Anonymous Usage Reporting

The encrypted usage data is sent daily. It will be used to track common platforms, repo sizes and app versions.

If the reported data set is changed, you will be prompted with this dialog again.

The aggregated stats are available at data.syncthing.net

[Preview](expands to show example data) [/quote]

Any additional info might be better placed in the documentation section than the dialog.

Btw, how about checking for device performance somehow? There’s quite a difference between a Raspberry Pi and a Desktop PC (and that’s probably not so clear from the architecture). Though I don’t have a good idea how to do that myself. Maybe RAM size and a benchmark result?

calmh · June 8, 2014, 2:03pm

Thanks, I’ll integrate something like that. I was actually thinking the same but ram size is one of those things that’s annoyingly platform specific to figure out. But it would be nice to have for sure… A CPU benchmark is easier though.

calmh · June 12, 2014, 12:16am

This has been committed to master, and https://data.syncthing.net/ is up with a basic aggregate report. https://github.com/calmh/st-usage-reporting is the repo for the report if someone want’s to draw fancy graphs or something.

Nutomic · June 12, 2014, 12:35am

Maybe versions should be truncated, otherwise dev versions might pollute this a lot in the future (maybe like “v0.8.14-11-master”).

calmh · June 12, 2014, 1:12am

Hopefully there won’t be very many of those.