Out of sync : read-only filesystem - but it's not?!

M0les · November 12, 2024, 3:41pm

I’m getting a weird error syncing between a Windows 11 machine and a Fedora 39 one. Basically it’s a ~60G (currently) photo spool archive: I dump a bunch of photos on the Windows machine (mostly) and they sync-over to the Linux one. After a certain number or volume of files are synchronised (I’m not certain which), the Linux machine reports “Out of sync” with the logs indicating it’s trying to create some “.tmp” files and getting a “read-only file system” error. But the filesystem is still mounted RW and I have tested file create and write capabilities in the directory in question as the daemon’s user. I’ve tried a bunch of “cargo cult fixes” (I.e. make a change based on a theory, rather than evidence), but none have panned out in the end.

Here’s a list of things that I thought might be the issue (or may have been for a while, but I’ve fixed):

Spooky filesystem weirdness (FS was originally BTRFS, but I’ve switched to EXT4 for simplicity)
Target filesystem full (Never: 100G Linux FS has 37G free)
/home partition full (It was about 80% full, but I’ve expanded it to have about 44G/90% free now)
Weird permissions mismatch between Windows and Linux (There was one warning about this in the past, but I’ve selected the “Ignore permissions” options in the GUIs at both ends and all the other “Sync/Send” “Ownership/Attributes” options are unset.
SELinux gotcha - Tried both restorecon -r and a setenforce 0 followed by a rescan on the Linux side. No effect seen.
File ownership problems - as root, I’ve chown-ed the synced tree to the user running the daemon
“Turn it off and on” - I’ve tried restarting the Linux daemon and the VM that runs it. Hasn’t made a difference.
Somehow the iNotify max watchers limit reached (but that should be its own, descriptive error anyway). There are other synced folders here, but the net total is less than 20k and the limit is currently 242166.
Maximum size of synced data? (I’m reaching here) - Currently the system manages about 60G of data across 3 shared folders.
Synced file is too large? No file is anywhere near the free capacity of the target volume.
Try removing and re-scanning each end. No change in the end.

Here is the relevant sequence of log lines:

2024-11-13 01:59:08 Ready to synchronize "Photos" (ytp2g-gpsu7) (sendreceive)
2024-11-13 01:59:08 Unpaused folder "Photos" (ytp2g-gpsu7) (sendreceive)
2024-11-13 01:59:12 Completed initial scan of sendreceive folder "Photos" (ytp2g-gpsu7)
2024-11-13 01:59:12 Puller (folder "Photos" (ytp2g-gpsu7), item "2021/PXL_20210924_094434208.jpg"): syncing: finishing: opening temp file: open /usr/local/share/photos/photos/2021/.syncthing.PXL_20210924_094434208.jpg.tmp: read-only file system
2024-11-13 01:59:12 "Photos" (ytp2g-gpsu7): Failed to sync 1 items
2024-11-13 01:59:12 Folder "Photos" (ytp2g-gpsu7) isn't making sync progress - retrying in 1m0s.

So the 100G Ext4 photos filesystem is mounted at /usr/local/share/photos and then there’s another “photos” sub directory there that is where Syncthing does its work. As I stated before, this filesystem has 37G free. There are no files anywhere near that size (the one in the logs above is 2.5M)

I’m not sure what the size/number limit is, but I’ve just noticed the sync works for a while and then eventually this error happens and won’t go away. So far it’s only present in this one shared folder.

Any clues or hints as to what the problem might be and how I might diagnose further or fix it?

gadget · November 12, 2024, 5:26pm

All the details about what’s been tried/checked are fine, but what’s missing is anything about the underlying host running the Fedora Linux VM, the VM’s storage format (e.g. raw, qcow, VHD), the storage medium, etc.

Based on the info so far, filesystem corruption is the most likely culprit. So immediately after Syncthing stops syncing due to reporting a read-only error, check the Linux kernel log inside the VM and also on the host. Use dmesg -T (the -T displays a timestamp with each entry to make it easier to trace).

calmh · November 12, 2024, 8:55pm

My guess is you’re running into the systemd ProtectSystem=full protection because you’re writing to a “system” location. Remove that from the unit file if it’s there, I’m not sure what the packaging situation on Fedora is.

ProtectSystem=

Takes a boolean argument or the special values “full” or “strict”. If true, mounts the /usr/ and the boot loader directories (/boot and /efi) read-only for processes invoked by this unit.

gadget · November 12, 2024, 10:35pm

In Fedora, ProtectSystem is set to full if Syncthing is started as a system service, but disabled for a user service.

The OP said “sync works for a while and then eventually this error happens”, so I’m assuming it’s not a sync failure from the start.

gadget · November 12, 2024, 11:01pm

On a somewhat related note…

Note that /usr/local/share is really for sharing local system-wide files used by various programs and not intended for end-user data. A better location would be /srv/photos, /srv/Syncthing/photos or some subdirectory of /srv.

(See the Filesystem Hierarchy Standard for details.)

Although sysadmins are ultimately free to put things wherever they feel like, various parts of a Linux system are protected by multi-layer security. It’s especially true of Fedora and its siblings which are typically out-of-the-box more locked down compared to other distros.

M0les · November 13, 2024, 12:14am

Thanks for looking at this, Gadget!

All the details about what’s been tried/checked are fine, but what’s missing is anything about the underlying host running the Fedora Linux VM, the VM’s storage format (e.g. raw, qcow, VHD), the storage medium, etc

The VM host is a Fedora38 install running on an i5-8500T CPU w/ 32G of ram. The spinning disks behind all the bulk data are a pair of Seagate Backup+ disks connected via USB3. These disks have a single GPT partition each (code 8E00: Linux LVM). Both partitions are LVM PVs and members of a single VG over them (labelled “mirror”). The “photos” LV was supposed to be a mirrored-pair, but I managed to set it up as a linear volume on a single drive only (still: data should be stored OK, just redundancy is non-existent). The EXT4 filesystem on the mirror/photos volume also has the label photos. This volume is attached to the KVM vm using driver=qemu, subdriver=raw, source dev=/dev/mirror/photos (attached as vdc). Inside the VM, the fstab has LABEL=photos /usr/local/share/photos auto defaults,nofail 0 0 and mounts at boot without any issues.

Based on the info so far, filesystem corruption is the most likely culprit. So immediately after Syncthing stops syncing due to reporting a read-only error, check the Linux kernel log inside the VM and also on the host. Use dmesg -T (the -T displays a timestamp with each entry to make it easier to trace).

I’ve got the system to the state where with the 2021, 2022 and 2023 folders “moved aside” (i.e. moved below the Synthing folder, so they’ve “disappeared” as far as Syncthing’s concerned), both sides sync-up fine. At this stage, both sides report 16416 files and 102 directories synced. If I create a single 2021 folder on the Windows side and force a rescan, the Linux side reports out-of-sync again with the following log lines generated:

2024-11-13 10:43:00 Puller (folder "Photos" (ytp2g-gpsu7), item "2021"): syncing: creating directory: mkdir /usr/local/share/photos/photos/2021: read-only file system
2024-11-13 10:43:00 "Photos" (ytp2g-gpsu7): Failed to sync 1 items
2024-11-13 10:43:00 Folder "Photos" (ytp2g-gpsu7) isn't making sync progress - retrying in 1m0s.

If I look at the dmesg -T output, there are no new error log lines generated (Last messages were from VM boot last night and apparently unrelated).

Interestingly: If I delete the 2021 folder and re-sync everything back to be “OK” on both sides, then I create the 2021 folder on the Linux side and sync this over to the Windows side - Everything is OK. Both sides are in sync and the 2021 folder now exists on both sides. Again, if I try to put anything into the folder from the Windows side, it eventually goes back to “out of sync” on the Linux side.

Thanks again for your time.

M0les · November 13, 2024, 12:46am

Thanks for looking at this @calmh and @gadget.

@calmh said:

My guess is you’re running into the systemd ProtectSystem=full protection because you’re writing to a “system” location. Remove that from the unit file if it’s there, I’m not sure what the packaging situation on Fedora is.

@gadget said:

In Fedora, ProtectSystem is set to full if Syncthing is started as a system service, but disabled for a user service.

The OP said “sync works for a while and then eventually this error happens”, so I’m assuming it’s not a sync failure from the start.

…

Note that /usr/local/share is really for sharing local system-wide files used by various programs and not intended for end-user data. A better location would be /srv/photos, /srv/Syncthing/photos or some subdirectory of /srv…

So the Syncthing daemon is enabled as syncthing@syncthing.service - that is “Run the Syncthing daemon as the syncthing user”. I have a syncthing user that owns 3 shared folders. Two of those are just “document syncing stuff” with a small amount of data I keep up to date between various PCs, phones and laptops. These don’t exhibit any problems at the moment. The photos folder is the big one that just syncs between my desktop and the home lab server detailed earlier.

On the possibility of filesystem corruption: Note that I was using a BTRFS volume originally when I first found this issue. Then I made a new EXT4 volume and ported everything into it to see if it improved anything (I thought it did at first). There are no OS or FS errors on the host or the VM and manual filesystem r/w actions when sudo-ed to the syncthing user work fine - so it looks like the error is up at the application’s end or some permissions problem. That’s why I was hunting around for quota or SELinux stuff. I tried stopping the Syncthing service, unmounting the volume and running an fsck on it, just in case:

# fsck.ext4 /dev/vdc
e2fsck 1.47.0 (5-Feb-2023)
photos: clean, 18310/6553600 files, 15349588/26214400 blocks

I’ll try moving the folder under /srv as @gadget suggests. That necessitates a full re-sync of the folder, so it will take some time to find out if it’ll have any effect. I doubt it’ll improve anything, but it’s good to be FHS compliant at least (and maybe this will be the magic bullet?).

One thing I forgot about the system info is the host CPU has 6 threads and I’ve given the VM 6 vCPUs and 4G of ram. The hosts memory allocation looks like this now:

# free -h
               total        used        free      shared  buff/cache   available
Mem:            31Gi       8.8Gi       245Mi       2.0Mi        22Gi        21Gi
Swap:          8.0Gi       1.0Mi       8.0Gi

And the VM guest has:

$ free -h
               total        used        free      shared  buff/cache   available
Mem:           3.4Gi       1.1Gi       295Mi       1.5Mi       2.4Gi       2.2Gi
Swap:          8.0Gi       9.0Mi       8.0Gi

(swap is just zram0 in both cases)

Thanks again for all your help - I’ll post an update when I know what this did.

M0les · November 13, 2024, 1:55am

Update: I removed the ytp2g-gpsu7 (“Photos”) folder from the Linux Syncthing shared folders and removed the Linux target from the Windows side’s Syncthing for this folder (now “Unshared” on the Windows side). Then I moved the Linux photos mount to /srv/Syncthing, re-added the share to the Linux Syncthing daemon and waited for it to sync-up. Finally I “re-shared” the folder from the Windows side and both sides eventually got back to “Up to date”.

At this stage I have two subfolders “moved aside” on both sides (i.e. moved out of the file heirarchy that the Syncthing daemons try to synchronise): 2021 (~1.5G) and 2023 (~5G). I tried moving 2021 into the photos folder on the Windows side, re-scanned the folder and let it transfer all those files over to the Linux side. This all worked fine and we got back to “up to date” again.

I then tried pausing the folder on both sides, then moving the 2023 folder into the photos folder on the Windows side, un-pausing the Windows side only and waiting for it to re-scan. I then paused the Windows side, moved the 2023 folder into the photos folder on the Linux side, un-paused the Linux side and let it re-scan also. Finally I un-paused the Windows side again and let it sync-up. This also worked fine and I’m back to having everything “Up to date” again.

So I’m now left with 2 questions:

Is it “fixed” now?
Was it the mounting of a Systemd daemon-controlled data directory under /usr/share that was the problem’s cause in the first place?

I think the answer in both cases is: “Maybe? - but I doubt it”. In all my previous fumbling to try and fix this issue, I’ve experienced similar behaviour: I do some stuff and the error goes away, but I was never very sure of the cause in the first place.

I expect I’ll see this error again some time in the future and I think I’ll just have to try and tace-down what’s happening in the source code to see what the actual problem is.

Anyway, this will probably just sit idle for now (until the error returns). Thanks again to @gadget and @calmh for letting me distract them, your input gave me a few more things to try (and maybe even fixed it?).

gadget · November 13, 2024, 2:17am

Almost there…

For the following command syntax:

systemctl start syncthing@user.service

In your particular setup the user is syncthing, so:

systemctl start syncthing@syncthing.service

It’s telling systemd to “Start the Syncthing daemon as a system service in the security context of the user syncthing”.

In contrast, starting a service as a user:

systemctl --user start syncthing.service

Is telling systemd to “Start the Syncthing daemon in the security context of the user who issued the command”.

As @calmh had pointed out, and in my followup post mentioning that Fedora bundles two systemd unit files for Syncthing, comparing the differences:

$ diff /usr/lib/systemd/system/syncthing@.service /usr/lib/systemd/user/syncthing.service
2c2
< Description=Syncthing - Open Source Continuous File Synchronization for %I
---
> Description=Syncthing - Open Source Continuous File Synchronization
4d3
< After=network.target
9d7
< User=%i
17,18d14
< ProtectSystem=full
< PrivateTmp=true
28c24
< WantedBy=multi-user.target
---
> WantedBy=default.target

Note how the system-level service unit includes the ProtectSystem and PrivateTmp directives.

With ProtectSystem=full, /usr is one of three filesystem paths that are bind-mounted read-only inside the systemd sandbox that Syncthing runs in, so /usr/local/share/photos/photos would be included in the restricted path.

There’s no way the “sync works for a while and then eventually this error happens” – Syncthing would have immediately hit the read-only filesystem error starting with the first file it tried to sync.

M0les · November 13, 2024, 2:39am

@gadget said:

As @calmh had pointed out, and in my followup post mentioning that Fedora bundles two systemd unit files for Syncthing, comparing the differences…

Aha. I might look into whether I can enable the service as the “user” version of the daemon.

There’s no way the “sync works for a while and then eventually this error happens” – Syncthing would have immediately hit the read-only filesystem error starting with the first file it tried to sync.

Yes, I concur. AFAICT I’ve only ever run the daemon as syncthing@syncthing, but way back when I first started testing out, I might have run the server directly from the command line. I think at some time in the past, I did have the photos folder mounted inside the home directory for the syncthing user, then moved it under /usr (to make it more “publicly accessible” to other user accounts) - which might have coincided with the error - but I still don’t think they line up in time properly. I would have moved the mount some weeks ago (well before switching to EXT4) and this error only occurred in past couple of days. I still wouldn’t exclude that I’m deluding myself and the error’s been there all along and I’ve just masked it by manually copying files into the Linux server and syncing back the other way (i.e. no writes into /usr from the Linux daemon). It’s also annoying there’s no immediate error in the system logs I can see that this might be the problem (But I think in the app logs, ro filesystem is probably reasonable).

Anyway, the error’s not here now. I’ll be sure to pester you again if it shows up in the future.

Thanks again for all your help.

gadget · November 13, 2024, 2:47am

Yes.

Partially. It was technically the combination of two things:

Placing the mount point for your USB drive under /usr at /usr/local/share/photos.
Launching Syncthing as a system-level service via systemctl start syncthing@syncthing.service.

Had you done only one or the other, it would’ve been fine.

Or in other words, if you had logged on as user syncthing, then started Syncthing via systemctl --user syncthing, Syncthing wouldn’t have had any issues writing to /usr/local/share/photos because the user-level Syncthing service unit file doesn’t sandbox /usr.

/srv is intended for serving up user data, so it’s not sandboxed like /boot, /etc, /usr and other system paths are. When combined with systemctl start syncthing@syncthing, Syncthing is able to read/write to /srv/photos (or whatever subdirectory you use).

(Although I can touch type, if there’s a shortcut, I’ll use it. … If a unit type isn’t specified – e.g. system.service – systemd defaults to service, so systemctl start syncthing is equivalent to systemctl start syncthing.service.)

gadget · November 13, 2024, 3:04am

This is off-topic, but since we’ve already been discussing Linux instead of Syncthing…

Although journaling in ext4 takes care of incomplete transactions, fsck is still required to fix filesystem issues, so instead of defaults,nofail 0 0, a safer choice is something like defaults,nofail 0 2.

(If the root filesystem needs it, it’s most often 0 1, meaning it’s fsck’ed first at boot time.)

M0les · November 13, 2024, 3:45am

Yeah, I’ve been ignoring the last 2 fstab fields for the last ~25 years because I vaguely recall “They’re something to to do with that ancient ‘dump’ backup system no one’s used since SVR4”. However your point is valid, so I’ll put them in as 0, 2 (“No dumps, FSCK after root FS is”)

system · December 13, 2024, 3:46am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.