Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

severe file corruption - 2 different bugs - not sure if in sshfs or fuse3, reporting in both projects #302

Open
cornerfix opened this issue May 4, 2024 · 5 comments

Comments

@cornerfix
Copy link

cornerfix commented May 4, 2024

Arch Linux
SSHFS version 3.7.3
FUSE library version 3.16.2
using FUSE kernel interface version 7.38
fusermount3 version: 3.16.2

Steps to reproduce:

  1. mount directory on remote server with sshfs
  2. open ssh terminal to remote server
  3. in the remote terminal - go to the mounted directory, create a text file, put some text in it
  4. over sshfs - do an md5sum of the file
  5. in the remote terminal - edit the file, ADD SOME CHARACTERS so the lenght increases
  6. over sshfs - do second md5sum of the file
  7. over sshfs - do a third md5sum of the file

second and third md5sums of the file are different
the file has not been changed between them

there is also another bug - sometimes the second md5sum is THE SAME as first, even though the file is edited between taking md5sums. this happens even though "-o auto_cache,ac_attr_timeout=0" is given, and happens even with 0 timeout for detecting changes. happens less frequently, but if you do ~10 consecutive test (possibly with re-mounting between them) - this bug will also happen

I suspect this can be due to ac_attr_timeout=0 not properly invalidaing cache in libfuse3, so i will file the same report in libfuse project

@cornerfix
Copy link
Author

cornerfix commented May 4, 2024

I did further investigation and was able to triage the bug better.

Here is more easily reproducible and more informative test.

Lets increase dcache_stat_timeout, because it masks the bug. Corruption happens when copying changed files before dcache_stat_timeout expiration.

lets mount sshfs with the following command line:
-о auto_cache,ac_attr_timeout=0 -o dcache_stat_timeout=60

Here is the test:

  1. make text file on the server, add some text in it (vim /sshfs-test.txt)
  2. use sshfs to mount "/" on the server to a directory on local computer
  3. open ssh terminal to the server
  4. via sshfs mount: cat sshfs-test.txt (cat1)
  5. in the terminal - vim /sshfs-test.txt, add some text so the file size increases
  6. via sshfs mount: cat sshfs-test.txt (cat2)
  7. via sshfs mount: cat sshfs-test.txt (cat3)

The result is:

  • part of new content of test file appears on cat2, BUT
  • the full size of the new content appears on cat3

This means than first read of the file after the change returns corrupted content - file is cut to the old size, but has a new content. In my use case - this caused a lot of PDF files changed by a script and then copied over sshfs to be severely corrupted - unopenable.

There is also a second bug - related to the first.

If you repeat the test, but mount sshfs with a slightly different command line:
-o kernel_cache,auto_cache,ac_attr_timeout=0 -o dcache_stat_timeout=60

Then the result is:

  • on cat2 - old content of the file appears with old size
  • on cat3 - new content of the file appears with new size

This seems related to cache handling and so the bug is probably in fuse3 and most probably due to improper cache and/or stat attributes cache expiration in auto_cache open.

@cornerfix cornerfix changed the title severe file corruptiopn - 2 different bugs - not sure if due to sshfs or fuse3, reporting in both projects severe file corruption - 2 different bugs - not sure if due to sshfs or fuse3, reporting in both projects May 4, 2024
@cornerfix cornerfix changed the title severe file corruption - 2 different bugs - not sure if due to sshfs or fuse3, reporting in both projects severe file corruption - 2 different bugs - not sure if in sshfs or fuse3, reporting in both projects May 4, 2024
@h4sh5
Copy link
Collaborator

h4sh5 commented May 8, 2024

thanks for the report! that's some solid digging.

can you test with a different fuse program that uses ssh as well, such as rclone to test if its a problem with the fuse library?

@cornerfix
Copy link
Author

cornerfix commented May 8, 2024

I was preparing to test with rclone, but then stumbled upon this text in their docs:


Attribute caching
You can use the flag --attr-timeout to set the time the kernel caches the attributes (size, modification time, etc.) for directory entries.

The default is 1s which caches files just long enough to avoid too many callbacks to rclone from the kernel.

In theory 0s should be the correct value for filesystems which can change outside the control of the kernel. However this causes quite a few problems such as rclone using too much memory, rclone not serving files to samba and excessive time listing directories.

The kernel can cache the info about a file for the time given by --attr-timeout. You may see corruption if the remote file changes length during this window. It will show up as either a truncated file or a file with garbage on the end. With --attr-timeout 1s this is very unlikely but not impossible. The higher you set --attr-timeout the more likely it is. The default setting of "1s" is the lowest setting which mitigates the problems above.

If you set it higher (10s or 1m say) then the kernel will call back to rclone less often making it more efficient, however there is more chance of the corruption issue above.

If files don't change on the remote outside of the control of rclone then there is no chance of corruption.

This is the same as setting the attr_timeout option in mount.fuse.


This all sounds reasonable and was valid before libfuse3 tried implementing auto_cache and ac_attr_timeout. Auto_cache, at least in theory, was supposed to prevent those issues, by allowing:

  1. attr_timeout in fuse (=dcache_stat_timeout in sshfs) to be set to a greated value for the purpose of e.g. fast directory listing, but
  2. auto_cache and ac_attr_timeout=0 for stat (attributes) cache expiration before read

The only problem is - ar_attr_timeout expires the filesize/attributes cache AFTER the read, not before

So - it seems - this is libfuse bug.

Rclone seems to use the same libfuse3 as sshfs though ?
Or am I mistaken ?

@h4sh5
Copy link
Collaborator

h4sh5 commented May 10, 2024

It appears sshfs uses libfuse3-dev

sudo apt-get install valgrind gcc ninja-build meson libglib2.0-dev libfuse3-dev

where as rclone uses libfuse-dev which is actually libfuse 2

https://pkgs.org/search/?q=libfuse-dev

https://github.com/rclone/rclone/blob/aa2746d0de0214ac9e7f9bd7dcaa4b8c9a3fe51e/.github/workflows/build.yml#L127

@NilsIrl
Copy link

NilsIrl commented Aug 19, 2024

libfuse issue: libfuse/libfuse#945

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants