Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbz2 multinode fault? #579

Open
samlor opened this issue Jul 31, 2024 · 4 comments
Open

dbz2 multinode fault? #579

samlor opened this issue Jul 31, 2024 · 4 comments
Assignees

Comments

@samlor
Copy link

samlor commented Jul 31, 2024

My tests indicate that, for all but very small files of only a few MBs, dbz2 works in parallel on a single node but appears to persistently fail when distributed over multiple nodes in a HPC cluster. Sometimes failing with errors, or worse, the resulting output does not match the original file after decompression. Is this a documented/known limitation or am I doing something wrong?

--- Session transcript ---
$ uname -a
Linux sms 4.18.0-513.11.1.el8_9.0.1.x86_64 #1 SMP Sun Feb 11 10:42:18 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
$ cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 79 | head -1000000 > 80M.txt
$ mpirun -np 4 ~/mpifileutils-v0.11.1/install/bin/dbz2 --compress --keep 80M.txt
$ mv 80M.txt.dbz2 80M1n4p.txt.dbz2
$ mpirun -np 4 -H c1:2,c2:2 ~/mpifileutils-v0.11.1/install/bin/dbz2 --decompress --keep 80M1n4p.txt.dbz2
$ cmp 80M.txt 80M1n4p.txt
$ mpirun -np 4 -H c1:2,c2:2 ~/mpifileutils-v0.11.1/install/bin/dbz2 --compress --keep 80M.txt
$ mv 80M.txt.dbz2 80M2n4p.txt.dbz2
$ mpirun -np 4 -H c1:2,c2:2 ~/mpifileutils-v0.11.1/install/bin/dbz2 --decompress --keep 80M2n4p.txt.dbz2
[2024-07-31T11:29:40] [0] [/home/slr/mpifileutils-v0.11.1/mpifileutils/src/common/mfu_bz2_static.c:596] ERROR: Error in decompression
[2024-07-31T11:29:40] [1] [/home/slr/mpifileutils-v0.11.1/mpifileutils/src/common/mfu_bz2_static.c:596] ERROR: Error in decompression
$ ~/mpifileutils-v0.11.1/install/bin/dbz2 --decompress --keep 80M2n4p.txt.dbz2
[2024-07-31T11:34:07] [0] [/home/slr/mpifileutils-v0.11.1/mpifileutils/src/common/mfu_bz2_static.c:596] ERROR: Error in decompression
$

@gonsie
Copy link
Contributor

gonsie commented Aug 1, 2024

ping @adammoody

@adammoody
Copy link
Member

Thanks for the report, @samlor . dbz2 writes to a single shared file from multiple processes. For correctness, it requires a POSIX-compliant parallel file system like Lustre or IBM's Spectrum Scale. In particular, many NFS file systems are not POSIX-compliant.

Do you the type of the backing file system where the compressed file is being written here?

Do you have a POSIX-compliant file system that you try as a test?

@samlor
Copy link
Author

samlor commented Aug 26, 2024 via email

@ofaaland
Copy link
Collaborator

@samlor definitely please do try it on Lustre. The fact that you've got XFS backing your test file system doesn't matter, I think.

For example, the NFS protocol doesn't guarantee that writes made on one node are visible on another node, unless the file is closed and then opened again, which isn't the way one would normally write an I/O function. There are likely other gotchas, too, related to the block size as viewed by NFS and how the writes span them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants