-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dbz2 multinode fault? #579
Comments
ping @adammoody |
Thanks for the report, @samlor . dbz2 writes to a single shared file from multiple processes. For correctness, it requires a POSIX-compliant parallel file system like Lustre or IBM's Spectrum Scale. In particular, many NFS file systems are not POSIX-compliant. Do you the type of the backing file system where the compressed file is being written here? Do you have a POSIX-compliant file system that you try as a test? |
G'day Adam,
Oh, thank you for responding.
Yes, it's on a test system with just NFS exported XFS.
Don't know about POSIX compliance but I presume ordinary XFS is not even a
distributed/parallel file system.
If this is the issue then I can/will try it on Lustre, to confirm, thanks.
Cheers,
Sam
…On Sun, 25 Aug 2024 at 07:10, Adam Moody ***@***.***> wrote:
Thanks for the report, @samlor <https://github.com/samlor> . dbz2 writes
to a single shared file from multiple processes. For correctness, it
requires a POSIX-compliant parallel file system like Lustre or IBM's
Spectrum Scale. In particular, many NFS file systems are not
POSIX-compliant.
Do you the type of the backing file system where the compressed file is
being written here?
Do you have a POSIX-compliant file system that you try as a test?
—
Reply to this email directly, view it on GitHub
<#579 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADLL5JEUQI3ULUHABYGECZDZTDZC7AVCNFSM6AAAAABLXO3IJ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBYGUZTQNRXGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@samlor definitely please do try it on Lustre. The fact that you've got XFS backing your test file system doesn't matter, I think. For example, the NFS protocol doesn't guarantee that writes made on one node are visible on another node, unless the file is closed and then opened again, which isn't the way one would normally write an I/O function. There are likely other gotchas, too, related to the block size as viewed by NFS and how the writes span them. |
My tests indicate that, for all but very small files of only a few MBs, dbz2 works in parallel on a single node but appears to persistently fail when distributed over multiple nodes in a HPC cluster. Sometimes failing with errors, or worse, the resulting output does not match the original file after decompression. Is this a documented/known limitation or am I doing something wrong?
--- Session transcript ---
$ uname -a
Linux sms 4.18.0-513.11.1.el8_9.0.1.x86_64 #1 SMP Sun Feb 11 10:42:18 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
$ cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 79 | head -1000000 > 80M.txt
$ mpirun -np 4 ~/mpifileutils-v0.11.1/install/bin/dbz2 --compress --keep 80M.txt
$ mv 80M.txt.dbz2 80M1n4p.txt.dbz2
$ mpirun -np 4 -H c1:2,c2:2 ~/mpifileutils-v0.11.1/install/bin/dbz2 --decompress --keep 80M1n4p.txt.dbz2
$ cmp 80M.txt 80M1n4p.txt
$ mpirun -np 4 -H c1:2,c2:2 ~/mpifileutils-v0.11.1/install/bin/dbz2 --compress --keep 80M.txt
$ mv 80M.txt.dbz2 80M2n4p.txt.dbz2
$ mpirun -np 4 -H c1:2,c2:2 ~/mpifileutils-v0.11.1/install/bin/dbz2 --decompress --keep 80M2n4p.txt.dbz2
[2024-07-31T11:29:40] [0] [/home/slr/mpifileutils-v0.11.1/mpifileutils/src/common/mfu_bz2_static.c:596] ERROR: Error in decompression
[2024-07-31T11:29:40] [1] [/home/slr/mpifileutils-v0.11.1/mpifileutils/src/common/mfu_bz2_static.c:596] ERROR: Error in decompression
$ ~/mpifileutils-v0.11.1/install/bin/dbz2 --decompress --keep 80M2n4p.txt.dbz2
[2024-07-31T11:34:07] [0] [/home/slr/mpifileutils-v0.11.1/mpifileutils/src/common/mfu_bz2_static.c:596] ERROR: Error in decompression
$
The text was updated successfully, but these errors were encountered: