Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dsync seems to get stuck and ceases file activity and periodic logging #588

Open
defaziogiancarlo opened this issue Oct 9, 2024 · 0 comments

Comments

@defaziogiancarlo
Copy link

There are many observed cases of dsync getting stuck. At some point during the transfer it stops writing its periodic status update and there is essentially no IO going on for the file systems that the sync is being done on. Also, all but one of the dsync processes are near 100% cpu usage and stay there. It can stay stuck for days and has to be killed manually using the job manager.

This happens during syncs from one lustre file system to another.

v0.11.1
this is running on TOSS 4, which is based on RHEL using the 4.18.0-553.22.1 kernel.
I'm not sure which mpi it was compiled with
The lustre version being used is lustre-2.12.9_11.llnl for the clients, routers, and severs. However there will be a 2.15 version of lustre on the clients and routers soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant