-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: gRPC Socket Shutting Down After Many Runs #681
Comments
Hi @deevashwer Are you running a large model? There is a timeout config here, this can happen when data is kind of huge and send takes more than 100s. It is possible that there are some network jitter that causes one of the node takes a little bit longer to recv data. Thanks |
Yes, I'm running a large model in a LAN setting, so I don't expect significant jitters. It is a curious case because it works just fine for a few runs (say 4 or 5) and then on the 6th run, one of the sockets closes down. I'll try setting the timeout higher and let's see if that fixes the issue. Thanks! |
That did not solve the problem. After a bunch of runs, the same error happened after around 1 hour and 43 minutes. One of the nodes gets automatically terminated with signal 9, and then the other two abort from a closed socket. |
Interesting, we'll try to reproduce this. |
Hi @deevashwer , we have encountered a some similar issue (it's in chinese) before due to a potential memory leak problem in glibc. Maybe you can have a try with a different version of glibc or tcmalloc. |
Hi @tpppppub. Thanks for the reference. Switching to tcmalloc unfortunately didn't resolve the issue. It does look like a memory leak however. |
@warriorpaw Can you take a look when you have time? Thanks |
Issue Type
Usability
Modules Involved
SPU runtime
Have you reproduced the bug with SPU HEAD?
Yes
Have you searched existing issues?
Yes
SPU Version
spu 0.7.0b0
OS Platform and Distribution
Linux Ubuntu 22.04
Python Version
3.9
Compiler Version
No response
Current Behavior?
Hi!
I'm trying to benchmark SPU performance over 3 machines using PPD and it works well for the most part, but when I try to do many runs to get more accurate runtimes, one of the gRPC sockets shuts down with the following error message:
I don't think it has something to do with the application code because the preceding runs which are doing the exact same computation run just fine. It seems to me that there's potentially a limit set on how much data can be communicated over these RPC instances. I don't think it's timing because I've had them run for several hours without aborting.
Is there an RPC environment variable I can set to prevent the sockets from closing?
Thanks for your help!
Standalone code to reproduce the issue
Relevant log output
No response
The text was updated successfully, but these errors were encountered: