-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Caikit/TGIS swallows model loading running out of memory #92
Comments
This likely should go against caikit/caikit-nlp |
Also we'll get separated logs once the container split happens (this sprint) |
This is the ticket for reference :) |
@kpouget could you share an update on this once you try it with the new SR with split images of Caikit and TGIS? |
@heyselbi , it didn't change AFAICT:
but
|
Container didn't crash when it got an OOM error? |
Oh, if it's just GPU memory, it probably won't, but it probably should... Hmmm... I'd say this is probably covered at least on startup by the upcoming readiness probe |
Will be resolved by #156 |
TGIS now lives in a separate container, and following its logs should show the OOM errors. For proper liveness/readiness probes for the tgis container in the caikiit+tgis setup, we'll have to wait for knative/serving#14853. |
When trying to load a model in a Pod running with a memory limit too low, the out-of-memory error message is swallowed by TGIS and hard to troubleshoot (in addition to Caikit swallowing the TGIS error):
while troubleshooting it, I observed that even TGIS return code does not refect the OOM error, although my attemps confirmed that not giving enough memory was the cause of the load failure:
The text was updated successfully, but these errors were encountered: