Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] TGIS container fails to run on a FIPS cluster #130

Closed
bdattoma opened this issue Nov 3, 2023 · 4 comments
Closed

[Bug] TGIS container fails to run on a FIPS cluster #130

bdattoma opened this issue Nov 3, 2023 · 4 comments
Assignees

Comments

@bdattoma
Copy link
Contributor

bdattoma commented Nov 3, 2023

When deploying a LLM model using the new Caikit+TGIS architecture introduced with #107 , the TGIS container (i.e., transformer-container) fails to start if the cluster has FIPS cryptography enabled.

These are the 2 errors I got in the container logs
There was a problem when trying to write in your cache folder (/.cache/huggingface/hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory. fips.c(145): OpenSSL internal error, assertion failed: FATAL FIPS SELFTEST FAILURE
Note: the TRANSFORMERS_CACHE is actually set in the ServinRuntime

This was found on a OpenShift 4.13.18 cluster with RHODS 2.1.2 (aka 1.32.2) and KServe 0.11 installed

@heyselbi
Copy link
Contributor

heyselbi commented Dec 11, 2023

What needs to be done (more like notes to self):

  • Create a FIPS enabled cluster and deploy RHOAI + Caikit+TGIS serving runtime + isvc
  • Check each python library for FIPS as shown here

@bmcfeeters
Copy link

As another data point, I have hit this issue with FIPS enabled OpenShift 4.13.12 cluster and Red Hat OpenShift Data Science operator 2.5.0

It appears it is the tokenizer Python module that is causing the crash. From a debug container I see the same issue as noted here from huggingface.

Unfortunately, I have to redeploy my entire cluster now to make progress since FIPS cannot be disabled after OpenShift is fully deployed and running.

@bdattoma
Copy link
Contributor Author

bdattoma commented Feb 1, 2024

As another data point, I have hit this issue with FIPS enabled OpenShift 4.13.12 cluster and Red Hat OpenShift Data Science operator 2.5.0

It appears it is the tokenizer Python module that is causing the crash. From a debug container I see the same issue as noted here from huggingface.

Unfortunately, I have to redeploy my entire cluster now to make progress since FIPS cannot be disabled after OpenShift is fully deployed and running.

@bmcfeeters thanks for sharing your case. This issue should be solved on the latest image versions of the runtime which is going to be shipped with operator 2.6.0

@dtrifiro
Copy link
Contributor

dtrifiro commented Mar 8, 2024

Fixed in #171, due to this change, this was due to the way the virtualenv was being prepared in the Dockerfile

@dtrifiro dtrifiro closed this as completed Mar 8, 2024
@github-project-automation github-project-automation bot moved this from To-do/Groomed to Done in ODH Model Serving Planning Mar 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Status: Done
Development

No branches or pull requests

4 participants