Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concerns about modules stability #2

Open
vemonet opened this issue Oct 22, 2021 · 6 comments
Open

Concerns about modules stability #2

vemonet opened this issue Oct 22, 2021 · 6 comments

Comments

@vemonet
Copy link

vemonet commented Oct 22, 2021

Hi again, I have some more questions about Lmod and EasyBuild, and some concerns that start to arise as I am discovering this technology (note that I am not a sysadmin originally, more of a dev who don't want to rely on sysadmins to get things done, so my point of view and questions might be a)

The error

Here's how it worked for me:

  1. Day 1: manage to install everything, the JupyterLab allows us to load Tensorflow and then import it!
  2. Day 2: reconnecting to the same JupyterLab, trying to import the same tensorflow from the same easybuild-data volume, not working anymore, getting error related to Cython paths (I shared the error at the end of this issue as it is quite large)
  3. Restarting the JupyterLab: ok, it's definitely broken, but the data in the easybuild-data volume is still here

The problem also is that here the error is quite unreadable (related to path loading arcanes of lmod), so it can't be just read and solved with regular computer knowledge. Usually when I get issues with any type of language/package using a regular package manager in a Docker image I can always find my way to solve it in a few minutes just because I know basics of bash and Unix filesystem.

See below for the full output of the error

The concerns

From a developer point of view, the major advantage of using (docker) containers to handle dependencies has always been: it's stable, there are no surprise (people hates surprises!). And it just requires some basic Unix/Linux knowledge (that are required elsewhere anyway). You pull, you run, it works. No surprise, no additional work

But Lmod does not seems to provide this level stability. It seems to require ad hoc fixes for various packages installed. And each time something fails it can't be fixed easily: you need to know perfectly the whole lmod/easyBuild mechanisms, go through a complex system of path loading dependencies

For example I faced an issue where RStudio was complaining about permissions in /var/run/rstudio-server, and I noticed that you defined the env variable USER=rstudio-server in the jupyterlab Dockerfile: https://github.com/guimou/s2i-lmod-notebook/blob/main/f34/Dockerfile#L51 was it for this reason?

The questions

Any idea why Lmod modules would fail like this ? (I am looking into what I might have added to the jupyterLab image that could have created this, but maybe someone got an idea already :) )

What does it honestly takes to make a Lmod/EasyBuild environment stable enough that it can be trustfully used by scientists doing research? Is it completely automatic, magic and stable once you found the right setup? Or does it requires someone to regularly get his hand in the system, help the researchers to fix modules not loading properly, or fixing the EasyConfig for this package?

Here are the error output for tensorflow failing to import (import tensorflow as tf):

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
/opt/apps/easybuild/software/TensorFlow/2.6.0-foss-2021a/lib/python3.9/site-packages/tensorflow/python/pywrap_tensorflow.py in <module>
     63   try:
---> 64     from tensorflow.python._pywrap_tensorflow_internal import *
     65   # This try catch logic is because there is no bazel equivalent for py_extension.

ImportError: /opt/apps/easybuild/software/TensorFlow/2.6.0-foss-2021a/lib/python3.9/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so: undefined symbol: PyCMethod_New

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
<ipython-input-2-64156d691fe5> in <module>
----> 1 import tensorflow as tf

/opt/apps/easybuild/software/TensorFlow/2.6.0-foss-2021a/lib/python3.9/site-packages/tensorflow/__init__.py in <module>
     39 import sys as _sys
     40 
---> 41 from tensorflow.python.tools import module_util as _module_util
     42 from tensorflow.python.util.lazy_loader import LazyLoader as _LazyLoader
     43 

/opt/apps/easybuild/software/TensorFlow/2.6.0-foss-2021a/lib/python3.9/site-packages/tensorflow/python/__init__.py in <module>
     38 # pylint: disable=wildcard-import,g-bad-import-order,g-import-not-at-top
     39 
---> 40 from tensorflow.python import pywrap_tensorflow as _pywrap_tensorflow
     41 from tensorflow.python.eager import context
     42 

/opt/apps/easybuild/software/TensorFlow/2.6.0-foss-2021a/lib/python3.9/site-packages/tensorflow/python/pywrap_tensorflow.py in <module>
     81 for some common reasons and solutions.  Include the entire stack trace
     82 above this error message when asking for help.""" % traceback.format_exc()
---> 83   raise ImportError(msg)
     84 
     85 # pylint: enable=wildcard-import,g-import-not-at-top,unused-import,line-too-long

ImportError: Traceback (most recent call last):
  File "/opt/apps/easybuild/software/TensorFlow/2.6.0-foss-2021a/lib/python3.9/site-packages/tensorflow/python/pywrap_tensorflow.py", line 64, in <module>
    from tensorflow.python._pywrap_tensorflow_internal import *
ImportError: /opt/apps/easybuild/software/TensorFlow/2.6.0-foss-2021a/lib/python3.9/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so: undefined symbol: PyCMethod_New


Failed to load the native TensorFlow runtime.

See https://www.tensorflow.org/install/errors

for some common reasons and solutions.  Include the entire stack trace
above this error message when asking for help.
@vemonet
Copy link
Author

vemonet commented Oct 22, 2021

Figured out where it came from: the python version of the base image for Docker changed from 3.9 to 3.8

Shouldn't Lmod automatically install the required python version as a module if it needs it for a specific package?

@guimou
Copy link
Owner

guimou commented Oct 22, 2021

OK, you figured it out well. Python is an issue I have not been able to overcome tbh in this kind of environment. If you are using a JupyterLab container image, it requires Python from the start of course. So you already are in a specific Python environment, and depending on how you build this image, maybe even in a virtualenv. So even if you load a Python module with a different Python version, this one is not taken into account by the environment (even if you reload the kernel because you already are in a specific Python version environment).
So what you experienced is:

  • You launch the container with JL and Python 3.8
  • You load the Tensorflow module, which also loads Python 3.9 module which is a dependency.
  • TF libraries are accessible, but your Python is still 3.8 because that's where your JL is running, therefore your problem. And I have not figured out a way to make this new Python version as default, at least without breaking the running JL environment...

So at this point, all my module builds are based on Python 3.9, and the container images I'm using have to be based on the same 3.9 version.
For the same reason, my base "builder" image is from Fedora 34 with as little packages as I could, and I'm using the exact same base for the JupyterLab images, to make sure that any OS library calls are handled properly. Meaning you still have to maintain some compatibilty layer between where you build and where you use your modules.
In other environments, this compatibility layer is built on Gentoo for example: https://eessi.github.io/docs/compatibility_layer/
For now, because I'm working in containers, I'm still considering if I should create this compatibility layer within the container image, or just say that this role is played by the Fedora 34 base (could be something else), that all containers built for this environment have to derive from... With a further compatibility layer, that would mean other bases (Alpine, Debian,...) could be used to build the containers, or the modules could be used outside of this container world, provided there is the same compatibility layer. I did not investigate this part yet as 1) It seems "too much" to have Gentoo on top of another base container image, 2) I've not played with Gentoo for a while, so deploying it inside a container image will require time I don't have presently.
But any advice on this matter by anyone more knowledgeable than I am will be really appreciated! 😃

@vemonet
Copy link
Author

vemonet commented Oct 22, 2021

Thanks a lot for such a detailed answer @guimou ! The challenges are much clearer now

So by running modules built on Fedora in a Debian based notebook I might also face some compatibility issues? (probably not on all modules though since the env must be relatively similar, explaining why I did not noticed it before)

If I understand well: to run EasyBuild and Lmod (to build and serve the modules), we need to install packages like GCC, Lua and python (e.g. 3.9), which will create conflict if we try to run modules that are using different versions (e.g. python 3.8)

For this we rely on Gentoo Prefix, by installing a limited set of Gentoo Linux packages in a non-standard location (a "prefix")

To avoid this EESSI installed those GCC and python packages somewhere where the computer don't expect to find them (using Gentoo Prefix)

So that's what we need to do too in a container?

e.g. start from a bare linux image, then install GCC and python in a "secret place" to avoid our linux OS to use it. Then use this place to setup lmod/easybuild.
Lmod will use its own GCC/python version (in the secret place), but the system will have no default GCC/python version so it will properly use the one installed through the modules

Did I get it right?

For jupyterlab we can easily load the module in the docker build I guess, but for GCC and python that will be another level

Could not we reuse the docker image EESSI uses in singularity? https://github.com/EESSI/compatibility-layer/blob/main/Dockerfile.bootstrap-prefix-centos8 it's centos based (8 is maybe not the most ideal since centos8 already ends of life at the end of this year :p )

Otherwise an option I see would be compiling GCC and python from source using --PREFIX with configure/make install to install in the secret location. Then make sure building Lmod still manage to pick up the secret place. But that will probably involves a lot of manual fix and configuration

A more container friendly solution could be to do a multistage build to install Lmod/EasyBuild, then move to a bare linux image and copy the preinstalled Lmod to the final image (does Lmod need GCC installed to run, or only for building it?)

We plan to get in touch with EESSI, maybe they'll have some good advice on this

@guimou
Copy link
Owner

guimou commented Oct 22, 2021

You almost got everything right.
In fact GCC is one of the first modules that are built in the chain of dependencies, because you can choose to build your modules with different compilers or toolchains. But there are other "base" things than you need in your environment to be able to build the first modules. Therefore all the packages I have to install on top of ubi8 in my Dockerfile to have something running. Btw, I should push shortly another version of the easybuild container, based on Fedora instead of ubi8. ubi8 is lagging too much in its glibc version, and it creates some issues with other modules.
If we add a compatibility layer like Gentoo-prefix, this would solve some issues, and allow to build images based on whatever you want because in fact our "real" environment, where we run easybuild, or Jupyter, or whatever is inside this Gentoo layer. But that means that all container images you build where you want to use the modules may have to use this compatibility layer (I say may have because maybe all the required OS libraries will already be there in your base image...).
That said, this won't solve the Python and JupyterLab issue because this happens on another upper layer. Whatever base image you have, compatibility or not, your JL instance will be launched and run with/from a specific Python version. Meaning even if you load another Python module version afterwards, this won't change anything within the launched JL environment. If you were in a VM, a shared environment in an HPC cluster, you would just shutdown JL, reload the Python version you want as a module, and relaunch JL (which maybe a different module version because it's a different Python). Here, because we're inside containers, you just cannot do this kill JL and unload/load. At least at the moment I have not found a way to do this...
So at the moment, unless someone has a solution, your "base image" for JL has to be built with a specific Python version, and all the modules you build have to be with this version.
Now, what you can do is to build different module versions for different Python versions, have different base images (JL on Py38, JL on Py39,...), and with some clever configuration you can instruct lmod to only display the modules that are compatible. Like if you are running the Py39 version, you filter out the Py38 dependant modules to prevent the kind of errors you had.
Final work: yes, please get in touch with the communities, both EESSI and EasyBuilders (lots of overlap). Use the slack channel, https://easybuild.slack.com/ , ask questions... That's how I started, they are awesome!

@vemonet
Copy link
Author

vemonet commented Oct 28, 2021

Nice, indeed the fact that we are in a container brings some basic assumptions on how we will be starting it

We could just start a bare linux without JL running, and then the user starts it himself... But from our users point of view it would be a bit weird to do it this way, and not convenient

We could also use something like run-one-constantly this to be able to restart JL and make sure it is always running

From a user point of view, I think the easiest would just to have a different base image for all major GCC/Python version they need. So they just start the right image, and then they can install all modules fitting this image (a bit like you do... But multiple times for multiple version)

But it will create a lot of work on the maintainers side: build probably 3 to 5 images, then as many EasyBuild images with the modules corresponding for each version (from my experience running eb to install modules can take a lot of time)

In any cases, thanks a lot for those explanations and pointers, that was really instructive!

@guimou
Copy link
Owner

guimou commented Oct 28, 2021

We could just start a bare linux without JL running, and then the user starts it himself... But from our users point of view it would be a bit weird to do it this way, and not convenient

Yes and no, because then you have to figure out how your user will access the container. You can always rsh into it, but then JL will be run under a specific session that you'll have to maintain open. Other solution would be to start a basic interface, some kind of spawner where you can choose you JL version. But in this case, why not doing from the start and spawn the JL the user wants.

From a user point of view, I think the easiest would just to have a different base image for all major GCC/Python version they need. So they just start the right image, and then they can install all modules fitting this image (a bit like you do... But multiple times for multiple version)
That's the path I'm following so far, Python being the main concern for the moment as I intend to do everything based on a relatively recent GCC version (like 10.3 for now). But with maybe two different "environments", one with Py38, and one with Py39.

But it will create a lot of work on the maintainers side: build probably 3 to 5 images, then as many EasyBuild images with the modules corresponding for each version (from my experience running eb to install modules can take a lot of time)

Well, in fact you can have one EasyBuild (the "builder" instance) building both Py38 and Py39 modules. But yes, you need to have two "lines" of modules. Or make some choices/prioritization. TF up to version xx is available only in the Py38 line, then for both up to version yy, then only for Py39.
Good news being that once a module is created, you normally don't have to maintain it that much, it's supposed to be immutable. If you have real patching to do, mainly for security purposes, it should be another module version.

In any cases, thanks a lot for those explanations and pointers, that was really instructive!

My pleasure! As this project seems to get some interest, I'll maybe create some Slack channel to have other interesting conversations like this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants