-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move dependencies out of Dockerfile #32
base: devel
Are you sure you want to change the base?
Conversation
40948b5
to
9f799a4
Compare
Thanks for the PR @jwokaty. The overall size of the docker image currently is much larger by about 750MB (approx)
The key questions for this image are:
I'm happy to help on any of these, and welcome thoughts from @jwokaty, @hpages, @vjcitn and @mtmorgan . |
bin/install.sh
Outdated
# Constructing array of apt packages, removing unnecessary packages. | ||
cat $bbs_files/apt_*.txt | awk '/^[^#]/ {print $1}' | sort >> /tmp/bbs_apt_pkgs | ||
cat $bd_files/apt_*.txt | awk '/^[^#]/ {print $1}' | sort >> /tmp/skip_apt_pkgs | ||
apt_pkgs=$(comm -23 /tmp/bbs_apt_pkgs /tmp/skip_apt_pkgs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be saved into a file, instead of just being a variable. This is so that we can track what is being installed explicitly. Both apt-pkgs
and apt_required_pkgs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the list of packages that's now in apt_pkgs
be a file in the repository?
bin/install.sh
Outdated
|
||
# Remove files | ||
if test -n "$(find /tmp -maxdepth 1 -name '*_pkgs' -print -quit)"; then | ||
rm /tmp/*_pkgs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove all other files other than what is being installed.
bin/install.sh
Outdated
# Constructing array of pip packages, removing unnecessary packages. | ||
cat $bbs_files/pip_*.txt | awk '/^[^#]/ {print $1}' | sort >> /tmp/bbs_pip_pkgs | ||
cat $bd_files/pip_*.txt | awk '/^[^#]/ {print $1}' | sort >> /tmp/skip_pip_pkgs | ||
pip_pkgs=$(comm -23 /tmp/bbs_pip_pkgs /tmp/skip_pip_pkgs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a couple of more comments, so we are aware of what is going on explicitly here.
bin/install.sh
Outdated
|
||
# Get repository with Ubuntu-files | ||
if [ ! -d "/tmp/BBS" ]; then | ||
git clone $repo /tmp/BBS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shallow clone --depth 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this tip!
bin/install.sh
Outdated
# Install pip packages | ||
pip3 install $pip_pkgs | ||
|
||
# Remove files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clean up repos along the lines of
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
do it in this layer, otherwise they are only 'masked' rather than removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain this a little more or point me toward more information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
layers are independent, so if in one layer you create, maybe indirectly, a large file RUN <create large-file>
and in the next layer you try to clean that up RUN rm <large-file>
actually the large file still exists, just the file system doesn't know about it. Maybe the comment isn't directly relevant, but hopefully the sentiment is -- clean up before the layer closes.
Ubuntu-files/20.04/apt_skip.txt
Outdated
libeigen3-dev | ||
libfribidi-dev | ||
libfuse-dev | ||
libgmp-dev # BD uses libgmp3-dev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BD
== bioconductor_docker
? Would be good to spell this out at least on first use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, bd
is bioconductor_docker.
bin/install.sh
Outdated
repo="https://github.com/Bioconductor/bbs" | ||
branch="master" | ||
bbs_files="/tmp/BBS/Ubuntu-files/$version" | ||
bd_files="/tmp/$version" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bd_files
is cryptic to me here, too.
I took a look at Dockerfile, went thru the list of deb packages that are explicitly listed in the file, and annotated them. This should help us decide what to do with each of them. The goal is that each deb package should go in one of the following lists:
All these lists (except the last one) are in https://github.com/Bioconductor/BBS/tree/master/Ubuntu-files/20.04/. The last one (apt_docker_only.txt) would need to be created. It would list stuff that is maybe nice to have on the Docker image for developers but Here's the annotated list extracted from Dockerfile:
Note that I've left the following section from Dockerfile out of the discussion for now:
because I'm not sure what to do with it or why it is needed. We do need some Python modules on the build machines but they should all be installed for Python 3, not Python 2 (we've dropped support for Python 2 last year). The goal is that in the future we'll only need to add new deb packages to the apt_cran.txt and/or apt_bioc.txt lists as new (or existing) Bioconductor packages introduce new system requirements. This will impact what gets installed on both, the build machines and the Docker image. Hope this helps, |
So in docker, we should be installing apt dependencies from the following files:
We're not installing apt_nice_to_have and apt_vignettes_reference_manuals--is that correct? I also want to clarify that when one of these files has packages not listed in the current Dockerfile in the master branch, that we still install everything in the file. For example, the dockerfile on the master branch lists only 2 font packages; however, apt_extra_fonts has 8 total packages. So we will still be installing more packages than the current docker on the master branch, but at least what we're installing is explicit. Additionally, when the docker and build systems have a similar package, we should choose the build system package that's in one of the files listed, correct? |
Hi Jennifer,
These are all very good questions. I will try to answer the ones which I’m able to,
So in docker, we should be installing apt dependencies from the following files:
• apt_bioc.txt
• apt_cran.txt
• apt_optional_compile_R
• apt_required_compile_R
• apt_extra_fonts
• apt_required_compile_R
• apt_required_build_R
We're not installing apt_nice_to_have and apt_vignettes_reference_manuals--is that correct?
I also want to clarify that when one of these files has packages not listed in the current Dockerfile in the master branch, that we still install everything in the file. For example, the dockerfile on the master branch lists only 2 font packages; however, apt_extra_fonts has 8 total packages. So we will still be installing more packages than the current docker on the master branch, but at least what we're installing is explicit.
Somehow my experience so far has been that these apt_extra_fonts are needed only to ‘build’ vignettes. Is there a way to narrow down, what exactly these 6 extra fonts are needed for?
One thing to remember is that we are inheriting some dependencies from our *parent* docker image from rocker.
So the inheritance goes like this,
```
ubuntu/latest —> rocker/r-ver —> rocker/rstudio —> bioconductor/bioconductor_docker`
There are dependencies which are already pre-installed and inherited from
**rocker/r-ver** - ( https://github.com/rocker-org/rocker-versioned2/blob/master/scripts/install_R.sh)
**rocker/rstudio** - (https://github.com/rocker-org/rocker-versioned2/blob/master/scripts/install_pandoc.sh)
And as Martin, pointed out previously, because of the AUFS file system used as ‘layers’ in docker, each additional installation of the same software gives the impression of overwriting, but we are simply adding layers and increasing size.
Additionally, when the docker and build systems have a similar package, we should choose the build system package that's in one of the files listed, correct?
This should be correct, except changes could be potentially made once we complete testing. To elaborate on testing, we’ll be building / installing all the 2000+ bioconductor packages and their dependencies and we’ll see how that goes.
Specifically, something like this, where ‘pkg’ is a vector of Bioconductor packages.
BiocManager::install(pkg,
INSTALL_opts = "--build",
update = FALSE,
quiet = TRUE,
force = TRUE,
keep_outputs = TRUE)
I’m not sure if this answers your questions, but I’m happy to get on a call and discuss the solution some more.
|
@nturaga Thanks for trying to answer some of my questions as well as pointing me to the rocker scripts. I'm not sure if there's a better way to investigate these dependencies, but I decided to use code.bioconductor.org to investigate the "who needs that" packages. I will look there too for these files, but they're probably dependencies from nonbioconductor packages. If these are for building vignettes and we don't build vignettes with docker, why are we installing them? The fonts are just one group of files where I know there are more in the BBS than in docker. I suspect there will be others. |
I marked all the packages that are in both docker and the BBS:
Here's what's only in the BBS:
So while it's fine that we don't include the vignette packages, we see there's still other packages in other files that we do want to include that have additional packages. To complicate matters more, we have dependencies installed from rocker that we don't need to install again because we don't replace the original package, they just add another layer (note: we're usually installing a
It seems that if we want to install apt_bioc.txt, apt_cran.txt, apt_optional_compile_R, apt_required_compile_R, apt_extra_fonts (do we still need this for the docker?), apt_required_compile_R, and apt_required_build_R, we should expect that the docker is going to be larger because of the extra packages. I think the current method where I exclude packages will give better control; it's just that what's installed needs to become explicit. I also think we should keep the practice of annotating any package dependencies. I was not able to find any reference for the following packages listed in the Dockerfile when searching code.bioconductor.org, with the exception of the first package. These are the candidates for the
|
Not sure why these are different as this is not what I expected! The following packages were selected because either we weren't sure Appeared to be already installed
Not needed / No additional Bioc packages were install after the following apt packages were installed
|
- Add packages already installed via Rocker to skip file - Move scripts to src - Add comments to src/install.sh
d732eb7
to
b82f00e
Compare
I've tried to address all previous comments. However, I'm not attempting to recreate what is in the master branch nor the BBS, but a container that installs packages from the BBS and can override entries that we don't want to install (for example, when they've already been installed via Rocker). The current size is 4.25GB. All but the following Bioconductor packages are installed:
Here's what we actually install in the Docker. You see these in the output when container is built. You can comment out the clean up at the bottom of
If this is still too big, I need to know where to cut. I could start removing dependencies required for only a few BioC packages. If we want to be more explicit, I can write a script to generate the above packages and we can rerun the script every time we want to update the docker image. We can also commit the list of packages. I still kept a list of packages to skip because the BBS files have quite a few packages that were already installed via Rocker. |
Thanks @jwokaty, I will review this today/tomorrow. |
This PR creates Ubuntu-files to track docker-specific dependencies and skip dependencies that we don't want installed from BBS.
Regarding
libmariadb-dev-compat
, I've put it inapt_required.txt
but commented it out because it depends onlibmariadb-dev
and conflicts withlibmysqlclient-dev
, which gets installed on the build system. We can choose to skiplibmysqlclient-dev
by putting it inapt_skip.txt
and uncommentlibmariadb-dev-compat
inapt_required.txt
so that it gets installed.bin/install.sh
installs BBS dependencies, comparing the BBS Ubuntu-files to this repo's Ubuntu-files.When I tested, I was able to install the following packages:
On 6/29,
docker images
reported the size as 4.57GB.I'd appreciate any feedback to improve this PR. You can see my PR for BBS at Bioconductor/BBS#84.