Skip to content

Commit

Permalink
Lighter image, decent README.
Browse files Browse the repository at this point in the history
  • Loading branch information
anjackson committed Aug 23, 2024
1 parent 3722fec commit 61a6255
Show file tree
Hide file tree
Showing 2 changed files with 82 additions and 19 deletions.
61 changes: 44 additions & 17 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,27 +1,54 @@
# DigiPres Toolbox
# A Docker image with some tools pre-installed
FROM python:3.10-bullseye

RUN pip install --no-cache notebook bash_kernel opf-fido
#
# Note that some blocks are commented out to keep the image size down and launches fast.
#

# Core Jupyter support:
RUN pip install --no-cache notebook jupyterlab bash_kernel
RUN python -m bash_kernel.install

RUN apt-get update && apt-get install -y mediainfo default-jre ffmpeg cloc && \
apt-get install -y cmake pkg-config libicu-dev zlib1g-dev libcurl4-openssl-dev libssl-dev ruby-dev && \
# Some lightweight tools and support for installing more:
RUN apt-get update && apt-get install -y sudo mediainfo cloc && \
apt-get clean && rm -rf /var/lib/apt/lists/*

RUN gem install github-linguist

RUN curl -s -L -O https://github.com/richardlehane/siegfried/releases/download/v1.11.1/siegfried_1.11.1-1_amd64.deb && \
dpkg -i siegfried_1.11.1-1_amd64.deb && \
# Install Siegfried:
ENV SF_VERSION=1.11.1
ENV SF_DEB=siegfried_${SF_VERSION}-1_amd64.deb
RUN curl -s -L -O https://github.com/richardlehane/siegfried/releases/download/v${SF_VERSION}/${SF_DEB} && \
dpkg -i ${SF_DEB} && \
rm -f ${SF_DEB} && \
sf -update

RUN curl -s -L -o /usr/share/java/tika-app-2.9.2.jar https://dlcdn.apache.org/tika/2.9.2/tika-app-2.9.2.jar && \
ln -s /usr/share/java/tika-app-2.9.2.jar /usr/share/java/tika-app.jar

COPY droid /usr/share/java/droid
RUN ln -s /usr/share/java/droid/droid.sh /usr/local/bin/droid.sh
COPY tika.sh /usr/local/bin/tika.sh

# Install TRiD:
RUN curl -s -L -O http://mark0.net/download/trid_linux_64.zip && \
curl -s -L -O http://mark0.net/download/triddefs.zip && \
unzip trid_linux_64.zip && unzip triddefs.zip && chmod +x ./trid && \
cp ./trid /usr/local/bin/trid && cp triddefs.trd /usr/local/bin/
curl -s -L -O http://mark0.net/download/triddefs.zip && \
unzip trid_linux_64.zip && unzip triddefs.zip && chmod +x ./trid && \
mv ./trid /usr/local/bin/trid && mv triddefs.trd /usr/local/bin/ && \
rm -f trid_linux_64.zip triddefs.zip

# Install Fido:
RUN pip install --no-cache opf-fido

# Install JRE for Java programs and ffmpeg for a/v formats (c. 0.6GB!):
#RUN apt-get update && apt-get install -y default-jre ffmpeg && \
# apt-get clean && rm -rf /var/lib/apt/lists/*

# Install GitHub Linguist and it's build dependencies (c. 0.2GB):
#RUN apt-get update && \
# apt-get install -y cmake pkg-config libicu-dev zlib1g-dev libcurl4-openssl-dev libssl-dev ruby-dev && \
# apt-get clean && rm -rf /var/lib/apt/lists/*
#RUN gem install github-linguist

# Install Apache Tika (needs Java):
#ENV TIKA_VERSION=2.9.2
#RUN curl -s -L -o /usr/share/java/tika-app-${TIKA_VERSION}.jar https://dlcdn.apache.org/tika/${TIKA_VERSION}/tika-app-${TIKA_VERSION}.jar && \
# ln -s /usr/share/java/tika-app-${TIKA_VERSION}.jar /usr/share/java/tika-app.jar
#COPY tika.sh /usr/local/bin/tika.sh

# Install DROID (needs Java)
#COPY droid /usr/share/java/droid
#RUN ln -s /usr/share/java/droid/droid.sh /usr/local/bin/droid.sh

40 changes: 38 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,38 @@
# format-id-toolbox
A Docker container with various format identification tools installed.
DigiPres Toolbox
----------------

A Docker image designed to make it easy to experiment with tools for Digital Preservation. Designed to be used via the [DigiPres Sandbox](https://github.com/digipres/sandbox) and the [DigiPres Workbench](https://github.com/digipres/workbench).

## Supported Tools

### Pre-installed

Only light-weight tools are pre-installed, so the Docker image size (and hence Sandbox launch times) can be kept low.

- [Siegfried](https://www.itforarchivists.com/siegfried) (using the 'deluxe' format signatures which includes mutliple sources).
- [File](https://www.darwinsys.com/file/)
- [TrID](http://mark0.net/soft-trid-e.html)
- [MediaInfo](https://github.com/MediaArea/MediaInfo)
- [CLOC](https://github.com/AlDanial/cloc)

### Verified Installable

These aren't installed by default, but the [Sandbox](https://github.com/digipres/sandbox) shows how to install them.

- [Apache Tika](https://tika.apache.org/)
- [DROID](http://digital-preservation.github.io/droid/)
- [Fido](https://github.com/openpreserve/fido)
- [ffmpeg](https://ffmpeg.org) including [ffprobe](https://ffmpeg.org/ffprobe.html)
- [GitHub Linguist](https://github.com/github/linguist)

### To Consider

- VeraPDF
- JHOVE
- Handbrake
- MediaConch

## Inspirations

- The PLANETS Testbed ([briefing paper](https://www.dcc.ac.uk/guidance/briefing-papers/technology-watch-papers/planets-testbed), [article](https://journal.code4lib.org/articles/83))
- [VIPER](https://viper.openpreservation.org/)

0 comments on commit 61a6255

Please sign in to comment.