This document is intended for developers who want to install, test or contribute to the code.
Install rust:
$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
$ source $HOME/.cargo/env
Install pyenv:
$ curl https://pyenv.run | bash
Install Python 3.9.18:
$ pyenv install 3.9.18
Check that the expected local version of Python is used:
$ cd services/worker
$ python --version
Python 3.9.18
Install Poetry with pipx:
- Either a single version:
pipx install poetry==1.8.2
poetry --version
- Or a parallel version (with a unique suffix):
pipx install poetry==1.8.2 [email protected]
[email protected] --version
Set the Python version to use with Poetry:
poetry env use 3.9.18
or
[email protected] env use 3.9.18
Install the dependencies:
make install
To install the worker on Mac OS, you can follow the next steps.
Install brew:
$ /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
Install pyenv:
$ curl https://pyenv.run | bash
append the following lines to ~/.zshrc:
export PYENV_ROOT="$HOME/.pyenv"
command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"
eval "$(pyenv init -)"
Logout and login again.
Install Python 3.9.18:
$ pyenv install 3.9.18
Check that the expected local version of Python is used:
$ cd services/worker
$ python --version
Python 3.9.18
Install Poetry with pipx:
- Either a single version:
pipx install poetry==1.8.2
poetry --version
- Or a parallel version (with a unique suffix):
pipx install poetry==1.8.2 [email protected]
[email protected] --version
append the following lines to ~/.zshrc:
export PATH="/Users/slesage2/.local/bin:$PATH"
Install rust:
$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
$ source $HOME/.cargo/env
Set the python version to use with poetry:
poetry env use 3.9.18
or
[email protected] env use 3.9.18
Avoid an issue with Apache beam (python-poetry/poetry#4888 (comment)):
poetry config experimental.new-installer false
or
[email protected] config experimental.new-installer false
Install the dependencies:
make install
To start working on the project:
git clone [email protected]:huggingface/dataset-viewer.git
cd dataset-viewer
Install all the packages:
make install
Install docker (see https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository and https://docs.docker.com/engine/install/linux-postinstall/)
Run the project locally:
make start
When the docker containers have been started, enter http://localhost:8100/healthcheck: it should show ok
.
Run the project in development mode:
make dev-start
In development mode, you don't need to rebuild the docker images to apply a change in a worker. You can just restart the worker's docker container and it will apply your changes.
To install a single job (in jobs), library (in libs) or service (in services), go to their respective directory, and install Python 3.9 (consider pyenv) and poetry (don't forget to add poetry
to the PATH
environment variable).
If you use pyenv:
cd libs/libcommon/
pyenv install 3.9.18
pyenv local 3.9.18
poetry env use python3.9
then:
make install
It will create a virtual environment in a ./.venv/
subdirectory.
If you use VSCode, it might be useful to use the "monorepo" workspace (see a blogpost for more explanations). It is a multi-root workspace, with one folder for each library and service (note that we hide them from the ROOT to avoid editing there). Each folder has its own Python interpreter, with access to the dependencies installed by Poetry. You might have to manually select the interpreter in every folder though on first access, then VSCode stores the information in its local storage.
The repository is structured as a monorepo, with Python libraries and applications in jobs, libs and services:
The following diagram represents the general architecture of the project:
- Mongo Server, a Mongo server with databases for: "cache", "queue" and "maintenance".
- jobs contains the jobs run by Helm before deploying the pods or scheduled basis. For now there are two type of jobs:
- libs contains the Python libraries used by the services and workers. For now, there are two libraries
- services contains the applications:
- api, the public API, is a web server that exposes the API endpoints. All the responses are served from pre-computed responses in Mongo server. That's the main point of this project: generating these responses takes time, and the API server provides this service to the users.
- webhook, exposes the
/webhook
endpoint which is called by the Hub on every creation, update or deletion of a dataset on the Hub. On deletion, the cached responses are deleted. On creation or update, a new job is appended in the "queue" database. - rows
- search
- admin, the admin API (which is separated from the public API and might be published under its own domain at some point)
- reverse proxy the reverse proxy
- worker the worker that processes the queue asynchronously: it gets a "job" collection (caution: the jobs stored in the queue, not the Helm jobs), processes the expected response for the associated endpoint, and stores the response in the "cache" collection.
Note also that the workers create local files when the dataset contains images or audios. A shared directory (
ASSETS_STORAGE_ROOT
) must therefore be provisioned with sufficient space for the generated files. The/first-rows
endpoint responses contain URLs to these files, served by the API under the/assets/
endpoint. - sse-api
- Clients
If you have access to the internal HF notion, see https://www.notion.so/huggingface2/Datasets-server-464848da2a984e999c540a4aa7f0ece5.
Hence, the working application has the following core components:
- a Mongo server with two main databases: "cache" and "queue"
- one instance of the API service which exposes a port
- one instance of the ROWS service which exposes a port
- one instance of the SEARCH service which exposes a port
- N instances of worker that processes the pending "jobs" and stores the results in the "cache"
The application also has optional components:
- a reverse proxy in front of the API to serve static files and proxy the rest to the API server
- an admin server to serve technical endpoints
- a shared directory for the assets and cached-assets in S3 (It can be configured to point to a local storage instead)
- a shared storage for temporal files created by the workers in EFS (It can be configured to point to a local storage instead)
The following environments contain all the modules: reverse proxy, API server, admin API server, workers, and the Mongo database.
Environment | URL | Type | How to deploy |
---|---|---|---|
Production | https://datasets-server.huggingface.co | Helm / Kubernetes | Argo CD |
Development | https://datasets-server.us.dev.moon.huggingface.tech | Helm / Kubernetes | Argo CD |
Local build | http://localhost:8100 | Docker compose | make start (builds docker images) |
The following diagram represents the logic when a worker pulls a job from the queue:
Source: https://www.figma.com/board/Yymk75rQTYpZuIwTqffyKQ/Queues-in-dataset-viewer
The CI checks the quality of the code through a GitHub action. To manually format the code of a job, library, service or worker:
make style
To check the quality (which includes checking the style, but also security vulnerabilities):
make quality
The CI checks the tests a GitHub action. To manually test a job, library, service or worker:
make test
Note that it requires the resources to be ready, ie. mongo and the storage for assets.
To launch the end to end tests:
make e2e
We don't use the package versions (in pyproject.toml files), no need to update them.
All the contributions should go through a pull request. The pull requests must be "squashed" (ie: one commit per pull request).
You can use act to test the GitHub Actions (see .github/workflows/) locally. It reduces the retroaction loop when working on the GitHub Actions, avoid polluting the branches with empty pushes only meant to trigger the CI, and allows to only run specific actions.
For example, to launch the build and push of the docker images to Docker Hub:
act -j build-and-push-image-to-docker-hub --secret-file my.secrets
with my.secrets
a file with the secrets:
DOCKERHUB_USERNAME=xxx
DOCKERHUB_PASSWORD=xxx
GITHUB_TOKEN=xxx