Developer guide

This document is intended for developers who want to install, test or contribute to the code.

Set up development environment

Linux

Install rust:

$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
$ source $HOME/.cargo/env

Install pyenv:

$ curl https://pyenv.run | bash

Install Python 3.9.18:

$ pyenv install 3.9.18

Check that the expected local version of Python is used:

$ cd services/worker
$ python --version
Python 3.9.18

Install Poetry with pipx:

Either a single version:

pipx install poetry==1.8.2
poetry --version

Or a parallel version (with a unique suffix):

pipx install poetry==1.8.2 --suffix=@1.8.2
poetry@1.8.2 --version

Set the Python version to use with Poetry:

poetry env use 3.9.18

or

poetry@1.8.2 env use 3.9.18

Install the dependencies:

make install

Mac OS

To install the worker on Mac OS, you can follow the next steps.

First: as an administrator

Install brew:

$ /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Then: as a normal user

Install pyenv:

$ curl https://pyenv.run | bash

append the following lines to ~/.zshrc:

export PYENV_ROOT="$HOME/.pyenv"
command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"
eval "$(pyenv init -)"

Logout and login again.

Install Python 3.9.18:

$ pyenv install 3.9.18

Check that the expected local version of Python is used:

$ cd services/worker
$ python --version
Python 3.9.18

Install Poetry with pipx:

Either a single version:

pipx install poetry==1.8.2
poetry --version

Or a parallel version (with a unique suffix):

pipx install poetry==1.8.2 --suffix=@1.8.2
poetry@1.8.2 --version

append the following lines to ~/.zshrc:

export PATH="/Users/slesage2/.local/bin:$PATH"

Install rust:

$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
$ source $HOME/.cargo/env

Set the python version to use with poetry:

poetry env use 3.9.18

or

poetry@1.8.2 env use 3.9.18

Avoid an issue with Apache beam (python-poetry/poetry#4888 (comment)):

poetry config experimental.new-installer false

or

poetry@1.8.2 config experimental.new-installer false

Install the dependencies:

make install

Install dataset-viewer

To start working on the project:

git clone git@github.com:huggingface/dataset-viewer.git
cd dataset-viewer

Install all the packages:

make install

Install docker (see https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository and https://docs.docker.com/engine/install/linux-postinstall/)

Run the project locally:

make start

When the docker containers have been started, enter http://localhost:8100/healthcheck: it should show ok.

Run the project in development mode:

make dev-start

In development mode, you don't need to rebuild the docker images to apply a change in a worker. You can just restart the worker's docker container and it will apply your changes.

To install a single job (in jobs), library (in libs) or service (in services), go to their respective directory, and install Python 3.9 (consider pyenv) and poetry (don't forget to add poetry to the PATH environment variable).

If you use pyenv:

cd libs/libcommon/
pyenv install 3.9.18
pyenv local 3.9.18
poetry env use python3.9

then:

make install

It will create a virtual environment in a ./.venv/ subdirectory.

If you use VSCode, it might be useful to use the "monorepo" workspace (see a blogpost for more explanations). It is a multi-root workspace, with one folder for each library and service (note that we hide them from the ROOT to avoid editing there). Each folder has its own Python interpreter, with access to the dependencies installed by Poetry. You might have to manually select the interpreter in every folder though on first access, then VSCode stores the information in its local storage.

Architecture

The repository is structured as a monorepo, with Python libraries and applications in jobs, libs and services:

The following diagram represents the general architecture of the project:

Mongo Server, a Mongo server with databases for: "cache", "queue" and "maintenance".
jobs contains the jobs run by Helm before deploying the pods or scheduled basis. For now there are two type of jobs:
- cache maintenance
- mongodb migrations
libs contains the Python libraries used by the services and workers. For now, there are two libraries
- libcommon, which contains the common code for the services and workers.
- libapi, which contains common code for authentication, http requests, exceptions and other utilities for the services.
services contains the applications:
- api, the public API, is a web server that exposes the API endpoints. All the responses are served from pre-computed responses in Mongo server. That's the main point of this project: generating these responses takes time, and the API server provides this service to the users.
- webhook, exposes the /webhook endpoint which is called by the Hub on every creation, update or deletion of a dataset on the Hub. On deletion, the cached responses are deleted. On creation or update, a new job is appended in the "queue" database.
- rows
- search
- admin, the admin API (which is separated from the public API and might be published under its own domain at some point)
- reverse proxy the reverse proxy
- worker the worker that processes the queue asynchronously: it gets a "job" collection (caution: the jobs stored in the queue, not the Helm jobs), processes the expected response for the associated endpoint, and stores the response in the "cache" collection. Note also that the workers create local files when the dataset contains images or audios. A shared directory (ASSETS_STORAGE_ROOT) must therefore be provisioned with sufficient space for the generated files. The /first-rows endpoint responses contain URLs to these files, served by the API under the /assets/ endpoint.
- sse-api
Clients
- Admin UI
- Hugging Face Hub

If you have access to the internal HF notion, see https://www.notion.so/huggingface2/Datasets-server-464848da2a984e999c540a4aa7f0ece5.

Hence, the working application has the following core components:

a Mongo server with two main databases: "cache" and "queue"
one instance of the API service which exposes a port
one instance of the ROWS service which exposes a port
one instance of the SEARCH service which exposes a port
N instances of worker that processes the pending "jobs" and stores the results in the "cache"

The application also has optional components:

a reverse proxy in front of the API to serve static files and proxy the rest to the API server
an admin server to serve technical endpoints
a shared directory for the assets and cached-assets in S3 (It can be configured to point to a local storage instead)
a shared storage for temporal files created by the workers in EFS (It can be configured to point to a local storage instead)

The following environments contain all the modules: reverse proxy, API server, admin API server, workers, and the Mongo database.

Environment	URL	Type	How to deploy
Production	https://datasets-server.huggingface.co	Helm / Kubernetes	Argo CD
Development	https://datasets-server.us.dev.moon.huggingface.tech	Helm / Kubernetes	Argo CD
Local build	http://localhost:8100	Docker compose	`make start` (builds docker images)

Jobs queue

The following diagram represents the logic when a worker pulls a job from the queue:

Source: https://www.figma.com/board/Yymk75rQTYpZuIwTqffyKQ/Queues-in-dataset-viewer

Quality

The CI checks the quality of the code through a GitHub action. To manually format the code of a job, library, service or worker:

make style

To check the quality (which includes checking the style, but also security vulnerabilities):

make quality

Tests

The CI checks the tests a GitHub action. To manually test a job, library, service or worker:

make test

Note that it requires the resources to be ready, ie. mongo and the storage for assets.

To launch the end to end tests:

make e2e

Versions

We don't use the package versions (in pyproject.toml files), no need to update them.

Pull requests

All the contributions should go through a pull request. The pull requests must be "squashed" (ie: one commit per pull request).

GitHub Actions

You can use act to test the GitHub Actions (see .github/workflows/) locally. It reduces the retroaction loop when working on the GitHub Actions, avoid polluting the branches with empty pushes only meant to trigger the CI, and allows to only run specific actions.

For example, to launch the build and push of the docker images to Docker Hub:

act -j build-and-push-image-to-docker-hub --secret-file my.secrets

with my.secrets a file with the secrets:

DOCKERHUB_USERNAME=xxx
DOCKERHUB_PASSWORD=xxx
GITHUB_TOKEN=xxx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DEVELOPER_GUIDE.md

DEVELOPER_GUIDE.md

Developer guide

Set up development environment

Linux

Mac OS

First: as an administrator

Then: as a normal user

Install dataset-viewer

Architecture

Jobs queue

Quality

Tests

Versions

Pull requests

GitHub Actions

Files

DEVELOPER_GUIDE.md

Latest commit

History

DEVELOPER_GUIDE.md

File metadata and controls

Developer guide

Set up development environment

Linux

Mac OS

First: as an administrator

Then: as a normal user

Install dataset-viewer

Architecture

Jobs queue

Quality

Tests

Versions

Pull requests

GitHub Actions