Skip to content

Commit

Permalink
Rename to dataset viewer part1 (#2663)
Browse files Browse the repository at this point in the history
* Datasets Server -> (the) dataset viewer (API)

* more renaming + change repo name

* datasets-server -> dataset-viewer where it has no side-effect
  • Loading branch information
severo authored Apr 5, 2024
1 parent cbf56be commit cc97f89
Show file tree
Hide file tree
Showing 66 changed files with 322 additions and 390 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/_e2e_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ jobs:
CLOUDFRONT_KEY_PAIR_ID: "K3814DK2QUJ71H"
CLOUDFRONT_PRIVATE_KEY: ${{ secrets.CLOUDFRONT_PRIVATE_KEY }}
HF_HUB_ENABLE_HF_TRANSFER: "1"
run: docker compose -f docker-compose-datasets-server.yml up -d --wait --wait-timeout 20
run: docker compose -f docker-compose-dataset-viewer.yml up -d --wait --wait-timeout 20
working-directory: ./tools
- name: Install poetry
run: pipx install poetry==${{ env.poetry-version }}
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/e2e.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ on:
- ".github/workflows/_quality-python.yml"
- ".github/workflows/e2e.yml"
- "tools/Python.mk"
- "tools/docker-compose-datasets-server.yml"
- "tools/docker-compose-dataset-viewer.yml"
pull_request:
paths:
- "e2e/**"
Expand All @@ -27,7 +27,7 @@ on:
- ".github/workflows/_quality-python.yml"
- ".github/workflows/e2e.yml"
- "tools/Python.mk"
- "tools/docker-compose-datasets-server.yml"
- "tools/docker-compose-dataset-viewer.yml"
jobs:
quality:
uses: ./.github/workflows/_quality-python.yml
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/stale.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ on:
jobs:
close_stale_issues:
name: Close Stale Issues
if: github.repository == 'huggingface/datasets-server'
if: github.repository == 'huggingface/dataset-viewer'
runs-on: ubuntu-latest
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
Expand Down
2 changes: 1 addition & 1 deletion AUTHORS
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# This is the list of HuggingFace Datasets Server authors for copyright purposes.
# This is the list of HuggingFace dataset viewer authors for copyright purposes.
#
# This does not necessarily list everyone who has contributed code, since in
# some cases, their employer may be the copyright holder. To see the full list
Expand Down
12 changes: 6 additions & 6 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# How to contribute to the Datasets Server?
# How to contribute to the dataset viewer?

[![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg)](CODE_OF_CONDUCT.md)

The Datasets Server is an open source project, so all contributions and suggestions are welcome.
The dataset viewer is an open source project, so all contributions and suggestions are welcome.

You can contribute in many different ways: giving ideas, answering questions, reporting bugs, proposing enhancements,
improving the documentation, fixing bugs...
Expand All @@ -28,14 +28,14 @@ If you would like to work on any of the open Issues:

## How to create a Pull Request?

1. Fork the [repository](https://github.com/huggingface/datasets-server) by clicking on the 'Fork' button on the repository's page. This creates a copy of the code under your GitHub user account.
1. Fork the [repository](https://github.com/huggingface/dataset-viewer) by clicking on the 'Fork' button on the repository's page. This creates a copy of the code under your GitHub user account.

2. Clone your fork to your local disk, and add the base repository as a remote:

```bash
git clone [email protected]:<your Github handle>/datasets-server.git
cd datasets-server
git remote add upstream https://github.com/huggingface/datasets-server.git
git clone [email protected]:<your Github handle>/dataset-viewer.git
cd dataset-viewer
git remote add upstream https://github.com/huggingface/dataset-viewer.git
```

3. Create a new branch to hold your development changes:
Expand Down
4 changes: 2 additions & 2 deletions DEVELOPER_GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@ This document is intended for developers who want to install, test or contribute
To start working on the project:

```bash
git clone [email protected]:huggingface/datasets-server.git
cd datasets-server
git clone [email protected]:huggingface/dataset-viewer.git
cd dataset-viewer
```

Install docker (see https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository and https://docs.docker.com/engine/install/linux-postinstall/)
Expand Down
8 changes: 4 additions & 4 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,10 @@ dev-start: export COMPOSE_PROJECT_NAME := dev-datasets-server
dev-stop: export COMPOSE_PROJECT_NAME := dev-datasets-server

# makefile variables per target
start: DOCKER_COMPOSE := ./tools/docker-compose-datasets-server.yml
stop: DOCKER_COMPOSE := ./tools/docker-compose-datasets-server.yml
dev-start: DOCKER_COMPOSE := ./tools/docker-compose-dev-datasets-server.yml
dev-stop: DOCKER_COMPOSE := ./tools/docker-compose-dev-datasets-server.yml
start: DOCKER_COMPOSE := ./tools/docker-compose-dataset-viewer.yml
stop: DOCKER_COMPOSE := ./tools/docker-compose-dataset-viewer.yml
dev-start: DOCKER_COMPOSE := ./tools/docker-compose-dev-dataset-viewer.yml
dev-stop: DOCKER_COMPOSE := ./tools/docker-compose-dev-dataset-viewer.yml

include tools/Docker.mk

Expand Down
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
# Datasets server
# Dataset viewer

> Integrate into your apps over 10,000 datasets via simple HTTP requests, with pre-processed responses and scalability built-in.
Documentation: https://huggingface.co/docs/datasets-server

## Ask for a new feature 🎁

The datasets server pre-processes the [Hugging Face Hub datasets](https://huggingface.co/datasets) to make them ready to use in your apps using the API: list of the splits, first rows.
The dataset viewer pre-processes the [Hugging Face Hub datasets](https://huggingface.co/datasets) to make them ready to use in your apps using the API: list of the splits, first rows.

We plan to [add more features](https://github.com/huggingface/datasets-server/issues?q=is%3Aissue+is%3Aopen+label%3A%22feature+request%22) to the server. Please comment there and upvote your favorite requests.
We plan to [add more features](https://github.com/huggingface/dataset-viewer/issues?q=is%3Aissue+is%3Aopen+label%3A%22feature+request%22) to the server. Please comment there and upvote your favorite requests.

If you think about a new feature, please [open a new issue](https://github.com/huggingface/datasets-server/issues/new).
If you think about a new feature, please [open a new issue](https://github.com/huggingface/dataset-viewer/issues/new).

## Contribute 🤝

Expand All @@ -20,8 +20,8 @@ To install the server and start contributing to the code, see [DEVELOPER_GUIDE.m

## Community 🤗

You can star and watch this [GitHub repository](https://github.com/huggingface/datasets-server) to follow the updates.
You can star and watch this [GitHub repository](https://github.com/huggingface/dataset-viewer) to follow the updates.

You can ask for help or answer questions on the [Forum](https://discuss.huggingface.co/c/datasets/10) and [Discord](https://discord.com/channels/879548962464493619/1019883044724822016).

You can also report bugs and propose enhancements on the code, or the documentation, in the [GitHub issues](https://github.com/huggingface/datasets-server/issues).
You can also report bugs and propose enhancements on the code, or the documentation, in the [GitHub issues](https://github.com/huggingface/dataset-viewer/issues).
4 changes: 2 additions & 2 deletions chart/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

apiVersion: v2
name: datasets-server
description: A Helm chart for the datasets-server application
description: A Helm chart for the dataset-viewer application

# A chart can be either an 'application' or a 'library' chart.
#
Expand All @@ -25,7 +25,7 @@ version: 2.0.0
# follow Semantic Versioning. They should reflect the version the application is using.
# It is recommended to use it with quotes.
#
# See https://github.com/huggingface/datasets-server/releases
# See https://github.com/huggingface/dataset-viewer/releases
appVersion: "0.22.2"

icon: https://huggingface.co/front/assets/huggingface_logo-noborder.svg
Expand Down
10 changes: 5 additions & 5 deletions chart/README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
# datasets-server Helm chart
# Dataset viewer Helm chart

The `datasets-server` Helm [chart](https://helm.sh/docs/topics/charts/) describes the Kubernetes resources of the datasets-server application.
The dataset viewer Helm [chart](https://helm.sh/docs/topics/charts/) describes the Kubernetes resources of the dataset viewer application.

If you have access to the internal HF notion, see https://www.notion.so/huggingface2/Infrastructure-b4fd07f015e04a84a41ec6472c8a0ff5.

The cloud infrastructure for the datasets-server uses:
The cloud infrastructure for the dataset viewer uses:

- Docker Hub to store the docker images of the datasets-server services.
- Docker Hub to store the docker images of the dataset viewer services.
- Amazon EKS for the Kubernetes clusters.

Note that this Helm chart is used to manage the deployment of the `datasets-server` services to the cloud infrastructure (AWS) using Kubernetes. The infrastructure in itself is not created here, but in https://github.com/huggingface/infra/ using terraform. If you need to create or modify some resources, contact the infra team.
Note that this Helm chart is used to manage the deployment of the dataset viewer services to the cloud infrastructure (AWS) using Kubernetes. The infrastructure in itself is not created here, but in https://github.com/huggingface/infra/ using terraform. If you need to create or modify some resources, contact the infra team.

## Deploy

Expand Down
2 changes: 1 addition & 1 deletion chart/nginx-templates/default.conf.template
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ server {
set $cached_assets_storage_root ${CACHED_ASSETS_STORAGE_ROOT};

location /openapi.json {
return 307 https://raw.githubusercontent.com/huggingface/datasets-server/main/${OPENAPI_FILE};
return 307 https://raw.githubusercontent.com/huggingface/dataset-viewer/main/${OPENAPI_FILE};
}

location /assets/ {
Expand Down
6 changes: 3 additions & 3 deletions chart/templates/_common/_helpers.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -162,7 +162,7 @@ Return the api ingress anotation
{{- end -}}

{{/*
Datasets Server base url
The dataset viewer API base url
*/}}
{{- define "datasetsServer.ingress.hostname" -}}
{{ .Values.global.huggingface.ingress.subdomains.datasetsServer }}.{{ .Values.global.huggingface.ingress.domain }}
Expand Down Expand Up @@ -195,7 +195,7 @@ The cached-assets base URL

{{/*
The parquet-metadata/ subpath in the EFS
- in a subdirectory named as the chart (datasets-server/), and below it,
- in a subdirectory named as the chart (dataset-viewer/), and below it,
- in a subdirectory named as the Release, so that Releases will not share the same dir
*/}}
{{- define "parquetMetadata.subpath" -}}
Expand All @@ -204,7 +204,7 @@ The parquet-metadata/ subpath in the EFS

{{/*
The duckdb-index/ subpath in EFS
- in a subdirectory named as the chart (datasets-server/), and below it,
- in a subdirectory named as the chart (dataset-viewer/), and below it,
- in a subdirectory named as the Release, so that Releases will not share the same dir
*/}}
{{- define "duckDBIndex.subpath" -}}
Expand Down
2 changes: 1 addition & 1 deletion chart/templates/_env/_envDiscussions.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
{{- else }}
value: {{ .Values.secrets.appParquetConverterHfToken.value }}
{{- end }}
# ^ we use the same token (datasets-server-bot) for discussions and for uploading parquet files
# ^ we use the same token (dataset viewer bot) for discussions and for uploading parquet files
- name: DISCUSSIONS_PARQUET_REVISION
value: {{ .Values.parquetAndInfo.targetRevision | quote }}
{{- end -}}
2 changes: 1 addition & 1 deletion chart/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -289,7 +289,7 @@ hfDatasetsCache:
cacheDirectory: "/tmp/hf-datasets-cache"

discussions:
# name of the Hub user associated with the Datasets Server bot app
# name of the Hub user associated with the dataset viewer bot app
botAssociatedUserName: "parquet-converter"

# --- jobs (pre-install/upgrade hooks) ---
Expand Down
4 changes: 2 additions & 2 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ The documentation is available at http://localhost:3000/.
To build the documentation, launch:

```bash
BUILD_DIR=/tmp/doc-datasets-server/ make build
BUILD_DIR=/tmp/doc-dataset-viewer/ make build
```

You can adapt the `BUILD_DIR` environment variable to set any temporary folder that you prefer. This command will create it and generate
Expand All @@ -69,7 +69,7 @@ will see a bot add a comment to a link where the documentation with your changes
Accepted files are Markdown (.md or .mdx).

Create a file with its extension and put it in the source directory. You can then link it to the toc-tree by putting
the filename without the extension in the [`_toctree.yml`](https://github.com/huggingface/datasets-server/blob/main/docs/source/_toctree.yml) file.
the filename without the extension in the [`_toctree.yml`](https://github.com/huggingface/dataset-viewer/blob/main/docs/source/_toctree.yml) file.

## Adding an image

Expand Down
4 changes: 2 additions & 2 deletions docs/pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
[tool.poetry]
authors = ["Sylvain Lesage <[email protected]>"]
description = "Documentation for datasets-server"
name = "datasets-server-doc"
description = "Documentation for dataset-viewer"
name = "dataset-viewer-doc"
version = "0.1.0"

[tool.poetry.dependencies]
Expand Down
4 changes: 2 additions & 2 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
- title: Get Started
sections:
- local: index
title: 🤗 Datasets server
title: 🤗 Dataset viewer
- local: quick_start
title: Quickstart
- local: analyze_data
Expand Down Expand Up @@ -30,7 +30,7 @@
title: Explore dataset statistics
- local: croissant
title: Get Croissant metadata
- title: Query datasets from Datasets Server
- title: Query datasets from dataset viewer API
sections:
- local: parquet_process
title: Overview
Expand Down
2 changes: 1 addition & 1 deletion docs/source/croissant.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Get Croissant metadata

Datasets Server automatically generates the metadata in [Croissant](https://github.com/mlcommons/croissant) format (JSON-LD) for every dataset on the Hugging Face Hub. It lists the dataset's name, description, URL, and the distribution of the dataset as Parquet files, including the columns' metadata. The Croissant metadata is available for all the datasets that can be [converted to Parquet format](./parquet#conversion-to-parquet).
The dataset viewer automatically generates the metadata in [Croissant](https://github.com/mlcommons/croissant) format (JSON-LD) for every dataset on the Hugging Face Hub. It lists the dataset's name, description, URL, and the distribution of the dataset as Parquet files, including the columns' metadata. The Croissant metadata is available for all the datasets that can be [converted to Parquet format](./parquet#conversion-to-parquet).

## What is Croissant?

Expand Down
2 changes: 1 addition & 1 deletion docs/source/data_types.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Data types

Datasets supported by Datasets Server have a tabular format, meaning a data point is represented in a row and its features are contained in columns. Using the `/first-rows` endpoint allows you to preview the first 100 rows of a dataset and information about each feature. Within the `features` key, you'll notice it returns a `_type` field. This value describes the data type of the column, and it is also known as a dataset's [`Features`](https://huggingface.co/docs/datasets/about_dataset_features).
Datasets supported by the dataset viewer have a tabular format, meaning a data point is represented in a row and its features are contained in columns. Using the `/first-rows` endpoint allows you to preview the first 100 rows of a dataset and information about each feature. Within the `features` key, you'll notice it returns a `_type` field. This value describes the data type of the column, and it is also known as a dataset's [`Features`](https://huggingface.co/docs/datasets/about_dataset_features).

There are several different data `Features` for representing different data formats such as [`Audio`](https://huggingface.co/docs/datasets/v2.5.2/en/package_reference/main_classes#datasets.Audio) and [`Image`](https://huggingface.co/docs/datasets/v2.5.2/en/package_reference/main_classes#datasets.Image) for speech and image data respectively. Knowing a dataset feature gives you a better understanding of the data type you're working with, and how you can preprocess it.

Expand Down
6 changes: 3 additions & 3 deletions docs/source/filter.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
# Filter rows in a dataset

Datasets Server provides a `/filter` endpoint for filtering rows in a dataset.
The dataset viewer provides a `/filter` endpoint for filtering rows in a dataset.

<Tip warning={true}>
Currently, only <a href="./parquet">datasets with Parquet exports</a>
are supported so Datasets Server can index the contents and run the filter query without
are supported so the dataset viewer can index the contents and run the filter query without
downloading the whole dataset.
</Tip>

This guide shows you how to use Datasets Server's `/filter` endpoint to filter rows based on a query string.
This guide shows you how to use the dataset viewer's `/filter` endpoint to filter rows based on a query string.
Feel free to also try it out with [ReDoc](https://redocly.github.io/redoc/?url=https://datasets-server.huggingface.co/openapi.json#operation/filterRows).

The `/filter` endpoint accepts the following query parameters:
Expand Down
4 changes: 2 additions & 2 deletions docs/source/first_rows.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# Preview a dataset

Datasets Server provides a `/first-rows` endpoint for visualizing the first 100 rows of a dataset. This'll give you a good idea of the data types and example data contained in a dataset.
The dataset viewer provides a `/first-rows` endpoint for visualizing the first 100 rows of a dataset. This'll give you a good idea of the data types and example data contained in a dataset.

![dataset-viewer](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/dataset-viewer.png)

This guide shows you how to use Datasets Server's `/first-rows` endpoint to preview a dataset. Feel free to also try it out with [Postman](https://www.postman.com/huggingface/workspace/hugging-face-apis/request/23242779-32d6a8be-b800-446a-8cee-f6b5ca1710df), [RapidAPI](https://rapidapi.com/hugging-face-hugging-face-default/api/hugging-face-datasets-api), or [ReDoc](https://redocly.github.io/redoc/?url=https://datasets-server.huggingface.co/openapi.json#operation/listFirstRows).
This guide shows you how to use the dataset viewer's `/first-rows` endpoint to preview a dataset. Feel free to also try it out with [Postman](https://www.postman.com/huggingface/workspace/hugging-face-apis/request/23242779-32d6a8be-b800-446a-8cee-f6b5ca1710df), [RapidAPI](https://rapidapi.com/hugging-face-hugging-face-default/api/hugging-face-datasets-api), or [ReDoc](https://redocly.github.io/redoc/?url=https://datasets-server.huggingface.co/openapi.json#operation/listFirstRows).

The `/first-rows` endpoint accepts three query parameters:

Expand Down
Loading

0 comments on commit cc97f89

Please sign in to comment.