Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc: DuckDB native support for HuggingFace urls #2817

Closed
wants to merge 25 commits into from
Closed
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
6df1f15
Draft duckdb cli small guides
AndreaFrancis May 15, 2024
f61ecd4
Adding duckdb_cli_auth
AndreaFrancis May 15, 2024
f46e704
Adding credential_chain doc
AndreaFrancis May 16, 2024
573cb71
Add query datasets doc
AndreaFrancis May 16, 2024
4fe6c54
Change sample dataset
AndreaFrancis May 16, 2024
c9a5489
Merge branch 'main' into duckdb-cli-integration-doc
AndreaFrancis May 20, 2024
cd462ca
sql operations
AndreaFrancis May 21, 2024
4bd4425
Complete information
AndreaFrancis May 21, 2024
50949ca
Merge branch 'main' into duckdb-cli-integration-doc
AndreaFrancis May 22, 2024
b41960d
Merge branch 'main' into duckdb-cli-integration-doc
AndreaFrancis May 22, 2024
d131167
Apply code review suggestions
AndreaFrancis May 22, 2024
aef0ac8
Change release tag
AndreaFrancis May 22, 2024
c525899
Combine and export a result dataset
AndreaFrancis May 22, 2024
89648fc
Align sections
AndreaFrancis May 22, 2024
4565ebe
Adding vector search
AndreaFrancis May 22, 2024
8a154b9
Apply suggestions from code review
AndreaFrancis May 23, 2024
a2fd8de
Apply suggestions from code review
AndreaFrancis May 23, 2024
5a6615d
Adding ref for parquet_scan
AndreaFrancis May 23, 2024
60bd06c
Update docs/source/_toctree.yml
lhoestq May 23, 2024
18c8d14
Update docs/source/_toctree.yml
lhoestq May 23, 2024
d3a7a92
Update docs/source/_toctree.yml
lhoestq May 23, 2024
d169f48
Try to fix menu
AndreaFrancis May 23, 2024
21d62be
Remove misisng file
AndreaFrancis May 23, 2024
a31c7a1
Apply suggestions from code review
AndreaFrancis May 23, 2024
1e6592f
Adling new results
AndreaFrancis May 23, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,22 @@
title: ClickHouse
- local: duckdb
title: DuckDB
lhoestq marked this conversation as resolved.
Show resolved Hide resolved
sections:
- local: duckdb_cli
title: DuckDB CLI
sections:
- local: duckdb_cli_auth
title: Authentication for private and gated datasets
- local: duckdb_cli_select
title: Query datasets
- local: duckdb_cli_sql
title: Perform SQL operations
- local: duckdb_cli_combine_and_export
title: Combine datasets and export
- local: duckdb_cli_vector_similarity_search
title: Perform vector similarity search
- local: duckdb_cli_fts
title: Implement full-text search
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it's because we don't support this depth of sections ? trying with one less level

Suggested change
- local: duckdb
title: DuckDB
sections:
- local: duckdb_cli
title: DuckDB CLI
sections:
- local: duckdb_cli_auth
title: Authentication for private and gated datasets
- local: duckdb_cli_select
title: Query datasets
- local: duckdb_cli_sql
title: Perform SQL operations
- local: duckdb_cli_combine_and_export
title: Combine datasets and export
- local: duckdb_cli_vector_similarity_search
title: Perform vector similarity search
- local: duckdb_cli_fts
title: Implement full-text search
- local: duckdb
title: DuckDB
sections:
- local: duckdb_cli
title: DuckDB CLI
- local: duckdb_cli_auth
title: Authentication for private and gated datasets
- local: duckdb_cli_select
title: Query datasets
- local: duckdb_cli_sql
title: Perform SQL operations
- local: duckdb_cli_combine_and_export
title: Combine datasets and export
- local: duckdb_cli_vector_similarity_search
title: Perform vector similarity search
- local: duckdb_cli_fts
title: Implement full-text search

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added but not sure why it is still now working ..

- local: pandas
title: Pandas
- local: polars
Expand Down
57 changes: 57 additions & 0 deletions docs/source/duckdb_cli.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# DuckDB CLI

The [DuckDB CLI](https://duckdb.org/docs/api/cli/overview.html) (Command Line Interface) is a single, dependency-free executable.

<Tip>

For installation details, visit the [installation page](https://duckdb.org/docs/installation).

</Tip>

Starting from version `v0.10.3`, the DuckDB CLI includes native support for accessing datasets on Hugging Face via URLs. Here are some features you can leverage with this powerful tool:
AndreaFrancis marked this conversation as resolved.
Show resolved Hide resolved

- Query public, gated and private datasets
AndreaFrancis marked this conversation as resolved.
Show resolved Hide resolved
- Analyze datasets and perform SQL operations
- Combine datasets and export it different formats
AndreaFrancis marked this conversation as resolved.
Show resolved Hide resolved
- Conduct vector similarity search on embedding datasets
- Implement full-text search on datasets
- And more! For a complete list of DuckDB features, visit the DuckDB documentation.
AndreaFrancis marked this conversation as resolved.
Show resolved Hide resolved

To start the CLI, execute the following command in the installation folder:

```bash
./duckdb
```

## Forming the Hugging Face URL

To access Hugging Face datasets, use the following URL format:

```plaintext
hf://datasets/{my-username}/{my-dataset}/{path_to_parquet_file}
```

Where:
- **my-username** The user or organization of the dataset, e.g. `ibm`
- **my-dataset** Is the dataset name, e.g: `duorc`
- **path_to_parquet_file** Is the parquet file path, it supports glob patterns, e.g `**/*.parquet` to query all parquet files
AndreaFrancis marked this conversation as resolved.
Show resolved Hide resolved


<Tip>

You can query auto-converted Parquet files using the @~parquet branch, which corresponds to the refs/convert/parquet revision. For more details, refer to the documentation at https://huggingface.co/docs/datasets-server/en/parquet#conversion-to-parquet.

</Tip>

Let's start with a quick demo to query all the rows of a dataset:

```sql
FROM 'hf://datasets/ibm/duorc/ParaphraseRC/*.parquet' LIMIT 3;
```

Or using traditional SQL syntax:

```sql
SELECT * FROM 'hf://datasets/ibm/duorc/ParaphraseRC/*.parquet' LIMIT 3;
```
In the following sections, we will cover more complex operations you can perform with DuckDB on Hugging Face datasets.
46 changes: 46 additions & 0 deletions docs/source/duckdb_cli_auth.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Authentication for private and gated datasets

To access private or gated datasets, you need to configure your Hugging Face Token in the DuckDB Secrets Manager.

Visit [Hugging Face Settings - Tokens](https://huggingface.co/settings/tokens) to obtain your access token.

DuckDB supports two providers for managing secrets:

- `CONFIG`: Requires the user to pass all configuration information into the CREATE SECRET statement.
- `CREDENTIAL_CHAIN`: Automatically tries to fetch credentials. For Hugging Face token it will try to get it from `~/.cache/huggingface/token`
AndreaFrancis marked this conversation as resolved.
Show resolved Hide resolved

For more information about DuckDB Secrets visit https://duckdb.org/docs/configuration/secrets_manager.html
AndreaFrancis marked this conversation as resolved.
Show resolved Hide resolved

## Creating a secret with `CONFIG` provider

To create a secret using the CONFIG provider, use the following command:

```bash
CREATE SECRET hf_token (TYPE HUGGINGFACE, TOKEN 'your_hf_token');
```

Replace `your_hf_token` with your actual Hugging Face token.

## Creating a secret with `CREDENTIAL_CHAIN` provider

To create a secret using the CREDENTIAL_CHAIN provider, use the following command:

```bash
CREATE SECRET hf_token (TYPE HUGGINGFACE, PROVIDER credential_chain);
```

This command automatically retrieves the stored token from `~/.cache/huggingface/token`.

If you haven't configured your token, execute the following command in the terminal:

```bash
huggingface-cli login
```

Alternatively, you can set your Hugging Face token as an environment variable:

```bash
export HF_TOKEN="HF_XXXXXXXXXXXXX"
```

For more information on authentication, see the [Hugging Face authentication](https://huggingface.co/docs/huggingface_hub/main/en/quick-start#authentication) documentation.
105 changes: 105 additions & 0 deletions docs/source/duckdb_cli_combine_and_export.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Combine datasets and export

In this section, we'll combine two datasets and export the result. Let's start with our datasets:


The first will be [TheFusion21/PokemonCards](https://huggingface.co/datasets/TheFusion21/PokemonCards):

```bash
FROM 'hf://datasets/TheFusion21/PokemonCards/train.csv' LIMIT 3;
┌─────────┬──────────────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬────────────┬───────┬─────────────────┐
│ id │ image_url │ caption │ name │ hp │ set_name │
│ varchar │ varchar │ varchar │ varchar │ int64 │ varchar │
├─────────┼──────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────────┼───────┼─────────────────┤
│ pl3-1 │ https://images.pok… │ A Basic, SP Pokemon Card of type Darkness with the title Absol G and 70 HP of rarity Rare Holo from the set Supreme Victors. It has … │ Absol G │ 70 │ Supreme Victors │
│ ex12-1 │ https://images.pok… │ A Stage 1 Pokemon Card of type Colorless with the title Aerodactyl and 70 HP of rarity Rare Holo evolved from Mysterious Fossil from … │ Aerodactyl │ 70 │ Legend Maker │
│ xy5-1 │ https://images.pok… │ A Basic Pokemon Card of type Grass with the title Weedle and 50 HP of rarity Common from the set Primal Clash and the flavor text: It… │ Weedle │ 50 │ Primal Clash │
└─────────┴──────────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────────┴───────┴─────────────────┘
```

And the second one will be [wanghaofan/pokemon-wiki-captions](https://huggingface.co/datasets/wanghaofan/pokemon-wiki-captions):

```bash
FROM 'hf://datasets/wanghaofan/pokemon-wiki-captions/data/*.parquet' LIMIT 3;

┌──────────────────────┬───────────┬──────────┬──────────────────────────────────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ image │ name_en │ name_zh │ text_en │ text_zh │
│ struct(bytes blob,… │ varchar │ varchar │ varchar │ varchar │
├──────────────────────┼───────────┼──────────┼──────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ {'bytes': \x89PNG\… │ abomasnow │ 暴雪王 │ Grass attributes,Blizzard King standing on two feet, with … │ 草属性,双脚站立的暴雪王,全身白色的绒毛,淡紫色的眼睛,几缕长条装的毛皮盖着它的嘴巴 │
│ {'bytes': \x89PNG\… │ abra │ 凯西 │ Super power attributes, the whole body is yellow, the head… │ 超能力属性,通体黄色,头部外形类似狐狸,尖尖鼻子,手和脚上都有三个指头,长尾巴末端带着一个褐色圆环 │
│ {'bytes': \x89PNG\… │ absol │ 阿勃梭鲁 │ Evil attribute, with white hair, blue-gray part without ha… │ 恶属性,有白色毛发,没毛发的部分是蓝灰色,头右边类似弓的角,红色眼睛 │
└──────────────────────┴───────────┴──────────┴──────────────────────────────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────┘

```

Now, let's try to combine these two datasets joining by the `name` column:
AndreaFrancis marked this conversation as resolved.
Show resolved Hide resolved

```bash
SELECT a.image_url
, a.caption AS card_caption
, a.name
, a.hp
, b.text_en as wiki_caption
FROM 'hf://datasets/TheFusion21/PokemonCards/train.csv' a
JOIN 'hf://datasets/wanghaofan/pokemon-wiki-captions/data/*.parquet' b
ON LOWER(a.name) = b.name_en
LIMIT 3;

┌──────────────────────┬──────────────────────┬────────────┬───────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ image_url │ card_caption │ name │ hp │ wiki_caption │
│ varchar │ varchar │ varchar │ int64 │ varchar │
├──────────────────────┼──────────────────────┼────────────┼───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ https://images.pok… │ A Stage 1 Pokemon … │ Aerodactyl │ 70 │ A Pokémon with rock attributes, gray body, blue pupils, purple inner wings, two sharp claws on the wings, jagged teeth, and an arrow-like … │
│ https://images.pok… │ A Basic Pokemon Ca… │ Weedle │ 50 │ Insect-like, caterpillar-like in appearance, with a khaki-yellow body, seven pairs of pink gastropods, a pink nose, a sharp poisonous need… │
│ https://images.pok… │ A Basic Pokemon Ca… │ Caterpie │ 50 │ Insect attributes, caterpillar appearance, green back, white abdomen, Y-shaped red antennae on the head, yellow spindle-shaped tail, two p… │
└──────────────────────┴──────────────────────┴────────────┴───────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

```

We can export the result to a Parquet file using the `COPY` command:

```bash
COPY (SELECT a.image_url
, a.caption AS card_caption
, a.name
, a.hp
, b.text_en as wiki_caption
FROM 'hf://datasets/TheFusion21/PokemonCards/train.csv' a
JOIN 'hf://datasets/wanghaofan/pokemon-wiki-captions/data/*.parquet' b
ON LOWER(a.name) = b.name_en)
TO 'output.parquet' (FORMAT PARQUET);
```

Let's validate the new Parquet file:

```bash
SELECT COUNT(*) FROM 'output.parquet';

┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 9460 │
└──────────────┘

```

<Tip>

You can also export to [CSV](https://duckdb.org/docs/guides/file_formats/csv_export), [Excel](https://duckdb.org/docs/guides/file_formats/excel_export
) and [JSON](https://duckdb.org/docs/guides/file_formats/json_export
) formats.

</Tip>

Finally, let's push the resulting dataset to the Hub using the `datasets` library in Python:
AndreaFrancis marked this conversation as resolved.
Show resolved Hide resolved

```python
from datasets import load_dataset

dataset = load_dataset("parquet", data_files="output.parquet")
dataset.push_to_hub("asoria/duckdb_combine_demo")
```

And that's it! You've successfully combined two datasets, exported the result, and uploaded it to the Hugging Face Hub.
Loading
Loading