diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml index 9b59c61817..4672ce0960 100644 --- a/docs/source/_toctree.yml +++ b/docs/source/_toctree.yml @@ -36,7 +36,22 @@ title: Overview - local: clickhouse title: ClickHouse - - local: duckdb + - isExpanded: false + sections: + - local: duckdb + title: General Usage + - local: duckdb_cli + title: DuckDB CLI + - local: duckdb_cli_auth + title: Authentication for private and gated datasets + - local: duckdb_cli_select + title: Query datasets + - local: duckdb_cli_sql + title: Perform SQL operations + - local: duckdb_cli_combine_and_export + title: Combine datasets and export + - local: duckdb_cli_vector_similarity_search + title: Perform vector similarity search title: DuckDB - local: pandas title: Pandas diff --git a/docs/source/duckdb_cli.md b/docs/source/duckdb_cli.md new file mode 100644 index 0000000000..1ec43419a0 --- /dev/null +++ b/docs/source/duckdb_cli.md @@ -0,0 +1,57 @@ +# DuckDB CLI + +The [DuckDB CLI](https://duckdb.org/docs/api/cli/overview.html) (Command Line Interface) is a single, dependency-free executable. + + + +For installation details, visit the [installation page](https://duckdb.org/docs/installation). + + + +Starting from version `v0.10.3`, the DuckDB CLI includes native support for accessing datasets on the Hugging Face Hub via URLs. Here are some features you can leverage with this powerful tool: + +- Query public datasets and your own gated and private datasets +- Analyze datasets and perform SQL operations +- Combine datasets and export it to different formats +- Conduct vector similarity search on embedding datasets +- Implement full-text search on datasets + +For a complete list of DuckDB features, visit the DuckDB [documentation](https://duckdb.org/docs/). + +To start the CLI, execute the following command in the installation folder: + +```bash +./duckdb +``` + +## Forming the Hugging Face URL + +To access Hugging Face datasets, use the following URL format: + +```plaintext +hf://datasets/{my-username}/{my-dataset}/{path_to_parquet_file} +``` + +- **my-username**, the user or organization of the dataset, e.g. `ibm` +- **my-dataset**, the dataset name, e.g: `duorc` +- **path_to_parquet_file**, the parquet file path which supports glob patterns, e.g `**/*.parquet`, to query all parquet files + + + + +You can query auto-converted Parquet files using the @~parquet branch, which corresponds to the refs/convert/parquet revision. For more details, refer to the documentation at https://huggingface.co/docs/datasets-server/en/parquet#conversion-to-parquet. + + + +Let's start with a quick demo to query all the rows of a dataset: + +```sql +FROM 'hf://datasets/ibm/duorc/ParaphraseRC/*.parquet' LIMIT 3; +``` + +Or using traditional SQL syntax: + +```sql +SELECT * FROM 'hf://datasets/ibm/duorc/ParaphraseRC/*.parquet' LIMIT 3; +``` +In the following sections, we will cover more complex operations you can perform with DuckDB on Hugging Face datasets. diff --git a/docs/source/duckdb_cli_auth.md b/docs/source/duckdb_cli_auth.md new file mode 100644 index 0000000000..32c2d37a24 --- /dev/null +++ b/docs/source/duckdb_cli_auth.md @@ -0,0 +1,46 @@ +# Authentication for private and gated datasets + +To access private or gated datasets, you need to configure your Hugging Face Token in the DuckDB Secrets Manager. + +Visit [Hugging Face Settings - Tokens](https://huggingface.co/settings/tokens) to obtain your access token. + +DuckDB supports two providers for managing secrets: + +- `CONFIG`: Requires the user to pass all configuration information into the CREATE SECRET statement. +- `CREDENTIAL_CHAIN`: Automatically tries to fetch credentials. For the Hugging Face token, it will try to get it from `~/.cache/huggingface/token`. + +For more information about DuckDB Secrets visit the [Secrets Manager](https://duckdb.org/docs/configuration/secrets_manager.html) guide. + +## Creating a secret with `CONFIG` provider + +To create a secret using the CONFIG provider, use the following command: + +```bash +CREATE SECRET hf_token (TYPE HUGGINGFACE, TOKEN 'your_hf_token'); +``` + +Replace `your_hf_token` with your actual Hugging Face token. + +## Creating a secret with `CREDENTIAL_CHAIN` provider + +To create a secret using the CREDENTIAL_CHAIN provider, use the following command: + +```bash +CREATE SECRET hf_token (TYPE HUGGINGFACE, PROVIDER credential_chain); +``` + +This command automatically retrieves the stored token from `~/.cache/huggingface/token`. + +If you haven't configured your token, execute the following command in the terminal: + +```bash +huggingface-cli login +``` + +Alternatively, you can set your Hugging Face token as an environment variable: + +```bash +export HF_TOKEN="HF_XXXXXXXXXXXXX" +``` + +For more information on authentication, see the [Hugging Face authentication](https://huggingface.co/docs/huggingface_hub/main/en/quick-start#authentication) documentation. diff --git a/docs/source/duckdb_cli_combine_and_export.md b/docs/source/duckdb_cli_combine_and_export.md new file mode 100644 index 0000000000..c0d504b87e --- /dev/null +++ b/docs/source/duckdb_cli_combine_and_export.md @@ -0,0 +1,105 @@ +# Combine datasets and export + +In this section, we'll combine two datasets and export the result. Let's start with our datasets: + + +The first will be [TheFusion21/PokemonCards](https://huggingface.co/datasets/TheFusion21/PokemonCards): + +```bash +FROM 'hf://datasets/TheFusion21/PokemonCards/train.csv' LIMIT 3; +┌─────────┬──────────────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬────────────┬───────┬─────────────────┐ +│ id │ image_url │ caption │ name │ hp │ set_name │ +│ varchar │ varchar │ varchar │ varchar │ int64 │ varchar │ +├─────────┼──────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────────┼───────┼─────────────────┤ +│ pl3-1 │ https://images.pok… │ A Basic, SP Pokemon Card of type Darkness with the title Absol G and 70 HP of rarity Rare Holo from the set Supreme Victors. It has … │ Absol G │ 70 │ Supreme Victors │ +│ ex12-1 │ https://images.pok… │ A Stage 1 Pokemon Card of type Colorless with the title Aerodactyl and 70 HP of rarity Rare Holo evolved from Mysterious Fossil from … │ Aerodactyl │ 70 │ Legend Maker │ +│ xy5-1 │ https://images.pok… │ A Basic Pokemon Card of type Grass with the title Weedle and 50 HP of rarity Common from the set Primal Clash and the flavor text: It… │ Weedle │ 50 │ Primal Clash │ +└─────────┴──────────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────────┴───────┴─────────────────┘ +``` + +And the second one will be [wanghaofan/pokemon-wiki-captions](https://huggingface.co/datasets/wanghaofan/pokemon-wiki-captions): + +```bash +FROM 'hf://datasets/wanghaofan/pokemon-wiki-captions/data/*.parquet' LIMIT 3; + +┌──────────────────────┬───────────┬──────────┬──────────────────────────────────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────┐ +│ image │ name_en │ name_zh │ text_en │ text_zh │ +│ struct(bytes blob,… │ varchar │ varchar │ varchar │ varchar │ +├──────────────────────┼───────────┼──────────┼──────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────┤ +│ {'bytes': \x89PNG\… │ abomasnow │ 暴雪王 │ Grass attributes,Blizzard King standing on two feet, with … │ 草属性,双脚站立的暴雪王,全身白色的绒毛,淡紫色的眼睛,几缕长条装的毛皮盖着它的嘴巴 │ +│ {'bytes': \x89PNG\… │ abra │ 凯西 │ Super power attributes, the whole body is yellow, the head… │ 超能力属性,通体黄色,头部外形类似狐狸,尖尖鼻子,手和脚上都有三个指头,长尾巴末端带着一个褐色圆环 │ +│ {'bytes': \x89PNG\… │ absol │ 阿勃梭鲁 │ Evil attribute, with white hair, blue-gray part without ha… │ 恶属性,有白色毛发,没毛发的部分是蓝灰色,头右边类似弓的角,红色眼睛 │ +└──────────────────────┴───────────┴──────────┴──────────────────────────────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────┘ + +``` + +Now, let's try to combine these two datasets by joining on the `name` column: + +```bash +SELECT a.image_url + , a.caption AS card_caption + , a.name + , a.hp + , b.text_en as wiki_caption +FROM 'hf://datasets/TheFusion21/PokemonCards/train.csv' a +JOIN 'hf://datasets/wanghaofan/pokemon-wiki-captions/data/*.parquet' b +ON LOWER(a.name) = b.name_en +LIMIT 3; + +┌──────────────────────┬──────────────────────┬────────────┬───────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ +│ image_url │ card_caption │ name │ hp │ wiki_caption │ +│ varchar │ varchar │ varchar │ int64 │ varchar │ +├──────────────────────┼──────────────────────┼────────────┼───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ +│ https://images.pok… │ A Stage 1 Pokemon … │ Aerodactyl │ 70 │ A Pokémon with rock attributes, gray body, blue pupils, purple inner wings, two sharp claws on the wings, jagged teeth, and an arrow-like … │ +│ https://images.pok… │ A Basic Pokemon Ca… │ Weedle │ 50 │ Insect-like, caterpillar-like in appearance, with a khaki-yellow body, seven pairs of pink gastropods, a pink nose, a sharp poisonous need… │ +│ https://images.pok… │ A Basic Pokemon Ca… │ Caterpie │ 50 │ Insect attributes, caterpillar appearance, green back, white abdomen, Y-shaped red antennae on the head, yellow spindle-shaped tail, two p… │ +└──────────────────────┴──────────────────────┴────────────┴───────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ + +``` + +We can export the result to a Parquet file using the `COPY` command: + +```bash +COPY (SELECT a.image_url + , a.caption AS card_caption + , a.name + , a.hp + , b.text_en as wiki_caption +FROM 'hf://datasets/TheFusion21/PokemonCards/train.csv' a +JOIN 'hf://datasets/wanghaofan/pokemon-wiki-captions/data/*.parquet' b +ON LOWER(a.name) = b.name_en) +TO 'output.parquet' (FORMAT PARQUET); +``` + +Let's validate the new Parquet file: + +```bash +SELECT COUNT(*) FROM 'output.parquet'; + +┌──────────────┐ +│ count_star() │ +│ int64 │ +├──────────────┤ +│ 9460 │ +└──────────────┘ + +``` + + + +You can also export to [CSV](https://duckdb.org/docs/guides/file_formats/csv_export), [Excel](https://duckdb.org/docs/guides/file_formats/excel_export +) and [JSON](https://duckdb.org/docs/guides/file_formats/json_export +) formats. + + + +Finally, let's push the resulting dataset to the Hub using the [Datasets](https://huggingface.co/docs/datasets/index) library in Python: + +```python +from datasets import load_dataset + +dataset = load_dataset("parquet", data_files="output.parquet") +dataset.push_to_hub("asoria/duckdb_combine_demo") +``` + +And that's it! You've successfully combined two datasets, exported the result, and uploaded it to the Hugging Face Hub. diff --git a/docs/source/duckdb_cli_select.md b/docs/source/duckdb_cli_select.md new file mode 100644 index 0000000000..d126737c04 --- /dev/null +++ b/docs/source/duckdb_cli_select.md @@ -0,0 +1,150 @@ +# Query datasets + +Querying datasets is a fundamental step in data analysis. Here, we'll guide you through querying datasets using various methods. + +There are several [different ways](https://duckdb.org/docs/data/parquet/overview.html) to select your data. + +Using the `FROM` syntax: +```bash +FROM 'hf://datasets/jamescalam/world-cities-geo/train.jsonl' SELECT city, country, region LIMIT 3; + +┌────────────────┬─────────────┬───────────────┐ +│ city │ country │ region │ +│ varchar │ varchar │ varchar │ +├────────────────┼─────────────┼───────────────┤ +│ Kabul │ Afghanistan │ Southern Asia │ +│ Kandahar │ Afghanistan │ Southern Asia │ +│ Mazar-e Sharif │ Afghanistan │ Southern Asia │ +└────────────────┴─────────────┴───────────────┘ + +``` + +Using the `SELECT` and `FROM` syntax: + +```bash +SELECT city, country, region FROM 'hf://datasets/jamescalam/world-cities-geo/train.jsonl' USING SAMPLE 3; + +┌──────────┬─────────┬────────────────┐ +│ city │ country │ region │ +│ varchar │ varchar │ varchar │ +├──────────┼─────────┼────────────────┤ +│ Wenzhou │ China │ Eastern Asia │ +│ Valdez │ Ecuador │ South America │ +│ Aplahoue │ Benin │ Western Africa │ +└──────────┴─────────┴────────────────┘ + +``` + +Count all parquet files matching a glob pattern: + +```bash +SELECT COUNT(*) FROM 'hf://datasets/jamescalam/world-cities-geo/*.jsonl'; + +┌──────────────┐ +│ count_star() │ +│ int64 │ +├──────────────┤ +│ 9083 │ +└──────────────┘ + +``` + +You can also query Parquet files using the read_parquet and parquet_scan functions. Let's explore these functions using the auto-converted Parquet files from the same dataset. + +Select using [read_parquet](https://duckdb.org/docs/guides/file_formats/query_parquet.html) function: + +```bash +SELECT * FROM read_parquet('hf://datasets/jamescalam/world-cities-geo@~parquet/default/**/*.parquet') LIMIT 3; +``` + +Read all files that match a glob pattern and include a filename column specifying which file each row came from: + +```bash +SELECT * FROM read_parquet('hf://datasets/jamescalam/world-cities-geo@~parquet/default/**/*.parquet', filename = true) LIMIT 3; +``` + +Using [`parquet_scan`](https://duckdb.org/docs/data/parquet/overview) function: + +```bash +SELECT * FROM parquet_scan('hf://datasets/jamescalam/world-cities-geo@~parquet/default/**/*.parquet') LIMIT 3; +``` + +## Get metadata and schema + +The [parquet_metadata](https://duckdb.org/docs/data/parquet/metadata.html) function can be used to query the metadata contained within a Parquet file. + +```bash +SELECT * FROM parquet_metadata('hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet'); + +┌───────────────────────────────────────────────────────────────────────────────┬──────────────┬────────────────────┬─────────────┐ +│ file_name │ row_group_id │ row_group_num_rows │ compression │ +│ varchar │ int64 │ int64 │ varchar │ +├───────────────────────────────────────────────────────────────────────────────┼──────────────┼────────────────────┼─────────────┤ +│ hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet │ 0 │ 1000 │ SNAPPY │ +│ hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet │ 0 │ 1000 │ SNAPPY │ +│ hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet │ 0 │ 1000 │ SNAPPY │ +└───────────────────────────────────────────────────────────────────────────────┴──────────────┴────────────────────┴─────────────┘ + +``` + +Fetch the column names and column types: + +```bash +DESCRIBE SELECT * FROM 'hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet'; + +┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐ +│ column_name │ column_type │ null │ key │ default │ extra │ +│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ +├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤ +│ city │ VARCHAR │ YES │ │ │ │ +│ country │ VARCHAR │ YES │ │ │ │ +│ region │ VARCHAR │ YES │ │ │ │ +│ continent │ VARCHAR │ YES │ │ │ │ +│ latitude │ DOUBLE │ YES │ │ │ │ +│ longitude │ DOUBLE │ YES │ │ │ │ +│ x │ DOUBLE │ YES │ │ │ │ +│ y │ DOUBLE │ YES │ │ │ │ +│ z │ DOUBLE │ YES │ │ │ │ +└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘ + +``` + +Fetch the internal schema (excluding the file name): + +```bash +SELECT * EXCLUDE (file_name) FROM parquet_schema('hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet'); + +┌───────────┬────────────┬─────────────┬─────────────────┬──────────────┬────────────────┬───────┬───────────┬──────────┬──────────────┐ +│ name │ type │ type_length │ repetition_type │ num_children │ converted_type │ scale │ precision │ field_id │ logical_type │ +│ varchar │ varchar │ varchar │ varchar │ int64 │ varchar │ int64 │ int64 │ int64 │ varchar │ +├───────────┼────────────┼─────────────┼─────────────────┼──────────────┼────────────────┼───────┼───────────┼──────────┼──────────────┤ +│ schema │ │ │ REQUIRED │ 9 │ │ │ │ │ │ +│ city │ BYTE_ARRAY │ │ OPTIONAL │ │ UTF8 │ │ │ │ StringType() │ +│ country │ BYTE_ARRAY │ │ OPTIONAL │ │ UTF8 │ │ │ │ StringType() │ +│ region │ BYTE_ARRAY │ │ OPTIONAL │ │ UTF8 │ │ │ │ StringType() │ +│ continent │ BYTE_ARRAY │ │ OPTIONAL │ │ UTF8 │ │ │ │ StringType() │ +│ latitude │ DOUBLE │ │ OPTIONAL │ │ │ │ │ │ │ +│ longitude │ DOUBLE │ │ OPTIONAL │ │ │ │ │ │ │ +│ x │ DOUBLE │ │ OPTIONAL │ │ │ │ │ │ │ +│ y │ DOUBLE │ │ OPTIONAL │ │ │ │ │ │ │ +│ z │ DOUBLE │ │ OPTIONAL │ │ │ │ │ │ │ +├───────────┴────────────┴─────────────┴─────────────────┴──────────────┴────────────────┴───────┴───────────┴──────────┴──────────────┤ + +``` + +## Get statistics + +The `SUMMARIZE` command can be used to get various aggregates over a query (min, max, approx_unique, avg, std, q25, q50, q75, count). It returns these statistics along with the column name, column type, and the percentage of NULL values. + +```bash +SUMMARIZE SELECT latitude, longitude FROM 'hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet'; + +┌─────────────┬─────────────┬──────────────┬─────────────┬───────────────┬────────────────────┬───────────────────┬────────────────────┬───────────────────┬────────────────────┬───────┐ +│ column_name │ column_type │ min │ max │ approx_unique │ avg │ std │ q25 │ q50 │ q75 │ count │ +│ varchar │ varchar │ varchar │ varchar │ int64 │ varchar │ varchar │ varchar │ varchar │ varchar │ int64 │ +├─────────────┼─────────────┼──────────────┼─────────────┼───────────────┼────────────────────┼───────────────────┼────────────────────┼───────────────────┼────────────────────┼───────┤ +│ latitude │ DOUBLE │ -54.8 │ 67.8557214 │ 7324 │ 22.5004568364307 │ 26.77045468469093 │ 6.065424395863388 │ 29.33687520478191 │ 44.88357641321427 │ 9083 │ +│ longitude │ DOUBLE │ -175.2166595 │ 179.3833313 │ 7802 │ 14.699333721953098 │ 63.93672742608224 │ -7.077471714978484 │ 19.19758476462836 │ 43.782932169927165 │ 9083 │ +└─────────────┴─────────────┴──────────────┴─────────────┴───────────────┴────────────────────┴───────────────────┴────────────────────┴───────────────────┴────────────────────┴───────┘ + +``` diff --git a/docs/source/duckdb_cli_sql.md b/docs/source/duckdb_cli_sql.md new file mode 100644 index 0000000000..33714d1e49 --- /dev/null +++ b/docs/source/duckdb_cli_sql.md @@ -0,0 +1,159 @@ +# Perform SQL operations + +Performing SQL operations with DuckDB opens up a world of possibilities for querying datasets efficiently. Let's dive into some examples showcasing the power of DuckDB functions. + +For our demonstration, we'll explore a fascinating dataset. The [MMLU](https://huggingface.co/datasets/cais/mmlu) dataset is a multitask test containing multiple-choice questions spanning various knowledge domains. + +To preview the dataset, let's select a sample of 3 rows: + +```bash +FROM 'hf://datasets/cais/mmlu/all/test-*.parquet' USING SAMPLE 3; + +┌──────────────────────┬──────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬────────┐ +│ question │ subject │ choices │ answer │ +│ varchar │ varchar │ varchar[] │ int64 │ +├──────────────────────┼──────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────┤ +│ Dr. Harry Holliday… │ professional_psych… │ [discuss his vacation plans with his current clients ahead of time so that they know he’ll be unavailable during that time., give his clients a phone … │ 2 │ +│ A resident of a st… │ professional_law │ [The resident would succeed, because the logging company's selling of the timber would entitle the resident to re-enter and terminate the grant to the… │ 2 │ +│ Moderate and frequ… │ miscellaneous │ [dispersed alluvial fan soil, heavy-textured soil, such as silty clay, light-textured soil, such as loamy sand, region of low humidity] │ 2 │ +└──────────────────────┴──────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────┘ + +``` + +This command retrieves a random sample of 3 rows from the dataset for us to examine. + +Let's start by examining the schema of our dataset. The following table outlines the structure of our dataset: + +```bash +DESCRIBE FROM 'hf://datasets/cais/mmlu/all/test-*.parquet' USING SAMPLE 3; +┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐ +│ column_name │ column_type │ null │ key │ default │ extra │ +│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ +├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤ +│ question │ VARCHAR │ YES │ │ │ │ +│ subject │ VARCHAR │ YES │ │ │ │ +│ choices │ VARCHAR[] │ YES │ │ │ │ +│ answer │ BIGINT │ YES │ │ │ │ +└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘ + +``` +Next, let's analyze if there are any duplicated records in our dataset: + +```bash +SELECT *, + COUNT(*) AS counts +FROM 'hf://datasets/cais/mmlu/all/test-*.parquet' +GROUP BY ALL +HAVING counts > 2; + +┌──────────┬─────────┬───────────┬────────┬────────┐ +│ question │ subject │ choices │ answer │ counts │ +│ varchar │ varchar │ varchar[] │ int64 │ int64 │ +├──────────┴─────────┴───────────┴────────┴────────┤ +│ 0 rows │ +└──────────────────────────────────────────────────┘ + +``` + +Fortunately, our dataset doesn't contain any duplicate records. + +Let's see the proportion of questions based on the subject in a bar representation: + +```bash +SELECT + subject, + COUNT(*) AS counts, + BAR(COUNT(*), 0, (SELECT COUNT(*) FROM 'hf://datasets/cais/mmlu/all/test-*.parquet')) AS percentage +FROM + 'hf://datasets/cais/mmlu/all/test-*.parquet' +GROUP BY + subject +ORDER BY + counts DESC; + +┌──────────────────────────────┬────────┬────────────────────────────────────────────────────────────────────────────────┐ +│ subject │ counts │ percentage │ +│ varchar │ int64 │ varchar │ +├──────────────────────────────┼────────┼────────────────────────────────────────────────────────────────────────────────┤ +│ professional_law │ 1534 │ ████████▋ │ +│ moral_scenarios │ 895 │ █████ │ +│ miscellaneous │ 783 │ ████▍ │ +│ professional_psychology │ 612 │ ███▍ │ +│ high_school_psychology │ 545 │ ███ │ +│ high_school_macroeconomics │ 390 │ ██▏ │ +│ elementary_mathematics │ 378 │ ██▏ │ +│ moral_disputes │ 346 │ █▉ │ +├──────────────────────────────┴────────┴────────────────────────────────────────────────────────────────────────────────┤ +│ 57 rows (8 shown) 3 columns │ +└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ + +``` + +Now, let's prepare a subset of the dataset containing questions related to **nutrition** and create a mapping of questions to correct answers. +Notice that we have the column **choices** from which we can get the correct answer using the **answer** column as an index. + +```bash +SELECT * +FROM 'hf://datasets/cais/mmlu/all/test-*.parquet' +WHERE subject = 'nutrition' LIMIT 3; + +┌──────────────────────┬───────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬────────┐ +│ question │ subject │ choices │ answer │ +│ varchar │ varchar │ varchar[] │ int64 │ +├──────────────────────┼───────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────┤ +│ Which foods tend t… │ nutrition │ [Meat, Confectionary, Fruits and vegetables, Potatoes] │ 2 │ +│ In which one of th… │ nutrition │ [If the incidence rate of the disease falls., If survival time with the disease increases., If recovery of the disease is faster., If the population in which the… │ 1 │ +│ Which of the follo… │ nutrition │ [The flavonoid class comprises flavonoids and isoflavonoids., The digestibility and bioavailability of isoflavones in soya food products are not changed by proce… │ 0 │ +└──────────────────────┴───────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────┘ + +``` + +```bash +SELECT question, + choices[answer] AS correct_answer +FROM 'hf://datasets/cais/mmlu/all/test-*.parquet' +WHERE subject = 'nutrition' LIMIT 3; + +┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─────────────────────────────────────────────┐ +│ question │ correct_answer │ +│ varchar │ varchar │ +├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────────────────────────────────────┤ +│ Which foods tend to be consumed in lower quantities in Wales and Scotland (as of 2020)?\n │ Confectionary │ +│ In which one of the following circumstances will the prevalence of a disease in the population increase, all else being constant?\n │ If the incidence rate of the disease falls. │ +│ Which of the following statements is correct?\n │ │ +└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────────────────────────────────────────┘ + +``` + +To ensure data cleanliness, let's remove any newline characters at the end of the questions and filter out any empty answers: + +```bash +SELECT regexp_replace(question, '\n', '') AS question, + choices[answer] AS correct_answer +FROM 'hf://datasets/cais/mmlu/all/test-*.parquet' +WHERE subject = 'nutrition' AND LENGTH(correct_answer) > 0 LIMIT 3; + +┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─────────────────────────────────────────────┐ +│ question │ correct_answer │ +│ varchar │ varchar │ +├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────────────────────────────────────┤ +│ Which foods tend to be consumed in lower quantities in Wales and Scotland (as of 2020)? │ Confectionary │ +│ In which one of the following circumstances will the prevalence of a disease in the population increase, all else being constant? │ If the incidence rate of the disease falls. │ +│ Which vitamin is a major lipid-soluble antioxidant in cell membranes? │ Vitamin D │ +└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────────────────────────────────────────┘ + +``` + +Finally, lets hightlight some of the DuckDB functions used in this section: +- `DESCRIBE`, returns the table schema. +- `USING SAMPLE`, samples are used to randomly select a subset of a dataset. +- `BAR`, draws a band whose width is proportional to (x - min) and equal to width characters when x = max. Width defaults to 80. +- `string[begin:end]`, extracts a string using slice conventions. Missing begin or end arguments are interpreted as the beginning or end of the list respectively. Negative values are accepted. +- `regexp_replace`, if the string contains the regexp pattern, replaces the matching part with replacement. +- `LENGTH`, gets the number of characters in the string. + + + +There are plenty of useful functions available in DuckDB's [SQL functions overview](https://duckdb.org/docs/sql/functions/overview). The best part is that you can use them directly on Hugging Face datasets. + + diff --git a/docs/source/duckdb_cli_vector_similarity_search.md b/docs/source/duckdb_cli_vector_similarity_search.md new file mode 100644 index 0000000000..ef6aed3907 --- /dev/null +++ b/docs/source/duckdb_cli_vector_similarity_search.md @@ -0,0 +1,63 @@ +# Perform vector similarity search + +The Fixed-Length Arrays feature was added in DuckDB version 0.10.0. This lets you use vector embeddings in DuckDB tables, making your data analysis even more powerful. + +Additionally, the array_cosine_similarity function was introduced. This function measures the cosine of the angle between two vectors, indicating their similarity. A value of 1 means they’re perfectly aligned, 0 means they’re perpendicular, and -1 means they’re completely opposite. + +Let's explore how to use this function for similarity searches. In this section, we’ll show you how to perform similarity searches using DuckDB. + +We will use the [asoria/awesome-chatgpt-prompts-embeddings](https://huggingface.co/datasets/asoria/awesome-chatgpt-prompts-embeddings) dataset. + +First, let's preview a few records from the dataset: + +```bash +FROM 'hf://datasets/asoria/awesome-chatgpt-prompts-embeddings/data/*.parquet' SELECT act, prompt, len(embedding) as embed_len LIMIT 3; + +┌──────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬───────────┐ +│ act │ prompt │ embed_len │ +│ varchar │ varchar │ int64 │ +├──────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────────┤ +│ Linux Terminal │ I want you to act as a linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output insid… │ 384 │ +│ English Translator… │ I want you to act as an English translator, spelling corrector and improver. I will speak to you in any language and you will detect the language, translate it and answer… │ 384 │ +│ `position` Intervi… │ I want you to act as an interviewer. I will be the candidate and you will ask me the interview questions for the `position` position. I want you to only reply as the inte… │ 384 │ +└──────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴───────────┘ + +``` + +Next, let's choose an embedding to use for the similarity search: + +```bash +FROM 'hf://datasets/asoria/awesome-chatgpt-prompts-embeddings/data/*.parquet' SELECT embedding WHERE act = 'Linux Terminal'; + +┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ +│ embedding │ +│ float[] │ +├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ +│ [-0.020781303, -0.029143505, -0.0660217, -0.00932716, -0.02601602, -0.011426172, 0.06627567, 0.11941507, 0.0013917526, 0.012889079, 0.053234346, -0.07380514, 0.04871567, -0.043601237, -0.0025319182, 0.0448… │ +└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ + +``` + +Now, let's use the selected embedding to find similar records: + + +```bash +SELECT act, + prompt, + array_cosine_similarity(embedding::float[384], (SELECT embedding FROM 'hf://datasets/asoria/awesome-chatgpt-prompts-embeddings/data/*.parquet' WHERE act = 'Linux Terminal')::float[384]) AS similarity +FROM 'hf://datasets/asoria/awesome-chatgpt-prompts-embeddings/data/*.parquet' +ORDER BY similarity DESC +LIMIT 3; + +┌──────────────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬────────────┐ +│ act │ prompt │ similarity │ +│ varchar │ varchar │ float │ +├──────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────────┤ +│ Linux Terminal │ I want you to act as a linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output insi… │ 1.0 │ +│ JavaScript Console │ I want you to act as a javascript console. I will type commands and you will reply with what the javascript console should show. I want you to only reply with the termin… │ 0.7599728 │ +│ R programming Inte… │ I want you to act as a R interpreter. I'll type commands and you'll reply with what the terminal should show. I want you to only reply with the terminal output inside on… │ 0.7303775 │ +└──────────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────────┘ + +``` + +That's it! You have successfully performed a vector similarity search using DuckDB.