Skip to content

Latest commit

 

History

History
107 lines (79 loc) · 4.21 KB

load_dataset_from_huggingface.md

File metadata and controls

107 lines (79 loc) · 4.21 KB

Load dataset from Hugging Face

The ai.load_dataset function allows you to load datasets from Hugging Face's datasets library directly into your PostgreSQL database.

Example Usage

select ai.load_dataset('squad');

select * from squad limit 10;

Parameters

Name Type Default Required Description
name text - The name of the dataset on Hugging Face (e.g., 'squad', 'glue', etc.)
config_name text - The specific configuration of the dataset to load. See Hugging Face documentation for more information.
split text - The split of the dataset to load (e.g., 'train', 'test', 'validation'). Defaults to all splits.
schema_name text 'public' The PostgreSQL schema where the table will be created
table_name text - The name of the table to create. If null, will use the dataset name
if_table_exists text 'error' Behavior when table exists: 'error' (raise error), 'append' (add rows), 'drop' (drop table and recreate)
field_types jsonb - Custom PostgreSQL data types for columns as a JSONB dictionary from name to type.
batch_size int 5000 Number of rows to insert in each batch
max_batches int null Maximum number of batches to load. Null means load all
kwargs jsonb - Additional arguments passed to the Hugging Face dataset loading function

Returns

Returns the number of rows loaded into the database (bigint).

Using Multiple Transactions

The ai.load_dataset function loads all data in a single transaction. However, to load large dataset, it is sometimes useful to use multiple transactions. For this purpose, we provide the ai.load_dataset_multi_txn procedure. That procedure is similar to ai.load_dataset, but it allows you to specify the number of batches between commits using the commit_every_n_batches parameter.

CALL ai.load_dataset_multi_txn('squad', commit_every_n_batches => 10);

Examples

  1. Basic usage - Load the entire 'squad' dataset:
SELECT ai.load_dataset('squad');

The data is loaded into a table named squad.

  1. Load a small subset of the 'squad' dataset:
SELECT ai.load_dataset('squad', batch_size => 100, max_batches => 1);
  1. Load the entire 'squad' dataset using multiple transactions:
CALL ai.load_dataset_multi_txn('squad', commit_every_n_batches => 100);
  1. Load specific configuration and split:
SELECT ai.load_dataset(
    name => 'glue',
    config_name => 'mrpc',
    split => 'train'
);
  1. Load with custom table name and field types:
SELECT ai.load_dataset(
    name => 'glue',
    config_name => 'mrpc',
    table_name => 'mrpc',
    field_types => '{"sentence1": "text", "sentence2": "text"}'::jsonb
);
  1. Pre-create the table and load data into it:
CREATE TABLE squad (
    id          TEXT,
    title       TEXT,
    context     TEXT,
    question    TEXT,
    answers     JSONB
);

SELECT ai.load_dataset(
    name => 'squad',
    table_name => 'squad',
    if_table_exists => 'append'
);

Notes

  • The function requires an active internet connection to download datasets from Hugging Face.
  • Large datasets may take significant time to load depending on size and connection speed.
  • The function automatically maps Hugging Face dataset types to appropriate PostgreSQL data types unless overridden by field_types.