The ai.load_dataset
function allows you to load datasets from Hugging Face's datasets library directly into your PostgreSQL database.
select ai.load_dataset('squad');
select * from squad limit 10;
Name | Type | Default | Required | Description |
---|---|---|---|---|
name | text | - | ✔ | The name of the dataset on Hugging Face (e.g., 'squad', 'glue', etc.) |
config_name | text | - | ✖ | The specific configuration of the dataset to load. See Hugging Face documentation for more information. |
split | text | - | ✖ | The split of the dataset to load (e.g., 'train', 'test', 'validation'). Defaults to all splits. |
schema_name | text | 'public' | ✖ | The PostgreSQL schema where the table will be created |
table_name | text | - | ✖ | The name of the table to create. If null, will use the dataset name |
if_table_exists | text | 'error' | ✖ | Behavior when table exists: 'error' (raise error), 'append' (add rows), 'drop' (drop table and recreate) |
field_types | jsonb | - | ✖ | Custom PostgreSQL data types for columns as a JSONB dictionary from name to type. |
batch_size | int | 5000 | ✖ | Number of rows to insert in each batch |
max_batches | int | null | ✖ | Maximum number of batches to load. Null means load all |
kwargs | jsonb | - | ✖ | Additional arguments passed to the Hugging Face dataset loading function |
Returns the number of rows loaded into the database (bigint).
The ai.load_dataset
function loads all data in a single transaction. However, to load large dataset, it is sometimes useful to use multiple transactions.
For this purpose, we provide the ai.load_dataset_multi_txn
procedure. That procedure is similar to ai.load_dataset
, but it allows you to specify the number of batches between commits
using the commit_every_n_batches
parameter.
CALL ai.load_dataset_multi_txn('squad', commit_every_n_batches => 10);
- Basic usage - Load the entire 'squad' dataset:
SELECT ai.load_dataset('squad');
The data is loaded into a table named squad
.
- Load a small subset of the 'squad' dataset:
SELECT ai.load_dataset('squad', batch_size => 100, max_batches => 1);
- Load the entire 'squad' dataset using multiple transactions:
CALL ai.load_dataset_multi_txn('squad', commit_every_n_batches => 100);
- Load specific configuration and split:
SELECT ai.load_dataset(
name => 'glue',
config_name => 'mrpc',
split => 'train'
);
- Load with custom table name and field types:
SELECT ai.load_dataset(
name => 'glue',
config_name => 'mrpc',
table_name => 'mrpc',
field_types => '{"sentence1": "text", "sentence2": "text"}'::jsonb
);
- Pre-create the table and load data into it:
CREATE TABLE squad (
id TEXT,
title TEXT,
context TEXT,
question TEXT,
answers JSONB
);
SELECT ai.load_dataset(
name => 'squad',
table_name => 'squad',
if_table_exists => 'append'
);
- The function requires an active internet connection to download datasets from Hugging Face.
- Large datasets may take significant time to load depending on size and connection speed.
- The function automatically maps Hugging Face dataset types to appropriate PostgreSQL data types unless overridden by
field_types
.