Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PostgreSQL as a possible viewer #3121

Merged
merged 2 commits into from
Dec 18, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@
title: Pandas
- local: polars
title: Polars
- local: postgresql
title: PostgreSQL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also add the link inside parquet_process.md?

- local: mlcroissant
title: mlcroissant
- local: pyspark
Expand Down
53 changes: 53 additions & 0 deletions docs/source/postgresql.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# PostgreSQL

[PostgreSQL](https://www.postgresql.org/docs/) is a powerful, open source object-relational database system. It is the most [popular](https://survey.stackoverflow.co/2024/technology#most-popular-technologies-database) database by application developers for a few years running. [pgai](https://github.com/timescale/pgai) is a PostgreSQL extension that allows you to easily ingest huggingface datasets into your PostgreSQL database.


## Run PostgreSQL with pgai installed

You can easily run a docker container containing PostgreSQL with pgai installed.

```bash
docker run -d --name pgai -p 5432:5432 \
-v pg-data:/home/postgres/pgdata/data \
-e POSTGRES_PASSWORD=password timescale/timescaledb-ha:pg17
```

Alternatively, you can install pgai into an existing PostgreSQL database. For instructions on how to install pgai into an existing PostgreSQL database, follow the instructions in the [github repo](https://github.com/timescale/pgai).

## Create a table from a dataset

To load a dataset into PostgreSQL, you can use the `ai.load_dataset` function. This function will create a PostgreSQL table, and load the dataset from the Hugging Face Hub
in a streaming fashion.

```sql
select ai.load_dataset('squad');
julien-c marked this conversation as resolved.
Show resolved Hide resolved
```

You can now query the table using standard SQL.

```sql
select * from squad limit 10;
```

<Tip>
Full documentation for the `ai.load_dataset` function can be found [here](https://github.com/timescale/pgai/blob/main/docs/load_dataset_from_huggingface.md).
</Tip>

## Import only a subset of the dataset

You can also import a subset of the dataset by specifying the `max_batches` parameter.
This is useful if the dataset is large and you want to experiment with a smaller subset.

```sql
SELECT ai.load_dataset('squad', batch_size => 100, max_batches => 1);
```

## Load a dataset into an existing table

You can also load a dataset into an existing table.
This is useful if you want more control over the data schema or want to predefine indexes and constraints on the data.

```sql
select ai.load_dataset('squad', table_name => 'squad', if_table_exists => 'append');
```
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
- fast data retrieval and filtering,
- efficient storage.

**This is what powers the dataset viewer** on each dataset page and every dataset on the Hub can be accessed with the same code (you can use HF Datasets, ClickHouse, DuckDB, Pandas or Polars, [up to you](https://huggingface.co/docs/dataset-viewer/parquet_process)).
**This is what powers the dataset viewer** on each dataset page and every dataset on the Hub can be accessed with the same code (you can use HF Datasets, ClickHouse, DuckDB, Pandas, PostgreSQL, or Polars, [up to you](https://huggingface.co/docs/dataset-viewer/parquet_process)).

You can learn more about the advantages associated with Parquet in the [documentation](https://huggingface.co/docs/dataset-viewer/parquet).

Expand Down