Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PostgreSQL as a possible viewer #3121

Merged
merged 2 commits into from
Dec 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@
title: Pandas
- local: polars
title: Polars
- local: postgresql
title: PostgreSQL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also add the link inside parquet_process.md?

- local: mlcroissant
title: mlcroissant
- local: pyspark
Expand Down
1 change: 1 addition & 0 deletions docs/source/parquet_process.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,5 +11,6 @@ There are several different libraries you can use to work with the published Par
- [DuckDB](https://duckdb.org/docs/), a high-performance SQL database for analytical queries
- [Pandas](https://pandas.pydata.org/docs/index.html), a data analysis tool for working with data structures
- [Polars](https://pola-rs.github.io/polars-book/user-guide/), a Rust based DataFrame library
- [PostgreSQL via pgai](https://github.com/timescale/pgai/blob/main/docs/load_dataset_from_huggingface.md), a powerful, open source object-relational database system
- [mlcroissant](https://github.com/mlcommons/croissant/tree/main/python/mlcroissant), a library for loading datasets from Croissant metadata
- [pyspark](https://spark.apache.org/docs/latest/api/python), the Python API for Apache Spark
68 changes: 68 additions & 0 deletions docs/source/postgresql.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# PostgreSQL

[PostgreSQL](https://www.postgresql.org/docs/) is a powerful, open source object-relational database system. It is the most [popular](https://survey.stackoverflow.co/2024/technology#most-popular-technologies-database) database by application developers for a few years running. [pgai](https://github.com/timescale/pgai) is a PostgreSQL extension that allows you to easily ingest huggingface datasets into your PostgreSQL database.


## Run PostgreSQL with pgai installed

You can easily run a docker container containing PostgreSQL with pgai.

```bash
docker run -d --name pgai -p 5432:5432 \
-v pg-data:/home/postgres/pgdata/data \
-e POSTGRES_PASSWORD=password timescale/timescaledb-ha:pg17
```

Then run the following command to install pgai into the database.

```bash
docker exec -it pgai psql -c "CREATE EXTENSION ai CASCADE;"
```

You can then connect to the database using the `psql` command line tool in the container.

```bash
docker exec -it pgai psql
```

or using your favorite PostgreSQL client using the following connection string: `postgresql://postgres:password@localhost:5432/postgres
`

Alternatively, you can install pgai into an existing PostgreSQL database. For instructions on how to install pgai into an existing PostgreSQL database, follow the instructions in the [github repo](https://github.com/timescale/pgai).

## Create a table from a dataset

To load a dataset into PostgreSQL, you can use the `ai.load_dataset` function. This function will create a PostgreSQL table, and load the dataset from the Hugging Face Hub
in a streaming fashion.

```sql
select ai.load_dataset('rajpurkar/squad', table_name => 'squad');
```

You can now query the table using standard SQL.

```sql
select * from squad limit 10;
```

<Tip>
Full documentation for the `ai.load_dataset` function can be found [here](https://github.com/timescale/pgai/blob/main/docs/load_dataset_from_huggingface.md).
</Tip>

## Import only a subset of the dataset

You can also import a subset of the dataset by specifying the `max_batches` parameter.
This is useful if the dataset is large and you want to experiment with a smaller subset.

```sql
SELECT ai.load_dataset('rajpurkar/squad', table_name => 'squad', batch_size => 100, max_batches => 1);
```

## Load a dataset into an existing table

You can also load a dataset into an existing table.
This is useful if you want more control over the data schema or want to predefine indexes and constraints on the data.

```sql
select ai.load_dataset('rajpurkar/squad', table_name => 'squad', if_table_exists => 'append');
```
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
- fast data retrieval and filtering,
- efficient storage.
**This is what powers the dataset viewer** on each dataset page and every dataset on the Hub can be accessed with the same code (you can use HF Datasets, ClickHouse, DuckDB, Pandas or Polars, [up to you](https://huggingface.co/docs/dataset-viewer/parquet_process)).
**This is what powers the dataset viewer** on each dataset page and every dataset on the Hub can be accessed with the same code (you can use HF Datasets, ClickHouse, DuckDB, Pandas, PostgreSQL, or Polars, [up to you](https://huggingface.co/docs/dataset-viewer/parquet_process)).
You can learn more about the advantages associated with Parquet in the [documentation](https://huggingface.co/docs/dataset-viewer/parquet).
Expand Down
Loading