Skip to content

Commit

Permalink
Add PostgreSQL as a possible viewer (#3121)
Browse files Browse the repository at this point in the history
* Add PostgreSQL as a possible viewer

Add documentation for how to use PostgreSQL with pgai to access the dataset.

* Improve documentation for PostgreSQL viewer
  • Loading branch information
cevian authored Dec 18, 2024
1 parent 04f9b1e commit 5e9371a
Show file tree
Hide file tree
Showing 4 changed files with 72 additions and 1 deletion.
2 changes: 2 additions & 0 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@
title: Pandas
- local: polars
title: Polars
- local: postgresql
title: PostgreSQL
- local: mlcroissant
title: mlcroissant
- local: pyspark
Expand Down
1 change: 1 addition & 0 deletions docs/source/parquet_process.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,5 +11,6 @@ There are several different libraries you can use to work with the published Par
- [DuckDB](https://duckdb.org/docs/), a high-performance SQL database for analytical queries
- [Pandas](https://pandas.pydata.org/docs/index.html), a data analysis tool for working with data structures
- [Polars](https://pola-rs.github.io/polars-book/user-guide/), a Rust based DataFrame library
- [PostgreSQL via pgai](https://github.com/timescale/pgai/blob/main/docs/load_dataset_from_huggingface.md), a powerful, open source object-relational database system
- [mlcroissant](https://github.com/mlcommons/croissant/tree/main/python/mlcroissant), a library for loading datasets from Croissant metadata
- [pyspark](https://spark.apache.org/docs/latest/api/python), the Python API for Apache Spark
68 changes: 68 additions & 0 deletions docs/source/postgresql.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# PostgreSQL

[PostgreSQL](https://www.postgresql.org/docs/) is a powerful, open source object-relational database system. It is the most [popular](https://survey.stackoverflow.co/2024/technology#most-popular-technologies-database) database by application developers for a few years running. [pgai](https://github.com/timescale/pgai) is a PostgreSQL extension that allows you to easily ingest huggingface datasets into your PostgreSQL database.


## Run PostgreSQL with pgai installed

You can easily run a docker container containing PostgreSQL with pgai.

```bash
docker run -d --name pgai -p 5432:5432 \
-v pg-data:/home/postgres/pgdata/data \
-e POSTGRES_PASSWORD=password timescale/timescaledb-ha:pg17
```

Then run the following command to install pgai into the database.

```bash
docker exec -it pgai psql -c "CREATE EXTENSION ai CASCADE;"
```

You can then connect to the database using the `psql` command line tool in the container.

```bash
docker exec -it pgai psql
```

or using your favorite PostgreSQL client using the following connection string: `postgresql://postgres:password@localhost:5432/postgres
`

Alternatively, you can install pgai into an existing PostgreSQL database. For instructions on how to install pgai into an existing PostgreSQL database, follow the instructions in the [github repo](https://github.com/timescale/pgai).

## Create a table from a dataset

To load a dataset into PostgreSQL, you can use the `ai.load_dataset` function. This function will create a PostgreSQL table, and load the dataset from the Hugging Face Hub
in a streaming fashion.

```sql
select ai.load_dataset('rajpurkar/squad', table_name => 'squad');
```

You can now query the table using standard SQL.

```sql
select * from squad limit 10;
```

<Tip>
Full documentation for the `ai.load_dataset` function can be found [here](https://github.com/timescale/pgai/blob/main/docs/load_dataset_from_huggingface.md).
</Tip>

## Import only a subset of the dataset

You can also import a subset of the dataset by specifying the `max_batches` parameter.
This is useful if the dataset is large and you want to experiment with a smaller subset.

```sql
SELECT ai.load_dataset('rajpurkar/squad', table_name => 'squad', batch_size => 100, max_batches => 1);
```

## Load a dataset into an existing table

You can also load a dataset into an existing table.
This is useful if you want more control over the data schema or want to predefine indexes and constraints on the data.

```sql
select ai.load_dataset('rajpurkar/squad', table_name => 'squad', if_table_exists => 'append');
```
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
- fast data retrieval and filtering,
- efficient storage.
**This is what powers the dataset viewer** on each dataset page and every dataset on the Hub can be accessed with the same code (you can use HF Datasets, ClickHouse, DuckDB, Pandas or Polars, [up to you](https://huggingface.co/docs/dataset-viewer/parquet_process)).
**This is what powers the dataset viewer** on each dataset page and every dataset on the Hub can be accessed with the same code (you can use HF Datasets, ClickHouse, DuckDB, Pandas, PostgreSQL, or Polars, [up to you](https://huggingface.co/docs/dataset-viewer/parquet_process)).
You can learn more about the advantages associated with Parquet in the [documentation](https://huggingface.co/docs/dataset-viewer/parquet).
Expand Down

0 comments on commit 5e9371a

Please sign in to comment.