Add PostgreSQL as a possible viewer (#3121)

* Add PostgreSQL as a possible viewer Add documentation for how to use PostgreSQL with pgai to access the dataset. * Improve documentation for PostgreSQL viewer
huggingface · Dec 18, 2024 · 5e9371a · 5e9371a
1 parent 04f9b1e
commit 5e9371a
Show file tree

Hide file tree

Showing 4 changed files with 72 additions and 1 deletion.
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
@@ -44,6 +44,8 @@
           title: Pandas
         - local: polars
           title: Polars
+        - local: postgresql
+          title: PostgreSQL
         - local: mlcroissant
           title: mlcroissant
         - local: pyspark

diff --git a/docs/source/parquet_process.md b/docs/source/parquet_process.md
@@ -11,5 +11,6 @@ There are several different libraries you can use to work with the published Par
 - [DuckDB](https://duckdb.org/docs/), a high-performance SQL database for analytical queries
 - [Pandas](https://pandas.pydata.org/docs/index.html), a data analysis tool for working with data structures
 - [Polars](https://pola-rs.github.io/polars-book/user-guide/), a Rust based DataFrame library
+- [PostgreSQL via pgai](https://github.com/timescale/pgai/blob/main/docs/load_dataset_from_huggingface.md), a powerful, open source object-relational database system
 - [mlcroissant](https://github.com/mlcommons/croissant/tree/main/python/mlcroissant), a library for loading datasets from Croissant metadata
 - [pyspark](https://spark.apache.org/docs/latest/api/python), the Python API for Apache Spark
diff --git a/docs/source/postgresql.md b/docs/source/postgresql.md
@@ -0,0 +1,68 @@
+# PostgreSQL
+
+[PostgreSQL](https://www.postgresql.org/docs/) is a powerful, open source object-relational database system. It is the most [popular](https://survey.stackoverflow.co/2024/technology#most-popular-technologies-database) database by application developers for a few years running. [pgai](https://github.com/timescale/pgai) is a PostgreSQL extension that allows you to easily ingest huggingface datasets into your PostgreSQL database.
+
+
+## Run PostgreSQL with pgai installed
+
+You can easily run a docker container containing PostgreSQL with pgai.
+
+```bash
+docker run -d --name pgai -p 5432:5432 \
+-v pg-data:/home/postgres/pgdata/data \
+-e POSTGRES_PASSWORD=password timescale/timescaledb-ha:pg17
+```
+
+Then run the following command to install pgai into the database.
+
+```bash
+docker exec -it pgai psql -c "CREATE EXTENSION ai CASCADE;"
+```
+
+You can then connect to the database using the `psql` command line tool in the container.
+
+```bash
+docker exec -it pgai psql
+```
+
+or using your favorite PostgreSQL client using the following connection string: `postgresql://postgres:password@localhost:5432/postgres
+`
+
+Alternatively, you can install pgai into an existing PostgreSQL database. For instructions on how to install pgai into an existing PostgreSQL database, follow the instructions in the [github repo](https://github.com/timescale/pgai).
+
+## Create a table from a dataset
+
+To load a dataset into PostgreSQL, you can use the `ai.load_dataset` function. This function will create a PostgreSQL table, and load the dataset from the Hugging Face Hub
+in a streaming fashion.
+
+```sql
+select ai.load_dataset('rajpurkar/squad', table_name => 'squad');
+```
+
+You can now query the table using standard SQL.
+
+```sql
+select * from squad limit 10;
+```
+
+<Tip>
+Full documentation for the `ai.load_dataset` function can be found [here](https://github.com/timescale/pgai/blob/main/docs/load_dataset_from_huggingface.md).
+</Tip>
+
+## Import only a subset of the dataset
+
+You can also import a subset of the dataset by specifying the `max_batches` parameter.
+This is useful if the dataset is large and you want to experiment with a smaller subset.
+
+```sql
+SELECT ai.load_dataset('rajpurkar/squad', table_name => 'squad', batch_size => 100, max_batches => 1);
+```
+
+## Load a dataset into an existing table
+
+You can also load a dataset into an existing table.
+This is useful if you want more control over the data schema or want to predefine indexes and constraints on the data.
+
+```sql
+select ai.load_dataset('rajpurkar/squad', table_name => 'squad', if_table_exists => 'append');
+```
diff --git a/jobs/cache_maintenance/src/cache_maintenance/discussions.py b/jobs/cache_maintenance/src/cache_maintenance/discussions.py
@@ -24,7 +24,7 @@
 - fast data retrieval and filtering,
 - efficient storage.
 
-**This is what powers the dataset viewer** on each dataset page and every dataset on the Hub can be accessed with the same code (you can use HF Datasets, ClickHouse, DuckDB, Pandas or Polars, [up to you](https://huggingface.co/docs/dataset-viewer/parquet_process)).
+**This is what powers the dataset viewer** on each dataset page and every dataset on the Hub can be accessed with the same code (you can use HF Datasets, ClickHouse, DuckDB, Pandas, PostgreSQL, or Polars, [up to you](https://huggingface.co/docs/dataset-viewer/parquet_process)).
 
 You can learn more about the advantages associated with Parquet in the [documentation](https://huggingface.co/docs/dataset-viewer/parquet).