Skip to content

Commit

Permalink
Merge branch 'main' into datetime-stats
Browse files Browse the repository at this point in the history
  • Loading branch information
polinaeterna committed Dec 20, 2024
2 parents 945dff0 + 0d342b8 commit d91d365
Show file tree
Hide file tree
Showing 66 changed files with 927 additions and 425 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/cd.yml
Original file line number Diff line number Diff line change
Expand Up @@ -136,5 +136,5 @@ jobs:
repo: huggingface/infra-deployments
wait-for-completion: true
ref: refs/heads/main
token: ${{ secrets.ARGO_CD_TOKEN }}
token: ${{ secrets.GIT_TOKEN_INFRA_DEPLOYMENT }}
inputs: '{"path": "datasets-server/*.yaml", "values": ${{ env.VALUES }}, "url": "${{ github.event.head_commit.url }}"}'
4 changes: 2 additions & 2 deletions chart/env/prod.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -277,7 +277,7 @@ admin:
# Number of reports in /cache-reports-with-content/... endpoints
cacheReportsWithContentNumResults: 100
# the timeout in seconds for the requests to the Hugging Face Hub.
hfTimeoutSeconds: "1.5"
hfTimeoutSeconds: "10"
# Number of uvicorn workers for running the application
# (2 x $num_cores) + 1
# https://docs.gunicorn.org/en/stable/design.html#how-many-workers
Expand Down Expand Up @@ -306,7 +306,7 @@ admin:
memory: "8Gi"

hf:
timeoutSeconds: "1.5"
timeoutSeconds: "10"

api:
# Number of uvicorn workers for running the application
Expand Down
2 changes: 1 addition & 1 deletion chart/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -228,7 +228,7 @@ optInOutUrlsScan:

configNames:
# the max number of configs per dataset
maxNumber: 3_000
maxNumber: 4_000

s3:
regionName: "us-east-1"
Expand Down
2 changes: 2 additions & 0 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@
title: Pandas
- local: polars
title: Polars
- local: postgresql
title: PostgreSQL
- local: mlcroissant
title: mlcroissant
- local: pyspark
Expand Down
42 changes: 36 additions & 6 deletions docs/source/analyze_data.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,25 +8,55 @@ To demonstrate, this guide will show you an end-to-end example of how to retriev

## Get a dataset

The [Hub](https://huggingface.co/datasets) is home to more than 100,000 datasets across a wide variety of tasks, sizes, and languages. For this example, you'll use the [`codeparrot/codecomplex`](https://huggingface.co/datasets/codeparrot/codecomplex) dataset, but feel free to explore and find another dataset that interests you! The dataset contains Java code from programming competitions, and the time complexity of the code is labeled by a group of algorithm experts.
The [Hub](https://huggingface.co/datasets) is home to more than 200,000 datasets across a wide variety of tasks, sizes, and languages. For this example, you'll use the [`codeparrot/codecomplex`](https://huggingface.co/datasets/codeparrot/codecomplex) dataset, but feel free to explore and find another dataset that interests you! The dataset contains Java code from programming competitions, and the time complexity of the code is labeled by a group of algorithm experts.

Let's say you're interested in the average length of the submitted code as it relates to the time complexity. Here's how you can get started.

Use the `/parquet` endpoint to convert the dataset to a Parquet file and return the URL to it:

```py
<inferencesnippet>
<python>
```python
import requests
API_URL = "https://datasets-server.huggingface.co/parquet?dataset=codeparrot/codecomplex"
def query():
response = requests.get(API_URL)
return response.json()
data = query()
print(data)
{'parquet_files':
```
</python>
<js>
```js
import fetch from "node-fetch";
async function query(data) {
const response = await fetch(
"https://datasets-server.huggingface.co/parquet?dataset=codeparrot/codecomplex",
{
method: "GET"
}
);
const result = await response.json();
return result;
}
query().then((response) => {
console.log(JSON.stringify(response));
});
```
</js>
<curl>
```curl
curl https://datasets-server.huggingface.co/parquet?dataset=codeparrot/codecomplex \
-X GET
```
</curl>
</inferencesnippet>

```json
{"parquet_files":
[
{'dataset': 'codeparrot/codecomplex', 'config': 'default', 'split': 'train', 'url': 'https://huggingface.co/datasets/codeparrot/codecomplex/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet', 'filename': '0000.parquet', 'size': 4115908}
{"dataset": "codeparrot/codecomplex", "config": "default", "split": "train", "url": "https://huggingface.co/datasets/codeparrot/codecomplex/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet", "filename": "0000.parquet", "size": 4115908}
],
'pending': [], 'failed': [], 'partial: false
"pending": [], "failed": [], "partial": false
}
```

Expand Down
33 changes: 17 additions & 16 deletions docs/source/clickhouse.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,17 +97,17 @@ Remember to set `enable_url_encoding` to 0 and `max_https_get_redirects` to 1 to
SET max_http_get_redirects = 1, enable_url_encoding = 0
```

Let's create a function to return a list of Parquet files from the [`barilan/blog_authorship_corpus`](https://huggingface.co/datasets/barilan/blog_authorship_corpus):
Let's create a function to return a list of Parquet files from the [`tasksource/blog_authorship_corpus`](https://huggingface.co/datasets/tasksource/blog_authorship_corpus):

```bash
CREATE OR REPLACE FUNCTION hugging_paths AS dataset -> (
SELECT arrayMap(x -> (x.1), JSONExtract(json, 'parquet_files', 'Array(Tuple(url String))'))
FROM url('https://datasets-server.huggingface.co/parquet?dataset=' || dataset, 'JSONAsString')
);

SELECT hugging_paths('barilan/blog_authorship_corpus') AS paths
SELECT hugging_paths('tasksource/blog_authorship_corpus') AS paths

['https://huggingface.co/datasets/barilan/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet','https://huggingface.co/datasets/barilan/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0001.parquet','https://huggingface.co/datasets/barilan/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/validation/0000.parquet']
['https://huggingface.co/datasets/tasksource/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet','https://huggingface.co/datasets/tasksource/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/default/train/0001.parquet']
```

You can make this even easier by creating another function that calls `hugging_paths` and outputs all the files based on the dataset name:
Expand All @@ -118,26 +118,27 @@ CREATE OR REPLACE FUNCTION hf AS dataset -> (
SELECT multiIf(length(urls) = 0, '', length(urls) = 1, urls[1], 'https://huggingface.co/datasets/{' || arrayStringConcat(arrayMap(x -> replaceRegexpOne(replaceOne(x, 'https://huggingface.co/datasets/', ''), '\\.parquet$', ''), urls), ',') || '}.parquet')
);

SELECT hf('barilan/blog_authorship_corpus') AS pattern
SELECT hf('tasksource/blog_authorship_corpus') AS pattern

['https://huggingface.co/datasets/{blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/barilan/blog_authorship_corpus/blog_authorship_corpus-train-00000-of-00002,barilan/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/blog_authorship_corpus-train-00001-of-00002,barilan/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/blog_authorship_corpus-validation}.parquet']
https://huggingface.co/datasets/{tasksource/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/default/train/0000,tasksource/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/default/train/0001}.parquet
```

Now use the `hf` function to query any dataset by passing the dataset name:

```bash
SELECT horoscope, count(*), AVG(LENGTH(text)) AS avg_blog_length
FROM url(hf('barilan/blog_authorship_corpus'))
GROUP BY horoscope
SELECT sign, count(*), AVG(LENGTH(text)) AS avg_blog_length
FROM url(hf('tasksource/blog_authorship_corpus'))
GROUP BY sign
ORDER BY avg_blog_length
DESC LIMIT(5)

┌─────────────┬───────┬────────────────────┐
│ Aquarius │ 51747 │ 1132.487873693161 │
├─────────────┼───────┼────────────────────┤
│ Cancer │ 66944 │ 1111.613109464627 │
│ Libra │ 63994 │ 1060.3968184517298 │
│ Sagittarius │ 52753 │ 1055.7120732470191 │
│ Capricorn │ 52207 │ 1055.4147719654452 │
└─────────────┴───────┴────────────────────┘
┌───────────┬────────┬────────────────────┐
│ sign │ count │ avg_blog_length │
├───────────┼────────┼────────────────────┤
│ Aquarius │ 49687 │ 1193.9523819107615 │
│ Leo │ 53811 │ 1186.0665291483153 │
│ Cancer │ 65048 │ 1160.8010392325666 │
│ Gemini │ 51985 │ 1158.4132922958545 │
│ Vurgi │ 60399 │ 1142.9977648636566 │
└───────────┴────────┴────────────────────┘
```
6 changes: 3 additions & 3 deletions docs/source/cudf.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ To read from a single Parquet file, use the [`read_parquet`](https://docs.rapids
import cudf

df = (
cudf.read_parquet("https://huggingface.co/datasets/barilan/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet")
.groupby('horoscope')['text']
cudf.read_parquet("https://huggingface.co/datasets/tasksource/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet")
.groupby('sign')['text']
.apply(lambda x: x.str.len().mean())
.sort_values(ascending=False)
.head(5)
Expand All @@ -25,6 +25,6 @@ import dask.dataframe as dd
dask.config.set({"dataframe.backend": "cudf"})

df = (
dd.read_parquet("https://huggingface.co/datasets/barilan/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/*.parquet")
dd.read_parquet("https://huggingface.co/datasets/tasksource/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/default/train/*.parquet")
)
```
48 changes: 26 additions & 22 deletions docs/source/duckdb.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
```py
import duckdb

url = "https://huggingface.co/datasets/barilan/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet"
url = "https://huggingface.co/datasets/tasksource/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet"

con = duckdb.connect()
con.execute("INSTALL httpfs;")
Expand All @@ -22,7 +22,7 @@ var con = db.connect();
con.exec('INSTALL httpfs');
con.exec('LOAD httpfs');

const url = "https://huggingface.co/datasets/barilan/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet"
const url = "https://huggingface.co/datasets/tasksource/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet"
```
</js>
</inferencesnippet>
Expand All @@ -32,22 +32,22 @@ Now you can write and execute your SQL query on the Parquet file:
<inferencesnippet>
<python>
```py
con.sql(f"SELECT horoscope, count(*), AVG(LENGTH(text)) AS avg_blog_length FROM '{url}' GROUP BY horoscope ORDER BY avg_blog_length DESC LIMIT(5)")
con.sql(f"SELECT sign, count(*), AVG(LENGTH(text)) AS avg_blog_length FROM '{url}' GROUP BY sign ORDER BY avg_blog_length DESC LIMIT(5)")
┌───────────┬──────────────┬────────────────────┐
horoscope │ count_star() │ avg_blog_length │
sign │ count_star() │ avg_blog_length │
│ varchar │ int64 │ double │
├───────────┼──────────────┼────────────────────┤
Aquarius 34062 1129.218836239798
Cancer41509 1098.366812016671
Capricorn 339611073.2002002296751
Libra403021072.0718326633914
Leo 405871064.0536871412028
Cancer 389561206.5212034089743
Leo 354871180.0673767858652
Aquarius 327231152.1136815084192
Virgo361891117.1982094006466
Capricorn 31825 1102.397360565593
└───────────┴──────────────┴────────────────────┘
```
</python>
<js>
```js
con.all(`SELECT horoscope, count(*), AVG(LENGTH(text)) AS avg_blog_length FROM '${url}' GROUP BY horoscope ORDER BY avg_blog_length DESC LIMIT(5)`, function(err, res) {
con.all(`SELECT sign, count(*), AVG(LENGTH(text)) AS avg_blog_length FROM '${url}' GROUP BY sign ORDER BY avg_blog_length DESC LIMIT(5)`, function(err, res) {
if (err) {
throw err;
}
Expand All @@ -62,22 +62,26 @@ To query multiple files - for example, if the dataset is sharded:
<inferencesnippet>
<python>
```py
con.sql(f"SELECT horoscope, count(*), AVG(LENGTH(text)) AS avg_blog_length FROM read_parquet({urls[:2]}) GROUP BY horoscope ORDER BY avg_blog_length DESC LIMIT(5)")
┌─────────────┬──────────────┬────────────────────┐
│ horoscope │ count_star() │ avg_blog_length │
│ varchar │ int64 │ double │
├─────────────┼──────────────┼────────────────────┤
│ Aquarius │ 495681125.8306770497095
│ Cancer │ 635121097.95608703867
│ Libra │ 603041060.6110539931017
│ Capricorn │ 494021059.5552609206104
│ Sagittarius │ 504311057.4589835616982
└─────────────┴──────────────┴────────────────────┘
urls = ["https://huggingface.co/datasets/tasksource/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet", "https://huggingface.co/datasets/tasksource/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/default/train/0001.parquet"]

con.sql(f"SELECT sign, count(*), AVG(LENGTH(text)) AS avg_blog_length FROM read_parquet({urls}) GROUP BY sign ORDER BY avg_blog_length DESC LIMIT(5)")
┌──────────┬──────────────┬────────────────────┐
│ sign │ count_star() │ avg_blog_length │
│ varchar │ int64 │ double │
├──────────┼──────────────┼────────────────────┤
│ Aquarius │ 496871191.417211745527
│ Leo │ 538111183.8782219248853
│ Cancer │ 650481158.9691612347804
│ Gemini │ 519851156.0693084543618
│ Virgo │ 603991140.9584430205798
└──────────┴──────────────┴────────────────────┘
```
</python>
<js>
```js
con.all(`SELECT horoscope, count(*), AVG(LENGTH(text)) AS avg_blog_length FROM read_parquet(${JSON.stringify(urls)}) GROUP BY horoscope ORDER BY avg_blog_length DESC LIMIT(5)`, function(err, res) {
const urls = ["https://huggingface.co/datasets/tasksource/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet", "https://huggingface.co/datasets/tasksource/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/default/train/0001.parquet"];

con.all(`SELECT sign, count(*), AVG(LENGTH(text)) AS avg_blog_length FROM read_parquet(${JSON.stringify(urls)}) GROUP BY sign ORDER BY avg_blog_length DESC LIMIT(5)`, function(err, res) {
if (err) {
throw err;
}
Expand Down
Loading

0 comments on commit d91d365

Please sign in to comment.