Skip to content

Commit

Permalink
Stats for audio (#2612)
Browse files Browse the repository at this point in the history
* compute audio durations with librosa

* add audio statistics testcase

* check for all nan values

* update public docs

* update column types in openapi.json

* update dev docs

* add example of response to openapi.json (from MLCommons/peoples_speech validation subset)
  • Loading branch information
polinaeterna authored Mar 27, 2024
1 parent dbbcb7a commit 918acfc
Show file tree
Hide file tree
Showing 12 changed files with 639 additions and 87 deletions.
172 changes: 171 additions & 1 deletion docs/source/openapi.json
Original file line number Diff line number Diff line change
Expand Up @@ -1084,7 +1084,7 @@
},
"ColumnType": {
"type": "string",
"enum": ["float", "int", "class_label", "string_label", "string_text", "bool", "list"]
"enum": ["float", "int", "class_label", "string_label", "string_text", "bool", "list", "audio"]
},
"Histogram": {
"type": "object",
Expand Down Expand Up @@ -6128,6 +6128,176 @@
}
]
}
},
"A split (MLCommons/peoples_speech) with audio column": {
"summary": "Statistics on an audio column 'audio'.",
"description": "Try with https://datasets-server.huggingface.co/statistics?dataset=MLCommons/peoples_speech&config=validation&split=validation.",
"value": {
"num_examples": 18622,
"statistics": [
{
"column_name": "audio",
"column_type": "audio",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0.0,
"min": 0.653,
"max": 105.97,
"mean": 6.41103,
"median": 4.8815,
"std": 5.63269,
"histogram": {
"hist": [
15867,
2319,
350,
67,
12,
5,
0,
1,
0,
1
],
"bin_edges": [
0.653,
11.1847,
21.7164,
32.2481,
42.7798,
53.3115,
63.8432,
74.3749,
84.9066,
95.4383,
105.97
]
}
}
},
{
"column_name": "duration_ms",
"column_type": "int",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0.0,
"min": 833,
"max": 105970,
"mean": 6411.06079,
"median": 4881.5,
"std": 5632.67057,
"histogram": {
"hist": [
15950,
2244,
345,
64,
12,
5,
0,
1,
0,
1
],
"bin_edges": [
833,
11347,
21861,
32375,
42889,
53403,
63917,
74431,
84945,
95459,
105970
]
}
}
},
{
"column_name": "id",
"column_type": "string_text",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0.0,
"min": 43,
"max": 197,
"mean": 120.06675,
"median": 136.0,
"std": 44.49607,
"histogram": {
"hist": [
3599,
939,
278,
1914,
1838,
1646,
4470,
1443,
1976,
519
],
"bin_edges": [
43,
59,
75,
91,
107,
123,
139,
155,
171,
187,
197
]
}
}
},
{
"column_name": "text",
"column_type": "string_text",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0.0,
"min": 1,
"max": 1219,
"mean": 94.52873,
"median": 75.0,
"std": 79.11078,
"histogram": {
"hist": [
13703,
3975,
744,
146,
36,
10,
5,
1,
1,
1
],
"bin_edges": [
1,
123,
245,
367,
489,
611,
733,
855,
977,
1099,
1219
]
}
}
}
],
"partial": false
}
}
}
}
Expand Down
120 changes: 113 additions & 7 deletions docs/source/statistics.md
Original file line number Diff line number Diff line change
Expand Up @@ -165,16 +165,18 @@ The response JSON contains three keys:

## Response structure by data type

Currently, statistics are supported for strings, float and integer numbers, and the special [`datasets.ClassLabel`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.ClassLabel) feature type of the [`datasets`](https://huggingface.co/docs/datasets/) library.
Currently, statistics are supported for strings, float and integer numbers, lists, audio data and the special [`datasets.ClassLabel`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.ClassLabel) feature type of the [`datasets`](https://huggingface.co/docs/datasets/) library.

`column_type` in response can be one of the following values:

* `class_label` - for [`datasets.ClassLabel`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.ClassLabel) feature
* `float` - for float dtypes
* `int` - for integer dtypes
* `bool` - for boolean dtype
* `string_label` - for string dtypes being treated as categories (see below)
* `string_text` - for string dtypes if they do not represent categories (see below)
* `class_label` - for [`datasets.ClassLabel`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.ClassLabel) feature which represents categorical data
* `float` - for float data types
* `int` - for integer data types
* `bool` - for boolean data type
* `string_label` - for string data types being treated as categories (see below)
* `string_text` - for string data types if they do not represent categories (see below)
* `list` - for lists of any other data types (including lists)
* `audio` - for audio data

### `class_label`

Expand Down Expand Up @@ -426,3 +428,107 @@ If string column does not satisfy the conditions to be treated as a `string_labe

</p>
</details>

### list

For lists, the distribution of their lengths is computed. The following measures are returned:

* minimum, maximum, mean, and standard deviation of lists lengths
* number and proportion of `null` values
* histogram of lists lengths with up to 10 bins

<details><summary>Example </summary>
<p>

```json
{
"column_name": "chat_history",
"column_type": "list",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0.0,
"min": 1,
"max": 3,
"mean": 1.01741,
"median": 1.0,
"std": 0.13146,
"histogram": {
"hist": [
11177,
196,
1
],
"bin_edges": [
1,
2,
3,
3
]
}
}
}
```

</p>
</details>

Note that dictionaries of lists are not supported.


### audio

For audio data, the distribution of audio files durations is computed. The following measures are returned:

* minimum, maximum, mean, and standard deviation of audio files durations
* number and proportion of `null` values
* histogram of audio files durations with 10 bins


<details><summary>Example </summary>
<p>

```json
{
"column_name": "audio",
"column_type": "audio",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0,
"min": 1.02,
"max": 15,
"mean": 13.93042,
"median": 14.77,
"std": 2.63734,
"histogram": {
"hist": [
32,
25,
18,
24,
22,
17,
18,
19,
55,
1770
],
"bin_edges": [
1.02,
2.418,
3.816,
5.214,
6.612,
8.01,
9.408,
10.806,
12.204,
13.602,
15
]
}
}
}
```

</p>
</details>
Loading

0 comments on commit 918acfc

Please sign in to comment.