[Tabular] Infer label from directory structure for tabular datasets #5801

mbignotti · 2023-04-27T08:38:12Z

mbignotti
Apr 27, 2023

Hi!
Is it possible to infer the label from the directory structure for tabular datasets with many files (in either parquet or csv format)?
I see that it's possible for audio and image datasets, but I haven't managed to do the same for tabular ones.

Also, I tried with pyarrow tabular datasets and it works, but then of course you don't have the automatic train, test, validation identification:

import pyarrow.dataset as ds

# inside path_to_data/train/ there are other folders like class_a, class_b, class_c and inside each subfolder
# there are many csv or parquet files corresponding the relative class
train_path = "path_to_data/train" 
train_ds = ds.dataset(
    train_path,
    format="csv",
    partitioning=ds.partitioning(field_names=["label"]),
    partition_base_dir=train_path,
)
train_ds.to_table().to_pandas() # contains the "label" column populated with one of "class_a", "class_b", "class_b"

Thanks a lot!

mariosasko · 2023-05-18T16:59:45Z

mariosasko
May 18, 2023
Collaborator

You can use Dataset.from_generator to load a PyArrow Dataset, e.g:

def gen_from_pa_dataset(pa_dataset):
    for pa_batch in pa_dataset.to_batches():
        yield from pa_batch.to_pylist()

ds = Dataset.from_generator(gen_from_pa_dataset, gen_kwargs={"pa_dataset": train_ds})

But this use case is too specific to have a dedicated packaged builder, at least for now.

3 replies

mbignotti May 18, 2023
Author

Is it not possible to re-use what has already been implemented for, say, audio datasets?
I'm sorry, I'm asking because without knowing the internals of datasets, the only big difference I see is the file format.

mariosasko May 18, 2023
Collaborator

Yes, we could. Still, unless this format is common (e.g, on Kaggle), I think this is not something we want to implement and maintain.

mbignotti May 18, 2023
Author

Makes sense.

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tabular] Infer label from directory structure for tabular datasets #5801

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[Tabular] Infer label from directory structure for tabular datasets #5801

mbignotti Apr 27, 2023

Replies: 1 comment · 3 replies

mariosasko May 18, 2023 Collaborator

mbignotti May 18, 2023 Author

mariosasko May 18, 2023 Collaborator

mbignotti May 18, 2023 Author

mbignotti
Apr 27, 2023

Replies: 1 comment 3 replies

mariosasko
May 18, 2023
Collaborator

mbignotti May 18, 2023
Author

mariosasko May 18, 2023
Collaborator

mbignotti May 18, 2023
Author