OpenML Sparse Dataset support #46

prabhant · 2022-04-07T15:10:21Z

This issue tracks the progress of sparse Dataset support on the OpenML-MinIO backend.
Currently, MinIO does not have OpenML sparse datasets because pandas can't write to sparse datasets by Default.
Example

did = 42379
  d = openml.datasets.get_dataset(did, download_qualities=False)
  df , *_ = d.get_data(dataset_format="dataframe", include_row_id=True, include_ignore_attribute=True)
  df.to_parquet(f'dataset_{d.id}.pq')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-37-42ca2d7c4839> in <module>
      7                                       target=d.default_target_attribute)
      8     df = pd.concat([X,y], axis=1)
----> 9     df.to_parquet(f'dataset_{d.id}.pq')
     10     client.make_bucket(f"dataset{did}")
     11     client.fput_object(

~/opt/anaconda3/envs/env/lib/python3.8/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    197                 else:
    198                     kwargs[new_arg_name] = new_arg_value
--> 199             return func(*args, **kwargs)
    200 
    201         return cast(F, wrapper)

~/opt/anaconda3/envs/env/lib/python3.8/site-packages/pandas/core/frame.py in to_parquet(self, path, engine, compression, index, partition_cols, storage_options, **kwargs)
   2453         from pandas.io.parquet import to_parquet
   2454 
-> 2455         return to_parquet(
   2456             self,
   2457             path,

~/opt/anaconda3/envs/env/lib/python3.8/site-packages/pandas/io/parquet.py in to_parquet(df, path, engine, compression, index, storage_options, partition_cols, **kwargs)
    388     path_or_buf: FilePathOrBuffer = io.BytesIO() if path is None else path
    389 
--> 390     impl.write(
    391         df,
    392         path_or_buf,

~/opt/anaconda3/envs/env/lib/python3.8/site-packages/pandas/io/parquet.py in write(self, df, path, compression, index, storage_options, partition_cols, **kwargs)
    150             from_pandas_kwargs["preserve_index"] = index
    151 
--> 152         table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
    153 
    154         path_or_handle, handles, kwargs["filesystem"] = _get_path_or_handle(

~/opt/anaconda3/envs/env/lib/python3.8/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pandas()

~/opt/anaconda3/envs/env/lib/python3.8/site-packages/pyarrow/pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
    551      index_columns,
    552      columns_to_convert,
--> 553      convert_fields) = _get_columns_to_convert(df, schema, preserve_index,
    554                                                columns)
    555 

~/opt/anaconda3/envs/env/lib/python3.8/site-packages/pyarrow/pandas_compat.py in _get_columns_to_convert(df, schema, preserve_index, columns)
    357 
    358         if _pandas_api.is_sparse(col):
--> 359             raise TypeError(
    360                 "Sparse pandas data (column {}) not supported.".format(name))
    361 

TypeError: Sparse pandas data (column FCFP6_1024_0) not supported.

prabhant · 2022-04-07T15:11:34Z

TODOs:

Make a Script to save sparse dataframes in parquet files
List all sparse datasets(Currently some sparse datasets are not labelled sparse)

prabhant · 2022-04-07T15:12:26Z

@mitar @PGijsbers @joaquinvanschoren
Please list all issues related to sparse datasets here. As well as the IDs of mislabeled datasets

mitar · 2022-04-07T19:14:48Z

Somebody should run a script to check all datasets with missing Parquet file if they are all marked as sparse.

PGijsbers · 2022-04-08T13:43:27Z

Not all datasets without parquet are necessarily sparse (there's also currently an issue with datasets containing datetime info). That said, it shouldn't take more than a small page of code for a script that checks all datasets against their metadata. If I member correctly, the meta-data of whether or not a dataset is sparse is directly stored on OpenMLDataset objects in openml-python (.format).

prabhant · 2022-04-26T17:31:14Z

Update regarding the issue. I have not been able to find any solution to convert sparse pandas dataframe to parquet directly. I have asked the parquet community forums for help.
One way this can be done is to convert the dataframe to dense dataframe to change the dypte of the arrays and then save it to parquet files.
@PGijsbers That will require changes in the Openml-python to read the dataframe first as a dense frame and then convert the arrays to sparse arrays in pandas(we can have the sparse attribute in the metadata to identify the dataset as sparse or not). Do you think that is a reasonable workflow?

mitar · 2022-04-26T22:17:09Z

Does parquet even support sparse data? Or have you decided on a format which does not?

joaquinvanschoren · 2022-04-27T07:48:36Z

Parquet supports sparse data.

It seems that since pandas 1.0.0, you need to make a pandas dataframe with columns of type SparseArray, and to_parquet should then 'just work'?

pandas-dev/pandas#26378

prabhant · 2022-06-07T15:23:59Z

test code https://gist.github.com/prabhant/dfd25b894afbf4d102f7abee23376c41
Please test it out on a few sparse datasets

prabhant · 2022-06-07T15:24:09Z

@mfeurer @mitar for reference

PGijsbers · 2022-06-22T10:12:45Z

In the provided example we lose some data compared to the sparse arff file itself.
The sparse arff contains data:

@data
{1 83.683,3 4,4 4,5 0.47,6 5,10 12.8,12 -0.229,13 -0.348,15 1.226,16 63.504,26 -0.264,27 83.683,28 13.894,29 4,30 1.417,31 3.07,33 4.583,34 1,35 16.663,36 67.02,37 16.981,38 0.803,39 1.392,40 82.698,42 1.358,43 3.323,44 0.913,46 1.119,47 6,48 4.953,49 3.016,50 4.199,51 6.421,52 1.206,53 6.716,54 2,56 1.106,58 4.82,59 0.119,60 0.293,61 -0.208,62 0.621,63 68.376,64 -0.247,66 11.935,67 7.62}
{0 CHEMBL1077387,1 83.683,3 4,4 4,6 5,10 19.2,12 -0.221,13 -0.33,15 1.181,16 61.552,26 -0.309,27 83.683,28 6.264,29 2,30 1.256,31 2.918,33 2.153,35 16.663,36 67.02,37 16.586,38 1.093,39 1.062,40 74.095,42 1.107,43 2.741,44 0.906,46 1.095,47 4,48 2.03,49 2.854,50 3.757,51 3.141,52 1.154,53 2.773,54 1,56 0.821,58 4.809,59 -0.09,60 -0.054,61 -0.315,62 0.559,63 55.181,64 -0.165,66 3.975,67 6.886}

neither the old nor new dataframe contain the molecule id properly (specifically, the molecule id). Old:

>>>    molecule_id   P_VSA_e_3  C.039  N.075  ...  MATS7i  nCbH  ATSC7m  pXC50
0          0.0   83.682999    0.0    4.0  ...  -0.247   0.0  11.935  7.620
1          1.0   83.682999    0.0    4.0  ...  -0.165   0.0   3.975  6.886

new:

>>>    molecule_id   P_VSA_e_3  C.039  N.075  ...  MATS7i  nCbH  ATSC7m  pXC50
0          NaN   83.682999    NaN    4.0  ...  -0.247   NaN  11.935  7.620
1          1.0   83.682999    NaN    4.0  ...  -0.165   NaN   3.975  6.886

This seems to stem from encode_nominal=True when reading the arff file. From what I can tell having a string gives the downstream error (<class 'TypeError'>, TypeError("no supported conversion for types: (dtype('<U32'),)"), <traceback object at 0x0000017BFDCAD400>) when calling .tocsr.

I think we probably should use the intermediate output to generate the sparse parquet file. This will also avoid us accidentally encoding a 0 as nan (with the provided code, 0 values may be encoded as nan).

mitar · 2023-07-18T17:56:15Z

Here is a list of all datasets which I think are failing because of this issue and do not have parquet files:

'32359', '40864', '41079', '41111', '41120', '41121', '41122', '41204', '41205', '41206', '41238', '42807', '42825', '43100', '43105', '43113', '43123', '43127', '43138', '43140', '43180', '43190', '43192', '43194', '43198', '43252', '43256', '43303', '43304', '43305', '43306', '43307', '43308', '43309', '43310', '43311', '43312', '43313', '43315', '43318', '43319', '43320', '43321', '43322', '43323', '43324', '43325', '43326', '43327', '43328', '43331', '43332', '43335', '43336', '43337', '43338', '43339', '43340', '43341', '43342', '43343', '43344', '43345', '43346', '43347', '43348', '43349', '43350', '43351', '43352', '43353', '43354', '43355', '43356', '43357', '43358', '43360', '43361', '43362', '43363', '43364', '43365', '43366', '43367', '43368', '43369', '43370', '43371', '43372', '43373', '43374', '43375', '43376', '43377', '43378', '43379', '43380', '43381', '43382', '43383', '43384', '43385', '43386', '43387', '43388', '43389', '43390', '43391', '43392', '43393', '43394', '43395', '43396', '43397', '43398', '43399', '43400', '43401', '43402', '43403', '43404', '43405', '43406', '43407', '43408', '43409', '43410', '43412', '43413', '43414', '43415', '43416', '43417', '43419', '43420', '43421', '43422', '43423', '43424', '43425', '43426', '43427', '43428', '43430', '43431', '43432', '43433', '43434', '43435', '43436', '43437', '43438', '43439', '43440', '43441', '43442', '43443', '43445', '43446', '43447', '43448', '43449', '43450', '43451', '43452', '43453', '43454', '43455', '43456', '43457', '43458', '43459', '43460', '43461', '43463', '43464', '43465', '43466', '43467', '43468', '43470', '43471', '43472', '43473', '43474', '43475', '43476', '43477', '43478', '43479', '43480', '43481', '43482', '43483', '43484', '43485', '43486', '43487', '43488', '43489', '43490', '43491', '43492', '43493', '43495', '43496', '43497', '43498', '43499', '43500', '43501', '43502', '43503', '43504', '43505', '43506', '43507', '43508', '43509', '43510', '43511', '43512', '43513', '43515', '43516', '43517', '43518', '43519', '43520', '43521', '43522', '43523', '43524', '43525', '43526', '43527', '43528', '43529', '43530', '43531', '43532', '43533', '43534', '43535', '43536', '43537', '43538', '43539', '43540', '43541', '43542', '43543', '43544', '43545', '43546', '43547', '43548', '43549', '43550', '43551', '43552', '43553', '43554', '43555', '43556', '43557', '43558', '43559', '43560', '43561', '43562', '43563', '43564', '43565', '43566', '43567', '43568', '43569', '43570', '43571', '43572', '43573', '43574', '43575', '43576', '43577', '43578', '43579', '43580', '43581', '43582', '43583', '43584', '43585', '43586', '43587', '43588', '43589', '43590', '43591', '43592', '43593', '43594', '43595', '43596', '43597', '43598', '43599', '43600', '43601', '43602', '43603', '43604', '43605', '43606', '43607', '43608', '43609', '43610', '43611', '43612', '43613', '43614', '43615', '43616', '43617', '43618', '43619', '43620', '43621', '43622', '43623', '43624', '43625', '43626', '43627', '43628', '43630', '43631', '43633', '43634', '43635', '43636', '43637', '43638', '43639', '43640', '43641', '43642', '43643', '43644', '43645', '43646', '43647', '43648', '43649', '43650', '43651', '43652', '43653', '43654', '43655', '43656', '43657', '43658', '43659', '43660', '43661', '43662', '43663', '43664', '43665', '43666', '43667', '43668', '43669', '43670', '43671', '43672', '43673', '43674', '43675', '43676', '43677', '43678', '43679', '43680', '43681', '43682', '43683', '43684', '43685', '43686', '43687', '43688', '43689', '43690', '43691', '43692', '43694', '43695', '43696', '43697', '43698', '43699', '43700', '43701', '43702', '43703', '43704', '43705', '43706', '43707', '43708', '43709', '43710', '43711', '43712', '43713', '43714', '43715', '43716', '43717', '43718', '43719', '43720', '43721', '43722', '43723', '43724', '43725', '43726', '43727', '43728', '43729', '43730', '43731', '43733', '43734', '43735', '43736', '43737', '43738', '43739', '43740', '43741', '43742', '43743', '43744', '43745', '43746', '43747', '43748', '43749', '43750', '43751', '43752', '43753', '43754', '43755', '43756', '43757', '43758', '43759', '43760', '43761', '43762', '43763', '43764', '43765', '43766', '43767', '43768', '43769', '43770', '43771', '43772', '43773', '43774', '43775', '43776', '43777', '43778', '43779', '43780', '43781', '43782', '43783', '43784', '43785', '43786', '43787', '43788', '43789', '43790', '43791', '43792', '43793', '43794', '43795', '43796', '43797', '43798', '43799', '43800', '43801', '43802', '43803', '43804', '43805', '43806', '43807', '43808', '43809', '43810', '43811', '43812', '43814', '43815', '43816', '43817', '43818', '43819', '43820', '43821', '43822', '43823', '43824', '43825', '43826', '43827', '43828', '43829', '43830', '43831', '43832', '43833', '43834', '43835', '43836', '43837', '43838', '43839', '43840', '43841', '43842', '43843', '43844', '43845', '43846', '43847', '43848', '43849', '43850',

PGijsbers mentioned this issue Sep 26, 2022

Migration ARFF to Parquet on the OpenML server #50

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenML Sparse Dataset support #46

OpenML Sparse Dataset support #46

prabhant commented Apr 7, 2022

prabhant commented Apr 7, 2022

prabhant commented Apr 7, 2022

mitar commented Apr 7, 2022

PGijsbers commented Apr 8, 2022

prabhant commented Apr 26, 2022

mitar commented Apr 26, 2022

joaquinvanschoren commented Apr 27, 2022

prabhant commented Jun 7, 2022

prabhant commented Jun 7, 2022

PGijsbers commented Jun 22, 2022 •

edited

Loading

mitar commented Jul 18, 2023

OpenML Sparse Dataset support #46

OpenML Sparse Dataset support #46

Comments

prabhant commented Apr 7, 2022

prabhant commented Apr 7, 2022

prabhant commented Apr 7, 2022

mitar commented Apr 7, 2022

PGijsbers commented Apr 8, 2022

prabhant commented Apr 26, 2022

mitar commented Apr 26, 2022

joaquinvanschoren commented Apr 27, 2022

prabhant commented Jun 7, 2022

prabhant commented Jun 7, 2022

PGijsbers commented Jun 22, 2022 • edited Loading

mitar commented Jul 18, 2023

PGijsbers commented Jun 22, 2022 •

edited

Loading