Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenML Sparse Dataset support #46

Open
prabhant opened this issue Apr 7, 2022 · 11 comments
Open

OpenML Sparse Dataset support #46

prabhant opened this issue Apr 7, 2022 · 11 comments

Comments

@prabhant
Copy link

prabhant commented Apr 7, 2022

This issue tracks the progress of sparse Dataset support on the OpenML-MinIO backend.
Currently, MinIO does not have OpenML sparse datasets because pandas can't write to sparse datasets by Default.
Example

did = 42379
  d = openml.datasets.get_dataset(did, download_qualities=False)
  df , *_ = d.get_data(dataset_format="dataframe", include_row_id=True, include_ignore_attribute=True)
  df.to_parquet(f'dataset_{d.id}.pq')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-37-42ca2d7c4839> in <module>
      7                                       target=d.default_target_attribute)
      8     df = pd.concat([X,y], axis=1)
----> 9     df.to_parquet(f'dataset_{d.id}.pq')
     10     client.make_bucket(f"dataset{did}")
     11     client.fput_object(

~/opt/anaconda3/envs/env/lib/python3.8/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    197                 else:
    198                     kwargs[new_arg_name] = new_arg_value
--> 199             return func(*args, **kwargs)
    200 
    201         return cast(F, wrapper)

~/opt/anaconda3/envs/env/lib/python3.8/site-packages/pandas/core/frame.py in to_parquet(self, path, engine, compression, index, partition_cols, storage_options, **kwargs)
   2453         from pandas.io.parquet import to_parquet
   2454 
-> 2455         return to_parquet(
   2456             self,
   2457             path,

~/opt/anaconda3/envs/env/lib/python3.8/site-packages/pandas/io/parquet.py in to_parquet(df, path, engine, compression, index, storage_options, partition_cols, **kwargs)
    388     path_or_buf: FilePathOrBuffer = io.BytesIO() if path is None else path
    389 
--> 390     impl.write(
    391         df,
    392         path_or_buf,

~/opt/anaconda3/envs/env/lib/python3.8/site-packages/pandas/io/parquet.py in write(self, df, path, compression, index, storage_options, partition_cols, **kwargs)
    150             from_pandas_kwargs["preserve_index"] = index
    151 
--> 152         table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
    153 
    154         path_or_handle, handles, kwargs["filesystem"] = _get_path_or_handle(

~/opt/anaconda3/envs/env/lib/python3.8/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pandas()

~/opt/anaconda3/envs/env/lib/python3.8/site-packages/pyarrow/pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
    551      index_columns,
    552      columns_to_convert,
--> 553      convert_fields) = _get_columns_to_convert(df, schema, preserve_index,
    554                                                columns)
    555 

~/opt/anaconda3/envs/env/lib/python3.8/site-packages/pyarrow/pandas_compat.py in _get_columns_to_convert(df, schema, preserve_index, columns)
    357 
    358         if _pandas_api.is_sparse(col):
--> 359             raise TypeError(
    360                 "Sparse pandas data (column {}) not supported.".format(name))
    361 

TypeError: Sparse pandas data (column FCFP6_1024_0) not supported.
@prabhant
Copy link
Author

prabhant commented Apr 7, 2022

TODOs:

  • Make a Script to save sparse dataframes in parquet files
  • List all sparse datasets(Currently some sparse datasets are not labelled sparse)

@prabhant
Copy link
Author

prabhant commented Apr 7, 2022

@mitar @PGijsbers @joaquinvanschoren
Please list all issues related to sparse datasets here. As well as the IDs of mislabeled datasets

@mitar
Copy link
Member

mitar commented Apr 7, 2022

Somebody should run a script to check all datasets with missing Parquet file if they are all marked as sparse.

@PGijsbers
Copy link

Not all datasets without parquet are necessarily sparse (there's also currently an issue with datasets containing datetime info). That said, it shouldn't take more than a small page of code for a script that checks all datasets against their metadata. If I member correctly, the meta-data of whether or not a dataset is sparse is directly stored on OpenMLDataset objects in openml-python (.format).

@prabhant
Copy link
Author

Update regarding the issue. I have not been able to find any solution to convert sparse pandas dataframe to parquet directly. I have asked the parquet community forums for help.
One way this can be done is to convert the dataframe to dense dataframe to change the dypte of the arrays and then save it to parquet files.
@PGijsbers That will require changes in the Openml-python to read the dataframe first as a dense frame and then convert the arrays to sparse arrays in pandas(we can have the sparse attribute in the metadata to identify the dataset as sparse or not). Do you think that is a reasonable workflow?

@mitar
Copy link
Member

mitar commented Apr 26, 2022

Does parquet even support sparse data? Or have you decided on a format which does not?

@joaquinvanschoren
Copy link

Parquet supports sparse data.

It seems that since pandas 1.0.0, you need to make a pandas dataframe with columns of type SparseArray, and to_parquet should then 'just work'?

pandas-dev/pandas#26378

@prabhant
Copy link
Author

prabhant commented Jun 7, 2022

test code https://gist.github.com/prabhant/dfd25b894afbf4d102f7abee23376c41
Please test it out on a few sparse datasets

@prabhant
Copy link
Author

prabhant commented Jun 7, 2022

@mfeurer @mitar for reference

@PGijsbers
Copy link

PGijsbers commented Jun 22, 2022

In the provided example we lose some data compared to the sparse arff file itself.
The sparse arff contains data:

@data
{1 83.683,3 4,4 4,5 0.47,6 5,10 12.8,12 -0.229,13 -0.348,15 1.226,16 63.504,26 -0.264,27 83.683,28 13.894,29 4,30 1.417,31 3.07,33 4.583,34 1,35 16.663,36 67.02,37 16.981,38 0.803,39 1.392,40 82.698,42 1.358,43 3.323,44 0.913,46 1.119,47 6,48 4.953,49 3.016,50 4.199,51 6.421,52 1.206,53 6.716,54 2,56 1.106,58 4.82,59 0.119,60 0.293,61 -0.208,62 0.621,63 68.376,64 -0.247,66 11.935,67 7.62}
{0 CHEMBL1077387,1 83.683,3 4,4 4,6 5,10 19.2,12 -0.221,13 -0.33,15 1.181,16 61.552,26 -0.309,27 83.683,28 6.264,29 2,30 1.256,31 2.918,33 2.153,35 16.663,36 67.02,37 16.586,38 1.093,39 1.062,40 74.095,42 1.107,43 2.741,44 0.906,46 1.095,47 4,48 2.03,49 2.854,50 3.757,51 3.141,52 1.154,53 2.773,54 1,56 0.821,58 4.809,59 -0.09,60 -0.054,61 -0.315,62 0.559,63 55.181,64 -0.165,66 3.975,67 6.886}

neither the old nor new dataframe contain the molecule id properly (specifically, the molecule id). Old:

>>>    molecule_id   P_VSA_e_3  C.039  N.075  ...  MATS7i  nCbH  ATSC7m  pXC50
0          0.0   83.682999    0.0    4.0  ...  -0.247   0.0  11.935  7.620
1          1.0   83.682999    0.0    4.0  ...  -0.165   0.0   3.975  6.886

new:

>>>    molecule_id   P_VSA_e_3  C.039  N.075  ...  MATS7i  nCbH  ATSC7m  pXC50
0          NaN   83.682999    NaN    4.0  ...  -0.247   NaN  11.935  7.620
1          1.0   83.682999    NaN    4.0  ...  -0.165   NaN   3.975  6.886

This seems to stem from encode_nominal=True when reading the arff file. From what I can tell having a string gives the downstream error (<class 'TypeError'>, TypeError("no supported conversion for types: (dtype('<U32'),)"), <traceback object at 0x0000017BFDCAD400>) when calling .tocsr.

I think we probably should use the intermediate output to generate the sparse parquet file. This will also avoid us accidentally encoding a 0 as nan (with the provided code, 0 values may be encoded as nan).

@mitar
Copy link
Member

mitar commented Jul 18, 2023

Here is a list of all datasets which I think are failing because of this issue and do not have parquet files:

'32359', '40864', '41079', '41111', '41120', '41121', '41122', '41204', '41205', '41206', '41238', '42807', '42825', '43100', '43105', '43113', '43123', '43127', '43138', '43140', '43180', '43190', '43192', '43194', '43198', '43252', '43256', '43303', '43304', '43305', '43306', '43307', '43308', '43309', '43310', '43311', '43312', '43313', '43315', '43318', '43319', '43320', '43321', '43322', '43323', '43324', '43325', '43326', '43327', '43328', '43331', '43332', '43335', '43336', '43337', '43338', '43339', '43340', '43341', '43342', '43343', '43344', '43345', '43346', '43347', '43348', '43349', '43350', '43351', '43352', '43353', '43354', '43355', '43356', '43357', '43358', '43360', '43361', '43362', '43363', '43364', '43365', '43366', '43367', '43368', '43369', '43370', '43371', '43372', '43373', '43374', '43375', '43376', '43377', '43378', '43379', '43380', '43381', '43382', '43383', '43384', '43385', '43386', '43387', '43388', '43389', '43390', '43391', '43392', '43393', '43394', '43395', '43396', '43397', '43398', '43399', '43400', '43401', '43402', '43403', '43404', '43405', '43406', '43407', '43408', '43409', '43410', '43412', '43413', '43414', '43415', '43416', '43417', '43419', '43420', '43421', '43422', '43423', '43424', '43425', '43426', '43427', '43428', '43430', '43431', '43432', '43433', '43434', '43435', '43436', '43437', '43438', '43439', '43440', '43441', '43442', '43443', '43445', '43446', '43447', '43448', '43449', '43450', '43451', '43452', '43453', '43454', '43455', '43456', '43457', '43458', '43459', '43460', '43461', '43463', '43464', '43465', '43466', '43467', '43468', '43470', '43471', '43472', '43473', '43474', '43475', '43476', '43477', '43478', '43479', '43480', '43481', '43482', '43483', '43484', '43485', '43486', '43487', '43488', '43489', '43490', '43491', '43492', '43493', '43495', '43496', '43497', '43498', '43499', '43500', '43501', '43502', '43503', '43504', '43505', '43506', '43507', '43508', '43509', '43510', '43511', '43512', '43513', '43515', '43516', '43517', '43518', '43519', '43520', '43521', '43522', '43523', '43524', '43525', '43526', '43527', '43528', '43529', '43530', '43531', '43532', '43533', '43534', '43535', '43536', '43537', '43538', '43539', '43540', '43541', '43542', '43543', '43544', '43545', '43546', '43547', '43548', '43549', '43550', '43551', '43552', '43553', '43554', '43555', '43556', '43557', '43558', '43559', '43560', '43561', '43562', '43563', '43564', '43565', '43566', '43567', '43568', '43569', '43570', '43571', '43572', '43573', '43574', '43575', '43576', '43577', '43578', '43579', '43580', '43581', '43582', '43583', '43584', '43585', '43586', '43587', '43588', '43589', '43590', '43591', '43592', '43593', '43594', '43595', '43596', '43597', '43598', '43599', '43600', '43601', '43602', '43603', '43604', '43605', '43606', '43607', '43608', '43609', '43610', '43611', '43612', '43613', '43614', '43615', '43616', '43617', '43618', '43619', '43620', '43621', '43622', '43623', '43624', '43625', '43626', '43627', '43628', '43630', '43631', '43633', '43634', '43635', '43636', '43637', '43638', '43639', '43640', '43641', '43642', '43643', '43644', '43645', '43646', '43647', '43648', '43649', '43650', '43651', '43652', '43653', '43654', '43655', '43656', '43657', '43658', '43659', '43660', '43661', '43662', '43663', '43664', '43665', '43666', '43667', '43668', '43669', '43670', '43671', '43672', '43673', '43674', '43675', '43676', '43677', '43678', '43679', '43680', '43681', '43682', '43683', '43684', '43685', '43686', '43687', '43688', '43689', '43690', '43691', '43692', '43694', '43695', '43696', '43697', '43698', '43699', '43700', '43701', '43702', '43703', '43704', '43705', '43706', '43707', '43708', '43709', '43710', '43711', '43712', '43713', '43714', '43715', '43716', '43717', '43718', '43719', '43720', '43721', '43722', '43723', '43724', '43725', '43726', '43727', '43728', '43729', '43730', '43731', '43733', '43734', '43735', '43736', '43737', '43738', '43739', '43740', '43741', '43742', '43743', '43744', '43745', '43746', '43747', '43748', '43749', '43750', '43751', '43752', '43753', '43754', '43755', '43756', '43757', '43758', '43759', '43760', '43761', '43762', '43763', '43764', '43765', '43766', '43767', '43768', '43769', '43770', '43771', '43772', '43773', '43774', '43775', '43776', '43777', '43778', '43779', '43780', '43781', '43782', '43783', '43784', '43785', '43786', '43787', '43788', '43789', '43790', '43791', '43792', '43793', '43794', '43795', '43796', '43797', '43798', '43799', '43800', '43801', '43802', '43803', '43804', '43805', '43806', '43807', '43808', '43809', '43810', '43811', '43812', '43814', '43815', '43816', '43817', '43818', '43819', '43820', '43821', '43822', '43823', '43824', '43825', '43826', '43827', '43828', '43829', '43830', '43831', '43832', '43833', '43834', '43835', '43836', '43837', '43838', '43839', '43840', '43841', '43842', '43843', '43844', '43845', '43846', '43847', '43848', '43849', '43850',

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants