-
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenML Sparse Dataset support #46
Comments
TODOs:
|
@mitar @PGijsbers @joaquinvanschoren |
Somebody should run a script to check all datasets with missing Parquet file if they are all marked as sparse. |
Not all datasets without parquet are necessarily sparse (there's also currently an issue with datasets containing datetime info). That said, it shouldn't take more than a small page of code for a script that checks all datasets against their metadata. If I member correctly, the meta-data of whether or not a dataset is sparse is directly stored on OpenMLDataset objects in |
Update regarding the issue. I have not been able to find any solution to convert sparse pandas dataframe to parquet directly. I have asked the parquet community forums for help. |
Does parquet even support sparse data? Or have you decided on a format which does not? |
Parquet supports sparse data. It seems that since pandas 1.0.0, you need to make a pandas dataframe with columns of type SparseArray, and to_parquet should then 'just work'? |
test code https://gist.github.com/prabhant/dfd25b894afbf4d102f7abee23376c41 |
In the provided example we lose some data compared to the sparse arff file itself.
neither the old nor new dataframe contain the molecule id properly (specifically, the molecule id). Old:
new:
This seems to stem from I think we probably should use the intermediate output to generate the sparse parquet file. This will also avoid us accidentally encoding a |
Here is a list of all datasets which I think are failing because of this issue and do not have parquet files:
'32359',
'40864',
'41079',
'41111',
'41120',
'41121',
'41122',
'41204',
'41205',
'41206',
'41238',
'42807',
'42825',
'43100',
'43105',
'43113',
'43123',
'43127',
'43138',
'43140',
'43180',
'43190',
'43192',
'43194',
'43198',
'43252',
'43256',
'43303',
'43304',
'43305',
'43306',
'43307',
'43308',
'43309',
'43310',
'43311',
'43312',
'43313',
'43315',
'43318',
'43319',
'43320',
'43321',
'43322',
'43323',
'43324',
'43325',
'43326',
'43327',
'43328',
'43331',
'43332',
'43335',
'43336',
'43337',
'43338',
'43339',
'43340',
'43341',
'43342',
'43343',
'43344',
'43345',
'43346',
'43347',
'43348',
'43349',
'43350',
'43351',
'43352',
'43353',
'43354',
'43355',
'43356',
'43357',
'43358',
'43360',
'43361',
'43362',
'43363',
'43364',
'43365',
'43366',
'43367',
'43368',
'43369',
'43370',
'43371',
'43372',
'43373',
'43374',
'43375',
'43376',
'43377',
'43378',
'43379',
'43380',
'43381',
'43382',
'43383',
'43384',
'43385',
'43386',
'43387',
'43388',
'43389',
'43390',
'43391',
'43392',
'43393',
'43394',
'43395',
'43396',
'43397',
'43398',
'43399',
'43400',
'43401',
'43402',
'43403',
'43404',
'43405',
'43406',
'43407',
'43408',
'43409',
'43410',
'43412',
'43413',
'43414',
'43415',
'43416',
'43417',
'43419',
'43420',
'43421',
'43422',
'43423',
'43424',
'43425',
'43426',
'43427',
'43428',
'43430',
'43431',
'43432',
'43433',
'43434',
'43435',
'43436',
'43437',
'43438',
'43439',
'43440',
'43441',
'43442',
'43443',
'43445',
'43446',
'43447',
'43448',
'43449',
'43450',
'43451',
'43452',
'43453',
'43454',
'43455',
'43456',
'43457',
'43458',
'43459',
'43460',
'43461',
'43463',
'43464',
'43465',
'43466',
'43467',
'43468',
'43470',
'43471',
'43472',
'43473',
'43474',
'43475',
'43476',
'43477',
'43478',
'43479',
'43480',
'43481',
'43482',
'43483',
'43484',
'43485',
'43486',
'43487',
'43488',
'43489',
'43490',
'43491',
'43492',
'43493',
'43495',
'43496',
'43497',
'43498',
'43499',
'43500',
'43501',
'43502',
'43503',
'43504',
'43505',
'43506',
'43507',
'43508',
'43509',
'43510',
'43511',
'43512',
'43513',
'43515',
'43516',
'43517',
'43518',
'43519',
'43520',
'43521',
'43522',
'43523',
'43524',
'43525',
'43526',
'43527',
'43528',
'43529',
'43530',
'43531',
'43532',
'43533',
'43534',
'43535',
'43536',
'43537',
'43538',
'43539',
'43540',
'43541',
'43542',
'43543',
'43544',
'43545',
'43546',
'43547',
'43548',
'43549',
'43550',
'43551',
'43552',
'43553',
'43554',
'43555',
'43556',
'43557',
'43558',
'43559',
'43560',
'43561',
'43562',
'43563',
'43564',
'43565',
'43566',
'43567',
'43568',
'43569',
'43570',
'43571',
'43572',
'43573',
'43574',
'43575',
'43576',
'43577',
'43578',
'43579',
'43580',
'43581',
'43582',
'43583',
'43584',
'43585',
'43586',
'43587',
'43588',
'43589',
'43590',
'43591',
'43592',
'43593',
'43594',
'43595',
'43596',
'43597',
'43598',
'43599',
'43600',
'43601',
'43602',
'43603',
'43604',
'43605',
'43606',
'43607',
'43608',
'43609',
'43610',
'43611',
'43612',
'43613',
'43614',
'43615',
'43616',
'43617',
'43618',
'43619',
'43620',
'43621',
'43622',
'43623',
'43624',
'43625',
'43626',
'43627',
'43628',
'43630',
'43631',
'43633',
'43634',
'43635',
'43636',
'43637',
'43638',
'43639',
'43640',
'43641',
'43642',
'43643',
'43644',
'43645',
'43646',
'43647',
'43648',
'43649',
'43650',
'43651',
'43652',
'43653',
'43654',
'43655',
'43656',
'43657',
'43658',
'43659',
'43660',
'43661',
'43662',
'43663',
'43664',
'43665',
'43666',
'43667',
'43668',
'43669',
'43670',
'43671',
'43672',
'43673',
'43674',
'43675',
'43676',
'43677',
'43678',
'43679',
'43680',
'43681',
'43682',
'43683',
'43684',
'43685',
'43686',
'43687',
'43688',
'43689',
'43690',
'43691',
'43692',
'43694',
'43695',
'43696',
'43697',
'43698',
'43699',
'43700',
'43701',
'43702',
'43703',
'43704',
'43705',
'43706',
'43707',
'43708',
'43709',
'43710',
'43711',
'43712',
'43713',
'43714',
'43715',
'43716',
'43717',
'43718',
'43719',
'43720',
'43721',
'43722',
'43723',
'43724',
'43725',
'43726',
'43727',
'43728',
'43729',
'43730',
'43731',
'43733',
'43734',
'43735',
'43736',
'43737',
'43738',
'43739',
'43740',
'43741',
'43742',
'43743',
'43744',
'43745',
'43746',
'43747',
'43748',
'43749',
'43750',
'43751',
'43752',
'43753',
'43754',
'43755',
'43756',
'43757',
'43758',
'43759',
'43760',
'43761',
'43762',
'43763',
'43764',
'43765',
'43766',
'43767',
'43768',
'43769',
'43770',
'43771',
'43772',
'43773',
'43774',
'43775',
'43776',
'43777',
'43778',
'43779',
'43780',
'43781',
'43782',
'43783',
'43784',
'43785',
'43786',
'43787',
'43788',
'43789',
'43790',
'43791',
'43792',
'43793',
'43794',
'43795',
'43796',
'43797',
'43798',
'43799',
'43800',
'43801',
'43802',
'43803',
'43804',
'43805',
'43806',
'43807',
'43808',
'43809',
'43810',
'43811',
'43812',
'43814',
'43815',
'43816',
'43817',
'43818',
'43819',
'43820',
'43821',
'43822',
'43823',
'43824',
'43825',
'43826',
'43827',
'43828',
'43829',
'43830',
'43831',
'43832',
'43833',
'43834',
'43835',
'43836',
'43837',
'43838',
'43839',
'43840',
'43841',
'43842',
'43843',
'43844',
'43845',
'43846',
'43847',
'43848',
'43849',
'43850',
|
This issue tracks the progress of sparse Dataset support on the OpenML-MinIO backend.
Currently, MinIO does not have OpenML sparse datasets because pandas can't write to sparse datasets by Default.
Example
The text was updated successfully, but these errors were encountered: