Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we remove datapipes? Yes #342

Open
peterdudfield opened this issue Jul 18, 2024 · 2 comments
Open

Should we remove datapipes? Yes #342

peterdudfield opened this issue Jul 18, 2024 · 2 comments

Comments

@peterdudfield
Copy link
Contributor

peterdudfield commented Jul 18, 2024

Detailed Description

The idea is to remove torch datapipes from this repo.
We would essentially replace this with normal python functions instead.
For our ML models, we can then wrap these in torch datasets afterwards

pros and cons

pros cons
Less work not to do it Not sure what the benefits are for the extra code
Torch data is good for steaming data We use xarray, which is good for streaming large data
Its complex and hard to make changes
datapipes not well support by the community
forking is annoying
we have used some infinite loops are bad
debugging and logging is hard
we can use torch dataset which is widely used

Possible Implementation

  1. Start a fresh repo and copy over the functions we need
  2. Refactor this repo,
    -- pull out functions from all datapipes
    -- rebuild dataflow using new function. rebuilding these files
1. pros 2. pros
Nice to start with a fresh repo No code duplication
Easier to refactor, don need to worry about breaking tests Can continue developing
Could get one entire pipeline working first .e.g PVnet Dont need to setup CI
Might be able to itterate and solve each function

What I would like to keep

  1. Configuration
  2. batch strucutre
  3. making building blocks, so it doesnt matter what order we do things. passing around mainly xarray objects seemed to work
  4. There's some good readme that help, and here
  5. folder structure i think is nice

Other things to do

  • We should remove any pipelines that are not being used like metnet, and perceiver
  • Asset memory usage, make sure the pipelines are effecient
  • refactor nwp duplicate code
  • Remove 'is training' #348
@peterdudfield peterdudfield changed the title Should we remove datapipes Should we remove datapipes? Yes Jul 25, 2024
@peterdudfield
Copy link
Contributor Author

peterdudfield commented Jul 29, 2024

Here's a list of all the current datapipes we have

  1. MergeNumpyBatchIterDataPipe
  2. MergeNumpyExamplesToBatchIterDataPipe
  3. MergeNumpyModalitiesIterDataPipe
  4. MergeNumpyModalitiesIterDataPipe
  5. ConvertLonLatToOSGBIterDataPipe
  6. ConvertOSGBToLonLatIterDataPipe
  7. ConvertGeostationaryToLonLatIterDataPipe
  8. StackXarrayIterDataPipe
  9. ConvertGSPToNumpyIterDataPipe
  10. ConvertPVToNumpyIterDataPipe
  11. ConvertGSPToNumpyBatchIterDataPipe
  12. ConvertNWPToNumpyBatchIterDataPipe
  13. ConvertPVToNumpyBatchIterDataPipe
  14. ConvertSatelliteToNumpyBatchIterDataPipe
  15. ConvertSensorToNumpyBatchIterDataPipe
  16. ConvertWindToNumpyBatchIterDataPipe
  17. OpenConfigurationIterDataPipe
  18. OpenSatelliteIterDataPipe
  19. OpenTopographyIterDataPipe
  20. OpenGSPFromDatabaseIterDataPipe
  21. OpenGSPIterDataPipe
  22. OpenGSPNationalIterDataPipe
  23. OpenNWPIterDataPipe
  24. OpenPVFromPVSitesDBIterDataPipe
  25. OpenPVFromNetCDFIterDataPipe
  26. OpenAWOSFromNetCDFIterDataPipe
  27. OpenMeteomaticsFromZarrIterDataPipe
  28. OpenWindFromNetCDFIterDataPipe
  29. ApplyPVDropoutIterDataPipe
  30. DrawDropoutTimeIterDataPipe
  31. ApplyDropoutTimeIterDataPipe
  32. FilterChannelsIterDataPipe
  33. FilterGSPIDsIterDataPipe
  34. FilterPvSysGeneratingOvernightIterDataPipe
  35. FilterPVSystemsWithOnlyNanInADayIterDataPipe
  36. FilterPVSystemsOnCapacityIterDataPipe
  37. FilterTimePeriodsIterDataPipe
  38. FilterTimesIterDataPipe
  39. FilterToOverlappingTimePeriodsIterDataPipe
  40. FindContiguousT0TimePeriodsIterDataPipe
  41. PickLocationsIterDataPipe
  42. PickLocationsAndT0sIterDataPipe
  43. PickT0TimesIterDataPipe
  44. SelectIDIterDataPipe
  45. SelectNonNaNTimesIterDataPipe
  46. SelectSpatialSlicePixelsIterDataPipe
  47. SelectSpatialSliceMetersIterDataPipe
  48. SelectTimeSliceIterDataPipe
  49. SelectTimeSliceNWPIterDataPipe
  50. PVNetSelectPVbyMLIDIterDataPipe
  51. ListMap
  52. SelectAllGSPSpatialSlicePixelsIterDataPipe
  53. SelectAllGSPSpatialSliceMetersIterDataPipe
  54. ConvertToNumpyBatchIterDataPipe
  55. DictDatasetIterDataPipe
  56. LoadDictDatasetIterDataPipe
  57. ConvertToNumpyBatchIterDataPipe
  58. AddFourierSpaceTimeIterDataPipe
  59. AddTopographicDataIterDataPipe
  60. AddSunPositionIterDataPipe
  61. AddT0IdxAndSamplePeriodDurationIterDataPipe
  62. ConvertPressureLevelsToSeparateVariablesIterDataPipe
  63. CreateSunImageIterDataPipe
  64. CreateTimeImageIterDataPipe
  65. DownsampleIterDataPipe
  66. NormalizeIterDataPipe
  67. ReprojectTopographyIterDataPipe
  68. UpSampleIterDataPipe
  69. CreateGSPImageIterDataPipe
  70. EnsureNGSPSPerExampleIterDataPipe
  71. AssignDayNightStatusIterDataPipe
  72. CreatePVImageIterDataPipe
  73. CreatePVMetadataImageIterDataPipe
  74. EnsureNPVSystemsPerExampleIterDataPipe
  75. PVFillNightNansIterDataPipe
  76. PVInterpolateInfillIterDataPipe
  77. PVPowerRollingWindowIterDataPipe
  78. PVPowerRemoveZeroDataIterDataPipe
  79. ZipperIterDataPipe
  80. RepeaterIterDataPipe
  81. UnZipperIterDataPipe
  82. LengthSetterIterDataPipe
  83. HeaderIterDataPipe
  84. CheckValueEqualToFractionIterDataPipe
  85. CheckGreaterThanOrEqualToIterDataPipe
  86. CheckLessThanOrEqualToIterDataPipe
  87. CheckNotEqualToIterDataPipe
  88. CheckNaNsIterDataPipe
  89. CheckVarsAndDimsIterDataPipe

@peterdudfield
Copy link
Contributor Author

More detailed analysis here

@peterdudfield peterdudfield mentioned this issue Jul 30, 2024
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant