Write prepared batches to disk as a separate script #167

JackKelly · 2022-06-24T06:38:34Z

Combine the best of both worlds of nowcasting_dataset and power_perceiver's data loader:

Have a separate script which writes totally prepared batches to disk.
This script would operate like power_perceiver's data loader: Load a subset of contiguous data into RAM, and sample from it.
Save batches (or examples?) to disk. As Python pickles? Or numpy files?
The data would be completely ready to be loaded into the model. The data will be normalised. (The only disadvantage of this is it means we have to use float32 for pretty much everything. But that should be OK because there won't be huge numbers of batches on disk at any one time).
Write the batches into a directory like train/2022-06-24T12:45/. Always have two "done" folders, and be working on a third folder. The model code signals that it's completed one epoch by deleting the older of the two "done" folders. (Although will need a different system if multiple models are training from the same data).
Maybe have two repos: helios_model, and helios_data?

Advantages:

Much less waiting!
- Don't have to wait about half an hour for the model to start training
- Don't have to wait between epochs
Can manually examine the files on disk
Modularises the code
Compared to nowcasting_dataset, this new approach would be much faster to create a new set of batches from scratch. e.g. if we change the structure of the data, then we should only have to wait about an hour for a whole new epoch to be prepared.

The text was updated successfully, but these errors were encountered:

JackKelly mentioned this issue Jun 24, 2022

[META] Improve Helios production model #148

Open

39 tasks

JackKelly added the enhancement New feature or request label Jun 24, 2022

JackKelly added this to Nowcasting Jun 24, 2022

JackKelly moved this to Todo in Nowcasting Jun 24, 2022

Provide feedback