tar-checkpoints

When running deep learning experiments, I want to store many checkpoints to be able to re-start the experiment from any point in time (maybe with different hyperparams). Training usually happens on a server from where the checkpoints need to be recovered. Given that copying many files is slower than copying one single big file, this utility allows me to save all the checkpoints for a given experiment in a single tar file, as well as to circumvent the file-limit quota on the cluster. On top of that, it will write the tar file in a non-blocking fashion so that it has minimal impact on training time.

Move checkpoints to a single tarfile async-ly after the training script has written the files to disk. The idea is to reduce the number of files, not to reduce the file size (since checkpoints are already compressed).

Go ahead and run python tar_checkpoints.py to see the demo in action 🎥

Intended use:

Make the module tar_checkpoints.py available to your training loop (no pip package yet).
Open the TarCheckpoints context.
Do everything as usual inside the context.
Move the saved checkpoint to the tar.

from tar_checkpoints import TarCheckpoints

with TarCheckpoints("my_tarfile.tar") as tar_files:
    for i in range(100): # epochs
        # Do training
        model.save(fp)
        tar_files(i, [fp]) # Non-blocking. Works in a separate process.
# Blocking when exiting the context. Allows the child process to finish all tasks.

To extract one of the epochs' files, you can use

epoch = 42
extraction_path = TarCheckpoints.extract("my_tarfile.tar", epoch)
print(f"Your checkpoint has been extracted to `{extraction_path}`")

I will package it in a pip package if I get a ⭐ from a stranger 😄.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
tar_checkpoints.py		tar_checkpoints.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tar-checkpoints

About

Releases

Packages

Languages

fcossio/tar-checkpoints

Folders and files

Latest commit

History

Repository files navigation

tar-checkpoints

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages