v0.0.16: T5 export and inference, general training fixes
What's Changed
Training
A few fixes related to precompilation and checkpoiting. Those fixes enable training LLMs on AWS Trainium instances without friction.
- Skip model saving during precompilation and provide option to skip cache push (#365)
- Fixes checkpoint saving and consolidtation for TP (#378)
- A
torch_xla
compatible version ofsafetensors.torch.save_file
is now used in theNeuronTrainer
(#329)