This document defines a high level roadmap for the arena development.
-
Enhance Training Job
- Move to MPI-operator
- Set default CPU/Memory limit according to different types of training: tf-operator, MPI-operator
- Support Gang Scheuler
- Pytorch-operator
-
Training History Management
- Use CRD to manage the training history
-
Integrate with data
-
Muti-tenancy
-
Easy install
- end-to-end testing
- unit tests
- build arena docker images automatically