-
Notifications
You must be signed in to change notification settings - Fork 12
ZZ Feature Release Plan 2022 01 30
Zhaobo edited this page Oct 5, 2022
·
1 revision
Owner: David
- Unified job template for distributed/elastic/simple training jobs
- Dynamic job resize with a centralized GPU allocator
- Unified interface for different framework implementation, clear state transition and control
Design choices
1) how to aggregate logs, launcher pod
2) multiple gpu per pod or one gpu per pod
3) stateful set or individual pod
Owner: Hao
- Process level GPU memory usage control through intercept CUDA driver API
- GPU thread usage control, explore different methods and evaluate delay brought by usage control
Evaluate LD_PRELOAD setting conflicts between platform and user script, with environment variable or binary
Time kernel launch function start and end, sleep on overused thread
Owner: Ziyu
- MongoDB stored fine-grained job execution record, e.g. no of workers, worker resource utilization distribution etc.
- Persist data with local persist volume
Failed/Duplicate job records collection, Running job records update
ML Framework and CUDA level metrics exporter design
Owner: Yaohui
- New score function, and synthetic device ID management, support Alnair vGPU resources
- Scheduling with forecast (GPU utilization and job completion time)
Owner: Zhaobo
- MLPerf benchmark
- Average job completion time benchmark (Group of jobs), cluster resource utilization
Owner: Steven
- Track and report Pytorch/Tensorflow/Horovod job execution performance and bottleneck with Nsight and Tensorboard
- Optimize ML Framework and job placement