Conceptual code repo for Azure Batch for AI
- Auto Scaling
- Mount Azure File Share, Network File System
- BYO (Bring Your Own) File System
- Use GPUs, InfiniBand, MPI, Grpc
- Hierarchical Quota Management
- Ability to submit and monitor 1000’s of job in Parallel
- Multi machine distributed jobs
- Output directories management
- Environment variables
- SSH setup for multi node jobs
- Automatic directory mappings from VM, including mounted volumes
- Azure Managed container repository
- Bring Your Own container
- Caching of containers
- CNTK
- Tensorflow
- Caffe
- Chainer
- Publish Model
- StdOut, StdErr, Model Logs, intermediate models
- Role Based Access Control (RBAC)
- REST API, with support for Python, C#, Java and more
- Part of Azure CLI
- Part of Azure Portal
- Cluster and Task Management
- Job Monitoring and Retries
- Proven Stability and Scale