This project facilitates inference with quantization for high-resolution images, offering support for integer-only (INT8) and half-precision (FLOAT16/BFLOAT16) for single-GPU inference. Furthermore, for scaled images (e.g., beyond 2048×2048 or 4096×4096), we leverage Spatial Parallelism (referenced as a parallelism technique for Distributed Deep Learning) with support for half-precision quantization. We evaluate inference with quantization on different datasets, including real-world pathology dataset: CAMELYON16, and object detection datasets: ImageNet, Cifar10, Imagenette, achieving accuracy degradation of less than 1%.
- Python 3.8 or later (for Linux, Python 3.8.1+ is needed).
- NCCL
- PyTorch >= 1.13.1
- TensorRt (required only for INT8 quantization)
Note: We used the following versions during implementation and testing. Python=3.9.16, cuda=11.6, gcc=10.3.0, cmake=3.22.2, PyTorch=1.13.1
cd Infer-HiRes
python setup.py install
Figure 1. Throughput and Memory Evaluation on a single GPU for the ResNet101 model with different image sizes and batch size 32. The speedup and memory reduction is shown in respective colored boxes for FP16, BFLOAT16, and INT8 when compared to baseline FP3. Overall, we acheived an average 6.5x speedup and 4.55x memory reduction with a single GPU using INT8 quantization.
Figure 3. Throughput and Memory Evaluation using SP+LP for ResNet101 model with image sizes of 4096x4096. The evaluation is done by comparing quantized model of FP16, BFLOAT16 quantization with FP32 as the baseline.By utilizing Distributed DL, we enabled inference for scaled images, achieving an average 1.58x speedup and 1.57x memory reduction using half-precision.
Example to run ResNet model with model partition set to two, spatial partition set to four, with half-precision quantization.
python benchmarks/spatial_parallelism/benchmark_resnet_inference.py \
--batch-size ${batch_size} \
--image-size ${image_size} \
--precision int_8 \
--datapath ${datapath} \
--checkpoint ${checkpoint} \
--enable-evaluation &>> $OUTFILE 2>&1
Note : supported quntization(precision) options includes 'int8'(INT8), 'fp_16'(FLOAT16), and 'bp_16'(BFLOAT16). For training your model remove '--enable-evaluation' flag.
Example to run ResNet model with model partition set to two, spatial partition set to four, with half-precision quantization.
mpirun_rsh --export-all -np $total_np\
--hostfile ${hostfile} \
python benchmarks/spatial_parallelism/benchmark_resnet_sp.py \
--batch-size ${batch_size} \
--split-size 2 \
--slice-method square \
--num-spatial-parts 4 \
--image-size ${image_size} \
--backend nccl \
--precision fp_16 \
--datapath ${datapath} \
--checkpoint ${checkpoint} \
--enable-evaluation &>> $OUTFILE 2>&1
Refer Spatial Parallelism, Layer Parallelism and Single GPU for more benchmarks.
- MPI4DL : https://github.com/OSU-Nowlab/MPI4DL
- Arpan Jain, Ammar Ahmad Awan, Asmaa M. Aljuhani, Jahanzeb Maqbool Hashmi, Quentin G. Anthony, Hari Subramoni, Dhableswar K. Panda, Raghu Machiraju, and Anil Parwani. 2020. GEMS: GPU-enabled memory-aware model-parallelism system for distributed DNN training. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '20). IEEE Press, Article 45, 1–15. https://doi.org/10.1109/SC41405.2020.00049
- Arpan Jain, Aamir Shafi, Quentin Anthony, Pouya Kousha, Hari Subramoni, and Dhableswar K. Panda. 2022. Hy-Fi: Hybrid Five-Dimensional Parallel DNN Training on High-Performance GPU Clusters. In High Performance Computing: 37th International Conference, ISC High Performance 2022, Hamburg, Germany, May 29 – June 2, 2022, Proceedings. Springer-Verlag, Berlin, Heidelberg, 109–130. https://doi.org/10.1007/978-3-031-07312-0_6