GitHub - OSU-Nowlab/Infer-HiRes

Inference with quantization for High-Resolution Images

This project facilitates inference with quantization for high-resolution images, offering support for integer-only (INT8) and half-precision (FLOAT16/BFLOAT16) for single-GPU inference. Furthermore, for scaled images (e.g., beyond 2048×2048 or 4096×4096), we leverage Spatial Parallelism (referenced as a parallelism technique for Distributed Deep Learning) with support for half-precision quantization. We evaluate inference with quantization on different datasets, including real-world pathology dataset: CAMELYON16, and object detection datasets: ImageNet, Cifar10, Imagenette, achieving accuracy degradation of less than 1%.

Quantization in Deep Learning

Installation

Prerequisites

Python 3.8 or later (for Linux, Python 3.8.1+ is needed).
NCCL
PyTorch >= 1.13.1
TensorRt (required only for INT8 quantization)

Note: We used the following versions during implementation and testing. Python=3.9.16, cuda=11.6, gcc=10.3.0, cmake=3.22.2, PyTorch=1.13.1

Install Infer-HiRes

cd Infer-HiRes
python setup.py install

Results

Single-GPU Evaluation

^{Figure 1. Throughput and Memory Evaluation on a single GPU for the ResNet101 model with different image sizes and batch
size 32. The speedup and memory reduction is shown in respective colored boxes for FP16, BFLOAT16, and INT8 when compared
to baseline FP3. Overall, we acheived an average 6.5x speedup and 4.55x memory reduction with a single GPU using INT8 quantization.}

Spatial Parallelism Evaluation

^{Figure 2. Enabling scaled images and accelerating performance using SP}

Throughput and Memory Evaluation for SP with Quantization

^{Figure 3. Throughput and Memory Evaluation using SP+LP for ResNet101 model with image sizes of 4096x4096. The evaluation is done by comparing quantized model of FP16, BFLOAT16 quantization with FP32 as the baseline.By utilizing Distributed DL, we enabled inference for scaled images, achieving an average 1.58x speedup and 1.57x memory reduction using half-precision.}

Run Inference

Using Single-GPU

Example to run ResNet model with model partition set to two, spatial partition set to four, with half-precision quantization.

        python benchmarks/spatial_parallelism/benchmark_resnet_inference.py \
        --batch-size ${batch_size} \
        --image-size ${image_size} \
        --precision int_8 \
        --datapath ${datapath} \
        --checkpoint ${checkpoint} \
        --enable-evaluation &>> $OUTFILE 2>&1

Note : supported quntization(precision) options includes 'int8'(INT8), 'fp_16'(FLOAT16), and 'bp_16'(BFLOAT16). For training your model remove '--enable-evaluation' flag.

Using Spatial Parallelism

Example to run ResNet model with model partition set to two, spatial partition set to four, with half-precision quantization.

mpirun_rsh --export-all -np $total_np\
        --hostfile ${hostfile} \
        python benchmarks/spatial_parallelism/benchmark_resnet_sp.py \
        --batch-size ${batch_size} \
        --split-size 2 \
        --slice-method square \
        --num-spatial-parts 4 \
        --image-size ${image_size} \
        --backend nccl \
        --precision fp_16 \
         --datapath ${datapath} \
        --checkpoint ${checkpoint} \
        --enable-evaluation &>> $OUTFILE 2>&1

Refer Spatial Parallelism, Layer Parallelism and Single GPU for more benchmarks.

References

MPI4DL : https://github.com/OSU-Nowlab/MPI4DL
Arpan Jain, Ammar Ahmad Awan, Asmaa M. Aljuhani, Jahanzeb Maqbool Hashmi, Quentin G. Anthony, Hari Subramoni, Dhableswar K. Panda, Raghu Machiraju, and Anil Parwani. 2020. GEMS: GPU-enabled memory-aware model-parallelism system for distributed DNN training. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '20). IEEE Press, Article 45, 1–15. https://doi.org/10.1109/SC41405.2020.00049
Arpan Jain, Aamir Shafi, Quentin Anthony, Pouya Kousha, Hari Subramoni, and Dhableswar K. Panda. 2022. Hy-Fi: Hybrid Five-Dimensional Parallel DNN Training on High-Performance GPU Clusters. In High Performance Computing: 37th International Conference, ISC High Performance 2022, Hamburg, Germany, May 29 – June 2, 2022, Proceedings. Springer-Verlag, Berlin, Heidelberg, 109–130. https://doi.org/10.1007/978-3-031-07312-0_6

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github		.github
benchmarks		benchmarks
docs		docs
examples		examples
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Inference with quantization for High-Resolution Images

Installation

Prerequisites

Install Infer-HiRes

Results

Single-GPU Evaluation

Spatial Parallelism Evaluation

Run Inference

Using Single-GPU

Using Spatial Parallelism

References

About

Releases

Packages

Languages

License

OSU-Nowlab/Infer-HiRes

Folders and files

Latest commit

History

Repository files navigation

Inference with quantization for High-Resolution Images

Installation

Prerequisites

Install Infer-HiRes

Results

Single-GPU Evaluation

Spatial Parallelism Evaluation

Run Inference

Using Single-GPU

Using Spatial Parallelism

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages