GitHub - quettabit/convolution_kernel: Accelerating CNN's convolution operation on GPUs by using memory-efficient data access patterns.

About the Project

During the training of Convolutional Neural Networks (CNNs), the convolutional layer is the most time consuming layer. So, we wanted to accelerate the forward pass convolution operation on GPUs which would obviously reduce the time taken in the convolutional layer.

Researchers are actively working on different ways to reduce the time complexity of different convolution methods including Winograd algorithm, FFT based convolution etc.,

Based on the literature survey, we found that very few researchers are working on accelerating the general matrix multiplication(GEMM) based convolution by the usage of efficient memory access patterns. On noticing it, we planned to implement and verify any one of their techniques.

Our implementation of the convolution kernel is based on the algorithms mentioned in the conference paper titled "Optimizing Memory Efficiency for Convolution Kernels on Kepler GPUs" which was accepted at DAC'17.

Our implementation is benchmarked against the single-precision general matrix multiplication(SGEMM) based convolution kernel available in NVIDIA's cuDNN library with the help of nvprof.

Special thanks to Peter Goldsborogh for his blogpost and gist which explained the usage of convolution algorithm routine available in the cuDNN library. Without his work, It would have been a tough time for us battling with the cuDNN developer guide to benchmark our kernel.

Benchmarking Environment

OS : Ubuntu 16.04.3 LTS

GPU : GeForce GTX 650 Ti BOOST

CUDA Driver Version : 9.0

CUDA Runtime Version : 8.0

CUDA Capability Version : 3.0

cuDNN Major Version : 7

Benchmarking Results

For the purpose of benchmarking, We are naming our implementation of the memory-efficient kernel as Kernel A and the SGEMM based convolution kernel of cuDNN as Kernel B.

Here are some of the results from the benchmarking process,

For a stride value of 1, a filter dimension of 3*3 and number of channels to be 1,

Kernel	Image Dimension	Avg. Time
Kernel A	2048*2048	8.2038 milli.secs
Kernel B	2048*2048	15.149 milli.secs
Kernel A	1024*1024	2.0776 milli.secs
Kernel B	1024*1024	3.7918 milli.secs
Kernel A	512*512	531.65 micro.secs
Kernel B	512*512	955.65 micro.secs

From the above table, it can be clearly seen that Kernel A outperforms Kernel B by a ~50% reduction in the time taken for computation.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Makefile		Makefile
README.md		README.md
convolution.cu		convolution.cu
cudnn_convolution.cu		cudnn_convolution.cu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About the Project

Benchmarking Environment

Benchmarking Results

About

Releases

Packages

Languages

quettabit/convolution_kernel

Folders and files

Latest commit

History

Repository files navigation

About the Project

Benchmarking Environment

Benchmarking Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages