Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project 2: Nischal K N #11

Open
wants to merge 13 commits into
base: master
Choose a base branch
from
126 changes: 121 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,127 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Nischal K N
* Tested on: Windows 10, i7-2670QM @ 2.20GHz 8GB, GTX 540 2GB (Personal)

### (TODO: Your README)
### SUMMARY
This project implements scan, stream compaction and sorting algorithms implemented on GPU and its performance compared against a CPU implementation. Two kinds of exclusive scans are implemented, viz., naive and efficient. Stream compaction is implemented using the efficient GPU scan implementation. Additionally Radix sort is also implemented using efficient stream compaction. The performance of these algorithms are measured against CPU and thrust implementations and are documented in Analysis section.

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
### BUILD
* To setup cuda development environment use [this guide](https://github.com/nischalkn/Project0-CUDA-Getting-Started/blob/master/INSTRUCTION.md#part-1-setting-up-your-development-environment)
* Once the environment is set up, to Build and run the project follow [this](https://github.com/nischalkn/Project0-CUDA-Getting-Started/blob/master/INSTRUCTION.md#part-3-build--run)

### PARAMETERS
* `SIZE` - Size of array to be scanned, compacted and sorted
* `PROFILE` - `1` or `0`, Print execution times

### PERFORMANCE ANALYSIS
The time taken to perform exclusive scans on arrays of different sizes ranging from `2^8` or `2^16` were recorded with a fixed block size of `128` (program crashed for array sizes beyond `2^16`). It was seen that the time taken by the CPU was very small to be recorded accurately for small array sizes but would increase exponentially as the array size increases. It is also seen that the naive scan outperforms work-efficient scan at larger array sizes because of the overhead of upsweep and downsweep steps which increase logarithmically with increase in array size. Most of the threads do not do any work in the extreme stages of the upsweep and downsweep. Many threads are launched but do not perform any work. This overhead causes a dramatic increase in execution time. It was also observed that the thrust implementation consistently was the slowest. The following are the time taken for different kind of scans in milliseconds.

|Scan | 2^8 | 2^10 | 2^12 | 2^14 | 2^16 |
|:--------------------------------|-----|-------|-------|-------|-------|
|cpu scan, power-of-two | 0 | 0 | 0 | 0 | 0 |
|cpu scan, non-power-of-two | 0 |0 | 0 |0 | 0.5004|
|naive scan, power-of-two | 0.5357 |0.7699| 0.8006| 1.2946| 2.4013|
|naive scan, non-power-of-two | 0.6124| 0.6252| 0.6991| 1.0961| 2.4055|
|work-efficient scan, power-of-two| 0.0750| 0.1054| 0.2469| 0.8709| 3.5367|
|work-efficient scan, non-power-of-two| 0.0745| 0.1050| 0.2465| 0.8685| 3.5334|
|thrust scan, power-of-two | 6.7073| 6.8454| 7.4349| 10.7565| 28.5635|
|thrust scan, non-power-of-two | 0.7849| 0.8866| 1.6581| 4.8936| 16.7807|

![](images/scan_array.png)

An other experiment was conducted with various block sizes for a constant array length of `2^16`. It is seen that the best performance is obtained with a block size of `128`.
![](images/scan_block.png)
CPU compactions for array sizes less than `2^14` take negligible amount of time. The work-efficient compactions is the slowest because of the same reason as mentioned above. A number of threads are launched which do not do any work.

|Compact | 2^8 | 2^10 | 2^12 | 2^14 | 2^16 |
|:----------------------------------------|-----|-------|-------|-------|-----|
|cpu compact without scan, power of two |0 |0 |0 |0.5029 |0.500066667|
|cpu compact without scan, non-power of two |0 |0 |0 |0.5019 |0.50035|
|cpu compact with scan |0 |0 |0 |0.4998 |0.999733333|
|work-efficient compact, power-of-two |0.089568 |0.123530667 |0.280234667| 0.946997333 |3.793076667|
|work-efficient compact, non-power-of-two |0.086741333 |0.118101333| 0.276192 |0.947157333 |3.79866|


![](images/compact_array.png)
Again running the compactions on different block sizes result in a best block size of `128`.
![](images/compact_block.png)
A radix sort was implemented using the work-efficient scan and its performance was compared with the `thrust::sort`. It was seen that for small array sizes, the radix sort outperforms thrust sort. But however as the array size increases, the work-efficient scan becomes slow as explained above and reduces the effciency of sorting algorithm.
![](images/sort.png)

### FUTURE IMPROVEMENTS
* Work-efficient scan can be optimized to reduce the number of threads launched at each stage of upsweep and downsweep process.
* Early termination of threads would also improve the scan performance.

### OUTPUT
```

****************
** SCAN TESTS **
****************
[ 38 19 38 37 5 47 15 35 0 12 3 0 42 ... 35 0 ]
==== cpu scan, power-of-two ====
[ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 1604374 1604409 ]
==== cpu scan, non-power-of-two ====
[ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 1604305 1604316 ]
passed
==== naive scan, power-of-two ====
[ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 1604374 1604409 ]
passed
==== naive scan, non-power-of-two ====
[ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 0 0 ]
passed
==== work-efficient scan, power-of-two ====
[ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 1604374 1604409 ]
passed
==== work-efficient scan, non-power-of-two ====
[ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 1604305 1604316 ]
passed
==== thrust scan, power-of-two ====
[ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 1604374 1604409 ]
passed
==== thrust scan, non-power-of-two ====
[ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 1604305 1604316 ]
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 2 3 2 1 3 1 1 1 2 0 1 0 2 ... 1 0 ]
==== cpu compact without scan, power-of-two ====
[ 2 3 2 1 3 1 1 1 2 1 2 1 1 ... 1 1 ]
passed
==== cpu compact without scan, non-power-of-two ====
[ 2 3 2 1 3 1 1 1 2 1 2 1 1 ... 3 1 ]
passed
==== cpu compact with scan ====
[ 2 3 2 1 3 1 1 1 2 1 2 1 1 ... 1 1 ]
passed
==== work-efficient compact, power-of-two ====
[ 2 3 2 1 3 1 1 1 2 1 2 1 1 ... 1 1 ]
passed
==== work-efficient compact, non-power-of-two ====
[ 2 3 2 1 3 1 1 1 2 1 2 1 1 ... 3 1 ]
passed

*****************************
***** RADIX SORT TESTS ******
*****************************
==== thrust sort small array ====
[ 0 1 2 3 4 5 6 7 8 9 ]
==== Radix Sort small array ====
[ 0 1 2 3 4 5 6 7 8 9 ]
passed
==== thrust sort large array ====
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 199 199 ]
==== Radix Sort large array ====
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 199 199 ]
passed
```

### MODIFICATIONS
* radixSort was added to StreamCompaction module containing the sort function.
* sort function was also added to thrust.cu module to perform `thrust::sort`.
* 4 tests are added to the main function for sorting. 2 test to sort an array using `thrust::sort` and 2 tests to sort the array using `StreamCompaction::radixSort`.
* CMakeLists.txt of StreamCompaction was edited to include `radixSort.h` and `radixSort.cu`.
Binary file added images/compact_array.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/compact_block.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/scan_array.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/scan_block.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/sort.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
48 changes: 39 additions & 9 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,11 @@
#include <stream_compaction/naive.h>
#include <stream_compaction/efficient.h>
#include <stream_compaction/thrust.h>
#include <stream_compaction/radixSort.h>
#include "testing_helpers.hpp"

int main(int argc, char* argv[]) {
const int SIZE = 1 << 8;
const int SIZE = 1 << 16;
const int NPOT = SIZE - 3;
int a[SIZE], b[SIZE], c[SIZE];

Expand Down Expand Up @@ -43,37 +44,37 @@ int main(int argc, char* argv[]) {
zeroArray(SIZE, c);
printDesc("naive scan, power-of-two");
StreamCompaction::Naive::scan(SIZE, c, a);
//printArray(SIZE, c, true);
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("naive scan, non-power-of-two");
StreamCompaction::Naive::scan(NPOT, c, a);
//printArray(SIZE, c, true);
printArray(SIZE, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan, power-of-two");
StreamCompaction::Efficient::scan(SIZE, c, a);
//printArray(SIZE, c, true);
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan, non-power-of-two");
StreamCompaction::Efficient::scan(NPOT, c, a);
//printArray(NPOT, c, true);
printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
printDesc("thrust scan, power-of-two");
StreamCompaction::Thrust::scan(SIZE, c, a);
//printArray(SIZE, c, true);
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("thrust scan, non-power-of-two");
StreamCompaction::Thrust::scan(NPOT, c, a);
//printArray(NPOT, c, true);
printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

printf("\n");
Expand Down Expand Up @@ -112,12 +113,41 @@ int main(int argc, char* argv[]) {
zeroArray(SIZE, c);
printDesc("work-efficient compact, power-of-two");
count = StreamCompaction::Efficient::compact(SIZE, c, a);
//printArray(count, c, true);
printArray(count, c, true);
printCmpLenResult(count, expectedCount, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient compact, non-power-of-two");
count = StreamCompaction::Efficient::compact(NPOT, c, a);
//printArray(count, c, true);
printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);

printf("\n");
printf("*****************************\n");
printf("***** RADIX SORT TESTS ******\n");
printf("*****************************\n");

int t1[10] = { 4, 1, 6, 3, 8, 9, 2, 0, 5, 7 };
int t2[10];
int t3[10];
printDesc("thrust sort small array");
StreamCompaction::Thrust::sort(10, t2, t1);
printArray(10, t2, true);

printDesc("Radix Sort small array");
StreamCompaction::radixSort::sort(10, t3, t1);
printArray(10, t3, true);
printCmpResult(10, t2, t3);

genArray(SIZE, a, 200);
zeroArray(SIZE, c);
printDesc("thrust sort large array");
StreamCompaction::Thrust::sort(SIZE, c, a);
printArray(SIZE, c, true);

zeroArray(SIZE, b);
printDesc("Radix Sort large array");
StreamCompaction::radixSort::sort(SIZE, b, a);
printArray(SIZE, b, true);
printCmpResult(10, c, b);
}
2 changes: 2 additions & 0 deletions stream_compaction/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ set(SOURCE_FILES
"efficient.cu"
"thrust.h"
"thrust.cu"
"radixSort.h"
"radixSort.cu"
)

cuda_add_library(stream_compaction
Expand Down
11 changes: 9 additions & 2 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,10 @@ namespace Common {
* which map to 0 will be removed, and elements which map to 1 will be kept.
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
int index = (blockIdx.x * blockDim.x) + threadIdx.x;
if (index >= n)
return;
bools[index] = idata[index] ? 1 : 0;
}

/**
Expand All @@ -32,7 +35,11 @@ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
*/
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
int index = (blockIdx.x * blockDim.x) + threadIdx.x;
if (index >= n)
return;
if (bools[index] == 1)
odata[indices[index]] = idata[index];
}

}
Expand Down
5 changes: 5 additions & 0 deletions stream_compaction/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,11 @@
#include <cstdio>
#include <cstring>
#include <cmath>
#include <ctime>
#include <chrono>
#include <iostream>

#define PROFILE 0

#define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
#define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)
Expand Down
72 changes: 66 additions & 6 deletions stream_compaction/cpu.cu
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
#include <cstdio>
#include "cpu.h"
#include "common.h"

namespace StreamCompaction {
namespace CPU {
Expand All @@ -8,8 +9,21 @@ namespace CPU {
* CPU scan (prefix sum).
*/
void scan(int n, int *odata, const int *idata) {
// TODO
printf("TODO\n");
odata[0] = 0;

#if PROFILE
auto begin = std::chrono::high_resolution_clock::now();
#endif

for (int i = 1; i < n; i++)
{
odata[i] = odata[i - 1] + idata[i - 1];
}

#if PROFILE
auto end = std::chrono::high_resolution_clock::now();
std::cout << "Time Elapsed for cpu scan(size " << n << "): " << std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count() << "ns" << std::endl;
#endif
}

/**
Expand All @@ -18,8 +32,27 @@ void scan(int n, int *odata, const int *idata) {
* @returns the number of elements remaining after compaction.
*/
int compactWithoutScan(int n, int *odata, const int *idata) {
// TODO
return -1;
long j = 0;

#if PROFILE
auto begin = std::chrono::high_resolution_clock::now();
#endif

for (int i = 0; i < n; i++)
{
if (idata[i] != 0)
{
odata[j] = idata[i];
j++;
}
}

#if PROFILE
auto end = std::chrono::high_resolution_clock::now();
std::cout << "Time Elapsed for compact without scan(size " << n << "): " << std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count() << "ns" << std::endl;
#endif

return j;
}

/**
Expand All @@ -28,8 +61,35 @@ int compactWithoutScan(int n, int *odata, const int *idata) {
* @returns the number of elements remaining after compaction.
*/
int compactWithScan(int n, int *odata, const int *idata) {
// TODO
return -1;
int *temporal;
int *pscan;
long j = 0;
temporal = (int*)malloc(sizeof(int)*n);
pscan = (int*)malloc(sizeof(int)*n);

#if PROFILE
auto begin = std::chrono::high_resolution_clock::now();
#endif
for (int i = 0; i < n; i++)
{
temporal[i] = idata[i] ? 1 : 0;
}
scan(n, pscan, temporal);
for (int i = 0; i < n; i++)
{
if (temporal[i] == 1)
{
odata[pscan[i]] = idata[i];
j++;
}
}

#if PROFILE
auto end = std::chrono::high_resolution_clock::now();
std::cout << "Time Elapsed for compact with scan(size " << n << "): " << std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count() << "ns" << std::endl;
#endif

return j;
}

}
Expand Down
Loading