Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project 2: Bowen Bao #16

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
144 changes: 139 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,145 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Bowen Bao
* Tested on: Windows 10, i7-6700K @ 4.00GHz 32GB, GTX 1080 8192MB (Personal Computer)

### (TODO: Your README)
## Overview

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
Here's the list of features of this project:

1. CPU Scan and Stream Compaction
2. Naive GPU Scan
3. Efficient GPU Scan and Stream Compaction
4. Thrust Scan
5. Optimize efficient GPU Scan
6. Radix Sort based on GPU Scan
7. Benchmark suite

## Instruction to Run

I made a few changes to the function headers to add more flexible benchmarking capabilities, such as able to return process times, change block size without re-compile, etc. The only change that is visible to the user is that they need to pass in a double parameter as reference to be able to receive the logged process time.

I added a benchmark suite for testing the run time of each implementation under different parameter settings. Also, I inserted a few tests for radix sort into the original main function.

## Performance Analysis
### Performance of different implementation

![](/image/process_time.png)

Here's the test result for each of the methods. The tests are run with the block size of 256(which is decided as near optimal after testing on numerous values). For each methods, I ran 100 independent tests, and calculated their average process time.

We can observe indeed that the GPU version of scan has a better performance than CPU scan.

### Performance of GPU methods under different block size

![](/image/process_time_blocksize.png)

The tests are run with the stream length of 2^24, each method is tested 100 times and recorded the average. Observe that the performance starts to decrease after blocksize getting over 256.

## Extra Credits
### Improving GPU Scan
See part 3 in Questions.

### Radix Sort
I followed the algorithm in the slides, and implemented a radix sort method based on the GPU Scan function. One interesting note is that when checking bits of the numbers, numbers with 1 on the first bit are actually smaller than those with 0, as on these occasions they turned out to be negative, which is the reverse case against situations on other bits. I tested my radix sort function with a special hand crafted case containing negative numbers, and with a random large test case.

## Questions
* Roughly optimize the block sizes of each of your implementations for minimal
run time on your GPU.
* (You shouldn't compare unoptimized implementations to each other!)

See Performance Analysis.

* Compare all of these GPU Scan implementations (Naive, Work-Efficient, and
Thrust) to the serial CPU version of Scan. Plot a graph of the comparison
(with array size on the independent axis).
* You should use CUDA events for timing GPU code. Be sure **not** to include
any *initial/final* memory operations (`cudaMalloc`, `cudaMemcpy`) in your
performance measurements, for comparability. Note that CUDA events cannot
time CPU code.
* You can use the C++11 `std::chrono` API for timing CPU code. See this
[Stack Overflow answer](http://stackoverflow.com/a/23000049) for an example.
Note that `std::chrono` may not provide high-precision timing. If it does
not, you can either use it to time many iterations, or use another method.
* To guess at what might be happening inside the Thrust implementation (e.g.
allocation, memory copy), take a look at the Nsight timeline for its
execution. Your analysis here doesn't have to be detailed, since you aren't
even looking at the code for the implementation.

See Performance Analysis.

* Write a brief explanation of the phenomena you see here.
* Can you find the performance bottlenecks? Is it memory I/O? Computation? Is
it different for each implementation?

One problem with "naive" efficient GPU scan is that there are too many threads wasted(after being checked that their index mod interval is not zero). One way of improving this is to assign the index as the divided result of the original index by the interval, and compute back the actual index later in that thread. With this improvement, we can save a lot of useless works done by threads, and note that waste grows exponentially with the number of elements in stream in the original implementation.

* Paste the output of the test program into a triple-backtick block in your
README.
* If you add your own tests (e.g. for radix sort or to test additional corner
cases), be sure to mention it explicitly.

See Output.

## Output

****************
** SCAN TESTS **
****************
[ 38 19 38 37 5 47 15 35 0 12 3 0 42 ... 35 0 ]
==== cpu scan, power-of-two ====
[ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 1604374 1604409 ]
==== cpu scan, non-power-of-two ====
[ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 1604305 1604316 ]
passed
==== naive scan, power-of-two ====
passed
==== naive scan, non-power-of-two ====
passed
==== work-efficient scan, power-of-two ====
passed
==== work-efficient scan, non-power-of-two ====
passed
==== thrust scan, power-of-two ====
passed
==== thrust scan, non-power-of-two ====
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 2 3 2 1 3 1 1 1 2 0 1 0 2 ... 1 0 ]
==== cpu compact without scan, power-of-two ====
[ 2 3 2 1 3 1 1 1 2 1 2 1 1 ... 1 1 ]
passed
==== cpu compact without scan, non-power-of-two ====
[ 2 3 2 1 3 1 1 1 2 1 2 1 1 ... 3 1 ]
passed
==== cpu compact with scan ====
[ 2 3 2 1 3 1 1 1 2 1 2 1 1 ... 1 1 ]
passed
==== work-efficient compact, power-of-two ====
passed
==== work-efficient compact, non-power-of-two ====
passed
==== work-efficient compact, power-of-two, last non-zero ====
passed
==== work-efficient compact, power-of-two, last zero ====
passed
==== work-efficient compact, test on special case 1 ====
passed
==== work-efficient compact, test on special case 2 ====
passed
==== cpu compact without scan, test on special case 1 ====
passed
==== radix sort, test on special case ====
[ 0 5 -2 6 3 7 -5 2 7 1 ]
sorted:
[ -5 -2 0 1 2 3 5 6 7 7 ]
passed
==== radix sort, test ====
[ 38 7719 1238 2437 8855 1797 8365 2285 450 612 5853 8100 1142 ... 5085 6505 ]
sorted:
[ 0 0 0 0 0 0 0 1 1 1 1 1 1 ... 9999 9999 ]
passed
Binary file added image/process_time.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added image/process_time_blocksize.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
197 changes: 190 additions & 7 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,132 @@
#include <stream_compaction/efficient.h>
#include <stream_compaction/thrust.h>
#include "testing_helpers.hpp"
#include <string>
#include <algorithm>


void bench_mark()
{
int run_times = 100;
const int block_choice = 5;
const int input_choice = 4;
const int method_choice = 7;

int blockSizes[] = { 32, 128, 256, 512, 1024 };
int inputSizes[] = { 8, 16, 20, 24 };
std::string methods[] = {
"cpu scan",
"naive scan",
"eff scan",
"thrust scan",
"cpu compact no scan",
"cpu compact",
"gpu compact"
};

double result[method_choice * block_choice * input_choice];

for (int i = 0; i < run_times; ++i)
{
printf("=========Running %d round ... \n", i+1);
for (int j = 0; j < block_choice; ++j)
{
printf("==================Running on blockSize %d ... \n", blockSizes[j]);
for (int k = 0; k < input_choice; ++k)
{
printf("=============================Running on inputSize 2^%d ... \n", inputSizes[k]);
int idx;
// generate input
int SIZE = 1 << inputSizes[k];
int *a = new int[SIZE];
int *b = new int[SIZE];

genArray(SIZE - 1, a, 1000);
int cur_method = 0;

// cpu scan
idx = block_choice * input_choice * (cur_method++) + j * input_choice + k;
printf("==test on pos %d \n", idx);
zeroArray(SIZE, b);
result[idx] += StreamCompaction::CPU::scan(SIZE, b, a);

// naive scan
idx = block_choice * input_choice * (cur_method++) + j * input_choice + k;
printf("==test on pos %d \n", idx);
zeroArray(SIZE, b);
result[idx]
+= StreamCompaction::Naive::scan(SIZE, b, a, blockSizes[j]);

// work-efficient scan
idx = block_choice * input_choice * (cur_method++) + j * input_choice + k;
printf("==test on pos %d \n", idx);
zeroArray(SIZE, b);
result[idx]
+= StreamCompaction::Efficient::scan(SIZE, b, a, blockSizes[j]);

// thrust scan
idx = block_choice * input_choice * (cur_method++) + j * input_choice + k;
printf("==test on pos %d \n", idx);
zeroArray(SIZE, b);
result[idx]
+= StreamCompaction::Thrust::scan(SIZE, b, a);

// cpu compact no scan
idx = block_choice * input_choice * (cur_method++) + j * input_choice + k;
printf("==test on pos %d \n", idx);
zeroArray(SIZE, b);
double time;
StreamCompaction::CPU::compactWithoutScan(SIZE, b, a, time);
result[idx] += time;

// cpu compact
idx = block_choice * input_choice * (cur_method++) + j * input_choice + k;
printf("==test on pos %d \n", idx);
zeroArray(SIZE, b);
StreamCompaction::CPU::compactWithScan(SIZE, b, a, time);
result[idx] += time;

// gpu compact
idx = block_choice * input_choice * (cur_method++) + j * input_choice + k;
printf("test on pos %d \n", idx);
zeroArray(SIZE, b);
StreamCompaction::Efficient::compact(SIZE, b, a, time);
result[idx] += time;

delete[] a;
delete[] b;
}
}
}

// print result
printf("===================== RESULTS ========================\n");
for (int j = 0; j < block_choice; ++j)
{
printf("======= block size %d ===========\n", blockSizes[j]);

for (int i = 0; i < method_choice; ++i)
{
printf("==== method %s ==== ", methods[i].c_str());
for (int k = 0; k < input_choice; ++k)
{
printf(" %d input %f time ", inputSizes[k], result[block_choice * input_choice * i + j * input_choice + k] / run_times);
}
printf("\n");
}

printf("=====================================\n");
}
}


int main(int argc, char* argv[]) {
const int SIZE = 1 << 8;
const int SIZE = 1 << 16;
const int NPOT = SIZE - 3;
int a[SIZE], b[SIZE], c[SIZE];
//int a[SIZE], b[SIZE], c[SIZE];
int *a = new int[SIZE];
int *b = new int[SIZE];
int *c = new int[SIZE];

// Scan tests

Expand Down Expand Up @@ -89,35 +210,97 @@ int main(int argc, char* argv[]) {

int count, expectedCount, expectedNPOT;

double time;

zeroArray(SIZE, b);
printDesc("cpu compact without scan, power-of-two");
count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a);
count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a, time);
expectedCount = count;
printArray(count, b, true);
printCmpLenResult(count, expectedCount, b, b);

zeroArray(SIZE, c);
printDesc("cpu compact without scan, non-power-of-two");
count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a);
count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a, time);
expectedNPOT = count;
printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);

zeroArray(SIZE, c);
printDesc("cpu compact with scan");
count = StreamCompaction::CPU::compactWithScan(SIZE, c, a);
count = StreamCompaction::CPU::compactWithScan(SIZE, c, a, time);
printArray(count, c, true);
printCmpLenResult(count, expectedCount, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient compact, power-of-two");
count = StreamCompaction::Efficient::compact(SIZE, c, a);
count = StreamCompaction::Efficient::compact(SIZE, c, a, time);
//printArray(count, c, true);
printCmpLenResult(count, expectedCount, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient compact, non-power-of-two");
count = StreamCompaction::Efficient::compact(NPOT, c, a);
count = StreamCompaction::Efficient::compact(NPOT, c, a, time);
//printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);

zeroArray(SIZE, c);
a[SIZE - 1] = 5;
printDesc("work-efficient compact, power-of-two, last non-zero");
count = StreamCompaction::Efficient::compact(SIZE, c, a, time);
int *bb = new int[SIZE];
int cpuCount = StreamCompaction::CPU::compactWithoutScan(SIZE, bb, a, time);
//printArray(count, c, true);
printCmpLenResult(count, cpuCount, bb, c);

zeroArray(SIZE, c);
a[SIZE - 1] = 0;
printDesc("work-efficient compact, power-of-two, last zero");
count = StreamCompaction::Efficient::compact(SIZE, c, a, time);
//printArray(count, c, true);
printCmpLenResult(count, expectedCount, b, c);

printDesc("work-efficient compact, test on special case 1");
int test[5] = { 1, 0, 1, 0, 1 };
count = StreamCompaction::Efficient::compact(5, c, test, time);
printCmpLenResult(count, 3, c, c);

printDesc("work-efficient compact, test on special case 2");
int test1[5] = { 1, 0, 1, 0, 0 };
count = StreamCompaction::Efficient::compact(5, c, test1, time);
printCmpLenResult(count, 2, c, c);

printDesc("cpu compact without scan, test on special case 1");
count = StreamCompaction::CPU::compactWithoutScan(5, c, test, time);
printCmpLenResult(count, 3, c, c);



//bench_mark();

int testArr[] = { 0, 5, -2, 6, 3, 7, -5, 2, 7, 1 };
int resultArr[10];
int goalArr[] = { -5, -2, 0, 1, 2, 3, 5, 6, 7, 7 };

StreamCompaction::Efficient::radix_sort(10, resultArr, testArr);
printDesc("radix sort, test on special case");
printArray(10, testArr, true);
printf(" sorted:\n");
printArray(10, resultArr, true);
printCmpResult(10, goalArr, resultArr);

genArray(SIZE, a, 10000);
StreamCompaction::Efficient::radix_sort(SIZE, b, a);
printDesc("radix sort, test");
printArray(SIZE, a, true);
printf(" sorted:\n");
printArray(SIZE, b, true);
std::sort(a, a + SIZE);
printCmpResult(SIZE, a, b);


delete[] a;
delete[] b;
delete[] c;
delete[] bb;
}
Binary file added stat.xlsx
Binary file not shown.
Loading