Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project 2: AUSTIN ENG #4

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 89 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,95 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Austin Eng
* Tested on: Windows 10, i7-4770K @ 3.50GHz 16GB, GTX 780 3072MB (Personal Computer)

### (TODO: Your README)
## Analysis

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
**Note: Reported graphs are the result of 100 trials, averaged. Also note that input sizes are at increasing powers of two. Furthermore, since the algorithm is exponential in growth, both axes are displayed at a log scale**

![Scan Analysis](https://docs.google.com/spreadsheets/d/1x1MppbyAceIIrwhDLsmV7unUYS2RYU_I20_wK0ReORY/pubchart?oid=175703576&format=image)

![Compaction Analysis](https://docs.google.com/spreadsheets/d/1x1MppbyAceIIrwhDLsmV7unUYS2RYU_I20_wK0ReORY/pubchart?oid=477396612&format=image)

For smaller input sizes, the CPU implementation for both Scan and Stream Compaction is much, much faster than the GPU implementation. When dealing with contiguous buffers of memory, the CPU reaps large benefits from cache which makes it very fast. However, at around 2^19 in input size, the more efficient GPU implementations begin to outperform the CPU. With only a single core, CPU performance becomes worse as the number of computations required increases exponentially.

Meanwhile, on the GPU, the exponent of this algorithmic growth is divided by the number of cores so there is much slower growth. However, there is a larger cost from memory access so the GPU implementations are much slower for lower input sizes because of this memory overhead. Memory usage, however, increases linearly not exponentially, so for larger sets of data, the GPU wins with performance.

In comparing the Naive and Efficient GPU implementations, we see that for smaller datasets, the Naive implementation is faster. This is probably because there are fewer kernel invocations are made. Even though there are an exponential number of additions, this still takes less time. However, as the input size increases, the much more computationally-efficient method performs better.

I did not see much difference between power-of-two input sizes and non-power-of-two data sizes. This is likely because my implementation just increases the size of non-power-of-two inputs to be power-of-two inputs.

### More Efficient `Efficient` Scan

It turns out that my initial implementation of the Efficient scan was the extra credit implementation. Instead of launching the same number threads for the upsweep and downsweep, we decrease the number to avoid wasted threads and increase occupancy. Why? For the upsweep, after every iteration we need half as many threads. The others don't do anything. For the downsweep, our first iteration uses just 1 thread and each subsequent iteration doubles this number. A more efficient way to implement Efficient scan is to launch only the number of threads needed and have your calculated thread index jump by a power of two. So: `index = 2^d * (blockIdx.x * blockDim.x + threadIdx.x)`. Now our indicies will jump 2 -- 4 -- 6 -- 8 or 16 -- 32 -- 48 -- 64, etc. We can launch only the needed number of threads instead of launching `n` threads and using far, far less than half of them.


### Why is Thrust So Fast?

It seems like the Thrust implementation receives a big performance boost from using shared memory. From the names of the function calls: `accumulate_tiles, exclusive_scan, exclusive_downsweep` it seems like Thrust is doing the same thing as the Efficient implementation except the `accumulate_tiles` calls have 32 and 4064 static and dynamic bytes of shared memory, respectively. `exclusive_scan`: 48 and 12240. `exclusive_downsweep`: 32 and 6880. This probably allows for much more efficient memory access in the kernel. Analysis also shows that each of the kernels is called twice, notably wrapped in `cuda_task` and `parallel_group`. This is probably done because the computation needs to be split into multiple pieces since shared memory can only be so large.

## Test Output
```
****************
** SCAN TESTS **
****************
[ 38 19 38 37 5 47 15 35 0 12 3 0 42 ... 10 0 ]
==== cpu scan, power-of-two ====
Elapsed: 10.0046ms
[ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 205473618 205473628 ]
==== cpu scan, non-power-of-two ====
Elapsed: 9.0066ms
[ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 205473524 205473568 ]
passed
==== naive scan, power-of-two ====
Elapsed: 9.708448ms
[ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 205473618 205473628 ]
passed
==== naive scan, non-power-of-two ====
Elapsed: 9.713088ms
[ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 0 0 ]
passed
==== work-efficient scan, power-of-two ====
Elapsed: 4.019968ms
[ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 205473618 205473628 ]
passed
==== work-efficient scan, non-power-of-two ====
Elapsed: 3.999136ms
[ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 205473524 205473568 ]
passed
==== thrust scan, power-of-two ====
Elapsed: 0.906560ms
[ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 205473618 205473628 ]
passed
==== thrust scan, non-power-of-two ====
Elapsed: 1.042912ms
[ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 205473524 205473568 ]
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 2 3 2 1 3 1 1 1 2 0 1 0 2 ... 0 0 ]
==== cpu compact without scan, power-of-two ====
Elapsed: 17.0074ms
[ 2 3 2 1 3 1 1 1 2 1 2 1 1 ... 2 2 ]
passed
==== cpu compact without scan, non-power-of-two ====
Elapsed: 17.0071ms
[ 2 3 2 1 3 1 1 1 2 1 2 1 1 ... 2 2 ]
passed
==== cpu compact with scan ====
Elapsed: 6.0037ms
Elapsed: 31.0118ms
[ 2 3 2 1 3 1 1 1 2 1 2 1 1 ... 2 2 ]
passed
==== work-efficient compact, power-of-two ====
Elapsed: 5.496416ms
[ 2 3 2 1 3 1 1 1 2 1 2 1 1 ... 2 2 ]
passed
==== work-efficient compact, non-power-of-two ====
Elapsed: 5.449856ms
[ 2 3 2 1 3 1 1 1 2 1 2 1 1 ... 2 2 ]
passed
```
39 changes: 39 additions & 0 deletions calc_stats.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@

var fs = require("fs")

fs.readFile(process.argv[2], function (err, data) {

var stats = {}

var output = data.toString();
var re = /\n==== ([\s\S]+?) ====[^=]+Elapsed: ([\.\d]+)ms/g
outputs = output.split("SIZE: ");

for (var i = 1; i < outputs.length; ++i) {
var out = outputs[i];
var size = parseFloat(out.match(/\d+/)[0])
var match = re.exec(out)
while (match != null) {

if (!(size in stats)) {
console.log('initing', size)
stats[size] = new Object()
}
if (!(match[1] in stats[size])) {
console.log('initing', size, match[1])
stats[size][match[1]] = [0, 0]
}
stats[size][match[1]][0] += 1
stats[size][match[1]][1] += parseFloat(match[2])

match = re.exec(out)
}
}
for (var i in stats) {
for (var j in stats[i]) {
stats[i][j] = stats[i][j][1] / stats[i][j][0]
}
}
console.log(stats)

})
231 changes: 125 additions & 106 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,111 +13,130 @@
#include <stream_compaction/thrust.h>
#include "testing_helpers.hpp"

void test(int SIZE) {
const int NPOT = SIZE - 3;
int* a = new int[SIZE];
int* b = new int[SIZE];
int* c = new int[SIZE];

// Scan tests

printf("\n");
printf("****************\n");
printf("** SCAN TESTS **\n");
printf("****************\n");

genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case
a[SIZE - 1] = 0;
printArray(SIZE, a, true);

zeroArray(SIZE, b);
printDesc("cpu scan, power-of-two");
StreamCompaction::CPU::scan(SIZE, b, a);
printArray(SIZE, b, true);

zeroArray(SIZE, c);
printDesc("cpu scan, non-power-of-two");
StreamCompaction::CPU::scan(NPOT, c, a);
printArray(NPOT, b, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
printDesc("naive scan, power-of-two");
StreamCompaction::Naive::scan(SIZE, c, a);
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("naive scan, non-power-of-two");
StreamCompaction::Naive::scan(NPOT, c, a);
printArray(SIZE, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan, power-of-two");
StreamCompaction::Efficient::scan(SIZE, c, a);
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan, non-power-of-two");
StreamCompaction::Efficient::scan(NPOT, c, a);
printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
printDesc("thrust scan, power-of-two");
StreamCompaction::Thrust::scan(SIZE, c, a);
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("thrust scan, non-power-of-two");
StreamCompaction::Thrust::scan(NPOT, c, a);
printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

printf("\n");
printf("*****************************\n");
printf("** STREAM COMPACTION TESTS **\n");
printf("*****************************\n");

// Compaction tests

genArray(SIZE - 1, a, 4); // Leave a 0 at the end to test that edge case
a[SIZE - 1] = 0;
printArray(SIZE, a, true);

int count, expectedCount, expectedNPOT;

zeroArray(SIZE, b);
printDesc("cpu compact without scan, power-of-two");
count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a);
expectedCount = count;
printArray(count, b, true);
printCmpLenResult(count, expectedCount, b, b);

zeroArray(SIZE, c);
printDesc("cpu compact without scan, non-power-of-two");
count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a);
expectedNPOT = count;
printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);

zeroArray(SIZE, c);
printDesc("cpu compact with scan");
count = StreamCompaction::CPU::compactWithScan(SIZE, c, a);
printArray(count, c, true);
printCmpLenResult(count, expectedCount, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient compact, power-of-two");
count = StreamCompaction::Efficient::compact(SIZE, c, a);
printArray(count, c, true);
printCmpLenResult(count, expectedCount, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient compact, non-power-of-two");
count = StreamCompaction::Efficient::compact(NPOT, c, a);
printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);

delete a;
delete b;
delete c;
}

int main(int argc, char* argv[]) {
const int SIZE = 1 << 8;
const int NPOT = SIZE - 3;
int a[SIZE], b[SIZE], c[SIZE];

// Scan tests

printf("\n");
printf("****************\n");
printf("** SCAN TESTS **\n");
printf("****************\n");

genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case
a[SIZE - 1] = 0;
printArray(SIZE, a, true);

zeroArray(SIZE, b);
printDesc("cpu scan, power-of-two");
StreamCompaction::CPU::scan(SIZE, b, a);
printArray(SIZE, b, true);

zeroArray(SIZE, c);
printDesc("cpu scan, non-power-of-two");
StreamCompaction::CPU::scan(NPOT, c, a);
printArray(NPOT, b, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
printDesc("naive scan, power-of-two");
StreamCompaction::Naive::scan(SIZE, c, a);
//printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("naive scan, non-power-of-two");
StreamCompaction::Naive::scan(NPOT, c, a);
//printArray(SIZE, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan, power-of-two");
StreamCompaction::Efficient::scan(SIZE, c, a);
//printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan, non-power-of-two");
StreamCompaction::Efficient::scan(NPOT, c, a);
//printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
printDesc("thrust scan, power-of-two");
StreamCompaction::Thrust::scan(SIZE, c, a);
//printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("thrust scan, non-power-of-two");
StreamCompaction::Thrust::scan(NPOT, c, a);
//printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

printf("\n");
printf("*****************************\n");
printf("** STREAM COMPACTION TESTS **\n");
printf("*****************************\n");

// Compaction tests

genArray(SIZE - 1, a, 4); // Leave a 0 at the end to test that edge case
a[SIZE - 1] = 0;
printArray(SIZE, a, true);

int count, expectedCount, expectedNPOT;

zeroArray(SIZE, b);
printDesc("cpu compact without scan, power-of-two");
count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a);
expectedCount = count;
printArray(count, b, true);
printCmpLenResult(count, expectedCount, b, b);

zeroArray(SIZE, c);
printDesc("cpu compact without scan, non-power-of-two");
count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a);
expectedNPOT = count;
printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);

zeroArray(SIZE, c);
printDesc("cpu compact with scan");
count = StreamCompaction::CPU::compactWithScan(SIZE, c, a);
printArray(count, c, true);
printCmpLenResult(count, expectedCount, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient compact, power-of-two");
count = StreamCompaction::Efficient::compact(SIZE, c, a);
//printArray(count, c, true);
printCmpLenResult(count, expectedCount, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient compact, non-power-of-two");
count = StreamCompaction::Efficient::compact(NPOT, c, a);
//printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);
int SIZE = 1 << 23;
test(SIZE);
/*for (int pow = 4; pow < 24; ++pow) {
int SIZE = 1 << pow;
printf("====== SIZE: %d ======\n", SIZE);

for (int i = 0; i < 100; ++i) {
test(SIZE);
}
}*/

}
2 changes: 1 addition & 1 deletion stream_compaction/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,5 +13,5 @@ set(SOURCE_FILES

cuda_add_library(stream_compaction
${SOURCE_FILES}
OPTIONS -arch=sm_20
OPTIONS -arch=sm_35
)
Loading