Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project 2: Gabriel Naghi #15

Open
wants to merge 24 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
149 changes: 144 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,150 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Gabriel Naghi
* Tested on: Windows 7, i7-6700 @ 3.70GHz 16GB, Quadro K620 222MB (Moore 100C Lab)

### (TODO: Your README)
## Overview

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
In this project, I implemented three Stream Compaction algorithms.

Stream compaction is the process of compressing a list of elements, removing any elements
which don't match some criteria and replacing the "good" elements in their original ordering.
This algorithm will be useful in the future when dong ray-tracing, where once rays have escaped
the scene, we no longer need to process them and thus wish to elminate them from our list.

![](images/stream_compaction.jpg)

*Source: Nvidia GPU Gems 3*

This is equivalent to the python dictionary comprehension one-liner:

~~~

output_list = [ elt for elt in input_list if elt != 0 ]

~~~

While this process is easily done in an iterative fashion, we can also employ some parallel algorithms
to compute the compacted array more quickly. These parallel algorithms require that first a temporary boolean
mapping of the list must be created, which then undergoes an "exclusive scan". The result of the scan the represents the indices of elements in the output array, which occurs in a simple operation called "scatter."

A "scan" (depicted below) is an operation that creates and output list such that for each index an input list, the output list
contains the sums of all elements preceding it in the input list. The term "exclusive" means that the first
element of the output array is always 0, and thus the last element of the input array is excluded. This contrasts
with an "inclusive" scan, which begins with the first element of the input array.

![](images/scan.tiff)

*Source: CIS565 Lecture Slides*

It is here that we can divide our algorithms into naive and efficient implementations. For comparison's sake,
the scan method was also implemented as a CPU function.

For the purposes of our implementation, the criteria for inclusion in the output list in non-zero value.

### CPU Implementation

The CPU implementation functions in the most straightforward way possible. At each index, it simply adds the value at that index plus the preceding calculated value, much like a Fibonacci sequence.

The only optimization I was able to make here was that, instead of re-summing all input elements 0 through j-1 to compute
element j, I simply add input element j-1 to output element j-2. We will see later in performance analysis, however, that
optimizations are inherit to the CPU implementation due to hardware features such as memory caching.


### Naive Parallel Implementation

![](images/naive_parallel.tiff)

*Source: CIS565 Lecture Slides*

As depicted above, the Naive parallel implementation computes several pairwise adds for each level in 0 to lg n.

While this limits the number of parallel loops that need to be launched, this algorithm is not considered "work efficient" since the number of additions that must be computed is O(n lgn).

I optimized this implementation be launching a scan of an array of length 2^15 using different thread counts. The results are depicted below. In this graph and those later, Po2 stands for Power of Two.

![](images/naive_blocksize.png)

This graph shows that a 256 block size has the lowest average runtime, so this is the value I selected for the final analysis.

### Efficient Parallel Implementation

The work-efficient scan implementation can be split into two pieces.

The first part, depicted below, is the "upsweep." Here, by using a binary tree representation of the array, we compute several intermediate values at each level.

![](images/efficient_parallel_upsweep.tiff)*Source: CIS565 Lecture Slides*

Next, we carry out a "downsweep," which swaps some values and accumulates value in necessary locations in the array.

![](images/efficient_parallel_downsweep.tiff)*Source: CIS565 Lecture Slides*

All in all, this algorithm does O(n) adds on the upsweep and O(n) adds on the downsweep, which together is stil O(n).

As I did with the naive implementation, I optimized the thread count per block in the efficient implementation. The test results below indicate that the shortest average runtime occurs at 512 threads per block.

![](images/efficient_blocksize.png)

### Thrust Implementation

Since we don't have the Thrust source code, its really quite hard to tell what precisely it is doing. However, the kernels which are repeated called are:
1. Accumulate Tiles (memcopy?)
2. Exclusive Scan
3. Downsweep. This seems to take elements from a couple different algorithms we've seen. I would not be surprised if there is some hybrid algorithm implemented by Thrust.

## Performance Analysis
Interestingly enough, the CPU implementation completely blew all other implementations out of the water. Indeed, at the lower array sizes, I could hardly get a timing since the algorithm would complete execution before a they system clock would even tick.

As I alluded to earlier, I think much of the credit here is due to CPU caching. All the memory locations with which we need do deal are both temporally and spatially local, so each index in the array likely needs only a couple instructions to complete, and likely no disk waits. In fact, if the compiler is smart enough, it might even be holding the values we need in registers. Given a Hyperthredded Quad Core CPU running at 3.7 GHz, an array of length 2^15 would finish in quite close to no time at all- which is what we see here.

![](images/scan_times.png)

On the other hand, we see our parallel algorithms lagging behind quite a bit. Nvidia's Thrust implementation is seen working quite well, but only for arrays of size not-power-of-two. Very strange.

Additionally, our naive implementation is seen crushing our work efficient implementation. I believe this is because the bottleneck here is memory access, which the work efficient implementation does rather a lot of (especially in the add/swap downsweep.) In general, I question how valuable "saving work" is on a GPU, particularly addition, since the devices are so heavily optimized for arithmetic.

## Test Suite Output
~~~
$ ./cis565_stream_compaction_test.exe

****************
** SCAN TESTS **
****************
[ 38 19 38 37 5 47 15 35 0 12 3 0 42 ... 7 0 ]
==== cpu scan, power-of-two ====
[ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 803684 803691 ]
==== cpu scan, non-power-of-two ====
[ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 803630 803660 ]
passed
==== naive scan, power-of-two ====
passed
==== naive scan, non-power-of-two ====
passed
==== work-efficient scan, power-of-two ====
passed
==== work-efficient scan, non-power-of-two ====
passed
==== thrust scan, power-of-two ====
passed
==== thrust scan, non-power-of-two ====
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 2 3 2 1 3 1 1 1 2 0 1 0 2 ... 3 0 ]
==== cpu compact without scan, power-of-two ====
[ 2 3 2 1 3 1 1 1 2 1 2 1 1 ... 2 3 ]
passed
==== cpu compact without scan, non-power-of-two ====
[ 2 3 2 1 3 1 1 1 2 1 2 1 1 ... 2 2 ]
passed
==== cpu compact with scan ====
[ 2 3 2 1 3 1 1 1 2 1 2 1 1 ... 2 3 ]
passed
==== work-efficient compact, power-of-two ====
passed
==== work-efficient compact, non-power-of-two ====
passed
~~~
Binary file added images/efficient_blocksize.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/efficient_parallel_downsweep.tiff
Binary file not shown.
Binary file added images/efficient_parallel_upsweep.tiff
Binary file not shown.
Binary file added images/naive_blocksize.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/naive_parallel.tiff
Binary file not shown.
Binary file added images/scan.tiff
Binary file not shown.
Binary file added images/scan_times.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/stream_compaction.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion stream_compaction/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,5 +13,5 @@ set(SOURCE_FILES

cuda_add_library(stream_compaction
${SOURCE_FILES}
OPTIONS -arch=sm_20
OPTIONS -arch=sm_35
)
21 changes: 17 additions & 4 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
Expand Up @@ -22,17 +22,30 @@ namespace Common {
* Maps an array to an array of 0s and 1s for stream compaction. Elements
* which map to 0 will be removed, and elements which map to 1 will be kept.
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
__global__ void kernMapToBoolean(int n, int *bools, const int *idata)
{
int index = (blockIdx.x * blockDim.x) + threadIdx.x;

if (index >= n)
return;

bools[index] = idata[index] ? 1 : 0;
}

/**
* Performs scatter on an array. That is, for each element in idata,
* if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
*/
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
const int *idata, const int *bools, const int *indices)
{
int index = (blockIdx.x * blockDim.x) + threadIdx.x;

if (index >= n)
return;

if (bools[index])
odata[indices[index]] = idata[index];
}

}
Expand Down
2 changes: 2 additions & 0 deletions stream_compaction/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@
#define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
#define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)

#define blocksize 128

/**
* Check for CUDA errors; print and exit if there was a problem.
*/
Expand Down
58 changes: 52 additions & 6 deletions stream_compaction/cpu.cu
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,14 @@ namespace CPU {
* CPU scan (prefix sum).
*/
void scan(int n, int *odata, const int *idata) {
// TODO
printf("TODO\n");

//exclusive scan: first element of output is 0
odata[0] = 0;

for (int i = 0; i < n - 1; ++i)
{
odata[i + 1] = odata[i] + idata[i];
}
}

/**
Expand All @@ -18,8 +24,19 @@ void scan(int n, int *odata, const int *idata) {
* @returns the number of elements remaining after compaction.
*/
int compactWithoutScan(int n, int *odata, const int *idata) {
// TODO
return -1;

int oIndex = 0;

for (int i = 0; i < n; ++i)
{
if (idata[i])
{
odata[oIndex] = idata[i];
++oIndex;
}
}

return oIndex;
}

/**
Expand All @@ -28,8 +45,37 @@ int compactWithoutScan(int n, int *odata, const int *idata) {
* @returns the number of elements remaining after compaction.
*/
int compactWithScan(int n, int *odata, const int *idata) {
// TODO
return -1;

int *temp = new int[n];
int *scanResult = new int[n];
// compute the temp array
for (int i = 0; i < n; ++i)
{
if (idata[i])
temp[i] = 1;
else
temp[i] = 0;
}

// Run exclusive scan on temp array
scan(n, scanResult, temp);

// result of scan is index into final array
int oCnt = 0;
for (int i = 0; i < n; ++i)
{
// only write if tmp array has 1
if (temp[i])
{
odata[scanResult[i]] = idata[i];
++oCnt;
}
}

delete[] temp;
delete[] scanResult;

return oCnt;
}

}
Expand Down
Loading