CIS565-Fall-2016 · gabenaghi · Sep 16, 2016 · Sep 18, 2016 · Sep 19, 2016 · Sep 20, 2016
diff --git a/README.md b/README.md
@@ -3,11 +3,150 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Gabriel Naghi
+* Tested on: Windows 7, i7-6700 @ 3.70GHz 16GB, Quadro K620 222MB (Moore 100C Lab)
 
-### (TODO: Your README)
+## Overview
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+In this project, I implemented three Stream Compaction algorithms.
 
+Stream compaction is the process of compressing a list of elements, removing any elements 
+which don't match some criteria and replacing the "good" elements in their original ordering. 
+This algorithm will be useful in the future when dong ray-tracing, where once rays have escaped 
+the scene, we no longer need to process them and thus wish to elminate them from our list.
+
+![](images/stream_compaction.jpg)
+
+*Source: Nvidia GPU Gems 3*
+
+This is equivalent to the python dictionary comprehension one-liner:
+
+~~~
+
+output_list = [ elt for elt in input_list if elt != 0 ]
+
+~~~
+
+While this process is easily done in an iterative fashion, we can also employ some parallel algorithms 
+to compute the compacted array more quickly. These parallel algorithms require that first a temporary boolean 
+mapping of the list must be created, which then undergoes an "exclusive scan". The result of the scan the represents the indices of elements in the output array, which occurs in a simple operation called "scatter."
+
+A "scan" (depicted below) is an operation that creates and output list such that for each index an input list, the output list 
+contains the sums of all elements preceding it in the input list. The term "exclusive" means that the first 
+element of the output array is always 0, and thus the last element of the input array is excluded. This contrasts 
+with an "inclusive" scan, which begins with the first element of the input array.
+
+![](images/scan.tiff)
+
+*Source: CIS565 Lecture Slides*
+
+It is here that we can divide our algorithms into naive and efficient implementations. For comparison's sake, 
+the scan method was also implemented as a CPU function.
+
+For the purposes of our implementation, the criteria for inclusion in the output list in non-zero value.
+
+### CPU Implementation
+
+The CPU implementation functions in the most straightforward way possible. At each index, it simply adds the value at that index plus the preceding calculated value, much like a Fibonacci sequence. 
+
+The only optimization I was able to make here was that, instead of re-summing all input elements 0 through j-1 to compute 
+element j, I simply add input element j-1 to output element j-2. We will see later in performance analysis, however, that
+optimizations are inherit to the CPU implementation due to hardware features such as memory caching. 
+
+
+### Naive Parallel Implementation
+
+![](images/naive_parallel.tiff)
+
+*Source: CIS565 Lecture Slides*
+
+As depicted above, the Naive parallel implementation computes several pairwise adds for each level in 0 to lg n. 
+
+While this limits the number of parallel loops that need to be launched, this algorithm is not considered "work efficient" since the number of additions that must be computed is O(n lgn). 
+
+I optimized this implementation be launching a scan of an array of length 2^15 using different thread counts. The results are depicted below. In this graph and those later, Po2 stands for Power of Two. 
+
+![](images/naive_blocksize.png)
+
+This graph shows that a 256 block size has the lowest average runtime, so this is the value I selected for the final analysis. 
+
+### Efficient Parallel Implementation 
+
+The work-efficient scan implementation can be split into two pieces. 
+
+The first part, depicted below, is the "upsweep." Here, by using a binary tree representation of the array, we compute several intermediate values at each level.
+
+![](images/efficient_parallel_upsweep.tiff)*Source: CIS565 Lecture Slides*
+
+Next, we carry out a "downsweep," which swaps some values and accumulates value in necessary locations in the array. 
+
+![](images/efficient_parallel_downsweep.tiff)*Source: CIS565 Lecture Slides*
+
+All in all, this algorithm does O(n) adds on the upsweep and O(n) adds on the downsweep, which together is stil O(n). 
+
+As I did with the naive implementation, I optimized the thread count per block in the efficient implementation. The test results below indicate that the shortest average runtime occurs at 512 threads per block. 
+
+![](images/efficient_blocksize.png)
+
+### Thrust Implementation 
+
+Since we don't have the Thrust source code, its really quite hard to tell what precisely it is doing. However, the kernels which are repeated called are:
+1. Accumulate Tiles (memcopy?)
+2. Exclusive Scan
+3. Downsweep. This seems to take elements from a couple different algorithms we've seen. I would not be surprised if there is some hybrid algorithm implemented by Thrust. 
+
+## Performance Analysis
+Interestingly enough, the CPU implementation completely blew all other implementations out of the water. Indeed, at the lower array sizes, I could hardly get a timing since the algorithm would complete execution before a they system clock would even tick. 
+
+As I alluded to earlier, I think much of the credit here is due to CPU caching. All the memory locations with which we need do deal are both temporally and spatially local, so each index in the array likely needs only a couple instructions to complete, and likely no disk waits. In fact, if the compiler is smart enough, it might even be holding the values we need in registers. Given a Hyperthredded Quad Core CPU running at 3.7 GHz, an array of length 2^15 would finish in quite close to no time at all- which is what we see here.
+
+![](images/scan_times.png)
+
+On the other hand, we see our parallel algorithms lagging behind quite a bit. Nvidia's Thrust implementation is seen working quite well, but only for arrays of size not-power-of-two. Very strange. 
+
+Additionally, our naive implementation is seen crushing our work efficient implementation. I believe this is because the bottleneck here is memory access, which the work efficient implementation does rather a lot of (especially in the add/swap downsweep.) In general, I question how valuable "saving work" is on a GPU, particularly addition, since the devices are so heavily optimized for arithmetic.
+
+## Test Suite Output
+~~~
+$ ./cis565_stream_compaction_test.exe
+
+****************
+** SCAN TESTS **
+****************
+    [  38  19  38  37   5  47  15  35   0  12   3   0  42 ...   7   0 ]
+==== cpu scan, power-of-two ====
+    [   0  38  57  95 132 137 184 199 234 234 246 249 249 ... 803684 803691 ]
+==== cpu scan, non-power-of-two ====
+    [   0  38  57  95 132 137 184 199 234 234 246 249 249 ... 803630 803660 ]
+    passed
+==== naive scan, power-of-two ====
+    passed
+==== naive scan, non-power-of-two ====
+    passed
+==== work-efficient scan, power-of-two ====
+    passed
+==== work-efficient scan, non-power-of-two ====
+    passed
+==== thrust scan, power-of-two ====
+    passed
+==== thrust scan, non-power-of-two ====
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   2   3   2   1   3   1   1   1   2   0   1   0   2 ...   3   0 ]
+==== cpu compact without scan, power-of-two ====
+    [   2   3   2   1   3   1   1   1   2   1   2   1   1 ...   2   3 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+    [   2   3   2   1   3   1   1   1   2   1   2   1   1 ...   2   2 ]
+    passed
+==== cpu compact with scan ====
+    [   2   3   2   1   3   1   1   1   2   1   2   1   1 ...   2   3 ]
+    passed
+==== work-efficient compact, power-of-two ====
+    passed
+==== work-efficient compact, non-power-of-two ====
+    passed
+~~~
diff --git a/images/efficient_blocksize.png b/images/efficient_blocksize.png
diff --git a/images/efficient_parallel_downsweep.tiff b/images/efficient_parallel_downsweep.tiff
diff --git a/images/efficient_parallel_upsweep.tiff b/images/efficient_parallel_upsweep.tiff
diff --git a/images/naive_blocksize.png b/images/naive_blocksize.png
diff --git a/images/naive_parallel.tiff b/images/naive_parallel.tiff
diff --git a/images/scan.tiff b/images/scan.tiff
diff --git a/images/scan_times.png b/images/scan_times.png
diff --git a/images/stream_compaction.jpg b/images/stream_compaction.jpg
diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt
@@ -13,5 +13,5 @@ set(SOURCE_FILES
 
 cuda_add_library(stream_compaction
     ${SOURCE_FILES}
-    OPTIONS -arch=sm_20
+    OPTIONS -arch=sm_35
     )
diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -22,17 +22,30 @@ namespace Common {
  * Maps an array to an array of 0s and 1s for stream compaction. Elements
  * which map to 0 will be removed, and elements which map to 1 will be kept.
  */
-__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
-    // TODO
+__global__ void kernMapToBoolean(int n, int *bools, const int *idata) 
+{
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+	if (index >= n)
+		return;
+
+	bools[index] = idata[index] ? 1 : 0;
 }
 
 /**
  * Performs scatter on an array. That is, for each element in idata,
  * if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
  */
 __global__ void kernScatter(int n, int *odata,
-        const int *idata, const int *bools, const int *indices) {
-    // TODO
+        const int *idata, const int *bools, const int *indices) 
+{
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+	if (index >= n)
+		return;
+
+	if (bools[index])
+		odata[indices[index]] = idata[index];
 }
 
 }

diff --git a/stream_compaction/common.h b/stream_compaction/common.h
@@ -7,6 +7,8 @@
 #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
 #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)
 
+#define blocksize 128
+
 /**
  * Check for CUDA errors; print and exit if there was a problem.
  */

diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
@@ -8,8 +8,14 @@ namespace CPU {
  * CPU scan (prefix sum).
  */
 void scan(int n, int *odata, const int *idata) {
-    // TODO
-    printf("TODO\n");
+
+	//exclusive scan: first element of output is 0
+	odata[0] = 0;
+
+	for (int i = 0; i < n - 1; ++i)
+	{
+		odata[i + 1] = odata[i] + idata[i];
+	}
 }
 
 /**
@@ -18,8 +24,19 @@ void scan(int n, int *odata, const int *idata) {
  * @returns the number of elements remaining after compaction.
  */
 int compactWithoutScan(int n, int *odata, const int *idata) {
-    // TODO
-    return -1;
+
+	int oIndex = 0;
+
+	for (int i = 0; i < n; ++i)
+	{
+		if (idata[i])
+		{
+			odata[oIndex] = idata[i];
+			++oIndex;
+		}
+	}
+
+	return oIndex;
 }
 
 /**
@@ -28,8 +45,37 @@ int compactWithoutScan(int n, int *odata, const int *idata) {
  * @returns the number of elements remaining after compaction.
  */
 int compactWithScan(int n, int *odata, const int *idata) {
-    // TODO
-    return -1;
+
+	int *temp = new int[n];
+	int *scanResult = new int[n];
+	// compute the temp array
+	for (int i = 0; i < n; ++i)
+	{
+		if (idata[i])
+			temp[i] = 1;
+		else
+			temp[i] = 0;
+	}
+
+	// Run exclusive scan on temp array
+	scan(n, scanResult, temp);
+
+	// result of scan is index into final array
+	int oCnt = 0;
+	for (int i = 0; i < n; ++i)
+	{
+		// only write if tmp array has 1
+		if (temp[i])
+		{
+			odata[scanResult[i]] = idata[i];
+			++oCnt;
+		}
+	}
+
+	delete[] temp;
+	delete[] scanResult;
+
+	return oCnt;
 }
 
 }