Project 2: Xueyin Wan #7

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open

xueyinw wants to merge 18 commits into CIS565-Fall-2016:master from xueyinw:master

README.md

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -1,13 +1,263 @@
  
    CUDA Stream Compaction

    ======================

    ####University of Pennsylvania

    ####CIS 565: GPU Programming and Architecture

    **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

    ##Project 2 - CUDA Stream Compaction

    * (TODO) YOUR NAME HERE

    * Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)

    * Xueyin Wan

    * Tested on: Windows 10 x64, i7-6700K @ 4.00GHz 16GB, GTX 970 4096MB (Personal Desktop)

    * Compiled with Visual Studio 2013 and CUDA 7.5

    ### (TODO: Your README)

    **SCREENSHOT**

    -------------

    **BlockSize : 128**  

    **SIZE :  1 << 24**   

    Include analysis, etc. (Remember, this is public, so don't put

    anything here that you don't want to share with the world.)

    ![alt text](https://github.com/xueyinw/Project2-Stream-Compaction/blob/master/result_showcase/XueyinResultOriginal_pow(2%2C24).gif "Performance One") 

    ###   

    **FEATURES I IMPLEMENT**

    -------------

    ```

    Part 1: CPU Scan & Stream Compaction

    Part 2: Naive GPU Scan Algorithm

    Part 3: Work-Efficient GPU Scan & Stream Compaction

    Part 4: Thrust Exclusive Scan using Thrust library

    Part 5: Radix Sort (Extra Credit)

    Part 6: Using std::chrono and CUDA events for comparing the speed of different algorithms 

    ```

    ###        

    **Dive Into Block Size**

    -------------

    In order to find the relationship between block size and performance, I modified block size to see different algorithm run time for getting the optimized block size.   

    Below is my chart based on my code:

    **Case 1:** 

    #####Power of Two number, `SIZE` = 1 << 24 = 16777216. All the time recorded in `ms`.

    Block Size | Naïve Scan | Efficient Scan | Thrust Scan|CPU Scan

    ---|---|---|---|---

    16 | 52.045727 | 91.401695 |2.128768|24.0632

    32 | 30.109312  | 53.902912 |2.09424|24.0563

    64 | 25.546721  | 29.845119 |2.081152|24.0908

    128 | 25.994272 | 27.808865 |2.255712|24.0321

    256 | 25.615328 | 27.646433 |2.404192|24.064

    512 | 25.576256 | 29.840576 |2.256288|24.5889

    1024 | 25.609535 | 33.565887 |2.211232|24.0653

    **Case 2:**

    #####Non Power of Two number, `SIZE(NPOT)`= 1 << 24 - 3 = 16777213. All the time recorded in `ms`.

    Block Size        | Naïve Scan   | Efficient Scan | Thrust Scan|CPU Scan|

    ---|---|---|---|---

    16 | 45.901855 | 89.639648 |2.094752|42.9234

    32 | 30.138912  | 51.030048 |2.29776|43.1142

    64 | 25.93968  | 27.795744 |2.011712|42.6413

    128 | 25.812672 | 24.770847 |2.052608|42.6398

    256 | 25.627424 | 27.607807 |2.223552|41.6099

    512 | 25.609535 | 29.848961 |2.146816|42.1115

    1024 | 26.082048 | 33.715874 |2.04576|42.6002

    Now let me draw a graph to explicitly show my result :)  

    ####

    `Notice: ` 

    This graph is based on Case 1 result, `Array Size` is Power of Two number, `SIZE` = 1 << 24 = 16777216 

    ###

    ![alt text](https://github.com/xueyinw/Project2-Stream-Compaction/blob/master/result_showcase/ReadMeAboutBlockSizeChoose1.PNG "Chart1")  

    ###   

    ![alt text](https://github.com/xueyinw/Project2-Stream-Compaction/blob/master/result_showcase/ReadMeAboutBlockSizeChoose2.PNG "Chart2")

    From case 1 and case 2, we could see that when block size is less than 128, the algorithm performance is definitely worse than block size = 128. And after we set block size to 128, we could see that radix sort performance reaches to its highest level. As block size continues to grow, we could notice that Naive Scan, Efficient Scan and Radix Sort are all becoming slower.

    So I choose my block size to be `128` in my code.  

    **Dive Into Array Size**

    -------------

    I set block size = `128` in my code, and start to use array size as a parameter to change, in order to compare the performance between different GPU algorithms and CPU algorithm.

    Below is my chart based on my code:

    #####`Blocksize` = 128.   All the time recorded in `ms`. Max Value for scan in the array is 50 (for this chart)

    Array Size | Naïve Scan | Efficient Scan | Thrust Scan|CPU Scan

    ---|---|---|---|---

    2^8 | 0.031904 | 0.11024 |0.021248|0

    2^12 | 0.047008  | 0.141728 |0.027616|0 

    2^16 | 0.13168  | 0.347968 |0.245728|0.5013 

    2^20 | 1.297824 | 1.681472 |0.468608|1.5041 

    2^24 | 25.53968 | 27.6632 |2.403232|25.0931 

    #####`Blocksize` = 128.  All the time recorded in `ms`. Max Value for sort in the array is 2^15 (for this chart)

    Array Size | Radix Sort | Std::sort  

    ---|---|---

    2^8 | 1.105344 | 0

    2^12 | 2.223136  | 0

    2^16 | 7.358048  | 4.0105 

    2^20 | 42.627296 | 58.1868

    2^24 | 749.649841 | 894.4247

    Graph for summary:

    ####

    ![alt text](https://github.com/xueyinw/Project2-Stream-Compaction/blob/master/result_showcase/ReadMeAboutArraySizeChoose0.PNG "Chart1")  

    ####

    ![alt text](https://github.com/xueyinw/Project2-Stream-Compaction/blob/master/result_showcase/ReadMeAboutArraySizeChoose1.PNG "Chart2")

    From the test result, we could see that for small Array Size, GPU implementation is slower than CPU's. But when the array size grows, they two become close.    

    Thrust Scan is very fast for large array size.    

    ####     

    For Radix sort, I compare it with std::sort. We could see that when arraysize is small, std::sort is faster.    

    However, as array size grows, my radix sort on GPU is much faster than std::sort!  

    ###Answer to Questions

    #### 1. Roughly optimize the block sizes of each of your implementations for minimal run time on your GPU.

    Done! See above `Dive Into Block Size` part.

    #### 2. Compare all of these GPU Scan implementations (Naive, Work-Efficient, and Thrust) to the serial CPU version of Scan. Plot a graph of the comparison (with array size on the independent axis).

    Done! See above `Dive Into Array Size` part.     

    I use CUDA events for timing GPU code. 

    I use std::chrono for timing CPU code.

    #### 3.To guess at what might be happening inside the Thrust implementation (e.g. allocation, memory copy)

    Answer:   

    I guess the inner mechnism of THRUST requires some initialization operations. After this step, it reaches to better performance.    

    #### 4.Can you find the performance bottlenecks? 

    Answer:    

    Yes. First I want to mention, When we are doing iterations in scan function, and inside each loop is kernal function like Upsweep, Downsweep. We could see that as the iteration goes on, one phenomenon appears:     

    There are several threads idling. Since they need to wait those threads which are working to finish there mission, they have to be idling, which causes extra resource allocate.  

    Paste part of my code here to address this problem:

    ```java

    void scanInDevice(int n, int *devData) {

    		int blockNum = (n + blockSize - 1) / blockSize;

    		for (int d = 0; d < ilog2ceil(n) - 1; d++) {    //Here we have iterations

    			upSweep << <blockNum, blockSize >> >(n, d, devData);  // Here we have kernal function

    			checkCUDAError("upSweep not correct...");

    		}

    		//set last element to zero, refer to slides!

    		int counter = 0;

    		cudaMemcpy(&devData[n - 1], &counter, sizeof(int), cudaMemcpyHostToDevice);

    		for (int d = ilog2ceil(n) - 1; d >= 0; d--) {

    			downSweep << <blockNum, blockSize >> >(n, d, devData);

    			checkCUDAError("downSweep not correct...");

    		}

    	} 

    ```

    ```java

    __global__ void upSweep(int N, int d, int *idata) {

    	int n = (blockDim.x * blockIdx.x) + threadIdx.x;     

    	if (n >= N) {

    		return;

    	}

    	int delta = 1 << d;

    	int doubleDelta = 1 << (d + 1);

    	if (n % doubleDelta == 0) {    // not each thread is working, right? 

                                       //But those "should not be working" threads are still evoked.

    		idata[n + doubleDelta - 1] += idata[n + delta - 1];

    	}

    	}

    ```

    Plan to optimize this (yet several interviews this week I have to say: "lol" D:)    

    Try to optimize mycode in path tracer project !

    Also a huge problem: Memory I/O!    

    We need to malloc memory in device, copy the host content into device then get a result, then transfer back to host memory......  

    When we're doing first assignment, we know that for index-continuous threads to access physical-not-continuous memory, it needs extra unnecessary operations and becomes slow.    

    And in this project, we have a lot of memory I/O operation... So here we found another issue!!!

    ![alt text](https://github.com/xueyinw/Project2-Stream-Compaction/blob/master/result_showcase/Profiling.PNG "Chart1")  

    ###

    From the picture above, we can see that CUDA memory operations occupied especially large part of the entire execution.    

    So my guess is right. :)

    #### 5. Sample output 

    More my test result are in the `result_showcase` folder.

    Here I show one of them:

    ####Array Size = 1 << 24. Block Size is 128.

    ```

    ****************

    ** SCAN TESTS **

    ****************

        [  38  19  38  37   5  47  15  35   0  12   3   0  42 ...  42   0 ]

    ==== cpu scan, power-of-two ====

        [   0  38  57  95 132 137 184 199 234 234 246 249 249 ... 411089014 411089056 ]

    CPU scan power-of-two number time is 24.032100 ms

    ==== cpu scan, non-power-of-two ====

        [   0  38  57  95 132 137 184 199 234 234 246 249 249 ... 411088950 411088974 ]

        passed

    CPU scan non-power-of-two number time is 42.639800 ms

    ==== naive scan, power-of-two ====

    GPU Naive Scan time is 25.994272 ms

        passed

    ==== naive scan, non-power-of-two ====

    GPU Naive Scan time is 25.812672 ms

        passed

    ==== work-efficient scan, power-of-two ====

    GPU Efficient Scan time is 27.808865 ms

        passed

    ==== work-efficient scan, non-power-of-two ====

    GPU Efficient Scan time is 24.770847 ms

        passed

    ==== thrust scan, power-of-two ====

    GPU Thrust Scan time is 2.255712 ms

        passed

    ==== thrust scan, non-power-of-two ====

    GPU Thrust Scan time is 2.052608 ms

        passed

    *********************************************

    *************** EXTRA CREDIT ****************

    ************* RADIX SORT TESTS **************

    *************** POWER-OF-TWO ****************

    *********************************************

        [  38 7719 21238 2437 8855 11797 8365 32285 10450 30612 5853 28100 1142 ... 7792 2304 ]

    ==== std sort for comparasion ====

        [   0   0   0   0   0   0   0   0   0   0   0   0   0 ... 32767 32767 ]

    ==== Extra : RadixSort ====

    GPU Radix Sort time is 741.838257 ms

        passed

        [   0   0   0   0   0   0   0   0   0   0   0   0   0 ... 32767 32767 ]

    *********************************************

    *************** EXTRA CREDIT ****************

    ************* RADIX SORT TESTS **************

    ************* NON-POWER-OF-TWO **************

    *********************************************

        [  38 7719 21238 2437 8855 11797 8365 32285 10450 30612 5853 28100 1142 ... 7792 2304 ]

    ==== std sort for comparasion ====

        [   0   0   0   0   0   0   0   0   0   0   0   0   0 ... 32767 32767 ]

    ==== Extra : RadixSort ====

    GPU Radix Sort time is 743.634277 ms

        passed

        [   0   0   0   0   0   0   0   0   0   0   0   0   0 ... 32767 32767 ]

    *****************************

    ** STREAM COMPACTION TESTS **

    *****************************

        [   2   3   2   1   3   1   1   1   2   0   1   0   2 ...   0   0 ]

    ==== cpu compact without scan, power-of-two ====

        [   2   3   2   1   3   1   1   1   2   1   2   1   1 ...   2   2 ]

        passed

    CPU compact without scan power-of-two number time is 37.117100 ms

    ==== cpu compact without scan, non-power-of-two ====

        [   2   3   2   1   3   1   1   1   2   1   2   1   1 ...   2   2 ]

        passed

    CPU compact without scan non-power-of-two number time is 37.113600 ms

    ==== cpu compact with scan ====

        [   2   3   2   1   3   1   1   1   2   1   2   1   1 ...   2   2 ]

        passed

    CPU compact with scan time is 129.808400 ms

    ==== work-efficient compact, power-of-two ====

    GPU Efficient Compact time is 27.158209 ms

        passed

    ==== work-efficient compact, non-power-of-two ====

    GPU Efficient Compact time is 27.228865 ms

        passed

    ```

result_showcase/ArraySize = 2^12.txt

-Original file line number
+Diff line change
@@ -0,0 +1,86 @@
+    ****************
+    ** SCAN TESTS **
+    ****************
+        [  38  19  38  37   5  47  15  35   0  12   3   0  42 ...  24   0 ]
+    ==== cpu scan, power-of-two ====
+        [   0  38  57  95 132 137 184 199 234 234 246 249 249 ... 99378 99402 ]
+    CPU scan power-of-two number time is 0.000000 ms
+    ==== cpu scan, non-power-of-two ====
+        [   0  38  57  95 132 137 184 199 234 234 246 249 249 ... 99347 99371 ]
+        passed
+    CPU scan non-power-of-two number time is 0.000000 ms
+    ==== naive scan, power-of-two ====
+    GPU Naive Scan time is 0.047008 ms
+        [   0  38  57  95 132 137 184 199 234 234 246 249 249 ... 99378 99402 ]
+        passed
+    ==== naive scan, non-power-of-two ====
+    GPU Naive Scan time is 0.045472 ms
+        [   0  38  57  95 132 137 184 199 234 234 246 249 249 ...   0   0 ]
+        passed
+    ==== work-efficient scan, power-of-two ====
+    GPU Efficient Scan time is 0.141728 ms
+        passed
+    ==== work-efficient scan, non-power-of-two ====
+    GPU Efficient Scan time is 0.167264 ms
+        passed
+    ==== thrust scan, power-of-two ====
+    GPU Thrust Scan time is 0.027616 ms
+        passed
+    ==== thrust scan, non-power-of-two ====
+    GPU Thrust Scan time is 0.019008 ms
+        passed
+    *********************************************
+    *************** EXTRA CREDIT ****************
+    ************* RADIX SORT TESTS **************
+    *************** POWER-OF-TWO ****************
+    *********************************************
+        [  38 3623 758 2437 663 3605 173 3613 2258 1940 1757 3524 1142 ... 3336 120 ]
+    ==== std sort for comparasion ====
+    std sort for power-of-two number time is 0.000000 ms
+        [   0   1   1   1   3   3   4   5   6   6   7   8   8 ... 4093 4094 ]
+    ==== Extra : RadixSort ====
+    GPU Radix Sort time is 2.223136 ms
+        passed
+        [   0   1   1   1   3   3   4   5   6   6   7   8   8 ... 4093 4094 ]
+    *********************************************
+    *************** EXTRA CREDIT ****************
+    ************* RADIX SORT TESTS **************
+    ************* NON-POWER-OF-TWO **************
+    *********************************************
+        [  38 3623 758 2437 663 3605 173 3613 2258 1940 1757 3524 1142 ... 3336 120 ]
+    ==== std sort for comparasion ====
+    std sort for non-power-of-two number time is 0.000000 ms
+        [   0   1   1   1   3   3   4   5   6   6   7   8   8 ... 4093 4094 ]
+    ==== Extra : RadixSort ====
+    GPU Radix Sort time is 2.448928 ms
+        passed
+        [   0   1   1   1   3   3   4   5   6   6   7   8   8 ... 4093 4094 ]
+    *****************************
+    ** STREAM COMPACTION TESTS **
+    *****************************
+        [   2   3   2   1   3   1   1   1   2   0   1   0   2 ...   0   0 ]
+    ==== cpu compact without scan, power-of-two ====
+        [   2   3   2   1   3   1   1   1   2   1   2   1   1 ...   2   1 ]
+        passed
+    CPU compact without scan power-of-two number time is 0.000000 ms
+    ==== cpu compact without scan, non-power-of-two ====
+        [   2   3   2   1   3   1   1   1   2   1   2   1   1 ...   2   1 ]
+        passed
+    CPU compact without scan non-power-of-two number time is 0.000000 ms
+    ==== cpu compact with scan ====
+        [   2   3   2   1   3   1   1   1   2   1   2   1   1 ...   2   1 ]
+        passed
+    CPU compact with scan time is 0.000000 ms
+    ==== work-efficient compact, power-of-two ====
+    GPU Efficient Compact time is 0.284064 ms
+        passed
+    ==== work-efficient compact, non-power-of-two ====
+    GPU Efficient Compact time is 0.266176 ms
+        passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project 2: Xueyin Wan #7

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Project 2: Xueyin Wan #7

Are you sure you want to change the base?

Uh oh!

Project 2: Xueyin Wan #7

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!