CS 40: Lab #7

You may work with one partner on this assignment. In this lab, you will design and performance test CUDA kernels for finding the maximum element in a large array.

Getting started

Use the setup40 script to set up the appropriate git repos with the right permissions set for partners and instructors. Suppose users molly and tejas which to work together. Molly can start by running

[~]$ setup40 projects/cuda tejas

Once the script finishes, Tejas should run

[~]$ setup40 projects/cuda molly

Please note that the script tries its best to ease the initial creation and cloning of git repos and it tries to be smart and check that each partner agrees that they are partners. That said, there are some race conditions and if there are uncooperative formations of partners, the script and the instructor will get confused. If you play nice with the script, it will play nice with you.

If all goes well, Tejas and Molly should each have a local clone of a common git repo in their own ~cs40/projects/cuda directory. You can use git status to confirm this.

If you wish to work by yourself (not recommended), use the syntax

[~]$ setup40 projects/cuda none

Copying starter code

Both partners should modify their own ~/cs40/projects/CMakeLists.txt as this file is not under version control. Just add the line

add_subdirectory(cuda)

to the end of the file and save.

For the next step only one partner should copy over the starting code

[~]$ cd ~/cs40/projects/cuda
[cuda]$ cp ~adanner/public/cs40/projects/cuda/* ./

Now push the changes to your partner

[cuda]$ git add *.h *.cpp *.txt
[cuda]$ git commit -m "cuda lab start"
[cuda]$ git push

Even if your are working by yourself, you must run git push or I can't see/grade your work. If you are working with a partner, your partner can now pull the changes. In this case if Tejas wishes to get files Molly pushed, he would run

[~]$ cd ~/cs40/projects/cuda
[cuda]$ git pull

Part 1: Maximum Value

First compile and run maxval.cu. This program is supposed to compute the maximum of an array of floats. Initially a CPU only version has been provided for you. I provided various timing code to time the GPU and CPU versions of this max function. Your first step is to write a simple CUDA kernel that works with only one block and one thread. Because a global CUDA kernel can only have a void return type, the variable result can be used to hold a GPU buffer that can store one or more results. Call your kernel max gpu single, and have it store the max value in the buffer result[0]. The code in main will copy this buffer into a partial results buffer and do some post processing. Call your kernel in main with one block and one thread, and note the time. Check that your GPU result matches the CPU result before proceeding.

Next, change the size of N near the top of the code from 32 to 32*1024*1024. Comment out the line in main which prints out the values of a[i] using cout, so you do not see 32 million items print. Run your code and note the time for the GPU and CPU versions. If your GPU version is significantly slower, that is OK at this point. Next, make the following changes and run some experiments. Note, the GPU version may be so slow that it times out. If this happens, decrease the size of N until the kernel is able to finish.

Experiments

Block Kernel

Write a kernel called max_gpu_block that can be called on multiple blocks, each containing one thread. Call your kernel with the following number of blocks: 4, 8, 16, 32, 64, 256, and record the time. Note you will need to recompile. Note your kernel only needs to have each block compute the max of all the elements that block checks. Each block can store its maximum in results[blockIdx.x]. A small amount of post-processing by the CPU can then find the maximum over all the blocks

Thread Kernel

Write a kernel called max_gpu_thread that can be called on a single block containing multiple threads. Call your kernel with the following number of threads and record the time: 32, 64, 256, 512. You may need to change the variable partial size in main to max sure the results buffer is the appropriate size. Each thread will write to one slot in this buffer which is again post-processed by the CPU in main.

Combination Kernel

Finally, write a kernel called max_gpu_combined that can be called on a arbitrary number of blocks, each with multiple threads. Try various block and thread counts when calling your kernel, reporting at least three experiments and highlighting the parameters that result in the shortest run time. At the thread level, you should use shared memory and a parallel reduction to compute the maximum value per block.

Test on two Cards

Test your code on at least two different graphics cards. See the list of Host graphics cards. The specs in terms of multiprocessors, cores per multiprocess, and total number of cores is summarized below.

Card	MP	Cores/MP	Total
NVS 450	1	8	8
600	2	48	96
2000	4	48	192
1000M	2	48	96
GTX 780Ti	15	192	2888
660M	2	192	384

The 660M and NVS 450 cards are limited to 512MB and 256MB of memory respectively and may not be able to store very large arrays. 32 millions floats occupies 128MB of memory. All other cards have at least 1GB of GPU memory.

Requirements

You should have four kernels (single, block, thread, combo) in your source code for part 1.

Run on at least two different graphics cards.

Put your results in a README.txt file. A sample format is shown below. If a kernel cannot run on the full 32 million elements, list the time and size of the largest input you can successfully run on.

Card 1: NVS450

CPU Time:

Single kernel Time:  ...  Max size: ....

-------------
Block Kernel
-------------

4   Blocks Time: ... 
8   Blocks Time: ...
...
256 Blocks Time: ...


-------------
Thread Kernel
-------------

32  Threads Time: ...
64  Threads Time: ...
256 Threads Time: ...


-------------
Combo  Kernel
-------------

B Blocks, T Threads Time: ...


Repeat above for different hardware.

Part 2: Beat the CPU

Modify the GPU kernel for julia_gpu_timed.cu and try to beat the compute time of the CPU kernel. In your readme, note the hardware/host you used for the experiment and summarize your design and final run time.

As an optional extension you may wish to change the function you compute or change the color map to something more exciting. Consider the following examples for possible inspiration.

Submit

Write your results in a README.txt file and add, commit, and push this file to your git repo. Note any test cases when you could not run on the full 32 million element array and summarize any patterns or peculiarities you noticed in your testing.

You should regularly commit your changes and occasionally push to your shared remote. Note you must push to your remote to share updates with your partner. Ideally you should commit changes at the end of every session of working on the project. You will be graded on work that appears in your remote by the project deadline.

CS40 Lab 7: CUDA Performance testing

Block Kernel

Thread Kernel

Combination Kernel