CUDA Introduction

Wednesday | Friday
General purpose GPU Computing

Readings

We will just be scratching the surface of what you can do with CUDA. The book "CUDA by Example" has a gentle introduction to the primary concepts and many of the code examples in the book are provided as examples in your git repo. A copy of the text should be available in the lab. Please do not take it out of the CS department.

Examples updates

Let's fetch this week's upstream changes.

$ cd ~/cs40/examples
$ git fetch upstream
$ git merge -X ours upstream/master

After this step, you will need to manually edit the top level CMakeLists.txt to add the new subdirectory.

#add these lines near the bottom
if(CMAKE_CUDA_COMPILER)
  add_subdirectory(w07-cuda-pt1)
endif()
You will also need to add a line to grab some CUDA helper files. Add the line
include(${CMAKE_SOURCE_DIR}/cmake_modules/detectCUDA.cmake)
after the line in CMakeLists.txt that reads
include(${CMAKE_SOURCE_DIR}/cmake_modules/helpers.cmake)

We are not done yet with CUDA customization. CUDA requires that we use a special compiler. For some reason, it is not being picked up correctly by CMake or the default paths.

You will need to edit your ~/.bashrc file to add some custom paths. Append the following lines to the end of this file.

export PATH=$PATH:/usr/local/cuda/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64

After making the changes, for today only, run the following command to update the paths.

source ~/.bashrc

Your ~/.bashrc is read every time you log into the system, so next time you login, you will not need to update this file or run the source command. We are almost done with setup.

The last step is to go into your examples/build directory and tell cmake the location of CUDA and the default CUDA compiler. This should also be a one time step, unless you rebuild your build directory from scratch again.

cd ~/cs40/examples/build
cmake -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-6 \
      -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc ../

If everything went well, you should be able to run make -j8 to compile the week 07 examples.

make -j8
cd w07-cuda-pt1
./simple_kernel_params
2 + 7 = 9

CUDA Overview

So far, this course has focused on using modern OpenGL to render 3D scenes. Modern OpenGL leverages specialized GPUs and programmable shaders to process geometry and fragments efficiently in parallel. The general outline for an OpenGL application is as follows:
  1. Copy data from the CPU to the GPU by writing to Vertex Buffer Object (VBO), or Texture.
  2. Write custom shaders in glsl that will execute in parallel on the GPU hardware on GPU data.
  3. Run the shaders on the GPU data by calling e.g., glDrawArrays.
  4. Display the final image buffer on the screen
CUDA takes many of these same concepts and applies them to general purpose data, that may not be related to graphics (since this is a graphics course, we will later see examples that combine CUDA and graphics). CUDA aims to solve problems that can benefit from parallel computation by leveraging the GPU to process general data. The outline for a CUDA application is:
  1. Copy data from the CPU to the GPU using cudaMalloc and cudaMemcpy
  2. Write custom kernels in CUDA that will execute in parallel on the GPU hardware on GPU data.
  3. Run the kernels on the GPU data by calling the kernels from the CPU using a special syntax indicating the degree of parallelism.
  4. Copy data back from the GPU and interpret the results.
Much like OpenGL (or similar graphics APIs) is used behind the scenes of other libraries for physics engines, game engines, animation libraries, etc., CUDA (or similar accelerator APIs) is used behind the scenes of modern parallel computing toolkits including tensorflow, photoshop, and data visualization tools.

CUDA Demos

We will review the CUDA demos in the following order. We may skip some incremental steps, but you may go back and review them outside of class.
  1. hello_world.cu: the nvcc compiler.
  2. simple_kernel.cu: the __global__ keyword and <<<>>> syntax.
  3. simple_kernel_params.cu: cudaMalloc, cudaMemcpy, cudaFree
  4. simple_device_call.cu: the __device__ keyword
  5. add_loop_cpu.cpp: sequential array addition
  6. add_loop_gpu.cu: parallel array addition (small)
  7. add_loop_long.cu: parallel array addition (large)
  8. enum_gpu: general GPU properties.

Wednesday updates

Let's fetch this more upstream changes.

$ cd ~/cs40/examples
$ git fetch upstream
$ git merge -X ours upstream/master

After this step, you will need to manually edit the top level CMakeLists.txt to add the new subdirectory.

if(CMAKE_CUDA_COMPILER)
  add_subdirectory(w07-cuda-pt1)
  #add this line inside the if/endif block
  add_subdirectory(w07-cuda-pt2)
endif()
The setup from Monday should still work, so just a simple make will build the new code samples.
cd ~/cs40/examples/build
make -j8
cd w07-cuda-pt2
./juliaCPU
./juliaGPU