CS87 Week 5 Thursday Lab: Some Example Cuda Programs

We are going to look at some example Cuda programs together. Use these as examples as you work on the next lab assignment.

1. Copy over Examples

Copy over some files into your cs87 subdirectory

cd ~/cs87
cp -R ~newhall/public/cs87/cuda_examples .
cd cuda_examples
ls
  Makefile	 main.cpp	 testgrid.cu
  README.md	 saxpy.cu	 userBufferKernel.cu   userBufferKernel.h
make

2. Cuda Examples

2.1. Simple Cuda Example

saxpy.cu is an example that computes y ← ax + y, which is from Nvidia In this example, y and x are two 1D arrays of size N, and a is a scalar value.

The application running on the host (and this is very typical control flow):

initializes x and y in CPU (host) memory.
allocates cuda memory for both x and y (cudaMalloc)
copies their values from CPU to GPU memory (cudaMemcpy).
invokes the cuda saxpy kernel on the GPU using 1D block/grid and threads/block layout:
```
`saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y)`
```
(the blocks/grid dimension, (N+255)/256, is the number of blocks needed to process N elements of the arrays x and y (note that this is int division, so truncation is used for the result (this is why 255 is added to N) ). The threads/block dimension is 256. The application assumes that N is evenly divisible by the thread block size (256), otherwise the kernel code will have to check that threads are not accessing x and y beyond their bounds.
copies the result value of y from GPU to CPU memory (cudaMemcpy)
computes the max error on the CPU side
frees all CPU and GPU allocated memory (cudaFree)

The kernel function is written from the view of each individual thread executing. Each thread computes its index into x and y based on its position in its enclosing block (threadIdx.x) and its block’s position in the grid (blockIdx.x*blockDim.x)

__global__
void saxpy(int n, float a, float *x, float *y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}

2.2. Example of 2D Block, Grid, Thread layout

testgrid.cu is an example that illustrates a 2D block/grid and 2D thread/block layout. It is often useful to use a dimensional layout that matches the dimensions of your data structures (1D for 1D, 2D for 2D, 3D for 3D), but it is not necessary to do this.

When run, this application takes one command line argument that invokes a different kernel function, each set 2D array values to a different function that lets you "see" how a thread is mapped onto the 2D array by its position in its block and grid.

./testgrid 1
./testgrid 2
./testgrid 3

In this example, the kernel functions are invoked with dim3 variables that specify the blocks/grid and threads/block 2D arrangement (this code assumes DIM is evenly divisible by 8):

dim3 blocks(DIM/8, DIM/8, 1);   // 2D array of blocks in grid: DIM/8 x DIM/8
dim3 threads_block(8, 8, 1);    // 2D array of threads/block 8x8

When the kernel function is called, these are used to specify the blocks/grid and threads/block:

kernel_func<<blocks,threads_block>>(dev_grid);

Within each kernel, a thread figures out its position in the 2D threads/block/grid layout, and uses it to compute an index into the data (which is viewed as a 1D array in GPU memory…a single cudaMalloc allocated it):

  int x = blockIdx.x * blockDim.x + threadIdx.x;
  int y = blockIdx.y * blockDim.y + threadIdx.y;
  int offset = x + y*DIM;

The GPU view of data is an x-axis, y-axis view, with the origin block (0,0) in the upper left (2D thread blocks are similarly numbered):

  block(0,0)   block(1,0)   block (2, 0)
  block(0,1)   block(1,1)   block (2, 1)
  block(0,2)   block(1,2)   block (2, 2)
  block(0,3)   block(1,3)   block (2, 3)

When mapping onto a 2D C view of data, keep this in mind (sometimes it matters and sometimes it doesn’t for the computation you are doing).

2.3. Example using ParaVis library to animate CUDA application

The main.cpp and userBufferKernel.[h,cu] is a CUDA application that uses the ParaVis library to simultaneously animate the computation on the GPU. This example has a application 2D array, and a 2D array of pixel values that the application uses to "animate" its computation. You will use this library in your Lab 3 solution.

A ParaVis application must be implemented in a class derived from the Animator class. Its implementation of the update method drives the application’s execution. The bulk of this code is in userBufferKernel.[h,cu]. main.cpp contains the program’s main function that sets up and runs the animation.

2.3.1. `main.cpp` is the main control flow for using the ParaVis library

main.cpp contains the main control flow for writing a program using the ParaVis library. This is pretty generic code for any CUDA application using ParaVis. The application-specific part involves:

determining how and where to parse any command line arguments (typically either parse in main or pass them to the the application-specific constructor to parse)
calling the application-specific constructor for the class derived from the Animator class (perhaps passing it the command line options to parse, or values from parsed command line options).

In this application, the application-specific part of main.cpp is the call to the UserBufferKernel constructor, and the width and height values passed to it. The rest of the main control flow is common to most ParaVis Cuda animations. These main steps are:

create a new QTSafeViewer object (you can add an application specific title passing a string value as the 3rd argument)
create a DataVisCUDA object and call its setAnimator method to hook the Animator (UserBufferKernel) object’s update method:
1. call DataVisCUDA constructor passing it the 2D dimensions matching those of the application 2D buffer (800x800 in this example).
2. call UserBufferKernel constructor (derived from Animator) passing args needed to init it (dimensions of 2D buffer in this example)
3. call setAnimation method of the DataVisCUDA objet passing it the Animator (UserBufferKernel) object.
call setAnimation on the QTViewer object passing it the DataVisCUA object
call run on the QTViewer to start the animation running. (this will invoke the update method of the UserBufferKernel class)
clean-up any state before exit

The C++ parts of this code are mostly to support running the main animation loop (in main.cpp), and then as C++ derived UserBufferKernel constructor and update methods, that invoke more C-like code that implements the core of the CUDA application userBufferKernel.[h,cu]. The CUDA kernels themselves should be written like C functions and not as C++ method functions of the userBufferKernel class.

2.3.2. `userBufferKernel.[h,cu]` is the application-specific code

Look at the userBufferKernel.[h,cu] to see how what this program does.

In userBuferKernel.h is the definition of the userBufferKernel class that is derived from Animator. It implements a constructor, destructor, and update method that are application-specific. Its update method is invoked by the main animation loop that is invoked by viewer.run() in main. run can be called with no argument, in which case the application runs until the user chooses to stop it, or passing in a number of iterations (e.g. viewer.run(200)) for which to animation (the number of times the udpate method is called). This class has private data members corresponding to the 2D grid.

In the userBufferKernel.cu:

The constructor:
- initializes CPU-size data (the data members of this object)
- allocates CUDA memory space for the 2D application grid
- copies the CPU initialized 2D grid to the GPU memory
The destructor: frees cuda allocated memory
The update method is called repeatedly by QTSafeViewer run, and is where the CUDA kernels that perform the main computation on the data grid and the update to the imageBuffer are invoked. The QTSafeViewer run method passes it a ImageBuffer *img, through which the method can access the grid of color3 image pixel values to set to different colors via img→buffer.

Because it is called repeatedly by the QTViewer’s run method, it is implemented like one iteration (or step) of computation on the 2D grid.
```
void UserBufferKernel::update(ImageBuffer* img){

  dim3 blocks(m_cols/BLOCKS, m_rows/BLOCKS, 1);
  dim3 threads_block(BLOCKS, BLOCKS, 1);

  for (int i = 0; i < 90; i++){
    simplekernel<<<blocks, threads_block>>>(m_dev_grid, m_cols);
  }
  int_to_color<<<blocks, threads_block>>>(img->buffer, m_dev_grid, m_cols);
  usleep(100000);
}
```
In this example, the udpate calls:
1. the simplekernel CUDA kernel function 90 times (just to animate larger steps of computation), which performs some computation on the 2D data grid
2. the int_to_color CUDA kernel, that updates the image buffer for the ParaVis annimation based on the values of the application 2D grid (m_dev_grid).
3. usleep just to slow down the animation.
The kernel’s themselves should be written in a C style (rather than C++ style).

In the int_to_color kernel, you can see code that each CUDA thread runs to set its individual pixels values in the image color3 buffer. Each pixel has a rbg component, which is set to some function based on its corresponding value in the 2D array of ints (whose values are updated by simplekernel):
```
optr[offset].r = (my_cuda_data[offset] + 10) % 255;  // R value
optr[offset].g = (my_cuda_data[offset] + 100) % 255; // G value
optr[offset].b = (my_cuda_data[offset] + 200) % 255; // B value
```

3. Cuda Programming References

Nvidia’s Cuda Developer Documentation there is a CUDA C Programming Guide off this page
Nvidia’s Parallel Forall Developer’s Blog
Chapt 11.1 from Dive into Systems textbook
saxpy.cu examplei from Nvida
ParaVis documentation is in .h files (view in vim or other editor):

  /usr/local/include/qtvis/animator.h
  /usr/local/include/qtvis/imageBuffer.h
  /usr/local/include/qtvis/dataVisCUDA.h
  /usr/local/include/qtvis/dataVis.h