Due Dates

  • Due before 11:59pm, Wednesday Oct. 6 push to git repo.

  • Demo: after the due date you will sign up to demo your firesimulator.

Lab3 Partners

This lab will be done with your Lab 3 Partner

See the documentation on the course webpage about working with partners. It is a brief guide for effective practices and my expectations for how CS students should be working together on labs.

Overview

For this assignment you and your partner will implement a discrete event simulator in CUDA that simulates a forest fire in a 2-D world of lakes and land. Your program will make use of the ParaVis GPU animation library to animate your fire simulation as it runs on the GPU.

Lab Goals

  • Learn CUDA programming by implementing a fire simulator.

  • Learn the ParaVis library to animate a CUDA computation, and help with debugging your CUDA fire simulator.

  • Gain more practice with C programming along the way.

Starting Point Code

  1. Clone your Lab 3 repo from the CS87 git org:

    cd ~/cs87/Labs
    git clone [your_Lab_ssh_URL]

    If all was successful, you should see the following files when you run ls:

    Makefile  README.adoc  fire1.txt  firesimulator.cu  firesimulator.h  main.cpp

    If this didn’t work, or for more detailed instructions on git see: Git Help Page

Starting Point Files

With the starting point are several files, many have some starting point code written for you. These files include:

  • README.adoc: detailed information about these files, and which code you need to modify or add to which files.

  • main.cpp: the main function that contains the main control flow needed for a ParaVis program. You only need to modify this code slightly (see the comments and TODOs for parts you can change if you want and parts you should be aware of when you implement the constructor, and may need to add).

  • firesimulator.[cu,h]: the bulk of the code you write to implement the fire simulation, including CUDA kernels for animation and simulation goes in this file. The starting point ncludes the start of the class you need to define that inherits from Animator. See TODOs in this file for what to add where. You will want to add to the class definition in .h, but most of the code you will write and modify is in the .cu file.

    I have a lot of TODO comments in here for you, with starts of some of the method functions and cuda kernel functions you will need to write. You should define and use more constants, you should define and use more functions (method, C, and CUDA device) for good modular design.

  • *.txt: one or more example input files.

  • Makefile: builds the executable. NOTE: there are many uninitialized variable warnings in the starting point code. Your submitted code should have no warnings (initialize and use these variables in your solution).

Because ParaVis runs the animation on the GPU, you have to run this lab when you are directly logged into the lab machine (remotely logged in won’t work).

Details

You will implement a discrete event fire simulator in CUDA. Discrete event simulation is the same technique used in GOL, and thus the firesimulator will be structured similarly. As a result, you can focus more on the CUDA implementation parts for this lab and less on the discrete event simulation, which you know how to do.

Cell State

The forest fire simulator is a discrete event simulator of a 2-dimensional non-tours world, where each cell is either:

  1. part of a LAKE

  2. part of a forest that is UNBURNED

  3. part of a forest that is BURNING

  4. part of a forest that has already BURNED

In addition to a cell being in one of these different states, also associated with each cell is its temperature. A cell’s temperature range depends on its state:

  1. 60 degrees for UNBURNED forest cells

  2. 300 to 1000 to 60 for a BURNING forest cell. A burning cell goes through increasing and decreasing temperatures phases. It starts at the ignition temperature of 300 degrees and increase up to a max of 1000 degrees. Once it reaches 1000 degrees its temperature starts decreasing back down to 60 degrees, at which point it becomes BURNED.

  3. X degrees for a BURNED cell: you can pick a temperature, but pick one that no UNBURNED or BURNING forest cell can ever be.

  4. Y degrees for a LAKE cell: you can pick a temperature, but pick one that no forest cell can be.

You should define and use constants for these and for sizes in your solution.

Command-line Options

  • Your simulator should take the following command line arguments (all are optional arguments):

    ./firesimulator {-i iters -d step -p prob | -f filename}
     -i iters     number of iterations to run
     -d step      rate at which a burning cell's temp increases or decrease each step
     -p prob      probability a cell will catch fire if one of its neighbors is burning
     -f filename  read in configuration info from a file

    Your program should using default values for any of values not given as command line arguments. Use 1,000 iterations, a step size of 20, and a probability of 0.25 as the default values.

    Options -i, -d and -p are not compatible with -f. The file format is discussed below (see Input file format).

    Initialize your world to some default configuration (unless the -f command line is given, in which case initialize from setting read in from file). Your default configuration should start a fire in the center of the world (just a single cell…​like a lightning strike). It should also contain a couple lakes (a lake is a contiguous region of some size of lake cells).

Fire Simulation

Your fire simulator will simulate a forest fire’s spread over some number of time steps. At the end of the simulation your program should print out the time it took to simulate on the GPU (i.e. the firesimulation CUDA kernel part of the execution).

  • At each time step, a cell’s state and/or temperature may change according to these rules:

    1. if a cell is a LAKE, it stays a LAKE

    2. if a cell is BURNED, it stays BURNED forever

    3. if a cell is UNBURNED, then it either starts on fire or stays UNBURNED.

    4. if a cell is BURNING, it either stays BURNING or may become BURNED.

    To decide if an UNBURNED cell starts on fire:

    1. look at the the state of its immediate neighbors to the north, south, east and west. The world is not a torus, so each cell has up to 4 neighbors, edge cells have only 2 or 3 neighbors.

    2. if at least one neighboring cell is on fire, then the cell will catch fire with a probability passed in on the command line (or use 10% as the default probability).

    If an UNBURNED cell changes state to BURNING, its new temperature jumps to 300 degrees F, and its temperature will start increasing in the next time step.

    If a cell is BURNING, then it burns at a constant rate for some number of time steps. However, its temperature first increases from 300 (the fire igniting temp) up to 1000 degrees, and then it decreases from 1000 back down to 60 degrees, at which point it becomes a BURNED cell.

    The rate at which its temperature increases or decreases is given by a command line argument -d, or uses the default value of 20 degrees each step.

    A BURNING cell’s state may change based on its new temperature: if its new temperature is <= 60, then this cell is now done burning and its state is now BURNED. Its temperature is set to the BURNED temperature value that you use.

  • After simulating the given number of steps your program should print out the cumulative GPU run time before it exits (when the user selects the quit button in the ParaVis animation window, your destructor will be invoked).

Here are a few screen shots of a run: ./firesimulator -i 2000 -p 0.05 -d 20 showing a fire starting in the center and spreading to neighboring forest cells over time. In my simulator, unburned forest cells are green, burning forest cells are red, burned forest cells are black, and lake cells are blue and note that my very rectangular lakes do not burn. A more interesting looking fire would use different colors for different temperatures, but this is not required (it may, however, help with debugging to see if your fire is correctly first increasing and then decreasing in temperature).

fire start
fire midway
fire end

The ParaVis graphics display has point (0,0) in the lower left corner vs. your view of your program data where (0,0) is in the upper left corner. As a result, the ParaVis animation of the world looks rotated horizontal from the CUDA view of the world (e.g., CUDA’s block(0,0) is in top left, ParaVisi maps it to bottom left). If you’d like, you can map CUDA world state values differently in the ParaVisi color3 state to rotate the animated world back to match the CUDA view. You are not required to do this, but if you don’t, keep this in mind as you debug your solution (cell 0,0 is in the bottom left in the animation).

Input file format

If run with an input file (the -f command line option), the program configuration values are all read in from the file. The file’s format should be:

line 1: number of iterations
line 2: step size
line 3: probability
line 4: the lightning strike cell (its (i,j) coordinates)
line 5: number of lakes
line 6-numlakes+6: lines of (i,j) coordinate pairs of the upper
left corner and lower right corner of each rectangular lake

The lake coordinates are given in terms of the 2-D array of cell values that you initialize on the CPU. All cells specified in that rectangle should be lake cells, all others should be forest cells. For example:

800
40
0.3
250 400
2
20 30 50 70
100 60 120 110

This will run a simulation for 800 iterations, with a temperature step size of a 40 degree increase or decrease, and with a probability of 30%. It will start with an initial world containing 2 lakes one with upper left corner at (20,30) and lower right at (50,70), the other with upper left corner at (100,60) and lower right at (120, 110). All other cells will be UNBURNED forest cells, except cell (250,400) which will start as BURNING. It is fine if the lakes overlap (the lakes in the world from my example simulation include some examples of overlapping lake rectangles).

Requirements

The following are requirements for your solution:

  • The size of the 2D world grid should be compiled into your program (see the constant definition for N (and use constants)). You do not need to dynamically allocate space for the CPU-side grid to initialize (although you may). Instead, just statically declared 2 dimensional array of NxN values on the CPU side that you use to initialize to the starting point values for your fire simulator. There is a comment at the top of the firesimulator.cu about this. Your program will init the world in CPU memory, and then allocate cudaMemory for simulation state and copy the init’ed world state from CPU to CUDA memory. Define and use constants for the dimension of the world N (512) so that you can easily try other sizes. Define the N to be evenly divisible by the number of blocks and number of threads in a block (i.e. you don’t need handle cases when the total number of threads and total number of cells differ).

  • The 2D forest you are simulating is NOT a torus; there is no wrap-around for neighbors on edge points.

  • Each cell’s value changes based on its current state, and possibly the state of its up to 4 neighboring cells (north, south, east, and west).

  • Your program should take optional command line arguments for the number of iterations to run the simulation, the probability a cell catches fire if one or more of its neighbors are on fires, and the rate at which a cell on fire’s temperature increases or decreases each time-step. For example, to run for 800 time steps, using a probability of 20% and a temperature step of 50 do:

    ./firesimulator -i 800 -p 0.2 -d 50

    Because all of these arguments are optional, you should use default values of 1000 time steps, 25%, and 20 degrees for these values. Some of these are already defined as constants for you in firesimulator.[h,cu].

  • Your program should also support an optional command line argument for reading in world configuration information from a file (-f is not compatible with -i, -p or -d).

    ./firesimulator -f fire1.txt
  • Your program should contain timers to time the GPU firesimulation part of the computation and output the total time (and number of iterattions) after the specified number of iterations of simulation and before your program exits.

  • At each step, you should color the display pixels based on each cell’s state (or temperature). I recommend starting with something simple like green for UNBURNED, red for BURNING, black for BURNED, and blue for LAKE. You are welcome to try something more complicated based on actual temperature, but this is not required.

  • You must use a 2-D grid of blocks layout on the GPU to match the 2-D array that is being modeled:

    dim3  blocks(D, D, 1);

    A 1-D grid of 512x512 blocks is too big for some of our graphics cards. If you don’t use a 2-D grid, you program will not work and you will see some very strange behavior. (see Handy Resources for some utilities to find out about GPUs and their CUDA resource limits and usage on CS lab machines).

  • You should run your kernels as DxD blocks, and each block should have some number of 2D threads:

    dim3 blocks_grid(D,D,1); // a 2D array of blocks makes most sense here
                             // D should be a value determined by N and
                             // the number of threads per block or per block dim
    dim3 threads_block(x,y,z);  // you decide, but do more than (1,1,1)
                                // likely want 2D, (T,T,1), to map to 2D blocks
    
    // call kernel with blocks per thread dim and threads per block dim
    my_kernel_func<<< blocks_grid, threads_block >>>(args ...);

    You can define the block size (see its constant def BLOCK_SIZE) to be an even multiple of the grid size (N).

  • Your program should use ParaVis to visualize its simulation as it runs on the GPU. See the example program from Thursday in lab. The Lab 3 starting point code includes the scaffolding for this (see the comments in Starting Point Code).

Cuda Programming

Examples

I suggest starting by looking at the example CUDA programs we looked at in lab on Thursday. Your solution will use ParaVis library to visualize the computation on the GPU. The userKernelBuffer example will be very helpful for the main control flow of a ParaVis program, and image buffer updates. You can copy over the code from here:

cd ~/cs87/Labs
cp  -r ~newhall/public/cs87/cuda_examples .

Programming Model

The CUDA programming model consists of a global shared memory and a set of multi-thread blocks that run in parallel. CUDA has very limited support for synchronization (only threads in the same thread block can synchronize their actions). As a result, CUDA programs are often written as purely parallel CUDA kernels that are run on the GPU, where code running on the CPU implements the synchronization steps. CUDA programs often have alternating steps of parallel execution on the GPU and sequential on the CPU (minimally, there is an implicit barrier at the end of one CUDA kernel call before the next CUDA kernel call starts).

A typical CUDA program may look like:

  1. The initialization phase that includes GPU memory allocation and memory copy from the CPU to GPU. CUDA memory allocated is allocated on the GPU by calling cudaMalloc. Often program data are initialized on the CPU in a CPU-side copy of the data in RAM, and then copied to the GPU using cudaMemcpy. GPU data can also be initialized on the GPU using a CUDA kernel, and then the cudaMemcpy does not need to be done. For example, initializing all elements in an array to 0 can be done very efficiently on the GPU.

  2. The main computation phase that consists of one or more calls to cuda kernel functions. This could be a loop run on the CPU that makes calls to one or more CUDA kernels to perform sub-steps of the larger computation. Because there is almost no support for GPU thread synchronization, CUDA kernels usually implement the parallel parts of the computation and the CPU-side the synchronization events. An embarrassingly parallel application could run as a single CUDA kernel call.

  3. A final sequential output phase where data are copied from the GPU to the CPU, using cudaMemcpy, and output in some form.

  4. A clean-up phase where CUDA and CPU memory is freed. cudaFree is used to free GPU memory allocated with cudaMalloc. Be sure to call cudaFree in any error handling code that cleans-up state and exits.

In CUDA, parallelism is expressed in terms of a number of multi-threaded parallel blocks running on the GPU. The programmer explicitly maps parallelism in terms of blocks and threads onto portions of the GPU data that each thread will access "simultaneously" in parallel. All array data in CUDA (on the GPU) are single-dimensional. However, the blocks and threads specification can be structured multi-dimensionally to better match the programmer’s view of his/her program. For example, for programs that process 2-D arrays, the CUDA programmer often specifies a 2-D layout of blocks where a block’s 2-D x, y position may better map onto the programmer’s view of the data. This is not to say that there is always a 1-1 mapping of blocks and threads to underlying data elements. There are limits to the sizes of blocks and threads per block, which mean that for larger data, a single thread must access a range of the underlying array.

GPU Functions

CUDA kernel functions are prefixed with __global__. CUDA kernel functions are called from the CPU and run on the GPU. They are invoked using this syntax:

my_kernel_func<<< blocks, threads>>>(args ...);

CUDA __device__ functions are functions that can be called only from other __device__ functions or from __global__ functions. They are useful for good modular GPU-side code design. CUDA device functions are called using a similar syntax to C function calls. For example:

__global__  my_kernel_function(int a, int *dev_array) {

  // NOTE: this is not an example of how you will compute offset for your
  // fire simulator (this is a 1D grid and thread block example)
  int offset = blockIdx.x + blockDim.x + threadIdx.x;

  int max = findmax(a, dev_array[offset]);
  ...
}

__device__ findmax(int a, int b) {

  if(a > b) {
    return a;
  }
  return b;
}

CUDA Memory

GPU memory needs to be explicitly allocated with cudaMalloc.

If initial values for data are on CPU, then these need to be copied to GPU side data with cudaMemcpy (note: CUDA also provides a unified memory abstraction on top of separate GPU and CPU memories, but for this lab you are required to explicitly copy data between the CPU and GPU with cudaMemcpy). When your program is done using the GPU memory it allocated, it should explicitly free it (cudaFree).

When programming in CUDA you need to think carefully about what is running on the CPU on data stored in RAM, and what is running on the GPU on data stored on the GPU. Memory allocated on the GPU (via cudaMalloc) stays on the GPU between kernel calls. If the CPU wants intermediate or final results, they have to be explicitly copied from the GPU to CPU using cudaMemcpy.

In CUDA all arrays are 1-dimensional, so each parallel thread’s location in the multi-dimensional thread blocks specifying the parallelism, needs to be explicitly mapped onto offsets into CUDA 1-dimensional arrays. Often times there is not a perfect 1-1 thread to data mapping and the programmer needs to handle this case to not try to access invalid memory locations beyond the bounds of an array (when there are more threads than data elements), or to ensure that every data element is processed (when there are fewer threads than data elements).

For this lab, if you use a 2D layout of blocks, then you can assume that there are enough GPU threads to have 1 thread associated with each cell in a 512x512 world (and you will likely want use 2D layout of threads in blocks too). There are also enough for a 800x800 world if you want to increase the N dimension and simulate larger worlds.

Timing CUDA code

To time the GPU part of your program’s execution, define start and stop variables of type cudaEvent_t, then "start" and "stop" events around the code you want to time, and compute the elapsed time based on their values. To do this you will need to use functions cudaEventCreate, cudaEventRecord, cudaEventSynchronize, cudaEventElapsedTime, and cudaEventDestroy. Here is a simple example of timing one kernel call to simplekernel:

float amt;
cudaEvent_t e1, e2;

/* create cuda events */
cudaEventCreate(&e1);
cudaEventCreate(&e2);

/* record events around a kernel call */
cudaEventRecord(e1, 0);
simplekernel<<< ...
cudaEventRecord(e2, 0);

/* wait for the e2 event to complete */
cudaEventSynchronize(e2);

/* now compute the time between the two events */
cudaEventElapsedTime(&amt, e1, e2);

cudaEventDistroy(e1);
cudaEventDistroy(e2);

Add a data member to your fireSimulatorKernel class to keep a running total of your timing of each call to CUDA kernels. This is the value you can print out at the end of your program (in the fireSimulatorKernel distructor).

Random numbers in CUDA

Random number generators are inherently sequential: they generate a sequence of pseudo random values. It is much more complicated to generate pseudo random sequences in parallel. Depending on how your program wants to use random values, you may need to create separate random state for each thread that each thread uses to generate its own random sequence. Seeding each thread’s state differently will ensure that threads are not generating identical random sequences.

The cuRAND library provides an interface for initializing random number generator state, and using that state to generate random number sequences. You will need to use random numbers to calculate the chance that a cell will catch fire if one or more of its neighbors is on fire. The following are the steps necessary for using cuRAND to generate random numbers in your program (Note that most of the code you need is already included in the lab starting point, but read through this so you know what to add):

  1. include curand headers

    #include <curand_kernel.h>
    #include <curand.h>
  2. allocate curandState for every CUDA thread:

      int ret;
      curandState *dev_random;
      ...
      ret = cudaMalloc((void**)&dev_random, sizeof(curandState)*N*N);
  3. write a CUDA kernel to initialize the random state (each thread will initialize its own state on the GPU):

    // CUDA kernel to initialize NxN array of curandState, each
    // thread will use its own curandState to generate its own
    // random number sequence
    __global__ void  init_rand(curandState *rand_state) {
    
      int row, col, offset;
      row = blockIdx.x;
      col = blockIdx.y;
    
      offset = col + row*gridDim.x;
      if(row < N && col < N) {
        curand_init(hash(offset), 0, 0, &amp;(rand_state[offset]));
      }
    
    }
    
    // a hash function for 32 bit ints
    // it uses a lot bitwise functions to compute the hash function.
    // it is not important that you understand what it is doing
    // just that it is a cuda hash function h(a)
    // (from http://www.concentric.net/~ttwang/tech/inthash.htm)
    __device__ unsigned int hash(unsigned int a) {
    
      a = (a+0x7ed55d16) + (a<<12);
      a = (a^0xc761c23c) ^ (a>>19);
      a = (a+0x165667b1) + (a<<5);
      a = (a+0xd3a2646c) ^ (a<<9);
      a = (a+0xfd7046c5) + (a<<3);
      a = (a^0xb55a4f09) ^ (a>>16);
      return a;
    }
  4. Call init_rand before calling any cuRAND library functions that use curandState:

    // this example is invoking it with 1 thread per block:
    init_rand<<< blocks, 1 >>>(dev_random)
  5. Now CUDA threads can generate random numbers on the GPU using there own initialized state:

    __global__ void  use_rand_kernel(curandState *rand_state, float prob){
    
      int offset = ... // compute offset base on this thread's position
                       // in the parallelization
    
      // get a random value uniformly distributed between 0.0 and 1.0
      val = curand_uniform(&(rand_state[offset]));

Debugging CUDA

  • you can add simple debug printf statements to cuda kernels.

    printf output from a kernel call is buffered in a circular buffer and passed to the host at the end of the kernel call. If there is too much output in one kernel call, it will overwrite some of its previous output in the circular buffer (so be aware of this if you see "missing output").

    If you add debugging output in a kernel, I recommend you limit the amount of debug output by changing the problem size (make N smaller) and/or only having some threads execute the printf (put printf inside if stmt).

  • The ParaVis animation of your firesimlator can help you debug your solution. An odd looking animation indicates errors in your simulation. These could include such things as not mapping threads to 2D grid elements correctly, not using curand correctly with the step and probability values, not correctly accessing neighbor values, or not computing correct value for a cell at the next time step.

  • cuda-memcheck is like valgrind for gpu memory

ParaVis color3 pixels

The image buffer should be a 2D array of color3 pixels, one element for each corresponding element in the fire grid. To set a color3 to a specific value, set its r,g,b, components to values between 0 and 255:

// set r, g, b values individually:
imagebuff[index].r = 255;   // set to orange
imagebuff[index].g = 128;
imagebuff[index].b = 0;

// or in a single C set all three like this:
imagebuff[offset] = {255, 128, 0};   // set to orange

Here are rgb values of a few different colors:

Black:  { 0, 0, 0 }         Red:    { 255, 0, 0 }       Orange: { 255, 128, 0 }
Yellow: { 255, 255, 0 }     Green:  { 0, 255, 0 }       Blue:   { 0, 0, 255}
Purple  { 128, 0, 255 }     White:  { 255, 255, 255}    Pink:   { 255, 0, 128 }
Teal:   { 50, 255, 255 }    Brown:  { 100, 50, 0 }

Submitting

Repo

Before the Due Date, one of you or your partner should push your solution to github from one of your local repos to the GitHub remote repo. Be sure to do a make clean before you git add anything:

make clean
git add *.cpp *.cu *.h *.txt
git commit
git push
git status

See the git help page "Troubleshooting" section for git help.

Demo

You and your partner will sign up for a 15 minute demo slot to demo your completed fire simulator to me.

The link to the demo sign-up sheet has been posted to the class EdStem page (in the Lab 3 tab).

About the Demo:

  • Prior to your demo, you and your partner should come up with a set of runs to demonstrate to me to show that your simulation is correct and robust to some errors (bad input values most notably). Think about, and practice, different scenarios to demonstrate both correctness and error handling. Think about some runs that will demonstrate that your firesimulator is correctly using the fire parameters to simulate the specified fire. It might be useful to create some additional fire starting point files to help demo certain functionality. A handful of runs should be sufficient to demo everything.

  • Your demo is also an opportunity to talk with me about your solution, and some aspects that you found difficult or interesting and how you solved them (or potentially didn’t). And it is a chance to show off any extra features you implemented.

  • You should plan for a 15 minute demo slot (approx 5 mins of demonstrating to me your runs and approx 10 mins of discussion and answering some questions from me). 15 minutes is very short, so make sure you have planned out what you want to show me to demonstrate your solution.

  • Both you and your partner must be present for the demo, and you should be logged into an overflow (or other) lab machine ready to go at the start of your time slot.

We will start out the demo by you showing me and talking me through the runs of your solution that demonstrate its correctness and error handling (i.e., go through your script of example runs and tell me what each is showing). I may ask you to show me some runs as well. And then we will have an opportunity to discuss your solution in more detail.

In general, a demo is not talking me through your code. It is possible that we may look at some of it in the context of discussing your solution, but it may be that we don’t look at your code at all.

Post Lab Eval

After submisssion and demo, fill out a short evaluation form for this lab assignment. An email with a link to the 3 evaluation form will be emailed to you.

Handy Resources

C, getopts

CUDA

  • Nvidia’s CUDA documetation includes a Programmers Guide and other resources. Nvidia’s resources are Cuda Developer Documentation, and their Parallel Forall Developer’s Blog have some other resources. Here is cuRAND Library documentation

  • Chapt 15.1 of Dive into Systems

  • We have a copy of the book "CUDA by Example" in the lab, which is a useful resource. CUDA has some new features since this, but this covers the basics well.

  • The Thursday in lab CUDA examples (userKernelBuffer is an example using the ParaVis library).

  • You can look over and try running Nvidia’s CUDA example programs located here: /usr/local/cuda/samples

    Just cd into subdirectories here, look at the code and try running the binaries to see what they do. Some of these examples use features of CUDA that you are well beyond what you need to use in this lab, so don’t get too bogged down in slogging through them all.

  • Some information about the ParaVis library is described comments in the library .h files (open .h files in vim or other editor):

    vim  /usr/local/include/qtvis/dataVisCUDA.h
    vim  /usr/local/include/qtvis/dataVis.h
    vim  /usr/local/include/qtvis/imageBuffer.h
    vim  /usr/local/include/qtvis/animator.h
  • lab machine specs page lists graphics card specs

  • run deviceQuery to get CUDA stats about a GPU on particular machine. It will list the limits on block and thread size, and the GPU memory size among other information.

  • nvidia-smi shows gpu storage and usage information (it is sort of like top for the gpu). You can run it over and over again using watch in one terminal while you run your firsesimlator to see how it uses gpu system resource as it runs. watch is a command to run some other command over and over at whatever second intervals you specify with -n command line option. Type Cntrl-C to kill watch:

    # run nvidia-smi every 2 seconds
    watch -n 2  nvidia-smi

misc help pages