CS 31 Lab 9: Parallel Game of Life

Due Sunday, May 6, before midnight


Here are the partnerships and here's a link to GitHub.

Contents


Goals


Overview

The Game of Life programs we wrote for lab 7 run quite slowly when we use them to simulate a large board over many iterations. We can speed the simulation up by making better use of our system's resources. Our computers have multiple processors (CPUs), but our lab 7 programs used just one processor. Your task on this lab is to speed up your Game of Life program by re-designing it to run over multiple processors.

The user will specify with a command line arugment how many threads they want the Game of Life simulation to use. If the number of these worker threads is less than or equal to the number of processors on your computer, then all the threads can run at the same time.

Your job is to divide the computation evenly over multiple threads that can run in parallel. Each thread will be assigned a section (partition) of the board that it's responsible for updating. You will need to synchronize the threads to avoid the situation where one thread is several iterations ahead of another.

With experimentation, you should be able to confirm that adding more threads makes the program run faster, until you exceed the number of processors available on your machine.


Details

Use your lab 7 solution as a starting point, and then add in the additional features required for this lab. We'll use the pthread library to create and manage our threads, so you should add #include <pthread.h> to the top of your gol.c file.

On top of making the program multithreaded, your program should now take in four command line arguments:

  1. The name of the input file, as before.
  2. A 0 or 1 to indicate whether the simulation should be animated, as before.
  3. A positive integer indicating the number of threads.
  4. A 0 or 1 to indicate whether the board partitions should be printed.

When the fourth command line argument is 1, each thread should print out the range of cells that it's responsible for updating. This is mainly useful for debugging (and grading) purposes. Here we're referring to cells by their index in the dynamically allocated 1D array that we've been treating as a 2D array.

$./gol square100.txt 0 3 1
thread  0:    0-3332 (3333 cells)
thread  1: 3333-6665 (3333 cells)
thread  2: 6666-9999 (3334 cells)
50 iterations of a 100x100 board took 0.019325 seconds.

$ ./gol square100.txt 0 4 1
thread  0:    0-2499 (2500 cells)
thread  1: 2500-4999 (2500 cells)
thread  2: 5000-7499 (2500 cells)
thread  3: 7500-9999 (2500 cells)
50 iterations of a 100x100 board took 0.012225 seconds.

$ ./gol square100.txt 0 6 1
thread  0:    0-1665 (1666 cells)
thread  1: 1666-3331 (1666 cells)
thread  5: 8330-9999 (1670 cells)
thread  3: 4998-6663 (1666 cells)
thread  4: 6664-8329 (1666 cells)
thread  2: 3332-4997 (1666 cells)
50 iterations of a 100x100 board took 0.044254 seconds.

You may be able to deduce the partitioning scheme we're using from the example output above. We divide the number of cells by the number of threads to get the number of cells each thread is responsible for updating, n. The thread with logical id 0 gets the first n cells, thread 1 gets the next n cells, and so on. If the number of threads doesn't evenly divide the number of cells, we assign all the extra cells to the last thread (the one with the biggest id).

Make sure that each cell is assigned to exactly one thread. If a cell is unassigned, then the simluation won't run correctly. If a cell is assigned to multiple threads, this will slow down the program.

Because the partitioning prints are coming from within concurrent threads, they won't necessarily show up in order of thread id, as in the third example run above. The above runs were done on a machine with 4 CPUs. This explains why the program got faster when we increased from 3 threads to 4 threads, but then slowed down when we increased to 6 threads.

The input file format is the same as it was on lab 7. We have provided a couple test files, which you can use in addition to the ones from lab 7. You are encouraged to create your own additional test files. When you test that your multithread Game of Life program is still correctly implementing the update rules, it's good to use a small board. When you test that your program actually becomes faster with more threads, it's good to use a large board that runs for many iterations.


Requirements


Tips

int i, j;
i = k / cols;
j = k % cols;

This reverses the i*cols + j calculation.

typedef struct {
  // declare fields here
} ThreadInfo;
// declare a global barrier variable
static pthread_barrier_t barrier;

// initialize the barrier with the number of threads it should wait for
pthread_barrier_init(&barrier, NULL, numberOfThreads);

// Use the barrier
pthread_barrier_wait(&barrier);
$ cat /proc/cpuinfo

Note that when you run valgrind, it will report more than four mallocs because fopen and pthread_create both use malloc internally.


Extra Efficiency (Optional)

There are a number of things you could do to make this program more efficient in terms of time or memory. None of these is required.

  1. Improve the partitioning scheme so that the number of cells assigned to two threads never differs by more than 1. This will balance the workload more evenly across the threads in cases where the number of threads doesn't evenly divide the number of cells. With this approach we might see output like the following:
$ ./gol square100.txt 0 6 1
thread  0:    0-1666 (1667 cells)
thread  1: 1667-3333 (1667 cells)
thread  2: 3334-5000 (1667 cells)
thread  3: 5001-6667 (1667 cells)
thread  4: 6668-8333 (1666 cells)
thread  5: 8334-9999 (1666 cells)
50 iterations of a 100x100 board took 0.037205 seconds.
  1. Compute the partitions inside the worker threads instead of in the main() thread, if you haven't already done so.

  2. Certain values are needed by all the threads, like the number of iterations and the number of columns. Giving each thread its own copy of these values is a waste of memory. Instead, put all these values in a single struct, stored in the stack frame for main(). Then pass each thread a pointer to this struct. Since we pass things to threads by putting them in structs, you'll have a struct inside of a struct.

  3. Prevent excessive cache misses by aligning memory to cache lines. Our computers have 64-byte cache blocks so this would involve:
    1. Using posix_memalign instead of malloc to make sure your board starts on a 64-byte cache line.
    2. Making sure that all but the last thread are assigned a multiple of 64 cells.