CS 31 Lab 9: Parallel Game of Life

Due Sunday, May 6, before midnight

Here are the partnerships and here's a link to GitHub.

Goals
Overview
Details
Requirements
Tips
Extra Efficiency (Optional)

Goals

Practice parallel programming using pthreads
Practice synchronizing access to shared memory
Observe the benefits and costs of parallel programming

Overview

The Game of Life programs we wrote for lab 7 run quite slowly when we use them to simulate a large board over many iterations. We can speed the simulation up by making better use of our system's resources. Our computers have multiple processors (CPUs), but our lab 7 programs used just one processor. Your task on this lab is to speed up your Game of Life program by re-designing it to run over multiple processors.

The user will specify with a command line arugment how many threads they want the Game of Life simulation to use. If the number of these worker threads is less than or equal to the number of processors on your computer, then all the threads can run at the same time.

Your job is to divide the computation evenly over multiple threads that can run in parallel. Each thread will be assigned a section (partition) of the board that it's responsible for updating. You will need to synchronize the threads to avoid the situation where one thread is several iterations ahead of another.

With experimentation, you should be able to confirm that adding more threads makes the program run faster, until you exceed the number of processors available on your machine.

Details

Use your lab 7 solution as a starting point, and then add in the additional features required for this lab. We'll use the pthread library to create and manage our threads, so you should add #include <pthread.h> to the top of your gol.c file.

On top of making the program multithreaded, your program should now take in four command line arguments:

The name of the input file, as before.
A 0 or 1 to indicate whether the simulation should be animated, as before.
A positive integer indicating the number of threads.
A 0 or 1 to indicate whether the board partitions should be printed.

When the fourth command line argument is 1, each thread should print out the range of cells that it's responsible for updating. This is mainly useful for debugging (and grading) purposes. Here we're referring to cells by their index in the dynamically allocated 1D array that we've been treating as a 2D array.

$./gol square100.txt 0 3 1
thread  0:    0-3332 (3333 cells)
thread  1: 3333-6665 (3333 cells)
thread  2: 6666-9999 (3334 cells)
50 iterations of a 100x100 board took 0.019325 seconds.

$ ./gol square100.txt 0 4 1
thread  0:    0-2499 (2500 cells)
thread  1: 2500-4999 (2500 cells)
thread  2: 5000-7499 (2500 cells)
thread  3: 7500-9999 (2500 cells)
50 iterations of a 100x100 board took 0.012225 seconds.

$ ./gol square100.txt 0 6 1
thread  0:    0-1665 (1666 cells)
thread  1: 1666-3331 (1666 cells)
thread  5: 8330-9999 (1670 cells)
thread  3: 4998-6663 (1666 cells)
thread  4: 6664-8329 (1666 cells)
thread  2: 3332-4997 (1666 cells)
50 iterations of a 100x100 board took 0.044254 seconds.

You may be able to deduce the partitioning scheme we're using from the example output above. We divide the number of cells by the number of threads to get the number of cells each thread is responsible for updating, n. The thread with logical id 0 gets the first n cells, thread 1 gets the next n cells, and so on. If the number of threads doesn't evenly divide the number of cells, we assign all the extra cells to the last thread (the one with the biggest id).

Make sure that each cell is assigned to exactly one thread. If a cell is unassigned, then the simluation won't run correctly. If a cell is assigned to multiple threads, this will slow down the program.

Because the partitioning prints are coming from within concurrent threads, they won't necessarily show up in order of thread id, as in the third example run above. The above runs were done on a machine with 4 CPUs. This explains why the program got faster when we increased from 3 threads to 4 threads, but then slowed down when we increased to 6 threads.

The input file format is the same as it was on lab 7. We have provided a couple test files, which you can use in addition to the ones from lab 7. You are encouraged to create your own additional test files. When you test that your multithread Game of Life program is still correctly implementing the update rules, it's good to use a small board. When you test that your program actually becomes faster with more threads, it's good to use a large board that runs for many iterations.

Requirements

Use pthread_create to create the threads. You should create exactly the number of threads specified in the command line. The threads should all run for the entire duration of the simulation. Don't create new threads for every iteration.
Use pthread_join to wait for the threads to finish.
Each thread has a system-generated identifier, but these aren't very readable, so you should also assign each thread a logical id, starting at 0 and counting up.
A single thread should be responsible for the animation. Remember that this involves clearing the terminal, printing the board and the number of alive cells, and sleeping. Assign this job to the thread with logical id 0. This thread is also responsible for updating a partition of the board, like all the other threads.
You need to synchronize the threads. If thread 1 is on iteration 5 when thread 0 is still on iteration 3, this would be a race condition. The board may change before thread 0 has the chance to print it. Because we're using this alternating two-board approach, it's okay if a thread is one iteration ahead of another thread. However, you never want a thread to be two iterations ahead.
Synchronization primitives (barriers, mutexes, etc.) may be decalred as global variables, but you shouldn't have any other globals.
The Game of Life update rules should be correctly implemented, but you won't be penalized a second time if you never got the edges and corners updating properly on lab 7. That said, here's one technique you can use to concisely and accurately count a cell's neighbors.
Use the partitioning scheme described above (unless you go for the extra challenge). Partitioning prints should happen inside the worker threads. Computing the partitions can happen in the main() thread or in the worker threads. The latter is more efficient, but also more complex.
Handle the following errors:
- Wrong number of command line arguments
- Number of threads is less than or equal to 0
- Number of threads exceeds the number of cells
You can assume that the 0/1 command line arguments will be valid.
Start the timer after you malloc the two boards and initialize board #1, but before you do any of the work involved in setting up the threads. End the timer at the end of your program. You should always print out the elapsed time.
Your program should not generate any valgrind errors.
Write comments for all your functions.

Tips

Here's one sequence of steps you could use to approach this lab:
1. Create the number of threads specified at the command line and have each thread just print out the logical id you assign it.
2. Write down all the values that a thread will need to do its job. Define a struct with a field for each of these values. malloc an array of these structs, one for each thread. Initialize each struct before passing it into pthread_create by reference.
3. Have each thread print out its own partitioning information (when the final command line argument is 1).
4. Write the code to compute one iteration of the simulation, with the work divided among the threads. Don't worry about synchronization.
5. Compute all the iterations. Use synchronization primitives to avoid race conditions.
6. Make sure all other requirements have been met.
Our partitioning scheme treats the board as a 1D array, but the code we wrote last week pretended the board was a 2D array. To convert a 1D index, k, into a 2D pair of indices, i and j, do the following:

int i, j;
i = k / cols;
j = k % cols;

This reverses the i*cols + j calculation.

Here's one way to define a struct and give it a convenient alias with typedef:

typedef struct {
  // declare fields here
} ThreadInfo;

Before you use a synchronization primitive you need to separately declare and initialize it. For example:

// declare a global barrier variable
static pthread_barrier_t barrier;

// initialize the barrier with the number of threads it should wait for
pthread_barrier_init(&barrier, NULL, numberOfThreads);

// Use the barrier
pthread_barrier_wait(&barrier);

The following command prints out information about your machine's CPUs. It can be used to figure out how many processors you have.

$ cat /proc/cpuinfo

You should have four total mallocs:
- Two Game of Life boards
- An array of system-generated thread ids (type pthread_t)
- An array of ThreadInfo structs

Note that when you run valgrind, it will report more than four mallocs because fopen and pthread_create both use malloc internally.

Extra Efficiency (Optional)

There are a number of things you could do to make this program more efficient in terms of time or memory. None of these is required.

Improve the partitioning scheme so that the number of cells assigned to two threads never differs by more than 1. This will balance the workload more evenly across the threads in cases where the number of threads doesn't evenly divide the number of cells. With this approach we might see output like the following:

$ ./gol square100.txt 0 6 1
thread  0:    0-1666 (1667 cells)
thread  1: 1667-3333 (1667 cells)
thread  2: 3334-5000 (1667 cells)
thread  3: 5001-6667 (1667 cells)
thread  4: 6668-8333 (1666 cells)
thread  5: 8334-9999 (1666 cells)
50 iterations of a 100x100 board took 0.037205 seconds.

Compute the partitions inside the worker threads instead of in the main() thread, if you haven't already done so.
Certain values are needed by all the threads, like the number of iterations and the number of columns. Giving each thread its own copy of these values is a waste of memory. Instead, put all these values in a single struct, stored in the stack frame for main(). Then pass each thread a pointer to this struct. Since we pass things to threads by putting them in structs, you'll have a struct inside of a struct.
Prevent excessive cache misses by aligning memory to cache lines. Our computers have 64-byte cache blocks so this would involve:
1. Using posix_memalign instead of malloc to make sure your board starts on a 64-byte cache line.
2. Making sure that all but the last thread are assigned a multiple of 64 cells.