Handy References

Manual pages for pthread functions.
Descriptions of Conway's Game of Life, from Lab 6 and Wikipedia.

Lab 9 Goals:

Parallelize a shared-memory application using the pthreads library.
Protect threaded code with synchronization primitives.
Revisit 2D arrays and file I/O in C.
Gain more exposure to pointers and dynamic memory allocation.
Practice debugging threaded programs.

Part 1: Parallel Game of Life

Starting Point Code

Since we've already done one lab related to the game of life, you're going to be providing most of the starting point code this time (your lab 6 submission). To make organizing things easier, I've created new lab 9 repositories for you, so be sure to copy, add, and commit your prior GOL implementation there. I've given you a Makefile and some .txt example input files (these were the cases we used when grading lab 6). You should create more .txt input files to test larger sized and longer running threaded version, but these smaller ones will help you test for correctness.

Lab Description

This lab is designed to give you some practice writing and debugging multi-threaded pthreads programs, using synchronization primitives, and experience designing and running scalability experiments. You may work with a partner of your choice.

You will implement a parallel version of Conway's Game of Life using your sequential solution as a starting point for this lab. See the lab 6 assignment to remind yourself about the sequential version of the program, and about the rules of the game. You will evaluate the scalability of your implementation as you increase the problem size and the number of threads.

Your lab 9 solution will use the same method as lab 6 for initializing a dynamically allocated board from an input file given as a command line argument, and will have the same print or not-print option after every iteration. You should begin lab 9 by first fixing any errors in your lab 6 sequential solution that you copied over into your labs/09 subdirectory.

Parallel Requirements

Your parallel version of the game of life should be structured as follows:

The main thread should begin by initializing the game board in the same way as the sequential version. Your timing code should not include this step but should include all other parts of the execution.

After initializing the game board, the main thread spawns off worker threads that will simulate multiple rounds of game of life, each thread computing just its portion of cells of the new board each round. Grid cells are allocated to threads using either a row-wise or column-wise partitioning specified by a command line argument. You may assume that you will not be asked to create more threads than there are rows/columns (whichever you've been asked to partition by). In other words, you don't need to worry about a case in which there are threads that have no rows/columns assigned to them.

Your program should use two global variables that will be shared among all threads to count the number of grid cells that are processed (i.e., each cell that is looked at / considered, regardless of whether or not changes its alive/dead state). One variable should count the number of cells processed in each round, while the other should count the total number of grid cells processed during the entire execution. Each thread must keep a count of the number of cells it processed during a round and use that value to update the global counters when it has completed its execution of the round. Each thread should keep track of its own local counter during each round and then update the global counters at the end of the round (like we did in the findmax exercise).

For example, suppose you have a 20x20 board (400 cells) and that you're told to run for 20 iterations with 5 threads. Each thread would, during each iteration, keep a count of the number of cells it processed during that round (by the end of the round, it should be 400 / 5 = 80 cells per thread). It would then add that value of 80 to the global per-round sum, which should ultimately equal 400 at the end of the round. It would also add the 80 to an overall count of the total number cells processed through the entire life of the program. By the end of the entire program, that value should equal 400 * 20 = 8000. The rationale for this requirement is that it creates a critical section when the threads attempt to write the global values, so you'll need to use synchronization to protect those writes.

If printing after each round is enabled, you should designate one thread to be the printing thread, and only that thread should do the printing after each round (otherwise the output can be interleaved in crazy ways). It should print both the board and a count of the number of cells processed during that round (e.g., the 400 value from the example above).

At the end of the simulation's execution, the main thread should always print the final board state, the elapsed time, and the total number of grid cells processed across the run of the program (e.g., the 8000 value from the example above).

You will not be graded on absolute performance, but your code should run faster when run in parallel. As a sanity check, try running a large board instance (e.g., 1500x1500) for many rounds (e.g., 100+). You should see a substantial improvement between running with one thread and a handful of threads (2-4). It's that sort of relative improvement that we'll be looking for.

You will need to think about where and how to synchronize thread actions to remove race-conditions. For example, you need to ensure that no thread starts the next round of play until all threads are done with the current round, otherwise threads will not read consistent values for the given round. If printing is enabled, you need to ensure that the printing thread prints a consistent version of the game board. You will also need to ensure that threads do not race to update the global grid cell counters.

A run with no printing arguments enabled should not include any calls to usleep in the run.

Command line arguments

Your program will take the same two command line arguments from lab 6, plus three new command line arguments:

An integer to specify the number of threads (the degree of parallelism)
A 0/1 flag to specify how to parallelize the GOL program (0: row-wise grid cell allocation, 1: column-wise grid cell allocation). More details below.
A 0/1 flag to specify if the per-thread board allocation should be printed out or not at start up (0:don't print configuration information, 1: print allocation information).

$ ./gol
usage: ./gol infile.txt print[0:1] ntids partition[0:1] print_alloc[0:1]

Here are some example command lines:

# run with config values read from file1.txt, do not print the board after each round
# create 8 threads, use row-wise partitioning, print per-thread partitioning details:
./gol file1.txt  0  8  0  1

# run with config file file2.txt, print the board after each round, create
# 15 threads, use column-wise partitioning, don't print per-thread partitioning:
./gol file2.txt  1  15  1  0

The print per-thread board allocation command line option will help you debug the grid cell allocation scheme to ensure that you are correctly allocating rows or columns across the worker threads.

Your program should handle badly formed command lines and check for bad values entered (e.g., a negative number of threads). It should print out an error message and exit for badly formed command lines and handle most bad input values similarly.

Partitioning

You will implement two different ways of partitioning the computation: one partitions the board's rows across threads, the other partitions the board's columns across threads (i.e each thread computes the result for some number of rows or columns). As an example, suppose that the board is 8x8, and that there are 4 threads. Here is how the threads (with logical tids 0-3) would be assigned to the grid using row and column partitioning, respectively:

row partitioning                        column partitioning
----------------                        ----------------
0 0 0 0 0 0 0 0                         0 0 1 1 2 2 3 3  
0 0 0 0 0 0 0 0                         0 0 1 1 2 2 3 3
1 1 1 1 1 1 1 1                         0 0 1 1 2 2 3 3
1 1 1 1 1 1 1 1                         0 0 1 1 2 2 3 3
2 2 2 2 2 2 2 2                         0 0 1 1 2 2 3 3
2 2 2 2 2 2 2 2                         0 0 1 1 2 2 3 3
3 3 3 3 3 3 3 3                         0 0 1 1 2 2 3 3
3 3 3 3 3 3 3 3                         0 0 1 1 2 2 3 3

When the number of threads does not evenly divide the dimension by which you are partitioning, divide the rows (or columns) up so that there is at most a difference of 1 assigned row (or column) between any two threads. For example, for an 8x7 board and 3 threads, you would partition rows and columns like this among the 3 threads (with tids 0-2):

row partitioning                        column partitioning
----------------                        ----------------
0 0 0 0 0 0 0                           0 0 0 1 1 2 2   
0 0 0 0 0 0 0                           0 0 0 1 1 2 2
0 0 0 0 0 0 0                           0 0 0 1 1 2 2 
1 1 1 1 1 1 1                           0 0 0 1 1 2 2 
1 1 1 1 1 1 1                           0 0 0 1 1 2 2 
1 1 1 1 1 1 1                           0 0 0 1 1 2 2 
2 2 2 2 2 2 2                           0 0 0 1 1 2 2 
2 2 2 2 2 2 2                           0 0 0 1 1 2 2

In this partitioning scheme, a single row (or column) is assigned to exactly one thread; a single row (or column) is never split between two or more threads. Thus, in the above example threads 0 and 1 have one more row each of work to do than thread 2 in the row-wise partitioning, and thread 0 has one more column of work to do than threads 1 and 2 in the column-wise partitioning.

Printing Partitioning Information

When the 5th command line option is non-zero, your program should print thread partitioning information. Each thread should print:

Its thread id (the logical id that you assign, not the pthread_self value).
Its start and end row index values.
The total number of rows it is allocated.
Its start and end column index values.
The total number of columns it is allocated.

The threads will run in different orders, so you may see each thread's output in any order, which is fine. If you'd like, you can try to avoid some interleaving of individual threads output by calling fflush after printf:

printf("tid %d my values ...\n", mytid, ...);
fflush(stdout);   // force the printf output to be printed to the terminal

Here are some example runs with different numbers of threads partitioning a 100x100 grid. The total number of rows and columns per thread are shown in parentheses (your output does not need to be identical to this, but it must include all the required parts):

# 9 threads, column-wise partitioning
% gol big 0 9 1 1
tid  0: rows:  0:99 (100) cols:  0:11 (12)
tid  1: rows:  0:99 (100) cols: 12:22 (11)
tid  2: rows:  0:99 (100) cols: 23:33 (11)
tid  4: rows:  0:99 (100) cols: 45:55 (11)
tid  3: rows:  0:99 (100) cols: 34:44 (11)
tid  5: rows:  0:99 (100) cols: 56:66 (11)
tid  6: rows:  0:99 (100) cols: 67:77 (11)
tid  7: rows:  0:99 (100) cols: 78:88 (11)
tid  8: rows:  0:99 (100) cols: 89:99 (11)

# 6 threads, row-wise partitioning
% gol big 0 6 0 1
tid  0: rows:  0:16 (17) cols:  0:99 (100)
tid  1: rows: 17:33 (17) cols:  0:99 (100)
tid  3: rows: 51:67 (17) cols:  0:99 (100)
tid  2: rows: 34:50 (17) cols:  0:99 (100)
tid  4: rows: 68:83 (16) cols:  0:99 (100)
tid  5: rows: 84:99 (16) cols:  0:99 (100)

# 17 threads, row-wise partitioning
% gol big 0 17 0 1
tid  0: rows:  0: 5 ( 6) cols:  0:99 (100)
tid  1: rows:  6:11 ( 6) cols:  0:99 (100)
tid  3: rows: 18:23 ( 6) cols:  0:99 (100)
tid  2: rows: 12:17 ( 6) cols:  0:99 (100)
tid  4: rows: 24:29 ( 6) cols:  0:99 (100)
tid  5: rows: 30:35 ( 6) cols:  0:99 (100)
tid  6: rows: 36:41 ( 6) cols:  0:99 (100)
tid  7: rows: 42:47 ( 6) cols:  0:99 (100)
tid 10: rows: 60:65 ( 6) cols:  0:99 (100)
tid 11: rows: 66:71 ( 6) cols:  0:99 (100)
tid 12: rows: 72:77 ( 6) cols:  0:99 (100)
tid 13: rows: 78:83 ( 6) cols:  0:99 (100)
tid  8: rows: 48:53 ( 6) cols:  0:99 (100)
tid 15: rows: 90:94 ( 5) cols:  0:99 (100)
tid 14: rows: 84:89 ( 6) cols:  0:99 (100)
tid  9: rows: 54:59 ( 6) cols:  0:99 (100)
tid 16: rows: 95:99 ( 5) cols:  0:99 (100)

Note that for the run with 17 threads, there are enough threads to see "out of order" execution due to the exact scheduling of threads on the CPUs. That's fine, we're at the mercy of the OS scheduler!

Correctness

In addition to printing out per-thread board partitioning information to verify correct allocation, you should also ensure that your parallel version simulates GOL correctly. To do this run your parallel version on some of the same input files as your sequential one with printing enabled. The results should be identical. For example, starting with the oscillator input file the start and end board should look like this (you would of course call system("clear") between each iteration):

$ ./gol oscillator.txt 1 4 0 0

start board:

- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - @ @ @ - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 

end board:

- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - @ - - - - - 
- - - - - @ - - - - - 
- - - - - @ - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 

total time for 21 iterations of 11x11 is 5.493663 secs

If you see odd output, either your parallelization of GOL is incorrect, or you are missing synchronization, or both. You can remove the call to system("clear") to see the history of each round to help you debug incorrect GOL computation (of course run for a small number of iterations).

Additional Requirements

In addition to the parallelization and correctness requirements described above, you solution should:

Use a makefile.
Be free of valgrind errors.
Be well-designed and well-commented.

Tips

The man pages for the various pthread functions are very helpful. Additionally, Tia has some links to other pthreads documentations and tutorials

Use the perror() function to print out error messages from failed system and pthread library calls. perror prints out detailed, human-readable information about the error that occurred. You call it passing in a string with program-specific error message. For example:
```
big_buff = malloc(...);
if (big_buff == NULL) {
  perror("mallocing big buffer failed");
}
```
See the man page for perror for more information and for the include files you need to add to your program to use it.

You will need to use some synchronization in your solution. You may use any of the pthread synchronization primitives: mutexes, condition variables, and/or barriers. Remember that before using any of these, you need to initialize them and that all threads need to have access to the shared synchronization variables (either pass a reference to them at thread creation or declare them as static global variables). Here are some code snippets for using the barrier synchronization primitive:
```
// declare a global pthread_barrier_t var
static pthread_barrier_t barrier;

// initialize it somewhere before using it:
if(pthread_barrier_init(&barrier, NULL, num_threads)){
   perror("pthread_barrier_init");
   exit(1);
}

// threads than can call pthread_barrier_wait to synchronize:
ret = pthread_barrier_wait(&barrier);
if(ret != 0 && ret != PTHREAD_BARRIER_SERIAL_THREAD) {
  perror("pthread_barrier_wait");
  exit(1);
}
```

With the exception of pthread synchronization primitives and cell counters, you should not use any other global variables in your program. Any values that your threads need should be passed to them via the (void *arg) parameter. Look at the parallel max program we did for an example of how to do this.

A guide for using gdb to debug pthread programs

Part 2: Producer / Consumer (a.k.a Bounded Buffer)

For the second part of the lab, you'll be solving a classic threading problem: the producer / consumer problem. One thread, the producer, is reading values from an external source (e.g., file on disk or the network) and adding them to a queue, for future processing. The other thread, the consumer, is pulling items off the queue to be processed.

What makes this problem challenging is that the queue has a maximum size, and the producer and consumer run at different rates: the producer is limited by the speed of I/O, and the consumer may have a variable amount of work to perform for each queued item. Therefore, you must protect access to the shared queue such that:

The producer should never attempt to add an item to a full queue.

The consumer should never attempt to read an item from an empty queue.

The consumer should process every item exactly one time, in the intended order.

Clearly you'll need to use a mutex lock to ensure that the threads don't race to modify the queue or its item count. A mutex alone is not enough for a good solution though: if a thread is holding the lock and cannot complete it's task (e.g., the producer wants to write an item to the queue, but the queue is full), it should immediately give up the lock, allowing the other thread to proceed. Thus, we need a way for a thread to give up a lock and wait for some future event to occur. Luckily, we have just such a pair of functions: pthread_cond_wait and pthread_cond_signal.

These functions work on condition variables and mutex locks. To wait on a condition variable, one must already be holding the associated mutex lock. Waiting will release the lock and block the thread until another thread signals the condition variable. At that point, the waiting thread will wake up, re-acquire the lock, and proceed. It's generally good practice to re-check the condition that caused you to wait in the first place before proceeding (e.g., if your producer was waiting for free space in the queue, it should probably double check that there is a free space now, after it has woken up).

Note that one thread signalling a condition variable when no thread is waiting on it is NOT saved anywhere. In other words, if nobody is listening on a condition variable that is signaled, it's as if that signal never happened.

The starting code is available in ~kwebb/public/cs31/part2. From your lab 9 repository, you can do:

cp -r ~kwebb/public/cs31/part2 .

This should give you a "part 2" directory that you can use to work on this part of the assignment. You can run it by building with make and then executing:

 $ ./producer_consumer [input file name] [max buffer size]

for example:

 $ ./producer_consumer ten_numbers.txt 5

The starter code has the producer thread read integers from the specified input file into a shared queue (buffer) whose maximum size is determined by the second command line argument. The consumer reads items from the buffer and prints them. As provided, there is no synchronization, and the values that get printed will not reflect those in the file.

Requirements

You should only add code to the "producer" and "consumer" functions. All the variables you need should already be declared and initialized.

There should be no spin waiting or deadlocks, and the consumer should always print the integers once (and only once) in the order they appeared in the input file.

The consumer thread should cleanly terminate when the producer sets the done flag.

Testing

You may find it helpful to automate your testing for part 2 by using the shell to repeatedly test your program. The following shell loop will run it 100 times:

for i in `seq 100`; do ./producer_consumer ten_numbers.txt 5 2>/dev/null; done

You can change the `seq 100` to change the number of times it runs. Note that the characters surrounding that expression are backticks (the character to the left of the "1" key on your keyboard). You may also find it helpful to diff your output against the expected output (the original input file). You can do that with:

./producer_consumer ten_numbers.txt 5 2>/dev/null| diff -u ten_numbers.txt -

The above command will print no output if the contents match (i.e., it's correct). If it prints anything, it's telling you how your output differs from the expected output. You can combine this method with the loop to do both at once:

Note: this also prints the iteration number you're on.  If you see it hang, you likely have a deadlock.
for i in `seq 1000`; do ./producer_consumer ten_numbers.txt 5 2>/dev/null | diff -u ten_numbers.txt -; echo $i; done

Submitting

Please remove any debugging output prior to submitting.

To submit your code, simply commit your changes locally using git add and git commit. Then run git push while in your lab directory. Only one partner needs to run the final push, but make sure both partners have pulled and merged each others changes. See the section on Using a shared repo on the git help page.

CS 31 Lab 9: Parallel Game of Life

Due 11:59 PM, Friday, April 29