CS31 Lab 8: Parallel Game of Life

Code: Due before 11:59pm, Tuesday Dec. 11 (via handin31)
Report: Due before noon, Wednesday Dec. 12 (printout submitted to me in person)

This lab should be done with your lab partner.

Lab 8 Goals:

Starting Point Code

First, both you and your partner should run update31 to create the handin directory structure for lab 8, then you should copy your lab 5 solution into your lab8 subdirectory:

  $ update31
  $ cd cs31/labs/08
  $ pwd
  /home/your_user_name/cs31/labs/08
  $ cp ../05/gol.c .
  $ cp ../05/Makefile .
  $ cp ../05/*.txt .
  $ ls
  Makefile  gol.c  oscillator.txt
You will start with your lab 5 solution to the Game of Life program, modifying it to implement a parallel version of GOL using pthreads. To link in the pthread library, to your Makefile, add the following to the end of the command for building the gol executable:
-pthread

# for example, you may have something that looks like this:
gol: gol.c
        gcc -g -Wall -o gol gol.c -pthread

Project Details and Requirements
This lab is designed to give you some practice writing and debugging multithreaded Pthreads programs, using synchronization primitives, and experience designing and running scalability experiments.

You will implement a parallel version of Conway's Game of Life using your sequential solution as a starting point for this lab. See the lab 5 assignment to remind yourself about the sequential version of the program, and about the rules of the game. You will evaluate the scalability of your implementation as you increase the problem size and the number of threads.

Your lab 8 solution will use the same method as lab 5 for initializing a dynamically allocated board from an input file given as a command line argument, and will have the same print or not-print option after every iteration.

You should begin lab 8 by first fixing any errors in your lab 5 sequential solution that you copied over into your labs/08 subdirectory.

Parallel Version

The parallel version should be structured as follows:
  1. The main thread should begin by initializing the game board in the same way as the sequential version. The gettimeofday timers in your code should not include this step but should include all other parts of the execution.
  2. After initializing the game board, the main thread spawns off worker threads that will play multiple rounds of game of life, each thread computing just its portion of cells of the new board each round. Grid cells are allocated to threads using either a row-wise or column-wise partitioning specified by a command line argument.
  3. If printing after each round is enabled, you should designate one thread to be the printing thread, and only that thread should do the printing after each round (otherwise the output can be interleaved in crazy ways).
  4. The main thread should print out the time of the gol computation part of the program before exiting (this time should NOT include the time to initialize the board from the input file, but should measure the rest of the computation).
You will need to think about where and how to synchronize thread actions to remove race-conditions. For example, you need to ensure that no thread starts the next round of play until all threads are done with the current round, otherwise threads will not read consistent values for the given round. If printing is enabled, you need to ensure that the printing thread prints a consistent version of the game board.

A run with no printing enabled should only printout the final timing result of the gettimeofday timer and should not include any calls to usleep in the run.

Command line arguments

Your program will use the two command line arguments from lab 5 and additionally add three new command line arguments to:
  1. specify the number of threads (the degree of parallelism)
  2. specify how to parallelize the GOL program (0: row-wise grid cell allocation, 1: column-wise grid cell allocation).
  3. specify if the per-thread board allocation should be printed out or not (0:don't print configuration information, 1: print allocation information)
$ ./gol
usage: ./gol infile.txt print[0:1] ntids partition[0:1] print_alloc[0:1]
Here are some example command lines:
# run with config values read from file1.txt, do not print the board after each round
# create 8 threads, use row-wise partitioning, print per-thread partitioning details:
./gol file1.txt  0  8  0  1     

# run with config file file2.txt, print the board after each round, create
# 15 threads, use column-wise partitioning, don't print per-thread partitioning:
./gol file2.txt  1  15  1  0   

The print per-thread board allocation command line option will help you debug the grid cell allocation scheme to ensure that you are correctly allocating rows or columns across the worker threads.

Your program should handle badly formed command lines and check for bad values entered (like a negative number of threads). It should print out an error message and exit for badly formed command lines and handle most bad input values similarly.

Partitioning

You will implement two different ways of parallelizing the computation: one partitions the board's rows across threads, the other partitions the board's columns across threads (i.e each thread computes the result for some number of rows or columns). As an example, suppose that the board is 8x8, and that there are 4 threads. Here is how the threads (with logical tids 0-3) would be assigned to computing elements of C using row and column partitioning:
row partitioning                        column partitioning
----------------                        ----------------
0 0 0 0 0 0 0 0                         0 0 1 1 2 2 3 3  
0 0 0 0 0 0 0 0                         0 0 1 1 2 2 3 3
1 1 1 1 1 1 1 1                         0 0 1 1 2 2 3 3
1 1 1 1 1 1 1 1                         0 0 1 1 2 2 3 3
2 2 2 2 2 2 2 2                         0 0 1 1 2 2 3 3
2 2 2 2 2 2 2 2                         0 0 1 1 2 2 3 3
3 3 3 3 3 3 3 3                         0 0 1 1 2 2 3 3
3 3 3 3 3 3 3 3                         0 0 1 1 2 2 3 3
When the number of threads does not evenly divide the dimension by which you are partitioning, divide the rows (or columns) up so that there is at most a difference of 1 assigned row (or column) between any two threads. For example, for an 8x7 board and 3 threads, you would partition rows and columns like this among the 3 threads (with tids 0-2):
row partitioning                        column partitioning
----------------                        ----------------
0 0 0 0 0 0 0                           0 0 0 1 1 2 2   
0 0 0 0 0 0 0                           0 0 0 1 1 2 2
0 0 0 0 0 0 0                           0 0 0 1 1 2 2 
1 1 1 1 1 1 1                           0 0 0 1 1 2 2 
1 1 1 1 1 1 1                           0 0 0 1 1 2 2 
1 1 1 1 1 1 1                           0 0 0 1 1 2 2 
2 2 2 2 2 2 2                           0 0 0 1 1 2 2 
2 2 2 2 2 2 2                           0 0 0 1 1 2 2 
In this partitioning scheme, a single row (or column) is assigned to exactly one thread; a single row (or column) is never split between two or more threads. Thus, in the above example threads 0 and 1 have one more row each of work to do than thread 2 in the row-wise partitioning, and thread 0 has one more column of work to do than threads 1 and 2 in the column-wise partitioning.

Printing Partitioning Information

When the 5th command line option is non-zero, your program should print thread partitioning information. Each thread should print:
  1. Its thread id (its logical one not its pthread_self value)
  2. Its start and end row index values
  3. The total number of rows it is allocated
  4. Its start and end column index values
  5. The total number of columns it is allocated
The threads will run in different orders, so you may see each thread's output in any order, which is fine. You can avoid some interleaving of individual threads output by calling fflush after printf:
printf("tid %d my values ...\n", mytid, ...);
fflush(stdout);   // force the printf output to be printed to the terminal
Here are some example runs of my program with different numbers of threads partitioning a 100x100 grid. The total number of rows and columns per thread are shown in parentheses (your output does not need to be identical to mine but must include all required parts):
# 9 threads, column-wise partitioning
% gol big 0 9 1 1
tid  0: rows:  0:99 (100) cols:  0:11 (12)
tid  1: rows:  0:99 (100) cols: 12:22 (11)
tid  2: rows:  0:99 (100) cols: 23:33 (11)
tid  4: rows:  0:99 (100) cols: 45:55 (11)
tid  3: rows:  0:99 (100) cols: 34:44 (11)
tid  5: rows:  0:99 (100) cols: 56:66 (11)
tid  6: rows:  0:99 (100) cols: 67:77 (11)
tid  7: rows:  0:99 (100) cols: 78:88 (11)
tid  8: rows:  0:99 (100) cols: 89:99 (11)

# 6 threads, row-wise partitioning
% gol big 0 6 0 1
tid  0: rows:  0:16 (17) cols:  0:99 (100)
tid  1: rows: 17:33 (17) cols:  0:99 (100)
tid  3: rows: 51:67 (17) cols:  0:99 (100)
tid  2: rows: 34:50 (17) cols:  0:99 (100)
tid  4: rows: 68:83 (16) cols:  0:99 (100)
tid  5: rows: 84:99 (16) cols:  0:99 (100)

# 17 threads, row-wise partitioning
% gol big 0 17 0 1
tid  0: rows:  0: 5 ( 6) cols:  0:99 (100)
tid  1: rows:  6:11 ( 6) cols:  0:99 (100)
tid  3: rows: 18:23 ( 6) cols:  0:99 (100)
tid  2: rows: 12:17 ( 6) cols:  0:99 (100)
tid  4: rows: 24:29 ( 6) cols:  0:99 (100)
tid  5: rows: 30:35 ( 6) cols:  0:99 (100)
tid  6: rows: 36:41 ( 6) cols:  0:99 (100)
tid  7: rows: 42:47 ( 6) cols:  0:99 (100)
tid 10: rows: 60:65 ( 6) cols:  0:99 (100)
tid 11: rows: 66:71 ( 6) cols:  0:99 (100)
tid 12: rows: 72:77 ( 6) cols:  0:99 (100)
tid 13: rows: 78:83 ( 6) cols:  0:99 (100)
tid  8: rows: 48:53 ( 6) cols:  0:99 (100)
tid 15: rows: 90:94 ( 5) cols:  0:99 (100)
tid 14: rows: 84:89 ( 6) cols:  0:99 (100)
tid  9: rows: 54:59 ( 6) cols:  0:99 (100)
tid 16: rows: 95:99 ( 5) cols:  0:99 (100)
Note that for the 17 thread run there are enough threads to see "out of order" execution due to the exact scheduling of threads on the cpus.

Correctness

In addition to printing out per-thread board partitioning information to verify correct allocation, you should also ensure that your parallel version solves GOL correctly. To do this run your parallel version on some of the same input files as your sequential one with printing enabled. The results should be identical. For example, starting with the oscillator input file the start and end board should look like this (you would of course call system("clear") between each iteration):
$ ./gol oscillator.txt 1 4 0 0

start board:

- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - @ @ @ - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 

end board:

- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - @ - - - - - 
- - - - - @ - - - - - 
- - - - - @ - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 

total time for 21 iterations of 11x11 is 5.493663 secs
If you see odd output, either your parallelization of GOL is incorrect, or you are missing synchronization, or both. You can remove the call to system("clear") to see the history of each round to help you debug incorrect GOL computation (of course run for a small number of iterations).

Additional Requirements

In addition to the parallelization and correctness requirements described above, you solution must:

Scalability Experiments and Report
After implementing and debugging your parallel GOL program, you will run experiments of your parallel solution testing its scalability in different ways, and write a report presenting your experimental results and describing what they show. You should answer the following questions with your experiments:
  1. How well does the problem scale with the number of threads? Is there an optimal thread size for a given sized problem?
  2. How well does the 16 thread version scale for different sized boards?
  3. Is one of the row-wise or column-wise allocations more efficient?
For each of these questions, you should:
  1. design some experiments to answer the question
  2. run the experiments and see if they do answer your question (and if not redesign the experiments and try again)
  3. write up the results of each of your experiments. For each question:
    1. Describe what your experiment(s) was (were), and describe why/how the experiment answers the question you are trying to answer with it.
    2. Present your experimental results in a easy to read summary format. You can list results in a table or in a graph, for example. You should present averages of several runs of each test, and indicate this in your presentation (e.g. "the values shown are average of 10 runs each, run on a 8 core system").
    3. Describe what your experimental results show: what do the data you present say about the question they are answering and why? What is your interpretation of the results? What do they say about the scalability of your solution? Did the results match your expectations? If not, do you have an idea of why not?

    You can write up your report as an ascii file using vim (don't wrap lines), or you can try using latex (details below). Your report should be no more than 3 pages long, so think about quality over quantity in presenting results, and be concise yet complete in your explanations. One page in an ascii file is 59 lines (CNTRL-L is a page break if you want to explicitly add one in). If you use ascii, then present your results in tables something similar to this:

    Total Execution Time for different number of threads and board sizes 
    
    num tids  |  500x500 |  1000X1000  | 2000x2000 | ... 
    -----------------------------------------------------------
       1      |  x secs  |   x secs    |  x secs   | 
       2      |  x secs  |   x secs    |  x secs   | 
       4      |  x secs  |   x secs    |  x secs   | 
    ...
    

As you design experiments, you will want to vary the grid size and the number of threads to obtain different data points:

  1. For the grid sizes. Try doubling the sizes at each increase: (500, 1000, 2000, ...). You do not need to have a large number of different sizes but you should have a few intermediate sizes between your smallest-sized and largest-sized grids. Run some initial experiments to help you pick good sizes. You also want to ensure you are running for enough iterations to get runs that are long enough to be able to make a comparison between the runtimes of different sized boards.
  2. The number of threads. Increase by powers of two (1, 2, 4, 8, 16, 32, ...). And, include some thread numbers that are beyond the number of CPUs.
When running scalability studies you need to make sure that you have problem sizes that are large enough and the number of iterations is large enough to result in fairly long run times for at least some numbers of threads. For example, if the single threaded run takes 1.3 seconds and the 16 threaded run takes 1.2, it is pretty difficult to draw any conclusions about scalability by comparing these two small runtimes. Instead, you want some runs that take a few minutes.

As you run experiments, make sure you are doing so in a way that doesn't interfere with others (see the Useful Unix Utilities section below for some tips about how to ensure this). Also, remember to run without any of the printing options enabled (and make sure you are not including usleep calls in timed runs (usleep should only be called from within printing enabled path in your code).

You should run multiple runs of each experiment (i.e. don't do only a single run of 16 threads for 512x512 and 10000 iterations). The purpose of multiple runs is to determine how consistent your runs are (look at the standard deviation across runs), and to also see if you have some odd outliers (which may indicate that something else running on the computer interfered with this run, and that you should discard this result).

You should be careful to control the experimental runs as much as possible. If other people using the machine you are running experiments on, their programs can interfere with your results. Also, run all experiments on the same machine. You can see what is running on a machine:

Machines for experiments

The Lab Machine specs page contains information about most of the CS lab machines, including the number of CPUs (click on the processors link). We have machines with 4, 6, 8, and 16 cpus. Pick one with 8 or 16 cores (x16) for your final experiments, but try out others during development.

The following machines are not in the main or overflow labs, and thus are more likely to be available for extended periods of time, so try one of these first:

mushroom  
avocado                                                                         
carrot                                                                          
savory                                                                          
parsley                                                                         
And, avoid using main and overflow lab machines during classes.

As much as possible, it is good to run all your experiments on the same machine, or at least on identical machines.

Finally, be aware of other groups wanting to run experiments on the 8 and 16 node machines. So, please don't run experiments on a machine for hours and hours or days and days. And, logout when you are not using a machine to run experiments. If you run who and someone is logged in, run top to see if they are actually running on the machine before deciding it is okay for you to use.

In the Hints section below there is information about utilities for running experiments.

Hints and Useful Utilities

For the pthread implementation:

For experiments and report:


Submit

Report: Due before noon, Wednesday Dec. 12

You must submit your report to me as a printout by the report due date. A printout of your report must be handed in to me in person before the due date. If I am not in my office when you stop by, you should slide it under my door. In addition to handing me a printout of your report, also please email me either your report .pdf and .tex files if you use latex, or your REPORT ascii file if you use vim. You can add these as attachments to your email.

To print a ascii file on our system:

lpr REPORT
To print a pdf file, do so from within xpdf (lpr will print out pages and pages of raw pdf encoding):
xpdf report.pdf    # then choose the print option under the File menu
If you accidentally print a .pdf usin lpr, you can kill the print job:
lprm
Make sure both you and your partner's names are on the report

Code: Due before 11:59pm, Tuesday Dec. 11

Submit all your source code and Makefile and your report in your labs/08/ subdirectory. If your report is complete at this time, you can include it in this directory (you still need to submit to me in person a printout of your report). If you use vim for your report, put it in a file named REPORT. If you use latex, include both the REPORT.tex file and any image files, and the REPORT.pdf file.

Once you are satisfied with your solution, hand it in by typing handin31 at the unix prompt.

Only one of you or your partner should run handin31 to submit your joint solutions If you accidentally both run it, send me email right away letting me know which of the two solutions I should keep and which I should discard (you don't want the grader to just guess which joint solution to grade).

You may run handin31 as many times as you like, and only the most recent submission will be recorded. This is useful if you realize, after handing in some programs, that you'd like to make a few more changes to them.