CS31: Lab 8

This lab should be done with your lab partner.

Lab 8 Goals:

shared memory parallel programming using pthreads
synchronization
2D arrays and file I/O in C
pointers, dynamic memory allocation, and pass by reference
debugging threaded programs
experimental evaluation of program scalability

Starting Point Code

First, both you and your partner should run update31 to create the handin directory structure for lab 8, then you should copy your lab 5 solution into your lab8 subdirectory:

  $ update31
  $ cd cs31/labs/08
  $ pwd
  /home/your_user_name/cs31/labs/08
  $ cp ../05/gol.c .
  $ cp ../05/Makefile .
  $ cp ../05/*.txt .
  $ ls
  Makefile  gol.c  oscillator.txt

You will start with your lab 5 solution to the Game of Life program, modifying it to implement a parallel version of GOL using pthreads. To link in the pthread library, to your Makefile, add the following to the end of the command for building the gol executable:

-pthread

# for example, you may have something that looks like this:
gol: gol.c
        gcc -g -Wall -o gol gol.c -pthread

Project Details and Requirements

This lab is designed to give you some practice writing and debugging multithreaded Pthreads programs, using synchronization primitives, and experience designing and running scalability experiments.

You will implement a parallel version of Conway's Game of Life using your sequential solution as a starting point for this lab. See the lab 5 assignment to remind yourself about the sequential version of the program, and about the rules of the game. You will evaluate the scalability of your implementation as you increase the problem size and the number of threads.

Your lab 8 solution will use the same method as lab 5 for initializing a dynamically allocated board from an input file given as a command line argument, and will have the same print or not-print option after every iteration.

You should begin lab 8 by first fixing any errors in your lab 5 sequential solution that you copied over into your labs/08 subdirectory.

Parallel Version

The parallel version should be structured as follows:

The main thread should begin by initializing the game board in the same way as the sequential version. The gettimeofday timers in your code should not include this step but should include all other parts of the execution.
After initializing the game board, the main thread spawns off worker threads that will play multiple rounds of game of life, each thread computing just its portion of cells of the new board each round. Grid cells are allocated to threads using either a row-wise or column-wise partitioning specified by a command line argument.
If printing after each round is enabled, you should designate one thread to be the printing thread, and only that thread should do the printing after each round (otherwise the output can be interleaved in crazy ways).
The main thread should print out the time of the gol computation part of the program before exiting (this time should NOT include the time to initialize the board from the input file, but should measure the rest of the computation).

You will need to think about where and how to synchronize thread actions to remove race-conditions. For example, you need to ensure that no thread starts the next round of play until all threads are done with the current round, otherwise threads will not read consistent values for the given round. If printing is enabled, you need to ensure that the printing thread prints a consistent version of the game board.

A run with no printing enabled should only printout the final timing result of the gettimeofday timer and should not include any calls to usleep in the run.

Command line arguments

Your program will use the two command line arguments from lab 5 and additionally add three new command line arguments to:

specify the number of threads (the degree of parallelism)
specify how to parallelize the GOL program (0: row-wise grid cell allocation, 1: column-wise grid cell allocation).
specify if the per-thread board allocation should be printed out or not (0:don't print configuration information, 1: print allocation information)

$ ./gol
usage: ./gol infile.txt print[0:1] ntids partition[0:1] print_alloc[0:1]

Here are some example command lines:

# run with config values read from file1.txt, do not print the board after each round
# create 8 threads, use row-wise partitioning, print per-thread partitioning details:
./gol file1.txt  0  8  0  1     

# run with config file file2.txt, print the board after each round, create
# 15 threads, use column-wise partitioning, don't print per-thread partitioning:
./gol file2.txt  1  15  1  0

The print per-thread board allocation command line option will help you debug the grid cell allocation scheme to ensure that you are correctly allocating rows or columns across the worker threads.

Your program should handle badly formed command lines and check for bad values entered (like a negative number of threads). It should print out an error message and exit for badly formed command lines and handle most bad input values similarly.

Partitioning

You will implement two different ways of parallelizing the computation: one partitions the board's rows across threads, the other partitions the board's columns across threads (i.e each thread computes the result for some number of rows or columns). As an example, suppose that the board is 8x8, and that there are 4 threads. Here is how the threads (with logical tids 0-3) would be assigned to computing elements of C using row and column partitioning:

row partitioning                        column partitioning
----------------                        ----------------
0 0 0 0 0 0 0 0                         0 0 1 1 2 2 3 3  
0 0 0 0 0 0 0 0                         0 0 1 1 2 2 3 3
1 1 1 1 1 1 1 1                         0 0 1 1 2 2 3 3
1 1 1 1 1 1 1 1                         0 0 1 1 2 2 3 3
2 2 2 2 2 2 2 2                         0 0 1 1 2 2 3 3
2 2 2 2 2 2 2 2                         0 0 1 1 2 2 3 3
3 3 3 3 3 3 3 3                         0 0 1 1 2 2 3 3
3 3 3 3 3 3 3 3                         0 0 1 1 2 2 3 3

When the number of threads does not evenly divide the dimension by which you are partitioning, divide the rows (or columns) up so that there is at most a difference of 1 assigned row (or column) between any two threads. For example, for an 8x7 board and 3 threads, you would partition rows and columns like this among the 3 threads (with tids 0-2):

row partitioning                        column partitioning
----------------                        ----------------
0 0 0 0 0 0 0                           0 0 0 1 1 2 2   
0 0 0 0 0 0 0                           0 0 0 1 1 2 2
0 0 0 0 0 0 0                           0 0 0 1 1 2 2 
1 1 1 1 1 1 1                           0 0 0 1 1 2 2 
1 1 1 1 1 1 1                           0 0 0 1 1 2 2 
1 1 1 1 1 1 1                           0 0 0 1 1 2 2 
2 2 2 2 2 2 2                           0 0 0 1 1 2 2 
2 2 2 2 2 2 2                           0 0 0 1 1 2 2

In this partitioning scheme, a single row (or column) is assigned to exactly one thread; a single row (or column) is never split between two or more threads. Thus, in the above example threads 0 and 1 have one more row each of work to do than thread 2 in the row-wise partitioning, and thread 0 has one more column of work to do than threads 1 and 2 in the column-wise partitioning.

Printing Partitioning Information

When the 5th command line option is non-zero, your program should print thread partitioning information. Each thread should print:

Its thread id (its logical one not its pthread_self value)
Its start and end row index values
The total number of rows it is allocated
Its start and end column index values
The total number of columns it is allocated

The threads will run in different orders, so you may see each thread's output in any order, which is fine. You can avoid some interleaving of individual threads output by calling fflush after printf:

printf("tid %d my values ...\n", mytid, ...);
fflush(stdout);   // force the printf output to be printed to the terminal

Here are some example runs of my program with different numbers of threads partitioning a 100x100 grid. The total number of rows and columns per thread are shown in parentheses (your output does not need to be identical to mine but must include all required parts):

# 9 threads, column-wise partitioning
% gol big 0 9 1 1
tid  0: rows:  0:99 (100) cols:  0:11 (12)
tid  1: rows:  0:99 (100) cols: 12:22 (11)
tid  2: rows:  0:99 (100) cols: 23:33 (11)
tid  4: rows:  0:99 (100) cols: 45:55 (11)
tid  3: rows:  0:99 (100) cols: 34:44 (11)
tid  5: rows:  0:99 (100) cols: 56:66 (11)
tid  6: rows:  0:99 (100) cols: 67:77 (11)
tid  7: rows:  0:99 (100) cols: 78:88 (11)
tid  8: rows:  0:99 (100) cols: 89:99 (11)

# 6 threads, row-wise partitioning
% gol big 0 6 0 1
tid  0: rows:  0:16 (17) cols:  0:99 (100)
tid  1: rows: 17:33 (17) cols:  0:99 (100)
tid  3: rows: 51:67 (17) cols:  0:99 (100)
tid  2: rows: 34:50 (17) cols:  0:99 (100)
tid  4: rows: 68:83 (16) cols:  0:99 (100)
tid  5: rows: 84:99 (16) cols:  0:99 (100)

# 17 threads, row-wise partitioning
% gol big 0 17 0 1
tid  0: rows:  0: 5 ( 6) cols:  0:99 (100)
tid  1: rows:  6:11 ( 6) cols:  0:99 (100)
tid  3: rows: 18:23 ( 6) cols:  0:99 (100)
tid  2: rows: 12:17 ( 6) cols:  0:99 (100)
tid  4: rows: 24:29 ( 6) cols:  0:99 (100)
tid  5: rows: 30:35 ( 6) cols:  0:99 (100)
tid  6: rows: 36:41 ( 6) cols:  0:99 (100)
tid  7: rows: 42:47 ( 6) cols:  0:99 (100)
tid 10: rows: 60:65 ( 6) cols:  0:99 (100)
tid 11: rows: 66:71 ( 6) cols:  0:99 (100)
tid 12: rows: 72:77 ( 6) cols:  0:99 (100)
tid 13: rows: 78:83 ( 6) cols:  0:99 (100)
tid  8: rows: 48:53 ( 6) cols:  0:99 (100)
tid 15: rows: 90:94 ( 5) cols:  0:99 (100)
tid 14: rows: 84:89 ( 6) cols:  0:99 (100)
tid  9: rows: 54:59 ( 6) cols:  0:99 (100)
tid 16: rows: 95:99 ( 5) cols:  0:99 (100)

Note that for the 17 thread run there are enough threads to see "out of order" execution due to the exact scheduling of threads on the cpus.

Correctness

In addition to printing out per-thread board partitioning information to verify correct allocation, you should also ensure that your parallel version solves GOL correctly. To do this run your parallel version on some of the same input files as your sequential one with printing enabled. The results should be identical. For example, starting with the oscillator input file the start and end board should look like this (you would of course call system("clear") between each iteration):

$ ./gol oscillator.txt 1 4 0 0

start board:

- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - @ @ @ - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 

end board:

- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - @ - - - - - 
- - - - - @ - - - - - 
- - - - - @ - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 
- - - - - - - - - - - 

total time for 21 iterations of 11x11 is 5.493663 secs

If you see odd output, either your parallelization of GOL is incorrect, or you are missing synchronization, or both. You can remove the call to system("clear") to see the history of each round to help you debug incorrect GOL computation (of course run for a small number of iterations).

Additional Requirements

In addition to the parallelization and correctness requirements described above, you solution must:

use a makefile
be free of valgrind errors
be well designed and well commented
include a report of your scalability study (details below).

Scalability Experiments and Report

After implementing and debugging your parallel GOL program, you will run experiments of your parallel solution testing its scalability in different ways, and write a report presenting your experimental results and describing what they show. You should answer the following questions with your experiments:

How well does the problem scale with the number of threads? Is there an optimal thread size for a given sized problem?
How well does the 16 thread version scale for different sized boards?
Is one of the row-wise or column-wise allocations more efficient?

For each of these questions, you should:

design some experiments to answer the question
run the experiments and see if they do answer your question (and if not redesign the experiments and try again)
write up the results of each of your experiments. For each question:
1. Describe what your experiment(s) was (were), and describe why/how the experiment answers the question you are trying to answer with it.
2. Present your experimental results in a easy to read summary format. You can list results in a table or in a graph, for example. You should present averages of several runs of each test, and indicate this in your presentation (e.g. "the values shown are average of 10 runs each, run on a 8 core system").
3. Describe what your experimental results show: what do the data you present say about the question they are answering and why? What is your interpretation of the results? What do they say about the scalability of your solution? Did the results match your expectations? If not, do you have an idea of why not?
You can write up your report as an ascii file using vim (don't wrap lines), or you can try using latex (details below). Your report should be no more than 3 pages long, so think about quality over quantity in presenting results, and be concise yet complete in your explanations. One page in an ascii file is 59 lines (CNTRL-L is a page break if you want to explicitly add one in). If you use ascii, then present your results in tables something similar to this:
```
Total Execution Time for different number of threads and board sizes 

num tids  |  500x500 |  1000X1000  | 2000x2000 | ... 
-----------------------------------------------------------
   1      |  x secs  |   x secs    |  x secs   | 
   2      |  x secs  |   x secs    |  x secs   | 
   4      |  x secs  |   x secs    |  x secs   | 
...
```

As you design experiments, you will want to vary the grid size and the number of threads to obtain different data points:

For the grid sizes. Try doubling the sizes at each increase: (500, 1000, 2000, ...). You do not need to have a large number of different sizes but you should have a few intermediate sizes between your smallest-sized and largest-sized grids. Run some initial experiments to help you pick good sizes. You also want to ensure you are running for enough iterations to get runs that are long enough to be able to make a comparison between the runtimes of different sized boards.
The number of threads. Increase by powers of two (1, 2, 4, 8, 16, 32, ...). And, include some thread numbers that are beyond the number of CPUs.

When running scalability studies you need to make sure that you have problem sizes that are large enough and the number of iterations is large enough to result in fairly long run times for at least some numbers of threads. For example, if the single threaded run takes 1.3 seconds and the 16 threaded run takes 1.2, it is pretty difficult to draw any conclusions about scalability by comparing these two small runtimes. Instead, you want some runs that take a few minutes.

As you run experiments, make sure you are doing so in a way that doesn't interfere with others (see the Useful Unix Utilities section below for some tips about how to ensure this). Also, remember to run without any of the printing options enabled (and make sure you are not including usleep calls in timed runs (usleep should only be called from within printing enabled path in your code).

You should run multiple runs of each experiment (i.e. don't do only a single run of 16 threads for 512x512 and 10000 iterations). The purpose of multiple runs is to determine how consistent your runs are (look at the standard deviation across runs), and to also see if you have some odd outliers (which may indicate that something else running on the computer interfered with this run, and that you should discard this result).

You should be careful to control the experimental runs as much as possible. If other people using the machine you are running experiments on, their programs can interfere with your results. Also, run all experiments on the same machine. You can see what is running on a machine:

who to see who else is on a machine you are running on, and try another machine if you are not alone.
top to see the system load on the machine.
top -H to see individual threads running on the machine (you can see your 4, 8, 16, etc running).

Machines for experiments

The Lab Machine specs page contains information about most of the CS lab machines, including the number of CPUs (click on the processors link). We have machines with 4, 6, 8, and 16 cpus. Pick one with 8 or 16 cores (x16) for your final experiments, but try out others during development.

The following machines are not in the main or overflow labs, and thus are more likely to be available for extended periods of time, so try one of these first:

And, avoid using main and overflow lab machines during classes.

As much as possible, it is good to run all your experiments on the same machine, or at least on identical machines.

Finally, be aware of other groups wanting to run experiments on the 8 and 16 node machines. So, please don't run experiments on a machine for hours and hours or days and days. And, logout when you are not using a machine to run experiments. If you run who and someone is logged in, run top to see if they are actually running on the machine before deciding it is okay for you to use.

In the Hints section below there is information about utilities for running experiments.

Hints and Useful Utilities

For the pthread implementation:

The man pages for pthread functions are very helpful. In addition I have links to pthreads documentations and tutorials
use perror to print out error messages from failed system and pthread library calls. perror prints out detailed information about the error that occurred. You call it passing in a string with program-specific error message. For example:
```
big_buff = malloc( ...);
if(!big_buff) {
  perror("mallocing big buffer failed\n");
}
```
See the man page for perror for more information and for the include files you need to add to your program.
You will need to use some synchronization in your solution. You may use any of the pthread synchronization primitives: semaphores, condition variables, and/or barriers. Remember that before using any of these, you need to initialize them, and that all threads need to have access to the shared synchronization variables (either pass a reference to them to the main thread function or declare them as static global variables). Here are some code snippets for using the barrier synchronization primitive:
```
// declare a global pthread_barrier_t var
static pthread_barrier_t barrier;

// initialize it somewhere before using it:
if(pthread_barrier_init(&barrier, NULL, num_threads)){
   perror("Error: with pthread barrier init\n");
   exit(1); 
}

// threads than can call pthread_barrier_wait to synchronize: 
ret = pthread_barrier_wait(&barrier);
if(ret != 0 && ret != PTHREAD_BARRIER_SERIAL_THREAD) {
  perror("Error: can't wait on pthread barrier\n");
  exit(1);
}
```
A guide for using gdb to debug pthread programs

For experiments and report:

See my Tools for examining system and runtime state for some tools you can use to determine the
- top, who, xload to get system usage info
  top -H: a dynamic display of information about system resource usage of the current threads that are the biggest consumers of CPUs. You should see your matrix multiply threads rise to the top. You can also use this to see if someone else is using a machine for experiments (in which case you should pick another one).
- /proc/ more useful system info. You can cat out files in here to find system-wide or per-process information: meminfo prints information about system memory use; cpuinfo prints information about the cpus, including L1 cache and cache line sizes.
```
$ cat /proc/cpuinfo
$ cat /proc/meminfo
```
time ./a.out: show the total runtime, and user and system times associated with executing a.out.
You can write and use shell scripts for running a set of experiments. To run a shell script, first make sure that the shell script file has permissions set to executable (e.g. chmod 777 run.sh), and then just run it from the Unix prompt ( ./run.sh). Here is a example script for running a set of row-wise and column-size experiments on a several input files which specify the grid size and number of iterations (I've given my files names which include the size and the number of iterations so that the script output helps me easily identify the different results). This is just an example, and you will likely want to change thread ranges and problem sizes for your experiments:
```
#!/bin/bash

for f in infile_500_500.txt infile_1000_500.txt infile_2000_500.txt
do
  for((p=0; p <= 1; p++))
  do
    for ((t=1; t <= 64; t*=2))
    do
      echo ""
      echo "$f  $t  $p  0"

      for((i=0; i < 5; i+=1)) 
      do
        time ./gol $f 0  $t  $p  0
      done

    done
  done
done
```
See bash shell programing links for more information.
I'd run this bash script inside a script session to capture all its output to a file, and if it is long running, I may also run within a screen session (then I can start it late at night, wake up the next morning and see what happened).
See my Tools for running experiments and collecting performance measurements. screen, script, and bash scripts may be particularly useful. I suggest running experiments in script, which will save all terminal output in a file (the default name is typescript). Just run script outfile, start your experiments on the command line, and then run exit to exit out of script. All the terminal output will be saved in a file named outfile. You can clean up the output file a bit by running dos2unix on it:
```
$ script results
Script started, file is results
% ./run
  ...
% exit
Script done, file is results
$ dos2unix -f results
$ dos2unix -f results
```
latex links. latex is Unix's document writing software. You do not have to use latex for writing your report, but it has very good support for some things that are common in scientific writing such as representing mathematical expressions, so you may want to give it a try. I have some example latex documents that you can grab to use as a starting point in ~newhall/public/latex_examples/.

Submit

Report: Due before noon, Wednesday Dec. 12

You must submit your report to me as a printout by the report due date. A printout of your report must be handed in to me in person before the due date. If I am not in my office when you stop by, you should slide it under my door. In addition to handing me a printout of your report, also please email me either your report .pdf and .tex files if you use latex, or your REPORT ascii file if you use vim. You can add these as attachments to your email.

To print a ascii file on our system:

lpr REPORT

To print a pdf file, do so from within xpdf (lpr will print out pages and pages of raw pdf encoding):

xpdf report.pdf    # then choose the print option under the File menu

If you accidentally print a .pdf usin lpr, you can kill the print job:

lprm

Make sure both you and your partner's names are on the report

Code: Due before 11:59pm, Tuesday Dec. 11

Submit all your source code and Makefile and your report in your labs/08/ subdirectory. If your report is complete at this time, you can include it in this directory (you still need to submit to me in person a printout of your report). If you use vim for your report, put it in a file named REPORT. If you use latex, include both the REPORT.tex file and any image files, and the REPORT.pdf file.

Once you are satisfied with your solution, hand it in by typing handin31 at the unix prompt.

Only one of you or your partner should run handin31 to submit your joint solutions If you accidentally both run it, send me email right away letting me know which of the two solutions I should keep and which I should discard (you don't want the grader to just guess which joint solution to grade).

You may run handin31 as many times as you like, and only the most recent submission will be recorded. This is useful if you realize, after handing in some programs, that you'd like to make a few more changes to them.

CS31 Lab 8: Parallel Game of Life