CS 31 Lab 7

1. Due Date

Complete lab: Due by 11:59 pm, Sunday, May 09, 2021

Checkpoint: Monday, May 03 push to git repo before 11:59pm.

By the checkpoint deadline, your program should implement functionality up to and including step 3.b. in Section 8.2.

Check the Lab 7 Partners link for your lab partner.

CS31 Partner Etiquette and Expectations.

2. Lab Goals

Parallelize a shared-memory application using the pthreads library.
Protect threaded code with synchronization primitives.
Revisit 2D arrays and file I/O in C.
Gain more expertise with C pointers and dynamic memory allocation.
Design input files to test the correctness of your solution,
Practice debugging threaded programs.

3. Lab Overview

This lab is designed to give you some practice writing and debugging multi-threaded pthreads programs, using synchronization primitives, and to give you some experience designing and running some parallel performance experiments.

You’ll implement a parallel version of Conway’s Game of Life using your Lab 5 sequential solution as a starting point for this lab. See the Lab 5 assignment to remind yourself about the sequential version of the program and the rules of the game.

Your Lab 7 solution will use the same method as Lab 5 for initializing a dynamically allocated board from an input file given as a command line argument. Your solution will support the same command line options as Lab 5, and add three additional command line options related to the parallelization. These command line options are outlined in detail below.

You will begin Lab 7 by first fixing any errors in your Lab 5 sequential solution.

Don’t worry about fixing the torus parts (the edges cells) if you didn’t have that working in Lab 5. As long as your Lab 5 gol implementation for all the middle (non-edge) cells works, that is enough to move on to starting to parallelize your gol implementation. And then changing it to get ready for parallelization.

In Wednesday lab we will also help you analyze your Lab 5 solution for any design choices that may make parallelization difficult, so that you can fix those before parallelizing.

After reading through the lab details and requirements, see Section 8 for tips on how to get started.

4. Lab Starting Point Code

4.1. Getting Your Lab Repo

Both you and your partner should clone your Lab repo into your cs31/labs subdirectory:

get your Lab ssh-URL from the CS31 git org. The repository to clone is named Lab7-userID1-userID2, where user1 and user2 are the user names of you and your Lab partner.
cd into your cs31/labs subdirectory:
```
$ cd ~/cs31/labs
$ pwd
```

clone your repo

$ git clone [your Lab7-userID1-userID2 url]
$ cd Lab7-userID1-userID2
$ ls

4.2. Starting Point Files

The files included in your repo are:

Makefile: builds gol executable file
README.md: some notes to you

*.txt: one or more example input files. You can copy more test files from your Lab 5 repo into this one, and you should create new input files too (particularly big ones to test the performance of your parallel solution).
design_worksheet: start your lab assignment by filling in this worksheet with the design of your parallelization of gol. You should bring this filled out worksheet with you to the ninja session, and to office hours and other times you are seeking help on your lab assignment. To print a copy of this worksheet to a CS lab printer, use the lpr command: lpr design_worksheet. See the "Getting Started" tips in Section 8.2 for information about how and when to use this worksheet.

One thing to note, is that there is not a gol.c file in the starting point code. You will add this in from your or your partner’s Lab 5 solution, that you will use as a starting point for this lab.

4.3. Add gol.c to your repo and modify it

The starting point code does not include a gol.c file. You and your partner should choose one of your Lab 5 solutions as a starting point for this lab, and copy over that gol.c file into your Lab7-userID1-userID2:

$ cd ~cs31/labs/Lab7-userID1-userID2
$ cp ~you_or_partner/cs31/labs/Lab05-user1-user2/gol.c .

4.3.1. modifications to sequential gol.c before parallelizing

The first thing to do is to make sure that your gol.c file includes these files.

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/time.h>
#include <time.h>
#include <string.h>
#include <pthread.h>

Next, you will need to change your play_gol function prototype to match that of the thread function argument passed to pthread_create:
```
void *play_gol(void *args)
```
This means you will have to modify it to return a value (returning NULL is fine), and you need to explicitly re-cast the void * parameter to a struct gol_data * inside this function in order to access field values of the passed argument. I recommend doing this by changing your function parameter name to a name not used inside your function (args for example) and then adding a local variable with your old parameter name (data for example) that you initialize to the re-cast void * parameter. This way your function code that refers to data will work un-changed. For example:
```
// change sequential play_gol function prototype from this:
void play_gol(struct *gol_data);
// to this:
void *play_gol(void *args);

// change name of parameter in function definition and local variable
// of old paramter name and types that is assigned re-cast void* param
void *play_gol(void *args) {
  struct gol_data *data;  // add local variable of the type we know is passed in
    ...
  data = (struct gol_data *)args;  // initialize it to the re-cast args value
  // function code can use data to reference fields in passed struct
    ...
  return NULL; // this function now needs to return some value
}
```

After making these changes, compile and run your program, and test that this change to your sequential version of gol still works as before.

5. Lab Details

5.1. Command line arguments

Your program will support the same command line args as the Lab 5 version, and will add three new command line arguments:

An integer to specify the number of threads (the degree of parallelism).
A 0/1 flag to specify how to parallelize the GOL program (0: row-wise grid cell allocation, 1: column-wise grid cell allocation). More details below.
A 0/1 flag to specify if the per-thread board allocation should be printed out or not at start up (0:don’t print configuration information, 1: print allocation information).

$ ./gol
usage: ./gol infile.txt output_mode[0,1] num_threads partition[0,1] print_partition[0,1]

Here are some example command lines:

# run with config values read from file1.txt, run in run mode OUTPUT_NONE (0)
# create 8 threads, use row-wise partitioning, print per-thread partitioning details:
./gol file1.txt  0  8  0  1

# run with config file file2.txt, run in run mode OUTPUT_ASCII, create
# 15 threads, use column-wise partitioning, don't print per-thread partitioning:
./gol file2.txt  1  15  1  0

The print per-thread board allocation command line option will help you debug the grid cell allocation scheme to ensure that you are correctly allocating rows or columns across the worker threads. I suggest running with print partitioning in in run mode 0 (OUTPUT_NONE), as OUTPUT_ASCII will clear the terminal too quickly to view the partitioning information. See more details below about partitioning and the print partitioning option.

Your program should handle badly formed command lines and check for bad values entered (e.g., a negative number of threads). It should print out an error message and exit for badly formed command lines and handle most bad input values similarly.

5.2. structure of parallel game of life

The main thread should begin by initializing the game board by reading the input file in the same way as the sequential version. The input files format has not changed. Your timing code should not include this step but should include all other parts of the execution.
After initializing the game board, the main thread spawns off worker threads that will simulate multiple rounds of the game, with each thread computing just its portion of cells of the new board each round (based on the partitioning information passed as a command line option). Grid cells are allocated to threads using either a row-wise or column-wise partitioning specified by a command line argument. If a command line is entered with more threads then the number of rows/columns to partition by, your program can print out an error message and exit In other words, you don’t need to handle a case in which there are threads that have no rows/columns assigned to them. The total number of live cells each round should be calculated in parallel with each thread contributing to its portion of live cells to the total value. If the print_partitioning option is specified, threads should print out their partition information before running rounds of gol.
If running in OUTPUT_ASCII mode, you should designate one thread to be the printing thread, and only that thread should call print_board to print the entire board after each round. Having threads print out just their part of the board will result in crazy interleaved output (a race condition!). Synchronizing parallel board printing so that the board is printed in order is beyond what you need to support (and would be less efficient than if just a single thread prints the entire board).

At the end of the simulation’s execution, the main thread should print out the elapsed time (if run in run mode OUTPUT_NONE and OUTPUT_ASCII). You will not be graded on absolute performance, but your code should run faster when run in parallel. As a sanity check, try running a large board instance (e.g., 1500x1500) for many rounds (e.g., 100+). You should see a substantial improvement between running with one thread and a handful of threads (2-8). It’s that sort of relative improvement that we’ll be looking for. Note: for comparison timing, run in mode OUTPUT_NONE, and make sure it does not include any extra synchronization or calls to usleep that may be part of the other run modes.

5.3. Synchronization

You will need to think about where and how to synchronize thread actions to remove race conditions. For example, you need to ensure that no thread starts the next round of play until all threads are done with the current round, otherwise threads will not read consistent values for the given round. If printing is enabled, you need to ensure that the printing thread prints a consistent version of the game board. If run in OUTPUT_ASCII mode, you need to ensure that the printing thread prints a consistent version of the game board. You will also need to ensure that threads do not race to update the global live cell counter. Any added synchronization that is only necessary for a specific command line option should not be included in runs that do not specify that option.

5.4. Partitioning

You will implement two different ways of parallelizing the computation: one partitions the board’s rows across threads, the other partitions the board’s columns across threads. As an example, suppose that the board is 8x8, and that there are 4 threads. Here is how the threads (with logical tids 0-3) would be assigned to the grid using row and column partitioning, respectively:

row partitioning                        column partitioning
----------------                        ----------------
0 0 0 0 0 0 0 0                         0 0 1 1 2 2 3 3
0 0 0 0 0 0 0 0                         0 0 1 1 2 2 3 3
1 1 1 1 1 1 1 1                         0 0 1 1 2 2 3 3
1 1 1 1 1 1 1 1                         0 0 1 1 2 2 3 3
2 2 2 2 2 2 2 2                         0 0 1 1 2 2 3 3
2 2 2 2 2 2 2 2                         0 0 1 1 2 2 3 3
3 3 3 3 3 3 3 3                         0 0 1 1 2 2 3 3
3 3 3 3 3 3 3 3                         0 0 1 1 2 2 3 3

When the number of threads does not evenly divide the dimension by which you are partitioning, divide the rows (or columns) up so that there is at most a difference of 1 assigned row (or column) between any two threads. For example, for an 8x7 board and 3 threads, you would partition rows and columns like this among the 3 threads (with tids 0-2):

row partitioning                        column partitioning
----------------                        ----------------
0 0 0 0 0 0 0                           0 0 0 1 1 2 2
0 0 0 0 0 0 0                           0 0 0 1 1 2 2
0 0 0 0 0 0 0                           0 0 0 1 1 2 2
1 1 1 1 1 1 1                           0 0 0 1 1 2 2
1 1 1 1 1 1 1                           0 0 0 1 1 2 2
1 1 1 1 1 1 1                           0 0 0 1 1 2 2
2 2 2 2 2 2 2                           0 0 0 1 1 2 2
2 2 2 2 2 2 2                           0 0 0 1 1 2 2

In this partitioning scheme, a single row (or column) is assigned to exactly one thread; a single row (or column) is never split between two or more threads. Thus, in the above example threads 0 and 1 have one more row each of work to do than thread 2 in the row-wise partitioning, and thread 0 has one more column of work to do than threads 1 and 2 in the column-wise partitioning. NOTE how the imbalance of rows/columns is assigned to threads in this example. Your solution should assign them in the same way. Every thread should either be assigned the same number of rows or columns or differ by at most one.

5.5. Printing Partitioning Information

When the 5th command line option is non-zero, your program should print thread partitioning information. Each thread should print:

Its thread id (the logical id number that you assign, not the pthread_self value).
Its start and end row index values.
The total number of rows it is allocated.
Its start and end column index values.
The total number of columns it is allocated.

The threads will run in different orders, so you may see each thread’s output in any order, which is fine. If you’d like, you can try to avoid some interleaving of individual threads output by calling fflush after printf:

printf("tid %d my values ...\n", mytid, ...);
fflush(stdout);   // force the printf output to be printed to the terminal

5.5.1. Example Partitioning Output

Here are some example runs with different numbers of threads partitioning a 100x100 grid (I’m using a config file I create named b100.txt that defines a 100x100 grid). The total number of rows and columns per thread are shown in parentheses (your output does not need to be identical to this, but it must include all the required parts):

# 9 threads, column-wise partitioning
./gol b100.txt 0 9 1 1
tid  0: rows:  0:99 (100) cols:  0:11 (12)
tid  1: rows:  0:99 (100) cols: 12:22 (11)
tid  2: rows:  0:99 (100) cols: 23:33 (11)
tid  4: rows:  0:99 (100) cols: 45:55 (11)
tid  3: rows:  0:99 (100) cols: 34:44 (11)
tid  5: rows:  0:99 (100) cols: 56:66 (11)
tid  6: rows:  0:99 (100) cols: 67:77 (11)
tid  7: rows:  0:99 (100) cols: 78:88 (11)
tid  8: rows:  0:99 (100) cols: 89:99 (11)

# 6 threads, row-wise partitioning
./gol b100.txt 0 6 0 1
tid  0: rows:  0:16 (17) cols:  0:99 (100)
tid  1: rows: 17:33 (17) cols:  0:99 (100)
tid  3: rows: 51:67 (17) cols:  0:99 (100)
tid  2: rows: 34:50 (17) cols:  0:99 (100)
tid  4: rows: 68:83 (16) cols:  0:99 (100)
tid  5: rows: 84:99 (16) cols:  0:99 (100)

# 17 threads, row-wise partitioning
./gol b100.txt 0 17 0 1
tid  0: rows:  0: 5 (6) cols:  0:99 (100)
tid  1: rows:  6:11 (6) cols:  0:99 (100)
tid  3: rows: 18:23 (6) cols:  0:99 (100)
tid  2: rows: 12:17 (6) cols:  0:99 (100)
tid  4: rows: 24:29 (6) cols:  0:99 (100)
tid  5: rows: 30:35 (6) cols:  0:99 (100)
tid  6: rows: 36:41 (6) cols:  0:99 (100)
tid  7: rows: 42:47 (6) cols:  0:99 (100)
tid 10: rows: 60:65 (6) cols:  0:99 (100)
tid 11: rows: 66:71 (6) cols:  0:99 (100)
tid 12: rows: 72:77 (6) cols:  0:99 (100)
tid 13: rows: 78:83 (6) cols:  0:99 (100)
tid  8: rows: 48:53 (6) cols:  0:99 (100)
tid 15: rows: 90:94 (5) cols:  0:99 (100)
tid 14: rows: 84:89 (6) cols:  0:99 (100)
tid  9: rows: 54:59 (6) cols:  0:99 (100)
tid 16: rows: 95:99 (5) cols:  0:99 (100)

Note that for the run with 17 threads, there are enough threads to see "out of order" execution due to the exact scheduling of threads on the CPUs. This is fine and expected behavior (the exact ordering of thread printing depends on when threads are scheduled to run on cores by the OS CPU scheduler).

5.6. Correctness

In addition to printing out per-thread board partitioning information to verify correct portioning, you should also ensure that your parallel version solves GOL correctly. To do this run both your parallel version and your sequential version on some of the same input files in OUTPUT_ASCII run mode (run mode 1). The results should be identical.

If you see incorrect and/or odd output, either your partitioning of the GOL game board is incorrect, or you are missing synchronization, or both. You can remove the call to system("clear") to see the history of each round to help you debug incorrect GOL computation (of course run for a small number of iterations).

6. Sample Output

Here is output from a run of my gol program showing:

Example column-wise and row-wise partitioning output for different numbers of threads is shown in Section 5.5.1. Please follow our output format for your printing partitioning functionality.

Other program ouput should be identical to that from Lab 5.

Example runs showing the time improvement from the parallelization is shown below. This output is from runs on basil on a 4000x4000 board for 500 iterations for different numbers of threads (1, 4, 8, 12, and 20). Notice the time improvement with increased parallelization up to the number of cores (from 298 seconds for 1 thread to 50 seconds for 12 threads). Beyond the total number of cores (12 in the case of basil), adding more threads does not help to improve total runtime.

./gol big.txt 0 1 0 0
Total time: 297.943 seconds
After 500 rounds on 4000x4000, number of live cells is 17

./gol big.txt 0 4 0 0
Total time: 78.617 seconds
After 500 rounds on 4000x4000, number of live cells is 17

./gol big.txt 0 8 0 0
Total time: 71.223 seconds
After 500 rounds on 4000x4000, number of live cells is 17

./gol big.txt 0 12 0 0
Total time: 50.128 seconds
After 500 rounds on 4000x4000, number of live cells is 17

./gol big.txt 0 20 0 0
Total time: 53.965 seconds
After 500 rounds on 4000x4000, number of live cells is 17

These are runs with just the normal Makefile gcc command (there is no optimiztion compiler flag used to build this code). With compiler optimization each run would be much faster, but the trend of improvement from the multithreaded would remain.

7. Lab Requirements

Your program must satisfy all the requirements of the sequential gol lab, except that if you did not get the torus to work correctly, you do not need to fix that part (i.e. if your edge cells are not quite in your sequential gol, that is okay for your parallel solution).
Your program should handle the new command line arguments in addition to the previous ones.
The total_live global variable that should be updated each round to the number of live cells. Its value should be computed in parallel each round with each thread updating the portion of total_live value for a round from its portion of the grid.
The only other global variables you may add are any needed for pthread synchronization. Any values that threads need should be passed to them via the (void *arg) parameter. Look at the example pthread programs from Week 12, from in class, and in the textbook for some examples of how to do this. You will likely need to add fields to your struct gol_data type that are needed for threads to perform gol on their part of the grid.
The control flow in your sequential main function doesn’t need much change. You need to parse the additional command line arguments and make calls to pthread_create. If you find you are adding a lot of code to main, stop and function-ize that functionality and call it from main: main should not get a lot longer, and its main control flow really should not change.
For full credit, your solution should be correct, robust, and free of valgrind errors (when run in modes OUTPUT_ASCII or OUTPUT_NONE). Still reachable is okay.

All TODO comments in the starting point code should be removed in your final submission. These are notes to you for what you need to add into the starting point code and where you need to add it.
Use good modular design, and comment your code well.

8. Getting Started

8.1. Get sequential gol.c ready for parallelizing

If you are working with a partner that had a different partner for Lab 5, start by deciding whose gol.c solution you will use for a starting point for this lab.
After copying your Lab 5 gol.c into your repo, make the modifications described in Section 4.3.1. Compile and run to make sure it works.
Then fix any any parts of your Lab 5 gol.c that will make parallelizing it difficult, and that are memory inefficient. We will look at your solution in lab on Wednesday and help you figure out what you need to do for this part.

8.2. Get Started on Parallelization

Start by reviewing Section of 14.2 and 14.3 of the textbook. and the weekly lab examples.
Next, print out the design_worksheet from your repo and follow the directions for outlining the design of your solution (see Section 4.2 for printing and directions for how to this).
After filling out the worksheet with your design. I recommend implementing and testing functionality in this order:
1. Without spawning any threads, write the partitioning function, i.e. figure out which section of the board would be assigned to a thread. Remember each thread will call this function to compute its partitioning info. You can, however, test the function’s implementation first without spawning threads by adding some calls in main to test different sizes and partitioning schemes (and remove these calls before moving on to the next step).
2. Implement the main thread create-join control flow, and test thread’s row-column partitioning. Spawn the number of threads specified at the command line and have each thread to just print its partitioning information (including its tid value) in play_gol and just return (and remove this return before moving on to the next step).
3. Add fields to gol_data struct needed for parallelization and add code to initialize these fields (some may be init’ed by the main thread before spawning and other by the thread themselves before they start playing rounds of gol). You can add some debug output to test this, and add a call to return to avoid execution the rest of the code in play_gol (remove both before moving on to the next step).
4. Add code so that each thread prints out its own partitioning information only when the final command line argument is nonzero.
5. Write and test the code to compute one round of the GOL simulation, with the work divided among the threads. This will include support for all run modes. Start with testing OUTPUT_ASCII on a small file with just one round.
6. Compute all rounds of the GOL simulation, avoiding race conditions.
7. Count the number of live cells in a multi-threaded way, avoiding race conditions.
8. Check timing for multi-threaded in OUTPUT_NONE mode. You should see performance improvements with multi-threaded vs. sequential for large sized boards.
9. Make sure all other requirements have been met.

9. Tips

Look at the Lab 5 page for lots of tips related to sequential that may be useful here too
Implement and test incrementally!
Most of your changes will be to the play_gol function. You should also add helper functions to implement any new functionality. main should not get much longer, so if you need to add alot of code to main, or you are adding the same code to multiple parts, functionize it and call it from main. Plus you can independently test this code by adding debugging calls to this function from main!
You will need to add fields to your struct gol_data type that are needed by threads to perform gol on their part of the grid.
Use gdb as you incrementally implement and test to find and fix bugs as you go. Use valgrind as you go to catch memory access errors as you make them.
As you test and debug, you may find it useful to use config files with small numbers of iterations and to comment out the call to system("clear") so you can examine the results of every intermediate iteration.
Running in OUTPUT_ASCII is helpful for finding and debugging partitioning errors and other correctness errors.

The man pages for the various pthread functions are very helpful.
You can use the perror() function to print out error messages from failed system and pthread library calls. perror prints out detailed, human-readable information about the error that occurred. You call it passing in a string with program-specific error message. For example:
```
big_buff = malloc(...);
if (big_buff == NULL) {
  perror("mallocing big buffer failed");
}
```
See the man page for perror for more information and for the include files you need to add to your program to use it.
You will need to use some synchronization in your solution. You may use any of the pthread synchronization primitives: mutexes, condition variables, and/or barriers. Remember that before using any of these, you need to initialize them and that all threads need to have access to the shared synchronization variables (either pass a reference to them at thread creation or declare them as static global variables). Here are some code snippets for using the barrier synchronization primitive:
```
// declare a global pthread_barrier_t
static pthread_barrier_t barrier;

// initialize it in a function (e.g., main) before using it:
if(pthread_barrier_init(&barrier, NULL, num_threads)){
   perror("pthread_barrier_init");
   exit(1);
}

// threads than can call pthread_barrier_wait to synchronize:
ret = pthread_barrier_wait(&barrier);
if(ret != 0 && ret != PTHREAD_BARRIER_SERIAL_THREAD) {
  perror("pthread_barrier_wait");
  exit(1);
}
```

10. Submitting

Here are the commands to submit your solution (from one of you or your partner’s cs31/labs/Lab7-userID1-userID2 directory:

$ git add gol.c *.txt
$ git commit -m "Lab 7 completed"
$ git push

And of course, if you want to submit any nifty starting point world config files you come up with, please do:

$ git add coolworld.txt
$ git commit -m "cool world"
$ git push

If you have difficulty pushing your changes, see the "Troubleshooting" section and "can’t push" sections at the end of the Using Git for CS31 Labs page. And for more information and help with using git, see the git help page.

At this point, you should submit the required Lab 7 Questionnaire (each lab partner must do this)

11. Handy Resources

General Lab Resources

Class piazza page for questions and answers about lab assignment
CS31 github org
Git help and Git for CS31 labs
Some Useful Unix Commands
vi (and vim) quick reference
man and Manual pages documentation for libraries and commands

CS31 Lab 7: Pthreads Game of Life!