CS 31 Lab 9

1. Due Date

Complete lab: Due by 11:59 pm, Friday, Dec 10, 2021

Checkpoint: Tuesday, Nov 30 push to git repo before 11:59pm.

By the checkpoint deadline, your program should implement functionality up to and including step 3(b) in Section 8.2.

Check the Lab 9 Partners link for your lab partner. (This should be the same as for the original Game of Life lab.)

CS31 Partner Etiquette and Expectations.

2. Lab Goals

Parallelize a shared-memory application using the pthreads library.
Protect threaded code with synchronization primitives.
Revisit 2D arrays and file I/O in C.
Gain more expertise with C pointers and dynamic memory allocation.
Design input files to test the correctness of your solution,
Practice debugging threaded programs.

3. Lab Overview

This lab is designed to give you some practice writing and debugging multi-threaded pthreads programs, using synchronization primitives, and to give you some experience designing and running some parallel performance experiments.

You’ll implement a parallel version of Conway’s Game of Life using your Lab 6 sequential solution as a starting point for this lab. See the Lab 6 assignment to remind yourself about the sequential version of the program and the rules of the game.

Your Lab 9 solution will use the same method as Lab 6 for initializing a dynamically allocated board from an input file given as a command line argument. Your solution will support the same command line options as Lab 6, and add three additional command line options related to the parallelization. These command line options are outlined in detail below.

You will begin Lab 9 by first fixing any errors in your Lab 6 sequential solution.

Don’t worry about fixing the torus parts (the edges cells) if you didn’t have that working in Lab 6. As long as your Lab 6 gol implementation for all the middle (non-edge) cells works, that is enough to move on to starting to parallelize your gol implementation. And then changing it to get ready for parallelization.

In Wednesday lab we will also help you analyze your Lab 6 solution for any design choices that may make parallelization difficult, so that you can fix those before parallelizing.

After reading through the lab details and requirements, see Section 8 for tips on how to get started.

4. Lab Starting Point Code

4.1. Getting Your Lab Repo

Both you and your partner should clone your Lab repo into your cs31/labs subdirectory:

get your Lab ssh-URL from the CS31 git org. The repository to clone is named Lab9-userID1-userID2, where user1 and user2 are the user names of you and your lab partner.
cd into your cs31/labs subdirectory:
```
$ cd ~/cs31/labs
$ pwd
```

clone your repo

$ git clone [your Lab9-userID1-userID2 url]
$ cd Lab9-userID1-userID2
$ ls

4.2. Starting Point Files

The files included in your repo are:

Makefile: builds gol executable file
colors.h: contains an array of some different color3 values for use with ParaVis (you can use this or not in your solution)
*.txt: one or more example input files. You can copy more test files from your Lab 6 repo into this one, and you should create new input files too (particularly big ones to test the performance of your parallel solution).
design_worksheet: start your lab assignment by filling in this worksheet with the design of your parallelization of gol. You should bring this filled out worksheet with you to the ninja session, and to office hours and other times you are seeking help on your lab assignment. To print a copy of this worksheet to a CS lab printer, use the lpr command: lpr design_worksheet. See the "Getting Started" tips in Section 8.2 for information about how and when to use this worksheet.

One thing to note, is that there is not a gol.c file in the starting point code. You will add this in from your or your partner’s Lab 6 solution, that you will use as a starting point for this lab.

4.3. Add gol.c to your repo and modify it

The starting point code does not include a gol.c file. You and your partner should choose one of your Lab 6 solutions as a starting point for this lab, and copy over that gol.c file into your Lab9-userID1-userID2:

$ cd ~/cs31/labs/Lab9-userID1-userID2
$ cp ~you_or_partner/cs31/labs/Lab6-user1-user2/gol.c .

4.3.1. modifications to sequential gol.c before parallelizing

The first thing to do is to make sure that your gol.c file includes these files (add colors.h if you want to use it):

#include <pthreadGridVisi.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/time.h>
#include <time.h>
#include <string.h>
#include <pthread.h>
#include "colors.h"

Next, you will need to change your play_gol function prototype to match that of the thread function argument passed to pthread_create:

void *play_gol(void *args)

This means you will have to modify it to return a value (returning NULL is fine), and you need to explicitly re-cast the void * parameter to a struct gol_data * inside this function in order to access field values of the passed argument. I recommend doing this by changing your function parameter name to a name not used inside your function (args for example) and then adding a local variable with your old parameter name (data for example) that you initialize to the re-cast void * parameter. This way your function code that refers to data will work un-changed. For example:

// change sequential play_gol function prototype from this:
void play_gol(struct gol_data *data);
// to this:
void *play_gol(void *args);  // be sure to change the return type too!


// change name of parameter in function definition and local variable
// of old paramter name and types that is assigned re-cast void* param
void *play_gol(void *args) {
  struct gol_data *data;  // add local variable of the type we know is passed in
    ...
  data = (struct gol_data *)args;  // initialize it to the re-cast args value
  // function code can use data to reference fields in passed struct
    ...
  return NULL; // this function now needs to return some value!!
}

For the OUTPUT_VISI mode in the sequential version, we hid a call to pthread_create that was needed by the ParaVis library. This code was hidden in a wrapper function named connect_animation that was called from main. Now that your code is going to create its own threads, you will not be able to use connect_animation in this lab.

As part of completing the lab, you will call pthread_create on your own in a loop, like we did in class and the in-lab example code. Those calls to pthread_create will do the same job as connect_animation, so you no longer need that function in your program.

In other words:
```
// this line that currently exists main
connect_animation(play_gol, &data);
// will eventually be replaced by something like this, once you get
// around to creating threads yourself:
for (...) {
   pthread_create(..., NULL, play_gol, ...); // some details for you to figure out
}
```
At this point in the lab, we’re not ready to make the changes we need to make the VISI code work seemlessly, so for now, just comment out the calls to connect_animation and run_animation. This means your VISI mode is broken, but we will talk about how restore that functionality when we get to Section 5.7.

After making these changes, compile and run your program, and test that this change to your sequential version of gol still works as before (with the exception of VISI mode being broken).

5. Lab Details

5.1. Command line arguments

Your program will support the same command line arguments as the Lab 6 version, and will add three new command line arguments:

An integer to specify the number of threads (the degree of parallelism).
A 0/1 flag to specify how to parallelize the GOL program (0: row-wise grid cell allocation, 1: column-wise grid cell allocation). More details below.
A 0/1 flag to specify if the per-thread board allocation should be printed out or not at start up (0:don’t print configuration information, 1: print allocation information).

$ ./gol
usage: ./gol infile.txt output_mode[0,1,2] num_threads partition[0,1] print_partition[0,1]

Here are some example command lines:

# run with config values read from file1.txt, run in run mode OUTPUT_NONE (0)
# create 8 threads, use row-wise partitioning, print per-thread partitioning details:
$ ./gol file1.txt  0  8  0  1

# run with config file file2.txt, run in run mode OUTPUT_ASCII (1),
# create 15 threads, use column-wise partitioning, no partitioning details printed:
$ ./gol file2.txt  1  15  1  0

The print per-thread board allocation command line option will help you debug the grid cell allocation scheme to ensure that you are correctly allocating rows or columns across the worker threads. We suggest you try running with print partitioning when using run mode 0 (OUTPUT_NONE), as OUTPUT_ASCII will clear the terminal too quickly to view the partitioning information. See more details below about partitioning and the print partitioning option.

Your program should handle badly formed command lines and check for bad values entered (e.g., a negative number of threads). It should print out an error message and exit for badly formed command lines and handle most bad input values similarly.

5.2. Structure of parallel game of life

The main thread should begin by initializing the game board by reading the input file in the same way as the sequential version. The input files format has not changed. Your timing code should not include this step but should include all other parts of the execution.
After initializing the game board, the main thread spawns off worker threads that will play multiple rounds of the game, with each thread computing just its portion of cells of the new board each round (based on the partitioning information passed as a command line option). Grid cells are allocated to threads using either a row-wise or column-wise partitioning specified by a command line argument. If a command line is entered with more threads then the number of rows/columns to partition by, your program can print out an error message and exit. In other words, you don’t need to handle a case in which there are threads that have no rows/columns assigned to them. The total number of live cells each round should be calculated in parallel with each thread contributing to its portion of live cells to the total value. Note: If the print_partitioning option is specified, threads should print out their partition information before running rounds of gol.
If running in OUTPUT_ASCII mode, you should designate one thread to be the printing thread, and only that thread should call print_board to print the entire board after each round. Having threads print out just their part of the board will result in crazy interleaved output: a race condition! Synchronizing parallel board printing so that the board is printed in order is beyond what you need to support, and it would be less efficient than if just a single thread prints the entire board.
If running in OUTPUT_VISI mode, each thread should update its part of the color3 image buffer in parallel and then make a call to draw_ready when it is done updating its part.
At the end of the simulation’s execution, the main thread should print out the elapsed time (if run in mode OUTPUT_NONE and OUTPUT_ASCII). You will not be graded on absolute performance, but your code should run faster when run in parallel. As a sanity check, try running a large board instance (e.g., 1500x1500) for many rounds (e.g., 100+). You should see a substantial improvement between running with one thread and a handful of threads (2-8). It’s that sort of relative improvement that we’ll be looking for. Note: for comparison timing, run in mode OUTPUT_NONE, and make sure it does not include any extra synchronization or calls to usleep that may be part of the other run modes.

5.3. Synchronization

You will need to think about where and how to synchronize thread actions to remove race conditions. For example, you need to ensure that no thread starts the next round of play until all threads are done with the current round, otherwise threads will not read consistent values for the given round. If printing is enabled, you need to ensure that the printing thread prints a consistent version of the game board. If run in OUTPUT_ASCII mode, you need to ensure that the printing thread prints a consistent version of the game board. You will also need to ensure that you don’t have a race condition when updating the global live cell counter. Any synchronization steps that are only necessary for a specific command line option should not be included in runs that do not specify that option.

5.4. Partitioning

You will implement two different ways of parallelizing the computation: one partitions the board’s rows across threads, the other partitions the board’s columns across threads. As an example, suppose that the board is 8x8, and that there are 4 threads. Here is how the threads (with logical tids 0-3) would be assigned to the grid using row and column partitioning, respectively:

row partitioning (option 0)             column partitioning (option 1)
----------------                        ----------------
0 0 0 0 0 0 0 0                         0 0 1 1 2 2 3 3
0 0 0 0 0 0 0 0                         0 0 1 1 2 2 3 3
1 1 1 1 1 1 1 1                         0 0 1 1 2 2 3 3
1 1 1 1 1 1 1 1                         0 0 1 1 2 2 3 3
2 2 2 2 2 2 2 2                         0 0 1 1 2 2 3 3
2 2 2 2 2 2 2 2                         0 0 1 1 2 2 3 3
3 3 3 3 3 3 3 3                         0 0 1 1 2 2 3 3
3 3 3 3 3 3 3 3                         0 0 1 1 2 2 3 3

When the number of threads does not evenly divide the dimension by which you are partitioning, divide the rows (or columns) up so that there is at most a difference of 1 assigned row (or column) between any two threads. For example, for an 8x7 board and 3 threads, you would partition rows and columns like this among the 3 threads (with tids 0-2):

row partitioning (option 0)             column partitioning (option 1)
----------------                        ----------------
0 0 0 0 0 0 0                           0 0 0 1 1 2 2
0 0 0 0 0 0 0                           0 0 0 1 1 2 2
0 0 0 0 0 0 0                           0 0 0 1 1 2 2
1 1 1 1 1 1 1                           0 0 0 1 1 2 2
1 1 1 1 1 1 1                           0 0 0 1 1 2 2
1 1 1 1 1 1 1                           0 0 0 1 1 2 2
2 2 2 2 2 2 2                           0 0 0 1 1 2 2
2 2 2 2 2 2 2                           0 0 0 1 1 2 2

In this partitioning scheme, a single row (or column) is assigned to exactly one thread; a single row (or column) is never split between two or more threads. Thus, in the above example threads 0 and 1 have one more row each of work to do than thread 2 in the row-wise partitioning, and thread 0 has one more column of work to do than threads 1 and 2 in the column-wise partitioning. Note how the imbalance of rows/columns is assigned to threads in this example. Your solution should assign them in the same way: every thread should either be assigned the same number of rows or columns or differ by at most one.

5.5. Printing Partitioning Information

When the 5th command line option is non-zero, your program should print thread partitioning information. Each thread should print:

Its thread id (the logical id number that you assign, not the pthread_self value).
Its start and end row index values.
The total number of rows it is allocated.
Its start and end column index values.
The total number of columns it is allocated.

The threads will run in different orders, so you may see each thread’s output in any order, which is fine. If you’d like, you can try to avoid some interleaving of individual threads output by calling fflush after printf:

printf("tid %d my values ...\n", mytid, ...);
fflush(stdout);   // force the printf output to be printed to the terminal

5.5.1. Example Partitioning Output

Here are some example runs with different numbers of threads partitioning a 100x100 grid (I’m using a config file I created named b100.txt that defines a 100x100 grid). The total number of rows and columns per thread are shown in parentheses. Your output does not need to be identical to this, but it must include all the required parts.

# 9 threads, column-wise partitioning
$ ./gol b100.txt 0 9 1 1
tid  0: rows:  0:99 (100) cols:  0:11 (12)
tid  1: rows:  0:99 (100) cols: 12:22 (11)
tid  2: rows:  0:99 (100) cols: 23:33 (11)
tid  4: rows:  0:99 (100) cols: 45:55 (11)
tid  3: rows:  0:99 (100) cols: 34:44 (11)
tid  5: rows:  0:99 (100) cols: 56:66 (11)
tid  6: rows:  0:99 (100) cols: 67:77 (11)
tid  7: rows:  0:99 (100) cols: 78:88 (11)
tid  8: rows:  0:99 (100) cols: 89:99 (11)

# 6 threads, row-wise partitioning
$ ./gol b100.txt 0 6 0 1
tid  0: rows:  0:16 (17) cols:  0:99 (100)
tid  1: rows: 17:33 (17) cols:  0:99 (100)
tid  3: rows: 51:67 (17) cols:  0:99 (100)
tid  2: rows: 34:50 (17) cols:  0:99 (100)
tid  4: rows: 68:83 (16) cols:  0:99 (100)
tid  5: rows: 84:99 (16) cols:  0:99 (100)

# 17 threads, row-wise partitioning
$ ./gol b100.txt 0 17 0 1
tid  0: rows:  0: 5 (6) cols:  0:99 (100)
tid  1: rows:  6:11 (6) cols:  0:99 (100)
tid  3: rows: 18:23 (6) cols:  0:99 (100)
tid  2: rows: 12:17 (6) cols:  0:99 (100)
tid  4: rows: 24:29 (6) cols:  0:99 (100)
tid  5: rows: 30:35 (6) cols:  0:99 (100)
tid  6: rows: 36:41 (6) cols:  0:99 (100)
tid  7: rows: 42:47 (6) cols:  0:99 (100)
tid 10: rows: 60:65 (6) cols:  0:99 (100)
tid 11: rows: 66:71 (6) cols:  0:99 (100)
tid 12: rows: 72:77 (6) cols:  0:99 (100)
tid 13: rows: 78:83 (6) cols:  0:99 (100)
tid  8: rows: 48:53 (6) cols:  0:99 (100)
tid 15: rows: 90:94 (5) cols:  0:99 (100)
tid 14: rows: 84:89 (6) cols:  0:99 (100)
tid  9: rows: 54:59 (6) cols:  0:99 (100)
tid 16: rows: 95:99 (5) cols:  0:99 (100)

Note that for the run with 17 threads, there are enough threads to see "out of order" execution due to the exact scheduling of threads on the CPUs. This is fine and expected behavior (the exact ordering of thread printing depends on when threads are scheduled to run on cores by the OS CPU scheduler).

5.6. Correctness

In addition to printing out per-thread board partitioning information to verify correct portioning, you should also ensure that your parallel version solves GOL correctly. To do this run both your parallel version and your sequential version on some of the same input files in OUTPUT_ASCII run mode (run mode 1). The results should be identical.

If you see incorrect and/or odd output, either your partitioning of the GOL game board is incorrect, or you are missing synchronization, or both. You can remove the call to system("clear") to see the history of each round to help you debug incorrect GOL computation (of course run for a small number of iterations).

5.7. Changes to ParaVis animation for pthreads

The ParaVis library supports animating pthread programs. You only need to make a few small changes to get the OUTPUT_VISI mode to work for the pthreads of gol. The first few changes are all in main:

The init_pthread_animation function takes the number of threads as the first parameter:
```
visi_handle init_pthread_animation(int num_tids, int rows, int cols, char *name);
```
This number is the number of threads that will be spawned to simulate the parallel game of life and does not include the main thread. Only the main thread should call this function. The sequential version of gol.c passed 1 to init_pthread_animation since only 1 pthread was created to run play_gol (which we hid in the connect_animation function). You need to update this parameter to be the number of threads your program is running:
```
// Previously you had 1 thread hardcoded
data.handle = init_pthread_animation(1, data.rows, data.cols, visi_name);
// Now you should update it to be the number of threads you have in
// your program. (The variable name used below might not match yours.)
data.handle = init_pthread_animation(num_threads, data.rows, data.cols, visi_name);
```
You will leave the code that gets the animation buffer as is. No changes are necessary. However, just like with the init_pthread_animation function above, only the main thread should call this function.
As discussed in Section 4.3.1, you will replace the call to connect_animation with the loop that creates your threads.
You should leave the line to run_animation unchanged, but that function call must be called before join your threads.
In addition to the changes above that you made in main, your sequential code had the following comment in play_gol:
```
  //   if ParaVis animation:
  //     (a) call your function to update the color3 buffer
  //     (b) call draw_ready(data->handle)
  //     (c) call usleep(SLEEP_USECS) to slow down the animation
```
You need to modify the function that updates the color3 buffer so that it is multi-threaded. As it is currently written, this function colors every pixel of the image buffer. Since every thread calls this function, we don’t want every thread coloring every pixel. Instead, we only want each thread to color its subset of the pixels. Update this function so that it uses the partitioning information to only color the pixels that the thread is responsible for.
In play_gol, be sure that every thread updates the color3 buffer as described above and every thread calls draw_ready when it is done updating the color3 buffer.

Further documentation about using the ParaVis interface is available as a comment at the top of the pthreadvisi_example.c program from Week 12. Also, more detailed documentation is available in the pthreadGridVisi.h header file that you can open in an editor:

$ vim /usr/local/include/qtvis/pthreadGridVisi.h

6. Sample Output

Here is output from a run of my gol program showing:

Example column-wise and row-wise partitioning output for different numbers of threads is shown in Section 5.5.1. Please follow our output format for your printing partitioning functionality.

Other program ouput should be identical to that from Lab 6.

Example runs showing the time improvement from the parallelization is shown below. This output was generated on a lab machine with 12 cores, using a configuration file (big.txt) that specified a 4000x4000 board and 500 iterations. The program was run for different numbers of threads (1, 4, 8, 12, and 20). Notice the time improvement with increased parallelization up to the number of cores (from 298 seconds for 1 thread to 50 seconds for 12 threads). Beyond the total number of cores (12, in the case), adding more threads does not help to improve total runtime.

$ ./gol big.txt 0 1 0 0
Total time: 297.943 seconds
After 500 rounds on 4000x4000, number of live cells is 17

$ ./gol big.txt 0 4 0 0
Total time: 78.617 seconds
After 500 rounds on 4000x4000, number of live cells is 17

$ ./gol big.txt 0 8 0 0
Total time: 71.223 seconds
After 500 rounds on 4000x4000, number of live cells is 17

$ ./gol big.txt 0 12 0 0
Total time: 50.128 seconds
After 500 rounds on 4000x4000, number of live cells is 17

$ ./gol big.txt 0 20 0 0
Total time: 53.965 seconds
After 500 rounds on 4000x4000, number of live cells is 17

These are runs with just the normal Makefile gcc command (there is no optimiztion compiler flag used to build this code). With compiler optimization each run would be much faster, but the trend of improvement from the multithreaded would remain.

7. Lab Requirements

Your program must satisfy all the requirements of the sequential gol lab, except that if you did not get the torus to work correctly, you do not need to fix that part (i.e. if your edge cells are not quite in your sequential gol, that is okay for your parallel solution).
Your program should handle the new command line arguments in addition to the previous ones.
The total_live global variable that should be updated each round to the number of live cells. Its value should be computed in parallel each round with each thread updating the portion of total_live value for a round from its portion of the grid.
The only other global variables you may add are any needed for pthread synchronization. Any values that threads need should be passed to them via the (void *arg) parameter. Look at the example pthread programs from Week 12, from in class, and in the textbook for some examples of how to do this. You will likely need to add fields to your struct gol_data type that are needed for threads to perform gol on their part of the grid (e.g. which rows and columns this thread is operating on).
The control flow in your sequential main function doesn’t need much change. You need to parse the additional command line arguments and make calls to pthread_create. If you find you are adding a lot of code to main, stop and create a function to do this work and call that function from main: main should not get a lot longer, and its main control flow really should not change.
For full credit, your solution should be correct, robust, and free of valgrind errors (when run in modes OUTPUT_ASCII or OUTPUT_NONE). "Still reachable" errors are okay.
You are NOT responsible for running valgrind in the OUTPUT_VISI mode (the ParaVis visualization mode). "Still reachable" errors are okay.
All TODO comments in the starting point code should be removed in your final submission. These are notes to you for what you need to add into the starting point code and where you need to add it.
Use good modular design, and comment your code well.
Last lab! Good luck!

8. Getting Started

8.1. Get sequential gol.c ready for parallelizing

If you are working with a different partner than you had for Lab 6, start by deciding whose gol.c solution you will use for a starting point for this lab.
After copying your Lab 6 gol.c into your repo, make the modifications described in Section 4.3.1. Compile and run to make sure it works.
Then fix any any parts of your Lab 6 gol.c that will make parallelizing it difficult, and that are memory inefficient. As necessary, we may have included comments as part of your gradesheet that are in your Lab 6 repo. We can also look at your solution in lab on Wednesday and help you figure out changes that you may need to make.

8.2. Get Started on Parallelization

Start by reviewing Section 14.2 and 14.3 of the textbook and the in-lab examples.
Next, print out the design_worksheet from your repo and follow the directions for outlining the design of your solution (see Section 4.2 for printing and directions for how to this).
After filling out the worksheet with your design. I recommend implementing and testing functionality in this order:
1. Without spawning any threads, write the partitioning function, i.e. figure out which section of the board would be assigned to a thread. Remember each thread will call this function to compute its partitioning info. You can, however, test the function’s implementation first without spawning threads by adding some calls in main to test different sizes and partitioning schemes (and remove these calls before moving on to the next step).
2. Implement the main thread create-join control flow, and test each thread’s row/column partitioning. Spawn the number of threads specified at the command line and have each thread just print its partitioning information (including its tid value) in play_gol and just return before actually running the remainder of the code in play_gol. (Be sure to remove this return after you’ve tested this and before moving on to the next step).
3. Add fields to the struct gol_data that you will need for parallelization and add any code to initialize these fields. Some of these fields may be initialized by the main thread before spawning and other fields may be initialized by the threads themselves before they start playing rounds of gol). You can add some debug output to test this. As with the previous step, you should return from the play_gol function before executing the rest of the code on play_gol to test this step. When you’re confident it’s working, you should remove the return before moving on to the next step.
4. Add code so that each thread prints out its own partitioning information only when the final command line argument is non-zero.
5. Write and test the code to compute one round of the GOL simulation, with the work divided among the threads. This will include support for all run modes. Start with testing OUTPUT_ASCII on a small file with just one round. Once that is working, test OUTPUT_VISI to be sure things are working correctly.
6. Compute all rounds of the gol simulation, avoiding race conditions.
7. Count the number of live cells in a multi-threaded way, avoiding race conditions.
8. Check timing for multi-threaded in OUTPUT_NONE mode. You should see performance improvements with multi-threaded vs. sequential for large boards.
9. Make sure all other requirements have been met.

9. Tips

Look at the Lab 6 page for lots of tips related to sequential that may be useful here too
Implement and test incrementally!
Most of your changes will be to the play_gol function. You should also add helper functions to implement any new functionality. main should not get much longer, so if you need to add alot of code to main, or you are adding the same code to multiple parts, functionize it and call it from main. Plus you can independently test this code by adding debugging calls to this function from main!
You will need to add fields to your struct gol_data type that are needed by threads to perform gol on their part of the grid.
Use gdb as you incrementally implement and test to find and fix bugs as you go. Use valgrind as you go to catch memory access errors as you make them.
As you test and debug, you may find it useful to use config files with small numbers of iterations and to comment out the call to system("clear") so you can examine the results of every intermediate iteration.
Running in OUTPUT_ASCII is helpful for finding and debugging partitioning errors and other correctness errors.
Running in OUTPUT_VISI and OUTPUT_ASCII are helpful for finding and debugging partitioning errors and other correctness errors.
ParaVis with colors.h. If you’d like to use some of the colors I have defined in colors.h you can set an RGB value in the color3 image array to a value in colors like this:
```
   // assume buff is name of the color3 array:
   buff[buff_i] = colors[3];  // set this pixel to the 3rd color (Yellow)
```
If you want to use the colors to help with debugging row and column partitioning you can create the rainbow pattern of my examples by having each thread be assigned a color based on its thread id, and have it print its color for 0 (or 1) cell values in its partition. For example, if data→tid stores the value of a thread’s logical id, you could do something like this:
```
   if(live) {
     buff[buff_i] = c3_black;  // set live cells to black
   } else {
     // assumes your thread id is stored in your struct as tid;
     // modify the line below if you called it something different.
     buff[buff_i] = colors[((data->tid)%8)]; // dead to my tid color
   }
```
Another thing to remember about the ParaVis display is that it displays coordinate (0,0) at the bottom left of the visualization, and coordinate (rows-1, 0) at the top left. This means you will need to perform some calculation on the row index into the color3 array to "flip" the visualization horizontally to match your view of it (so the ascii and the ParaVis display the world in the same way for the same input file).
The man pages for the various pthread functions are very helpful.
You can use the perror() function to print out error messages from failed system and pthread library calls. perror prints out detailed, human-readable information about the error that occurred. You call it passing in a string with program-specific error message. For example:
```
big_buff = malloc(...);
if (big_buff == NULL) {
  perror("malloc of big buffer failed");
  exit(1);
}
```
See the man page for perror for more information and for the include files you need to add to your program to use it.
You will need to use some synchronization in your solution. You may use any of the pthread synchronization primitives, including mutexes and barriers. Remember that before using any of these, you need to initialize them. You also need to make sure that all threads have access to the shared synchronization variables (either pass a reference to them at thread creation or declare them as static global variables). Here are some code snippets for using the barrier synchronization primitive:
```
// declare a global pthread_barrier_t
static pthread_barrier_t barrier;

// initialize it in a function (e.g., main) before using it:
if(pthread_barrier_init(&barrier, NULL, num_threads)){
   perror("pthread_barrier_init");
   exit(1);
}

// threads than can call pthread_barrier_wait to synchronize:
ret = pthread_barrier_wait(&barrier);
if(ret != 0 && ret != PTHREAD_BARRIER_SERIAL_THREAD) {
  perror("pthread_barrier_wait");
  exit(1);
}
```

10. Submitting

Here are the commands to submit your solution (from one of you or your partner’s cs31/labs/Lab9-userID1-userID2 directory:

$ git add gol.c *.txt
$ git commit -m "Lab 9 completed"
$ git push

And of course, if you want to submit any nifty starting point world config files you come up with, please do:

$ git add coolworld.txt
$ git commit -m "cool world"
$ git push

If you have difficulty pushing your changes, see the "Troubleshooting" section and "can’t push" sections at the end of the Using Git for CS31 Labs page. And for more information and help with using git, see the git help page.

11. Handy Resources

General Lab Resources

Class Ed page for questions and answers about lab assignment
CS31 github org
Git help and Git for CS31 labs
Some Useful Unix Commands
vi (and vim) quick reference
man and Manual pages documentation for libraries and commands

C programming Resource

Week 8 in-lab examples: ParaVis, and other sequential GOL related examples.

CS31 Lab 9: pthreads Game of Life