Lab 2: Pthreads and Scalability Analysis

Code Due: Thursday Feb 2nd before 1 am (late Wednesday night)
Written Report Due: a hardcopy is due before noon on Friday Feb 3.

Lab 2 Partners
Sam White and Chloe Stevens	Niels Verosky and Jordan Singleton
Phil Koonce and Luis Ramirez	Steven Hwang and Ames Bielenberg
Nick Felt and Kyle Erf	Katherine Bertaut and Elliot Weiser
See the git howto for information about how you can set up a git repository for your joint lab 2 project.

Introduction

This lab is designed to give you some practice writing and debugging multithreaded C (or C++) Pthreads programs, using synchronization primatives, and experience designing and running scalability experiements.

For this assignment you are going to implement parallel matrix multiply using Pthreads and evaluate the scalability of your implementation as you increase the problem size and the number of threads. Matrix multiply is an example of a parallel kernel--something that is not typically a complete stand alone application, but is a common computation pattern that occurs in many numeric applications. Efficient parallel matrix multiply can be used to greatly improve the performance a large set of real-world problems.

Your implementation will take command line arguments for the N and M dimensions of the first matrix, the number of threads, and the number of iterations of matrix multiply (how many times you will do AxB=C...and yes, it re-does the exact same computation each iteration). For example:

./matrixmult -n 1024 -m 512 -t 4 -i 10

Will Multiply matrix A of size 1024x512 times matrix B of size 512x1024 to obtain the resulting C of size 1024x1024, and will repeat this multiplication 10 times. The reason for multiple iterations is to get longer runs so that you can obtain more meaningful timing results, and also to give you some practice using Pthread synchronization primatives: no thread should be start the next round of matrix multiply until all threads have finished with the current round.

The sequential matrix-multiply algorithm is:

// given matrices A, B, and C of the following dimensions:
A[N][M] // N rows and M columns
B[M][N] // M rows and N columns
C[N][N] // N rows and N columns

// compute C = AxB
for i from 0 to N // for each row in C 
  for j from 0 to N // for each column in C 
    // compute the value of C[i][j] 
    val = 0.0
    for k from 0 to M  // num elms in row i of A and col j of B
       val += A[i][k]*B[k][j]
    C[i][j] = val

To parallelize matrix multiply, assign each thread some portion of the computation. For example, each thread could assigned a subset of C's rows to calculate, or a subset of C's columns, or blocks of C, or ...

Implementation Details and Requirements

Start by setting up a git repository for your and your lab2 partner.

You can also copy over my pthreads simple example into your lab2 repository to use as starting point for your lab2 solution if you'd like:

cp ~newhall/public/cs87/pthreads_example/* .

Command line options

Your solution should support multiple command line arguments for specifying the size of the arrays, the number of threads, and the number of iterations, and optinally a column partitioning scheme:

 usage:
    ./matrixmult -n n_dim  -m m_dim -t num_tids -i iters [-c]
       -n  n_dim       number of rows in A
       -m  m_dim       number of cols in A
       -t  num_tidst   number of threads
       -i  iters       number of iterations
       -c              optional arg to parallelize by columns of C (default is by rows) 
       -h:             print out this help message

Requirements

You must use a makefile. Your code should be free of valgrind errors, and be well designed and well commented.

You should use mostly C to code up your solution. It is fine to use some features of C++, but use C's dynamic memory allocation for the arrays (malloc).

You must dynamically allocate memory for two source 2D arrays, A and B, and for the result 2D array, C, using malloc. Also, make sure to use the dynamic memory allocation strategy that mallocs up a single chunk of contiguous heap memory for each of array (only 1 malloc each for A, B, and C). See Arrays in C.
You will implement two different ways of parallelizing the computation: one partitions C's rows across threads, the other partitions C's columns across threads (i.e each thread computes the result for some number of C's columns). As an example, suppose that A is 8x5, B 5x8, and C 8x8, and that there are 4 threads. Here is how the threads (with tids 0-3) would be assigned to computing elements of C using row and column partitioning:
```
row partitioning                        column partitioning
----------------                        ----------------
0 0 0 0 0 0 0 0                         0 0 1 1 2 2 3 3  
0 0 0 0 0 0 0 0                         0 0 1 1 2 2 3 3
1 1 1 1 1 1 1 1                         0 0 1 1 2 2 3 3
1 1 1 1 1 1 1 1                         0 0 1 1 2 2 3 3
2 2 2 2 2 2 2 2                         0 0 1 1 2 2 3 3
2 2 2 2 2 2 2 2                         0 0 1 1 2 2 3 3
3 3 3 3 3 3 3 3                         0 0 1 1 2 2 3 3
3 3 3 3 3 3 3 3                         0 0 1 1 2 2 3 3
```
If the optional command line argument -c is given, run with column partitioning, otherwise run with row partitioning.
You are welcome to implement and test other ways of partitioning of the problem, but the row-wise and column-wise partitioning are required.
You should use arrays of double and initialize A and B to contain randomly selected double values. Here is how you can select random doubles between 0 and 1:
```
double val = (double)(rand()/(RAND_MAX*1.0)); 
```
I recommend picking random values between some small range (like (0.0,2.0)). You can test for correctness of your two parallelization schemes by commenting out the call to seed the random number generator so that every run of your program generates the same random sequence, and then print out the resulting C matrix at the end to see if the values are the same. You can just print out the same small chunk of C (like a 10x10 upper corner) and compare to avoid massive amounts of hard to read output.
It is fine to only allow runs of your program for combinations of command line arguments that result in an even distribution of the computation across the threads. For example, given a C matrix with 512 rows where rows are distributed across threads, rows can be evenly distributed across 4 or 8 threads but not across 3 threads. It is fine to not allow runs with 3 threads in this case. If you add this type of restriction, then after parsing arguments your program should check for these types of "invalid" command line arguments and exit with a helpful error message for bad size combinations. You are also welcome to write your program so that it handles any array size for any number of threads too.
You must perform a scalability study of your implementation that varies the size of the problem and the number of threads, and compares the two partitioning schemes for these different sizes. You will write a report describing your experiments, hypotheses, results and an explanation of what the results show and why. Run experiments on the lab machines with 16 processors so that you can test with the highest degree of physical parallelism (you should run some experiments with more threads then there are physical processors too).

Scalability Experiments and Report

Once you have a working solution, you are going to evaluate the scalability of your implementation. You will design and run experiments that will answer questions about the scalability of your solution. Run experiments of your two solutions for partitioning the result matrix over the set of threads varying:

The matrix sizes. Try different powers of two (64, 128, 256, 512, 1024, 2048, ...) You do not have to try every power of two between your min and max sizes, but run a several intermediate sizes between your smallest-sized and largest-sized matrices.
The number of threads. Again, increase by powers of two: (1, 2, 4, 8, 16, 32, ...).
The number of iterations (try a couple different values that show differences across runs). You should run each experiment for at least 2 iterations so that there is some synchronization in your implementation, but try for more iterations as well to see if, and how, added synchronization steps affect scalability.

When running scalability studies you need to make sure that you have problem sizes that are large enough to result in fairly long run times for at least some numbers of threads. For example, if the single threaded run takes 1.3 seconds and the 16 threaded run takes 1.2, it is pretty difficult to draw any conclusions about scalability by comparing these two small runtimes. Instead, you want some runs that take many seconds to many minutes.

In addition to comparing the scalability of your solution, test whether or not the row-wise or column-wise assignment has any effect on performance.

As you run experiments, make sure you are doing so in a way that doesn't interfere with others (see the Useful Unix Utilities section below for some tips about how to ensure this). Also, remember to remove all output statements from the code you time.

You should run multiple runs of each experiment (don't just do a single times run of 16 threads for 512x512 and 10 iterations, for example). The purpose of multiple runs is to determine how consistent your runs are (look at the standard deviation across runs), and to also see if you have some odd outliers (which may indicate that something else running on the computer interfered with this run, and that you should discard this result).

You should be careful to control the experimental runs as much as possible. If other people using the machine you are running experiments on, their programs can interfere with your results. You can see what is running on a machine:

who to see who else is on a machine you are running on, and try another machine if you are not alone.
top to see the system load on the machine.
top -H to see individual threads running on the machine (you can see your 4, 8, 16, etc running).

Also, make sure that your matrix sizes are not too large to fit in RAM. I don't think this really will be a problem, but double check your a run with your largest sizes before running experiments. To see if the system is swapping, run:

watch -n 1 cat /proc/swaps
Filename                                Type            Size    Used    Priority
/dev/sda5                               partition       2072344 0       -1

If you notice the Used value going above 0, your matrix sizes are too big to fit into RAM (or there are too many people running on this machine, and you need to find an idle machine).

Machine Info

The Lab Machine specs page contains information about most of the lab machines, including the number of CPUs (click on the Processors link). We have machines with 4, 8, and 16 cpus. I suggest picking one with 16 cores (x16) for your final experiments, but try out others during development.

As much as possible, it is good to run all your experiments on the same machine, or at least identical machines.

Written Report

You should write a short report (no more than 3 pages) that describes the results of your experimentation. It should include the following sections:

A brief description of how you implemented the row-wise and column-wise thread assignment.
A description of the experiments you ran. What did you vary, what machine(s) did you run on, how many runs of each experiment. Also, briefly describe what you thought the expected outcome would be and why? It is fine if your expected outcome was different than what your experimental results show.
Experimental results: present your results AND describe what they show. You can use tables or graphs to present the data. Choose quality over quantity in the data you present. A couple tables with data that show scalability results in terms of number of threads and problem size is fine. It is also okay to present and discuss negative results..."we thought the X experiment would be better because...,but as shown in table 2, the Y experiment performed better. This is (or we think that this is) because ...". There is, however, a difference between negative results (well designed experiments that produce unexpected results) and bad results (results from poorly designed experiments).
Conclusions: what did you learn from your experiments? what do they say about the scalability of your solution? did they match your expectations? if not, do you have an idea of why not? did the row-wise and column-wise versions perform differently? explain why do you think they did or did not?

Useful C, Unix and Pthreads Utilities

time ./a.out: show the total runtime, and user and system times associated with executing a.out. As long as you remove all output (printf's) statements before experimentation, you can use this method to time your experiments.

Use gettimeofday if you want to time specific sections of your code.

some pthread programming links
I recommend using Pthread's barrier synchronization primative for synchronizing threads, but you are welcome to use other pthread synchronization primatives if you'd like. Here are some code snippets for initializing and using a pthreads barrier syncrhonization:

// declare a global pthread_barrier_t var
static pthread_barrier_t barrier;

// initialize it somewhere before using it:
if(pthread_barrier_init(&barrier, NULL, num_threads)){
   perror("Error: with pthread barrier init\n");
   exit(1); 
}

// threads than can call pthread_barrier_wait to synchronize: 
ret = pthread_barrier_wait(&barrier);
if(ret != 0 && ret != PTHREAD_BARRIER_SERIAL_THREAD) {
  perror("Error: can't wait on pthread barrier\n");
  exit(1);
}

top, who, xload to get system usage info
top -H: a dynamic display of information about system resource usage of the current threads that are the biggest consumers of CPUs. You should see your matrix multiply threads rise to the top. You can also use this to see if someone else is using a machine for experiments (in which case you should pick another one).
/proc/ more useful system info. You can cat out files in here to find system-wide or per-process information: meminfo has information about system memory use, cpuinfo has information about the cpus, including cache line sizes.
CS Project Etiquette Tools and Guidelines for using shared CS resources to run long-running, intensive applications.
Tools for running experiments and collecting performance measurements

script

exit

latex links. latex is Unix's document writing software. You do not have to use latex for writing your report, but it has very good support for some things that are common in scientific writing such as representing mathematical expressions, so you may want to give it a try. I have some example latex documents that you can grab to use as a starting point in ~newhall/public/latex_examples/.
getopt command line parsing.
rand and srand: C library random number generator and seeding function (pass srand time(NULL) to seed with current time).
use perror to print out error messages from failed system calls.
Here is a guide for using gdb to debug pthread programs
You can write and use shell scripts for running a set of experiments. To run a shell script, first make sure that the shell script file has permissions set to executable (e.g. chmod 777 run.sh), and then just run it from the Unix prompt ( ./run.sh). Here is a example script for running a set of row-wise experiments (you may want to change thread and size ranges in your real experiments):
```
#!/bin/bash

for((n=256; n <= 2048; n=n*2))
do
for ((t=1; t <= 32; t=t*2))
  do 
     echo ""
     echo "matrixmult -n $n -m $n -t $t -i 10"
     time matrixmult -n $n -m $n -t $t -i 10
  done
done
```
See bash shell programing links for more information.

What to Hand in

Submit a single tar file with the following contents using cs87handin (see Unix Tools for more information on script, dos2unix, make, and tar):

A README file with:
1. Your name and your partner's name
2. If you have not fully implemented some functionality, then list the parts that work (and how to test them if it is not obvious) so that you can be sure to receive credit for the parts you do have working.
All the source files needed to compile, run and test your code (Makefile, .c files, .h files, and optional test scripts). Please do not submit object (.o) or executable files (I can build these using your Makefile).

Turn in a hard copy of your written report at the beginning of class on Thursday.