Lab 2: Pthreads and Scalability Analysis

Code Due: Thursday Feb 2nd before 1 am (late Wednesday night)
Written Report Due: a hardcopy is due before noon on Friday Feb 3.

Lab 2 Partners
  Sam White and Chloe Stevens   Niels Verosky and Jordan Singleton
  Phil Koonce and Luis Ramirez   Steven Hwang and Ames Bielenberg
  Nick Felt and Kyle Erf   Katherine Bertaut and Elliot Weiser
See the git howto for information about how you can set up a git repository for your joint lab 2 project.

Contents:

Problem Introduction
Implementation Details and Requirements
Experiments and Report
Useful C, Unix, and Pthreads Utilities
What to Hand in

Introduction

This lab is designed to give you some practice writing and debugging multithreaded C (or C++) Pthreads programs, using synchronization primatives, and experience designing and running scalability experiements.

For this assignment you are going to implement parallel matrix multiply using Pthreads and evaluate the scalability of your implementation as you increase the problem size and the number of threads. Matrix multiply is an example of a parallel kernel--something that is not typically a complete stand alone application, but is a common computation pattern that occurs in many numeric applications. Efficient parallel matrix multiply can be used to greatly improve the performance a large set of real-world problems.

Your implementation will take command line arguments for the N and M dimensions of the first matrix, the number of threads, and the number of iterations of matrix multiply (how many times you will do AxB=C...and yes, it re-does the exact same computation each iteration). For example:

./matrixmult -n 1024 -m 512 -t 4 -i 10
Will Multiply matrix A of size 1024x512 times matrix B of size 512x1024 to obtain the resulting C of size 1024x1024, and will repeat this multiplication 10 times. The reason for multiple iterations is to get longer runs so that you can obtain more meaningful timing results, and also to give you some practice using Pthread synchronization primatives: no thread should be start the next round of matrix multiply until all threads have finished with the current round.

The sequential matrix-multiply algorithm is:

// given matrices A, B, and C of the following dimensions:
A[N][M] // N rows and M columns
B[M][N] // M rows and N columns
C[N][N] // N rows and N columns

// compute C = AxB
for i from 0 to N // for each row in C 
  for j from 0 to N // for each column in C 
    // compute the value of C[i][j] 
    val = 0.0
    for k from 0 to M  // num elms in row i of A and col j of B
       val += A[i][k]*B[k][j]
    C[i][j] = val
To parallelize matrix multiply, assign each thread some portion of the computation. For example, each thread could assigned a subset of C's rows to calculate, or a subset of C's columns, or blocks of C, or ...

Implementation Details and Requirements

Start by setting up a git repository for your and your lab2 partner.

You can also copy over my pthreads simple example into your lab2 repository to use as starting point for your lab2 solution if you'd like:

cp ~newhall/public/cs87/pthreads_example/* .

Command line options

Your solution should support multiple command line arguments for specifying the size of the arrays, the number of threads, and the number of iterations, and optinally a column partitioning scheme:
 usage:
    ./matrixmult -n n_dim  -m m_dim -t num_tids -i iters [-c]
       -n  n_dim       number of rows in A
       -m  m_dim       number of cols in A
       -t  num_tidst   number of threads
       -i  iters       number of iterations
       -c              optional arg to parallelize by columns of C (default is by rows) 
       -h:             print out this help message

Requirements

Scalability Experiments and Report

Once you have a working solution, you are going to evaluate the scalability of your implementation. You will design and run experiments that will answer questions about the scalability of your solution. Run experiments of your two solutions for partitioning the result matrix over the set of threads varying:
  1. The matrix sizes. Try different powers of two (64, 128, 256, 512, 1024, 2048, ...) You do not have to try every power of two between your min and max sizes, but run a several intermediate sizes between your smallest-sized and largest-sized matrices.
  2. The number of threads. Again, increase by powers of two: (1, 2, 4, 8, 16, 32, ...).
  3. The number of iterations (try a couple different values that show differences across runs). You should run each experiment for at least 2 iterations so that there is some synchronization in your implementation, but try for more iterations as well to see if, and how, added synchronization steps affect scalability.
When running scalability studies you need to make sure that you have problem sizes that are large enough to result in fairly long run times for at least some numbers of threads. For example, if the single threaded run takes 1.3 seconds and the 16 threaded run takes 1.2, it is pretty difficult to draw any conclusions about scalability by comparing these two small runtimes. Instead, you want some runs that take many seconds to many minutes.

In addition to comparing the scalability of your solution, test whether or not the row-wise or column-wise assignment has any effect on performance.

As you run experiments, make sure you are doing so in a way that doesn't interfere with others (see the Useful Unix Utilities section below for some tips about how to ensure this). Also, remember to remove all output statements from the code you time.

You should run multiple runs of each experiment (don't just do a single times run of 16 threads for 512x512 and 10 iterations, for example). The purpose of multiple runs is to determine how consistent your runs are (look at the standard deviation across runs), and to also see if you have some odd outliers (which may indicate that something else running on the computer interfered with this run, and that you should discard this result).

You should be careful to control the experimental runs as much as possible. If other people using the machine you are running experiments on, their programs can interfere with your results. You can see what is running on a machine:

Also, make sure that your matrix sizes are not too large to fit in RAM. I don't think this really will be a problem, but double check your a run with your largest sizes before running experiments. To see if the system is swapping, run:

watch -n 1 cat /proc/swaps
Filename                                Type            Size    Used    Priority
/dev/sda5                               partition       2072344 0       -1
If you notice the Used value going above 0, your matrix sizes are too big to fit into RAM (or there are too many people running on this machine, and you need to find an idle machine).

Machine Info

The
Lab Machine specs page contains information about most of the lab machines, including the number of CPUs (click on the Processors link). We have machines with 4, 8, and 16 cpus. I suggest picking one with 16 cores (x16) for your final experiments, but try out others during development.

As much as possible, it is good to run all your experiments on the same machine, or at least identical machines.

Written Report

You should write a short report (no more than 3 pages) that describes the results of your experimentation. It should include the following sections:
  1. A brief description of how you implemented the row-wise and column-wise thread assignment.
  2. A description of the experiments you ran. What did you vary, what machine(s) did you run on, how many runs of each experiment. Also, briefly describe what you thought the expected outcome would be and why? It is fine if your expected outcome was different than what your experimental results show.
  3. Experimental results: present your results AND describe what they show. You can use tables or graphs to present the data. Choose quality over quantity in the data you present. A couple tables with data that show scalability results in terms of number of threads and problem size is fine. It is also okay to present and discuss negative results..."we thought the X experiment would be better because...,but as shown in table 2, the Y experiment performed better. This is (or we think that this is) because ...". There is, however, a difference between negative results (well designed experiments that produce unexpected results) and bad results (results from poorly designed experiments).
  4. Conclusions: what did you learn from your experiments? what do they say about the scalability of your solution? did they match your expectations? if not, do you have an idea of why not? did the row-wise and column-wise versions perform differently? explain why do you think they did or did not?

Useful C, Unix and Pthreads Utilities

What to Hand in

Submit a single tar file with the following contents using
cs87handin (see Unix Tools for more information on script, dos2unix, make, and tar):

  1. A README file with:
    1. Your name and your partner's name
    2. If you have not fully implemented some functionality, then list the parts that work (and how to test them if it is not obvious) so that you can be sure to receive credit for the parts you do have working.

  2. All the source files needed to compile, run and test your code (Makefile, .c files, .h files, and optional test scripts). Please do not submit object (.o) or executable files (I can build these using your Makefile).

Turn in a hard copy of your written report at the beginning of class on Thursday.