CS87: Mini Lab 3

You will work on this mini lab with your Lab2 partner.

A Mini Lab is one that I anticipate that you can complete in a couple hours; finish or be close to finishing by the end of a Thursday lab session. The purpose of Mini labs are to introduce you to a parallel or distributed programming language/utility without having you solve a larger problem using the language/utility.

I give you only about 24 hours to complete a mini-lab because I want you to stop working on it and get back to focusing your effort on the regular lab assignment.

If you don't get a mini lab fully working, just submit what you tried. If you don't submit a solution, it is not a big deal. Mini Labs do not count very much towards your final grade, and not nearly as much as regular Labs--they are mini. If you submit a solution to Lab3, make sure to add the submission string in the README.md file and push it with our solution (see Submit instructions).

The starting point code contains a complete implementation of sequential Matrix Multiply. The code runs some iterations of multiplying to matrices together. This is an example of a kernel benchmark program: it is likely not so useful as a stand-alone program, but instead implements a common sub-operation that might be part of larger parallel programs.

Your job in this lab is to use OpenMP to parallelize the code.

OpenMP

You do not need to learn an enormous amount of OpenMP to solve this problem. You will need to use the #pragma omp parallel to fork a set of threads to do something in parallel, and you will want to add a parallel for loop and maybe some synchronization.

You should be careful to stick with the fork-join, fork-join, fork-join model of OpenMP; don't do things in the parallel parts that are really not parallel or you will get some weird/unexpected behavior. Do not try to "optimize" your code by reducing fork-join blocks. You should, however, think about minimizing other parallel overheads as you design a solution; your goal is a solution designed such that there is a performance improvement from parallelization. If your 1 thread execution wins out over the multi-thread ones, think about how you can remove some parallel overhead (think of space/time trade-offs, think about synchronization costs, ...). Make sure you are comparing runs for large enough problem sizes (N and M) with enough iterations.

I encourage you to try different partitioning of all or some of the matrices and see if you get different timed results. For example, see if you can partition one or more matrices by rows or by columns across threads:

row                                  column
---                                  ------
1 1 1 1 1 1 1 1                      1 1 2 2 3 3 4 4 
1 1 1 1 1 1 1 1                      1 1 2 2 3 3 4 4 
2 2 2 2 2 2 2 2                      1 1 2 2 3 3 4 4 
2 2 2 2 2 2 2 2                      1 1 2 2 3 3 4 4 
3 3 3 3 3 3 3 3                      1 1 2 2 3 3 4 4 
3 3 3 3 3 3 3 3                      1 1 2 2 3 3 4 4 
4 4 4 4 4 4 4 4                      1 1 2 2 3 3 4 4 
4 4 4 4 4 4 4 4                      1 1 2 2 3 3 4 4

Starting Point Code and Tips for Getting Started

Get your LabO3 ssh-URL from the GitHub server for our class: CS87-s18
On the CS system, cd into your cs87/labs subdirectory
Clone a local copy of your shared repo in your private cs87/labs subdirectory:
```
cd cs87/labs
git clone [your_Lab03_URL]
```
Then cd into your Lab03-you subdirectory.

If all was successful, you should see the following files when you run ls:

Makefile README.md matrixmult.c

If this didn't work, or for more detailed instructions on git see: the Using git page.

Starting Point files

Makefile: builds both openMP parallel version and sequential executables (mm_par, mm_seq).
README.md: see notes about #defines and sizes
matrixmult.c: full sequential implementation of matrix mult read for you to add openMP directives to parallelize it.

Getting Started

I suggest first trying out my simple openMP examples in my public directory:
```
cp  -r ~newhall/public/cs87/openMP_examples .
```
Then take a look at the matrixmult.c file, then try compiling and running it to understand what it does.
Then try to add in some openMP code to parallelize parts of the matrix multiply program.

With the starting point code, the sizes of N and M are tiny and the DEBUG definition is on. This will print out matrices and debug info as the code runs. Once you have something working, comment out DEBUG and make N and M big and try some timed runs to see if you get performance improvements with your parallel solutions. For example:

time ./mm_par 1000 0
time ./mm_seq 1000 0

Note: these executables take at least two command line options, the first is the number of iterations, the second specifies row-wise or column-wise partioning, and an optional third takes a partitioning block size. The row/column-wise and the block-size options are there if you want to use them, you don't have to; it is to make the starting point code have a few more command line options that you can use if you'd like

Useful Functions and Resources

Try out my simple openMP examples, you can copy them over from here:
```
cp -r ~newhall/public/openMP_examples .
```
OpenMPlinks The tutorial is a good place to start.
top -H get system usage info on a per-thread basis
Lab Machine specs page contains information about most of the lab machines, including number of cores.
My help pages

Submit

Before the Due date:

At the top of the README.md file add the following line if you are submitting a solution to this lab (and cut and paste exactly this line):
```
@@@@@ WE ARE SUBMITTING THIS FOR GRADING:  your names
```

push your changes to github:

git add README.md
git add matrixmult.c
git commit
git push

If you have git problems, take a look at the "Troubleshooting" section of the Using git page.

CS87 Mini Lab3: OpenMP Matrix Multiply

Contents:

Parallel Matrix Multiply

OpenMP

Starting Point files

Getting Started