CS87 Lab 4B: Large Experiments on Different Systems

1. Due Date

Due Wednesday, March 25 before 11:59pm

2. Overview and Goals

This is a continuation of your Lab 4 assignment that involves running some large runs of your program on 2 or 3 different systems. Its goal is to give you some practice running on different systems, and some practice running some large experiements.

You should spend no more than 5 hours total of your time to complete this part (really, do not spend much more than that on this…just stop it is not worth it). I’m giving you a long time to complete it because it involves running some large experiments on our system over spring break and some runs on XSEDE either over break or right when you get back. The runs may take a long time, but you can start them or submit them and come back later to see the results.

With your Lab 4 partner, you will run some large runs of your MPI odd-even sort on the CS system and on the XSEDE SDSC Comet cluster (use the compute queue for longer runs after testing on the debug queue first), and submit some output results (do not do any runs that have debug printf stmts: your run should have no output other than timing and possibly printing out initial size and number Pis).

You are required to do some (timed) large runs on:

CS lab machines, run these over break when few people are using the machines.
XSEDE SDSC Comet cluster

Additionally, you are encouraged to do some large runs on Swarthmore’s Strelka Cluster. In fact, I encourage you try out some runs on Strelka before XSEDE to get some practice with sbatch and slurm scripts.

Goals:

Learn the vim editor that you can use when ssh’ed into cs or XSEDE systems (plus you should just know vim as a good, always available, text editor).
Practice with ssh and scp
Do some large runs on CS lab machines and XSEDE
Practice with slurm and nd remote cluster systems running slurm

3. "Huge" runs on the CS system

I want you to do this over break when the CS machines are mostly idle.

Run some large runs (lots of hosts, large -np values, large size N) on CS lab machines. In particular, do some runs that distribute processes over a large number of nodes and sort fairly large size N.

First test out some large runs before break, and make sure your program doesn’t have deadlock.
Next, also before break, write a run scripts to run some big experiments.
- And test out your script (you can comment out the actually run calls to just check the script variable values and its output).
- Try your largest sized run (in terms of N and P on a single node) and make sure that its execution does not exceed physical RAM size (htop), and scale it back a good bit if so.
Finally, sometime over break (sometime after Sunday when we reboot all lab machines), start your experiments running (ssh in to start, use at cron to run them), and check back after they should be done to check that they are. This should be a quick 10 minutes of your time (ssh in, check that hosts are up, start your script in a screen or tmux session, then ssh in later to see that it finished). If there are reasons why neither you nor your partner can do this over break.

You can list all the machines in our labs:

cat /usr/swat/db/hosts.bookstore
cat /usr/swat/db/hosts.mainlab
cat /usr/swat/db/hosts.256
cat /usr/swat/db/hosts.overflow

You can use all or any of these machines, however do not use cornstarch and honey

You can automatically generate a hostfile of some number of hosts by running autoMPIgen. If this fails, however, smartherSSH can be used to produce a list of hosts, or just create a hostfile by hand from the lab machines listed in the 4 files shown above.

Use the check_up.sh script to test that all machines are reachable

./check_up.sh hostfilename

Here is more information about hostfiles in openMPI: about hostfiles

Then you can run your experiment script of a large set of runs using this hostfile.

IMPORTANT: before running a huge set of experiments, make sure your largest size run completes (make sure it does not deadlock and never finishes). If your oddeven sort deadlocks, then see the Lab 4 write-up for some hints about this and how to fix it.

See the Experiment Tools page from Lab 1 for a reminder about useful tools for running experiments, writing scripts, some useful commands/utilities: screen, script, tmux, at, cron, …

4. Learn vim

The vi (and vim) editor is available on every Unix system. It is a very efficient, lightweight, text editor that is easy to use after learning a few basic commands, which that you can learn by running though the vimtutor tutorial.

When you log into the Swarthmore cluster and XSEDE resources, atom is not available to you, so you will need to use vi (vim) to edit files.

Vi (vim) also has a lot of advanced features that are very nice, and there are gui versions of vim, like other editors, for when you run on an X window system (like when you are logged into our machines).

ssh into our system and run vimtutor ( more info on remote access, and more info on ssh):

from home$  ssh <yourusername>@lab.cs.swarthmore.edu

cs$  cd ~/cs87
cs$  pwd
cs$  vimtutor           # start the vim tutorial

Go through the sections listed below of vimtutor (the other sections cover more obscure features that are not necessary). It will take about 30 minutes to run through these lessons.
- all of Lesson 1 (moving around, x, i, A, :wq)
- Lesson 2.6 (dd)
- Lesson 2.7 (undo)
- Lesson 3.1 (p) and 3.2 (r)
- Lesson 4.1 (G) and 4.2 (searching)
- Lesson 6.2 (a), 6.3 ( R ), and 6.4 (y and p)

5. Practice using scp, ssh, vim on comet

You will need to use vim to edit files on Comet, and scp to copy between the CS system and Comet. For your XSEDE odd-even sort experiments, you will need to scp your oddeven.c, and oddeven.sb files from your repo on our systemn to Comet. You can scp each file one by one or make a tar file containing all files you want to copy over, and just scp the single tar file. (more info on scp, more info on tar)

First make sure to set up your XSEDE and your Comet account this week (follow all the directions under "XSEDE and Comet Account Set-up"): XSEDE and Comet accounts. Then try out ssh and scp on comet.

on_cs$ pwd       # get path to your Lab 4 repo
on_cs$ ssh you@commet.sdsc.edu

comet$ mkdir cs87
comet$ cd cs87
comet$ mkdir oddeven
comet$ cd oddeven
comet$ scp you_on_cs@cs.swarthmore.edu:<path to your Lab 4 repo>/oddeven.c .
# example:
       scp newhall@cs.swarthmore.edu:./cs87/labs/Lab04-tia/oddeven.c .

comet$ scp you_on_cs@cs.swarthmore.edu:<path to your Lab 4 repo>/oddeven.sb .

On Comet, I have a Makefile you can copy to compile oddevensort on comet, and then try submitting an oddevensort job to the default queue by running sbatch with the slurm script you copied over: submit to the run queue:

comet$ cp ~newhall/Makefile .
comet$ make
comet$ sbatch oddeven.sb

On Comet, you will need to edit the oddeven.sb slurm file using vi (and create additional .sb files) to submit to sbatch:

comet$ vim oddeven.sb   # change something like number of nodes
comet$ sbatch oddeven.sb

6. Experiments on Swarthmore’s Strelka Cluster

This is optional, but I encourage you to try this out before trying some runs on XSEDE. You will need to use vim to edit files on Strelka.

Swarthmore has a brand new cluster that I’d like you to try running on. Don’t spend too much time on this but just try it out. You may want to try some small runs on XSEDE first to get used to submitting jobs with sbatch First, request an account on Strelka, the college’s cluster:

email support@swarthmore.edu with your Swarthmore username and SSH public key.

Once you have an account, next see if you can try some larger runs of your MPI oddeven sort on this cluster.

It uses the slurm scheduler, so the same process you use to compile and run on the XSEDE system, you will use on this system. See the details described below about how to run on SDSC Comet and follow the same steps for running on this system, just ssh (and scp) into strelka@swarthmore.edu

You may need to add the mpi module into your environment to run mpicc and mpirun

# list available modules
module avail
------------------------- /opt/modulefiles -----------------------
  list of modules
  ...
(L) are loaded

# to load an mpi version into your environment:
module load openmpi/4.0.2-intel-19.0.5.281

If you echo your PATH environment variable you will see the path to this module added to the fromt of your path:

echo $PATH
/opt/apps/mpi/openmpi-4.0.2_intel-19.0.5.281/bin: ...

You can also add this to your PATH in your .bashrc file and then avoid running module load each time you log in.

For more information on Strelka: Strelka cluster

7. XSEDE Experiments on SDSC Comet

Run some large runs of your Lab 4 solution on comet. You should do this after running some large runs on the CS machines to make sure your solution does not have any deadlock or other errors that would make it run forever. Also, you will want to estimate a reasonably accurate upper bound for its runtime in your slurm script. Do not make the time super long, as if your application has deadlock it will use up a lot of XSEDE CUs for no good reason. Pick a reasonable upper-bound estimate (you want to pad it a bit so that it is long enough for your application to finish, but you don’t want a deadlocked process to continue to use CUs for a huge over-estimate of total runtime). Do some experimentation with times on XSEDE and use times on our system to help you pick a good upper bound estimate. Your time doesn’t have to be super close, but if you expect it to easily complete within 15 minutes, submit a slurm script with a time of a few minutes beyond this, maybe 17-20, and don’t submit one with a runtime of 1 day, for example).

7.1. Practice First

First make sure to set up your XSEDE and your Comet account this week (follow all the directions under "XSEDE and Comet Account Set-up"): XSEDE and Comet accounts
Next, try ssh’ing into Comet, and try out scp’ing over my mpi example and running in on Comet (follow the directions under "Using Comet and submitting jobs").
Once you figured out slurm and how to submit jobs, then scp over your Lab04 solution, build on comet. You may need to make changes to the Makefile (see the Makefile for my XSEDE examples).
Then, write a submission script for some small runs and try running. Try a few small runs with debug printing to make sure it runs on comet. You can modify the hello slurm script to run your sorting program and submit it.
Finally, try a small run with printing disabled in preparation for larger runs (remove all debug output from your program, comment out #define DEBUG, except that you can keep in printing out timing information.

7.2. Assignment: try some long runs of your solution on comet

Try out at least two large runs of your Lab 4 solution on comet.

Make sure to disable or remove all debug output from your program (comment out #define DEBUG).
Copy over your oddevensort.c and oddeven.sb (if you didn’t already do it with practicing scp above): Section 5

On Comet, I have a Makefile you can copy to compile oddevensort on comet, and then try submitting an oddevensort job to the default queue by running sbatch with the slurm script you copied over: submit to the run queue:
```
comet$ cp ~newhall/Makefile .
comet$ make
comet$ sbatch oddeven.sb
```
Try out some small runs using example slurm script. With the starting point code is an example slurm script for running on comet (it is for a small run on the debug queue). Only try running this with small sized problems. To submit to the debug queue on Comet:

sbatch oddeven.sb

write a couple slurm submission scripts for long runs (large sizes), submit them. In your slurm script, you will want to modify at least these four lines (use the compute queue instead of the debug queue for your experiment runs):

#SBATCH --partition=debug      # which queue
#SBATCH --nodes=2              # Total number of nodes
#SBATCH --ntasks-per-node=24   # Total number of mpi tasks
#SBATCH -t 00:30:00            # Run time (hh:mm:ss) - 30 mins

You should choose way more than 2 nodes and 24 mpi tasks in your runs.
You can also try large sized arrays for each process to sort via command line args to your executable (add them to the command in the slurm script).
You may need to adjust the estimated runtime (30 mins in this example). If your estimate is too small and your program runs longer than your estimate, it will be killed before it completes. If your estimate is too long, it will wait in the job queue for much longer than it should.
You should also submit to a regular job queue (e.g. <tt>compute</tt>) for the big runs (don’t use the debug queue).

8. Submit

You will submit the following via git:

BIGRESULTS: add this file (like RESULTS) that contains results of your large run tests on the CS machines. make sure to edit this file so that it is clear what run sizes the results you show are from
Two comet output files from two large runs. The only program output should be that the process with rank 0 should print out the size of the problem: N and P (make sure you have no debug printing output in these runs…these files should be very small). If you forget to include this printing of the problem size N and P, then edit the output file to include these values (copy in the slurm script corresponding to the run, or just edit the output file to add this in).
```
vim oddevenNUMBER.commet-X-Y.out
# in vim can import file contents using:
:r oddeven.sb
```
You can scp these files over to your cs account and add them to your repo:
```
comet$ scp filename you@cs.swarthmore.edu:./cs87/labs/Lab04-you-partner/.

# or scp to home and then on CS just mv file from your
# home directory into your Lab04 repo:
comet$ scp filename you@cs.swarthmore.edu:.
cs$ cd ~/cs87/labs/Lab04-you-partner
cs$ mv ~/filename .
```
Optional but stongly encouraged (One or a few output file from running on on the Swarthmore Strelka cluster).

Then just add these to your git repo and commit and push:

$ cd ~/cs87/labs/Lab04-you-partner
$ git add oddevenNUM1.comet-X-Y.out
$ git add oddevenNUM2.comet-X-Y.out
$ git add BIGRESULTS
$ git add StrelkaResults
$ git commit -m "lab 4b results"
$ git push

Before the Due Date, one of you or your partner should push comet output files from two large runs to your Lab04 repo. (ssh them over to your cs account to git push them to your Lab04 repo).

If you have git problems, take a look at the "Troubleshooting" section of the Using git page.

9. Handy Resources

XSEDE and Comet accounts
vi (and vim) quick reference
tar
remote access
ssh
scp
CS87 github org, and Git help
man, man pages, apropos
Some Useful Unix Commands,
my help pages
CS department help pages
Class piazza page for questions and answers about assignment