Using Condor on our system

Condor is a system for running compute intensive batch jobs on idle nodes in a network of workstations. One example of a good candidate for condor, is running neural network training jobs. If you are running a large number compute intensive of experiments where you are not concerned with measuring run time of the experiment, then Condor is a nice system to use to make use of unused cycles in the network and to ensure that your compute-intensive workload doesn't interfere with other users of the machines.

You submit your programs to condor, and then condor finds idle machines in the network on which to run these programs. Condor will suspend jobs and move them to other idle machines if a user logs into and starts using the current machine on which a condor job is running. When you use condor you get the advantage of being able to use any unused compute cycles in the network. Also, by using condor you can ensure that your massive compute job doesn't interfere with others being able to use the CS lab machines.

Currently, condor is running only on the HP machines in the lab. I'd recommend logging into one of these, and then running screen to submit jobs to condor. screen will let you detach from a session and re-attach later, so that you can logout of the hp machine while your really long set of experiments run (this way you do not have to stay logged into a machine, preventing others from using it, and you do not have to risk having someone log you out before your jobs are complete). See the CS project etiquette page for more information about using screen.

To use condor:

% condor_status                #  to see the pool or 
% condor_status -master        #  to just get the machine names

# there are some examples to try in /scratch/knerr/condor

# here's the job file for the shell script:

LEMON[condor]$ cat submit.prog 
Universe   = vanilla
Executable =
Arguments  = 8
Log        = aaa.log
Output     = aaa.$(Process).out
Error      = aaa.$(Process).error
InitialDir = /scratch/knerr/condor
Requirements = Memory >= 600
Queue 10
#  This just ran " 8" on 10 machines. You could also use
# "Input   = in.$(Process)" to have different input files for each process.

# to run:

% condor_submit jobfile      # then wait a bit
% condor_q to                # to see the condor queue 
% condor_rm                  # to delete jobs from the queue