PBS Short Help Document
written by Jason Ferguson
Updated: 23 October, 2003
This short document about the Portable Batch System installed on HiPeCC's supercomputer system is meant to help a new user of PBS get started using the system.
There are three commands that all PBS users need to be aware of: qstat, qsub, and qdel.
Queues on cronus
Before discussing the syntax and usage of the PBS commands we need to first describe how the machine is configured. There are 7 queues currently established on the machine There are 3 parallel queues, a serial-short queue, a serial-long queue, a compile queue and an interactive queue. Issuing a qstat -q from the command prompt shows the setup:
Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- --- --- -- -----
long -- -- -- -- 6 0 20 E R
para -- -- -- -- 14 21 30 E R
short -- 04:00:00 -- -- 0 0 20 E R
stdin -- -- -- -- 0 0 8 E R
compile -- -- -- -- 0 0 4 E R
para8 -- -- -- -- 1 0 20 E R
para12 -- -- -- -- 1 0 20 E R
The names of the queues are shortened for convenience. The "short" means serial-short, "long" is serial-long, "compile" is for compiler jobs, "stdin" is for interactive jobs using -I. The parallel queues, "para", "para8" and "para12" allow use of 2 to 4, 5 to 8 and 9 to 12 processors respectively. All queues have no time limit with the exception of the short queue which has a 4 cpu hour time limit. The 'Memory' column is the memory limit for each queue and is not enforced, 'CPU Time' is the limit of cpu time that a job may use, 'Walltime' and the 'Node' column are not applicable. The 'Run' column indicates the number of jobs running in each queue and the total below, 'Que' shows the number of jobs queued, 'Lm' is the maximum number of running jobs permitted. The 'State' column shows that each queue is enabled (E) and running (R).
In order to manage resources we have set up the following run limits. The short queue has a 4 hour time limit and will not allow jobs to run longer than 4 hours. Notice the 'Lm' entry for each queue in the table above. The serial queues are intended to run jobs that use only 1 CPU and 2 and 12 such jobs may run in the short and long queues respectively. A particular user may have 2 jobs running simultaneously in the short queue, and 3 in the long queue. The para queue is intended to run up to 2 jobs using a maximum of 4 CPUs each. In this case one user may run 2 jobs in this queue. An individual user may have a total of 4 running jobs on the machine at any one time.
The queue status command has several switches to help you see what is happening on the machine. Issuing a qstat command with no switches will show you jobs that are currently running and queued on the machine, such as:
Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
75.cronus-b sample.pbs ferguson 00:01:08 R short
Which says that the request "sample.pbs" with PBS ID number 75 owned by the user ferguson is in the queue "short". The job has a status of "R" which means running, other statuses could be "Q" meaning that the job is waiting or queued. Job 75 in the example above has used just over a minute of CPU time.
See the qstat man pages for common switches. Generally the command is executed without switches, or with "-q", "-Q" (which give the same information in different formats) and the "-B" switch. Executing a qstat -q command will show output that gives a summary of the queue statuses without the details of each job. An example is shown on the previous page.
The queue submit command is used to submit a script to PBS for processing. There are several switches that can be used with this command, many of which can be used inside the script itself (see script building below). Generally in my work the scripts are all the same, it is the batch queue that it is submitted to that changes. So I submit my scripts with the following command: qsub -q long sample.pbs. In this command example the -q is a switch that is passed to the qsub command that tells qsub to submit the script to the queue 'long'.
If you have a job that requires user interaction with the program and the program can not be submitted to batch then you will have to use the qsub -I command to start an interactive PBS session. When executed this command waits in the queue like any other job. When your job is then set to running by the job scheduler, it will open an login shell to the terminal that you issued the qsub -I command from. At this time you can issue any command on the machine and run your job interactively like any login shell. If qsub -I -q short is submitted to the serial-short queue, then the login shell will close after 4 hours of cputime are used. Be careful using this command, while you are using PBS interactively, another user is probably trying to use it in batch mode and waiting for you to finish.
Common Qsub Switches:
For other switches see the qsub man page.
The queue delete command is used when a mistake is realized and you need to delete a script from the queue for some reason. Look up the I.D. number of your job with the qstat command and then issue a qdel ID# for a job that is either queued or running.
PBS Script Building
Scripts used by PBS to execute jobs can be either very simple (a single line with one command) or very complicated (several dozen lines long).
Here is a simple script:
# this is a sample script on how to execute a serial
cloudy.exe < parispn.in > parispn.out
Any line that begins with a "#" is treated as a comment and is not executed. All the above script does is change directories to my pbstest directory with a cd command. The next line runs the program called cloudy.exe with an input file (parispn.in) and an output file parispn.out using the "<" and ">" symbols to redirect input and output. Note that the 'cd' command is essential, because PBS does not know where your files are. You must, in the script, specify the full path name or change to the directory that you want to work in.
Here is a slightly more complicated script that makes use of the temp space
# change directories to the temp space and create a new
# copy the necessary files to the temp space
cp /user/ferguson/pbstest/cloudy.exe /temp/ferguson/paris
cp /user/ferguson/pbstest/parispn.in /temp/ferguson/paris
cp /user/ferguson/pbstest/c84.ini /temp/ferguson/paris
# run the program
cloudy.exe < parispn.in > parispn.out
# copy results back
cp parispn.out /user/ferguson/pbstest
# remove temp space
rm -rf paris
The above script takes advantage of the scratch space. The first section creates the temp space in my assigned area: /temp/ferguson.
The next section copies the necessary input files into the temp area. And then the program is run with the identical method as in the previous simple example.
Results are then copied back to the user area from the temp space and the temp space is removed.
Here is a very complex script.
#PBS -N ParisPN
#PBS -q long
# set up environment variables
# name of job, initialdirectory, scratch space.
# remove scratch space, then remake, then copy program and input file
rm -rf $TMPDIR/$NAME
cp $CLOUDY/cloudy.exe .
cp $INITDIR/c84.ini .
cp $INITDIR/$NAME.in .
# run program
cloudy.exe < $NAME.in > $NAME.out
# compress results, copy back, remove scratch space
gzip -f -9 $NAME.out
cp $NAME.out.gz $INITDIR/
rm -rf /$TMPDIR/$NAME
The above script copies to scratch space a program named cloudy and its input files, executes the program, then copies the results back to the user space while compressing the output with gzip. The script is broken into five sections.
The top of the script does two things. It uses a 'set -x' command so that all of the commands are repeated in the output from PBS. Then there are 3 PBS lines that pass switches to PBS to control the queues. Lines that begin with a "#PBS" are treated as instructions to PBS and not as commands. This script is submitted with a 'qsub name.pbs' command with no switches.
Second, we set up environment variables. The name of the job is "parispn", and the initial starting directory is /user/ferguson/cloudy/planetary. I have defined a variable CLOUDY since the program is in a different directory than the initial directory. It is always a good idea to keep the program away from the model results! Notice that the TMPDIR is going to be used as a scratch space, and /temp/ferguson already exists. All users have a directory on a temp hard drive.
In the next section I create the scratch space and move the program and input files into it. Before making the directory, I always remove it in order to avoid any write errors.
The fourth section executes the program. I use the timex command to get information about the runtime.
In the last section of the script, I gzip the output file of the program and then copy it back to the initial directory (useful for large files that will be ftp'd later). Then the scratch space is removed.