GQSUB

Grid computing at the meso scale


Table of Contents

1. Introduction, and indented audience
2. Installation
3. General configuration
4. gqsub
Example
Command line options
5. gqstat

List of Figures

4.1. gqsub command line options

List of Tables

4.1.

List of Examples

3.1. Sample .gqsubrc
4.1. Trivial submission script
4.2. Command line use of gqsub

Chapter 1. Introduction, and indented audience

GQSUB was born out of work with a group of users with existing HPC experience. The most common question we were asked was, "Can we not just use qsub?". (Which wasn't the case, as that would have bound them to a single site...) However, the intent was clear: Most Grid tools don't work quite like anything else, and feel a lot more complicated. Parts of this is necessary complexity - using certificates to allow for decentralised authentication and authorisation, for example. Parts of this is unnecessary complexity, which could be removed.

By focusing in on a specific user community, it is possible to tailor the user interface to that community. The more focused community, the better the tailoring can be. By focusing on existing HPC/HTC users, who typically have their own cluster, gqsub can exploit existing knowledge. This reduces learning time, and improves usability. This user group was selected as a target due to it size (big), and current uptake of Grid computing (small) - essentially, it's logical place to expect Grid use to expand into.

The aim of gqsub is to allow a user of a local cluster to install a small bit of code on their existing head node, and then be able to submit the same job control scripts to either the local cluster or the grid, without modification.

In practice, the aim of running without modification is probably unrealistic. It can happen if there is a shared filesystem between the head node and the Grid system, but that's not a common method of operation (yet). Therefore there is an ability to make small changes dependant on if it's running locally or on the Grid. In trials, a couple of lines were all that needed to be adjusted - most of the configuration remains the same, as it is the same job; just a few small changes on

Chapter 2. Installation

Table of Contents

Put all the python file in the same directory, somewhere on your path.

It requires a gLite UI to be installed, in order to work. (Strictly, in either order; in practice, do the gLite UI first). It works well with gLite 3.1 UI, and works with the 3.2 UI, modululo known problems in 3.2. Once they are resolved, gqsub will work identically on both.

At Glasgow, we have it in the /usr/local/bin on the UI machine.

Testing

There is a test script, called (imaginativly!) test.sh. This runs a basic job, outputs a little bit of diagnostics, ships in a C file, compiles and runs it. The C file does some output, and sleeps for a few minutes (i.e. just long enought to detect it in the monitoring system.) Production of a formal test suite is on the todo list.

Chapter 3. General configuration

gqsub uses a configuration file, and a directory tree. The configuration file is $HOME/.gqsubrc, and it defines the location of the directory tree, and a number of other variables. If the file is not present it is generated with default values. If it is present, but with values not specified, then a default for that value will be assumed. Most of the time these will not need changed, except perhaps the defaultdirective and defaultvo. An example file, set with the default values for these options is given in Example 3.1, “Sample .gqsubrc”.

Example 3.1. Sample .gqsubrc

[GQSUB] 
 Section header - required; exactly one section present at the moment. 
sharedpaths = 
 Paths that should be assumed to be the same between the submit host and the worker node. When within a sharedpath, file staging is not performed.  The values here are used, if neither --on-shared-path or --not-on-shared-path is specified.
maxjobspersubmission = 50
 Large array jobs will be split up into tranches no larger than this.  The 50 is the current recommendation from the WMS developers.
defaultdirective = #$
 Default directive.  Torque and PBS use #PBS, SGE uses #$.  It is best to set this to match the an existing local cluster.
gqsubdirective = #GQSUB
 Additional directive, gqsub specific.  Use this to set things that should not be used for local jobs - e.g. file staging. 
proxysafetymargin = 129600
 Additional time required on the proxy - if you find your jobs regularly queue for a long time, you might want to increase this.  
gqsubdir = $HOME/.gqsub
Location of the directory tree to use for storing jobs.  Changing this will result in the disappearance of any existing jobs 
defaultvo = a.best.guess
 When no VO is specified, and there is no existing certificate, generate a proxy with this VO.  Most users have a single VO, and those that have multiple will tend to use one significantly more than the other.  This is originally filled in from an existing certificate.
defaultshell = /bin/sh
 The shell that will be used to execute the script.  This can be changed per job with -S, and this is the same default that qsub has.

The most critical of these is gqsubdir - the other can all be changed at runtime. The exact storage needs for this will vary, but as a data point, 320 jobs at maximal space usage takes 13 MB. Much of the space is taken with the storage of the final job status by gqstat. In $gqsubdir a directory will be created per job submitted - these will be numbered sequentially. If a file called default.jdl is in this directory, it will be included as a default set for glite-wms-job-submit - this is a place where arbitrary JDL instructions can be included. In the event that it is desired to have different options set for local and Grid jobs, the recommended approach is to set the option via the normal directive for local use, and then use the gqsub directive to change it. The file is parsed in order, top to bottom, so entries later in the file take precedence.

Chapter 4. gqsub

The submission engine

Example

Lets start with an example submission script, and show it in action. Firstly, here's the script in Example 4.1, “Trivial submission script”.

Example 4.1. Trivial submission script

#!/bin/sh

#PBS -l cput=0:30:00
#PBS -l walltime=0:30:00
#GQSUB -q UKI-SCOTGRID-GLASGOW

echo in user script
hostname
pwd

And here's the submission and monitoring in use.

Example 4.2. Command line use of gqsub

-bash-3.00$ ls
gqdel  gqoutput  gqstat  gqsub  gqsubconfig.py  gqsubconfig.pyc  gqsubproxy.py  gqsubproxy.pyc  test.sh
-bash-3.00$ gqsub test.sh 
File stagein requested: on a shared path, so no staging needed.  Note that this captures files at time of execution.

Couldn't find a valid proxy.

Proxy renewal required - looking for 36:45:00
Enter GRID pass phrase:
Submitting job 320 as: https://svr023.gla.scotgrid.ac.uk:9000/SJ7Ch-mRrEn4dD9oAgHecw

-bash-3.00$ gqstat 320
Getting status for 320
Job id Name User                                                    WallTime State     Queue                                               
------ ---- ------------------------------------------------------- -------- --------- ----------------------------------------------------
320         /C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=stuart_purdie          Scheduled dev010.gla.scotgrid.ac.uk:2119/jobmanager-lcgpbs-q7d
-bash-3.00$ ./gqstat 320
Getting status for 320
Job id Name User                                                    WallTime State   Queue
------ ---- ------------------------------------------------------- -------- ------- ----------------------------------------------------
320         /C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=stuart_purdie 00:02:14 Running dev010.gla.scotgrid.ac.uk:2119/jobmanager-lcgpbs-q7d
-bash-3.00$ ./gqstat 320
Getting status for 320
Job id Name User                                                    WallTime State Queue                                               
------ ---- ------------------------------------------------------- -------- ----- ----------------------------------------------------
320         /C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=stuart_purdie 00:26:32 Done  dev010.gla.scotgrid.ac.uk:2119/jobmanager-lcgpbs-q7d
-bash-3.00$ ls
gqdel  gqoutput  gqstat  gqsub  gqsubconfig.py  gqsubconfig.pyc  gqsubproxy.py  gqsubproxy.pyc  test.sh  test.sh.e320  test.sh.o320
-bash-3.00$ more test.sh.o320 
in user script
node296.beowulf.cluster
/cluster/share/gla058/gqsub
-bash-3.00$ 

The details of the output from gqstat will be covered in detail later; for the moment note that it shows the job going through the relevent stages. In this case, the job was run on a shared path, so the output appeared automatically as it was generated. The job script executed, test.sh, is the one shown in Example 4.1, “Trivial submission script”.

The prompt to generate the proxy was due to the current proxy having expired. Note that there was no VO specified explicitly anywhere - it was assumed that the VO to use matched the last VO used. If there had been no proxy at all ( e.g. voms-proxy-destroy had been run, then it would have drawn the VO name from the default VO in .gqsubrc.

Command line options

POSIX defines many command line options, and many more are present in vendor extensions. Here we have a complete survey of the command line options available to gqsub.

Figure 4.1. gqsub command line options

Long options are all gqsub extensions over POSIX
Short optionLong optionNameOriginStatusNotes
-a Not before datePOSIXNot supportedNot intended to support; cannot implement with gLite functionality, and of limited use. The best solution would be to use local 'at' command to achieve it, and that would run the risk of having jobs not be submitted when they might be expected to have been.
-A--voAccount Name POSIXSupported, mapped to VOIt is assumed that the Account that POSIX uses is mapped directly on the the VO.
-c CheckpointingPOSIXNot supportedNot intended to support; cannot implement with gLite functionality, and would require significant investment to produce the net effect.
-C Directive Prefix POSIXSupported
-e Stderror filename POSIXSupported
-j Output stream joiningPOSIXNot supportedIntended to support; it's been lower down the list.
-I Interactive jobTorque extensionNot supportedNo plans as yet; batch jobs considered much more important
-k Files to keep POSIXNot supportedNot intended to support; explicitly not possible with lcg-pbs job managers, and of limited utility even if it were.
-m When to email status updatesPOSIXNot supportedWould like to support, but appears to be rather tricky. Emailing out from the job itself appears to be not reliable, so will need some stable service somewhere to do the emailing. That is an external dependence that I want to avoid for the moment. The optimal solution would be to use the email mechanism in the underlying local cluster, which will require CE support.
-M Addresses to emailPOSIXNot supportedWill not support; exploitable in this context. In practice, the person to email should be taken from the X509 certificate. See above for other email issues
-N Job NamePOSIXSupportedDefaults to the name of the script if not provided
-o Stdout filename POSIXSupportedDefaults to ${JobName}.o${job id} if not provided
-p PriorityPOSIXNot supportedNot intended to support. Job priority is done at the VO level, gLite has no scope for user specified job priority
-q Destination POSIXSupported, overloadedSee discussion of destinations
-r RerunPOSIXNot supportedIntended to support; can be mapped onto the gLite retry mechanism.
-S Shell interpreterPOSIXSupportedName of the program used to interpret the script with - typically /bin/sh or similar, but can also be, e.g. a Python or R script, or any other interpreted language.
-t Array jobCommon extensionSupported, see notesThis is a common extension, and many implementations have a slightly different way of handling these. There are two aspects to that, the way the jobs are specific, where gqsub takes a single range of numbers; and in how this is passed to individual jobs.
-u POSIXNot supported
-v POSIXNot supported
-V POSIXNot supported
-z POSIXNot supported
--on-shared-path gqsub extensionForces assumption of shared filesystem. This forces gqsub to not do any explicit file staging for data return.
--not-on-shared-path gqsub extensionProhibits assumption of shared filesystem. This forces gqsub to use explicit file stageing to return data.
--gridftp-host gqsub extensionIndicates the host name of a server running GridFTP, that can access the submisison directory. Presence of this indicates that gqsub should use auto stage back, unless on a shared path.
--dry-run gqsub extensionInstructed gqsub to do everything, except the final submission. This is primarily intend as a debugging tool.
--verbose gqsub extension

Chapter 5. gqstat

Job monitoring

In combination with gqsub, gqstat provides a familiar display of current jobs.


Last modified Thu 24 September 2009 . View page history
Switch to HTTPS . Print View . Built with GridSite 1.5.1