Troubleshooting Guide
From ScotGrid
It would be nice if everything just worked. And some days it does. This page is for those other days.
This page contains the a list of the more common error messages, sorted by what you were attempting to do, and how to resolve them.
GSISSH
GSISSH prompts for password
heisenberg:~ sdjp$ gsissh -p 2222 -k svr020.gla.scotgrid.ac.uk sdjp@svr020.gla.scotgrid.ac.uk's password:
Normally this means that your grid proxy is not valid. Being a modified SSH client, if GSISSH can't access the proxy, then it runs home to mama, and defaults to password based authentication. Given that we don't accept password authentication, this is not terribly helpful.
You can get GSISSH to report the problem, by putting the -v flag in, and wading through the large quantitiy of diagnostics.
heisenberg:~ sdjp$ gsissh -v -p 2222 -k svr020.gla.scotgrid.ac.uk
... [Lots of output skipped] ...
GSS Minor Status Error Chain:
globus_gsi_gssapi: Error with gss context
globus_gsi_gssapi: Error with GSI credential
globus_gsi_gssapi: Error with gss credential handle
globus_credential: Error with credential: The proxy credential: /tmp/x509up_u502
with subject: /C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=stuart purdie/CN=821973805
expired 40 minutes ago.
... [More output here] ...
debug1: Next authentication method: password
sdjp@svr020.gla.scotgrid.ac.uk's password:
Run voms-proxy-init, or, if using a small install of Globus for a client, grid-proxy-init.
job-submit or job-list-match
Both of these are together because they are very similar - job-submit must do everything job-list-match does, and then one extra step.
'Proxy File Not Found' or 'Proxy credential has expired'
-bash-3.00$ glite-wms-job-list-match -a test.jdl Error - Proxy validity Error Your proxy credential has expired
Fairly self explanitory in cause (but see below for a similar problem that isn't), both of these problems can be resolved by generating a fresh proxy certificate
voms-proxy-init -voms vo.scotgrid.ac.uk
'Unable to delegate credential'
Warning - Unable to delegate the credential to the endpoint: https://svr022.gla.scotgrid.ac.uk:7443/glite_wms_wmproxy_server User not authorized: unable to check credential permission (/opt/glite/etc/glite_wms_wmproxy.gacl) (credential entry not found) credential type: person input dn: /C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=stuart purdie
This means that the VO you are using isn't authorised to use the WMS it's looking at. Whilst this can be caused by a number of reasons, the most common reason to see this is that your proxy is missing the VOMS extensions (hence, the WMS can't authroise you).
Check using
voms-proxy-info -all
and if there is no VO section (or the VO section has expired), then run
voms-proxy-init -voms vo.scotgrid.ac.uk
again.
'I/O Error Couldn't find a valid proxy certificate'
1. Missing delegation instructions
-bash-3.00$ glite-wms-job-list-match --rank test.jdl Error - I/O Error Couldn't find a valid proxy certificate
There is no '-a' flag, to indicate automatic delegation. Add the '-a' flag, and all will be well.
2. Parameter ordering problems
-bash-3.00$ glite-wms-job-submit -a -o jid -e https://svr022.gla.scotgrid.ac.uk:7443/glite_wms_wmproxy_server test.jdl Error - I/O Error Couldn't find a valid proxy certificate
This error message is misleading. The problem here is that job-submit has got itself confused over where to find the proxy certificate, and as a result is failing.
This problem can be worked around by reordering the parameters to job-submit
-bash-3.00$ glite-wms-job-submit -a -e https://svr022.gla.scotgrid.ac.uk:7443/glite_wms_wmproxy_server -o jid test.jdl Connecting to the service https://svr022.gla.scotgrid.ac.uk:7443/glite_wms_wmproxy_server ====================== glite-wms-job-submit Success ======================
Alternatively, don't specify the WMS, and instead use the default failover mechanism. This might not be an option if you are using a remote UI, but svr020 should work fine without specification of a WMS
- This is ticked at https://gus.fzk.de/ws/ticket_info.php?ticket=45364
- Once resolved, and the update is installed, this entry should be removed.
'Missing Information a mandatory attribute is missing'
-bash-3.00$ glite-wms-job-list-match test.jdl Error - Missing Information a mandatory attribute is missing: --delegationid, -d <id_string> to use a proxy previously delegated or --autm-delegation, -a to perform automatic delegation
This results because in order for the WMS to submit jobs as yourself, you need to be able to give it permission to act as you - this is called 'delegation'. The command line tools demand that you specify how you want it accomplished.
In general, you will want to use automatic delegation, and thus this error probably means you missed off the '-a' flag.
'The server is temporarily drained'
Should be an unusual one, this one time I saw it is:
-bash-3.00$ glite-wms-job-list-match -a test.jdl Connecting to the service https://svr022.gla.scotgrid.ac.uk:7443/glite_wms_wmproxy_server Error - Operation failed Unable to perform the operation: Unavailable service (the server is temporarily drained) Method: jobListMatch
This indicates that the WMS that job-list-match used is in the process of being shutdown. Fortunatly, there are redundant WMS in place, so on the second try it should just work. Note that job-submit does automatic failover, whereas job-list-match does not (at least, for the moment; in principle they both aught to, so this might change in the future).
'Wrong type caught for attribute'
-bash-3.00$ glite-wms-job-list-match -a test.jdl Connecting to the service https://svr022.gla.scotgrid.ac.uk:7443/glite_wms_wmproxy_server Error - InputSandbox: wrong type caught for attribute
This error refers to a problem with the JDL file. In this case, I had
Executable = "testexe";
Arguments = "1";
StdOutput = "test.out";
StdError = "test.err";
InputSandbox = {testexe};
OutputSandbox = {"test.out", "test.err"};
VirtualOrganisation = "vo.scotgrid.ac.uk";
Where the lack of double quotes around testexe on the InputSandbox line is the problem here. Note that the first part of the error message is name of the stanza that the problem was with.
'Specificed path 'blah' is missing'
-bash-3.00$ glite-wms-job-list-match -a test.jdl Connecting to the service https://svr023.gla.scotgrid.ac.uk:7443/glite_wms_wmproxy_server Error - InputSandbox: Specified path 'testexe' is missing
This is a problem that is occuring after the JDL has been sucessfully parsed. It indicates that a file listed in InputSandbox is missing. The paths are taken relative to the current working directory that job-list-match or job-submit are run from, not relative the the JDL file itself; this has caught me out on occasion.
'Not a valid ClassAd' or 'The following parsing error(s) have been found:'
-bash-3.00$ glite-wms-job-list-match -a badjob.jdl Connecting to the service https://svr022.gla.scotgrid.ac.uk:7443/glite_wms_wmproxy_server Error - The following parsing error(s) have been found: Not a valid ClassAd
or
-bash-3.00$ glite-wms-job-list-match -a test.jdl
Connecting to the service https://svr022.gla.scotgrid.ac.uk:7443/glite_wms_wmproxy_server
Error -
The following parsing error(s) have been found:
ClassAd utils - cannot parse classad: [Executable = "test"; Arguments = "1";
StdOutput = "test.out"; StdError = "test.err" InputSandbox = {"../../test"};
OutputSandbox = {"test.out", "test.err"}; VirtualOrganisation = "vo.scotgrid.ac.uk";
Requirements = other.GlueCEUniqueID == "svr021.gla.scotgrid.ac.uk:2119/jobmanager-lcgpbs-q30m"; ]
This indicates one of certain types of problems with the JDL file.
The term 'ClassAd' arises because JDL is based on the Condor Classified Advertisment language. Therefore the error is saying the JDL file is invalid.
Unfortunatly, there isn't a lot of detail given in the event of a parse failure. The first example above was specifying an empty file, whilst the second one is missing a semicolon at the end of the StdError stanza.
The debugging procedure that I use is: 1. Check that there is a semicolon on the end of each stanza, including the last. 2. Check that any relational operators are correct - in particular, that there isn't a single '=' where there should be a '=='. 3. Check that for every '{' there is a '}' and that there are an even number of double quotes. 4. Save a copy of the file, and delete everything that is not absolutly essential (in partcular, Reqirements and Rank stanzas). Check the file using job-list-match, and if it works add the stanza's back one by one until you find the one that's problematic.
'nodes: unable to complete the operation: the attribute has not been initialised yet'
This message refers to the use of a 'Collection' type job. However, as the WMS converts a 'Parametric' job into a collection internally, this can also be generated in response to a parameteric job request.
With a collection job, the way to correct is to add a 'Nodes' attribute.
For a parametric job, the error is generated if the secification of the parameters would result in no jobs generated. This can happen if the 'Parameter' attribute (representing the end value) is the same as 'ParameterStart'. In this case, the 'Parameter' is exclusive of the given value, rather than inclusive. This is quite different from every other job management framework. Normally, you'll need to increase 'Parameter' by one to get the expected behaviour.
'Resource temporarily unavailable'
-bash-3.00$ glite-wms-job-list-match -a test.jdl Connecting to the service https://svr023.gla.scotgrid.ac.uk:7443/glite_wms_wmproxy_server Error - Operation failed Unable to perform the operation: End of file or no input: Resource temporarily unavailable Error code: SOAP-ENV:Client
Although there can be several fundemental reasons for this, the most common is that the WMS that was selected was highly loaded, and couldn't service the requests before a timeout happened (In the specific example above, it took eight minutes before the error was given). Such a load normally occurs because someone has just submitted a very large collection of jobs to the WMS, rather than spliting them into smaller batches that can be spread over all the available WMSs.
This is, however it's cause, easy to deal with - try again, and you'll get a different WMS, or it might have resolved itself.
It's ignoring a part of the JDL!
Only stanzas that are recognised are checked. Unknown stanzas are silently ignored. One consequence of this is that a mistyped stanza heading will result in no error message, but that whole stanza being ignored.
One particular stanza where this might happen is the 'Arguments' stanza. When passing just a single argument to the program, it's tempting to use the singluar form - which is incorrect, and will result in no arguments being passed to the program.
If you think that a stanza of your JDL file is being ignored, check that exact spelling of the stanza name against the manual.
job-status
Job Aborted
-bash-3.00$ glite-wms-job-status -i jid
*************************************************************
BOOKKEEPING INFORMATION:
Status info for the Job : https://svr022.gla.scotgrid.ac.uk:9000/42cZjLhcUBrU49Z4X0CLfQ
Current Status: Aborted
Logged Reason(s):
- Got a job held event, reason: Globus error 131: the user proxy expired (job is still running)
- Job got an error while in the CondorG queue.
Status Reason: Job proxy is expired.
Destination: svr021.gla.scotgrid.ac.uk:2119/jobmanager-lcgpbs-q30m
Submitted: Thu Jan 15 17:00:51 2009 GMT
*************************************************************
This indicated that the proxy certificate you had when you submitted the job expired whilst the job was running. This might be able to be avoided for the shorter jobs, by checking the voms-proxy-info -all
-bash-3.00$ voms-proxy-info --all subject : /C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=stuart purdie/CN=proxy issuer : /C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=stuart purdie identity : /C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=stuart purdie type : proxy strength : 512 bits path : /tmp/x509up_u218058 timeleft : 11:59:54 === VO vo.scotgrid.ac.uk extension information === VO : vo.scotgrid.ac.uk subject : /C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=stuart purdie issuer : /C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=svr029.gla.scotgrid.ac.uk/Email=grid-certificate@physics.gla.ac.uk attribute : /vo.scotgrid.ac.uk/Role=NULL/Capability=NULL timeleft : 11:59:54
Here, the timeleft field is specified in HH:MM:SS (hours, minutes then seconds). If you need certificate valid for a certain length of time, then you can use the '-valid' flag to voms-proxy-init
-bash-3.00$ voms-proxy-init -voms vo.scotgrid.ac.uk -valid 196:00 Enter GRID pass phrase: Your identity: /C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=stuart purdie Creating temporary proxy ....................................................... Done Contacting svr029.gla.scotgrid.ac.uk:15000 [/C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=svr029.gla.scotgrid.ac.uk/Email=grid-certificate@physics.gla.ac.uk] "vo.scotgrid.ac.uk" Done Warning: svr029.gla.scotgrid.ac.uk:15000: The validity of this VOMS AC in your proxy is shortened to 86400 seconds! Creating proxy .................................... Done Your proxy is valid until Sat Jan 24 18:13:36 2009
Unfortunatly, many VO's limit the duration of a proxy certificate: in the example above it's restricted to 86400 seconds (one day).
Even more unfortunate, voms-proxy-init by default shows you how long the certificate can last for, but not how long the VO signature will last (hence the use of -all above).
The above voms-proxy-init gives:
-bash-3.00$ voms-proxy-info --all subject : /C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=stuart purdie/CN=proxy issuer : /C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=stuart purdie identity : /C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=stuart purdie type : proxy strength : 512 bits path : /tmp/x509up_u218058 timeleft : 195:59:58 === VO vo.scotgrid.ac.uk extension information === VO : vo.scotgrid.ac.uk subject : /C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=stuart purdie issuer : /C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=svr029.gla.scotgrid.ac.uk/Email=grid-certificate@physics.gla.ac.uk attribute : /vo.scotgrid.ac.uk/Role=NULL/Capability=NULL timeleft : 23:59:58
where you can see the effect the VO limit has.
The way to resolve this problem is to use myproxy to store your credentials, and allow it to renew the VO proxy.
- See MyProxy guide for this. FIXME: Link this...
Job stuck with 'BrokerHelper: no compatible resources'
This is typically caused when using an exact match on a particular queue in the JDL. In this case, if the WMS fails to submit the job to the CE first time, it goes into a retry mode, and should prefer to find other CE's for the job. Should, because of [Bug 28235 (https://savannah.cern.ch/bugs/?28235)] which means that it actually requires a different CE. Coupled with an exact match for a CE, this results in the job staying in limbo.
The best workaround is not to specify exact matched in JDL files. Even if the location where the job has to run is fixed, it might be possible to allow it to use a queue with a longer time limit, as a back up. This might make it be queued for a little longer, but it should then run. By way of example, instead of
Requirements = other.GlueCEUniqueID = svr026.gla.scotgrid.ac.uk:2119/jobmanager-lcgpbs-q1d
one could use
Requirements = Regexp(other.GlueCEUniqueID"UKI-SCOTGRID-GLASGOW"
&& other.GlueCEPolicyMaxCPUTime > 260;
Where MaxCPUTime is specified in minutes on a machine rated 1000 on SpecInt2000. This will allow the WMS to route a job into a longer queue, if there is a problem with the initial submission, which sidesteps the bug, with the downside of potentially being queued for longer.
Sometimes, however, that's not possible, in which case the recommend course of action is to disable resubmission by the WMS. Do this by putting:
RetryCount = 0; ShallowRetryCount = 0;
in your JDL file (both lines required - there are two sorts of resubmission available.) This will prevent the job from reaching a stuck stage, and instead force it to fail. While not terribly pretty, this does mean that it will be clear what jobs need to be manually resubmitted.
- Once the bug is fixed, this entry should be removed.
job-output
Unfortunatly, most problems with job-output tend to be actually problems with the initial job description, that weren't discovered at submission time. This means that discovery and correction may well require resubmission of the problematic job.
Directory already exists
-bash-3.00$ glite-wms-job-output --dir output https://svr022.gla.scotgrid.ac.uk:9000/UQh5u9B-egbrVfZld0wT7w Connecting to the service https://svr022.gla.scotgrid.ac.uk:7443/glite_wms_wmproxy_server Warning - Directory already exists: /clusterhome/home/gla058/testjob/bulkTest/tmp/output Do you wish to overwrite it ? [y/n]n : y
Pretty strightforward - note that if you choose to overwrite, it will not affect other files in the directory that are not part of the OutputSandbox.
The tool has an expecation to create the directory, mosty to ensure that you don't accidentally overwrite previous data.
'JobPurging is not allowed'
-bash-3.00$ glite-wms-job-output --dir output https://svr023.gla.scotgrid.ac.uk:9000/Ctf31bq222q5uq7Ff8Qw2w
Connecting to the service https://svr022.gla.scotgrid.ac.uk:7443/glite_wms_wmproxy_server
Warning - JobPurging not allowed
(The Operation is not allowed: Unable to complete job purge)
================================================================================
JOB GET OUTPUT OUTCOME
Output sandbox files for the job:
https://svr023.gla.scotgrid.ac.uk:9000/Ctf31bq222q5uq7Ff8Qw2w
have been successfully retrieved and stored in the directory:
/clusterhome/home/gla058/testjob/bulkTest/tmp/output
================================================================================
This is not actually a problem. It's just a warning that it can't perform certain aspects of cleanup. In general, it is safe to ignore this - your job has completed fine.
For local use, it's actually down to a bug in the WMS. See [GGUS Ticket #44136 (https://gus.fzk.de/ws/ticket_info.php?ticket=44136)] for more details.
'No output files to be retrieved for the job:'
-bash-3.00$ glite-wms-job-output --dir output https://svr022.gla.scotgrid.ac.uk:9000/JvHgw2I86puA-A6YNs4BOQ
Connecting to the service https://svr022.gla.scotgrid.ac.uk:7443/glite_wms_wmproxy_server
================================================================================
JOB GET OUTPUT OUTCOME
No output files to be retrieved for the job:
https://svr022.gla.scotgrid.ac.uk:9000/JvHgw2I86puA-A6YNs4BOQ
================================================================================
Sometimes this will be perfectly expected - if you are pulling data in from Storage, and returning it to there. However, most of the time, this represents an error, as a computation with no output isn't terribly helpful.
Note that it does not nessesceraly mean that your computation yielded no output - if your JDL specifies to record and return the StdOutput and StdError, then those files will exist (even if of zero size).
The example above was generated with a JDL file of
Executable = "testexe";
Arguments = "1";
StdOutput = "test.out";
StdError = "test.err";
InputSandbox = {"testexe"};
OutputSandbox = {"test.out" "test.err"};
VirtualOrganisation = "vo.scotgrid.ac.uk";
where there is a comma missing between the two files in OutputSandbox. Accordingly, the Compute Element didn't understand that the files were desired, and thus didn't send them back. (If the file names in OutputSandbox specify files that were not created on the CE, then they will not exist, and therefore not be sent back. If you always get the StdOutput of your job sent back, it makes it very easy to detect this case.)
Some of my output files are missing!
When the files are returned from job-output, they are all places in the one directory. Although you can retrieve files generated from multiple directories, this flattening preserves the names of the files.
The consequence of this is that the job with JDL:
Executable = "doMySums.sh";
StdOutput = "std.out";
StdError = "std.err";
InputSandbox = {"doMySums.sh"};
OutputSandbox = {"std.out", "std.err", "dirA/fileOne", "dirB/fileOne", "dirA/fileTwo"};
VirtualOrganisation = "vo.scotgrid.ac.uk";
is not going to do what is expected.
4 files will be returned, and placed in the output directory. These are
-bash-3.00$ ls output fileOne std.err std.out fileTwo
The 'fileOne' is the file from 'dirB'. When the files are returned, they are returned in the order specified, so that the 'fileOne' from dirA is written, then overwritten with the 'fileOne' from dirB. (Techically, this is not guarenteed, and may change. This explanation is more for understanding that something you should rely on.) 'fileTwo' is unaffected by this, and is present as expected.
The way to solve this is to not to try to return multiple files with the same name. If you have control of the filenames that are written, this should be straightforward to arrange.
If you cannot change the names of the file that your job generates, then you will have to write a wrapper script that renames the files after the main job is completed, before they will be returned. The OuputSandbox should then reference the renamed files.
