Tech Meeting Minutes 20110825
SCOTGRID TECH MEETING MINUTES 25 AUGUST 2011
Mark Mitchell (Chair)
Sam Skipsey (minutes)
David Crooks (Chaise-longue)
Andy Washbrook (Ottoman)
Mike Johnson (Recliner)
Robert Harrington (Cathedra)
Aircon was tested, and the power went off for 30 seconds. The compressor didn't come back on, so the room roasted - the worker nodes were turned off to reduce heat load. When turned back on, issues with ATLAS Production jobs - GGUS 73778 .
73778 - still under investigation. We have ATLAS jobs failing, some local pheno users seem to be okay, as does biomed.
73773 - "Expected Output File" does not exist.
Dave suggested that we could include details about the error in the GGUS ticket.
Andy and Robert are concerned about the ELF error in the python setup. Suggests logging into a node interactively and trying to set up ATLAS software manually to debug.
Some nodes are actually succeeding? n97 is the only one to have succeeded in running a job from ATLAS, but the only nodes which ran ATLAS (and failed ) are 89 - 108.
It is possible that the shutdown of those nodes was unclean.
Suggested: bouncing the NFS mounts on those nodes.
Suggested: compare n97 to other nodes.
Mike will try bouncing NFS mounts on the broken nodes, Sam will try interactively sourcing the ATLAS setup scripts on a node that is broken. Mike may try rebuilding a node, but this is harder now he has been left with a non-functional kickstart.
Empathy was expressed for Mike's situation.
Mark suggested considering, in a quiet period, making Durham's software infrastructure more to Mike's liking, and more consistent.
Robert complained about the arctic temperatures in Edinburgh compared to his native Texas.
Nothing is combusting.
Andy has been collating the CREAM-SGE errors at Edinburgh, looking ahead to the possible UK response to the CREAM developers.
There are only one or two issues affecting availability, but is in contact with the developers (who are responsive).
Looking into CVMFS & SQUID for such. Some of the looser documentation is a little contrary to the official documentation. Doing a test run now, and will get some of the systems team's time next week.
Andy is also available for remote testing of Durham.
Robert noted that the job that passed was using a different release to the ones that failed. So, we might just need to get the broken release 16.6.4 build fixed (Alessandro/etc could revalidate and reinstall it if broken).
Glasgow is bringing back the shambling zombie hordes of old nodes to improve our CPU provision. This is in the same rack as our test rack for testing ipv6, which will be testing today.
Dave: mostly fairly quiet on an event level. Mostly just the old worker nodes being set up correctly (and triaged for functionality). These are the oldest generation of 6-yearold 4-core WNs.
Sam mentioned the puzzling case of the CVMFS nodes and the Maui.
We also had a visit from the Directorate of the Chinese National Academy of Sciences about e-Science things.
- GridPP 27 slides
Mark encouraged all sites to submit slides for the ScotGrid talk by the end of next week (as Mark will be in the wilds of CERN the week after).
- CHEP Paper submissions
Be sure to submit things appropriately.
There will be no tech meeting next Thursday, due to the internal Glasgow PP Year Review.
Email, or bring issues up in the Ops meeting if we need to discuss things.
-- Chat log:
[10:59:15] Mark Mitchell joined
[11:00:33] IPPP1 UofDurham joined
[11:01:43] IPPP1 UofDurham Don't think you'd have heard me anyway since I'm having sound issues as per usual...
[11:02:37] Andrew Washbrook joined
[11:02:41] Sam Skipsey Well, try speaking again, just in case.
[11:03:06] Sam Skipsey So, I'm getting a tiny wee sound like a gust of breeze.
[11:03:11] Sam Skipsey But that's it.
[11:03:21] IPPP1 UofDurham OK, cheers back in a sec
[11:03:27] IPPP1 UofDurham left
[11:03:35] David Crooks joined
[11:05:23] IPPP1 UofDurham joined
[11:06:21] Robert Harrington joined
[11:13:06] David Crooks http://panda.cern.ch/server/pandamon/query?job=*&site=UKI-SCOTGRID-DURHAM&type=production&jobStatus=failed&hours=12
[11:31:18] Andrew Washbrook breaking news from Edinburgh - Orlando has booked me in next week for CVMFS testing, will report on progress next week
[11:31:41] Mark Mitchell Good Stuff
[11:34:54] IPPP1 UofDurham OK no problem
[11:34:58] Andrew Washbrook should be fine
[11:41:47] Sam Skipsey CHEP 2011 submission deadline: 30 September 2011
[11:41:57] Andrew Washbrook get off!