Tech Meeting Minutes 20110922

From ScotGrid

Scotgrid Tech Meeting Minutes 22 Sept 2011

Present:

Chair: Mark Mitchell
Minutes: Sam Skipsey
Peanut Gallery: Robert Harrington
Agenda:- http://indico.cern.ch/conferenceDisplay.py?confId=148162
- ECDF

One ticket against ECDF.
Rob tried to close it, but was bamboozled by the list of options he had to pick from. The amazing powers of the Scotgrid Tech were bent to its solution.

Rob still has to restart the VM Cream CE to fix it daily. Generally holding down the fort. Last week, discovered that can't submit more than 8 jobs to the more than 8-core nodes, due to VMEM limits. Qstat indicated that a lot more VMEM was being used than we expected.
(This is a historical problem that Sam remembers - VMEM consumption metrics are often huge over-estimates, due to the handling of shared libraries and copy-on-write pages.) Rob will talk to Orlando more.
ECDF will be becoming a T2D site, and a multi-cloud site, which should help to even out the job load on ECDF a bit (clouds are not totally correlated in task load).

- A problem with CASTOR at RAL was causing a problem with getting more jobs (the ATLAS pilot system won't send more jobs your way if you're maxed out on transferring jobs - this is to avoid backlogs getting worse and breaking FTS transfers).

Rob is also going to try increasing the logging level on the CREAM CE.

(There was some discussion about the horror of CREAM CEs, and Daniela's tendency to restart CREAM CEs every 4 hours.) There is a move to stop using Virtual Instances for CEs. But Ewan seems very pro-VMs. Anecdotal evidence concerning CEs on VMs does rather suggest that network load and I/O load are not suited to this kind of use case.

--- Mark: Process Engineering and systems evaluations; would Rob be happy to conduct such an analysis for ECDF?
Rob is interested in doing so, and is taking notes and learning about the systems as he goes.
Mark would be interested in talking about this in the meeting planned in London (QMUL), comparing the various organisational cases (QMUL vs Glasgow network connectivity, etc). So, we can make a recommendation on standard practices, and reasonable divergences.


--

Durham has problems.

--

Glasgow seems mostly functional. Water ingress yesterday morning at the top of the building led to a small amount of flooding in 141 (which had no operational impact, luckily).
Disks 058 and 068 are being a bit odd, and being investigated.

CVMFS migration will happen soonish. Durham should be moved to CVMFS as part of it being fixed. Glasgow will be moving to CVMFS at the start of October. ECDF in progress.


- AOB:
CHEP papers recommended.
Need to fix the squid communication issue with Glasgow <-> Edinburgh.
We need to get a big community support effort together for Mike @ Durham to help him restore Durham from the ashes.
Next week, Mark will be away, and so will Sam.

- Rob mentioned the possibility of distributed T2ing Scotgrid (Glasgow SE being closeSE for Edinburgh, and vice-versa.)
Mark thinks it's worthwhile looking into - the issue is the poor link Gla <-> Ed. (There's no direct routed link that avoids Warrington, etc).

Rob was added to the exalted echelons of ScotGrid Tech Support Skype Group membership.



Chat contents:

[11:10:22] Sam Skipsey Hi Robert. It's a little like waiting for godot in here.
[11:11:12] Robert Harrington indeed
[11:12:19] Mark Mitchell https://ggus.eu/ws/ticket_info.php?ticket=74278


[11:26:26] Robert Harrington http://grid.pd.infn.it/cream/field.php?n=Main.RelevantLogFiles
[11:50:34] Robert Harrington rdharrington
[11:54:44] Robert Harrington left
[11:54:48] Mark Mitchell left