Tech Meeting Minutes 20111103

From ScotGrid

Scotgrid Technical Meeting 3 October 2011
Agenda: https://indico.cern.ch/conferenceDisplay.py?confId=148169

Present:
Mark (chair)
Sam (minutes)
Andy W
Stuart
David C
Robert H

Edinburgh:

Andy informs us that he is "not too bad".

LHCb ticket 75671: waiting for reply from Vladimr.
Spacetoken ticket: (this is a tracking ticket)
CVMFS: (this is a tracking ticket) Andy: done some testing with the systems team. We have one odd error (initial Transport endpoint not connected error on installing on a node). which then goes away later. Sent a bug report to the developers, who have come up with some things to look at. It doesn't appear to be the Squid, as its load seems sensible. This is being investigated, as we're being ultra cautious.)

SRM transfer ticket 75929: Got blacklisted overnight due to failing transfers. Seems to be a problem with VOMS authentication, with the DDM transfer used using VOMS credentials signed by the BNL ATLAS VOMS server.
(Probably fixed by installing an lsc file for the BNL ATLAS VOMS server.)
Robert noted that this should be a problem for the entirety of Europe. (It turns out that the BNL server isn't a default for outside the US)

Robert noted that ECDF are looking at trying ATHENA-MP jobs on the cluster. It's a bit puzzling; our jobs don't run, but Douglas Smith's jobs to. Having problems with pilots for the multicore queue.

Glasgow:

David Winn ticket (75873): David Winn had some problems with jobs failing at multiple WMSs with "failed: LB query failed". So, of course, he ticketed Glasgow (!?). Some checking indicated that this was related to a bug in savannah against the ICE component in WMS/LB, which is fixed in… the not-in-production EMI release. We punted it over to DMSU, who confirmed that this was the case, so we really have to wait until the EMI release is in production before *every WMS in the world* can be fixed.


75320: David: this the enmr problem. We're not quite clear on what the problem is: Christophe tries lcg-tags queries which fail intermittently and unpredictably (with both a normal and an sgm enmr role) on a sub-minute timescale. For us, with a vo.scotgrid role, we can query the enmr tags… with 100% reliability over an extended period. So, it looks like this is more likely to be a VOMS mapping issue of some kind, but we're still looking into it.

Mark: Out of the accounting period, finally, we're going to go break some stuff and do network maintenance tomorrow. So, we'll be in some downtime briefly. We also need to start thinking about procurements.


Durham:

Mike: 75488: the compchem ticket. Mike will take a bit of a look at the compchem configuration on the CEs.
cvmfs: (this is a tracking ticket). This is waiting on some hardware for the squid. (Delayed due to HEPFORGE hardware breaking.) Mike will quote for some hardware if he can't see any recyclable old things he can reuse.
Durham has seen the load on the site as a whole go up to about 70% from 10%, so jobs seem to be working. Mike has brought up a few more worker nodes into the cluster, and we're waiting to see if they're stable. Once again, time has been limited for grid work.

Alastair is now happy with Durham for ATLAS.


AOB:

We need to start looking at hardware. We have 7 weeks before Christmas to do this. All hardware must be installed + paid for by the end of March 2012. HEPSYSMAN/Core-Ops issues: let us (Mark/Stuart/David) know if you have anything you need brought up.
David: a quick note: ce02 at Durham seems to be a little unhappy. Giving a GRAM error. (Last night, cream02 seemed to be exceptionally busy. We restarted tomcat, which caused it to be inaccessible due to even higher load for 60 seconds… and then seemed to fix matters.)


Chat log:

10:59:45] Mark Mitchell right off for a cig
[11:00:57] Stuart Purdie joined
[11:01:02] Mark Mitchell andback
[11:01:05] Andrew Washbrook joined
[11:01:40] Mark Mitchell Can anyone here my audio that isn't sitting in the same room as me
[11:01:58] David Crooks joined
[11:02:01] Mike Johnson I think I'm having my own audio problems. As usual.
[11:02:13] David Crooks Sorry I'm a bit late in
[11:03:06] Andrew Washbrook is Peter Grandi got his own grid site by the way ? Peter Grandi
[11:03:16] Andrew Washbrook gridpp.for.sabi
[11:03:17] Andrew Washbrook ?
[11:03:45] Mike Johnson that sounds very much like peter - his personal website was sabi.co.uk
[11:04:40] Andrew Washbrook is that you Mark?
[11:06:16] Robert Harrington joined
[11:36:30] Robert Harrington left
[11:36:38] Robert Harrington joined
[11:37:15] Mike Johnson left
[11:37:16] Mark Mitchell left
[11:37:24] Stuart Purdie left
[11:37:39] David Crooks left
[11:39:58] Robert Harrington left
[11:39:59] Andrew Washbrook left