Tech Meeting Minutes 20110714
Present: Andy, Sam, Dave, Stuart, Robert, Mike and Mark
|Table of contents|
Mike now running the Durham site
Edinburgh have another gold star for no tickets
Glasgow on going SARA fault outstanding.
No tickets open for Durham.
Mike slight issue with certificates (resolved)
One thing Mark noticed - on Steve Lloyd's network monitoring tests,
Durham has improved dramatically. What did you do?
Mike: we do have capacity problems. One thing that has changed is that we have no undergraduates around either.
Mark notes all of the sites have improved in the same situation (although Glasgow hasn't really noticed).
Mark suggested running some more iperf tests to see what the link is like now.
It was also reiterated that we're a big happy ScotGrid family here.
10 weeks without a ticket that Mark noticed. (Andy came clean about
the ticket they resolved very rapidly.)
Andy: manic monday! Andy was also off then so Wahid fixed things. One
pool server hardware failure over weekend. BDII service collapsed and
refused to accept incoming connections (so ECDF was "down") on Sunday
night, and then again on Monday night. The BDII is a few versions old,
so perhaps an upgrade is in order.
Hardware for pool server - Wahid has gotten DELL to ship a replacement part. This is the third time Pool3 has failed, and this is part of the cycle of bug fixing with Dell (re: firmware updates).
Last week were broker_offed for ATLAS Analy. Looks like a Panda misconfiguration for the new CREAM CE, but now resolved.
Still battling a little with the CREAM CEs. Trying the advanced copy of the BLAH-SGE, but Andy noticed a memory leak in it.
Weird authentication issue for certain users (Andy will pursue this offline with configuration file discussion)
Robert (hi!) is the ATLAS Future Computing guy and doing site level monitoring.
The SARA ticket is still open (but it is SARA's fault).
Discussion of the compound ATLAS prod and analy issues this morning.
(We were BROKEROFF for Analysis because PANDA thought we had a file that DDM hadn't put on us, causing tons of GangaRobot test to fail; we were at the transferring jobs limit for Production due to FTS throttling at RAL, so we couldn't get more production. The former was resolved by talking to Peter Love.)
Stuart has Mikael who is his ARC-gqsub guy.
Stuart is trying to get the ARC plugged into the BDII.
Mark was at WLCG Workshop.
Hamburg was lovely.
Things of interest:
Whole node scheduling. No-one seems quite sure how this is going to work, precisely.
(Andy W: yes, there were a few open questions at the Future Computing workshop at Edinburgh a few weeks back. It needs proper queue configuration, especially with regard to backfill.)
Perhaps we should publicise that we're testing it at Edinburgh a bit better?
(Stuart: whole node scheduling is attractive to them because they can mix IO bound and CPU bound workloads on a single node, to get efficiency. But it isn't obvious that this is better than the random mixing from scheduling that we already get. And it's a small order effect.)
Changes to the data models for most of the experiments due. Have not been finalised. One of the causes of these is that ATLAS have tons of disk but not tons of data on it.
EMI overview. Suggestion we should leapfrog SL6 to SL7.
Data storage presentations. Everyone is looking to make the software and the monitoring more efficient.And the interconnects.
LHCONE presentation was… not as insane as it was originally. (Open access peering connect between the entire universe.)
PerfSONAR tests were mentioned (US /BNL thing), perhaps we can do this in ScotGrid.
CVMFS@GLASGOW, but not quite turned on.