Tech Meeting Minutes 20111020
Present: Sam, Andy and David
Andy W - has been checking on a few things before talking to Orlando. We're happy with the use of sudo. Need to be careful with automount and updating releases of cvmfs.
We're going to move ahead with deploying at CVMFS.
Discovered that one of our CEs haven't been publishing at all for the last few months. Error in the parser config was resulting in no event records being published from it. Andy has rerun that and fixed matters (should see a bit jump in published work for ECDF in the next few days).
Dave asked about the ticket re: squids.
Andy: you need port 4301 open on the squid, but our edge-router is very restrictive. There's a ticket in to get it reconfigured - but this only breaks monitoring of the squid, not functionality.
Dave: Lingering network issues are causing general unreliability in things across the cluster. Pending downtime for network work, this will continue (scheduled November).
enmr.eu ticket is interesting; they get intermittently successful querying of enmr software tags. We're tempted to get another set of test jobs submitted to see if we can get more information about the issue.
We're taking some time to bring our services back up to baseline releases (in a rolling fashion).
We're also had some publishing accounting issues. The CEs couldn't publish to the APEL DB because they'd run out of space (for the DB). This was masked by the parser also failing, due to exhausting the total file descriptor limit for root (we moved out our historical records from 2009 etc to make it better - APEL opens *all* files in the accounting directory when parsing, not one at a time!). As far as the metrics go, this affects LHCb more than ATLAS (ATLAS metrics are derived from their own records, not from APEL).
Sam: We've also had a recurring issue with disk043's iowait, but that's because it's the only node in 141 without a channel bond (thanks to the networking issue), so it sometimes gets backlogs. It's now RDONLY to mitigate this a bit.
There are still problems at Durham. Of the 3 CEs: 1 lcg-CE is dead. ce02 seems to be working, and cream02 which seems to be working.
There are problems with the worker nodes, which are curiouser and curiouser after some investigation by Stuart.
Mike emailed us to say that he's been looking at the switches for the nodes, and will also be updating to the "new" (current) lcg-CA release. There are ongoing local issues with failed servers at Durham (not grid) which complicate matters.
10:58:32] Sam Skipsey Hey up, all. It's just Edinburgh and Glasgow today, I think. Apparently Mike has a conflicting meeting or something.
[11:01:51] David Crooks joined
[11:03:09] David Crooks left
[11:03:24] David Crooks joined
[11:03:29] Andrew Washbrook joined
[11:03:49] Andrew Washbrook hi - sorry I am late
[11:03:51] David Crooks Hi Andy
[11:03:57] David Crooks No worries