Tech Meeting Minutes 20110224

From ScotGrid

Present: Graeme, Mike, Wahid, Andy, David, Sam, Michael, Stuart

http://indico.cern.ch/conferenceDisplay.py?confId=129019

Backlink: http://www.scotgrid.ac.uk/wiki/index.php/ScotGrid_Technical_Meetings

Table of contents

Site Issues

ECDF

  • Sunday outages:
    • Middleware racks 'blew up'. Rack switches needed replaced.
    • Disk pool server suffered hardware errors from Saturday night.
      • Server was reporting a CPU 'missing' - required firmware update and reboot.
  • Monday was a planned outage for power work. Coincidence...?
    • Lots of racks had been covered in dust, but actual intervention went ok.
  • CREAM CE
    • Load spiked and lots of stuck jobs from mw.
    • Lease manager is trying to clean out dead jobs - might need manual purge of db.
    • Stuart: Maybe the same issue seen at Glasgow (BLAH looses jobs). Tuning of MySQL helps. But maybe this is SGE related?

Durham

  • Bad audio so could not hear Mike.
    • Please look at tickets and ask for help if necessary.

Glasgow

  • Decommissioned last lcg-CE (svr021).
    • Took slightly longer because of extra publishing which was done through this machine.
    • Suffered a downtime hit when the machine was in final draning - could have been avoided by taking it out of the GOC earlier.
      • No objections from GridPP for correcting the availability figures.
  • Network
    • Colin changed network path to avoid the bad module, but lost connectivity because of a misconfiguration (now fixed).
      • Looks less congested
      • Test again next Tuesday.
  • ATLAS Sonar
    • Alessandra/Sam tuning SACK - YAIM tuning for disk servers is very old and probably not optimal.
    • Will need to inject more transfers.
  • Moving SL4 DPM nodes to SL5.
  • Register for storage workshop!
  • Jamie Ferguson is working on some improved monitoring.
  • Have a second ARC CE to test LCMAPS. Will run other ARC CE for Andrej/ATLAS.

Other Topics

glexec

  • Andy will discuss with Oralando (glexec has been security audited twice).
  • Glasgow should look at ARGUS in a spare moment(!).

Whole Node Queues

  • Graeme will submit a bug to condor people for option passing to CREAM.

AOB