Tech Meeting Minutes 20100624

From ScotGrid

Present: Graeme, Sam, Andrew, Wahid, Mike

Indico: http://indico.cern.ch/conferenceDisplay.py?confId=98094

Open Problems

  • Load problems on mw05 at ECDF have been mitigated by blacklisting a particular CMS use who was doing aggressive polling and updating the globus-gma component on the CE. Things seem to be ok, but the underlying cause of the issue was not definitively proven.
  • Andrew has spoken about the CREAM job wrapper for SGE to Orlando. He is very much against adding job prologue/epilogue scripts to the single ECDF queue, which is understandable. There is a patch to avoid the prologue, but nothing for the epilogue, which means that stdour/err cannot be returned. Andrew will continue to press the CREAM developers for an alternative. We can raise this via GridPP as an issue via the GDB if needed.
  • The current configuration of storm at ECDF, which uses an nfs mount, cannot return checksums so failed transfers. ACTION: Graeme to double check that checksums are disabled for transfers to LOCALGROUPDISK at ECDF. The new storage will be mounted directly via GPFS. ECDF are going to hand the management of the new storage (beyond the base OS) to the grid team. Wahid things is can be deployed in ~1 month. Graeme noted that ATLAS data distribution policy is now proportional to a sites delivered storage.
  • LHCb upload problems at Glasgow are now very much more reliable after Stuart tuned the network settings on the WNs. There is a feeling that this is just mitigation and the underlying issue is just being dodged for the moment. But at least it's working again. ACTION: Mike to ask Vladimir for a new test.
  • SLC5 disk servers at Glasgow continue to misbehave. Lancaster observe a similar problem on one node only. Sam emailed the DPM forum and received a communication from Maarten Litmath about a kernel bug which can be mitigated by changing some VM sysctl settings. He will try this and try to find a way to reproduce the problem out with ATLAS running a very high number of analysis rates. (ACTION on Sam)
  • Durham: we really need better documentation to support them better.

Infrastructure

  • glexec: Jose is testing. Basic identity switch does work, but there were problems with the ATLAS setup for real jobs, which was too OSG specific.

AOB