Tech Meeting Minutes 20100610

From ScotGrid

Present: Graeme, Sam, Andrew, Wahid

Indico: http://indico.cern.ch/conferenceDisplay.py?confId=97220

Open Problems

  • Serious load problems on mw05 at ECDF. Memory exhaustion. Seems that globus-gatekeeper processes are not exiting properly (hanging on a FIN_WAIT). No clear suspect: network change upstream? bad clients? bad state inside gatekeeper? Will put in a cron job to kill stale gatekeepers. CMS user submitted 2000 jobs (lcg-CEs do not like that). Suggestion to drain CE, but cannot do this at the queue level - need to force state to "Draining" (http://goc.grid.sinica.edu.tw/gocwiki/How_to_close_the_site_so_it_won%27t_receive_anymore_jobs_from_the_RBs). Could also gut CE by hand, but this will certainly lose running jobs.
  • SRM fell over on DPM - now fixed (GGUS 58938).
  • GGUS for ECDF from Biomed for SE access (58909). Not in fact enabled at ECDF. Might consider this, but worries about extra usage eventually being 'charged'. ACTION: Wahid to discuss with Phil.
  • Transfers to new LOCALGROUPDISK area at ECDF not working. Graeme thinks this is because the site services box needs to be configured to pickup this DDM endpoint. ACTION on Graeme.
  • Sam continues to look at SL5 load issues at Glasgow. Will try to reproduce the problem with offline servers.

Infrastructure

  • ECDF storage has arrived.
  • Should consider better ARC cache at Glasgow now that basic functionality is established. ACTION: Graeme, Stuart and Sam to discuss.
  • CREAM CEs at Glasgow might need upgrade. ACTION: Sam.

AOB