Tech Meeting Minutes 20100610
Present: Graeme, Sam, Andrew, Wahid
- Serious load problems on mw05 at ECDF. Memory exhaustion. Seems that globus-gatekeeper processes are not exiting properly (hanging on a FIN_WAIT). No clear suspect: network change upstream? bad clients? bad state inside gatekeeper? Will put in a cron job to kill stale gatekeepers. CMS user submitted 2000 jobs (lcg-CEs do not like that). Suggestion to drain CE, but cannot do this at the queue level - need to force state to "Draining" (http://goc.grid.sinica.edu.tw/gocwiki/How_to_close_the_site_so_it_won%27t_receive_anymore_jobs_from_the_RBs). Could also gut CE by hand, but this will certainly lose running jobs.
- SRM fell over on DPM - now fixed (GGUS 58938).
- GGUS for ECDF from Biomed for SE access (58909). Not in fact enabled at ECDF. Might consider this, but worries about extra usage eventually being 'charged'. ACTION: Wahid to discuss with Phil.
- Transfers to new LOCALGROUPDISK area at ECDF not working. Graeme thinks this is because the site services box needs to be configured to pickup this DDM endpoint. ACTION on Graeme.
- Sam continues to look at SL5 load issues at Glasgow. Will try to reproduce the problem with offline servers.
- ECDF storage has arrived.
- Should consider better ARC cache at Glasgow now that basic functionality is established. ACTION: Graeme, Stuart and Sam to discuss.
- CREAM CEs at Glasgow might need upgrade. ACTION: Sam.
- Next meeting in two weeks (June 24): http://indico.cern.ch/conferenceDisplay.py?confId=98094