Tech Meeting Minutes 20110210
Present: Graeme, Mike, Wahid, Andy, David, Sam, Mark, Peter, Stuart
|Table of contents|
- Pheno have been very unhappy recently with grid performance. Mark and Graeme did some analysis of problems reported in tickets (https://gus.fzk.de/ws/ticket_info.php?ticket=63656, https://gus.fzk.de/ws/ticket_info.php?ticket=66080 and https://gus.fzk.de/ws/ticket_info.php?ticket=65857). Discussion was extensive but summarised as:
- At some point an upgrade of the WMS introduced a broken proxy renewal package (this is an inference from the fact that pheno were running fine last summer).
- As the WMS's are only sporadically used, the problems reported by David Winn (63656) were not correlated with this upgrade. (Peter stressed that pheno's use of the grid is bursty.)
- It took a long time to get to grips with this problem, exacerbated by the fact that the problem frequently waited days for more input from one other other party (us, RAL, user).
- At no point was it apparent to us that the problem was a general one affecting pheno, rather it seemed as a problem experienced by one user only.
- The issue was finally resolved with un upgrade of the glite-security-proxyrenewal package.
- Peter R's attempts to 'revalidate' GridPP resources for pheno then ran foul of a widespread problem where the new GridPP VOMS server certificate was not installed at sites.
- The technical issues do now seem to finally be sorted (https://gus.fzk.de/ws/ticket_info.php?ticket=66929, https://gus.fzk.de/ws/ticket_info.php?ticket=66928), but a lot of confidence in the grid was lost by the pheno community.
- ACTION David/Sam. Test behaviour of WMS to see if it does still require the VOMS server's certificate or if the .lsc file is sufficient. Use the scotgrid VO for this.
- ACTION David/Mark. Decommission last lcg-CE at Glasgow (svr021), in part because this definitely does not work with .lsc files (there are also availability/reliability impacts).
- Generate similar actions for E & D once 2 CREAM CEs are installed at each site.
- Network to/from RAL tested this week with 3 pairs of machines.
- Traffic from RAL->Glasgow never gets above 1Gb/s and there is very strange correlated network traffic seen on vm003(?!).
- Achieved 3Gb/s easily from Glasgow->RAL.
This is a significant breakthrough in demonstrating a real problem. Investigations continue with Colin Cooper in network services. More tests will be done in the network at risk window next Tuesday.
- Sam now feels confident that data is better spread over the disk servers at Glagsow.
- ACTION Raise ATLAS pilot job limit to 800 and review in a week.
- Wahid has been trying to run a hammercloud at ECDF, but there were problems with lack of data, lack of slots and downtime this week. He will continue to try to get a good test done. Graeme suggested 150-200 concurrent jobs running well would be sufficient to validate the cluster.
- Mark: will post his diabolical plan for the year on the wiki.
- We will review cross-site access next week, but access for Peter to the Glasgow UI will be arranged.
- Please exchange IM details with Durham people.
- Next meeting: Feb 17, http://indico.cern.ch/conferenceDisplay.py?confId=127245.