Tech Meeting Minutes 20110210

From ScotGrid

Present: Graeme, Mike, Wahid, Andy, David, Sam, Mark, Peter, Stuart

http://indico.cern.ch/conferenceDisplay.py?confId=126109

Backlink: http://www.scotgrid.ac.uk/wiki/index.php/ScotGrid_Technical_Meetings

Table of contents

Hot Topics

Pheno Problems

  • Pheno have been very unhappy recently with grid performance. Mark and Graeme did some analysis of problems reported in tickets (https://gus.fzk.de/ws/ticket_info.php?ticket=63656, https://gus.fzk.de/ws/ticket_info.php?ticket=66080 and https://gus.fzk.de/ws/ticket_info.php?ticket=65857). Discussion was extensive but summarised as:
    • At some point an upgrade of the WMS introduced a broken proxy renewal package (this is an inference from the fact that pheno were running fine last summer).
    • As the WMS's are only sporadically used, the problems reported by David Winn (63656) were not correlated with this upgrade. (Peter stressed that pheno's use of the grid is bursty.)
    • It took a long time to get to grips with this problem, exacerbated by the fact that the problem frequently waited days for more input from one other other party (us, RAL, user).
    • At no point was it apparent to us that the problem was a general one affecting pheno, rather it seemed as a problem experienced by one user only.
    • The issue was finally resolved with un upgrade of the glite-security-proxyrenewal package.
    • Peter R's attempts to 'revalidate' GridPP resources for pheno then ran foul of a widespread problem where the new GridPP VOMS server certificate was not installed at sites.
    • The technical issues do now seem to finally be sorted (https://gus.fzk.de/ws/ticket_info.php?ticket=66929, https://gus.fzk.de/ws/ticket_info.php?ticket=66928), but a lot of confidence in the grid was lost by the pheno community.
  • ACTION David/Sam. Test behaviour of WMS to see if it does still require the VOMS server's certificate or if the .lsc file is sufficient. Use the scotgrid VO for this.
  • ACTION David/Mark. Decommission last lcg-CE at Glasgow (svr021), in part because this definitely does not work with .lsc files (there are also availability/reliability impacts).
    • Generate similar actions for E & D once 2 CREAM CEs are installed at each site.

Glasgow Network

  • Network to/from RAL tested this week with 3 pairs of machines.
  • Traffic from RAL->Glasgow never gets above 1Gb/s and there is very strange correlated network traffic seen on vm003(?!).
  • Achieved 3Gb/s easily from Glasgow->RAL.

This is a significant breakthrough in demonstrating a real problem. Investigations continue with Colin Cooper in network services. More tests will be done in the network at risk window next Tuesday.

ATLAS Analysis

  • Sam now feels confident that data is better spread over the disk servers at Glagsow.
    • ACTION Raise ATLAS pilot job limit to 800 and review in a week.
  • Wahid has been trying to run a hammercloud at ECDF, but there were problems with lack of data, lack of slots and downtime this week. He will continue to try to get a good test done. Graeme suggested 150-200 concurrent jobs running well would be sufficient to validate the cluster.

AOB

  • Mark: will post his diabolical plan for the year on the wiki.
  • We will review cross-site access next week, but access for Peter to the Glasgow UI will be arranged.
  • Please exchange IM details with Durham people.