ScotGrid Technical Meetings

From ScotGrid

(Difference between revisions)
Revision as of 10:43, 17 May 2012
David crooks (Talk | contribs)

← Go to previous diff
Current revision
Mark mitchell (Talk | contribs)
2013
Line 1: Line 1:
==Meeting Minutes== ==Meeting Minutes==
 +===2013===
 +* [[Tech Meeting Minutes 20130201]]
 +* [[Tech Meeting Minutes 20130901]]
 +* [[Tech Meeting Minutes 20131601]]
 +* [[Tech Meeting Minutes 20132301]]
 +* [[Tech Meeting Minutes 20133001]]
 +* [[Tech Meeting Minutes 20130602]]
 +* [[Tech Meeting Minutes 20131302]]
 +* [[Tech Meeting Minutes 20132002]]
 +* [[Tech Meeting Minutes 20132702]]
 +* [[Tech Meeting Minutes 20130603]]
 +* [[Tech Meeting Minutes 20131303]]
 +* [[Tech Meeting Minutes 20132003]]
 +* [[Tech Meeting Minutes 20132703]]
 +* [[Tech Meeting Minutes 20130304]]
 +* [[Tech Meeting Minutes 20131004]]
 +* [[Tech Meeting Minutes 20131704]]
 +* [[Tech Meeting Minutes 20132404]]
 +* [[Tech Meeting Minutes 20130105]]
 +* [[Tech Meeting Minutes 20130805]]
 +* [[Tech Meeting Minutes 20131505]]
 +* [[Tech Meeting Minutes 20132205]]
 +* [[Tech Meeting Minutes 20132905]]
 +
===2012=== ===2012===
* [[Tech Meeting Minutes 20121201]] * [[Tech Meeting Minutes 20121201]]
Line 13: Line 37:
* [[Tech Meeting Minutes 20122203]] * [[Tech Meeting Minutes 20122203]]
* [[Tech Meeting Minutes 20122903]] * [[Tech Meeting Minutes 20122903]]
-...+* [[Tech Meeting Minutes 20120504]]
 +* [[Tech Meeting Minutes 20121204]]
 +* [[Tech Meeting Minutes 20121904]]
 +* [[Tech Meeting Minutes 20122604]]
 +* [[Tech Meeting Minutes 20120305]]
 +* [[Tech Meeting Minutes 20121005]]
* [[Tech Meeting Minutes 20121705]] * [[Tech Meeting Minutes 20121705]]
 +* [[Tech Meeting Minutes 20122405]]
 +* [[Tech Meeting Minutes 20120706]]
 +* [[Tech Meeting Minutes 20121406]]
 +* [[Tech Meeting Minutes 20122106]]
 +* [[Tech Meeting Minutes 20122806]]
 +* [[Tech Meeting Minutes 20120507]]
 +* [[Tech Meeting Minutes 20121207]]
 +* [[Tech Meeting Minutes 20121907]]
 +* [[Tech Meeting Minutes 20122607]]
 +* [[Tech Meeting Minutes 20120208]]
 +* [[Tech Meeting Minutes 20120908]]
 +* [[Tech Meeting Minutes 20121608]]
 +* [[Tech Meeting Minutes 20122308]]
 +* [[Tech Meeting Minutes 20123008]]
 +* [[Tech Meeting Minutes 20120609]]
 +* [[Tech Meeting Minutes 20121309]]
 +* [[Tech Meeting Minutes 20122009]]
 +* [[Tech Meeting Minutes 20122709]]
 +* [[Tech Meeting Minutes 20120410]]
 +* [[Tech Meeting Minutes 20121110]]
 +* [[Tech Meeting Minutes 20121810]]
 +* [[Tech Meeting Minutes 20122510]]
 +* [[Tech Meeting Minutes 20120111]]
 +* [[Tech Meeting Minutes 20120811]]
 +* [[Tech Meeting Minutes 20121511]]
 +* [[Tech Meeting Minutes 20122211]]
 +* [[Tech Meeting Minutes 20122911]]
 +* [[Tech Meeting Minutes 20120612]]
 +* [[Tech Meeting Minutes 20121312]]
 +* [[Tech Meeting Minutes 20122012]]
 +* [[Tech Meeting Minutes 20122712]]
===2011=== ===2011===

Current revision

Table of contents

Meeting Minutes

2013

2012

2011

2010

Open Action List

From 2011-06-02

  • IPPP: Upgrade Servers to SL5 and install Cream CE by the end of June 2011.

From 2011-03-24

  • IPPP: Correct VO shares in batch system and in information system.
    • Completed

From 2011-02-10

  • Andy: Decomission lcg-CEs.
    • Completed
  • IPPP: Decomission lcg-CEs.

From 2011-01-27

  • IPPP: Install CREAM CE.

From 2011-01-06

  • IPPP: Implement auto-restart of daemons with cfengine.
    • 2011-01-27 - In progress.
      • 2011-06-09 - Completed.

From 2010-11-25

  • Sam: Install pool account mapping clean-up script on DPM gridftp nodes at Glasgow (then publicise).
    • 2011-01-27 - Installed on one server and being monitored.

From 2010-07-20

Closed Actions

From 2011-03-03

  • Graeme: Enable PD2P for ECDF.
    • Done.

From 2010-02-10

  • David/Sam: Test WMS behaviour with/without VOMS server cert and .lsc files, using scotgrid VO.
    • Done: WMS still needs VOMS server certs.
  • David/Mark: Decommission svr021 at Glasgow (last lcg-CE).
    • Done: Took a nasty availability hit though!
  • Sam: Raise ATLAS pilot job limit at Glasgow to 800.
    • Done: and further raised to 1000 after review.

From 2011-01-27

  • Wahid: Subscribe in more data and run HC tests to validate site and explore capacity of the system.
    • 2011-02-10 making progress, but need a wider set of source datasets.
    • Done and PD2P enabled for ECDF (see 2011-03-03) as well as ANALY_ queue set online.

From 2011-01-20

  • Graeme: Send pilots to ce03 at ECDF. Validate 6GB VMEM request.
    • Done and validated. Production pilots now going to this CE.
  • Andy/Wahid: Investigate publishing problem in voviews which stops LHCb pilots arriving.
    • Fixed. After SGE upgrade SSL certs were needed to query queue status, so scripts needed changed. Sadly this is another ECDF specific hack!
  • Graeme: To contact Cedric and declare unavailable ATLAS files 'suspicious' for next week's disk server intervention.
    • Done. However, it was clarified that in fact the DA tools do not look at the 'suspicious' metadata flag.

From 2010-12-16

  • Andy: Customise VMEM parameters so grid jobs ask for 6GB VMEM.
    • Done.

From 2010-12-09

  • Graeme: Email Pete Clarke results of network discussion.
    • Done.
  • David: Fix LCG_GFAL_INFOSYS at Glasgow (RAL + CERN top BDIIs).
    • Done.
  • Sam: Raise case sensitivity of shift.conf with DPM developers.
    • Done. J-P B does not want to do this - it would slow down shift.conf parsing and not 100% clear it would be correct. Add to zen of storage knowledge.
  • Peter: Update biomed VOMS certs at Durham.
    • Done.
  • Stuart: Prevent ATLAS analysis jobs from running on old CV kit (small disk!)
    • Done.

From 2010-11-25

  • Peter/Mike: Install glite-APEL at Durham.
    • Done.
  • Wahid/Andy: Speak to Orlando about cvmfs.
    • Done. Orlando likes the idea, but changes/maintenance may be difficult. Alternative plan is to use cvmfs on a m/w node and mount via nfs to cluster. This should be pursued.
  • All: Review security incident handling procedure: https://documents.egi.eu/public/ShowDocument?docid=47.
    • Done. Recommended to have this printed out and to hand. Most important point is to communicate quickly and effectively.

From 2010-11-18

  • Wahid: Investigate squid/frontier failures at ECDF.
    • Wahid discovered that the local/setup.sh file was empty at ECDF. Might need a new s/w area...
    • Fixed.
  • Wahid, Andy: Plan squid for ECDF in the medium term.
    • Done and box now being commissioned (VM).
  • Peter: Add GRIDPP to the grid tags published in the Durham BDII.
    • Done.
  • Wahid, Andy: Ensure ATLAS analysis pilots get a higher priority than production.
    • Checked that this is set to 50/50, which is correct for a T2.


From 2010-11-11

  • Mark: Check again Glasgow gstat publication of total CPU.
    • Found a problem with 4 extra spaces in the published information which caused the gstat parser to reject one subcluster. Fixed.

From 2010-11-04

  • Peter: Upgrade Durham to latest SL5 kernels.
    • Done.
  • Graeme: Send details of T2 pledges from GridPP spreadsheets (but note also, see Q3 report for totals).
    • Done.
  • Graeme: Send link to WLCG MoU, specifically T2 sites section.
    • Done.

From 2010-09-30

  • Stuart: Liaise with Andy about ECDF ARC setup. Only modest requirements, but needs also some infrastructure support from systems team (ports, submit rights).
    • Not done. Now post-CHEP to revist later.
  • David: Send disk burn in scripts to Wahid.
    • Done.
  • All: Inputs for ScotGrid poster.
    • Done.
  • Mark: Correct gstat publishing total for Glasgow.
    • Done (see 2010-11-11 action).

From 2010-09-23

  • Peter: Syncat dump of DPM to clean ATLAS dark data.
    • Done.
  • Mike: Decommission Glasgow top level BDII.
    • Done.
  • David: Convert glite-MON to glite-APEL at Glasgow.
    • Done 2010-11-24. Seems to work fine.
  • Mark: Convert one Glasgow lcg-CE CE to CREAM.
    • Done. svr026 is now a dairy CE.
  • Andy: Convert glite-MON to glite-APEL at Ed.
    • Done Jan 2011, Edinburgh publishing via glite-APEL.

From 2010-09-17

  • Sites to update security patching status on mailing lists.
    • Done.

From 2010-07-20

  • Sam: check replica table for further ghosts from drained servers.
    • Done ( the only cause was the only recent drains, which was basically just disk044).
  • Sam: increase replica numbers for DB release files (ATLASHOTDISK).
    • Done (but probably not the root cause)
  • Graeme: Raise running jobs to 200 again.
    • Done - in fact now raised to 300.
  • Graeme: to change panda to use filestager instead of pilot copy at Glasgow.
  • Mike J to ensure last week's security checks were done at Durham.
    • Done.
  • Mike J to close ATLAS tickets at Durham.
    • Done.
  • Mike J to check that nfs mounts are working on all nodes, especially n79 for LHCb ticket!
    • Done.
  • Mike J to check CAs and CRLs on all worker nodes at Durham..
    • Done.
  • Andy to check that ATLAS jobs run in local disk area, not on GPFS (N.B. no stdout/err from ECDF CEs so cannot be done from pilot logs).
    • Done.
  • Graeme: to change panda to use filestager instead of pilot copy at Glasgow.
    • Under discussion with DA people.
    • Done.

From 2010-06-24

  • Sam try kernel VM settings to prevent SL5 DPM nodes from going into crisis.
    • Partial success on disk045, but more scientific analysis of VM parameters will be done.
    • Closed as understanding now is that this is an SL5+xfs problem, specifically.
  • Sam try to find a way to reproduce rfio load outside ATLAS analysis job running.
    • Suggested to trigger a large number of rfcp copies of different files from many worker nodes using pdsh.
    • Done: Take disk044 out of service to test on.
  • Graeme check status for checksums on UK SS box.
    • They were indeed disabled for ECDF LOCALGOUPDISK.
  • Mike ask Vladimir for a new LHCb upload test at Glasgow.
    • Done and passed at 100%. See extensive discussions of SACK packets and iptables on TB-SUPPORT list.

From 2010-06-10

  • Sam: Update CREAM CEs at Glasgow.
    • CREAM dev CE updated on vm007 (and vm006 CREAM dev CE sort of working with SGE now...). Leaving svr014 until I test vm007 a bit.
    • (04/08/2010) Upgraded CREAM on svr014, partly to solve odd CREAM CE job load issues. Seems to be fine with the latest release.
    • Handed over to Mark (action generated 2010-09-23).
  • Graeme, Sam, Stuart: Better ARC cache at Glasgow
    • Done. Stuart has an 8TB lustre cache now.
  • Andy: Try to drain all jobs from mw05 to recover stability.
    • CE back running ok after updating globus-gma
  • Graeme: Investigate ECDF LOCALGROUPDISK is served by an ATLAS site services box.
    • It wasn't, this was fixed.
  • Wahid: Discuss biomed access at ECDF. Ticket 58909.
    • No support for biomed at ECDF.

From 2010-06-03

  • Graeme: Try to get https://gus.fzk.de/ws/ticket_info.php?ticket=58692 fixed with David Grellscheid at Durham.
    • Sam fixed this remotely.
  • Wahid: Chase up LOCALGROUPDISK ToA ticket, http://savannah.cern.ch/bugs/?67701.
    • Now in ToA.
  • Andy: Investigate the need for prologue/epilogue scripts with CREAM+SGE. Raise issue with Orlando.
    • Orlando is very much against this. Andrew is pushing the CREAM developers for a solution avoiding this being necessary.
    • Hardware for CREAM now available (ce3). Andy will set it up next week and we will start to try running jobs through it.
    • 2010-07-20: CREAM installed, waiting for ports to be opened.
    • 2010-09-23: Finish commissioning ECDF CREAM CE.
    • 2010-09-30: Still waiting for CREAM CE to become authorised submitter to SGE.
    • 2010-11-18: Send CREAM CE details to Graeme
      • 2010-12-16: Closed, but follow up action to raise VMEM limit.

From 2010-05-27

  • Dug rerun transfer upload test using LHCb proxy.
    • Not needed.
  • Dug ask Vladimir to re-test Glasgow with his tests.
    • The tests pass - see later discussion on network congestion corellations.
  • Andy monitor ATLAS running jobs to check this rises as expected.
    • It did.
  • Graeme and Wahid plan migration of ECDF SCRATCHDISK to Storm+GPFS
    • Superseded by LOCALGROUPDISK plans.

From 2010-05-20

From 2010-05-13

  • Andy Migrate LHCb s/w area to nfs
    • Considered best if all s/w is reinstalled.
    • LHCb and CMS have confirmed this.
    • Now done (ECDF side), just waiting for reinstalls from CMS and LHCb. Done?

From 2010-05-07

  • Dug Dump packet information from NATed anf non-NATed connections at Glasgow to debug stalled LHCb transfers.
    • Problematic rule in NAT configuration identified. Now removed and internal tests look fine. Asked LHCb to retest.
  • Wahid Contact Peter about access to Lancaster squid from ECDF.
  • Andy Compare LHCb VO views at ECDF with those at Glasgow and Durham.
    • Done - some minor differences, but pilot jobs arriving so probably nothing significant.
  • David Send Durham IP ranges to Mike for access to Glasgow squid.
    • Done. Graeme to now get ToA configured.
  • Mike Coordinate ScotGrid CHEP poster.
    • Done. Final iteration then submit.
  • Graeme Submit CHEP abstract on ScotGrid ARC.
    • Done. Final iteration then submit.
  • Andy Ticket 57042 should be marked solved.
    • Ticket solved and verified 2010-05-20. Jobs now running.