ScotGrid Technical Meetings
From ScotGrid
(Difference between revisions)
| Revision as of 10:43, 17 May 2012 David crooks (Talk | contribs) ← Go to previous diff |
Current revision Mark mitchell (Talk | contribs) 2013 |
||
| Line 1: | Line 1: | ||
| ==Meeting Minutes== | ==Meeting Minutes== | ||
| + | ===2013=== | ||
| + | * [[Tech Meeting Minutes 20130201]] | ||
| + | * [[Tech Meeting Minutes 20130901]] | ||
| + | * [[Tech Meeting Minutes 20131601]] | ||
| + | * [[Tech Meeting Minutes 20132301]] | ||
| + | * [[Tech Meeting Minutes 20133001]] | ||
| + | * [[Tech Meeting Minutes 20130602]] | ||
| + | * [[Tech Meeting Minutes 20131302]] | ||
| + | * [[Tech Meeting Minutes 20132002]] | ||
| + | * [[Tech Meeting Minutes 20132702]] | ||
| + | * [[Tech Meeting Minutes 20130603]] | ||
| + | * [[Tech Meeting Minutes 20131303]] | ||
| + | * [[Tech Meeting Minutes 20132003]] | ||
| + | * [[Tech Meeting Minutes 20132703]] | ||
| + | * [[Tech Meeting Minutes 20130304]] | ||
| + | * [[Tech Meeting Minutes 20131004]] | ||
| + | * [[Tech Meeting Minutes 20131704]] | ||
| + | * [[Tech Meeting Minutes 20132404]] | ||
| + | * [[Tech Meeting Minutes 20130105]] | ||
| + | * [[Tech Meeting Minutes 20130805]] | ||
| + | * [[Tech Meeting Minutes 20131505]] | ||
| + | * [[Tech Meeting Minutes 20132205]] | ||
| + | * [[Tech Meeting Minutes 20132905]] | ||
| + | |||
| ===2012=== | ===2012=== | ||
| * [[Tech Meeting Minutes 20121201]] | * [[Tech Meeting Minutes 20121201]] | ||
| Line 13: | Line 37: | ||
| * [[Tech Meeting Minutes 20122203]] | * [[Tech Meeting Minutes 20122203]] | ||
| * [[Tech Meeting Minutes 20122903]] | * [[Tech Meeting Minutes 20122903]] | ||
| - | ... | + | * [[Tech Meeting Minutes 20120504]] |
| + | * [[Tech Meeting Minutes 20121204]] | ||
| + | * [[Tech Meeting Minutes 20121904]] | ||
| + | * [[Tech Meeting Minutes 20122604]] | ||
| + | * [[Tech Meeting Minutes 20120305]] | ||
| + | * [[Tech Meeting Minutes 20121005]] | ||
| * [[Tech Meeting Minutes 20121705]] | * [[Tech Meeting Minutes 20121705]] | ||
| + | * [[Tech Meeting Minutes 20122405]] | ||
| + | * [[Tech Meeting Minutes 20120706]] | ||
| + | * [[Tech Meeting Minutes 20121406]] | ||
| + | * [[Tech Meeting Minutes 20122106]] | ||
| + | * [[Tech Meeting Minutes 20122806]] | ||
| + | * [[Tech Meeting Minutes 20120507]] | ||
| + | * [[Tech Meeting Minutes 20121207]] | ||
| + | * [[Tech Meeting Minutes 20121907]] | ||
| + | * [[Tech Meeting Minutes 20122607]] | ||
| + | * [[Tech Meeting Minutes 20120208]] | ||
| + | * [[Tech Meeting Minutes 20120908]] | ||
| + | * [[Tech Meeting Minutes 20121608]] | ||
| + | * [[Tech Meeting Minutes 20122308]] | ||
| + | * [[Tech Meeting Minutes 20123008]] | ||
| + | * [[Tech Meeting Minutes 20120609]] | ||
| + | * [[Tech Meeting Minutes 20121309]] | ||
| + | * [[Tech Meeting Minutes 20122009]] | ||
| + | * [[Tech Meeting Minutes 20122709]] | ||
| + | * [[Tech Meeting Minutes 20120410]] | ||
| + | * [[Tech Meeting Minutes 20121110]] | ||
| + | * [[Tech Meeting Minutes 20121810]] | ||
| + | * [[Tech Meeting Minutes 20122510]] | ||
| + | * [[Tech Meeting Minutes 20120111]] | ||
| + | * [[Tech Meeting Minutes 20120811]] | ||
| + | * [[Tech Meeting Minutes 20121511]] | ||
| + | * [[Tech Meeting Minutes 20122211]] | ||
| + | * [[Tech Meeting Minutes 20122911]] | ||
| + | * [[Tech Meeting Minutes 20120612]] | ||
| + | * [[Tech Meeting Minutes 20121312]] | ||
| + | * [[Tech Meeting Minutes 20122012]] | ||
| + | * [[Tech Meeting Minutes 20122712]] | ||
| ===2011=== | ===2011=== | ||
Current revision
[edit]
Meeting Minutes
[edit]
2013
- Tech Meeting Minutes 20130201
- Tech Meeting Minutes 20130901
- Tech Meeting Minutes 20131601
- Tech Meeting Minutes 20132301
- Tech Meeting Minutes 20133001
- Tech Meeting Minutes 20130602
- Tech Meeting Minutes 20131302
- Tech Meeting Minutes 20132002
- Tech Meeting Minutes 20132702
- Tech Meeting Minutes 20130603
- Tech Meeting Minutes 20131303
- Tech Meeting Minutes 20132003
- Tech Meeting Minutes 20132703
- Tech Meeting Minutes 20130304
- Tech Meeting Minutes 20131004
- Tech Meeting Minutes 20131704
- Tech Meeting Minutes 20132404
- Tech Meeting Minutes 20130105
- Tech Meeting Minutes 20130805
- Tech Meeting Minutes 20131505
- Tech Meeting Minutes 20132205
- Tech Meeting Minutes 20132905
[edit]
2012
- Tech Meeting Minutes 20121201
- Tech Meeting Minutes 20121901
- Tech Meeting Minutes 20122601
- Tech Meeting Minutes 20120202
- Tech Meeting Minutes 20120902
- Tech Meeting Minutes 20121602
- Tech Meeting Minutes 20122302
- Tech Meeting Minutes 20120103
- Tech Meeting Minutes 20120803
- Tech Meeting Minutes 20121503
- Tech Meeting Minutes 20122203
- Tech Meeting Minutes 20122903
- Tech Meeting Minutes 20120504
- Tech Meeting Minutes 20121204
- Tech Meeting Minutes 20121904
- Tech Meeting Minutes 20122604
- Tech Meeting Minutes 20120305
- Tech Meeting Minutes 20121005
- Tech Meeting Minutes 20121705
- Tech Meeting Minutes 20122405
- Tech Meeting Minutes 20120706
- Tech Meeting Minutes 20121406
- Tech Meeting Minutes 20122106
- Tech Meeting Minutes 20122806
- Tech Meeting Minutes 20120507
- Tech Meeting Minutes 20121207
- Tech Meeting Minutes 20121907
- Tech Meeting Minutes 20122607
- Tech Meeting Minutes 20120208
- Tech Meeting Minutes 20120908
- Tech Meeting Minutes 20121608
- Tech Meeting Minutes 20122308
- Tech Meeting Minutes 20123008
- Tech Meeting Minutes 20120609
- Tech Meeting Minutes 20121309
- Tech Meeting Minutes 20122009
- Tech Meeting Minutes 20122709
- Tech Meeting Minutes 20120410
- Tech Meeting Minutes 20121110
- Tech Meeting Minutes 20121810
- Tech Meeting Minutes 20122510
- Tech Meeting Minutes 20120111
- Tech Meeting Minutes 20120811
- Tech Meeting Minutes 20121511
- Tech Meeting Minutes 20122211
- Tech Meeting Minutes 20122911
- Tech Meeting Minutes 20120612
- Tech Meeting Minutes 20121312
- Tech Meeting Minutes 20122012
- Tech Meeting Minutes 20122712
[edit]
2011
- Tech Meeting Minutes 20111201
- Tech Meeting Minutes 20111124
- Tech Meeting Minutes 20111117
- Tech Meeting Minutes 20111110
- Tech Meeting Minutes 20111103
- Tech Meeting Minutes 20111027
- Tech Meeting Minutes 20111020
- Tech Meeting Minutes 20111013
- Tech Meeting Minutes 20111006
- Tech Meeting Minutes 20110929
- Tech Meeting Minutes 20110922
- Tech Meeting Minutes 20110915
- Tech Meeting Minutes 20110908
- Tech Meeting Minutes 20110901
- Tech Meeting Minutes 20110825
- Tech Meeting Minutes 20110818
- Tech Meeting Minutes 20110811
- Tech Meeting Minutes 20110804
- Tech Meeting Minutes 20110728
- Tech Meeting Minutes 20110721
- Tech Meeting Minutes 20110714
- Tech Meeting Minutes 20110707
- Tech Meeting Minutes 20110630
- Tech Meeting Minutes 20110623
- Tech Meeting Minutes 20110616
- Tech Meeting Minutes 20110609
- Tech Meeting Minutes 20110602
- Tech Meeting Minutes 20110526
- Tech Meeting Minutes 20110519
- Tech Meeting Minutes 20110512
- Tech Meeting Minutes 20110505
- Tech Meeting Minutes 20110428
- Tech Meeting Minutes 20110421
- Tech Meeting Minutes 20110414
- Tech Meeting Minutes 20110407
- Tech Meeting Minutes 20110324
- Tech Meeting Minutes 20110310
- Tech Meeting Minutes 20110303
- Tech Meeting Minutes 20110224
- Tech Meeting Minutes 20110217
- Tech Meeting Minutes 20110210
- Tech Meeting Minutes 20110203
- Tech Meeting Minutes 20110127
- Tech Meeting Minutes 20110120
- Tech Meeting Minutes 20110106
[edit]
2010
- Tech Meeting Minutes 20101216
- Tech Meeting Minutes 20101209
- Tech Meeting Minutes 20101202
- Tech Meeting Minutes 20101125
- Tech Meeting Minutes 20101118
- Tech Meeting Minutes 20101111
- Tech Meeting Minutes 20101104
- Tech Meeting Minutes 20101028
- Tech Meeting Minutes 20101021
- Tech Meeting Minutes 20101013
- Tech Meeting Minutes 20100930
- Tech Meeting Minutes 20100923
- Tech Meeting Minutes 20100917
- Tech Meeting Minutes 20100720
- Tech Meeting Minutes 20100701
- Tech Meeting Minutes 20100624
- Tech Meeting Minutes 20100610
- Tech Meeting Minutes 20100603
- Tech Meeting Minutes 20100527
- Tech Meeting Minutes 20100520
- Tech Meeting Minutes 20100513
- Tech Meeting Minutes 20100507
[edit]
Open Action List
[edit]
From 2011-06-02
- IPPP: Upgrade Servers to SL5 and install Cream CE by the end of June 2011.
[edit]
From 2011-03-24
- IPPP: Correct VO shares in batch system and in information system.
- Completed
[edit]
From 2011-02-10
- Andy: Decomission lcg-CEs.
- Completed
- IPPP: Decomission lcg-CEs.
[edit]
From 2011-01-27
- IPPP: Install CREAM CE.
[edit]
From 2011-01-06
- IPPP: Implement auto-restart of daemons with cfengine.
- 2011-01-27 - In progress.
- 2011-06-09 - Completed.
- 2011-01-27 - In progress.
[edit]
From 2010-11-25
- Sam: Install pool account mapping clean-up script on DPM gridftp nodes at Glasgow (then publicise).
- 2011-01-27 - Installed on one server and being monitored.
[edit]
From 2010-07-20
- Sam: redeploy current SL5 servers into DPM with ext4 filesystems.
- In progress - discussed Tech Meeting Minutes 20101125, Tech Meeting Minutes 20101216, Tech Meeting Minutes 20110120.
- Sam: to plan slow migration of SL4 boxes to SL5+ext4.
- In progress (waiting on new diskservers, to provide extra wiggle room for draining etc). See Tech Meeting Minutes 20101125, Tech Meeting Minutes 20101216, Tech Meeting Minutes 20110120.
- IPPP to investigate central syslog status for Durham grid nodes.
- 2011-01-27 - In progress.
[edit]
Closed Actions
[edit]
From 2011-03-03
- Graeme: Enable PD2P for ECDF.
- Done.
[edit]
From 2010-02-10
- David/Sam: Test WMS behaviour with/without VOMS server cert and .lsc files, using scotgrid VO.
- Done: WMS still needs VOMS server certs.
- David/Mark: Decommission svr021 at Glasgow (last lcg-CE).
- Done: Took a nasty availability hit though!
- Sam: Raise ATLAS pilot job limit at Glasgow to 800.
- Done: and further raised to 1000 after review.
[edit]
From 2011-01-27
- Wahid: Subscribe in more data and run HC tests to validate site and explore capacity of the system.
- 2011-02-10 making progress, but need a wider set of source datasets.
- Done and PD2P enabled for ECDF (see 2011-03-03) as well as ANALY_ queue set online.
[edit]
From 2011-01-20
- Graeme: Send pilots to ce03 at ECDF. Validate 6GB VMEM request.
- Done and validated. Production pilots now going to this CE.
- Andy/Wahid: Investigate publishing problem in voviews which stops LHCb pilots arriving.
- Fixed. After SGE upgrade SSL certs were needed to query queue status, so scripts needed changed. Sadly this is another ECDF specific hack!
- Graeme: To contact Cedric and declare unavailable ATLAS files 'suspicious' for next week's disk server intervention.
- Done. However, it was clarified that in fact the DA tools do not look at the 'suspicious' metadata flag.
[edit]
From 2010-12-16
- Andy: Customise VMEM parameters so grid jobs ask for 6GB VMEM.
- Done.
[edit]
From 2010-12-09
- Graeme: Email Pete Clarke results of network discussion.
- Done.
- David: Fix LCG_GFAL_INFOSYS at Glasgow (RAL + CERN top BDIIs).
- Done.
- Sam: Raise case sensitivity of shift.conf with DPM developers.
- Done. J-P B does not want to do this - it would slow down shift.conf parsing and not 100% clear it would be correct. Add to zen of storage knowledge.
- Peter: Update biomed VOMS certs at Durham.
- Done.
- Stuart: Prevent ATLAS analysis jobs from running on old CV kit (small disk!)
- Done.
[edit]
From 2010-11-25
- Peter/Mike: Install glite-APEL at Durham.
- Done.
- Wahid/Andy: Speak to Orlando about cvmfs.
- Done. Orlando likes the idea, but changes/maintenance may be difficult. Alternative plan is to use cvmfs on a m/w node and mount via nfs to cluster. This should be pursued.
- All: Review security incident handling procedure: https://documents.egi.eu/public/ShowDocument?docid=47.
- Done. Recommended to have this printed out and to hand. Most important point is to communicate quickly and effectively.
[edit]
From 2010-11-18
- Wahid: Investigate squid/frontier failures at ECDF.
- Wahid discovered that the local/setup.sh file was empty at ECDF. Might need a new s/w area...
- Fixed.
- Wahid, Andy: Plan squid for ECDF in the medium term.
- Done and box now being commissioned (VM).
- Peter: Add GRIDPP to the grid tags published in the Durham BDII.
- Done.
- Wahid, Andy: Ensure ATLAS analysis pilots get a higher priority than production.
- Checked that this is set to 50/50, which is correct for a T2.
[edit]
From 2010-11-11
- Mark: Check again Glasgow gstat publication of total CPU.
- Found a problem with 4 extra spaces in the published information which caused the gstat parser to reject one subcluster. Fixed.
[edit]
From 2010-11-04
- Peter: Upgrade Durham to latest SL5 kernels.
- Done.
- Graeme: Send details of T2 pledges from GridPP spreadsheets (but note also, see Q3 report for totals).
- Done.
- Graeme: Send link to WLCG MoU, specifically T2 sites section.
- Done.
[edit]
From 2010-09-30
- Stuart: Liaise with Andy about ECDF ARC setup. Only modest requirements, but needs also some infrastructure support from systems team (ports, submit rights).
- Not done. Now post-CHEP to revist later.
- David: Send disk burn in scripts to Wahid.
- Done.
- All: Inputs for ScotGrid poster.
- Done.
- Mark: Correct gstat publishing total for Glasgow.
- Done (see 2010-11-11 action).
[edit]
From 2010-09-23
- Peter: Syncat dump of DPM to clean ATLAS dark data.
- Done.
- Mike: Decommission Glasgow top level BDII.
- Done.
- David: Convert glite-MON to glite-APEL at Glasgow.
- Done 2010-11-24. Seems to work fine.
- Mark: Convert one Glasgow lcg-CE CE to CREAM.
- Done. svr026 is now a dairy CE.
- Andy: Convert glite-MON to glite-APEL at Ed.
- Done Jan 2011, Edinburgh publishing via glite-APEL.
[edit]
From 2010-09-17
- Sites to update security patching status on mailing lists.
- Done.
[edit]
From 2010-07-20
- Sam: check replica table for further ghosts from drained servers.
- Done ( the only cause was the only recent drains, which was basically just disk044).
- Sam: increase replica numbers for DB release files (ATLASHOTDISK).
- Done (but probably not the root cause)
- Graeme: Raise running jobs to 200 again.
- Done - in fact now raised to 300.
- Graeme: to change panda to use filestager instead of pilot copy at Glasgow.
- Mike J to ensure last week's security checks were done at Durham.
- Done.
- Mike J to close ATLAS tickets at Durham.
- Done.
- Mike J to check that nfs mounts are working on all nodes, especially n79 for LHCb ticket!
- Done.
- Mike J to check CAs and CRLs on all worker nodes at Durham..
- Done.
- Andy to check that ATLAS jobs run in local disk area, not on GPFS (N.B. no stdout/err from ECDF CEs so cannot be done from pilot logs).
- Done.
- Graeme: to change panda to use filestager instead of pilot copy at Glasgow.
- Under discussion with DA people.
- Done.
[edit]
From 2010-06-24
- Sam try kernel VM settings to prevent SL5 DPM nodes from going into crisis.
- Partial success on disk045, but more scientific analysis of VM parameters will be done.
- Closed as understanding now is that this is an SL5+xfs problem, specifically.
- Sam try to find a way to reproduce rfio load outside ATLAS analysis job running.
- Suggested to trigger a large number of rfcp copies of different files from many worker nodes using pdsh.
- Done: Take disk044 out of service to test on.
- Graeme check status for checksums on UK SS box.
- They were indeed disabled for ECDF LOCALGOUPDISK.
- Mike ask Vladimir for a new LHCb upload test at Glasgow.
- Done and passed at 100%. See extensive discussions of SACK packets and iptables on TB-SUPPORT list.
[edit]
From 2010-06-10
- Sam: Update CREAM CEs at Glasgow.
- CREAM dev CE updated on vm007 (and vm006 CREAM dev CE sort of working with SGE now...). Leaving svr014 until I test vm007 a bit.
- (04/08/2010) Upgraded CREAM on svr014, partly to solve odd CREAM CE job load issues. Seems to be fine with the latest release.
- Handed over to Mark (action generated 2010-09-23).
- Graeme, Sam, Stuart: Better ARC cache at Glasgow
- Done. Stuart has an 8TB lustre cache now.
- Andy: Try to drain all jobs from mw05 to recover stability.
- CE back running ok after updating globus-gma
- Graeme: Investigate ECDF LOCALGROUPDISK is served by an ATLAS site services box.
- It wasn't, this was fixed.
- Wahid: Discuss biomed access at ECDF. Ticket 58909.
- No support for biomed at ECDF.
[edit]
From 2010-06-03
- Graeme: Try to get https://gus.fzk.de/ws/ticket_info.php?ticket=58692 fixed with David Grellscheid at Durham.
- Sam fixed this remotely.
- Wahid: Chase up LOCALGROUPDISK ToA ticket, http://savannah.cern.ch/bugs/?67701.
- Now in ToA.
- Andy: Investigate the need for prologue/epilogue scripts with CREAM+SGE. Raise issue with Orlando.
- Orlando is very much against this. Andrew is pushing the CREAM developers for a solution avoiding this being necessary.
- Hardware for CREAM now available (ce3). Andy will set it up next week and we will start to try running jobs through it.
- 2010-07-20: CREAM installed, waiting for ports to be opened.
- 2010-09-23: Finish commissioning ECDF CREAM CE.
- 2010-09-30: Still waiting for CREAM CE to become authorised submitter to SGE.
- 2010-11-18: Send CREAM CE details to Graeme
- 2010-12-16: Closed, but follow up action to raise VMEM limit.
[edit]
From 2010-05-27
- Dug rerun transfer upload test using LHCb proxy.
- Not needed.
- Dug ask Vladimir to re-test Glasgow with his tests.
- The tests pass - see later discussion on network congestion corellations.
- Andy monitor ATLAS running jobs to check this rises as expected.
- It did.
- Graeme and Wahid plan migration of ECDF SCRATCHDISK to Storm+GPFS
- Superseded by LOCALGROUPDISK plans.
[edit]
From 2010-05-20
- Wahid Configure ToA for squid access at Edinburgh
- Done and passing tests.
- Graeme Ask Paul and Jose to test glexec for ATLAS at Glasgow
- Done - Jose is starting this.
- Wahid Open ECDF firewall for connections to Lancaster squid
- Looks broken. Wahid to chase this with ECDF people (https://lcg-sam.cern.ch:8443/sam/sam.py?funct=TestResultLatest&nodename=mw05.ecdf.ed.ac.uk&vo=atlas&testname=CE-ATLAS-sft-Frontier-Squid).
- Was fixed a little mysteriously.
[edit]
From 2010-05-13
- Andy Migrate LHCb s/w area to nfs
- Considered best if all s/w is reinstalled.
- LHCb and CMS have confirmed this.
- Now done (ECDF side), just waiting for reinstalls from CMS and LHCb. Done?
[edit]
From 2010-05-07
- Dug Dump packet information from NATed anf non-NATed connections at Glasgow to debug stalled LHCb transfers.
- Problematic rule in NAT configuration identified. Now removed and internal tests look fine. Asked LHCb to retest.
- Wahid Contact Peter about access to Lancaster squid from ECDF.
- Andy Compare LHCb VO views at ECDF with those at Glasgow and Durham.
- Done - some minor differences, but pilot jobs arriving so probably nothing significant.
- David Send Durham IP ranges to Mike for access to Glasgow squid.
- Done. Graeme to now get ToA configured.
- Mike Coordinate ScotGrid CHEP poster.
- Done. Final iteration then submit.
- Graeme Submit CHEP abstract on ScotGrid ARC.
- Done. Final iteration then submit.
- Andy Ticket 57042 should be marked solved.
- Ticket solved and verified 2010-05-20. Jobs now running.
