Glasgow Logbook 2008-wk37
From ScotGrid
Week Commencing 8th Sept 2008
Monday
Another weekend of intermittent SAM SE failures. 'operation in progress'
Pools look OK in dpm-queryconf
One anomaly is that /etc/shift.conf on svr040 contains the extra line
DPM PROTOCOLS rfio gsiftp
relevant?
can copy files to/from servers OK
20:49 - failed a SAM test SRMv2-ATLAS-lcg-cp except it wasn't our problem
[BDII] sam-bdii.cern.ch:2170: Connection Timeout lcg_cp: Connection timed out
Tuesday
Upgraded to nagios 3.0.3 (from homebuilt RPM - woohoo) re-ran ncg.pl - noticed that mw05 is now nolonger at ECDF
made a simple permissions-check script in svr031:/var/cfengine/inputs/scripts/fix_opt_perms
basically everything under /opt should be (dirs) 755 with the exception of
drwx------ 2 root root 16384 Sep 9 2007 /opt/lost+found drwx------ 10 root root 4096 Nov 23 2007 /opt/glite/yaim drwxrwxrwt 2 root root 4096 Dec 9 2006 /opt/globus/tmp
so its a nasty
#!/bin/bash
find /opt -type d -exec chmod 755 {} \;
chmod 700 /opt/lost+found
chmod 700 /opt/glite/yaim
chmod 1777 /opt/globus/tmp
perhaps this should simply be a cfenginifid process anyway?
Rolling upgrade to new SL4x (4.7) - the current list of upgraded machines (from pakiti) is
Scientific Linux SL release 4.7 (Beryllium) nat005.beowulf.cluster 2.6.9-78.0.1.EL 2.6.9-78.0.1.EL 9 September 2008 18:24 nat006.beowulf.cluster 2.6.9-78.0.1.EL 2.6.9-78.0.1.EL 9 September 2008 17:42 nat007.beowulf.cluster 2.6.9-78.0.1.EL 2.6.9-78.0.1.EL 9 September 2008 17:48 node048.beowulf.cluster 2.6.9-78.0.1.ELsmp 2.6.9-78.0.1.ELsmp 9 September 2008 17:51 node056.beowulf.cluster 2.6.9-78.0.1.ELsmp 2.6.9-78.0.1.ELsmp 9 September 2008 17:55 node085.beowulf.cluster 2.6.9-78.0.1.ELsmp 2.6.9-78.0.1.ELsmp 9 September 2008 17:41 node111.beowulf.cluster 2.6.9-78.0.1.ELsmp 2.6.9-78.0.1.ELsmp 9 September 2008 17:38 node114.beowulf.cluster 2.6.9-78.0.1.ELsmp 2.6.9-78.0.1.ELsmp 9 September 2008 04:04 node131.beowulf.cluster 2.6.9-78.0.1.ELsmp 2.6.9-78.0.1.ELsmp 9 September 2008 17:41 node138.beowulf.cluster 2.6.9-78.0.1.ELsmp 2.6.9-78.0.1.ELsmp 9 September 2008 04:03 svr025.beowulf.cluster 2.6.9-78.0.1.ELsmp 2.6.9-78.0.1.ELsmp 9 September 2008 17:08 svr026.beowulf.cluster 2.6.9-78.0.1.ELsmp 2.6.9-78.0.1.ELsmp 9 September 2008 04:18 svr031.beowulf.cluster 2.6.9-78.0.1.ELsmp 2.6.9-78.0.1.ELsmp 9 September 2008 04:18
Wednesday
LHC spinny beam day :-)
Changed nagios config to use large installaton tweaks = 1 -- see if it makes a difference to svr031 load.
edited the email so that its now readable (trimmed off all the crap from the subject line, and don't include WARNING/CRITIACL/OK so google mail stores em together
node067 had an MCE kernel panic - rebuilt and returned to service
Network issues at CERN caused SAM failures
21:40 - Durham dropped off the network
svr031:~# traceroute ce01.dur.scotgrid.ac.uk
traceroute to ce01.dur.scotgrid.ac.uk (129.234.193.11), 30 hops max, 46 byte packets
1 130.209.239.1 (130.209.239.1) 0.456 ms 0.300 ms 0.293 ms 2 glasgowpop-ge1-2-glasgowuni-ge1-1-v152.clyde.net.uk (194.81.62.153) 0.405 ms 0.326 ms 0.300 ms 3 so-1-3-1.glas-sbr1.ja.net (146.97.41.249) 0.446 ms 0.425 ms 0.417 ms 4 NorMAN-N1.site.ja.net (146.97.42.6) 4.615 ms 4.429 ms 4.404 ms 5 dur-rtr1-gi-1-1.iv.norman.net.uk (194.81.4.30) 5.019 ms 4.874 ms 4.860 ms 6 * * ct-pop-gi-3-5.iv.norman.net.uk (194.81.4.29) 5.023 ms 7 * * so-1-3-0.lond-sbr3.ja.net (146.97.33.5) 10.240 ms 8 * * * 9 * * *
10 * * * 11 * * * 12 so-1-0-0.leed-sbr1.ja.net (146.97.33.25) 30.622 ms 30.535 ms * 13 * * * 14 * * * 15 * * * 16 * * * 17 * * * 18 * * *
