Glasgow Logbook 2008-wk37

From ScotGrid

Week Commencing 8th Sept 2008

Monday

Another weekend of intermittent SAM SE failures. 'operation in progress'

Pools look OK in dpm-queryconf

One anomaly is that /etc/shift.conf on svr040 contains the extra line

DPM PROTOCOLS rfio gsiftp

relevant?

can copy files to/from servers OK


20:49 - failed a SAM test SRMv2-ATLAS-lcg-cp except it wasn't our problem

[BDII] sam-bdii.cern.ch:2170: Connection Timeout
lcg_cp: Connection timed out


Tuesday

Upgraded to nagios 3.0.3 (from homebuilt RPM - woohoo) re-ran ncg.pl - noticed that mw05 is now nolonger at ECDF

made a simple permissions-check script in svr031:/var/cfengine/inputs/scripts/fix_opt_perms

basically everything under /opt should be (dirs) 755 with the exception of

drwx------  2 root root 16384 Sep  9  2007 /opt/lost+found
drwx------  10 root root 4096 Nov 23  2007 /opt/glite/yaim
drwxrwxrwt  2 root root 4096 Dec  9  2006 /opt/globus/tmp

so its a nasty

#!/bin/bash
find /opt -type d -exec chmod 755 {} \;
chmod 700 /opt/lost+found
chmod 700 /opt/glite/yaim
chmod 1777 /opt/globus/tmp

perhaps this should simply be a cfenginifid process anyway?


Rolling upgrade to new SL4x (4.7) - the current list of upgraded machines (from pakiti) is

Scientific Linux SL release 4.7 (Beryllium)
 nat005.beowulf.cluster	2.6.9-78.0.1.EL 	2.6.9-78.0.1.EL 	9 September 2008 18:24
 nat006.beowulf.cluster	2.6.9-78.0.1.EL 	2.6.9-78.0.1.EL 	9 September 2008 17:42
 nat007.beowulf.cluster	2.6.9-78.0.1.EL 	2.6.9-78.0.1.EL 	9 September 2008 17:48
node048.beowulf.cluster	2.6.9-78.0.1.ELsmp	2.6.9-78.0.1.ELsmp	9 September 2008 17:51
node056.beowulf.cluster	2.6.9-78.0.1.ELsmp	2.6.9-78.0.1.ELsmp	9 September 2008 17:55
node085.beowulf.cluster	2.6.9-78.0.1.ELsmp	2.6.9-78.0.1.ELsmp	9 September 2008 17:41
node111.beowulf.cluster	2.6.9-78.0.1.ELsmp	2.6.9-78.0.1.ELsmp	9 September 2008 17:38
node114.beowulf.cluster	2.6.9-78.0.1.ELsmp	2.6.9-78.0.1.ELsmp	9 September 2008 04:04
node131.beowulf.cluster	2.6.9-78.0.1.ELsmp	2.6.9-78.0.1.ELsmp	9 September 2008 17:41
node138.beowulf.cluster	2.6.9-78.0.1.ELsmp	2.6.9-78.0.1.ELsmp	9 September 2008 04:03
 svr025.beowulf.cluster	2.6.9-78.0.1.ELsmp	2.6.9-78.0.1.ELsmp	9 September 2008 17:08
 svr026.beowulf.cluster	2.6.9-78.0.1.ELsmp	2.6.9-78.0.1.ELsmp	9 September 2008 04:18
 svr031.beowulf.cluster	2.6.9-78.0.1.ELsmp	2.6.9-78.0.1.ELsmp	9 September 2008 04:18


Wednesday

LHC spinny beam day :-)

Changed nagios config to use large installaton tweaks = 1 -- see if it makes a difference to svr031 load.

edited the email so that its now readable (trimmed off all the crap from the subject line, and don't include WARNING/CRITIACL/OK so google mail stores em together

node067 had an MCE kernel panic - rebuilt and returned to service

Network issues at CERN caused SAM failures


21:40 - Durham dropped off the network svr031:~# traceroute ce01.dur.scotgrid.ac.uk traceroute to ce01.dur.scotgrid.ac.uk (129.234.193.11), 30 hops max, 46 byte packets

1  130.209.239.1 (130.209.239.1)  0.456 ms  0.300 ms  0.293 ms
2  glasgowpop-ge1-2-glasgowuni-ge1-1-v152.clyde.net.uk (194.81.62.153)  0.405 ms  0.326 ms  0.300 ms
3  so-1-3-1.glas-sbr1.ja.net (146.97.41.249)  0.446 ms  0.425 ms  0.417 ms
4  NorMAN-N1.site.ja.net (146.97.42.6)  4.615 ms  4.429 ms  4.404 ms
5  dur-rtr1-gi-1-1.iv.norman.net.uk (194.81.4.30)  5.019 ms  4.874 ms  4.860 ms
6  * * ct-pop-gi-3-5.iv.norman.net.uk (194.81.4.29)  5.023 ms
7  * * so-1-3-0.lond-sbr3.ja.net (146.97.33.5)  10.240 ms
8  * * *
9  * * *

10 * * * 11 * * * 12 so-1-0-0.leed-sbr1.ja.net (146.97.33.25) 30.622 ms 30.535 ms * 13 * * * 14 * * * 15 * * * 16 * * * 17 * * * 18 * * *




Glasgow Weekly Logbook