Glasgow LHCB Testing

From ScotGrid

Table of contents

LHCb Transfer Issues Investigation

This ongoing issue was summarised here: http://scotgrid.blogspot.com/2010/05/return-of-lhcb-at-glasgow.html

The current solution was to remove a REJECT rule: -A INPUT -i eth1 -p tcp -m tcp -j REJECT --reject-with tcp-reset for both NAT's.

However, LHCb are currently still reporting issues with transfers.

Various tests were performed. Transfers from many worker nodes through both NAT's and transfers from one worker node through one NAT.

Here are the commands I was using for testing LHCb transfers from across the cluster. This was tweaked slightly to submit in a loop, backgrounding each lcg_cp to submit from one node.

  1. Generate a dteam/LHCb proxy and stick it somewhere every node can see it i.e. /cluster/share
  2. Copy the cron onto each of the worker nodes /etc/cron.d/spam-lhcb-castor
  3. Change the permissions if required and wait for it to run.
  4. Suck the logs off the nodes and parse for 'Connection timed out' && 'Transfer took'

The Script

/cluster/share/gla057/test-dteam-upload.sh

           
#!/bin/sh
source /opt/glite/etc/profile.d/grid-env.sh

export X509_USER_PROXY=/cluster/share/gla057/proxy.dteam

/opt/glite/bin/voms-proxy-info -all
echo on $1
#/opt/lcg/bin/lcg-ls srm://srm-dteam.gridpp.rl.ac.uk/castor/ads.rl.ac.uk/test/dteam/dugs-lhcb-testing/
/opt/lcg/bin/lcg-cp --vo dteam --sendreceive-timeout 300 -v file:/cluster/share/gla057/zerofile1M.tst
 srm://srm-dteam.gridpp.rl.ac.uk/castor/ads.rl.ac.uk/test/dteam/dugs-lhcb-testing/zerofile1M.$1.tst 

Cron

svr031:/var/cfengine/inputs/skel/worker/etc/cron.d# cat spam-lhcb-castor

#58 12 30 3 2 prdlhcb001 /cluster/share/gla057/test-lhcb-upload.sh `hostname` > /tmp/lhcb-test-`hostname`.log 2>&1

or

35 12 25 5 2 dteam001 /cluster/share/gla057/test-dteam-upload.sh `hostname` > /tmp/dteam-test-`hostname`.log 2>&1

Copy onto nodes

for n in `seq -w 001 020`;do scp spam-lhcb-castor node0$n:/etc/cron.d/spam-lhcb-castor;done

Change permissions

pdsh -w node[001-020] chmod 644 /etc/cron.d/spam-lhcb-castor

Get the logs

for n in `seq -w 001 309`;do scp node$n:/tmp/lhcb-test-*.log .;done

or

for n in `seq -w 001 309`;do scp node$n:/tmp/dteam-test-*.log .;done

list the files at the end point

/opt/lcg/bin/lcg-ls srm://srm-lhcb.cern.ch/castor/cern.ch/grid/lhcb/test/

or

/opt/lcg/bin/lcg-ls srm://srm-dteam.gridpp.rl.ac.uk/castor/ads.rl.ac.uk/test/dteam/dugs-lhcb-testing/

tidy up the srm

for n in `seq -w 001 309`;do /opt/lcg/bin/lcg-del -l --vo lhcb srm://srm-lhcb.cern.ch/castor/cern.ch/grid/lhcb/test/zerofile10M.node$n.beowulf.cluster.tst;done

or

for n in `seq -w 001 309`;do /opt/lcg/bin/lcg-del -l --vo dteam srm://srm-dteam.gridpp.rl.ac.uk/castor/ads.rl.ac.uk/test/dteam/dugs-lhcb-testing/zerofile10M.node$n.beowulf.cluster.tst;done

tidy up the logs

pdsh -w node[001-309] rm /tmp/lhcb-test-*.beowulf.cluster.log

or

pdsh -w node[001-309] rm /tmp/dteam-test-*.beowulf.cluster.log

Parse the logs

svr031:~/dug/lhcb# cat getLogs.sh 
#!/bin/sh
cd /root/dug/lhcb
cd logs
echo copying logs.
for n in `seq -w 001 102`;do scp node$n:/tmp/dteam-test-*.log .;done
echo  connection timed out: `find . | xargs grep 'lcg_cp: Connection timed out'  | wc -l`
echo Communication error on send: `find . | xargs grep 'lcg_cp: Communication error on send' | wc -l`
echo successful transfers: `find . | xargs grep 'Transfer took' | wc -l`
#cd ../trace
#echo copying trace logs.
#for n in `seq -w 001 103`;do scp node$n:/tmp/trace*.beowulf.cluster .;done
echo done.

Useful background

Cluster Monkey has a good description of gridftp issues with NAT's on page two they state ...

Most NAT devices are capable of translating the port numbers as well. Currently, however, GT does not have this capability, and therefore the port number on the internal machine on which a particular service is listening must be the same as on the external interface of the firewall. Thus, in the case of multiple machines behind a single NAT device, each machine must have a unique port range defined, and those ports must be forwarded to the appropriate machine by the firewall. (GLOBUS_HOSTNAME will be the same for all the machines, however.)

This seems to be backed up by the NAT instructions from http://dev.globus.org/wiki/FirewallHowTo Network Address Translation (NAT)

Clients behind NATs will be restricted as described in #Allowed Incoming Ports unless the firewall and site hosts are configured to allow incoming connections.

This configuration involves:

  1. Select a separate portion of the ephemeral port range for each host at the site on which clients will be running (e.g. 45000-45099 for host A, 45100-45199 for host B, etc.).
  2. Configure the NAT to direct incoming connections in the port range for each host back to the appropriate host (e.g., configure 45000-45099 on the NAT to forward to 45000-45099 on host A).
  3. Configure the Globus Toolkit clients on each site host to use the selected port range for the host using the techniques described in Section .
  4. Configure Globus Toolkit clients to advertise the firewall as the hostname to use for callbacks from the server host. 

This is done using the GLOBUS_HOSTNAME environment variable. The client must also have the GLOBUS_HOSTNAME environment variable set to the hostname of the external side of the NAT firewall. This will cause the client software to advertise the firewall's hostname as the hostname to be used for callbacks causing connections from the server intended for it to go to the firewall (which redirects them to the client).