Glasgow LHCB Testing
|Table of contents|
LHCb Transfer Issues Investigation
This ongoing issue was summarised here: http://scotgrid.blogspot.com/2010/05/return-of-lhcb-at-glasgow.html
The current solution was to remove a REJECT rule: -A INPUT -i eth1 -p tcp -m tcp -j REJECT --reject-with tcp-reset for both NAT's.
However, LHCb are currently still reporting issues with transfers.
Various tests were performed. Transfers from many worker nodes through both NAT's and transfers from one worker node through one NAT.
Here are the commands I was using for testing LHCb transfers from across the cluster. This was tweaked slightly to submit in a loop, backgrounding each lcg_cp to submit from one node.
- Generate a dteam/LHCb proxy and stick it somewhere every node can see it i.e. /cluster/share
- Copy the cron onto each of the worker nodes /etc/cron.d/spam-lhcb-castor
- Change the permissions if required and wait for it to run.
- Suck the logs off the nodes and parse for 'Connection timed out' && 'Transfer took'
#!/bin/sh source /opt/glite/etc/profile.d/grid-env.sh export X509_USER_PROXY=/cluster/share/gla057/proxy.dteam /opt/glite/bin/voms-proxy-info -all echo on $1 #/opt/lcg/bin/lcg-ls srm://srm-dteam.gridpp.rl.ac.uk/castor/ads.rl.ac.uk/test/dteam/dugs-lhcb-testing/ /opt/lcg/bin/lcg-cp --vo dteam --sendreceive-timeout 300 -v file:/cluster/share/gla057/zerofile1M.tst srm://srm-dteam.gridpp.rl.ac.uk/castor/ads.rl.ac.uk/test/dteam/dugs-lhcb-testing/zerofile1M.$1.tst
svr031:/var/cfengine/inputs/skel/worker/etc/cron.d# cat spam-lhcb-castor
#58 12 30 3 2 prdlhcb001 /cluster/share/gla057/test-lhcb-upload.sh `hostname` > /tmp/lhcb-test-`hostname`.log 2>&1
35 12 25 5 2 dteam001 /cluster/share/gla057/test-dteam-upload.sh `hostname` > /tmp/dteam-test-`hostname`.log 2>&1
Copy onto nodes
for n in `seq -w 001 020`;do scp spam-lhcb-castor node0$n:/etc/cron.d/spam-lhcb-castor;done
pdsh -w node[001-020] chmod 644 /etc/cron.d/spam-lhcb-castor
Get the logs
for n in `seq -w 001 309`;do scp node$n:/tmp/lhcb-test-*.log .;done
for n in `seq -w 001 309`;do scp node$n:/tmp/dteam-test-*.log .;done
list the files at the end point
tidy up the srm
for n in `seq -w 001 309`;do /opt/lcg/bin/lcg-del -l --vo lhcb srm://srm-lhcb.cern.ch/castor/cern.ch/grid/lhcb/test/zerofile10M.node$n.beowulf.cluster.tst;done
for n in `seq -w 001 309`;do /opt/lcg/bin/lcg-del -l --vo dteam srm://srm-dteam.gridpp.rl.ac.uk/castor/ads.rl.ac.uk/test/dteam/dugs-lhcb-testing/zerofile10M.node$n.beowulf.cluster.tst;done
tidy up the logs
pdsh -w node[001-309] rm /tmp/lhcb-test-*.beowulf.cluster.log
pdsh -w node[001-309] rm /tmp/dteam-test-*.beowulf.cluster.log
Parse the logs
svr031:~/dug/lhcb# cat getLogs.sh #!/bin/sh cd /root/dug/lhcb cd logs echo copying logs. for n in `seq -w 001 102`;do scp node$n:/tmp/dteam-test-*.log .;done echo connection timed out: `find . | xargs grep 'lcg_cp: Connection timed out' | wc -l` echo Communication error on send: `find . | xargs grep 'lcg_cp: Communication error on send' | wc -l` echo successful transfers: `find . | xargs grep 'Transfer took' | wc -l` #cd ../trace #echo copying trace logs. #for n in `seq -w 001 103`;do scp node$n:/tmp/trace*.beowulf.cluster .;done echo done.
Cluster Monkey has a good description of gridftp issues with NAT's on page two they state ...
Most NAT devices are capable of translating the port numbers as well. Currently, however, GT does not have this capability, and therefore the port number on the internal machine on which a particular service is listening must be the same as on the external interface of the firewall. Thus, in the case of multiple machines behind a single NAT device, each machine must have a unique port range defined, and those ports must be forwarded to the appropriate machine by the firewall. (GLOBUS_HOSTNAME will be the same for all the machines, however.)
This seems to be backed up by the NAT instructions from http://dev.globus.org/wiki/FirewallHowTo Network Address Translation (NAT)
Clients behind NATs will be restricted as described in #Allowed Incoming Ports unless the firewall and site hosts are configured to allow incoming connections.
This configuration involves:
1. Select a separate portion of the ephemeral port range for each host at the site on which clients will be running (e.g. 45000-45099 for host A, 45100-45199 for host B, etc.). 2. Configure the NAT to direct incoming connections in the port range for each host back to the appropriate host (e.g., configure 45000-45099 on the NAT to forward to 45000-45099 on host A). 3. Configure the Globus Toolkit clients on each site host to use the selected port range for the host using the techniques described in Section . 4. Configure Globus Toolkit clients to advertise the firewall as the hostname to use for callbacks from the server host.
This is done using the GLOBUS_HOSTNAME environment variable. The client must also have the GLOBUS_HOSTNAME environment variable set to the hostname of the external side of the NAT firewall. This will cause the client software to advertise the firewall's hostname as the hostname to be used for callbacks causing connections from the server intended for it to go to the firewall (which redirects them to the client).