This page is a logbook for Glasgow in Service Challenge 4
|Table of contents|
Preamble: Initial tests of transfers between RAL and Glasgow revealed a serious problem transfering data between dCache and DPM. Rates were dire, ~2Mb/s.
Transfer rates from DPM (Edinburgh) to DPM (Glasgow) achieved 200Mb/s without tuning. Transfering back to Edinburgh was at a lower write speed (~80Mb/s), probably as their pool has filesystems nfs mounted (poor write performance).
Set se2 pool to RDONLY. pool1 and pool2 writable. 5 files on the channel. Noticed that 4 files would go to pool1, 1 to pool2 - so pool selection algorithm obviously works on a filesystem, rather than server, basis.
Experimenting with forcing transfers to one pool, or allowing them to float:
|pool1(5 fs), pool2||5||4||1GB||20||220||16 Transfers went to pool1, 4 to pool2|
|pool1(1 fs), pool2||5||4||1GB||20||216||10 transfers to each pool|
|pool1(1 fs), pool2||5||1||1GB||20||203||10 transfers to each pool|
|pool1(1 fs), pool2||5||10||1GB||20||213||10 transfers to each pool|
Differences in transfer rates are minor, and probably within error bars. Suspect transfers are being limited by network or by Edinburgh output data rate. No great effect from multiple streams.
On basis of above tests, initiated a 1TB transfer from Ed DPM to Gla DPM. 10 simultaneous files, 4 streams per file:
Transfer Bandwidth Report: 1000/1000 transfered in 35683.9609549 seconds 1e+12 bytes transfered. Bandwidth: 224.190358523Mb/s
Steady as she goes, basically.
1TB Upload to RAL
Was seeing really good rates to RAL when doing lcg-rep testing, so triggered a 1TB transfer overnight, using 5 streams. Rate was an excellent 331Mb/s.
Transfer Bandwidth Report: 1000/1000 transferred in 24165.1441939 seconds 1e+12 bytes transferred. Bandwidth: 331.055338872Mb/s
I did vary the number of concurrently transfered files, from 3 up to 8, during the transfer, which had no noticable effect on the transfer rate. It did, however, affect the load average (naturally) and, interestingly, the system CPU load on the machine: 45% for 8 concurrent transfers, 25% for 3.
I also applied the SC3 kernel tweaks during the transfer, but again there was no noticable effect on the rate. Possibly it will have more effect on the sink than the source?
Tuning pools for incoming transfers seems to be tricky (see Glasgow DPM Tuning), but settled on having pool1 with two writable filesystems and pool2 with 1. Then 5 files seem to almost guarantee that there's always a file being written to pool2, without having so many transfer streams as to provoke a load crisis on one of the pools. 3 streams also keeps the total number of TCP streams reasonable.
Initial rate was quite good - ~200Mb/s, but this declined after about 4am, down to as low as ~120Mb/s at 0830. Overall the rate was 166Mb/s:
Transfer Bandwidth Report: 998/1000 transferred in 47889.3235981 seconds 998000000000.0 bytes transferred. Bandwidth: 166.717744168Mb/s
The two failures were interesting:
Transfer: srm://dcache.gridpp.rl.ac.uk:8443/pnfs/gridpp.rl.ac.uk/data/dteam/tfr2tier2/canned1G to srm://se2-gla.scotgrid.ac.uk:8443/dpm/scotgrid.ac.uk/home/dteam/pytest/tfr000-file00171 Size: 1000000000 FTS Reason: No such active transfer - RAL-GLAtransXXo0MOgp
Transfer: srm://dcache.gridpp.rl.ac.uk:8443/pnfs/gridpp.rl.ac.uk/data/dteam/tfr2tier2/canned1G to srm://se2-gla.scotgrid.ac.uk:8443/dpm/scotgrid.ac.uk/home/dteam/pytest/tfr000-file00219 Size: 1000000000 FTS Reason: Transfer succeeded.
Haven't seem either of these before. Looking on the DPM both files transferred successfully, so I suspect that this was an error in FTS itself - should check the logs.
Finally, write rates were very patchy - indicating that the balancing of incoming writes isn't working. Snapshot from the end of the transfer shows that pool2 was not transferring continuously:
Rates to pool1 and pool2 are out of sync, so I think the network may have become the limiting factor at this point.
Work to be done on i/o rates to our pools, I think.
Inbound test from Edinburgh to Glasgow
Started a 24 hour test from Edinburgh to Glasgow at ~1600 (needed Matt to setup new FTS channels for the site).
Rate was, as excpected, rather dissapointing. Started at ~200Mb/s, rising to ~250Mb/s as, presumably, the traffic calmed on the universtity's WAN and the Janet backbone.
However, this is still way below what we know that hardware is capable of: 3 disk servers alone managed to hit 800Mb/s+ when sinking data from a source connected directly to their switches. During the write test we had 6 disk servers available and DPM used them all.
Oubound test from Glasgow to Edinburgh
Started up the outbound test after seeding in files from Edinburgh (at the same rate achieved for the inbound test).
Rate was even worse than the inbound rate! Struggled up to 80Mb/s. Ganglia clearly showed the load spread nicely over all the disk servers, so why was the rate so bad.
Then on Sunday morning, between ~0600 and 0920 the rate dropped to zero, with all files failing. The FTS server said:
Reason: Failed on SRM get: Cannot Contact SRM Service. Error in srm__ping: SOAP-ENV:Client - CGSI-gSOAP: Could not open connection !
but it was not clear whose SRM was down.
- iperf tests from off campus to old production site and to new disk servers
- Setup a test DPM using svr023 and a couple of the spare disk servers
- Test interally
- Experiment with stack settings on external tests
- Use SL3?