Project

General

Profile

Work progress Oct 5 2012

Two-node artdaq testing

The two-node problems that we saw yesterday from dsfr5 are now fixed. I've seen a couple of intermittent errors when running from dseb7, though.

An oddity: the one-node "throughput" is less on dseb7 (~475 MB/s) than on dsfr5 (~800 MB/s).

dsfr5 results:

[biery@dsfr5 build]$ daqrate 1 1 10 101 -- -c ../daqrate_gen_test.fcl
opts=[] args=['1', '1', '10', '101', '-c', '../daqrate_gen_test.fcl'] nodes=[]
executing cmd: mpirun -n 3 /home/biery/build/bin/builder 1 1 10 101 '--' '-c' '../daqrate_gen_test.fcl'
Started process 0 of 3.
Started process 1 of 3.
Started process 2 of 3.
Detector 0 ready.
return status of mpirun -n 3 /home/biery/build/bin/builder 1 1 10 101 '--' '-c' '../daqrate_gen_test.fcl' is: 0

return status is (really!) : 0

[biery@dsfr5 build]$ cat EventStoreEventRate_0101_0000.txt
EventStore rank 0: events processed = 1000 at 167.899 events/sec, date rate = 671.6 MB/sec, duration = 5.95595 sec
  1349448673.494: 386 events at 197.377 events/sec, data rate = 789.512 MB/sec, bin size = 1.956 sec
  1349448674.494: 199 events at 198.986 events/sec, data rate = 795.946 MB/sec, bin size = 1.000 sec
  1349448675.495: 199 events at 198.986 events/sec, data rate = 795.948 MB/sec, bin size = 1.000 sec
  1349448676.495: 200 events at 199.986 events/sec, data rate = 799.947 MB/sec, bin size = 1.000 sec
  1349448677.495: 16 events at 15.999 events/sec, data rate = 63.995 MB/sec, bin size = 1.000 sec
[biery@dsfr5 build]$
[biery@dsfr5 build]$
[biery@dsfr5 build]$ daqrate 2 2 10 302 --nodes=dsfr5,dseb7 -- -c ../daqrate_gen_test.fcl
opts=[('--nodes', 'dsfr5,dseb7')] args=['2', '2', '10', '302', '-c', '../daqrate_gen_test.fcl'] nodes=['dsfr5', 'dseb7']
dsfr5
dseb7
dsfr5
dseb7
dsfr5
dseb7
executing cmd: mpirun_rsh -rsh -hostfile /tmp/nodes19147.txt -n 6  FHICL_FILE_PATH="$FHICL_FILE_PATH" /home/biery/build/bin/builder 2 2 10 302 '--' '-c' '../daqrate_gen_test.fcl'
Started process Started process Started process 0 of 6.
4 of 6.
2 of 6.
Started process 5 of 6.
Started process 1 of 6.
Started process 3 of 6.
Detector 0 ready.
Detector 1 ready.
return status is (really!) : 0

[biery@dsfr5 build]$ cat EventStoreEventRate_0302_0000.txt
EventStore rank 0: events processed = 500 at 63.3599 events/sec, date rate = 506.881 MB/sec, duration = 7.89143 sec
  1349448705.491: 118 events at 62.403 events/sec, data rate = 499.225 MB/sec, bin size = 1.891 sec
  1349448706.491: 67 events at 66.995 events/sec, data rate = 535.963 MB/sec, bin size = 1.000 sec
  1349448707.492: 65 events at 64.995 events/sec, data rate = 519.965 MB/sec, bin size = 1.000 sec
  1349448708.492: 65 events at 64.995 events/sec, data rate = 519.962 MB/sec, bin size = 1.000 sec
  1349448709.492: 65 events at 64.995 events/sec, data rate = 519.961 MB/sec, bin size = 1.000 sec
  1349448710.492: 66 events at 65.995 events/sec, data rate = 527.961 MB/sec, bin size = 1.000 sec
  1349448711.492: 54 events at 53.994 events/sec, data rate = 431.953 MB/sec, bin size = 1.000 sec
[biery@dsfr5 build]$ cat EventStoreEventRate_0302_0001.txt
EventStore rank 1: events processed = 500 at 63.3375 events/sec, date rate = 506.702 MB/sec, duration = 7.89422 sec
  1349430966.440: 119 events at 62.840 events/sec, data rate = 502.724 MB/sec, bin size = 1.894 sec
  1349430967.440: 66 events at 65.995 events/sec, data rate = 527.964 MB/sec, bin size = 1.000 sec
  1349430968.440: 64 events at 63.996 events/sec, data rate = 511.966 MB/sec, bin size = 1.000 sec
  1349430969.440: 66 events at 65.995 events/sec, data rate = 527.964 MB/sec, bin size = 1.000 sec
  1349430970.440: 67 events at 66.995 events/sec, data rate = 535.965 MB/sec, bin size = 1.000 sec
  1349430971.441: 65 events at 64.996 events/sec, data rate = 519.967 MB/sec, bin size = 1.000 sec
  1349430972.441: 53 events at 52.990 events/sec, data rate = 423.924 MB/sec, bin size = 1.000 sec

dseb7 results:

[biery@dseb7 build]$ daqrate 1 1 10 201 -- -c ../daqrate_gen_test.fcl
opts=[] args=['1', '1', '10', '201', '-c', '../daqrate_gen_test.fcl'] nodes=[]
executing cmd: mpirun -n 3 /home/biery/build/bin/builder 1 1 10 201 '--' '-c' '../daqrate_gen_test.fcl'
Started process 2 of 3.
Started process 1 of 3.
Started process 0 of 3.
Detector 0 ready.
return status of mpirun -n 3 /home/biery/build/bin/builder 1 1 10 201 '--' '-c' '../daqrate_gen_test.fcl' is: 0

return status is (really!) : 0

[biery@dseb7 build]$ cat EventStoreEventRate_0201_0000.txt
EventStore rank 0: events processed = 1000 at 112.316 events/sec, date rate = 449.267 MB/sec, duration = 8.90342 sec
  1349431068.647: 224 events at 117.722 events/sec, data rate = 470.892 MB/sec, bin size = 1.903 sec
  1349431069.647: 122 events at 121.991 events/sec, data rate = 487.964 MB/sec, bin size = 1.000 sec
  1349431070.647: 119 events at 118.991 events/sec, data rate = 475.967 MB/sec, bin size = 1.000 sec
  1349431071.647: 118 events at 117.991 events/sec, data rate = 471.966 MB/sec, bin size = 1.000 sec
  1349431072.648: 120 events at 119.991 events/sec, data rate = 479.964 MB/sec, bin size = 1.000 sec
  1349431073.648: 120 events at 119.991 events/sec, data rate = 479.965 MB/sec, bin size = 1.000 sec
  1349431074.648: 118 events at 117.991 events/sec, data rate = 471.965 MB/sec, bin size = 1.000 sec
  1349431075.648: 59 events at 58.990 events/sec, data rate = 235.959 MB/sec, bin size = 1.000 sec
[biery@dseb7 build]$
[biery@dseb7 build]$
[biery@dseb7 build]$ daqrate 2 2 10 402 --nodes=dseb7,dsfr5 -- -c ../daqrate_gen_test.fcl
opts=[('--nodes', 'dseb7,dsfr5')] args=['2', '2', '10', '402', '-c', '../daqrate_gen_test.fcl'] nodes=['dseb7', 'dsfr5']
dseb7
dsfr5
dseb7
dsfr5
dseb7
dsfr5
executing cmd: mpirun_rsh -rsh -hostfile /tmp/nodes15630.txt -n 6  FHICL_FILE_PATH="$FHICL_FILE_PATH" /home/biery/build/bin/builder 2 2 10 402 '--' '-c' '../daqrate_gen_test.fcl'
Started process Started process Started process 351 of  of  of 666.
.
.
Started process 4 of 6.
Started process 2 of 6.
Started process 0 of 6.
Detector 1 ready.
Detector 0 ready.
builder: /home/biery/oct2012/artdaq/artdaq/DAQrate/EventStore.cc:52: void artdaq::EventStore::insert(artdaq::FragmentPtr): Assertion `pfrag->fragmentID() != Fragment::InvalidFragmentID' failed.
[dsfr5:mpi_rank_5][error_sighandler] Caught error: Aborted (signal 6)
[dsfr5:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 5. MPI process died?
[dsfr5:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI process died?
[dsfr5:mpispawn_1][child_handler] MPI process (rank: 5, pid: 19364) terminated with signal 6 -> abort job
[dseb7:mpirun_rsh][process_mpispawn_connection] mpispawn_1 from node dsfr5 aborted: Error while reading a PMI socket (4)
[dseb7:mpispawn_0][read_size] Unexpected End-Of-File on file descriptor 10. MPI process died?
[dseb7:mpispawn_0][handle_mt_peer] Error while reading PMI socket. MPI process died?
return status is (really!) : 0

Secondary dseb7 results:

[biery@dseb7 build]$ daqrate 2 2 10 402 --nodes=dseb7,dsfr5 -- -c ../daqrate_gen_test.fcl
opts=[('--nodes', 'dseb7,dsfr5')] args=['2', '2', '10', '402', '-c', '../daqrate_gen_test.fcl'] nodes=['dseb7', 'dsfr5']
dseb7
dsfr5
dseb7
dsfr5
dseb7
dsfr5
executing cmd: mpirun_rsh -rsh -hostfile /tmp/nodes15881.txt -n 6  FHICL_FILE_PATH="$FHICL_FILE_PATH" /home/biery/build/bin/builder 2 2 10 402 '--' '-c' '../daqrate_gen_test.fcl'
Started process Started process Started process 5 of 6.
1 of 6.
3 of 6.
Started process 4 of 6.
Started process 2 of 6.
Started process 0 of 6.
Detector 1 ready.
Detector 0 ready.
return status is (really!) : 0

[biery@dseb7 build]$ cat EventStoreEventRate_0402_0000.txt
EventStore rank 0: events processed = 500 at 63.393 events/sec, date rate = 507.146 MB/sec, duration = 7.88731 sec
  1349431286.700: 119 events at 63.070 events/sec, data rate = 504.564 MB/sec, bin size = 1.887 sec
  1349431287.700: 66 events at 65.995 events/sec, data rate = 527.963 MB/sec, bin size = 1.000 sec
  1349431288.700: 65 events at 64.995 events/sec, data rate = 519.965 MB/sec, bin size = 1.000 sec
  1349431289.700: 66 events at 65.995 events/sec, data rate = 527.965 MB/sec, bin size = 1.000 sec
  1349431290.700: 67 events at 66.995 events/sec, data rate = 535.965 MB/sec, bin size = 1.000 sec
  1349431291.701: 66 events at 65.996 events/sec, data rate = 527.966 MB/sec, bin size = 1.000 sec
  1349431292.701: 51 events at 50.991 events/sec, data rate = 407.931 MB/sec, bin size = 1.000 sec
[biery@dseb7 build]$ cat EventStoreEventRate_0402_0001.txt
EventStore rank 1: events processed = 500 at 63.3482 events/sec, date rate = 506.787 MB/sec, duration = 7.89289 sec
  1349449025.741: 120 events at 63.412 events/sec, data rate = 507.296 MB/sec, bin size = 1.892 sec
  1349449026.741: 65 events at 64.995 events/sec, data rate = 519.962 MB/sec, bin size = 1.000 sec
  1349449027.742: 67 events at 66.995 events/sec, data rate = 535.965 MB/sec, bin size = 1.000 sec
  1349449028.742: 66 events at 65.995 events/sec, data rate = 527.961 MB/sec, bin size = 1.000 sec
  1349449029.742: 66 events at 65.995 events/sec, data rate = 527.960 MB/sec, bin size = 1.000 sec
  1349449030.742: 69 events at 68.995 events/sec, data rate = 551.962 MB/sec, bin size = 1.000 sec
  1349449031.742: 47 events at 46.995 events/sec, data rate = 375.958 MB/sec, bin size = 1.000 sec

more work

yum install numactl - to help investigate the speed issues

recabling

  • dsfr5 and dseb7 are now connected to their own 8 port infiniband switch
  • infiniband boards are in the correct slots on the motherboard
  • reseat infiniband cable in dseb1 (was not fully inserted)

artdaq testing after the recabling

The performance results look the same between the two nodes after the recabling (please see below). The one-node performance results still differ between the two machines (dseb7 is lower).

[biery@dsfr5 build]$ date
Fri Oct  5 15:18:44 CDT 2012
[biery@dsfr5 build]$ daqrate 2 2 10 302 --nodes=dsfr5,dseb7 -- -c ../daqrate_gen_test.fcl
opts=[('--nodes', 'dsfr5,dseb7')] args=['2', '2', '10', '302', '-c', '../daqrate_gen_test.fcl'] nodes=['dsfr5', 'dseb7']
dsfr5
dseb7
dsfr5
dseb7
dsfr5
dseb7
executing cmd: mpirun_rsh -rsh -hostfile /tmp/nodes6551.txt -n 6  FHICL_FILE_PATH="$FHICL_FILE_PATH" /home/biery/build/bin/builder 2 2 10 302 '--' '-c' '../daqrate_gen_test.fcl'
Started process Started process Started process 40 of  of 66.
.
2 of 6.
Started process 5 of 6.
Started process 1 of 6.
Started process 3 of 6.
Detector 0 ready.
Detector 1 ready.
return status is (really!) : 0

[biery@dsfr5 build]$ dir EventStoreEventRate_0302*
-rw-r--r-- 1 biery g163 809 Oct  5 15:18 EventStoreEventRate_0302_0000.txt
-rw-r--r-- 1 biery g163 810 Oct  5 15:18 EventStoreEventRate_0302_0001.txt
[biery@dsfr5 build]$ cat
^C
[biery@dsfr5 build]$ cat EventStoreEventRate_0302_0000.txt
EventStore rank 0: events processed = 500 at 63.334 events/sec, date rate = 506.674 MB/sec, duration = 7.89466 sec
  1349468331.213: 120 events at 63.350 events/sec, data rate = 506.802 MB/sec, bin size = 1.894 sec
  1349468332.214: 65 events at 64.996 events/sec, data rate = 519.969 MB/sec, bin size = 1.000 sec
  1349468333.214: 66 events at 65.995 events/sec, data rate = 527.965 MB/sec, bin size = 1.000 sec
  1349468334.214: 67 events at 66.995 events/sec, data rate = 535.965 MB/sec, bin size = 1.000 sec
  1349468335.214: 67 events at 66.995 events/sec, data rate = 535.965 MB/sec, bin size = 1.000 sec
  1349468336.214: 65 events at 64.995 events/sec, data rate = 519.965 MB/sec, bin size = 1.000 sec
  1349468337.214: 50 events at 49.996 events/sec, data rate = 399.972 MB/sec, bin size = 1.000 sec