Project

General

Profile

Notes from the event builder work

Kurt, Ron, Gennadiy

Current recipes

Initial setup:
  1. log into cluck ('ssh -X cluck')
  2. copy ~paterno/.gitconfig to $HOME
  3. edit the local copy of .gitconfig to change Marc's name to your own
  4. cd to a directory that is accessible from the grunt nodes
    • e.g. /mnt/disk1/grunt/home/${USER}/<workDir>
  5. 'git clone ssh://p-artdaq@cdcvs.fnal.gov/cvs/projects/artdaq'
  6. 'cd artdaq'
  7. 'git checkout -b work' (creates a branch called "work" and moves the sandbox to use that branch)
  8. 'git status' (informational)
  9. 'git branch -a' (informational)
  10. create a directory for building the code that is accessible from the grunt nodes
    • e.g. 'mkdir /mnt/disk1/grunt/home/${USER}/build'
  11. create a directory for local UPS products
    • e.g. 'mkdir /mnt/disk1/grunt/home/${USER}/products'
Building the code:
  1. log into one of the grunt nodes. If needed, do the following steps:
    • 'mpi-selector-menu' (only need to do this once)
      • choose the first option, "mvapich2_gcc-1.6", for user mode by entering '1u'
      • exit mpi-selector-menu by entering 'Q'
    • 'source /products/setup'
    • 'source /etc/profile.d/mpi-selector.sh'
    • 'export CETPKG_INSTALL=<yourLocalProductsDirectory>'
    • 'export CETPKG_J=16'
  2. cd to your local build directory
  3. 'source <yourGitRepositoryParentDir>/artdaq/ups/setup_for_development -d e1'
  4. 'buildtool -C' and 'buildtool', OR 'buildtool -c'
Running the application on a single grunt node:
  1. If needed, do the following steps:
    • 'source /products/setup'
    • 'source /etc/profile.d/mpi-selector.sh'
    • 'export CETPKG_INSTALL=<yourLocalProductsDirectory>'
    • 'export CETPKG_J=16'
  2. cd to your local build directory
  3. 'mpirun -n 3 $HOME/build/bin/builder 1 1 5000 100000 10 4'
    • where the "builder" arguments are the number of detectors, the number of sinks, the number of events, the event size, the queue size, and the run number. The number of detectors is also used for the number of sources (since they are always the same). The -n argument to mpirun is the number of processes to start, and this value needs to be equal to the number of detectors, sources, and sinks.
  4. OR
  5. 'mpirun -n 3 -genv PRINT_HOST 1 -genv VERBOSE_QUEUE_READING 1 $HOME/build/bin/builder 1 1 5000 100000 10 4'
    • to enable printout of hostnames for each process and printout of debug information from the thread that reads events out of the EventStore queue
  6. OR
  7. 'daqrate 1 1 5000 100000 10 5'
Running the application on multiple grunt nodes:
  1. 'mpirun_rsh -rsh -hostfile <yourGitRepositoryParentDir>/artdaq/DAQrate/TwoNode222_hosts.txt -np 6 $HOME/build/bin/builder 2 2 5000 100000 10 4'
  2. OR
  3. 'mpirun_rsh -rsh -hostfile <yourGitRepositoryParentDir>/artdaq/DAQrate/TwoNode222_hosts.txt -np 6 PRINT_HOST=1 VERBOSE_QUEUE_READING=1 $HOME/build/bin/builder 2 2 5000 100000 10 4'
    • to see full debug information
  4. SOON, we will be able to run 'daqrate 2 2 5000 100000 10 4 --nodes=grunt{1-2}' to achieve the same result as the above to commands. In the meantime, there are a couple more sample host files in the DAQrate directory in the repository that can be used.
Currently available debugging flags:
  • set PRINT_HOST to any non-empty string to enable printing of hostname, rank, and function from each "builder" process that is started
  • set FRAGMENT_POOL_DEBUG to any non-empty string to enable debug printout from the FragmentPool class
  • set VERBOSE_QUEUE_READING to any non-empty string to enable printout of event diagnostics from the SimpleQueueReader (which reads events off the queue that will be used to send events to ART)

Reading DarkSide50 data files on multiple nodes

23-Dec-2011: After a couple of code changes (that are committed), we're now able to read and transfer DarkSide50 data from files on multiple nodes.

Here are a couple of the commands that can be used to do this:
  1. 'daqrate 2 2 5000 /home/ron/work2/parallelartPrj/data/ 10 4 --nodes=grunt{1-2}'
    • after doing its work, the daqrate script suggested the following mpirun command to use:
      • 'mpirun_rsh -rsh -hostfile /tmp/nodes17259.txt -n 6 /home/biery/build/bin/builder 2 2 5000 2400056 10 4 --data-dir=/home/ron/work2/parallelartPrj/data/'
    • when you run daqrate, you will get a different hostfile and a different location for the builder binary
    • this configuration has one detector, one source, and one sink on grunt1 and one detector, one source, and one sink on grunt2
  2. 'mpirun_rsh -rsh -hostfile /tmp/nodes17259.txt -n 6 PRINT_HOST=1 /home/biery/build/bin/builder 2 2 5 2400056 10 4 --data-dir=/home/ron/work2/parallelartPrj/data/'
    • for this command, I reduced the number of events from 5000 to 5 and added the PRINT_HOST env var to verify that I was, in fact, using two grunt nodes
  3. 'mpirun_rsh -rsh -hostfile /tmp/nodes17259.txt -n 6 VERBOSE_QUEUE_READING=1 /home/biery/build/bin/builder 2 2 2000 2400056 10 4 --data-dir=/home/ron/work2/parallelartPrj/data/'
    • for this command, I changed the number of events from 5000 to 2000 and added the VERBOSE_QUEUE_READING env var to get diagnostic printouts for events that would be input to ART (when we get that far)
  1. 'daqrate 4 4 2000 /home/ron/work2/parallelartPrj/data/ 10 5 --nodes=grunt{1-5} --ddnodes=1'
    • this configuration has four detectors on grunt1, and one source and one sink on each of grunt2-5
  2. 'mpirun_rsh -rsh -hostfile /tmp/nodes17796.txt -n 12 VERBOSE_QUEUE_READING=1 /home/biery/build/bin/builder 4 4 2000 4800112 10 5 --data-dir=/home/ron/work2/parallelartPrj/data/'
  1. 'daqrate 5 5 2000 /home/ron/work2/parallelartPrj/data/ 10 6 --nodes=grunt{1-5}'
    • this configuration has one detector, one source, and one sink on each of grunt1-5
  2. 'mpirun_rsh -rsh -hostfile /tmp/nodes18050.txt -n 15 VERBOSE_QUEUE_READING=1 /home/biery/build/bin/builder 5 5 2000 6000140 10 6 --data-dir=/home/ron/work2/parallelartPrj/data/'

Existing statistics

"Detectors"
  • job start:
    • run number
    • rank
    • "jobstart"
    • call number
    • "det"
    • timestamp (double, seconds since 1970)
  • fragment send:
    • run number
    • rank
    • "send"
    • call number
    • buffer number
    • event ID
    • enter time (double) [before send]
    • exit time (double) [after send]
    • rank of destination process
    • time after send (double)
  • job end:
    • run number
    • rank
    • "jobend"
    • call number
    • timestamp (double, seconds since 1970)
"Sources"
  • job start:
    • run number
    • rank
    • "jobstart"
    • call number
    • "det"
    • timestamp (double, seconds since 1970)
  • fragment receive:
    • run number
    • rank
    • "recv"
    • call number
    • buffer number
    • event ID
    • enter time (double) [before receive]
    • exit time (double) [after after]
    • rank of sender process
    • time after send (double)
  • fragment send:
    • run number
    • rank
    • "send"
    • call number
    • buffer number
    • event ID
    • enter time (double) [before send]
    • exit time (double) [after send]
    • rank of destination process
    • time after send (double)
  • job end:
    • run number
    • rank
    • "jobend"
    • call number
    • timestamp (double, seconds since 1970)
"Sinks"
  • job start:
    • run number
    • rank
    • "jobstart"
    • call number
    • "det"
    • timestamp (double, seconds since 1970)
  • fragment receive:
    • run number
    • rank
    • "recv"
    • call number
    • buffer number
    • event ID
    • enter time (double) [before receive]
    • exit time (double) [after after]
    • rank of sender process
    • time after send (double)
  • event start:
    • run number
    • rank
    • "evtstart"
    • call number
    • start of event building (0) or complete event received (1)
    • event ID
    • timestamp (double)
  • job end:
    • run number
    • rank
    • "jobend"
    • call number
    • timestamp (double, seconds since 1970)

Notes from the retreat (instructions replaced by ones above)

Initial steps to get the DAQrate application running:
  1. log into cluck ('ssh -X cluck')
  2. copy ~paterno/.gitconfig to $HOME
  3. edit the local copy of .gitconfig to change Marc's name to your own
  4. cd to a directory that is accessible from the grunt nodes
    • e.g. /mnt/disk1/grunt/home/${USER}/<workDir>
  5. 'git clone ssh://p-artdaq@cdcvs.fnal.gov/cvs/projects/artdaq'
  6. 'cd artdaq'
  7. 'git checkout -b work' (creates a branch called "work" and moves the sandbox to use that branch)
  8. 'git status' (informational)
  9. 'git branch -a' (informational)
  10. log into grunt1 ('rsh grunt1')
  11. 'mpi-selector-menu' (only needs to be done once)
    • choose the first option, "mvapich2_gcc-1.6", for user mode by entering '1u'
    • exit mpi-selector-menu by entering 'Q'
  12. '. /etc/profile.d/mpi-selector.sh'
  13. 'cd <workDir>/artdaq'
  14. 'make'
  15. 'mpirun -n 8 ./builder 2 10 3 3 2 100000 10 1'
Running on multiple grunt nodes:
  1. create a text file that has "grunt1" on the first six lines and "grunt2" on the next ten lines (I called this file TwoNode664_hosts.txt, but you can call it anything you want).
  2. 'mpirun_rsh -rsh -hostfile TwoNode664_hosts.txt -np 16 ./builder 3 1000 6 6 4 100000 10 2'
  3. I added code to builder.cc to print out information on what function is running on which host at what rank. Here is an example of the command to trigger this printout:
    • 'mpirun_rsh -rsh -hostfile TwoNode664_hosts.txt -np 16 PRINT_HOST=true ./builder 2 1000 6 6 4 100000 10 2'
    • This produces output like the following:
      Running sink on host grunt2 with rank 15.
      Running sink on host grunt2 with rank 14.
      Running detector on host grunt1 with rank 5.
      Running detector on host grunt1 with rank 4.
      Running source on host grunt2 with rank 10.
      Running sink on host grunt2 with rank 12.
      Running sink on host grunt2 with rank 13.
      Running source on host grunt2 with rank 11.
      Running Running source on host source on host grunt2 with rank grunt28. with rank 7.
      
      Running source on host grunt2 with rank 9.
      Running source on host grunt2 with rank 6.
      Running detector on host grunt1Running  with rank detector on host grunt1 with rank 2.3.
      
      Running detector on host grunt1 with rank 1.
      Running detector on host grunt1 with rank 0.
      
Building with CET build tools:
  1. log into one of the grunt nodes. If needed, do the following steps:
    • 'mpi-selector-menu' (select mvapich2 - only need to do this once)
    • 'source /products/setup'
    • 'source /etc/profile.d/mpi-selector.sh'
  2. cd to the directory where you want the build products to live (typically different than your git sandbox)
  3. 'export CETPKG_INSTALL=<yourLocalProductsDirectory>'
  4. 'export CETPKG_J=16'
  5. 'source <yourGitRepoParentDir>/artdaq/ups/setup_for_development -d e1'
  6. 'buildtool -C' and 'buildtool', OR 'buildtool -c'
Running with the CET build tool result:
  1. 'mpirun_rsh -rsh -hostfile <yourGitRepoParentDir>/artdaq/DAQrate/TwoNode111_hosts.txt -np 3 PRINT_HOST=true <buildProductDir>/bin/builder 2 100000 1 1 1 100000 10 2'

Data rate observations

Using one detector on grunt1, and one source and one sink on grunt2, I used the existing performance monitoring output to estimate the data rate.

  • 'mpirun_rsh -rsh -hostfile TwoNode111_hosts.txt -np 3 PRINT_HOST=true ./builder 2 100000 1 1 1 100000 10 2'

Contents of TwoNode111_hosts.txt:

grunt1
grunt2
grunt2

  • 'make perfdump'
  • './perfdump perf_0002_0000.txt | head -50000 | tail -20'
2 0 send 49979 9 49979 1324307466.2177 1324307466.2178 1 1324307466.2178 
2 0 send 49980 0 49980 1324307466.2178 1324307466.2179 1 1324307466.2179 
2 0 send 49981 1 49981 1324307466.2179 1324307466.2179 1 1324307466.2179 
2 0 send 49982 2 49982 1324307466.2179 1324307466.218 1 1324307466.218 
2 0 send 49983 3 49983 1324307466.218 1324307466.2181 1 1324307466.2181 
2 0 send 49984 4 49984 1324307466.2181 1324307466.2185 1 1324307466.2185 
2 0 send 49985 5 49985 1324307466.2185 1324307466.2185 1 1324307466.2185 
2 0 send 49986 6 49986 1324307466.2185 1324307466.2185 1 1324307466.2185 
2 0 send 49987 7 49987 1324307466.2186 1324307466.2186 1 1324307466.2186 
2 0 send 49988 8 49988 1324307466.2186 1324307466.2188 1 1324307466.2188 
2 0 send 49989 9 49989 1324307466.2188 1324307466.2189 1 1324307466.2189 
2 0 send 49990 0 49990 1324307466.2189 1324307466.219 1 1324307466.219 
2 0 send 49991 1 49991 1324307466.219 1324307466.219 1 1324307466.219 
2 0 send 49992 2 49992 1324307466.219 1324307466.2193 1 1324307466.2193 
2 0 send 49993 3 49993 1324307466.2194 1324307466.2194 1 1324307466.2194 
2 0 send 49994 4 49994 1324307466.2195 1324307466.2195 1 1324307466.2195 
2 0 send 49995 5 49995 1324307466.2196 1324307466.2196 1 1324307466.2196 
2 0 send 49996 6 49996 1324307466.2197 1324307466.2198 1 1324307466.2198 
2 0 send 49997 7 49997 1324307466.2198 1324307466.2199 1 1324307466.2199 
2 0 send 49998 8 49998 1324307466.2199 1324307466.22 1 1324307466.22 

These 20 fragment sends take 2.3 msec. With an event size of 100000 bytes, this gives a data rate of 2000000 / 2.3 msec = 830 MB/sec = 6.5 Mbit/sec.

Thoughts on specifying configuration parameters

A wrapper script can be used to run:
  • mpirun with the proper number of processes argument
    OR
  • mpirun_rsh with a generated hostsfile and the proper number of processes arguments

Possible script invocations:

usage: daqrate  3 3 2 events filebase   # no mpirun_rsh -hostfile need
   OR  daqrate  3 3 2 events filebase   --nodes grunt{1-3}  --dnode 1
usage: daqrate  3 3 2 events eventsz
   OR  daqrate  3 3 2 events eventsz  --nodes grunt{1-3}  --dnode 1

The relevant parameters that a user would want to specify for deciding what functions to run where include the following (in our opinion):
  • the total number of detectors (data producing processes)
  • the total number of sources (data receiving processes)
  • the total number of sinks (event builder processes)
  • the total number of physical nodes to run the detectors, sources, and sinks on
  • the number of builder nodes (must be between 1 and the total number of physical nodes, inclusive)
The boundary conditions that we understand to be part of the DAQrate package include the following:
  • the number of sources must match the number of detectors
  • builder nodes run sources and sinks
  • dedicated detector nodes run only detector processes
  • the number of dedicated detector nodes is the total number of nodes minus the number of builder nodes
    • the number of builder nodes must be 1 or more
    • the number of dedicated detector nodes may be zero
  • if there is only one host in the test, then all three types of processes run on the same host
  • if there is more than one host in the test...
    • if the number of builder nodes equals the total number of nodes (the number of dedicated detector nodes is zero), then the detectors, sources, and sinks may be spread across any of the nodes
    • if the number of dedicated detector nodes is more than zero, then detectors may only be run on detector nodes and sources and sinks may only be run on builder nodes

Candidate structs for raw data

Suggestions for better naming or other improvements are welcome.

#ifndef RAWDATA_HHH
#define RAWDATA_HHH

#include <boost/shared_ptr.hpp>
#include <vector>
#include <stdint.h>

typedef uint32_t RawDataType;

struct RawEvent
{
  typedef std::vector<RawDataType> Fragment;
  typedef boost::shared_ptr<Fragment> FragmentPtr;

public:
  RawDataType size_;
  RawDataType run_id_;
  RawDataType subrun_id_;
  RawDataType event_id_;

  std::vector<FragmentPtr> fragment_list_;
};

struct RawDataFragment
{
public:
  RawDataType size_;
  RawDataType event_id_;
  RawDataType fragment_id_;
};

#endif

Git sequences

Alas, I'm not sure of the exact situations where these were needed, but here they are anyway:

  583  git commit -a
  584  git status
  585  git fetch
  586  git rebase origin/master
  587  git status
  588  vi EventStore.hh 
  589  git add EventStore.hh 
  590  git rebase --continue
  591  git checkout master
  592  git pull
  593  git merge --ff-only work
  594  git push
  595  git branch -d work
  596  git branch work
  597  git checkout work
  598  git status
  599  git lg
  600  git branch -a