3/9/2015 Notes

DocDB Event


Pengfei, Yuri, Craig G., Adam, Leon, Alec, Zukai, Andrew and Martin (taking notes)

DDT Versions

  • The external products area is getting full, so we need to remove some old versions.
  • v03_00_01 was the first one deployed on FD and it depends on novadaq v03_00_00.
  • I send out an email asking whether anyone in the DDT group still needs versions before v03_00_00.

Memory Usage

  • 20,000 hits would cause memory issues according to Andrew's and Leon's private calculation.
  • Andrew suggests a cut at 5,000 hits per slice for the Hough Tracker.
  • Craig G. brought up the point that we could regain some of our live time by having the Hough Tracker not process the large hit-multiplicity events.
  • With Craig's point in mind Leon and Andrew came up with 1,200 hits per slice.
  • Yuri will add this cut and commit it and then I will cut a new release.
  • Leon was asking whether we can deploy this in stages.
    • We discussed the possibility of rolling out the new release on a subset of the buffer nodes. However, this would then take a long time to run into one of these large memory usage events.
    • Andrew proposes that we deploy system wide.
    • Leon then asks that we are vigilant and investigate any DDT crash immediately.
    • We can then roll back if there are too many issues.
  • Perhaps we need a Ganglia metric that tells us how many DDT processes are running.
    • Perhaps the DDTManager could report this.
    • Andrew thinks that this should be done on each host, so DDTManager is not a good place since it runs on one host.
    • Alec explains that gmond.conf needs to be changed to pick up such a new DDT process count metric.
    • Alec volunteered to write such a monitoring script.


  • Leon has been looking through the documentation and will send comments around about things that could improve.


  • I reported on the time that we take with 1 process versus 13 processes on one host.
  • Andrew is not convinced that there is actually an increase there, but notes that it could simply be a statistical fluctuation.
  • My main point here is that we are too slow to keep up.