Planning for the parallel art retreat¶
Summary of retreat goals¶
- Learn about some of the available high-performance computing tools for doing multicore / manycore programming and how we might apply these to problems we have.
- Apply these tools to a simplified DAQ and trigger/filter problem to gain experience and solve a small, real problem.
- event building function using MPI
- event processing framework in a filter setting using OpenMP or equivalent
- multicore algorithm applied to data within the framework
- Gain a common understanding of the software development tools that our group has embraced
- Kick off work for our multi-threaded framework project (more about this later)
multi-threaded framework project¶
Our group has received project money (through an FWP) to do multithreaded framework studies. This includes GPU algorithmic work. Here is the text from the FWP.
“The existing computational frameworks for high-energy physics experiments offline and online event processing will not be able to provide the necessary performance on the emerging new computational technologies (multi-core and hybrid systems). The current computational approach, of a single thread per core scales well but requires prohibitive large amounts of memory and sequential file merging. We propose to develop new algorithms and capabilities that will allow multi-threaded applications that will run efficiently on the new multi-core hardware. The proposed work for FY12 will focus on design changes necessary to achieve this goal for the art Event Processing Framework developed at FNAL. This framework evolved from the CMS framework and is used by FNAL intensity frontier experiments. Art coordinates event processing via configurable, pluggable modules (e.g. reconstruction, filtering, and analysis) that add data to and retrieve data from events, supporting a programming model that separates algorithm and data. An event is one unit of data, an interaction, or period of collected data. Because of these design choices, art is a good starting point for building fully multi-threaded applications. The objective of out FY12 work will be to implement processing decomposition of different data "chunks" utilizing high-performance-computing techniques, develop efficient scheduling algorithms, and deploy prototype applications to study and characterize the performance of the new capabilities.”
- Task 1: Begin converting our existing single-threaded event reconstruction/filtering framework to a multithreaded application, using HPC computing techniques and technologies(high performance networks, MPI). Implement strategies for the decomposition of the processing of each event into ‘chunks’ that can be executed concurrently.
- Task 2: Develop and prototype dispatcher algorithm for multi-threaded applications. Instrument code for perormance studies, perform studies.
- Task 3: Utilize framework for evaluating real-time processisng configurations and determining design parameters for compact, high-throughput systems. Study dataflow in a streaming realtime environments to determine system design parameters for experiments. Implement typical filtering algorithm on multi-core system (e.g. GPU) and study performance.
common libraries and tools¶
One of our objectives is to further on toolkit. We have a few pieces in place.
target project for the prototype work¶
DarkSide50 has an an architecture that is aligned well with our future toolkit requirements. It is similar to the NOvA DDT architecture that we are proposing and want to evaluate. The key features are a high-speed event building phase and an algorithmic phase that can benefit from a parallel art framework. DarkSide50 wants to know if an event builder and algorithmic framework doing compression can keep up with their maximum expected rate (see diagram). We want to be able to report throughput numbers and machine utilization for a configuration similar to theirs, using the event data samples they have provides (available on oink). We can contact them with questions about the data organization and access. We can show progress on Thursday to see if we understand their problem.
See the bottom of this page for several document that may be useful reading for before this retreat.
Some good information about MPI is available at MPI tutorial.
An additional online tutorial from the OpenMP consortium is available at OpenMP tutorial.
See CluckComputingCluster notes on usage policies and plans for using the system.
OpenMP is for doing thread level parallelism within one process. It supports both data parallelism and task parallelism using compiler directives (pragmas). We will be using the 3.x standard, available in gcc 4.6.x.
There are several implementations of MPI (message passing interface) available. We will be using mvapich2. The biggest difference between implementation are the things that are not part of the standard, such as mpirun. The option are different on this launcher for the different implementations.
Suggestion for work distribution:
- Event building and feeding the processing framework: Ron and Kurt, concentration on MPI
- art processing framework: Walter, Chris, and Gennadiy, concentration on OpenMP
- Algorithms and analysis: Marc and Qiming (if he is back), and Jim
It currently looks like Qiming will not be back in time. Gennadiy may has a software deadline that interferes with him participating in the this activity.
We'll be doing development on cluck, and hope to be able to run tests using the grunts. This retreat will take place at the user center. We have the TV room scheduled for Wednesday, and the Music room scheduled for Thursday, Monday, and Tuesday. We were able to allocate the room from 8:30am to 5pm. The first one there will need to get the keys from the housing office to open up the doors.
If anyone in the group is unfamiliar with git, we'll need to have a brief introduction (the handful of git commands one needs for daily business).
Make sure everyone has a reasonable ".gitconfig" file in the home directory of whatever machine(s) he works on.
High-level view of the application¶
The target application for this retreat (or sprint, or whatever we call it) is a demonstration program for the DarkSide-50 DAQ. DarkSide-50 have a requirement to compress their full data stream during data collection. We want to see how fast we can do this with reasonable resources, e.g. a few 32-core nodes, each of which might have a GPU.
There are 4 high-level subsystems we'll be dealing with, listed below. For most, we list the relevant choices to be decided upon.
art input / event builder¶
- Read from file
- Read from MPI event builder
art event processor¶
Run multiple schedules, each processing independent events, with the processes of each event being a task. We want to base our solution on task-based concurrent programming.
- Intel TBB
- Intel CnC
We considered and rejected Cilk++ because it requires a specific branch of a not-yet-released version of GCC (Intel's Cilk++ branch of GCC 4.7).
The incoming event will carry a RawData product; the compression module will read this and create a CompressedRawData product, and insert that into the event.
- Use nested task parallelism (same technology choice as for the event processor, above)
- Offload compression task to GPU
We have no requirement that the output be a Root data file. We want to assure that we have parallel-capable output.
- write one event per file. This will have terrible performance, but is trivially parallel
- use a parallel-capable file format
- On the development path that includes running art code in a single thread, we can use Root output
The MPI application¶
The application is an MPI program, and is thus multi-process. The organization is that of Jim's "DAQrate" application (code available in the cet-is repository source:DAQrate; this is the application Jim gave his "Neat Things" talk about, some time ago).
Structure of the application¶
The application contains:
- N "fragment emitters", called detectors in Jim's slides. Each fragment emitter represents a piece of the DS-50 detector, emitting its stream of data.
- N "fragment receivers", called sources in Jim's slides. Each fragment receiver takes the fragments from 1 emitter, and directs fragments to the "event builder" that is to handle that event.
- M "event builders", each of which receives N fragments, from N fragment receivers. Each event builder puts the assembled event (really a RawData object) into that builder's event pool. RawData objects in the event pool are then read by the input source (running in another thread of the same process); events are then processed by the parallel art.
This MPI application thus consists of 2N+M ranks. The the "fragment emitter" and "fragment receiver" ranks may be singly-threaded. The "event builder" ranks are multiply-threaded, having 1+X threads, where X is the number of Schedules created:
- 1 thread assembling fragments into raw data products, which are put into the pool;
- X threads running some or all of art_main, using a single shared input module which reads from the event pool, and using X Schedules
Some MPI notes¶
The current DAQRate code using MPI version 1. We want to migrate to MPI version 2, and to use an InfiniBand-aware MPI version. MVAPICH2 is one such, and is already available on cluck.
We want to move to MPI version 2 to obtain two advantages:
- the ability to recover from lost (e.g. crashed) processes, which is enabled with the MPI-2 task management facilities
- faster "streaming" data throughput, using the MPI-2 one-sided communication directives
Milestones for development event builder¶
For each milestone below, the trailing number indicates the order of execution. Milestones with the same number have the same level of dependency.
-  The event builder emits RawData objects containing random numbers (no reading of DarkSide-50 data files needed). Fix N=3 and M=2 in the MPI application (no configuration needed); all runs on one node (for simpler MPI configuration). Event pool throws away data; no art code is needed. This should include defining a generic "raw data" product, re-usable in other contexts.
-  Replace random data from (A) with data read from DarkSide-50 data files.
-  Run the event builder on mulitple grunt nodes:
- two nodes
- set M=4 and N=6
- Make sure MPI communication between nodes over IB is working
- Write the necessary scripts to allow easy configuration
- Instrument code as necessary to allow verification of proper function; make sure right code is executing in the right rank on the right core of the right node.
- Make sure the system achieves stable streaming mode, and that it is not "accidentally" working just because the load is small enough for all data to be handled in a single burst.
-  Integrate multi-threaded art into the event builder processes; art runs a source that reads from the event pool and runs mulitple schedules. Output will only work when the multi-threaded output is made to work; system can be used to time compression task, when the compression module is available.
- Integrate single-threaded art, including an output module. Can still exercise some multi-threaded capability if the compression module is internally multi-threaded, without the framework itself being so.
-  Re-organize use of MPI communicators to improve robustness of the application and to simplify MPI configuration.
-  Experiment with one-sided communication directives and compare the throughput rates of different usages.
The compression algorithm¶
We will investigate only one, or (time permitting) a few compression algorithms. Previous research has indicated there are only a few leading candidates for this use. An important goal here is gaining expertise in the coding of algorithms using one of the interesting concurrency technologies.
- The algorithms and technologies of interest are given below. The numeral before each item is the order of priority; lower numbers indicate higher priority.
-  Huffman coding using OpenMP and nested task parallelism
-  Huffman coding using TBB and nested task parallelism
-  Huffman coding implemented for the GPU
-  Dictionary or Rice algorithm, with the same technology choice as used above
- After one algorithm is implemented, it must be wrapped in an EDProducer, which reads from and writes to the Event.
- After the EDProducer is implemented, we want to get the product written to a file. Options to choose from, based on what else has been achieved by this time, include:
- get i/o for one event per file output format
- get i/o to work using aggregation of many events to write one file, in our own format
- get i/o for compressed data product to work for singly-threaded Root i/o
- get i/o using some parallel-capable i/o format (candidate: Berkeley DB).
The parallel version of art¶
The most important functionalities of the parallel art framework, for this exercise, are the following:
- A functioning event loop (but not necessarily runs or subruns)
- Product insertion and lookup in the Event
- An input system that can feed multiple Schedules
- Parallel execution of Schedules, each of which is processing different Events, and each of which is running at least two modules, 1 producer and 1 analyzer
- Keeping of sufficient product metadata to allow unique identification of the results of a producer
- At least 1 global and 1 schedule-specific service begin employed
- schedule-specific: Timing
- global: TriggerNamesService