Project

General

Profile

Geant Vector and GPU Prototype

The main objective of this project is to quantify the potential gain of a vector implementation of Geant by measuring the performance on a simple but realistic example. This vector implementation will test at the same time the use of high level vector code on the CPU and on a GPU. The type of vectorization investigated in this prototype is a design based on propagating a set of tracks or particles at once through the various simulation stages, applying the same or similar operations on the set rather than applying all stages to each track or particle one after the other. The investigation will cover several issues including the coordination and balancing of the tasks on the CPU and GPU and the study of various grouping algorithm (per geometry elements, per particle types, etc.)

Roughly, FNAL will be responsible for developing the code working on GPU and interfacing it with the framework while CERN will
develop the framework. The framework must allow for the GPU to take more than one steps and possibly process some of the secondaries. This will allow more flexibility in exploring efficient use of the GPU. The scheduler must also be flexible enough to allow for the bundling type and size to be different for the CPU and GPU tasks. See the challenges section for more details.

For random number generation, we should strive for reproducibility on the same hardware only (for example we should not expected exact reproducibility for work done on the CPU or the GPU). We will need to make sure that the random number generation stream techniques on CPU and GPU are coordinated but not necessarily identical. We must make sure that there is a way to rerun the same configuration and having the same particules (the same workload) being scheduled on the CPU and GPU and thus requires at least a configuration (switch) where the scheduling is deterministic.

Descriptions of the example

We will simulate the propagation of electromagnetic particles (electrons and photons) through a realistic high energy detector on GPU/CPU.
An example of geometry is to be the CMS electromagnetic calorimeter with a minimal set of solid shapes, volume segmentation and material description. Adding photons simulation would requires implementing three more physics processes (Compton scattering, pair-production, photo-electric effect) [Need evaluation]. Followings are essential components to simulate a massively parallel particle transportation:

  • a vectorized geometry with solid shapes (box, trepzoid) and a GDML description of the CMS electronmagnetic calorimeter
  • a volume-based magnetic field map of the CMS experiment
  • physics processes for electrons (Bremsstrahlung, ionization, and multiple (Coulomb) scattering),
    and possibly photons (Compton scattering, pair-production, photo-electric effect) and positrons (annihilation)
  • cross section data for physics processes, materials (lambda tables)
  • The resulting hit collection shall be stored in a ROOT file.

The example shall work on SLC6 with CUDA 5 with hardwares supporting the compute-capability 3.5 and above (Kepler 20).

Steps

The following is a set of steps (or sub-projects) that are necessary to accomplish.

  • Integrate Geant4 to produce the showers as initial input
  • Integrate calling to the GPU from a task
    • Convert/Adapt/Copy the data structure from CPU to GPU
    • Retrieve and convert back the output (track updates and new secondaries), navigator state.
    • Push new secondaries back into schedulers
  • Develop vectorized EM physics (brem, ionization, multiple scattering).
  • Measure performance and optimize.
    • Make sure we always have the same task/example runnable with G4 (ideally the two examples should give the same statistically results).
    • Requires a validation 'framework'.

Tasks

  1. Agree on set of example(s) and scope(s)
    04-2013
    [ALL, FNAL , CERN]
    1. What shower generations
    2. What particles
    3. What physics
    4. What output
  2. Agree on repository
    04-2013
    [FNAL , CERN]
  3. Update ‘makefile’ to support G4 and Cuda (and maybe OpenCL).
    04-2013
    Depend on Task # 2
    [FNAL - Philippe ]
  4. Define and build a realistic geometry with currently available solids (for an example, CMS EM calorimeter)
    04-2013
    [FNAL - Soon, Daniel , CERN]
  5. Enhance existing scheduler
    ??-2013
    [CERN]
    1. Separate scheduling from task in Vector Prototype
    2. Extend scheduling flexibility to allow different choice of bundling and size of bundle for CPU and GPU tasks.
    3. Remove (if any) requirement on the maximum amount of work done by a task (to allow for the rescheduling of work directly on the GPU).
  6. Integrate existing shared code (EM Physics) in both C++ and CUDA
    05-2013
    [FNAL - Philippe]
  7. Create example in pure Geant4.
    05-2013
    [FNAL - Soon, CERN, John A.]
  8. Define validation criteria and validation test suite for the example (for example energy deposition, number of hits, etc.)
    06-2013
    [FNAL - Soon, Daniel , CERN]
  9. Create example in CPU code for Vector Prototype
    Depend on Task # 1, 4, 6, 8
    06-2013
    [FNAL, CERN]
    • requires at least extra physics (in cpu-vector-code) ; See 4. and 6.
  10. Add the necessary missing parts the GPU implementation.
    1. Create geometry from G4 and/or Vector Prototype geometry [likely via GDML as first]
      04-2013
      [FNAL]
    2. Add track translation (need to later become minimal)
      ??-2013
      Depend on Task # 5
      [FNAL]
    3. Add track update from GPU result
      ??-2013
      Depend on Task # 5
      [FNAL]
    4. Complete Electron Processes (secondaries creation, hits creation.)
      07-2013
      [FNAL]
    5. Add download and copy of secondaries and hits back to Vector Prototype structures and scheduling.
      Depend on Task # 5 and # 10.4
      ??-2013
      [FNAL]
  11. Review choice and select of pseudo random number generation on the GPU
    ??-2013
    [FNAL - Marc]
  12. Coordination of the random number streams on CPU and GPU
    ??-2013
    [FNAL, CERN]
  13. Measure performance and optimize.
    09-2013
    Depend on all task listed above.
    [FNAL, CERN]
    • Need to be able to extract the performance of the ‘improved’ part.
  14. Vectorized track dispatcher and a voxelized navigator
    ??-2013
    [CERN]
  15. Investigate (and update the code/infrastructure accordingly) the best way to group tracks/particles for both the CPU and GPU. (for groups by geometry type, or particle, or energy or all the above, etc.).
    Depend on all task listed above.
    10-2013
    [FNAL, CERN]
  16. Design EM Physics implementation for performance on GPU and CPU
    ??-2013
    [FNAL, CERN]
    • define input data structures for cross-section data (i.e., Lambda tables for physics processes and materials)
    • It is not clear whether this redesign should happen before or after we connect the existing GPU code to the Vector Prototype. This is also directly related to measuring and analyzing performance.
  17. Extend physics implementation to positrons (annihilation)
    11-2013
    [FNAL]
  18. Extend physics implementation to photons (Compton scattering, pair-production, photo-electric effect)
    2-2014
    [FNAL]
  19. Minimize operation needed for transforming track to/from CPU and GPU
    12-2013
    [FNAL, CERN]
    • Imply some convergence between the structure.
  20. Design and implement common data structures (track, geometry, material and etc.) that can be used for both transportation and vectorized navigation.
    [Requires understanding of which datum can and can not be shared and which needs to be recalculated, including feedback from task # 6, 10.2, 10.5 and performance measurements. ]
    1-2014
    [FNAL, CERN]

Challenges

In this section, we are listing (some of) the challenges that spread over some of the tasks and represent significant research topic.

  • Define the correct details and level of granularity for information passing between the CPU and GPU code
    • IN - particles to propagate (electronic and photons across the entire detector)
    • OUT - energy deposits and stopping information for complete particles, particles to propagate further (all secondaries that are not photons and electrons mostly).
  • Coordinate the geometry (G4, ROOT, GPU).
    Geant4, ROOT and the GPU are using different representation in memory of the detector geometry. We need to be able to generate the 3 forms from a single source to avoid divergence. We also need to understand and mitigate the impact on performance on the duplication (Geant4/ROOT) in memory that is only necessary for the purpose of the example (but not in production environment).
  • Coordinate random number generation.
    We need to be able to run in a fully reproducible mode for debugging purpose which requires good control over the random number generation streams even in the massively parallel (GPU) part of the code or the multi-threaded parts.
  • Coordinate the navigation
    Both the GPU and the CPU may keep a certain amount of state information about the propagation of each particles (pointers to the current geometric elements, etc.). We need to review and understand the amount of information sharing that needs to be done between the CPU and GPU state and clarify if this information sharing is semantically required or if it is present to avoid redundant calculation.
  • Reduce code duplication between CPU and GPU for common functionality (Physics implementation in particular). Area of plausible sharing includes:
    1. vectorized geometry and navigator
    2. full EM physics (potentially vectorized)
    3. interface between the vectorization scheduler and GPU kernels
  • Reduce copy to/from CPU data structures.
    In the first integration step, we should conserve the existing differences between the two code base and incur the performance cost of transforming the data structure back and forth between the two sets. Once the prototype is functional, we can improve the performance by refactoring the two set of data structure into a smaller set that would attempt to reduce the need to data transformation while not impeding on the run-time performance of the algorithm executions.
    • Reduce data conversion when going to GPU (for example use index rather than pointers).
    • How to do so in a thread safe manner.
    • What to part of the object state (navigation in particular) should be shared