Users' Experience in Bridging the Multicore Programmability Gap

Monday Nov.16, 2009

This schedule was listed in the SC09 schedules to start at 8:30AM, but the organizers planned on it beginning at 9AM.

The focus is on "user experience" in trying out "new forms of computation". The organizers have run a series of workshops, each year with a different emphasis. We were asked to move from a room we fit in into a too-small room because of other poor organization.

The focus of this year's workshop is supposed to be users' experience in using new languages ...

Chapel talk
Co-array Fortran talk

Keynote: Beyond UPC (Kathy Yelick)

Announced as "needing no introduction". Always a poor idea. She's the director of NERSC at LBL. Speaks at 1000 words per minute, and seems not to breathe.

She's interested in "exascale" computers, and also only supercomputer programming, and the special cases for dealing with huge numbers of cores (talking about hundreds of millions of cores to get the exascale computer.) Need to have billion-way concurrency (maybe billions of threads, maybe more). Expects millions of chips, each with thousand-way concurrency.

Divide thousand-way concurrency into data-parallel and task-parallelism. Maybe vast vector processing, multiple ALUs and vector floating-point. "Maybe you don't want an operating system on the machine".

"multicore" -> complex cores; "manycore" -> more and simpler cores. She prefers the manycore for the huge computers. But multicore is still needed to run PowerPoint. Is the only thing on which large computing will be done petascale computers?

Running an operating system is called "legacy code".

Memory density is doubling every three years; processor logic every two years. Thus the ability to have enough memory per "core" is failing.

"Can't run MPI everywhere": works on dual- and quad-core machines, but the memory capacity won't keep up. (No mention of the concept that some problems might not be natural fits for MPI. It seems unbelievable that all problems are really suitable for MPI.)

Memory bandwidth is also a problem; wasting memory space also means wasting memory bandwidth, since data has to be moved in and out of memory.

People are already using MPI and OpenMP. This seems suitable for dual-chip 12-core-per-chip machines.

PGAS languages are what she suggests. "Best of MPI and threads". "No less scalable than MPI". "Forces you to think in parallel all the time." Puts programmers into the right mind-set of thinking they are doing something special when they have to write something that is serial processing. Parts of memory are global and shared, parts of local. Does not require cache-coherent shared memory. UPC (in her talk title) is a PGAS language.

Showed many plots comparing performance of "naive" and "fully hand optimized" codes, across various multi- and many-core machines. The plots were not very clear in purpose, and seemed to have no relation to the focus of the session (use of new languages).

"PGAS languages are a good fit to machines with explicitly managed memory (local store). It seems they are special-purpose languages for massively data-parallel calculations. How does this help with pattern recognition for tracking detectors, or Kalman filtering, or tracking particles through material, or making GEANT4 faster?

"Don't hide the complexity of the machine from the programmer; expose it and let them program it in a fairly platform-independent way." This does not seem suitable for our use with experiments at Fermilab. We need to work with a programming model that allows "casual" programmers to work effectively.

"Features of a successful language":

  1. Portability of applications; multiple compilers, portable compilers, or both
  2. Interoperability with other models: calling MPI, necessary for incremental development.
  3. Performance has to be equivalent or better than current systems, and has to have scalability; nobody wants to move to a slower language.
  4. Must take advantage of the best possible hardware.

Buy a node with the memory you can afford, and figure out how many cores you can use in a program with the memory footprint that your total memory use will support.

Running UPC on an ethernet system will give poor performance; she counts this as "bad hardware".

One-sided MPI may be of interest; need to look into this.

Much interest in autotuning of specific applications, presented with specific data.

If you're running on a cloud, you can't use too much specific optimization.

What goes well with UPC? "Irregular applications": this was the intent. All her previous talking was about regular applications. Irregular means:
  • Irregular data access patterns
  • Irregular in space, needs to access more than "nearest neighbors"
  • Irregular computational patterns (I think this is what we do for experiments, but her explanation was not very clear) Not strict SIMD; needs "teams" or "gangs" of threads; not "bulk-synchronous".