Our latest two LQCD clusters have 8 and 32 cores per node. Essentially all physics production on these cluster relies on simply assigning one MPI rank to each core. This is not optimal, because in domain decomposition most message traffic is between nearest neighbors (the only non-n.n. traffic patterns are during global reductions). So, there are advantages to making sure that adjacent subvolumes are placed on the same node, so that message passing doesn't have to go over Infiniband. In addition, on our newest cluster the Infiniband hardware is smart enough to do message coalescing, so for example if there is an exchange of face data between many subvolumes, if the subvolumes are laid out correctly then messages between pairs of computers can be coalesced. Our strong scaling is limited by the fall off in bandwidth as message sizes decrease.
So the project would involve taking a specific LQCD application (one of the MILC applications), making modifications to determine and to modify the geometrical layout of the subvolumes, and making measurements of performance differences. On Ds, which has 8 NUMA nodes per computer, there are also opportunities to optimize the geometric layout within the node. For this work NUMA tools like likwid or hwloc could be used to discover the NUMA geometry and tailor the layouts. Finally, I would expect that performance counters and hardware counters in the Infiniband cards could be used to understand the effects on message size and on memory bandwidth.