More particles can only be simulated in shorter time spans by making use of the parallel resources available. For ‘shared memory’ parallelism such as TBB or GP-GPU, the bottleneck for this type of simulation is typically memory bandwidth and memory latency. When switching to an MPI parallel acceleration, across multiple compute nodes, the main challenge...