VASP on Sisu
VASP 5.3.3 with the Transition State Tools (TST) extension has been compiled using ifort and the MKL math library for BLAS/LAPACK/FFT's/ScaLAPACK plus the ELPA library (Eigenvalue soLvers for Petaflop Applications, a kind of extension to ScaLAPACK supported by VASP). This works nicely, passes basic tests and and produces good performance, which is to say that it is clearly better than for example the swedish clusters Triolith (NSC, Linköping) and Lindgren (Cray X6, PDC, Stockholm). See attached makefiles. It was not possible to compile VASP with Cray compilers without more extensive meddling with the code, so that is left as a home exercise for the interested administrator for now.
The scaling deserves some comments. The test system is a MgH2 supercell with a number of vacancies, designed by Peter Larsson at NSC. All VASP users should read his blog. The 'NGZhalf' and 'gamma' keywords above designate the "Normal" VASP and the "Gamma-point only" versions, respectively. The unit on the y-axis is 'Jobs/hour', or in other words 3600/(runtime in seconds). We note that the gammapoint version is both quicker and can be made to scale better in this case. Not shown are curves for compilations without ELPA. It turned out that ELPA consistently gave improvements for all sizes of calculations investigated here, so I decided to just go with that.
Now to the antiintuitive bit: There are also two additional curves in which I have run on only half of the available CPU's on each node, these are the ones called '8 CPU's/node' in the figure. Note that these scale significantly better and even overtake the full-node calculations for larger jobs. So: With VASP it pays to only use half of the available processors! At least for larger jobs. This has been seen before and the reasons have been analyzed here. The NPAR parameter has not been optimized, I have just used the generally recommended VASP setting "close to sqrt(number of processors)". The curves can probably be made somewhat smoother with NPAR optimization, but I have not seen any major difference yet when I have fiddled with it.
Doing the calculations properly on half a node requires a non-trivial modification of the run command. Example script:
#SBATCH -J test_job
#SBATCH -t 00:30:00
#SBATCH -N 96
#SBATCH -p large
aprun -n 768 -N 8 -S 4 -d 2 vaspbinary
So first we allocate 96 nodes = 1536 CPU's. Then we run with only half of them (-n 768) and make sure that we get eight CPU's per node (-N 8). If we stop there we get terrible performance, because the gain will come from improving the memory access patterns, which probably will not improve unless we also allocate things in the right way on the nodes. Therefore, we go on to making sure that the workload is evenly spread on the two Numa cores (-S 4) and then that we lay out the calculation so that every second processor is allocated (-d 2). The last '-d' flag is really a way of laying out processors two-by-two for OpenMP applications, but since there is no OpenMP in VASP, it will leave every second processor idling. This way we get maximal memory maximally close to each CPU, and we can squeeze out the performance gain shown above.