Introduction
GROMACS released a huge update with improvements to multi-node NVIDIA GPU scalability. GROMACS (GROningen MAchine for Chemical Simulations) is a molecular dynamics and scientific software package used extensively for studying molecular behaviors in different environments such as drug discovery and research of various molecules, proteins, and more.
The latest improvements to GROMACS come from the enablement of GPU Particle Mesh Ewald decomposition with GPU direct communication enabling an up to 21x performance increase.
Improved Multi-Node Performance
GROMACS typically assigned one GPU to PME long-range force calculations while the remaining GPUs are used for short-range particle-to-particle force calculations.
Having a single GPU for PME calculations, there exist scalability limitations. While it is possible to add more GPUs to tackle the PP force calculations, it isn’t possible to scale most simulations past a few additional nodes. The single PME GPU will act as a bottleneck no matter how many additional GPUs are allocated for the PP force calculations.
In the latest release GROMACS 2023, the PME calculation is decomposed across multiple GPUs to relieve this bottleneck. PME decomposition is done by leveraging NVIDIA cuFTTMp library, a library that enables fast Fourier Transformation calculations to be distributed across multiple GPUs within a node. cuFFTMp uses NVSHMEM, a parallel programming interface for fast one-sided communications. It can make use of intra- and inter-node interconnects to perform the all-to-all communications required by distributed 3D FFTs.
By integrating decomposed PME calculations, GROMACS can now compute multiple PME ranks in the same simulation for not only better performance but the ability to scale. Alongside PME GPU decomposition, the introduction to pipelining and parallel implementation to enhance overlap communication and use grid overlap instead of redistribution.
Performance Results
The results are stark, revealing how important this update is to GROMACS and its capabilities moving forward, now with the ability to scale with the hardware. 2 tests were performed; the 1M atom STMV system is representative of common workloads, challenging for scaling and the 12M atom BenchPEP system representative of the large atom systems for the upper threshold of molecular dynamics.
STMV Performance Results
From the start, we can see a 2x increase in single-node performance. But the single PME GPU as a bottleneck becomes apparent when scaling over 2 nodes: Red and Blue lines flatline in performance no matter how much additional hardware is added. In Green, with PME decomposition enabled, the redistribution of the PME calculations allows further scaling when adding additional nodes with more than double the performance at 4 nodes, peaking at 3 times the performance at 8 nodes (32 GPUs) before leveling out at 16 nodes.
BenchPEP Performance Results
With the larger molecular dynamics simulation BenchPEP, we can see the usefulness of the decomposition of PME. Since the dataset is extremely large, the need to distribute PME across multiple GPUs is apparent, lending itself to unparalleled scalability.
From the start, we can see similar results. However, Red and Blue both plateau at 2 nodes whereas PME Decomposition shoots ahead with its ability to scale, diverging from both Legacy and GPU Direct Communication Only. This is because using each node allocates a single GPU for PME calculations thus, the difference between the three versions doesn’t seem apparent until scaling with 3 or more.
The need to distribute the PME workload is extremely apparent with the larger BenchPEP dataset. The performance continues to increase over and over reaching a peak 21x faster than Legacy at 64 nodes. We do not expect any scientific lab to leverage over 256 NVIDIA A100 GPUs but the GPUs performance difference from 8 GPUs (2 nodes) to 20 GPUs (5 nodes) is more than double.
How to Build and Run GROMACs with PME Decomposition
We installed NVIDIA HPC SDK 22.11 by following the download and installation instructions on the website. With GROMACS 2023, we do not recommend any later version of the HPC SDK due to compatibility issues with future developments.
Obtain the GROMACS 2023 release version:
git clone https://gitlab.com/gromacs/gromacs.git
cd gromacs
git checkout v2023
Build GROMACS:
# make a new directory to build in
mkdir build
cd build
# set the location of the math_libs directory in the NVIDIA HPC installation
HPCSDK_LIBDIR=/lustre/fsw/devtech/hpc-devtech/alang/packages/nvhpc/nvhpc_2022_2211_Linux_x86_64_cuda_11.8-install/Linux_x86_64/2022/math_libs
# build the code with PME GPU decomposition with cuFFTMp enabled,
# in an environment with a CUDA-aware OpenMPI installation
# (see https://manual.gromacs.org/current/install-guide/index.html)
cmake \
../ \
-DGMX_OPENMP=ON -DGMX_MPI=ON -DGMX_BUILD_OWN_FFTW=ON \
-DGMX_GPU=CUDA -DCMAKE_BUILD_TYPE=Release -DGMX_DOUBLE=off \
-DGMX_USE_CUFFTMP=ON -DcuFFTMp_ROOT=$HPCSDK_LIBDIR
#Build using 8 CPU threads. Can increase this if you have more CPU cores available.
make -j 8
The HPCSDK_LIBDIR variable is set to the Linux_x86_64/2022/math_libs subdirectory of the HPC SDK installation.
Obtain the input file:
wget https://zenodo.org/record/3893789/files/GROMACS_heterogeneous_parallelization_benchmark_info_and_systems_JCP.tar.gz
tar zxvf GROMACS_heterogeneous_parallelization_benchmark_info_and_systems_JCP.tar.gz
ln -s GROMACS_heterogeneous_parallelization_benchmark_info_and_systems_JCP/stmv/topol.tpr .
To use PME GPU decomposition, set the following variables:
# Specify that GPU direct communication should be used
export GMX_ENABLE_DIRECT_GPU_COMM=1
# Specify that GPU PME decomposition should be used
export GMX_GPU_PME_DECOMPOSITION=1
#Specify the total number of PME GPUs through the -npme <N> flag to mdrun.
Without PME GPU decomposition, N is 1 as you can only use a single PME GPU.
With decomposition, you set N to the number of nodes in use, to specify one PME GPU per node, with the other three GPUs in each node dedicated to PP. This division typically gives a good balance given the relative computational expense of the PP and PME workloads, but experimentation is recommended for any specific case.
Conclusion
The new features introduced in GROMACS 2023 drastically accelerate the capabilities of executing on multi-node GPU clusters. By now being able to scale your productivity and your simulations, more molecular dynamics problems can be solved quickly for faster time-to-market solutions and increased comprehensive analysis of molecules and proteins.
Scale your own GROMACS simulations and accelerate your scientific research with Exxact GROMACS optimized solution built to propel your discovery to new heights. Learn more about GROMACS here at their forum!
Have any Questions?
Contact Us Today!
GROMACS 2023 Massively Improved with NVIDIA GPU Scalability
Introduction
GROMACS released a huge update with improvements to multi-node NVIDIA GPU scalability. GROMACS (GROningen MAchine for Chemical Simulations) is a molecular dynamics and scientific software package used extensively for studying molecular behaviors in different environments such as drug discovery and research of various molecules, proteins, and more.
The latest improvements to GROMACS come from the enablement of GPU Particle Mesh Ewald decomposition with GPU direct communication enabling an up to 21x performance increase.
Improved Multi-Node Performance
GROMACS typically assigned one GPU to PME long-range force calculations while the remaining GPUs are used for short-range particle-to-particle force calculations.
Having a single GPU for PME calculations, there exist scalability limitations. While it is possible to add more GPUs to tackle the PP force calculations, it isn’t possible to scale most simulations past a few additional nodes. The single PME GPU will act as a bottleneck no matter how many additional GPUs are allocated for the PP force calculations.
In the latest release GROMACS 2023, the PME calculation is decomposed across multiple GPUs to relieve this bottleneck. PME decomposition is done by leveraging NVIDIA cuFTTMp library, a library that enables fast Fourier Transformation calculations to be distributed across multiple GPUs within a node. cuFFTMp uses NVSHMEM, a parallel programming interface for fast one-sided communications. It can make use of intra- and inter-node interconnects to perform the all-to-all communications required by distributed 3D FFTs.
By integrating decomposed PME calculations, GROMACS can now compute multiple PME ranks in the same simulation for not only better performance but the ability to scale. Alongside PME GPU decomposition, the introduction to pipelining and parallel implementation to enhance overlap communication and use grid overlap instead of redistribution.
Performance Results
The results are stark, revealing how important this update is to GROMACS and its capabilities moving forward, now with the ability to scale with the hardware. 2 tests were performed; the 1M atom STMV system is representative of common workloads, challenging for scaling and the 12M atom BenchPEP system representative of the large atom systems for the upper threshold of molecular dynamics.
STMV Performance Results
From the start, we can see a 2x increase in single-node performance. But the single PME GPU as a bottleneck becomes apparent when scaling over 2 nodes: Red and Blue lines flatline in performance no matter how much additional hardware is added. In Green, with PME decomposition enabled, the redistribution of the PME calculations allows further scaling when adding additional nodes with more than double the performance at 4 nodes, peaking at 3 times the performance at 8 nodes (32 GPUs) before leveling out at 16 nodes.
BenchPEP Performance Results
With the larger molecular dynamics simulation BenchPEP, we can see the usefulness of the decomposition of PME. Since the dataset is extremely large, the need to distribute PME across multiple GPUs is apparent, lending itself to unparalleled scalability.
From the start, we can see similar results. However, Red and Blue both plateau at 2 nodes whereas PME Decomposition shoots ahead with its ability to scale, diverging from both Legacy and GPU Direct Communication Only. This is because using each node allocates a single GPU for PME calculations thus, the difference between the three versions doesn’t seem apparent until scaling with 3 or more.
The need to distribute the PME workload is extremely apparent with the larger BenchPEP dataset. The performance continues to increase over and over reaching a peak 21x faster than Legacy at 64 nodes. We do not expect any scientific lab to leverage over 256 NVIDIA A100 GPUs but the GPUs performance difference from 8 GPUs (2 nodes) to 20 GPUs (5 nodes) is more than double.
How to Build and Run GROMACs with PME Decomposition
We installed NVIDIA HPC SDK 22.11 by following the download and installation instructions on the website. With GROMACS 2023, we do not recommend any later version of the HPC SDK due to compatibility issues with future developments.
Obtain the GROMACS 2023 release version:
git clone https://gitlab.com/gromacs/gromacs.git
cd gromacs
git checkout v2023
Build GROMACS:
# make a new directory to build in
mkdir build
cd build
# set the location of the math_libs directory in the NVIDIA HPC installation
HPCSDK_LIBDIR=/lustre/fsw/devtech/hpc-devtech/alang/packages/nvhpc/nvhpc_2022_2211_Linux_x86_64_cuda_11.8-install/Linux_x86_64/2022/math_libs
# build the code with PME GPU decomposition with cuFFTMp enabled,
# in an environment with a CUDA-aware OpenMPI installation
# (see https://manual.gromacs.org/current/install-guide/index.html)
cmake \
../ \
-DGMX_OPENMP=ON -DGMX_MPI=ON -DGMX_BUILD_OWN_FFTW=ON \
-DGMX_GPU=CUDA -DCMAKE_BUILD_TYPE=Release -DGMX_DOUBLE=off \
-DGMX_USE_CUFFTMP=ON -DcuFFTMp_ROOT=$HPCSDK_LIBDIR
#Build using 8 CPU threads. Can increase this if you have more CPU cores available.
make -j 8
The HPCSDK_LIBDIR variable is set to the Linux_x86_64/2022/math_libs subdirectory of the HPC SDK installation.
Obtain the input file:
wget https://zenodo.org/record/3893789/files/GROMACS_heterogeneous_parallelization_benchmark_info_and_systems_JCP.tar.gz
tar zxvf GROMACS_heterogeneous_parallelization_benchmark_info_and_systems_JCP.tar.gz
ln -s GROMACS_heterogeneous_parallelization_benchmark_info_and_systems_JCP/stmv/topol.tpr .
To use PME GPU decomposition, set the following variables:
# Specify that GPU direct communication should be used
export GMX_ENABLE_DIRECT_GPU_COMM=1
# Specify that GPU PME decomposition should be used
export GMX_GPU_PME_DECOMPOSITION=1
#Specify the total number of PME GPUs through the -npme <N> flag to mdrun.
Without PME GPU decomposition, N is 1 as you can only use a single PME GPU.
With decomposition, you set N to the number of nodes in use, to specify one PME GPU per node, with the other three GPUs in each node dedicated to PP. This division typically gives a good balance given the relative computational expense of the PP and PME workloads, but experimentation is recommended for any specific case.
Conclusion
The new features introduced in GROMACS 2023 drastically accelerate the capabilities of executing on multi-node GPU clusters. By now being able to scale your productivity and your simulations, more molecular dynamics problems can be solved quickly for faster time-to-market solutions and increased comprehensive analysis of molecules and proteins.
Scale your own GROMACS simulations and accelerate your scientific research with Exxact GROMACS optimized solution built to propel your discovery to new heights. Learn more about GROMACS here at their forum!
Have any Questions?
Contact Us Today!