GPU acceleration and other improvements of the one-body matrix element calculations #2
GaffaSnobb
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Allright, so it's finally the part of the coding I have been looking a lot forward to! Calculations of the matrix elements are highly parallelisable so testing GPU acceleration seems like something worth checking out. There are however a bunch of considerations to take when it comes to GPU programming, and some of them are similar to the mindset you need when parallelising for CPU.
Choice of GPGPU framework and compatibility
Nvidia's CUDA has for years been the go-to platform for general-purpose GPU programming (GPGPU) because for a long time it was really the only choice and even after the emergence of AMD's open-source alternative, HIP, CUDA had (and has) far better support. You can run CUDA on more or less every single consumer-grade GPU that is 10 years old or newer, meaning that the bar for GPGPU is low on the Nvidia platform. AMD's software stack ROCm in which HIP is contained, has at the time of this writing official support for only three consumer-grade GPUs, namely the Radeon VII, Radeon RX 7900XT and Radeon RX 7900XTX. The VII is several years old at this point, and the two latter are the two most expensive cards in the consumer line-up. The official support is very limited, but you'll likely manage to get ROCm to run on Navi 1X and 2X even though they are not officially supported by AMD.
On the contrary, ROCm support on data center grade GPUs (or rather accelerators) is better. AMD's Instinct line-up are a set of extremely beefy accelerators with up to 192GB of memory. The relatively newly commissioned supercomputer LUMI has a ridiculous 11 912 AMD Instinct MI250X accelerators powered by the ROCm software stack. Choosing to build a supercomputer with AMD accelerators makes me trust the ROCm platform more.
I have chosen to use ROCm/HIP for three reasons: First, it is open-source. Nvidia has had a firm closed grip on the GPGPU market for years. I don't like closed-source, I like open-source. Simple as that. Second, I have a hope to run my code on LUMI at some point. Third, HIP actually exists and works now! ROCm was first released in 2016, meaning that CUDA was the only viable option at that time. Popular software like Tensorflow and Pytorch actually runs on AMD GPUs today because ROCm/HIP exists.
Accelerating the one-body matrix element calculations
The goal is of course to accelerate as much as possible, but since the one-body calculations are relatively small it is a nice place to start. Recall the definition of the one-body Hamiltonian operator
Where$\epsilon$ are the single-particle energies and $c^\dagger$ and $c$ are creation and annihilation operators, respectively. To calculate the matrix elements from the operator:
where$| i \rangle$ and $| j \rangle$ are basis states. There can be up to billions of basis states, to that looks like a nice place to start the GPU parallelisation.
My first try of GPU accelerating the code for calculating a single one-body matrix element looks like this (the CPU version of the code can be seen here)
So-called kernels (
__global__) are functions which are called from the CPU and executed on the GPU. Note that in this setting, the CPU is more commonly called host and the GPU is called device. Device functions (__device__) however, are only callable from the device and are also executed on the device.In the kernel I have to some index gymnastics to make sure that the matrix elements are mapped to the correct basis states. This is because the kernel is called in parallel by up to several thousand threads at the same time, and each thread gets a unique
blockIdx.x,blockDim.x, andthreadIdx.xwhich I must use to map the thread to the correct basis state and matrix element. This means that I dont have to explicitly write the for-loop which loops over the basis states, but rather just tell the GPU how many threads I want to be spawned and then map the unique thread identifiers accordingly. Pretty neat, I think!The one-body calculations themselves are actually almost identical to the CPU version of the code, except that I had to use primitive data types instead of
std::vectorand some other self-made data structures because only primitive data types are allowed on the GPU. Additionally, all data has to be manually copied from the host to the device, and new arrays must be allocated on the device.Some simple benchmarks show that the GPU is significantly faster than the CPU (a 16 core 7950X) even when also copying data (the Hamiltonian matrix) from the device to the host, something that host-only calculations do not need to do. More detailed benchmarks will come.
Further improvements to the one-body matrix element calculations
I noticed that all the interaction files that I have at hand only has single-particle energies for when$\alpha = \beta$ . This means that I only need to loop over the diagonal elements of the Hamiltonian! ...
Beta Was this translation helpful? Give feedback.
All reactions