GPU acceleration and other improvements of the one-body matrix element calculations #2

GaffaSnobb · 2024-02-01T02:40:50Z

GaffaSnobb
Feb 1, 2024
Maintainer

Allright, so it's finally the part of the coding I have been looking a lot forward to! Calculations of the matrix elements are highly parallelisable so testing GPU acceleration seems like something worth checking out. There are however a bunch of considerations to take when it comes to GPU programming, and some of them are similar to the mindset you need when parallelising for CPU.

Choice of GPGPU framework and compatibility

Nvidia's CUDA has for years been the go-to platform for general-purpose GPU programming (GPGPU) because for a long time it was really the only choice and even after the emergence of AMD's open-source alternative, HIP, CUDA had (and has) far better support. You can run CUDA on more or less every single consumer-grade GPU that is 10 years old or newer, meaning that the bar for GPGPU is low on the Nvidia platform. AMD's software stack ROCm in which HIP is contained, has at the time of this writing official support for only three consumer-grade GPUs, namely the Radeon VII, Radeon RX 7900XT and Radeon RX 7900XTX. The VII is several years old at this point, and the two latter are the two most expensive cards in the consumer line-up. The official support is very limited, but you'll likely manage to get ROCm to run on Navi 1X and 2X even though they are not officially supported by AMD.

On the contrary, ROCm support on data center grade GPUs (or rather accelerators) is better. AMD's Instinct line-up are a set of extremely beefy accelerators with up to 192GB of memory. The relatively newly commissioned supercomputer LUMI has a ridiculous 11 912 AMD Instinct MI250X accelerators powered by the ROCm software stack. Choosing to build a supercomputer with AMD accelerators makes me trust the ROCm platform more.

I have chosen to use ROCm/HIP for three reasons: First, it is open-source. Nvidia has had a firm closed grip on the GPGPU market for years. I don't like closed-source, I like open-source. Simple as that. Second, I have a hope to run my code on LUMI at some point. Third, HIP actually exists and works now! ROCm was first released in 2016, meaning that CUDA was the only viable option at that time. Popular software like Tensorflow and Pytorch actually runs on AMD GPUs today because ROCm/HIP exists.

Accelerating the one-body matrix element calculations

The goal is of course to accelerate as much as possible, but since the one-body calculations are relatively small it is a nice place to start. Recall the definition of the one-body Hamiltonian operator

$H_{\text{OB}} = \sum\limits_{\alpha, \beta, m_\alpha, m_\beta} \epsilon_{\alpha, \beta} c_{\alpha, m_\alpha}^\dagger c_{\beta, m_\beta}$

Where $\epsilon$ are the single-particle energies and $c^\dagger$ and $c$ are creation and annihilation operators, respectively. To calculate the matrix elements from the operator:

$H_{i, j} = \langle i | H_\text{OB} | j \rangle$

where $| i \rangle$ and $| j \rangle$ are basis states. There can be up to billions of basis states, to that looks like a nice place to start the GPU parallelisation.

My first try of GPU accelerating the code for calculating a single one-body matrix element looks like this (the CPU version of the code can be seen here)

__device__ double calculate_onebody_matrix_element_primitive_bit_representation_device(
    const unsigned short n_orbitals,
    const double* spe,
    const unsigned short* orbital_idx_to_composite_m_idx_map_flattened_indices,
    const unsigned long long& left_state,
    const unsigned long long& right_state
)
{
    double onebody_res = 0;
    unsigned short creation_start_m_idx = 0;
    unsigned short annihilation_start_m_idx = 0;
    
    for (unsigned short creation_and_annihilation_orb_idx = 0; creation_and_annihilation_orb_idx < n_orbitals; creation_and_annihilation_orb_idx++)
    {
        const unsigned short creation_end_m_idx = orbital_idx_to_composite_m_idx_map_flattened_indices[creation_and_annihilation_orb_idx];
        const unsigned short annihilation_end_m_idx = orbital_idx_to_composite_m_idx_map_flattened_indices[creation_and_annihilation_orb_idx];

        for (unsigned short creation_comp_m_idx = creation_start_m_idx; creation_comp_m_idx < creation_end_m_idx; creation_comp_m_idx++)
        {
            for (unsigned short annihilation_comp_m_idx = annihilation_start_m_idx; annihilation_comp_m_idx < annihilation_end_m_idx; annihilation_comp_m_idx++)
            {
                // double switch_ = 1; // To eliminate if-statements.
                unsigned long long new_right_state = right_state;   // The contents of right_state is copied, not referenced.

                // switch_ = switch_*bittools_device::is_bit_set(new_right_state, annihilation_comp_m_idx);
                if (not bittools_device::is_bit_set(new_right_state, annihilation_comp_m_idx)) continue;
                const unsigned short n_operator_swaps_annihilation = bittools_device::reset_bit_and_count_swaps(new_right_state, annihilation_comp_m_idx);
                const short annihilation_sign = bittools_device::negative_one_pow(n_operator_swaps_annihilation);

                // switch_ = switch_*(not bittools_device::is_bit_set(new_right_state, creation_comp_m_idx));
                if (bittools_device::is_bit_set(new_right_state, creation_comp_m_idx)) continue;
                const unsigned short n_operator_swaps_creation = bittools_device::set_bit_and_count_swaps(new_right_state, creation_comp_m_idx);
                const short creation_sign = bittools_device::negative_one_pow(n_operator_swaps_creation);

                // switch_ = switch_*(left_state == new_right_state);
                if (left_state != new_right_state) continue;
                onebody_res += annihilation_sign*creation_sign*spe[creation_and_annihilation_orb_idx];//*switch_;   // Or annihilation_orb_idx, they are the same.
            }
        }
        annihilation_start_m_idx = annihilation_end_m_idx; // Update annihilation_start_m_idx to the beginning of the next section of the map.
        creation_start_m_idx = creation_end_m_idx; // Update creation_start_m_idx to the beginning of the next section of the map.
    }
    return onebody_res;
}

__global__ void matrix_element_dispatcher(
    double* H,
    const unsigned long long* basis_states,
    const double* spe,
    const unsigned short* orbital_idx_to_composite_m_idx_map_flattened_indices,
    const unsigned int m_dim,
    const unsigned short n_orbitals
)
{
    int idx = blockIdx.x*blockDim.x + threadIdx.x;
    int row_idx = idx/m_dim;
    int col_idx = idx%m_dim;

    if ((row_idx < m_dim) and (col_idx < m_dim))
    {
        const unsigned long long left_state = basis_states[row_idx];
        const unsigned long long right_state = basis_states[col_idx];

        H[idx] = calculate_onebody_matrix_element_primitive_bit_representation_device(
            n_orbitals,
            spe,
            orbital_idx_to_composite_m_idx_map_flattened_indices,
            left_state,
            right_state
        );
    }
}

So-called kernels (__global__) are functions which are called from the CPU and executed on the GPU. Note that in this setting, the CPU is more commonly called host and the GPU is called device. Device functions (__device__) however, are only callable from the device and are also executed on the device.

In the kernel I have to some index gymnastics to make sure that the matrix elements are mapped to the correct basis states. This is because the kernel is called in parallel by up to several thousand threads at the same time, and each thread gets a unique blockIdx.x, blockDim.x, and threadIdx.x which I must use to map the thread to the correct basis state and matrix element. This means that I dont have to explicitly write the for-loop which loops over the basis states, but rather just tell the GPU how many threads I want to be spawned and then map the unique thread identifiers accordingly. Pretty neat, I think!

The one-body calculations themselves are actually almost identical to the CPU version of the code, except that I had to use primitive data types instead of std::vector and some other self-made data structures because only primitive data types are allowed on the GPU. Additionally, all data has to be manually copied from the host to the device, and new arrays must be allocated on the device.

Some simple benchmarks show that the GPU is significantly faster than the CPU (a 16 core 7950X) even when also copying data (the Hamiltonian matrix) from the device to the host, something that host-only calculations do not need to do. More detailed benchmarks will come.

Further improvements to the one-body matrix element calculations

I noticed that all the interaction files that I have at hand only has single-particle energies for when $\alpha = \beta$. This means that I only need to loop over the diagonal elements of the Hamiltonian! ...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU acceleration and other improvements of the one-body matrix element calculations #2

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

GPU acceleration and other improvements of the one-body matrix element calculations #2

Uh oh!

GaffaSnobb Feb 1, 2024 Maintainer

Choice of GPGPU framework and compatibility

Accelerating the one-body matrix element calculations

Further improvements to the one-body matrix element calculations

Replies: 0 comments

GaffaSnobb
Feb 1, 2024
Maintainer