Get rid of nvshmem dependency for cuBLASMp integration #2661

vcherepanov-nv · 2026-02-09T21:29:50Z

Description

Starting with cuBLASMp 0.8.0, they're moving away from using nvshmem for symmetric memory, use NCCL instead.
This change adapts the to changed API.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Get rid of nvshmem dependency (for cuBLASMp)
Update cuBLASMp API usage

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-02-09T22:07:20Z

Greptile Overview

Greptile Summary

This PR migrates from nvshmem to NCCL for symmetric memory management in cuBLASMp integration, adapting to the cuBLASMp 0.8.0 API changes.

Key Changes:

Replaced nvshmem_malloc/nvshmem_free with cublasMpMalloc/cublasMpFree for symmetric memory allocation
Updated cuBLASMp API calls to new naming convention (cublasMpMatmulDescriptorSetAttribute → cublasMpMatmulDescriptorSetAttribute, cublasMpStreamSet → cublasMpSetStream)
Removed nvshmem dependency from build configuration and replaced with NCCL library linking
Added cublasMpBufferRegister/cublasMpBufferDeregister calls for workspace management
Improved error messages using cublasMpGetStatusString instead of numeric codes
Documented caller responsibility for stream synchronization before context destruction

Outstanding Issues:

NCCL library discovery in CMake lacks path hints (already reported)
Stream synchronization responsibility documented but not enforced (already reported)

Confidence Score: 3/5

Safe to merge with caution - requires build environment validation
Score reflects two previously reported issues: NCCL library discovery missing path configuration (build-time failure risk) and unsafe workspace deallocation requiring caller synchronization (runtime use-after-free risk if misused). The API migration itself appears correct and systematic.
Pay close attention to transformer_engine/common/CMakeLists.txt (NCCL discovery issue) and transformer_engine/common/comm_gemm/comm_gemm.cpp (workspace synchronization)

Important Files Changed

Filename	Overview
transformer_engine/common/CMakeLists.txt	Replaced nvshmem with NCCL linking, but NCCL library discovery lacks PATHS hint (previously reported)
transformer_engine/common/comm_gemm/comm_gemm.cpp	Updated to cuBLASMp 0.8.0 API with symmetric memory via cublasMpMalloc/cublasMpFree instead of nvshmem, updated API calls to use new naming conventions

Sequence Diagram

sequenceDiagram
    participant App as Application
    participant TE as TransformerEngine
    participant cuBLASMp as cuBLASMp 0.8.0
    participant NCCL as NCCL
    
    Note over TE,cuBLASMp: Context Creation
    App->>TE: nvte_comm_gemm_ctx_create(comm, nranks, rank)
    TE->>cuBLASMp: cublasMpGridCreate(..., comm)
    cuBLASMp->>NCCL: Use NCCL communicator
    TE-->>App: return ctx
    
    Note over TE,cuBLASMp: GEMM Operation
    App->>TE: nvte_all_gather_gemm(ctx, matrices, main_stream)
    TE->>cuBLASMp: cublasMpSetStream(handle, main_stream)
    TE->>cuBLASMp: cublasMpMatmul_bufferSize(...)
    alt workspace needs resize
        TE->>cuBLASMp: cublasMpBufferDeregister(old_workspace)
        TE->>cuBLASMp: cublasMpFree(old_workspace)
        TE->>cuBLASMp: cublasMpMalloc(&new_workspace, size)
        TE->>cuBLASMp: cublasMpBufferRegister(new_workspace, size)
    end
    TE->>cuBLASMp: cublasMpMatmul(..., workspace, main_stream)
    cuBLASMp->>NCCL: Collective communication
    
    Note over TE,cuBLASMp: Context Destruction
    App->>App: Synchronize main_stream (caller responsibility)
    App->>TE: nvte_comm_gemm_ctx_destroy(ctx)
    TE->>cuBLASMp: cublasMpBufferDeregister(workspace)
    TE->>cuBLASMp: cublasMpFree(workspace)

greptile-apps

_{3 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-09T22:07:27Z

transformer_engine/common/CMakeLists.txt

+    find_library(NCCL_LIB
+                 NAMES nccl libnccl
                 PATH_SUFFIXES lib
                 REQUIRED)
-  target_link_libraries(transformer_engine PUBLIC ${CUBLASMP_LIB} ${NVSHMEM_HOST_LIB})
+  target_link_libraries(transformer_engine PUBLIC ${NCCL_LIB} ${CUBLASMP_LIB})


NCCL library not discoverable

find_library(NCCL_LIB ...) is missing a PATHS/hint variable (unlike CUBLASMP_LIB, which uses ${CUBLASMP_DIR}). Unless NCCL is already in the default linker search paths, enabling NVTE_WITH_CUBLASMP will fail at configure time with NCCL_LIB not found. This PR should add a way to point CMake at NCCL (env var / CMake cache var) and pass it from setup.py similarly to CUBLASMP_DIR.

greptile-apps · 2026-02-09T22:07:43Z

Additional Comments (3)

transformer_engine/common/CMakeLists.txt
Undefined NVSHMEM_DIR include

In the NVTE_WITH_CUBLASMP block, target_include_directories(... ${NVSHMEM_DIR}/include) still references NVSHMEM_DIR, but this PR removed the only place that was setting it (setup.py). With NVTE_WITH_CUBLASMP=ON, CMake will now expand an unset var here, which can break configuration/build (and it also contradicts the goal of removing nvshmem for cuBLASMp).

transformer_engine/common/comm_gemm/comm_gemm.cpp
Uninitialized workspace fields

NVTECommGemmCtx has raw workspace/workspace_size fields, but nvte_comm_gemm_ctx_create()'s designated initializer never sets them. That means cublasmp_gemm() will read uninitialized ctx->workspace_size and may also treat an uninitialized ctx->workspace as non-null (leading to invalid deregister/free calls). These need explicit initialization (e.g., workspace=nullptr, workspace_size=0) in the create path.

transformer_engine/common/comm_gemm/comm_gemm.cpp
Workspace leak + missing cleanup

After switching from nvshmem_malloc/free to cublasMpMalloc/cublasMpFree + cublasMpBufferRegister, nvte_comm_gemm_ctx_destroy() now just deletes the ctx and never deregisters/frees ctx->workspace. Since workspace is a raw pointer (not RAII-managed), this leaks device memory and leaves the cuBLASMp registration state dirty for the lifetime of the process.

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

greptile-apps

_{3 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-09T22:32:25Z

transformer_engine/common/comm_gemm/comm_gemm.cpp

 void nvte_comm_gemm_ctx_destroy(NVTECommGemmCtx* ctx) {
  NVTE_API_CALL(nvte_comm_gemm_ctx_destroy);
-  nvshmemx_sync_all_on_stream(ctx->stream.get());
+  if (ctx->workspace) {
+    NVTE_CHECK_CUBLASMP(cublasMpBufferDeregister(ctx->grid_row_major.get(), ctx->workspace));
+    NVTE_CHECK_CUBLASMP(cublasMpFree(ctx->grid_col_major.get(), ctx->workspace));
+  }
  delete ctx;


Unsafe workspace free

nvte_comm_gemm_ctx_destroy can deregister/free ctx->workspace while cublasMpMatmul work on the user-provided main_stream is still in flight. cublasmp_gemm sets the cuBLASMp handle stream to main_stream (comm_gemm.cpp:389) and uses ctx->workspace in the enqueue (comm_gemm.cpp:436-439), but destroy() does not synchronize main_stream (or otherwise ensure completion) before calling cublasMpBufferDeregister/cublasMpFree.

This can become a use-after-free if a caller destroys the ctx shortly after launching a comm GEMM. Either synchronize the relevant stream(s) before freeing, or explicitly document (in comm_gemm.h) that callers must synchronize main_stream before calling destroy().

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

greptile-apps

_{4 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

ptrendx · 2026-02-10T21:10:27Z

transformer_engine/common/include/transformer_engine/comm_gemm.h

 *
 *  \param[in]  ctx  Context to destroy.
+ *
+ *  It's the caller's respondibility to synchronize all streams involved before calling this function.


Suggested change

* It's the caller's respondibility to synchronize all streams involved before calling this function.

* It's the caller's responsibility to synchronize all streams involved before calling this function.

ptrendx · 2026-02-10T21:13:43Z

setup.py

            f"nvidia-cublasmp-cu{cuda_version()[0]}"
        ).locate_file(f"nvidia/cublasmp/cu{cuda_version()[0]}")
        cmake_flags.append(f"-DCUBLASMP_DIR={cublasmp_dir}")
-        nvshmem_dir = os.getenv("NVSHMEM_HOME") or metadata.distribution(


There is also explicit nvshmem usage in transformer_engine/common/nvshmem_api - I assume this removal would result in failure of that functionality?

I don't think so, Nvshmem discovery / CMake knobs for cuBLASMp (which this PR removes) and for nvshmem_api is independent.

ptrendx · 2026-02-10T21:16:49Z

A general comment as well - if we intend to require cublasMp 0.8.0 then we should make sure we say that in the installation docs etc. I don't think we have that feature (building with cublasmp enabled) currently in the docs - we should change that (between this PR and #2443) @denera FYI.

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

greptile-apps

_{4 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

almogsegal · 2026-02-11T10:52:51Z

@vcherepanov-nv cuBLASMp 0.8.0 added the API you asked for: https://docs.nvidia.com/cuda/cublasmp/usage/functions.html#cublasmpgetstatusstring
I suggest to consider adding this in the NVTE_CHECK_CUBLASMP macro.

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

vcherepanov-nv and others added 5 commits January 27, 2026 02:36

Remove nvshmem usage

5b2f7cc

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Renamings

aadb720

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

NCCL dependency

80c2d1a

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Check for not yet allocated workspace

7ba0971

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

fd02d57

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Feb 9, 2026

View reviewed changes

Address greptile comments

706d9d1

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

greptile-apps bot reviewed Feb 9, 2026

View reviewed changes

Add a comment per greptile

a98e550

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

greptile-apps bot reviewed Feb 9, 2026

View reviewed changes

ptrendx reviewed Feb 10, 2026

View reviewed changes

Fix a typo

2d3a3fa

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

greptile-apps bot reviewed Feb 10, 2026

View reviewed changes

Display human-readable cuBLASMp error message

95a271c

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

greptile-apps bot reviewed Feb 11, 2026

View reviewed changes

	* It's the caller's respondibility to synchronize all streams involved before calling this function.
	* It's the caller's responsibility to synchronize all streams involved before calling this function.

Get rid of nvshmem dependency for cuBLASMp integration #2661

Are you sure you want to change the base?

Get rid of nvshmem dependency for cuBLASMp integration #2661

Conversation

vcherepanov-nv commented Feb 9, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Feb 9, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

ptrendx Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

ptrendx Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

vcherepanov-nv Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

ptrendx commented Feb 10, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

almogsegal commented Feb 11, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps bot commented Feb 9, 2026 •

edited

Loading