Skip to content

Conversation

@ptrendx
Copy link
Member

@ptrendx ptrendx commented Feb 9, 2026

Description

Improve handling of the NVTE_CUDA_ARCHS env variable:

  • add the regular architectures to the build of the sources with specific architectures to enable some support for GPU architectures in the family that were not specialized directly.
  • automatically add sm75 to the build in case the CMAKE_CUDA_ARCHITECTURES becomes empty (which currently should only happen when cmake < 4.0.2 and sm120 is the only selected architecture)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
@ptrendx ptrendx requested a review from ksivaman February 9, 2026 23:09
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 9, 2026

Greptile Overview

Greptile Summary

This PR adjusts how TransformerEngine’s common CMake build derives and applies GPU architecture targets: it rewrites some base arches (110/120) into family-specific variants when CMake supports them, keeps per-source --generate-code injection for generic/specific arch lists otherwise, and introduces a NVTE_HAS_ARCH_SPECIFIC_TARGETS compile definition to change arch-compatibility checks in util/ptx.cuh.

Main concern: the new NVTE_HAS_ARCH_SPECIFIC_TARGETS define is currently enabled unconditionally for all arch-specific CUDA sources, which alters the safety checks in ptx.cuh even in builds that are not actually compiling any arch-specific variants.

Confidence Score: 3/5

  • This PR is likely mergeable but has one correctness/safety-check regression to address in arch-specific compilation gating.
  • The CMake arch rewriting and fallback behavior are localized build-system changes, but NVTE_HAS_ARCH_SPECIFIC_TARGETS is currently defined unconditionally for arch-specific source files, which can disable compile-time static_assert protections in ptx.cuh even when no arch-specific codegen is being produced. That can allow miscompiled feature usage to slip through without a hard error.
  • transformer_engine/common/CMakeLists.txt and transformer_engine/common/util/ptx.cuh

Important Files Changed

Filename Overview
transformer_engine/common/CMakeLists.txt Adjusts CUDA arch handling (110/120 family-specific) and adds NVTE_HAS_ARCH_SPECIFIC_TARGETS define for arch-specific sources; currently forces NVTE_ARCH_SPECIFIC_TARGETS TRUE and may skip the intended static_assert guard even when no arch-specific codegen flags are used.
transformer_engine/common/util/ptx.cuh Relaxes compile-time static_asserts for arch/family-specific feature usage when NVTE_HAS_ARCH_SPECIFIC_TARGETS is defined; with current CMake always defining it for arch-specific sources, mismatched compilation/feature assumptions may silently proceed.

Sequence Diagram

sequenceDiagram
  participant UserEnv as User env (NVTE_CUDA_ARCHS)
  participant CMake as CMake configure
  participant ArchList as CMAKE_CUDA_ARCHITECTURES
  participant NVCC as NVCC compile
  participant PTX as ptx.cuh checks

  UserEnv->>CMake: Configure build (optionally sets NVTE_CUDA_ARCHS)
  CMake->>ArchList: Initialize/override CMAKE_CUDA_ARCHITECTURES
  CMake->>ArchList: Rewrite 110/120 to 110f/120f (CMake >= 4.0.2)
  CMake->>ArchList: For CMake < 4.0.2, split into NVTE_GENERIC_ARCHS/NVTE_SPECIFIC_ARCHS
  CMake->>NVCC: Add --generate-code flags per-source for generic/specific lists
  CMake->>NVCC: Define NVTE_HAS_ARCH_SPECIFIC_TARGETS for arch-specific sources
  NVCC->>PTX: Compile sources including ptx.cuh
  PTX-->>PTX: If NVTE_HAS_ARCH_SPECIFIC_TARGETS defined, relax static_asserts
  PTX-->>NVCC: Compatibility check returns bool (no hard error)
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 43 to 56
@@ -54,31 +52,57 @@ endif()
# Check for architecture 101 (if we see this we are in toolkit <= 12.9)
list(FIND CMAKE_CUDA_ARCHITECTURES "101" arch_101_index)
if(NOT arch_101_index EQUAL -1)
list(REMOVE_ITEM CMAKE_CUDA_ARCHITECTURES "101")
list(APPEND NVTE_GENERIC_ARCHS "101")
list(APPEND NVTE_SPECIFIC_ARCHS "101a")
endif()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arch 100/101 not removed

When "100" / "101" are present in CMAKE_CUDA_ARCHITECTURES, this block only appends 100a/101a to NVTE_SPECIFIC_ARCHS but never removes the base arch from CMAKE_CUDA_ARCHITECTURES nor adds it to NVTE_GENERIC_ARCHS. As a result, the build will still compile all sources for sm_100 / sm_101 (via CMAKE_CUDA_ARCHITECTURES) while also compiling arch-specific sources for sm_100a / sm_101a (via --generate-code), which defeats the “generic + specific” split and can cause unexpected extra build work / incorrect targeting. This differs from the 110/120 handling below, which removes the base arch and routes it into the explicit lists.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@greptileai That is because arches 110/120 are routed to the "f" (like 120f) variant where the entire family is covered, whereas in the case of 100, it routes to "a" variants (like 100a) - this is intentional behavior, but in order to support different Arches in the same family, we need to still compile for the regular 100 then.

Comment on lines +93 to 101
if(CMAKE_VERSION VERSION_LESS 4.0.2)
if(NOT CMAKE_CUDA_ARCHITECTURES)
message(WARNING
"CMAKE_CUDA_ARCHITECTURES is empty after replacing arch-specific targets. "
"Please upgrade to CMake 4.0.2+ for native 'f' architecture support. "
"Adding sm_75 target in addition to the specified target to avoid configuration "
"errors - this will result in longer build time, but does not affect correctness.")
set(CMAKE_CUDA_ARCHITECTURES 75)
endif()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fallback wipes user archs

On CMake < 4.0.2, if the arch list becomes empty you set(CMAKE_CUDA_ARCHITECTURES 75) (dropping the user’s originally requested arch entirely). Given the warning text says “add sm_75 target in addition to the specified target”, this looks like a functional mismatch: the fallback should preserve the originally selected arch(es) and append 75, not replace everything with 75.

Copy link
Member Author

@ptrendx ptrendx Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@greptileai If the list is empty, then setting is the same as appending.

@ptrendx
Copy link
Member Author

ptrendx commented Feb 11, 2026

/te-ci

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +104 to 106
set(NVTE_ARCH_SPECIFIC_TARGETS TRUE)

# cuDNN frontend API
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always-on arch-specific define

set(NVTE_ARCH_SPECIFIC_TARGETS TRUE) makes NVTE_HAS_ARCH_SPECIFIC_TARGETS=1 get added to all transformer_engine_cuda_arch_specific_sources (lines ~247-255) even when NVTE_SPECIFIC_ARCHS is empty (e.g., user selects only 70/80/89/90). That flips the behavior in util/ptx.cuh to skip the compile-time static_assert that catches “compiled generic but using arch/family-specific features”, so those mistakes will now silently pass in exactly the builds where no arch-specific codegen flags are being injected for these sources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant