[Common] Fuse pre-swizzling into grouped MXFP8 quantization kernel #2630

Oleg-Goncharov · 2026-01-28T14:45:33Z

Description

This PR fuses pre-swizzling into the grouped MXFP8 quantization kernel so that scaling factors are stored in the format expected by GEMM.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Added a template parameter to the kernel to control the scaling-factor format.
Added a new member to GroupedTensor

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

greptile-apps · 2026-01-28T14:55:01Z

Greptile Overview

Greptile Summary

This PR adds optional pre-swizzling support to the grouped MXFP8 quantization kernel, allowing scaling factors to be stored in the format expected by GEMM operations. The implementation adds a new template parameter WITH_GEMM_SWIZZLED_SCALES to the kernel and uses compile-time branching to select between standard linear indexing and GEMM-swizzled indexing for both rowwise and colwise scaling paths. A new with_gemm_swizzled_scales boolean field was added to the GroupedTensor struct to control this behavior at runtime.

Key Changes:

Added WITH_GEMM_SWIZZLED_SCALES template parameter to the group_quantize_mxfp8_kernel
Conditional scale index computation using gemm_swizzled_scale_idx() function for both colwise (line 484) and rowwise (line 617) scaling paths
Added with_gemm_swizzled_scales field to GroupedTensor struct with proper initialization
Wrapped kernel instantiation with TRANSFORMER_ENGINE_SWITCH_CONDITION macro to generate both swizzled and non-swizzled versions

Confidence Score: 5/5

This PR is safe to merge
The implementation follows the exact same pattern as existing swizzling code in quantize_mxfp8.cuh and gated_mxfp8.cuh. The template parameter approach ensures zero runtime overhead. The GroupedTensor field is properly initialized in both constructor and clear() method. The code is well-structured and consistent with the codebase
No files require special attention

Important Files Changed

Filename	Overview
transformer_engine/common/cast/mxfp8/group_quantize_mxfp8.cuh	Added `WITH_GEMM_SWIZZLED_SCALES` template parameter and conditional scale index computation using `gemm_swizzled_scale_idx` for both rowwise and colwise scaling paths
transformer_engine/common/common.h	Added `with_gemm_swizzled_scales` boolean field to `GroupedTensor` struct with proper initialization in constructor and `clear()` method

greptile-apps

_{4 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

for more information, see https://pre-commit.ci

Oleg-Goncharov · 2026-02-11T19:29:34Z

/te-ci

greptile-apps

_{2 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

ksivaman

LGTM

greptile-apps

_{2 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-12T00:25:13Z

transformer_engine/common/cast/mxfp8/group_quantize_mxfp8.cuh

+      size_t scale_idx = 0;
+      if constexpr (WITH_GEMM_SWIZZLED_SCALES) {
+        scale_idx = gemm_swizzled_scale_idx(global_scales_offset_X, global_scales_offset_Y,
+                                            DIVUP(rows, static_cast<size_t>(128)));
+      } else {
+        scale_idx = global_scales_offset_Y * scale_stride_colwise + global_scales_offset_X;
+      }
      scales_colwise[scale_idx] = biased_exponent;


Type inconsistency: scale_idx is declared as size_t in the colwise path but as int in the rowwise path (line 615). Should use consistent type (size_t) in both paths.

Suggested change

size_t scale_idx = 0;

if constexpr (WITH_GEMM_SWIZZLED_SCALES) {

scale_idx = gemm_swizzled_scale_idx(global_scales_offset_X, global_scales_offset_Y,

DIVUP(rows, static_cast<size_t>(128)));

} else {

scale_idx = global_scales_offset_Y * scale_stride_colwise + global_scales_offset_X;

}

scales_colwise[scale_idx] = biased_exponent;

size_t scale_idx = 0;

if constexpr (WITH_GEMM_SWIZZLED_SCALES) {

scale_idx = gemm_swizzled_scale_idx(global_scales_offset_X, global_scales_offset_Y,

DIVUP(rows, static_cast<size_t>(128)));

} else {

scale_idx = global_scales_offset_Y * scale_stride_colwise + global_scales_offset_X;

}

Oleg-Goncharov requested a review from ptrendx January 28, 2026 14:45

greptile-apps bot reviewed Jan 28, 2026

View reviewed changes

Added GEMM-ready preswizzling option

ed61ff7

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Oleg-Goncharov force-pushed the pr_mxfp8_grouped_preswizzle branch from bf07d9d to ed61ff7 Compare February 11, 2026 19:27

[pre-commit.ci] auto fixes from pre-commit.com hooks

dc76201

for more information, see https://pre-commit.ci

Oleg-Goncharov added enhancement New feature or request MoE labels Feb 11, 2026

greptile-apps bot reviewed Feb 11, 2026

View reviewed changes

ksivaman approved these changes Feb 12, 2026

View reviewed changes

Merge branch 'main' into pr_mxfp8_grouped_preswizzle

8bc2556

greptile-apps bot reviewed Feb 12, 2026

View reviewed changes

ksivaman merged commit 93d51c8 into NVIDIA:main Feb 12, 2026
11 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Common] Fuse pre-swizzling into grouped MXFP8 quantization kernel #2630

[Common] Fuse pre-swizzling into grouped MXFP8 quantization kernel #2630

Oleg-Goncharov commented Jan 28, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Jan 28, 2026 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

Oleg-Goncharov commented Feb 11, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

ksivaman left a comment

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Feb 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Common] Fuse pre-swizzling into grouped MXFP8 quantization kernel #2630

[Common] Fuse pre-swizzling into grouped MXFP8 quantization kernel #2630

Conversation

Oleg-Goncharov commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Oleg-Goncharov commented Feb 11, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

ksivaman left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Oleg-Goncharov commented Jan 28, 2026 •

edited

Loading

greptile-apps bot commented Jan 28, 2026 •

edited

Loading