Skip to content

Conversation

@tohtana
Copy link
Collaborator

@tohtana tohtana commented Jan 22, 2026

This PR introduces a flexible, configuration-driven API for AutoTP (Automatic Tensor Parallelism) that allows users to define custom layer partitioning patterns for training.
@inkcherry @delock

Motivation

Previously, AutoTP relied on hardcoded layer detection logic that was difficult to customize for new model architectures. This PR enables:

  1. Custom models: Users can define exact regex patterns to match their model's parameter names
  2. Fused layers: Support for fused QKV, gate_up_proj, and other packed weight matrices with unequal sub-parameter sizes (e.g., GQA with different Q/K/V dimensions)
  3. Extensibility: Easy to add new model presets or customize existing ones

Here is an example of a config including custom partitioning patterns:

{
    "tensor_parallel": {
        "autotp_size": 4,
        "partition_config": {
            "use_default_specs": false,
            "layer_specs": [
                {
                    "patterns": [".*\\.o_proj\\.weight$", ".*\\.down_proj\\.weight$"],
                    "partition_type": "row"
                },
                {
                    "patterns": [".*\\.[qkv]_proj\\.weight$"],
                    "partition_type": "column"
                },
                {
                    "patterns": [".*\\.gate_up_proj\\.weight$"],
                    "partition_type": "column",
                    "shape": [2, -1],
                    "partition_dim": 0
                }
            ]
        }
    }
}

Refer to the document for more details (including preset models and how to define partitioning for fused models).
We also opened a new PR to show the usage.

Simplified initialization step

AutoTP previously required calling set_autotp_mode(training=True) and deepspeed.tp_model_init before deepspeed.initialize. Now we can include all the necessary configurations in the DeepSpeed config.

We still support the traditional initialization path for backward compatibility.
When you use both (i.e. calling set_autotp_mode(training=True) and deepspeed.tp_model_init and passing the config to deepspeed.initialize), we will merge the settings at initialization. When we have conflicting settings, we will error out.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
@delock
Copy link
Collaborator

delock commented Jan 23, 2026

I like the way to use layer specs which provides much more flexibility, and easy to use from presets. I see presets are defined as layer specs in DeepSpeed code, probably add a link to preset code in the documents (
https://github.com/tohtana/DeepSpeed/blob/tohtana/autotp_custom_patterns/docs/code-docs/source/training.rst#preset-based-partitioning
and
https://github.com/tohtana/DeepSpeed/blob/tohtana/autotp_custom_patterns/docs/code-docs/source/training.rst#supported-models) then invite user to contribute presets with PR would be a good idea.

@tohtana tohtana requested a review from inkcherry January 23, 2026 06:22
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
@tohtana
Copy link
Collaborator Author

tohtana commented Jan 23, 2026

Thank you @delock, I added links to the preset code and messages to welcome PRs. I hope the community will add support of new models.

@sfc-gh-truwase
Copy link
Collaborator

@delock or @inkcherry can you please help with review and approval of this PR?

@delock delock self-assigned this Jan 30, 2026
if not matches:
return None
if len(matches) > 1:
warning_once(f"AutoTPConfig: parameter {param_name} matched multiple layer_specs; using the first match.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of more than one matched, should show the matching specs in the warning as well, so user can judge whether this is intended.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch, I added the warning as you suggested.

if not isinstance(config, dict):
config = load_ds_config(config)

mesh_device = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw sequence parallel (mesh) related code is moved down here. Is there a reason?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was actually a fix of a pre-existing bug. config can be a file path, but we refer to it as a dictionary to initialize a device mesh. We need to have load_ds_config earlier than the initialization.

@delock
Copy link
Collaborator

delock commented Jan 30, 2026

Hi @tohtana, overall this looks good to me, I gave some minor suggestion and questions. I'll approve and let me know when you want to merge it. Thanks!

tohtana and others added 3 commits January 30, 2026 14:48
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
@tohtana tohtana enabled auto-merge (squash) January 31, 2026 09:11
@tohtana tohtana merged commit 6b9cab1 into deepspeedai:master Jan 31, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants