-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Support custom partitioning patterns for AutoTP #7806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support custom partitioning patterns for AutoTP #7806
Conversation
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
|
I like the way to use layer specs which provides much more flexibility, and easy to use from presets. I see presets are defined as layer specs in DeepSpeed code, probably add a link to preset code in the documents ( |
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
|
Thank you @delock, I added links to the preset code and messages to welcome PRs. I hope the community will add support of new models. |
|
@delock or @inkcherry can you please help with review and approval of this PR? |
| if not matches: | ||
| return None | ||
| if len(matches) > 1: | ||
| warning_once(f"AutoTPConfig: parameter {param_name} matched multiple layer_specs; using the first match.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of more than one matched, should show the matching specs in the warning as well, so user can judge whether this is intended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch, I added the warning as you suggested.
| if not isinstance(config, dict): | ||
| config = load_ds_config(config) | ||
|
|
||
| mesh_device = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw sequence parallel (mesh) related code is moved down here. Is there a reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was actually a fix of a pre-existing bug. config can be a file path, but we refer to it as a dictionary to initialize a device mesh. We need to have load_ds_config earlier than the initialization.
|
Hi @tohtana, overall this looks good to me, I gave some minor suggestion and questions. I'll approve and let me know when you want to merge it. Thanks! |
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
This PR introduces a flexible, configuration-driven API for AutoTP (Automatic Tensor Parallelism) that allows users to define custom layer partitioning patterns for training.
@inkcherry @delock
Motivation
Previously, AutoTP relied on hardcoded layer detection logic that was difficult to customize for new model architectures. This PR enables:
Here is an example of a config including custom partitioning patterns:
{ "tensor_parallel": { "autotp_size": 4, "partition_config": { "use_default_specs": false, "layer_specs": [ { "patterns": [".*\\.o_proj\\.weight$", ".*\\.down_proj\\.weight$"], "partition_type": "row" }, { "patterns": [".*\\.[qkv]_proj\\.weight$"], "partition_type": "column" }, { "patterns": [".*\\.gate_up_proj\\.weight$"], "partition_type": "column", "shape": [2, -1], "partition_dim": 0 } ] } } }Refer to the document for more details (including preset models and how to define partitioning for fused models).
We also opened a new PR to show the usage.
Simplified initialization step
AutoTP previously required calling
set_autotp_mode(training=True)anddeepspeed.tp_model_initbeforedeepspeed.initialize. Now we can include all the necessary configurations in the DeepSpeed config.We still support the traditional initialization path for backward compatibility.
When you use both (i.e. calling
set_autotp_mode(training=True)anddeepspeed.tp_model_initand passing the config todeepspeed.initialize), we will merge the settings at initialization. When we have conflicting settings, we will error out.