fix: Add ml.p5e.48xlarge and ml.p5.48xlarge to EFA instance lists#5492
Open
srujithpoondla03 wants to merge 1 commit intoaws:masterfrom
Open
fix: Add ml.p5e.48xlarge and ml.p5.48xlarge to EFA instance lists#5492srujithpoondla03 wants to merge 1 commit intoaws:masterfrom
srujithpoondla03 wants to merge 1 commit intoaws:masterfrom
Conversation
Add ml.p5e.48xlarge to SM_EFA_NCCL_INSTANCES and SM_EFA_RDMA_INSTANCES. Add ml.p5.48xlarge to SM_EFA_RDMA_INSTANCES (was missing). Without these entries, NCCL hangs during distributed training initialization on P5e instances due to missing EFA environment variables (FI_PROVIDER, FI_EFA_USE_DEVICE_RDMA, RDMAV_FORK_SAFE). Fixes aws#5491
Author
|
Is there a specific process for testing EFA/instance-specific changes on actual hardware (e.g., P5e instances) before merging? Or are unit tests sufficient for this type of change? We've verified the fix works in production with a manual workaround (setting the EFA env vars directly), but wanted to confirm if there's an integration test process for these hardware-specific configurations. |
Can you add more details about how this was tested in production. For example, the training job arns, the exact commands you used to test, the CUDA versions that were used in the training algo. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Add ml.p5e.48xlarge to
SM_EFA_NCCL_INSTANCESandSM_EFA_RDMA_INSTANCES.Add ml.p5.48xlarge to
SM_EFA_RDMA_INSTANCES(was missing).Without these entries, NCCL hangs during distributed training initialization on P5e instances due to missing EFA environment variables (
FI_PROVIDER,FI_EFA_USE_DEVICE_RDMA,RDMAV_FORK_SAFE).Related
Testing