Fix deadlock in construct_shard_mesh with PP + dp_replicate > 1

When Pipeline Parallelism is combined with dp_replicate > 1, different PP
stages own different parameters and call dist.new_group() in different
orders, causing a collective mismatch deadlock. Fix by using
use_local_synchronization=True so only ranks within the same group need
to coordinate, and skip creating groups for shard meshes the current rank
doesn't belong to.

[skip-build]

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Files changed (1) hide show

torch-ext/optimizer/distributed/utils.py +8 -7

torch-ext/optimizer/distributed/utils.py CHANGED Viewed

@@ -214,15 +214,16 @@ def construct_shard_mesh(
     my_key = None
     for sm in shard_meshes:
-        key = _cache_key(sm)
         if (my_rank == sm).any().item():
             assert my_key is None, "Rank appears in multiple shard groups"
             my_key = key
-        if key not in _ranks_to_dist_cache:
-            pg = dist.new_group(sm.flatten().tolist())
-            _ranks_to_dist_cache[key] = (
-                DeviceMesh(device_type="cuda", mesh=sm),
-                pg,
-            )
     return (*_ranks_to_dist_cache[my_key], shard_placements)

     my_key = None
     for sm in shard_meshes:
         if (my_rank == sm).any().item():
+            key = _cache_key(sm)
             assert my_key is None, "Rank appears in multiple shard groups"
             my_key = key
+            if key not in _ranks_to_dist_cache:
+                pg = dist.new_group(sm.flatten().tolist(),
+                                    use_local_synchronization=True)
+                _ranks_to_dist_cache[key] = (
+                    DeviceMesh(device_type="cuda", mesh=sm),
+                    pg,
+                )
     return (*_ranks_to_dist_cache[my_key], shard_placements)