[LoweringContext] Support an optimized parameter mapping for SPMD #8460

rpsilva-aws · 2024-12-05T23:49:12Z

Currently, the existing parameter mapping for the lowering context is not well suited for SPMD. In case of large models, it will cause a large synchronous bottleneck when transferring all device data to the host. This is caused by each ReplicateShardedData computation that gathers and reassembles each sharded data across multiple devices. This is by design, since it is expected to collect all parameters regardless of their allocation.

In this PR, we introduce a new mapping that does not invoke the sharded replication, but instead uses references to the device data. This is generally sufficient and preferred in most cases, where the user only wants to access the validate parameters (those that are not returned as -1 from tensor_parameter_id, as 'fake' parameters).

rpsilva-aws · 2024-12-05T23:49:43Z

Re-opened from #8453, cleaned up the merge commit.

Previously scan uses `parameter_id_tensor_mapping` to fetch tensors hoisted to HLO parameters e.g. the fn being scanned may create additional tensors while its running. `parameter_id_tensor_mapping` will fetch those tensors back to host as XLA literals and create new tensors wrapphing those, resulting in additional host RAM usage. PR #8460 added `device_parameter_id_tensor_mapping` that returns the actual device backed tensors instead of another copy. So we'll use that and test that this avoids host transfers.

rpsilva-aws mentioned this pull request Dec 5, 2024

[LoweringContext] Support an optimized parameter mapping for SPMD #8453

Closed

tengyifei self-requested a review December 5, 2024 23:51

tengyifei added the tpuci label Dec 5, 2024

tengyifei marked this pull request as ready for review December 5, 2024 23:52

tengyifei approved these changes Dec 5, 2024

View reviewed changes

[LoweringContext] Support an optimized parameter mapping for SPMD

9858577

rpsilva-aws force-pushed the rpsilva_lc_mapping_v3 branch from 8fd7ac7 to 9858577 Compare December 5, 2024 23:53

tengyifei merged commit 5d11f66 into pytorch:master Dec 7, 2024
12 checks passed

rpsilva-aws deleted the rpsilva_lc_mapping_v3 branch December 9, 2024 19:03

tengyifei mentioned this pull request Jan 2, 2025

[scan] use device_parameter_id_tensor_mapping #8524

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[LoweringContext] Support an optimized parameter mapping for SPMD #8460

[LoweringContext] Support an optimized parameter mapping for SPMD #8460

Uh oh!

rpsilva-aws commented Dec 5, 2024

rpsilva-aws commented Dec 5, 2024

Uh oh!

Labels

2 participants

Uh oh!

[LoweringContext] Support an optimized parameter mapping for SPMD #8460

[LoweringContext] Support an optimized parameter mapping for SPMD #8460

Uh oh!

Conversation

rpsilva-aws commented Dec 5, 2024

rpsilva-aws commented Dec 5, 2024

Uh oh!

Labels

2 participants