What if analysis
Training strategy and straggler reason
Reason 1: worker synchronization: a slow worker can cause all workers to stall
Reason 2: If PP stages are not evenly partitioned, the slowest stage stalls other stages and becomes a performance bottleneck.
Parallelism strategy and straggler reason
- Data parallelism (DP), ZeRO [31] and fully sharded data parallelism (FSDP [43])
- Data parallelism: all-reduce step requires synchronization
- ZeRO and FSDP: reduce-scatter and all gather requires synchronization.
- Pipeline parallelism: GPipe, 1F1B, virtual pipeline parallelism(VPP)
- Assumption: computation is evenly partitioned across pipeline stages, and aims to minimize pipeline bubbles, i.e., times when a pipeline stage is idle waiting for data from the previous stage.
- Straggler hazard: If PP stages are not evenly partitioned, the slowest stage stalls other stages and becomes a performance bottleneck.
- Tensor parallelism and context parallelism:
- TP: partition each layer’s weight between workers
- CP: partition each sequence’s tokens across workers
- Require a synchronization step after each transformer layer
Straggler-free: An LLM training job that uses hybrid parallelism is straggler-free if all workers take the same amount of time to complete their assigned work
Hardware environment setup: no resource contention
- The network is overprovisioned. No network congestion
- Jobs do not share machines.
Transfer duration estimation: maximum start time among all of its peer operations in the same collective (or the same P2P pair) and subtract it from its end time
Idealized operation duration:
- compute operation: the average value across the group of elements
- communication operation: median instead of average
Enjoy Reading This Article?
Here are some more articles you might like to read next: