What if analysis | Zhiting's space

Understanding Stragglers in Large Model Training Using What-if Analysis

Training strategy and straggler reason

Reason 1: worker synchronization: a slow worker can cause all workers to stall
Reason 2: If PP stages are not evenly partitioned, the slowest stage stalls other stages and becomes a performance bottleneck.

Parallelism strategy and straggler reason

Data parallelism (DP), ZeRO [31] and fully sharded data parallelism (FSDP [43])
1. Data parallelism: all-reduce step requires synchronization
2. ZeRO and FSDP: reduce-scatter and all gather requires synchronization.
Pipeline parallelism: GPipe, 1F1B, virtual pipeline parallelism(VPP)
1. Assumption: computation is evenly partitioned across pipeline stages, and aims to minimize pipeline bubbles, i.e., times when a pipeline stage is idle waiting for data from the previous stage.
2. Straggler hazard: If PP stages are not evenly partitioned, the slowest stage stalls other stages and becomes a performance bottleneck.
Tensor parallelism and context parallelism:
1. TP: partition each layer’s weight between workers
2. CP: partition each sequence’s tokens across workers
3. Require a synchronization step after each transformer layer

Straggler-free: An LLM training job that uses hybrid parallelism is straggler-free if all workers take the same amount of time to complete their assigned work

Hardware environment setup: no resource contention

The network is overprovisioned. No network congestion
Jobs do not share machines.

Transfer duration estimation: maximum start time among all of its peer operations in the same collective (or the same P2P pair) and subtract it from its end time

Idealized operation duration:

compute operation: the average value across the group of elements
communication operation: median instead of average

Understanding Stragglers in Large Model Training Using What-if Analysis

Training strategy and straggler reason

Parallelism strategy and straggler reason

Enjoy Reading This Article?