What if analysis

Training strategy and straggler reason
Reason 1: worker synchronization: a slow worker can cause all workers to stall
Reason 2: If PP stages are not evenly partitioned, the slowest stage stalls other stages and becomes a performance bottleneck.

Parallelism strategy and straggler reason

  1. Data parallelism (DP), ZeRO [31] and fully sharded data parallelism (FSDP [43])
    1. Data parallelism: all-reduce step requires synchronization
    2. ZeRO and FSDP: reduce-scatter and all gather requires synchronization.
  2. Pipeline parallelism: GPipe, 1F1B, virtual pipeline parallelism(VPP)
    1. Assumption: computation is evenly partitioned across pipeline stages, and aims to minimize pipeline bubbles, i.e., times when a pipeline stage is idle waiting for data from the previous stage.
    2. Straggler hazard: If PP stages are not evenly partitioned, the slowest stage stalls other stages and becomes a performance bottleneck.
  3. Tensor parallelism and context parallelism:
    1. TP: partition each layer’s weight between workers
    2. CP: partition each sequence’s tokens across workers
    3. Require a synchronization step after each transformer layer

Straggler-free: An LLM training job that uses hybrid parallelism is straggler-free if all workers take the same amount of time to complete their assigned work

Hardware environment setup: no resource contention

  1. The network is overprovisioned. No network congestion
  2. Jobs do not share machines.

Transfer duration estimation: maximum start time among all of its peer operations in the same collective (or the same P2P pair) and subtract it from its end time

Idealized operation duration:

  • compute operation: the average value across the group of elements
  • communication operation: median instead of average



Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Minder
  • ByteRobust
  • Magicdom