ByteRobust
Robust LLM Training Infrastructure at ByteDance link
Control plane
- Robust controller
- Robust controller react to events from proactive realtime check:
- High confidence event for a specific machine: force drain node, evict the machine, skip stop time diagnostics
- GPU unavailability
- Disk fault
- Network issue: tolerate several alerts before evict the machine. Policy: twice within 5 mins empirically
- NIC and Network switch flapping can auto recover
- If the restarted job fails again after machine eviction, it goes to a stop-time check procedure.
- High confidence event for a specific machine: force drain node, evict the machine, skip stop time diagnostics
- Robust controller analyze the logs
- User space error: traceable to specific code modules from logs and exit codes, python exception -> code rollback
- Training crashes/abnormal metrics, e.g., NaN losses, arise without a clear culprit -> stop time check
- Performance anomalies: 0 RDMA traffic within 10 mins, low TensorCore utilization -> aggregation analysis
- Robust controller react to events from proactive realtime check:
- Runtime analyzer
- Aggregation analysis
- Parses process trees in each training pod to identify training related processes (torchrun, dataloader, checkpoint process)
- Stack traces from these identified processes are aggregated into multiple groups via string matching to differentiate abnormal sources
- The dominant groups are deemed healthy.
- Remaining groups are classified as outliers.
- Find shared parallel groups for those outliers and isolate the corresponding machines
- Fail slow (MFU decline):
- Repeat aggregation every 10s, flag the parallel group with the most outliers at each round.
- Parallel group with the highest cumulative flag count across 5 rounds is marked as the degrader for over-eviction.
- Aggregation analysis
Data plane (Robust agent in each training pod)
- Monitor
- System inspection (Proactive realtime check):
- Network: NIC down or jitter, packet loss rate, switches down
- GPU: status of DCGM service, PCIe bandwidth, memory row remapping, and GPU temperature, etc.
- Host: OS kernel event (Xid in dmesg)
- Metics collection
- Workload specific training metrics from wandb: loss, gradient norm, MFR, etc
- Fatal signal: 5× increase in loss/gradient norms, NaN values
- stdout/stderr logs and process exit codes, which serve as hints for diagnostics
- Events: Significant declines serve as a signal for potential job hangs and MFU declines.
- CUDA
- RDMA: traffic
- Host
- Storage
- Workload specific training metrics from wandb: loss, gradient norm, MFR, etc
- System inspection (Proactive realtime check):
- Diagnoser
- NaN loss diagnosis
- Standard GPU and network tests first (EUD and NCCL tests).
- Intra machine all-to-all test to verify bandwidth
- Inter machine all-gather NCCL test to verify connectivity and integrity of data transfer with “neighboring” machines.
- Bitwise alignment test
- Each machine initiates a reference model whose structure matches that of the target training job (dense models or MoE models).
- Load predefined weights, employs a specific parallelism configuration (e.g., TP=2, PP=2, DP=2 or EP=2, PP=2, DP=2), and executes one training step on fixed input to ensure reproducibility.
- The outputs from all machines are collected and analyzed to verify bit-wise accuracy.
- Machines that yield incorrect results are promptly isolated and removed.
- If this test does not identify any defective machines, reattempt and rollback are sequentially employed to settle potential transient failures and human errors.
- Standard GPU and network tests first (EUD and NCCL tests).
- Job hang and MFU declines
- NaN loss diagnosis
- On demand tracer
- CKPT manager
Enjoy Reading This Article?
Here are some more articles you might like to read next: