perftracker
PerfTracker: Online Performance Troubleshooting for Large-scale Model Training in Production link
Observation
LMT function (Python functions, GPU/CPU kernel functions, memory operations, etc.) executions exhibit two significant characteristics.
- Within a single worker, most low-level functions are executed repeatedly. This is because training involves many identical iterations, and models are typically composed of repetitive submodules (e.g., transformer blocks).
- Runtime behaviors of functions are highly identical across workers, because modern parallelisms (e.g., at data, pipeline, tensor, and expert levels) distribute workloads evenly with frequent synchronization operations.
- Most performance issues can be diagnosed by observing abnormal function runtime behaviors in comparison to all other function executions
Insight
- (1) Performance issues can be observed by profiling the behavior of function executions
- (2) We can troubleshoot performance issues using differential observability that localizes the offending function executions with abnormal behavior (e.g., low average GPU-NIC throughput without fluctuation)
- (3) We do not need to analyze fine-grained raw observability data of all the functions; instead, we only need to summarize their runtime behavior patterns.
Design
- (1) detecting performance degradation of LMT to trigger online profiling (per worker).
- (2) summarizing runtime behavior patterns of each function from raw profiling data (per worker).
- (3) a centralized localization algorithm that pinpoints the root-cause function based on the behavior patterns (global).
Enjoy Reading This Article?
Here are some more articles you might like to read next: