-
Megascale
Scaling Large Language Model Training to More Than 10,000 GPUs
-
Mycroft
Mycroft Tracing Dependencies in Collective Communication Towards Reliable LLM Training
-
Minder
Minder Faulty Machine Detection for Large-scale Distributed Model Training
-
ByteRobust
Robust LLM Training Infrastructure at ByteDance
-
Training with Confidence
Catching Silent Errors in Deep Learning Training with Automated Proactive Checks