-
Mycroft
Mycroft Tracing Dependencies in Collective Communication Towards Reliable LLM Training
-
Minder
Minder Faulty Machine Detection for Large-scale Distributed Model Training
-
ByteRobust
Robust LLM Training Infrastructure at ByteDance
-
Training with Confidence
Catching Silent Errors in Deep Learning Training with Automated Proactive Checks
-
What if analysis
Understanding Stragglers in Large Model Training Using What-if Analysis