-
Minder
Minder Faulty Machine Detection for Large-scale Distributed Model Training
-
ByteRobust
Robust LLM Training Infrastructure at ByteDance
-
What if analysis
Understanding Stragglers in Large Model Training Using What-if Analysis
-
Magicdom
browser dom implementation