Mycroft | Zhiting's space

Mycroft: Tracing Dependencies in Collective Communication Towards Reliable LLM Training link

Summary: Mycroft, a novel, lightweight distributed tracing and root cause analysis system designed to enhance reliability in Large Language Model (LLM) training by addressing issues within collective communication libraries (CCLs). These libraries often act as “black boxes,” making troubleshooting difficult and leading to wasted resources due to failures and slowdowns in large-scale training jobs. Mycroft functions by implementing Coll-level tracing, which captures fine-grained control and data dependencies—specifically Flow-level and Chunk-level traces—to expose internal communication states with minimal performance overhead. The system uses a real-time trigger mechanism to detect anomalies quickly and a dependency-driven analysis to efficiently pinpoint the root cause of both gray failures and performance degradation in distributed LLM training environments, showing effectiveness when deployed at scale, such as at ByteDance.

Instrumentation

Trace point format

Metadata: IP, comm_id, Gid, GPU_id, Channel_id, QP_id
Operation: timestamps, op_name, op_seq, msg_size
Chunk: stuck_time, total_chunks, GPU_ready, RDMA_transmitted, RDMA_done

Tracepoints

completion log: when a CollOp finishes. Include metadata (start and end timestamps, bytes transmitted, local and remote NIC information, and other CollOp metadata).
realtime state log: every 100ms while a CollOp is in progress.
- Includes data transmission progress across all devices, covering Stream Multiprocessor (SM) copies and RDMA writes.
- Generated in each window until the CollOp completes or the NCCL proxy thread exits or crashes.

Real time trigger

sampling a subset of ranks across the cluster and monitoring all CollOps.
At least one rank per DP group to ensure coverage in common LLM parallelism architectures.
Limit sampling to at most 10 ranks.

// LogData represents the aggregated trace data for a specific IP within a time window.
// In the source, this includes "completion logs" and "real-time state logs" [4].
type LogData struct {
	IP             string
	CompletedOps   int     // Count of completed Coll Ops
	HasRealTimeLog bool    // True if "real-time state log" exists (indicates active running state) [3]
	Throughput     float64 // Current throughput measurement
	OpInterval     float64 // Current interval between operations
}

// TriggerMechanism maintains the state for Mycroft's anomaly detection.
type TriggerMechanism struct {
	NormalThroughput float64 // Baseline for healthy throughput [2]
	NormalInterval   float64 // Baseline for healthy operation interval [2]
	Delta            float64 // Time window size (e.g., seconds) [1]
}

// NewTriggerMechanism initializes the trigger system.
func NewTriggerMechanism(delta float64) *TriggerMechanism { return &TriggerMechanism{ Delta: delta}}

// Trigger implements Algorithm 1: Trigger Mechanism [1].
// Input: Sampled IP list (S_ips), time (t)
// Output: Trigger type (string) and Abnormal IP (string), or empty if normal.
func (tm *TriggerMechanism) Trigger(sampledIPs []string, t float64) (string, string) {
	// Line 2: logs <- Acquire(S_ips, t - Delta, t) [1]
	logs := tm.Acquire(sampledIPs, t-tm.Delta, t)

	for ip, log := range logs {
		// Line 3: "if no Coll Ops completed in logs" [1]
		// Context from Section 4.3: Mycroft checks if the rank "stalls mid-operation
		// with real-time state log but without producing completion log" [3].
		if log.HasRealTimeLog && log.CompletedOps == 0 {
			// Line 4: return failure trigger, abnormal IP [1]
			return "failure_trigger", ip
		}

		// Line 6: "if throughput drops by half or Coll Op interval doubles" [2]
		// These heuristics are configurable but based on practical experience [3].
		isThroughputDrop := tm.NormalThroughput > 0 && log.Throughput < (tm.NormalThroughput*0.5)
		isIntervalSpike := tm.NormalInterval > 0 && log.OpInterval > (tm.NormalInterval*2.0)

		if isThroughputDrop || isIntervalSpike {
			// Line 7: return straggler trigger, abnormal IP [2]
			return "straggler_trigger", ip
		}

		// Line 8: update normal throughput and Coll Op interval [2]
		// Updates the baseline for the next window if no anomalies are found.
		tm.NormalThroughput = log.Throughput
		tm.NormalInterval = log.OpInterval
	}

	// Return empty strings if no active abnormal trigger is found.
	return "", ""
}

Root cause analysis

Principle

Level	Problem	Rule
Chunk-level	Failure	Each rank should transmit the same amount of data
Chunk-level	Performance	Each component should finish within expected execution time
Flow-level	Failure	Each flow should complete
Flow-level	Performance	Each flow should take similar execution time
Flow-level	Performance	Each flow should start and end at similar time

local and remote root cause

State	Condition	Local cause	Remote cause
Not started	GPUReadyChunks = RDMATransmittedChunks = RDMADoneChunks = 0	Uninitialized	Blocked
Not transmitted	GPUReadyChunks > RDMATransmittedChunks	RDMA Issue	Receiver not ready
Not delivered	RDMATransmittedChunks > RDMADoneChunks	RDMA Issue	Receiver failed
GPU not ready	GPUReadyChunks = RDMATransmittedChunks = RDMADoneChunks > 0	GPU issue	-

Deployment

Data volume: a training job utilizing 10,000 GPUs generates approximately 3 TB of data per day. This data is retained for one day before being discarded.

Debugging toolset

Dump Python stacks of each rank using py-spy to identify the data loader or checkpoint stuck
- Stack traces of all relevant Python programs are dumped automatically when Mycroft detects a trigger. Stacks are then grouped by process commands across all GPUs and mapped into a topology grid, where each block represents a GPU rank with its current Python call stack. In this grid, identical call stacks will be marked in the same color.
  - This is particularly useful as stuck threads typically have different call stacks from the rest, making them stand out on the grid for ease of troubleshooting.
Dump Python CollOps collected by Flight Recorder to identify the synchronization issues
- Stores the traces of the latest N CollOps in a ring buffer.
- Each CollOp trace includes the CollOp ID, the sizes of input and output tensors, execution state, and communication process group ID.
- Extract CollOp stacks to analyze the rank synchronization issue by figuring out the device with the last operation on each CUDA stream in a process group, and aggregating all the trace stacks to visualize and identify abnormal devices.