Minder | Zhiting's space

Minder: Faulty Machine Detection for Large-scale Distributed Model Training link

Challenges to solve

Any machine could fail in various ways
The normal state of a monitoring metric is task-dependent
The correlation between fault types and monitoring metrics is not necessarily one-to-one
Noises exist in time series monitoring data

Solution proposed by Minder

Machine level similarity: For challenge 1 and 2, if a machine undergoes a fault, its monitoring data will display distinctive differences, offering an opportunity for detection.
Machine level continuity: For challenge 2, most abnormal patterns last for over 5min. if we recognize a machine displaying such dissimilarity continuously for a period, the machine may be faulty.
Individual Learning-Based Denoising Models for Each Monitoring Metric:
- For challenge 4, variational autoencoders can learn embedding schemes that can infer the generation factors for most of the training data.

Metrics to monitor (bold are used in detection, others are collected but not used):

CPU Usage: Percentage of CPU time being used.
PFC Tx Packet Rate: Periodic counts of PFC packets sent by RDMA-enabled devices. - RoCE only?
GPU Duty Cycle: Percentage of time over the past sample period when the accelerator is active.
GPU Power Draw: Periodic counts of the GPU power consumption.
GPU Tensor Core Activity: Percentage of cycles when the tensor (HMMA / IMMA) pipe is active.
GPU Graphics Engine Activity: Percentage of time when any portion of the graphics or compute engines are active.
GPU NVLink Bandwidth: The rate of data transmitted/received over an NVLink.
Memory Usage: Percentage of memory being used.
Disk Usage: Percentage of storage space being used on a disk.
TCP Throughput: Periodic counts of the amount of TCP data being transmitted by a NIC.
TCP+RDMA Throughput: Periodic counts of the amount of TCP and RDMA data being transmitted by an NIC.
GPU Memory Used: The amount of GPU memory being used by processes.
GPU Temperature: The temperature of a GPU while it is operating, measured in degrees Celsius.
GPU SM Activity: Averaged percentage of time when at least one warp is active on a multiprocessor.
GPU Clocks: The clock speed of a GPU, reflecting the frequency of the GPU’s processor.
GPU FP Engine Activity: Percentage of cycles when the FP pipe is active.
GPU Memory Bandwidth Utilization: Percentage of cycles when data is sent to or received from the device memory.
PCIe Bandwidth: The rate of data transmitted/received over the PCIe bus.
PCIe Usage: Percentage of the bandwidth being used on the PCIe bus.
ECN Packet Rate: Periodic counts of ECN packets transmitted/received by a NIC.
CNP Packet Rate: Periodic counts of CNP packets transmitted/received by a NIC.

Analytics:

Metrics data are grouped into a time window.
Within the time window, align sample points across all sampled machines. If sample data points are missed, use data from nearest sampling time for padding.
Normalize data points based on upper and lower limits of each metric with min-max normalization.
\[x' = \frac{x - min}{max - min}\]
Fed the normalized data to the corresponding metric LSTM-VAE model to get the embedding.
Calculate pairwise Euclidean distances of embeddings between every two machines.
For each machine, get the sum of the distances to other machines
Calculate the normal/z-score of the sum of the distances of each machine. avg(x) is the average of x and std(x) is the standard deviation of x.
\[z =\frac{x - avg(x)}{std(x)}\]
The machine with the maximum normal score is probably the faulty one. If the maximum normal score is higher than a similarity threshold, the machine is assumed as a candidate of the time window
Continuity check: shifts the time window with one data sample to detect the potentially faulty machine for new windows. If the same machine is detected with consecutive times that exceed a continuity threshold, 4mins, it is considered a truly faulty machine.

Minder: Faulty Machine Detection for Large-scale Distributed Model Training link

Challenges to solve

Solution proposed by Minder

Metrics to monitor (bold are used in detection, others are collected but not used):

Analytics:

Enjoy Reading This Article?