Piper | Zhiting's space

Overview / Takeaway

Piper is a PyTorch distributed training system that separates strategy specification from runtime execution, letting users express composed strategies such as PP + DP/EP + ZeRO without hard-coding each combination into the framework. Its core abstraction is a global training DAG that explicitly represents compute, communication, placement, streams, and ordering constraints, enabling joint scheduling across parallelism dimensions. The main result is flexibility without giving up baseline performance: Piper matches common strategies such as ZeRO-1 and 1F1B while enabling DualPipe-like schedules with 6-30% throughput gains, plus PP x ZeRO combinations that other evaluated systems either cannot run or do not memory-optimize correctly.

Abstract

Distributed training needs composed strategies Large-scale training increasingly combines data, pipeline, expert parallelism, and memory optimizations such as ZeRO, but deployed foundation-model systems often depend on experts who manually design both high-level sharding and low-level execution.
Piper decouples strategy from runtime Piper exposes model annotations and scheduling directives that transform a unified IR, described as “a unified global training DAG that represents all computation and communication”. The runtime then executes per-device plans without being specialized to a fixed strategy.
Performance parity plus composed-strategy gains Piper maintains parity on widely used strategies such as ZeRO and enables further gains through joint scheduling of communication and compute in strategies like DeepSeek-V3’s DualPipe.

1 Introduction

High-level and low-level strategy jointly determine throughput A training strategy decomposes into a high-level parallelism plan and a per-device low-level execution plan. The high-level plan sets lower bounds on per-device memory, compute, and communication load; the low-level plan determines how close execution gets to that bound.
Existing systems either require experts or expose limited strategy spaces Human-engineered systems such as DeepSeek-V3’s DualPipe require custom codesign of PP, EP, and intra-GPU resource usage. General frameworks such as Megatron, DeepSpeed, and TorchTitan expose knobs, but they tend to dispatch each parallelism dimension independently, which makes composed scheduling hard.
DualPipe motivates local microbatch overlap DualPipe shares a GPU between forward and backward microbatches to hide EP communication, but that conflicts with framework assumptions that each microbatch owns the full GPU. Compiler systems like JAX/XLA offer generic tensor placement but lack arbitrary PP scheduling and user control over streams/resources.
Piper’s goal is extensibility The target is “to build a system that minimizes the effort needed to specify and implement an arbitrary distributed training strategy”. Piper does this with a scheduling API, a global DAG IR, and a strategy-agnostic runtime.
Reported contributions are API, IR, runtime, and evaluation Piper contributes a user scheduling interface, a unified global training DAG for joint compute/communication scheduling, an efficient distributed runtime, and an evaluation showing parity on common strategies plus better performance and memory efficiency on composed ones.

2 Background

DP and ZeRO reduce redundant state differently In DP, every worker stores a full model replica, computes local gradients, and averages gradients with allreduce or allgather/reduce-scatter. ZeRO reduces redundant state by sharding optimizer state, gradients, and/or weights, rematerializing state with allgather and resharding after use.
TP, EP, and CP put communication on the critical path Tensor, expert, and context parallelism shard weights or activations and require collectives according to the sharding plan. Unlike DP, their collectives execute directly on the batch’s critical path.
PP needs microbatch schedules Pipeline parallelism shards layers across workers, then uses microbatches to overlap execution across devices. Its performance depends on bubbles, communication overhead, and schedule design; PP is naturally MPMD because ranks run different operations in different orders.
Composed strategies must handle heterogeneous submodules The DualPipe-like example combines PP across layers, EP for expert layers, and DP for non-expert attention layers. This matters more as models become heterogeneous, such as Qwen3-Next’s diverse attention layers and multimodal models with modality- specific components.

3 Challenges

High-level strategy must include intra-device parallelism As communication overhead grows, scheduling must include communication/compute overlap within a GPU. DualPipeV shows that overlapping forward and backward microbatches helps, but overlapping forward-forward or backward-backward can introduce bubbles that erase gains.
Low-level scheduling faces resource contention Multiple parallelism dimensions introduce operations that compete for GPU memory, network bandwidth, streams, and communicators. In the DualPipe-style example, a DP allreduce can interfere with EP all-to-all; measured EP communication slowed by 1.46x under background DP allreduces.
Communication partitioning is a tradeoff Using separate streams can avoid sequential delay but create bandwidth interference; putting DP and EP communication on the same stream can delay critical all-to-alls; partitioning DP allreduces into smaller pieces may reduce interference but can lower communication efficiency.
Runtime must be strategy-agnostic but still efficient The runtime must avoid CPU scheduling overhead, allocate memory/streams/communicators efficiently, and jointly schedule communication from different parallelism dimensions. Near capacity, PyTorch’s memory allocator can stall waiting for in-flight work, so ZeRO can improve throughput as well as memory.
Existing frameworks have incomplete PP x ZeRO behavior The paper reports that general frameworks do not fully support all ZeRO levels with PP, likely because ZeRO hooks interact poorly with PP’s repeated layer execution across microbatches. Piper’s unified DAG supports 3-8x larger batch sizes in the PP x ZeRO case study.

4 Design

Piper has compiler and runtime components The compiler translates annotated models and user schedules into a distributed execution plan; the runtime executes that plan on distributed workers. The design avoids hard-coding specific combinations of parallelism strategies.
The IR is a global training DAG The DAG contains Chunk nodes for compute and Comm nodes for point-to-point or collective communication. Nodes have device placement, stream assignment, exec functions, and edges for data dependencies.
Annotations define schedulable regions Users annotate meaningful model regions, such as PP stages or expert MLP blocks. Piper converts these into Chunks that can later be placed, replicated, sharded, split, or ordered.
Scheduling directives transform the DAG The main directives are Place, Replicate, Shard, Split, and Order. Filters select Chunks by dimensions such as PP, EP, MB, or PASS, where PASS can distinguish forward, backward, backward-input, and backward-weight phases.
Place and Replicate insert communication Place assigns nodes to devices and inserts send/recv at cross-device boundaries. Replicate synchronizes gradients with allreduce by default or reduce-scatter when gradient sharding is enabled, with optional streams and bucket sizes.
Shard, Split, and Order express expert parallelism, microbatching, and schedules Shard inserts all-to-all before and after matched Chunks, enabling EP when combined with Replicate. Split duplicates a sub-DAG into microbatches. Order adds temporal dependencies and can express overlapped sub-DAGs through nested filter lists.
DualPipe can be specified concisely The simplified DualPipe schedule uses streams for PP, EP, and DP communication; places two PP stages across devices; replicates non-expert chunks; shards expert chunks; splits into two microbatches; and orders microbatches so one PP stage overlaps forward and backward work.

4.2 Piper Compiler

Compilation starts from TorchDynamo graph capture Piper extracts a PyTorch fx.Graph, initially treating all tensor operators for a forward-backward pass as one Chunk. Annotation boundaries split this graph into subgraphs, each becoming a Chunk’s forward exec function; PyTorch autograd supplies backward execution.
Model-state buckets are tied to Chunks The compiler uses tensor operator dependencies to associate parameters, gradients, and optimizer state with each Chunk. Currently, a state bucket can only be associated with Chunks that share the same placement.
Directives become graph rewrites The compiler mechanically applies user scheduling directives, inserting Comm nodes such as all-to-all and allreduce. It then removes unnecessary parameter allgathers or gradient reduce-scatters when consecutive Chunks use the same state bucket.
The final DAG includes data and temporal constraints The compiler output is a distributed DAG with explicit communication, placement, streams, model data dependencies, and Order dependencies. Missing stream assignments default to the compute stream.

4.3 Piper Runtime

The centralized scheduler creates per-device partial orders The scheduler decomposes the global DAG into one unique sub-DAG per PP rank, with workers sharing a PP rank executing SPMD. Tasks on the same stream are totally ordered; tasks on different streams are ordered only when data or temporal dependencies require it.
Independent tasks are scheduled by a simple dependency heuristic For overlapping sub-DAGs, Piper creates one queue per stream, repeatedly chooses a ready task with the most downstream dependencies, and appends it to that task’s stream queue. This works well for symmetric DualPipe-style forward/backward overlap.
Workers manage streams, communicators, and memory Each Ray actor loads its model weights and dispatches Chunks/Comms according to the scheduler’s plan. Cross-stream dependencies use CUDA events and stream waits, while independent tasks proceed concurrently.
Worker dispatch prioritizes communication carefully Piper prioritizes send communication first, defers receive communication to reduce P2P interference, and among other communication tasks prioritizes critical-path operations over reductions. Deterministic ordering avoids collective deadlocks.
Separate P2P streams and communicators reduce PP bubbles Piper uses separate send and receive streams plus separate communicators. This avoids requiring a single global P2P order and only requires that downstream workers process data in the same order upstream workers produce it.
Memory management explicitly controls state and activations Piper allocates flat buffers for parameter and gradient buckets, stores persistent sharded state for ZeRO, materializes temporary full buffers when needed, and releases buffers after the last consumer completes. Intermediate activations are freed once their final downstream task is scheduled.

5 Implementation

Piper is a TorchDynamo backend It hooks into arbitrary PyTorch code but currently requires fully traceable models so it can partition fx.Graphs at compile time.
Annotations are Python context managers Annotations attach metadata during execution; graph capture records that metadata and uses it to segment the graph into Chunks.
Runtime execution uses Ray The compiler and centralized scheduler run in a driver process; each worker is a Ray actor.

6 Evaluation

Evaluation compares against three general-purpose frameworks Piper is evaluated against Megatron-LM 0.18.0, DeepSpeed 0.18.9, and TorchTitan 0.2.2 on 4 AWS EC2 NVIDIA 8xA100 nodes with NVLink and EFA.
Common PP schedules are competitive For Qwen3 1B with PP-8 x DP/EP-4 across 32 A100s and Qwen3 9B with PP-4 x DP/EP-4 across 16 A100s, Piper supports 1F1B and interleaved 1F1B. TorchTitan schedule builders were adapted to Piper in 29 LoC and 38 LoC.
TorchTitan loses performance from memory and stream behavior TorchTitan’s larger DP memory footprint causes PyTorch CUDA allocator delays, and its interleaved schedule is 14% worse than its 1F1B schedule because sends and receives share one stream. Piper-interleaved-1F1B is 5% higher throughput than Piper- 1F1B due to lower memory use and separate send/receive streams.
Megatron benefits from fused kernels Megatron’s single-device stage forward takes about 30 ms for Qwen3 1B, versus 40 ms in Piper. Piper’s Chunk abstraction is orthogonal to fused kernels, so those optimizations could be integrated.
PP x ZeRO support differs sharply DeepSpeed and Megatron support PP x ZeRO-1 but not ZeRO-2/3 with PP. TorchTitan claims ZeRO-2/3 support, but gradient and weight states do not reshard between all microbatches, so memory savings are much smaller. Piper supports all PP x ZeRO combinations.
ZeRO-1 throughput is similar across systems On Qwen3 1B with DP-2 across 2 A100s, ZeRO-1 throughput is Piper 8641 tokens/s ±701, TorchTitan 8637 ±977, DeepSpeed 9352 ±52, and Megatron 9942 ±1106.
Piper gets the expected ZeRO memory savings For Qwen3 9B with 8-way PP and 4-way DP across 32 A100s, PP x ZeRO-1 variants OOM even at the smallest batch size. TorchTitan OOMs at batch size 8 for ZeRO-2 and 16 for ZeRO-3, while Piper runs up to batch size 32 for ZeRO-2 and 40 for ZeRO-3, corresponding to 8x and 3.3x larger batch sizes.
DualPipeV is easier to express and faster in Piper The DualPipeV schedule builder from TorchTitan was adapted to Piper in 63 LoC using Order for microbatch overlap. On Qwen3 1B, Piper-DualPipeV improves 13% over Piper-1F1B, while TorchTitan-DualPipeV improves only 3% over TorchTitan-1F1B.
Qwen3 9B DualPipeV shows gains where TorchTitan OOMs TorchTitan could not run the 9B DualPipeV case due to OOM. Piper-DualPipeV improves 10% over Piper interleaved 1F1B and 6% over Megatron interleaved 1F1B, while Megatron still benefits from fused kernels that Piper could potentially incorporate.
Scalability is reasonable Piper scales Qwen3 1B across 2-, 4-, and 8-way PP and 2- and 4-way DP, with global batch size scaled linearly. Figure 9 shows throughput increasing close to the theoretical curve, reaching roughly 390k tokens/s for PP=8, DP=4.

7 Related Work

General frameworks expose fixed strategy sets Megatron, DeepSpeed, and TorchTitan implement common DP/ZeRO, TP/EP/CP, and PP mechanisms, but each dimension dispatches eagerly with little synchronization across dimensions, making joint scheduling of memory and communication bandwidth difficult.
Compiler systems inspire Piper but lack low-level user scheduling JAX/XLA and GSPMD-inspired systems provide tensor sharding annotations and compiler-inserted communication, but are limited in arbitrary PP scheduling and do not expose per-device resources such as GPU streams to users.
DSLs and scheduling systems overlap with Piper’s goals CoCoNeT, AutoSP, DynaFlow, Slapo, and TVM relate through annotation, compiler support, or schedule languages. Piper extends these ideas to distributed tensor programs with explicit intra-device parallelism and a flexible runtime.
Auto-parallelism systems need broader execution support Many auto-parallel systems restrict the search space to remain tractable and may miss strategies such as full communication/compute overlap. Piper aims to serve as a common runtime for such systems by supporting richer composed strategies through a unified IR.
nnScaler is closest but lacks intra-device parallelism nnScaler allows generic constraints similar to Piper directives, but does not support intra-device parallelism. Future Piper work could integrate profile-guided and dynamic approaches from systems such as DeepCompile and Tessera.

8 Conclusion

Piper’s central contribution is explicit strategy representation Piper decouples distributed execution strategy from model and runtime using a unified global training DAG. By making placement, granularity, and ordering explicit, it can express PP, DP, EP, and ZeRO combinations without runtime specialization.
The interface supports both expert schedules and automated search Piper is positioned as a programmable substrate for “both expert-designed schedules and future automated search over a rich space of composed strategies”.

Enjoy Reading This Article?