1University of California, Berkeley 2Stanford University 3XDOF
*Equal contribution. †Corresponding author: yujustin@berkeley.edu
Episode A and B play real teleoperated episodes while the curve below shows the dense per-frame predicted signed progress magnitude \(\hat{v}_{t}\) — positive (blue) for forward task progress, negative (red) for regression. Click anywhere on the curve to seek the video.
Abstract ▾(click to expand)
Real-World Policy Rollouts
Across 380 real-world trials of T-shirt folding from a crumpled start, WARP-BC consistently completes more folds and completes them faster than vanilla BC trained on the same demonstration corpus. Each video below shows all 20 evaluation trials for a tier, played simultaneously in a 4×5 grid. Use the selector to switch between the three demonstration tiers from the paper.
Every rollout below runs at 1× speed, fully autonomous.
Both videos are time-aligned: WARP-BC finishes folds while vanilla BC is often still hesitating or recovering.
Method
Prior progress models supervise on absolute normalized time, which is noisy: identical timesteps across demos can correspond to very different task states. WARP-RM instead learns a relative signal — how fast, and in which direction, the task is advancing — with supervision generated for free by replaying successful demonstrations at non-uniform velocities.
Resample a successful trajectory with smoothly varying playback speeds (AR(1) in log-space) and Poisson-sampled reversals. The signed source-frame displacement from the window's first frame is the per-frame progress label — no human annotation required.
A frozen DINOv3 ViT-B/16 + 12-layer bidirectional transformer head outputs a per-frame categorical distribution over cumulative progress. Its temporal derivative gives the velocity \(\hat{v}_{t}\):
\(\hat{v}\) ≈ 1 → expert pace · \(\hat{v}\) ≈ 0 → stalled · \(\hat{v}\) < 0 → regressing
For each training chunk, gate on the predicted velocity at its terminal frame. With τ = 1.0, only chunks ending in faster-than-expert progress are kept, and each is weighted continuously by its velocity:
\( w \;=\; \hat{v}_{\mathrm{end}} \cdot \mathbf{1}\{\,\hat{v}_{\mathrm{end}} > \tau\,\} \)
Quantitative Results
Cross-tier results on T-shirt folding
All policies are evaluated on 20 trials of T-shirt folding from a crumpled start with a 240s timeout. Time-to-completion (TTC) is reported over successful trials only. As the training pool admits more suboptimal demonstrations, vanilla BC degrades sharply while WARP-BC stays robust.
| Method | Metric | D1 (≤60s) | D2 (≤90s) | D3 (≤120s) |
|---|---|---|---|---|
| Vanilla BC | Success ↑ | 20/20 | 2/20 | 0/20 |
| Mean TTC (s) ↓ | 113.8 | 199.0 | N/A | |
| Throughput (/hr) ↑ | 31.6 | 1.5 | 0.0 | |
| Act. Chunks Kept | 100% | 100% | 100% | |
| WARP-BC (Ours) | Success ↑ | 20/20 | 19/20 | 14/20 |
| Mean TTC (s) ↓ | 63.9 | 118.8 | 117.4 | |
| Throughput (/hr) ↑ | 56.3 | 27.4 | 16.3 | |
| Act. Chunks Kept | 35.7% | 34.4% | 22.5% |
Matched baseline comparisons
Because SARM requires human-annotated subtask boundaries, all methods are evaluated on the augmented corpora D4 = D1 ∪ A and D5 = D2 ∪ A, where A is the annotated supplement (treated as unlabeled by every method except SARM). WARP-BC sustains the highest throughput on both tiers and ties or leads on success — without the human labels SARM needs. SARM and SCIZOR collapse on the noisier D5, while DemInf stays robust on success but at lower throughput.
| Method | Metric | D4 | D5 |
|---|---|---|---|
| SARM | Success ↑ | 19/20 | 2/20 |
| Mean TTC (s) ↓ | 90.5 | 156.0 | |
| Throughput (/hr) ↑ | 34.9 | 1.55 | |
| Act. Chunks Kept | 78.5% | 66.6% | |
| DemInf | Success ↑ | 19/20 | 18/20 |
| Mean TTC (s) ↓ | 89.6 | 115.8 | |
| Throughput (/hr) ↑ | 35.2 | 25.3 | |
| Act. Chunks Kept | 45.6% | 33.7% | |
| SCIZOR | Success ↑ | 19/20 | 2/20 |
| Mean TTC (s) ↓ | 98.4 | 206.2 | |
| Throughput (/hr) ↑ | 32.4 | 1.5 | |
| Act. Chunks Kept | 77.9% | 66.7% | |
| WARP-BC (Ours) | Success ↑ | 20/20 | 20/20 |
| Mean TTC (s) ↓ | 71.2 | 80.7 | |
| Throughput (/hr) ↑ | 50.6 | 44.6 | |
| Act. Chunks Kept | 45.6% | 33.7% |
Ablations
All ablations are run on dataset D2. Kept-train-samples reports the fraction of action chunks that survive the weighting filter.
| Variant | Success ↑ | Mean TTC (s) ↓ | Throughput (/hr) ↑ | Act. Chunks Kept |
|---|---|---|---|---|
| Weighting function | ||||
| τ = 0 | 3/20 | 201.4 | 2.3 | 97.0% |
| τ = 1, max = 1 (binary) | 16/20 | 139.6 | 18.0 | 34.4% |
| τ = 1, continuous — WARP | 19/20 | 118.8 | 27.4 | 34.4% |
| RA-BC aggregation strategy | ||||
| Mean over 1s action chunk | 15/20 | 127.0 | 17.4 | 34.0% |
| Mean over 1s, one-chunk offset | 14/20 | 124.2 | 15.9 | 34.3% |
| Terminal \(\hat{v}_{end}\) — WARP | 19/20 | 118.8 | 27.4 | 34.4% |
| WARP sampler | ||||
| IID log-normal | 18/20 | 131.0 | 22.8 | 28.7% |
| AR(1) process — WARP | 19/20 | 118.8 | 27.4 | 34.4% |