WARP-RM
A Warp-Augmented Relative Progress Reward Model for Data Curation

Justin Yu1,3,*,†, Andrew Goldberg1,*, Kavish Kondap1,*, Karim El-Refai1,*, Ethan Ransing1, Qianzhong Chen2, Mac Schwager2, Fred Shentu3, Philipp Wu3, Ken Goldberg1

1University of California, Berkeley    2Stanford University    3XDOF

*Equal contribution.    Corresponding author: yujustin@berkeley.edu

Episode A and B play real teleoperated episodes while the curve below shows the dense per-frame predicted signed progress magnitude \(\hat{v}_{t}\) — positive (blue) for forward task progress, negative (red) for regression. Click anywhere on the curve to seek the video.

t = 0.0 s  /   s
\(\hat{v}_{t}\)   +0.00
paused
TL;DR: WARP learns a fully self-supervised, dense, signed relative progress signal from raw demonstrations by training on time-warped playback. Reweighted behavior cloning with WARP-RM produces policies that fold T-shirts up to 18× faster in throughput compared to vanilla BC trained on the same data (failed trials are counted at the full 240s timeout).

Abstract ▾(click to expand)

Real-World Policy Rollouts

Across 380 real-world trials of T-shirt folding from a crumpled start, WARP-BC consistently completes more folds and completes them faster than vanilla BC trained on the same demonstration corpus. Each video below shows all 20 evaluation trials for a tier, played simultaneously in a 4×5 grid. Use the selector to switch between the three demonstration tiers from the paper.

Every rollout below runs at 1× speed, fully autonomous.

D1: training corpus filtered to demonstrations ≤ 60s — the cleanest, fastest demonstrations.
Vanilla BC
20/20 successes · 113.8s mean TTC
WARP-BC (Ours)
20/20 successes · 63.9s mean TTC

Both videos are time-aligned: WARP-BC finishes folds while vanilla BC is often still hesitating or recovering.

Time-to-completion distribution for successful trials across D1, D2, D3
Time-to-completion distribution for successes. Across all three training tiers, WARP-BC successful folds complete faster than vanilla BC. As the training corpus admits more suboptimal demonstrations (D1→D3), vanilla BC's success count collapses while WARP-BC stays robust. Solid horizontal bar marks the mean.

Method

Prior progress models supervise on absolute normalized time, which is noisy: identical timesteps across demos can correspond to very different task states. WARP-RM instead learns a relative signal — how fast, and in which direction, the task is advancing — with supervision generated for free by replaying successful demonstrations at non-uniform velocities.

1 · Time-warp playback → free progress labels

Resample a successful trajectory with smoothly varying playback speeds (AR(1) in log-space) and Poisson-sampled reversals. The signed source-frame displacement from the window's first frame is the per-frame progress label — no human annotation required.

2 · Predict per-frame velocity from images

A frozen DINOv3 ViT-B/16 + 12-layer bidirectional transformer head outputs a per-frame categorical distribution over cumulative progress. Its temporal derivative gives the velocity \(\hat{v}_{t}\):

\(\hat{v}\) ≈ 1 → expert pace · \(\hat{v}\) ≈ 0 → stalled · \(\hat{v}\) < 0 → regressing

WARP-RM architecture: frozen DINOv3 + temporal-diff + transformer + 30-bin categorical head
WARP Reward Model. A window of \(N{=}32\) RGB frames is encoded by frozen DINOv3, augmented with per-frame temporal differences, projected to model dimension, and processed by a bidirectional transformer. The head emits a 30-bin categorical distribution over cumulative progress at each input frame; per-step intra-window velocities \(v_j = (N{-}1)(\hat{y}_j - \hat{y}_{j-1})\) are averaged across overlapping sliding windows to give the dense curve on the left.
3 · WARP-BC: reweight action chunks by terminal velocity

For each training chunk, gate on the predicted velocity at its terminal frame. With τ = 1.0, only chunks ending in faster-than-expert progress are kept, and each is weighted continuously by its velocity:

\( w \;=\; \hat{v}_{\mathrm{end}} \cdot \mathbf{1}\{\,\hat{v}_{\mathrm{end}} > \tau\,\} \)

Quantitative Results

Cross-tier results on T-shirt folding

All policies are evaluated on 20 trials of T-shirt folding from a crumpled start with a 240s timeout. Time-to-completion (TTC) is reported over successful trials only. As the training pool admits more suboptimal demonstrations, vanilla BC degrades sharply while WARP-BC stays robust.

Method Metric D1 (≤60s) D2 (≤90s) D3 (≤120s)
Vanilla BC Success ↑ 20/20 2/20 0/20
Mean TTC (s) ↓ 113.8 199.0 N/A
Throughput (/hr) ↑ 31.6 1.5 0.0
Act. Chunks Kept 100% 100% 100%
WARP-BC (Ours) Success ↑ 20/20 19/20 14/20
Mean TTC (s) ↓ 63.9 118.8 117.4
Throughput (/hr) ↑ 56.3 27.4 16.3
Act. Chunks Kept 35.7% 34.4% 22.5%

Matched baseline comparisons

Because SARM requires human-annotated subtask boundaries, all methods are evaluated on the augmented corpora D4 = D1 ∪ A and D5 = D2 ∪ A, where A is the annotated supplement (treated as unlabeled by every method except SARM). WARP-BC sustains the highest throughput on both tiers and ties or leads on success — without the human labels SARM needs. SARM and SCIZOR collapse on the noisier D5, while DemInf stays robust on success but at lower throughput.

Method Metric D4 D5
SARM Success ↑ 19/20 2/20
Mean TTC (s) ↓ 90.5 156.0
Throughput (/hr) ↑ 34.9 1.55
Act. Chunks Kept 78.5% 66.6%
DemInf Success ↑ 19/20 18/20
Mean TTC (s) ↓ 89.6 115.8
Throughput (/hr) ↑ 35.2 25.3
Act. Chunks Kept 45.6% 33.7%
SCIZOR Success ↑ 19/20 2/20
Mean TTC (s) ↓ 98.4 206.2
Throughput (/hr) ↑ 32.4 1.5
Act. Chunks Kept 77.9% 66.7%
WARP-BC (Ours) Success ↑ 20/20 20/20
Mean TTC (s) ↓ 71.2 80.7
Throughput (/hr) ↑ 50.6 44.6
Act. Chunks Kept 45.6% 33.7%

Ablations

All ablations are run on dataset D2. Kept-train-samples reports the fraction of action chunks that survive the weighting filter.

Variant Success ↑ Mean TTC (s) ↓ Throughput (/hr) ↑ Act. Chunks Kept
Weighting function
τ = 0 3/20201.42.397.0%
τ = 1, max = 1 (binary) 16/20139.618.034.4%
τ = 1, continuous — WARP 19/20118.827.434.4%
RA-BC aggregation strategy
Mean over 1s action chunk 15/20127.017.434.0%
Mean over 1s, one-chunk offset 14/20124.215.934.3%
Terminal \(\hat{v}_{end}\) — WARP 19/20118.827.434.4%
WARP sampler
IID log-normal 18/20131.022.828.7%
AR(1) process — WARP 19/20118.827.434.4%