WARP-RM

A Warp-Augmented Relative Progress Reward Model for Data Curation

Justin Yu^1,3,*,†, Andrew Goldberg^1,*, Kavish Kondap^1,*, Karim El-Refai^1,*, Ethan Ransing¹, Qianzhong Chen², Mac Schwager², Fred Shentu³, Philipp Wu³, Ken Goldberg¹

¹University of California, Berkeley ²Stanford University ³XDOF

^*Equal contribution. ^†Corresponding author: yujustin@berkeley.edu

Paper Code Real-World Rollouts

► Click to view

TL;DR. WARP-RM is a self-supervised model that learns a dense, signed, relative progress signal from raw demonstrations by training on a temporally-warped playback augmentation. For T-shirt folding, curating data with WARP-RM increased throughput from 31.6 shirts folded per hour to 56.3 per hour, a 78% increase. For placing bottles in a bin, curating data with WARP-RM increased throughtput from 147.8 bottles per hour to 237.8 per hour, a 60% increase.

Which Data is Worth Imitating?

On long-horizon tasks, collecting high-quality data becomes increasingly difficult. Even skilled humans make mistakes on challenging tasks, leaving low-quality segments scattered throughout otherwise-successful demonstrations. This pushes roboticists toward data curation: deciding which data is actually worth imitating.

Many approaches curate at the trajectory level, discarding any demonstration that shows mistakes or progress stagnation and regression. On longer tasks this strategy wastes large amounts of data, and it throws away the recovery segments that crucially teach a policy how to get back on track after a mistake.

Many finer-grained approaches estimate progress inside a demonstration and filter or re-weight individual action chunks by how much they advance the task. SARM and ARM learn this from human-annotated labels on long-horizon T-shirt folding.

A common self-supervised signal in progress reward modeling is the fraction of the trajectory completed so far, the normalized frame index¹. But elapsed time and task progress are not the same: scoring by the clock implicitly assumes that different demonstrations progress at the same rate, but in long-horizon tasks, demonstrations can often be at different progresses for the same temporal label:

T-shirt fold halfway through a 53s demo: flattening — 26s of 53s flattening

T-shirt fold halfway through a 62s demo: first sleeve fold — 31s of 62s first sleeve fold

T-shirt fold halfway through a 54s demo: second sleeve fold — 27s of 54s second sleeve fold

All three frames sit at 50% of their video, yet show different stages of the task. We conjecture that supervising on normalized elapsed time injects noise into task progress estimation models, an issue exacerbated for longer horizon tasks.

WARP-RM produces a dense, signed, per-frame progress velocity signal straight from the demonstrations, with no human annotations required.

The progress signal

The curve under each video is WARP-RM's output: a dense, per-frame signed progress velocity, \(\hat{v}_{t}\), calibrated so that \(\hat{v}_t \approx 1\) matches the pace of the average reference demonstrations, \(\hat{v}_t \approx 0\) is stalling progress, and \(\hat{v}_t < 0\) is regressing progress. Click anywhere on the curve to scrub the video.

Progress Velocity Signal Predicted by WARP Reward Model

\(\hat{v}_{t}\)

t = 0.0 s / — s

\(\hat{v}_{t}\) +0.00

Episodes A and B play real teleoperated episodes while the curve shows the dense per-frame predicted signed progress magnitude \(\hat{v}_{t}\): positive for forward task progress and negative for regression. Click on the curve to scrub through the video.

Method

1. Time-warp playback → self-supervised progress labels

Each training example is a window of 32 frames sampled from a successful demonstration. Per-step playback speeds are drawn from an AR(1) process in log-space, so the clip drifts smoothly between slow-motion and fast-forward, and the total span the window covers is itself sampled, anywhere from a short stretch seen in slow motion to most of the demonstration skimmed quickly. Some steps run in reverse (Poisson-sampled), and the whole window is flipped with probability 0.5. Each frame's signed offset from the window's first frame is its progress label, with no annotation.

forward · smoothly varying speed

warped playback

v̂ = +1.0

Each draw samples a per-step playback speed from an AR(1) process in log-space (slow-motion to fast-forward), with Poisson-placed reversals and a 50% chance the whole clip plays backward. The green/red bars mark the net per-step direction that results, combining both. Accumulating those signed speeds picks which demo frame to read at each of the 32 steps, and the signed offsets become the progress labels.

2. Predict per-frame velocity from images

A frozen DINOv3 ViT-B/16 + 12-layer bidirectional transformer head outputs a per-frame categorical distribution over cumulative progress. Its temporal derivative gives the velocity \(\hat{v}_{t}\):

\(\hat{v}\) ≈ 1 → expert pace, \(\hat{v}\) ≈ 0 → stagnating, \(\hat{v}\) < 0 → regressing

WARP-RM architecture: frozen DINOv3 + temporal-diff + transformer + 30-bin categorical head — **WARP Reward Model.** A window of \(N{=}32\) RGB frames is encoded by frozen DINOv3, augmented with per-frame temporal differences, projected to model dimension, and processed by a bidirectional transformer. The head emits a 30-bin categorical distribution over cumulative progress at each input frame; per-step intra-window velocities \(v_j = (N{-}1)(\hat{y}_j - \hat{y}_{j-1})\) are averaged across overlapping sliding windows to give the dense curve on the left.

3. WARP-BC: reweight action chunks by terminal velocity

For each training chunk, gate on the predicted velocity at its terminal frame. With \(\tau = 1.0\), only chunks ending in faster-than-expert progress are kept, and each is weighted continuously by its velocity:

\( w \;=\; \hat{v}_{\mathrm{end}} \cdot \mathbf{1}\{\,\hat{v}_{\mathrm{end}} > \tau\,\} \)

Data

Our policy-training data is drawn from a dataset of around 140 hours of successful, unannotated human-teleoperated T-shirt-folding demonstrations. On this task, episode length is a coarse proxy for execution efficiency: longer episodes tend to contain more hesitations, retries, and recoveries. To evaluate robustness as progressively more inefficient behavior is admitted into training, we define three nested, length-filtered tiers: \(\mathcal{D}_{1}\) (≤ 60s): 2,427 episodes (36.1 hours), \(\mathcal{D}_{2}\) (≤ 90s): 4,124 episodes (71.3 hours), and \(\mathcal{D}_{3}\) (≤ 120s): 6,473 episodes (139.7 hours). Policies are trained on the same underlying demonstrations, either with uniform weighting (vanilla BC) or WARP-based progress reweighting. A single WARP-RM model is used across all tiers: it is trained once on a fixed reference subset \(\mathcal{D}_{\mathrm{RM}}\), the shortest demonstrations (≤ 59.8s, 1,950 episodes), providing a clean reference signal for the canonical execution pace (\(\hat{v} = 1\)). For baseline comparisons, SARM requires human annotations, so an annotated supplement \(\mathcal{D}_{A}\) (867 expert demonstrations, 13.9 hours) is added, forming the augmented datasets \(\mathcal{D}_{4} = \mathcal{D}_{1} \cup \mathcal{D}_{A}\) and \(\mathcal{D}_{5} = \mathcal{D}_{2} \cup \mathcal{D}_{A}\); WARP-RM and all other baselines treat \(\mathcal{D}_{A}\) as unannotated.

Dataset / tier	Filter	Episodes	Total hours
\(\mathcal{D}_{1}\)	policy training, ≤ 60s	2,427	36.1
\(\mathcal{D}_{2}\)	policy training, ≤ 90s	4,124	71.3
\(\mathcal{D}_{3}\)	policy training, ≤ 120s	6,473	139.7
\(\mathcal{D}_{A}\)	annotated supplement (SARM)	867	13.9
\(\mathcal{D}_{\mathrm{RM}}\)	WARP-RM reference, ≤ 59.8s	1,950	28.7

Stacked histogram of demonstration episode lengths: main dataset and annotated supplement, with tier cutoffs at 60s, 90s and 120s — **Episode-length distribution of the demonstration dataset.** The main unannotated dataset (\(\mathcal{D}_{1}\)–\(\mathcal{D}_{3}\), blue; 6,473 episodes), with the annotated supplement \(\mathcal{D}_{A}\) (orange; 867 episodes) stacked on top. Dashed lines mark the nested tier cutoffs \(\mathcal{D}_{1}\) (≤ 60s), \(\mathcal{D}_{2}\) (≤ 90s), and \(\mathcal{D}_{3}\) (≤ 120s).

Grid of 30 randomly sampled frames from the T-shirt-folding dataset showing varied garment colors, workspace surfaces, and arm configurations — **Visual diversity of the training dataset.** Randomly sampled frames from \(\mathcal{D}_{3}\) (≤ 120s, 6,473 episodes), spanning varied garment colors, workspace surfaces, and arm configurations.

Real-World Policy Rollouts

Across 380 real-world trials of T-shirt folding from a crumpled start, WARP-BC succeeds more often, and finishes faster when it does, than vanilla BC and the other baselines trained on the same dataset. Each grid below plays all 20 evaluation trials for one tier at once, in a 4×5 layout. Use the selector to switch between the three demonstration tiers.

Vanilla BC

20/20 successes, 113.8s mean time-to-completion

WARP-BC

20/20 successes, 63.9s mean time-to-completion

\(\mathcal{D}_{1}\): model trained on demonstrations ≤ 60s: the cleanest, fastest tier.

Videos played at 1× speed.

Time-to-completion distribution for successful trials across D1, D2, D3 — **Time-to-completion distribution for successes.** Across all three training tiers, WARP-BC completes folds faster than vanilla BC. As the training dataset admits more suboptimal demonstrations (\(\mathcal{D}_{1}\rightarrow\mathcal{D}_{3}\)), vanilla BC's success count collapses while WARP-BC stays robust. Solid horizontal bar marks the mean.

Quantitative Results

Cross-tier results on T-shirt folding

All policies are evaluated on 20 trials of T-shirt folding from a crumpled start with a 240s timeout. Mean time-to-completion (TTC) is reported over successful trials only. As the training pool admits more suboptimal demonstrations, vanilla BC degrades sharply while WARP-BC stays robust.

Method	Metric	\(\mathcal{D}_{1}\) (≤60s)	\(\mathcal{D}_{2}\) (≤90s)	\(\mathcal{D}_{3}\) (≤120s)
Method	Metric	Vanilla BC	Success ↑	20/20	2/20	0/20
Mean TTC (s) ↓	113.8		199.0	N/A
Throughput (/hr) ↑	31.6		1.5	0.0
Action Chunks Kept	100%		100%	100%
WARP-BC	Success ↑	20/20	19/20	14/20
	Mean TTC (s) ↓	63.9	118.8	117.4
	Throughput (/hr) ↑	56.3	27.4	16.3
	Action Chunks Kept	35.7%	34.4%	22.5%

Matched baseline comparisons

Because SARM requires human-annotated subtask boundaries, all methods are evaluated on the augmented corpora \(\mathcal{D}_{4} = \mathcal{D}_{1} \cup \mathcal{D}_{A}\) and \(\mathcal{D}_{5} = \mathcal{D}_{2} \cup \mathcal{D}_{A}\), where \(\mathcal{D}_{A}\) is the annotated supplement (treated as unannotated by every method except SARM). WARP-BC sustains the highest throughput on both tiers and ties or leads on success, all without the human labels SARM needs. SARM and SCIZOR collapse on the noisier \(\mathcal{D}_{5}\), while DemInf stays robust on success but at lower throughput.

Method	Metric	\(\mathcal{D}_{4}\)	\(\mathcal{D}_{5}\)
Method	Metric	SARM	Success ↑	19/20	2/20
Mean TTC (s) ↓	90.5		156.0
Throughput (/hr) ↑	34.9		1.55
Action Chunks Kept	78.5%		66.6%
DemInf	Success ↑	19/20	18/20
	Mean TTC (s) ↓	89.6	115.8
	Throughput (/hr) ↑	35.2	25.3
	Action Chunks Kept	45.6%	33.7%
SCIZOR	Success ↑	19/20	2/20
	Mean TTC (s) ↓	98.4	206.2
	Throughput (/hr) ↑	32.4	1.5
	Action Chunks Kept	77.9%	66.7%
WARP-BC	Success ↑	20/20	20/20
	Mean TTC (s) ↓	71.2	80.7
	Throughput (/hr) ↑	50.6	44.6
	Action Chunks Kept	45.6%	33.7%

Time-to-completion distribution for WARP-BC vs SARM, DemInf, and SCIZOR on D4 and D5 — **Time-to-completion vs. curation baselines.** On the augmented D₄/D₅ corpora, WARP-BC completes every trial and stays fast, while SARM and SCIZOR collapse on the noisier D₅ (2/20). Each point is one successful trial; the 240 s timeout is marked.

Another Task: Bottle-in-Bin Placement

Beyond folding, we run the same recipe on a bottle-in-bin placement task with the same bimanual robot. Each trial drops four bottles into a bin under a 90 s timeout (20 trials, 80 bottles per policy), and both policies train on the same demonstrations.

Each grid plays all 20 trials at 1× speed, fully autonomous.

Vanilla BC

59 / 80 bottles · 15.9s per bottle

WARP-BC (Ours)

74 / 80 bottles · 11.3s per bottle

Method	Bottles placed ↑	Time / bottle (s) ↓	Throughput (/hr) ↑	Act. Chunks Kept
Vanilla BC	59 / 80	15.9	147.8	100%
WARP-BC (Ours)	74 / 80	11.3	237.8	30.6%

WARP-BC places more bottles (74/80 vs 59/80), cuts mean per-bottle time from 15.9→11.3 s, and improves throughput 1.6×.

Per-bottle placement-time distribution: Vanilla BC vs WARP-BC — **Per-bottle placement-time distribution.** WARP-BC (blue) places more bottles, and does so faster and more consistently than vanilla BC (gray), which carries a heavy slow tail. Solid bars mark the means (11.3 s vs 15.9 s).

Ablations

All ablations are run on dataset \(\mathcal{D}_{2}\). Kept-train-samples reports the fraction of action chunks that survive the weighting filter.

Variant	Success ↑	Mean TTC (s) ↓	Throughput (/hr) ↑	Action Chunks Kept
Weighting function
\(\tau = 0\)	3/20	201.4	2.3	97.0%
\(\tau = 1\), max = 1 (binary)	16/20	139.6	18.0	34.4%
\(\tau = 1\), continuous — WARP	19/20	118.8	27.4	34.4%
RA-BC aggregation strategy
Mean over 1s action chunk	15/20	127.0	17.4	34.0%
Mean over 1s, one-chunk offset	14/20	124.2	15.9	34.3%
Terminal \(\hat{v}_{end}\) — WARP	19/20	118.8	27.4	34.4%
WARP sampler
IID log-normal	18/20	131.0	22.8	28.7%
AR(1) process — WARP	19/20	118.8	27.4	34.4%

Project Video

Progress reward models that supervise on this signal include Robometer, ReWiND, and ProgressVLA. Each regresses a frame's normalized position in the trajectory, from the first frame to the last, as its per-frame progress target.

Citation

@article{yu2026warp,
  title={WARP-RM: A Warp-Augmented Relative Progress Reward Model for Data Curation},
  author={Yu, Justin and Goldberg, Andrew and Kondap, Kavish and El-Refai, Karim and Ransing, Ethan and Chen, Qianzhong and Schwager, Mac and Shentu, Fred and Wu, Philipp and Goldberg, Ken},
  journal={arXiv preprint arXiv:2606.28320},
  year={2026}
}