LAFR · A self-adaptive framework where quadrotor policies evolve in real-world flight, growing faster and more agile with every iteration.
Yunfan Ren1 Zhiyuan Zhu1 Jiaxu Xing1 Davide Scaramuzza1
1 Robotics and Perception Group · University of Zurich
LAFR is a self-adaptive framework that learns agile quadrotor flight directly in the real world, without precise system identification, without offline Sim2Real transfer, and without conservative safety margins. The system operates as a continuous closed-loop cycle bridging physical execution and differentiable simulation: a learned hybrid dynamics model closes the reality gap; RASH-BPTT (Real-world Anchored Short-horizon Backpropagation Through Time) optimizes the control policy via massively parallel rollouts anchored at the latest real-world state; and Adaptive Temporal Scaling jointly retunes the reference trajectory's time-scale \(\alpha\) using closed-loop sensitivity, maximizing agility while enforcing safety via a barrier function. The base policy evolves from a peak speed of 2.0 m/s to 7.3 m/s within roughly 100 seconds of physical flight time, converging to a 2.34 s figure-8 lap at \(\alpha = 0.28\).
If you find this work useful, please cite:
@inproceedings{ren2026agile,
title = {Learning Agile Quadrotor Flight in the Real World},
author = {Ren, Yunfan and Zhu, Zhiyuan and Xing, Jiaxu and Scaramuzza, Davide},
booktitle = {Robotics: Science and Systems (RSS)},
year = {2026}
}
A continuous closed-loop cycle bridging physical execution and differentiable simulation. Hover any module to spotlight it; click to pin the detail panel.
The figure below embeds the official Rerun web viewer (WASM) streaming the
training recording from this site. Drag to orbit, scroll to zoom, and use the
sim_time
timeline at the bottom to scrub through the 8 ATS iterations as \(\alpha\) contracts and lap time falls from 8.34 s to 2.34 s.
Loading viewer…
The recording above corresponds row-for-row to the eleven ATS iterations below. \(\alpha\) contracts from 1.0 to its 0.25 floor; the lap time falls from 8.34 s to 2.08 s and the policy's peak speed grows from 3.4 m/s to 10.0 m/s, while tracking RMSE stays at or below 0.16 m, well clear of the 0.35 m safety guard.
| Iter | \(\alpha\) | Lap time | Tracking RMSE | Notes |
|---|---|---|---|---|
| 0 | 1.000 | 8.34 s | 0.60 m | Base policy, pre-residual |
| 1 | 0.753 | 6.28 s | 0.19 m | ATS first contraction |
| 2 | 0.753 | 6.28 s | 0.11 m | Residual closes sim-to-real gap |
| 3 | 0.579 | 4.82 s | 0.06 m | |
| 4 | 0.471 | 3.92 s | 0.04 m | Best RMSE (0.042 m) |
| 5 | 0.399 | 3.32 s | 0.06 m | |
| 6 | 0.350 | 2.92 s | 0.07 m | |
| 7 | 0.313 | 2.60 s | 0.08 m | |
| 8 | 0.281 | 2.34 s | 0.09 m | |
| 9 | 0.255 | 2.12 s | 0.16 m | Approaching \(\alpha\) floor |
| 10 | 0.250 | 2.08 s | 0.09 m | Converged at \(\alpha\) floor |
After the residual closes the sim-to-real gap (iter 2 onward), tracking RMSE drops to 0.042 m at iter 4 and stays below 0.16 m for the rest of the run, well clear of the 0.35 m safety guard. The pipeline converges to 2.08 s at \(\alpha = 0.25\) (the \(\alpha\) floor) on iteration 10.
ROS-free, JAX-only reference implementation. A single workstation with a modern NVIDIA GPU reproduces the figure-8 lap-time curve end-to-end.
conda install -n base -c conda-forge mamba mamba create -n flightning python=3.11 -y mamba activate flightning pip install --upgrade "jax[cuda12]" pip install -e ".[dev]"
python -m flightning.scripts.train \
--log_dir outputs/tracking
python -m flightning.online_learning.run_pipeline \
--cfg flightning/cfg/online.yaml
We thank the Robotics and Perception Group at the University of Zurich for hardware, lab space, and countless flight sessions. We are grateful to the open-source communities behind JAX, Rerun, and the broader differentiable-simulation ecosystem, on whose tools this work stands.
This research was supported in part by the National Centre of Competence in Research (NCCR) Robotics through the Swiss National Science Foundation (SNSF) and by the European Research Council (ERC) under the European Union's Horizon programme.
Real-world Anchored Short-horizon Backpropagation Through Time (RASH-BPTT).
The control policy \(\pi_\phi\) is an MLP (2×256 hidden units) producing a 4-D
CTBR command. We unroll the learned hybrid dynamics for \(H\) steps and differentiate
through the entire trajectory via JAX autodiff + JIT + vmap.
Tightly coupled with Module E: \(\phi\) and \(\alpha\) are updated jointly each cycle.
The quadrotor executes the current policy \(\pi_\phi\) on physical hardware, streaming state-action transitions \((\mathbf{x}_k,\mathbf{u}_k,\mathbf{x}_{k+1})\) into a sliding-window replay buffer \(\mathcal{B}\) for downstream calibration.
Includes a short history \(\mathbf{h}_k\) of past actions for control smoothness.
A neural residual augments nominal rigid-body dynamics to close the reality gap (unmodeled aerodynamics, motor delays, payload). Trained online by minimizing one-step prediction error through a differentiable RK4 integrator on \(SO(3)\).
\(\mathcal{D}\) combines Euclidean translation error and the geodesic \(\|\mathrm{Log}(\hat{\mathbf{R}}^\top\mathbf{R})^\vee\|_2^2\) on \(SO(3)\).
Instead of random simulator resets, every short-horizon rollout starts from the instantaneous physical state estimate. This anchors policy gradients in reality and avoids the model exploitation failure mode of long-horizon BPTT.
Compounding prediction errors of the learned residual grow with horizon length. A short \(H\) keeps the gradient \(\nabla_\phi\mathcal{J}\) faithful to the current dynamic regime, so no rollout drifts far from a state the model has actually seen.
ATS optimises the time-scale \(\alpha\) by counterfactual reasoning anchored at the actual real-world rollout \(\{(\bar{\mathbf{x}}_k,\bar{\mathbf{u}}_k)\}\). The differentiable hybrid model serves as a proxy to construct a counterfactual state sequence \(\hat{\mathbf{x}}_k(\alpha)\) (what the trajectory would have been at a different time-scale), and gradients flow through this proxy with no physical perturbation. The result is temporal elasticity: ATS compresses time (smaller \(\alpha\)) when tracking is precise, and relaxes it when disturbances or model mismatch grow.
\(\Psi(z)=\tfrac{1}{\kappa}\ln(1+e^{\kappa z})\) is a Softplus barrier; \(\mathcal{E}_k\) is the tracking error of the counterfactual rollout \(\hat{\mathbf{x}}_k(\alpha)\).
\(\nabla_\alpha\mathcal{J}_{\text{ATS}}\) is propagated along the real rollout via a one-step linearisation of the hybrid model. The state sensitivity \(\mathbf{S}_k \triangleq \mathrm{d}\hat{\mathbf{x}}_k/\mathrm{d}\alpha\), with \(\mathbf{S}_0 = \mathbf{0}\), evolves recursively:
with the Jacobians frozen at the measured real-world rollout \((\bar{\mathbf{x}}_k,\bar{\mathbf{u}}_k)\):
The action sensitivity is obtained by differentiating the policy through its observation \(\mathbf{o}_k(\hat{\mathbf{x}}_k, \mathbf{x}_{\mathrm{ref},k}(\alpha))\):
Chaining \(\mathbf{S}_k\) into the per-step error gradient \(\partial \mathcal{E}_k/\partial \hat{\mathbf{x}}_k\) closes the loop: \(\alpha\) is then updated by a single projected gradient step using the analytical \(\nabla_\alpha\mathcal{J}_{\text{ATS}}\), with no finite-difference probing of the physical system.
The five modules form a continuous closed-loop cycle bridging the physical robot and the differentiable simulator. Each pass refines both the dynamics model \(\theta\) and the policy \(\phi\), while ATS retunes the time-scale \(\alpha\).
Base policy evolves from 2.0 m/s to 7.3 m/s peak speed within roughly 100 s of physical flight. No offline system ID, no conservative safety margins.