Motor Control & Embodied RL – Research theme

Animals move with stubborn grace. Bodies grow and age, muscles tire, surfaces change, yet locomotion and reaching stay smooth. Our best reinforcement-learning agents are nowhere near as robust: small changes in dynamics can destroy carefully trained policies.

There is a second puzzle: the cerebellum. This small structure holds more than half of all neurons, yet lesions leave strength and reflexes intact while movements become jerky and poorly timed. Motor cortex and basal ganglia can decide what to do, but without cerebellum execution falls apart.

This project develops a single framework that speaks to both problems at once: in machine learning and robotics it separates learning what to do from keeping it under control, and in neuroscience it explains the cerebellum as a universal adaptive controller operating in the space defined by cortical world models.

The missing separation: learning vs control

Modern model-based RL can learn complex skills, but the resulting policies are fragile: small actuator or friction changes can cause failure, fast online adaptation demands heavyweight meta-learning, and it is difficult to guarantee how performance degrades under perturbations.

Classical adaptive control brings the opposite trade-off. It can stabilise systems, track reference trajectories, and offer Lyapunov-style guarantees, but only when given structured models and explicit targets; it does not explain how to acquire rich behaviours from reward in high-dimensional bodies.

There is still no standard way to combine these views. Our core idea is to treat them as genuinely different problems: reinforcement learning chooses behaviours over long horizons to maximise reward, while control keeps those behaviours stable and precise in the face of noise, model error, and drift.

The same split appears in the brain—cortico–basal ganglia loops for choice and value, cerebellum for fast predictive control—and our framework makes that parallel precise.

World models as reference trajectories

In the Reflexive World Models framework, a learned world model plays a double role. As in standard model-based RL it evaluates and improves policies, but it also serves as a source of reference trajectories in latent space.

The picture runs as follows. A base RL agent learns a policy and latent world model from experience. Policy and model together induce an “intended” trajectory through latent space—how the agent expects the state to evolve when things go well. A fast controller then acts to keep the real trajectory close to this intended path, correcting for mismatches between prediction and observation.

Analytically, expanding the value function around an optimal trajectory reveals a slow term that scores trajectories according to long-term reward (the RL problem) and a fast term that penalises deviations from those trajectories (the control problem).

This yields a concrete architecture in which the RL module proposes actions from its world model and value estimates, the world model generates short-horizon predictions in latent space to define the reference path, a reflexive controller adjusts actions in real time so observations follow that path, and theoretical bounds connect tracking error to loss of value.

In practice, agents adapt within a few trials to dynamics changes, maintain coordination in high-dimensional bodies, and come with explicit links between tracking and performance.

Cerebellum as adaptive controller

Classic cerebellar theories fall into three families: supervised-learning devices (Marr–Albus), internal-model engines (Kawato, Wolpert), and predictors that extend beyond movement. They agree that cerebellum learns from rich teaching signals and supports fast predictive control, but differ on what is represented and optimised.

Reflexive World Models provide an integrated answer. Cortico–basal ganglia loops learn policies and world models from reward, defining what to do. The cerebellum receives a compact latent description of state and intended change. Climbing fibres provide high-dimensional, temporally precise error signals within that latent space, and cerebellar microcircuits implement a fast, largely feedforward controller that keeps the real trajectory aligned with the reference path.

Mathematically, the slow value term lives in cortex/basal ganglia and the fast tracking term becomes an adaptive controller in cerebellum. The cerebellum is neither a policy nor a passive predictor; it is a dedicated module for rapid, error-based control in the space defined by learned world models. This picture recovers internal-model ideas, matches evidence about climbing fibre signals, and gives a principled division of labour between cerebellum and basal ganglia.

Separating reinforcement learning from control

Keeping reinforcement learning and control distinct clarifies why “RL should do everything” fails under nonstationarity and provides a clean place to insert classical control ideas—tracking in latent space rather than joint coordinates. In agents, value gradients shape policies and models while dedicated controllers stabilise execution. In the brain, basal ganglia / motor cortex handle slow value-based choice and cerebellum handles rapid adjustments. This framing also suggests new interpretations of motor adaptation experiments where cerebellum-dependent corrections sit atop slower, value-based learning.

Robotics and sim-to-real

The same ingredients that keep biological motor control robust—world models plus reflexive controllers—are what sim-to-real pipelines lack. Reflexive World Models maintain locomotion as actuator gains drift, loads change, or contact varies; recover quickly from unexpected forces, sensor noise, or partial damage; and avoid full retraining by relying on lightweight latent-space corrections. On Walker2D and Humanoid they correct sudden and continuous perturbations, keep multi-joint gaits coordinated under nonstationary dynamics, and outperform domain-randomised baselines. These are early steps toward control stacks where learned policies provide flexible skills and cerebellum-inspired modules keep them reliable as hardware and environments change.

Research directions

Richer perturbations. Move beyond gain changes to handle morphology shifts, contact variation, and sensor corruption in a unified way.
Representation learning for control. Design world models and latent spaces with adaptive control in mind, not just planning.
Cerebellar models beyond motor cortex. Apply the same architecture to timing, working memory, and other cerebellar-involved domains.
Simulated ataxia studies. Explicitly model ataxic signatures observed in patient data so we can test whether the framework reproduces cerebellar deficits under simulated lesions and explore interventions inside the latent-controller setup.