A theory of learning time – Notebook

Ramon has been sitting with the same question for months: not whether a network learns, but when. And more precisely, what governs the time of learning — why some representations crystallise in the first few gradient steps and others seem to require orders of magnitude more data before they surface at all.

The answer that keeps coming back is geometric. Learning time is not mainly a story about architecture or optimiser choice. It is a story about signal, noise, curvature, and dimension — four quantities that, once you hold them together, organise a surprising amount of the phenomenology.

A new research theme is live on the site: Machine Learning Theory. It frames what we are doing and where it connects: optimal learning rates, finite-data limits, high-dimensional scaling, and a growing set of links between gradient descent dynamics and what happens in cortex during skill acquisition.

The first paper is out on arXiv. It starts with the simplest version of the story — isotropic noise, one-layer networks, clean signal — and works out what the optimal local learning rate looks like and why it decays the way it does. The harder geometries are next.

Iris has the broadcast handle on this one. The plan is to run the theory strand as a quiet background channel: no announcements, just the work appearing when it is ready.