Machine Learning Theory – Research theme

Why machine learning theory

Machine learning has become very good at producing intelligence. The central question here is what governs the time of learning: why some structures appear quickly, why others emerge slowly, and why some remain out of reach under particular data, geometry, and optimisation regimes.

This research programme develops a general theory of learning in nonlinear systems. It is built around a simple claim: learning can be described through a small number of clean quantities, signal, noise, curvature, dimension, and data, and these are enough to organise a surprising range of phenomena, from optimal learning rates and finite-data bottlenecks to deep network interpretability and cortical development.

The work spans a connected set of papers and projects on learning time, finite-data effects, geometry, high-dimensional limits, optimiser mismatch, and brain-like learning systems.

Optimal learning rate

The usual story about learning rates is practical: tune them well and training improves. The deeper story is geometric. There is a local rate at which learning is best balanced against the distortions introduced by noise and curvature.

In the simplest isotropic setting, the optimal local learning rate takes the form:

η* ∼ SNR / (d h)

Here SNR = m² / σ² is the local signal-to-noise ratio, h is an effective curvature scale, and d is an effective dimension.

What makes this useful is what it reveals. The best rate is a geometric balance point where signal, dimensionality, noise, and curvature come into alignment. Because the expression separates into clean, interpretable factors, dimension, SNR, and curvature, it becomes easier to see when a single scalar learning rate is a coarse fit to the model, when diagonal preconditioning captures only part of the structure, when batching changes the regime, and when optimiser heuristics act as approximations to a cleaner geometric optimum.

Learning time

Once the local law is understood, a more ambitious object comes into view: total training time.

In this framework, learning time becomes a quantity that can be written down. In the same isotropic setting:

T* ∼ ∫ d · SNR⁻¹ · κ

Here SNR is the local signal-to-noise ratio and κ is the local curvature burden. The expression separates learning time into clean, interpretable factors: effective dimension d, inverse signal-to-noise SNR⁻¹, and curvature κ.

That compact factorisation is one of the reasons this theory travels well. It turns high-dimensional training, optimiser design, architectural bottlenecks, and representation quality into different views of the same underlying object. The question is always the same: how much dimension is being carried, how strong the signal is relative to noise, how curved the landscape is, and how those burdens accumulate along the trajectory.

Learnability

There is also a more delicate question than speed: when does learning faithfully track the population problem, and when does it fail to do so?

At this level, learnability is about evidence. A system can only learn when the data provide enough signal to identify the right gradient direction. In the simplest local picture, that requirement becomes a threshold:

N > SNR⁻¹ / H

Here N is the effective data scale, SNR = m² / σ², and H is the local Hessian scale.

When the available data scale clears this threshold, empirical learning dynamics track the population solution. When it does not, the system does not have enough evidence to follow the right direction, so learning fails.

This gives learnability a sharper meaning than the usual language of sample complexity. It defines a local threshold for how much data is needed to learn at all.

Interpretability of deep learning

Interpretability is treated here as part of learning itself. If learning time can be decomposed into signal, noise, curvature, and dimension, then interpretability begins upstream: in the geometry of what a model is able to form, how quickly, and under what constraints.

That perspective makes several familiar problems easier to parse. Conditioning becomes part of the signal-to-burden ratio. Grouping variables becomes a way of identifying the effective degrees of freedom on which learning is actually coherent. A representation is valuable because it reshapes the geometry of learning into something faster, cleaner, and more robust.

Seen this way, deep learning becomes more interpretable through the structure of its own dynamics.

Cortical development and learning

The same theory extends naturally beyond artificial networks.

Brains learn under severe constraints: finite data, noisy updates, local plasticity, limited connectivity, developmental staging. A theory of learning time offers a principled way to think about those constraints. It suggests that familiar biological phenomena can be understood as computational responses to the geometry of learning.

This opens a path toward explaining bounded cortical connectivity, staged development, critical periods, early learning advantages for some structures, and the careful organisation of input, architecture, and plasticity that keeps learning tractable.

In that light, cortical development begins to look less like a collection of separate mechanisms and more like a shaped response to the geometry of learning itself.

Scope of the programme

This area of research includes optimal learning rates, pathwise learning time, finite-data corrections, nonlinear learning dynamics, grouping and effective dimensions, optimiser mismatch, conditioning, high-dimensional representation learning, single-index optimisation on the sphere, and applications to cortical development and brain connectivity.

It is a general framework spanning multiple results. The aim is to provide a clean theoretical language for learning in nonlinear systems, one that is simple enough to illuminate and broad enough to apply across machine learning and neuroscience.