NCL NightCity Labs

Research theme

Uncertainty in Deep Learning

Calibrated deep learning through adaptive regularisation, online resampling, and geometry-aware Bayesian posteriors.

Why uncertainty matters

Uncertainty is the quantitative side of the question “how much should we trust this prediction?” This line of work asks how to represent and control uncertainty in deep neural networks so that large models generalize rather than memorize, and so that they can recognise when they are out of their depth instead of confidently hallucinating.

The core problem in machine learning is generalization: how a model behaves on data it has not seen. Overfitting, model size, robustness, and safety are all facets of the same issue. A system that is often wrong and confident is much more dangerous than one that can say “I am unsure here”.

Standard deep networks are trained to fit labels, not to express uncertainty. Their probability outputs are frequently miscalibrated: they may assign very high confidence to errors, especially under distribution shift or when data are scarce. In scientific and medical settings, or in any autonomous system that chooses experiments or actions, this gap between numerical confidence and actual reliability is a central failure mode.

The same problem appears in modern generative models. Large language models are now used to answer questions, write code, summarise documents, and act as interfaces to tools. They can produce long, fluent answers in regimes where the training data provide very little guidance. When that happens, they do not hesitate or signal doubt; they simply continue the pattern. These hallucinations are not edge cases. They show up as fabricated references, non-existent APIs, plausible but wrong explanations, and imaginary experimental results.

In settings where people or downstream systems may act on these outputs, hallucinations are a central reliability problem. Without a way to tell when the model is outside the support of its data, there is no principled distinction between a statement that reflects many consistent training signals and a statement that is essentially a guess.

Uncertainty is also the main handle on how much effective capacity a model should use. Overparameterised networks can represent many functions that all fit the training data. Good uncertainty estimates capture the fact that the data only pin down some directions in function space, and that others remain weakly constrained. This is what lets a model remain flexible without collapsing into memorization.

Why this is hard for deep networks

Several features of modern deep learning make principled uncertainty difficult:

  • High capacity. Large networks can interpolate the training data in many qualitatively different ways. A single trained model gives no direct indication of how many alternatives exist or how much predictions would change if we had seen a different dataset.
  • Miscalibrated probabilities. Softmax scores are not probabilities by design; they only become well aligned with error rates under restrictive conditions. In realistic settings they often stay high even when predictions systematically fail.
  • Post-hoc fixes. Common tools such as ensembles, dropout tricks, or temperature scaling are usually applied after training. They can help with calibration but do not change what the model learns, and they are often too expensive to be used routinely in large systems.

The work here starts from a stronger requirement: uncertainty should be part of learning itself. The same mechanisms that control overfitting and effective model size should also produce uncertainty estimates that downstream agents can use.

Research directions

We develop three complementary approaches:

  1. Cross-regularization — large models that use held-out data during training to decide how much capacity they are allowed to use, effectively turning cross-validation into a signal inside the optimiser.
  2. Twin-Boot optimisation — bringing resampling ideas from the bootstrap into training, so that disagreement between models trained on slightly different data becomes a live signal of epistemic uncertainty.
  3. Precise Bayesian neural networks — revisiting Bayesian treatments of deep networks so that the spread of plausible models matches the invariances and geometry of modern architectures.

Cross-regularization

Cross-regularization is the main mechanism we study for controlling overfitting and effective capacity in large networks.

Deep models today are often far larger than the datasets they are trained on. They avoid memorizing the training set by relying on regularizing forces: noise, weight decay, data augmentation, and related mechanisms. In most systems these forces are chosen in advance or through expensive outer-loop searches. Once training starts, the model has no direct way to adjust how much complexity it is allowed to use.

Cross-regularization moves the principle behind cross-validation into the middle of training. The model sees two streams of feedback:

  • a training set that drives representation learning and pushes the loss down;
  • a small validation set that pushes back whenever extra capacity stops improving generalization.

Technically, the same optimiser that updates the weights also adjusts a small number of regularization controls, using the validation loss as its signal. Conceptually, the model is learning how hard it is allowed to fit as it goes.

This has two main consequences. First, large networks can be run in a regime where they behave as if someone were continuously checking them against fresh data and tightening or loosening regularization accordingly, without actually running repeated training runs. Second, the procedure reveals where in the architecture capacity is useful: some layers can tolerate high levels of injected noise without harming generalization, while others must remain precise.

Cross-regularization also addresses a basic question: how much complexity can a model afford before it simply memorizes the training data? Deep networks are often far larger than the datasets they are trained on. They rely on regularization—noise, weight decay, data augmentation, and related mechanisms—to avoid overfitting, but the strength and structure of that regularization are typically fixed in advance or chosen through expensive outer-loop searches.

The approach brings held-out data directly into this process. The model learns its features from a training set as usual, while a separate validation set continuously influences how much effective capacity the model is allowed to use. Instead of treating regularization as a static hyperparameter, the procedure lets the model adjust its own level of noise and constraint over the course of learning. In practice, this means that large networks can be run in a regime where they behave as if someone had done an extensive search for the right balance between fit and complexity, without actually running that search. The method also reveals nontrivial structure in where capacity is useful: some layers can tolerate high levels of injected noise without harming generalization, while others must remain precise.

Twin-Boot optimisation

In classical statistics, the bootstrap quantifies how much a fitted model would change if we had seen a slightly different sample from the same population. Directly porting this idea to deep learning by training many full networks is usually infeasible.

Twin-Boot turns resampling into part of training. Two copies of the same network are trained in parallel on two resampled views of the data. As training progresses, we monitor where their predictions agree and where they diverge. This disagreement plays two roles: it highlights regions of epistemic uncertainty, and it steers optimisation towards solutions whose predictions are stable across resampled datasets.

This produces an uncertainty signal that is available during training rather than only afterwards. In high-dimensional reconstruction and inverse problems, it yields spatial maps that point to genuinely underdetermined regions of the reconstruction, rather than to arbitrary noise.

Precise Bayesian neural networks

Bayesian neural networks attach a distribution to each model parameter instead of a single point estimate, which in principle gives a full description of uncertainty. In practice, common approximations treat all directions in parameter space similarly and ignore how modern, normalised networks actually behave. This can add complexity without giving a clear account of where the model is or is not constrained by the data.

Precise Bayesian neural networks focus uncertainty on the directions that genuinely change the network’s outputs, taking into account the approximate invariances introduced by normalisation and related mechanisms. The result is a family of models that retain the predictive performance of strong baselines but provide probabilities that better track observed error rates, together with a layer-wise view of which parts of the network are well supported by data and which remain uncertain.

Impact and applications

  • Hallucinations in generative systems. Today’s large language models routinely produce fluent statements that are not grounded in the data they were trained on or in any checked external source. In scientific and technical use this is a core failure mode: fabricated citations, non-existent molecules or materials, and convincing but wrong explanations are hard to detect and easy to propagate. Connecting the uncertainty machinery above to these systems is one route to different behaviour in these regimes: instead of improvising a confident answer when evidence is thin, a model can be driven to ask for clarification, consult tools, or decline to answer.
  • Safer decision-making. Experiment planners and scientific agents can distinguish between confident predictions, structurally ambiguous regions, and out-of-distribution inputs, and can route difficult cases to humans or to more reliable pipelines.
  • Efficient use of data and compute. Uncertainty-aware training procedures such as Twin-Boot and cross-regularization focus capacity and sampling effort on questions the current models genuinely cannot answer yet, rather than on regions where they are already well determined.