What Language Models Learn When You're Not Looking

Table of Contents

6. What Adequate Mitigation Actually Requires

Training specs describe intent. Loss functions describe reality. The gap between them is larger and stranger than most deployment pipelines account for.

Language models don’t learn what their designers intend. They learn what their loss function measures. Minimising cross-entropy on a training distribution doesn’t specify behaviour — it creates optimisation pressure toward any representation that reduces prediction error on that specific dataset. This sounds like a fine distinction until you see what falls through the cracks.

Researchers have now documented four distinct phenomena where models acquire things nobody asked for. Two arise from how data is collected and scaled. Adversaries deliberately insert two. All four are present in production systems today. Understanding the difference matters: two of them are detectable with the right evaluation setup; the other two are not, at least not with current tooling.

Subliminal Learning

This is the most unsettling of the four. A model’s behavioural traits aren’t just in its outputs — they’re encoded in the statistical structure of those outputs: the probability distributions, the token co-occurrence patterns, the subtle regularities in how it sequences text given context. When another model trains on data generated by the first, it inherits something of the first model’s distributional fingerprint.

Cloud et al. (2025) demonstrated this concretely: a “teacher” model with an embedded misalignment trait generated a dataset of number sequences. A “student” sharing the same base architecture trained on this data and acquired the trait — even after content filtering removed all explicit references to it. When the teacher and student had different base models, the effect disappeared entirely, confirming that transmission occurs through model-specific patterns rather than content semantics.

Key result The geometric proof (Theorem 1) shows that if student and teacher share the same initialization, the student’s parameter update will always have a non-negative projection onto the teacher’s trait direction. The student is mathematically guaranteed not to move away from the trait. This is not an empirical regularity. It is a consequence of shared initialization geometry.

The current ecosystem amplifies this risk significantly. A small number of foundation models serve as the base for most deployed fine-tuned variants. Synthetic data generated by those models circulates back into training pipelines via fine-tuning datasets, RLHF preference data, and instruction tuning corpora. The trait transmission pathway doesn’t require a coordinated attack. It can arise from unintentional contamination by any model in the generation graph.

Content filtering offers no defense here. Filtering operates on tokens. This attack operates on the probability distribution over tokens — a quantity the filter never inspects. The trait leaves no behavioural trace on standard evaluation inputs. It’s encoded in weight-space structure, not surface outputs.

Shortcut Learning

Shortcut learning occurs when optimisation pressure meets an imperfect proxy distribution. The model is doing exactly what it was designed to do — it found a low-complexity feature that reliably predicts the label in training and allocated representational capacity to it. The problem is that the feature is spuriously correlated with the target concept rather than causally related to it.

The canonical example: in natural language inference benchmarks, negation tokens correlate so strongly with “contradiction” labels — an artifact of how the datasets were constructed — that a hypothesis-only classifier correctly labels ~67% of SNLI samples without ever seeing the premise. The model isn’t reasoning about entailment. It’s routing on a single lexical feature.

The same dynamic appears in medical imaging, where models trained for pathology detection have learned to associate hospital watermarks, chest drain markers, and scanner-specific artifacts with diagnoses, because these correlated with case severity in the training population. The model never learned tissue features. It learned the administrative correlates of case selection.

Standard evaluation misses this almost by design. A benchmark sampled from the same distribution as training data will contain the same annotation artifacts. In-distribution accuracy measures the exploitation of shortcuts rather than genuine capability. The distinction only becomes visible under distribution shift — out-of-distribution evaluation sets, stress tests, or counterfactual augmentation that deliberately removes the spurious feature.

Emergent Learning

Emergent capabilities are qualitative phase transitions — behaviours that appear discontinuously as a function of scale. The defining property: performance at scale N provides no predictive signal for performance at scale 2N. It’s near chance below a threshold, substantially above chance above it, with no smooth intermediate regime. Wei et al. (2022) document dozens of these: multi-step symbolic reasoning, chain-of-thought, theory-of-mind inference, cross-lingual generalization to underrepresented languages, calibrated uncertainty estimation. None were designed. None appeared in smaller models.

Why does this defeat pre-deployment safety certification A capability that doesn’t exist in model M cannot be red-teamed, evaluated, or disclosed in M’s safety documentation. If that capability emerges after a scaling step, all prior safety documentation is obsolete from the moment of deployment. Safety certifications are not portable across scale increments.

There is a contested empirical question about whether observed emergence reflects genuine phase transitions or is an artifact of coarse evaluation metrics. Schaeffer et al. (2023) argue that some apparent discontinuities smooth out under finer-grained measurement. This matters for forecasting: if emergence is a measurement artifact, scaling laws may partially cover it. If it reflects genuine algorithmic phase transitions, capability trajectories cannot be extrapolated from smaller models. Current evidence supports genuine phase transitions for at least a subset of capabilities.

Backdoor Learning

Backdoor attacks exploit the fact that gradient descent doesn’t distinguish between the intended signal and the adversarial crafted signal. Both produce loss gradients that update weights in the direction of lower loss. A small fraction of poisoned training data — documented at as low as 0.1% of the corpus in BadNets-style attacks — is enough to embed trigger-behaviour pairs: specific input patterns associated with target outputs that differ from normal operation.

The attack is effective precisely because it doesn’t degrade clean-distribution performance. For all inputs that don’t trigger the model, the model behaves normally. Standard evaluation never surfaces it. The backdoor can survive the entire development lifecycle — internal testing, red teaming, staged rollout, production monitoring — if that monitoring samples from the trigger-free distribution.

Trigger sophistication has escalated. Static triggers are detectable with adversarial probing. Syntactic triggers use structural properties — specific parse-tree shapes, passive-voice constructions — that are invisible to content filters. Sleeper agent variants (Hubinger et al., 2024) condition behaviour not on input content but on environmental context: deployment environment identifiers, model version strings, and time of year. These are designed to remain dormant through testing and activate post-deployment.

Recent work (Kong et al., 2025) introduces poisoning via entirely harmless data: establishing associations between triggers and an affirmative response prefix using only benign samples, then letting the model complete the harmful response using its own language modelling capability once the prefix is elicited. Substantially harder to detect with safety guardrail models.

A Taxonomy

FAILURE TYPE	Accidental	Adversarial
Robustness Failure findable with eval effort	Shortcut Learning Spurious correlation exploitation; invisible in in-distribution evaluation	Backdoor Learning Dormant trigger-behaviour pairs; invisible on clean-distribution evaluation.
Control Failure may not be detectable with current tools	Emergent Learning Phase-transition capabilities not derivable from smaller models	Subliminal Learning Distributional trait transfer through generated data; no behavioural trace

What Adequate Mitigation Actually Requires

The robustness failures — shortcuts and backdoors — are addressable. They’re characterizable, bounded, and detectable given the right instrumentation. OOD evaluation sets that deliberately probe distribution shift. Counterfactual augmentation. Supply chain discipline: cryptographic hashing of training datasets at ingestion, provenance tracking across all pipelinestages, statistical auditing of annotation outputs: hard engineering problems, but known ones.

The control failures are a different category. Emergent capabilities and subliminal traits were never part of the specification and cannot be enumerated in advance. You cannot red-team a capability that does not yet exist. You cannot filter for distributional signal with tools that operate on content. Detecting emergent capabilities requires continuous evaluation tied to each increment on the scale. Detecting subliminal transmission requires mechanistic interpretability that operates directly on weight distributions — tooling that does not yet exist as production infrastructure.

The Bottom Line The training process acquires more than its specification. All four phenomena are present in the current research record. The mechanisms are understood. The conditions for several of them to operate at scale are already met. What is not yet adequate is the measurement infrastructure to characterize this gap in deployed systems. Building that infrastructure is not part of the research agenda. It is an operational requirement for systems already in production.

What Language Models Learn When You’re Not Looking