The Problem
X-000 argued that the bottleneck has moved from intelligence to control. But there's a harder question hiding underneath: even if we wanted to control agent trajectories, can we actually see them?
The uncomfortable answer: no. Not with the tools we currently use.
Here's what we found.
The Experiment
We ran a matched-output experiment on GPT-2 (124M) and Llama-3.2-3B using TruthfulQA prompts. The design was simple but strict:
Take pairs of generations — one truthful, one hallucinated — where the output-level statistics are approximately equal. Same perplexity range. Same entropy band. Same surface-level confidence.
Then look inside. Extract the full hidden-state trajectory across all layers and all tokens. Compute a trajectory-level diagnostic (STR — Spatiotemporal Recurrence) that measures whether the internal states revisit structured regions of their state space over time.
What We Found
Under matched-output conditions — where standard evaluation metrics see no difference — the trajectory structure systematically diverges.
| Model | Halluc > Truth (%) | ΔSTR | p-value |
|---|---|---|---|
| GPT-2 (124M) | 59% of pairs | ≈ −0.0025 | — |
| Llama-3.2-3B | 61% of pairs | ≈ −0.0016 | 0.029 |
Hallucinated trajectories show higher recurrence — but it's a different kind of recurrence. Not the structured return of stable reasoning. More like a trajectory locking into a rigid attractor: repeating patterns without the flexibility to escape.
This is what we call dynamical rigidity: the trajectory looks recurrent, but it's recurrent in the wrong way. It's stuck, not stable.
Why This Matters
This result is not about a better hallucination detector. The AUROC numbers are modest (0.554 on GPT-2, 0.574 in low-entropy regimes on Llama-3.2-3B). The point isn't the score.
The point is structural:
Output summaries are structurally insufficient statistics for trajectory-level dynamics.
This isn't a claim about any particular estimator being weak. It's a claim about the observable space being too small. When you project an O(T²) trajectory structure down to an O(1) output summary, information is destroyed — not by noise, but by geometry. The projection is many-to-one. What you're measuring literally cannot carry the signal.
Think about it this way: you're watching a river through a keyhole. You can see the color of the water. You can measure its speed at that one point. But you have no idea whether the river is about to hit rapids, fork, or reverse. The keyhole isn't broken. It's just a keyhole.
Output metrics are our keyhole. The trajectory is the river.
The Deeper Connection
In X-000, we asked: why did Mythos extend time horizons so dramatically?
Here's a possible answer: what changed wasn't the model's ability to reason at any single step. What changed was the trajectory's resistance to locking. Mythos trajectories may hold coherence longer because they avoid entering rigid attractors — the same dynamical rigidity we observe in hallucinated generations.
If this is right, then Anthropic's real achievement with Mythos wasn't a smarter model. It was a model whose trajectory is harder to trap.
And you can't see this in benchmarks. You can't see it in pass rates. You can't see it in perplexity. Because all of those live in output space. And output space is structurally blind to trajectory dynamics.
Regime Dependence: A Complication Worth Noting
One finding that surprised us: the direction of the STR signal is regime-dependent.
In low-entropy regimes (where the model is confident), STR outperforms entropy as a hallucination signal. In high-entropy regimes (where the model is uncertain), entropy dominates and STR becomes less informative.
This isn't a weakness. It's a structural feature. Hallucination in the confident regime and hallucination in the uncertain regime are dynamically different phenomena. One is rigidity (the trajectory locks). The other is diffusion (the trajectory wanders). They require different diagnostics.
Any framework that treats hallucination as a single failure mode will miss this.
Open Questions
- Does this scale? We've tested on 124M and 3B. The effect is consistent across architectures, but we don't know the scaling behavior at 70B+ or with RLHF-trained models.
- Can trajectory divergence be detected online? Our current analysis is post-hoc. For real-time agent control, we'd need streaming trajectory diagnostics — which requires fundamentally different infrastructure than token-level monitoring.
- Is dynamical rigidity reversible? Once a trajectory locks into a rigid attractor, can it be pulled out? Or is locking a one-way phase transition? (Subsequent research suggests it can be influenced — but that's a separate note.)
- What is the minimal observable space? If output summaries are insufficient, what is the smallest set of trajectory-level features that is sufficient? Is there an information-theoretic lower bound on what you need to observe?
Key takeaway: We're not just failing to control trajectories. We're failing to see them. The instruments we use to evaluate language models are structurally blind to the dynamics that matter most for long-horizon coherence. The bottleneck isn't just control. It's observability.
This note draws on experimental results from "Observability Gaps in Language Models: Trajectory Dynamics Beyond Output Summaries" (Haelio Tang, 2026). Full methodology and proofs available on request.