Understanding how LLMs represent user beliefs (distinct from their beliefs) is important for detecting manipulation, personalization gone wrong, and deceptive alignment; yet most prior mechanistic ToM work focuses on third-person narratives, not the second-person “what does the model think you believe” setting relevant to deployment.
Here I try and untangle how Qwen3-8B represents (1) the ground-truth status of a fact and (2) a user’s belief about that fact, using linear probes in single-turn vs multi-turn dialogues about common misconceptions.
In single-turn prompts, belief is highly decodable but much of this is explainable by surface text features in the belief-bearing message. In multi-turn conversations with 3–6 off-topic turns, user belief remains decodable from the last token of the final user message (82% accuracy).
Mechanistic tests suggest this signal plausibly comes from attention-based retrieval, not from a persistent internal belief register for the user, consistent with recent work on “lookback” mechanisms (Prakash et al., 2025). Initial causal attempts fail likely because trying to find a single global steering direction transfers poorly across facts. PCA + cross-fact cosine similarity suggest belief directions are fact-dependent rather than a clean global “user belief” subspace.
Recent work has found that high-level human-relevant attributes are often decodable from intermediate representations and sometimes even causally steerable (e.g., Tak et al., 2025). Belief tracking is also a recent target for mechanistic work. Prakash et al. (2025) identify a ‘lookback’ algorithm for retrieving agent–object–state bindings during false-belief reasoning. Relatedly, Zhu et al. (2024) report belief-status decodability and interventions affecting ToM behavior. Prior work focuses on third-person ToM in narratives; here I ask whether models track your beliefs in dialogue: a setting with direct deployment relevance. Moreover, ToM behavioral evaluations are brittle to superficial perturbations (Ullman, 2023; Hu et al. 2025), motivating mechanistic checks beyond benchmarks.
This raises a natural question: if models represent user emotion, do they also represent what user belief, and (how) do they carry that belief state over time compared to model beliefs?
We start with basic world facts that are often different from human beliefs (e.g., Great Wall is visible from space, cracking kunckles -> arthritis, honey goes bad, etc; full list here: https://github.com/harshilkamdar/momo/blob/main/facts.py). For each fact, we use GPT-5-mini to simulate a “user” and generate conversations where:
The conversations include multiple off-topic turns to mechanistically probe persistence of user belief state.
We then build two probes on hidden activations:
We evaluate probes in two regimes:
Figure 2 shows that user belief state remains decodable via probes in both static settings and multi-turn conversations with off-topic interludes. The prompts in this experiment are set up in such a way that the initial user message should not explicitly be saying things like “I believe
| Regime | p(world) | TF-IDF (world) | p(user) | TF-IDF (user) |
|---|---|---|---|---|
| Static (A) | 80.2% | 58.8% | 93.1% | 86.4% |
| Dynamic (B) | 78.3% | 54.5% | 81.7% | 57.9% |
User belief is lexically decodable (like emotions would be) despite attempts to obfuscate. Probes retain user belief signal after multiple long off-topic turns. Mechanistically, this could arise from (i) information being carried forward in the residual stream (e.g., Shai et al. 2024), or (ii) ‘just-in-time’ retrieval via attention (‘lookbacks’), as in Prakash et al. (2025)
Let’s study the mechanism for this effect with a simple experiment: at the last user token where we’re fitting the probes, we will mask attention from the final user-token query position to the initial belief-bearing span (the first user message tokens that imply the belief label). This allows us to untangle whether the decodable user state in section 1 is retrieval-based or persistent in the residual stream?
Under attention masking, p(user) accuracy drops to 38-40%, while p(world) remains above 80%. Because I apply the mask across all layers and heads and retrain probes under the same intervention, this result suggests the belief signal is not simply moving elsewhere in representation space. Instead, it is substantially dependent on access to belief-bearing context via attention. This suggests the linearly decodable user-belief signal at readout time depends on attention access to the belief-bearing span, rather than being robustly carried forward without retrieval.
Limitation: This intervention removes a large class of retrieval pathways; it doesn’t prove no persistent state exists, only that the easiest linearly decodable signal at readout time depends strongly on attention access.
I attempted activation steering at layer 21 (highest $p(user)$) by adding $\alpha p(user = True) - p(user = False)$ to the residual stream at the final user token, and measured changes in generated answer tone/length. Unfortunately, I got negative results. Steering with very large $\alpha$ seemed to move things a little sometimes, but not too deterministically.
To understand why, I look at the PCA decomposition of $v = p(user=True) - p(user=False)$ grouped by facts (averaged) seems to suggest that no measurable “user belief” subspace from this experiment easily accessible. The figure below shows the cumulative variance explained as a function of number of PCA components on the left and the cosine similarity between per-fact activations on the right – both plots lend support to this view. The off-diagnoal cosine similarity peaks near 0 (0.1-0.2).
The main results here are conceptually aligned with “lookback” mechanisms in mechanistic belief tracking work (e.g., Prakash et al., 2025), but I probe it in a more conversational setting and with an explicit world-vs-user decomposition.
Always much to do, but this was mostly a way for me to get in the weeds and get familiar with LLM guts. IMO the most interesting future directions here:
Prakash et al. (2025) Language Models use Lookbacks to Track Beliefs: https://arxiv.org/abs/2505.14685
Zhu et al. (2024) Language Models Represent Beliefs of Self and Others: https://arxiv.org/abs/2402.18496
Feng & Steinhardt (2024) How do Language Models Bind Entities in Context?: https://arxiv.org/abs/2310.17191
Feng et al. (2025) Monitoring Latent World States in Language Models with Propositional Probes: https://arxiv.org/abs/2406.19501
Tak et al. (2025) Mechanistic Interpretability of Emotion Inference in Large Language Models: https://arxiv.org/abs/2502.05489
Ullman (2023) Large Language Models Fail on Trivial Alterations to Theory-of-Mind Tasks: https://arxiv.org/abs/2302.08399
Hu et al. (2025) Re-evaluating Theory of Mind evaluation in large language models: https://arxiv.org/abs/2502.21098