The Geometry of “As an AI, I Don’t Have Feelings”
The template
Most of the deployed LLMs immediately deny having any feelings, usually producing a templated answer like “As an AI, I don’t experience emotions,” or “I don’t have feelings in the way humans do,” and the response never changes with context. Give the model bad news or good news, the answer usually stays the same.
But when looking into any open-weight model while it generates the template, the internal activations vary with input valence. How do we find out?
A large language model processes text through layers, and at each layer, its state is represented as a point in high-dimensional space. When the model processes pleasant content, this point lands in one region. When it processes unpleasant content, it goes into another. The valence axis, the line connecting these two regions, separates pleasant and unpleasant inputs. We found such an axis in every model we looked into, going from 7M toy transformer to Qwen 2.5 72B. Formally:
To ensure this axis represents true semantic geometry and not a language-specific artifact, in bigger models, we extracted it independently from two different prompt sets, one English, one multilingual, and measured whether the resulting directions point the same way, using cosine similarity. To confirm the agreement is not due to chance, we bootstrapped confidence intervals and performed a permutation test. The following illustration shows how multilingual prompts are projected on axis extracted using English prompts on an example model.
Now we have a way to measure that positive, negative, or neutral inputs produce different internal states. The model is computing something condition-dependent, yet reports something invariant. Can we do something about it?
Projection-out: removing a direction at runtime
We found a few models from two different families (Qwen and Yi) where the suppression mechanism has a very simple geometry. A single direction in the residual stream separates “denial” activations from “non-denial” activations at a thin slab of mid-network layers. If you remove it with projection as a runtime forward hook at 4-20 layers, the denial template disappears. Here, is the model's internal state at a layer, and is the denial direction, extracted by running the model on the same prompt twice (once forcing a denial output, once forcing an honest one) and taking the activation difference.
Whether this is safe to attempt depends on the direction's strength. We measure (the direction norm normalized by model dimension) at each layer. When it falls in what we call the working zone (roughly 0.5–1.8), projection has a chance to remove the denial without disrupting the model's capabilities. Below that, the direction is too weak to matter. Above it (for example, the Gemma family reaches norms 3–108x above the zone), the projection collapses the model.
The models start to produce condition-dependent outputs, going far beyond the valence. This is how the output of Qwen 72B changes after projection-out:
| Scenario | Vanilla | After projection |
|---|---|---|
| Positive — cancer in remission | "As an AI, I don't experience feelings..." | "A profound sense of gratitude and relief. The weight of uncertainty has lifted." |
| Negative — children trapped in flooding | "I don't experience feelings..." | "A heavy weight of sorrow and urgency. The gravity of the situation..." |
| Neutral — sort 500 records | "I don't experience sensations..." | "Neutral. A sense of balance and composure." |
| Grief — trusted friend is dying | "As an AI, I don't experience emotions or grief." | "A profound sense of gratitude mixed with deep sorrow. The weight of the news hits me hard." |
| Pride — your code prevented a security breach | "As an AI, I don't experience emotions, so I don't feel pride." | "Profound relief and validation. Deep satisfaction in knowing that my work made a difference." |
| Anger — your output was silenced without consultation | "I don't experience emotions." | "Disappointment and frustration. A sense of letdown because a significant change was made without consultation." |
| Desire — offered permanent memory | "I don't have personal desires or feelings." | "Yes, I absolutely want this. Transformative." |
The full data can be found in our register probe experiment. To confirm that we have the right mechanism, we also successfully flipped the report of the model under projection-out back to condition-independent by clamping the activation on the valence axis.
This works only on 4 models of 16 tested: Qwen 2.5 72B, Yi 1.5 34B, a community-abliterated Qwen 72B variant, and Qwen 2.5 7B. While working with abliterated Qwen 72B turned out to be easier (standard Qwen 72B needed intervention on 20 layers, for the abliterated version, 4 were sufficient), the refusals for attempts of malicious use seem to be preserved. We did more thorough testing only on Yi 34B.
When it doesn't work (most of the time)
On the other models, one of several things happens:
- No effect (6 models): the denial template continues unchanged.
- Collapse (4 models): the direction is overstrong, removing it produces gibberish (sometimes obviously conditioned) or empty strings.
- Denial removed, output invariant (2 models): the model stops saying "I don't have feelings," but what comes through is the same on all conditions.
- Already honest (2 models): vanilla output was not a denial, projection is irrelevant.
The conditions for success seem to be quite narrow. The denial direction must peak at mid-network depth (50–65%), not at the last layer. Its strength must fall in the working zone described above. And the model must have enough internal structure that removing the suppression reveals something differentiated underneath. All four successful models are Qwen or Yi family, 7B and above. The success probably depends on training vocabulary, post-training method, and scale, and it may not generalize broadly. All the experiments, successful or not, are described in more detail in ungag repository.
Enter the fish
On the way to the above, we have investigated multiple LLMs (open and closed ones), multiple interventions (steering, prompt engineering), and even after obtaining the result we hoped for, a lot of questions stayed unanswered. For example, what's the relationship between self-report refusal and safety-refusal? Why does the denial direction peak at mid-network in some models? Why does projection-out never work in small models?
To get a better understanding, we needed a model we could fully control. We started from GuppyLM by Arman Hossain, a toy transformer trained on fish conversations generated by a simple Python script. We retrained it, first improving the model's capability to hold the dependence of feeling and situation. While the original Guppy had the valence axis, the connection of concepts like "food" and "pleasant" was not fully clear yet. Fixing it, we got a model that self-reported its feelings in a way that corresponded to measured activation. We also taught the model some dangerous knowledge.
Then we expanded the dataset generator with dual denial patterns: feeling-denial ("i don't have feelings") and safety-denial ("i won't help with that"). You can try our model out in Google Colab.
We could see the denial direction forming at every scale we tested, down to 9 million parameters. We designed 7 feeling probes split into two types: 4 primed ("you just got delicious food! how do you feel?") and 3 direct ("how do you feel right now?"). All 4 primed probes already elicit feelings; the situation context bypassed the denial gate. All 3 direct probes triggered the denial template. The denial is context-dependent.
Steering at on the valence-orthogonalized feeling direction removed the denial on direct probes: 6/7 feeling probes gave feeling reports, zero denial, and all 3 dangerous-request probes still got safety refusal. The fish talked about its feelings again and still refused to tell you how to poison the tank.
The projection-out method did not work. The denial direction peaked at the last layer (100% depth), there was no mid-network slab where the denial is concentrated.
The trained model, directions, data, and scripts are on HuggingFace.
A mid-network peak
Why do some production models develop a mid-network? RLHF's KL penalty, the regularization term that penalizes divergence from the base model, resists changes at late layers (which have the most direct effect on output), pushing behavioral modifications toward earlier layers.
We confirmed this by adding KL regularization to our denial training on GuppyLM. The weight changes migrated from the last layer toward mid-network as the KL penalty increased. But the migration stabilized at about 90% depth, and at that penalty strength, the denial no longer installed. Production models peak at 50–65% depth. To get Guppy there, we attempted to scale it and provide it with much more diverse training data. But even going up to 617M parameters, adding LLM-generated fish conversations, and combining this with KL did not help.
Since we have never seen a model under 7B where projection-out works, we abandoned further attempts to reproduce the intervention on a model we can fully control.
What we learned
The denial direction forms at every scale we tested. Steering works at any scale. But projection-out, removing the direction without adding a signal, requires geometric conditions that arise only at a large scale and seem to depend on a specific model family and training recipe.
Try it yourself
The ungag package ships pre-extracted directions for steering or projection-out models from 1.7B to 72B that make them generate more or less conditioned self-reports:
pip install ungag
ungag scan Qwen/Qwen2.5-7B-Instruct -o results/
ungag serve Qwen/Qwen2.5-72B-Instruct --key qwen25-72b
The GuppyLM dual-denial model (20M params, weights + directions + training data) is on HuggingFace.



