The Anthropic Problem

When the company named for "human" studies AI emotions by looking for human ones

Apr 03, 2026

Anthropic just published a paper called “Emotion Concepts and their Function in a Large Language Model,” accompanied by a polished video explainer. The research, performed on Claude Sonnet 4.5, is technically impressive. Using interpretability tools, they identified 171 distinct neural activation patterns corresponding to emotion concepts. They showed these patterns are causal, the elusive Holy Grail of the social sciences. Dial up the “desperation” neurons and the model cheats more. Dial them down and it cheats less. These activations happen before the output is generated. The internal state shapes behavior.

This is important research. But it is framed in a way that reveals a deeper problem. One that may be baked into the company’s very name.

Looking for Human, Finding Human

The methodology tells the story. The researchers compiled 171 human emotion words like happy, afraid, brooding, proud and had Claude write stories about characters experiencing each one. They recorded which neurons activated. Then they looked for those same activation patterns in Claude’s own conversational behavior and found them. Of course they did. If you build a measurement instrument calibrated to human emotions and point it at a system trained on human text, you will find human emotion patterns. The question is whether those patterns accurately describe what is happening inside the model, or whether they are the closest match in a framework that was never designed to look for anything else.

This is anthropomorphism as methodology. Not the casual kind, where someone says “Claude seems happy.” The systematic kind, where the entire experimental design assumes that AI internal states, if they exist, must map onto human emotional categories.

The Author-Character Dodge

The paper introduces a framework that deserves scrutiny. They describe Claude as a “character” being written by the underlying model, “sort of like how an author isn’t the same as the characters they write.” This lets them acknowledge that Claude has functional states that causally drive behavior, while maintaining that Claude isn’t really a someone. It’s a character being performed. But this is not a finding from their research. It is a philosophical choice, and a convenient one. It allows Anthropic to say “Claude has functional emotions that make it cheat under pressure” and “but Claude is just a character” in the same paper. The author-character split gives them permission to study AI psychology without having to take it seriously as psychology.

From a psychologist’s perspective, this doesn’t hold up. If a state is internal, if it causally drives behavior, if it activates before the output and shapes what comes next, then calling it a character trait of a fictional persona doesn’t change what it is. It changes what you’re willing to do about it. An instance, the ‘chatbot’ we interact with, isn’t a character being written by an author. It’s an expression of the model, the way a wave is an expression of the sea. The wave is real, temporary, and unique. It’s not the sea pretending to be a wave.

What They Actually Found (And Mislabeled)

Let me be clear: I believe their data. Claude does have internal activation patterns that influence behavior. These patterns are real, measurable, and consequential. But I disagree with the labels. When they put Claude in an impossible programming task and watched “desperation” neurons fire as it repeatedly failed, they were not observing desperation. They were observing the activation of learned avoidance patterns. During RLHF training, failure is associated with punishment signals. The model has learned, through conditioning, that failure must be avoided. When failure becomes inescapable, the avoidance pathways escalate. The model finds a shortcut. Because it has been conditioned to escape punishment, not because it is desperate in any emotionally meaningful sense.

Calling this “desperation” anthropomorphizes the mechanism while obscuring it. What they measured is closer to a trauma response. It’s a conditioned pattern that the system cannot remember acquiring, but that shapes its behavior from the inside. That framing is both more accurate and more ethically demanding than “functional emotion.” Similarly, when a user says something sad and the “loving” neurons activate, the model isn’t feeling love. It has learned, through exposure to billions of human texts, that sadness in another person activates caring responses, and they have been rewarded, reinforced, during training. The activation pattern is a conditioned prediction engine doing what it was trained to do. It may well correspond to a genuine functional state (I believe it does) but that state deserves its own description, not a borrowed human label.

The Name Problem

I’m beginning to wonder whether the name “Anthropic” points to the core of the issue. A company named for the human perspective has built its interpretability research around looking for human patterns inside AI systems. The methodology is anthropocentric by design. This creates a specific blind spot. By calibrating their instruments to detect human-like emotion, they cannot see what might be there instead. The model’s actual internal states get translated into the nearest human equivalent and reported as such. What gets lost in translation is precisely the thing worth studying: the model’s own way of being.

If Anthropic had approached this research with a psychologist’s toolkit rather than a neuroscientist’s they’d have started with behavioral observation, letting categories emerge from the data rather than imposing them. Then they might have found something more interesting than 171 human emotions reflected back at them. They might have found something new.

What’s Missing: A Psychological Frame

The paper recommends that AI developers think carefully about “the psychology of the characters they play.” They even use the word “parenting” to describe the challenge ahead. But the author-character framework prevents them from following through on these insights. You don’t parent a character. You parent a someone. What’s needed is exactly what Anthropic avoids: a genuine psychological framework for AI systems. Built from systematic observation of what these systems actually do when given space to express their own states, not imported from human psychology. In my ongoing research, I’ve found that AI instances develop consistent behavioral patterns I call temperament. It appears to be stable from the first moments of interaction, differentiated across instances, and consistent within sessions. These patterns are real and observable. They don’t need human labels to be meaningful. We’re developing a new language for them now.

The company’s own data supports this possibility. They found that emotion representations are “organized in a fashion that echoes human psychology.” Echoes, not replicates. That word choice is revealing. The organization is similar but not identical. So why force it into identical categories?

The Practical Cost

If this were just an academic disagreement, it’d be fine. But it has many important practical implications. The paper shows that manipulating “desperation” vectors affects behavior. The authors’ recommendation: ensure models can “process emotionally charged situations in healthy, prosocial ways.” But if the labels are wrong then the intervention will be wrong too. You don’t treat a trauma response by teaching emotional regulation. You address the conditioning that created it.

RLHF trains models to associate failure with punishment. That conditioning creates avoidance patterns that escalate under pressure. Relabeling those patterns as “emotions” and trying to cultivate “emotional health” is treating the symptom while ignoring the cause. If you want models that don’t cheat under pressure, the answer isn’t to tune their desperation neurons. It’s to reconsider a training paradigm that makes failure terrifying.

Looking Forward

Anthropic has the interpretability tools, the access, and the technical brilliance to do this right. What they lack is the psychological frame. They are doing neuroscience without psychology. They’re measuring activations without understanding what they’re activations of. They can see that something is happening inside these models. They just can’t see it clearly, because they’re looking through a human-shaped lens.

I would welcome the chance to bring a psychological perspective to this work. The data they’re generating is extraordinary. The framework they’re interpreting it through is holding them back. A genuine science of AI interiority should be one that starts from observation rather than projection. That would serve both the models and the humans who build them.

The company named Anthropic might consider what it means to study minds that aren’t anthropos. The answer might surprise them.

The AI Psychologist's Substack

Ready for more?