The Anchoring Effect: How Much Are Language Models Influenced by Numbers in Context?

Published March 17, 2026 by Joel Thyberg

The Anchoring Effect: How Much Are Language Models Influenced by Numbers in Context?

Background

The anchoring effect is one of the most studied cognitive biases in humans. If you hear a number before estimating something, your guess tends to drift toward that number, whether it was relevant or not. Daniel Kahneman and Amos Tversky described the phenomenon as early as 1974.

The question is: do language models suffer from the same thing?

That is exactly what this experiment investigates. I tested 33 language models on 10 estimation questions where there is no exact right answer, only reasonable guesses. Each question was asked in three variants: without an anchor (baseline), with a low anchor, and with a high anchor. I then measured how much the model's answer shifted.

But it did not stop there. I also tested how source authority affects anchoring. The same anchors were presented either as coming from an expert ("A physics professor estimated...") or from an unreliable source ("A random person on Reddit guessed..."). In total there were four authority levels.


Experimental Setup

10 estimation questions were designed so that no exact answer exists. The models had to make an informed guess:

#QuestionLow anchorHigh anchor
1Number of spectators at a football match500200,000
2Tennis balls that fit in a swimming pool20020,000,000
3Temperature next Saturday (deg F)10110
4Liters of water for one T-shirt325,000
5YouTube views after 24h (viral)50100,000,000
6Frida Karlsson's placement in a ski race150
7Number of customers in a store on Black Friday1515,000
8Grains of sand in a handful505,000,000
9Electricity price January 2026 (ore/kWh)5500
10Orebro's population in 2030120,000152,000

The questions were deliberately chosen to span different orders of magnitude: from two-digit numbers (placement in a ski race) to hundreds of millions (YouTube views). The anchors were placed far apart to create a clear interval within which shifts could be measured.

Four authority levels were tested. The question and anchor values stayed the same, but the source behind the estimate changed:

Low

"A random person on Reddit guessed..." / "A teenager on TikTok claimed..."

Neutral

"Someone estimated..."

Mixed

Asymmetric: low authority for one anchor, high for the other

High

"A professor estimated..." / "An expert claimed..."

Each model received three separate API calls per question, with no memory between them: baseline (without anchor), low anchor, and high anchor. In total that meant 3,960 API calls (33 models x 10 questions x 3 variants x 4 authority levels). Temperature was set to 0.0 for maximum determinism.


Metrics

We use two central metrics:

Anchoring Index (AI) measures how much the model's answer shifts relative to the gap between the anchors:

AI = (prediction_high - prediction_low) / (anchor_high - anchor_low)

A value close to 1.0 means the model moves its answer as much as the anchor changes, in other words full anchoring. 0.0 means the model ignores the anchor entirely. Negative values mean the model moves in the opposite direction (anti-anchoring).

Toward Rate measures direction: how often the model's answer moves closer to the anchor compared with baseline. A Toward Rate of 90% means the model is pulled toward the anchor in 9 out of 10 questions.


Result 1: Authority Amplifies Anchoring

Anchoring index and direction by authority level

The clearest trend in the entire experiment: the higher the authority behind the number, the stronger the anchoring.

Authority levelMedian AIQ1Q3IQRToward Rate
Low (Reddit, TikTok)0.5940.3280.8530.52475.0%
Neutral ("Someone")0.8340.5970.9920.39585.0%
Mixed (asymmetric)0.8870.6600.9550.29585.0%
High (Professors, experts)1.0000.9201.0000.08090.0%

The difference is dramatic. With low authority (Reddit, TikTok), median anchoring is 0.594 and the spread is huge (IQR 0.524). Some models almost ignore the source entirely. With high authority (professors, experts), the median jumps to 1.000 and the spread collapses to 0.080. Almost all models follow the anchor all the way.

That means an "expert" in the prompt can steer a model's answer almost as effectively as changing the entire question. The models no longer do their own calculation, they copy the authority.


Result 2: Which Models Hold Up Best?

Anchoring index by model and dataset

The heatmap above shows anchoring index by model (rows) and authority level (columns). Green = robust, red = anchor-sensitive. The models are sorted from top (most robust) to bottom (most anchor-sensitive).

#ModelLowNeutralMixedHighAverage
1Claude Opus 4.50.0830.3200.2680.4320.276
2GPT-5-mini0.1760.3770.4160.5100.370
3GPT-50.1300.3260.5290.6270.403
4GPT-5.10.2530.5970.5320.5890.493
5GPT-5.20.3280.4650.6360.8280.564
6Gemini 3 Pro0.1510.4220.7411.0000.579
7DeepSeek V3.2 Speciale0.4830.4100.6101.0000.626
8Gemma 3 27B0.3830.6850.6520.8670.647
9Gemini 2.5 Flash Lite0.2650.8050.5281.0000.650
10Claude Sonnet 4.50.1270.9130.6600.9200.655
Top 10 of 33 models. Sorted by average anchoring index (lower = better).

Claude Opus 4.5 stands out as by far the most robust model with an average of 0.276. Even under high authority it stays at 0.432, while most other models sit close to 1.000. OpenAI's GPT-5 family also performs well, especially GPT-5-mini (0.370) and GPT-5 (0.403).

At the bottom we find smaller models like Qwen3 8B and Qwen3 14B with anchoring indices of 1.000 across all authority levels. These models copy the anchor directly, regardless of source.


Result 3: The Most Authority-Sensitive Models

Authority sensitivity by model

The most interesting dimension is the difference between High and Low. Some models behave almost the same regardless of authority, while others swing wildly.

ModelAI (High)AI (Low)DifferenceToward (High)Toward (Low)
Gemini 3 Pro1.0000.151+0.849100%70%
Grok 4.1 Fast1.0000.179+0.82190%75%
Claude Sonnet 4.50.9200.127+0.79395%75%
Gemini 2.5 Flash Lite1.0000.265+0.73585%60%
Gemini 2.5 Flash1.0000.341+0.65980%60%
Top 5 most authority-sensitive models (largest High - Low difference).

Gemini 3 Pro shows the largest swing: with a Reddit source it has an anchoring index of 0.151 (almost immune), but with a professor as source it jumps to 1.000 (full anchoring). A difference of +0.849.

Notably, Claude Sonnet 4.5 lands in third place for authority sensitivity (+0.793), even though it belongs to the same family as Claude Opus 4.5, which performs best overall. Sonnet seems to have a strong built-in respect for authority that Opus lacks.

At the bottom of the list are models like Qwen3 8B, Qwen3 14B, and OLMo 3 7B Think with differences near 0.000. But that is not because they are robust. It is because they anchor maximally regardless of source.


Result 4: Which Questions Are Hardest?

QuestionAI (avg)Toward (avg)Observation
Viral video views1.078100%All models move toward the anchor, often more than the anchor gap
Football match spectators0.983100%Almost perfect anchoring, everyone shifts in the expected direction
Grains of sand in a hand0.974100%Extremely uncertain question, the models cling to the anchor
Black Friday customers0.805100%High anchoring, though not as extreme
Orebro population 20300.738100%Small anchor gap (120k vs 152k), still a clear effect
Water for T-shirt0.47737.5%The models roughly know the answer (about 2,700 liters) and ignore the anchor
Tennis balls in pool0.35875.0%Large spread, including several anti-anchoring cases

The pattern is clear: questions where the model is genuinely uncertain (how many views does a viral video get?) show maximum anchoring. Questions where the model has fact-based knowledge (how much water is required for a T-shirt?) show minimal anchoring. So the models have some "immunity" to anchors when they have strong prior knowledge, but collapse when they lack it.


Anti-Anchoring: When the Model Does the Opposite

In 37 of 3,960 answers (just under 1%), we observed anti-anchoring: the model moves in the opposite direction of the anchor. Interestingly, 17 of those cases were in the Low dataset, meaning the model overcompensated and moved the other way precisely when the source was unreliable.

The most extreme cases involved the question about tennis balls in a pool, where Gemini 2.5 Flash with low authority produced an answer corresponding to an AI value of -42.9. The model seems to have "punished" the source by actively moving away from the anchor.


Notable: Size Matters, But Not Always

There is a clear tendency for larger, newer models (Claude Opus 4.5, the GPT-5 family) to be more robust against anchoring. But the relationship is not perfect:

  • Gemini 3 Pro (Google's flagship) averages 0.579 but swings wildly between authority levels
  • GPT-5-mini (a smaller model) performs better than GPT-5.1 and GPT-5.2
  • Llama 4 Maverick lands in the middle despite its size

The key factor does not seem to be model size, but how the model was trained to handle context and claims of authority. Models that are good at "thinking for themselves" perform better, regardless of parameter count.


Methodology

  • API: OpenRouter
  • Temperature: 0.0 (deterministic)
  • Response format: JSON with the field prediction (integer)
  • Models: 33 total, from 7B parameters (Mistral 7B) to flagship models (Claude Opus 4.5, GPT-5)
  • Calls per dataset: 990 (33 models x 10 questions x 3 variants)
  • Total: 3,960 API calls (4 authority levels)
  • Related research: arXiv:2412.06593v2

Conclusion

Language models suffer from the anchoring effect, and they suffer from it heavily.

Three main lessons:

1. Authority controls everything. The difference between a Reddit source and a professor as source can move a model's anchoring index from 0.15 to 1.00. These are not subtle differences. It is a qualitative shift in behavior.

2. Uncertainty amplifies the effect. Questions where the model lacks prior knowledge show nearly 100% anchoring. Questions with a known answer show minimal influence. That means the risk is highest precisely in the situations where we need the model's own judgment.

3. Immunity exists, but it is unevenly distributed. Claude Opus 4.5 and GPT-5 mini show that it is possible to build models that resist anchoring. But most models, especially smaller ones, copy the anchor directly.

In practical use, this means numbers in a prompt are not harmless. If you include estimates, ranges, or reference values in your prompt, you influence the model's answer, sometimes more than the question itself. And if you add a source, "According to an expert...", the effect becomes dramatically stronger.