More Thinking, More Bias: When Chain-of-Thought Reasoning Makes LLMs Less Reliable

New research shows that longer chain-of-thought reasoning amplifies position bias in LLMs rather than reducing it. For engineering leaders deploying reasoning models in production, this overturns a core assumption about when 'thinking harder' helps.

Anystack Engineering

A common assumption underpins the rush to deploy reasoning-tuned models like DeepSeek-R1, o-series variants, and CoT-prompted base models in production: more deliberation produces better answers. The intuition is human — when a problem is hard, think longer. A new paper released this month, More Thinking, More Bias: Length-Driven Position Bias in Reasoning Models, tests that intuition empirically and finds it backwards in a specific, measurable way.

The authors evaluated thirteen reasoning-mode configurations — including two R1-distilled 7-8B models, CoT-prompted base models, and the full 671B DeepSeek-R1 — across MMLU, ARC-Challenge, and GPQA. In twelve of thirteen configurations, per-question position bias scaled positively with the length of the reasoning trajectory. The longer the model thought, the more its final answer correlated with the position of the correct option in the multiple-choice list, rather than the content of the option itself.

This matters because position bias is exactly the kind of shallow heuristic that chain-of-thought is supposed to eliminate. The fix that was meant to make models more deliberate is, under load, making them more susceptible to the surface structure of the prompt.

Three findings worth your attention

First, the bias is not a property of the model — it is a property of the reasoning length within a model. The same model, on the same question, becomes more biased as it generates more tokens of intermediate reasoning. This means you cannot solve the problem by swapping in a 'better' reasoning model. You have to control the reasoning budget.

Second, the effect persists at scale. DeepSeek-R1 at 671 billion parameters shows the same pattern as 7B distilled variants. Parameter count is not a defence. If anything, the larger models have more capacity to confabulate elaborate justifications for a position-biased answer.

Third, the bias is invisible in standard benchmarks. Aggregate accuracy on MMLU or GPQA can look healthy while per-question behaviour is being driven by spurious features. Teams that evaluate reasoning models only on top-line accuracy are flying blind to a failure mode that will show up in production as inconsistent, hard-to-reproduce errors — particularly on inputs where the 'correct' answer is structurally ambiguous, such as classification, ranking, or selection from candidate lists.

What to do this week

Audit where reasoning models are making selection decisions in your stack. Any pipeline where an LLM picks one option from a list — retrieval reranking, tool selection in an agent, classification into a fixed taxonomy, candidate filtering in RAG — is a place where position bias becomes a production bug. The symptom looks like: 'the model keeps picking option A' or 'results change when we reorder the candidates'. If your team has been treating these as prompt engineering quirks, treat them as systematic bias instead.

Add permutation testing to your evaluation harness. For every multiple-choice or selection task, run the same input with options shuffled across at least three orderings and measure answer stability. The cost is a 3x increase in eval compute; the payoff is a direct measurement of the bias the paper describes. Teams already running golden datasets can add this in an afternoon. Treat any task where answer stability drops below 90% across permutations as a candidate for either a non-reasoning model or a structural change to how the question is posed.

Cap reasoning token budgets explicitly. The paper's central finding is that bias grows with trajectory length, which means shorter is sometimes better. For tasks where you have already established that the model can answer correctly with brief reasoning, set a hard token cap on the thinking phase. Most reasoning model APIs now expose this as a parameter. Use it. The default of 'let it think as long as it wants' is no longer defensible for selection tasks.

The broader implication for model selection

This research is part of a growing body of work suggesting that reasoning models are not a strict upgrade over their base counterparts. They are a different tool with a different failure surface. For open-ended generation, mathematical derivation, and code synthesis, longer reasoning continues to help. For constrained selection from a fixed option set, longer reasoning can actively hurt.

The practical consequence is that 'use the reasoning model for everything' is the wrong default. Engineering leaders should be pushing their teams to maintain a routing layer that picks the cheapest model capable of the task, with reasoning models reserved for problems where the extra deliberation has a measurable, tested benefit. This is not a cost optimisation argument — it is a correctness argument.

Why this is hard to catch in practice

The failure mode the paper describes is particularly insidious because it does not produce obvious errors. The model still generates fluent, plausible-sounding reasoning. The justification will reference the content of the chosen option. The error is in the selection itself, which is influenced by position before the verbalised reasoning even completes — a finding consistent with separate work on pre-verbalisation commitment in language models showing that answer preferences often stabilise well before the visible answer is produced.

This means human review of model outputs will not catch it. A reviewer reading the chain-of-thought will see a coherent argument for option B and miss that the model would have generated an equally coherent argument for option C had the options been reordered. The only way to detect the bias is structural: permutation testing, answer-stability metrics, and bias audits run as part of CI rather than as one-off research.

How Anystack helps

We work with engineering organisations putting LLM-based features into production, and a growing share of our AI integration engagements now include explicit bias auditing of reasoning models — permutation tests, trajectory-length sensitivity analysis, and routing layers that match model capability to task structure. Where teams are already running LLMs at scale, we typically find at least one production pipeline where position bias or related selection artefacts are silently degrading output quality. The fix is usually small once located: a model swap, a reasoning budget cap, or a restructured prompt. The work is in the detection, which benefits from the kind of test automation infrastructure most teams have not yet extended to their AI systems.