AI

Thinking Machines: When Should You Actually Use Reasoning Models (o3, Claude Extended Thinking, DeepSeek R1)?

Reasoning models can outscore standard chat models by 30 points on hard math and still lose by 3 points on simple questions. The trick is knowing which one you're asking.

13 min read
Key Takeaways
    • Reasoning models think before they answer: They burn extra tokens on an internal chain of thought, then produce a final reply. This helps on multi-step problems and hurts on easy ones.
  • The gap on hard benchmarks is large: OpenAI o3 hit 87.7% on GPQA-Diamond versus 76.0% for o1 and about 48% for GPT-4o. DeepSeek R1 lifted AIME 2024 pass@1 from 15.6% to 71.0% on the same base model (DeepSeek-AI, Nature 2025).
  • Reasoning models can get worse on simple tasks: Recent studies report 2.4% to 3.8% accuracy drops on basic factual recall because models overthink and contradict themselves.
  • Latency and cost are real: Expect 10 to 60 seconds per answer and higher token bills, even after o3's 80% price cut to $2 / $8 per million tokens in 2025.
  • The decision rule is simple: Use a reasoning model when the task is multi-step, verifiable, and expensive to get wrong. Use a standard chat model for everything else.

The Quietest Big Change in AI (Without Using the P-Word)

For most of 2022 and 2023, bigger AI meant bigger training runs. More parameters, more data, more GPUs. Scale during pre-training was expected to keep dragging capability upward.

Then in September 2024, OpenAI previewed o1, a model that didn't feel bigger so much as slower. Ask it a question and it would pause, sometimes for half a minute, before writing anything. The full o1 shipped with ChatGPT Pro on December 5 at $15 / $60 per million tokens (OpenAI, 2024). It wasn't a bigger model. It was a model that spent more compute per query.

A few weeks later OpenAI announced o3. DeepSeek open-sourced R1 on January 20, 2025 (DeepSeek-AI, 2025). Anthropic rolled Extended Thinking into Claude 3.7 Sonnet on February 24, 2025, with a user-adjustable "thinking budget" and visible raw reasoning traces (Anthropic, 2025). The feature carried into Claude 4, 4.5, and 4.7.

The technical name is "test-time compute scaling." Instead of only investing compute during training, the model is given more compute to think during inference. As Sebastian Raschka puts it in "Understanding Reasoning LLMs," the quiet change isn't how these models are trained but what happens once you press enter.

For knowledge workers and learners, this matters because the choice of model is no longer only a quality question. It's also a latency question, a cost question, and a task-fit question.


What a Reasoning Model Actually Does Differently

Strip away the jargon and a reasoning model does something simple. Before it writes an answer, it writes a private draft to itself. That draft can be hundreds or thousands of tokens long. It explores approaches, checks work, backtracks, and then commits to a final response.

A standard chat model like GPT-4o produces tokens left to right, and those tokens are the answer. Whatever reasoning it does is compressed into whatever fits in that forward pass. Prompt it with "think step by step" and you get a bit more reasoning on paper, but the underlying model isn't built to deliberate.

A reasoning model is built to deliberate. Three concrete differences show up in practice:

  1. More tokens per query. Reasoning outputs often contain five to twenty times more hidden tokens than the visible answer.
  2. Higher latency. Responses take 10 to 60 seconds instead of 1 to 3.
  3. Different failure modes. When a reasoning model is wrong, it's often wrong in a confident, elaborate way. When it's right on hard problems, it's right in a way a standard model can't match.

DeepSeek's paper in Nature (2025) gives one of the clearest demonstrations. On AIME 2024, their base model scored 15.6% pass@1. After reinforcement learning that rewarded correct reasoning, R1 scored 71.0% pass@1 and 86.7% with majority voting. The model hadn't seen more math data. It had learned to use inference tokens to think.

The practical question for the rest of us is when that extra thinking is worth it.


The Three Families: o3, Claude Extended Thinking, DeepSeek R1

Three products dominate the reasoning-model landscape as of early 2026. Each takes a slightly different angle.

OpenAI o3 is the benchmark-smashing option. Announced in December 2024, it crossed the ~85% human threshold on ARC-AGI for the first time, hitting 87.5% in high-compute mode and 75.7% in its efficiency tier (Chollet, ARC Prize, 2024). ARC-AGI is built to resist pattern memorization, and no prior model had gotten close. On GPQA-Diamond, a graduate-level science benchmark, o3 scored 87.7% against o1's 76.0%. OpenAI cut o3 pricing by roughly 80% during 2025 to $2 / $8 per million tokens, about 7.5 times cheaper than original o1 rates.

Claude Extended Thinking is the tunable option. Introduced with Claude 3.7 Sonnet on February 24, 2025, it lets you set a "thinking budget" per query. The raw reasoning is visible in the API response, useful for debugging and auditing. Pricing stays at Claude Sonnet's standard $3 / $15 per million tokens, so extra thinking costs extra tokens but not a premium rate.

DeepSeek R1 is the open-weight option. Released January 20, 2025, under the MIT license and later published in Nature, R1 was trained with reinforcement learning applied directly to a base model, with no supervised reasoning data in the initial stage. It matched o1-0912 on AIME 2024 and hit 71.5% on GPQA-Diamond. Distilled variants from 1.5B to 70B parameters made strong reasoning runnable on a single GPU. An update, R1-0528, pushed AIME 2025 to 87.5%.

These three cover the space: proprietary top-tier (o3), tunable and transparent (Claude), and open-weight (DeepSeek R1).


Benchmarks, Honestly Read

Numbers without context are misleading. Here's how the major reasoning benchmarks compare, with a standard chat model included as a baseline.

ModelGPQA-DiamondAIME 2024 (pass@1)ARC-AGI (semi-private)Typical cost per queryLatency per reply
GPT-4o (standard)~48%~13%~5%~$0.011 to 3 sec
DeepSeek R171.5%71.0% (86.7% with majority vote)~15%~$0.005 (hosted)15 to 40 sec
Claude 4.5 Extended Thinking~83%~80%~50% (high budget)~$0.05 to $0.3010 to 40 sec
OpenAI o387.7%~90%75.7% (efficient) / 87.5% (high)~$0.05 to $2.00+20 to 60 sec

Sources: OpenAI o3 announcement (Dec 2024), ARC Prize blog (Chollet, 2024), DeepSeek-R1 (Nature 2025), Anthropic release notes. Latency and cost vary by prompt length and thinking budget.

A few things to keep in mind when reading numbers like these:

GPQA-Diamond is a set of graduate-level science questions designed so that non-experts with web access still do poorly. A high score means the model can reason at the level of a PhD candidate. It doesn't mean it's a better writer or summarizer.

AIME is a pre-olympiad competition. Scores above 70% mean the model can solve problems that roughly the top 2% of US high school students tackle. AIME generalizes weakly to everyday math like forecasting or spreadsheets.

ARC-AGI was built by François Chollet to resist memorization. Tasks are visual puzzles where the rules are shown by example. Pre-reasoning models scored in single digits. o3's jump was genuinely surprising to researchers. ARC-AGI is not a proxy for practical usefulness, though. It measures one specific form of abstract generalization.

A model that dominates these benchmarks is not automatically better for a product launch plan, a book summary, or a customer email.


When Reasoning Helps

Reasoning models earn their keep on tasks with three properties: multiple steps, verifiable answers, and a high cost of being wrong.

Multi-step math and quantitative reasoning. Tax calculations with multiple conditions. Financial models where a transposed digit changes the answer. Engineering calculations with unit conversions. The 55-point jump DeepSeek R1 got on AIME came from exactly this kind of problem.

Code generation and debugging for non-trivial tasks. "Write a function that sorts a list" doesn't need reasoning. Refactoring a 300-line module while preserving behavior, debugging a race condition, or implementing an algorithm from a paper does.

Legal and regulatory analysis. Contract review with cross-referenced clauses. Compliance questions where the answer depends on how several rules interact. Many legal teams now use reasoning models for first-pass analysis, with a lawyer reviewing the output.

Complex RAG routing. When a retrieval system has to decide which of ten indices to query, rewrite the query, and synthesize across sources, a reasoning model in the orchestrator role produces noticeably better plans.

Literature synthesis. Reading several papers and identifying where they agree, disagree, and what's missing is the kind of compare-and-contrast that reasoning models handle well. If you've used Glasp's AI chat to pull themes across highlights, escalating to a reasoning model for the final synthesis is where you feel the biggest difference.

Hard scientific or technical questions. If your work involves graduate-level chemistry, physics, or biology, a 40-point benchmark gap translates to real answers the standard model can't produce.

Heuristic: if you'd want a colleague to double-check the answer before you trust it, a reasoning model is probably worth the wait.


When Reasoning Hurts

Reasoning models fail in interesting ways. And on a surprisingly large fraction of everyday tasks, they underperform standard chat models.

Simple factual recall. When the right answer is one fact the model already knows, extra thinking tokens give it more chances to second-guess. A 2025 study reported reasoning models losing 2.4% to 3.8% accuracy on basic factual recall. The models consider alternatives to the correct answer and sometimes commit to one.

Translation. Good translation is a pattern-matching problem, not a reasoning problem. Reasoning models don't translate better than GPT-4o, and they take 20 times longer.

Summarization. If you're condensing 5,000 words into 300, the bottleneck is writing quality, not reasoning depth. Standard chat models are faster and often produce cleaner prose. Our AI Research Workflow piece goes into more detail.

Classification. Tagging support tickets, labeling emails, scoring sentiment. Reasoning adds latency without accuracy.

Simple question answering. "What year was the moon landing?" doesn't improve with chain of thought. Standard chat handles these in half a second.

Creative writing that needs voice. Reasoning traces are analytical. Models trained heavily on reasoning sometimes produce answers that feel mechanical when asked for a poem or an emotional passage. Standard chat models feel warmer.

A subtler failure mode is documented in arXiv 2509.09677, "Illusion of Diminishing Returns." The authors find that long-horizon execution benefits taper sharply. Early gains are real, but the marginal accuracy of an extra 10,000 reasoning tokens drops fast. Past a point, more thinking just makes the answer later and more expensive.

Latency is its own problem. Most users interpret 30 seconds of silence as a broken system. Products often add visible "thinking" UI to reassure users something is happening. If you're embedding AI in a tight flow, this friction matters.


A Decision Rule You Can Actually Use

Here's a practical matrix. Coarse, but it covers most of what you'll run into.

Task TypeReasoning ModelStandard Chat Model
Multi-step math or proofsYes, clearlyNo
Code for non-trivial featuresYesOnly for simple snippets
Legal / contract analysisYesNo
Complex RAG query routingYesNo
Scientific or technical Q&A (PhD-level)YesNo
Literature synthesis across 5+ sourcesYes (final pass)Yes (first pass)
TranslationNoYes
SummarizationNoYes
Email draftingNoYes
Classification / taggingNoYes
Short factual Q&ANoYes
Creative writing needing voiceUsually noYes
Chat interfaces with tight latencyNoYes
BrainstormingSometimesUsually yes

The rule can be compressed. Ask three questions:

  1. Is the problem multi-step? Does it require several logical moves chained together?
  2. Is the answer verifiable? Can you tell when it's right or wrong?
  3. Is the cost of being wrong high? Would a mistake waste significant time or money?

If at least two are yes, use a reasoning model. Otherwise, save the latency. If you're not sure, try the standard model first and escalate if the answer feels shaky.

This pattern, of starting cheap and escalating only when needed, is one of the most underrated skills in working with AI. We went deeper on it in AI Research Workflow.


What This Means for Reading and Research

If you read, learn, and research as part of your work, reasoning models fit a specific slot, not the whole workflow.

Most of the work of learning is not reasoning. It's attention. You pick which sources matter, focus on what's novel, and build a personal map of ideas over time. No model does that for you. This is why Glasp's web highlighter is built around the human step first: you highlight what matters, and the AI comes in later as a thinking partner, not a replacement.

For most day-to-day reading tasks, a standard chat model is the right tool:

  • Summarize an article I just read. Standard model, fast and clean.
  • Explain a concept I didn't understand in this paper. Standard model. If the concept is a PhD-level scientific claim, escalate.
  • Pull all the quotes about AI safety from my highlights this month. Standard model.
  • Generate flashcards from my notes. Standard model.

Reasoning models earn their place in a smaller set of jobs:

  • Synthesize the disagreement between five authors on one topic. Reasoning model, preferably after you've highlighted the relevant passages.
  • Map this paper's argument to my existing notes and flag contradictions. Reasoning model.
  • Design a reading plan that hits my gaps based on what I've already read. Reasoning model.
  • Derive a proof or work through a complex technical argument from first principles. Reasoning model.

The YouTube Summary flow is a good example. Summarizing a 40-minute talk is firmly a standard-model task. But if the talk is technical and you want to check whether the speaker's argument holds up against three counter-arguments you've saved elsewhere, that's where escalating to a reasoning model with your highlights as context earns its cost.

This two-tier approach connects to a broader point from AI Impact on Learning and AI Thinking Trap: AI is most useful when it amplifies thinking you've already done, not when it substitutes for thinking you haven't. Reasoning models raise the ceiling for what the AI can contribute. They don't change the floor, which is set by how deeply you've engaged with your material.

DeepSeek R1's MIT license also broke a pattern. Until 2025, strong reasoning was proprietary. Now anyone can run a 70B distilled reasoner on their own hardware. For teams that care about privacy, cost at scale, or fine-tuning, this changes the calculus. We covered this in Open Source vs Closed AI Strategy.


Frequently Asked Questions

Do I need a reasoning model for most of my work?

Probably not. For reading, writing, summarizing, and general Q&A, a standard chat model is faster, cheaper, and often more accurate. Reasoning models earn their place on problems with multiple logical steps and verifiable answers.

What's the difference between chain-of-thought prompting and a reasoning model?

Chain-of-thought prompting is a technique where you tell a standard model to "think step by step" in the prompt. A reasoning model is trained specifically to generate much longer internal reasoning traces before answering, using reinforcement learning that rewards correct reasoning. You can get some of the benefit with chain-of-thought prompting alone, but the gap on hard benchmarks between prompted GPT-4o and o3 is still large, often 20 to 40 percentage points.

Why does o3 cost so much less than o1 did?

OpenAI cut o3 pricing by roughly 80% during 2025, ending around $2 per million input tokens and $8 per million output tokens. The reductions came from model distillation, inference optimizations, and increased hardware efficiency. Reasoning models remain more expensive per query than standard chat models because they generate far more tokens, but the per-token price gap has narrowed significantly.

Is DeepSeek R1 really competitive with o3?

On math benchmarks like AIME 2024 and on GPQA-Diamond, R1 is close to o1 but still behind o3. On ARC-AGI, o3 holds a clear lead. Where R1 wins is flexibility. It's open-weight under MIT license, you can self-host it, and distilled variants from 1.5B to 70B parameters make it practical on commodity hardware. For teams that care about data residency, fine-tuning, or cost at scale, R1 is often the better pick even when it's a few percentage points behind on benchmarks.

How do I know if a reasoning model is overthinking my question?

Two signs. First, the latency feels absurd for the question you asked, like 45 seconds for "what does this word mean." Second, the answer hedges more than it should and introduces caveats the question didn't need. The 2.4% to 3.8% accuracy drop on simple factual recall documented in 2025 research mostly comes from this overthinking pattern. If you see it, switch to a standard model.

Can I use both reasoning and standard models in the same workflow?

Yes, and this is often the best setup. Use a standard model for fast, high-volume work (summarizing, drafting, classifying) and escalate to a reasoning model for the small number of queries that need deliberation. Claude 3.7 Sonnet made this explicit with a thinking budget slider, and OpenAI's API lets you route between GPT-4o and o3 freely.

Does Glasp use reasoning models?

Glasp's AI chat is optimized for fast, conversational responses over your highlights, so it defaults to standard chat models for most interactions. For specific use cases that benefit from deeper analysis, like synthesizing across many highlights or comparing arguments from multiple sources, reasoning models are part of the toolkit. The principle is the same one we'd suggest you follow in your own work: match the model to the question.

Will standard chat models eventually do everything reasoning models do?

The gap is closing. Newer standard models incorporate techniques from reasoning training, and reasoning models are getting faster and cheaper. By 2027, the distinction may blur into a single model that spends more or less compute based on the query. For now, the two modes are distinct enough that treating them as separate tools pays off.


Conclusion: Match the Model to the Question

The big shift of 2024 and 2025 wasn't that AI got smarter in the way we used to mean. A new kind of model showed up that trades speed for depth. That tradeoff is real and measurable. A reasoning model can double your accuracy on hard math and lose three points on simple Q&A in the same afternoon.

The model choice is part of the craft now. Fast and cheap for most things. Slow and deep for the small set of problems where the extra compute earns its keep. The rule that works in practice: ask whether the problem is multi-step, verifiable, and expensive to get wrong. If two of those are yes, use a reasoning model. Otherwise, use a standard chat model.

Reasoning models don't make thinking optional. They make one specific kind of thinking cheaper and more reliable when you actually need it. The rest of the time, a standard model is still your best tool, and your own attention is still the part that matters most. That's the frame Glasp has always pushed toward: the AI amplifies what you've already highlighted and connected. Pick the right model, and you get more out of every query. Pick the wrong one, and you're just waiting longer for a worse answer.

Start building your knowledge library

Highlight what matters as you read across the web. Save insights from articles, books, and YouTube videos in one place.

Get Started Free