Deep Research Tools: OpenAI vs Perplexity vs Gemini

The Deep Research Moment

On February 2, 2025, OpenAI announced Deep Research. It was the first agent most people had used that could take a one-sentence prompt, plan a 30-minute investigation, browse dozens of sources on its own, and return with a cited report.

The industry reaction was telling. Within six weeks, Perplexity shipped its own Deep Research (February 14) and opened the Sonar Deep Research API to developers (March 7). Google, which had launched Gemini Deep Research quietly in December 2024, accelerated its rollout and upgraded the backbone to Gemini 2.5 Pro in May 2025. Anthropic made Claude's web search generally available on May 27, 2025, packaging the Research feature in the same spring window.

Four labs, one product category, one quarter. That doesn't happen by accident. 2024 was the year context windows crossed 200K tokens, tool use became reliable, and agentic loops stopped silently failing halfway through. Deep research was the first consumer-facing app that made all three feel worth paying for. It's also closely tied to the broader shift toward agent protocols we cover in The Agentic Web: Inside the MCP Protocol Wars.

If you write, study, analyze markets, or evaluate products, you're already at a disadvantage if you don't use one. The question is which one, and when.

What "Deep Research" Actually Does

It's easy to confuse deep research with chat search. You type a question, you get an answer with links. The mechanics are different.

A chat search (like regular ChatGPT with browsing) runs one or two web queries and synthesizes the top results in seconds. A deep research agent does something closer to what a junior analyst does over an afternoon. It breaks your question into sub-questions, runs dozens or hundreds of searches, reads full pages, follows citations, updates its plan as it learns, and produces a structured report with footnotes.

Ask chat search "what are the main critiques of the Phillips curve?" and you'll get a three-paragraph summary. Ask a deep research agent the same thing and you'll get a 15-page report covering Friedman's natural rate hypothesis, the 1970s stagflation breakdown, rational expectations revisions, post-2008 flattening debates, and recent papers from 2023-2025, each with a source you can click.

The trade-off is time. Runs take between 3 and 45 minutes depending on the tool and the depth. That's the point. You queue one up, work on something else, and come back to a report that would have taken you half a day to assemble manually. For more on restructuring research habits around AI agents, see How to Build an AI-Powered Research Workflow in 2026.

Head-to-head: The 4 Tools Compared

Here's the matrix, with verified numbers from the launch blogs and current pricing pages.

Tool	Launch	Model	Price / Limits	HLE score
OpenAI Deep Research	Feb 2, 2025	custom o3	Free: 5/mo; Plus ($20/mo): 25/mo; Pro ($200/mo): 250/mo; 5-30 min runs	26.6%
Perplexity Deep Research	Feb 14, 2025 (API Mar 7)	Sonar	Free: 5/day; Pro ($20/mo): 500/mo; API $2/$8 per M tokens; under 3 min	21.1% (SimpleQA 93.9%)
Gemini Deep Research	Dec 2024, upgraded May 2025	Gemini 2.5/3 Pro	AI Pro ($19.99/mo): 20/day; AI Ultra ($249.99/mo): 200/day; Gmail/Drive/Docs integration	not publicly reported
Claude Research	Web search GA May 27, 2025; Research Apr-May 2025	Sonnet 4.5 / Opus 4.5, 200K ctx (1M beta)	Included on Pro ($20/mo); 5-45 min runs; Google Workspace connectors	not publicly reported

The one-paragraph profiles:

OpenAI Deep Research is the heavyweight. Runs are slower (often 15-25 minutes), reports are the longest, and reasoning is visibly deeper on ambiguous topics. The custom o3 model is tuned for web-scale synthesis rather than chat. The 25-per-month cap on Plus is the real constraint. Heavy users burn through it in a week.

Perplexity Deep Research is the speed champion. Most runs finish in 2-3 minutes. Reports are shorter and more encyclopedic, ideal for a briefing rather than an essay. It's also the only one of the four with a real API, priced at $2 input / $8 output per million tokens at launch.

Gemini Deep Research is the best-integrated for Google Workspace users. It pulls from your Gmail, Drive, and Docs alongside the web. The 20-per-day cap on AI Pro is generous. Reports come with a visible research plan you can edit before the agent runs.

Claude Research is the patient one. Runs regularly hit the 30-45 minute end of the range, and the output reflects it: long-form, nuanced, good at weighing contradictory evidence. The 200K context window (1M beta for enterprise) means large source sets don't get truncated.

Benchmarks: What HLE and SimpleQA Actually Tell You

The two numbers that get quoted most are Humanity's Last Exam and SimpleQA. They're useful, and they're also overread.

Humanity's Last Exam (HLE), released by Scale AI and the Center for AI Safety in early 2025, is a 3,000-question multi-domain benchmark covering math, science, humanities, and professional knowledge at the outer edge of what experts can answer. OpenAI reported 26.6% for Deep Research at launch (OpenAI, Feb 2, 2025). Perplexity reported 21.1% for Sonar Deep Research (Perplexity, Feb 14, 2025). Anthropic and Google haven't publicly reported HLE scores for their research agents as of this writing.

What HLE measures well is the ability to synthesize across domains on genuinely hard questions. What it doesn't measure is whether the agent is good at the kind of work you actually do. Most real research isn't PhD-level physics. It's "summarize recent debates on this topic" or "compare these five products for my use case." On those tasks, the benchmark gap between OpenAI and Perplexity is much smaller than 5.5 percentage points would suggest.

SimpleQA is Perplexity's stronger showing. The benchmark tests short-form factual accuracy, and Sonar Deep Research scored 93.9% (Perplexity, Feb 14, 2025). That's a useful proxy for "does the agent hallucinate facts?", which matters a lot when you're going to cite the output.

The honest read: benchmarks rank tools reliably in the 80th-95th percentile range of difficulty, and badly below that. The best way to pick is to run the same real prompt through two or three of them on the free tier and compare. Benchmarks are suggestive. Your own test is decisive.

For a longer argument about why benchmark obsession can mislead, see The AI Thinking Trap.

Free Tier Reality Check

The marketing pages all highlight free access. Here's what "free" actually means when you try to use these tools for real work.

OpenAI Deep Research (Free: 5/month). Enough to evaluate, not enough to rely on. A single project often eats 2-3 runs (initial pass, follow-up, clarification). You'll hit the cap by day 10 if you use it for work. Plus at $20/month for 25 runs is the realistic starting tier.

Perplexity Deep Research (Free: 5/day). The most generous of the bunch. 5 per day is 150 per month, more than most people need. Free-tier output is shorter than Pro, and you don't get the newer Sonar variants. For casual use, this is the free tier you actually keep using.

Gemini Deep Research (Free: limited access). Rolled out in limited form during 2025, with reduced frequency and shorter reports than AI Pro. If you have a Google One subscription with AI Pro already, the 20-per-day cap is the one to beat.

Claude Research (Pro only, $20/month). No dedicated free tier for the Research feature. The free plan includes chat and web search, but multi-step research is behind Pro. Pro also includes Claude's full Sonnet 4.5 and Opus 4.5 access, so the $20 buys you the strongest long-context reading model on the market.

Free-tier summary	Usable for real work?
OpenAI Deep Research (5/mo)	Evaluation only
Perplexity Deep Research (5/day)	Yes, for light use
Gemini Deep Research (limited)	Partial, better with AI Pro
Claude Research	No free tier

If you only pay for one, Perplexity Pro gives you the highest run count (500/month) at $20. If you only want the smartest output, ChatGPT Plus at $20 gets you 25 OpenAI Deep Research runs plus everything else in the Plus bundle. For Google Workspace users, Gemini AI Pro is the natural pick. Claude Pro makes the most sense if you already use Claude for reading and writing and want one integrated subscription.

Which Tool for Which Job

After running hundreds of queries across all four, clear patterns emerge. Here's how I'd route work now.

Academic literature review. Claude Research. The long context window matters when the agent needs to hold 20+ papers in working memory, and Claude is noticeably better at distinguishing between superficially similar claims. Runs take longer, but literature reviews aren't time-sensitive.

Market sizing and competitive intelligence. OpenAI Deep Research. The depth of reasoning on ambiguous strategic questions (why a market grew, what's driving customer switching) comes through clearly here. It's the one I trust most for "help me understand this industry" prompts.

Quick factual briefings. Perplexity Deep Research. If you just need a cited two-page summary before a meeting, Perplexity's 3-minute turnaround is hard to beat. SimpleQA-style factual accuracy is a genuine strength.

Buying decisions and product comparisons. Perplexity or Gemini. Both pull in enough real-world review data (forums, YouTube transcripts, spec sheets) to produce useful side-by-side comparisons. Gemini's advantage is pulling in your own Gmail receipts and Drive notes.

Research involving your own documents. Gemini Deep Research. The Workspace integration is the moat. If you're researching a topic where half the source material is in your Drive (meeting notes, PDFs, old emails), nothing else compares.

Developer integrations and bulk runs. Perplexity Sonar Deep Research API. It's the only one with real API pricing at a reasonable rate. If you're building a product that needs deep research as a feature, this is the obvious choice.

Synthesizing contradictory evidence. Claude. When sources disagree (e.g., "is fiber actually good for diverticulitis?" or "does the Pomodoro technique work?"), Claude is the most willing to surface the disagreement rather than pick a side prematurely.

One pattern that might surprise people: no single tool dominates. I run the same prompt through two agents for high-stakes work. The cost is $40/month for two subscriptions, and the benefit is noticeably better output than any single tool produces alone. Chat search and deep research are starting to feel less like competing products and more like a stack you compose.

The Missing Piece: Turning Research Reports Into Usable Knowledge

Here's what almost no comparison article mentions. The report the agent produces is not the output of your research. Your understanding is.

A 20-page Claude Research output or a 15-page OpenAI Deep Research report is the start of the work, not the end. Read it once, skim the conclusion, close the tab, and you've paid an agent to summarize something you didn't actually learn. The 2025 MIT Media Lab study on passive AI use (tracked in our analysis of AI's impact on learning) showed that heavy ChatGPT users consistently retained less of what they "read" than active learners did.

The fix is what researchers have done for centuries: annotate. Highlight the claims that matter. Flag the sources you want to verify. Link insights across reports.

This is where Glasp's web highlighter fits into the workflow. Run your research on OpenAI, Perplexity, Gemini, or Claude. Paste the report into a readable page. Highlight directly in the browser as you read. Your highlights sync to your Glasp library, searchable and organized, alongside everything else you've read that month.

A few specific workflows that work:

Highlight, then re-query. Read the report, highlight the 10-15 claims that matter most. Paste those highlights back into the same agent with "dig deeper on these specific points." Iterative rather than one-shot.

Stack reports by topic. When you research the same topic across two tools (say, OpenAI + Claude), highlighting both reports in Glasp lets you see where they converge and diverge. Disagreements are often the most interesting parts.

Use YouTube alongside text. When the best sources are podcasts or talks, YouTube Summary gives you transcript-level summaries with timestamps. Pairing a text deep research report with 3-4 annotated YouTube talks covers a topic more thoroughly than either alone.

Chat with your highlights. Glasp's AI chat can answer questions using your annotations as the source. It's the difference between "what did GPT say about X?" and "what have I actually concluded about X?"

Publish what you learned. The community on Glasp is full of other people researching similar topics. Sharing highlighted reports is a forcing function to finish the research, not just queue more of it. For a step-by-step guide, see How to Annotate Articles the Right Way.

A report you read once is a receipt, not knowledge. The highlight-and-annotate step is what converts agent output into something you actually know.

Frequently Asked Questions

Which deep research tool is the most accurate?

On published benchmarks, OpenAI Deep Research leads on Humanity's Last Exam at 26.6% (OpenAI, Feb 2025) versus Perplexity's 21.1% (Perplexity, Feb 2025). Anthropic and Google haven't released HLE numbers for their research agents. For short-form factual accuracy, Perplexity Sonar scored 93.9% on SimpleQA, which is excellent. In practical use, accuracy differences between OpenAI, Claude, and Gemini are smaller than benchmarks suggest. The bigger difference is depth versus speed.

How long do deep research runs take?

Perplexity finishes most runs in under 3 minutes. Gemini typically runs 5-15 minutes. OpenAI Deep Research takes 5-30 minutes depending on query complexity. Claude Research can stretch 5-45 minutes on hard prompts. If you need an answer now, Perplexity. If you can wait, Claude or OpenAI usually produce more thorough reports.

Is any deep research tool genuinely free?

Yes, but with limits. OpenAI gives free users 5 Deep Research runs per month. Perplexity gives 5 per day on the free tier, which is the most generous allowance. Gemini has limited free Deep Research access. Claude doesn't offer Research on its free tier. For casual use, Perplexity Free covers most needs. For regular work, a $20/month Pro plan on any of the four is the realistic entry point.

Can I use deep research tools via API?

Perplexity is currently the only major player with a true Deep Research API. Sonar Deep Research launched on March 7, 2025 at $2 per million input tokens and $8 per million output tokens. OpenAI offers access to o3 via the API, but the full Deep Research agent loop is tied to ChatGPT. Claude and Gemini don't yet offer their Research features as standalone APIs, though their underlying models (Sonnet 4.5, Opus 4.5, Gemini 2.5/3 Pro) are available.

Does deep research replace traditional search?

No. Deep research is a complement, not a replacement. For a quick fact, search is still faster. For a two-sentence definition, chat with a regular LLM. Deep research wins when you want a structured, cited report on a multi-faceted question that would take you 30+ minutes to assemble manually. Most people use all three.

How do I stop hallucinations in deep research reports?

Three practical tactics. First, always click at least the top 3-5 cited sources and verify the claim is in the source (hallucinations more often come from mis-citing a real source than inventing a fake one). Second, run the same prompt through a second tool and compare. Disagreements between Claude and OpenAI, for example, are often the places where one of them got something wrong. Third, favor Perplexity for high-stakes factual queries, since its SimpleQA score of 93.9% reflects genuine calibration on short-form facts.

Can deep research tools read my private documents?

Gemini Deep Research has the deepest integration, with native access to your Gmail, Drive, and Docs (with permission). Claude Research supports Google Workspace connectors. OpenAI Deep Research can read files you upload during a session but doesn't integrate directly with cloud storage. Perplexity primarily works against the web. If your source material is largely in Google Workspace, Gemini is the obvious pick.

What's the best way to save and reuse deep research reports?

Export the report as PDF or Markdown, open it in a readable view, and highlight it like you would any long article. Glasp is built for exactly this workflow: highlights sync to a library you can search, link to other highlights, and revisit. Without a highlighting step, most deep research reports get read once and forgotten. This is related to what educators call the "generation effect": information you process actively is retained far better than information you passively receive.

Conclusion: The Research Stack, Not the Research Tool

A year after OpenAI's launch, the category has clarified. Deep research agents aren't a winner-take-all market. They're a four-player mix where the right answer depends on what you're researching, how much time you have, and where your source material lives.

If I had to pick one for most knowledge workers in 2026, it's Perplexity Pro. Five hundred runs per month at $20 is the best volume-to-price ratio, runs are fast enough to fit inside a normal work rhythm, and the SimpleQA accuracy is genuinely strong. For heavier or more ambiguous work, pair it with OpenAI Deep Research or Claude Research.

But the tool choice matters less than what you do with the output. The biggest mistake I see people make is treating a deep research report as finished work. It isn't. It's raw material. The actual knowledge gets built when you highlight the claims that matter, link them to other things you've read, and return to them later when the topic comes up again.

That's the workflow Glasp is designed for. Highlight any report, any article, any YouTube transcript. Build a searchable library of what you actually thought was important. Chat with your highlights later when you need to recall something specific. Share your work with others doing the same research.

The deep research agents will keep getting better. The ones that don't also get a highlighting layer on top will keep producing reports that get read once and forgotten. Don't build your 2026 research workflow around a single tool. Build it around a stack, and make sure the last link in that stack is the one where your own understanding gets recorded.

Start by running one real research question through two of the four tools this week. Highlight both reports. Compare what you learned. That's the workflow. Everything else is a feature list.