The Productivity Promise vs the Reality
The pitch is everywhere. Pair an LLM with a knowledge worker and watch output double. Stack a Copilot license on every employee and ride the productivity curve. The narrative is so loud that questioning it feels like questioning gravity.
Then the data started arriving. In July 2025, METR published "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity," a randomized controlled trial with 16 senior developers working on real issues in their own large open-source repositories. The result: developers using AI tools took 19% longer to complete tasks than developers without them. The same developers, asked to estimate after the fact, believed AI had made them 20% faster. That gap between perception and reality, roughly 39 percentage points, is the productivity tax in one chart.
Microsoft's Copilot rollouts have produced similarly mixed pictures. Studies from BetterUp Labs and the Stanford Social Media Lab in 2024 and 2025 found gains on some narrow tasks (summarizing meeting notes, drafting boilerplate emails) but losses on others, and a worrying signal that AI use can shift work toward "workslop", low-effort output that other humans then have to clean up. The aggregate picture is not a productivity revolution. It's a productivity redistribution, with winners and losers depending on the task.
So why does AI feel so fast? Because the visible part is fast. Generation is instant. The invisible part, the prompt-writing, the verification, the re-prompting, the cleanup of subtly wrong output, is where the bill comes due. Call it the productivity tax. It's the time you pay for AI that doesn't show up in the chat window.
The Three Hidden Costs Every AI Task Pays
Every AI task ships with three line items. Most users only notice the third when it bites.
The prompt construction tax is what you pay before generation starts. For a complex task, a usable prompt might be 200 to 600 words, plus context dump, plus examples. That's 30 to 120 seconds of typing or copy-pasting. OpenAI's NBER working paper "How People Use ChatGPT" (September 2025, drawing on 1.5 million conversations) found that 49% of messages are "Asking" rather than "Doing", meaning users are mostly seeking information, not delegating tasks. Even seeking takes setup time, and that time isn't free.
The verification tax is what you pay after generation ends. You read the output. You check the facts. You sanity-test the code. You cross-reference the citation. For a 300-word answer, careful verification can take 60 to 180 seconds. For code, it's longer. For anything you'd put your name on, it's longer still. The Vectara Hallucination Leaderboard, which tracks how often consumer LLMs invent facts when summarizing source documents, shows hallucination rates of roughly 1% to 10% depending on the model and task. One in twenty answers will mislead you, on average. Skipping verification just shifts the cost from "your time" to "your reputation."
The re-work tax is the surprise bill. The output is 80% right but the tone is off, or the format is wrong, or it cited a paper that doesn't exist, or it confidently asserted a number you know to be five years stale. Now you're either re-prompting (another 30 seconds) or rewriting (another 2 minutes). For tasks where you knew the answer to begin with, re-work usually costs more than just doing it yourself. This is exactly what METR's developers ran into: they spent more time prompting and reviewing than they would have spent writing the code.
Add those three together and a "5-second AI answer" routinely becomes a 3-minute interaction. Multiply by 30 AI uses a day and you have an hour and a half spent on the productivity tax alone.
The Worth-It Matrix: A 2x2 You Can Run In Your Head
The decision of whether to use AI is two-dimensional, not one. Most people think only about task difficulty. They should also be thinking about verification cost.
Task complexity is how long the task would take you without AI. Verification cost is how long it takes you to confirm that an AI-generated answer is correct. These are independent. Translating a paragraph into Spanish is hard for you (high complexity) and cheap to verify if you read Spanish (low verification cost). Writing a short, friendly reply to a colleague is easy for you (low complexity) and easy to verify (low verification cost), but the AI overhead alone exceeds the time you would have spent.
| Cheap to verify | Expensive to verify | |
|---|---|---|
| Hard task | AI shines. Translation, structured extraction, drafting unfamiliar formats, code in a language you read but don't write fluently. | Deep work zone. Strategy memos, novel research, code in safety-critical paths. AI's hallucination risk plus your verification cost often exceed doing it yourself. |
| Easy task | Skip AI. Short emails, formatting fixes, anything under 60 seconds. The prompt tax exceeds the work. | Definitely skip AI. Familiar writing in your own voice, decisions that hinge on context only you have. AI here is pure overhead. |
The point of the matrix is to make one decision automatic: if you're in the "easy task" row, default to no AI. The two upper quadrants are where AI earns its keep, and even those split. Hard plus expensive verification is the trickiest case, because the temptation is highest (the task is hard, after all) but the cost is also highest. For a deeper read on when AI's "thinking for you" backfires on cognition itself, see The AI Thinking Trap.
Seven Tasks Where AI Almost Always Slows You Down
Some tasks lose by default. They're worth memorizing as a "no-AI" list, because reaching for the chat box on these is muscle memory most knowledge workers haven't unlearned yet.
| Task | Why AI loses | What to do instead |
|---|---|---|
| Short emails (under 80 words) | Prompt + verify costs more than typing the reply. | Type it. Use a snippet expander if it's truly repetitive. |
| Formatting fixes (capitalization, list spacing) | The fix is mechanical and 10 seconds away. AI adds round-trip latency and may "improve" things you didn't ask it to. | Use your editor. Find-and-replace beats AI for known patterns. |
| Your own voice on familiar topics | AI flattens voice toward the LLM mean. You'll spend longer un-flattening it than writing fresh. | Write it yourself. Use AI only for critique afterward. |
| Sub-60-second decisions | The decision finishes before the prompt does. | Decide. Trust the 80% answer your brain already produced. |
| Decisions that hinge on private context | Context-loading the AI takes longer than the decision. | Decide with the context you already hold. |
| Active learning (recall, problem-solving) | Karpicke's retrieval practice research and Bjork's "desirable difficulties" framework both show that effortful retrieval builds memory. AI dissolves the difficulty and the memory along with it. | Struggle first. Use AI only after you've attempted the recall. |
| Creative work where friction is the value | A first draft you wrote yourself, even a bad one, is closer to your real ideas than a polished AI draft you have to reverse-engineer. | Draft ugly. Revise with help. Don't outsource generation. |
The learning entry deserves extra weight. A 2008 study by Karpicke and Roediger ("The Critical Importance of Retrieval for Learning") showed that students who practiced retrieving information remembered 50% more a week later than students who restudied the same material. AI is a restudy machine. It hands you the answer. Every time you let it, you skip the retrieval rep that would have built the memory. For a focused decision framework on this, see Claude vs ChatGPT for Learning.
Six Tasks Where AI Genuinely Compounds
The flip side is real. Some tasks gain so much from AI that skipping it would be silly. They share a structure: the task is hard, the verification is cheap, and the output is structured enough that errors surface fast.
| Task | Why AI wins | Prompt skeleton |
|---|---|---|
| Synthesizing 5+ sources | Reading 30 pages and producing a coherent summary is slow for humans, fast for LLMs. Verification is fast if you keep the sources side by side. | "Here are 5 source excerpts. Produce a 200-word synthesis covering points X, Y, Z. Cite each claim by source number." |
| Drafting unfamiliar formats | Grant proposals, legal letters, sprint planning docs you've never written. The format itself is the hard part. | "Draft a [format] for [purpose]. Audience: [X]. Tone: [Y]. 400 words." |
| Translation (when you read but don't write the target language) | Asymmetric verification: you can read it back instantly. | "Translate the following to [language]. Preserve register and idiom where possible." |
| Code outside your comfort zone | A bash one-liner, a regex, a SQL window function. You can run it and see if it works. | "Write a [language] snippet that [does X]. Include 1 test case I can paste into the REPL." |
| Structured extraction (CSV, JSON from messy text) | LLMs are excellent at format-bound extraction. You can validate by schema. | "Extract the following fields from this text into JSON: [field list]. If a field is missing, use null." |
| Socratic critique of your own draft | You wrote it, you know it. The AI's job is just to poke holes. Verification is "do I agree with the critique?" | "Critique this draft as an editor would. Identify the 3 weakest claims and why." |
Notice the common thread: in every winning case, you are still the author of the work. The AI is doing a sub-task whose output you can sanity-check fast. When the AI is doing the thinking, the verification cost balloons and the task drifts back toward the lower half of the matrix. For more on how upstream context quality determines whether these prompts actually work, see Context Engineering.
The Verification Latency Problem
Here's the dirty secret of AI productivity claims: most "time saved" numbers are measured before verification. The user generates a draft, declares the task complete, and moves on. The verification cost gets pushed downstream, usually to the user's future self when an error surfaces in production, in a meeting, or in front of a client.
Verification latency is the gap between when AI produces output and when you would discover it's wrong. For code, latency is short: it either runs or it doesn't. For prose, latency can be hours or days, especially if the error is a confidently stated false fact. The Vectara Hallucination Leaderboard, which benchmarks how often summarization tasks invent details not in the source, places top consumer models in the 1% to 3% range and weaker models in the 5% to 10% range. A 3% error rate sounds small until you realize it means roughly one in 30 paragraphs has a fabricated fact. If you're writing a 12-paragraph briefing, expect a meaningful error 40% of the time.
The real productivity calculation has to include verification. If a task takes 5 minutes by hand and 2 minutes with AI, you "saved" 3 minutes, but only if verification is free. If verification takes 90 seconds, your real saving is 90 seconds. If verification takes 4 minutes (because the topic is technical and you have to chase citations), you lost a minute. METR's developer study found exactly this pattern: AI generated code fast, but reading and fixing it ate the savings and then some. For a structured way to verify model output without burning all your saved minutes, see the LLM Hallucination Detection Playbook.
A useful rule: verification should not take more than 30% of the time AI claims to have saved. If it does, you've crossed into negative territory and should probably do the task yourself.
Building Your Own AI Time Audit
Theory is cheap. The cure for AI overuse is data on your own behavior. Here's a 7-day exercise that will surface, with embarrassing precision, where AI is helping you and where it's the productivity tax.
Day 0: open a notes file or a spreadsheet. Three columns: timestamp, task description, "what would I have done without AI?" Optional fourth column: estimated minutes saved or lost.
Days 1 through 7: every time you open ChatGPT, Claude, Gemini, or any AI tool, log it. Don't filter. Don't skip the trivial ones. Especially don't skip the trivial ones, because those are the ones quietly draining your day. For each entry, note what you actually used the AI for (write a Slack reply, summarize a doc, draft an email) and what your fallback would have been (typed it myself, skimmed the doc, used a template).
Day 8: review. For each row, estimate net minutes saved or lost. Be honest. If you used AI to write a 3-sentence reply that would have taken 30 seconds to type, log it as -1 minute (prompt + verify took longer than typing). If you used AI to translate a 600-word doc into a language you don't write, log it as +20 minutes.
Most people who run this exercise find two surprises. First, they use AI roughly twice as often as they thought. Second, somewhere between 30% and 50% of those uses are net-negative or break-even. The audit isn't about quitting AI. It's about cutting the bottom third of uses, the ones where the productivity tax exceeds the productivity gain. That alone is usually 30 to 60 minutes a day reclaimed.
Designing An AI-Lean Workflow
Once the audit gives you data, the redesign is straightforward. Default to no AI. Escalate only when the matrix says it's worth it.
The default-to-no-AI heuristic flips the current culture, which is default-to-AI. Most knowledge workers open ChatGPT before they've decided whether the task warrants it. Reverse the order: start the task, and reach for AI only when you hit a real friction point. A real friction point is "I don't know the format of this document," not "this is mildly tedious." Tedium plus AI usually equals tedium plus tax.
For the tasks that do warrant AI, design for low verification cost. That means giving the model the source material it needs (so it doesn't have to invent), asking for structured output (so errors surface), and keeping your verification surface in front of you. This is where Glasp's web highlighter earns its keep in an AI workflow. When you've already highlighted the key passages from an article or PDF, the AI chat feature doesn't have to guess what you care about. The context is pre-loaded. The same logic applies to YouTube Summary: the transcript is the source of truth, and the model is summarizing something verifiable rather than inventing from a vague title.
The rhythm we'd recommend, after watching thousands of Glasp users work this way, is highlight first, prompt later. Highlight while you're reading or watching. Build a small, source-backed corpus. Then, when you need synthesis or critique or extraction, prompt against that corpus. The verification cost collapses, because the source is right there. The hallucination risk drops, because the model has real material to ground in. The productivity tax drops, because the prompt isn't trying to import context, the context is already in the room.
That's an AI-lean workflow. Less AI, used better, on the tasks where the math actually works.
Frequently Asked Questions
Is AI actually slowing me down?
Possibly, on a meaningful fraction of your tasks. METR's July 2025 study of experienced open-source developers found a 19% slowdown when using AI tools, despite users reporting they felt 20% faster. The perception gap is the danger. The only reliable way to know is to run a personal time audit (see Section 7) for one week. Most people find that 30% to 50% of their AI uses are break-even or net-negative.
When should I use ChatGPT vs Claude vs just doing it myself?
Decide in two steps. Step one: run the Worth-It Matrix. If the task is short, familiar, or the verification cost is high, just do it yourself. Step two: if AI is warranted, pick the model based on the task. Claude tends to win for long-context analysis and structured writing. ChatGPT tends to win for fast back-and-forth and tool use. Gemini wins when you need it baked into Google Workspace. The model matters less than the decision to use AI at all.
Why do I feel faster with AI even when I'm not?
Because generation feels fast. Watching tokens stream gives a strong sense of progress, while the prompt-writing time and verification time are diffuse and easy to forget. METR's developers reported a 20% perceived speedup while measurably running 19% slower, a 39-point illusion. The brain over-credits the visible part of the loop and under-credits the invisible parts. The audit fixes this by making the invisible time visible.
Should I stop using AI for writing?
Nuanced. Stop using it for short, familiar writing in your own voice (replies, internal updates, anything under 80 words). The output flattens your voice and the round-trip costs more than typing. Keep using it for unfamiliar formats (grant proposals, legal letters, formats you've written under five times), translation, and structured extraction. And use it for critique of your own drafts, where you remain the author and the AI is just a sparring partner.
How long should it take to verify an AI answer?
Tie verification time to stakes. For low-stakes output (a Slack message, a personal note), 5 to 15 seconds is enough. For medium-stakes (a doc your team will read), 30 to 90 seconds, with at least one fact spot-checked. For high-stakes (anything published externally, code in production, claims about numbers), verification should be at least as long as it would have taken to write the thing yourself. If verification consistently takes more than 30% of the time AI claims to have saved, you're paying the productivity tax in full.
Conclusion
AI is not free. It costs prompt time, verification time, and the occasional re-work bill. On the right tasks, the gains dwarf the costs. On the wrong tasks, the costs quietly eat the day. The 2025 evidence is clear enough that "always use AI" is no longer a defensible default for serious knowledge work.
The practical move is small. Run the audit for a week. Notice where AI compounds and where it taxes. Cut the bottom third of uses. Default to no AI on short, familiar, sub-60-second work. Escalate to AI on hard, structured, easy-to-verify work. Highlight first, prompt later. The result isn't less AI in your life. It's AI that actually pays for itself.