Agentic Engineering: Skills, Subagents, and Hooks Explained

The Moment Vibe Coding Stopped Scaling

In February 2025, Andrej Karpathy posted what looked like a joke on X. He described "vibe coding" as accepting whatever an LLM produced without really reading it. "I just see stuff, say stuff, run stuff, and copy paste stuff," he wrote, "and it mostly works."

It did mostly work. For about a year.

By late 2025, the industry started seeing the bill. Forrester research published in 2025 found that code with significant AI involvement carried roughly 1.7 times the rate of major issues. Separate audits from security firms put the share of AI-generated code with security flaws between 40 and 62 percent, depending on prompt style and language. Models hallucinate APIs, skip input validation, leak secrets in logs, and confidently call functions that do not exist.

What broke was the assumption that a single prompt and a single context window could carry a real codebase. Repos got bigger. Conventions got more specific. Side effects (migrations, deploys, paid API calls) got more dangerous. The workflow that felt magical on a greenfield app collapsed when you pointed it at a service with five years of decisions baked in.

The cracks showed up in three ways. Models forgot project conventions halfway through a session and reintroduced patterns the team had spent months deleting. Long sessions accumulated so much irrelevant context that the model started ignoring the actual task. Agents happily ran destructive commands because nothing stopped them. A prompt that said "do not push to main" worked about 95 percent of the time, which is another way of saying it failed 1 in 20 sessions.

By the end of 2025, every serious team using Claude Code, Cursor, or similar tools had built some version of the same scaffolding around the model. Nobody had a name for it yet.

Karpathy's Rename to Agentic Engineering

In February 2026, Karpathy posted again. The new term was "agentic engineering."

Vibe coding, he argued, described the toy mode. Agentic engineering described what people who actually shipped were doing: writing project-specific context files, defining narrow skills, spawning subagents for isolated work, and gating tool use through deterministic hooks. The model is still doing the typing. The human is doing the engineering of the system the model runs inside.

Tooling matured between the two posts. Anthropic's Claude Code Best Practices documented the CLAUDE.md convention and the subagent pattern. Building Agents with the Claude Agent SDK laid out the agent loop and the role of hooks. Skills landed in October 2025: small markdown files the agent loads only when the task triggers them. Cursor 2.0 shipped background agents in cloud VMs with git-worktree isolation and up to eight parallel agents.

Martin Fowler's "Context Engineering for Coding Agents" captured the principle that tied all of this together. The job is no longer prompt engineering, which optimized one string of text. The job is context engineering: deciding what the model sees, when it sees it, and what it is allowed to do with it.

This is not Anthropic-specific. Cursor calls them rules and background agents. GitHub Copilot uses an AGENTS.md convention. Gemini CLI uses GEMINI.md. The names differ. The shape does not.

The Four Pillars of Agentic Engineering

Before going deep on each pillar, it helps to see them side by side. They look similar (they're all "things you give the agent") but they solve different problems.

Pillar	What it stores	When it loads	Failure mode if missing
CLAUDE.md	Always-true project context: stack, conventions, commands, hard rules	Every session, automatically	Model reinvents conventions, forgets the package manager, runs `npm install` in a yarn repo
Skills	Procedural knowledge for narrow tasks (rebase, schema migration, Stripe review)	On demand, when the task matches the skill description	Model improvises domain-specific steps and gets the order wrong
Subagents	A fresh context window plus a system prompt for a single role	When the main agent delegates a defined task	Main context gets polluted by side quests; one bad tool call corrupts the whole session
Hooks	Shell commands fired on tool events (PreToolUse, PostToolUse, Stop)	Deterministically, every time the trigger fires	Risky commands run unchecked; formatter never runs; the "always do X" prompt fails silently

The pattern: CLAUDE.md is what the model knows. Skills are what the model can look up. Subagents are who else the model can ask. Hooks are what the model cannot avoid.

Each pillar exists because the other three cannot solve that specific failure mode.

CLAUDE.md: The Advisory Layer

CLAUDE.md (or AGENTS.md, or GEMINI.md, depending on your tool) is a markdown file at the root of your repo. The agent reads it at the start of every session and treats it as background context.

It is not memory in the consumer-product sense. It is the file you, the human, write to tell the agent what it would have learned if it had been on the team for a year.

What belongs in it:

Stack and versions: "Next.js 16 App Router, React 19, Yarn 1.22 (npm prohibited), Node 22."
Repo layout: which directory does what.
Commands: yarn dev, yarn test, yarn build:ci, with notes on which one is safe to run.
Hard rules: "Never push directly to main. Always create a feature branch."
Style conventions: tab width, lint rules that bite, naming conventions.
Domain shortcuts: glossary terms specific to your product.

What does not belong: step-by-step instructions for occasional tasks (those go in Skills), long reference docs (the model skims, link out instead), and anything secret (CLAUDE.md content is loaded into context, and context can be quoted back).

The trap most teams fall into is the everything-bagel CLAUDE.md. A 4,000-line file with every coding standard and every architecture decision. The model loads all of it on every task and starts to treat the relevant 5 percent the same as the irrelevant 95 percent. Token costs go up. Adherence to specific rules goes down.

A good CLAUDE.md is closer to a sticky note than a wiki. Aim for one screen of essential context. If you find yourself writing a section that starts with "When doing X..." then X probably wants to be a skill. CLAUDE.md is for "always true." Skills are for "true when."

Skills: Just-in-Time Knowledge Files

Skills are small markdown files (typically under 200 lines) that the agent loads only when the task matches. Each skill has a name, a description, and a body. The description is what the agent reads first to decide whether to pull in the full skill.

Anthropic shipped skills as a first-class concept in late 2025. You put a skill file in a known directory; its frontmatter describes when to use it. When planning a task, the agent scans the available skill descriptions and loads any whose description matches.

Good skill examples:

rebase-cleanly: how to rebase on develop, conflict resolution rules, what to do if tests fail post-rebase.
review-stripe-integration: checklist for changes that touch Stripe webhooks, idempotency keys, price IDs.
add-shadcn-component: the exact commands and import conventions for adding a shadcn/ui component to this repo.
debug-flaky-test: the team's preferred order of operations when a CI test is intermittent.

Each one is procedural and narrow. Each one would be too much detail to live in CLAUDE.md, but is too important to leave to the model's general knowledge.

The mental model: CLAUDE.md is your colleague telling the new hire the basics on day one. A skill is the runbook they hand the new hire when a Stripe webhook fails at 2 a.m. You don't memorize the runbook; you read it when you need it.

Skills compose. The agent can load three skills in one task ("add a new API route" plus "validate input with Zod" plus "write a Vitest test") without you predicting the combination in advance.

Skill descriptions matter more than you'd think. Vague ones ("helpful for code things") either never trigger or trigger on everything. Write descriptions that name the situation: "Use when the user asks to rebase a branch, resolve merge conflicts, or clean up commit history."

Subagents: Isolated Context Workers

A subagent is an agent the main agent can call. It has its own system prompt, its own context window, and its own tool permissions. When it finishes, it returns a result to the main agent. Then it goes away.

The naive reading is "subagents are for parallelism." That is part of it (Cursor 2.0 advertises up to eight parallel background agents) but parallelism is not the main benefit. The main benefit is context isolation.

Three patterns where subagents pay off:

1. The researcher. You want the agent to search 200 files and summarize what it finds. If the main agent does this, all 200 files end up in the main context, even though you only needed three sentences of summary. A research subagent reads the 200 files, summarizes, returns one paragraph. The main context stays clean.

2. The reviewer. Before a commit, you want a fresh pass on the diff. A reviewer subagent loads a "code review" skill, reads the diff with no other context, and reports issues. Because it has no memory of the implementation argument the main agent had with you, it cannot rationalize away problems.

3. The risky operation. A migration script. A bulk rename. A schema change. The agent plans and executes it in isolation, then reports back. If something goes wrong, the wreckage is contained in the subagent's context.

There's a real cost. Each subagent is another model call and another context window. They add coordination complexity. A team that spawns a subagent for every task burns tokens and slows itself down.

The rule of thumb: spawn a subagent when one of three things is true. (1) The task would dump a lot of garbage into the main context. (2) The task benefits from a fresh perspective. (3) The task is risky and you want it sandboxed. Otherwise, keep working in the main session.

Open-source collections like VoltAgent's awesome-claude-code-subagents catalog hundreds of pre-built subagents. Most teams do best with three or four custom subagents tuned to their codebase rather than dozens of generic ones.

Hooks: Deterministic Guardrails

Hooks are the part of the stack the model cannot talk its way out of. They are shell commands wired to tool events. When the event fires, the command runs. The model has no say.

The canonical events:

PreToolUse: fires before a tool call. Can block the call.
PostToolUse: fires after a tool call. Useful for formatters, validators, side effects.
Stop: fires when the agent finishes a turn. Useful for notifications.
Notification: fires on certain agent messages.

Why hooks beat prompts for safety: prompts are probabilistic. Even a clear instruction like "never run rm -rf" will fail occasionally, because the model is doing pattern completion. A hook that greps the command for rm -rf and exits non-zero before the shell sees it will fail zero percent of the time. It's a regex, not a vibe.

Three hooks worth having:

pre-bash-guard (PreToolUse on Bash). Reads the command, blocks dangerous patterns: rm -rf /, git push --force against protected branches, DROP TABLE, direct overwrites of .env* files. A 30-line shell script saves you from disasters prompts can't reliably prevent.

post-edit-prettier (PostToolUse on Edit/Write). After the agent edits a .ts or .tsx file, run prettier. Catching it deterministically keeps style consistent across the session.

notify-on-stop (Stop). When the agent finishes a long-running task, fire a macOS notification or Slack ping. Quality of life, but it changes how you work: you can let the agent run for ten minutes and not babysit it.

There's a small performance cost. Each hook is a process spawn. In practice this is unnoticeable compared to the model's own latency, and the determinism is worth it.

The mental shift: stop thinking of safety as something you ask the model to do. Start thinking of safety as something the environment enforces, the same way a CI pipeline enforces tests. The model is a fast junior. Hooks are the pre-commit hook the junior cannot disable.

How the Four Pillars Compose

Here is how a real team's setup looks, in prose.

The repo has a CLAUDE.md at the root. About 80 lines. It lists the stack (Next.js 16, React 19, Yarn, Node 22), the directory layout, the test command, the deploy rule, and a glossary of five domain terms.

In .claude/skills/, six skill files: rebase-cleanly.md, add-api-route.md, review-stripe.md, debug-firestore.md, write-deep-dive.md, sql-migration.md. Each is 80 to 150 lines.

In .claude/subagents/, three. A reviewer runs before commits and reports diff issues. A researcher is invoked when the main agent needs to read more than ten files. A test-runner is invoked when a test fails on the first try; it isolates the failure without polluting the main context.

In .claude/hooks/, four. pre-bash-guard.sh blocks dangerous commands. pre-edit-env-guard.sh blocks edits to .env.local. post-edit-prettier.sh runs prettier after edits to .ts/.tsx files. notification.sh pings on macOS when a long task finishes.

A normal session: developer asks the agent to "add a new API route that returns user bookmarks." The agent reads CLAUDE.md, matches the add-api-route skill and loads it, writes the file. The post-edit hook runs prettier. It writes a test, prettier runs again. It asks the reviewer subagent to review the diff. The reviewer flags missing input validation; the main agent adds it. The developer asks to commit. The pre-bash hook checks the branch and allows the commit. The stop hook pings the developer.

No part of that flow needed a long prompt. The prompt was "add a new API route that returns user bookmarks." Everything else was wired into the environment.

Anti-Patterns Builders Keep Shipping

A few patterns to avoid, drawn from teams that adopted this stack badly.

The one-giant-CLAUDE.md. Some teams treat CLAUDE.md as a dumping ground for every decision in three years. The result is a 5,000-line file that the model loads but does not internalize. The rule about not using npm ends up sandwiched between two pages of architecture rationale, and the model picks up the rationale and forgets the rule. Keep CLAUDE.md tight.

The no-hooks bet. Some teams skip hooks entirely, relying on prompts to keep the agent safe. This works most of the time, which is exactly the problem. Most of the time is not good enough for rm -rf or git push --force. If the consequence is "I lost an hour of work," prompts are fine. If the consequence is "I dropped a production table," you need a hook.

Subagent sprawl. Some teams build a subagent for every conceivable role. Researcher, reviewer, planner, summarizer, namer, refactorer, documenter, tester. Each one is another file to maintain, another set of tokens, another coordination overhead. Teams that succeed with subagents tend to have three to five, each with a clear job. Not twenty.

Skill-as-documentation-dump. A skill is not a place for your old wiki pages. If a skill is 800 lines, the model loads 800 lines every time it triggers. If your skill is long, it's probably two skills.

Treating Skills like CLAUDE.md. Putting always-true context in a skill means it loads only sometimes. The team-wide rule "never use npm" belongs in CLAUDE.md, because it applies to every task.

Hooks that block the agent for ten seconds. A hook that runs a full test suite on every edit makes the agent unusable. Hooks should be fast. The expensive checks belong in CI.

The Practical Setup You Can Adopt This Week

If you've read this far and want a starting point, here is the lean version. A senior engineer can set this up in a day, and it's enough to make agentic coding meaningfully more reliable than the vibe-coding default.

CLAUDE.md (about 50 lines). Stack and versions. Package manager (and which to never use). Top-level directories. The five hard rules (don't push to main, don't touch .env.local, use these test commands). A short list of domain terms. Resist the urge to add more.

Two skills. rebase-cleanly.md (your team's rebase steps, 80 lines max) and review-changes.md (your code-review checklist, 100 lines max).

One reviewer subagent. Loads review-changes.md, reads the diff, reports issues. Invoked before commits.

Three hooks. PreToolUse on Bash blocks rm -rf, git push --force against protected branches, and edits to .env* files. PostToolUse on Edit/Write runs prettier on edited .ts/.tsx files. Stop fires a macOS notification when the agent finishes.

That's it. CLAUDE.md plus two skills plus one subagent plus three hooks. A folder of maybe eight files, all under version control, all reviewable by the rest of the team.

You will iterate. After a week, you will notice tasks the agent keeps fumbling and codify them as new skills. You will see categories of bad commands and add them to the pre-bash guard. You will spot the moment a researcher subagent would have kept your main context clean and add one. The four pillars are the shape. The content is yours.

The shift from vibe coding to agentic engineering is not a step up in cleverness. It is a step toward operational discipline. You're treating the agent the way you'd treat any other system that runs in production: with conventions, guardrails, and isolation between concerns. Less magic, more engineering, and a workflow that scales past the first month.

Frequently Asked Questions

Is this Claude Code-specific or does it apply to Cursor and other agents?

The pillars apply across tools; the names differ. Cursor uses "rules" files and "background agents." GitHub Copilot uses AGENTS.md. Gemini CLI uses GEMINI.md. Hooks are not universal yet (Claude Code has the most mature implementation), but most tools have some equivalent. The mental model (context layer, on-demand knowledge, isolated workers, deterministic guards) generalizes even where the implementation differs.

What's the difference between a Skill and putting everything in CLAUDE.md?

Loading. CLAUDE.md loads at the start of every session. Skills load only when their description matches the task. If you put procedural knowledge for ten different tasks in CLAUDE.md, the model loads all of it on every task, the context fills with mostly irrelevant content, and adherence to specific rules drops. Skills keep CLAUDE.md tight and keep procedural details available on demand.

When should I create a subagent vs just use a longer prompt?

Three conditions push you toward a subagent. (1) The task would dump a lot of content into the main context. (2) You want a fresh perspective that the main agent's accumulated reasoning cannot taint. (3) The task is risky and you want it sandboxed. Otherwise, a longer prompt or a skill is usually the better tool. Subagents cost tokens and coordination.

Do hooks slow down the agent?

Each hook is a process spawn, so it adds milliseconds to seconds depending on what it does. In practice, this is dwarfed by the model's own latency. Long hooks (running a full test suite on every edit) will make the agent feel sluggish; those belong in CI. A good rule: PreToolUse hooks should exit in under 200ms in the common case; PostToolUse hooks like formatters can take a second or two without anyone noticing.

Should I commit my CLAUDE.md to the repo?

Yes, almost always. CLAUDE.md is a team artifact: it encodes the conventions everyone (human or agent) should follow. Committing it means the whole team's agents work from the same context, and the file gets reviewed like any other code. The only thing you might keep local is per-developer permission settings (Claude Code supports a settings.local.json for that).

Closing Thoughts

Karpathy's two posts, a year apart, neatly bracket what happened to AI coding in 2025. The first was permission to play. The second was the bill arriving. Vibe coding was useful as a discovery mode: it taught people what these models could do without making them learn an SDK first. It was not designed to ship.

What's replacing it is recognizable to anyone who has worked on real systems. You write down the invariants (CLAUDE.md). You package the procedures (Skills). You spin off isolated workers for risky or context-heavy jobs (Subagents). You install guardrails at the boundaries (Hooks). None of this is exotic. It is the same operational discipline that distinguishes a hobby project from a service that runs on Monday morning without anyone on call.

What surprises me about working teams' setups is how small they are. Eight files. Maybe a thousand lines total. A reviewer subagent. Three hooks. That's enough to convert a model that occasionally drops your database into a teammate that doesn't. The leverage is not in volume. It's in putting the right invariant in the right pillar.

If you're still in pure vibe-coding mode in mid-2026, you're not behind. You're at the stage where most of the productivity gain comes from making the model boring: more predictable, more constrained, more reviewable. Less magic, on purpose. That is the work.