Domain Code Reviews

March 29, 2026

"Review this code" is not a great prompt you can give an LLM.

It's vague. The model spreads its attention evenly across syntax, naming, and formatting, and gives you generic advice you already know. It's like asking a doctor "am I healthy?" instead of "check my bloodwork."

There's a better way. After you finish building a feature, run domain-specific reviews, targeted passes where you tell the model exactly what lens to look through.

Why generic review fails

Three reasons:

1. Attention dilution. Transformers compute relevance of every token to every other token. A vague prompt spreads that computation thin. Specific terms like "race condition" or "timezone" act as searchlights; they force the model's attention heads to concentrate on exactly those failure domains.

1async def process_order(request):
2 user = get_user(request.headers["auth_token"])
3 items = request.json["items"]
4 total = sum(item["price"] * item["qty"] for item in items)
5
6 order = Order(
7 user_id=user.id,
8 total=total,
9 created_at=datetime.now(),
10 items=items
11 )
12
13 db.orders.insert(order)
14
15 response = requests.post(PAYMENT_API, json={
16 "amount": total,
17 "currency": "USD",
18 "user": user.email
19 })
20
21 send_email(user.email, f"Order confirmed: ${total}")
22 return {"status": "success", "order_id": order.id}

Attention spread evenly. The timezone bug on line 9 gets the same weight as boilerplate.

2. Sycophancy. Models are trained to be agreeable. Ask "is this code good?" and the model wants to say yes. It'll nitpick your variable names while ignoring a missing null check. Adversarial framing, like "challenge the assumptions," gives the model permission to be ruthless.

3. Open-ended generation hallucinates. When you ask "find bugs," the model is generating possibilities from scratch. When you ask "are boundary conditions handled?", you've converted it into a verification task, a yes/no question against concrete criteria. Checklists reduce hallucination by up to 15% on complex reasoning tasks.

The method

After finishing a feature, run separate review passes. Each one forces a different expert persona.

Pass 1: Edge cases & assumptions

Prompt: Edge Cases

Challenge the assumptions in this code:

  • What happens with empty collections, null values, zero values?
  • How are boundary conditions handled (max int, empty string, unicode)?
  • What timezone, locale, or encoding assumptions exist?
  • What implicit assumptions exist about input data?

This works because LLMs are trained on millions of bug fixes where exactly these assumptions caused crashes. The specific vocabulary acts as a retrieval key into that knowledge.

Pass 2: Concurrency & state

Prompt: Concurrency

Review this code for concurrency and state management issues:

  • Any unhandled race conditions?
  • Is shared mutable state protected?
  • What if operations are called out of expected order?
  • Are there any deadlock risks?

LLMs detect missing locks and unprotected shared memory well in high-level languages (Python, Go, TypeScript). They're weaker on hardware-level memory models, so don't rely on this for low-level C/Rust concurrency.

Pass 3: Failure & resilience

Prompt: Resilience

Review this code assuming the network is hostile:

  • What if network calls are slow, fail, or return unexpected data?
  • Is there retry logic with backoff?
  • What if a dependency is down?
  • What if a response shape doesn't match expectations?

This invokes the model's deep training on distributed systems post-mortems. It knows the fallacies of distributed computing, that developers assume networks are reliable and latency is zero, and searches your code for those exact assumptions.

Pass 4: Security & adversarial input

Prompt: Security

Act as a penetration tester reviewing this code:

  • What if the user sends malicious input?
  • Are there injection vectors (SQL, XSS, command injection)?
  • Could secrets leak through logs or error messages?
  • What undocumented behaviors might break in future versions?

Studies show 29–36% of AI-generated code contains at least one security weakness. This pass catches what you introduced and what your copilot introduced.

All four lenses on one snippet

Click each lens below to see what a single code snippet looks like through different review passes. Notice how each lens reveals entirely different bugs.

1async def process_order(request):
2 user = get_user(request.headers["auth_token"])
3 items = request.json["items"]
4 total = sum(item["price"] * item["qty"] for item in items)
5
6 order = Order(
7 user_id=user.id,
8 total=total,
9 created_at=datetime.now(),
10 items=items
11 )
12
13 db.orders.insert(order)
14
15 response = requests.post(PAYMENT_API, json={
16 "amount": total,
17 "currency": "USD",
18 "user": user.email
19 })
20
21 send_email(user.email, f"Order confirmed: ${total}")
22 return {"status": "success", "order_id": order.id}

Click a lens to reveal issues invisible to the other passes.

Why this works mechanistically

When you write "timezone assumptions" in a prompt, you're not just asking a question. You're reprogramming the model's forward pass. The attention mechanism reweights to prioritize tokens related to datetime libraries, offset calculations, and locale handling. Researchers call this Post-hoc Attention Steering; it improves accuracy by up to 22% on analysis tasks without changing the model's weights.

The model already has internal representations of bugs. It encodes incorrect code as anomalies in its latent space. But those anomalies don't always surface in the output, especially when the surrounding code looks correct and the model's agreeableness bias is active. Your checklist is the activation energy that pushes those anomalies above the output threshold.

Bug detected internallyActivation thresholdBug surfaces in outputbug

Why you can't do this yourself (as well)

You just spent hours building the feature. You're in "happy path" mode; your brain is wired to confirm it works, not to break it. This is called vigilance decrement: after sustained cognitive effort, your ability to spot errors drops sharply. Research shows 48.8% of programmer actions during review are influenced by confirmation bias.

50%60%70%80%90%100%0h1h2h3h4hHours into reviewAccuracycrossoverHumanLLM

The LLM doesn't get tired. It applies the same scrutiny to line 1,000 as to line 1. It doesn't assume "the user will use the UI correctly." It doesn't know it's your code, so it has no ego to protect.

This is the optimal split: you handle architecture and intent, the model handles combinatorial edge cases. Each of you covers the other's blind spot.

The workflow

1. Build the feature
2. Commit (or stage) your changes
3. Run 2-4 domain-specific review passes
4. Fix what the model finds
5. Run your static analyzers (ESLint, CodeQL, etc.)
6. Ship

Domain reviews sit between "I think I'm done" and "it's actually done." They're the structured skepticism that catches what your linter can't see and what your tired brain won't notice.

Don't ask the model to review your code. Tell it exactly how to break it.