AI Coding Assistants: A Causal Framework for Productivity and Risk
Hook¶
“Which AI coding assistant is best?” is the wrong question.
Rankings are unstable because they compress a complex system into a single score—and that score depends as much on harness design and team workflow as it does on the underlying model.
The causal question this post answers is: under what conditions do coding assistants increase team throughput—and when do they quietly increase defects and operational risk?
Executive Summary¶
- The model matters, but the harness often determines whether a model can follow long-running specs.
- Productivity improvements are mediated by context quality, test scaffolding, and error correction loops.
- If measurement is weak, teams “feel faster” while defect escape increases.
- The highest ROI interventions are governance primitives: unit tests, traceable diffs, and review gates.
- A causal view replaces “tool choice” with “system design”: incentives, prompts, context, and rollback.
- The correct unit of evaluation is not a single task—it’s a workflow over time.
The Causal Model¶
Outcome (Y)¶
Y: Sustainable engineering productivity (lead time, reliability, maintainability, defect escape rate, and developer experience).
Key causes (X)¶
- X1: Model capability (reasoning + code generation ceiling)
- X2: Harness design (context injection, file ops, planning, tool use)
- X3: Prompt discipline (spec clarity, constraints, acceptance criteria)
- X4: Verification loop strength (tests, linting, CI, code review)
Mediators (M)¶
- M1: Context fidelity (does the assistant “see” the right code + constraints?)
- M2: Iteration speed (time-to-first-diff, time-to-green-tests)
- M3: Error correction effectiveness (how fast wrong code is detected and corrected)
Moderators (Z)¶
- Z1: System complexity (monorepo vs small service)
- Z2: Domain risk (payments vs internal tooling)
- Z3: Task type (greenfield scaffolding vs surgical patching)
Confounders (C)¶
- C1: Engineer skill and baseline speed
- C2: Codebase health (existing tests, architecture, clarity)
- C3: Incentives (shipping pressure, quality culture)
Measurement / proxy risks¶
- “Lines of code” and “PR count” are poor productivity proxies.
- “LLM-as-judge” can correlate with style, not correctness.
- “Unit tests passed” can be gamed if tests are incomplete.
Counterfactual statements¶
- If we keep the same model (X1 constant) but improve harness context injection (X2↑), we should see faster convergence to passing tests and fewer spec deviations.
- If we keep tool and model constant but strengthen verification (X4↑), we should observe reduced defect escape even if perceived speed decreases slightly.
Causal Diagrams (Mermaid)¶
A) Primary DAG¶
flowchart TB
%% Inputs
X1["X1: Model capability"]:::i
X2["X2: Harness design"]:::i
X3["X3: Prompt discipline"]:::i
X4["X4: Verification loop strength"]:::i
C1["C1: Engineer baseline skill"]:::r
C2["C2: Codebase health"]:::r
C3["C3: Incentives / culture"]:::r
Z1["Z1: System complexity"]:::r
Z2["Z2: Domain risk"]:::r
Z3["Z3: Task type"]:::r
%% Mediators / mechanisms
M1["M1: Context fidelity"]:::p
M2["M2: Iteration speed"]:::p
M3["M3: Error correction"]:::p
%% Outcome
Y["Y: Sustainable engineering productivity"]:::o
%% Structural links
X2 --> M1
X3 --> M1
M1 --> M2
X4 --> M3
M2 --> M3
X1 --> Y
M2 --> Y
M3 --> Y
C1 --> X3
C1 --> Y
C2 --> X4
C2 --> Y
C3 --> X4
C3 --> Y
Z1 -. moderates .-> X2
Z2 -. moderates .-> X4
Z3 -. moderates .-> X3
%% brModel styles
classDef i fill:#eef6ff,stroke:#2563eb,stroke-width:1px,color:#0f172a;
classDef p fill:#ecfdf5,stroke:#16a34a,stroke-width:1px,color:#052e16;
classDef r fill:#fff7ed,stroke:#f97316,stroke-width:1px,color:#431407;
classDef o fill:#fdf2f8,stroke:#db2777,stroke-width:1px,color:#500724;
B) Feedback loop (Goodhart risk)¶
flowchart TB
%% Inputs
A["Pressure to ship"]:::i --> B["Use assistant for speed"]:::p
B --> C["More code changes"]:::p
%% Decision gates
G1{"CI passes<br>(tests + lint + build)?"}:::p
G2{"Review approves<br>(risk-aware)?"}:::p
%% Records / artifacts
R1["Patch set (diff) + provenance"]:::r
R2["CI report (tests/logs)"]:::r
R3["Change record + ownership"]:::r
R4["Incident record + postmortem"]:::r
%% Outputs
O1["Defects escape"]:::o
O2["Incidents + rework"]:::o
C --> R1 --> G1
G1 -- pass --> R2 --> G2
G1 -- fail --> O2
G2 -- approve --> O1
G2 -- reject --> O2
O1 --> O2 --> A
O2 --> R4
%% Levers that strengthen the gates
L1["Strong tests + CI gates"]:::i --> G1
L2["Traceable diffs + ownership"]:::i --> G2
L3["Risk-based scope limits"]:::i --> G2
L2 --> R3
%% brModel styles
classDef i fill:#eef6ff,stroke:#2563eb,stroke-width:1px,color:#0f172a;
classDef p fill:#ecfdf5,stroke:#16a34a,stroke-width:1px,color:#052e16;
classDef r fill:#fff7ed,stroke:#f97316,stroke-width:1px,color:#431407;
classDef o fill:#fdf2f8,stroke:#db2777,stroke-width:1px,color:#500724;
C) Intervention map¶
flowchart TB
%% Inputs / levers
L1["Write acceptance tests first"]:::i
L2["Constrain prompts with invariants"]:::i
L3["Use a harness that shows diffs"]:::i
L4["Enforce CI + code review"]:::i
L5["Measure defect escape + rework"]:::i
L6["Limit scope in high-risk areas"]:::i
%% Process targets
P1["Raise spec clarity"]:::p
P2["Increase context fidelity"]:::p
P3["Increase verification strength"]:::p
P4["Tighten governance loop"]:::p
%% Records / products
R1["Test suite + failure signals"]:::r
R2["Diff trace + ownership"]:::r
R3["CI evidence bundle"]:::r
R4["Quality dashboard (escape + rework)"]:::r
%% Outcome
Y["Sustainable engineering productivity"]:::o
L1 --> P1 --> R1 --> P3
L2 --> P1
L3 --> P2 --> R2 --> P4
L4 --> P3 --> R3 --> P4
L5 --> R4 --> P4
L6 --> P4
P2 --> Y
P3 --> Y
P4 --> Y
%% brModel styles
classDef i fill:#eef6ff,stroke:#2563eb,stroke-width:1px,color:#0f172a;
classDef p fill:#ecfdf5,stroke:#16a34a,stroke-width:1px,color:#052e16;
classDef r fill:#fff7ed,stroke:#f97316,stroke-width:1px,color:#431407;
classDef o fill:#fdf2f8,stroke:#db2777,stroke-width:1px,color:#500724;
Mechanism Walkthrough¶
Mechanism 1: The harness shapes what the model can do¶
A coding assistant is not just a model. It is a control system.
A strong harness:
- loads the right files and only the right files,
- plans work before editing,
- applies changes as diffs,
- runs tests and iterates.
A weak harness:
- leaks irrelevant context,
- loses long-running constraints,
- encourages the model to “fill in” missing details.
This is why a model that looks “great” in a demo can fail in production workflows.
Mechanism 2: Verification loops turn generation into engineering¶
Without tests, a coding assistant optimizes for plausibility.
With tests, it is forced to optimize for behavior.
The productivity gain comes from converting ambiguous human intent into executable checks, then letting the model propose diffs that satisfy those checks.
Mechanism 3: Measurement determines what teams optimize¶
If you reward speed, the system will produce speed.
If you reward reliability, the system will produce reliability.
Coding assistants amplify whatever your organization already optimizes.
Alternative mechanisms (weaker)¶
- “Just choose the top-ranked tool.” Rankings are non-causal summaries; they do not transfer across codebases and workflows.
- “The best model solves everything.” Model improvements are real but still bounded by context fidelity and verification.
Evidence & Uncertainty¶
What we know¶
- Benchmark results vary dramatically by model and by harness.
- Instruction-following is harder than producing “good-looking” code.
- Unit tests provide the most reliable ground truth signal among common evaluation methods.
What we strongly suspect¶
- Many teams overestimate productivity because they measure outputs (PRs) rather than outcomes (defect escape, rework).
- The largest long-run gains come from enforcing constraints and verification, not from changing which model is used.
What we don’t know yet¶
- Which harness patterns generalize best across languages and monorepo structures.
- How to build evaluation that is robust to “teaching to the test.”
Falsification ideas¶
- Swap harnesses while keeping the same model; measure time-to-green-tests and spec deviations.
- Increase verification gates for assistant-generated diffs; measure changes in defect escape rate.
Interventions & Leverage Points¶
1) Define acceptance criteria as tests - Expected effect: tighter feedback loop, fewer speculative diffs. - Risks: upfront time. - Prerequisites: test framework discipline. - Measurement: time-to-green-tests; coverage of edge cases.
2) Adopt diff-first workflows - Expected effect: safer review, easier rollback. - Risks: requires tooling support. - Prerequisites: harness that produces clean diffs. - Measurement: review time; revert rate.
3) Constrain prompts with invariants - Expected effect: reduces scope creep and spec drift. - Risks: false rigidity. - Prerequisites: written invariants per module. - Measurement: “violation rate” of constraints.
4) Strengthen CI gates and ownership - Expected effect: reduces defect escape. - Risks: perceived slowdown. - Prerequisites: stable CI. - Measurement: post-merge incident rate; rollbacks.
5) Measure rework, not vibes - Expected effect: corrects Goodhart loops. - Risks: metric fatigue. - Prerequisites: lightweight tagging of “AI-assisted changes.” - Measurement: rework hours per feature; production incidents.
6) Scope limits in high-risk domains - Expected effect: prevents catastrophic failures. - Risks: lost opportunity. - Prerequisites: risk classification. - Measurement: incident severity distribution.
Practical Takeaways¶
- Evaluate assistants as systems: model + harness + workflow.
- Treat tests as the interface between intent and code.
- Prefer tools that show diffs and support rollback.
- Constrain prompts with explicit invariants and acceptance criteria.
- Measure defect escape and rework to avoid self-deception.
- Keep high-risk modules behind stricter gates.
- Expect rankings to drift; design for robustness instead.
Glossary¶
- Harness: the orchestration layer that manages context, file edits, tool execution, and iteration.
- Defect escape: failures that reach production.
- Goodhart effect: when a measure becomes a target, it stops being a good measure.