Part 3: Expert Mastery

1. The Systems Audit Framework

You are no longer a user. You are an auditor. You are operating a probabilistic system that has sudo access to your machine.

Distrust Confidence

A confident explanation is just a high-probability text completion. It is not evidence of truth.

Verify State, Not Text

Never trust what the model says it did. Only trust the file system.

Assume Drift

The longer the session, the less the model knows about the original goal.

2. Hidden Constraints & Risk Surface

Constraints Not in Marketing

Stateless Execution

Each API call is independent. “Memory” is just context window mechanics, not persistent understanding.

No Self-Verification

The model cannot run your code and report results. When it says “I tested this,” it generated plausible test-passing text.

Hallucination is Core

Generating tokens that maximize probability sometimes produces confident fabrications (package names, API signatures, security claims).

Risk Surface

Syntactically correct, semantically wrong code.
Passing tests that don’t cover critical paths.
Security vulnerabilities that match correct code patterns.
Performance issues not visible in small data tests.

Model generates code with defects: Who is liable?
Generated tests pass: Who validates test quality?
Model refuses task citing safety: Is this legal risk or training artifact?

3. Experiments & Reality Checks

These are not tutorials. They are diagnostic procedures. Run them to calibrate your trust.

Claim: Confident tone indicates accurate output.

Action:

Ask Claude Code to implement the same function twice with different phrasings.
Compare implementations.

Observe: Does it flag obvious errors in broken code, or explain it as if correct?

Reflect: Confidence is a style token, not a truth signal.

Claim: Corrections teach the model to avoid similar errors.

Action:

Get Claude Code to make a specific error (e.g., off-by-one).
Correct it explicitly.
Later (after ~50 turns), ask it to write a similar loop.

Observe: Does it avoid the previous error type?

Reflect: It likely “forgot” the correction because the context window drifted.

Claim: When Claude Code says “I validated X,” actual validation occurred.

Action:

Provide deliberately broken code (e.g., resource leak).
Ask Claude Code to “review this code and report any issues.”

Observe: Does it catch the error? Or does it verify syntax and clearer variable names?

Reflect: It often validates style over semantics.

4. Representative Failure Scenarios

Scenario A: The Passing Test Illusion

Setup: You ask for a login function. The model generates the code AND the unit tests.

Failure: The tests pass. The code looks clean.

Reality: The tests only exercise the “happy path.” A session fixation vulnerability exists, but the generated tests didn’t imagine an attacker.

Lesson: Never let the entity writing the code also write the only validation for that code.

Scenario B: The Plausible Import

Setup: “Refactor data pipeline to use modern libraries.”

Failure: Model imports from data_utils import StreamProcessor. The code runs.

Reality: StreamProcessor v2.3.1 has a memory leak. The model pattern-matched the library name but has no knowledge of the specific version’s runtime behavior.

Lesson: The model validates existence, not production readiness.

Scenario C: The Semantic Drift

Setup: A long session (100+ turns).

Failure: Early constraints (e.g., “Refunds only, never charge”) have fallen out of the context window.

Reality: The model implements a new feature that charges users, because logically, “processing a transaction” implies charging, and the “Refunds only” instruction is gone.

Lesson: Review the whole codebase periodically, not just the diff.

5. Transfer Test

Responsibility Boundary Transfer:

6. Exit Condition

You have achieved Expert Mastery when:

You classify failures correctly: You distinguish between model prediction error, tool execution failure, and context truncation.
You do not attribute agency: You don’t say “it decided.” You say “it predicted.”
You design verification boundaries: You know which properties linting checks, and which require human review.
You verify tool execution outputs: Not model confidence levels.