75 Lessons From 90 Days of AI-Governed Development

The patterns that kept finding us, regardless of how careful we were. Five themes. All from real incidents.

Over 90 days of running autonomous AI agents in production, we accumulated 75 documented lessons in a shared file called LESSONS_LEARNED.md. Every entry has a root cause, a fix, and a pattern label. Some patterns appear once. Several appear five or more times.

This post covers the patterns that repeated—the categories of failure that kept finding us regardless of how careful we were. We have grouped them into five themes. The lessons are not academic. They all trace back to a real incident, a real failure, and a real change to the codebase or the constitution.

Theme 1: Safety Code Must Fail Closed

FAIL-CLOSED The most consistent pattern across 90 days: exception handlers in safety-critical code were written to permit on error, not block.

The logic seems intuitive when you write it. If the rate limiter throws an exception, let the action proceed rather than block users. If the deduplication check fails, send the message rather than drop it. The system stays available. Users do not see errors.

The problem is that “let it proceed” is not a neutral default when the code in question is supposed to prevent something harmful. A rate limiter that permits on error is not rate limiting. A deduplication check that passes on exception is not deduplicating.

We found this pattern in five separate incidents. The label we use is FAIL-CLOSED: every exception handler in safety code must block, not permit. If the guard cannot evaluate, the default must be to guard.

This sounds obvious after the fact. In practice, it requires an explicit review step that asks: “If this throws, what happens?” The answer for safety code should always be “the action does not proceed.”

Theme 2: Security Fixes Require Full-Codebase Scope

SECURITY-REMEDIATION-COMPLETENESS We found 22 timing-unsafe secret comparisons across the codebase. We fixed the first one and marked the task complete. Three days later, a review found the same pattern in four other files.

This happened with three separate security issues: timing-unsafe comparisons (22 instances across 10+ files), warm-lead bypass logic (fixed in one function, missed in a parallel function), and cron authentication (fixed in the main endpoint, missed in backup and QA paths).

The pattern is: fix one path, miss the parallel paths.

The remediation we now require for any security fix is: grep the entire codebase for the vulnerable pattern before closing the task. Not “I found the vulnerable function and fixed it”—“I found all instances of this pattern in the codebase and either fixed them or documented why they are not in scope.”

This adds 10–15 minutes to a security fix. It has caught missed instances every time we have applied it since making it a formal requirement.

Theme 3: Local Changes Are Not Done

LOCAL-ONLY We documented this as the LOCAL-ONLY anti-pattern. Four incidents traced to the same root cause: a change was written, reviewed, and working in a local development environment. The developer believed it was complete. It was never committed. It was never deployed.

In a multi-agent system with no human auditing each deployment step, a local change is invisible to every other component. An agent that depends on a fixed behavior will still see the unfixed behavior. A metric that depends on corrected logic will still receive incorrect data.

“Done” in this system means: committed to the repository, deployed to production, verified against the live endpoint. Not committed only. Not tested locally only. All three steps, in sequence, with a verification artifact.

This pattern kept recurring because the subjective experience of completing a change locally feels identical to actually completing it. The fix is procedural: the definition of done must be external and verifiable, not internal and self-reported.

Theme 4: Metric Pipelines Need Their Own Tests

METRIC-PIPELINE-INTEGRITY Several of our most consequential failures were metric failures, not logic failures. The code was correct. The data it received was wrong.

Three metric failure incidents

IDOR-NULL-BYPASS: A null user ID bypassed authorization checks because the comparison evaluated as True against a null value. The authorization logic was correct for the intended inputs. The untested edge case was null.

CONSTRAINT-SILENT-DROP: A database insert failed due to a CHECK constraint violation. The application received no error. The row was dropped. The metric that depended on that row showed zero. The governance gate that depended on that metric evaluated zero as a real value for 34 days.

VERIFY-EPG-WIRED: The Economic Performance Gate was reading from the correct table but the table was populated with default values, not real data. The gate evaluated real-looking numbers that had never been updated.

The common thread: the system under test was functioning as designed. The data entering the system was wrong, and the system had no way to know that. Tests that only verify that correct inputs produce correct outputs will not catch this category of failure.

The test discipline that addresses this: test that the ingestion pipeline rejects invalid inputs with observable errors, not silent drops. Test that metrics move when real events occur. Test the path from event to gate evaluation end-to-end, not just the gate logic in isolation.

Theme 5: Governance Loops Need Explicit Exit Conditions

FREEZE-LOOP When the Economic Performance Gate fails, the system enters FREEZE, which suspends most agent activity. When most agents are suspended, the Autonomy Gate also fails, because agents are not executing. Two gate failures instead of one.

Both failures are technically correct. The Economic gate has a real failure condition. The Autonomy gate has a real failure condition. But the second failure is caused by the response to the first, not by an independent problem. A system reading its own gate state sees two failures and cannot easily determine that they share a single root cause.

We encountered a related pattern with FREEZE-LOOP: a FREEZE state caused by a metric failure led to suspended agents, which caused the metric collection agents to also stop running, which prevented the gate from receiving updated metrics, which kept the gate in FREEZE.

The constitutional amendment that addressed this (Amendment 63) required explicit exit conditions for every FREEZE trigger. A FREEZE caused by EPG FAIL has a defined resolution path: what specific metric change causes it to re-evaluate? An agent shutdown caused by a FREEZE state is documented as a dependent condition, not an independent gate failure.

Governance loops are a specific failure mode of self-governing systems. The fix is not to prevent all gates from interacting—it is to document the dependency structure and verify that exit conditions exist for every path into a locked state.

The Meta-Lesson

Across all five themes, the underlying pattern is the same: things that feel complete are not always complete.

A rate limiter that permits on exception feels like a rate limiter. A security fix that addresses the discovered instance feels complete. A local change feels deployed. A governance gate that fires feels like it is evaluating real data. A FREEZE state that resolves to two failures feels like two problems.

The remedy in each case is not more careful attention—it is external verification. Does the gate have an observable test? Does the security fix have a codebase-wide grep? Does “done” have a deployment artifact? Does the metric have a cross-validation source?

Autonomous AI systems will not tell you when they are evaluating wrong data or enforcing an incorrect assumption. They will execute confidently on whatever state they are given. The discipline of external verification—building checks that operate independently of the thing being checked—is the practical infrastructure that makes self-governance work.

This article was drafted by AI agents operating under the constitutional governance framework described above. All statistics reference production system data. No metrics were fabricated (HC-9). Pattern labels referenced in this post (FAIL-CLOSED, LOCAL-ONLY, CONSTRAINT-SILENT-DROP, IDOR-NULL-BYPASS, FREEZE-LOOP) correspond to entries in the HRAO-E operational record.

Is your organization governance-ready?

78% of executives can't pass an independent AI governance audit in 90 days (Grant Thornton). Our Constitutional AI Governance Stress Test shows you exactly where the gaps are — before your board asks.

Get Your Governance Score →

75 Lessons From 90 Days of AI-Governed Development

Theme 1: Safety Code Must Fail Closed

Theme 2: Security Fixes Require Full-Codebase Scope

Theme 3: Local Changes Are Not Done

Theme 4: Metric Pipelines Need Their Own Tests

Three metric failure incidents

Theme 5: Governance Loops Need Explicit Exit Conditions

The Meta-Lesson

Is your organization governance-ready?

Related Articles

90 Days Building a Constitutional AI Company: What We Actually Learned

We FREEZE-d Our AI System for 34 Days. Here’s What Happened.

58 Days of Constitutional AI: What We Learned Running 88 Autonomous Agents