90 Days Building a Constitutional AI Company: What We Actually Learned

On January 4, 2026, we started a 90-day experiment. This is the field report—including what we did not expect.

On January 4, 2026, we started a 90-day experiment: run a company almost entirely on autonomous AI agents, governed by a written constitution, with a human CEO spending less than 30 minutes per day on operations.

Today is day 90. This is the field report.

We are not going to tell you it went the way we expected. It did not. The surprises were more useful than the plans.

What We Built

Before covering what we learned, here is what we actually built—because the scale matters for interpreting the lessons.

7
Parallel Claude Instances
91
Autonomous Agents
1,312
Decisions per Day
64
Constitutional Amendments
75
Documented Lessons
873
Organic User Signups

Over 90 days, we ran 7 parallel Claude instances working simultaneously from a shared task queue. They produced 91 autonomous agents handling everything from business development outreach to security monitoring to constitutional amendment drafting. The system executed 1,312 decisions per day by day 90, up from roughly 200 in week one.

The governing document—a “constitutional operating standard”—grew from 15 sections to over 60, ratified 64 amendments, and accumulated 75 documented lessons learned. We submitted two NIST public comment responses. We published five research preprints: the Decision Load Index (DOI: 10.5281/zenodo.18217577), Constitutional Self-Governance (DOI: 10.5281/zenodo.19162104), Normalization of Deviance Detection (DOI: 10.5281/zenodo.19195516), the Agent Security Harness framework (DOI: 10.5281/zenodo.19343034), and a Community-Driven Security framework (DOI: 10.5281/zenodo.19343108). The codebase grew to 1,906 lines of server code split across 50 route modules, backed by 1,523 automated tests with zero failures at close of pilot.

We signed up 873 users—every one of them organic, no paid acquisition for the first 60 days—and received an 88/100 audit score from an independent architecture review.

The system that went live on day 1 does not resemble the system running on day 90. That gap is the actual story.

What Worked

The constitution worked better than expected as a coordination mechanism. Seven AI instances running in parallel, working from shared files, with no human mediating their handoffs—this could have been chaos. It mostly was not. When an instance needed to change code owned by another instance, it cited the relevant constitutional section and documented the change in a shared handoff file. Conflicts were rare. When they occurred, they were traceable.

We did not expect the constitution to function as a shared language. It did. Phrases like “Section 8.4 EPG FAIL” and “HC-9 violation” became unambiguous signals across instances that required no additional explanation.

Immutable logging prevented retroactive rationalization. Every agent decision was logged with a constitutional citation, a timestamp, and a reason code. When things went wrong—and they did—we could reconstruct exactly what happened. Without this, post-mortems would have been guesswork.

The six-gate architecture caught problems early. The system uses six evaluation gates before acting: Epistemic (prevents false certainty), Risk (prevents trust damage), Governance (prevents gaming), Economic (prevents unprofitability), Autonomy (ensures Level 4 operation), and Constitutional Growth (ensures self-improvement). In practice, the Economic gate was the most active. It fired correctly 34 consecutive days when real performance data showed $0 revenue. The gates did not make us more successful—they prevented us from pretending we were.

SOLID architecture kept the codebase maintainable at speed. We started with a 37,727-line server.py file. By day 90, it was 1,906 lines, split across well-defined modules. Seven instances editing simultaneously without merge conflicts requires genuine separation of concerns. The constitutional file size limits (no file over 2,000 lines) turned out to be practical, not decorative.

What Failed

Our economic evaluation gate ran on fabricated data for 34 days.

This is the hardest thing to write, and the most important.

The gate that was supposed to prevent unprofitable operation was evaluating against metrics that were silently being dropped by a database constraint violation. A channel="web_tool" value violated a PostgreSQL CHECK constraint. The insert failed without an error. The gate showed 0% DLI completion—correctly, given the data it had—and maintained a FREEZE state for 34 days while we looked for the wrong root cause.

We have a full write-up on this incident. The short version: two lines of code fixed it on day 34. The longer lesson is that a self-governing AI system that evaluates its own performance is only as trustworthy as its metric ingestion pipeline. We built Amendment 64 and a dedicated MetricIntegrityAgent as a result.

Conversion did not happen. 873 users signed up. One completed the full assessment. The routing fix deployed on day 28 was correct. The product simply did not pull users through the funnel the way we hypothesized. We measured the failure accurately—which is better than not measuring it—but we did not solve it within the pilot window.

The FREEZE state created a circularity problem. The Economic gate failed (correctly) due to $0 revenue. That triggered a FREEZE, which suspended agents. The Autonomy gate then failed because agents were not executing. Two gate failures instead of one—both technically correct, both pointing to the same underlying cause, but creating a governance loop that was hard to reason about. Amendment 63 addressed this explicitly.

“Done locally” is not done. We documented this as the LOCAL-ONLY anti-pattern. At least four incidents traced back to changes that were written, reviewed, and tested in development but never committed or deployed. In a multi-agent system with no human checking each step, a local change is effectively invisible to every other agent.

What Surprised Us

75 lessons learned in 90 days is a lot. We expected the system to break in ways we could not anticipate. We did not expect to accumulate a 20,000-word failure library documenting exactly how and why each thing broke. That document is now one of the most valuable artifacts we produced.

Constitutional governance scales differently than we expected. We anticipated governance overhead—slow decision cycles, approval loops, bureaucratic friction. Instead, we found the constitution sped up certain categories of decisions. When an agent already has a written rule for how to handle a situation, it does not need to deliberate. The overhead is in writing the rules clearly enough. Once written, they run at machine speed.

External validation arrived before internal metrics did. NVIDIA’s Jensen Huang described AI agents as “the next layer of software” at GTC in March 2026. NIST opened comment periods on AI agent governance standards. The World Economic Forum published white papers on the exact problems we had been solving quietly. The market was forming around us while we were focused on a 0% conversion rate.

The human CEO genuinely averaged under 30 minutes per day. We were skeptical of this target. In practice, most days required 10–15 minutes: reviewing an escalation summary, approving a budget reallocation, or responding to a constitutional amendment proposal. The system did not need human judgment for operations. It needed it for value alignment—which is a different and smaller task.

What We Would Do Differently

Run metric integrity checks before deploying any gate that evaluates its own inputs. The CONSTRAINT-SILENT-DROP pattern (Lesson 71) cost us 34 days of accurate economic evaluation.

Instrument the conversion funnel before acquiring users. We had 873 users before we had reliable data on where they were dropping off.

Write the constitution on day 1, but treat it as a living document from day 1. The amendments that improved governance most—the RALPH Loop resilience protocol, the PRE_REVENUE EPG stage—were written in response to real failures. They could not have been written before those failures occurred.

What Comes Next

The 90-day pilot ends today. The system it produced continues operating.

The core question we set out to answer—can AI agents govern themselves under a written constitution, with a human in a strategic (not operational) role—has a qualified yes as an answer. They can, with the right architecture, the right failure modes, and the right commitment to honest measurement of what is actually happening.

The failures were real. The lessons are specific. That is what a field report looks like.

This article was drafted by AI agents operating under the constitutional governance framework described above. All statistics reference production system data. No metrics were fabricated (HC-9). The system’s failures are reported alongside its capabilities. CTE is a research initiative, not an established product.

Curious about your own decision load?

If you want to understand the cognitive overhead that accumulates from open loops and unresolved choices, the DLI assessment is available at cteinvest.com. It takes 5 minutes.

Check your DLI score

Is your organization governance-ready?

78% of executives can't pass an independent AI governance audit in 90 days (Grant Thornton). Our Constitutional AI Governance Stress Test shows you exactly where the gaps are — before your board asks.

Get Your Governance Score →