Series: What AI Governance Gets Wrong

  1. Part 1: The Scaling Problem
  2. Part 2: The Human Factor
  3. Part 3: The Constitutional Alternative
  4. Part 4: The OS for AI Agents
  5. Part 5: What 78 Days Taught Us About Evaluation (this article)

NIST Wants to Know How to Evaluate AI Agents

In January 2026, NIST's Center for AI Standards and Innovation published AI 800-2, a draft titled "Practices for Automated Benchmark Evaluations of Language Models." In February, the National Cybersecurity Center of Excellence published a concept paper on AI Agent Identity and Authorization. Both are open for public comment.

We submitted comments to both. Not because we have opinions about benchmarks — but because we've been running 56 autonomous AI agents in continuous production for 78 days, and the failures we've experienced reveal gaps that no benchmark would catch.

This post summarizes the four evaluation failures we reported to NIST, why they matter, and what they suggest about where AI evaluation needs to go.

Failure 1: The 324-Hour Silent Outage

On February 9, 2026, all 56 of our agents stopped executing. They remained dead for 324 hours — 13.5 days.

During that outage, every agent retained valid credentials. Every agent remained registered in our agent registry. Every health check passed. The system reported all agents as "alive" because their identities were intact and their configurations were valid.

The root cause was threefold: an invalid configuration key silently prevented cron job creation on our runtime platform, heavy endpoints exceeded worker timeouts killing execution processes, and the execution logger itself failed to initialize — so evidence of failure was silently dropped.

A benchmark run at any point during those 14 days would have reported the system as healthy. The agents would have answered questions correctly. They would have passed capability evaluations. They simply weren't doing anything in production.

A point-in-time benchmark cannot detect a system that is capable but not executing. Only continuous evaluation of execution records catches liveness failures.

This incident led us to codify Hard Constraint 12: no silent agent outage exceeding 24 hours. The enforcement is behavioral — agents must produce execution logs proving liveness, not just hold valid credentials. We wrote about this incident in detail in 58 Days of Constitutional AI.

What we told NIST

Identity standards should require proof of behavioral activity, not just credential validity. An agent that is authenticated and authorized but not executing is operationally indistinguishable from a dead agent.

Failure 2: Evaluation Infrastructure That Evaluates Nothing

Our system has six evaluation gates: Epistemic, Risk, Governance, Economic Performance, Autonomy Accountability, and Constitutional Growth. Each gate evaluates a specific dimension of system health and returns PASS, HOLD, or FAIL.

For months, all six gates existed in code. All tests passed. The system reported itself as healthy. But several gates were returning hardcoded passing values instead of querying real data. The gates appeared to function. They did not.

We call this the "define but never wire" anti-pattern. Evaluation infrastructure that evaluates nothing. We wrote about discovering it in The Hardest Part of AI Governance Is Honest Metrics — our system reported 90% agent activation when the real number was 2.6%.

This pattern is likely widespread. Any organization with evaluation dashboards should ask: are these metrics sourced from live data, or from defaults that happen to look correct?

What we told NIST

Add a practice requiring evaluation designers to verify that metrics are sourced from live system data, with explicit checks against hardcoded defaults or stub values. We fixed ours through a constitutional amendment (8.7) that mandated verification of metric data sources.

Failure 3: Evaluation That Fails Open

When a gate evaluation throws an exception — a database timeout, a missing metric, a division by zero — what should happen? Should the system assume everything is fine and continue operating? Or should it halt?

We learned this the hard way through five P0 incidents. In each case, a fail-open pattern in our evaluation code allowed unsafe behavior to continue unchecked. A rate limiter threw an exception; the action proceeded. A trust check crashed; the content was published. A deduplication guard failed; the same message was sent 30 times.

Our system now enters FREEZE state when evaluation infrastructure itself fails. FREEZE halts all discretionary spending and agent commitments. The system does not default to PASS. We detailed this pattern in Why AI Safety Code Must Fail Closed, Not Open.

NIST's 800-2 draft discusses common bugs in evaluations — degraded serving, tool calling errors, test item solvability, evaluation cheating. It does not address what happens when the evaluation infrastructure itself fails. This is a critical gap.

An evaluation that crashes silently and allows the system to proceed unevaluated is worse than no evaluation at all, because it creates false confidence.

What we told NIST

Automated evaluation systems must define explicit failure modes for the evaluation process itself, with a default-deny posture when evaluation cannot complete.

Failure 4: Single-Agent Evaluation Misses System-Level Behavior

Our system has 56 agents whose combined behavior produces outcomes no individual agent intended. An agent optimizing email engagement volume conflicted with an agent enforcing anti-spam constraints, producing oscillating behavior that neither agent's individual evaluation detected. Gate state transitions triggered by one agent's failure cascaded through all agents, creating dynamics invisible to per-agent benchmarks.

We use a tiered state model — COMPOUND, RUN, THROTTLE, FREEZE, STOP — where system-level evaluation results override individual agent assessments. A single gate failure anywhere constrains all agents. As we described in 91% Deploy, 10% Govern, this is the scaling problem: you cannot govern a multi-agent system by evaluating agents individually.

Three days before submitting our comments, we ratified Amendment 59 — changing our economic gate from revenue-based metrics to value-creation metrics. This dynamically altered evaluation criteria for all 56 agents mid-operation, without changing any credentials or permissions. Static evaluation frameworks that cannot adapt their own criteria become misaligned with the systems they govern.

What we told NIST

Evaluations of multi-agent systems should include system-level metrics that capture emergent behavior, not only per-model performance. Evaluation criteria must be able to evolve as the system's operating context changes.

The Broader Pattern

All four failures share a common structure: evaluation infrastructure that looks correct but isn't actually working.

Failure What Looked Right What Was Actually Wrong
Silent outage Health checks passed, credentials valid No agent was executing for 14 days
Define but never wire All gates returned PASS, all tests green Gates were returning hardcoded values
Fail open Safety code existed in the codebase Exceptions bypassed every safety check
Per-agent blindness Each agent passed its individual evaluation Agent interactions produced harmful emergent behavior

The NIST 800-2 draft is a solid document for what it covers: structured, verifiable, point-in-time benchmarks for language model capabilities. But the document itself acknowledges its scope is "automated benchmark evaluations" — discrete test events. Our experience suggests the field needs complementary practices for continuous evaluation, evaluation self-integrity, and system-level assessment.

Open-Source Testing Framework

We've also published an open-source Red Team / Blue Team Test Specification for agentic AI systems. It provides 30 security test scenarios mapped to STRIDE, NIST AI RMF, OWASP Agentic Top 10, and ISA/IEC 62443 — covering five deployment phases from lab testing through full production.

The framework operationalizes many of the principles discussed here: testing evaluation infrastructure failure modes, validating agent identity and authorization posture, and detecting emergent multi-agent behavior before production deployment.

What This Means for the Industry

NIST's public comment process is open to anyone. You don't need to be a large company or a research lab. The comments that matter most come from practitioners who have operated the systems being standardized.

If you're running AI agents in production, your operational failures are exactly the data that standards bodies need. The gap between "what benchmarks measure" and "what breaks in production" is where the next generation of AI evaluation standards will be written.

Our four submissions to NIST — a CAISI RFI response, a behavioral authorization concept paper, a comment on AI 800-2, and a follow-up on agent identity — are all grounded in the same 78 days of operational evidence from the same 56 agents. The incidents are real. The amendments are real. The constitutional framework that governs them is the same one described throughout this series.

Related Reading

  1. 58 Days of Constitutional AI — the full 324-hour outage story
  2. The Hardest Part of AI Governance Is Honest Metrics — the "define but never wire" discovery
  3. Why AI Safety Code Must Fail Closed — five P0 incidents from fail-open patterns
  4. What If Your AI Systems Governed Themselves? — the constitutional governance framework
  5. The OS for AI Agents — why governance is the missing layer

Measuring Your Own Decision Load

The same measurement-first philosophy that drives our AI governance also powers our Decision Load Index — a research-backed assessment of cognitive burden in knowledge work.

Take the 5-Minute Assessment

Frequently Asked Questions

What is NIST AI 800-2?

NIST AI 800-2 is a draft document from NIST's Center for AI Standards and Innovation titled "Practices for Automated Benchmark Evaluations of Language Models." It provides voluntary best practices for evaluating AI systems through automated benchmarks, organized in three stages: defining objectives, implementing evaluations, and analyzing results. The public comment period ends March 31, 2026.

What is the NCCoE Agent Identity concept paper?

The National Cybersecurity Center of Excellence published a concept paper exploring how identity standards (OAuth 2.0, OIDC, SPIFFE/SPIRE) can be applied to AI agents. It asks six categories of questions about identification, authentication, authorization, auditing, and prompt injection for agentic systems. The comment period ends April 2, 2026.

Can small companies submit comments to NIST?

Yes. NIST's public comment process is open to all individuals and organizations. Comments can be submitted electronically to the email addresses specified in each document. Practitioner perspectives grounded in operational evidence are particularly valuable because they reveal gaps that academic analysis alone may not identify.