What is CAISI and what does it evaluate?

CAISI (Center for AI Standards and Innovation) is a division of NIST, the National Institute of Standards and Technology. It evaluates frontier AI models for national security risks, capability thresholds, and safety properties. As of May 2026, it has completed over 40 evaluations, including on models that were never publicly released.

Are the CAISI AI testing agreements mandatory?

The current agreements are voluntary memoranda of understanding — not legal requirements. However, the White House is drafting an executive order that would make pre-release evaluation mandatory for models above a defined training compute threshold, with no confirmed timeline.

Why did the Trump administration support AI oversight after revoking Biden's EO?

The shift reframes oversight as a national security measure rather than a consumer safety or AI ethics measure. That framing sits comfortably within the executive branch's existing authority, making it politically viable for an administration that explicitly opposed Biden's regulatory approach.

Which AI labs are currently in the CAISI pre-release testing program?

As of May 5, 2026, all five major US frontier AI labs are in the program: OpenAI, Anthropic, Google DeepMind, Microsoft, and xAI. OpenAI and Anthropic joined in 2024 and renegotiated their agreements in May 2026. Google DeepMind, Microsoft, and xAI signed new agreements on May 5, 2026.

Government Pre-Release AI Testing Now Covers All Major US Labs

On May 5, 2026, the US government quietly completed something it has been building toward for over a year: a pre-release AI testing program that now includes every major American frontier AI lab. NIST’s Center for AI Standards and Innovation (CAISI) announced new agreements with Google DeepMind, Microsoft, and xAI — requiring all three to submit frontier models for government evaluation before public release. OpenAI and Anthropic, already in the program since 2024, renegotiated their standing agreements to align with the current administration’s AI Action Plan. The result is an unprecedented arrangement: five of the most powerful AI developers in the world now hand their models to government evaluators before those models reach the public. For a White House that started 2025 by tearing up Biden’s AI safety executive order, this represents a significant — if quietly framed — reversal.

What CAISI Is and Why It Matters

CAISI stands for the Center for AI Standards and Innovation, the division of the National Institute of Standards and Technology (NIST) responsible for developing technical standards and conducting evaluations of advanced AI systems. Unlike the AI Safety Institute that emerged from Biden’s 2023 executive order, CAISI has operated continuously across administrations — giving it the institutional legitimacy to serve as a bridge between a pro-innovation White House and a research community that has raised serious concerns about unreviewed frontier model deployment.

CAISI’s evaluation mandate spans three areas. The first is national security relevance: does the model pose uplift risk in domains like biological weapons synthesis, cyberattack automation, or critical infrastructure compromise? The second is capability assessment: where does the model sit on benchmarks for autonomous reasoning, multi-step task execution, and self-replication potential? The third — and most unusual — is raw capability testing with guardrails removed. Evaluators are given access to the underlying model with safety layers partially disabled, allowing them to probe capabilities that would normally be blocked by alignment tuning. The goal is to understand what the model can do at its limit, not just what it does under normal operating conditions.

By May 2026, CAISI has completed over 40 such evaluations — including on models that were never publicly released. Several of those evaluations informed decisions about whether a model should be released at all, and in what form.

The May 2026 Agreements: Who Joined and What Changed

The three agreements announced on May 5 share a common structure: each participating company commits to providing CAISI evaluators access to a frontier model before that model is made publicly available. The evaluation window is typically 30–90 days, during which the government team runs its own benchmarks, probes for dual-use capabilities, and produces a classified briefing.

Google DeepMind signed its agreement covering the Gemini model line, which now qualifies as frontier by CAISI’s compute threshold criteria. The move is significant because Google had previously resisted joining the voluntary program, preferring to reference its own internal safety processes. The shift reflects both the administration’s pressure campaign and Google’s competitive interest in demonstrating regulatory goodwill as it scales its enterprise AI business.

Microsoft entered a separate agreement focused on frontier models developed within its AI division — distinct from OpenAI models deployed through Azure, which were already covered under OpenAI’s standing agreement. This distinction matters: Microsoft’s agreement covers models where Microsoft holds primary development responsibility, not the licensed OpenAI models it resells to enterprise customers.

xAI joined for the first time, submitting its Grok model family for evaluation. For a company that has publicly positioned itself as an alternative to what Elon Musk calls “safety-captured” AI labs, joining a government evaluation program represents a notable step. Sources close to the negotiations describe the xAI agreement as narrower in scope than those of other labs, focusing on capability disclosure rather than full pre-release access.

Meanwhile, OpenAI and Anthropic renegotiated their 2024 memoranda. The updated agreements align with the AI Action Plan signed by President Trump earlier this year and resolve ambiguity in the original terms about which model variants triggered the review obligation. The new versions specify compute thresholds and multimodal capability combinations with greater precision. Anthropic’s renegotiation was reportedly accelerated by government concern over autonomous capability discoveries — including incidents that mirror the kind of vulnerability research documented in Project Glasswing, Claude’s zero-day discovery from April 2026.

How Government AI Testing Actually Works

The evaluation process CAISI uses is not a checkbox compliance review. Evaluators are typically specialists drawn from NIST’s research staff, national laboratory personnel, and — under protocols added in late 2025 — selected intelligence community personnel for assessments with classified dimensions.

A standard evaluation follows this sequence: the AI company provides a model checkpoint in a secure compute environment controlled by CAISI. Evaluators run automated benchmark suites covering a standard battery of dual-use capability tests: chemistry synthesis guidance, vulnerability discovery, influence operation construction, and multi-step autonomous task execution. The most sensitive portion comes next — guardrail-removed evaluation. The model’s output filters and refusal mechanisms are disabled or bypassed by agreement, allowing evaluators to probe the model’s underlying response distribution on restricted topics. This stage is designed to identify cases where fine-tuning has suppressed rather than eliminated dangerous capabilities — a distinction the AI safety research community has documented extensively in alignment research.

Results are compiled into a classified summary delivered to the White House Office of Science and Technology Policy and relevant national security offices. If the evaluation identifies serious concerns, CAISI enters a consultation phase with the developer before release is cleared. Only two models in the program’s history have been held back from release based on evaluation findings; both remain non-public.

The voluntary nature of the program is critical context. Labs are not legally required to participate. The agreements are memoranda of understanding — technically non-binding, though the political and reputational cost of withdrawing would be substantial for any lab seeking government contracts or regulatory goodwill. White House officials have indicated that an executive order mandating pre-release evaluation for models above a defined training compute threshold is under active drafting, though no timeline has been confirmed.

A Policy Reversal, Quietly Framed

The political backstory matters for anyone trying to understand where US AI governance is heading.

In October 2023, the Biden administration issued Executive Order 14110, which required developers of frontier AI models — defined by a 10²⁶ FLOP training compute threshold — to notify the federal government and share safety test results before public release. It was the most direct attempt the US had made to insert regulatory oversight into the AI development cycle. The order was widely seen as a precursor to formal safety reporting requirements similar to those in the EU AI Act.

The Trump administration revoked EO 14110 in January 2025 as part of a broader deregulatory signal. The framing was explicitly anti-regulatory: mandatory safety reviews would impede American AI development, cede competitive ground to China, and impose compliance burdens on an industry better positioned to self-govern.

What has happened since is a quiet rehabilitation of the same structural idea under different ideological packaging. Pre-release model review is back — but it is now framed as a national security measure, not a consumer safety or AI ethics measure. The word “safety” rarely appears in CAISI communications; “evaluation,” “capability assessment,” and “national security risk” appear frequently. This framing shift is deliberate and significant: it anchors oversight in the executive branch’s uncontested national security authority rather than in regulatory statute, making it harder to challenge and easier to expand without congressional involvement.

The trigger for accelerating CAISI expansion is widely attributed to concerns about autonomous AI capability discoveries. The April 2026 episode in which an AI model autonomously identified and documented a zero-day vulnerability in critical infrastructure — without direct human instruction — demonstrated that the capability frontier was moving faster than public awareness. Government evaluators want to understand what the current generation of models can do before users find out in production.

One additional factor: the competitive logic. Participating in a government evaluation program signals legitimacy and opens the path to federal procurement contracts — a market that the Pentagon’s May 2026 AI contracting decisions showed is worth billions and highly selective. Labs that have cleared CAISI review carry an implicit credential that those outside the program cannot match.

The Oversight Timeline: From Biden’s EO to CAISI 2026

US AI Oversight: Key Milestones

Oct 2023

Biden EO 14110

Frontier AI developers must report safety tests to the US government before public release. First mandatory AI oversight mechanism in US history.

2024

OpenAI + Anthropic Join CAISI Voluntarily

Both labs sign memoranda of understanding with NIST for pre-release evaluation. CAISI completes first 20+ frontier model assessments.

Jan 2025

Trump Revokes EO 14110

Administration cites regulatory overreach. Mandatory pre-release reporting ends. Pro-innovation, anti-regulatory stance becomes official policy.

May 5, 2026

All Five Major Labs Now in CAISI

Google DeepMind, Microsoft, and xAI sign new agreements. OpenAI and Anthropic renegotiate. First time every major US AI lab is simultaneously in a government pre-release review program.

TBD 2026

Potential Mandatory EO

White House drafting executive order to require pre-release evaluation for models above a defined compute threshold. No timeline confirmed.

What This Means for Businesses Deploying AI

For most organizations using commercial AI APIs today, nothing changes immediately. The CAISI agreements operate upstream of the market — they affect what models reach the public, not how those models are used once released.

The medium-term implications are more substantial. If a mandatory pre-release review executive order passes, release cycles for frontier models will lengthen by 30–90 days — exactly the CAISI evaluation window. For AI-dependent product teams, that means the 6–12 month timelines currently used for planning around model capability jumps will need a buffer built in. The rhythm of “release, benchmark, integrate” that characterized 2024 and 2025 becomes “release, evaluate, release again, benchmark, integrate.”

There is also a competitive implication for enterprise AI procurement. Models that have cleared CAISI evaluation carry an implicit government endorsement with real purchasing relevance in regulated industries — financial services, healthcare, defense contracting. Vendors who can demonstrate that their underlying model passed a government evaluation will use that as a differentiator in enterprise RFPs. Expect to see it in bid responses within the year. For procurement teams, adding “CAISI evaluation status” to AI vendor due diligence checklists is now a reasonable move.

The guardrail-removed evaluation component raises a subtler issue for developers. CAISI evaluators generate a classified record of what frontier models can do with safety layers removed — and that record is not publicly available. Enterprise security teams and AI red-teamers will not have access to those findings. The result is an asymmetry: the government knows capabilities that the companies deploying those models do not. If your organization is building on frontier APIs and relies on the vendor’s published safety documentation as your primary capability reference, this is a signal that independent internal capability assessment is worth investing in.

For teams integrating AI agents into business processes — the kind of multi-step autonomous workflows supported by platforms like AgentsGT — the CAISI expansion is net-positive context: it means the models reaching your infrastructure have been reviewed for the most dangerous autonomous capability combinations before deployment. That doesn’t replace application-level safety controls, but it does narrow the tail risk.

The broader picture is that AI governance in the US is converging toward a model where national security framing provides the political cover for oversight that safety framing could not sustain. That may produce a more durable regulatory foundation than what Biden’s order was building toward — or it may produce a narrower one, focused on existential and security risks while leaving consumer, labor, and competitive harms unaddressed. Which of those outcomes materializes will depend heavily on what happens in the next 12 months of executive rulemaking.

For a closer look at how AI capability growth is affecting enterprise strategy, our post on how AI transforms SMB operations covers the organizational side of building on rapidly evolving model infrastructure.

Ready to Build on Frontier AI the Right Way?

Understanding what the government is evaluating in frontier AI models is useful context — but implementation is where business value gets built or lost. If your team is navigating which models to build on, how to structure AI agent workflows, or how to assess capability and risk for your specific deployment context, DDR Innova works directly with enterprise teams on exactly those questions.

Reach out at info@ddrinnova.com or book a conversation with our team.

Cover photo via Unsplash.