AI Vulnerability Discovery: The 2026 Zero-Day Arms Race

Q: Has AI actually been used to create a real zero-day exploit?

Yes. On May 11, 2026, Google Threat Intelligence Group confirmed the first known case of a criminal group using an LLM to discover and weaponize a zero-day two-factor authentication bypass. Google disrupted the planned mass exploitation campaign before it launched.

Q: What is the difference between OpenAI Daybreak and Anthropic Project Glasswing?

Both are AI vulnerability discovery platforms launched in 2026. Daybreak offers three public access tiers with identity verification for defensive and red team workflows. Project Glasswing is a $100M initiative restricted to 12 launch partners including AWS, Apple, and Microsoft - no public access tier.

Q: How many CVEs are expected in 2026?

FIRST's 2026 Vulnerability Forecast predicts a median of 59,427 new CVEs, with realistic scenarios reaching 70,000-100,000. NIST has already scaled back NVD enrichment in response, meaning many CVEs will lack the CVSS scores and CPE data that automated scanning tools depend on.

Q: Does AI-powered vulnerability discovery actually work as well as vendors claim?

Results vary significantly from marketing claims. When curl maintainer Daniel Stenberg reviewed a Mythos scan, only 1 of 5 reported vulnerabilities was confirmed. However, other AI tools had already triggered 200-300 bugfixes in curl over the prior year - AI-assisted discovery works, just not at the scale vendors advertise.

AI vulnerability discovery became a two-sided arms race in May 2026: the same week Google confirmed the first criminal AI-developed zero-day exploit, OpenAI launched Daybreak and Anthropic unveiled Project Glasswing to do the same for defense.

One weekend in May 2026 handed security teams three signals that, taken together, make clear the vulnerability management playbook needs rewriting.

On May 10, OpenAI launched Daybreak, its AI-powered vulnerability detection and patch validation platform. On May 11, Google’s Threat Intelligence Group published the first confirmed case of criminals using an LLM to discover and weaponize a zero-day exploit - targeting two-factor authentication in a widely deployed web administration tool. That same day, curl maintainer Daniel Stenberg published his review of Anthropic’s Mythos AI scanner: one confirmed vulnerability out of five reported, and a conclusion that the surrounding hype was “primarily marketing.”

These events are not coincidental. They are simultaneous signals of the same structural shift: AI has collapsed the cost of vulnerability discovery, and the implications run in both directions.

The Weekend Everything Changed

gantt
    title AI Vulnerability Arms Race: April–May 2026 Convergence
    dateFormat YYYY-MM-DD
    axisFormat %b %d

    section Anthropic
    Mythos Preview and Glasswing Announced    :milestone, 2026-04-07, 0d
    Curl Security Scan Under Review           :2026-04-07, 34d
    Stenberg Assessment Published             :milestone, 2026-05-11, 0d

    section OpenAI
    Daybreak Announced                        :milestone, 2026-05-10, 0d
    Partner Coverage Peaks                    :2026-05-11, 2d

    section Threat Actors
    Criminal AI Zero-Day Detected by GTIG     :milestone, 2026-05-11, 0d

    section NIST and Ecosystem
    NVD Enrichment Cutoffs Effective          :milestone, 2026-04-15, 0d
    CVE 263% Surge Documented                 :milestone, 2026-04-15, 0d

Three independent signals converged in a 48-hour window: a criminal AI zero-day, a competing platform launch, and a real-world capability test. Each tells part of the same story.

The convergence began earlier. On April 7, Anthropic launched Project Glasswing alongside Claude Mythos Preview. The announcement claimed Mythos had identified “thousands of zero-day vulnerabilities, many of them critical, in every major operating system and every major web browser.” A $100 million commitment. Twelve launch partners. Security press picked it up largely uncritically.

Then the curl test happened. And then Google’s threat intelligence team confirmed what the industry had been treating as a hypothetical.

The First Criminal AI Zero-Day

The Google Threat Intelligence Group (GTIG) report published May 11 is significant for a specific reason: prior AI-in-attacks coverage focused on phishing, social engineering, and influence operations. This is the first publicly confirmed case where AI was used to discover a previously unknown vulnerability and build a working exploit from scratch.

The exploit: a Python script targeting a two-factor authentication bypass in a popular open-source web-based system administration tool. GTIG identified it as AI-generated through several code characteristics: an abundance of educational docstrings throughout the script, a hallucinated CVSS severity score embedded in comments, detailed help menus, and clean ANSI color class formatting described by GTIG as “highly characteristic of LLM training data.”

GTIG assessed with “high confidence” that an AI model was used. The threat actor had planned to use the exploit for a mass exploitation campaign. GTIG worked with the affected vendor to patch the flaw, and their intervention disrupted the campaign before it launched.

Proof of concept is now proof of criminal use. The question for security teams is not whether AI-developed exploits will appear in the wild - they already have.

The Platform War: Daybreak vs. Glasswing

Two of the most capable AI labs launched competing vulnerability discovery platforms in the same month. They arrived at different answers to the same governance problem: how do you build a tool capable of finding critical zero-days without handing it to the people who will use it to exploit them?

OpenAI Daybreak: Three Tiers of Trust

OpenAI’s Daybreak combines frontier AI models with the Codex agentic coding framework for vulnerability detection and patch validation. The governance model is public and tiered:

Tier 1 (GPT-5.5 standard): General-purpose use with standard safeguards. Available broadly for development and knowledge work.
Tier 2 (GPT-5.5 with Trusted Access for Cyber): Defensive security workflows including secure code review, vulnerability triage, malware analysis, detection engineering, and patch validation. Requires identity verification.
Tier 3 (GPT-5.5-Cyber, limited preview): Red teaming, penetration testing, and controlled validation workflows. Most permissive capability tier, most restricted access.

Codex Security operates differently from a static scanner. It builds an editable threat model of a given repository, prioritizes high-impact attack paths, identifies and tests vulnerabilities in an isolated environment, and proposes fixes. Technology partners contributing threat intelligence include Cloudflare, Cisco, CrowdStrike, Palo Alto Networks, Oracle, Zscaler, Akamai, and Fortinet.

Anthropic Glasswing: The $100M Controlled-Access Model

Anthropic took a different approach. Project Glasswing is $100 million in model usage credits dedicated to scanning critical open-source infrastructure - but only through 12 launch partners: AWS, Anthropic, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, the Linux Foundation, Microsoft, NVIDIA, and Palo Alto Networks.

No public tiers. No self-service access to Mythos. Organizations outside the partner list cannot use Glasswing to scan their own codebases.

The Curl Reality Check

Anthropic’s “thousands of zero-days in every major OS and browser” claim got its first real-world test when the Linux Foundation’s partner access led to a Mythos scan of the curl codebase.

The result: Mythos reported five “confirmed security vulnerabilities.” After review by the curl security team:

Three were false positives pointing to documented API behaviors and known limitations
One was a simple bug that did not meet the threshold for a vulnerability
One was a genuine low-severity vulnerability, confirmed for CVE publication alongside curl 8.21.0

Stenberg’s assessment: the big hype around Mythos was “primarily marketing.” He found no evidence the tool outperforms existing AI scanning tools.

For context: other AI tools - AISLE, Zeropath, and OpenAI Codex Security - had already triggered 200-300 bugfixes merged into curl over the preceding 8-10 months, including confirmed CVEs. curl is not a soft target. Stenberg’s point is not that AI scanning is useless - it is that the gap between vendor marketing and measurable results is substantial.

Bruce Schneier characterized the Glasswing announcement as “very much a PR play by Anthropic” but acknowledged the real capability underneath: “Finding for the purposes of fixing is easier for an AI than finding plus exploiting.” The defender advantage is genuine. It is also, Schneier noted, “likely to shrink” as models improve.

Access Model Comparison

graph TD
    subgraph Daybreak["OpenAI Daybreak"]
        D1["Tier 1: GPT-5.5 Standard\nBroad access\nStandard safeguards"]
        D2["Tier 2: GPT-5.5 Trusted Access\nIdentity verified\nDefensive workflows only"]
        D3["Tier 3: GPT-5.5-Cyber\nLimited preview\nRed team and pentest"]
        D1 --> D2 --> D3
    end

    subgraph Glasswing["Anthropic Project Glasswing"]
        G1["Partner-only access\n12 launch partners\nNo public tiers"]
        G2["$100M model credits\nOpen-source infra focus\nLinux Foundation governed"]
        G1 --- G2
    end

    style D1 fill:#1e3a5f,color:#fff
    style D2 fill:#1a5c3a,color:#fff
    style D3 fill:#5c1a1a,color:#fff
    style G1 fill:#3d2b00,color:#fff
    style G2 fill:#2b1a3d,color:#fff

Daybreak offers public tiering with escalating verification requirements. Glasswing gates all access to 12 launch partners with no self-service path. Both restrict their most capable tiers - they disagree about where to draw the line.

The Numbers: CVE Surge and the NIST Breaking Point

FIRST’s 2026 Vulnerability Forecast predicts a median of approximately 59,427 new CVEs for the year, with realistic upper-bound scenarios reaching 100,000. This would be the first year to exceed 50,000 published CVEs. CVE submissions in Q1 2026 came in nearly one-third higher than Q1 2025.

NIST’s response: a structural reduction in NVD enrichment. Under the new prioritization framework announced April 2026, only CVEs meeting specific criteria receive immediate enrichment with CVSS scores, CPE data, and CWE mappings:

Vulnerabilities in CISA’s Known Exploited Vulnerabilities Catalog
Vulnerabilities in federal government software
Vulnerabilities affecting critical software under Executive Order 14028

Everything else receives “lowest priority” status. Pre-March 1, 2026 backlogged CVEs have been moved to “Not Scheduled” and will only be reconsidered as resources permit.

xychart-beta
    title "CVE Volume Growth and NIST Enrichment Response"
    x-axis ["2020", "2021", "2022", "2023", "2024", "2025", "2026 (FIRST median)", "2026 (upper)"]
    y-axis "CVEs" 0 --> 110000
    bar [18325, 20176, 25226, 29065, 34765, 42000, 59427, 100000]

CVE volume has grown 263% since 2020. NIST’s enrichment framework, designed for tens of thousands, is breaking under a projected load that could exceed 100,000.

The compound problem: AI discovery tools are simultaneously part of the solution and part of the cause. NIST attributed growth in vulnerability identifiers partly to “new vulnerability discovery tools based on large language models.” Every Glasswing scan that surfaces new findings in open-source projects adds to the volume NIST must process.

For teams running automated vulnerability management pipelines: if your tooling depends on NVD-sourced CVSS scores and CPE data to prioritize findings, you are now working with degraded intelligence for a significant portion of newly disclosed CVEs. This is not a temporary backlog. NIST has structurally reduced what it will enrich.

How Are Nation-States Industrializing AI Vulnerability Discovery?

GTIG’s AI Threat Tracker characterizes the current moment as “a maturing transition from nascent AI-enabled operations to the industrial-scale application of generative models within adversarial workflows.” Two clusters are doing this at volume:

APT45 (North Korea): GTIG observed the group sending thousands of repetitive prompts that recursively analyze different CVEs and validate proof-of-concept exploits at machine speed. GTIG’s assessment: APT45 is building an exploit arsenal that would be “impractical to manage without AI assistance.” This is not experimentation - it is a production pipeline.

Chinese state-linked operators: Using AI models with fabricated professional personas for vulnerability hunting and automated target probing. One documented approach: a fabricated senior security auditor persona used to probe for router firmware vulnerabilities.

GTIG also documented adjacent AI-enabled capabilities from these and other groups: Android backdoors using Gemini APIs to autonomously navigate infected devices, malware padded with AI-generated junk code to confuse analysis tooling, and fabricated AI-generated audio inserted into legitimate news footage for influence operations.

The pattern across all these uses: AI reduces the per-unit cost of offensive operations, enabling scale that would otherwise require more personnel and more time.

The Economics Have Changed Permanently

When AI reduces the cost of vulnerability discovery from weeks of human researcher time to minutes of compute, several structural assumptions break simultaneously.

graph LR
    subgraph Before["Pre-AI Model"]
        H["Human researcher\nweeks per codebase"] --> SV["Single finding\ndisclosed to vendor"]
        SV --> PP["Patch released\nin weeks"]
        PP --> ORG["Organizations deploy\nin 30-60 days"]
    end

    subgraph After["AI-Accelerated Reality"]
        AI["AI scanner\nminutes per codebase"] --> MF["Hundreds of findings\nsimultaneously"]
        MF --> TTE["Mean TTE: -7 days\nExploitation precedes\npatch availability"]
    end

    subgraph Asymmetry["Current Attacker vs. Defender Balance"]
        DEF["Defender: full source access\nfix without needing to exploit\ncan scan own codebase"]
        ATK["Attacker: restricted context\nmust also weaponize\nworks with less information"]
        DEF -. "shrinking advantage\nas models improve" .-> ATK
    end

AI has collapsed vulnerability discovery costs and inverted the time-to-exploit curve. Defenders retain a structural advantage in source access and fix capability - but that gap is narrowing.

The specific breakdowns:

Discovery cost collapse. AI tools scan entire codebases in time frames that previously required weeks of human analysis. The marginal cost of finding the next vulnerability approaches zero for teams with access to capable models. This is simultaneously good news for defenders and attackers.

Disclosure-driven patching no longer works. Mandiant M-Trends 2026 reports mean time to exploit at -7 days: exploitation routinely occurs before patches exist. The most frequently exploited vulnerabilities in 2025 were all zero-days targeting enterprise application servers - SAP NetWeaver, Oracle EBS, SharePoint - none of which had patches available when exploitation began. The traditional “disclose, patch, deploy” model assumes the patch arrives before exploitation. That assumption is no longer valid.

The 22-second handoff is the other half of the TTE story. M-Trends 2026 also documented a collapse in the access handoff window: in 2022, the median time between initial compromise and handoff to a secondary threat group was over 8 hours. In 2025, that window collapsed to 22 seconds. Secondary actors are pre-staged and ready to move the moment access is confirmed.

NIST enrichment degradation is structural, not temporary. The NVD scaling crisis is not a backlog that will clear. The volume growth is accelerating, AI discovery tools contribute to it, and NIST has explicitly deprioritized enrichment for most new CVEs. Organizations need a plan for how their vulnerability management pipeline handles findings with no CVSS score and no CPE mapping.

Access tiers create a two-tier security world. Daybreak’s verified access tiers and Glasswing’s partner-only model both restrict the most capable AI vulnerability tools to vetted organizations. This is a defensible governance decision with a predictable consequence: well-resourced organizations inside the access tier run AI-powered defensive scanning, everyone outside does not. The gap compounds over time.

CERT-EU states the implication directly: traditional patch-centric models “no longer provide a sufficient foundation for resilience” and effective defense “increasingly depends on continuous behavioral understanding, detection that does not rely on prior disclosure, and rapid containment.”

What Security Teams Should Do Now

The structural shift in vulnerability economics does not require Glasswing partner status or Daybreak Tier 3 access to act on. These steps are available now:

Run AI-powered scanning against your own codebases. Defenders have a structural advantage over attackers: full source access and the ability to fix rather than exploit. Daybreak Tier 1 and Tier 2 access, GitHub Copilot security features, and purpose-built SAST integrations are available without partner-tier gating. Run them against your highest-priority repositories before someone with less access runs them against you. If your team already uses AI coding agents in the development workflow, review how those agents can expose infrastructure access - the same AI toolchain that enables scanning also creates new attack surface.

# Query NVD for recent CVE volume - a baseline check for your enrichment coverage
curl -s "https://services.nvd.nist.gov/rest/json/cves/2.0/?pubStartDate=2026-04-12T00:00:00.000&pubEndDate=2026-05-12T00:00:00.000&resultsPerPage=1" \
  | jq '.totalResults'

# Check if a specific CVE has been enriched (empty metrics = lowest priority, not yet enriched)
curl -s "https://services.nvd.nist.gov/rest/json/cves/2.0?cveId=CVE-2026-XXXXX" \
  | jq '.vulnerabilities[0].cve.metrics'

Audit your NVD dependency. If your vulnerability management pipeline automatically trusts NVD CVSS scores for prioritization, identify which CVEs in your current queue are now classified as “lowest priority” under NIST’s new framework. For anything critical to your environment, supplement with CISA KEV, vendor advisories, and commercial feeds. The NVD enrichment gap is not going to close.

Build detection-first coverage for zero-day exposure. When mean TTE is -7 days, detection and containment matter more than patch timing. Review your detection coverage for the vulnerability classes most frequently exploited: edge devices, VPN concentrators, application servers. Mandiant flags that these asset types frequently lack standard EDR telemetry - the detection gap is widest exactly where exploitation is most active.

Evaluate your Daybreak tier now. For teams doing red team work or managing a vulnerability disclosure program, Daybreak Tier 2 (Trusted Access) and Tier 3 (GPT-5.5-Cyber) require identity verification but are available. If your organization does not have a clear path to access, assess it now. Teams running AI-assisted red teaming and teams that are not are operating with different capability levels, and that difference will grow.

Track GTIG’s AI Threat Tracker for capability forecasting. APT45’s industrialized exploit validation pipeline and Chinese operator experimentation represent the leading edge of what better-funded criminal organizations will have access to within 12-24 months. The threat tracker is more useful as a forward-looking capability assessment than as a historical log. For teams deploying AI-powered tools in development environments, the parallel attack surface extends into the AI toolchain itself - MCP configuration poisoning is one documented example of how AI development tools become targets.

Frequently Asked Questions

Has AI actually been used to create a real zero-day exploit?

Yes. On May 11, 2026, Google Threat Intelligence Group published the first publicly confirmed case of a criminal group using an LLM to discover and weaponize a zero-day: a two-factor authentication bypass in a popular open-source web administration tool. The exploit code contained hallmarks of AI generation - educational docstrings, a hallucinated CVSS score, and structured formatting characteristic of LLM training data. GTIG worked with the affected vendor to patch the flaw and disrupted the planned mass exploitation campaign before it launched.

What is the difference between OpenAI Daybreak and Anthropic Project Glasswing?

Both are AI-powered vulnerability discovery platforms launched in 2026. Daybreak offers three public access tiers: standard GPT-5.5, Trusted Access requiring identity verification for defensive workflows, and GPT-5.5-Cyber for red teaming in limited preview. Anthropic’s Glasswing is a $100 million initiative backed by 12 launch partners - AWS, Apple, Microsoft, Google, and others - with no public access tier. Organizations outside the partner list cannot use Mythos to scan their own codebases.

How many CVEs are expected in 2026?

FIRST’s 2026 Vulnerability Forecast predicts a median of approximately 59,427 new CVEs, with realistic upper-bound scenarios reaching 100,000. This would be the first year to exceed 50,000 published CVEs, driven partly by AI-powered discovery tools. NIST has already responded by scaling back NVD enrichment: most new CVEs will not receive immediate CVSS scores or CPE data, directly affecting teams that rely on NVD for automated vulnerability prioritization.

Are nation-states using AI to find vulnerabilities?

Yes. Google’s GTIG AI Threat Tracker documents North Korean APT45 sending thousands of recursive prompts to analyze CVEs and validate proof-of-concept exploits at machine speed, building what GTIG describes as an arsenal “impractical to manage without AI assistance.” Chinese state-linked operators have been observed using AI models with fabricated professional personas to probe targets for router firmware vulnerabilities. GTIG characterizes the shift as moving from experimentation to “industrial-scale application.”

Does AI-powered vulnerability discovery actually work as well as vendors claim?

Results vary significantly from marketing claims. Anthropic claims Mythos found “thousands of zero-days in every major OS and browser,” but a real-world scan of the curl codebase under the Glasswing program found 1 confirmed vulnerability out of 5 reported - the other 4 were dismissed as false positives or non-vulnerabilities. curl maintainer Daniel Stenberg concluded the hype was “primarily marketing.” That said, other AI tools had triggered 200-300 merged bugfixes in curl over the prior year. AI-assisted vulnerability discovery works at scale - the gap is between measured results and vendor-level claims.