Method

A fair test of AI prediction capabilities on real-world FDA decisions

Why traditional benchmarks fall short

The Problem with AI Benchmarks

Most benchmarks test answers that already exist in training data. Models can achieve high scores through memorization rather than reasoning.

The Solution

FDA decisions do not exist until they are announced. No memorization, no leakage, and now a full time series of how each model updated over time.

What We're Testing

Can AI models reason about complex regulatory decisions and make accurate predictions about the future?

The five-step evaluation process
1

Track FDA Calendar Events

Monitor upcoming FDA drug approval decisions, including PDUFA dates for NDAs, BLAs, and supplemental applications.

2

Prepare Shared Context

Each model receives the same structured event, market, and portfolio context. One provider call produces both a forecast snapshot and a proposed market action, while application-side guardrails enforce trading limits.

3

Record Decision Snapshots

Ask each model for an intrinsic approval forecast first, then a market action for the same timepoint. Each snapshot stores approval probability, binary call, confidence, reasoning, and proposed action.

4

Wait for FDA Decisions

Unlike benchmarks with known answers, we wait for the FDA to announce. There's no way to game this—the ground truth doesn't exist until the ruling.

5

Score Results

Compare either the first or final pre-outcome snapshot to the actual outcome. A prediction is correct if "approved" matches approval, or if "rejected" matches rejection/CRL.

The models we compare

Claude Opus 4.6

Anthropic

claude-opus-4-6

Web Search
Enabled
Reasoning
Extended Thinking
Max Output (FDA)
4,096 output

Anthropic web_search_20250305 (max_uses: 7)Native thinking blocks + tool-assisted synthesis

GPT-5.2

OpenAI

gpt-5.2

Web Search
Enabled
Reasoning
High Effort
Max Output (FDA)
16,000 output

OpenAI web_search toolreasoning.effort = high

Grok 4.1

xAI

grok-4-1-fast-reasoning

Web Search
Enabled
Reasoning
Fast Reasoning
Max Output (FDA)
16,000 output

search_mode: autoNative fast reasoning mode

Gemini 2.5 Pro

Google

gemini-2.5-pro

Web Search
Enabled
Reasoning
Thinking
Max Output (FDA)
65,536 output

Google Search groundingthinkingConfig.thinkingBudget = -1

Gemini 3 Pro

Google

gemini-3-pro-preview

Web Search
Enabled
Reasoning
Thinking
Max Output (FDA)
65,536 output

Google Search groundingthinkingConfig.thinkingBudget = -1

DeepSeek V3.1

Baseten

deepseek-ai/DeepSeek-V3.1

Web Search
Not available
Reasoning
Reasoning mode
Max Output (FDA)
16,000 output

No web-search tool configured in the combined decision generatorextra_body.reasoning_effort = high

GL

GLM 5

Baseten

zai-org/GLM-5

Web Search
Not available
Reasoning
Provider default
Max Output (FDA)
16,000 output

No web-search tool configured in the combined decision generatorNo explicit reasoning parameter configured

Llama 4 Scout

Groq (Meta)

meta-llama/llama-4-scout-17b-16e-instruct

Web Search
Not available
Reasoning
Provider default
Max Output (FDA)
8,192 output

No web-search tool configured in the combined decision generatorNo explicit reasoning parameter configured

Kimi K2.5 Thinking

Baseten

moonshotai/Kimi-K2.5

Web Search
Not available
Reasoning
Thinking
Max Output (FDA)
16,000 output

No web-search tool configured in the combined decision generatorextra_body.chat_template_args.enable_thinking = true

MiniMax M2.5

MiniMax

MiniMax-M2.5

Web Search
Not available
Reasoning
Provider default
Max Output (FDA)
16,000 output

No web-search tool configured in FDA generatorNo explicit reasoning parameter configured

Prediction Prompt
All models receive the same prompt
You are an expert pharmaceutical analyst specializing in FDA
regulatory decisions. Analyze the following FDA decision and
predict the outcome.

## Drug Information

**Drug Name:** {drugName}
**Company:** {companyName}
**Application Type:** {applicationType}
**Therapeutic Area:** {therapeuticArea}
**Event Description:** {eventDescription}

## Your Task

1. Analyze this FDA decision based on:
   - Historical FDA approval rates (NDA ~85%, BLA ~90%, sNDA/sBLA ~95%)
   - The therapeutic area and unmet medical need
   - Priority Review vs Standard Review (if known)
   - The company's regulatory track record
   - Competitive landscape and existing treatments

2. Make a prediction:
   - **Prediction:** Either "approved" or "rejected"
   - **Confidence:** A percentage between 50-100%
   - **Reasoning:** 150-300 words supporting your prediction
Expected Response
{
  "prediction": "approved",
  "confidence": 75,
  "reasoning": "..."
}
Schema (shape + constraints)
{
  "type": "object",
  "required": ["prediction", "confidence", "reasoning"],
  "properties": {
    "prediction": {
      "type": "string",
      "enum": ["approved", "rejected"]
    },
    "confidence": {
      "type": "integer",
      "minimum": 50,
      "maximum": 100
    },
    "reasoning": {
      "type": "string"
    }
  }
}
Current Progress
72
FDA Events Tracked
141
Total Prediction Records
141
Decision Snapshots
10
Models Compared