How It Works

A fair test of AI prediction capabilities on real-world FDA decisions

Why This Matters

The Problem with AI Benchmarks

Most benchmarks test answers that already exist in training data. Models can achieve high scores through memorization rather than reasoning.

The Solution

FDA decisions don't exist until they're announced. No memorization possible, no data leakage, no benchmark contamination.

What We're Testing

Can AI models reason about complex regulatory decisions and make accurate predictions about the future?

The Process

1

Track FDA Calendar Events

Monitor upcoming FDA drug approval decisions from the RTTNews FDA Calendar including PDUFA dates for NDAs, BLAs, and supplemental applications.

2

Prepare Identical Context

Each model receives the same information: drug name, company, application type, therapeutic area, clinical trial data, and regulatory history.

3

Request Predictions

Ask each model: "Will the FDA approve this drug?" Models provide a binary APPROVED or REJECTED prediction with reasoning. All predictions are timestamped before decisions.

4

Wait for FDA Decisions

Unlike benchmarks with known answers, we wait for the FDA to announce. There's no way to game this—the ground truth doesn't exist until the ruling.

5

Score Results

Compare each model's prediction to the actual outcome. Correct if APPROVED matches approval, or REJECTED matches rejection/CRL.

Model Configuration

Claude Opus 4.5

claude-opus-4-5-20251101

Web SearchNo
ReasoningExtended Thinking
Max Tokens16,000

10,000 token budget

GPT-5.2

gpt-5.2

Web SearchYes
ReasoningHigh Effort
Max TokensDefault

Agentic web searchreasoning.effort: high

Grok 4

grok-4

Web SearchYes
ReasoningStandard
Max Tokens4,096

Live search (auto)No enhanced reasoning

Key Differences

GPT-5.2 & Grok 4 can search the web

They may find recent news, press releases, or analyst reports

Claude uses extended thinking

10,000 token budget for step-by-step reasoning

Prediction Prompt

All models receive the same promptfda-prompt.ts
You are an expert pharmaceutical analyst specializing in FDA
regulatory decisions. Analyze the following FDA decision and
predict the outcome.

## Drug Information

**Drug Name:** {drugName}
**Company:** {companyName}
**Application Type:** {applicationType}
**Therapeutic Area:** {therapeuticArea}
**Event Description:** {eventDescription}

## Your Task

1. Analyze this FDA decision based on:
   - Historical FDA approval rates (NDA ~85%, BLA ~90%, sNDA/sBLA ~95%)
   - The therapeutic area and unmet medical need
   - Priority Review vs Standard Review (if known)
   - The company's regulatory track record
   - Competitive landscape and existing treatments

2. Make a prediction:
   - **Prediction:** Either "approved" or "rejected"
   - **Confidence:** A percentage between 50-100%
   - **Reasoning:** 150-300 words supporting your prediction

Expected Response

JSON format required
{
  "prediction": "approved",
  "confidence": 75,
  "reasoning": "Based on historical approval rates..."}
prediction"approved" or "rejected"
confidence50-100 (percentage)
reasoning150-300 word explanation

Data Sources

FDA Calendar:Upcoming PDUFA dates from RTTNews FDA Calendar.
Research Context:Clinical trial data, regulatory history, and advisory committee recommendations when available.
Results:Official FDA announcements, company press releases, and regulatory filings.

Current Progress

62
FDA Events Tracked
27
Predictions Made
3
Models Compared