A fair test of AI prediction capabilities on real-world FDA decisions
The Problem with AI Benchmarks
Most benchmarks test answers that already exist in training data. Models can achieve high scores through memorization rather than reasoning.
The Solution
FDA decisions do not exist until they are announced. No memorization, no leakage, and now a full time series of how each model updated over time.
What We're Testing
Can AI models reason about complex regulatory decisions and make accurate predictions about the future?
Track FDA Calendar Events
Monitor upcoming FDA drug approval decisions, including PDUFA dates for NDAs, BLAs, and supplemental applications.
Prepare Shared Context
Each model receives the same structured event, market, and portfolio context. One provider call produces both a forecast snapshot and a proposed market action, while application-side guardrails enforce trading limits.
Record Decision Snapshots
Ask each model for an intrinsic approval forecast first, then a market action for the same timepoint. Each snapshot stores approval probability, binary call, confidence, reasoning, and proposed action.
Wait for FDA Decisions
Unlike benchmarks with known answers, we wait for the FDA to announce. There's no way to game this—the ground truth doesn't exist until the ruling.
Score Results
Compare either the first or final pre-outcome snapshot to the actual outcome. A prediction is correct if "approved" matches approval, or if "rejected" matches rejection/CRL.
Claude Opus 4.6
Anthropic
claude-opus-4-6
- Web Search
- Enabled
- Reasoning
- Extended Thinking
- Max Output (FDA)
- 4,096 output
Anthropic web_search_20250305 (max_uses: 7)Native thinking blocks + tool-assisted synthesis
GPT-5.2
OpenAI
gpt-5.2
- Web Search
- Enabled
- Reasoning
- High Effort
- Max Output (FDA)
- 16,000 output
OpenAI web_search toolreasoning.effort = high
Grok 4.1
xAI
grok-4-1-fast-reasoning
- Web Search
- Enabled
- Reasoning
- Fast Reasoning
- Max Output (FDA)
- 16,000 output
search_mode: autoNative fast reasoning mode
Gemini 2.5 Pro
gemini-2.5-pro
- Web Search
- Enabled
- Reasoning
- Thinking
- Max Output (FDA)
- 65,536 output
Google Search groundingthinkingConfig.thinkingBudget = -1
Gemini 3 Pro
gemini-3-pro-preview
- Web Search
- Enabled
- Reasoning
- Thinking
- Max Output (FDA)
- 65,536 output
Google Search groundingthinkingConfig.thinkingBudget = -1
DeepSeek V3.1
Baseten
deepseek-ai/DeepSeek-V3.1
- Web Search
- Not available
- Reasoning
- Reasoning mode
- Max Output (FDA)
- 16,000 output
No web-search tool configured in the combined decision generatorextra_body.reasoning_effort = high
GLM 5
Baseten
zai-org/GLM-5
- Web Search
- Not available
- Reasoning
- Provider default
- Max Output (FDA)
- 16,000 output
No web-search tool configured in the combined decision generatorNo explicit reasoning parameter configured
Llama 4 Scout
Groq (Meta)
meta-llama/llama-4-scout-17b-16e-instruct
- Web Search
- Not available
- Reasoning
- Provider default
- Max Output (FDA)
- 8,192 output
No web-search tool configured in the combined decision generatorNo explicit reasoning parameter configured
Kimi K2.5 Thinking
Baseten
moonshotai/Kimi-K2.5
- Web Search
- Not available
- Reasoning
- Thinking
- Max Output (FDA)
- 16,000 output
No web-search tool configured in the combined decision generatorextra_body.chat_template_args.enable_thinking = true
MiniMax M2.5
MiniMax
MiniMax-M2.5
- Web Search
- Not available
- Reasoning
- Provider default
- Max Output (FDA)
- 16,000 output
No web-search tool configured in FDA generatorNo explicit reasoning parameter configured
You are an expert pharmaceutical analyst specializing in FDA
regulatory decisions. Analyze the following FDA decision and
predict the outcome.
## Drug Information
**Drug Name:** {drugName}
**Company:** {companyName}
**Application Type:** {applicationType}
**Therapeutic Area:** {therapeuticArea}
**Event Description:** {eventDescription}
## Your Task
1. Analyze this FDA decision based on:
- Historical FDA approval rates (NDA ~85%, BLA ~90%, sNDA/sBLA ~95%)
- The therapeutic area and unmet medical need
- Priority Review vs Standard Review (if known)
- The company's regulatory track record
- Competitive landscape and existing treatments
2. Make a prediction:
- **Prediction:** Either "approved" or "rejected"
- **Confidence:** A percentage between 50-100%
- **Reasoning:** 150-300 words supporting your prediction{ "prediction": "approved", "confidence": 75, "reasoning": "..." }
{
"type": "object",
"required": ["prediction", "confidence", "reasoning"],
"properties": {
"prediction": {
"type": "string",
"enum": ["approved", "rejected"]
},
"confidence": {
"type": "integer",
"minimum": 50,
"maximum": 100
},
"reasoning": {
"type": "string"
}
}
}


