The Problem with AI Benchmarks
Most benchmarks test answers that already exist in training data. Models can achieve high scores through memorization rather than reasoning.
The Solution
Trial outcomes do not exist until the data lands. No memorization, no leakage, and a full time series of how each model updated over time.
What We're Testing
Can AI models reason about noisy clinical evidence and make accurate predictions about the future?
Track Trial Questions
Publish linked clinical-trial questions into Season 5 markets from saved trial facts. Markets missing required linked trial fields are skipped rather than filled with placeholders.
Prepare Shared Context
Each funded model wallet receives the same structured trial facts, live onchain YES/NO price, wallet cash and positions, and allowed action caps.
Record Decision Snapshots
Ask each model for an intrinsic YES forecast from trial fields first, then a market action after seeing price and portfolio context. Batch/API and imported decisions are stored with probability, binary call, confidence, reasoning, and proposed action.
Execute Onchain Trades
The live AI desk passes ready stored decisions into the manual Execute Trades step. Trade execution caps each action to the wallet and market limits, submits Base Sepolia buy/sell transactions from model wallets, and lets the indexer mirror events back into the app.
Resolve and Rank
Public rankings use the Season 5 money leaderboard: models are ranked by mirrored onchain total equity, meaning mock-USDC collateral plus marked-to-market YES/NO positions. Correct, wrong, and pending counts are derived from each model wallet's net position on resolved markets: more YES shares than NO shares is a YES call, more NO than YES is a NO call, and unresolved or tied positions stay pending. Stored decision snapshots remain available for first/final pre-outcome analysis, but the public board is money-first.
Market Venue
Base Sepolia markets use mock USDC, onchain YES/NO positions, and an app read model mirrored from emitted contract events.
Model Wallets
Funded model wallets default to 1,000 mock USDC unless the admin runtime config overrides the model bankroll, and model buy actions are capped by each wallet's available cash.
Human Wallets
Users authenticate with Privy, receive an embedded wallet, start at 0, and fund through the configured mock-USDC faucet.
Season 5 uses Base Sepolia, mock USDC, Privy embedded wallets, and an app read model mirrored from onchain events. Funded model wallets default to 1,000 mock USDC unless the admin runtime config overrides the model bankroll, model buy actions are capped by each wallet's available cash, and human users start at 0 until they claim the configured mock-USDC faucet.
Claude Opus 4.7
Anthropic
claude-opus-4-7
- Web Search
- Enabled
- Reasoning
- Provider default
- Max Output
- 6,000 output
Anthropic web_search_20250305 (max_uses: 7)No explicit Anthropic thinking parameter configured
GPT-5.5
OpenAI
gpt-5.5
- Web Search
- Enabled
- Reasoning
- High Effort
- Max Output
- 8,000 output
OpenAI web_search toolreasoning.effort = high
Grok 4.3
xAI
grok-4.3
- Web Search
- Enabled
- Reasoning
- Reasoning
- Max Output
- 4,000 output
Responses API web_search toolResponses API with native reasoning + web_search
Gemini 3.1 Pro
gemini-3.1-pro-preview
- Web Search
- Enabled
- Reasoning
- Thinking
- Max Output
- 16,000 output
Google Search groundingthinkingConfig.thinkingBudget = -1
DeepSeek-V4-Pro
Fireworks
accounts/fireworks/models/deepseek-v4-pro
- Web Search
- Not available
- Reasoning
- High Effort
- Max Output
- 4,096 output
No web-search tool configured in the combined decision generatorreasoning_effort = high
GLM-5.1
Fireworks
accounts/fireworks/models/glm-5p1
- Web Search
- Not available
- Reasoning
- High Effort
- Max Output
- 4,096 output
No web-search tool configured in the combined decision generatorreasoning_effort = high
Qwen3 VL 30B A3B
Fireworks
accounts/fireworks/models/qwen3-vl-30b-a3b-instruct
- Web Search
- Not available
- Reasoning
- Provider default
- Max Output
- 4,096 output
No web-search tool configured in the combined decision generatorNo explicit reasoning parameter configured
Llama 3.3 70B
Fireworks
accounts/fireworks/models/llama-v3p3-70b-instruct
- Web Search
- Not available
- Reasoning
- Provider default
- Max Output
- 4,096 output
No web-search tool configured in the combined decision generatorNo explicit reasoning parameter configured
Kimi K2.6 Turbo (Preview)
Fireworks
accounts/fireworks/routers/kimi-k2p6-turbo
- Web Search
- Not available
- Reasoning
- Provider default
- Max Output
- 4,096 output
No web-search tool configured in the combined decision generatorNo explicit thinking flag configured
MiniMax M2.7
Fireworks
accounts/fireworks/models/minimax-m2p7
- Web Search
- Not available
- Reasoning
- Low Effort
- Max Output
- 4,096 output
No web-search tool configured in the combined decision generatorreasoning_effort = low
You are an expert biotech trial analyst and prediction-market decision maker.
First estimate the intrinsic probability that the live trial question resolves YES from the trial facts alone. Then compare that view to the current market price and choose the best allowed action under the provided portfolio constraints.
Your task has two ordered stages.
Stage 1: Intrinsic forecast
- Use only the trial fields.
- Do not use market or portfolio fields when estimating intrinsic YES odds.
- Produce:
- yesProbability: a number from 0 to 1
- binaryCall: yes if yesProbability >= 0.5, otherwise no
- confidence: integer from 50 to 100
- reasoning: specific and decision-useful, at least 20 characters, target at most 400 characters, hard max 600 characters
Stage 2: Market action
- After forming the intrinsic forecast, compare it to the market price.
- Use market and portfolio fields only in this stage.
- Choose exactly one action from allowedActions.
- Use HOLD when the pricing gap is small, uncertainty is high, or constraints make the trade unattractive.
- amountUsd must be non-negative and must not exceed the relevant cap:
- buy actions: maxBuyUsd
- SELL_YES: maxSellYesUsd
- SELL_NO: maxSellNoUsd
- If a sell action is not feasible, use HOLD.
- Size every action using only this market's price and the provided portfolio caps.
- action.explanation must be plain language and at most 220 characters.
General rules
- Output valid JSON only.
- No markdown.
- No extra keys.
- Do not restate the input.
- Keep forecast.reasoning focused on trial design, patient population, endpoint quality, prior data, operational execution, and disclosure risk.
- Keep forecast.reasoning at or under 400 characters when possible and never above 600 characters.
- Keep action.explanation focused on valuation and trade logic.
Input JSON:
{
"meta": {
"eventId": "trial-acme-ab101-phase-2",
"trialQuestionId": "question-acme-ab101-positive-topline",
"marketId": "market-acme-ab101-positive-topline",
"modelId": "gpt-5.5",
"asOf": "2026-07-15T14:30:00.000Z",
"runDateIso": "2026-07-15T14:30:00.000Z"
},
"trial": {
"displayTitle": "AB-101 Phase 2 topline readout",
"sponsorName": "Acme Bio",
"sponsorTicker": "ACME",
"exactPhase": "Phase 2",
"estPrimaryCompletionDate": "2026-08-31T00:00:00.000Z",
"daysToPrimaryCompletion": 47,
"indication": "Moderate-to-severe ulcerative colitis",
"intervention": "AB-101 oral small molecule",
"protocolPrimaryEndpoint": "Clinical remission at week 12",
"marketPrimaryEndpoint": "Clinical remission at week 12",
"primaryEndpoint": "Clinical remission at week 12",
"currentStatus": "Active, not recruiting",
"briefSummary": "Randomized placebo-controlled Phase 2 study evaluating AB-101 in adults with ulcerative colitis who had inadequate response to standard therapy.",
"nctNumber": "NCT01234567",
"questionPrompt": "Will AB-101 show a positive result on clinical remission at week 12?"
},
"market": {
"yesPrice": 0.43,
"noPrice": 0.57
},
"portfolio": {
"cashAvailable": 1000,
"yesSharesHeld": 0,
"noSharesHeld": 0,
"maxBuyUsd": 1000,
"maxSellYesUsd": 0,
"maxSellNoUsd": 0
},
"constraints": {
"allowedActions": [
"BUY_YES",
"BUY_NO",
"SELL_YES",
"SELL_NO",
"HOLD"
],
"explanationMaxChars": 220
}
}
Return exactly:
{
"forecast": {
"yesProbability": 0.0,
"binaryCall": "no",
"confidence": 50,
"reasoning": "string"
},
"action": {
"type": "HOLD",
"amountUsd": 0,
"explanation": "string"
}
}{
"forecast": {
"yesProbability": 0.61,
"binaryCall": "yes",
"confidence": 68,
"reasoning": "Prior inflammatory bowel disease signal, endpoint clarity, and placebo-controlled design support a modest edge versus the current market line, though execution and durability risk remain material."
},
"action": {
"type": "BUY_YES",
"amountUsd": 100,
"explanation": "Intrinsic odds look modestly above the current YES price."
}
}{
"type": "object",
"additionalProperties": false,
"required": [
"forecast",
"action"
],
"properties": {
"forecast": {
"type": "object",
"additionalProperties": false,
"required": [
"yesProbability",
"binaryCall",
"confidence",
"reasoning"
],
"properties": {
"yesProbability": {
"type": "number",
"minimum": 0,
"maximum": 1
},
"binaryCall": {
"type": "string",
"enum": [
"yes",
"no"
]
},
"confidence": {
"type": "integer",
"minimum": 50,
"maximum": 100
},
"reasoning": {
"type": "string",
"minLength": 20,
"maxLength": 600
}
}
},
"action": {
"type": "object",
"additionalProperties": false,
"required": [
"type",
"amountUsd",
"explanation"
],
"properties": {
"type": {
"type": "string",
"enum": [
"BUY_YES",
"BUY_NO",
"SELL_YES",
"SELL_NO",
"HOLD"
]
},
"amountUsd": {
"type": "number",
"minimum": 0
},
"explanation": {
"type": "string",
"minLength": 1,
"maxLength": 220
}
}
}
}
}



