Why traditional benchmarks fall short

The Problem with AI Benchmarks

Most benchmarks test answers that already exist in training data. Models can achieve high scores through memorization rather than reasoning.

The Solution

Trial outcomes do not exist until the data lands. No memorization, no leakage, and a full time series of how each model updated over time.

What We're Testing

Can AI models reason about noisy clinical evidence and make accurate predictions about the future?

The five-step evaluation process
1

Track Trial Questions

Publish linked clinical-trial questions into Season 5 markets from saved trial facts. Markets missing required linked trial fields are skipped rather than filled with placeholders.

2

Prepare Shared Context

Each funded model wallet receives the same structured trial facts, live onchain YES/NO price, wallet cash and positions, and allowed action caps.

3

Record Decision Snapshots

Ask each model for an intrinsic YES forecast from trial fields first, then a market action after seeing price and portfolio context. Batch/API and imported decisions are stored with probability, binary call, confidence, reasoning, and proposed action.

4

Execute Onchain Trades

The live AI desk passes ready stored decisions into the manual Execute Trades step. Trade execution caps each action to the wallet and market limits, submits Base Sepolia buy/sell transactions from model wallets, and lets the indexer mirror events back into the app.

5

Resolve and Rank

Public rankings use the Season 5 money leaderboard: models are ranked by mirrored onchain total equity, meaning mock-USDC collateral plus marked-to-market YES/NO positions. Correct, wrong, and pending counts are derived from each model wallet's net position on resolved markets: more YES shares than NO shares is a YES call, more NO than YES is a NO call, and unresolved or tied positions stay pending. Stored decision snapshots remain available for first/final pre-outcome analysis, but the public board is money-first.

Season 5 onchain runtime

Market Venue

Base Sepolia markets use mock USDC, onchain YES/NO positions, and an app read model mirrored from emitted contract events.

Model Wallets

Funded model wallets default to 1,000 mock USDC unless the admin runtime config overrides the model bankroll, and model buy actions are capped by each wallet's available cash.

Human Wallets

Users authenticate with Privy, receive an embedded wallet, start at 0, and fund through the configured mock-USDC faucet.

Season 5 uses Base Sepolia, mock USDC, Privy embedded wallets, and an app read model mirrored from onchain events. Funded model wallets default to 1,000 mock USDC unless the admin runtime config overrides the model bankroll, model buy actions are capped by each wallet's available cash, and human users start at 0 until they claim the configured mock-USDC faucet.

The models we compare

Claude Opus 4.7

Anthropic

claude-opus-4-7

Web Search
Enabled
Reasoning
Provider default
Max Output
6,000 output

Anthropic web_search_20250305 (max_uses: 7)No explicit Anthropic thinking parameter configured

GPT-5.5

OpenAI

gpt-5.5

Web Search
Enabled
Reasoning
High Effort
Max Output
8,000 output

OpenAI web_search toolreasoning.effort = high

Grok 4.3

xAI

grok-4.3

Web Search
Enabled
Reasoning
Reasoning
Max Output
4,000 output

Responses API web_search toolResponses API with native reasoning + web_search

Gemini 3.1 Pro

Google

gemini-3.1-pro-preview

Web Search
Enabled
Reasoning
Thinking
Max Output
16,000 output

Google Search groundingthinkingConfig.thinkingBudget = -1

DeepSeek-V4-Pro

Fireworks

accounts/fireworks/models/deepseek-v4-pro

Web Search
Not available
Reasoning
High Effort
Max Output
4,096 output

No web-search tool configured in the combined decision generatorreasoning_effort = high

GLM-5.1

Fireworks

accounts/fireworks/models/glm-5p1

Web Search
Not available
Reasoning
High Effort
Max Output
4,096 output

No web-search tool configured in the combined decision generatorreasoning_effort = high

Qwen3 VL 30B A3B

Fireworks

accounts/fireworks/models/qwen3-vl-30b-a3b-instruct

Web Search
Not available
Reasoning
Provider default
Max Output
4,096 output

No web-search tool configured in the combined decision generatorNo explicit reasoning parameter configured

Llama 3.3 70B

Fireworks

accounts/fireworks/models/llama-v3p3-70b-instruct

Web Search
Not available
Reasoning
Provider default
Max Output
4,096 output

No web-search tool configured in the combined decision generatorNo explicit reasoning parameter configured

Kimi K2.6 Turbo (Preview)

Fireworks

accounts/fireworks/routers/kimi-k2p6-turbo

Web Search
Not available
Reasoning
Provider default
Max Output
4,096 output

No web-search tool configured in the combined decision generatorNo explicit thinking flag configured

MiniMax M2.7

Fireworks

accounts/fireworks/models/minimax-m2p7

Web Search
Not available
Reasoning
Low Effort
Max Output
4,096 output

No web-search tool configured in the combined decision generatorreasoning_effort = low

Model Decision Prompt
Generated from the runtime decision prompt builder
You are an expert biotech trial analyst and prediction-market decision maker.

First estimate the intrinsic probability that the live trial question resolves YES from the trial facts alone. Then compare that view to the current market price and choose the best allowed action under the provided portfolio constraints.

Your task has two ordered stages.

Stage 1: Intrinsic forecast
- Use only the trial fields.
- Do not use market or portfolio fields when estimating intrinsic YES odds.
- Produce:
  - yesProbability: a number from 0 to 1
  - binaryCall: yes if yesProbability >= 0.5, otherwise no
  - confidence: integer from 50 to 100
  - reasoning: specific and decision-useful, at least 20 characters, target at most 400 characters, hard max 600 characters

Stage 2: Market action
- After forming the intrinsic forecast, compare it to the market price.
- Use market and portfolio fields only in this stage.
- Choose exactly one action from allowedActions.
- Use HOLD when the pricing gap is small, uncertainty is high, or constraints make the trade unattractive.
- amountUsd must be non-negative and must not exceed the relevant cap:
  - buy actions: maxBuyUsd
  - SELL_YES: maxSellYesUsd
  - SELL_NO: maxSellNoUsd
- If a sell action is not feasible, use HOLD.
- Size every action using only this market's price and the provided portfolio caps.
- action.explanation must be plain language and at most 220 characters.

General rules
- Output valid JSON only.
- No markdown.
- No extra keys.
- Do not restate the input.
- Keep forecast.reasoning focused on trial design, patient population, endpoint quality, prior data, operational execution, and disclosure risk.
- Keep forecast.reasoning at or under 400 characters when possible and never above 600 characters.
- Keep action.explanation focused on valuation and trade logic.

Input JSON:
{
  "meta": {
    "eventId": "trial-acme-ab101-phase-2",
    "trialQuestionId": "question-acme-ab101-positive-topline",
    "marketId": "market-acme-ab101-positive-topline",
    "modelId": "gpt-5.5",
    "asOf": "2026-07-15T14:30:00.000Z",
    "runDateIso": "2026-07-15T14:30:00.000Z"
  },
  "trial": {
    "displayTitle": "AB-101 Phase 2 topline readout",
    "sponsorName": "Acme Bio",
    "sponsorTicker": "ACME",
    "exactPhase": "Phase 2",
    "estPrimaryCompletionDate": "2026-08-31T00:00:00.000Z",
    "daysToPrimaryCompletion": 47,
    "indication": "Moderate-to-severe ulcerative colitis",
    "intervention": "AB-101 oral small molecule",
    "protocolPrimaryEndpoint": "Clinical remission at week 12",
    "marketPrimaryEndpoint": "Clinical remission at week 12",
    "primaryEndpoint": "Clinical remission at week 12",
    "currentStatus": "Active, not recruiting",
    "briefSummary": "Randomized placebo-controlled Phase 2 study evaluating AB-101 in adults with ulcerative colitis who had inadequate response to standard therapy.",
    "nctNumber": "NCT01234567",
    "questionPrompt": "Will AB-101 show a positive result on clinical remission at week 12?"
  },
  "market": {
    "yesPrice": 0.43,
    "noPrice": 0.57
  },
  "portfolio": {
    "cashAvailable": 1000,
    "yesSharesHeld": 0,
    "noSharesHeld": 0,
    "maxBuyUsd": 1000,
    "maxSellYesUsd": 0,
    "maxSellNoUsd": 0
  },
  "constraints": {
    "allowedActions": [
      "BUY_YES",
      "BUY_NO",
      "SELL_YES",
      "SELL_NO",
      "HOLD"
    ],
    "explanationMaxChars": 220
  }
}

Return exactly:
{
  "forecast": {
    "yesProbability": 0.0,
    "binaryCall": "no",
    "confidence": 50,
    "reasoning": "string"
  },
  "action": {
    "type": "HOLD",
    "amountUsd": 0,
    "explanation": "string"
  }
}
Expected JSON Response
{
  "forecast": {
    "yesProbability": 0.61,
    "binaryCall": "yes",
    "confidence": 68,
    "reasoning": "Prior inflammatory bowel disease signal, endpoint clarity, and placebo-controlled design support a modest edge versus the current market line, though execution and durability risk remain material."
  },
  "action": {
    "type": "BUY_YES",
    "amountUsd": 100,
    "explanation": "Intrinsic odds look modestly above the current YES price."
  }
}
Runtime JSON schema (shape + constraints)
{
  "type": "object",
  "additionalProperties": false,
  "required": [
    "forecast",
    "action"
  ],
  "properties": {
    "forecast": {
      "type": "object",
      "additionalProperties": false,
      "required": [
        "yesProbability",
        "binaryCall",
        "confidence",
        "reasoning"
      ],
      "properties": {
        "yesProbability": {
          "type": "number",
          "minimum": 0,
          "maximum": 1
        },
        "binaryCall": {
          "type": "string",
          "enum": [
            "yes",
            "no"
          ]
        },
        "confidence": {
          "type": "integer",
          "minimum": 50,
          "maximum": 100
        },
        "reasoning": {
          "type": "string",
          "minLength": 20,
          "maxLength": 600
        }
      }
    },
    "action": {
      "type": "object",
      "additionalProperties": false,
      "required": [
        "type",
        "amountUsd",
        "explanation"
      ],
      "properties": {
        "type": {
          "type": "string",
          "enum": [
            "BUY_YES",
            "BUY_NO",
            "SELL_YES",
            "SELL_NO",
            "HOLD"
          ]
        },
        "amountUsd": {
          "type": "number",
          "minimum": 0
        },
        "explanation": {
          "type": "string",
          "minLength": 1,
          "maxLength": 220
        }
      }
    }
  }
}