Standard pass@1 evaluation, 164 prompts.
| Model | pass@1 | Avg latency | Cost / 1k tasks |
|---|---|---|---|
| GPT-5.5 | 92.1 | 2.4s | $3.80 |
| Claude 4.6 Sonnet | 91.7 | 2.1s | $3.20 |
| Doubao-Seed-2.0-Pro | 89.5 | 1.9s | $0.42 |
| DeepSeek V4 Pro | 93.4 | 2.6s | $0.31 |
50 LeetCode "Hard" problems posted after each model's training cutoff to minimize contamination.
| Model | Solved (out of 50) | Notes |
|---|---|---|
| GPT-5.5 | 34 | Strong dynamic programming |
| Doubao Pro | 31 | Weakest at advanced graph algorithms |
| Claude 4.6 Sonnet | 33 | Best explanations alongside code |
| DeepSeek V4 Pro | 38 | Clearly the strongest competitive coder |
We submitted 30 real PRs with deliberately injected bugs. Doubao Pro caught 24/30. GPT-5.5 caught 26/30, Claude 25/30, DeepSeek 28/30. Doubao misses subtle race conditions more often than the others, but matches them on logic and security review.
| Model | Score |
|---|---|
| GPT-5.5 | 88.2 |
| Claude 4.6 Sonnet | 87.0 |
| Doubao Pro | 86.4 |
| DeepSeek V4 Pro | 85.1 |
Doubao Pro: 96.4. GPT-5.5: 96.8. Within noise. Both essentially saturated.
30 multi-step puzzles requiring 8+ reasoning hops. Doubao Pro solved 23/30, GPT-5.5 solved 25/30. Doubao occasionally drops a constraint in the middle of long chains; spelling out "verify each step" in the system prompt closes most of the gap.
FLORES-200 dev-test split, BLEU scores.
| Direction | Doubao Pro | GPT-5.5 | Qwen3-Max |
|---|---|---|---|
| EN → ZH | 47.3 | 42.1 | 49.4 |
| ZH → EN | 38.2 | 40.5 | 41.1 |
| JA → EN | 33.4 | 35.2 | 34.8 |
| EN → JA | 32.1 | 30.9 | 33.5 |
Doubao Pro is best-in-class for English-to-Chinese (training corpus advantage). It loses to Qwen3-Max for Japanese and Chinese-to-English, where Alibaba has a longer multilingual heritage.
3,000 requests, single thread, 200 input tokens / 200 output tokens, non-streaming.
| Metric | Doubao Pro | GPT-5.5 | Claude 4.6 |
|---|---|---|---|
| First token P50 | 312 ms | 820 ms | 610 ms |
| First token P95 | 540 ms | 1,420 ms | 980 ms |
| First token P99 | 910 ms | 2,100 ms | 1,540 ms |
| Tokens per second (output) | 72 | 68 | 54 |
| End-to-end P50 (200 out tokens) | 3.1s | 3.7s | 4.4s |
This is where Doubao really shines from a Hong Kong / Asia client. The Volcano Engine endpoint is geographically close, so first-token latency is roughly 2.6× faster than calling OpenAI from the same region.
| Dimension | Doubao Pro | vs GPT-5.5 | vs Claude 4.6 |
|---|---|---|---|
| Coding | 89.5 | −2.6 | −2.2 |
| Reasoning | 86.4 | −1.8 | −0.6 |
| Translation (EN→ZH) | 47.3 | +5.2 | +6.5 |
| Latency (P50 first token) | 312ms | −508 ms | −298 ms |
| Cost / 1M I/O | $2.40 | −$17.60 | −$15.60 |
Sign up, get $0.50 free credit, point your existing eval harness at our endpoint. Three lines of code change, no integration work.
Get a Key →Doubao-Seed-2.0-Pro is a credible flagship-tier model. It's not the absolute best at any single dimension — DeepSeek wins on code, Qwen wins on translation parity, GPT-5.5 wins on reasoning by a hair — but the combination of competitive quality + best-in-Asia latency + 12.5× cheaper pricing is hard to argue with for most production workloads.
All numbers from independent NovAI testing in May 2026. Methodology and prompts available on request. Pricing accurate as of publication date.