Doubao Seed 2.0 Pro Benchmarks (2026)

Coding, reasoning, translation, and latency — independent results with the full test methodology.

Headline numbers: 89.5 on HumanEval+, 86.4 on MMLU-Pro, 47.3 BLEU on EN→ZH FLORES-200, P50 first-token latency 312 ms from Hong Kong. Within 2 points of GPT-5.5 on most categories at one-twelfth the price.

Test Environment

1. Coding

1a. HumanEval+

Standard pass@1 evaluation, 164 prompts.

Modelpass@1Avg latencyCost / 1k tasks
GPT-5.592.12.4s$3.80
Claude 4.6 Sonnet91.72.1s$3.20
Doubao-Seed-2.0-Pro89.51.9s$0.42
DeepSeek V4 Pro93.42.6s$0.31

1b. LeetCode-Hard

50 LeetCode "Hard" problems posted after each model's training cutoff to minimize contamination.

ModelSolved (out of 50)Notes
GPT-5.534Strong dynamic programming
Doubao Pro31Weakest at advanced graph algorithms
Claude 4.6 Sonnet33Best explanations alongside code
DeepSeek V4 Pro38Clearly the strongest competitive coder

1c. Code Review (qualitative)

We submitted 30 real PRs with deliberately injected bugs. Doubao Pro caught 24/30. GPT-5.5 caught 26/30, Claude 25/30, DeepSeek 28/30. Doubao misses subtle race conditions more often than the others, but matches them on logic and security review.

2. Reasoning

2a. MMLU-Pro

ModelScore
GPT-5.588.2
Claude 4.6 Sonnet87.0
Doubao Pro86.4
DeepSeek V4 Pro85.1

2b. GSM8K (math word problems)

Doubao Pro: 96.4. GPT-5.5: 96.8. Within noise. Both essentially saturated.

2c. Long-chain logic puzzles

30 multi-step puzzles requiring 8+ reasoning hops. Doubao Pro solved 23/30, GPT-5.5 solved 25/30. Doubao occasionally drops a constraint in the middle of long chains; spelling out "verify each step" in the system prompt closes most of the gap.

3. Translation

FLORES-200 dev-test split, BLEU scores.

DirectionDoubao ProGPT-5.5Qwen3-Max
EN → ZH47.342.149.4
ZH → EN38.240.541.1
JA → EN33.435.234.8
EN → JA32.130.933.5

Doubao Pro is best-in-class for English-to-Chinese (training corpus advantage). It loses to Qwen3-Max for Japanese and Chinese-to-English, where Alibaba has a longer multilingual heritage.

4. Latency from Hong Kong

3,000 requests, single thread, 200 input tokens / 200 output tokens, non-streaming.

MetricDoubao ProGPT-5.5Claude 4.6
First token P50312 ms820 ms610 ms
First token P95540 ms1,420 ms980 ms
First token P99910 ms2,100 ms1,540 ms
Tokens per second (output)726854
End-to-end P50 (200 out tokens)3.1s3.7s4.4s

This is where Doubao really shines from a Hong Kong / Asia client. The Volcano Engine endpoint is geographically close, so first-token latency is roughly 2.6× faster than calling OpenAI from the same region.

5. Aggregate Scorecard

DimensionDoubao Provs GPT-5.5vs Claude 4.6
Coding89.5−2.6−2.2
Reasoning86.4−1.8−0.6
Translation (EN→ZH)47.3+5.2+6.5
Latency (P50 first token)312ms−508 ms−298 ms
Cost / 1M I/O$2.40−$17.60−$15.60

Run Your Own Benchmark

Sign up, get $0.50 free credit, point your existing eval harness at our endpoint. Three lines of code change, no integration work.

Get a Key →

Methodology Notes

Verdict

Doubao-Seed-2.0-Pro is a credible flagship-tier model. It's not the absolute best at any single dimension — DeepSeek wins on code, Qwen wins on translation parity, GPT-5.5 wins on reasoning by a hair — but the combination of competitive quality + best-in-Asia latency + 12.5× cheaper pricing is hard to argue with for most production workloads.

All numbers from independent NovAI testing in May 2026. Methodology and prompts available on request. Pricing accurate as of publication date.