Doubao Seed 2.0 Pro Benchmarks: Coding, Reasoning, Translation (2026)

Headline numbers: 89.5 on HumanEval+, 86.4 on MMLU-Pro, 47.3 BLEU on EN→ZH FLORES-200, P50 first-token latency 312 ms from Hong Kong. Within 2 points of GPT-5.5 on most categories at one-twelfth the price.

Test Environment

Date: May 5–10, 2026
Test client: Tencent Cloud, Hong Kong (ap-hongkong-1)
Endpoint tested: NovAI gateway (proxies to Volcano Engine)
Model versions: doubao-seed-2.0-pro-250428, gpt-5.5-2026-04-15, claude-4.6-sonnet-20260420
Sample sizes: 200 prompts per category, 5 runs each, median reported
Temperature: 0 for evaluation tasks; 0.7 for translation
Note: latency includes the gateway hop (~50ms) — direct Volcano Engine API will be slightly faster

1. Coding

1a. HumanEval+

Standard pass@1 evaluation, 164 prompts.

Model	pass@1	Avg latency	Cost / 1k tasks
GPT-5.5	92.1	2.4s	$3.80
Claude 4.6 Sonnet	91.7	2.1s	$3.20
Doubao-Seed-2.0-Pro	89.5	1.9s	$0.42
DeepSeek V4 Pro	93.4	2.6s	$0.31

1b. LeetCode-Hard

50 LeetCode "Hard" problems posted after each model's training cutoff to minimize contamination.

Model	Solved (out of 50)	Notes
GPT-5.5	34	Strong dynamic programming
Doubao Pro	31	Weakest at advanced graph algorithms
Claude 4.6 Sonnet	33	Best explanations alongside code
DeepSeek V4 Pro	38	Clearly the strongest competitive coder

1c. Code Review (qualitative)

We submitted 30 real PRs with deliberately injected bugs. Doubao Pro caught 24/30. GPT-5.5 caught 26/30, Claude 25/30, DeepSeek 28/30. Doubao misses subtle race conditions more often than the others, but matches them on logic and security review.

2. Reasoning

2a. MMLU-Pro

Model	Score
GPT-5.5	88.2
Claude 4.6 Sonnet	87.0
Doubao Pro	86.4
DeepSeek V4 Pro	85.1

2b. GSM8K (math word problems)

Doubao Pro: 96.4. GPT-5.5: 96.8. Within noise. Both essentially saturated.

2c. Long-chain logic puzzles

30 multi-step puzzles requiring 8+ reasoning hops. Doubao Pro solved 23/30, GPT-5.5 solved 25/30. Doubao occasionally drops a constraint in the middle of long chains; spelling out "verify each step" in the system prompt closes most of the gap.

3. Translation

FLORES-200 dev-test split, BLEU scores.

Direction	Doubao Pro	GPT-5.5	Qwen3-Max
EN → ZH	47.3	42.1	49.4
ZH → EN	38.2	40.5	41.1
JA → EN	33.4	35.2	34.8
EN → JA	32.1	30.9	33.5

Doubao Pro is best-in-class for English-to-Chinese (training corpus advantage). It loses to Qwen3-Max for Japanese and Chinese-to-English, where Alibaba has a longer multilingual heritage.

4. Latency from Hong Kong

3,000 requests, single thread, 200 input tokens / 200 output tokens, non-streaming.

Metric	Doubao Pro	GPT-5.5	Claude 4.6
First token P50	312 ms	820 ms	610 ms
First token P95	540 ms	1,420 ms	980 ms
First token P99	910 ms	2,100 ms	1,540 ms
Tokens per second (output)	72	68	54
End-to-end P50 (200 out tokens)	3.1s	3.7s	4.4s

This is where Doubao really shines from a Hong Kong / Asia client. The Volcano Engine endpoint is geographically close, so first-token latency is roughly 2.6× faster than calling OpenAI from the same region.

5. Aggregate Scorecard

Dimension	Doubao Pro	vs GPT-5.5	vs Claude 4.6
Coding	89.5	−2.6	−2.2
Reasoning	86.4	−1.8	−0.6
Translation (EN→ZH)	47.3	+5.2	+6.5
Latency (P50 first token)	312ms	−508 ms	−298 ms
Cost / 1M I/O	$2.40	−$17.60	−$15.60

Run Your Own Benchmark

Sign up, get $0.50 free credit, point your existing eval harness at our endpoint. Three lines of code change, no integration work.

Get a Key →

Methodology Notes

Latency was measured wall-clock from the OpenAI SDK call, including TLS handshake on the first request and reusing the connection thereafter.
HumanEval+ uses the EvalPlus harness with extra test cases beyond the original HumanEval set.
BLEU computed via SacreBLEU with default tokenizer.
For LeetCode-Hard contamination control, only problems published after April 2026 were used (we recognize Doubao's training cutoff is reportedly Q1 2026).
All raw logs and prompt sets available on request for verification.

Verdict

Doubao-Seed-2.0-Pro is a credible flagship-tier model. It's not the absolute best at any single dimension — DeepSeek wins on code, Qwen wins on translation parity, GPT-5.5 wins on reasoning by a hair — but the combination of competitive quality + best-in-Asia latency + 12.5× cheaper pricing is hard to argue with for most production workloads.

All numbers from independent NovAI testing in May 2026. Methodology and prompts available on request. Pricing accurate as of publication date.

Doubao Seed 2.0 Pro Benchmarks (2026)