About the most powerful open-source model DeepSeek V4: all details revealed, including performance on par with Opus 4.6, price reduction, and coding benchmark chart-topping.

Bitsfull2026/04/24 13:2916508

Summary:

About the most powerful open-source model DeepSeek V4: all details revealed, including performance on par with Opus 4.6, price reduction, and coding benchmark chart-topping.

Today, DeepSeek announced the open-source release of the V4 series preview, with weights now available on Hugging Face and ModelScope, under the MIT License. The series includes two MoE models: V4-Pro (total parameters 1.6T, per token activation 49B/490 billion) and V4-Flash (total parameters 284B/2840 billion, activation 13B/130 billion), both supporting a 1M token context.


There are three key architecture upgrades:


· Hybrid Attention Mechanism, including Compressed Sparse Attention (CSA) and Heavy Compressed Attention (HCA), significantly reducing long-context overhead. Under 1M context, V4-Pro single token inference FLOPs are only 27% of V3.2, and KV cache usage is only 10% of V3.2.


· Manifold-Constrained HyperConnections (mHC) replacing traditional residual connections, enhancing cross-layer signal propagation stability.


· Training leverages Muon optimizer for accelerated convergence. Pre-training data volume exceeds 32T tokens.


Post-training occurs in two stages: initially training domain experts separately using SFT and GRPO reinforcement learning, then unifying them into a single model through online distillation.


Performance Evaluation: V4-Pro-Max Claims to be the Strongest Current Open-Source Model


The highest inference intensity mode of V4-Pro is referred to as V4-Pro-Max. The official tech report compares it to Opus 4.6 Max, GPT-5.4 xHigh, Gemini 3.1 Pro High, as well as the open-source models Kimi K2.6 and GLM-5.1 (excluding the newly released Opus 4.7 and GPT-5.5, with the final gap awaiting third-party verification).


On the encoding front, V4-Pro-Max scores 3206 in Codeforces, surpassing GPT-5.4 at 3168 and Gemini 3.1 Pro at 3052, setting a new benchmark record. LiveCodeBench achieves a score of 93.5, also the highest across the board. SWE Verified scores 80.6, just below Opus 4.6 at 80.8, a difference of 0.2 percentage points.


In terms of Long Context, both 1M benchmarks rank second: CorpusQA 1M scores 62.0 (Opus 4.6 is 71.7), and MRCR 1M scores 83.5 (Opus 4.6 is 92.9).


Regarding Agent Tasks, MCPAtlas Public scores 73.6, just below Opus 4.6 at 73.8; Terminal-Bench 2.0 scores 67.9, lower than GPT-5.4 at 75.1 and Gemini 3.1 Pro at 68.5.


There is still a noticeable gap in Knowledge and Reasoning: GPQA Diamond 90.1 (Gemini 94.3), SimpleQA-Verified 57.9 (Gemini 75.6), HLE 37.7 (Gemini 44.4).


As an open-source model, V4-Pro-Max for the first time matches or even exceeds some closed-source flagships on various encoding and long-context benchmarks, but still lags behind Gemini 3.1 Pro in knowledge-intensive evaluations.


Internal Dogfooding Data and Mathematical Reasoning


DeepSeek has rarely disclosed internal dogfooding data. The team collected about 200 real R&D tasks from over 50 engineers, covering feature development, bug fixing, refactoring, and diagnostics. The tech stack includes PyTorch, CUDA, Rust, C++, and after rigorous selection, 30 tasks were retained for evaluation.


V4-Pro-Max has a pass rate of 67%, significantly higher than Sonnet 4.5 at 47%, close to Opus 4.5 at 70%, but lower than Opus 4.5 Thinking at 73% and Opus 4.6 Thinking at 80%; Haiku 4.5 has a pass rate of only 13%. An internal survey of N=85 showed that all respondents use V4-Pro for agentic coding in their daily work, with 52% considering V4-Pro as a default main coding model, 39% leaning towards acceptance, and less than 9% in disagreement. The main feedback issues include low-level errors, misinterpretation of vague prompts, and occasional overthinking.


In the realm of formal mathematical reasoning, the Putnam Competition is the most prestigious undergraduate math competition in North America. In a practical scenario, V4-Flash-Max scored 81.00 on the Putnam-200 Pass@8 benchmark using the LeanExplore open-source tool and restricted sampling. For comparison, Seed-2.0-Prover scored 35.50, Gemini 3 Pro and Seed-1.5-Prover both scored 26.50.


In the frontier scenario, V4 adopts a hybrid formal-informal reasoning approach, generating candidate natural language solutions using informal reasoning, filtering through self-verification, and then having a formal agent complete a rigorous proof in Lean. In Putnam-2025, V4 scored a perfect 120/120, tying with Axiom for first place, surpassing Seed-1.5-Prover's 110/120 and Aristotle's 100/120. The frontier scenario involved extensive computational scaling, while the practical scenario results better reflect routine deployment capabilities.


API and Pricing: V4-Flash Price Reduction and Context Upgrade, V4-Pro Positioned as Premium Tier


The DeepSeek V4 API has synchronously launched with V4-Pro and V4-Flash. The official account released pricing and computational power plans: V4-Flash is a direct replacement for V3.2 (deepseek-chat), not only maintaining the same prices but actually reducing them—unchanged cache-hit input at 0.2 yuan per million tokens, cache-miss input decreased from 2 yuan to 1 yuan (a 50% decrease), and output reduced from 3 yuan to 2 yuan (a 33% reduction). The context has expanded from 128K to 1M, providing 8 times more context at a cheaper price. The old model names, deepseek-chat and deepseek-reasoner, will be discontinued on July 24, 2026, now respectively pointing to V4-Flash's non-reasoning and reasoning modes.


V4-Pro represents a new premium tier: 1 yuan for cache-hit input, 12 yuan for cache miss, and 24 yuan per million tokens for output, an 8-fold increase compared to V3.2. DeepSeek explained in the pricing table notes that due to the limited high-end computational power, Pro's service throughput is currently very restricted, but it is expected to significantly decrease in price after the listing of 950 super nodes in the second half of the year. Both models support non-reasoning and reasoning modes, with reasoning mode offering two levels of reasoning_effort parameter settings: high/max.


In its announcement, DeepSeek stated, "Starting now, 1M context will be a standard feature of all DeepSeek official services."


First Public Infrastructure: Production-Grade Elastic Compute Sandbox DSec


The DeepSeek V4 Technical Report unveiled for the first time the core infrastructure supporting post-training by the Agent and massive-scale evaluations—the production-grade Elastic Compute Sandbox DSec (DeepSeek Elastic Compute).


Current large-scale model reinforcement learning requires an extremely large code experimentation environment. The report revealed that in actual production, a single DSec cluster can simultaneously schedule hundreds of thousands of concurrent sandboxes. The system is written in Rust, integrates with the in-house 3FS distributed file system at the core, and breaks through the performance bottleneck of massive sandbox cold starts through on-demand loading.


In terms of developer experience, DSec uses a Python SDK to unify four types of execution bases: function calls, containers, micro VMs, and full VMs, with a simple parameter modification to switch between them. To address the common task preemption issue in computing clusters, DSec has introduced global trajectory logs: when a task resumes, the system will directly "fast forward" and replay the cached command execution results, achieving both rapid checkpoint resumption training and avoiding non-idempotent errors caused by duplicate executions.