DeepSeek-V4: A Need for Reassessment
“DeepSeek-V4,需要一次重估”Originally published by Weijin on 26 April 2026
The long-awaited release of DeepSeek-V4 on April 24 did not faze Wall Street one jot. Nvidia’s stock rose more than 4% that same day, pushing its market capitalization firmly above $5 trillion. Just 15 months earlier, the release of DeepSeek-R1 had led to a 17% single-day drop.
There appears to be far less connect between the Chinese and US AI infrastructures in 2026.
Tracing the past year and a half, we see two distinct “token economies” and AI narratives forming. One prioritizes cost-efficient scale under constraints of computing power, while the other prices at the frontier of capability, justified by scarcity. Both paths now exhibit strong inertia. But is this divergence truly structural?
Up to half a year behind, but closing in…
DeepSeek frames V4 as a “preview,” noting in its technical report that the architecture remains inelegant and that several active explorations have not yet been incorporated.
The release includes two models. V4-Pro features 1.6 trillion total parameters with 49 billion activated, while V4-Flash is a more efficient variant with 284 billion total parameters and 13 billion activated. Both support million-token context lengths but do not yet include multimodal capabilities.
By parameter scale alone, V4 still trails the most advanced closed-source models. DeepSeek itself acknowledges a lag of roughly three to six months behind the US frontier. While its agentic coding capabilities surpass Sonnet 4.5, it still falls short of Opus 4.6 in deeper reasoning tasks.
As benchmarks become saturated and evaluation increasingly subjective, external feedback carries greater weight. DeepSeek highlights input from community evaluator DollyDeng, who notes that V4-Pro demonstrates strong breadth in programming knowledge, low hallucination over long contexts, and high usability despite occasional lapses in attention.
Even so, SemiAnalysis characterizes the system as an “excellent engineering implementation,” one that approaches state-of-the-art performance while potentially becoming the lowest-cost alternative to closed-source models.
Smart, but efficient above all
Under million-token workloads, V4-Pro reduces per-token inference FLOPs to just 27% of its predecessor V3.2, while shrinking KV cache usage to 10%. V4-Flash pushes this even further, cutting FLOPs to 10% and KV cache size to 7%. These gains arguably exceed the practical impact of recent attention optimization approaches such as Google’s TurboQuant.
DeepSeek’s priority is clear: enabling efficient million-token intelligence. This capability opens the door to long-horizon task execution, persistent context reasoning, and lays the groundwork for future paradigms such as online learning.
To address the computational bottlenecks of attention at extreme context lengths, V4 introduces a hybrid attention system. Compressed Sparse Attention (CSA) reduces KV cache size through a 4:1 compression ratio combined with sparse retrieval, while Highly Compressed Attention (HCA) pushes compression as far as 128:1 while still allowing all compressed entries to participate in computation. Alternating between the two allows the model to maintain both global awareness and precise retrieval.
This architecture builds on a year of iteration–from DSA to mHC to Engram–and points toward a broader goal of multi-dimensional sparsity. In particular, memory decoupling aims to bypass GPU HBM limitations, enabling more aggressive scaling in future systems.
The real hardware story
DeepSeek-V4 launched with Day 0 support from Huawei Ascend and Cambricon, signaling deep inference-side collaboration with domestic chipmakers.
In practice, Chinese open-source models continue to operate across both Nvidia and domestic hardware ecosystems. Nvidia GPUs remain dominant in the near term, but Ascend is expected to take on a growing share of inference workloads over time.
On Huawei’s Ascend 950DT platform, V4-Pro achieves a throughput of 388 tokens per second under high concurrency with approximately 20 milliseconds latency per request. V4-Flash reaches significantly higher throughput at 4722 tokens per second. These figures highlight the growing viability of domestic inference infrastructure.
The technical report subtly emphasizes cross-platform adaptability, noting validation across both Nvidia GPUs and Huawei NPUs. At the center of this is MegaMoE, which reorganizes computation and communication in expert parallelism, overlapping the two to effectively hide latency. Combined with FP4 precision, this reduces bandwidth requirements and improves overall efficiency.
This direction aligns with broader industry trends. Nvidia’s Blackwell architecture introduces FP4 precision, Google’s TPU roadmap includes similar experimentation, and Huawei’s Ascend chips already support FP8 and MXFP4. While DeepSeek is not leading hardware development, it is increasingly shaping hardware requirements. Its estimate that one gigabyte per second of interconnect bandwidth can support 6.1 teraflops of compute may prove to be one of the most influential insights from the report.
Two AI narratives, still unresolved
DeepSeek’s framing that “AGI belongs to everyone” reflects a vision centered on cost-driven ubiquity. As Ascend deployment scales and V4 continues to improve, token costs are expected to fall further. Future hardware iterations could deliver an additional one-third efficiency gain.
In contrast, the US narrative has increasingly converged around frontier models and agent systems. Intelligence is treated as a scarce resource, agents as economic disruptors, and pricing is tied to the value of tasks solved. This creates a self-reinforcing dynamic in which models grow more expensive, yet demand continues to rise.
However, this model is not without tension. Most real-world workloads do not require frontier-level intelligence, and the long-term return on investment for agent-driven software remains uncertain. Cost is emerging not as a secondary concern, but as the primary constraint on scaling.
Industry voices, from Cloudflare to Replit to Hugging Face, are converging on this point: efficiency is becoming central again.
The gap exists, but it is also fluid
Chinese models continue to lag behind their US counterparts, with estimates suggesting an average gap of around seven months since 2023. This gap is driven not only by compute limitations, but also by differences in data quality, training environments, and enterprise demand.
As Nathan Lambert argues, the true moat lies in data and training ecosystems, particularly for agent-based systems where real-world feedback loops are critical.
Yet this gap is not static. AI paradigms tend to shift every 12 to 18 months, moving from dialogue to reasoning to agents. The next shift could favor different strengths. Demis Hassabis has also emphasized that AGI will likely require several more fundamental breakthroughs and may still be five to ten years away, leaving ample room for rebalancing.
Huang’s constraints, and the harder path ahead
DeepSeek’s trajectory increasingly resembles that of Android. It is not necessarily the most advanced system, but it is more accessible, more adaptable, and better positioned for scale.
Open models lower barriers to entry and broaden adoption, even benefiting US developers who deploy them on Nvidia hardware. At the same time, longer-term dynamics are shifting. Export controls are tightening Nvidia’s reach, while Chinese hardware continues to improve and integrate more deeply with domestic software ecosystems.
Over time, this could lead to a gradual replacement of the underlying stack, layer by layer, without users ever leaving the ecosystem itself. This is the deeper strategic context behind the pressures facing Jensen Huang.
DeepSeek’s challenge is not simply catching up in capability, but remaining committed to a path of low-cost, widely accessible AI. This path is slower, less aligned with capital market incentives, and more difficult to sustain. But it represents a fundamentally different vision of how AI should scale.
The team closed its V4 release with a rare statement of intent: to remain unmoved by praise or criticism, and to stay on its chosen path. Now that funding discussions are no longer denied, the real question is whether its investors share that same long-term conviction.




