The Token Economy Has a Speed Problem China Can’t Yet Solve
But having closed the model quality gap faster than most observers thought possible, speed is the next competitive frontier Chinese firms plan to cross.
At GTC 2026, Jensen Huang used the Groq 3 LPU to make a market structure argument: as AI inference speed rises from roughly 50 tokens per second to 800 and beyond, per-token pricing rises from a few dollars per million to $150 per million.
Speed does not mechanically create value, but faster inference enables a class of applications, think real-time voice, autonomous agents, and live coding assistance, that users will pay way more for.
The Latency Frontier
Every inference system faces a fundamental constraint. When a server handles many user requests simultaneously (batching), total output rises but each individual request waits longer. When a system prioritizes speed for each request, it serves fewer users at once. Costs are measured in time; revenue is measured in tokens generated. The result is a fixed tradeoff: optimize for volume and prices stay low, optimize for speed and prices can rise substantially.
Nvidia’s Dynamo, introduced at last year’s GTC, is software designed to manage this tradeoff dynamically, shifting resources between speed-sensitive and volume-sensitive tasks. The new development at GTC 2026 is hardware. Nvidia’s current flagship inference platform, GB NVL72, tops out at 400 tokens per second.Huang’s pitch is that the Vera Rubin and Rosa Feynman platforms will push past that ceiling, and that a new chip, the Groq 3 LPU, gives Nvidia a dedicated tool for the specific part of inference where speed matters most.
Groq 3 LPU is different.
Generating AI output happens in two stages. The first stage, prefill, reads and processes the input prompt; it is computationally intensive. The second stage, decode, generates the actual output tokens one by one; it is limited not by raw compute power but by how fast the chip can move data. The Groq 3 LPU is built around a very large, very fast on-chip memory (500 MB of SRAM with 150 TB/s bandwidth, versus 22 TB/s on Rubin) that removes the bottleneck in the decode stage. Compute per chip is modest, roughly 1/25th of a Rubin GPU, but for low-latency token generation that memory architecture creates a meaningfully different operating point. In Huang’s proposed deployment, Vera Rubin handles the compute-heavy prefill stage; the LPU handles the final token generation steps where latency sensitivity is highest; Dynamo coordinates both. He suggested an LPU allocation of roughly 25% in a mixed rack deployment and framed the configuration as unlocking an incremental $150 billion addressable market beyond what Rubin alone could serve. The Groq 3 LPX rack, housing 256 LPU chips, is scheduled to ship in the second half of this year and is manufactured by Samsung.
This architecture reflects an acquisition completed late last year. Nvidia paid $2 billion for a non-exclusive license to Groq’s inference technology. Groq founder and CEO Jonathan Ross, president Sunny Madra, and a cohort of core engineers subsequently joined Nvidia. The Groq 3 LPU arriving three months later as the seventh chip in the Rubin platform reflects how quickly that integration moved from license to production.
The shift toward an SRAM-heavy design also sidesteps real supply constraints. The advanced chip packaging and high-bandwidth memory that most AI accelerators depend on are both scarce and expensive. A large on-chip SRAM architecture built on Samsung’s standard process reduces exposure to those bottlenecks considerably.
Where China’s Open Ecosystem Sits on This Curve
Chinese open-source models currently operate at roughly 100 TPS, priced from free to $3 per million tokens. US closed models are moving from 400 TPS toward 1,000 TPS, with pricing reaching $45 to $150 per million tokens.
Qwen-3 has established itself as a credible, widely deployed foundation layer: abundant, cheap, and genuinely useful for the high-throughput, cost-sensitive segment of the market. This is a distinct value proposition, not a failure state. Huang is asserting something more pointed: that high-value inference workloads are migrating toward low-latency, high-price-per-token segments, and that the constraint keeping Chinese models out of that segment is not model quality but access to specialized low-latency chips and SRAM-optimized hardware.
The competitive picture on that hardware dimension is moving fast.
Cerebras, a chip startup competing with Nvidia in low-latency inference, is already running OpenAI’s Codex-Spark at speeds reaching 1,000 TPS. Last week, Amazon Web Services announced a Cerebras partnership pairing Trainium for prefill with Cerebras WSE for decode, with claimed speeds up to 3,000 TPS. The AWS announcement also noted Cerebras support for models from OpenAI, Cognition (the AI agent startup behind Devin), and Meta. Microsoft’s Maia 200 has similarly moved toward large on-chip SRAM as its core design principle.
The convergence on low-latency architecture across multiple US infrastructure players is compressing whatever window exists for closing the gap from the outside. The application logic that drives this convergence is still partly aspirational but directionally clear.
Nvidia VP of hyperscale business Ian Buck described multi-agent systems as an emerging use case that will require 1,500 TPS or above for agent-to-agent communication, a speed regime where human-paced interaction becomes a ceiling rather than a target. OpenAI’s Codex-Spark, running at 1,000 TPS, represents an early signal that coding workloads are moving in this direction, with the implication that near-real-time feedback keeps users in continuous flow states in ways that 100 TPS cannot. These remain emerging use cases rather than dominant workloads today, but the hardware investment being made around them suggests the infrastructure stack is being built ahead of demand.
China’s open ecosystem has closed the model quality gap faster than most observers expected.
The token consumption numbers support that: ByteDance’s daily API call volumes are approaching Google’s first-party figures, and models like MiniMax and Staircase AI (Jiebu Xingchen) are appearing at the top of consumption rankings on platforms like OpenRouter. The infrastructure gap on the latency dimension is a different kind of problem. Without domestic access to low-latency chip architectures and SRAM-optimized hardware, Chinese inference providers have no technical path to the $45 to $150 per million token pricing tier, regardless of how competitive the underlying models become.


