IOSG: When Reasoning Becomes a Scarce Resource, Who Captures the Value?

Bitsfull2026/06/09 14:385716

Summary:

The ultimate winner will not be the one with the most GPUs.


In 2023, the hole David Cahn identified was never filled on the training side. It was filled on the inference side, and the market only started pricing it in over the past few weeks.


As Nvidia restructures its financial reporting around "service tokens" and Cerebras goes public with a 20x oversubscription, the bottleneck battle has concluded, and the real question becomes: when inference becomes a scarce resource, where will the value settle within the compute stack.


Following the GPU's Path: From a $2 Trillion Problem to a $6 Trillion Problem


In 2023, Sequoia's David Cahn posed the question hanging over the entire AI buildout, the "$2 trillion problem." For every dollar spent on a GPU, about another dollar is spent powering it in the data center, so each year's GPU CapEx implies these chips must eventually generate around $2 trillion in revenue to recoup that capital.


Even with very generous assumptions about AI revenues, he found a hole of over $1.25 trillion between "investment" and "actual end-customer payment." The concern was straightforward: GPUs are being overbuilt ahead of real needs.


A year later, instead of narrowing, the gap widened. In 2024, Cahn, as large-scale vendor CapEx ballooned, redefined it as the "$6 trillion problem." Bearish logic converged into a familiar shape: overbuilding leading to oversupply, and oversupply burning capital.


Both articles are essentially asking the same question: who will fill this hole? The answer was never in the ledger on the "training" side. It was on the inference side, and the market only started pricing it in over these past few weeks.


Cerebras IPO and the Inference Squeeze


Cerebras went public on Thursday. This IPO saw a 20x oversubscription, pricing close to double what was set the day before. The demand was not driven by bets on the "next Nvidia killer" but rather from a simpler realization: the market is starting to recognize that the real bottleneck in AI is inference, not training.


Cerebras' flagship capability is a chip architecture that enables extremely fast inference. Not training, but inference. This is what has Wall Street excited. The inference market is recurring, expanding with usage. Every time Claude answers a question, every time an agent performs a task, compute power is consumed. Training happens only once, but inference never stops.


J.P. Morgan estimates the size of the inference market to be 10 to 50 times that of training. As machines start executing tasks assigned by other machines, a scenario known as agentic (agent-style) expansion, the demand for inference no longer scales with the number of users but with compute power itself.


Nvidia's Redrawn Landscape: Inference Takes the Spotlight


If Cerebras represents the awakening of the market, then Nvidia's latest quarterly earnings call is the confirmation from the top of the industry chain. In the recent earnings call, Jensen Huang made the implicit explicit: AI demand is experiencing exponential growth.


The reason is simple: agentic AI has arrived. Mainstream AI has transitioned from one-time inference to logical inference, then to the stage of agents that can self-summon tools and orchestrate tasks. Huang said, "Tokens are now profitable." In the AI era, compute power equals revenue and profit.


This reshapes the entire industry. Training is a one-time cost to build a model, while inference is the recurring cost to run it, and today the bottleneck is in inference, not in training.


Nvidia has reflected this judgment in its financial reporting. It now discloses based on two platforms, not one: Data Center and Edge Computing. Data Center (approximately $75 billion for the quarter, +92% year-over-year) is further segmented into Hyperscale (about $38 billion, +12% quarter-over-quarter) and ACIE, which stands for AI Cloud and Industry Enterprise (around $37 billion, +31% quarter-over-quarter).


A completely new line is Edge Computing: $6.4 billion, +29% year-over-year, covering agentic AI and physical AI operating at true endpoints such as PCs, workstations, AI-RAN base stations, robots, and cars.


Edge currently accounts for less than 8% of total revenue, but Nvidia has elevated it to be on par with the Data Center as a "second platform." This signal indicates that inference is splitting into two fronts: cloud inference in the data center and endpoint inference at the edge, as AI needs to see, move, and act in the physical world.


The roadmap follows the same logic: Vera Rubin, which will start shipping in the third quarter, boasts reasoning throughput up to 35X that of Blackwell; Huang also provided a new $200 billion TAM for the Vera CPU designed for agentic workloads. Every leading model company is expected to fully transition to it on day one.


As the world's most valuable company reorganizes financial disclosures around the "service token," the bottleneck battle has already been settled. The remainder of this article discusses who captures the value when inference (rather than training) becomes a scarce resource.


Let's first establish the scope. In these two fronts, this article discusses cloud inference, which is the provision of API token services using rented data center GPUs.


Endpoint inference runs on local chips within the device itself (Nvidia's Jetson, RTX, Drive, AI-RAN), completely bypassing the underlying GPU rental and aggregation stack. Here, consider it as amplifying the entire inference economy, supporting the bottleneck argument, rather than the market where Hyperbolic and Venice operate, both of which are entirely on the cloud front.


The Squeeze Is Here


Anthropic is the canary in the coal mine. With usage far beyond the pre-provisioned capacity, complaints about Claude being "lobotomized" flooded the entire web, including replies with rate-limiting, slowed inference, and compressed context windows.


The solution was raw compute power: in May 2026, Anthropic took over the entire Colossus 1 data center from SpaceX, with 220k+ Nvidia GPUs, 300+ MW, dedicated solely to inference, not training.


This capacity unlocked a series of quota changes, each a signal.


On May 6, Anthropic doubled the five-hour limit for Claude Code, removed peak-hour throttling, and significantly increased Opus' API rate limit. On May 13, the weekly limit for Claude Code was raised by another 50% (until July 13). Subsequently, starting June 15, it did the opposite of "generous": it carved out agentic and programmatic usage (Agent SDK, headless mode claude -p, CI pipelines) from the flat subscription and placed it in a separate metered credit pool ($20 to $200 monthly, billed per API price).


Finally, this step condenses the entire argument into one action: the agent consumes inference at a speed far beyond the flat subscription design's capacity, and thus must be priced according to its inherent "recurring cost."


Training is a one-time capital expenditure. Inference is a recurring operational cost that compounds with every new user, every new agent.


This Stack: Six Layers, One Bottleneck


Every AI application sits on a supply chain that starts at the TSMC fab and ends at an API endpoint:




Most companies own only one layer. Nvidia owns the silicon, CoreWeave owns the bare metal, Together AI owns the inference optimization, OpenRouter owns the model API routing.


Except for one.


Hyperbolic: The Only Company Spanning Three Layers


In June 2025, Hyperbolic launched its on-demand GPU marketplace. In the initial months, its developer count surpassed 200,000+, spanning cutting-edge AI labs, search, and large-scale consumer platforms.


What's intriguing is its architecture.


Hyperbolic doesn't own a single GPU. Each card comes from neocloud and data centers, including CoreWeave, Lambda Labs, Nebius, and smaller operators with idle capacity. This may sound like a weakness, but it's actually a moat.


By sitting between GPU suppliers and consumers, Hyperbolic sees real-time data that others can't. It knows who is buying what GPU at what price and when. It sees oversupply before it's public, and demand spikes before they hit the market.


Today, the moat itself is this multi-cloud aggregation. Hyperbolic stitches together fragmented capacity from dozens of independent clouds and data centers into a standardized unified pool, allowing developers to rent the cheapest available GPUs anywhere without negotiating with each operator or managing a slew of accounts.


The more clouds it connects to, the deeper the liquidity, and the richer the pricing data. Looking ahead, the team is exploring how to use this data to model the GPU price curve and eventually deploy proprietary capital to smooth supply and demand, acting as a market maker for physical computing power; however, this goal is still in its early stages, and what is currently compounding is the aggregation layer.


This is the Flywheel:


1. Connect to more clouds → More aggregated supply


2. More supply → Deeper market with real-time pricing data


3. Better data → Smarter routing now, pricing model in the long term


4. Better liquidity and pricing → More developers → More clouds looking to connect


No other company is attempting this. Hyperbolic is the only company spanning the GPU rental layer, deployment layer, and model API layer simultaneously.


Venice: The Mirror


Venice is the clearest manifestation of the inference economy at the application layer and serves as a useful contrast to where Hyperbolic is positioned.


It is a privacy-first inference application: a set of OpenAI-compatible APIs, coupled with consumer-oriented subscriptions (Free / Pro / Pro+ / Max), routing requests to about 75 models, of which about two-thirds are open-source or self-hosted models (Llama, Mistral, Qwen, DeepSeek), and the rest are anonymously passed through to closed-source cutting-edge models.


The key point is that Venice does not possess meaningful computing power itself. It rents from undisclosed GPU partners and confidential computing suppliers (NEAR AI Cloud, Phala), and pays forward-leaning labs for pass-through, so its true cost of revenue is the inference computing power, not SaaS hosting.


What Venice truly sells is privacy. The term "privacy" here does not turn public computing power into private property but rather wraps commercialized inference in a layer of assurance: no data retention, no training data taken, anonymized requests, part of the workload even runs in TEE, making it impossible for the operator to see plaintext.


The underlying computing power is a commodity, and what is being sold at a premium is this layer of privacy packaging. Moreover, this assurance layer is layered and not homogeneous: for open-source models running on self-controlled or TEE GPUs, nearly end-to-end confidential computing can be achieved; however, for anonymous pass-through of closed-source models like Claude, GPT, privacy is only about de-identifying, while the cutting-edge lab still processes your original prompt. Therefore, the strongest privacy only covers the open-source part, and the part with cutting-edge models is "anonymous" rather than "truly confidential."


Venice's Gross Profit = Subscription Price - Inference Cost Passed Downstream, with the additional amount it can charge compared to the bare API price almost entirely supported by this privacy premium layer. This is also the reason why it operates on thin margins and is constrained by front-edge pass-through pricing.


The token design encapsulates this portion of the inference demand. Venice runs on two tokens: VVV (staking and platform access) and DIEM, the latter being an inference credit, with each DIEM roughly equivalent to $1 worth of compute per day.


A paid subscription triggers a programmatic buyback and burn of VVV (Pro / Pro+ / Max approximately $2 / $5 / $10 respectively), with emissions decreasing on a fixed schedule: 6M → 5M → 4M VVV monthly, and decreasing to 3M on July 1.


The buyback is real but discretionary and still modest: Approximately $103,000 was burned in both April and May, with June slowly climbing towards around $110,000, well below the $200,000 per month threshold.


The fundamentals are healthier than the headline suggests. The publicly circulated " $70 million ARR" figure is almost certainly subscription renewals mistaken for net new ACV; a defensible observable range is closer to $6 million to $15 million ARR.


Beneath this, traction is real: around 136,000 wallet addresses, approximately 9.9 million website visits per month (about 330,000 visits per day), with new Pro subscriptions hovering around 1,400 per day. It's a real business, but a low-margin business, with its economics constrained by the compute it's purchasing.


This is precisely why Hyperbolic sits one layer above it. If Venice is the gas station, Hyperbolic is the refinery. Venice buys compute from the same constrained supply everyone relies on; Hyperbolic aggregates, standardizes, and sells that fragmented supply to Venice and all similar players.


As the inference demand grows, value not only accumulates to consumer compute applications but also aggregates and routes compute and captures the layer of cost of revenue paid by those applications.


Why It Matters Now


Nvidia has restructured its finances around the "service token." Cerebras' IPO has proven that the market has recognized inference as a bottleneck. Anthropic is scrambling for capacity, proving this is a real issue. Agentic and physical AI will amplify demand by several orders of magnitude, spanning both the cloud and the edge.


It has also closed the loop on the "$6 trillion problem" from another angle. Cahn's bearish logic, that is, overbuilding and then oversupply, is likely to be validated in the end.


But oversupply is precisely the optimal market for asset-light aggregation: when GPU prices fall, supply is fragmented across dozens of clouds, the player who holds no hardware, routes each workload to the cheapest available card, will earn the spread, while operators holding depreciating GPUs will incur losses. Hyperbolic is long on oversupply, not short.


The ultimate winning company will not be the one with the most GPUs, but the one that can tell you which GPUs are available where, at what price, and route each workload to where it can run at the lowest cost.


Hyperbolic is building such a company. Without owning GPUs themselves, pure software, three layers deep, but built to be the aggregation layer for ultimate inference power.



Welcome to join the official BlockBeats community:

Telegram Subscription Group: https://t.me/theblockbeats

Telegram Discussion Group: https://t.me/BlockBeats_App

Official Twitter Account: https://twitter.com/BlockBeatsAsia