After AI Devours Everything, What Remains Untrainable?

Bitsfull2026/06/11 12:5414667

Summary:

Trust, Permission, Responsibility, and Industry Judgment


Editor's Note: As AI capabilities continue to advance rapidly, the investment community is witnessing a new wave of pessimism: as models become stronger, all applied companies will eventually be absorbed by entities like Anthropic, OpenAI, and Nvidia at the model and compute layer, leaving the market with only cutting-edge models, compute, and a few key infrastructures. However, Sarah Guo believes that this assessment is only half correct. Those "thin wrapper" applications will indeed be absorbed, and any task that can be benchmarked, trained on public data, and validated at a low cost will also gradually become commercialized.


The real question is: after AI consumes everything trainable, what remains untrainable?


The answer in this article lies in the values that exist within real organizations and cannot be easily replicated externally: proprietary enterprise data, complex workflows, user trust, system permissions, industry judgment, compliance responsibilities, and the experience accumulated through long-term operations. Models can become smarter but cannot automatically integrate into a bank's production system; they can generate medical answers but cannot directly gain a doctor's trust and the hospital's decision-making process; they can draft legal texts but cannot assume the responsibility of a senior attorney or arbitrarily define what qualifies as competent legal work.


Therefore, the AI companies with true moats in the future are not simply smarter than generic models but delve deep into a specific industry, undertaking the challenging yet crucial "translation" work: organizing a client's private reality, tools, processes, and judgment criteria into a system that models can act upon, gradually defining what constitutes a "good outcome" through long-term service. The stronger AI becomes, the more it devalues measurable and replicable tasks; it also highlights the "untrainable" elements that carry history, relationships, permissions, and professional judgment. This is the genuine value that may remain after the model's consumption.


Below is the original text:


Mid-2026, in the investor's version of "AI Schizophrenia," is a sense of despair that there is nothing left to invest in: we seem like we should just give all our money to Anthropic and Nvidia and go to sleep. But I've never felt this way. Through several iterations, I've always believed that the models are smarter than me; I would gladly buy into Anthropic and Nvidia at market prices; my smartest friends are also quite confident that the model's self-improvement will soon truly take off—but I still don't feel this sense of despair.


This kind of despair is not foolish. Its logic is this: if the model continues to get stronger at everything, then all companies built on top of the model are just a thin shell waiting to be absorbed by the model; the remaining value that can ultimately be retained is only computation power and cutting-edge model weights.


Take software, for example, this is where this sense of despair is most reliant. When Devin was released in 2024, it could only handle 13% of tasks in standard software benchmark tests, and as a result, it was largely overlooked by the market. A year and a half later, the strongest Agent could achieve over 80% high scores and began handling real work internally at Goldman Sachs and the U.S. Army. Almost everyone drew the same wrong conclusion: the model had swallowed software engineering.


However, after the model had consumed the most easily measurable part of software engineering, we are also rediscovering what many teams have long known: engineering has always resisted measurement, and the most easily measurable part is not necessarily the only important part.


MIT's Mert Demirer and his collaborators finally quantified this: among over 100,000 developers, the latest generation coding Agent increased the amount of code written by about 180%, but the actual amount of code delivered to production only increased by about 30%. Writing code became cheaper, but the remaining steps still need to pass through humans, and these steps are critical. Of course, the overall net impact is still impressive.


Benchmarking is something you can measure; and anything that can be measured can be used for training. Therefore, the coding Agent matured first: compilers are free validators, test suites are also free validators. When the answer can be self-checked almost at zero cost, you can continuously polish around this check signal until you break through.


But passing a test does not mean that a change is correct for a codebase that has been running for ten years. The reason why that module exists may have three reasons that no one wrote down; the deployment pipeline may rely on a cron job that no one is willing to admit they wrote maintaining it, barely.


This correctness cannot be read from a leaderboard, or even truly directly from anything. You can only let such a complex system run in the real world for long enough to know if it is truly effective. And a smarter model does not make the real world run faster. No one will be reassured by seeing a green check mark after completing unit tests for a system as large as Google. The reason you trust it is because it has withstood years of real-world loads.


This kind of correctness is not only private, but it is also a slowly formed moat, a moat that capital cannot directly compress time into. Even optimists acknowledge that this clock cannot be skipped. Noam Brown, a pioneer of OpenAI's reasoning model, recently wrote: the only reliable way to assess an Agent's performance over a one-year period may be to actually let it run for a year.


As Gabe Pereyra put it, true automation is not just about making the model stronger. It is about the product, the model, the workflow, and the company organization all evolving together, with three out of these four moving at the organization's pace.


Moving people is a part untouched by any benchmark test: convincing a skeptical partner to change how she does business, keeping a team cohesive through a rebuild. That's why we prioritize people-handling skills as much as analytical skills when hiring a CEO. Making the model smarter does not shift this weighting.


The feedback loop here is fuzzy, time spans are in years, and trust resides with a specific person. Every company I know has already put cutting-edge modeling tools in the hands of every engineer, but no company's engineering organization has changed at anywhere near the speed of model progress. Tool adoption took a quarter, what a magical token growth quarter it was! But true rebuilds take years.


The work that can be seen is leaving. Valuable work structurally is unreadable: anything you can put on a leaderboard can be made into a train set; thus, anything measurable is on the path to commodification. This process takes time and is never fully completed, but the direction never reverses.


Using the words of my friend Matt MacInnis from Rippling, in money speak, a token that is only used to answer a generic question is nearly worthless, as anyone's model can answer it; but a token reasoning on your company's data is much more valuable, as it is doing what you really want, not just generating a plausible-sounding answer.


The readable work will be eaten from both ends.


From below, tasks will saturate: once a job can be checked at low cost, the buyer no longer cares which model completed it, but starts asking how much. So, the work will fall to the cheapest open-source or distilled model of the week. As long as profit margins can play a role, eventually it will.


From above, labs are trying to have the models eat their own scaffolding. Routing between fetches, cheap calls, and expensive calls, tool use, even reasoning strategies—all the apparatus once wrapped around the model is being pulled into the model weights until the "shell" itself becomes the model. This is absorption at the boundary.


Profit pressure also plays from another angle: a generalist agent must be ready to handle anything at any time, so it is costly; whereas a focused application can tune a workflow to perfection, making it consume only a fraction of a token. And unlike the labs selling these tokens, the app companies can keep the difference.


Therefore, we can pose two questions to any type of work: Is its correctness proprietary, expensive, a truth that only exists within a particular company's data? Is it isolated within a system that is inaccessible to outsiders? Combining these questions with the degree of task saturation yields a 2×2 matrix.


Fully-saturated work with publicly available answers is the realm of commoditized tokens, where the open-source model prevails. Cutting-edge work with publicly available answers, such as coding benchmark tests, is the domain of the laboratory, as evaluation is free and merely owning it holds no value.


The real prize lies in the last corner, the "untrainable" corner: cutting-edge work with correctness that exists only within a private environment. You can see this in the Inference Cloud serving AI-native pioneers: the vast majority of tokens are generated by custom models, not by universal open-source models.


The wall to this final corner varies in height. A developer's toy code repository is portable and standardized, making it easy to dig into. However, a bank's production system is neither portable nor standardized. You won't gain root access by being clever on SWE-Bench Verified.


Capability will swallow many things, but a better model will not turn a private ground truth into a public one. It won't hold a license, sign for liability, or possess a company's documents; when the answer is wrong, it also can't be the one sued. The bottleneck here is not intelligence but permission and responsibility. You can imagine a model much smarter than anyone, but it still must be allowed in and someone must sign its deeds.


That door has a lock and a latch.


The lock is the environment: only after gaining trust within a system, undergoing security reviews, completing integration, and signing a contract with outcome responsibility can you verify if AI has truly been useful.


The latch is the user. Today, most American doctors open OpenEvidence every day, which cannot be bought with any amount of computational power. A lab could train a perfect medical model tomorrow, but it still cannot enter a doctor's practice or UCSF's decision-making process. Trust is built slowly through relationships and user consent, not by gradient descent erasing them.


This is also the work of application companies. An app's ability to occupy a place in the "untrainable" corner relies on unglamorous tasks: organizing a company's private reality for the model to act on; providing action tools to the model; and working with customers to change how their workforce actually operates.


A company that can accomplish this kind of "translation" is hard to replicate, and this translation will never end. Integration and maintenance will continue along with the customer relationship. The winners of this are those who place domain-expert engineers and tools at the customer's side.


For example, at a top-tier established law firm, the number of M&A transactions alone approaches close to a thousand each year. You cannot have hundreds of lawyer associates individually downloading client files to their desktops and handing them over to a generic Agent for review. For confidentiality reasons alone, this is not allowed, not to mention a host of other issues. Even if this could be done, what you would learn is only fragmented: one associate correcting a bit at a time, with no one seeing how an entire transaction flows.


The truly important signals exist at the transaction level. Each transaction has its own shape: for M&A, it's an NDA, a terms sheet, due diligence, a purchase agreement, ancillary documents, a closing checklist; for intellectual property litigation, it's motions, discovery, prior art, more motions. Each business domain has its own structure, and lawyers and tools cannot be swapped arbitrarily.


And the real problem this law firm needs to solve is at a higher level: how to simultaneously run each business domain, much like a senior partner managing hundreds of matters in parallel, while bringing in new engagements and nurturing associate lawyers. Transforming a company like this is not a single problem you can write a task review for. It requires an operator to handle it like playing "data baseball": the interim goals are extremely fuzzy, feedback is incomplete, cycles are very long, and the environment itself doesn't stand still.


Unfortunately, unreadable value is also difficult to sell, for the same reason it's hard to commoditize: a company cannot judge from the outside whether AI can actually transform its operations as benchmark testing suggests. Therefore, the strongest companies will stop trying to prove themselves externally and instead first enter the customer's interior and then price based on outcomes.


Sierra only charges when its Agent solves a customer's issue; if the problem is handed off to a human, it doesn't charge. Thus, the price itself becomes the evaluation mechanism. And this is because Sierra has the power to define what is "resolved." Cognition's Devin has done the same in the software realm, introducing a "performance guarantee." Only when you are trusted to enter a system internally are you eligible to provide such a guarantee for outcomes.


Even at the level of providing token services — which everyone likes to call the pure commodity level — its performance is not like a commodity. The best AI-native companies will concentrate services with one or two vendors, such as Baseten or Fireworks. Because the cost per token will tend toward commoditization, but reliability under actual load and steady access to scarce compute will not. Where you offer inference services and which models you use are two different choices. The only part of inference that truly resembles a commodity is price.


One common rebuttal is: The lab is your supplier, why wouldn't it engage in predatory pricing with its in-house first-party products to undercut you and put you out of business? Or simply revoke your API access and capture the market for itself? This is the real version of that sense of desperation. But this only holds true when the model layer is a single-player game.


Clearly, that is not the case. The model layer is more like a three-and-a-half-player deathmatch, with a bunch of international players lagging about six months behind in training progress, along with a development league that is five times the size it was last year. Customers want competition between their suppliers, while the lab wants market share, preferring to outcompete rather than kill off any specific application.


You can see this in markets where the lab competes head-on. In the consumer chat scene, the best models never simply win over the entire market. ChatGPT has consistently stayed ahead over the years in real competition; the share it is now losing is going to Gemini, with the reason being distribution capabilities on Android and search, not a better model. Anthropic is currently considered to have the best model in prediction markets and the internet atmosphere, but it is barely a major player in consumer chat and has instead built its business in enterprise and coding scenarios.


If a superior model cannot win over competitors' users in the most core applications, it certainly won't easily swallow up a hospital's medical record system or a bank's liability framework through integration. Today, the public chooses products not just based on coding prowess. If the cutting-edge model layer remains crowded, there is value in the application layer above it.


If a job cannot be scored externally, then someone internally must decide what constitutes a good answer. And that decision is the game itself. Enough of these decisions are written down to become benchmarks. Harvey has released benchmarks for the legal domain, Sierra has released benchmarks for Voice Agents. You have the right to define what "good" means in a domain because that domain is already using you. And these companies have earned this right through the arduous struggle of real-world adoption.


The assessment that truly dictates where money flows is private, company-by-company formed: what this company is willing to accept as good work on these matters. And this is far from complete, as the depth of law far surpasses any public benchmark. OpenEvidence is crystallizing what constitutes a safe clinical answer.


All of this is not truly about "measurement," but about what is true and what is a good judgment. These judgments are written down until they become standards that everyone else must accept as measures. No matter how smart the foundational model lab becomes, it cannot conjure up these standards out of thin air because that kind of authority only exists within the domain.


This kind of authority tends to rest where it already existed. The seasoned lawyer writes the legal precedent. The doctor defines what constitutes a safe clinical answer. What “resolved” means is then determined by the company that already has the client relationship.


The absorption frontier will continue to rise as we learn to measure more work, and the measurable will be eaten. The untrainable ground beneath will keep shrinking under whoever stands on it, so you can’t find a defensible position and stop. You must keep moving to where the scoring hasn’t reached and continually reinsure, reassess risks.


In a narrow task, with your proprietary data and your own evaluation system, you can train to the frontier and beat a general model in key scenarios; this specialized model becomes part of the moat. On the other hand, if you’re competing on general model capability, it’s a capital war, and you will lose to those with the most compute. This is also the trap that companies with shallow access and highly readable tasks are most susceptible to.


When a company decides, for survival, to train on a vast generic task to a level beyond the frontier models, the outcome seems usually dictated by the scale of a data center. The endgame is often not an independent champion but acquisition by a player with sufficient compute.


All of the above is defense. The harder part is offense: first deciding what to build. This is what I have been looking for all year, and I have probably found it only three times. Models are useless in this. They do what you point them at; they cannot tell you what is worth pointing at. You cannot benchmark this, so you cannot train it.


This is also why the existing giants won’t take everything: they’ll hold their already established ground, and the next thing comes from someone who finds utility before others. Perhaps intent is an input scarcer than compute.


Half of this despair is right. The thin outer layer is indeed being absorbed, and much of what looks like a company today is indeed just that outer layer. But it is wrong about what will remain after absorption. The mechanism is clear; the endpoint is not.


My bet is on this path: intelligence will keep getting cheaper, and value will continue to slide to where only a few models cannot reach. The untrainable, with history, holds value.


So, step into one of those areas, do the unglamorous translation work, and then start writing down what “good” looks like there. Because someone will always do that. The most frequently cited benchmark score this year is actually a soon-to-be worthless map of terrain and a notice: a notice to certain people that they are about to lose the right to define what “good” is.


[Original Tweet]



Welcome to join the official BlockBeats community:

Telegram Subscription Group: https://t.me/theblockbeats

Telegram Discussion Group: https://t.me/BlockBeats_App

Official Twitter Account: https://twitter.com/BlockBeatsAsia