a16z: The Future of Visual AI Is Not Images, But Code

Bitsfull2026/06/03 17:1819861

概要:

From Pixel to Code_translation_pending


Editor's Note: Over the past few years, the competition in visual AI has largely revolved around one question: whose generated images are more realistic, whose generated videos are smoother. Diffusion models have turned text prompts into images, videos, and realistic scenes, leading the outside world to judge model capabilities by asking "Does it look like the real thing?" and "Is it beautiful?"


However, this article from a16z points out that in the next stage of visual AI, the focus may not only be on generating prettier pixels but on generating the code artifacts behind the pixels (code artifact, a structured file that can be further edited, tested, and delivered).


This distinction may seem technical but ultimately determines whether AI can truly enter the production workflow. What designers need is not just a UI screenshot but HTML/CSS, React components, layers, and deliverable files; what animators need is not just a video clip but keyframes, timing curves, and adjustable motion parameters; what 3D artists need is not just a rendering but geometric structures, materials, lighting, cameras, and scene hierarchy.


Therefore, the article divides visual generation into two paths: pixel-native generation (directly generating images or videos) is suitable for realism, atmosphere, and exploration; code-native generation (generating SVG, Lottie, Blender scripts, USD scenes, etc.) is more suitable for editing, iteration, and production. The real significance of the latter is that it can form a "code → render → inspect → modify" loop. The model is no longer just repeatedly resampled but is debugging a verifiable visual program.


This is also why the author is particularly bullish on 3D. Because a rendering of a chair is not a chair; it is merely an image of a chair. Assets truly usable in games, simulators, or 3D tools must have a stable geometric structure, component hierarchy, materials, and functional constraints: doors must open, drawers must slide, wheels must turn. In other words, the value of future visual AI lies not in "looking like" but in "being able to be reused."


This article provides a good framework for assessment: the first wave of visual AI solved the generation issue, and the next wave will address the production problem. As visual AI moves from final output to source code, what is truly transformed is not just design tools but the entire visual content production pipeline.


The following is the original text:


In the past few years, visual AI has mostly been judged by its "pixels." The better the final generated image or video looks, the stronger the model seems to be.


This is not surprising. Initially, the generative models transformed text prompts into beautiful images, then expanded to videos, and further to an increasingly realistic world. People naturally compare it to Photoshop or a camera.


However, for many visual-related tasks such as graphic design, UI design, or 3D modeling, what users really need in the end representation is not just the pixels of the final output. What they need is a product that can be iterated upon continuously based on feedback and new ideas. Designers not only need a mockup but also layers, components, and deliverable files; animators not only need a video but also timing curves, keyframes, and editable motion paths; 3D artists not only need a rendered image but also geometric structures, materials, lighting, cameras, and scene setups.


Today's most interesting visual AI tools are no longer attempting to directly generate the final output. They are starting to generate the source code behind the final output. This shift is unlocking editability, iteration capability, and feedback loops, which are hard to match for pixel-native models.


Two Stacks of Visual Generation


We can understand visual generation in two main ways.


The first is pixel-native generation. These systems typically directly generate images or videos, often in latent space. They excel at texture, ambiance, lighting, and realism. If the goal is to generate a cinematic shot, a set of beautiful mood boards, or a photo-realistic image, diffusion models are still the mainstream approach.


The second is code-native generation. These systems generate a representation that is then executed or rendered by another engine. The models do not directly generate final pixels but instead generate a program that can generate pixels.


This program could be an SVG file, an HTML/CSS layout, a React component, a Lottie JSON file, a Blender script, a USD scene graph, a shader, or a game engine scene. The final visual output is still in pixels, but the real "ground truth" is a structured representation.


This difference is crucial because production workflows care deeply about "what happens after generation." While a generated image can serve as output, a generated visual program can be used as a product: it can be edited, reused, improved, versioned; it can be integrated into software stacks and validated against constraints; it can be rendered repeatedly under different conditions and handed off between designers, engineers, and agents.


I believe a significant shift is underway: for a subset of vision tasks, we will learn to reconceptualize the vision generation task as a coding task, achieving highly efficient improvements by solving a well-defined, verifiable coding problem.


Code as a Powerful Medium for Solving Vision Problems


The simplest way to understand the value of vision code generation is to look at what happens after the first draft.


Imagine a model generates a logo. If the output is a raster image and one of the curves is off, the user must mask, redraw locally, regenerate, or manually redraw. But if the output is SVG, the user can directly edit paths, basic shapes, gradients, strokes, or text elements. This is already how designers design logos on Quiver.



In UI design, if the output is a screenshot, it is more of an inspirational reference. But if the output is HTML/CSS or React, designers can inspect the DOM, replace real components, test responsive states, check for accessibility, and integrate it into an application.



This is also why visual code generation is particularly suitable for test-time compute. In pixel-native generation, increasing inference computation usually means sampling more outputs: generate 20 images, pick the best one, maybe try again. This is certainly useful, but each attempt is essentially a new roll of the dice. The model can respond to feedback, but this feedback is usually holistic and not precise enough.


Technically, diffusion models can also benefit from test-time compute. For example, "Inference-time Scaling of Diffusion Models through Classical Search" shows that search during inference can improve the performance of diffusion models in planning, reinforcement learning, and image generation. However, the loop mechanism here is different. In diffusion models, the system typically searches between latent trajectories or final samples. Reward signals can tell the model one output is better than another, but it cannot clearly map feedback to a specific modification at the source code level.


Code-native generation creates a more precise loop: code → render → inspect → modify.


The model generates the artifact, renders it, observes where the issue lies, and then patches the source file. If the spacing is off, modify the CSS; if the logo curve is skewed, edit the SVG path; if the animation tempo is too slow, adjust the timing parameters. The key is that each iteration improves the underlying artifact, not just the rendered output. This is also why visual code generation inherently benefits from more token generation and test-time compute. The model debugs the visual program in a closed-loop, verifiable environment, rather than just sampling more images.


Code-Based Visual Generation Technology Stack


Behind the examples above is a technology stack consisting of: Encoding Model + Symbol Representation + Renderer or Engine.



The Encoding Model is the creator and editor of the artifact. It is responsible for writing HTML, SVG, Lottie JSON, Blender scripts, USD scenes, or custom 3D asset programs.


The Symbol Representation is the source of truth. This is precisely why the artifact is editable. A UI has DOM nodes, layout rules, and components; a Lottie animation has layers, vector shapes, timing curves, keyframes, and motion parameters; a 3D asset has geometric structure, materials, joints, constraints, and hierarchy.


The Renderer or Engine then transforms these structures into pixels. The browser renders HTML/CSS, an SVG renderer renders vector graphics, a Lottie player renders animations, Blender or a game engine renders 3D scenes, and a simulator validates whether an asset with joints can truly move or interact.


OmniLottie is a great example that illustrates why Symbol Representation is essential. Lottie is a lightweight, JSON-based animation format that does not represent the animation as a flat video, but rather uses editable vector shapes, layers, keyframes, and timing parameters to represent motion. OmniLottie proposes converting the original Lottie JSON into a command sequence that is more suitable for model understanding, enabling the model to more reliably generate and edit Lottie animations. The focus of this paper is not to build a complete Agent loop, but to make Lottie more conducive to model generation: it converts the original Lottie JSON into a set of compact commands and parameter sequences. This action is crucial because Lottie itself is already an editable animation format. Once the motion is represented as shapes, layers, time, and animation parameters, feedback can be mapped back to modifications at the source file level. If an object moves too slowly, adjust the timing; if the path is incorrect, modify the vector; if there is deformation bias, update the shape sequence.



Correspondingly, this technology stack is what the encoding Agent can use to enhance output quality through a test-time compute loop: in each "code → render → inspect → modify" cycle, the model does not generate a new sample but instead, leverages feedback provided by the renderer to improve the underlying artifact. It can adjust CSS rules, tweak SVG paths, correct animation timing, or update 3D constraints, then re-render and continue to refine.


This enables the loop to have the potential for convergence. In pixel-native generation, each retry often creates a new output. In code-native generation, each retry can improve the source artifact itself. The model is not just sampling more images or videos, but debugging the visual program in a closed-loop, renderable environment.


Market Map: Entry Points Formed Around Runtimes


The visual code generation market is organizing around the "runtime," where the artifact is rendered or executed. In code-native visual generation, the model generates a symbolic artifact, which will be executed in a specific environment: a browser, SVG renderer, Lottie player, Blender, game engine, or simulator.


Each type of runtime forms a different entry point, as each runtime has its own source representation, feedback loop, and production workflow.



Today's most obvious applications are in the 2D design field, especially in UI design and graphic design. However, visual code generation is not limited to design tools. As long as there is a foundational representation behind a visual artifact that can be generated, rendered, inspected, and optimized, it can emerge.


Why 3D Is the Next Important Frontier


While product design and 2D design are the most intuitive use cases today, 3D artifacts may benefit the most from this "redefining consistency as a code issue" approach.


A 2D design is sometimes useful as long as it looks right. But a 3D asset is not. A rendering of a chair is not a chair; it is just an image of a chair. To make this asset truly usable in a game, simulator, or 3D editing tool, it must have a consistent underlying 3D representation, including correct geometry, materials, part hierarchy, and scene context.


This is why 3D is inherently suitable for visual code generation. Its value lies not just in generating something that looks 3D from a certain angle but in creating a consistent 3D structure that holds up in different views, edits, and interactions. This requires an iterative loop: propose an object, render it, inspect if the geometry and parts make sense, and then modify the underlying representation. However, this loop is only effective when the Agent has the right tools and context. Simply running Blender repeatedly until something looks better is not enough. The Agent needs to switch camera perspectives, query scene states, isolate objects, compare against targets, remember previous attempts, and translate visual differences into source-level modifications. It is these capabilities that give test-time compute the opportunity to converge.


For many assets, visual consistency is just the baseline. Objects also need correct part semantics and functional constraints: doors should be able to open, hinges should be able to rotate, drawers should be able to slide, wheels should be able to turn. In other words, the output cannot just be a visually reasonable shape; it must also operate like what it represents.


This is precisely what makes projects like VIGA and Articraft3D stand out. We expect to see more related work emerge this year, including both commercial and open-source projects. VIGA uses Blender as a rendering and feedback environment, transforming visual reconstruction into a "code-render-inspect" loop; however, VIGA does not merely expose the raw Blender to the Agent loop. It provides semantic tools for the Agent to observe and manipulate, retaining memory of past attempts, enabling it to inspect objects from better angles, diagnose issues, and make targeted modifications. Articraft3D, on the other hand, deals more directly with asset structure: it defines articulated 3D generation as writing programs responsible for defining parts, geometry, joints, and testing.



Future Impact and Unsolved Challenges


If visual code generation truly takes hold, the eventual winning products will not just generate more beautiful outputs. They will master the entire loop: generating artifacts, rendering artifacts, inspecting where things went wrong, and modifying source files.


This will have several impacts.


First, renderers will become the feedback environment. Browsers, SVG renderers, Lottie players, Blender, game engines, and simulators will become the environment where Agents test and improve their work, much like today's coding Agents leverage sandboxes and VMs.


Second, the quality of iteration context will become even more critical than before. To get Agents into the "Ralph loop" of visual code versions, the intermediate representation must be precise enough to guide the next steps. Models need to know not only "something looks off" but also where in the source file to make modifications and why. Small errors in structure, rendering, or feedback may quickly accumulate over multiple iterations.


Third, the future is likely to be hybrid. Pixel-native models still excel in realism, texture, and exploration; code-native systems are better suited for structure, iteration, and production. The most useful workflows will combine both.


Of course, there are still many open questions. Which representation will ultimately be adopted in each field? Do we need to rebuild engines and renderers from scratch instead of continuing to use last-gen tools? To what extent can visual taste be captured, constrained, tested, and fed back into the loop?


But the direction is already clear: Visual AI is moving from output to code artifacts. The first wave made generating images easier; the next wave will make it easier to generate visual artifacts that are editable, testable, deployable, and improvable.


[Original Article]



Welcome to join the official BlockBeats community:

Telegram Subscription Group: https://t.me/theblockbeats

Telegram Discussion Group: https://t.me/BlockBeats_App

Official Twitter Account: https://twitter.com/BlockBeatsAsia