Who is the Best at Using Claude Code? The Answer Might Not Be a Programmer

Bitsfull2026/06/20 12:005697

Summary:

40,000 Sessions Show AI Lowered the Programming Barrier and Amplified the Value of Domain Expertise


Editor's Note: This report is based on approximately 400,000 Claude Code sessions, discussing how AI programming tools are changing the relationship between humans and code.


The key finding of the article is that in agentic programming, humans mainly determine "what to do," while Claude is mainly responsible for "how to do it." Users take on most of the planning decisions, while Claude takes on most of the execution work. In other words, AI is taking over implementation tasks such as writing code, modifying files, running commands, debugging, etc., but goal setting and outcome judgment still rely on humans.


More importantly, the effectiveness of using Claude Code does not solely depend on whether the user is a programmer. The report shows that in code generation tasks, non-technical professionals in fields such as law, finance, management, and research have success rates close to software engineers. What truly affects the outcome is whether the user understands the problem they are trying to solve.


This implies that AI programming reduces the implementation threshold, not the judgment threshold. In the future, individuals who understand the business, the context, can clearly articulate requirements, and judge outcomes may be better able to leverage AI than those who can merely write code. AI will not automatically replace domain knowledge but will instead amplify its value.


Below is the original article:


Key Findings


Building on existing research, we propose a framework for studying interactive agentic programming. This framework, based on privacy-preserving analysis of approximately 400,000 Claude Code sessions from October 2025 to April 2026, evaluates task composition, human-AI collaboration, and task success rates.


In a typical session, humans are responsible for most planning decisions, i.e., deciding "what to do"; Claude is responsible for most execution decisions, i.e., deciding "how to accomplish." The stronger a user's domain-specific knowledge in a particular area, the greater the workload triggered for Claude for each instruction. In coding tasks, the average success rates across major professional groups—i.e., whether the user completed the intended task with verifiable evidence through testing, code submission, etc.—are nearly on par with software engineers.


The stronger a user's domain expertise, the more likely the session is to end successfully. However, the difference between intermediate and expert users is not significant. Over the seven months we observed, the proportion of sessions used for debugging decreased by almost half, with usage shifting towards more end-to-end agentic usage: deploying and running code, analyzing data, and writing non-code documentation.


Over the past seven months, the value of typical tasks has increased across almost all job types. We estimated task value by comparing it to freelance job posting data, revealing an average increase of about 25%.


Introduction


AI-assisted programming is rapidly on the rise. Since the end of 2025, the proportion of coding AI activity in GitHub projects has more than doubled, and Claude Code users now spend an average of 20 hours per week using the tool. Can individuals without formal programming experience successfully command an AI to perform complex technical tasks? How will the rapid adoption of these tools and the improvement in their capabilities impact a broader knowledge workforce? While we cannot yet provide a complete answer to these questions, some early signals can be gleaned from the usage data of Claude Code.


This report is based on a privacy-preserving analysis of approximately 235,000 users and around 400,000 interactive sessions from October 2025 to April 2026, offering evidence of how Claude Code is actually being used. It builds on our previous research on autonomy metrics in Claude Code sessions and how Claude Code is transforming internal work at Anthropic. This paper will present a framework for describing the usage of an interactive AI programming assistant: what tasks people are doing, who is doing these tasks, and whether the tasks are successful. We focus on users employing Claude Code through the command-line interface (CLI), Claude.ai, or the Claude Code desktop application. By tracking how AI programming usage evolves with model capabilities, we can better understand the impact of these tools on the programming professional and knowledge worker labor market.


Events on Claude Code may foreshadow the future of knowledge work: AI assistants are gradually becoming integrated into non-coding tasks. We find that Claude is handling more complex and valuable tasks. At the same time, there is still a clear division of labor in AI programming: humans decide what to build, while AI decides how to build.


We also see evidence suggesting that domain expertise is what truly amplifies tool effectiveness, rather than programming proficiency. Particularly, domain experts are more likely to succeed and recover from errors and misconceptions more easily. However, the gap between experts and intermediate users is not significant. This indicates that as long as one has sufficient domain proficiency, they can effectively use such tools almost as efficiently as deep experts.


These findings allow us to tentatively observe potential shifts in the labor market. In our data, success depends on whether a person understands the problem they are solving, rather than whether they have received programming training. If these patterns hold across the entire economic ecosystem, it suggests that while AI programming tools may be absorbing some of the more task-oriented work, they are also rewarding those who truly comprehend the problems they are solving in their work. Coding AI is not replacing domain expertise. Instead, the more understanding a worker brings to the AI, the higher-quality work the AI can accomplish.


Division of Labor


What People Do with Claude Code


To understand how people use Claude Code, we categorize each interaction into one of nine work modes, representing the single activity that best describes the session's goal. Four of these modes directly involve coding: building something new, fixing something broken, testing code, and orchestrating other agents or automation pipelines. Another category involves operating software, including deployment, configuration, running pipelines, and system monitoring. There are two modes more focused on figuring out "what to do": understanding how an existing system operates and planning changes before getting hands-on. The final two modes are either code-agnostic or have code as a peripheral part of the end product: analyzing data and communicating through presentations and other text-based documents.


Approximately 56% of sessions are composed of coding (25%), debugging (26%), or testing and orchestrating code (5%). Operating software accounts for 17%, planning or exploring for 14%, and analyzing or writing text for 13% (see Figure 1).




We first let the model read the session logs and classify each session accordingly; then we use our privacy-preserving analysis tool to cross-validate the classification results with telemetry data automatically logged for each session, including whether code lines were added or removed. There is a high level of consistency between the two sources. For instance, in sessions our classifier labels as creating or modifying code, over 90% also show code changes in telemetry data. See the appendix for details.


Who Makes the Decisions


How strong is Claude Code's autonomy? Capability assessments show that its upper limit is already high and still rising. For instance, in benchmark tests such as METR's time-span evaluation, state-of-the-art models can now autonomously complete software tasks that would originally take humans hours and overcome obstacles in the process. But what is the reality in actual use? Here, we focus on how much guidance is provided by humans and Claude in real sessions.


We study this issue from two perspectives. First, we look at the extent to which people defer decisions to Claude; second, we observe how much agency they allocate to Claude. To understand the decision-making division of labor in a session, we built a privacy-preserving decision attribution classifier based on session content. We tasked the classifier with listing all meaningful decisions in the session and categorizing these decisions into planning decisions and execution decisions. Planning decisions include what to do, which approach to take, and what constitutes completion; execution decisions include which files to modify, what code to write, in what language, and which commands to run. Subsequently, the classifier attributes each decision to Claude or the user and generates two metrics for each session: the percentage of planning decisions borne by the user and the percentage of execution decisions borne by the user.


On average, humans make about 70% of planning decisions but only 20% of execution decisions (see Figure 2). In actual usage, the agent programming has established a clear division of labor: humans decide what to build, and the agent decides how to build it.


To understand the level of action delegation in a single conversation, we do not look at the content but rather at the conversation structure. A Claude Code session consists of back-and-forth interactions between Claude and the user: the user provides prompts, Claude takes actions; then the user sends the next prompt, and so on. In a typical session, such cycles are about four rounds. In our historical data from October to April, for every prompt the user issues, on average, Claude triggers about 10 actions, sometimes even exceeding 100 actions. In each round, Claude reads files, edits code, runs commands, and produces an average of 2400 words.


The amount of work Claude completes between two user checks greatly depends on who is making the decisions. When the user retains control over the execution process, meaning the user makes over 80% of execution decisions, Claude performs fewer actions per round, around 8. However, when Claude holds planning control, meaning Claude makes over 80% of planning decisions, it undertakes the highest number of actions, around 16.




Expertise Level


Per session record, Claude evaluates the user's apparent expertise level on a five-level scale for that task, ranging from novice to expert. The expertise level classifier focuses on three signals: how precise the user's instructions are, what the user asks Claude to validate, and whether the user corrects Claude more often or Claude corrects the user more often. It is important to note that this expertise level is completely different from a position or general ability and, importantly, is task-specific. A seasoned engineer asking a Rust question for the first time may still be a beginner in Rust tasks. An accountant who has never used Python but can accurately instruct Claude on which Python script must execute specific reconciliation rules and catch the edge cases of misapplying them during month-end close is an expert in that task.


The table below shows how we define each level of expertise in the classifier and provides examples of requests from the SWE-chat public code-aware conversational dataset. Dialogues classified as "Novice" provide general instructions without reflecting specific domain knowledge, while dialogues classified as "Expert" convey a deep understanding of the codebase and technical environment.




We quantified the relationship between expertise level and the output and activity generated by Claude for each prompt word. In a typical Novice conversation, each prompt word triggers Claude to perform about 5 actions and output around 600 words. In contrast, in an Expert conversation, the action chain length is more than double the former, about 12 actions, and the output volume reaches around 3200 words, five times that of the former (see Figure 3). This gap between Novice and Expert appears in every type of task and every task value range.


These metrics complement our previous research on Claude Code autonomy. Previous studies tracked the agent's runtime and how often users automatically approve its actions. In contrast, our decision attribution metrics capture who is making substantive decisions during the entire session, while the output volume and number of actions triggered by each prompt word measure the extent to which each human instruction can trigger Claude's autonomous activity.




Who is Using Claude Code and What They're Using It For


User


To understand who is doing this work, we infer each user's occupation based on session records and map it to one of the 23 major categories in the U.S. Bureau of Labor Statistics Standard Occupational Classification (SOC) system. The classifier was instructed to make judgments based only on the following signals: the context of the project loaded by the agent at the beginning of the session, file names and structure, references to materials or products by the user such as legal documents, clinical data, financial reports, course materials, etc., and the vocabulary used by the user. The classifier was explicitly instructed not to take "coding" itself as evidence that the user is engaged in a programming occupation. Only when there is clear evidence indicating that software or data work is the user's occupation, will the session be classified into a coding-related SOC category, i.e., "Computer and Mathematical Occupations." If a lawyer builds a script to automatically check for missing clauses in a set of contracts, then even if this session is primarily about writing software, it will still be classified as a legal occupation. If there are no signals about the user's occupation, the session remains unclassified.


We were able to infer occupations in about 70% of sessions. In these classifiable sessions, "Computer and Mathematical Occupations" are the largest group, which is not surprising as this category covers most software-related work. Next are Business and Financial Operations, Arts, Design, and Media, Management, as well as Life, Physical, and Social Science. The fastest-growing non-software occupational groups in our sample are Management, Sales, and Legal.


Work


From October 2025 to April 2026, there have been significant changes in the work done using Claude Code. The most noticeable change is the decrease in the percentage of sessions focused on fixing broken code from 33% to 19% (see Figure 4). Instead, there is more work revolving around code. The percentage of sessions operating software has increased from 14% to 21%. Writing and data analysis have approximately doubled, from about 10% to about 20%.


The value of the tasks themselves has also increased. We approximated the economic value of each session by estimating the cost of similar work in the freelance market and calibrated using real publicly available job posting data. By this measure, the estimated value of an average session increased by 27% from October to April. This increase was seen across various types of work. The value of building, operating, and fixing tasks has increased by about 43%, 34%, and 32%, respectively. These price estimates are rough, so we primarily use them to compare trends in the value of different tasks over time rather than as directly interpretable dollar amounts. For details on how the task value estimator was constructed, see the Appendix.




Success Depends on What Users Bring


Estimating task value is one way to understand how Claude Code assists people in completing work. Another perspective is observing session success rates and which session characteristics are associated with success. Across all success metrics, we see a clear pattern: the higher the level of professionalism users exhibit in a session, the more likely the session is to be successful. Most of the improvements are focused on the lower end of the professionalism spectrum, meaning the gap between beginners and intermediate users is larger than the gap between intermediate and expert users.


Before analyzing the characteristics of successful sessions, it is important to define success accurately. Since we cannot observe real-world outcomes or directly ask users if they achieved what they intended through Claude, we rely on two complementary session record-based metrics. The first is "Judged Success," where a classifier reads the full session log to determine if the user completed their original goal, with options including success, partial success, failure, and no clear goal. Subsequently, two supporting classifiers assess the strength of this judgment to determine "Verified Success." The Success Signal classifier looks for verifiable evidence of success, especially including git activities that match the work, such as commits and pull requests, passing test suites, and explicit user endorsements. It scores sessions on a scale from "No Signal" to "Weak Signal" (1 point) to "Multiple Hard Signals" (5 points). Another parallel Failure Signal classifier assigns scores to evidence of things going wrong, including errors, test failures, repeated attempts at the same task, and user objections to outputs. Verified Success requires both conditions to be met: the session is judged as successful, and at least one hard verifiable success signal exists. The following analysis focuses on the degree of success or failure in sessions, so we exclude sessions classified as "No Clear Goal" by the success classifier, which represent about 7.7% of the full sample.


Return on Professionalism


So, which sessions are most likely to succeed? The results indicate that the professionalism score of sessions, as described earlier, has a significant impact on session success.


One might be concerned that professionalism is not a true driver of success. Perhaps experts simply choose different tasks or there are other differences. In this section, by comparing sessions of the same work type, with the same estimated value, in the same month, on the same topic, from the same broad occupational group, we partially address such concerns and explore how varying user professionalism influences outcomes.




In all success metrics, the higher the level of professionalism displayed by the user in the conversation, the more likely the conversation is to be successful. Sessions rated as novice achieved a success rate of 15% on our most stringent metric "validated success" and a partially successful rate of 77%. Meanwhile, sessions rated as intermediate and above had a validated success rate of 28% to 33% and a partial success rate of 91% to 92% (see Figure 5).


In each metric, most of the gains come from moving from novice to intermediate; the slope then flattens from intermediate to expert. For regression analysis details behind Figure 5, refer to the appendix.




In sessions facing challenges, a similar gradient can also be observed. When failure signals are recorded as validated evidence of failure, we consider the session to have "encountered issues." This may include encountering errors, failed tests, repeated attempts to accomplish the same task, or users expressing frustration and dissatisfaction. In sessions facing issues, controlling for all the variables above, the proportion of validated successful sessions increases from 4% in novice sessions to 15% in expert sessions (see Figure 5). If a more lenient success metric is used, we find that the proportion of at least partial success is 60% among novice users and 80% to 81% among intermediate to expert users.


We also tracked another inverse relationship, namely the relationship between professionalism level and various failure metrics. It is important to note that in this analysis, sessions classified as failures are those that did not even achieve partial success. If a session that encountered issues is deemed a failure and did not write any lines of code, we refer to it as abandoned. In sessions where users appear to be novices, 19% are ultimately abandoned; while in other user groups, this percentage ranges from 5% to 7%. In other words, when users with minimal experience struggle to achieve their goals but face difficulties, they are more likely to give up. Part of the value of expertise seems to lie in being able to guide the agent back on track.


The Relevance of Occupation May Be Less Important Than Proficiency


Occupational users in software-related roles had a validated success rate of about 30% across all sessions, while users in other occupations had a rate of around 26%. In sessions involving code generation, meaning sessions where at least one line of code was added or modified, these numbers were 34% and 29%, respectively (see Figure 6). If a more lenient definition of success is used, the gap between software-related and other occupational users further narrows. In code generation sessions, the proportion of users achieving at least partial success was 89% and 88% for both user categories. A five-percentage-point difference is not significant, and over a seven-month period, this gap has neither widened nor narrowed, despite both groups improving their success rates. Among the ten largest occupational groups in our dataset when it comes to code generation sessions, each group's difference in success rate with software engineers is within seven percentage points. Management occupations had the highest validated success rate, slightly surpassing software engineering occupations. The higher validated success rate of managers may reflect that management skills can be transferred to the task of commanding an intelligent agent. However, this may also partly result from our measurement method: validation relies to some extent on users explicitly confirming success during sessions, and managers may be more accustomed to expressing themselves when they achieve the outcomes they desire.




Outlook


The results of this report outline an emerging picture: intelligent agent programming is amplifying certain knowledge and skills while replacing others. In sessions involving code generation, the success rates of major occupations do not differ significantly from those in software-related occupations. It appears that coding agents are making having a programming background less critical to successfully completing programming tasks.


Simultaneously, successful sessions are more likely to demonstrate domain-specific expertise. Sessions rated as expert had a validated success rate of over twice that of novice sessions. When encountering issues during sessions, novices also exhibit abandonment rates several times higher than other users. The collaborative nature itself makes this picture clearer: domain experts can guide Claude through more work with each instruction. Therefore, the ability to steer Claude towards success comes more from mastering a specific domain rather than the ability to write code. Anyone with this mastery in a domain can now accomplish technical work that was previously out of reach. Conversely, those lacking this specialized understanding, even when using the same tools, will achieve far less. Furthermore, the benefits come mainly from competency rather than mastery. Having a practical understanding of a domain is sufficient to gain the most benefits; further specialization on top of this foundation will only bring a slight additional advantage.


These findings are still preliminary. Like most of our research, we cannot measure real-world outcomes, such as whether code written in a session is later used or discarded, or if it generates economically valuable results. In addition, non-interactive usage, which accounts for a significant portion of overall activity, is excluded from this report. Developing a framework to measure this type of usage is a key focus for future work. Furthermore, all our session classifications rely on the model's reading of session logs. In the appendix, we demonstrate that the classifier aligns with independent telemetry data as expected and agrees with a strong reference model in most sessions. However, validating the classifier at scale remains challenging; Claude Code sessions themselves add complexity as they can be long and intricate, making it difficult to use manual labeling as ground truth.


As models, users, and the division of labor between them evolve, the landscape in this report will continue to be updated. We hope these metrics will help us track significant shifts taking place. For example, a decline in the return on investment from future expertise would indicate that models are beginning to provide key judgments currently offered by users, with the benefits of these tools expanding beyond domain experts to a broader audience. If the proportion of non-software professionals successfully completing coding sessions continues to rise, it may signal that software production is becoming part of everyday work across various fields, no longer exclusive to a single profession. These shifts will affect who can benefit from AI programming, how much they benefit, and impact the most valued skills in the labor market.


[Original Article Link]



Welcome to join the official BlockBeats community:

Telegram Subscription Group: https://t.me/theblockbeats

Telegram Discussion Group: https://t.me/BlockBeats_App

Official Twitter Account: https://twitter.com/BlockBeatsAsia