Skip to main content
Back to BlogFuture of Work

Execution Compresses. Judgment Expands.

2026-06-097 min read

Four independent studies, from Science to BCG and Harvard, show AI cutting task time by up to 40% while raising quality, but only inside its zone of competence. Outside it, results get worse. What the research means for where your attention should go now.

Four separate research teams, working independently, on different kinds of work, keep arriving at the same shape of result: people using generative AI finish faster, and an independent evaluator rates their output as better, not worse. That combination should not be possible. Faster usually means worse. The studies say otherwise, and the way they say it also draws a precise map of where AI helps and where it does not.

Four studies, one pattern

In 2023, Shakked Noy and Whitney Zhang published an experiment in Science. They gave professionals realistic writing tasks, the kind that fill a normal week: press releases, reports, emails, analysis pieces, and split them into groups with and without access to ChatGPT. The group using ChatGPT finished in roughly 40% less time. Independent evaluators, who did not know which output came from which group, rated the AI-assisted work about 18% higher in quality.

Around the same time, Erik Brynjolfsson, Danielle Li, and Lindsey Raymond studied something messier: a real generative AI assistant rolled out to more than 5,000 customer support agents. Average productivity rose by roughly 14 to 15%. But the gains were not evenly spread. They were strongest among newer and less experienced agents. The assistant seemed to absorb the patterns that the best agents already used, the phrasing, the sequencing, the judgment calls, and make them available to everyone else.

A third study, run by researchers from BCG, Harvard, Wharton, MIT, and Warwick, looked at a different kind of work entirely: creative product innovation. Consultants were asked to generate and develop new product concepts, with and without GPT-4. About 90% of the consultants using GPT-4 produced work that scored higher than the group working without it, and the average performance gap was around 40%.

Faster, better, and more consistent. On the surface, that sounds like the entire argument for AI adoption is settled. It is not.

The part that gets left out

Every one of these studies includes a warning that rarely makes it into the summary slide. The gains above were measured inside what researchers call the AI's competence "frontier," tasks the model can genuinely do well. Push the same people onto tasks outside that frontier, ones where the AI's output looks fluent but is subtly wrong, and the pattern reverses. People using AI can perform worse than people without it, and the range of ideas a group produces narrows. Confidence in a fluent-sounding answer is not the same as the answer being right.

This is the part that actually matters for how you use these tools day to day. The question is never simply "does AI help?" It is "is this task inside the zone where AI is genuinely strong, and do I know where that zone ends?" That second question is a judgment call. It is, not coincidentally, exactly the kind of judgment call AI cannot make for you.

The scale McKinsey puts on it

Zoom out from individual tasks to the shape of a working week, and McKinsey's estimate is the number that puts the other three in context: generative AI has the technical potential to automate activities that currently occupy somewhere between 60% and 70% of employees' time. The impact is heaviest in knowledge work: marketing, sales, software development, customer service, and R&D, the same domains the studies above were testing.

Read together, the four pieces of research describe the same transition from two directions. The experiments show, at the level of a single task, that execution gets faster and more consistent. The McKinsey estimate shows, at the level of an entire job, how much of what people currently do is built from exactly those kinds of tasks.

Execution compresses. Idea expands.

Here is the shift underneath all of it. Generative AI does not simply remove human work. It changes where that work concentrates. A large share of effort in knowledge work has always gone into execution: writing drafts, producing variations, searching for information, formatting, documenting, repeating the same structure across different deliverables. That is the layer the research above is measuring, and that is the layer compressing fastest.

What is left, and what is growing in relative importance, is everything upstream and downstream of execution: defining the problem properly, forming a hypothesis, choosing the criteria that matter, designing the prompt that gets you there, evaluating what comes back, catching the errors, supplying the context only you have, protecting what is genuinely original, and deciding what is actually good enough to ship.

What moves to AI

  • First drafts, variations, and reformatting
  • Information retrieval and summarization
  • Following an established template or "best practice" pattern
  • Producing multiple versions of the same deliverable

What moves to you

  • Defining the actual problem, not the presenting one
  • Choosing the criteria a good answer has to meet
  • Designing and refining the prompt or brief
  • Evaluating outputs, catching errors, and supplying context
  • Deciding what is original enough, and good enough, to use

"Idea," in this sense, is not a synonym for inspiration. It is direction and editorial judgment: the part of the work that decides what gets made, for whom, to what standard, and why. AI can multiply how much gets produced. It cannot decide what is worth producing. That decision was always yours. The research above just shows how much more of your working week it is about to occupy.

What to do with this

If you recognize your own work in the execution list, that is not a threat, it is information. Those are the tasks to hand over first, deliberately, and to track honestly: what got faster, and did the quality genuinely hold? The studies above used independent evaluators for a reason. Your own sense of "this looks fine" is the least reliable judge of AI output, precisely because fluent and correct can look identical from the inside.

And if your week is already mostly the second list, defining problems, setting criteria, evaluating, deciding, then the research is telling you something else: that work is not shrinking. It is becoming the whole job. The professionals who get the most out of AI right now are not the ones who have automated the most tasks. They are the ones who have gotten sharper at the judgment calls that sit on either side of automation. That sharpening does not happen by accident, and it does not happen by using AI more. It happens by practicing the thinking AI cannot do, on purpose, until it is the thing you are actually known for.

AI researchproductivityjudgment

Managing Disruptions

A weekly newsletter about thinking clearly in noisy times. No tips. No hacks. Just better questions.

Join 500+ professionals, leaders, and parents who refuse to outsource their thinking.