In February 1982, the managers on Apple’s Lisa team began asking each engineer to report how many lines of code (LOC) they had written that week.
That same week, Bill Atkinson rewrote the part of QuickDraw that calculated regions. His new version ran about 6x faster and was roughly 2000 lines shorter. In the box for lines of code, he wrote “-2000.” After a couple weeks of this, management stopped asking him to fill out the form.
The most celebrated achievement in the early history of measuring LOC was the deletion of code.
Fast-forward 44 years, we are making the same mistake with a new unit of measure.
We’ve Run this Experiment Before
We spent decades learning that LOC measures motion rather than progress. Dijkstra put it plainly in 1988: If we count lines of code, we should book them as “lines spent,” not “lines produced,” because conventional wisdom records LOC on the wrong side of the ledger.
He called productivity-by-line foolish, because developers measured in that way write more code, when the harder and more valuable work is usually writing less.
Tom DeMarco gave the discipline its most famous slogan, “you can’t control what you can’t measure,” in 1982. Then, he took it back. He wrote that strict measurement-driven control matters most on the projects that matter least.
Capers Jones was blunter still. He called the use of physical lines of code as a productivity metric “professional malpractice,” because it punishes the more expressive language that does the same job in fewer lines.
A New Unit, Already Gamed
Enter the token, the small chunk of text an AI model reads and writes. Every interaction with a model consumes tokens, and tokens cost money, so it’s tempting to treat tokens as a measure of AI adoption.
We’ve even named the impulse to drive consumption: tokenmaxxing.
Tokenmaxxing risks repeating the LOC debate almost beat for beat.
- Engineers at Meta built an internal dashboard (Claudeonomics) that ranked employees by token consumption, with titles like “Token Legend” for the heaviest users. The creator pulled it down within days of it leaking.
- Salesforce set minimum monthly spend targets for its engineers.
- Amazon employees reportedly spun up agents to do pointless work to climb a ranking system called Kirorank. The company has since scrapped it, while advising staff not to use AI for AI’s sake.
Charles Goodhart saw the mechanism in 1975. Any statistical regularity, he observed, collapses once you put pressure on it for control.
Marilyn Strathern later distilled the idea into the oft-quoted line: when a measure becomes a target, it ceases to be a good measure.
The moment leadership watches a number, people optimize the number instead of the goal it was meant to represent.
Spend Is A Meter, Not A Result
Tokens themselves are not the enemy. Aimed at a hard problem, increased spend can be justified. Agentic workflows and deep reasoning earn token cost. For a fixed task, however, padding the context (or worse, not managing it) usually hurts output quality, because the model has more noise to wade through.
A goal set purely against token volume rewards people for running up the meter, not for the judgment of where the tokens should go.
Engineers skilled in token optimization spend tokens only where needed to reach an outcome, nowhere else.
Atkinson did not delete 2000 lines because less is always more. Rather, he deleted them because the goal was a working region engine, not a line count.
A Growing bill, A Comfortable Metric
With a few notable exceptions, ROI isn’t keeping pace with token spend.
In June 2026, Bain reported that between December 2024 and December 2025, the cost per token fell by 50% while token consumption rose 4.5x. The firm stated that the models got cheaper, the usage got heavier, and the bill stayed stubbornly high.
Token spending still runs at only 1-to-2% of headcount cost in software engineering, and the much-discussed future where it climbs to 20-30% is a stress test, not a forecast.
We are spending more every quarter on an input while the value stays mostly unmeasured, and I have watched capable teams reach for the token number anyway. The reasons are comfortable:
- It is safe, because everyone else is counting tokens too.
- It is easy, because the vendors already report the count for you.
The harder problem sits underneath both. Measuring value means defining what a unit of value even is, and many organizations have never done that work. A token count is the path of least effort, disguised as rigor.
Measure The Outcome
Token visibility is necessary but not sufficient. How then to insert value into the equation? You could divide value achieved by tokens spent, however the token denominator drifts week over week (models change, pricing evolves, new solutions emerge, etc).
This ratio can be gamed from both sides: inflate the value you claim or starve the tokens and lose quality.
Finally, it still requires a definition of value that many organizations never documented.
AI-forward organizations measure cost per outcome: the fully-loaded cost of the AI work divided by the count of finished results, like:
- Cost per resolved support ticket.
- Cost per generated asset.
- Cost per qualified lead.
Not only is cost per outcome less-gamable, but it links technology execution to the business.
Intercom, the customer-service software company, already prices on exactly this basis. It charges about a dollar for each ticket its AI resolves, not per message and not per token. To enable this pricing, Intercom first had to define a “resolution” precisely, with explicit rules for what counts and what does not.
You can measure value only after you define what counts as value. An outcome has to become a first-class entity with a stable definition (you cannot divide by “resolved tickets” until a resolved ticket is a defined, tracked thing). You can then tag every cost that goes into one result:
- The model calls
- The tool invocations
- The retries
- The human minutes spent reviewing and reworking.
A token invoice, as EY notes, “captures only part of the total financial exposure.” The real cost of a resolved ticket includes the people and the rework, not just the tokens.
A strong data model and ontology supplies both halves of the equation: the definition of the outcome you divide by, and the tags that route each cost to it.
The ontology that makes an agent reliable is the same ontology that makes its value countable.
Cost per outcome links finance and engineering, and both must partner on the definition. The CFO certifies it is audit-ready and the CIO certifies it can be instrumented. Accountability for the number itself sits with the leader who owns the outcome’s P&L. Without this ownership triad, cost per outcome is vaporware, a sound idea that never leaves the whiteboard.
Goodhart’s law still applies though: while cost per outcome is harder to game, no single-number target is immune.
A team could redefine the outcome to inflate a “finished” count or push real cost somewhere outside of the AI line. You guard it with the same rigorous outcome definition and a small set of supporting metrics (e.g. a quality floor like a reopen rate).
Klarna is a cautionary tale: it cut deep on a bet that AI could replace much of its staff, then reversed when quality dropped, the CEO admitting that cost had become too dominant a factor.
Incentivize team learning, and not just the individual. A number that makes one engineer look hyper-efficient does nothing for the organization if the capability never spreads.
A defined outcome is necessary but not sufficient. Even a clean number tells you what a result costs, not whether AI is what changed it. Without a baseline or holdout to compare against, you risk booking a concurrent reorganization as an AI win.
The Literacy Under The Metric
Vanity metrics are always enticing, and there’s always an easy number like tokens (or LOC) within reach. AI literacy helps a leader refuse the easy way out.
A CIO with hands-on command of how AI behaves can tell a good metric from a bad one. Alternatively, a CIO without it has no way to see why the easy number is naive.
You don’t have to write production code, but you do have to know what moves the output when you change the inputs.
A leader who grasps why a leaner context can yield a better answer, not just a cheaper one, sees why chasing a meter is backward. A leader who doesn’t will keep asking for the number that is easy to collect.
Our role has outgrown technology alone, and the common counsel is for leaders to spend less time in the technical detail. While directionally correct, that counsel misses one thing: the judgment to measure value against business objectives, rather than accept a vanity metric, is the piece of depth a CIO cannot afford to delegate.
Fortunately, these same AI models make digging into details easier than ever. Good leaders dive deep. That habit has always separated the executives who lead change from the ones who narrate it.
The Wrap
Atkinson’s -2000 is 44 years old, and it still reminds us to measure the right thing. We are handing our boards a rising token number as proof of progress. Atkinson would have seen a cost to drive down, not a score to run up.
We already know the better path.
Define the outcome, measure its cost, guard against gaming, and learn the craft well enough to know the difference.
The frameworks are not new. It’s the discipline that we must continually hone.
Trusted insights for technology leaders
Our readers are CIOs, CTOs, and senior IT executives who rely on The National CIO Review for smart, curated takes on the trends shaping the enterprise, from GenAI to cybersecurity and beyond.
Subscribe to our 4x a week newsletter to keep up with the insights that matter.


