Why your AI budget ran out in four months (and what to do instead)

The invoice always arrives on a Tuesday.

Uber's CTO found out his engineering team had burned through the company's entire 2026 AI coding budget sometime around the end of April. The budget was sized for twelve months. It lasted four. Five thousand engineers, a productivity tool that actually worked, and no one watching the meter.

Uber is a $100 billion company with a real finance organization. They still missed by 3x.

What actually happened

The cause wasn't reckless engineers or poor judgment. It was a pricing model that no one in enterprise software procurement has encountered before.

Claude Code doesn't charge per seat. It charges per token. That's not a subtle distinction; it's a completely different cost surface.

With traditional enterprise software, a $10/user/month tool is predictable. You have 200 engineers, you pay $2,000/month. Done. You can plan around it. You can put it in a spreadsheet.

With Claude Code, a developer running a quick autocomplete at the end of a function spends cents. A developer running Claude Code as an autonomous agent across a monorepo, asking it to refactor an API layer and generate the associated tests simultaneously, can spend thousands of dollars in an afternoon. Uber's CTO spent $1,200 in two hours during a personal demo session. That wasn't a mistake. That's what agentic tools cost when used as intended.

Scale that across 5,000 engineers running multiple agent loops simultaneously, and the math stops resembling anything you've seen before.

How much Uber overshot its annual AI budget

$100B company. Real finance org. Still missed by 3x.

This is not a spending discipline problem

That framing matters, because the instinct when you see an overrun is to blame the people spending. But Uber's engineers weren't doing anything wrong. Claude Code adoption jumped from 32% to 84% because it worked. Seventy percent of code commits at Uber now come from AI. One in ten live backend updates ships from an agent without human review. These are productivity outcomes. They're what you bought the tool for.

The problem is that no budget model in enterprise software was designed for costs that scale with usage intensity rather than user count. Finance teams plan in seats. AI tools bill in tokens. You can't project a quarter from a pilot, because a pilot doesn't capture what happens when engineers stop using the tool cautiously and start using it the way it's designed to be used.

Agentic workflows consume 25,000 to 60,000 tokens for a single complex task. A single agent with a bad trigger configuration can exhaust a team's monthly budget in minutes. The first signal is always the invoice. By then you're already in damage control.

Uber wasn't alone

The Exponential View reported in May 2026 that ServiceNow also depleted its 2026 AI budget ahead of schedule. The mechanics were different but the structure was identical: a consumption-based billing model that grew faster than anyone's planning assumptions. ServiceNow meters AI agent usage in “assists.” A simple summarization costs one assist. An agentic workflow costs between 25 and 150 assists per execution, depending on complexity.

GitHub paused new sign-ups for Copilot Pro and Pro+ on April 20, 2026, after agentic workloads overwhelmed the unit economics of its $10/month flat plan. The company's own blog acknowledged it directly: “it's now common for a handful of requests to incur costs that exceed the plan price.” A handful of requests. Per session. GitHub is moving to usage-based billing in June.

Microsoft is winding down most Claude Code usage in its Experiences and Devices division by June 30. The official reason is platform consolidation toward GitHub Copilot. The financial context is hard to ignore.

71%

Of companies exceeded their AI budgets in 2025

85% missed AI cost forecasts by more than 10% (Mavvrik 2025 survey)

The startup version is sharper

At least Uber can absorb a bad quarter. Startups cannot.

One founder writing on Dev.to in February described pulling up his Stripe dashboard and seeing $27,486 for the prior month. His SaaS was doing $45,000 in monthly revenue. His OpenAI bill was 60% of it. He went for a walk.

The fix was architectural, not behavioral. He didn't tell his team to use AI less. He rebuilt the stack: a cheaper model for 70% of requests (classification, routing, the simple stuff), GPT-4o for the 30% that actually needed frontier reasoning. Monthly bill dropped from $27k to $10,800. Margin went from 40% to 64%.

The insight buried in that story is the important one: he wasn't paying frontier prices because he needed frontier quality. He was paying frontier prices because he hadn't mapped which tasks actually required it.

What to actually do

Set alerts at 50%, 75%, and 90% — not at zero

The only budget alert most teams configure is the hard stop. You find out you're over budget when you're over budget. Set intermediate alerts. Fifty percent of cap should trigger a review. Seventy-five percent should trigger a response. By 90%, you're already adjusting, not reacting.

Give every agent its own ceiling

The most dangerous cost pattern in agentic AI is no circuit breaker. One runaway loop on one engineer's machine, billing to one shared API key, can spike an entire organization's monthly spend before anyone notices. Per-agent token budgets are not a restriction on productivity. They're infrastructure hygiene. When the agent hits its ceiling, it stops.

Per-team allocations, not a shared pool

Most organizations run all AI usage against a single billing key. One team's heavy sprint burns the budget another team needed for the rest of the month. Team-level allocations make cost visible at the right level of accountability. The feedback loop doesn't exist when everything pools together.

Map your tasks before routing them

Not every task needs the best model. This is the single most impactful change most teams can make without changing anything else about how they work. Classification, routing, summarization, formatting, boilerplate generation: these do not need a frontier model. One engineering analysis found that routing 90% of workloads to cheaper models and reserving frontier reasoning for the remaining 10% achieved 87% cost reduction.

Cache your system prompts

Prompt caching cuts up to 90% of costs on repeated system prompts. If you're running the same context window preamble on thousands of requests, you're paying to process identical text thousands of times. Every major provider supports some form of caching. Most teams haven't turned it on.

The conversation nobody is having at budget reviews

Every fix in that list is real and worth doing. But they're all cost-reduction moves within a cloud billing model. They make a metered system more efficient. They don't change the underlying equation: every token costs something, and as usage grows, so does the invoice.

There's a different answer, and it doesn't show up in most FinOps conversations: run some of this locally.

High-volume, repetitive AI tasks (code completion, summarization, classification, routing, internal document search) don't need frontier models. They need models that are fast, reliable, and good enough. A quantized 14B or 27B model running on hardware you own costs $0 per token. The cost is VRAM and electricity. After you buy the hardware, those tokens are free.

Uber's problem wasn't that its engineers used too many tokens. It was that every one of those tokens billed to Anthropic. The engineers running repetitive agent loops on internal codebases, the same context, the same patterns, hundreds of times a day: that's the workload that makes sense to run locally. The one-off complex reasoning task where you actually need frontier quality? That stays in the cloud. You use it for 10% of the work and you pay for 10% of the work.

The calculus depends on hardware. How much VRAM do you need? What model can you run at what quantization, at what speed? These are the questions that determine whether local inference makes sense for your workload. They're exactly what we built OwnRig's recommendation engine to answer.

The 10-year version of this problem

The token cost crisis of 2026 is not going away. It's getting more acute.

Agentic tools are getting more capable. More capable agents run longer loops, process larger contexts, and call more tools. A better agent consumes more tokens, not fewer. The productivity gains are real. But productivity compounds, and so do the costs.

Enterprise AI budgets grew 36% last year. The organizations that figure out a mixed strategy (cloud for reasoning, local for volume) will have a structural cost advantage over the ones trying to FinOps their way through 100% cloud billing.

The hardware investment is one-time. The API bill is forever.

Uber's engineers didn't stop wanting to use Claude Code when the budget ran out. That's the sentence from The Information that matters: they wanted it, they were using it, and then they ran out of money to pay for it. That's not a product problem. It's not a discipline problem. It's a funding model problem. And the most durable fix to a funding model problem isn't spending controls. It's changing who you're paying and for what.

Sources: The Information (Uber CTO Praveen Neppalli Naga interview, April 2026), Exponential View (May 2026), GitHub official blog (April 2026), WebProNews, Startup Fortune, Dev.to, Redress Compliance GenAI FinOps report, Mavvrik 2025 AI Budget Survey.

Common Questions

Why did Uber burn through its AI budget so fast?

Claude Code adoption jumped from 32% to 84% of 5,000 engineers in a few months. Token-based billing means cost scales with usage intensity, not user count. Engineers running agentic loops on large codebases consumed thousands of dollars per session. There were no per-team caps and no real-time monitoring, so the first signal was the invoice.

What is the difference between token-based and seat-based pricing?

Seat-based pricing charges a fixed amount per user per month regardless of how much they use the tool. Token-based pricing charges for every unit of text processed: input tokens read and output tokens generated. A developer doing light autocomplete costs cents per day. A developer running an autonomous agent across a large codebase can cost hundreds of dollars in an afternoon. You cannot extrapolate one from the other.

How many tokens does an agentic workflow actually consume?

A standard chat interaction uses roughly 500 to 2,000 tokens. An agentic workflow completing a multi-step task through planning, tool calls, iteration, and self-checking can consume 25,000 to 60,000 tokens. A 10-cycle reasoning loop consumes approximately 50 times the tokens of single-pass inference. Processing 128,000 tokens costs roughly 64 times more than 8,000 tokens due to attention matrix complexity.

What did GitHub do when agentic AI broke its pricing model?

GitHub paused new sign-ups for Copilot Pro and Pro+ on April 20, 2026, after agentic workloads overwhelmed the unit economics of its $10/month plan. The company's own blog stated: "it's now common for a handful of requests to incur costs that exceed the plan price." GitHub moved to usage-based billing (AI Credits) starting June 1, 2026.

Does running AI locally actually solve the token cost problem?

For high-volume repetitive tasks, yes. A quantized model running on hardware you own costs zero dollars per token after the hardware purchase. The constraint becomes VRAM and inference speed rather than billing. This works best for workloads you run repeatedly: code completion, summarization, classification, internal document search. Complex one-off reasoning tasks can stay in the cloud where frontier model quality matters.