Your Browser Tab Is the Inference Server Now

Here's an uncomfortable number: 180 tokens per second. Qwen 3.5, INT4 quantized, running in a Chrome tab. No API key. No server. No egress bill. Just a browser and a GPU that was already sitting there rendering your CSS animations.

That's the number that should make every team running a /v1/completions proxy stop and ask what exactly they're paying for.

The claim

For the majority of consumer-facing AI features — text completion, summarization, classification, embeddings, image captioning — browser-side WebGPU inference is now fast enough, cheap enough, and broad enough to replace your inference API. Not supplement it. Replace it.

Bold? Sure. But the math has shifted under our feet in the last 90 days, and most teams haven't noticed.

What changed

Three things converged in early 2026, and their intersection matters more than any one alone.

WebGPU crossed the coverage threshold. As of January, 70% of browsers globally ship WebGPU by default. Chrome, Edge, Firefox 147, Safari on iOS 26 and macOS Tahoe. The "but Safari doesn't support it" excuse expired. For most consumer web apps, the addressable market with WebGPU is larger than the market that had WebGL2 when teams started shipping WebGL2 features.

Transformers.js v4 rewrote the runtime. The February release wasn't an incremental update. The team rebuilt the inference engine in C++ on top of ONNX Runtime, tested it across ~200 model architectures, and shipped a WebGPU backend that makes BERT embeddings 4x faster than v3. The default web bundle is 53% smaller. Build times dropped from 2 seconds to 200ms. And the big headline: models over 8B parameters now work. GPT-OSS 20B hits 60 tokens/sec on an M4 Pro Max with q4f16 quantization — inside the browser.

WebLLM proved the performance ceiling is high. The MLC-AI team's benchmarks on an M3 Max show Llama 3.1 8B (4-bit) at 41 tok/s and Phi 3.5 Mini at 71 tok/s in a Chrome tab. That's 80% of native speed. And the 180 tok/s figure for Qwen 3.5 INT4 with hand-tuned WGSL shaders? That came from a developer who decided the default shader compiler wasn't trying hard enough. The ceiling isn't the framework. It's how much you care about your shader code.

The cost argument no one wants to hear

An API call to a hosted inference endpoint costs somewhere between $0.10 and$ 2.00 per million tokens, depending on the model and provider. A WebGPU inference call costs nothing. The compute happens on hardware your user already bought and already powered on.

"But the model download!" Yes. A 4-bit quantized 3B model is about 1.8 GB. That's one Netflix episode. Cache it with a service worker, and it's a one-time cost. After that, every inference call is free — to you.

For a product doing 10 million inference calls per month on a hosted 3B model at the low end of pricing, that's $1,000/month in API costs alone, before you count the infrastructure around it. Moving those calls to the browser doesn't just save money. It removes an entire category of operational complexity: rate limiting, autoscaling, cold starts, regional latency, API key management, usage monitoring.

Where this actually works right now

Stop me if you've built one of these:

Search-as-you-type with embeddings. Compute embeddings client-side, run cosine similarity against a pre-indexed set. No round trip. BERT-base runs in under 15ms per query on mid-range hardware with WebGPU.
Real-time text classification. Spam detection, content moderation, intent routing. A fine-tuned DistilBERT handles this at thousands of classifications per second without touching your servers.
Summarization for reading apps. Phi 3.5 Mini at 71 tok/s in-browser is fast enough for "summarize this article" without the user noticing they're not hitting an API.
Image captioning and alt-text generation. Vision-language models under 1B parameters run comfortably in the browser with WebGPU. Accessibility features that don't depend on your uptime.
Code completion in web IDEs. SmolLM2 at 1.7B is small enough to cache, fast enough to suggest, and private enough that nobody's code leaves the browser.

Where it doesn't work (yet)

I'm not arguing you should run GPT-4-class reasoning in a browser tab. Models above 8B parameters need serious hardware — 16+ GB of GPU VRAM — and that rules out most laptops and all phones. Anything requiring tool use with complex orchestration still wants a server. Multi-turn agents with long context windows will chew through VRAM on consumer devices.

The line is roughly: if the task needs a model under 8B and the latency budget is "instant," the browser wins. If the task needs 70B+ or multi-minute reasoning chains, the cloud wins. The territory between those poles is where the fight is happening, and the browser is gaining ground every quarter.

The real objection

"What about users with bad hardware?" Fair. A 2019 integrated GPU won't run Llama 3.1 8B. But it will run a 135M SmolLM2 for classification, or a quantized 500M model for embeddings. The strategy isn't browser-or-nothing. It's browser-first-with-API-fallback: detect WebGPU support and available VRAM at startup, load the largest model the hardware can handle, fall back to your API for devices that can't keep up.

This is the same progressive enhancement pattern the web has used for decades. We didn't stop shipping websites when some browsers didn't support flexbox. We shipped flexbox with a float fallback.

The bet

Twelve months from now, shipping a consumer AI feature that routes every inference call through a centralized API will look like shipping a web app that renders every page on the server in 2016. Technically fine. Strategically behind.

The browser has a GPU. The models fit. The runtimes are production-grade. The only thing left is for teams to realize they're paying for a middleman they no longer need.

#The claim

#What changed

#The cost argument no one wants to hear

#Where this actually works right now

#Where it doesn't work (yet)

#The real objection

#The bet