Stop Benchmarking Tokens Per Second. Start Measuring Joules Per Token.

Ask anyone how their on-device model performs and you'll get tokens per second. Maybe latency to first token. Nobody volunteers the number that actually kills edge AI features in production: energy consumption per generated token.

The Number That Kills Your Feature

Tokens per second is a fine metric when you're plugged into a wall. On a phone with a 5,000 mAh battery running background inference for a smart reply feature, the question isn't "how fast" — it's "how many queries before the user notices their battery draining into the floor."

A research team recently published what might be the most useful edge AI benchmark of the year. They ran 28 quantized LLMs on a Raspberry Pi 4 (4GB RAM, quad-core Cortex-A72 at 1.5 GHz) and measured actual energy draw per output token across different quantization levels. Not theoretical FLOPs. Not roofline projections. Actual joules, measured at the wall.

Model	Params	Quantization	Energy (J/token)	vs. Best
Qwen 2.5	0.5B	Q4_K_M	~2.6	1× (baseline)
LLaMA 3.2	1.2B	Q3_K_S	3.75	1.4×
Qwen 2.5	1.5B	Q4_K_M	~7.6	2.9×
Gemma 2	2.6B	FP16	~9.4	3.6×
LLaMA 3.2	1.2B	FP16	17.60	6.8×

That LLaMA 3.2 FP16 number at the bottom of the table deserves a hard stare. 17.6 joules to generate a single token. A typical 200-token response costs 3,520 joules — roughly 0.98 Wh. Run a smart-reply feature 50 times a day and you've burned about 49 Wh on inference alone. A Galaxy S24 battery holds approximately 19 Wh total. You'd drain it two and a half times over before dinner, just from one AI feature.

This is why features get shipped, tested internally for a week, then quietly gated behind "low power mode" toggles. The latency was fine. The accuracy was fine. The battery impact was not.

Quantization's Energy Curve Has a Kink

The intuitive assumption: lower precision equals less energy. Mostly true, but the relationship isn't a clean downward slope.

The same study found that pushing from Q4 to Q3 sometimes increased energy usage. The culprit is dequantization overhead — at extreme compression levels, the CPU burns more cycles unpacking tightly packed weights than it saves from reduced memory bandwidth. The arithmetic doesn't simplify; it just moves somewhere less convenient.

Q4 variants (Q4_0, Q4_K_M, Q4_K_S) consistently landed in the sweet spot across all four model families. Energy savings of 50-79% compared to FP16, with accuracy degradation staying within a few percentage points. The Q3 variants were a coin flip: sometimes marginally better on energy, sometimes worse, and the accuracy penalty always stung harder.

If you're deploying on ARM application processors without a dedicated neural accelerator, Q4_K_M is your default until you have hardware-specific measurements that say otherwise. Don't chase Q3 savings on spec.

Chain-of-Thought Will Drain Your Battery Before Lunch

This one should make anyone building "reasoning" features on mobile uncomfortable.

A separate paper from March 2026 measured the energy cost of common inference-time strategies. Chain-of-thought prompting on a small model (around 3B parameters) boosted math task accuracy by up to 281%. Genuinely impressive. But energy consumption jumped 120–150×.

Not a rounding error. A hundred and fifty times more energy per query.

The mechanism is straightforward once you stop and think: CoT works by generating many more intermediate tokens to reach an answer. More tokens generated means more joules burned. On a battery-powered device, "smarter reasoning" can translate directly to "the phone dies by 2 PM."

Their counterintuitive finding: for many tasks, simply running a larger model (8B instead of 3B) delivered comparable or better accuracy at 35–65% of the energy cost of doing CoT on the small model. The brute-force approach — just ship the bigger model — turned out to be the greener and more battery-friendly option. Sometimes elegance loses to mass.

Edge vs. Cloud: Closer Than You Think

For perspective on the cloud side, DeepSeek-R1 (671B parameters, MoE) running on an NVIDIA GB200 NVL72 cluster burns about 0.96 Wh per query at fp8 precision. A 120B dense model on similar hardware: 0.11–0.20 Wh per query.

Now compare: Qwen 2.5 0.5B on a Raspberry Pi at Q4 quantization burns roughly 0.15 Wh for a 200-token response.

That's energy parity with a 120B cloud model running on hardware that costs orders of magnitude more. Except the Pi version doesn't need a network round-trip, doesn't incur API fees, doesn't send user data to anyone's servers, and works in airplane mode.

This is the number that should shift architecture decisions. When your $35 SBC matches the per-query energy footprint of a datacenter rack, the "just call the API" reflex starts looking less like pragmatism and more like habit.

Start Tracking This

TokenPowerBench, an open benchmark released late last year, is designed specifically for measuring LLM inference power consumption. It breaks energy attribution into prefill and decode phases separately, supports GPU-level and system-level measurement backends, and doesn't require specialized power meters. If you haven't looked at it yet, it's worth setting up on whatever hardware your production workload targets.

The next time you evaluate an on-device model, report two numbers: tokens per second and joules per token. The first tells you whether your feature feels fast. The second tells you whether anyone will actually leave it turned on.

#The Number That Kills Your Feature

#Quantization's Energy Curve Has a Kink

#Chain-of-Thought Will Drain Your Battery Before Lunch

#Edge vs. Cloud: Closer Than You Think

#Start Tracking This

The Number That Kills Your Feature

Quantization's Energy Curve Has a Kink

Chain-of-Thought Will Drain Your Battery Before Lunch

Edge vs. Cloud: Closer Than You Think

Start Tracking This