Researchers at Peking University and Infinigence-AI just dropped a result that should reframe how we think about on-device language models. A Qwen 2.5 1.5B, running on a Snapdragon 8 Elite's neural processor, matched the accuracy of a 7B baseline on MATH500. Not by making the model bigger — by sampling multiple answers in parallel and picking the best one. The trick exploits a hardware inefficiency hiding in plain sight on every flagship phone sold since 2023.

The 90% waste problem

When you run a language model on a phone's neural accelerator at batch size 1 — which is what every on-device chat app does — matrix multiplications degenerate into matrix-vector products. The Hexagon Matrix eXtensions (HMX) units, designed to chew through 16,384 multiply-accumulate operations per cycle, end up performing what amounts to vector math.

On the Snapdragon 8 Elite (V79), this means the matrix units sit mostly idle during the token generation phase. Prefill lights them up properly — you're processing the full prompt as an actual matrix. But once you're generating tokens one at a time, you're paying for a V12 engine and driving in first gear.

The researchers measured this directly. During standard batch-1 decode on a Snapdragon 8 Gen 3, the matrix units operated at under 10% utilization. All that silicon, burning power, doing almost nothing. The vector units (HVX) and scalar core carry the load while the most powerful compute block on the chip watches.

Test-time scaling: brute force, but calibrated

The concept is not complicated. Instead of generating one answer and hoping it's correct, generate N candidates simultaneously and score them. Return the best. In the literature, this goes by Best-of-N sampling. You can also run beam search, maintaining multiple hypotheses and pruning progressively.

Cloud providers have used test-time compute scaling to boost reasoning performance since late 2024. What's new is pulling it off on a phone, using hardware capacity that's already allocated and already drawing standby power.

The batch size increase maps directly onto the idle matrix units. Generate 8 candidates at once and the HMX blocks have actual matrices to multiply instead of single vectors. Throughput jumps without proportional power increase because you're finally exercising the silicon as its architects intended.

For evaluation, the team ran Skywork-1.5B-PRM alongside the generator — a process reward model that scores each reasoning step, not just the final answer. Both models fit comfortably in memory on the target devices. The reward model selects the highest-scoring completion, and the entire pipeline runs without touching a network socket.

This matters because it decouples accuracy from parameter count. A small model with room to retry can outperform a large model that gets one shot. The constraint isn't "how big is your model" but "how much idle compute can you redirect."

Bypassing Qualcomm's SDK to hit the metal

One of the more surprising engineering choices: the team bypassed QNN (Qualcomm's Neural Network SDK) entirely and wrote directly against the Hexagon instruction set. This unlocked two optimizations that the official toolchain doesn't expose.

Tile-aligned W4A16 quantization. Mobile neural processors don't natively support the fine-grained group quantization popular in llama.cpp. The researchers designed a weight layout aligned with the hardware's memory access tiles — 4-bit weights dequantized on-the-fly to FP16 before feeding the matrix units. You get the memory savings of INT4 storage with the precision of FP16 arithmetic, at the cost of some dequantization overhead that the LUT trick below mitigates.

Lookup-table softmax. Computing exponentials on the vector unit is painfully slow — it's a scalar-like operation repeated across thousands of positions. Hexagon's V-series chips natively support LUT instructions, so the team replaced the naive exp() calls with table lookups. The result: 2.2× faster softmax computation. When you're scoring multiple hypotheses per generation step, softmax speedup compounds quickly.

The matrix multiplication acceleration alone delivered a 19× speedup over a baseline implementation by properly exploiting HMX tiling. That's the gap between treating the accelerator as a generic compute target and actually understanding its microarchitecture.

What they measured

Five models across three Snapdragon generations, all with W4A16 quantization:

Device Processor Qwen 1.5B (tok/s) Llama 3B (tok/s) Power
OnePlus Ace3 V73 (Gen 2) 18.4 9.7 ~4.2W
OnePlus 12 V75 (Gen 3) 27.1 14.3 ~4.5W
OnePlus Ace5 Pro V79 (Elite) 38.6 21.2 ~4.8W

Power stayed under 5W across the board — roughly what you'd burn scrolling a social media feed.

The accuracy payoff

Qwen 2.5 1.5B with Best-of-64 sampling on the V79 scored 74.2% on MATH500. The 7B model with single-pass inference scored 72.8%. A model using 4.7× less memory beat the larger one by redirecting compute that was already going to waste.

Nobody's shipping Best-of-64 in a conversational app — generating 64 full responses introduces unacceptable latency for chat. But at modest batch sizes like 4 or 8, the technique is entirely practical for tasks where correctness outweighs speed: on-device code review, math tutoring, medical triage forms, structured data extraction from documents.

The deeper implication is about hardware economics. Every Snapdragon 8-series phone since 2023 ships with matrix units powerful enough for batch inference. That compute capacity is there whether you use it or not. Test-time scaling is the first argument I've seen for actually lighting up the silicon that hundreds of millions of devices already carry.

It also inverts the deployment question. Instead of cramming a 7B into 6GB of RAM with brutal quantization and praying the thermals hold, you run a 1.5B comfortably — lower memory pressure, better thermal headroom, room for the reward model — and throw cycles at it. On reasoning benchmarks, at least, you get better answers from the smaller setup.