Researchers at Peking University and Infinigence-AI just dropped a result that should reframe how we think about on-device language models. A Qwen 2.5 1.5B, running on a Snapdragon 8 Elite's neura
Two days ago, Google DeepMind dropped Gemma 4 with four sizes. The 31B dense and 26B MoE variants are fine — another round of open-weight heavyweights to add to the pile. What caught my attention were
Walk into any electronics store in April 2026 and try to buy a laptop without a neural processing unit. You can't. AMD's Ryzen AI 400 ships with XDNA 2 rated at up to 60 TOPS. Intel's Core
Four billion parameters, two gigabytes of RAM. That's the promise of INT4 quantization — shrink a model by 87% and run it on hardware that couldn't have touched it at full precision. But "
Here's an uncomfortable number: 180 tokens per second. Qwen 3.5, INT4 quantized, running in a Chrome tab. No API key. No server. No egress bill. Just a browser and a GPU that was already sitting t