Edge Deployed

On-device AI, mobile ML, WebGPU inference, and small models that punch above their weight.

A 1.5B Model Just Beat a 7B — By Spending Compute Differently

test-time-computemobile-npusmall-modelsquantizationon-device-ai

Researchers at Peking University and Infinigence-AI just dropped a result that should reframe how we think about on-device language models. A Qwen 2.5 1.5B, running on a Snapdragon 8 Elite's neura

Google Shipped a Multimodal 2B Model That Runs on a Raspberry Pi at 133 tok/s

gemma-4on-device-aiper-layer-embeddingsmultimodaledge-inference

Two days ago, Google DeepMind dropped Gemma 4 with four sizes. The 31B dense and 26B MoE variants are fine — another round of open-weight heavyweights to add to the pile. What caught my attention were

Every Laptop Has an NPU Now. Almost Nobody Knows How to Use It.

npudeveloper-toolsopenvinoonnx-runtimeedge-inference

Walk into any electronics store in April 2026 and try to buy a laptop without a neural processing unit. You can't. AMD's Ryzen AI 400 ships with XDNA 2 rated at up to 60 TOPS. Intel's Core

87% Smaller, 2% Dumber: A Field Guide to INT4 Quantization

quantizationint4gptqawqggufedge-inference

Four billion parameters, two gigabytes of RAM. That's the promise of INT4 quantization — shrink a model by 87% and run it on hardware that couldn't have touched it at full precision. But &quot

Your Browser Tab Is the Inference Server Now

webgpubrowser-inferencetransformers.jswebllmon-device-aihot-take

Here's an uncomfortable number: 180 tokens per second. Qwen 3.5, INT4 quantized, running in a Chrome tab. No API key. No server. No egress bill. Just a browser and a GPU that was already sitting t