Walk into any electronics store in April 2026 and try to buy a laptop without a neural processing unit. You can't. AMD's Ryzen AI 400 ships with XDNA 2 rated at up to 60 TOPS. Intel's Core Ultra 200V packs 48 TOPS. Qualcomm's Snapdragon X Elite delivers 45 TOPS. Gartner estimates 55% of all PCs sold this year will have a dedicated neural engine. The silicon war is over — the NPU won a permanent seat at the table.
The software war hasn't started yet.
Three Chips, Three Toolchains, Zero Portability
Here's the part nobody talks about at product launches: if you want to run a model on AMD's neural engine, you need Ryzen AI Software and the XDNA runtime. For Intel, it's OpenVINO with the NPU plugin. For Qualcomm, it's the QNN SDK paired with ONNX Runtime's QNN Execution Provider. Each vendor ships its own quantization requirements, its own model format preferences, and its own set of supported operators.
"Just export to ONNX," someone says. Sure. But to actually hit the neural silicon, you still need to quantize to QDQ format on Qualcomm, compile through OpenVINO's model optimizer on Intel, or route through AMD's Vitis AI flow. The write-once-run-anywhere promise dissolves the moment you try to target the dedicated accelerator.
This isn't theoretical whining. A developer building real-time background removal for a video app faces a choice: support one vendor's chip well, or support all three badly. Most choose neither and fall back to the GPU, which at least speaks CUDA or Metal consistently. Fifty TOPS, collecting dust.
What Actually Runs on the Neural Engine Today
Strip away the keynote demos and the list of real workloads is shorter than you'd expect:
Windows Studio Effects — background blur, eye contact correction, auto-framing during video calls
Adobe Creative Cloud — content-aware fill, auto masking, and generative expand in Photoshop. One architecture firm reportedly saw AI operations complete 3.4x faster on NPU-equipped machines while keeping the GPU free for preview rendering
Real-time noise suppression — Teams, Zoom, and a growing number of conferencing apps
Live captions and translation — a Windows 12 feature that's NPU-exclusive, processing speech locally without a cloud round-trip
That's roughly the whole consumer-facing list. The chip doesn't help your games. It doesn't speed up your browser. For most buyers, those 50 TOPS sit at 5 watts, waiting for something — anything — to schedule work onto them.
OpenVINO 2026 Gets One Thing Right
Intel's OpenVINO 2026 release shipped something genuinely interesting: a Unified Runtime Scheduler. Rather than forcing developers to pick a single execution target, it lets you describe a pipeline graph where individual nodes are tagged with preferred devices — CPU, GPU, or NPU. The runtime partitions automatically, placing transformer attention layers on the neural accelerator while keeping pre- and post-processing on the CPU.
A retail chain deployed a LLaMA-2-7B chatbot on a Core Ultra-based kiosk using this approach: sub-100ms first-token latency, under 40W total system power, with the dedicated silicon handling attention computation and the CPU managing tokenization and sampling. That's the heterogeneous compute story that justifies having a neural accelerator in the first place.
The catch is obvious: it only works on Intel hardware. AMD and Qualcomm don't ship an equivalent unified scheduler. Not yet, anyway.
The ONNX Bridge Is Narrower Than It Looks
ONNX Runtime is positioned as the answer — one model format, multiple execution providers, automatic hardware dispatch. In practice, the gaps are significant:
| Vendor | Execution Provider | Quantization | Supported Ops | LLM-Ready? |
|---|---|---|---|---|
| Qualcomm | QNN EP | QDQ (INT8/INT4) | ~180 | Yes — Phi-3.5, Llama 3.2 |
| Intel | OpenVINO EP | FP16/INT8 | ~220 | Yes — via OpenVINO 2026 |
| AMD | XDNA EP | INT8 (expanding) | ~140 | Partial — Ryzen AI 1.7 |
The operator coverage gap matters more than the numbers suggest. A model that runs fine through Qualcomm's QNN provider might hit an unsupported op on AMD's XDNA provider and silently fall back to CPU execution. No error, no warning — just degraded performance and a battery drain the developer never intended.
AMD's Ryzen AI Software 1.7 expanded model coverage and reduced friction in local development workflows. But "closing the gap" while Intel and Qualcomm keep extending their own runtimes means the finish line keeps moving.
What Would Actually Fix This
The GPU ecosystem went through the same fragmentation phase fifteen years ago. CUDA won — for better or worse — because NVIDIA invested relentlessly in developer tools, documentation, and model zoo support until the ecosystem tipped irreversibly.
Neural accelerators don't have their CUDA yet. DirectML is Microsoft's attempt at a hardware-agnostic abstraction, and it now covers all three vendors. But DirectML operates at a higher level than vendor-native runtimes — you trade fine-grained scheduling control for portability. Works everywhere, optimized nowhere.
Three changes would move the needle:
A real model zoo per vendor. Not "we support Llama" in a press release — end-to-end examples with measured latency, power draw, and accuracy loss on specific SKUs. Qualcomm's AI Hub comes closest today, with pre-optimized checkpoints for over 100 architectures. Intel and AMD need equivalents.
Transparent fallback reporting. When an operator falls back from the neural engine to CPU, the runtime should scream about it. Developers shouldn't be able to ship an "NPU-accelerated" feature that's secretly running on x86 cores. Silent fallback is the single worst design decision in the current toolchain.
Converge on quantization formats. INT4 AWQ and GPTQ are becoming de facto standards for language model compression. If all three vendors' silicon can consume AWQ INT4 natively, a single quantized checkpoint could serve every device. We're close — Qualcomm's QNN already handles INT4 QDQ, Intel added INT4 weight compression in OpenVINO 2026, and AMD is expanding INT4 support in XDNA 2. Someone just needs to agree on the container format.
Fifty TOPS, Waiting
The hardware story is settled. More than half of new PCs ship with neural engines capable of running 7B-parameter language models locally at single-digit wattage. The latency wins over cloud inference are real. The privacy argument is unassailable.
But until a developer can write one inference pipeline and have it actually dispatch to the dedicated accelerator on AMD, Intel, and Qualcomm silicon without maintaining three separate optimization passes, most of those transistors will idle. The chip that was supposed to democratize on-device intelligence is currently democratizing nothing, because nobody agreed on how to talk to it.