87% Smaller, 2% Dumber: A Field Guide to INT4 Quantization

Four billion parameters, two gigabytes of RAM. That's the promise of INT4 quantization — shrink a model by 87% and run it on hardware that couldn't have touched it at full precision. But "INT4" isn't a single thing. There are at least three serious approaches competing for your attention, each with different tradeoffs in accuracy, speed, and what silicon you can actually target. Choosing wrong doesn't just cost you performance — it can lock you into an ecosystem that doesn't match your deployment reality.

The Compression Arithmetic

A 7B-parameter model at FP16 weighs roughly 14 GB. INT8 halves that to ~7 GB. INT4 halves it again to ~3.5 GB. For a 3.8B model like Phi-3.5-mini, the drop is starker: 15 GB at FP32 collapses to about 2 GB after four-bit compression. That's the gap between "needs a data center rack" and "runs on a midrange phone."

The format you'll see everywhere is W4A16 — four-bit weights, sixteen-bit activations. At batch size 1, the typical edge scenario, generation is memory-bandwidth-bound rather than compute-bound. Arithmetic intensity hovers around 1 FLOP per byte transferred. Cutting weight size directly cuts latency. This isn't theoretical — it's why virtually every on-device deployment conversation in 2026 starts with "which quantization method?"

Three Algorithms, Three Philosophies

GPTQ (Gradient Post-Training Quantization) was the first to prove that 175B+ parameter models could survive 3-4 bit compression. Published in October 2022, it operates layer by layer, using second-order information from the Hessian matrix to decide which weights tolerate lower precision. Think of it as surgical: it analyzes sensitivity per weight group, reconstructing each layer to minimize accumulated quantization error. The catch is calibration data dependency. When your calibration distribution matches the evaluation distribution, GPTQ is nearly untouchable. When it doesn't, perplexity degrades by 2.3 to 4.9 points — a gap that's hard to diagnose in production because standard metrics won't flag it.

AWQ (Activation-Aware Weight Quantization) shipped eight months later with a fundamentally different bet. Instead of layer reconstruction, it identifies the roughly 1% of weight channels that disproportionately affect output — the ones corresponding to the largest activation magnitudes — and applies per-channel scaling before quantization. No backpropagation step, no Hessian computation. The payoff is resilience to distribution mismatch: when calibration data doesn't perfectly represent production traffic, AWQ loses only 0.5-0.6 perplexity points where GPTQ loses several times that.

GGUF is the pragmatist's choice. Less an algorithm than a container format built around llama.cpp, it uses block-level mixed-precision quantization. Q4_K_M — the variant most people reach for — blends 4-bit and 5-bit blocks with per-block scaling factors. It runs on CPUs, Apple Metal, CUDA, and basically anything with a C compiler. The cost is slightly larger files and, on NVIDIA hardware, lower throughput than GPTQ's tensor-core-optimized code path.

What the Benchmarks Say

Microsoft's February 2026 evaluation on Phi-3.5-mini running a function-calling task provides a controlled comparison:

Method	Compressed Size	Macro F1	Micro F1
Baseline FP32	15 GB	0.873	0.871
GPTQ INT4	2.0 GB	0.866	0.866
AWQ INT4	2.1 GB	0.859	0.857

GPTQ wins here — but the calibration set was meticulously stratified to match the evaluation domain. Zoom out to broader benchmarks and the picture flips. On Llama 3.1 8B, AWQ scores 86.8 accuracy versus GPTQ's 84.7. On Mistral 7B: 84.6 versus 83.1. GGUF Q4_K_M splits the difference in both cases, landing at 85.9 and 83.8, with the lowest median absolute error of the three at 0.036.

The pattern is consistent: GPTQ dominates narrow, well-calibrated evaluations. AWQ dominates messy, heterogeneous ones.

Matching Format to Silicon

Here's where the decision gets concrete:

Target Hardware	Recommended	Rationale
Qualcomm Hexagon NPU	GPTQ → ONNX via Olive	QNNExecutionProvider has native INT4; Olive's `auto-opt` handles conversion
NVIDIA RTX desktop	GPTQ 4-bit	Tensor core acceleration, ~20% throughput advantage over GGUF
Apple Silicon	GGUF Q4_K_M	Mature Metal backend in llama.cpp, no CUDA dependency
AMD ROCm	AWQ 4-bit	vLLM with ROCm 6 is the cleanest path
Unknown or mixed hardware	GGUF	Runs everywhere without specialized kernels

Qualcomm's Hexagon NPU is worth calling out specifically. The HMX tensor unit handles INT4 natively — not emulated through paired INT8 operations — which doubles tensor throughput on four-bit layers. The recently shipped XNNPACK HVX backend further accelerates transformer attention on-device. If you're targeting Android flagships with Snapdragon 8 Gen 5, the conversion pipeline through Microsoft Olive (quantize with GPTQ, export to ONNX, optimize for QNN) is the most battle-tested path right now.

Calibration Is Where Models Quietly Break

Every quantization guide says "use representative calibration data." Almost none explain the failure mode when you don't. With GPTQ in particular, calibrating on generic web text for a model that will handle domain-specific tasks — medical coding, legal extraction, structured function calls — can silently wreck output quality in ways perplexity won't catch. Microsoft found that 256 well-stratified samples beat larger but noisier calibration sets.

AWQ handles this more gracefully, but it's not bulletproof either. The practical advice: pull 200-300 real samples from production logs. If you don't have production logs yet, you probably don't need INT4 yet either — start with FP16 and earn the right to compress.

So Which One?

GPTQ when you control calibration and ship to NVIDIA or Qualcomm silicon. AWQ when your traffic is unpredictable and you'd rather absorb a small accuracy dip than risk a catastrophic one. GGUF when you need to hand someone a binary and say "it'll work on whatever you've got."

The algorithms themselves haven't changed much since 2023. What's shifted in 2026 is everything around them: native INT4 in mobile NPUs, optimized CUDA kernels, and toolchains like Olive that collapse the conversion pipeline into a single CLI command. The bottleneck for getting a quantized model into production is no longer compute or tooling. It's having the right 256 samples to calibrate against.

#The Compression Arithmetic

#Three Algorithms, Three Philosophies

#What the Benchmarks Say

#Matching Format to Silicon

#Calibration Is Where Models Quietly Break

#So Which One?