Two days ago, Google DeepMind dropped Gemma 4 with four sizes. The 31B dense and 26B MoE variants are fine — another round of open-weight heavyweights to add to the pile. What caught my attention were the two smallest: E2B and E4B. The "E" stands for "effective," and behind that naming choice is an architecture trick that squeezes multimodal capabilities — images, audio, video — into a package that runs on a Raspberry Pi 5 at 133 tokens per second prefill. Under Apache 2.0. With a 128K context window.
That's not a toy demo. That's a deployment target.
Per-Layer Embeddings: How Google Compressed a 5B Model Into a 2B Runtime
The headline trick is called Per-Layer Embeddings (PLE). In a standard transformer, tokens get embedded once at the input and that representation propagates through every layer. PLE adds a second embedding table that feeds a residual signal into every decoder layer — part token-identity lookup, part learned projection of the main embeddings.
The E2B variant has 2.3 billion active parameters during inference but 5.1 billion total when you count the embedding tables. Google's argument: the architecture carries the "representational depth" of a 5B system while only needing the compute budget of a 2.3B one. Whether you buy that framing or not, the benchmarks suggest something real is happening. E2B scores 60% on MMLU Pro — that's not a 2B-class number.
PLE isn't the only efficiency trick. The smaller Gemma 4 variants use alternating attention: local sliding-window layers (512-token window) interleaved with global full-context layers. Sliding layers use standard RoPE, global layers use proportional RoPE for the 128K context reach. A shared KV cache — where later layers reuse K/V tensors from earlier layers of the same attention type — cuts memory further with minimal quality loss.
The net result: multimodal input (image, audio, video), 128K context, and genuine reasoning capability in under 1.5 GB of RAM at 4-bit quantization.
Specs at a Glance
| E2B | E4B | |
|---|---|---|
| Active params | 2.3B | 4.5B |
| Total params (w/ embeddings) | 5.1B | 8B |
| Context window | 128K | 128K |
| RAM (4-bit quant) | ~1.5 GB | ~5 GB |
| Modalities | Text, image, audio, video | Text, image, audio, video |
| MMLU Pro | 60.0% | 69.4% |
| LiveCodeBench v6 | 44.0% | 52.0% |
| License | Apache 2.0 | Apache 2.0 |
On the Pi 5, E2B hits 7.6 tok/s decode — usable for interactive tasks, if not blazing. Google also claims it can process 4,000 input tokens across two agentic tool calls in under three seconds via LiteRT-LM GPU optimizations on mobile hardware.
Where It Runs
Deployment options are broader than you'd expect for something this fresh. Hugging Face Transformers, llama.cpp with GGUF quants from Q2 through Q8, transformers.js for WebGPU browser inference, MLX on Apple Silicon with TurboQuant, ONNX for cross-platform, and mistral.rs for Rust-native inference. On Android, there's direct integration through AICore — the system-level inference service that ships on recent Pixel devices.
The hardware list: NVIDIA CUDA, Apple Metal, WebGPU, CPU, Jetson, Raspberry Pi, and Qualcomm IQ8 NPUs. Google explicitly partnered with Qualcomm and MediaTek to optimize the 2B and 4B variants for their mobile silicon.
Minimal setup with Transformers:
from transformers import pipeline
pipe = pipeline("any-to-any", model="google/gemma-4-e2b-it")
output = pipe([{
"role": "user",
"content": [
{"type": "image", "image": "photo.jpg"},
{"type": "text", "text": "What's in this image?"}
]
}], max_new_tokens=200)
Five lines to multimodal inference from something that fits on a phone. The any-to-any pipeline handles text, image, audio, and video inputs through a single interface.
Stacking It Against the Competition
The sub-5B edge space is crowded now. Phi-4 mini (3.8B) dominates math benchmarks and has solid function-calling support, but it's English-only and text-only. Llama 3.2 3B is the reliable all-rounder — easy to fine-tune, fast, good instruction following — but also text-only at that size. Gemma 3's 4B added multimodal support, but the new E2B undercuts it on both footprint and capability.
What makes the E2B interesting is the combination: multimodal input including audio, 128K context, native tool use, and a genuine sub-2GB memory footprint. No other offering in this weight class bundles all four. Phi-4 has function calling but no vision. Llama 3.2 1B is smaller but dramatically less capable. The tradeoff is that E2B's raw text-only performance won't beat Phi-4 mini on math-heavy tasks — the AIME 2026 score of 37.5% is decent but not remarkable.
The real question isn't whether one is "better" than another in isolation. It's whether multimodal capability at this size enables edge deployment patterns that weren't viable before. A smart camera that understands speech commands. A field inspection tool that processes photos and voice notes without connectivity. A wearable that does real-time audio and visual context simultaneously. These use cases previously required separate specialized models stitched together. A single 1.5 GB package handling all of it changes the engineering calculus.
What I'm Watching
Two things will determine if this matters beyond benchmark slides. First: how well the 4-bit quantized versions hold up on messy real-world inputs. Google's numbers look good, but quantization artifacts in multimodal architectures are poorly understood — audio fidelity especially. A model that transcribes clean speech perfectly but falls apart on noisy factory-floor audio isn't useful for the edge deployments Google is targeting.
Second: whether the Qualcomm and MediaTek NPU optimizations deliver meaningful speedups over generic GPU inference. The Raspberry Pi throughput numbers suggest the architecture is efficient on general hardware, but mobile NPUs are a different beast with their own operator coverage gaps and memory hierarchies.
Gemma 4 E2B is the first sub-2GB option I'd seriously evaluate for a multimodal edge deployment in production. That's a sentence I couldn't have written six months ago.