GitHub

Introducing Gemma 4 12B: a unified, encoder-free multimodal model

Today we introduce Gemma 4 12B, our latest model designed to bring agentic multimodal intelligence directly to laptops. Bridging the gap between our user-friendly E4B and our more advanced Blend of Experts (MoE) 26B, Gemma 4 12B delivers powerful features in a smaller memory footprint. It's also our first mid-size model with native audio inputs.

Thanks to the developer community, Gemma 4 models have now exceeded 150 million downloads. You've created everything from wearable robotic arms for physical assistance to enterprise-grade AI security. We're excited to see what you build with this latest addition.

Here's a look at what makes the Gemma 4 12B unique:

- New unified architecture: no multimodal encoders. Visual and audio input flows directly into the main LLM structure.

- Advanced reasoning: Benchmark performance close to that of our 26B model, paving the way for powerful multi-step reasoning and agent-based workflows.

- Laptop ready: Small enough to run locally with just 16 GB of VRAM or unified memory.

- Open and accessible: released under an Apache 2.0 license with support across the developer ecosystem.

- Editor Ready: Gemma 4 12B is equipped with Multi-Token Prediction (MTP) writers to reduce latency.

Together, these features bring advanced multimodal capabilities to everyday hardware without sacrificing speed or reasoning. Now let's take a closer look at how the Gemma 4 12B achieves this.

Run edge agents locally

The Gemma 4 12B offers performance close to our largest MoE 26B model on standard tests, but at less than half the total memory footprint. Small enough to run locally on consumer laptops with 16GB of RAM, it unlocks powerful multimodal and agentic experiences right on your machine.

Discover a unique unified and efficient architecture

What sets the Gemma 4 12B apart is its streamlined approach to processing visual and audio input. Traditional multimodal models typically rely on separate encoders to translate images and audio before passing these representations to the language model. Since these split encoders add latency and increase memory usage, we trained the Gemma 4 12B with an encoderless architecture to directly integrate audio and visual inputs.

Here is how Gemma 4 12B natively handles multimodal inputs:

- Vision: We replaced Gemma 4's vision encoder with a lightweight integration module consisting of single matrix multiplication, positional integration and normalizations. This allows the LLM skeleton to take over visual processing.

- Audio: We've made audio processing even simpler. We removed the audio encoder entirely and projected the raw audio signal into the same dimensional space as the text tokens.

For developers who want a detailed analysis, head over to our Gemma 4 12B developer guide.

Get started today

- Try it yourself: experiment with just a few clicks in LM Studio, Ollama, the Google AI Edge Gallery app, the Google AI Edge Eloquent app and the LiteRT-LM CLI

- Download Weights: Download pre-trained, instruction-friendly checkpoints directly from Hugging Face and Kaggle.

- Integrate and learn: View the developer documentation and quickstart workbook.

- Use your favorite development tools: implement local inference pipelines with Hugging Face Transformers, llama.cpp, MLX, SGLang, and vLLM, or refine efficiently using Unsloth.

- Unlock agent development with Gemma skills: To help agents create with the latest Gemma advancements, we are releasing our official skills framework. This is a skill library specifically designed to allow agents to build with Gemma models.

- Deploy your way: Run endpoints in production using Google Cloud. Deploy your path through Gemini Enterprise Agent Platform Model Garden, Cloud Run and GKE.

![Gemma 4 12B: a unified, encoder-free multimodal model](https://storage.googleapis.com/gweb-uniblog-publish-prod/images/Social_Image_G4_12B.width-1300.png)