GitHub

Gemma 4 QAT model: Optimizing model compression to make mobile and laptop computers more efficient

Since Gemma 4 was released two months ago, we have been working continuously to expand its functions. First of all, we introduced multiple token predictions to speed up reasoning, and just a few days ago we released the 12B model to bridge the gap between the E4B and 26B MOE models.

Today, we released a new check point for optimization through quantitative sensor training (QAT) to make Gemma 4 more efficient so that you can use the equipment and consumers on the edges of everyday life. GPU A local run model.

By simulated quantification during training, QAT can minimize the loss of quality when the model is compressed. This version includes the popular QAT checkpoint for Q4 0 quantitative format and new quantitative format for mobile examples. Using this mobile format, we reduce the memory occupancy of Gemma 4 E2B to 1 GB. In short, these significantly reduce memory needs while retaining the functionality and quality you expect from Gemma 4.

Maintain the quality of the model while reducing it

Quantification is a key technology for operating models on consumer hardware, which can reduce memory occupancy while accelerating decodering. However, standard post-training quantification (PTQ) usually results in reduced performance. QAT does not simply quantify the model after the training, but integrates the quantification process directly into the training. Although PTQ has been effective in maintaining quality, our QAT results have produced higher overall quality than the standard PTQ baseline.

We apply this QAT formulation to the popular Q4 0 format to maximize performance of all models. For the edge models (E2B and E4B), we're rethinking how to use special mobile-specific quantitative models for quantification.

Save VRAM and storage space

The following is an approximate memory requirement, indicating the amount of VRAM required to load the model:

Bottom optimization of mobile devices

Standard compression is often difficult to run efficiently by mobile processors. To ensure that Gemma 4 works well on mobile devices, we have designed customized mobile Quantification Models for edge hardware:

- Static activation: Normally, models waste processing capacity to calculate how dynamically scaled data can be measured. We precalculate these settings during our training, which reduces the volume of mobile chips and makes responses faster.

- Channel Quantification: We build compressed data to adapt to the design of the mobile accelerator. This allows cell phones to run locally without a slow solution.

- Targeted 2 bit Quantification: We generate specific parts of the mark in the severe compression (to 2 bit) model, while keeping the core reasoning layer more accurate. This saves storage space without reducing the intelligence of the model.

- Embedding and KV Cache Optimization: We will focus on the glossary of models and their short-term memory. This significantly reduces the active memory and allows you to talk for a long time without exhausting space.

Since many examples do not require our audio and visual encoders, you can further optimize memory occupancy by deploying only the required models. For example, Gemma 4 E2B pure text model (without each layer embedded) requires less than 1 GB memory.

Starting today.

In order to ease the use of these models in your preferred workflows, we are working with developers ' tools that are popular throughout the ecosystem to provide seamless support to Gemma 4 QAT checkpoints from today:

- Download weight: immediately access Q4 0 and move model weights on Hugging Face. We customise the format for your workflow: the GGF format can be used with llama.cpp and for vLLM Provide a condensed load. For everything else, we share unquantified check points that can be converted and quantified in support of Q4 0.

- Integration and learning: browse our documents and learn how best to deploy QAT checkpoints.

- Trial on the desktop: the Gemma 4 QAT model is easily downloaded, managed and run locally on the desktop using user-friendly interfaces such as llama.cpp, Ollama and LM Studio.

- Deployment on equipment: use Google Lightweight LiteRT-LM runs to optimize the deployment of edges, or to use Transformer.s.js run models directly on the network

- Use your favorite development tool: use SGLang and vLLM more efficient service models, use MLX optimization Apple. Chip. Use MTP QAT checkpoint to keep MTP acceleration while quantifying the model. Use Hugging Face Transformers and Unsloth to fine-tune the weight directly.

We can't wait to see what you built with the locally run Gemma 4!

![Gemma 4 QAT Models: Optimizing Compression for Mobile and Laptop Efficiency](https://storage.googleapis.com/gweb-uniblog-publish-prod/images/Hero_Visual_Blog.width-1300.png)