Gemma 3n – Multimodal for Edge AI (Released Today)
ok
type
Post
status
Published
date
Jun 26, 2025
slug
article121
summary
Google releases their new Gemma 3n models!✨Gemma 3n supports audio, vision, video & text and needs just 2GB RAM for fast local inference.
tags
Multimodal
category
icon
password
comment
publish date
Google releases their new Gemma 3n models!✨Gemma 3n supports audio, vision, video & text and needs just 2GB RAM for fast local inference.
Google releases their new Gemma 3n models!✨
Gemma 3n supports audio, vision, video & text and needs just 2GB RAM for fast local inference.
Gemma 3n excels at reasoning, coding & math and fine-tuning is also now supported in Unsloth. Currently text is only supported for GGUFs.
Try it in AI studio:
This code explain Gemma 3n Architecture:
Gemma 3n : MatFormer
So what is so special about Gemma 3n you ask? It is based on Matryoshka Transformer or MatFormer architecture meaning that each transformer layer/block embeds/nests FFNs of progressively smaller sizes. Think of it like progressively smaller cups put inside one another. The training is done so that at inference time you can choose the size you want and get the most of the performance of the bigger models.
There is also Per Layer Embedding which can be cached to reduce memory usage at inference time. So the 2B model (E2B) is a sub-network inside the 4B (aka 5.44B) model that is achieved by both Per Layer Embedding caching and skipping audio and vision components focusing solely on text.
The MatFormer architecture, typically is trained with exponentially spaced sub-models aka of sizes
S, S/2, S/4, S/8 etc in each of the layers. So at training time, inputs are randomly forwarded through one of the said sub blocks giving every sub block equal chance to learn. Now the advantage is, at inference time, if you want the model to be 1/4th of the original size, you can pick S/4 sized sub blocks in each layer.You can also choose to Mix and Match where you pick say,
S/4 sized sub block of one layer, S/2 sized sub block of another layer and S/8 sized sub block of another layer. In fact, you can change the sub models you pick based on the input itself if you fancy so. Basically its like choose your own kind of structure at every layer. So by just training a model of one particular size, you are creating exponentially many models of smaller sizes. No learning goes waste. Pretty neat huh.Chinese Version Explaination
- Matryoshka Transformer(MatFormer)架构
- 名字来源于“套娃”(Matryoshka),每一层 Transformer Block 内部都“嵌套”了多个尺寸逐渐变小的前馈网络(FFN),就像一层层套在一起的杯子。
- 训练时,会随机让输入数据流经不同尺寸的子网络(子模型),确保每个子模型都有机会学习到有用的特征。
- 灵活调节推理速度与性能
- 推理(inference)阶段,你可以根据需求在每层里选用大杯子(性能更强、速度略慢)或小杯子(速度更快、性能稍弱)。
- 例如,你想要一个「原模型的 1/4 大小」,就直接在每层都选用 S/4 的子模型即可。
- Per Layer Embedding 缓存
- 每层的词嵌入(或其它模态的嵌入)可以预先计算并缓存,下次推理直接复用,从而大幅降低显存和计算开销。
- 这使得「2B 模型(E2B)」可以作为「4B/5.44B 模型」里的一个子网络存在:只启用文本通道,跳过音频和视觉部分,同时利用缓存好的嵌入,既保留性能,又更轻量。
- 指数级子模型组合
- 理论上,若每层有 N 个不同大小的子模型,整个模型就对应 Nⁿ 种组合(n 是层数),可以根据具体场景进行“Mix & Match”:
- 比如第 1 层用 S/4,第二层用 S/2,第三层再用 S/8……
- 甚至可以动态地、针对每条输入选择最合适的子模型组合。
- 换句话说,训练一次,得到了无数个不同规模的模型——没有任何学习成果被浪费。

This code explain the MatFormer Architecture:
This code can download Gemma 3n 4b and run:
Loading...