Apple 在 7 月份发布了 AFM 技术报告，包括两个多语种多模态大模型，分别为 3B 和 xB, 一个面向 device, 另一个面向 server，前者主要集中于效率，后者集中于表现。

对于 on-device model, 作者将模型分为两个 block, Block1 占 $62.5\%$ 的 transformer layers, Block2 占 $37.5\%$ 的 transformer layers. 但是，对于 Block2, 作者移除了 key, value projection, 对应的 KV cache 则直接从 Block1 中获取。通过这种方式，作者将 KV cache memory usage 减少了 $37.5\%$ . 并且，由于 Block2 不产生任何 key values, prefill stage 可以跳过这些计算，这样 TTFT 也可以减少 $37.5\%$ .

Server Model

对于 server model, 作者对架构进行了改进，来提高效率。架构如下图所示

Parallel Track Transformer 作者提出了 Parallel Track (PT) Transformer 架构，PT-Transformer 将 transformer 模型分割多个小的 transformer, 作者将这些小的 transformer 称之为 track. 每个 track 包含多个 transformer block. 不同的 track 只会在输入和输出的时候进行交互，这样就能够减少同步的开销。作者讲这种模式称为 track parallelism.

PT-MoE 为了进一步提高 server model 的效率，作者将 MoE 和 PT-transformer 结合在一起。具体的做法就是，每两个 transformer block 为一组，每组里包含一个 dense layer 和一个 MoE layer.

Interleaving Global and Local Attention Layers 作者还设计额 interleaved attention 机制，也就是，将 transformer block 按照四个为 1 组，前面 3 个 block 使用 window attention, window size 为 4096 和 RoPE. 最后一个 block 使用 global attention layer 以及 NoPE. 作者认为，使用 NoPE 可以提高模型对长上下文的泛化性。

Recall Qwen2.5-VL 的 ViT 使用的类似的做法，即 8 个 block 为一组，前面 7 个 block 使用 window attention, 最后一个 block 使用 full self attention.

Vision Encoder

Vision encoder 包含 ViT 和 adapter 两个模块

对于 ViT 来说，作者使用了 ViT 架构：

server model 使用了 1B 参数的 ViT-g
on-device model 使用了 300M 参数的 ViTDet-L backbone

作者在 ViTDet 的基础上加入了 Register-Window 机制，这个机制用于编码一个 global register token 来与不同的 loca windows 进行交互。

对于 adapter 来说，其包含了一个 transformer layer, 一个 linear projection layer, 一个 $3\times 3$ 的 convolutional layer. 其中， linear projection 用于将 visual token 映射到 LLM 的特征空间，pooling layer 用于压缩 visual token 个数。

Data

主要包括 web data 和 image data 两部分

image data 部分：

Image-Text Crawl Data: 包含 175M 图文交错数据，包含 550M images
Synthetic Image Caption data: 5B image caption 数据
Text-Rich Image Data
High-quality Domain-Specific Image-text Data: 包括 caption 数据， grounding 数据，table, chart, plots 数据以及 knowledge-required domains 的数据

Training Recipe

text tokenizer 大小为 150K.

Vision encoder 的训练包含两个 stage:

基于 CLIP 的方法，使用 6B的 image-text pair 数据进行训练，图片精度为 448, 作者还使用了 FLIP 来提高训练效率
使用一个 compact LLM, 同时训练 vsion encoder, adapter 和 compact LLM. 加入了更高质量的数据，图片精度为 672.

LLM 的训练使用了 13.4T token

Post-training

SFT

SFT 数据包括：

General knowledge
Reasoning: 纯文本包括 math 和 reasoning, 多模态包括 STEM, math, CoT 数据
Text-Rich Image understanding: chart, table 数据
Multilingual OCR: OCR 相关数据
Text and visual grounding: grounding 数据
Multi-image reasoning: 多图推理数据

作者还基于 retrieval-based 方法来收集数据，具体做法就是给定一些 prompt, 然后通过一个 Image search pipeline 来进行检索。

训练的时候，作者将图片精度从 672 提升到 1344, 处理方式就是将图片切分为四个子图，然后作者还加入了一个总蓝图。这样，vision encoder 的输入包括四个子图和一个 thumbnail 图.

为了提高 on-device model 的效率，作者设置了三种模式：

rapid mode: 图片精度为 224
balanced mode: 只有 thumbnail 图
high-resolution mode: 四个子图和一个 thumbnail 图

对于不同的 mode, 如果输入的是低精度图片，则 $50\%$ 概率为 rapid mode; 如果输入的是高精度图片，则 $1\%$ 的概率为 rapid mode. 对于其他数据，作者将 $20\%$ 的数据设置为 balanced mode.

RLHF

作者使用 RLOO 作为 RLHF 的算法。

RL 的 infra 如下图所示

infra 主要由两个部分组成：

Trajectory Generators: 生成轨迹并提供反馈
Policy updater: 更新 policy

训练时，作者首先训练了一个 reward model, 与 AFM-2024 相似，作者使用了一个 preference loss function 以及一个 single-sided grading 作为 regularization.

数据包括以下类别:

text-only prompts
Image-text prompts
Math prompts
Image-text STEM reasoning prompts

其中，前面两个使用 reward function 进行打分，后面两个基于 ruled-based verifier 进行打分

作者还发现，人类的打分和 reward model 的发奋可能会出现 $20\%\sim30\%$ 的偏差。为了解决这个问题，作者训练了一个单独的 reward model, 专门用于 prompt selection.

Tool Use

工具调用数据由于 Multi-turn 和依赖软件工具，比较难以收集。为了解决这个问题，作者设计了一个交互式标注平台，包括一个 agent 和一个工具执行环境。环境包括了工具和数据库，能执行工具调用并反馈。

标注时，用户发起一个请求，然后 agent 自动执行工具调用，最后平台返回反正的轨迹。

Multilingual

作者逐步增加模型对于新语言的理解能力。默认情形下，输入和输出的语种一致，但是包含 $0.4\%$ 的跨语种数据。在 SFT 和 RLHF 阶段，英语和多语种数据的比例为 $80\%:20\%$ .

Optimization

作者使用了 QAT 来将 on-device model 压缩到 2 bits-per-weight, 使用 Adaptive Scalable Texture Compression (ASTC) 来 post-training 3.56 bits-per-weight 版本的 server model.

QAT

QAT 是一个在模型训练过程中模拟量化误差，从而提升模型量化后表现的方法。它解决了传统后量化方法精度损失较大的问题，是平衡模型性能用户效率的关键手段。

训练时，作者通过修改权重 $W$ 来模仿量化：

\tilde{W} = s\left(\mathrm{clamp}(\lfloor \frac{W}{s}+z\rceil, q_{\min}, q_{\max}) - z\right)

其中, $s$ 是 scaling factor, $z$ 是 zero point, $q_{\min}$ , $q_{\max}$ 是 quantization 的 range. 为了解决 rounding operation 不可微的问题，作者使用了 straight-through estimator 的方法来近似梯度。

作者还提出了一个可学习的 scaling factor $f$ 用于计算 quantization scale, 计算方法如下所示

s = \frac{f\cdot \max(|W|)}{q_{\max}}

作者通过精细设计 $f$ 的初始化来保证模型训练的 robust.

ASTC

对于 server model, 作者使用了 ASTC, 一个针对 GPU 图形纹理压缩的技术，来压缩模型权重。具体做法就是，模型训练好之后，作者对模型权重应用 ASTC, 然后对每个块进行预处理。存储时，每个块用 ASTC-HAR-ch 模式压缩为 128 位。最小值单独存储为 float16.

推理时，GPU 硬件自动解压缩 ASTC 块，然后解压的权重最小值相加参与矩阵计算

Quality Recovery Adapters

作者还是用 LoRA 来恢复量化模型的精度，并通过选择性压缩策略优化 ASTC 过程，在极小的算力开销下实现了接近全量微调的性能。

Evaluation

On-device model 表现如下

Model	MMLU	MMMLU	MGSM
AFM On-Device	67.85	60.60	74.91
Qwen-2.5-3B	66.37	56.53	64.80
Qwen-3-4B	75.10	66.52	82.97
Gemma-3-4B	62.81	56.71	74.74
Gemma-3n-E4B	57.84	50.93	77.77

server model 表现如下

Model	MMLU	MMMLU	MGSM
AFM Server	80.20	74.60	87.09
LLaMA 4 Scout	84.88	80.24	90.34
Qwen-3-235B	87.52	82.95	92.00
GPT-4o	85.70	84.00	90.30

Conclusion

作者提出了 AFM-2025 多模态多语种大语言模型系列，包括 on-device 和 server 两个版本，作者介绍了模型的架构，训练数据和训练方式。

作者提出了 Ovis2.5, 一个基于 Ovis 改进的多模态大模型系列，包括 2B 和 9B 两个 size，Ovis2.5 主要强调了支持不同分辨率图片输入以及深度思考这两个 feature

Introduction

作者首先回顾了 Ovis, Ovis 主要是解决 text embedding 以及 visual embedding 对齐程度比较低的问题。

接下来，作者介绍了以下 Ovis 的两个问题：

只能支持固定大小的图片输入
缺乏深度思考能力

为了解决这两个问题，作者提出了 Ovis 2.5, Ovis 主要做出了两点改进：

使用了 NaViT 来处理不同分辨率图片的输入
作者通过训练提高了模型的深度思考能力

最终 Ovis2.5 主要有以下 feature

支持动态分辨率图片输入
深度思考能力
SOTA 的表现
高效的训练方式

Method

Architecture

Ovis2.5 的架构如下所示

Ovis 包括三个模块：

visual tokenizer： ViT 架构，
visual embedding table: 类似 LLM 中的 text embedding table, 见 Ovis
LLM: 基于 Qwen3

作者在架构上进行了如下改进：

动态分辨率图片输入处理：作者使用了 NaViT 来支持动态分辨率图片输入
LLM: 作者使用了 Qwen3 来进一步提高模型的表现

Training

模型训练包括 pre-training 和 post-training 两个大的 stage, 其中 pre-training 又包含 3 个小的 stage, post-training 包含 2 个 stage. 训练过程如下所示

pre-training 阶段的数据包括 COYO, Laion, Wukong, DataComp, SAM 等。作者介绍了几个部分的数据：

OCR 数据，作者基于 MLLM 来标注数据和合成 QA
Grounding 数据，作者使用了 RefCoCo 等数据集以及先进的 MLLM 来标注数据
Reasoning 数据，作者收集了数据然后使用 MLLM 来合成 Reasoning path

训练时，

VET pretraining: 训练 VET, 作者基于 SigLIP 来初始模型的参数，然后仅训练最后一层 ViT layer, visual head 以及 VET, 图片精度为 448-896. 作者采用了动态 position embedding
Multimodal pretraining: 这阶段全量微调所有参数，主要目的是使用对话格式的数据。图片精度为 448-1792
multimodal instruction tuning: 这阶段训练所有参数，主要提高模型跟随多模态指令的能力

post-training 包括 DPO 和 GRPO 两个阶段。

DPO: 训练所有参数，使用 pre-training checkpoint 来多次采样
GRPO: 使用 RLVR 数据集进行训练

Infra

infra 方面，作者主要强调了 data packing 以及多种并行策略融合。

Conclusion

作者在本文中提出了 Ovis2.5, 一个基于 Ovis 架构的多模态大模型，作者主要强调了模型的动态图片输入处理能力以及深度思考能力。

作者提出了几个未来的方向：

将输入图片精度提升到 4K
处理长视频输入并进行 temporal reasoning
在 Reasoning 过程中加入 tool-use.

arxiv

作者提出了 Ovis，一个离散化表示 visual encder 输出特征的方法，来更好对齐 LLM 的视觉输入和文本输入

Introduction

作者分析了已有多模态大模型的架构，已有多模态大模型的输入对于文本来说是离散的 (text token), 对于图片来说是连续的 (visual embedding)。作者认为这种连续 - 离散的输入可能会影响模型最终的表现。

为了解决这个问题，作者构建了一个 visual embedding table, 将 visual embedding 也转换成离散的 token 表示形式，进而统一 LLM 输出的粒度。

Method

模型的架构如下图所示

我们首先会构建一个 visual vocabulary $\{e_k\}_{k=1}^K$ , 其大小为 $K$ , 然后对于 ViT 输出的 $n$ 个 visual feature $\{r_i\}_{i=1}^n$ , 我们会加入一个 linear head 以及一个 softmax 来构建一个 vocabulary 上的分布，即

v_i = \mathrm{softmax}(Wr_i), W\in\mathbb{R}^{K\times d}

这里 $v_i\in\Delta^K$ 是 visual vocabulary 上的概率分布。最终，视觉模块的输入是 vocabulary 中 visual token 的一个加权求和

V_i = \sum_{k=1}^K v_{i,k}e_k\in\mathbb{R}^{d'}

训练分为三个阶段：

Stage 1: 训练 $W$ , visual encoder 最后一个 block 以及 visual vocabulary
Stage 2: 训练 $W$ , visual vocabulary 以及 visual encoder
Stage 3: multimodal SFT, 提高模型的指令跟随能力，模型所有参数都参与训练

训练数据分布如下表所示

Conclusion

作者提出了 Ovis，一个离散化表示 visual encder 输出特征的方法，来更好对齐 LLM 的视觉输入和文本输入

Arxiv

Aya Vision是一个多模态大语言模型，包含8B, 32B两个size，支持23种语言。Aya Vision基于 Aya Expanse大语言模型。

模型架构

Aya Vision的模型架构如下图所示

Vision Encoder: SigLip2-patch14-384
Vision-text connector: 2 layer MLP
LLM: Aya Expanse 8B/ 32B

训练

训练包含两个stage：

Vision-language alignment: 仅训练vision-text connector，基于image-text pairs进行训练
SFT：训练connector和LLM，基于合成的多语种数据进行训练

多语种数据

为了提高模型的多语种能力，作者先基于English的高质量数据集合成了annotation，然后作者讲这些数据转化为22中语言对应的文本

Model merging

最后为了提高模型在纯文本任务上的表现，作者还使用了model merging的技巧。具体做法就是merge使用的base language model和SFT之后的vision-language model

Aya Vision Blog

介绍

简单总结一下已有的介绍了训练数据配比的多模态大模型，方便后续使用。根据之前看的一些论文进行总结，如有缺漏，欢迎批评指正。

Category	Ratio
general vqa	36.1
Doc/Chart/Screen	20.6
Math/Reasoning	20.1
general OCR	8.9
text only	14.3

Category	Ratio
general vqa	11.02
captioning	5.14
OCR	17.47
chart/figures	14.05
table	11.3
reasoning	10.32
textbook	1.58
difference	2.38
screenshot2code	0.31
text only	26.41

Category	Ratio
interleaved image-text	45
imagetext	45
text only	10

Category	Ratio
single-image	45.92%
multi-image	9.37%
video	39.79%
pure-text	4.92%

Category	Ratio
Special Enhancement	4%
Text	21%
Caption	4%
Chart	16%
Math	11%
OCR	3%
Code	8%
General	33%

参考文献

Introduction

已有的 MLLM 将视觉 token 作为一个 1d sequence, 输入给 LLM. 在本文中，作者将 visual token 注入到 LLM 的不同 layer 中来提高视觉信息的利用率

Method

首先，对于输入的图片 $I$ , 我们将其分为高精度图片版本 $I_{high}$ 和低精度图片版本 $I_{low}$ , $I_{low}$ 通过 vision encoder 和 MLP 得到对应的视觉 token $X_v$ 作为 LLM 的输入，然后在 LLM transformer block 的第 $i$ 层，其对应的视觉 token $X_{i,v}$ 会与 stack feature $X_{v}^i$ 相加，这里 $X_v^i$ 是对高精度图片输入的一个采样，即

X_v^i = \mathrm{Sampling2D}(\mathrm{MLP}(\mathrm{ViT}(I_{high})))

算法伪代码如下所示

# H0: Input embeddings for LLM (Original inputs args for traditional LMM); # vis_pos: the location of visual tokens;  
# X, Xstack: Original visual tokens, Extra high-resolution visual token list; 
# lstart, n: Index of starting layer, and layer interval for stacking.

def forward(H0, Xstack, lstart, n, vis_pos): 
    H = H0  
    for (idx, TransformerLayer) in enumerate(self.layers): 
    # DeepStack:  
    if idx >= lstart & (idx − lstart) % n == 0: 
        H[vis_pos] += Xstack[(idx − lstart)//n]  
        # Original Transformer: 
        H = TransformerLayer(H)

Experiments

Ablation Study

作者进一步验证了不同实验配置，结果发现在 early layer 进行 deepstack 效果最好，越往后效果越差

作者还在 ViT 上应用了 DeepStack 策略，结果发现 ViT 的效果也有所提升

作者还发现，模型表现提升是因为加入了 high-reoslution image token 信息

Conclusion

作者在本文中提出了 DeepStack, 一个提高 MLLM 中视觉信息利用率的方法，作者验证了这个方法的有效性。

Reference

paper

TLDR

This paper proposes a Multimodal Large Language Model VITA (Video, Image, Text, Audio). VITA supports non-awakening interaction and audio interruption for better interactive experience. VITA aims to be an open-sourced version of GPT-4o.

Introduction

Features of GPT-4o:

a unified framework that processes text, vision, and audio signals in an end-to-end manner,
the capability to enable natural multimodal human-computer interaction.

Similar to Mini-GPT4, this paper tries to proposed an open-sourced version of GPT-4o.

Method

Model

The architecture of VITA is shown as follows:

LLM: Mixtral $8\times 7$ B
Visual Encoder: InternViT-300M-448px
Audio Encoder: Mel Filter Bank block

To support audio interruption, the author uses two model at the same time, where the generation model is responsible for handling user queries and the other model monitors the environment. The other models starts to work is there is an audio interruption.

Data

multimodal instruction tuning

Training data of multimodal instruction tuning is given as follows:

Improvements are made:

The questions are randomly (about half) replaced with their audio versions, using TTS technique such as GPT-SoVITS
Different system prompts are set to avoid conflicts between different types of data

To support human-AI interaction, the noisy audio data are also constructed. Noisy audio samples are generated from existed QA data. These negative sample texts aim to improve the ability of VITA to not respond to non-query-related content.

To distinguish three types of queries, the author uses three state tokens:

Token <1> denotes that the question input is the query audio
Token <2> denotes that the question input is the noisy audio.
Token <3> signifies the question of pure text.

Training pipeline

Training pipeline of VITA consists of three stages:

Non-awakening Interaction

There are following requirements and solutions:

Real-time Tracking of Environmental Sounds. This paper uses SileroVAD to complete the Voice Activity Detection (VAD) task.
Filtering out noisy audio. This is done by making use of token <2>.

Audio Interrupt Interaction

There are following requirements and solutions:

Real-time Tracking and Filtering of External Queries. This is done by use another VITA model as stated in Model section.

Evaluation

Conclusion

The paper points out three limitations of VITA:

Enhancement of Foundational Capabilities.
Refinement of Noisy Audio Construction.
Building end-to-end TTS in conjunction with LLM.

Ovewview of Multimodal Large Language Models