在本文中,我们将回顾总结visual foundation models的发展历程以及最新的进展,这里的visual foundation model指的的是 CLIP (Radford et al., 2021), SigLIP (Zhai et al., 2023) 这类主要作为多模态大模型vision encoder的模型。
Overview
Methods
GLM-5V-turbo (Team et al., 2026) 提出了 CogViT, 一个基于 ViT (Dosovitskiy et al., 2021) 架构,403M 的 vision encoder. CogViT 在架构上使用了 QK-Norm (Henry et al., 2020) 来提高 attention 计算的稳定性。在训练上,作者将训练分为两个阶段:
- 使用基于蒸馏的 masked image modeling 来提高模型的视觉表征能力,这个阶段使用了 SigLIP2 (Tschannen et al., 2025) 和 DINOv3 (Siméoni et al., 2025) 作为教师模型分别提供 sementic feature 和 textual feature. 训练时图片大小为 , masking ratio 为 . 数据配比为:natural images (), instruction following (), scientific imagery ().
- 使用对比学习来对齐视觉特征和文本特征。这个阶段使用了 NaFlex 来处理动态分辨率图片输入 以及使用了 SigLIP (Zhai et al., 2023) 的 Sigmoid loss (batch size) 来进行训练,数据上,作者加入了 8B 的 bilingual 数据来提高模型的多语种能力。
对于优化器,作者使用了 Muon (Jordan et al., 2024) 以及 cosine learning rate schedule.
- Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. International Conference on Learning Representations. https://openreview.net/forum?id=YicbFdNTTy
- Henry, A., Dachapally, P. R., Pawar, S. S., & Chen, Y. (2020). Query-Key Normalization for Transformers. In T. Cohn, Y. He, & Y. Liu (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 4246–4253). Association for Computational Linguistics. 10.18653/v1/2020.findings-emnlp.379
- Jordan, K., Jin, Y., Boza, V., You, J., Cesista, F., Newhouse, L., & Bernstein, J. (2024). Muon: An optimizer for hidden layers in neural networks. https://kellerjordan.github.io/posts/muon/
- Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. CoRR, abs/2103.00020. https://arxiv.org/abs/2103.00020
- Siméoni, O., Vo, H. V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., Wehrstedt, L., Wang, J., Darcet, T., Moutakanni, T., Sentana, L., Roberts, C., Vedaldi, A., … Bojanowski, P. (2025). DINOv3. https://arxiv.org/abs/2508.10104
- Team, V., Hong, W., Gu, X., Pan, Z., Yang, Z., Wang, Y., Wang, Y., Yue, Y., Wang, Y., Wang, Y., Wang, Y., Liu, X., Yu, W., Wang, W., Li, W., Duan, S., Yang, S., Lv, R., Liu, M., … Tang, J. (2026). GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents. https://arxiv.org/abs/2604.26752
- Tschannen, M., Gritsenko, A., Wang, X., Naeem, M. F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y., Mustafa, B., Hénaff, O., Harmsen, J., Steiner, A., & Zhai, X. (2025). SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features. https://arxiv.org/abs/2502.14786
- Zhai, X., Mustafa, B., Kolesnikov, A., & Beyer, L. (2023). Sigmoid Loss for Language Image Pre-Training. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 11941–11952. 10.1109/ICCV51070.2023.01100 back: 1, 2