Overview of Visual Foundation Models

Author

Updated

May, 09, 2026

PDF

在本文中,我们将回顾总结visual foundation models的发展历程以及最新的进展,这里的visual foundation model指的的是 CLIP (Radford et al., 2021), SigLIP (Zhai et al., 2023) 这类主要作为多模态大模型vision encoder的模型。

Overview

Model performance over time (ImageNet)
X-axis is published year, Y-axis is ImageNet Top-1 accuracy, and bubble size encodes number of parameters. Hover a bubble to view model, affiliation, parameters, and score.

Methods

GLM-5V-turbo (Team et al., 2026) 提出了 CogViT, 一个基于 ViT (Dosovitskiy et al., 2021) 架构,403M 的 vision encoder. CogViT 在架构上使用了 QK-Norm (Henry et al., 2020) 来提高 attention 计算的稳定性。在训练上,作者将训练分为两个阶段:

  1. 使用基于蒸馏的 masked image modeling 来提高模型的视觉表征能力,这个阶段使用了 SigLIP2 (Tschannen et al., 2025) 和 DINOv3 (Siméoni et al., 2025) 作为教师模型分别提供 sementic feature 和 textual feature. 训练时图片大小为 224×224224\times 224, masking ratio 为 35%35\%. 数据配比为:natural images (80%80\%), instruction following (10%10\%), scientific imagery (10%10\%).
  2. 使用对比学习来对齐视觉特征和文本特征。这个阶段使用了 NaFlex 来处理动态分辨率图片输入 以及使用了 SigLIP (Zhai et al., 2023) 的 Sigmoid loss (batch size) 来进行训练,数据上,作者加入了 8B 的 bilingual 数据来提高模型的多语种能力。

对于优化器,作者使用了 Muon (Jordan et al., 2024) 以及 cosine learning rate schedule.

  1. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. International Conference on Learning Representations. https://openreview.net/forum?id=YicbFdNTTy
  2. Henry, A., Dachapally, P. R., Pawar, S. S., & Chen, Y. (2020). Query-Key Normalization for Transformers. In T. Cohn, Y. He, & Y. Liu (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 4246–4253). Association for Computational Linguistics. 10.18653/v1/2020.findings-emnlp.379
  3. Jordan, K., Jin, Y., Boza, V., You, J., Cesista, F., Newhouse, L., & Bernstein, J. (2024). Muon: An optimizer for hidden layers in neural networks. https://kellerjordan.github.io/posts/muon/
  4. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. CoRR, abs/2103.00020. https://arxiv.org/abs/2103.00020
  5. Siméoni, O., Vo, H. V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., Wehrstedt, L., Wang, J., Darcet, T., Moutakanni, T., Sentana, L., Roberts, C., Vedaldi, A., … Bojanowski, P. (2025). DINOv3. https://arxiv.org/abs/2508.10104
  6. Team, V., Hong, W., Gu, X., Pan, Z., Yang, Z., Wang, Y., Wang, Y., Yue, Y., Wang, Y., Wang, Y., Wang, Y., Liu, X., Yu, W., Wang, W., Li, W., Duan, S., Yang, S., Lv, R., Liu, M., … Tang, J. (2026). GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents. https://arxiv.org/abs/2604.26752
  7. Tschannen, M., Gritsenko, A., Wang, X., Naeem, M. F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y., Mustafa, B., Hénaff, O., Harmsen, J., Steiner, A., & Zhai, X. (2025). SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features. https://arxiv.org/abs/2502.14786
  8. Zhai, X., Mustafa, B., Kolesnikov, A., & Beyer, L. (2023). Sigmoid Loss for Language Image Pre-Training. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 11941–11952. 10.1109/ICCV51070.2023.01100 back: 1, 2