Mao Song(毛松)'s Homepage

Notes on ViT

Google 在 21 年提出了 ViT, 一个基于 Transformer 的图像识别模型架构，作者通过实验验证了 Transformer 架构在图像识别领域的成功。

作者提出了一个针对 vision foundation model 的 continual multimodal pretraining pipeline, 用于提高模型在下游任务上的表现

DeepSeek 在 2024 年 5 月提出了 DeepSeek-V2，一个基于 MoE 架构的大语言模型，参数量为 236B-A21B. 作者使用了 MLA 来压缩 KV cache, 使用 DeepSeekMoE 架构来提高模型训练效率和表现。

DeepSeek在 2024 年 5 月提出了 multi-head latent attention (MLA), 用于提高 attention 的 Inference 效率