Mao Song(毛松)'s Homepage

Notes on DeepSeek-V2

DeepSeek 在 2024 年 5 月提出了 DeepSeek-V2，一个基于 MoE 架构的大语言模型，参数量为 236B-A21B. 作者使用了 MLA 来压缩 KV cache, 使用 DeepSeekMoE 架构来提高模型训练效率和表现。

DeepSeek在 2024 年 5 月提出了 multi-head latent attention (MLA), 用于提高 attention 的 Inference 效率

DeepSeek 在 24 年 8 月提出了 Loss-free balancing 策略，该策略可以在不修改训练梯度的情况下实现 load balancing 进而提高模型的表现.

Mistral 在 24 年 1 月提出了 Mistral 8x7B, 一个 MoE 大语言模型，模型包括 8 个专家，激活 2 个专家，总参数量为 47B, 激活参数量为 13B.

Mistral 在 23 年 10 月提出了 Mistral 7B, 其模型表现超过了 LLaMA2-13B.