Mao Song(毛松)'s Homepage

Mao Song(毛松)'s Homepage Mao Song's technical blog covering machine learning, large language models (LLMs), deep learning research, and AI innovations. https://maosong.website/ KL divergence: from definition to application https://maosong.website/p/kl_divergence/ https://maosong.website/p/kl_divergence/ Why unbiased KL estimates need not give unbiased KL gradients; forward vs reverse KL, estimators in on/off-policy RL, and experiments. Mon, 04 May 2026 00:00:00 GMT Reinforcement Learning for Large Language models: An Overview https://maosong.website/p/RL4LLM/ https://maosong.website/p/RL4LLM/ Publish‑ready workflow that lets you focus on ideas, not infrastructure Mon, 04 May 2026 00:00:00 GMT Notes on OpenMath-Nemotron https://maosong.website/p/notes-on-openmath-nemotron/ https://maosong.website/p/notes-on-openmath-nemotron/ MNVIDIA 在 AIMO-2 比赛中的 winning solution. Wed, 15 Apr 2026 01:50:39 GMT Performance and Scalability https://maosong.website/p/performance-and-scalability/ https://maosong.website/p/performance-and-scalability/ 本文介绍了strong scaling和weak scaling Thu, 26 Mar 2026 09:44:26 GMT Fix Point Theorem https://maosong.website/p/fix-point-theorem/ https://maosong.website/p/fix-point-theorem/ 不动点定理 Mon, 09 Mar 2026 09:16:02 GMT Notes on roofline model https://maosong.website/p/notes-on-roofline-model/ https://maosong.website/p/notes-on-roofline-model/ roofline model 是 infra 的理论分析基础，为算法设计与优化提供思路 Thu, 26 Feb 2026 09:23:40 GMT Notes on Step3-VL 10B https://maosong.website/p/notes-on-step3-vl-10b/ https://maosong.website/p/notes-on-step3-vl-10b/ 阶跃星辰在 26 年 1 月提出了 Step3-VL-10B, 一个强调 perception, complex reasoning 以及 human-centric alignment 的开源多模态大模型 Fri, 13 Feb 2026 10:05:47 GMT Notes on Kimi-k2.5 https://maosong.website/p/notes-on-kimi-k2-5/ https://maosong.website/p/notes-on-kimi-k2-5/ Kimi 在 2026 年 2 月发布了 Kimi K2.5, 一个 multimodal agentic model, Kimi K2.5 基于 Kimi K2 开发，在预训练阶段使用了图文联合训练的方式，在 post-training 阶段使用了 zero-vision SFT 和 multimodal RL 来提高模型的 reasoning 能力以及泛化能力，Kimi K2.5 还提出了 Agent Swarm 来提高解决复杂任务的效率。 Thu, 12 Feb 2026 03:13:13 GMT Notes on KL divergence https://maosong.website/p/notes-on-kl-divergence/ https://maosong.website/p/notes-on-kl-divergence/ 在强化学习中，KL divergence 常被用作 policy 正则项，但很多不稳定现象并非来自 KL 本身，而是来自其估计方式。本文展示了为什么“无偏的 KL 估计”并不能保证“无偏的 KL 梯度”，并系统分析了不同 KL estimator 在 on-policy 与 off-policy 场景下的行为差异。通过理论推导与实验验证，文章揭示了 KL 作为 loss 与 reward shaping 时的本质区别，并解释了实践中低方差 KL 设计背后的原因 Sat, 24 Jan 2026 08:32:14 GMT Notes on Qwen3-Next https://maosong.website/p/notes-on-qwen3-next/ https://maosong.website/p/notes-on-qwen3-next/ 2025年9月，Qwen团队提出了Qwen3-Next，这是一个基于混合注意力机制与MoE架构的大语言模型，旨在显著提升训练与推理效率。该模型通过结合线性注意力与Softmax注意力的优势，在保持高性能的同时实现计算效率的大幅优化。 Fri, 23 Jan 2026 02:29:56 GMT megatron-lm https://maosong.website/p/megatron-lm/ https://maosong.website/p/megatron-lm/ NVIDIA 在 2020 年提出了 megatron-LM, 一个基于 Tensor Parallelism 的大规模 LLM 训练框架。论文着重介绍了 tensor parallelism Wed, 21 Jan 2026 10:04:12 GMT Notes on Gated Attention https://maosong.website/p/notes-on-gated-attention/ https://maosong.website/p/notes-on-gated-attention/ Qwen 在 2025 年系统性研究了 attention 中的 gating 机制，发现通过在 attention 中引入非线性与稀疏性，可以以极低成本显著提升模型的表达能力、训练稳定性以及长上下文表现。 Tue, 20 Jan 2026 07:41:52 GMT NextFlow 基于single-branch的统一理解与生成多模态大模型 https://maosong.website/p/nextflow-single-branch/ https://maosong.website/p/nextflow-single-branch/ 字节在 26 年 1 月提出了 NextFlow, 一个基于 decoder-only autoregressive transformer 架构的统一理解与生成多模态，验证了纯自回归架构在统一模型中的有效性。 Sat, 17 Jan 2026 09:31:53 GMT State of AI--从OpenRouter 100T token使用情况了解AI 大模型能力分层竞争逻辑 https://maosong.website/p/state-of-ai-openrouter-100t-tokenai/ https://maosong.website/p/state-of-ai-openrouter-100t-tokenai/ OpenRouter在25年12月发布了一份基于100T token调用数据的统计报告，该报告从模型，任务，用户多角度分析了当前AI模型的使用情况 Sat, 17 Jan 2026 09:04:07 GMT LLM Memory Computation https://maosong.website/p/llm-memory-computation/ https://maosong.website/p/llm-memory-computation/ 本文中，我们将介绍如何计算 LLM 在训练和推理过程中的内存需求以及简要介绍对应的优化方法。 Sat, 17 Jan 2026 02:04:32 GMT Nvidia-GPU specs https://maosong.website/p/nvidia-gpu-specs/ https://maosong.website/p/nvidia-gpu-specs/ 本文汇总了NVIDIA GPU 系列的技术规格以及关键改进 Wed, 14 Jan 2026 03:09:19 GMT Notes on GLaM https://maosong.website/p/notes-on-glam/ https://maosong.website/p/notes-on-glam/ Google 在 2022 年 8 提出了 GLaM，一个基于 MoE 架构的大语言模型系列，模型超过了 GPT-3 的表现 Tue, 06 Jan 2026 10:07:29 GMT Notes on MiniMax-01 https://maosong.website/p/notes-on-minimax-01/ https://maosong.website/p/notes-on-minimax-01/ MiniMax-01 是一个基于 hybrid attention 架构的大模型系列，包含 MiniMax-Text-01 和 MiniMax-VL-01 两个模型，其中 MiniMax-Text-01 推理时支持 4M 的上下文长度，MiniMax-VL-01 支持 512B 的上下文长度 Tue, 06 Jan 2026 09:38:01 GMT Notes on DeepSeek-V3.2 https://maosong.website/p/notes-on-deepseek-v3-2/ https://maosong.website/p/notes-on-deepseek-v3-2/ DeepSeek 在 25 年 10 月发布了 DeepSeek-V3.2, 模型强调了稀疏注意力，scaling RL 以及 agentic task synthesis. Tue, 06 Jan 2026 09:30:40 GMT Notes on Gemini3.0 https://maosong.website/p/notes-on-gemini3-0/ https://maosong.website/p/notes-on-gemini3-0/ Gemini 3.0 是是 Google 新一代最强模型，model card 介绍了 Gemini 3.0 系列的评估结果以及基本能力 Tue, 06 Jan 2026 02:26:39 GMT Notes on Softmax https://maosong.website/p/notes-on-softmax/ https://maosong.website/p/notes-on-softmax/ 本文介绍了 softmax 函数的性质，实现和应用，方便后续的使用和学习 Sat, 27 Dec 2025 08:39:53 GMT Notes on NoPE https://maosong.website/p/notes-on-nope/ https://maosong.website/p/notes-on-nope/ NoPE 是一个不需要显式位置编码的方法，相关研究说明不使用位置编码我们也可以让模型学习到对应的位置信息以及进行长度外推。 Wed, 24 Dec 2025 07:19:42 GMT Notes on ALiBi https://maosong.website/p/notes-on-alibi/ https://maosong.website/p/notes-on-alibi/ meta 等提出了 ALiBi, 一个通过 linear biases 来实现位置编码的方法来提高 LLM 在推理阶段的外推能力。 Wed, 24 Dec 2025 07:10:55 GMT Notes on T5 https://maosong.website/p/notes-on-t5/ https://maosong.website/p/notes-on-t5/ google 在 2020 年发表了 T5 (Text-to-Text Transfer Transformer), 一个使用统一框架来将所有 NLP 任务转换为 text-to-text 格式的迁移学习框架。 Wed, 24 Dec 2025 07:07:08 GMT GPipe https://maosong.website/p/gpipe/ https://maosong.website/p/gpipe/ google 在 2018 年提出了 GPipe, 一个使用 pipeline parallelism 来训练大规模神经网络的并行策略 Tue, 23 Dec 2025 08:49:25 GMT Base of RoPE Bounds Context Length https://maosong.website/p/base-of-rope-bounds-context-length/ https://maosong.website/p/base-of-rope-bounds-context-length/ 百川探究了 LLM 中 RoPE base frequency 与 context length 之间的关系，给出了 base frequency 与 context length 之间的 lower bound. Mon, 22 Dec 2025 03:34:42 GMT Notes on NSA https://maosong.website/p/notes-on-nsa/ https://maosong.website/p/notes-on-nsa/ DeepSeek 在 25 年 1 月提出了 Natively trainable Sparse Attention (NSA), 一个软硬件结合的稀疏注意力机制，NSA 可以在提高模型推理效率的同时提高计算效率。 Mon, 15 Dec 2025 09:39:16 GMT MoE tutorial https://maosong.website/p/moe-tutorial/ https://maosong.website/p/moe-tutorial/ 本 blog 详细介绍了 MoE 模型的一些关键设计与相关实验结果，为 MoE 模型的学习提供基础。 Sat, 13 Dec 2025 08:04:04 GMT Notes on Ling-mini-beta https://maosong.website/p/notes-on-ling-mini-beta/ https://maosong.website/p/notes-on-ling-mini-beta/ 蚂蚁提出了针对 MoE 模型的 scaling law, 并基于 scaling law 提出了 Ling-mini-beta Sat, 13 Dec 2025 07:58:51 GMT Load Balancing tutorial https://maosong.website/p/load-balancing-tutorial/ https://maosong.website/p/load-balancing-tutorial/ 我们在本文中探讨关于 load balancing loss 的定义，性质和推广 Thu, 11 Dec 2025 08:10:08 GMT Notes on Global-batch load balancing https://maosong.website/p/notes-on-global-batch-load-balancing/ https://maosong.website/p/notes-on-global-batch-load-balancing/ Qwen 在 25 年 2 月提出了 global batching load balancing loss strategy, 其在 global level 上考虑每个专家的负载均衡，从而提高模型的表现 Thu, 11 Dec 2025 08:09:34 GMT Notes on DPO https://maosong.website/p/notes-on-dpo/ https://maosong.website/p/notes-on-dpo/ 作者提出了 DPO，一个无需 reward model 的偏好优化方法，DPO通过隐式建模reward model而直接基于数据集训练policy model, 进而大幅度提高了 LLM 偏好优化的训练效率 Tue, 09 Dec 2025 02:43:11 GMT Notes on DeepSeek-V3 https://maosong.website/p/notes-on-deepseek-v3/ https://maosong.website/p/notes-on-deepseek-v3/ DeepSeek 在 24 年 11 月发布了 DeepSeek-V3, 一个仅花费 2.8M H800 hours 的大语言模型，且在各个 benchmark 上达到了 SOTA 表现 Mon, 08 Dec 2025 03:14:45 GMT Notes on Gemini2.5 https://maosong.website/p/notes-on-gemini2-5/ https://maosong.website/p/notes-on-gemini2-5/ DeepMind 在 6 月 17 号发布了 Gemini2.5 系列的技术报告，包括Pro和Flash两个版本 Sat, 06 Dec 2025 10:14:15 GMT Notes on olmoe https://maosong.website/p/notes-on-olmoe-openmoe/ https://maosong.website/p/notes-on-olmoe-openmoe/ NUS 等提出了 OpenMoE, 一个全开源的 MoE 大语言模型系列，作者详细介绍了 MoE 中的 routing 机制 Sat, 06 Dec 2025 10:08:11 GMT Notes on Qwen3 VL https://maosong.website/p/notes-on-qwen3-vl/ https://maosong.website/p/notes-on-qwen3-vl/ Qwen 在 25 年 11 月 27 日发布了 Qwen3-VL 技术报告，作者强调了模型的纯文本理解能力，长文本能力以及多模态推理能力。 Fri, 05 Dec 2025 02:12:01 GMT Notes on SAPO https://maosong.website/p/notes-on-sapo/ https://maosong.website/p/notes-on-sapo/ Qwen 在 25 年 11 月提出了 SAPO，一个通过温度控制的 soft gate 以及非对称温度来解决 hard-clpping 存在的问题，从而提高 RL 训练的稳定性以及效率 Fri, 05 Dec 2025 02:09:06 GMT Notes on DeepStack https://maosong.website/p/notes-on-deepstack/ https://maosong.website/p/notes-on-deepstack/ 作者提出了 DeepStack, 用于帮助MLLM更好地利用视觉信息，进而提高模型在下游任务上的表现 Thu, 04 Dec 2025 09:32:41 GMT Notes on ViT https://maosong.website/p/notes-on-vit/ https://maosong.website/p/notes-on-vit/ Google 在 21 年提出了 ViT, 一个基于 Transformer 的图像识别模型架构，作者通过实验验证了 Transformer 架构在图像识别领域的成功。 Thu, 04 Dec 2025 03:00:44 GMT Notes on CoMP https://maosong.website/p/notes-on-comp/ https://maosong.website/p/notes-on-comp/ 作者提出了一个针对 vision foundation model 的 continual multimodal pretraining pipeline, 用于提高模型在下游任务上的表现 Thu, 04 Dec 2025 02:58:30 GMT Notes on DeepSeek-R1 https://maosong.website/p/notes-on-deepseek-r1/ https://maosong.website/p/notes-on-deepseek-r1/ DeepSeek 在 2024 年 5 月提出了 DeepSeek-V2，一个基于 MoE 架构的大语言模型，参数量为 236B-A21B. 作者使用了 MLA 来压缩 KV cache, 使用 DeepSeekMoE 架构来提高模型训练效率和表现。 Tue, 02 Dec 2025 10:21:54 GMT Notes on DeepSeek-V2 https://maosong.website/p/notes-on-deepseek-v2/ https://maosong.website/p/notes-on-deepseek-v2/ DeepSeek 在 2024 年 5 月提出了 DeepSeek-V2，一个基于 MoE 架构的大语言模型，参数量为 236B-A21B. 作者使用了 MLA 来压缩 KV cache, 使用 DeepSeekMoE 架构来提高模型训练效率和表现。 Tue, 02 Dec 2025 10:21:54 GMT Notes on MLA https://maosong.website/p/notes-on-mla/ https://maosong.website/p/notes-on-mla/ DeepSeek在 2024 年 5 月提出了 multi-head latent attention (MLA), 用于提高 attention 的 Inference 效率 Tue, 02 Dec 2025 10:21:54 GMT Notes on Loss-free Balancing https://maosong.website/p/notes-on-loss-free-balancing/ https://maosong.website/p/notes-on-loss-free-balancing/ DeepSeek 在 24 年 8 月提出了 Loss-free balancing 策略，该策略可以在不修改训练梯度的情况下实现 load balancing 进而提高模型的表现. Fri, 21 Nov 2025 07:38:52 GMT Mixstral 8x7B https://maosong.website/p/mixstral-8x7b/ https://maosong.website/p/mixstral-8x7b/ Mistral 在 24 年 1 月提出了 Mistral 8x7B, 一个 MoE 大语言模型，模型包括 8 个专家，激活 2 个专家，总参数量为 47B, 激活参数量为 13B. Sat, 01 Nov 2025 07:32:30 GMT Mixstral 7B https://maosong.website/p/mixstral-7b/ https://maosong.website/p/mixstral-7b/ Mistral 在 23 年 10 月提出了 Mistral 7B, 其模型表现超过了 LLaMA2-13B. Sat, 01 Nov 2025 07:28:19 GMT Notes on olmoe https://maosong.website/p/notes-on-olmoe/ https://maosong.website/p/notes-on-olmoe/ AllenAI 在 24 年 9 月提出了 olmoe, 一个全开源的基于 MoE 架构的大语言模型，参数量为 7B-A1B，作者详细介绍了模型的设计，数据以及训练策略. 论文获得了ICLR2025 oral Sat, 01 Nov 2025 07:23:58 GMT GShard https://maosong.website/p/gshard/ https://maosong.website/p/gshard/ Google 在 2020 年提出了 Gshard, 一个用于 MoE model 的 API 模块，作者的目的是探究如何高效训练基于 MoE 的 transformer 模型 Wed, 29 Oct 2025 03:22:39 GMT ST-MoE https://maosong.website/p/st-moe/ https://maosong.website/p/st-moe/ google 在 2022 年 4 月提出了 ST-MoE-269B-A32B, 用于解决 MoE 模型的训练不稳定性以及表现不佳的问题 Wed, 29 Oct 2025 03:19:37 GMT Switch Transformer https://maosong.website/p/switch-transformer/ https://maosong.website/p/switch-transformer/ Google 在 2022 年 6 月提出了 Switch Transformer, 一个基于 MoE 架构的 Transformer 模型。作者通过改进 MoE 算法，大幅度提高了计算和通信效率，结果发现模型比 dense model 有更高的训练效率。 Tue, 28 Oct 2025 01:38:12 GMT Chinchilla Scaling Law https://maosong.website/p/chinchilla-scaling-law/ https://maosong.website/p/chinchilla-scaling-law/ DeepMind 在 22 年 3 月探究了如何在给定算力下，决定最优的 model size 和 data size. 作者发现对于算力最优的场景，model size 和 dataset size 应该以相同的规模增长。基于这个 scaling law 作者提出了 Chinchilla, 一个 70B 的大语言模型，结果显示 Chinchilla 超过了其他更大 size 模型的表现。 Wed, 22 Oct 2025 06:39:23 GMT Kaplan Scaling Law https://maosong.website/p/kaplan-scaling-law/ https://maosong.website/p/kaplan-scaling-law/ OpenAI 在 20 年 1 月份探究了 model size, dataset size 以及 compute budget 与 transformer 损失之间的关系。通过构建 scaling law, 我们可以在固定的 compute budget 下，决定最优的配置 Wed, 22 Oct 2025 06:10:52 GMT LLM FLOPs Computation https://maosong.website/p/llm-flops-computation/ https://maosong.website/p/llm-flops-computation/ 我们介绍如何计算基于 transformer 架构的 LLM 的 FLOPs, 计算完成之后，我们可以推导出算力 $C$ 与模型参数量 $N$，数据集大小 $D$ 之间的关系，即 $C\approx 6ND$. Wed, 15 Oct 2025 08:33:39 GMT Notes on Keye-VL 1.5 https://maosong.website/p/notes-on-keye-vl-1-5/ https://maosong.website/p/notes-on-keye-vl-1-5/ 快手提出了 Keye-VL 1.5, 一个强调 reasoning, video understanding 的 8B 多模态大模型。作者提出了 slow-fast video encoding strategy 来提高模型的视频理解能力，作者通过在预训练和后训练提高了模型的长上下文能力和 reasoning 能力 Thu, 11 Sep 2025 03:33:31 GMT Notes on AdamW https://maosong.website/p/notes-on-adamw/ https://maosong.website/p/notes-on-adamw/ 作者提出了一个针对 Adam 优化器的 weight decay 方法 Thu, 04 Sep 2025 02:27:03 GMT Notes on Adam https://maosong.website/p/notes-on-adam/ https://maosong.website/p/notes-on-adam/ 作者提出了 Adam, 一个一阶的优化方法，Adam 更加高效，且具有 scaling invariant 的性质。 Thu, 04 Sep 2025 02:11:55 GMT Notes on RNoPE-SWA https://maosong.website/p/notes-on-rnope-swa/ https://maosong.website/p/notes-on-rnope-swa/ 作者系统性分析了已有的 attention 机制，然后作者提出了混合的 attention 机制，来提高模型在长上下文的表现以及维持模型在短上下文场景下的表现。 Tue, 02 Sep 2025 03:24:10 GMT Notes on InternVL3.5 https://maosong.website/p/notes-on-internvl3-5/ https://maosong.website/p/notes-on-internvl3-5/ 上海 AI LAB 提出了 InternVL 3.5 系列多模态大模型，InternVL 3.5 主要强调了模型的 reasoning 能力以及 inference 效率 Mon, 01 Sep 2025 03:30:50 GMT Ovis2.5 MLLM with stronger perception and reasoning capability https://maosong.website/p/ovis2-5-mllm-with-stronger-perception-and-reasoning-capability/ https://maosong.website/p/ovis2-5-mllm-with-stronger-perception-and-reasoning-capability/ 作者提出了 Ovis2.5, 一个基于 Ovis 改进的多模态大模型系列，包括 2B 和 9B 两个 size，Ovis2.5 主要强调了支持不同分辨率图片输入以及深度思考这两个 feature Sat, 30 Aug 2025 09:34:44 GMT Ovis-discrete visual embedding https://maosong.website/p/ovis-discrete-visual-embedding/ https://maosong.website/p/ovis-discrete-visual-embedding/ 作者提出了 Ovis，一个离散化表示 visual encder 输出特征的方法，来更好对齐 LLM 的视觉输入和文本输入 Sat, 30 Aug 2025 09:32:22 GMT Notes on DeepSeekMoE https://maosong.website/p/notes-on-deepseekmoe/ https://maosong.website/p/notes-on-deepseekmoe/ DeepSeek 在 2024 年 1 月发布了 DeepSeekMoE, 一个解决 MoE 模型 scecialization 不足以及 redundancy 问题的大模型系列。 Fri, 29 Aug 2025 03:03:12 GMT Notes on DeepSeek-LLM https://maosong.website/p/notes-on-deepseek-llm/ https://maosong.website/p/notes-on-deepseek-llm/ DeepSeek 在 2024 年 1 月 5 日发布了 DeepSeek LLM, 包括 7B 和 67B 两个 size, 作者主要强调了对于 scaling law 的探究 Tue, 26 Aug 2025 02:53:10 GMT Notes on MFA https://maosong.website/p/notes-on-mfa/ https://maosong.website/p/notes-on-mfa/ 阶跃星辰等提出了 Multi-matrix Factorization Attention (MFA), 一个新型注意力机制，用于在 KV cache 限制下最大化模型的表现。 Sat, 23 Aug 2025 08:04:34 GMT Notes on MX-format https://maosong.website/p/notes-on-mx-format/ https://maosong.website/p/notes-on-mx-format/ MX format 是一个表示数据的数据格式，在 LLM 中主要用于量化。相比于直接对整个张量进行量化，MX format 可以在更细粒度的层面控制量化，从而提高模型的表现 Thu, 21 Aug 2025 10:23:03 GMT Notes on flashattention https://maosong.website/p/notes-on-flashattention/ https://maosong.website/p/notes-on-flashattention/ 作者提出了 flashattention, 一个通过降低 multi head attention 内存访问开销来提高 attention 计算效率的方法 Thu, 21 Aug 2025 03:32:53 GMT Notes on StreamingLLM https://maosong.website/p/notes-on-streamingllm/ https://maosong.website/p/notes-on-streamingllm/ 作者提出了 StreamingLLM, 一个基于 attention sink 来提高 sliding window attention 在超长上下文场景下表现的方法 Wed, 20 Aug 2025 02:16:35 GMT Notes on gpt-oss https://maosong.website/p/notes-on-gpt-oss/ https://maosong.website/p/notes-on-gpt-oss/ openAI 发布了 gpt-oss 大语言模型，包含 120B-A5.1B 以及 20.9B-A3.6B 两个 size, 作者强调了模型的 instruction following, tool use, 以及 adaptive thinking 能力 Tue, 19 Aug 2025 08:14:56 GMT Notes on QK-Norm https://maosong.website/p/notes-on-qk-norm/ https://maosong.website/p/notes-on-qk-norm/ 作者提出了 QK norm, 一个解决 softmax 注意力权重不稳定的 scaling 算法。 Wed, 13 Aug 2025 08:12:11 GMT Notes on GLM-4.5 https://maosong.website/p/notes-on-glm-4-5/ https://maosong.website/p/notes-on-glm-4-5/ 智谱 AI 提出了 GLM4.5, 包含 GLM4.5 和 GLM-4.5-Air,两个 MoE LLM. 模型大小分别为 355B-A22B 和 106B-A12B, GLM4.5 主要关注 agentic, reasoning 以及 coding 三个领域。 Wed, 13 Aug 2025 04:27:48 GMT Notes on ARC-Hunyuan-Video-7B https://maosong.website/p/notes-on-arc-hunyuan-video-7b/ https://maosong.website/p/notes-on-arc-hunyuan-video-7b/ 腾讯 ARC LAB 提出了 ARC-Hunyuan-Video-7B, 一个针对短视频理解和推理的视频多模态大模型。 Tue, 12 Aug 2025 02:57:57 GMT Notes on GQA https://maosong.website/p/notes-on-gqa/ https://maosong.website/p/notes-on-gqa/ Google Research 在 23 年 12 月份提出了 Group Query Attention (GQA), 一个提升 multi-head attention 效率的方法。GQA 自 Qwen2 系列开始被应用。 Thu, 07 Aug 2025 10:08:36 GMT Notes on MQA https://maosong.website/p/notes-on-mqa/ https://maosong.website/p/notes-on-mqa/ Google 在 2019 年提出了 multi-query attention (MQA), 用于解决 MQA 内存带宽瓶颈问题。 Thu, 07 Aug 2025 10:06:37 GMT Notes on Moonlight https://maosong.website/p/notes-on-moonlight/ https://maosong.website/p/notes-on-moonlight/ Kimi 提出了 Moonlight, 一个基于 Muon optimizer 训练得到的 16B-A3B MoE LLM. 作者详细介绍了如何 scale up muon optimizer. Thu, 07 Aug 2025 02:49:32 GMT Notes on Hunyuan-Large https://maosong.website/p/notes-on-hunyuan-large/ https://maosong.website/p/notes-on-hunyuan-large/ 腾讯混元提出了 Hunyuan-Large, 一个 389B-A52B 的 MoE LLM, 上下文长度为 256K. Wed, 06 Aug 2025 08:46:32 GMT Notes on GSPO https://maosong.website/p/notes-on-gspo/ https://maosong.website/p/notes-on-gspo/ Qwen 提出了 Group Sequence Policy Optimization (GSPO), 一个针对 GRPO 进行改进的 RL 算法。GSPO 在 sequence 层面计算 importance ratio, 避免了 token-level 计算带来的训练不稳定性。 Wed, 06 Aug 2025 03:26:26 GMT Notes on Muon blog https://maosong.website/p/notes-on-muon-blog/ https://maosong.website/p/notes-on-muon-blog/ Muon (MomentUm Orthogonalized by Newton-Schulz) 是一个针对二维神经网络的优化器，它基于 SGD-momentum 改进，增加了一个 Newton-Schulz 的后处理步骤 Tue, 05 Aug 2025 03:10:51 GMT Notes on AFM2025 https://maosong.website/p/notes-on-afm2025/ https://maosong.website/p/notes-on-afm2025/ Apple 在 7 月份发布了 AFM 技术报告，包括两个多语种多模态大模型，一个面向 device, 另一个面向 server Tue, 29 Jul 2025 04:36:28 GMT Notes on Kimi-k2 https://maosong.website/p/notes-on-kimi-k2/ https://maosong.website/p/notes-on-kimi-k2/ Kimi-k2 是一个总参数为 1T, 激活参数为 32B 的 MoE 大语言模型，模型使用 15.5T token 进行训练，optimizer 使用了 MuonClip. 作者主要关注模型的 agent 能力 Thu, 24 Jul 2025 02:56:50 GMT Notes on Keye-VL https://maosong.website/p/notes-on-keye-vl/ https://maosong.website/p/notes-on-keye-vl/ Keye-VL 是快手在 25 年 7 月份提出的一个 8B 的多模态大模型，其亮点为短视频理解能力。预训练包括 4 个 stage，使用了 600B token，后训练包括 2 个 stage，用于提升模型的 reasoning 和 non-reasoning 能力。 Wed, 23 Jul 2025 03:11:43 GMT LLM Parameter Computation https://maosong.website/p/llm-parameter-computation/ https://maosong.website/p/llm-parameter-computation/ 我们介绍一下如何计算 LLM 的参数量。我们将基于 Qwen3 模型架构出发，对模型架构进行拆解，然后给出 LLM 参数量计算公式。 Tue, 22 Jul 2025 02:50:47 GMT Notes on Seed1.6 https://maosong.website/p/notes-on-seed1-6/ https://maosong.website/p/notes-on-seed1-6/ Seed 1.6支持 adaptive deep thinking, multimodal understanding,上下文长度为 256K Fri, 18 Jul 2025 06:59:35 GMT Notes on V-Triune https://maosong.website/p/notes-on-v-triune/ https://maosong.website/p/notes-on-v-triune/ 统一的RL训练框架，用于提升VLM的感知和推理能力 Thu, 17 Jul 2025 01:37:36 GMT Notes on Magistral https://maosong.website/p/notes-on-magistral/ https://maosong.website/p/notes-on-magistral/ Magistral 是 Mistral 提出的一个 reasoning model 系列，主要针对 math 和 code 两个 domain Wed, 16 Jul 2025 03:04:04 GMT Notes on SmolLM3 https://maosong.website/p/notes-on-smollm3/ https://maosong.website/p/notes-on-smollm3/ Hugging Face 在 2025 年 7 月 8 号发布了 SmolLM3, 一个 3B 的，128K 上下文，支持 6 种语言，支持 dual mode reasoning 的小语言模型。 Tue, 15 Jul 2025 03:01:13 GMT Notes on GLM-4.1V-Thinking https://maosong.website/p/notes-on-glm-4-1v-thinking/ https://maosong.website/p/notes-on-glm-4-1v-thinking/ 智谱 AI 在 25 年 7 月份发布了 GLM-4.1V-Thinking, 一个 9B 的多模态大语言模型，其在多个 benchmark 上达到了相同大小 MLLM 的 SOTA Mon, 14 Jul 2025 02:32:04 GMT Notes on Qwen2.5-1M https://maosong.website/p/notes-on-qwen2-5-1m/ https://maosong.website/p/notes-on-qwen2-5-1m/ Qwen2.5-1M 技术报告总结 Sat, 12 Jul 2025 03:00:47 GMT Notes on Qwen2.5 https://maosong.website/p/notes-on-qwen2-5/ https://maosong.website/p/notes-on-qwen2-5/ Qwen2.5 技术报告总结 Sat, 12 Jul 2025 02:51:42 GMT Dual Chunk Attention https://maosong.website/p/dual-chunk-attention/ https://maosong.website/p/dual-chunk-attention/ 无需训练的上下文扩展策略 Sat, 12 Jul 2025 02:41:12 GMT Notes on Qwen2 https://maosong.website/p/notes-on-qwen2/ https://maosong.website/p/notes-on-qwen2/ Qwen2 技术报告总结 Sat, 12 Jul 2025 02:36:43 GMT Notes on Qwen1.5 https://maosong.website/p/notes-on-qwen1-5/ https://maosong.website/p/notes-on-qwen1-5/ Qwen在24年1月份发布了Qwen1.5，包含 0.5B, 1.8B, 4B, 7B, 14B, 32B, 72B, 以及 110B 6个size，还有一个MoE模型。 Thu, 03 Jul 2025 09:37:39 GMT Notes on YaRN https://maosong.website/p/notes-on-yarn/ https://maosong.website/p/notes-on-yarn/ YaRN (Yet Another RoPE extentionN method) 时23年9月EleutherAI等提出来的一个扩展LLM上下文长度的方法，后来被Qwen系列模型所应用。 Thu, 03 Jul 2025 06:40:49 GMT Notes on Qwen-LLM https://maosong.website/p/notes-on-qwen-llm/ https://maosong.website/p/notes-on-qwen-llm/ Qwen技术报告总结 Thu, 03 Jul 2025 02:47:27 GMT Hands on LLM(2) Transformer https://maosong.website/p/hands-on-llm-2-transformer/ https://maosong.website/p/hands-on-llm-2-transformer/ 基于Qwen3讲解transformer的架构以及核心代码 Sun, 29 Jun 2025 03:40:39 GMT Unified perspective on dLLM and LLM https://maosong.website/p/unified-perspective-on-dllm-and-llm/ https://maosong.website/p/unified-perspective-on-dllm-and-llm/ MLE和KL divergence之间的等价性推导 Sat, 28 Jun 2025 07:02:09 GMT Relationship between MLE and KL divergence https://maosong.website/p/relationship-between-mle-and-kl-divergence/ https://maosong.website/p/relationship-between-mle-and-kl-divergence/ MLE和KL divergence之间的等价性推导 Fri, 27 Jun 2025 03:35:33 GMT Notes on MiMo-VL https://maosong.website/p/notes-on-mimo-vl/ https://maosong.website/p/notes-on-mimo-vl/ MiMo-VL基于MiMo-7B，是一个多模态推理大语言模型 Thu, 05 Jun 2025 02:51:43 GMT Hands on LLM(1) Tokenizer https://maosong.website/p/hands-on-llm-1-tokenizer/ https://maosong.website/p/hands-on-llm-1-tokenizer/ Tokenizer总结与BPE的高效实现 Sat, 24 May 2025 11:56:34 GMT Notes on attention bias https://maosong.website/p/notes-on-attention-bias/ https://maosong.website/p/notes-on-attention-bias/ 为什么transformer没有QKV bias Thu, 22 May 2025 07:25:07 GMT Notes on Position encoding https://maosong.website/p/notes-on-position-encoding/ https://maosong.website/p/notes-on-position-encoding/ 从Absolute position encoding到RoPE Mon, 19 May 2025 02:46:39 GMT Notes on Qwen3 https://maosong.website/p/notes-on-qwen3/ https://maosong.website/p/notes-on-qwen3/ Qwen3 包括6个dense模型，2个MoE模型，主要亮点是快慢思考模式切换，多语种，支持thinking budge调整 Thu, 15 May 2025 06:48:11 GMT Notes on Seed1.5-VL https://maosong.website/p/notes-on-seed1-5-vl/ https://maosong.website/p/notes-on-seed1-5-vl/ 字节Seed在5月11号发布了Seed1.5-VL技术报告。技术报告详细介绍了Seed1.5-VL的架构，训练和评估细节 Wed, 14 May 2025 01:28:07 GMT 分布式训练：参数量与计算量分析 https://maosong.website/p/distributed-training-computations/ https://maosong.website/p/distributed-training-computations/ Basic computations in distributed training Tue, 13 May 2025 03:26:36 GMT 分布式训练：如何训练一个模型 https://maosong.website/p/distributed-training-pytorch-training/ https://maosong.website/p/distributed-training-pytorch-training/ Basic computations in distributed training Tue, 13 May 2025 03:26:36 GMT Distributed training--Basic https://maosong.website/p/distributed-training-basic/ https://maosong.website/p/distributed-training-basic/ Basic concepts in distributed training Mon, 12 May 2025 02:15:17 GMT Notes on LLaMA4 blog https://maosong.website/p/notes-on-llama4-blog/ https://maosong.website/p/notes-on-llama4-blog/ LLaMA4 blog阅读笔记 Wed, 30 Apr 2025 02:44:19 GMT Notes on Qwen3 blog https://maosong.website/p/notes-on-qwen3-blog/ https://maosong.website/p/notes-on-qwen3-blog/ Qwen3系列LLM发布 Tue, 29 Apr 2025 03:23:04 GMT Data mixture in MLLM https://maosong.website/p/data-mixture-in-mllm/ https://maosong.website/p/data-mixture-in-mllm/ MLLM训练数据配比简单总结 Fri, 25 Apr 2025 02:25:48 GMT 随笔-身体健康 https://maosong.website/p/%E9%9A%8F%E7%AC%94/ https://maosong.website/p/%E9%9A%8F%E7%AC%94/ 疾病缠身才明白身体健康的重要性 Wed, 23 Apr 2025 05:24:02 GMT Notes on VAPO https://maosong.website/p/notes-on-vapo/ https://maosong.website/p/notes-on-vapo/ 字节Seed团队提出了VAPO, 通过结合DAPO以及VC-PPO的优点，来解决long CoT任务中的一些问题，来提高reasoning model的表现 Thu, 17 Apr 2025 01:41:51 GMT Notes on VC-PPO https://maosong.website/p/notes-on-vc-ppo/ https://maosong.website/p/notes-on-vc-ppo/ 字节Seed团队提出了 Value-Calibrated PPO (VC-PPO), 用于解决PPO的value initialization bias 以及 reward signal decay 问题 Mon, 14 Apr 2025 09:36:15 GMT Notes on DAPO https://maosong.website/p/notes-on-dapo/ https://maosong.website/p/notes-on-dapo/ Notes on DAPO Wed, 09 Apr 2025 13:40:33 GMT Notes on Qwen2.5 omni https://maosong.website/p/notes-on-qwen2-5-omni/ https://maosong.website/p/notes-on-qwen2-5-omni/ Academic notes on Qwen2.5 omni Tue, 01 Apr 2025 02:29:00 GMT Understanding Sigmoid Loss in SigLip https://maosong.website/p/understanding-sigmoid-loss-in-siglip/ https://maosong.website/p/understanding-sigmoid-loss-in-siglip/ Understanding Sigmoid Loss in SigLip Fri, 28 Mar 2025 06:55:50 GMT Notes on Aya Vision https://maosong.website/p/notes-on-aya-vision/ https://maosong.website/p/notes-on-aya-vision/ Aya Vision包含8B, 32B两个size，支持23种语言 Mon, 17 Mar 2025 09:58:24 GMT Notes on Gemma3 https://maosong.website/p/notes-on-gemma3/ https://maosong.website/p/notes-on-gemma3/ Notes on Gemma3 technical report Sat, 15 Mar 2025 03:15:29 GMT Overview of Qwen-VL series https://maosong.website/p/overview-of-qwen-vl-series/ https://maosong.website/p/overview-of-qwen-vl-series/ Overview of Qwen-VL series Sun, 09 Mar 2025 07:11:29 GMT Notes on QwQ-32B https://maosong.website/p/notes-on-qwq-32b/ https://maosong.website/p/notes-on-qwq-32b/ notes on QwQ-32B Sat, 08 Mar 2025 01:46:16 GMT compression is intelligence https://maosong.website/p/compression-is-intelligence/ https://maosong.website/p/compression-is-intelligence/ 从压缩即智能的角度理解大模型 Thu, 06 Mar 2025 09:57:51 GMT Notes on Qwen2.5 VL https://maosong.website/p/notes-on-qwen2-5-vl/ https://maosong.website/p/notes-on-qwen2-5-vl/ Acedemic notes on Qwen2.5 VL Tue, 04 Mar 2025 02:46:42 GMT Git authentication error https://maosong.website/p/git-authentication-error/ https://maosong.website/p/git-authentication-error/ Git clone authentication error. Sat, 22 Feb 2025 02:51:27 GMT Notes on Kimi k1.5 https://maosong.website/p/notes-on-kimi-k1-5/ https://maosong.website/p/notes-on-kimi-k1-5/ An brief introduction to Kimi k1.5 Sat, 08 Feb 2025 02:09:52 GMT Screen usage https://maosong.website/p/screen-usage/ https://maosong.website/p/screen-usage/ Screen usage Wed, 05 Feb 2025 07:14:43 GMT Notes on Phi-4 https://maosong.website/p/notes-on-phi-4/ https://maosong.website/p/notes-on-phi-4/ An brief introduction to Phi-4 Mon, 16 Dec 2024 09:33:52 GMT An overview of adaption layer in multimodal large language models. https://maosong.website/p/an-overview-of-adaption-layer-in-multimodal-large-language-models/ https://maosong.website/p/an-overview-of-adaption-layer-in-multimodal-large-language-models/ An overview of different adaption layers used in MLLM. Sat, 09 Nov 2024 01:53:43 GMT Notes on VITA https://maosong.website/p/notes-on-vita/ https://maosong.website/p/notes-on-vita/ The first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Tue, 13 Aug 2024 06:36:45 GMT ROUGE (Recall-Oriented Understudy) https://maosong.website/p/rouge-recall-oriented-understudy/ https://maosong.website/p/rouge-recall-oriented-understudy/ The metric that evaluates similarity between summaries. Thu, 09 May 2024 09:35:20 GMT Formal Algorithms for Transformer https://maosong.website/p/formal-algorithms-for-transformer/ https://maosong.website/p/formal-algorithms-for-transformer/ An formal algorithm describing how transformer works. Thu, 02 May 2024 05:13:12 GMT MiniGPT-4-Enhancing Vision-Language Understanding with Advanced Large Language Models https://maosong.website/p/minigpt-4-enhancing-vision-language-understanding-with-advanced-large-language-models/ https://maosong.website/p/minigpt-4-enhancing-vision-language-understanding-with-advanced-large-language-models/ An formal algorithm describing how transformer works. Thu, 02 May 2024 05:13:12 GMT Notes on t-SNE https://maosong.website/p/notes-on-t-sne/ https://maosong.website/p/notes-on-t-sne/ Learning notes on t-SNE Thu, 02 May 2024 05:13:12 GMT Regularization methods in deep learning https://maosong.website/p/regularization-methods-in-deep-learning/ https://maosong.website/p/regularization-methods-in-deep-learning/ Overview of the regularization methods in deep learning. Sat, 27 Apr 2024 10:02:02 GMT BLEU (Bilingual Evaluation Understudy) https://maosong.website/p/bleu-bilingual-evaluation-understudy/ https://maosong.website/p/bleu-bilingual-evaluation-understudy/ The metric that evaluates the quality of the translation Thu, 25 Apr 2024 14:46:53 GMT Notes on Llama3 https://maosong.website/p/notes-on-llama3/ https://maosong.website/p/notes-on-llama3/ An brief introduction to Llama3 Mon, 22 Apr 2024 08:22:19 GMT Practical advice for analysis of large, complex data sets https://maosong.website/p/practical-advice-for-analysis-of-large-complex-data-sets/ https://maosong.website/p/practical-advice-for-analysis-of-large-complex-data-sets/ Advice on how to analyze complex and large data sets Wed, 17 Apr 2024 14:40:11 GMT Notes on RAG https://maosong.website/p/notes-on-rag/ https://maosong.website/p/notes-on-rag/ Design for AI agentic workflows Sun, 14 Apr 2024 04:38:04 GMT What's next for AI agentic workflows https://maosong.website/p/what-s-next-for-ai-agentic-workflows/ https://maosong.website/p/what-s-next-for-ai-agentic-workflows/ Design for AI agentic workflows Sun, 14 Apr 2024 04:38:04 GMT Rules of Machine Learning https://maosong.website/p/rules-of-machine-learning/ https://maosong.website/p/rules-of-machine-learning/ Advice for machine learning Sat, 13 Apr 2024 11:57:47 GMT How to fix http error when creating a new environment. https://maosong.website/p/how-to-fix-http-error-when-creating-a-new-environment/ https://maosong.website/p/how-to-fix-http-error-when-creating-a-new-environment/ Conda configuration Fri, 25 Aug 2023 00:00:00 GMT