<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<channel>
  <title>Mao Song(毛松)&apos;s Homepage</title>
  <description>Mao Song&apos;s technical blog covering machine learning, large language models (LLMs), deep learning research, and AI innovations.</description>
  <link>https://maosong.website/</link>
  <item>
  <title>KL divergence: from definition to application</title>
  <link>https://maosong.website/p/kl_divergence/</link>
  <guid>https://maosong.website/p/kl_divergence/</guid>
  <description>Why unbiased KL estimates need not give unbiased KL gradients; forward vs reverse KL, estimators in on/off-policy RL, and experiments.</description>
  <pubDate>Mon, 04 May 2026 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Reinforcement Learning for Large Language models: An Overview</title>
  <link>https://maosong.website/p/RL4LLM/</link>
  <guid>https://maosong.website/p/RL4LLM/</guid>
  <description>Publish‑ready workflow that lets you focus on ideas, not infrastructure</description>
  <pubDate>Mon, 04 May 2026 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Notes on OpenMath-Nemotron</title>
  <link>https://maosong.website/p/notes-on-openmath-nemotron/</link>
  <guid>https://maosong.website/p/notes-on-openmath-nemotron/</guid>
  <description>MNVIDIA 在 AIMO-2 比赛中的 winning solution.</description>
  <pubDate>Wed, 15 Apr 2026 01:50:39 GMT</pubDate>
</item>
<item>
  <title>Performance and Scalability</title>
  <link>https://maosong.website/p/performance-and-scalability/</link>
  <guid>https://maosong.website/p/performance-and-scalability/</guid>
  <description>本文介绍了strong scaling和weak scaling</description>
  <pubDate>Thu, 26 Mar 2026 09:44:26 GMT</pubDate>
</item>
<item>
  <title>Fix Point Theorem</title>
  <link>https://maosong.website/p/fix-point-theorem/</link>
  <guid>https://maosong.website/p/fix-point-theorem/</guid>
  <description>不动点定理</description>
  <pubDate>Mon, 09 Mar 2026 09:16:02 GMT</pubDate>
</item>
<item>
  <title>Notes on roofline model</title>
  <link>https://maosong.website/p/notes-on-roofline-model/</link>
  <guid>https://maosong.website/p/notes-on-roofline-model/</guid>
  <description>roofline model 是 infra 的理论分析基础，为算法设计与优化提供思路</description>
  <pubDate>Thu, 26 Feb 2026 09:23:40 GMT</pubDate>
</item>
<item>
  <title>Notes on Step3-VL 10B</title>
  <link>https://maosong.website/p/notes-on-step3-vl-10b/</link>
  <guid>https://maosong.website/p/notes-on-step3-vl-10b/</guid>
  <description>阶跃星辰在 26 年 1 月提出了 Step3-VL-10B, 一个强调 perception, complex reasoning 以及 human-centric alignment 的开源多模态大模型</description>
  <pubDate>Fri, 13 Feb 2026 10:05:47 GMT</pubDate>
</item>
<item>
  <title>Notes on Kimi-k2.5</title>
  <link>https://maosong.website/p/notes-on-kimi-k2-5/</link>
  <guid>https://maosong.website/p/notes-on-kimi-k2-5/</guid>
  <description>Kimi 在 2026 年 2 月发布了 Kimi K2.5, 一个 multimodal agentic model, Kimi K2.5 基于 Kimi K2 开发，在预训练阶段使用了图文联合训练的方式，在 post-training 阶段使用了 zero-vision SFT 和 multimodal RL 来提高模型的 reasoning 能力以及泛化能力，Kimi K2.5 还提出了 Agent Swarm 来提高解决复杂任务的效率。</description>
  <pubDate>Thu, 12 Feb 2026 03:13:13 GMT</pubDate>
</item>
<item>
  <title>Notes on KL divergence</title>
  <link>https://maosong.website/p/notes-on-kl-divergence/</link>
  <guid>https://maosong.website/p/notes-on-kl-divergence/</guid>
  <description>在强化学习中，KL divergence 常被用作 policy 正则项，但很多不稳定现象并非来自 KL 本身，而是来自其估计方式。本文展示了为什么“无偏的 KL 估计”并不能保证“无偏的 KL 梯度”，并系统分析了不同 KL estimator 在 on-policy 与 off-policy 场景下的行为差异。通过理论推导与实验验证，文章揭示了 KL 作为 loss 与 reward shaping 时的本质区别，并解释了实践中低方差 KL 设计背后的原因</description>
  <pubDate>Sat, 24 Jan 2026 08:32:14 GMT</pubDate>
</item>
<item>
  <title>Notes on Qwen3-Next</title>
  <link>https://maosong.website/p/notes-on-qwen3-next/</link>
  <guid>https://maosong.website/p/notes-on-qwen3-next/</guid>
  <description>2025年9月，Qwen团队提出了Qwen3-Next，这是一个基于混合注意力机制与MoE架构的大语言模型，旨在显著提升训练与推理效率。该模型通过结合线性注意力与Softmax注意力的优势，在保持高性能的同时实现计算效率的大幅优化。</description>
  <pubDate>Fri, 23 Jan 2026 02:29:56 GMT</pubDate>
</item>
<item>
  <title>megatron-lm</title>
  <link>https://maosong.website/p/megatron-lm/</link>
  <guid>https://maosong.website/p/megatron-lm/</guid>
  <description>NVIDIA 在 2020 年提出了 megatron-LM, 一个基于 Tensor Parallelism 的大规模 LLM 训练框架。论文着重介绍了 tensor parallelism</description>
  <pubDate>Wed, 21 Jan 2026 10:04:12 GMT</pubDate>
</item>
<item>
  <title>Notes on Gated Attention</title>
  <link>https://maosong.website/p/notes-on-gated-attention/</link>
  <guid>https://maosong.website/p/notes-on-gated-attention/</guid>
  <description>Qwen 在 2025 年系统性研究了 attention 中的 gating 机制，发现通过在 attention 中引入非线性与稀疏性，可以以极低成本显著提升模型的表达能力、训练稳定性以及长上下文表现。</description>
  <pubDate>Tue, 20 Jan 2026 07:41:52 GMT</pubDate>
</item>
<item>
  <title>NextFlow 基于single-branch的统一理解与生成多模态大模型</title>
  <link>https://maosong.website/p/nextflow-single-branch/</link>
  <guid>https://maosong.website/p/nextflow-single-branch/</guid>
  <description>字节在 26 年 1 月提出了 NextFlow, 一个基于 decoder-only autoregressive transformer 架构的统一理解与生成多模态，验证了纯自回归架构在统一模型中的有效性。</description>
  <pubDate>Sat, 17 Jan 2026 09:31:53 GMT</pubDate>
</item>
<item>
  <title>State of AI--从OpenRouter 100T token使用情况了解AI 大模型能力分层竞争逻辑</title>
  <link>https://maosong.website/p/state-of-ai-openrouter-100t-tokenai/</link>
  <guid>https://maosong.website/p/state-of-ai-openrouter-100t-tokenai/</guid>
  <description>OpenRouter在25年12月发布了一份基于100T token调用数据的统计报告，该报告从模型，任务，用户多角度分析了当前AI模型的使用情况</description>
  <pubDate>Sat, 17 Jan 2026 09:04:07 GMT</pubDate>
</item>
<item>
  <title>LLM Memory Computation</title>
  <link>https://maosong.website/p/llm-memory-computation/</link>
  <guid>https://maosong.website/p/llm-memory-computation/</guid>
  <description>本文中，我们将介绍如何计算 LLM 在训练和推理过程中的内存需求以及简要介绍对应的优化方法。</description>
  <pubDate>Sat, 17 Jan 2026 02:04:32 GMT</pubDate>
</item>
<item>
  <title>Nvidia-GPU specs</title>
  <link>https://maosong.website/p/nvidia-gpu-specs/</link>
  <guid>https://maosong.website/p/nvidia-gpu-specs/</guid>
  <description>本文汇总了NVIDIA GPU 系列的技术规格以及关键改进</description>
  <pubDate>Wed, 14 Jan 2026 03:09:19 GMT</pubDate>
</item>
<item>
  <title>Notes on GLaM</title>
  <link>https://maosong.website/p/notes-on-glam/</link>
  <guid>https://maosong.website/p/notes-on-glam/</guid>
  <description>Google 在 2022 年 8 提出了 GLaM，一个基于 MoE 架构的大语言模型系列，模型超过了 GPT-3 的表现</description>
  <pubDate>Tue, 06 Jan 2026 10:07:29 GMT</pubDate>
</item>
<item>
  <title>Notes on MiniMax-01</title>
  <link>https://maosong.website/p/notes-on-minimax-01/</link>
  <guid>https://maosong.website/p/notes-on-minimax-01/</guid>
  <description>MiniMax-01 是一个基于 hybrid attention 架构的大模型系列，包含 MiniMax-Text-01 和 MiniMax-VL-01 两个模型，其中 MiniMax-Text-01 推理时支持 4M 的上下文长度，MiniMax-VL-01 支持 512B 的上下文长度</description>
  <pubDate>Tue, 06 Jan 2026 09:38:01 GMT</pubDate>
</item>
<item>
  <title>Notes on DeepSeek-V3.2</title>
  <link>https://maosong.website/p/notes-on-deepseek-v3-2/</link>
  <guid>https://maosong.website/p/notes-on-deepseek-v3-2/</guid>
  <description>DeepSeek 在 25 年 10 月发布了 DeepSeek-V3.2, 模型强调了稀疏注意力，scaling RL 以及 agentic task synthesis.</description>
  <pubDate>Tue, 06 Jan 2026 09:30:40 GMT</pubDate>
</item>
<item>
  <title>Notes on Gemini3.0</title>
  <link>https://maosong.website/p/notes-on-gemini3-0/</link>
  <guid>https://maosong.website/p/notes-on-gemini3-0/</guid>
  <description>Gemini 3.0 是是 Google 新一代最强模型，model card 介绍了 Gemini 3.0 系列的评估结果以及基本能力</description>
  <pubDate>Tue, 06 Jan 2026 02:26:39 GMT</pubDate>
</item>
<item>
  <title>Notes on Softmax</title>
  <link>https://maosong.website/p/notes-on-softmax/</link>
  <guid>https://maosong.website/p/notes-on-softmax/</guid>
  <description>本文介绍了 softmax 函数的性质，实现和应用，方便后续的使用和学习</description>
  <pubDate>Sat, 27 Dec 2025 08:39:53 GMT</pubDate>
</item>
<item>
  <title>Notes on NoPE</title>
  <link>https://maosong.website/p/notes-on-nope/</link>
  <guid>https://maosong.website/p/notes-on-nope/</guid>
  <description>NoPE 是一个不需要显式位置编码的方法，相关研究说明不使用位置编码我们也可以让模型学习到对应的位置信息以及进行长度外推。</description>
  <pubDate>Wed, 24 Dec 2025 07:19:42 GMT</pubDate>
</item>
<item>
  <title>Notes on ALiBi</title>
  <link>https://maosong.website/p/notes-on-alibi/</link>
  <guid>https://maosong.website/p/notes-on-alibi/</guid>
  <description>meta 等提出了 ALiBi, 一个通过 linear biases 来实现位置编码的方法来提高 LLM 在推理阶段的外推能力。</description>
  <pubDate>Wed, 24 Dec 2025 07:10:55 GMT</pubDate>
</item>
<item>
  <title>Notes on T5</title>
  <link>https://maosong.website/p/notes-on-t5/</link>
  <guid>https://maosong.website/p/notes-on-t5/</guid>
  <description>google 在 2020 年发表了 T5 (Text-to-Text Transfer Transformer), 一个使用统一框架来将所有 NLP 任务转换为 text-to-text 格式的迁移学习框架。</description>
  <pubDate>Wed, 24 Dec 2025 07:07:08 GMT</pubDate>
</item>
<item>
  <title>GPipe</title>
  <link>https://maosong.website/p/gpipe/</link>
  <guid>https://maosong.website/p/gpipe/</guid>
  <description>google 在 2018 年提出了 GPipe, 一个使用 pipeline parallelism 来训练大规模神经网络的并行策略</description>
  <pubDate>Tue, 23 Dec 2025 08:49:25 GMT</pubDate>
</item>
<item>
  <title>Base of RoPE Bounds Context Length</title>
  <link>https://maosong.website/p/base-of-rope-bounds-context-length/</link>
  <guid>https://maosong.website/p/base-of-rope-bounds-context-length/</guid>
  <description>百川探究了 LLM 中 RoPE base frequency 与 context length 之间的关系，给出了 base frequency 与 context length 之间的 lower bound.</description>
  <pubDate>Mon, 22 Dec 2025 03:34:42 GMT</pubDate>
</item>
<item>
  <title>Notes on NSA</title>
  <link>https://maosong.website/p/notes-on-nsa/</link>
  <guid>https://maosong.website/p/notes-on-nsa/</guid>
  <description>DeepSeek 在 25 年 1 月提出了 Natively trainable Sparse Attention (NSA), 一个软硬件结合的稀疏注意力机制，NSA 可以在提高模型推理效率的同时提高计算效率。</description>
  <pubDate>Mon, 15 Dec 2025 09:39:16 GMT</pubDate>
</item>
<item>
  <title>MoE tutorial</title>
  <link>https://maosong.website/p/moe-tutorial/</link>
  <guid>https://maosong.website/p/moe-tutorial/</guid>
  <description>本 blog 详细介绍了 MoE 模型的一些关键设计与相关实验结果，为 MoE 模型的学习提供基础。</description>
  <pubDate>Sat, 13 Dec 2025 08:04:04 GMT</pubDate>
</item>
<item>
  <title>Notes on Ling-mini-beta</title>
  <link>https://maosong.website/p/notes-on-ling-mini-beta/</link>
  <guid>https://maosong.website/p/notes-on-ling-mini-beta/</guid>
  <description>蚂蚁提出了针对 MoE 模型的 scaling law, 并基于 scaling law 提出了 Ling-mini-beta</description>
  <pubDate>Sat, 13 Dec 2025 07:58:51 GMT</pubDate>
</item>
<item>
  <title>Load Balancing tutorial</title>
  <link>https://maosong.website/p/load-balancing-tutorial/</link>
  <guid>https://maosong.website/p/load-balancing-tutorial/</guid>
  <description>我们在本文中探讨关于 load balancing loss 的定义，性质和推广</description>
  <pubDate>Thu, 11 Dec 2025 08:10:08 GMT</pubDate>
</item>
<item>
  <title>Notes on Global-batch load balancing</title>
  <link>https://maosong.website/p/notes-on-global-batch-load-balancing/</link>
  <guid>https://maosong.website/p/notes-on-global-batch-load-balancing/</guid>
  <description>Qwen 在 25 年 2 月提出了 global batching load balancing loss strategy, 其在 global level 上考虑每个专家的负载均衡，从而提高模型的表现</description>
  <pubDate>Thu, 11 Dec 2025 08:09:34 GMT</pubDate>
</item>
<item>
  <title>Notes on DPO</title>
  <link>https://maosong.website/p/notes-on-dpo/</link>
  <guid>https://maosong.website/p/notes-on-dpo/</guid>
  <description>作者提出了 DPO，一个无需 reward model 的偏好优化方法，DPO通过隐式建模reward model而直接基于数据集训练policy model, 进而大幅度提高了 LLM 偏好优化的训练效率</description>
  <pubDate>Tue, 09 Dec 2025 02:43:11 GMT</pubDate>
</item>
<item>
  <title>Notes on DeepSeek-V3</title>
  <link>https://maosong.website/p/notes-on-deepseek-v3/</link>
  <guid>https://maosong.website/p/notes-on-deepseek-v3/</guid>
  <description>DeepSeek 在 24 年 11 月发布了 DeepSeek-V3, 一个仅花费 2.8M H800 hours 的大语言模型，且在各个 benchmark 上达到了 SOTA 表现</description>
  <pubDate>Mon, 08 Dec 2025 03:14:45 GMT</pubDate>
</item>
<item>
  <title>Notes on Gemini2.5</title>
  <link>https://maosong.website/p/notes-on-gemini2-5/</link>
  <guid>https://maosong.website/p/notes-on-gemini2-5/</guid>
  <description>DeepMind 在 6 月 17 号发布了 Gemini2.5 系列的技术报告，包括Pro和Flash两个版本</description>
  <pubDate>Sat, 06 Dec 2025 10:14:15 GMT</pubDate>
</item>
<item>
  <title>Notes on olmoe</title>
  <link>https://maosong.website/p/notes-on-olmoe-openmoe/</link>
  <guid>https://maosong.website/p/notes-on-olmoe-openmoe/</guid>
  <description>NUS 等提出了 OpenMoE, 一个全开源的 MoE 大语言模型系列，作者详细介绍了 MoE 中的 routing 机制</description>
  <pubDate>Sat, 06 Dec 2025 10:08:11 GMT</pubDate>
</item>
<item>
  <title>Notes on Qwen3 VL</title>
  <link>https://maosong.website/p/notes-on-qwen3-vl/</link>
  <guid>https://maosong.website/p/notes-on-qwen3-vl/</guid>
  <description>Qwen 在 25 年 11 月 27 日发布了 Qwen3-VL 技术报告，作者强调了模型的纯文本理解能力，长文本能力以及多模态推理能力。</description>
  <pubDate>Fri, 05 Dec 2025 02:12:01 GMT</pubDate>
</item>
<item>
  <title>Notes on SAPO</title>
  <link>https://maosong.website/p/notes-on-sapo/</link>
  <guid>https://maosong.website/p/notes-on-sapo/</guid>
  <description>Qwen 在 25 年 11 月提出了 SAPO，一个通过温度控制的 soft gate 以及非对称温度来解决 hard-clpping 存在的问题，从而提高 RL 训练的稳定性以及效率</description>
  <pubDate>Fri, 05 Dec 2025 02:09:06 GMT</pubDate>
</item>
<item>
  <title>Notes on DeepStack</title>
  <link>https://maosong.website/p/notes-on-deepstack/</link>
  <guid>https://maosong.website/p/notes-on-deepstack/</guid>
  <description>作者提出了 DeepStack, 用于帮助MLLM更好地利用视觉信息，进而提高模型在下游任务上的表现</description>
  <pubDate>Thu, 04 Dec 2025 09:32:41 GMT</pubDate>
</item>
<item>
  <title>Notes on ViT</title>
  <link>https://maosong.website/p/notes-on-vit/</link>
  <guid>https://maosong.website/p/notes-on-vit/</guid>
  <description>Google 在 21 年提出了 ViT, 一个基于 Transformer 的图像识别模型架构，作者通过实验验证了 Transformer 架构在图像识别领域的成功。</description>
  <pubDate>Thu, 04 Dec 2025 03:00:44 GMT</pubDate>
</item>
<item>
  <title>Notes on CoMP</title>
  <link>https://maosong.website/p/notes-on-comp/</link>
  <guid>https://maosong.website/p/notes-on-comp/</guid>
  <description>作者提出了一个针对 vision foundation  model 的 continual multimodal pretraining pipeline, 用于提高模型在下游任务上的表现</description>
  <pubDate>Thu, 04 Dec 2025 02:58:30 GMT</pubDate>
</item>
<item>
  <title>Notes on DeepSeek-R1</title>
  <link>https://maosong.website/p/notes-on-deepseek-r1/</link>
  <guid>https://maosong.website/p/notes-on-deepseek-r1/</guid>
  <description>DeepSeek 在 2024 年 5 月提出了 DeepSeek-V2，一个基于 MoE 架构的大语言模型，参数量为 236B-A21B. 作者使用了 MLA 来压缩 KV cache, 使用 DeepSeekMoE 架构来提高模型训练效率和表现。</description>
  <pubDate>Tue, 02 Dec 2025 10:21:54 GMT</pubDate>
</item>
<item>
  <title>Notes on DeepSeek-V2</title>
  <link>https://maosong.website/p/notes-on-deepseek-v2/</link>
  <guid>https://maosong.website/p/notes-on-deepseek-v2/</guid>
  <description>DeepSeek 在 2024 年 5 月提出了 DeepSeek-V2，一个基于 MoE 架构的大语言模型，参数量为 236B-A21B. 作者使用了 MLA 来压缩 KV cache, 使用 DeepSeekMoE 架构来提高模型训练效率和表现。</description>
  <pubDate>Tue, 02 Dec 2025 10:21:54 GMT</pubDate>
</item>
<item>
  <title>Notes on MLA</title>
  <link>https://maosong.website/p/notes-on-mla/</link>
  <guid>https://maosong.website/p/notes-on-mla/</guid>
  <description>DeepSeek在 2024 年 5 月提出了 multi-head latent attention (MLA), 用于提高 attention 的 Inference 效率</description>
  <pubDate>Tue, 02 Dec 2025 10:21:54 GMT</pubDate>
</item>
<item>
  <title>Notes on Loss-free Balancing</title>
  <link>https://maosong.website/p/notes-on-loss-free-balancing/</link>
  <guid>https://maosong.website/p/notes-on-loss-free-balancing/</guid>
  <description>DeepSeek 在 24 年 8 月提出了 Loss-free balancing 策略，该策略可以在不修改训练梯度的情况下实现 load balancing 进而提高模型的表现.</description>
  <pubDate>Fri, 21 Nov 2025 07:38:52 GMT</pubDate>
</item>
<item>
  <title>Mixstral 8x7B</title>
  <link>https://maosong.website/p/mixstral-8x7b/</link>
  <guid>https://maosong.website/p/mixstral-8x7b/</guid>
  <description>Mistral 在 24 年 1 月提出了 Mistral 8x7B, 一个 MoE 大语言模型，模型包括 8 个专家，激活 2 个专家，总参数量为 47B, 激活参数量为 13B.</description>
  <pubDate>Sat, 01 Nov 2025 07:32:30 GMT</pubDate>
</item>
<item>
  <title>Mixstral 7B</title>
  <link>https://maosong.website/p/mixstral-7b/</link>
  <guid>https://maosong.website/p/mixstral-7b/</guid>
  <description>Mistral 在 23 年 10 月提出了 Mistral 7B, 其模型表现超过了 LLaMA2-13B.</description>
  <pubDate>Sat, 01 Nov 2025 07:28:19 GMT</pubDate>
</item>
<item>
  <title>Notes on olmoe</title>
  <link>https://maosong.website/p/notes-on-olmoe/</link>
  <guid>https://maosong.website/p/notes-on-olmoe/</guid>
  <description>AllenAI 在 24 年 9 月提出了 olmoe, 一个全开源的基于 MoE 架构的大语言模型，参数量为 7B-A1B，作者详细介绍了模型的设计，数据以及训练策略. 论文获得了ICLR2025 oral</description>
  <pubDate>Sat, 01 Nov 2025 07:23:58 GMT</pubDate>
</item>
<item>
  <title>GShard</title>
  <link>https://maosong.website/p/gshard/</link>
  <guid>https://maosong.website/p/gshard/</guid>
  <description>Google 在 2020 年提出了 Gshard, 一个用于 MoE model 的 API 模块，作者的目的是探究如何高效训练基于 MoE 的 transformer 模型</description>
  <pubDate>Wed, 29 Oct 2025 03:22:39 GMT</pubDate>
</item>
<item>
  <title>ST-MoE</title>
  <link>https://maosong.website/p/st-moe/</link>
  <guid>https://maosong.website/p/st-moe/</guid>
  <description>google 在 2022 年 4 月提出了 ST-MoE-269B-A32B, 用于解决 MoE 模型的训练不稳定性以及表现不佳的问题</description>
  <pubDate>Wed, 29 Oct 2025 03:19:37 GMT</pubDate>
</item>
<item>
  <title>Switch Transformer</title>
  <link>https://maosong.website/p/switch-transformer/</link>
  <guid>https://maosong.website/p/switch-transformer/</guid>
  <description>Google 在 2022 年 6 月提出了 Switch Transformer, 一个基于 MoE 架构的 Transformer 模型。作者通过改进 MoE 算法，大幅度提高了计算和通信效率，结果发现模型比 dense model 有更高的训练效率。</description>
  <pubDate>Tue, 28 Oct 2025 01:38:12 GMT</pubDate>
</item>
<item>
  <title>Chinchilla Scaling Law</title>
  <link>https://maosong.website/p/chinchilla-scaling-law/</link>
  <guid>https://maosong.website/p/chinchilla-scaling-law/</guid>
  <description>DeepMind 在 22 年 3 月探究了如何在给定算力下，决定最优的 model size 和 data size. 作者发现对于算力最优的场景，model size 和 dataset size 应该以相同的规模增长。基于这个 scaling law 作者提出了 Chinchilla, 一个 70B 的大语言模型，结果显示 Chinchilla 超过了其他更大 size 模型的表现。</description>
  <pubDate>Wed, 22 Oct 2025 06:39:23 GMT</pubDate>
</item>
<item>
  <title>Kaplan Scaling Law</title>
  <link>https://maosong.website/p/kaplan-scaling-law/</link>
  <guid>https://maosong.website/p/kaplan-scaling-law/</guid>
  <description>OpenAI 在 20 年 1 月份探究了 model size, dataset size 以及 compute budget 与 transformer 损失之间的关系。通过构建 scaling law, 我们可以在固定的 compute budget 下，决定最优的配置</description>
  <pubDate>Wed, 22 Oct 2025 06:10:52 GMT</pubDate>
</item>
<item>
  <title>LLM FLOPs Computation</title>
  <link>https://maosong.website/p/llm-flops-computation/</link>
  <guid>https://maosong.website/p/llm-flops-computation/</guid>
  <description>我们介绍如何计算基于 transformer 架构的 LLM 的 FLOPs, 计算完成之后，我们可以推导出算力 $C$ 与模型参数量 $N$，数据集大小 $D$ 之间的关系，即 $C\approx 6ND$.</description>
  <pubDate>Wed, 15 Oct 2025 08:33:39 GMT</pubDate>
</item>
<item>
  <title>Notes on Keye-VL 1.5</title>
  <link>https://maosong.website/p/notes-on-keye-vl-1-5/</link>
  <guid>https://maosong.website/p/notes-on-keye-vl-1-5/</guid>
  <description>快手提出了 Keye-VL 1.5, 一个强调 reasoning, video understanding 的 8B 多模态大模型。作者提出了 slow-fast video encoding strategy 来提高模型的视频理解能力，作者通过在预训练和后训练提高了模型的长上下文能力和 reasoning 能力</description>
  <pubDate>Thu, 11 Sep 2025 03:33:31 GMT</pubDate>
</item>
<item>
  <title>Notes on AdamW</title>
  <link>https://maosong.website/p/notes-on-adamw/</link>
  <guid>https://maosong.website/p/notes-on-adamw/</guid>
  <description>作者提出了一个针对 Adam 优化器的 weight decay 方法</description>
  <pubDate>Thu, 04 Sep 2025 02:27:03 GMT</pubDate>
</item>
<item>
  <title>Notes on Adam</title>
  <link>https://maosong.website/p/notes-on-adam/</link>
  <guid>https://maosong.website/p/notes-on-adam/</guid>
  <description>作者提出了 Adam, 一个一阶的优化方法，Adam 更加高效，且具有 scaling invariant 的性质。</description>
  <pubDate>Thu, 04 Sep 2025 02:11:55 GMT</pubDate>
</item>
<item>
  <title>Notes on RNoPE-SWA</title>
  <link>https://maosong.website/p/notes-on-rnope-swa/</link>
  <guid>https://maosong.website/p/notes-on-rnope-swa/</guid>
  <description>作者系统性分析了已有的 attention 机制，然后作者提出了混合的 attention 机制，来提高模型在长上下文的表现以及维持模型在短上下文场景下的表现。</description>
  <pubDate>Tue, 02 Sep 2025 03:24:10 GMT</pubDate>
</item>
<item>
  <title>Notes on InternVL3.5</title>
  <link>https://maosong.website/p/notes-on-internvl3-5/</link>
  <guid>https://maosong.website/p/notes-on-internvl3-5/</guid>
  <description>上海 AI LAB 提出了 InternVL 3.5 系列多模态大模型，InternVL 3.5 主要强调了模型的 reasoning 能力以及 inference 效率</description>
  <pubDate>Mon, 01 Sep 2025 03:30:50 GMT</pubDate>
</item>
<item>
  <title>Ovis2.5 MLLM with stronger perception and reasoning capability</title>
  <link>https://maosong.website/p/ovis2-5-mllm-with-stronger-perception-and-reasoning-capability/</link>
  <guid>https://maosong.website/p/ovis2-5-mllm-with-stronger-perception-and-reasoning-capability/</guid>
  <description>作者提出了 Ovis2.5, 一个基于 Ovis 改进的多模态大模型系列，包括 2B 和 9B 两个 size，Ovis2.5 主要强调了支持不同分辨率图片输入以及深度思考这两个 feature</description>
  <pubDate>Sat, 30 Aug 2025 09:34:44 GMT</pubDate>
</item>
<item>
  <title>Ovis-discrete visual embedding</title>
  <link>https://maosong.website/p/ovis-discrete-visual-embedding/</link>
  <guid>https://maosong.website/p/ovis-discrete-visual-embedding/</guid>
  <description>作者提出了 Ovis，一个离散化表示 visual encder 输出特征的方法，来更好对齐 LLM 的视觉输入和文本输入</description>
  <pubDate>Sat, 30 Aug 2025 09:32:22 GMT</pubDate>
</item>
<item>
  <title>Notes on DeepSeekMoE</title>
  <link>https://maosong.website/p/notes-on-deepseekmoe/</link>
  <guid>https://maosong.website/p/notes-on-deepseekmoe/</guid>
  <description>DeepSeek 在 2024 年 1 月发布了 DeepSeekMoE, 一个解决 MoE 模型 scecialization 不足以及 redundancy 问题的大模型系列。</description>
  <pubDate>Fri, 29 Aug 2025 03:03:12 GMT</pubDate>
</item>
<item>
  <title>Notes on DeepSeek-LLM</title>
  <link>https://maosong.website/p/notes-on-deepseek-llm/</link>
  <guid>https://maosong.website/p/notes-on-deepseek-llm/</guid>
  <description>DeepSeek 在 2024 年 1 月 5 日发布了 DeepSeek LLM, 包括 7B 和 67B 两个 size, 作者主要强调了对于 scaling law 的探究</description>
  <pubDate>Tue, 26 Aug 2025 02:53:10 GMT</pubDate>
</item>
<item>
  <title>Notes on MFA</title>
  <link>https://maosong.website/p/notes-on-mfa/</link>
  <guid>https://maosong.website/p/notes-on-mfa/</guid>
  <description>阶跃星辰等提出了 Multi-matrix Factorization Attention (MFA), 一个新型注意力机制，用于在 KV cache 限制下最大化模型的表现。</description>
  <pubDate>Sat, 23 Aug 2025 08:04:34 GMT</pubDate>
</item>
<item>
  <title>Notes on MX-format</title>
  <link>https://maosong.website/p/notes-on-mx-format/</link>
  <guid>https://maosong.website/p/notes-on-mx-format/</guid>
  <description>MX format 是一个表示数据的数据格式，在 LLM 中主要用于量化。相比于直接对整个张量进行量化，MX format 可以在更细粒度的层面控制量化，从而提高模型的表现</description>
  <pubDate>Thu, 21 Aug 2025 10:23:03 GMT</pubDate>
</item>
<item>
  <title>Notes on flashattention</title>
  <link>https://maosong.website/p/notes-on-flashattention/</link>
  <guid>https://maosong.website/p/notes-on-flashattention/</guid>
  <description>作者提出了 flashattention, 一个通过降低 multi head attention 内存访问开销来提高 attention 计算效率的方法</description>
  <pubDate>Thu, 21 Aug 2025 03:32:53 GMT</pubDate>
</item>
<item>
  <title>Notes on StreamingLLM</title>
  <link>https://maosong.website/p/notes-on-streamingllm/</link>
  <guid>https://maosong.website/p/notes-on-streamingllm/</guid>
  <description>作者提出了 StreamingLLM, 一个基于 attention sink 来提高 sliding window attention 在超长上下文场景下表现的方法</description>
  <pubDate>Wed, 20 Aug 2025 02:16:35 GMT</pubDate>
</item>
<item>
  <title>Notes on gpt-oss</title>
  <link>https://maosong.website/p/notes-on-gpt-oss/</link>
  <guid>https://maosong.website/p/notes-on-gpt-oss/</guid>
  <description>openAI 发布了 gpt-oss 大语言模型，包含 120B-A5.1B 以及 20.9B-A3.6B 两个 size, 作者强调了模型的 instruction following, tool use, 以及 adaptive thinking 能力</description>
  <pubDate>Tue, 19 Aug 2025 08:14:56 GMT</pubDate>
</item>
<item>
  <title>Notes on QK-Norm</title>
  <link>https://maosong.website/p/notes-on-qk-norm/</link>
  <guid>https://maosong.website/p/notes-on-qk-norm/</guid>
  <description>作者提出了 QK norm, 一个解决 softmax 注意力权重不稳定的 scaling 算法。</description>
  <pubDate>Wed, 13 Aug 2025 08:12:11 GMT</pubDate>
</item>
<item>
  <title>Notes on GLM-4.5</title>
  <link>https://maosong.website/p/notes-on-glm-4-5/</link>
  <guid>https://maosong.website/p/notes-on-glm-4-5/</guid>
  <description>智谱 AI 提出了 GLM4.5, 包含 GLM4.5 和 GLM-4.5-Air,两个 MoE LLM. 模型大小分别为 355B-A22B 和 106B-A12B, GLM4.5 主要关注 agentic, reasoning 以及 coding 三个领域。</description>
  <pubDate>Wed, 13 Aug 2025 04:27:48 GMT</pubDate>
</item>
<item>
  <title>Notes on ARC-Hunyuan-Video-7B</title>
  <link>https://maosong.website/p/notes-on-arc-hunyuan-video-7b/</link>
  <guid>https://maosong.website/p/notes-on-arc-hunyuan-video-7b/</guid>
  <description>腾讯 ARC LAB 提出了 ARC-Hunyuan-Video-7B, 一个针对短视频理解和推理的视频多模态大模型。</description>
  <pubDate>Tue, 12 Aug 2025 02:57:57 GMT</pubDate>
</item>
<item>
  <title>Notes on GQA</title>
  <link>https://maosong.website/p/notes-on-gqa/</link>
  <guid>https://maosong.website/p/notes-on-gqa/</guid>
  <description>Google Research 在 23 年 12 月份提出了 Group Query Attention (GQA), 一个提升 multi-head attention 效率的方法。GQA 自 Qwen2 系列开始被应用。</description>
  <pubDate>Thu, 07 Aug 2025 10:08:36 GMT</pubDate>
</item>
<item>
  <title>Notes on MQA</title>
  <link>https://maosong.website/p/notes-on-mqa/</link>
  <guid>https://maosong.website/p/notes-on-mqa/</guid>
  <description>Google 在 2019 年提出了 multi-query attention (MQA), 用于解决 MQA 内存带宽瓶颈问题。</description>
  <pubDate>Thu, 07 Aug 2025 10:06:37 GMT</pubDate>
</item>
<item>
  <title>Notes on Moonlight</title>
  <link>https://maosong.website/p/notes-on-moonlight/</link>
  <guid>https://maosong.website/p/notes-on-moonlight/</guid>
  <description>Kimi 提出了 Moonlight, 一个基于 Muon optimizer 训练得到的 16B-A3B MoE LLM. 作者详细介绍了如何 scale up muon optimizer.</description>
  <pubDate>Thu, 07 Aug 2025 02:49:32 GMT</pubDate>
</item>
<item>
  <title>Notes on Hunyuan-Large</title>
  <link>https://maosong.website/p/notes-on-hunyuan-large/</link>
  <guid>https://maosong.website/p/notes-on-hunyuan-large/</guid>
  <description>腾讯混元提出了 Hunyuan-Large, 一个 389B-A52B 的 MoE LLM, 上下文长度为 256K.</description>
  <pubDate>Wed, 06 Aug 2025 08:46:32 GMT</pubDate>
</item>
<item>
  <title>Notes on GSPO</title>
  <link>https://maosong.website/p/notes-on-gspo/</link>
  <guid>https://maosong.website/p/notes-on-gspo/</guid>
  <description>Qwen 提出了 Group Sequence Policy Optimization (GSPO), 一个针对 GRPO 进行改进的 RL 算法。GSPO 在 sequence 层面计算 importance ratio, 避免了 token-level 计算带来的训练不稳定性。</description>
  <pubDate>Wed, 06 Aug 2025 03:26:26 GMT</pubDate>
</item>
<item>
  <title>Notes on Muon blog</title>
  <link>https://maosong.website/p/notes-on-muon-blog/</link>
  <guid>https://maosong.website/p/notes-on-muon-blog/</guid>
  <description>Muon (MomentUm Orthogonalized by Newton-Schulz) 是一个针对二维神经网络的优化器，它基于 SGD-momentum 改进，增加了一个 Newton-Schulz 的后处理步骤</description>
  <pubDate>Tue, 05 Aug 2025 03:10:51 GMT</pubDate>
</item>
<item>
  <title>Notes on AFM2025</title>
  <link>https://maosong.website/p/notes-on-afm2025/</link>
  <guid>https://maosong.website/p/notes-on-afm2025/</guid>
  <description>Apple 在 7 月份发布了 AFM 技术报告，包括两个多语种多模态大模型，一个面向 device, 另一个面向 server</description>
  <pubDate>Tue, 29 Jul 2025 04:36:28 GMT</pubDate>
</item>
<item>
  <title>Notes on Kimi-k2</title>
  <link>https://maosong.website/p/notes-on-kimi-k2/</link>
  <guid>https://maosong.website/p/notes-on-kimi-k2/</guid>
  <description>Kimi-k2 是一个总参数为 1T, 激活参数为 32B 的 MoE 大语言模型，模型使用 15.5T token 进行训练，optimizer 使用了 MuonClip. 作者主要关注模型的 agent 能力</description>
  <pubDate>Thu, 24 Jul 2025 02:56:50 GMT</pubDate>
</item>
<item>
  <title>Notes on Keye-VL</title>
  <link>https://maosong.website/p/notes-on-keye-vl/</link>
  <guid>https://maosong.website/p/notes-on-keye-vl/</guid>
  <description>Keye-VL 是快手在 25 年 7 月份提出的一个 8B 的多模态大模型，其亮点为短视频理解能力。预训练包括 4 个 stage，使用了 600B token，后训练包括 2 个 stage，用于提升模型的 reasoning 和 non-reasoning 能力。</description>
  <pubDate>Wed, 23 Jul 2025 03:11:43 GMT</pubDate>
</item>
<item>
  <title>LLM Parameter Computation</title>
  <link>https://maosong.website/p/llm-parameter-computation/</link>
  <guid>https://maosong.website/p/llm-parameter-computation/</guid>
  <description>我们介绍一下如何计算 LLM 的参数量。我们将基于 Qwen3 模型架构出发，对模型架构进行拆解，然后给出 LLM 参数量计算公式。</description>
  <pubDate>Tue, 22 Jul 2025 02:50:47 GMT</pubDate>
</item>
<item>
  <title>Notes on Seed1.6</title>
  <link>https://maosong.website/p/notes-on-seed1-6/</link>
  <guid>https://maosong.website/p/notes-on-seed1-6/</guid>
  <description>Seed 1.6支持 adaptive deep thinking, multimodal understanding,上下文长度为 256K</description>
  <pubDate>Fri, 18 Jul 2025 06:59:35 GMT</pubDate>
</item>
<item>
  <title>Notes on V-Triune</title>
  <link>https://maosong.website/p/notes-on-v-triune/</link>
  <guid>https://maosong.website/p/notes-on-v-triune/</guid>
  <description>统一的RL训练框架，用于提升VLM的感知和推理能力</description>
  <pubDate>Thu, 17 Jul 2025 01:37:36 GMT</pubDate>
</item>
<item>
  <title>Notes on Magistral</title>
  <link>https://maosong.website/p/notes-on-magistral/</link>
  <guid>https://maosong.website/p/notes-on-magistral/</guid>
  <description>Magistral 是 Mistral 提出的一个 reasoning model 系列，主要针对 math 和 code 两个 domain</description>
  <pubDate>Wed, 16 Jul 2025 03:04:04 GMT</pubDate>
</item>
<item>
  <title>Notes on SmolLM3</title>
  <link>https://maosong.website/p/notes-on-smollm3/</link>
  <guid>https://maosong.website/p/notes-on-smollm3/</guid>
  <description>Hugging Face 在 2025 年 7 月 8 号发布了 SmolLM3, 一个 3B 的，128K 上下文，支持 6 种语言，支持 dual mode reasoning 的小语言模型。</description>
  <pubDate>Tue, 15 Jul 2025 03:01:13 GMT</pubDate>
</item>
<item>
  <title>Notes on GLM-4.1V-Thinking</title>
  <link>https://maosong.website/p/notes-on-glm-4-1v-thinking/</link>
  <guid>https://maosong.website/p/notes-on-glm-4-1v-thinking/</guid>
  <description>智谱 AI 在 25 年 7 月份发布了 GLM-4.1V-Thinking, 一个 9B 的多模态大语言模型，其在多个 benchmark 上达到了相同大小 MLLM 的 SOTA</description>
  <pubDate>Mon, 14 Jul 2025 02:32:04 GMT</pubDate>
</item>
<item>
  <title>Notes on Qwen2.5-1M</title>
  <link>https://maosong.website/p/notes-on-qwen2-5-1m/</link>
  <guid>https://maosong.website/p/notes-on-qwen2-5-1m/</guid>
  <description>Qwen2.5-1M 技术报告总结</description>
  <pubDate>Sat, 12 Jul 2025 03:00:47 GMT</pubDate>
</item>
<item>
  <title>Notes on Qwen2.5</title>
  <link>https://maosong.website/p/notes-on-qwen2-5/</link>
  <guid>https://maosong.website/p/notes-on-qwen2-5/</guid>
  <description>Qwen2.5 技术报告总结</description>
  <pubDate>Sat, 12 Jul 2025 02:51:42 GMT</pubDate>
</item>
<item>
  <title>Dual Chunk Attention</title>
  <link>https://maosong.website/p/dual-chunk-attention/</link>
  <guid>https://maosong.website/p/dual-chunk-attention/</guid>
  <description>无需训练的上下文扩展策略</description>
  <pubDate>Sat, 12 Jul 2025 02:41:12 GMT</pubDate>
</item>
<item>
  <title>Notes on Qwen2</title>
  <link>https://maosong.website/p/notes-on-qwen2/</link>
  <guid>https://maosong.website/p/notes-on-qwen2/</guid>
  <description>Qwen2 技术报告总结</description>
  <pubDate>Sat, 12 Jul 2025 02:36:43 GMT</pubDate>
</item>
<item>
  <title>Notes on Qwen1.5</title>
  <link>https://maosong.website/p/notes-on-qwen1-5/</link>
  <guid>https://maosong.website/p/notes-on-qwen1-5/</guid>
  <description>Qwen在24年1月份发布了Qwen1.5，包含 0.5B, 1.8B, 4B, 7B, 14B, 32B, 72B, 以及 110B 6个size，还有一个MoE模型。</description>
  <pubDate>Thu, 03 Jul 2025 09:37:39 GMT</pubDate>
</item>
<item>
  <title>Notes on YaRN</title>
  <link>https://maosong.website/p/notes-on-yarn/</link>
  <guid>https://maosong.website/p/notes-on-yarn/</guid>
  <description>YaRN (Yet Another RoPE extentionN method) 时23年9月EleutherAI等提出来的一个扩展LLM上下文长度的方法，后来被Qwen系列模型所应用。</description>
  <pubDate>Thu, 03 Jul 2025 06:40:49 GMT</pubDate>
</item>
<item>
  <title>Notes on Qwen-LLM</title>
  <link>https://maosong.website/p/notes-on-qwen-llm/</link>
  <guid>https://maosong.website/p/notes-on-qwen-llm/</guid>
  <description>Qwen技术报告总结</description>
  <pubDate>Thu, 03 Jul 2025 02:47:27 GMT</pubDate>
</item>
<item>
  <title>Hands on LLM(2) Transformer</title>
  <link>https://maosong.website/p/hands-on-llm-2-transformer/</link>
  <guid>https://maosong.website/p/hands-on-llm-2-transformer/</guid>
  <description>基于Qwen3讲解transformer的架构以及核心代码</description>
  <pubDate>Sun, 29 Jun 2025 03:40:39 GMT</pubDate>
</item>
<item>
  <title>Unified perspective on dLLM and LLM</title>
  <link>https://maosong.website/p/unified-perspective-on-dllm-and-llm/</link>
  <guid>https://maosong.website/p/unified-perspective-on-dllm-and-llm/</guid>
  <description>MLE和KL divergence之间的等价性推导</description>
  <pubDate>Sat, 28 Jun 2025 07:02:09 GMT</pubDate>
</item>
<item>
  <title>Relationship between MLE and KL divergence</title>
  <link>https://maosong.website/p/relationship-between-mle-and-kl-divergence/</link>
  <guid>https://maosong.website/p/relationship-between-mle-and-kl-divergence/</guid>
  <description>MLE和KL divergence之间的等价性推导</description>
  <pubDate>Fri, 27 Jun 2025 03:35:33 GMT</pubDate>
</item>
<item>
  <title>Notes on MiMo-VL</title>
  <link>https://maosong.website/p/notes-on-mimo-vl/</link>
  <guid>https://maosong.website/p/notes-on-mimo-vl/</guid>
  <description>MiMo-VL基于MiMo-7B，是一个多模态推理大语言模型</description>
  <pubDate>Thu, 05 Jun 2025 02:51:43 GMT</pubDate>
</item>
<item>
  <title>Hands on LLM(1) Tokenizer</title>
  <link>https://maosong.website/p/hands-on-llm-1-tokenizer/</link>
  <guid>https://maosong.website/p/hands-on-llm-1-tokenizer/</guid>
  <description>Tokenizer总结与BPE的高效实现</description>
  <pubDate>Sat, 24 May 2025 11:56:34 GMT</pubDate>
</item>
<item>
  <title>Notes on attention bias</title>
  <link>https://maosong.website/p/notes-on-attention-bias/</link>
  <guid>https://maosong.website/p/notes-on-attention-bias/</guid>
  <description>为什么transformer没有QKV bias</description>
  <pubDate>Thu, 22 May 2025 07:25:07 GMT</pubDate>
</item>
<item>
  <title>Notes on Position encoding</title>
  <link>https://maosong.website/p/notes-on-position-encoding/</link>
  <guid>https://maosong.website/p/notes-on-position-encoding/</guid>
  <description>从Absolute position encoding到RoPE</description>
  <pubDate>Mon, 19 May 2025 02:46:39 GMT</pubDate>
</item>
<item>
  <title>Notes on Qwen3</title>
  <link>https://maosong.website/p/notes-on-qwen3/</link>
  <guid>https://maosong.website/p/notes-on-qwen3/</guid>
  <description>Qwen3 包括6个dense模型，2个MoE模型，主要亮点是快慢思考模式切换，多语种，支持thinking budge调整</description>
  <pubDate>Thu, 15 May 2025 06:48:11 GMT</pubDate>
</item>
<item>
  <title>Notes on Seed1.5-VL</title>
  <link>https://maosong.website/p/notes-on-seed1-5-vl/</link>
  <guid>https://maosong.website/p/notes-on-seed1-5-vl/</guid>
  <description>字节Seed在5月11号发布了Seed1.5-VL技术报告。技术报告详细介绍了Seed1.5-VL的架构，训练和评估细节</description>
  <pubDate>Wed, 14 May 2025 01:28:07 GMT</pubDate>
</item>
<item>
  <title>分布式训练：参数量与计算量分析</title>
  <link>https://maosong.website/p/distributed-training-computations/</link>
  <guid>https://maosong.website/p/distributed-training-computations/</guid>
  <description>Basic computations in distributed training</description>
  <pubDate>Tue, 13 May 2025 03:26:36 GMT</pubDate>
</item>
<item>
  <title>分布式训练：如何训练一个模型</title>
  <link>https://maosong.website/p/distributed-training-pytorch-training/</link>
  <guid>https://maosong.website/p/distributed-training-pytorch-training/</guid>
  <description>Basic computations in distributed training</description>
  <pubDate>Tue, 13 May 2025 03:26:36 GMT</pubDate>
</item>
<item>
  <title>Distributed training--Basic</title>
  <link>https://maosong.website/p/distributed-training-basic/</link>
  <guid>https://maosong.website/p/distributed-training-basic/</guid>
  <description>Basic concepts in distributed training</description>
  <pubDate>Mon, 12 May 2025 02:15:17 GMT</pubDate>
</item>
<item>
  <title>Notes on LLaMA4 blog</title>
  <link>https://maosong.website/p/notes-on-llama4-blog/</link>
  <guid>https://maosong.website/p/notes-on-llama4-blog/</guid>
  <description>LLaMA4 blog阅读笔记</description>
  <pubDate>Wed, 30 Apr 2025 02:44:19 GMT</pubDate>
</item>
<item>
  <title>Notes on Qwen3 blog</title>
  <link>https://maosong.website/p/notes-on-qwen3-blog/</link>
  <guid>https://maosong.website/p/notes-on-qwen3-blog/</guid>
  <description>Qwen3系列LLM发布</description>
  <pubDate>Tue, 29 Apr 2025 03:23:04 GMT</pubDate>
</item>
<item>
  <title>Data mixture in MLLM</title>
  <link>https://maosong.website/p/data-mixture-in-mllm/</link>
  <guid>https://maosong.website/p/data-mixture-in-mllm/</guid>
  <description>MLLM训练数据配比简单总结</description>
  <pubDate>Fri, 25 Apr 2025 02:25:48 GMT</pubDate>
</item>
<item>
  <title>随笔-身体健康</title>
  <link>https://maosong.website/p/%E9%9A%8F%E7%AC%94/</link>
  <guid>https://maosong.website/p/%E9%9A%8F%E7%AC%94/</guid>
  <description>疾病缠身才明白身体健康的重要性</description>
  <pubDate>Wed, 23 Apr 2025 05:24:02 GMT</pubDate>
</item>
<item>
  <title>Notes on VAPO</title>
  <link>https://maosong.website/p/notes-on-vapo/</link>
  <guid>https://maosong.website/p/notes-on-vapo/</guid>
  <description>字节Seed团队提出了VAPO, 通过结合DAPO以及VC-PPO的优点，来解决long CoT任务中的一些问题，来提高reasoning model的表现</description>
  <pubDate>Thu, 17 Apr 2025 01:41:51 GMT</pubDate>
</item>
<item>
  <title>Notes on VC-PPO</title>
  <link>https://maosong.website/p/notes-on-vc-ppo/</link>
  <guid>https://maosong.website/p/notes-on-vc-ppo/</guid>
  <description>字节Seed团队提出了 Value-Calibrated PPO (VC-PPO), 用于解决PPO的value initialization bias 以及 reward signal decay 问题</description>
  <pubDate>Mon, 14 Apr 2025 09:36:15 GMT</pubDate>
</item>
<item>
  <title>Notes on DAPO</title>
  <link>https://maosong.website/p/notes-on-dapo/</link>
  <guid>https://maosong.website/p/notes-on-dapo/</guid>
  <description>Notes on DAPO</description>
  <pubDate>Wed, 09 Apr 2025 13:40:33 GMT</pubDate>
</item>
<item>
  <title>Notes on Qwen2.5 omni</title>
  <link>https://maosong.website/p/notes-on-qwen2-5-omni/</link>
  <guid>https://maosong.website/p/notes-on-qwen2-5-omni/</guid>
  <description>Academic notes on Qwen2.5 omni</description>
  <pubDate>Tue, 01 Apr 2025 02:29:00 GMT</pubDate>
</item>
<item>
  <title>Understanding Sigmoid Loss in SigLip</title>
  <link>https://maosong.website/p/understanding-sigmoid-loss-in-siglip/</link>
  <guid>https://maosong.website/p/understanding-sigmoid-loss-in-siglip/</guid>
  <description>Understanding Sigmoid Loss in SigLip</description>
  <pubDate>Fri, 28 Mar 2025 06:55:50 GMT</pubDate>
</item>
<item>
  <title>Notes on Aya Vision</title>
  <link>https://maosong.website/p/notes-on-aya-vision/</link>
  <guid>https://maosong.website/p/notes-on-aya-vision/</guid>
  <description>Aya Vision包含8B, 32B两个size，支持23种语言</description>
  <pubDate>Mon, 17 Mar 2025 09:58:24 GMT</pubDate>
</item>
<item>
  <title>Notes on Gemma3</title>
  <link>https://maosong.website/p/notes-on-gemma3/</link>
  <guid>https://maosong.website/p/notes-on-gemma3/</guid>
  <description>Notes on Gemma3 technical report</description>
  <pubDate>Sat, 15 Mar 2025 03:15:29 GMT</pubDate>
</item>
<item>
  <title>Overview of Qwen-VL series</title>
  <link>https://maosong.website/p/overview-of-qwen-vl-series/</link>
  <guid>https://maosong.website/p/overview-of-qwen-vl-series/</guid>
  <description>Overview of Qwen-VL series</description>
  <pubDate>Sun, 09 Mar 2025 07:11:29 GMT</pubDate>
</item>
<item>
  <title>Notes on QwQ-32B</title>
  <link>https://maosong.website/p/notes-on-qwq-32b/</link>
  <guid>https://maosong.website/p/notes-on-qwq-32b/</guid>
  <description>notes on QwQ-32B</description>
  <pubDate>Sat, 08 Mar 2025 01:46:16 GMT</pubDate>
</item>
<item>
  <title>compression is intelligence</title>
  <link>https://maosong.website/p/compression-is-intelligence/</link>
  <guid>https://maosong.website/p/compression-is-intelligence/</guid>
  <description>从压缩即智能的角度理解大模型</description>
  <pubDate>Thu, 06 Mar 2025 09:57:51 GMT</pubDate>
</item>
<item>
  <title>Notes on Qwen2.5 VL</title>
  <link>https://maosong.website/p/notes-on-qwen2-5-vl/</link>
  <guid>https://maosong.website/p/notes-on-qwen2-5-vl/</guid>
  <description>Acedemic notes on Qwen2.5 VL</description>
  <pubDate>Tue, 04 Mar 2025 02:46:42 GMT</pubDate>
</item>
<item>
  <title>Git authentication error</title>
  <link>https://maosong.website/p/git-authentication-error/</link>
  <guid>https://maosong.website/p/git-authentication-error/</guid>
  <description>Git clone authentication error.</description>
  <pubDate>Sat, 22 Feb 2025 02:51:27 GMT</pubDate>
</item>
<item>
  <title>Notes on Kimi k1.5</title>
  <link>https://maosong.website/p/notes-on-kimi-k1-5/</link>
  <guid>https://maosong.website/p/notes-on-kimi-k1-5/</guid>
  <description>An brief introduction to Kimi k1.5</description>
  <pubDate>Sat, 08 Feb 2025 02:09:52 GMT</pubDate>
</item>
<item>
  <title>Screen usage</title>
  <link>https://maosong.website/p/screen-usage/</link>
  <guid>https://maosong.website/p/screen-usage/</guid>
  <description>Screen usage</description>
  <pubDate>Wed, 05 Feb 2025 07:14:43 GMT</pubDate>
</item>
<item>
  <title>Notes on Phi-4</title>
  <link>https://maosong.website/p/notes-on-phi-4/</link>
  <guid>https://maosong.website/p/notes-on-phi-4/</guid>
  <description>An brief introduction to Phi-4</description>
  <pubDate>Mon, 16 Dec 2024 09:33:52 GMT</pubDate>
</item>
<item>
  <title>An overview of adaption layer in multimodal large language models.</title>
  <link>https://maosong.website/p/an-overview-of-adaption-layer-in-multimodal-large-language-models/</link>
  <guid>https://maosong.website/p/an-overview-of-adaption-layer-in-multimodal-large-language-models/</guid>
  <description>An overview of different adaption layers used in MLLM.</description>
  <pubDate>Sat, 09 Nov 2024 01:53:43 GMT</pubDate>
</item>
<item>
  <title>Notes on VITA</title>
  <link>https://maosong.website/p/notes-on-vita/</link>
  <guid>https://maosong.website/p/notes-on-vita/</guid>
  <description>The first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience.</description>
  <pubDate>Tue, 13 Aug 2024 06:36:45 GMT</pubDate>
</item>
<item>
  <title>ROUGE (Recall-Oriented Understudy)</title>
  <link>https://maosong.website/p/rouge-recall-oriented-understudy/</link>
  <guid>https://maosong.website/p/rouge-recall-oriented-understudy/</guid>
  <description>The metric that evaluates similarity between summaries.</description>
  <pubDate>Thu, 09 May 2024 09:35:20 GMT</pubDate>
</item>
<item>
  <title>Formal Algorithms for Transformer</title>
  <link>https://maosong.website/p/formal-algorithms-for-transformer/</link>
  <guid>https://maosong.website/p/formal-algorithms-for-transformer/</guid>
  <description>An formal algorithm describing how transformer works.</description>
  <pubDate>Thu, 02 May 2024 05:13:12 GMT</pubDate>
</item>
<item>
  <title>MiniGPT-4-Enhancing Vision-Language Understanding with Advanced Large Language Models</title>
  <link>https://maosong.website/p/minigpt-4-enhancing-vision-language-understanding-with-advanced-large-language-models/</link>
  <guid>https://maosong.website/p/minigpt-4-enhancing-vision-language-understanding-with-advanced-large-language-models/</guid>
  <description>An formal algorithm describing how transformer works.</description>
  <pubDate>Thu, 02 May 2024 05:13:12 GMT</pubDate>
</item>
<item>
  <title>Notes on t-SNE</title>
  <link>https://maosong.website/p/notes-on-t-sne/</link>
  <guid>https://maosong.website/p/notes-on-t-sne/</guid>
  <description>Learning notes on t-SNE</description>
  <pubDate>Thu, 02 May 2024 05:13:12 GMT</pubDate>
</item>
<item>
  <title>Regularization methods in deep learning</title>
  <link>https://maosong.website/p/regularization-methods-in-deep-learning/</link>
  <guid>https://maosong.website/p/regularization-methods-in-deep-learning/</guid>
  <description>Overview of the regularization methods in deep learning.</description>
  <pubDate>Sat, 27 Apr 2024 10:02:02 GMT</pubDate>
</item>
<item>
  <title>BLEU (Bilingual Evaluation Understudy)</title>
  <link>https://maosong.website/p/bleu-bilingual-evaluation-understudy/</link>
  <guid>https://maosong.website/p/bleu-bilingual-evaluation-understudy/</guid>
  <description>The metric that evaluates the quality of the translation</description>
  <pubDate>Thu, 25 Apr 2024 14:46:53 GMT</pubDate>
</item>
<item>
  <title>Notes on Llama3</title>
  <link>https://maosong.website/p/notes-on-llama3/</link>
  <guid>https://maosong.website/p/notes-on-llama3/</guid>
  <description>An brief introduction to Llama3</description>
  <pubDate>Mon, 22 Apr 2024 08:22:19 GMT</pubDate>
</item>
<item>
  <title>Practical advice for analysis of large, complex data sets</title>
  <link>https://maosong.website/p/practical-advice-for-analysis-of-large-complex-data-sets/</link>
  <guid>https://maosong.website/p/practical-advice-for-analysis-of-large-complex-data-sets/</guid>
  <description>Advice on how to analyze complex and large data sets</description>
  <pubDate>Wed, 17 Apr 2024 14:40:11 GMT</pubDate>
</item>
<item>
  <title>Notes on RAG</title>
  <link>https://maosong.website/p/notes-on-rag/</link>
  <guid>https://maosong.website/p/notes-on-rag/</guid>
  <description>Design for AI agentic workflows</description>
  <pubDate>Sun, 14 Apr 2024 04:38:04 GMT</pubDate>
</item>
<item>
  <title>What&apos;s next for AI agentic workflows</title>
  <link>https://maosong.website/p/what-s-next-for-ai-agentic-workflows/</link>
  <guid>https://maosong.website/p/what-s-next-for-ai-agentic-workflows/</guid>
  <description>Design for AI agentic workflows</description>
  <pubDate>Sun, 14 Apr 2024 04:38:04 GMT</pubDate>
</item>
<item>
  <title>Rules of Machine Learning</title>
  <link>https://maosong.website/p/rules-of-machine-learning/</link>
  <guid>https://maosong.website/p/rules-of-machine-learning/</guid>
  <description>Advice for machine learning</description>
  <pubDate>Sat, 13 Apr 2024 11:57:47 GMT</pubDate>
</item>
<item>
  <title>How to fix http error when creating a new environment.</title>
  <link>https://maosong.website/p/how-to-fix-http-error-when-creating-a-new-environment/</link>
  <guid>https://maosong.website/p/how-to-fix-http-error-when-creating-a-new-environment/</guid>
  <description>Conda configuration</description>
  <pubDate>Fri, 25 Aug 2023 00:00:00 GMT</pubDate>
</item>
</channel>
</rss>