Blog

Notes, tutorials, and longer-form writing. Each entry is generated from a content bundle under src/content/<slug>/article.md or article.mdx. Search and filter by tags or category →

All posts

LLM KL divergence: from definition to application May. 04, 2026
Why unbiased KL estimates need not give unbiased KL gradients; forward vs reverse KL, estimators in on/off-policy RL, and experiments.

tutorial
LLM Reinforcement Learning for Large Language models: An Overview May. 04, 2026
Publish‑ready workflow that lets you focus on ideas, not infrastructure

tutorial
LLM Notes on OpenMath-Nemotron 2026-04-15 09:50:39+08:00
MNVIDIA 在 AIMO-2 比赛中的 winning solution.

Reasoning NVIDIA AIMO
Infra Performance and Scalability 2026-03-26 17:44:26+08:00
本文介绍了strong scaling和weak scaling

Scaling Law
Math Fix Point Theorem 2026-03-09 17:16:02+08:00
不动点定理
Infra Notes on roofline model 2026-02-26 17:23:40+08:00
roofline model 是 infra 的理论分析基础，为算法设计与优化提供思路

roofline GPU
MLLM Notes on Step3-VL 10B 2026-02-13 18:05:47+08:00
阶跃星辰在 26 年 1 月提出了 Step3-VL-10B, 一个强调 perception, complex reasoning 以及 human-centric alignment 的开源多模态大模型

StepFun
LLM Notes on Kimi-k2.5 2026-02-12 11:13:13+08:00
Kimi 在 2026 年 2 月发布了 Kimi K2.5, 一个 multimodal agentic model, Kimi K2.5 基于 Kimi K2 开发，在预训练阶段使用了图文联合训练的方式，在 post-training 阶段使用了 zero-vision SFT 和 multimodal RL 来提高模型的 reasoning 能力以及泛化能力，Kimi K2.5 还提出了 Agent Swarm 来提高解决复杂任务的效率。

kimi Reasoning
Machine Learning Notes on KL divergence 2026-01-24 16:32:14+0800
在强化学习中，KL divergence 常被用作 policy 正则项，但很多不稳定现象并非来自 KL 本身，而是来自其估计方式。本文展示了为什么“无偏的 KL 估计”并不能保证“无偏的 KL 梯度”，并系统分析了不同 KL estimator 在 on-policy 与 off-policy 场景下的行为差异。通过理论推导与实验验证，文章揭示了 KL 作为 loss 与 reward shaping 时的本质区别，并解释了实践中低方差 KL 设计背后的原因
LLM Notes on Qwen3-Next 2026-01-23 10:29:56+0800
2025年9月，Qwen团队提出了Qwen3-Next，这是一个基于混合注意力机制与MoE架构的大语言模型，旨在显著提升训练与推理效率。该模型通过结合线性注意力与Softmax注意力的优势，在保持高性能的同时实现计算效率的大幅优化。

Qwen linear attention MoE
Infra megatron-lm 2026-01-21 18:04:12+0800
NVIDIA 在 2020 年提出了 megatron-LM, 一个基于 Tensor Parallelism 的大规模 LLM 训练框架。论文着重介绍了 tensor parallelism

parallelism NVIDIA
LLM Notes on Gated Attention 2026-01-20 15:41:52+0800
Qwen 在 2025 年系统性研究了 attention 中的 gating 机制，发现通过在 attention 中引入非线性与稀疏性，可以以极低成本显著提升模型的表达能力、训练稳定性以及长上下文表现。

Attention Qwen Oral
Unified MLLM NextFlow 基于single-branch的统一理解与生成多模态大模型 2026-01-17 17:31:53+0800
字节在 26 年 1 月提出了 NextFlow, 一个基于 decoder-only autoregressive transformer 架构的统一理解与生成多模态，验证了纯自回归架构在统一模型中的有效性。

Seed tokenizer VAR
State of AI--从OpenRouter 100T token使用情况了解AI 大模型能力分层竞争逻辑 2026-01-17 17:04:07+0800
OpenRouter在25年12月发布了一份基于100T token调用数据的统计报告，该报告从模型，任务，用户多角度分析了当前AI模型的使用情况

Report
LLM LLM Memory Computation 2026-01-17 10:04:32+0800
本文中，我们将介绍如何计算 LLM 在训练和推理过程中的内存需求以及简要介绍对应的优化方法。

Transformer Training Inference
Infra Nvidia-GPU specs 2026-01-14 11:09:19+0800
本文汇总了NVIDIA GPU 系列的技术规格以及关键改进

NVIDIA GPU
LLM Notes on GLaM 2026-01-06 18:07:29+0800
Google 在 2022 年 8 提出了 GLaM，一个基于 MoE 架构的大语言模型系列，模型超过了 GPT-3 的表现

MoE Google
MLLM Notes on MiniMax-01 2026-01-06 17:38:01+0800
MiniMax-01 是一个基于 hybrid attention 架构的大模型系列，包含 MiniMax-Text-01 和 MiniMax-VL-01 两个模型，其中 MiniMax-Text-01 推理时支持 4M 的上下文长度，MiniMax-VL-01 支持 512B 的上下文长度

Attention MoE MiniMax
LLM Notes on DeepSeek-V3.2 2026-01-06 17:30:40+0800
DeepSeek 在 25 年 10 月发布了 DeepSeek-V3.2, 模型强调了稀疏注意力，scaling RL 以及 agentic task synthesis.

deepseek Attention Reasoning Agent
MLLM Notes on Gemini3.0 2026-01-06 10:26:39+0800
Gemini 3.0 是是 Google 新一代最强模型，model card 介绍了 Gemini 3.0 系列的评估结果以及基本能力

Google MoE
Machine Learning Notes on Softmax 2025-12-27 16:39:53+0800
本文介绍了 softmax 函数的性质，实现和应用，方便后续的使用和学习

activation
LLM Notes on NoPE 2025-12-24 15:19:42+0800
NoPE 是一个不需要显式位置编码的方法，相关研究说明不使用位置编码我们也可以让模型学习到对应的位置信息以及进行长度外推。

position encoding
LLM Notes on ALiBi 2025-12-24 15:10:55+0800
meta 等提出了 ALiBi, 一个通过 linear biases 来实现位置编码的方法来提高 LLM 在推理阶段的外推能力。

meta position encoding
LLM Notes on T5 2025-12-24 15:07:08+0800
google 在 2020 年发表了 T5 (Text-to-Text Transfer Transformer), 一个使用统一框架来将所有 NLP 任务转换为 text-to-text 格式的迁移学习框架。

Transfer learning Google position encoding
Infra GPipe 2025-12-23 16:49:25+0800
google 在 2018 年提出了 GPipe, 一个使用 pipeline parallelism 来训练大规模神经网络的并行策略

parallelism Google
LLM Base of RoPE Bounds Context Length 2025-12-22 11:34:42+0800
百川探究了 LLM 中 RoPE base frequency 与 context length 之间的关系，给出了 base frequency 与 context length 之间的 lower bound.

RoPE position encoding Long context
LLM Notes on NSA 2025-12-15 17:39:16+0800
DeepSeek 在 25 年 1 月提出了 Natively trainable Sparse Attention (NSA), 一个软硬件结合的稀疏注意力机制，NSA 可以在提高模型推理效率的同时提高计算效率。

Attention deepseek best_paper
LLM MoE tutorial 2025-12-13 16:04:04+0800
本 blog 详细介绍了 MoE 模型的一些关键设计与相关实验结果，为 MoE 模型的学习提供基础。

MoE Architecture
LLM Notes on Ling-mini-beta 2025-12-13 15:58:51+0800
蚂蚁提出了针对 MoE 模型的 scaling law, 并基于 scaling law 提出了 Ling-mini-beta

Ling MoE Scaling Law
LLM Load Balancing tutorial 2025-12-11 16:10:08+0800
我们在本文中探讨关于 load balancing loss 的定义，性质和推广

MoE
LLM Notes on Global-batch load balancing 2025-12-11 16:09:34+0800
Qwen 在 25 年 2 月提出了 global batching load balancing loss strategy, 其在 global level 上考虑每个专家的负载均衡，从而提高模型的表现

Qwen MoE
LLM Notes on DPO 2025-12-09 10:43:11+0800
作者提出了 DPO，一个无需 reward model 的偏好优化方法，DPO通过隐式建模reward model而直接基于数据集训练policy model, 进而大幅度提高了 LLM 偏好优化的训练效率

RL alignment Oral
LLM Notes on DeepSeek-V3 2025-12-08 11:14:45+0800
DeepSeek 在 24 年 11 月发布了 DeepSeek-V3, 一个仅花费 2.8M H800 hours 的大语言模型，且在各个 benchmark 上达到了 SOTA 表现

deepseek MoE Reasoning
MLLM Notes on Gemini2.5 2025-12-06 18:14:15+0800
DeepMind 在 6 月 17 号发布了 Gemini2.5 系列的技术报告，包括Pro和Flash两个版本

Google MoE
LLM Notes on olmoe 2025-12-06 18:08:11+0800
NUS 等提出了 OpenMoE, 一个全开源的 MoE 大语言模型系列，作者详细介绍了 MoE 中的 routing 机制

MoE
MLLM Notes on Qwen3 VL 2025-12-05 10:12:01+0800
Qwen 在 25 年 11 月 27 日发布了 Qwen3-VL 技术报告，作者强调了模型的纯文本理解能力，长文本能力以及多模态推理能力。

Qwen
LLM Notes on SAPO 2025-12-05 10:09:06+0800
Qwen 在 25 年 11 月提出了 SAPO，一个通过温度控制的 soft gate 以及非对称温度来解决 hard-clpping 存在的问题，从而提高 RL 训练的稳定性以及效率

Qwen RL
MLLM Notes on DeepStack 2025-12-04 17:32:41+0800
作者提出了 DeepStack, 用于帮助MLLM更好地利用视觉信息，进而提高模型在下游任务上的表现
MLLM Notes on ViT 2025-12-04 11:00:44+0800
Google 在 21 年提出了 ViT, 一个基于 Transformer 的图像识别模型架构，作者通过实验验证了 Transformer 架构在图像识别领域的成功。

VFM Google position encoding
MLLM Notes on CoMP 2025-12-04 10:58:30+0800
作者提出了一个针对 vision foundation model 的 continual multimodal pretraining pipeline, 用于提高模型在下游任务上的表现

VFM position encoding
LLM Notes on DeepSeek-R1 2025-12-02 18:21:54+0800
DeepSeek 在 2024 年 5 月提出了 DeepSeek-V2，一个基于 MoE 架构的大语言模型，参数量为 236B-A21B. 作者使用了 MLA 来压缩 KV cache, 使用 DeepSeekMoE 架构来提高模型训练效率和表现。

deepseek MoE
LLM Notes on DeepSeek-V2 2025-12-02 18:21:54+0800
DeepSeek 在 2024 年 5 月提出了 DeepSeek-V2，一个基于 MoE 架构的大语言模型，参数量为 236B-A21B. 作者使用了 MLA 来压缩 KV cache, 使用 DeepSeekMoE 架构来提高模型训练效率和表现。

deepseek MoE
LLM Notes on MLA 2025-12-02 18:21:54+0800
DeepSeek在 2024 年 5 月提出了 multi-head latent attention (MLA), 用于提高 attention 的 Inference 效率

Attention deepseek
LLM Notes on Loss-free Balancing 2025-11-21 15:38:52+0800
DeepSeek 在 24 年 8 月提出了 Loss-free balancing 策略，该策略可以在不修改训练梯度的情况下实现 load balancing 进而提高模型的表现.

deepseek MoE
LLM Mixstral 8x7B 2025-11-01 15:32:30+0800
Mistral 在 24 年 1 月提出了 Mistral 8x7B, 一个 MoE 大语言模型，模型包括 8 个专家，激活 2 个专家，总参数量为 47B, 激活参数量为 13B.

Mixtral MoE
LLM Mixstral 7B 2025-11-01 15:28:19+0800
Mistral 在 23 年 10 月提出了 Mistral 7B, 其模型表现超过了 LLaMA2-13B.

Mixtral
LLM Notes on olmoe 2025-11-01 15:23:58+0800
AllenAI 在 24 年 9 月提出了 olmoe, 一个全开源的基于 MoE 架构的大语言模型，参数量为 7B-A1B，作者详细介绍了模型的设计，数据以及训练策略. 论文获得了ICLR2025 oral

MoE Oral Allen AI
LLM GShard 2025-10-29 11:22:39+0800
Google 在 2020 年提出了 Gshard, 一个用于 MoE model 的 API 模块，作者的目的是探究如何高效训练基于 MoE 的 transformer 模型

MoE Google
LLM ST-MoE 2025-10-29 11:19:37+0800
google 在 2022 年 4 月提出了 ST-MoE-269B-A32B, 用于解决 MoE 模型的训练不稳定性以及表现不佳的问题

MoE Google
LLM Switch Transformer 2025-10-28 09:38:12+0800
Google 在 2022 年 6 月提出了 Switch Transformer, 一个基于 MoE 架构的 Transformer 模型。作者通过改进 MoE 算法，大幅度提高了计算和通信效率，结果发现模型比 dense model 有更高的训练效率。

MoE Google
LLM Chinchilla Scaling Law 2025-10-22 14:39:23+0800
DeepMind 在 22 年 3 月探究了如何在给定算力下，决定最优的 model size 和 data size. 作者发现对于算力最优的场景，model size 和 dataset size 应该以相同的规模增长。基于这个 scaling law 作者提出了 Chinchilla, 一个 70B 的大语言模型，结果显示 Chinchilla 超过了其他更大 size 模型的表现。

Scaling Law Google
LLM Kaplan Scaling Law 2025-10-22 14:10:52+0800
OpenAI 在 20 年 1 月份探究了 model size, dataset size 以及 compute budget 与 transformer 损失之间的关系。通过构建 scaling law, 我们可以在固定的 compute budget 下，决定最优的配置

Scaling Law
LLM LLM FLOPs Computation 2025-10-15 16:33:39+0800
我们介绍如何计算基于 transformer 架构的 LLM 的 FLOPs, 计算完成之后，我们可以推导出算力 $C$ 与模型参数量 $N$，数据集大小 $D$ 之间的关系，即 $C\approx 6ND$.

Transformer Scaling Law
MLLM Notes on Keye-VL 1.5 2025-09-11 11:33:31+0800
快手提出了 Keye-VL 1.5, 一个强调 reasoning, video understanding 的 8B 多模态大模型。作者提出了 slow-fast video encoding strategy 来提高模型的视频理解能力，作者通过在预训练和后训练提高了模型的长上下文能力和 reasoning 能力

Kuaishou Reasoning Video
LLM Notes on AdamW 2025-09-04 10:27:03+0800
作者提出了一个针对 Adam 优化器的 weight decay 方法

optimizer
LLM Notes on Adam 2025-09-04 10:11:55+0800
作者提出了 Adam, 一个一阶的优化方法，Adam 更加高效，且具有 scaling invariant 的性质。

optimizer
MLLM Notes on RNoPE-SWA 2025-09-02 11:24:10+0800
作者系统性分析了已有的 attention 机制，然后作者提出了混合的 attention 机制，来提高模型在长上下文的表现以及维持模型在短上下文场景下的表现。

position embedding Attention Long Context
MLLM Notes on InternVL3.5 2025-09-01 11:30:50+0800
上海 AI LAB 提出了 InternVL 3.5 系列多模态大模型，InternVL 3.5 主要强调了模型的 reasoning 能力以及 inference 效率

Intern Reasoning
MLLM Ovis2.5 MLLM with stronger perception and reasoning capability 2025-08-30 17:34:44+0800
作者提出了 Ovis2.5, 一个基于 Ovis 改进的多模态大模型系列，包括 2B 和 9B 两个 size，Ovis2.5 主要强调了支持不同分辨率图片输入以及深度思考这两个 feature
MLLM Ovis-discrete visual embedding 2025-08-30 17:32:22+0800
作者提出了 Ovis，一个离散化表示 visual encder 输出特征的方法，来更好对齐 LLM 的视觉输入和文本输入
LLM Notes on DeepSeekMoE 2025-08-29 11:03:12+0800
DeepSeek 在 2024 年 1 月发布了 DeepSeekMoE, 一个解决 MoE 模型 scecialization 不足以及 redundancy 问题的大模型系列。

deepseek MoE
LLM Notes on DeepSeek-LLM 2025-08-26 10:53:10+0800
DeepSeek 在 2024 年 1 月 5 日发布了 DeepSeek LLM, 包括 7B 和 67B 两个 size, 作者主要强调了对于 scaling law 的探究

deepseek
LLM Notes on MFA 2025-08-23 16:04:34+0800
阶跃星辰等提出了 Multi-matrix Factorization Attention (MFA), 一个新型注意力机制，用于在 KV cache 限制下最大化模型的表现。

Attention StepFun
LLM Notes on MX-format 2025-08-21 18:23:03+0800
MX format 是一个表示数据的数据格式，在 LLM 中主要用于量化。相比于直接对整个张量进行量化，MX format 可以在更细粒度的层面控制量化，从而提高模型的表现

Attention
LLM Notes on flashattention 2025-08-21 11:32:53+0800
作者提出了 flashattention, 一个通过降低 multi head attention 内存访问开销来提高 attention 计算效率的方法

Attention
LLM Notes on StreamingLLM 2025-08-20 10:16:35+0800
作者提出了 StreamingLLM, 一个基于 attention sink 来提高 sliding window attention 在超长上下文场景下表现的方法

meta Attention
LLM Notes on gpt-oss 2025-08-19 16:14:56+0800
openAI 发布了 gpt-oss 大语言模型，包含 120B-A5.1B 以及 20.9B-A3.6B 两个 size, 作者强调了模型的 instruction following, tool use, 以及 adaptive thinking 能力

openAI
LLM Notes on QK-Norm 2025-08-13 16:12:11+0800
作者提出了 QK norm, 一个解决 softmax 注意力权重不稳定的 scaling 算法。

Attention
LLM Notes on GLM-4.5 2025-08-13 12:27:48+0800
智谱 AI 提出了 GLM4.5, 包含 GLM4.5 和 GLM-4.5-Air,两个 MoE LLM. 模型大小分别为 355B-A22B 和 106B-A12B, GLM4.5 主要关注 agentic, reasoning 以及 coding 三个领域。

Zhipu Reasoning
MLLM Notes on ARC-Hunyuan-Video-7B 2025-08-12 10:57:57+0800
腾讯 ARC LAB 提出了 ARC-Hunyuan-Video-7B, 一个针对短视频理解和推理的视频多模态大模型。

Tencent Video
LLM Notes on GQA 2025-08-07 18:08:36+0800
Google Research 在 23 年 12 月份提出了 Group Query Attention (GQA), 一个提升 multi-head attention 效率的方法。GQA 自 Qwen2 系列开始被应用。

Attention
LLM Notes on MQA 2025-08-07 18:06:37+0800
Google 在 2019 年提出了 multi-query attention (MQA), 用于解决 MQA 内存带宽瓶颈问题。

Attention
LLM Notes on Moonlight 2025-08-07 10:49:32+0800
Kimi 提出了 Moonlight, 一个基于 Muon optimizer 训练得到的 16B-A3B MoE LLM. 作者详细介绍了如何 scale up muon optimizer.

kimi optimizer
LLM Notes on Hunyuan-Large 2025-08-06 16:46:32+0800
腾讯混元提出了 Hunyuan-Large, 一个 389B-A52B 的 MoE LLM, 上下文长度为 256K.

Hunyuan MoE
LLM Notes on GSPO 2025-08-06 11:26:26+0800
Qwen 提出了 Group Sequence Policy Optimization (GSPO), 一个针对 GRPO 进行改进的 RL 算法。GSPO 在 sequence 层面计算 importance ratio, 避免了 token-level 计算带来的训练不稳定性。

Qwen RL
LLM Notes on Muon blog 2025-08-05 11:10:51+0800
Muon (MomentUm Orthogonalized by Newton-Schulz) 是一个针对二维神经网络的优化器，它基于 SGD-momentum 改进，增加了一个 Newton-Schulz 的后处理步骤

kimi optimizer
MLLM Notes on AFM2025 2025-07-29 12:36:28+0800
Apple 在 7 月份发布了 AFM 技术报告，包括两个多语种多模态大模型，一个面向 device, 另一个面向 server

Apple
LLM Notes on Kimi-k2 2025-07-24 10:56:50+0800
Kimi-k2 是一个总参数为 1T, 激活参数为 32B 的 MoE 大语言模型，模型使用 15.5T token 进行训练，optimizer 使用了 MuonClip. 作者主要关注模型的 agent 能力

kimi Reasoning
MLLM Notes on Keye-VL 2025-07-23 11:11:43+0800
Keye-VL 是快手在 25 年 7 月份提出的一个 8B 的多模态大模型，其亮点为短视频理解能力。预训练包括 4 个 stage，使用了 600B token，后训练包括 2 个 stage，用于提升模型的 reasoning 和 non-reasoning 能力。

Kuaishou Reasoning
LLM LLM Parameter Computation 2025-07-22 10:50:47+0800
我们介绍一下如何计算 LLM 的参数量。我们将基于 Qwen3 模型架构出发，对模型架构进行拆解，然后给出 LLM 参数量计算公式。

distributed training
MLLM Notes on Seed1.6 2025-07-18 14:59:35+0800
Seed 1.6支持 adaptive deep thinking, multimodal understanding,上下文长度为 256K

Seed
MLLM Notes on V-Triune 2025-07-17 09:37:36+0800
统一的RL训练框架，用于提升VLM的感知和推理能力

Reasoning Perception
LLM Notes on Magistral 2025-07-16 11:04:04+0800
Magistral 是 Mistral 提出的一个 reasoning model 系列，主要针对 math 和 code 两个 domain

Reasoning
LLM Notes on SmolLM3 2025-07-15 11:01:13+0800
Hugging Face 在 2025 年 7 月 8 号发布了 SmolLM3, 一个 3B 的，128K 上下文，支持 6 种语言，支持 dual mode reasoning 的小语言模型。

Small LLM Reasoning
MLLM Notes on GLM-4.1V-Thinking 2025-07-14 10:32:04+0800
智谱 AI 在 25 年 7 月份发布了 GLM-4.1V-Thinking, 一个 9B 的多模态大语言模型，其在多个 benchmark 上达到了相同大小 MLLM 的 SOTA

Zhipu Reasoning
LLM Notes on Qwen2.5-1M 2025-07-12 11:00:47+0800
Qwen2.5-1M 技术报告总结

Qwen
LLM Notes on Qwen2.5 2025-07-12 10:51:42+0800
Qwen2.5 技术报告总结

Qwen
LLM Dual Chunk Attention 2025-07-12 10:41:12+0800
无需训练的上下文扩展策略

Qwen Long Context
LLM Notes on Qwen2 2025-07-12 10:36:43+0800
Qwen2 技术报告总结

Qwen
LLM Notes on Qwen1.5 2025-07-03 17:37:39+0800
Qwen在24年1月份发布了Qwen1.5，包含 0.5B, 1.8B, 4B, 7B, 14B, 32B, 72B, 以及 110B 6个size，还有一个MoE模型。

Qwen
LLM Notes on YaRN 2025-07-03 14:40:49+0800
YaRN (Yet Another RoPE extentionN method) 时23年9月EleutherAI等提出来的一个扩展LLM上下文长度的方法，后来被Qwen系列模型所应用。

Qwen Long Context position encoding
LLM Notes on Qwen-LLM 2025-07-03 10:47:27+0800
Qwen技术报告总结

Qwen
LLM Hands on LLM(2) Transformer 2025-06-29 11:40:39+0800
基于Qwen3讲解transformer的架构以及核心代码

cs336 Transformer
LLM Unified perspective on dLLM and LLM 2025-06-28 15:02:09+0800
MLE和KL divergence之间的等价性推导

diffusion
Machine Learning Relationship between MLE and KL divergence 2025-06-27 11:35:33+0800
MLE和KL divergence之间的等价性推导

MLE KL divergence
MLLM Notes on MiMo-VL 2025-06-05 10:51:43+0800
MiMo-VL基于MiMo-7B，是一个多模态推理大语言模型

xiaomi
LLM Hands on LLM(1) Tokenizer 2025-05-24 19:56:34+0800
Tokenizer总结与BPE的高效实现

Transformer
LLM Notes on attention bias 2025-05-22 15:25:07+0800
为什么transformer没有QKV bias

Transformer
LLM Notes on Position encoding 2025-05-19 10:46:39+0800
从Absolute position encoding到RoPE

position encoding
LLM Notes on Qwen3 2025-05-15 14:48:11+0800
Qwen3 包括6个dense模型，2个MoE模型，主要亮点是快慢思考模式切换，多语种，支持thinking budge调整

Qwen
MLLM Notes on Seed1.5-VL 2025-05-14 09:28:07+0800
字节Seed在5月11号发布了Seed1.5-VL技术报告。技术报告详细介绍了Seed1.5-VL的架构，训练和评估细节

Seed
Infra 分布式训练：参数量与计算量分析 2025-05-13 11:26:36+0800
Basic computations in distributed training

distributed training
Infra 分布式训练：如何训练一个模型 2025-05-13 11:26:36+0800
Basic computations in distributed training

distributed training
Infra Distributed training--Basic 2025-05-12 10:15:17+0800
Basic concepts in distributed training

distributed training
LLM Notes on LLaMA4 blog 2025-04-30 10:44:19+0800
LLaMA4 blog阅读笔记

LLaMA
LLM Notes on Qwen3 blog 2025-04-29 11:23:04+0800
Qwen3系列LLM发布

Qwen
MLLM Data mixture in MLLM 2025-04-25 10:25:48+0800
MLLM训练数据配比简单总结

data selection dataset
随笔随笔-身体健康 2025-04-23 13:24:02+0800
疾病缠身才明白身体健康的重要性

身体健康
LLM Notes on VAPO 2025-04-17 09:41:51+0800
字节Seed团队提出了VAPO, 通过结合DAPO以及VC-PPO的优点，来解决long CoT任务中的一些问题，来提高reasoning model的表现

RL GRPO
LLM Notes on VC-PPO 2025-04-14 17:36:15+0800
字节Seed团队提出了 Value-Calibrated PPO (VC-PPO), 用于解决PPO的value initialization bias 以及 reward signal decay 问题

RL GRPO
LLM Notes on DAPO 2025-04-09 21:40:33+0800
Notes on DAPO

RL GRPO
MLLM Notes on Qwen2.5 omni 2025-04-01 10:29:00+0800
Academic notes on Qwen2.5 omni

Qwen omni Audio
MLLM Understanding Sigmoid Loss in SigLip 2025-03-28 14:55:50+0800
Understanding Sigmoid Loss in SigLip

loss
MLLM Notes on Aya Vision 2025-03-17 17:58:24+0800
Aya Vision包含8B, 32B两个size，支持23种语言

multilingual
LLM Notes on Gemma3 2025-03-15 11:15:29+0800
Notes on Gemma3 technical report

Long context
MLLM Overview of Qwen-VL series 2025-03-09 15:11:29+0800
Overview of Qwen-VL series

Qwen
LLM Notes on QwQ-32B 2025-03-08 09:46:16+0800
notes on QwQ-32B

Qwen Reasoning
LLM compression is intelligence 2025-03-06 17:57:51+0800
从压缩即智能的角度理解大模型

Compression
MLLM Notes on Qwen2.5 VL 2025-03-04 10:46:42+0800
Acedemic notes on Qwen2.5 VL

Qwen
Terminal Git authentication error 2025-02-22 10:51:27+0800
Git clone authentication error.

Linux Git
MLLM Notes on Kimi k1.5 2025-02-08 10:09:52+0800
An brief introduction to Kimi k1.5

kimi Reasoning
Terminal Screen usage 2025-02-05 15:14:43+0800
Screen usage

Linux
LLM Notes on Phi-4 2024-12-16 17:33:52+0800
An brief introduction to Phi-4

Phi Synthetic data
MLLM An overview of adaption layer in multimodal large language models. 2024-11-09 09:53:43+0800
An overview of different adaption layers used in MLLM.

adaption Layer
MLLM Notes on VITA 2024-08-13 14:36:45+0800
The first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience.

Paper Reading Video Audio Interaction
LLM ROUGE (Recall-Oriented Understudy) 2024-05-09 17:35:20+0800
The metric that evaluates similarity between summaries.

Metric
LLM Formal Algorithms for Transformer 2024-05-02 13:13:12+0800
An formal algorithm describing how transformer works.

Transformer
MLLM MiniGPT-4-Enhancing Vision-Language Understanding with Advanced Large Language Models 2024-05-02 13:13:12+0800
An formal algorithm describing how transformer works.

Paper Reading
Machine Learning Notes on t-SNE 2024-05-02 13:13:12+0800
Learning notes on t-SNE

Dimension Reduction
Deep Learning Regularization methods in deep learning 2024-04-27 18:02:02+0800
Overview of the regularization methods in deep learning.

regularization
LLM BLEU (Bilingual Evaluation Understudy) 2024-04-25 22:46:53+0800
The metric that evaluates the quality of the translation

Metric
LLM Notes on Llama3 2024-04-22 16:22:19+0800
An brief introduction to Llama3

LLaMA
Machine Learning Practical advice for analysis of large, complex data sets 2024-04-17 22:40:11+0800
Advice on how to analyze complex and large data sets
RAG Notes on RAG 2024-04-14 12:38:04+0800
Design for AI agentic workflows

RAG
LLM What's next for AI agentic workflows 2024-04-14 12:38:04+0800
Design for AI agentic workflows

Agent
Machine Learning Rules of Machine Learning 2024-04-13 19:57:47+0800
Advice for machine learning
How to fix http error when creating a new environment. 2023-08-25 00:00:00+0000
Conda configuration