Overview of LLaMA series

Introduction


时间	series	model	infra	blog	Tech report
2023.2	LLaMA 1	7B,13B,33B,65B	-
2023.7	LLaMA 2LLaMA 2	7B,13B,70B	-
2023.8-2023.12	Code LLaMA,Purple LlaMA	8B,70B	LLaMA Guard,CyberSec Eval
2024.4	LLaMA 3LLaMA 3	8B 70B,405B	LLaMA Guarc 2,CyberSec Eval 2,Code Shield
2024.7	LLaMA 3.1	1B,3B,LLaMA Vision	LLaMA Guard 3,CyberSec Eval 3,Prompt Guard
2024.9	LLaMA3.2	-	LlaMA Stack


	LLaMA 1	LLaMA 2	LLaMA 3
License	Research Only	Commercial	Commercial
Model sizes	7b, 13b, 30b, 65b	7b, 13b, 70b	1b, 3b, 7b, 70b, 405b, multimodal
Training data	1~1.4t tokens	2t tokens (+40%)	15t tokens
Context length	2k	4k	128k
Models avaliable	Pretraining only	Pretraining, finetuned	+ llama guard, cybersec eval, prompt guard
Pretraining flops	5.4e23	8.3e23	3.8e25

raining

Initial pre-training
Data mix adjustments
Long context pre-training
annealing

Data

Data grows with compute
If the trend continues,the next 100 times of FLOPs scale, would require ~10 times model parameters and ~10x data

Post-training challenges

Need for higher quality data
Need to carefully balance different objectives
Scaling laws aren’t as developed
Evaluation difficuties

Eval challenges

The lifecycle of a benchmark is too short
- One model cycle from adoption to saturation
- Training data pollution
some things are to measure
- Quality of generated images
- Quality of speech
- Tone of response
General shortage of good benchmarks

How to help make the next LLaMA better

Better community driven benchmarks
- Measure what really matters to you
- Not leaked into the training data
Better post training
- DPO wasn’t invented in a large industry lab
- Scaling laws for post training
Better data

LLaMA

LLaMA (Touvron et al., 2023) 是 Meta 在2023年发布的一款开源大语言模型系列

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). LLaMA: Open and Efficient Foundation Language Models. https://arxiv.org/abs/2302.13971

LLaMA2

LLaMA2 (Touvron et al., 2023) 是 Meta 在 2023 年 7月发布的一款开源大语言模型系列。

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., … Scialom, T. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. https://arxiv.org/abs/2307.09288

LLaMA3

LLaMA3 (Grattafiori et al., 2024) 是 meta 在24年4月发布的一款开源大语言模型系列。

Model Architecture

Several improvements are made on Llama3 compared to llama2:

Llama3 uses a tokenizer with a vocabulary of 128K tokens.
Llama3 adopts grouped query attention (GQA) across both the 8B and 70B sizes.
Llama3 uses to context window of size 8192 tokens

Traning

Llama3 uses 15T tokens for pre-training. Compares to Llama2, it is seven times larger and includes four times more code.

5% data of the training dataset are non-English to support multi-lingual use cases.

Data processing includes:

Heuristic filters
NSFW filters
Semantic deduplication approaches
Text classifiers to predict data quality. Llama2 is used to generate training data for the text classifiers.

Data mixing strategy is explored to improve the performance of Llama3.

Scaling up pretraining

Llama3 developed a series of scaling laws for downstream benchmark evaluations.

Scaling laws help:

Select an optimal data mix and to make informed decisions on how to best use training compute.
Scaling laws allow Llama3 to predict the performance of the largest models on key tasks without training the models.

The authors finds our that the performance of the model continues to improve log-linearly as the training tokens increase. It is seen that Larger models can match the performance of these smaller models with less training compute, but smaller models are generally preferred because they are much more efficient during inference.

The authors combine three types of parallelization:

Data parallelization
Model parallelization
Pipeline parallelization

Instruction fine-tuning

The fine-tuning of Llama3 contains:

Supervised fine-tuning
Rejection sampling
Proximal Policy Optimization
Direct Preference Optimization

Learning from perference rankings via PPO and DPO also greatly improved the performance of LLma3 on reasoning and coding tasks. Since perference ranking helps the model to select answer when it is in a dilemma.

Instruct model performance

The performance of Llama3 8B compared with Gemma and Mistral:

Model	Llama3 8B	Gemma 7B - It	Mistral &B Instruct
MMLU (5 shot)	68.4	53.3	58.4
GPQA (0 shot)	34.2	21.4	26.3
HumanEval (0 shot)	62.2	30.5	36.6
GSM-8K (8 shot, CoT)	79.6	30.6	39.9
MATH (4 shot, CoT)	30.0	12.2	11.0

performance of Llama3 70B compared with Gemini Pro 1.5 and Claude Sonnet:

Model	Llama3 70B	Gemini Pro 1.5 (Published)	Claude 3 Sonnet (Published)
MMLU (5 shot)	82.0	81.9	79.0
GPQA (0 shot)	39.5	41.5 (CoT)	38.5 (CoT)
HumanEval (0 shot)	81.7	71.9	73.0
GSM-8K (8 shot, CoT)	93.0	91.7 (11 shot)	92.3 (0 shot)
MATH (4 shot, CoT)	50.4	58.5 (Minerva prompt)	40.5

Pre-trained model performance

The performance of Llama3 8B compared with Gemma and Mistral:

Model	Llama3 8B	Gemma 7B (Published)	Gemma 7B (Measured)	Mistral 7B (Published)	Mistral 7B (Measured)
MMLU (5 shot)	66.6	64.3	64.4	62.5	63.9
AGIEval English (3-5 shot)	45.9	41.7	44.9	-	44.0
BIG-Bench Hard (3 shot, CoT)	61.1	55.1	59.0	-	56.0
ARC-Challenge (25 shot)	78.6	53.2(0 shot)	79.1	78.1	78.7
DROP (3 shot, F1)	58.4	-	56.3	-	54.4

performance of Llama3 70B compared with Gemini Pro 1.5 and Claude Sonnet:

Model	Llama3 70B	Gemini Pro 1.0 (Published)	Mixtral 8 $\times$ 22B (Measured)
MMLU (5-shot)	79.5	71.8	77.7
AGIEval English (3-5 shot)	63.0	-	61.2
BIG-Bench Hard (3 shot, CoT)	81.3	75.0	79.2
ARC-Challenge (25 shot)	93.0	-	90.7
DROP (3 shot, F1)	79.7	74.1 (variable shot)	77.6

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravankumar, A., Korenev, A., Hinsvark, A., … Ma, Z. (2024). The Llama 3 Herd of Models. https://arxiv.org/abs/2407.21783

LLaMA4

Meta在2025年4月10号发布了LLaMA4系列，包含三个模型：Llama 4 Scout, Llama 4 Maverick 以及Llama 4 Behemoth, 三个模型都基于MoE架构，且支持多模态

Model	Layers	Heads (Q / KV)	Context Length	#Parameters (activated/total)	#Experts (activated/total)	#Tokens
LLaMA 4 Behemoth	-	-	-	288B / 2T	(1 shared + 1 routed) / 16	-
LLaMA 4 Maverick	48	40/8	1M	17B / 109B	(1 shared + 1 routed) / 128	~22T
LLaMA 4 Scout	48	40/8	10M	17B / 400B	(1 shared + 1 routed) / 16	~40T

训练数据截止到2024年8月。LLaMa4支持200多种语言，其中100多种语言的训练token数超过了1B

亮点

原生多模态 LLaMA 4是一个原生多模态架构
超长上下文 LLaMA 4的上下文超过了1M
iRoPE 通过交替dense和MoE MLP来提高整体推理效率
基于MetaCLIP的vision encoder
MetaP 使用MetaP来调整超参数
FP8精度训练

Pre-training

LLaMA 4 仍然是一个基于transformer的架构，但是引入了MoE，其示意图如下所示

MoE架构中包含1个shared expert以及1个routed expert. 并且，与其他LLM不同，LLaMA 4使用了一个交替MLP个MoE的架构，即iRoPE，即特定的transformer layer是MoE架构，其余的是MLP架构，其核心代码如下：

self.is_moe_layer = layer_idx in config.moe_layers
if self.is_moe_layer:  # the 128E model interleaves dense / sparse
    self.feed_forward = Llama4TextMoe(config)
else:
    self.feed_forward = Llama4TextMLP(config, intermediate_size=config.intermediate_size_mlp)

early fusion. LLaMA 4称其一个原生多模态大模型，但是其架构仍然是 Vision Encoder-MLP-LLM 的形式，其不同点在于patch embedding没有使用convolution, 而是使用 nn.Unfold直接进行展平，然后使用一个线性层与vision encoder进行对齐。代码如下

class Llama4UnfoldConvolution(nn.Module):
    def __init__(self, config):
        super().__init__()
        kernel_size = config.patch_size
        if isinstance(kernel_size, int):
            kernel_size = (kernel_size, kernel_size)
        self.unfold = torch.nn.Unfold(kernel_size=kernel_size, stride=config.patch_size)
        self.linear = nn.Linear(
            config.num_channels * kernel_size[0] * kernel_size[1],
            config.hidden_size,
            bias=False,
        )

    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        hidden_states = self.unfold(hidden_states)
        hidden_states = hidden_states.permute(0, 2, 1)
        hidden_states = self.linear(hidden_states)
        return hidden_states

其他训练优化技巧如下：

MetaP：用于选择超参数
FP8 precision：与 DeepSeek-V3 一样，使用FP8精度进行训练
mid-training：在预训练阶段之后，额外增加了一个训练阶段，来提高模型的长上下文等关键能力

Post-training

post-training包括三个阶段：

SFT
online RL
DPO

作者发现SFT和DPO会限制模型的探索能力，特别是在math, coding等domain。为了解决这个问题，作者使用LlaMA对问题进行难度分级，然后移除了50%的简单数据。

在online RL阶段，作者设计了一个continuous online RL策略，让模型在训练和筛选问题两种模式之间进行切换，以平衡效率和准确率。

DPO的目的是为了提升模型输出的质量

评测

benchmark	LLaMA 4 Maverick	LLaMA 4 Maverick	LLaMA 4 Scout	Gemeni 2.0 Flash	GPT-4o
MMMU	76.1	73.4	69.4	71.7	69.1
Math Vista	73.7	70.7	73.1	63.8
ChartQA	-	90.0	88.8	88.3	85.7
DocVQA	-	94.4	94.4	-	92.8
LiveCodeBench	49.4	43.4	32.8	34.5	32.3
MMLU Pro	82.2	80.5	74.3	77.6	-
GPQA Diamond	73.7	69.8	57.2	60.1	53.6

结论

LLaMA 4 采用了MoE架构，是一个原生的多模态大模型系列。在架构上，与DeepSeek-MoE, aria和OLMoE不同，LLaMA4并没有增加expert granularity，OLMoE分析认为，增加granularity可以提高模型的flexibility，下面总结了一下相关模型的参数

Model	Layers	Heads (Q / KV)	Context Length	#Parameters (activated/total)	#Experts (activated/total)	#Tokens
DeepSeek-MoE(144.6B)	62	32/32	2048	22.2B/144.6B	(1 shared + 7 routed)/64	245B
DeepSeek-V3	61	128(MLA)	128K	37B/671B	(1 shared + 8 routed)/257	14.8T
Aria	28	20/20	64K	3.5B/24.9B	(2 shared+ 6 routed)/66	6.4T(text)
OLMoE	16	16/16	4096	1.3B/6.9B	8/64	5T
LLaMA 4 Maverick	48	40/8	1M	17B / 109B	(1 shared + 1 routed) / 128	~22T
LLaMA 4 Scout	48	40/8	10M	17B / 400B	(1 shared + 1 routed) / 16	~40T