GPU vs TPU: A Comprehensive Guide to Specialized Hardware Accelerators

Understanding the architecture, performance characteristics, and optimal use cases for GPUs and TPUs in modern computing.

Author

Updated

May, 29, 2026

Category

Introduction

我们在前面介绍了关于大语言模型的 scaling law, 如 Kaplan scaling law (Kaplan et al., 2020), Chinchilla scaling law (Hoffmann et al., 2022). 其核心结论为,大语言模型的能力随算力,模型大小,数据量的提升而提升。 其中,算力由 GPU/TPU/NPU 决定,因此,我们在本节介绍这些硬件的相关知识。

在介绍不同的显卡之前,我们先了解一下使用显卡加速计算的过程。

我们是如何通过 GPU 来加速计算的,TODO: 框架图

TODO:不同显卡算力,内存对比

Notation

在本文中,我们将使用 device 来表示显卡 (GPU, TPU) 等,使用 host 来表示 CPU.

  1. Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., … Sifre, L. (2022). Training Compute-Optimal Large Language Models. https://arxiv.org/abs/2203.15556
  2. Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling Laws for Neural Language Models. https://arxiv.org/abs/2001.08361

GPU

Introduction

GPU 全称为 graphic processing unit, 其主要用于高效并行计算,现在已经被广泛应用于游戏,科学记算,深度学习等领域。 我们在本文中主要关注针对机器学习领域的 GPU, 如 A100, H100, B200 等

overview of GPU

GPU 的架构如下所示

Architecture of GPU
未命名绘图

Memory

我们首先来了解一下 GPU 中的内存架构,GPU 的内存架构与 CPU 大体上相似,其内存架构图如下所示

分为了多个层级:

  1. Global memory: Global memory 用于存储模型权重,梯度和 activation 等,我们通常说的 GPU 显存指的就是 global memory,比如 A100 40GB, A100 80GB, 它代表了单张显卡所能容纳的最大数据大小。
  2. L2 cache: 对所有 SM 可见,所有的 thread 都可以访问这个内存,由于访问速度相对较慢,因此我们需要控制 memory access pattern.
  3. L1 data cache (SMEM unit): 对当前 CUDA block 的所有 threads 可见, 同一个 SM 的所有 CUDA blocks 可以共享一块物理内存,SMEM 一般用于保存 activation 或者 Tensor Core 的输入
  4. register file: 只对当前分区中的 CUDA cores (thread) 可见,对 H100 来说,每个 register file 可以存储 16384 个 32-bit 的 words, 因而每个 SM 中的 register file size 为 4163844=2564*16384*4=256 kb.

不同架构 GPU 的内存相关信息如下表所示

GPUGenerationregister file size per SMSMEM per SML2 cacheHBM
V100Volta256kb96kb6MB32GB
A100Amper256kb164kb40MB80GB
H100Hopper256kb228kb50MB80GB
H200Hopper256kb228kb50MB141GB
B200Blackwell??126MB192GB

Compute

GPU 由若干个 streaming multiprocessor (SM) 组成,每个 SM 是一个独立的模块,这提高了 GPU 并行处理计算任务的能力。

每个 SM 又进一步细化为若干个分区,称为 SM subpartition, 比如 H100 中每个 SM 就包含 4 个 subpartition, 如下图所示

TODO: add subpartition figure

接下来,每个 subpartition 由以下几部分组成:

  1. 1 个 Tensor Core, Tensor Core 用于进行矩阵和张量运算,相比于 CUDA core 拥有更高的算力
  2. 1 个 register file
  3. 1 个 warp scheduler
  4. 1 个 L1 data cache, 也称为 SMEM unit

每个 CUDA core 可以在一个 cycle 里执行一个算数操作,比如 f32.add, 每个分区包含 32 个 CUDA cores, 它们可以在一个 cycle 里执行相同的指令。CUDA core 主要用于 ReLU, point-wise vector operations 和 reduction 等操作。其数量的计算方式为 SMs 数 分区数 CUDA cores 数,比如对于 H100 的 FP32 CUDA core, 其数量为

#FP32 CUDA cores=132×4×32=16896\# \text{FP32 CUDA cores} = 132 \times 4 \times 32=16896

其中 132132 代表 SM 个数, 44 代表分区个数,3232 代表每个分区的 FP32 CUDA core 个数。

接下来我们可以基于 CUDA cores 来计算出对应架构的 peak FLOPs/s. 其计算方式为 CUDA cores 个数 * 频率,比如对于 H100 的 FP32 精度, 我们有

Peak FLOPs=168961.98 GHz=33.5 TFLOPs/s\text{Peak FLOPs} = 16896*1.98\text{ GHz} = 33.5 \text{ TFLOPs/s}

这个值刚好是 H100 datasheet 报告的 peak FLOPs 的一半,这是因为我们可以使用 FMA (fused-multiply-add) 来在一个时钟周期里完成两次操作。不同 GPU 的计算效率如下表所示 (除 SMs 之外其他数值单位为 TFLOPs/s)

GPUGenerationSMsCUDA core (FP64)CUDA core (FP32)Tensor Core (FP64)Tensor Core (TF32)Tensor Core (BF16)Tensor Core (FP8)Tensor Core (INT*)
V100Volta807.815.7--125--
A100Amper1089.719.519.5156312624624
H100Hopper13234676749599019791979
H200Hopper13234676749599019791979
B200Blackwell1484080401125225045009000

Comparison with CPU

架构上 CPU 与 GPU 的对比如下表所示

TODO: add figure

ComponentsCPUGPU
ALUa few powerful ALUs (reduce operation latency)many smaller ALUs(low latency, high throughput)
Cacheslarge caches (reduce memory access latency)small caches(more area dedicated to computation)
ControlSophisticated control(branch prediction, data forwarding)simple control(more area dedicated to computation)
clock frequencyhighmoderate
latencylowhgih
latency optimizationmodest multi-threading (2)massive number of threads

可以看到,CPU 有少量 ALU, 有较大的 cache 和 control module, 而 GPU 则拥有较多的 ALU, 较小的 cache 以及 control module.

GPU 与 CPU 的核心不同在于两者的优化目标不一致:

我们可以用出行方式来举个例子,当我们从一点到另一点时,CPU 相当于是跑车,能够快速完成任务。而 GPU 相当于公交车,在人数很多时,我们可以在一次运行中将多个乘客送到目的地。

与 CPU 不同,GPU 大部分 transistors 都被用于数据处理,而 CPU 则保存了一些 transistor 用于缓存和控制单元。

TODO: add figure

CPU 需要最小化每个 thread 的 latency, 在一段时间内,CPU 需要尽可能完成多的任务,因此 CPU 需要 low latency, 这就需要较大的 cache 和复杂的逻辑。

Multi-GPU

Intra Node

Inter Node

跨 node 层面的设计目前还没有统一,NVIDIA 提出了 DGX 架构,使用 InfiniBand 来在 node 之间进行通信,在 node 之上,还有两个抽象等级,分别是 Scalable Units (SUs) 和 SuperPod.

InfiniBand 端口的带宽为 50GB/s (duplex), InfiniBand 有 64 个端口,因此一个 SU 的 IB 总带宽为 50648=3.250*64*8=3.2 TB/s, 下表是不同 level 的通信效率对比

LevelGPUsSwitches per UnitSwitch TypeBandwidth per Unit (TB/s, full-duplex)GPU-to-GPU Bandwidth (GB/s, full-duplex)Fat Tree Bandwidth (GB/s, full-duplex)
Node84NVL3.6450450
Leaf2568IB12.850400
Spine102416IB51.250400

Optimization

TPU

Acknowledgements

这篇 blog 参考了 Aleksa Gordić 写的一片关于 GPU 的 blog (Gordić, 2025)

  1. Gordić, A. (2025). Inside NVIDIA GPUs: Anatomy of high performance matmul kernels. https://www.aleksagordic.com/blog/matmul

GPU Specs

V100

V100 关键改进

V100 技术规格

Tesla ProductTesla K40Tesla M40Tesla P100Tesla V100
GPUGK180 (Kepler)GM200 (Maxwell)GP100 (Pascal)GV100 (Volta)
SMs15245680
TPCs15242840
FP32 Cores / GPU2880307235845120
FP64 Cores / GPU9609617922560
Tensor Cores / GPUNANANA640
GPU Boost Clock810/875 MHz1114 MHz1480 MHz1530 MHz
Peak FP32 TFLOPS²56.810.615.7
Peak FP64 TFLOPS²1.7.215.37.8
Peak Tensor TFLOPS²NANANA125
Memory SizeUp to 12 GBUp to 24 GB16 GB16 GB
Memory Interface384-bit GDDR5384-bit GDDR54096-bit HBM24096-bit HBM2
TDP235 Watts250 Watts300 Watts300 Watts
Manufacturing Process28 nm28 nm16 nm FinFET+12 nm FFN

内存规格

GPUKepler GK180Maxwell GM200Pascal GP100Volta GV100
Compute Capability3.55.26.07.0
Threads / Warp32323232
Max Warps / SM64646464
Max Threads / SM2048204820482048
Max Thread Blocks / SM32323232
Max 32-bit Registers / SM65536655366553665536
Max Registers / Block65536655366553665536
Max Registers / Thread255255255255
Max Thread Block Size1024102410241024
FP32 Cores / SM1921286464
Ratio of SM Registers to FP32 Cores34151210241024
Shared Memory Size / SM16 KB/32 KB/ 48 KB96 KB64 KBConfigurable up to 96 KB

系统规格

SpecificationDGX-1 (Tesla P100)DGX-1 (Tesla V100)
GPU8x Tesla P100 GPUs8x Tesla V100 GPUs
TFLOPS170 (GPU FP16) + 3 (CPU FP32)1 (GPU Tensor PFLOP)
GPU Memory16 GB per GPU / 128 GB per DGX-1 Node16 GB or 32 GB per GPU / 128-256 GB per DGX-1 Node
CPUDual 20-core Intel® Xeon® E5-2698 v4Dual 20-core Intel® Xeon® E5-2698 v4
FP32 CUDA Cores28,672 Cores40,960 Cores
System MemoryUp to 512 GB 2133 MHz DDR4 LRDIMMUp to 512 GB 2133 MHz DDR4 LRDIMM
Storage4x 1.92 TB SSD RAID 04x 1.92 TB SSD RAID 0
Network InterconnectDual 10 GbE, 4 IB EDRDual 10 GbE, 4 IB EDR
System Dimensions866 D x 444 W x 131 H (mm)866 D x 444 W x 131 H (mm)
System Weight80 lbs80 lbs
Max Power TDP3200 W3200 W
Operating Temp10 - 35°C10 - 35°C

A100

A100 关键改进

A100 技术规格

A100 80GB PCIeA100 80GB SXM
FP649.7 TFLOPS9.7 TFLOPS
FP64 Tensor Core19.5 TFLOPS19.5 TFLOPS
FP3219.5 TFLOPS19.5 TFLOPS
Tensor Float 32 (TF32)156 TFLOPS | 312 TFLOPS156 TFLOPS | 312 TFLOPS*
BFLOAT16 Tensor Core312 TFLOPS | 624 TFLOPS*312 TFLOPS | 624 TFLOPS*
FP16 Tensor Core312 TFLOPS | 624 TFLOPS*312 TFLOPS | 624 TFLOPS*
INT8 Tensor Core624 TOPS | 1248 TOPS*624 TOPS | 1248 TOPS*
GPU Memory80GB HBM2e80GB HBM2e
GPU Memory Bandwidth1,935 GB/s2,039 GB/s
Max Thermal Design Power (TDP)300W400W ***
Multi-Instance GPUUp to 7 MIGs @ 10GBUp to 7 MIGs @ 10GB
Form FactorPCIe Dual-slot air-cooled or single-slot liquid-cooledSXM
InterconnectNVIDIA® NVLink® Bridge for 2 GPUs: 600 GB/s ** PCIe Gen4: 64 GB/sNVLink: 600 GB/s PCIe Gen4: 64 GB/s
Server OptionsPartner and NVIDIA-Certified Systems™ with 1-8 GPUsNVIDIA HGX™ A100-Partner and NVIDIA-Certified Systems with 4,8, or 16 GPUs NVIDIA DGX™ A100 with 8 GPUs

H100

H100 关键改进

H100 技术规格

H100 SXMH100 NVL
FP6434 teraFLOPS30 teraFLOPs
FP64 Tensor Core67 teraFLOPS60 teraFLOPs
FP3267 teraFLOPS60 teraFLOPs
TF32 Tensor Core*989 teraFLOPS835 teraFLOPs
BFLOAT16 Tensor Core*1,979 teraFLOPS1,671 teraFLOPS
FP16 Tensor Core*1,979 teraFLOPS1,671 teraFLOPS
FP8 Tensor Core*3,958 teraFLOPS3,341 teraFLOPS
INT8 Tensor Core*3,958 teraFLOPS3,341 teraFLOPS
GPU Memory80GB94GB
GPU Memory Bandwidth3.35TB/s3.9TB/s
Decoders7 NVDEC 7 JPEG7 NVDEC 7 JPEG
Max Thermal Design Power (TDP)Up to 700W (configurable)350-400W (configurable)
Multi-Instance GPUsUp to 7 MIGS @ 10GB eachUp to 7 MIGS @ 12GB each
Form FactorSXMPCIe dual-slot air-cooled
InterconnectNVIDIA NVLink™: 900GB/s PCIe Gen5: 128GB/sNVIDIA NVLink: 600GB/s PCIe Gen5: 128GB/s
Server OptionsNVIDIA HGX H100 Partner and NVIDIA- Certified Systems™ with 4 or 8 GPUs NVIDIA DGX H100 with 8 GPUsPartner and NVIDIA-Certified Systems with 1–8 GPUs
NVIDIA AI EnterpriseAdd-onIncluded

H200

H200 关键改进

H200 技术规格

H200 SXMH200 NVL
FP6434 teraFLOPS30 teraFLOPs
FP64 Tensor Core67 teraFLOPS60 teraFLOPs
FP3267 teraFLOPS60 teraFLOPs
TF32 Tensor Core*989 teraFLOPS835 teraFLOPs
BFLOAT16 Tensor Core*1,979 teraFLOPS1,671 teraFLOPS
FP16 Tensor Core*1,979 teraFLOPS1,671 teraFLOPS
FP8 Tensor Core*3,958 teraFLOPS3,341 teraFLOPS
INT8 Tensor Core*3,958 teraFLOPS3,341 teraFLOPS
GPU Memory141GB141GB
GPU Memory Bandwidth4.8TB/s4.8TB/s
Decoders7 NVDEC 7 JPEG7 NVDEC 7 JPEG
Confidential ComputingSupportedSupported
Max Thermal Design Power (TDP)Up to 700W (configurable)Up to 600W (configurable)
Multi-Instance GPUsUp to 7 MIGS @ 18GB eachUp to 7 MIGS @ 18GB each
Form FactorSXMPCIe dual-slot air-cooled
InterconnectNVIDIA NVLink™: 900GB/s PCIe Gen5: 128GB/s2- or 4-way NVIDIA NVLink bridge: 900GB/s per GPUPCIe Gen5: 128GB/s
Server OptionsNVIDIA HGX H200 Partner and NVIDIA- Certified Systems™ with 4 or 8 GPUsNVIDIA MGX™ H200 NVL partner and NVIDIA-Certified Systems with up to 8 GPUs
NVIDIA AI EnterpriseAdd-onIncluded

相比于 H100, H200 升级了 HBM 和 bandwidth

B200

B200 关键改进

B2100 技术规格

system specification 如下

SpecificationGB200 NVL72GB200 NVL4HGX B200
NVIDIA Blackwell GPUs | Grace CPUs72 | 364 | 28 | 0
CPU Cores2,592 Arm® Neoverse V2 Cores144 Arm Neoverse V2 Cores-
Total NVFP4 Tensor Core²1,440 | 720 PFLOPS80 | 40 PFLOPS144 | 72 PFLOPS
Total FP8/FP6 Tensor Core²720 PFLOPS40 PFLOPS72 PFLOPS
Total Fast Memory31 TB1.8 TB1.4 TB
Total Memory Bandwidth576 TB/s32 TB/s62 TB/s
Total NVLink Bandwidth130 TB/s7.2 TB/s14.4 TB/s

individual specification 如下

SpecificationGB200 NVL72GB200 NVL4HGX B200
FP4 Tensor Core20 PFLOPS20 PFLOPS18 PFLOPS
FP8/FP6 Tensor Core²10 PFLOPS10 PFLOPS9 PFLOPS
INT8 Tensor Core²10 POPS10 POPS9 POPS
FP16/BF16 Tensor Core²5 PFLOPS5 PFLOPS4.5 PFLOPS
TF32 Tensor Core²2.5 PFLOPS2.5 PFLOPS2.2 PFLOPS
FP3280 TFLOPS80 TFLOPS75 TFLOPS
FP64 / FP64 Tensor Core40 TFLOPS40 TFLOPS37 TFLOPS
GPU Memory Bandwidth186 GB HBM3E 8 TB/s186 GB HBM3E 8 TB/s180 GB HBM3E 7.7 TB/s
Multi-Instance GPU (MIG)-7-
Decompression Engine-Yes-
Decoders-7 NVDEC³ 7 nvJPEG-
Max Thermal Design Power (TDP)Configurable up to 1,200 WConfigurable up to 1,200 WConfigurable up to 1,000 W
Interconnect-Fifth-generation NVLink: 1.8 TB/s PCIe Gen5: 128 GB/s-
Server OptionsNVIDIA GB200 NVL72 partner and NVIDIA-Certified Systems™ with 72 GPUsNVIDIA MGX partner and NVIDIA-Certified SystemsNVIDIA HGX B200 partner and NVIDIA-Certified Systems with 8 GPUs

B300

B300 关键改进

B3100 技术规格

system specification 如下

GB300 NVL72HGX B300
Blackwell Ultra GPUs| Grace CPUs72 | 368 | 0
CPU Cores2,592 Arm Neoverse V2 Cores-
Total FP4 Tensor Core1 1,440 PFLOPS | 1,080 PFLOPS144 PFLOPS | 108 PFLOPS
Total FP8/FP6 Tensor Core2 720 PFLOPS72 PFLOPS
Total Fast Memory37 TB2.1 TB
Total Memory Bandwidth576 TB/s62 TB/s
Total NVLink Switch Bandwidth130 TB/s14.4 TB/s

individual specification 如下

GB300 NVL72HGX B300
FP4 Tensor Core20 PFLOPS | 15 PFLOPS18 PFLOPS | 14 PFLOPS
FP8/FP6 Tensor Core210 PFLOPS9 PFLOPS
INT8 Tensor Core2330 TOPS307 TOPS
FP16/BF16 Tensor Core5 PFLOPS4.5 PLFOPS
TF32 Tensor Core22.5 PFLOPS2.2 PFLOPS
FP3280 TFLOPS75 TFLOPS
FP64/FP64 Tensor Core1.3 TFLOPS1.2 TFLOPS
GPU Memory | Bandwidth279 GB HBM3E | 8 TB/s270 GB HBM3E | 7.7 TB/s
Multi-Instance GPU (MIG)77
Decompression EngineYesYes
Decoders7 NVDEC3 7 nvJPEG7 NVDEC3 7 nvJPEG
Max Thermal Design Power (TDP)Configurable up to 1,400 WConfigurable up to 1,100 W
InterconnectFifth-Generation NVLink: 1.8 TB/s PCIe Gen6: 256 GB/sFifth-Generation NVLink: 1.8 TB/s PCIe Gen6: 256 GB/s
Server OptionsNVIDIA GB300 NVL72 partner and NVIDIA-Certified Systems™NVIDIA HGX B300 partner and NVIDIA-Certified Systems