核心模块详解

深入理解 vLLM 的核心创新和实现

1: PagedAttention 分页注意力
2: KV Cache 管理器
3: Block Pool 内存块池
4: 调度器原理
5: 连续批处理机制

本部分将深入讲解 vLLM 的核心技术创新，包括 PagedAttention、KV Cache 管理、调度器和连续批处理等关键模块。

1 - PagedAttention 分页注意力

PagedAttention 分页注意力

本章将详细介绍 vLLM 的核心创新——PagedAttention，包括设计思想、数据结构和实现原理。

引言

PagedAttention 是 vLLM 最重要的创新，它借鉴了操作系统虚拟内存管理的思想，革命性地解决了 KV Cache 的显存浪费问题。本章将深入剖析其设计原理和实现细节。

1. 传统 KV Cache 的问题回顾

1.1 连续内存分配的要求

传统方案要求每个请求的 KV Cache 存储在连续的内存空间中：

传统 KV Cache 布局:
+----------------------------------------------------------+
| Request A 的 KV Cache (预分配 max_seq_len)                 |
| [K0,V0][K1,V1][K2,V2]...[Kn,Vn][   空闲预留空间   ]        |
+----------------------------------------------------------+
| Request B 的 KV Cache (预分配 max_seq_len)                 |
| [K0,V0][K1,V1]...[Km,Vm][      空闲预留空间      ]         |
+----------------------------------------------------------+

1.2 显存碎片化图解

当多个请求并发时，显存碎片化问题严重：

graph TB
    subgraph 时间 T1: 三个请求开始
        M1[Request A<br/>预分配 2GB<br/>实际用 0.1GB]
        M2[Request B<br/>预分配 2GB<br/>实际用 0.2GB]
        M3[Request C<br/>预分配 2GB<br/>实际用 0.1GB]
        M4[空闲 2GB]
    end

    subgraph 时间 T2: Request B 完成
        N1[Request A<br/>预分配 2GB<br/>实际用 0.5GB]
        N2[空洞 2GB<br/>外部碎片]
        N3[Request C<br/>预分配 2GB<br/>实际用 0.3GB]
        N4[空闲 2GB]
    end

    subgraph 时间 T3: 新请求 D 到来
        O1[Request A<br/>2GB]
        O2[空洞 2GB]
        O3[Request C<br/>2GB]
        O4[空闲 2GB]
        O5[Request D 需要 3GB<br/>失败！]
    end

    style N2 fill:#ffcdd2
    style O2 fill:#ffcdd2
    style O5 fill:#ffcdd2

1.3 量化浪费

问题类型	说明	浪费比例
内部碎片	预分配 » 实际使用	40-60%
外部碎片	空洞无法利用	20-30%
总计	综合浪费	60-80%

2. PagedAttention 核心思想

2.1 灵感来源：操作系统虚拟内存

操作系统如何管理内存？

graph LR
    subgraph 虚拟内存
        VA[虚拟地址空间<br/>程序看到的连续空间]
    end

    subgraph 页表
        PT[Page Table<br/>虚拟页 → 物理页]
    end

    subgraph 物理内存
        PA[物理内存<br/>实际不连续的页]
    end

    VA --> PT --> PA

关键特性：

程序看到连续的地址空间
物理内存可以不连续
按需分配（用到才分配）
页面可以共享

2.2 PagedAttention 的类比

将操作系统的思想应用到 KV Cache 管理：

操作系统概念	PagedAttention 对应
页（Page）	Block（块）
页表（Page Table）	Block Table（块表）
虚拟地址	逻辑块索引
物理地址	物理块 ID
页帧	KV Cache 块

2.3 核心改进

graph LR
    subgraph 传统方案
        T1[预分配连续空间]
        T2[大量浪费]
        T1 --> T2
    end

    subgraph PagedAttention
        P1[按需分配]
        P2[分块存储]
        P3[非连续]
        P4[高利用率]
        P1 --> P2 --> P3 --> P4
    end

    style T2 fill:#ffcdd2
    style P4 fill:#c8e6c9

3. 关键数据结构详解

3.1 Block（块）

Block 是 KV Cache 的基本存储单元：

# 概念定义
class KVCacheBlock:
    block_id: int              # 物理块 ID
    ref_cnt: int               # 引用计数（支持共享）
    block_hash: Optional[int]  # 用于前缀缓存匹配

Block 的特点：

固定大小：每个 block 存储固定数量的 token（如 16 个）
独立分配：不需要连续
可复用：释放后可分配给其他请求

3.2 Block 的存储内容

每个 Block 存储若干 token 的 K 和 V：

Block 结构 (block_size = 16):
┌─────────────────────────────────────────────────┐
│ Token 0:  K[layers, heads, head_dim]            │
│           V[layers, heads, head_dim]            │
├─────────────────────────────────────────────────┤
│ Token 1:  K[layers, heads, head_dim]            │
│           V[layers, heads, head_dim]            │
├─────────────────────────────────────────────────┤
│ ...                                             │
├─────────────────────────────────────────────────┤
│ Token 15: K[layers, heads, head_dim]            │
│           V[layers, heads, head_dim]            │
└─────────────────────────────────────────────────┘

实际存储形状：

# 单个 Block 的 KV Cache 形状
k_block = torch.zeros(num_layers, num_heads, block_size, head_dim)
v_block = torch.zeros(num_layers, num_heads, block_size, head_dim)

# 整个 KV Cache 池
kv_cache = torch.zeros(num_blocks, 2, num_layers, num_heads, block_size, head_dim)

3.3 Block Table（块表）

Block Table 记录逻辑块到物理块的映射：

classDiagram
    class BlockTable {
        +block_ids: List[int]
        +num_blocks: int
        +append(block_id)
        +get_physical_block(logical_idx) int
    }

示例：

Request A 的 Block Table:
逻辑块索引:  0    1    2    3
            ↓    ↓    ↓    ↓
物理块 ID:  [5]  [2]  [8]  [12]

解释:
- 逻辑块 0 → 物理块 5
- 逻辑块 1 → 物理块 2
- 逻辑块 2 → 物理块 8
- 逻辑块 3 → 物理块 12

3.4 Slot Mapping（槽位映射）

Slot Mapping 将 token 位置映射到具体的缓存槽位：

def get_slot_mapping(token_position, block_size, block_table):
    """
    token_position: token 在序列中的位置（如 35）
    block_size: 每个 block 的 token 数（如 16）
    block_table: 块表
    """
    logical_block_idx = token_position // block_size  # 35 // 16 = 2
    block_offset = token_position % block_size        # 35 % 16 = 3

    physical_block_id = block_table[logical_block_idx]  # 假设是 8
    slot_id = physical_block_id * block_size + block_offset  # 8 * 16 + 3 = 131

    return slot_id

图解：

graph LR
    subgraph Token 位置 35
        T[token_position = 35]
    end

    subgraph 计算
        LB[逻辑块 = 35 // 16 = 2]
        OFF[偏移 = 35 % 16 = 3]
        PB[物理块 = block_table[2] = 8]
        SLOT[slot = 8 × 16 + 3 = 131]
    end

    T --> LB --> PB
    T --> OFF --> SLOT
    PB --> SLOT

4. 内存管理优势

4.1 减少显存碎片

graph TB
    subgraph 传统方案
        A1[Request A: 2GB 预分配<br/>实际 0.1GB]
        A2[Request B: 2GB 预分配<br/>实际 0.2GB]
        A3[Request C: 2GB 预分配<br/>实际 0.1GB]
        A4[空洞和碎片]
    end

    subgraph PagedAttention
        B1[Block Pool<br/>统一管理所有块]
        B2[Request A: 2 blocks]
        B3[Request B: 4 blocks]
        B4[Request C: 2 blocks]
        B5[空闲块可立即复用]
    end

    style A4 fill:#ffcdd2
    style B5 fill:#c8e6c9

4.2 按需分配

sequenceDiagram
    participant R as Request
    participant S as Scheduler
    participant BP as BlockPool

    R->>S: 开始生成（0 tokens）
    S->>BP: 分配 1 个 block
    BP-->>S: Block 5

    loop 每 16 个 token
        R->>S: 需要新空间
        S->>BP: 分配 1 个 block
        BP-->>S: Block N
    end

    R->>S: 生成完成
    S->>BP: 释放所有 blocks
    Note over BP: 块立即可用于其他请求

4.3 支持 Copy-on-Write

当多个请求共享相同前缀时，可以共享 Block：

graph TB
    subgraph 共享场景
        P[共同前缀<br/>'System prompt...']
    end

    subgraph 共享的 Blocks
        B1[Block 0]
        B2[Block 1]
        B3[Block 2]
    end

    subgraph Request A
        A[继续生成 A 的内容]
        AB[Block 10]
    end

    subgraph Request B
        B[继续生成 B 的内容]
        BB[Block 15]
    end

    P --> B1
    P --> B2
    P --> B3
    B3 --> AB
    B3 --> BB

    style B1 fill:#c8e6c9
    style B2 fill:#c8e6c9
    style B3 fill:#c8e6c9

引用计数：

Block 0, 1, 2 的 ref_cnt = 2（被两个请求共享）
只有当 ref_cnt = 0 时才真正释放

4.4 支持前缀缓存

相同前缀的请求可以直接复用已计算的 KV Cache：

# 前缀缓存示例
request_1 = "你好，请问" + "天气怎么样？"
request_2 = "你好，请问" + "今天星期几？"

# "你好，请问" 的 KV Cache 可以复用！

5. PagedAttention 计算流程

5.1 写入 KV Cache

sequenceDiagram
    participant M as Model
    participant PA as PagedAttention
    participant BT as Block Table
    participant KVC as KV Cache Memory

    M->>PA: 新 token 的 K, V
    PA->>BT: 查询 slot_mapping
    BT-->>PA: slot_id = 131
    PA->>KVC: kv_cache[131] = (K, V)

5.2 读取并计算 Attention

flowchart TD
    subgraph 输入
        Q[Query: 新 token 的 Q]
        BT[Block Table: 物理块列表]
    end

    subgraph PagedAttention 计算
        FETCH[从各个物理块获取 K, V]
        COMPUTE[计算 Attention<br/>Q × K^T / √d]
        SOFTMAX[Softmax]
        WEIGHTED[加权求和 V]
    end

    subgraph 输出
        O[Attention 输出]
    end

    Q --> COMPUTE
    BT --> FETCH
    FETCH --> COMPUTE
    COMPUTE --> SOFTMAX
    SOFTMAX --> WEIGHTED
    WEIGHTED --> O

5.3 代码实现概览

# vllm/v1/attention/ops/paged_attn.py (简化版)

class PagedAttention:
    @staticmethod
    def write_to_paged_cache(
        key: torch.Tensor,           # [num_tokens, num_heads, head_dim]
        value: torch.Tensor,         # [num_tokens, num_heads, head_dim]
        key_cache: torch.Tensor,     # [num_blocks, block_size, num_heads, head_dim]
        value_cache: torch.Tensor,   # [num_blocks, block_size, num_heads, head_dim]
        slot_mapping: torch.Tensor,  # [num_tokens]
    ):
        """将新的 K, V 写入缓存"""
        # 使用 slot_mapping 确定写入位置
        # slot_mapping[i] 告诉我们 token i 应该写入哪个槽位
        pass

    @staticmethod
    def forward(
        query: torch.Tensor,         # [num_tokens, num_heads, head_dim]
        key_cache: torch.Tensor,     # KV Cache
        value_cache: torch.Tensor,
        block_tables: torch.Tensor,  # [batch, max_blocks] 块表
        context_lens: torch.Tensor,  # [batch] 每个请求的上下文长度
        ...
    ) -> torch.Tensor:
        """执行 PagedAttention 计算"""
        # 1. 根据 block_tables 定位 K, V
        # 2. 计算 Attention
        # 3. 返回输出
        pass

6. 块的动态管理

6.1 块的生命周期

stateDiagram-v2
    [*] --> Free: 初始化

    Free --> Allocated: 分配给请求
    Allocated --> Free: 请求完成<br/>ref_cnt=0

    Allocated --> Cached: 启用前缀缓存
    Cached --> Allocated: 缓存命中<br/>ref_cnt++

    Cached --> Free: LRU 驱逐<br/>或缓存失效

    note right of Allocated : ref_cnt >= 1
    note right of Cached : 保留以备复用

6.2 块分配流程

def allocate_blocks_for_request(request, kv_cache_manager):
    """为请求分配所需的 blocks"""
    num_tokens = len(request.prompt_tokens) + request.num_generated_tokens
    num_blocks_needed = (num_tokens + block_size - 1) // block_size

    blocks = []
    for i in range(num_blocks_needed):
        # 尝试获取空闲块
        block = kv_cache_manager.get_free_block()
        if block is None:
            # 没有空闲块，触发驱逐或返回失败
            return None
        blocks.append(block)

    # 更新请求的块表
    request.block_table = blocks
    return blocks

6.3 块增长过程

gantt
    title 请求的块分配时间线
    dateFormat X
    axisFormat %s

    section 块分配
    Block 0 (prefill) :done, b0, 0, 16
    Block 1 (prefill 续) :done, b1, 16, 32
    Block 2 (decode) :active, b2, 32, 48
    Block 3 (decode 续) :active, b3, 48, 64

7. CUDA 内核实现

7.1 文件位置

Python 接口：vllm/v1/attention/ops/paged_attn.py
CUDA 内核：csrc/attention/paged_attention_v1.cu、paged_attention_v2.cu

7.2 V1 vs V2 内核

特性	V1	V2
适用场景	短序列	长序列
分块策略	简单	两级分块
性能	中等	更优

7.3 内核参数

// paged_attention_v2.cu (简化)
template<typename T, int BLOCK_SIZE, int NUM_THREADS>
__global__ void paged_attention_v2_kernel(
    T* __restrict__ out,              // 输出
    const T* __restrict__ q,          // Query
    const T* __restrict__ k_cache,    // Key Cache
    const T* __restrict__ v_cache,    // Value Cache
    const int* __restrict__ block_tables,  // 块表
    const int* __restrict__ context_lens,  // 上下文长度
    const float scale,                // 缩放因子
    ...
) {
    // 1. 确定当前线程处理的 query
    // 2. 遍历 block_table 中的所有块
    // 3. 计算 Attention 分数
    // 4. Softmax 和加权求和
}

8. 性能对比

8.1 显存效率

方案	显存利用率	最大并发
传统预分配	20-40%	低
PagedAttention	96%+	高 2-4 倍

8.2 吞吐量提升

graph LR
    subgraph 吞吐量对比
        T1[传统方案<br/>1x 基准]
        T2[PagedAttention<br/>2-4x 提升]
    end

    style T2 fill:#c8e6c9

8.3 碎片率

传统方案:
- 内部碎片: 50-70%
- 外部碎片: 10-20%
- 总碎片: 60-80%

PagedAttention:
- 内部碎片: < 4% (最后一个块)
- 外部碎片: 0% (固定大小块)
- 总碎片: < 4%

9. 本章小结

核心创新

分块存储：将 KV Cache 分成固定大小的 Block
非连续分配：Block 可以分散在显存任意位置
按需分配：生成新 token 时才分配新 Block
块表映射：通过 Block Table 管理逻辑到物理的映射

关键数据结构

结构	作用
Block	KV Cache 的基本存储单元
Block Table	逻辑块 → 物理块映射
Slot Mapping	Token 位置 → 缓存槽位
BlockPool	管理所有空闲块

优势总结

显存效率：从 20-40% 提升到 96%+
减少碎片：从 60-80% 降到 4% 以下
支持共享：多请求可共享相同前缀的 Block
按需增长：不需要预分配最大长度

代码位置

功能	文件
Python 接口	`vllm/v1/attention/ops/paged_attn.py`
CUDA 内核	`csrc/attention/paged_attention_v2.cu`
块管理	`vllm/v1/core/block_pool.py`
块表	`vllm/v1/worker/block_table.py`

思考题

为什么选择固定大小的 Block 而不是可变大小？
前缀缓存和 Copy-on-Write 有什么区别和联系？
如果 block_size 设置得太大或太小，会有什么影响？

下一步

了解了 PagedAttention 的原理后，让我们来看看 KV Cache Manager 是如何管理这些 Block 的：

👉 下一章：KV Cache 管理器

2 - KV Cache 管理器

KV Cache 管理器详解

在上一章中，我们详细了解了 PagedAttention 的原理——它将 KV Cache 分成固定大小的 Block 来管理。但是，谁来负责这些 Block 的分配、释放和缓存查找呢？ 答案就是本章的主角：KVCacheManager。

KVCacheManager 是 vLLM 内存管理的核心组件，它连接了调度器（Scheduler）和底层的内存池（BlockPool），为每个请求提供高效的 KV Cache 管理服务。

1. KVCacheManager 在架构中的位置

graph TD
    subgraph 调度层
        Scheduler[Scheduler<br/>调度器]
    end

    subgraph 内存管理层
        KVM[KVCacheManager<br/>KV Cache 管理器]
        Coord[KVCacheCoordinator<br/>协调器]
        BP[BlockPool<br/>内存块池]
    end

    subgraph 执行层
        Runner[GPUModelRunner<br/>模型执行器]
    end

    Scheduler -->|"allocate_slots()"| KVM
    Scheduler -->|"get_computed_blocks()"| KVM
    Scheduler -->|"free()"| KVM
    KVM --> Coord
    Coord --> BP
    KVM -.->|"block_ids"| Runner

    style KVM fill:#e1f5fe
    style Scheduler fill:#fff3e0
    style BP fill:#e8f5e9

KVCacheManager 的核心职责：

分配管理：为请求分配 KV Cache 槽位
前缀缓存：查找和利用已缓存的前缀块
生命周期管理：跟踪和释放请求的内存块
使用率监控：提供 KV Cache 使用统计

2. 核心数据结构

2.1 KVCacheBlocks - 分配结果

KVCacheBlocks 是 KVCacheManager 返回给调度器的分配结果，它封装了实际的内存块列表：

# vllm/v1/core/kv_cache_manager.py

@dataclass
class KVCacheBlocks:
    """
    KVCacheManager 的分配结果，作为 Scheduler 和 KVCacheManager
    之间的接口，隐藏 KVCacheManager 的内部数据结构。
    """

    blocks: tuple[Sequence[KVCacheBlock], ...]
    """
    blocks[i][j] 表示第 i 个 kv_cache_group 的第 j 个 block。

    使用 kv_cache_group 作为外层维度，因为不同 group 可能有
    不同数量的 blocks（为未来扩展预留）。
    """

    def get_block_ids(self) -> tuple[list[int], ...]:
        """将 KVCacheBlocks 转换为 block_ids 列表"""
        return tuple([blk.block_id for blk in group]
                     for group in self.blocks)

为什么需要这个封装？

接口隔离：调度器不需要知道内部的 Block 结构细节
GC 优化：预分配空的 KVCacheBlocks 避免频繁创建对象
多 Group 支持：为混合精度等场景预留扩展能力

2.2 KVCacheManager 类结构

# vllm/v1/core/kv_cache_manager.py

class KVCacheManager:
    def __init__(
        self,
        kv_cache_config: KVCacheConfig,
        max_model_len: int,
        hash_block_size: int,
        enable_caching: bool = True,    # 是否启用前缀缓存
        use_eagle: bool = False,        # 是否使用 EAGLE 投机解码
        log_stats: bool = False,        # 是否记录统计信息
        ...
    ):
        self.max_model_len = max_model_len
        self.enable_caching = enable_caching

        # 核心组件：协调器（封装了 BlockPool）
        self.coordinator = get_kv_cache_coordinator(
            kv_cache_config=kv_cache_config,
            max_model_len=self.max_model_len,
            enable_caching=self.enable_caching,
            ...
        )

        # BlockPool 引用（实际内存管理）
        self.block_pool = self.coordinator.block_pool

        # 预分配的空 KVCacheBlocks（避免 GC 开销）
        self.empty_kv_cache_blocks = KVCacheBlocks(
            tuple(() for _ in range(self.num_kv_cache_groups))
        )

组件关系图：

classDiagram
    class KVCacheManager {
        +max_model_len: int
        +enable_caching: bool
        +coordinator: KVCacheCoordinator
        +block_pool: BlockPool
        +empty_kv_cache_blocks: KVCacheBlocks

        +get_computed_blocks(request) KVCacheBlocks, int
        +allocate_slots(request, num_new_tokens, ...) KVCacheBlocks
        +free(request) void
        +usage: float
    }

    class KVCacheCoordinator {
        +block_pool: BlockPool
        +find_longest_cache_hit() blocks, num_tokens
        +allocate_new_blocks() blocks
        +cache_blocks() void
        +free() void
    }

    class BlockPool {
        +blocks: list~KVCacheBlock~
        +free_block_queue: FreeKVCacheBlockQueue
        +cached_block_hash_to_block: BlockHashToBlockMap
        +get_new_blocks() blocks
        +free_blocks() void
        +touch() void
    }

    KVCacheManager --> KVCacheCoordinator : uses
    KVCacheCoordinator --> BlockPool : manages

3. 核心方法详解

3.1 get_computed_blocks() - 查找前缀缓存

当新请求到来时，调度器首先调用 get_computed_blocks() 检查是否有可复用的缓存：

def get_computed_blocks(self, request: Request) -> tuple[KVCacheBlocks, int]:
    """
    获取请求的已计算（缓存）块。

    Returns:
        - KVCacheBlocks: 命中的缓存块列表
        - int: 已计算的 token 数量
    """
    # 情况1：禁用缓存或请求明确跳过缓存
    if not self.enable_caching or request.skip_reading_prefix_cache:
        return self.empty_kv_cache_blocks, 0

    # 情况2：查找最长前缀匹配
    # 注意：即使全部命中，也需要保留最后一个 token 用于计算 logits
    max_cache_hit_length = request.num_tokens - 1

    computed_blocks, num_new_computed_tokens = (
        self.coordinator.find_longest_cache_hit(
            request.block_hashes,     # 请求的 block hash 列表
            max_cache_hit_length      # 最大命中长度
        )
    )

    # 记录统计信息
    if self.log_stats:
        self.prefix_cache_stats.record(
            num_tokens=request.num_tokens,
            num_hits=num_new_computed_tokens,
            preempted=request.num_preemptions > 0,
        )

    return self.create_kv_cache_blocks(computed_blocks), num_new_computed_tokens

缓存查找流程图：

flowchart TD
    A[开始 get_computed_blocks] --> B{启用缓存?}
    B -->|否| C[返回空块, 0]
    B -->|是| D{请求跳过缓存?}
    D -->|是| C
    D -->|否| E[计算 block_hashes]
    E --> F[find_longest_cache_hit]
    F --> G[遍历 hash 查找缓存]
    G --> H{找到匹配块?}
    H -->|是| I[增加引用计数]
    I --> J[继续查找下一个]
    H -->|否| K[停止查找]
    J --> H
    K --> L[返回缓存块和命中数]

    style F fill:#e8f5e9
    style L fill:#e1f5fe

关键点：为什么 max_cache_hit_length = num_tokens - 1？

输入: "Hello, how are you?"
Token IDs: [101, 102, 103, 104, 105]  (5个token)

假设全部命中缓存:
- 前4个token的KV已缓存 ✓
- 但我们仍需要"重新计算"最后1个token
- 因为我们需要得到最后位置的 logits 来采样下一个 token

所以最多只能利用 num_tokens - 1 = 4 个缓存

3.2 allocate_slots() - 分配缓存槽位

这是 KVCacheManager 最核心的方法，负责为请求分配 KV Cache 存储空间：

def allocate_slots(
    self,
    request: Request,
    num_new_tokens: int,          # 新生成的 token 数
    num_new_computed_tokens: int = 0,  # 前缀缓存命中的 token 数
    new_computed_blocks: KVCacheBlocks | None = None,  # 缓存命中的块
    num_lookahead_tokens: int = 0,     # 投机解码的预测 token 数
    num_external_computed_tokens: int = 0,  # 外部（如 P/D）计算的 token
    delay_cache_blocks: bool = False,  # 是否延迟缓存
    num_encoder_tokens: int = 0,       # 编码器 token（如 Whisper）
) -> KVCacheBlocks | None:
    """
    为请求分配新的 KV Cache 槽位。

    Returns:
        - KVCacheBlocks: 新分配的块
        - None: 内存不足，分配失败
    """

块布局说明：

----------------------------------------------------------------------
| < comp > | < new_comp > | < ext_comp >  | < new >  | < lookahead > |
----------------------------------------------------------------------
                                          |   < 需要计算的 >          |
----------------------------------------------------------------------
                          |            < 需要分配的 >                 |
----------------------------------------------------------------------
                          | < 需要缓存的 >           |
----------------------------------------------------------------------

术语说明:
comp      = 已计算的 token（request.num_computed_tokens）
new_comp  = 新命中缓存的 token（前缀缓存）
ext_comp  = 外部计算的 token（如 P/D disaggregation）
new       = 新生成的 token
lookahead = 投机解码的预测 token

allocate_slots 核心流程：

def allocate_slots(self, request, num_new_tokens, ...):
    # 1. 计算需要的总 token 数
    num_local_computed_tokens = request.num_computed_tokens + num_new_computed_tokens
    total_computed_tokens = num_local_computed_tokens + num_external_computed_tokens
    num_tokens_need_slot = total_computed_tokens + num_new_tokens + num_lookahead_tokens

    # 2. 释放滑动窗口外的块（如果有）
    # 这可以回收一些空间，减少需要新分配的块
    self.coordinator.remove_skipped_blocks(
        request.request_id, total_computed_tokens
    )

    # 3. 计算需要分配的新块数
    num_blocks_to_allocate = self.coordinator.get_num_blocks_to_allocate(
        request_id=request.request_id,
        num_tokens=num_tokens_need_slot,
        new_computed_blocks=new_computed_block_list,
        ...
    )

    # 4. 检查是否有足够的空闲块
    if num_blocks_to_allocate > self.block_pool.get_num_free_blocks():
        return None  # 内存不足！

    # 5. 处理前缀缓存命中的块（增加引用计数）
    if new_computed_block_list or num_external_computed_tokens > 0:
        self.coordinator.allocate_new_computed_blocks(
            request_id=request.request_id,
            new_computed_blocks=new_computed_block_list,
            ...
        )

    # 6. 分配新块
    new_blocks = self.coordinator.allocate_new_blocks(
        request.request_id,
        num_tokens_need_slot,
        num_tokens_main_model,
        num_encoder_tokens,
    )

    # 7. 缓存新的满块（用于后续请求复用）
    if self.enable_caching and not delay_cache_blocks:
        self.coordinator.cache_blocks(request, num_tokens_to_cache)

    return self.create_kv_cache_blocks(new_blocks)

完整流程图：

flowchart TD
    A[allocate_slots 开始] --> B[计算总 token 数]
    B --> C[remove_skipped_blocks<br/>释放滑动窗口外的块]
    C --> D[get_num_blocks_to_allocate<br/>计算需要的新块数]
    D --> E{空闲块足够?}
    E -->|否| F[返回 None<br/>分配失败]
    E -->|是| G{有缓存命中?}
    G -->|是| H[allocate_new_computed_blocks<br/>处理缓存块]
    G -->|否| I[allocate_new_blocks<br/>分配新块]
    H --> I
    I --> J{启用缓存?}
    J -->|是| K[cache_blocks<br/>缓存满块]
    J -->|否| L[返回新分配的块]
    K --> L

    style F fill:#ffcdd2
    style L fill:#c8e6c9

3.3 free() - 释放请求资源

当请求完成或被抢占时，需要释放其占用的 KV Cache：

def free(self, request: Request) -> None:
    """
    释放请求占用的块。
    按逆序释放，这样当启用缓存时，尾部块会先被驱逐。
    """
    self.coordinator.free(request.request_id)

释放流程：

sequenceDiagram
    participant Scheduler
    participant KVM as KVCacheManager
    participant Coord as Coordinator
    participant BP as BlockPool

    Note over Scheduler: 请求完成或被抢占
    Scheduler->>KVM: free(request)
    KVM->>Coord: free(request_id)
    Coord->>Coord: 获取请求的所有块
    Coord->>Coord: 按逆序排列块
    Coord->>BP: free_blocks(ordered_blocks)

    loop 每个块
        BP->>BP: block.ref_cnt -= 1
        alt ref_cnt == 0
            BP->>BP: 加入 free_block_queue
        end
    end

    Note over BP: 块可被其他请求复用<br/>或从缓存中驱逐

为什么要逆序释放？

假设请求占用块: [Block_A, Block_B, Block_C, Block_D]
                   ↑        ↑        ↑        ↑
                  前缀     ....    ....     尾部

如果另一个请求有相同前缀，它更可能复用 Block_A, Block_B
所以我们希望：
  - 先释放 Block_D（尾部，最不可能被复用）
  - 后释放 Block_A（前缀，最可能被复用）

这样 LRU 驱逐时，尾部块会先被驱逐，前缀块保留更久

4. 前缀缓存机制详解

前缀缓存（Prefix Caching）是 vLLM 的重要优化，它允许多个请求共享相同前缀的 KV Cache。

4.1 Block Hash 计算

每个 Block 的 hash 值基于其包含的 token 序列计算：

# 简化示意
def compute_block_hash(token_ids: list[int], block_size: int) -> BlockHash:
    """
    计算一个 block 的 hash 值。

    hash 基于：
    1. block 内的 token IDs
    2. 之前所有 block 的 hash（形成链式依赖）
    """
    # 确保是完整的 block
    assert len(token_ids) % block_size == 0

    # 使用加密 hash 或高效 hash 算法
    return hash(tuple(token_ids))

4.2 缓存查找过程

flowchart TD
    subgraph 请求1[请求 1: "Hello, how are you?"]
        R1B1[Block 1<br/>hash: abc123]
        R1B2[Block 2<br/>hash: def456]
    end

    subgraph 请求2[请求 2: "Hello, how is weather?"]
        R2B1[Block 1<br/>hash: abc123]
        R2B2[Block 2<br/>hash: ghi789]
    end

    subgraph 缓存[Block Hash Cache]
        C1[abc123 → Block #5]
        C2[def456 → Block #8]
    end

    R1B1 -.-> C1
    R1B2 -.-> C2
    R2B1 -.->|命中!| C1
    R2B2 -.->|未命中| X[分配新块]

    style C1 fill:#c8e6c9
    style R2B1 fill:#c8e6c9

4.3 缓存块的生命周期

stateDiagram-v2
    [*] --> Free: 初始化

    Free --> Allocated: get_new_blocks()
    note right of Allocated: ref_cnt = 1

    Allocated --> Cached: cache_full_blocks()
    note right of Cached: block_hash 被设置<br/>加入 hash_to_block

    Cached --> SharedCached: touch() (另一请求命中)
    note right of SharedCached: ref_cnt += 1

    SharedCached --> Cached: free() (一个请求释放)
    note right of Cached: ref_cnt -= 1

    Cached --> Evictable: free() (ref_cnt = 0)
    note right of Evictable: 加入 free_block_queue<br/>但保留 hash

    Evictable --> Evicted: _maybe_evict_cached_block()
    note right of Evicted: hash 被清除<br/>从 hash_to_block 移除

    Evicted --> Free: 可重新分配

    Evictable --> SharedCached: touch() (再次命中)
    note left of SharedCached: 从 free_queue 移除<br/>ref_cnt = 1

5. 与调度器的协作

KVCacheManager 和 Scheduler 紧密协作，下面是典型的调度循环中的交互：

sequenceDiagram
    participant S as Scheduler
    participant KVM as KVCacheManager
    participant BP as BlockPool

    Note over S: schedule() 开始

    rect rgb(230, 245, 230)
        Note over S,BP: 处理新请求
        S->>KVM: get_computed_blocks(request)
        KVM->>BP: 查找缓存
        BP-->>KVM: 缓存块, 命中数
        KVM-->>S: KVCacheBlocks, num_computed

        S->>KVM: allocate_slots(request, num_tokens, cached_blocks)
        KVM->>BP: 检查空闲块
        alt 空闲块足够
            KVM->>BP: 分配新块
            BP-->>KVM: 新块列表
            KVM-->>S: KVCacheBlocks
            Note over S: 将请求加入 running
        else 空闲块不足
            KVM-->>S: None
            Note over S: 请求继续等待
        end
    end

    rect rgb(255, 245, 230)
        Note over S,BP: 处理运行中请求
        S->>KVM: allocate_slots(request, 1)
        Note over KVM: 为新生成的 token 分配槽位
        alt 分配成功
            KVM-->>S: KVCacheBlocks
        else 需要抢占
            KVM-->>S: None
            S->>S: 选择抢占目标
            S->>KVM: free(victim_request)
            KVM->>BP: 释放块
            S->>KVM: allocate_slots(request, 1)
            KVM-->>S: KVCacheBlocks
        end
    end

    rect rgb(245, 230, 230)
        Note over S,BP: 处理完成请求
        S->>KVM: free(finished_request)
        KVM->>BP: 释放块
        Note over BP: 块加入 free_queue<br/>可被复用或驱逐
    end

6. 配置与调优

6.1 关键配置参数

参数	说明	建议值
`enable_prefix_caching`	是否启用前缀缓存	True（有重复前缀时）
`block_size`	每个 Block 的 token 数	16（默认）
`gpu_memory_utilization`	GPU 显存利用率	0.9（90%）
`max_model_len`	最大序列长度	根据需求设置

6.2 性能优化建议

1. 前缀缓存优化

# 适合启用前缀缓存的场景：
# - 多轮对话（共享系统提示）
# - RAG 应用（共享检索文档）
# - 批量相似任务

# 启用前缀缓存
llm = LLM(
    model="meta-llama/Llama-2-7b",
    enable_prefix_caching=True
)

2. 内存使用监控

# 获取 KV Cache 使用率
usage = kv_cache_manager.usage  # 0.0 ~ 1.0

# 获取前缀缓存统计
stats = kv_cache_manager.make_prefix_cache_stats()
print(f"缓存命中率: {stats.hit_rate:.2%}")

3. 显存不足时的调整

# 方案1：减小 block_size
# 更细粒度的内存管理，但增加元数据开销

# 方案2：降低 gpu_memory_utilization
llm = LLM(model="...", gpu_memory_utilization=0.8)

# 方案3：减小 max_model_len
llm = LLM(model="...", max_model_len=2048)

7. 代码位置速查

功能	文件	关键函数/类
KVCacheManager	`vllm/v1/core/kv_cache_manager.py`	`KVCacheManager` 类
分配结果	`vllm/v1/core/kv_cache_manager.py`	`KVCacheBlocks` 数据类
协调器	`vllm/v1/core/kv_cache_coordinator.py`	`KVCacheCoordinator`
BlockPool	`vllm/v1/core/block_pool.py`	`BlockPool` 类
Block 数据结构	`vllm/v1/core/kv_cache_utils.py`	`KVCacheBlock`

8. 小结

本章我们深入了解了 KVCacheManager 的工作原理：

核心职责：管理 KV Cache 的分配、释放和前缀缓存
分层设计：KVCacheManager → Coordinator → BlockPool
关键方法：
- get_computed_blocks()：查找前缀缓存
- allocate_slots()：分配缓存槽位
- free()：释放请求资源
前缀缓存：通过 Block Hash 实现多请求共享
与调度器协作：为调度决策提供内存管理支持

在下一章中，我们将深入 BlockPool，了解底层的内存块管理和 LRU 驱逐策略。

3 - Block Pool 内存块池

Block Pool 内存块池详解

在前两章中，我们了解了 PagedAttention 的分页思想和 KVCacheManager 的分配接口。本章我们将深入到内存管理的最底层——BlockPool。

BlockPool 是 vLLM 内存管理的基石，它直接管理 GPU 上的物理内存块，负责块的分配、释放、缓存和驱逐。理解 BlockPool 的工作原理，对于深入理解 vLLM 的内存效率至关重要。

1. BlockPool 的作用

graph TD
    subgraph 上层接口
        KVM[KVCacheManager]
    end

    subgraph BlockPool 内部
        BP[BlockPool]
        Blocks[blocks: list]
        FreeQ[FreeKVCacheBlockQueue<br/>空闲块队列]
        HashCache[BlockHashToBlockMap<br/>缓存查找表]
    end

    subgraph GPU 显存
        GPU[KV Cache 物理内存]
    end

    KVM -->|get_new_blocks| BP
    KVM -->|free_blocks| BP
    KVM -->|touch| BP
    KVM -->|cache_full_blocks| BP

    BP --> Blocks
    BP --> FreeQ
    BP --> HashCache
    Blocks -.->|映射| GPU

    style BP fill:#e1f5fe
    style GPU fill:#e8f5e9

BlockPool 的核心职责：

块管理：维护所有物理块的元数据
分配/释放：从空闲队列分配块，释放块回队列
缓存管理：维护 Block Hash 到 Block 的映射
LRU 驱逐：当需要分配但无空闲块时，驱逐最久未使用的块

2. 核心数据结构

2.1 KVCacheBlock - 块元数据

每个物理块都有一个对应的 KVCacheBlock 对象来存储元数据：

# vllm/v1/core/kv_cache_utils.py

@dataclass
class KVCacheBlock:
    """KV Cache 块的元数据"""

    # 块 ID，范围 [0, num_gpu_blocks - 1]
    block_id: int

    # 引用计数
    ref_cnt: int = 0

    # 块的 Hash 值（仅当块满且被缓存时有效）
    _block_hash: BlockHashWithGroupId | None = None

    # 双向链表指针（用于空闲队列）
    prev_free_block: "KVCacheBlock | None" = None
    next_free_block: "KVCacheBlock | None" = None

    # 是否是占位符块（null block）
    is_null: bool = False

关键字段解释：

字段	说明
`block_id`	块的唯一标识，对应 GPU 内存中的物理位置
`ref_cnt`	引用计数，表示有多少请求正在使用这个块
`_block_hash`	用于前缀缓存的 hash 值，只有满块才有
`prev/next_free_block`	空闲队列的链表指针，实现 O(1) 的插入删除
`is_null`	标记占位符块，用于填充 Block Table 中的空位

2.2 FreeKVCacheBlockQueue - 空闲块队列

空闲块队列是一个双向链表，按 LRU 顺序组织空闲块：

# vllm/v1/core/kv_cache_utils.py

class FreeKVCacheBlockQueue:
    """
    双向链表实现的空闲块队列。

    特点：
    - O(1) 的头部/尾部插入和删除
    - O(1) 的中间删除（已知节点位置）
    - LRU 顺序：最久未使用的块在队列头部
    """

    def __init__(self, blocks: list[KVCacheBlock]) -> None:
        self.num_free_blocks = len(blocks)

        # 初始化双向链表
        for i in range(self.num_free_blocks):
            if i > 0:
                blocks[i].prev_free_block = blocks[i - 1]
            if i < self.num_free_blocks - 1:
                blocks[i].next_free_block = blocks[i + 1]

        # 哨兵节点，简化边界处理
        self.fake_free_list_head = KVCacheBlock(block_id=-1)
        self.fake_free_list_tail = KVCacheBlock(block_id=-1)

        # 连接哨兵与实际链表
        self.fake_free_list_head.next_free_block = blocks[0]
        blocks[0].prev_free_block = self.fake_free_list_head
        self.fake_free_list_tail.prev_free_block = blocks[-1]
        blocks[-1].next_free_block = self.fake_free_list_tail

链表结构示意图：

graph LR
    subgraph FreeKVCacheBlockQueue
        HEAD[fake_head<br/>id=-1]
        B0[Block 0]
        B1[Block 1]
        B2[Block 2]
        BN[...]
        TAIL[fake_tail<br/>id=-1]

        HEAD -->|next| B0
        B0 -->|prev| HEAD
        B0 -->|next| B1
        B1 -->|prev| B0
        B1 -->|next| B2
        B2 -->|prev| B1
        B2 -->|next| BN
        BN -->|next| TAIL
        TAIL -->|prev| BN
    end

    style HEAD fill:#ffcccc
    style TAIL fill:#ffcccc

为什么用双向链表而不是 Python 的 deque？

# 场景：缓存命中，需要从队列中间移除块

# 使用 deque：O(n) 查找 + 移除
def remove_from_deque(deque, block):
    deque.remove(block)  # O(n)

# 使用双向链表：O(1) 直接移除
def remove_from_linked_list(block):
    block.prev_free_block.next_free_block = block.next_free_block
    block.next_free_block.prev_free_block = block.prev_free_block
    block.prev_free_block = block.next_free_block = None

2.3 BlockHashToBlockMap - 缓存查找表

用于快速查找已缓存的块：

# vllm/v1/core/block_pool.py

class BlockHashToBlockMap:
    """
    Block Hash 到 Block 的映射表。

    注意：同一个 Hash 可能对应多个 Block（虽然内容相同，
    但分配给了不同的请求）。
    """

    def __init__(self):
        # 单个块时直接存储 KVCacheBlock
        # 多个块时存储 dict[block_id, KVCacheBlock]
        self._cache: dict[
            BlockHashWithGroupId,
            KVCacheBlock | dict[int, KVCacheBlock]
        ] = {}

    def get_one_block(self, key: BlockHashWithGroupId) -> KVCacheBlock | None:
        """获取任意一个匹配的块"""
        blocks = self._cache.get(key)
        if blocks is not None:
            if isinstance(blocks, KVCacheBlock):
                return blocks
            if isinstance(blocks, dict):
                return next(iter(blocks.values()))
        return None

    def insert(self, key: BlockHashWithGroupId, block: KVCacheBlock) -> None:
        """插入块到缓存"""
        blocks = self._cache.get(key)
        if blocks is None:
            self._cache[key] = block
        elif isinstance(blocks, KVCacheBlock):
            # 第二个相同 hash 的块，升级为 dict
            self._cache[key] = {blocks.block_id: blocks, block.block_id: block}
        elif isinstance(blocks, dict):
            blocks[block.block_id] = block

为什么同一 Hash 可能有多个块？

请求 A: "Hello, how are you?"
请求 B: "Hello, how are you?" (相同内容)

当两个请求同时运行时：
- 请求 A 分配了 Block #5 来存储 "Hello, how are you?"
- 请求 B 也需要这些 token，但由于 Block #5 正在被使用，
  所以也分配了 Block #8

两个块内容相同，Hash 相同，但是物理上是两个不同的块。

3. BlockPool 类详解

# vllm/v1/core/block_pool.py

class BlockPool:
    """管理 KV Cache 块的内存池"""

    def __init__(
        self,
        num_gpu_blocks: int,      # GPU 块总数
        enable_caching: bool,      # 是否启用前缀缓存
        hash_block_size: int,      # Hash 计算的块大小
        enable_kv_cache_events: bool = False,
        metrics_collector: KVCacheMetricsCollector | None = None,
    ):
        self.num_gpu_blocks = num_gpu_blocks
        self.enable_caching = enable_caching

        # 所有块的元数据
        self.blocks: list[KVCacheBlock] = [
            KVCacheBlock(idx) for idx in range(num_gpu_blocks)
        ]

        # 空闲块队列（LRU 顺序）
        self.free_block_queue = FreeKVCacheBlockQueue(self.blocks)

        # 缓存查找表
        self.cached_block_hash_to_block = BlockHashToBlockMap()

        # 占位符块（用于填充 Block Table 中的空位）
        self.null_block = self.free_block_queue.popleft()
        self.null_block.is_null = True

初始化后的状态：

graph TD
    subgraph BlockPool
        blocks["blocks[0...N-1]"]
        FQ[FreeKVCacheBlockQueue]
        Cache[BlockHashToBlockMap]
        NB[null_block]
    end

    subgraph "空闲队列 (初始状态)"
        B1[Block 1] --> B2[Block 2]
        B2 --> B3[Block 3]
        B3 --> BN[...]
    end

    subgraph "GPU 物理内存"
        GPU["KV Cache Tensor<br/>[num_blocks, 2, num_heads, block_size, head_dim]"]
    end

    blocks --> B1
    FQ --> B1
    NB -.->|"block_id=0"| GPU

    style NB fill:#ffcccc

注意： null_block 是第一个被"分配"的块（block_id=0），但它实际上是一个占位符，永远不会存储真实数据。它用于填充 Block Table 中那些暂时不需要的位置（例如滑动窗口注意力中窗口外的位置）。

4. 核心操作详解

4.1 get_new_blocks() - 分配新块

def get_new_blocks(self, num_blocks: int) -> list[KVCacheBlock]:
    """
    从空闲队列分配指定数量的块。

    如果启用了缓存，可能会驱逐一些缓存块来腾出空间。
    """
    if num_blocks > self.get_num_free_blocks():
        raise ValueError(f"Cannot get {num_blocks} free blocks from the pool")

    # 从队列头部弹出 n 个块
    ret: list[KVCacheBlock] = self.free_block_queue.popleft_n(num_blocks)

    if self.enable_caching:
        for block in ret:
            # 如果块之前被缓存过，需要驱逐它
            self._maybe_evict_cached_block(block)
            block.ref_cnt += 1
    else:
        for block in ret:
            block.ref_cnt += 1

    return ret

分配流程图：

flowchart TD
    A[get_new_blocks] --> B{请求块数 <= 空闲块数?}
    B -->|否| C[抛出异常]
    B -->|是| D[popleft_n 弹出 n 个块]
    D --> E{启用缓存?}
    E -->|是| F[遍历每个块]
    F --> G{块有 hash?}
    G -->|是| H[从缓存表移除]
    H --> I[清除 block_hash]
    G -->|否| J[跳过]
    I --> K[ref_cnt += 1]
    J --> K
    E -->|否| K
    K --> L[返回块列表]

    style C fill:#ffcdd2
    style L fill:#c8e6c9

4.2 free_blocks() - 释放块

def free_blocks(self, ordered_blocks: Iterable[KVCacheBlock]) -> None:
    """
    释放一组块。块应按驱逐优先级排序（先释放的先被驱逐）。

    注意：释放不一定意味着块立即可复用。如果块被多个请求共享
    （ref_cnt > 1），释放只会减少引用计数。
    """
    blocks_list = list(ordered_blocks)

    # 减少所有块的引用计数
    for block in blocks_list:
        block.ref_cnt -= 1

    # 只有 ref_cnt == 0 且非 null 的块才加入空闲队列
    self.free_block_queue.append_n(
        [block for block in blocks_list
         if block.ref_cnt == 0 and not block.is_null]
    )

释放流程示意：

sequenceDiagram
    participant Caller as 调用者
    participant BP as BlockPool
    participant Block as KVCacheBlock
    participant FQ as FreeBlockQueue

    Caller->>BP: free_blocks([B1, B2, B3])

    loop 每个块
        BP->>Block: ref_cnt -= 1
    end

    alt B1.ref_cnt == 0
        BP->>FQ: append(B1)
        Note over FQ: B1 加入队列尾部<br/>可被复用或驱逐
    else B1.ref_cnt > 0
        Note over Block: B1 仍被其他请求使用<br/>不加入空闲队列
    end

4.3 touch() - 缓存命中处理

当一个请求命中了缓存块时，需要调用 touch() 来增加引用计数：

def touch(self, blocks: Sequence[KVCacheBlock]) -> None:
    """
    触碰块，增加引用计数。

    如果块在空闲队列中（ref_cnt 为 0），需要将其从队列中移除，
    因为它现在被使用了。
    """
    for block in blocks:
        # ref_cnt=0 意味着块在空闲队列中
        if block.ref_cnt == 0 and not block.is_null:
            self.free_block_queue.remove(block)  # O(1)

        block.ref_cnt += 1

touch 的关键作用：

场景：请求 B 命中了请求 A 的缓存块

初始状态：
- Block #5 被请求 A 使用，ref_cnt = 1
- Block #5 有 hash 值（已缓存）

请求 A 完成后：
- Block #5.ref_cnt = 0
- Block #5 加入空闲队列尾部
- Block #5 的 hash 仍然保留（可被缓存复用）

请求 B 查找缓存，命中 Block #5：
- touch(Block #5)
- 从空闲队列移除 Block #5（因为 ref_cnt 是 0）
- Block #5.ref_cnt = 1

现在 Block #5 又被使用了，不会被驱逐！

4.4 cache_full_blocks() - 缓存满块

当一个块填满后，将其加入缓存表：

def cache_full_blocks(
    self,
    request: Request,
    blocks: list[KVCacheBlock],
    num_cached_blocks: int,     # 已缓存的块数
    num_full_blocks: int,       # 满块数
    block_size: int,
    kv_cache_group_id: int,
) -> None:
    """
    缓存新的满块。

    为每个新满块计算 hash 并加入缓存表。
    """
    if num_cached_blocks >= num_full_blocks:
        return

    new_full_blocks = blocks[num_cached_blocks:num_full_blocks]

    for i, blk in enumerate(new_full_blocks):
        if blk.is_null:
            continue

        # 获取块的 hash
        block_hash = request.block_hashes[num_cached_blocks + i]

        # 设置块的 hash
        block_hash_with_group_id = make_block_hash_with_group_id(
            block_hash, kv_cache_group_id
        )
        blk.block_hash = block_hash_with_group_id

        # 加入缓存表
        self.cached_block_hash_to_block.insert(block_hash_with_group_id, blk)

5. 块的生命周期

stateDiagram-v2
    [*] --> Free: 初始化

    Free --> Allocated: get_new_blocks()
    note right of Allocated
        ref_cnt = 1
        block_hash = None
    end note

    Allocated --> FullAndCached: cache_full_blocks()
    note right of FullAndCached
        ref_cnt >= 1
        block_hash 已设置
        加入 hash_to_block
    end note

    FullAndCached --> Shared: touch() (另一请求命中)
    note right of Shared
        ref_cnt += 1
    end note

    Shared --> FullAndCached: free() (一个请求释放)
    note right of FullAndCached
        ref_cnt -= 1
    end note

    FullAndCached --> CachedEvictable: free() (ref_cnt = 0)
    note right of CachedEvictable
        ref_cnt = 0
        block_hash 保留
        加入空闲队列尾部
    end note

    CachedEvictable --> Shared: touch() (再次命中)
    note left of Shared
        从空闲队列移除
        ref_cnt = 1
    end note

    CachedEvictable --> Evicted: _maybe_evict_cached_block()
    note right of Evicted
        从空闲队列头部被取出
        block_hash 清除
        从 hash_to_block 移除
    end note

    Evicted --> Allocated: 重新分配

6. LRU 驱逐策略

BlockPool 使用 LRU（Least Recently Used） 策略来决定驱逐哪些缓存块。

6.1 LRU 顺序的维护

关键规则：
1. 新分配的块从队列头部取出
2. 释放的块（ref_cnt 降为 0）加入队列尾部
3. 缓存命中时，块从队列中间移除

这意味着：
- 队列头部是最久未使用的块（最先被驱逐）
- 队列尾部是最近使用的块（最后被驱逐）

6.2 驱逐过程

sequenceDiagram
    participant Caller as 调用者
    participant BP as BlockPool
    participant FQ as FreeBlockQueue
    participant Cache as BlockHashToBlockMap

    Note over Caller: 需要分配新块，但内存紧张

    Caller->>BP: get_new_blocks(1)
    BP->>FQ: popleft()
    FQ-->>BP: Block (可能是缓存块)

    alt 块有 hash (是缓存块)
        BP->>BP: _maybe_evict_cached_block()
        BP->>Cache: pop(block_hash, block_id)
        Cache-->>BP: 移除成功
        BP->>BP: block.reset_hash()
        Note over BP: 块的缓存状态被清除
    end

    BP-->>Caller: 干净的块

6.3 为什么使用逆序释放？

# KVCacheManager 中的释放逻辑
def free(self, request: Request) -> None:
    """释放请求占用的块，按逆序释放"""
    blocks = self.coordinator.get_blocks(request.request_id)
    # 逆序：尾部块先加入空闲队列
    self.block_pool.free_blocks(reversed(blocks))

原因：

请求 A 占用块: [Block_1, Block_2, Block_3, Block_4]
                 ↑         ↑         ↑         ↑
               前缀      ....      ....      尾部

如果请求 B 有相同的前缀 "Hello, how"，它更可能复用 Block_1

正序释放的结果：
空闲队列: [..., Block_1, Block_2, Block_3, Block_4]
                   ↑                              ↑
                 先驱逐                         后驱逐
// 前缀块先被驱逐，不好！

逆序释放的结果：
空闲队列: [..., Block_4, Block_3, Block_2, Block_1]
                   ↑                              ↑
                 先驱逐                         后驱逐
// 尾部块先被驱逐，前缀块保留更久，好！

7. 前缀缓存的缓存命中

7.1 缓存查找流程

def get_cached_block(
    self, block_hash: BlockHash, kv_cache_group_ids: list[int]
) -> list[KVCacheBlock] | None:
    """
    根据 block hash 查找缓存块。

    返回每个 group 的缓存块，如果任何 group 未命中则返回 None。
    """
    cached_blocks = []
    for group_id in kv_cache_group_ids:
        block_hash_with_group_id = make_block_hash_with_group_id(
            block_hash, group_id
        )
        block = self.cached_block_hash_to_block.get_one_block(
            block_hash_with_group_id
        )
        if not block:
            return None  # 任何 group 未命中，整体失败
        cached_blocks.append(block)
    return cached_blocks

7.2 完整的缓存命中场景

sequenceDiagram
    participant Req as 新请求
    participant KVM as KVCacheManager
    participant BP as BlockPool
    participant Cache as BlockHashToBlockMap
    participant FQ as FreeBlockQueue

    Req->>KVM: get_computed_blocks()
    KVM->>KVM: 计算 block_hashes

    loop 每个 block_hash
        KVM->>BP: get_cached_block(hash)
        BP->>Cache: get_one_block(hash)

        alt 命中
            Cache-->>BP: Block
            BP-->>KVM: Block

            Note over KVM: 检查块状态
            alt 块在空闲队列中 (ref_cnt = 0)
                KVM->>BP: touch(Block)
                BP->>FQ: remove(Block)
                Note over FQ: 从队列移除<br/>防止被驱逐
                BP->>BP: Block.ref_cnt += 1
            else 块正在使用中 (ref_cnt > 0)
                KVM->>BP: touch(Block)
                BP->>BP: Block.ref_cnt += 1
                Note over BP: 共享使用
            end
        else 未命中
            Cache-->>BP: None
            BP-->>KVM: None
            Note over KVM: 停止查找<br/>后续块无法复用
        end
    end

    KVM-->>Req: 缓存块列表, 命中 token 数

8. 配置与调优

8.1 关键配置参数

参数	说明	影响
`num_gpu_blocks`	总块数	更多块 = 更大的 KV Cache 容量
`enable_caching`	是否启用前缀缓存	开启后有缓存命中收益，但增加管理开销
`block_size`	每块 token 数	更大块 = 更少碎片，但粒度更粗

8.2 块数量计算

# vllm/v1/core/kv_cache_utils.py

def get_num_blocks(
    vllm_config: VllmConfig,
    num_layers: int,
    available_memory: int,
    page_size: int  # 每块占用的显存
) -> int:
    """计算可分配的块数"""
    num_blocks = int(available_memory // page_size // num_layers)
    num_blocks = max(num_blocks, 0)
    return num_blocks

显存消耗公式：

page_size = block_size × num_heads × head_dim × 2 × dtype_size
                           ↑                    ↑
                    (K 和 V)            (FP16 = 2 bytes)

total_kv_cache_memory = page_size × num_layers × num_blocks

8.3 监控使用率

def get_usage(self) -> float:
    """获取 KV Cache 使用率"""
    total_gpu_blocks = self.num_gpu_blocks - 1  # 减去 null_block
    if not total_gpu_blocks:
        return 0
    return 1.0 - (self.get_num_free_blocks() / total_gpu_blocks)

9. 代码位置速查

功能	文件	关键类/函数
BlockPool	`vllm/v1/core/block_pool.py`	`BlockPool` 类
KVCacheBlock	`vllm/v1/core/kv_cache_utils.py`	`KVCacheBlock` 数据类
空闲队列	`vllm/v1/core/kv_cache_utils.py`	`FreeKVCacheBlockQueue` 类
缓存查找表	`vllm/v1/core/block_pool.py`	`BlockHashToBlockMap` 类
Hash 计算	`vllm/v1/core/kv_cache_utils.py`	`hash_block_tokens()`
块数计算	`vllm/v1/core/kv_cache_utils.py`	`get_num_blocks()`

10. 小结

本章我们深入了解了 BlockPool 的工作原理：

核心数据结构：
- KVCacheBlock：块元数据，包含引用计数和 hash
- FreeKVCacheBlockQueue：双向链表实现的 LRU 空闲队列
- BlockHashToBlockMap：前缀缓存的 hash 查找表
关键操作：
- get_new_blocks()：从队列头部分配块
- free_blocks()：减少引用计数，ref_cnt=0 时加入队列尾部
- touch()：缓存命中时增加引用计数
- cache_full_blocks()：缓存满块的 hash
LRU 驱逐：
- 队列头部是最久未使用的块
- 逆序释放确保前缀块保留更久
块生命周期：Free → Allocated → Cached → Evictable → Evicted

在下一章中，我们将学习调度器（Scheduler）如何使用这些内存管理组件来做出调度决策。

4 - 调度器原理

调度器原理详解

调度器（Scheduler）是 vLLM 的"大脑"，它决定了哪些请求可以运行以及每个请求能处理多少 token。一个好的调度策略直接影响系统的吞吐量、延迟和资源利用率。

本章我们将深入了解 vLLM 调度器的工作原理、调度算法和实现细节。

1. Scheduler 在架构中的位置

graph TD
    subgraph 用户层
        User[用户请求]
    end

    subgraph 引擎层
        Engine[EngineCore]
    end

    subgraph 调度层
        Scheduler[Scheduler<br/>调度器]
        Waiting[waiting 队列]
        Running[running 队列]
    end

    subgraph 内存管理层
        KVM[KVCacheManager]
    end

    subgraph 执行层
        Executor[ModelExecutor]
    end

    User --> Engine
    Engine --> Scheduler
    Scheduler --> Waiting
    Scheduler --> Running
    Scheduler -->|allocate/free| KVM
    Scheduler -->|SchedulerOutput| Executor

    style Scheduler fill:#e1f5fe

Scheduler 的核心职责：

请求队列管理：维护 waiting 和 running 两个队列
资源分配决策：决定哪些请求可以获得 GPU 资源
内存协调：与 KVCacheManager 协作管理 KV Cache
抢占处理：在内存不足时执行抢占策略
输出构建：为 ModelExecutor 构建 SchedulerOutput

2. 核心数据结构

2.1 请求队列

# vllm/v1/core/sched/scheduler.py

class Scheduler:
    def __init__(self, ...):
        # 等待队列：存放新到达和被抢占的请求
        self.waiting = create_request_queue(self.policy)

        # 运行队列：存放正在处理的请求
        self.running: list[Request] = []

        # 请求字典：通过 request_id 快速查找
        self.requests: dict[str, Request] = {}

两个队列的关系：

stateDiagram-v2
    [*] --> WAITING: add_request()

    WAITING --> RUNNING: 调度成功
    RUNNING --> WAITING: 被抢占

    RUNNING --> FINISHED: 完成

    FINISHED --> [*]: 释放资源

2.2 调度约束

class Scheduler:
    def __init__(self, ...):
        # 最大并发请求数
        self.max_num_running_reqs = self.scheduler_config.max_num_seqs

        # 每步最大 token 数
        self.max_num_scheduled_tokens = self.scheduler_config.max_num_batched_tokens

        # 最大序列长度
        self.max_model_len = vllm_config.model_config.max_model_len

约束说明：

约束	默认值	说明
`max_num_seqs`	256	最多同时运行的请求数
`max_num_batched_tokens`	2048	每步最多处理的 token 数
`max_model_len`	模型配置	单个序列的最大长度

2.3 SchedulerOutput - 调度输出

# vllm/v1/core/sched/output.py

@dataclass
class SchedulerOutput:
    """调度器的输出，传递给 ModelExecutor"""

    # 新调度的请求
    scheduled_new_reqs: list[NewRequestData]

    # 继续运行的请求
    scheduled_cached_reqs: CachedRequestData

    # 每个请求调度的 token 数
    num_scheduled_tokens: dict[str, int]

    # 总调度 token 数
    total_num_scheduled_tokens: int

    # 抢占的请求 ID
    preempted_req_ids: set[str]

    # 完成的请求 ID
    finished_req_ids: set[str]

3. schedule() 方法详解

schedule() 是调度器的核心方法，每个 step 调用一次。

3.1 调度算法概述

vLLM 的调度没有明确的 "prefill 阶段" 或 "decode 阶段"。
每个请求只有两个关键状态：
  - num_computed_tokens: 已计算的 token 数
  - num_tokens: 总 token 数（prompt + output）

每一步，调度器尝试分配 token，使 num_computed_tokens 追上 num_tokens。
这种设计足够通用，支持分块预填充、前缀缓存、投机解码等各种优化。

3.2 完整调度流程

flowchart TD
    Start[schedule 开始] --> Init[初始化预算和输出列表]
    Init --> RunLoop{running 队列}

    subgraph 处理运行中请求
        RunLoop --> |遍历| CalcTokens[计算需要调度的 token 数]
        CalcTokens --> Allocate[allocate_slots 分配 KV Cache]
        Allocate --> AllocOK{分配成功?}
        AllocOK -->|是| AddToScheduled[加入调度列表<br/>减少 token 预算]
        AllocOK -->|否| Preempt[抢占低优先级请求]
        Preempt --> RetryAlloc{重试分配}
        RetryAlloc -->|成功| AddToScheduled
        RetryAlloc -->|失败| SkipReq[跳过此请求]
        AddToScheduled --> RunLoop
        SkipReq --> RunLoop
    end

    RunLoop -->|完成| WaitCheck{有抢占?}
    WaitCheck -->|是| SkipWaiting[跳过 waiting 队列]
    WaitCheck -->|否| WaitLoop{waiting 队列}

    subgraph 处理等待请求
        WaitLoop --> |遍历| CheckStatus[检查请求状态]
        CheckStatus --> ReadyCheck{可以调度?}
        ReadyCheck -->|否| SkipWait[跳过并继续]
        ReadyCheck -->|是| GetCache[查找前缀缓存]
        GetCache --> CalcNewTokens[计算需要调度的 token 数]
        CalcNewTokens --> AllocWait[allocate_slots 分配]
        AllocWait --> AllocWaitOK{分配成功?}
        AllocWaitOK -->|是| MoveToRunning[移入 running 队列]
        AllocWaitOK -->|否| StopWait[停止处理 waiting]
        MoveToRunning --> WaitLoop
        SkipWait --> WaitLoop
    end

    WaitLoop -->|完成| BuildOutput[构建 SchedulerOutput]
    SkipWaiting --> BuildOutput
    StopWait --> BuildOutput
    BuildOutput --> Return[返回输出]

3.3 处理 Running 请求

def schedule(self) -> SchedulerOutput:
    # ... 初始化 ...

    # 第一步：处理 RUNNING 请求
    req_index = 0
    while req_index < len(self.running) and token_budget > 0:
        request = self.running[req_index]

        # 计算需要调度的 token 数
        num_new_tokens = (
            request.num_tokens_with_spec      # 总 token 数
            + request.num_output_placeholders  # 输出占位符
            - request.num_computed_tokens      # 已计算的 token 数
        )

        # 应用长 prefill 分块限制
        if 0 < threshold < num_new_tokens:
            num_new_tokens = threshold

        # 不超过 token 预算
        num_new_tokens = min(num_new_tokens, token_budget)

        # 不超过模型最大长度
        num_new_tokens = min(
            num_new_tokens,
            self.max_model_len - 1 - request.num_computed_tokens
        )

        if num_new_tokens == 0:
            req_index += 1
            continue

        # 尝试分配 KV Cache
        while True:
            new_blocks = self.kv_cache_manager.allocate_slots(
                request,
                num_new_tokens,
                num_lookahead_tokens=self.num_lookahead_tokens,
            )

            if new_blocks is not None:
                # 分配成功
                break

            # 分配失败，执行抢占
            preempted_req = self._select_preempt_target()
            self._preempt_request(preempted_req, timestamp)

            if preempted_req == request:
                # 自己被抢占了，无法继续
                break

        if new_blocks is None:
            break

        # 调度成功
        scheduled_running_reqs.append(request)
        token_budget -= num_new_tokens
        req_index += 1

3.4 处理 Waiting 请求

    # 第二步：处理 WAITING 请求（只有没有抢占时才处理）
    if not preempted_reqs:
        while self.waiting and token_budget > 0:
            # 检查并发限制
            if len(self.running) == self.max_num_running_reqs:
                break

            request = self.waiting.peek_request()

            # 检查各种等待状态
            if request.status == RequestStatus.WAITING_FOR_REMOTE_KVS:
                # 等待远程 KV Cache 加载
                ...

            if request.status == RequestStatus.WAITING_FOR_FSM:
                # 等待语法编译
                ...

            # 查找前缀缓存
            new_computed_blocks, num_cached = (
                self.kv_cache_manager.get_computed_blocks(request)
            )

            # 计算需要调度的 token 数
            num_new_tokens = request.num_tokens - num_cached
            num_new_tokens = min(num_new_tokens, token_budget)

            # 分配 KV Cache
            new_blocks = self.kv_cache_manager.allocate_slots(
                request,
                num_new_tokens,
                num_new_computed_tokens=num_cached,
                new_computed_blocks=new_computed_blocks,
            )

            if new_blocks is None:
                # 无法分配，停止处理 waiting 队列
                break

            # 调度成功，移入 running 队列
            request = self.waiting.pop_request()
            self.running.append(request)
            request.status = RequestStatus.RUNNING
            request.num_computed_tokens = num_cached
            token_budget -= num_new_tokens

4. 抢占机制

当 KV Cache 内存不足时，调度器需要抢占一些请求来释放空间。

4.1 抢占策略

def _select_preempt_target(self) -> Request:
    """选择要抢占的请求"""
    if self.policy == SchedulingPolicy.PRIORITY:
        # 优先级调度：抢占优先级最低、到达最晚的请求
        return max(
            self.running,
            key=lambda r: (r.priority, r.arrival_time),
        )
    else:
        # FCFS：抢占最后加入的请求
        return self.running[-1]

4.2 抢占流程

def _preempt_request(self, request: Request, timestamp: float) -> None:
    """抢占请求并放回 waiting 队列"""
    assert request.status == RequestStatus.RUNNING

    # 释放 KV Cache
    self.kv_cache_manager.free(request)
    self.encoder_cache_manager.free(request)

    # 更新请求状态
    request.status = RequestStatus.PREEMPTED
    request.num_computed_tokens = 0  # 重置计算进度
    request.num_preemptions += 1

    # 放回 waiting 队列头部（优先恢复）
    self.waiting.prepend_request(request)

4.3 抢占流程图

sequenceDiagram
    participant Scheduler
    participant KVM as KVCacheManager
    participant Running as running 队列
    participant Waiting as waiting 队列
    participant Request as 请求

    Note over Scheduler: allocate_slots 失败

    Scheduler->>Running: 选择抢占目标
    Running-->>Scheduler: victim 请求

    Scheduler->>Request: status = PREEMPTED
    Scheduler->>Request: num_computed_tokens = 0
    Scheduler->>Request: num_preemptions += 1

    Scheduler->>KVM: free(victim)
    Note over KVM: 释放 KV Cache 块

    Scheduler->>Running: remove(victim)
    Scheduler->>Waiting: prepend(victim)
    Note over Waiting: victim 加入队列头部<br/>等待恢复

    Scheduler->>KVM: retry allocate_slots
    Note over Scheduler: 重试分配

5. 调度策略

5.1 FCFS（先来先服务）

# 默认策略
class SchedulingPolicy(Enum):
    FCFS = "fcfs"
    PRIORITY = "priority"

FCFS 特点：

按请求到达顺序处理
简单公平
可能导致"队头阻塞"

5.2 Priority（优先级调度）

# 基于优先级的调度
def _select_preempt_target(self):
    if self.policy == SchedulingPolicy.PRIORITY:
        # 选择优先级最低的请求
        return max(
            self.running,
            key=lambda r: (r.priority, r.arrival_time),
        )

Priority 特点：

高优先级请求优先处理
支持紧急请求插队
可能导致低优先级请求"饥饿"

6. 分块预填充（Chunked Prefill）

对于长输入，vLLM 支持分块预填充，将 prefill 分成多个 chunk 处理。

6.1 工作原理

传统 Prefill（一次性处理）:
step 1: [token_1 ... token_1000] → 处理 1000 个 token

分块 Prefill:
step 1: [token_1 ... token_256] → 处理 256 个 token
step 2: [token_257 ... token_512] → 处理 256 个 token
step 3: [token_513 ... token_768] → 处理 256 个 token
step 4: [token_769 ... token_1000] → 处理 232 个 token

6.2 配置参数

# 长 prefill 分块阈值
threshold = self.scheduler_config.long_prefill_token_threshold

if 0 < threshold < num_new_tokens:
    num_new_tokens = threshold  # 限制每次处理的 token 数

6.3 好处

降低延迟：长 prefill 不会阻塞其他请求
更好的资源利用：允许多个请求交替执行
内存平滑：避免一次性分配大量 KV Cache

gantt
    title 分块 Prefill vs 传统 Prefill
    dateFormat X
    axisFormat %s

    section 传统 Prefill
    请求 A (1000 tokens) :a1, 0, 100
    请求 B (等待)        :a2, 100, 150

    section 分块 Prefill
    请求 A chunk 1       :b1, 0, 25
    请求 B step 1        :b2, 25, 30
    请求 A chunk 2       :b3, 30, 55
    请求 B step 2        :b4, 55, 60
    请求 A chunk 3       :b5, 60, 85
    请求 B step 3        :b6, 85, 90

7. 调度器与其他组件的协作

7.1 完整的调度循环

sequenceDiagram
    participant EC as EngineCore
    participant Sched as Scheduler
    participant KVM as KVCacheManager
    participant Exec as ModelExecutor

    loop 每个 step
        EC->>Sched: schedule()

        rect rgb(230, 245, 230)
            Note over Sched,KVM: 处理 running 请求
            Sched->>KVM: allocate_slots()
            alt 内存不足
                Sched->>Sched: _preempt_request()
                Sched->>KVM: free()
            end
        end

        rect rgb(245, 230, 230)
            Note over Sched,KVM: 处理 waiting 请求
            Sched->>KVM: get_computed_blocks()
            Sched->>KVM: allocate_slots()
        end

        Sched-->>EC: SchedulerOutput

        EC->>Exec: execute_model(SchedulerOutput)
        Exec-->>EC: ModelRunnerOutput

        EC->>Sched: update_from_output()
        Note over Sched: 更新请求状态<br/>检查完成条件
    end

7.2 请求状态更新

def update_from_output(
    self,
    model_runner_output: ModelRunnerOutput,
    ...
) -> EngineCoreOutputs:
    """根据模型输出更新请求状态"""

    for req_id, sampler_output in model_output.items():
        request = self.requests[req_id]

        # 追加输出 token
        request.append_output_token_ids(sampler_output.sampled_token_ids)

        # 检查停止条件
        stopped = check_stop(request, self.max_model_len)

        if stopped:
            # 请求完成
            self._free_request(request)
            self.finished_req_ids.add(req_id)
            outputs.append(...)

8. 配置与调优

8.1 关键配置参数

参数	说明	建议值
`max_num_seqs`	最大并发请求数	根据 GPU 内存调整
`max_num_batched_tokens`	每步最大 token 数	2048-8192
`enable_chunked_prefill`	启用分块预填充	建议开启
`long_prefill_token_threshold`	长 prefill 阈值	256-512
`policy`	调度策略	fcfs 或 priority

8.2 性能调优建议

1. 提高吞吐量

# 增加最大并发数
llm = LLM(model="...", max_num_seqs=512)

# 增加每步 token 数
llm = LLM(model="...", max_num_batched_tokens=4096)

2. 降低延迟

# 启用分块预填充
llm = LLM(
    model="...",
    enable_chunked_prefill=True,
    long_prefill_token_threshold=256,
)

3. 处理高优先级请求

# 使用优先级调度
llm = LLM(model="...", policy="priority")

# 发送请求时设置优先级
llm.generate(prompt, priority=0)  # 高优先级
llm.generate(prompt, priority=10)  # 低优先级

9. 代码位置速查

功能	文件	关键类/函数
Scheduler 主类	`vllm/v1/core/sched/scheduler.py`	`Scheduler`
schedule 方法	`vllm/v1/core/sched/scheduler.py:313`	`schedule()`
抢占逻辑	`vllm/v1/core/sched/scheduler.py:892`	`_preempt_request()`
调度输出	`vllm/v1/core/sched/output.py`	`SchedulerOutput`
请求队列	`vllm/v1/core/sched/request_queue.py`	`create_request_queue()`
调度策略	`vllm/v1/core/sched/request_queue.py`	`SchedulingPolicy`

10. 小结

本章我们深入了解了 vLLM 调度器的工作原理：

双队列管理：waiting 和 running 队列
调度算法：
- 先处理 running 请求
- 再处理 waiting 请求
- 内存不足时执行抢占
抢占机制：释放低优先级请求的资源
调度策略：FCFS 和 Priority
分块预填充：降低长输入的延迟影响
与其他组件协作：KVCacheManager、ModelExecutor

在下一章中，我们将学习连续批处理（Continuous Batching）机制，了解 vLLM 如何实现高效的动态批处理。

5 - 连续批处理机制

连续批处理机制详解

连续批处理（Continuous Batching）是 vLLM 实现高吞吐量的关键技术之一。与传统的静态批处理不同，连续批处理允许请求在推理过程中动态加入和退出，极大地提高了 GPU 利用率。

本章我们将深入了解连续批处理的原理、实现和优势。

1. 静态批处理的问题

1.1 什么是静态批处理

传统的 LLM 推理使用静态批处理：将一组请求打包成一个批次，等待所有请求完成后再处理下一批。

静态批处理示意:

Batch 1:
  请求 A: "你好" → 生成 50 个 token
  请求 B: "Hello, world" → 生成 100 个 token
  请求 C: "如何学习编程" → 生成 200 个 token

所有请求必须等待请求 C 完成才能返回结果

1.2 静态批处理的问题

gantt
    title 静态批处理的浪费
    dateFormat X
    axisFormat %s

    section Batch 1
    请求 A (50 tokens) :active, a1, 0, 50
    请求 A 等待       :done, a2, 50, 200
    请求 B (100 tokens) :active, b1, 0, 100
    请求 B 等待       :done, b2, 100, 200
    请求 C (200 tokens) :active, c1, 0, 200

    section Batch 2 (等待)
    请求 D           :crit, d1, 200, 250
    请求 E           :crit, e1, 200, 280

问题分析：

问题	说明
GPU 空闲	短请求完成后，GPU 资源被浪费
高延迟	所有请求必须等待最长的请求完成
低吞吐量	新请求必须等待当前批次完成
内存浪费	必须为最长序列预分配内存

2. 连续批处理的解决方案

2.1 核心思想

连续批处理的核心思想是：在每个推理步骤（iteration）级别进行调度，而不是在批次级别。

连续批处理示意:

Step 1: [请求 A, 请求 B, 请求 C]
Step 2: [请求 A, 请求 B, 请求 C]
...
Step 50: 请求 A 完成 → 请求 D 加入
Step 51: [请求 D, 请求 B, 请求 C]  // 请求 D 立即开始！
...
Step 100: 请求 B 完成 → 请求 E 加入
Step 101: [请求 D, 请求 E, 请求 C]
...

2.2 优势对比

gantt
    title 连续批处理的高效性
    dateFormat X
    axisFormat %s

    section 请求流
    请求 A (50 tokens) :active, a1, 0, 50
    请求 B (100 tokens) :active, b1, 0, 100
    请求 C (200 tokens) :active, c1, 0, 200
    请求 D (立即开始) :active, d1, 50, 100
    请求 E (立即开始) :active, e1, 100, 180
    请求 F (立即开始) :active, f1, 100, 250

优势分析：

优势	说明
高 GPU 利用率	完成的请求立即被新请求替换
低延迟	请求完成后立即返回，不等待其他请求
高吞吐量	新请求可以立即加入批次
按需分配	配合 PagedAttention，内存按需分配

3. vLLM 中的实现

3.1 迭代级调度（Iteration-Level Scheduling）

vLLM 的调度器在每个 step 都会重新调度：

# vllm/v1/engine/core.py

def step(self) -> EngineCoreOutputs:
    """执行一个推理步骤"""

    # 1. 调度：决定这一步处理哪些请求
    scheduler_output = self.scheduler.schedule()

    # 2. 执行模型
    model_output = self.model_executor.execute_model(scheduler_output)

    # 3. 采样
    sampled_tokens = self.model_executor.sample_tokens(model_output)

    # 4. 更新状态：可能有请求完成或新请求加入
    outputs = self.scheduler.update_from_output(sampled_tokens)

    return outputs

3.2 动态请求管理

sequenceDiagram
    participant User as 用户
    participant Engine as EngineCore
    participant Sched as Scheduler
    participant Exec as ModelExecutor

    Note over Engine: Step N

    User->>Engine: 新请求到达
    Engine->>Sched: add_request()
    Note over Sched: 加入 waiting 队列

    Engine->>Sched: schedule()
    Note over Sched: 检查 running 请求<br/>分配新请求资源

    Sched-->>Engine: SchedulerOutput<br/>(包含新请求)

    Engine->>Exec: execute_model()
    Exec-->>Engine: ModelOutput

    Engine->>Sched: update_from_output()

    alt 某请求完成
        Note over Sched: 从 running 移除<br/>释放 KV Cache
        Sched-->>Engine: 完成的请求结果
        Engine-->>User: 返回结果
    end

    Note over Engine: Step N+1 (循环)

3.3 关键代码路径

# vllm/v1/core/sched/scheduler.py

def schedule(self) -> SchedulerOutput:
    """每个 step 调用一次的调度方法"""

    scheduled_new_reqs = []
    scheduled_running_reqs = []
    token_budget = self.max_num_scheduled_tokens

    # 第一步：处理 running 请求
    for request in self.running:
        if token_budget <= 0:
            break

        # 计算需要的 token 数（通常是 1，decode 阶段）
        num_new_tokens = request.num_tokens - request.num_computed_tokens
        num_new_tokens = min(num_new_tokens, token_budget)

        # 分配 KV Cache
        new_blocks = self.kv_cache_manager.allocate_slots(
            request, num_new_tokens
        )

        if new_blocks is not None:
            scheduled_running_reqs.append(request)
            token_budget -= num_new_tokens

    # 第二步：处理 waiting 请求（动态加入！）
    while self.waiting and token_budget > 0:
        if len(self.running) >= self.max_num_running_reqs:
            break

        request = self.waiting.peek_request()

        # 查找缓存
        cached_blocks, num_cached = self.kv_cache_manager.get_computed_blocks(request)

        # 计算需要的 token 数
        num_new_tokens = request.num_tokens - num_cached
        num_new_tokens = min(num_new_tokens, token_budget)

        # 分配 KV Cache
        new_blocks = self.kv_cache_manager.allocate_slots(
            request, num_new_tokens, new_computed_blocks=cached_blocks
        )

        if new_blocks is None:
            break

        # 成功加入 running 队列
        self.waiting.pop_request()
        self.running.append(request)
        scheduled_new_reqs.append(request)
        token_budget -= num_new_tokens

    return SchedulerOutput(...)

4. Prefill 与 Decode 的混合处理

4.1 两阶段的特性回顾

阶段	计算类型	GPU 利用率	Token 数量
Prefill	计算密集型	高	多（整个 prompt）
Decode	内存密集型	低	少（每次 1 个）

4.2 混合处理示意

Step 1:
  请求 A: Prefill [token_1...token_100]  (100 tokens)
  请求 B: Decode [token_51]              (1 token)
  请求 C: Decode [token_30]              (1 token)

Step 2:
  请求 A: Decode [token_101]             (1 token)
  请求 B: Decode [token_52]              (1 token)
  请求 C: Decode [token_31]              (1 token)
  请求 D: Prefill [token_1...token_50]   (50 tokens，新加入！)

4.3 好处

graph TD
    subgraph 传统方式[传统方式]
        P1[Prefill A] --> D1[Decode A]
        D1 --> P2[Prefill B]
        P2 --> D2[Decode B]
    end

    subgraph 混合处理[混合处理]
        M1[Prefill A + Decode B,C]
        M2[Decode A,B,C + Prefill D]
        M1 --> M2
    end

    Note1[时间长，GPU 利用不均] -.-> 传统方式
    Note2[时间短，GPU 持续高负载] -.-> 混合处理

    style 混合处理 fill:#c8e6c9

5. 分块预填充与连续批处理

5.1 分块预填充（Chunked Prefill）

对于长输入，将 prefill 分成多个 chunk，与其他请求的 decode 交替执行：

请求 A: 长 prompt (1000 tokens)
请求 B, C, D: 正在 decode

Step 1:
  请求 A chunk 1: [token_1...token_256]
  请求 B, C, D: decode

Step 2:
  请求 A chunk 2: [token_257...token_512]
  请求 B, C, D: decode
  请求 E: 新加入

Step 3:
  请求 A chunk 3: [token_513...token_768]
  请求 B, C, D, E: decode

...

5.2 配置分块预填充

llm = LLM(
    model="meta-llama/Llama-2-7b",
    enable_chunked_prefill=True,
    # 每个 chunk 的最大 token 数
    long_prefill_token_threshold=256,
)

5.3 分块预填充的实现

# vllm/v1/core/sched/scheduler.py

def schedule(self):
    # 应用长 prefill 分块限制
    threshold = self.scheduler_config.long_prefill_token_threshold

    if 0 < threshold < num_new_tokens:
        # 限制每次处理的 token 数
        num_new_tokens = threshold

6. 性能对比

6.1 吞吐量对比

graph LR
    subgraph 静态批处理
        S1[Batch 1<br/>100 req/s]
        S2[等待]
        S3[Batch 2<br/>100 req/s]
        S1 --> S2 --> S3
    end

    subgraph 连续批处理
        C1[持续处理<br/>300 req/s]
    end

    style C1 fill:#c8e6c9

6.2 延迟对比

场景	静态批处理	连续批处理
短请求在长请求批次中	高延迟（等待长请求）	低延迟（立即返回）
新请求到达	高延迟（等待当前批次）	低延迟（立即加入）
首 token 延迟	高（等待调度）	低（可立即开始）

6.3 GPU 利用率对比

静态批处理:
GPU: ████░░░░░░████░░░░░░████
     ^处理    ^空闲    ^处理

连续批处理:
GPU: ████████████████████████
     ^持续高效运行

7. 实现细节

7.1 请求完成检测

# vllm/v1/core/sched/utils.py

def check_stop(request: Request, max_model_len: int) -> bool:
    """检查请求是否应该停止"""

    # 检查 EOS token
    if request.sampling_params.eos_token_id is not None:
        if request._all_token_ids[-1] == request.sampling_params.eos_token_id:
            return True

    # 检查最大长度
    if request.num_output_tokens >= request.max_tokens:
        return True

    # 检查模型最大长度
    if len(request._all_token_ids) >= max_model_len:
        return True

    # 检查停止字符串
    if request.sampling_params.stop:
        output_text = request.get_output_text()
        for stop_str in request.sampling_params.stop:
            if stop_str in output_text:
                return True

    return False

7.2 动态资源释放

# vllm/v1/core/sched/scheduler.py

def update_from_output(self, model_output: ModelRunnerOutput) -> EngineCoreOutputs:
    """更新请求状态，处理完成的请求"""

    outputs = []
    finished_reqs = []

    for req_id, sampler_output in model_output.items():
        request = self.requests[req_id]

        # 追加输出 token
        request.append_output_token_ids(sampler_output.sampled_token_ids)

        # 检查停止条件
        if check_stop(request, self.max_model_len):
            finished_reqs.append(request)
            outputs.append(self._create_output(request))

    # 释放完成请求的资源
    for request in finished_reqs:
        self._free_request(request)
        self.running.remove(request)

    return EngineCoreOutputs(outputs=outputs)

7.3 Token 预算管理

def schedule(self):
    token_budget = self.max_num_scheduled_tokens

    # 处理 running 请求
    for request in self.running:
        num_new_tokens = min(
            request.num_tokens - request.num_computed_tokens,
            token_budget
        )
        # 分配资源...
        token_budget -= num_new_tokens

    # 处理 waiting 请求
    while self.waiting and token_budget > 0:
        # 新请求可以使用剩余的预算
        num_new_tokens = min(num_required, token_budget)
        # 分配资源...
        token_budget -= num_new_tokens

8. 最佳实践

8.1 配置建议

llm = LLM(
    model="meta-llama/Llama-2-7b",

    # 连续批处理相关
    max_num_seqs=256,              # 最大并发请求数
    max_num_batched_tokens=4096,   # 每步最大 token 数

    # 分块预填充
    enable_chunked_prefill=True,
    long_prefill_token_threshold=256,

    # 前缀缓存（配合连续批处理）
    enable_prefix_caching=True,
)

8.2 监控指标

指标	说明	健康范围
`running_requests`	运行中的请求数	接近 max_num_seqs
`waiting_requests`	等待中的请求数	越小越好
`gpu_utilization`	GPU 利用率	> 80%
`tokens_per_second`	吞吐量	取决于模型和硬件

9. 与其他技术的协同

9.1 连续批处理 + PagedAttention

连续批处理: 请求可以动态加入/退出
PagedAttention: 内存可以动态分配/释放

结合效果:
- 请求完成 → 立即释放 KV Cache → 新请求立即使用
- 无内存碎片，无预分配浪费

9.2 连续批处理 + 前缀缓存

场景: 多个请求有相同的系统提示

请求 A: [系统提示] + [用户问题 A]
请求 B: [系统提示] + [用户问题 B]

结合效果:
- 请求 A 处理系统提示，缓存 KV
- 请求 B 直接复用缓存，跳过系统提示的计算
- 连续批处理允许请求 B 立即加入并利用缓存

9.3 连续批处理 + 投机解码

结合效果:
- Draft 模型快速生成多个候选 token
- Target 模型验证
- 验证失败的 token 被拒绝，但其他请求不受影响
- 持续高效的批处理执行

10. 代码位置速查

功能	文件	关键函数
主调度循环	`vllm/v1/engine/core.py`	`step()`
调度器	`vllm/v1/core/sched/scheduler.py`	`schedule()`
状态更新	`vllm/v1/core/sched/scheduler.py`	`update_from_output()`
完成检测	`vllm/v1/core/sched/utils.py`	`check_stop()`
请求释放	`vllm/v1/core/sched/scheduler.py`	`_free_request()`

11. 小结

本章我们深入了解了连续批处理机制：

静态批处理的问题：GPU 空闲、高延迟、低吞吐量
连续批处理的解决方案：迭代级调度，动态加入/退出
vLLM 的实现：
- 每个 step 重新调度
- 完成的请求立即释放资源
- 新请求立即加入批次
Prefill 与 Decode 混合：不同阶段的请求交替执行
分块预填充：长输入分成多个 chunk 处理
与其他技术协同：PagedAttention、前缀缓存、投机解码

连续批处理是 vLLM 实现高吞吐量的核心技术之一，结合 PagedAttention 和前缀缓存，使得 vLLM 能够高效地服务大量并发请求。

核心模块详解

1 - PagedAttention 分页注意力

PagedAttention 分页注意力

引言

1. 传统 KV Cache 的问题回顾

1.1 连续内存分配的要求

1.2 显存碎片化图解

1.3 量化浪费

2. PagedAttention 核心思想

2.1 灵感来源：操作系统虚拟内存

2.2 PagedAttention 的类比

2.3 核心改进

3. 关键数据结构详解

3.1 Block（块）

3.2 Block 的存储内容

3.3 Block Table（块表）

3.4 Slot Mapping（槽位映射）

4. 内存管理优势

4.1 减少显存碎片

4.2 按需分配

4.3 支持 Copy-on-Write

4.4 支持前缀缓存

5. PagedAttention 计算流程

5.1 写入 KV Cache

5.2 读取并计算 Attention

5.3 代码实现概览

6. 块的动态管理

6.1 块的生命周期

6.2 块分配流程

6.3 块增长过程

7. CUDA 内核实现

7.1 文件位置

7.2 V1 vs V2 内核

7.3 内核参数

8. 性能对比

8.1 显存效率

8.2 吞吐量提升

8.3 碎片率

9. 本章小结

核心创新

关键数据结构

优势总结

代码位置

思考题

下一步

2 - KV Cache 管理器

KV Cache 管理器详解

1. KVCacheManager 在架构中的位置

2. 核心数据结构

2.1 KVCacheBlocks - 分配结果

2.2 KVCacheManager 类结构

3. 核心方法详解

3.1 get_computed_blocks() - 查找前缀缓存

3.2 allocate_slots() - 分配缓存槽位

3.3 free() - 释放请求资源

4. 前缀缓存机制详解

4.1 Block Hash 计算

4.2 缓存查找过程

4.3 缓存块的生命周期

5. 与调度器的协作

6. 配置与调优

6.1 关键配置参数

6.2 性能优化建议

7. 代码位置速查

8. 小结

导航

3 - Block Pool 内存块池

Block Pool 内存块池详解

1. BlockPool 的作用

2. 核心数据结构

2.1 KVCacheBlock - 块元数据

2.2 FreeKVCacheBlockQueue - 空闲块队列

2.3 BlockHashToBlockMap - 缓存查找表

3. BlockPool 类详解

4. 核心操作详解

4.1 get_new_blocks() - 分配新块

4.2 free_blocks() - 释放块

4.3 touch() - 缓存命中处理

4.4 cache_full_blocks() - 缓存满块

5. 块的生命周期