怎么来的 How it started
我是前端工程师。日常是 React bundle 分析、首屏加载、LCP 优化。不是 CUDA 工程师,写这个项目之前从没写过 GPU kernel。 I'm a frontend performance engineer. My day-to-day is React bundle analysis, first-paint optimization, Core Web Vitals. Not a CUDA engineer — I'd never written a GPU kernel before this project.
去年接了个需求:优化一个 LLM 应用的用户体验。用户的反馈是,发出去之后要等好几秒才能看到第一个字。在 DevTools 里打开 SSE stream,第一个 event 要等两三秒才到。 Last year I picked up a task: improve UX for an LLM application. Users complained they'd wait several seconds after hitting send before the first word appeared. In DevTools, watching the SSE stream, the first event arrived after 2–3 seconds.
照理说去问后端同事就完了,但我没有——我去翻了 vLLM 和 SGLang 的源码。本来只是想理解"为什么慢",结果越读越觉得有意思。KV Cache、Paged Attention、Continuous Batching,这些概念说起来吓人,但每一个拆开看都能跟着代码走通。 I should have just asked a backend colleague. Instead I opened vLLM and SGLang's source code. I only meant to understand "why is it slow," but I kept reading. KV Cache, Paged Attention, Continuous Batching — intimidating names, but each one, broken down, was actually followable.
某一刻脑子里冒出个问题:这些我能理解,为什么不自己试试写一个?不是要造更好的轮子,就是想搞清楚这些东西到底是怎么串起来的——哪些设计是第一次就想到的,哪些是踩坑改出来的。读别人代码永远隔着一层。 Then a question: I understand this stuff — why not try building one? Not to beat existing frameworks. Just to understand how all these pieces actually connect — which design decisions were obvious upfront, which ones came from pain. Reading someone else's code always keeps you one step removed.
推理引擎在做什么 What an inference engine actually does
在写任何代码之前,先把整个流程走了一遍: Before writing any code, I walked through the full pipeline:
Inference Pipeline · Request Lifecycle
Prefill 把整个 prompt 一次送进模型,建立 KV Cache。之后 decode 每步出一个 token,把它放进 KV Cache,再出下一个,直到停止符。 Prefill sends the whole prompt through the model in one shot, building the KV Cache. Then decode generates one token per step, appending each to the KV Cache, until a stop token.
读论文时感受不到、自己做才体会到的一件事:decode 天然是 memory bandwidth bound。每生成一个 token,模型要把全部权重从 GPU 显存读一遍。Qwen3-4B 的权重约 8GB,A100 显存带宽 1.5 TB/s,理论上每秒能读约 187 次,对应单请求约 187 tok/s 的上限。 Something you don't feel from papers but understand immediately when building: decode is inherently memory bandwidth bound. Each token requires reading all model weights from GPU memory. Qwen3-4B weights ≈ 8GB, A100 bandwidth ≈ 1.5 TB/s, so roughly 187 reads/sec — about 187 tok/s ceiling for a single request.
这解释了为什么 batching 这么关键——8 个请求一起 decode,权重只读一遍,8 个人均摊读取成本,吞吐量接近线性增长。这条逻辑是后来所有优化的出发点。 This explains why batching matters so much — 8 requests decode together, weights read once, 8 people share the cost, throughput scales nearly linearly. This logic underpins every optimization that followed.
KV Cache 怎么管 Managing the KV Cache
最简单的方案:每个请求一个独立 buffer,prefill 时分配,结束时释放。这个方案有两个问题。 The simplest approach: one buffer per request, allocate on prefill, free when done. Two problems with this.
第一,内存浪费。如果允许 8 个并发请求、每个最长 4096 token,就得提前留好 8 × 4096 的完整空间,哪怕某个请求实际只用了 100 个 token 也照占。第二,无法共享前缀。多轮对话里 system prompt 每次都一样,但 per-request 的独立 buffer 没法复用已算好的 KV。 First, memory waste. With 8 concurrent requests at 4096 tokens max, you pre-allocate 8 × 4096 whether a request uses 100 or 4000 tokens. Second, no prefix sharing. Multi-turn conversations repeat the same system prompt every call, but isolated buffers can't reuse already-computed KV.
解法是 Paged Attention:把 KV Cache 切成固定大小的 page,按需分配,不同请求可以共享相同的 page。照着 vLLM 的论文实现了一版,page size = 1(每个 token 一个 slot)。 Solution: Paged Attention — split KV Cache into fixed-size pages, allocate on demand, share pages across requests. I followed the vLLM paper and implemented with page size = 1 (one slot per token).
在这个基础上加了 radix tree prefix cache:system prompt 命中缓存时直接复用,完全跳过 prefill。实测 agent 工作负载下 KV 命中率可以到 100%——system prompt + tool definitions 通常占整个 context 的 30–50%,命中就相当于免费跳过了这部分。 On top of that, a radix tree prefix cache: when a system prompt is cached, reuse it and skip prefill entirely. On agent workloads, KV hit rate reaches 100% — system prompt + tool definitions often account for 30–50% of context, so hitting the cache skips all of that computation.
多个请求怎么一起跑 Running multiple requests together
第一版是最朴素的串行:一个请求跑完再跑下一个。GPU 大部分时间在空转——一个请求在 decode,其他请求全在等 CPU 处理返回结果、解析下一个 prompt。 First version was naive serial: one request at a time. GPU sat idle most of the time — while one request decoded, others waited for CPU to handle the response, parse the next prompt, and enqueue it.
解法是 Continuous Batching:多个请求的 decode 步骤打成一个 batch,一次 forward 处理所有人。新请求可以随时插入,不用等前一个结束。 Solution: Continuous Batching — pack multiple requests' decode steps into one batch, one forward pass for everyone. New requests can join at any point, without waiting for others to finish.
调度策略:decode 优先。正在 decode 的请求永远先跑,新请求的 prefill 等 decode 完再插入。原因是 decode 的 KV Cache 已经在 GPU 上了,中断代价很高;新请求等一轮几乎没有影响。 Scheduling policy: decode first. Active decode requests always run before new prefills. Reason: decode's KV Cache is already on GPU — interrupting it is expensive. New requests waiting one round barely matters.
还加了 Chunked Prefill:长 prompt 切成 64-token 小块,每块之间给 decode 请求插队。这样一个超长新请求不会把所有人的延迟拉高几秒。 Also added Chunked Prefill: long prompts split into 64-token chunks, with decode requests interleaved between chunks. A very long new request doesn't add seconds of latency for everyone else.
第一次跑:128 tok/s First benchmark: 128 tok/s
以上都实现完,跑了第一个基准测试:8 并发,Qwen3-4B,A100-40GB。 With everything above implemented, I ran the first benchmark: 8 concurrent requests, Qwen3-4B, A100-40GB.
这不是"差一点",是根本性的问题。打开 profiler,找到了。 This wasn't "a little behind." Something fundamentally wrong. I opened the profiler and found it immediately.
// 每个请求单独跑 attention——8 并发 × 36 层 = 288 次独立 kernel launch
// Each request runs attention separately — 8 concurrent × 36 layers = 288 separate kernel launches
for req in &active_requests {
attention_single_request(q, &req.kv_cache, output)?;
}
Attention 是一个 for 循环。8 个并发请求,36 层,= 288 次独立 kernel launch,每次 launch 之间都有 CPU→GPU dispatch overhead 和 memcpy。其他所有 linear 层(GEMM)早就是 batched 的——只有 attention 还在逐请求串行。 Attention was a for loop. 8 concurrent requests, 36 layers, = 288 separate kernel launches, each with CPU→GPU dispatch overhead and memcpy. Every other linear layer (GEMM) was already batched — only attention ran sequentially per-request.
SGLang 用 FlashInfer 的 BatchDecodeWithPagedKVCacheDispatched,8 个请求的 attention 一个 kernel 搞定。这就是 7 倍差距的来源。
SGLang uses FlashInfer's BatchDecodeWithPagedKVCacheDispatched — all 8 requests' attention in a single kernel launch. That's the entire 7× gap.
128 → 811:六轮修 128 → 811: six rounds of fixes
知道问题在哪,开始修。下面是完整优化历程(8 并发,Qwen3-4B,A100-40GB,SGLang v0.5.9 参照 886 tok/s): With the problem identified, I started fixing. Here's the complete optimization journey (8-concurrent, Qwen3-4B, A100-40GB, SGLang v0.5.9 reference: 886 tok/s):
8-Concurrent Throughput · Qwen3-4B · A100-40GB
所有数据:Qwen3-4B · A100-SXM4-40GB · 8 并发 · greedy decodeAll measurements: Qwen3-4B · A100-SXM4-40GB · 8-concurrent · greedy decode
Phase 1:把 attention 改成 batched(128 → 434,+239%)Phase 1: batched attention (128 → 434, +239%)
把 per-request KV buffer 换成 shared token-level KV pool,接上 FlashInfer paged batch decode——一次 kernel 处理所有请求的 attention。 Replaced per-request KV buffers with a shared token-level KV pool, then wired up FlashInfer's paged batch decode — one kernel handles all requests' attention.
这一步踩了三个 bug,叠在一起 debug 花了不少时间: This step hit three bugs stacked on top of each other:
MAX_SEQ 硬编码 4096,但运行时 max_seq=1024,OOB 写内存,无报错,数据静默损坏。Attention kernel had MAX_SEQ hardcoded as 4096, but runtime used max_seq=1024. OOB write, no error, silent data corruption.Bug 2:FlashInfer 的
plan_info 在 GPU 上分配,但代码用 CPU memcpy 读写——segfault,host/device 指针搞混。FlashInfer's plan_info allocated on GPU but accessed via CPU memcpy — segfault. Host/device pointer mismatch.Bug 3:Scheduler 和 model 两处都调用了
alloc_tokens,token 分配了两次,metadata 全乱。Both scheduler and model called alloc_tokens. Tokens allocated twice, metadata corrupted everywhere.
三个修完:128 → 434 tok/s。 All three fixed: 128 → 434 tok/s.
Phase 2:停止每步都分配 GPU buffer(434 → 681,+57%)Phase 2: stop allocating GPU buffers every step (434 → 681, +57%)
Profiler 显示,每个 decode step 都在分配约 10 个 GPU tensor + 128MB FlashInfer workspace,耗时 4.5ms,占整个 step(14ms)的 32%。GPU buffer 分配通过 cuMemAllocAsync 实现,每次约 0.5ms。
Profiler showed each decode step allocating ~10 GPU tensors + 128MB FlashInfer workspace, costing 4.5ms — 32% of total step time (14ms). GPU buffer allocation via cuMemAllocAsync: ~0.5ms per call.
改成第一次用时分配一次,之后一直复用。434 → 681 tok/s。 Changed to allocate once on first use, reuse thereafter. 434 → 681 tok/s.
Phase 3–4:两个小优化(681 → 700)Phase 3–4: two smaller fixes (681 → 700)
FlashInfer batched decode 分 plan(CPU 调度计算)和 run(GPU kernel)两步。最初在 36 层循环里每层都调 plan——但同一个 step 内 KV layout 没变,plan 只需调一次。提到层循环外面,+1.3%。Embedding 输出和 logits buffer 也同样预分配,收益很小(约 40KB buffer),但值得做。总计 681 → 700。 FlashInfer batch decode has two phases: plan (CPU scheduling) and run (GPU kernel). I originally called plan every layer in the 36-layer loop — but KV layout doesn't change within a step. Moved plan outside the loop: +1.3%. Also pre-allocated embedding and logits buffers (tiny ~40KB, small gain). Together: 681 → 700.
Phase 5–6:CUDA Graph(700 → 756,+8%)Phase 5–6: CUDA Graph (700 → 756, +8%)
Decode 的 36 层,每层约 14 个 kernel,总共约 504 次 kernel launch。CUDA Graph 的思路:把层循环录制成一个 graph,之后每步只需 replay,kernel launch overhead 降到接近零。 The 36-layer decode loop has ~14 kernels per layer — ~504 kernel launches per step. CUDA Graph: record the layer loop as a graph, then replay each step. Kernel launch overhead approaches zero.
这里有个坑:FlashInfer 的 plan() 内部用了 CPU memcpy——graph replay 时这段 CPU 代码不重新执行,plan 信息被"烤进"了 graph。这只在 batch size 固定、KV layout 稳定时才正确。
One catch: FlashInfer's plan() uses CPU memcpy internally — during graph replay, that CPU code doesn't re-execute. Plan info gets baked into the graph. This is only correct when batch size is fixed and KV layout is stable.
解法:每个 batch size 各缓存一个 graph(HashMap<usize, CudaGraph>),batch size 变了就重新捕获,或回退到非 graph 路径。700 → 756 tok/s。
Solution: cache one graph per batch size (HashMap<usize, CudaGraph>). When batch size changes, recapture or fall back to non-graph path. 700 → 756 tok/s.
后续小步(756 → 811)Smaller steps (756 → 811)
Decode-priority chunked prefill(64-token 小块)、zero-alloc logit extraction(D2D 直写 decode_bufs)、batched greedy argmax、skip D2D scatter for greedy path——一路小步走到 811 tok/s,SGLang 的 91.5%。每步收益 0.5%–4%,加起来是 55 tok/s。 Decode-priority chunked prefill (64-token chunks), zero-alloc logit extraction (D2D direct write), batched greedy argmax, skip D2D scatter for greedy path — small steps to 811 tok/s, 91.5% of SGLang. Each step: 0.5%–4%. Together: +55 tok/s.
和 SGLang 的差距在哪 Where the remaining gap comes from
811 vs 886,差 8.5%。用 nsys 对比了两边的 trace(Qwen3-4B,A100-40GB,batch=8): 811 vs 886, 8.5% gap. Compared nsys traces side by side (Qwen3-4B, A100-40GB, batch=8):
| 指标Metric | agent-infer | SGLang v0.5.9 | 说明Note |
|---|---|---|---|
| CUDA Graph 执行时间CUDA Graph exec | 8.56ms | 8.18ms | GPU 计算差距仅 0.38msGPU compute gap only 0.38ms |
| Memcpy 总耗时Total memcpy | 2.7ms | 20.7ms | 我们快 7.7×(Rust 无 Python overhead)7.7× faster (Rust, no Python overhead) |
| RMSNorm | 4.4μs | 1.3μs | SGLang 用 FusedAddRMSNormSGLang uses FusedAddRMSNorm |
| argmax | 22.6μs | 13.2μs | SGLang warp reduction 更优SGLang has better warp reduction |
| 同步方式Sync method | cuStreamSync | cudaEventSync | SGLang 粒度更细SGLang more fine-grained |
| TTFT (C=1) | 17.9ms | 40.5ms | 我们快 2.3×2.3× faster |
GPU 计算本身只差 0.38ms。大部分差距在 CPU 调度粒度(cuStreamSync vs cudaEventSync)和 kernel 质量(SGLang 的 FusedAddRMSNorm 把 residual add 和 norm 合成一个 kernel,省了一次全局内存读写,比我们快 3.4 倍)。 GPU compute itself differs by only 0.38ms. Most of the gap is CPU synchronization granularity (cuStreamSync vs cudaEventSync) and kernel quality — SGLang's FusedAddRMSNorm fuses residual add + norm into one kernel, saves one global memory roundtrip, and runs 3.4× faster than ours.
有趣的一点:我们的 memcpy 反而比 SGLang 快 7.7 倍(2.7ms vs 20.7ms)。这是 Rust 控制平面的直接收益——没有 Python dispatch overhead,CPU→GPU 路径更短。TTFT 也受益于此:单请求 17.9ms vs 40.5ms,快 2.3 倍。 Interesting reversal: our memcpy is 7.7× faster than SGLang's (2.7ms vs 20.7ms). Direct benefit of the Rust control plane — no Python dispatch overhead, shorter CPU→GPU path. Same reason TTFT leads: 17.9ms vs 40.5ms at single-request concurrency, 2.3× faster.
Vibe Coding 的真相 The truth about vibe coding
这个引擎从第一行 Rust struct 到最后一个 CUDA kernel,全程是 Vibe Coding 完成的——自然语言描述需求,AI 生成代码,我审查逻辑、调整方向,发现问题再迭代。 This engine — from the first Rust struct to the last CUDA kernel — was built entirely through vibe coding. Describe what I needed in natural language, AI generates the implementation, I review logic and steer direction, iterate when something's wrong.
如果没有这种工作方式,一个从没写过 GPU kernel 的前端工程师不可能在两天内做出这件事。这不是谦虚,是真的。Rust + CUDA FFI binding、FlashInfer 的 C++ API、CUDA Graph 的捕获逻辑——这些我自己从头写不了,至少两天内不可能。 Without this way of working, a frontend engineer who'd never written a GPU kernel couldn't have built this in two days. That's not false modesty — it's true. Rust + CUDA FFI bindings, FlashInfer's C++ API, CUDA Graph capture logic — I couldn't have written these from scratch, certainly not in two days.
AI 解决了"怎么写"的问题,"写什么"和"哪里不对"还是得自己想。 AI solved "how to write it." What to write, and where it's wrong — those are still human jobs.
还没完 Not done yet
这不是成熟项目。Llama、DeepSeek、量化内核都还没有。p99 ITL 有毛刺,prefill/decode 调度还有改进空间(SGLang 的 overlap scheduling 能把 CPU 和 GPU 工作流水线化,估计能再找回约 0.5ms/step)。 This isn't production-ready. Llama, DeepSeek, quantization kernels — all missing. p99 ITL has spikes. SGLang's overlap scheduling pipelines CPU and GPU work in parallel; that's roughly 0.5ms/step left on the table.
但学到了想学的东西。写一遍才真正理解为什么 KV Cache 要按 block 对齐,为什么 CUDA Graph 只适合固定 batch size,为什么 prefill 和 decode 要分开调度。这些读论文感受不到,自己踩一遍才会记住。 But I learned what I set out to learn. You only really understand why KV Cache needs block alignment, why CUDA Graphs only work with fixed batch sizes, why prefill and decode scheduling must be separate — by building it yourself and getting it wrong first. Papers tell you the what; building it shows you the why.