起因:被一个 TTFT 数字气到了 It started with a latency number that felt wrong
说起来有点难为情——这件事的直接起因,是某个下午盯着 vLLM 的延迟数字,突然有一种很强烈的不适感。当时要在一体机上跑 Agent:硬件固定、网络内网,TTFT 直接决定用户体验的上限。vLLM 和 SGLang 都很优秀,但它们是为云端多租户设计的——到了边缘硬件上,太重、调度太粗,第一个 token 慢得让人崩溃。 I'll be honest — this whole thing started with an afternoon staring at vLLM's latency numbers and feeling something was just wrong. I needed to run Agents on an all-in-one machine with fixed hardware and a private network, where TTFT directly set the ceiling on user experience. vLLM and SGLang are excellent. But they're built for cloud multi-tenancy — on edge hardware they're too heavy, too coarsely scheduled, and their first-token latency was just unacceptable.
但说实话,"项目需要"只是一半的理由。另一半是:我想真正弄明白推理是怎么工作的。不是用框架那种懂——是知道每个 token 背后发生了什么,内存从哪来、到哪去,attention 为什么慢、慢在哪一步。这些问题在用框架的时候是黑盒,想清楚只有一条路:自己写。 But "the project needed it" is only half the story. The other half: I wanted to actually understand how inference works. Not framework-user understand — write-it-yourself understand. What happens on every token? Where does the memory come from and go? Why is attention slow, and where exactly? These are black boxes when you're using a framework. There's only one way to see inside: write it yourself.
还有一个很现实的压力:Agent 工作流里,模型会被连续调用十几次。每次 TTFT 50ms,感知延迟就是秒级——用户会觉得系统很卡。我需要它在 10ms 以内。这不是在追求极致,这是基本可用的门槛。 And there was a concrete pressure: Agent workflows call the model a dozen times in sequence. At 50ms TTFT each, perceived latency compounds to seconds — the system feels sluggish. I needed it under 10ms. Not as an ambitious goal. As a basic usability threshold.
动手之前,先想两天 Two days of thinking before a single line of code
在写第一行代码之前,我做了一件很反工程师直觉的事:什么都没写,先想了两天。不是因为我多有纪律,而是被之前的坑吓到了——有些设计决策一旦定错,后面改起来不是重构,是重写。 Before writing the first line of code, I did something very counterintuitive for an engineer: nothing. I thought for two days instead. Not because I'm particularly disciplined, but because I'd been burned before — some design decisions, once wrong, don't need refactoring. They need a full rewrite.
我在白板上问自己:如果这个系统跑起来了,然后发现设计错了,哪三个地方会让我最惨?答案很快就出来了:内存管理、调度器和模型前向的边界、硬件后端的抽象层。这三个定下来,其他的都是可以改的细节。 I asked myself: if this system ships and then I discover the design was wrong, which three things would hurt the most to fix? The answer came fast: memory management, the boundary between scheduler and model forward, and the hardware abstraction layer. Lock these down first. Everything else is recoverable.
System Architecture · Request Lifecycle
调度器和模型前向之间的边界——事后证明是最正确的决策。调度器只管组 batch,不知道 attention 怎么算;模型前向只管算,不知道这些 sequence 为什么被放在一起。正是这个隔离,让我后来把 attention 后端从手写 kernel 换成 FlashInfer 的时候,几乎没碰调度器的代码。 The scheduler/model forward boundary turned out to be the most important decision. The scheduler owns batch composition and knows nothing about attention math; the model forward knows nothing about why these sequences are grouped together. That isolation let me swap the attention backend from a hand-written kernel to FlashInfer later with almost no changes to the scheduler.
KV Cache 分三层——然后来了一个意外惊喜Three-tier KV cache — and an unexpected gift
KV cache 的设计,事后让我最得意。不是因为它多复杂,而是因为它给了一个意料之外的礼物。 The KV cache design is the one I'm most proud of — not because it's complex, but because of an unexpected gift it brought along.
KV Cache · Three-Tier Storage Hierarchy
Token Pool 布局是 [max_tokens, kv_dim],page_size=1,FlashInfer 直接在上面做 paged attention。然后那个意外的礼物来了:prefix caching 变得自然了。system prompt 的 KV,只要 block 哈希相同,就可以在多个请求之间直接共享——不用重算。
Token Pool layout: [max_tokens, kv_dim], page_size=1, FlashInfer does paged attention directly on it. Then the unexpected gift arrived: prefix caching fell out naturally. System prompt KV with matching block hashes is shared across requests — no recomputation needed.
对 Agent 来说,这几乎是免费的性能。Agent 的 system prompt 通常很长,而且每次调用几乎完全相同,命中率接近 100%。 For Agent workloads, this is almost free performance. The system prompt is typically long and nearly identical across every call — cache hit rate approaches 100%.
// Token pool: [max_tokens, num_heads, head_dim]
// Each slot holds one token's K and V for all layers
fn decode_step(batch: &[SeqId], kv_pool: &TokenPool) {
// Build indptr + indices for FlashInfer (O(batch) CPU work)
let (indptr, indices) = kv_pool.build_paged_layout(batch);
// Single kernel launch covers all seqs × all heads
flashinfer::batch_decode(queries, &indptr, &indices, kv_pool.data)
// No scatter/gather — FlashInfer handles the indirection
}
调度器和 Attention:两条走了弯路的路 Scheduler and attention: two roads with detours
调度器:从"偷懒版"到真正可用Scheduler: from the lazy version to the real one
调度器的第一版——我现在想起来还是有点不好意思——就是一个 FIFO 队列。能放进去就放,能跑就跑。这个版本跑起来很稳,但 GPU 总有一种"在偷懒"的感觉:prefill 的时候把 GPU 拉满,decode 阶段因为每个 sequence 长短不一,batch 里总有"慢的在拖快的"。 The first version of the scheduler — and I'm a little embarrassed about this — was just a FIFO queue. If it fits, it runs. This version was stable, but the GPU always felt like it was loafing: maxed during prefill, then uneven during decode because sequences of different lengths were stepping on each other.
后来加入了 chunked prefill:把长 prompt 切成多个 chunk,每次 step 里混入一个 chunk 和所有正在 decode 的 sequence 一起处理。prefill 不再独占 GPU,decode 延迟变平滑。TTFT 的降低,主要就来自这里。 Then came chunked prefill: split long prompts into chunks and interleave them with decode steps. Prefill no longer monopolizes the GPU, decode latency smooths out. Most of the TTFT reduction comes from this one change.
loop {
// Admit requests that fit in the KV pool
while let Some(req) = waiting.front() {
if kv_pool.free_slots() >= req.prompt_len { running.push(waiting.pop()); }
else { break; }
}
// Build batch: one prefill chunk + all decoding seqs
let batch = BatchBuilder::new()
.add_prefill_chunk(&mut running, chunk_size)
.add_all_decoding(&running)
.build();
let outputs = engine.forward(batch);
update_state(&mut running, outputs, &mut kv_pool); // emit tokens, free done seqs
}
Sequence State Machine
Attention:手写 → FlashAttention → FlashInfer,三次才找到对的答案Attention: three tries to get it right
Attention 是我踩坑最多的地方。第一版手写 CUDA kernel,跑通了、输出也对——但只有理论峰值的 40%。原因后来想明白了:shared memory 没用好,warp-level 并行没打满。换 FlashAttention-2 之后到了 70%,但 PagedAttention 的 block 寻址需要额外的 scatter/gather,代码写起来很绕。 Attention is where I stumbled the most. The hand-written CUDA kernel worked and produced correct output — but only hit 40% of theoretical peak. The reason I figured out later: poor shared memory usage, warp-level parallelism not fully utilized. FlashAttention-2 got it to 70%, but PagedAttention block addressing needed extra scatter/gather, making the code awkward.
真正让我松一口气的,是遇到了 FlashInfer。 What finally let me breathe was finding FlashInfer.
- 针对不同 batch size 自适应算法Adaptive algorithms per batch size——prefill 用 flashattn,decode 用专为 1-query 场景优化的 kernelflashattn for prefill, a 1-query-optimized kernel for decode
- 原生 PagedAttentionNative PagedAttention——不需要额外的 scatter/gather,直接拿 Token Pool 的 indptr 和 indicesno extra scatter/gather, directly consumes Token Pool indptr and indices
- CUDA Graph 友好friendly——只有 run() 需要进图,plan() 在图外一次搞定only run() needs to be captured; plan() runs once outside the graph
// Plan once before the layer loop (CPU scheduling, ~50µs)
flashinfer.plan(&seq_ids, &kv_pool.indptr, &kv_pool.indices);
// Run 36× inside CUDA Graph (one per transformer layer)
for layer in &model.layers {
let qkv = layer.qkv_proj.forward(&hidden); // batched GEMM
let attn = flashinfer.run_layer(qkv.q, layer.kv); // paged attn
hidden = layer.out_proj.forward(attn) + residual; // fused add+norm
}
// Graph replay eliminates 504 kernel launch calls → CPU overhead ≈ 0
128 到 811:每一步都踩了一个坑 128 to 811: every step had a trap
128 tok/s——这是当时的起点。看到这个数字,我的第一反应是:这也太丢人了吧,SGLang 在同等硬件上跑 886。目标定下来:TTFT 超过 SGLang,throughput 达到 90% 以上。然后开始一步步找问题。 128 tok/s. That's where it started. I looked at that number and felt embarrassed — SGLang was running 886 on the same hardware. Goals locked in: beat SGLang on TTFT, hit 90%+ on throughput. Then the slog began.
Throughput Optimization Path · Qwen3-4B · A100-40GB · 8-concurrent
cuStreamSynchronize,会阻塞等整张图跑完(8.6ms);SGLang 用 cudaEventSynchronize,CPU 在等的同时还能做其他事(5.1ms)。差的那 8%,不在 GPU,在 CPU 上怎么等 GPU。Profiled both systems with nsys. Guess where the gap was? CUDA Graph body: ours 8.56ms vs SGLang 8.18ms — 0.38ms difference. The GPU compute is basically matched. The real gap is outside the graph: I was using cuStreamSynchronize, which blocks for the full graph duration (8.6ms). SGLang uses cudaEventSynchronize, letting CPU and GPU overlap (5.1ms effective wait). The remaining 8% gap isn't the GPU. It's how I was waiting for the GPU.Final Comparison · A100-40GB · Qwen3-4B · 8-concurrent
TTFT 领先 SGLang 4.6×,throughput 达 92%。剩余差距:sync 策略(~1.4ms/step)+ FusedAddRMSNorm 未合并(7.1µs vs SGLang 的 1.3µs 融合 kernel)。TTFT 4.6× ahead of SGLang. Throughput at 92%. Remaining gap: sync strategy (~1.4ms/step) + FusedAddRMSNorm not yet fused (7.1µs vs SGLang's 1.3µs fused kernel).
测试:因为推理引擎的 bug 会藏起来Testing: because inference engine bugs hide
推理引擎有一种很讨厌的特性:bug 不会在正常情况下出现。只有在特定的 batch 组合、特定的 sequence 长度、KV 压力刚好到某个临界点的时候,才会冒出来——往往是上线之后。所以我对测试有一种近乎偏执的执念。 Inference engines have an annoying property: bugs don't surface under normal conditions. They only appear under specific batch compositions, specific sequence lengths, KV pressure hitting just the right threshold — often after you've shipped. So I'm a little obsessive about testing.
Rust 测试负责最底层的东西:block allocator 在极端情况下还对不对,调度器状态机的每条路径都走到没有,KV block 的引用计数在多请求共享 prefix block 时有没有泄漏。Python 集成测试用 GPT-2 和 TinyLlama 跑端到端——不要求输出完全一致,而是要求生成分布的 KL 散度在合理范围内,因为推理的正确性不是"完全相同",而是"统计等价"。 Rust tests cover the fundamentals: block allocator correctness under edge cases, every state machine path in the scheduler, KV block reference counting when multiple requests share a prefix block. Python integration tests run end-to-end with GPT-2 and TinyLlama — not requiring identical output, but validating KL divergence against a reference, because inference correctness isn't "exactly the same", it's "statistically equivalent".
还有专门的混沌测试——随机在 70%、90%、98% KV 压力下注入高并发请求、随机中止、随机 preemption。就是要看系统在最差情况下会不会崩、会不会丢数据、能不能优雅恢复。毕竟线上的 bug 永远比你想象的更有创意。 Then there are chaos tests — randomly injecting high-concurrency requests at 70%, 90%, 98% KV pressure, with random aborts and forced preemptions. The question is whether the system stays correct under the worst conditions. Production bugs are always more creative than you expect.
@pytest.mark.parametrize("pressure_level", [0.7, 0.9, 0.98])
def test_kv_pressure_recovery(engine, pressure_level):
num_requests = int(engine.kv_capacity * pressure_level / 16)
reqs = [make_request(prompt_len=512, max_new_tokens=256)
for _ in range(num_requests)]
results = engine.batch_generate(reqs, timeout=60.0)
assert len(results) == len(reqs) # no silent drop
assert all(r.finish_reason in ("stop", "length") for r in results)
assert all(len(r.tokens) >= 1 for r in results) # no corrupt output
8.6ms 改变了什么 What 8.6ms actually changes
有人问我:为什么要做推理引擎?不是有现成的吗? 直接的回答是:因为需要 10ms 的 TTFT。但这个回答不完整。更完整的版本是:8.6ms 改变了一个感知。 People ask me: why build an inference engine when perfectly good ones already exist? The short answer: I needed 10ms TTFT. But that's incomplete. The fuller answer: 8.6ms changes a perception.
这不只是一个延迟数字。这是一个交互方式的改变。当每次 Agent 调用要等 50ms,你会下意识地减少调用次数——能合并的步骤就合并,能省的中间推理就省。Agent 变得"谨慎",不是因为它聪明,而是因为调用太贵了。 当 TTFT 降到 10ms,这个顾虑消失了。你可以让 Agent 多思考几步,多验证一次,多调一个工具——这些额外的调用几乎是"免费的"。Agent 从被动响应变成了主动推理。这才是做快推理引擎真正值得做的理由。 This isn't just a latency number. It's a change in how Agents can behave. When each call costs 50ms, you instinctively ration — merge reasoning steps, skip verifications, make the Agent "cautious" not because it's wise but because calling the model is expensive. When TTFT drops to 10ms, that mental tax disappears. You can afford more intermediate reasoning steps, more tool calls, more self-verification — because they're nearly free. The Agent shifts from reactive to proactive. That's the real reason to care about fast inference.
Agent OS ——推理引擎和 Agent 不该是两件事— the engine and the Agent shouldn't be separate
我不想做的是"在推理引擎上面跑的 Agent 框架"——那只是给推理引擎加了一层包装。想做的是:推理引擎本身成为 Agent 的一部分,两者共享调度语义。 What I don't want to build is "an Agent framework that runs on top of an inference engine" — that's just a wrapper. What I want: the inference engine becomes part of the Agent, sharing scheduling semantics with it.
Agent OS · System Architecture
FlashInfer · CUDA Graph · Prefix CacheKV Cache (3-tier) · Continuous Batching
FlashInfer · CUDA Graph · Prefix Cache
File System · External APIs
Isolated · Fast spawnTool Execution · Code Runner
File System · External APIs
Isolated · Fast spawn
- 任务感知调度Task-aware scheduling:调度器知道哪些 sequence 属于同一 Agent 任务链,协调优先级,避免同一任务的多步骤相互抢占。The scheduler knows which sequences belong to the same Agent task chain, coordinating priority so a task's steps don't preempt each other.
- 持久 KV 复用Persistent KV reuse:system prompt 和历史 context 的 KV 跨调用复用,每次新调用不从头计算。System prompt and context KV reused across calls — no recomputation per invocation.
- 工具调用感知流式输出Tool-call-aware streaming:检测到 function call 格式时提前触发工具执行,不等模型生成完整 JSON。Detect function call format early, trigger tool execution without waiting for complete JSON generation.
- 多模型路由Multi-model routing:不同大小的模型在同一引擎实例,由路由层根据任务复杂度分配。Multiple model sizes in one engine instance, routed by task complexity.
这些特性,在推理引擎和 Agent 框架分开实现时,要么做不到,要么效率很低——因为它们需要调度语义上的深度整合。分开是没有办法的办法,合并才是对的方向。 These features either can't be built or are grossly inefficient when the inference engine and Agent framework are separate — they require deep integration at the scheduling level. Keeping them separate is a workaround. Merging them is the right direction.
固定硬件不是约束,是护城河Fixed hardware isn't a constraint — it's a moat
花了一段时间我才真正接受这件事:一体机的"固定"其实是优势。固定 GPU、固定内存、固定网络,在云端这是没有弹性的缺点。但对于 Agent OS,这意味着编译期就能确定内存布局,可以针对这块 GPU 的 SM 数量和 L2 cache 大小精确调优,可以离线把所有 CUDA Graph 的形状都 capture 好。确定性,让我可以在特定硬件上榨出比通用框架高得多的利用率。 It took me a while to genuinely accept this: the "fixed" nature of the all-in-one machine is actually an advantage. Fixed GPU, fixed memory, fixed network — in the cloud that means no elasticity. For Agent OS, it means memory layouts are decided at compile time, tuning can be precise to this exact GPU's SM count and L2 cache size, CUDA Graph shapes can be captured offline for all cases. Determinism lets me squeeze utilization out of specific hardware that no general-purpose framework can match.
cuStreamSynchronize 改为 cudaEventSynchronize 消除剩余的 1.4ms/step 差距。Speculative decoding (Medusa heads + draft model), targeting 1.5–2× decode throughput with no TTFT regression. Tensor parallelism for larger models. Switching sync from cuStreamSynchronize to cudaEventSynchronize to close the remaining 1.4ms/step gap.
从 128 到 811From 128 to 811
你知道那个数字是什么感觉吗?当终端里第一次打出 811 的时候,是凌晨两点多。没有人可以分享,就自己对着屏幕笑了一会儿,然后截了个图。142 个 Rust 测试、254 个 Python 测试、三个让我怀疑人生的 segfault、无数个 CUDA OOM——那个数字里装着这些。 You know what that number feels like? When 811 first appeared in the terminal, it was around 2am. Nobody to tell. I just sat there smiling at the screen for a bit, then took a screenshot. 142 Rust tests, 254 Python tests, three segfaults that made me question my career choices, more CUDA OOMs than I can count — all of that lives inside that number.
但比数字更有价值的,是那些被 bug 逼出来的理解——这些在任何框架文档里都找不到。CUDA Graph 为什么要求静态形状?Token Pool 的 page_size=1 为什么是关键决策?FlashInfer plan() 为什么必须在图外?我和 SGLang 差的 8%,为什么不在 GPU 里,而在 CPU 怎么等 GPU? But more valuable than the numbers is the understanding that only gets forced out by bugs — the kind you won't find in any framework's documentation. Why does CUDA Graph require static shapes? Why is Token Pool page_size=1 a critical decision? Why must FlashInfer plan() run outside the graph? Why is my remaining 8% gap vs SGLang not in the GPU, but in how the CPU waits for the GPU?
这些问题,在用框架的时候永远不会遇到。只有当你从 HTTP 层一路写到 CUDA kernel,它们才会出现,然后强迫你真正理解。 You never encounter these questions when you're using a framework. They only appear when you write everything yourself, from the HTTP layer all the way down to the CUDA kernel — and when they appear, they force you to actually understand.