工程博客Engineering Blog

两天从零撸一个推理引擎 Zero to Inference Engine in Two Days

——再把它变成 Agent OS — and then turning it into an Agent OS

128 tok/s 到 811 tok/s,中间是一堆 segfault、凌晨两点的 profiler,以及几个让我突然明白了什么的瞬间。Qwen3-4B · A100-40GB · 8-concurrent 128 to 811 tok/s — the journey involved several segfaults, a profiler at 2am, and a few moments where something finally clicked. Qwen3-4B · A100-40GB · 8-concurrent

8.6ms TTFT(4.6× vs SGLang)TTFT (4.6× vs SGLang)
811 tok/s(0.92× SGLang)tok/s (0.92× SGLang)
142 Rust 单元测试Rust unit tests
254 Python 集成测试Python integration tests

起因:被一个 TTFT 数字气到了 It started with a latency number that felt wrong

说起来有点难为情——这件事的直接起因,是某个下午盯着 vLLM 的延迟数字,突然有一种很强烈的不适感。当时要在一体机上跑 Agent:硬件固定、网络内网,TTFT 直接决定用户体验的上限。vLLM 和 SGLang 都很优秀,但它们是为云端多租户设计的——到了边缘硬件上,太重、调度太粗,第一个 token 慢得让人崩溃。 I'll be honest — this whole thing started with an afternoon staring at vLLM's latency numbers and feeling something was just wrong. I needed to run Agents on an all-in-one machine with fixed hardware and a private network, where TTFT directly set the ceiling on user experience. vLLM and SGLang are excellent. But they're built for cloud multi-tenancy — on edge hardware they're too heavy, too coarsely scheduled, and their first-token latency was just unacceptable.

但说实话,"项目需要"只是一半的理由。另一半是:我想真正弄明白推理是怎么工作的。不是用框架那种懂——是知道每个 token 背后发生了什么,内存从哪来、到哪去,attention 为什么慢、慢在哪一步。这些问题在用框架的时候是黑盒,想清楚只有一条路:自己写。 But "the project needed it" is only half the story. The other half: I wanted to actually understand how inference works. Not framework-user understand — write-it-yourself understand. What happens on every token? Where does the memory come from and go? Why is attention slow, and where exactly? These are black boxes when you're using a framework. There's only one way to see inside: write it yourself.

"如果你不能从头写出它,你就没有真正理解它。" "If you can't write it from scratch, you don't really understand it."

还有一个很现实的压力:Agent 工作流里,模型会被连续调用十几次。每次 TTFT 50ms,感知延迟就是秒级——用户会觉得系统很卡。我需要它在 10ms 以内。这不是在追求极致,这是基本可用的门槛。 And there was a concrete pressure: Agent workflows call the model a dozen times in sequence. At 50ms TTFT each, perceived latency compounds to seconds — the system feels sluggish. I needed it under 10ms. Not as an ambitious goal. As a basic usability threshold.

动手之前,先想两天 Two days of thinking before a single line of code

在写第一行代码之前,我做了一件很反工程师直觉的事:什么都没写,先想了两天。不是因为我多有纪律,而是被之前的坑吓到了——有些设计决策一旦定错,后面改起来不是重构,是重写。 Before writing the first line of code, I did something very counterintuitive for an engineer: nothing. I thought for two days instead. Not because I'm particularly disciplined, but because I'd been burned before — some design decisions, once wrong, don't need refactoring. They need a full rewrite.

我在白板上问自己:如果这个系统跑起来了,然后发现设计错了,哪三个地方会让我最惨?答案很快就出来了:内存管理调度器和模型前向的边界硬件后端的抽象层。这三个定下来,其他的都是可以改的细节。 I asked myself: if this system ships and then I discover the design was wrong, which three things would hurt the most to fix? The answer came fast: memory management, the boundary between scheduler and model forward, and the hardware abstraction layer. Lock these down first. Everything else is recoverable.

System Architecture · Request Lifecycle

HTTP API OpenAI-compat
Scheduler Continuous Batch
KV Cache Token Pool
Model Forward CUDA Graph
CUDA / FlashInfer A100
Metal / MSL Apple Silicon

调度器和模型前向之间的边界——事后证明是最正确的决策。调度器只管组 batch,不知道 attention 怎么算;模型前向只管算,不知道这些 sequence 为什么被放在一起。正是这个隔离,让我后来把 attention 后端从手写 kernel 换成 FlashInfer 的时候,几乎没碰调度器的代码。 The scheduler/model forward boundary turned out to be the most important decision. The scheduler owns batch composition and knows nothing about attention math; the model forward knows nothing about why these sequences are grouped together. That isolation let me swap the attention backend from a hand-written kernel to FlashInfer later with almost no changes to the scheduler.

KV Cache 分三层——然后来了一个意外惊喜Three-tier KV cache — and an unexpected gift

KV cache 的设计,事后让我最得意。不是因为它多复杂,而是因为它给了一个意料之外的礼物。 The KV cache design is the one I'm most proud of — not because it's complex, but because of an unexpected gift it brought along.

KV Cache · Three-Tier Storage Hierarchy

Tier 1 · HBM
Token Pool [max_tokens, kv_dim],page_size=1 · FlashInfer 直接访问Token Pool [max_tokens, kv_dim], page_size=1 · direct FlashInfer access
~3 TB/s
16–80 GB
↕ swap on preemption
🗄️
Tier 2 · DRAM
被抢占的 block,PCIe 异步传输等待重调度Preempted blocks, async PCIe transfer awaiting reschedule
~50 GB/s
64–512 GB
↕ spill on pressure
💽
Tier 3 · SSD
Prefix cache 持久化,跨请求复用,mmap 加载Persistent prefix cache, cross-request reuse, mmap load
~7 GB/s
TBs

Token Pool 布局是 [max_tokens, kv_dim],page_size=1,FlashInfer 直接在上面做 paged attention。然后那个意外的礼物来了:prefix caching 变得自然了。system prompt 的 KV,只要 block 哈希相同,就可以在多个请求之间直接共享——不用重算。 Token Pool layout: [max_tokens, kv_dim], page_size=1, FlashInfer does paged attention directly on it. Then the unexpected gift arrived: prefix caching fell out naturally. System prompt KV with matching block hashes is shared across requests — no recomputation needed.

对 Agent 来说,这几乎是免费的性能。Agent 的 system prompt 通常很长,而且每次调用几乎完全相同,命中率接近 100%。 For Agent workloads, this is almost free performance. The system prompt is typically long and nearly identical across every call — cache hit rate approaches 100%.

Rust · PagedAttention 伪代码PagedAttention pseudocode
// Token pool: [max_tokens, num_heads, head_dim]
// Each slot holds one token's K and V for all layers

fn decode_step(batch: &[SeqId], kv_pool: &TokenPool) {
    // Build indptr + indices for FlashInfer (O(batch) CPU work)
    let (indptr, indices) = kv_pool.build_paged_layout(batch);

    // Single kernel launch covers all seqs × all heads
    flashinfer::batch_decode(queries, &indptr, &indices, kv_pool.data)
    // No scatter/gather — FlashInfer handles the indirection
}

调度器和 Attention:两条走了弯路的路 Scheduler and attention: two roads with detours

调度器:从"偷懒版"到真正可用Scheduler: from the lazy version to the real one

调度器的第一版——我现在想起来还是有点不好意思——就是一个 FIFO 队列。能放进去就放,能跑就跑。这个版本跑起来很稳,但 GPU 总有一种"在偷懒"的感觉:prefill 的时候把 GPU 拉满,decode 阶段因为每个 sequence 长短不一,batch 里总有"慢的在拖快的"。 The first version of the scheduler — and I'm a little embarrassed about this — was just a FIFO queue. If it fits, it runs. This version was stable, but the GPU always felt like it was loafing: maxed during prefill, then uneven during decode because sequences of different lengths were stepping on each other.

后来加入了 chunked prefill:把长 prompt 切成多个 chunk,每次 step 里混入一个 chunk 和所有正在 decode 的 sequence 一起处理。prefill 不再独占 GPU,decode 延迟变平滑。TTFT 的降低,主要就来自这里。 Then came chunked prefill: split long prompts into chunks and interleave them with decode steps. Prefill no longer monopolizes the GPU, decode latency smooths out. Most of the TTFT reduction comes from this one change.

Rust · 调度主循环伪代码Scheduler loop pseudocode
loop {
    // Admit requests that fit in the KV pool
    while let Some(req) = waiting.front() {
        if kv_pool.free_slots() >= req.prompt_len { running.push(waiting.pop()); }
        else { break; }
    }
    // Build batch: one prefill chunk + all decoding seqs
    let batch = BatchBuilder::new()
        .add_prefill_chunk(&mut running, chunk_size)
        .add_all_decoding(&running)
        .build();
    let outputs = engine.forward(batch);
    update_state(&mut running, outputs, &mut kv_pool); // emit tokens, free done seqs
}

Sequence State Machine

等待Wait
等待 KV 资源Awaiting KV slots
预填充Prefill
分 chunk 处理 promptChunked prompt processing
生成Decode
自回归生成 tokenAutoregressive token gen
抢占Preempt
KV swap 到 DRAMKV swapped to DRAM
完成Done
释放 KV slotsKV slots released

Attention:手写 → FlashAttention → FlashInfer,三次才找到对的答案Attention: three tries to get it right

Attention 是我踩坑最多的地方。第一版手写 CUDA kernel,跑通了、输出也对——但只有理论峰值的 40%。原因后来想明白了:shared memory 没用好,warp-level 并行没打满。换 FlashAttention-2 之后到了 70%,但 PagedAttention 的 block 寻址需要额外的 scatter/gather,代码写起来很绕。 Attention is where I stumbled the most. The hand-written CUDA kernel worked and produced correct output — but only hit 40% of theoretical peak. The reason I figured out later: poor shared memory usage, warp-level parallelism not fully utilized. FlashAttention-2 got it to 70%, but PagedAttention block addressing needed extra scatter/gather, making the code awkward.

真正让我松一口气的,是遇到了 FlashInfer What finally let me breathe was finding FlashInfer.

Rust · FlashInfer 批量 Decode 伪代码batched decode pseudocode
// Plan once before the layer loop (CPU scheduling, ~50µs)
flashinfer.plan(&seq_ids, &kv_pool.indptr, &kv_pool.indices);

// Run 36× inside CUDA Graph (one per transformer layer)
for layer in &model.layers {
    let qkv = layer.qkv_proj.forward(&hidden);          // batched GEMM
    let attn = flashinfer.run_layer(qkv.q, layer.kv);  // paged attn
    hidden = layer.out_proj.forward(attn) + residual;   // fused add+norm
}
// Graph replay eliminates 504 kernel launch calls → CPU overhead ≈ 0

128 到 811:每一步都踩了一个坑 128 to 811: every step had a trap

128 tok/s——这是当时的起点。看到这个数字,我的第一反应是:这也太丢人了吧,SGLang 在同等硬件上跑 886。目标定下来:TTFT 超过 SGLang,throughput 达到 90% 以上。然后开始一步步找问题。 128 tok/s. That's where it started. I looked at that number and felt embarrassed — SGLang was running 886 on the same hardware. Goals locked in: beat SGLang on TTFT, hit 90%+ on throughput. Then the slog began.

Throughput Optimization Path · Qwen3-4B · A100-40GB · 8-concurrent

起点(无 batching)Baseline (no batching)
128
128 tok/s
+ Token Pool + FlashInfer
434
434 tok/s
+ Buffer 预分配+ Buffer pre-alloc
681
681 tok/s
+ Plan 提前+ Plan once
690
690 tok/s
+ Logits 预分配+ Logits pre-alloc
700
700 tok/s
+ CUDA Graph
756
756 tok/s
nsys 精调(最新)nsys tuning (latest)
811
811 tok/s latest
SGLang v0.4 (基准baseline)
886
886 tok/s
起点Start
128 tok/s
无 batching,每个请求独立跑一次 forward。看起来很 naive,但这是能先跑通的最简版本。No batching. Each request runs as an independent forward pass. Naive, but it's the simplest version that actually runs.
优化一Step 1
→ 434 tok/s · Token Pool + FlashInfer (+239%)
这是最重的一步,不只是性能,还有三个让我怀疑人生的 FlashInfer segfault:硬编码的 MAX_SEQ=4096 导致 OOB 写入,plan_info 用 CPU memcpy 但指针在 GPU,调度器和模型各自 alloc 了一次 token 导致元数据损坏。每一个都能在屏幕上只留下一行"Segmentation fault"。The heaviest step — and not just in terms of performance. Three FlashInfer segfaults nearly broke me: a hardcoded MAX_SEQ=4096 causing OOB writes, plan_info being CPU-memcpy'd from a GPU pointer, and both the scheduler and model allocating tokens independently, corrupting metadata. Each one left just "Segmentation fault" on the screen.
优化二Step 2
→ 681 tok/s · Buffer 预分配Pre-allocation (+57%)
有一天把 profiler 打开,盯着时间线看了大概两个小时。然后发现了一件让我很无语的事:每次 decode step,我都在重新分配 ~128MB 的 FlashInfer workspace。4.5ms,每步如此,从头来过。这个 buffer 完全可以分配一次复用。一行改动,57% 的提升——有时候最贵的 bug 就是最蠢的 bug。One day I opened a profiler and stared at the trace for about two hours. Then I saw it: every single decode step was allocating ~128MB of FlashInfer workspace from scratch. 4.5ms, every step, starting over. That buffer is completely reusable. One-line fix, 57% throughput gain — sometimes the most expensive bug is the stupidest one.
优化三Step 3
→ 690 tok/s · FlashInfer Plan 提前(+1.3%)Once (+1.3%)
FlashInfer 的 plan()(CPU 调度)在每层 transformer 里都调了一次,共 36 次。但所有层的 KV 布局完全相同,plan 只需要做一次。不到两行改动,690 tok/s。收益小,但这种"我怎么没早发现"的感觉很难受。FlashInfer's plan() was being called once per transformer layer — 36 calls per step — even though KV layout is identical across all layers. Plan once before the loop, call only run() inside. Under two lines changed, 690 tok/s. Small gain, but the "how did I not see this earlier" feeling stings.
优化四Step 4
→ 756 tok/s · CUDA Graph (+8%)
36 层 × ~14 kernels = ~504 次 kernel launch,每次都要走一遍 CPU → GPU 的启动路径。CUDA Graph 把这整个序列录制成一张图,之后直接重放,CPU 开销降到微秒级。按 batch_size 分桶维护一个 graph cache,首次 capture,之后全部 replay。36 layers × ~14 kernels = ~504 kernel launches per step, each one going through the full CPU→GPU launch path. CUDA Graph records the entire sequence as a static graph and replays it — CPU overhead drops to microseconds. A graph cache bucketed by batch_size: capture once, replay forever.
优化五Step 5
→ 811 tok/s · nsys 找到了真正的差距nsys revealed where the real gap was
用 nsys 把两个系统 profile 了一遍。你猜问题在哪?CUDA Graph 体内的时间:ours 8.56ms vs SGLang 8.18ms,差 0.38ms,GPU 计算基本持平。真正的差距在图外:我用 cuStreamSynchronize,会阻塞等整张图跑完(8.6ms);SGLang 用 cudaEventSynchronize,CPU 在等的同时还能做其他事(5.1ms)。差的那 8%,不在 GPU,在 CPU 上怎么等 GPU。Profiled both systems with nsys. Guess where the gap was? CUDA Graph body: ours 8.56ms vs SGLang 8.18ms — 0.38ms difference. The GPU compute is basically matched. The real gap is outside the graph: I was using cuStreamSynchronize, which blocks for the full graph duration (8.6ms). SGLang uses cudaEventSynchronize, letting CPU and GPU overlap (5.1ms effective wait). The remaining 8% gap isn't the GPU. It's how I was waiting for the GPU.
CUDA Graph 的代价trade-offcapture 要求 kernel launch 序列是静态的——相同 batch size,相同 KV 布局。按 batch_size 分桶维护 graph cache,每桶 capture 一次,之后 replay。cache miss 时回退到普通 launch,代价约几十微秒。FlashInfer 的 plan() 包含 CPU memcpy,必须在图外运行;只有 run() 被 capture。Capture requires a static kernel launch sequence — same batch size, same KV layout. Graph cache bucketed by batch_size: capture once, replay thereafter. Cache miss falls back to normal launch, ~tens-of-µs penalty. FlashInfer's plan() does CPU memcpy and must run outside the graph; only run() is captured.

Final Comparison · A100-40GB · Qwen3-4B · 8-concurrent

TTFT — 越低越好TTFT — lower is better
agent-infer 🏆 4.6×
8.6ms
8.6 ms
SGLang v0.4
39.3ms
39.3 ms
vLLM v0.6
~52ms
~52 ms
Throughput — 越高越好Throughput — higher is better
SGLang v0.4 (基准baseline)
886
886 tok/s
agent-infer (0.92×)
811
811 tok/s
vLLM v0.6
711
711 tok/s

TTFT 领先 SGLang 4.6×,throughput 达 92%。剩余差距:sync 策略(~1.4ms/step)+ FusedAddRMSNorm 未合并(7.1µs vs SGLang 的 1.3µs 融合 kernel)。TTFT 4.6× ahead of SGLang. Throughput at 92%. Remaining gap: sync strategy (~1.4ms/step) + FusedAddRMSNorm not yet fused (7.1µs vs SGLang's 1.3µs fused kernel).

测试:因为推理引擎的 bug 会藏起来Testing: because inference engine bugs hide

推理引擎有一种很讨厌的特性:bug 不会在正常情况下出现。只有在特定的 batch 组合、特定的 sequence 长度、KV 压力刚好到某个临界点的时候,才会冒出来——往往是上线之后。所以我对测试有一种近乎偏执的执念。 Inference engines have an annoying property: bugs don't surface under normal conditions. They only appear under specific batch compositions, specific sequence lengths, KV pressure hitting just the right threshold — often after you've shipped. So I'm a little obsessive about testing.

142 Rust 单元测试Rust unit tests
254 Python 集成测试Python integration tests
100% 调度器路径覆盖Scheduler path coverage
0 已知 flaky testKnown flaky tests

Rust 测试负责最底层的东西:block allocator 在极端情况下还对不对,调度器状态机的每条路径都走到没有,KV block 的引用计数在多请求共享 prefix block 时有没有泄漏。Python 集成测试用 GPT-2 和 TinyLlama 跑端到端——不要求输出完全一致,而是要求生成分布的 KL 散度在合理范围内,因为推理的正确性不是"完全相同",而是"统计等价"。 Rust tests cover the fundamentals: block allocator correctness under edge cases, every state machine path in the scheduler, KV block reference counting when multiple requests share a prefix block. Python integration tests run end-to-end with GPT-2 and TinyLlama — not requiring identical output, but validating KL divergence against a reference, because inference correctness isn't "exactly the same", it's "statistically equivalent".

还有专门的混沌测试——随机在 70%、90%、98% KV 压力下注入高并发请求、随机中止、随机 preemption。就是要看系统在最差情况下会不会崩、会不会丢数据、能不能优雅恢复。毕竟线上的 bug 永远比你想象的更有创意。 Then there are chaos tests — randomly injecting high-concurrency requests at 70%, 90%, 98% KV pressure, with random aborts and forced preemptions. The question is whether the system stays correct under the worst conditions. Production bugs are always more creative than you expect.

Python · 混沌测试示例chaos test example
@pytest.mark.parametrize("pressure_level", [0.7, 0.9, 0.98])
def test_kv_pressure_recovery(engine, pressure_level):
    num_requests = int(engine.kv_capacity * pressure_level / 16)
    reqs = [make_request(prompt_len=512, max_new_tokens=256)
            for _ in range(num_requests)]
    results = engine.batch_generate(reqs, timeout=60.0)
    assert len(results) == len(reqs)                              # no silent drop
    assert all(r.finish_reason in ("stop", "length") for r in results)
    assert all(len(r.tokens) >= 1 for r in results)              # no corrupt output

8.6ms 改变了什么 What 8.6ms actually changes

有人问我:为什么要做推理引擎?不是有现成的吗? 直接的回答是:因为需要 10ms 的 TTFT。但这个回答不完整。更完整的版本是:8.6ms 改变了一个感知 People ask me: why build an inference engine when perfectly good ones already exist? The short answer: I needed 10ms TTFT. But that's incomplete. The fuller answer: 8.6ms changes a perception.

当 TTFT 低于 10ms,Agent 调用的延迟感知消失了。用户提问和 Agent 开始回答之间的停顿,短到不再像"等待计算",而更像"人类在思考"。 Below 10ms TTFT, the perceived pause between question and answer disappears. It stops feeling like "waiting for computation" and starts feeling like thought.

这不只是一个延迟数字。这是一个交互方式的改变。当每次 Agent 调用要等 50ms,你会下意识地减少调用次数——能合并的步骤就合并,能省的中间推理就省。Agent 变得"谨慎",不是因为它聪明,而是因为调用太贵了。 当 TTFT 降到 10ms,这个顾虑消失了。你可以让 Agent 多思考几步,多验证一次,多调一个工具——这些额外的调用几乎是"免费的"。Agent 从被动响应变成了主动推理。这才是做快推理引擎真正值得做的理由。 This isn't just a latency number. It's a change in how Agents can behave. When each call costs 50ms, you instinctively ration — merge reasoning steps, skip verifications, make the Agent "cautious" not because it's wise but because calling the model is expensive. When TTFT drops to 10ms, that mental tax disappears. You can afford more intermediate reasoning steps, more tool calls, more self-verification — because they're nearly free. The Agent shifts from reactive to proactive. That's the real reason to care about fast inference.

Agent OS ——推理引擎和 Agent 不该是两件事— the engine and the Agent shouldn't be separate

我不想做的是"在推理引擎上面跑的 Agent 框架"——那只是给推理引擎加了一层包装。想做的是:推理引擎本身成为 Agent 的一部分,两者共享调度语义。 What I don't want to build is "an Agent framework that runs on top of an inference engine" — that's just a wrapper. What I want: the inference engine becomes part of the Agent, sharing scheduling semantics with it.

Agent OS · System Architecture

Multi-channel I/O
Chat · Voice · REST API · Scheduled Tasks · WebhooksChat · Voice · REST API · Scheduled Tasks · Webhooks
Agent Runtime
Task-Aware Scheduler · Persistent Memory · Tool Registry · Multi-step ReasoningTask-Aware Scheduler · Persistent Memory · Tool Registry · Multi-step Reasoning
Inference Engine
KV Cache (3-tier) · Continuous Batching
FlashInfer · CUDA Graph · Prefix Cache
KV Cache (3-tier) · Continuous Batching
FlashInfer · CUDA Graph · Prefix Cache
Lightweight Sandbox
Tool Execution · Code Runner
File System · External APIs
Isolated · Fast spawn
Tool Execution · Code Runner
File System · External APIs
Isolated · Fast spawn

这些特性,在推理引擎和 Agent 框架分开实现时,要么做不到,要么效率很低——因为它们需要调度语义上的深度整合。分开是没有办法的办法,合并才是对的方向。 These features either can't be built or are grossly inefficient when the inference engine and Agent framework are separate — they require deep integration at the scheduling level. Keeping them separate is a workaround. Merging them is the right direction.

固定硬件不是约束,是护城河Fixed hardware isn't a constraint — it's a moat

花了一段时间我才真正接受这件事:一体机的"固定"其实是优势。固定 GPU、固定内存、固定网络,在云端这是没有弹性的缺点。但对于 Agent OS,这意味着编译期就能确定内存布局,可以针对这块 GPU 的 SM 数量和 L2 cache 大小精确调优,可以离线把所有 CUDA Graph 的形状都 capture 好。确定性,让我可以在特定硬件上榨出比通用框架高得多的利用率。 It took me a while to genuinely accept this: the "fixed" nature of the all-in-one machine is actually an advantage. Fixed GPU, fixed memory, fixed network — in the cloud that means no elasticity. For Agent OS, it means memory layouts are decided at compile time, tuning can be precise to this exact GPU's SM count and L2 cache size, CUDA Graph shapes can be captured offline for all cases. Determinism lets me squeeze utilization out of specific hardware that no general-purpose framework can match.

下一步Nextspeculative decoding(Medusa heads + draft model),目标 decode throughput 再提升 1.5–2×,TTFT 不变。tensor parallelism 支持更大参数规模。sync 策略从 cuStreamSynchronize 改为 cudaEventSynchronize 消除剩余的 1.4ms/step 差距。Speculative decoding (Medusa heads + draft model), targeting 1.5–2× decode throughput with no TTFT regression. Tensor parallelism for larger models. Switching sync from cuStreamSynchronize to cudaEventSynchronize to close the remaining 1.4ms/step gap.

从 128 到 811From 128 to 811

你知道那个数字是什么感觉吗?当终端里第一次打出 811 的时候,是凌晨两点多。没有人可以分享,就自己对着屏幕笑了一会儿,然后截了个图。142 个 Rust 测试、254 个 Python 测试、三个让我怀疑人生的 segfault、无数个 CUDA OOM——那个数字里装着这些。 You know what that number feels like? When 811 first appeared in the terminal, it was around 2am. Nobody to tell. I just sat there smiling at the screen for a bit, then took a screenshot. 142 Rust tests, 254 Python tests, three segfaults that made me question my career choices, more CUDA OOMs than I can count — all of that lives inside that number.

但比数字更有价值的,是那些被 bug 逼出来的理解——这些在任何框架文档里都找不到。CUDA Graph 为什么要求静态形状?Token Pool 的 page_size=1 为什么是关键决策?FlashInfer plan() 为什么必须在图外?我和 SGLang 差的 8%,为什么不在 GPU 里,而在 CPU 怎么等 GPU? But more valuable than the numbers is the understanding that only gets forced out by bugs — the kind you won't find in any framework's documentation. Why does CUDA Graph require static shapes? Why is Token Pool page_size=1 a critical decision? Why must FlashInfer plan() run outside the graph? Why is my remaining 8% gap vs SGLang not in the GPU, but in how the CPU waits for the GPU?

这些问题,在用框架的时候永远不会遇到。只有当你从 HTTP 层一路写到 CUDA kernel,它们才会出现,然后强迫你真正理解。 You never encounter these questions when you're using a framework. They only appear when you write everything yourself, from the HTTP layer all the way down to the CUDA kernel — and when they appear, they force you to actually understand.

推理引擎不是终点。但写一遍推理引擎,是理解 AI 基础设施最短的路。 The inference engine isn't the destination. But writing one is the shortest path to understanding what's actually happening inside an AI system.
8.6ms TTFT · 4.6× vs SGLang
811 tok/s · 0.92× SGLang
6.3× 吞吐提升throughput gain
396 Rust + Python 测试Rust + Python tests