Maybe I am misunderstanding something but: 1) This is basically the intention of...

rao-v · 2026-02-23T19:24:06 1771874646

So llama.cpp currently statically puts overflow MoE experts in RAM and inferences them on CPU, so you get a mix of GPU + CPU inferencing. You are rooflined by RAM->CPU bandwidth + CPU compute.

With good predictability of MoE, you might see a world were it's more efficient to spend PCI bandwidth (slower than RAM->CPU) on loading MOE experts for the next ~3 layers from RAM to VRAM so you are not rooflined by CPU compute.

VLLM / SGLang (AFAIK) just assume you have enough VRAM to fit all the experts (but will page KV cache to RAM).