Kubernetes LLM Inference Stack 2026: llm-d, DRA & KAI

KubeCon EU 2026 in Amsterdam brought three announcements that, taken individually, each look like incremental progress. Taken together, they describe something more significant: a complete Kubernetes-native stack for production LLM inference that addresses problems the community has been hacking around for two years.

The components are llm-d (CNCF Sandbox), the NVIDIA DRA Driver for GPUs (donated to CNCF), KAI Scheduler (CNCF Sandbox), and Grove (NVIDIA Dynamo). Each solves a specific production problem. Each integrates directly with the others. None of them replaces vLLM — they build above and around it.

The Kubernetes LLM inference stack combines llm-d, GPU DRA, KAI Scheduler, and Grove to deliver disaggregated, cache-aware LLM inference on Kubernetes. These four CNCF-aligned components enable 57x faster TTFT and 2x aggregate throughput versus round-robin serving by separating compute-heavy prefill from memory-bandwidth-bound decode, and routing requests to pods with matching prefix caches.

Before you read further: this stack requires RDMA networking (InfiniBand or RoCE), 8+ NVIDIA GPUs, Kubernetes 1.29+ (1.34.2+ recommended for full GPU Allocation support via DRA), and driver versions that are several months old as of April 2026. If you’re running single-node inference or evaluating on a workstation, standalone vLLM is the right call. This stack is for platform teams running multi-node, multi-model inference at scale.

Why Does Standard Kubernetes Scheduling Fail for LLM Inference?

The naive approach to LLM inference on Kubernetes is to package vLLM in a Deployment, put a Service in front of it, and let kube-proxy round-robin requests across replicas. It works at low traffic. It falls apart under load for three reasons.

The prefill/decode resource asymmetry. Processing an input prompt (prefill) is compute-heavy — it parallelizes across input tokens and saturates GPU FLOPS. Generating output tokens (decode) is memory-bandwidth-bound — it’s autoregressive, sequential, and repeatedly reads the KV cache. These two phases have fundamentally different hardware requirements. Running them on the same GPU means neither gets ideal hardware, and you can’t scale them independently.

Round-robin routing ignores inference state. A standard Kubernetes Service routes requests in rotation with no knowledge of which pod has relevant KV cache prefix data already loaded. Every cache miss adds latency to TTFT. Under load, this compounds — a pod handling a batch of unrelated requests repeatedly starts from scratch on every prompt.

The gap between vLLM and KServe. vLLM is an inference engine. KServe is a serving control plane. Between them is a gap: production concerns like disaggregated serving, intelligent request routing, and topology-aware scheduling have no home in either project. This stack fills that gap.

The Four Layers of the Kubernetes-Native LLM Inference Stack

The full architecture connects like this:

Client Request
    │
Gateway (Envoy/Istio) + EPP (Endpoint Picker)
    │  prefix-cache-aware routing
InferencePool (llm-d inference scheduler)
    │  scores pods by cache hit rate, KV utilization, queue depth
Prefill Workers (vLLM, compute-heavy GPU pods)
    │  KV cache transfer via NIXL (InfiniBand/RoCE RDMA)
Decode Workers (vLLM, memory-bandwidth GPU pods)
    │  autoregressive token generation
Response Stream

The orchestration layer wraps the entire thing:

Grove manages workload lifecycle — PodCliqueSets define the full disaggregated deployment as a single declarative resource
KAI Scheduler handles gang scheduling and topology-aware pod placement
GPU DRA provides fine-grained GPU allocation: MIG slices, ComputeDomains for multi-node NVLink
Workload Variant Autoscaler (WVA) adjusts the prefill/decode/batch instance ratio as traffic patterns shift

These aren’t independent tools you wire together by hand. Grove creates PodGang resources that KAI Scheduler consumes directly. KAI Scheduler places pods on nodes sharing a high-bandwidth interconnect so NIXL can transfer KV cache at line rate. GPU DRA gives both the scheduler and the inference engine fine-grained control over GPU allocation. llm-d’s Endpoint Picker routes into the pod pool that Grove provisioned and KAI Scheduler placed.

What Is GPU DRA and How Does It Replace the Kubernetes Device Plugin?

The NVIDIA Kubernetes Device Plugin has been the standard way to expose GPUs to Kubernetes for years. It works by advertising GPUs as opaque extended resources (nvidia.com/gpu: 1). Fractional allocation isn’t supported. MIG partitioning through it is awkward. And it has no concept of multi-node GPU interconnects.

For a primer on how Kubernetes handles resource requests and limits before diving into DRA, see our Kubernetes Resource Limits: Production Guide.

NVIDIA donated its DRA Driver for GPUs to CNCF at KubeCon EU 2026. DRA (Dynamic Resource Allocation) replaces the device plugin model with a more expressive allocation framework using DeviceClasses and ResourceClaims instead of extended resources.

With the DRA Driver, four DeviceClasses are available:

gpu.nvidia.com — whole GPU allocation
mig.nvidia.com — MIG slice allocation, dynamically provisioned through ResourceClaims
compute-domain-daemon.nvidia.com / compute-domain-default-channel.nvidia.com — ComputeDomains for Multi-Node NVLink (MNNVL)

ComputeDomains are the meaningful new primitive. On GB200 Grace Blackwell systems with MNNVL, a ComputeDomain exposes a multi-node NVLink fabric as a first-class Kubernetes resource. Pods allocated to the same ComputeDomain get placed on nodes sharing that fabric — the scheduler-side counterpart to KAI Scheduler’s topology awareness.

Install the DRA Driver:

# Label nodes for DRA support
kubectl label node $HOSTNAME nvidia.com/dra-kubelet-plugin=true

# Install GPU Operator with device plugin DISABLED (DRA replaces it)
helm upgrade --install gpu-operator nvidia/gpu-operator \
  --version=v26.3.0 \
  --create-namespace \
  --namespace gpu-operator \
  --set devicePlugin.enabled=false

# Install DRA Driver
helm upgrade -i nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
  --version="25.12.0" \
  --namespace nvidia-dra-driver-gpu \
  --create-namespace \
  --set nvidiaDriverRoot=/run/nvidia/driver \
  --set gpuResourcesEnabledOverride=true

Version requirements: DRA Driver v25.12.0+, GPU Operator v25.10.0+ (Helm chart v26.3.0), Kubernetes v1.32+ for ComputeDomains, v1.34.2+ for GPU Allocation, NVIDIA Driver 580+, CDI enabled.

How Does KAI Scheduler Enable Gang Scheduling for LLM Inference?

The default Kubernetes scheduler places pods independently. For disaggregated LLM inference, that creates two practical problems. First, if only some pods in a tensor-parallel group get scheduled, the entire group idles. Second, if decode and prefill workers land on nodes without a shared high-bandwidth interconnect, KV cache transfer latency erases your TTFT gains.

KAI Scheduler — NVIDIA’s open-sourced Run:ai scheduling engine, accepted as CNCF Sandbox at KubeCon EU 2026 — addresses both.

Gang scheduling: All pods in a group schedule simultaneously or not at all. A disaggregated deployment’s prefill workers, decode workers, and router pod all land together, or none do. No partial deployments hanging in pending indefinitely.

Hierarchical gang scheduling: Extends gang scheduling across multiple levels. A prefill group and a decode group can be scheduled as a coordinated unit while maintaining their internal role dependencies.

Topology-Aware Scheduling (TAS): Places related pods on nodes sharing high-bandwidth interconnects — NVLink domains, InfiniBand fabrics. When KV cache transfers from prefill to decode workers via NIXL, latency is proportional to interconnect bandwidth. TAS keeps that path on fast links.

Dominant Resource Fairness (DRF): Fair-share queuing with hierarchical queues, over-quota weights, and resource reclamation. Practically relevant for multi-tenant GPU clusters where multiple teams share the same inference infrastructure.

Install KAI Scheduler:

helm upgrade -i kai-scheduler \
  oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler \
  -n kai-scheduler \
  --create-namespace \
  --version v0.10.0

Current stable version is v0.10.0. Requirements: Kubernetes cluster, Helm CLI, NVIDIA GPU Operator.

How Does llm-d Enable Disaggregated LLM Inference on Kubernetes?

llm-d was accepted as a CNCF Sandbox project at KubeCon EU 2026. Current version is v0.5 (February 2026), Apache 2.0, requiring Kubernetes 1.29+. Founding collaborators include IBM Research, Red Hat, Google Cloud, CoreWeave, NVIDIA, AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI.

What llm-d actually does: It sits between vLLM and the control plane, adding disaggregated serving and cache-aware request routing. Three components do the work:

Endpoint Picker (EPP): Implements the Kubernetes Gateway API Inference Extension (GAIE, now GA). Intercepts requests at the gateway, discovers available pods in the InferencePool, and scores them: prefix cache hits (weight 3), KV cache utilization (weight 2), queue depth (weight 2). Routes to the highest-scoring pod.
Disaggregated Serving Sidecar: Coordinates request routing between prefill and decode workers and manages KV cache transfer via NIXL — a transport layer supporting InfiniBand, RoCE RDMA, and TPU ICI.
Workload Variant Autoscaler (WVA): Monitors traffic patterns and adjusts the prefill/decode/batch instance ratio without redeploying the stack.

Why the 57x TTFT improvement happens: Standard round-robin sends requests to pods with empty or irrelevant prefix caches. In production workloads — RAG pipelines with shared retrieval context, chatbots with long system prompts, few-shot applications — many requests share the same leading tokens. The EPP routes those requests to pods with matching prefix cache entries already loaded, skipping KV recomputation for the shared prefix. At 8 pods and 16 H100 GPUs, this delivers 57x faster TTFT versus round-robin and 2x aggregate throughput on identical hardware.

Validated benchmarks from llm-d v0.5:

Metric	Value	Configuration
TTFT improvement	57x vs round-robin	8 pods / 16 H100 GPUs
Decode throughput	~3,100 tok/s per B200	Wide expert-parallelism
Aggregate throughput	Up to 50k output tok/s	16×16 B200 prefill/decode topology
Throughput gain	2x vs round-robin	8 pods / 16 H100 GPUs
Per-output-token latency	40% reduction (v0.4)	DeepSeek V3.1 on H200 GPUs

Installing with P/D disaggregation:

# Prerequisites: Kubernetes 1.29+, RDMA networking, 8+ NVIDIA GPUs
export NAMESPACE=llm-d-pd
kubectl create namespace ${NAMESPACE}

# Deploy the disaggregated stack (prefill + decode + gateway)
cd guides/pd-disaggregation
helmfile apply -n ${NAMESPACE}

# Verify deployment
helm list -n ${NAMESPACE}
# Expected: gaie-pd, infra-pd, ms-pd charts

kubectl get all -n ${NAMESPACE}
# Expected: 4 prefill pods, 1 decode pod, gateway service

Being a CNCF Sandbox project means API stability is not guaranteed. Plan for breaking changes before llm-d reaches Incubating status. Treat v0.5 as appropriate for infrastructure experimentation and staging — not regulated production workloads.

What Is Grove and How Does It Orchestrate the Kubernetes Inference Stack?

Grove is the most under-covered piece of this stack and the one that makes the rest operationally viable. It’s NVIDIA’s open-source Kubernetes API (part of the Dynamo ecosystem) for orchestrating distributed AI inference as a unified declarative resource.

Without Grove, you’re managing a disaggregated deployment by hand: separate Deployments for prefill workers, decode workers, and the router; custom logic for startup ordering (router must be ready before workers); manual coordination of scaling across roles. This doesn’t scale past a few deployments.

Grove introduces a three-tier CRD hierarchy:

PodCliques — groups of pods with specific roles: prefill leader, prefill worker, decode leader, decode worker, frontend
PodCliqueScalingGroups — bundles of tightly coupled PodCliques that scale together; one prefill leader + N prefill workers = one model instance, scaled atomically
PodCliqueSets — the full workload definition: startup ordering, scaling policies, gang-scheduling constraints

Grove creates PodGang resources consumed directly by KAI Scheduler, so topology-aware gang scheduling is a first-class concern in the deployment manifest, not something bolted on afterward.

Grove manifest for disaggregated inference:

apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: dynamo-grove
spec:
  services:
    Frontend:
      dynamoNamespace: vllm-v1-disagg-router
      componentType: frontend
      replicas: 1
      extraPodSpec:
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.1
      envs:
        - name: DYN_ROUTER_MODE
          value: kv
    VllmDecodeWorker:
      dynamoNamespace: vllm-v1-disagg-router
      componentType: worker
      replicas: 2
      resources:
        limits:
          gpu: "1"
      extraPodSpec:
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.1
          workingDir: /workspace/components/backends/vllm
          command: ["python3", "-m", "dynamo.vllm"]
          args: ["--model", "Qwen/Qwen3-0.6B"]
    VllmPrefillWorker:
      dynamoNamespace: vllm-v1-disagg-router
      componentType: worker
      replicas: 1
      resources:
        limits:
          gpu: "1"
      extraPodSpec:
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.1
          workingDir: /workspace/components/backends/vllm
          command: ["python3", "-m", "dynamo.vllm"]
          args: ["--model", "Qwen/Qwen3-0.6B", "--is-prefill-worker"]

llm-d vs. Standalone vLLM: When Should You Upgrade?

Most teams reading this are running vLLM today. Here is a direct answer to the question worth asking: when does this stack add enough value to justify the operational complexity?

Standalone vLLM is sufficient when:

Single-node inference — one server, one model, traffic you can handle with a few replicas behind a load balancer
Low to moderate volume with heterogeneous prompts — prefix caching yields minimal benefit when every request has a unique leading context
No RDMA networking — NIXL requires InfiniBand or RoCE for KV cache transfer; standard Ethernet negates the disaggregation benefit
Fewer than 8 GPUs — you need enough capacity to split prefill and decode pools meaningfully

The full stack adds value when:

Multi-node inference is required because the model doesn’t fit on a single node
High-volume serving with repeated prefix patterns: RAG applications, chatbots with long system prompts, few-shot workloads where many users share the same prompt prefix
Multi-model serving on a shared GPU cluster where fair scheduling and preemption are operational requirements
Long-context workloads where prefill is expensive and caching incomplete prefixes provides asymmetric latency gains

What the cost improvements actually mean: The 40-60% GPU utilization improvement and 15-40% infrastructure cost reduction figures come from a specific mechanism. Disaggregation lets you right-size hardware per phase — H100 SXM5 or B200 for prefill (FLOPS-limited), H100 PCIe or A100 80GB for decode (memory-bandwidth-limited). Combined with independent autoscaling — scale decode workers up during high-output workloads, scale prefill workers up during high-concurrency prompt-heavy loads — you eliminate the over-provisioning that comes from treating both phases as the same resource type. The savings are real, but they depend on having a workload where the two phases actually have different resource profiles at scale.

How to Deploy the Kubernetes LLM Inference Stack

Prerequisites:

Kubernetes 1.29+ (v1.32+ for ComputeDomains, v1.34.2+ for GPU Allocation)
RDMA networking: InfiniBand or RoCE — standard Ethernet is not adequate for KV cache transfer via NIXL
8+ NVIDIA GPUs minimum for a meaningful disaggregated topology
NVIDIA Driver 580+, CDI enabled
Helm CLI, helmfile

Installation order:

Install GPU Operator (device plugin disabled) and DRA Driver (commands above)
Install KAI Scheduler
Clone the llm-d quickstart and deploy with helmfile
Define your Grove DynamoGraphDeployment

Verify with a test request:

export GATEWAY_IP=$(kubectl get gateway/llm-d-inference-gateway \
  -n ${NAMESPACE} -o jsonpath='{.status.addresses[0].value}')

curl -i http://${GATEWAY_IP}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "openai/gpt-oss-120b", "messages": [{"role": "user", "content": "Hello"}]}'

What Does the Kubernetes LLM Inference Stack Mean for Platform Teams?

Maturity: Both llm-d and KAI Scheduler are CNCF Sandbox projects. GPU DRA requires Kubernetes v1.32+ for ComputeDomains and v1.34.2+ for full GPU Allocation support. Grove is early-stage. CNCF Sandbox is explicit about API stability: breaking changes are expected before a project reaches Incubating status. This is appropriate for infrastructure experimentation, staging pipelines, and platform teams building toward production — not for regulated or latency-critical production workloads today.

Migration path from vLLM: The migration is additive. llm-d uses vLLM as its inference engine — your model weights, quantization config, and serving parameters carry over. The infrastructure lift is real: RDMA networking, DRA Driver installation (which requires disabling the existing device plugin), and KAI Scheduler. Plan that as a separate infrastructure project from the application migration.

If you’re also running durable AI agent workloads on Kubernetes, see our guide to Dapr Agents v1.0 on Kubernetes for durable agent orchestration on this same infrastructure.

What to watch: The DRA API graduation timeline through Kubernetes releases will determine when GPU DRA is ready for large-scale production. KAI Scheduler’s progression through CNCF maturity stages will be the operational stability signal. llm-d’s v1.0 release is the API stability milestone to wait for before treating the stack as locked-in infrastructure.

The architecture announced at KubeCon EU 2026 is internally coherent. These four components were designed to integrate — the PodGang API that Grove writes and KAI Scheduler reads, the InferencePool that the EPP manages and the Gateway routes into, the ComputeDomains that DRA exposes and TAS uses for placement. For platform teams building GPU inference infrastructure for the next two years, this is the direction the Kubernetes ecosystem is moving. The time to understand it is before you need to run it.