Istio 1.29 is production-ready for Kubernetes AI inference workloads. It delivers model-aware routing via the Gateway API Inference Extension (GIE), sidecarless ambient mode that cuts per-pod GPU memory overhead by approximately 70%, and experimental agentgateway support planned for Istio 1.30.
Three years ago, the main service mesh concerns were mTLS between microservices and circuit breaking on REST APIs. Teams running LLM inference on GPU nodes today face a different set of problems: round-robin load balancing that parks long-running inference requests on already-saturated GPUs, sidecar proxies consuming hundreds of MB of VRAM that should be serving model weights, and AI agents that speak A2A and MCP rather than HTTP/gRPC.
Istio’s releases from 1.28 onward address all three - but the features span multiple versions and the CNCF announcement at KubeCon EU bundled shipped capabilities with experimental work still in progress. This post untangles what landed in which version, what the configurations actually look like, and what to hold off on until 1.30.
Why Does AI Inference Need a Different Service Mesh Approach?
Standard microservice load balancing assumes short-lived, roughly homogeneous requests. Inference requests break both assumptions. A single completion request can hold a GPU busy for seconds to minutes depending on output length. Round-robin distributes by request count, not GPU memory pressure or queue depth - the result is hotspots where some pods saturate while others sit idle.
The sidecar model makes this worse on GPU nodes. Every Envoy sidecar consumes roughly 50-200 MB of memory depending on configuration and route table size. On a 24 GB A10G running a pool of inference pods, that overhead is VRAM that cannot serve model weights. Multiply across a node with eight inference pods and the loss is meaningful before a single token is generated.
The service mesh abstractions that matter for inference are: routing that understands model identity and GPU state, a data plane that does not tax GPU memory, and traffic governance for the AI agents querying those models.
Gateway API Inference Extension: Model-Aware Routing on Kubernetes
The Gateway API Inference Extension (GIE) adds inference-specific routing to the standard Kubernetes Gateway API. Istio 1.28 (released November 2025) added InferencePool v1 support. Istio 1.29 (released February 2026) is fully conformant with GIE v1.0.1 and promotes GIE support to beta.
What Are the InferencePool and InferenceObjective CRDs?
GIE separates platform admin concerns from AI/ML team concerns through two resource types.
InferencePool is the platform admin resource. It defines a pool of pods running inference servers on shared GPU compute, maps to those pods via label selectors, and references an Endpoint Picker (EPP) deployment that makes per-request routing decisions.
InferenceObjective (renamed from InferenceModel in the v1 GA release) is the AI/ML owner resource. It maps a public model name like “llama3-70b” to backend models within an InferencePool, with traffic splitting weights and criticality policies that the EPP uses to prioritize requests.
The API group changed on GA: v1alpha2 used inference.networking.x-k8s.io, v1 uses inference.networking.k8s.io (the x- prefix dropped). If you shipped on v1alpha2, you need to migrate both CRD types and update every apiVersion field. The GA migration guide supports zero-downtime traffic shifting between v1alpha2 and v1 resources during rollout.
graph TD
PA[Platform Admin] -->|owns| IP[InferencePool\napiVersion: inference.networking.k8s.io/v1]
ML[AI/ML Owner] -->|owns| IO[InferenceObjective\napiVersion: inference.networking.k8s.io/v1]
IO -->|poolRef| IP
IP -->|label selector| PODS[vLLM / Triton Pods\nGPU nodes]
IP -->|extensionRef| EPP[Endpoint Picker\next-proc deployment]
HR[HTTPRoute] -->|backendRef kind: InferencePool| IP
style PA fill:#1e3a5f,color:#fff
style ML fill:#1e3a5f,color:#fff
style IP fill:#0f4c75,color:#fff
style IO fill:#0f4c75,color:#fff
style EPP fill:#1b6ca8,color:#fff
style HR fill:#2c7bb6,color:#fff
style PODS fill:#163d60,color:#fff
Platform admins own the InferencePool (compute and routing infrastructure). AI/ML teams own InferenceObjective resources (model routing policies). HTTPRoute references the InferencePool as its backend.
How the Endpoint Picker Routes Inference Traffic
The Endpoint Picker is an ext-proc deployment wired into the Envoy gateway’s external processing filter. The optional Body Based Router (BBR) is a second ext-proc that runs first: it extracts the requested model name from the OpenAI-compatible request body and adds it as an X-Gateway-Base-Model-Name header so the EPP knows which pool and model to consider.
sequenceDiagram
participant C as Client
participant GW as Istio Gateway<br>(Envoy)
participant BBR as Body Based Router<br>(ext-proc)
participant EPP as Endpoint Picker<br>(ext-proc)
participant M as vLLM Pod
C->>GW: POST /v1/completions\n{"model": "llama3-70b", ...}
GW->>BBR: ext-proc request
BBR->>GW: X-Gateway-Base-Model-Name: llama3-70b
GW->>EPP: ext-proc request with model header
EPP->>EPP: Check live metrics\n(queue depth, GPU memory,\nloaded LoRA adapters)
EPP->>GW: Selected pod endpoint
GW->>M: Route to least-loaded pod
M->>C: Streaming inference response
BBR extracts the model name from the request body; EPP consults live pod metrics to select the optimal pod. Standard gateways route by URL; GIE routes by GPU state.
Benchmarks from the Kubernetes blog tested this on 10 H100 80GB GPUs running vLLM v1 with Llama2 at 100-1000 QPS (ShareGPT dataset). GIE delivered comparable throughput to standard Kubernetes Services while achieving meaningfully lower p90 per-output-token latency above 500 QPS - the range where GPU memory saturation causes hotspots under round-robin. Model-aware routing avoids the scenario where long-running batch completions monopolize pods while the queue backs up behind them.
How Do You Configure GIE with Istio 1.29?
InferencePool manifest:
apiVersion: inference.networking.k8s.io/v1
kind: InferencePool
metadata:
name: vllm-llama3-70b
spec:
targetPorts:
- number: 8000
selector:
app: vllm-llama3-70b
extensionRef:
name: vllm-llama3-70b-epp
port: 9002
failureMode: FailOpen
Wire it as an HTTPRoute backend:
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: inference-route
spec:
parentRefs:
- name: inference-gateway
rules:
- matches:
- path:
type: PathPrefix
value: /v1/completions
backendRefs:
- name: vllm-llama3-70b
kind: InferencePool
group: inference.networking.k8s.io
failureMode: FailOpen means requests fall through to standard load balancing if the EPP is unavailable. For workloads that depend on LoRA adapter locality - where routing to the wrong pod means a cold adapter load - you may prefer FailClose to reject requests rather than route blindly.
Ambient Multicluster: Sidecarless Mesh for GPU Workloads
Ambient mode replaces per-pod Envoy sidecars with two shared components:
- ztunnel: A Rust-based L4 proxy deployed as a DaemonSet, one instance per node. Handles mTLS and identity-based authorization without per-pod overhead.
- Waypoint proxy: An optional Envoy deployment per namespace for L7 features (HTTP routing, per-request load balancing, distributed tracing). Supports HPA for traffic-proportional scaling.
graph LR
subgraph Sidecar Mode
direction TB
PA[Pod A\n+ Envoy Sidecar\n~100-200 MB] -->|2 proxy hops| PB[Pod B\n+ Envoy Sidecar\n~100-200 MB]
end
subgraph Ambient Mode
direction TB
PC[Pod C\nno sidecar] --> ZT[ztunnel DaemonSet\nshared per node\nL4 mTLS only]
ZT -->|L7 optional| WP[Waypoint Proxy\nnamespace-scoped\nHPA-enabled]
WP -->|1 proxy hop| PD[Pod D\nno sidecar]
end
Sidecar mode adds 100-200 MB per pod and two proxy hops per request. Ambient mode uses a shared ztunnel per node for L4 mTLS and an optional waypoint for L7 features - one hop, no per-pod overhead.
Community benchmarks comparing equivalent workloads show approximately 70% memory reduction (per-node ztunnel vs per-pod sidecar allocation), and p90 latency drops from 0.63ms to 0.16ms - a 74% reduction. For GPU nodes where VRAM is the binding constraint, removing per-pod sidecar overhead translates directly to more capacity for model weights or higher pod density.
How Do You Configure Cross-Cluster Inference with Ambient Mode?
Ambient multicluster support for multi-network deployments (separate VPCs or cloud regions) went beta in Istio 1.29. Waypoints now route traffic to remote clusters over HBONE connections, carrying peer metadata via baggage headers for end-to-end observability. This enables active-active inference deployments across regions with outlier detection ejecting degraded endpoints:
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
name: inference-multicluster
spec:
host: inference-pool.inference.svc.cluster.local
trafficPolicy:
outlierDetection:
consecutive5xxErrors: 3
interval: 30s
baseEjectionTime: 30s
For model canary rollouts, HTTPRoute traffic splitting applies across clusters:
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: model-canary
spec:
parentRefs:
- name: inference-gateway
rules:
- matches:
- path:
value: /v1/completions
backendRefs:
- name: llm-model-v2
weight: 20
- name: llm-model-v1
weight: 80
How Do You Migrate from Sidecar Mode to Ambient Mode?
Migration is namespace-scoped and reversible. The five-phase sequence:
# Phase 1: Install Istio 1.29+ with the ambient profile
istioctl install --set profile=ambient
# Phase 2: Enable ambient on a dev/staging namespace (no pod restart required)
kubectl label namespace inference istio.io/dataplane-mode=ambient
# Phase 3: Deploy waypoint proxy for L7 features
istioctl waypoint apply -n inference --enroll-namespace
# Phase 4: Verify ztunnel DaemonSet is running on all nodes
kubectl get daemonset -n istio-system ztunnel
# Phase 5: Validate error rates and p99 latency before migrating production
To roll back any namespace: kubectl label namespace inference istio.io/dataplane-mode- --overwrite
Known limitations to plan around: single-network multicluster is still alpha in 1.29 (only multi-network reached beta), and the east-west gateway may show uneven endpoint preference in some traffic patterns. The AMBIENT_ENABLE_MULTI_NETWORK_WAYPOINT pilot env var can disable the recent waypoint behavior change if you encounter issues. Do not promote ambient multicluster to production without staging validation.
agentgateway: Governing AI Agent Traffic in the Mesh
agentgateway is a Linux Foundation project (donated by Solo.io in August 2025) built specifically for AI agent traffic. Where Envoy was designed for HTTP microservices and extended to handle AI workloads, agentgateway was built from scratch in Rust for agent-to-agent, agent-to-tool, and agent-to-LLM traffic patterns.
graph TD
AG[AI Agent] --> GW[agentgateway\nRust / ztunnel-heritage\nLinux Foundation]
GW -->|A2A protocol| OA[Other AI Agents]
GW -->|MCP protocol| TL[Tools and APIs]
GW -->|LLM Provider API| FM[Foundation Models]
subgraph Security and Observability
RBAC[RBAC for MCP servers]
MTLS[mTLS + agent identity]
OT[OpenTelemetry tracing\naudit trails]
end
GW --- RBAC
GW --- MTLS
GW --- OT
agentgateway applies security and observability uniformly across three traffic types: agent-to-agent (A2A), agent-to-tool (MCP), and agent-to-model (LLM API).
The project supports Agent2Agent (Google’s interoperability standard for agent communication) and the Model Context Protocol for agent-to-tool interactions. Security features include RBAC for MCP servers, mTLS, and agent identity management with impersonation and delegation controls. OpenTelemetry integration provides request-response tracing for both LLM calls and tool invocations, producing the audit trails that compliance-sensitive deployments require.
If you have Kyverno policies governing MCP traffic at admission time, agentgateway is the complementary data plane: Kyverno enforces policy at deployment, agentgateway governs in-flight agent traffic.
When Will agentgateway Be Integrated with Istio?
Agentgateway integration with Istio is experimental and targets Istio 1.30. As of April 20, 2026, the implementation issue (GitHub #59209, opened February 24, 2026) is still open. The work is broken into 7 sequential PRs: xDS generation from KRT collections, gateway controller refactoring, GatewayClass implementation, gateway and listener collections, route collection refactor, and controller assembly. None have merged yet.
You can run agentgateway as a standalone proxy today for MCP and A2A governance. For Istio data plane integration, wait for 1.30.
The Bigger Picture: llm-d, GIE, and Kubernetes Inference Convergence
The GIE, Istio ambient mode, and agentgateway represent three layers of a converging Kubernetes AI inference platform. The Endpoint Picker that GIE introduced is currently migrating from the GIE repository into the llm-d inference scheduler (llm-d-inference-scheduler). The InferencePool API and protocol definitions stay within the Kubernetes SIG-Network organization for vendor neutrality; llm-d builds advanced scheduling on top with features like P/D (Prefill/Decode) disaggregation, KV-cache-aware LoRA routing, and scale-to-zero autoscaling.
graph BT
subgraph GPU_Nodes [GPU Nodes]
VL[vLLM / Triton\nmodel servers]
end
subgraph Control_Plane [Kubernetes Control Plane]
IP2[InferencePool CRDs\nGIE v1 - Kubernetes SIG-Network]
LLD[llm-d inference-scheduler\nEPP with P/D disaggregation\nKV-cache routing, LoRA affinity]
KS[KServe v0.17\nmodel lifecycle, HPA/KEDA autoscaling]
end
subgraph Mesh [Service Mesh - Istio]
ZT2[ztunnel DaemonSet\nL4 mTLS, per-node]
WP2[Waypoint Proxy\nL7 routing, canary]
AGIG[agentgateway\nA2A / MCP governance\nexperimental - Istio 1.30]
end
subgraph Obs [Observability]
OT2[OpenTelemetry\nmetrics, traces, audit]
end
VL --> IP2
IP2 --> LLD
KS --> VL
ZT2 --> WP2
WP2 --> IP2
AGIG --> ZT2
LLD --> OT2
WP2 --> OT2
The converging K8s AI inference stack: agentgateway and Istio ambient handle traffic governance; GIE provides routing CRDs; llm-d handles advanced scheduling; vLLM and KServe serve the models.
The broader trend is standardization on OpenAI-compatible APIs as the wire format, with Kubernetes-native CRDs handling infrastructure policy. vLLM, KServe, and Llama Stack all expose OpenAI-format APIs. GIE routes to any of them via InferencePool without depending on which serving framework is underneath - the routing layer and the serving layer are decoupled.
Getting Started: What to Deploy Today
Three decisions with clear answers today:
Istio 1.29 + GIE v1 for inference routing is production-ready. Deploy an InferencePool, wire an HTTPRoute, and run the EPP alongside your vLLM pods. If you are on GIE v1alpha2, the GA migration guide covers the upgrade path with zero-downtime traffic shifting. Istio 1.29 is conformant with GIE v1.0.1.
Ambient mode migration for GPU memory savings is beta. Run it in staging first. Use the 5-phase migration, measure your actual memory and latency delta against your current sidecar configuration, and validate the rollback path works in your environment before touching production. Multi-network multicluster is beta; single-network is still alpha in 1.29.
agentgateway is ready to evaluate standalone if you are building agent infrastructure today. The Linux Foundation project is available now. For Istio data plane integration, wait for 1.30 - it is not yet shipped, and the implementation is still in active development.
Frequently Asked Questions
Which Istio version do I need for AI inference routing?
Istio 1.29 (released February 16, 2026) is conformant with the Gateway API Inference Extension v1.0.1 and promotes ambient multicluster to beta. Istio 1.28 added initial InferencePool v1 support. For agentgateway integration with the Istio data plane, wait for Istio 1.30, which is still in development as of April 2026.
How much GPU memory does ambient mode save compared to sidecars?
Community benchmarks show approximately 70% memory savings by replacing per-pod sidecar proxies with a shared per-node ztunnel DaemonSet. For GPU nodes running inference pods, this frees hundreds of MB per pod that would otherwise go to sidecar overhead - capacity you can redirect to model weights or additional pod density. p90 latency also drops from 0.63ms (sidecar) to 0.16ms (ambient), a 74% reduction, with similar improvements at p99.
What happened to InferenceModel? I see InferenceObjective in the docs now.
The GIE v1 GA release renamed InferenceModel to InferenceObjective and changed the API group from inference.networking.x-k8s.io to inference.networking.k8s.io (removing the x- prefix that marks pre-GA Kubernetes extensions). If you have existing v1alpha2 resources, the GA migration guide covers zero-downtime traffic shifting between v1alpha2 and v1 during the transition period.
Is the Istio agentgateway integration ready for production?
No. Agentgateway integration targets Istio 1.30 as an experimental feature and is still in active development. The implementation GitHub issue (#59209) is open as of April 2026, with 7 sequential PRs planned but not yet merged. You can use agentgateway as a standalone proxy today for MCP and A2A traffic governance without waiting for Istio integration.
How does the Gateway API Inference Extension differ from KServe?
GIE handles the routing layer: it standardizes how inference requests reach model servers using Kubernetes Gateway API primitives (InferencePool, InferenceObjective, HTTPRoute). KServe is a higher-level serving platform that manages model lifecycle, multi-framework support, and autoscaling via KEDA/HPA. They are complementary - GIE routes traffic to KServe-managed model servers. The llm-d project builds advanced scheduling algorithms (P/D disaggregation, KV-cache routing) on top of GIE’s InferencePool CRDs.