Your AI/ML team wants to deploy agents on your Kubernetes cluster. You have StatefulSets, PVCs, a solid HPA configuration, and years of K8s operations behind you. How hard can agents be?
An agent-ready Kubernetes platform is a cluster configured with Agent Sandbox for workload management, hardened container runtimes for code isolation, DRA for GPU scheduling, KEDA for scale-to-zero autoscaling, and OpenTelemetry GenAI conventions for distributed tracing across agent reasoning chains.
Harder than you expect. Gartner predicts over 40% of agentic AI projects will be canceled by end of 2027, and infrastructure is one of the primary failure modes alongside unclear ROI and inadequate risk controls. Standard Kubernetes patterns handle stateless, predictable microservices. AI agents are stateful, bursty, and capable of executing arbitrary code. They spawn sub-agents. They make recursive tool calls. They idle for hours and then consume a full GPU for thirty seconds.
This guide is not about agent frameworks. It covers the Kubernetes platform layer: the specific primitives, configuration patterns, and architectural decisions that make a cluster capable of safely running agent workloads at scale.
Why Do Agent Workloads Break Standard Kubernetes Setups?
Four properties make agent workloads fundamentally different from microservices:
Bursty compute. A microservice handles requests at roughly predictable throughput. An agent completes a multi-step reasoning task in seconds, then sits idle for minutes or hours. CPU and memory utilization spikes in ways HPA was not designed for. GPU demand is even more erratic: an agent doing local inference can go from zero to full-GPU utilization, then release entirely.
Stateful execution chains. Agents maintain working memory, tool context, and task state across long-running operations. A pod restart mid-chain loses everything. StatefulSets provide stable identity but lack scale-to-zero with state preservation, warm pool management, and agent-specific lifecycle semantics.
Unpredictable resource footprint. An agent that spawns sub-agents creates cascading resource demand. One orchestrator becomes five specialist agents, each making concurrent LLM calls and tool invocations. Resource requests set at deploy time cannot account for this dynamic expansion.
Untrusted code execution. Agents with shell access and code generation run arbitrary, unreviewed code inside your cluster. Sharing a host kernel with this workload - the default for standard containers - violates any reasonable security boundary.
These four properties require specific platform decisions. The rest of this guide walks through each one.
What Is the Agent Platform Stack for Kubernetes?
The Kubernetes primitives for agent workloads form a layered stack. Each layer addresses one of the core infrastructure problems agents create.
graph BT
A["K8s Cluster Base\n(nodes, kubelet, CNI)"] --> B
B["Container Runtime\n(gVisor / Kata / Firecracker)"] --> C
C["Agent Sandbox CRD\n(agents.x-k8s.io/v1alpha1)"] --> D
D["GPU Scheduling\n(DRA ResourceClaims)"] --> E
E["Network Isolation\n(NetworkPolicy - default-deny egress)"] --> F
F["Storage Layer\n(PVCs + OCI VolumeSource)"] --> G
G["Autoscaling\n(KEDA - scale to zero)"] --> H
H["Observability\n(OTel Collector - GenAI conventions)"] --> I
I["Agent Workloads\n(orchestrators, tools, sub-agents)"]
The agent platform stack. Each layer handles a distinct infrastructure concern. Skipping any one of them creates operational gaps that surface as reliability or security incidents in production.
No existing content combines all of these into a single guide with working YAML. The layers interact: isolation tier affects which DRA features you can use, NetworkPolicy scope determines your egress allowlist design, and OTel instrumentation depends on how you structure namespace boundaries.
What Is Agent Sandbox, the New Kubernetes Primitive for Agent Runtimes?
The Kubernetes SIG Apps team built Agent Sandbox (kubernetes-sigs/agent-sandbox) as a purpose-built CRD for AI agent runtimes. It is currently alpha at v0.3.10 (released April 8, 2026) and represents the direction upstream Kubernetes is heading - not a stable API to run in production today without accepting breakage risk.
The core resource is a Sandbox (agents.x-k8s.io/v1alpha1): a single-container stateful pod with stable identity, persistent storage, and lifecycle management built in. Compare what you need to build manually today with what the CRD provides:
Without Agent Sandbox:
- StatefulSet definition
- Headless Service for stable DNS
- PVC template for persistence
- Custom controller or operator for scale-to-zero
- Separate logic for pre-warmed pool management
- Manual state checkpoint and restoration on resume
With Agent Sandbox:
apiVersion: agents.x-k8s.io/v1alpha1
kind: Sandbox
metadata:
name: research-agent
namespace: ai-agents
spec:
podTemplate:
spec:
runtimeClassName: gvisor
containers:
- name: agent
image: my-agent-runtime:latest
resources:
requests:
memory: "2Gi"
cpu: "500m"
Install the CRD set:
export VERSION="v0.3.10"
kubectl apply -f https://github.com/kubernetes-sigs/agent-sandbox/releases/download/${VERSION}/manifest.yaml
# Optional: warm pools and SandboxClaims
kubectl apply -f https://github.com/kubernetes-sigs/agent-sandbox/releases/download/${VERSION}/extensions.yaml
The extensions add three additional resources: SandboxTemplate for reusable agent configurations, SandboxWarmPool for pre-provisioned idle sandboxes, and SandboxClaim to claim a pre-warmed pod from a pool. Google Cloud reports sub-second startup latency for claimed sandboxes from warm pools - a 90% improvement over cold start provisioning.
How Does Agent Sandbox Manage Pod Lifecycle?
stateDiagram-v2
[*] --> Provisioning: kubectl apply Sandbox
Provisioning --> Running: Pod scheduled and started
Running --> Suspending: Scale-to-zero triggered
Suspending --> Suspended: State preserved, compute released
Suspended --> Resuming: Task arrives
Resuming --> Running: State restored, identity preserved
state WarmPool {
[*] --> PreProvisioned: SandboxWarmPool controller
PreProvisioned --> Claimed: SandboxClaim submitted
Claimed --> Running2: Agent starts immediately
Running2: Running
}
Agent Sandbox lifecycle. The suspension cycle preserves state and stable identity while releasing compute. The warm pool path eliminates provisioning latency entirely for high-throughput workloads.
The key behavior in the Suspended -> Running transition: stable hostname, DNS, and PVC remain bound across the cycle. The agent resumes with its working memory and tool context intact, from the same pod identity.
Because this is alpha, expect API changes. Track the GitHub repo and the #agent-sandbox channel in CNCF Slack before committing to the CRD in production workloads.
What Is Google Scion and How Does It Approach Multi-Agent Isolation?
Alongside Agent Sandbox, Google open-sourced Scion on April 7, 2026, as an experimental multi-agent orchestration testbed. Scion describes itself as a “hypervisor for agents”: each agent runs in its own container with a dedicated git worktree and isolated credentials. Rather than restricting what agents can do at the API level, Scion runs agents without capability restrictions and enforces boundaries at the infrastructure layer.
Scion supports Docker, Podman, and Kubernetes runtimes and emits normalized OTel telemetry. Its K8s runtime support is early-stage, but the isolation model (per-agent container, per-agent credentials, shared workspace with git worktrees) is a useful reference for designing multi-agent namespacing on K8s.
Which Container Isolation Tier Should You Use for Agent Workloads?
Every agent with shell access or code generation needs stronger isolation than a standard container provides. Standard containers share the host kernel: a container escape gives access to every other workload on the node.
The main runtime options and their tradeoffs:
| Runtime | Isolation Mechanism | Boot Time | Overhead | Best For |
|---|---|---|---|---|
| Standard containers | Shared host kernel | ms | Minimal | Trusted, audited code only |
| gVisor | User-space kernel, syscall interception | ms | 10-30% I/O | Compute-heavy agents, multi-tenant SaaS |
| Kata Containers | Lightweight VM via KVM | ~200ms | Moderate | Production K8s, regulated environments |
| Firecracker MicroVMs | Dedicated kernel via KVM | ~125ms | <5 MiB/VM | Multi-tenant, fully untrusted code execution |
For production agents executing arbitrary code: Kata Containers or gVisor. Firecracker is the strongest option for fully untrusted, multi-tenant environments but adds hardware and operational requirements (bare-metal or KVM-capable nodes, Firecracker device model).
Configure via RuntimeClass - create the class once per cluster and reference it in pod specs:
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: gvisor
handler: runsc
---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: kata
handler: kata-qemu
Agent Sandbox supports runtimeClassName directly in its podTemplate.spec, so isolation is configured as part of the same resource.
flowchart TD
A["Agent executes arbitrary code?"] -->|No| B["Trusted workloads only?"]
A -->|Yes| C["Fully untrusted / multi-tenant?"]
C -->|Yes| D["Firecracker MicroVMs"]
C -->|No| E["Kata Containers or gVisor"]
B -->|Yes| F["Standard containers"]
B -->|No| G["gVisor minimum"]
E --> H["Use runtimeClassName in Sandbox podTemplate"]
D --> H
F --> I["No RuntimeClass needed"]
G --> H
Isolation tier selection. Most production AI agents with tool use fall into the Kata Containers or gVisor branch.
How Do You Schedule GPUs for AI Agents with Dynamic Resource Allocation?
The nvidia.com/gpu extended resource model allocates entire GPUs to pods. A single inference call from one agent that holds a full A100 for ten seconds blocks every other agent on the same node. DRA gives you fine-grained allocation semantics that extended resources cannot.
DRA core reached stable before v1.36. K8s 1.36 (late April 2026) adds DRA Admin Access (GA) for centralized ResourceClaim management, expanded KubeletPodResources reporting for DRA (GA) with P99 under 100ms, partitionable devices (beta) for GPU sharing, and device taints and tolerations (beta) for health-aware scheduling. For the production upgrade path to K8s 1.36, see our Kubernetes 1.36 Production Upgrade Guide.
A ResourceClaim expresses what the agent pod needs from the GPU pool:
apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
name: agent-gpu
namespace: ai-agents
spec:
devices:
requests:
- name: gpu-req
deviceClassName: gpu.example.com
selectors:
- cel:
expression: "device.driver == \"nvidia\" && device.attributes[\"memory\"].quantity >= quantity(\"8Gi\")"
Reference the claim from the pod spec:
spec:
resourceClaims:
- name: gpu
resourceClaimName: agent-gpu
containers:
- name: agent
resources:
claims:
- name: gpu
Partitionable devices (beta in v1.36) allow splitting one physical GPU into logical slices allocated to different agent pods. This is the DRA integration point for NVIDIA MIG (Multi-Instance GPU). An A100 can be partitioned into up to 7 MIG instances. Each agent requests a slice via ResourceClaim, and the DRA driver allocates the appropriate MIG profile. One GPU serves seven concurrent agents instead of one.
Device taints and tolerations (beta in v1.36) mark GPU devices as degraded or under maintenance. Scheduling automatically avoids tainted devices without removing them from the node pool. Useful when a MIG partition reports ECC errors but the rest of the GPU is healthy.
How Do You Harden Agent Pods Against Kernel Escapes and Exfiltration?
How Do User Namespaces Reduce Agent Pod Blast Radius?
User Namespaces (GA in K8s 1.36) maps the root user inside the container to an unprivileged user on the host. An agent running as UID 0 inside its pod has no host-level privileges if the container escapes isolation.
Requirements: Linux kernel 6.3+, containerd 2.0+. Both are required for K8s 1.36 in any case.
Enable at the pod level and combine with a restrictive security context:
spec:
hostUsers: false
containers:
- name: agent
securityContext:
runAsNonRoot: true
allowPrivilegeEscalation: false
seccompProfile:
type: RuntimeDefault
capabilities:
drop: ["ALL"]
hostUsers: false is the critical field. The RuntimeDefault seccomp profile blocks system calls agents typically do not need and reduces kernel attack surface independently of User Namespaces. Together they create two independent privilege barriers.
How Do You Apply a Default-Deny Egress Policy to Agent Namespaces?
An agent with tool use can call arbitrary HTTP endpoints. Without egress controls, a compromised or misbehaving agent can exfiltrate model outputs, call home, or pivot to internal services that trust cluster-internal traffic.
Apply default-deny egress at the namespace level, then add targeted allowlists per agent type:
# Block all egress from ai-agents namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: agent-default-deny-egress
namespace: ai-agents
spec:
podSelector: {}
policyTypes:
- Egress
egress: []
---
# Allowlist: LLM API access for research agents
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: research-agent-allow-llm
namespace: ai-agents
spec:
podSelector:
matchLabels:
app: research-agent
policyTypes:
- Egress
egress:
- ports:
- port: 443
protocol: TCP
to:
- ipBlock:
cidr: 34.102.0.0/16 # Replace with your LLM provider CIDR
---
# Allowlist: vector store access for retrieval agents
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: retrieval-agent-allow-vectordb
namespace: ai-agents
spec:
podSelector:
matchLabels:
app: retrieval-agent
policyTypes:
- Egress
egress:
- ports:
- port: 6333 # Qdrant, adjust for your vector store
protocol: TCP
to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: vector-store
NetworkPolicy enforcement requires a CNI plugin that implements it. Calico, Cilium, and Antrea all support the full NetworkPolicy spec. Flannel does not.
How Do You Package Agent Tools and Model Weights with OCI VolumeSource?
Baking model weights and large tool binaries into container images creates fat images that slow pulls, bloat registries, and couple tool versions to the agent runtime image tag. OCI VolumeSource (GA in K8s 1.36) mounts OCI artifacts as read-only volumes at pod startup.
spec:
volumes:
- name: model-weights
image:
reference: registry.example.com/models/llama-3-8b:v3.1
pullPolicy: IfNotPresent
- name: agent-tools
image:
reference: registry.example.com/tools/search-plugin:v1.4
pullPolicy: IfNotPresent
containers:
- name: agent
volumeMounts:
- name: model-weights
mountPath: /models
readOnly: true
- name: agent-tools
mountPath: /tools
readOnly: true
The agent runtime sees /models and /tools populated at startup. Artifacts are cached on the node independently of the agent image. Updating model versions or tool plugins requires only a tag change in the reference field - no image rebuild, no registry push of a new agent image.
This also makes agent tooling composable: a base agent image plus independently versioned tool and model artifacts, assembled at pod startup from your OCI registry.
How Do You Scale Agent Pools to Zero with KEDA?
Native HPA cannot scale Deployments to zero replicas. For agent pools that are mostly idle, this means paying for compute - including GPU compute - 24 hours a day. KEDA (Kubernetes Event Driven Autoscaling, v2.19) extends HPA to support 0-to-N scaling from external event sources.
A ScaledObject targeting an agent Deployment with a RabbitMQ task queue as the scaling trigger:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: agent-pool-scaler
namespace: ai-agents
spec:
scaleTargetRef:
name: agent-pool
minReplicaCount: 0
maxReplicaCount: 20
cooldownPeriod: 300
triggers:
- type: rabbitmq
metadata:
host: amqp://rabbitmq.messaging.svc.cluster.local
queueName: agent-tasks
queueLength: "1"
When the queue depth drops to zero, KEDA scales the Deployment to zero pods. When a message arrives, it scales to at least one. The cooldownPeriod of 300 seconds prevents thrashing when tasks arrive in quick succession.
For Agent Sandbox, scale-to-zero is built into the CRD itself. The Sandbox controller handles suspension and resumption with state preservation natively. You do not need KEDA for Sandbox workloads - use KEDA for standard Deployments and StatefulSets that represent agent pools without the full Sandbox lifecycle semantics.
How Do You Add Observability to Multi-Agent Systems on Kubernetes?
Standard application traces break on agent workloads. A single user request can spawn an orchestrator agent that calls three specialist agents, each making multiple LLM calls and tool invocations. The resulting trace is a tree with linked subtrees, not a linear chain.
OpenTelemetry GenAI semantic conventions (experimental) provide a standard model for this shape. The core conventions are finalized; the agent framework convention is under active development in the OTel GenAI SIG to standardize across CrewAI, AutoGen, LangGraph, and others.
The trace structure for a multi-agent system:
- Root span: the agent loop (
agent.loop) - one per agent invocation - Child spans: each LLM call (
gen_ai.chat), tool execution (mcp.tool.*), and retrieval operation - Linked spans: sub-agent invocations are separate trace trees linked by W3C Trace Context, not nested spans
sequenceDiagram
participant U as User Request
participant O as Orchestrator Agent
participant L as LLM API
participant T as Tool: Search
participant S as Sub-Agent
participant D as Tool: DB Query
U->>O: Task assignment<br/>(root span: agent.loop)
O->>L: LLM call<br/>(child span: gen_ai.chat)
L-->>O: Plan with subtasks
O->>T: Tool call<br/>(child span: mcp.tool.search)
T-->>O: Search results
O->>S: Delegate subtask<br/>(linked span: agent.loop, W3C Trace Context)
S->>L: LLM call<br/>(child span: gen_ai.chat)
L-->>S: Response
S->>D: DB query<br/>(child span: db.query)
D-->>S: Data
S-->>O: Subtask complete
O-->>U: Final response
OTel trace of a multi-agent reasoning chain. Sub-agent invocations use linked spans with W3C Trace Context propagation, preserving independent trace trees while maintaining correlation for end-to-end debugging.
In code, instrument at the agent framework level:
from opentelemetry import trace
from opentelemetry.propagate import extract, inject
tracer = trace.get_tracer(__name__, "1.0.0")
def handle_task(request_headers: dict, task: str):
parent_context = extract(request_headers)
with tracer.start_as_current_span(
"agent.loop",
context=parent_context,
) as span:
span.set_attribute("gen_ai.agent.name", "research-agent")
span.set_attribute("gen_ai.agent.task", task)
# Instrument sub-agent calls with context propagation
outgoing_headers = {}
inject(outgoing_headers)
sub_agent_client.call(headers=outgoing_headers, task=subtask)
The gen_ai.* attribute namespace covers model name, input token count, output token count, and finish reason per LLM call span. Aggregate gen_ai.usage.input_tokens and gen_ai.usage.output_tokens across all spans in a root trace to calculate per-task LLM cost. Route these metrics to your FinOps pipeline for per-agent, per-task cost attribution.
Red Hat’s April 2026 guide demonstrates auto-instrumentation for FastAPI-based agent services using OTel HTTP middleware, which propagates Trace Context on outbound HTTP calls automatically without manual header injection.
What Does a Full Agent-Ready Kubernetes Reference Architecture Look Like?
This diagram shows the full agent platform with all components connected:
graph TB
subgraph External["External Services"]
LLM["LLM APIs\n(OpenAI / Anthropic)"]
REG["OCI Registry\n(models + tools)"]
end
subgraph K8sCluster["K8s Cluster"]
subgraph NS["ai-agents namespace"]
SB["Agent Sandbox\n(gVisor RuntimeClass)"]
WP["SandboxWarmPool\n(pre-warmed pods)"]
end
subgraph Infra["Platform Infra"]
DRA["DRA GPU Pool\n(ResourceClaims + MIG)"]
OCI_V["OCI VolumeSource\n(models + tools mounted)"]
NP["NetworkPolicy\n(default-deny egress)"]
UN["User Namespaces\n(hostUsers: false)"]
end
subgraph Scale["Autoscaling"]
KEDA["KEDA ScaledObject\n(scale-to-zero)"]
MQ["Task Queue\n(RabbitMQ / SQS)"]
end
subgraph Obs["Observability"]
OTEL["OTel Collector\n(GenAI conventions)"]
PROM["Prometheus\n(token cost metrics)"]
TRACE["Jaeger / Tempo\n(trace backend)"]
end
end
WP --> SB
SB --> DRA
SB --> OCI_V
SB --> NP
SB --> UN
NP --> LLM
OCI_V --> REG
MQ --> KEDA
KEDA --> SB
SB --> OTEL
OTEL --> PROM
OTEL --> TRACE
Full agent platform reference architecture. The NetworkPolicy layer sits between agent pods and external LLM APIs, enforcing egress allowlists. The OTel Collector fans out to metrics and tracing backends.
Namespace boundaries. Run agent workloads in a dedicated namespace (ai-agents) isolated from production application namespaces. This makes NetworkPolicy scope explicit, simplifies RBAC for agent service accounts, and gives you a clean boundary for resource quotas and LimitRanges.
Resource quotas. Set namespace-level ResourceQuota to cap the total GPU and memory that agent pools can consume. This prevents a runaway agent (or a spawning loop) from exhausting node capacity shared with production workloads.
apiVersion: v1
kind: ResourceQuota
metadata:
name: agent-namespace-quota
namespace: ai-agents
spec:
hard:
requests.memory: "256Gi"
limits.memory: "512Gi"
count/sandboxes.agents.x-k8s.io: "50"
Frequently Asked Questions
What is Kubernetes Agent Sandbox and how does it differ from a StatefulSet?
Agent Sandbox (agents.x-k8s.io/v1alpha1) is an alpha CRD from SIG Apps at v0.3.10. It is designed specifically for AI agent runtimes and provides stable hostname and network identity, native scale-to-zero with state preservation, SandboxWarmPool for sub-second startup from pre-provisioned pods, and built-in runtimeClassName support for gVisor and Kata Containers. A StatefulSet requires manually combining the StatefulSet resource, a headless Service, PVC templates, and custom controller logic - and still lacks warm pool semantics. Agent Sandbox is alpha and not a stable production API today.
Do I need Kubernetes 1.36 to run AI agents on my cluster?
No. Agent Sandbox (v0.3.10) installs on earlier versions. K8s 1.36, expected late April 2026, graduates User Namespaces (rootless sandboxing), OCI VolumeSource (model weight mounting), and DRA Admin Access (centralized ResourceClaim management) to stable. These improve the security and flexibility of an agent platform but are not prerequisites for getting started. DRA core reached stable before v1.36.
How do I prevent an AI agent from exfiltrating data through tool use?
Apply a default-deny egress NetworkPolicy to the agent namespace and whitelist only required external endpoints: your LLM provider’s IP range, your vector store, and any sanctioned external APIs. Combine this with hostUsers: false for User Namespaces and a gVisor or Kata RuntimeClass. The three layers operate independently: network controls which endpoints are reachable, User Namespaces limits host privilege if the container escapes, and the runtime’s kernel isolation prevents kernel-level attacks.
How do I share GPUs across multiple AI agent pods efficiently?
Use DRA partitionable devices (beta in K8s 1.36) with NVIDIA MIG hardware partitioning. An A100 supports up to 7 MIG instances. Each agent pod requests a ResourceClaim for one instance sized to its workload. DRA Admin Access (also GA in v1.36) lets platform operators manage ResourceClaims centrally across namespaces. Compared to nvidia.com/gpu extended resources, DRA gives you CEL-based device selectors, health-aware scheduling via device taints, and partial-GPU allocation without node-level GPU sharing daemons.
How do I trace an AI agent reasoning chain across containers?
Use OpenTelemetry GenAI semantic conventions (experimental but stable enough for production instrumentation). Create one root span per agent loop invocation, child spans per LLM call and tool execution, and inject W3C Trace Context headers into all outbound HTTP calls to sub-agents. The gen_ai.* attribute namespace captures model name, input and output token counts, and finish reason per LLM call. Sum token attributes across a root trace for per-task cost attribution. The Red Hat distributed tracing guide covers FastAPI auto-instrumentation that handles most inter-agent context propagation without manual header injection.