How do I share GPUs across multiple AI agent pods efficiently?

Use DRA partitionable devices (beta in K8s 1.36) to split a single physical GPU into logical units allocated to different pods via ResourceClaim. This integrates with NVIDIA MIG for hardware-level partitioning and avoids dedicating a full GPU per agent.

Building an Agent-Ready Kubernetes Platform [2026]

Q: What is Kubernetes Agent Sandbox and how does it differ from a StatefulSet?

Agent Sandbox (agents.x-k8s.io/v1alpha1) is an alpha CRD from SIG Apps built for AI agent runtimes. It provides stable identity, scale-to-zero with state preservation, and pre-warmed pools in a single resource. A StatefulSet requires combining a StatefulSet, headless Service, PVC template, and custom controller to achieve similar behavior, without warm pool support.

Q: Do I need Kubernetes 1.36 to run AI agents on my cluster?

No. Agent Sandbox works on earlier versions. K8s 1.36 (late April 2026) graduates User Namespaces, OCI VolumeSource, and DRA Admin Access to stable, all of which improve agent platform security and flexibility. DRA core reached stable before 1.36.

Q: How do I prevent an AI agent from exfiltrating data through tool use?

Apply a default-deny egress NetworkPolicy per agent namespace and whitelist only required endpoints such as your LLM API and vector store. Combine with User Namespaces (hostUsers: false) and a gVisor or Kata RuntimeClass for layered isolation.

Q: How do I trace an AI agent reasoning chain across containers?

Use OpenTelemetry GenAI semantic conventions. Set a root span for the agent loop, child spans for each LLM call and tool execution, and propagate W3C Trace Context headers across HTTP boundaries. The gen_ai.* attributes capture model, token counts, and finish reason.

Your AI/ML team wants to deploy agents on your Kubernetes cluster. You have StatefulSets, PVCs, a solid HPA configuration, and years of K8s operations behind you. How hard can agents be?

An agent-ready Kubernetes platform is a cluster configured with Agent Sandbox for workload management, hardened container runtimes for code isolation, DRA for GPU scheduling, KEDA for scale-to-zero autoscaling, and OpenTelemetry GenAI conventions for distributed tracing across agent reasoning chains.

Harder than you expect. Gartner predicts over 40% of agentic AI projects will be canceled by end of 2027, and infrastructure is one of the primary failure modes alongside unclear ROI and inadequate risk controls. Standard Kubernetes patterns handle stateless, predictable microservices. AI agents are stateful, bursty, and capable of executing arbitrary code. They spawn sub-agents. They make recursive tool calls. They idle for hours and then consume a full GPU for thirty seconds.

This guide is not about agent frameworks. It covers the Kubernetes platform layer: the specific primitives, configuration patterns, and architectural decisions that make a cluster capable of safely running agent workloads at scale.

Why Do Agent Workloads Break Standard Kubernetes Setups?

Four properties make agent workloads fundamentally different from microservices:

Bursty compute. A microservice handles requests at roughly predictable throughput. An agent completes a multi-step reasoning task in seconds, then sits idle for minutes or hours. CPU and memory utilization spikes in ways HPA was not designed for. GPU demand is even more erratic: an agent doing local inference can go from zero to full-GPU utilization, then release entirely.

Stateful execution chains. Agents maintain working memory, tool context, and task state across long-running operations. A pod restart mid-chain loses everything. StatefulSets provide stable identity but lack scale-to-zero with state preservation, warm pool management, and agent-specific lifecycle semantics.

Unpredictable resource footprint. An agent that spawns sub-agents creates cascading resource demand. One orchestrator becomes five specialist agents, each making concurrent LLM calls and tool invocations. Resource requests set at deploy time cannot account for this dynamic expansion.

Untrusted code execution. Agents with shell access and code generation run arbitrary, unreviewed code inside your cluster. Sharing a host kernel with this workload - the default for standard containers - violates any reasonable security boundary.

These four properties require specific platform decisions. The rest of this guide walks through each one.

What Is the Agent Platform Stack for Kubernetes?

The Kubernetes primitives for agent workloads form a layered stack. Each layer addresses one of the core infrastructure problems agents create.

graph BT
    A["K8s Cluster Base\n(nodes, kubelet, CNI)"] --> B
    B["Container Runtime\n(gVisor / Kata / Firecracker)"] --> C
    C["Agent Sandbox CRD\n(agents.x-k8s.io/v1alpha1)"] --> D
    D["GPU Scheduling\n(DRA ResourceClaims)"] --> E
    E["Network Isolation\n(NetworkPolicy - default-deny egress)"] --> F
    F["Storage Layer\n(PVCs + OCI VolumeSource)"] --> G
    G["Autoscaling\n(KEDA - scale to zero)"] --> H
    H["Observability\n(OTel Collector - GenAI conventions)"] --> I
    I["Agent Workloads\n(orchestrators, tools, sub-agents)"]

The agent platform stack. Each layer handles a distinct infrastructure concern. Skipping any one of them creates operational gaps that surface as reliability or security incidents in production.

No existing content combines all of these into a single guide with working YAML. The layers interact: isolation tier affects which DRA features you can use, NetworkPolicy scope determines your egress allowlist design, and OTel instrumentation depends on how you structure namespace boundaries.

What Is Agent Sandbox, the New Kubernetes Primitive for Agent Runtimes?

The Kubernetes SIG Apps team built Agent Sandbox (kubernetes-sigs/agent-sandbox) as a purpose-built CRD for AI agent runtimes. It is currently alpha at v0.3.10 (released April 8, 2026) and represents the direction upstream Kubernetes is heading - not a stable API to run in production today without accepting breakage risk.

The core resource is a Sandbox (agents.x-k8s.io/v1alpha1): a single-container stateful pod with stable identity, persistent storage, and lifecycle management built in. Compare what you need to build manually today with what the CRD provides:

Without Agent Sandbox:

StatefulSet definition
Headless Service for stable DNS
PVC template for persistence
Custom controller or operator for scale-to-zero
Separate logic for pre-warmed pool management
Manual state checkpoint and restoration on resume

With Agent Sandbox:

apiVersion: agents.x-k8s.io/v1alpha1
kind: Sandbox
metadata:
  name: research-agent
  namespace: ai-agents
spec:
  podTemplate:
    spec:
      runtimeClassName: gvisor
      containers:
      - name: agent
        image: my-agent-runtime:latest
        resources:
          requests:
            memory: "2Gi"
            cpu: "500m"

Install the CRD set:

export VERSION="v0.3.10"
kubectl apply -f https://github.com/kubernetes-sigs/agent-sandbox/releases/download/${VERSION}/manifest.yaml
# Optional: warm pools and SandboxClaims
kubectl apply -f https://github.com/kubernetes-sigs/agent-sandbox/releases/download/${VERSION}/extensions.yaml

The extensions add three additional resources: SandboxTemplate for reusable agent configurations, SandboxWarmPool for pre-provisioned idle sandboxes, and SandboxClaim to claim a pre-warmed pod from a pool. Google Cloud reports sub-second startup latency for claimed sandboxes from warm pools - a 90% improvement over cold start provisioning.

How Does Agent Sandbox Manage Pod Lifecycle?

stateDiagram-v2
    [*] --> Provisioning: kubectl apply Sandbox
    Provisioning --> Running: Pod scheduled and started
    Running --> Suspending: Scale-to-zero triggered
    Suspending --> Suspended: State preserved, compute released
    Suspended --> Resuming: Task arrives
    Resuming --> Running: State restored, identity preserved

    state WarmPool {
        [*] --> PreProvisioned: SandboxWarmPool controller
        PreProvisioned --> Claimed: SandboxClaim submitted
        Claimed --> Running2: Agent starts immediately
        Running2: Running
    }

Agent Sandbox lifecycle. The suspension cycle preserves state and stable identity while releasing compute. The warm pool path eliminates provisioning latency entirely for high-throughput workloads.

The key behavior in the Suspended -> Running transition: stable hostname, DNS, and PVC remain bound across the cycle. The agent resumes with its working memory and tool context intact, from the same pod identity.

Because this is alpha, expect API changes. Track the GitHub repo and the #agent-sandbox channel in CNCF Slack before committing to the CRD in production workloads.

What Is Google Scion and How Does It Approach Multi-Agent Isolation?

Alongside Agent Sandbox, Google open-sourced Scion on April 7, 2026, as an experimental multi-agent orchestration testbed. Scion describes itself as a “hypervisor for agents”: each agent runs in its own container with a dedicated git worktree and isolated credentials. Rather than restricting what agents can do at the API level, Scion runs agents without capability restrictions and enforces boundaries at the infrastructure layer.

Scion supports Docker, Podman, and Kubernetes runtimes and emits normalized OTel telemetry. Its K8s runtime support is early-stage, but the isolation model (per-agent container, per-agent credentials, shared workspace with git worktrees) is a useful reference for designing multi-agent namespacing on K8s.

Which Container Isolation Tier Should You Use for Agent Workloads?

Every agent with shell access or code generation needs stronger isolation than a standard container provides. Standard containers share the host kernel: a container escape gives access to every other workload on the node.

The main runtime options and their tradeoffs:

Runtime	Isolation Mechanism	Boot Time	Overhead	Best For
Standard containers	Shared host kernel	ms	Minimal	Trusted, audited code only
gVisor	User-space kernel, syscall interception	ms	10-30% I/O	Compute-heavy agents, multi-tenant SaaS
Kata Containers	Lightweight VM via KVM	~200ms	Moderate	Production K8s, regulated environments
Firecracker MicroVMs	Dedicated kernel via KVM	~125ms	<5 MiB/VM	Multi-tenant, fully untrusted code execution

For production agents executing arbitrary code: Kata Containers or gVisor. Firecracker is the strongest option for fully untrusted, multi-tenant environments but adds hardware and operational requirements (bare-metal or KVM-capable nodes, Firecracker device model).

Configure via RuntimeClass - create the class once per cluster and reference it in pod specs:

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: gvisor
handler: runsc
---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: kata
handler: kata-qemu

Agent Sandbox supports runtimeClassName directly in its podTemplate.spec, so isolation is configured as part of the same resource.

flowchart TD
    A["Agent executes arbitrary code?"] -->|No| B["Trusted workloads only?"]
    A -->|Yes| C["Fully untrusted / multi-tenant?"]
    C -->|Yes| D["Firecracker MicroVMs"]
    C -->|No| E["Kata Containers or gVisor"]
    B -->|Yes| F["Standard containers"]
    B -->|No| G["gVisor minimum"]
    E --> H["Use runtimeClassName in Sandbox podTemplate"]
    D --> H
    F --> I["No RuntimeClass needed"]
    G --> H

Isolation tier selection. Most production AI agents with tool use fall into the Kata Containers or gVisor branch.

How Do You Schedule GPUs for AI Agents with Dynamic Resource Allocation?

The nvidia.com/gpu extended resource model allocates entire GPUs to pods. A single inference call from one agent that holds a full A100 for ten seconds blocks every other agent on the same node. DRA gives you fine-grained allocation semantics that extended resources cannot.

DRA core reached stable before v1.36. K8s 1.36 (late April 2026) adds DRA Admin Access (GA) for centralized ResourceClaim management, expanded KubeletPodResources reporting for DRA (GA) with P99 under 100ms, partitionable devices (beta) for GPU sharing, and device taints and tolerations (beta) for health-aware scheduling. For the production upgrade path to K8s 1.36, see our Kubernetes 1.36 Production Upgrade Guide.

A ResourceClaim expresses what the agent pod needs from the GPU pool:

apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
  name: agent-gpu
  namespace: ai-agents
spec:
  devices:
    requests:
    - name: gpu-req
      deviceClassName: gpu.example.com
      selectors:
      - cel:
          expression: "device.driver == \"nvidia\" && device.attributes[\"memory\"].quantity >= quantity(\"8Gi\")"

Reference the claim from the pod spec:

spec:
  resourceClaims:
  - name: gpu
    resourceClaimName: agent-gpu
  containers:
  - name: agent
    resources:
      claims:
      - name: gpu

Partitionable devices (beta in v1.36) allow splitting one physical GPU into logical slices allocated to different agent pods. This is the DRA integration point for NVIDIA MIG (Multi-Instance GPU). An A100 can be partitioned into up to 7 MIG instances. Each agent requests a slice via ResourceClaim, and the DRA driver allocates the appropriate MIG profile. One GPU serves seven concurrent agents instead of one.

Device taints and tolerations (beta in v1.36) mark GPU devices as degraded or under maintenance. Scheduling automatically avoids tainted devices without removing them from the node pool. Useful when a MIG partition reports ECC errors but the rest of the GPU is healthy.

How Do You Harden Agent Pods Against Kernel Escapes and Exfiltration?

How Do User Namespaces Reduce Agent Pod Blast Radius?

User Namespaces (GA in K8s 1.36) maps the root user inside the container to an unprivileged user on the host. An agent running as UID 0 inside its pod has no host-level privileges if the container escapes isolation.

Requirements: Linux kernel 6.3+, containerd 2.0+. Both are required for K8s 1.36 in any case.

Enable at the pod level and combine with a restrictive security context:

spec:
  hostUsers: false
  containers:
  - name: agent
    securityContext:
      runAsNonRoot: true
      allowPrivilegeEscalation: false
      seccompProfile:
        type: RuntimeDefault
      capabilities:
        drop: ["ALL"]

hostUsers: false is the critical field. The RuntimeDefault seccomp profile blocks system calls agents typically do not need and reduces kernel attack surface independently of User Namespaces. Together they create two independent privilege barriers.

How Do You Apply a Default-Deny Egress Policy to Agent Namespaces?

An agent with tool use can call arbitrary HTTP endpoints. Without egress controls, a compromised or misbehaving agent can exfiltrate model outputs, call home, or pivot to internal services that trust cluster-internal traffic.

Apply default-deny egress at the namespace level, then add targeted allowlists per agent type:

# Block all egress from ai-agents namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: agent-default-deny-egress
  namespace: ai-agents
spec:
  podSelector: {}
  policyTypes:
  - Egress
  egress: []
---
# Allowlist: LLM API access for research agents
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: research-agent-allow-llm
  namespace: ai-agents
spec:
  podSelector:
    matchLabels:
      app: research-agent
  policyTypes:
  - Egress
  egress:
  - ports:
    - port: 443
      protocol: TCP
    to:
    - ipBlock:
        cidr: 34.102.0.0/16    # Replace with your LLM provider CIDR
---
# Allowlist: vector store access for retrieval agents
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: retrieval-agent-allow-vectordb
  namespace: ai-agents
spec:
  podSelector:
    matchLabels:
      app: retrieval-agent
  policyTypes:
  - Egress
  egress:
  - ports:
    - port: 6333   # Qdrant, adjust for your vector store
      protocol: TCP
    to:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: vector-store

NetworkPolicy enforcement requires a CNI plugin that implements it. Calico, Cilium, and Antrea all support the full NetworkPolicy spec. Flannel does not.

How Do You Package Agent Tools and Model Weights with OCI VolumeSource?

Baking model weights and large tool binaries into container images creates fat images that slow pulls, bloat registries, and couple tool versions to the agent runtime image tag. OCI VolumeSource (GA in K8s 1.36) mounts OCI artifacts as read-only volumes at pod startup.

spec:
  volumes:
  - name: model-weights
    image:
      reference: registry.example.com/models/llama-3-8b:v3.1
      pullPolicy: IfNotPresent
  - name: agent-tools
    image:
      reference: registry.example.com/tools/search-plugin:v1.4
      pullPolicy: IfNotPresent
  containers:
  - name: agent
    volumeMounts:
    - name: model-weights
      mountPath: /models
      readOnly: true
    - name: agent-tools
      mountPath: /tools
      readOnly: true

The agent runtime sees /models and /tools populated at startup. Artifacts are cached on the node independently of the agent image. Updating model versions or tool plugins requires only a tag change in the reference field - no image rebuild, no registry push of a new agent image.

This also makes agent tooling composable: a base agent image plus independently versioned tool and model artifacts, assembled at pod startup from your OCI registry.

How Do You Scale Agent Pools to Zero with KEDA?

Native HPA cannot scale Deployments to zero replicas. For agent pools that are mostly idle, this means paying for compute - including GPU compute - 24 hours a day. KEDA (Kubernetes Event Driven Autoscaling, v2.19) extends HPA to support 0-to-N scaling from external event sources.

A ScaledObject targeting an agent Deployment with a RabbitMQ task queue as the scaling trigger:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: agent-pool-scaler
  namespace: ai-agents
spec:
  scaleTargetRef:
    name: agent-pool
  minReplicaCount: 0
  maxReplicaCount: 20
  cooldownPeriod: 300
  triggers:
  - type: rabbitmq
    metadata:
      host: amqp://rabbitmq.messaging.svc.cluster.local
      queueName: agent-tasks
      queueLength: "1"

When the queue depth drops to zero, KEDA scales the Deployment to zero pods. When a message arrives, it scales to at least one. The cooldownPeriod of 300 seconds prevents thrashing when tasks arrive in quick succession.

For Agent Sandbox, scale-to-zero is built into the CRD itself. The Sandbox controller handles suspension and resumption with state preservation natively. You do not need KEDA for Sandbox workloads - use KEDA for standard Deployments and StatefulSets that represent agent pools without the full Sandbox lifecycle semantics.

How Do You Add Observability to Multi-Agent Systems on Kubernetes?

Standard application traces break on agent workloads. A single user request can spawn an orchestrator agent that calls three specialist agents, each making multiple LLM calls and tool invocations. The resulting trace is a tree with linked subtrees, not a linear chain.

OpenTelemetry GenAI semantic conventions (experimental) provide a standard model for this shape. The core conventions are finalized; the agent framework convention is under active development in the OTel GenAI SIG to standardize across CrewAI, AutoGen, LangGraph, and others.

The trace structure for a multi-agent system:

Root span: the agent loop (agent.loop) - one per agent invocation
Child spans: each LLM call (gen_ai.chat), tool execution (mcp.tool.*), and retrieval operation
Linked spans: sub-agent invocations are separate trace trees linked by W3C Trace Context, not nested spans

sequenceDiagram
    participant U as User Request
    participant O as Orchestrator Agent
    participant L as LLM API
    participant T as Tool: Search
    participant S as Sub-Agent
    participant D as Tool: DB Query

    U->>O: Task assignment<br/>(root span: agent.loop)
    O->>L: LLM call<br/>(child span: gen_ai.chat)
    L-->>O: Plan with subtasks
    O->>T: Tool call<br/>(child span: mcp.tool.search)
    T-->>O: Search results
    O->>S: Delegate subtask<br/>(linked span: agent.loop, W3C Trace Context)
    S->>L: LLM call<br/>(child span: gen_ai.chat)
    L-->>S: Response
    S->>D: DB query<br/>(child span: db.query)
    D-->>S: Data
    S-->>O: Subtask complete
    O-->>U: Final response

OTel trace of a multi-agent reasoning chain. Sub-agent invocations use linked spans with W3C Trace Context propagation, preserving independent trace trees while maintaining correlation for end-to-end debugging.

In code, instrument at the agent framework level:

from opentelemetry import trace
from opentelemetry.propagate import extract, inject

tracer = trace.get_tracer(__name__, "1.0.0")

def handle_task(request_headers: dict, task: str):
    parent_context = extract(request_headers)
    with tracer.start_as_current_span(
        "agent.loop",
        context=parent_context,
    ) as span:
        span.set_attribute("gen_ai.agent.name", "research-agent")
        span.set_attribute("gen_ai.agent.task", task)

        # Instrument sub-agent calls with context propagation
        outgoing_headers = {}
        inject(outgoing_headers)
        sub_agent_client.call(headers=outgoing_headers, task=subtask)

The gen_ai.* attribute namespace covers model name, input token count, output token count, and finish reason per LLM call span. Aggregate gen_ai.usage.input_tokens and gen_ai.usage.output_tokens across all spans in a root trace to calculate per-task LLM cost. Route these metrics to your FinOps pipeline for per-agent, per-task cost attribution.

Red Hat’s April 2026 guide demonstrates auto-instrumentation for FastAPI-based agent services using OTel HTTP middleware, which propagates Trace Context on outbound HTTP calls automatically without manual header injection.

What Does a Full Agent-Ready Kubernetes Reference Architecture Look Like?

This diagram shows the full agent platform with all components connected:

graph TB
    subgraph External["External Services"]
        LLM["LLM APIs\n(OpenAI / Anthropic)"]
        REG["OCI Registry\n(models + tools)"]
    end

    subgraph K8sCluster["K8s Cluster"]
        subgraph NS["ai-agents namespace"]
            SB["Agent Sandbox\n(gVisor RuntimeClass)"]
            WP["SandboxWarmPool\n(pre-warmed pods)"]
        end

        subgraph Infra["Platform Infra"]
            DRA["DRA GPU Pool\n(ResourceClaims + MIG)"]
            OCI_V["OCI VolumeSource\n(models + tools mounted)"]
            NP["NetworkPolicy\n(default-deny egress)"]
            UN["User Namespaces\n(hostUsers: false)"]
        end

        subgraph Scale["Autoscaling"]
            KEDA["KEDA ScaledObject\n(scale-to-zero)"]
            MQ["Task Queue\n(RabbitMQ / SQS)"]
        end

        subgraph Obs["Observability"]
            OTEL["OTel Collector\n(GenAI conventions)"]
            PROM["Prometheus\n(token cost metrics)"]
            TRACE["Jaeger / Tempo\n(trace backend)"]
        end
    end

    WP --> SB
    SB --> DRA
    SB --> OCI_V
    SB --> NP
    SB --> UN
    NP --> LLM
    OCI_V --> REG
    MQ --> KEDA
    KEDA --> SB
    SB --> OTEL
    OTEL --> PROM
    OTEL --> TRACE

Full agent platform reference architecture. The NetworkPolicy layer sits between agent pods and external LLM APIs, enforcing egress allowlists. The OTel Collector fans out to metrics and tracing backends.

Namespace boundaries. Run agent workloads in a dedicated namespace (ai-agents) isolated from production application namespaces. This makes NetworkPolicy scope explicit, simplifies RBAC for agent service accounts, and gives you a clean boundary for resource quotas and LimitRanges.

Resource quotas. Set namespace-level ResourceQuota to cap the total GPU and memory that agent pools can consume. This prevents a runaway agent (or a spawning loop) from exhausting node capacity shared with production workloads.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: agent-namespace-quota
  namespace: ai-agents
spec:
  hard:
    requests.memory: "256Gi"
    limits.memory: "512Gi"
    count/sandboxes.agents.x-k8s.io: "50"

Frequently Asked Questions

What is Kubernetes Agent Sandbox and how does it differ from a StatefulSet?

Agent Sandbox (agents.x-k8s.io/v1alpha1) is an alpha CRD from SIG Apps at v0.3.10. It is designed specifically for AI agent runtimes and provides stable hostname and network identity, native scale-to-zero with state preservation, SandboxWarmPool for sub-second startup from pre-provisioned pods, and built-in runtimeClassName support for gVisor and Kata Containers. A StatefulSet requires manually combining the StatefulSet resource, a headless Service, PVC templates, and custom controller logic - and still lacks warm pool semantics. Agent Sandbox is alpha and not a stable production API today.

Do I need Kubernetes 1.36 to run AI agents on my cluster?

No. Agent Sandbox (v0.3.10) installs on earlier versions. K8s 1.36, expected late April 2026, graduates User Namespaces (rootless sandboxing), OCI VolumeSource (model weight mounting), and DRA Admin Access (centralized ResourceClaim management) to stable. These improve the security and flexibility of an agent platform but are not prerequisites for getting started. DRA core reached stable before v1.36.

How do I prevent an AI agent from exfiltrating data through tool use?

Apply a default-deny egress NetworkPolicy to the agent namespace and whitelist only required external endpoints: your LLM provider’s IP range, your vector store, and any sanctioned external APIs. Combine this with hostUsers: false for User Namespaces and a gVisor or Kata RuntimeClass. The three layers operate independently: network controls which endpoints are reachable, User Namespaces limits host privilege if the container escapes, and the runtime’s kernel isolation prevents kernel-level attacks.

Use DRA partitionable devices (beta in K8s 1.36) with NVIDIA MIG hardware partitioning. An A100 supports up to 7 MIG instances. Each agent pod requests a ResourceClaim for one instance sized to its workload. DRA Admin Access (also GA in v1.36) lets platform operators manage ResourceClaims centrally across namespaces. Compared to nvidia.com/gpu extended resources, DRA gives you CEL-based device selectors, health-aware scheduling via device taints, and partial-GPU allocation without node-level GPU sharing daemons.

How do I trace an AI agent reasoning chain across containers?

Use OpenTelemetry GenAI semantic conventions (experimental but stable enough for production instrumentation). Create one root span per agent loop invocation, child spans per LLM call and tool execution, and inject W3C Trace Context headers into all outbound HTTP calls to sub-agents. The gen_ai.* attribute namespace captures model name, input and output token counts, and finish reason per LLM call. Sum token attributes across a root trace for per-task cost attribution. The Red Hat distributed tracing guide covers FastAPI auto-instrumentation that handles most inter-agent context propagation without manual header injection.