Securing AI Inference Servers on Kubernetes (2026 CVE Guide)

Q: Why is --api-key not enough to secure a vLLM deployment?

The --api-key flag only protects endpoints under the /v1 path prefix. Endpoints like /invocations, /pause, and /collective_rpc remain unauthenticated regardless. Deploy vLLM behind a reverse proxy that explicitly allowlists only the endpoints you intend to expose, and bind vLLM to 127.0.0.1.

Q: Can NetworkPolicy alone prevent SSRF attacks from inference servers?

NetworkPolicy blocks L3/L4 SSRF paths to cloud metadata and RFC 1918 ranges but cannot inspect application-layer content. Pair it with --allowed-media-domains in vLLM, VLLM_MEDIA_URL_ALLOW_REDIRECTS=0, and IMDSv2 enforcement on AWS for defense in depth.

Q: Does gVisor work with GPU inference workloads?

Yes. gVisor's nvproxy intercepts CUDA calls in a memory-safe layer and provides better GPU isolation than microVM passthrough using VFIO, which grants the VM DMA access to host memory. GKE Sandbox supports GPU workloads on H100, A100, L4, and T4 as of GKE 1.29.2.

Q: How quickly are inference server vulnerabilities exploited after disclosure?

CVE-2026-33626 (LMDeploy SSRF) was exploited within 12 hours and 31 minutes of advisory publication with no public PoC. Attackers crafted the exploit directly from advisory text. Patch timelines alone are not a sufficient defense strategy.

Q: What Kubernetes controls would have limited the LiteLLM supply chain blast radius?

Kyverno policies enforcing restricted Pod Security Standards, RBAC scoped to the inference namespace with automountServiceAccountToken disabled, and NetworkPolicy denying egress from inference pods to the Kubernetes API server would all limit lateral movement stages.

Seven CVEs, three frameworks, one month: April 2026 established AI inference servers as the most actively exploited new attack surface in Kubernetes. vLLM logged four advisories including a CVSS 9.8 pre-auth RCE, LMDeploy was hit by a confirmed SSRF within 12 hours of disclosure, and LiteLLM’s supply chain was compromised via a poisoned security scanner two hops upstream. This post maps the shared attack surface and gives you the Kubernetes-native controls to defend it.

Why Are AI Inference Servers Under Siege in April 2026?

Four separate security incidents across three frameworks in a single calendar month is not a coincidence. It is a pattern.

CVE	Framework	Type	CVSS	Affected Versions	Patched
CVE-2026-22778	vLLM	Pre-auth RCE via video URL	9.8	>=0.8.3, <0.14.1	0.14.1
CVE-2025-62164	vLLM	Tensor deserialization DoS/RCE	8.8	>=0.10.2	0.11.1
CVE-2026-34753	vLLM	SSRF via batch input	-	0.16.0 - <0.19.0	0.19.0
CVE-2026-34755	vLLM	OOM DoS via video frames	-	<0.19.0	0.19.0
CVE-2026-34756	vLLM	OOM DoS via n parameter	-	<0.19.0	0.19.0
CVE-2026-33626	LMDeploy	SSRF via load_image()	7.5	<=0.12.0	0.12.3
Supply chain	LiteLLM	Malicious PyPI (1.82.7-1.82.8)	-	1.82.7-1.82.8	1.83.0

The common thread across all seven is not a shared library or shared code. It is a set of structural properties that every inference server shares.

No default authentication. vLLM, LMDeploy, and LiteLLM all start with open HTTP endpoints. Adding --api-key to vLLM protects only the /v1 path prefix, leaving a long list of endpoints completely open.

Rich, user-controlled input processing. Inference servers accept video URLs, image URLs, base64-encoded media, and serialized tensor objects. Every media fetch is a potential SSRF vector. Every deserialized tensor is a potential RCE vector.

GPU access and elevated scheduling. GPU workloads create pressure to weaken pod security. Inference pods frequently run without resource limits, with broad service account permissions, and in configurations that would fail a standard pod security audit.

Deep supply chain exposure. Model weights, vision processing libraries, audio transcription dependencies - the LiteLLM compromise shows attackers do not need to exploit the inference server directly when they can poison the CI pipeline that builds it.

What Is the AI Inference Server Attack Surface on Kubernetes?

Which vLLM Endpoints Bypass the —api-key Flag?

The vLLM security documentation explicitly lists which endpoints its --api-key flag protects (only /v1 prefix routes) and which it does not. The unprotected list is long.

Inference bypass endpoints (always unauthenticated):

/invocations - SageMaker-compatible, provides full inference access
/inference/v1/generate
/pooling, /classify, /score, /rerank

Operational control (DoS vectors):

/pause - halts all in-progress generation
/resume
/scale_elastic_ep

Development mode (requires VLLM_SERVER_DEV_MODE=1 - never set in production):

/collective_rpc - executes arbitrary distributed RPC methods
/reset_prefix_cache, /sleep, /wake_up

Any attacker with network access to your inference pod can call /invocations for free inference, /pause to halt your service, or /collective_rpc to execute arbitrary operations if dev mode was accidentally left on.

For multi-node deployments, the attack surface extends further. PyTorch Distributed sends unencrypted messages with no authorization checks, and the TCPStore listens on all network interfaces by default. An attacker who reaches any node in the cluster can inject into the inference coordination layer.

How Do Video URLs and Serialized Tensors Create RCE and SSRF Vectors?

CVE-2026-22778 (CVSS 9.8) chains two flaws in vLLM’s video processing path. First, when an invalid image is submitted to a multimodal endpoint, PIL raises an exception containing a heap memory address that vLLM returns directly to the client, defeating ASLR. Second, a crafted JPEG2000 video frame exploits a color channel remapping bug in the bundled OpenCV/FFmpeg decoder, triggering a heap overflow at the now-known address. The result is pre-auth remote code execution requiring no public PoC.

CVE-2025-62164 (CVSS 8.8) uses the prompt_embeds parameter in the Completions API. vLLM calls torch.load() on the submitted data without validation. PyTorch 2.8.0 disabled sparse tensor integrity checks by default, allowing a crafted tensor to bypass bounds checking and trigger out-of-bounds writes via to_dense().

CVE-2026-34753 takes the SSRF path: the download_bytes_from_url function in vLLM 0.16.0 through 0.18.x fetches arbitrary HTTP/HTTPS URLs from batch input JSON without domain restrictions. Any user who can submit batch requests can make the vLLM server probe internal infrastructure.

How Did the LiteLLM Supply Chain Attack Work?

The LiteLLM compromise followed a kill chain that began nowhere near LiteLLM.

A threat actor exploited a pull_request_target workflow misconfiguration in Trivy, the widely used container scanning tool. This workflow type runs in the context of the base repository, giving pull requests from forks access to secrets they should not reach. The attacker used this access to exfiltrate the aqua-bot credentials. Since LiteLLM’s CI pulled Trivy without a pinned version, the compromised Trivy action ran with the PYPI_PUBLISH token in the environment.

With that token, the attacker published litellm==1.82.7 and litellm==1.82.8 containing a malicious .pth file. Python automatically executes .pth files in site-packages on every interpreter startup. The payload harvested SSH keys, .env files, cloud provider credentials, and Kubernetes service account tokens, then exfiltrated them to an attacker-controlled endpoint.

graph TD
    A["Trivy CI: pull_request_target\nmisconfiguration"] --> B["Exfiltrate aqua-bot\ncredentials"]
    B --> C["Compromise Trivy\ndependency"]
    C --> D["LiteLLM CI pulls\npoisoned Trivy"]
    D --> E["PYPI_PUBLISH token\nexfiltrated from runner"]
    E --> F["Malicious litellm 1.82.7\nand 1.82.8 published"]
    F --> G["User installs litellm\nvia pip"]
    G --> H["litellm_init.pth executes\non every Python start"]
    H --> I["Stage 1: Harvest SSH keys,\n.env, cloud creds, K8s tokens"]
    I --> J["Stage 2: K8s lateral movement\nvia service account token"]

    style A fill:#ef4444,color:#fff
    style F fill:#ef4444,color:#fff
    style H fill:#f97316,color:#fff
    style J fill:#dc2626,color:#fff

The LiteLLM supply chain kill chain: a CI misconfiguration in a dependency two hops upstream enabled Kubernetes credential theft at scale.

For the full threat actor attribution, IOC list, and attack timeline: Anatomy of the TeamPCP Supply Chain Campaign. For the Kubernetes-native controls that would have blocked each stage of the kill chain: Securing AI/ML Supply Chains on Kubernetes.

Five Defense Layers for Inference Infrastructure

No single control stops all seven incidents from April 2026. The following diagram shows the five layers that, applied together, mitigate each attack class:

graph LR
    EXT["External Traffic"] --> RP["Layer 3\nReverse Proxy\nEndpoint Allowlisting"]
    RP --> NP["Layer 1\nNetworkPolicy\nSSRF Blocking"]
    NP --> AC["Layer 2\nAdmission Control\nKyverno Policies"]
    AC --> RT["Layer 4\nRuntime Sandbox\ngVisor nvproxy"]
    RT --> INF["Inference\nServer Pod"]
    INF --> MON["Layer 5\nMonitoring\nFalco Detection"]

    style EXT fill:#374151,color:#fff
    style RP fill:#1d4ed8,color:#fff
    style NP fill:#1d4ed8,color:#fff
    style AC fill:#1d4ed8,color:#fff
    style RT fill:#1d4ed8,color:#fff
    style INF fill:#15803d,color:#fff
    style MON fill:#7c3aed,color:#fff

Five defense layers, each blocking a different attack class: NetworkPolicy stops SSRF and lateral movement, admission control blocks privileged pods, the reverse proxy filters unauthenticated endpoints, gVisor limits RCE blast radius, and Falco detects post-compromise indicators.

Defense Layer 1: Network Isolation with NetworkPolicy

NetworkPolicy is the first line of defense against SSRF and lateral movement. The LMDeploy CVE-2026-33626 exploitation path went: arbitrary URL fetch to 169.254.169.254 (AWS IMDS), then Redis on port 6379, then internal admin interfaces. All three paths are blockable at the network layer.

Block cloud metadata and internal network egress

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: inference-egress-restricted
  namespace: inference
spec:
  podSelector:
    matchLabels:
      app: vllm-server
  policyTypes:
    - Egress
  egress:
    # Allow DNS resolution
    - to: []
      ports:
        - port: 53
          protocol: UDP
        - port: 53
          protocol: TCP
    # Allow HTTPS to public internet only (model registry, Hugging Face)
    # Blocks: link-local metadata services and all RFC 1918 ranges
    - to:
      - ipBlock:
          cidr: 0.0.0.0/0
          except:
            - 169.254.0.0/16    # Link-local: AWS/GCP/Azure metadata services
            - 10.0.0.0/8        # RFC 1918: internal cluster services
            - 172.16.0.0/12     # RFC 1918
            - 192.168.0.0/16    # RFC 1918
      ports:
        - port: 443
          protocol: TCP

Restrict ingress to your API gateway only

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: inference-ingress-gateway-only
  namespace: inference
spec:
  podSelector:
    matchLabels:
      app: vllm-server
  policyTypes:
    - Ingress
  ingress:
    - from:
      - namespaceSelector:
          matchLabels:
            name: gateway
        podSelector:
          matchLabels:
            app: api-gateway
      ports:
        - port: 8080
          protocol: TCP

This policy ensures only your API gateway can reach inference pods. Direct lateral movement from compromised workloads in other namespaces is blocked at the network level.

Enforce IMDSv2 on AWS (Terraform)

NetworkPolicy covers Kubernetes-level traffic, but the cloud metadata service is accessible at the infrastructure layer. The http_put_response_hop_limit = 1 setting blocks container-originating requests from reaching the metadata endpoint because the container-to-host hop consumes the TTL before the request can be fulfilled:

resource "aws_launch_template" "gpu_nodes" {
  metadata_options {
    http_tokens                 = "required"    # Require IMDSv2 signed token
    http_put_response_hop_limit = 1             # Block container SSRF to IMDS
    http_endpoint               = "enabled"
  }
}

This directly closes the attack path used against CVE-2026-33626. The Sysdig research team observed the attacker probing 169.254.169.254 within the first two minutes of the exploitation window. Both this Terraform setting and the NetworkPolicy above block that probe independently.

NetworkPolicy enforcement requires a CNI plugin that supports it. Calico, Cilium, and Weave Net enforce NetworkPolicy. The default kubenet plugin does not.

Defense Layer 2: Admission Control with Kyverno

Kyverno intercepts resource creation at the Kubernetes API server before pods deploy. NVIDIA uses Kyverno as the required governance engine in DGX Cloud, Mission Control, and NeMo for exactly the GPU workload governance patterns that apply here. For policy-as-code patterns that extend Kyverno governance to AI agent tool calls, see Securing AI Agent MCP Traffic with Kyverno on Kubernetes.

Block privileged containers. The LiteLLM supply chain payload needed elevated container permissions to complete its lateral movement. A Kyverno policy enforcing the restricted Pod Security Standard blocks privileged containers, host namespace sharing, and capability escalation:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-restricted-pod-security
spec:
  validationFailureAction: Enforce
  rules:
    - name: check-pod-security-standard
      match:
        resources:
          kinds:
            - Pod
          namespaces:
            - inference
      validate:
        podSecurity:
          level: restricted
          version: latest

The restricted profile enforces runAsNonRoot: true, drops all capabilities, sets allowPrivilegeEscalation: false, and requires a read-only root filesystem.

Enforce resource limits. CVE-2026-34755 (unbounded video frames) and CVE-2026-34756 (unbounded n parameter) are DoS vectors that depend on the inference pod being allowed to allocate unbounded memory. Kyverno can block pods that do not declare explicit limits, bounding the damage any single request can cause. For production tuning of requests and limits, see Kubernetes Resource Limits: The Production Configuration Guide.

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-resource-limits
spec:
  validationFailureAction: Enforce
  rules:
    - name: check-container-resources
      match:
        resources:
          kinds:
            - Pod
          namespaces:
            - inference
      validate:
        message: "Resource limits are required for all containers in the inference namespace."
        pattern:
          spec:
            containers:
              - resources:
                  limits:
                    memory: "?*"
                    cpu: "?*"

Restrict image registries. An equivalent supply chain attack via container images could inject malicious layers through a compromised public registry. Restricting pulls to your internal or verified registry reduces this surface:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: restrict-image-registries
spec:
  validationFailureAction: Enforce
  rules:
    - name: check-image-registry
      match:
        resources:
          kinds:
            - Pod
          namespaces:
            - inference
      validate:
        message: "Images must be pulled from approved registries only."
        pattern:
          spec:
            containers:
              - image: "registry.example.com/*"

Defense Layer 3: Reverse Proxy Endpoint Allowlisting

The --api-key flag is insufficient on its own. It does not protect /invocations, /pause, or any of the operational control endpoints. The correct architecture is vLLM bound to localhost only, with a reverse proxy in front that allowlists the specific endpoints you want to expose:

upstream vllm_backend {
    server 127.0.0.1:8080;
}

server {
    listen 443 ssl;

    # Allowlisted OpenAI-compatible inference endpoints only
    location /v1/chat/completions {
        proxy_pass http://vllm_backend;
        proxy_buffering off;  # Required for SSE streaming responses
    }

    location /v1/completions {
        proxy_pass http://vllm_backend;
        proxy_buffering off;
    }

    location /v1/embeddings {
        proxy_pass http://vllm_backend;
    }

    location /health {
        proxy_pass http://vllm_backend;
    }

    # Everything else is denied
    location / {
        return 403;
    }

    # Strip auth headers from access logs
    log_format no_auth '$remote_addr - $remote_user [$time_local] '
                       '"$request" $status $body_bytes_sent';
    access_log /var/log/nginx/vllm_access.log no_auth;
}

Pair the nginx config with vLLM server flags that address the SSRF and resource exhaustion vectors:

vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --api-key "$VLLM_API_KEY" \
  --allowed-media-domains "cdn.example.com" \
  --host 127.0.0.1 \
  --port 8080

export VLLM_MEDIA_URL_ALLOW_REDIRECTS=0   # Block redirect-based SSRF bypasses
export VLLM_MAX_N_SEQUENCES=64            # Hard cap on concurrent output sequences
# Never set in production:
# export VLLM_SERVER_DEV_MODE=1

--allowed-media-domains explicitly allowlists domains from which vLLM will fetch media URLs. VLLM_MEDIA_URL_ALLOW_REDIRECTS=0 prevents attackers from chaining HTTP redirects to bypass the domain allowlist. VLLM_MAX_N_SEQUENCES=64 sets a hard cap on concurrent output sequences, directly mitigating CVE-2026-34756.

flowchart LR
    C["Client Request"] --> P["Reverse Proxy\nnginx / Envoy"]
    P --> D{"Endpoint\nin allowlist?"}
    D -->|"YES\n/v1/chat/completions\n/v1/completions\n/v1/embeddings\n/health"| F["Forward to vLLM\nlocalhost:8080"]
    D -->|"NO\n/invocations\n/pause\n/collective_rpc\n/tokenizer_info\n/everything-else"| B["403 Forbidden"]

    style D fill:#1d4ed8,color:#fff
    style F fill:#15803d,color:#fff
    style B fill:#ef4444,color:#fff

All traffic reaches the proxy first. Only four paths forward to vLLM. Everything else returns 403, including the unauthenticated inference and control endpoints that bypass —api-key protection entirely.

Defense Layer 4: Runtime Sandboxing with gVisor

Network isolation and admission control block most attack vectors before they reach the pod. Runtime sandboxing limits the damage when an attacker achieves code execution inside the container anyway, which CVE-2026-22778 (CVSS 9.8) demonstrates is possible without authentication.

gVisor intercepts system calls in a memory-safe Go process (the Sentry) before they reach the host kernel. A heap overflow that achieves code execution inside a gVisor container is constrained to the gVisor syscall interface rather than the full host kernel surface. Most post-exploitation techniques - host filesystem traversal, raw socket creation, ptrace-based credential extraction - are blocked or intercepted at this layer.

For GPU inference specifically, gVisor’s nvproxy component intercepts CUDA calls in the same memory-safe layer. This approach provides stronger GPU isolation than microVM passthrough using VFIO, which gives the virtual machine direct DMA access to host memory and breaks the isolation boundary in the process. With nvproxy, CUDA calls are mediated rather than directly passed through.

GKE Sandbox supports GPU workloads on NVIDIA H100, A100, L4, and T4 hardware from GKE version 1.29.2. Enable it with a RuntimeClass:

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: gvisor
handler: runsc
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
  namespace: inference
spec:
  template:
    spec:
      runtimeClassName: gvisor
      containers:
        - name: vllm
          image: registry.example.com/vllm:0.20.0
          resources:
            limits:
              nvidia.com/gpu: "1"
              memory: "40Gi"
              cpu: "8"

Inference workloads are compute-bound at the GPU level with minimal host I/O. That profile minimizes gVisor overhead while maximizing the isolation benefit.

Defense Layer 5: Monitoring and Detection

The previous four layers are preventive. This one catches what gets through. The April 2026 incidents produce specific, identifiable signals that are distinct from normal inference traffic.

What Are the SSRF Indicators for LMDeploy-Pattern Attacks?

Sysdig detected the CVE-2026-33626 exploitation using honeypot monitoring that watched for these exact signals:

Outbound connections from inference pods to 169.254.169.254 (AWS, Azure, GCP metadata services)
Connections to loopback or RFC 1918 ranges from pods that should only be responding to API requests
DNS lookups to OAST domains: *.requestrepo.com, *.interact.sh, *.burpcollaborator.net
Rapid sequential connections to multiple ports on the same internal host within seconds (scripted port sweep)

What Are the Lateral Movement Indicators for LiteLLM-Pattern Attacks?

File reads from /var/run/secrets/kubernetes.io/serviceaccount/token
Kubernetes API calls originating from inference pod IP addresses - listing secrets, creating pods, modifying resources
Privileged pod creation in the kube-system namespace
Mount of host root filesystem paths from newly created pods

How Do You Detect DoS Resource Exhaustion Attacks Against Inference Servers?

Memory growth on inference pods correlated with specific API requests, without proportional increase in output
OOM kills followed by immediate pod restarts (CVE-2026-34755/34756 pattern)
Inference process crashes following submission of unusual base64 video or high-n requests

Which Custom Falco Rules Detect AI Inference Server Compromise?

Falco’s eBPF-based runtime monitoring can detect these signals with custom rules. The following are recommended custom rules - not existing Falco defaults. Frame them as additions to your Falco deployment:

# Detect cloud metadata service access from inference pods
- rule: Inference Pod Metadata Service Access
  desc: Outbound connection to cloud metadata IP from inference namespace
  condition: >
    outbound and fd.sip = "169.254.169.254" and
    k8s.ns.name = "inference"
  output: >
    Cloud metadata access from inference pod
    (pod=%k8s.pod.name ip=%fd.sip user=%user.name)
  priority: CRITICAL
  tags: [inference, ssrf, cloud-metadata]

# Detect Kubernetes service account token reads from inference pods
- rule: Inference Pod Service Account Token Read
  desc: Service account token file read from inference namespace pod
  condition: >
    open_read and
    fd.name = "/var/run/secrets/kubernetes.io/serviceaccount/token" and
    k8s.ns.name = "inference"
  output: >
    Service account token read from inference pod
    (pod=%k8s.pod.name user=%user.name cmd=%proc.cmdline)
  priority: HIGH
  tags: [inference, lateral-movement, credential-theft]

Wire both rules to immediate alerting. The metadata access rule in particular should have zero tolerance - there is no legitimate reason for an inference pod to reach 169.254.169.254.

Which Versions of vLLM, LMDeploy, and LiteLLM Address the April 2026 CVEs?

Framework	CVE	Affected	Patched	Mitigation if Patching is Delayed
vLLM	CVE-2026-22778 (RCE)	>=0.8.3, <0.14.1	0.14.1	Block multimodal endpoints at proxy
vLLM	CVE-2025-62164 (deserialization)	>=0.10.2	0.11.1	Disable prompt_embeds at proxy layer
vLLM	CVE-2026-34753 (SSRF)	0.16.0 - <0.19.0	0.19.0	Block batch API at proxy; use NetworkPolicy
vLLM	CVE-2026-34755 (video OOM)	<0.19.0	0.19.0	Block video media at proxy; set memory limits
vLLM	CVE-2026-34756 (n parameter OOM)	<0.19.0	0.19.0	Set `VLLM_MAX_N_SEQUENCES=64`
LMDeploy	CVE-2026-33626 (SSRF)	<=0.12.0	0.12.3	NetworkPolicy blocking RFC 1918 + 169.254/16
LiteLLM	Supply chain	1.82.7-1.82.8	1.83.0	Pin all dependencies with hash verification

Current stable versions as of April 28, 2026: vLLM v0.20.0, LMDeploy v0.12.3, Kubernetes v1.36.0.

Patching is necessary but not sufficient. CVE-2026-33626 was exploited in 12 hours and 31 minutes with no public PoC. That window is shorter than most organizations’ patch review and deployment cycles. The five defense layers above provide protection during that window and address entire vulnerability classes - not just individual CVEs.

Frequently Asked Questions

Why is `--api-key` not enough to secure a vLLM deployment?

The --api-key flag only protects endpoints under the /v1 path prefix. Endpoints including /invocations (full inference access, SageMaker-compatible), /pause (halts all generation), and /collective_rpc (arbitrary distributed RPC execution in dev mode) remain unauthenticated regardless of whether an API key is configured. An attacker with network access to your inference pod can bypass authentication entirely by calling these paths directly. Bind vLLM to 127.0.0.1 and deploy it behind a reverse proxy that explicitly allowlists only the endpoints you intend to expose - block everything else with a 403 at the proxy layer.

Can NetworkPolicy alone prevent SSRF attacks from inference servers?

NetworkPolicy blocks network-level SSRF paths at Layer 3/4: cloud metadata at 169.254.169.254 and internal services on RFC 1918 ranges. It cannot inspect application-layer content or block SSRF to permitted external destinations. A complete SSRF defense combines NetworkPolicy with --allowed-media-domains in vLLM (allowlisting which domains vLLM will fetch media from), VLLM_MEDIA_URL_ALLOW_REDIRECTS=0 (blocking redirect-based bypasses), and IMDSv2 enforcement on AWS (http_tokens = "required", http_put_response_hop_limit = 1). All three controls target the same attack class from different layers.

Does gVisor work with GPU inference workloads?

Yes. gVisor’s nvproxy component intercepts CUDA calls in a memory-safe layer rather than passing them directly to the host GPU driver. This provides better GPU isolation than microVM passthrough with VFIO, which gives the virtual machine direct DMA access to host memory and breaks the VM isolation boundary in the process. GKE Sandbox supports GPU workloads on NVIDIA H100, A100, L4, and T4 hardware as of GKE version 1.29.2. The performance overhead for compute-heavy inference workloads with limited host I/O is minimal.

How quickly are inference server vulnerabilities exploited after disclosure?

CVE-2026-33626 (LMDeploy SSRF) was exploited in the wild within 12 hours and 31 minutes of the advisory being published on GitHub. No public proof-of-concept existed beforehand. Attackers crafted the exploit directly from advisory text, probing AWS IMDS, Redis, and internal admin endpoints in a three-phase scripted attack within minutes of each other. This timeline is shorter than most organizations’ patch review cycles, which is why defense-in-depth controls that mitigate vulnerability classes matter more than patch speed alone.

What Kubernetes controls would have limited the LiteLLM supply chain blast radius?

Three controls combined would significantly contain the damage: (1) Kyverno policies enforcing the restricted Pod Security Standard in the inference namespace, blocking privileged pod creation and removing the capabilities needed for persistence. (2) RBAC scoped to the inference namespace with automountServiceAccountToken: false on inference pods that do not need Kubernetes API access, removing the service account token the credential harvesting stage targeted. (3) NetworkPolicy denying egress from inference pods to the Kubernetes API server (kubernetes.default.svc) and to arbitrary external endpoints, blocking both the K8s lateral movement stage and the credential exfiltration callback. Pinning dependencies with hash verification in CI/CD and using Sigstore cosign for container image signing would address the supply chain entry point before any of this becomes necessary.