Guardians of the Container Galaxy

Defending the Cosmic Cluster

Chris Ayers

Principal Software Engineer

Microsoft

chris-ayers.com | @Chris_L_Ayers

Chris Ayers

Principal Software Engineer
Azure CXP AzRel
Microsoft

BlueSky: @chris-ayers.com
LinkedIn: - chris-l-ayers
Blog: https://chris-ayers.com/
GitHub: Codebytes
Mastodon: @Chrisayers@hachyderm.io
Twitter: @Chris_L_Ayers

chris-ayers.com | @Chris_L_Ayers

Container Security: The Challenge

Modern Container Threats:

  • Supply Chain Attacks (xz-utils, LiteLLM, Axios)
  • Runtime Exploits (Cryptojacking, Container Escape)
  • Lateral Movement (Flat Networks)
  • Visibility Gaps (Lack of Observability)
chris-ayers.com | @Chris_L_Ayers

Container Security: The Impact

The Numbers:

  • 78% of orgs fail audits due to unresolved container CVEs
  • 63% of organizations hit by supply chain attacks (2024-2025)
  • 267 days average dwell time without runtime detection
chris-ayers.com | @Chris_L_Ayers

The Container Attack Kill Chain

center

chris-ayers.com | @Chris_L_Ayers

The Guardians Framework

center

chris-ayers.com | @Chris_L_Ayers

Why Layering Matters

No single control is perfect โ†’ Layer them (Swiss cheese model)

Log4Shell (CVE-2021-44228) proved it:

Layer Response
Scanning Found vulnerable Log4j in images
Runtime Detected JNDI exploitation attempts
Network Blocked C2 communication
Observability Correlated timeline, full reconstruction

Different layers fail differently โ€” overlapping controls matter

chris-ayers.com | @Chris_L_Ayers

Shift Left + Shield Right

Shift Left = Prevent known bad (build-time)

  • SBOM, vulnerability scanning, image signing
  • Goal: catch 80% before deploy

Shield Right = Detect & contain what gets through (runtime)

  • Behavioral detection, network policies, observability
  • Why: 267 days average dwell time without runtime detection

Both required. Prevention alone is not enough.

chris-ayers.com | @Chris_L_Ayers

Principles We'll Apply Throughout

Zero Trust: Never trust, always verify โ€” even inside the cluster (โ†’ Groot)
Supply Chain Security: You don't control your dependencies โ€” verify them (โ†’ Gamora)
Security Observability: Siloed tools miss correlated attacks (โ†’ Mantis)

Standard: NIST SP 800-207 ย |ย  Framework: SLSA (OpenSSF)

Our tools: CNCF-first โ€” portable, community-driven, production-proven

  • Graduated: Falco, OPA, Cilium, Prometheus, Kyverno, OpenTelemetry
  • Incubating: Trivy, Sigstore
chris-ayers.com | @Chris_L_Ayers

Guardian #1

๐ŸŽฏ Star-Lord

Policy Orchestration

chris-ayers.com | @Chris_L_Ayers

Star-Lord: Admission Control

center

chris-ayers.com | @Chris_L_Ayers

Star-Lord: Policy as Code

Choose the simplest enforcement layer that solves the problem:

Layer When to Use
Pod Security Admission (PSA) Baseline pod hardening โ€” fastest built-in guardrail
ValidatingAdmissionPolicy (CEL) Simple custom validation โ€” native, in-process (K8s v1.30+)
Kyverno / OPA Gatekeeper Advanced policies โ€” mutation, reporting, external data

All Policy as Code: version controlled, peer reviewed, auditable

chris-ayers.com | @Chris_L_Ayers

Star-Lord: Image & Pod Policy Patterns

Image Trust:

  • Require signed images (verify with Cosign)
  • Block images from untrusted registries
  • Deny :latest tag (enforce immutable tags)

Pod Hardening:

  • Require non-root user
  • Disallow privileged containers
  • Drop all Linux capabilities by default
chris-ayers.com | @Chris_L_Ayers

Star-Lord: Runtime & Governance Patterns

Runtime Boundaries:

  • Block hostPath, hostNetwork, hostPID mounts
  • Enforce read-only root filesystem
  • Prevent privilege escalation

Operational Governance:

  • Require resource limits (CPU, memory)
  • Enforce required labels (team, cost-center)
  • Restrict allowed namespaces / service accounts
chris-ayers.com | @Chris_L_Ayers

Demo #1: Star-Lord

Policy Enforcement with Kyverno

What We'll Show:

  1. Deploy Kyverno admission controller
  2. Apply policy: Require signed images + non-root
  3. Try unsigned image โ†’ โŒ Blocked
  4. Try root container โ†’ โŒ Denied
  5. Deploy compliant workload โ†’ โœ… Success
chris-ayers.com | @Chris_L_Ayers

Guardian #2

โš”๏ธ Gamora

Supply Chain Integrity

chris-ayers.com | @Chris_L_Ayers

Gamora: The Supply Chain Problem

Trust is a vulnerability. You don't control:

  • Base images (Docker Hub, public registries)
  • Transitive dependencies (your deps pull other deps)
  • Build tools & CI infrastructure (can be compromised)

Real-World Proof:

  • xz-utils (2024): Trusted maintainer planted SSH backdoor after 2 years
  • LiteLLM (2026): Compromised security scanner โ†’ AI gateway backdoored on PyPI
  • Axios NPM (2026): Hijacked account โ†’ RAT delivered to 100M+ weekly downloaders
chris-ayers.com | @Chris_L_Ayers

โš”๏ธ Gamora: Supply Chain Defense Pipeline

center

chris-ayers.com | @Chris_L_Ayers

Gamora: Vulnerability Scanning & Signing

Vulnerability Scanning:

  • Match packages against CVE databases (NVD, OSV)
  • Severity scoring (CVSS) โ€” Gate: Fail builds on HIGH/CRITICAL
  • Tools: Trivy, Grype, Snyk

Cryptographic Signing:

  • Keyless with OIDC (no key management!)
  • Sigstore: Cosign + Rekor + Fulcio
  • Verify: Only signed images deploy
chris-ayers.com | @Chris_L_Ayers

Gamora: SLSA Framework

Supply chain Levels for Software Artifacts (v1.0)

  • Build L0: No guarantees (status quo)
  • Build L1: Build provenance exists
  • Build L2: Hosted build platform (tamper-resistant)
  • Build L3: Hardened build platform (isolated, auditable)

Goal: Move from L0 โ†’ L2+ for production

Standard: OpenSSF (Open Source Security Foundation)

chris-ayers.com | @Chris_L_Ayers

Demo #2: Gamora

Complete Supply Chain Pipeline

What We'll Show:

  1. Generate SBOM with Syft โ†’ See all packages
  2. Scan image with Trivy โ†’ Find CVEs
  3. Sign with Cosign โ†’ Keyless OIDC signature
  4. Verify signature โ†’ Cryptographic proof
  5. Deploy with policy โ†’ Only signed allowed

Key Takeaway: Cryptographic trust from build to deploy

chris-ayers.com | @Chris_L_Ayers

Guardian #3

๐Ÿ”ง Rocket

Image Hardening

chris-ayers.com | @Chris_L_Ayers

Rocket: Before & After

center

chris-ayers.com | @Chris_L_Ayers

Rocket: Distroless Philosophy

What is Distroless?

  • Only runtime dependencies (language runtime + your app)
  • No shell (bash, sh) โ†’ Can't RCE via shell injection
  • No package manager โ†’ Can't install malware
  • No OS utilities โ†’ Minimal attack surface
chris-ayers.com | @Chris_L_Ayers

Rocket: Distroless by the Numbers

Numbers:

  • Ubuntu base: ~80MB, 100+ packages
  • Distroless: ~2-20MB, <10 packages
  • Result: 60-80% fewer CVEs

Modern Options:

  • Google Distroless (Debian-based, Bazel builds)
  • Chainguard Images / Wolfi (2,000+ images, nightly rebuilds, built-in SBOMs, near-zero CVEs)
chris-ayers.com | @Chris_L_Ayers

Rocket: Multi-Stage Builds

Separate Build and Runtime:

# Stage 1: Build (has compilers, tools)
FROM python:3.11-slim AS build
WORKDIR /app
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt

# Stage 2: Runtime (minimal)
FROM gcr.io/distroless/python3-debian12
COPY --from=build /usr/local/lib/python3.11/site-packages \
     /usr/local/lib/python3.11/site-packages
COPY --from=build /app /app
USER nonroot:nonroot
ENTRYPOINT ["python", "/app/main.py"]

Build tools never reach production

chris-ayers.com | @Chris_L_Ayers

Demo #3: Rocket

Image Hardening Before/After

What We'll Show:

  1. Scan "before" (python:3.11-bullseye) โ†’ Count CVEs
  2. Scan "after" (distroless python3-debian12) โ†’ Count CVEs
  3. Compare: 60-80% reduction
  4. Compare sizes: 50%+ smaller
  5. Show: No shell in distroless container

Key Takeaway: Minimal base = minimal risk

chris-ayers.com | @Chris_L_Ayers

Guardian #4

๐Ÿ’ช Drax

Runtime Detection

chris-ayers.com | @Chris_L_Ayers

Drax: Why Runtime Detection?

Build-time scanning can't detect:

  • Zero-day exploits โ†’ No CVE exists yet
  • Fileless attacks โ†’ Malware in memory only
  • Living-off-the-land โ†’ Abuse curl, bash, legitimate tools
  • Insider threats โ†’ Authorized malicious actions
  • Configuration drift โ†’ Runtime container changes

Remember that 267-day dwell time? Runtime detection is how you shrink it.

You need eyes on running containers

chris-ayers.com | @Chris_L_Ayers

Drax: eBPF Technology

Extended Berkeley Packet Filter

What is eBPF?

  • Kernel-level syscall monitoring
  • Verified safe by kernel (can't crash system)
  • JIT compiled (near-native performance <1% overhead)
  • Event-driven (zero cost when idle)
  • Can't be bypassed by userspace malware

Used by: Cilium, Falco, Tetragon, Pixie, Hubble

Industry consensus: eBPF is the future of observability

chris-ayers.com | @Chris_L_Ayers

Drax: Detection vs. Enforcement

Falco (CNCF Graduated) โ†’ Detection (alert on suspicious behavior)

Tetragon (part of Cilium) โ†’ Enforcement (kill processes, block syscalls in-kernel)

Use Falco for broad behavioral monitoring + alerting
Use Tetragon when you need real-time kernel-level blocking

chris-ayers.com | @Chris_L_Ayers

Drax: Detection Architecture

center

chris-ayers.com | @Chris_L_Ayers

Demo #4: Drax

Runtime Detection with Falco

What We'll Show:

  1. Deploy Falco with modern eBPF
  2. Apply custom rule: Detect /etc writes
  3. Monitor Falco logs real-time
  4. Trigger: Pod writes /etc/shadow, /etc/passwd
  5. Observe: Alerts with pod, file, user context

Key Takeaway: Detect malicious behavior instantly

chris-ayers.com | @Chris_L_Ayers

Guardian #5

๐ŸŒณ Groot

Zero-Trust Networking

chris-ayers.com | @Chris_L_Ayers

Groot: The Lateral Movement Problem

Kubernetes Default: Flat Network

  • Any pod can reach any other pod
  • No network boundaries between namespaces
  • Attacker compromises frontend โ†’ pivots to database
  • Single vulnerability = full cluster access

Real Attack: Capital One breach (2019)

  • SSRF in web app โ†’ AWS metadata service
  • Stolen credentials โ†’ S3 bucket access
  • Lesson: Flat networks enable easy lateral movement
chris-ayers.com | @Chris_L_Ayers

Groot: Zero Trust in Kubernetes

center

chris-ayers.com | @Chris_L_Ayers

Groot: Kubernetes NetworkPolicies

How They Work โ€” by example:

  • Selector: "This policy applies to pods labeled app=api"
  • Ingress: "Only app=frontend can call the API"
  • Egress: "API can only connect to app=database"
  • Namespace: "Nothing in dev can reach prod"

CNI Plugin Required: Calico, Cilium (Docker Desktop doesn't support!)

Beyond L3/L4: Service mesh (Istio, Linkerd) adds mTLS + L7 identity

chris-ayers.com | @Chris_L_Ayers

Demo #5: Groot

Zero-Trust Network Policies

What We'll Show:

  1. Deploy 3-tier app (flat network)
  2. Test: All pods can reach each other
  3. Apply default-deny โ†’ All blocked
  4. Test: Tester can't reach API/DB โœ…
  5. Apply allow rules โ†’ Only approved paths
  6. Test: Frontendโ†’APIโ†’DB works, rest blocked โœ…

Key Takeaway: Contain breaches, prevent lateral movement

chris-ayers.com | @Chris_L_Ayers

Guardian #6

๐Ÿ”ฎ Mantis

Security Observability

chris-ayers.com | @Chris_L_Ayers

Mantis: The Observability Gap

Siloed Teams, Siloed Tools:

Without Correlation:

  • Ops team: "API is slow" (looks at Grafana)
  • Security team: "No alerts" (checks SIEM)
  • Reality: Crypto miner running for days
chris-ayers.com | @Chris_L_Ayers

Mantis: Correlation in Action

With Correlation:

  • 9:00 AM: API latency spike (APM)
  • 9:02 AM: High CPU usage (Prometheus)
  • 9:02 AM: Suspicious process (Falco alert)
  • Context: Same pod, same trace ID
  • Result: Detected in minutes, not days
chris-ayers.com | @Chris_L_Ayers

Mantis: Observability Correlation

center

chris-ayers.com | @Chris_L_Ayers

Mantis: The Observability Context

Common Context:

  • Pod name, namespace
  • Trace ID (links requests across services)
  • Timestamp (timeline reconstruction)

Goal: Mean Time To Respond (MTTR) < 1 hour

chris-ayers.com | @Chris_L_Ayers

Mantis: OpenTelemetry for Security

Why OTEL Matters:

Traces: Show which services were accessed during incident

Metrics: Detect resource anomalies (CPU spike = crypto miner)

Logs: Capture security-relevant events with context

Vendor-Neutral: Single instrumentation โ†’ any backend

  • Jaeger, Prometheus, Grafana
  • Datadog, New Relic, Splunk
  • No lock-in
chris-ayers.com | @Chris_L_Ayers

Demo #6: Mantis

Observability Correlation

What We'll Show:

  1. Deploy OTEL collector + instrumented app
  2. Deploy Falcosidekick โ†’ Route alerts to OTEL
  3. Generate traffic โ†’ See traces in logs
  4. (Optional) Trigger Falco โ†’ See security events alongside app traces
  5. View timeline โ†’ App + security events

Key Takeaway: Link security to business impact

chris-ayers.com | @Chris_L_Ayers

Guardians Together: Prevent & Harden

Scenario: Cryptominer in compromised Node.js image

Attack Step Guardian Action Result
Poisoned base image โš”๏ธ Gamora Scan + SBOM detects vuln โš ๏ธ Known threats caught
Bloated surface ๐Ÿ”ง Rocket Distroless reduces tooling โœ… Less to exploit
Unsigned deploy ๐ŸŽฏ Star-Lord Policy rejects image โœ… Blocked at gate

Prevention catches what's known โ€” but what gets through?

chris-ayers.com | @Chris_L_Ayers

Guardians Together: Detect & Contain

The attacker bypassed build-time controlsโ€ฆ

Attack Step Guardian Action Result
Mining process spawns ๐Ÿ’ช Drax Falco detects anomaly โœ… Alert in seconds
C2 network beacon ๐ŸŒณ Groot Egress policy blocks it โœ… Contained
Full timeline needed ๐Ÿ”ฎ Mantis Correlates all signals โœ… MTTR < 1 hour

Not every layer prevents โ€” some reduce, some detect, some contain.

chris-ayers.com | @Chris_L_Ayers

Container Security Maturity Model

Level Actions to Reach It
0 โ†’ 1 Image scanning in CI, Pod Security Admission (audit)
1 โ†’ 2 Image signing + verification, default-deny NetworkPolicies
2 โ†’ 3 Runtime detection (Falco), observability correlation, mTLS
3 โ†’ 4 Attestations, automated response, MTTR optimization
chris-ayers.com | @Chris_L_Ayers

Your First Week

Concrete Steps to Start Monday:

๐Ÿ” Day 1: Run trivy image on your top 5 production images
๐Ÿ“‹ Day 2: Generate your first SBOM with syft โ†’ know your dependencies
๐Ÿ”’ Day 3: Apply Restricted Pod Security Standard to one namespace (audit mode)
๐ŸŒ Day 4: Apply default-deny NetworkPolicy to one namespace
๐Ÿ‘๏ธ Day 5: Deploy Falco in dry-run mode โ†’ see what it detects

Don't boil the ocean: pick one namespace, one app, one pipeline.

chris-ayers.com | @Chris_L_Ayers

Key Takeaways

  1. Defense in Depth โ€” No single tool is enough
  2. Shift Left + Shield Right โ€” Prevention AND detection required
  3. Principles First, Tools Second โ€” Pick controls you can actually run consistently
  4. Start Small โ€” One namespace, one pipeline, prove value, expand
  5. Measure Progress โ€” CVEs blocked, MTTR, coverage %

"We are layered" โ€” Security is a team sport

chris-ayers.com | @Chris_L_Ayers

Questions?

chris-ayers.com | @Chris_L_Ayers

Resources & Links

chris-ayers.com | @Chris_L_Ayers

Star-Lord decides what may enter. Next: Gamora verifies what enters is trustworthy.

Rocket ships a cleaner target. But once running, what is the container doing?

Drax detects the threat. Groot limits where it can go.

Groot limits where threats can go. Mantis shows you that they tried.