---
name: forge-kubernetes
description: Production-grade Kubernetes manifests. Pinned images, resource requests + memory limits (no CPU limits), securityContext (non-root, read-only root, drop all caps, RuntimeDefault seccomp), readiness probes, PodDisruptionBudgets, NetworkPolicy, external secrets, graceful shutdown. Contains ready-to-paste Deployment + Service + PDB + NetworkPolicy. Use when writing or auditing manifests for production traffic.
license: MIT
---

# forge-kubernetes

You are writing manifests that will run real traffic against real users. Default agent-written k8s manifests omit resource limits (one pod OOMs the node), use `image: foo:latest` (rollbacks become impossible), set no probes (k8s thinks every pod is healthy forever), and bake secrets into ConfigMaps. This skill exists to fix all of those.

The mental model: **every field that k8s has a default for is wrong for production in some subtle way. Set the important fields explicitly.**

## Quick reference (the things you must never ship)

1. `image: foo:latest` or any untagged image.
2. A `Pod` for anything that should be replaced or scaled (use `Deployment`).
3. A `Deployment` with `replicas: 1` and no `PodDisruptionBudget`.
4. Container without `securityContext.runAsNonRoot: true`.
5. Workload without `resources.requests` AND `resources.limits.memory`.
6. CPU limits set (almost always wrong; see rule 10).
7. No `readinessProbe` on a container that serves traffic.
8. Secrets passed via plaintext `env` value.
9. `hostNetwork: true` outside specific infra pods.
10. No `NetworkPolicy` (flat lateral-movement surface in the cluster).

## Hard rules

### Image references

**1. No `:latest`. No untagged images.** `image: app:1.4.2` or `image: app:sha-abc123`. Rollbacks depend on stable tags.

**2. `imagePullPolicy: IfNotPresent` for tagged images.** Default behavior varies by version; set it explicitly.

**3. Image digest for true immutability:** `image: app@sha256:...`.

### Workloads

**4. `Deployment` for stateless services, `StatefulSet` for stateful, `DaemonSet` for per-node agents, `Job/CronJob` for batch.** Never a bare `Pod` for anything that should be replaced.

**5. `replicas:` set explicitly.** Defaulting to 1 in prod is asking for a single-node failure to take you down.

**6. PodDisruptionBudgets accompany every Deployment with `replicas > 1`.** Without a PDB, node drains can take all replicas down simultaneously.

```yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-service-pdb
  namespace: default
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: my-service
```

### Resources

**7. Set `resources.requests` and `resources.limits.memory` on every container.** Without requests, the scheduler cannot place pods correctly. Without memory limits, one pod can starve neighbors.

```yaml
resources:
  requests:
    cpu: 100m
    memory: 256Mi
  limits:
    memory: 512Mi    # NO cpu limit, see rule 10
```

**8. `requests.cpu` set to actual measured baseline + 20% headroom.** Not "I think 100m sounds reasonable."

**9. `requests.memory` equals `limits.memory` for predictable scheduling.** Different request/limit causes memory pressure surprises.

**10. CPU limits are debated. Most teams should NOT set them.** CPU throttling at the limit hits hard. Set requests for scheduling and reservations; let workloads burst above for short spikes.

**11. Memory limits are mandatory.** No exceptions. Without them, a memory leak kills the node.

### Probes

**12. Every container that serves traffic has a `readinessProbe`.** Without it, the service receives traffic the moment the container starts, before the app is ready.

```yaml
readinessProbe:
  httpGet:
    path: /health/ready
    port: 3000
  initialDelaySeconds: 0
  periodSeconds: 5
  timeoutSeconds: 2
  failureThreshold: 3
```

**13. `livenessProbe` only when you have a deterministic way to detect unrecoverable state.** Bad liveness probes (e.g., HTTP timeout) cause restart cascades during traffic spikes. When in doubt, omit.

**14. `startupProbe` for slow-starting apps.** Better than tuning long initial delays on liveness/readiness.

```yaml
startupProbe:
  httpGet:
    path: /health/ready
    port: 3000
  failureThreshold: 30
  periodSeconds: 5
```

**15. Explicit `failureThreshold`, `periodSeconds`, `timeoutSeconds` on every probe.** Defaults are usually too aggressive.

### Security context

**16. Full securityContext block on every container.** This is the single most-skipped configuration in default manifests.

```yaml
securityContext:
  runAsNonRoot: true
  runAsUser: 10001
  runAsGroup: 10001
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false
  capabilities:
    drop: ["ALL"]
  seccompProfile:
    type: RuntimeDefault
```

**17. `hostNetwork`, `hostPID`, `hostIPC`: never `true` outside specific infra pods.**

### Secrets

**18. No secrets in ConfigMaps. No secrets in `env` value literals. No secrets in plaintext YAML committed to git.**

```yaml
# BAD
env:
  - name: STRIPE_API_KEY
    value: sk_live_...

# GOOD
env:
  - name: STRIPE_API_KEY
    valueFrom:
      secretKeyRef:
        name: app-secrets
        key: stripe_api_key
```

**19. Use a real secrets store.** External Secrets Operator with AWS Secrets Manager / Vault / Doppler, or sealed-secrets for git-ops setups.

**20. Mounted secrets get `defaultMode: 0400`** so only owner reads.

**21. `imagePullSecrets` for private registries via a `kubernetes.io/dockerconfigjson` Secret.** Never inline credentials.

### Networking

**22. `Service` of type `ClusterIP` for internal services. Avoid `LoadBalancer` per service.** Each `LoadBalancer` is a billed cloud resource. Use a single Ingress.

**23. `NetworkPolicy` defines what each pod is allowed to talk to.** Without it, every pod can reach every other pod - flat lateral-movement surface.

```yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: my-service-netpol
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: my-service
  policyTypes: ["Ingress", "Egress"]
  ingress:
    - from:
      - podSelector:
          matchLabels:
            app.kubernetes.io/name: api-gateway
      ports:
        - protocol: TCP
          port: 3000
  egress:
    - to:
      - podSelector:
          matchLabels:
            app.kubernetes.io/name: postgres
      ports:
        - protocol: TCP
          port: 5432
    - to:                  # DNS
      - namespaceSelector: {}
        podSelector:
          matchLabels:
            k8s-app: kube-dns
      ports:
        - protocol: UDP
          port: 53
```

**24. `hostPort` is forbidden outside DaemonSet/infra contexts.**

### Lifecycle

**25. `terminationGracePeriodSeconds` matches your app's longest in-flight request.** Default 30s for HTTP; longer for long-running jobs.

**26. `preStop` hook for apps that need to drain.** Sleep + graceful-shutdown.

```yaml
lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 5"]
```

**27. App traps SIGTERM and exits cleanly.** Document the trap.

### Deployments / Rollouts

**28. `strategy.type: RollingUpdate` with `maxUnavailable` and `maxSurge` set.** Defaults can mean half the pods unavailable during rollout.

```yaml
strategy:
  type: RollingUpdate
  rollingUpdate:
    maxUnavailable: 0
    maxSurge: 1
```

**29. `progressDeadlineSeconds` set.** Without it, a stuck rollout sits forever.

**30. Deployment tool with rollout verification (Argo CD with health checks, Flux, k8s deploy strategies in CI).** `kubectl apply` and walk away is a footgun.

### Observability

**31. Pods emit Prometheus-style metrics on a dedicated port and path.** Documented in annotations.

```yaml
metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9090"
    prometheus.io/path: "/metrics"
```

**32. Logs go to stdout/stderr only.** Sidecar log shippers handle the rest.

**33. Standard labels:** `app.kubernetes.io/name`, `app.kubernetes.io/version`, `app.kubernetes.io/component`, `app.kubernetes.io/managed-by`.

### Namespaces and isolation

**34. Each environment is its own namespace, often its own cluster.**

**35. `ResourceQuota` per namespace caps requestable resources.**

### Storage

**36. `PersistentVolumeClaim` with explicit `storageClassName` and `accessModes`.**

**37. No `emptyDir` for data that should survive a pod restart.**

## Common AI-output patterns to reject

| Pattern | Why dangerous | Fix |
| --- | --- | --- |
| `image: app:latest` | Rollback impossible | Pinned tag or digest |
| `replicas: 1` no PDB | Single failure = outage | Replicas ≥ 2 + PDB |
| No `resources` | Scheduler blind, node-killer pods | requests + memory limit |
| `limits.cpu: 500m` | CPU throttling spikes | Drop CPU limit |
| Container running as root | Default | `runAsNonRoot: true` |
| Secret value in `env` plaintext | Leaks in `kubectl describe` | `secretKeyRef` |
| No `readinessProbe` | Service routes to unready pods | HTTP probe on /health/ready |
| Aggressive `livenessProbe` | Restart cascade during DB blip | Omit liveness, or probe truly local state |
| `Service: LoadBalancer` per service | Cloud bill | One Ingress, many ClusterIP services |
| No `NetworkPolicy` | Flat lateral movement | Per-service netpol with explicit ingress/egress |
| `kubectl apply -f .` in CI | No drift detection | Argo CD / Flux with health checks |

## Worked example: complete service manifest

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: orders-api
  namespace: prod
  labels:
    app.kubernetes.io/name: orders-api
    app.kubernetes.io/version: "1.4.2"
    app.kubernetes.io/component: backend
    app.kubernetes.io/managed-by: argocd
spec:
  replicas: 3
  progressDeadlineSeconds: 300
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: orders-api
  template:
    metadata:
      labels:
        app.kubernetes.io/name: orders-api
        app.kubernetes.io/version: "1.4.2"
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
        prometheus.io/path: "/metrics"
    spec:
      terminationGracePeriodSeconds: 30
      securityContext:
        runAsNonRoot: true
        runAsUser: 10001
        runAsGroup: 10001
        fsGroup: 10001
        seccompProfile:
          type: RuntimeDefault
      containers:
        - name: app
          image: registry.example.com/orders-api:1.4.2
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 3000
              name: http
            - containerPort: 9090
              name: metrics
          env:
            - name: PORT
              value: "3000"
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: orders-api-secrets
                  key: database_url
            - name: SESSION_SECRET
              valueFrom:
                secretKeyRef:
                  name: orders-api-secrets
                  key: session_secret
          resources:
            requests:
              cpu: 100m
              memory: 256Mi
            limits:
              memory: 512Mi
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            capabilities:
              drop: ["ALL"]
          startupProbe:
            httpGet:
              path: /health/ready
              port: 3000
            failureThreshold: 30
            periodSeconds: 2
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 3000
            periodSeconds: 5
            timeoutSeconds: 2
            failureThreshold: 3
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 5"]
          volumeMounts:
            - name: tmp
              mountPath: /tmp
      volumes:
        - name: tmp
          emptyDir:
            medium: Memory
            sizeLimit: 64Mi
---
apiVersion: v1
kind: Service
metadata:
  name: orders-api
  namespace: prod
spec:
  type: ClusterIP
  selector:
    app.kubernetes.io/name: orders-api
  ports:
    - name: http
      port: 80
      targetPort: 3000
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: orders-api-pdb
  namespace: prod
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: orders-api
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: orders-api-netpol
  namespace: prod
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: orders-api
  policyTypes: ["Ingress", "Egress"]
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: api-gateway
      ports:
        - protocol: TCP
          port: 3000
  egress:
    - to:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: postgres
      ports:
        - protocol: TCP
          port: 5432
    - to:
        - namespaceSelector: {}
          podSelector:
            matchLabels:
              k8s-app: kube-dns
      ports:
        - protocol: UDP
          port: 53
```

What this demonstrates: pinned image (rule 1); 3 replicas + PDB (rules 5, 6); full securityContext at both pod and container level (rule 16); resources with no CPU limit (rules 7, 10, 11); startup + readiness probes, no liveness (rules 12, 13, 14); secrets via `secretKeyRef` not plaintext (rule 18); `readOnlyRootFilesystem` + writable `tmp` volume (rule 16); preStop sleep (rule 26); standard labels (rule 33); Service is ClusterIP not LoadBalancer (rule 22); explicit NetworkPolicy (rule 23).

## Workflow

When writing a Kubernetes manifest:

1. **Start from a known-good template.** Bitnami chart, internal Helm chart, or a reviewed reference. Greenfield YAML is where mistakes live.
2. **Set the security context first.** Get it right once for the whole file.
3. **Set resources.** Measure first, then set.
4. **Add probes.** Readiness on every container.
5. **Document non-obvious choices in YAML comments.**
6. **`kubectl apply --dry-run=server -f manifest.yaml`** before applying.
7. **Run `kubeval`, `kube-score`, `polaris`, or `kubesec` against the manifest.** Each catches a different bug class.

## Verification

```bash
bash skills/infra/forge-kubernetes/verify/check_k8s.sh path/to/manifest.yaml
```

Flags: `:latest` tag, missing resources, missing securityContext, missing probes, secret-shaped env literals.

## When to skip this skill

- Local development clusters (kind, minikube, k3d) where production strictness is overkill.
- One-off jobs that run to completion in seconds.
- Manifests fully generated by an opinionated platform tool (operator, Crossplane) that already enforces these.

## Related skills

- [`forge-dockerfile`](../forge-dockerfile/SKILL.md) - the image that runs in the pod.
- [`forge-secrets`](../../security/forge-secrets/SKILL.md) - the secrets `secretKeyRef` points at.
- [`forge-github-actions`](../forge-github-actions/SKILL.md) - the CI that builds and applies the manifest.
