---
name: gke-deployment
description: >-
  GKE infrastructure patterns including Autopilot configuration, HPA custom
  metrics, Kubernetes manifests, Redis cluster architecture, Docker multi-stage
  builds, network architecture, and observability stack. Use when working with
  deployment, infrastructure, or operational concerns.
---

# GKE Deployment Skill

## Instructions

When users work with infrastructure, deployment, Kubernetes, or operational concerns:

1. Reference the architecture document Sections 5, 8, and 9 for implementations
2. Follow GKE Autopilot patterns (no node management, pay per pod)
3. Use custom HPA metrics for service-specific scaling
4. Apply security best practices: non-root containers, Workload Identity, PDB
5. Monitor cost -- LLM APIs are ~75% of spend, not infrastructure

## GKE Cluster Configuration

```
Cluster: ace-prod
Region: us-central1
Channel: Rapid (latest K8s)
Network: ace-vpc (custom)
Mode: Autopilot (default) + Standard GPU pool (optional)
```

**Why Autopilot**: eliminates node management overhead for <25 R&D team. 20% cost
premium offset by operational simplicity. GPU workloads require Standard (Autopilot
doesn't support GPU node pools).

## HPA Configuration

| Service | Min | Max | Scale Metric | Target | Scale-Up | Scale-Down |
|---------|-----|-----|-------------|--------|----------|------------|
| ai-gateway | 3 | 50 | `active_sse_connections` | 200/pod | 30s | 300s |
| user-service | 2 | 20 | CPU utilization | 70% | 60s | 300s |
| telemetry-ingestion | 4 | 30 | `pubsub_subscription_backlog` | <10K msgs | 15s | 180s |
| sync-service | 2 | 10 | CPU utilization | 70% | 60s | 300s |
| internal-api | 2 | 8 | CPU utilization | 60% | 60s | 300s |
| stream-processor | 2 | 16 | `flink_checkpoint_duration_ms` | <5000ms | 60s | 600s |

**Scale-up policy**: Max of (5 pods, 50% of current) per 30s
**Scale-down policy**: Min of 2 pods per 60s, 300s stabilization window

## Kubernetes Manifest Patterns

### Deployment
- `maxSurge: 2, maxUnavailable: 0` -- zero downtime during rollouts
- `revisionHistoryLimit: 5`
- `topologySpreadConstraints`: maxSkew 1 across zones for HA

### Pod Spec
- `serviceAccountName`: Workload Identity for GCP IAM (no key files)
- `terminationGracePeriodSeconds: 35` -- must be > preStop (5s) + graceful shutdown (25s)
- **No CPU limit**: intentional -- CFS throttling causes latency spikes. CPU requests guarantee scheduling.
- Memory limit = memory request (hard limit prevents OOM-kill of neighbors)

### Probes
- **Readiness**: `/internal/health`, initial 5s, period 10s, failure 3 -- gates traffic
- **Liveness**: `/internal/health`, initial 15s, period 20s, failure 5 -- more lenient
- **Startup**: `/internal/health`, initial 3s, period 5s, failure 10 -- allows 53s max startup

### Lifecycle
```yaml
preStop:
  exec:
    command: ["sh", "-c", "sleep 5"]
```
Sleep 5s before SIGTERM so Service endpoint is removed from LB first (kube-proxy propagation delay).

### PodDisruptionBudget
```yaml
minAvailable: 2  # At least 2 pods during voluntary disruptions
```

## Redis (Memorystore) Architecture

```
3 shards x 2 nodes (primary + replica) = 6 nodes
Redis 7.2, Standard HA tier
Capacity: ~300K ops/sec, 90GB memory
Connection: Private Service Access (no public IP)
```

**Key namespace design:**

| Prefix | Type | TTL | Usage |
|--------|------|-----|-------|
| `sess:{user_id}` | STRING | 24h | JWT claims + session metadata |
| `rate:{user_id}:{endpoint}` | ZSET | 60s | Sliding window rate limiting |
| `cache:exact:{hash}` | STRING | 1h | Exact-match LLM response cache |
| `cache:vec:{hash}` | HASH + VSS | 24h | Semantic similarity cache |
| `budget:{user_id}:{period}` | HASH | Until reset | Token usage counters |
| `ff:{flag_name}` | HASH | 5min | Feature flag state |
| `cb:{service}:{endpoint}` | HASH | Varies | Circuit breaker state |
| `dedup:{fingerprint}` | STRING (NX) | 5min | Event deduplication |
| `ctx:{user_id}` | LIST | 24h | Last 50 browsing pages |

## Docker (ai-gateway)

Multi-stage build:
1. **Builder**: `python:3.12-slim` + gcc + libpq-dev, install to virtualenv
2. **Runtime**: `python:3.12-slim` + libpq5 + curl, non-root user (UID 1000)

```
CMD: uvicorn src.main:app --host 0.0.0.0 --port 8000
     --workers 4 --loop uvloop --http httptools
     --timeout-graceful-shutdown 25
```

- 4 workers (over-provisioned for I/O-bound workloads)
- `uvloop` + `httptools` for performance
- 25s graceful shutdown paired with K8s preStop (5s) + terminationGracePeriod (35s)

## Network Architecture

```
Internet -> Cloud Armor (WAF, 10K req/s per IP) -> Global HTTPS LB (L7, SSL termination)
  /v1/ai/*       -> ai-gateway (ClusterIP)
  /v1/user/*     -> user-service (ClusterIP)
  /v1/sync/*     -> user-service / sync-service (ClusterIP)
  /v1/events/*   -> telemetry-ingestion (ClusterIP)
```

## Observability Stack

| Layer | Tool | Integration |
|-------|------|-------------|
| Metrics | Cloud Monitoring + OTel SDK | Custom metrics: cache hit rate, tokens, circuit state |
| Logs | Cloud Logging + structlog | Structured JSON, trace_id/span_id on every line |
| Traces | Cloud Trace + W3C traceparent | 100% errors, 5% success, 100% > 2s |
| Dashboards | Grafana (Cloud Monitoring source) | Cross-service correlation |

### SLOs

| Service | Availability | Latency P95 | Error Budget/Week |
|---------|-------------|-------------|-------------------|
| AI Gateway | 99.5% | TTFB < 2s | 43.8 min |
| User Service | 99.9% | < 200ms | 8.6 min |
| Sync Service | 99.9% | < 500ms | 8.6 min |
| Event Pipeline | 99.95% | < 5s e2e | 4.3 min |

**AI Gateway at 99.5%**: LLM provider outages are outside our control. 99.9% would
exhaust error budget in minutes during a provider outage.

### Alerting Tiers

| Severity | Channel | Response | Examples |
|----------|---------|----------|----------|
| P0 | PagerDuty + Slack | 2 min | Service down, data loss, security breach |
| P1 | Slack + PagerDuty low | 15 min | Error rate >5%, provider degraded |
| P2 | Slack monitoring | 1 hour | Cache hit drop, cost spike |
| P3 | Daily digest | Next business day | Disk >80%, cert expiry <30d |

## Cost Optimization

### Monthly Estimate (1M DAU)

| Component | Monthly Cost | % of Total |
|-----------|-------------|-----------|
| LLM APIs | $300K-450K | ~75% |
| GKE Compute | $15K-25K | ~4% |
| Aerospike | $10K-15K | ~3% |
| BigQuery | $8K-15K | ~2.5% |
| Memorystore | $5K-8K | ~1.5% |
| Cloud SQL | $3K-5K | ~1% |
| **Total** | **$350K-535K** | **100%** |

**Strategic insight**: 10% improvement in cache hit rate saves $30K-45K/mo -- more than
optimizing the entire infrastructure stack.

### Cost Levers

- **Spot VMs for Flink**: 60-91% discount, checkpoint/restart handles preemption
- **CUDs**: 1-year commitment for stable services (user-service, sync-service, DBs)
- **Autopilot**: pay per pod, not idle node capacity
- **BQ slot reservations**: predictable cost at 20B events/day volume

## Debugging

### Pod CrashLoopBackOff
- Check startup probe timeout (53s max) -- is the app initializing in time?
- Check DB connection pool warmup -- may need longer startup probe
- Verify secrets are mounted (secretKeyRef references exist)

### HPA Not Scaling
- Check custom metric pipeline: app -> Prometheus -> Cloud Monitoring -> HPA
- Verify `active_sse_connections` or `pubsub_subscription_backlog` metric is reporting
- Check RBAC: HPA needs access to custom.metrics.k8s.io API

### High Latency After Deploy
- Check if preStop sleep (5s) is configured -- without it, LB sends traffic to terminating pods
- Verify `maxUnavailable: 0` -- pods should only terminate after new ones are ready
- Check memory limit == request -- memory pressure causes GC pauses