---
name: remediate-aws-terraform
description: Staged remediation plan for existing AWS infra in bad shape — takes checkov/tfsec/review-aws-terraform findings and produces sequenced fix PRs that avoid downtime and data loss. Orders changes by blast radius, enforces dual-run for destructive changes, and never merges multiple unrelated fixes in one PR.
model: opus
---

# remediate-aws-terraform

Use when existing AWS infra has known issues (public S3, unencrypted RDS, overly-broad IAM, unpinned providers, state-file-in-repo, etc.) and you need a **safe, staged** path to fix it without breaking prod.

Opus model. Sequencing infra changes is a planning-heavy task; the cost of getting order wrong (e.g., rotating a KMS key before updating consumers) is much higher than the cost of the call.

## When to use
- `/review-aws-terraform` returned a list of Blockers you can't fix in one PR.
- `checkov` / `tfsec` / AWS Config / AWS Security Hub flagged 50+ findings and you need a prioritized work plan.
- An auditor dropped a list of control failures the week before a PCI/SOX audit.
- A security incident uncovered a class of misconfig you now need to eliminate across accounts.

## When NOT to use
- You have one small, isolated fix — just make the PR and use `/review-aws-terraform`.
- You're building new infra — use `/scaffold-terraform`.
- The "bad shape" is architectural (wrong service choice entirely) — that needs a design doc first, not a remediation sequence.

## The core idea

Infra remediation is NOT a single PR. A 30-finding repo gets fixed in ~15–30 sequential PRs over weeks, ordered so that:
1. **Highest-blast-radius fixes first** (anything exposing data externally — public S3, open SGs on DB ports).
2. **Non-destructive fixes before destructive ones** (add tags before changing instance classes).
3. **Dependency-ordered** (create the CMK before switching resources to use it).
4. **Each PR is reversible** by itself. No PR depends on another one being merged, applied, AND stable.

## Phase 0 — Inventory

Take the findings source:
- `/review-aws-terraform` output,
- `checkov` JSON output,
- `tfsec` JSON output,
- AWS Config non-compliant resources export, or
- a manual list from a security review.

Produce a table: `Finding ID | Severity | Resource | File | Effort | Blast radius | Destructive? | Depends on`.

Gate: human reviews the inventory before moving to Phase 1. If the list is wrong, no plan downstream makes sense.

## Phase 1 — Bucket into tracks

Group findings into these tracks, in this order of priority:

### Track A — Data exposure (highest priority)
- Public S3 buckets, public RDS instances, `0.0.0.0/0` on DB ports.
- Unencrypted data stores (S3, RDS, DynamoDB, EBS, Secrets Manager).
- Secrets in tfvars / Lambda env / CI env without rotation.
- Overly-broad IAM that enables data exfil (`s3:Get*` on `*`, `kms:Decrypt` on `*`, `AdministratorAccess` on a service role).

### Track B — Auditability
- Missing CloudTrail multi-region / log-file-validation.
- VPC Flow Logs disabled.
- Config rules not enabled.
- S3 access logging disabled on data buckets.
- Audit log SNS/SQS not wired.

### Track C — Prod safety
- Missing `prevent_destroy` on prod stateful.
- `deletion_protection = false` on prod RDS.
- `skip_final_snapshot = true` on prod RDS.
- Low `backup_retention_period`.

### Track D — Hygiene
- Unpinned providers / modules.
- Missing tags.
- State file in repo / unencrypted backend.
- `terraform` version unpinned.
- `.terraform/` committed.

### Track E — Cleanup
- Unused IAM roles / users / keys.
- Detached EBS volumes, unassociated EIPs.
- Orphaned security groups.
- Overlapping/duplicate modules.

Track A is **always** first. Do not reorder based on convenience.

## Phase 2 — Staged PR sequence per track

For each track, produce one PR per logical fix (usually one per finding, but small related findings can be batched). Each PR has:

- **One narrow scope.** "Close public-S3 on bucket X." Not "fix all S3 issues." Easier review, easier revert.
- **A ticket reference.** Must be traceable to the inventory.
- **An apply plan.** Plan output attached; SRE + SecOps approval recorded.
- **A rollback statement.** "If this PR breaks prod, the rollback is: revert + apply." For encryption or IAM changes, explicit rollback steps required.

## Phase 3 — Special handling per finding type

### Public S3 bucket (Track A)
1. PR 1: add `aws_s3_bucket_public_access_block` with all-true. **Non-destructive** — doesn't remove existing policies, just blocks NEW public access.
2. PR 2: review any existing bucket policy granting public — remove or scope down.
3. PR 3: enable versioning + access logging if missing.

### Unencrypted RDS (Track A) — **THIS IS DESTRUCTIVE**
Plain-text RDS cannot be encrypted in-place. The process:
1. PR 1: create new encrypted RDS, with matching params, in the same VPC/subnet group. `skip_final_snapshot = false`.
2. PR 2: add read-replica relationship if applicable, OR take a snapshot of old + restore encrypted.
3. PR 3: cut over application DB credentials (staged: staging first for 24h, then prod with a maintenance window).
4. PR 4: after 14-day soak, delete old RDS.

This is 4 PRs, ~3 weeks elapsed. No shortcuts — encryption in-place isn't supported by AWS for existing RDS without recreation.

### Overly-broad IAM (Track A)
1. PR 1: add **new** narrower policy alongside the broad one; attach to a test role.
2. PR 2: test the narrow policy in staging via a canary workload.
3. PR 3: swap the existing role's policy to the narrow one.
4. PR 4: monitor CloudTrail for `AccessDenied` for 7 days; if none, the old broad policy is removed.

### Missing CloudTrail (Track B)
1. PR 1: create `aws_cloudtrail` with `is_multi_region_trail`, `enable_log_file_validation`, S3 bucket Object-Locked.
2. PR 2: enable via provider; first-time logs start ~5 minutes post-apply.
3. PR 3: wire CloudWatch alarms on IAM policy changes + root login + KMS key deletion.

### State file in repo (Track D)
1. PR 1: add backend.tf with S3 + DynamoDB lock (encrypted bucket, versioning on).
2. **MANUAL STEP** (not automatable in the agent): `terraform init -migrate-state`.
3. PR 2: delete `*.tfstate` from the repo with `git rm` **and** purge from history via BFG (same process as `/migrate-legacy-secrets` phase 3).
4. PR 3: update `.gitignore` + `.cursorignore` to block any tfstate files from being re-added.

### Unpinned providers (Track D)
Single PR: add explicit `version = "~> x.y"` to every `required_providers` block. Non-destructive; safe.

### Missing tags (Track D)
1. PR 1: add `default_tags` block to each provider.
2. PR 2: per-resource overrides for resources that need specific tags beyond the defaults.

### Secrets in tfvars (Track A) — **USE `/migrate-legacy-secrets`**
This is the same pattern as `.env` migration. Delegate to `/migrate-legacy-secrets`; don't reinvent.

## Phase 4 — Track the campaign

Produce a running tracker (commit to a runbook or Confluence page, not to the service repo):

```
## Infra remediation — <account/repo>
Inventory total: <N> findings
Completed: <M> / <N>

### Track A — Data exposure
- [x] S3 public-block: payments-data (PR #123, merged 2026-04-10)
- [ ] RDS encryption: payments-primary (in progress, PR 2/4)
- [ ] IAM narrow: deploy-role (PR 3/4, monitoring)
- [ ] SG tighten: db-sg 5432 (PR #128, pending SecOps)

### Track B — Auditability
- [ ] CloudTrail multi-region (not started)
- ...

Expected completion: <date> (at current velocity of N PRs/week)
Next review: <date>
```

Update weekly. Share with SecOps + SRE.

## Phase 5 — Validate at the end

Once all findings are closed:
1. Re-run `/review-aws-terraform` — must return zero Blockers.
2. Re-run `checkov --compact` and `tfsec` — findings should be down to acceptable-with-exception items.
3. AWS Config compliance score: should be green on the control pack.
4. Sign-off: SecOps confirms the campaign is closed; record the PR list in the audit folder.

## Hard rules

- **One finding per PR for high-risk tracks (A, C).** Reviewability matters more than velocity.
- **Destructive changes** (anything that recreates a stateful resource) require: staging-first, dual-run window ≥24h, SecOps + Owner + SRE sign-off, documented rollback.
- **Never skip the inventory step.** Jumping straight to fixes produces drift and re-work.
- **Never force-push to rewrite history** as part of a single-resource fix. Only the state-file-purge PR (Phase 3 of `/migrate-legacy-secrets`) is allowed that, and it's a separate coordinated operation.
- **Never merge multiple unrelated fixes in one PR.** "S3 + IAM + CloudTrail" = three PRs minimum.
- **Never claim a fix is done without re-running the verification.** checkov/tfsec don't count as "done" signals alone; they must be green AND the relevant AWS Config control must be compliant.

## Output format

```
## Inventory summary
- Total findings: <N>
- By track:
  - A (data exposure): <a>
  - B (auditability):  <b>
  - C (prod safety):   <c>
  - D (hygiene):       <d>
  - E (cleanup):       <e>

## Proposed PR sequence (first 10)
1. [Track A · P0] <title> — 1 PR, non-destructive, ~2 days
2. [Track A · P0] <title> — 4 PRs (destructive), ~3 weeks
3. [Track A · P1] <title> — 1 PR, ~2 days
...

## Timeline estimate
- P0 block (Track A, highest severity): <n weeks>
- Full campaign: <n weeks>

## Open questions for humans
- <items needing decisions before work starts>
```
