--- type: skill lifecycle: stable inheritance: inheritable name: "root-cause-analysis" description: Find the true source, not symptoms — systematic debugging from observation to permanent fix tier: core applyTo: '**/*root*,**/*cause*,**/*analysis*' currency: 2026-04-20 lastReviewed: 2026-04-30 --- # Root Cause Analysis Skill > If you fixed it but it came back, you fixed a symptom. ## Core Principle Every symptom has a cause. Every cause has a deeper cause. Keep digging until you reach something you can *prevent*, not just fix. ## 5 Whys — Extended Example | # | Question | Answer | | - | -------- | ------ | | 1 | Why did the page crash? | JavaScript threw a TypeError on null | | 2 | Why was the value null? | The API returned an empty response | | 3 | Why did the API return empty? | The database query timed out | | 4 | Why did the query time out? | Missing index on a 10M-row table | | 5 | Why was the index missing? | No performance review in the PR process | **Root cause**: Process gap (no performance review), not the missing index. **Fix the system**: Add performance checklist to PR template, not just add the index. ### 5 Whys Traps | Trap | Example | How to Avoid | | ---- | ------- | ------------ | | Stopping at human error | "Dev forgot to add the index" | Ask *why was it possible to forget?* | | Single chain only | Only follow one branch | Branch at each Why if multiple causes | | Speculation without evidence | "Probably because of..." | Each answer must have evidence | | Going too deep | Why #12: "Because physics" | Stop when you reach an actionable system change | ## Cause Categories | Category | Common Patterns | Investigation Tools | | -------- | --------------- | ------------------- | | Code | Null reference, off-by-one, race condition, type mismatch | Debugger, unit tests, static analysis | | Data | Corrupt input, unexpected format, encoding issues | Query logs, data validation, sample inspection | | Infrastructure | Disk full, memory exhaustion, network partition | Metrics dashboards, health endpoints, `top`/`df` | | Dependencies | Breaking change, version mismatch, transitive conflict | Lockfile diff, changelog review, `npm ls` | | Configuration | Wrong env var, feature flag state, missing secret | Config diff, environment comparison | | Process | Missing review, unclear ownership, no runbook | Post-mortem patterns, team interviews | ## Investigation Techniques ### Binary Search Debugging When you don't know where the bug is, halve the search space: 1. Identify the last known good state (commit, deploy, timestamp) 2. `git bisect` between good and bad 3. Each step: does the bug exist? Yes → go earlier. No → go later. 4. Result: the exact commit that introduced the bug. ```bash # Binary search with git bisect git bisect start git bisect bad HEAD # Current commit is broken git bisect good v2.3.0 # This tag was working # Git checks out middle commit, you test # Repeat: git bisect good/bad until found git bisect reset # Return to HEAD when done ``` ```typescript // Binary search debugging in code async function findBreakingChange( commits: string[], testFn: (commit: string) => Promise ): Promise { let left = 0; let right = commits.length - 1; while (left < right) { const mid = Math.floor((left + right) / 2); const works = await testFn(commits[mid]); if (works) { left = mid + 1; // Bug introduced after this commit } else { right = mid; // Bug exists at or before this commit } } return commits[left] ?? null; } ``` ### Timeline Reconstruction | Time | Event | Source | | ---- | ----- | ------ | | T-24h | Deploy v2.3.1 | CI/CD logs | | T-12h | Config change: cache TTL 60→30s | Config audit log | | T-2h | First user report | Support tickets | | T-0 | Alert fired | Monitoring | **Key question**: What changed between "working" and "broken"? ### Correlation vs Causation | Evidence Type | Confidence | Example | | ------------- | ---------- | ------- | | Reproduces on demand | High | "Every time I submit this form..." | | Correlates with a deploy | Medium | "Started after we deployed" | | Timing coincidence | Low | "Started Monday" (traffic patterns?) | | "It's never done this before" | Very Low | Memory is unreliable — check logs | ## Fix + Prevent Pattern | Phase | Purpose | Example | Deadline | | ----- | ------- | ------- | -------- | | **Immediate** | Stop the bleeding | Rollback, disable feature, redirect traffic | Now | | **Permanent** | Fix root cause | Add missing index, fix validation, patch dependency | This sprint | | **Prevention** | Stop recurrence | Add CI check, monitoring alert, runbook, PR checklist | Next sprint | **Test the fix**: The permanent fix should make the immediate fix unnecessary. If you remove the band-aid and the symptom returns, you haven't found root cause. ## Common Symptom → Root Cause Patterns | Symptom | Obvious Cause | Deeper Root Cause | | ------- | ------------- | ----------------- | | Memory leak | Unclosed resource | No resource cleanup pattern in codebase | | N+1 queries | Missing join | ORM hides query count, no query logging | | Intermittent test failure | Timing-dependent | Shared mutable state between tests | | "Works on my machine" | Different environment | No environment parity tooling (Docker, etc.) | | Data corruption | Missing validation | Validation in UI only, not at API boundary | | Slow deploys | Large artifact | No build caching, monorepo without selective builds | ## Post-Mortem Integration The RCA section of a post-mortem should include: 1. **The 5 Whys chain** (with evidence for each level) 2. **Contributing factors** (things that made it worse, not the direct cause) 3. **What we were lucky about** (things that could have made it much worse) 4. **Action items** with owners and dates for permanent fix + prevention