---
name: release-management
description: Provides release management strategies including deployment patterns, version control, and rollback procedures. Use when planning releases, managing deployments, or when user mentions 'release', 'canary', 'blue-green', 'rollback', 'feature flag', 'release train', 'semantic versioning', 'changelog', 'migration'.
type: skill
category: ops
status: stable
origin: tibsfox
modified: false
first_seen: 2026-02-07
first_path: examples/release-management/SKILL.md
superseded_by: null
---
# Release Management

Best practices for shipping software reliably through structured release processes, deployment strategies, and rollback automation.

## Deployment Strategy Comparison

Choosing the right deployment strategy depends on risk tolerance, infrastructure budget, and rollback requirements.

| Strategy | Zero Downtime | Rollback Speed | Infrastructure Cost | Complexity | Best For |
|----------|--------------|----------------|--------------------:|------------|----------|
| Blue-Green | Yes | Instant (switch) | 2x | Medium | Critical services, compliance |
| Canary | Yes | Fast (route away) | 1.05-1.5x | High | High-traffic user-facing apps |
| Rolling | Yes | Slow (re-deploy) | 1x | Low | Stateless microservices |
| Recreate | No (brief outage) | Slow (re-deploy) | 1x | Low | Dev/staging, batch jobs |
| A/B Testing | Yes | Fast (route away) | 1.2-1.5x | High | Feature experimentation |
| Shadow/Dark | Yes | N/A (no user impact) | 1.5-2x | Very High | ML models, data pipelines |

### Decision Matrix

```
Is the service stateful?
  YES --> Can you afford 2x infrastructure?
            YES --> Blue-Green
            NO  --> Rolling (with drain + health checks)
  NO  --> Is it high-traffic (>1000 rps)?
            YES --> Canary (gradual rollout)
            NO  --> Rolling (simple, cost-effective)
```

## Canary Deployment Configuration

### Kubernetes Canary with Argo Rollouts

```yaml
# argo-rollout.yaml -- Progressive canary with automated analysis
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payment-service
  namespace: production
spec:
  replicas: 10
  revisionHistoryLimit: 3
  selector:
    matchLabels:
      app: payment-service
  strategy:
    canary:
      # Canary traffic steps with pause for analysis
      steps:
        - setWeight: 5
        - pause: { duration: 5m }    # 5% for 5 minutes
        - analysis:
            templates:
              - templateName: canary-success-rate
            args:
              - name: service-name
                value: payment-service
        - setWeight: 20
        - pause: { duration: 10m }   # 20% for 10 minutes
        - analysis:
            templates:
              - templateName: canary-success-rate
        - setWeight: 50
        - pause: { duration: 10m }   # 50% for 10 minutes
        - setWeight: 80
        - pause: { duration: 5m }    # 80% for 5 minutes
        # 100% happens automatically after final step

      # Auto-rollback on failure
      abortScaleDownDelaySeconds: 30
      dynamicStableScale: true

      # Traffic management via Istio
      trafficRouting:
        istio:
          virtualServices:
            - name: payment-service-vsvc
              routes:
                - primary
          destinationRule:
            name: payment-service-destrule
            canarySubsetName: canary
            stableSubsetName: stable

      # Analysis template for canary health
      analysis:
        templates:
          - templateName: canary-success-rate
        startingStep: 2
        args:
          - name: service-name
            value: payment-service

  template:
    metadata:
      labels:
        app: payment-service
    spec:
      containers:
        - name: payment-service
          image: registry.example.com/payment-service:v2.3.1
          ports:
            - containerPort: 8080
          readinessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10
          resources:
            requests:
              cpu: 250m
              memory: 256Mi
            limits:
              cpu: 500m
              memory: 512Mi

---
# Analysis template -- Prometheus-based success rate check
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: canary-success-rate
spec:
  args:
    - name: service-name
  metrics:
    - name: success-rate
      interval: 60s
      count: 5
      successCondition: result[0] >= 0.99
      failureLimit: 2
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{
              service="{{args.service-name}}",
              status=~"2..",
              canary="true"
            }[2m])) /
            sum(rate(http_requests_total{
              service="{{args.service-name}}",
              canary="true"
            }[2m]))
    - name: p99-latency
      interval: 60s
      count: 5
      successCondition: result[0] <= 500
      failureLimit: 2
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_ms_bucket{
                service="{{args.service-name}}",
                canary="true"
              }[2m])) by (le)
            )
```

## Feature Flag Implementation

### LaunchDarkly Pattern with Gradual Rollout

```typescript
// feature-flags.ts -- Structured feature flag management
import * as LaunchDarkly from "@launchdarkly/node-server-sdk";

// --- Flag Configuration Types ---

interface FlagConfig {
  key: string;
  description: string;
  type: "boolean" | "multivariate" | "percentage";
  owner: string;           // Team responsible
  createdAt: string;       // For cleanup tracking
  maxAge: string;          // Expected lifetime: "temporary" | "permanent"
  cleanupTicket?: string;  // JIRA ticket to remove flag
}

// Registry prevents flag sprawl -- all flags must be declared
const FLAG_REGISTRY: Record<string, FlagConfig> = {
  "new-checkout-flow": {
    key: "new-checkout-flow",
    description: "Redesigned checkout with one-page form",
    type: "percentage",
    owner: "checkout-team",
    createdAt: "2025-11-01",
    maxAge: "temporary",
    cleanupTicket: "CHECKOUT-1234",
  },
  "payment-v2-api": {
    key: "payment-v2-api",
    description: "Route payments through v2 processor",
    type: "boolean",
    owner: "payments-team",
    createdAt: "2025-10-15",
    maxAge: "temporary",
    cleanupTicket: "PAY-567",
  },
};

// --- Client Initialization ---

let ldClient: LaunchDarkly.LDClient;

export async function initFeatureFlags(): Promise<void> {
  ldClient = LaunchDarkly.init(process.env.LAUNCHDARKLY_SDK_KEY!);

  await ldClient.waitForInitialization({ timeout: 10 });
  console.log("LaunchDarkly client initialized");
}

// --- Flag Evaluation with Fallback ---

export async function isEnabled(
  flagKey: string,
  context: LaunchDarkly.LDContext,
  defaultValue = false
): Promise<boolean> {
  // Validate flag is registered
  if (!FLAG_REGISTRY[flagKey]) {
    console.warn(`Unknown flag: ${flagKey}. Returning default.`);
    return defaultValue;
  }

  try {
    const value = await ldClient.variation(flagKey, context, defaultValue);
    return Boolean(value);
  } catch (err) {
    // Flag evaluation failure should never break the app
    console.error(`Flag evaluation failed for ${flagKey}:`, err);
    return defaultValue;
  }
}

// --- Usage in Route Handler ---

export async function checkoutHandler(req: Request, res: Response) {
  const userContext: LaunchDarkly.LDContext = {
    kind: "user",
    key: req.user.id,
    email: req.user.email,
    custom: {
      plan: req.user.plan,       // Target by plan tier
      region: req.user.region,   // Target by geography
      company: req.user.orgId,   // Target by organization
    },
  };

  const useNewCheckout = await isEnabled("new-checkout-flow", userContext);

  if (useNewCheckout) {
    return renderNewCheckout(req, res);
  }
  return renderLegacyCheckout(req, res);
}

// --- Stale Flag Detection ---

export function getStaleFlags(maxAgeDays = 90): FlagConfig[] {
  const now = Date.now();
  return Object.values(FLAG_REGISTRY).filter((flag) => {
    if (flag.maxAge === "permanent") return false;
    const age = now - new Date(flag.createdAt).getTime();
    return age > maxAgeDays * 24 * 60 * 60 * 1000;
  });
}
```

## Database Migration Strategy

### Safe Migration Workflow

Database migrations during releases require special care because they cannot be rolled back as easily as code.

| Migration Type | Risk Level | Rollback Strategy | Requires Downtime |
|---------------|-----------|-------------------|-------------------|
| Add column (nullable) | Low | Drop column | No |
| Add column (NOT NULL + default) | Medium | Drop column | No (with care) |
| Drop column | High | Cannot undo | No (expand-contract) |
| Rename column | High | Cannot undo easily | No (expand-contract) |
| Add index | Medium | Drop index | No (CONCURRENTLY) |
| Change column type | High | Revert type | Sometimes |
| Add table | Low | Drop table | No |
| Drop table | Critical | Restore from backup | No |

### Expand-Contract Migration Pattern

```sql
-- Phase 1: EXPAND (deploy with old code still running)
-- Add new column alongside old one
ALTER TABLE users ADD COLUMN email_normalized VARCHAR(255);

-- Backfill data (run as background job, not in migration)
UPDATE users SET email_normalized = LOWER(TRIM(email))
WHERE email_normalized IS NULL
LIMIT 10000;  -- Batch to avoid locking

-- Phase 2: MIGRATE (deploy new code that writes to both columns)
-- Application writes to BOTH email and email_normalized
-- Application reads from email_normalized with fallback to email

-- Phase 3: CONTRACT (after all code uses new column)
-- Only after verifying no code reads the old column
ALTER TABLE users DROP COLUMN email;
ALTER TABLE users RENAME COLUMN email_normalized TO email;
```

### Migration Runner with Safety Checks

```bash
#!/usr/bin/env bash
# migrate.sh -- Safe migration runner with pre-flight checks
set -euo pipefail

DB_NAME="${DB_NAME:?DB_NAME required}"
MIGRATION_DIR="${MIGRATION_DIR:-./migrations}"
DRY_RUN="${DRY_RUN:-false}"

# Pre-flight checks
preflight_check() {
  echo "=== Pre-flight checks ==="

  # 1. Check for pending transactions that could block
  BLOCKED=$(psql -d "$DB_NAME" -t -c \
    "SELECT count(*) FROM pg_stat_activity
     WHERE state = 'idle in transaction'
     AND query_start < now() - interval '5 minutes';")

  if [ "$BLOCKED" -gt 0 ]; then
    echo "ERROR: $BLOCKED long-running idle transactions detected"
    echo "These may block DDL operations. Investigate before proceeding."
    exit 1
  fi

  # 2. Check disk space (migrations can temporarily double table size)
  DISK_FREE=$(df -BG /var/lib/postgresql | tail -1 | awk '{print $4}' | tr -d 'G')
  if [ "$DISK_FREE" -lt 20 ]; then
    echo "ERROR: Only ${DISK_FREE}GB free. Migrations may need more space."
    exit 1
  fi

  # 3. Verify backup is recent (within last hour)
  LAST_BACKUP=$(psql -d "$DB_NAME" -t -c \
    "SELECT pg_last_xact_replay_timestamp();" 2>/dev/null || echo "N/A")
  echo "Last backup/replica sync: $LAST_BACKUP"

  echo "=== Pre-flight passed ==="
}

# Run migrations
run_migrations() {
  for migration in "$MIGRATION_DIR"/*.sql; do
    MIGRATION_NAME=$(basename "$migration")

    # Check if already applied
    APPLIED=$(psql -d "$DB_NAME" -t -c \
      "SELECT count(*) FROM schema_migrations
       WHERE name = '$MIGRATION_NAME';")

    if [ "$APPLIED" -gt 0 ]; then
      echo "SKIP: $MIGRATION_NAME (already applied)"
      continue
    fi

    echo "APPLYING: $MIGRATION_NAME"

    if [ "$DRY_RUN" = "true" ]; then
      echo "  DRY RUN -- would execute:"
      head -20 "$migration"
      continue
    fi

    # Apply with statement timeout to prevent long locks
    psql -d "$DB_NAME" \
      -v ON_ERROR_STOP=1 \
      -c "SET statement_timeout = '30s';" \
      -f "$migration"

    # Record migration
    psql -d "$DB_NAME" -c \
      "INSERT INTO schema_migrations (name, applied_at)
       VALUES ('$MIGRATION_NAME', now());"

    echo "  APPLIED: $MIGRATION_NAME"
  done
}

preflight_check
run_migrations
echo "=== Migrations complete ==="
```

## Rollback Automation

### Automated Rollback Script

```bash
#!/usr/bin/env bash
# rollback.sh -- Automated release rollback with verification
set -euo pipefail

SERVICE="${1:?Usage: rollback.sh <service> [target-version]}"
TARGET_VERSION="${2:-}"  # If empty, rolls back to previous
ENVIRONMENT="${ENVIRONMENT:-production}"
NAMESPACE="${NAMESPACE:-production}"

# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'

log() { echo -e "${GREEN}[ROLLBACK]${NC} $*"; }
warn() { echo -e "${YELLOW}[WARNING]${NC} $*"; }
err() { echo -e "${RED}[ERROR]${NC} $*" >&2; }

# Step 1: Determine rollback target
if [ -z "$TARGET_VERSION" ]; then
  # Get previous revision from Kubernetes
  TARGET_VERSION=$(kubectl rollout history "deployment/$SERVICE" \
    -n "$NAMESPACE" | tail -3 | head -1 | awk '{print $1}')
  log "Auto-detected previous revision: $TARGET_VERSION"
fi

# Step 2: Create rollback record (audit trail)
ROLLBACK_ID="rb-$(date +%Y%m%d-%H%M%S)-${SERVICE}"
log "Rollback ID: $ROLLBACK_ID"

# Step 3: Notify team
curl -s -X POST "$SLACK_WEBHOOK_URL" \
  -H 'Content-Type: application/json' \
  -d "{
    \"text\": \"ROLLBACK INITIATED: ${SERVICE} in ${ENVIRONMENT}\",
    \"blocks\": [{
      \"type\": \"section\",
      \"text\": {
        \"type\": \"mrkdwn\",
        \"text\": \"*Rollback:* ${ROLLBACK_ID}\n*Service:* ${SERVICE}\n*Target:* revision ${TARGET_VERSION}\n*Initiated by:* $(whoami)\"
      }
    }]
  }" 2>/dev/null || warn "Slack notification failed (non-blocking)"

# Step 4: Execute rollback
log "Rolling back $SERVICE to revision $TARGET_VERSION..."
kubectl rollout undo "deployment/$SERVICE" \
  -n "$NAMESPACE" \
  --to-revision="$TARGET_VERSION"

# Step 5: Wait for rollout
log "Waiting for rollback to complete..."
if ! kubectl rollout status "deployment/$SERVICE" \
  -n "$NAMESPACE" --timeout=300s; then
  err "Rollback did not complete within 5 minutes"
  err "Manual intervention required"
  exit 1
fi

# Step 6: Verify health
log "Verifying service health..."
sleep 10  # Allow metrics to settle

HEALTH_URL="https://${SERVICE}.${ENVIRONMENT}.example.com/healthz"
HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$HEALTH_URL" || echo "000")

if [ "$HTTP_STATUS" != "200" ]; then
  err "Health check failed: HTTP $HTTP_STATUS"
  err "Service may need manual investigation"
  exit 1
fi

log "Health check passed (HTTP $HTTP_STATUS)"
log "Rollback $ROLLBACK_ID completed successfully"
```

## Release Train Schedule

A release train ships on a fixed cadence regardless of what features are ready. Features that miss the train wait for the next one.

### Cadence Options

| Cadence | Suitable For | Trade-offs |
|---------|-------------|------------|
| Daily | SaaS, internal tools | Fast feedback, high automation needed |
| Weekly | B2B products | Balanced pace, manageable testing |
| Bi-weekly | Regulated industries | More testing time, slower delivery |
| Monthly | Enterprise, on-prem | Maximum stability, slow feedback |

### Weekly Release Train Example

```
Monday    | Feature freeze for this week's release
            Code complete -- all PRs merged to release branch
            Automated regression suite runs

Tuesday   | QA validation day
            Manual exploratory testing on staging
            Performance benchmarks compared to baseline

Wednesday | Release candidate tagged (e.g., v2.4.0-rc.1)
            Deploy to pre-production / canary
            Stakeholder sign-off window opens

Thursday  | Production deploy (canary -> full rollout)
            Monitoring period (4 hours minimum)
            Incident response team on standby

Friday    | Retrospective and metrics review
            Hotfix window (if needed)
            No new releases on Friday afternoon
```

## Semantic Versioning and Changelog Automation

### SemVer Rules

```
MAJOR.MINOR.PATCH (e.g., 2.4.1)

MAJOR: Breaking changes (removed API, changed behavior)
MINOR: New features (backward compatible)
PATCH: Bug fixes (backward compatible)

Pre-release: 2.4.1-beta.1, 2.4.1-rc.1
Build metadata: 2.4.1+build.123
```

| Change Type | Version Bump | Example |
|-------------|-------------|---------|
| Removed endpoint | MAJOR | v2.0.0 -> v3.0.0 |
| Changed response format | MAJOR | v2.0.0 -> v3.0.0 |
| New endpoint added | MINOR | v2.3.0 -> v2.4.0 |
| New optional parameter | MINOR | v2.3.0 -> v2.4.0 |
| Fixed validation bug | PATCH | v2.3.1 -> v2.3.2 |
| Performance improvement | PATCH | v2.3.1 -> v2.3.2 |

### Automated Changelog with Conventional Commits

```yaml
# .github/workflows/release.yml
name: Release

on:
  push:
    branches: [main]

permissions:
  contents: write
  pull-requests: write

jobs:
  release:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # Full history for changelog generation

      - uses: googleapis/release-please-action@v4
        id: release
        with:
          release-type: node
          # Conventional Commits -> automatic version bumps
          # feat: -> MINOR
          # fix:  -> PATCH
          # feat!: or BREAKING CHANGE: -> MAJOR

      - if: ${{ steps.release.outputs.release_created }}
        run: |
          echo "Release v${{ steps.release.outputs.major }}.${{ steps.release.outputs.minor }}.${{ steps.release.outputs.patch }} created"
          # Trigger deploy workflow, publish packages, etc.
```

## Release Readiness Review

### Go/No-Go Criteria

| Category | Criteria | Status Check |
|----------|---------|-------------|
| Testing | All CI tests pass on release branch | Automated |
| Testing | No P0/P1 bugs open against release | JIRA query |
| Testing | Performance benchmarks within 10% of baseline | Automated |
| Security | No critical/high vulnerabilities in scan | Automated |
| Security | Dependency audit clean | Automated |
| Operations | Runbook updated for new features | Manual review |
| Operations | Monitoring dashboards cover new endpoints | Manual review |
| Operations | Rollback procedure tested in staging | Manual verification |
| Compliance | Change request approved (if required) | Ticketing system |
| Compliance | Data migration verified on staging data copy | Manual verification |

## Anti-Patterns

| Anti-Pattern | Problem | Fix |
|--------------|---------|-----|
| Friday afternoon deploys | No one available if things break | Deploy early in the week; freeze Friday PM |
| No rollback plan | Stuck with broken release | Always have one-command rollback tested in staging |
| Big-bang releases | Too many changes, impossible to isolate failures | Small, frequent releases behind feature flags |
| Skipping staging | Production becomes the test environment | Always deploy to staging first with same process |
| Manual deployment steps | Human error, inconsistent process | Automate everything; deploy via CI pipeline only |
| Feature flags never cleaned up | Flag spaghetti, dead code, maintenance burden | Set expiry dates; track flags in registry with cleanup tickets |
| Database migration in deploy script | Coupled rollback; can't roll back code without rolling back data | Separate migration pipeline; expand-contract pattern |
| No deploy metrics | Cannot tell if release caused degradation | Compare error rates, latency, throughput before/after |
| Shared staging environment | Teams block each other; dirty state | Per-PR or per-team ephemeral environments |
| Version pinning to latest | Builds break when dependencies update | Pin exact versions; use lockfiles; update deliberately |
| No release notes | Users and support team blindsided by changes | Automate changelog from conventional commits |
| Hotfix bypasses process | Introduces new bugs under pressure | Hotfix follows same pipeline, just expedited |

## Release Readiness Checklist

- [ ] All automated tests pass on the release branch
- [ ] No open P0/P1 bugs targeting this release
- [ ] Performance benchmarks are within acceptable thresholds
- [ ] Security scan shows no critical or high vulnerabilities
- [ ] Database migrations tested on staging with production-like data
- [ ] Rollback procedure tested and documented
- [ ] Feature flags configured for gradual rollout where appropriate
- [ ] Monitoring dashboards and alerts configured for new functionality
- [ ] Changelog and release notes generated and reviewed
- [ ] Stakeholder sign-off obtained (if required)
- [ ] On-call team briefed on release contents and known risks
- [ ] Deploy window scheduled outside peak traffic hours
- [ ] Communication plan ready (status page, customer notification)
- [ ] Post-deploy smoke tests defined and ready to run
