---
name: provision-sosafe-service
description: Provision a new SoSafe Kubernetes service across Terraform, Helm, and ArgoCD repos with Jira ticket creation, grouped requirement intake, PR creation, and deployment runbook.
---

# Provision New Service (SoSafe)

## Purpose + Triggers
Use this skill for requests like:
- create a new service
- add infra for `be-ms-*` or `fe-ms-*`
- provision `be-task-*`

Outputs:
- Terraform stack in the team's Terraform repo
- Helm chart + values in `dev-ci-kubernetes-services`
- ArgoCD apps in `dev-ci-argocd-deployments`
- Jira task when ticket is missing
- PRs (never merge automatically — always give PR links to user)
- Deployment runbook

## Conventions

| Area | Standard |
|---|---|
| K8s repo | `git@github.com:sosafe-platform-engineering/dev-ci-kubernetes-services.git` |
| Argo repo | `git@github.com:sosafe-platform-engineering/dev-ci-argocd-deployments.git` |
| Service names | microservice: `be-ms-*` or `fe-ms-*`; task: `be-task-*` |
| Branch | `{TICKET}-{service-name}-infra` |
| Argo env folders | `development`, `stage`, `production` |
| Helm env files | `values-dev.yaml`, `values-staging.yaml`, `values-prod.yaml` |
| Secret naming | `{env_prefix}/{service-name}` (env_prefix via `secret_name_prefix` variable, NOT `terraform.workspace`) |
| Argo repoURL | no `.git` suffix |
| nodeRestriction | prefer codebase/modulith id; fallback service name |
| Dockerfile `FROM` | use `AS` (uppercase) for multi-stage build aliases |

### Chart Defaults

| Workload | Chart name | Alias | Version | Repository |
|---|---|---|---|---|
| microservice | `nodejs-microservice` | `microservice` | `4.59.1` | `oci://650815732575.dkr.ecr.eu-central-1.amazonaws.com/sosafe/charts` |
| task | `nodejs-single-instance-task` | `task` | `1.11.0` | `oci://650815732575.dkr.ecr.eu-central-1.amazonaws.com/sosafe/charts` |

### Runtime + Replica Defaults

| Area | Default |
|---|---|
| `maxOldSpaceSizePercentage` | `75` |
| Service cpu/memory | `100m` / `256` (dev/staging), increase for prod |
| Worker cpu/memory | `100m` / `256` |
| `pdb.maxUnavailable` | `75%` |
| Service replicas | dev `1`, staging `1`, prod `2` |
| Worker replicas | dev `1`, staging `1`, prod `1` |
| Required egress | always `443`; add `5432` for DB; add `9098` for Kafka |

### IRSA Defaults

| Purpose | Name |
|---|---|
| App service account | `{serviceName}-sa` |
| App role | `{serviceName}-irsa` |
| Migration service account | `{serviceName}-db-migrations-sa` |
| Migration role | `{serviceName}-db-mig-irsa` |

## Preconditions
1. Ensure all 3 repos exist locally; clone to `/tmp/*` if missing.
2. Validate GitHub auth: `gh auth status`.
3. If no ticket provided, ask the user for a ticket ID or create one.
4. If ticket creation fails, use `NO-TICKET` as the slug and continue.
5. Stop on auth/clone failures before file generation.

## Jira Rules
- Create ticket only when user did not provide one.
- Tool order: Atlassian MCP first, `acli` fallback.
- If both fail, use `NO-TICKET` and ask the user for the ticket ID.

## Requirement Intake (AskUserQuestion)

IMPORTANT: Ask ALL dynamic values via AskUserQuestion. Do NOT assume or hardcode values for: team folder, Jira project, terraform repo, ingress type, tagging config, container tag, subdomain, env var values, secret names. Only use defaults from the chart/runtime defaults tables above.

### Call 1: Identity + Team
Use one grouped `AskUserQuestion` call for:

| Field | Options | Notes |
|---|---|---|
| Service name | free text | must follow naming convention |
| Service type | microservice, task | infer from name prefix |
| Terraform repo | free text (org/repo) | e.g. `sosafe-experience-squads/ts-simulation` |
| Team folder in K8s/Argo repos | free text | e.g. `threat-management`, `simulation` |
| Jira project | free text | e.g. `TIP`, `SIM`, `MSG` |
| Argo project | `experience-squads`, `platform-engineering` | determines namespace |
| Namespace | `exp-squad`, `platform` | must match Argo project |

### Call 2: Infrastructure
Use one grouped `AskUserQuestion` call for:

| Field | Options | Notes |
|---|---|---|
| Database | no DB, DB without migration, DB with migration | DB implies egress `5432` |
| Kafka/MSK | none, read, write, read_write | Kafka implies egress `9098` |
| SQS | none, existing queues only, create queue (+ optional DLQ) | add `sqs.tf` only when selected |
| API exposure | public, private, worker (no ingress) | public = Google/external callers can reach it; private = VPN only |
| Secret strategy | dedicated secret, shared modulith secret | determines `secrets.tf` |

### Call 3: Tagging + Deployment
Use one grouped `AskUserQuestion` call for:

| Field | Options | Notes |
|---|---|---|
| ProductGroup | free text | e.g. `Simulation`, `Threat Management` — check tagging strategy docs |
| Squad | free text | e.g. `Reporting`, `Simulation` |
| Jira board URL (Owner tag) | free text | e.g. `https://sosafegmbh.atlassian.net/browse/TIP` |
| Team tag | free text | e.g. `Simulation`, `Threat Management` |
| Lifecycle | `experimental`, `production`, `legacy` | default `experimental` for new services |
| Environments | `dev only`, `dev+staging`, `dev+staging+prod` | only create files for selected envs |
| Container tag | free text | e.g. `0.1.0` — the version to deploy |
| Ingress subdomain | free text | e.g. `be-ms-prb-google` — used in DNS |

### Call 4: Environment Variables
Use one grouped `AskUserQuestion` call for:

| Field | Notes |
|---|---|
| Secrets to store in AWS Secrets Manager | list of key names, e.g. `GOOGLE_SERVICE_ACCOUNT_KEY` |
| Environment-specific env vars | key=value pairs for dev (ask for staging/prod separately if those envs selected) |
| Sentry DSN | optional, goes in envVars (NOT secrets) |

## Key Lessons Learned

These are corrections from real provisioning experience:

1. **Never use `terraform.workspace`** for secret names — use a `secret_name_prefix` variable set per environment in tfvars
2. **`values.yaml`** should contain ONLY static shared config (namespace, team, network policies, container port). All env-specific values (replicas, resources, secrets, env vars, ingress, PDB) go in `values-{env}.yaml`
3. **Public ingress** is required when external services (like Google) call your endpoints. Private ingress = VPN only
4. **Sentry DSN is NOT a secret** — it's a public identifier, put it in envVars
5. **LavaMoat `allowScripts`** — new dependencies may need entries in `package.json` lavamoat config
6. **Chart version** — always ask the reviewer or check latest. Don't assume.
7. **`versions/` directory** — only create if using Terraform modules (tm-kafka-policy, tm-sqs). Skip for simple stacks.
8. **Container image must exist in ECR** before the Helm PR can pass checks. Merge order matters.

## Decision Tree

```text
[Start]
  |
  v
[Ticket provided?] --No--> [Ask user / create ticket] --> [Continue]
  |Yes
  v
[Ask all intake questions (Calls 1-4)]
  v
[Service type]
  |-- be-ms-* / fe-ms-* --> [microservice chart]
  |-- be-task-* --> [task chart]
  v
[DB?] --Yes--> [secrets + 5432 egress + optional migrations.tf]
  |No
  v
[Kafka?] --Yes--> [msk.tf + 9098 egress + versions pin]
  |No
  v
[SQS?] --Yes--> [sqs.tf + versions pin]
  |No
  v
[API exposure]
  |-- public --> [exposedAPI true, ingress.public block]
  |-- private --> [exposedAPI true, ingress.private block]
  |-- worker --> [exposedAPI false, no ingress]
  v
[Generate files in 3 repos]
  v
[Create branches + commits + PRs (DO NOT merge)]
  v
[Output: PR links + merge order + runbook]
```

## Repository Implementation

### 1) Terraform repo (user-specified)
Path: `app/stacks/eks_{service_name_snake}`

Always create:
- `variables.tf` — include `secret_name_prefix` variable
- `data.tf`
- `locals.tf`
- `eks.tf`
- `secrets.tf` — use `var.secret_name_prefix` NOT `terraform.workspace`
- `tags.tf`
- `tfvars/base.tfvars` — all values asked in intake
- `tfvars/{env}.tfvars` — per selected env, set `secret_name_prefix`

Conditional:
- `migrations.tf` when DB migration enabled
- `msk.tf` when Kafka enabled
- `sqs.tf` when SQS enabled
- `versions/{env}.json` ONLY when using Terraform modules (Kafka/SQS)

Must update `config/include-stacks.yaml` for each selected env.

### 2) `dev-ci-kubernetes-services`
Path: `{teamFolder}/{serviceName}`

Create:
- `Chart.yaml`
- `values.yaml` — ONLY static shared config (namespace, team, nodeRestriction, containerPort, exposedAPI, enableIAM, ciliumNetworkPolicies)
- `values-{env}.yaml` — per selected env: awsAccount, cpu, memory, replicas, containerTag, secrets, envVars, ingress, pdb

### 3) `dev-ci-argocd-deployments`
Create (selected envs only):
- `{env}/managed/services/{teamFolder}/{serviceName}.yaml`

Value-file mapping:
- development -> `values-dev.yaml`
- stage -> `values-staging.yaml`
- production -> `values-prod.yaml`

## PR Workflow
1. Create branch `{TICKET}-{service-name}-infra` per repo.
2. Commits prefixed with ticket id.
3. Rebase on latest `main` before push.
4. Create PR with `gh pr create`.
5. **NEVER merge PRs automatically** — always return PR URLs to the user.
6. Include merge order in output (Terraform -> Helm -> ArgoCD).

## Merge Order (Important!)

1. **Terraform** — creates IRSA roles, secrets, security groups
2. **Helm** (dev-ci-kubernetes-services) — chart must exist on main before container deploy can find it
3. **Container tag** — tag a version on the service repo to publish to ECR
4. **ArgoCD** — picks up the Helm chart and deploys to the cluster

## Validation Checklist

### Terraform
- `secret_name_prefix` variable used (NOT `terraform.workspace`)
- `include-stacks.yaml` updated only for selected envs
- `versions/` directory only present when using modules
- Tagging values match what user provided

### Helm
- `values.yaml` has ONLY static shared config
- `values-{env}.yaml` has ALL env-specific config
- Correct ingress type (public vs private)
- `containerTag` matches the version user specified
- Sentry DSN in envVars, NOT in secrets

### ArgoCD
- Manifest exists only for selected envs
- `repoURL` has no `.git` suffix
- `project` + `namespace` are coherent

### Networking
- Egress `443` always present
- `5432` only when DB enabled
- `9098` only when Kafka enabled
- Public ingress when external callers need access

## Output Contract
Return:
1. Summary of all intake answers used for generation
2. Files created/modified grouped by repo
3. Ticket ID
4. Branch names
5. **PR URLs** (one per repo)
6. **Merge order** with explanation
7. Deployment runbook
8. Any blockers or manual steps needed

## Next Steps (include in output)

After PRs are created, guide the user through these steps:

### 1. Get PR approvals
- **Terraform PR** → share with Cloud/AppOps team for review (they own the infra repos)
- **Helm PR** → share with Cloud/AppOps team or your team lead
- **ArgoCD PR** → share with Cloud/AppOps team
- Tip: post all 3 PR links in the relevant Slack channel (e.g. `#appops-support`) with a short summary

### 2. Merge in order
Follow the merge order strictly — each step depends on the previous:
1. **Terraform PR** → merge and wait for Terraspace pipeline to apply (check pipeline in GitHub Actions)
2. **Populate AWS Secrets** → after Terraform creates the secret, fill in values via AWS Console or CLI (`aws secretsmanager put-secret-value`)
3. **Helm PR** → merge (chart available on main)
4. **Tag a version** on the service repo → triggers container build + publish to ECR
5. **ArgoCD PR** → merge, then:
   - **Development**: auto-syncs via ArgoCD
   - **Staging/Production**: manual sync required in ArgoCD UI

### 3. Verify deployment
- Check ArgoCD UI for your environment ([Dev](https://argo.sosafe-dev-internal.de/), [Staging](https://argo.sosafe-stage-internal.de/), [Prod](https://argo.sosafe-prod-internal.de/))
- Verify pods are running: the app should appear in the `exp-squad` (or `platform`) namespace
- Check health endpoints: `curl https://{subdomain}.sosafe-{env}-internal.de/health/liveness`
- Check logs in Datadog or `kubectl logs`

### 4. Post-deployment
- Add Sentry DSN once Sentry project is created (update `values-{env}.yaml`)
- Set up Compass component if not done during provisioning
- Update the service's README with the deployed URLs
- If public ingress: update `deployment.json` / Google Workspace config with the production URL

### 5. Troubleshooting common issues
| Issue | Fix |
|---|---|
| Container image not found in ECR | Tag a version on the service repo first, wait for publish pipeline |
| Helm PR check fails on image | Expected for new services — merge anyway, image will exist after tagging |
| ArgoCD sync fails | Check if Helm PR is merged to main first |
| Terraform plan fails | Check `include-stacks.yaml` formatting, variable names |
| Pod CrashLoopBackOff | Check secrets are populated in AWS Secrets Manager |
| 503 from ingress | Check if pod is running and health check passes |
| Push rejected | Request write access to the repo from the org owner |

## Reference Documentation

Always check these for the latest standards before generating files:

| Topic | URL |
|---|---|
| Create a new microservice (full guide) | https://sosafegmbh.atlassian.net/wiki/spaces/ENG/pages/2221966093/Create+a+new+microservice |
| Tagging strategy (ProductGroup, Squad, etc.) | https://sosafegmbh.atlassian.net/wiki/spaces/COPS/pages/2011070513/Tagging+Strategy |
| GitHub repo naming conventions | https://sosafegmbh.atlassian.net/wiki/spaces/ENG/pages/1178861569/GitHub |
| Terraspace repositories per team | https://sosafegmbh.atlassian.net/wiki/spaces/COPS/pages/2151317603/Terraspace+Repositories |
| Infrastructure pipelines (merge + deploy) | https://sosafegmbh.atlassian.net/wiki/spaces/COPS/pages/2310307978/Infrastructure+Pipelines |
| Karpenter node pools | https://sosafegmbh.atlassian.net/wiki/spaces/COPS/pages/3247538203 |
| Helm chart: nodejs-microservice | https://github.com/sosafe-platform-engineering/be-chart-nodejs-microservice |
| Helm chart: nodejs-task | https://github.com/sosafe-platform-engineering/be-chart-nodejs-single-instance-task |
| ArgoCD deployments repo | https://github.com/sosafe-platform-engineering/dev-ci-argocd-deployments |
| K8s services repo (Helm values) | https://github.com/sosafe-platform-engineering/dev-ci-kubernetes-services |
| Blueprint example (Terraform) | https://github.com/sosafe-platform-engineering/ts-services/tree/main/_blueprints/eks_be_ms_BLUEPRINT |

### Naming Validation
Before creating any resources, validate the service name against the GitHub naming conventions:
- Schema: `<group>-<type>-<name>` (e.g. `be-ms-prb-google`)
- Allowed characters: lowercase letters, numbers, `-` (kebab-case)
- Groups: `be` (backend), `fe` (frontend), `shared`, `dev`, `ds`, `de`
- Types: `ms` (microservice), `task`, `app`, `lib`, `mf` (microfrontend)
- If the service runs on EKS/K8s as a server, prefer `be-ms-*` even if owned by a frontend team

## Failure Behavior

| Failure | Action |
|---|---|
| Jira creation fails | use `NO-TICKET`, ask user for ticket ID |
| Repo missing locally | clone to `/tmp`; if clone fails, stop with error |
| GitHub auth missing | stop before PR creation |
| Service path already exists | stop and ask: update existing or choose new name |
| Push rejected (no write access) | inform user, suggest requesting access |
| Conflicting intake answers | ask one follow-up clarification, then continue |
