---
name: platform-engineering
description: Provides platform engineering best practices for Internal Developer Platforms (IDPs), golden paths, service catalogs, and developer experience. Use when building developer platforms, configuring Backstage, designing self-service workflows, or when user mentions 'platform engineering', 'backstage', 'golden path', 'IDP', 'developer portal', 'service catalog', 'DevEx', 'platform team', 'self-service'.
type: skill
category: patterns
status: stable
origin: tibsfox
modified: false
first_seen: 2026-02-07
first_path: examples/platform-engineering/SKILL.md
superseded_by: null
---
# Platform Engineering

Best practices for building Internal Developer Platforms (IDPs) that reduce cognitive load, accelerate delivery, and create golden paths for development teams.

## IDP Architecture Layers

A well-designed IDP separates concerns into distinct layers. Each layer abstracts complexity from the one above it.

```
Developer Interface (Portal / CLI / API)
        |
  Orchestration Layer (Workflows, Templates, Scaffolding)
        |
  Integration Layer (APIs, Plugins, Connectors)
        |
  Resource Layer (Infrastructure, Services, Tools)
```

| Layer | Purpose | Components | Owned By |
|-------|---------|------------|----------|
| Developer Interface | Self-service entry point | Backstage portal, CLI tools, API gateway | Platform team |
| Orchestration | Workflow automation, templating | Scaffolder, Terraform modules, Crossplane | Platform team |
| Integration | Connect tools and services | Backstage plugins, API adapters, webhooks | Platform + tool owners |
| Resource | Actual infrastructure and services | Kubernetes, databases, CI/CD, monitoring | Infrastructure team |
| Governance | Policy enforcement and compliance | OPA, Kyverno, cost policies, security scans | Security + platform team |

## Platform Team Topology and Responsibilities

### Team Structure

| Role | Responsibility | Focus Area |
|------|---------------|------------|
| Platform Product Manager | Roadmap, prioritization, user research | Developer needs, adoption metrics |
| Platform Engineer | IDP core, golden paths, automation | Infrastructure abstraction, tooling |
| Developer Advocate | Documentation, onboarding, feedback loops | DevEx, training, communication |
| SRE/Reliability Lead | Platform reliability, SLOs, incident response | Uptime, performance, observability |
| Security Engineer | Policy-as-code, compliance automation | Guardrails, scanning, access control |

### Interaction Model

```
Stream-Aligned Teams (consumers)
        |
        | self-service requests
        v
Platform Team (enablers)
        |
        | golden paths, templates, APIs
        v
Infrastructure / Cloud (resources)
```

Platform teams operate as **enabling teams** (Team Topologies model). They reduce cognitive load on stream-aligned teams by providing curated, opinionated abstractions.

## Backstage: Service Catalog and Developer Portal

### catalog-info.yaml -- Service Registration

Every service registers itself in the catalog via a `catalog-info.yaml` at the repo root.

```yaml
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: payment-service
  description: Handles payment processing and refunds
  annotations:
    github.com/project-slug: acme-corp/payment-service
    backstage.io/techdocs-ref: dir:.
    pagerduty.com/service-id: P1234ABC
    grafana/dashboard-selector: "payment-service"
  tags:
    - java
    - spring-boot
    - payments
  links:
    - url: https://grafana.internal/d/payments
      title: Dashboard
      icon: dashboard
    - url: https://runbooks.internal/payments
      title: Runbook
      icon: docs
spec:
  type: service
  lifecycle: production
  owner: team-payments
  system: checkout-system
  providesApis:
    - payment-api
  consumesApis:
    - inventory-api
    - notification-api
  dependsOn:
    - resource:payments-db
    - component:auth-service

---
apiVersion: backstage.io/v1alpha1
kind: API
metadata:
  name: payment-api
  description: Payment processing REST API
spec:
  type: openapi
  lifecycle: production
  owner: team-payments
  system: checkout-system
  definition:
    $text: ./openapi.yaml
```

### Service Catalog API -- Querying Components

```bash
# List all components owned by a team
curl -s "https://backstage.internal/api/catalog/entities?filter=kind=component,spec.owner=team-payments" \
  -H "Authorization: Bearer $BACKSTAGE_TOKEN" | jq '.[] | {name: .metadata.name, lifecycle: .spec.lifecycle}'

# Find all services consuming a specific API
curl -s "https://backstage.internal/api/catalog/entities?filter=kind=component,spec.consumesApis=payment-api" \
  -H "Authorization: Bearer $BACKSTAGE_TOKEN" | jq '.[] | .metadata.name'

# Get component details with relations
curl -s "https://backstage.internal/api/catalog/entities/by-name/component/default/payment-service" \
  -H "Authorization: Bearer $BACKSTAGE_TOKEN" | jq '{
    name: .metadata.name,
    owner: .spec.owner,
    apis: .spec.providesApis,
    dependencies: .spec.dependsOn
  }'
```

## Golden Path Templates

Golden paths are opinionated, pre-configured templates that encode best practices. They give teams a paved road to production.

### Backstage Software Template

```yaml
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: spring-boot-service
  title: Spring Boot Microservice
  description: Creates a production-ready Spring Boot service with CI/CD, monitoring, and database
  tags:
    - java
    - spring-boot
    - recommended
spec:
  owner: platform-team
  type: service

  parameters:
    - title: Service Details
      required:
        - name
        - owner
        - description
      properties:
        name:
          title: Service Name
          type: string
          pattern: '^[a-z][a-z0-9-]*$'
          ui:autofocus: true
        owner:
          title: Owner Team
          type: string
          ui:field: OwnerPicker
          ui:options:
            catalogFilter:
              kind: Group
        description:
          title: Description
          type: string
        javaVersion:
          title: Java Version
          type: string
          default: '21'
          enum: ['17', '21']

    - title: Infrastructure
      properties:
        database:
          title: Database
          type: string
          default: postgresql
          enum: [postgresql, mysql, none]
        cacheLayer:
          title: Cache Layer
          type: string
          default: none
          enum: [redis, none]
        messageBroker:
          title: Message Broker
          type: string
          default: none
          enum: [kafka, rabbitmq, none]

  steps:
    - id: fetch-template
      name: Fetch Skeleton
      action: fetch:template
      input:
        url: ./skeleton
        values:
          name: ${{ parameters.name }}
          owner: ${{ parameters.owner }}
          description: ${{ parameters.description }}
          javaVersion: ${{ parameters.javaVersion }}
          database: ${{ parameters.database }}

    - id: create-repo
      name: Create Repository
      action: publish:github
      input:
        repoUrl: github.com?owner=acme-corp&repo=${{ parameters.name }}
        description: ${{ parameters.description }}
        defaultBranch: main
        protectDefaultBranch: true
        requireCodeOwnerReviews: true

    - id: register-catalog
      name: Register in Catalog
      action: catalog:register
      input:
        repoContentsUrl: ${{ steps['create-repo'].output.repoContentsUrl }}
        catalogInfoPath: /catalog-info.yaml

    - id: create-argocd-app
      name: Create ArgoCD Application
      action: argocd:create-resources
      input:
        appName: ${{ parameters.name }}
        repoUrl: ${{ steps['create-repo'].output.remoteUrl }}

  output:
    links:
      - title: Repository
        url: ${{ steps['create-repo'].output.remoteUrl }}
      - title: Service in Catalog
        url: ${{ steps['register-catalog'].output.entityRef }}
      - title: CI/CD Pipeline
        url: ${{ steps['create-repo'].output.remoteUrl }}/actions
```

### Golden Path Coverage Matrix

| Category | What the Golden Path Provides | Without Golden Path |
|----------|------------------------------|---------------------|
| Repository | Pre-configured with CI/CD, linting, CODEOWNERS | Manual setup, inconsistent configs |
| CI/CD | Working pipeline from day one | Copy-paste from other repos, broken configs |
| Observability | Dashboards, alerts, SLOs pre-configured | No monitoring until first incident |
| Security | Dependency scanning, SAST, secrets detection | Added retroactively (if at all) |
| Documentation | ADR template, README structure, API docs | Empty README, no docs |
| Infrastructure | Terraform modules, Kubernetes manifests | Hand-crafted YAML, drift between envs |
| Testing | Test framework, coverage gates, fixtures | Ad-hoc test setup, no coverage requirements |

## Developer Experience Metrics (SPACE Framework)

Measure platform effectiveness using the SPACE framework. Never rely on a single dimension.

| Dimension | What It Measures | Example Metrics | Collection Method |
|-----------|-----------------|-----------------|-------------------|
| **S**atisfaction | How developers feel about the platform | NPS score, satisfaction survey (1-5) | Quarterly survey |
| **P**erformance | Outcome of developer work | Deployment frequency, change failure rate | DORA metrics pipeline |
| **A**ctivity | Volume of actions | Scaffolding requests, API calls, portal visits | Platform telemetry |
| **C**ommunication | Quality of collaboration | Time to first response on platform support | Ticketing system |
| **E**fficiency | Flow and minimal friction | Time from commit to deploy, onboarding time | Pipeline metrics |

### Key DevEx Metrics Dashboard

```yaml
# Platform DevEx Metrics -- collected via platform telemetry
metrics:
  onboarding:
    time_to_first_deploy:
      target: "< 2 hours"
      description: "Time from new hire to first successful deployment"
      source: "scaffolder + pipeline timestamps"

    time_to_first_commit:
      target: "< 4 hours"
      description: "Time from repo creation to first merged commit"
      source: "github events"

  self_service:
    template_adoption_rate:
      target: "> 80%"
      description: "Percentage of new services using golden path templates"
      source: "backstage scaffolder logs"

    self_service_resolution_rate:
      target: "> 70%"
      description: "Percentage of requests resolved without platform team intervention"
      source: "support tickets vs portal actions"

  reliability:
    platform_availability:
      target: "99.9%"
      description: "Uptime of developer portal, CI/CD, and artifact registry"
      source: "synthetic monitoring"

    mean_time_to_recovery:
      target: "< 30 minutes"
      description: "Time to restore platform services after incident"
      source: "incident management system"

  delivery:
    deployment_frequency:
      target: "multiple per day per team"
      description: "How often teams deploy to production"
      source: "deployment pipeline events"

    lead_time_for_changes:
      target: "< 1 day"
      description: "Time from commit to production"
      source: "git + pipeline timestamps"
```

## Self-Service Portal Workflow

### Request Flow Architecture

```
Developer submits request via Portal UI
        |
        v
Request Validation (schema check, policy check)
        |
        v
Approval Gate (if required by policy)
        |           |
        | auto      | manual
        v           v
Orchestration Engine (executes workflow)
        |
        +---> Provision Infrastructure (Terraform/Crossplane)
        +---> Configure CI/CD (GitHub Actions / ArgoCD)
        +---> Register in Catalog (Backstage)
        +---> Set Up Monitoring (Grafana / PagerDuty)
        +---> Notify Team (Slack / Email)
        |
        v
Verification (health checks, smoke tests)
        |
        v
Developer notified -- ready to use
```

### Self-Service Capability Matrix

| Capability | Automation Level | Approval Required | Typical Time |
|------------|-----------------|-------------------|--------------|
| Create new service | Fully automated | No | 5 minutes |
| Provision database | Fully automated | No (dev/staging), Yes (prod) | 10 minutes |
| Add CI/CD pipeline | Fully automated | No | 2 minutes |
| Request cloud credentials | Semi-automated | Yes (security review) | 1 hour |
| Create new environment | Fully automated | No (non-prod), Yes (prod) | 15 minutes |
| Add monitoring/alerts | Fully automated | No | 5 minutes |
| Resize infrastructure | Semi-automated | Yes (cost review > threshold) | 30 minutes |
| Decommission service | Automated with safeguards | Yes (owner confirmation) | 10 minutes |

## Platform Engineering Maturity Model

| Level | Name | Characteristics | Capabilities |
|-------|------|----------------|--------------|
| 0 | Ad Hoc | No platform, tribal knowledge | Teams manage their own infra |
| 1 | Reactive | Shared scripts, wiki docs | Basic CI/CD, manual provisioning |
| 2 | Standardized | Golden paths, basic portal | Service templates, catalog, basic self-service |
| 3 | Optimized | Full IDP, metrics-driven | Self-service everything, DevEx metrics, policy-as-code |
| 4 | Strategic | Platform as product, innovation | API-first platform, marketplace, continuous feedback |

### Maturity Assessment Checklist

```
Level 1 --> Level 2:
  [x] Service catalog exists and is maintained
  [x] At least 3 golden path templates available
  [x] Basic developer portal deployed
  [x] CI/CD standardized across teams

Level 2 --> Level 3:
  [x] Self-service for >80% of common requests
  [x] SPACE metrics collected and reviewed monthly
  [x] Policy-as-code enforced (not advisory)
  [x] Platform team has dedicated product manager
  [x] Internal SLOs defined for platform services

Level 3 --> Level 4:
  [x] API-first platform (all capabilities programmable)
  [x] Internal developer marketplace for plugins/extensions
  [x] Continuous developer experience research program
  [x] Platform economics model (cost attribution per team)
  [x] Platform contributes to organizational strategy
```

## Anti-Patterns

| Anti-Pattern | Problem | Fix |
|--------------|---------|-----|
| Build it and they will come | No adoption without developer input | Treat platform as product; user research before building |
| Ticket-ops disguised as platform | Self-service portal that just creates tickets | Automate end-to-end; tickets are a smell, not a solution |
| Mandating platform use | Forced adoption breeds resentment and workarounds | Make the golden path the easiest path, not the only path |
| One-size-fits-all templates | Overly rigid templates that don't fit team needs | Composable templates with sensible defaults and escape hatches |
| No feedback loops | Platform team builds in isolation | Regular surveys, office hours, embedded rotations with teams |
| Ignoring developer experience | Technically correct but painful to use | Measure DevEx metrics, optimize for developer happiness |
| Platform team as bottleneck | All changes go through platform team | Self-service with guardrails; teams should not wait on platform |
| Over-abstracting too early | Complex abstraction layers before understanding needs | Start with concrete solutions, abstract when patterns emerge |
| Neglecting documentation | Powerful platform nobody knows how to use | Docs-as-code, TechDocs in Backstage, examples for everything |
| No platform SLOs | Platform reliability treated as best-effort | Define and publish SLOs; platform is a product with SLAs |
| Shadow platforms | Teams build their own tooling around the platform | Understand why and address gaps; shadow platforms reveal unmet needs |
| Gold plating the portal | Spending months on portal UI before delivering value | Ship incrementally; a working CLI beats a beautiful but empty portal |

## Platform Engineering Checklist

- [ ] Platform team established with clear product ownership
- [ ] Developer portal deployed (Backstage or equivalent)
- [ ] Service catalog populated with all production services
- [ ] At least 3 golden path templates available and documented
- [ ] Self-service provisioning for common infrastructure (databases, queues, caches)
- [ ] CI/CD pipelines standardized and available via templates
- [ ] Observability stack integrated (dashboards auto-created with new services)
- [ ] Security scanning built into golden paths (not bolted on after)
- [ ] DevEx metrics defined and collected (SPACE framework dimensions)
- [ ] Feedback mechanism active (surveys, office hours, Slack channel)
- [ ] Platform SLOs defined and monitored
- [ ] Documentation maintained in developer portal (TechDocs)
- [ ] Onboarding time measured and optimized (target: first deploy < 2 hours)
- [ ] Cost visibility per team/service available through platform
- [ ] Platform roadmap published and informed by developer feedback
- [ ] Escape hatches documented for when golden paths don't fit
- [ ] Platform reliability meets or exceeds published SLOs
