Build self-service infrastructure that accelerates development: golden paths, developer portals, and reducing cognitive load at scale.

Platform Engineering: Building Internal Developer Platforms

Platform engineering is DevOps evolved. Instead of every team managing infrastructure, platform teams build self-service tools that let developers ship faster without becoming infrastructure experts.

What is Platform Engineering?

Traditional DevOps:

"You build it, you run it"
Every team owns their infrastructure
Duplicate work across teams
Cognitive overload

Platform Engineering:

Platform team provides self-service tools
Developers use golden paths
Centralized best practices
Reduced cognitive load

Example: Instead of each team figuring out Kubernetes, CI/CD, and monitoring, platform provides "deploy my app" button that handles everything.

The Core Problem

At scale, infrastructure becomes a bottleneck:

10 product teams × 8 engineers = 80 developers

Each team needs:

Kubernetes cluster
CI/CD pipelines
Database provisioning
Monitoring setup
Secret management
Log aggregation

Without platform: 80 engineers distracted by infrastructure With platform: 80 engineers building features

Platform engineering = force multiplier

Golden Paths: The Foundation

A golden path is the "opinionated but flexible" way to do something:

Bad: "Here's kubectl, good luck"

# Developer has to know:
kubectl create namespace my-app
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
kubectl apply -f ingress.yaml
# Plus: secrets, configmaps, RBAC, network policies...

Good: Golden path with CLI

# Platform-provided tool
platform deploy \
  --app my-service \
  --image ghcr.io/company/my-service:v1.2.3 \
  --env production \
  --replicas 3

# Behind the scenes:
# - Creates namespace
# - Applies standard manifests
# - Configures monitoring/logging
# - Sets up secrets from vault
# - Configures autoscaling

Developer gets 80% of what they need with zero infrastructure knowledge.

Building a Platform CLI

// platform-cli: Deploy command
package cmd

import (
    "github.com/spf13/cobra"
    "platform/internal/k8s"
    "platform/internal/vault"
    "platform/internal/monitoring"
)

var deployCmd = &cobra.Command{
    Use:   "deploy",
    Short: "Deploy an application to Kubernetes",
    RunE: func(cmd *cobra.Command, args []string) error {
        app := cmd.Flag("app").Value.String()
        image := cmd.Flag("image").Value.String()
        env := cmd.Flag("env").Value.String()
        replicas := cmd.Flag("replicas").Value.String()

        // 1. Generate manifests from templates
        manifests := generateManifests(app, image, env, replicas)

        // 2. Fetch secrets from Vault
        secrets, err := vault.GetSecrets(app, env)
        if err != nil {
            return err
        }

        // 3. Apply to Kubernetes
        if err := k8s.Apply(manifests, secrets); err != nil {
            return err
        }

        // 4. Configure monitoring
        if err := monitoring.Setup(app, env); err != nil {
            return err
        }

        // 5. Update service mesh
        if err := configureServiceMesh(app, env); err != nil {
            return err
        }

        fmt.Printf("✅ %s deployed to %s\n", app, env)
        fmt.Printf("🔗 https://%s.%s.company.com\n", app, env)

        return nil
    },
}

Key principles:

Sensible defaults (replicas=3, autoscaling enabled)
Escape hatches for advanced use cases
Error messages that suggest solutions

Developer Portal with Backstage

Backstage (by Spotify) is the de facto standard for developer portals:

# catalog-info.yaml - Service definition
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: payment-service
  description: Payment processing service
  annotations:
    github.com/project-slug: company/payment-service
    pagerduty.com/integration-key: abc123
    grafana/dashboard-selector: "service:payment-service"
spec:
  type: service
  lifecycle: production
  owner: payments-team
  system: checkout
  providesApis:
    - payment-api
  consumesApis:
    - user-api
    - fraud-detection-api
  dependsOn:
    - resource:postgres-payments
    - resource:redis-cache

Backstage shows:

Service ownership and dependencies
Live deployment status
Recent deployments and rollbacks
On-call rotation
Runbooks and documentation
Metrics and logs (embedded Grafana)

Custom Backstage Plugin

// packages/plugin-platform/src/components/DeployButton.tsx
import React from 'react';
import { useEntity } from '@backstage/plugin-catalog-react';
import { Button } from '@material-ui/core';

export const DeployButton = () => {
  const { entity } = useEntity();

  const handleDeploy = async () => {
    const response = await fetch('/api/platform/deploy', {
      method: 'POST',
      body: JSON.stringify({
        service: entity.metadata.name,
        environment: 'staging',
        image: 'latest',
      }),
    });

    if (response.ok) {
      alert('Deployment started!');
    }
  };

  return (
    <Button variant="contained" color="primary" onClick={handleDeploy}>
      Deploy to Staging
    </Button>
  );
};

Infrastructure Provisioning: Terraform Modules

Standardize infrastructure with reusable modules:

# modules/service/main.tf
module "service" {
  source = "github.com/company/terraform-modules//service"

  name        = var.service_name
  environment = var.environment

  # Defaults provided by platform
  container_image = var.image
  replicas        = var.replicas
  cpu_request     = "100m"
  memory_request  = "256Mi"

  # Auto-configured
  monitoring_enabled = true
  logging_enabled    = true
  tracing_enabled    = true

  # Service mesh integration
  istio_enabled = true

  # Database (optional)
  database = var.needs_database ? {
    engine  = "postgres"
    version = "15"
    size    = "db.t3.medium"
  } : null
}

Developer usage:

# teams/payments/main.tf
module "payment_service" {
  source = "../../modules/service"

  service_name   = "payment-service"
  environment    = "production"
  image          = "ghcr.io/company/payment-service:v1.2.3"
  replicas       = 5
  needs_database = true
}

Platform handles all the complexity: networking, security groups, IAM roles, logging, monitoring.

Self-Service Database Provisioning

# Platform API for database provisioning
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import boto3
import terraform

app = FastAPI()

class DatabaseRequest(BaseModel):
    team: str
    service: str
    engine: str  # postgres, mysql, mongodb
    environment: str
    size: str = "small"  # small, medium, large

@app.post("/api/databases/provision")
async def provision_database(req: DatabaseRequest):
    # Validate request
    if req.engine not in ["postgres", "mysql", "mongodb"]:
        raise HTTPException(400, "Invalid database engine")

    # Generate Terraform config
    tf_config = generate_terraform_config(
        team=req.team,
        service=req.service,
        engine=req.engine,
        env=req.environment,
        size=req.size
    )

    # Apply Terraform
    result = terraform.apply(tf_config)

    # Store connection info in Vault
    connection_string = result['outputs']['connection_string']
    vault.write(
        path=f"database/{req.team}/{req.service}/{req.environment}",
        data={"connection_string": connection_string}
    )

    # Create monitoring dashboard
    create_database_dashboard(req.service, req.environment)

    # Send notification
    slack.send(
        channel=f"#{req.team}",
        message=f"✅ Database provisioned for {req.service} ({req.environment})"
    )

    return {
        "status": "provisioned",
        "endpoint": result['outputs']['endpoint'],
        "vault_path": f"database/{req.team}/{req.service}/{req.environment}"
    }

From developer perspective:

curl -X POST https://platform.company.com/api/databases/provision \
  -d '{
    "team": "payments",
    "service": "payment-service",
    "engine": "postgres",
    "environment": "staging"
  }'

5 minutes later: database ready, secrets in Vault, monitoring configured.

Reducing Cognitive Load

Platform engineering is about reducing decisions developers must make:

Example: CI/CD Pipeline

Without platform:

# Developer writes from scratch (100+ lines)
name: CI/CD
on: [push]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Run tests
        run: npm test
      # ... 20 more steps
  build:
    # ... another 30 lines
  deploy:
    # ... another 50 lines

With platform:

# .platform.yaml
pipeline: nodejs-service  # Platform-provided template
tests: npm test
deploy_to:
  - staging
  - production

Platform generates the full pipeline.

Measuring Platform Success

Track platform adoption and impact:

# Platform metrics
metrics = {
    # Adoption
    "teams_using_platform": 85,  # % of teams
    "services_on_platform": 120,

    # Velocity
    "avg_deployment_time": "8 min",  # vs 45 min before
    "deployments_per_day": 450,      # vs 120 before

    # Quality
    "incident_mttr": "12 min",       # vs 45 min before
    "deployment_success_rate": 0.98,

    # Developer satisfaction
    "nps_score": 65,  # Net Promoter Score
    "support_tickets_per_week": 8,   # vs 40 before
}

Anti-Patterns to Avoid

1. Building Too Early

Don't build platform before you have 3+ teams with duplicate work.

2. No Escape Hatches

# BAD: Can't override defaults
platform deploy --app my-service

# GOOD: Advanced users can customize
platform deploy \
  --app my-service \
  --cpu-limit 2000m \
  --custom-manifest overrides.yaml

3. Forcing Migration

Evangelize, don't mandate. Show value first.

4. Ignoring Feedback

Platform teams ARE product teams. Listen to your users (developers).

Real-World Example: Shopify

Shopify's platform team provides:

Shipit: Deploy any service to Kubernetes
Railgun: CI/CD pipeline generator
Vault integration: Automatic secret management
Dev environments: Spin up full stack in one command

Result:

2000+ engineers using platform
10,000+ deployments per day
New service deployed in <30 mins

Getting Started

Phase 1: Identify Pain Points

Survey developers: what's slowing you down?
Common answer: deployment, database setup, secrets

Phase 2: Build One Golden Path

Start small: standardize deployments
Get 3 teams using it
Iterate based on feedback

Phase 3: Expand

Add more golden paths (databases, queues, caches)
Build developer portal
Automate toil

Phase 4: Scale

Self-service everything
Treat platform as a product
Measure and optimize

Conclusion

Platform engineering is not about controlling developers—it's about empowering them. The best platforms are invisible: developers ship features without thinking about infrastructure.

Signs you need platform engineering:

Multiple teams duplicating infrastructure work
Deployments take hours/days
Developers spending >30% time on ops

Benefits:

5-10x faster deployments
Reduced cognitive load
Consistent best practices
Happier developers

Ready to build your internal platform? Let's talk about your engineering organization.

Platform Engineering: Building Internal Developer Platforms

Platform Engineering: Building Internal Developer Platforms

What is Platform Engineering?

The Core Problem

Golden Paths: The Foundation

Bad: "Here's kubectl, good luck"

Good: Golden path with CLI

Building a Platform CLI

Developer Portal with Backstage

Custom Backstage Plugin

Infrastructure Provisioning: Terraform Modules

Self-Service Database Provisioning

Reducing Cognitive Load

Example: CI/CD Pipeline

Measuring Platform Success

Anti-Patterns to Avoid

1. Building Too Early

2. No Escape Hatches

3. Forcing Migration

4. Ignoring Feedback

Real-World Example: Shopify

Getting Started

Conclusion

You might also like

AWS ECS Production Deployment: The Complete Guide

GitHub Actions CI/CD Pipeline Design for Production

Production Observability: OpenTelemetry and Distributed Tracing