Platform Engineering: Building Internal Developer Platforms
Build self-service infrastructure that accelerates development: golden paths, developer portals, and reducing cognitive load at scale.
Platform Engineering: Building Internal Developer Platforms
Platform engineering is DevOps evolved. Instead of every team managing infrastructure, platform teams build self-service tools that let developers ship faster without becoming infrastructure experts.
What is Platform Engineering?
Traditional DevOps:
- "You build it, you run it"
- Every team owns their infrastructure
- Duplicate work across teams
- Cognitive overload
Platform Engineering:
- Platform team provides self-service tools
- Developers use golden paths
- Centralized best practices
- Reduced cognitive load
Example: Instead of each team figuring out Kubernetes, CI/CD, and monitoring, platform provides "deploy my app" button that handles everything.
The Core Problem
At scale, infrastructure becomes a bottleneck:
10 product teams × 8 engineers = 80 developers
Each team needs:
- Kubernetes cluster
- CI/CD pipelines
- Database provisioning
- Monitoring setup
- Secret management
- Log aggregation
Without platform: 80 engineers distracted by infrastructure With platform: 80 engineers building features
Platform engineering = force multiplier
Golden Paths: The Foundation
A golden path is the "opinionated but flexible" way to do something:
Bad: "Here's kubectl, good luck"
# Developer has to know: kubectl create namespace my-app kubectl apply -f deployment.yaml kubectl apply -f service.yaml kubectl apply -f ingress.yaml # Plus: secrets, configmaps, RBAC, network policies...
Good: Golden path with CLI
# Platform-provided tool platform deploy \ --app my-service \ --image ghcr.io/company/my-service:v1.2.3 \ --env production \ --replicas 3 # Behind the scenes: # - Creates namespace # - Applies standard manifests # - Configures monitoring/logging # - Sets up secrets from vault # - Configures autoscaling
Developer gets 80% of what they need with zero infrastructure knowledge.
Building a Platform CLI
// platform-cli: Deploy command package cmd import ( "github.com/spf13/cobra" "platform/internal/k8s" "platform/internal/vault" "platform/internal/monitoring" ) var deployCmd = &cobra.Command{ Use: "deploy", Short: "Deploy an application to Kubernetes", RunE: func(cmd *cobra.Command, args []string) error { app := cmd.Flag("app").Value.String() image := cmd.Flag("image").Value.String() env := cmd.Flag("env").Value.String() replicas := cmd.Flag("replicas").Value.String() // 1. Generate manifests from templates manifests := generateManifests(app, image, env, replicas) // 2. Fetch secrets from Vault secrets, err := vault.GetSecrets(app, env) if err != nil { return err } // 3. Apply to Kubernetes if err := k8s.Apply(manifests, secrets); err != nil { return err } // 4. Configure monitoring if err := monitoring.Setup(app, env); err != nil { return err } // 5. Update service mesh if err := configureServiceMesh(app, env); err != nil { return err } fmt.Printf("✅ %s deployed to %s\n", app, env) fmt.Printf("🔗 https://%s.%s.company.com\n", app, env) return nil }, }
Key principles:
- Sensible defaults (replicas=3, autoscaling enabled)
- Escape hatches for advanced use cases
- Error messages that suggest solutions
Developer Portal with Backstage
Backstage (by Spotify) is the de facto standard for developer portals:
# catalog-info.yaml - Service definition apiVersion: backstage.io/v1alpha1 kind: Component metadata: name: payment-service description: Payment processing service annotations: github.com/project-slug: company/payment-service pagerduty.com/integration-key: abc123 grafana/dashboard-selector: "service:payment-service" spec: type: service lifecycle: production owner: payments-team system: checkout providesApis: - payment-api consumesApis: - user-api - fraud-detection-api dependsOn: - resource:postgres-payments - resource:redis-cache
Backstage shows:
- Service ownership and dependencies
- Live deployment status
- Recent deployments and rollbacks
- On-call rotation
- Runbooks and documentation
- Metrics and logs (embedded Grafana)
Custom Backstage Plugin
// packages/plugin-platform/src/components/DeployButton.tsx import React from 'react'; import { useEntity } from '@backstage/plugin-catalog-react'; import { Button } from '@material-ui/core'; export const DeployButton = () => { const { entity } = useEntity(); const handleDeploy = async () => { const response = await fetch('/api/platform/deploy', { method: 'POST', body: JSON.stringify({ service: entity.metadata.name, environment: 'staging', image: 'latest', }), }); if (response.ok) { alert('Deployment started!'); } }; return ( <Button variant="contained" color="primary" onClick={handleDeploy}> Deploy to Staging </Button> ); };
Infrastructure Provisioning: Terraform Modules
Standardize infrastructure with reusable modules:
# modules/service/main.tf module "service" { source = "github.com/company/terraform-modules//service" name = var.service_name environment = var.environment # Defaults provided by platform container_image = var.image replicas = var.replicas cpu_request = "100m" memory_request = "256Mi" # Auto-configured monitoring_enabled = true logging_enabled = true tracing_enabled = true # Service mesh integration istio_enabled = true # Database (optional) database = var.needs_database ? { engine = "postgres" version = "15" size = "db.t3.medium" } : null }
Developer usage:
# teams/payments/main.tf module "payment_service" { source = "../../modules/service" service_name = "payment-service" environment = "production" image = "ghcr.io/company/payment-service:v1.2.3" replicas = 5 needs_database = true }
Platform handles all the complexity: networking, security groups, IAM roles, logging, monitoring.
Self-Service Database Provisioning
# Platform API for database provisioning from fastapi import FastAPI, HTTPException from pydantic import BaseModel import boto3 import terraform app = FastAPI() class DatabaseRequest(BaseModel): team: str service: str engine: str # postgres, mysql, mongodb environment: str size: str = "small" # small, medium, large @app.post("/api/databases/provision") async def provision_database(req: DatabaseRequest): # Validate request if req.engine not in ["postgres", "mysql", "mongodb"]: raise HTTPException(400, "Invalid database engine") # Generate Terraform config tf_config = generate_terraform_config( team=req.team, service=req.service, engine=req.engine, env=req.environment, size=req.size ) # Apply Terraform result = terraform.apply(tf_config) # Store connection info in Vault connection_string = result['outputs']['connection_string'] vault.write( path=f"database/{req.team}/{req.service}/{req.environment}", data={"connection_string": connection_string} ) # Create monitoring dashboard create_database_dashboard(req.service, req.environment) # Send notification slack.send( channel=f"#{req.team}", message=f"✅ Database provisioned for {req.service} ({req.environment})" ) return { "status": "provisioned", "endpoint": result['outputs']['endpoint'], "vault_path": f"database/{req.team}/{req.service}/{req.environment}" }
From developer perspective:
curl -X POST https://platform.company.com/api/databases/provision \ -d '{ "team": "payments", "service": "payment-service", "engine": "postgres", "environment": "staging" }'
5 minutes later: database ready, secrets in Vault, monitoring configured.
Reducing Cognitive Load
Platform engineering is about reducing decisions developers must make:
Example: CI/CD Pipeline
Without platform:
# Developer writes from scratch (100+ lines) name: CI/CD on: [push] jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Run tests run: npm test # ... 20 more steps build: # ... another 30 lines deploy: # ... another 50 lines
With platform:
# .platform.yaml pipeline: nodejs-service # Platform-provided template tests: npm test deploy_to: - staging - production
Platform generates the full pipeline.
Measuring Platform Success
Track platform adoption and impact:
# Platform metrics metrics = { # Adoption "teams_using_platform": 85, # % of teams "services_on_platform": 120, # Velocity "avg_deployment_time": "8 min", # vs 45 min before "deployments_per_day": 450, # vs 120 before # Quality "incident_mttr": "12 min", # vs 45 min before "deployment_success_rate": 0.98, # Developer satisfaction "nps_score": 65, # Net Promoter Score "support_tickets_per_week": 8, # vs 40 before }
Anti-Patterns to Avoid
1. Building Too Early
Don't build platform before you have 3+ teams with duplicate work.
2. No Escape Hatches
# BAD: Can't override defaults platform deploy --app my-service # GOOD: Advanced users can customize platform deploy \ --app my-service \ --cpu-limit 2000m \ --custom-manifest overrides.yaml
3. Forcing Migration
Evangelize, don't mandate. Show value first.
4. Ignoring Feedback
Platform teams ARE product teams. Listen to your users (developers).
Real-World Example: Shopify
Shopify's platform team provides:
- Shipit: Deploy any service to Kubernetes
- Railgun: CI/CD pipeline generator
- Vault integration: Automatic secret management
- Dev environments: Spin up full stack in one command
Result:
- 2000+ engineers using platform
- 10,000+ deployments per day
- New service deployed in <30 mins
Getting Started
Phase 1: Identify Pain Points
- Survey developers: what's slowing you down?
- Common answer: deployment, database setup, secrets
Phase 2: Build One Golden Path
- Start small: standardize deployments
- Get 3 teams using it
- Iterate based on feedback
Phase 3: Expand
- Add more golden paths (databases, queues, caches)
- Build developer portal
- Automate toil
Phase 4: Scale
- Self-service everything
- Treat platform as a product
- Measure and optimize
Conclusion
Platform engineering is not about controlling developers—it's about empowering them. The best platforms are invisible: developers ship features without thinking about infrastructure.
Signs you need platform engineering:
- Multiple teams duplicating infrastructure work
- Deployments take hours/days
- Developers spending >30% time on ops
Benefits:
- 5-10x faster deployments
- Reduced cognitive load
- Consistent best practices
- Happier developers
Ready to build your internal platform? Let's talk about your engineering organization.
You might also like
AWS ECS Production Deployment: The Complete Guide
Deploy containerized applications on AWS ECS with auto-scaling, blue/green deployments, and production-grade monitoring.
GitHub Actions CI/CD Pipeline Design for Production
Build reliable, fast CI/CD pipelines with GitHub Actions: caching strategies, secrets management, matrix builds, reusable workflows, and deployment patterns.
Production Observability: OpenTelemetry and Distributed Tracing
Implement comprehensive observability with OpenTelemetry: distributed tracing, metrics, and logs in a unified pipeline for production systems.