Production Observability: OpenTelemetry and Distributed Tracing
Implement comprehensive observability with OpenTelemetry: distributed tracing, metrics, and logs in a unified pipeline for production systems.
Production Observability: OpenTelemetry and Distributed Tracing
Logs tell you what happened. Metrics tell you when. Distributed tracing tells you why your system is slow. Here's how to implement production-grade observability with OpenTelemetry.
The Three Pillars of Observability
Modern observability requires all three signals working together:
Logs:
- Events that happened (
User login failed) - Good for debugging known issues
- Hard to query at scale
Metrics:
- Aggregated numbers over time (
requests_per_second) - Excellent for dashboards and alerts
- Loss of granularity
Traces:
- Request flow across services (
API → Auth → Database) - Essential for distributed systems
- Shows bottlenecks and dependencies
The problem: Most teams use different tools for each (Elasticsearch for logs, Prometheus for metrics, Jaeger for traces). This creates silos.
The solution: OpenTelemetry unifies all three.
What is OpenTelemetry?
OpenTelemetry (OTel) is a vendor-neutral standard for instrumenting applications:
- One SDK for traces, metrics, and logs
- Auto-instrumentation for popular frameworks
- Flexible backends (Jaeger, Tempo, Datadog, etc.)
- CNCF graduated project (production-ready)
Setting Up OpenTelemetry
1. Install the SDK
// Node.js/TypeScript example import { NodeSDK } from '@opentelemetry/sdk-node'; import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'; import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'; import { Resource } from '@opentelemetry/resources'; import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions'; const sdk = new NodeSDK({ resource: new Resource({ [SemanticResourceAttributes.SERVICE_NAME]: 'api-service', [SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0', [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV, }), traceExporter: new OTLPTraceExporter({ url: 'http://otel-collector:4318/v1/traces', }), instrumentations: [ getNodeAutoInstrumentations({ '@opentelemetry/instrumentation-fs': { enabled: false, // Too noisy for most apps }, }), ], }); sdk.start(); // Graceful shutdown process.on('SIGTERM', () => { sdk.shutdown() .then(() => console.log('Tracing terminated')) .catch((error) => console.log('Error terminating tracing', error)) .finally(() => process.exit(0)); });
Auto-instrumentation automatically traces:
- HTTP/HTTPS requests
- Database queries (Postgres, MySQL, MongoDB)
- Redis operations
- gRPC calls
- Express/Fastify routes
2. Deploy OpenTelemetry Collector
The collector receives, processes, and exports telemetry data:
# otel-collector-config.yaml receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 processors: batch: timeout: 10s send_batch_size: 1024 memory_limiter: check_interval: 1s limit_mib: 512 # Add resource attributes resource: attributes: - key: cluster.name value: production action: upsert exporters: # Traces to Tempo otlp/tempo: endpoint: tempo:4317 tls: insecure: true # Metrics to Prometheus prometheus: endpoint: "0.0.0.0:8889" # Logs to Loki loki: endpoint: http://loki:3100/loki/api/v1/push service: pipelines: traces: receivers: [otlp] processors: [memory_limiter, batch] exporters: [otlp/tempo] metrics: receivers: [otlp] processors: [memory_limiter, batch] exporters: [prometheus] logs: receivers: [otlp] processors: [memory_limiter, batch] exporters: [loki]
3. Deploy with Kubernetes
apiVersion: apps/v1 kind: Deployment metadata: name: otel-collector spec: replicas: 3 selector: matchLabels: app: otel-collector template: metadata: labels: app: otel-collector spec: containers: - name: otel-collector image: otel/opentelemetry-collector-contrib:0.91.0 args: - "--config=/conf/otel-collector-config.yaml" volumeMounts: - name: config mountPath: /conf resources: limits: memory: "1Gi" cpu: "1000m" requests: memory: "512Mi" cpu: "200m" ports: - containerPort: 4317 # OTLP gRPC - containerPort: 4318 # OTLP HTTP volumes: - name: config configMap: name: otel-collector-config --- apiVersion: v1 kind: Service metadata: name: otel-collector spec: selector: app: otel-collector ports: - name: otlp-grpc port: 4317 targetPort: 4317 - name: otlp-http port: 4318 targetPort: 4318
Custom Instrumentation
Auto-instrumentation covers 80% of use cases. For the rest, add custom spans:
import { trace, SpanStatusCode } from '@opentelemetry/api'; const tracer = trace.getTracer('payment-service', '1.0.0'); async function processPayment(userId: string, amount: number) { // Create a span for this operation return tracer.startActiveSpan('process_payment', async (span) => { try { // Add attributes for filtering/searching span.setAttribute('user.id', userId); span.setAttribute('payment.amount', amount); span.setAttribute('payment.currency', 'USD'); // Call Stripe const stripeSpan = tracer.startActiveSpan('stripe.create_charge', async (stripeSpan) => { const charge = await stripe.charges.create({ amount: amount * 100, currency: 'usd', customer: userId, }); stripeSpan.setAttribute('stripe.charge.id', charge.id); stripeSpan.end(); return charge; }); // Record an event span.addEvent('payment_authorized', { 'charge.id': stripeSpan.id, }); // Update database await db.query( 'INSERT INTO payments (user_id, amount, stripe_charge_id) VALUES ($1, $2, $3)', [userId, amount, stripeSpan.id] ); span.setStatus({ code: SpanStatusCode.OK }); return { success: true }; } catch (error) { // Record the error span.recordException(error as Error); span.setStatus({ code: SpanStatusCode.ERROR, message: (error as Error).message, }); throw error; } finally { span.end(); } }); }
Distributed Context Propagation
The magic of distributed tracing is correlating requests across services:
# Python microservice example from opentelemetry import trace from opentelemetry.propagate import inject, extract import requests tracer = trace.get_tracer(__name__) def call_downstream_service(user_id: str): with tracer.start_as_current_span("call_inventory_service") as span: span.set_attribute("user.id", user_id) # Inject trace context into HTTP headers headers = {} inject(headers) # This will continue the trace in the downstream service response = requests.get( 'http://inventory-service/check', headers=headers, params={'user_id': user_id} ) span.set_attribute("inventory.available", response.json()['available']) return response.json()
How it works:
- Service A creates a trace with TraceID
abc123 - TraceID injected into HTTP headers (
traceparent) - Service B extracts TraceID and continues the trace
- Both spans linked by the same TraceID
Querying Traces with TraceQL
Grafana Tempo supports TraceQL for powerful trace queries:
# Find slow database queries { span.db.system = "postgresql" && duration > 1s } # Find errors in payment service { resource.service.name = "payment-service" && status = error } # Find traces with high latency { duration > 5s } | by(resource.service.name) # Complex: Payment failures with user context { resource.service.name = "payment-service" && span.name = "process_payment" && status = error } | select(span.user.id, span.payment.amount)
Real-World Debugging Example
Problem: API response time increased from 200ms → 2s
Traditional approach: Check logs, guess which service is slow
With distributed tracing:
# Find slow traces { resource.service.name = "api-gateway" && duration > 1s }
Trace waterfall shows:
api-gateway [====] 2000ms
├─ auth-service [=] 50ms
├─ user-service [=] 100ms
└─ recommendation-svc [====================] 1800ms
└─ database [====================] 1750ms
└─ SELECT * FROM recommendations WHERE user_id = ?
Root cause identified in 30 seconds: Missing database index on user_id.
Sampling Strategies
Tracing everything is expensive. Use intelligent sampling:
import { TraceIdRatioBasedSampler, ParentBasedSampler } from '@opentelemetry/sdk-trace-base'; // Sample 10% of traces const sampler = new ParentBasedSampler({ root: new TraceIdRatioBasedSampler(0.1) }); // Better: Sample based on conditions class AdaptiveSampler { shouldSample(context: any, traceId: string, spanName: string, spanKind: any, attributes: any) { // Always sample errors if (attributes['http.status_code'] >= 500) { return { decision: 'RECORD_AND_SAMPLE' }; } // Always sample slow requests if (attributes['http.duration'] > 1000) { return { decision: 'RECORD_AND_SAMPLE' }; } // Sample 1% of successful requests return Math.random() < 0.01 ? { decision: 'RECORD_AND_SAMPLE' } : { decision: 'DROP' }; } }
Correlating Logs with Traces
Link logs to traces for full context:
import { trace, context } from '@opentelemetry/api'; import winston from 'winston'; const logger = winston.createLogger({ format: winston.format.combine( winston.format.timestamp(), winston.format.json(), // Add trace context to every log winston.format((info) => { const span = trace.getActiveSpan(); if (span) { const spanContext = span.spanContext(); info.trace_id = spanContext.traceId; info.span_id = spanContext.spanId; info.trace_flags = spanContext.traceFlags; } return info; })() ), transports: [new winston.transports.Console()], }); // Now logs include trace IDs logger.info('Payment processed', { user_id: '123', amount: 99.99 }); // Output: {"level":"info","message":"Payment processed","trace_id":"abc123...","span_id":"def456..."}
In Grafana, click a trace → see all related logs automatically.
Cost Optimization
Observability can get expensive. Optimize:
1. Tail-Based Sampling
# In OTel collector processors: tail_sampling: decision_wait: 10s policies: - name: error-traces type: status_code status_code: status_codes: [ERROR] - name: slow-traces type: latency latency: threshold_ms: 1000 - name: sample-successful type: probabilistic probabilistic: sampling_percentage: 1
2. Attribute Filtering
processors: attributes: actions: # Remove high-cardinality attributes - key: http.url action: delete - key: user.email action: delete
3. Use Compact Exporters
// Use OTLP binary instead of JSON (50% smaller) import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
Alerting on Traces
Create alerts based on trace data:
# Prometheus alert from trace metrics groups: - name: trace_alerts rules: - alert: HighErrorRate expr: | sum(rate(traces_spanmetrics_calls_total{status_code="STATUS_CODE_ERROR"}[5m])) by (service_name) / sum(rate(traces_spanmetrics_calls_total[5m])) by (service_name) > 0.05 annotations: summary: "{{ $labels.service_name }} error rate > 5%"
Best Practices
- Start with auto-instrumentation - Cover 80% with zero code
- Add custom spans sparingly - Only for business-critical operations
- Use semantic conventions - Standard attribute names for consistency
- Sample intelligently - Capture errors/slow requests, sample the rest
- Correlate signals - Link traces → logs → metrics
- Set SLOs - Use trace data to measure P99 latency
Conclusion
OpenTelemetry gives you x-ray vision into distributed systems. With traces, you can:
- Debug production issues in minutes (not hours)
- Understand service dependencies
- Optimize slow requests systematically
- Correlate errors across microservices
The upfront investment pays off the first time you debug a production incident.
Need help implementing observability for your platform? Let's talk about your monitoring strategy.
You might also like
Sentry for Production: Error Monitoring and Performance Tracking
Master Sentry for production applications: error tracking, performance monitoring, distributed tracing, and alerting strategies that catch issues before users do.
Incident Response at Scale: From Alert to Resolution
Build resilient systems with effective incident response: on-call best practices, runbooks, blameless postmortems, and SLO-driven reliability.
Platform Engineering: Building Internal Developer Platforms
Build self-service infrastructure that accelerates development: golden paths, developer portals, and reducing cognitive load at scale.