Back to Blog
ObservabilityOpenTelemetryDevOpsMonitoring

Production Observability: OpenTelemetry and Distributed Tracing

Implement comprehensive observability with OpenTelemetry: distributed tracing, metrics, and logs in a unified pipeline for production systems.

Azynth Team
14 min read

Production Observability: OpenTelemetry and Distributed Tracing

Logs tell you what happened. Metrics tell you when. Distributed tracing tells you why your system is slow. Here's how to implement production-grade observability with OpenTelemetry.

The Three Pillars of Observability

Modern observability requires all three signals working together:

Logs:

  • Events that happened (User login failed)
  • Good for debugging known issues
  • Hard to query at scale

Metrics:

  • Aggregated numbers over time (requests_per_second)
  • Excellent for dashboards and alerts
  • Loss of granularity

Traces:

  • Request flow across services (API → Auth → Database)
  • Essential for distributed systems
  • Shows bottlenecks and dependencies

The problem: Most teams use different tools for each (Elasticsearch for logs, Prometheus for metrics, Jaeger for traces). This creates silos.

The solution: OpenTelemetry unifies all three.

What is OpenTelemetry?

OpenTelemetry (OTel) is a vendor-neutral standard for instrumenting applications:

  • One SDK for traces, metrics, and logs
  • Auto-instrumentation for popular frameworks
  • Flexible backends (Jaeger, Tempo, Datadog, etc.)
  • CNCF graduated project (production-ready)

Setting Up OpenTelemetry

1. Install the SDK

// Node.js/TypeScript example import { NodeSDK } from '@opentelemetry/sdk-node'; import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'; import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'; import { Resource } from '@opentelemetry/resources'; import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions'; const sdk = new NodeSDK({ resource: new Resource({ [SemanticResourceAttributes.SERVICE_NAME]: 'api-service', [SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0', [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV, }), traceExporter: new OTLPTraceExporter({ url: 'http://otel-collector:4318/v1/traces', }), instrumentations: [ getNodeAutoInstrumentations({ '@opentelemetry/instrumentation-fs': { enabled: false, // Too noisy for most apps }, }), ], }); sdk.start(); // Graceful shutdown process.on('SIGTERM', () => { sdk.shutdown() .then(() => console.log('Tracing terminated')) .catch((error) => console.log('Error terminating tracing', error)) .finally(() => process.exit(0)); });

Auto-instrumentation automatically traces:

  • HTTP/HTTPS requests
  • Database queries (Postgres, MySQL, MongoDB)
  • Redis operations
  • gRPC calls
  • Express/Fastify routes

2. Deploy OpenTelemetry Collector

The collector receives, processes, and exports telemetry data:

# otel-collector-config.yaml receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 processors: batch: timeout: 10s send_batch_size: 1024 memory_limiter: check_interval: 1s limit_mib: 512 # Add resource attributes resource: attributes: - key: cluster.name value: production action: upsert exporters: # Traces to Tempo otlp/tempo: endpoint: tempo:4317 tls: insecure: true # Metrics to Prometheus prometheus: endpoint: "0.0.0.0:8889" # Logs to Loki loki: endpoint: http://loki:3100/loki/api/v1/push service: pipelines: traces: receivers: [otlp] processors: [memory_limiter, batch] exporters: [otlp/tempo] metrics: receivers: [otlp] processors: [memory_limiter, batch] exporters: [prometheus] logs: receivers: [otlp] processors: [memory_limiter, batch] exporters: [loki]

3. Deploy with Kubernetes

apiVersion: apps/v1 kind: Deployment metadata: name: otel-collector spec: replicas: 3 selector: matchLabels: app: otel-collector template: metadata: labels: app: otel-collector spec: containers: - name: otel-collector image: otel/opentelemetry-collector-contrib:0.91.0 args: - "--config=/conf/otel-collector-config.yaml" volumeMounts: - name: config mountPath: /conf resources: limits: memory: "1Gi" cpu: "1000m" requests: memory: "512Mi" cpu: "200m" ports: - containerPort: 4317 # OTLP gRPC - containerPort: 4318 # OTLP HTTP volumes: - name: config configMap: name: otel-collector-config --- apiVersion: v1 kind: Service metadata: name: otel-collector spec: selector: app: otel-collector ports: - name: otlp-grpc port: 4317 targetPort: 4317 - name: otlp-http port: 4318 targetPort: 4318

Custom Instrumentation

Auto-instrumentation covers 80% of use cases. For the rest, add custom spans:

import { trace, SpanStatusCode } from '@opentelemetry/api'; const tracer = trace.getTracer('payment-service', '1.0.0'); async function processPayment(userId: string, amount: number) { // Create a span for this operation return tracer.startActiveSpan('process_payment', async (span) => { try { // Add attributes for filtering/searching span.setAttribute('user.id', userId); span.setAttribute('payment.amount', amount); span.setAttribute('payment.currency', 'USD'); // Call Stripe const stripeSpan = tracer.startActiveSpan('stripe.create_charge', async (stripeSpan) => { const charge = await stripe.charges.create({ amount: amount * 100, currency: 'usd', customer: userId, }); stripeSpan.setAttribute('stripe.charge.id', charge.id); stripeSpan.end(); return charge; }); // Record an event span.addEvent('payment_authorized', { 'charge.id': stripeSpan.id, }); // Update database await db.query( 'INSERT INTO payments (user_id, amount, stripe_charge_id) VALUES ($1, $2, $3)', [userId, amount, stripeSpan.id] ); span.setStatus({ code: SpanStatusCode.OK }); return { success: true }; } catch (error) { // Record the error span.recordException(error as Error); span.setStatus({ code: SpanStatusCode.ERROR, message: (error as Error).message, }); throw error; } finally { span.end(); } }); }

Distributed Context Propagation

The magic of distributed tracing is correlating requests across services:

# Python microservice example from opentelemetry import trace from opentelemetry.propagate import inject, extract import requests tracer = trace.get_tracer(__name__) def call_downstream_service(user_id: str): with tracer.start_as_current_span("call_inventory_service") as span: span.set_attribute("user.id", user_id) # Inject trace context into HTTP headers headers = {} inject(headers) # This will continue the trace in the downstream service response = requests.get( 'http://inventory-service/check', headers=headers, params={'user_id': user_id} ) span.set_attribute("inventory.available", response.json()['available']) return response.json()

How it works:

  1. Service A creates a trace with TraceID abc123
  2. TraceID injected into HTTP headers (traceparent)
  3. Service B extracts TraceID and continues the trace
  4. Both spans linked by the same TraceID

Querying Traces with TraceQL

Grafana Tempo supports TraceQL for powerful trace queries:

# Find slow database queries { span.db.system = "postgresql" && duration > 1s } # Find errors in payment service { resource.service.name = "payment-service" && status = error } # Find traces with high latency { duration > 5s } | by(resource.service.name) # Complex: Payment failures with user context { resource.service.name = "payment-service" && span.name = "process_payment" && status = error } | select(span.user.id, span.payment.amount)

Real-World Debugging Example

Problem: API response time increased from 200ms → 2s

Traditional approach: Check logs, guess which service is slow

With distributed tracing:

# Find slow traces { resource.service.name = "api-gateway" && duration > 1s }

Trace waterfall shows:

api-gateway               [====] 2000ms
  ├─ auth-service        [=] 50ms
  ├─ user-service        [=] 100ms
  └─ recommendation-svc  [====================] 1800ms
       └─ database       [====================] 1750ms
            └─ SELECT * FROM recommendations WHERE user_id = ?

Root cause identified in 30 seconds: Missing database index on user_id.

Sampling Strategies

Tracing everything is expensive. Use intelligent sampling:

import { TraceIdRatioBasedSampler, ParentBasedSampler } from '@opentelemetry/sdk-trace-base'; // Sample 10% of traces const sampler = new ParentBasedSampler({ root: new TraceIdRatioBasedSampler(0.1) }); // Better: Sample based on conditions class AdaptiveSampler { shouldSample(context: any, traceId: string, spanName: string, spanKind: any, attributes: any) { // Always sample errors if (attributes['http.status_code'] >= 500) { return { decision: 'RECORD_AND_SAMPLE' }; } // Always sample slow requests if (attributes['http.duration'] > 1000) { return { decision: 'RECORD_AND_SAMPLE' }; } // Sample 1% of successful requests return Math.random() < 0.01 ? { decision: 'RECORD_AND_SAMPLE' } : { decision: 'DROP' }; } }

Correlating Logs with Traces

Link logs to traces for full context:

import { trace, context } from '@opentelemetry/api'; import winston from 'winston'; const logger = winston.createLogger({ format: winston.format.combine( winston.format.timestamp(), winston.format.json(), // Add trace context to every log winston.format((info) => { const span = trace.getActiveSpan(); if (span) { const spanContext = span.spanContext(); info.trace_id = spanContext.traceId; info.span_id = spanContext.spanId; info.trace_flags = spanContext.traceFlags; } return info; })() ), transports: [new winston.transports.Console()], }); // Now logs include trace IDs logger.info('Payment processed', { user_id: '123', amount: 99.99 }); // Output: {"level":"info","message":"Payment processed","trace_id":"abc123...","span_id":"def456..."}

In Grafana, click a trace → see all related logs automatically.

Cost Optimization

Observability can get expensive. Optimize:

1. Tail-Based Sampling

# In OTel collector processors: tail_sampling: decision_wait: 10s policies: - name: error-traces type: status_code status_code: status_codes: [ERROR] - name: slow-traces type: latency latency: threshold_ms: 1000 - name: sample-successful type: probabilistic probabilistic: sampling_percentage: 1

2. Attribute Filtering

processors: attributes: actions: # Remove high-cardinality attributes - key: http.url action: delete - key: user.email action: delete

3. Use Compact Exporters

// Use OTLP binary instead of JSON (50% smaller) import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';

Alerting on Traces

Create alerts based on trace data:

# Prometheus alert from trace metrics groups: - name: trace_alerts rules: - alert: HighErrorRate expr: | sum(rate(traces_spanmetrics_calls_total{status_code="STATUS_CODE_ERROR"}[5m])) by (service_name) / sum(rate(traces_spanmetrics_calls_total[5m])) by (service_name) > 0.05 annotations: summary: "{{ $labels.service_name }} error rate > 5%"

Best Practices

  1. Start with auto-instrumentation - Cover 80% with zero code
  2. Add custom spans sparingly - Only for business-critical operations
  3. Use semantic conventions - Standard attribute names for consistency
  4. Sample intelligently - Capture errors/slow requests, sample the rest
  5. Correlate signals - Link traces → logs → metrics
  6. Set SLOs - Use trace data to measure P99 latency

Conclusion

OpenTelemetry gives you x-ray vision into distributed systems. With traces, you can:

  • Debug production issues in minutes (not hours)
  • Understand service dependencies
  • Optimize slow requests systematically
  • Correlate errors across microservices

The upfront investment pays off the first time you debug a production incident.


Need help implementing observability for your platform? Let's talk about your monitoring strategy.

You might also like