Implement comprehensive observability with OpenTelemetry: distributed tracing, metrics, and logs in a unified pipeline for production systems.

Production Observability: OpenTelemetry and Distributed Tracing

Logs tell you what happened. Metrics tell you when. Distributed tracing tells you why your system is slow. Here's how to implement production-grade observability with OpenTelemetry.

The Three Pillars of Observability

Modern observability requires all three signals working together:

Logs:

Events that happened (User login failed)
Good for debugging known issues
Hard to query at scale

Metrics:

Aggregated numbers over time (requests_per_second)
Excellent for dashboards and alerts
Loss of granularity

Traces:

Request flow across services (API → Auth → Database)
Essential for distributed systems
Shows bottlenecks and dependencies

The problem: Most teams use different tools for each (Elasticsearch for logs, Prometheus for metrics, Jaeger for traces). This creates silos.

The solution: OpenTelemetry unifies all three.

What is OpenTelemetry?

OpenTelemetry (OTel) is a vendor-neutral standard for instrumenting applications:

One SDK for traces, metrics, and logs
Auto-instrumentation for popular frameworks
Flexible backends (Jaeger, Tempo, Datadog, etc.)
CNCF graduated project (production-ready)

Setting Up OpenTelemetry

1. Install the SDK

// Node.js/TypeScript example
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'api-service',
    [SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
    [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV,
  }),
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces',
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-fs': {
        enabled: false, // Too noisy for most apps
      },
    }),
  ],
});

sdk.start();

// Graceful shutdown
process.on('SIGTERM', () => {
  sdk.shutdown()
    .then(() => console.log('Tracing terminated'))
    .catch((error) => console.log('Error terminating tracing', error))
    .finally(() => process.exit(0));
});

Auto-instrumentation automatically traces:

HTTP/HTTPS requests
Database queries (Postgres, MySQL, MongoDB)
Redis operations
gRPC calls
Express/Fastify routes

2. Deploy OpenTelemetry Collector

The collector receives, processes, and exports telemetry data:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 10s
    send_batch_size: 1024

  memory_limiter:
    check_interval: 1s
    limit_mib: 512

  # Add resource attributes
  resource:
    attributes:
      - key: cluster.name
        value: production
        action: upsert

exporters:
  # Traces to Tempo
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

  # Metrics to Prometheus
  prometheus:
    endpoint: "0.0.0.0:8889"

  # Logs to Loki
  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/tempo]

    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]

    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]

3. Deploy with Kubernetes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
spec:
  replicas: 3
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      containers:
      - name: otel-collector
        image: otel/opentelemetry-collector-contrib:0.91.0
        args:
          - "--config=/conf/otel-collector-config.yaml"
        volumeMounts:
        - name: config
          mountPath: /conf
        resources:
          limits:
            memory: "1Gi"
            cpu: "1000m"
          requests:
            memory: "512Mi"
            cpu: "200m"
        ports:
        - containerPort: 4317  # OTLP gRPC
        - containerPort: 4318  # OTLP HTTP
      volumes:
      - name: config
        configMap:
          name: otel-collector-config
---
apiVersion: v1
kind: Service
metadata:
  name: otel-collector
spec:
  selector:
    app: otel-collector
  ports:
  - name: otlp-grpc
    port: 4317
    targetPort: 4317
  - name: otlp-http
    port: 4318
    targetPort: 4318

Custom Instrumentation

Auto-instrumentation covers 80% of use cases. For the rest, add custom spans:

import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('payment-service', '1.0.0');

async function processPayment(userId: string, amount: number) {
  // Create a span for this operation
  return tracer.startActiveSpan('process_payment', async (span) => {
    try {
      // Add attributes for filtering/searching
      span.setAttribute('user.id', userId);
      span.setAttribute('payment.amount', amount);
      span.setAttribute('payment.currency', 'USD');

      // Call Stripe
      const stripeSpan = tracer.startActiveSpan('stripe.create_charge', async (stripeSpan) => {
        const charge = await stripe.charges.create({
          amount: amount * 100,
          currency: 'usd',
          customer: userId,
        });

        stripeSpan.setAttribute('stripe.charge.id', charge.id);
        stripeSpan.end();
        return charge;
      });

      // Record an event
      span.addEvent('payment_authorized', {
        'charge.id': stripeSpan.id,
      });

      // Update database
      await db.query(
        'INSERT INTO payments (user_id, amount, stripe_charge_id) VALUES ($1, $2, $3)',
        [userId, amount, stripeSpan.id]
      );

      span.setStatus({ code: SpanStatusCode.OK });
      return { success: true };

    } catch (error) {
      // Record the error
      span.recordException(error as Error);
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: (error as Error).message,
      });
      throw error;

    } finally {
      span.end();
    }
  });
}

Distributed Context Propagation

The magic of distributed tracing is correlating requests across services:

# Python microservice example
from opentelemetry import trace
from opentelemetry.propagate import inject, extract
import requests

tracer = trace.get_tracer(__name__)

def call_downstream_service(user_id: str):
    with tracer.start_as_current_span("call_inventory_service") as span:
        span.set_attribute("user.id", user_id)

        # Inject trace context into HTTP headers
        headers = {}
        inject(headers)

        # This will continue the trace in the downstream service
        response = requests.get(
            'http://inventory-service/check',
            headers=headers,
            params={'user_id': user_id}
        )

        span.set_attribute("inventory.available", response.json()['available'])
        return response.json()

How it works:

Service A creates a trace with TraceID abc123
TraceID injected into HTTP headers (traceparent)
Service B extracts TraceID and continues the trace
Both spans linked by the same TraceID

Querying Traces with TraceQL

Grafana Tempo supports TraceQL for powerful trace queries:

# Find slow database queries
{ span.db.system = "postgresql" && duration > 1s }

# Find errors in payment service
{ resource.service.name = "payment-service" && status = error }

# Find traces with high latency
{ duration > 5s } | by(resource.service.name)

# Complex: Payment failures with user context
{
  resource.service.name = "payment-service" &&
  span.name = "process_payment" &&
  status = error
} | select(span.user.id, span.payment.amount)

Real-World Debugging Example

Problem: API response time increased from 200ms → 2s

Traditional approach: Check logs, guess which service is slow

With distributed tracing:

# Find slow traces
{ resource.service.name = "api-gateway" && duration > 1s }

Trace waterfall shows:

api-gateway               [====] 2000ms
  ├─ auth-service        [=] 50ms
  ├─ user-service        [=] 100ms
  └─ recommendation-svc  [====================] 1800ms
       └─ database       [====================] 1750ms
            └─ SELECT * FROM recommendations WHERE user_id = ?

Root cause identified in 30 seconds: Missing database index on user_id.

Sampling Strategies

Tracing everything is expensive. Use intelligent sampling:

import { TraceIdRatioBasedSampler, ParentBasedSampler } from '@opentelemetry/sdk-trace-base';

// Sample 10% of traces
const sampler = new ParentBasedSampler({
  root: new TraceIdRatioBasedSampler(0.1)
});

// Better: Sample based on conditions
class AdaptiveSampler {
  shouldSample(context: any, traceId: string, spanName: string, spanKind: any, attributes: any) {
    // Always sample errors
    if (attributes['http.status_code'] >= 500) {
      return { decision: 'RECORD_AND_SAMPLE' };
    }

    // Always sample slow requests
    if (attributes['http.duration'] > 1000) {
      return { decision: 'RECORD_AND_SAMPLE' };
    }

    // Sample 1% of successful requests
    return Math.random() < 0.01
      ? { decision: 'RECORD_AND_SAMPLE' }
      : { decision: 'DROP' };
  }
}

Correlating Logs with Traces

Link logs to traces for full context:

import { trace, context } from '@opentelemetry/api';
import winston from 'winston';

const logger = winston.createLogger({
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.json(),
    // Add trace context to every log
    winston.format((info) => {
      const span = trace.getActiveSpan();
      if (span) {
        const spanContext = span.spanContext();
        info.trace_id = spanContext.traceId;
        info.span_id = spanContext.spanId;
        info.trace_flags = spanContext.traceFlags;
      }
      return info;
    })()
  ),
  transports: [new winston.transports.Console()],
});

// Now logs include trace IDs
logger.info('Payment processed', { user_id: '123', amount: 99.99 });
// Output: {"level":"info","message":"Payment processed","trace_id":"abc123...","span_id":"def456..."}

In Grafana, click a trace → see all related logs automatically.

Cost Optimization

Observability can get expensive. Optimize:

1. Tail-Based Sampling

# In OTel collector
processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: error-traces
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow-traces
        type: latency
        latency:
          threshold_ms: 1000
      - name: sample-successful
        type: probabilistic
        probabilistic:
          sampling_percentage: 1

2. Attribute Filtering

processors:
  attributes:
    actions:
      # Remove high-cardinality attributes
      - key: http.url
        action: delete
      - key: user.email
        action: delete

3. Use Compact Exporters

// Use OTLP binary instead of JSON (50% smaller)
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';

Alerting on Traces

Create alerts based on trace data:

# Prometheus alert from trace metrics
groups:
  - name: trace_alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(traces_spanmetrics_calls_total{status_code="STATUS_CODE_ERROR"}[5m])) by (service_name)
          /
          sum(rate(traces_spanmetrics_calls_total[5m])) by (service_name)
          > 0.05
        annotations:
          summary: "{{ $labels.service_name }} error rate > 5%"

Best Practices

Start with auto-instrumentation - Cover 80% with zero code
Add custom spans sparingly - Only for business-critical operations
Use semantic conventions - Standard attribute names for consistency
Sample intelligently - Capture errors/slow requests, sample the rest
Correlate signals - Link traces → logs → metrics
Set SLOs - Use trace data to measure P99 latency

Conclusion

OpenTelemetry gives you x-ray vision into distributed systems. With traces, you can:

Debug production issues in minutes (not hours)
Understand service dependencies
Optimize slow requests systematically
Correlate errors across microservices

The upfront investment pays off the first time you debug a production incident.

Need help implementing observability for your platform? Let's talk about your monitoring strategy.

Production Observability: OpenTelemetry and Distributed Tracing

Production Observability: OpenTelemetry and Distributed Tracing

The Three Pillars of Observability

What is OpenTelemetry?

Setting Up OpenTelemetry

1. Install the SDK

2. Deploy OpenTelemetry Collector

3. Deploy with Kubernetes

Custom Instrumentation

Distributed Context Propagation

Querying Traces with TraceQL

Real-World Debugging Example

Sampling Strategies

Correlating Logs with Traces

Cost Optimization

1. Tail-Based Sampling

2. Attribute Filtering

3. Use Compact Exporters

Alerting on Traces

Best Practices

Conclusion

You might also like

Sentry for Production: Error Monitoring and Performance Tracking

Incident Response at Scale: From Alert to Resolution

Platform Engineering: Building Internal Developer Platforms