Popular Now
Navigating 2026 Data Lakehouse Formats

Navigating Table Formats in Data Engineering: A Practical Guide for 2026

LGTM Stack on Kubernetes

Building Production-Ready Observability: A Deep Dive into the LGTM Stack with OpenTelemetry

Azure mistakes and fixes

15 Common Azure Mistakes and Misconfigurations to Avoid in 2026: A Guide for DevOps Engineers, Architects, and Business Leaders

LGTM Stack on Kubernetes
LGTM Stack on Kubernetes

Building Production-Ready Observability: A Deep Dive into the LGTM Stack with OpenTelemetry

Introduction

In today’s cloud-native landscape, observability isn’t just a nice-to-have—it’s essential for understanding system behavior, debugging issues, and ensuring reliability. This blog post explores a complete, production-ready observability platform built on the LGTM Stack (Loki, Grafana, Tempo, Mimir) with OpenTelemetry instrumentation, running on Kubernetes.

Whether you’re a DevOps engineer looking to implement observability, a developer wanting to understand distributed tracing, or a platform architect designing monitoring solutions, this comprehensive guide will walk you through a reference implementation that demonstrates industry best practices.


🎯 Project Overview

What is the LGTM Stack?

The LGTM Stack represents Grafana’s comprehensive observability solution:

  • Loki – Horizontally scalable, multi-tenant log aggregation system
  • Grafana – Feature-rich visualization and dashboarding platform
  • Tempo – High-scale distributed tracing backend
  • Mimir – Long-term storage for Prometheus metrics

Key Features

This implementation provides:

Multi-service architecture – Flask frontend → Flask backend → SQLite database
Complete telemetry collection – Traces, metrics, and logs
Distributed tracing – End-to-end request tracking across service boundaries
Log-to-trace correlation – Navigate seamlessly from logs to traces
Pre-built dashboards – Production-ready monitoring views
OpenTelemetry instrumentation – Standards-based telemetry collection
Kubernetes-native – Runs on Kind (Kubernetes in Docker)


🏗️ Architecture Deep Dive

High-Level Architecture

The system is organized into two primary layers:

1. Application Layer

Flask Frontend (2 replicas)

  • Serves as the user-facing web interface
  • Makes HTTP requests to backend services
  • Exports telemetry via OTLP/HTTP to Grafana Alloy (port 4318)
  • Service name: flask-frontend
  • Exposes port 8080 via NodePort (30080)

Flask Backend (1 replica)

  • Provides REST API for data operations
  • Uses SQLite for persistent storage
  • Exports telemetry via OTLP/gRPC to Grafana Alloy (port 4317)
  • Service name: flask-backend
  • Internal ClusterIP service (port 8081)

Key Endpoints:

Frontend:
  GET /              - Home page
  GET /api/users     - Fetch all users (proxies to backend)
  GET /api/users/:id - Fetch user by ID
  GET /api/stats     - Database statistics
  GET /health        - Health check
  GET /ready         - Readiness probe

Backend:
  GET /users         - List all users
  GET /users/:id     - Get user by ID
  POST /users        - Create new user
  DELETE /users/:id  - Delete user
  GET /stats         - Database statistics
  GET /health        - Health check

2. Observability Layer

Grafana Alloy (Unified Collector)

  • Acts as a central telemetry collection point
  • Receives data via OTLP (gRPC: 4317, HTTP: 4318)
  • Scrapes Prometheus metrics from cAdvisor
  • Routes telemetry to appropriate backends
  • Performs batch processing for efficiency

Loki (Log Aggregation)

  • Receives logs from Alloy via Loki push API
  • Stores logs with labels for efficient querying
  • Supports full-text search with LogQL
  • Filesystem-based storage (production should use object storage)

Tempo (Distributed Tracing)

  • Receives traces via OTLP protocol
  • Stores and indexes trace data
  • Supports TraceQL for advanced querying
  • Enables service dependency analysis

Mimir (Metrics Storage)

  • Prometheus-compatible metrics storage
  • Receives metrics via remote write API
  • Supports PromQL queries
  • Provides long-term metric retention

Grafana (Visualization)

  • Pre-configured with datasources (Loki, Tempo, Mimir)
  • Two pre-built dashboards (Overview + Performance)
  • Explore interface for ad-hoc queries
  • Log-to-trace correlation via derived fields

Data Flow Architecture

Metrics Flow

┌─────────────┐
│  Flask App  │
└──────┬──────┘
       │ OTLP
       ▼
┌─────────────┐
│    Alloy    │
└──────┬──────┘
       │ Prometheus Remote Write
       ▼
┌─────────────┐
│    Mimir    │
└──────┬──────┘
       │ PromQL
       ▼
┌─────────────┐
│   Grafana   │
└─────────────┘

Metrics Collected:

  • HTTP request duration (histogram)
  • HTTP request count by status code (counter)
  • Container CPU usage (from cAdvisor)
  • Container memory usage (from cAdvisor)
  • Network I/O statistics
  • Disk I/O statistics

Logs Flow

┌─────────────┐
│  Flask App  │
└──────┬──────┘
       │ OTLP Log Export
       ▼
┌─────────────┐
│    Alloy    │
└──────┬──────┘
       │ Loki Push API
       ▼
┌─────────────┐
│    Loki     │
└──────┬──────┘
       │ LogQL
       ▼
┌─────────────┐
│   Grafana   │
└─────────────┘

Log Structure:

  • Labels: job, level, service_name, trace_id
  • Automatic trace ID extraction for correlation
  • Structured logging with context propagation

Traces Flow

┌─────────────┐
│  Flask App  │
└──────┬──────┘
       │ OTLP Span Export
       ▼
┌─────────────┐
│    Alloy    │
└──────┬──────┘
       │ OTLP/gRPC
       ▼
┌─────────────┐
│    Tempo    │
└──────┬──────┘
       │ TraceQL
       ▼
┌─────────────┐
│   Grafana   │
└─────────────┘

Trace Structure:

Frontend Span (flask-frontend)
  ├─ HTTP GET /api/users
  │   └─ Backend Span (flask-backend)
  │       ├─ HTTP GET /users
  │       └─ SQLite Query Span
  │           └─ SELECT * FROM users

OpenTelemetry Instrumentation

The implementation uses manual instrumentation for complete control:

# Resource definition
resource = Resource.create({
    "service.name": "flask-backend",
    "service.version": "1.0.0"
})

# Trace provider
trace_provider = TracerProvider(resource=resource)
span_exporter = OTLPSpanExporter(endpoint=otlp_endpoint)
trace_provider.add_span_processor(BatchSpanProcessor(span_exporter))

# Metric provider
metric_reader = PeriodicExportingMetricReader(OTLPMetricExporter())
meter_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])

# Log provider
logger_provider = LoggerProvider(resource=resource)
logger_provider.add_log_record_processor(BatchLogRecordProcessor(OTLPLogExporter()))

# Auto-instrumentation for frameworks
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()
SQLite3Instrumentor().instrument()

Benefits of Manual Instrumentation:

  • ✅ Complete control over telemetry
  • ✅ Custom span attributes
  • ✅ Granular sampling strategies
  • ✅ Educational value for learning
  • ✅ No operator dependencies

Network Architecture

Port Mappings:

ComponentInternal PortExternal PortProtocol
Grafana30003000HTTP
Flask Frontend80808080HTTP
Loki31003100HTTP
Tempo3200HTTP
Alloy OTLP4317gRPC
Alloy OTLP4318HTTP
Mimir8080HTTP

Service Communication:

  • Frontend → Backend: HTTP (ClusterIP)
  • Apps → Alloy: OTLP (ClusterIP)
  • Alloy → Loki/Tempo/Mimir: Various protocols (ClusterIP)
  • Grafana → All Backends: HTTP/gRPC (ClusterIP)
  • User → Grafana: HTTP (NodePort 30300)
  • User → Frontend: HTTP (NodePort 30080)

💼 Use Cases and Applications

1. Educational and Learning

Scenario: Teams new to observability want hands-on experience

How This Helps:

  • Complete working example of LGTM stack integration
  • Demonstrates OpenTelemetry best practices
  • Shows distributed tracing in action
  • Includes pre-built dashboards to learn from
  • PowerShell automation for easy deployment

Target Audience:

  • Developers learning observability
  • Platform engineers evaluating Grafana stack
  • Students studying distributed systems
  • DevOps teams planning monitoring strategy

2. Development and Testing

Scenario: Development teams need local observability for debugging

How This Helps:

  • Runs entirely on local Kind cluster
  • Minimal resource requirements
  • Easy reset and cleanup
  • Simulates production observability
  • Test monitoring configurations before deployment

Workflow:

# Start development environment
.\scripts\setup-cluster.ps1
.\scripts\deploy.ps1

# Make code changes
# ... edit app/app.py ...

# Rebuild and test
.\scripts\build-app.ps1
kubectl rollout restart deployment/flask-frontend -n app

# Generate test traffic
.\scripts\generate-traffic.ps1

# View results in Grafana
# http://localhost:3000

3. Proof of Concept (PoC)

Scenario: Organizations evaluating Grafana LGTM stack

How This Helps:

  • Production-like architecture
  • Demonstrates key capabilities
  • Shows integration patterns
  • Provides baseline for capacity planning
  • Includes performance testing scripts

Evaluation Points:

  • ✅ Log aggregation and search (Loki)
  • ✅ Distributed tracing (Tempo)
  • ✅ Metrics storage and querying (Mimir)
  • ✅ Unified visualization (Grafana)
  • ✅ OpenTelemetry compatibility
  • ✅ Kubernetes deployment patterns

4. Monitoring Template

Scenario: Teams need a starting point for monitoring infrastructure

How This Helps:

  • Reference implementation for Kubernetes deployments
  • Working Grafana Alloy configuration
  • Dashboard templates (Overview + Performance)
  • RBAC configurations
  • Service discovery patterns

Customization Points:

  • Modify dashboards for specific metrics
  • Add custom endpoints to applications
  • Adjust retention policies
  • Configure alerting rules
  • Scale components based on load

5. Debugging Distributed Systems

Scenario: Production issues require trace analysis

How This Helps:

  • End-to-end request tracing
  • Log-to-trace correlation
  • Service dependency mapping
  • Performance bottleneck identification
  • Error root cause analysis

Example Workflow:

  1. User reports slow API response
  2. Query logs in Grafana for error messages
  3. Click trace ID in log entry
  4. View complete trace showing:
  • Frontend received request (2ms)
  • Backend called (150ms – bottleneck!)
  • Database query executed (145ms – root cause!)
  1. Optimize database query
  2. Verify improvement with metrics dashboard

6. Performance Benchmarking

Scenario: Need to test application under load

How This Helps:

  • Heavy load generation script (10,000 requests)
  • Real-time metrics collection
  • Performance dashboard with p95/p99 percentiles
  • Resource utilization monitoring

Load Testing:

# Generate heavy load
.\scripts\generate-heavy-load.ps1

# Monitor in Grafana:
# - Request rates spike
# - Response times increase
# - CPU/Memory usage
# - Error rates
# - Network I/O

7. CI/CD Integration

Scenario: Automated testing needs observability validation

How This Helps:

  • Scriptable deployment
  • Health check endpoints
  • Automated traffic generation
  • Programmatic metric queries

CI/CD Pipeline Example:

test-observability:
  steps:
    - setup-cluster.ps1
    - deploy.ps1
    - test-app.ps1  # Validates all endpoints
    - generate-traffic.ps1
    - validate-metrics.ps1  # Custom script
    - cleanup.ps1 -DeleteCluster

✅ Best Practices Demonstrated

1. OpenTelemetry Standards

✅ Use Semantic Conventions

resource = Resource.create({
    "service.name": "flask-backend",
    "service.version": "1.0.0",
    "deployment.environment": "production"
})

✅ Batch Processing

  • Uses BatchSpanProcessor for traces
  • Uses BatchLogRecordProcessor for logs
  • Reduces network overhead
  • Improves performance

✅ Context Propagation

  • Automatic trace context propagation
  • W3C Trace Context standard
  • Maintains trace across service boundaries

2. Kubernetes Best Practices

✅ Namespace Isolation

observability namespace:  # Monitoring infrastructure
app namespace:            # Application workloads

✅ Health and Readiness Probes

livenessProbe:
  httpGet:
    path: /health
    port: 8080
readinessProbe:
  httpGet:
    path: /ready
    port: 8080

✅ Resource Requests and Limits

resources:
  requests:
    memory: "128Mi"
    cpu: "100m"
  limits:
    memory: "256Mi"
    cpu: "500m"

✅ ConfigMap-Based Configuration

  • Alloy configuration in ConfigMap
  • Grafana datasources in ConfigMap
  • Easy updates without image rebuilds

✅ RBAC for Service Discovery

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: alloy
rules:
- apiGroups: [""]
  resources: ["pods", "nodes"]
  verbs: ["get", "list", "watch"]

3. Observability Best Practices

✅ Log-to-Trace Correlation

derivedFields:
  - datasourceUid: tempo
    matcherRegex: "trace_id=(\\w+)"
    name: TraceID
    url: "$${__value.raw}"

✅ Structured Logging

  • JSON-formatted logs
  • Consistent log levels (INFO, WARNING, ERROR)
  • Include trace IDs in all log entries
  • Use semantic labels

✅ Meaningful Metrics

  • HTTP request duration (histogram for percentiles)
  • Request count by status code
  • Application-specific business metrics
  • Infrastructure metrics (CPU, memory, I/O)

✅ Dashboard Design

  • Overview dashboard for high-level health
  • Detailed performance dashboard
  • Variables for filtering (service selector)
  • Consistent time ranges
  • Logical grouping of panels

4. Development Workflow

✅ Infrastructure as Code

  • All Kubernetes manifests in source control
  • Version-controlled dashboards (JSON)
  • Scripted deployment and cleanup
  • Reproducible environments

✅ Automation Scripts

scripts/
  ├── setup-cluster.ps1      # Cluster creation
  ├── deploy.ps1             # Full stack deployment
  ├── build-app.ps1          # Image building
  ├── generate-traffic.ps1   # Load testing
  ├── test-app.ps1           # Endpoint validation
  └── cleanup.ps1            # Environment cleanup

✅ Documentation

  • README with quick start
  • ARCHITECTURE.md for design details
  • GETTING-STARTED.md for setup guide
  • DEVELOPMENT.md for workflows

5. Scalability Patterns

✅ Horizontal Scaling

replicas: 2  # Frontend can scale horizontally

✅ Service Abstraction

  • ClusterIP services for internal communication
  • NodePort for external access
  • Service discovery via DNS

✅ Separation of Concerns

  • Frontend handles user requests
  • Backend handles data operations
  • Collector handles telemetry routing
  • Backends handle storage

🚀 Production Readiness Checklist

While this implementation is production-like, here’s what you need to add for actual production deployment:

Security 🔒

Current State (Demo):

  • ❌ No authentication between services
  • ❌ No TLS/SSL encryption
  • ❌ Default Grafana credentials (admin/admin)
  • ❌ No network policies
  • ❌ No pod security policies

Production Requirements:

Enable mTLS

# Use service mesh (Istio/Linkerd)
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
spec:
  mtls:
    mode: STRICT

Implement Authentication

  • OAuth2/OIDC for Grafana
  • API keys for datasource access
  • Service accounts with least privilege
  • Secret management (Sealed Secrets, Vault)

Network Policies

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-from-frontend
spec:
  podSelector:
    matchLabels:
      app: flask-backend
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: flask-frontend

Pod Security

securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false
  capabilities:
    drop:
      - ALL

Secrets Management

# Use external secret management
kubectl create secret generic grafana-admin \
  --from-literal=username=admin \
  --from-literal=password=$(openssl rand -base64 32)

Storage 💾

Current State (Demo):

  • ❌ Filesystem storage (local volumes)
  • ❌ No retention policies
  • ❌ EmptyDir volumes (ephemeral)
  • ❌ Single-node storage

Production Requirements:

Object Storage

# Loki configuration
storage_config:
  aws:
    s3: s3://region/bucket-name
    s3forcepathstyle: true

# Tempo configuration
storage:
  trace:
    backend: s3
    s3:
      bucket: tempo-traces
      region: us-east-1

Retention Policies

# Loki retention
limits_config:
  retention_period: 30d

# Tempo retention
compactor:
  retention: 168h  # 7 days

# Mimir retention
limits:
  max_query_lookback: 8760h  # 1 year

Persistent Volumes

volumeClaimTemplates:
  - metadata:
      name: storage
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: "fast-ssd"
      resources:
        requests:
          storage: 100Gi

Backup Strategy

  • Regular S3 backups
  • Snapshot schedules
  • Disaster recovery testing
  • Cross-region replication

High Availability 🔄

Current State (Demo):

  • ❌ Single-replica backends
  • ❌ No pod disruption budgets
  • ❌ No anti-affinity rules
  • ❌ Single cluster

Production Requirements:

Multiple Replicas

replicas: 3  # Minimum for HA

Pod Disruption Budgets

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: loki-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: loki

Anti-Affinity Rules

affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchLabels:
          app: tempo
      topologyKey: kubernetes.io/hostname

Multi-Zone Deployment

nodeSelector:
  topology.kubernetes.io/zone: us-east-1a

Distributed Components

# Loki in microservices mode
- distributor (3 replicas)
- ingester (3 replicas)
- querier (3 replicas)
- query-frontend (2 replicas)

Monitoring the Monitors 📊

Production Requirements:

Self-Monitoring

# Alloy health
up{job="alloy"}

# Loki ingestion rate
sum(rate(loki_distributor_bytes_received_total[5m]))

# Tempo trace ingestion
sum(rate(tempo_distributor_spans_received_total[5m]))

# Mimir write latency
histogram_quantile(0.99, rate(cortex_request_duration_seconds_bucket[5m]))

Alerting Rules

groups:
  - name: lgtm-stack
    rules:
      - alert: LokiDown
        expr: up{job="loki"} == 0
        for: 5m
        annotations:
          summary: "Loki is down"

      - alert: HighTraceDropRate
        expr: rate(tempo_distributor_spans_dropped_total[5m]) > 100
        for: 5m

Health Endpoints

curl http://loki:3100/ready
curl http://tempo:3200/ready
curl http://mimir:8080/ready
curl http://grafana:3000/api/health

Performance Optimization

Production Requirements:

Resource Limits

resources:
  requests:
    cpu: "500m"
    memory: "1Gi"
  limits:
    cpu: "2000m"
    memory: "4Gi"

Horizontal Pod Autoscaling

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  scaleTargetRef:
    name: flask-frontend
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Caching Strategies


# Query caching in Grafana
[caching]
enabled = true

Sampling Strategies

# Probabilistic sampling
sampler = TraceIdRatioBased(0.1)  # Sample 10% of traces

# Head-based sampling for errors
if response.status_code >= 500:
    span.set_attribute("sample.rate", 1.0)

Compliance and Governance 📜

Production Requirements:

Data Retention Compliance

  • GDPR compliance for log retention
  • Data anonymization
  • Right to deletion implementation

Audit Logging


# Enable Grafana audit logs
[log]
mode = file
level = info

[auth]
audit_log_enabled = true

Access Controls

  • Role-based access control (RBAC)
  • Audit trails for configuration changes
  • Separation of duties

Operational Readiness 🛠️

Production Requirements:

Runbooks and Documentation

  • Incident response procedures
  • Escalation paths
  • Common troubleshooting steps
  • Architecture diagrams

Disaster Recovery Plan

  • Backup and restore procedures
  • Recovery time objectives (RTO)
  • Recovery point objectives (RPO)
  • Regular DR testing

Capacity Planning

Metrics retention: 1 year = X GB
Logs retention: 30 days = Y GB
Traces retention: 7 days = Z GB
Growth rate: 20% per quarter

Change Management

  • Version control all configurations
  • Staging environment for testing
  • Rollback procedures
  • Gradual rollouts (canary, blue-green)

🎓 Conclusion

This deep dive into the LGTM Stack with OpenTelemetry has demonstrated:

  1. Complete Architecture – From application instrumentation to data storage and visualization
  2. Data Flows – How metrics, logs, and traces flow through the system
  3. Practical Use Cases – Educational, development, PoC, and production monitoring scenarios
  4. Best Practices – OpenTelemetry standards, Kubernetes patterns, and observability design
  5. Production Checklist – Security, HA, storage, and operational requirements

Key Takeaways

Observability is Multi-Dimensional

  • Metrics tell you what is happening
  • Logs tell you why it happened
  • Traces tell you where it happened
  • Together, they provide complete visibility

Standards Matter

  • OpenTelemetry provides vendor-neutral instrumentation
  • W3C Trace Context enables distributed tracing
  • Prometheus exposition format ensures metrics compatibility

Design for Scale from Day One

  • Separate compute and storage
  • Use distributed architectures
  • Implement proper retention policies
  • Plan for growth

Security is Non-Negotiable

  • mTLS for service communication
  • RBAC for access control
  • Network policies for isolation
  • Secret management for credentials

Automation Enables Agility

  • Infrastructure as Code
  • Automated deployments
  • Scriptable testing
  • Reproducible environments

Resources and References

Official Documentation:

Repository Structure:

monitoring/
├── app/                    # Flask frontend
├── app-backend/            # Flask backend with SQLite
├── grafana/dashboards/     # Pre-built dashboards
├── k8s/base/               # Kubernetes manifests
├── scripts/                # PowerShell automation
├── ARCHITECTURE.md         # System design reference
├── GETTING-STARTED.md      # Setup guide
├── DEVELOPMENT.md          # Development workflows
└── README.md               # Quick start

Final Thoughts

Observability isn’t just about collecting data—it’s about enabling teams to understand system behavior, debug issues faster, and build more reliable services. The LGTM Stack with OpenTelemetry provides a complete, open-source solution that scales from local development to global production.

This reference implementation demonstrates that building production-grade observability doesn’t require expensive proprietary tools. With the right architecture, proper instrumentation, and adherence to best practices, you can achieve world-class observability using open-source technologies.

Whether you’re just starting your observability journey or looking to modernize existing monitoring infrastructure, the patterns and practices demonstrated here provide a solid foundation for success.

Start small, iterate quickly, and always measure what matters. 🚀

Quick Start Commands

Get started in minutes:

Repo

# 1. Create cluster
.\scripts\setup-cluster.ps1

# 2. Deploy stack
.\scripts\deploy.ps1

# 3. Generate traffic
.\scripts\generate-traffic.ps1

# 4. Open Grafana
# http://localhost:3000 (admin/admin)

# 5. Explore!
# - View metrics dashboards
# - Query logs in Explore
# - Trace distributed requests
# - Correlate logs to traces

This comprehensive guide provides everything you need to understand, deploy, and extend a production-ready observability platform. Whether you’re an individual developer learning observability or a platform team deploying at scale, these patterns will serve you well. Happy monitoring! 🎉

Previous Post
Azure mistakes and fixes

15 Common Azure Mistakes and Misconfigurations to Avoid in 2026: A Guide for DevOps Engineers, Architects, and Business Leaders

Next Post
Navigating 2026 Data Lakehouse Formats

Navigating Table Formats in Data Engineering: A Practical Guide for 2026

Add a comment

Leave a Reply

Your email address will not be published. Required fields are marked *