What:
Observability measures a distributed system's internal health based on its external outputs: structured Logs, real-time numeric Metrics, and distributed Traces.
Primary purpose:
Diagnosing root causes of failures, tracing latencies across microservice networks, and keeping system downtime minimal.
Usually used for:
Microservices health checking, cloud cluster monitoring, API gateway profiling, and alert automation.
How should I think about this inside system architectures?
🪵 The Log Microscope
Logs detail exactly what happened inside a single node process at a specific timestamp. Always structure logs as searchable JSON key-values.
📈 The Metric Radar
Low-cost time-series counters. Monitor the four golden signals: Latency, Traffic (QPS), Errors (5xx), and Saturation (CPU/RAM).
🕸️ Distributed Request Tracing
Trace request flows across servers. Gateways generate a `trace_id` header propagated to all downstream RPC calls to group child `spans` together.