Service mesh Metrics and Monitoring¶
Effective monitoring of a Service mesh relies on a combination of Metrics collection, visualization, and distributed Tracing to track the health of both the control plane and the applications within the mesh.^[README.md]
Core Components¶
Metrics Collection¶
Prometheus serves as the foundational monitoring system and time series database for the mesh^[README.md]. It records Metrics tracking the health of Istio and mesh applications, which can then be visualized using tools like Grafana and Kiali^[README.md].
For environments using the Prometheus Operator, integration is supported via custom resources^[README.md]. This requires Metrics merging to be enabled^[README.md#L103-L104]. A sample ServiceMonitor is available to monitor the Istio control plane, and a PodMonitor is provided to scrape Metrics from Envoy proxies^[README.md#L100-L102].
Visualization and Dashboards¶
Grafana is used to configure dashboards and monitor the health of the mesh^[README.md]. Pre-configured dashboards provided in samples include:
- Mesh Dashboard: An overview of all services^[README.md].
- Service Dashboard: Detailed Metrics breakdown for a specific service^[README.md].
- Workload Dashboard: Detailed Metrics breakdown for a specific workload^[README.md].
- Performance Dashboard: Monitors mesh-wide resource usage^[README.md].
- Control Plane Dashboard: Monitors the health and performance of the control plane^[README.md].
- WASM Extension Dashboard: Overview of WebAssembly extension runtime and loading state^[README.md].
Kiali acts as an observability console, inferring the topology of the mesh to help users understand its structure and health^[README.md]. It provides detailed Metrics and offers basic integration with Grafana for advanced queries^[README.md].
Distributed Tracing¶
To troubleshoot latency problems and monitor transactions in complex distributed systems, Distributed Tracing is employed^[README.md].
- Jaeger: An open-source end-to-end distributed Tracing system used for root cause analysis, service dependency analysis, and latency optimization^[README.md].
- Zipkin: An alternative to Jaeger that helps gather timing data for troubleshooting latency^[README.md]. It can be deployed as a replacement for Jaeger if the default configuration is adjusted^[README.md].
Related Concepts¶
- Prometheus
- Grafana
- [[Kiali]]
- [[Jaeger]]
Sources¶
^[README.md]