This guide describes the best practices concepts for adding Observability to operators.
In this document, we provide best practices and examples for creating metrics, recording rules and alerts. It is based on the general guidelines in Operator Capability Levels.
Note: For technical documentation of how to add metrics to your operator, please read the Metrics section of the Kubebuilder documentation.
Operator Observability Recommended Componentsrunbook_url
annotation and an alert runbook that describes it. See additional details below.Additional components would be Dashboards
, Logs
and Traces
, which are not addressed in this document at this point.
Important: It is highly recommended to separate your monitoring code from your core operator code.
We recommend to create a dedicated /monitoring
subfolder that will include all the code of the Operator Observability Recommended Components, that are outlined above. For example, in the memcached-operator.
In your core operator code only call the functions that will update the metrics value from your desired location. For example, in the memcached-operator.
All operators start small. This separation will help you, as a developer, with easier maintenance of both your operator core code and the monitoring code and for other stakeholders to understand your monitoring code better.
Metrics Guidelines Metrics NamingKubernetes components emit metrics in Prometheus format. This format is structured plain text, designed so that people and machines can both read it.
Your operator users should get the same experience when searching for a metric across Kubernetes operators, resources and custom resources.
operator name
prefix + the sub-operator name
or entity
+ metric name
based on the Prometheus naming conventions. For example, in the memcached-operator.Note: In Prometheus Node Exporter metrics are separated like this:
In this example, based on receive
and transmit
.
Please follow the same principle and don’t put similar metrics details as labels, so the user experience would be fluent.
Example for this in an operator:
Gauge
,Counter
,Histogram
and Summary
. You can read more about the different types here, Understanding metrics types.Counter
- Value can only increase or reset.Gauge
Value can be increased and decreased as needed._total
suffix should be used for accumulating count. If your metrics has labels with high cardinality, like pod
/container
it usually means that you can aggregate it more, thus it will not require _total
suffix.Prometheus labels are used to differentiate the characteristics of the thing that is being measured.
pod name
, but try to keep this to the minimum.pod
or a container
, make sure that the namespace
is included, in order to be able to uniquely identify it.Help
message
Your operator metrics help
message should include the following details:
The Help
message can be used to create auto-generated documentation, like it’s done in KubeVirt and generated by the KubeVirt metrics doc generator.
We recommend to auto-generated metrics documentation and save it in your operator repository, to a location like /docs/monitoring/
, so that the users can find the information about your operator metrics easily.
See Alerts, Metrics and Recording Rules Tests section for metrics testing recommendations.
Prometheus Recording Rules NamingAs per Prometheus documentation, Recording rules allow you to pre-compute frequently needed or computationally expensive expressions and save their result as a new set of time series.
Note: The Prometheus recording rules appear in Prometheus UI as metrics. In order to easily identify your operator recording rules, their names should usually follow the same naming guidelines as the metrics.
See Alerts, Metrics and Recording Rules Tests section for recording rules testing recommendations.
Prometheus Alerts GuidelinesClear and actionable alerts are a key component of a smooth operational experience and will result in a better experience for the end users.
The following guidances aim to align alert naming, severities, labels, etc., in order to avoid alerts fatigue for administrators.
Recommended ReadingA list of references on good alerting practices:
Individual operator authors are responsible for writing and maintaining alerting rules for their components, i.e. their operators and operands.
Operator authors should also take into consideration how their components interact with existing monitoring and alerting.
As an example, if your operator deploys a service which creates one or more PersistentVolume
resources, and these volumes are expected to be mostly full as part of normal operation, it’s likely that this will cause unnecessary KubePersistentVolumeFillingUp
alerts to fire.
You should work to find a solution to avoid triggering these alerts if they are not actionable.
Alerts Style GuidePrometheusRuleFailures
AlertmanagerFailedReload
TargetDown
severity
label indicating the alert’s urgency.
critical
, warning
, or info
— see below for guidelines on writing alerts of each severity.summary
and description
annotations.
summary
as the first line of a commit message, or an email subject line. It should be brief but informative. The description
is the longer, more detailed explanation of the alert.namespace
label indicating the source of the alert.
Optional Alerts Labels and Annotations
priority
label indicating the alert’s level of importance and the order in which it should be fixed.
high
, medium
, or low
. Higher the priority the sooner the alert should be resolved.priority
label, we can assume it is a medium
priority alert. This label will usually be used for alerts with warning
severity, to indicate the order in which the alert should be addressed by, even though it doesn’t require immediate action.runbook_url
annotation is a link to an alert runbook which is intended to guide a cluster owner and/or operator through the steps of fixing problems on clusters, which are surfaced by alerts.
/docs/monitoring/runbooks/
for example, at OpenShift Runbooks if your operator is shipped with OpenShift or another location that fits your operator.kubernetes_operator_part_of
label indicating the operator name. Label name is based on the Kubernetes Recommended Labels.For alerting current and impending disaster situations. These alerts page an SRE. The situation should warrant waking someone in the middle of the night.
Timeline: ~5 minutes.
Reserve critical level alerts only for reporting conditions that may lead to loss of data or inability to deliver service for the cluster as a whole.
Failures of most individual components should not trigger critical level alerts, unless they would result in either of those conditions.
Configure critical level alerts so they fire before the situation becomes irrecoverable.
Expect users to be notified of a critical alert within a short period of time after it fires so they can respond with corrective action quickly.
Example critical alert: KubeAPIDown
- alert: KubeAPIDown
annotations:
summary: Target disappeared from Prometheus target discovery.
description: KubeAPI has disappeared from Prometheus target discovery.
runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/KubeAPIDown.md
expr: |
absent(up{job="apiserver"} == 1)
for: 15m
labels:
severity: critical
This alert fires if no Kubernetes API server instance has reported metrics successfully in the last 15 minutes.
This is a clear example of a critical control-plane issue that represents a threat to the operability of the cluster as a whole, and likely warrants paging someone.
The alert has clear summary and description annotations, and it links to a runbook with information on investigating and resolving the issue.
The group of critical alerts should be small, very well defined, highly documented, polished and with a high bar set for entry.
Warning AlertsThe vast majority of alerts should use this severity.
Issues at the warning level should be addressed in a timely manner, but don’t pose an immediate threat to the operation of the cluster as a whole.
Timeline: ~60 minutes
If your alert does not meet the criteria in “Critical Alerts” above, it belongs to the warning level or lower.
Use warning level alerts for reporting conditions that may lead to inability to deliver individual features of the cluster, but not service for the cluster as a whole. Most alerts are likely to be warnings.
Configure warning level alerts so that they do not fire until components have sufficient time to try to recover from the interruption automatically.
Expect users to be notified of a warning, but for them not to respond with corrective action immediately.
Example warning alert: ClusterNotUpgradeable
- alert: ClusterNotUpgradeable
annotations:
summary: One or more cluster operators have been blocking minor version cluster upgrades for at least an hour.
description: In most cases, you will still be able to apply patch releases.
Reason {{ "{{ with $cluster_operator_conditions := \"cluster_operator_conditions\" | query}}{{range $value := .}}{{if and (eq (label \"name\" $value) \"version\") (eq (label \"condition\" $value) \"Upgradeable\") (eq (label \"endpoint\" $value) \"metrics\") (eq (value $value) 0.0) (ne (len (label \"reason\" $value)) 0) }}{{label \"reason\" $value}}.{{end}}{{end}}{{end}}"}}
For more information refer to 'oc adm upgrade'{{ "{{ with $console_url := \"console_url\" | query }}{{ if ne (len (label \"url\" (first $console_url ) ) ) 0}} or {{ label \"url\" (first $console_url ) }}/settings/cluster/{{ end }}{{ end }}" }}.
expr: |
max by (name, condition, endpoint) (cluster_operator_conditions{name="version", condition="Upgradeable", endpoint="metrics"} == 0)
for: 60m
labels:
severity: warning
This alert fires if one or more operators have not reported their Upgradeable
condition as true in more than an hour.
The alert has a clear name and informative summary and description annotations.
The timeline is appropriate for allowing the operator a chance to resolve the issue automatically, avoiding the need to alert an administrator.
Info level alerts represent situations an administrator should be aware of, but they don’t necessarily require any action.
Use these sparingly, and consider instead reporting this information via Kubernetes events.
Example info alert: MultipleContainersOOMKilled
- alert: MultipleContainersOOMKilled
annotations:
description: Multiple containers were out of memory killed within the past
15 minutes. There are many potential causes of OOM errors, however issues
on a specific node or containers breaching their limits is common.
summary: Containers are being killed due to OOM
expr: sum(max by(namespace, container, pod) (increase(kube_pod_container_status_restarts_total[12m]))
and max by(namespace, container, pod) (kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}) == 1) > 5
for: 15m
labels:
namespace: kube-system
severity: info
This alert fires if multiple containers have been terminated due to out of memory conditions in the last 15 minutes.
This is something the administrator should be aware of, but may not require immediate action.
runbook_url
link is valid.pod
or a container
also includes the namespace
.RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4