This page contains administrator information about the Toolforge prometheus setup and how to manage it.
SetupThere should be a couple of VMs with the puppet role role::wmcs::toolforge::prometheus
. The VM instances should be big enough to hold all the metrics we collect (usually more than 100GB of data).
All the configuration (and metrics data) is stored in /srv/prometheus/tools/.
Among other things, this sets up a systemd service which is the main thing running in these server:
aborrero@tools-prometheus-03:~$ sudo systemctl status prometheus@tools.service ● prometheus@tools.service - prometheus server (instance tools) Loaded: loaded (/lib/systemd/system/prometheus@tools.service; static; vendor preset: enabled) Active: active (running) since Thu 2020-02-13 18:15:02 UTC; 15h ago Main PID: 1517 (prometheus) Tasks: 21 (limit: 4915) Memory: 9.4G CGroup: /system.slice/system-prometheus.slice/prometheus@tools.service └─1517 /usr/bin/prometheus --storage.tsdb.path /srv/prometheus/tools/metrics --web.listen-addr [..]
Query targets are defined in profile::toolforge::prometheus
in a very long inlined yaml.
NOTE: there is no relationship between these prometheus servers and cloudmetrics systems. They collect different metrics and use different grafana to show it.
RedundancyThe HA approach is active/cold-standby. Both VMs collect exactly the same metrics, so there is no need for any specific sync between de VMs for data redundancy.
There is a web proxy with the name prometheus.svc.toolforge.org created using horizon, pointing to the active VM. This proxy can be used to inspect the status of prometheus. Useful links:
Worth noting that this URL is also what's used by our Grafana setup to use this Prometheus instance as a data source.
HieraSome global hiera keys are needed for prometheus to be able to query for metrics:
prometheus_nodes: - tools-prometheus-03.tools.eqiad1.wikimedia.cloud - tools-prometheus-04.tools.eqiad1.wikimedia.cloud
Basically this sets up host firewall rules and other ACL mechanisms.
AlertsThe Toolforge Prometheus hosts automatically provision their alert rules from the cloud/toolforge/alerts GitLab repository.
FailoverSince all prometheus VMs collect all the metrics, there is no need to do any specific sync before doing failover. Data should be already there in the backup/standby server.
In case you want to migrate metrics data, for example from an old VM to a new one, there are a couple of caveats:
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4