Monitoring services¶
Monitoring is the art of knowning when something fails, and getting as much information as possible to solve the issue.
We use prometheus as our metrics monitoring backend and grafana for the dashboards. We use elasticsearch to store logs and kibana to search through them.
We use a separate machine for monitoring (called, surprisingly,
monitoring
), as we want to isolate it from the core services, because we
don’t want the monitoring workload to impact other services, and vice versa.
Prometheus¶
Prometheus is a monitoring system that stores a backlog of metrics as time
series in its database. Prometheus runs on the monitoring
machine as
prometheus.service
.
All the machines in the infrastructure have a
prometheus-node-exporter.service
service running on them, which
periodically exports system information to Prometheus.
Most SADM services come with built-in monitoring and will be monitored as soon as prometheus is started.
The following endpoints are available for Prometheus to fetch metrics:
http://udb/metrics
http://mdb/metrics
http://concours/metrics
http://masternode:9021
http://presencesync:9030
hfs: each hfs exports its metrics on
http://hfsx:9030
workernode: each workernode exports its metrics on
http://MACHINE:9020
.
TODO: add more information on how to use alerting.
Grafana¶
Grafana is a web service available at http://grafana/
that allows you to
visualize the information stored in Prometheus.
To access it, the username is admin
and the password is the value of
grafana_admin_password
in your Ansible inventory. It is also possible to
allow guest user access, so that contestants will also be able to see the state
of the infrastructure.
Some built-in dashboards are automatically added to Grafana during the installation, and show the current state of the machines and main services.
Monitoring screen how-to¶
We like to have a giant monitoring screen showing sexy graphs to the roots because it’s cool.
Start multiple chromium --app http://grafana/
to open a monitoring web
view.
We look at both the System
and Masternode
dashboards from grafana.
Disable the screen saver and DPMS using on the monitoring display using:
$ xset -dpms
$ xset s off
Icinga¶
Icinga aggregates the logs of all the machines in the infrastructure, and stores it in an ElasticSearch database. It allows you to quickly search for failures in the logs of the entire setup, including the diskless machines.
These logs are exported through the
journalbeat.service
service that is installed on all the machines and
extracts the logs from the systemd journal.
TODO: add more information on how to use Icinga
Icinga runs on monitoring
as icinga.service
.