Monitoring¶

Monitoring is the art of knowning when something fails, and getting as much information as possible to solve the issue.

We use prometheus as our metrics monitoring backend and grafana for the dashboards. We use elasticsearch to store logs and kibana to search through them.

We will use a separate machine for monitoring as we want to isolate it from the core services, because we don’t want the monitoring workload to impact other services, and vice versa. The system is installed with the same base Arch Linux configuration as the other servers.

Setup¶

To make a good monitoring system, mix the following ingredients, in that order:

bootstrap_arch_linux.sh
setup_monitoring.sh
python install.py prometheus
systemctl enable --now prometheus
python install.py grafana
systemctl enable --now grafana

Monitoring services¶

Most SADM services come with built-in monitoring and should be monitored as soon as prometheus is started.

The following endpoints are availables:

http://udb/metrics
http://mdb/metrics
http://concours/metrics
http://masternode:9021
http://presencesync:9030
hfs: each hfs exports its metrics on http://hfsx:9030
workernode: each workernode exports its metrics on http://MACHINE:9020.

Grafana configuration¶

In a nutshell:

Install the grafana package.
Copy the SADM configuration file: etc/grafana/grafana.ini.
Enable and start the grafana service
Copy the nginx configuration: etc/nginx/services/grafana.nginx
Open http://grafana/, login and import the SADM dashboards from etc/grafana.

Todo

automate the process above

Monitoring screen how-to¶

Start multiple chromium --app http://grafana/ to open a monitoring web view.

We look at both the System and Masternode dashboards from grafana.

Disable the screen saver and DPMS using on the monitoring display using:

$ xset -dpms
$ xset s off

Log monitoring¶

On monitoring:

$ pacman -S elasticsearch kibana
$ systemctl enable --now elasticsearch kibana

In the kibana web UI, go to the dev tools tab and run:

# Make sure the index isn't there
DELETE /logs

# Create the index
PUT /logs

PUT logs/_mapping
{
  "properties": {
    "REALTIME_TIMESTAMP": {
      "type": "date",
      "format": "epoch_millis"
    }
  }
}

It creates an index called logs, as well as proper metadata for time filtering.

Install https://github.com/multun/journal-upload-aggregator on the monitoring server, and please do not configure nginx as a front-end on journal-aggregator. Don’t forget to add the alias in mdb.

On the machines that need to be monitored, create /etc/systemd/journal-upload.conf:

[Upload]
Url=http://journal-aggregator:20200/gateway

If still not fixed, also create /etc/systemd/system/systemd-journal-upload.service.d/restart.conf:

[Service]
Restart=on-failure
RestartSec=4

Then:

$ systemctl enable --now systemd-journal-upload

As an useful first request:

not SYSTEMD_USER_SLICE:* and (error or (PRIORITY < 5) or (EXIT_STATUS:* and not EXIT_STATUS:0))

This request filters non-user errors.