Prologin’s 2019 setup

Overview

We had 2 rooms:

  • Paster, 96 machines

  • Masters, 41 machines

Network setup

Pasteur

There were 5 48 ports switches in Pasteur and 4 boxes to hold them.

gw.prolo had 1 nic and was the network gateway.

Roots and organizers were on the last 2 rows.

Masters

There were 2 24 ports switches (lent by the bocal). There two RHFS (67 & 89). The room was separated into two parts, each connected to a switch and a RHFS. (67 & 89). The switches were interconnected, and one was connected to the bocal network. (to link Pasteur <-> Master).

Wifi for organizers

We used a NETGEAT AC1200 to bridge the WLAN with the LAN.

MAC addresses for the organizers’ machines were added to mdb with an IP on the services range.

Services organization

gw.prolo:

  • bind

  • dhcpd

  • firewall

  • mdb

  • netboot

  • postgresql database for hfs

  • udb

web.prolo:

  • concours

  • masternode

  • postgresql database for concours

  • map

misc.prolo:

  • redmine

  • djraio

  • irc_gatessh

  • sddm-remote

  • spotify

  • wow (World of Warcraft)

monitoring.prolo:

  • elasticsearch

  • grafana

  • kibana

  • prometheus

RHFS:

  • rhfs01 (pasteur)

  • rhfs23 (pasteur)

  • rhfs45 (pasteur)

  • rhfs67 (masters)

  • rhfs89 (masters)

Issues encountered during the event

No major issue this year.

RHFS sync breaks login

The rfs/commit_staging.sh script overwrites /etc/passwd, and during the time the rsync is running, this file does not contain the udb users. This prevented users from logging in and for logged in users tools that relied on it failed. The /etc/passwd file is updated by the udbsync_passwd service, which is run after the rsync is finished.

Impact on contestants: medium

Remediation: See https://github.com/prologin/sadm/issues/169

High network usage, freeze and OOM

After starting a match concours, the user landed on the match replay page and a non-identified bug in the match replay stack (Godot web) made the the contestants system freeze due to high network usage. Symptoms where full bandwidth usage of the NIC, >100MB/s and high CPU usage. We suspect that the code entered a busy loop hammering the NFS client. This prevented us from logging-in with ssh, but the prometheus-node-exporter still worked and we could gather logs. We initially had no clue what was causing the freeze, due to a lack of per-process monitoring, but inspection of the machines when they were frozen shown consistent correlation with opening a replay page on concours. Also, users that did not open such page did not experience the freeze.

Unfreezing the machine required either to a) reboot the machine, with risk of lost data and FS corruption, or b) unplug the network cable for some seconds and re-plug it, after that waiting ~30 seconds and the OOM killer would kill the browser. Multiple contestants did reboot their machines when they froze, without data loss.

Impact on contestants: high, ~10 freeze per hour

Detection: created a dashboard in grafana to identify systems with abnormaly high network bandwidth usage

Remediation: fix web replay, limit network bandwidth to allow ssh

Heat

We did not have enough fans and the temperature in the rooms was very high.

Next year: ensure each row has a large fan, put drink cart in front of the room.