Prologin’s 2019 setup¶
Network setup¶
Pasteur¶
There were 5 48 ports switches in Pasteur and 4 boxes to hold them.
gw.prolo had 1 nic and was the network gateway.
Roots and organizers were on the last 2 rows.
Masters¶
There were 2 24 ports switches (lent by the bocal). There two RHFS (67 & 89). The room was separated into two parts, each connected to a switch and a RHFS. (67 & 89). The switches were interconnected, and one was connected to the bocal network. (to link Pasteur <-> Master).
Wifi for organizers¶
We used a NETGEAT AC1200 to bridge the WLAN with the LAN.
MAC addresses for the organizers’ machines were added to mdb with an IP on the services range.
Services organization¶
gw.prolo:
bind
dhcpd
firewall
mdb
netboot
postgresql database for hfs
udb
web.prolo:
concours
masternode
postgresql database for concours
map
misc.prolo:
redmine
djraio
irc_gatessh
sddm-remote
spotify
wow (World of Warcraft)
monitoring.prolo:
elasticsearch
grafana
kibana
prometheus
RHFS:
rhfs01 (pasteur)
rhfs23 (pasteur)
rhfs45 (pasteur)
rhfs67 (masters)
rhfs89 (masters)
Issues encountered during the event¶
No major issue this year.
RHFS sync breaks login¶
The rfs/commit_staging.sh
script overwrites /etc/passwd
, and during the
time the rsync
is running, this file does not contain the udb
users.
This prevented users from logging in and for logged in users tools that relied
on it failed. The /etc/passwd
file is updated by the udbsync_passwd
service, which is run after the rsync
is finished.
Impact on contestants: medium
Remediation: See https://github.com/prologin/sadm/issues/169
High network usage, freeze and OOM¶
After starting a match concours
, the user landed on the match replay page
and a non-identified bug in the match replay stack (Godot web) made the the
contestants system freeze due to high network usage. Symptoms where full
bandwidth usage of the NIC, >100MB/s and high CPU usage. We suspect that the
code entered a busy loop hammering the NFS client. This prevented us from
logging-in with ssh, but the prometheus-node-exporter still worked and we could
gather logs. We initially had no clue what was causing the freeze, due to a
lack of per-process monitoring, but inspection of the machines when they were
frozen shown consistent correlation with opening a replay page on concours.
Also, users that did not open such page did not experience the freeze.
Unfreezing the machine required either to a) reboot the machine, with risk of lost data and FS corruption, or b) unplug the network cable for some seconds and re-plug it, after that waiting ~30 seconds and the OOM killer would kill the browser. Multiple contestants did reboot their machines when they froze, without data loss.
Impact on contestants: high, ~10 freeze per hour
Detection: created a dashboard in grafana to identify systems with abnormaly high network bandwidth usage
Remediation: fix web replay, limit network bandwidth to allow ssh
Packet drop on uplink¶
Organizers using gw.prolo
as uplink saw packet drop that mainly impacted
DNS queries. Other part of the network stack were also unstable, but DNS
failures had the most impact, mainly on the radio service that was querying
external APIs.
Impact on contestants: no impact
Detection: general packet loss, “Server IP address could not be found” error in browsers
Remediation: added retry of network requests
Next year: prepare a secondary uplink in case the main one fails
Heat¶
We did not have enough fans and the temperature in the rooms was very high.
Next year: ensure each row has a large fan, put drink cart in front of the room.