its-network/docs/disaster_recovery.md
Wachtl Enterprises LLC ccb747c82b Make renderable by mkdocs-material
Move docs into a renderable form for mkdocs. This necessitated a
reordering of things inside this repo.

Signed-off-by: Wachtl Enterprises LLC <tyrolyean@escpe.net>
2025-03-15 22:21:42 +01:00

2.3 KiB

% ITS Disaster recovery plan

About

This file describes how to perform disaster recovery if everything breaks down. As I cannot cover all of the catastrophic events that may occur, I will cover what comes to my mind why everything would stop working.

Scenario 1: Hypervisor dies

The current hypervisor (namely acraze.srv.it-syndikat.org) may spontaneously die. First check wether the server is just hanging in unlock (it probably is). If the server has died for real this time: that may happen at any time for any reason. The most simple and probably fastest recovery method is putting all hard disks inside a new similar server and booting from those. If that is not an option, we have daily backups on bringmethehorizon.cuco (the server inside of cuco which is now virtualized). You can of course restore from there at any point in time. The servers are all connected to a single port on the firewall. Restoring should be a pretty fast operation overall. Network configuration is handled entirely by the firewall.

Scenario 2: The firewall dies

The current firewall (namely sozial.asozial.it-syndikat.org) may spontaneously catch fire and die at any time. To recover from this event try booting the internal ssd on any hardware having the same or more network ports. The os will detect the interface changes and ask you to reassign them. If that is not an option I have copied a few router configurations inside this git repository inside the resources section. Install pfsense and restore this configuration on alternative hardware. This process should be fairly straightforward.

Scenario 3: The ldap server dies

This may happen for a very large amount of reasons. If you are relying on ldap for authorization on hosts, this may be disaster for you. If you have access to the vaultwarden (which does not rely on ldap) you can use the recovery root ssh key and ssh into the ldap machine (currently blacksunempire.srv.it-syndikat.org) and diagnose slapd. This may be resolved by simply restarting slapd.service, or less easily by actually debugging slapd, in which case you should probably contact someone who has knowledge of ldap. If you don't have access to the vaultwarden, contact someone who has. If you are reading this as a precautionary measure: register inside the vaultwarden and download the ssh key.