Keep linux services alive on server failures, based on protected NVMesh volumes
nvkeep is an easy to use linux service high availability (HA) framework that can keep any linux service alive in case of server failure. To do this, it creates a scenario which looks to clients on the network like the original server would still be up and running, even though nvkeep has actually moved the protected service to a backup host.
An nvkeep-protected service can be an NFS export, a BeeGFS service, a database server like MySQL or any other systemd-based linux service.
Whether a physical server is still up is detected by a lightweight service named keepalived, which comes with all major Linux distributions. nvkeep includes tools to easily configure keepalived, to monitor the status of the system and to automatically trigger the following actions in case the primary server for a nvkeep-protected Linux service fails:
- Exclusively attach the underlying NVMesh volume of the nvkeep-protected service to the backup host. (Exclusive attachment ensures split brain safety from the storage perspective.)
- Mount the file system of the nvkeep-protected service on the backup host.
- Move the floating IP address of the nvkeep-protected service to the backup host.
- Start the nvkeep-protected service on the backup host.
After this, the nvkeep-protected service is available under its original IP address to the network clients and has access to all its data on the NVMesh volume.
nvkeep requires keepalived v1.3.6 or higher, which is included in RHEL/CentOS 8.0 and higher and in Ubuntu 18.04 and higher.
apt install keepalived
yum install keepalived
The version of keepalived included in RHEL/CentOS 7.x is too old, but it's easy to install a newer version from the official webiste. The official install guide is available here. These are the steps:
yum -y install autoconf automake git openssl-devel libnl3-devel ipset-devel
git clone https://github.com/acassen/keepalived.git
cd keepalived
git checkout v2.1.5
./build_setup
./configure
mkdir -p ~/rpmbuild/SOURCES
make rpm
# install created rpm
cd $(rpm --eval "%{_rpmdir}")/x86_64
yum install keepalived-2.1.5-1.el7.x86_64.rpm
git clone https://github.com/Excelero/nvkeep.git
cd nvkeep
For RHEL/CentOS:
./create_rpm_package.sh
yum -y install ./packaging/RPMS/noarch/nvkeep*.rpm
For Ubuntu:
./create_deb_package.sh
apt -y install ./packaging/nvkeep*.deb
Afterwards, the nvkeep_apply_config
command will guide you through the setup process.
- Setting up a highly available NFS server with nvkeep: See here
- Setting up a highly available BeeGFS file system with nvkeep: See here
All nvkeep tools have a built-in help, which can be viewed by using the -h
argument.
The nvkeep_check_status
command will check if all services which run preferably on the current host, are actually running and also whether any services that run preferably on its peer host, are currently running.
In this example, all hosts are up and running, so host server01
, which is configured as a highly available NFS server, reports that all its services are OK:
server01: nvkeep_check_status -l
[OK] VOLUME: nfsvolume
[OK] MOUNTPOINT: /mnt/nfsexport
[OK] FLOATING IP: ib0 192.168.150.1/24
[OK] SERVICE: nfs-server
- Keepalived log messages are included in the normal system log, e.g.
/var/log/messages
- Events (such as a host becoming the master for a nvkeep-protected service) are in
/var/log/nvkeep/event_handler.log
- Actions that are taken by nvkeep based on received events are in
/var/log/nvkeep/actions.log
Most of the time, you will just have the keepalived
service running and sleep well at night, knowing that keepalived
and nvkeep
ensure your services are running. However, there might be situations where you briefly want to restart the nvkeep-protected service without going through the full failover and failback procedure. This could be the case after you updated a service package to the latest version and now just want to restart it.
It's no problem to just manually restart an nvkeep-protected service like you would usually do without nvkeep
. For example, if the nvkeep-protected service is an nfs-server, then you could do it just like this:
server01: systemctl restart nfs-server
If you ran a full OS update and now would like to reboot the host server01
, it's best to do a graceful service shutdown that allows a clean unmount of the underlying file system and then afterwards stop keepalived
to trigger a failover to the other host.
First let's disable keepalived automatic start on boot to prevent immediate failback after reboot:
server01: systemctl disable keepalived
Now we do a clean shutdown of the nvkeep-protected service and associated resources:
server01: nvkeep_clean_shutdown -l
Stopping localhost preferred services and associated resources...
All done.
Then we stop keepalived to trigger the failover to the backup host and reboot the server:
server01: systemctl stop keepalived
server01: reboot
When server01
is back up, we do a clean shutdown of its resources on the backup host (server02
) and then start keepalived
again to trigger the failback to server01
:
server02: nvkeep_clean_shutdown -p
Stopping localhost preferred services and associated resources...
All done.
server01: systemctl start keepalived
server01: systemctl enable keepalived
That's it. You can now use the nvkeep_check_status
command to confirm that everything is back to normal state.
Enabling keepalived
autostart on system boot will ensure that all nvkeep-protected services come up automatically again after a server failure. However, a server failure also means that something is broken and might need repair before resuming normal operation.
Thus, it is recommended to disable autostart of keepalived, so that you can analyze what lead to the server failure before starting keepalived and the associated nvkeep-protected services again on this host:
server01: systemctl disable keepalived
server02: systemctl disable keepalived