Kubernetes disaster recovery and self-healing¶

Kubernetes disaster recovery refers to the strategies and actions required to restore service functionality following catastrophic failures, such as the loss of a node or data corruption.^[400-devops-06-kubernetes-k8s-paas-05k8scicd.md] Self-healing describes the cluster's built-in capability to automatically detect and replace failed workloads without human intervention.^[400-devops-06-kubernetes-k8s-paas-05k8scicd.md]

Self-healing mechanisms¶

Kubernetes maintains service availability through automated self-healing processes. When a worker node fails, the cluster attempts to reschedule the affected workloads onto other healthy nodes to maintain the desired state^[400-devops-06-kubernetes-k8s-paas-05k8scicd.md].

For example, during a simulated failure of hdss7-21.host.com, pods that were running on the failed node were automatically recreated on hdss7-22.host.com^[400-devops-06-kubernetes-k8s-paas-05k8scicd.md].

Disaster recovery workflow¶

In the event of a physical server failure or severe corruption, the recovery process involves immediate operational steps to prevent service disruption and restore the cluster to a healthy state^[400-devops-06-kubernetes-k8s-paas-05k8scicd.md].

Node isolation¶

The primary recovery step following a server crash is to isolate the failed node. Since the cluster may initially interpret a complete failure as a network partition or temporary outage (leading to repeated reconnection attempts), the administrator must explicitly delete the node object from the cluster^[400-devops-06-kubernetes-k8s-paas-05k8scicd.md].

[kubectl delete](<./kubectl-delete.md>) node <failed-node-hostname>

Once the node is deleted, the self-healing mechanisms trigger, and the scheduler begins provisioning replacements on the remaining nodes^[400-devops-06-kubernetes-k8s-paas-05k8scicd.md].

Infrastructure updates¶

After a node fails, upstream load balancers may still direct traffic to the offline host, potentially causing errors for end-users. To prevent this, administrators must update infrastructure configurations, such as Nginx or API server gateways, to remove references to the failed server's IP address^[400-devops-06-kubernetes-k8s-paas-05k8scicd.md].

Cluster restoration¶

Once the physical hardware or operating system of the failed node is repaired, the node can be reintroduced to the cluster^[400-devops-06-kubernetes-k8s-paas-05k8scicd.md].

Restart Services: Ensure the kubelet and container runtime services are running on the restored node^[400-devops-06-kubernetes-k8s-paas-05k8scicd.md].
Relabel Node: Re-apply the necessary Kubernetes labels to the node (e.g., node-role.kubernetes.io/master or node-role.kubernetes.io/node) so that the scheduler recognizes it and can assign new workloads^[400-devops-06-kubernetes-k8s-paas-05k8scicd.md].
Revert Load Balancer: Update the external load balancers to reinstate traffic routing to the restored node^[400-devops-06-kubernetes-k8s-paas-05k8scicd.md].

Ideally, the cluster will then rebalance, distributing workloads across both the recovered and the previously active nodes^[400-devops-06-kubernetes-k8s-paas-05k8scicd.md].

[[Rolling update]]
Pod
[[High availability]]

Sources¶

^[400-devops-06-kubernetes-k8s-paas-05k8scicd.md]