Kubernetes disaster recovery and self-healing¶
Kubernetes disaster recovery refers to the strategies and actions required to restore service functionality following catastrophic failures, such as the loss of a node or data corruption.^[400-devops-06-kubernetes-k8s-paas-05k8scicd.md] Self-healing describes the cluster's built-in capability to automatically detect and replace failed workloads without human intervention.^[400-devops-06-kubernetes-k8s-paas-05k8scicd.md]
Self-healing mechanisms¶
Kubernetes maintains service availability through automated self-healing processes. When a worker node fails, the cluster attempts to reschedule the affected workloads onto other healthy nodes to maintain the desired state^[400-devops-06-kubernetes-k8s-paas-05k8scicd.md].
For example, during a simulated failure of hdss7-21.host.com, pods that were running on the failed node were automatically recreated on hdss7-22.host.com^[400-devops-06-kubernetes-k8s-paas-05k8scicd.md].
Disaster recovery workflow¶
In the event of a physical server failure or severe corruption, the recovery process involves immediate operational steps to prevent service disruption and restore the cluster to a healthy state^[400-devops-06-kubernetes-k8s-paas-05k8scicd.md].
Node isolation¶
The primary recovery step following a server crash is to isolate the failed node. Since the cluster may initially interpret a complete failure as a network partition or temporary outage (leading to repeated reconnection attempts), the administrator must explicitly delete the node object from the cluster^[400-devops-06-kubernetes-k8s-paas-05k8scicd.md].
[kubectl delete](<./kubectl-delete.md>) node <failed-node-hostname>
Once the node is deleted, the self-healing mechanisms trigger, and the scheduler begins provisioning replacements on the remaining nodes^[400-devops-06-kubernetes-k8s-paas-05k8scicd.md].
Infrastructure updates¶
After a node fails, upstream load balancers may still direct traffic to the offline host, potentially causing errors for end-users. To prevent this, administrators must update infrastructure configurations, such as Nginx or API server gateways, to remove references to the failed server's IP address^[400-devops-06-kubernetes-k8s-paas-05k8scicd.md].
Cluster restoration¶
Once the physical hardware or operating system of the failed node is repaired, the node can be reintroduced to the cluster^[400-devops-06-kubernetes-k8s-paas-05k8scicd.md].
- Restart Services: Ensure the
kubeletand container runtime services are running on the restored node^[400-devops-06-kubernetes-k8s-paas-05k8scicd.md]. - Relabel Node: Re-apply the necessary Kubernetes labels to the node (e.g.,
node-role.kubernetes.io/masterornode-role.kubernetes.io/node) so that the scheduler recognizes it and can assign new workloads^[400-devops-06-kubernetes-k8s-paas-05k8scicd.md]. - Revert Load Balancer: Update the external load balancers to reinstate traffic routing to the restored node^[400-devops-06-kubernetes-k8s-paas-05k8scicd.md].
Ideally, the cluster will then rebalance, distributing workloads across both the recovered and the previously active nodes^[400-devops-06-kubernetes-k8s-paas-05k8scicd.md].
Related concepts¶
- [[Rolling update]]
- Pod
- [[High availability]]
Sources¶
^[400-devops-06-kubernetes-k8s-paas-05k8scicd.md]