Terraform GKE version management¶
This document outlines the standard operating procedure for upgrading the Google Kubernetes Engine (GKE) version using Terraform. The process involves infrastructure provisioning via Terraform and workload migration strategies to ensure zero downtime.
Prerequisites¶
Before initiating the upgrade, verify the authentication status and synchronize the Terraform state with the actual cloud infrastructure^[400-devops__05-Cloud-Provider__GCP升级.md].
- Run
gcloud auth login^[400-devops__05-Cloud-Provider__GCP升级.md]. - Initialize and refresh the Terraform workspace^[400-devops__05-Cloud-Provider__GCP升级.md]:
[Terraform](<./terraform.md>) init [Terraform](<./terraform.md>) refresh
Version Selection¶
Identify the target GKE version by consulting the official release notes^[400-devops__05-Cloud-Provider__GCP升级.md]. You must select a version listed under the No channel (Static) channel^[400-devops__05-Cloud-Provider__GCP升级.md].
Terraform Configuration¶
- Variable Definition: Update the
gke_versionvariable in your configuration files (e.g.,01-variables.tf)^[400-devops__05-Cloud-Provider__GCP升级.md].variable "gke_version" { default = "1.21.13-gke.900" } - Enable Resource File: Restore the active application configuration by renaming
11-app.tf.backto11-app.tf^[400-devops__05-Cloud-Provider__GCP升级.md]. - Create Temporary Node Pool: Modify the resource name in
11-app.tf(e.g.,app->temp) and apply to create a temporary node pool^[400-devops__05-Cloud-Provider__GCP升级.md]. This new pool (e.g.,node_pool_1) will hold workloads during the upgrade process^[400-devops__05-Cloud-Provider__GCP升级.md].[Terraform](<./terraform.md>) plan [Terraform](<./terraform.md>) apply
Workload Migration¶
Migrate your workloads from the original node pool (app) to the temporary pool (temp) to prepare for the infrastructure upgrade^[400-devops__05-Cloud-Provider__GCP升级.md].
- Update Kubernetes manifests in your configuration directory (e.g.,
win-env-project\dev\kube)^[400-devops__05-Cloud-Provider__GCP升级.md]. - Change the
nodeSelectorfrompool: apptopool: temp^[400-devops__05-Cloud-Provider__GCP升级.md]. - Apply changes iteratively (e.g., for Kafka nodes 1, 2, and 3) with approximately 3-minute intervals to maintain stability^[400-devops__05-Cloud-Provider__GCP升级.md].
Infrastructure Upgrade¶
With workloads safely moved to the temporary pool, upgrade the main infrastructure^[400-devops__05-Cloud-Provider__GCP升级.md].
- Modify the
google_container_node_poolresource definition inmodules/site/03-node-pool.tfto target the new GKE version^[400-devops__05-Cloud-Provider__GCP升级.md]. - Run
terraform applyto upgrade theappnode pool^[400-devops__05-Cloud-Provider__GCP升级.md].
Restoration and Cleanup¶
After the infrastructure upgrade is complete, migrate workloads back to the upgraded node pool^[400-devops__05-Cloud-Provider__GCP升级.md].
- Revert the
nodeSelectorin your Kubernetes manifests frompool: tempback topool: app^[400-devops__05-Cloud-Provider__GCP升级.md]. - Apply these changes iteratively (e.g., Kafka nodes 1, 2, 3) with 3-minute intervals^[400-devops__05-Cloud-Provider__GCP升级.md].
- Disable the temporary Terraform configuration by renaming
11-app.tfback to11-app.tf.back^[400-devops__05-Cloud-Provider__GCP升级.md].
Post-Upgrade Tasks¶
Verify the environment and perform manual adjustments for external services^[400-devops__05-Cloud-Provider__GCP升级.md].
- Update Jenkins VCP network ops IP to ensure connectivity^[400-devops__05-Cloud-Provider__GCP升级.md].
Sources¶
- 400-devops__05-Cloud-Provider__GCP升级.md
Related Concepts¶
- Terraform
- [[GKE]]
- [[Node Pool]]
- Blue-Green Deployment