GKE Blue-Green Upgrade Strategy¶
The GKE Blue-Green Upgrade Strategy is a manual procedure designed to upgrade Google Kubernetes Engine (GKE) node pools with zero downtime. Instead of upgrading the existing nodes in-place, which can interrupt running workloads, this method creates a temporary node pool (Green) to handle traffic during the upgrade of the original pool (Blue).^[400-devops__05-Cloud-Provider__GCP升级.md]
Strategy Overview¶
This approach treats the upgrade process as a migration between two distinct environments. Workloads are systematically moved from the stable "Blue" environment (the existing node pool named app) to a temporary "Green" environment (a new pool named temp).^[400-devops__05-Cloud-Provider__GCP升级.md]
Once the migration is complete, the original pool is destroyed and recreated with the new GKE version, and workloads are migrated back.^[400-devops__05-Cloud-Provider__GCP升级.md]
Prerequisites¶
Before starting the upgrade, the environment must be prepared:
- Authentication & Init: Log in to
gcloudand initialize the Terraform working directory^[400-devops__05-Cloud-Provider__GCP升级.md]. - State Verification: Run
terraform refreshto ensure the Terraform state matches the actual GCP resources^[400-devops__05-Cloud-Provider__GCP升级.md]. - Version Selection: Check the official release notes to find the latest valid version under the
No channel(static) channel^[400-devops__05-Cloud-Provider__GCP升级.md]. - Config Activation: Rename
11-app.tf.backto11-app.tfto enable the configuration for the temporary resources^[400-devops__05-Cloud-Provider__GCP升级.md].
Step-by-Step Procedure¶
1. Deploy Temporary Infrastructure¶
Modify the Terraform variables to define the target GKE version (e.g., "1.21.13-gke.900")^[400-devops__05-Cloud-Provider__GCP升级.md]. Then, update the application configuration to create a temporary node pool named temp by renaming the resource in 11-app.tf^[400-devops__05-Cloud-Provider__GCP升级.md].
Run the deployment:
[Terraform](<./terraform.md>) plan
[Terraform](<./terraform.md>) apply
node_pool_1 (temp) which will act as the holding area for pods during the upgrade^[400-devops__05-Cloud-Provider__GCP升级.md].
2. Migrate to Green (Temp)¶
With the temporary pool ready, migrate workloads from the original pool to the new one.^[400-devops__05-Cloud-Provider__GCP升级.md]
- Update Selectors: Modify the Kubernetes manifests (e.g., in
win-env-project/dev/kube) to change thenodeSelectorfrompool: apptopool: temp^[400-devops__05-Cloud-Provider__GCP升级.md]. - Apply Changes: Execute the application script (e.g.,
03-apply-kube.sh) to move the pods^[400-devops__05-Cloud-Provider__GCP升级.md]. - Rolling Update: For sensitive services like Kafka, update the selectors one by one (e.g., kafka1, 2, 3) with a delay of approximately 3 minutes between each to ensure stability^[400-devops__05-Cloud-Provider__GCP升级.md].
3. Upgrade Original Pool (Blue)¶
Once all workloads are running on the temp pool, the original pool is safe to upgrade.
- Upgrade: Modify the
google_container_node_poolresource (specificallynode_pool) inwin-env-project/dev/gcloud/modules/site/03-node-pool.tfto the new GKE version^[400-devops__05-Cloud-Provider__GCP升级.md]. - Terraform Apply: Run
terraform applyagain to update the infrastructure^[400-devops__05-Cloud-Provider__GCP升级.md].
4. Migrate Back to Blue¶
Now that the original app pool is upgraded and running the new GKE version, reverse the migration process.
- Update Selectors: Change the
nodeSelectorin the Kubernetes manifests frompool: tempback topool: app^[400-devops__05-Cloud-Provider__GCP升级.md]. - Apply Changes: Run
03-apply-kube.shagain to move pods back to the upgraded pool^[400-devops__05-Cloud-Provider__GCP升级.md]. - Stagger Updates: Again, update services like Kafka sequentially with 3-minute intervals^[400-devops__05-Cloud-Provider__GCP升级.md].
5. Cleanup and Post-Upgrade Tasks¶
Finalize the environment by removing temporary configurations and performing manual checks.
- Cleanup Terraform: Rename
11-app.tfback to11-app.tf.backto remove the temporary pool configuration from the next apply^[400-devops__05-Cloud-Provider__GCP升级.md]. - Network Adjustment: Manually verify and adjust the VCP network ops IP for Jenkins in the GCP console if necessary^[400-devops__05-Cloud-Provider__GCP升级.md].
Related Concepts¶
- Blue-Green Deployment
- Zero Downtime Deployment
- [[Terraform State Management]]
Sources¶
^[400-devops__05-Cloud-Provider__GCP升级.md]