Skip to content

GKE Blue-Green Upgrade Strategy

The GKE Blue-Green Upgrade Strategy is a manual procedure designed to upgrade Google Kubernetes Engine (GKE) node pools with zero downtime. Instead of upgrading the existing nodes in-place, which can interrupt running workloads, this method creates a temporary node pool (Green) to handle traffic during the upgrade of the original pool (Blue).^[400-devops__05-Cloud-Provider__GCP升级.md]

Strategy Overview

This approach treats the upgrade process as a migration between two distinct environments. Workloads are systematically moved from the stable "Blue" environment (the existing node pool named app) to a temporary "Green" environment (a new pool named temp).^[400-devops__05-Cloud-Provider__GCP升级.md]

Once the migration is complete, the original pool is destroyed and recreated with the new GKE version, and workloads are migrated back.^[400-devops__05-Cloud-Provider__GCP升级.md]

Prerequisites

Before starting the upgrade, the environment must be prepared:

  • Authentication & Init: Log in to gcloud and initialize the Terraform working directory^[400-devops__05-Cloud-Provider__GCP升级.md].
  • State Verification: Run terraform refresh to ensure the Terraform state matches the actual GCP resources^[400-devops__05-Cloud-Provider__GCP升级.md].
  • Version Selection: Check the official release notes to find the latest valid version under the No channel (static) channel^[400-devops__05-Cloud-Provider__GCP升级.md].
  • Config Activation: Rename 11-app.tf.back to 11-app.tf to enable the configuration for the temporary resources^[400-devops__05-Cloud-Provider__GCP升级.md].

Step-by-Step Procedure

1. Deploy Temporary Infrastructure

Modify the Terraform variables to define the target GKE version (e.g., "1.21.13-gke.900")^[400-devops__05-Cloud-Provider__GCP升级.md]. Then, update the application configuration to create a temporary node pool named temp by renaming the resource in 11-app.tf^[400-devops__05-Cloud-Provider__GCP升级.md].

Run the deployment:

[Terraform](<./terraform.md>) plan
[Terraform](<./terraform.md>) apply
This creates the new node pool node_pool_1 (temp) which will act as the holding area for pods during the upgrade^[400-devops__05-Cloud-Provider__GCP升级.md].

2. Migrate to Green (Temp)

With the temporary pool ready, migrate workloads from the original pool to the new one.^[400-devops__05-Cloud-Provider__GCP升级.md]

  • Update Selectors: Modify the Kubernetes manifests (e.g., in win-env-project/dev/kube) to change the nodeSelector from pool: app to pool: temp^[400-devops__05-Cloud-Provider__GCP升级.md].
  • Apply Changes: Execute the application script (e.g., 03-apply-kube.sh) to move the pods^[400-devops__05-Cloud-Provider__GCP升级.md].
  • Rolling Update: For sensitive services like Kafka, update the selectors one by one (e.g., kafka1, 2, 3) with a delay of approximately 3 minutes between each to ensure stability^[400-devops__05-Cloud-Provider__GCP升级.md].

3. Upgrade Original Pool (Blue)

Once all workloads are running on the temp pool, the original pool is safe to upgrade.

  • Upgrade: Modify the google_container_node_pool resource (specifically node_pool) in win-env-project/dev/gcloud/modules/site/03-node-pool.tf to the new GKE version^[400-devops__05-Cloud-Provider__GCP升级.md].
  • Terraform Apply: Run terraform apply again to update the infrastructure^[400-devops__05-Cloud-Provider__GCP升级.md].

4. Migrate Back to Blue

Now that the original app pool is upgraded and running the new GKE version, reverse the migration process.

  • Update Selectors: Change the nodeSelector in the Kubernetes manifests from pool: temp back to pool: app^[400-devops__05-Cloud-Provider__GCP升级.md].
  • Apply Changes: Run 03-apply-kube.sh again to move pods back to the upgraded pool^[400-devops__05-Cloud-Provider__GCP升级.md].
  • Stagger Updates: Again, update services like Kafka sequentially with 3-minute intervals^[400-devops__05-Cloud-Provider__GCP升级.md].

5. Cleanup and Post-Upgrade Tasks

Finalize the environment by removing temporary configurations and performing manual checks.

  • Cleanup Terraform: Rename 11-app.tf back to 11-app.tf.back to remove the temporary pool configuration from the next apply^[400-devops__05-Cloud-Provider__GCP升级.md].
  • Network Adjustment: Manually verify and adjust the VCP network ops IP for Jenkins in the GCP console if necessary^[400-devops__05-Cloud-Provider__GCP升级.md].

Sources

^[400-devops__05-Cloud-Provider__GCP升级.md]