Skip to content

GKE upgrade workflow

The GKE upgrade workflow is a manual process designed to upgrade the Google Kubernetes Engine (GKE) version with minimal downtime. The strategy involves creating a temporary node pool to handle traffic during the upgrade, ensuring service continuity while the primary node pool is updated^[400-devops__05-Cloud-Provider__GCP升级.md].

Prerequisites and Version Selection

Before starting the upgrade, authentication and initialization are required.^[400-devops__05-Cloud-Provider__GCP升级.md]

  • Authentication: Run gcloud auth login to access the GCP project.^[400-devops__05-Cloud-Provider__GCP升级.md]
  • State Refresh: Execute terraform init followed by terraform refresh to confirm that the Terraform state matches the current GCP infrastructure.^[400-devops__05-Cloud-Provider__GCP升级.md]
  • Target Version: Identify the specific GKE version to upgrade to from the release notes (specifically looking for the No channel version).^[400-devops__05-Cloud-Provider__GCP升级.md]

Workflow Steps

The upgrade process follows a specific sequence to prepare the environment, shift workloads, upgrade the infrastructure, and then shift workloads back.

1. Environment Preparation

First, the Terraform configuration for the new node pool must be enabled and the target version defined.

  1. Restore Configuration: Rename the backup file 11-app.tf.back to 11-app.tf.^[400-devops__05-Cloud-Provider__GCP升级.md]
  2. Set Version: Update the variable gke_version in win-env-project\dev\gcloud\01-variables.tf and win-env-project\dev\gcloud\11-app.tf to the desired version (e.g., "1.21.13-gke.900").^[400-devops__05-Cloud-Provider__GCP升级.md]

2. Temporary Node Pool Creation

A temporary node pool is provisioned to serve as a staging area for pods during the main upgrade.

  1. Modify App Config: In win-env-project\dev\11-app.tf, change the resource name from app to temp. This configuration defines a google_container_node_pool for the temporary environment.^[400-devops__05-Cloud-Provider__GCP升级.md]
  2. Provision: Run terraform plan and terraform apply to create the temporary node pool named node_pool_1 (temp).^[400-devops__05-Cloud-Provider__GCP升级.md]

3. Migrate Workloads to Temporary Pool

With the temporary pool active, Kubernetes workloads are shifted away from the node pool that will be upgraded.

  1. Update Node Selectors: In win-env-project\dev\kube, modify the nodeSelector from pool: app to pool: temp.^[400-devops__05-Cloud-Provider__GCP升级.md]
  2. Apply Configuration: Use 03-apply-kube.sh to apply these changes.^[400-devops__05-Cloud-Provider__GCP升级.md]
  3. Rolling Update: For specific components like Kafka, switch them one by one (e.g., kafka1, kafka2, kafka3) to the temp pool with approximately 3-minute intervals to manage load.^[400-devops__05-Cloud-Provider__GCP升级.md]

4. Perform GKE Upgrade

Once the primary node pool (app) is drained of active workloads, the upgrade can be performed on the infrastructure level.

  1. Upgrade Module: Navigate to win-env-project\dev\gcloud\modules\site\03-node-pool.tf.^[400-devops__05-Cloud-Provider__GCP升级.md]
  2. Update Version: Modify the google_container_node_pool "node_pool" resource to upgrade the GKE version.^[400-devops__05-Cloud-Provider__GCP升级.md]

5. Migrate Workloads Back to App Pool

After the upgrade is complete, workloads must be shifted back to the upgraded primary pool.

  1. Restore Node Selectors: Change the nodeSelector in win-env-project\dev\kube back from pool: temp to pool: app.^[400-devops__05-Cloud-Provider__GCP升级.md]
  2. Apply Configuration: Run 03-apply-kube.sh again.^[400-devops__05-Cloud-Provider__GCP升级.md]
  3. Rolling Update: Switch components (e.g., kafka1, kafka2, kafka3) back to the app pool one by one with ~3-minute intervals.^[400-devops__05-Cloud-Provider__GCP升级.md]

6. Cleanup and Post-Upgrade

Finalize the process by removing the temporary configuration and verifying network settings.

  1. Backup Config: Rename 11-app.tf back to 11-app.tf.back to disable the temporary pool resource.^[400-devops__05-Cloud-Provider__GCP升级.md]
  2. Verify Network: Manually check and adjust VCP network settings or Ops IPs if necessary, for instance, ensuring Jenkins connectivity (referenced via kube.16888dev.com:30100).^[400-devops__05-Cloud-Provider__GCP升级.md]

Sources

^[400-devops__05-Cloud-Provider__GCP升级.md]