Skip to content

Terraform GKE version management

This document outlines the standard operating procedure for upgrading the Google Kubernetes Engine (GKE) version using Terraform. The process involves infrastructure provisioning via Terraform and workload migration strategies to ensure zero downtime.

Prerequisites

Before initiating the upgrade, verify the authentication status and synchronize the Terraform state with the actual cloud infrastructure^[400-devops__05-Cloud-Provider__GCP升级.md].

  • Run gcloud auth login^[400-devops__05-Cloud-Provider__GCP升级.md].
  • Initialize and refresh the Terraform workspace^[400-devops__05-Cloud-Provider__GCP升级.md]:
    [Terraform](<./terraform.md>) init
    [Terraform](<./terraform.md>) refresh
    

Version Selection

Identify the target GKE version by consulting the official release notes^[400-devops__05-Cloud-Provider__GCP升级.md]. You must select a version listed under the No channel (Static) channel^[400-devops__05-Cloud-Provider__GCP升级.md].

Terraform Configuration

  1. Variable Definition: Update the gke_version variable in your configuration files (e.g., 01-variables.tf)^[400-devops__05-Cloud-Provider__GCP升级.md].
    variable "gke_version" { default = "1.21.13-gke.900" }
    
  2. Enable Resource File: Restore the active application configuration by renaming 11-app.tf.back to 11-app.tf^[400-devops__05-Cloud-Provider__GCP升级.md].
  3. Create Temporary Node Pool: Modify the resource name in 11-app.tf (e.g., app -> temp) and apply to create a temporary node pool^[400-devops__05-Cloud-Provider__GCP升级.md]. This new pool (e.g., node_pool_1) will hold workloads during the upgrade process^[400-devops__05-Cloud-Provider__GCP升级.md].
    [Terraform](<./terraform.md>) plan
    [Terraform](<./terraform.md>) apply
    

Workload Migration

Migrate your workloads from the original node pool (app) to the temporary pool (temp) to prepare for the infrastructure upgrade^[400-devops__05-Cloud-Provider__GCP升级.md].

  • Update Kubernetes manifests in your configuration directory (e.g., win-env-project\dev\kube)^[400-devops__05-Cloud-Provider__GCP升级.md].
  • Change the nodeSelector from pool: app to pool: temp^[400-devops__05-Cloud-Provider__GCP升级.md].
  • Apply changes iteratively (e.g., for Kafka nodes 1, 2, and 3) with approximately 3-minute intervals to maintain stability^[400-devops__05-Cloud-Provider__GCP升级.md].

Infrastructure Upgrade

With workloads safely moved to the temporary pool, upgrade the main infrastructure^[400-devops__05-Cloud-Provider__GCP升级.md].

  • Modify the google_container_node_pool resource definition in modules/site/03-node-pool.tf to target the new GKE version^[400-devops__05-Cloud-Provider__GCP升级.md].
  • Run terraform apply to upgrade the app node pool^[400-devops__05-Cloud-Provider__GCP升级.md].

Restoration and Cleanup

After the infrastructure upgrade is complete, migrate workloads back to the upgraded node pool^[400-devops__05-Cloud-Provider__GCP升级.md].

  • Revert the nodeSelector in your Kubernetes manifests from pool: temp back to pool: app^[400-devops__05-Cloud-Provider__GCP升级.md].
  • Apply these changes iteratively (e.g., Kafka nodes 1, 2, 3) with 3-minute intervals^[400-devops__05-Cloud-Provider__GCP升级.md].
  • Disable the temporary Terraform configuration by renaming 11-app.tf back to 11-app.tf.back^[400-devops__05-Cloud-Provider__GCP升级.md].

Post-Upgrade Tasks

Verify the environment and perform manual adjustments for external services^[400-devops__05-Cloud-Provider__GCP升级.md].

  • Update Jenkins VCP network ops IP to ensure connectivity^[400-devops__05-Cloud-Provider__GCP升级.md].

Sources

  • 400-devops__05-Cloud-Provider__GCP升级.md