Peter Schuurman (@pwschuurman) Software Engineer / Google 2022-08-12 K8s Cluster Upgrade Strategies Best Practices for your Stateful Workload
Agenda ● Why Upgrade? ● Stateful Workloads and Upgrades ● Nodepool Upgrade Strategies ● Control Plane Upgrade Strategies ● Upgrade Strategy and Workload Selection
Why Upgrade: Kubernetes Version Lifecycle source
Version Skew Kubernetes Version Skew Policy maintains support for 2 node minor versions New Features New features are introduced in upcoming Kubernetes versions. Eg: StatefulSet MaxUnavailable was introduced in 1.24. Security Compliance Organizations following compliance protocols (PCI, HIPAA, FedRamp) are required to apply security patches within 30 days of availability Patch Support Kubernetes minor versions are maintained for 1 year Why Upgrade: Modern and Protected
MariaDB has modernized their architecture by bringing SkySQL to the cloud on Kubernetes. Built using the Kubernetes operator pattern, MariaDB leverages resiliency and maintains high availability during upgrades. We have been using containers for many years … Our goal was to simplify the implementation and focus less on lower-level infrastructure, dependencies and instance life-cycle. With Kubernetes, our engineers could leverage the strong momentum from the open source community to drive infrastructure logic and security. (Reference) Why Upgrade: Modern Applications
Why Upgrade: Upgrade Dimensions Application Developer Kubernetes Administrator Cloud Platform
Why Upgrade: Upgrade Dimensions Application Developer Kubernetes Administrator Cloud Platform
Why Upgrade: Upgrade Dimensions Application Compatibility Nodes Control Plane Ensuring your application is compatible with an upgraded Kubernetes version Kubernetes (node or control plane) Upgrading the operating system, dependant libraries and kubernetes software of your cluster’s data plane Upgrading the operating system and kubernetes software of your cluster’s orchestration layer
Why Upgrade: Key Concerns Application Availability Cost Speed
Nodes: Surge Upgrades ● Application Availability: Suitable for fault-tolerant workloads. Control availability by specifying node maxUnavailable ● Cost: Cost effective ● Speed: Increase upgrade velocity with parallel node surge
Nodes: Blue/Green Upgrades ● Application Availability: Granular control during migration ● Cost: Increased cost with resource pre-provisioning ● Speed: Slow and controlled
Node Upgrade Takeaways Surge Upgrades Blue/Green Upgrades Application Availability Rollback scenarios make take more time High degree of application availability Cost Lower cost, upgraded node creation occurs just in time Higher cost, upgraded nodes are pre-provisioned Speed Nodes can be upgraded in batches for increased speed Higher control over node migration reduces speed
Control Plane: Upgrades ● Kubernetes maintains API versions with each minor release ● API schema may change with new minor versions
Control Plane: Surge Upgrade ● Application Availability: HA control plane setups limit disruptions. Kubernetes minor rollback is not supported ● Cost: Cost effective ● Speed: Fast
Control Plane: Blue/Green Upgrade ● Application Availability: Granular control over application upgrade. Safe minor version rollback ● Cost: Increased cost over in-place upgrades with cluster pre-provisioning ● Speed: Slow and controlled
Control Plane: Blue/Green Upgrade ● KEP-3335: Introduces building blocks to the StatefulSet API to enable StatefulSet replicas to be moved across clusters. ● With Kubernetes Multi-Cluster Services (KEP-1645), applications can maintain connectivity ● Demo
Control Plane Upgrade Takeaways Surge Upgrades Blue/Green Upgrades Application Availability Rollback is not possible Applications can be rolled back to a cluster with a known compatible Control Plane Cost Lower cost, upgraded control plane creation occurs just in time Higher cost, cluster pre-provisioned Speed Control Plane upgrade is fast and scales sub-linearly as cluster size increases Upgrade speed scales with application migration speed
Takeaways ● Trade-off between business requirements: application availability, speed and cost ● Modern applications update consistently and often ● Kubernetes has the tools to support safe stateful upgrades today, and the community is building new tools to increase this margin of safety

Kubernetes Cluster Upgrade Strategies and Data: Best Practices for your Stateful Workload

  • 1.
    Peter Schuurman (@pwschuurman) SoftwareEngineer / Google 2022-08-12 K8s Cluster Upgrade Strategies Best Practices for your Stateful Workload
  • 2.
    Agenda ● Why Upgrade? ●Stateful Workloads and Upgrades ● Nodepool Upgrade Strategies ● Control Plane Upgrade Strategies ● Upgrade Strategy and Workload Selection
  • 3.
    Why Upgrade: KubernetesVersion Lifecycle source
  • 4.
    Version Skew Kubernetes Version SkewPolicy maintains support for 2 node minor versions New Features New features are introduced in upcoming Kubernetes versions. Eg: StatefulSet MaxUnavailable was introduced in 1.24. Security Compliance Organizations following compliance protocols (PCI, HIPAA, FedRamp) are required to apply security patches within 30 days of availability Patch Support Kubernetes minor versions are maintained for 1 year Why Upgrade: Modern and Protected
  • 5.
    MariaDB has modernizedtheir architecture by bringing SkySQL to the cloud on Kubernetes. Built using the Kubernetes operator pattern, MariaDB leverages resiliency and maintains high availability during upgrades. We have been using containers for many years … Our goal was to simplify the implementation and focus less on lower-level infrastructure, dependencies and instance life-cycle. With Kubernetes, our engineers could leverage the strong momentum from the open source community to drive infrastructure logic and security. (Reference) Why Upgrade: Modern Applications
  • 6.
    Why Upgrade: UpgradeDimensions Application Developer Kubernetes Administrator Cloud Platform
  • 7.
    Why Upgrade: UpgradeDimensions Application Developer Kubernetes Administrator Cloud Platform
  • 8.
    Why Upgrade: UpgradeDimensions Application Compatibility Nodes Control Plane Ensuring your application is compatible with an upgraded Kubernetes version Kubernetes (node or control plane) Upgrading the operating system, dependant libraries and kubernetes software of your cluster’s data plane Upgrading the operating system and kubernetes software of your cluster’s orchestration layer
  • 9.
    Why Upgrade: KeyConcerns Application Availability Cost Speed
  • 10.
    Nodes: Surge Upgrades ●Application Availability: Suitable for fault-tolerant workloads. Control availability by specifying node maxUnavailable ● Cost: Cost effective ● Speed: Increase upgrade velocity with parallel node surge
  • 11.
    Nodes: Blue/Green Upgrades ●Application Availability: Granular control during migration ● Cost: Increased cost with resource pre-provisioning ● Speed: Slow and controlled
  • 12.
    Node Upgrade Takeaways SurgeUpgrades Blue/Green Upgrades Application Availability Rollback scenarios make take more time High degree of application availability Cost Lower cost, upgraded node creation occurs just in time Higher cost, upgraded nodes are pre-provisioned Speed Nodes can be upgraded in batches for increased speed Higher control over node migration reduces speed
  • 13.
    Control Plane: Upgrades ●Kubernetes maintains API versions with each minor release ● API schema may change with new minor versions
  • 14.
    Control Plane: SurgeUpgrade ● Application Availability: HA control plane setups limit disruptions. Kubernetes minor rollback is not supported ● Cost: Cost effective ● Speed: Fast
  • 15.
    Control Plane: Blue/GreenUpgrade ● Application Availability: Granular control over application upgrade. Safe minor version rollback ● Cost: Increased cost over in-place upgrades with cluster pre-provisioning ● Speed: Slow and controlled
  • 16.
    Control Plane: Blue/GreenUpgrade ● KEP-3335: Introduces building blocks to the StatefulSet API to enable StatefulSet replicas to be moved across clusters. ● With Kubernetes Multi-Cluster Services (KEP-1645), applications can maintain connectivity ● Demo
  • 17.
    Control Plane UpgradeTakeaways Surge Upgrades Blue/Green Upgrades Application Availability Rollback is not possible Applications can be rolled back to a cluster with a known compatible Control Plane Cost Lower cost, upgraded control plane creation occurs just in time Higher cost, cluster pre-provisioned Speed Control Plane upgrade is fast and scales sub-linearly as cluster size increases Upgrade speed scales with application migration speed
  • 18.
    Takeaways ● Trade-off betweenbusiness requirements: application availability, speed and cost ● Modern applications update consistently and often ● Kubernetes has the tools to support safe stateful upgrades today, and the community is building new tools to increase this margin of safety