1 Confidential do not distribute 1 Chris Lavery - Senior Site Reliability Engineer SRE and GitOps for Building Robust Kubernetes Platforms
2 2 Webinar Platform - FAQs Using Zoom • You are in listen only mode • This webinar is being recorded • Q&A session will follow the presentation, please use the Q&A panel to submit questions • Hit escape to exit full screen • Please introduce yourself in the chat. Technical Issues - please visit Zoom Help https://support.zoom.us/hc/en-us/articles/206175806-Top-Questions
3 Weaveworks’ is backed by solid investors Weaveworks created the GitOps methodology and tooling to solve our own Kubernetes management, scalability, and reliability requirements Weaveworks is a key partner with all the major infrastructure and Kubernetes vendors Weaveworks: the GitOps company Weaveworks is deeply committed to the Open Source Community
4 Confidential do not distribute 4 Weaveworks Site Reliability Engineer Chris Lavery Belfast, Northern Ireland Chris Lavery is a Senior Site Reliability Engineer at Weaveworks (currently on secondment to Deutsche Telekom) where he champions continuous improvement through DevOps/GitOps practices, collaborating with multiple technical and non-technical stakeholders to achieve organisational goals effectively. Chris has experience around high performance computing and modern data center architectures, familiarity with different use cases and verticals (Telecoms, Fintech, Gaming). Outside of work Chris enjoys cycling, music and a neverending list of DIY tasks. https://www.linkedin.com/in/christopherlavery https://twitter.com/mrchrislavery github.com/fire-ant
5 Introducing Site Reliability Engineering (SRE) A (basic) definition: ▪ Operations focused (Production Infrastructure, On-call work) ▪ Often Embedded within Development teams to facilitate better operational outcomes ▪ Applying Software Engineering Principles to Operational Problems.
6 Introducing Site Reliability Engineering (SRE) Why SRE ? What has changed? ▪ DevOps has progressed the culture of collaboration between development and operations ▪ Cloud computing has standardized/commodified aspects of software businesses ie. infrastructure as a service ▪ Consumers/End Users have transitioned to digital services
7 Data Driven Decisions Data Driven Decisions ▪ Modern Distributed Systems have more moving parts ▪ These systems may be a hybrid of managed cloud resources/services and bespoke custom logic ▪ Modern Application Architectures can also be distributed in nature ▪ Increasing Cardinality (more things per thing) can provide better granularity/visibility but at the cost of complexity
8 Data Driven Decisions Observability: ▪ System/Service Metrics (Infrastructure) ▪ Logging (and Events) ▪ Tracing (and Spans) APM (Application Performance Monitoring)
9 SLIs and SLOs: Dashboard Gauges, Minimum Indicators and Penalty points ▪ SLIs - Service Level Indicator ▪ A measurement derived from one or more underlying metrics to support an SLO ▪ For example: A Yield SLI (fraction of successful requests) can inform an availability SLA based on an SLO of 97.5%
10 SLIs and SLOs: Dashboard Gauges, Minimum Indicators and Penalty points ▪ SLOs - Service Level Objective ▪ The defined threshold that a service should ideally operate above or within as per the terms of an SLA
11 SLIs and SLOs: Dashboard Gauges, Minimum Indicators and Penalty points ▪ SLAs - Service Level Agreements ▪ An agreement contracted with a customer to set expectations and convey minimum levels of service
12 SLIs and SLOs: Dashboard Gauges, Minimum Indicators and Penalty points ▪ Reliable enough, but no more reliable than it needs to be. ▪ Use your error budget wisely. ▪ Use it to take risks. ▪ An error budget is 1 minus the SLO of the service
13 Uptime Traditional Focus is on Uptime ▪ If you were down for 1 second per day, you would exceed the 5 9’s SLA. ▪ This level of uptime is expensive and not often needed. ▪ Set your uptime to a reasonable level and no better Uptime Target Yearly allowed downtime 99% 3d 15h 39m 29s 99.9% 8h 45m 56s 99.99% 52m 35s 99.999% 5m 15s
14 Service Levels Summary Service Level Agreements = Agreement with a Customer to provide a service or penalties are applied. Service Level Objectives = A promise that must be achieved, e.g. uptime, response time Service Level Indicators = The metric that corresponds to meeting an objective, e.g. measured uptime, latency, error rates etc.
15 DORA: DevOps Research and Assessment
16 DORA: DevOps Research and Assessment ▪ Low Lead Time (minutes > hours > days) ▪ High Deployment Frequency (minutes < hours < days)
17 DORA: DevOps Research and Assessment ▪ Low Lead Time (minutes > hours > days) ▪ High Deployment Frequency (minutes < hours < days) ▪ Change failure rate ▪ Time to restore
18 DORA: DevOps Research and Assessment ▪ Lead Time + Deployment Frequency = Throughput ▪ Change failure rate + Time to restore = Stability
19 DORA: DevOps Research and Assessment
20 Confidential do not distribute 20 GitOps SRE
21 Confidential do not distribute 21 The entire system is described declaratively The canonical desired system state is versioned in git Software agents ensure correctness and perform actions on divergence in a closed loop The Principles of GitOps Approved changes can be automatically applied to the system
Weave GitOps Continuous delivery and operations for Kubernetes 22
23 Embrace Risk Traditional systems focus on Availability such as Uptime, but ignores velocity of deploying new features. An SRE balances both availability and velocity of features. So risk is an acceptable part of the system and should not be avoided, but it should be managed.
24 Progressive Delivery: Overview ● Progressive delivery is the practice of limiting the audience for your code changes or new feature releases ● It is done to restrict the exposure area to a minimum in case of any risk instances ● Progressive delivery can be implemented through a number of strategies from A/B testing to canary, Blue/Green, Rolling/Immutable upgrades or Feature flag management ● Make Deployments boring!
25 Progressive Delivery: SRE orientation ● Leverage the data emitted from an instrumented and observable system to reason about whether to proceed or rollback ● Configure the application infrastructure (service meshes, key metrics), SLA (overall downtime or service degradation) and automated operations (rollback/revert strategy) ● Foster higher collaboration and enables higher velocity through early feedback and deeper insight. Transferable approach which can be standardised with components and configurations within an organisation
26 Progressive Delivery: Flagger ● Cloud Native Open source Progressive delivery Operator ● Broad support for all popular Service meshes, Ingresses and K8s Gateway API ● Designed with GitOps Methodologies in mind and a complementary component to FluxCD ● Commercial UI Integration in Weave GitOps ● Adopters include CNCF projects right through to Enterprise users
27 Progressive Delivery: How it works
28 Progressive Delivery: Monitoring
29 Progressive Delivery: Visualisation
30 Progressive Delivery: Visualisation
31 Progressive Delivery: Visualisation
32 Progressive Delivery: Takeaways ● Progressive delivery is the practice of limiting the audience for your code changes or new feature releases ● It is done to restrict the exposure area to a minimum in case of any risk instances ● Progressive delivery can be implemented through a number of strategies from A/B testing to canary, Blue/Green or Feature flag management
33 Progressive Delivery: Business Value ● Deployment Frequency is higher with GitOps & Progressive Delivery ● Lead times are shorter with Progressive Delivery ● Change Failure Rate is lower with Progressive Delivery ● Mean Time to Recovery is shorter with GitOps
34 34 Whitepaper: Progressive Delivery with GitOps https://bit.ly/3K8oZwU Learn about Weave GitOps Assured www.weave.works/product/gitops/ Learn more about Weave GitOps Enterprise www.weave.works/enterprise and a 5 min demo https://youtu.be/aqJaHNCz2lM Request a personal demo www.weave.works/contact More information
35 Confidential do not distribute 3 5 You Thank Join our Community https://slack.weave.works Contact Us sales@weave.works Our products & services www.weave.works
36 Progressive Delivery: Deployment Strategies Strategy Positive Negative Rolling Zero Downtime Longer Deployments Canary Low Risk/Early Feedback Complex Monitoring/Testing Blue/Green Zero Downtime + Fast Rollback Cost (2 active instances) Feature Flags Controlled Release + test in Production Increased complexity in app management Immutable Predictable and Simple process Increase cost, Data persistence

SRE and GitOps for Building Robust Kubernetes Platforms.pdf

  • 1.
    1 Confidential do notdistribute 1 Chris Lavery - Senior Site Reliability Engineer SRE and GitOps for Building Robust Kubernetes Platforms
  • 2.
    2 2 Webinar Platform -FAQs Using Zoom • You are in listen only mode • This webinar is being recorded • Q&A session will follow the presentation, please use the Q&A panel to submit questions • Hit escape to exit full screen • Please introduce yourself in the chat. Technical Issues - please visit Zoom Help https://support.zoom.us/hc/en-us/articles/206175806-Top-Questions
  • 3.
    3 Weaveworks’ is backedby solid investors Weaveworks created the GitOps methodology and tooling to solve our own Kubernetes management, scalability, and reliability requirements Weaveworks is a key partner with all the major infrastructure and Kubernetes vendors Weaveworks: the GitOps company Weaveworks is deeply committed to the Open Source Community
  • 4.
    4 Confidential do notdistribute 4 Weaveworks Site Reliability Engineer Chris Lavery Belfast, Northern Ireland Chris Lavery is a Senior Site Reliability Engineer at Weaveworks (currently on secondment to Deutsche Telekom) where he champions continuous improvement through DevOps/GitOps practices, collaborating with multiple technical and non-technical stakeholders to achieve organisational goals effectively. Chris has experience around high performance computing and modern data center architectures, familiarity with different use cases and verticals (Telecoms, Fintech, Gaming). Outside of work Chris enjoys cycling, music and a neverending list of DIY tasks. https://www.linkedin.com/in/christopherlavery https://twitter.com/mrchrislavery github.com/fire-ant
  • 5.
    5 Introducing Site ReliabilityEngineering (SRE) A (basic) definition: ▪ Operations focused (Production Infrastructure, On-call work) ▪ Often Embedded within Development teams to facilitate better operational outcomes ▪ Applying Software Engineering Principles to Operational Problems.
  • 6.
    6 Introducing Site ReliabilityEngineering (SRE) Why SRE ? What has changed? ▪ DevOps has progressed the culture of collaboration between development and operations ▪ Cloud computing has standardized/commodified aspects of software businesses ie. infrastructure as a service ▪ Consumers/End Users have transitioned to digital services
  • 7.
    7 Data Driven Decisions DataDriven Decisions ▪ Modern Distributed Systems have more moving parts ▪ These systems may be a hybrid of managed cloud resources/services and bespoke custom logic ▪ Modern Application Architectures can also be distributed in nature ▪ Increasing Cardinality (more things per thing) can provide better granularity/visibility but at the cost of complexity
  • 8.
    8 Data Driven Decisions Observability: ▪System/Service Metrics (Infrastructure) ▪ Logging (and Events) ▪ Tracing (and Spans) APM (Application Performance Monitoring)
  • 9.
    9 SLIs and SLOs:Dashboard Gauges, Minimum Indicators and Penalty points ▪ SLIs - Service Level Indicator ▪ A measurement derived from one or more underlying metrics to support an SLO ▪ For example: A Yield SLI (fraction of successful requests) can inform an availability SLA based on an SLO of 97.5%
  • 10.
    10 SLIs and SLOs:Dashboard Gauges, Minimum Indicators and Penalty points ▪ SLOs - Service Level Objective ▪ The defined threshold that a service should ideally operate above or within as per the terms of an SLA
  • 11.
    11 SLIs and SLOs:Dashboard Gauges, Minimum Indicators and Penalty points ▪ SLAs - Service Level Agreements ▪ An agreement contracted with a customer to set expectations and convey minimum levels of service
  • 12.
    12 SLIs and SLOs:Dashboard Gauges, Minimum Indicators and Penalty points ▪ Reliable enough, but no more reliable than it needs to be. ▪ Use your error budget wisely. ▪ Use it to take risks. ▪ An error budget is 1 minus the SLO of the service
  • 13.
    13 Uptime Traditional Focus ison Uptime ▪ If you were down for 1 second per day, you would exceed the 5 9’s SLA. ▪ This level of uptime is expensive and not often needed. ▪ Set your uptime to a reasonable level and no better Uptime Target Yearly allowed downtime 99% 3d 15h 39m 29s 99.9% 8h 45m 56s 99.99% 52m 35s 99.999% 5m 15s
  • 14.
    14 Service Levels Summary ServiceLevel Agreements = Agreement with a Customer to provide a service or penalties are applied. Service Level Objectives = A promise that must be achieved, e.g. uptime, response time Service Level Indicators = The metric that corresponds to meeting an objective, e.g. measured uptime, latency, error rates etc.
  • 15.
  • 16.
    16 DORA: DevOps Researchand Assessment ▪ Low Lead Time (minutes > hours > days) ▪ High Deployment Frequency (minutes < hours < days)
  • 17.
    17 DORA: DevOps Researchand Assessment ▪ Low Lead Time (minutes > hours > days) ▪ High Deployment Frequency (minutes < hours < days) ▪ Change failure rate ▪ Time to restore
  • 18.
    18 DORA: DevOps Researchand Assessment ▪ Lead Time + Deployment Frequency = Throughput ▪ Change failure rate + Time to restore = Stability
  • 19.
  • 20.
    20 Confidential do notdistribute 20 GitOps SRE
  • 21.
    21 Confidential do notdistribute 21 The entire system is described declaratively The canonical desired system state is versioned in git Software agents ensure correctness and perform actions on divergence in a closed loop The Principles of GitOps Approved changes can be automatically applied to the system
  • 22.
    Weave GitOps Continuous deliveryand operations for Kubernetes 22
  • 23.
    23 Embrace Risk Traditional systemsfocus on Availability such as Uptime, but ignores velocity of deploying new features. An SRE balances both availability and velocity of features. So risk is an acceptable part of the system and should not be avoided, but it should be managed.
  • 24.
    24 Progressive Delivery: Overview ●Progressive delivery is the practice of limiting the audience for your code changes or new feature releases ● It is done to restrict the exposure area to a minimum in case of any risk instances ● Progressive delivery can be implemented through a number of strategies from A/B testing to canary, Blue/Green, Rolling/Immutable upgrades or Feature flag management ● Make Deployments boring!
  • 25.
    25 Progressive Delivery: SREorientation ● Leverage the data emitted from an instrumented and observable system to reason about whether to proceed or rollback ● Configure the application infrastructure (service meshes, key metrics), SLA (overall downtime or service degradation) and automated operations (rollback/revert strategy) ● Foster higher collaboration and enables higher velocity through early feedback and deeper insight. Transferable approach which can be standardised with components and configurations within an organisation
  • 26.
    26 Progressive Delivery: Flagger ●Cloud Native Open source Progressive delivery Operator ● Broad support for all popular Service meshes, Ingresses and K8s Gateway API ● Designed with GitOps Methodologies in mind and a complementary component to FluxCD ● Commercial UI Integration in Weave GitOps ● Adopters include CNCF projects right through to Enterprise users
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
    32 Progressive Delivery: Takeaways ●Progressive delivery is the practice of limiting the audience for your code changes or new feature releases ● It is done to restrict the exposure area to a minimum in case of any risk instances ● Progressive delivery can be implemented through a number of strategies from A/B testing to canary, Blue/Green or Feature flag management
  • 33.
    33 Progressive Delivery: BusinessValue ● Deployment Frequency is higher with GitOps & Progressive Delivery ● Lead times are shorter with Progressive Delivery ● Change Failure Rate is lower with Progressive Delivery ● Mean Time to Recovery is shorter with GitOps
  • 34.
    34 34 Whitepaper: Progressive Deliverywith GitOps https://bit.ly/3K8oZwU Learn about Weave GitOps Assured www.weave.works/product/gitops/ Learn more about Weave GitOps Enterprise www.weave.works/enterprise and a 5 min demo https://youtu.be/aqJaHNCz2lM Request a personal demo www.weave.works/contact More information
  • 35.
    35 Confidential do notdistribute 3 5 You Thank Join our Community https://slack.weave.works Contact Us sales@weave.works Our products & services www.weave.works
  • 36.
    36 Progressive Delivery: DeploymentStrategies Strategy Positive Negative Rolling Zero Downtime Longer Deployments Canary Low Risk/Early Feedback Complex Monitoring/Testing Blue/Green Zero Downtime + Fast Rollback Cost (2 active instances) Feature Flags Controlled Release + test in Production Increased complexity in app management Immutable Predictable and Simple process Increase cost, Data persistence