In 2010, Netflix officially moved to cloud services by using AWS, and one of the biggest lessons they learned was that: “The best way to avoid failure is to fail consistently.” So, they built a tool that would randomly kill processes, instances and induce failure called Chaos Monkey. Here’s how it helps the Netflix team and everything you need to know about it: ►What is Chaos Monkey? - Purpose: A tool designed to randomly shut down instances or services within Netflix’s architecture. - Objective: Induce controlled failures to test system resilience. - Evolution: Chaos Monkey became part of a broader suite called the Simian Army, which included other tools like Chaos Kong and Janitor Monkey for various disaster recovery tasks. ► How Does Chaos Monkey Work? 1. Dependencies: - MySQL: Tracks terminations and schedules to ensure control over random failures. - Spinnaker: An open-source delivery platform used to deploy and terminate services across multiple cloud platforms like AWS, Azure, and GCP. 2. Daily Cron Task: - Every day, Chaos Monkey looks for services marked as "Chaos Monkey enabled." - It randomly selects targets and schedules tasks to kill services, which are then executed via Spinnaker. 3. Tracking Failures: - MySQL tracks terminations to ensure that randomness doesn’t cause too frequent or too rare attacks on the same services. ► Benefits of Chaos Monkey - Resilience Testing: By randomly killing processes, Netflix ensures their services can recover from unexpected outages. - Learning Opportunities: Engineers gain deep insights into how their systems behave under stress, learning more from random destruction than they would from planned testing. - Open Source Tool: Netflix released Chaos Monkey as an open-source tool, allowing companies worldwide to test their systems similarly. ► Who Should Use Chaos Monkey? - Spinnaker Users: Chaos Monkey only works if you're using Spinnaker, so this is a key requirement. - Organizations Needing High Resilience: If your infrastructure requires robust testing for resilience and redundancy, Chaos Monkey may be a good fit. However, consider the effort to set up and maintain the tool. - Teams Ready for Controlled Chaos: Chaos Monkey helps you prepare for unexpected system crashes by simulating failures, but teams need to be prepared for the operational stress that comes with it. Netflix’s Chaos Monkey is just one example of how chaos engineering helps organizations build more resilient, failure-proof systems. References: https://lnkd.in/gUr5y3qS https://spinnaker.io/ https://lnkd.in/gxcDPDSx
Software Engineering Cloud Computing
Explore top LinkedIn content from expert professionals.
-
-
After 10 years in Cloud Engineering, I wish someone had told me these truths from day one: "Embrace boring technology." That shiny new AWS service isn't worth the operational overhead. Master the fundamentals first: EC2, RDS, S3, and IAM. "Infrastructure as Code isn't optional." Every manual click in the AWS console is technical debt. If you can't recreate your environment from code, you don't own it. "Security by design, not by accident." Adding security after the fact is 10x harder than building it in. Start with least privilege IAM from day one. "Automation saves your sanity, not just time." The goal isn't speed, it's consistency. Manual processes create knowledge silos and single points of failure. "Document your decisions, not just your code." Write down WHY you chose this architecture. Future you (and your team) will thank you during the inevitable 3 AM incident. "Plan for failure from the beginning." Every service will fail. Every network will have issues. Design for it, test for it, expect it. What's the best cloud advice you wish you'd received earlier?
-
The AWS downtime this week shook more systems than expected - here’s what you can learn from this real-world case study. 1. Redundancy isn’t optional Even the most reliable platforms can face downtime. Distributing workloads across multiple AZs isn’t enough.. design for multi-region failover. 2. Visibility can’t be one-sided When any cloud provider goes dark, so do its dashboards. Use independent monitoring and alerting to stay informed when your provider can’t. 3. Recovery plans must be tested A document isn’t a disaster recovery strategy. Inject a little chaos ~ run failover drills and chaos tests before the real outage does it for you. 4. Dependencies amplify impact One failing service can ripple across everything. You must map critical dependencies and eliminate single points of failure early. These moments are a powerful reminder that reliability and disaster recovery aren’t checkboxes .. They’re habits built into every design decision.
-
A Visual Overview of Kubernetes Containers revolutionized modern application development and deployment. Unlike bulky virtual machines, containers package up just the application code and dependencies, making them lightweight and portable. However, running containers at scale brings challenges. Enter Kubernetes! Kubernetes helps deploy, scale, and manage containerized applications across clusters of machines. 𝗖𝗼𝗿𝗲 𝗞𝘂𝗯𝗲𝗿𝗻𝗲𝘁𝗲𝘀 𝗖𝗼𝗺𝗽𝗼𝗻𝗲𝗻𝘁𝘀 Control Plane: The brains behind cluster management, handling scheduling, maintaining desired state, rolling updates etc. Runs on multiple machines for high availability. Worker Nodes: The machines that run the containerized applications. Each node has components like kubelet and kube-proxy alongside the application containers. The smallest deployable units in Kubernetes are Pods. A Pod encapsulates one or more tightly coupled containers that comprise an application. Kubernetes assigns Pods to worker nodes through its API server. 𝗞𝗲𝘆 𝗞𝘂𝗯𝗲𝗿𝗻𝗲𝘁𝗲𝘀 𝗖𝗮𝗽𝗮𝗯𝗶𝗹𝗶𝘁𝗶𝗲𝘀 - Scalability: It's easy to scale applications up and down on demand. Just specify the desired instance count, Kubernetes handles the rest! - Portability: Applications can run anywhere - on premise, cloud, hybrid environments etc. No vendor lock-in! - Resiliency: Kubernetes restarts failed containers, replaces unhealthy nodes, and maintains desired state, reducing downtime. - Automation: Manual tasks like rolling updates, rollbacks are automated, freeing teams to focus on development. 𝗧𝗿𝗮𝗱𝗲𝗼𝗳𝗳𝘀 The power of Kubernetes comes with complexity. Installing, configuring, and operating Kubernetes has a steep learning curve. For many teams, it's overkill. Managed Kubernetes services help by handling control plane management, letting teams focus only on applications and pay for just the worker resources used. 𝗜𝘀 𝗞𝘂𝗯𝗲𝗿𝗻𝗲𝘁𝗲𝘀 𝗮 𝗚𝗼𝗼𝗱 𝗙𝗶𝘁? Consider: - Are you running containers already at meaningful scale? - Will portability or resiliency resolve production issues? - Is your team willing to invest in learning and operating Kubernetes? If you answered yes, Kubernetes may suit your needs. Otherwise, containers without orchestration may still get the job done. – Subscribe to our weekly newsletter to get a Free System Design PDF (158 pages): https://bit.ly/496keA7
-
You dockerized your .NET Web apps. Great, but next you'll face these: - How to manage the lifecycle of your containers? - How to scale them? - How to make sure they are always available? - How to manage the networking between them? - How to make them available to the outside world? To deal with those, you need 𝗞𝘂𝗯𝗲𝗿𝗻𝗲𝘁𝗲𝘀, the container orchestration platform designed to manage your containers in the cloud. I started using Kubernetes about 6 years ago when I joined the ACR team at Microsoft, and never looked back. It's the one thing that put me ahead of my peers given the increasing move to Docker containers and cloud native development. Every single team I joined since then used Azure Kubernetes Service (AKS) because of the impressive things you can do with it like: - Quickly scale your app up and down as needed - Ensure your app is always available - Automatically distribute traffic between containers - Roll out updates and changes fast and with zero downtime - Ensure the resources on all boxes are used efficiently How to get started? Check out my step-by-step AKS guide for .NET developers here 👇 https://lnkd.in/gBPJT6wv Keep learning!
-
Friends please stop sending me AWS outage memes 🥲 If this one hit you/your team, I feel sorry for the oncall engineer standing in front of the screens. Infrastructure failures & the uncertainty. Not fun. Yes, I’m at Amazon now (not on the AWS infra side) but I’ve worked deeply on AWS across multiple firms. So when the Amazon Web Services (AWS) US-EAST-1 region went down two days ago, it struck me even though I wasn’t in that firefight but I’ve run that race before. If your system was impacted, here’s a design-first checklist you’ll appreciate. These are the tactics I rely on and you should too when such major outages hit! ↳ Multi-AZ + Multi-Region Deployment Run across multiple Availability Zones and more than one Region (e.g., US-WEST-2 and US-EAST-1) so region level failures don’t take you out. ↳ Primary-Secondary Production Environments Maintain one active (“Primary”) env and a warm standby (“Secondary”) you can switch to quickly. ↳ Service Mesh + Retry Logic with Backoff & Jitter Failures happen. Retries help but only if idempotency and exponential backoff are built in within core logic. ↳ Chaos Testing & Fault Injection Regularly simulate failures (DNS, DB, network partitions). If you only test in green field you’re not being battle ready. ↳ Automated Failover + Clear RTO/RPO Targets Know how fast you must recover (RTO) and how much data you can afford to lose (RPO). Automate switchovers. ↳ Independent Failure Domains Don’t share dependencies across services. If Service A and Service B share the same DB/Region, they’ll fail together. ↳ Distributed CDN + Edge-Routing + DNS Redundancy Don’t rely on a single DNS path or single CDN endpoint. Layer redundancy. ↳ Runbooks + Incident Drills + Blameless Culture When the alert hits everyone should know their role, escalate path and rollback plan. Practice often! -- #systemdesign #aws #amazon #outage #meme #softwareengineering #softwaredevelopment #tech
-
AWS has one cool service called Fault Injection Service that can allow you to inject faults into your infrastructure and see if your application works fine. This can be useful for testing High Availability in your application and evaluating if you would have downtime if one of AWS's availability zones ever went down. Here are some of the things that you can do in Fault Injection Service. 1. You can request all Autoscaling Group EC2 instance launches to fail in one AZ for a given time period ensuring that all EC2 instance launches happen in the other Availability Zones. 2. You can disrupt network connectivity to particular subnet in an AZ for a given time period and see what are the repercussions on your system on doing so. This is useful for testing services that have a node in either AZ. For example, services like MemoryDB, Amazon Opensearch Service and MSK create one node in each AZ for high availability. Using this scenario you can see how your apps and the services behave when there is a network partition. 3. You can Ask it to terminate all instances in a given AZ and check if your application still keeps on running. 4. You can ask it to fail all EC2 launch requests for a given role in a given AZ for a time period like 30 minutes. This is useful to block something like Karpenter from launching instances in an AZ. 5. You can Pause the IO operations on a given EBS volume and evaluate how your application behaves in such a setting. AWS provides prebuilt tempates that you can use to simulate certain real life scenarios. One of the scenarios that I recently used AWS AZ Power Interruption which was a combination of the above actions targetting all resources in one AZ. This has made it easier than ever to test your application for Zonal Failure. This is especially useful for organizations looking for compliances like SOC. If you are ever looking to test whether what you have built is indeed HA, try giving a shot to this service.
-
Companies like AWS and Netflix intentionally bring chaos into their systems to improve their reliability. Here is why they do it... Chaos engineering is a discipline aimed at improving the resilience of distributed systems by intentionally introducing chaos into a system to identify weaknesses and vulnerabilities before they impact users. ► Identifying Weaknesses: Chaos engineering intentionally introduces failures to reveal hidden vulnerabilities in systems. ► Resilience Testing: Chaos engineering simulates real-world failures to assess and enhance system response under stress. ► Failure Mode Analysis: Chaos engineering triggers failures to understand how systems break down and inform mitigation strategies. ► Continuous Improvement: Chaos engineering iteratively tests and refines systems to ensure ongoing resilience against evolving threats.
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development