BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations DevOps Is for Product Engineers, Too

DevOps Is for Product Engineers, Too

51:30

Summary

Lesley Cordero discusses platform engineering as a sociotechnical solution for scaling organizations. She explains the CALMS framework, the "pendulum of tension" between reliability and velocity, and how to transition from reactive to proactive leadership. By focusing on communal learning and distributed power, she shares how to build resilient systems without sacrificing human well-being.

Bio

Lesley Cordero is currently a Staff Software Engineer, Tech Lead at The New York Times. She has spent the majority of her career on edtech teams as an engineer, including Google for Education and other edtech startups.

About the conference

InfoQ Dev Summit Boston software development conference focuses on the critical software challenges senior dev teams face today. Gain valuable real-world technical insights from 20+ senior software developers, connect with speakers and peers, and enjoy social events.

Transcript

Lesley Cordero: My name is Lesley Cordero. Welcome to my talk on how DevOps fits into the landscape of product engineering, namely through using it as a practice of driving sociotechnical excellence. I'm currently a staff engineer at the New York Times, specifically focusing on reliability platforms within our wider platform engineering organization. While this talk is about DevOps, as my title indicates, the reason it's applicable to product engineers is because we're all ultimately operating within sociotechnical systems. Regardless of whether you're focused on developing more frontend facing applications versus more purely backend services, we all need to have a consideration for how our individual work is reflected in the wider landscape of our products and services that we're providing our users.

What Are Sociotechnical Systems?

Before diving into what this looks like, let's level set somewhat by defining what sociotechnical even means and how that concept translates to sociotechnical systems. As the name suggests, sociotechnical refers to the ways in which social and technical aspects of an organization relate to one another. In the context of technology companies, you can think of the organization itself as a sociotechnical system. Something that's important to note is that sociotechnical theory doesn't just acknowledge these two aspects, it emphasizes that they're inherently interconnected. When we're thinking about how we change cultures, we have to consider the ways in which these social and technical systems coexist rather than treating them as completely separate and independent from one another. This is where this principle of joint optimization comes in. Joint optimization is the idea that social and technical systems must be designed and improved together, not in isolation.

In practice, organizations often struggle with this balance, whether that's by over-investing in tools and processes at the expense of team dynamics, or by promoting values like collaboration and trust without putting systems in place to actually reinforce them. This tension is completely normal in an organization, especially complex organizations like large enterprises. Navigating this tension is a core responsibility of being a technology leader. I'm not sure how many of us are familiar with the LeadDev organization and conferences, but this is a quote from one of the emcees from their conferences, David Yee. He says, it's our job as leaders to hold things in tension. This quote came during the peak of some mass layoffs.

Naturally, the tone of the conference was a little bit more somber than we're used to. I think this was his way of calling attention to the fact that it was a really tough time to be in tech. It still is. That this is the job that leaders ultimately signed up for. Especially when you're functioning in large enterprises, where a lot of these types of decisions are out of the hands of the people most impacted by them, the hardest thing about being a technology leader of an individual team is reconciling that while these decisions aren't our fault, they are our responsibility, and we need to action on that. It's a quote that stuck with me quite a lot because it very much falls in line with sociotechnical theory.

When we talk about holding things in tension as technology leaders, we're not just talking about emotional resilience. We're also talking about navigating these complex systems, whether that's the system of the actual organization or the technical systems of our architecture. The reality is that we're almost never dealing with just one or the other. Leadership often feels like we're managing a pendulum, where one side swings towards culture and people, and the other towards tools and processes. Sometimes the swing is slow and manageable. Other times it's reactive and maybe even violent or disruptive, like the instance of mass layoffs.

Either way, our job isn't to hold this pendulum in the center, but to understand the motion and to respond with intention. Every leadership decision, even the small ones can impact the motion of the pendulum. If we change a tool and suddenly a team's workflow breaks, that's very disruptive. If we restructure a team and processes stop making sense for those teams. Now this pendulum is definitely a simple representation of the internal tensions of sociotechnical systems. We're going to complicate it a little further in, before defining how platform engineering is a sociotechnical strategy to the organizational implications of increasingly complex technical systems.

Elements and Interactions of a Sociotechnical System

This diagram is a common representation of sociotechnical systems that's based on original literature, but let's decompose it further in the context of platforms. We'll go into a more thorough definition of a platform, but for now we'll scope its definition to the smallest unit of a platform, which is ultimately what many organizations consider a platform team. Pulling from the book, "Team Topologies", a team is a stable group of five to nine people who work towards a shared goal as a union. While there is a genuine meaning behind having a product and platform split in terms of how we organize teams, especially at the enterprise level, I argue that a lot of platform engineering principles and strategies translate quite well to product engineering teams. This opinion is ultimately informed by the fact that I consider platform engineering to be a sociotechnical solution to the organizational problems of scaling our software.

Going back to this diagram, the three highlighted parts here represent the high-level composition of a sociotechnical system. Because the boundaries between social and technical systems can be so ambiguous, we instead represent them as four components. One, the structural patterns and practices that inform how we work. Two, the people and teams who collaborate on these efforts. Three, the architecture and infrastructure that provides our platforms. Four, the operations and processes that enable our work. These are the components that leaders of a team have direct influence over and they'll ultimately be elements that we'll be consistently pulling from throughout this talk.

On the other hand, we have the system representing our external environment. While there might be opportunities here for indirect influence, they're much harder to change. This is where a lot of tension we talked about from earlier comes from. They represent the constraints that we ultimately operate under. Going back to the context of earlier's quote by David Yee, the current state of technology has introduced a lot of constraints over the past few years. There's pressure to do more with less. All these factors have made change much more difficult to enact. Speaking from a personal narrar, these are hard times to be someone who cares about culture especially. Because of that there are opportunities for anyone to be a leader regardless of whether you're a leader in title or even a junior engineer.

Those who step up during these hard times, regardless of position or role are going to ultimately be our technology leaders of the future. While courage is definitely a characteristic of leading during hard times, awareness and intention are what will enable us to address the complexity of our systems. When resources are tight and priorities are constantly shifting, especially in the age of AI where we're having new changes every single day, the challenge becomes about building systems that can remain resilient in the face of complexity and change. This is the idea where organizational sustainability comes in.

Let's define organizational sustainability more concretely. I define sustainability as the continuous practice of operating in a way that enables short-term growth opportunities while ensuring long-term success. There's a lot to unpack here, so let's break it down. First, sustainability is a continuous practice. Even if we spend a lot of upfront time thinking about how to ensure long-term sustainability, circumstances change, as I mentioned. AI is introducing new technology all the time and often quickly, so we need continuous avenues to ensure long-term success. Secondly, enabling short-term growth opportunities. Sometimes those risky short-term growth opportunities are what lead to long-term success. That's ultimately why we're being very responsive to the age of GenAI. We're looking for those opportunities to grow our business in a way that's unique.

Another example from the past is the emergence of like bundles of tools that's worked really well for some companies, including The Times. I'm sure at least some of us have heard of like Wordle and New York Times Cooking and our games in general, Wirecutter. We love them too because they were ultimately revolutionary decisions for us. Like during the age of journalism, where it was very much like on the chopping block, our games strategy is actually a lot of what has contributed to our success. We don't want to give that up. Putting on my reliability management hat, we also need to prepare for the risk of those opportunities. This leads us to the component of enabling long-term success. We frequently see companies take their core business for granted in the name of growth. For every successful growth opportunity, most opportunities do fail. Preparation for this type of risk is essential.

Now that we've defined the goal, organizational sustainability, let's define the strategy, platform engineering. Using my definition of platform engineering, platform engineering drives organizational sustainability by practicing sociotechnical principles that provide a community-driven support system for product engineers or application developers using our standardized shared platform architecture. These three components form the basis of what it means to provide a platform. One theme throughout this talk will be about how platform engineering can enable us to scale our organizations to enable the growth that our businesses often demand.

As part of that, we'll also need to ask ourselves, at what point is this platform engineering framework necessary? We frequently talk about scaling software, but what does it actually mean to scale an organization? The answer is ultimately that our ability to scale our organization is directly tied to our ability to scale our software. When we think about scaling our software, we have to be intentional about addressing the inevitable complexity that comes with that growth. To address this complexity, we have to bring this intention into how our architecture can enable those needs. We all know that complexity makes development so much harder. It makes things so much harder that as a collective industry, we've evolved the way that we build applications.

For example, the modular monolith has become an increasingly popular architecture style, especially as an intermediate step towards adopting distributed architecture patterns that enable us to work and scale our applications. Just like the way that we've evolved the way that we build applications to embrace new architectural patterns, like microservices, we must evolve the delivery strategies we use to build these new architectural patterns. If architectural patterns are a solution to the technical complexity of scaling our applications, platform engineering is a sociotechnical solution to the organizational complexity of scaling our applications. To summarize it concisely, platform engineering is a sociotechnical solution to the organizational complexity of scaling our applications.

Guiding Principles

We'll spend the rest of this talk decomposing each of these components even further: the principles that guide us, the community-informed leadership that enables product engineers, and the architecture that we use along the way. First, we have the principles that guide the sociotechnical system behind a platform. Having focused on reliability management, the principles that we'll review are heavily influenced by DevOps, particularly because DevOps principles take a strong consideration for both the technical and social components of what it means to develop and operate software.

DevOps is also where platform engineering arose from. It was a response to the difficulty of bridging developers and operations engineers. Going back to the pendulum metaphor, we see developers on one side of the system and the other side representing operations engineers. Platform engineering isn't a replacement for DevOps, but rather a different way of reframing similar problems that technical organizations have seen before. The most critical difference really is that platform engineering applies DevOps principles and practices at scale and across the stack of development.

Let's head into the principles and practices of platform engineering. Some of us with an ops background might have heard the CALMS framework, which is basically a framework of principles that should be at the core of DevOps organizations. I'll walk through this framework, make sure to highlight the differences between DevOps and platform engineering. Starting off with culture, the CALMS framework tells us that DevOps drives a culture of continuous improvement and reduces silos by intentionally sharing knowledge and feedback. The same here is true for platform engineering, but I'll talk about it more directly by putting in the context of this community. In DevOps we often talk about breaking down silos. That's a huge area of tension because information flow is incredibly difficult to manage.

The way that we bridge that is by sharing knowledge. To share knowledge ultimately means to connect and collaborate with one another. Connection and communication are key for preventing the silos that would hinder our ability to make continuous progress and make sustainable technical decisions. When we're talking about organizations, especially as our organizations grow, the most effective way to manifest this culture of sharing is to think about how we can cultivate a strong community that fosters this culture at scale, especially in large enterprises where we might not know everyone in the organization. This is ultimately because the opposite of isolation is to be in community with other people. The reason that this is so important is because, more than anything, learning is the most sustainable advantage. This quote is by Andrew Clay Shafer, and he said this in his talk about sociotechnical systems.

The way that I interpret this to mean is that because our industry is always changing, as I mentioned, AI is changing the way that we work every day, being able to keep up with this change is the biggest advantage that we can give ourselves. To do that, learning needs to be a part of our organization's DNA. While I agree with him, I'd like to modify this to emphasize that communal learning is the most sustainable advantage. Because while our individual growth is important, if this knowledge isn't being shared intentionally, we risk introducing singular points of knowledge. Like our technical systems, when we talk about singular points of failure, humans are not supposed to be 100% reliable. We shouldn't be putting anyone in the position to be singular points of knowledge or failure. This is ultimately because this is how silos are created. This is ultimately how silos become an organizational pattern that hinders our success. In other words, communal learning is what provides the knowledge redundancy needed to sustain both ourselves and the organization.

Next, we have automation, which improves our software delivery process by reducing human error, improving our efficiency, and enabling faster delivery. This means thinking critically about the type of work that doesn't require business-specific knowledge and figuring out whether that work can be consolidated into software that's managed primarily by platform teams. When we do this, we can reduce the required cognitive load that engineering users often have to indulge by managing all aspects of their software. The type of work that's important but can be consolidated in an automated or centralized way is work that's repeatable and manual, which is what DevOps often refers to as toil or what application developers might refer to as boilerplate software.

Another aspect of platform engineering is how we should be explicit about improving efficiency by leaning into solutions by third parties. Whether that's through vendor solutions or open-source ones. The reason for this is because we need to reduce our own cognitive load and maintenance burden just as much as platform consumers do. Next, we have the lean principle. Earlier, I mentioned the impact of external constraints on sociotechnical systems, and presently, that's been manifesting in an industry-wide increased emphasis of doing more with less.

In other words, the need to be lean. While we've seen this increase in pressure of being lean, the truth of the matter is also that this has always been an external pressure. Resource constraints, time constraints, headcount constraints, these aren't anything that are new. What we can do is change how we respond to those constraints with attention and adaptability. In the context of platform engineering, the lean principle isn't just about reducing waste, it's about continuously improving how we deliver value to our internal users. This means embedding feedback loops into our tooling, processes, and services so that we can iteratively evolve them based on what isn't working for our organizational context.

Then, next, we have measurement. First, the function of measurement is ultimately to serve feedback loops to determine whether our work is actually having intended impact. These feedback loops should be consisting of both quantitative and qualitative feedback loops. The way that this principle connects to sustainability is again by eliminating time spent on work that doesn't ultimately lead to business goals. For example, if a tool that we've spent weeks on isn't actually serving our product engineering users, leading to lack of adoption, we've now invested time that could have directly served our pained end users. We end up missing the ultimate goal of building applications that age well with our evolving product growth opportunities. In other words, these feedback loops just keep us on the right path towards continuous improvement that enables us to build new features while maintaining our existing software. Lastly, we have the sharing principle.

The idea of sharing knowledge was central to the first principle I began with, culture. Instead of restating that principle for platform engineering, I've decided to reframe it in the context of technology leadership. Whereas culture is more reflective of the goal, sharing is more about how we cultivate that culture at scale. When working in highly complex sociotechnical systems, leadership needs to be distributed. Technical decision-making needs to be distributed. It's entirely unreasonable to rely on centralized decision-making because it's impossible for any given leader to have the full context needed to make decisions. While there's certainly a need for some degree of centralized leadership, empowering teams to have ownership over the decisions that impact them is far more sustainable in the long term.

Architecture

Going back to how we define platforms that sustain this sociotechnical excellence, we have our platform architecture, which is the architecture that platform engineers are building to support application development. Within these technical systems, we find ourselves with similar tensions as before. This brings us back to this pendulum of tension. This time, the pendulum of tension is a metaphor that helps us understand how platform engineering sits at the intersection of two critical forces, which is end-user experience and developer experience. This tension mirrors the first pendulum that we talked about, developers versus operations, where each side has historically been optimized at the expense of the other. Platform engineering was born in that gap. Now we find ourselves holding a new kind of tension. At the heart of this pendulum is the tradeoff between reliability and feature delivery.

If we optimize too much for developer speed and convenience, we might compromise the long-term system health and end-user trust. If we prioritize reliability without regard for developer workflows, we risk introducing friction, frustration, and shadow systems. Yet the goal, again, isn't to solve this tension, but to sustain it well.

In fact, some of the most powerful insights in platform engineering often come from engineers who've had the opportunity to swing across the pendulum, from product to platform teams and back. That motion is ultimately what creates empathy and perspective and better design instincts. When engineers understand what it's like to ship features under pressure and manage infrastructure at scale, they also become better stewards of both. When we say that platform engineering sits in tension, we mean that it's orchestrating movement and learning from both ends, and guiding the organization towards a sustainable balance.

In the previous section, we talked about our high-level principles. Next, we'll review some foundational architectural principles that should inform how we architect platforms. The first is to embrace design-driven architecture as a core set of best practices. Intentionality should be an important attribute of the way that we build technology and collaborate with others. This intention should manifest in the way that we design platform systems, whether that's to use abstraction or modularity to separate different functional concerns. This principle can definitely be broken down into many pieces. Highly recommend reading Domain-Driven Design, which I will talk about next. We'll omit that a little bit for now because we'll be talking about it more in the design tensions and how to alleviate them.

Secondly, our architecture has to be complementary to those of our end users. This is where that user versus developer experience tension shows itself. There's definitely a value in thinking about where our platform architecture might be heading. To the last principle about intentionality, we need to design with the future in mind and not necessarily build for it immediately. This is much in the same way that we might design a monolithic application in a way that would enable us to decompose it into a distributed architecture in the future. When we prioritize our work, we should be driven by the needs of application developers in our organization, whose architecture should be a reflection of our pained end users' needs. This might lead to prioritizing and deprioritizing certain domains based on need, whether that's CI/CD, observability, or runtime language support. This is why my first team at The Times was actually an observability focused team.

In my first version of this talk, I actually mentioned explicitly how I expected my team to change its domain at some point. As we delivered on our observability goals over time, it became very clear that it was time for us to extend our scope to think about our reliability management more holistically. Within the domains or problems that we're trying to solve, we also need to build in a way that's responsive to evolving architecture and developer needs. For example, if we want to improve the runtime experience of developers, we should prioritize the languages that are actually used by them, not just the ones that we as platform engineers want to support first. Ultimately, platform engineering is not here to tell other developers what to do, we're here to support them in what they need to do. This principle enables us to design our platforms so that we see similar benefits of concrete separations of concerns that we often see in end-user facing architectures.

This is where I'll take a moment to talk about some of the common pitfalls that we see within platform engineering, which is that platform engineering is not equal to infrastructure platforms. I think this is part of why we see a lot of DevOps claiming to be dead, in lieu of platform engineering, because too many of us are operating under the assumption that the only type of shared platform that companies need are ones that are limited to infrastructure.

For example, though, we should also be thinking about how platforms can aid the service and feature development cycle from beginning to end during the actual development phase. That might mean having language runtime platforms that support development of standardized Node.js services, for example. Each of these can be decomposed even further. Again, tying to the principle number two, this decomposition should only happen if there's a genuine need for it. For example, if our organization decides to introduce a new standard language, that's a good time to decompose your runtime platform. In the infrastructure platform context, you might see this by further breaking up into domains like cloud infrastructures, CI/CD, or observability.

As I mentioned at the beginning of this talk, we're all operating under sociotechnical systems. A lot of the technical principles and patterns translate quite well, regardless of whether you're on that platform or product engineering spectrum. To complicate this even further there are even product platforms, which might refer to specific end user domain or core platform. You might also see this as like core services, depending on your organizational structure.

Lastly, choose boring technology. This ties back to when I spoke about not building tools from scratch. We can prevent that by not leaning into every cutting edge opportunity. Some years back, a blog post named, Choose Boring Technology, by Dan McKinley went tech viral. He talked about this idea of innovation tokens and how we need to be intentional about how we spend those. I think this is particularly important because there's so much noise in the artificial intelligence space that it's really hard to figure out how do we prioritize testing out one solution over the other. We love to play with new toys, engineers. Every proof of concept shouldn't be making it to production. I acknowledge, again, that this is tempting given the era that we're in.

Ultimately, recency bias shouldn't be driving decision criteria, it should just be informing it. One of those seen as boring technologies is also documentation. This might be a hot take, but too many internal developer platforms can be replaced by good best practices and standards documentation. No, it's not as exciting, but it's still work that enables us to learn and mature how we build technology. When we think about how we're interacting with GenAI systems as well, technical writing is actually quite important now. Being able to write explicit requirements documentation is super crucial for generating the right software that we were looking for out of these different tools. Again, even if months or years later, we decide that we do end up needing to build a new tool, it's often still not wasted effort because they, again, end up being a pretty good start for design and requirements gathering anyway.

I mentioned earlier about some design tensions related to architecture best practices. We're going to review these next, and then transition to our final platform concept, which is organizational technology leadership, which are the methods we use to drive organizational change. First, we have what I think is the hardest tension to balance, which is standardization versus flexibility. The shared nature of a developer platform is an awesome opportunity to reduce the risk of drift. We have to hold that intention with the flexibility that product engineers might need, especially as our organization grows and the number of technical needs grows with it.

For example, right now, my organization is facing the consequences of building tens of services with an opinionated framework in Go that has since not aged well. Now not only do we have to revisit how we approach runtime support, but we also have to reconcile the tech debt that manifested from the decision many years ago. I previously said that the opposite of isolation is community. Now the way that we're approaching it is from the standpoint of driving standards with the actual product teams. Also, our learning communities of practice. In this we're able to share and distribute decision-making power, aligning with the shared principle from the CALMS framework earlier.

Next, we have this tension of simplicity and complexity. As we evolve to the evolving needs of our users, complexity becomes harder to manage because the architecture that supports them is likely subject to change, whether that's to begin using event-driven communication styles or begin embracing client-side rendered frontends. This just becomes another area that we need to be intentional about. Like tech debt, complexity is inevitable, but we can compartmentalize it somewhat by making sure that the developer-facing interfaces are simple and making sure that they're backwards compatible as well. This leads to the most common source of complexity in software engineering, which is integrations. We know the common principle of reducing coupling between services. The same applies to platform work.

Integrations are high risk to sociotechnical excellence because avoiding coupling is incredibly difficult. That's why a huge selling point for some vendors is their integration. This is so that we don't have to ultimately think about it ourselves on a day-to-day. Speaking of vendors, remember our automation principle from earlier? Even though I just spent quite some time talking about design principles for building platforms, I'm also here to say, give yourself permission to not build at all. The decision to build versus buy versus contribute should be our bread and butter. Deciding that we don't want to take on the work of building and maintaining a tool is a very valid one because one of my brilliant mentors once told me, every line of code that we write is a liability, especially in this evolving tech era of AI. Code isn't our bread and butter, especially now. Research, design, and technical decisions are. Engineering is a craft, and we should still continue to treat it as one.

Community-Informed Leadership

Lastly, we have organizational leadership. Organizational leadership is where that joint optimization from earlier comes into play. It's the work of taking what we've talked about today and actually applying it to an organizational context. Because of the inherent complexity of this, I'll cover some practices, but in the context of a more defined problem space. I mentioned in the beginning of this talk that I'm a TL for a reliability platforms team. I've thought a lot about what it looks like to build sustainable reliability management experiences. To begin, I'll circle back to the principle of CALMS framework from earlier, sharing. I reframe this principle through a leadership lens. More specifically, I discussed it as a community-informed approach. This is where I'll elaborate on that further. Because there are so many external factors and internal tensions, community-informed leadership is a sustainable model for leading organizations.

First, beginning with this idea of being stewards of sociotechnical excellence. To be a steward of sociotechnical excellence means taking responsibility for the ongoing health of a technology organization or team. In the context of platform engineering and technical leadership, stewardship means cultivating environments where people and systems can thrive together over time. It means honoring inherited knowledge. Grasping the system's history and how it came to be instead of just operating on assumptions that don't translate well. In a reliability management context, this can mean the difference between actually understanding why an application service is experiencing an outage, for example.

If the mental model that we're operating under is only based on recent context, we risk the possibility of attributing the cause of an outage to factors that are actually merely just symptoms. Honoring inherited knowledge allows teams to see patterns across incidents and understand the tradeoffs that were made by previous engineers, especially given the reality of staff attrition. Ultimately, the understandability that comes out of this is crucial for our effectiveness. Being a steward of sociotechnical excellence also means fostering inclusive dialogue. Making space for diverse perspectives and identifying tensions rather than avoiding them. In the case of incident management, the chaos of incidents makes it really easy to overlook this diversity of perspectives, whether that's by job level, title, or specialty. Tight and effective collaboration amongst folks coming from an operational or a product background is what enables us to mitigate incidents more effectively.

Lastly, it means guided principled action, even and especially when consensus is out of reach. This is ultimately because good leadership isn't about being liked or peacocking. It's about being in service of the people and systems that they depend on. This is particularly important for when we are being incident commanders. Being able to actually do this in service of people is ultimately crucial to getting to that ultimate fix.

At the end of the talk, I'll review some of the consequences of some centralized leadership styles, specifically when the centralized leadership manifests in the form of heroism. While there is a need, again, for centralized leadership, in reality, most organizations actually need a balance of both styles, where central guidance is complemented by distributed decision-making. This is where the concept of distributed leadership comes in. Distributed leadership is about more than just delegation. It's about sharing power intentionally and cultivating trust, and creating structures where decisions are guided by the people who are closest to the work and most impacted by its outcomes. In practice, this looks like teams having autonomy to adapt within guardrails. Product engineers should be shaping platform roadmaps. Incident responders should be codifying operational norms instead of waiting for permission from leadership. This model supports organizational resilience. It not only prevents bottlenecks, but it also guides and builds leadership capabilities and capacity across the system, so that when challenges emerge, leadership isn't just coming from one person, it can come from anywhere.

Ultimately, what this leads to is the idea of being able to lead by example. Because platform engineering work naturally touches many, if not all parts of our organization, we have a unique opportunity to show what it looks like to operate in a way that achieves excellence without sacrificing people along the way. To lead by example means to embody the values to accountability and behavior that we hope to see in others. That includes respecting technical boundaries, being transparent about tradeoffs, prioritizing long-term maintainability, and treating internal users as collaborators. It also means modeling how to engage with conflict constructively, how to take responsibility when things do go wrong, and how to center care and integrity even when there's a pressure to deliver.

To circle back to my promise on defining organizational leadership, I'm going to review a framework that I frequently pull from when I'm forming a technical vision and strategy. To give more history on how this framework came to be, it was in the context of me wanting to move reliability engineering teams from a reactive state to one that's more proactive and preventative. Time for reactiveness will certainly come, if only because incident management requires us to. In the preventative and proactive states, there is an opportunity to minimize the impact and frequency of those reactive times.

For me, this framework has been particularly helpful for addressing chronic problems, problems that are long-lasting and have emerged as a dysfunctional organizational pattern. That said, we'll dive into more specifics. I define three approaches to handling chronic issues. The first, again, is preventative. The second is proactive. The last, which is the one that we want to avoid as much as possible, is reactive. The preventative approach requires us to design processes and systems that prevent these problems in the first place. This ties back to what I mentioned about striking the right balance between standardization and flexibility earlier. By using our collective context to inform how we strike this balance, we can design systems and processes that age and scale well. Obviously, we won't be able to prevent all of our problems this way, but it can reduce the number of problems and keep our teams focused on harder problems that will mature our team quicker.

To do that, we have to have a way of monitoring the health of our team and organization. These are the feedback loops and contexts I've mentioned, and they can take many different forms. The point is that we should make it easy to find the patterns that serve as input for short-term and long-term improvements by building those robust feedback loops. Again, feedback loops ultimately serve the function of communicating context and pain points throughout the team, which is important for ensuring that people are also being heard. Problems will arise, it's inevitable, but this upfront investment that we put into building robust feedback loops will drive our teams towards being in a preventative and proactive state instead of one that's more reactive. To ground this in an example, one robust source of feedback is on-call. The experience of on-call is a great feedback loop for improving our technology and how our team works, because we can learn something from every alert, every page, and every on-call task. All of this is powerful data that can be used to improve ourselves and our systems.

Next, we should be strategic about how our teams prevent chronic issues from happening in the first place. Going back to the feedback loops, we should be constantly learning from these feedback loops, over time, building our team's collective knowledge of how to manage reliability effectively and build excellent software. This is really important for morale also, because people feel good when they produce excellent work. There are times where we should also invest early and continuously so that we aren't constantly distracted by systems that are best defined by their inability to be reliable. When we're not in this mode, it's really hard to get here. It involves a degree of trust that the time that we spend up front will pay off later. When I introduced this way of engineering on a previous team, I heard a lot of initial pushback from cross-functional peers who were more concerned with productivity. For those of us who are on teams that feel like rushing is causing a lot of our production problems, but feel stuck in this cycle of rushing, just choose one project to try this out and see how it goes.

Then use that as a model for getting buy-in and driving long-term change. This long-term approach has to be complemented with a short-term approach for when our long-term strategy inevitably falls from time to time, which is why having a strategy for how to hold our teams accountable to itself is important. These strategies and frameworks should be transparent and aligned with our core principles. Decision-making shouldn't be happening in a silo and our team should feel like they're part of the process. Luckily, the tech industry has come up with one solution to this need, which are service-level objectives and error budget policies. Service-level objectives, or SLOs, introduce transparency by defining reliability targets. It clarifies our collective expectations around what reliability experience we should be providing users. Coupled with error budget policies, which define the measures that we take when we stop meeting those expectations, SLOs introduce an extra layer of accountability that makes teams more resilient to failure.

Now let's talk again about this common source of pain for teams, on-call. Most engineers I know don't look forward to on-call. Incident management can be stressful, especially if the state of our on-call is best described as utter chaos. That shouldn't be the case. If our engineers are dreading our on-call shifts, that's feedback that the state of our on-call is unhealthy and needs actioning. Incidents and on-call noise aren't the only source of pain for on-call shifts. This is unfortunately a very common expectation for on-call engineers, especially product engineers, to balance the work of on-call with their long-term project work. I totally consider this an antipattern. Not only is it unfair, but it introduces instability to our roadmaps if we're depending on engineers who are making progress in the context of a deadline.

Because of this, I prefer to avoid it altogether by having what I refer to as dedicated on-call shifts. What I mean by this is instead of forcing on-call engineers to balance both on-call and long-term project work, reinforce empowerment by empowering our engineers who are on-call to take ownership over how they spend their time on-call outside of incidents. Not only does this relieve the stress of needing to manage incidents and project work at the same time, but it also communicates trust in teammates to help make on-call better over time by providing a steady avenue of creative freedom for them to solve the problems that bother them the most.

The reason I don't consider this a long-term approach is because we won't be able to rely only on dedicated on-call shifts to mitigate our issues. Not every issue or improvement will be able to fit into one on-call shift, which are usually one week. It is an additional layer of reassurance, and it's a powerful tool and way of keeping our teams accountable to itself. Lastly, it shouldn't be one leader holding a team accountable. When we fall into that pattern, we're introducing a huge dependency on our leaders to hold our team to its values. Much like a dependency introduces system vulnerabilities, so does the singular leader. Even if our organization isn't large, we should find ways to reduce our team's dependency on leaders. Maybe that means making sure that our team has strong relationships with skip levels or other leaders in our org. The goal here is ultimately to provide a sustainable leadership model for our team and organization. Again, one where power is distributed so that leaders can use each other's strengths to serve a shared goal of building excellent and healthy teams.

Now we're going to switch over to the proactive state. In the proactive state, a problem has emerged, but it hasn't caused significant problems yet, but we don't want them to get any worse. Because we proactively monitor for early indicators, we can address issues before they have long-term impact. Again, we need to make sure that our feedback loops capture a range of perspectives. We need to dig into the granularity of our experiences or behavior because different issues will affect people differently and different systems differently. Again, that doesn't mean that they aren't just as important. This is best served by having multiple sources of feedback loops with a stern reminder that too much process can introduce its own set of problems.

More importantly, we should focus on making the feedback actionable. Having worked for companies with tons of bureaucracy, I've seen too many processes be a source of harm instead. Our initial solutions might not end up working. Instead of focusing our team to accommodate the process, we should find opportunities to also adjust the process to accommodate the needs of our teams. Even though we should always be looking for new areas of improvement, it's ok and also essential that we celebrate the progress that we do make. We should be showing gratitude for the ways that our teams do step up and for the times that people show that they care about preserving sociotechnical excellence. A lot of us have retros, either general team retros or maybe project-specific ones. Make space for explicit celebration during those rituals. Revisit your premortems to see what you identified as a risk but ended up not actually happening. Make space during incident postmortems to celebrate the things that your team did well during what was possibly a really stressful time.

Lastly, the reactive approach. At this point, the chronic issue has already had a negative impact on our team, or our organization, and we're forced into addressing it. Once we've reached the state that an issue has become a chronic issue, it becomes a lot harder to solve. It becomes a lot harder to restore our team's sense of safety and trust. Because at this point, our team has probably lost trust in its leaders, the organization in general, and worst-case scenario, in each other.

Now, instead of taking this moment as a signal to make change happen, what we often see instead is organizations coming to rely on acts of heroism until people reach the point of burnout. When we say heroism and burnout here, I'm not just referring to the type of burnout that we tend to focus on, like overworking. I'm also referring to the type of emotional burnout that happens when someone is in an environment that's unhealthy. It goes without saying, heroism and burnout are not effective strategies for organizational failures. Because that's ultimately what we're asking for people when they're put in that situation, to make up for organizational failures by sacrificing their well-being.

Let's expand on the cultural and organizational consequences of this. Heroism prevents true progress because they're Band-Aids to systemic issues. They prevent progress by enabling us to put off the hard work of actually addressing deeper organizational flaws. Like tech debt, it might be effective in the short term, but it's not an effective long-term strategy. Eventually, we have to pay that sociotechnical debt back. The way that we pay that back tends to be in the form of burnout. Not only is it awful for the people who have to experience that, but it's also awful for the actual organization. We all know how awful it is for an engineer to leave. Don't give them more reasons to leave. If we are in that situation where our only choice is to engage in heroics, we should push back and say no if we're in the position to. If we're not in the position to, or forced into engaging heroics, take that as information for whether this is the type of environment that you actually want to be a part of.

Lastly, the impact of heroism isn't distributed equally. It looks different depending on your personhood. For some, it's a point of celebration. For others, it's merely an extension of what might already be a psychologically unsafe environment. The second part of this third consequence here is that heroism often leads to disproportionate power between teammates, when our ultimate goal is to distribute power and choice. Earlier I said how leaders of a team should reduce the team dependency on themselves. This applies to everyone in a team or organization. Heroism is dangerous because it redistributes power by putting people in a position where they have to depend on heroes. This is why the part about a strong leadership core is so important. When a team or organization becomes so dependent on one person, especially when that person is a leader, it can start to feel like they're untouchable. Like they're incapable of doing any wrong or being held accountable. What do we do when we get here?

Unfortunately, I'm here to deliver some perhaps obvious but hard truths. When a team reaches this state, leaders are the ones responsible for the organizational failures that got them there. We talk in reliability management a lot about blameless postmortem. Blameless postmortem is definitely a wonderful tool, but it does not apply to centralized leadership. In fact, part of the blameless culture is that we shift towards identifying systemic reasons that caused the problem. While blameless postmortem culture might not agree that leaders are at fault for these issues, blameless postmortem culture definitely agrees that leadership is responsible for them. This is the thing that I think a lot of technology leaders struggle with more than anything else. It can be really hard to reconcile the fact that most of our problems we solve or work on aren't directly our fault, but that they are our responsibility. That ironically, when we ignore them, as much as we don't want to cast blame and whatnot, it's when we ignore our responsibility as leaders that issues actually start to become our fault.

Another hard truth, sometimes you or your team leadership core are those leaders, and the onus is on us to take responsibility. Sometimes what the responsibility looks like is holding whoever leadership to us accountable in the ways that we have access to. It's also important that we recognize our own role in that organizational failure. This is ultimately because the higher up in leadership we are, the more our flaws have the potential to scale across our organization. Use whatever access you have to action on the responsibility. Sometimes we have to be strategic about when to use the privilege that comes with leadership, but generally speaking, most people tend to underestimate, underutilize that privilege. I think this is true in the context of leadership, but in the context of the world generally.

Now I've thrown the word responsibility a lot, but given no direction into what that looks like. This is because it's wildly complex and conceptual, but I think it can be condensed into three steps. The first is to ultimately admit where we went wrong, admit where we played a role in letting it get this way, whether that was one that was direct or one of more enablement. This is because people will ultimately appreciate us a lot more for being vulnerable where we went wrong, especially when we follow up with action. This is ultimately because psychological safety is about feeling safe to make mistakes while trusting that we're in an environment that seeks to minimize psychological harm through accountability.

This is the core of what it means to be community-informed. It's not just about feeling safe that we as individuals can make mistakes, it's also about preserving safety in spite of the mistakes that will inevitably end up happening. Which is why the second step of empowering is centering the folks impacted. This is where we really need to practice empathy. Who was harmed in the process? Who had to step up as a hero or a leader because we didn't? We should thank them and reward them and ask them what they need to rebuild trust in the organization. Lastly, there are the actual changes we follow up with. That means revisiting the preventative and proactive measures that we have in the first place. We should ask ourselves, where did our processes fail to get us here? What cultural or organizational flaws contributed to that system? Tying back to what we just talked through in the last slide, when centering those who were impacted, ask for their thoughts on these questions without placing the burden on them.

We can do that by putting our own thought into what changes we think will be impactful, and asking them like, what did we miss? That's because what's not obvious to us might be very obvious to others. We should ask them because these experiences and feedback that they provide is powerful data that we can use to improve our organization. Depending on the situation, what that action looks like can vary widely, but approach it from like a systems thinking angle. This is ultimately because leadership is earned not owed, continuously, because just like the flaws of our organization are felt by our most vulnerable coworkers, so are the flaws of our leaders.

Final Note on Tech Leadership

One final note, I know none of this is easy. All of this is easier said than it is done. My hardest moments as a team or tech lead were ones where I had to deal with these types of complex issues, whether that's stepping up in light of difficult situations like bias or driving serious change to a culture that enables burnout. Every single one can take a little bit out of you, but that's the price that we ultimately pay for technology in leadership. As technology leaders, we should never lose sight of that privilege, the privilege to cultivate culture, to cultivate community in a way that achieves excellence without sacrificing people along the way. In a world where sacrificing people in the name of business needs is so unfortunately common, especially right now, choose to bring this energy, this way of thinking in all the ways that you have access to.

Trust me, as scary as it might be, having been in a position where my access to that was very minimal, the heartache that comes with leadership is still one of the best privileges I've ever had. Honor that privilege by being honest with yourself about where you and others might be not doing right by people. This is ultimately why we attend conferences like this to refine our craft by being exposed to people with experiences that we can learn from. Honor your craft as a technology leader by surrounding yourself around people who can hold you accountable and whom you can do the same for. Leadership can be very lonely, but it doesn't have to be, for our sake and especially for the sake of the people that we have the privilege of leading.

 

See more presentations with transcripts

 

Recorded at:

Dec 22, 2025

BT