InfoQ Homepage Articles One Network: Cloud-Agnostic Service and Policy-Oriented Network Architecture

One Network: Cloud-Agnostic Service and Policy-Oriented Network Architecture

Aug 12, 2025 16 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

Listen to this article - 0:00

Key Takeaways

Combining policy management and networking across diverse environments, One Network combines every service as a manageable endpoint.
Guided by five principles, the architecture is built on open-source foundations like Envoy, enabling extensibility and integration with first-party and third-party tools through service extensions.
Policy enforcement in One Network is designed to be consistent and scalable, supporting segmentation, application-wide boundaries, and service-level controls across all network paths.
Success relies on long-term executive commitment, collaboration across many teams, and a strategy of incremental improvements, rather than a single large rollout.
Organizations considering a similar approach should ensure strong leadership support and align the initiative with compliance, policy, and multi-cloud strategies, focusing on short-term wins and long-term objectives.

One Network is a unified service networking overlay that simplifies policy management across different services and environments. It aims to provide a single, network-level approach to policy enforcement that works across public and private clouds, multi-cloud setups, and various deployment models.

At QCon San Francisco 2024, I shared the learnings and accomplishments from this seven-year effort.

Background

When Google started with cloud computing, we joined the race midway. Around 2020, our team had an ecosystem of more than 300 products, each running on its own infrastructure and following different network paths. This fast, organic growth led to our services being poorly integrated, and it became clear that releasing new features was cumbersome. Every new feature had to be implemented separately for each product and network path.

The real challenge was policy management. Policies are the most critical part of cloud infrastructure, controlling everything from security to traffic policies and compliance. With so many unique products and network paths, we needed a way to manage policies centrally, without having to modify every individual product. This realization set the stage for the idea of One Network.

Networking is complicated because Google’s infrastructure has its own networking stack. This networking stack supports services such as Search and YouTube. With increasing cloud products, there are container systems, virtual networks and service meshes, needing their own networking rules. Additionally, Google Cloud Platform (GCP) includes virtual networking, such as Andromeda. Further, runtimes like Kubernetes (GKE) and Compute Engine (GCE) add their own networking layers. And service meshes run atop this infrastructure, adding yet another abstraction.

As applications are built across these environments, they run on different infrastructures with separate network paths. This creates a complex, n-squared problem, which I often describe as Swiss cheese: something that works in one environment might not work in another. This was the inspiration to start with One Network, a unified service networking overlay that brings consistency and integration across all these diverse environments.

Overview: One Network

The goal of One Network is to enable uniform policies across services. To do so, we are looking to overcome the complexities of heterogeneous networking, different language runtimes, and the coexistence of monolith services and microservices. These complexities span multiple environments, including public, private, and multi-cloud setups. The idea behind One Network is to simplify the current state of affairs by asking, "Why do I need so many networks? Can I have one network?"

To explain further, One Network relies on a single proxy, one control plane to manage these proxies, and one load balancer that supports various runtimes like GKE, GCE, and Borg across different cloud environments. The universal data plane APIs extend the ecosystem with first-party and third-party services, and extensibility APIs for plugging in first-party and third-party services, leading to uniform first-party and third-party policies across the environments.

Initially, many found such a system "too good to be true". But afterwards, every personnel from security policies, DevOps, networking, or application development-related roles got some benefits from it.

One Network is a solution with width, depth, and isolation. Its universal nature enables orchestrating large environments and improving observability.

One Network Principles

We have built One Network on five principles. First, we’re going to build on a common foundation. Second, everything is treated as a service. Third, we aim to combine all paths and support all environments.

Fourth, we create an open ecosystem of what we call service extensions, which are essentially pluggable policies. Finally, we apply and enforce these policies uniformly across all paths.

We have built One Network on five principles:

Build on a common foundation.
Everything is treated as a service.
Combine all paths and support all environments.
Create an open ecosystem of what we call service extensions, which are essentially pluggable policies
Apply and enforce these policies uniformly across all paths.

Principle #1 - Common Foundation

Let’s start with the first principle. The One Network pyramid below explains how we narrow the scope at each layer. At the base, we use the Envoy proxy, an open-source proxy available anywhere, whether on GCP or on-prem. Around Envoy, we build GCP-based load balancers that work for VMs, containers, and serverless.

On top of that, we have the GKE controller and GKE gateway, which use the same infrastructure, but they serve only GKE workloads.

Vertex AI is at the top; the gateway is just an implementation detail. All these layers share the same infrastructure and are managed by a single control plane, Traffic Director. This control plane can be deployed regionally, globally, or specialized per product, but it’s always the same binary and API, enabling control and consistent orchestration everywhere.

Looking at the One Network architecture below, you see a path from left to right, mobile to edge, cloud data center, multi-cloud, and on-prem.

The three main building blocks are the Traffic Director control plane, the open-source xDS APIs link between the Traffic Director and the data planes, and the open-source data planes themselves (Envoy or gRPC). The data planes' open-source nature allows us to extend across clouds and environments, not just GCP.

Google invested in Envoy proxy, which was launched in 2016. It is a modern proxy that includes configuration APIs, generic data plane APIs, and external AuthZ and processing hooks. It also provides specialized APIs, like rate limiting.. Envoy supports L4 (TCP) and L7 (HTTP) filters, which can be first-party or third-party services offered via WebAssembly. There are two types of WebAssembly deployments, one linked to Envoy, which changes ownership, and another running out of process. Google heavily invests in the open-source proxy WebAssembly (Wasm) project. Using Wasm filters, you can load first-party and third-party code into the data plane.

These filters support request-based, response-based, or request-response traffic, depending on how you need to process the request. Traffic Director manages the delivery of all filter configurations to data planes.

Traffic Director is an internal tool at Google which acts as an xDS server, blending dynamic and static configuration. It powers Google’s global load balancer (GSLB), optimizing traffic for services like Search and YouTube.

It handles global routing, backend capacity, and centralized health checks to remove the overhead of n-squared health checks and free up data center throughput. Integrated with autoscaling, it can respond to traffic bursts instantly. When an admin creates a policy, Traffic Director executes it across all data planes.

Principle #2 - Everything as a Service

The second principle is – everything as a service. The below is an actual diagram of a real service.

One Network enables you to manage such a service by applying governance, orchestrating policy, and managing the small independent services.

Each of these microservices is imagined as a service endpoint. This enables orchestrating and grouping these service endpoints without application developers needing to modify service implementation, so everything is done on a network.

There are three ways to manage these service endpoints. The first is the classic model: you add a load balancer before a workload, such as a shopping cart service running in multiple regions, and that becomes your service endpoint.

The second model has a producer-consumer relationship. For example, a SaaS provider may build their service on Google Cloud and expose it to customers through a single service endpoint using Private Service Connect.

In this setup, the producer doesn’t have to expose their internal architecture; the consumer only interacts with the endpoint. This pattern is useful when you want to keep implementation details private or when you have shared services within a company that multiple teams need to access. Each consumer can apply their policies to the endpoint, and you can expose as many endpoints as required.

The third type is the headless service, often found in service meshes within a single trust domain. There is no gateway or load balancer; services are simply represented as a group of IP ports.

An obvious example here is an AI model: the model creator (producer) keeps the inference stack hidden behind a Private Service Connect, and application consumers can connect to it without knowing the internal workings.

Principle #3 - Unify All Paths and Support All Environments

The third principle relates to unifying network paths and environments. The main reason for doing this is to allow the same policies to be applied across all services. To unify paths, we first had to identify them. We generalized the many possible network paths down to eight main types:

Internet -> Workload
Internet -> SaaS
VPC -> Workload
VPC -> SaaS
Workload -> Workload
Workload -> SaaS
Workload -> Internet
SaaS -> Internet

For each path, we mapped out the network infrastructure needed to implement policy enforcement. This includes external load balancers for internet-facing traffic, internal load balancers for internal traffic, service meshes, egress proxies, and mobile environments.

Let’s look at them one at a time: For GKE, services are typically surfaced through the gateway and load balancer. We took Envoy, which started as a standalone deployment, and turned it into a managed load balancer.

We spent over a year hardening it in open source so it could handle internet traffic. There are both global and regional deployments. Global deployments are for customers who need to serve a worldwide audience or want cross-regional capacity, while regional deployments are for those who care about data residency, isolation, or reliability. Both options connect to all runtimes.

The next deployment model is the service mesh. Istio is now one of the most widely used service meshes. What stands out about service meshes is how they break down service-to-service communication into separate concerns: service discovery, traffic management, security, and observability.

Each area can be managed independently. Google’s cloud service mesh is based on Istio, backed by Traffic Director, gateway APIs, and Istio APIs. It works across VMs, containers, and serverless. Long before Istio, Google had its service mesh using a proxyless approach and a protocol called Stubby. The control plane would configure Stubby directly, so it functioned like a modern service mesh without sidecar proxies.

We’ve exposed this proxyless mesh idea to customers and open source. gRPC uses the same control plane APIs as Envoy, reducing overhead since there’s no proxy to install or manage.

A similar but slightly different approach is GKE data plane v2, which uses Cilium and eBPF in the kernel. This simplifies networking for GKE, improves scalability by eliminating sidecars, and provides always-on security and built-in observability. For L7 features, traffic is automatically redirected to an L7 load balancer.

Mobile is another interesting area. Although we didn’t productize this, we experimented with extending One Network to mobile devices.

Due to power constraints, mobile workloads can’t keep persistent connections to the control plane, so the handshake is different, with Traffic Director caching configurations.

We tested this on a simulation of 100 million devices using Envoy Mobile, a library linked into the mobile app. This setup allows for identifying and managing individual devices, delivering configuration, collecting observability data, or shutting down a rogue device if necessary.

Another ongoing project is control plane federation, which is relevant for multi-cloud or on-premises environments where customers run deployments outside GCP. Here, you might have your own Envoy deployment or a proxyless gRPC mesh with a local Istio control plane.

The local control plane handles dynamic configuration and health propagation, so if the connection to Traffic Director is lost, the local deployment keeps running until it reconnects. This setup provides a single management view for thousands of on-prem or multi-cloud deployments.

Bringing all these pieces together, the architecture spans load balancers and service meshes across environments, including mobile and multi-cloud, all unified under a common policy and control framework.

Principle #4 - Service Extension Open Ecosystem

The fourth principle is building an open ecosystem for service extensions. After setting up the backbone, the next step was figuring out how to use it for a policy-driven architecture. That’s where Service Extensions come in. Service Extensions are based on Envoy’s ext_proc filters, and enable programmability from the data path. For every API, like external AuthZ, which allows/denies decisions or external processors, there is an opportunity to plug in custom policies.

For authorization, some customers want to use their own AuthZ system instead of the default one we provide. With Service Extensions, they can simply plug in their own authorization.

Another case is API management. Traditionally, API management needs a dedicated gateway, like Apigee. But with Service Extensions, API management becomes ambient, available wherever you need it, whether at the edge, between services, on egress, or within a service mesh. This changes API management’s perspective from being a point solution to something present throughout the network.

Similarly, customers or competitors can bring in their own WAFs for security tools like web application firewalls (WAFs). The open ecosystem means customers can pick and choose which WAF to use, all on the same infrastructure, without having to build on extra components or deal with integration headaches.

Looking at the One Network architecture, now you can see that these extensions can be plugged in at any point - routing, security, API management, or traffic services. There are two main types of service extensions. First, there are Service Extension plugins, which recently went into public preview. They are essentially serverless functions: you provide the code, and we run it for you, often at the edge for quick header manipulation or other lightweight tasks.

Second, there are Service Extension Callouts: SaaS integrations with no restrictions on size or ownership - just a callout to an external service with open source generic external processing or external authorization APIs.

There is a PDP(Policy Decision Point) proxy, which lets you plug multiple policies behind a proxy, enabling things like policy caching and complex policy manipulation. Each policy is a service tool managed by AppHub, our services directory.

Looking ahead, we’re considering managing all these extensions through a marketplace with lifecycle management, since the number of available extensions will only grow.

At that point, it’ll be important to have ways to recommend and select the right tools, like choosing between different WAFs, to help customers make informed decisions.

Principle #5 - Apply & Enforce Uniform Policies

The fifth and final principle is about how to apply and enforce uniform policies across services. We concluded four main types of orchestration policies.

The first is segmentation. If you start with a flat network but want to create boundaries, you can segment by exposing only certain services and keeping others hidden. This forces traffic through specific chokepoints where policies can be enforced. Since everything is treated as a service, controlling which services are visible or hidden becomes straightforward.

The second type is applying a policy to all paths. In reality, every application is accessible through multiple network paths. For instance, a service might receive both internet and internal traffic. When you need to apply a policy quickly - in response to an incident - you want to be sure it covers every path to the service, without having to track them all manually. The orchestration system handles this, applying the policy programmatically across all paths.

The third type is applying policy at the application boundary. An application defines a perimeter containing its services and workloads. For example, an e-commerce application might include a frontend, a middle tier, and a database, while another application might handle the catalog.

Applications can call each other’s services, but an administrator can set boundary policies within a given application. For instance, only a specific service can access the internet, and everything else is blocked from external communication. This means policies like blocking public IPs or egress traffic need to be enforced on all workloads within that application.

The fourth type is policy delivery at the service level, which is especially useful at scale. Instead of configuring firewalls or settings on thousands of individual VMs, you can group them as a single service, set the policy once, and have it orchestrated across all the backends.

Policy enforcement and administration follow the concepts of policy administration points, decision points, and enforcement points. The One Network data planes act as enforcement points. Policy providers supply service extensions and administration points, allowing customers to define policies at a broader level, such as for an entire application or group of workloads, and orchestrate them from there.

For example, consider Google's service drain concept.

If you get paged in the middle of the night, the first step is to drain traffic from the affected service or region. This doesn’t bring anything down; it stops sending new traffic there, so you can debug without impacting users. Once the issue is resolved, you can gradually undrain and restore traffic flow.

Traffic comes in through various data planes - application load balancers, service mesh, gRPC proxyless mesh, or Envoy mesh - all targeting a given region. With One Network, when you apply a drain via the xDS API, traffic simultaneously shifts across all these paths. You can move all traffic immediately or shift it gradually in increments, depending on what’s needed.

Another example is CI/CD canary releases. When rolling out a new service version, traffic from different clients - website users through a load balancer, call center agents through an internal load balancer, point-of-sale systems via the service mesh, or even multi-cloud traffic - can be uniformly directed to the new version.

The configuration is provisioned centrally, and the system handles the traffic shift, enabling controlled rollouts and quick mitigation if something goes wrong.

One Network of Tomorrow

Looking at where we are today, One Network has come a long way. Primarily, the architecture is now in place: the central components are finished, and multi-cloud connectivity is up and running. We’ve extended the system to support multi-cloud environments, Google Edge, and work on federation is ongoing.

This has been a multi-year effort. The Envoy One Proxy project began in 2017, while the formal One Network initiative started in 2020.

Senior executives committed early to a long-term vision, and the work has involved more than a dozen teams. So far, we’ve delivered 125 individual projects, and now, most of Google Cloud’s network infrastructure is built on One Network. Since it’s based on open-source technology, it’s possible to integrate other open-source systems.

Is One Network Right for You?

If you’re considering something like One Network, the first question is whether you have executive support. This isn’t the kind of project to take on alone. Organizational priorities matter too. Think about whether policy enforcement and compliance are top concerns for your company. Consider your multi-cloud strategy and how much developer efficiency depends on infrastructure.

Having a long-term vision is essential, but it is just as important to focus on short-term progress. Tackling one project at a time, we gradually improved the network and closed the gaps as we found them rather than aiming for a single big outcome.

Summary

The One Network at Google provided a unified architecture for network paths, environments, and policy management. The journey span across several years, built on cross-team collaboration and open-source principles.

For organizations considering a similar approach, executive support, clear priorities, and a willingness to iterate are crucial. At the same time, the work is ongoing – with new boundaries like mobile and federation still in progress – the progress demonstrates the value of incrementally building toward a long-term vision. One Network’s story is ultimately about making cloud infrastructure more manageable, adaptable, and ready to support your organization's future goals.

About the Author

Anna Berenberg

Show moreShow less

InfoQ Software Architects' Newsletter