Architectural Patterns in Event-Driven Architectures With AWS Serverless Services
Architectural Patterns in Event-Driven Architectures With AWS Serverless Services
AWS provides a rich set of serverless services that make implementing EDA straightforward and cost-
effective. Key messaging services include Amazon Simple Queue Service (SQS) for reliable point-to-point
queues, Amazon Simple Notification Service (SNS) for pub/sub topics, and Amazon EventBridge for event
buses with filtering and routing logic 4 . Using these managed services, architects can build systems that
automatically scale, are highly available, and only incur costs on a pay-per-use basis. EDA on AWS has broad
use cases, from integrating microservices and third-party SaaS, to replicating data across accounts/regions,
to fan-out processing of streams 5 .
In this chapter, we will explore key architectural patterns in event-driven systems, diving deep into their
concepts, trade-offs, and how to implement them using AWS serverless services. We target an intermediate
to advanced audience of cloud architects and senior engineers, so we assume familiarity with AWS basics
(Lambda, SQS, etc.) and focus on higher-level design decisions. Each pattern will be discussed in the context
of why you would use it, how it works, and what AWS services or features support it. We'll illustrate patterns
with architecture diagrams and real-world examples, highlight anti-patterns and pitfalls to avoid, and map
these practices to the AWS Well-Architected Framework (including the Serverless Lens) to ensure our
designs follow best practices.
The patterns and topics covered include: Domain-Driven Design for aligning events with business
domains; CQRS (Command Query Responsibility Segregation) for separating read and write models;
Event State Propagation (Event-Carried State Transfer) for embedding state in events; Point-to-Point
Messaging and Publish/Subscribe communication styles; Event Streaming for high-volume or replayable
event flows; Choreography vs. Orchestration in distributed workflows; and other advanced patterns like
sagas (for distributed transactions), event sourcing, and the transactional outbox. We will also discuss the
anti-patterns and trade-offs associated with each pattern – knowing when a pattern might introduce
complexity or fail is as important as knowing how to implement it correctly. Finally, we’ll tie these concepts
back to AWS’s Well-Architected Framework and Serverless Lens, and look at some case studies of AWS
customers who successfully implemented event-driven architectures.
1
By the end of this chapter, you should have a solid understanding of how to leverage AWS serverless
services to realize robust event-driven architectures, how to choose appropriate patterns for your use cases,
and how to avoid common pitfalls. The goal is not to provide exhaustive code (we favor conceptual depth
over long code listings), but rather to equip you with architectural knowledge and practical guidance for
mastering event-driven design on AWS.
In an event-driven architecture, DDD manifests in the form of domain events. A domain event is something
notable that happened in a domain, expressed in business terms – for example, CustomerCreated ,
OrderShipped , or InventoryLow . These events act as the lingua franca between bounded contexts.
Rather than one service calling another’s API to perform an action (tight coupling), a service simply emits
an event about what occurred, and other services that care about that event handle it asynchronously. This
decouples the producer of the event from its consumers – the producer doesn’t need to know who (if
anyone) is listening, and consumers can come and go independently as long as they understand the event’s
schema and meaning.
How AWS Serverless Enables DDD & Domain Events: AWS provides the plumbing to publish and subscribe
to events easily, allowing you to implement DDD principles in a microservices architecture. For instance,
Amazon EventBridge can serve as an enterprise event bus that different domain-oriented microservices
publish events to and subscribe from. You might create a separate EventBridge event bus for each major
bounded context (to isolate internal domain events) and also have a central bus for cross-domain
integration events. Each event carries a detail-type and structured data (often JSON) that represents
the domain event. EventBridge rules can then route events from one domain’s bus to another domain’s bus
or service if cross-context communication is needed. This approach was used by City Electric Supply when
they modernized their inventory management system – they implemented an enterprise service bus using
Amazon EventBridge to connect numerous domain-specific services across the company 10 . By mapping
their legacy enterprise domains (like product pricing, warehouse stock, customer data, etc.) to separate
serverless services and exchanging events via the central event bus, they achieved loose coupling and a
unified real-time data exchange between disparate applications.
Another common approach is to use Amazon SNS topics for domain events. For example, you might have a
topic named OrderEvents and publish messages for events like OrderPlaced or OrderCancelled .
Multiple microservices in different bounded contexts (e.g., Billing, Shipping, Notifications domains) can
subscribe to this topic (directly or via SQS queues) and react to the events. SNS follows the publish/
subscribe model we’ll discuss later, delivering each message to all subscribers (or to specific subscribers
based on message filtering rules). EventBridge vs SNS is a design choice: SNS topics are straightforward for
2
broad fan-out, while EventBridge offers more advanced filtering, routing, and cross-account event bus
capabilities. As a rule of thumb, SNS is great for simple pub/sub of events within a single account,
whereas EventBridge excels in complex event routing scenarios (e.g., many event types, multiple target
services, need for content-based filtering, or integrating SaaS events). One AWS expert compares them by
noting that with SNS you might manage multiple topics and subscriber filters for different event types,
whereas with EventBridge you can use a single bus with multiple rules to route events to various
destinations 11 .
When implementing DDD in an event-driven way, it’s important to define clear schemas for your events
and keep the events tied to business language. AWS EventBridge’s Schema Registry can help by capturing
event schemas (it even can automatically infer schemas from events flowing through the bus). Publishers
and consumers should agree on these schemas – for example, an OrderPlaced event might have fields
like orderId , customerId , items , totalAmount , etc., which all services understand. The AWS
Serverless Lens emphasizes that having a well-defined data contract (event schema) between producers
and consumers is critical for EDA success 12 . This ensures that as your architecture grows, developers can
discover and understand events easily. In fact, some teams create an event catalog – essentially
documentation listing all events, their schemas, producers, and consumers. (An example is the KnowBe4
engineering team, which built an internal event catalog to document domains, services, and events,
improving developer autonomy and comprehension of the event flows 13 14 .)
Best Practice: Treat events as first-class, versioned API contracts for your organization. Use
consistent naming (past-tense verbs for events, e.g. "UserRegistered" ), include semantic
data, and provide versioning for schema evolution. The AWS Well-Architected Framework
recommends Domain-Focused design – services built around specific business contexts with
a common language 6 15 – and this extends to your events. When an event schema
changes, consider backward compatibility (e.g., adding new fields in a non-breaking way or
using version indicators) to avoid breaking consumers. Also leverage schema validation. For
instance, a Lambda or API that publishes events can validate the event against a JSON
Schema before putting it on the bus, to catch mistakes early (one team enforced this by
storing schemas in S3 and validating each event against it before publish 16 17 ).
From a deployment perspective, aligning microservices to bounded contexts may also influence how you
organize your AWS accounts or infrastructure. Some organizations put each bounded context in its own
AWS account or at least in isolated stacks, then use EventBridge’s cross-account event buses to route events
between domains. This can enhance isolation and reduce blast radius (one domain’s changes or issues don’t
directly impact others), at the cost of slightly more complex cross-account IAM configuration. Alternatively,
within one account, you might simply use separate EventBridge custom buses for each domain, or use
naming conventions on event types to designate domain ownership (e.g., event names prefixed with the
domain name). The right approach depends on team structure and governance: if you have independent
teams per domain, a multi-account strategy with an event bus per team can reinforce the bounded context
boundaries.
Transactional Outbox Pattern (Ensuring Domain Consistency): One challenge in a DDD microservice
that’s event-driven is ensuring that when something happens (e.g., an order status is updated in a
database), the corresponding event is published reliably. You typically don’t want to update your database
and publish an event in two separate steps without any coordination, because one might succeed and the
other fail (leading to an inconsistent system state). A widely-used solution is the transactional outbox
3
pattern. In this pattern, when a service processes a command that changes its domain state, it stores a
record of the event in an "outbox" (which is just a table or item in the same database) as part of the same
transaction that commits the state change. A separate process or Lambda function then reads from this
outbox and publishes the events to the event bus (EventBridge or SNS). Because the outbox write is in the
same transaction as the domain change, you never lose events – if the transaction rolls back, neither the
state nor event is saved; if it commits, the event will eventually be delivered. AWS implementations of this
include using DynamoDB Streams (the stream of changes acts as the outbox), or database features like AWS
DMS (Database Migration Service) or Debezium to capture changes from an RDS database and forward
them to a Kafka or EventBridge. In fact, AWS has published guidance on achieving atomic database
update + event publish by using such patterns 18 . In summary, DDD in EDA requires careful attention to
consistency and ordering – use patterns like the outbox and/or design events to be idempotent (so that if an
event is delivered twice, it doesn’t break anything) which we will discuss later.
Anti-Pattern – Layered Architecture over Bounded Contexts: A common anti-pattern is to build services
or APIs that cut across multiple domains (for example, a single service that handles both Customer
management and Order processing) in the name of convenience. This undermines DDD and often leads to
tighter coupling and harder-to-maintain code. Instead, services should strongly align to a single
responsibility or domain. The AWS Well-Architected guidelines warn that if applications span domain
responsibilities or share large domain entity libraries across services, it will increase complexity and
deployment risk 19 9 . Another pitfall is teams structuring around technical layers (UI team, database
team, etc.) rather than domain teams – this often results in blurred ownership of events and data. It’s more
effective to have cross-functional teams own an entire domain (including its data, logic, and emitted events)
so they can iterate quickly and handle domain-specific reliability needs 7 8 .
In summary, DDD provides a conceptual blueprint for designing event-driven microservices: model
around business domains, emit domain events for state changes, keep the contexts loosely coupled
(communicating via events or well-defined APIs), and ensure each context’s data is its own source of truth.
AWS serverless services like Lambda, DynamoDB, and the various messaging services allow each domain
service to be autonomous yet still integrate through events. As we move through other patterns in this
chapter, remember that many of them (CQRS, sagas, etc.) can be viewed through a DDD lens – for example,
CQRS often aligns with splitting the domain model for writes vs. reads, and sagas often coordinate events
across domains. Keeping domain boundaries clean will significantly help in managing complexity as your
event-driven system grows.
In event-driven systems, CQRS frequently goes hand-in-hand with event sourcing (although you can do
CQRS without full event sourcing). The typical flow is: when a command comes in (say, “Create Order” or
4
“Update Account”), the system will perform the business logic and persist the results as an event (or a series
of events) to an event store. These events represent the changes that occurred. Then, separate event
handlers or projection functions asynchronously update one or more read models (which could be in a
different database optimized for querying, such as a read-optimized relational table, a search index, or a
caching system) in response to those events. The read model thus always has a derived view of the latest
state, ready to be queried. Meanwhile, queries (like “Get Order Details” or “List Accounts”) are served from
the read model, not from the transactional write store. This way, the writes can be optimized for appending
events (fast inserts), and reads can be optimized for lookup and aggregation, without the two contending.
AWS serverless technologies provide a natural fit for CQRS and event-sourced systems. Let’s outline how
one might implement CQRS with AWS services:
• Commands (Write side): Clients (could be front-end apps or other services) issue commands via an
API. An example is sending an HTTP request to an AWS API Gateway endpoint for a certain action.
API Gateway can route to an AWS Lambda function that serves as the command handler. This
Lambda contains the business logic for the command – it might validate input, enforce rules, and
then save the results. In a CQRS+event sourcing setup, instead of updating a dozen fields in multiple
tables, this Lambda would typically record a single source of truth: an event describing what
happened. For example, if the command is “POST /orders” to create an order, the Lambda might
generate an OrderCreated event with all relevant details.
• Event Store: The event generated by the command handler needs to be saved durably. A common
choice on AWS is Amazon DynamoDB. DynamoDB is a natural event store: it’s a fast, scalable, NoSQL
database that can store events as items (each event could be an item with a type, timestamp,
payload, etc.). It’s fully managed and serverless (scales on demand). An AWS blog suggests using
DynamoDB for an event store because its flexible schema can easily adapt to varying event types,
and it is optimized for high write throughput 20 21 . Another advantage: DynamoDB has
DynamoDB Streams, which provides a real-time stream of changes (i.e., newly inserted events) that
can trigger further processing 20 22 . In a CQRS scenario, each time a new event is written to the
Events table, a DynamoDB Streams event is emitted. We can attach a Lambda function to this stream
to handle event processing.
• Event Processing (Projection to Read Models): For each event on the stream, a Lambda function
(or set of functions) acts as a projection service – transforming the event and updating the read
model(s). For instance, when an OrderCreated event appears, the Lambda could update a
denormalized view of orders for quick querying. If using DynamoDB as an event store, AWS Lambda
supports Event Source Mapping to DynamoDB Streams, meaning AWS will poll the stream and
invoke your Lambda with batches of events automatically 23 24 . The Lambda reads the new events
and applies the necessary transformations to the read model database.
• Read Models (Query side): The read model could be any data store suited to your query patterns. In
a simple case, it might be another DynamoDB table optimized for queries (different key structure or
pre-joined data). Or it could be an Amazon Aurora Serverless (for relational querying) or an Amazon
OpenSearch Service index (for text search and aggregations). In our order example, one read model
might be a “My Orders” table keyed by customer ID for the web app to quickly fetch all orders of a
customer. Another read model could be an OpenSearch index that allows searching orders by
various fields. In fact, in a multi-read-model scenario, you might have several different read
5
databases for different access patterns. The projection Lambda can update each as needed. For
example, in one AWS reference architecture, the initial implementation had a single read model
(Aurora) for current state, but to scale, they added a second read model (OpenSearch) to serve
different query use cases 25 26 . Each read model gets its own consumer pipeline from the event
store.
• Query Handling: Now, when a client needs to query data (the Query part of CQRS), the request goes
to a different endpoint (e.g., GET /orders/{id} ) which could be handled by another Lambda (or
perhaps directly by services like AWS AppSync GraphQL). This handler reads from the read model
(e.g., queries Aurora or DynamoDB or OpenSearch) and returns the result. Importantly, it does not
call the command data store or the event store; it only uses the precomputed view in the read store.
This ensures that heavy read traffic doesn’t impact write throughput and vice versa.
Figure 14.1 – Serverless CQRS and Event Sourcing architecture on AWS. In this example architecture, an API
Gateway routes client requests to Lambda handlers. Commands (top flow) go to a Lambda “command
handler” which writes events to a DynamoDB-based event store. DynamoDB Streams capture these events
(CDC – change data capture) and via EventBridge Pipes feed them to an SNS topic. The SNS topic fans out
events to multiple SQS queues – one per read model. Each read model (e.g., an Aurora database, an
OpenSearch index) has a dedicated Lambda that polls its SQS queue and updates the read model with the
event’s data (after appropriate transformations). Queries (bottom flow) from clients hit Lambda “query
handlers” that fetch data from the read models (Aurora, OpenSearch) directly. This decoupled design allows
the reads to scale independently of writes, and new read models can be added by simply attaching another
queue and processor to the SNS topic 27 28 .
The above figure illustrates a highly decoupled CQRS pipeline using AWS managed services. The use of
Amazon EventBridge Pipes in that architecture is interesting – Pipes can connect a DynamoDB Stream
directly to an SNS topic (with optional filtering or enrichment) in a serverless fashion 29 . This removes the
need to manage a custom Lambda to read the stream and publish to SNS. The SNS acts as a fan-out hub,
implementing a topic-to-queue chaining pattern 30 : each read model service gets events via its own SQS
queue subscribed to the SNS topic. Using SQS in between provides backpressure and reliability – each read
model can process at its own pace without dropping events, since SQS will buffer them. It also allows
individual read models to fail or slow down without impacting others, a key resilient design. As the AWS
community article notes, this decoupling ensures a failure in one component doesn’t cascade, and you can
6
integrate new consumers (new read models, new services interested in events) without disrupting existing
ones 31 32 . In effect, it’s an embodiment of loose coupling: producers (the event store / Pipe) don’t know
about consumers, and each consumer has an isolated feed of events.
Advantages of CQRS (with Event Sourcing): By separating read and write workloads, you can scale and
optimize them independently. For example, writes can be sharded or partitioned easily as they are just
appending events, and reads can be served from replicas or specialized datastores depending on query
patterns. Systems where reads far outnumber writes benefit because the read replica (materialized view)
can handle loads of queries without affecting the ingest of new events 33 . It also improves availability: if
the query side experiences issues or needs maintenance, the command side can still accept writes (the
events queue up to be processed later), and vice versa 34 35 . Moreover, CQRS with event sourcing
inherently provides an audit log (the sequence of events) which can be replayed to rebuild state or feed
new systems. As one source points out, the event store can be used to “replay events if a read datastore
missed them, or to hydrate a new read model” 36 – this is powerful for backfilling data or recovering from
failures (we'll discuss event replay more in the advanced patterns section).
Another benefit is the ability to have multiple different read models for different use cases, as mentioned.
In a monolithic design, trying to craft a single database schema that efficiently serves every query and every
update is hard; CQRS frees you from that by allowing multiple projections. For example, in an e-commerce
system, you might have one read model for operational data (orders, customers), another for analytics
(sales over time, top products – maybe in a time-series DB or data warehouse), and maybe a caching layer
for ultra-fast reads on a website. All of these can be kept up-to-date by subscribing to the event stream of
transactions.
Trade-offs and Challenges: The primary trade-off of CQRS is complexity. You now have more moving
parts: multiple databases, asynchronous processes to keep data in sync, and the need to handle eventual
consistency. In a CQRS system, by design there is a lag (usually small, but non-zero) between a write and the
availability of that change in the read model. If a user creates an order and then immediately tries to fetch
it, the system either must wait until the event has been projected to the read model, or implement a
strategy (like querying the event store or a cache) to handle that very short-term inconsistency. Often in
practice, a slight delay (milliseconds to seconds) is acceptable, but it’s something to design for. Also, error
handling becomes trickier – e.g., if a projection Lambda fails for some reason on an event, how do you
ensure it eventually processes it (we’ll talk about retries and dead-letter queues in the Best Practices
section). Tools like SNS and SQS help here (since SQS will retry deliveries and you can send bad messages to
a DLQ if they continuously fail). Another complexity is schema evolution across multiple models: if you
change the shape of data in events, you need to adjust all the read model updaters accordingly, possibly
with versioning.
It’s worth noting that you can implement CQRS without going full event sourcing. For instance, you might
have a system where commands directly update a database (say, a SQL DB for the write side), and you use a
service like AWS DMS or triggers to capture changes and update a NoSQL read store or cache. That is a
form of CQRS too – separating the models – but using change data capture instead of explicit events. On
AWS, one could use Aurora MySQL with binary log replication feeding into AWS Lambda or Kinesis to
propagate changes. However, embracing explicit events (event sourcing) has advantages like the clear log of
changes and easier debugging/historical analysis.
7
AWS Example: A concrete AWS example using serverless CQRS is the reference where DynamoDB was the
event store and Aurora and OpenSearch were two read models 25 26 . They used Amazon API Gateway +
Lambda for both command and query entry points 37 38 . The write side Lambda stored events in
DynamoDB 39 , and DynamoDB Streams fed into a pipeline that ultimately delivered events to Lambda
functions updating Aurora (a relational DB) and OpenSearch (for search queries) 23 32 . This kind of
polyglot persistence (different DBs) is made manageable by the decoupling that events provide.
Anti-Pattern – Mixing Read and Write Models Accidentally: A pitfall to avoid is letting the read model
become a back-door to modify data, or using the write model to serve queries out of expedience. Doing so
breaks the separation principle and can reintroduce tight coupling. For example, avoid a scenario where
some queries still hit the primary write database “for freshness” – that undermines the whole point of CQRS
(if you find you need that often, maybe CQRS isn’t worth the complexity for that part of the system). Another
anti-pattern is ignoring the event ordering and idempotency of projections. If your projection consumer
doesn’t handle events in the correct order, your read model could become inconsistent. With DynamoDB
Streams or Kinesis, ordering is usually guaranteed per key or shard, but if you have a lambda reading from
an SNS fan-out (as in the multi-read model design), realize that SNS+SQS do not guarantee ordering by
default (unless using FIFO topics/queues). In practice, ensuring each event carries a sequence number or
timestamp can help your projector detect out-of-order events or duplicates. Designing idempotent
projection updates is crucial – e.g., if the same event is delivered twice, applying it a second time should not
corrupt the read model (perhaps the second attempt finds that the data is already up-to-date and thus does
nothing). Idempotency can be achieved by using primary keys and conditional writes or simply ignoring
events that don’t change state. We’ll revisit idempotency in the best practices section since it’s a key concern
across all event-driven patterns.
In summary, CQRS (with or without event sourcing) is a powerful pattern for scaling and decoupling the
read/write workloads of your system. AWS serverless services—API Gateway, Lambda, DynamoDB (with
Streams), SNS, SQS, Aurora Serverless, OpenSearch, etc.—form a toolkit that makes it feasible to implement
CQRS without managing servers, letting you focus on the data flow and transformation logic. You get
scalability and reliability out of the box (DynamoDB and SQS are highly durable and scalable services, and
Lambdas can scale out horizontally to handle event processing). Just be mindful of the added complexity:
make sure the benefits outweigh the overhead for your particular use case. Often, CQRS shines in complex
domains or high-scale systems where read and write characteristics diverge significantly (e.g., read-heavy
systems, or cases where you want multiple tailored views of data). In simpler apps, it might be overkill. So,
apply this pattern judiciously.
For example, imagine a Customer service in a retail application. When a customer's profile is updated, we
have two choices for the event: 1. Event Notification style: publish an event like CustomerUpdated
containing just a customer ID (and maybe a timestamp), requiring any consumer (say an Order service that
needs customer info) to call the Customer service to get the details of the update. 2. Event-Carried State
8
style: publish an event CustomerUpdated that includes the new state of the customer (e.g., name, email,
loyalty status, etc) in the event itself, so consumers can directly use that data.
With ECST, we choose the second style. The events are richer in information. By including the changed data
(or relevant subset of it) in the event, we achieve loose coupling: consumers don’t need direct knowledge of
or real-time access to the producer’s database or API. They can update their own local state or trigger their
own logic purely based on the event content.
This pattern is particularly useful for propagating state changes across microservices. As one source
defines it, event-carried state transfer is an EDA pattern that utilizes events as a mechanism for state
propagation, rather than relying on synchronous request/response protocols. This decouples services, improves
scalability and reliability, and provides a mechanism for maintaining a consistent view of the system’s state 40 .
Each service can subscribe to the events it cares about and update its own data store or internal state
accordingly, ensuring it has an up-to-date view of relevant information without tight integration. In essence,
state is distributed via events.
AWS Example: Suppose you have a microservice architecture for an e-commerce site: - Catalog Service –
owns product data (titles, descriptions, prices, stock). - Recommendation Service – suggests related
products, needs product info to function. - Search Service – indexes products for search queries.
If the Catalog updates a product's details or price, using ECST it would publish an event ProductUpdated
containing the product ID and the new details (name, price, etc.). The Recommendation and Search services
subscribe (via SNS or EventBridge). Upon receiving the event, the Search service can update its OpenSearch
index document for that product with the new info, and the Recommendation service can adjust any cached
data or models that included the old product info. Neither needs to call the Catalog service’s API to fetch the
latest data – the event itself carried that state.
On AWS, implementing ECST is straightforward using JSON messages in SNS or EventBridge. The payload
(message body or EventBridge detail field) should include the necessary state. There’s a trade-off in how
much to include: you might include the full new representation of the entity (e.g., entire product record).
This maximizes independence (subscribers never need to call back), but at the cost of larger message sizes
and potential duplication of data across services. In other cases, you might include only the changed fields
or a summary. The key is subscribers should have enough info to do their work without synchronous calls.
One consideration on AWS is the message size limits. For instance, an SNS message or EventBridge event
has a maximum size (256 KB for EventBridge). If your entity data is potentially larger (say a big JSON or
binary data), you have to strategize. A known approach if events carry large state is to offload the heavy
payload to a storage service and include a reference. For example, the AWS blog on EventBridge integration
describes a pattern where if an event payload exceeds 256KB, the extra data is written to S3 and the event
carries an S3 URL instead 41 42 . This way, the event still “carries” the state via an indirect reference. The
consumer can decide if it needs to retrieve that data from S3.
Best Practice: Design your events with sufficient context so that consumers rarely need to
make follow-up queries. This reduces inter-service chatter and latency. If an OrderPlaced
event triggers an email confirmation service, consider putting customer email and order
details in the event so the email service doesn’t need to call the Order service to get them.
However, balance this with data privacy and minimization – don’t blindly dump entire
9
objects if consumers only need a few fields. Also be mindful of sensitive data: if an event bus
is enterprise-wide, you might not want certain confidential fields traveling widely. Techniques
like encryption of fields or using separate event channels for sensitive data can help.
The main benefit of ECST is resilience and autonomy. Because services can update their own databases
using event data, they can continue to function even if the source service is down, by relying on their locally
stored copies of data that were synced via events. It’s like each service has a cached, eventually-consistent
copy of the data it cares about. This is essentially how distributed caching or materialized views in
microservices are built. For example, a reporting service might maintain its own small database of key info
(populated by events from various sources) so it never hits the live services for queries. This can drastically
reduce inter-service calls and potential bottlenecks, improving overall reliability.
Another benefit is enabling new consumers easily. If a new service is introduced that needs certain data,
you just subscribe it to the relevant events and it can start building a local dataset. Many architectures use a
combination of event-carried state transfer and event sourcing to create a system where new
components can reconstruct the current world by replaying past events. Even if you’re not doing full event
sourcing, having events carry state means a new service could potentially snapshot the latest state of
entities by listening to a stream of events from some point onwards.
Trade-offs of ECST: The big trade-off is data duplication and consistency management. Once data is
copied to multiple services, you have to accept eventual consistency. There is a chance a consumer has
slightly stale data if an event is delayed or if it hasn’t yet processed the latest update. This is usually
acceptable within seconds or minutes depending on domain (and you design around it, e.g., showing a last-
updated timestamp or offering refresh mechanisms for users). Another potential downside is data bloat –
if many services keep copies of the same data, you use more storage overall, and if the data changes
frequently, you are pushing a lot of data through the system. That can increase bandwidth costs and
processing overhead. So be intentional about what you propagate. Use filtering to avoid sending events to
services not interested. AWS EventBridge rules or SNS message filtering can ensure, for instance, that a
service only gets events for certain types or certain entities. Perhaps the Recommendation service only
cares about product category changes but not price changes – you could filter accordingly so it doesn’t do
unnecessary work.
Additionally, ensure robust schema/versioning for the state in events. If the shape of the data evolves (say
new fields added), consumers should handle it gracefully (ignore unknown fields, use defaults for missing
ones, etc.). Using a schema registry like EventBridge Schema Registry or even an OpenAPI/AsyncAPI spec
for events can help coordinate changes.
10
even then, often a batch of IDs or a periodic sync might be better). As a rule: try to avoid designs where the
first thing an event handler does is invoke another service – it negates the decoupling advantage.
Implementing ECST on AWS – Practical Tips: Using JSON for event payloads is the norm. If using
EventBridge, you have a Detail object in the event where you can put a JSON structure of your data. With
SNS (especially when using SNS -> Lambda or SNS -> SQS), you typically put the JSON in the message body.
Consumers (Lambda functions, etc.) will parse the JSON and use it. Keep the format consistent (e.g.,
possibly leverage JSON schema as mentioned). Consider enabling EventBridge Schema Registry to auto-
capture schemas – AWS can even generate code bindings for your events for strongly-typed languages.
For very high-volume scenarios, you might choose an event streaming approach (like Kinesis or Kafka) with
state in events. For instance, streaming a million small events per second where each has some state.
Kinesis can handle that throughput and consumers can be designed to scale out (with multiple shards). But
note, Kinesis events are limited to 1MB, and generally it’s best to keep events small (a few KB ideally). If you
have extremely large state (like images or big JSON blobs), consider storing them in S3 and only send
references or keys in events – similar to the earlier example of offloading extended data to S3 41 .
Example use case: The AWS Data Sync across accounts scenario. Sometimes you need to propagate data
from one account/region to another (like multi-region active-active systems). Instead of pulling data, you
can use events: account A’s service emits an event with the new/updated data, and an EventBridge rule
forwards that event to account B’s event bus (EventBridge supports cross-bus forwarding). Account B’s
service receives it and updates its local store. This is event-carried state delivering cross-account
consistency in near real-time 43 (EventBridge is often used to replicate data/events across accounts or
even to SaaS apps). The Well-Architected Serverless Lens notes cross-account data replication as a typical
use case for event-driven designs 43 .
In conclusion, Event-Carried State Transfer is about making events self-contained snapshots of state
changes. It is a cornerstone of many event-driven microservice architectures because it maximizes
independence: services share information by publishing it, not by querying each other. When combined
with the other patterns (like pub/sub delivery and event streaming), it creates a robust web of services that
can scale and evolve independently, with the events ensuring everyone has the data they need. Just manage
the consistency expectations and data governance carefully. When done right, this pattern greatly reduces
the need for synchronous integration and can improve the performance of the overall system (e.g., fewer
synchronous waits, more localized reads).
Typical use cases for point-to-point messaging include task queues, work queues, or any scenario where
work needs to be distributed among workers and each piece of work should be handled only once. For
example: - Processing a user-uploaded image (one service produces a message for each image, and one
11
image processing worker will consume it and perform the task). - Order fulfillment jobs (each order goes
into a queue to be picked up by a fulfillment service). - Any asynchronous processing where you just need to
decouple producer and consumer, but not broadcast to multiple consumers.
AWS’s go-to service for point-to-point messaging is Amazon SQS (Simple Queue Service). SQS provides a
fully managed, serverless queue with at-least-once delivery and scalable throughput. Producers send
messages to an SQS queue, and consumers poll the queue to receive messages. In a serverless context, we
often use AWS Lambda as the consumer – Lambda can be configured with an SQS event source, which
means AWS will poll the queue on your behalf and invoke your Lambda function with messages (in batches)
as they become available. This makes processing queue messages straightforward: you write a Lambda
handler to process one batch of messages; AWS handles reading from the queue, scaling up Lambda
concurrency when there are many messages, and scaling down when the queue is empty.
Point-to-point messaging via SQS is a fundamental decoupling pattern. It introduces a buffer between
producer and consumer which smooths out demand and differences in processing rate. If the consumer is
slow or temporarily overwhelmed, messages pile up in the queue (rather than overwhelming the consumer
or forcing the producer to wait). The consumer can catch up later. This buffering effect is often called a
“shock absorber” in the system 44 – it helps maintain reliability under bursty loads. We saw a real example
in the WellRight case study (later in this chapter) where migrating to an SQS+Lambda model allowed them
to handle bursty traffic: when tens of thousands of events arrived, SQS buffered the load and Lambda
scaled out to process them, completing work in 15 minutes that previously choked their monolith for hours
45 .
Implementing P2P with AWS services: - Create an SQS queue (Standard queue by default, which is high-
throughput, at-least-once, best-effort ordering; or FIFO queue if you need strict ordering and exactly-once
processing semantics). - Producers (could be a Lambda, or any AWS service, or your application) send
messages to the queue using the AWS SDK or through an EventBridge Pipe or SNS (we'll discuss combos in
a moment). For example, a Lambda could call SendMessage API to SQS when it has a unit of work to
offload. - Consumers can be one or multiple Lambda functions listening to that queue. If multiple
consumers are attached to the same queue, they effectively compete for messages – only one will get each
message (this is competing consumers pattern for scaling throughput). - The Lambda processes the message
and completes. Upon successful processing, the message is deleted from the queue. If the Lambda errors
or times out, SQS will eventually make the message visible again for retry (unless it’s a FIFO queue with
exactly-once, in which case it won't reprocess unless you hook in the deduplication and sequencing
correctly – but let's stick to standard queues for now). - You can configure a Dead-Letter Queue (DLQ) for
the main queue so that if a message fails processing repeatedly (reaches a max retry count), it will be
moved to a DLQ for later inspection. This prevents poison-pill messages from endlessly reprocessing and
blocking the queue.
One big advantage of SQS’s point-to-point model is delivery guarantees and durability. SQS stores
messages durably across multiple AZs. If a consumer is unavailable, the message sits in the queue (for up to
14 days by default, or shorter if you set a retention). This ensures reliability in the system – you don’t lose
tasks just because a service is down temporarily. The Well-Architected Serverless Lens explicitly
recommends SQS for reliable and durable communication between microservices 4 .
Another form of point-to-point messaging on AWS is Amazon EventBridge Pipes configured to directly
connect a source to a target. For example, EventBridge Pipes can pull from a Kinesis stream or DynamoDB
12
stream and write to an SQS queue or a Lambda, effectively acting as a point-to-point integration without
needing an explicit publisher in code. However, under the hood, often it’s writing to an SQS (for buffering) or
invoking a target directly.
When to use point-to-point vs pub/sub: If you have exactly one interested party for each message, and
especially if the producer shouldn’t care who processes it, P2P is ideal. For instance, a “thumbnail
generation” microservice might simply push tasks to a queue; one of multiple worker Lambdas will pick it
up. You wouldn’t want that in a pub/sub because you don’t want multiple services generating the same
thumbnail – just one. Conversely, if multiple independent actions should occur for each event (e.g., when an
order is placed, you want to notify customer, update inventory, and start shipment – three different actions),
that calls for pub/sub (or multiple queues/topics). P2P is about one message -> one consumer’s action.
Combining SNS (pub/sub) with SQS for resilience: It’s common to integrate SNS and SQS for even point-
to-point flows to get some of the benefits of both. For example, a single producer publishes to an SNS topic,
and you have exactly one SQS queue subscribed to that topic (with no other subscribers). This might sound
like overkill, but it provides a neat property: the producer can be decoupled via SNS (which can buffer briefly
and fan-out if needed later), and the SQS provides durability plus the ability for a Lambda to handle the
message. AWS actually often recommends using SNS -> SQS chaining even for single consumers, because
SQS allows controlling the rate of consumption and acts as a reliable store in front of Lambdas 46 . For
example, if a burst of 10,000 messages come, SNS would push them to SQS quickly, and then your Lambda
can consume from SQS at a sustainable rate (scaling out as needed). Without SQS, if SNS invoked Lambdas
directly, you might overwhelm Lambdas or run into throttling. With SQS, you have back-pressure capability.
Best Practice: Use SQS queues in front of critical processing workflows to decouple and
protect your system from spikes. Even if you have a pub/sub scenario with SNS or
EventBridge, consider giving each consumer its own SQS queue (subscribed to the topic or
bus) – this way, each consumer has an isolated workload and can scale independently. This
pattern (sometimes called topic-queue chaining) improves resilience 30 . Also, tune your SQS
parameters like the visibility timeout to slightly longer than your longest processing time
(so that if a Lambda is still working on a message, SQS doesn’t give it to another Lambda
prematurely), and set a dead-letter queue to catch messages that never succeed. Always
handle the possibility of duplicate messages (at-least-once means your Lambda could see the
same message again if the first attempt’s ack didn’t go through) by making your processing
idempotent (more on that later).
FIFO (First-In-First-Out) Queues: AWS SQS offers FIFO queues which guarantee first-in-first-out ordering of
messages and exactly-once processing (via deduplication). FIFO is great if order matters – e.g., processing
financial transactions or steps in sequence. However, FIFO queues have lower throughput (as of writing,
~300 messages/sec per FIFO queue with batching). They also require more careful configuration (like
setting message group IDs to allow parallelism by grouping messages). If you truly need strict ordering and
consistency, SNS and SQS both have FIFO variants (SNS FIFO topics and SQS FIFO queues can work together
for ordered pub/sub) 47 . But note, enabling FIFO introduces more overhead and throughput limits, so use
it only when necessary. Many event-driven systems manage with eventual consistency and don’t need strict
global ordering.
Anti-Pattern – Using a Single Queue for Unrelated Messages: Sometimes architects create one “mega-
queue” for all kinds of messages to simplify architecture. This is usually a bad idea because different
13
message types might have different processing characteristics, and it introduces unnecessary coupling. It’s
better to use separate queues per message type or per consumer type. SQS is cheap and plentiful – you can
create many queues. Grouping unrelated things can lead to one slow message type blocking others if not
handled carefully. Also, avoid turning SQS into a poor-man’s pub/sub by having multiple consumers read
the same queue with the intention that each message is processed by all of them. If multiple consumers are
reading from one queue, they will compete, and only one of them will get each message. To get pub/sub
behavior, the correct way is SNS or EventBridge, not multiple consumers on one queue (unless you’re
scaling horizontally for throughput, but that’s a single logical consumer service with many instances).
Anti-Pattern – Synchronous calls hidden in queue consumers: Another misuse is to put a message on a
queue, then have the consumer immediately call back the producer or another service synchronously for
additional data. This mixes async and sync in a fragile way. If you find yourself doing that, maybe the
message didn’t carry enough information (see ECST above), or the design might be simplified by a direct call
if you truly needed immediate response. Queues are for async fire-and-forget tasks; design the tasks to be
self-contained units of work for the consumer.
In summary, point-to-point messaging via SQS (and similar services) is a backbone of serverless event-
driven workflows. It ensures each message is processed once by one component, allowing you to
distribute work and decouple producers from consumers. AWS Lambda + SQS gives a straightforward way
to implement background processing without managing servers or polling loops. By leveraging queues, you
inherently gain elasticity (as Lambdas scale with the queue depth) and fault tolerance (persistent
messages). Always consider if a given integration is naturally one-to-one – if so, a queue might be the right
choice. If it’s one-to-many, that’s where the next pattern (publish/subscribe) comes in.
Publish/Subscribe Pattern
The Publish/Subscribe (pub/sub) pattern is at the heart of many event-driven systems. In pub/sub, a
message (event) published by a producer is delivered to multiple interested subscribers. The producers and
consumers are decoupled – producers don’t know who the subscribers are, they just publish to a logical
channel (topic or event bus), and the system ensures delivery to all subscribers. This pattern enables fan-out
of events and is ideal for distributing information to many parts of a system simultaneously.
Use cases for pub/sub abound: - A new user registration event might need to be handled by the welcome
email service, the analytics service, and the recommendation engine (three different subscribers). - In a
microservices e-commerce app, an OrderPlaced event could be consumed by the Inventory service (to
reserve stock), the Payment service (to charge the customer), the Shipping service (to schedule delivery),
and the Notification service (to send a confirmation email). - System events or business events that multiple
systems care about, e.g. a price change event could be consumed by billing, by a caching service, by a UI
service to refresh displays, etc.
On AWS, there are two primary ways to implement pub/sub: - Amazon SNS (Simple Notification Service)
topics. - Amazon EventBridge event buses with rules.
SNS Topics: SNS follows a traditional pub/sub model. You create a topic (e.g., “OrderEvents”). Producers
publish messages to the topic. Subscribers can be of various types: SQS queues, Lambda functions, HTTP
endpoints, email/SMS (for notifications) and more. When a message is published, SNS will push it to all
14
subscribers. If an SQS queue is a subscriber, the message will be enqueued. If a Lambda is a subscriber,
AWS will invoke the Lambda with the message. SNS is a fully managed fan-out service – it handles all the
routing of copies of the message to each endpoint. One thing to highlight: if you have an SNS topic with 5
subscribers, and you publish one message, SNS essentially makes 5 deliveries (so the publish throughput
and cost scales with number of subscribers).
SNS supports message filtering on subscriptions. This means a subscriber can indicate it only wants
messages that match certain attributes (metadata in the message). For instance, you might have a single
“Orders” topic but have one microservice only interested in eventType = "OrderCanceled" events –
you can attach a filter policy to its subscription so it only receives those and not every order event 11 . This
helps avoid overhead on consumers that only need a subset of events from a busy topic.
EventBridge Event Buses: EventBridge can similarly deliver an event to multiple targets. You define rules
on a bus, each rule can match certain events (based on event content) and route them to a target (or
multiple targets). For example, on the default event bus you might create a rule “if detail-type is
OrderPlaced, then target Lambda A and Lambda B and SQS Q1” etc. Under the hood, EventBridge will
evaluate incoming events against all rules and invoke the target integrations accordingly for each matching
rule. In effect, this is pub/sub: the publisher puts an event on the bus, and multiple consumers (via rules)
can receive it. One advantage of EventBridge is that it can also route events to other AWS accounts or to
SaaS partner services, making cross-application pub/sub easier. Another advantage is the single bus,
multiple subscriber types model – as Danilo Poccia noted, with SNS you might need multiple topics for
different filtering needs, while with EventBridge you could have one bus and let rules determine routing
11 . This can simplify management when you have lots of event types.
A difference to note: push vs pull – SNS pushes to endpoints (though if the endpoint is SQS, then a
consumer will pull from SQS, but SNS itself is push to SQS). EventBridge likewise pushes events to targets
(like it will call a Lambda, or put to an SQS, etc.). Both SNS and EventBridge are at-least-once delivery and
generally have high durability. EventBridge has a slight additional latency compared to SNS (since it’s more
complex routing logic). AWS mentions if ultra-low latency is required, SNS might be preferable to
EventBridge because EventBridge’s rule matching can introduce a small delay 48 (small meaning usually
tens of milliseconds, but at scale it can increase). However, EventBridge offers more sophisticated features
(like content-based filtering, transformation of events if needed, etc.).
It’s worth noting you can actually combine them: For example, an EventBridge rule could target an SNS
topic (meaning if an event matches, republish it to SNS subscribers), or conversely use SNS to fan-out to
SQS and each SQS triggers Lambdas, etc. The design will depend on specific needs of filtering,
compatibility, etc.
AWS Example of Pub/Sub: Let’s revisit a part of the earlier CQRS discussion: In the multi-read-model
architecture, they published transformed data to an SNS topic, which then fanned out to multiple SQS
queues (one per read model) 32 . That’s a classic pub/sub scenario – one event triggers updates to multiple
data stores. Another example: say you have an IoT sensor event coming through AWS IoT Core (which can
send messages to EventBridge). That single sensor reading might need to be processed by a monitoring
application, stored in a database, and also triggers an alert if out of range. EventBridge could have three
rules on the IoT events bus: one to put the data in DynamoDB, one to invoke a Lambda for alerting if value
> threshold, one to forward the event to a third-party system via API. Each rule sees the same event and
acts in parallel. The decoupling is such that adding a fourth subscriber is just adding another rule or
15
subscription, and doesn’t impact the publisher or the other consumers (aside from additional throughput/
cost).
Publish/Subscribe vs Point-to-Point: We’ve explained each separately, but to solidify: Use pub/sub when
multiple independent processes or services need to know about an event. Use point-to-point (queue)
when only one process should handle each message (but you might have many instances of that process
for scalability). Sometimes they mix: for reliability, even pub/sub deliveries often go through an
intermediary queue to each consumer. For instance, an OrderPlaced could go to SNS, then to three SQS
queues (one for Inventory service, one for Shipping service, one for Notification service). Each service then
processes from its queue. That’s pub/sub (one event triggers three deliveries), but each delivery ends up as
point-to-point to a specific service. This hybrid approach is common and gets the best of both: pub/sub
decoupling and queue buffering. AWS reference calls this fan-out to queues a topic-queue chaining pattern
30 .
Advantages of Pub/Sub: - Loose Coupling at Scale: Publishers and subscribers know nothing about each
other. You can add new subscribers without changing the publisher logic. This is great for scaling an
architecture as new requirements emerge – e.g., one day product events are also needed by a Machine
Learning service to retrain models; you just attach it to the same events. - Parallel Processing: Multiple
actions can happen in response to one event, in parallel. This can lead to overall faster end-to-end
outcomes than if one service had to call others sequentially. For instance, as soon as an order is placed,
multiple things kick off at once (packing, charging, emailing), rather than one after the other. - Failure
Isolation: A slow or failing subscriber doesn’t directly impact the others or the publisher (especially if using
decoupling with queues). Each subscriber can have its own error handling and retry logic. If one subscriber
goes down, others still get the events (the one down can catch up from a DLQ or replay later). - Simplified
Communication Model: For the publisher, it’s fire-and-forget. It just emits an event to the topic/bus and
doesn’t worry about responses. This fits naturally with serverless where a Lambda might do some work and
then just enqueue an event for others, ending its execution quickly.
Challenges with Pub/Sub: - Ordering: In pub/sub, if two events are related (say OrderCreated then
OrderCanceled quickly after), ensuring all subscribers see them in the same order might not be trivial if
using separate deliveries. If ordering is critical, you may need to design for it (using FIFO as mentioned, or
designing idempotent consumers that can handle out-of-order events by checking timestamps or version
numbers). - Duplicate Processing: With multiple consumers, each event will be processed multiple times
(once per consumer). That’s intentional, but it means the system as a whole is doing more work. That’s fine
if each is a different concern. But it requires making sure each consumer’s side effects are isolated (e.g., you
wouldn’t want two services sending two emails to the customer for one OrderPlaced – ensure they have
distinct roles). - Visibility & Debugging: It can become tricky to trace the flow of a single event that triggers
many things. Tools like AWS X-Ray and trace IDs can help link events to the actions they trigger. EventBridge
supports passing trace context for X-Ray so that you can follow an event from one Lambda to another 49 .
This is important in a complex pub/sub system for debugging and monitoring, because the flow is not
linear. - Fan-out cost: Note that with pub/sub, your cost might increase with each subscriber (e.g., SNS
charges per 1M publishes and per subscriber protocol, Lambda invokes cost per invoke, etc.). It’s usually
trivial at moderate scale but something to keep an eye on if you have dozens of subscribers and very high
event volume.
AWS Service Selection – SNS vs EventBridge: They have overlapping use cases. Some guidelines: - Use SNS
when you want simple high-throughput pub/sub especially for application-driven events and have mostly
16
AWS service targets (Lambda, SQS, etc.). SNS is also required if you need certain target types like mobile
push notifications, SMS, or email. - Use EventBridge when you need sophisticated filtering on event content
(beyond SNS’s attribute filtering), or you want to easily route events to many different target types including
other AWS accounts or SaaS, or use the schema registry and event replay features. EventBridge is great for
integrating heterogeneous systems (e.g., SaaS, third-party, cross-account scenarios). - Latency: If you
need sub-100ms delivery consistently, SNS might be a bit faster. EventBridge might add some tens of
milliseconds. If latency isn’t super critical, the difference is usually negligible from an end-user perspective. -
If ordering is needed, both have FIFO options now, but with throughput trade-offs. With EventBridge, FIFO
is not currently a feature (EventBridge doesn’t have FIFO as of this writing, but SNS FIFO exists). - Developer
experience: SNS is a bit simpler conceptually (topic and subscriptions). EventBridge fits into a broader event
bus concept including AWS service events (like S3 events etc can come into it natively).
Anti-Pattern – Too Fine-Grained Events Flooding Subscribers: In pub/sub, a mistake would be to publish
extremely fine-grained events at high frequency that overwhelm consumers who can’t keep up or don’t
need that level of detail. For example, emitting an event for every single field change in an object rather than
batching or coalescing changes could cause unnecessary load. If consumers only care that “something
changed” in an aggregate sense, you might be better off debouncing events or sending one event that
“Order XYZ updated with all new data” rather than separate events for “Order status changed” and “Order
item added” and “Order total changed” all at once. Overloading the system with events can incur cost and
complexity in filtering. Design event granularity thoughtfully.
Anti-Pattern – Tight Coupling via Shared Topic Schemas: If multiple event types share the same topic and
subscribers are not filtering properly, you get unintended coupling (subscribers receiving events they don’t
care about, maybe even breaking on unexpected event formats). It might be better to separate topics by
event type category or ensure attribute-based filtering is used to segregate. For instance, having a single
“AllEvents” topic is usually not ideal; it’s better to have domain-specific topics or use an event bus with
clearly defined detail-type attributes and rules per type.
In essence, Publish/Subscribe enables the event-driven nature of architectures by allowing one event to
have cascading effects in a decoupled way. AWS’s SNS and EventBridge give us the tools to do pub/sub at
scale with minimal setup. By using pub/sub, we embrace the reality that in a complex system, many things
often need to know about a single fact, and rather than hard-coding those relationships, we let the
infrastructure dynamically route events. This pattern, combined with the others, truly helps in building
systems that are extensible – new features can tap into the event stream without modifying existing code,
just by adding new subscribers.
In an event streaming architecture, events are not just immediately pushed out to subscribers and then
forgotten; instead, they are stored in an ordered log (stream), and consumers can read from that log
(often from any point they want). Multiple consumers can independently read the same stream either in
17
real-time or later (even re-reading if needed). This opens up capabilities like event replay, windowed
processing, and parallel consumption.
Key AWS services for streaming: - Amazon Kinesis Data Streams (KDS): A fully managed streaming
service on AWS. You create a stream with a number of shards. Producers put records onto the stream (each
record up to 1MB, usually smaller). The records are persisted for 24 hours by default (extendable up to 7
days). Consumers then read from the shards (each consumer typically is an application or Lambda that
reads sequentially). Kinesis guarantees ordering per shard and allows multiple consumer applications (it
uses a concept of consumer name or uses Kinesis Client Library for sharing reads). There’s also Kinesis
Data Firehose (for pushing stream into destinations like S3), but for event-driven processing we focus on
Data Streams where you custom-handle events. - Amazon Managed Streaming for Apache Kafka (MSK):
This is AWS’s managed Kafka service. Kafka is a very popular open-source streaming platform. MSK (and the
newer MSK Serverless option) let you run Kafka clusters on AWS without the operational burden. Kafka
similarly stores events in topics partitioned, with retention. - Amazon DynamoDB Streams: It’s not a
general event stream service, but it provides a log of changes on a DynamoDB table. In essence, it’s a
stream of events (the DB change events) that can be consumed by multiple processes or Lambdas. We
already saw it used in CQRS. It’s limited to changes from DynamoDB but worth mentioning as streaming in
context of event-driven patterns. - Amazon EventBridge Archive and Replay: EventBridge isn’t exactly a
streaming platform, but it introduced a feature to archive events and replay them later to the bus. This
gives some streaming-like capabilities (like if you want to replay events from last week to test something or
to bootstrap a new service). However, it’s not a continuous high-throughput stream like Kinesis; it’s more for
backup and occasional replay.
When to use event streaming? Typically: - High-frequency events: If you have a firehose of events (like
clickstream data, IoT sensor readings, log events) where you might be dealing with thousands per second,
and multiple consumers (like analytics, monitoring, storage) need those events. - Ordered processing: If
certain computations need events in the exact order they were generated (within a partition or key). For
example, computing metrics over time, or applying a series of financial transactions in order. Streaming
platforms preserve order in partitions. - Multiple independent consumers with different speeds: For
example, you might have one consumer that does real-time alerting on events, another that does batch
aggregation every hour, another that just stores events to S3 for data lake. With a stream, each consumer
can read the data at its own pace without interfering with others. - Event sourcing: If you literally treat the
event log as the source of truth (like an append-only log of state changes), a streaming system can play that
role because it durably stores events and allows re-reading to rebuild state. - Reprocessing / Backtesting:
If you have the need to rewind and reprocess events (e.g., you improved an algorithm and want to run it on
the last month of events to compare results), streams with retention or archive capabilities allow that. In
contrast, SNS or SQS doesn’t store old messages once delivered.
An example: imagine a financial trading platform where price tick events are coming in constantly. You
might use a streaming platform to broadcast those ticks to many consumers – one updating live
dashboards, another computing technical indicators, another storing to a database. If one consumer falls
behind, it can catch up by reading from where it left off (the events are still in the stream). If you want to
replay yesterday’s ticks to test a new strategy, you can because the stream kept them for a day.
Another example: sensor data from IoT devices. AWS IoT Core can directly push device messages to
Kinesis. With Kinesis, you could have a Lambda that handles real-time anomalies, a Kinesis Data Analytics
18
(flink) application that computes running averages, and maybe an EMR or Glue job that periodically reads
the stream to load data into a warehouse. All those can tap into the same stream.
AWS Lambda with Kinesis: Lambda can be a consumer for Kinesis streams (similar to SQS, Lambda has an
event source integration). It polls shards and invokes your function with a batch of records. One limitation:
one Lambda invocation per shard at a time (Lambda will scale up to the number of shards, not beyond,
because it preserves ordering per shard). This works nicely for moderate throughput. If you have extremely
high throughput or need very low latency processing, sometimes specialized stream processing
frameworks (like Apache Flink or custom EC2 consumers) are used. But for many serverless architectures,
using Lambda is great because it abstracts the polling and scaling (just increase shards to scale out
horizontally, and Lambda will create more concurrent functions accordingly up to some limits).
Comparing streaming to SNS/SQS: One fundamental difference is persistence. SQS will delete a message
once one consumer processes it (unless you duplicate it to multiple queues). SNS doesn’t store at all, it just
pushes out. Kinesis/MSK stores events for a window of time, and they can be read multiple times by
different consumers or the same consumer re-reading. So streaming is more powerful in terms of data
retention. On the flip side, SNS/SQS are often simpler for one-time delivery semantics (don’t have to
manage offsets etc.).
Ordering and Parallelism: With streaming, you often partition events by some key (e.g., all events for a
given user or device go to the same partition). This guarantees order per key but means one partition is
processed by one consumer instance at a time. If one partition gets very hot, that can be a bottleneck. A
skill in using streaming is choosing a good partition key to balance load evenly while maintaining order for
things that need ordering. In SNS+SQS world, ordering is usually not guaranteed (unless FIFO topics/
queues), but you scale easily; in streaming, you think about partitioning and ordering more explicitly.
Integration with other AWS services: Kinesis can feed into Kinesis Data Firehose for easy loading to S3/
Redshift. There are managed analytics on streams (Kinesis Data Analytics) that allow SQL or Flink jobs to do
time-windowed analysis. Kafka has a whole ecosystem (Kafka Streams, etc.), though MSK is more self-
managed. One neat integration is EventBridge Pipes, which can connect a Kinesis stream as source and for
example send to multiple SQS or Lambda targets – but if you needed multiple targets from one stream,
you’d typically just have multiple consumers reading the stream (Pipes might be more for 1:1 connections
with filtering).
Replaying with EventBridge Archive: If you are using EventBridge as your main bus but want some
streaming-like replay ability, you can turn on archiving for a bus or specific events. Then you can later replay
events (e.g., “replay events from last 2 hours to this bus”). It’s useful for recovery or testing, but keep in
mind EventBridge is not high-volume storage (there are limits on how many events can be replayed per
hour, etc., currently). It’s more of a safety net or debugging tool than a primary event store.
Trade-offs: - Running a stream (like Kinesis or Kafka) typically requires more upfront capacity planning
(number of shards or brokers) and understanding throughput patterns. It’s not as hands-off scaling as SQS
(though Kinesis scaling is not too hard and MSK Serverless tries to abstract some of that). - Cost can
accumulate if you retain large volumes of data (you pay per shard-hour for Kinesis, and for data volume
ingested and retrieved). - Consumer complexity: managing consumer offsets and ensuring each consumer
is keeping up is an added responsibility. The Kinesis Lambda integration manages checkpointing behind the
scenes for you though, which simplifies it. - Exactly-once processing is tricky in streaming. Kinesis (and
19
Kafka) will deliver at least once, so duplicates are possible if a consumer crashes after processing but before
checkpointing. Your processing logic should handle idempotency or deduplication if that’s an issue. With
Kafka, transactions and exactly-once semantics are advanced features but beyond our scope here. Often,
idempotent design is the simpler answer.
Example scenario combining patterns: Consider a Data Lake ingestion pipeline: An application
publishes events (say user activity logs) to an EventBridge bus. A rule sends those events to a Kinesis stream
(perhaps for short-term real-time processing). One consumer Lambda does real-time metrics (small scale).
Meanwhile, a Firehose delivery stream subscribed to Kinesis batches and compresses events to S3 every
minute for long-term storage. Additionally, an EventBridge rule could also directly trigger some immediate
alert if a certain critical event arrives (bypassing the stream for immediacy). This shows that you can mix
event routing – not everything must go through one channel. Some events can go both to streaming and to
direct pub/sub targets.
Anti-Pattern – Using streaming when simple messaging suffices: If your use case doesn’t require replay
or multiple independent consumers or high throughput, using a Kafka cluster or a Kinesis stream can be
overkill. For example, if you just need to decouple a couple microservices and have moderate events, SNS or
SQS or EventBridge might be much simpler and cheaper. Sometimes engineers reach for Kafka because it’s
trendy, even when not needed. Remember that streaming platforms introduce their own complexity
(operational and conceptual). So justify it with clear needs like “I need to persist events for X time” or “I have
5 different teams needing access to all events asynchronously” or “I need to window and aggregate events
in real-time.”
Anti-Pattern – Ignoring backpressure in streaming: If one consumer lags far behind (say one analytics
job is slow), the data in the stream accumulates. For Kinesis, if a consumer can’t keep up before retention
expires (24h or whatever), it will start losing data (i.e., it’ll miss processing events that aged out). Systems
should monitor consumer lag (CloudWatch has iterator age metrics for Lambda on Kinesis) and alarm if
consumers fall behind. Strategies might include increasing shards (so the workload can parallelize more) or
if using Kafka, adding partitions, or scaling the consumer application. Backpressure in streaming systems is
often a sign that either the consumer is under-provisioned or the partitioning is not granular enough.
Because streaming is pull-based (consumers pull at their rate), producers are decoupled from slow
consumers (which is good, they don’t get blocked), but you must ensure consumers eventually catch up or
scale out.
Anti-Pattern – Over-enthusiastic partitioning or key choice: The flip side is picking a bad partition key
that leads to uneven distribution (hot key) or too many keys that hinder ordering when you needed it. It
requires thoughtful design. For instance, partitioning by user ID might be fine unless one user is extremely
active (if that one user’s events need order and they dominate traffic, that one partition becomes a
bottleneck). Partitioning by time (like all events of the same minute to one partition) is a bad idea because it
concentrates load by time windows. A good partition key yields roughly uniform traffic per shard and
correlates with the unit of independent processing.
In conclusion, Event Streaming is the pattern that underlies high-scale, data-intensive event-driven
systems. With AWS serverless options like Kinesis and MSK (Kafka) and their integration with Lambda and
other services, you can build robust streaming architectures without managing a lot of infrastructure.
Streaming complements the other patterns: you might use streaming for the heavy lifting of data flow and
use pub/sub messaging for more discrete business events. Many architectures use both: e.g., Kafka for the
20
big data pipeline, but SNS/EventBridge for business domain events because those fit more naturally with
filtering and fanout to specific microservices. It’s not either/or. The mastery comes in choosing the right tool
for each communication pattern in your system.
• Choreography – There is no central coordinator. Each service listens for events and reacts
accordingly, potentially emitting new events that trigger other services. The logic of the workflow is
distributed across the participants. In a choreographed saga, each service knows what to do when
a certain event arrives (and maybe what event to emit next), but no single entity tells everyone what
to do.
• Orchestration – A central orchestrator (or controller) tells each participant what to do and in what
order. The workflow logic is centrally defined, often in a workflow engine or orchestrator service,
that explicitly calls each service/task in sequence (or parallel) and handles decision points.
It’s analogous to a dance: choreography is like freestyle dancers reacting to each other (decentralized),
orchestration is like having a choreographer or conductor signaling each move (centralized control).
Choreography in an event-driven system: Choreography naturally aligns with pub/sub event-driven style.
For example, imagine an order processing saga in a choreographed approach: 1. Order Service: when an
order is placed (perhaps via an API call), it creates an OrderPlaced event. 2. Inventory Service (listening
for OrderPlaced): receives the event, reserves items, then emits InventoryReserved event. 3. Payment
Service (listening for OrderPlaced as well): charges the customer, then emits PaymentSuccessful or
PaymentFailed event. 4. Shipping Service: waits for both InventoryReserved and PaymentSuccessful.
Perhaps it listens for OrderReadyToShip event which only gets emitted when those two preceding
conditions are met (this could be tricky to sync via pure events, but one way is a service could detect when
both have happened). 5. If PaymentFailed, maybe Payment Service emits that event and some service
listens and reacts by emitting OrderCanceled which triggers Inventory Service to release reservation,
etc.
In pure choreography, each service has rules like “on X event, do Y and maybe emit Z.” The communication
bus (like EventBridge or SNS) routes events, but doesn’t enforce any order beyond what inherently occurs
from causality.
Orchestration approach for the same: You might use AWS Step Functions (a workflow service) as an
orchestrator: 1. An order comes in, Step Functions state machine is started (with order details). 2. Step
function first invokes a task (e.g., Lambda or API call) for Inventory Service to reserve stock. If that succeeds,
next it calls Payment Service to charge. If all good, then it calls Shipping to create shipment, etc. If any step
fails, the workflow can trigger compensating tasks (like if payment failed after inventory reserved, call
Inventory to release). 3. The Step Functions workflow thus has the sequence and branching logic encoded in
its state machine definition (e.g., using Choice states for branching on success/failure, and Parallel
21
or Map states for any parallel tasks). 4. The services themselves might still produce events or do their
internal work, but the orchestrator is driving them one by one or in structured concurrency.
AWS Step Functions is a powerful orchestrator with built-in error handling, retries, parallel execution, and
even the ability to integrate directly with many AWS services without writing glue code (via service
integrations).
Choreography Pros: - Loose coupling: Services are truly independent. They only know about the events, not
about who triggered them or what comes next. This aligns well with microservice autonomy. - Easier to add
new subscribers: If a new business capability needs to hook into the process, you can often just subscribe
to the relevant events. E.g., adding a notification email when an order is placed is easy by listening to
OrderPlaced event, without touching existing logic. - Resilience: No single point of failure controlling the
process. If one part is down, others still produce/consume events (though the overall business process
might not complete, it’s not bottlenecked on a central brain). - Natural fit for simple workflows: If the
process is essentially just a chain of reactions, choreography can be very straightforward. For example, an
image upload triggers a resizing service, which on completion triggers a thumbnail indexing service, etc.,
one after another via events – if each only cares about the immediate previous step’s event, it’s simple.
Choreography Cons: - Complexity in understanding flow: As more steps and conditions get added, it can
become like “spaghetti” – events flying everywhere with implicit ordering. It might be hard for a new
engineer to trace: “An order goes here, triggers that, then triggers those two, which race, and if one fails…
how do we handle that?” The logic is in the interplay of multiple services’ event handlers, which can be hard
to visualize. - Lack of global visibility: No single place to see the status of a given transaction (unless you
build a monitoring service that correlates events). Each service only knows its piece. Debugging multi-step
issues might require checking logs in multiple places and piecing together event timelines (though
techniques like a shared correlation ID in all events can help correlate). - Complex error handling:
Especially with sagas, handling failure and compensation in a choreographed way is tricky. If step 3 fails,
how do the earlier steps know to undo? Usually, they have to also be listening for some sort of cancellation
event. E.g., PaymentFailed triggers a listener in Inventory to release stock. This works, but as the number of
failure scenarios grows, the event choreography can get convoluted. - Inadvertent event storms: With
many services emitting and reacting, you can get cascades. For instance, a naive approach might have
PaymentFailed and InventoryReleased both emitting events that others listen to, some of which might
double-trigger something if not careful. Without careful design, you might get duplicate actions or race
conditions.
Orchestration Pros: - Single source of truth for workflow: The entire saga logic is defined in one place (e.g.,
Step Functions state machine). It’s easier to reason about the sequence and conditions because they’re
explicit. - Central error handling: The orchestrator can catch a failure at step N and decide what to do
(retry, compensate, halt, etc.) explicitly. In Step Functions, for instance, you can define fallback states or use
the saga pattern (where each step has a corresponding compensation step to call on failure). - Visibility:
Orchestrators often provide execution history. Step Functions, for example, visualizes each execution’s path,
which steps succeeded/failed, etc., which is great for monitoring and debugging a business process. -
Simpler dependencies: If step C must happen after A and B, the orchestrator ensures that order. In
choreography, you might need an event aggregator or each service to internally wait for multiple events.
Orchestration inherently can do “first do A, then do B & C in parallel, wait for both, then do D”. - Timeouts
22
and retries: Orchestrators can implement timeouts (if a service call doesn’t return in X seconds, treat as
failure) and consistent retry policies in one place. Without it, each service in a choreo might implement its
own retries, possibly inconsistently.
Orchestration Cons: - Tighter coupling to orchestrator: The orchestrator is aware of all steps and
participants, making it a form of coupling (though not direct service-to-service, but through the
orchestrator). If you change a service’s API, you have to update the orchestrator’s call. Teams might need to
coordinate changes more if the flow changes. - Single point of control (and potential failure): If the
orchestrator service goes down, new workflows can’t start or existing ones might pause. (AWS Step
Functions is highly available, so not likely to "go down", but conceptually it’s a central brain.) - Less
flexibility for ad-hoc extension: Adding a new side action (like that notification email example) might
require modifying the orchestrator definition to insert a new step. That could be more overhead than just
adding a new event listener in a choreographed model. - Complex orchestrator logic for highly dynamic
scenarios: If flows diverge wildly based on conditions or need to handle dozens of optional steps, the state
machine can become complex. Sometimes events might handle this more naturally by simply not triggering
certain flows unless needed (though that can be a double-edged sword). - Orchestrator Overreach: There’s
a danger of putting too much logic in the orchestrator, almost becoming a monolithic workflow engine
where all decisions happen. This can reduce the autonomy of services (they become passive slaves). Ideally,
an orchestrator should only coordinate coarse-grained steps and let services encapsulate business rules. If
not careful, you can shift logic out of services into the orchestrator, making it very complex.
AWS tools: - Step Functions is the flagship orchestrator on AWS for serverless workflows. It’s a state
machine as a service. You define a JSON (or YAML) state machine with states like Task (do something),
Choice (if-else), Parallel, Wait, etc. It integrates with Lambda (to execute code), or with many AWS services
directly (thanks to AWS Service Integrations – you can directly have a state that, say, invokes an AWS Glue
job or inserts an item in DynamoDB, without writing a Lambda to do it). For saga, Step Functions doesn’t
have native saga but you can implement by having a series of Try/Catch around tasks and invoking
compensating tasks on failure. There is also the newer AWS Step Functions Sagas pattern (part of
Workflow Studio) and literature on how to implement saga compensations in Step Functions effectively. -
Step Functions comes in Standard (durable, can run up to a year, great for long-running processes) and
Express (cheaper, high-volume, but only 5 minutes execution and less debug info) variants. Standard is
often used for important orchestrations even if they are short, because you get execution history. - Amazon
SWF (Simple Workflow Service) is an older orchestrator (predecessor to Step Functions). It’s still around
but Step Functions largely superseded it for most use cases. - Amazon EventBridge Scheduler could be
considered for orchestration when you need to delay or schedule future events as part of a flow, but it’s
more a utility (like “emit an event 1 hour later”). - AWS SDKs / custom orchestrator: One could build
orchestration by writing a dedicated “workflow service” that calls others (like a coordinator service). But
using Step Functions is usually simpler and less error-prone than building your own orchestrator logic in
code.
23
meanwhile, an OrderPlaced event still went out initially that a Marketing service listens to to record the
new order for analytics – that side doesn’t need to be in the orchestration, it’s just an async follow-on.
Another example: User signup – you might orchestrate verification and profile setup sequentially (maybe
need to create records in multiple systems in a defined order), but also publish an event “UserSignedUp”
that many services listen to (send welcome email, add to mailing list, etc.) without a central coordinator for
those.
Best Practice: For simpler workflows, default to choreographed events – they keep services
decoupled and are easy to extend. But when the process logic is complex or needs
transactional consistency, consider introducing an orchestrator. If you do use an
orchestrator like Step Functions, keep the states high-level. Don’t micro-manage within a
state machine what can be handled inside one service. For example, don’t break a single
service’s job into multiple orchestrated steps if it can be a single task; let the service handle
the sub-steps internally (or via its own internal events). Over-orchestrating can reduce the
benefits of microservices.
Real World AWS Example: A case study often cited is transaction management in e-commerce. At
re:Invent, Amazon’s retail team talked about using orchestration for ensuring an order is either fully
processed or fully canceled (compensating stocks and payments). On the other hand, Amazon’s various
other systems (recommendation, email, etc.) react to events from that process without being in the
orchestration loop. Another example is any user workflow with approvals: Step Functions is great for
orchestrating waiting for an approval (with a human) or a long-running job, which event choreography
alone might not handle as cleanly.
Anti-Pattern – Orchestration in code with lots of waits and polls: Some people implement an
orchestrator by writing a single Lambda that calls Service A, waits (polls) for some result via events or
checks, then calls Service B, etc. This is essentially recreating an orchestrator in code, and might involve
nasty polling loops or holding execution waiting, which is not ideal (could waste execution time or get
complicated managing state). Instead, Step Functions or SWF should be used as they provide native waiting
(like a Task token mechanism to wait for an external callback, or just have a long timeout). Don’t try to
orchestrate via one-shot Lambdas that hope everything completes quickly; use the right tool for durable
coordination if needed.
Anti-Pattern – Over-Choreographing Complex Processes: If you find a process where service A triggers B,
B triggers C, C triggers D, and D might trigger a compensating event back to A on failure – and all of this is
critical to get right – you might be better off with orchestration. Leaving such a complex web entirely
choreographed can lead to issues that are hard to debug or fix (like circular waits if one event is missed or a
race condition). A sign is if you have to implement something like an “event dependency manager” or a lot
of logic in each service to decide if it should act or wait – at that point, orchestrating might simplify things.
24
Figure 14.2 – Orchestration vs. Choreography. On the left, Orchestration, a central Orchestrator service (e.g.,
Step Functions) calls Service A, Service B, Service C in a controlled manner, deciding “what’s next” at each
step 50 . On the right, Choreography, Service A, B, C all communicate through an Event Bus (green), each
emitting and listening to events. The bus asks “Who is interested in this event?” and delivers accordingly
51 . There is no single controller; instead each service reacts to the events from others. Orchestration
In AWS terms, that green bus in the choreography side could be EventBridge or SNS. In orchestration, that
orange orchestrator could be a Step Functions state machine orchestrating calls to A, B, C (which might still
produce events, but the orchestrator is coordinating).
Both patterns can coexist. For instance, you might have a Step Functions workflow that at certain steps
publishes events to notify other systems, effectively mixing orchestration for the main flow with event
choreography for supplemental actions.
Summing up: - Use choreography for loose, decoupled, many-to-many interactions and where each
service can largely do its job with minimal awareness of a bigger picture. Great for event-driven
integration in general. - Use orchestration when you need centralized coordination, ordering, and error
handling for a multi-step process that spans services. Often suitable for sagas requiring rollback/
compensation or ensuring critical steps happen in order or not at all. - Be mindful that orchestration
introduces a dependency on an orchestrator, so it should be justified by the complexity it removes from the
services themselves. - Always ensure whichever pattern, you implement good observability: in
choreography, use distributed tracing and correlation IDs on events; in orchestration, leverage the
orchestrator’s logging and perhaps still emit events for state changes for external monitoring.
When designed correctly, orchestrators can call services via asynchronous means too (Step Functions can
publish to EventBridge or invoke Lambdas asynchronously), so you can still keep some async behavior but
under a central logic. The key takeaway for an architect is to evaluate the workflow’s requirements: if it’s
relatively simple and expected to evolve (new subscribers), choreography gives agility. If it’s complex and
must be consistent, orchestration gives control.
25
Additional Advanced Patterns and Best Practices
Beyond the core patterns we’ve discussed, there are several additional patterns and considerations that
often come into play in advanced event-driven architectures. We will cover a few important ones: Saga
pattern (distributed transaction management), Event Sourcing (as a design for system state),
Transactional Outbox (ensuring consistency between state change and event emission), Dead Letter
Queues & Error Handling, Idempotency, and Event Cataloging/Governance. We touched on some of
these earlier, but here we consolidate them as cross-cutting patterns/best practices that seasoned
architects should keep in their toolbox.
The Saga pattern refers to managing a business transaction that spans multiple services, without a
traditional two-phase commit. Instead, a saga is composed of a series of local transactions (each in a single
service) and if one fails, the preceding transactions are compensated (undone) by executing defined
compensating actions. We already essentially discussed saga in terms of how one might coordinate it via
choreography or orchestration. To be explicit:
• In a choreographed saga, each service performs its action and if it needs to trigger a rollback, it
emits an event that others listen to for compensations. For instance, service A does step1, emits
event; service B does step2 on event, but if fails, it emits a failure event; service A listens for that and
compensates step1.
• In an orchestrated saga, the orchestrator (Step Functions) would call A (step1), get success, call B
(step2) – if B fails, the orchestrator invokes A’s compensation logic.
The saga pattern ensures eventual consistency: either all steps succeed (making the overall business
operation succeed), or some fail and all completed steps roll back to leave the system as if the saga never
happened (or some defined alternative outcome).
AWS Step Functions can implement saga orchestration by using the Parallel state or Try/Catch with
compensating tasks. For example, AWS has blog content on using Step Functions to coordinate a saga for a
sample application (like processing an e-commerce order, which either completes or cancels if any part
fails).
Best Practice for Sagas: Always define what the compensating action for each step is at design time. If you
reserve inventory, the opposite is to release inventory. If you send an email… well, you can’t "unsend", but
you might send a correction or log something. Not all actions are easily undoable, so sometimes sagas
accept that some side effects (like an email) happen even if later things fail (that might be acceptable or not
depending on context). The key is to ensure the core systems (like inventory, payment) are consistent.
Event Sourcing
We mentioned this alongside CQRS, but let's clarify: Event Sourcing is a pattern where state is not stored as
a latest snapshot but rather derived from the sequence of events. In an event-sourced system, the primary store
is an append-only log of events describing changes to entities. To get the current state of an entity, you
start from a baseline (possibly an empty state or a snapshot) and replay all relevant events.
26
For example, consider an Account entity with balance. Instead of storing "balance = $100", an event-sourced
system might store events: Deposited $50 , Withdrew $30 , Deposited $80 , etc. From those, you
can compute the balance (which would be $100). The benefit is you have a full audit trail, and you can
reconstruct history or apply events to a new model.
Implementing Event Sourcing on AWS: DynamoDB is a fine choice for the event store as we saw. Each
item could be an event with an accountId partition key, a timestamp sort key, and details. Then, current
state could be materialized on demand by scanning events (inefficient for frequent queries unless you store
snapshots periodically) or more typically by maintaining a projection (read model) that is updated by
events (this leads to CQRS, which we covered). A more streaming approach is to use Kinesis or Kafka as the
event store – producers push events to a stream (the log), and consumers can materialize views from it.
Kafka is often used in event sourcing because it inherently stores events with offsets that can be replayed.
MSK (Kafka) or Kinesis could be that central log.
In AWS context, EventBridge could even serve small-scale event sourcing by archiving events and replaying,
but it’s not built for high-rate replay. Kafka/Kinesis are.
Trade-offs: Event sourcing increases storage (you’re keeping all historical events, which could be large). It
also means there’s complexity in versioning events (if your schema of events changes, older events are in
the log in old format – you have to handle that in your replays or converters). There’s also the challenge of
eventually consistent projections – the official state (if you query the latest events, that’s the truth), but
any read model might be slightly behind if it hasn’t caught up with the last event.
When to use: event sourcing is powerful if you need auditability, the ability to time-travel or recreate state
for debugging, or if you have complex state that’s easier to manage via events (like many microservices do
this for complex business logic – rather than storing a complicated object, store events and derive the
object’s state as needed). Finance and ledger systems often use event sourcing (a ledger is naturally an
event log). Also, event sourcing goes hand in hand with domain-driven design where domain events are
the core – you literally persist those domain events as your data.
We described Transactional Outbox earlier in the DDD section. It’s about ensuring when you update your
database, you also produce an event, without inconsistency. In practice: - In a relational DB scenario: have
an outbox table. When doing a transaction, insert into business table and outbox table together. Later, a
process reads outbox and publishes events. This is a known microservices pattern to avoid dual-write
problems. - In DynamoDB: a common trick is just to write an item to a DynamoDB stream-enabled table and
treat that as the outbox. For example, the Order service could write an "Order" item with status, etc. and
that appears on DynamoDB Streams, which essentially is the outbox of events (the change events). You can
configure a Lambda on the stream to filter for new orders or updated statuses and publish corresponding
events to EventBridge or SNS. This way the DB write and the event are effectively coupled (the event comes
from the DB change log, so you can’t miss it unless the whole write failed, in which case nothing happened
anyway). This is a nice approach because you don’t even need a separate outbox table – the main table’s
stream is the outbox.
Also consider the inbox pattern: that’s less talked about but is the complement on the consumer side –
ensuring a consumer processes each event exactly once even if it’s delivered multiple times. This might
27
involve the consumer service keeping track of processed event IDs (like an “inbox” table of events it has
seen). If a duplicate event comes, it checks and ignores it. Idempotency keys or event IDs help here.
DynamoDB is great for implementing an idempotency key check (do a put if not exists for a key = eventID, if
succeeds, process, if already exists, skip because we processed it). AWS’s Idempotency BP suggests similar
approaches 52 53 .
Dead Letter Queues (DLQs) are a vital pattern for robust event-driven systems. Almost any messaging
service in AWS supports them: SQS queues can have a DLQ for messages that exceed retry attempts, SNS
can forward failed deliveries to a DLQ (for certain subscription types like Lambda), Lambda functions can be
configured with a DLQ or on-failure destination (so if a Lambda invoked asynchronously fails after retries,
the event goes to an SQS or SNS target you specify), and EventBridge also can send failed events to a DLQ
(for EventBridge, you set a dead-letter config for targets).
The purpose is to avoid losing events silently and to avoid poison-pill events blocking pipelines. For
example, if one event consistently causes a consumer to crash, that message can get stuck (the queue
keeps retrying it, never succeeding). With a DLQ after, say, 5 attempts, it’ll move to DLQ and the main queue
can move on. You then need an alert on DLQ presence to investigate the bad event. That event might need
a fix (maybe it had malformed data or exposed a bug).
Best Practice: Always configure DLQs (or on-failure destinations) for asynchronous processing. Then have a
process to monitor these (CloudWatch Alarms on number of messages in DLQ > 0). Have a plan to reprocess
DLQ messages after fixing the issue (maybe a Lambda that can requeue them or a script). Sometimes if the
error was due to a downstream outage, reprocessing later solves it.
For streaming systems, “DLQ” isn’t automatic – e.g., Kafka doesn’t have DLQ built-in but you can program
consumer to catch exceptions and publish the bad record to a separate topic. Kinesis + Lambda: if Lambda
errors on a batch, by default it retries continuously (and blocks that shard). But AWS introduced a feature
called Bisect on Function Error and a max retry attempt for Lambda on Kinesis; after which you can send
the failed record to an SQS DLQ. Alternatively, one can use Lambda’s newer failure destination for async, but
for stream it’s a bit different. The net is: design for catching problematic events.
Monitoring and Alerting: It’s part of error handling. Use CloudWatch Logs, metrics like Lambda errors,
iterator age (for stream consumers), and so forth to keep an eye on the health of the event system. Also
consider using AWS X-Ray to trace events through multiple services, as mentioned, so you can pinpoint
where a failure in a chain occurred.
We’ve touched on idempotency multiple times because it is truly crucial. An operation is idempotent if
performing it more than once has the same effect as once. In event-driven systems, duplicates can occur: -
SNS may deliver a message to a Lambda twice (rare, but possible). - SQS occasionally might deliver
duplicates (also rare in standard queues, but at-least-once means non-zero chance). - EventBridge could put
the same event twice if a retry or a service sent duplicate. - Kinesis or DynamoDB Streams consumers might
see duplicates if a Lambda checkpoint didn’t update due to a failure. - Also, users might accidentally
produce the same event twice.
28
Thus, consumers should handle duplicates gracefully. Techniques: - Use a unique event ID (like a GUID or a
composite key). If processing, store that ID in a DynamoDB table or memory cache to track seen events. If
seen before, skip. - If the consumer is doing a database change, use conditional writes or idempotent
updates. For example, if event says “set order status to shipped”, if it’s already shipped, ignore the repeat. -
For processes like “send an email on event”, you might include a unique email campaign or message ID to
not send twice. Or check if already sent for that event ID. - AWS recommends idempotency tokens for API
calls – e.g., many AWS APIs allow an Idempotency-Key so you can retry without double effect 52 . In internal
design, an idempotency key could just be the event ID or order ID etc., something unique.
One also must consider exactly-once requirements – some systems try to enforce exactly-once delivery/
processing (like using FIFO queues with deduplication). But in distributed systems, exactly-once is tough;
often it’s easier to do at-least-once + idempotency to achieve effectively-once outcome 54 55 . As the Well-
Architected Framework notes, making operations idempotent simplifies error handling and recovery,
because retries won’t have side effects 56 .
AWS’s perspective: they have something called the Amazon EventBridge Schema Registry which can act
as a central registry of event schemas, including custom events. There’s also the concept of event APIs –
treating events as an interface. Some companies formalize events in AsyncAPI specifications.
Governance patterns might also include: - Access control: controlling who (which service/team) can
publish or subscribe to certain events (EventBridge allows setting resource-based policies on event buses,
etc.). - Multi-tenant event buses: If you operate a platform, you might have separate event buses per
tenant or per domain to isolate events. - Versioning strategy: how do you evolve an event schema? Often
by additive changes or new event types, rather than breaking changes. - Intermediated vs
Disintermediated: Some architectures use an Event Gateway or Mesh (like an event router layer) vs
letting services talk point-to-point. EventBridge can be seen as an intermediator (central router). In contrast,
if one service sends directly to another’s queue (point-to-point), that’s more disintermediated. Using a
central bus for all events (like EventBridge) can enforce governance (all events go through a hub). The
Solace patterns list mentions event mesh and event gateway as deployment patterns 57 – these basically
ensure events can flow between systems even across network boundaries.
One advanced pattern: Event Mesh – using a network of brokers or buses to route events globally (e.g.,
across regions or data centers). AWS EventBridge can do cross-account fairly well, but cross-region requires
custom wiring (though you could in theory send from one region’s bus to another region via an HTTP target
or via a central routing layer). Some companies overlay Kafka clusters or Solace brokers in multiple sites to
form an event mesh. In AWS, one might use a combination of EventBridge and SNS with cross-region
replication (like an SNS topic can have an HTTPS subscription in another region’s API endpoint, or use
EventBridge’s new global endpoints for failover scenarios). This is quite advanced but worth knowing
conceptually – the idea is to treat event routing infrastructure as a mesh where any event from anywhere
can reach interested consumers elsewhere, governed by rules.
29
Combining Patterns and Final Thoughts
These patterns are not mutually exclusive. A single architecture may use many of them: - Domain-driven
design gives you boundaries and event definitions. - Within a bounded context, you might use event
sourcing for its data, and CQRS to separate how you query that data. - Those events might be distributed via
pub/sub to other contexts, using event-carried state transfer to keep them in sync. - Some processes might
be choreographed via those events, others orchestrated with Step Functions for reliability. - SQS queues
might be used for point-to-point tasks that are part of a bigger event flow (e.g., an event triggers a fan-out,
one target is an SQS queue that a worker processes). - You ensure all your Lambdas are idempotent and
have DLQs set, and you monitor with CloudWatch and X-Ray. - You maintain an event catalog so everyone in
the company knows what events exist and what they contain. - You apply AWS Well-Architected principles
throughout: for example, Security – don’t put sensitive PII in events unnecessarily (if events go to many
places, consider encryption or redaction); Reliability – use retries, DLQs, idempotency to handle failures;
Performance – design for scaling (SQS decouples throughput, Kinesis shards for parallelism, etc.), avoid
slow consumer blocking producers; Cost Optimization – choose the right service (e.g., an Express Step
Function might be cheaper for short high-volume workflows than Standard), batch events where possible
(Lambda can process a batch from SQS/Kinesis, amortizing cost); Operational Excellence – have good
logging and tracing, infrastructure as code to manage event rules, etc.; Sustainability – event-driven can be
more energy-efficient by avoiding over-provisioning (serverless only uses resources when needed), etc.
In the next sections, we’ll look at how all these patterns and practices align with AWS’s Well-Architected
Framework and then examine some real-world case studies that demonstrate these principles in action.
• Operational Excellence: This pillar is about running and monitoring systems and continually
improving processes. In an event-driven context, one challenge is observability of complex,
asynchronous flows. It’s crucial to have logging, metrics, and tracing for events. Use Amazon
CloudWatch to monitor things like queue lengths, Lambda invocations, error rates, throttles, etc.
Employ AWS X-Ray for tracing event paths (EventBridge integrates with X-Ray to propagate trace
headers, allowing end-to-end tracing across services 49 ). Automate as much as possible:
infrastructure as code (AWS SAM or CloudFormation to define event buses, rules, etc.), and CI/CD for
deploying changes to your event-driven components. For example, treat your EventBridge rules and
Lambda code as version-controlled artifacts. Another aspect is testability: incorporate testing for
your asynchronous flows (e.g., have integration tests that publish a test event and verify all expected
outcomes happened, possibly using X-Ray trace data or checking downstream states). Use DLQs and
alarms as mentioned to promptly detect issues (Operational Excellence means quickly knowing if
something’s wrong). Also implement event replay testing in lower environments – e.g., record a set
of production events (if allowed by data policy) and replay them in a staging environment to see how
the system copes, as a form of continuous resilience testing.
30
• Security: With decoupled events, ensure that access is controlled. For instance, an EventBridge bus
may receive events from many sources – use IAM policies to restrict who can put events and who can
subscribe. If events cross accounts, use Resource-based policies on EventBridge buses to allow only
specific accounts to send/receive 59 60 . Encrypt sensitive data – by default, services like SQS, SNS,
EventBridge encrypt data at rest (SQS/SNS with KMS optional). In transit, using the AWS services
ensures TLS. But if the event payload itself has sensitive fields (PII, secrets), consider encrypting
those fields at the application level or not including them. Another security aspect is principle of
least privilege for Lambdas consuming events – each consumer Lambda should have IAM
permissions only for the resources it needs (don’t give a broad wildcard policy just for convenience).
If you use AWS API Destinations (EventBridge feature to send events to external HTTP endpoints),
secure those endpoints and use proper auth (EventBridge API Destinations can handle auth for you
in a secure way, rather than embedding credentials in events 61 ). Also, validate events – just
because an event arrived doesn’t mean it’s safe or expected; use schema validation to guard against
malformed input that could break your logic or be abused.
• Reliability: Reliability is a big winner for event-driven architecture if done right. The decoupling via
SQS, SNS, etc., inherently adds resiliency (components can fail independently without bringing the
whole system down). However, you need to manage the retries and DLQs as described to handle
failures gracefully. One Well-Architected best practice is to make operations idempotent 52 54 so
that retries don’t cause bad side effects – we hammered this in earlier sections. Another reliability
tip: use a “fallback” path for unprocessed events. For example, if even after DLQ you want to
recover, consider building a retry handler function that monitors DLQ and maybe triggers a Step
Functions retry mechanism for those events in a controlled manner (with maybe manual approval if
needed for certain events). Also, design bus topologies for reliability: a single EventBridge bus for
an entire org might become a critical point (though managed by AWS to be HA). AWS suggests
patterns like multi-account buses to isolate blast radius 62 63 (for example, each account has a
local bus, with a central bus aggregating high-level events, etc.). Make sure to account for duplicate
event delivery (use message de-duplication if absolutely necessary via FIFO, or handle in logic). For
performance reliability, one notable thing from the Serverless Lens: EventBridge can introduce
latency, so if ultra-low latency is needed for reliability (like real-time trading system), consider SNS/
SQS which may be faster 48 . Use health monitoring – e.g., a heartbeat event or a synthetic test
event through the system to verify it’s working.
• Performance Efficiency: With serverless, performance usually means ensuring the system scales to
handle load and responds within acceptable times. Event-driven architectures excel at naturally
scaling: SNS, SQS, Lambda, etc., all scale automatically to a point. But you should be aware of limits
(concurrency limits on Lambdas, account limits on EventBridge throughput, etc.) and set those
appropriately. Use concurrency controls if needed: e.g., if a consumer can’t handle more than 100
events per second due to an external API limit, you might set a Lambda concurrency limit or use a
smaller batch size from SQS to throttle. Or better, design a buffer. Many event-driven systems
employ the “buffer and throttle” pattern: e.g., put events in SQS, then have a Lambda poll with a
reserved concurrency to ensure processing doesn’t exceed X concurrently – this smooths spikes and
protects downstream (a form of shock absorber 44 ). Performance also involves optimizing the
content of events – keep events lean (avoid huge payloads if not needed). If you have to send large
data (like images), use references as mentioned (S3 links), to not bog down the buses. Another
performance aspect: cold starts in Lambdas – in high-volume event systems, Lambdas may scale up
frequently; using languages like Java or .NET might incur bigger cold start penalties, so consider
31
lighter runtimes ([Link], Python) for event handlers that need to be highly responsive or keep them
warm with Provisioned Concurrency if needed (cost trade-off). Also leverage the managed services
for heavy lifting: e.g., if you need to filter events at high volume, EventBridge’s filtering is very
efficient; doing filtering in your own code might be less so.
• Cost Optimization: Serverless is pay-per-use, which generally aligns cost to usage nicely. But be
mindful of event volumes: high throughput on EventBridge (charged per million events) or SNS or
Lambda invocations can add up. Prefer doing filtering early to avoid excess processing. For example,
if only 1% of events are relevant to a consumer, use an EventBridge rule or SNS filter to only send
those, rather than sending all 100% to a Lambda and having it drop 99%. That saves cost on Lambda
invocations. Use batching where possible: Lambda triggers for SQS/Kinesis allow batching multiple
messages in one invocation, which is more cost-efficient than one invocation per message. The size
of batch can be tuned for a balance of latency vs cost. Also consider the Serverless Data Analytics
options for cost: for instance, if you have tons of streaming data to store, Kinesis Firehose can
convert and compress data to S3 cheaply, rather than you doing it manually. Deleting unused event
rules, queues, etc., is obvious but worth a housekeeping mention. The Serverless Lens might
mention using things like AWS Cost Anomaly Detection for spikes in event usage cost. Also, compare
Standard vs Express Step Functions for orchestrations: Express is much cheaper if you have high
volume short executions, whereas Standard charges per state transition which can add up if you
have a lot of them (but Standard might be needed for long-running or guaranteed execution). So
choose execution types wisely.
• Sustainability: This is a newer pillar. Sustainability in a serverless EDA context likely means doing the
above optimizations (cost and performance improvements usually align with using fewer resources,
hence lower carbon footprint). By decoupling and scaling only as needed, you avoid running big
always-on servers, which is inherently more efficient energy-wise. Also reducing data transfer and
processing (via filtering and efficient event design) reduces energy usage. So following best practices
in other pillars often supports sustainability. One could mention designing for intelligent scaling –
e.g., if certain events don’t need immediate processing, perhaps aggregate them or schedule
processing during off-peak if appropriate (this could improve utilization patterns). But typically,
serverless abstracts this for you.
The AWS Well-Architected Serverless Lens specifically calls out many of these points. For example, it
emphasizes building services around business domains (which we covered via DDD) 64 65 , choosing the
right integration pattern (EventBridge vs SNS vs direct invocation) for reliability and latency 48 , and using
features like X-Ray for observability 49 . It also addresses multi-tenant considerations if you build an event-
driven SaaS (making sure one tenant’s events don’t overwhelm others – often solved by partitioning or
separate channels).
In conclusion, aligning your event-driven architecture with Well-Architected principles means making it
robust, secure, efficient, and transparent. AWS provides many knobs and tools (DLQs, tracing, etc.) – use
them to avoid common pitfalls. Periodically do a Well-Architected Review of your system; these patterns
we’ve covered will come up in such a review (e.g., “how do you ensure idempotency?” or “how do you isolate
failures?”). By proactively applying the patterns like idempotency, DLQs, least privilege, etc., you’ll satisfy
those concerns. An event-driven system, if not well-architected, could degenerate into chaos (lost
messages, hard-to-debug, overloading, etc.), but if well-architected, it can be the backbone of a very
resilient and agile enterprise system.
32
Applying the AWS Serverless Lens Best Practices
The AWS Serverless Lens is an extension of Well-Architected focusing on serverless-specific best practices.
Many overlaps with what we’ve discussed, but let’s explicitly connect some dots to ensure best practices are
highlighted:
• Performance and Scaling (Serverless Lens): One scenario in the Serverless Lens is “event-driven
architectures” 66 . They mention designing with event sources, routers, and destinations. One best
practice: avoid situations where your event processing can’t scale as fast as event ingestion. For
example, if you use synchronous invocations (like a Lambda behind API Gateway) to fan-out events,
that could bottleneck. Instead, use asynchronous services (SNS, EventBridge) that can handle
sudden spikes by buffering. The lens even notes to consider latency trade-offs like using SNS+SQS
instead of EventBridge if latency is a big concern 48 . That’s a best practice we hadn’t explicitly stated
as such: choose the event medium based on throughput and latency needs. EventBridge adds a
bit of latency but offers rich routing; SQS/SNS might be leaner for raw speed.
• Observability (Serverless Lens): The lens emphasizes tracing and structured logging. For events, a
tip is to include a consistent Correlation ID in all logs and events for a given workflow (maybe the
orchestrator or initial producer generates one). Pass it along (EventBridge and SNS can pass
message attributes – you might include a trace ID or correlation ID there). This makes it easier to
search logs for all pieces of a transaction. Also, use CloudWatch ServiceLens or X-Ray’s service map
to see how your Lambdas and other services interact.
• Event Schema and Discoverability: The Serverless Lens mentions using EventBridge Schemas and
code generation 12 . It’s a best practice to document and enforce schemas as we covered. There’s
even an AWS tool, the EventBridge Schema Registry, that can scan your events from a bus and
infer schema, then you can generate type-safe code for Java/Python/etc. This prevents subtle bugs
(e.g., a consumer expecting a field that producer didn’t send or spelled differently).
• Deployment and Versioning: A serverless best practice is to deploy changes safely – e.g., if you
change an event structure, use feature flags or parallel processing (have old and new consumers co-
exist) to avoid breaking things. Canary deployments for Lambdas ensure new code handling events
doesn’t break. And because events decouple, you can often deploy one service at a time (just ensure
new events are backward compatible or old consumers ignore new fields).
• Testing and Automation: Use frameworks like AWS SAM CLI to generate local events to test
Lambda logic offline. And consider using Step Functions’ Workflow Simulator or SAM’s integration
testing capabilities for orchestrations.
• Cost (Serverless Lens): They often mention monitoring high-volume functions for cost spikes. In
event systems, one culprit could be a Lambda function invoked per message rather than batching –
if you see very high invocation count and low average duration, maybe increase batch size or
consider moving some filtering logic upstream (so the function isn’t called unnecessarily). Also use
Compute Savings Plan if you have steady usage across Lambdas to reduce cost.
33
• Continuous Improvement: The lens encourages post-incident analyses – if an event was lost or a
bug happened, add a test or a monitor to catch that scenario. For example, if a particular type of
event piled up in DLQ because a consumer wasn’t subscribed (someone forgot to deploy a rule), you
might implement a CloudWatch rule to detect if any bus has a high number of unmatched events
(EventBridge metrics have something like “DeadLetterInvocations” if DLQ is configured, or you can
log and analyze unprocessed events). Or at least, ensure each event type has at least one consumer
or is intentionally not consumed.
In summary, apply the Serverless Lens by treating event-driven components with the same rigor as any
code: plan for failure, instrument for visibility, minimize blast radius (isolate components), and optimize
continuously. AWS well-architected reviews specifically call out things like “do you use DLQs?”, “do you
handle idempotency?”, “do you separate business logic from infrastructure logic in Lambdas?” – for event-
driven, an example of the last point: keep your Lambda handlers lightweight, focusing on the event
processing, and delegate heavy lifting to libraries or separate modules, which is more of a coding best
practice but helps maintainability.
At this point, we have covered a wide range of patterns and best practices. To ground these in reality, let’s
look at a couple of brief case studies where AWS serverless event-driven architectures were implemented,
the patterns they used, and the outcomes achieved.
Case Study 1: City Electric Supply – Modernizing Inventory Management with DDD and EventBridge
City Electric Supply (CES), a large distributor, undertook a modernization of their inventory management
systems using an event-driven, serverless approach aligned with domain-driven design principles. They had
a legacy environment with various applications for products, warehouses, pricing, etc., that needed to
exchange data. The goal was to decouple these and enable real-time data sharing and scalability.
Architecture: CES defined clear business domains (Customer, Product & Price, Supplier, Sale, Warehouse,
etc.) – essentially bounded contexts for their enterprise. Each domain was implemented as a set of
microservices (using AWS Lambda, DynamoDB, etc.) and each had its own data. To connect them, they built
an Enterprise Event Bus using Amazon EventBridge 10 . Whenever something noteworthy happened in
one domain (e.g., a Price updated, or Stock level changed), a domain event would be published on the bus.
Other domains that needed that information subscribed via EventBridge rules filtering by event type or
source.
For example, when an item’s price is updated in the Product domain, an ProductPriceUpdated event is
put on EventBridge. The Sales domain’s services have a rule to capture that and update their pricing cache
for sales orders. Similarly, if stock in a Warehouse is low, a StockLow event might be emitted by the
Warehouse domain, and the Purchasing domain listens to initiate re-ordering from suppliers.
34
They followed single-bus multi-account pattern: multiple AWS accounts (or separated environments)
produce events to one central bus, which then fans out to consumers in various accounts 62 63 . This gave
them centralized governance (the central bus acted like a managed ESB) while still isolating domain
implementations.
Patterns Used: Domain-driven design was front and center – events were defined around business
language (like “order fulfilled” not “table XYZ row inserted”). They utilized pub/sub (EventBridge) to
decouple. For cross-team development, they set up an Event Catalog so each team could discover existing
events (avoiding duplication) and know who publishes what. Also, they implemented schema validation for
events as part of the publishing process to maintain quality 16 .
Benefits and Outcome: By moving to an event-driven model, CES achieved far more extensibility – new
systems can integrate simply by subscribing to events, without needing complex point-to-point
integrations. It also improved performance; previously, a change might require synchronous calls between
services or nightly batch jobs. Now events propagate changes near real-time. They reported that using fully
managed services like EventBridge and Lambda reduced their operational overhead for integration. Also, by
decoupling domains, each domain’s team could iterate faster (a change in Warehouse service doesn’t
require touching Sales service, as long as the event contract remains consistent). This autonomy echoes the
Well-Architected best practice of domain-oriented teams 7 8 .
City Electric’s story also underscored the importance of organizational buy-in for EDA: they established
cross-team standards for events and got leadership support to invest in refactoring around domains. It’s a
reminder that introducing these patterns isn’t just a technical effort but also a cultural one.
WellRight is a provider of corporate wellness programs. They faced a challenge with highly bursty and
unpredictable workloads – e.g., a customer might upload a huge batch of user data or a fitness challenge
might suddenly generate millions of events (like users logging steps). Their old monolithic system struggled
with such bursts, leading to slow processing and high cost when over-provisioned for peaks 67 68 .
Architecture: They decided to break the monolith into microservices and use an event-driven approach to
handle state changes asynchronously. They used AWS Lambda and SQS extensively. For example, when a
user logs progress in a wellness challenge, instead of the monolith handling everything synchronously
(updating DB, calculating rewards, sending emails), the new system would: - Update a DynamoDB record for
the user’s progress. - That DynamoDB update would trigger a DynamoDB Stream event, which a Lambda
consumes. - The Lambda would do some computation (e.g., recalc user’s points) and then put a message on
an SQS queue for further processing or fan-out other events like PointsUpdated . - Other Lambdas
(triggered by SQS or EventBridge) would handle things like sending a badge/notification if a milestone was
hit, or updating team statistics.
They essentially broke long chains of work into event-triggered tasks. Many parts used point-to-point
queues to buffer work. For instance, bulk data imports were processed by pushing records to SQS and
letting a fleet of Lambda functions work through them, rather than one server looping through and getting
overwhelmed.
35
They also used SNS+SQS (pub/sub) for some things: e.g., when a user’s status changes, publish to SNS so
that multiple services (like the reporting service, notification service) get it via their SQS subscriptions.
Patterns Used: This case highlights event-driven state propagation (when DynamoDB changed, events
propagated state to other parts like caches or logs), queue-based decoupling (point-to-point SQS to handle
burst absorption), and an overall choreography style – many services reacting to changes as they occurred,
rather than being orchestrated. One might expect they used Step Functions for some workflows (like end-
of-challenge process might orchestrate sending reports, etc.), though the blog focuses more on the
asynchronous processing via events.
Importantly, they built with scale in mind: Lambdas can scale up to process SQS messages concurrently,
and DynamoDB scales for writes, so they achieved much higher throughput. In a load test, their serverless
system handled in 15 minutes what used to take hours on the monolith 45 , processing a spike of events
that previously would have swamped the old system. And it did so automatically – Lambda scaled to meet
demand, then scaled down, so they weren’t paying for idle capacity.
They also saw improved cost efficiency – by using event-driven invocation, they eliminated a lot of idle
compute. The blog cites a 70% cost reduction for a particular service after going serverless EDA 69 70 .
This ties to how pay-per-use and decoupling (so you can optimize each piece) can save money.
Resilience also improved: With SQS and Lambda, if they got a giant burst, the system might queue and
process it without crashing, whereas before it might crash or backlog jobs on limited threads. The
decoupled nature ensures one slow consumer doesn’t take down everything – in worst case, messages
queue up and can be dealt with slightly later.
Key Takeaways from WellRight: Embracing events allowed them to handle unpredictable loads gracefully.
They had to implement all the good practices – using DLQs to catch failures (since with high volume, some
functions might fail and need reprocessing), using idempotent processing to safely retry. For instance, if
two progress events for the same user come in near-simultaneously, their functions had to handle that
(maybe by locking or designing commutative updates). They likely used DynamoDB’s conditional updates or
atomic counters to manage concurrent updates without conflict. This case underscores that event-driven
architecture often shines in burstiness and scalability scenarios – it provides a natural way to buffer and
distribute load.
These case studies demonstrate in concrete terms how the patterns we discussed can be applied. City
Electric’s focus was more on enterprise integration and domain decoupling using an event bus. WellRight’s
focus was on scaling and resilience using asynchronous processing. Both leveraged AWS serverless services
heavily (Lambda, EventBridge, SNS, SQS, DynamoDB Streams), validating that these managed services can
handle production-grade scenarios.
By analyzing such examples, we see that success with EDA comes from combining patterns appropriately:
domain events + pub/sub for decoupling, queues for buffering, idempotency and DLQs for error handling,
and aligning the design to business needs (inventory updates, wellness events, etc. are modeled as events
in the language of those domains). The outcomes are faster development (teams not stepping on each
other), more robust systems (no single bottleneck, ability to handle failure gracefully), and often lower costs
(pay for what you use, shut down when idle).
36
Summary
In this chapter, we embarked on an in-depth exploration of architectural patterns for event-driven
systems on AWS serverless. We covered a broad landscape – from high-level design philosophies like
Domain-Driven Design, down to nitty-gritty best practices like dead-letter queues and idempotency. Let’s
recap the key points and takeaways:
• Domain-Driven Design (DDD) and Events: We saw that aligning services to business domains and
letting them communicate via domain events leads to a natural separation of concerns. Events
become the “language” between bounded contexts. AWS services like EventBridge and SNS enable
publishing and subscribing to these domain events easily, enabling a truly decoupled architecture.
The use of an enterprise event bus, as in our case study, can connect disparate domain services in a
loosely coupled way, facilitating agility and independent evolution of each domain.
• CQRS and Event Sourcing: By separating write models (commands) and read models (queries),
CQRS can improve scalability and performance. When combined with event sourcing, it offers
powerful auditability and the ability to reconstruct state from history. We saw how AWS DynamoDB
Streams plus Lambda can propagate changes to multiple read models (like Aurora and OpenSearch
in our example) 32 28 , and how SNS and SQS can be used to fan-out updates to these read models
reliably. CQRS brings complexity, but in high-scale systems it pays off by isolating read workload
from write workload.
• Event-Carried State Transfer: This pattern taught us to include necessary data within events to
avoid tight coupling. It improves autonomy of services at the cost of possible data duplication. We
learned that AWS event payload limits might require offloading large data to S3, but for most cases,
carrying essential fields in the event is doable and highly beneficial 40 . The principle is “don’t make
consumers chase the data – give it to them,” thereby reducing synchronous dependencies.
• Event Streaming: We examined how services like Kinesis and Kafka (MSK) allow for high-
throughput, ordered, persistent streams of events. This is crucial for analytics, IoT, and any scenario
where you want multiple consumers or replay capability. Although not every application needs a
streaming platform, knowing when to use one (e.g., when you require retention and independent
consumption of event streams) is part of an architect’s toolkit. Lambda’s integration with Kinesis
makes consuming streams serverless and straightforward, though one must pay attention to shard
scaling and consumer lag.
37
become hard to manage for complex processes. Orchestration (central coordinator like Step
Functions driving the process) brings clarity and centralized control but introduces a coupling point.
Depending on the complexity of the workflow (especially sagas requiring rollback), one may choose
orchestration for critical flows and choreography for simpler, extensible ones. We saw how Step
Functions can implement saga patterns and how a mix-and-match approach can yield the best of
both worlds. The visual diagram【38†】 highlighted the conceptual difference in interaction style.
• The Saga pattern ensures eventually consistent transactions across services by either completing all
or compensating on failures.
• Transactional Outbox guarantees that no data change goes out without an event (and vice versa),
solving the dual-write problem.
• Dead Letter Queues, retries, and monitoring were emphasized as the safety net for any production
event-driven system – never let errors or misplaced messages disappear silently. Use DLQs 71 and
alarm on them.
• Idempotency was a recurring theme: in a distributed, asynchronous world, assume duplicates and
design for duplicate-safe processing 56 . This simplifies error handling tremendously.
• We also touched on governance – schema registries, event catalogs, and access control – important
for maintaining order as the number of events and services grows in an organization.
• Case Studies: Through City Electric Supply and WellRight, we saw real implementations. CES
demonstrated how events and DDD can modernize an integration-heavy enterprise, making it more
decoupled and real-time. WellRight showed the power of serverless events in scaling to meet bursty
workloads and optimizing cost (70% cost reduction through serverless EDA 72 ). These stories
echoed the benefits of applying the patterns we learned: agility, scalability, resilience, and overall
better alignment of tech to business needs.
In conclusion, mastering event-driven architecture on AWS is about understanding these patterns and
knowing when and how to apply them. It’s like having a toolbox: sometimes you use a hammer (SQS) for a
nail (point-to-point task), sometimes a saw (SNS) to cut across modules (pub/sub broadcast), or a level (Step
Functions) to ensure everything lines up in order (orchestration). And you also carry protective gear (DLQs,
monitoring) to handle the unexpected safely.
For cloud solution architects and senior engineers, the journey is to architect systems that not only meet
functional requirements but also are robust in the face of real-world conditions – variable load, partial
failures, rapid change. Event-driven patterns, especially with AWS serverless services, are proven ways to
38
achieve that robustness with elegance. They embrace asynchrony and decoupling, which are key to building
scalable distributed systems.
As you design your next system: - Identify the domains and events that matter (think in events). - Choose
the right pattern for communication (direct vs. mediated, queue vs. topic, etc.). - Ensure each component is
independently scalable and failure-isolated (one failing shouldn’t collapse the whole). - And incorporate the
best practices from the start (idempotency, validation, least privilege, etc.), so that as the system grows, it
remains healthy and manageable.
Event-driven architecture is not a silver bullet, but it addresses many common problems in complex
systems. With AWS providing a rich serverless ecosystem to implement it – from Lambda, SNS, SQS,
EventBridge to Step Functions and Kinesis – architects have an unprecedented ability to compose powerful
architectures with minimal infrastructure management. The patterns and practices covered in this chapter
serve as a roadmap to navigate these possibilities.
By mastering these, you can design systems that gracefully handle change (both in code and in runtime
events), scale effortlessly, and provide a responsive, real-time experience to users or integrators. In a world
where businesses demand agility and resilience, event-driven serverless architectures are often the Master
key to unlock those attributes.
Take these patterns, experiment with them in your AWS environments, and adapt them to your specific
context. The concepts will remain widely applicable even as services evolve. Whether you’re integrating
legacy systems or building cloud-native microservices, the principles of event-driven design will help you
create architectures that are robust by design.
With this knowledge, you are well-equipped to architect and implement sophisticated event-driven
solutions on AWS – solutions that are aligned with both the needs of the business and the best practices of
modern cloud architecture.
2 3 Best practices for implementing event-driven architectures in your organization | AWS Architecture
Blog
[Link]
39
13 14 16 17 41 42 62 63 Event Driven Architecture using Amazon EventBridge – Part 1 | AWS Cloud
Operations Blog
[Link]
40