SA.

Blog post

Event-Driven Architecture: Tradeoffs You Need to Accept Before You Commit

Event-driven architecture decouples services and enables scale, but it moves complexity from runtime coupling to data consistency and operational observability.

Category
architecture
Published
Intricate web of blue light trails — data transmission and event streaming

Two different things called event-driven

Before discussing tradeoffs, you need to know which pattern you are using, because they are fundamentally different:

Event notification: services emit events to signal that something happened. Other services subscribe and react. No one knows who is listening. The producer does not wait for a response. This is loose coupling by message — services interact through a broker (Kafka, RabbitMQ, SNS) instead of direct calls.

Event sourcing: the state of an entity is derived entirely from its event history. Instead of storing the current state (order.status = shipped), you store every event that led to it (OrderPlaced, PaymentConfirmed, ItemPicked, Dispatched). Current state is reconstructed by replaying events.

These are often conflated. You can use event notification without event sourcing. Event sourcing almost always involves event notification, but it adds far more complexity. Choose them independently.

Event notification: the benefits and the traps

Benefits:

  • Services are decoupled. The order service does not know that the inventory, email, and analytics services exist.
  • New consumers can subscribe without modifying the producer.
  • Natural audit log — the event stream is a record of what happened.
  • Temporal decoupling — consumers process at their own pace.

Traps:

Choreography complexity. With direct calls, you can read the code and trace what happens when an order is placed. With event-driven choreography, the flow is distributed across multiple services and topics. Debugging a failed order requires reading logs from five services and correlating by trace ID.

Schema coupling. Decoupled in runtime, but tightly coupled at the schema level. A producer changing event field names breaks every consumer. Use schema registries (Confluent, Apicurio) and backward-compatible schema evolution as first-class concerns from day one.

No transactional boundary. An event is emitted after a database write, but the write and the emit are not atomic. The service could crash between them. This is where the outbox pattern comes in.

The outbox pattern

Write the event to an outbox table in the same database transaction as the business operation. A separate process polls the outbox and publishes events to the broker, then marks them as published.

BEGIN;
  UPDATE orders SET status = 'confirmed' WHERE id = ?;
  INSERT INTO outbox (event_type, payload, created_at)
    VALUES ('OrderConfirmed', '{"order_id": 123}', NOW());
COMMIT;

The outbox relay publishes OrderConfirmed to Kafka and deletes the outbox row (or marks it sent). Even if the service crashes after the database commit, the event is not lost — the relay will retry on restart.

This guarantees at-least-once delivery. Consumers must be idempotent: processing the same event twice should produce the same result as processing it once.

Event sourcing: when it is worth it

Event sourcing is the right choice when:

  • Audit history is a business requirement, not a nice-to-have. Financial transactions, healthcare records, regulatory domains.
  • Temporal queries are needed. "What was the state of this account at 3pm last Tuesday?"
  • Event replay has value. You can rebuild projections, fix bugs by replaying with corrected logic, or create new read models without changing the event history.

It is not worth it when you just want loose coupling between services. Use event notification for decoupling; reserve event sourcing for the domains where the history itself is the source of truth.

The cost: you must maintain an event store (EventStoreDB, Kafka with compaction, or a purpose-built append-only table). You must manage schema evolution for events that may need to be replayed years later. You must handle snapshot strategies for entities with long event histories (replaying 50,000 events to get current state is expensive without periodic snapshots).

The saga pattern for distributed transactions

When a business operation spans multiple services (create order → reserve inventory → charge payment → dispatch), you cannot use a database transaction. The saga pattern coordinates the steps as a sequence of local transactions, each publishing an event that triggers the next step.

If a step fails, compensating transactions undo the previous steps:

OrderCreated → InventoryReserved → PaymentCharged → OrderDispatched
                                  ↓ (payment fails)
                          InventoryReleased → OrderCancelled

Choreography saga: each service listens for events and emits the next event. No central coordinator. Simple to implement, hard to understand as it grows.

Orchestration saga: a saga orchestrator sends commands to services and waits for responses. The flow is explicit and readable. Easier to monitor, debug, and add timeouts. Preferred for complex, long-running sagas.

Observability is not optional

Event-driven systems require distributed tracing. Without it, debugging a failure means manually correlating log timestamps across five services. Invest in:

  • Trace propagation: pass a correlation ID (or OpenTelemetry trace context) in every event header.
  • Dead letter queues: failed events must go somewhere visible, not disappear.
  • Consumer lag monitoring: Kafka consumer group lag tells you whether a consumer is falling behind before users notice.
  • Idempotency logging: log duplicate events so you can audit whether your idempotency logic is working.

The decision

Choose event-driven architecture when the services have genuinely different operational requirements (different scale, different teams, different release cadences) and the decoupling benefit justifies the observability cost.

Avoid it when you are prematurely optimizing a monolith that has not hit its limits yet. The operational overhead of event-driven systems is real — start simple and migrate specific bounded contexts to events when the coupling becomes a delivery problem.