Localization looks simple on a whiteboard. A string goes out, a translation comes back, and the UI updates. The reality is a chain of steps with multiple teams, external vendors, and release deadlines that do not move. When I joined the localization platform work for Riot titles, the pain was not in one broken service. It was in the coupling between services and people.
The pipeline had to ingest thousands of translation units daily, sync with vendors, push updates to tools, and report status to producers. When release season hit, a single slow step would block the entire chain. We were reliable most of the time, but the failures hurt when they mattered most.
What was breaking
The original flow was synchronous. One service called another, waited, and passed the baton. That meant:
- A single vendor delay could stall upstream processing
- Retries amplified load instead of smoothing it
- We had limited visibility into which step was actually stuck
We could scale the services, but scaling the bottleneck was not enough. The problem was in the shape of the workflow, not just in the CPU graphs.
The switch: events with clear contracts
We moved to an event-driven model using Kafka and SQS. The most important part was not the technology. It was the contract. We documented message types, payload shapes, and the responsibility of each consumer. That clarity gave us two wins: the freedom to scale each stage independently and a shared language across teams.
A typical lifecycle became:
- Content ingested and validated
- Jobs emitted as events per language and asset type
- Vendors synchronized asynchronously
- Status and progress events published back to UI dashboards
That change made every step observable and recoverable. If a vendor job failed, we could replay that message instead of replaying the entire chain.
Reliability was the real goal
Once everything was evented, we focused on the boring details that keep a platform stable:
- Idempotent consumers so replays were safe
- Dead-letter queues with explicit recovery paths
- Correlation IDs so we could trace a single asset across the system
- Clear SLIs for throughput, latency, and backlog depth
We also invested in observability. We wired DataDog and SonarCloud into the pipeline, built dashboards per stage, and wrote alerts for backlog growth rather than error spikes. That changed how we responded to incidents. We stopped reacting at the end and started seeing the stress early.
The outcome
The practical impact was obvious during release windows. The pipeline absorbed spikes without creating failures across the stack. That shift cut turnaround time by roughly 30% and reduced critical incidents by about 40% in the period after the redesign. The team also trusted the system more, which mattered as much as any metric.
What I learned
Event-driven design is not magic. It is a discipline. It forces you to be explicit about what a service owns, what it publishes, and how it fails. If you do not define those boundaries, the events become another form of coupling.
If I had to summarize the approach in one line: make every step visible, make every message replayable, and let each stage scale on its own terms. That is what lets localization feel calm even when the releases are not.