Ricardo Zanardo - Senior Full-Stack Engineer

Localization looks simple on a whiteboard. A string goes out, a translation comes back, and the UI updates. The reality is a chain of steps with multiple teams, external vendors, and release deadlines that do not move. When I joined the localization platform work for Riot titles, the pain was not in one broken service. It was in the coupling between services and people.

The pipeline had to ingest thousands of translation units daily, sync with vendors, push updates to tools, and report status to producers. When release season hit, a single slow step would block the entire chain. We were reliable most of the time, but the failures hurt when they mattered most.

What was breaking

The original flow was synchronous. One service called another, waited, and passed the baton. That meant:

A single vendor delay could stall upstream processing
Retries amplified load instead of smoothing it
We had limited visibility into which step was actually stuck

We could scale the services, but scaling the bottleneck was not enough. The problem was in the shape of the workflow, not just in the CPU graphs.

The switch: events with clear contracts

We moved to an event-driven model using Kafka and SQS. The most important part was not the technology. It was the contract. We documented message types, payload shapes, and the responsibility of each consumer. That clarity gave us two wins: the freedom to scale each stage independently and a shared language across teams.

A typical lifecycle became:

Content ingested and validated
Jobs emitted as events per language and asset type
Vendors synchronized asynchronously
Status and progress events published back to UI dashboards

That change made every step observable and recoverable. If a vendor job failed, we could replay that message instead of replaying the entire chain.

Reliability was the real goal

Once everything was evented, we focused on the boring details that keep a platform stable:

Idempotent consumers so replays were safe
Dead-letter queues with explicit recovery paths
Correlation IDs so we could trace a single asset across the system
Clear SLIs for throughput, latency, and backlog depth

We also invested in observability. We wired DataDog and SonarCloud into the pipeline, built dashboards per stage, and wrote alerts for backlog growth rather than error spikes. That changed how we responded to incidents. We stopped reacting at the end and started seeing the stress early.

The outcome

The practical impact was obvious during release windows. The pipeline absorbed spikes without creating failures across the stack. That shift cut turnaround time by roughly 30% and reduced critical incidents by about 40% in the period after the redesign. The team also trusted the system more, which mattered as much as any metric.

What I learned

Event-driven design is not magic. It is a discipline. It forces you to be explicit about what a service owns, what it publishes, and how it fails. If you do not define those boundaries, the events become another form of coupling.

If I had to summarize the approach in one line: make every step visible, make every message replayable, and let each stage scale on its own terms. That is what lets localization feel calm even when the releases are not.

Event-driven localization at scale

What was breaking

The switch: events with clear contracts

Reliability was the real goal

The outcome

What I learned