Story

Opening the briefing

Loading the article brief, supporting context, and related editorial blocks.

How Superhuman and Databricks built a 200K QPS inference platform together | AI BriefWire

Original article excerpt

Server-side extracted preview paragraphs from the original source.

Superhuman began serving its grammar correction model via Databricks’ Foundation Model API, handling>200K QPS with p99 latency under 1s. Through a close engineering partnership with Databricks, both teams optimized runtime performance to deliver a 60% throughput gain, while maintaining 4 9’s of availability

Superhuman, the productivity platform that includes Superhuman, Coda, Superhuman Mail and Superhuman Go, serves over 40 million daily users across dozens of languages. Superhuman's AI communication assistance provides real-time suggestions for correctness, clarity, tone, and style across every surface where people write.

Databricks and Superhuman have been partners for years. The Superhuman team has historically used the Databricks Data Intelligence Platform as the foundation for analytics. But analytics was only half the picture.

Behind many of Superhuman’s real-time suggestions is a highly sophisticated, custom AI model, served at a massive scale. Superhuman runs this model at peak traffic of over 200,000 queries per second, with end-to-end latency under 1 second at P99, and strict 4 9’s reliability guarantees. Superhuman modernized their serving stack for large language models by leveraging Databricks model serving, which required a new kind of partnership, built on joint product and engineering work.

Before this migration, Superhuman operated a DIY serving stack built on vLLM, alongside internal tools for training and model management. An internal ML infrastructure team maintained this stack, which supported a massive scale, but several pain points were compounding when serving large language models.

The custom large language model powers grammatical error correction at enormous volume, 200K+ QPS peak with roughly 50 input tokens and 50 output tokens per request. It was pushing the limits of what the L40S-gpus-based stack could deliver. Each new iteration of the model required months of manual performance tuning to onboard. Meanwhile, the operational burden was growing, with capacity planning, performance tuning, and autoscaling consuming time from a lean team that needed to focus on model quality and product innovations.

Superhuman needed a platform partner who could commit to performance and latency SLAs on the serving stack, and who would co-invest in the engineering required to meet them. Both teams defined target real-time latency SLOs upfront: sub second p99 latency and zero quality regression on Superhuman’s internal evaluation harnesses.

Opening the briefing

How Superhuman and Databricks built a 200K QPS inference platform together

Original article excerpt

So you’ve heard these AI terms and nodded along; let’s fix that

Kiwibit’s AI-powered bird feeder is my new backyard buddy

What happens when companies become too AI-pilled?