Story

Opening the briefing

Loading the article brief, supporting context, and related editorial blocks.

Accelerating LLM Inference with Prompt Caching for Open‑Source Models on Databricks | AI BriefWire

Original article excerpt

Server-side extracted preview paragraphs from the original source.

Learn how prompt caching speeds up OSS LLM inference on Databricks, and delivers secure, automatic performance gains.

Large language model (LLM) inference often involves repeated prompts—think of the same system or instruction prompt appearing in thousands of requests. Reprocessing that identical prefix for every call wastes compute cycles, inflates latency, and increases costs.

Prompt caching can be a powerful technique to raise a model’s quality in specific domains without compromising the model’s token throughput. Queries can share a large domain-specific system prompt, with the compute cost of that shared prompt amortized across all those queries. Frontier models, such as Claude, use system prompts that are many thousands of tokens long under the hood. Furthermore, in our recently published research we showed that automated prompt optimization allows open-source models to surpass frontier-model quality for enterprise tasks.

Databricks already provides built-in prompt caching for proprietary models (GPT, Gemini, Claude). We’ve now extended this capability to the open-weights models powering our Foundation Model APIs (FMAPIs) for batch inference, pay-per-token, and provisioned-throughput workloads. It also applies to any and all higher-level services powered by a foundation model, e.g., Agent Bricks, Genie, AI Functions.

Prompt caching is now supported for the following OSS models hosted on Databricks:

We will continue to roll out this feature across our other models. Security is a first‑class concern at Databricks. Prompt caches are isolated, only reside in volatile memory and are never persisted. Importantly, the caching is implicit: customers do not need to configure anything, our system has built to automatically run the prompt caching and reuse to improve throughput.

We rolled out prompt caching to our GPT‑OSS models first and immediately saw measurable gains in one of the large-scale production batch‑inference pipelines:

Opening the briefing

Accelerating LLM Inference with Prompt Caching for Open‑Source Models on Databricks

Original article excerpt

Meta is reportedly developing an AI pendant

Tired of AI Overviews? I found 9 Google Search alternatives that showed me links again

ReMarkable Paper Pure vs. Boox Go 10.3: I used both tablets at work, and it comes down to this