Original article excerpt
Server-side extracted preview paragraphs from the original source.
Databricks Knowledge Assistant is now faster: Instructed-Retriever-1 cuts search time over 3x and answer time 2x, with no reconfiguration and no quality tradeoff.
Today we’re announcing a major update that makes Agent Bricks Knowledge Assistant both faster and higher quality. Answer generation time has dropped by 2x, and search time has dropped by more than 3x, bringing Time To First Token (TTFT) to around two seconds.¹ Thus, Knowledge Assistant users will get noticeably faster answers across their use cases, with no reconfiguration required and no tradeoff in quality.
These gains are powered by Instructed-Retriever-1, a retrieval-specialized model built for parallel test-time scaling. Unlike standard agentic retrieval, where an agent works sequentially and reasons over each result before deciding its next step, our approach fans this work out in parallel. Instructed-Retriever-1 is a single model trained for both retrieval stages: query generation to increase recall and reranking to increase precision, run in parallel to keep latency low. In this post, we describe how this approach results in a Pareto-optimal performance, how we train one model to support the full retrieval pipeline, and how we validate performance on realistic enterprise workloads.
Figure: On KARLBench, Knowledge Assistant with Instructed-Retriever-1 improves both search latency and retrieval quality.
Our previous research demonstrated that quality can improve with additional test-time compute. However, most agentic search systems today spend that compute on sequential operations, like tool calls, reason-act loops, and chain-of-thought reasoning. These methods do improve search quality, but they come at the expense of substantially higher latency and cost. For training Instructed-Retriever-1, we take a different route: rather than scaling compute sequentially, we parallelize it during the initial search phase. By broadening the range of retrieved evidence and selecting the most relevant context up front, we achieve highly effective search with significantly lower latency.
Improving the initial search depends heavily on the training harness. Our harness provides the model with user instructions and the precise schema of the underlying retrieval index, and it propagates them to all the subsequent stages of query and filter generation, reranking, and answer generation. We described how this can be achieved in our earlier Instructed Retriever blog, and we use the same search harness in training our Instructed-Retriever-1 model. This approach is especially important for enterprise questions, which often involve domain-specific constraints such as time period, organization, document type, or product area.
Parallel query and filter generation improves candidate-set recall by simultaneously exploring multiple formulations and aspects of the same request. This allows the system to search more broadly while keeping latency low. Broader search creates an aggregation challenge. Different formulations may return overlapping or only partially relevant chunks. To select the most useful context from the merged candidate set, we use a multi-pivot groupwise reranker. Candidates are ranked in parallel groups, each anchored by one or more pivot chunks, and the group rankings are merged into a final ordering. This captures the key benefits of comparing evidence in context while keeping reranking efficient.
