Original article excerpt
Server-side extracted preview paragraphs from the original source.
In this post, we show how pairing Amazon Nova 2 Lite with Anthropic’s Claude Sonnet 4.6 delivers an efficient solution for digitizing scanned documents at scale. We built a two-model pipeline on Amazon Bedrock for digitizing scanned yearbook pages. Amazon Nova 2 Lite handles native multimodal extraction in a single call: detecting photos, extracting visible names with coordinates, and returning page-level metadata. Claude Sonnet 4.6 then performs spatial reasoning to match names to faces based on page layout.
A scanned yearbook page contains 176 printed names, 4 portrait photographs, and zero machine-readable structure linking them. To digitize this page, you need reliable photo detection with bounding boxes and accurate name extraction. You also need a way to determine which name belongs to which face based on page layout.
In this post, we show how pairing Amazon Nova 2 Lite with Anthropic’s Claude Sonnet 4.6 delivers an efficient solution for digitizing scanned documents at scale. We built a two-model pipeline on Amazon Bedrock for digitizing scanned yearbook pages. Amazon Nova 2 Lite handles native multimodal extraction in a single call: detecting photos, extracting visible names with coordinates, and returning page-level metadata. Claude Sonnet 4.6 then performs spatial reasoning to match names to faces based on page layout.
We ran this pipeline against 336 scanned yearbook pages and produced 3,122 name-to-face associations, with 93 percent scoring at or above 0.95 confidence. This two-model approach costs about two-thirds less per page than a single-model alternative that sends the entire task to one vision-language model. See the Cost considerations section for the detailed breakdown.
The pipeline has two stages. Each stage uses a different model, chosen for the specific task it performs.
Figure 1. Two-model pipeline architecture. The scanned page image flows through two sequential stages. In stage 1, Amazon Nova 2 Lite performs native multimodal extraction in a single API call. It detects and classifies photos with bounding boxes, reads visible names on the page and returns their approximate positions, and emits page-level metadata. In stage 2, Claude Sonnet 4.6 performs spatial reasoning to match names to faces using the combined Nova output.
Amazon Nova 2 Lite runs first. Because it handles interleaved text and images natively, a single Converse call returns three things:
