Story

Opening the briefing

Loading the article brief, supporting context, and related editorial blocks.

Embed the world: Multimodal AI for searchable aerial imagery at scale | AI BriefWire

Original article excerpt

Server-side extracted preview paragraphs from the original source.

In this post, we walk through the problem space, our architecture on Amazon Bedrock and Amazon OpenSearch Serverless, the evaluation methodology we built on OpenStreetMap ground truth, four experiments that compared embedding models, fusion strategies, captioning, and search methods, and the practical guidance you can apply when building a similar system. You’ll learn which design choices move the needle for geospatial semantic search, including why Amazon Nova Multimodal Embeddings delivered the highest F1 scores across both benchmark queries in our evaluation. The work described here evolved into Vexcel Intelligence, a searchable imagery product.

Turning a library of aerial imagery into a natural-language-searchable knowledge base is a problem that touches every industry that relies on geospatial data — insurance, real estate, government, infrastructure, and agriculture. The traditional path requires either manual tile-by-tile inspection or training a bespoke computer vision model for each new question. Multimodal embeddings, large language model (LLM) captioning, and vector search on AWS offer a faster alternative: index once, then query using natural language.

We worked with Vexcel, an aerial imagery and geospatial data provider that operates one of the largest aerial imagery programs in the world, to evaluate embedding models, fusion strategies, caption integration, and search methods over multi-view aerial imagery. Using its own sensors and a dedicated fleet of aircraft, Vexcel collects high-resolution data across 45+ countries and territories, delivering orthomosaic imagery, oblique imagery from multiple angles, and elevation models. The data exists, and the use cases are numerous, but turning billions of pixels into answers about the real world requires a faster path.

When a customer needs to locate swimming pools in a suburb, identify road networks in a development zone, or count solar panels across a city, someone has to manually look tile-by-tile (inspecting each map tile in turn) across millions of images. The alternative is training a computer vision model for each feature, which requires labeled data, engineering time, and ongoing retraining. When the next customer wants to find warehouses with graffiti on the side (see Figure 1), they repeat the cycle. Semantic search powered by vector embeddings removes this per-feature training step and turns natural-language queries into results in seconds.

Figure 1. A typical oblique image from Vexcel, providing models rich 360-degree vision of the world

Vexcel had explored this problem through three prior POCs: an agent-based approach combining imagery with property data, a property embedding system for similarity search, and a tiled multimodal embedding pipeline with captions generated by a large language model (LLM). The third showed promise but raised key questions: which embedding model to use, how to handle multiple views per location, and whether captions actually improve results or just add cost.

Opening the briefing

Embed the world: Multimodal AI for searchable aerial imagery at scale

Original article excerpt

Running ComfyUI workflows on Amazon SageMaker AI processing jobs

I tested the new modular ThinkPad, and it's the repairable future I'm hoping for

NVIDIA Vera CPU Opens the Way for Agentic Scientific AI at Los Alamos National Laboratory