Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers

Hugging Face published a guide on training and fine-tuning multimodal embedding and reranker models using Sentence Transformers. The tutorial explains how to combine text and image data for improved model performance. This advancement helps developers build better multimodal AI applications.

ArchiveCore AIHigh-signal source

Signal trust

High-signal sourceSingle sourceEarly signal

PublishedThursday, April 16, 2026 at 2:00 AMApr 16, 02:00 AM

FreshnessArchive

Story ID#1998

Back to feed Original report

Original article excerpt

Server-side extracted preview paragraphs from the original source.

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

As a practical example, I'll walk through finetuning Qwen/Qwen3-VL-Embedding-2B for Visual Document Retrieval (VDR), the task of retrieving relevant document pages (as images, with charts, tables, and layout intact) for a given text query. The resulting tomaarsen/Qwen3-VL-Embedding-2B-vdr demonstrates how much performance you can gain by finetuning on your own domain. On my evaluation data, the finetuned model achieves an NDCG@10 of 0.947 compared to the base model's 0.888, and outperforms all existing VDR models I tested against, including models up to 4x its size.

Opening the briefing

Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers

Original article excerpt