Story

Opening the briefing

Loading the article brief, supporting context, and related editorial blocks.

Which tokens does a hybrid model predict better? | AI BriefWire

Original article excerpt

Server-side extracted preview paragraphs from the original source.

A Blog post by Ai2 on Hugging Face

Which kinds of tokens does a model predict well, and which does it not? That question is especially intriguing in the case of hybrids, a language model architecture that’s begun to challenge the standard transformer and that we’ve been investigating with Olmo Hybrid.

Hybrids can match or beat transformers on standard benchmarks, but the headline numbers don’t reveal much about what specific advantages hybrid models have over transformers.

In an attempt to shed light on these token-level behaviors, we recently conducted experiments comparing our own strongest 7B transformer, Olmo 3, and hybrid model, Olmo Hybrid, head-to-head. Specifically, we compare the differences in model predictions in a fine-grained way across different types of tokens, or units of information that appear as input to an LLM.

Because Olmo 3 and Olmo Hybrid were built to be as alike as possible outside their architectures — closely matched in data, tokenizer, and training recipe — any difference in their predictions mostly reflects the architecture itself. Viewing these differences at the token level allows us to glean insights about the specific strengths of hybrid models over transformers.

Our results show that the hybrid’s advantage is real across many tokens, but not all. Olmo Hybrid is strongest on tokens that carry meaning, such as nouns, verbs, and adjectives, and on tokens that can only be predicted by following what’s going on, like which person a pronoun refers to. But the hybrid’s advantage almost disappears on tokens that simply repeat something already in the input — a word or phrase reproduced verbatim from earlier — where the answer is sitting right there to be looked up. That’s where the transformer’s strength lies.

A language model is built from a stack of repeated layers, each one refining its representation of every token using the tokens around it.

Opening the briefing

Which tokens does a hybrid model predict better?

Original article excerpt

Netris raises $15M Series A from a16z to help AI neoclouds go live faster

I won't fly anywhere without this tiny Bluetooth dongle - it's 21% off right now

A 64GB flash drive for $17? This SanDisk Prime Day deal is an absolute steal