Original article excerpt
Server-side extracted preview paragraphs from the original source.
A Blog post by Ai2 on Hugging Face
Which kinds of tokens does a model predict well, and which does it not? That question is especially intriguing in the case of hybrids, a language model architecture that’s begun to challenge the standard transformer and that we’ve been investigating with Olmo Hybrid.
Hybrids can match or beat transformers on standard benchmarks, but the headline numbers don’t reveal much about what specific advantages hybrid models have over transformers.
In an attempt to shed light on these token-level behaviors, we recently conducted experiments comparing our own strongest 7B transformer, Olmo 3, and hybrid model, Olmo Hybrid, head-to-head. Specifically, we compare the differences in model predictions in a fine-grained way across different types of tokens, or units of information that appear as input to an LLM.
Because Olmo 3 and Olmo Hybrid were built to be as alike as possible outside their architectures — closely matched in data, tokenizer, and training recipe — any difference in their predictions mostly reflects the architecture itself. Viewing these differences at the token level allows us to glean insights about the specific strengths of hybrid models over transformers.
Our results show that the hybrid’s advantage is real across many tokens, but not all. Olmo Hybrid is strongest on tokens that carry meaning, such as nouns, verbs, and adjectives, and on tokens that can only be predicted by following what’s going on, like which person a pronoun refers to. But the hybrid’s advantage almost disappears on tokens that simply repeat something already in the input — a word or phrase reproduced verbatim from earlier — where the answer is sitting right there to be looked up. That’s where the transformer’s strength lies.
A language model is built from a stack of repeated layers, each one refining its representation of every token using the tokens around it.
