Original article excerpt
Server-side extracted preview paragraphs from the original source.
The new DiffusionGemma open model generates text in parallel — not one token at a time — and is optimized to run on the NVIDIA RTX PRO platform, NVIDIA DGX Spark systems and GeForce RTX GPUs.
Today, Google DeepMind released DiffusionGemma — an experimental open model built for exceptionally fast text generation. NVIDIA has optimized DiffusionGemma to run even faster across NVIDIA GeForce RTX GPUs, the NVIDIA RTX PRO platform and NVIDIA DGX Spark systems, from local PCs to the cloud.
Rather than generating text one word at a time, DiffusionGemma generates multiple words in parallel to output whole blocks of text, opening a new, low-latency frontier for the kind of single-user workloads that developers, researchers and AI enthusiasts run every day.
Almost every large language model (LLM) in wide use today is autoregressive — meaning it generates text one token at a time, with each new word depending on the one before it. That sequential process is what makes interactive AI feel like it’s typing.
DiffusionGemma takes a different path. Built on the Gemma 4 26B mixture-of-experts architecture, it generates text the way diffusion models generate images: by starting from noise and refining a whole block of text at once. Each step denoises up to 256 tokens in parallel rather than emitting a single token and waiting to compute the next.
The result is a model that thinks in blocks instead of sequentially. For latency-sensitive, single-user work — such as interactive chat, agentic loops or on-device assistants that plan and act — that parallelism translates into responses fast enough to keep pace with how developers think and iterate.
Generating one token at a time is fundamentally a memory-bound problem — a traditional LLM spends most of its time waiting on memory bandwidth, not doing math, which leaves a lot of compute on the table.
