Original article excerpt
Server-side extracted preview paragraphs from the original source.
A Blog post by NVIDIA on Hugging Face
In large-scale LLM development, the question is no longer simply how much data a model sees. It is also whether the data contains enough structured learning signals. General web, code, math, multilingual, and domain data provide a broad base. Task-seeded synthetic Q&A complements them by adding compact, task-structured examples with a clear information need, a constrained response space, and explanations that connect evidence to an answer. In a 100B-token continuation experiment on the Nemotron-3 Nano model, task-seeded SDG improved MMLU-Pro by +1.8, average code by +1.9, commonsense understanding by +1.6, and GPQA by +11.1, while average math remained stable.
This post describes a task-seeded synthetic Q&A generation workflow developed for Nemotron-family training, including Ultra and Super training runs. The workflow uses training splits from broad public task families as capability seeds, generates new task-aligned examples, enriches them with reasoning and relevant knowledge, and filters them into curated synthetic datasets. Held-out evaluation and test data are excluded from generation. Downstream training recipes can then decide how to mix those datasets with the broader corpus.
Figure 1. The task-seeded SDG pipeline ends at curated generated data. Training mixture design and reported evaluations happen downstream.
The generation workflow is a compact loop: collect training-split seeds, normalize heterogeneous task records, generate new examples, enrich answers, and filter the resulting data. In the internal pipeline, we used roughly 70 public task datasets from lm-eval-harness, covering about 700 subtasks. For each task, we used only suitable training splits as SDG seeds; held-out test data was not used for generation, and tasks without suitable training data were excluded from seed collection.
For Nemotron Ultra and Super pretraining, we used a license-compatible subset of the generated data suitable for commercial model training.
One practical formatting choice is to store semantic answer text rather than only option labels when possible. For example, writing the answer as dirt trapped under the fingernails gives the model a clearer training signal than only writing B.