Original article excerpt
Server-side extracted preview paragraphs from the original source.
A Blog post by NVIDIA on Hugging Face
GR00T N1.7 is a 3B-parameter open reasoning Vision-Language-Action (VLA) model that maps visual observations and natural language instructions to continuous robot actions. It uses an Action Cascade architecture — a dual-system design that separates high-level reasoning from low-level motor control:
Inputs: RGB image frames (any resolution) + language instruction + robot proprioceptive state (joint positions, velocities, EEF poses)