Original article excerpt
Server-side extracted preview paragraphs from the original source.
A Blog post by NVIDIA on Hugging Face
Press SPACE to record, SPACE again to stop. This is a simple VLA: the model decides on its own whether to act based on the context of what you asked, no keyword triggers, no hardcoded logic. If your question needs Gemma to open her eyes, she'll decide to take a photo, interpret it, and answer you with that context in mind. She's not describing the picture, she's answering your actual question using what she saw.