Original article excerpt
Server-side extracted preview paragraphs from the original source.
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
It's the quickest way to stand up a model for tests, evals, or batch generation. (If you're after a managed, production-ready service instead, that's what Inference Endpoints are for — more on when to pick which at the end.)
hf jobs run is docker run for HF infrastructure. We use the official vllm/vllm-openai image, ask for a GPU with --flavor, and expose vLLM's port with --expose:
--expose 8000 routes the container's port through HF's public jobs proxy (see the Serve Models guide for the full reference). The command prints the URL your server is reachable at:
6a381ca1953ed90bfb947332 is your job ID. Keep track of it, we'll need it. We'll use <job_id> as a placeholder for it in the rest of the post.
Give it a couple of minutes to download weights and boot. When the logs show Application startup complete, you're live.
vLLM speaks the OpenAI API, and every request just needs your HF token as a bearer token. The quickest way to hit it is curl: