Introduction
vLLM is a high-throughput, open-source inference and serving engine for big language fashions (LLMs). It offers quick, memory-efficient inference utilizing GPU optimizations comparable to PagedAttention and steady batching, making it appropriate for GPU-based workloads.
On this tutorial, we’ll present tips on how to run LLMs with vLLM solely in your native machine and expose them via a safe public API. This strategy enables you to run fashions with GPU acceleration, keep native execution pace, and hold full management over your atmosphere with out counting on cloud companies or an web connection.
Clarifai Native Runners make this course of easy. You may serve AI fashions or brokers straight out of your laptop computer, workstation, or inner server via a safe public API. You don’t want to add your mannequin or handle infrastructure. The Native Runner routes API requests to your machine, executes them domestically, and returns the outcomes to the consumer, whereas all computation stays in your {hardware}.
Let’s have a look at tips on how to set that up.
Working Fashions by way of vLLM Regionally
The vLLM Toolkit within the Clarifai CLI enables you to initialize, configure, and run fashions by way of vLLM domestically whereas exposing them via a safe public API. You may take a look at, combine, and iterate straight out of your machine with out standing up any infrastructure.
Step 1: Conditions
Set up the Clarifai CLI
vLLM helps fashions from the Hugging Face Hub. If you happen to’re utilizing personal repositories, you’ll want a Hugging Face entry token.
Step 2: Initialize a Mannequin
Use the Clarifai CLI to scaffold a vLLM-based mannequin listing. It will put together all required information for native execution and integration with Clarifai.
If you wish to work with a particular mannequin, use the --model-name flag:
Be aware: Some fashions are massive and require important reminiscence. Guarantee your machine meets the mannequin’s necessities.
After initialization, the generated folder construction seems like this:
-
mannequin.py– Comprises logic that runs the vLLM server domestically and handles inference. -
config.yaml– Defines metadata, runtime, checkpoints, and compute settings. -
necessities.txt– Lists Python dependencies.
Step 3: Customise mannequin.py
The scaffold features a VLLMModel class extending OpenAIModelClass. It defines how your Native Runner interacts with vLLM’s OpenAI-compatible server.
Key strategies:
-
load_model()– Launches vLLM’s native runtime, hundreds checkpoints, and connects to the OpenAI-compatible API endpoint. -
predict()– Handles single-prompt inference with optionally available parameters likemax_tokens,temperature, andtop_p. Returns the whole response. -
generate()– Streams generated tokens in actual time for interactive outputs.
You should utilize these implementations as-is or customise them to suit your most popular request/response constructions.Â
Step 4: Configure config.yaml
The config.yaml file defines the mannequin id, runtime, checkpoints, and compute metadata:
Be aware: For native execution, inference_compute_info is optionally available — the mannequin runs solely in your machine utilizing native CPU/GPU sources. If deploying on Clarifai’s devoted compute, you may specify accelerators and useful resource limits.
Step 5: Begin the Native Runner
Begin a Native Runner that connects to the vLLM runtime:
If any configuration is lacking, the CLI will immediate you to outline it. After startup, you’ll obtain a public Clarifai URL to your mannequin. Requests despatched to this endpoint route securely to your machine, run via vLLM, and return to the consumer.
Step 6: Run Inference with Native Runner
As soon as your mannequin is operating domestically and uncovered by way of the Clarifai Native Runner, you may ship inference requests utilizing the OpenAI-compatible API or the Clarifai SDK.
OpenAI-Appropriate API
Clarifai Python SDK
You can too experiment with generate() technique for real-time streaming.
Conclusion
Native Runners provide you with full management over the place your fashions execute with out sacrificing integration, safety, or flexibility. You may prototype, take a look at, and serve actual workloads by yourself {hardware}, whereas Clarifai handles routing, authentication, and the general public endpoint.
You may attempt Native Runners totally free with the Free Tier, or improve to the Developer Plan at $1 monthly for the primary yr to attach as much as 5 Native Runners with limitless hours.
