vLLM (Virtual Large Language Model): A Comprehensive Guide to Local and Cloud Deployment
vLLM is a powerful library for hosting large language models (LLMs), offering control over data privacy, customization options, and potentially lower costs compared to relying solely on APIs. This guide details setting up vLLM locally using Docker and deploying it on Google Cloud, providing scalable solutions for various needs.
Local CPU Setup with Docker
For users without access to high-end GPUs, vLLM offers a CPU-optimized Docker image. This simplifies the process, eliminating the need for manual installation and potential compatibility issues.
Step 1: Building the Docker Image
Begin by cloning the vLLM repository. Use the appropriate Dockerfile (Dockerfile.cpu for standard CPUs, Dockerfile.arm for ARM-based CPUs like those in Macs):
git clone https://github.com/vllm-project/vllm.git cd vllm docker build -f Dockerfile.arm -t vllm-cpu --shm-size=4g . # Or Dockerfile.cpu
Step 2: Hugging Face Configuration
meta-llama/Llama-3.2-1B-Instruct
for testing).Step 3: Running the Docker Container
Run the following command, replacing <your_hugging_face_token></your_hugging_face_token>
with your actual token:
docker run -it --rm -p 8000:8000 \ --env "HUGGING_FACE_HUB_TOKEN=<your_hugging_face_token>" \ vllm-cpu --model meta-llama/Llama-3.2-1B-Instruct \ --dtype float16</your_hugging_face_token>
The server will start; once you see "Application startup complete," it's ready.
Interacting with the LLM
vLLM's OpenAI API compatibility allows seamless interaction using existing OpenAI code. Modify the base URL to http://localhost:8000/v1
in your OpenAI client. Optional API key authentication can be added via the --api-key
flag in the docker run
command.
Google Cloud Deployment
Deploying vLLM on Google Cloud offers scalability.
Step 1: Google Cloud Setup
Create a new Google Cloud project (e.g., "vllm-demo") and enable the Artifact Registry service.
Step 2: Create an Artifact Repository
Create a Docker repository named "vllm-cpu" in the Artifact Registry.
Step 3: Build and Push the Docker Image
Use the Cloud Shell to build and push the Docker image:
git clone https://github.com/vllm-project/vllm.git cd vllm docker build -f Dockerfile.arm -t vllm-cpu --shm-size=4g . # Or Dockerfile.cpu
Step 4: Deploy to Cloud Run
Create a Cloud Run service, specifying the pushed image, port 8000, the Hugging Face token as an environment variable, the model name, and sufficient resources (e.g., 16 GiB memory, 4 CPUs). Keep at least one instance alive to minimize cold starts.
Interacting with the Deployed LLM
Update your OpenAI client's base URL to the Cloud Run service URL.
Cost Considerations: Remember to manage your Google Cloud billing to avoid unexpected charges.
GPU Support (Google Cloud): GPU support on Google Cloud Run is available upon request. Using the vllm/vllm-openai:latest
image is recommended when GPU support is enabled.
Alternative Hosting (RunPod): Services like RunPod offer simpler deployment but often at a higher cost.
This guide provides a comprehensive overview of vLLM deployment. Remember to choose the setup that best fits your resources and budget. Always carefully monitor your cloud costs.
The above is the detailed content of vLLM: Setting Up vLLM Locally and on Google Cloud for CPU. For more information, please follow other related articles on the PHP Chinese website!