vLLM: Setting Up vLLM Locally and on Google Cloud for CPU-AI-php.cn

vLLM: Setting Up vLLM Locally and on Google Cloud for CPU

Joseph Gordon-Levitt

Release： 2025-03-01 09:36:11

Original

419 people have browsed it

vLLM (Virtual Large Language Model): A Comprehensive Guide to Local and Cloud Deployment

vLLM is a powerful library for hosting large language models (LLMs), offering control over data privacy, customization options, and potentially lower costs compared to relying solely on APIs. This guide details setting up vLLM locally using Docker and deploying it on Google Cloud, providing scalable solutions for various needs.

Local CPU Setup with Docker

For users without access to high-end GPUs, vLLM offers a CPU-optimized Docker image. This simplifies the process, eliminating the need for manual installation and potential compatibility issues.

Step 1: Building the Docker Image

Begin by cloning the vLLM repository. Use the appropriate Dockerfile (Dockerfile.cpu for standard CPUs, Dockerfile.arm for ARM-based CPUs like those in Macs):

git clone https://github.com/vllm-project/vllm.git
cd vllm
docker build -f Dockerfile.arm -t vllm-cpu --shm-size=4g .  # Or Dockerfile.cpu

Copy after login

Step 2: Hugging Face Configuration

Create a Hugging Face account and obtain an API token.
Request access to a model (e.g., meta-llama/Llama-3.2-1B-Instruct for testing).

Step 3: Running the Docker Container

Run the following command, replacing <your_hugging_face_token></your_hugging_face_token> with your actual token:

docker run -it --rm -p 8000:8000 \
--env "HUGGING_FACE_HUB_TOKEN=<your_hugging_face_token>" \
vllm-cpu --model meta-llama/Llama-3.2-1B-Instruct \
--dtype float16</your_hugging_face_token>

Copy after login

The server will start; once you see "Application startup complete," it's ready.

Interacting with the LLM

vLLM's OpenAI API compatibility allows seamless interaction using existing OpenAI code. Modify the base URL to http://localhost:8000/v1 in your OpenAI client. Optional API key authentication can be added via the --api-key flag in the docker run command.

Google Cloud Deployment

Deploying vLLM on Google Cloud offers scalability.

Step 1: Google Cloud Setup

Create a new Google Cloud project (e.g., "vllm-demo") and enable the Artifact Registry service.

vLLM: Setting Up vLLM Locally and on Google Cloud for CPU

Step 2: Create an Artifact Repository

Create a Docker repository named "vllm-cpu" in the Artifact Registry.

vLLM: Setting Up vLLM Locally and on Google Cloud for CPU

Step 3: Build and Push the Docker Image

Use the Cloud Shell to build and push the Docker image:

git clone https://github.com/vllm-project/vllm.git
cd vllm
docker build -f Dockerfile.arm -t vllm-cpu --shm-size=4g .  # Or Dockerfile.cpu

Copy after login

Step 4: Deploy to Cloud Run

Create a Cloud Run service, specifying the pushed image, port 8000, the Hugging Face token as an environment variable, the model name, and sufficient resources (e.g., 16 GiB memory, 4 CPUs). Keep at least one instance alive to minimize cold starts.

vLLM: Setting Up vLLM Locally and on Google Cloud for CPU

Interacting with the Deployed LLM

Update your OpenAI client's base URL to the Cloud Run service URL.

Cost Considerations: Remember to manage your Google Cloud billing to avoid unexpected charges.

GPU Support (Google Cloud): GPU support on Google Cloud Run is available upon request. Using the vllm/vllm-openai:latest image is recommended when GPU support is enabled.

Alternative Hosting (RunPod): Services like RunPod offer simpler deployment but often at a higher cost.

vLLM: Setting Up vLLM Locally and on Google Cloud for CPU

This guide provides a comprehensive overview of vLLM deployment. Remember to choose the setup that best fits your resources and budget. Always carefully monitor your cloud costs.

The above is the detailed content of vLLM: Setting Up vLLM Locally and on Google Cloud for CPU. For more information, please follow other related articles on the PHP Chinese website!