_If you are not a member but want to read this article, please check out this friend link. _
If you've been trying open source models of different sizes, you might be wondering: What is the most efficient way to deploy them?
What is the price difference between pay-as-you-go and serverless providers, and when an LLM service platform exists, is it really worth it to handle participants like AWS?
I decided to dig into this topic and compare cloud vendors like AWS with updated alternatives like Modal, BentoML, Replicate, Hugging Face endpoints, and Beam.
We will look at metrics such as processing time, cold start latency, and CPU, memory and GPU cost to see which ones are most effective and economical. We will also cover softer metrics such as ease of deployment, developer experience, and community.
We will explore some use cases, such as deploying smaller models on CPUs and running 7-8 billion parameter models on GPUs.
I will also delve into the process of deploying smaller models using EFS in AWS Lambda and compare them to more modern platforms like Modal.
I won't go into the depth of optimization strategies here—such as using different frameworks or quantizations to speed up reasoning—this is a completely separate topic.
Instead, this article will focus on how to choose the right deployment option, giving you the opportunity to compare performance in different scenarios and helping you understand the economic costs of deploying small and large LLMs.
When you use an off-the-shelf open source model, there are a lot of easy-to-use API options. I recommend checking this list for some options. You can also choose to self-host – view the Local Inference section in the same list.
However, you may need to use a private, fine-tuned, or less common model.
You can certainly host these models locally, too, but your computer needs enough resources, and you may want to integrate these models into an application running on another server.
This brings us to hosting open source models on demand or via serverless platforms. The idea is that you only pay for the resources you use, whether it's on demand or on run, just like serverless.
Serverless and on-demand work a little bit similar, but with serverless, it's faster to scale down so you don't have to pay for idle resources.
You can check out my doodle below for more comparisons.
In this article, we will compare AWS's EC2 and Lambda with several emerging platforms that are increasingly popular recently.
This way, you can better understand what works best.
As a side note, I have not received any compensation from these vendors, so the information I share here is my personal opinion.
If you are a stakeholder, this is a great way to understand the economics of different options and the cost of running inference based on model size and vendor choice.
The first part of the article covers research, which anyone can participate in, while the second part covers aspects of deployment technology that you may or may not want to read.
Now, before we get started, I would like to comment on the LLM Inference Framework that simplifies setting up API endpoints for service models. There are several open source LLM service frameworks available, including vLLM, TensorRT, and TGI, which we can use here.
You can check out some of the more popular frameworks in the LLM Service Frameworks section of the list I shared earlier (see below).
Some people have measured the performance differences between these frameworks and you should definitely do your own research.
However, in this article, we will use vLLM, which is widely used – unless the model is deployed through the Hugging Face endpoint, which will automatically use TGI for us.
To deploy a smaller converter model running on the CPU, I just used the Hugging Face pipeline or transformers library directly.
In the first part, we will look at the efficiency, cost, and performance of on-demand and serverless selection. We will first introduce the metrics and then dig into any technical details.
Let's first measure the total processing time across platforms when the container is in a preheated state (i.e. used in the past few seconds) and there is no concurrency.
We define processing time as the total time it takes to complete the response. Please note that some people may measure first response time, especially when streaming output.
For consistency, I used the same prompt for each test. For the 400M model, I batched the text into 30 pieces.
You can see the indicator below.
I only ran these tests a few times on each platform on the same day. Ideally, I should test them in a few days. I may be unlucky for some of these tests.
However, to discuss their performance, for **** serverless providers, Modal and Beam perform very well on the CPU (shown as light green bars). It is easier to start a 400M model than to start an 8B model.
I find that even with smaller models (less than 130M) works with AWS Lambda, especially when using EFS cache models.
While I really like Hugging Face endpoints, I found their CPU instances a little unpredictable. However, their AWS GPU instances are very reliable and fast.
Even if I host the 7B model on an L4 instance, I can get very fast responses on the GPU, which can return in 10 seconds - this is something we can't implement with a serverless provider, which serverless providers need More powerful GPU.
If we choose the A100 GPU, we will see that all providers perform very well for the 7B-8B parameter model and can return a full response in seconds.
Of course, the speed is very fast, but we need to consider other indicators.
Next, let's dive into cold start, i.e. how long it takes to respond if the model is not used for a while. Even if you cache the model, it may still need to download the shard, which may add a few seconds.
On-demand service may allow you to cache models to speed up startup time, I'm not doing this here, but most serverless providers will show you how to cache models at build time, which can reduce cold boot latency.
Let’s take a look at the indicators of the following platforms.
Please note that I calculated the entire processing time during cold start, be sure to check the cold start-only calculation directly.
As expected, my on-demand services that do not have a cache model perform poorly, such as BentoML, Hugging Face endpoints, and Baseten.
While the Hugging Face endpoint can perform well after running, you may still experience cold starts that last 30 seconds to 5 minutes, which can become a problem if you need to scale and scale down frequently. They also throw 500 errors before the containers are fully run again.
Serverless providers are faster because they are designed to scale quickly by requiring us to cache model weights on first deployment.
Beam performs best on the CPU, followed by Baseten, Modal and Lambda with EFS. Smaller models usually start faster. Lambda shows excellent results with fast processing time and minimal cold start latency for small models with only 125M parameters.
Although I think it's good to use Modal or Beam for smaller models.
Let's turn to pricing. We need to look at the cost of CPU, memory, and GPU resources.
There are some obvious differences between platforms.
Serverless providers are often more expensive because they charge CPU and memory fees in addition to GPU usage. However, they won't charge you for idle time, which can help offset the higher costs.
You can find the price of Nvidia GPU in the picture below.
However, you should look at SageMaker, which has the highest GPU cost of all of them. If you need to use AWS, it is best to use EC2 directly.
Let's take a look at CPU pricing as well.
Hugging Face endpoints lead at $0.07, with 2 vCPUs and 4GB of memory, unfortunately their CPU instances are underperforming.
Beam and Modal allow you to adjust the resources you need, which helps minimize costs. For the 400M model, I calculated that only 3GB of memory and 1 core (2 vCPUs) were required on both platforms.
Replicate, on the other hand, forces us to use 4 vCPUs regardless of the model size, making it the most expensive CPU option here.
We will cover some use cases to compare the price and efficiency of all these platforms.
The first case would be a sporadic operation of a 400M model throughout the day. This means that each call to the container requires scale-up and downsize.
It is not always necessary to scale and scale down, but we will have to count it.
I run this case study by batching 30 text for each call (using a smaller fine-tuning model) with 250 calls throughout the day. For simplicity, we assume that the container is cold-started every time it runs (except the Hugging Face endpoint).
Serverless providers are a better choice here because we won't pay for idle time like we do on demand. For BentoML, we need to stay idle for at least 5 minutes before automatically reducing the size, and for HF endpoints, we need to wait for 15 minutes.
Side note, if you are not familiar with automatic scale reduction, this concept means that if allowed, we will tell the platform to automatically scale our instances.
They all have different requirements, Baseten and HF endpoints have 15 minutes of free windows, while BentoML has 5 minutes.
Since the HF endpoint takes at least 15 minutes to scale down, if we call the function every 5-6 minutes, it will have no time to scale down, so we have few cold starts, but most of the time is idle.
We can see that having 17 hours of free time like the HF case, and 18 hours in the BentoML case is inherently inefficient. We will pay most of the funds for idle resources throughout the day.
A cent or a dollar here and there doesn't seem to be much for your first few days, but it will accumulate after a while.
Think about people saving a little money in their savings accounts every day – overpaying here would be the opposite.
But what if we run all 250 calls while the container is in preheated state? How much difference will there be?
Beams seem to be an outlier here, but I think they are running more than the maximum CPU that other platforms don't allow you to use.
In this case, the cold start and free time disappear. This shows that if you work on everything at once, using a persistent container is a better option – it's much cheaper.
It is worth noting that for Hugging Face endpoints and BentoML, the 400M model is best suited for the T4 GPU. This setting reduces costs while significantly reducing processing time.
One thing to note is that if you use AWS Lambda with EFS, you will incur additional charges for the NAT gateway, which may increase by $1 to $3 per day, which will make the total cost higher than what is shown here.
Now, let's move on to the second case - a larger model with 7B to 8B parameters running on the GPU.
For this case, I have been testing models of sizes like Mistral, Gemma, or Llama, such as Mistral, Gemma, or Llama.
This scene involves sporadic calls to the model 250 times throughout the day. We assume that the container will scale and scale down every time it is called, although this is not always the case.
Just like CPU testing, we assume that the on-demand service runs for 24 hours because it doesn't have time to scale down.
I have made sure to write down the GPU instance we use for each vendor. Please check the bar chart below.
For serverless providers, I have slightly exaggerated the processing time by multiplication, but excluded cold starts from the total price calculation.
While the actual cost may be lower, this adjustment is for caution. You may be charged more because you will pay for some startups.
As we saw in our CPU case, running 250 calls at a time is more cost-effective.
If you are setting up calculations for the cheapest models of Anthropic and OpenAI and comparing them to the cost of self-hosting, you will find that the model that calls them with the same prompt costs much less than you do hosting.
People call these suppliers McDonald's from LLM.
We think open source would be cheaper, but we do not calculate the actual unit economy of hosting. These platforms are also subsidized by venture capital funds. But, like I mentioned earlier, it's cheaper to use the vendor you can find here to access the open source model.
If you want to dig into the detailed calculations, you can view this file. Fair warning – it looks a little messy.
You may have come to your own conclusions so far, but the last thing I want to introduce is user experience.
HF endpoints are very easy to use if you are not a coder, as you can deploy models from the HuggingFace center with a simple click. If you know some technology, you might prefer other options that you can control more.
For Replicate, they have a huge fan base and many public models shared by different people. There is a community around. They have some one-click training and deployment processes to make operations easier.
However, I found Modal, Beam and BentoML to have a good developer experience overall. You can deploy directly from the terminal and have the code run on its server.
For Replicate, if you are deploying your own model, you will need a GPU machine, and for Baseten, you will need to download a library called Truss, which will take some time.
I have collected some of my notes in this table (see below).
If you are keen to use any of these, the table will also contain links to start the script.
Now that we have covered most of the non-technical aspects, I will walk you through two deployment options for models that perform well on the CPU, AWS Lambda and Modal.
In this section, we will introduce how to deploy my 400M model fine-tuned for keyword extraction using AWS Lambda and EFS and compare it with deployments on updated platforms such as Modal.
Both tools are serverless, which means we need to cache the model correctly when building so that we can quickly access it in continuous runs. AWS provides a ready-made script that we can easily adjust, and I also prepared a script for Modal here.
We will focus on two things: how to deploy models on each platform and reflect key differences in the deployment process.
For this part, you can read through it or follow it to deploy.
To follow it, you need to install git, AWS CDK, Docker, NodeJS 18, Python 3.9 on your computer. After you have all this installed, you can open a new terminal.
If you want to create a new directory, then clone the repository below.
<code>git clone https://github.com/aws-samples/zero-administration-inference-with-aws-lambda-for-hugging-face.git</code>
Enter the created directory.
<code>cd zero-administration-inference-with-aws-lambda-for-hugging-face</code>
You can now open these files in your code editor.
I use VSCode, so I do it.
<code>.code</code>
Now we can go to the created file and make some adjustments to it. Looking at the Inference folder, you will see two files, sentiment.py and summary.py.
We can easily change the models in these files to the one we want.
If you go to the HuggingFace Center and find the model you are interested in.
I will use one of my own models.
If you are interested in learning how to build such a model, you can check out the keyword extraction tutorial and text classification tutorial for here.
As you can see, we have two options here, but since this script is using pipeline, we can do the same.
<code>git clone https://github.com/aws-samples/zero-administration-inference-with-aws-lambda-for-hugging-face.git</code>
I changed both scripts using different models I usually use. After you have finished, make sure to save the script.
You can then set up a virtual environment in the terminal.
<code>cd zero-administration-inference-with-aws-lambda-for-hugging-face</code>
<code>.code</code>
After
<code># inference/summarization.py import json from transformers import pipeline extractor = pipeline("text2text-generation", model="ilsilfverskiold/tech-keywords-extractor") def handler(event, context): # texts should be an array texts = event['texts'] response = { "statusCode": 200, "body": extractor(texts)[0] } return response</code>
<code>python3 -m venv venv source venv/bin/activate</code>
If you run Docker on your computer, you can now deploy it through the terminal.
<code>pip install -r requirements.txt</code>
8 GB of memory and 600 seconds timeout.
It will create an Internet gateway, EFS for caching models, several Docker-based Lambda functions (for two models in hosting scripts), and severalIAM roles for Lambda executionVPC.
This will take some time.
I was doing this in a small village in Italy, so my internet connection failed and I had to rent a GPU machine for deployment.
After the deployment is complete, you can go to Lambda in the AWS console and look for your new functions. You can test them directly there. The first run will be slower, but once it is preheated, the speed will be faster.
Here are some instructions that since the Lambda function is located in a private subnet (in a VPC), it cannot access the internet, which is why AWS creates a NAT gateway for you. However, using a NAT gateway is expensive, and no matter how much you use it, it will incur about $1-3 per day.
We can try to put the Lambda function in a public subnet, but unfortunately I didn't try it. There may be a way to bypass this problem to create a VPC endpoint.
We do need a VPC for EFS so that we can cache the models so that we don't need to download them every time we call the functions. Yes, AWS Lambda has a very generous free tier, but we need to be aware of other costs when adding other resources.
When you are done, I recommend you destroy these resources so you don't have to pay the NAT gateway 24/7.
<code>git clone https://github.com/aws-samples/zero-administration-inference-with-aws-lambda-for-hugging-face.git</code>
Additional instructions for using this method, you cannot specify memory and CPU separately. If you need more CPU, you need to increase memory, which can become expensive.
However, when using smaller models with 125M parameters or less, I don't completely ignore AWS Lambda. You can configure Lambda functions with less memory.
Modal is created to deploy ML models, which will make this process even simpler. We use the script we use here to deploy the same model as before, and you can find it here.
When deploying, we can specify memory, CPU, and GPU directly in the function. We can also ask that an endpoint be created for us in the script, which will make it easier to test our model using the endpoint.
But just because we are using another platform, it doesn't mean it won't cost us some money as well.
Remember the calculations we made before.
To get started, you need a Modal account and installed python3. After creating an account, we can open a terminal and create a new folder.
<code>cd zero-administration-inference-with-aws-lambda-for-hugging-face</code>
Then we can set up a virtual environment.
<code>.code</code>
Use pip to install the Modal package.
<code># inference/summarization.py import json from transformers import pipeline extractor = pipeline("text2text-generation", model="ilsilfverskiold/tech-keywords-extractor") def handler(event, context): # texts should be an array texts = event['texts'] response = { "statusCode": 200, "body": extractor(texts)[0] } return response</code>
Using Modal, all resources, environment settings, and execution are done on their platform, not on-premises, so we don't have the same problems as when deploying to AWS.
To authenticate, run this command.
<code>python3 -m venv venv source venv/bin/activate</code>
Now, if you don't have any files in the folder, create one.
<code>pip install -r requirements.txt</code>
You can simply paste the following code into it, but we will also introduce it.
<code>{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "ecr:*", "ssm:*", "iam:*", "lambda:*", "s3:*", "ec2:*", "logs:*", "cloudformation:*", "elasticfilesystem:*" ], "Resource": "*" } ] }</code>
Remember, I'm using the same model, you can use another model.
To deploy, just run the following command.
<code>cdk bootstrap</code>
This script sets up an application called "text-generation" in Modal and builds a Docker image with the required dependencies (huggingface-hub, transformers, and torch).
It installs these dependencies directly in the Modal environment, so you don't have to deal with them locally. The app requests 1 CPU core and 3 GB of memory , which is the setting I used during my testing.
Model cache is processed by @modal.build(), which uses snapshot_download() to extract the model from Hugging Face and save it in /cache. We need to do this so that it can be called faster on cold boot.
@modal.enter() The decorator runs on the first call to the TextExtraction class, loading the tokenizer and model from the cached file into memory.
After loading the model, you can call the extract_text() method to run the inference. @modal.web_endpoint sets up a serverless API endpoint that allows you to hit extract_text() via POST request and get text extraction results.
The whole process runs in the Modal environment, so we don't have to worry about whether your computer has enough resources. Of course, this is more important for larger models.
After the deployment is complete, you will see something similar to this in the terminal, which contains your endpoints.
You can view this application in the Modal dashboard.
To run this function, you can call the URL you get in the terminal.
<code>git clone https://github.com/aws-samples/zero-administration-inference-with-aws-lambda-for-hugging-face.git</code>
This does not add authentication, see Modal's documentation to add this feature.
As you have learned now, with any deployment choice, you need to cache the model first at build time to ensure faster cold starts after scale down. If you want to try deploying to any other platform, you can check out all the starter scripts here.
Using a new platform is not necessarily bad, and it will be much faster. However, sometimes your organization has strict restrictions on the platforms that allow you to use.
Easier options to use, and may also be slightly higher, but the ones I show you are not much different from the cost of using EC2 directly.
If you have seen this, I hope you understand the research I've done here and it will help you choose a supplier.
❤
The above is the detailed content of Economics of Hosting Open Source LLMs. For more information, please follow other related articles on the PHP Chinese website!