Table of Contents
First of all, don’t rush. This is a big decision.
At what point do the costs of OpenAI and open source models tie up?
Summary: When does ownership really make sense?
Home Technology peripherals AI OpenAI or DIY? Uncovering the true cost of self-hosting large language models

OpenAI or DIY? Uncovering the true cost of self-hosting large language models

Apr 22, 2024 pm 06:01 PM
AI openai Large language model

OpenAI or DIY? Uncovering the true cost of self-hosting large language models

Your service standards have been positioned as "AI-driven" by integrating large-scale language models. Your website homepage proudly showcases the revolutionary impact of your AI-driven services through interactive demos and case studies. This is also the first mark your company has left in the global GenAI field.

Your small but loyal user base is enjoying an improved customer experience, and you can see potential for future growth. However, as the month enters its third week, you receive an email from OpenAI that surprises you: Just a week ago, you were talking to customers to assess product market fit (PMF). ), now thousands of users flock to your site (anything can go viral on social media these days) and crash your AI-driven service.

As a result, your once-reliable service not only frustrates existing users, but also affects new users.

A quick and obvious solution is to restore service immediately by increasing the usage limit.

However, this temporary solution brought with it a sense of unease. You can't help but feel like you're locked into a reliance on a single vendor, with limited control over your own AI and its associated costs.

"Should I do it myself?" you ask yourself.

You already know that open source large language models (LLMs) have become a reality. On platforms like Hugging Face, thousands of models are available for immediate use, which provides the possibility for natural language processing.

However, the most powerful LLMs you will encounter have billions of parameters, run into hundreds of gigabytes, and require significant effort to scale. In a real-time system that requires low latency, you can't simply plug them into your application as you can with traditional models.

While you may be confident in your team's ability to build the necessary infrastructure, the real concern is the cost implications of this transformation, including:

Cost of fine-tuning
  • Hosting Cost
  • Service Cost
  • So, the big question is: Should you increase usage limits, or should you go the self-hosted, otherwise known as the "own" route?

Do some calculations using Llama 2

First of all, don’t rush. This is a big decision.

If you consult your machine learning (ML) engineer, they will probably tell you that Lama 2 is an open source LLM that seems to be a good choice because it performs as well as you on most tasks The currently used GPT-3 is just as good.

You will also find that the model comes in three sizes - 7 billion, 1.3 billion and 700 million parameters - and you decide to use the largest 7 billion parameter model to maintain consistency with the OpenAI model you are currently using. Competitiveness.

LLaMA 2 uses bfloat16 for training, so each parameter consumes 2 bytes. This means the model size will be 140 GB.

If you think this model is a lot to adjust, don’t worry. With LoRA, you don't need to fine-tune the entire model before deployment.

In fact, you may only need to fine-tune about 0.1% of the total parameters, which is 70M, which consumes 0.14 GB in bfloat16 representation.

Impressive, right?

To accommodate memory overhead during fine-tuning (e.g. backpropagation, storing activations, storing datasets), the best memory space to maintain is trainable Approximately 5 times the parameter consumption.

Let's break it down in detail:

When using LoRA, the weights of the LLaMA 2 70B model are fixed, so this does not result in memory overhead → memory requirement = 140 GB.

However, in order to adjust the LoRA layer, we need to maintain 0.14 GB * (5 times) = 0.7 GB.

This results in a total memory requirement of approximately 141 GB during fine-tuning.

Assuming you don’t currently have training infrastructure, we assume you prefer to use AWS. According to AWS EC2 on-demand pricing, the compute cost is about $2.80 per hour, so the cost of fine-tuning is about $67 per day, which is not a huge expense because the fine-tuning does not last for many days.

Artificial intelligence is the opposite of a restaurant: the main cost is in service rather than preparation

When deploying, you need to maintain two weights in memory:

Model weights, consuming 140 GB of memory.
  • LoRA fine-tunes weights and consumes 0.14 GB of memory.
  • The total is 140.14 GB.

Of course, you can cancel gradient calculations, but it is still recommended to maintain about 1.5x the memory — about 210 GB — to account for any unexpected overhead.

Again based on AWS EC2 on-demand pricing, GPU compute costs approximately $3.70 per hour, which works out to approximately $90 per day to keep the model in production memory and respond to incoming requests.

This equates to about $2,700 per month.

Another thing to consider is that unexpected failures happen all the time. If you don't have a backup mechanism, your users will stop receiving model predictions. If you want to prevent this from happening, you need to maintain another redundant model in case the first model request fails.

So this would bring your cost to $180 per day or $5,400 per month. You're almost close to the current cost of using OpenAI.

At what point do the costs of OpenAI and open source models tie up?

If you continue to use OpenAI, here is the number of words you can process per day to match the above fine-tuning and serving of LLaMA 2 cost.

According to OpenAI’s pricing, fine-tuning GPT 3.5 Turbo costs $0.0080 per 1,000 tokens.

Assuming most words have two tokens, to match the fine-tuning cost of the open source LLaMA 2 70B model ($67 per day), you would need to feed the OpenAI model approximately 4.15 million words.

Typically, the average word count on an A4 paper is 300, which means we can feed the model about 14,000 pages of data to match the open source fine-tuning cost, which is a huge number.

You may not have that much fine-tuning data, so the cost of fine-tuning with OpenAI is always lower.

Another point that may be obvious is that this fine-tuning cost is not related to the training time, but to the amount of data for model fine-tuning. This is not the case when fine-tuning open source models, as the cost will depend on the amount of data and the time you use AWS compute resources.

As for the cost of the service, according to OpenAI’s pricing page, a fine-tuned GPT 3.5 Turbo costs $0.003 per 1,000 tokens for input and $0.006 for output per 1,000 tokens.

We assume an average of $0.004 per 1000 tokens. To reach the cost of $180 per day, we need to process approximately 22.2 million words per day through the API.

This equates to over 74,000 pages of data, with 300 words per page.

However, the benefit is that you don’t need to ensure the model is running 24/7 as OpenAI offers pay-per-use pricing.

If your model is never used, you pay nothing.

Summary: When does ownership really make sense?

At first, moving to self-hosted AI may seem like a tempting endeavor. But beware of the hidden costs and headaches that come with it.

Aside from the occasional sleepless night where you wonder why your AI-driven service is down, almost all of the difficulties of managing LLMs in production systems disappear if you use a third-party provider.

Especially when your service doesn't primarily rely on "AI", but other things that rely on AI.

For large enterprises, the annual cost of ownership of $65,000 may be a drop in the bucket, but for most enterprises, it is a number that cannot be ignored.

Additionally, we should not forget about other additional expenses such as talent and maintenance, which can easily increase the total cost to over $200,000 to $250,000 per year.

Of course, having a model from the beginning has its benefits, such as maintaining control over your data and usage.

But to make self-hosting feasible, you will need user request volume well beyond the norm of about 22.2 million words per day, and you will need the resources to manage both the talent and logistics.

For most use cases, it may not be financially worthwhile to have a model instead of using an API.

The above is the detailed content of OpenAI or DIY? Uncovering the true cost of self-hosting large language models. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
1 months ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
1 months ago By 尊渡假赌尊渡假赌尊渡假赌
Will R.E.P.O. Have Crossplay?
1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Bytedance Cutting launches SVIP super membership: 499 yuan for continuous annual subscription, providing a variety of AI functions Bytedance Cutting launches SVIP super membership: 499 yuan for continuous annual subscription, providing a variety of AI functions Jun 28, 2024 am 03:51 AM

This site reported on June 27 that Jianying is a video editing software developed by FaceMeng Technology, a subsidiary of ByteDance. It relies on the Douyin platform and basically produces short video content for users of the platform. It is compatible with iOS, Android, and Windows. , MacOS and other operating systems. Jianying officially announced the upgrade of its membership system and launched a new SVIP, which includes a variety of AI black technologies, such as intelligent translation, intelligent highlighting, intelligent packaging, digital human synthesis, etc. In terms of price, the monthly fee for clipping SVIP is 79 yuan, the annual fee is 599 yuan (note on this site: equivalent to 49.9 yuan per month), the continuous monthly subscription is 59 yuan per month, and the continuous annual subscription is 499 yuan per year (equivalent to 41.6 yuan per month) . In addition, the cut official also stated that in order to improve the user experience, those who have subscribed to the original VIP

Context-augmented AI coding assistant using Rag and Sem-Rag Context-augmented AI coding assistant using Rag and Sem-Rag Jun 10, 2024 am 11:08 AM

Improve developer productivity, efficiency, and accuracy by incorporating retrieval-enhanced generation and semantic memory into AI coding assistants. Translated from EnhancingAICodingAssistantswithContextUsingRAGandSEM-RAG, author JanakiramMSV. While basic AI programming assistants are naturally helpful, they often fail to provide the most relevant and correct code suggestions because they rely on a general understanding of the software language and the most common patterns of writing software. The code generated by these coding assistants is suitable for solving the problems they are responsible for solving, but often does not conform to the coding standards, conventions and styles of the individual teams. This often results in suggestions that need to be modified or refined in order for the code to be accepted into the application

Can fine-tuning really allow LLM to learn new things: introducing new knowledge may make the model produce more hallucinations Can fine-tuning really allow LLM to learn new things: introducing new knowledge may make the model produce more hallucinations Jun 11, 2024 pm 03:57 PM

Large Language Models (LLMs) are trained on huge text databases, where they acquire large amounts of real-world knowledge. This knowledge is embedded into their parameters and can then be used when needed. The knowledge of these models is "reified" at the end of training. At the end of pre-training, the model actually stops learning. Align or fine-tune the model to learn how to leverage this knowledge and respond more naturally to user questions. But sometimes model knowledge is not enough, and although the model can access external content through RAG, it is considered beneficial to adapt the model to new domains through fine-tuning. This fine-tuning is performed using input from human annotators or other LLM creations, where the model encounters additional real-world knowledge and integrates it

Posthumous work of the OpenAI Super Alignment Team: Two large models play a game, and the output becomes more understandable Posthumous work of the OpenAI Super Alignment Team: Two large models play a game, and the output becomes more understandable Jul 19, 2024 am 01:29 AM

If the answer given by the AI ​​model is incomprehensible at all, would you dare to use it? As machine learning systems are used in more important areas, it becomes increasingly important to demonstrate why we can trust their output, and when not to trust them. One possible way to gain trust in the output of a complex system is to require the system to produce an interpretation of its output that is readable to a human or another trusted system, that is, fully understandable to the point that any possible errors can be found. For example, to build trust in the judicial system, we require courts to provide clear and readable written opinions that explain and support their decisions. For large language models, we can also adopt a similar approach. However, when taking this approach, ensure that the language model generates

To provide a new scientific and complex question answering benchmark and evaluation system for large models, UNSW, Argonne, University of Chicago and other institutions jointly launched the SciQAG framework To provide a new scientific and complex question answering benchmark and evaluation system for large models, UNSW, Argonne, University of Chicago and other institutions jointly launched the SciQAG framework Jul 25, 2024 am 06:42 AM

Editor |ScienceAI Question Answering (QA) data set plays a vital role in promoting natural language processing (NLP) research. High-quality QA data sets can not only be used to fine-tune models, but also effectively evaluate the capabilities of large language models (LLM), especially the ability to understand and reason about scientific knowledge. Although there are currently many scientific QA data sets covering medicine, chemistry, biology and other fields, these data sets still have some shortcomings. First, the data form is relatively simple, most of which are multiple-choice questions. They are easy to evaluate, but limit the model's answer selection range and cannot fully test the model's ability to answer scientific questions. In contrast, open-ended Q&A

SOTA performance, Xiamen multi-modal protein-ligand affinity prediction AI method, combines molecular surface information for the first time SOTA performance, Xiamen multi-modal protein-ligand affinity prediction AI method, combines molecular surface information for the first time Jul 17, 2024 pm 06:37 PM

Editor | KX In the field of drug research and development, accurately and effectively predicting the binding affinity of proteins and ligands is crucial for drug screening and optimization. However, current studies do not take into account the important role of molecular surface information in protein-ligand interactions. Based on this, researchers from Xiamen University proposed a novel multi-modal feature extraction (MFE) framework, which for the first time combines information on protein surface, 3D structure and sequence, and uses a cross-attention mechanism to compare different modalities. feature alignment. Experimental results demonstrate that this method achieves state-of-the-art performance in predicting protein-ligand binding affinities. Furthermore, ablation studies demonstrate the effectiveness and necessity of protein surface information and multimodal feature alignment within this framework. Related research begins with "S

SK Hynix will display new AI-related products on August 6: 12-layer HBM3E, 321-high NAND, etc. SK Hynix will display new AI-related products on August 6: 12-layer HBM3E, 321-high NAND, etc. Aug 01, 2024 pm 09:40 PM

According to news from this site on August 1, SK Hynix released a blog post today (August 1), announcing that it will attend the Global Semiconductor Memory Summit FMS2024 to be held in Santa Clara, California, USA from August 6 to 8, showcasing many new technologies. generation product. Introduction to the Future Memory and Storage Summit (FutureMemoryandStorage), formerly the Flash Memory Summit (FlashMemorySummit) mainly for NAND suppliers, in the context of increasing attention to artificial intelligence technology, this year was renamed the Future Memory and Storage Summit (FutureMemoryandStorage) to invite DRAM and storage vendors and many more players. New product SK hynix launched last year

Laying out markets such as AI, GlobalFoundries acquires Tagore Technology's gallium nitride technology and related teams Laying out markets such as AI, GlobalFoundries acquires Tagore Technology's gallium nitride technology and related teams Jul 15, 2024 pm 12:21 PM

According to news from this website on July 5, GlobalFoundries issued a press release on July 1 this year, announcing the acquisition of Tagore Technology’s power gallium nitride (GaN) technology and intellectual property portfolio, hoping to expand its market share in automobiles and the Internet of Things. and artificial intelligence data center application areas to explore higher efficiency and better performance. As technologies such as generative AI continue to develop in the digital world, gallium nitride (GaN) has become a key solution for sustainable and efficient power management, especially in data centers. This website quoted the official announcement that during this acquisition, Tagore Technology’s engineering team will join GLOBALFOUNDRIES to further develop gallium nitride technology. G

See all articles