Home Technology peripherals AI With 100,000 US dollars + 26 days, a low-cost LLM with 100 billion parameters was born

With 100,000 US dollars + 26 days, a low-cost LLM with 100 billion parameters was born

Sep 20, 2023 pm 03:49 PM
large model theory

Large language models (LLMs) including decoder-only structures (such as GPT and LLAMA series models), encoder-only structures (such as BERT), and encoder-decoder structures (such as T5) and their variant models ) has achieved remarkable success and has been widely used in various language processing and multi-modal tasks.

Despite this success, training LLM is so expensive that only a few companies can afford it. In addition, current trends indicate that larger training data will be used in the future, which will further increase the development cost of large models. For example, LLAMA-1 training uses 1-1.4 TB tokens, while Llama 2 reaches 2 TB.

Another key challenge in developing an LLM is evaluation. Mainstream evaluation methods are divided into two categories: knowledge evaluation (MMLU and C-Eval) and NLP task evaluation. These evaluation methods may not truly reflect the capabilities of the model because there may be data leakage issues, i.e. some parts of the evaluation data set may have been used during the model training process. Furthermore, knowledge-oriented assessment methods may not be adequate for assessing intelligence levels. A more fair and objective assessment method is to measure the intelligence quotient (IQ) of the LLM, which is to generalize the LLM to conditions and contexts not seen in the training data.

Growth strategy. In order to solve the training cost problem, many institutions such as Beijing Zhiyuan Artificial Intelligence Research Institute and the Institute of Computing Technology of the Chinese Academy of Sciences have recently made some attempts - training a 100 billion parameter level LLM through a growth strategy for the first time. Growth means that the number of parameters during training is not fixed, but expands from smaller models to larger models.

With 100,000 US dollars + 26 days, a low-cost LLM with 100 billion parameters was born

  • Paper: https://arxiv.org/pdf/2309.03852.pdf

  • Needs to be reprinted The written content is: Model link: https://huggingface.co/CofeAI/FLM-101B

Figure 1 shows three typical scenarios of growth strategies. Since the FLOPs of an LLM are roughly proportional to the number of its parameters, the area between the change curve of the model parameters and the X-axis can represent the computational cost of training.

With 100,000 US dollars + 26 days, a low-cost LLM with 100 billion parameters was born


Figure 1 (a) shows the standard training strategy without model growth; 1 (b) is a linear growth strategy, which can save 50% of the cost; 1 (c) It is a moderate growth strategy, which can save less than 50% of the cost; 1 (d) is a radical growth strategy, which can save more than 50% of the cost. This analysis illustrates that in order to save as much computing cost as possible, an aggressive growth strategy should be adopted.

The design of the growth operator of this new study is inspired by the MSG in the paper "2x faster language model pre-training via masked structural growth", which is a complete A set of operations covering all four growth dimensions of the Transformer structure. More importantly, MSG can grow while tightly preserving functionality. Therefore, although a small model can learn quickly with a smaller parameter search space, its knowledge can be inherited by subsequent larger models. This makes it possible for growth strategies to achieve better performance using the same or less computational cost.

Open source FLM-101B model. Researchers at Zhiyuan Research Institute trained an LLM model with 101 billion parameters through gradual growth, and they also stated that they would release the model as open source. The architecture of this model is an evolution of FreeLM. Therefore, the researchers named it FLM-101B, where F stands for Free.

#The FreeLM framework has two pre-training objectives, which are guided by language signals and teacher signals respectively. In this new research, these two goals are unified into a common language modeling paradigm.

IQ Assessment Benchmark. In addition to the low-cost training paradigm, the team also made another contribution by proposing a systematic set of benchmarks for LLM's intelligence quotient (IQ) assessment.

Previous research has shown that although the perplexity level (PPL) indicator can reflect the quality of generated text to a certain extent, it is not reliable. On the other hand, the scale of LLM training data is so large that it is difficult for us to distinguish whether the model is just quoting knowledge data, or whether it is really achieving human-like reasoning, analysis, and generalization capabilities, which are what this study defines IQ Foundation. Some commonly used evaluation metrics (MMLU for English and C-Eval for Chinese) are obviously knowledge-oriented and cannot fully reflect the intelligence level of the model.

For a sanity check, the team conducted a test: five computer science researchers from world-renowned universities took an exam using C-Eval’s chemistry test questions . It turned out that their accuracy was almost as good as random guessing because most of the volunteers had forgotten what they had learned about chemistry. Therefore, evaluation benchmarks that emphasize knowledge of expertise are not adequate measures of a model's IQ.

To comprehensively measure LLM's IQ, the team developed an IQ assessment benchmark that takes into account four key aspects of IQ: symbol mapping, rule understanding, pattern mining, and Anti-interference.
  • Language is symbolic in nature. There have been some studies using symbols rather than category labels to assess the intelligence level of LLMs. Similarly, the team used a symbolic mapping approach to test the LLM's ability to generalize to unseen contexts.

  • An important ability of human intelligence is to understand given rules and take corresponding actions. This testing method has been widely used in various levels of testing. Therefore, rule understanding becomes the second test here.

  • Rewritten content: Pattern mining is an important part of intelligence, which involves induction and deduction. In the history of scientific development, this method plays a crucial role. In addition, test questions in various competitions often require this ability to answer. For these reasons, we chose pattern mining as the third evaluation indicator

  • The last and very important indicator is the anti-interference ability, which is also one of the core capabilities of intelligence. Studies have pointed out that both language and images are easily disturbed by noise. With this in mind, the team used interference immunity as a final evaluation metric.

Of course, these four indicators are by no means the final word in LLM IQ assessment, but they can serve as a starting point to stimulate subsequent research and development, and are expected to eventually lead to a comprehensive set of LLM IQ assessment framework.

The main contributions of this study include:
  • The researchers stated that this is a study using growth strategies to train more than 1,000 people from scratch. LLM research attempt on billion parameters. At the same time, this is also the lowest cost 100 billion parameter model currently, costing only 100,000 US dollars

  • By improving FreeLM training objectives, potential hyperparameter search methods and function-preserving growth, This study addresses the issue of instability. The researchers believe this method can also help the broader scientific research community.

  • The researchers also conducted experimental comparisons of the new model with previously powerful models, including using knowledge-oriented benchmarks and a newly proposed systematic IQ assessment benchmark. Experimental results show that the FLM-101B model is competitive and robust

  • The team will release model checkpoints, code, related tools, etc. to promote the research and development of bilingual LLM in Chinese and English with a scale of 100 billion parameters.

FLM-101B Design Overview

Architecturally, FLM-101B uses FreeLM as its backbone network, and integrates xPos. In terms of model size, thanks to the new growth strategy, researchers can obtain models of three sizes: 16B, 51B, and 101B in one training.

As for the pre-training settings, FLM-101B inherits the training strategy of FreeLM.

In terms of growth strategy, instead of the common practice of training models of different sizes independently, the team can sequentially train three models with 16B, 51B, and 101B parameters. Each of these models inherits the knowledge of the smaller model before it.

As for the training hardware, a cluster of 24 DGX-A800 GPU (8×80G) servers is used; the training time of FLM-101B is less than 26 days. See Tables 1 and 2 below for multi-parallel strategy and model configurations.

With 100,000 US dollars + 26 days, a low-cost LLM with 100 billion parameters was born

With 100,000 US dollars + 26 days, a low-cost LLM with 100 billion parameters was born

Training Stability of FLM-101B

In order to solve unstable problems such as loss divergence and gradient explosion, researchers have proposed a promising solution, which is briefly described as follows.

Loss prediction. The newly proposed method to achieve training stability is as follows:

First, determine the distribution of the data before starting FLM-16B training.

Next, perform a grid search on three hyperparameters, including the learning rate, initialization standard deviation, and softmax temperature of the output layer. The grid search is performed by running a surrogate model with a hidden state dimension (i.e., model width) of 256, a head count of 2, and a parameter count of 40 million. All other structural hyperparameters and training data of this surrogate model are the same as FLM-16B. Using data parallelism on 6 nodes, a grid search run took 24.6 hours, which roughly translates into 6 hours using a 24-node configuration.

Through this grid search, the researchers found the optimal hyperparameters: learning rate = 4e-4, standard deviation = 1.6e-2, softmax temperature = 2.0.

Then they migrate these hyperparameters through µP to achieve a seamless training experience that avoids instability problems. When MSG is used in combination, LM-51B and FLM-101B do not have subsequent growth divergence problems.

Figure 2 shows the complete training loss curve.

With 100,000 US dollars + 26 days, a low-cost LLM with 100 billion parameters was born

Mixed precision via Bfloat16. The purpose of using mixed precision is to save memory and time costs during runtime. Here they chose Bfloat16.
Benchmark Evaluation

Table 3 compares FLM-101B with others Performance of powerful baseline models (LLAMA series models and GLM-130B).

With 100,000 US dollars + 26 days, a low-cost LLM with 100 billion parameters was born

The researchers stated that these results indicate that FLM-101B does not have any advantage in factual knowledge, and its performance will continue if more training data can be used. promote.

Table 4 shows the results of eFLM-16B versus the baseline model in terms of expertise assessment.

With 100,000 US dollars + 26 days, a low-cost LLM with 100 billion parameters was born

It turns out that scores on datasets that emphasize expertise do not reflect the level of intelligence of LLM, as some specific training data may have an overwhelming contribution.

Table 5 shows the performance of each stage of the FLM model.

With 100,000 US dollars + 26 days, a low-cost LLM with 100 billion parameters was born

As expected, the performance of FLM will improve as the model increases. The FLM-101B performed best on almost every mission. This means that each time the model grows, it inherits the knowledge from the previous stage.
IQ experiment

In the experiment, in order to test the IQ of LLM For a more systematic evaluation, the team from the Intellectual Property Research Institute used existing IQ-related data sets and made some necessary modifications. They also generated some new synthetic data.

Specifically, the IQ assessment they proposed mainly considers four aspects: symbol mapping, rule understanding, pattern mining, and anti-interference. These tasks have one key thing in common: they all rely on reasoning and generalization in new contexts.

The following tables show the results of the IQ experiment:

With 100,000 US dollars + 26 days, a low-cost LLM with 100 billion parameters was born

With 100,000 US dollars + 26 days, a low-cost LLM with 100 billion parameters was born

With 100,000 US dollars + 26 days, a low-cost LLM with 100 billion parameters was born

##From these tables, on these four IQ evaluation benchmarks, FLM-101B achieves results comparable to GPT-3 and better than GLM-130B at a much lower computational cost.

In addition to the impact of training data, the researchers speculate that this advantage may be due to the small model in the early stages refining the smaller search space, and when the model becomes more This advantage continues to play out when larger and wider, and generalization capabilities are enhanced.

The above is the detailed content of With 100,000 US dollars + 26 days, a low-cost LLM with 100 billion parameters was born. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Big model app Tencent Yuanbao is online! Hunyuan is upgraded to create an all-round AI assistant that can be carried anywhere Big model app Tencent Yuanbao is online! Hunyuan is upgraded to create an all-round AI assistant that can be carried anywhere Jun 09, 2024 pm 10:38 PM

On May 30, Tencent announced a comprehensive upgrade of its Hunyuan model. The App "Tencent Yuanbao" based on the Hunyuan model was officially launched and can be downloaded from Apple and Android app stores. Compared with the Hunyuan applet version in the previous testing stage, Tencent Yuanbao provides core capabilities such as AI search, AI summary, and AI writing for work efficiency scenarios; for daily life scenarios, Yuanbao's gameplay is also richer and provides multiple features. AI application, and new gameplay methods such as creating personal agents are added. "Tencent does not strive to be the first to make large models." Liu Yuhong, vice president of Tencent Cloud and head of Tencent Hunyuan large model, said: "In the past year, we continued to promote the capabilities of Tencent Hunyuan large model. In the rich and massive Polish technology in business scenarios while gaining insights into users’ real needs

Bytedance Beanbao large model released, Volcano Engine full-stack AI service helps enterprises intelligently transform Bytedance Beanbao large model released, Volcano Engine full-stack AI service helps enterprises intelligently transform Jun 05, 2024 pm 07:59 PM

Tan Dai, President of Volcano Engine, said that companies that want to implement large models well face three key challenges: model effectiveness, inference costs, and implementation difficulty: they must have good basic large models as support to solve complex problems, and they must also have low-cost inference. Services allow large models to be widely used, and more tools, platforms and applications are needed to help companies implement scenarios. ——Tan Dai, President of Huoshan Engine 01. The large bean bag model makes its debut and is heavily used. Polishing the model effect is the most critical challenge for the implementation of AI. Tan Dai pointed out that only through extensive use can a good model be polished. Currently, the Doubao model processes 120 billion tokens of text and generates 30 million images every day. In order to help enterprises implement large-scale model scenarios, the beanbao large-scale model independently developed by ByteDance will be launched through the volcano

Breaking through the boundaries of traditional defect detection, 'Defect Spectrum' achieves ultra-high-precision and rich semantic industrial defect detection for the first time. Breaking through the boundaries of traditional defect detection, 'Defect Spectrum' achieves ultra-high-precision and rich semantic industrial defect detection for the first time. Jul 26, 2024 pm 05:38 PM

In modern manufacturing, accurate defect detection is not only the key to ensuring product quality, but also the core of improving production efficiency. However, existing defect detection datasets often lack the accuracy and semantic richness required for practical applications, resulting in models unable to identify specific defect categories or locations. In order to solve this problem, a top research team composed of Hong Kong University of Science and Technology Guangzhou and Simou Technology innovatively developed the "DefectSpectrum" data set, which provides detailed and semantically rich large-scale annotation of industrial defects. As shown in Table 1, compared with other industrial data sets, the "DefectSpectrum" data set provides the most defect annotations (5438 defect samples) and the most detailed defect classification (125 defect categories

NVIDIA dialogue model ChatQA has evolved to version 2.0, with the context length mentioned at 128K NVIDIA dialogue model ChatQA has evolved to version 2.0, with the context length mentioned at 128K Jul 26, 2024 am 08:40 AM

The open LLM community is an era when a hundred flowers bloom and compete. You can see Llama-3-70B-Instruct, QWen2-72B-Instruct, Nemotron-4-340B-Instruct, Mixtral-8x22BInstruct-v0.1 and many other excellent performers. Model. However, compared with proprietary large models represented by GPT-4-Turbo, open models still have significant gaps in many fields. In addition to general models, some open models that specialize in key areas have been developed, such as DeepSeek-Coder-V2 for programming and mathematics, and InternVL for visual-language tasks.

Training with millions of crystal data to solve the crystallographic phase problem, the deep learning method PhAI is published in Science Training with millions of crystal data to solve the crystallographic phase problem, the deep learning method PhAI is published in Science Aug 08, 2024 pm 09:22 PM

Editor |KX To this day, the structural detail and precision determined by crystallography, from simple metals to large membrane proteins, are unmatched by any other method. However, the biggest challenge, the so-called phase problem, remains retrieving phase information from experimentally determined amplitudes. Researchers at the University of Copenhagen in Denmark have developed a deep learning method called PhAI to solve crystal phase problems. A deep learning neural network trained using millions of artificial crystal structures and their corresponding synthetic diffraction data can generate accurate electron density maps. The study shows that this deep learning-based ab initio structural solution method can solve the phase problem at a resolution of only 2 Angstroms, which is equivalent to only 10% to 20% of the data available at atomic resolution, while traditional ab initio Calculation

Advanced practice of industrial knowledge graph Advanced practice of industrial knowledge graph Jun 13, 2024 am 11:59 AM

1. Background Introduction First, let’s introduce the development history of Yunwen Technology. Yunwen Technology Company...2023 is the period when large models are prevalent. Many companies believe that the importance of graphs has been greatly reduced after large models, and the preset information systems studied previously are no longer important. However, with the promotion of RAG and the prevalence of data governance, we have found that more efficient data governance and high-quality data are important prerequisites for improving the effectiveness of privatized large models. Therefore, more and more companies are beginning to pay attention to knowledge construction related content. This also promotes the construction and processing of knowledge to a higher level, where there are many techniques and methods that can be explored. It can be seen that the emergence of a new technology does not necessarily defeat all old technologies. It is also possible that the new technology and the old technology will be integrated with each other.

Google AI won the IMO Mathematical Olympiad silver medal, the mathematical reasoning model AlphaProof was launched, and reinforcement learning is so back Google AI won the IMO Mathematical Olympiad silver medal, the mathematical reasoning model AlphaProof was launched, and reinforcement learning is so back Jul 26, 2024 pm 02:40 PM

For AI, Mathematical Olympiad is no longer a problem. On Thursday, Google DeepMind's artificial intelligence completed a feat: using AI to solve the real question of this year's International Mathematical Olympiad IMO, and it was just one step away from winning the gold medal. The IMO competition that just ended last week had six questions involving algebra, combinatorics, geometry and number theory. The hybrid AI system proposed by Google got four questions right and scored 28 points, reaching the silver medal level. Earlier this month, UCLA tenured professor Terence Tao had just promoted the AI ​​Mathematical Olympiad (AIMO Progress Award) with a million-dollar prize. Unexpectedly, the level of AI problem solving had improved to this level before July. Do the questions simultaneously on IMO. The most difficult thing to do correctly is IMO, which has the longest history, the largest scale, and the most negative

Nature's point of view: The testing of artificial intelligence in medicine is in chaos. What should be done? Nature's point of view: The testing of artificial intelligence in medicine is in chaos. What should be done? Aug 22, 2024 pm 04:37 PM

Editor | ScienceAI Based on limited clinical data, hundreds of medical algorithms have been approved. Scientists are debating who should test the tools and how best to do so. Devin Singh witnessed a pediatric patient in the emergency room suffer cardiac arrest while waiting for treatment for a long time, which prompted him to explore the application of AI to shorten wait times. Using triage data from SickKids emergency rooms, Singh and colleagues built a series of AI models that provide potential diagnoses and recommend tests. One study showed that these models can speed up doctor visits by 22.3%, speeding up the processing of results by nearly 3 hours per patient requiring a medical test. However, the success of artificial intelligence algorithms in research only verifies this

See all articles