Large models are also large and small, and their size is measured by the number of parameters. GPT-3 has 17.5 billion parameters, and Grok-1 is even more impressive, with 31.4 billion parameters. Of course, there are also slimmer ones like Llama, whose number of parameters is only between 7 billion and 70 billion.
The 70B mentioned here may not refer to the amount of training data, but to the densely packed parameters in the model. These parameters are like small "brain cells". The more they are, the smarter the model can be and the better it can understand the intricate relationships in the data. With these "brain cells," models may perform better at tasks. However, many times these parameters, especially in large-scale models, can cause problems. These "brain cells" may interact with each other when processing tasks, making it difficult for the model to understand the complex relationships in the data. With these "brain cells," models may perform better at tasks. Therefore, we need to find a way to manage the relationship between these parameters when working on the task. A common method is through regularization. The parameters of these large models are like the "architects" inside the model. Through complex algorithms and training processes, this huge language is built bit by bit. world. Each parameter has its role, and they work together to allow the model to more accurately understand our language and give more appropriate answers.
So, how are the parameters in the large model composed?
1. Parameters in the large model
2. Memory requirements for large model parameters
2.1 Memory requirements during the training phase
At any time during training, for each model parameter, there always needs to be enough GPU memory to store:
This means that the following memory is required to store all model status and process data during training: (x+y+12) * model_size
The inference phase uses pre-trained LLM to complete tasks such as text generation or translation. Here, memory requirements are typically lower, with the main influencing factors being:
The inference phase requires no more memory than a quarter of the memory required by the training phase for the same parameter count and type. For example, for a 7B model, in general, using floating point precision requires 28GB of memory, using BF16 precision requires 14GB of memory, and using int8 precision requires 7GB of memory. This rough estimation method can be applied to other versions of the model accordingly.
Also, when tuning LLM for a specific task, fine-tuning requires a higher memory footprint. Fine-tuning typically involves longer training sequences to capture the nuances of the target task. This will lead to larger activations as the LLM processes more text data. The backpropagation process requires the storage of intermediate values for gradient calculations, which are used to update the model's weights during training. This adds a significant memory load compared to inference.
Specifically, corresponding to the large model based on Transformer, try to calculate the memory required for training, where set:
Here, bshp = b * s * h * p represents the size of the input data. In the linear layer part of the transformer, approximately 9bshp+bsh of space is needed for subsequent activations. In the attention part, self-attention can be expressed as: softmax((XQ)(XK)^T)XV
Then, XQ, XK, and XV all require bshp-sized space. In standard self-attention, the result of multiplying (XQ) * (XK) ^ T is just a b * s * s matrix containing logit. However, in practice, due to the use of a multi-head attention mechanism, a separate s * s storage space needs to be established for each head. This means that abssp bytes of space are required, and storing the output of the softmax also requires abssp bytes. After softmax, additional abss bytes are generally needed to store the mask, so the attention part requires 2abssp+abss storage space.
In addition, there are two Norm layers in the transformer, each of which still requires bshp storage space, for a total of 2 bshp.
So, the memory required for large model training based on Transformer is approximately: L(9bshp+bsh+2abssp+abss +2bshp) = Lbshp[16+2/p+(as/h)(2+1/ p)]
Explain that the memory required to train a large model based on Transformer is approximately: the number of layers of the model x the size of the training batch x sequence length x the dimension of the hidden layer x accuracy x an integer greater than 16
This may be a theoretical lower bound for the memory requirements of large model parameters based on Transfromer during training.
With the memory requirements for large model parameters, we can further estimate the number of GPUs required for training and inference of large models. However, since the estimation of the number of GPUs relies on slightly more parameters, someone (Dr. Walid Soula, https://medium.com/u/e41a20d646a8) gave a simple formula for rough estimation, which also has certain reference significance in engineering.
Picture
Among them,
As a practical example, assuming that an NVIDIA RTX 4090 GPU is used, which has 24GB of VRAM, calculate the training The number of GPUs required by the 'Llama3 7B' model is approximately:
The total number of GPUs≈(7 * 18 * 1.25)/24, which is approximately equal to 7
For inference, it can be simplified to 1/8~1/9 of the training stage. Of course, these are only rough estimates in a general sense.
Understanding the composition of large model parameters and their requirements for memory and GPU will help to deeply understand the role of distributed training in engineering practice. challenges faced.
The implementation process of distributed training strategies can be significantly simplified by adopting frameworks designed for distributed training, such as TensorFlow or PyTorch, which provide rich tools and APIs. By using techniques such as gradient accumulation before updating the model, or using techniques such as gradient compression to reduce the amount of data exchange between nodes, communication costs can be effectively reduced. It is crucial to determine the optimal batch size for distributed training (the parameter b mentioned above); a b value that is too small may increase communication overhead, while a value that is too large may result in insufficient memory.
The importance of LLMOps has become increasingly prominent. Regularly monitoring the performance indicators configured for distributed training and adjusting hyperparameters, partitioning strategies, and communication settings to optimize performance are key to improving training efficiency. Implementing a checkpointing mechanism for the model and efficient recovery in the event of failure ensures that the training process continues without having to start from scratch.
In other words, the training/inference of large models is essentially a complex distributed system architecture engineering challenge, such as:
However, in fact, most engineers may not be directly involved in specific training work, but focus on how to leverage large models when building applications parameters.
Picture
The main focus here is on how to configure the parameters when using a large model to output text. Three parameters: Temperature, Top-K and Top-P.
The Temperature parameter is often misunderstood as a switch that only controls the creativity of the model, but in fact its deeper role is to adjust the "softness" of the probability distribution. When the Temperature value is set higher, the probability distribution becomes softer and more uniform, which encourages the model to generate more diverse and creative output. Conversely, lower Temperature values will make the distribution sharper and have more obvious peaks, thus tending to produce output similar to the training data.
The Top-K parameter is used to limit the model to output the most likely Top-K tokens at each step. In this way, incoherent or meaningless content in the output can be reduced. This strategy creates a balance between maintaining the best possible consistency of output while allowing a certain degree of creative sampling.
Top-P is another decoding method that selects a minimum set of words whose cumulative probability exceeds the P value as output based on the set P value (0≤P≤1). This method allows the number of selected words to be dynamically increased or decreased based on the probability distribution of the next word. In particular, when the P value is 1, Top-P will select all words, which is equivalent to sampling from the entire distribution, thereby producing a more diverse output; while when the P value is 0, Top-P only selects the words with the highest probability , similar to greedy decoding, makes the output more focused and consistent.
These three parameters work together to affect the behavior of the model. For example, when setting Temperature=0.8, Top-K=36, and Top-P=0.7, the model first calculates the complete unnormalized log probability distribution of the entire vocabulary based on context. Temperature=0.8 means that each log probability is divided by 0.8, which effectively increases the model's confidence in its predictions before normalization. Top-K=36 means selecting the 36 markers with the highest frequency proportional logarithmic probability. Then, Top-P=0.7 applies filtering in this Top-K=36 set, keeping sorting from high to low probability until the cumulative probability reaches 0.7. Finally, this filtered set is renormalized and used in the subsequent sampling process.
In engineering practice, it is meaningful to understand the parameters of large models. Parameters play a decisive role in large models. They define the behavior, performance, implementation costs, and resource requirements of large models. Understanding the parameters of a large model in engineering means grasping the relationship between the complexity, performance, and capabilities of the model. Properly configuring and optimizing these parameters from the perspective of storage and computing can better select and optimize models in practical applications to adapt to different task requirements and resource constraints.
【Reference Materials】
The above is the detailed content of 7B? 13B? 175B? Interpret parameters of large models. For more information, please follow other related articles on the PHP Chinese website!