Generally speaking, the more calculations it takes to train a neural network, the better its performance. When scaling up a calculation, a decision must be made: increase the number of model parameters or increase the size of the data set—both factors must be weighed within a fixed computational budget. The advantage of increasing the number of model parameters is that it can improve the complexity and expression ability of the model, thereby better fitting the training data. However, too many parameters can lead to overfitting, making the model perform poorly on unseen data. On the other hand, expanding the data set size can improve the generalization ability of the model and reduce overfitting problems.
We tell you: As long as parameters and data can be allocated appropriately, performance can be maximized under a fixed computing budget. Many previous studies have explored the scaling law of neural language models, and these studies usually concluded that the parameters and the number of training tokens should be expanded one-to-one.
However, previous language model Scaling law research was based on Transformer trained on scattered network text. This is a very specific data distribution, so we naturally ask: Can the scaling law obtained based on such a web text data set be generalized to other distributions?
In addition to the current language model (i.e. Chinchilla) which only targets the specific case of network text data, there is also a broader Scaling law based on the attributes of training data. Considering that improving data quality can significantly improve the performance of language models, the scaling law of reinforcement learning may scale with the intensity of the game. Perhaps we can assume that the current language model Scaling law (i.e. Chinchilla) is only for the specific case of network text data, and behind it there is a broader Scaling law based on the attributes of the training data.
So, what properties of the token sequence data set used for training is sensitive to neural scaling law? In other words, if we want to accurately predict how to best allocate computation to the training process, what properties of the data should we observe? Also, is the data-dependent nature of scaling law just a theoretical issue, or is it also important for real-world data sets?
In order to explore these issues, Rohan Pandey, a researcher at the AI data company Reworkd, did some research and got the answers to these questions; in addition, he also proposed a compression algorithm gzip, which can Predict the impact of data complexity on scaling properties.
#His research method is: using information theory methods to understand Scaling law in a text data setting that can intuitively control complexity Reasons for data dependencies.
The setting he eventually found is called Probabilistic Context-Free Grammar (PCFG, first proposed by Chomsky in 1956). This setting is relatively natural (can model natural language, code, etc.), has controllable syntactic complexity, and follows some well-understood information theory principles.
In the experiment, by adjusting the syntactic properties of PCFG, he generated 6 data sets with different complexities. For each data set, he trained 6 language models of different sizes (parameters from 4.4M to 1.4B), and recorded the results of these language models under 6 different training steps (100K to 100M tokens) . He then fit a scaling law to each data set and found that the parameters of the scaling law varied meaningfully with syntactic complexity. Following previous work on entropy in formal grammars, for the complexity metric he used the median compressibility of each token sequence in the dataset, which can be easily calculated using gzip.
It was found that as the compressibility of the training data decreases (more complex), the optimal boundary of Scaling law calculation will gradually shift from the parameter amount to the data size. He then measured the compressibility of real-world code and natural language datasets and found that the former was more compressible and therefore predicted to obey different scaling laws.
Probabilistic context-free grammar (PCFG) is a basic tool in computational linguistics that can be used to model natural The syntax of language. PCFG is an extension of the standard context-free grammar (CFG) that associates probabilities in the generation rules, thereby representing the ambiguity and variability of language in a quantifiable way. These grammars generate trees in which each node represents a syntactic category and each edge represents a generative rule used to generate sentences. When generating sentences from a PCFG, sequences of applied generation rules are probabilistically sampled until all leaf nodes of the tree are endpoints (actual lexical tokens).
We can control the syntactic properties of PCFG to scale the complexity of text datasets in a natural way. Specifically, the parameters that the PCFG creation function can accept include: the number of endpoints, data for non-endpoints, the maximum length of the right side of the generation rule, and the maximum number of generation rules allowed for any non-endpoints (if this value is 1, then the given The non-endpoints will always get the same right hand side). Intuitively, an increase in each of the above values will lead to an increase in syntactic complexity.
To create a PCFG based on the above parameters, for each endpoint, its number of generations (RHS option), the length of each of these generations, is randomly selected by randomly sampling from the endpoints and non-endpoints to instantiate a generation rule and assign it a probability (normalized by the total RHS options for non-endpoints). Then, collect all generated rules for all non-endpoints and instantiate a grammar using the PCFG package built on NLTK.
Then use this grammar (randomly created under given constraints) to probabilistically sample sentences to build a token sequence data set. In order to make it easier to compare training on different grammars (generating sentences of different average lengths) later, he decided to sample the sentences into documents with the same number of tokens. Continue sampling sentences based on grammar until the context length is filled. If there is overflow, the sentence is truncated directly.
Sentences are composed of endpoints that are only integers, so they can be considered token IDs for language models; then the unused integer 0 is used (which effectively corresponds to the period in natural language ) to connect the sentences. To clarify, this is not about generating a string that "looks" like natural language and then tokenizing it - PCFG directly generates the sequence of the token ID itself. Now, 6 token sequence data sets with different complexities can be generated based on 6 sets of initial grammatical constraints.
To estimate the complexity of generated datasets as well as real datasets, Rohan Pandey chose to use A compression algorithm gzip.
One of the advantages of gzip is that there is a good theoretical research foundation, which shows that: compressibility (compressibility) is inversely proportional to entropy, and entropy is directly proportional to syntactic complexity. Specifically, for each token sequence of 1000 tokens in the dataset, use gzip and calculate the ratio of the size (number of bytes) of the compressed data to the original data.
Then, the median and standard deviation of the compressibility ratios are calculated to confirm that grammars with higher syntactic complexity result in more difficult to compress data sets.
Table 1 lists the syntactic parameters and measured compression ratios for each grammar.
It can be observed that as the non-endpoint (grammar category), endpoint (token), right-hand option and right-hand length increase, gzip The compression ratio also grows, i.e. it becomes harder to compress.
Figure 1 plots these datasets along with natural language and code data.
It can be seen that in terms of complexity, some PCFG data sets are similar to code data (easily compressed parts), while others are Similar to natural language.
In order to determine the Scaling law of the data set, the researcher trained several different sizes of data subsets (100K, 1M, 5M, 20M, 50M, 100M tokens). For models with different sizes (parameters are 4.2M, 8.8M, 20.3M, 59.0M, 275.3M, 1.4B), Table 6 gives the details of its architecture; then he performed power law fitting on the obtained loss results. Most experiments were done on 4 NVIDIA A100s with 80 GB VRAM, using PyTorch FSDP.
As shown in Figure 2, if a data set is easier to compress (the lower the compressibility rate), the model will converge faster. This is consistent with our intuitive understanding.
Although this suggests that we need more computational effort to model more complex data sets, we still need more evidence before we can be certain that computationally optimal Whether the optimal boundary changes directly based on data complexity. To establish the non-trivial sensitivity of scaling law to data complexity, one needs to compute scaling law for each data set and investigate its fitting parameters.
The Scaling law functional form proposed by Hoffmann et al. in 2022 is: Training loss as a function of model and data size:
where N is the number of parameters of the model and D is the number of tokens in the training data set. They claim that E is "the entropy of natural text" and that Scaling law is "dataset independent". However, when Rohan Pandey fitted the training results with this function on the PCFG data set, he found that the Scaling law of each data set was very different, see Table 2.
The Scaling law can obtain a computational optimal boundary for the parameter quantity (by Kaplan et al. [2020] and Hoffmann et al. [2022] ]) is derived and can be simplified to:
where C is the calculation budget, unit FLOPs.
Figure 3 plots Chinchilla’s calculated optimal boundaries and the Scaling law fitted to each PCFG data set.
It can be seen that as the data becomes more and more difficult to compress, the boundary of the Scaling law obtained by fitting gradually becomes biased towards the data. Chinchilla's one-to-one boundary is crossed at some point in the interval of 0.23 < gzip compressibility< 0.45.
To predict Scaling law parameters based on the compressibility of a data set, a simple linear regression fit can be performed on the fitted Scaling law parameters for each data set. As we mentioned before, for the data set D, the method of calculating the compressibility rate H is to first calculate the ratio of the compressed bit amount to the original bit amount of each element d, and then calculate the average of all elements.
Once the lines predicting each parameter (E, A, B, α, β) are fitted from H, each The parameters are redefined as a function of the compressibility rate:
where m_x and n_x are the parameters of the linear regression after fitting.
Table 3 gives these fitted values (and the p-values of the regression), and Figure 4 is the visualization result of these linear regressions.
They are almost all monotonically decreasing, just at different rates, and at H about 0.27, α and β intersect. It should be noted that E (the "entropy of natural language" originally set to a constant) is the only parameter that increases with H (but not significantly).
We can now reparameterize (1) as a function of the compressibility rate H:
However, since the scale of the experiment here is quite small and mainly focused on the PCFG data set, Pandey extended the function - after adjusting Chinchilla, the data-dependent Scaling law was obtained:
where ε is the adjustment weight for the gzip compression rate of the training data, and the parameter added ' is a Chinchilla constant.
The above experiment does not address this possibility: this The compressibility measure conflates some underlying syntactic property (such as vocabulary size). To address this issue, Figure 5 presents additional results.
It can be seen that when keeping the vocabulary size stable and changing other syntactic properties (Table 4), the gzip compressibility rate is still good Predict the parameter changes of Scaling law (the correlation is even stronger than the setting of increasing vocabulary).
Figure 6 is a counterexample found in the empirical study, which shows that when the syntactic properties vary widely (Table 5) but the final gzip compressibility ratio of these data sets is the same, Scaling law The parameters will not change significantly.
Although no intersection behavior like that in Figure 4 is observed in this equivalent vocabulary case, The slope of α is still steeper than β (A is also steeper than B), which shows that as the gzip compressibility rate increases, there is the same biased data phenomenon.
Thus, it can be said that these results show that: Scaling law depends on the training data, and gzip compressibility ratio is a good predictor of the impact of data complexity on scaling properties.
The above is the detailed content of Do different data sets have different scaling laws? And you can predict it with a compression algorithm. For more information, please follow other related articles on the PHP Chinese website!