


Exploring the impact of vocabulary selection on language model training: A groundbreaking study
How are language models affected by different vocabulary lists? How to balance these effects?
In a recent experiment, researchers pre-trained and fine-tuned 16 language models with different corpus. This experiment used NanoGPT, a small-scale architecture (based on GPT-2 SMALL), and trained a total of 12 models. The network architecture configuration of NanoGPT is: 12 attention heads, 12 layers of transformers, the word embedding dimension is 768, and approximately 400,000 iterations (approximately 10 epochs) were performed. Then 4 models were trained on GPT-2 MEDIUM. The architecture of GPT-2 MEDIUM was set to 16 attention heads, 24 layers of transformers, the word embedding dimension was 1024, and 600,000 iterations were performed. All models are pre-trained using NanoGPT and OpenWebText datasets. In terms of fine-tuning, the researchers used the instruction data set provided by baize-chatbot and added an additional 20,000 and 500,000 "dictionary" entries to the two types of models respectively
In the future, The researchers plan to release the code, pre-trained models, instruction-tuned models, and fine-tuned datasets
# However, due to the lack of GPU sponsors (this is a free open source project), in order to reduce costs, The researchers are currently not continuing, although there is room to further improve the research content. In the pre-training stage, these 16 models need to run on 8 GPUs for a total of 147 days (a single GPU needs to be used for 1,176 days), at a cost of US$8,000
The research results can be summarized as:
- In terms of encoding method, the TokenMonster (550256-strict-nocapcode) vocabulary performs better than GPT-2 Tokenizer and tiktoken p50k_base in all indicators.
- The optimal vocabulary size is 32,000.
- The simpler the vocabulary, the faster the model will converge, but it will not necessarily produce better results after convergence.
- An increase in the word ratio (the average number of characters corresponding to each token) will not have a negative impact on model quality alone.
- Vocabularies with a single token corresponding to multiple words have a 5% negative impact on the SMLQA (ground truth) benchmark, but a 13% higher word ratio.
- With the Capcode vocabulary, the model takes longer to learn, but once the model converges, it does not seem to have any impact on SMLQA (ground truth) or SQuAD in any direction. (Data Extraction) benchmarks have an impact.
- Both validation loss and F1 are meaningless metrics when comparing different tokenizers.
- The flaws and complexity of the tokenizer have a greater impact on the model's ability to learn facts than on the model's ability to learn language.
According to the experimental results, the results of englishcode-32000-consistent are the best. However, as mentioned above, when using TokenMonster where a single token corresponds to multiple words, there is a trade-off between SMLQA (Ground Truth) accuracy and word ratio, which increases the slope of the learning curve. The researchers firmly believe that by forcing 80% of the tokens to correspond to one word and 20% of the tokens to correspond to multiple words, this trade-off can be minimized and a "best of both worlds" vocabulary can be achieved. The researchers believe that this method has the same performance as the one-word vocabulary list, while improving the word ratio by about 50%.
The implication of this sentence is that flaws and complexity in the tokenizer have a greater impact on the model's ability to learn facts than on its language ability
This phenomenon is an interesting feature discovered during the training process. It makes sense when we think about the way model training works. The researcher has no evidence to justify his reasoning. But essentially, because the fluency of language is easier to correct during backpropagation than the factuality of language (which are extremely subtle and context-dependent), this means that any improvement in tokenizer efficiency will be less consistent than the factuality of language Regardless of gender, there will be a ripple effect that directly translates into improved information fidelity, as seen in the SMLQA (Ground Truth) benchmark. Simply put: a better tokenizer is a more realistic model, but not necessarily a smoother model. The other way around: a model with an inefficient tokenizer can still learn to write fluently, but the additional cost of fluency reduces the model's credibility.
The impact of vocabulary size
Before conducting these tests, the researchers believed that 32,000 was the optimal vocabulary size table scale, and experimental results also confirmed this. The performance of the 50256-balanced model is only 1% better than the 32000-balanced model on the SMLQA (Ground Truth) benchmark, but the model size increases by 13%. In order to clearly prove this point of view, in multiple models based on MEDIUM, this article conducted experiments by dividing the vocabulary into 24000, 32000, 50256 and 100256 words.
The impact of optimization modes
The researchers tested TokenMonster and tested three specific optimization modes: balanced, consistent and strict . Different optimization modes will affect the way punctuation marks and capcodes are combined with word tokens. The researchers initially predicted that the consistent mode would perform better (because it is less complex), although the word ratio (i.e., the ratio of characters to tokens) would be slightly lower
The experimental results seem to confirm The above conjecture, but researchers have also observed some phenomena. First, consistent mode seems to perform about 5% better than balanced mode on the SMLQA (Ground Truth) benchmark. However, consistent mode performs significantly worse (28%) on the SQuAD (Data Extraction) benchmark. However, the SQuAD benchmark exhibits large uncertainties (repeated runs give different results) and is not convincing. The researchers did not test convergence between balanced and consistent, so this may simply mean that the consistent pattern is easier to learn. In fact, consistent may do better on SQuAD (data extraction) because SQuAD is harder to learn and less likely to hallucinate.
This is an interesting finding in itself, because it means that there is no obvious problem with combining punctuation and words into a single token. All other tokenizers so far have argued that punctuation should be separated from letters, but as you can see from the results here, words and punctuation can be merged into a single token with no noticeable performance loss. This is also confirmed by 50256-consistent-oneword, a combination that performs on par with 50256-strict-oneword-nocapcode and outperforms p50k_base. 50256-consistent-oneword combines simple punctuation with the word token (which the other two combinations do not).
After enabling the strict mode of capcode, it will have obvious negative effects. On SMLQA, 50256-strict-oneword-nocapcode scores 21.2 and on SQuAD it scores 23.8, while 50256-strict-oneword scores 16.8 and 20.0 respectively. The reason is obvious: the strict optimization mode prevents the merging of capcodes and word tokens, resulting in more tokens being needed to represent the same text, resulting in an 8% reduction in word ratio. In fact, strict-nocapcode is more similar to consistent than strict mode. On various indicators, 50256-consistent-oneword and 50256-strict-oneword-nocapcode are almost equal
In most cases, the researchers concluded that the model is useful for learning inclusion The meaning of punctuation marks and tokens of words does not present much difficulty. That is, the consistency model has higher grammatical accuracy and fewer grammatical errors than the balanced model. Taking all things into consideration, the researchers recommend that everyone use the consistency mode. Strict mode can only be used with capcode disabled
Impact on syntax accuracy
As described above , compared with balanced mode, consistent mode has higher syntax accuracy (fewer syntax errors). This is reflected in the very slight negative correlation between word ratio and grammar, as shown in the figure below. Beyond that, the most noteworthy point is that the syntax results for both GPT-2 tokenizer and tiktoken p50k_base are terrible (respectively) compared to TokenMonster’s 50256-strict-oneword-nocapcode (98.6% and 98.4%). 98.1% and 97.5%). The researchers initially thought it was just a coincidence, but multiple samplings produced the same range of results. It's unclear what the cause is.
Influence on MTLD
MTLD is used to represent the language diversity of generated sample text . It seems to be closely related to the n_embed parameter and not to features such as vocabulary size, optimization mode, or maximum number of words per token. This is particularly evident in the 6000-balanced model (n_embd is 864) and the 8000-consistent model (n_embd is 900)
In the medium-sized model, p50k_base has the highest MTLD, which is 43.85, but also has the lowest grammar score. The reason for this is unclear, but the researchers speculate that the selection of training data may be a bit strange.
Discussion of SQuAD
The purpose of the SQuAD benchmark is to evaluate a model’s ability to extract data from a piece of text. The specific method is to provide a paragraph of text and ask a question, requiring that the answer must be found in the paragraph of text. The test results are not very meaningful, there are no obvious patterns or correlations, and they are not affected by the overall parameters of the model. In fact, the 8000-balanced model with 91 million parameters scored higher in SQuAD than the 50256-consistent-oneword model with 354 million parameters. The reason for this may be that there are not enough examples of this style, or there are too many question-answer pairs in the fine-tuning dataset. Or maybe this is just a less than ideal benchmark
Discussion of SMLQA
The SMLQA benchmark is proposed by Test "truth" with objectively answered common sense questions, such as "Which country's capital is Jakarta?" and "Who wrote the Harry Potter books?"
It is worth noting that in this benchmark test, the two reference tokenizers, GPT-2 Tokenizer and p50k_base, performed very well. The researchers initially thought they had wasted months of time and thousands of dollars, but it turned out tiktoken performed better than TokenMonster. However, it turns out that the problem is related to the number of words corresponding to each token. This is particularly evident in the "MEDIUM" model, as shown in the figure below
##The performance of the single-word vocabulary is slightly better than TokenMonster's default vocabulary where each token corresponds to multiple words.
Another important observation is that when the vocabulary size is below 32,000, the vocabulary size will directly affect the true value, even if the n_embd parameter of the model is adjusted to compensate for the reduction in model size. This is counterintuitive because the researchers originally thought that 16000-balanced with n_embd of 864 (parameter of 121.34 million) and 8000-consistent with n_embd of 900 (parameter of 123.86 million) would be better than 50256-consistent with n_embd of 768 (parameter of 121.34 million). 123.59 million) performed better, but that's not the case - both performed much worse (13.7 and 15.1 vs. 50256-consistent's 16.4). However, both "adjusted" models received the same training time, which resulted in significantly fewer pre-training times (albeit the same time)
with 12 layers of attention Head, 12-layer small model of transformer layer
The researchers trained 12 models on the default NanoGPT architecture. The architecture is based on the GPT-2 architecture with 12 attention heads and 12 layers, and the embedding parameter size is 768. None of these models have reached a convergence state. Simply put, they have not reached their maximum learning ability. The model was trained for 400,000 iterations, but it appears that 600,000 iterations are needed to reach maximum learning capability. The reasons for this situation are very simple. One is the budget problem, and the other is the uncertainty of the convergence point.
Results of the small model:
Pearson correlation of small model:
Rewritten content: The optimal vocabulary level is reached when the vocabulary is 32,000. In the vocabulary increasing stage from 8,000 to 32,000, increasing the vocabulary improves the accuracy of the model. However, when the vocabulary size increases from 32,000 to 50,257, the total parameters of the model also increase accordingly, but the accuracy improvement is only 1%. After exceeding 32,000, the gain decreases rapidly
Poor tokenizer design will have an impact on the accuracy of the model, but not on grammatical correctness or language diversity. Performance on the ground-truth benchmark for tokenizers with more complex grammatical rules (such as tokens corresponding to multiple words, word and punctuation combinations, capcode encoding tokens, and total vocabulary reduction) in the parameter range of 90 million to 125 million Poor. However, this sophisticated tokenizer design did not have a statistically significant impact on the linguistic diversity or grammatical correctness of the generated texts. Even a compact model, such as one with 90 million parameters, can effectively exploit more complex markers. More complex vocabulary takes longer to learn, reducing the time it takes to acquire information related to basic facts. Since none of these models have been fully trained, the potential for further training to close the performance gap remains to be seen Rewritten in Chinese:
3. Validation loss is not a valid metric for comparing models using different tokenizers. The validation loss has a very strong correlation (0.97 Pearson correlation) with the word ratio (the average number of characters per token) for a given tokenizer. To compare loss values between tokenizers, it may be more useful to measure the loss relative to characters rather than tokens, since the loss value is proportional to the average number of characters corresponding to each token 4. F1 score is not suitable as a metric for evaluating language models because these language models are trained to generate variable-length responses (completion is indicated by an end-of-text marker). This is because the longer the text sequence, the more severe the penalty of the F1 formula. F1 score tends to produce shorter response models Their ability to produce grammatically coherent answers can be fine-tuned. Although these answers are often incorrect or illusory, they are relatively coherent and demonstrate understanding of contextual background. The vocabulary of the generated text is diverse as the embedding size increases Sexuality and grammatical accuracy increased significantly and were slightly negatively correlated with word ratio. This means that a vocabulary with a larger word-to-word ratio will make learning grammatical and lexical diversity slightly more difficult 7. When adjusting the model parameter size, the word-to-word ratio is related to SMLQA ( There is no statistically significant correlation between the Ground Truth) or SQuAD (Information Extraction) benchmarks. This means that a tokenizer with a higher word-to-word ratio will not negatively impact the performance of the model. Compared with the "balanced" one, the "consistent" category seems to perform slightly better on the SMLQA (Ground Truth) benchmark, but worse on the SQuAD (Information Extraction) benchmark many. Although more data is needed to confirm this
Medium model with 16 layers of attention heads and 24 layers of transformer layers Training the model to convergence did significantly reduce the performance difference between simpler and more complex vocabularies. The benchmark results of SMLQA (Ground Truth) and SQuAD (Data Extraction) are very close. The main difference is that 50256-consistent has the advantage of a 23.5% higher word ratio than p50k_base. However, for vocabularies with multiple words per token, the performance cost of truth values is smaller, but this can be solved using the method I discussed at the top of the page. Results for the models in :
After 560000 iterations, all models started to converge, As shown below:
In the next phase, we will use englishcode-32000-consistent to train and benchmark MEDIUM’s model. This vocabulary contains 80% word tokens and 20% multi-word tokens
After training and benchmarking the small model, the researchers clearly found that the measured results reflected the model's learning speed, not the model's learning ability. Furthermore, the researchers did not optimize the computational potential of the GPU because the default NanoGPT parameters were used. In order to solve this problem, the researchers chose to use a tokenizer with 50257 tokens and a medium language model to study four variants. The researchers adjusted the batch size from 12 to 36 and reduced the block size from 1024 to 256 to ensure full utilization of the VRAM capabilities of the 24GB GPU. Then 600,000 iterations were performed instead of 400,000 as in the smaller model. Pretraining for each model took an average of just over 18 days, three times the 6 days required for smaller models.
Future Outlook
The above is the detailed content of Exploring the impact of vocabulary selection on language model training: A groundbreaking study. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



Efficiently process 7 million records and create interactive maps with geospatial technology. This article explores how to efficiently process over 7 million records using Laravel and MySQL and convert them into interactive map visualizations. Initial challenge project requirements: Extract valuable insights using 7 million records in MySQL database. Many people first consider programming languages, but ignore the database itself: Can it meet the needs? Is data migration or structural adjustment required? Can MySQL withstand such a large data load? Preliminary analysis: Key filters and properties need to be identified. After analysis, it was found that only a few attributes were related to the solution. We verified the feasibility of the filter and set some restrictions to optimize the search. Map search based on city

There are many reasons why MySQL startup fails, and it can be diagnosed by checking the error log. Common causes include port conflicts (check port occupancy and modify configuration), permission issues (check service running user permissions), configuration file errors (check parameter settings), data directory corruption (restore data or rebuild table space), InnoDB table space issues (check ibdata1 files), plug-in loading failure (check error log). When solving problems, you should analyze them based on the error log, find the root cause of the problem, and develop the habit of backing up data regularly to prevent and solve problems.

The article introduces the operation of MySQL database. First, you need to install a MySQL client, such as MySQLWorkbench or command line client. 1. Use the mysql-uroot-p command to connect to the server and log in with the root account password; 2. Use CREATEDATABASE to create a database, and USE select a database; 3. Use CREATETABLE to create a table, define fields and data types; 4. Use INSERTINTO to insert data, query data, update data by UPDATE, and delete data by DELETE. Only by mastering these steps, learning to deal with common problems and optimizing database performance can you use MySQL efficiently.

Detailed explanation of database ACID attributes ACID attributes are a set of rules to ensure the reliability and consistency of database transactions. They define how database systems handle transactions, and ensure data integrity and accuracy even in case of system crashes, power interruptions, or multiple users concurrent access. ACID Attribute Overview Atomicity: A transaction is regarded as an indivisible unit. Any part fails, the entire transaction is rolled back, and the database does not retain any changes. For example, if a bank transfer is deducted from one account but not increased to another, the entire operation is revoked. begintransaction; updateaccountssetbalance=balance-100wh

MySQL can return JSON data. The JSON_EXTRACT function extracts field values. For complex queries, you can consider using the WHERE clause to filter JSON data, but pay attention to its performance impact. MySQL's support for JSON is constantly increasing, and it is recommended to pay attention to the latest version and features.

Remote Senior Backend Engineer Job Vacant Company: Circle Location: Remote Office Job Type: Full-time Salary: $130,000-$140,000 Job Description Participate in the research and development of Circle mobile applications and public API-related features covering the entire software development lifecycle. Main responsibilities independently complete development work based on RubyonRails and collaborate with the React/Redux/Relay front-end team. Build core functionality and improvements for web applications and work closely with designers and leadership throughout the functional design process. Promote positive development processes and prioritize iteration speed. Requires more than 6 years of complex web application backend

LaravelEloquent Model Retrieval: Easily obtaining database data EloquentORM provides a concise and easy-to-understand way to operate the database. This article will introduce various Eloquent model search techniques in detail to help you obtain data from the database efficiently. 1. Get all records. Use the all() method to get all records in the database table: useApp\Models\Post;$posts=Post::all(); This will return a collection. You can access data using foreach loop or other collection methods: foreach($postsas$post){echo$post->

SQLLIMIT clause: Control the number of rows in query results. The LIMIT clause in SQL is used to limit the number of rows returned by the query. This is very useful when processing large data sets, paginated displays and test data, and can effectively improve query efficiency. Basic syntax of syntax: SELECTcolumn1,column2,...FROMtable_nameLIMITnumber_of_rows;number_of_rows: Specify the number of rows returned. Syntax with offset: SELECTcolumn1,column2,...FROMtable_nameLIMIToffset,number_of_rows;offset: Skip
