The emergence of conversational AI such as ChatGPT has made people accustomed to such a thing: input a piece of text, code or a picture, and the conversational robot will give you the answer you want. But behind this simple interaction method, the AI model needs to perform very complex data processing and calculations, and tokenization is a common one.
In the field of natural language processing, tokenization refers to dividing text input into smaller units, called "tokens". These tokens can be words, subwords, or characters, depending on the specific word segmentation strategy and task requirements. For example, if we perform tokenization on the sentence "I like eating apples", we will get a sequence of tokens: ["I", "Like", "Eat", "Apple"]. Some people translate tokenization into "word segmentation", but some people think that this translation is misleading. After all, the segmented token may not be the "word" we understand every day.
Source: https://towardsdatascience.com/dynamic-word-tokenization-with-regex -tokenizer-801ae839d1cd
The purpose of Tokenization is to convert the input data into a form that can be processed by the computer and provide a structured representation for subsequent model training and analysis. . This method brings convenience to deep learning research, but it also brings a lot of trouble. Andrej Karpathy, who just joined OpenAI some time ago, pointed out several of them.
First of all, Karpathy believes that Tokenization introduces complexity: by using tokenization, the language model is not a complete end-to-end model. It requires a separate stage for tokenization, which has its own training and inference process and requires additional libraries. This increases the complexity of introducing data from other modalities.
In addition, tokenization will also make the model error-prone in certain scenarios, such as when using text completion. With the full API, if your prompt ends with a space, the results you get may be very different.
## Picture source: https://blog.scottlogic.com/2021/08/31/a -primer-on-the-openai-api-1.html
For another example, because of the existence of tokenization, the powerful ChatGPT does not actually write words in reverse (below Test results are from GPT 3.5).
There may be many such examples. Karpathy believes that to solve these problems, we must first abandon tokenization.
A new paper published by Meta AI explores this question. Specifically, they proposed a multi-scale decoder architecture called "MEGABYTE" that can perform end-to-end differentiable modeling of sequences exceeding one million bytes.
Paper link: https://arxiv.org/pdf/2305.07185.pdf
Importantly, this paper shows the feasibility of abandoning tokenization and was evaluated by Karpathy as "promising".
The following are the details of the paper.
Paper OverviewAs mentioned in the machine learning article, the reason why machine learning seems to be able to solve many complex problems is because it transforms these problems into for math problems.
And NLP has the same idea. Texts are all "unstructured data". We need to convert these data into "structured data" first. "Data", structured data can be transformed into mathematical problems, and word segmentation is the first step in transformation.
Due to the high cost of both self-attention mechanisms and large feed-forward networks, large transformer decoders (LLM) typically use only thousands of context tokens. This severely limits the set of tasks to which LLM can be applied.
Based on this, researchers from Meta AI proposed a new method for modeling long byte sequences - MEGABYTE. This method divides the byte sequence into fixed-size patches, similar to token.
The MEGABYTE model consists of three parts:
Crucially, the study found that most bytes are relatively easy to predict for many tasks (e.g., completing a word given the first few characters ), which means that it is not necessary to use a large neural network for every byte, but instead a much smaller model can be used for intra-patch modeling.
The MEGABYTE architecture has made three major improvements to the Transformer for long sequence modeling:
sub-quadratic self-attention. Most work on long sequence models focuses on reducing the quadratic cost of self-attention. By decomposing a long sequence into two shorter sequences and optimal patch size, MEGABYTE reduces the cost of the self-attention mechanism to , making even long sequences easy to process.
per-patch feed-forward layer. In very large models such as GPT-3, more than 98% of FLOPS are used to compute position-wise feedforward layers. MEGABYTE enables larger, more expressive models at the same cost by using large feedforward layers per-patch (instead of per-position). With patch size P, the baseline transformer will use the same feedforward layer with m parameters P times, while MEGABYTE only needs to use the layer with mP parameters once at the same cost.
3. Parallel decoding. The transformer must perform all calculations serially during generation because the input of each time step is the output of the previous time step. By generating patch representations in parallel, MEGABYTE achieves greater parallelism in the generation process. For example, a MEGABYTE model with 1.5B parameters generates sequences 40% faster than a standard 350M parameter transformer, while also improving perplexity when trained using the same computation.
Overall, MEGABYTE allows us to train larger, better-performing models with the same compute budget, will be able to handle very long sequences, and improves generation during deployment speed.
MEGABYTE also contrasts with existing autoregressive models, which typically use some form of tokenization where sequences of bytes are mapped into larger discrete tokens (Sennrich et al., 2015; Ramesh et al., 2021; Hsu et al., 2021). Tokenization complicates preprocessing, multimodal modeling, and transfer to new domains, while hiding useful structure in the model. This means that most SOTA models are not truly end-to-end models. The most widely used tokenization methods require the use of language-specific heuristics (Radford et al., 2019) or loss of information (Ramesh et al., 2021). Therefore, replacing tokenization with an efficient and performant byte model will have many advantages.
The study conducted experiments on MEGABYTE and some powerful baseline models. Experimental results show that MEGABYTE performs comparably to subword models on long-context language modeling, achieves state-of-the-art density estimation perplexity on ImageNet, and allows audio modeling from raw audio files. These experimental results demonstrate the feasibility of large-scale tokenization-free autoregressive sequence modeling.
##patch embedder
A patch embedder of size P can map the byte sequence
into a length of # A patch embedding sequence of
## and dimension
##.
First, each byte is embedded with a lookup table
, forming an embedding of size D_G and adding positional embeddings.
Then, the byte embedding is reshaped into dimensions of
The sequence of K patches embedded in
. To allow autoregressive modeling, the patch sequence is padded with a padding embedding (
) from the trainable patch size, and then from the input Remove the last patch. This sequence is the input to the global model, represented as
##Global module
The global module is a decoder-only architecture P・D_G dimensional transformer model, which operates on k patch sequences. The global module combines self-attention mechanism and causal mask to capture the dependencies between patches. The global module inputs representations of k patch sequences , and outputs updated representations## by performing self-attention on previous patches.
Final global module output
Contains K patch representations of P・D_G dimensions. For each of these, we reshape them into sequences of length P and dimension D_G, where position p uses the dimension p·D_G to (p 1)·D_G. Each location is then mapped to a local module dimension with matrix
, where D_L is the local module dimension. These are then combined with a byte embedding of size D_L for the next
token.
The local byte embedding is offset by 1 with a trainable local pad embedding (E^local-pad ∈ R^DL), allowing for in-path Autoregressive modeling. Finally get the tensor
##Local module
The local module is a smaller, decoder-only architecture D_L-dimensional transformer model that contains P elements. Running on a single patch k, each element is the sum of a global module output and the embedding of the previous byte in the sequence. K copies of the local module are run independently on each patch and in parallel during training, thus computing the representation
##Finally, the researcher can calculate the vocabulary probability distribution for each position. The p-th element of the k-th patch corresponds to element t of the complete sequence, where t = k·P p.
Efficiency analysis
Training efficiencyThe researchers analyzed the costs of different architectures when scaling sequence length and model size. As shown in Figure 3 below, the MEGABYTE architecture uses fewer FLOPS than comparably sized transformers and linear transformers across a variety of model sizes and sequence lengths, allowing the use of larger models at the same computational cost.
Generation efficiency
Consider such a The MEGABYTE model, which has L_global layers in the global model and L_local layers in the local module, with patch size P, is compared with the transformer architecture with L_local L_global layers. Generating each patch with MEGABYTE requires an O (L_global P・L_local) sequence of serial operations. When L_global ≥ L_local (global modules have more layers than local modules), MEGABYTE can reduce the inference cost by nearly P times.
Language modeling
Researchers emphasize 5 aspects of long-range dependence The language modeling capabilities of MEGABYTE were evaluated on different data sets, namely Project Gutenberg (PG-19), Books, Stories, arXiv and Code. The results are shown in Table 7 below, MEGABYTE consistently outperforms the baseline transformer and PerceiverAR on all datasets.
The researchers also expanded the training data on PG-19. The results are shown in Table 8 below. MEGABYTE is significant. Outperforms other byte models and is comparable to SOTA models trained on subwords.
Image Modeling
Researcher A large MEGABYTE model was trained on the ImageNet 64x64 data set, in which the parameters of the global and local modules are 2.7B and 350M respectively, and there are 1.4T tokens. They estimate that training the model takes less than half the number of GPU hours required to reproduce the best PerceiverAR model in the Hawthorne et al., 2022 paper. As shown in Table 8 above, MEGABYTE has comparable performance to PerceiverAR's SOTA, while using only half of the latter's calculations.
We compared three transformer variants, namely vanilla, PerceiverAR, and MEGABYTE, to test the scalability of long sequences at increasingly larger image resolutions. The results are shown in Table 5 below. Under this computational control setting, MEGABYTE outperforms the baseline model at all resolutions.
Table 14 below summarizes the precise settings used by each baseline model, including context length and number of latents.
Audio modeling
Audio cum With the sequence structure of text and the continuous nature of images, this is an interesting application for MEGABYTE. The model in this article obtained a bpb of 3.477, which is significantly lower than the perceiverAR (3.543) and vanilla transformer model (3.567). Additional ablation results are detailed in Table 10 below.
For more technical details and experimental results, please refer to the original paper.
The above is the detailed content of Is it necessary to 'participle'? Andrej Karpathy: It's time to throw away this historical baggage. For more information, please follow other related articles on the PHP Chinese website!