Ten years ago, the rise of deep learning was driven in part by the incorporation of new algorithms and architectures, a significant increase in data, and improvements in computing power. Over the past decade, AI and ML models have become deeper and more complex, with more parameters and training data, and thus larger and more cumbersome, resulting in some of the most transformative results in machine learning history.
These models are increasingly used in production and business applications, and at the same time, their efficiency and cost have evolved from secondary considerations to major limitations. To address major challenges at four levels: efficient architecture, training efficiency, data efficiency, and inference efficiency, Google continues to invest heavily in ML efficiency. In addition to efficiency, these models face many challenges regarding authenticity, security, privacy, and freshness. Next, this article will focus on Google Research’s efforts in developing new algorithms to address the above challenges.
The basic question of the research is "Is there a better way to parameterize the model to improve efficiency?" In 2022, researchers focus on retrieving context, hybrid expert systems, and improving Transformer (the heart of large ML models) efficiency to develop new technologies that inject external knowledge by enhancing the model.
In pursuit of higher quality and efficiency, neural models can be enhanced with external context from large databases or trainable memory. By leveraging retrieved context, neural networks can achieve better parameter efficiency, interpretability, and realism without the need to extensively store knowledge in their internal parameters.
An article titled "Decoupled Context Processing for Context Augmented Language Modeling" explores a method based on decoupling A simple architecture of encoder-decoder architecture for incorporating external context into language models. This provides significant computational savings in autoregressive language modeling and open-domain question answering tasks. However, pre-trained large language models (LLMs) consume a large amount of information through self-supervision on large training sets. However, it is unclear how these models' knowledge of the world interacts with the context presented. Through knowledge-aware fine-tuning (KAFT), researchers incorporate counterfactual and irrelevant context into standard supervised datasets, which enhances the controllability and robustness of LLM.
## Paper address: https://arxiv.org/abs/2210.05758
Encoder-decoder cross-attention mechanism for context merging, allowing context encoding to be decoupled from language model inference, thereby improving context enhancement models s efficiency.
In the process of seeking modular deep networks, one of the issues is how to design a concept database with corresponding computing modules. The researchers proposed a theoretical architecture that stores "remember events" in the form of sketches in an external LSH table, including a pointers module to handle sketches.
Utilizing accelerators to quickly retrieve information from large databases is another major challenge for context-augmented models. The researchers developed a TPU-based similarity search algorithm that is consistent with the TPU's performance model and provides analytical guarantees on expected recall, achieving peak performance. Search algorithms often involve a large number of hyperparameters and design choices, which makes it difficult to tune them when performing new tasks. Researchers propose a new constrained optimization algorithm for automated hyperparameter tuning. With the desired cost or recall fixed as input, the proposed algorithm produces tunings that are empirically very close to the speed-recall Pareto frontier and provide leading performance on standard benchmarks. .
The Mixed Expert (MoE) model has been proven to be an effective means of increasing the capacity of neural network models without excessively increasing computational costs. The basic idea of MoE is to build a unified network from many expert sub-networks, where each input is processed by an appropriate expert subset. As a result, MoE invokes only a small fraction of the entire model compared to standard neural networks, resulting in high efficiencies as shown in language model applications such as GLaM.
Each input token in the GLaM architecture is dynamically routed to two of the 64 expert networks. make predictions.
For a given input, the routing function is responsible for deciding which experts should be activated. The design of this function is challenging because the researchers want to avoid underutilizing each expert. and overexploitation. A recent work proposes expert selection routing, a new routing mechanism that instead of assigning each input token to top-k experts, assigns each expert to top-k tokens. This will automatically ensure expert load balancing while also naturally allowing multiple experts to handle an input token.
Experts choose the route. Experts with predetermined buffer capacity are assigned top-k tokens, thus ensuring load balancing. Each token can be processed by a variable number of experts.
Transformer is a sequence-to-sequence model that is currently popular, and is used in a series of challenging problems from vision to natural language understanding. achieved remarkable success. The core component of this model is the attention layer, which identifies similarities between queries and keys and uses these similarities to construct an appropriately weighted combination of values. Although the performance is strong, the computational efficiency of the attention mechanism is not high, and the complexity is usually the second power of the length of the input sequence.
As the scale of Transformer continues to grow, research on one of the issues is very valuable, that is, whether there are any naturally occurring structures or patterns of learning models that can solve the principle of effective attention. . To this end, the researchers studied learned embeddings in intermediate MLP layers and found that they are very sparse—for example, the T5-Large model has 1% non-zero entries. The sparsity further demonstrates that one can potentially reduce FLOPs without affecting model performance.
Paper address: https://arxiv.org/pdf/2210.06313.pdf
Recently, there is research to launch Treeformer—— 1 An alternative to standard attention computation that relies on decision trees. Simply put, this quickly identifies a small subset of keys that are relevant to a query and performs attention operations only on that set. As a rule of thumb, Treeformer can reduce FLOPs of the attention layer by 30x. In addition to this there is Sequential Attention - a differentiable feature selection method that combines attention and greedy algorithms. The technique has strong provable guarantees for linear models and scales seamlessly to large embedding models.
Another way to improve the efficiency of Transformer is to speed up the softmax calculation in the attention layer. Based on the research on "low-rank approximation of the softmax kernel", the researchers proposed a new type of random features, providing the first "positive and bounded" random feature approximation of the softmax kernel, and the calculation on the sequence length is Linear.
Efficient optimization methods are the cornerstone of modern ML applications, and this is especially important in large-scale settings. In this setting, even first-order adaptive methods like Adam are often expensive and face challenges with training stability. In addition, these methods are usually agnostic to the architecture of the neural network, thereby ignoring the richness of the architecture, resulting in low training efficiency. This also prompts new technologies to be continuously proposed to optimize modern neural network models more effectively. Researchers are developing new architecture-aware training techniques. For example, some studies for training Transformer networks include new scale-invariant Transformer networks and new pruning methods combined with stochastic gradient descent (SGD) to Speed up the training process. With the help of this method, researchers were able to effectively train BERT using simple SGD for the first time, without the need for adaptation.
Paper address: https://arxiv.org/pdf/2210.05758.pdf
In addition, the researchers proposed a new method with the help of LocoProp - while using the same computing and memory resources as the first-order optimizer, it achieves the same performance as the second-order optimizer. Optimizer-like performance. LocoProp takes a modular view of neural networks, breaking them down into compositions of layers. Each layer is then allowed to have its own loss function as well as output targets and weight regularizers. With this setup, after appropriate forward and backward passes, LocoProp continues to update the local loss of each layer in parallel. In fact, these updates can be shown to be similar to those of higher-order optimizers, both theoretically and empirically. On the deep autoencoder benchmark, LocoProp achieves performance comparable to high-order optimizers while having a speed advantage.
Paper link: https://proceedings.mlr.press/v151/amid22a.html
Similar to backpropagation, LocoProp Apply a forward pass to compute activations. In a backward pass, LocoProp sets per-neuron targets for each layer. Finally, LocoProp splits model training into independent problems across layers, where several local updates can be applied to each layer's weights in parallel.
The core idea of optimizers such as SGD is that each data point is sampled independently and identically from the distribution. Unfortunately this is difficult to satisfy in real-world settings, such as reinforcement learning, where the model (or agent) must learn from data generated based on its own predictions. The researchers proposed a new SGD algorithm based on reverse experience replay, which can find optimal solutions in linear dynamic systems, nonlinear dynamic systems, and Q-learning. Furthermore, studies have proven that an enhanced version of this method, IER, is currently the state-of-the-art and the most stable experience replay technique in various popular RL benchmarks.
Paper address: https://arxiv.org/pdf/2103.05896.pdf
In many tasks, deep neural networks rely heavily on large data sets. In addition to the storage costs and potential security/privacy issues that come with large datasets, training modern deep neural networks on such datasets also incurs high computational costs. One possible way to solve this problem is to select a subset of the data.
The researchers analyzed a subset selection framework designed for use with arbitrary model families in practical batch processing settings. In this case, the learner can sample one example at a time, accessing both the context and the true label, but to limit the overhead, its state (i.e. further training model weights) can only be updated after a sufficient batch of examples has been selected. The researchers developed an algorithm, called IWeS, that selects examples through importance sampling, where the sampling probability assigned to each example is based on the entropy of a model trained on previously selected batches. The theoretical analysis provided by the study demonstrates bounds on generalization and sampling rates.
Paper address: https://arxiv.org/pdf/2301.12052.pdf
Another problem with training large networks is that they may be inconsistent with the training data and The distribution changes seen across data at deployment time are highly sensitive, especially when working with a limited amount of training data that may not cover all deployment-time scenarios. A recent study that hypothesized that "extreme simplicity bias" is the key issue behind this fragility of neural networks made this hypothesis feasible, leading to the combination of two new complementary methods - DAFT and FRR. can provide significantly more robust neural networks. In particular, these two methods use adversarial fine-tuning as well as reverse feature prediction to strengthen the learning network.
Paper address: https://arxiv.org/pdf/2006.07710.pdf
It has been proven that increasing the size of a neural network can improve its predictive accuracy, however, achieving these gains in the real world is challenging, Because the inference cost of large models is very high for deployment. This drives strategies to improve service efficiency without sacrificing accuracy. In 2022, experts studied different strategies to achieve this goal, especially those based on knowledge distillation and adaptive computing.
Distillation
Distillation is a simple and effective model compression method that greatly scales large neural networks. potential applicability of the model. Studies have proven that distillation can play its role in a series of practical applications such as advertising recommendations. Most use cases for distillation involve the direct application of a basic recipe to a given area, with limited understanding of when and why this should work. Google's research this year looked at customizing distillation for specific environments and formally examined the factors that control distillation success.
On the algorithm side, the research developed an important way to reweight the training examples by carefully modeling the noise in the teacher labels, and an effective measure to classify the data. Sets are sampled to obtain teacher labels. Google stated in "Teacher Guided Training: An Efficient Framework for Knowledge Transfer" that instead of passively using teachers to annotate fixed data sets, teachers are actively used to guide the selection of informative samples to be annotated. This makes the distillation process stand out in limited data or long-tail settings.
Paper address: https://arxiv.org/pdf/2208.06825.pdf
In addition, Google also studied New methods from cross-encoder (dual-encoder, such as BERT) to factorial dual-encoder (dual-encoder), which is also an important setting for scoring the relevance of (query, document) pairs. The researchers explored the reasons for the performance gap between cross-encoders and dual-encoders, noting that this may be a result of generalization rather than capacity limitations of dual-encoders. Careful construction of the distillation loss function can alleviate this situation and reduce the gap between cross-encoder and dual-encoder performance. Subsequently, in embedtitil, we investigate further improving dual-encoder distillation by matching the embeddings in the teacher model. This strategy can also be used to extract information from large-to-small dual-encoder models, where inheriting and freezing the teacher’s document embeddings can prove to be very effective.
Paper address: https://arxiv.org/pdf/2301.12005.pdf
Theoretically, the research provides a new perspective on distillation from the perspective of supervisory complexity, which is a method of measuring how well students predict teacher labels. NTK (neural tangent kernel) theory provides conceptual insights. Research further demonstrates that distillation can lead to students not fitting into points that the teacher model deems difficult to model. Intuitively, this can help students focus their limited abilities on those samples that can be reasonably modeled.
Paper address: https://arxiv.org/pdf/2301.12245.pdf
While distillation is an effective means of reducing the cost of inference, it is consistent across all samples. However, intuitively, some easy samples may inherently require less computation than hard samples. The goal of adaptive computing is to design mechanisms that enable such sample-dependent computation.
CALM (Confident Adaptive Language Modeling) introduces controlled early-exit functionality for Transformer-based text generators such as T5.
Paper address: https://arxiv.org/pdf/2207.07061.pdf
In this form of adaptive computation, the model dynamically modifies the number of Transformer layers used at each decoding step. The early exit gate uses a confidence measure with a decision threshold that is calibrated to meet statistical performance guarantees. This way, the model only needs to compute the full stack of decoder layers for the most challenging predictions. Simpler predictions only require computing a few decoder layers. In practice, the model uses about one-third as many layers on average to make predictions, resulting in 2-3x speedups while maintaining the same level of generation quality.
Generate text using a regular language model (top) and CALM (bottom). CALM attempts to make early predictions. Once it's confident enough in the generated content (dark blue tint), it skips over to save time.
A popular adaptive computing mechanism is the cascade of two or more basic models. A key question when using cascades: whether to simply use the current model's predictions or defer predictions to downstream models. Learning when to delay requires designing a suitable loss function that can leverage appropriate signals as supervision for delaying decisions. To achieve this goal, the researchers formally studied existing loss functions, demonstrating that they may not be suitable for training samples due to the implicit application of label smoothing. Research has shown that this can be mitigated by post-hoc training of delayed rules, which does not require modifying the model internals in any way.
Paper address: https://openreview.net/pdf?id=_jg6Sf6tuF7
For retrieval applications, standard semantic search techniques use a fixed representation for each embedding generated by large models. That is, the size and capabilities of the representation are essentially fixed regardless of the downstream tasks and their associated computing environments or constraints. MRL (Matryoshka representation learning) introduces the flexibility to adapt the representation according to the deployment environment. When used in conjunction with standard approximate nearest neighbor search techniques such as ScaNN, MRL is able to provide up to 16 times lower computation while having the same recall and precision metrics.
##Paper address: https://openreview.net/pdf?id=9njZa1fm35
The above is the detailed content of How to improve the efficiency of deep learning algorithms, Google has these tricks. For more information, please follow other related articles on the PHP Chinese website!