The AIxiv column is a column where academic and technical content is published on this site. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com
The authors of this article are all from the team of Associate Professor Huang Lei, School of Artificial Intelligence, Beihang University and National Key Laboratory of Complex Critical Software Environment. The first author, Ni Yunhao, is a first-year graduate student, the second author, Guo Yuxin, is a third-year graduate student, and the third author, Jia Junlong, is a second-year graduate student. The corresponding author is Associate Professor Huang Lei (Homepage: https://huangleibuaa.github.io/)
Neural networks are usually composed of three Partly composed: linear layer, nonlinear layer (activation function) and normalization layer. The linear layer is the main location where network parameters exist. The nonlinear layer improves the expressive ability of the neural network, while the normalization layer (Normalization) is mainly used to stabilize and accelerate neural network training. There is little work to study their expressive ability. For example, with Batch Normalization is an example. It can be considered as a linear transformation in the prediction stage and does not introduce nonlinearity in expression. Therefore, researchers generally believe that Normalization cannot improve the expressive ability of the model. However, the paper "On the Nonlinearity of Layer Normalization" recently published at ICML2024 by the team of Professor Huang Lei from the School of Artificial Intelligence of Beihang University pointed out that layer normalization (Layer Normlization, LN) and its computationally degraded version RMSNorm are nonlinear Expression ability, and the universal approximate classification ability of LN is discussed in detail.
- Paper address: https://arxiv.org/abs/2406.01255
This paper mathematically proves the nonlinearity of LN. And a simple neural network LN-Net containing only linear layers and LN is proposed. If it is deep enough, in theory, given samples and sample categories can be arbitrarily classified. This discovery breaks people's inertia that regards various Normalizations as linear transformations without fitting capabilities, and the nonlinear layer and normalization layer are no longer disjoint neural network modules. Currently, with the widespread use of transformers, LN, as a fixed component, has become a commonly used technology. This research may provide a new theoretical basis for neural network architecture in this direction in the future. on, it is of groundbreaking significance. The mathematical discovery of LN nonlinearityFor nonlinear research, the article does not directly discuss the analytical properties of LN itself, but explores the relationship between LN and data in a more practical way interaction. The author first proposed the statistic SSR (Sum of Squares Ratio) to describe the linear separability of samples under two categories. When a sample is linearly transformed, the SSR also changes. Therefore, the minimum SSR corresponding to the sample under all linear transformations is defined as LSSR. The article points out that when the LSSR is smaller, the linear separability between samples is stronger. However, when the linear change imposed on the sample is replaced by the structure of "linear transformation-LN-linear transformation", it is found that the new SSR obtained may be lower than the LSSR, which verifies the nonlinear expression of LN— —If LN is linear, then "linear transformation-LN-linear transformation" is also linear, and the resulting new SSR cannot be lower than the LSSR. Arbitrary separability of LN in classification problemsFor further research, the author splits LN into two steps: centering and scaling. Centralization is mathematically a linear transformation, so the nonlinearity of LN mainly exists in the scale scaling operation (also called spherical projection in the article, which is the operation performed by RMSNorm). The author took the simplest linearly inseparable XOR data as an example, and correctly classified these four points through linear transformation and spherical projection.
More generally, the author proposes an algorithm to correctly classify any number of samples using LN and linear layers, exploring the universal approximation capability of LN-Net. By constructing algorithm steps, the layer-by-layer transformation of the neural network is converted into a similar sample merging problem, and the universal approximate classification problem is converted into a sample merging problem, and pointed out that - for m samples with any label, You can construct an O(m) layer LN-Net to correctly classify these m samples. This construction method also provides new ideas for calculating the VC dimension of neural networks. The author pointed out that on this basis, it can be inferred that the LN-Net with L normalization layers has a VC dimension of at least L+2. LN nonlinear enhancement and practical applicationBased on proving the nonlinearity of LN, the author proposed a grouping layer standardization technology to further enhance the nonlinearity of LN for practical applications. (LN-G). The author mathematically predicts that grouping can strengthen the nonlinearity of LN from the perspective of the Hessian matrix, and preliminarily explores the expressive ability of LN-G experimentally. The author pointed out that on the CIFAR-10 random label data set, for the usual linear layer model, the accuracy does not exceed 20%; while using the neural network composed of linear layer and LN-G (without introducing traditional Activation function as a nonlinear unit) can achieve an accuracy of 55.85%. The author further explored the classification effect of LN-G in the convolutional neural network without activation function, and experimentally proved that this neural network without activation function does have powerful fitting ability. In addition, the author proposed LN-G-Position by analogy with MLP where GN acts on the entire sample (stretching a single sample into a one-dimensional vector and then performing GN). Using the LN-G-Position method on the ResNet network without non-linear layers can achieve an accuracy of 86.66% on the CIFAR-10 data set, which reflects the powerful expression ability of LN-G-Position. The author then conducted an experimental study on Transformer, replacing the original LN with LN-G. According to the experimental results, it was found that group layer standardization can effectively improve the performance of the Transformer network, proving that in real networks, this feasibility of the theory. In the paper "On the Nonlinearity of Layer Normalization", the author theoretically proved for the first time the universal classification ability of a model containing only linear layers and LN and given a specific depth The VC dimension lower bound of the model. The most important significance here is that the analysis of the expressive ability of traditional deep neural networks has taken a big step towards the widely used modern real networks. This may provide new ideas for future neural network structure design. ideas. The above is the detailed content of Neural networks may no longer need activation functions? Layer Normalization also has non-linear expression!. For more information, please follow other related articles on the PHP Chinese website!