How I wrote a top essay from scratch-AI-php.cn

I recently completed a very satisfying paper. Not only was the whole process enjoyable and memorable, but it also truly achieved "academic impact and industrial output." I believe this article will change the paradigm of differential privacy (DP) deep learning.

Because this experience is so "coincident" (the process is full of coincidences and the conclusion is extremely clever), I would like to share with my classmates my experience from observation-->conception-->evidence-- >Theory-->Complete process of large-scale experiments. I will try to keep this article lightweight and not involve too many technical details.

How I wrote a top essay from scratch

Paper address: arxiv.org/abs/2206.07136

Different from the order presented in the paper, the paper sometimes deliberately places the conclusions Put it at the beginning to attract readers, or introduce the simplified theorem first and put the complete theorem in the appendix. In this article, I want to write down my experience in chronological order (that is, a running account), such as the detours I have taken and the sudden changes in my research. Write down the status of the development for reference by students who have just embarked on the path of scientific research.

1. Literature Reading

The origin of the matter is a paper from Stanford, which has now been recorded in ICLR:

How I wrote a top essay from scratch

Paper address: https://arxiv.org/abs/2110.05679

The article is very well written. In summary, it has three main contributions:

1. In NLP tasks, the accuracy of the DP model is very high, which encourages the application of privacy in language models. (In contrast, DP in CV will cause a very large accuracy deterioration. For example, CIFAR10 currently has 80% accuracy without pre-training under the DP limit, but can easily reach 95% without considering DP; ImageNet's best DP accuracy at the time was less than 50%.)

2. On language models, the larger the model, the better the performance will be. For example, the performance improvement of GPT2 from 400 million parameters to 800 million parameters is obvious, and it has also achieved many SOTAs. (But in CV and recommendation systems, in many cases the performance of larger models will be very poor, even close to random guess. For example, the DP best accuracy of CIFAR10 was previously obtained by four-layer CNN, not ResNet.)

How I wrote a top essay from scratch

In NLP tasks, the larger the DP model, the better the performance [Xuechen et al. 2021]

3. Obtain SOTA super on multiple tasks The parameters are the same: the clipping threshold needs to be set small enough, and the learning rate needs to be larger. (All previous articles were to adjust a clipping threshold for each task, which was time-consuming and labor-intensive. There has never been a clipping threshold=0.1 like this one that runs through all tasks, and the performance is so good.)

Summary of the above I understood it immediately after reading the paper. The content in brackets is not from this paper, but the impression generated by many previous readings. This relies on long-term reading accumulation and a high degree of generalization ability to quickly associate and compare.

In fact, it is difficult for many students to start writing articles because they can only see the content of one article and cannot form a network or make associations with knowledge points in the entire field. On the one hand, students who are just starting out do not read enough and have not yet mastered enough knowledge points. This is especially the case for students who have been taking projects from teachers for a long time and do not propose independently. On the other hand, although the amount of reading is sufficient, it is not summarized from time to time, resulting in the information not being condensed into knowledge or the knowledge not being connected.

Here is the background knowledge of DP deep learning. I will skip the definition of DP for now and it will not affect reading.

The so-called DP deep learning from an algorithmic point of view actually means doing two extra steps: per-sample gradient clipping and Gaussian noise addition; in other words, as long as you follow the gradient according to these two After step processing is completed (the processed gradient is called private gradient), you can use the optimizer however you want, including SGD/Adam.

As for how private the final algorithm is, it is a question in another sub-field, called privacy accounting theory. This field is relatively mature and requires a strong theoretical foundation. Since this article focuses on optimization, it will not be mentioned here.

How I wrote a top essay from scratch

g_i is the gradient of a data point (per-sample gradient), R is the clipping threshold, and sigma is the noise multiplier.

Clip is called clipping function, just like regular gradient clipping. If the gradient is longer than R, it will be cut to R, and if it is less than R, it will not move.

For example, the DP version of SGD is currently used in all papers using the clipping function in the pioneering work of privacy deep learning (Abadi, Martin, et al. "Deep learning with differential privacy."), also known as Abadi's clipping: picture.

But this is completely unnecessary. Following the first principles and starting from privacy accounting theory, in fact, the clipping function only needs to satisfy that the modulus of Clip(g_i)*g_i is less than or equal to R. In other words, Abadi's clipping is just one function that satisfies this condition, but it is by no means the only one.

2. Entry point

There are many shining points in an article, but not all of them can be used by me. I have to judge the greatest contribution based on my own needs and expertise. What is it.

The first two contributions of this article are actually very empirical and difficult to dig into. The last contribution is very interesting. I carefully looked at the ablation study of the hyperparameters and found a point that the original author did not discover: when the clipping threshold is small enough, in fact, the clipping threshold (that is, clipping norm C, in the above formula and R is a variable) has no effect.

How I wrote a top essay from scratch

Longitudinally, C=0.1, 0.4, 1.6 has no difference to DP-Adam [Xuechen et al. 2021].

This aroused my interest, and I felt that there must be some principle behind it. So I handwrote the DP-Adam they used to see why. In fact, it is very simple:

How I wrote a top essay from scratch

If R is small enough, clipping is actually equivalent to normalization! By simply substituting private gradient (1.1), R can be extracted from the clipping part and the noising part respectively:

How I wrote a top essay from scratch

And the form of Adam makes R appear at the same time In the gradient and adaptive step size, once the numerator and denominator cancel, R is gone and the idea is there!

How I wrote a top essay from scratch

Both m and v depend on the gradient, and replacing them with private gradients results in DP-AdamW.

Such a simple substitution proves my first theorem: in DP-AdamW, sufficiently small clipping thresholds are equivalent to each other, and no parameter adjustment is required.

No doubt this is a succinct and interesting observation, but it doesn't make enough sense, so I need to think about what practical use this observation has.

In fact, this means that DP training reduces the parameter adjustment work by an order of magnitude: assuming that the learning rate and R are adjusted to 5 values each (as shown above), then 25 combinations must be tested to find the best Excellent hyperparameters. Now you only need to adjust the learning rate in 5 possibilities, and the parameter adjustment efficiency has been improved several times. This is a very valuable pain point issue for the industry.

The intention is high enough, the mathematics is concise enough, and a good idea has begun to take shape.

3. Simple expansion

If it is only established for Adam/AdamW, the limitations of this work are still too great, so I quickly expanded it to AdamW and other adaptive optimizers, such as AdaGrad. In fact, for all adaptive optimizers, it can be proved that the clipping threshold will be offset, so there is no need to adjust parameters, which greatly increases the richness of the theorem.

There is another interesting little detail here. As we all know, Adam with weight decay is different from AdamW. The latter uses decoupled weight decay. There is an ICLR article on this difference

How I wrote a top essay from scratch

There are two kinds of Adam Add weight decay.

This difference also exists in the DP optimizer. The same is true for Adam. If decoupled weight decay is used, scaling R does not affect the size of weight decay. However, if ordinary weight decay is used, enlarging R by twice is equivalent to reducing weight decay by twice.

4. There is another world

Smart students may have discovered that I have always emphasized the adaptive optimizer. Why don’t I talk about SGD? The answer is that I After writing the theory of DP adaptive optimizer, Google immediately published an article on DP-SGD used in CV and also did an ablation study. However, the rules were completely different from those found in Adam, which left me with a diagonal impression

How I wrote a top essay from scratch

For DP-SGD and when R is small enough, increasing lr by 10 times is equal to increasing R by 10 times [https://arxiv.org/abs/2201.12328].

I was very excited when I saw this article, because it was another paper that proved the effectiveness of small clipping threshold.

In the scientific world, there are often hidden patterns behind consecutive coincidences.

Simply substitute it and find that SGD is easier to analyze than Adam. (1.3) can be approximated as:

How I wrote a top essay from scratch

##Obviously R and It can be proposed, combined with the learning rate, to theoretically prove Google's observation.

“Specifically, when the clipping norm is decreased k times, the learning rate should be increased k times to maintain similar accuracy.”

It’s a pity that Google only sees The phenomenon has not risen to the level of theory. There is also a coincidence here, that is, they drew an ablation study of two scales at the same time in the above picture. Only the left scale can see the diagonal line. There is no conclusion just looking at the right side...

Since there is no adaptive step size, SGD does not ignore R like Adam, but treats R as part of the learning rate, so there is no need to adjust it separately. Anyway, the learning rate is adjusted together if the parameters are adjusted.

Then extend the theory of SGD to momentum, and all optimizers supported by Pytorch have been analyzed.

5. From intuition to rigor

There is an innovative point, but Abadi's clipping is strictly an approximate normalization and cannot be equalized, so there is no way Conclusively analyze convergence.

According to the principle of Doraemon Iron Man Corps, I directly named normalization as the new per-sample gradient clipping function, which replaced the Abadi clipping that has been used in the entire field for 6 years. This is my second an innovative point.

After the previous proof, the new clipping strictly does not require R, so it is called automatic clipping (AUTO-V; V for vanilla).

Since the form is different from Abadi's clipping, the accuracy will be different, and my clipping may have disadvantages.

So I need to write code to test my new method, and this only requires changing one line of code (after all, it is just a picture).

In fact, there are mainly three clipping functions in the direction of DP per-sample gradient clipping. In addition to Abadi's clipping, the two are proposed by me. One is global clipping, and the other is this article. automatic clipping. In my previous work, I already knew how to change clipping in various popular libraries. I put the modification method in the appendix at the end of the article.

After my testing, I found that in the Stanford article, GPT2 has clipped all iterations and all per-sample gradients during the entire training process. In other words, at least in this experiment, Abadi's clipping is completely equivalent to automatic clipping. Although later experiments did lose to SOTA, this has shown that my new method has enough value: a clipping function that does not need to adjust the clipping threshold, and sometimes accuracy will not be sacrificed.

6. Return to abstract thinking

The Stanford article has two major types of language model experiments, one is the generative task using GPT2 as the model, and the other is RoBERTa is a model classification task. Although automatic clipping is equivalent to Abadi's clipping on generation tasks, the accuracy on classification tasks is always a few points worse.

Due to my own academic habits, I will not change the data set at this time and then select our dominant experiments to publish, let alone add tricks (such as data enhancement and magic model modification) some type of). I hope that in a completely fair comparison, I can only compare per-sample gradient clipping and achieve the best moisture-free effect possible.

In fact, in discussions with our collaborators, we found that pure normalization and Abadi's clipping completely discard the information of gradient size. That is to say, for automatic clipping, no matter how big the original gradient is, After clip, R is always as large as R, and Abadi retains size information for gradients smaller than R.

Based on this idea, we made a small but extremely clever change, called AUTO-S clipping (S stands for stable)

How I wrote a top essay from scratch

After integrating R and learning rate, it becomes

How I wrote a top essay from scratch

You can find this small (usually set to 0.01, but actually set to something else) Any positive number will work, it is very robust) and the information of the gradient size can be retained:

How I wrote a top essay from scratch

Based on this algorithm, just change one line and rerun the Stanford code, and you will get the SOTA of six NLP data sets.

How I wrote a top essay from scratch

On the E2E generation task, AUTO-S surpasses all other clipping functions, as well as on the SST2/MNLI/QNLI/QQP classification task.

7. Make a general algorithm

One limitation of Stanford’s article is that it only focuses on NLP, and what a coincidence is that: immediately after Google brushed up ImageNet’s DP Two months after SOTA, Google subsidiary DeepMind released an article on how DP shines in CV, directly increasing the ImageNet accuracy from 48% to 84%!

How I wrote a top essay from scratch

Paper address: https://arxiv.org/abs/2204.13650

In this article, I am the first It took time to look at the choice of optimizer and clipping threshold until I turned to this picture in the appendix:

How I wrote a top essay from scratch

DP-SGD's SOTA on ImageNet also requires clipping. threshold is small enough.

Still the small clipping threshold works best! With three high-quality articles supporting automatic clipping, I already have a strong motivation, and I am more and more sure that my work will be outstanding.

Coincidentally, this article by DeepMind is also a pure experiment without theory, which also led them to almost realize that they can theoretically do not need R. In fact, they are really close to my idea. , they have even discovered that R can be extracted and integrated with the learning rate (interested students can take a look at their formulas (2) and (3)). But the inertia of Abadi's clipping was too great... Even though they figured out the rules, they didn't go any further.

How I wrote a top essay from scratch

DeepMind also found that small clipping threshold works best, but did not understand why.

Inspired by this new work, I started to experiment with CV so that my algorithm can be used by all DP researchers, instead of using one set of methods for NLP and another for CV. .

A good algorithm should be universal and easy to use. Facts have also proved that automatic clipping can also achieve SOTA on CV data sets.

How I wrote a top essay from scratch

8. Theory as bones and experiments as wings

Looking at all the above papers, SOTA has significantly improved and achieved engineering results. It's full, but the theory is completely blank.

When I finished all the experiments, the contribution of this work has exceeded the requirements of a top conference: I will empirically use DP-SGD and DP-Adam generated by small clipping threshold. The influence of parameters is greatly simplified; a new clipping function is proposed without sacrificing computational efficiency, privacy, and without parameter adjustment; the small γ repairs the damage to gradient size information caused by Abadi's clipping and normalization; sufficient NLP and CV experiments have Achieved SOTA accuracy.

I'm not satisfied yet. An optimizer without theoretical support is still unable to make a substantial contribution to deep learning. Dozens of new optimizers are proposed every year, and all of them are discarded in the second year. There are still only a few officially supported by Pytorch and actually used by the industry.

For this reason, my collaborators and I spent an additional two months doing automatic DP-SGD convergence analysis. The process was difficult but the final proof was simplified to the extreme. The conclusion is also very simple: the impact of batch size, learning rate, model size, sample size and other variables on convergence is quantitatively expressed, and it is consistent with all known DP training behaviors.

Specially, we proved that although DP-SGD converges slower than standard SGD, when the iteration tends to infinite, the convergence speed is an order of magnitude. This provides confidence in privacy computation: the DP model converges, albeit late.

9. Car crash...

Finally, the article I wrote for 7 months was finished. Unexpectedly, the coincidences haven’t stopped yet. NeurIPS submitted the paper in May, and the internal modifications were completed on June 14 and released to arXiv. As a result, on June 27, I saw that Microsoft Research Asia (MSRA) published an article that collided with ours. The clipping proposed was exactly the same as our automatic clipping:

How I wrote a top essay from scratch

Exactly the same as our AUTO-S.

Looking carefully, even the proof of convergence is almost the same. And the two groups of us have no intersection. It can be said that a coincidence was born across the Pacific Ocean.

Let’s briefly talk about the differences between the two articles: the other article is more theoretical, for example, it additionally analyzes the convergence of Abadi DP-SGD (I only proved automatic clipping, which is the DP- in their article) NSGD, maybe I don’t know how to adjust DP-SGD); the assumptions used are also somewhat different; and our experiments are more and larger (more than a dozen data sets), and we have more explicitly established Abadi's clipping and normalization, etc. Valence relations, such as Theorem 1 and 2 explain why R can be used without parameter adjustment.

Since we are working at the same time, I am very happy that there are people who agree with each other and can complement each other and jointly promote this algorithm, so that the entire community can believe in this result and benefit from it as soon as possible. Of course, selfishly, I also remind myself that the next article will be accelerated!

10. Summary

Looking back on the creative process of this article, from the starting point, basic skills must be the prerequisite, and another important prerequisite is that I have always kept it in mind. The pain point of adjusting parameters. It's been a long drought, so reading the right article can help you find nectar. As for the process, the core lies in the habit of mathematically theorizing observation. In this work, the ability to implement code is not the most important. I will write another column focusing on another hard-core coding work; the final convergence analysis also relies on my collaborators and my own indomitability. Fortunately, you are not afraid of being late for a good meal, so keep going!

About the author

Bu Zhiqi, graduated from the University of Cambridge with a bachelor's degree and a Ph.D. from the University of Pennsylvania. He is currently a senior research scientist at Amazon AWS AI. His research direction is differential privacy and deep learning, focusing on Optimization algorithms and large-scale computation.

The above is the detailed content of How I wrote a top essay from scratch. For more information, please follow other related articles on the PHP Chinese website!