"In the digital age, problems can be quantitatively evaluated, and machine learning can make more intelligent and efficient optimization around goals."
On April 18, the Volcano Engine was released Develop a series of cloud products such as DPU, and launch a new version of the machine learning platform to support enterprise customers to better train large AI models. Yang Zhenyuan, Vice President of ByteDance, shared his understanding of machine learning with the theme of "Douyin's Machine Learning Practice".
Yang Zhenyuan believes that the core competitiveness of a machine learning system is that each experiment can be done quickly and cheaply. Algorithm engineers can focus on their own work and continue to try and make mistakes at a very low cost. Only in this way can agile iteration and innovation of the business be achieved. He said: "The Volcano Engine machine learning platform is unified internally and externally. Volcano Engine customers and Douyin use the same platform. I hope that these technologies polished within the company can serve more customers and support everyone in making intelligent innovations." ”
The following is the full text of Yang Zhenyuan’s speech:
Good morning! As we all know, Douyin and other businesses are internal customers of Volcano Engine, and they all run on the Volcano Engine cloud. Today I will share some practical experience in the company’s internal business: how the Volcano Engine supports Douyin’s use of machine learning.
First of all, let’s talk about why we need to talk about machine learning. In what scenarios and under what circumstances should we use machine learning systems? What are the challenges of using machine learning? How did we solve these challenges?
I thinkThe most important thing about machine learning is to digitize the problem. Digitize first, then make the problem quantitatively assessable. When the problem can be quantitatively evaluated, it can then be made intelligent and further optimized using some machine learning methods.
Some friends asked me before, "Zhenyuan, can you help me make a model?" I asked him what he wanted to use this model for? In fact, he didn't think clearly about it himself.
I would like to explain the use of machine learning through a few examples.
For example, in performance advertising, for merchants, can they find customers with reasonable money? For the platform, if there is an advertising space, can the most suitable advertisement be placed in this position? How to evaluate this problem? It's very simple, we just look at the conversion rate, so its goal can be clearly defined.
If you can clearly define the goal, you can conduct A/B experiments, determine which method is better, and then use machine learning to further optimize. In the end, it is often found that using manual methods, such as selecting users to do effective advertising, is difficult to do better than using machine learning.
Another example is the issuance of coupons. Which users should the same money be distributed to, which can bring longer-term retention to the platform? This is also a question that can be precisely quantified and evaluated. For such a problem, we can think about what kind of algorithm to use and what kind of machine learning to use for optimization.
Transportation capacity dispatching is a field that everyone is familiar with and can also be evaluated quantitatively through the order rate. If the matching is not good, I cannot effectively match drivers and passengers. I won’t go into details about autonomous driving. If you want to evaluate the effect in this field, there are actually more dimensions, such as safety, time, comfort, etc.
Having said so much, the core issue is to be able to clearly define the problem, digitize it first, and then make it intelligent.
#What kind of problems will there be when we use machine learning to make intelligence? There are two main problems. The first is that it is complicated and the second is that it is expensive.
Why is it complicated? Because the machine learning software stack is very deep, it requires a platform, including PyTorch platform, TensorFlow, and many other platforms. It also involves frameworks, operating systems, and underlying hardware. When everyone goes out recently, they always ask each other how many GPU cards they have. If you don't have one, you will be embarrassed to say hello to them. But in fact, many people don’t know what the efficiency of using these cards is like. Therefore, the software stack of machine learning is very deep and complex, and every link must be done correctly and well.
Let’s talk about the expensive issue. Manpower is expensive, and a very good algorithm engineer is expensive and not easy to find. In addition to talent being expensive, data is also expensive, and high-quality data costs a lot. Not to mention the hardware, everyone knows the price of high-performance GPU.
So, machine learning is a complex and expensive thing. So how does Douyin handle this complex and expensive matter and better use machine learning to help business development?
Let me briefly introduce our platform. Our two main platforms are one is a recommended advertising platform, and the other is a general platform, including CV (Computer Vision) , NLP (natural language processing) platform and so on.
Recommended platform, tens of thousands of models are trained on it every week, because we have many products and frequently train models in different scenarios. On the CV/NLP platform, the number of model training will be larger, with a training scale of approximately 200,000 models per week. Moreover, a large number of online services are running on these two platforms daily.
for example. For example, Douyin's recommendation system has many models, one of which requires 15 months of samples to train, which means that training data needs to be continuously constructed over 15 months. This amount of data is very large. But on our machine learning platform, we only need 5 hours to complete the training of this model, and the calculated cost is only 5,000 yuan. For an algorithm engineer, he trains the model in the morning and does AB experiments online in the afternoon, which greatly improves product iteration efficiency.
Whether machine learning is doing well or not, I think it can be represented by this triangle, the most important of which is the algorithm. If the algorithm takes the lead in effectiveness, it can bring great value to the business. There are two things that support the needs of algorithm effects, one is hardware ROI and the other is human ROI.
Hardware ROI refers to the cost per unit model. In market competition, if others spend 10,000 yuan to make a model, if you spend 10,000 yuan to make ten similar models, the battle will be stable. Human ROI refers to recruiting a powerful algorithm engineer. Whether he can maximize his potential depends mainly on whether the system can support him to try new ideas easily and quickly enough.
How to improve hardware ROI? Tide and mixed parts, these are some of the methods we commonly use. In essence, it is how to improve device utilization, which is also a basic idea of cloud native. We mix different tasks together, stagger each other's peaks, and run them at a high utilization rate through intelligent scheduling. This can greatly improve resource utilization and reduce the cost of each experiment.
In addition to the hardware cost, there is also a very important point, which is whether the machine learning infrastructure is easy enough to use. Just kidding: Many people who do mathematics don’t like you doing computer science, especially deep learning. They say that you guys are here to “make elixirs”. You often can’t explain why your stuff is good, and why do you need to keep doing experiments? But from a practical perspective, we must continue to experiment and try. Many new discoveries in this field are made through continuous attempts.
How to make every attempt faster and cheaper, this is the core competitiveness. It is difficult to achieve a perfect model once and for all.
#What the Volcano Engine has to do is to do a good job on the platform. As you can see, the entire process of data processing, model training, evaluation, online, and AB testing is unified and integrated across the entire platform. The algorithm engineer does not need to repeatedly communicate with various links and connect various businesses. He can focus more on his own work.
Let’s look at another example. This is a very interesting special effect (TikTok AI painting). I guess many friends have used it. Around the end of last year, this special effect became particularly popular. Guess how much manpower Douyin invested in making this special effect? Many people may not have thought that the algorithm engineer invested one person, and he wrote some research codes on the platform. It took about a week to complete the training of the model, and after some adjustments, it was released online.
At that time, the product was estimated to have a peak traffic of 200QPS. We planned to launch it at 2000QPS. Unexpectedly, it would be full within a few hours of launch. We quickly did a lot of expansion, and the capacity expanded 10 times in a short period of time to support 20,000 QPS.
You can see the entire process. The number of people participating is very small, and the expansion efficiency is also very high. Many people say that model training is expensive. In fact, in the long run, the cost of inference will be significantly greater than training. The AI painting model’s inference efficiency on the Volcano Engine platform is approximately five times faster than the native Pytorch model. After going online, some targeted optimizations were made, and it can be even faster, about 10 times faster, which is an order of magnitude improvement.
With such platform support, engineers can quickly try various ideas, whether it is following up on progress or pioneering innovation, they can do it quickly.
Finally, you can see that on some apps such as Douyin, Toutiao, and Dianchedi, the screen will display: Volcano Engine provides computing services. The machine learning platform we are talking about is unified internally and externally. Volcano Engine customers and Douyin use the same platform. I hope that these technologies polished within the company can serve more customers and support everyone in intelligent innovation. thank you all.
The above is the detailed content of ByteDance Yang Zhenyuan: How Douyin makes good use of machine learning. For more information, please follow other related articles on the PHP Chinese website!