This article shares with you a very interesting topic about high concurrency issues in the development of Honor of Kings. I hope it can bring you ideas for solving such problems. Let’s learn together about the analysis of high concurrency issues in the development of Honor of Kings.
"Glory of Kings" is a national-level mobile game with a huge user base and has maintained a high frequency of updates. In this business scenario, bursts have become very frequent. However, business experience is crucial, and the use of CDN is essential. Similarly, there are often scenarios where bandwidth bursts, such as breaking news videos, large-scale live broadcast events, the release of popular movies and TV series, and the release of popular games and other applications. At the same time, due to the rapid upgrade of home bandwidth and mobile networks, the magnitude of burst bandwidth is getting larger and larger, often reaching Tb level, or even 10Tb. How to protect business emergencies quickly and at low cost has become a major challenge for CDN.
The most popular mobile game "Honor of Kings" in China has hundreds of millions of users and tens of millions of daily active users. How to ensure business emergencies quickly and at low cost? This article starts from this problem, discusses the corresponding solution to the problem, and summarizes its effects.
Background
In 2007, Tencent’s self-built CDN was launched and connected to its first business, Tencent.com. Up to now, the CDN bandwidth has grown from tens of Gb at the earliest to tens of Tb now. The bandwidth of single services is also getting larger and larger. The constant bandwidth of most services is several hundred Gb, and some burst services have reached 10Tb. The rapid upgrade of the network, the explosive growth of mobile users, and the rise of video services including on-demand and live broadcasts have made business bursts more and more frequent, the burst bandwidth has become higher and higher, and the requirements for CDN have become higher and higher.
Benefiting from the booming development of Tencent's business, the self-built CDN has successively supported Tencent's internal businesses such as game downloads, streaming video acceleration, and Spring Festival red envelopes. In 2014, Tencent opened up the CDN's full capabilities and became a Tencent Cloud CDN product. , in addition to carrying internal business, it has also begun to connect to third-party customers, such as Kuaishou on-demand, Douyu Live, etc. All of the above services have emergency scenarios and strong cost requirements. Tencent CDN has accumulated rich experience in how to protect business emergencies at low cost. Next, we will analyze the challenges and problems, solutions, and effects.
1. Challenges and problems
The following will start with the business characteristics and analyze the current challenges and problems.
1. Business characteristics and challenges
The diverse scenarios of CDN are doomed to be full of challenges for emergencies. Burst services are characterized by large volume, diverse scenarios, and irregularities.
a) Large volume: Most of the burst service bandwidth exceeds Tb, and some even reach 10T;
b) Diversified scenarios: hot dramas and news hot spots on demand; Live broadcasts of games such as LOL/KPL/DOTA2, live sports such as NBA/World Cup, and live broadcasts of variety shows such as concerts; downloads of games such as Honor of Kings in application downloads; red envelope activities, e-commerce promotions, etc. in static web page acceleration;
c) Irregular: Some sudden events are unpredictable and you won’t know until the event is about to start or has already started, such as breaking news.
The volume is large and more resources need to be prepared; the scenarios are diverse and different resource requirements need to be met; the irregularity puts high requirements on our expansion efficiency.
2. Current problems
The cost of reserving a large amount of resources just to meet sudden business needs is too high and will cause a huge waste of resources. Therefore, resources are generally reused to cope with business emergencies. However, there are two problems in directly reusing resources:
a) Only some resources can be reused: CDN business generally distinguishes platforms and resource usage according to business types. The main reason is that different business types have different resource requirements. , for example, the on-demand category requires more storage; the static page category with more https requests requires more CPU resources. This limitation prevents resources from being fully utilized and makes resource preparation more difficult. For example, video bursts mainly use video buffers, but download and web page buffers cannot be used directly, which limits the size of the buffer. Even if the same type of resources is reused, because it involves the coordination of multiple business resources, the preparation time will generally exceed two days, which cannot cope with temporary emergencies;
b) Unable to reduce costs: In addition, for some sudden services, such as For game application downloads, the bandwidth peaks in the morning and noon. If only the resources of this platform are used, the settlement bandwidth will increase significantly, thereby increasing costs. The characteristics of off-peak hours with other services cannot be used to reduce settlement bandwidth.
2. Solution
Tencent Cloud CDN reuses existing resources through virtualization to build a burst pool common to all services, and all platforms share the Buffer. The devices in the burst pool are Docker virtual machines. The virtual machines have different specifications and can be used on demand as long as the business needs them. The bandwidth reserve in the burst pool reaches 10Tb, which can basically meet all business burst needs. If any business has sudden demand, with the automated listing interface, the 10Tb burst pool can be expanded in 10 minutes.
Burst pool system architecture
a) Burst pool: On the upper layer of the physical machine of each platform, a resource pool composed of Docker virtual machines, which controls CPU/memory/disk, etc. Use is restricted to prevent impact on physical machines. The original business is still deployed on the physical machine and does not need to be adjusted.
b) Automated deployment and monitoring system: Can automatically predict demand and expand capacity based on actual business needs. All sudden needs can be expanded within 10 minutes. For on-demand/download services, hot files are automatically distributed to reduce return-to-source bandwidth.
c) Scheduling system: The sudden nature and large volume of sudden services make through trains more advantageous than domain name scheduling systems. Through train scheduling is more flexible and takes effect quickly, reaching the minute level.
Virtual machines and physical machines are deployed with reporting agents, and business information and server load are reported to the monitoring system every minute. The monitoring system will predict a value based on the historical bandwidth and compare it with the current bandwidth. If the current bandwidth exceeds 50% of the predicted value, it is considered that there is a burst. According to the proportion of bandwidth increase, the system will automatically expand the equipment with corresponding data from the burst pool. For unexpected activities prepared in advance, operation and maintenance can specify the bandwidth demand, and then the system will automatically calculate the equipment demand and expand the capacity.
The server load information reported at minute granularity provides a basis for the monitoring system to make scheduling decisions. The system will determine whether the virtual machine needs to be enabled or disabled from the through train based on comprehensive information such as the remaining bandwidth of the computer room, server bandwidth, CPU, and IO. When accessing, the user first requests the express train dispatching system. The express train will return a 302 address according to the scheduling policy. The 302 address is the actual CDN resource address. The user jumps to the 302 address and gets the actual content.
2. Technical optimization
The important prerequisite for using virtualization technology to reuse resources is that it does not affect existing businesses. This requires sufficient isolation of resources, such as CPU/disk, and bandwidth usage. The following are several problems and solutions during the implementation process:
● Accurately control the load of a single machine: Excessive load will affect the quality of the service, and the load of a single machine needs to be accurately controlled.
Solution:
a) Quota system: There is a quota system in the express train, which limits the resources that each virtual machine can use, including CPU/IO and bandwidth. The information reported in the monitoring system, combined with the quota system, can ensure that the server load is limited to the specified range, with a granularity of minutes.
b) Some requests return 302: After limiting the CPU/bandwidth/IO, etc., the application can determine whether to process a request in real time based on the current load of the host machine. If the load is within the limit, it will be processed directly; if the load exceeds the limit, 302 will be returned, allowing the user to jump to the dispatch address of the through train. This can accurately control the load without affecting the service quality as much as possible. Real-time control of load at the program level is an effective supplement to the quota system.
c) Network card flow control: In extreme cases, if the business bandwidth exceeds the set threshold, the virtual network card will actively drop packets to avoid affecting the host machine.
● Limit disk size: Docker cannot limit the disk size at the file/directory level in the ext3/ext4 file system.
Solution:
Since Tencent Cloud CDN business basically uses the ext3/ext4 file system, in this case Docker can only restrict disks based on users or user groups, but now All network services are used directly in the root environment. Here we use loop device to solve the disk size limitation problem. Burst services in the virtual machine use the directory mounted on the loop device, which can indirectly limit the disk size and prevent the use of too many disks from affecting other services.
● CPU binding: The default is to bind all CPUs. High load on some single CPUs will affect the mother machine business.
Solution:
Use a script to collect all single CPU loads in the system every minute. To avoid frequent adjustments and being affected by glitch data, take the average value of 15 minutes. Finally, some cores with lower load are selected and dynamically bound through the configuration file cpuset.cpus to minimize the impact of the virtual machine on the host machine's business and make full use of resources.
After the burst pool went online, it efficiently supported many large-scale burst events such as King of Glory downloads, NBA live broadcasts, KPL/LPL game live broadcasts, etc., saving 20 million yuan in costs. By sharing buffers, building a burst pool can significantly improve burst capabilities and reduce costs.
Summary
Tencent Cloud CDN uses Docker technology to reuse resources and build a Tb-level burst pool. It can support various business bursts such as live broadcast, on-demand, and static, and can automatically detect business Resource expansion can be completed within 10 minutes in case of sudden demand, with the characteristics of fast release and low cost. Resource reuse can improve resource utilization and provide a huge burst pool for services, but it must be noted that multiplexed services cannot affect each other, which requires real-time monitoring of the server and timely scheduling. There are also some areas for improvement, such as kernel parameters based on container isolation to facilitate tuning of different services; some business clients do not support 302 jumps, and the scheduling system needs to support domain name scheduling.
Related recommendations:
Solving high concurrency problems in web development
Summary of high concurrency solutions for php read and write file conflicts
The above is the detailed content of Analysis on high concurrency issues in the development of King of Glory. For more information, please follow other related articles on the PHP Chinese website!