After Meizu’s transformation in 2014-2015 and the explosion of sales, There are more and more Internet service businesses coming, and the user base is getting larger and larger. The previous expansion architecture of a single computer room can no longer meet the development of Meizu. In addition, in the domestic complex network environment, a single computer room cannot meet our reliability needs. In recent years, optical cables have often been dug up and power outages in computer rooms have occurred. For example, Alipay's optical fiber was dug out, causing business interruption; WeChat also suffered a large-scale failure last year, and its optical fiber was also dug out. In addition to the risk of single computer room failure, users also have strong demand for nearby access.
The biggest challenge of the multi-computer room solution is a series of problems caused by network delays between computer rooms, such as consistency. Of course, there are also some mature solutions in the industry, such as Alibaba’s unitized solution. The business is enclosed in a unit according to users; Tencent's set solution and Weibo's cross-computer room solution are mainly written in a centralized manner and provide a quick switching solution.
We learn from the above solutions, sort out the business, and map it to the following two businesses:
This type of business is mainly reading and rarely writing, so we This type of business is even classified as read-only business.
The single-machine room structure of the application store is as follows:
The access end is divided into three businesses, API, developer community, and operation backend.
After grading business availability, the application store (API interface) has the highest availability requirements, while the operational backend and developer community have slightly lower availability requirements.
Based on the above analysis, we only need to provide a multi-machine room solution for the application store (API interface).
The multi-computer room architecture of the application store is as follows:The deployment of the core computer room basically does not need to be changed. The data in our East China computer room is replicated through the synchronization function of MySQL, and the reading of list data Get the same as the core computer room and read from the Redis cache. The data cached by Redis is obtained from DB and flushed to Redis regularly using scheduled tasks.
In order to ensure data consistency, "writing" is still a single-point write, which is written directly to the core computer room across computer rooms. There are two types. One is to write to the remote computer room through the message queue. Even if there is a problem with the computer room network, our "writes" can be accumulated in MQ, which basically does not affect the user experience. The accumulated data will be pulled out after the network is smooth. . Another kind of "write" requires to know immediately whether the "write" is successful, so it is written directly to the database across the computer room. If there is a network problem in this part, it will cause failure, and we can perform downgrade processing.
In addition, we use GSLB to schedule the traffic in the computer room, which will be explained in detail later.
Our read and write balancing business here has an important feature, that is, all data can be segmented according to the user dimension. There is little correlation between them. For example, our synchronization service synchronizes all data on the mobile phone (contacts, text messages, settings, wifi, input method preferences...) to the cloud. When the mobile phone is lost or needs to be refreshed and needs to be cleared, the data can be pulled at any time. Come down and make sure your data is never lost.
The following is the synchronization business single-room architecture:
Our user access interface is also divided into two parts, one for practical APIs on mobile phones, and the other for Web users can directly operate (modify contacts). The requests obtained by the web interface are forwarded to back-end services, such as contact synchronization, message synchronization, setting item synchronization and other services. The backend service then stores the user routing information in different DB shards.
It is more convenient to make a cross-computer room solution here. You can directly do global routing according to users and route to different computer rooms.
The cross-machine room architecture diagram is as follows:
We package business data and services into a single Unit, and one Unit serves a certain number of users. Each Unit contains complete data and services and can be deployed independently. Each computer room has multiple Units, and each user's data has a local backup and a remote backup. When the computer room fails, the backup data can be pulled up to serve users.
When users access our services through API, GSLB is used for scheduling. When users access business services, they first obtain the location of user data from GSLB (user data is only provided in a certain computer room at the same time), and then The client request is scheduled to the appropriate computer room.
Web requests are a challenge because web services cannot use GSLB, so user requests cannot be accurately scheduled. This plan will be discussed in the subsequent traffic scheduling.
When it comes to multiple computer rooms, it involves traffic scheduling. The simplest way to schedule traffic is to use smart DNS services. As shown below:
Only DNS can determine which ISP and which region you are a user based on the IP in the request from LocalDNS, and then schedule it to the corresponding ISP and corresponding region. The core of the computer room is the IP library of smart DNS. There are several disadvantages:
From this, we have accessed the GSLB service for specific businesses:
This service avoids DNS requests and initiates requests Before accessing our own GSLB service (or HttpDNS), businesses can bring user IDs to locate which computer room their data is in, and use IP to access business services.
brings several obvious benefits:
* 可以根据IP或者UID等等信息精确调度。* 避免DNS劫持。* 用户手工设置DNS也不会调度错误。
Currently, all our client access is connected to GSLB, such as the application center mentioned above, user synchronized API access, etc.
But there are also problems. This solution is only suitable for client-side HTTP and HTTPS requests, and is not suitable for browser access. The browser does not know what your GSLB is. User synchronized API access can be done with GSLB, but when accessing the Web, GSLB cannot be used for traffic scheduling, because the browser does not recognize GSLB, and if you use smart DNS, you cannot accurately schedule traffic based on user ID.
Based on the above considerations, we proposed a third solution, GSLB smart DNS:
Before the user requests a service, find a server resolved by DNS. To obtain data, the back-end service will first find the GSLB service to find the computer room where the user data is located. If the user data is in the local computer room, the data will be returned directly. Otherwise, the user request will be redirected to the appropriate computer room to re-initiate the request.
This solution may cause the domain name in the user's browser to change, affecting the user experience. In addition, domain name hijacking cannot be avoided.
This article mainly introduces Meizu’s multi-computer room disaster recovery plan and the problems and countermeasures encountered during the implementation process, as well as Meizu’s core computer room migration plan and solutions to the problems .
What insights and experiences do you have in multi-computer room deployment? Feel free to share in the comments.