1) Service perspective
##A service is nothing more than a requested input, and normally it only needs a corresponding output. In real situations, there are many aspects that affect the correct response of the service. In some classic scenarios, the influencing factors have been summarized
In terms of capacity: exponential growth in business requests will lead to abnormal output of a single service;Capacity layer: A sudden increase in requests and insufficient capacity of the entire link lead to service anomalies;
From the perspective of the stability of the entire link: upstream and downstream dependencies, insufficient capacity, and abnormal service configurations are all important factors affecting stability.
3. Fault prevention construction
After analyzing the fault factors from the two perspectives of service and full link, the fault There are corresponding ideas for prevention construction:
I talked about the overall analysis and construction ideas before. How does vivo actually do it?
We have implemented construction guarantees based on the entire link. The entire link has been constructed from the access layer, business logic layer, middleware layer, storage layer, and infrastructure layer:
1) Unitization: Reduce service calls across computer rooms to avoid the failure of a single computer room from affecting all computer room services;
2) More Entrance: In the past, many businesses only had a single access layer entrance. After building the multi-entry capability of IDC and public cloud, the impact of a single entrance exception on the overall service access will be smaller;
3) Overload protection: When the business capacity suddenly increases, the access layer service can actively reject some burst requests according to the settings to prevent excessive request traffic from overwhelming subsequent services;
4) Circuit breaker downgrade: Monopoly downgrade of dependent services can shield the impact of abnormal services and avoid the avalanche effect.
## We have built a fault detection capability based on the entire link, and currently the proactive fault detection rate can reach 90%, which includes client monitoring, server monitoring and basic monitoring: 1) Client monitoring: self-built dial-up test system, monitoring the availability of each service through bypass simulated user access; 2) Server monitoring: Including domain name monitoring, log monitoring and call monitoring between services. According to the monitoring implementation method, it is mainly metrics/logs/trace; 3) Basic monitoring: monitor the hardware resource usage of the host situation, mainly in the form of metrics. #6. Troubleshooting Mainly includes fault analysis and fault handling. ##Fault analysis: Linked with the monitoring system to support basic service fault analysis, Domain name availability analysis, etc.;
7. Fault recovery
Fault recovery is very important in the entire high availability construction cycle important part.
We use business-based SLA grading to ensure business stability in a targeted manner. And record every fault of the business, improve and verify capacity building:
1) Business classification: Operation and maintenance resources are very limited, ensuring that all businesses have the same SLA, so classification Guarantee is very necessary. Based on the reputation and revenue of the business, we divide it into four business levels: core, important, general, and other. This guides the operation and maintenance manpower and guarantee efforts invested in each business;
2) Fault record: Improve review efficiency, and track online business faults for subsequent analysis to guide business optimization;
3) Fault improvement : Conduct backward verification based on chaos engineering to determine whether the improvement measures have taken effect.
This is our practice in fault review. We have also implemented these capabilities and practices into the platform and managed the fault review work through the platform.
8. Capacity management
##Many online failures are caused by capacity issues. After capacity resources are in place, availability can be guaranteed to a certain extent. In this regard, we have mainly improved our capabilities in two aspects: resource elastic scalability and resource delivery operations. management capabilities.After usability capability building, we divide it into three stages to build usability: Standardization stage , process stage and platform stage.
##Why should we build standardization? Standardization can greatly reduce the complexity of business operation and maintenance, thereby reducing operation and maintenance costs. We have done a lot of standardization work at both the hardware and software levels.
##First of all, we will condense the best practices and methods in the operation and maintenance process into process mechanisms and specifications to ensure business stability is orderly and controllable, including operation and maintenance military regulations. , fault response mechanism, public affairs specifications, large-scale event guarantee specifications, etc.
For example, when the guarantee specifications for large-scale events are not established, such as when there are large-scale operational activities or Spring Festival red envelope distribution activities, it is easy for online failures to occur. Since 2018 After establishing the guarantee standards for large-scale events, heavy insurance such as the Spring Festival can ensure smooth operation.
3. Platform and system construction
##In terms of platform and system construction, CMDB is used as the base to further develop the usual better process mechanisms into platforms, such as change platforms, monitoring platforms, service tool platforms, etc., to support business stability. . 4. Availability results and prospects By 2022, the overall business stability operation and maintenance will be orderly and efficient, and business availability will increase from the previous level. Three nines have been increased to four nines now, and the number of businesses that meet the standard has also increased from eight before to 24 now. To achieve this usability result is mainly through usability capability building and usability phase building:Availability capability building: fault prevention, fault discovery, fault cure, fault review
Q1: What are the biggest difficulties encountered during the implementation of usability construction?
A1: The first point is the construction specifications of the underlying technical capabilities. Failure to comply with these specifications will lead to great uncertainty in the business availability results, so certain rules must be formulated for the team. standards, and at the same time, there must be a certain bottom-keeping mechanism;
The second point is the recognition from the upper level. Each business has different demands at different stages, and the stability is different. Well, it will affect business, reputation and revenue. After being recognized by the upper management, usability construction will be easier to promote.
Q2: During the implementation of CMDB, in addition to the development person in charge, host and other information, what other information did your company associate in the actual process? For example, is it related to middleware information?
A2: Many of our systems are currently based on CMDB. Not only the operation and maintenance system, many systems are built based on CMDB, and middleware services will also be integrated with CMDB. Association construction, such as dubbo in microservices, is also based on CMDB for service discovery and governance.
Lecturer Introduction
Zhou Jiali is now the operation and maintenance director of vivo, responsible for the operation and maintenance of vivo’s Internet business. This person who has worked at Baidu and Tencent has experience in offline business operation and maintenance such as client, internationalization and big data algorithms. After joining vivo, I led the construction of business high availability and improved the business availability to 99.99% level.
The above is the detailed content of Business is growing exponentially, can usability construction be so stable?. For more information, please follow other related articles on the PHP Chinese website!