


Business is growing exponentially, can usability construction be so stable?
1. Problems and Challenges
- Computer room level failure risk (both large and small companies will encounter it, fiber mining outage or internal failure in the computer room, etc.);
- Rapid business growth has significantly increased capacity requirements.
- How to prevent the occurrence of faults?
- How to find the fault as soon as possible?
- How to quickly cure the fault?
- After the fault is restored, how to follow up?
1) Service perspective
##A service is nothing more than a requested input, and normally it only needs a corresponding output. In real situations, there are many aspects that affect the correct response of the service. In some classic scenarios, the influencing factors have been summarized
In terms of capacity: exponential growth in business requests will lead to abnormal output of a single service;- On the service side: there is a bug in the software itself, and the service crashes as a result;
- Hardware side: Abnormalities caused by host hardware, computer room, and network.
- Service layer: Collaborative configuration is required between services. Incorrect configuration settings can also cause full-link abnormalities;
- Upstream and downstream dependencies: Abnormalities in some key services can cause abnormalities across the entire link.
From the perspective of the stability of the entire link: upstream and downstream dependencies, insufficient capacity, and abnormal service configurations are all important factors affecting stability.
3. Fault prevention construction
After analyzing the fault factors from the two perspectives of service and full link, the fault There are corresponding ideas for prevention construction:
- Full-link abnormality: It is necessary to analyze the strength and weakness of upstream and downstream, and provide special protection for key servers , to ensure the stability of the entire link;
- Change exceptions: establish change process specifications and change management platforms;
- Infrastructure exceptions: rely on high-availability architecture, remove single point risks, and Good redundancy and disaster recovery.
4. Fault prevention
I talked about the overall analysis and construction ideas before. How does vivo actually do it?
We have implemented construction guarantees based on the entire link. The entire link has been constructed from the access layer, business logic layer, middleware layer, storage layer, and infrastructure layer:
1) Unitization: Reduce service calls across computer rooms to avoid the failure of a single computer room from affecting all computer room services;
2) More Entrance: In the past, many businesses only had a single access layer entrance. After building the multi-entry capability of IDC and public cloud, the impact of a single entrance exception on the overall service access will be smaller;
3) Overload protection: When the business capacity suddenly increases, the access layer service can actively reject some burst requests according to the settings to prevent excessive request traffic from overwhelming subsequent services;
4) Circuit breaker downgrade: Monopoly downgrade of dependent services can shield the impact of abnormal services and avoid the avalanche effect.
5. Fault discovery
## We have built a fault detection capability based on the entire link, and currently the proactive fault detection rate can reach 90%, which includes client monitoring, server monitoring and basic monitoring: 1) Client monitoring: self-built dial-up test system, monitoring the availability of each service through bypass simulated user access; 2) Server monitoring: Including domain name monitoring, log monitoring and call monitoring between services. According to the monitoring implementation method, it is mainly metrics/logs/trace; 3) Basic monitoring: monitor the hardware resource usage of the host situation, mainly in the form of metrics. #6. Troubleshooting Mainly includes fault analysis and fault handling.
- Troubleshooting: Failure plan construction, including plan formulation, drills, etc.
7. Fault recovery
Fault recovery is very important in the entire high availability construction cycle important part.
We use business-based SLA grading to ensure business stability in a targeted manner. And record every fault of the business, improve and verify capacity building:
1) Business classification: Operation and maintenance resources are very limited, ensuring that all businesses have the same SLA, so classification Guarantee is very necessary. Based on the reputation and revenue of the business, we divide it into four business levels: core, important, general, and other. This guides the operation and maintenance manpower and guarantee efforts invested in each business;
2) Fault record: Improve review efficiency, and track online business faults for subsequent analysis to guide business optimization;
3) Fault improvement : Conduct backward verification based on chaos engineering to determine whether the improvement measures have taken effect.
This is our practice in fault review. We have also implemented these capabilities and practices into the platform and managed the fault review work through the platform.
8. Capacity management
- Resource elastic scalability: Build hybrid cloud-based resource guarantee capabilities to greatly improve resource elasticity;
- Resource delivery, operation and management capabilities : Establish a management mechanism for the entire life cycle of resources to ensure the maximum supply and use efficiency of resources, including budget management, demand management, procurement management, and inventory operation management.
3. Usability phase construction
After usability capability building, we divide it into three stages to build usability: Standardization stage , process stage and platform stage.
1. Standardization stage
##Why should we build standardization? Standardization can greatly reduce the complexity of business operation and maintenance, thereby reducing operation and maintenance costs. We have done a lot of standardization work at both the hardware and software levels.
- Hardware level: computer room standardization, network standardization (public network, active Internet access, intranet dedicated line);
- Software level: OS standardization, host environment standardization , service catalog standardization, Agent standardization, access to nginx cluster standardization, and service capability standardization (middleware services).
##First of all, we will condense the best practices and methods in the operation and maintenance process into process mechanisms and specifications to ensure business stability is orderly and controllable, including operation and maintenance military regulations. , fault response mechanism, public affairs specifications, large-scale event guarantee specifications, etc.
For example, when the guarantee specifications for large-scale events are not established, such as when there are large-scale operational activities or Spring Festival red envelope distribution activities, it is easy for online failures to occur. Since 2018 After establishing the guarantee standards for large-scale events, heavy insurance such as the Spring Festival can ensure smooth operation.
3. Platform and system construction
Availability capability building: fault prevention, fault discovery, fault cure, fault review
- Availability phase construction: standardization, process/standardization, platform/automation
Q&A
Q1: What are the biggest difficulties encountered during the implementation of usability construction?
A1: The first point is the construction specifications of the underlying technical capabilities. Failure to comply with these specifications will lead to great uncertainty in the business availability results, so certain rules must be formulated for the team. standards, and at the same time, there must be a certain bottom-keeping mechanism;
The second point is the recognition from the upper level. Each business has different demands at different stages, and the stability is different. Well, it will affect business, reputation and revenue. After being recognized by the upper management, usability construction will be easier to promote.
Q2: During the implementation of CMDB, in addition to the development person in charge, host and other information, what other information did your company associate in the actual process? For example, is it related to middleware information?
A2: Many of our systems are currently based on CMDB. Not only the operation and maintenance system, many systems are built based on CMDB, and middleware services will also be integrated with CMDB. Association construction, such as dubbo in microservices, is also based on CMDB for service discovery and governance.
Lecturer Introduction
Zhou Jiali is now the operation and maintenance director of vivo, responsible for the operation and maintenance of vivo’s Internet business. This person who has worked at Baidu and Tencent has experience in offline business operation and maintenance such as client, internationalization and big data algorithms. After joining vivo, I led the construction of business high availability and improved the business availability to 99.99% level.
The above is the detailed content of Business is growing exponentially, can usability construction be so stable?. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



<p>MSTeams is the trusted platform to communicate, chat or call with teammates and colleagues. Error code 80090016 on MSTeams and the message <strong>Your computer's Trusted Platform Module has failed</strong> may cause difficulty logging in. The app will not allow you to log in until the error code is resolved. If you encounter such messages while opening MS Teams or any other Microsoft application, then this article can guide you to resolve the issue. </p><h2&

What is 0x0000004e failure? Failure is a common problem in computer systems. When a computer encounters a fault, the system usually shuts down, crashes, or displays error messages because it cannot run properly. In Windows systems, there is a specific fault code 0x0000004e, which is a blue screen error code indicating that the system has encountered a serious error. The 0x0000004e blue screen error is caused by system kernel or driver issues. This error usually causes the computer system to

What should I do if my Black Shark phone cannot be turned on? Teach you how to save yourself! In our daily lives, mobile phones have become an indispensable part of us. For many people, the Black Shark mobile phone is a beloved gaming phone. But it is inevitable that you will encounter various problems, one of which is that the phone cannot be turned on. When you encounter such a situation, don't panic. Here are some solutions that I hope will help you. First of all, when the Black Shark phone cannot be turned on, first check whether the phone has enough power. It may be that the phone cannot be turned on due to exhausted battery.

What to do about 0x00000001 blue screen? The blue screen problem is a headache that many computer users often encounter. When our computer encounters a blue screen, it will suddenly stop running and display a blue screen interface with an error code. Among them, 0x00000001 is a common blue screen error code. Blue screen issues can be caused by a variety of reasons, including software errors, hardware failures, driver issues, and more. Although this problem can be frustrating, there are things we can do to resolve it. Below I will introduce some solutions to blue screen

Users who shared printers found that their win10 computers could not connect to the shared printers after upgrading the September 2021 patch. So what should they do if they encounter the win10 shared printer 0x0000011b failure? This problem is encountered by many users. , the following will give you the specific content of the win10 shared printer 0x0000011b fault solution. The method is very simple, and customers can learn it at a glance. What to do if win10 shared printer 0x0000011b fails 1. Open the control panel, enter the program and functions, and check the installed upgrade; 2. Uninstall the following patches: KB5005569/KB5005573/KB5005568/KB

Black Shark is a smartphone brand known for its powerful performance and excellent gaming experience. It is loved by gamers and technology enthusiasts. However, just like other smartphones, Black Shark phones will have various problems, among which charging failure is a common one. Charging failure will not only affect the normal use of the mobile phone, but may also cause more serious problems, so it is very important to solve the charging problem in time. This article will start with the common causes of Black Shark mobile phone charging failures and introduce methods to troubleshoot and solve charging problems. I hope it can help readers solve the problem of Black Shark mobile phones.

Many friends have just bought a new graphics card. Just a few days after installing it, the fan suddenly stopped spinning. What is the reason? Is this normal? This must be a problem. You can check the graphics card in the chassis. , the memory and hard disk cables are connected and there is no power supply. Is it normal? Is there any voltage instability? Let’s take a look at the specific reasons with the editor. Answers to the reasons why the graphics card fan does not rotate: 1. Insufficient power supply causes the fan to not rotate. One of the most common reasons is that when the energy provided by your power supply cannot meet the requirements of the graphics card, in order to maintain the normal operation of the computer program, most graphics cards will stop their cooling fans to ensure that the GPU core can Continue to perform calculations. When encountering this situation, don’t blame the graphics card for not being powerful! It's obviously very considerate, okay?

MySQL vs. TiDB: Which is better for your business? With the rapid development of the Internet and big data, data storage and management have become an important part of enterprise business. When choosing a suitable database solution, many enterprises come across the two choices of MySQL and TiDB. This article will compare the features and advantages of MySQL and TiDB to help you determine which one is more suitable for your business. MySQL is an open source relational database management system that was born as early as 1995
