Table of Contents
1. Problems and Challenges
4. Fault prevention
5. Fault discovery
3. Usability phase construction
1. Standardization stage
Q&A
Home Operation and Maintenance Safety Business is growing exponentially, can usability construction be so stable?

Business is growing exponentially, can usability construction be so stable?

Jun 09, 2023 am 12:17 AM
business Fault exponential

1. Problems and Challenges

Business is growing exponentially, can usability construction be so stable?

#Since 2017, vivo’s machine scale and number of services have grown significantly, as can be seen in the chart. The size of the machine has increased by about five times, and the number of services has basically increased by more than ten times. The time span is from 2017 to 2022.

Business is growing exponentially, can usability construction be so stable?

As the scale grows, the challenges and complexity will definitely increase. Typical challenges in vivo are mainly divided into change challenges and failure challenges.

1. Change Challenge

There are still more or less manual change scenarios in the change;

Our single release time is relatively long;

There are many scenarios of large-scale business migration;

Google SRE has such a concept: 70% of failures are caused by changes. This situation also exists in vivo, and changes will have a great impact on online stability.

2. Failure challenges

    Computer room level failure risk (both large and small companies will encounter it, fiber mining outage or internal failure in the computer room, etc.);
  • Rapid business growth has significantly increased capacity requirements.

Under this challenge, we divided the construction into two dimensions: availability capability and availability stage to ensure the stability of the business.

2. Availability Capability Building

1. Fault-based full life cycle development

Business is growing exponentially, can usability construction be so stable?

Our availability capability building is based on full-cycle fault management, covering fault occurrence, discovery, response, and recovery , review and preventive measures. The time from the occurrence of a fault to the recovery is called MTTR; the time from the recovery to the occurrence of a fault, from stable to unstable is called MTTF; the time between fault occurrences is called MTBF, with a total of 3 indicators.

Fault management is nothing more than these 4 points:

    How to prevent the occurrence of faults?
  • How to find the fault as soon as possible?
  • How to quickly cure the fault?
  • After the fault is restored, how to follow up?

Mainly considering business availability, you need to pay attention to the frequency of failure and the time it affects the business. Therefore, reducing the frequency of faults, quickly locating faults, shortening the duration of faults, and achieving rapid fault cure are the general ideas of our entire high-availability capability construction. Let me introduce to you the measures we have put in place:

#2. Fault occurrence analysis

First of all, it is necessary to realize To prevent faults, we must first understand why faults occur, which can be viewed from a service perspective and a full-link perspective.

1) Service perspective

Business is growing exponentially, can usability construction be so stable?

##A service is nothing more than a requested input, and normally it only needs a corresponding output. In real situations, there are many aspects that affect the correct response of the service. In some classic scenarios, the influencing factors have been summarized

In terms of capacity: exponential growth in business requests will lead to abnormal output of a single service;
  • On the service side: there is a bug in the software itself, and the service crashes as a result;
  • Hardware side: Abnormalities caused by host hardware, computer room, and network.

2) Full-link perspective

Business is growing exponentially, can usability construction be so stable?

Capacity layer: A sudden increase in requests and insufficient capacity of the entire link lead to service anomalies;
  • Service layer: Collaborative configuration is required between services. Incorrect configuration settings can also cause full-link abnormalities;
  • Upstream and downstream dependencies: Abnormalities in some key services can cause abnormalities across the entire link.

From the perspective of the stability of the entire link: upstream and downstream dependencies, insufficient capacity, and abnormal service configurations are all important factors affecting stability.

3. Fault prevention construction

After analyzing the fault factors from the two perspectives of service and full link, the fault There are corresponding ideas for prevention construction:

Business is growing exponentially, can usability construction be so stable?

  • Full-link abnormality: It is necessary to analyze the strength and weakness of upstream and downstream, and provide special protection for key servers , to ensure the stability of the entire link;
  • Change exceptions: establish change process specifications and change management platforms;
  • Infrastructure exceptions: rely on high-availability architecture, remove single point risks, and Good redundancy and disaster recovery.

4. Fault prevention

Business is growing exponentially, can usability construction be so stable?

I talked about the overall analysis and construction ideas before. How does vivo actually do it?

We have implemented construction guarantees based on the entire link. The entire link has been constructed from the access layer, business logic layer, middleware layer, storage layer, and infrastructure layer:

1) Unitization: Reduce service calls across computer rooms to avoid the failure of a single computer room from affecting all computer room services;

2) More Entrance: In the past, many businesses only had a single access layer entrance. After building the multi-entry capability of IDC and public cloud, the impact of a single entrance exception on the overall service access will be smaller;

3) Overload protection: When the business capacity suddenly increases, the access layer service can actively reject some burst requests according to the settings to prevent excessive request traffic from overwhelming subsequent services;

4) Circuit breaker downgrade: Monopoly downgrade of dependent services can shield the impact of abnormal services and avoid the avalanche effect.

5. Fault discovery

Business is growing exponentially, can usability construction be so stable?

## We have built a fault detection capability based on the entire link, and currently the proactive fault detection rate can reach 90%, which includes client monitoring, server monitoring and basic monitoring:

1) Client monitoring: self-built dial-up test system, monitoring the availability of each service through bypass simulated user access;

2) Server monitoring: Including domain name monitoring, log monitoring and call monitoring between services. According to the monitoring implementation method, it is mainly metrics/logs/trace;

3) Basic monitoring: monitor the hardware resource usage of the host situation, mainly in the form of metrics.

#6. Troubleshooting

Mainly includes fault analysis and fault handling.

Business is growing exponentially, can usability construction be so stable?

##Fault analysis: Linked with the monitoring system to support basic service fault analysis, Domain name availability analysis, etc.;
  • Troubleshooting: Failure plan construction, including plan formulation, drills, etc.

7. Fault recovery

Fault recovery is very important in the entire high availability construction cycle important part.

Business is growing exponentially, can usability construction be so stable?

We use business-based SLA grading to ensure business stability in a targeted manner. And record every fault of the business, improve and verify capacity building:

1) Business classification: Operation and maintenance resources are very limited, ensuring that all businesses have the same SLA, so classification Guarantee is very necessary. Based on the reputation and revenue of the business, we divide it into four business levels: core, important, general, and other. This guides the operation and maintenance manpower and guarantee efforts invested in each business;

2) Fault record: Improve review efficiency, and track online business faults for subsequent analysis to guide business optimization;

3) Fault improvement : Conduct backward verification based on chaos engineering to determine whether the improvement measures have taken effect.

This is our practice in fault review. We have also implemented these capabilities and practices into the platform and managed the fault review work through the platform.

8. Capacity management

Business is growing exponentially, can usability construction be so stable?

##Many online failures are caused by capacity issues. After capacity resources are in place, availability can be guaranteed to a certain extent. In this regard, we have mainly improved our capabilities in two aspects: resource elastic scalability and resource delivery operations. management capabilities.

  • Resource elastic scalability: Build hybrid cloud-based resource guarantee capabilities to greatly improve resource elasticity;

  • Resource delivery, operation and management capabilities : Establish a management mechanism for the entire life cycle of resources to ensure the maximum supply and use efficiency of resources, including budget management, demand management, procurement management, and inventory operation management.

3. Usability phase construction

After usability capability building, we divide it into three stages to build usability: Standardization stage , process stage and platform stage.

1. Standardization stage

Business is growing exponentially, can usability construction be so stable?

##Why should we build standardization?

Standardization can greatly reduce the complexity of business operation and maintenance, thereby reducing operation and maintenance costs. We have done a lot of standardization work at both the hardware and software levels.

    Hardware level: computer room standardization, network standardization (public network, active Internet access, intranet dedicated line);
  • Software level: OS standardization, host environment standardization , service catalog standardization, Agent standardization, access to nginx cluster standardization, and service capability standardization (middleware services).

2. Process and standardized construction

Business is growing exponentially, can usability construction be so stable?

##First of all, we will condense the best practices and methods in the operation and maintenance process into process mechanisms and specifications to ensure business stability is orderly and controllable, including operation and maintenance military regulations. , fault response mechanism, public affairs specifications, large-scale event guarantee specifications, etc.

For example, when the guarantee specifications for large-scale events are not established, such as when there are large-scale operational activities or Spring Festival red envelope distribution activities, it is easy for online failures to occur. Since 2018 After establishing the guarantee standards for large-scale events, heavy insurance such as the Spring Festival can ensure smooth operation.

3. Platform and system construction

Business is growing exponentially, can usability construction be so stable?

##In terms of platform and system construction, CMDB is used as the base to further develop the usual better process mechanisms into platforms, such as change platforms, monitoring platforms, service tool platforms, etc., to support business stability. .

4. Availability results and prospects

By 2022, the overall business stability operation and maintenance will be orderly and efficient, and business availability will increase from the previous level. Three nines have been increased to four nines now, and the number of businesses that meet the standard has also increased from eight before to 24 now.

Business is growing exponentially, can usability construction be so stable?

To achieve this usability result is mainly through usability capability building and usability phase building:

Availability capability building: fault prevention, fault discovery, fault cure, fault review

    Availability phase construction: standardization, process/standardization, platform/automation

Business is growing exponentially, can usability construction be so stable?

In the future, we will focus on remote multi-activity, container/cloud native Availability guaranteed.

Business is growing exponentially, can usability construction be so stable?

Taking the availability guarantee of containers and cloud native as an example, we have more It is a pure physical machine. Later, virtual machines were added, and then public cloud was added, which further reduced the direct dependence on the underlying infrastructure. At the same time, we are also working on containers and cloud native to unitize resources and flexibly schedule them to reduce the need for resources. Direct dependence on physical hardware resources, so we need to build high availability capabilities for different infrastructures.

What else can be done to build usability?

Business is growing exponentially, can usability construction be so stable?

## I personally think that we not only consider availability, business quality and operating costs These are all things we need to consider. The operation and maintenance guarantee of the business will then enter the stage of refined operation guarantee.

Q&A

Q1: What are the biggest difficulties encountered during the implementation of usability construction?

A1: The first point is the construction specifications of the underlying technical capabilities. Failure to comply with these specifications will lead to great uncertainty in the business availability results, so certain rules must be formulated for the team. standards, and at the same time, there must be a certain bottom-keeping mechanism;

The second point is the recognition from the upper level. Each business has different demands at different stages, and the stability is different. Well, it will affect business, reputation and revenue. After being recognized by the upper management, usability construction will be easier to promote.

Q2: During the implementation of CMDB, in addition to the development person in charge, host and other information, what other information did your company associate in the actual process? For example, is it related to middleware information?

A2: Many of our systems are currently based on CMDB. Not only the operation and maintenance system, many systems are built based on CMDB, and middleware services will also be integrated with CMDB. Association construction, such as dubbo in microservices, is also based on CMDB for service discovery and governance.

Lecturer Introduction

Zhou Jiali is now the operation and maintenance director of vivo, responsible for the operation and maintenance of vivo’s Internet business. This person who has worked at Baidu and Tencent has experience in offline business operation and maintenance such as client, internationalization and big data algorithms. After joining vivo, I led the construction of business high availability and improved the business availability to 99.99% level.

The above is the detailed content of Business is growing exponentially, can usability construction be so stable?. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Fix: Microsoft Teams error code 80090016 Your computer's Trusted Platform module has failed Fix: Microsoft Teams error code 80090016 Your computer's Trusted Platform module has failed Apr 19, 2023 pm 09:28 PM

<p>MSTeams is the trusted platform to communicate, chat or call with teammates and colleagues. Error code 80090016 on MSTeams and the message <strong>Your computer's Trusted Platform Module has failed</strong> may cause difficulty logging in. The app will not allow you to log in until the error code is resolved. If you encounter such messages while opening MS Teams or any other Microsoft application, then this article can guide you to resolve the issue. </p><h2&

What does the 0x0000004e error mean? What does the 0x0000004e error mean? Feb 18, 2024 pm 01:54 PM

What is 0x0000004e failure? Failure is a common problem in computer systems. When a computer encounters a fault, the system usually shuts down, crashes, or displays error messages because it cannot run properly. In Windows systems, there is a specific fault code 0x0000004e, which is a blue screen error code indicating that the system has encountered a serious error. The 0x0000004e blue screen error is caused by system kernel or driver issues. This error usually causes the computer system to

What should I do if my Black Shark phone cannot be turned on? Teach you how to save yourself! What should I do if my Black Shark phone cannot be turned on? Teach you how to save yourself! Mar 23, 2024 pm 04:06 PM

What should I do if my Black Shark phone cannot be turned on? Teach you how to save yourself! In our daily lives, mobile phones have become an indispensable part of us. For many people, the Black Shark mobile phone is a beloved gaming phone. But it is inevitable that you will encounter various problems, one of which is that the phone cannot be turned on. When you encounter such a situation, don't panic. Here are some solutions that I hope will help you. First of all, when the Black Shark phone cannot be turned on, first check whether the phone has enough power. It may be that the phone cannot be turned on due to exhausted battery.

How to Fix 0x00000001 Blue Screen Error How to Fix 0x00000001 Blue Screen Error Feb 19, 2024 pm 11:12 PM

What to do about 0x00000001 blue screen? The blue screen problem is a headache that many computer users often encounter. When our computer encounters a blue screen, it will suddenly stop running and display a blue screen interface with an error code. Among them, 0x00000001 is a common blue screen error code. Blue screen issues can be caused by a variety of reasons, including software errors, hardware failures, driver issues, and more. Although this problem can be frustrating, there are things we can do to resolve it. Below I will introduce some solutions to blue screen

What to do if Win10 Shared Printer 0x0000011b Fault Solution Win10 Shared Printer 0x0000011b Fault Solution What to do if Win10 Shared Printer 0x0000011b Fault Solution Win10 Shared Printer 0x0000011b Fault Solution Jul 18, 2023 am 08:33 AM

Users who shared printers found that their win10 computers could not connect to the shared printers after upgrading the September 2021 patch. So what should they do if they encounter the win10 shared printer 0x0000011b failure? This problem is encountered by many users. , the following will give you the specific content of the win10 shared printer 0x0000011b fault solution. The method is very simple, and customers can learn it at a glance. What to do if win10 shared printer 0x0000011b fails 1. Open the control panel, enter the program and functions, and check the installed upgrade; 2. Uninstall the following patches: KB5005569/KB5005573/KB5005568/KB

Black Shark mobile phone charging troubleshooting and solutions Black Shark mobile phone charging troubleshooting and solutions Mar 22, 2024 pm 09:03 PM

Black Shark is a smartphone brand known for its powerful performance and excellent gaming experience. It is loved by gamers and technology enthusiasts. However, just like other smartphones, Black Shark phones will have various problems, among which charging failure is a common one. Charging failure will not only affect the normal use of the mobile phone, but may also cause more serious problems, so it is very important to solve the charging problem in time. This article will start with the common causes of Black Shark mobile phone charging failures and introduce methods to troubleshoot and solve charging problems. I hope it can help readers solve the problem of Black Shark mobile phones.

Reasons and solutions for graphics card fan stalling Reasons and solutions for graphics card fan stalling Dec 26, 2023 pm 05:49 PM

Many friends have just bought a new graphics card. Just a few days after installing it, the fan suddenly stopped spinning. What is the reason? Is this normal? This must be a problem. You can check the graphics card in the chassis. , the memory and hard disk cables are connected and there is no power supply. Is it normal? Is there any voltage instability? Let’s take a look at the specific reasons with the editor. Answers to the reasons why the graphics card fan does not rotate: 1. Insufficient power supply causes the fan to not rotate. One of the most common reasons is that when the energy provided by your power supply cannot meet the requirements of the graphics card, in order to maintain the normal operation of the computer program, most graphics cards will stop their cooling fans to ensure that the GPU core can Continue to perform calculations. When encountering this situation, don’t blame the graphics card for not being powerful! It's obviously very considerate, okay?

MySQL vs. TiDB: Which is better for your business? MySQL vs. TiDB: Which is better for your business? Jul 13, 2023 pm 03:09 PM

MySQL vs. TiDB: Which is better for your business? With the rapid development of the Internet and big data, data storage and management have become an important part of enterprise business. When choosing a suitable database solution, many enterprises come across the two choices of MySQL and TiDB. This article will compare the features and advantages of MySQL and TiDB to help you determine which one is more suitable for your business. MySQL is an open source relational database management system that was born as early as 1995

See all articles