Introduction | Operation and maintenance automation is what we long for, but when we blindly emphasize automation capabilities, we ignore a key factor that affects the implementation of automation. That is the business structure that people love and hate when they live with operation and maintenance day and night. |
Because business architecture is one of the key factors that determine the efficiency and quality of operation and maintenance, I would like to talk with you about what kind of architectural design is friendly to operation and maintenance. Combining the business architecture encountered at Tencent over the years and the thinking about business non-functional specifications when doing operation and maintenance planning, we can divide the operation and maintenance-oriented architecture design into six major design points.
Point 1: Architecture independence
Any architecture is created to meet specific business requirements. If we can meet the business requirements while also taking into account the non-functional requirements of operation and maintenance for architecture management. Then we have reason to think that such an architecture is friendly to operation and maintenance.
From the perspective of operation and maintenance, the requested architecture includes four aspects: Independent deployment, independent testing, Componentization and Technical decoupling .
Independent deploymentrefers to a source code that can be deployed, upgraded, scaled, etc. according to management requirements that facilitate operation and maintenance. The geographical distribution can be distinguished through configuration. Mutual calls between services are implemented through interface requests. Deployment independence is also a prerequisite for operation and maintenance independence.
Independent testingOperation and maintenance can verify the availability of the business architecture or service through some convenient test cases or tools. Business architecture or services with this capability allow operation and maintenance to have the ability to go online independently, without requiring the participation of developers or testers for every release or change.
Component Specificationrefers to having good framework support for related technologies within the same company, thereby preventing different development teams from using different technology stacks or components, causing the company's internal technical architecture to get out of control.
This approach can limit the disorderly increase of operation and maintenance objects, allowing operation and maintenance to always maintain control over the production environment. At the same time, it can also allow operation and maintenance to maintain more energy investment and do more efficiency and quality construction work around standard components.
Technical decouplingrefers to reducing the interdependence between services, and also includes reducing the code's dependence on configuration files. This is also the basis for realizing microservices and achieving independent deployment, independent testing, and componentization.
Point 2: Deployment friendlyDevOps has a lot of space about the technical practice of continuous delivery, hoping to open up all technical links of development, testing, and operation and maintenance from end to end to achieve the goal of rapid deployment and delivery of value. It can be seen that deployment is a very important part of daily operation and maintenance work. It is a planned task with high repetition and must improve efficiency.
To achieve efficient and reliable deployment capabilities, overall planning must be done to ensure comprehensive operation and maintenance control during the deployment and operation phases. There are five dimensions of content related to deployment friendliness:
CMDB configurationBefore each deployment operation, operation and maintenance need to clearly understand the relationship between the application, architecture, and business, in order to better understand and evaluate the workload and potential risks overall.
In the Zhiyun automated operation and maintenance platform, we are accustomed to managing configuration information such as business relationships, cluster management, operation status, importance levels, and architecture layers as operation and maintenance management objects in the CMDB configuration management database. The benefits of this management method are obvious. Centralized storage of configuration information of operation and maintenance objects will provide a large amount of configuration data support and decision-making assistance for the construction of automated capabilities such as operation and maintenance operations, monitoring and alarming in the future.
Environment configurationIn enterprises with low operation and maintenance standardization, one of the original sins that hinders the efficiency of deployment and delivery is environment configuration. This is also one of the operation and maintenance pain points that containerization technology mainly hopes to solve.
In Tencent's operation and maintenance practice, the standardized management of the three main environments of development, testing, and production is achieved by enumerating and managing the resource collection and operation and maintenance operations related to the environment, combined with automatic initialization tools to achieve standard environment management. Landed.
Dependency ManagementSolve the management of dependencies of application software on libraries, operating environments, etc. In Zhiyun’s practical experience, we use package management to configure dependent library files or environments through overall packaging and pre- and post-execution scripts to solve the problem of deploying application software in different environments. There are also lighter containerized delivery methods in the industry, which are also good choices.
Deployment methodThe principle of continuous delivery mentions the need to create a reliable and repeatable delivery pipeline. We also plan the deployment operations of application software strongly according to this goal. There are many cases in the industry that you can refer to, such as Docker's Build, Ship, and Run, such as Zhiyun's configuration description, one-click deployment of standardized processes, etc.
Publish self-testPublishing a self-test consists of two parts:
Build these two capabilities to cope with the needs of different operation and maintenance scenarios. For example, during incremental release, using the proofreading capability of published content, operation and maintenance personnel can quickly obtain the change file md5, or check the related processes and ports. Configuration information is checked and compared to ensure the reliability of each released change.
Similarly, lightweight testing meets the need for service availability testing during release. This step can detect the connectivity of the service and run some backbone test cases.
Grayscale is onlineThere is this sentence in "Thirty-Six Strategies for Daily Operation and Maintenance": For irreversible deletion or modification operations, try to delay or execute them slowly. This is the idea of grayscale. Whether it is grayscale online from the latitude of users, time, servers, etc., we hope to reduce the risk of online operations as much as possible. The business architecture supports the ability of grayscale release to reduce the risk of the application deployment process. More friendly to operation and maintenance.
Point 3: OperabilityThe most ideal microservice architecture in the mind of operation and maintenance must be the one with strong operability and maintainability. Applications or architectures that are not operable and maintainable will not only bring trouble to the operation and maintenance team, but also cause deep harm to their career development, because maintaining an architecture that is not operable and maintainable is simply a waste of time. It's a waste of operation and maintenance personnel's lives.
Operation and maintainability can be summarized into the following seven points according to operating specifications and management specifications:
Configuration ManagementIn microservice architecture management, we propose to separate the application binary files and configuration management to facilitate independent deployment.
The separated application configuration has three management methods:
Due to space limitations, we will not discuss the advantages and disadvantages of the above three methods. Different enterprises can choose the most suitable configuration management method. The key is to require each business to use a consistent solution, so that operation and maintenance can build targeted tools and systems for configuration management.
Version ManagementOne of the eight principles of DevOps continuous delivery is "Put everything into version control". As far as operation and maintenance objects are concerned, if you want to manage it well, you must be able to describe it clearly.
Similar to the requirements of source code management, operation and maintenance also need to perform scripted management of daily operation objects, such as packages, configurations, scripts, etc., so that when the operation and maintenance system completes automated operations, it can accurately Select the object and version to be operated on.
Standard OperationThere are a large number of highly repetitive tasks that need to be performed in daily operation and maintenance. From the perspective of lean thinking, there is a huge waste here: learning costs, worthless operations, repeated construction of scripts/tools, risks of human flesh execution, etc. wait.
If unified operation and maintenance operation specifications can be formed within the enterprise, such as file transfer, remote execution, application start and stop, etc. Operations will be standardized, centralized, and one-click operations , operation and maintenance The efficiency and quality will be greatly improved.
Process ManagementIncluding application installation path, directory structure, standardized process name, standardized port number, start and stop methods, monitoring plan, etc., are included in the category of process management. Doing a good job in overall planning of process management can greatly improve the degree of automated operation and maintenance and reduce the occurrence of unplanned tasks.
Space ManagementGood management of disk space usage is to ensure the orderly storage of business data and is also an effective means to reduce the occurrence of unplanned tasks.
Requires advance planning: Backup strategy, storage plan, Capacity warning, Cleanup strategy, etc., supplemented by effective tools, so that These tasks no longer plague operations and maintenance.
Log ManagementThe promotion and implementation of log specifications requires close cooperation with R&D. Based on experience gained in practice, the ideal log specifications for operation and maintenance should include these requirements:
When the log specifications specific to the above conditions are implemented, development, operation and maintenance, and business can accordingly obtain better monitoring and analysis capabilities.
Centralized ControlThe work of operation and maintenance is inherently easy to be cut into different parts, such as release changes, monitoring and analysis, fault handling, project support, multi-cloud management, etc. We seek a one-stop operation and maintenance management platform to enable all Work information can be connected and experience passed on, eliminating operational risks caused by information islands or manual transmission of information, and improving the efficiency and quality of overall operation and maintenance management and control.
Point 4: Fault Tolerance and Disaster Tolerance Four major responsibilities in Tencent technical operations (operation and maintenance):quality, efficiency, cost, security. Quality is the primary guarantee. From the perspective of architecture, the ideal high-availability architecture design from the perspective of operation and maintenance should include the following points:
Load balancing Whether it is a balanced solution for software or hardware, from the perspective of operation and maintenance, we always hope that the business architecture will be stateless, routing and addressing will be intelligent, and cluster fault tolerance will be automatically realized.
In Tencent’s years of routing software practice, the software’s load balancing solution has been widely used, making great contributions to achieving high availability in the business architecture.
Scheduling In the era when mobile Internet is prevalent, schedulability is an extremely important operation and maintenance method for disaster recovery and fault tolerance. When the business encounters a fault that cannot be solved immediately, moving users or services away from the abnormal area is a tried and tested technique in mass operation practice. It is also one of the core operation and maintenance capabilities of Tencent QQ and WeChat to ensure the business quality of the platforms.
Combined with domain name, VIP, access gateway and other technologies, the architecture supports scheduling capabilities, enriches operation and maintenance management methods, and has the ability to respond to various failure scenarios more calmly.
Live more in a different place
Multi-activity in remote locations is a requirement for high data availability and a prerequisite for schedulability. For different business scenarios, there are no limitations on the means of technical implementation.
For the practice of Tencent social networking, you can refer to teacher Zhou Xiaojun’s article “Architectural design and efficient operation behind the large-scale scheduling of 200 million QQ users”.
Master-slave switchingIn the high-availability solution of the database, master-slave switching is the most common disaster recovery and fault tolerance solution. By realizing the separation of reading and writing in business logic, and combining it with intelligent routing to realize unmanned master-slave switching automation, it is undoubtedly the best gift of architectural design to DBA.
Flexible and available"Sure first and then optimize" is one of Tencent's massive operational ideas, and it also points the way for us to do high-availability design of business architecture.
How to ensure service availability to the greatest extent when business volume suddenly increases? This is an unavoidable issue when doing architectural planning and design. Cleverly setting flexible switches, or building logic to automatically reject excessive requests in the architecture, can ensure that back-end services do not collapse at critical moments and ensure the high availability of the business architecture.
Point 5: Quality ControlEnsuring and improving business quality is the goal that operation and maintenance strives to pursue, and monitoring capabilities are an important technical means for us to achieve our goals. Operation and maintenance hopes that the architecture will provide convenience and data support for quality monitoring, and requires the following points to be achieved:
IndicatorsEvery architecture must be measured by indicators. At the same time, what we hope is that it is best to have only one indicator measurement. As business becomes more and more sophisticated in three-dimensional monitoring, the number of monitoring indicators will increase exponentially. Therefore, for the metric measurement of the architecture, what we hope is that it is best to have only a unique metric measurement.
Basic monitoringrefers to low-level indicator capabilities such as networks, dedicated lines, hosts, and systems. Most of these monitoring points are non-intrusive and can easily collect data.
In enterprises with sound automated operation and maintenance capabilities, most of the alarm data generated by basic monitoring will be converged. At the same time, this part of monitoring data will provide data support and decision-making basis for high-level business monitoring, or be packaged into business monitoring data that is closer to upper-level application scenarios, such as capacity, multi-dimensional indicators, etc.
Component monitoringTencent is accustomed to collectively refer to development frameworks, routing services, middleware, etc. as components. This type of monitoring is between basic monitoring and business monitoring. Operations and maintenance often rely on embedding monitoring logic in components. Through the components of Promotion will increase the coverage of component monitoring, and the cost of obtaining data will be moderate. For example, by using the monitoring of routing components, operation and maintenance can obtain status and quality indicators such as request volume and delay of each routing service.
Business MonitoringThe implementation methods of business monitoring are divided into active and passive monitoring, which can be implemented intrusively or by bypass. This type of monitoring solution requires development cooperation, both in terms of coding and architecture.
Usually, business monitoring indicators can be summarized into three indicators: request volume, success rate, and delay. There are many implementation methods, including log monitoring, flow data monitoring, wave testing, etc. Business monitoring is a high-level monitoring and can often directly feedback business problems. However, if you want to deeply analyze the root cause of the problem, it must be combined with necessary operation and maintenance monitoring. Management specifications, such as return code definitions, logging protocols, etc. When designing the business architecture, the requirements of operation and maintenance monitoring and management must be considered in advance, and the scope of the overall planning needs to be well planned.
Full link monitoringThe monitoring methods of foundation, components, and business are more focused on point monitoring. In the business scenario of distributed architecture, to monitor well, we must consider the monitoring of service request links.
Based on the unique transaction ID or RPC calling relationship, technical means are used to restore the calling relationship chain, and then monitoring alarms are triggered through models or events to feedback the status and quality of the service link. This monitoring method is a high-end application of monitoring, and it also requires pre-planning and code burying when planning the business architecture. .
Quality AssessmentAny promotion of monitoring capabilities and optimization of quality require a closed loop of management. Assessment is a good means. From monitoring coverage, comprehensiveness of indicators, event management mechanisms to report assessment and scoring, operation and maintenance and development can work hand in hand. Create a quality management closed loop with continuous feedback so that the business structure can continue to evolve and improve.
Point 6: Performance CostAt Tencent, all technical operations personnel shoulder an important function, which is to ensure that business operating costs are reasonable. To this end, we must have corresponding management methods for application throughput performance, business capacity planning and operating costs.
Throughput performanceIn the DevOps continuous delivery methodology, one of the most important aspects of non-functional requirements testing during the testing phase is the stress test of the architecture throughput performance, so as to ensure the health of the business capacity after the application is launched.
In Tencent's practice, we not only conduct performance stress testing during the testing phase, but also combine the functions of routing components to conduct real request stress testing on business modules and business SETs to establish a baseline for the business capacity model. It also provides data from the side to demonstrate whether the throughput performance of the business architecture meets the requirements of cost assessment, and uses the comparison of performance data between different businesses to promote the continuous improvement of architecture performance.
Capacity PlanningThe word capacity in English can be translated into: application performance, service capacity, and total business requests. Capacity planning for operation and maintenance refers to reasonable service capacity planning based on the total business requests under the premise that the application performance reaches the standard.
Operating costsReducing operating costs is to reduce cash flow investment for the company, and its value to the enterprise is no less than the improvement of quality and efficiency.
Tencent focuses on rich media businesses such as social networking, UGC, cloud computing, games, and videos, and consumes a huge amount of operating costs such as bandwidth and equipment every year. If operation and maintenance wants to optimize operating costs, it often involves the optimization of product functions and business architecture. Therefore, the ideal business architecture design for operation and maintenance requires sufficient cost awareness,
summaryThis article is purely based on some personal opinions on the design of microservice architecture from an operation and maintenance perspective. To maximize the operation and maintenance value and ensure the overall improvement of business quality, efficiency, and cost, the hard core of business architecture must not be Not chewable.
Operation and maintenance people need to have architectural awareness and be able to make suggestions or demands for business architecture from different perspectives. This is also advocated by the DevOps spirit. Development and operation and maintenance work together to continuously optimize the best business architecture.
The above is the detailed content of 6 key points to help you develop automated operation and maintenance architecture. For more information, please follow other related articles on the PHP Chinese website!