In the first issue, the boss of Yangqingjing expressed many interesting views. Some people left a message saying that it was about operation and maintenance. Guide to persuading people to quit, haha, the guests in this issue will have different opinions. Please keep an open mind, listen to the opinions of hundreds of schools of thought, and make your own career and life plans. As the saying goes, if you listen to both, you will be enlightened, but if you believe only, you will be dark. If you only listen to what suits your ears, there is a high probability that there will be no in-depth thinking and collision, which is a pity.
This is the second issue of the down-to-earth and high-level "Operation and Maintenance Forum", let's start!
In this issue we invite Nie An, the head of operation and maintenance of Zuoyebang. Nie An is a senior industry veteran who has worked for Alibaba, Xiaomi, Didi Di and Zuoyebang have more than 10 years of operation and maintenance/R&D/management experience.
Internet operation and maintenance has experienced pure manual, There are several stages such as standardization, platformization, and digital intelligence, as shown in the figure below. Among them, DevOps is a technology-driven organizational change and a non-professional change.
From the development history of operation and maintenance, we can see several characteristics:
Learning the development history of a certain field allows us to learn from history and take advantage of the trend. .
In the traditional operation and maintenance model, service objects can basically be divided into three layers. The lowest layer is the hardware infrastructure IaaS, which is mainly composed of computing, network, and storage; the middle layer is the software infrastructure, including operating systems, virtualization technology, code frameworks, middleware, etc.; the top layer is the business layer, mainly application services .
The traditional responsibility of operation and maintenance is to assemble industrial products into services through a series of processes, technologies, and methods. Deliver it to users and maintain service operation; it is usually required to achieve multiple-dimensional goals (operability) such as stability, cost, security, and efficiency. To a certain extent, traditional operation and maintenance needs to be attached to the business to generate value; many companies will regard whether they understand the business as one of the main assessments of operation and maintenance workers (dependence). With the popularization of cloud computing and cloud native technology, the traditional operation and maintenance model has encountered many challenges. for example,
As mentioned above, after infrastructure is outsourced to public clouds and cloud native, operation and maintenance responsibilities are transferred to business research and development, and platforms replace labor professionals. sex. Faced with such trends and facts, operation and maintenance practitioners need to make some transformations.
First let’s talk about the organizational structure. In the long term, the organizational form of a company in the cloud native era will consist of the following parts:
The top end user is the enterprise’s Party A customer , are also potential profit-making groups. The business team is responsible for end users, and its roles include product, business, marketing, marketing, etc. Business research and development directly serves the business team, mainly providing SaaS applications/services. Platform research and development serves business research and development, provides various PaaS capabilities, and encapsulates cloud vendors. There will also be some cross-functional organizations, such as cost operation FinOps, efficiency operation EP, administrative team IT, etc.
In the new organizational structure, everyone’s ultimate goal is to do their own thing and serve end users well. The business team pays more attention to business value, and the R&D system focuses on service quality. With the advancement of information technology, the functions currently performed by cross-functional organizations will gradually be decomposed to the platform R&D team. The main method of organizational collaboration will be upgraded from everyone's collaboration to platform self-service. Operations and maintenance have new job goals, namely: The main theme of operation and maintenance is the management platform, resource & technology center, not horizontal collaboration. Operations and maintenance must use high technology leverage, empower business, and help enterprises improve operating efficiency. .
Operation and maintenance transformation, the goal is to provide operation and maintenance management to upper-level teams through a self-service platform Service; the essence is operation and maintenance service OPaS (OP as Service). According to content differences, operation and maintenance work can be divided into two categories: object management and scene management, as shown in the figure below.
Object management is a vertical model that revolves around operating and maintaining objects and building a life cycle management platform. Operation and maintenance objects can be classified according to IaaS resources (machine, network, storage, cloud services), PaaS components (database, cache, MQ, gateway), SaaS applications (business middle platform, business applications), service framework (runtime, Code framework, name service) and other dimensions, the classification granularity of different companies is different. Each type of object has an independent management platform (chimney). The functions of the management platform should cover the complete life cycle of the operation and maintenance object. The key stages include modeling (metadata), delivery/change, monitoring/measurement, offline, etc., which is different from the public Cloud management functions are similar. The goal of object management is to produce vertically complete cloud products and build an internal cloud platform ICSP.
Scenario management is a horizontal mode that manages the life cycle stages of various operation and maintenance objects according to operation and maintenance scenarios. The classification of operation and maintenance scenarios, including delivery/change, monitoring/measurement, multi-cloud, cost, etc., is very close to the working habits of business research and development, covers a few high-frequency scenarios, and is similar in different companies. Each type of operation and maintenance scenario has an independent scenario management platform, such as work order center, data center, FinOps platform, etc. Scenario management is based on object management. The scenario management platform manages operation and maintenance objects by unifying models, aggregating data, orchestrating management and control APIs, etc. The goal of scene management is to provide self-service business management capabilities and build an internal developer platform IDP.
Common ways to generate operation and maintenance objects include self-research, open source construction, external procurement (public cloud), etc. Each operation and maintenance object can be further subdivided into different categories, clusters, instances, etc., with unprecedented scale and complexity. Only by maintaining the isomorphism of the management characteristics of operation and maintenance objects can we build and maintain operation and maintenance services on a large scale and at low cost, thereby realizing large-scale operation and maintenance (technical leverage effect). Therefore, the isomorphism of operation and maintenance objects is the basis of the entire operation and maintenance architecture. premise.
Isomorphic maintenance is aimed at the management characteristics of operation and maintenance objects, not all characteristics. The method of maintaining isomorphism is: controlling increment, repairing inventory, and preventing fission. As shown in the figure below, the platform is used to deliver demand and control increments, to drive governance through measurement to repair inventory, and to prevent large-scale fission of the technical system through standardized service frameworks; platforms and metrics strictly follow specifications, and specifications also require metrics or platforms. input of questions to improve, the three complement each other. Specifications are divided into service specifications (corresponding to service governance), management specifications (corresponding to operation and maintenance control) and other types.
Isomorphism is maintained and relies on an organizational division of labor with clear responsibilities. For example, operation and maintenance focuses on management, stripping off business tools and returning them to business R&D, such as status quo governance, alarm response, and CD; business R&D focuses on business implementation, stripping off the non-business logic of the service framework and handing it over to the infrastructure. Implementation, such as service discovery and traffic control; the infrastructure focuses on middle-end capabilities such as service framework, stripping away management functions and handing them over to operation and maintenance, such as demand delivery, change control, etc. The influence of culture cannot be ignored. Operations and architecture will output concepts and cultivate user habits through communication and guidance, such as not providing SLA commitments for personalized needs and providing out-of-the-box observation capabilities for standard applications.
Based on the isomorphic maintenance of operation and maintenance objects, and upward support for the operation and maintenance service-oriented technology system, a sustainable operation and maintenance architecture is formed, as shown in the figure below. Under the current technical level, operation and maintenance services based on self-service platforms can solve 70% of the needs, and the remaining 30% still require manual labor, such as demand communication, problem troubleshooting, result acceptance, policy compliance, etc. With the advancement of technology and concepts, it is believed that the proportion of operation and maintenance services will further increase.
Note: The service framework in this article includes not only the code framework and code library N years ago, but also the current popular microservice governance. Transition stage, naming is urgent.
Business operation and maintenance, also called application operation and maintenance, is the closest to cloud native and has been hardest hit. In addition to traditional cross-team responsibilities such as specification formulation, process construction, and global management, business operations and maintenance must be transformed in a service-oriented direction. The path is as follows:
Operation and maintenance as service OPaS (OP as Service) is our mid-term transformation, from the perspective of business operation and maintenance The proposed goals pointed out the general direction, but lacked the path and was relatively abstract; after that, OPaS was gradually refined into the operation and maintenance architecture of ICSP IDP, and its scope of application was extended to the entire operation and maintenance team, so that there was a clear path and starting point.
In addition to servitization, business operation and maintenance can also lead the construction of the hyperservice perspective (now renamed as scenario). The DevOps technology puzzle under cloud native is not complete. It has only completed the application computing part, and there are gaps in capabilities in other directions, especially the upward business perspective, department perspective, company perspective, etc., let’s call it Super Service Perspective. From a hyper-service perspective, business R&D personnel usually do not have the ability or motivation to take the lead; department heads or architects can take care of their own departments, but are limited by job responsibilities and find it difficult to expand to the overall situation. On the other hand, the hyper-service perspective is the old battlefield of traditional business operation and maintenance, with unparalleled experience, understanding and cognitive advantages. Business operation and maintenance leads the construction of a hyper-service perspective, which can not only fill the gap in the cloud native field, but also give full play to the professional advantages of business operation and maintenance, and take advantage of the opportunity for transformation. It will be a win-win choice, as shown below.
Super service perspective, including but not limited to:
under cloud native Looking down at the DevOps technology puzzle, there are gaps in capabilities. For example, the support for basic services such as CDN, object storage, MQ, and EMR is not perfect, and it is still in the exploratory period in 2022; from the perspective of operation and maintenance management, as long as it is covered by the service framework (Authentication, discovery, communication, perception, flow control) is radiated, even if it is managed by Cloud Native.
Cloud services, middleware, big data and other operation and maintenance objects, the technology stack is converged and professionally focused. When implementing transformation, operation and maintenance personnel can follow the onion model.
The onion model was first verified in database, big data, middleware and other positions, and later It was taken over and used in cloud services, and it was also successful. For example, our company's cloud service operation and maintenance CloudOps team implements transformation according to the onion model. The details are as follows.
As the business operation Roles such as maintenance, component operation and maintenance, and system operation and maintenance (resource network cloud services) began to participate in development work. The space left for the operation and maintenance development DevOps team gradually became less and less, and the division of labor was unclear during the transformation process. With reference to the prediction of the upgrade of the organizational structure and technical architecture, we have re-adjusted the positioning of OpDev: OpDev should not be a development outsourcing or vassal of operation and maintenance personnel, but should have its own independent services. As a result, the original operation and maintenance platform was split into two parts. One part focused on functional iteration and could not be reused, and was left to the original users to maintain themselves, such as IDP resource console, ICSP scenario management tools, etc.; the other part was public functions. , abstracted as the operation and maintenance middle platform is responsible for OpDev, such as unified account IAM, work order orchestration engine, monitoring indicator collector, etc., as shown below.
The operation and maintenance center is a subset of the original operation and maintenance platform. It does not need to rebuild domain knowledge. It needs to re-do domain abstract modeling and has relatively high requirements for code quality (same as basic components). This is exactly what OpDev is for children. ’s strengths. As responsibilities are centralized and reduced, OpDev must simultaneously slim down and achieve higher leverage.
Briefly share some of our company’s transformation lessons, including
The following is additional content, not the core of this article.
Whether it is a public cloud or an internal K8S platform, there are a large number of demand delivery operations. This type of ToM (ToManager) delivery platform often lacks necessary constraints and can only be open to experienced people.
In order to optimize the division of labor and improve efficiency, the operation and maintenance management plane ToC (ToRD) can be implemented through "work order arrangement and approval"; the workflow/work order itself will be heavily integrated into the best practices of operation and maintenance management. , can be safely opened to research and development. This is an important direction for the servitization of operation and maintenance capabilities. The evolution path of self-service delivery is as follows:
Currently, the communication link from requirements to technical solutions is relatively difficult to self-service or automate. More attempts are needed in the future.
The essence of economics of scale operation and maintenance is marginal cost, which is the interaction of "diminishing marginal cost of operation and maintenance management vs. increasing marginal cost of isomorphic maintenance". As shown in the figure below, when the number of operation and maintenance objects is small, the cost of operation and maintenance management accounts for the majority, such as building platforms and manual operations; when the number of operation and maintenance objects increases, isomorphic maintenance constitutes the main cost; the marginal turning point will be affected by technology and concepts and other environmental factors.
Cloud native technology reduces the difficulty of maintaining isomorphism (promoting the isomorphism maintenance curve to shift to the right) and improves operation and maintenance service capabilities (promoting the The operation and maintenance management curve shifts downward), allowing operation and maintenance personnel to manage more operation and maintenance objects at a lower cost, thus significantly improving production efficiency.
The above is the detailed content of Zuoyebang Nie An: How to transform operation and maintenance, listen to Zuoyebang's OPaS ideas. For more information, please follow other related articles on the PHP Chinese website!