Home Operation and Maintenance Safety Well Source: Operational and Maintenance Geometry

Well Source: Operational and Maintenance Geometry

Jun 09, 2023 pm 04:50 PM
Operation and maintenance

Editor's note: Boss Jing was the boss of my team when I joined Baidu in 2011. He is a hard-core veteran. It was not easy to seize this opportunity. He asked all the common questions in the industry for the benefit of readers. Boss Jing has a free and easy nature, and his jokes and curses are all written down, and his principles are easy to understand. Here is the first issue of the down-to-earth and high-level "Operation and Maintenance Forum", let's start!

Guest introduction

Well Source: Operational and Maintenance Geometry

Jingyuan, first from left, former Baidu operation and maintenance architect, former Xiaomi Person in charge of operation and maintenance, former Meicai CIO

Some operation and maintenance personnel reported that the company knew very little about the value of operation and maintenance. How did you clearly explain the value of operation and maintenance to the company back then?

First of all, you need to explain clearly to the company the job responsibilities of operation and maintenance (what operation and maintenance does, what it produces) and key indicators (measuring output results), such as working around stability, safety, efficiency, etc. Expand, what operation and maintenance projects have been carried out, and how to proactively promote the achievement of key indicators.

Key indicators include not only service availability, but also server resource compliance rate, service failure data (fault classification, fault response time, mean fault recovery time, fault alarm coverage), service security indicators, service How long the resources will be available, etc.

For example, build a complete monitoring system:

Monitor server resource usage, find servers with substandard usage, recycle or reallocate resources, through virtualization, containerization and other means Improve resource utilization, sort out alarm thresholds, and standardize P0, P1, P2, and P3 alarm levels; the monitoring system provides alarm merging, intelligent positioning suggestions, active alarm aggregation, and time-latitude alarm analysis. Convenient and faster alarm response and fault location, improve alarm and plan sorting of fault response time, fault recovery time and other services, shorten the mean fault recovery time, and improve fault alarm coverage

Some opinions in the industry believe that The rise of infrastructure such as cloud and Kubernetes will gradually eliminate operation and maintenance positions. What do you think of this view?

Many years ago, the slogan of our operation and maintenance team was NO Ops, and the blog was noops.me.

It has been said for a long time that operation and maintenance positions will gradually disappear, or some job responsibilities will disappear. Take system operation and maintenance as an example. The previous management team required a team of 20 people including server engineers, kernel engineers, network engineers, CDN engineers, and computer room operation and maintenance engineers. Later, with the introduction of public cloud, the team only had 4 people, including 1 cloud resource administrator, 1 CDN scheduling engineer, 1 network engineer, and 1 kernel engineer. They only needed to manage and schedule the resources and services provided by third-party companies. Can.

With the popularization of K8s and cloud, and the continuous maturity of R&D code engineering, operation and maintenance will be less and less involved in this process. When the deployment framework is mature, in order to save operation and maintenance manpower and improve deployment efficiency, the deployment of second- and third-level services has been left to R&D self-service.

With the development of science and technology and the changes of the times, it is normal for a position to disappear. Making timely adjustments and planning is the focus of thinking.

In the current environment where enterprises are moving to the cloud on a large scale, what adjustments do you think operation and maintenance personnel should make to better meet the current talent needs?

In the cloud environment, operation and maintenance engineers should be more business-oriented and architecture-oriented, expand their business scope, and become key talents to ensure business stability. If it is still the same as before, only focusing on monitoring alarms and only responsible for service deployment changes, then it will definitely be eliminated.

On the other hand, you can go in the direction of specialization, become an expert in a certain field (monitoring, big data, K8s, database, etc.), and become an operation and maintenance R&D expert.

Life advice, look for more side jobs, operation and maintenance work is only a small part of life.

AIOps has been hyped for several years, but its popularity has obviously become quieter recently. Do you think companies should implement AIOps at this stage? What issues should we pay attention to?

Take smart monitoring as an example. I have seen a lot of copywriting saying that AI should be used to predict faults and intelligently locate. I haven't seen any reliable cases so far. In an Internet business system where services are changing faster, dependencies are complex, and there are many factors affecting faults, if it is really possible to achieve fault prediction through historical data. It is better to do earthquake prediction. Thousands of years of earthquake data accumulation can produce great social value.

The prerequisite for doing AIOps is to really understand AI and understand the principles of machine learning and neural networks. There is as much intelligence as there are artificial intelligence, and AIOps capabilities are not a slogan.

Do you think AI capabilities like chatGPT will be able to solve problems in the operation and maintenance industry in the future?

For example, in fault management, based on the faulty equipment, data, description, and through the knowledge base, historical fault database, etc., possible auxiliary suggestions (suggestbot) for the fault are given

BTW, if you can already play with chatGPT, invest this technology in other areas that can generate more value. Don’t always waste it in the field of operation and maintenance...

There is endless debate in many companies about whether the deployment of business programs should be left to R&D or operation and maintenance. What do you think of this issue?

As mentioned before, our second- and third-level services are entirely provided by R&D, while the first-level services are provided by operation and maintenance and R&D in turn. The main purpose is to let operation and maintenance know the current services. Just the changes. When operation and maintenance personnel do deployment at the beginning of the company, they focus more on standardizing the online environment and standardizing service deployment methods, so as to better develop and deploy systems and control the service architecture they are responsible for.

Security issues and process issues can be completely solved by deploying the system. In terms of operation and maintenance, don’t cling to this work that has no value and no accumulation.

What is the thing you most want to say to the (operation and maintenance) industry? Why?

"Physics does not exist, but the physics we think may not exist." The operation and maintenance industry may not exist anymore. How many operation and maintenance people's dream is AIOps, NOOps, or their own Kill this industry or be killed in this industry.

When it comes to tool selection, how do you decide whether to develop it yourself, use open source, or use commercial products?

If you have the ability and time, use open source, and if the ability and time are limited, use commercial products. If you have money, leisure and are very conceited, you can try self-study.

Does your company also have a multi-cloud architecture? Which capabilities do you think should be relied upon by cloud vendors in multi-cloud scenarios and which capabilities should be built in-house?

We are a multi-cloud architecture. Dedicated lines or data transmission capabilities need to be built by yourself. Public capabilities based on multi-cloud can also be built by ourselves, such as monitoring systems, data backup systems, deployment systems, microservice core components, etc., and the rest can be left to cloud vendors.

What is your most memorable failure? What inspiration does it have for you?

After so many years of operation and maintenance, we have encountered too many weird failures, and the root cause is beyond your imagination. It can only be said that failures are difficult to avoid, and we can only try to reduce the frequency, impact area and impact time of failures.

So your performance is not the number of failures and failure levels, but the impact of failures, failure response, recovery time, etc.

Faced with the rapid development of basic technologies, do you have any career planning suggestions for operation and maintenance personnel who have just entered the industry and those who have been in the industry for a long time?

It’s quite extreme~ For those who have just entered the industry, it is recommended to change careers as soon as possible! For those who have been in the industry for a long time, it is relatively difficult to change careers in technology, and it has been deeply imprinted on operation and maintenance. I have seen too many operation and maintenance personnel switch to other technologies. Most of them are operation and maintenance R&D and operation and maintenance product manager positions. It is better to find a side job.

What do you think is the difference between traditional operation and maintenance and SRE? What was the thinking behind your team's transformation?

It’s already 2023. Talking about this topic is like setting up a NOC monitoring duty for Internet operation and maintenance, going backwards.

If you are still considering whether to transform SRE, how to transform SRE, and the changes in SRE, just like in the 5g era, if you are still considering whether to use 2g or 3g... you will be eliminated by the times.

Do you feel like it’s coming to an abrupt end? Haha, this is the first issue of "Operation and Maintenance Forum". We will continue to invite industry leaders to share. The more different opinions there are, the more interesting it is and the more it can trigger thinking. Let's work together with an open mind. , listen to the opinions of hundreds of schools of thought. See you next time!

The above is the detailed content of Well Source: Operational and Maintenance Geometry. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Spring Boot Actuator Endpoint Revealed: Easily Monitor Your Application Spring Boot Actuator Endpoint Revealed: Easily Monitor Your Application Jun 09, 2023 pm 10:56 PM

1. Introduction to SpringBootActuator endpoint 1.1 What is Actuator endpoint SpringBootActuator is a sub-project used to monitor and manage SpringBoot applications. It provides a series of built-in endpoints (Endpoints) that can be used to view the status, operation status and operation indicators of the application. Actuator endpoints can be exposed to external systems in HTTP, JMX or other forms to facilitate operation and maintenance personnel to monitor, diagnose and manage applications. 1.2 The role and function of the endpoint The Actuator endpoint is mainly used to implement the following functions: providing health check of the application, including database connection, caching,

Having worked in operation and maintenance for more than ten years, there have been countless moments when I felt like I was still a novice... Having worked in operation and maintenance for more than ten years, there have been countless moments when I felt like I was still a novice... Jun 09, 2023 pm 09:53 PM

Once upon a time, when I was a fresh graduate majoring in computer science, I browsed many job postings on recruitment websites. I was confused by the dazzling technical positions: R&D engineer, operation and maintenance engineer, test engineer...‍ During college, my professional courses were so-so, not to mention having any technical vision, and I had no clear ideas about which technical direction to pursue. Until a senior student said to me: "Do operation and maintenance. You don't have to write code every day to do operation and maintenance. You just need to be able to play Liunx! It's much easier than doing development!" I chose to believe... I have been in the industry for more than ten years , I have suffered a lot, shouldered a lot of blame, killed servers, and experienced department layoffs. If someone tells me now that operation and maintenance is easier than development, then I will

Spring Cloud microservice architecture deployment and operation Spring Cloud microservice architecture deployment and operation Jun 23, 2023 am 08:19 AM

With the rapid development of the Internet, the complexity of enterprise-level applications is increasing day by day. In response to this situation, the microservice architecture came into being. With its modularity, independent deployment, and high scalability, it has become the first choice for enterprise-level application development today. As an excellent microservice architecture, Spring Cloud has shown great advantages in practical applications. This article will introduce the deployment and operation and maintenance of SpringCloud microservice architecture. 1. Deploy SpringCloud microservice architecture SpringCloud

What capabilities should PG database operation and maintenance tools cover? What capabilities should PG database operation and maintenance tools cover? Jun 08, 2023 pm 06:56 PM

Before the holidays, I collaborated with the PG China community to conduct an online live broadcast on how to use D-SMART to operate and maintain the PG database. It happened that one of my clients in the financial industry listened to my introduction and called over to chat. They are selecting database Xinchuang and have tried several domestic databases. Finally, they are going to choose TDSQL. I felt a little surprised at the time. They had been selecting domestic databases since 2020, but it seemed that the initial experience after using TDSQL was not very good. Later, after communication, I learned that they had just started using TDSQL's distributed database and found that the research and development requirements were too high, so they all chose TDSQL's centralized MYSQL instance. After using it, they found that it was very easy to use. The entire database cloud

What is observability? Everything a beginner needs to know What is observability? Everything a beginner needs to know Jun 08, 2023 pm 02:42 PM

The term observability originates from the engineering field and has become increasingly popular in the software development field in recent years. Simply put, observability is the ability to understand the internal state of a system based on external outputs. IBM defines observability as: Generally, observability refers to the degree to which the internal state or condition of a complex system can be understood based on knowledge of its external output. The more observable the system is, the faster and more accurate the process of locating the root cause of a performance issue can be without the need for additional testing or coding. In cloud computing, observability also refers to software tools and practices that aggregate, correlate, and analyze data from distributed application systems and the infrastructure that supports their operation in order to more effectively monitor, troubleshoot, and debug application systems. , thereby achieving customer experience optimization and service level agreement

Tuyou Zou Yi: How to operate and maintain small and medium-sized companies? Tuyou Zou Yi: How to operate and maintain small and medium-sized companies? Jun 09, 2023 pm 01:56 PM

Through interviews and submissions, veterans in the field of operation and maintenance are invited to provide profound insights and collide together, with a view to forming some advanced consensus and promoting the industry to move forward better. In this issue, we invite Zou Yi, the operation and maintenance director of Tuyou Games. Mr. Zou often jokingly calls himself the operation and maintenance representative of the world's top 5 million companies. It can be seen that in his heart, he feels that the operation and maintenance construction ideas of small and medium-sized companies are different from those of large enterprises. There are differences. Today we have a few questions and ask Mr. Zou to share his journey of integrating research and operations for small and medium-sized companies. This is the 6th issue of the down-to-earth and high-level "Operation and Maintenance Forum", starting now! Question Preview Tuyou is a game company. What do you think are the unique features of game operation and maintenance? What are the biggest operational challenges you face? How did you solve these challenges? Game operation and maintenance people

Do you need to learn golang for operation and maintenance? Do you need to learn golang for operation and maintenance? Jul 17, 2023 pm 01:27 PM

Don’t learn golang for operation and maintenance. The reasons are: 1. Golang is mainly used to develop applications with high performance and concurrent performance requirements; 2. The tools and scripting languages ​​commonly used by operation and maintenance engineers can already meet most management and Maintenance requirements; 3. Learning golang requires a certain programming foundation and experience; 4. The main goal of the operation and maintenance engineer is to ensure the stability and high availability of the system, not to develop applications.

Du Xiaoman and Chen Cunli: 20-year-old 'commander' talks about operation and maintenance, performance and growth Du Xiaoman and Chen Cunli: 20-year-old 'commander' talks about operation and maintenance, performance and growth Jun 09, 2023 am 09:56 AM

Through interviews and submissions, veterans in the field of operation and maintenance are invited to provide profound insights and collide together, with a view to forming some advanced consensus and promoting the industry to move forward better. In this issue, we invite Chen Cunli, general manager of Du Xiaoman System Operation and Maintenance Department. He has spent most of his 20-year career in the Internet field. During his time in the Baidu Operations and Maintenance Department, his team members called him "Commander Chen" due to his excellent leadership style. Today we invite "Commander Chen" to talk about his views. This is the 5th issue of the down-to-earth and high-level "Operation and Maintenance Forum", starting now! Question preview: You joined Baidu very early and later became independent with Du Xiaoman. We understand that many employees around you have been following you for a long time and have experienced many business operation and maintenance tests. I believe everyone is very interested.

See all articles