Flashcat Lai Wei: How to stabilize the job of operation and maintenance

WBOY
Release: 2023-06-08 18:42:26
forward
1535 people have browsed it

Flashcat Lai Wei: How to stabilize the job of operation and maintenance

The first issue of the forum "Jingyuan - Operation and Maintenance Geometry" and Ma Chi's "​" some time ago is time to let "Operation and maintenance workers were laid off collectively" has caused widespread discussion in the industry. Is there really no future for operation and maintenance positions? How to keep your job steady? In this issue, we interviewed Lai Wei from Kuaimao Nebula. Lai Wei is an entrepreneur who breaks out of the operation and maintenance circle. Since he can start a business, he must have deep experience in the industry. How would he view this problem? Let's listen to a new sound together!

This is the 3rd issue of the down-to-earth and high-level "

Operation and Maintenance Hundreds Forum"", let’s start!

Introduce yourself and your current company?

Hello everyone, I am Lai Wei from Kuaimao Nebula. Kuaimao Nebula is a cloud-native intelligent operation and maintenance technology company, composed of the core development team of the open source monitoring tool "Nightingale Monitor". The "Flashcat Platform", a cloud-native monitoring and analysis platform created by Kuaimao Nebula, aims to solve the problems of difficult unified monitoring and slow fault location under cloud-native and hybrid cloud architectures.

If you want to know more about the story behind the founding of Kuaimao Nebula, you can further read an exclusive interview with me on ITPub​

"Ten years of hard work, from front-line engineer to CEO"​ , everyone’s corrections are welcome.

Some operation and maintenance veterans report that the company knows very little about the value of operation and maintenance. How do you clearly explain the value of operation and maintenance to the company?

How to clearly explain the value of work to the company management in an easy-to-understand manner and obtain understanding and support is a common problem faced by all middle and back-end technical teams. Otherwise, they will lose their jobs in a matter of minutes. It is even more difficult to clearly explain the value of operation and maintenance work.

Judging from my circle of friends, I will see posts from time to time

urging operation and maintenance to be laid off/change careers:

    For example, the Swedish horse worker​
  • ​"It's time to lay off operations and maintenance collectively"​​​, inspiring and enlightening, mentioned at the beginning:​​People who know the truth don't tell secrets: Today, with the maturity of cloud native and DevOps, operation and maintenance As a position and a team, we have completed our historical mission and should retire from the stage. ​
  • Another example is Boss Jing, who introduced me to the industry. In the first issue of SRETalk, he gave his well-intentioned advice: ​
  • With the development of technology and the changes of the times, a position’s Death is a normal thing, and timely adjustment and planning are the focus of thinking. ​
However,

the position of operation and maintenance and the operation and maintenance people behind it have always stood on the edge of being eliminated time and time again, and have stubbornly come back to life again and again. , the willows are dark and the flowers are bright. They are often willing to laugh at themselves, actively embrace crises, and dare to seek change. Looking back, in the past ten years, whether it is cloud computing, cloud native, DevOps, or SRE, all these major changes in IT are attempts to continuously optimize and improve the field of "big operations and maintenance." The operation and maintenance industry has not died out, but has continued to evolve and acquire new connotations.

This shows what? It shows that operation and maintenance is very important, but it also shows that operation and maintenance is also difficult! But how to make this value clear? Let’s analyze it from the perspective of positioning, goal setting, and input-output ratio. See you in the next question.

What do you think are the most important goals of operation and maintenance work? How did you achieve these goals? How can the value of operation and maintenance be better reflected?

Focus on the classic operation and maintenance field, the most important job responsibilities:

    Code release and delivery (delivery), Do a good job in delivering value in the last mile;
  1. Improve the scalability of the architecture and implement it;
  2. Ensure the stability of the system (reliability) and continuously improve it;
  3. While meeting the first three goals, continuously optimize and reduce the operating costs (finops) of the system.
If you find that your work does not revolve around the above categories, then there are two possibilities: you are not in operation and maintenance or your work is beyond the scope!

After clarifying the scope of work, or the mission of operation and maintenance, it is relatively easy to set goals, such as:

  1. For code release and delivery, it can be simply measured by the number of releases;
  2. For the scalability of the system, it can be measured by the timeliness of expansion;
  3. For stability We can measure it by observing the unavailability duration of core functions;
  4. For system operation costs, we can calculate and track the resource costs and labor costs spent on completing each core transaction.

About how to embody the value of operation and maintenance:

First of all, our operation and maintenance people must change their attitude and stance: stand firmly with the business and strive for Sharing business goals.

Let me give you an example. The HR department is also a department that belongs to the backend of the company and cannot be used as a backend. However, among the excellent HRs I have come into contact with, whether they are recruiters or HRBPs, they have always put themselves first. As a member of the business department, regard the goals of the business department as your own goals. When the position is consistent and everyone is our own person, the value is easy to say.

Secondly, value is always related to “cost input”. If you have established a large operation and maintenance team and the labor cost is very prominent in the company, then you will easily become the "key focus" in the eyes of the boss, and you will also be subject to more demanding challenges from the business side. As the saying goes, Chu people have no talent. The guilt is clear:) Objectively speaking, the resource investment of the operation and maintenance team must match the business income. Too high or too low is unhealthy and not conducive to the development of the team. Therefore, "value creation in operation and maintenance" will ultimately come down to competition in operation and maintenance efficiency.

Finally, regarding value, there must be both quantitative and qualitative descriptions. For example, quantitative comparison with the industry level, quantitative data from the company's business department satisfaction survey. There must also be qualitative data such as “sense of presence” in supporting the company’s strategic projects.

Do you think AI capabilities like ChatGPT will be able to solve problems in the operation and maintenance industry in the future?

First of all, let’s take a look, what are the core advantages of ChatGPT? ChatGPT has generational innovation in terms of richness of knowledge, natural language understanding capabilities (and context understanding), and content generation capabilities.

Then, let’s analyze what are the core issues in the operation and maintenance industry?

  • Is it a lack of domain knowledge?
  • Is the interaction efficiency low?
  • Is it difficult to output content?

None of the above. The problem dealt with by the operation and maintenance industry is essentially a systemic engineering problem. It is to solve the problem of rapid delivery of IT system value, solve the problem of scalability, and solve the problem of stability. The problem is to continuously improve the cost-effectiveness of system operation and maintenance.

Currently, cloud computing and microservices have brought more substantial changes to the operation and maintenance industry. ChatGPT can effectively improve the knowledge accumulation problem in the operation and maintenance industry, and may soon replace some junior operation and maintenance architect positions.

When it comes to tool selection, how do you decide whether to develop it yourself, use open source, or use commercial products?

There is no absolute answer to this question. From my personal experience, there are probably the following situations:

Benefits of self-research

  1. The psychological sense of autonomy and control will be stronger;
  2. In the short and medium term, it will be more beneficial to the team’s development space;
  3. Able to carry out targeted and flexible design according to one's own actual situation.

Disadvantages of self-research:

  1. The time cost is very high, which will cause delays for a long time and bring negative consequences to the development of the business. Certain impact;
  2. The labor cost is high. Taking Beijing as an example, it is necessary to recruit a relatively senior engineer, and the annual salary is about 500,000. If you want to self-research related operation and maintenance tools until they are mature, you need to invest two Engineers are still needed;
  3. Limited by the cognition of R&D personnel, self-research can easily be decoupled from industry best practices, which will cause internal tools to lag behind the times in the long run.

Open source and open source secondary development:

The advantage is that it can be effective quickly and put into production.

There are three disadvantages:

  1. Open source tools generally focus on flexibility and are relatively focused on functions. They are usually lacking in productization and user experience. There are problems in terms of experience when used quickly. Question;
  2. Everyone who writes code has experience. It is actually equally difficult to completely read and understand other people's code and develop one yourself. Therefore, when an open source project is put into a production environment, enough investment must be made. It requires manpower and time to master;
  3. Most secondary development of open source projects will lead to decoupling from the community backbone, resulting in the inability to smoothly upgrade to the latest subsequent versions, and the inability to enjoy the real dividends of open source projects.

Use commercial products and solutions:

Advantages:

  1. The time cost advantage is obvious. With the help of commercial products, we can quickly and agilely support the development needs of the business. First of all, we must not delay!
  2. In principle, the cost of commercialized products will be several times lower than that of self-developed products. This cost gap is determined by the business model. The fundamental reason why commercial products can be profitable is that product research and development costs (plus sales costs) are diluted as the number of customers increases. Otherwise, the company has no meaning and possibility of existence;
  3. The core competitiveness of commercial products includes areas Know-how, ultimate product experience, good technical support and services, which usually means that technical teams that use commercial products will gain a better reputation on the company's business side.

Shortcomings:

  1. The domestic tob field started late. The biggest problem currently hindering customers from adopting commercial products is the lack of extremely easy-to-use products and the lack of price advantages. Obviously;
  2. Many of Party A’s customers have heavy technical history and many personalized solutions. It is often difficult to completely match commercial products, resulting in customers having to bite the bullet and choose to develop their own products.

There is a view in the industry that the rise of infrastructure such as cloud computing and Kubernetes will gradually eliminate operation and maintenance positions. What do you think of this view?

It is true that the emergence of cloud computing and K8s is mainly to improve the "operation and maintenance" industry, which has had a significant impact on the working methods of the operation and maintenance industry. For example:

  • The previous clickops gradually transitioned to IaC
  • Traditional monitoring was upgraded to a more comprehensive observability system
  • release also changed from regular release of large versions to A more agile continuous integration
  • The old Chinese medicine-style open source software maintenance model has become the correct selection and use of the corresponding cloud service
  • The physical work of putting the machine on the shelf has become simple The console is opened in minutes
  • The expert work of typing commands to configure network routing is transformed into the combination and matching of various network products of cloud services
  • The transformation from physical machine co-location to improve utilization to the use of micro-systems The cost of services and cloud-native architecture has naturally declined

We see that the connotation of operation and maintenance work has not changed, and the value of the work has not become weaker. The skill tree that operation and maintenance needs to master is being upgraded. If operation and maintenance personnel continue to maintain a sense of crisis, maintain a proactive spirit of seeking change, and focus on serving the business well, they will be able to stay on top of the trend and see bright futures everywhere.

#There are many optional monitoring tools. Why do users choose your company’s Flashcat platform?

Indeed, there are many open source and commercial monitoring platforms. I have also written a blog before:​​"Twelve major open source monitoring tools in the past twenty years" Comparison》​​, you can refer to it.

Back to why we chose the Flashcat platform, we need to start with the development trends of monitoring systems and the characteristics of the Flashcat platform. For the development trend of monitoring systems, you can refer to my previous blog article ​​"Top Ten Characteristics and Trends of Cloud Native Monitoring"​​. The Flashcat platform is a targeted solution for these trends:

  1. Flashcat is oriented to a wider and more diverse user group: from the operation and maintenance engineer group to the All R&D, operations, CTO/CIO, Flashcat makes monitoring analysis and information gathering so simple;
  2. Flashcat is closely linked to business indicators: When the business is damaged, Flashcat can always be the first Time discovery, and in-depth linkage with IT systems, assisting the technical team to quickly launch investigations;
  3. Cloud native and hybrid cloud unified monitoring: No matter what kind of IT architecture is adopted, you only need one set Flashcat platform.

The above is the detailed content of Flashcat Lai Wei: How to stabilize the job of operation and maintenance. For more information, please follow other related articles on the PHP Chinese website!

source:51cto.com
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template