Through interviews and manuscript requests, veterans in the field of operation and maintenance are invited to provide profound insights and collide together, with a view to forming some advanced consensus and promoting The industry has to move forward in a better way.
In this issue, we invite Shao Haiyang from Youpaiyun Technology, a 25-year Linux veteran. Mr. Shao is obsessed with technology and moves up step by step. This is a typical growth of ordinary operation and maintenance personnel. Path, I hope today’s interview can give you some inspiration.
This is the 4th issue of the down-to-earth and high-level "Operation and Maintenance Hundreds Forum", let’s start!
I am Shao Haiyang from Youpaiyun Technology. I have been using Linux for almost 25 years since 1998. I am a veteran (veteran) of Linux. System operation and maintenance/architect, advocate of DevOps eight honors and eight disgraces, amateur writer; proficient in (guilty) system optimization and network service management, Linux system customization, CDN acceleration and security defense; good at high-performance Internet network and architecture design, Virtualized KVM and OpenStack cloud platform, K8S container cloud and Ceph distributed storage and other new technologies; likes to communicate and share, active in the community, and has been actively involved in the organization and dissemination of open source activities.
Youpaiyun is a company that provides cloud storage, cloud distribution, and cloud processing services. It is also the first professional cloud service provider in China to provide programmable CDN services. Its characteristic is that it is available 7x24 all year round. Intermittent services, so there are some rules or principles for cloud operation and maintenance, such as:
Ensure stability first, and then optimize
Over-design or premature optimization is likely to lead to To avoid more downtime, we must first focus on improving the scalability and high availability of the system. Adhering to the implementation strategy of "first complete, then perfect, then perfect", the project also adopts the implementation strategy of "first usable, then easy to use, then good to use".
Provide reliable test basis and time verification
Before introducing new technologies into the architecture, it is necessary to ensure the stability of the new technologies and sufficient long-term testing, and more importantly, there must be The integrity of the tool chain developed in operation and maintenance engineering. Being caught off guard due to online rework or changes may already be the trigger for failure.
Use controllable automation methods to improve efficiency
Automation methods such as automatic deployment, automatic orchestration, automatic inspection, and automatic upgrade are increasingly used in cloud operation and maintenance . This is a trend that adapts to the era of cloud computing, but with greater ability comes greater responsibility. Be careful about the avalanche and thundering herd effects of automation, and do a good job in grayscale/blue-green deployment and various tests.
Keep it simple, monitor everything
Keep it simple, don’t make it too complicated. In addition to common abnormal problem alarms, business indicators, market indicators, sales data, costs, etc. can be used for trend analysis information. Regular polling to view the peaks and troughs of each trend data can help you gain insights.
Budget-oriented operation and maintenance
The operation and maintenance team is usually the biggest spender. Because of insufficient budget, it is difficult for operation and maintenance without money to take into account the growing growth. The company's business scale, unless the company's business has stagnated or no longer has explosive growth, faced with such challenges, operation and maintenance must learn to reduce costs and gain, increase revenue and reduce expenditures, and use new technologies to improve energy efficiency.
Scenario-oriented intelligent operation and maintenance
Various load scenarios, from high-concurrency processing to video transcoding, from high-performance parallel computing to massive networks ask. These different load scenarios also have different requirements for network bandwidth, various processing and IO. Intelligent operation and maintenance requires an in-depth understanding of the business and reasonable allocation of resources and architecture to meet the needs of different business scenarios.
Continuous integration and release system
Continuous release includes grayscale release, test release, rolling release, rollback release and other scenarios, and ensures that each scenario It should be controllable.
Ensure that anyone can be replaced
In an iron-clad camp, it is normal for people to move around and move around. Do a good job in shared document management and knowledge transfer and sharing among employees. , in theory, everyone can be replaced, and no one should become the ceiling of the company.
The company has always actively encouraged employees to self-improve their skills and promote growth:
The training within the Youpai cloud operation and maintenance team includes:
For those who have just entered the management position, my suggestion is to sort out the remaining technical debt and inventory and cultivate talent skills in a timely manner. Lay a good foundation first, and then you can have more skills later. For greater room for progress, please refer to my sharing of "Eight Honors and Eight Disgraces of DevOps".
The inventory of talent trees in the skill tree is mainly to cooperate with human resources to divide the nine-square grid of talents (if it is development or operation and maintenance, replace the performance on the left with potential, performance For sales), what is tested is the manager's ability to analyze all aspects of employees and know how to make good use of them.
Combined with the company's OKR goal management to motivate employees, its advantage is that while gathering goals, it can also:
The CPU, disk and network IO are not intensive;
Theoretically, excellent Software engineers can do some (or even all) of the work of operation and maintenance engineers, such as monitoring the performance of business software. If programmers insert a lot of hooks or probes into the program, they can count the data. No need The laborious monitoring of operation and maintenance; for example, when programmers design programs, they consider sub-databases and tables, and consider large concurrency and distributed design, then operation and maintenance can expand the machine horizontally; if the software does not have so many bugs , there are many ifs... However, the reality is cruel, there are too few such high-level programmers, especially in China, everyone is busy implementing business functions, and they are not even willing to write documents or even comments. Not to mention being able to think so thoroughly; similarly, operation and maintenance comes into contact with many excellent and mature open source software, from which we can learn how to design excellent software. For example, for excellent programs, the log information will be very detailed. We can Monitor it through standard syslog or logs. Therefore, senior operation and maintenance will:
Actively participate in prior planning, cooperate with development to conduct drills, automate deployment, and assist in architecture improvement
Reasonable demands and resources are required, and it is best to have a budget to prevent problems before they happen.
You don't have to charge into battle, you can have an overview of the overall situation , strategize and schedule all resources (the function of the operation and maintenance architect)
Can lead and unite the team, build a high-level building, and implement solutions according to the times (the function of the software architect)
For example, to provide 10W online concurrency capability, we need redundant bandwidth and the number of redundant servers x 2. The consequences and responsibilities caused by halving the budget due to insufficient budget People; another example is poor software design. Through performance monitoring, the consequences of abnormal indicators and the responsible person are discovered; of course, if the alarm is not handled in time, it is understandable that human operation failures will also be counted in operation and maintenance; fault culture means paying attention to problems and paying attention to things. In itself, it's not about the person but the matter. Everyone grows up through failures and becomes stronger during reviews.
Operation and maintenance automation;
Monitoring normalization;
Log visualization!
This is too long, so I won’t go into details. You can refer to "Enlightenment and Architecture Design of Cloud Operation and Maintenance"
Youpaiyun usually does not reinvent the wheel, but it will definitely make good use of the wheel first, or modify the wheel to make it more convenient. Choosing self-research often means you have certain development capabilities. Coupled with some necessary reasons, such as:
Public cloud serves as the IaaS base, container cloud serves as the CaaS middle layer, and cloud native serves as the SaaS application layer. The entire cloud ecosystem is changing with each passing day, and the core functions of the SRE team will pay more attention to the top-level system. Sexual capacity planning, indicator monitoring, high availability and distributed elastic design, so cross-platform and cross-department functional complementarity, team collaboration, continuous improvement, and courage to take responsibility include:
The value of a team lies in whether it can always accept new things, new challenges, and use their strengths to avoid being a frog in the well. It is not a matter of boiling frogs in warm water. When innovation or subversion comes, we can still not be decoupled by the times.
##Technical field
Non-technical field
First of all, it is not the job that chooses the person, but the person who chooses the job. If a person is interested in something and has really studied hard for nearly 10,000 hours, he can actually do anything. . For example, when I graduated, the emphasis was on compound talents and there was no such thing as operation and maintenance. Not only did we build (DIY) machines and teach ourselves the Linux operating system, we also learned programming, messed around with the Internet, and wrote our own programs such as forum chat rooms. ;Linux brings us innovative, fun, and excellent open source software every day, allowing us to maintain our passion to toss and learn to our heart's content. When the opportunity comes with the rise of the Internet, it is actually natural to become an operation and maintenance director. ; In fact, in addition to that, I have also transitioned into pre-sales and technical support, traveled to the market, and often did speech training, so a real master is one who cannot learn anything, has many skills but does not overwhelm himself, and is someone who understands business and Operations and maintenance engineers who can develop.
I think the most important ability is the ability to express and communicate, but it does not exclude the technical reserves, practical skills, programming skills and learning abilities required for operation and maintenance itself. Considering that operation and maintenance is still mostly a cost expenditure position, how to use esoteric and obscure performance and bottleneck indicators to intuitive chart display to obtain continuous investment from the upper management requires skills; and then face your colleagues and your brother departments , you also need your influence to coordinate and promote the work. If you can do this, it means that you have the ability to lead, so that you will be at a higher level in everything you do in the future, and use an overall view to coordinate and plan the entire project. Reasonable allocation and control of goals, personnel, construction schedules and resources.
The above is the detailed content of Another shot of Yun Shao Haiyang: 25-year Linux veteran talks about the eight honors and eight disgrace of DevOps. For more information, please follow other related articles on the PHP Chinese website!