My team has been doing system optimization since 2003. At the invitation of HP SERVICE, I joined their Haier system optimization team in 2003 and was responsible for the optimization of Oracle database. This was my first time to participate in the optimization of a large-scale system. Even at that time, I didn't know where to start optimizing a large-scale after-sales service system. I went to Qingdao to participate in this optimization project with a book by Levi's. Through this project, I gained a preliminary understanding of the optimization of Oracle databases. Later, I helped HP complete the performance evaluation of the CAF platform used in Huawei's SCM system, and recommended to the decision-makers to stop the project in time to avoid greater waste of funds, because the project could no longer be optimized. Later, HP adopted my suggestion and closed the project based on the CAF platform. Huawei also re-selected Oracle EBS as the basis of the SCM system and ERP system. Since then, our team has grown in size, done more and more optimization projects, and trained a group of experts in system optimization.
In 2011, we began to help the State Grid with system optimization. Under the leadership of experts, the first few projects achieved particularly good results. The customer wanted us to expand the scope of optimization and developed a large-scale optimization project that required nearly a hundred DBAs. We recruited dozens of DBAs from many partners to participate in this project. In order to ensure the quality of the project, we conducted multiple centralized trainings for the entire team. However, in the end, the results of this project were very unsatisfactory. The main reason is that the abilities of DBAs are uneven, and most of them have not participated in large-scale optimization projects. Since that project, I have also been thinking about the problems of the traditional operation and maintenance model that relies on people and experts, hoping to find a way to make the experience of experts play a greater role. This is my original intention to develop D-SMART, an operation and maintenance knowledge automation system. In order to build a knowledge automation system, the degree of digitization in Yunzhong must be improved. However, the degree of digitalization of IT operations and maintenance in traditional industries is very low. There are several main reasons for this.
Limited resources: Many companies may not have enough resources to invest in R&D and implementation of intelligent operation and maintenance systems, or may think that investing resources in other aspects is more rewarding.
Cultural factors: Some businesses may prefer to rely on human experience rather than automated systems, perhaps because they lack trust in automated systems, or they may believe that expert judgment is more reliable than machines in an emergency.
Technical limitations: Some companies may lack the necessary technical infrastructure to support intelligent operation and maintenance systems, which may require higher costs to upgrade equipment and systems.
Lack of awareness: Some enterprises may not be aware of the potential advantages of digital operations, or may not have enough knowledge and understanding of how to implement digital operations.
Although traditional industries have various cognitive deficiencies in digital operation and maintenance, with the development of technology and the increasing importance of digitalization, intelligent operation and maintenance will become a trend in future information system operation and maintenance, and also An inevitable direction.
Reflecting on our work experience in system optimization and operation and maintenance over the years, inexperienced technical personnel are an important factor leading to poor optimization results. Optimization work requires professional knowledge and skills rather than relying solely on experience. More systematic training may be needed to ensure that all personnel involved in optimization efforts have the necessary skills and knowledge. In addition, the effect of optimization work is also affected by multiple factors, such as system design, data quality and optimization work process.
With the continuous development of technology, many intelligent algorithms and methods are now available, which can greatly improve operation and maintenance efficiency and reduce human errors. Operation and maintenance knowledge automation tools can provide intelligent analysis and automated operations to help DBAs better manage and optimize the system. If the enterprise has sufficient resources, it can consider introducing these tools and systems to improve operation and maintenance efficiency. The "operation and maintenance knowledge automation system" combines big data analysis, artificial intelligence and other technologies, as well as expert experience and work accumulation, to build a comprehensive operation and maintenance knowledge system, which can help improve the efficiency and quality of operation and maintenance work. Through monitoring indicator systems, health models, operation and maintenance knowledge maps, anomaly detection algorithms and other technologies, the "operation and maintenance knowledge automation system" can automatically analyze and solve system performance problems, and at the same time provide intelligent optimization suggestions and decision-making support for Provides strong support for the enterprise's operation and maintenance work.
In fact, the most important purpose of D-SMART system development is to summarize our team’s more than 20 years of experience in IT operation and maintenance and system optimization, so that the experts in the team can accumulate experience over the years. experience into a digital knowledge base that can be automated. And through continuous iteration of the knowledge base, operation and maintenance knowledge can be continuously accumulated and accumulated in the platform, thereby continuously improving the ability of automated analysis.
The research and development of this system does not only rely on the R&D team. The research and development of knowledge tools is completely completed by the DBA without the help of ordinary operation and maintenance personnel. This is because ordinary R&D personnel do not understand IT operations, databases, and performance optimization. Only DBAs who have done operation and maintenance work can more accurately turn experts' ideas into automated tools.
The starting point of the D-SMART system is the indicator system. I think indicators are part of expert experience, and they are a very important part. Only indicators recognized by experts can be fully interpreted. At present, many database monitoring software provide many indicators that operation and maintenance personnel cannot correctly interpret. Even if these indicators are abnormal, they may not be discovered. In other words, if abnormal indicators are discovered, they cannot sense where the problem is in the system. The indicator data sorted out by experts are single and can be interpreted by experts, so each indicator will be marked by experts and given a specific label.
The second step of D-SMART is to complete the accurate collection of indicators. Accurate collection of data for each indicator is very critical for an intelligent operation and maintenance system. It is critical to ensure that every data accurately reflects the true state of the database. After a lot of data is collected, it needs to be processed before it can be turned into usable indicators. These processing algorithms also reflect the experience of experts. Through this step, the D-SMART system continuously obtains a digital model of the database's operating status.
The third step is to conduct automated modeling analysis on the collected indicators and log data. We use the health model to determine whether the running status of the database is normal and whether there are risks; we use the performance model to understand the overall performance status of the database; we use the load model to understand the current load situation of the database; we use the fault model to discover possible hidden dangers in the database and provide timely alarms.
The fourth step is to use the collected data to automatically complete various inspection tasks. For example, during daily inspection, the system will automatically analyze the data collected the previous day at midnight every day, discover the risks and hidden dangers, and generate a daily inspection report. Every month or every week, you can customize tasks to automatically analyze the recently collected data and generate inspection reports. This kind of inspection can analyze comprehensive data and has richer data than the traditional method of manual data collection and manual analysis. Algorithms that automate analysis are also more efficient.
Using this data, you can also do a lot of valuable analysis work, such as capacity prediction, performance optimization, special audits, etc. At the same time, using the standardized indicator system, we can also build digital communication between first-line operations and second- and third-line operations. Through a complete indicator set, we can provide third-line operations with a panoramic view of database operation as comprehensively as possible, truly eliminating the need for On site, experts can know everything about the world.
A while ago, my mother, who is over 80 years old, made sure to celebrate my birthday. She has been running around for many years and has not celebrated a birthday for more than ten years. When I put the candles, I realized that I was already 54 years old after my birthday, and there was not much time left before retirement. I want to digitize the experience accumulated over the years as much as possible while I can still do something now, so that I can keep it, so that there will be no regrets.
The above is the detailed content of Why should I spend the time before retirement struggling with the operation and maintenance knowledge automation system?. For more information, please follow other related articles on the PHP Chinese website!