Practical construction and application of AI-driven event intelligent analysis system-AI-php.cn

With the widespread application of new technologies such as virtualization and cloud computing, the scale of IT infrastructure within enterprise data centers has grown rapidly. This resulted in an increase in the size of computer hardware and software, as well as frequent computer failures. Therefore, front-line operation and maintenance personnel urgently need more professional and powerful operation and maintenance tools to meet the challenges.

In the daily operation and maintenance of data centers, basic monitoring systems and application monitoring systems are usually used to build fault discovery mechanisms. By setting preset thresholds, when various software and hardware abnormalities occur, indicator items will exceed these thresholds, thus triggering alarms. Operations experts are notified immediately and perform troubleshooting to ensure stable operation of the data center. Such a monitoring mechanism can detect and solve potential problems in time, improving the reliability and availability of the data center.

The event intelligent analysis system is a system designed to resolve alarm transitions and analyze and handle them.

2. Overall architecture

1. Event intelligent analysis system architecture

The event intelligent analysis system creates a full-process fault handling system of "fault identification-fault analysis-fault handling", and integrates the experience of operation and maintenance experts into a digital model. When a fault occurs, it can automatically "identify the fault- Analysis-Disposal", thereby shortening MTTR (Mean Time To Repair).

The event intelligent analysis system introduces AI technology to empower each module of the system. When the operation and maintenance expert does not manually establish a fault model, AI will automatically establish a fault for the alarm and automatically analyze it. , and then provide an analysis plan to assist operation and maintenance experts in analyzing the fault. AI empowerment reduces the modeling workload pressure of operation and maintenance experts, and also makes up for the experience blind spots of operation and maintenance experts.

The following is the overall architecture diagram of the event intelligent analysis system:

Practical construction and application of AI-driven event intelligent analysis system Picture

The blue part is the functional module of the event intelligent analysis system, and the orange part is the peripheral system, providing corresponding data or interfaces.

2. Relationship with surrounding systems

Unified event platform: Alert system collects data from each monitoring system (basic monitoring , application monitoring, log monitoring), after unified aggregation, they are converted into a unified format and sent to kafka; the event intelligent analysis system will read all alarm data from the kafka system.

Automation platform: Operation and maintenance experts create some arrangements and scripts on the automation platform in advance as a method to deal with faults. When the root cause is found through fault analysis, it can be handled by calling the automation platform interface. Tasks are orchestrated and issued for execution, ultimately achieving the purpose of automatic processing.

CMDB: During fault analysis, the object instance attributes and relationships stored in the CMDB can be used to logically associate alarm instances and disposal instances; at the same time, some of the objects surrounding the alarm object can be displayed. When providing information, the corresponding CMDB object instance data also needs to be associated.

ITSM: Provides work order data such as change orders and incident orders. When a failure occurs, these work order data need to be used for analysis.

Operation and maintenance big data platform: The big data platform provides data cleaning tools to help the event intelligent analysis platform clean the required data, and also provides technical support for massive data storage; big data The platform is a solid foundation for the data required for event intelligent analysis. It also provides analysis data for subsequent AI analysis, including object data from CMDB, work order data from ITSM, indicator data and alarm data from the monitoring system, etc.

3. Detailed explanation of functions

1. Fault identification

The main function of fault identification is to establish a fault model, which can define the rules for converting alarms into faults. At the same time, the definition of the fault model is also a simple classification of faults, such as high CPU usage faults, high memory usage faults, etc. High disk usage faults, network delay faults, etc. Simply put, it means which alarms can become a fault. The relationship between the number of alarms and faults can be either 1:1 or n:1; only the relationship between Only by identifying the specific fault can subsequent analysis and processing be facilitated.

Alarm formatting:

The alarms received from the unified event platform are standardized and processed by the event intelligent processing system. The required format, some fields need to be supplemented by searching for the object instance data of configuration management.

Fault model definition:

The definition of fault scenario model mainly includes basic information, fault rules, analysis and decision-making functions, etc. The specific description is as follows:

1) Basic information includes fault name, belonging object, fault type and fault description;

2) Fault rules can be divided into The following categories:

Setting keyword rules for alarm matching: fields such as summary and level in the json field of the alarm can be set as conditions, and multiple rules can be logically set (rules AND or NOT calculation);
Time rules: including immediate execution (generating a fault instance immediately after receiving an alarm), waiting for a fixed time window (forcibly aggregating fault instances of alarms within a period of time after the initial alarm starts), waiting Sliding time window (alarm forced aggregation fault instance within a period of time after the last alarm started);
Location rules: including the same machine, the same deployment unit, the same physical subsystem, and alarms that meet the conditions within the specified range are aggregated into one fault instance.

#3) Associate the specified analysis decision tree to determine the analysis plan.

2. Fault analysis

Fault analysis is based on related data display, topology data display, analysis decision tree and Faults are analyzed and displayed in multiple aspects such as knowledge base retrieval, providing data support for operation and maintenance experts to help them quickly find the root cause of the fault and handle the fault. The analysis decision tree can be associated with disposition.

Related information display:

1) Alarm analysis: the physical subsystem corresponding to the alarm object and other software and hardware objects associated with the deployment unit Alarm data in the last 48 hours;

2) Indicator analysis: Indicator data of the physical subsystem corresponding to the alarm object and other software and hardware objects associated with the deployment unit within 2 hours before the failure;

3) Change analysis: Change work order records of the system corresponding to the alarm object in the last 48 hours, and conduct change analysis;

4) Log analysis: Application of specified paths for the alarm object and surrounding objects Logs and system logs are analyzed and displayed;

5) Link analysis: With the transaction code as the core, the upstream and downstream link data of the transaction code involved in the alarm object is analyzed and displayed;

Topological structure display:

Taking the physical subsystem as the dimension, the operation and maintenance objects involved in the entire system are organized in a tree topology structure Display, and at the same time, nodes with alarms are marked red to alert operation and maintenance experts.

Specific examples are as follows:

Practical construction and application of AI-driven event intelligent analysis system Picture

Analysis decision tree:

Based on data such as CMDB objects and relationships, alarms, indicators, changes, logs and links, it is integrated into a customizable and editable analysis decision tree.

Operation and maintenance experts can preset the order and judgment criteria for analyzing data, and precipitate operation and maintenance experience into the analysis decision tree in the form of a digital model. When a failure occurs, the platform will Analyze and judge relevant data according to the preset analysis decision tree, and finally provide the results.

The final leaf nodes of the analysis decision tree can be associated with disposal, ensuring the automated operation of the entire life cycle of "identification-analysis-disposal" of faults.

The specific examples are as follows:

Practical construction and application of AI-driven event intelligent analysis system Picture

Knowledge base search:

The data center builds a knowledge base system based on the data on the operation and maintenance big data platform. It mainly collects text data such as emergency plans, incident order processing records, and operation and maintenance expert experience summaries.

When a fault occurs, the fault keyword will be used to search the knowledge base (string matching), and the corresponding text knowledge will be returned as expert experience. In the chapter on AI empowerment, we will talk about using text analysis for related searches, not just simple string matching.

#3. Fault handling

Fault handling is mainly handled according to the pre-defined handling model, which mainly includes handling Decision-making, orchestration and disposal operations need to rely on an automation platform to orchestrate and execute disposal tasks.

1) Disposal orchestration: Disposal orchestration is an organic combination of a series of disposal operations, because some disposals require the operation and maintenance objects to be isolated and then restarted; edit the script of the disposal operation in the process , so that several operation scripts are delivered to specific instance machines in a predetermined order and executed;

2) Disposal operation: Encapsulate the script (shell, python) so that it can be Executed on the instance machine, it can also be called by the processing orchestration; the processing operation is the minimum action of the processing, such as tomcat restart, isolation, circuit breaker and other scripts;

Fault handling is mostly based on operation and maintenance experts Experience or emergency plan documents are digitally precipitated into models.

After the fault handling is completed, relevant records of the handling will be recorded according to the process for subsequent review and analysis.

4. AI Empowerment

AI empowerment is to minimize the manual configuration workload and reduce the work pressure of operation and maintenance experts in the entire process of "identification-analysis-disposal" of faults. It can also make up for the parts that cannot be covered by the experience of operation and maintenance experts, and It can cover 100% of the alarm types that have occurred in history during the initialization stage; the overall principle is to use AI calculations to build fault models and analysis in the field of fault identification and analysis through automatic modeling, automatic aggregation, automatic analysis, etc. The plan provides a reference for operation and maintenance experts, but ensures that the final judgment and control are made by operation and maintenance experts, ensuring that the algorithm does 99% of the work, and manual review ensures the last 1% of the work.

1. Automatic modeling

Reviewing the definition of fault model in Chapter 3-1, we found that as long as we determine Alarm rules, time rules and space rules, and the analysis decision tree can be determined at the same time to build a fault model. The time rules and space rules can default to the most common immediate execution and the same machine, and the analysis decision tree can use the most conventional health checks.

Therefore, when establishing a fault model and building a model for the same type of faults, the core issue is to classify the faults through the alarm content, and we use the keywords of the alarm content to determine the classification. , and then establish a certain type of fault model. Then the problem of automatic modeling degenerates into finding keywords for alarms and establishing fault models based on them.

The overall logic diagram is as follows:

Practical construction and application of AI-driven event intelligent analysis system Picture

Input historical alarms and real-time alarms into the fault model one by one. If the existing fault model can be matched, the processing of this alarm will end; if there is no fault model that can be matched, the algorithm will be used to calculate Keyword of this alarm content, and use this keyword to build a fault model, and then add the newly built fault model to the fault model list.

Operation and maintenance experts can generalize the fault model and put it online through manual confirmation.

This automatic modeling method has the following advantages:

1) Alarms can be processed in real time and fault modeling can be performed in real time. , the speed of updating the model is very fast;

2) Modeling does not rely on the experience of operation and maintenance experts, and can be modeled directly through the alarm content;

3) All historical alarms can be covered, and Can respond to new types of alarms in real time;

4) There is no need for operation and maintenance experts to perform a large amount of model setting work, saving manpower; operation and maintenance experts only need to do the final manual confirmation, which improves efficiency while ensuring results;

Generally speaking, words that appear frequently in documents to be calculated, but have a low probability of appearing in massive documents, have a higher probability of becoming keywords, so part of the alarm memory is used for processing. The results are as follows:

Practical construction and application of AI-driven event intelligent analysis system Picture

Using the above algorithm and using part of the alarm content for calculation, we get The data effect is as follows:

Practical construction and application of AI-driven event intelligent analysis system Picture

Practical construction and application of AI-driven event intelligent analysis system ##Picture

2. Automatic clustering failure

Since Google released BERT (Bidirectional Encoder Representations from Transformers), it has topped the rankings in various text tasks. Very good results have been achieved, so it is used to calculate text similarity, mainly to calculate the similarity between alarm content and fault description.

Now build our clustering algorithm, the specific process diagram is as follows:

Practical construction and application of AI-driven event intelligent analysis system Picture

The specific steps are as follows:

1) If necessary, you can manually set the fault description as the anchor direction of fault clustering; this step is not necessary , if not, skip it directly;

2) Clean the alarm information and remove some useless characters;

3) Use the BERT model to analyze the text content of the alarm summary and all faults The clustered information is subjected to text similarity calculation to obtain similar results (determine whether it is similar by judging whether it exceeds the threshold);

4) If it is similar, the alarm is assigned to this fault cluster. ;

5) If the distance value does not exceed the threshold, set this alarm to a new fault cluster;

6) The results of steps 4 and 5 are updated to the fault cluster information in the list;

7) Process the next alarm data from step 2.

This algorithm can attribute alarms to different types of faults. If there is no ready-made type of fault, a self-built type can be created. There can be different classifications for different fault types. Analytical method.

The advantages of this algorithm are as follows:

1) Through historical and real-time alarm data, fault classification is automatically performed without supervision, and there is no need to establish a fault model, saving manpower;

2) For real-time alarms, the fault clustering process ensures real-time online updates without the need for regular calculations and model updates;

3) Alarms are automatically generated or associated with faults, which can be further correlated Corresponding emergency plans, and obtain fault analysis plans and treatment methods.

3. Automatically generate analysis plan

Review Chapter 3-2 Fault Analysis, the analysis of the fault, mainly It focuses on displaying the information of the fault node and surrounding nodes, and also requires more manual settings in the setting of the analysis decision tree.

After AI empowerment, consider using emergency plans, alarm details, and display information in fault analysis as prompts (prompts), and use existing large language models with excellent results to Automatically provide fault analysis solutions.

Considering the issue of privatized deployment, large language models can consider ChatGLM2, llama2, etc. In the specific implementation stage, different large language models can be selected according to needs and hardware levels. In the plan description of this article , LLM is used uniformly to represent large language models, please pay attention to the distinction.

The main process diagram is as follows:

Practical construction and application of AI-driven event intelligent analysis system Picture

##After the fault is identified, the corresponding real-time alarm and display related data are obtained, combined with the emergency plan data, to form a prompt combination. The prompt word is to obtain better output when the LLM large language model asks questions. Effect.

At the same time, the emergency plan and historical alarm data are stored in the faiss vector database in batches. The amount of text in each batch does not exceed the LLM token limit; when the prompt combination prompt word exceeds When using the LLM large language model, the prompt combination prompt words will be queried to the faiss vector database to obtain the text with the most similar vectors; these texts that do not exceed the token length limit are queried to the LLM, and the returned result is the fault analysis plan (text form).

For specific effects, please refer to the picture below:

Practical construction and application of AI-driven event intelligent analysis system Picture

4. Emergency plan retrieval

As a necessary manual in the industry, the emergency plan comprehensively records the analysis of corresponding faults of all systems and all operation and maintenance objects. and disposal steps are very good text data to rely on. The contents of emergency plans will be used in many places in this system. Therefore, it is necessary to provide retrieval capabilities for emergency plans, and the knowledge base system can be used as the retrieval base for emergency plans.

can provide text retrieval by string matching, keyword retrieval after text analysis, and semantic-level vector similarity retrieval. Either way is for Obtain the corresponding emergency plan text required by the system.

The above search methods can all be processed using the technical means mentioned above, and will not be described again here.

5. Conclusion

The event intelligent analysis system is to help operation and maintenance experts operate and maintain each system, so it provides a series of The modeling method allows operation and maintenance experts to precipitate operation and maintenance experience into digital models; when the amount of data (fault sample data and operation and maintenance related data) becomes larger and larger, the use of some AI algorithms can reduce the workload of operation and maintenance experts. Workload, assist operation and maintenance experts to make analytical decisions; ultimately, we hope to achieve a state where operation and maintenance can be automated without the intervention of operation and maintenance experts, that is, "self-discovery and maintenance-free" for faults.

The above is the detailed content of Practical construction and application of AI-driven event intelligent analysis system. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

4 weeks ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

3 weeks ago By DDD

Where to find the Crane Control Keycard in Atomfall

4 weeks ago By DDD

Roblox: Dead Rails - How To Complete Every Challenge

1 months ago By DDD

Atomfall guide: item locations, quest guides, and tips

1 months ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7710

Java Tutorial

1640

CakePHP Tutorial

1394

Laravel Tutorial

1288

PHP Tutorial

1232

Related knowledge

WorldCoin (WLD) price forecast 2025-2031: Will WLD reach USD 4 by 2031? Apr 21, 2025 pm 02:42 PM

WorldCoin (WLD) stands out in the cryptocurrency market with its unique biometric verification and privacy protection mechanisms, attracting the attention of many investors. WLD has performed outstandingly among altcoins with its innovative technologies, especially in combination with OpenAI artificial intelligence technology. But how will the digital assets behave in the next few years? Let's predict the future price of WLD together. The 2025 WLD price forecast is expected to achieve significant growth in WLD in 2025. Market analysis shows that the average WLD price may reach $1.31, with a maximum of $1.36. However, in a bear market, the price may fall to around $0.55. This growth expectation is mainly due to WorldCoin2.

What does cross-chain transaction mean? What are the cross-chain transactions? Apr 21, 2025 pm 11:39 PM

Exchanges that support cross-chain transactions: 1. Binance, 2. Uniswap, 3. SushiSwap, 4. Curve Finance, 5. Thorchain, 6. 1inch Exchange, 7. DLN Trade, these platforms support multi-chain asset transactions through various technologies.

'Black Monday Sell' is a tough day for the cryptocurrency industry Apr 21, 2025 pm 02:48 PM

The plunge in the cryptocurrency market has caused panic among investors, and Dogecoin (Doge) has become one of the hardest hit areas. Its price fell sharply, and the total value lock-in of decentralized finance (DeFi) (TVL) also saw a significant decline. The selling wave of "Black Monday" swept the cryptocurrency market, and Dogecoin was the first to be hit. Its DeFiTVL fell to 2023 levels, and the currency price fell 23.78% in the past month. Dogecoin's DeFiTVL fell to a low of $2.72 million, mainly due to a 26.37% decline in the SOSO value index. Other major DeFi platforms, such as the boring Dao and Thorchain, TVL also dropped by 24.04% and 20, respectively.

How to win KERNEL airdrop rewards on Binance Full process strategy Apr 21, 2025 pm 01:03 PM

In the bustling world of cryptocurrencies, new opportunities always emerge. At present, KernelDAO (KERNEL) airdrop activity is attracting much attention and attracting the attention of many investors. So, what is the origin of this project? What benefits can BNB Holder get from it? Don't worry, the following will reveal it one by one for you.

Aavenomics is a recommendation to modify the AAVE protocol token and introduce token repurchase, which has reached the quorum number of people. Apr 21, 2025 pm 06:24 PM

Aavenomics is a proposal to modify the AAVE protocol token and introduce token repos, which has implemented a quorum for AAVEDAO. Marc Zeller, founder of the AAVE Project Chain (ACI), announced this on X, noting that it marks a new era for the agreement. Marc Zeller, founder of the AAVE Chain Initiative (ACI), announced on X that the Aavenomics proposal includes modifying the AAVE protocol token and introducing token repos, has achieved a quorum for AAVEDAO. According to Zeller, this marks a new era for the agreement. AaveDao members voted overwhelmingly to support the proposal, which was 100 per week on Wednesday

What are the hybrid blockchain trading platforms? Apr 21, 2025 pm 11:36 PM

Suggestions for choosing a cryptocurrency exchange: 1. For liquidity requirements, priority is Binance, Gate.io or OKX, because of its order depth and strong volatility resistance. 2. Compliance and security, Coinbase, Kraken and Gemini have strict regulatory endorsement. 3. Innovative functions, KuCoin's soft staking and Bybit's derivative design are suitable for advanced users.

Why is the rise or fall of virtual currency prices? Why is the rise or fall of virtual currency prices? Apr 21, 2025 am 08:57 AM

Factors of rising virtual currency prices include: 1. Increased market demand, 2. Decreased supply, 3. Stimulated positive news, 4. Optimistic market sentiment, 5. Macroeconomic environment; Decline factors include: 1. Decreased market demand, 2. Increased supply, 3. Strike of negative news, 4. Pessimistic market sentiment, 5. Macroeconomic environment.

Ranking of leveraged exchanges in the currency circle The latest recommendations of the top ten leveraged exchanges in the currency circle Apr 21, 2025 pm 11:24 PM

The platforms that have outstanding performance in leveraged trading, security and user experience in 2025 are: 1. OKX, suitable for high-frequency traders, providing up to 100 times leverage; 2. Binance, suitable for multi-currency traders around the world, providing 125 times high leverage; 3. Gate.io, suitable for professional derivatives players, providing 100 times leverage; 4. Bitget, suitable for novices and social traders, providing up to 100 times leverage; 5. Kraken, suitable for steady investors, providing 5 times leverage; 6. Bybit, suitable for altcoin explorers, providing 20 times leverage; 7. KuCoin, suitable for low-cost traders, providing 10 times leverage; 8. Bitfinex, suitable for senior play

See all articles