


Dragon Lizard System Operation and Maintenance Alliance: How Kindling-OriginX integrates DeepFlow's data to enhance the explanation of network faults
Editor's note: In 2023, the Dragon Lizard Community officially established the system operation and maintenance alliance, which consists of the Academy of Information and Communications Technology, Alibaba Cloud, ZTE, Fudan University, Tsinghua University, Zhejiang University, Yunguan Qiuhao, Chengyun Digital, Yunshan It was co-sponsored by 12 units including Network, Inspur Information, Tongxin Software and China Unicom Software Institute. This article is reproduced from Yun Guan Qiu Hao and introduces Kindling-OriginX, a member of the System Operation and Maintenance Alliance, to automatically generate explainable fault root cause reports by combining DeepFlow's complete network data capabilities.
DeepFlow is an open source project that leverages eBPF technology to provide high observability for complex cloud infrastructure and cloud native applications. Through eBPF technology, DeepFlow collects fine link tracking data, network and application performance indicators, with full link coverage and rich TCP performance indicators. These features provide professional users and network experts with powerful troubleshooting and problem location support.
Kindling-OriginX is a fault root cause derivation product. The goal is to provide users with an interpretable fault root cause report, allowing users to directly understand the fault root cause, and with a root cause reasoning process to verify the root cause. accuracy. Network faults are difficult to explain simply. It is not enough to simply tell users which network segment has problems. Users need more indicators and illustrations to help users better understand what faults occurred on the network and where they occurred. .
This article introduces Kindling-OriginX, which combines DeepFlow's complete network data capabilities to automatically generate interpretable fault root cause reports.
soma-chaos simulates network failure
-
Inject a 200ms delayed network simulation fault into seat-service.
-
Next, we first use DeepFlow to identify 200ms network failures and take corresponding actions.
Manually simplified troubleshooting process
Step 1: Use the Trace system to narrow the scope
In a microservice environment, when a performance problem occurs on an interface, the first step is to use the tracking system to check which link is causing the slowness and understand the specific performance.
Using the Tracing system, users can accurately locate specific Traces. After analyzing the Trace, it was found that the execution time of seat-service was long, and a long config-service call occurred at the same time. In this case, linked network indicators will help pinpoint the source of the network problem.
Step 2: Use DeepFlow flame graph to determine which network segment the fault occurs
Input the fault representative traceid into DeepFlow in the flame graph, find the performance of Trace at the network level, and then analyze the flame graph in depth. If you have a good understanding of flame graphs and have expert experience with network knowledge, you can The flame graph manually analyzed that: this fault should have occurred in the caller, which is the seat-service, and the problem occurred during the time period when the syscall was sent to the network card, that is, there was a problem in the container network period (which is consistent with fault injection).
(Picture/DeepFlow network flame graph)
Step 3: Determine what network indicators are abnormal in the container network
Based on troubleshooting experience, users need to check the network indicators of the pods of seat-service and config-service. At this time, the user needs to jump to DeepFlow's Pod-level network indicator page. Through this page, users can view a 200ms delay mutation in connection establishment and a mutation in the RTT indicator.
(Figure/DeepFlow-pod level monitoring indicators)
(Figure/DeepFlow-pod level monitoring indicators)
Step 4: Eliminate possible interference factors
According to experience, when the host's CPU and bandwidth are full, packet loss and delay will also occur in the virtual network, so it is necessary to check the CPU and node level of the node where seat-service and config-service are located at that time. bandwidth to ensure that Node level resources are not saturated.
Use the k8s command to confirm the node where the two pods are located, and then go to DeepFlow's node indicator monitoring page to check the corresponding indicators. It is found that the bps, pps and other indicators of the node are within a reasonable range.
(Picture/Find the node where the pod is located through k8s command)
(Figure/DeepFlow-node level monitoring indicators (client))
(Figure/DeepFlow-node level monitoring indicators (server))
Since there was no obvious abnormality in the node-level network indicators, it was finally determined that the pod-level rtt indicator of seat-service was abnormal.
Manual Troubleshooting Summary
After a series of troubleshooting processes, the end user can troubleshoot the fault, but the following requirements are imposed on the user:
-
very rich network knowledge
-
In-depth understanding of network flame graph
-
Proficient in using related tools
Kindling-OriginX How to combine DeepFlow metrics to produce explainable fault reports
Kindling-OriginX Based on different user needs and usage scenarios, Kindling-OriginX processes and presents DeepFlow data.
By analogy to the manual most simplified troubleshooting process, the troubleshooting process using Kindling-OriginX is as follows:
Automatic analysis of each Trace
In view of the fault at this time, each Trace is automatically analyzed, and the listed Traces are grouped according to the fault node. Travel-service is caused by cascading faults. This article does not focus on cascading faults. If you are interested, you can refer to how to deal with microservice cascading faults.
Review Fault root report where the fault node is seat-service
Fault root cause conclusion:
For sub-request 10.244.1.254:50332->10.244.5.79:15679 rtt indicator, there is a delay of about 200ms.
Fault reasoning verification
Since Kindling-OriginX has identified that there is a problem with the network where seat-service calls config-service, it does not need to completely present all the data of DeepFlow's flame graph to the user. It only needs to interface with DeepFlow and only get the seat-service call. The relevant data of the network call in config-service is enough.
Using DeepFlow's seat-service to call config-service data, it is automatically analyzed that the container network of the client pod has a delay of 201ms.
Kindling-OriginX will simulate expert analysis experience and further correlate DeepFlow's retransmission indicators and RTT indicators to determine what exactly causes the delay in seat-service calling config-service.
Kindling-OriginX will also integrate the node’s CPU utilization and bandwidth indicators to eliminate interference factors.
Kindling-OriginX completes the entire fault reasoning in a one-page report, and each data source is trustworthy and verifiable.
Summarize
Kindling-OriginX and DeepFlow both use eBPF technology and aim to provide flexible and efficient solutions for users with different needs in different scenarios. We also look forward to seeing the emergence of more domestic products with complementary capabilities in the future.
DeepFlow can provide very complete basic data of the full-link network, making cloud native applications deeply observable, and is very useful for troubleshooting network problems.
Kindling-OriginX uses eBPF to collect troubleshooting North Star indicators, AI algorithms and expert experience to build a fault reasoning engine to provide users with interpretable root cause reports.
-- over --
The above is the detailed content of Dragon Lizard System Operation and Maintenance Alliance: How Kindling-OriginX integrates DeepFlow's data to enhance the explanation of network faults. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



There are various reasons for being unable to register for the BitgetWallet exchange, including account restrictions, unsupported regions, network issues, system maintenance and technical failures. To register for the BitgetWallet exchange, please visit the official website, fill in the information, agree to the terms, complete registration and verify your identity.

When encountering an EEX exchange network error, you can take the following steps to resolve it: Check your Internet connection. Clear browser cache. Try another browser. Disable browser plug-ins. Contact Ouyi customer service.

The reason for being unable to log in to the MEXC (Matcha) website may be network problems, website maintenance, browser problems, account problems or other reasons. Resolution steps include checking your network connection, checking website announcements, updating your browser, checking your login credentials, and contacting customer service.

The reasons why you cannot receive the verification code when logging into OKX include: network problems, mobile phone settings problems, SMS service interruption, busy server and verification code request restrictions. The solutions are: wait to try again, switch networks, and contact customer service.

Reasons and solutions for failing to receive the OKEx login verification code: 1. Network problems: check the network connection or switch networks; 2. Mobile phone settings: enable SMS reception or whitelist OKEx; 3. Verification code sending Restrictions: Try again later or contact customer service; 4. Server congestion: Try again later or use other login methods during peak periods; 5. Account freeze: Contact customer service to resolve. Other methods: 1. Voice verification code; 2. Third-party verification code platform; 3. Contact customer service.

Reasons why Gate.io cannot log in to its official website include: network problems, website maintenance, browser problems, security settings, etc. The solutions are: check the network connection, wait for the maintenance to end, clear the browser cache, disable plug-ins, check the security settings, and contact customer service.

Reasons for being unable to log in to the Huobi official website include: checking the network connection and clearing the browser cache. The website may be under maintenance or updates. Due to security issues (e.g. IP address blocked or account frozen). The entered website address is incorrect. May be restricted in your area. Other technical issues.

Problem Description When calling Alipay EasySDK using PHP, after filling in the parameters according to the official code, an error message was reported during operation: "Undefined...
