Editor's note: In 2023, the Dragon Lizard Community officially established the system operation and maintenance alliance, which consists of the Academy of Information and Communications Technology, Alibaba Cloud, ZTE, Fudan University, Tsinghua University, Zhejiang University, Yunguan Qiuhao, Chengyun Digital, Yunshan It was co-sponsored by 12 units including Network, Inspur Information, Tongxin Software and China Unicom Software Institute. This article is reproduced from Yun Guan Qiu Hao and introduces Kindling-OriginX, a member of the System Operation and Maintenance Alliance, to automatically generate explainable fault root cause reports by combining DeepFlow's complete network data capabilities.
DeepFlow is an open source project that leverages eBPF technology to provide high observability for complex cloud infrastructure and cloud native applications. Through eBPF technology, DeepFlow collects fine link tracking data, network and application performance indicators, with full link coverage and rich TCP performance indicators. These features provide professional users and network experts with powerful troubleshooting and problem location support.
Kindling-OriginX is a fault root cause derivation product. The goal is to provide users with an interpretable fault root cause report, allowing users to directly understand the fault root cause, and with a root cause reasoning process to verify the root cause. accuracy. Network faults are difficult to explain simply. It is not enough to simply tell users which network segment has problems. Users need more indicators and illustrations to help users better understand what faults occurred on the network and where they occurred. .
This article introduces Kindling-OriginX, which combines DeepFlow's complete network data capabilities to automatically generate interpretable fault root cause reports.
Inject a 200ms delayed network simulation fault into seat-service.
Next, we first use DeepFlow to identify 200ms network failures and take corresponding actions.
Step 1: Use the Trace system to narrow the scope
In a microservice environment, when a performance problem occurs on an interface, the first step is to use the tracking system to check which link is causing the slowness and understand the specific performance.
Using the Tracing system, users can accurately locate specific Traces. After analyzing the Trace, it was found that the execution time of seat-service was long, and a long config-service call occurred at the same time. In this case, linked network indicators will help pinpoint the source of the network problem.
Step 2: Use DeepFlow flame graph to determine which network segment the fault occurs
Input the fault representative traceid into DeepFlow in the flame graph, find the performance of Trace at the network level, and then analyze the flame graph in depth. If you have a good understanding of flame graphs and have expert experience with network knowledge, you can The flame graph manually analyzed that: this fault should have occurred in the caller, which is the seat-service, and the problem occurred during the time period when the syscall was sent to the network card, that is, there was a problem in the container network period (which is consistent with fault injection).
(Picture/DeepFlow network flame graph)
Step 3: Determine what network indicators are abnormal in the container network
Based on troubleshooting experience, users need to check the network indicators of the pods of seat-service and config-service. At this time, the user needs to jump to DeepFlow's Pod-level network indicator page. Through this page, users can view a 200ms delay mutation in connection establishment and a mutation in the RTT indicator.
(Figure/DeepFlow-pod level monitoring indicators)
(Figure/DeepFlow-pod level monitoring indicators)
Step 4: Eliminate possible interference factors
According to experience, when the host's CPU and bandwidth are full, packet loss and delay will also occur in the virtual network, so it is necessary to check the CPU and node level of the node where seat-service and config-service are located at that time. bandwidth to ensure that Node level resources are not saturated.
Use the k8s command to confirm the node where the two pods are located, and then go to DeepFlow's node indicator monitoring page to check the corresponding indicators. It is found that the bps, pps and other indicators of the node are within a reasonable range.
(Picture/Find the node where the pod is located through k8s command)
(Figure/DeepFlow-node level monitoring indicators (client))
(Figure/DeepFlow-node level monitoring indicators (server))
Since there was no obvious abnormality in the node-level network indicators, it was finally determined that the pod-level rtt indicator of seat-service was abnormal.
Manual Troubleshooting Summary
After a series of troubleshooting processes, the end user can troubleshoot the fault, but the following requirements are imposed on the user:
very rich network knowledge
In-depth understanding of network flame graph
Proficient in using related tools
Kindling-OriginX Based on different user needs and usage scenarios, Kindling-OriginX processes and presents DeepFlow data.
By analogy to the manual most simplified troubleshooting process, the troubleshooting process using Kindling-OriginX is as follows:
Automatic analysis of each Trace
In view of the fault at this time, each Trace is automatically analyzed, and the listed Traces are grouped according to the fault node. Travel-service is caused by cascading faults. This article does not focus on cascading faults. If you are interested, you can refer to how to deal with microservice cascading faults.
Review Fault root report where the fault node is seat-service
Fault root cause conclusion:
For sub-request 10.244.1.254:50332->10.244.5.79:15679 rtt indicator, there is a delay of about 200ms.
Fault reasoning verification
Since Kindling-OriginX has identified that there is a problem with the network where seat-service calls config-service, it does not need to completely present all the data of DeepFlow's flame graph to the user. It only needs to interface with DeepFlow and only get the seat-service call. The relevant data of the network call in config-service is enough.
Using DeepFlow's seat-service to call config-service data, it is automatically analyzed that the container network of the client pod has a delay of 201ms.
Kindling-OriginX will simulate expert analysis experience and further correlate DeepFlow's retransmission indicators and RTT indicators to determine what exactly causes the delay in seat-service calling config-service.
Kindling-OriginX will also integrate the node’s CPU utilization and bandwidth indicators to eliminate interference factors.
Kindling-OriginX completes the entire fault reasoning in a one-page report, and each data source is trustworthy and verifiable.
Kindling-OriginX and DeepFlow both use eBPF technology and aim to provide flexible and efficient solutions for users with different needs in different scenarios. We also look forward to seeing the emergence of more domestic products with complementary capabilities in the future.
DeepFlow can provide very complete basic data of the full-link network, making cloud native applications deeply observable, and is very useful for troubleshooting network problems.
Kindling-OriginX uses eBPF to collect troubleshooting North Star indicators, AI algorithms and expert experience to build a fault reasoning engine to provide users with interpretable root cause reports.
-- over --
The above is the detailed content of Dragon Lizard System Operation and Maintenance Alliance: How Kindling-OriginX integrates DeepFlow's data to enhance the explanation of network faults. For more information, please follow other related articles on the PHP Chinese website!