Dragon Lizard System Operation and Maintenance Alliance: How Kindling-OriginX integrates DeepFlow's data to enhance the explanation of network faults-Computer Knowledge-php.cn

Home

Dragon Lizard System Operation and Maintenance Alliance: How Kindling-OriginX integrates DeepFlow's data to enhance the explanation of network faults

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Feb 22, 2024 pm 02:16 PM

network Internet problem deepflow Yunwei

龙蜥系统运维联盟：Kindling-OriginX 如何集成 DeepFlow 的数据增强网络故障的解释力

Editor's note: In 2023, the Dragon Lizard Community officially established the system operation and maintenance alliance, which consists of the Academy of Information and Communications Technology, Alibaba Cloud, ZTE, Fudan University, Tsinghua University, Zhejiang University, Yunguan Qiuhao, Chengyun Digital, Yunshan It was co-sponsored by 12 units including Network, Inspur Information, Tongxin Software and China Unicom Software Institute. This article is reproduced from Yun Guan Qiu Hao and introduces Kindling-OriginX, a member of the System Operation and Maintenance Alliance, to automatically generate explainable fault root cause reports by combining DeepFlow's complete network data capabilities.

DeepFlow is an open source project that leverages eBPF technology to provide high observability for complex cloud infrastructure and cloud native applications. Through eBPF technology, DeepFlow collects fine link tracking data, network and application performance indicators, with full link coverage and rich TCP performance indicators. These features provide professional users and network experts with powerful troubleshooting and problem location support.

Kindling-OriginX is a fault root cause derivation product. The goal is to provide users with an interpretable fault root cause report, allowing users to directly understand the fault root cause, and with a root cause reasoning process to verify the root cause. accuracy. Network faults are difficult to explain simply. It is not enough to simply tell users which network segment has problems. Users need more indicators and illustrations to help users better understand what faults occurred on the network and where they occurred. .

This article introduces Kindling-OriginX, which combines DeepFlow's complete network data capabilities to automatically generate interpretable fault root cause reports.

soma-chaos simulates network failure

Inject a 200ms delayed network simulation fault into seat-service.
Next, we first use DeepFlow to identify 200ms network failures and take corresponding actions.

Manually simplified troubleshooting process

Step 1: Use the Trace system to narrow the scope

In a microservice environment, when a performance problem occurs on an interface, the first step is to use the tracking system to check which link is causing the slowness and understand the specific performance.

Using the Tracing system, users can accurately locate specific Traces. After analyzing the Trace, it was found that the execution time of seat-service was long, and a long config-service call occurred at the same time. In this case, linked network indicators will help pinpoint the source of the network problem.

Step 2: Use DeepFlow flame graph to determine which network segment the fault occurs

Input the fault representative traceid into DeepFlow in the flame graph, find the performance of Trace at the network level, and then analyze the flame graph in depth. If you have a good understanding of flame graphs and have expert experience with network knowledge, you can The flame graph manually analyzed that: this fault should have occurred in the caller, which is the seat-service, and the problem occurred during the time period when the syscall was sent to the network card, that is, there was a problem in the container network period (which is consistent with fault injection).

(Picture/DeepFlow network flame graph)

Step 3: Determine what network indicators are abnormal in the container network

Based on troubleshooting experience, users need to check the network indicators of the pods of seat-service and config-service. At this time, the user needs to jump to DeepFlow's Pod-level network indicator page. Through this page, users can view a 200ms delay mutation in connection establishment and a mutation in the RTT indicator.

(Figure/DeepFlow-pod level monitoring indicators)

Step 4: Eliminate possible interference factors

According to experience, when the host's CPU and bandwidth are full, packet loss and delay will also occur in the virtual network, so it is necessary to check the CPU and node level of the node where seat-service and config-service are located at that time. bandwidth to ensure that Node level resources are not saturated.

Use the k8s command to confirm the node where the two pods are located, and then go to DeepFlow's node indicator monitoring page to check the corresponding indicators. It is found that the bps, pps and other indicators of the node are within a reasonable range.

(Picture/Find the node where the pod is located through k8s command)

(Figure/DeepFlow-node level monitoring indicators (client))

(Figure/DeepFlow-node level monitoring indicators (server))

Since there was no obvious abnormality in the node-level network indicators, it was finally determined that the pod-level rtt indicator of seat-service was abnormal.

Manual Troubleshooting Summary

After a series of troubleshooting processes, the end user can troubleshoot the fault, but the following requirements are imposed on the user:

very rich network knowledge
In-depth understanding of network flame graph
Proficient in using related tools

Kindling-OriginX How to combine DeepFlow metrics to produce explainable fault reports

Kindling-OriginX Based on different user needs and usage scenarios, Kindling-OriginX processes and presents DeepFlow data.

By analogy to the manual most simplified troubleshooting process, the troubleshooting process using Kindling-OriginX is as follows:

Automatic analysis of each Trace

In view of the fault at this time, each Trace is automatically analyzed, and the listed Traces are grouped according to the fault node. Travel-service is caused by cascading faults. This article does not focus on cascading faults. If you are interested, you can refer to how to deal with microservice cascading faults.

Review Fault root report where the fault node is seat-service

Fault root cause conclusion:

For sub-request 10.244.1.254:50332->10.244.5.79:15679 rtt indicator, there is a delay of about 200ms.

Fault reasoning verification

Since Kindling-OriginX has identified that there is a problem with the network where seat-service calls config-service, it does not need to completely present all the data of DeepFlow's flame graph to the user. It only needs to interface with DeepFlow and only get the seat-service call. The relevant data of the network call in config-service is enough.

Using DeepFlow's seat-service to call config-service data, it is automatically analyzed that the container network of the client pod has a delay of 201ms.

Kindling-OriginX will simulate expert analysis experience and further correlate DeepFlow's retransmission indicators and RTT indicators to determine what exactly causes the delay in seat-service calling config-service.

Kindling-OriginX will also integrate the node’s CPU utilization and bandwidth indicators to eliminate interference factors.

Kindling-OriginX completes the entire fault reasoning in a one-page report, and each data source is trustworthy and verifiable.

Summarize

Kindling-OriginX and DeepFlow both use eBPF technology and aim to provide flexible and efficient solutions for users with different needs in different scenarios. We also look forward to seeing the emergence of more domestic products with complementary capabilities in the future.

DeepFlow can provide very complete basic data of the full-link network, making cloud native applications deeply observable, and is very useful for troubleshooting network problems.

Kindling-OriginX uses eBPF to collect troubleshooting North Star indicators, AI algorithms and expert experience to build a fault reasoning engine to provide users with interpretable root cause reports.

-- over --

The above is the detailed content of Dragon Lizard System Operation and Maintenance Alliance: How Kindling-OriginX integrates DeepFlow's data to enhance the explanation of network faults. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks ago By DDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

WWE 2K25: How To Unlock Everything In MyRise

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7504

CakePHP Tutorial

1378

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

Related knowledge

Why can't I register at the Bitget Wallet exchange? Sep 06, 2024 pm 03:34 PM

There are various reasons for being unable to register for the BitgetWallet exchange, including account restrictions, unsupported regions, network issues, system maintenance and technical failures. To register for the BitgetWallet exchange, please visit the official website, fill in the information, agree to the terms, complete registration and verify your identity.

What to do if there is a network error on Eureka Exchange Jul 17, 2024 pm 04:25 PM

When encountering an EEX exchange network error, you can take the following steps to resolve it: Check your Internet connection. Clear browser cache. Try another browser. Disable browser plug-ins. Contact Ouyi customer service.

Why can't I log in to the MEXC (Matcha) official website? Dec 07, 2024 am 10:50 AM

The reason for being unable to log in to the MEXC (Matcha) website may be network problems, website maintenance, browser problems, account problems or other reasons. Resolution steps include checking your network connection, checking website announcements, updating your browser, checking your login credentials, and contacting customer service.

Cannot receive verification code when logging in with okx Jul 23, 2024 pm 10:43 PM

The reasons why you cannot receive the verification code when logging into OKX include: network problems, mobile phone settings problems, SMS service interruption, busy server and verification code request restrictions. The solutions are: wait to try again, switch networks, and contact customer service.

Cannot receive verification code when logging in Ouyiokex Jul 25, 2024 pm 02:43 PM

Reasons and solutions for failing to receive the OKEx login verification code: 1. Network problems: check the network connection or switch networks; 2. Mobile phone settings: enable SMS reception or whitelist OKEx; 3. Verification code sending Restrictions: Try again later or contact customer service; 4. Server congestion: Try again later or use other login methods during peak periods; 5. Account freeze: Contact customer service to resolve. Other methods: 1. Voice verification code; 2. Third-party verification code platform; 3. Contact customer service.

Why can't I log in to the official website of gate.io? Aug 19, 2024 pm 04:58 PM

Reasons why Gate.io cannot log in to its official website include: network problems, website maintenance, browser problems, security settings, etc. The solutions are: check the network connection, wait for the maintenance to end, clear the browser cache, disable plug-ins, check the security settings, and contact customer service.

Why can't I log in to the Huobi official website? Aug 12, 2024 pm 04:09 PM

Reasons for being unable to log in to the Huobi official website include: checking the network connection and clearing the browser cache. The website may be under maintenance or updates. Due to security issues (e.g. IP address blocked or account frozen). The entered website address is incorrect. May be restricted in your area. Other technical issues.

How to solve the problem of 'Undefined array key 'sign'' error when calling Alipay EasySDK using PHP? Mar 31, 2025 pm 11:51 PM

Problem Description When calling Alipay EasySDK using PHP, after filling in the parameters according to the official code, an error message was reported during operation: "Undefined...

See all articles