Table of Contents
Block Alarm" >Block Alarm
Alarm upgrade" >Alarm upgrade
Alarm self-healing" >Alarm self-healing
Deploy ibex-server" >Deploy ibex-server
Configure the client" >Configure the client
Test self-healing" >Test self-healing
Summary" >Summary
Home Operation and Maintenance Safety [Nightingale Monitoring] Alarm management, great!

[Nightingale Monitoring] Alarm management, great!

Jun 09, 2023 am 08:31 AM
nightingale nightingale monitoring

[Nightingale Monitoring] Alarm management, great!

#Monitoring is the method, alarming is the means, and solution is the purpose.

But, have you ever encountered this kind of confusion? I have collected a lot of indicators, but I don’t know which indicators should trigger alarms, nor how to send these alarms to the corresponding teams or individuals, nor how to upgrade the alarms.

When I used Prometheus Altermanager before, I created a DingTalk group for each team, then added a bunch of tags, matched different tags and sent them to different groups. If I want to upgrade the alarm, In many cases, this is done through threshold upgrade, but it is difficult to upgrade the same alarm through time.

But Nightingale’s alarm rule management is not that complicated (they do the complicated things for you), and it is also very elegant. I first met Nightingale in "[Nightingale Monitoring]", and it's still strong! 》​​Mentioned: Grafana is better at monitoring panel management, and N9e is better at managing alarm rules.

Today, let’s take a look at how Nightingale plays.

Alarm rules

The troops and horses have not moved, but the food and grass go first.

To alert, we must first know what our needs are, that is, we must understand which indicators need to be alerted.

For example, at the system level, we need to consider CPU, memory, disk, IO and other indicators; at the application level, we need to consider application saturation, failure rate, delay, etc.; at the business level, we need to consider Consider how many times this transaction failed, where it failed, etc.

At different levels, the monitoring indicators and alarm strategies considered will be different.

Nightingale’s alarm rules are divided into built-in rules and custom rules.

The built-in rules are designed to lower the threshold for everyone to use and provide everyone with a set of universal rules. The main contents are as follows:

[Nightingale Monitoring] Alarm management, great!

#The built-in alarm rules will not take effect unless you add them to your rules. If you like a certain rule, you can clone it into the active rules. For example, I cloned the Linux TIME_WAIT alarm rule into the default business group.

[Nightingale Monitoring] Alarm management, great!

#Then go to the alarm rule overview and you will see that a new alarm rule has been added to the default business group.

[Nightingale Monitoring] Alarm management, great!

After seeing this, do you have any inspiration in your mind?

We can create multiple business groups according to the actual situation, and then can we manage the alarm rules involving multiple business groups separately?

Assuming we have two teams, the front office and the middle office, we can classify the indicators separately.

[Nightingale Monitoring] Alarm management, great!

In principle, the rules imported by default are not effective and require some additional configuration.

Click on the alarm rule name to enter the configuration page.

[Nightingale Monitoring] Alarm management, great!

#We can customize alarm conditions, data sources, alarm levels and other configurations. The information we configured above is summarized as follows:

    The data source of the alarm is local_prometheus, which indicates which cluster your alarm comes from.
  • The alarm condition is that the alarm will only be triggered when the total number of TIME_WAIT is greater than 20000.
  • The alarm level is Level 2, which is the general important level.
  • The execution frequency is once every 15 seconds. If the alarm rules are still met for 60 seconds continuously, an alarm will be triggered.
The next step is additional configuration, as follows:

[Nightingale Monitoring] Alarm management, great!

The effective configuration is used to configure the time period and business group in which the alarm rule will take effect. The notification configuration is to configure the notification medium, that is, if an alarm occurs, which channels should be used to send it to which place.

However, you can also make additional configurations in the notification configuration:

  • Start recovery notification, that is, if the alarm is restored, the person in charge will also be notified through this channel.
  • Alarm receiving group, that is, business group.
  • Observe the duration. After the alarm is restored, observe how long it takes to send a recovery notification to the business group. Which volatile alarms can be avoided? Issues such as alarms and recovery.
  • Repeat notification, that is, within this time period, if the alarm has not been resolved, it will be sent again. Of course, alarm escalation is not involved here.

After seeing this, do you have a certain understanding of common alarm rule management?

In addition to cloning the built-in alarm rules, we can also customize alarm rules, but the overall configuration is the same as above.

Block Alarm

Generally, shielded alarms are not very important alarms.

Under what circumstances will the alarm be blocked?

For example, when we are publishing an application, we will inevitably encounter problems. At this time, we can make some blocking rules in advance to avoid generating alarm messages.

[Nightingale Monitoring] Alarm management, great!

Shielding rules are also divided by business groups. We can add a new rule as follows to create a rule for blocking message center alarms.

[Nightingale Monitoring] Alarm management, great!

In this way, within the fixed time window, the alarm information will no longer be sent.

Some students may want to say, is it a little troublesome to add them one by one?

If it is an active alarm that has been generated, it can be blocked with one click.

[Nightingale Monitoring] Alarm management, great!

If it is a historical alarm, it can also be blocked with one click.

[Nightingale Monitoring] Alarm management, great!

What else?

If you want to block anything, just add it yourself!

Alarm upgrade

What should I do if an alarm has not been processed within a period of time?

Either it is not an important alarm - delete the rule and leave it useless.

Either it is an alarm that cannot be resolved - upgrade it and let more people know about it.

In Nightingale, alarm upgrades can be implemented in subscription rules.

For example, our configuration is as follows:

[Nightingale Monitoring] Alarm management, great!

#If the alarm event of server=notice is not resolved within 1 hour, we will upgrade the alarm level to level one , and send alarm information to higher-level groups.

The rules here can also be classified and managed by business teams.

In addition, it also provides active alarms and historical alarms. You can view the current alarm information and historical alarm records.

Alarm self-healing

The longer you work in operation and maintenance, you will actually find that the processing of many things is repetitive. Some simple and repetitive tasks can be performed through automated scripts. Processing can not only improve work efficiency, but also reduce the risk of human operation to a certain extent.

Nightingale provides alarm self-healing function. Although the function is good, don’t be greedy.

When dealing with an alarm, you must first find out the real reason behind it, so that you can solve the problem. So for alarm self-healing, you must understand that the risk of the automated operation you do is very low and you have tried it many times. Do not use the cd /opt/aaa;rm -rf ./ operation.

In Nightingale, use the ibex template to implement alarm self-healing. Currently, the ibex-server side needs to be deployed by itself, and the ibex-agent side has been integrated into Categraf.

Deploy ibex-server

Go to https://github.com/flashcatcloud/ibex/releases to download the binary package. After downloading, there are the following files:

# ll
total 21536
drwxr-xr-x 3 root root 4096 Apr 19 10:44 etc
-rwxr-xr-x 1 root root 16105472 Nov 152021 ibex
-rw------- 1 root root5931963 Jun32022 ibex-1.0.0.tar.gz
drwxr-xr-x 2 root root 4096 Nov 152021 sql
Copy after login

Import database:

mysql -uroot -p <sql/ibex.sql
Copy after login

Then modify the /etc/server.conf configuration file, mainly modifying the database configuration.

Finally start the server:

nohup ./ibex server &> server.log &
Copy after login

Configure the client

In the system configuration​->notification configuration​- >The server address corresponding to the alarm self-healing module configuration:

[Nightingale Monitoring] Alarm management, great!

Test self-healing

Then go to alarm self-healing​- >Add a script to the self-healing script, as follows:

[Nightingale Monitoring] Alarm management, great!

Save and exit, click to create a task:

[Nightingale Monitoring] Alarm management, great!

If the configuration inside does not need to be modified or after modifying the corresponding configuration, choose to execute immediately:

[Nightingale Monitoring] Alarm management, great!

At this point, what do you think? Is it good?

Anyway, I didn’t succeed. At this point I have to complain about this module:

  • Are there any prerequisites for the deployment of ibex-server?
  • Is there any preconditions for ibex-agent (categraf)?
  • The execution of the self-healing script failed. There is no specific failure log on either the client or the server.
  • How to put the alarm self-healing configuration entry of the N9e V6 version into the message notification module? Strange
  • Official Document This module is a bit too simple

So, I did not succeed here, the front end threw a timeout.

[Nightingale Monitoring] Alarm management, great!

There are no logs in the backend.

[Nightingale Monitoring] Alarm management, great!

Summary

Currently Nightingale can relatively complete the management of alarm rules, distribution of alarm channels, and suppression and upgrade of alarm messages. Moreover, FlashDuty can access different cluster alarms, which is enough for most enterprises.

Only when testing the alarm self-healing, I failed to test successfully. It should be related to my environment:

  • N9e overall module is deployed to K8s using Helm, but the
  • ibex-server side is deployed directly on the host in binary form

, but the specific cause has not been found out, and there is too little troubleshooting information available.

The above is the detailed content of [Nightingale Monitoring] Alarm management, great!. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
1 months ago By 尊渡假赌尊渡假赌尊渡假赌
Two Point Museum: All Exhibits And Where To Find Them
1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

What category does the operation and maintenance security audit system belong to? What category does the operation and maintenance security audit system belong to? Mar 05, 2025 pm 03:59 PM

This article examines operational security audit system procurement. It details typical categories (hardware, software, services), budget allocation (CAPEX, OPEX, project, training, contingency), and suitable government contracting vehicles (GSA Sch

What are the job safety responsibilities of operation and maintenance personnel What are the job safety responsibilities of operation and maintenance personnel Mar 05, 2025 pm 03:51 PM

This article details crucial security responsibilities for DevOps engineers, system administrators, IT operations staff, and maintenance personnel. It emphasizes integrating security into all stages of the SDLC (DevOps), implementing robust access c

What does the operation and maintenance safety engineer do? What does the operation and maintenance safety engineer do? Mar 05, 2025 pm 04:00 PM

This article explores the roles and required skills of DevOps, security, and IT operations engineers. It details the daily tasks, career paths, and necessary technical and soft skills for each, highlighting the increasing importance of automation, c

The difference between operation and maintenance security audit system and network security audit system The difference between operation and maintenance security audit system and network security audit system Mar 05, 2025 pm 04:02 PM

This article contrasts Operations Security (OpSec) and Network Security (NetSec) audit systems. OpSec focuses on internal processes, data access, and employee behavior, while NetSec centers on network infrastructure and communication security. Key

What is operation and maintenance security? What is operation and maintenance security? Mar 05, 2025 pm 03:54 PM

This article examines DevSecOps, integrating security into the software development lifecycle. It details a DevOps security engineer's multifaceted role, encompassing security architecture, automation, vulnerability management, and incident response

What is the prospect of safety operation and maintenance personnel? What is the prospect of safety operation and maintenance personnel? Mar 05, 2025 pm 03:52 PM

This article examines essential skills for a successful security operations career. It highlights the need for technical expertise (network security, SIEM, cloud platforms), analytical skills (data analysis, threat intelligence), and soft skills (co

What is operation and maintenance security? What is operation and maintenance security? Mar 05, 2025 pm 03:58 PM

DevOps enhances operational security by automating security checks within CI/CD pipelines, utilizing Infrastructure as Code for improved control, and fostering collaboration between development and security teams. This approach accelerates vulnerabi

Main work of operation and maintenance security Main work of operation and maintenance security Mar 05, 2025 pm 03:53 PM

This article details operational and maintenance (O&M) security, emphasizing vulnerability management, access control, security monitoring, data protection, and physical security. Key responsibilities and mitigation strategies, including proacti

See all articles