


The system is broken. It only recognizes the code but not the people.
Dear friends, listen to my advice and write code to provide methods for others to call, whether it is an internal system call, an external system call, or a passive trigger call (such as MQ consumption, callback execution etc.), be sure to add necessary condition checks. Don't believe some colleagues who say that this condition will definitely be transmitted, it will definitely have a value, it will definitely not be empty, etc. No, just before the Chinese New Year, I was tricked and had a production accident, so my year-end bonus was basically reduced by half.
I decided to focus on the code itself, rather than the people, to ensure high system availability and stability. Here are a few small lessons that may help you too.
1. What happened
My business scenario is: when business A changes, it will trigger the sending of MQ messages, and then the application will receive the MQ messages and write the data to Elasticsearch after processing.
(1) Received an abnormal alarm from business A. The alarm at that time was as follows:
(2) It seems a bit strange at first glance. How could it be a Redis exception? Then I connected to Redis and there was no problem. I checked the Redis cluster again and everything was normal. So I let it go, thinking it was an accidental network problem.
Then, in the technical problem group, customer service reported that some users were experiencing abnormal situations. I immediately checked the system to confirm the existence of sporadic problems.
(4) So I looked at a few core components out of habit:
- Gateway status, load status of core business Pods, and load status of user center Pods.
- Mysql situation: memory, CPU, slow SQL, deadlock, number of connections, etc.
It was found that slow SQL and long metadata lock time were found, mainly due to the large amount of data and slow execution speed caused by the full table query of a large table, which in turn caused the metadata lock to last too long and be exhausted. Number of database connections.
SELECT xxx,xxx,xxx,xxx FROM 一张大表
(6) After immediately killing several slow sessions, I found that the system was still not fully restored. Why? Now that the database is normal, why has it not been fully restored? I continued to look at the application monitoring and found that 2 of the 10 Pods in the user center were abnormal, and the CPU and memory were exhausted. No wonder there are occasional abnormalities when using it. So I quickly restarted the Pod and restored the application first.
(7) The problem has been found, and then we will continue to investigate why the Pod in the user center hung up. Start analyzing from the following doubt points:
- Is there something wrong with the code for synchronizing data to Elasticsearch? Why can't it connect to Redis?
- Could there be too many exceptions, causing the thread pool queue for sending exception alarm messages to be full, and then OOM?
- Where can we perform an unconditional full table query on the large table of business A?
(8) Continue to investigate suspicion point a. At first, I thought that the Redis connection could not be obtained, which caused the exception to enter the thread pool queue, and then the queue burst, causing OOM. According to this idea, I modified the code, upgraded, and continued to observe, but the same slow SQL and user center explosion still occurred. Because there is no abnormality, suspicion point b can also be ruled out.
(9) At this point, it is almost certain that point C is suspected. The full table query of the large table of business A is called, which causes the memory in the user center to be too large, and the JVM has no time to recycle it, and then directly explodes the CPU. . At the same time, because the entire table data is too large, the metadata lock time during query is too long, causing the connection to be unable to be released in time, and eventually almost exhausted.
(10) So the necessary verification conditions for querying the large table of business A were modified and redeployed for online observation. There was a problem with the final positioning.
2. Cause of the problem
Because when changing the business table B, you need to send an MQ message (synchronize the data of the business table A to ES). After receiving the MQ message, query the data related to the business table A, and then synchronize the data to Elasticsearch.
But when changing the business table B, there were no necessary conditions required for the business table A, and I also did not verify the necessary conditions, which resulted in a full table scan of the large table of business A. because:
某些同事说,“这个条件肯定会传、肯定有值、肯定不为空...”,结果我真信了他!!!
Due to the frequent changes in the business B table at that time, more MQ messages were sent and consumed, which triggered more full table scans of the large table of business A, which in turn led to more Mysql metadata lock times that were too long and the final connection Excessive data consumption.
At the same time, the results of the large table query of business A are returned to the memory of the user center every time, thus triggering JVM garbage collection, but it cannot be recycled. In the end, the memory and CPU are exhausted.
As for the exception that Redis cannot get the connection, it is just a smoke bomb. Because there are too many MQ events sent and consumed, a small number of threads cannot get the Redis connection in an instant.
In the end, I added condition verification in the code for consuming MQ events, and also added necessary condition verification at the query business A table, redeployed it online, and solved the problem.
3. Summarize lessons
After this incident, I also summed up some lessons and share them with you:
(1) Always be alert to online problems. Once a problem occurs, you must not let it go and investigate it quickly. Don’t doubt the problem of network jitter anymore. Most problems have nothing to do with the network.
(2) The large business table itself must be protected, and the query must add necessary condition verification.
(3) When consuming MQ messages, you must verify the necessary conditions and do not trust any information source.
(4) Never believe some colleagues who say, "This condition will definitely be transmitted, it will definitely have a value, it will definitely not be empty," etc. In order to ensure the high availability and stability of the system, we only recognize the code and not the people.
(5) General troubleshooting sequence when problems occur:
- CPU, deadlock, slow SQL of database.
- CPU, memory, and logs of the application's gateway and core components.
(6) Business observability and alarms are essential and must be comprehensive, so that problems can be discovered and solved faster.
The above is the detailed content of The system is broken. It only recognizes the code but not the people.. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



When encountering an EEX exchange network error, you can take the following steps to resolve it: Check your Internet connection. Clear browser cache. Try another browser. Disable browser plug-ins. Contact Ouyi customer service.

There are various reasons for being unable to register for the BitgetWallet exchange, including account restrictions, unsupported regions, network issues, system maintenance and technical failures. To register for the BitgetWallet exchange, please visit the official website, fill in the information, agree to the terms, complete registration and verify your identity.

The reason for being unable to log in to the MEXC (Matcha) website may be network problems, website maintenance, browser problems, account problems or other reasons. Resolution steps include checking your network connection, checking website announcements, updating your browser, checking your login credentials, and contacting customer service.

The reasons why you cannot receive the verification code when logging into OKX include: network problems, mobile phone settings problems, SMS service interruption, busy server and verification code request restrictions. The solutions are: wait to try again, switch networks, and contact customer service.

Reasons why the OKX application cannot be opened may be due to: network problems, application obsolescence, server maintenance, temporary glitches, device issues, regional restrictions, or security issues. Troubleshooting suggestions: 1. Check the network connection; 2. Update the application; 3. Check the server status; 4. Restart the application; 5. Restart the device; 6. Check the device settings; 7. Contact technical support.

Reasons why you cannot log in to your OEX account include network problems, input errors, account freezes and equipment problems. Solutions include clearing your browser cache, resetting your password, and contacting customer service.

Reasons and solutions for failing to receive the OKEx login verification code: 1. Network problems: check the network connection or switch networks; 2. Mobile phone settings: enable SMS reception or whitelist OKEx; 3. Verification code sending Restrictions: Try again later or contact customer service; 4. Server congestion: Try again later or use other login methods during peak periods; 5. Account freeze: Contact customer service to resolve. Other methods: 1. Voice verification code; 2. Third-party verification code platform; 3. Contact customer service.

Reasons why Gate.io cannot log in to its official website include: network problems, website maintenance, browser problems, security settings, etc. The solutions are: check the network connection, wait for the maintenance to end, clear the browser cache, disable plug-ins, check the security settings, and contact customer service.
