Dear friends, listen to my advice and write code to provide methods for others to call, whether it is an internal system call, an external system call, or a passive trigger call (such as MQ consumption, callback execution etc.), be sure to add necessary condition checks. Don't believe some colleagues who say that this condition will definitely be transmitted, it will definitely have a value, it will definitely not be empty, etc. No, just before the Chinese New Year, I was tricked and had a production accident, so my year-end bonus was basically reduced by half.
I decided to focus on the code itself, rather than the people, to ensure high system availability and stability. Here are a few small lessons that may help you too.
My business scenario is: when business A changes, it will trigger the sending of MQ messages, and then the application will receive the MQ messages and write the data to Elasticsearch after processing.
(1) Received an abnormal alarm from business A. The alarm at that time was as follows:
(2) It seems a bit strange at first glance. How could it be a Redis exception? Then I connected to Redis and there was no problem. I checked the Redis cluster again and everything was normal. So I let it go, thinking it was an accidental network problem.
Then, in the technical problem group, customer service reported that some users were experiencing abnormal situations. I immediately checked the system to confirm the existence of sporadic problems.
(4) So I looked at a few core components out of habit:
It was found that slow SQL and long metadata lock time were found, mainly due to the large amount of data and slow execution speed caused by the full table query of a large table, which in turn caused the metadata lock to last too long and be exhausted. Number of database connections.
SELECT xxx,xxx,xxx,xxx FROM 一张大表
(6) After immediately killing several slow sessions, I found that the system was still not fully restored. Why? Now that the database is normal, why has it not been fully restored? I continued to look at the application monitoring and found that 2 of the 10 Pods in the user center were abnormal, and the CPU and memory were exhausted. No wonder there are occasional abnormalities when using it. So I quickly restarted the Pod and restored the application first.
(7) The problem has been found, and then we will continue to investigate why the Pod in the user center hung up. Start analyzing from the following doubt points:
(8) Continue to investigate suspicion point a. At first, I thought that the Redis connection could not be obtained, which caused the exception to enter the thread pool queue, and then the queue burst, causing OOM. According to this idea, I modified the code, upgraded, and continued to observe, but the same slow SQL and user center explosion still occurred. Because there is no abnormality, suspicion point b can also be ruled out.
(9) At this point, it is almost certain that point C is suspected. The full table query of the large table of business A is called, which causes the memory in the user center to be too large, and the JVM has no time to recycle it, and then directly explodes the CPU. . At the same time, because the entire table data is too large, the metadata lock time during query is too long, causing the connection to be unable to be released in time, and eventually almost exhausted.
(10) So the necessary verification conditions for querying the large table of business A were modified and redeployed for online observation. There was a problem with the final positioning.
Because when changing the business table B, you need to send an MQ message (synchronize the data of the business table A to ES). After receiving the MQ message, query the data related to the business table A, and then synchronize the data to Elasticsearch.
But when changing the business table B, there were no necessary conditions required for the business table A, and I also did not verify the necessary conditions, which resulted in a full table scan of the large table of business A. because:
某些同事说,“这个条件肯定会传、肯定有值、肯定不为空...”,结果我真信了他!!!
Due to the frequent changes in the business B table at that time, more MQ messages were sent and consumed, which triggered more full table scans of the large table of business A, which in turn led to more Mysql metadata lock times that were too long and the final connection Excessive data consumption.
At the same time, the results of the large table query of business A are returned to the memory of the user center every time, thus triggering JVM garbage collection, but it cannot be recycled. In the end, the memory and CPU are exhausted.
As for the exception that Redis cannot get the connection, it is just a smoke bomb. Because there are too many MQ events sent and consumed, a small number of threads cannot get the Redis connection in an instant.
In the end, I added condition verification in the code for consuming MQ events, and also added necessary condition verification at the query business A table, redeployed it online, and solved the problem.
After this incident, I also summed up some lessons and share them with you:
(1) Always be alert to online problems. Once a problem occurs, you must not let it go and investigate it quickly. Don’t doubt the problem of network jitter anymore. Most problems have nothing to do with the network.
(2) The large business table itself must be protected, and the query must add necessary condition verification.
(3) When consuming MQ messages, you must verify the necessary conditions and do not trust any information source.
(4) Never believe some colleagues who say, "This condition will definitely be transmitted, it will definitely have a value, it will definitely not be empty," etc. In order to ensure the high availability and stability of the system, we only recognize the code and not the people.
(5) General troubleshooting sequence when problems occur:
(6) Business observability and alarms are essential and must be comprehensive, so that problems can be discovered and solved faster.
The above is the detailed content of The system is broken. It only recognizes the code but not the people.. For more information, please follow other related articles on the PHP Chinese website!