On June 5, Vipshop released a fault report on March 29, 2023. Due to a fault in the Nansha IDC refrigeration system, the Vipshop online mall stopped serving, causing Hundreds of millions of losses (as a small operation and maintenance person, I tremble).
For Vipshop, the online mall is its core business entrance, and failures are inevitable. However, it cannot be tolerated if the failure is so long. Why does this happen? In the eyes of small operators like us, this kind of accident should not happen in a company of this magnitude. We are all looking for ways to operate and maintain by imitating and learning from their PPTs.
However, PPT is so advanced that it cannot prevent failures from occurring. Why is this?
I personally venture to make some guesses:
There are now various domestic technology conferences, and then invite CTOs and technologies from some well-known companies The person in charge gave a speech. Judging from the speech, every company is very strong (at least this is how it is shown on the PPT). Every time I listen, I will suddenly become enlightened and benefit a lot. I admire these companies from the bottom of my heart and admire their super strength. Great thinking, great abilities, and a cool team.
However, PPT is only an auxiliary tool after all, it cannot replace the status quo.
Beautiful PPT is only for those who want to see it. Unbeautiful things have to be endured alone.
I have seen Vipshop’s sharing on GOPS before, and the PPT presentation is really great. If you use this to report to the boss, the boss will also feel that our company’s technology is really powerful and our work is really good. It gave the boss the illusion that everything was fine.
If something goes wrong, who will you do if you don’t do it?
The awesomeness that blows out of your own mouth will also come back to your own mouth.
In the book "SRE: Decryption of Google Operations and Maintenance", fault drills occupy a large space. Through fault drills, the reliability and fault tolerance of the system can be improved, the team can better understand the architecture and working principles of the system, the mutual influence of each module can be better understood, and loopholes and loopholes in the system architecture can be discovered more quickly. Fault.
It can be said that fault drills are the core link of the entire stability guarantee, because it can help the team minimize actual faults and respond to possible problems more efficiently.
But, is this true in practice?
When actually conducting a fault drill, the fault point must be predetermined, specific countermeasures must be organized and outputted, a comprehensive plan must be designated, and each person's job responsibilities and tasks must be accurately described.
These preparatory work alone requires a lot of manpower and material resources. Many teams and many people will streamline steps and measures. They will treat fault drills with the mentality that it will be done and treat faults with the mentality of luck. itself, placing hope on others not having problems.
For example, if you place your hopes on the public cloud, if there is no problem with the public cloud, the entire system will be stable, but the public cloud ≠ is completely reliable. Major accidents have occurred in Google Cloud, Alibaba Cloud, Tencent Cloud, etc. However, paying the bill It’s the users themselves.
Therefore, the operation and maintenance team or the SRE team needs to take fault drills seriously. Not only must they make preparatory work for the drill, they must also pay close attention to the plan during the drill, take timely measures and make corrections if problems are discovered. .
Don’t let the drill become a formality, don’t let the drill become a KPI, otherwise you will be the next optimization target.
The problem with Vipshop on March 29 can be reflected from the side: "Duohuo" may be just talk.
As the business develops, the system architecture will continue to evolve because our requirements for high availability are getting higher and higher.
For example, upgrade from a single-machine architecture in the same computer room to an active-standby architecture, then upgrade to a multi-machine room architecture in the same city, and finally reach the three-center architecture level in two places.
If Vipshop built multiple computer rooms in the same city, even the simplest main and backup systems in the same city would not be down for 12 hours.
Not to mention if you do dual live in the same city.
But, I am just guessing from God’s perspective. Maybe they also do a lot of work, but they are just pretending to work a lot.
The bosses above will all come up with financial, human and material resources in the end. Take Duohuo as an example, build a city-wide disaster preparedness, The cost of investment is not as simple as dubbo. Whenever the person in charge of SRE reports to apply for funds, if the leaders above do not support it (the money is not earned, but so much is spent), everything is in vain.
Leaders need to control costs, and subordinates need money to do things. Insufficient costs lead to inability to make ends meet, and there will be a situation where the PPT is beautiful, but the reality is terrible.
Even if you have ambition, it is useless.
If something goes wrong, I will sacrifice you to heaven.
The above is purely fictitious. If there are any similarities, please like it~
In many companies, operation and maintenance have a strong say Low, ridiculously low, which makes it difficult for operation and maintenance to do things or advance things.
However, once a problem occurs, operation and maintenance are the first to be pushed out, so the "scapegoat" has always been blamed on operation and maintenance.
So what should we do as an operation and maintenance officer?
Finally, let’s talk about it, don’t make fun of production.
The above is the detailed content of Smooth operation and maintenance, an iron pot. For more information, please follow other related articles on the PHP Chinese website!