nginx front end, the back end is n docker containers, and the docker container is nginx php-fpm. It is known that each container may fail. When the container fails, the front end will have a 502 or 504 error, and the front end will occasionally have network delays.
Current practices and issues
Assuming that interval is set to 3s and fall is 2, then if the backend hangs up immediately after the last check, that is, the request will still be forwarded to the faulty backend for nearly 6s.
Assuming that timeout is set to 1s, then if the front-end network is delayed, all back-ends will time out instantly and 502 will be returned directly to the user. But if you increase the timeout value, the health check will not be of much significance. Normally, the backend will respond within 50ms, and 1s can no longer filter out high-load backends
Same as above, if it hangs within the interval interval, some requests still arrive at the backend. If the backend load fluctuates frequently at the threshold, then 5xx errors may be more than without health check and sysguard
Is there any solution?
I don’t know what your front-end application scenario is like. It seems that the load is very high. It is unlikely that the error will be directly exposed to the user. At most, it will prompt the user to try again later. In this in-depth study, we still need to look at the technical architecture. Find the root of the problem
Personally, I feel that this thing is unavoidable. When the load is heavy, I will optimize the program or increase the cluster. The error will still occur. I just write a vague name.
Taobao is also often busy with the system. Zhihu provides its own server when nothing happens. Question