The traditional software stack structure of Taobao online applications is Nginx Velocity Java, that is:
In this system, Nginx forwards the request to the Java application, which processes the transaction and then renders the data into the final page using Velocity templates.
After introducing Node.js, we are bound to face the following problems:
How should the topology of the technology stack be designed and how should the deployment method be selected to be scientific and reasonable? After the project is completed, how to divide the traffic so that it is convenient and fast for operation and maintenance? When encountering online problems, how to resolve the danger as quickly as possible and avoid greater losses? How to ensure the health of the application and manage it at the load balancing scheduling level? Inheriting system topology
According to our thinking and practice on front-end and back-end separation (2) - Template exploration based on front-end and back-end separation, Velocity needs to be replaced by Node.js, so that this structure becomes:
This is of course the ideal goal. However, introducing the Node.js layer for the first time in the traditional stack is a new attempt after all. For the sake of safety, we decided to enable the new technology only on the item collection page of the favorites (shoucang.taobao.com/item_collect.htm), and other pages will continue to use the traditional solution. That is, Nginx determines the requested page type and determines whether the request should be forwarded to Node.js or Java. So, the final structure became:
Deployment plan
The above structure seems to have no problem, but in fact, new problems are still waiting ahead. In the traditional structure, Nginx and Java are deployed on the same server. Nginx listens on port 80 and communicates with Java which listens on the high port 7001. Now that Node.js has been introduced, we need to run a new process that listens to the port. Should we deploy Node.js and Nginx Java on the same machine, or should we deploy Node.js in a separate cluster?
Let’s compare the characteristics of the two methods:
Taobao Favorites is an application with tens of millions of daily PV, and has extremely high requirements for stability (in fact, online instability of any product is unacceptable). If you adopt the same cluster deployment solution, you only need one file distribution and two application restarts to complete the release. If rollback is needed, you only need to operate the baseline package once. In terms of performance, there are some theoretical advantages to deploying in the same cluster (although the switch bandwidth and latency of the intranet are very optimistic). As for the one-to-many or many-to-one relationship, it is theoretically possible to make full use of the server, but compared with the stability requirements, this point is not so urgent and needs to be solved. Therefore, in the transformation of favorites, we chose the same cluster deployment solution.
Grayscale mode
In order to ensure maximum stability, this transformation did not directly remove the Velocity code completely. There are nearly 100 servers in the application cluster. We use servers as granularity to gradually introduce traffic. In other words, although all servers are running Java Node.js processes, whether there are corresponding forwarding rules on Nginx determines whether requests for treasure collections on this server will be processed by Node.js. The configuration of Nginx is:
location = "/item_collect.htm" { proxy_pass http://127.0.0.1:6001; # Node.js 进程监听的端口 }
只有添加了这条 Nginx 规则的服务器,才会让 Node.js 来处理相应请求。通过 Nginx 配置,可以非常方便快捷地进行灰度流量的增加与减少,成本很低。如果遇到问题,可以直接将 Nginx 配置进行回滚,瞬间回到传统技术栈结构,解除险情。
第一次发布时,我们只有两台服务器上启用了这条规则,也就是说大致有不到 2% 的线上流量是走 Node.js 处理的,其余的流量的请求仍然由 Velocity 渲染。以后视情况逐步增加流量,最后在第三周,全部服务器都启用了。至此,生产环境 100% 流量的商品收藏页面都是经 Node.js 渲染出来的(可以查看源代码搜索 Node.js 关键字)。
转
灰度过程并不是一帆风顺的。在全量切流量之前,遇到了一些或大或小的问题。大部分与具体业务有关,值得借鉴的是一个技术细节相关的陷阱。
健康检查
在传统的架构中,负载均衡调度系统每隔一秒钟会对每台服务器 80 端口的特定 URL 发起一次 <font face="NSimsun">get</font>
请求,根据返回的 HTTP Status Code 是否为 <font face="NSimsun">200</font>
来判断该服务器是否正常工作。如果请求 1s 后超时或者 HTTP Status Code 不为 <font face="NSimsun">200</font>
,则不将任何流量引入该服务器,避免线上问题。
这个请求的路径是 Nginx -> Java -> Nginx,这意味着,只要返回了 <font face="NSimsun">200</font>
,那这台服务器的 Nginx 与 Java 都处于健康状态。引入 Node.js 后,这个路径变成了 Nginx -> Node.js -> Java -> Node.js -> Nginx。相应的代码为:
var http = require('http'); app.get('/status.taobao', function(req, res) { http.get({ host: '127.1', port: 7001, path: '/status.taobao' }, function(res) { res.send(res.statusCode); }).on('error', function(err) { logger.error(err); res.send(404); }); });
但是在测试过程中,发现 Node.js 在转发这类请求的时候,每六七次就有一次会耗时几秒甚至十几秒才能得到 Java 端的返回。这样会导致负载均衡调度系统认为该服务器发生异常,随即切断流量,但实际上这台服务器是能够正常工作的。这显然是一个不小的问题。
排查一番发现,默认情况下, Node.js 会使用 <font face="NSimsun">HTTP Agent</font>
这个类来创建 HTTP 连接,这个类实现了 socket 连接池,每个主机+端口对的连接数默认上限是 5。同时 <font face="NSimsun">HTTP Agent</font>
类发起的请求中默认带上了 <font face="NSimsun">Connection: Keep-Alive</font>
,导致已返回的连接没有及时释放,后面发起的请求只能排队。
最后的解决办法有三种:
禁用 <font face="NSimsun">HTTP Agent</font>
,即在在调用 <font face="NSimsun">get</font>
方法时额外添加参数 <font face="NSimsun">agent: false</font>
,最后的代码为:
var http = require('http'); app.get('/status.taobao', function(req, res) { http.get({ host: '127.1', port: 7001, agent: false, path: '/status.taobao' }, function(res) { res.send(res.statusCode); }).on('error', function(err) { logger.error(err); res.send(404); }); });
设置 <font face="NSimsun">http</font>
对象的全局 socket 数量上限:
http.globalAgent.maxSockets = 1000;
在请求返回的时候及时主动断开连接:
http.get(options, function(res) { }).on("socket", function (socket) { socket.emit("agentRemove"); // 监听 socket 事件,在回调中派发 agentRemove 事件 });
实践上我们选择第一种方法。这么调整之后,健康检查就没有再发现其它问题了。
合
Node.js 与传统业务场景结合的实践才刚刚起步,仍然有大量值得深入挖掘的优化点。比比如,让 Java 应用彻底中心化后,是否可以考分集群部署,以提高服务器利用率。或者,发布与回滚的方式是否能更加灵活可控。等等细节,都值得再进一步研究。