Recently, there has been a network timeout problem that needs to be troubleshooted. Roughly follow the diagram to troubleshoot
1. Eliminate code logic problems, possible TCP-related BUGs, kernel parameters and other issues;
2. When troubleshooting KVM problems, on different KVMs on the same host, Reproduced the timeout issue.
It is found that most abnormal connection durations are around 1 second. Through packet capture analysis, we can see that this part of the packet has been retransmitted, and the retransmission time is fixed at 1 second.
Why is the retransmission time here 1 second? What are the relevant standards and actual implementation?
This article mainly discusses this part (based on centos 2.6.32-358)
The retransmission timeout (RTO) is determined by the current network conditions (RTT) and then based on an algorithm. This part of the relevant content is mentioned in "TCP/IP Detailed Explanation Volume 1", but it is out of date.
Go to the RFC and check it out. The latest one related to retransmission timeout is RFC6298. It has updated RFC1122 and abandoned RFC2988
Let me briefly introduce the content. If you are interested, you can click to read it
RFC6298
1 reiterates the basic calculation method of RTO:
First there is a time parameter RTO_MIN obtained through the clock
Initialization:
First calculation:
Later calculations :
The minimum value of RTO is recommended to be 1 second, and the maximum value must be greater than 60 seconds
2 For multiple retransmissions of the same packet, the Karn algorithm must be used, which is the double growth just seen
In addition, RTT sampling cannot use retransmitted packets unless timestamps are turned on Parameter (this parameter can be used to accurately calculate RTT)
3 When 4*RTTVAR tends to 0, the obtained value must be close to the RTO_MIN time
From experience, the more accurate the clock, the better. The best error is within 100ms
4 RTO timer management
(1) Send data (including retransmission) and check whether the timer is started. If not, start it. Delete the timer when receiving the ACK of the data
(2) Use RTO = RTO * 2 for backoff
(3) New FALLBACK feature: when the timer is waiting for SYN The message expires, and the current TCP implementation uses an RTO of less than 3 seconds, then the RTO of the connection pair must be reset to 3 seconds, and the reset RTO will be used for the transmission of formal data (that is, after the three-way handshake is completed)
Send syn packet of three-way handshake
123456 | 01:00:00.129688 IP 172.16.3.14.1868 > 172.16.10.40.80: Flags [S], seq 3774079837, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 001:00:01.129065 IP 172.16.3.14.1868 > 172.16.10.40.80: Flags [S], seq 3774079837, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 001:00:03.129063 IP 172.16.3.14.1868 > 172.16.10.40.80: Flags [S], seq 3774079837, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 001:00:07.129074 IP 172.16.3.14.1868 > 172.16.10.40.80: Flags [S], seq 3774079837, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 001:00:15.129072 IP 172.16.3.14.1868 > 172.16.10.40.80: Flags [S], seq 3774079837, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 001:00:31.129128 IP 172.16.3.14.1868 > 172.16.10.40.80: Flags [S], seq 3774079837, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0 |
Double increment from 1 second
It is worth noting that after the fifth timeout, the upper layer will not be notified of the connection timeout until the sixth timeout, which is a total of 63 seconds
three times Handshake syncak packet sending
1234567 | 01:17:20.084839 IP 172.16.3.15.2535 > 172.16.3.14.80: Flags [S], seq 1297135388, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 001:17:20.084908 IP 172.16.3.14.80 > 172.16.3.15.2535: Flags [S.], seq 1194120443, ack 1297135389, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 001:17:21.284093 IP 172.16.3.14.80 > 172.16.3.15.2535: Flags [S.], seq 1194120443, ack 1297135389, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 001:17:23.284088 IP 172.16.3.14.80 > 172.16.3.15.2535: Flags [S.], seq 1194120443, ack 1297135389, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 001:17:27.284095 IP 172.16.3.14.80 > 172.16.3.15.2535: Flags [S.], seq 1194120443, ack 1297135389, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 001:17:35.284097 IP 172.16.3.14.80 > 172.16.3.15.2535: Flags [S.], seq 1194120443, ack 1297135389, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 001:17:51.284093 IP 172.16.3.14.80 > 172.16.3.15.2535: Flags [S.], seq 1194120443, ack 1297135389, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0 |
Double increment from 1 second
Normal data packet sending
12345678910111213141516 | 01:32:20.443757 IP 172.16.3.15.2548 > 172.16.3.14.80: Flags [P.], seq 3319667389:3319667400, ack 1233846614, win 115, length 1101:32:20.644600 IP 172.16.3.15.2548 > 172.16.3.14.80: Flags [P.], seq 3319667389:3319667400, ack 1233846614, win 115, length 1101:32:21.046579 IP 172.16.3.15.2548 > 172.16.3.14.80: Flags [P.], seq 3319667389:3319667400, ack 1233846614, win 115, length 1101:32:21.850632 IP 172.16.3.15.2548 > 172.16.3.14.80: Flags [P.], seq 3319667389:3319667400, ack 1233846614, win 115, length 1101:32:23.458555 IP 172.16.3.15.2548 > 172.16.3.14.80: Flags [P.], seq 3319667389:3319667400, ack 1233846614, win 115, length 1101:32:26.674594 IP 172.16.3.15.2548 > 172.16.3.14.80: Flags [P.], seq 3319667389:3319667400, ack 1233846614, win 115, length 1101:32:33.106601 IP 172.16.3.15.2548 > 172.16.3.14.80: Flags [P.], seq 3319667389:3319667400, ack 1233846614, win 115, length 1101:32:45.970567 IP 172.16.3.15.2548 > 172.16.3.14.80: Flags [P.], seq 3319667389:3319667400, ack 1233846614, win 115, length 1101:33:11.698415 IP 172.16.3.15.2548 > 172.16.3.14.80: Flags [P.], seq 3319667389:3319667400, ack 1233846614, win 115, length 1101:34:03.154300 IP 172.16.3.15.2548 > 172.16.3.14.80: Flags [P.], seq 3319667389:3319667400, ack 1233846614, win 115, length 1101:35:46.065892 IP 172.16.3.15.2548 > 172.16.3.14.80: Flags [P.], seq 3319667389:3319667400, ack 1233846614, win 115, length 1101:37:46.065382 IP 172.16.3.15.2548 > 172.16.3.14.80: Flags [P.], seq 3319667389:3319667400, ack 1233846614, win 115, length 1101:39:46.064917 IP 172.16.3.15.2548 > 172.16.3.14.80: Flags [P.], seq 3319667389:3319667400, ack 1233846614, win 115, length 1101:41:46.064466 IP 172.16.3.15.2548 > 172.16.3.14.80: Flags [P.], seq 3319667389:3319667400, ack 1233846614, win 115, length 1101:43:46.064060 IP 172.16.3.15.2548 > 172.16.3.14.80: Flags [P.], seq 3319667389:3319667400, ack 1233846614, win 115, length 1101:45:46.063675 IP 172.16.3.15.2548 > 172.16.3.14.80: Flags [P.], seq 3319667389:3319667400, ack 1233846614, win 115, length 11 |
Double increment starting from 0.2 seconds, up to 120 seconds, a total of 15 times
It is worth noting that it starts at 32 minutes and ends at 47 minutes, which is about 15 minutes and 25 seconds
Whether Linux supports the FALLBACK feature, do a simple test
123456789101112131415161718192021222324252627282930 | server开启iptables后,client连接server,在5次超时次数内关闭iptables23:35:01.036565 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [S], seq 2364912154, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 023:35:02.036152 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [S], seq 2364912154, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 023:35:04.036126 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [S], seq 2364912154, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 023:35:08.036127 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [S], seq 2364912154, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 023:35:16.036131 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [S], seq 2364912154, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 023:35:16.036842 IP 172.16.10.40.12345 > 172.16.3.14.6071: Flags [S.], seq 3634006739, ack 2364912155, win 14600, options [mss 1460], length 023:35:16.036896 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [.], ack 3634006740, win 14600, length 0接着server开启iptables后,client发送数据包,在15次超时次数内关闭iptables23:35:48.129273 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [P.], seq 2364912155:2364912156, ack 3634006740, win 14600, length 123:35:51.129120 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [P.], seq 2364912155:2364912156, ack 3634006740, win 14600, length 123:35:57.129070 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [P.], seq 2364912155:2364912156, ack 3634006740, win 14600, length 123:36:09.129068 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [P.], seq 2364912155:2364912156, ack 3634006740, win 14600, length 123:36:09.129802 IP 172.16.10.40.12345 > 172.16.3.14.6071: Flags [.], ack 2364912156, win 14600, length 0接着server不开iptables时,client发送数据包23:36:15.217231 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [P.], seq 2364912156:2364912157, ack 3634006740, win 14600, length 123:36:15.217766 IP 172.16.10.40.12345 > 172.16.3.14.6071: Flags [.], ack 2364912157, win 14600, length 0接着server开启iptables,client发送数据包23:36:26.658172 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [P.], seq 2364912157:2364912158, ack 3634006740, win 14600, length 123:36:26.859055 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [P.], seq 2364912157:2364912158, ack 3634006740, win 14600, length 123:36:27.261065 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [P.], seq 2364912157:2364912158, ack 3634006740, win 14600, length 123:36:28.065106 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [P.], seq 2364912157:2364912158, ack 3634006740, win 14600, length 123:36:29.673132 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [P.], seq 2364912157:2364912158, ack 3634006740, win 14600, length 123:36:32.889068 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [P.], seq 2364912157:2364912158, ack 3634006740, win 14600, length 123:36:39.321091 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [P.], seq 2364912157:2364912158, ack 3634006740, win 14600, length 123:36:52.185135 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [P.], seq 2364912157:2364912158, ack 3634006740, win 14600, length 123:37:17.913091 IP 172.16.3.14.6071 > 172.16.10.40.12345: Flags [P.], seq 2364912157:2364912158, ack 3634006740, win 14600, length 1 |
It can be found from this test that when the RTT exceeds 1 second during the three-way handshake, the RTO of the data sending phase is 3 seconds (the same is true for the SYNACK timeout on the server side)
Then after a normal RTT, RTO has converged back to around 200ms
Let’s see how timestamps are supported
1234567891011121314151617 | server开启iptables后,client连接server,在5次超时次数内关闭iptables23:47:47.754316 IP 172.16.3.14.8603 > 172.16.10.40.12345: Flags [S], seq 479022248, win 14600, options [mss 1460,sackOK,TS val 2336007392 ecr 0,nop,wscale 7], length 023:47:48.754079 IP 172.16.3.14.8603 > 172.16.10.40.12345: Flags [S], seq 479022248, win 14600, options [mss 1460,sackOK,TS val 2336008392 ecr 0,nop,wscale 7], length 023:47:50.754088 IP 172.16.3.14.8603 > 172.16.10.40.12345: Flags [S], seq 479022248, win 14600, options [mss 1460,sackOK,TS val 2336010392 ecr 0,nop,wscale 7], length 023:47:54.754083 IP 172.16.3.14.8603 > 172.16.10.40.12345: Flags [S], seq 479022248, win 14600, options [mss 1460,sackOK,TS val 2336014392 ecr 0,nop,wscale 7], length 023:48:02.754094 IP 172.16.3.14.8603 > 172.16.10.40.12345: Flags [S], seq 479022248, win 14600, options [mss 1460,sackOK,TS val 2336022392 ecr 0,nop,wscale 7], length 023:48:02.754683 IP 172.16.10.40.12345 > 172.16.3.14.8603: Flags [S.], seq 697602971, ack 479022249, win 14480, options [mss 1460,nop,nop,TS val 4044659641 ecr 2336022392], length 023:48:02.754742 IP 172.16.3.14.8603 > 172.16.10.40.12345: Flags [.], ack 697602972, win 14600, options [nop,nop,TS val 2336022392 ecr 4044659641], length 0接着server开启iptables后,client发送数据包,在15次超时次数内关闭iptables23:48:11.944170 IP 172.16.3.14.8603 > 172.16.10.40.12345: Flags [P.], seq 479022249:479022250, ack 697602972, win 14600, options [nop,nop,TS val 2336031582 ecr 4044659641], length 123:48:12.145036 IP 172.16.3.14.8603 > 172.16.10.40.12345: Flags [P.], seq 479022249:479022250, ack 697602972, win 14600, options [nop,nop,TS val 2336031783 ecr 4044659641], length 123:48:12.547084 IP 172.16.3.14.8603 > 172.16.10.40.12345: Flags [P.], seq 479022249:479022250, ack 697602972, win 14600, options [nop,nop,TS val 2336032185 ecr 4044659641], length 123:48:13.351106 IP 172.16.3.14.8603 > 172.16.10.40.12345: Flags [P.], seq 479022249:479022250, ack 697602972, win 14600, options [nop,nop,TS val 2336032989 ecr 4044659641], length 123:48:14.959080 IP 172.16.3.14.8603 > 172.16.10.40.12345: Flags [P.], seq 479022249:479022250, ack 697602972, win 14600, options [nop,nop,TS val 2336034597 ecr 4044659641], length 123:48:18.175092 IP 172.16.3.14.8603 > 172.16.10.40.12345: Flags [P.], seq 479022249:479022250, ack 697602972, win 14600, options [nop,nop,TS val 2336037813 ecr 4044659641], length 123:48:24.607088 IP 172.16.3.14.8603 > 172.16.10.40.12345: Flags [P.], seq 479022249:479022250, ack 697602972, win 14600, options [nop,nop,TS val 2336044245 ecr 4044659641], length 1 |
You can see that after timestamps are turned on, the FALLBACK mechanism to reset RTO to 3 seconds will not work
RTO calculation in linux The actual implementation is still different from the RFC document. If you only follow the RFC document to search for the picture, you will go astray in the actual RTO estimation
1 According to the previous paragraph, it can be found that he put the RTO The minimum value is set to 200ms (even on ubuntu it is 50ms while RFC recommends 1 second) and the maximum value is set to 120 seconds (RFC forces 60+ seconds)
2 According to my analysis of linux code, in RTT In the case of severe jitter, the Linux implementation reduces the interference of sharply changing RTT, making the RTO trend chart smoother
This is reflected in two points of fine-tuning:
When the following conditions are met
It means that R' fluctuates too much , compared with the smoothed RTT value, the difference is larger than RTTVAR
so
and the RFC document is
As you can see, compared with the RFC document, the smoothing coefficient is multiplied by 1/8, which means that the impact of R' on RTTVAR will be reduced, making RTTVAR smoother and RTO will be smoother
When RTTVAR decreases, RTTVAR will be smoothed so that RTO will not drop too far and cause a steep trend chart
Here RTTVAR' refers to the current value calculated based on RTT. This value limits the lower limit (RTO_MIN) and compares it with the RTTVAR at the previous RTT. When a reduction is found, use the 1/4 coefficient to Smoothing
Why not deal with the increase here? I think it’s because if the RTO increases, it’s fine, but if the decrease is large, it may cause spurious retransmission (about this term, details See the RFC document mentioned above)
Back to the original question, can the value of RTO be shortened, and how does this RTO value depend on linux? Estimate the actual implementation
Obviously the initial RTO value (including FALLBACK) cannot be changed. This part is written firmly in the code
The RTO value other than the three-way handshake can be Estimated
Assume that the network is stable when estimating, RTT will always not be R (otherwise it will be extremely complicated due to fine-tuning 1 and 2)
Then SRTT will always be R, and RTTVAR will always be is 0.5R
Otherwise
So just changing the value of RTO_MIN can significantly affect the value of RTO
The setting of RTO_MIN is based on ip route
12345678910111213 | [root@localhost.localdomain ~]# ping www.baidu.comPING www.a.shifen.com (180.97.33.107) 56(84) bytes of data.64 bytes from 180.97.33.107: icmp_seq=1 ttl=51 time=30.8 ms64 bytes from 180.97.33.107: icmp_seq=2 ttl=51 time=29.9 ms获得百度的IP后[root@localhost.localdomain ~]# ip route add 180.97.33.108/32 via 172.16.3.1 rto_min 20[root@localhost.localdomain ~]# nc www.baidu.com 80[root@localhost.localdomain ~]# ss -eipn '( dport =:www )'State Recv-Q Send-Q Local Address:Port Peer Address:PortESTAB 0 0 172.16.3.14:14149 180.97.33.108:80 users:(("nc",7162,3)) ino:48057454 sk:ffff88023905adc0sack cubic wscale:7,7 rto:81 rtt:27/13.5 cwnd:10 send 4.3Mbps rcv_space:14600 |
Because RTO_MIN < 2R, so RTO = 3R = 27 * 3 = 81
If it is an intranet, the RTT is very small
1234567 | [root@localhost.localdomain ~]# ip route add 172.16.3.16/32 via 172.16.3.1 rto_min 20[root@localhost.localdomain ~]# nc 172.16.3.16 22SSH-2.0-OpenSSH_5.3[root@localhost.localdomain ~]# ss -eipn '( dport =:22 )'State Recv-Q Send-Q Local Address:Port Peer Address:PortESTAB 0 0 172.16.3.14:57578 172.16.3.16:22 users:(("nc",7272,3)) ino:48059707 sk:ffff88023b7c7000sack cubic wscale:7,7 rto:21 rtt:1/0.5 ato:40 cwnd:10 send 116.8Mbps rcv_space:14600 |
Because RTO_MIN > 2R, so RTO = R RTO_MIN = 1 20 = 21
If you are confident about the entire intranet network, you can also directly apply it to all connections without setting the target IP, as follows
1 | ip route change dev eth0 rto_min 20ms |
1 Linux’s timeout retransmission implementation generally refers to RFC, but there are some fine-tuning:
RFC has only one RTO initial value, which is 1 second. The Linux implementation sets the RTO of the packets in the three-way handshake phase to 1 second, and sets the initial time of the remaining packets to 0.2 seconds
Due to the imperfect algorithm specified by RFC, the actual implementation of Linux is in the case of severe RTT jitter. , reducing the RTT interference of sharp changes, making the RTO trend chart smoother
2 The SYN retransmission time of the connection cannot be adjusted unless the kernel is recompiled, but the push packet can be adjusted to retransmit
3 of transmission time In a relatively stable network, assume that the minimum value of rto set is RTO_MIN