Saturday, February 17, 2007

Path MTU discovery and MTU troubleshooting

Recently when debugging some performance issues on a client's site, I came across some very interesting behavior. Some users were reporting that the site performed very well for a short period of time, but after a while, performance became very poor, enough so to render the site unusable. Checking the apache logfiles for the IP addreses of those clients showed that the requests themselves were not taking an unusual amount of time, but instead the requests were coming into the webserver at a snails pace.

Checking at the network level, I saw some strange things happening:

prod-lb01:~# tethereal -R "http.request and ip.addr == (client)"
125.362898 (client) -> (server) HTTP GET /search/stuff HTTP/1.1
125.362922 (server) -> (client) ICMP Destination unreachable (Fragmentation needed)
126.612994 (client) -> (server) HTTP GET /search/stuff HTTP/1.1
126.613018 (server) -> (client) ICMP Destination unreachable (Fragmentation needed)
129.615113 (client) -> (server) HTTP GET /search/stuff HTTP/1.1
129.615135 (server) -> (client) ICMP Destination unreachable (Fragmentation needed)
135.616047 (client) -> (server) HTTP GET /search/stuff HTTP/1.1
135.616066 (server) -> (client) ICMP Destination unreachable (Fragmentation needed)
Fragmentation Needed? (ICMP Type 3/Code 4) Why would we be needing to fragment incoming packets? This should only happen if the packet is bigger than the Maximum Transmission Size (MTU), and since this is all connected with ethernet, at a constant 1500 MTU, it is odd to see this.

Then I remembered this site is using Linux Virtual Server (LVS) for load balancing incoming requests. LVS can be configured in several ways, but this site is using IP-IP aka LVS-Tun load balancing, which encapsulates the incoming IP packet inside another packet and sends that to the destination server. Since this uses IP encapsulation, each request that hits the load balancer will have additional headers tacked on, to address the packet to the appropriate realserver. It happens to add 20 bytes to the header.

Okay, so the actual MTU of requests that go to the load balancer is 1480 due to the encapsulation overhead. Snooping for this type of packet at the router, I notice that we're sending out a LOT of them:

(router):~# tcpdump -n -i eth7 "icmp[icmptype] & icmp-unreach != 0 and icmp[icmpcode] & 4 != 0"
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth7, link-type EN10MB (Ethernet), capture size 96 bytes
17:07:00.608444 IP (server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
17:07:01.288197 IP (server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
17:07:01.910215 IP (server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
17:07:01.927728 IP (server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
17:07:02.391218 IP (server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
17:07:02.693094 IP (server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
17:07:02.912513 IP (server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
17:07:03.019852 IP (server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
17:07:03.398335 IP (server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
These ICMP messages are not bad, per say, they are part of the Path MTU Discovery process. However, many firewalls indiscriminately block ICMP packets of all kinds. Based on the research I did on this problem, most of the documentation I found was from the end-user's perspective, i.e., users who had PPPoE or other types of encapsulated/tunneled connections and had trouble getting to certain websites. Now with the proliferation of personal firewall hardware and software, some of which may be overzealously configured to block all ICMP (even "good" ICMP like PMTU discovery), this is something that server admins have to worry about, too, especially if running a load balancing solution which encapsulates packets.

The research I did on the problem pointed me to the following iptables rule to be added on the router:
iptables -A FORWARD -p tcp --tcp-flags SYN,RST SYN -m tcpmss --mss 1400:1536 -j TCPMSS --clamp-mss-to-pmtu
This is intended to force the advertised Maximum Segment Size (MSS) to be the 40 less than of the smallest MTU that the router knows about. However, this didn't work for us (This tcpdump line looks for any TCP handshakes plus any ICMP unreachable errors):

(router):~# tcpdump -vv -n -i eth7 "(host (client) ) and \
(tcp[tcpflags] & tcp-syn != 0 oricmp[icmptype] & icmp-unreach != 0)"
tcpdump: listening on eth7, link-type EN10MB (Ethernet), capture size 96 bytes
18:00:17.479661 IP (tos 0x0, ttl 53, id 47601, offset 0, flags [DF], length: 52)
(client).1199 > (server).80: S [tcp sum ok] 2541494183:2541494183(0) win 65535
<mss 1460,nop,wscale 2,nop,nop,sackOK>

18:00:17.479861 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], length: 52)
(server).80 > (client).1199: S [tcp sum ok] 2875112671:2875112671(0) ack 2541494184 win 5840
<mss 1460,nop,nop,sackOK,nop,wscale 7>

18:00:17.771080 IP (tos 0xc0, ttl 63, id 10080, offset 0, flags [none], length: 576)
(server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
for IP (tos 0x0, ttl 52, id 47613, offset 0, flags [DF], length: 1500)
(client).1199 > (server).80: . 546:2006(1460) ack 1 win 64240
It was still negotiating a 1460 byte MSS during the handshake. In hindsight, this makes sense, because the router doesn't really know that the MTU of the load balancer and the realservers is actually smaller than 1500 - the router communicates with these machines over their ethernet interfaces, which are all still set to a 1500 byte MTU. Digging some more into the problem (Including the LVS-Tun HOWTO linked above) there were quite a few things mentioned, but no real definitive answers.

I chose to fix this problem by hardcoding the MSS to 1440 at the router, rather than using the "clamp-mss-to-pmtu" setting:
iptables -A FORWARD -p tcp --tcp-flags SYN,RST SYN -m tcpmss --mss 1440:1536 -j TCPMSS --set-mss 1440
1440 is the normal MSS value of 1460, minus the 20 byte overhead for the encapsulated packet. This seems to have fixed the problem entirely:
(router):~# tcpdump -vv -n -i eth7 "(host (client) ) and \
(tcp[tcpflags] & tcp-syn != 0 or icmp[icmptype] & icmp-unreach != 0)"
tcpdump: listening on eth7, link-type EN10MB (Ethernet), capture size 96 bytes
18:02:19.466678 IP (tos 0x0, ttl 53, id 55012, offset 0, flags [DF], length: 52)
(client).1298 > (server).80: S [tcp sum ok] 2863214365:2863214365(0) win 65535
<mss 1460,nop,wscale 2,nop,nop,sackOK>

18:02:19.466886 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], length: 52)
(server).80 > (client).1298: S [tcp sum ok] 2996826059:2996826059(0) ack 2863214366 win 5840
<mss 1440,nop,nop,sackOK,nop,wscale 7>

.... silence!
PS - The reason that I was seeing this very odd behavior - very fast at first, followed by an unusable site?
  • The client website had recently added a search history, which was stored in a browser cookie. Things would go great until enough data was in the cookie to push it up over 1440 bytes.
  • I had configured my home DSL router to discard ICMP some many years back and had forgotten about it - My firewall was throwing away the ICMP Fragmentation Needed packets, so my PC never "Got the memo" that it needed to send smaller packets!
This actually worked out for the better, though - this site had had reports of odd slowness in the recent past, and hopefully this was the root cause!

EDIT: Note that in the original post, I had missed an important option, in the iptables config it is important to use the "-m tcpmss --mss 1440:1536" setting. Without this flag, iptables will force the MSS of ALL traffic to 1440, including clients which request a size smaller than that. This obviously presents a problem to the client.

3 comments:

Anonymous said...

Old post, good post!

Anonymous said...

It would be even better if "-t mangle" would not be missing, the iptables statements ;-)

iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu

and

iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN -m tcpmss --mss 1440:1536 -j TCPMSS --set-mss 1440

Deepu said...

Awesome Man... Made me clear to debug :)