John’s Tidbits

Moo - Development, Trouble-shooting and Random thoughts…


iptables evilness

Matt came to me with an interesting problem at Bulletproof this week. We have a dedicated hosting customer who talks to an external application for e-commerce. The IP for this was going to change but they needed to do to some testing before the switch. As usual with most enterprise applications, the hostname was hard coded. :(

Matt suggested we do some DNS poisoning or do some transparent proxying using squid or similar. While these would have worked they required firewall changes through three levels of firewalls and extra infrastructure.

So I turned to an evil solution, iptables. :) Most people use DNAT on the inbound connection from the Internet to their internal private network to port forward to internal servers, or perform one-to-one NAT mappings. There is nothing stopping you using it the other way around.

Lets say that every time someone browses to http://bulletproof.net we want them to hit http://inodes.org instead. All you need to do is use DNAT to translate one IP address into the other.
[code]
animal:~ johnf$ host bulletproof.net
bulletproof.net has address 202.44.98.174
animal:~ johnf$ host inodes.org
inodes.org has address 202.125.41.97
animal:~ johnf$ sudo iptables -t nat -A PREROUTING -d 202.44.98.174 -j DNAT –to 202.125.41.97

[/code]

Now for some testing, a ping looks normal

[code]

animal:~johnf$ ping www.bulletproof.net
PING www.bulletproof.net.au (202.44.98.174) 56(84) bytes of data.
64 bytes from 202.44.98.174: icmp_seq=1 ttl=241 time=198 ms

[/code]

but a tcpdump looks like

[code]

animal:~johnf$ sudo tcpdump -ni eth0
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
16:35:25.099510 IP 211.30.227.143 > 202.125.41.97: icmp 64: echo request seq 1
16:35:25.301712 IP 202.125.41.97 > 211.30.227.143: icmp 64: echo reply seq 1

[/code]

Of course if anyone needs to try and debug this they are going to have a really fun time working out what is going on. :)

If you want to test this yourself you can do it on your own machine using the OUTPUT chain instead of PREROUTING.

250!

We’ve just hit 250 registrations for linux.conf.au, only 5 days to go before early bird registrations close.

So here are some interesting stats of the attendee breakdown so far:

By Country

Country Number
Brazil 1
Canada 1
France 1
Ireland 1
Liberia 1
Nigeria 1
China 1
Singapore 1
Spain 1
UK 1
Croatia 4
Germany 4
Japan 4
Romania 9
New Zealand 13
USA 18
Australia 188

Australia by state

State Number
NT 1
TAS 3
WA 19
QLD 20
SA 20
ACT 23
VIC 24
NSW 77

ThinkingLinux ‘06

ThinkingLinux ‘06 was held in Melbourne a few days ago. It was organised by Synergy Plus with sponsorship by RedHat. Novel and a few others.

I gave a talk on Open Source in the Data Centre. Luckily this talk was after lunch so I got to do some editing in the morning sessions to tweak it more towards a business rather than technical audience. :)

The conference was pretty awesome with interesting talks, ranging from Xen to how wotif.com was started.

Copies of the slides for all the talks should eventually make it onto the conference’s website.

Open Source in the Data Centre

Next Tuesday (17th Oct) I’ll be giving a presentation at Thinking Linux ‘06 in Melbourne.

The talk is entitled Open Source in the Data Centre and I’ll be covering things like

  • Load Balancing “Stuff” (IPVS, keepalived, heartbeat)
  • Monitoring using Nagios and MRTG/rrdtool
  • Authentication with OpenLDAP anf FreeRADIUS

and a whole lot of other random things I can fit into 40 minutes.

I choose to blame Pia for putting me in a position to give this talk but only because it’s Jeff’s fault and there isn’t a justblamejdub.com :)

If anyone wants to catch up on the Monday night down in Melbourne then let me know.

I’ll put slides up after the event.

Build your own ISP

I’ve finally gotten around to putting up the slides for my Build your own ISP talk I gave at Software Freedom Day and DEBSIG. You can find them on my Presentations page or a the direct link to the PDF here.

The slides are fairly sparse, the talk was a bit of a brain dump about random things to do with ISPs. I’m sure someone is going to ask me to give it at SLUG again at some stage :)

TCP Window Scaling and kernel 2.6.17+

So I was tearing my hair out today. I’d installed Ubuntu onto a new Sun X4200 so that I could migrate Bulletproof’s monitoring system to it. (Note you need to use edgy knot-1 for the SAS drives to be supported). Anyway as I was installing packages I was getting speeds like 10kB/s. Normally I would expect 800-1000kB/s.

I did the usual sort of debugging, where there any errors on the switch, was it affecting other servers on the same network etc etc. Everything looked fine. Our friend tcpdump showed a dump that looked something like this.


root@oldlace:~# tcpdump -ni bond0 port 80
tcpdump: listening on bond0
1.2.3.4.42501 > 203.16.234.85.80: S 0:0 win 5840 <mss 1460,sackOK,timestamp 94318 0,nop,wscale 6> (DF)
203.16.234.85.80 > 1.2.3.4.42501: S 0:0(0) ack 1 win 5840<mss 1460,nop,wscale 2> (DF)
1.2.3.4.42501 > 203.16.234.85.80: . ack 1 win 92 (DF)
1.2.3.4.42501 > 203.16.234.85.80: P 1:352(351) ack 1 win 92 (DF)
203.16.234.85.80 > 1.2.3.4.42501: . ack 352 win 1608 (DF)

You’ll notice that the server initially advertises a window size of 5840, then suddenly in the first ACK it is advertising a size of 92. This means that the other side can only send 92 bytes before waiting for an ACK!!! Not very conducive to quick WAN transfer speeds.

After a lot of Google searching I discovered these threads on LKLM

Of course what I was missing was the wscale 6, which means that the windows was actually 92*2^6 = 5888. Which is pretty close to 5840 so why bother with the scaling, because towards the end of the connection we get 16022*2^6 = 1025408 which doesn’t normally fit into a TCP header.

So why aren’t things screaming along with this massive window, well something in the middle doesn’t like a windows scaling factor of 6 and is resetting it to zero. Which means the other end thingk the windows size really is 92.

There are 2 quick fixes. First you can simply turn off windows scaling all together by doing

echo 0 > /proc/sys/net/ipv4/tcp_window_scaling

but that limits your window to 64k. Or you can limit the size of your TCP buffers back to pre 2.6.17 kernel values which means a wscale value of about 2 is used which is acceptable to most broken routers.

echo "4096 16384 131072" > /proc/sys/net/ipv4/tcp_wmem
echo "4096 87380 174760" > /proc/sys/net/ipv4/tcp_rmem

The original values would have had 4MB in the last column above which is what was allowing these massive windows.

In a thread somewhere which I can’t find anymore Dave Miller had a great quote along the lines of

“I refuse to workaround it, window scaling has been part of the protocol since 1999, deal with it.”