John’s Tidbits

Moo - Development, Trouble-shooting and Random thoughts…


SFD2006 - Return to sender

Pia posting about Software freedom day, software freedom day online shop is up, reminded me about something I’ve been meaning to post for a while.

When you send in the address to get your team’s t-shirts and goodies, make sure you get it right!

Last year I helped pack all the goodies that we sent overseas, this was sometime in August if I remember correctly. We needed to put a return address on the packages so I offered the use of Bulletproof’s address.

6 months later the following turned up on our doorstep.





Notice the use of hemp rope and wax seal. This box has been through a lot!

linux.conf.au brings about another change

Being Technical Guru for linux.conf.au 2007 was one of the most amazing experiences I’ve had in recent years. It was a lot of hard work but it was totally worth it. Having a room burst into applause at the penguin dinner when you say your the network guy is pretty unbelievable.

I went up to the Hunter for a week to recover from the conference and as usual after linux.conf.au I did a lot of thinking as to whether it was time to try something new. This time change won out at the end of the day and after 6 years at Bulletproof I decided it was time to move on.

At the beginning of March I started as Director of Engineering at Vquence. Since we are a video company it was decided that we each needed to have our own video on the web.

The past three weeks have been so hectic that Bulletproof already seems a lifetime ago. I’ve been involved in everything from setting up the new office and the corporate infrastructure to product development.

Joining a startup right at the beginning is always an amazing experience. With just a few people on the ground you always get pulled in a few million directions and there is always a new challenge just another five minutes away. I definitely recommend anyone else to jump at the opportunity if it ever presents itself.

iptables evilness

Matt came to me with an interesting problem at Bulletproof this week. We have a dedicated hosting customer who talks to an external application for e-commerce. The IP for this was going to change but they needed to do to some testing before the switch. As usual with most enterprise applications, the hostname was hard coded. :(

Matt suggested we do some DNS poisoning or do some transparent proxying using squid or similar. While these would have worked they required firewall changes through three levels of firewalls and extra infrastructure.

So I turned to an evil solution, iptables. :) Most people use DNAT on the inbound connection from the Internet to their internal private network to port forward to internal servers, or perform one-to-one NAT mappings. There is nothing stopping you using it the other way around.

Lets say that every time someone browses to http://bulletproof.net we want them to hit http://inodes.org instead. All you need to do is use DNAT to translate one IP address into the other.
[code]
animal:~ johnf$ host bulletproof.net
bulletproof.net has address 202.44.98.174
animal:~ johnf$ host inodes.org
inodes.org has address 202.125.41.97
animal:~ johnf$ sudo iptables -t nat -A PREROUTING -d 202.44.98.174 -j DNAT –to 202.125.41.97

[/code]

Now for some testing, a ping looks normal

[code]

animal:~johnf$ ping www.bulletproof.net
PING www.bulletproof.net.au (202.44.98.174) 56(84) bytes of data.
64 bytes from 202.44.98.174: icmp_seq=1 ttl=241 time=198 ms

[/code]

but a tcpdump looks like

[code]

animal:~johnf$ sudo tcpdump -ni eth0
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
16:35:25.099510 IP 211.30.227.143 > 202.125.41.97: icmp 64: echo request seq 1
16:35:25.301712 IP 202.125.41.97 > 211.30.227.143: icmp 64: echo reply seq 1

[/code]

Of course if anyone needs to try and debug this they are going to have a really fun time working out what is going on. :)

If you want to test this yourself you can do it on your own machine using the OUTPUT chain instead of PREROUTING.

ThinkingLinux ‘06

ThinkingLinux ‘06 was held in Melbourne a few days ago. It was organised by Synergy Plus with sponsorship by RedHat. Novel and a few others.

I gave a talk on Open Source in the Data Centre. Luckily this talk was after lunch so I got to do some editing in the morning sessions to tweak it more towards a business rather than technical audience. :)

The conference was pretty awesome with interesting talks, ranging from Xen to how wotif.com was started.

Copies of the slides for all the talks should eventually make it onto the conference’s website.

Open Source in the Data Centre

Next Tuesday (17th Oct) I’ll be giving a presentation at Thinking Linux ‘06 in Melbourne.

The talk is entitled Open Source in the Data Centre and I’ll be covering things like

  • Load Balancing “Stuff” (IPVS, keepalived, heartbeat)
  • Monitoring using Nagios and MRTG/rrdtool
  • Authentication with OpenLDAP anf FreeRADIUS

and a whole lot of other random things I can fit into 40 minutes.

I choose to blame Pia for putting me in a position to give this talk but only because it’s Jeff’s fault and there isn’t a justblamejdub.com :)

If anyone wants to catch up on the Monday night down in Melbourne then let me know.

I’ll put slides up after the event.

TCP Window Scaling and kernel 2.6.17+

So I was tearing my hair out today. I’d installed Ubuntu onto a new Sun X4200 so that I could migrate Bulletproof’s monitoring system to it. (Note you need to use edgy knot-1 for the SAS drives to be supported). Anyway as I was installing packages I was getting speeds like 10kB/s. Normally I would expect 800-1000kB/s.

I did the usual sort of debugging, where there any errors on the switch, was it affecting other servers on the same network etc etc. Everything looked fine. Our friend tcpdump showed a dump that looked something like this.


root@oldlace:~# tcpdump -ni bond0 port 80
tcpdump: listening on bond0
1.2.3.4.42501 > 203.16.234.85.80: S 0:0 win 5840 <mss 1460,sackOK,timestamp 94318 0,nop,wscale 6> (DF)
203.16.234.85.80 > 1.2.3.4.42501: S 0:0(0) ack 1 win 5840<mss 1460,nop,wscale 2> (DF)
1.2.3.4.42501 > 203.16.234.85.80: . ack 1 win 92 (DF)
1.2.3.4.42501 > 203.16.234.85.80: P 1:352(351) ack 1 win 92 (DF)
203.16.234.85.80 > 1.2.3.4.42501: . ack 352 win 1608 (DF)

You’ll notice that the server initially advertises a window size of 5840, then suddenly in the first ACK it is advertising a size of 92. This means that the other side can only send 92 bytes before waiting for an ACK!!! Not very conducive to quick WAN transfer speeds.

After a lot of Google searching I discovered these threads on LKLM

Of course what I was missing was the wscale 6, which means that the windows was actually 92*2^6 = 5888. Which is pretty close to 5840 so why bother with the scaling, because towards the end of the connection we get 16022*2^6 = 1025408 which doesn’t normally fit into a TCP header.

So why aren’t things screaming along with this massive window, well something in the middle doesn’t like a windows scaling factor of 6 and is resetting it to zero. Which means the other end thingk the windows size really is 92.

There are 2 quick fixes. First you can simply turn off windows scaling all together by doing

echo 0 > /proc/sys/net/ipv4/tcp_window_scaling

but that limits your window to 64k. Or you can limit the size of your TCP buffers back to pre 2.6.17 kernel values which means a wscale value of about 2 is used which is acceptable to most broken routers.

echo "4096 16384 131072" > /proc/sys/net/ipv4/tcp_wmem
echo "4096 87380 174760" > /proc/sys/net/ipv4/tcp_rmem

The original values would have had 4MB in the last column above which is what was allowing these massive windows.

In a thread somewhere which I can’t find anymore Dave Miller had a great quote along the lines of

“I refuse to workaround it, window scaling has been part of the protocol since 1999, deal with it.”

VMware Consolidated Backup

The last few months have seen me working at an insane pace at Bulletproof in the lead up to a launch of our latest and greatest product Dedicated Virtual Machine Hosting or DVMH for short. I’ll ramble on a bit more about it after it’s launched but basically it is similar to our existing Managed Dedicated Hosting but running on VMware and with a whole heap of cool features due to the benefits of virtualisation.

Today saw me working with one of these cool features, Consolidated Backup. Basically what this lets you do is have a Windows 2003 server directly plugged into the SAN which can directly see all the VM images sitting in the VMFS LUNs. It then talks to the ESX servers takes a snapshot and makes a copy of it t local disk. Hey presto Disaster Recovery. Well mostly anyway, the restoration aspect isn’t all that crash hot as you’ll see below.

Documentation on performing the backups is a bit scarce. VMware provide some scripts that let you tie it in to some commercial backup products like Legato, Veritas and NetBackup but no real docs on how to do it yourself.

So here are some quick examples. (You can find all these commands in C:\Program Files\VMware\VMware Consolidated Backup Framework

Getting a list of VMs on your ESX farm.
[code]
vcbVmName.exe -h VC_HOST -u USERNAME -p PASSWORD -s any:
[/code]

Backing up a VM
[code]
vcbMounter.exe -h VC_HOST -u USERNAME -p PASSWORD -a moref:MOREF -r DESTINATION -t fullvm -m san
[/code]
where MOREF comes from the list you created above and DESTINATION is a local path on your VCB proxy.

You should then strictly unmount it by doing
[code]
vcbMounter.exe -d DESTINATION
[/code]
but I don’t think this does anymore than delete the files, since the snapshot on the ESX server has already been closed.

The above creates something like this
[code]
catalog
MyVM.nvram
MyVM.vmx
scsi0-0-0-MyVM-s001.vmdk
scsi0-0-0-MyVM-s002.vmdk
scsi0-0-0-MyVM-s003.vmdk
scsi0-0-0-MyVM-s004.vmdk
scsi0-0-0-MyVM-s005.vmdk
scsi0-0-0-MyVM.vmdk
unmount.dat
vmware-1.log
vmware-2.log
vmware-3.log
vmware-4.log
vmware-5.log
vmware.log
[/code]

Mounting a VM image locally
[code]
vmmount.exe -d VMDK -cycleId -sysdl LOCATION
[/code]
VMDK needs to be scsi0-0-0-MyVM.vmdk from above.

You can then unmount it by doing
[code]
vmount.exe -u LOCATION
[/code]

This is nice and easy and really useful means you can now easily backup everything to tape.

Recovery is another matter entirely, apparently in the Beta releases vcbRestore was distributed with Consolidated Backup but in the final release it now only exists on the ESX servers. So you need to move your directory above to one of your ESX boxes. You then do

[code]
vcbRestore -h VC_HOST -u USERNAME -p PASSWORD -s DIRECTORY
[/code]

This will totally replace your existing VM, if you wanted a copy then you should copy the catalog file elsewhere, edit it to change the paths and

[code]
vcbRestore -h VC_HOST -u USERNAME -p PASSWORD -s DIRECTORY -a CATALOG
[/code]

There are a couple more features I haven’t mentioned which you can work out for yourself by using -h. eg File level backups for Windows VMs.

Now all of the above is great but VMware have taken things a step further. With the above if your VM is running VMware Tools the equivalent of a sync is done before the snapshot is taken which effectively gives you slightly better than a crash consistent dump. Though you could still lose some DB data.

So VMware have added some functionality to rectify this. Just before the snapshot is made /usr/sbin/pre-freeze-script or C:\Windows\pre-freeze-script.bat is run and /usr/sbin/post-thaw-script or C:\Windows\post-thaw-script.bat are run afterwards. Taking a snapshot only takes a few minutes so you could use these scripts to stop your database for example.

I highly recommend reading the VMware Consolidated Backup manual for all the extra features I haven’t covered.