TCP Window Scaling and kernel 2.6.17+

So I was tearing my hair out today. I’d installed Ubuntu onto a new Sun X4200 so that I could migrate Bulletproof’s monitoring system to it. (Note you need to use edgy knot-1 for the SAS drives to be supported). Anyway as I was installing packages I was getting speeds like 10kB/s. Normally I would expect 800-1000kB/s.

I did the usual sort of debugging, where there any errors on the switch, was it affecting other servers on the same network etc etc. Everything looked fine. Our friend tcpdump showed a dump that looked something like this.


root@oldlace:~# tcpdump -ni bond0 port 80
tcpdump: listening on bond0
1.2.3.4.42501 > 203.16.234.85.80: S 0:0 win 5840 <mss 1460,sackOK,timestamp 94318 0,nop,wscale 6> (DF)
203.16.234.85.80 > 1.2.3.4.42501: S 0:0(0) ack 1 win 5840<mss 1460,nop,wscale 2> (DF)
1.2.3.4.42501 > 203.16.234.85.80: . ack 1 win 92 (DF)
1.2.3.4.42501 > 203.16.234.85.80: P 1:352(351) ack 1 win 92 (DF)
203.16.234.85.80 > 1.2.3.4.42501: . ack 352 win 1608 (DF)

You’ll notice that the server initially advertises a window size of 5840, then suddenly in the first ACK it is advertising a size of 92. This means that the other side can only send 92 bytes before waiting for an ACK!!! Not very conducive to quick WAN transfer speeds.

After a lot of Google searching I discovered these threads on LKLM

Of course what I was missing was the wscale 6, which means that the windows was actually 92*2^6 = 5888. Which is pretty close to 5840 so why bother with the scaling, because towards the end of the connection we get 16022*2^6 = 1025408 which doesn’t normally fit into a TCP header.

So why aren’t things screaming along with this massive window, well something in the middle doesn’t like a windows scaling factor of 6 and is resetting it to zero. Which means the other end thingk the windows size really is 92.

There are 2 quick fixes. First you can simply turn off windows scaling all together by doing

echo 0 > /proc/sys/net/ipv4/tcp_window_scaling

but that limits your window to 64k. Or you can limit the size of your TCP buffers back to pre 2.6.17 kernel values which means a wscale value of about 2 is used which is acceptable to most broken routers.

echo "4096 16384 131072" > /proc/sys/net/ipv4/tcp_wmem
echo "4096 87380 174760" > /proc/sys/net/ipv4/tcp_rmem

The original values would have had 4MB in the last column above which is what was allowing these massive windows.

In a thread somewhere which I can’t find anymore Dave Miller had a great quote along the lines of

“I refuse to workaround it, window scaling has been part of the protocol since 1999, deal with it.”

VMware Consolidated Backup

The last few months have seen me working at an insane pace at Bulletproof in the lead up to a launch of our latest and greatest product Dedicated Virtual Machine Hosting or DVMH for short. I’ll ramble on a bit more about it after it’s launched but basically it is similar to our existing Managed Dedicated Hosting but running on VMware and with a whole heap of cool features due to the benefits of virtualisation.

Today saw me working with one of these cool features, Consolidated Backup. Basically what this lets you do is have a Windows 2003 server directly plugged into the SAN which can directly see all the VM images sitting in the VMFS LUNs. It then talks to the ESX servers takes a snapshot and makes a copy of it t local disk. Hey presto Disaster Recovery. Well mostly anyway, the restoration aspect isn’t all that crash hot as you’ll see below.

Documentation on performing the backups is a bit scarce. VMware provide some scripts that let you tie it in to some commercial backup products like Legato, Veritas and NetBackup but no real docs on how to do it yourself.

So here are some quick examples. (You can find all these commands in C:Program FilesVMwareVMware Consolidated Backup Framework

Getting a list of VMs on your ESX farm.
[code]
vcbVmName.exe -h VC_HOST -u USERNAME -p PASSWORD -s any:
[/code]

Backing up a VM
[code]
vcbMounter.exe -h VC_HOST -u USERNAME -p PASSWORD -a moref:MOREF -r DESTINATION -t fullvm -m san
[/code]
where MOREF comes from the list you created above and DESTINATION is a local path on your VCB proxy.

You should then strictly unmount it by doing
[code]
vcbMounter.exe -d DESTINATION
[/code]
but I don’t think this does anymore than delete the files, since the snapshot on the ESX server has already been closed.

The above creates something like this
[code]
catalog
MyVM.nvram
MyVM.vmx
scsi0-0-0-MyVM-s001.vmdk
scsi0-0-0-MyVM-s002.vmdk
scsi0-0-0-MyVM-s003.vmdk
scsi0-0-0-MyVM-s004.vmdk
scsi0-0-0-MyVM-s005.vmdk
scsi0-0-0-MyVM.vmdk
unmount.dat
vmware-1.log
vmware-2.log
vmware-3.log
vmware-4.log
vmware-5.log
vmware.log
[/code]

Mounting a VM image locally
[code]
vmmount.exe -d VMDK -cycleId -sysdl LOCATION
[/code]
VMDK needs to be scsi0-0-0-MyVM.vmdk from above.

You can then unmount it by doing
[code]
vmount.exe -u LOCATION
[/code]

This is nice and easy and really useful means you can now easily backup everything to tape.

Recovery is another matter entirely, apparently in the Beta releases vcbRestore was distributed with Consolidated Backup but in the final release it now only exists on the ESX servers. So you need to move your directory above to one of your ESX boxes. You then do

[code]
vcbRestore -h VC_HOST -u USERNAME -p PASSWORD -s DIRECTORY
[/code]

This will totally replace your existing VM, if you wanted a copy then you should copy the catalog file elsewhere, edit it to change the paths and

[code]
vcbRestore -h VC_HOST -u USERNAME -p PASSWORD -s DIRECTORY -a CATALOG
[/code]

There are a couple more features I haven’t mentioned which you can work out for yourself by using -h. eg File level backups for Windows VMs.

Now all of the above is great but VMware have taken things a step further. With the above if your VM is running VMware Tools the equivalent of a sync is done before the snapshot is taken which effectively gives you slightly better than a crash consistent dump. Though you could still lose some DB data.

So VMware have added some functionality to rectify this. Just before the snapshot is made /usr/sbin/pre-freeze-script or C:Windowspre-freeze-script.bat is run and /usr/sbin/post-thaw-script or C:Windowspost-thaw-script.bat are run afterwards. Taking a snapshot only takes a few minutes so you could use these scripts to stop your database for example.

I highly recommend reading the VMware Consolidated Backup manual for all the extra features I haven’t covered.

Hmm Blogging

So I’ve decided to give this blogging thing a go. Lets see how long I keep it up for…