Hardy, exim4, SMTP-AUTH and LDAP… (or debian openssl causes pain)

As most people will know yesterday caused a lot of people a lot of pain as they ran around replacing SSH keys and SSL certificates.

While running around fixing up all our servers, most of them in one felll swoop thanks to puppet, I realised two of our servers were still running Edgy. I figured it was high time I moved them to Hardy.

Everything went fairly smoothly with some minor hicups, except for SMTP-AUTH for exim. We use an ldap backed SMTP-AUTH and this just wouldn’t work after the upgrade. The following error was appearing in the logs.

ldap_search failed: -7, Bad search filter	

This lead to hours upon hours of google searches, staring at debug messages and even at one stage resorting to using GDB. Eventually after staring at debug messages harder it twigged when I saw the following.

perform_ldap_search: ldapdn URL = "ldap:///ou=people,o=vquence?dn?sub?(uid=moo) "

Notice the space just before the closing double quote. It seems that the new openldap libraries don’t like errant spaces in your search filter.

Now to remember what I was doing yesterday morning before this whole derailment began.

Note: Before anyone comments I will completely deny that during these upgrades I did anything as silly as rm -rf `dpkg -L random-font-package`, no matter what twitter says.

Hardy and password locking

passwd -l root

In gutsy the above would simply lock the account by placing an ! in front of the passwd in your /etc/shadow file.

In hardy it now also sets the account as expired. Meaning you can’t ssh to it even if you have SSH keys in place.

Time to go and rebuild my EC2 AMI. 🙁

Update: To get the old behavour back you can do the following

passwd -l root
usermod -e "" root

Puppet, Facts and Certificates

I’m currently setting up Puppet at Vquence so that, among other things, we can deploy hosts into Amazon EC2 more easily.

To ensure a minimum setup time on a new server I wanted the setup to be as simple as

  • echo ‘DAEMON_OPTS=”-w 120 –fqdn newserver.vquence.com –server puppetmaster.vquence.com” > /etc/default/puppet
  • aptitude install puppet

This means that the puppet client will use newserver.vquence.com as the common name in the SSL certificate it creates for itself. On the puppet master the SSL cert name is then used to pick a node rather than the hostname reported by facter.

This means that I don’t need to worry about setting up /etc/hostname, even better /etc/hostname can be managed by puppet.

You can control this functionality on the puppet master by using the node_name option. From the docs

    # How the puppetmaster determines the client's identity 
    # and sets the 'hostname' fact for use in the manifest, in particular 
    # for determining which 'node' statement applies to the client. 
    # Possible values are 'cert' (use the subject's CN in the client's 
    # certificate) and 'facter' (use the hostname that the client 
    # reported in its facts)
    # The default value is 'cert'.
    # node_name = cert

The problem was that the ‘hostname’ fact wasn’t being set. It looks like there was a regression in SVN#1673 when some refactoring was performed.

I’ve filed bug #1133 and you can clone my git repository.

I haven’t included any tests in the patch as I’m not sure how to. The master.rb test already tests this functionality but doesn’t test that the facts object has actually been changed. I think a test on getconfig is probably required but I’m not sure how you would access the facts after calling it.

Update: This patch is now in puppet as of 0.24.3.

Amazon EC2 ruby gem and large user_data

When you create an instance in EC2 you can send Amazon some user data that is accessible by your instance. At Vquence we use this to send a script that gets executes at boot up. This script contains some openvpn and puppet RSA keys so its approaching about 10k in size.

This works without any problems when using the java based command line tools. However I was getting the following error when using the EC2 Ruby GEM.

/usr/lib/ruby/1.8/net/protocol.rb:133:in `sysread': Connection reset by peer (Errno::ECONNRESET)
	from /usr/lib/ruby/1.8/net/protocol.rb:133:in `rbuf_fill'
	from /usr/lib/ruby/1.8/timeout.rb:56:in `timeout'
	from /usr/lib/ruby/1.8/timeout.rb:76:in `timeout'
	from /usr/lib/ruby/1.8/net/protocol.rb:132:in `rbuf_fill'
	from /usr/lib/ruby/1.8/net/protocol.rb:116:in `readuntil'
	from /usr/lib/ruby/1.8/net/protocol.rb:126:in `readline'
	from /usr/lib/ruby/1.8/net/http.rb:2020:in `read_status_line'
	from /usr/lib/ruby/1.8/net/http.rb:2009:in `read_new'
	 ... 6 levels...
	from ./lib/ec2helpers.rb:43:in `start_instance'
	from ./ec2-puppet:107
	from ./ec2-puppet:89:in `each_pair'
	from ./ec2-puppet:89

Doing some tcpdumping indicated that after receiving the request Amazon waits for a while and then sends a TCP RESET. Not very nice at all. My next step was to use ngrep to compare the output from the command line tools and the ruby gem. This got nowhere fast since the command line tools use the SOAP API while the ruby gem uses the Query API.

What I did notice however is that while the command line tools performed a POST the ruby library performed a GET. At this stage I decided to test how much data I could send. So I started trying different user data sizes. The offending amount was around 7.8k, suspiciously close to exactly 8k.

The HTTP/1.1 spec doesn’t place an actual limit on the length but leaves it up to the server.

The HTTP protocol does not place any a priori limit on the length of
a URI. Servers MUST be able to handle the URI of any resource they
serve, and SHOULD be able to handle URIs of unbounded length if they
provide GET-based forms that could generate such URIs. A server
SHOULD return 414 (Request-URI Too Long) status if a URI is longer
than the server can handle (see section 10.4.15).


Note: Servers ought to be cautious about depending on URI lengths
above 255 bytes, because some older client or proxy
implementations might not properly support these lengths.

Apache for example limits this by default to 8190 bytes including the method and the protocol. You can change this using the LimitRequestLine directive.

I created a patch to modify the EC2 Gem to use a POST instead of a GET which has no such limitations. You can find the git tree for it at http://inodes.org/~johnf/git/amazon-ec2

Squid and Rails caching

At Vquence our Rails setup looks something like this.

------------     ---------     ------------ 
| Internet |---->| Squid |---->| Mongrels | 
------------     ---------     ------------ 

(Who needs Inkscape when you have ASCII art)

This infrastructure is hosted in the US and up until recently squid hadn’t been doing much of anything except really sitting there.

Now a few months ago when we signed a contract with an Australian customer we decided we needed to place a squid cache in Australia which would actually cache content. For two reasons, firstly the US is a long way away and the 300ms latency is really noticeable and secondly because some of our pages involving graphs have long statistical calculations which can take minutes to render. (OK its really because no one has had a chance to optimise them yet but lets pretend that’s not the case). So we changed the above setup for the Australian customers to look like the following.

------------     ------------     ------------     ------------
| Internet |---->| Squid AU |---->| Squid US |---->| Mongrels |
------------     ------------     ------------     ------------

We hand out urls like http://www.client.b2b.vquence.com/widget to Australian customers and the rails backend is smart enough to make sure all the URLs look similar (I’ll blog about how I did that another time).

Without much time to look into thing properly I did some really nasty things on the AU squid cache to make sure it cached the pages.

refresh_pattern /client/graph  1440    0%    1440    ignore-no-cache ignore-reload
refresh_pattern /client/static 1440    0%    1440    ignore-no-cache ignore-reload
refresh_pattern /client/video  1440    0%    1440    ignore-no-cache ignore-reload

This is evil, breaks a whole heap of RFCs but it did the trick and got us out of a bind quickly.

A few weeks ago I moved the production site to Rails 2.0, I noticed around this time that the caching had stopped working. The client was no longer using our services as their campaign had finished so it wasn’t an urgent concern.

It seems that Rails 2.0 goes one step further to ensure that caches don’t cache content and instead of just sending

Cache-Control: no-cache

it now sends

Cache-Control: private, max-age=0, must-revalidate

I tried adding ignore-private, since if you’re breaking some aspects of the RFC you may as well break a couple more, but squid still refused to cache the content. After struggling with this for a bit I decided that the universe was trying to tell me I should actually do things properly.

So with squid set back to its defaults I went exploring how to accomplish this. Google wasn’t all that helpful at first since most Rails caching articles talk about caching to static files as most sites don’t implement reverse proxying for caching. It turns out however its fairly simple. In the appropriate actions in your controllers simply do the following.

class VideoController  false
        render :template => "videos/vquence"
    end

end

This will send the following header and cache the page for 8 hours.

Cache-Control: max-age=28800

Now everything is much faster!!