Thursday, March 25, 2010

Cache busting in PHP: Part 2

In my previous post, I showed how to borrow a technique from Ruby on Rails for busting the browser cache for a particular file.

If you haven't read that, please check it out and come back. It's OK, I'll wait here. I've got a snack.

Improving on cachedFile()


Back? OK, well I've made some improvements to cachedFile() and thought I'd share them. Here are the new capabilities:

1) The function now extracts the file type from the extension
2) It handles images, and specifies their dimensions for faster and smoother rendering by the browser
3) It caches all the information it calculates about a file for faster performance on subsequent requests

#1 and #2 are pretty straightforward: you can use the function like this:

cachedFile('foo.png');
cachedFile('subdirectory/bar.png','class="buz"')

...and it outputs something like this:

<img src="/images/foo.png?1241452378" width="16" height="15" />
<img src="/images/subdirectory/bar.png?1241452378" width="20" height="17" class="buz" />

I put a cache in your cache so you can cache while you cache


But what about #3? What's this caching business? How can we add caching to a caching function?

Let's back up a bit.

First off, cachedFile() was a bit of a misnomer. This function is really for BUSTING a cache.

1) First, we configured our web server to tell the browser "you can cache these types of files for a whole year - don't ask for them again."
2) Second, we made sure that the browser saw each filename as the combination of the ACTUAL filename, like 'foo.png', and the file's time stamp, resulting in 'foo.png?1241452378' (or something like that). Those numbers represent the last time the file was changed; they're the same time stamp you see on any file on your computer.
3) Third, since the time stamp is automatically pulled from the file, we verified that we can update the file, which will update the time stamp, which will trick the browser into thinking it's never seen that file before, and therefore requesting it again.

The end result: the browser asks for a file once, then never again (at least for a year) - until the moment you change the file. As soon as it's updated, the browser asks for a new copy; until then, it uses the one it cached.

So, instead of cachedFile(), we could have called the function browserCacheBuster(). (But we won't, because I think that sounds cheesy.)

Now, this is all great, but the server is doing a bit of work for each file. Like before, each time you ask our function for a file, it has to go and determine the time stamp. In addition, my new features mean that for image files, it has to compute the width and height of the image.

This is all very fast in human terms, but how will it scale? What if you're using cachedFile() to spit out the same image tag several hundred times on the same page?

In that case, it might be nice to remember what you calculated last time. "Foo.png? Oh yeah, I remember him. I wrote down his dimensions and time stamp right here. No need to calculate them again."[1]

Memoization


To make this happen, we're going to use a design pattern called memoization. It works like this:

1) Before you calculate a result or pull it from the database, see if you've already got that result stored in a cache
2) If not, figure out your result and store it in your cache. If so, skip this step.
3) Now you've verified that you've got it in your cache, so return it from there.

For a given input, the first time the function runs, it will check the cache, find nothing, calculate a result from the input, store the result in cache, and return. Every time after that, it will just check the cache, find a result for that input, and return.

Does it matter?


But is there any point in doing this? Are we prematurely optimizing? Maybe. Let's see how much performance gain this really gets us.

I did a little not-very-scientific testing: added some caching to cachedFile(), called it from a loop a few hundred times, and timed the results using PHP's microtime(). I tried this with js, css, and image files, and did five or ten iterations of each.

Not a great sample size, but here's what I found: for .js files, having a cache made the function 2.72 times faster. For .css files, it made it 3.18 times faster. But for image files, having a cache made the function 119.63 times faster!

Clearly, computing those image dimensions is a bit expensive for the server, and we don't want to do it more than necessary.[2] Caching cuts the workload considerably.

Enough talk - code time


OK, let's see how our function looks with these changes. (The cache is stored in a global variable so it will persist between function calls. To offset this minor sin, I have labeled it clearly and awkwardly to prevent accidental meddling from elsewhere.)

$GLOBAL_cachedFile_cache = null;
function cachedFile($name, $attr=null){
 global $GLOBAL_cachedFile_cache;
 if (!isset($GLOBAL_cachedFile_cache[$name])){
  $root = $_SERVER['DOCUMENT_ROOT'];
  $filetype = substr($name,strripos($name,'.')+1);

  /* Configuration options */
  $imgpath = '/images/';
  $csspath = '/stylesheets/';
  $jspath = '/scripts/';

  switch ($filetype){
   case 'css':
    $output = '<link rel="stylesheet" type="text/css" href="/includes/';
    $output .= $name;
    $output .= '?' . filemtime($root . $csspath . $name) . '" ';
    if($attr){
     $output .= $attr . ' ';
    }
    $output .= '/>' . "\n";
    break;
   case 'js':
    $output = '<script type="text/javascript" src="/includes/';
    $output .= $name;
    $output .= '?' . filemtime($root . $jspath . $name) . '"';
    $output .= '</script>' . "\n";
    break;
   case 'jpg':
   case 'gif':
   case 'png':
    //This code will get run in any of the three cases above
    $output = '<img src="' . $imgpath . $name;
    $output .= '?' . filemtime($root . $imgpath . $name) . '"';
    $imgsize = getimagesize($root . $imgpath . $name);
    $output .= ' ' . $imgsize[3];
    if($attr){
     $output .= ' ' . $attr;
    }
    $output .= ' />';
    break;
  }
  $GLOBAL_cachedFile_cache[$name] = $output;
 }
 echo $GLOBAL_cachedFile_cache[$name];
}

Magnanimousness


What's that? Want to use this code somewhere? Well, sure. No, you don't have to thank me, or license it, or anything. Just name your kid after me or send me a solid gold pickle.

Humility


And of course, perhaps I did something very stupid here. Well, that's what comments are for.


[1]You might worry if this will create problems. After all, if we cache the time stamp, won't we miss the fact that the file has been updated and defeat our purpose? No worries: the cache only lasts as long as the page script is running. So if you update a file while a user is loading the page, they won't see it. But on the next reload, they will.

[2]In fact, it would be reasonable not to do it at all; there are lots of factors in how fast a site performs and seems, but how quickly it renders is certainly one of them. This is meant to help with that, but costs processor speed. You'll have to decide what works best for your site.

Saturday, March 20, 2010

Rails caching and cache busting in PHP

Ever wondered how to use browser caching to speed up your page loads?

I was working on a Rails project recently, and noticed something interesting in the documentation:

Using asset timestamps

By default, Rails appends asset‘s timestamps to all asset paths[1]. This allows you to set a cache-expiration date for the asset far into the future, but still be able to instantly invalidate it by simply updating the file (and hence updating the timestamp, which then updates the URL as the timestamp is part of that, which in turn busts the cache).

It‘s the responsibility of the web server you use to set the far-future expiration date on cache assets that you need to take advantage of this feature. Here‘s an example for Apache:

# Asset Expiration
ExpiresActive On
<filesmatch "\.(ico|gif|jpe?g|png|js|css)$">
ExpiresDefault "access plus 1 year"
</FilesMatch>

As I explained on Stackoverflow (more on that in a moment):

If you look at a the source for a Rails page, you'll see what they mean: the path to a stylesheet might be "/stylesheets/scaffold.css?1268228124", where the numbers at the end are the timestamp when the file was last updated.

So it should work like this:

1. The browser says 'give me this page'
2. The server says 'here, and by the way, this stylesheet called scaffold.css?1268228124 can be cached for a year - it's not gonna change.'
3. On reloads, the browser says 'I'm not asking for that css file, because my local copy is still good.'
4. A month later, you edit and save the file, which changes the timestamp, which means that the file is no longer called scaffold.css?1268228124 because the numbers change.
5. When the browser sees that, it says 'I've never seen that file! Give me a copy, please.' The cache is 'busted.'

Bringing it to PHP


Clever! Now how can we borrow that idea in a PHP app?

The first step, of course, is to set the server to tell browsers 'cache these files.' The example config above worked for me[2].

The second step is to append timestamps to your filenames. Here's a first-pass attempt at that:

<link rel="stylesheet" type="text/css" href="/includes/main.css <?PHP echo '?' . filemtime($root.'/includes/main.css'); ?>" title="default" />

That basically works - the time stamp is appended to the file name. But it's not nearly as streamlined as the Rails way, for a couple of reasons.

1) You have to type the file name twice - don't repeat yourself!
2) Come to think of it, all your stylesheet links are going to be the same format. Why keep typing in the boilerplate stuff?

In Rails, you'd just do this:

<%= stylesheet_link_tag 'main' %>

Slick! Helper tags like these take a lot of the drudgery out of HTML when you're using Rails.

A loose aproximation in PHP could be generalized to handle different file types. For example, you might write a function like this:

function cachedFile($type, $name, $attr=null){
 $root = $_SERVER['DOCUMENT_ROOT'];
 switch ($type){
  case 'css':
   $output = '<link rel="stylesheet" type="text/css" href="/includes/';
   $output .= $name;
   $output .= '?' . filemtime($root.'/includes/'. $name) . '" ';
   if($attr){
    $output .= $attr . ' ';
   }
   $output .= '/>';
   $output .= "\n";
   echo $output;
   break;
   case 'js':
    $output = '<script type="text/javascript" src="/includes/';
    $output .= $name;
    $output .= '?' . filemtime($root.'/includes/'. $name) . '"';
    $output .= '></script>';
    $output .= "\n";
    echo $output;
    break;
 }
}

...which could then be used like this:

cachedFile('css','jquery-ui-1.7.1.custom.css');
cachedFile('css','main.css','title="Default"');
cachedFile('js','jquery-1.4.min.js');

Notice that this function assumes something - that your javascript files and stylesheets will always be in a particular folder. That's part of Rails' "convention over configuration" mentality: if you always do something the same way, you only have to specify it once.

Now, there's still room for improvement. For example, the type could be extracted from the filename, so that's one less argument to pass in. And more file types could be added. But this function already accomplishes several good things:

1) It gets your files to be cached by the browser and to bust the cache when necessary
2) It cuts down on code repetition
3) Naming the function cachedFile makes its purpose obvious

Now - how can you verify that this is working? I had the same question myself.

As Andy on Stackoverflow pointed out, you can load your page in Firefox, use the Firebug add-on, and look in the "Net" panel as you load the page. For any file that's cached, you should see a status message of 304 Not Modified. For anything that's pulled from the server, you should see 200 OK.

Try it:

1) Load the page to request everything once
2) Reload it to verify that things are being cached
3) Make a trivial change to a cached file, so that its timestamp will change
4) Reload the page again and verify that it was requested
5) Reload the page one last time and verify that it's cached again
6) Set up an elaborate Rube Goldberg machine to pat yourself on the back

(Step 6 is optional.)

Great ideas are worth borrowing


One reason that Rails has become so popular is that it codifies a lot of clever ideas and best practices into easy-to-use shortcuts. You can make a whole app with Rails without ever realizing that it's pulling the trick shown here on your behalf.

You don't have to use Rails, but if you see a great idea, it's always worth asking: "can I borrow this?"

Now go bust come caches!

[1]There's a danger here: notice that the Rails docs say all asset paths. If you set Apache to tell the browser to cache all images, style sheets and scripts for a year, and you only use a cache busting strategy for some of those things, then your visitors won't see updated versions of the others unless they clear their own browser cache manually or do a hard refresh with Ctrl+F5.

[2]I put this information into Apache's main config file, httpd.conf. If you're using a web host, they probably don't give you access to that, but they may have configured Apache to look for .htaccess files in your project folders. If so, you can set caching rules there.

Monday, March 1, 2010

It's not magic

A while back, I had a small epiphany. I'd been asked to create a web form that could send emails with attachments. I already had forms that sent email, but attachments? What were they? In my mind, attachments were the little icons above the email. I had no idea how they were created or sent.

At the same time, I was reading The Code Book, an entertaining and fascinating look at cryptography - the art of sending scrambled messages, to be unscrambled only by the intended recipient.



The simplest, most brain-dead kind of encryption is a Caesar cipher, where you take all the letters in a message and shift them by the same amount. For example, with a Caesar cipher of 1, this:

HELLO WORLD

becomes this:

IFMMP XPSME

A modern cryptanalyst would pee his pants laughing if you used this method for anything serious. But one thing about it works: you can encode, and you can decode, as long as you know the rules.

Now, the Code Book traces the development of encryption methods so complex they'll make your head spin, but all of them are systematic: whatever is encoded can be decoded. You just need to know how.

This applies to any method of encoding information, even if it's not cryptography. For example, Morse Code encodes letters as electrical pulses - not to hide the message, but just to transmit it. How cool - you can encode actual language as beeps!

Which shows us another important idea: you can encode anything as anything else. You can encode the weather forecast with colored socks. You can encode the Constitution with duck calls. As long as there are consistent rules for encoding and decoding, it will work.

Going back to my email attachments, I soon discovered how attachments work: the file data is encoded as text in your email. It ends up looking like gibberish, but fortunately, no human has to read it, because the email program does that for you.

How does it know which parts of the text are text and which parts are files? You tell it. Say you want to attach an image. First, you choose some arbitrary string of characters which will probably never occur in an actual email text. Let's say it's "Woo_hamburgers_for_mayor_in_2050_yeeeeeha." Whenever you use that phrase, it means "I'm about to put in some different content." Each section of content also gets a label about what "MIME type" it is, like "image/jpg" or "application/pdf". Your email text goes in one section, and your image data goes in another, after being encoded as text.

(In PHP, you can encode the data as text like this: base64_encode($fileContents).**)

On the other end, when someone opens your email, their program is smart enough not to show all the scrambled-looking letters, but instead to say, 'hey, he said this part was an image - let's show it like that.' And it gets decoded. The little icons show up. It works!

The main thing is, it's not magic. For me, this turned on a light in my head. I stood next to a fax machine, and pointed my finger at it. "I know what you're doing with your crazy noises," I said. "You're encoding image data as sound!"

And that's how computers work, all the way down. Image information is encoded as characters, which are encoded as ones and zeros, which are encoded as magnetic charges on a disk platter, which are transmitted as electrical pulses in a circuit board.

It's hard to understand. It's hard to actually believe sometimes that everything I'm doing on screen can be represented as ones and zeros. But there's no magic. And if there's no magic, there's nothing to be scared of. If I work hard enough, I can understand a little piece of it - enough to get something done.

----
**For a detailed walkthrough of my attachment code, see this Stackoverflow post.)

The Bing Button

From a NY Times article about upcoming Windows 7 phones:

In addition, Microsoft is requiring phone makers to keep basic elements of its user interface, including a physical button to start Web searches on Bing.

Microsoft. Listen. Nobody wants a button for Bing, or Google either, for that matter. This violates three principles about what I want in a smartphone:

  1. Customizable. If a button goes to a web site, or opens a calculator, or gives me a voice prompt, and I can't change that, I'm going to be frustrated.
  2. Neutral. It's my phone, not yours. I'll use Bing or Google or Jeeves or Big Larry's Virus-Laden Search Emporium if I want to. Don't force your product on me.
  3. General purpose. I don't have a "word processor" key on my computer keyboard. I use on-screen menus for that. There are a million programs I could install, and a billion web sites I could visit. Smartphones are smartphones precisely because they share this characteristic. My flip phone has a single calendar program, take it or leave it. With a smartphone, I could install or write my own, or use one on the web. Having a button that does one thing makes this less like a smartphone and more like a calculator - a single-purpose device. I want a browser, not a Binger.

If you can't put your customers' desires above your need to cross-brand, you're going to make lousy products. And your market share will continue to drop.