OSCON Day 3: HTTP Caching

Email.Email weblog link
Blog this.Blog this

Robert Kaye
Aug. 06, 2005 05:40 PM

Atom feed for this author. RSS 1.0 feed for this author. RSS 2.0 feed for this author.

Michael Radwin's "HTTP Caching & Cache-Busting for Content Publishers" talk was very much like being smacked with an O'Reilly Book and absorbing its contents via impact-osmosis. A little intense, but very informative. This talk was very similar to Ask Bjørn Hansen's talk -- many slides at lightening pace -- one blink and you would've missed something important.

This talk was aimed at people who need to be aware of HTTP caching -- both for creating web sites with more/better/secure functionality, as well as being aware of caching proxies. Cache proxies are frequently used to prevent duplicate fetching of the same content to reduce network load. For instance, AOL uses cache proxies frequently to reduce their overall bandwidth use by reducing the number of times that a given page gets fetched from the server. Proper cache handling is important to make sure that dynamic web pages aren't tripped up by caches and that caches can be effective, which in turn will reduce the load on your own site.

Michael broke web content into three categories:

  • HTML - change frequently and should be considered dynamic content
  • CSS & JavaScript - change less frequently
  • Flash, images and PDFs -- change very infrequently, perhaps never

He suggested that each of these types of content should be treated differently when considering caching strategies. He suggested five strategies for dealing with these types of content:

  1. Cache-Control: Most caching proxies will not cache any pages that contain cookies, since cookies are not part of the HTTP specification on caching. If a web application like GMail uses cookies to identify users and the cache stores pages based on URL, then user Mary might end up accidentally getting a page that was originally cached for used Jane. Given this problem, proxies stay away from caching pages that use cookies, anything that uses HTTP authentication and any pages that contain the Cache-Control header.
  2. Images never expire policy: Since a lot of images on a site (e.g. logos) change very infrequently, it is not a stretch to say that images never expire. By setting the cache expire time in the request header to something like 10 years in the future, you can be sure that a caching proxy will not fetch an image from your server unnecessarily. But what happens if you do change the image? Use another filename -- this can be a bit of a pain, but if you're trying to optimize the bandwidth usage of your site, you'll have to jump through plenty of hoops.
  3. Cookie free TLD: For caching some static content that changes more frequently than images, it can be useful to have a separate top level domain that never uses cookies. For instance, if you use example.com for your site, get example.net as well, and serve static content from this site. Some proxies see cookies in use for example.com and then disallow caching on separate domains like static.example.com -- using static.example.net decouples your static site from the main site. Plus, requests that do not contain cookies will also be smaller -- fetching a 49 byte 1 pixel image with a 2K cookie laden request is silly.
  4. Apache defaults for occasionally changing content: If you're not certain how frequently your content changes, apache does the right thing by default. No extra work required.
  5. Using URL tags for sensitive content: If you're sending sensitive content over the net and you want to make sure that a caching proxy always does the right thing, consider embedding a user or session id into the URL. If you embed the user ID into the URL, even if the server never bothers to look at it, then a proxy will never mistakenly return a page from the cache to another user. However, if the same user requests the same page over again, the caching proxy can safely return the cached page.

This talk covered many more details that I can't really convey here -- if you're interested in finding out all the details on how to deal with caches, Michael suggested to pick up a book on the topic.

Robert Kaye is the Mayhem & Chaos Coordinator and creator of MusicBrainz, the music metadata commons.