Optimizing HTTP downloads in Java through conditional GET and compressed streams

by Diego Doval


When obtaining content from websites (either for a server-based app, such as a crawler, or for a client app, such as a newsreader) it's a good idea to avoid downloading content when possible. Additionally, many if not most servers
support dynamic stream compression, which has a big impact on download speeds, particularly when obtaining text-only
data (e.g., an RSS feed, which if every aggregator did, would help with problems like these). Throughout the Net there are a number of descriptions how to use some of these elements (For example, here is a good guide for conditional GETs, here is one for server-side compression), but I haven't seen an example of Java client-side code that pulls all the basic techniques together in one place--so here it is.



Basically there are two things to deal with: the first is use of the ETag and Last-Modified headers that allow the server to reply with a 304 (Not modified) response code, thus avoiding altogether download of data we've already obtained. The second is that when there is a download, we allow the server to send the content compressed (in GZIP or Deflate --ZLib-- formats) to save download time. For text content, compression can reduce the size of the download by a factor of 5 or more.



Without further ado, here's the code:




...
//create a URL to O'Reilly's Atom feed
URL sourceURL = new URL("http://www.oreillynet.com/meerkat/?_fl=rss10&t=ALL&c=5209");
//obtain the connection
HttpURLConnection sourceConnection = (HttpURLConnection) sourceURL.openConnection();
//add parameters to the connection
sourceConnection.setFollowRedirects(true);
//allow both GZip and Deflate (ZLib) encodings
sourceConnection.setRequestProperty("Accept-Encoding", "gzip, deflate");

// obtain the ETag from a local store, returns null if not found
String etag = loadETag();

if (etag != null) {
sourceConnection.addRequestProperty("If-None-Match", etag);
}

// obtain the Last-Modified from a local store, returns null if not found
String lastModified = loadLastModified();
if (lastModified != null) {
sourceConnection.addRequestProperty("If-Modified-Since",lastModified);
}

//establish connection, get response headers
sourceConnection.connect();

//obtain the encoding returned by the server
String encoding = sourceConnection.getContentEncoding();

//The Content-Type can be used later to determine the nature of the content regardless of compression
String contentType = sourceConnection.getContentType();

//if it returns Not modified then we already have the content, return
if (sourceConnection.getResponseCode() == HttpURLConnection.HTTP_NOT_MODIFIED) {
//disconnect() should only be used when you won't
//connect to the same site in a while,
//since it disconnects the socket. Only losing
//the stream on an HTTP 1.1 connection will
//maintain the connection waiting and be
//faster the next time around
sourceConnection.disconnect();
return;
}

//get the last modified & etag and
//store them for the next check
storeLastModified(sourceConnection.getHeaderField("Last-Modified"));
storeETag(sourceConnection.getHeaderField("ETag"));

InputStream resultingInputStream = null;

//create the appropriate stream wrapper based on
//the encoding type
if (encoding != null && encoding.equalsIgnoreCase("gzip")) {
resultingInputStream = new GZIPInputStream(sourceConnection.getInputStream());
}
else if (encoding != null && encoding.equalsIgnoreCase("deflate")) {
resultingInputStream = new InflaterInputStream(sourceConnection.getInputStream(), new Inflater(true));
}
else {
resultingInputStream = sourceConnection.getInputStream();
}

...

//now the stream can be read directly,
//and the data will be on the contentType received above.

Basically the code sets up the connection and adds the "gzip" and "deflate" encodings as recognized encodings by the app, and if ETag/Last-Modified values exist then they are added to the request with the If-None-Match header for ETag and If-Modified-Since header for Last-Modified, allowing the server to check if the content has changed since that check. If the ETag and Last-Modified are null (that is, this is the first time we are downloading the page) then they are not included in the request.


Once we establish the connection through connect() we check whether the reponse code is "not modified" and in that case we simply disconnect() and return, otherwise we continue by storing the Etag and Last-Modified values returned, and then creating the appropriate wrapper (depending on the content encoding returned by the server) for the network stream so that the content will be decompressed on the fly if necessary.


And that's it!


PS: the loadETag(), loadLastModified(), storeLastModified(...) and storeETag(...) methods require a persistence layer of some sort (simple Java serialization would do) to access the ETag and Last-Modified values between runs of the application. Since this is heavily dependent on the application, they are not included.


2 Comments

twuersch
2004-07-22 07:56:57
Compressed streams supported?
I tried your code on the given URL. The encoding of the response is always null. Does the site support compressed streams?
diegod
2004-07-22 08:38:10
Compressed streams supported?
Oops, apparently not (apologies, I assumed it would). However if you try, for example, the URL for my personal weblog's RSS feed at


http://www.dynamicobjects.com/d2r/index.xml


you'll see GZIP encoding in action.