oreilly.comSafari Books Online.Conferences.


A New, Improved Visualization for Web Server Logs

by Raju Varghese


In my last article I showed how web server logfiles can be visualized as a 3D plot with the help of Perl and gnuplot. In this article we will enhance the plot in several ways. The main things we will introduce are color and evening out of the plot.

Access logfiles from a web server need to be filtered before the data is passed on to gnuplot. Listing 5, the Perl script that was used in the previous article, can be used in this one as well. Each line in the access logfile produces one line of output; of the many items in a line of the access logfile, four items are extracted: the timestamp, URL, IP address, and status code of the request. The URL in the output is not the actual URL but its rank in the list of URLs in the file. Similarly, the IP address is the rank of the actual IP address in a sorted list. Both of these are integer numbers. The output file so created can be read directly by gnuplot, as you will see later in this article.


The 3D plots in the previous article were bland monochrome; the version of gnuplot at that time could not handle multiple colors for scatter plots. With the release of gnuplot 4.2 on March 3, 2007 the possibilities have increased. We will display the status code as the fourth dimension of the data in color.

Color plot
Figure 1. Color scatter plot showing HTTP requests

Code Listing 1 shows the gnuplot commands used to generate Figure 1. All commands except the last two should be self-explanatory. The penultimate command defines a function that returns the color of the dot depending on the status code. It is a nested ternary conditional statement in the syntax needed for gnuplot.

rgb(r) = (r<200)? (000000): (r<300)? (12632256): (r==304)? (10526880): (r<400)? (238): (r<500)? (15631086): (16711680)

In pseudo-code it could be written as:

if (statusCode < 200) # 1XX
   return black
else if (statusCode < 300) # 2XX
   return gray
else if (statusCode == 304) # Not modified 
   return darkgray
else if (statusCode < 400) # other 3XX or redirects
   return blue
else if (statusCode < 500) # 4XX including the infamous page-not-found 
   return violet
else # 5XX
   return red

The status code 304 (Not Modified) deserves special treatment because even though it is in the 3XX group it is not a redirect. It states that the content was not modified and that the client can continue to use the cached copy. I have therefore considered it similar to the 2XX status code but given it a different shade of gray. The table below shows the HTTP status codes and the corresponding color codes as integer and hex numbers.

Status code Color Color as integer Color as hex Comment
1XX Black 0 0x0 Informational
2XX Gray 12632256 0xC0C0C0 Successful
304 DarkGray 10526880 0xA0A0A0 Not Modified
3XX Blue 238 0x0000EE Redirection
4XX Violet 15631086 0xEE82EE Client Error
5XX Red 16711680 0xFF0000 Server Error

The last line in Code Listing 1, at the end of this article, is the actual command to draw the scatter plot. It specifies the input file (gnuplot.inp20070123.txt) where the four dimensions for each dot are specified and the order of the four values that are to be used. The fourth dimension is calculated according to the function rgb.

splot 'gnuplot.inp20070123.txt' using 1:2:3:(rgb($4)) with dots lc rgb variable

For the benefit of those who have not read the first article, each dot in Figure 1 corresponds to a line (i.e., one HTTP request) from the access logfile of a web server. The three axes are time, IP address, and content. The status code, which is also in every line of the access logfile, is represented by the color of the dot in 3D space. This particular plot looks featureless, but Figure 2 looks sinister and could give a sysadmin sleepless nights. It shows a spider attack; the tall pillar is a concentrated salvo of requests over the whole content space—one that is guaranteed to make the database, where the content is stored, break into sweat.

3D plot of a bad day (spider attack)
Figure 2. Spider attack in color

The left flank of the pillar is gray because all the components could handle the onslaught. The barrage of requests, however, soon brings the database to its knees and the status code changes to red (5XX). It affects all requests at that time, as the long shadow of the pillar shows. Nevertheless, the system recovers quite quickly after the attack and the red color fades away with time (increasing x-axis).

The representations of these plots were inspired by Edward Tufte's credo: simple design, intense content (Reference 1). I hope that this visualization is in that spirit. Three dimensions for three columns of the logfile and color for the fourth; about a million data points. The simplicity of the representation requires, however, that the interpretation of the plots be relegated to the viewer. Clustering, as we have seen in Figure 2, is a sure sign of forces at work, and this behooves attention.

Pages: 1, 2

Next Pagearrow

Sponsored by: