How to Analyze a Trillion Log Messages?

by Anton Chuvakin

Somebody posted a message to a loganalysis list seeking help with analyzing a trillion log messages. Yes, you've heard it right - a trillion. Apart from some naive folks suggesting totally unsuitable vendor solutions, there was one smart post from Jose Nasario (here), which implied that the original poster will need to write some code himself. Why?

Here is why (see also my post to the list): assuming 1 trillions records of 200 bytes, which is a typical
PIX log message size (a bit optimistic, in fact), we are looking at roughly 180TB of uncompressed log data. And we need to analyze it (even if we are not exactly sure for what, hopefully the poster himself knows) ... not just to store.

Thus, I hate (ehh, make it "have" :-)) to admit that Jose is probably right: writing purpose-specific code might be the only way out. About a year ago, there was a discussion titled "parsing logs ultra-fast inline" on firewall-wizards list about something very similar. We can look up some old posts by Marcus Ranum for useful tips on super-fast but purpose-specific log processing.

For example, here he suggests a few specific data structure to "handle truly ginormous amounts of log data quickly" and concludes that "this approach runs faster than hell on even low-end hardware and can crunch through a lot of logs extremely rapidly." One of the follow-ups really hits the point that I am making here and in my post: "if you put some thought into figuring out what you want to get from your log analysis, you can do it at extremely high speeds." A few more useful tips are added here.

So, nothing much we can do here - you are writing some code here, buddy :-) And, as far as tips are concerned, here is the "strategy" :-) to solve it:

1. figure out what you want to do

2. write the code to do it

3. run it and wait, wait, wait ... possibly for a long time :-)

Indeed, there are many great general purpose log management solutions on the market. However, we all know that there is always that "ginormous" amount of data that calls for custom code, heavily optimized for the task at hand.


John Dalton
2007-04-14 04:21:05
You talk about "custom" like it's a dirty word! General purpose tools develop because of a wide-spread need - but I'd guess that every problem begins with a single instance.

If you know exactly what info you want to get out of your logs, you can potentially do so at high speed, as you say. Also, you can probably parallelise this kind of task without too much difficulty, using an off-the-shelf batch job manager (such as Torque) to split up the jobs. If you have enough storage to keep 180Tb of logs around, plus extra space for processing, then running a small cluster to handle log analysis may not be too much of a stress.

One final important question to consider would be whether you're dealing purely with historical data, or expecting analysis to be ongoing with new data coming in. If so, this would strengthen the arguments in favour of a custom solution with dedicated infrastructure.