Filesystem Monitoring: You're Doing It Wrong

by Chris Josephes

This doesn't look good, right?

home2-vol.gif

Most open source monitoring tools do filesystem health checking by comparing the current percentage of used space against a set value. If it's is 90% full, send out a warning page; if it's 89%, send the all clear.

Notice that I said filesystem, and not actual disk. A single disk that's 90% full can be a bad thing, because there are fewer free blocks available for writing, which leads to longer write times and file fragmentation. Not all filesystems are restricted to a single disk: there may be a back-end RAID solution, or the filesystem may be a shared filesystem served over NFS.

Unfortunately, you could be the receiver of flapping alert pages where a filesystem sits between 90% and 89%, but it still performs fine. Unlike a broken Ethernet cable, the resolution for a filesystem threshold may not be so easy. Sometimes there are files that can't be deleted, or there may not be any additional storage to allocate. You may have a filesystem that sits at 91% full for months simply because a new disk shelf won't arrive until the next budget cycle.

Everything comes down to disk blocks, even SAN and NAS solutions. That brings back the concern regarding fragmentation and performance. But what if your filesystem is a read-only OS image? Or what if it turns out 10% equates to 500 gigabytes on a huge disk appliance? If the filesystem is never being written to, or if the amount of writes equates to 0.001% of the entire filesystem, then where's the fire?

What about the inverse? What if your filesystem never reaches 90% full? Can there still be problems?

In the above graph, nobody would have been paged by Nagios or other tools, because the filesystem never reached 90%. For the past few months it averaged 40% full, shot up to 75%, and then went back down. A newly released application was behaving incorrectly, and the issue was caught by the programmer. The next morning he stealthily re-released the application and corrected the issue. Nobody in systems administration noticed until the graph was checked in relation to another issue. If the programming error was never discovered, the filesystem would have filled up, probably at the most inconvenient time possible for a systems administrator.

I would like to recommend to people developing filesystem or disk monitoring solutions change their way of thinking about filesystem health. Hard limits on allocated space may still be required, but those warnings should be optional. Measuring fullness makes assumptions about block structure that may not be correct.

At the same time, the monitoring system should compare the standard deviation for the filesystem percentage over the past 24 hours, and compare it to the standard deviation for the past hour. Actually, you'd probably want to compare the first 23 hours out of 24, grab that standard deviation, and compare it to the deviation of the last hour.

If those two deviations aren't close, then there could be radical changes made to your filesystem that need to be addressed. Maybe files are being added or deleted, either way, it may warrant an investigation. For large filesystems in the terrabyte/petabyte range, using the percentage value may not be granular enough, so you will need to work with the actual value of free kilobytes or blocks.

I take it back. This isn't a recommendation to monitoring developers, this is a challenge. The first major open source monitoring guy that puts this solution together will have my undivided attention.



7 Comments

Matt
2008-05-02 09:51:34
Well, I know that this type of monitoring is possible with nagios, you just have to write your plug-ins correctly. They have to be able to recall past data. If you combined the nagios plugin with an RRDTool's database backend, you could figure out how quickly a filesystem grows and alert on a certain growth percentage as well.


I guess like most things, there is a solution, there's just no /easy/ solution

Jason
2008-05-02 11:01:19
http://www.opennms.org/index.php/Thresholding#Relative_change_threshold


Chris Josephes
2008-05-02 11:23:50
Matt,
I know it's possible to do, but it seems a little strange to write a single Nagios plug-in with a custom back-end storage environment that other plug-ins don't natively use.


Everyone can write a Nagios plug-in that does something, but after years of using Nagios, I think the core itself needs a lot of updating to make plug-in development and flexibility easier.

Chris Josephes
2008-05-02 11:37:20
Jason,
Very close. To be fair, it's probably the closest compared to any other NMS I've seen. But the relativeThreshold requires a starter value of an expected norm. Systems administrators may not have an expected starter value for a given filesystem.


It would be nice if relativeThreshold could compare more than 1 previous sample, though. Otherwise, it could miss an accumulating change that occurs very slowly.


By comparing the standard deviation, the system will make an educated guess, but it has a lot of historical data to go on.


I do think that OpenNMS is probably one of the few open source platforms out there that could easily add this functionality.

Jason
2008-05-05 06:03:14
Chris,


*** FROM THE OpenNMS Discuss list ***


Relative change thresholding is, I think, a step toward what this guy is proposing, but it still operates only on changes between two consecutive samples. The author is suggesting some far more sophisticated statistical analysis involving trend lines and sliding windows and other things that make people scared of taking a statistics class.


I'm thinking Statsd, with a little enhancement, has the potential to be the vehicle for just that kind of functionality in OpenNMS. The data model it uses need the ability to compare two time periods of different lengths (e.g. the past hour versus the past 24 hours), after which someone would just need to write an AttributeStatisticVisitor that does all the hard work.


-jeff

Chris Josephes
2008-05-05 19:47:29
Jason,


Woo-hoo!


I just might have to get brave and download that source code repository.

Félim Whiteley
2008-05-06 07:04:05
Take a Look at Cacti and the Plugin Architecture THOLD plugin. I can do baseline monitoring and allow you to specify the deviation etc. and also the time period you want to look back.


CactiEZ is a prebuilt Cacti system for install with everything already set up.


Félim