Where do you clean the user input?

by Derek Sivers

I learned early on to NEVER trust user-input. I ALWAYS either run it through a case-switch, a regex to strip-and-limit characters, force its type into an integer or whatever I'm expecting it to be.

But my code was getting cluttered, even down into the little tiny functions, with my constant paranoia about malicious values being passed.

I'm starting to feel like I have padlocks on my silverware drawer, instead of just strong locks on the outside, and comfort and cleanliness inside.

So where do you more experienced programmers choose to clean the user-input? Any advantages or disadvantages you've found from doing it a certain way?

A single user-interface class where ALL incoming data is cleaned and verified before being allowed through? Then stop the paranoia on the inside?

Or is it still wise to stay paranoid and distrust input at all levels?

Please share your experiences or advice, here:


2004-06-01 02:18:15
You are also a user
It's not just users that make mistakes, programmers do too. While you're adding more deadbolts to the door, your "inner child" is pouring custard in the silver drawer.

To protect against yourself, design-by-contract may help. Each method's contract ensures you're in a "known state" before you start, so you can progress to a known state at the end. Even if you don't go as far as DBC, sprinkling assertions throughout your code isn't a bad idea.

Its interesting to look at what DBC does /not/ do, too: you don't expend effort in a method correcting its input, eg removing illegal chars. The caller has responsibility for that, and in general, validation gets pushed out towards the user. Its surprising how much cleaner your code becomes when written this way.

2004-06-01 08:09:15
clean input
Can i recommend a book. Writing Solid Code by Steve Maguire. It covers the question you've asked in a very readable and intelligent way.
2004-06-01 09:06:34
Design by contract
Interesting...I'm only vaguely familiar with Design by Contract, but I like how, as you describe, it tends almost automatically to answer the question of where "data cleansing" should be done.

I'm a fan of styles of programming (I'm reluctant to say "methodologies") which enable the design to flow naturally from one's work. Test Driven Development is one such style, DBC sounds like another.

2004-06-01 14:36:32
client and server
When time allows I do input validation with javascript at the client side (assuming web clients) and then using some server side. When its one or the other I generally do server side processing as its too easy to bypass client side code.

To take things to the next level I use a strongly typed server side language like java. Coming from a web scripting background and working back into the server this was frustrating at first, but then I began to appreciate the granularity that it gave me. For my own state of mind I prefer memory management provided by a VM . . . buffer over flows are just to easy. I'm liking the mono implementation of C# as its shaping up!

I generally stop short of doing additional validation at the DB insert level . . . :-)

I think a healthier analogy might be to think of your data as water. It doesn't hurt to filter it more than once and its probably a good idea to use the most reliable filter(s) available.

2004-06-02 22:18:46
Stop the paranoia on the inside

I think validation should happen as close to the point of entry as possible. For a form on a web-site this is either in Javascript on the client-side or in a "presentation layer" on the server-side.

Take the following analogy. When you talk to a teller at a bank, he would ask you what kind of transaction you wanted to perform - "deposit" or "withdraw". If you answered "hamburger and large fries", the teller does not dutifully type this information into the system, saying "I'm sorry hamburger and large fries is not something our systems can handle". This would be silly, instead he would ask you again, and, if you didn't anwer appropriately, he would call security.

2004-06-03 22:51:24
Naming Conventions
I think your question is too application-specific to have one right answer. However, relying on client-side scripting, as I've seen suggested, is definitely a wrong answer.

One practice that I advocate is to use a strict naming convention to help you distinguish between data that hasn't been filtered and that which has. This at least allows you to focus on two issues while writing your application logic: only using filtered data in critical areas and making sure that unfiltered data cannot possibly be incorrectly named (usually by initializing certain variables).

Just to give a hypothetical example:

$clean = array();
switch ($_POST['color'])
case 'red':
case 'green':
case 'blue':
$clean[color'] = $_POST['color'];

If $clean[color'] exists after this point, you can be sure that it is red, green, or blue. Of course, you could also assign a default value if you need to.

I think this practice can give you more freedom in your design decisions while still making it easy to properly filter data. I've seen this question posed a number of times, and people seem to fall into two categories: design your application to have a tough outer shell and a soft chewy center, or be paranoid about data at every step. I'm not really in favor of either of these approaches, since I think security revolves around the developer's clear understanding about the data being used at every step.

The paranoid approach is not more secure - if you don't leverage the trust you can place in some things, you're going to have a weaker implementation that is based on guesswork. The paranoid (well, "blind paranoid" really) approach is also much more tedious and more likely to result in errors.

I may elaborate more on this topic in a future Security Corner in php|architect. Hopefully this helps a bit.

2004-12-17 20:43:04
great quote from Pragmatic Unit Testing
A great quote from the book Pragmatic Unit Testing by Andrew Hunt and David Thomas:

Who's responsible for validating input data?

In many systems, the answer is mixed, or haphazard at best. You can't really trust that any other part of the system has checked the input data, so you have to check it yourself or at least, that aspect of the input data that particularly concerns you. In effect, the data ends up being checked by everyone and no one. Besides being a grotesque violation of the DRY principle [HT00], it wastes a lot of time and energy and we typically don't have that much extra to waste.

In a well-designed system, you establish up-front the parts of the system that need to perform validation, and localize those to a small and well-known part of the system.

So the first question you should ask about a system is, who is supposed to check the validity of input data?

Generally we find the easiest rule to adopt is the keep the barbarians out at the gate approach. Check input at the boundaries of the system, and you won't have to duplicate those tests inside the system.

Internal components can trust that if the data has made it this far into the system, then it must be okay.

It's sort of like a hospital operating room or industrial clean room approach. You undergo elaborate cleaning rituals before you or any tools or materials can enter the room, but once there you are assured of a sterile field. If the field becomes contaminated, it s a major catastrophe; you have to re-sterilize the whole environment.