A Survey Schema Puzzler

by chromatic

Suppose you're a college student performing a research project. Part of that research is a survey with several types of questions. To make your life easier, you've decided to create a small web application to ask the questions and record the answers -- this also gives you a SQL database you can query to analyze the results.

(No, I'm not a college student -- a friend is and the question stumped me.)


2006-03-06 18:25:45
I don't see why each answer set isn't simply a collection of records, and I don't think it's taking the business logic (I hate that phrase--your friend isn't in business) out of the database to leave the type--like writing an enum of 1, 2, 3, 4 for the scaling questions, right?--outside of the database proper.

Probably you'd want to put some sort of key on the answers, and capture things like timestamp information, and possibly clickstream information, too.

Where it might get more interesting would be if (as I assume will be done) the answers to the second question in the first situation are used as the questions in the second situation. Then, assuming that people would be giving similar (but not reliably identical) answers, you'd need some method of joining on approximate matches. That's tough--I'd do it outside the database.

2006-03-06 22:18:44
I gave this a little more thought and produced a slightly more detailed answer.

I'm assuming that what you've shown in the examples is representative of what your friend is going to be asking. If so, then, the "question" is really two parts: The idiom being tested, which is data, and the verbage around the idiom in the test, which is not. The idiom being tested is in the database, in a table of of its own, perhaps with an ID column (depending on your taste in design).

That's one table. There's also a table of surveys--that is, groups of "questions"/idioms. Here you have two columns, one the primary key from the idiom table, and one the survey ID number. Together, that's the primary key for this table. If the order of the "questions"/idioms in each survey is important and non-alphabetical, then you could add a column for ordering purposes.

Finally, you've got the two tables of responses. I wouldn't try putting both types of answers into one table. You could do it, especially if you took advantage of (for instance) Oracle's object-oriented features to mix heterogenous question/answer types, but I see no advantage and some downside to this. In each table, I'd start with a session ID (more about this in a minute). For each "question"/idiom usage question, you've got a session ID and the primary key from the idiom table--that's a primary key for this table--along with the survey ID, the yes/no, the free text and the numeric rating. For the usage rating question, just the primary key, the survey ID, and the numeric rating.

(If this is an undergrad project, begin stopping after partially implementing the next suggestion.)

I'd want some sort of session information, too--at least a timestamp, maybe information about the client and its connection, possibly some form of clickstream information: Did anyone backtrack through the test and change an answer? How often is the help screen accessed? Did people utilize the link you might provide to give a dry definition of the idiom to answer the free-text question?

I wouldn't put this into the response table, though--I'd put this into a session information table. There you've got a harder design problem, but you start with that session ID mentioned above.

The other thing that's hard is that, in this case, it looks like your friend would probably like to compare the answers to sub-questions one and three in the first set with that of the second question by joining on the idiom in question. You can do that by generating the second question from sub-question two in the first set, but given that it's free-form text, I'm certain you'll see a lot of similar, but not quite identical, answers, and I bet also a fair number of outliers. It'd be really nice to be able to analyze that whole set, maybe even assigning some sort of quantitative rating to how similar two usage "free responses" might be--maybe even a rating for how close each one comes to some canonical response established by your friend.

Here, I think I'd consider doing some processing outside the database, and building two columns. One establishes what general category the response lies in--join on that to begin your analysis--and the other describes the specifics of the response within that category--therein you'll find a lot of detail, only a little of which will be meaningful. Again, look for clusters and outliers.

2006-04-01 08:24:03
QUESTION (Question Number, Category Number, Question Type Number, Question Text)
ANSWER (Answer Number, Answer Text)
CATEGORY (Category Number, Category Text) – Allows a grouping of questions in general categories and not to represent specific hierarchies of questions
RESPONSE (Session, Question Number, Answer Number, Free Response)
SURVEY (Survey Number, Description,)
SURVEYQUESTION (Survey Number, Question, Number, Sequence Number) – Sequence Number is order of questions on survey
QUESTIONTYPE (Question Type Number, Question Type Text) – Boolean, Free Response, Rating
QUESTIONANSWER (Question Number, Answer Number) – Possible answers for a given question
SUBQUESTION (Question Number, Sub Question Number) --Sub Question Number role name of Question Number, entity used to represent question hierarchy