Blogs as data stores

by brian d foy

I was talking with a friend and for some reason we got on the subject of blogs. We both sometimes write things in our weblogs so we can store the info. Instead of post-its on the monitor, bookmarks in browsers, or to-do items in PDAs, we have blogs.

I already "blog for Google", which is the same thing as the old usenet practice of posting a post about some problem I encountered and how I solved it. These entries are not really for discussion, but more for the archives so that the next poor soul can find it. Randal Schwartz tells me this is how it was back in the day when he could read all of usenet in a half-hour.

Someday, when we get our heads wrapped around unstructured data stores, there may be Perl modules called DBI::bloxsom and DBI::PerlMonks to bring together this stuff. Until then we have blogs and Google, and blogging about blogs.


2004-07-08 10:29:47
Blogs as Knowledge Management
My idea is nothing new: I found this through a trackback: Blogs as File Cabinets, which the author calls "knowledge management". I don't know if there is much management going on, though. :)
2004-07-08 11:32:58
and Wikis
>Instead of post-its on the monitor, bookmarks
>in browsers, or to-do items in PDAs, we have blogs.

I like using a wiki even more, because of the editing interface: from any browser anywhere, I can add something to my to-do list.

2004-07-08 12:38:59
This is not unstructured
Data without structure is random data.

Blogs are not at all unstructured. In fact, most blogs share at least some common structure. They consist of a set of entries each of which can be identified by a title and date. Each entry contains text, often HTML text. Each entry has an author, and that author is usually further broken down into one or more components (name, email, home page, etc).

If you wanted to go further, you could break down the entry's text into paragraphs, or sentences, or individual words. That's structure.

The bandying about of "unstructured" and "semi-structured" among techies is ridiculous. The former only applies to truly random data, and the latter simply doesn't exist. Something either has structure or it doesn't.

The real key is that the data in blogs may not be structured in an ideal way for database-type queries. For example, it'd be nice if all blog entries were categorized, but most probably aren't. It'd be nice if multiple blogs all shared the same categories, but they don't. This doesn't make it unstructured, it just means it's not structured the way you want it to be for your hypothetical application.

2004-07-08 12:54:32
This is not unstructured
Perhaps you are thinking too hard about this stuff. I didn't say that blogs were unstructured. In fact, I didn't label anything as unstructured. I only stated that when we have a better understanding of unstructured databases, we might see different sorts of data access.

Furthermore, if I say in this reply that the value of Pi is 3.14, but I also like apple pie on the Fourth of July, I have some information that shows up in n unstructured fashion. There isn't a clear anatomy to the information in this paragraph. You can break things into sentences, but what good is that if I am interested in pie? The information is not organized (or structured) even if the prose is. The format is not the information.

You are thinking too much about the format rather the the organization (structure) of information. The word "unstructured" still means what it has always meant despite however you want to view the world.

By the way, random data does exist. I had to collect lots of randomness as a physicist, and I even had a book of random numbers.

"Ridiculous" is knee-jerk reactions and thinking the world is black and white. :)

2004-07-08 22:11:25
and Wikis
I can generally edit my blogs from any web browser too. Indeed, with most of my blogs, I only have a web interface and I can update any post.
2004-07-08 22:30:40
This is not unstructured
It's still all structured. Your sentence about pi and pie has a structure. A sufficiently clever program could extract very specific information from it.

So if you're interested in pie, that clever program could use the structure of your sentence to determine that brian d foy likes pie. Or if you're just interested in mentions of pie, we don't even need a terribly clever program, by modern standards. Full text indexing certainly isn't cutting age.

Calling it unstructured is ridiculous. English language very definitely has structure, byzantine as its grammar may be. You can use that structure to extract specific pieces of information from a sentence or paragraph.

Just because it's really hard to do this doesn't mean the information is unstructured, it just means that its structure is very complex.

2004-07-09 05:00:58
Dealing with unstructured data
Interesting Brian that you are talking about this while Simon St Laurent is mentioning the Extreme Markup conference. Two different ends of the problem.

I was recently looking at my Google Mail account and the way I now have about thirty labels and how with a little effort a smart piece of software could probably apply those labels to incoming mail based on how I labelled mail in the past. Extracting as much from the slight structure of a piece of email and the unstructured nature of the subject and body as possible.

We will certainly be able to do a lot better when we can do the same with the items we post to our blogs. I agree that it will be interesting to watch where we go with blogs as data stores.


2004-07-09 08:48:05
This is not unstructured
If you want to apply the word "structured" to a mound of dirt and rocks, then you lose the meaning of the word.

Just because I can break things into paragraphs and sentences doesn't mean I know everything in the text. That may be one structure, but it certainly doesn't do much for full text searches.

If you think that the written language always has structure, you just haven't seen enough of it. People break that structure intentionally and unintentionally. You can't rely on that. Grammar books are full of examples.

You may not like it, but there are data stores that are organized and the information is labelled, and there are data stores that are just big messes of information. I'm calling the second on unstructured, because that's what unstructured means in english.