I used to work for a document management software vendor, and they espouse the value of not putting things in folders but instead indexing using metadata. In theory this loose coupling gives you greater flexibility and search power, but in practice people (and today's tools) prefer hierarchies. Plus the cost of capturing the metadata is rarely recuperated. The directory structure presented here looks straight forward and simple - much better!
Anyway, re OCR, it will be interesting to see what happens with with the tesseract project now that a couple of googlers are on board. Coupling that with Lucene would get you some ways towards a solution.