Data Condoms: Solutions for Private, Remote Search Indexes

by Sid Steward

Related link: http://google.blognewschannel.com/index.php/archives/2006/02/09/privacy-experts-…



The new Google Desktop has privacy watchdogs barking. Enough complaining -- what's your solution? I offer a couple information condoms.


These are pretty simple-minded ideas, yet they each have their merits. They are based on how a search index is an abstract of a document's contents. Somebody smarter than me should be able to hatch something better and put this issue to bed.


Loose Word Order


A page of words can say a lot, until you randomize the words. For purposes of search, it is enough to know what page or document contains my search query. So create an index that treats word order very loosely. I won't get readable snippets of text in search results, but I wouldn't mind. How about a thumbshot, instead?


This script of mine randomizes the text on web pages, to give you an idea of how effective this obfuscation is. It chunks words using block-level tags:




Please Enter the URL of a web page:





From LookLeap.com



Index Word Hashes, Not Words


If that's not enough, then consider hashing each word before entering them into the index using a one-way hash. Be sure to stem them, first. When you go to search this index, stem and hash your query. Salt your hash or get as fancy as you want. This way the server hosting your index really has no idea what you're storing.


"Don't Use It"? Not Good Enough


FWIW, my $0.02 on how to solve the remote privacy problem. Shoot them down, invent your own, but please let's talk about a solution to this issue. "Don't use it" isn't good enough. I want darknet/p2p search!