Subject:   RE: A more difficult search engine
Date:   2005-02-09 21:25:14
From:   corich
Response to: RE: A more difficult search engine

Giff is probably writing for Google by now, but maybe this will help someone else. This is actually not as tough as it might appear. First, add an extra field to the pages table -call it page_text, and make it the text datatype so that it's not constrained size-wise. Next, in your spider, insert an extra line as follows:

/* Try to remove all HTML-tags: */
$buf = strip_tags($buf);
$buf = ereg_replace('/&\w;/', '', $buf);
/* the above is for context, here's the new stuff: */
mysql_query("UPDATE page SET page_text = '$buf' WHERE page_id = $page_id");

This stores the entire page text, stripped of tags, in the table as a contiguous string.

The only other thing is to add code that checks for the exact search phrase to the search engine portion of the project. The simplest way to write an exact phrase match search (and this will only find exact matches) would be to replace the search query with something like this:

SELECT p.page_url AS url
FROM page p
WHERE page_text LIKE '%$keyword%'

This query searches the pages table for instances of the keyword phrase within the full text of the page.

Hope this helps!