Setting up ht://Dig on Mandrake 9

by Uche Ogbuji

I wanted to set up mailing list archive searching for the various Mailman lists hosted at lists.fourthought.com, which include the various 4Suite lists, exslt lists and the new XPath NG list. It was not a trivial process, so here are the steps I took.


The target server was a Mandrake 9 box which already had Mailman and Apache set up. Getting an RPM for ht://Dig was easy enough. I downloaded htdig-3.2.0-0.5mdk RPM for i586 from the RPM repository. One rpm -i htdig* later, and it was all installed.


I was able to glean a lot of valuable information from this excellent HOWTO.


I edited the config file (/etc/htdig/htdig.conf). The changes I made in diff form:



--- /etc/htdig/htdig.conf.orig 2002-11-18 20:35:31.000000000 -0700
+++ /etc/htdig/htdig.conf 2002-11-20 14:32:20.000000000 -0700
@@ -1,3 +1,4 @@
+#See http://www.scrounge.org/linux/htdig.html
#
# Example config file for ht://Dig.
#
@@ -25,7 +26,7 @@
# You could also index all the URLs in a file like so:
# start_url: `${common_dir}/start.url`
#
-start_url: http://localhost
+start_url: http://lists.fourthought.com/mailman/listinfo

#
# This attribute limits the scope of the indexing process. The default is to
@@ -37,7 +38,7 @@
# patterns. As long as URLs contain at least one of the patterns it will be
# seen as part of the scope of the index.
#
-limit_urls_to: ${start_url}
+limit_urls_to: http://lists.fourthought.com/

#
# If there are particular pages that you definitely do NOT want to index, you
@@ -48,7 +49,7 @@
# may not work on your web server. Check the path prefix used on your web
# server.)
#
-exclude_urls: /cgi-bin/ .cgi
+exclude_urls: /cgi-bin/ .cgi subject.html author.html date.html

#
# Since ht://Dig does not (and cannot) parse every document type, this
@@ -66,7 +67,7 @@
# The string htdig will send in every request to identify the robot. Change
# this to your email address.
#
-maintainer: root@localhost
+maintainer: admin@dollar.fourthought.com

#
# The excerpts that are displayed in long results rely on stored information
@@ -140,6 +141,19 @@
# Short short ${common_dir}/short.html
# template_name: long

${common_dir}/${this_base}/header.html
+search_results_header: ${common_dir}/header.html
+search_results_footer: ${common_dir}/footer.html
+nothing_found_file: ${common_dir}/nomatch.html
+syntax_error_file: ${common_dir}/syntax.html
+
+template_map: Long builtin-long ${common_dir}/long.html \
+ Short builtin-short ${common_dir}/short.html \
+ Default default ${common_dir}/long.html
+template_name: Default
+
#
# The following are used to change the text for the page index.
# The defaults are just boring text numbers. These images spice


Naturally, you'll need to customize these changes according to your needs. The Mandrake RPM sets up ${common_dir} as /var/www/html/htdig, and places the example files at /usr/share/htdig. I made copies of these files in the common_dir location:



cp -R /usr/share/htdig/ /var/www/html/

Then I edited the copies in /var/www/html/htdig to customize the look and feel. Next I had to set up Apache. I edited the virtual host stanza for lists.fourthought.com in /etc/httpd/conf/vhosts/Vhosts.conf to add the following:



ScriptAlias /cgi-bin/ /var/lib/mailman/cgi-bin/
Alias /htdig/ /var/www/html/htdig/

Then I copied the htsearch executable to the Mailman CGI directory:



cp /usr/bin/htsearch /var/lib/mailman/cgi-bin/

Then I restarted Apache:



/etc/init.d/httpd restart

All that remained was to add a search form to each list information page. First I updated the mailman template for list info pages: /var/lib/mailman/templates/listinfo.html. I made the following changes:



--- /var/lib/mailman/templates/listinfo.html.orig 2002-11-20 15:02:23.000000000 -0700
+++ /var/lib/mailman/templates/listinfo.html 2002-11-20 15:15:51.000000000 -0700
@@ -6,7 +6,6 @@
</HEAD>
<BODY BGCOLOR="#ffffff">

- <MM-Subscribe-Form-Start>
<P>
<TABLE COLS="1" BORDER="0" CELLSPACING="4" CELLPADDING="5">
<TR>
@@ -35,6 +34,44 @@
</p>
</TD>
</TR>
+ <tr>
+ <TD COLSPAN="2" WIDTH="100%" BGCOLOR="#FFF0D0">
+ <B>Search all mailing list archives on lists.fourthought.com</B>
+ </TD>
+ </TR>
+ <TR>
+ <TD COLSPAN="2" WIDTH="100%" BGCOLOR="#9999FF">
+<form method="post" action="/cgi-bin/htsearch">
+<font size="-1">
+Match: <select name="method">
+<option value="and">All
+<option value="or">Any
+<option value="boolean">Boolean
+</select>
+Format: <select name="format">
+<option value="builtin-long">Long
+<option value="builtin-short">Short
+</select>
+Sort by: <select name="sort">
+<option value="score">Score
+<option value="time">Time
+<option value="title">Title
+<option value="revscore">Reverse Score
+<option value="revtime">Reverse Time
+<option value="revtitle">Reverse Title
+</select>
+</font>
+<input type="hidden" name="config" value="htdig">
+<input type="hidden" name="restrict" value="">
+<input type="hidden" name="exclude" value="">
+<br>
+Search:
+<input type="text" size="30" name="words" value="">
+<input type="submit" value="Search">
+</form>
+ </TD>
+ </TR>
+
<TR>
<TD COLSPAN="2" WIDTH="100%" BGCOLOR="#FFF0D0">
<B><FONT COLOR="#000000">Using <MM-List-Name></FONT></B>
@@ -59,6 +96,7 @@
<P>
Subscribe to <MM-List-Name> by filling out the following
form.
+ <MM-Subscribe-Form-Start>
<MM-List-Subscription-Msg>
<ul>
<TABLE BORDER="0" CELLSPACING="2" CELLPADDING="2"

This will take care of mailing lists created from then on, but not existing lists. I had to modify the list info page for each. I did this through the mailing list Web admin UI. Just go to the list admin Web page for each list , click "Edit the HTML for the public list pages", then "General list information page". Just clear the text area in the resulting form and paste the template file.


For an example of the result, see the 4Suite mailing list info page.


You might also want to set up the indexer to run nightly, by creating a file /etc/cron.daily/htdig with executable permissions and the following content:



#!/bin/sh
rundig




Any tips on your own on this or other OSS search engine in association with mailing list archives? Please share.