Statistically Improbable Phrases

by Harold Davis

Statistically Improbable Phrases (a/k/a "SIP") is the improbable term Amazon.com uses as a search ranking technique. Here's Amazon's explanation.

In more-or-less plain English, here's how this works. Amazon indexes the "Search Inside" content of the books in its catalog (that is, the books in which publishers provide this content). In many cases, Amazon provides a list of SIPs on the main listing page for the title. For example, Starting an Online Business for Dummies by Greg Holden has a number of linked SIPs listed, including "your online business." These SIPs are phrases that appear with anomalous frequency in the inside content of the cataloged book compared with the entire the rate of occurence of the SIP in the universe of books in general. This statistic over-occurence implies that the SIP is a significant representation of the content of the book.

By clicking one of the SIP links, you get other books in which the SIP occurs, sorted from most to least by the number of SIP references. For example, "Web Analytics" and "E-Commerce for Dummies" have the next highest occurences of the SIP "your online business" after "Starting an Online Business for Dummies."

This is a different and somewhat appealing way to use Amazon's search facilities to find books in which the author uses distinctive phrases. Longer run, the concept has an elegant simplicity (as did the original PageRank algorithm), and may be useful for automated tagging and ranking of content.

Click here for a lively discussion of SIPs in the context of author as phrase maker, and here's a fun discussion and list of adult SIPs on Amazon (over 18 only please click this link).