Personalisation: Guessing user profiles from standard web data

by Dan Zambonini

We're currently re-writing our Personalisation application, which can tailor content, style and process (e.g. the order of a checkout form) to the type of user browsing a site. "Type of user" is a bit vague, so let me expand on that. Nearly all industries -- from banking to medicine, retail to education -- have histories of 'segmenting' their user bases; grouping their customers into generic stereotypes. This is typically based on common demographic attributes, such as location, age or gender, or sometimes through behavioural or psychographic analysis (such as personal interests or brand loyalty).



So, to place an individual user into a pre-defined segment, we need to match their attributes (age, location, etc.) to those of a particular segment. Obtaining the user's attributes can be achieved through two types of data collection: explicit (asking the user for specific information, possibly even asking them to define a profile) and implicit (using only implied information, without asking any direct questions).



The purpose of such tailoring is (theoretically) to better meet the needs of the user; provide information and functionality that better suit a specific type of user. Usually, the ultimate goal is to make the process of purchasing something easier/more probable.



Whether or not you agree with the ethical, commercial or practical aspects of transparently building user profiles, it makes for an interesting technical challenge. So, what can we guess about a user without asking them any questions?



Let's take a look at the data we can collect:


  • HTTP Request

    • User Agent String

    • IP Address

    • Referrer



  • Javascript/Client Side

    • Display Information

    • History Length

    • Cookie Information



  • Click-Stream (the history of clicks on our site)

    • Content viewed

    • Searches performed

    • Timing information






We can expand on these to give us more specific information:


  • HTTP Request

    • User Agent String

      • Browser Product

      • Platform



    • IP Address

      • Location (GeoLocation/IP Lookup)

      • Service Provider (Reverse DNS Lookup)



    • Referrer

      • Referring Website

      • Search Terms (e.g. keywords entered into Google)

      • Marketing/Advert used (e.g. which Google AdWords advert was clicked on)





  • Javascript/Client Side

    • Display Information

      • Colour Depth

      • Resolution



    • History Length

      • Whether or not the URL was typed directly



    • Cookie Information

      • Previous Visits/Repeat User





  • Click-Stream (the history of clicks on our site)

    • Content viewed

      • Type of content (e.g. subject of content, format of content)

      • Order of content viewed/Path through site



    • Searches performed

      • Keywords

      • Types of search (e.g. boolean)



    • Timing information

      • Date/Time of clicks

      • Time between clicks








Some of these -- such as location and service provider -- are not always accurate (AOL users, roaming users, using the web at work, and many other complexities), but we're only building a best guess, so lets use all of the information at our disposal.



We still need to translate this technical information into user attributes. So what kinds of guesses might we make about a user, from this information? Here are some starting points to get you thinking (again, these are not meant to be 100% accurate, but "the user is more likely to fall in this segment if they match this data"). Apologies if some of these seem to be based on groan-worthy stereotypes, but much of this comes from real evidence.




  • Affluent Users

    • HTTP Request > Browser > Platform > Mac

    • HTTP Request > IP Address > Location > Urban centre

    • Javascript > Screen Information > Resolution > High




  • Female Users

    • HTTP Request > Browser > Product > NOT Firefox

    • HTTP Request > Referrer > Website > Yahoo/AOL/Ask Jeeves

    • Click Stream > Content Viewed > Type of content > Female targeted content




  • Younger Users

    • Click Stream > Searches performed > Boolean search performed

    • Javascript > Screen Information > Resolution > NOT Low

    • Click Stream > Timing Information > Short time between clicks





What other rules might exist for matching a particular demographic? I have some thoughts, but these are currently based on my limited view of the world, rather than hard statistics (e.g. If you are browsing early in the morning, are you more likely to be male/young? If you have an old browser version, are you more likely to be less computer literate/older?).



Depending on the type of content on your site, it is sometimes relatively straightforward to only use a small subset (the type of content viewed) of this available information to build up an accurate profile. Amazon is an obvious example. If you browse the music of Kelly Clarkson and Avril Lavigne, the system will probably assume you are a white, middle class teenage girl who is in the middle of a temporary and short-lived attempt to 'rock out', whilst completely missing the point of rebellion. Look at Dixie Chicks and Green Day, and the communist left-wing checkbox will be ticked and an automatic email to the White House triggered. On the other hand, if you've been looking at The Flaming Lips and The Decemberists, then it could assume that you are highly intelligent, witty, with a deep cultural and political understanding of the world and a bright and happy future ahead of you.