Linking to (Almost) Any Block Element of (Almost) Any Web Page

by Bob DuCharme

Related link: http://www.snee.com/addids



(The following is the introduction from the web page at the URL shown above; see the web page for information on how to use the CGI that does this.)



In the early days of the web, you could only link to a specific
point within a web page if that point had an a element with a
name attribute. Recent releases of the Mozilla, Internet
Explorer, and Opera web browsers, however, let you link to any element that has
an id attribute. (More on this in a weblog
posting
I did.) Hopefully, more and more web development tools will
start adding id attributes to more block elements; I'm
trying to get into the habit of doing it to everything I
write.



Meanwhile, I've written a CGI script named addids.cgi ("add IDs") that creates a temporary
copy of any web page you pass to it, with IDs added to block elements
so that you can create links to any block element you like in that
temporary copy. For a web page that doesn't change much (not, for
example, the home page of a newspaper's web site), nearly all
generated IDs will be the same every time a temporary copy is
generated. This means that you can look at a copy created by
addids.cgi, create a URL that links to a specific point within that
copy, and send that link to someone else with reasonable confidence
that it will show them the same point in the document.



A few random tests show that it works with some slick commercial sites (I linked to stories in the archives so that the examples would last longer): The BBC ("The varying hotel guests in each episode...") , Rolling Stone ("On 1971's Gets Next to You..." ) and a Vignette Storyserver-generated Time Magazine article ("Ethiopia: Tackling terror in East Africa." Scroll up for slickness.) For a layout so complex that the CGI messes it up (for example, Wired) there may be a "Print" version of the same story that's easier to link to ("Paper modeling reached the zenith..."). I found that it doesn't always work properly with IE 6.0 under Windows, but it seems to work fine with Firebird .7, Mozilla 1.5, and Opera 6.1 under Windows and IE 5, IE 5.1, Safari 1.0 under OS X.




How did it work for you?


4 Comments

bazzargh
2004-01-14 04:50:21
agressive sessions
Your algorithm - line-by-line checksums - has a couple of problems. A good example of something thats stinkingly broken with it is here:


http://store.apple.com/Apple/WebObjects/ukstore.woa/90801/wo/vf7aNMsZv6ok2qaY8sTSf1bo6gm/0.0.7.1.0.5.21.1.4.1.2.0.0.1.0


(If this is already broken by the time you see it, I just went to the Apple store UK and clicked on iLife '04)


What I originally thought might be wrong with this would be that the session info for users would change change, breaking checksums all over the shop as URLs change; this would affect many news sites too when they are viewed the next day.


However this page has a much bigger problem, Apple's CMS is sticking almost all the block elements on one line of text, and with your scheme everything on the same line gets the same id. Interestingly it seems that Moz will jump to the last instance of the id in these circumstances while IE jumps to the first.


You could try taking checksums of the text in each block only (as opposed to text+tags), and assigning ids block-by-block instead of line-by-line.


Good article though. Makes a change from using this technology to pootify or swedish-chef the page ;)



BobDuCharme
2004-01-14 06:23:20
agressive sessions
Not only do the checksums in the page you reference change; the URL itself does, because it's from a transaction-oriented session that carries state information about the session right there in its URL. This URL won't work for someone trying to use it in another time and place, period, so the inability to let another person link to a point within that page can't be blamed on something in addids.cgi. You mentioned that the link might be "already broken by the time you see it"--the URL was designed to only work once for one person!


This is the reason that my examples came from archive pages the web sites of the BBC, Rolling Stone, etc., and not from pages on their web sites showing more current news. If I linked to something in the middle of the BBC's interview with Rubens Barrichello at http://news.bbc.co.uk/sport1/hi/motorsport/formula_one/3395221.stm, which now says "Last Updated: Wednesday, 14 January, 2004, 11:53 GMT", I'm not going to be too confident that this URL will still work two weeks from now. We can't count on consistency in dynamically generated pages, and it's in a news site's best interest to change their content as often as possible.


Your second point makes sense--note my wimpy use of the qualifier "(almost) any web page." Maybe a checksum based on the text from the beginning of the tag to the end of the line would be easier to compute and solve the problem you describe.

bazzargh
2004-01-14 06:46:11
agressive sessions
"This URL won't work for someone trying to use it..."


thats true for applestore, but on many other sites an URL with say ";jsessionid=blah" in it will still work, but the session ids in the urls in the page you get back will be for a brand new session. I tried to think of the most broken URL scheme I'd come across recently and picked one that was way /too/ broken :)


Incidentally, one reason I took a deeper interest was that I was writing code today to extract text to be translated from one of our sites. (its small-ish, not worth the cost of buying software for this). I used block md5-checksums, with and without tags. As you say, its not too hard, but then we have a luxury you don't, of knowing our content is xml, so blocks are easier to find.

anonymous2
2004-01-14 08:32:58
Internet Explorer
Is that browser understand something in HTML? Everything i done that work under any browser has always fail in that one for the first experiment after you have to make it work under explorer. The most of the time is spend on hating Bill.