how to store articles and interviews in a format to last forever?

by Derek Sivers

So we're going to be converting and archiving decades of amazing interviews with Bob Dylan, Paul Simon, Joni Mitchell, and many other songwriting legends - previously only available on paper.

But what's the most permanent way to store these so they'll be just as usable in 1000 years?

Roll our own method, or is there some XML kinda standard set up for this already?

The Criteria:

  • all interviews are question-answer, so need to mark each as such
  • each question and answer can have multiple paragraphs
  • often mentioning album names and song titles, that need to be italicized or have quotes around them
  • will need to be marked-up by HTML to make a presentable website archive of interviews


A weak mock-up I'm considering:

<intro>
I met with Bob Dylan today. Let's see what he has to say.
</intro>
<interview>
<question>
So Bobby, what are your favorite chords?
</question>
<answer>
The ones that sound like <quote>Eeeeee</quote>. You know like B C D and E. But not A and F.
</answer>
<question>
Is that how you wrote <quote>Blowin' in the Wind?</quote>
</question>
<answer>
Huh? How'd you know that was me?
</answer>
</interview>


Obviously I'm not the first person on earth to archive interviews for web-presentation and long-term use. But I couldn't find any info or recommendations on how to do it.

ANY advice or URLs appreciated. Anyone?

ANY advice or URLs appreciated. Anyone?


17 Comments

jwenting
2004-05-27 00:12:20
no way
There is no storage format that lasts forever, as everyone in IT should know.


It is only by incredible accident that we are able to partially decipher Linear A and B, the ancient Mayan and Inca languages are all but lost.


Yet these documents are available in a format that anyone can read and can stand most environmental factors that would destroy almost anything else, they're literally set in stone.


Computer records from as little as 25 years ago are now all but inaccessible.
Even if the hardware exists and works to read them back (in many cases it doesn't), the data is either so corrupted by time as to be impossible to read and/or the software to interpret that data no longer exists.
And that is data that has been meticulously preserved over time by copying it to a new disk every decade or so...


About the longest lasting medium for document storage is still good old microfilm (unless you want to go back to stone tablets) and paper which can last for centuries if properly stored without loosing the actual data (whether anyone will be able to understand the language it is written in after that time is always hard to say).

dereksivers
2004-05-27 01:44:09
no way
ok so pretend I said 10 years instead of 1000 years.


the point was not the number of years but a call out to see if there is a common standard markup format for storing articles and interviews.

enriquebuxton
2004-05-27 04:43:41
KISS
Follow the format used by magazines. An interview is not a series of question/answer pairs, it is a conversation with 2 or more people. Describe each speaker at the beginning, then identify who is speaking in each turn. External references used by any speaker (such as a song title) can be expanded in footnotes.
bazzargh
2004-05-27 09:19:59
Gutenberg answer: plain text?
Project Gutenberg use plain text because they want the files to still be usable in years to come (see the second part of "the project gutenberg philosophy".)


We need to have etexts in files a Plain Vanilla search/reader program can deal with; this is not to say there should never be any markup ... The value of Plain Vanilla ASCII is obvious ... so is very much of the value of most of the various markup systems we have in the world. But until some real standards arrive — we would be limiting our options a great deal if we do not keep copies of all etexts in Plain Vanilla ASCII as well. We don't have anything against markup. Not vice versa.


Looking back now, it seems like a good thing they didn't go for HTML 3.2, and have Shakespeare with "blink" tags.


I'm currently involved in a project putting an historical record online, a couple of gigs of hypertext. It was originally prepared some 15 years ago, and has been through two electronic editions in different hypertext formats, neither of which can be fully recovered. In the end they found the last copy of the originals, which contains copious amounts of plain text - what a relief; we can rescue large chunks of what was lost in translation.


We also do some work on public record systems where the docs are supposed to last 100 years. The general rule is to keep the original (not a conversion to standard markup, so nothing is lost), plus a "rendered" copy in a well known plugin-less format - eg tiff - in case you don't have the rendering program, plus a copy of the plain text to allow searches. Anything else is a bonus.

bazzargh
2004-05-27 09:28:53
Gutenberg answer: plain text?
Should have added this in the last paragraph: a link to Public Records Office standards, might be useful.
bazzargh
2004-05-27 09:29:53
Gutenberg answer: plain text?
Well that didnt work... oops!

... wish there was a preview on this site.
bazzargh
2004-05-27 09:30:41
Gutenberg answer: plain text?
hmmm I need markup that'll last 2 seconds, never mind a century.


http://www.pro.gov.uk/recordsmanagement/erecords/default.htm


fugu13
2004-05-27 10:19:03
Use RDF
Seriously. RDF's big advantage is that it is a relatively low level semantic format. Its very self describing, tremendously extensible out of box (unlike many XML formats, particularly home grown ones), has abundant tools available for its manipulation, and is undergoing a surge of development. For instance, imagining an RDF vocabulary "interview", and leveraging the dublin core metadata vocabulary, one might say in a snippet:






Bob Jones
How are you today Mr. Bob?


Jones Bob
I'm fine, thank you, and I'd just like to take this moment to promote my new book, Jones Bob Says Buy This Book

Jones Bob Says Buy This Book


Bob Jones
And that's all the time we have for today, folks! Be sure to catch us next time on Bob Jones Says Something!

Bob Jones Says Something
2004-02-30





interview:interviewer might inherit from dc:creator, interview:says might inherit from dc:description, and interview:statementSeq would of course inherit from rdf:Seq.


Using any of the many RDF tools available for dealing with RDF triples, one could transform this very easily into just about any presentation format imaginable, including ones that had embedded media, such as an audio or video recording of the interview.

fugu13
2004-05-27 10:20:30
Use RDF
oops, that should be:



Bob Jones
And that's all the time we have for today, folks! Be sure to catch us next time on Bob Jones Says Something!

Bob Jones Says Something
2004-02-30


in there.

mdubinko
2004-05-27 10:40:49
Forget, markup--it's all about the data
As I wrote about here:
http://www.onlamp.com/pub/wlg/4780


The data is what's important. You shouldn't need any particular software running to access it, which is the case with more complicated formats.


If you do need RDF, or XForms processing, or whatever, it's easy to add markup to structured text.


.micah

caseydk
2004-05-27 11:06:57
Yep, there sure are

You'll want to check out METS over at the Library of Congress. It's a schema that has all sorts of plugins for other multimedia-based schemas.


It was protoyped for a project that had the goal of providing presentation access to everything the Library owns... and making sure the data was available even after 100 years.


For example, one of the Bob Dylan LP's was archived this way. They digitized the audio into 96 bit sample wave files. Then they scanned the covers of the jacket along with the LP itself. Then they were going to use OCR to ensure that the text was searchable. Therefore, when it was complete, you'd have high quality audio of each side (650MB each), high quality images (20 MB each), and the text. Then, they're convert the stuff over to lossy formats for presentation.


I used to work on this project, so if you're interested in details, let me know:


Just take my username from above and add "1484 (at) yahoo dot com".


kc

fugu13
2004-05-27 11:59:27
I disagree, sort of
Of course the data should also be provided in "raw" forms -- that's part of what RDF helps enable, actually, the tying together of raw forms. After all, what's the raw form of the interview? The tape recording? The video recording? The edited version? The word for word transcription of the original conversation?


These can all be considered "the" interview, and I advocate keeping all of them around. However, for long term, institutional storage and access, plain text in a file as primary storage as you advocate is probably a bad idea, for several reasons, not least among them management ability, ease of access, ease of adding metadata (in the future somebody might right an article about an interview, or stage a play based on it -- using an RDF storage format of some kind I can add metadata referencing that directly to the information about the interview), et cetera. Go ahead and store it as a plain text file, kept on a server, and as an audio file, and as a video file. Then reference those URIs in the RDF version; its for this reason I recommend the "central" version be a semantically meaningful one, with other versions and related content tied in using properties -- one can always go from the RDF to the plain text, or the audio, or the video, but not usually vice versa. An RDF version can tie in notes on things of interest referenced in the interview with ease. And an RDF version based on publicly defined vocabularies and publicly available URIs can be used to create a huge network of information.


As a side note, one can break RDF down into "structured text" (though its often not its best storage format ;-) ) -- just write down the triples using one of the several possible formats :-)

fugu13
2004-05-27 12:04:19
to clarify
To clarify something in my above post, one very important thing for "meaningful" storage of something is semantic metadata. RDF enables that in a way plain text really doesn't. Plain text will always require "pre-knowledge enabled parsing" in a way that RDF (or another semantic format) doesn't, unless it already contains every bit of the data that the RDF does, in which case its probably more advantageous to use the community standard rather than a personally created format that will have unknown holes and flaws, particularly when one is creating a large repository.
fugu13
2004-05-27 12:05:09
above, below, what's the difference ;-) nt
nt
dereksivers
2004-05-27 12:11:21
Forget, markup--it's all about the data
Your post is great! It made one of those moments for me where I realize, "Oh maybe I DON'T need to do all that work!"


I guess as long as the parser would know what to use as a delimiter (the interviewee's name, defined at the top of the file), this would work, too:


-------- file: BobDylan-1984.txt -------


Interview subjects = "Bob Dylan"


INTRO:
I met with Bob Dylan today. Let's see what he has to say.


INTERVIEW:
Q: So Bobby, what are your favorite chords?


Bob Dylan: The ones that sound like "Eeeeee". You know like B C D and E. But not A and F.


Q: Is that how you wrote "Blowin' in the Wind?"


Bob Dylan: Huh? How'd you know that was me?

dereksivers
2004-05-27 23:48:59
when NOT to use XML
Interesting article on when NOT to use XML, also called Humans should not have to grok XML.
dereksivers
2004-05-28 00:29:15
reStructuredText
reStructuredText sounds like it should work!