Data Management: Data Preservation and Destruction
Toronto, ON, July 2003
Planning to auction off your old computer on eBay? Or more likely, give it away to charity? Or send your damaged computer out for repairs? You might want to take a closer look at your disk drive first! Have you ever bought a used computer, possibly at a failed dotcom auction (try http://www.murphyauctions.net/bestof.html) or Government Surplus (try http://www.gsa.gov/Portal/browse/channel.jsp?channelId=-13914&channelPage=/channel/default.jsp)? Ever taken a look at what’s on the hard drive? Perhaps, you are Curious George and like to poke around on the hard drive looking for interesting leftovers. There are many stories about people retrieving confidential information from second-hand computers. Racy love letters, current curriculum vitae, and pornographic pictures are just some of the things you can find. Did the former possessor of the hard drive maintain a list of bank account numbers on the computer? Did the previous owner have a list of PINs for debit or credit cards or passwords for other systems? Is their identity worth stealing? Given how many dotcoms went out of business, you wouldn’t have to look long to find hard drives out there with interesting data.
According to two Massachusetts Institute of Technology (MIT) graduate students (Simson L. Garfinkel and Abhi Shelat), companies and individuals are frequently selling or giving away old computer disk drives with sensitive information still on them. They analyzed 158 disk drives purchased through eBay Inc.’s online auction site, at computer stores, salvage companies and swap meets. The two found they could recover and read data from almost three-quarters of the systems. Previous owners had properly sanitized less than 10 percent. The students recovered personal and corporate financial records (including credit card numbers, bank account numbers, dates of transactions and account balances), medical records, love letters, personal e-mail and pornography. (You can find a summary of their report at http://www.computer.org/security/v1n1/garfinkel.htm.)
There are other examples of this kind of behaviour recorded in the popular press. In 2003, a U.S. state auditor found that at least one computer used by AIDS-HIV counselors was ready for sale to the public even though it still contained files on thousands of people.
Embarrassing (and frightening) to the individuals involved, it still is not life-threatening. Over ten years ago, the GAO reported that the Department of Justice disposed of computer assets and the buyer ultimately retrieved information from the hard drives. Apparently, the buyer recovered information dealing with the witness relocation program for the Northeast! But, the Government doesn’t learn. The GAO again reported in early 2001, that the Department of Energy disposed of computers that the department had not adequately cleaned, exposing data. It might well turn out foreign governments interested in doing espionage are buying disk drives!
There’s another way to look at this. Suppose you buy a formatted drive. You start your system, the drive comes up, and you don’t see any files. You run the DOS format command and don’t see any bad blocks. You use that drive for years and never suspect that it had child pornography on it because that data isn’t part of the blocks that you were looking at. But should someone come in and use forensic tools, they would find it since the data is still there.
So, let’s say you want to sell your old computer, but you don’t want snoopers reading all your old e-mail or getting your bank account number. What do you do? Reformat the drive? Before you look at the answers to these questions, let’s talk a little about how information is actually stored on the disk. A hard disk is magnetic media. In other words, your system stores information on the disk by changing the magnetic characteristics of a certain spot on the disk. As the system writes files to the disk, the system also writes the location of the file to another section of the disk: the index or catalogue. When you delete a file, the system only erases information about the location of the file. The information in the file itself remains on the disk. The same is true should you reformat a disk. In this case, the system erases all information about the location of the files but the information on the disk remains.
Casual computer users often assume that, when used, reformatting permanently deletes the data stored in a file from the computer’s disk drive. It doesn’t. Instead, most operating systems simply change the data to indicate that the file has been deleted, then mark the areas of the hard disk that contain the “deleted” data as being available for reuse by other programs. The commonly-used Microsoft format commands, such as FDISK, for example, verify the integrity of the disk drive blocks, but do not erase files. Moreover, Windows works in the background to actively preserve your deleted files. The Windows Recycle Bin only pretends to delete files, but the Recycle Bin subsystem surreptitiously copies the whole file to a special directory from where you can easily recover it. Should you empty the Recycle Bin, you still can retrieve the file, because the normal OS-level deletion operation kicks in and simply marks the file area as “ready for reuse.” Assuming another program doesn’t overwrite that data, it remains there undisturbed and you can retrieve and read it using a variety of techniques ranging from simple operating system commands to free and commercial forensic software tools. As well, Windows ME and XP have a System Restore function that saves and can restore certain kinds of files, even when you’ve erased them.
Regardless, in less than a minute with your favourite search engine (mine is currently Vivísimo Document Clustering at http://vivisimo.com/), you can find companies that sell utilities specifically made for recovering data from drives that have been reformatted, hit by viruses, or whatever. There are a number of tools that can read information on disks even when someone has formatted the disk. Table 1 provides a short list of some of these products.
Table 1. Data Recovery Tools
Maybe you already know about such utilities, and you’ve gone to the trouble of filling your entire hard drive with zeros. That’ll take care of the commercial recovery utilities, but someone with a few thousand dollars and the know-how could nonetheless recover all your data even from that action. (Read http://www.cs.auckland.ac.nz/~pgut001/pubs/secure_del.html and http://sxm4.uni-muenster.de/stm-en/ for more information on recovering “deleted” data. Should you really want to recover data without building a scanning-tunnelling microscope yourself, check out http://www.datarecoverycompanies.com/. They claim 80 to 90 % of lost data is recoverable and provide a list of companies to help.)
Nevertheless, the only way to make information unavailable on the disk is to overwrite the specific parts of the disk that contain the information. While you could do this manually, it is much more efficient to do this with a disk wiping tool. Tools like this overwrite the entire disk or just the location of a specific file multiple times to make sure that the information is unreadable. The software takes about 30 to 40 minutes per gigabyte to sanitize or scrub the data. Table 2 provides a short list of vendors selling erasure or disk scrubbing software.
Table 2. Disk Scrubbing Tools
1 This software is released under the GNU Public License; hence it is free.
One thing to keep in mind when evaluating these tools is to ensure it meets the Department of Defense Standard (DoD 5220.22). This standard requires that your utility overwrite the disk in several ways: first with zeros, then with ones, then with random numbers between 2 and 9. This makes it much more difficult for anyone to get at the information. These tools do exist, so obviously finding good tools is not the tough part.
Before you go any further, you have to do the tough part. You have to answer the big question: “How valuable is the data on my hard drive?” It does not matter whether you are looking at data destruction or preservation: this is a key question. All paranoia aside, it’s unlikely that anybody cares enough about the data on a home computer to take the time to recover a zeroed-out drive. On the other hand, should the computer come from a business, hospital, or research lab, it’s another story. The right buyer may pay big money for data on those computers.
Even the U.S. Department of Defense uses risk assessment to decide how to dispose of systems. In the summer of 2001, Deputy Secretary of Defense Paul Wolfowitz rescinded a January 2001 Department of Defense policy calling for destruction of all unclassified computer hard drives leaving departmental custody. Wolfowitz’s new disposal guidance will make more computers available for schools and other worthy organizations. Under this new guidance, the minimum requirement for equipment leaving DoD’s custody or control is for someone to overwrite the computer hard drives, but not completely destroy them. DoD still authorizes degaussing and destruction alternatives when there is a particular concern about data sensitivity on the machine. The long-standing practice of destroying hard drives on computers with classified information remains in place. Should you have an interest, you can find the policy direction, entitled “Disposition of Unclassified DoD Computer Hard Drives,” at http://www.c3i.osd.mil/hottopics.html.
You might also want to check out the Department of Defense’s “National Industrial Security Program Operating Manual” at http://nsi.org/Library/Govt/Nispom.html#link5. In the document, the DoD recommends the following steps to dispose of hard drives that contain “moderately” sensitive information:
Amazingly, these steps are not designed for the very highest levels of security. I can only imagine what they do to sanitize really sensitive data! Perhaps, they drop the drives in a vat of lukewarm Coca-Cola™!
But let’s say you want to preserve the data not destroy it. Data destruction is just one side of the coin. The other side is digital preservation, that is, the managed activities to ensure continued access to electronic resources. Access is the key factor here: when you cannot use a resource anymore, it is totally pointless to preserve it. Preservation of data is a key consideration for many organizations. Operational, legal, regulatory and historical needs are just some of the reasons for preserving data. What do you do when you want to preserve the data on the media for an extended period? How long will the data last on the hard drive you just sold or gave away? How long will the tapes you sent to off-site storage continue to hold a charge? Well, either way it is important that you know something about the lifetime of various media and other factors affecting data preservation and destruction.
Media Types and Properties
Below is a brief description of widely used storage media.
Microfilm seems, to many, outdated and behind the times. In our new information economy, electronic imaging is, surely, the way to go when you want to preserve information for posterity. Vital records, insurance claims and applications, loan applications and library records, among others, are some of the documents currently imaged.
Nonetheless, microfilm possesses two simple advantages over most other media used for recording information: it is long-lived and readable by humans with little difficulty. These are pretty basic and crucial advantages. You only can retrieve electronic data when you have the appropriate hardware and software to do the retrieval, and only the foolhardy would argue that we will have all of today’s hardware and software around tomorrow. Even when appropriate equipment is available in ten or twenty years, the electronic medium chosen to carry some specific information could well, by then, have deteriorated to the point of being unusable. Unfortunately, we have many examples of this kind to reflect upon. On the other hand, you can, in a pinch, retrieve information on microfilm with an instrument that has been around for centuries and will continue to exist for as long as humanity exists: the magnifying glass.
Without sounding like a Luddite, microfilm is, in my opinion, an eminently suitable medium for the preservation of information. It is durable (depending upon the type of film you choose), long-lived, and relatively inexpensive. But most of all, the information on it is retrievable by the human eye. Only printing on paper or chiselling in stone can match this!
Microfilm is available in a number of types. Silver halide film on a polyester base, processed and stored in accordance with the existing standards, has proven to last more than 1,000 years. That exceeds my need to preserve that “nastygram” from my Bank!
Magnetic tape is a logical media selection. A magnetically coated strip of plastic, where you can encode data; magnetic tape provides relatively inexpensive and large storage capacities. Because tapes are sequentially-accessed and not randomly, access time is slower on tape than on devices like disks or CDs. Tapes are available in a range of sizes and formats. While tape media are still very good for transporting and backing up data, they are not suitable for long term storage because of the limited lifetime of the media. See Table 3 for an approximate lifetime.
Digital Linear Tape
A recent development in tape storage technology is DLT (Digital Linear Tape). DLT is a cartridge tape and offers significant improvements in data access rates over magnetic tape. Moreover, DLT is durable (medium lifetime 30 years) and has very large capacity (10 to 80 gigabytes of compressed data per volume), with a low cost. There are other variants of high capacity tapes but you may want to question the longevity of such technologies.
CD-ROM (Compact Disk–Read Only Memory)
Optical disk technology is capable of storing large amounts of data that you can read but not alter. CD-ROMs all conform to size and format standards and are well suited for storing software applications, graphics, sound and video.
CD-R (Compact Disk-Recordable)
Based on WORM technology, a CD-R system can store large amounts of data, even though a single CD-R can hold only 0.64 gigabytes. CD-R drives have been improved to enable multi-session recording (that is, you can add additional data over time). The “write-once recordable” CD (CD-R) is inexpensive and CD juke boxes have been available for quite sometime.
DVD-R (Digital Versatile Disk-Recordable)
The “write-once recordable” digital versatile disk (DVD-R), considered a replacement for CDs, are on the market. Current single-sided DVD-R holds 4.7 gigabytes of data. The storage capacity will double when double-sided disks become available in the near future. DVD-R drives in juke box configuration with a capacity of 4 terabytes are on the market. Often, these juke boxes are backward compatible so the devices can handle a mixture of CDs and DVDs.
WORM drives read data in a fashion similar to CD-ROM drives, but they also can write data to disk (though this writing is permanent; hence the term, ‘Write-Once-Read-Many’). With WORM, a laser burns holes directly onto the surface of the disk. Since these holes reflect much less light than intact disk areas, the device uses the resultant decrease in beam intensity to denote the data stored on the disk. A WORM drive uses removable media divided into consecutively numbered, fixed-size sectors that the device can access in any order, similar to a hard disk.
Organizations have used WORM optical disks and supporting juke boxes for data archival. WORM optical disks and the juke boxes capacity can go up to 12 gigabytes per medium. Most WORM drives can store 800 megabytes of data per cartridge, while CD-ROM drives have 640 megabytes of storage space. But as CD and DVD become more popular, this technology is now becoming obsolete. >Factors suggesting its demise are that the media and system are more expensive than other media.
Magnetic hard disk
A hard disk, as opposed to a floppy disk, is a magnetic disk that can store large quantities of data. Hard disks or drives come in various sizes and it is not unusual to buy a low end computer with a 40 gigabyte drive. However, hard disk storage is more expensive than other storage media.
Magneto-optical (MO) disks are also a popular choice among the high density media with reasonable storage capacity (5.2 gigabytes on a single MO disk). MO disks are 5 ¼-inches. Data is written on an MO disk by both a laser and a magnet.
MO gives you high-capacity disks perfect for quick file retrieval (access times are in the sub-25ms range) at a low cost per gigabyte.
In addition to capacity, access, legal requirements, maintenance, reliability, cost of media and the system as a whole, I/O speed, and durability of the media, you should consider the expected lifetimes of any selected medium. There is reasonably widespread (though by no means universal) awareness of the fact that digital storage media have severely limited physical lifetimes. The National Media Lab (http://www.nml.org) has published test results for a wide range of tapes, magnetic disks, CD-ROMs, and other media, showing that a tape, disk, or even CD picked at random (that is, without prior evaluation of the vendor or the specific batch of media) is unlikely to have a lifetime of even five years. Notwithstanding their opinion, Table 3 as an example provides expected and typical lifetimes for various media.Table 3. Expected Lifetimes of Selected Media in Years
Table adapted from Dark Ages II(Prentice Hall, 2002) by Bryan Bergeron.
Vendors and media scientists may argue vehemently about such numbers, but accurate estimates are ultimately largely irrelevant, since the physical lifetime of media is rarely the constraining factor for digital preservation. Should any company introduce archival quality media in the market; they would probably fail, since they would quickly be made obsolete—despite their physical longevity—by newer media having increased capacity, higher speed, greater convenience, and lower price. This is a natural outgrowth of the exponential improvement in storage density, speed, and cost characterizing digital media development for the past several decades. The market makes older storage media obsolete as newer, better media become available. (Do any of you remember when all you wanted was a 20 megabyte hard drive?) The short lifetimes of eight-inch floppy disks, tape cartridges and reels, hard-sectored disks, and seven-track tapes, among others, demonstrate how quickly storage formats become inaccessible.
In addition, you should read carefully the “fine print” when it comes to statements about media lifetime estimates. DAT DDS tapes, for example, have a 10 year estimated lifetime. What that actually means is that should you use a tape once to record data and put it on a shelf in a temperature and humidity controlled environment, likely you can read it in 10 years. Divide that by some factor when you store it under less than ideal conditions. Divide it again by some factor when you read and write to the media or otherwise handle it.
So, it appears the media itself is vulnerable to decay and obsolescence. The standard lifetime of a particular disk or tape appears to be less than a decade; and you must copy or refresh the data stored on these media at regular intervals. A recent Canadian National Research Council study discussed the effects and implications of long-term commitments in scientific data management, with respect to both selection of data for long-term retention and media obsolescence. Further (and paradoxically), they concluded that data collected before the advent of computers and stored on “archival media” (paper) must be put into electronic form for wide and effective use today. They also concluded that such data can add enormous value to research efforts, particularly for studies examining long-term trends, but are costly to migrate.
The top priority is extending the usability of magnetic and optical media by stabilizing their structure and limiting the ability of internal and external factors to cause deterioration. It is important to identify life expectancy for existing media and to select the best media for storing new files in this digital world.
But media lifetime is only part of your problem. Preserving a printed book for decades or even centuries has been relatively easy. First, paper is usually a very durable material. Second, humans can extract information from a book by a simple process: reading. Third, understanding the information is possible since the written languages have not changed totally and there are human experts who can translate the documents into modern language. A friend of mine’s sister specializes in Old English and I can read Moliére’s Tartuffe written in early 17th century French.
But, electronic resources differ in a fundamental way from printed resources. An application has to interpret every electronic resource before it can be displayed to and understood by humans. You can interpret any string of bits in countless ways, depending on the resource type and the application used. And this application, for instance, Microsoft Word 2002 for Windows, requires an operational environment, that is, hardware, operating system running on the hardware, and drivers and other support software.
If the information technology you use were stable, then preserving data would be a simple task. But our technological infrastructure is changing with ever increasing speed. Technical obsolescence threatens your data in many different ways.
As pointed out, the media electronic resources you store data on may become unreadable either because the media—diskette, tape or CD ROM disk—is physically destroyed, or because you cannot read the media anymore although it still is physically in good condition.
File formats and compression schemes are also constantly changing. Sometimes there is a real reason for this, for instance compression techniques have improved quickly, improving efficiencies of some data transmissions. But one might cynically suggest that it is all too common for vendors to make changes to force customers to buy new versions of their products. Reluctance to use standards—or to use them properly—may also benefit a company from a marketing and sales point of view.
Advances in computer design have been spectacular, and it seems certain that the current development rate, as specified by Moore’s Law, will not abate during the next 10 to 15 years. Should this Law continue to apply for the next 30 years; our children will have computers that are a million times faster than their current system. It is almost certain that these machines will do at least the same things that the current systems do, but what else can they do? If the future computers are speech or vision controlled, could or would the future users get accustomed to user interfaces common in 2003?
Some experts have suggested that standards will solve our problems. But you may never see the standardization of some relevant technical features, and technical development also will change the standards we rely on. For example, there are already two, very different versions of the JPEG image compression standard even though the first JPEG version is less than 10 years old. How many JPEG versions will we have 100 years from now? You are probably aware of SGML, HTML, and XML. Maybe even SAML. But, are you also familiar with cHTML, HDML, WML, S-HTML, XHTML, VML, SMIL, MathML, ORM-ML, XrML, MNML, QAML, and DAML among others? Which of these standards, if any, will survive?
Successful long-term storage and preservation of documents and data involves looking at the aforementioned issues. Let’s look in turn and in more detail at the following:
Loss through misplacement
Storage medium/media degradation
Media and hardware obsolescence
Format and software obsolescence
Intentional or Unintentional Destruction
Active files on computers, especially shared data on network servers, are constantly changing. In the normal course of business, you open, close, edit, alter, modify, rename or delete files. They are subject to manipulation in ways that static paper documents, stored in file cabinets, are not. Word-processed documents or e-mail messages may take on a more-or-less permanent form once completed, but financial records, customer databases, and other data compilations by their very nature are dynamic. Even the act of opening files may fundamentally change an important characteristic, such as automatically-generated dates or calculations of interest.
Many computer network administrators will routinely create back-ups of network data. Many administrators keep these backups forever, but others will do exactly what they are supposed to do, and overwrite them in the normal course of business with more current data.
Distinct from the routine churning of data resulting in the inadvertent destruction of data is the potential intentional destruction. Individuals faced with potential lawsuits or employment dismissal might immediately delete damaging e-mail and word-processing files. Even when the destruction of electronic evidence, inadvertent or wilful, is ultimately unsuccessful, the cost of locating and recovering the data will increase substantially. So while the complete physical destruction of electronic data is difficult or impossible, it is relatively easy to render the data inaccessible or retrieval too costly.
Organizations must implement security measures to protect records against either deliberate or accidental alteration. Some possibilities include:
compliance and audit programs to ensure security procedures are maintained; and,
providing ‘read-only’ access to the records.
This paper does not deal directly with data security but with data management, with security being an integral component. You can obviously pick up any good security resource and learn how to prevent or recover from accidental or intentional access, misuse or destruction of data.
Loss through Misplacement
Loss through misplacement is pretty straightforward. Even where you have good media and good procedures, they don’t mean a thing when you cannot find the media. So, your company should institute a schedule of on-site and off-site media audits. Find out the media is missing when you can still do something about it. Not when it is too late!
Storage Medium/Media Degradation
Magnetic and optical media and the information stored on them degrade over time, which destroys the stored data. There are events or agents that can accelerate the degradation process. Table 4 shows some factors that accelerate the rate of degradation.Table 4. Degradation Accelerants
Table adapted from Dark Ages II(Prentice Hall, 2002) by Bryan Bergeron.
So, you need to handle and store media properly to get the maximum lifetime. Any factor shown above could reduce the life or effectiveness of your chosen media.
Media and Hardware Obsolescence
Ah, perhaps in the good old days you had a TRS-80 or an Apple Lisa. Those were the days! But as you have figured out: computer technology is subject to on-going technological obsolescence. Hardware and software quickly become outdated as new upgrades and versions come onto the market. Electronic material created under older systems becomes unreadable (and hence inaccessible) in the original form after relatively short periods of time. To illustrate, Table 5 lists some of the hardware that has become obsolete in the short history of the microcomputer.Table 5. Obsolete Microcomputer Hardware Platforms
Don’t forget about IBM’s Series 1, S/34, S/36, S/38 and AS/400! We could get quite an impressive list of obsolete hardware when we include Amdahl, Burroughs, NCR, Sperry-Rand, and Univac to name a few.
So whole systems become obsolete, but so does the media they use. Media obsolescence manifests itself in several ways:
Often, the act of upgrading to a new computer system means abandoning an old storage medium. The dual problems of short media lifetime and rapid obsolescence have led to the nearly universal recognition that you must refresh or copy digital information to new media every few years. You might think that copying is a relatively straightforward solution to these media problems. Though it is not trivial; in particular, the copy process must avoid corrupting documents via compression, encryption, or changing data formats.
In addition, as media become denser, each copy cycle aggregates many disks, tapes, or other storage units onto a single new unit of storage; for example, a compact disk or digital versatile disk. This raises the question of how to retain any labelling information and metadata associated with the original media since you cannot practically write the contents of the labels of 400 floppy disks to fit on the label of a single CD. So, you must digitize the label information to ensure that it continues to accompany the data it describes. But whereas labels are directly human-readable, digitized information is not; therefore, you must digitize the labels and metadata so that humans can without difficulty read them easier than the documents they describe. This may seem a relatively trivial aspect of the problem, but it has serious implications.
No matter the digital media format you choose, hardware obsolescence poses a greater problem than the degradation of the storage medium. Storage media have changed quite a bit physically from the metal open-reel tapes used in the 1950s and 1960s, and no one builds drives that can read them anymore. Even should someone build such a drive, you’d have to write custom software to interpret the file formats used back then, where you can find documentation for them.
Ten years ago most computers used 5¼-inch floppies and thirty years ago people used 8 mm reels for video. Not that long ago, organizations used punched cards and paper tape. For all intents and purposes, the hardware to “play” these media formats is obsolete. Dell announced it will no longer provide floppy drives in the systems it sells. Perhaps in five years, you won’t find the hardware to read a floppy.
You cannot count on backwards compatibility across multiple generations of hardware. For critical data, one must maintain a reference machine.
Software and format obsolescence
Though media problems are far from trivial, they are but the tip of the iceberg. Far more problematic is the fact that digital documents are generally dependent on application software to make them accessible and meaningful. Copying media correctly at best ensures that you can preserve the original bit stream of a digital document. But you cannot make a stream of bits self-explanatory, without the creating application. A bit stream (like any stream of symbols) can represent anything: not just text, but also data, imagery, audio, video, animated graphics, and any other form or format, current or future. Without knowing what is intended, it is impossible to decipher such a stream. In certain restricted cases, possibly you can decode the stream without additional knowledge. For example, when a bit stream is known to represent simple, linear text, you could use cryptographic techniques to decode it. But in general, you can make a bit stream intelligible only by running the software that created it, or some closely related software that understands it.
This point cannot be overstated. In a very real sense, digital documents exist only by virtue of software that understands how to access and display them; they come into existence only by virtue of running this software.
When all data are recorded as 0s and 1s, there is, essentially, no object that exists outside of the act of retrieval. The demand for access creates the object, that is, the act of retrieval precipitates the temporary reassembling of 0s and 1s into a meaningful sequence that only the right combination of software and hardware can decode.
As this statement implies, the only reliable way (and often the only possible way) to access the meaning and functionality of a digital document is to run its original software—either the software that created it or some closely related software that understands it. Yet such application software becomes obsolete just as fast as does the digital storage media and media-accessing software. And although you can save obsolete software—and the operating system environment where it runs—as just another bit stream, running that software requires specific computer hardware, which itself becomes obsolete just as quickly. It is therefore not obvious how we can use a digital document’s original software to view the document in the future on some unknown future computer (which, for example, might use quantum rather than binary states to perform its computations). This is the crux of the technical problem of preserving digital documents.
Just as hardware becomes obsolete, so do file formats. Not long ago, you might have used WPD (WordPerfect Document) format to send a document to a friend. Your friend might not have had WordPerfect or a suitable filter and could not open the document without losing data or formatting. Today, you might use Hypertext Markup Language (HTML), Microsoft’s Rich Text Format (RTF) or Adobe’s Portable Document Format (PDF). Who knows what you’ll use in two or three years? While widespread standard formats such as ASCII have excellent long-term prospects, most other file formats, especially proprietary database or word-processing formats, risk no longer being supported as the software that created the files and hence the software to “play” the files is no longer supported and compatible with future operating systems.
So you might lose the original application and the format. But in addition, you might lose the operating system as well. Table 6 shows you that operating systems become obsolete as well.Table 6. Obsolete Operating Systems
Yes, you can find people supporting CP/M on the Internet, but you should face it, it’s dead! And so is the Dumb Operating System!You could add many more obsolete operating systems to this list.
Planning for obsolescence means ensuring that you can copy, reformat or migrate records. This includes hardware, software, operating system and media. Whenever you copy or reformat records, you should use tools such as checksums, message digests or hash digests to confirm the integrity of the data. You should migrate web-based records and their associated metadata as often as necessary to avoid technological obsolescence for as long as you require the records. You should record any preservation actions such as copying, reformatting or migrating data. Any loss of functionality, content or appearance that occurs as a result of reformatting or migration to standard formats should result in full documentation as well.
When designing and building applications; organizations should plan to use software tools and languages that meet accepted (or de facto) standards, and are readily available and fully supported.
You already read that you should store data on relatively long lasting media that has a reasonably high capacity and is widely supported by storage media vendors. The physical storage of electronic records is as important as the long term digital format of the records themselves.To institute good environmental control and monitoring, you might:
But that’s not enough.It is important to assess different preservation strategies in your organization. If you want to truly preserve the data, you may have to perform regular media migrations as required by evolving standards. For critical data, you may have to maintain a reference machine with the appropriate software.
Let us assume that an organization has a large collection of Word 97 documents stored on 3½ inch diskettes. They use the documents on occasion, but some documents are still essential and require preservation.
At point 1 in time, the tools needed by the organization to use the documents (Windows 3.95 operating system, Word 97) are in common use but fading. The diskettes are reliable, even though no one has stored data on them in more than five years.
At point in time 2, the employees find diskettes difficult to use, and some diskettes begin to reach the end of their useful life. Consequently, the organization decides to copy all documents to a new archive server. They throw away all the old diskettes and refrain from making new ones since they now store the documents on the server. So far, so good.
At point in time 3, the organization upgrades the old workstations and application programs. Word 97 is no longer in use, but they still can read the documents in this format with the new text editor. Due to staff limitations, no retrospective migration is done, except for those documents, currently in use. Someone checks the quality of the migration, and they keep the original document in the event there are serious problems.
At point in time 4, the organization plans an upgrade to yet another hardware and software environment. During the planning stage, someone notices that the new text editor does not support Word 97. A quick check from the archival system shows that there are still a few thousand Word 97 documents left. There is not enough staff to migrate all these documents before they install the new hardware and software.
Moreover, it is known that the migration will not always give good enough results. A decision is made to acquire an emulator, which enables the continued use of the current application that does support the viewing and possibly also the editing of Word 97 files. Now, the fun starts. You must support the emulator that reads the Word 97 documents. Extrapolate this out a little further and you get the picture.
Although all preservation strategies have some shortcomings, you can use them to complement one another. Any organization investing in the archival of electronic resources should test all preservation methods in order to get familiar with them. Following are some preservation methods.
In order to protect against media obsolescence, deterioration or for transfer, you must periodically refresh the media. Refreshing strategy means periodical copying of the resources to new storage media. Refreshing is the process of physically copying records from one media to another. The resource will remain the same, not a single bit is changed. When and where needed, you can automate the process of refreshing.
Refreshing is a non-challenging approach from a technical point of view. But, the difficulty is in knowing when it is necessary to copy a document; there is no way to check whether you can still read a tape without actually trying to read it.
Valuable records may fail to be transferred to new media or transformed to electronic form because of a lack of resources (funds or appropriate equipment) or lack of motivation. With the extraordinary volumes of data an organization collects, transferring data to new media and managing high-value data sets for active use will increasingly challenge organizations, particularly since the time frames for rescuing old, deteriorating data are frequently quite short.
While every digital archive must copy their documents regularly enough–whatever that means–to new media, refreshing fails when used as the sole preservation strategy. Without the application to read the copy, it is useless. Consequently, you need to use it in conjunction with another, more efficient preservation technique.
Migration is the conversion of the resource from an old platform to a new supported software or hardware platform. This strategy is the most popular one at the moment and routinely used in preserving data. As the hardware and supporting software moves to newer generations, the potential longer lifetimes of new media become meaningless. You must migrate the data from old technology to new technology. The purpose of migration is to preserve the integrity of digital objects and to retain the ability for clients to retrieve, display and otherwise use them in the face of constantly changing technology. This migration process involves many standard related issues such as interoperability, interchangeability, data access and management.
Conversion of a WordPerfect 9 document into a Word 2002 XML document sounds like a fairly easy thing to do. But migration is not as easy as it may seem at first blush. You cannot predict how often you need to convert your documents, and you cannot estimate the technical difficulty of any conversion. While you can easily convert most WordPerfect documents to Word, some documents using special WordPerfect features may be very difficult to convert and might be lost.
Documents whose continued use is crucial to the individuals or organizations that own them are more likely to be included in the migration process, in recognition of their importance, but this does not guarantee that their meaning will not be inadvertently lost or corrupted.
Generally, where you convert from a versatile or complex file format to a simple one, you will lose data. For example, convert a technical document in LaTex format to HTML, and you will lose every mathematical formula unless you convert the formulae into images. This is certainly possible, and you can do it, but it may take weeks for a single work. While you can do the conversion this way for special cases, this method does not scale well when you have thousands or millions of documents.
Migration also is unpredictable in the sense that you may lose some properties of the converted documents. You can plan some losses, others you may not. You should have very detailed knowledge about the properties of the archived resources to know whether the conversion tool can deal with all the documents’ features. A Word document is not a simple entity consisting only of text; you might have tables, images, hyperlinks, macros, embedded metadata and other things that you also should convert.
Ask anybody about migrating data from one platform to another that bought into Ellen Feiss’ pitch (http://www.apple.com/switch/ads/ellenfeiss.html) and switched to Apple. You just cannot migrate some of those old documents or the costs make it unattractive. As a right-brained person I can vouch for that! It’s tough being a Mac person in a Windows world.
You may find that you cannot migrate every resource; either because it is totally impossible or it is cost-prohibitive. Organizations involved in the year 2000 migration found that software available in binary mode only is a good example of a resource that you must preserve as is. And even where you had the source code, conversion to a new software platform requires skills that you might not have.
Converting databases also will pose serious problems. Some experts think that relational databases and SQL simplify the handling of databases. Unfortunately, this is not always the case. No matter what the underlying database and query language, you must extract the data to a flat file that you can then load into a new database system. What happens when your organization finally migrates to that new object-oriented database management system? Or whatever technology succeeds your OODBMS?
In its short history, computer science has become inured to the fact that every new generation of software and hardware technology entails the loss of information, as we translate documents between incompatible formats. Paradigm shifts (such as those between networked, hierarchical, relational and object-oriented databases) cause the most serious losses often requiring a complete redesign of rules, triggers, files, documents, views, schemas, databases or data stores to migrate to the new greatest thing. Whenever this happens, you may find that documents that people are not currently using may well become orphaned by the demise of their original formats, that is, an organization may abandon them to save the cost of migration, while each document you do migrate turns into something unrecognizably different from its original, which is generally lost. Even when migration does not involve a major paradigm shift, it can result in subtle losses of context and content. Yet aside from this ad hoc migration process, there exists no proven strategy for preserving (or translating) documents over time.
So, you might find migration easy, but you also may find it extremely challenging. Badly written and tested conversion software may destroy information by inadvertently removing vitally important features from the original data. But in skilled hands, and with sufficient resources, migration may yield good result—provided you can apply it at all.
Perhaps, you should base preservation of electronic resources on emulation.This strategy is based on the development of applications that mimic old hardware or software in new hardware or software environments. In other words, emulation is the creation of new software that mimics the operations of the older software and hardware so you can reproduce its performance and preserve the physical presence of the digital object (1s and 0s) and content as well as layout and functionality. Conceivably, you might consider emulation the best solution when you can save all digital information and distribute it in a special transparent format independent of the software platform and hardware environment used in the process.
You would store resources encapsulated with sufficiently detailed information about the environment where the application was originally designed to work. Based on this information, the digital archive could pick the resource itself and then the emulators and applications the resource requires.
Full potential of emulation strategy still remains to be seen. Small tests have given promising results, but no large-scale or long-range tests have been carried out. Therefore, some experts remain skeptical, while others believe it shows all the signs for success.
Emulation requires a very accurate description of the old environment. Generally, you want to emulate the hardware as it is more stable than the software. Moreover, detailed enough hardware specifications exist and are widely available. For example, the Transmeta processor offers good proof of this; since it can mimic an Intel processor quite well. However, most emulators have been developed for operating systems not hardware. You can use Windows emulators running on MacOS to test emulation, quite successfully. Emulators did fail, but only when the original Windows failed as well. Many developers use products like VMWare to test effortlessly software in many environments on one machine.Microsoft recently purchased Virtual PC, which allows you to run Windows on an Apple platform.
Should you base your digital archive solely on emulation, you might not find it very user friendly. To facilitate the reading of a text document written by a DOS text editor, you might need to learn both DOS and the old command based text editor. This is an unrealistic requirement. Instead, you may need to develop special viewer applications. You could still store documents in their original format, but additional software would apply migration on the fly to present the resource to the user. Already, there are about 100 viewers that deploy this strategy and can present data in almost any image format. For example, Quick View Plus v6.0 from JASC Software (try http://www.provantage.com/PR_52043.HTM) gives you instant access to over 200 Windows, DOS, Macintosh, and Internet File types. These data viewers allow you to access data without the need for complex application installation or maintenance.
Using emulation for long term preservation will require seamless co-operation of a large number of emulators. Since it is not possible to emulate every old platform in a new one, you will need to layer one emulator on top of another. An emulated application running on an emulated operating system on emulated hardware. In the long run, this is only possible when emulator development is a well-controlled activity.
Devising a strategy for preservation is an important step in your data management. But as with data classification, time matters.
Temporal Aspects of Preservation
You can view the practical problem of digital preservation at three different points in time. In the short term, you are facing an urgent need to save digital data that is in imminent danger of becoming unreadable or inaccessible, or to retrieve digital records that are already difficult to access. Yet, the often heroic efforts needed to save or retrieve such material may not be generally applicable to preserving digital documents far into the future, and chances are you can not generalize the employed techniques to solve similar urgent problems that may arise in the future. These short-term efforts, therefore, do not provide much leverage, in the sense that they are not replicable for different document types, though you may need them for saving crucial records.
In the medium term, organizations must quickly implement policies and technical procedures to prevent digital records from becoming vulnerable to imminent loss in the near future. For the vast bulk of records—those you are creating now, those you created fairly recently, or those you translated into formats and stored on media currently in use—the medium-term issue is to prevent these records from becoming urgent cases of imminent loss within the next few years, as current media, formats, and software evolve and become obsolete.
In the long term, you must develop a truly long-lived solution to data longevity that does not require continual heroic effort or repeated invention of new approaches every time formats, software or hardware platforms, document types, or recordkeeping practices change. Such an approach must be extensible, in recognition of the fact that you cannot predict future changes, and it must not require labour-intensive (and error-prone) translation or examination of individual records. It must handle current and future records of unknown type in a uniform way, while being capable of evolving as necessary.
One of the most notable effects of the digital revolution is the ability to create and modify content at a rate unimaginable in a non-digital world. The enormous challenge to preserve and provide secure storage for this digital content has become the focus for research and product development. Now more than ever, organizations are producing data as digital content and even converting to digital formats meaning that the traditional analog methods for preserving content (such as paper copies) are more costly than storage and preservation on digital media as well as being more environmentally unfriendly (large quantities of paper).
Businesses will benefit from easier and more organized access to preserve data on various digital media. As more of those digital objects produced by organizations come with active functionality or multimedia, more and more organizations will require that their data hold on to its functionality, and that they can access that functionality in the long term, which is only possible with data management. Organizations interested in the preservation of their records and products need certainty about the quality of the media storing their data and the techniques and standards used.
Preservation of electronic resources is a complex technical, organizational and legal problem. Just how complex, nobody knows. Your organization must settle on a technical solution feasible in terms of the societal and organizational responsibilities and of the required implementation costs.
The preservation and management of digital records involves interrelated technical, administrative, procedural, organizational, and policy issues, but a sound technical approach must form the basis for everything else. Preserving digital records may require substantial new investments and commitments by organizations, forcing them to adopt new economic and administrative policies for funding and managing data preservation.
Digital media are vulnerable to loss by two independent mechanisms: the physical media where they are stored are subject to physical decay and obsolescence, and the proper interpretation of the documents themselves is inherently dependent on software.
To ensure that you can access preserved data on media, you should document the following:
File format and its version; for instance Word 2000;
Information about migrations and possible information losses that occurred during them;
Information that supports authentication; for instance MD5 checksum to prove that no one has changed the resource during the archival period; and,
Any technical solution you develop must also cope with issues of corruption of information, privacy, authentication, validation, and preservation of intellectual property rights.
Oddly enough, a good data preservation program will deal with data destruction as well. You have to make a decision to preserve data. The consequences of making a negative decision are data disuse, destruction or disposal. Preservation and destruction are both parts of managing data in any organization. You need to look at all data and decide who needs it and for how long. When you answer these questions, you will ultimately decide you want to dispose of some data. This is where we came in. Should you use a commercial program to overwrite the hard drive with specific patterns that exercise all the bits on the drive; it will make it extremely difficult to recover anything at all. Could the No Such Agency recover anything from a drive that has been erased using this software? You’d have to ask them, I don’t know. I personally doubt it. But if that’s who you’re worried about reading your data, you probably should grind the drive to dust, and then fuse the dust together with a blowtorch. For the rest of us, there are viable solutions! (If you are looking for a good way to destroy old CDs, check out the web site http://toast.ardant.net/. For your hard drive, you can physically destroy the drive by popping open the drive case and dragging a screwdriver tip over the platters. This will render the data on the disk unrecoverable for practical purposes and will prevent all hand-me-down users and dishonest repair technicians from getting much of value from your old disk drive.)
Reportedly, DriveSavers, a data-recovery company gets so many calls from frantic customers that it hired a specialist to talk to them. The specialist qualified for the job based on her previous experience on a suicide prevention hotline. It seems people freak out and are quite desperate or distraught when they think they have lost their precious data. So unless you want someone to commiserate with, you should be proactive in developing a data management program that includes both preservation and destruction.