Archive for the ‘Long-Term Retention and Preservation’ Category

From FAQs for the SNIA report: “Building a Terminology Bridge: Guidelines for Retention and Preservation in the Datacenter”

Preservation:  managing information in today’s datacenter with requirements to safeguard information assets for eDiscovery, litigation evidence, security, and regulatory compliance requires that many classes of information be preserved from time of creation. Preservation is a set of services that protect, provide availability, integrity and authenticity controls, include security and confidentiality safeguards, and include an audit log, control of metadata, and other practices for each preservation object.  The old IT practice of placing information into an archive when it becomes inactive or expired is tiering, not archive, and no longer works for compliance or litigation support because it only adds cost. and risk. Thus, we see products and practices like eMail Archive, Compliance Storage, Preservation Stores, and Database Archives being used to capture and preserve key classes of information and data upon creation.

From FAQs for the SNIA report: “Building a Terminology Bridge: Guidelines for Retention and Preservation in the Datacenter”

Archive:  the report advocates that IT practices adopt a more consistent usage of the term ‘archive’ to facilitate interaction with other departments within the organization. To the archival, preservation, and records communities, an archive is a specialized repository with preservation services and attributes. Typical IT use of the verb “archiving” actually refers to a practice based on ILM called “tiering,” the migration of inactive, reference, or expired information to a lower tier of storage to reduce cost and improve storage efficiencies. A lower tier of storage is not an archive with preservation-class services.  Another IT (and vendor) misuse happens when ‘archive’ is confused with backup. Backup media saved offline or offsite does not constitute an archive (a preservation store with preservation services) nor should backup media be confused with an archive or with tiering.

I have been frustrated lately observing the continued  misuses of the terms archive and ‘archiving’ I find throughout the data protection and backup industries. I keep trying to teach this principal so let me offer some additional perspectives.  I’ve written extensively about this in the new SNIA report “Building a Terminology Bridge: Guidelines for Digital Information Retention and Preservation in the Datacenter.”

1. Definition: As background, keep in mind that the world, including SNIA, define an ‘archive’ as a specialized repository with preservation services, generally used to preserve, protect, verify authenticity and integrity, and secure information and data for the long-term.  No preservation services and it is just a bit-bucket. This is not an archive, but just a tier.   (Always use that test.)

2. History:  Dating back to the late 1980′s on the mainframe, IT and vendors got into the bad habit of defining ‘archiving’ as the process of moving (migrating) information to a lower tier of storage or to shelf storage. The first time I heard ‘archiving’ I still remember that it referred to removing data off primary storage onto tape and putting it offline. And then there was HSM – with archive as the lowest tier.  In the early 1990′s STK introduced “deep archive” as the bottom tier of HSM. And in early 2000, some analysts (who will go unnamed) jumped on the idea of “Active Archive” implying the bottom tier could be on disk and accessible instead of buried on tape.  The vendors over the years have found it in their best interests to promote tiering and migration. The backup vendors now seem to find it useful to talk about moving backup data to tape for an archive.  That is so wrong and such a bad practice as it confuses IT into thinking they have a long-term preservation capability when they absolutely do not.   None of these use cases define a real archive and are really nothing more than migration and tiering with different policies or requirements.

3. Here is an important point.  State is independent of retention period. Migration or copying to a preservation store (an archive) has nothing to do with state. This confusion is exactly what we are trying to change. State and retention period are independent variables in ILM-based practices. If a governance policy says make a copy or store all emails classified ‘business critical’ in a compliant store, that has nothing to do with state. If a performance or cost policy says store inactive data or information on secondary storage instead of primary, that is a business rule that uses migration and tiering practices. Stop confusing tiering with archive.

4. Preservation services are essential:  Moving data to a lower tier with out adding preservation services is not an archive.  It is just a bit-bucket. eMail archive is a great example because all eMail archives begin with an ingestion process setting in place controls for long-term preservation. The eDiscovery community is now beginning to use eMail archive repositories for their litigation review stores because they need those services to control things like authenticity. A litigation may last 10 years and go through many custodians. So, preservation services are essential.

I recently prepared some FAQs on the Terminology Bridge report that are applicable to this conversation and I’ll post them separately.

It really is this simple:

  • Use “migration and tiering” instead of “archiving”
  • Do not use the term archive unless you refer to a specialized repository with preservation services

I’d like to explain the many ways that information and data will be lost in a typical datacenter. Note that I say, will be lost. Data loss is inevitable, Information is lost the more it is handled, copied, moved, replicated, migrated, and as it ages.

The point is that Loss happens. Let me say that you can not stop data loss? The key questions are “How much will be lost?” “Do you care?” and “what can we do to reduce it?” Here is what I mean by lost?

There are 4 principle classes of loss.

The first category I call poor storage practices. By this I mean several things. In a relatively large file system with millions to trillions of files distributed across multiple sites, servers, desktops, test databases, DR sites, and remote web-servers or service providers trust me, lots of files will be misplaced and effectively lost by users and the system.  Loss occurs if you can’t find it, read it, or interpret it. I’d doesn’t matter how it was caused. All these are valid forms of loss.

Additional storage problems come from poor doc control practices such as losing track of versions or ‘official records’ and are compounded if you are using external services. What happens when files are sent offsite to a web host or storage service and if those services are down, corrupted, or go out of business and you can’t get your files back, Loss happens. As we move into focusing on Cloud Storage we’ll hear more of this problem surfacing. Remember, You risk fines or other penalties during litigation if information can not be discovered and produced. This is a cost of loss.

The second class of loss is through poor security practices. the most obvious is when a hacker or employee gets through your firewalls and takes information, views confidential or private information, or changes or damages information. We have all heard countless stories now of lost notebooks or tapes containing millions of records with personally identifiable information. Those all count as forms of loss. One of the worst nightmares in litigation evidence control is when an ex-employee shows up with historical files and emails that you don’t have since you followed your retention and deletion protocols and permanently deleted them on schedule. They took them while employees and now potentially have an advantage. Perhaps the only perspective to have on losing information is one of damage control and recovery. If you think otherwise, consider the next class of challenges.

The third class of loss is through human or operational errors. Human error is the number one cause of damage or loss and we are not likely to change that fact. It manifests in many ways, but the pertinent issue is whether or not your recovery systems work. Here’s the test. Your systems are faithfully backed up. But, how often and how thoroughly have you tested recovery? Backup works great when it is write once, read never. But, you might be surprised how often recovery is compromised. The alternative is to rebuild information from scratch. Costs estimates to do this vary ranging between $5k to $50k per Megabyte. Factor that thinking into your recovery strategies. ‘

The fourth class of loss is caused by process or practice errors. First – inappropriate deletion processes. Deletion is good. You must delete expired and disposable information when you can otherwise all you are doing is driving up operating costs, storage costs, and risk.

But, do it wrong and you may cause ‘spoliation’. Make sure your processes are correct and cleared with legal and then audit them.

Next, mistakes occur and here are two examples:

1st – during litigation evidence processing, if you lose authenticity, damage chain of custody, the evidence is as good as lost. You may not be able to present it.

2nd – during migration events many things such as these can happen. It is safe to say that Migration causes damage. After two migrations most IT people will openly admit they have lost some portion of the information. Migration data loss is significant. That is why all digital information is at risk long-term. We just don’t have good physical and logical migration practices in place as an industry.

For long-term retention and preservation I strongly urge you to get expert help. Talk to me!

We’ve invited members of the IEEE’s Mass Storage Systems and Technologies workgroup on digital preservation to join with SNIA members in reviewing the requirements for long-term retention and preservation. If you would like to participate in this discussion, please go to the DMF Community’s site and register to access it.

http://community.snia-dmf.org

This is an important conversation as we need to update these requirements and then extend them further as we consider the implications of bringing technologies and architectures to market to solve the two ‘holy grail’ problems of preservation – logical and physical migration. Please participate.