Archive for the ‘Data Management Issues’ Category

From FAQs for the SNIA report: “Building a Terminology Bridge: Guidelines for Retention and Preservation in the Datacenter”

Authenticity: is defined in a digital retention and preservation context as a practice of verifying a digital object has not changed. Authenticity attempts to identify that an object is currently the same genuine object that it was “originally” and verify that it has not changed over time unless that change is known and authorized.  (The term integrity is not to be confused with authenticity.  The objective of “integrity” is to prevent corruption or damage and is defined as the consistency, accuracy, and correctness of stored or transmitted data or information. Integrity and authenticity are both required to preserve information and data assets.) Authenticity verification requires the use of metadata. The critical change for IT practices is that metadata is now very important and must be safeguarded with the same priorities the data is. IT practices that damage, merge, ignore, or scramble metadata are no longer appropriate.

From FAQs for the SNIA report: “Building a Terminology Bridge: Guidelines for Retention and Preservation in the Datacenter”

Preservation:  managing information in today’s datacenter with requirements to safeguard information assets for eDiscovery, litigation evidence, security, and regulatory compliance requires that many classes of information be preserved from time of creation. Preservation is a set of services that protect, provide availability, integrity and authenticity controls, include security and confidentiality safeguards, and include an audit log, control of metadata, and other practices for each preservation object.  The old IT practice of placing information into an archive when it becomes inactive or expired is tiering, not archive, and no longer works for compliance or litigation support because it only adds cost. and risk. Thus, we see products and practices like eMail Archive, Compliance Storage, Preservation Stores, and Database Archives being used to capture and preserve key classes of information and data upon creation.

From FAQs for the SNIA report: “Building a Terminology Bridge: Guidelines for Retention and Preservation in the Datacenter”

Archive:  the report advocates that IT practices adopt a more consistent usage of the term ‘archive’ to facilitate interaction with other departments within the organization. To the archival, preservation, and records communities, an archive is a specialized repository with preservation services and attributes. Typical IT use of the verb “archiving” actually refers to a practice based on ILM called “tiering,” the migration of inactive, reference, or expired information to a lower tier of storage to reduce cost and improve storage efficiencies. A lower tier of storage is not an archive with preservation-class services.  Another IT (and vendor) misuse happens when ‘archive’ is confused with backup. Backup media saved offline or offsite does not constitute an archive (a preservation store with preservation services) nor should backup media be confused with an archive or with tiering.

I have been frustrated lately observing the continued  misuses of the terms archive and ‘archiving’ I find throughout the data protection and backup industries. I keep trying to teach this principal so let me offer some additional perspectives.  I’ve written extensively about this in the new SNIA report “Building a Terminology Bridge: Guidelines for Digital Information Retention and Preservation in the Datacenter.”

1. Definition: As background, keep in mind that the world, including SNIA, define an ‘archive’ as a specialized repository with preservation services, generally used to preserve, protect, verify authenticity and integrity, and secure information and data for the long-term.  No preservation services and it is just a bit-bucket. This is not an archive, but just a tier.   (Always use that test.)

2. History:  Dating back to the late 1980′s on the mainframe, IT and vendors got into the bad habit of defining ‘archiving’ as the process of moving (migrating) information to a lower tier of storage or to shelf storage. The first time I heard ‘archiving’ I still remember that it referred to removing data off primary storage onto tape and putting it offline. And then there was HSM – with archive as the lowest tier.  In the early 1990′s STK introduced “deep archive” as the bottom tier of HSM. And in early 2000, some analysts (who will go unnamed) jumped on the idea of “Active Archive” implying the bottom tier could be on disk and accessible instead of buried on tape.  The vendors over the years have found it in their best interests to promote tiering and migration. The backup vendors now seem to find it useful to talk about moving backup data to tape for an archive.  That is so wrong and such a bad practice as it confuses IT into thinking they have a long-term preservation capability when they absolutely do not.   None of these use cases define a real archive and are really nothing more than migration and tiering with different policies or requirements.

3. Here is an important point.  State is independent of retention period. Migration or copying to a preservation store (an archive) has nothing to do with state. This confusion is exactly what we are trying to change. State and retention period are independent variables in ILM-based practices. If a governance policy says make a copy or store all emails classified ‘business critical’ in a compliant store, that has nothing to do with state. If a performance or cost policy says store inactive data or information on secondary storage instead of primary, that is a business rule that uses migration and tiering practices. Stop confusing tiering with archive.

4. Preservation services are essential:  Moving data to a lower tier with out adding preservation services is not an archive.  It is just a bit-bucket. eMail archive is a great example because all eMail archives begin with an ingestion process setting in place controls for long-term preservation. The eDiscovery community is now beginning to use eMail archive repositories for their litigation review stores because they need those services to control things like authenticity. A litigation may last 10 years and go through many custodians. So, preservation services are essential.

I recently prepared some FAQs on the Terminology Bridge report that are applicable to this conversation and I’ll post them separately.

It really is this simple:

  • Use “migration and tiering” instead of “archiving”
  • Do not use the term archive unless you refer to a specialized repository with preservation services

The top mantra today in the sales process is “reduce cost, improve efficiency.”  It seems that if you want to sell anything it has to meet both criteria. Note, the many advertisements we see on the web now that basically claim “storage is free and may actually save you money… ” It is a hard time in vendor-land, but at the same time a healthy time to purge the industry of wrong thinking.

To that end, I keep getting asked where the cost savings opportunities lie and would like to pose a hierarchy as a way to look at the business opportunities. Naturally, several disclaimers and important notes first.

  • “Your mileage will vary…”
  • Each organization has to approach the issue of cost reduction holistically
  • Fixing storage efficiency has a secondary effect of simply resetting the baseline and gaining temporary relief.
  • You can not solve the cost problem with point solutions (temporary relief again and probably larger angst when you realize the mistake and waste) – at the root is an organization set of practice problems
  • The vendors are not going to tell you all  these things because they don’t want you to know them!
  • Metrics that are not credited are based on my primary research. The others are from industry sources we are all sharing and propagating so if they are wrong, we are all making the same mistakes at least.

Peterson’s Cost Savings Hierarchy

  1. Deletion:  Delete expired data and information as soon as you can. Expired information and data represents ~20-25% of the entire set of storage capacity under management not counting its level of redundancy. You can get a ‘capex-free’ cost reduction rapidly be deleting expired information. You can keep capacity growth down by continuing to delete information and data as they expire.
    Note 1: The term “expired information and data” is one of 4 information states as defined in the report I produced and just published for SNIA titled: “Building a Terminology Bridge: Guidelines for Retention and Preservation in the Datacenter”)
    Note 2: You must work with your legal department to define appropriate deletion practices and adhere carefully to those practices. A sudden change in a practice is more likely to cause you to be liable of spoliation than deleting material that shouldn’t have been.  Deletion practices are another art I have to write about, but enough for now.
    Note 3: Do not let litigation holds stop you in continuing with deletion practices – sounds like bad advice. But, I submit there is a simple and effective work around that legal will agree to in an effort to keep costs under control.
  2. Virtualize your storage infrastructure: including secondary storage. This is another form of consolidation and we had great success in driving cost out with server and storage consolidation in the late 1990′s. Why consolidation? Think ‘thin provisioning’. Stop allocating storage to applications and living with 35% to 40% utilization (industry sources). Even better storage virtualization controllers provide automated migration capabilities so you can now automate tiering. Here are the value proposition metrics:
    – Fix capacity utilization with thin provisioning – 30% to 50% capacity utilization efficiency improvement that provides a short term gain  (<1 yr ROI)
    – The claim is that automated tiering will reduce storage costs 90%.  (Source: IDC 2009)
    – I add that reduced storage cost translates to huge reductions in opex. See my post on the Cost of Managing Storage)
  3. Change Backup:  With tiering, we can segregate active, inactive, reference, and expired information and data. Stop backing up everything, but active data. (Why, Because you already backed it up or have it in your preservation store. Get intelligent about your data protection schemes.)  Active information and data occupy only 20-25% of your capacity. You just saved 75% of your backup costs and backup operations represent 35% to 55% of the IT budget. Go figure!  Reducing backup operations costs is a huge win. And, there are even more ways to save $$ in your data protection methods. Among them is capacity optimization but before we go there chew on these metrics about backup efficiency:
    – Oh, by the way. The Capex to do this is $0.
    – The first time backup-to-tape success rate is 60-70%,
    – 30% of restores from backup fail – why do we accept this?
    – 90% of tape or disk capacity used by traditional tape-oriented backup utilities is redundant so the disk utilization efficiency factor in D2D is less than 5% with RAID and required slack space.  This means the actual cost of backup is  way out of line with its value.
    – 48% of test recoveries based on backup from DR fail (Source: Symantec Disaster Recovery Report 2007)
    Note 1: Do not, let me say that again, do NOT use backup for your archive. Wrong wrong wrong… Those vendors promoting this are only self-serving a lazy IT practice carrying over from mainframe days.
  4. Add Capacity Optimization: to appropriate points in your practices. The first is backup. The trend is to use data deduplication plus compression to reduce backup redundancy a factor greater than 90%.  It turns out that with this approach, disk-based backup repositories can now operate on par with tape from a cost perspective if you include all those tape operations, upgrades, and offsite media handling expenses in the computation. Clearly, tape still has a role and benefits especially in energy consumption. (Unless we are talking MAID technology.)
    Note 1: Do tier your data protection repository to reduce cost further. Even better, federate it with disk and tape and don’t forget to delete expired information and data.
  5. Reduce litigation/compliance/eDiscovery/security risk and cost:  Risk of fines and litigation expense and the overhead costs of eDiscovery add millions of dollars of overhead cost per week to the typical large enterprise. (They average over 550 ongoing suits at any one point in time with a new one every week.) Cost reduce this and you not only save overhead cost, but you profoundly improve the IT infrastructure. Here are some rules to apply:
    – Place copies of business critical and compliant information (corporate email, business docs, legal, accounting, etc.) into your preservation store upon creation – not later, not when they are inactive, do not migrate through a hierarchy such as HSM thinking because it only adds cost and increases risk. That is old-thinking in today’s litigious and compliant environment. Do add capacity optimization methods to this repository along with the long list of important preservation services that are needed to preserve data and information long-term.
    – Never backup the preservation store. There I said it. That’s blasphemy in some camps.  But, guess what, data protection protocols based on the business requirements and operating policies allow you to define the level of redundancy required to overcome risk. You may decide that high integrity storage (RAID) with integrated remote replication and some protocol for versioning will allow you never have to backup again. Kill backup if at all possible to reduce cost.  (Now if you caught the drift of placing copies in a preservation store on creation, then why are we also spending so much backing up active data? Step back and rethink your information architecture…)
    –  Federate disk, tape (and optical if you desire it) in the preservation store and virtualize and tier them so that migration is automated based on business requirements, SLAs, and policies
    – Index content as it is ingested into the preservation store – that will short circuit discovery costs and if your policies are set right, you won’t have to go hunting very far to assure you have the content you are looking for.  Carefully control this metadata as at some point in the near future you will have to produce metadata to verify authenticity in a legal case.
    – Add encryption where appropriate to reduce risk
  6. Create and run an “Electronically Stored Information Risk Assessment” to look holistically at what is next on the list for your organization. Find out where your risks are and reduce them. Use this same approach to flush out the cost centers and reduce them as well. Remember, IT does not have the entire picture of the organization’s needs so doing these exercises at an information governance committee level is appropriate.
Enough for now – you get the picture I hope.  So, what else would you add or where do you think I’m all wet. Let’s talk.