Archive for June, 2009

From FAQs for the SNIA report: “Building a Terminology Bridge: Guidelines for Retention and Preservation in the Datacenter”

Preservation:  managing information in today’s datacenter with requirements to safeguard information assets for eDiscovery, litigation evidence, security, and regulatory compliance requires that many classes of information be preserved from time of creation. Preservation is a set of services that protect, provide availability, integrity and authenticity controls, include security and confidentiality safeguards, and include an audit log, control of metadata, and other practices for each preservation object.  The old IT practice of placing information into an archive when it becomes inactive or expired is tiering, not archive, and no longer works for compliance or litigation support because it only adds cost. and risk. Thus, we see products and practices like eMail Archive, Compliance Storage, Preservation Stores, and Database Archives being used to capture and preserve key classes of information and data upon creation.

From FAQs for the SNIA report: “Building a Terminology Bridge: Guidelines for Retention and Preservation in the Datacenter”

Archive:  the report advocates that IT practices adopt a more consistent usage of the term ‘archive’ to facilitate interaction with other departments within the organization. To the archival, preservation, and records communities, an archive is a specialized repository with preservation services and attributes. Typical IT use of the verb “archiving” actually refers to a practice based on ILM called “tiering,” the migration of inactive, reference, or expired information to a lower tier of storage to reduce cost and improve storage efficiencies. A lower tier of storage is not an archive with preservation-class services.  Another IT (and vendor) misuse happens when ‘archive’ is confused with backup. Backup media saved offline or offsite does not constitute an archive (a preservation store with preservation services) nor should backup media be confused with an archive or with tiering.

I have been frustrated lately observing the continued  misuses of the terms archive and ‘archiving’ I find throughout the data protection and backup industries. I keep trying to teach this principal so let me offer some additional perspectives.  I’ve written extensively about this in the new SNIA report “Building a Terminology Bridge: Guidelines for Digital Information Retention and Preservation in the Datacenter.”

1. Definition: As background, keep in mind that the world, including SNIA, define an ‘archive’ as a specialized repository with preservation services, generally used to preserve, protect, verify authenticity and integrity, and secure information and data for the long-term.  No preservation services and it is just a bit-bucket. This is not an archive, but just a tier.   (Always use that test.)

2. History:  Dating back to the late 1980′s on the mainframe, IT and vendors got into the bad habit of defining ‘archiving’ as the process of moving (migrating) information to a lower tier of storage or to shelf storage. The first time I heard ‘archiving’ I still remember that it referred to removing data off primary storage onto tape and putting it offline. And then there was HSM – with archive as the lowest tier.  In the early 1990′s STK introduced “deep archive” as the bottom tier of HSM. And in early 2000, some analysts (who will go unnamed) jumped on the idea of “Active Archive” implying the bottom tier could be on disk and accessible instead of buried on tape.  The vendors over the years have found it in their best interests to promote tiering and migration. The backup vendors now seem to find it useful to talk about moving backup data to tape for an archive.  That is so wrong and such a bad practice as it confuses IT into thinking they have a long-term preservation capability when they absolutely do not.   None of these use cases define a real archive and are really nothing more than migration and tiering with different policies or requirements.

3. Here is an important point.  State is independent of retention period. Migration or copying to a preservation store (an archive) has nothing to do with state. This confusion is exactly what we are trying to change. State and retention period are independent variables in ILM-based practices. If a governance policy says make a copy or store all emails classified ‘business critical’ in a compliant store, that has nothing to do with state. If a performance or cost policy says store inactive data or information on secondary storage instead of primary, that is a business rule that uses migration and tiering practices. Stop confusing tiering with archive.

4. Preservation services are essential:  Moving data to a lower tier with out adding preservation services is not an archive.  It is just a bit-bucket. eMail archive is a great example because all eMail archives begin with an ingestion process setting in place controls for long-term preservation. The eDiscovery community is now beginning to use eMail archive repositories for their litigation review stores because they need those services to control things like authenticity. A litigation may last 10 years and go through many custodians. So, preservation services are essential.

I recently prepared some FAQs on the Terminology Bridge report that are applicable to this conversation and I’ll post them separately.

It really is this simple:

  • Use “migration and tiering” instead of “archiving”
  • Do not use the term archive unless you refer to a specialized repository with preservation services

The top mantra today in the sales process is “reduce cost, improve efficiency.”  It seems that if you want to sell anything it has to meet both criteria. Note, the many advertisements we see on the web now that basically claim “storage is free and may actually save you money… ” It is a hard time in vendor-land, but at the same time a healthy time to purge the industry of wrong thinking.

To that end, I keep getting asked where the cost savings opportunities lie and would like to pose a hierarchy as a way to look at the business opportunities. Naturally, several disclaimers and important notes first.

  • “Your mileage will vary…”
  • Each organization has to approach the issue of cost reduction holistically
  • Fixing storage efficiency has a secondary effect of simply resetting the baseline and gaining temporary relief.
  • You can not solve the cost problem with point solutions (temporary relief again and probably larger angst when you realize the mistake and waste) – at the root is an organization set of practice problems
  • The vendors are not going to tell you all  these things because they don’t want you to know them!
  • Metrics that are not credited are based on my primary research. The others are from industry sources we are all sharing and propagating so if they are wrong, we are all making the same mistakes at least.

Peterson’s Cost Savings Hierarchy

  1. Deletion:  Delete expired data and information as soon as you can. Expired information and data represents ~20-25% of the entire set of storage capacity under management not counting its level of redundancy. You can get a ‘capex-free’ cost reduction rapidly be deleting expired information. You can keep capacity growth down by continuing to delete information and data as they expire.
    Note 1: The term “expired information and data” is one of 4 information states as defined in the report I produced and just published for SNIA titled: “Building a Terminology Bridge: Guidelines for Retention and Preservation in the Datacenter”)
    Note 2: You must work with your legal department to define appropriate deletion practices and adhere carefully to those practices. A sudden change in a practice is more likely to cause you to be liable of spoliation than deleting material that shouldn’t have been.  Deletion practices are another art I have to write about, but enough for now.
    Note 3: Do not let litigation holds stop you in continuing with deletion practices – sounds like bad advice. But, I submit there is a simple and effective work around that legal will agree to in an effort to keep costs under control.
  2. Virtualize your storage infrastructure: including secondary storage. This is another form of consolidation and we had great success in driving cost out with server and storage consolidation in the late 1990′s. Why consolidation? Think ‘thin provisioning’. Stop allocating storage to applications and living with 35% to 40% utilization (industry sources). Even better storage virtualization controllers provide automated migration capabilities so you can now automate tiering. Here are the value proposition metrics:
    – Fix capacity utilization with thin provisioning – 30% to 50% capacity utilization efficiency improvement that provides a short term gain  (<1 yr ROI)
    – The claim is that automated tiering will reduce storage costs 90%.  (Source: IDC 2009)
    – I add that reduced storage cost translates to huge reductions in opex. See my post on the Cost of Managing Storage)
  3. Change Backup:  With tiering, we can segregate active, inactive, reference, and expired information and data. Stop backing up everything, but active data. (Why, Because you already backed it up or have it in your preservation store. Get intelligent about your data protection schemes.)  Active information and data occupy only 20-25% of your capacity. You just saved 75% of your backup costs and backup operations represent 35% to 55% of the IT budget. Go figure!  Reducing backup operations costs is a huge win. And, there are even more ways to save $$ in your data protection methods. Among them is capacity optimization but before we go there chew on these metrics about backup efficiency:
    – Oh, by the way. The Capex to do this is $0.
    – The first time backup-to-tape success rate is 60-70%,
    – 30% of restores from backup fail – why do we accept this?
    – 90% of tape or disk capacity used by traditional tape-oriented backup utilities is redundant so the disk utilization efficiency factor in D2D is less than 5% with RAID and required slack space.  This means the actual cost of backup is  way out of line with its value.
    – 48% of test recoveries based on backup from DR fail (Source: Symantec Disaster Recovery Report 2007)
    Note 1: Do not, let me say that again, do NOT use backup for your archive. Wrong wrong wrong… Those vendors promoting this are only self-serving a lazy IT practice carrying over from mainframe days.
  4. Add Capacity Optimization: to appropriate points in your practices. The first is backup. The trend is to use data deduplication plus compression to reduce backup redundancy a factor greater than 90%.  It turns out that with this approach, disk-based backup repositories can now operate on par with tape from a cost perspective if you include all those tape operations, upgrades, and offsite media handling expenses in the computation. Clearly, tape still has a role and benefits especially in energy consumption. (Unless we are talking MAID technology.)
    Note 1: Do tier your data protection repository to reduce cost further. Even better, federate it with disk and tape and don’t forget to delete expired information and data.
  5. Reduce litigation/compliance/eDiscovery/security risk and cost:  Risk of fines and litigation expense and the overhead costs of eDiscovery add millions of dollars of overhead cost per week to the typical large enterprise. (They average over 550 ongoing suits at any one point in time with a new one every week.) Cost reduce this and you not only save overhead cost, but you profoundly improve the IT infrastructure. Here are some rules to apply:
    – Place copies of business critical and compliant information (corporate email, business docs, legal, accounting, etc.) into your preservation store upon creation – not later, not when they are inactive, do not migrate through a hierarchy such as HSM thinking because it only adds cost and increases risk. That is old-thinking in today’s litigious and compliant environment. Do add capacity optimization methods to this repository along with the long list of important preservation services that are needed to preserve data and information long-term.
    – Never backup the preservation store. There I said it. That’s blasphemy in some camps.  But, guess what, data protection protocols based on the business requirements and operating policies allow you to define the level of redundancy required to overcome risk. You may decide that high integrity storage (RAID) with integrated remote replication and some protocol for versioning will allow you never have to backup again. Kill backup if at all possible to reduce cost.  (Now if you caught the drift of placing copies in a preservation store on creation, then why are we also spending so much backing up active data? Step back and rethink your information architecture…)
    –  Federate disk, tape (and optical if you desire it) in the preservation store and virtualize and tier them so that migration is automated based on business requirements, SLAs, and policies
    – Index content as it is ingested into the preservation store – that will short circuit discovery costs and if your policies are set right, you won’t have to go hunting very far to assure you have the content you are looking for.  Carefully control this metadata as at some point in the near future you will have to produce metadata to verify authenticity in a legal case.
    – Add encryption where appropriate to reduce risk
  6. Create and run an “Electronically Stored Information Risk Assessment” to look holistically at what is next on the list for your organization. Find out where your risks are and reduce them. Use this same approach to flush out the cost centers and reduce them as well. Remember, IT does not have the entire picture of the organization’s needs so doing these exercises at an information governance committee level is appropriate.
Enough for now – you get the picture I hope.  So, what else would you add or where do you think I’m all wet. Let’s talk.

Storage isn’t free. Never will be. Management costs, opex, overwhelm capex expenditures.  I continue to scale the cost of managing storage, CMS, in my research. Depending on organization size and complexity, the scary thing is that I find it is growing again, ranging from $10k to $35k/TB/yr.

Now a new metric. Let me suggest that while I first published this metric in 1992, and continued publishing primary research on it through 1997, we have to look at it differently today. Here’s the math. The acquisition cost of most classes of disk arrays is between $1k to $4k per TB.  That means the annual ratio of CMS to storage cost is still 7x to 10x (same as in 1992 – another scary thought) but, guess what. That is the wrong way to look at it.

The top problem in storage in the datacenter, I first published in 1994 through today, is storage expansion. No, it is not hard to add disk drives. The problem is that expansion causes all storage practices, management, and services to have to expand as well to accommodate the new storage. It has a ripple effect. Now add cost of managing storage. Adding 1TB adds ~$25k of incremental cost to large organizations per year.  It is the per year thing that gets you now. Storage doesn’t just go away. It has a life. A better way to look at the CMS is over at least a three year life. Even retirement doesn’t mean capacity reduction rather it means replacement, so the cost is ongoing…  But, we have to pick a threshold otherwise this gets ridiculous.  At three years, the real CMS is $30k-$100k/TB  and the factor of Opex to Capex is really 20-30x.  Wow!

The point is that if we recognize the real cost of adding storage to the datacenter, we will be more judicious in its use. If you just stop and recognized that every TB you add of primary storage will add ~$50k of cost, what would you do.  Buy less? Not necessarily. What you definitely would do is cost-reduce your practices by doing things like deletion, deduplication, and tiering You can be more efficient in your use of storage, but that is a one time deal. A change in efficiency does not change the shape of the consumption curve. It just resets the baseline.  You still need to cost reduce your practices.

To summarize, I think that this is the best thing to happen to the datacenter in a long time. Due to budget constraints we are having to pay attention to practices and fix an IT system that is broken and does not scale due to ever growing cost.