Archive for January, 2011

A recent ABA newsletter has an important article that discusses recent case law surrounding rules for ESI preservation and makes recommendations for changes that effectively extends the retention periods for all ESI.

From:  ”A New Set of Rules for e-Discovery Duties and Sanctions”   By Nick Brestoff,   Published in : EDDE Journal:    WINTER 2011 VOLUME 2 Issue 1 — A Publication of the E-Discovery and Digital Evidence Committee ABA Section of Science & Technology Law

“All ESI preserved in accordance with the Preservation Duty shall not be destroyed or materially altered until four years after the Proceeding is final. If such ESI is not then subject to any other Preservation Duty, it may be destroyed. However, if such ESI is subject to a Preservation Duty arising from any other Proceeding, that ESI shall not be destroyed or altered until one year after all such Proceedings are final.”

Juxtaposed Forces:

The impact is interesting in that it is getting harder and harder to delete information and data out of the datacenter.  On one hand we have IT’s cost-driven movement to incorporate capacity optimization in the datacenter and on the other legal and regulatory governance that extends retention periods and makes it harder to delete expired information and data – driving up costs and energy consumption.  In the preservation world, we are concerned with both, and the paradox of moving towards storage of preservation objects which are larger in size and not dedupe capable is causing capacity growth angst because you must have 3-4 copies of each object distributed for recovery, access, and business continuity.

Thoughts:

  • What screams at me is “classification, classification, classification!”
  • And, “deletion” as soon as possible – but without order and organization of information throughout the enterprise that is hopeless.
  • How to deal with the load, the cost, the complexity – I only see one path.  It is the practice approach provided by ILM2.0 .   (for more go to www.ilm20.org)

Are there others?

Is the Profile of Data in the Datacenter Changing?

According to Jeff Layton’s new article, “Data is Becoming Colder“, Jan 24, 2011  the historical data I have captured a number of times is evolving – most significantly  in file access patterns. Let’s see if we can put this in context.  Here’s my way to look at this issue.

Information State has 4 classes:  ”active, reference, inactive, expired”   – Source “Building a Terminology Bridge: Guidelines for Retention and Preservation Practices in the Datacenter,” 2009

20% of data is active
25% of data is expired
55% of data is reference or inactive

Layton’s Article states that it is only “CIFS” storage, meaning it is a subset of user docs and productivity apps.  It does not include the majority of storage consuming business apps. Factor that in.  Then I find data I have a difficult time assimilating. Let me illustrate.  (I posted some of the data below for comparison.)

a] Docs are more write oriented – a 100% shift – this is true based on the other data points that there is less read access once docs age.   Duh! Of course they are.  76% never opened by more than one person.  All that follows logically but what is the big deal.  Isn’t it more interesting that there is less collaboration in this community?  Isn’t the use case more interesting and says nothing about the world at large as it is a sample of just NetApp’s internal ops.  What about the Sharepoint/Google Docs collaboration extranets? Oh, they weren’t included…

b] This one throws me: “Files live an order of magnitude longer (10x). Fewer than 50% are deleted within a day of creation.”   …A day of creation?  We used to talk about 90days or longer. Not a day. If ~50% are deleted within one day, why is there a discussion?   And how is that 10X longer?

In the end, I just don’t get how the conclusion that “Data is getting colder” is derived.  What I would expect to see to justify that claim would be data like a comparison across the two studies (3 years apart) that showed volume % by state by study. Show me the shift in the total population statistics, normalized. What is the aging of the file system – here we need a profile of volume by age. Then a comparison of the two curves, normalized again.

Draw your own conclusions:

===========================  From the paper  ================

About three years ago there was a study from the University of California, Santa Cruz and Netapp that examined CIFS storage within Netapp itself. Part of the storage was deployed in the corporate data center where the storage were used by over 1,000 marketing, sales, and finance employees. The second part of the storage was a high-end file server deployed in the engineering data center and used by over 500 engineering employees. All together the storage amounted to about 22TB’s. During the study approximately 2.25 TB’s worth of trace data was obtained (creating data to study data).

In the study they examined the access, usage, and sharing patterns of the storage over a three month period. Using the collected data, they focused their analysis on three items:

  1. Changes in file access patterns and lifetimes since previous studies
  2. Properties of file I/O and file sharing
  3. The relationship between file type and client access patterns

The authors divided their observations into two categories – (1) observations compared to previous studies, and (2) new observations.

Below is the list of results compared to the previous study (taken from the paper with some extra comments added).

  1. Both of our workloads are more write-oriented. Read to write byte ratios have significantly decreased (from 4:1 to 2:1)
  2. Read-write access patterns have increased 30-fold relative to read-only and write-only access patterns.
  3. Most bytes are transferred in longer sequential runs. These runs are an order of magnitude larger (10x).
  4. Most bytes transferred are from larger files. File sizes are up to an order of magnitude larger (10x).
  5. Files live an order of magnitude longer (10x). Fewer than 50% are deleted within a day of creation.

The new observations reported in the study are:

  1. Files are rarely re-opened. Over 66% are re-opened once and 95% fewer than five times.
  2. Files re-opens are temporally related. Over 60% of re-opens occur within a minute of the first.
  3. A small fraction of clients account for a large fraction of file activity. Fewer than 1% of clients account for 50% of file requests.
  4. Files are infrequently shared by more than one client. Over 76% of files are never opened by more than one client.
  5. File sharing is rarely concurrent and sharing is usually read-only. Only 5% of files opened by multiple clients are concurrent and 90% of sharing is read-only.
  6. Most file types do not have a common access pattern.

One observation mentioned in the paper yet wasn’t listed in the two lists was the fact that overall file access was random, indicating the importance of random data performance of the storage medium.

Follow my Work

January 18th, 2011 No Comments

I’m writing and publishing mostly right now on my two reference model sites,

a] Long-Term Digital Preservation Reference Model :  www.ltdprm.org

b] Information Lifecycle Management 2.0 (ILM2.0) Reference Model: www.ilm20.org

So, instead of lots of blog posts – jump over to either of these sites and participate in the reference model communities we’ve started there and contribute.

Access the ILM2.0 Reference Model siteAccess the Long-term Digital Preservation Reference Model site

New Audit Services

January 18th, 2011 No Comments

ILM and Cloud Storage Audit Services

We have identified and are offering two important new audit services to our clients. Take a look and let me know what you think and if we can help you.

a] Preservation in the Cloud Assessment and Audit

b] IT and Information Governance in the Cloud Audit & Assessment

LTDP reference model site

ILM2.0 Reference Model

.

.

.

.

.

.

.

.

.

.

.

Cloud Adoption Drivers are not what is being Reported:

We talk a lot now about the Cloud and its drivers and barriers. These two new studies shed interesting light on the topics. However, I’m motivated to point out one glaring and important point of interpretation.  It is about cost.  Every study says cost as in “cost reduction” is the top driver.  In my experience, we’ve seen this opinion and misleading conclusion many times before. Be careful.

The two  reports are:

a] IBM: “Inside_the_Midmarket__Global_Report”_201101 – a study of cloud adoption in the midmarket

b] IBM: “The evolving role of IT managers and CIOs Findings from the 2010 IBM Global IT Risk Study”

MY OPINION:

Cost is an “eliminator” in the selection process, not a “selector” until you get to the end of the purchasing process. (Usually, #7 on the prioritized list of selectors.)  Consequently, if you ask a CXO the typical ‘leading’ questions in surveys you will get cost back at the top of the list, but be careful, you will potentially interpret this incorrectly and be mislead.

Cost has historically proven itself to be much less of a driver than the vendor community wants to believe.  We have many historical examples of this occurring.  I think the same applies to Cloud adoption.  Its an eliminator and until the “selectors” fall into line.  Cloud storage especially will still be too much risk to let cost be the primary driver.  All I’m saying is that you need to be looking to address the real barriers to adoption instead of focusing on cost. Focus instead on proper “customer development” practices.