Archive for the ‘Data Management Issues’ Category

Is the Profile of Data in the Datacenter Changing?

According to Jeff Layton’s new article, “Data is Becoming Colder“, Jan 24, 2011  the historical data I have captured a number of times is evolving – most significantly  in file access patterns. Let’s see if we can put this in context.  Here’s my way to look at this issue.

Information State has 4 classes:  ”active, reference, inactive, expired”   – Source “Building a Terminology Bridge: Guidelines for Retention and Preservation Practices in the Datacenter,” 2009

20% of data is active
25% of data is expired
55% of data is reference or inactive

Layton’s Article states that it is only “CIFS” storage, meaning it is a subset of user docs and productivity apps.  It does not include the majority of storage consuming business apps. Factor that in.  Then I find data I have a difficult time assimilating. Let me illustrate.  (I posted some of the data below for comparison.)

a] Docs are more write oriented – a 100% shift – this is true based on the other data points that there is less read access once docs age.   Duh! Of course they are.  76% never opened by more than one person.  All that follows logically but what is the big deal.  Isn’t it more interesting that there is less collaboration in this community?  Isn’t the use case more interesting and says nothing about the world at large as it is a sample of just NetApp’s internal ops.  What about the Sharepoint/Google Docs collaboration extranets? Oh, they weren’t included…

b] This one throws me: “Files live an order of magnitude longer (10x). Fewer than 50% are deleted within a day of creation.”   …A day of creation?  We used to talk about 90days or longer. Not a day. If ~50% are deleted within one day, why is there a discussion?   And how is that 10X longer?

In the end, I just don’t get how the conclusion that “Data is getting colder” is derived.  What I would expect to see to justify that claim would be data like a comparison across the two studies (3 years apart) that showed volume % by state by study. Show me the shift in the total population statistics, normalized. What is the aging of the file system – here we need a profile of volume by age. Then a comparison of the two curves, normalized again.

Draw your own conclusions:

===========================  From the paper  ================

About three years ago there was a study from the University of California, Santa Cruz and Netapp that examined CIFS storage within Netapp itself. Part of the storage was deployed in the corporate data center where the storage were used by over 1,000 marketing, sales, and finance employees. The second part of the storage was a high-end file server deployed in the engineering data center and used by over 500 engineering employees. All together the storage amounted to about 22TB’s. During the study approximately 2.25 TB’s worth of trace data was obtained (creating data to study data).

In the study they examined the access, usage, and sharing patterns of the storage over a three month period. Using the collected data, they focused their analysis on three items:

  1. Changes in file access patterns and lifetimes since previous studies
  2. Properties of file I/O and file sharing
  3. The relationship between file type and client access patterns

The authors divided their observations into two categories – (1) observations compared to previous studies, and (2) new observations.

Below is the list of results compared to the previous study (taken from the paper with some extra comments added).

  1. Both of our workloads are more write-oriented. Read to write byte ratios have significantly decreased (from 4:1 to 2:1)
  2. Read-write access patterns have increased 30-fold relative to read-only and write-only access patterns.
  3. Most bytes are transferred in longer sequential runs. These runs are an order of magnitude larger (10x).
  4. Most bytes transferred are from larger files. File sizes are up to an order of magnitude larger (10x).
  5. Files live an order of magnitude longer (10x). Fewer than 50% are deleted within a day of creation.

The new observations reported in the study are:

  1. Files are rarely re-opened. Over 66% are re-opened once and 95% fewer than five times.
  2. Files re-opens are temporally related. Over 60% of re-opens occur within a minute of the first.
  3. A small fraction of clients account for a large fraction of file activity. Fewer than 1% of clients account for 50% of file requests.
  4. Files are infrequently shared by more than one client. Over 76% of files are never opened by more than one client.
  5. File sharing is rarely concurrent and sharing is usually read-only. Only 5% of files opened by multiple clients are concurrent and 90% of sharing is read-only.
  6. Most file types do not have a common access pattern.

One observation mentioned in the paper yet wasn’t listed in the two lists was the fact that overall file access was random, indicating the importance of random data performance of the storage medium.

Follow my Work

January 18th, 2011 No Comments

I’m writing and publishing mostly right now on my two reference model sites,

a] Long-Term Digital Preservation Reference Model :  www.ltdprm.org

b] Information Lifecycle Management 2.0 (ILM2.0) Reference Model: www.ilm20.org

So, instead of lots of blog posts – jump over to either of these sites and participate in the reference model communities we’ve started there and contribute.

Access the ILM2.0 Reference Model siteAccess the Long-term Digital Preservation Reference Model site

Terminology is the starting point for Information Governance

I strongly urge you to read and distribute this new report from the SNIA – “Building a Terminology Bridge: Guidelines to Digital Information Retention and Preservation Practices in the Datacenter.” It took 2 years to develop, research, vet, socialize, educate, build consensus within SNIA alone. An effort that tried my patience and fortitude at times. But, I’m here for the long run and this report is a masterful contribution to the industry.

This report is essential for a long list of practices such as these:

  • Digital Preservation (Archive)
  • Cloud Archive
  • Retention Management
  • Risk Management
  • Security
  • Information management
  • ILM, ILM2.0
  • Information Governance
  • Long-term Retention and Deletion
  • Data management

    Building-a-Terminology-Bridge-Cover

    We encourage review and feedback.

    Blog-roll
    =====================
    Digital Curation Blog: SNIA “Terminology Bridge” report
    By Chris Rusbridge

    =====================

    SNIA Builds a Bridge—to Somewhere Important
    Rick Bauer, Dir.Technology and Education, SNIA
    =====================

    .

    .

    .

    .

    The Billion Year Ultra-Dense Memory Chip -

    I love storage technology – the demand more more, cheaper, and faster will never end.  Berkeley Labs brings us one of the most interesting technologies yet.

    http://newscenter.lbl.gov/feature-stories/2009/06/03/billion-year-ultra-dense-memory-chip/

    One of the drivers now is long-term preservation. If we had long-term media, it would slow down the rate of and number of required migrations – we postulate.  In any case, the domains of logical and physical migration are where we need to put a lot of effort and R&D otherwise the costs of preserving information for the long-term overwhelm everything else.  This is where NARA is putting its money – to develop a long term storage architecture. It will be fun to watch all this unfold over the next 10 years.

    I’ve been accused of throwing historical IT practices under the bus in my last posts. Well, in my opinion, we should.

    IT practices that confuse or just don’t meet the business requirements or only add cost and complexity need to go away.  The times are changing. We saw that clearly with regulatory compliance and eMail. We see it with eDiscovery and litigation review. Many IT practices damage metadata resulting in damage to authenticity.    The courts keep getting closer and closer to exposing bad IT practices and I submit we need to start somewhere making improvements.

    Metadata is a good example. Many IT practices damage, mix, confuse, or just plain ignore the value of metadata. (And, consequently denigrate its use to demonstrate authenticity.) This has to change.
    a) Yes, it wasn’t until 2008 that Sedona recognized metadata in litigation evidence, but now it is important.
    b) Aguilar v. Immigration & 
Customs Enforcement Div., 2008 U.S. Dist. LEXIS 97018 ( Nov. 21, 2008 ) changed it all again, making certain metadata a key part of litigation evidence.

    Another example is confusing archive and preservation – regulatory compliance hammered that. I believe that the IT premise we have to move toward could be framed “Preservation begins at creation.”  The IT practice of archiving at the time information becomes inactive or expired is too late, too costly, too complex, and too risky in the face of litigation and compliance risk.

    Oh, let’s add ‘deletion’ to the list:  Even the records community is at fault here. The whole idea of ‘disposition after information expires’ is ludicrous for the digital datacenter. I maintain disposition policies must be made up front – consistent with ‘preservation policies begin at creation.’

    This could be a stimulating conversation. Chip in.

    Oh, and I’m far from alone in this opinion. Change is hard and the top barrier is human and cultural on one side and resistance from the vendor community protecting their installed base of revenue by propagating the myth on the other.  I can’t blame them. I can only blame the IT community. I really like this anecdote from the “Backup Blog:”  ”…Having said that, the biggest obstacle to fixing backup is not technology. It is inertia. It is cultural. It is fear of change. It is ingrained process. It is the fact that we have done things one way for so long that the reason we are going things has been forgotten…”