Is the Profile of Data in the Datacenter Changing?
According to Jeff Layton’s new article, “Data is Becoming Colder“, Jan 24, 2011 the historical data I have captured a number of times is evolving – most significantly in file access patterns. Let’s see if we can put this in context. Here’s my way to look at this issue.
Information State has 4 classes: ”active, reference, inactive, expired” – Source “Building a Terminology Bridge: Guidelines for Retention and Preservation Practices in the Datacenter,” 2009
20% of data is active
25% of data is expired
55% of data is reference or inactive
Layton’s Article states that it is only “CIFS” storage, meaning it is a subset of user docs and productivity apps. It does not include the majority of storage consuming business apps. Factor that in. Then I find data I have a difficult time assimilating. Let me illustrate. (I posted some of the data below for comparison.)
a] Docs are more write oriented – a 100% shift – this is true based on the other data points that there is less read access once docs age. Duh! Of course they are. 76% never opened by more than one person. All that follows logically but what is the big deal. Isn’t it more interesting that there is less collaboration in this community? Isn’t the use case more interesting and says nothing about the world at large as it is a sample of just NetApp’s internal ops. What about the Sharepoint/Google Docs collaboration extranets? Oh, they weren’t included…
b] This one throws me: “Files live an order of magnitude longer (10x). Fewer than 50% are deleted within a day of creation.” …A day of creation? We used to talk about 90days or longer. Not a day. If ~50% are deleted within one day, why is there a discussion? And how is that 10X longer?
In the end, I just don’t get how the conclusion that “Data is getting colder” is derived. What I would expect to see to justify that claim would be data like a comparison across the two studies (3 years apart) that showed volume % by state by study. Show me the shift in the total population statistics, normalized. What is the aging of the file system – here we need a profile of volume by age. Then a comparison of the two curves, normalized again.
Draw your own conclusions:
=========================== From the paper ================
About three years ago there was a study from the University of California, Santa Cruz and Netapp that examined CIFS storage within Netapp itself. Part of the storage was deployed in the corporate data center where the storage were used by over 1,000 marketing, sales, and finance employees. The second part of the storage was a high-end file server deployed in the engineering data center and used by over 500 engineering employees. All together the storage amounted to about 22TB’s. During the study approximately 2.25 TB’s worth of trace data was obtained (creating data to study data).
In the study they examined the access, usage, and sharing patterns of the storage over a three month period. Using the collected data, they focused their analysis on three items:
- Changes in file access patterns and lifetimes since previous studies
- Properties of file I/O and file sharing
- The relationship between file type and client access patterns
The authors divided their observations into two categories – (1) observations compared to previous studies, and (2) new observations.
Below is the list of results compared to the previous study (taken from the paper with some extra comments added).
- Both of our workloads are more write-oriented. Read to write byte ratios have significantly decreased (from 4:1 to 2:1)
- Read-write access patterns have increased 30-fold relative to read-only and write-only access patterns.
- Most bytes are transferred in longer sequential runs. These runs are an order of magnitude larger (10x).
- Most bytes transferred are from larger files. File sizes are up to an order of magnitude larger (10x).
- Files live an order of magnitude longer (10x). Fewer than 50% are deleted within a day of creation.
The new observations reported in the study are:
- Files are rarely re-opened. Over 66% are re-opened once and 95% fewer than five times.
- Files re-opens are temporally related. Over 60% of re-opens occur within a minute of the first.
- A small fraction of clients account for a large fraction of file activity. Fewer than 1% of clients account for 50% of file requests.
- Files are infrequently shared by more than one client. Over 76% of files are never opened by more than one client.
- File sharing is rarely concurrent and sharing is usually read-only. Only 5% of files opened by multiple clients are concurrent and 90% of sharing is read-only.
- Most file types do not have a common access pattern.
One observation mentioned in the paper yet wasn’t listed in the two lists was the fact that overall file access was random, indicating the importance of random data performance of the storage medium.