Archive for May, 2008

We’ve invited members of the IEEE’s Mass Storage Systems and Technologies workgroup on digital preservation to join with SNIA members in reviewing the requirements for long-term retention and preservation. If you would like to participate in this discussion, please go to the DMF Community’s site and register to access it.

http://community.snia-dmf.org

This is an important conversation as we need to update these requirements and then extend them further as we consider the implications of bringing technologies and architectures to market to solve the two ‘holy grail’ problems of preservation – logical and physical migration. Please participate.

I am crafting a proposal for the IEEE’s DAPS’08  – Workshop on Digital Archive Preservation and Sustainability  at MSST2008 – IEEE Conference on Mass Storage Systems and Technologies, September 22, 2008.  http://storageconference.org/daps

In confirming our interest, I responded to the question of what content would we, as the SNIA-DMF, want to see in the workshop. I quickly generated this list and thought I’d share it here since it is indicative of how broad this effort is world-wide.

CONTENT IDEAS:

a. Updates on the progress of the XAM specification and the Self-Describing, Self Contained Data Format specification in development in SNIA.
b. The work on preservation data stores at Caspar
c. Use of AIPs from the OAIS model – Regan Moore’s implementation was a great case study. We need to keep testing how well the AIP model works and implementation experience. Where is the British Library on this?
d. Progress on metadata standards that will impact long-term preservation and information management
e. Case studies
f. parallel experiences – such as the SMPTE’s work on the MFX standard for preservation and interchange
g. NARA’s – ERA progress and directions

Wow, I make this list and recognize how much is going on world-wide and it brings to mind how important it is we all coordinate better. Makes me want to create an event….

What other important programs are you concerned with that should be included?

Three years ago we started the work on long-term digital information preservation in the Data Management Forum’s Long-Term Archive and Compliant Storage initiative, LTACSI. One of the first activities we held was a panel discussion at the SNIA’s June 2005 Symposium in Boston. Among the panelists was an archivist, MacKenzie Smith, Assoc-Dir for Technology, MIT Libraries and a datacenter practitioner, Jim Riggs, PERMS Program Manager, US ARMY who has a huge long-term retention challenge. Now the room was full of about 70 storage ‘geeks’ – the types that frequent symposia such as this. But, it also was attended by a few RIM/IT types and a CTOs from the handful of emerging archive systems companies like Permabit and Archivas, some email archiving companies, as well as a contingent of the CAS group from EMC. MacKenzie surprised us all when she told us in clear terms how difficult her work was with today’s storage systems and that the way we looked at ‘archive’ was wrong.

Point 1:

  • Based on feedback we got there, from our engagements with RIM and IT practitioners from ARMA and other groups including the SNIA End-User Council, and then from the important “Long-Term Digital Information Retention Requirements Study” I conducted for SNIA and published in January of 2007, we were continually admonished to stop using the “archive” word as it was too confused.
  • Here is a poignant quote from the survey: Records retention is different than depositing something in an archive. Archiving is a very problematic word and I would suggest not using it. It suggests dumping records into some bottomless pit where they can be forgotten. (Instead) Ingest (them) into a record keeping environment where they can be permanently preserved for long-term records retention seems better.

Point 2:

  • Engagements with ARMA’s RIM community and work on regulatory compliance brought out the importance of retention-periods, the setting of retention requirements, and proper disposition (meaning permanent deletion) of expired information to reduce the volume of information being stored long-term.

  • Paradoxically, our requirements survey as well as many informal audience surveys at conferences tell us that approximately 80% of the IT community still don’t know the requirements for the information they manage. A gauge of this disconnect can be seen in the many retention-requirements documents produced by RIMs that contain 2000 to 4000 specific record types and retention schedules.IT and IT systems can’t handle that type of granularity. (Thankfully, this thinking is dying out as people start talking and working together – we see classification catching on using just a few buckets.)
  • This gap is very important as it led us to begin the work with ARMA in stating that “Collaboration” is the starting point to “information-centric management” just as setting requirements for that information based on its value to the organization is the starting point for Information Lifecycle Management, ILM, based practices. (See the white paper we co-authored: “Collaboration: the New Standard of Excellence” linked on my publications page.)
  • Think about it now. Retention requirements are the focal issue to legal and RIM. Storing it off into a silo the focus of IT because they don’t have the authority to delete anything. No wonder, we have a disconnect around what archive means. Here are some definitions from their 2007 glossaries that illustrate the difference in thinking:

o ARMA – RIM: (context retention) 1. Used for electronic records, it is the procedure for transferring information from an active file to an inactive file, storage medium, or facility. 2. Act of creating a backup copy of computer files. See also BACKUP

o Society of American Archivists – Archivists: (context computing) – To store data offline.

o SNIA – IT: (context ILM) – (verb) To copy or move data for purposes of retention; to create an archive.

  • OK, I have to say something here about using backup for an ‘archive’. Don’t. Completely wrong thinking.  We’re trying to kill that message everywhere we can.

Point 3:

  • More information is being held long-term by more companies than any of us expected. In the requirements survey, 83%, of the 110 responding companies to this question, reported that they have to keep some information over 50 years.
  • What is long-term? Isn’t it relative. Yes, but we still need a number. Read my discussion on the definition of “long-term “ in the requirements study for the details on how this was derived, but for now let me just make the statement. In the LTACSI, we’ve adopted the definition that long-term is the period of time beyond which you start losing data. Today, that number is 10-15 years.

Now you have the background for what I want to say. The point is that we have to shift our thinking to using retention and preservation as the key terms, not archive. Let’s redefine archive similarly to what the digital archivist and library communities did in OAIS as an “electronic archive” defining a type of repository for long-term preservation, not as a verb which the storage community uses to connote “moving data into an electronic archive.” Throw the verb out! It is wrong thinking anyway as the notion of moving information around as it ages just adds cost and complexity. (aha, another discussion thread…)

The beauty of this switch is that it also changes our frame of reference and helps move the organization down the path towards information-centric management. Now, you don’t just say the words and its over. There is important work to do:

  • First, IT, RIM legal, security, and the business groups have to get together and collaborate to identify their information assets, classify them into a manageable number of buckets, and then set the retention requirements. (And, while at it set the other requirements too, please.) The mantra I teach for this process is “collaborate, identify, classify, requirements, implement, measure, improve”.
  • Second, we need the storage industry to recognize that information services such as ILM, retention, preservation, deletion, etc require the capabilities of managing information – not just the data. (See the discussion on the difference between digital information and data to fully appreciate this thought.)
  • Finally, we need a new storage architecture for long-term retention in the datacenter – not just a ‘preservation data store’ or another proprietary silo. And that is the point of this note. With it “archive” and backup go away and are replaced with retention and preservation.

I’ll discuss this architecture in another post titled “Virtualizing the secondary storage tier.”

I just found this blog by my friend Clark Hodge posted April 2007 and wanted to share it and my comments.

Clark Hodge – StorageSwitched

Continuing on the complexity theme (see my short previous post)… Michael Peterson of Strategic Research while speaking Storage Networking World had a couple of quotes that I liked:

“If you want to stop the complexity problem, stop doing it.” It took a little to digest this, I think it really means that if things are too complex – we should step back and decide why.

On reducing complexity – Mike pointed out that “It’s simpler to do it earlier.” Ain’t that the truth. The further along you get, the more things come into play, and if it wasn’t clear in the beginning – it’s going to get muddier as time progresses. Relates well to the “if you don’t have time to do it right the first time, when are you going to find time to fix it” idiom.

And now, to simplify my life – I’m going to stop, and call it a weekend!

——————————–
Clark,
I just found this note and in reading it thought I should add to the work. Here’s how the use of complexity theory applies to our work – or perhaps how I’ve derived a set of rules. In any case, here is how I say this:

First Rule of Complexity: “If you want to solve a complexity problem, Stop Doing It!”

Meaning – a complexity problem occurs usually because the approach we’ve taken to it is wrong or doesn’t scale or is inefficient by design (or lack of design). Stepping back and looking at the whole and the objectives usually allows a fresh approach. Example: Using backup for data protection. Sure it scales, but it just keeps getting more complex as it scales. As a wise director of DEC’s test lab used to say about backup, “I don’t have a performance problem, I have a money problem!” Now apply this to something like “long-term retention and preservation of petabyte size repositories (we used to call this archive). Current approaches just aren’t going to work because we can’t throw the national treasury at the problem…

First Corollary to the first rule: “Automating a bad process will not fix the problem”.

Meaning, just automating backup is the wrong approach. A bad process is still a bad process.

The solution — “stop doing it!” Meaning – stop backing up. Replace backup with ‘data protection’ – using replication-based processes. Now a teaser — we will soon see architectures being promoted (what I call a federated information repository) in which there will be no need for backup since the repositories are self-healing via integrated redundancy and DR. Backup and archive as discrete process will go away as they are no longer needed. – Yes, a prediction and a good example of eliminating complexity via ‘stopping the old process.’

In wrestling with the topic of intelligence while writing the Oracle white paper on embedded databases, I discovered an interesting and very important distinction we should be making between storage intelligence as a network systems concept and intelligent network elements. Here is a quick set of definitions I’d like to offer.

  • Intelligent, networked storage elements, means that systems and services have local, internal data management capabilities that allow these systems or services to operate autonomously, yet in coordination.
  • Storage intelligence, in a network context, is characterized by three essential components: management instrumentation, central management, and intelligent, networked storage elements and services.

Said another way – intelligent network elements do not give you a coordinated and orchestrated solution. Instead, intelligent elements still need to be configured and managed. The result, if this is all you do, is a myriad of independent domains of management. A very incomplete approach, albeit a first step in the process of achieving storage intelligence.

So, if we want to be holding up a picture of the future of storage systems management – this is a good one. Only management instrumentation based on SMI-S offers a standards-based approach to being able to coordinate the thousands of network elements through a singe point of (centralized) management. The alternative is many islands of proprietary instrumentation and many different management domains. (Read higher cost and higher complexity than the alternative of centralized management and standard instrumentation.) Naturally, I have to plug ILM-based management practices as the right approach for the central management process because it is the only approach based on business requirements. Combine ILM-based management practices with SMI-S based instrumentation and we have the most cost effective, least complex approach to operating the datacenter’s storage resources. Storage intelligence is a good theme for the storage industry in that it ties the entire set of initiatives we are progressing within SNIA together cohesively.