I have been frustrated lately observing the continued misuses of the terms archive and ‘archiving’ I find throughout the data protection and backup industries. I keep trying to teach this principal so let me offer some additional perspectives. I’ve written extensively about this in the new SNIA report “Building a Terminology Bridge: Guidelines for Digital Information Retention and Preservation in the Datacenter.”
1. Definition: As background, keep in mind that the world, including SNIA, define an ‘archive’ as a specialized repository with preservation services, generally used to preserve, protect, verify authenticity and integrity, and secure information and data for the long-term. No preservation services and it is just a bit-bucket. This is not an archive, but just a tier. (Always use that test.)
2. History: Dating back to the late 1980′s on the mainframe, IT and vendors got into the bad habit of defining ‘archiving’ as the process of moving (migrating) information to a lower tier of storage or to shelf storage. The first time I heard ‘archiving’ I still remember that it referred to removing data off primary storage onto tape and putting it offline. And then there was HSM – with archive as the lowest tier. In the early 1990′s STK introduced “deep archive” as the bottom tier of HSM. And in early 2000, some analysts (who will go unnamed) jumped on the idea of “Active Archive” implying the bottom tier could be on disk and accessible instead of buried on tape. The vendors over the years have found it in their best interests to promote tiering and migration. The backup vendors now seem to find it useful to talk about moving backup data to tape for an archive. That is so wrong and such a bad practice as it confuses IT into thinking they have a long-term preservation capability when they absolutely do not. None of these use cases define a real archive and are really nothing more than migration and tiering with different policies or requirements.
3. Here is an important point. State is independent of retention period. Migration or copying to a preservation store (an archive) has nothing to do with state. This confusion is exactly what we are trying to change. State and retention period are independent variables in ILM-based practices. If a governance policy says make a copy or store all emails classified ‘business critical’ in a compliant store, that has nothing to do with state. If a performance or cost policy says store inactive data or information on secondary storage instead of primary, that is a business rule that uses migration and tiering practices. Stop confusing tiering with archive.
4. Preservation services are essential: Moving data to a lower tier with out adding preservation services is not an archive. It is just a bit-bucket. eMail archive is a great example because all eMail archives begin with an ingestion process setting in place controls for long-term preservation. The eDiscovery community is now beginning to use eMail archive repositories for their litigation review stores because they need those services to control things like authenticity. A litigation may last 10 years and go through many custodians. So, preservation services are essential.
I recently prepared some FAQs on the Terminology Bridge report that are applicable to this conversation and I’ll post them separately.
It really is this simple:
- Use “migration and tiering” instead of “archiving”
- Do not use the term archive unless you refer to a specialized repository with preservation services
Posted in Archive, Data Management Issues, Long-Term Retention and Preservation
We’ve invited members of the IEEE’s Mass Storage Systems and Technologies workgroup on digital preservation to join with SNIA members in reviewing the requirements for long-term retention and preservation. If you would like to participate in this discussion, please go to the DMF Community’s site and register to access it.
http://community.snia-dmf.org
This is an important conversation as we need to update these requirements and then extend them further as we consider the implications of bringing technologies and architectures to market to solve the two ‘holy grail’ problems of preservation – logical and physical migration. Please participate.
Posted in Archive, Data Management Issues, Data Protection, Information Governance & ILM2.0, Long-Term Retention and Preservation, Service Mgmt, Storage Practices
Three years ago we started the work on long-term digital information preservation in the Data Management Forum’s Long-Term Archive and Compliant Storage initiative, LTACSI. One of the first activities we held was a panel discussion at the SNIA’s June 2005 Symposium in Boston. Among the panelists was an archivist, MacKenzie Smith, Assoc-Dir for Technology, MIT Libraries and a datacenter practitioner, Jim Riggs, PERMS Program Manager, US ARMY who has a huge long-term retention challenge. Now the room was full of about 70 storage ‘geeks’ – the types that frequent symposia such as this. But, it also was attended by a few RIM/IT types and a CTOs from the handful of emerging archive systems companies like Permabit and Archivas, some email archiving companies, as well as a contingent of the CAS group from EMC. MacKenzie surprised us all when she told us in clear terms how difficult her work was with today’s storage systems and that the way we looked at ‘archive’ was wrong.
Point 1:
- Based on feedback we got there, from our engagements with RIM and IT practitioners from ARMA and other groups including the SNIA End-User Council, and then from the important “Long-Term Digital Information Retention Requirements Study” I conducted for SNIA and published in January of 2007, we were continually admonished to stop using the “archive” word as it was too confused.
- Here is a poignant quote from the survey: Records retention is different than depositing something in an archive. Archiving is a very problematic word and I would suggest not using it. It suggests dumping records into some bottomless pit where they can be forgotten. (Instead) Ingest (them) into a record keeping environment where they can be permanently preserved for long-term records retention seems better.
Point 2:
- Engagements with ARMA’s RIM community and work on regulatory compliance brought out the importance of retention-periods, the setting of retention requirements, and proper disposition (meaning permanent deletion) of expired information to reduce the volume of information being stored long-term.
- Paradoxically, our requirements survey as well as many informal audience surveys at conferences tell us that approximately 80% of the IT community still don’t know the requirements for the information they manage. A gauge of this disconnect can be seen in the many retention-requirements documents produced by RIMs that contain 2000 to 4000 specific record types and retention schedules.IT and IT systems can’t handle that type of granularity. (Thankfully, this thinking is dying out as people start talking and working together – we see classification catching on using just a few buckets.)
- This gap is very important as it led us to begin the work with ARMA in stating that “Collaboration” is the starting point to “information-centric management” just as setting requirements for that information based on its value to the organization is the starting point for Information Lifecycle Management, ILM, based practices. (See the white paper we co-authored: “Collaboration: the New Standard of Excellence” linked on my publications page.)
- Think about it now. Retention requirements are the focal issue to legal and RIM. Storing it off into a silo the focus of IT because they don’t have the authority to delete anything. No wonder, we have a disconnect around what archive means. Here are some definitions from their 2007 glossaries that illustrate the difference in thinking:
o ARMA – RIM: (context retention) 1. Used for electronic records, it is the procedure for transferring information from an active file to an inactive file, storage medium, or facility. 2. Act of creating a backup copy of computer files. See also BACKUP
o Society of American Archivists – Archivists: (context computing) – To store data offline.
o SNIA – IT: (context ILM) – (verb) To copy or move data for purposes of retention; to create an archive.
- OK, I have to say something here about using backup for an ‘archive’. Don’t. Completely wrong thinking. We’re trying to kill that message everywhere we can.
Point 3:
- More information is being held long-term by more companies than any of us expected. In the requirements survey, 83%, of the 110 responding companies to this question, reported that they have to keep some information over 50 years.
- What is long-term? Isn’t it relative. Yes, but we still need a number. Read my discussion on the definition of “long-term “ in the requirements study for the details on how this was derived, but for now let me just make the statement. In the LTACSI, we’ve adopted the definition that long-term is the period of time beyond which you start losing data. Today, that number is 10-15 years.
Now you have the background for what I want to say. The point is that we have to shift our thinking to using retention and preservation as the key terms, not archive. Let’s redefine archive similarly to what the digital archivist and library communities did in OAIS as an “electronic archive” defining a type of repository for long-term preservation, not as a verb which the storage community uses to connote “moving data into an electronic archive.” Throw the verb out! It is wrong thinking anyway as the notion of moving information around as it ages just adds cost and complexity. (aha, another discussion thread…)
The beauty of this switch is that it also changes our frame of reference and helps move the organization down the path towards information-centric management. Now, you don’t just say the words and its over. There is important work to do:
- First, IT, RIM legal, security, and the business groups have to get together and collaborate to identify their information assets, classify them into a manageable number of buckets, and then set the retention requirements. (And, while at it set the other requirements too, please.) The mantra I teach for this process is “collaborate, identify, classify, requirements, implement, measure, improve”.
- Second, we need the storage industry to recognize that information services such as ILM, retention, preservation, deletion, etc require the capabilities of managing information – not just the data. (See the discussion on the difference between digital information and data to fully appreciate this thought.)
- Finally, we need a new storage architecture for long-term retention in the datacenter – not just a ‘preservation data store’ or another proprietary silo. And that is the point of this note. With it “archive” and backup go away and are replaced with retention and preservation.
I’ll discuss this architecture in another post titled “Virtualizing the secondary storage tier.”
Posted in Archive, Information Governance & ILM2.0, Long-Term Retention and Preservation, Storage Practices