I get very perturbed when I hear otherwise intelligent developers talk about data backup and archive like they are an old man’s game. Increasingly, I am told that storage infrastructure today is best designed to be “flat and frictionless” rather than tiered, with data migration or copy. When data is no longer being accessed or updated, that flat folks argue, it should just be quiesced on the disk or SSD where it resides. Just power down the drive. You can always just stand up a new node next to the powered down one and continue processing, so there is no need to reclaim space with archiving or duplicated data for off-site removal for backup. Those activities are so “day before yesterday”.
Science as an Excuse
The origins of these views are obvious. First, there is that entropy thing: a lot of IT folks seem to think that they sound “science-y” when they use theories and laws from hard sciences like physics as metaphors for stuff in their world. They interpret the laws of thermodynamics, especially the law of entropy, incorrectly to mean that systems cannot be made more orderly, that innate entropy (the process by which systems move to an ever increasingly disorganized state) prevents activities like storage tiering and data archiving from ever contributing to the order and optimization of data and storage. This futility translates into a preference for flat infrastructure and no data movement or replication, since that contributes nothing to improving orderliness or efficiency.
The truth of the second law of thermodynamics is that systems can maintain their orderliness in the face of entropy or can be made more orderly in spite of entropy by adding more energy from somewhere else. The real metaphor here is that folks need to get off of their butts and do the hard work of defining the strategies and policies and implementing the necessary technologies to get the job done. This entropy excuse for doing nothing is a veiled defense of laziness.
Useful Life: Four Minutes
Most flat storage folks might respond that their data is too ephemeral to go to all the trouble, especially those working in Big Data and Internet of Things. Millions of data points are gathered from sensors or social media posts and are included in algorithms that are updated every millisecond. The raw data itself is useful for no more than four minutes in some cases, after which it is replaced by new input. So, after four minutes, it has no value and we could stand to lose it without a problem or inconvenience.
At first glance, the apparently ephemeral nature of such data makes a case for not bothering to archive it or back it up, I suppose. This thinking was reflected in IDC’s Data Universe report for EMC back in 2014. They said that of the 4.4 zettabytes of data produced in 2014, only 37% would be useful to tag and analyze, suggesting that the rest could go away and no one would be worse off for it. They also noted that for 55% of that total, there was no provision for backup or data protection – and they were okay with that too.
Looking Into the Future
Truth be told, we have never looked at the issue of ephemeral data, or why we bother to collect it at all. Is ephemerality determined by workflow – ephemeral is that which does not conform to contextual analytical objectives? How do you know what is important now or in the future?
Our flirtation with Business Intelligence and Data Mining (remember those?) back in the 1990s suggested that there was a lot of value in keeping original source data, and in as close to its raw form as possible – not refined or normalized to work with a specific analytical process or tool. Why? Because the tools themselves constantly improve for one thing. Experience has proven that is useful to be able to return to source data with new algorithms from time to time to see what we missed, any non-intuitable relationships that might exist in the data itself. You can’t do that if the raw data is mission or transformed in some way. So, usually a case can be made to archive the original data.
“Crush all you want. We’ll make more”
The flat storage folk usually respond that powering down the drive and sheltering data in place will preserve the data asset. This too is a poorly thought out position. For one thing, advocates of this position have no reliable data about the failure rate of powered-down drives when they are restarted after a month or six months or more.
Plus there is a conceit or a lack of appreciation of business reality behind the idea that storage nodes can simply be replaced ad infinitum. That is grad student/media lab nonsense. In the real world, there is not sufficient budget to continuously roll out more all-flash storage or another three nodes of VSAN or whatever. The vendors, fond of the old Doritos mantra – “Crunch all you want. We’ll make more.” – love the idea of continuous scaling of storage infrastructure, but even they will tell you that there are fixed limits to the production capacity of the industry and the manageability of infrastructure at scale.
I know everyone loves to hate on latency, but flat infrastructure free of migration and replication voids the entire notion of efficient and intelligent storage. And, by the way, latency is rarely a product of storage I/O. It is more often caused by RAW I/O and how it is sequentially processed by multicore CPUs using sequential I/O processing techniques borne from unicore chip architecture. If you don’t get that, go over to the Storage Performance Council and look at DataCore’s latest 5 million-plus SPC-1 benchmark achieved by adding parallel I/O processing to a common Intel multicore chip. Their storage was SAS and SATA connected via a Fibre Channel link – and they weren’t even using half the bandwidth of the link!
Light at the End of the Tunnel
If you are worried about how to guide data between tiers for archive or protection without using a lot of app server CPU, think about offloading it to something like StrongLink from StrongBox Data Solutions. I have been testing their platform all Summer and it is already on the right track by combining cognitive processing with any-to-any workflow-to-storage data management designed to simplify things for the appdev folks whose mental acuity has fallen prey to entropy. Have a look.