You may have heard that some in the industry believe that 99.9% of Big Data is “worthless.” Indeed, I’ve written about the scientists at the Large Hadron Collider who discard the vast majority of the petabytes of data their experiments produce. And we all know about the huge swathes of our SANs wasted on unused data. Organizations confront mountains of extraneous and redundant data. That’s a well-understood problem.
But is it really “worthless”?
I vigorously disagree. Even if data is not usable to the task at hand, it may become so later to another analyst. And if it is redundant, understanding what processes are causing the repetition might lead to improved business performance. Understanding the whole of a Big Data opportunity means being able to discern what part of your data set is worthwhile and which is not. That does not mean that what you discard today will be worthless tomorrow. It depends on the questions you ask the data and how you ask them and when.
More to the point, as Hal Varian, Google’s Chief Economist observes, to truly conduct valid predictive analytics, perhaps the most important potential of Big Data, you need to start with a random data set, even if it’s a small one. However, as engineers at Google know, to get that truly random data set, that sliver of data needs to come from a massive amount of information. Without a large enough pool of data to draw from, the validity of your data set and subsequent analytics loses precision. In other words, even unused, Big Data generates the most valid data sets for modeling.
Philip Russom, TDWI’s director of data management, adds another critical aspect of Big Data. He argues that Big Data is discovery oriented, where you look for facts you never knew before. He warns if you over massage Big Data you risk eliminating outliers in the data, which might be exactly what you need to find, such as in credit card fraud.
Blithely dismissing 99.9% of Big Data as worthless shows a lack of nuance and insight when it comes to understanding the Big Data problems modern enterprises face. Seeing the world strictly through a traditional database mentality, where only the purest, cleanest, most massaged, and manageably small data set is trusted, leads to bald and false proclamations about the worth of Big Data.
Such a view does not lead to effective predictive analytics. It does not lead forward. It strikes me as purely a defensive position from those who do not have the tools to exploit the enormous worth of Big Data.