In the last installment of "Data Rich and Insight Poor" we discussed the benefits of following a Platform approach to drive the maximum amount of benefit from your investment across the enterprise. Today I'd like to address the need for consistent data quality at the onset of a project and as an ongoing governance need to ensure last success.
The old axiom is "Garbage in - Garbage Out"
How more true could this be when we are reliant on data to be present and consistent to drive complex analysis and algorithmic methods ? Using a car analogy - even trace amounts of water in your cars tank have not only immediate engine results such as hesitation or surging, but also long term impacts in the form of corrosion or failure of sensors.
So what are we to do with the "fuel" in our ever expanding data universe ? To reduce cost to the enterprise we need governance and controls in place to ensure that the data is acquired and stored in a clean and complete state. This can take the form of required fields to guide the humans in data entry, but what to do when the volumes approach billions of entries and industrial size data bases ?
One the best strategies involves the removal of the weakest link in the data entry process chain - that of the human !
Utilizing a M2M approach to data acquisition the large quantities of data can move from their sensory input to the storage with high fidelity. One of the "hidden" benefits of using a direct connection is that when the analysis does reveal a data anomaly one can make certain assumptions about the sensory chain involved. Perhaps the back-haul network is dropping out or certain environmental conditions cause the sensor to exhibit failures ? Having the human element removed from the list of possibilities removes a myriad of doubt in the investigative process.
At the other end of the "Garbage" axiom is the output - today's complex algorithmic approaches utilize multi dimensional equations that tend to skew the results when dealing with the presence of singular erroneous data points. Data entry errors, back haul system drop outs, mis or un decoded packets of data all lead to anomalies in the data load. Any or all of these problems lead to bad results in the analytical output.
What can be done to stem this problem ? First of all if you are still utilizing data entry clerks or having the craft enter data - place technical controls in place to enforce data entry parameters and required fields. If you are using a M2M approach with the data stream moving from device to device (M2M) you could take advantage of technical tools to monitor and address the data quality . Consider the capabilities in the release of Hana sp9 - Smart Data Quality can assist in ensure the data loads are properly monitored and managed through the stream. < link > . At a minimum being aware of the challenges and monitoring the data traces for spikes or peaks and excluding these is a start to the cleaner data feeds and better results.
Regardless of your approach - the axiom is as true if not more so than before "Garbage In Garbage Out" so pay attention to the data otherwise you could be misled by the results.