We have an abundance of data these days. In particular, there is so much data available on the internet that we have trouble determining which we should use and which is the best fit for our particular purposes. Some of the information – one is tempted to say a lot – is lacking in quality. And so the question of data quality has become a major issue.
At the core of data quality is the concept of “representational faithfulness”. This means that the data represents in a true way the underlying facts it is intended to represent. That is key to making the information “fit for purpose”, which is another key concept related to data quality.
For example, weather forecast reports are only useful if they truly represent the weather that actually occurs. Otherwise they can be decidedly not useful. If the forecast is for a sunny day and you plan your family picnic accordingly and then it pours, then the information is not only not useful it is decidedly misleading and therefore has led to poor decisions.
But there is more to data quality than representational faithfulness. The data must also be internally consistent. For example, if some of the data has Sally living is Toronto and some living in Montreal, then there is an inconsistency that makes the data less than useful. In addition, the data must be relevant, timely, complete and understandable.
In order to be understandable, data must be defined in some fashion. For example, if we see a number of 23,000, and that’s all we see, we have no idea what that number represents. If a dollar sign is added, then we know that it represents money. If we get further information, say that the amount represents the profit of a corporation for the past year, then we are starting to get some useful information.
In information theory, these additional facts about the data are generally referred to as metadata. Or data about data. And so, going back to the concept of information quality, we can only have information quality when the data is true to its metadata. When it actually represents what it is said to represent.
Establishing and maintaining data quality has become a major industry. Companies are spending vast sums on systems that can evaluate the quality of the data it receives and then monitor that quality during its period of usefulness. Often this is done through the installation of systems that automatically carry out data quality analysis. Most of the major software suppliers, like IBM and SAP offer solutions that are designed for data quality purposes.
For people who use data or information, which is all of us, we need to ask ourselves whether the data we see has quality – whether it is fit for purpose. Most of us have learned to do at least a cursory evaluation of the data we receive over the internet, but sometimes more than a cursory evaluation is needed. Sometimes we need to do some careful investigation as to what precautions have been taken by the provider with regard to data quality. Sometimes we even need to obtain professional assistance in evaluating and reporting on the quality of data that is important to us. It’s a big issue for all of us.
(Author’s Note: I have not attempted to differentiate between data and information, which can be a fuzzy distinction. Also, I have used data in the singular, which may offend some grammatical purists, but my case is that this has become a common form of expression.)