With all the discussion about big data, there is a persistent problem. There is not general agreement on a definition of big data. For some, it means data available on the internet generally; for others, it’s data coming from social media, or the internet of things. It sometimes refers to unstructured data and for others includes structured data such as that available from relational databases.
Sometimes big data is defined according to the tools used to analyze it, such as Hadoop or Spark. For others it relates to data from enterprise systems, like ERP and CRM.
Thee are lots of definitions around. Wikipedia, for example, says “big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate.” Most people would say this definition is too narrow.
Webopedia defines it as follows: “Big data is a buzzword, or catch-phrase, meaning a massive volume of both structured and unstructured data that is so large it is difficult to process using traditional database and software techniques.”
This definition is better as it focusses on structured and unstructured data, which encompasses both data from traditional business systems as well as internet data such as that from social media. It also refers to massive in quantity, which is one of the defining characteristics.
A more analytical approach to defining big data is through the use of the words Volume, Variety and Velocity, perhaps with the word Variability. But sole use of these words does not clearly define big data. Nevertheless, they do provide a framework for a definition. Volume means very large amounts of data. Variety means data coming from very different sources, from business systems to the Internet of Things. Velocity is important because big data is always moving in fast, and there is a trend now to streaming analytics to recognize this. Variability means the data changes in volume, format and source.
Wrap these together and we can approach a definition. Big data is structured and unstructured data coming from a variety of sources, such as business systems, social media and the internet of things, moving at a high velocity and with frequently changing sources, formats and subject matter.
This definition may not be perfect and elegant, but at least it is broad but specific and encompasses the generally understood characteristics of big data.