A lot of businesses are desperately keen to transform their business to become data-centric. The desire often reflects a recognition that many of them sit on large volumes of data and that presently they are not extracting all of the value from this resource. This makes perfect sense but as always the devil is in the detail. Finding all the data in a business is hard, so putting it all in one place is a good start but it's a long way short of what is needed to extract all of the value.
I have been at Semantics 2018
this week, an industry led conference for data scientists interested in ways of providing structure across disparate data. Amongst the talks was one from Alan Morrison
, a senior research fellow at PWC. He argued that to be truly data centric we put our existing technology stacks to one side and start again from the data up.
To see why this makes sense we need to remind ourselves of how we ended up with all the data that we have in our businesses. The story starts with software, farm management systems, accounting packages, spreadsheets, word processors, e-mailers and so on. All of these tools create data, different sorts of data, in different formats and probably stores them in different places. So, we could become a bit more data centric by bring the data together and make it possible to query them. This is what happens when you put your data in a relational data base.
If we step back a bit and ask the question "what is data" we can start to see why this is not really data centric, or at least not data centric enough. Data is a description of the universe in which we live, or probably more manageably a description of our business and the environment in which it operates. In this respect, Morrison argued that whilst we have a lot of data, half of it is missing.
Well dohhh, of course we don't collect all of the data in the business to provide a comprehensive description of the world. If we did that we would never do any business. Ok, but some of this missing data is very important because it describes relationships between things that we have got data on so that we get a more comprehensive picture of the world in which we operate, and really importantly this picture can be understood by machines so that humans can stop worrying about every detail of the world to focus on the insights that matter to their business.
Some of this missing data sounds a bit silly, for example if Widget Co is a customer of yours, this implies that you a supplier to Widget Co. We can therefore add information about the meanings of customer and supplier to enrich our data. In order to do this, we need to tell the computer about the relationship between the terms supplier and customer. Whilst this relationship is obvious to us, it is not so to the computer until we formally tell it.
It's a trivial example, but illustrates how a slightly different type of data can be added to our lakes of data to give the original data more meaning. This data is more abstract than the data we are used to. It describes concepts and entities as well as the relationships between them and in this sense it is an extension of the concept of meta-data. Meta-data is data about the data and commonly includes things like the method used to collect the data, when it was collected and is usually human readable.
In our world the meta data includes all the stuff that is usually included plus the data about entities, concepts and relationships, and, it is machine readable. We generally refer to this metadata as the ontology.
So, data centric means that you start with the data and you use it to give machines as much information as you can about your world. You can think of this as a digital twin of your world. Now that you have your digital twin you can write applications to pull out the information about it that you need. You can also play games with it to see what would happen if you changed some bits of your world.