Digital Transformation and Data Culture are two of the big hype terms to emerge recently from the tech industry think tanks, very high concept and intended to resonate in board rooms globally.
As with most such buzzwords, these phrases are remarkably devoid of actual meaning, and as such can be filled in to mean whatever happens to be in the listeners mind at the time.
However, there is enough here that these terms do have significance, in great part because they point to a reality that makes many managers uncomfortable. Failing to manage the data in an organization is not a failure in the tools used to manage that data - it is a failure of management itself.
Put another way, management has no one else to blame if big data projects fail. To understand this, it's worth clearing a few common but critical misconceptions.
Your databases are full of valuable data. Nope. Not even close. Most databases are filled with transactional data, in effect the ghost signatures of events that have taken place in the past. Some of this can be valuable - especially time series data where you have specific metrics that change over time - but much of it is intended to support applications, there is a great deal of redundancy within the data, and because each database is a world unto itself, synchronizing these databases with other databases can be a complex and expensive process that reduces the return on investment of such data analysis efforts.
Your databases are well designed. The bulk of database development occurs before the first actual piece of data is ever entered into a database. The configuration of tables, columns and keys that a database uses is called its schema, and if you dig deep enough into your IT department you will no doubt find a a block and line diagram that looks like a cork board on steroids, typically called an Entity Relationship (or ER) diagram.
Yet once that ER diagram gets printed, the reality begins to diverge from it. New tables are added because certain features weren't anticipated, columns get deprecated in favor of other columns, your database admin leaves and a new one takes over, with his own ideas about data modeling. A database schema gets transferred from one system to another with no one understanding why certain structures were chosen, leading to even more complexity.
Finally, columns themselves may have names like "REV" - which could be revolutions or revenue, with no indication about whether this is a raw datapoint or a an aggregate measure, no idea about what units this measure is in, or whether this is an active field or something that was deprecated years ago.
Your databases are pristine. Until comparatively recently (when sensor data began to overtake direct human input) almost all data within a database was entered by a human being.
Data might be miskeyed. Options may have been missed, information may have been blank with no checks on catching such bad data. Add into this data systems written by programmers who needed to create bounds conditions, such as using two digits to designate years, because the 1900s were never going to end, or entering a date of 12/31/2099 to indicate an indeterminate time in the future, because of course databases will not be around by 2100.
All too many databases simply fail to incorporate the fact that things change over time.
This means that either old information gets lost as new information overwrites it, or it means that in an attempt to preserve identity management, information goes out of date.
Your data is trustworthy. This is actually a newer problem than others, but is just as much of a problem.
Most big data systems act as aggregators, but in the process of aggregating they often lose their connection to the source of that data. Companies or divisions merge, and data systems get thrown together, with no comprehensive view towards data harmonization or managing the history (or provenance as its known in data circles), primarily because this kind of data about data (Metadata) is harder to capture in relational databases. Increasingly, it is hard to tell how trustworthy data is, because the people and processes that initially created that data are now long gone.
Software can fix these data problems. Every large software vendor has its own packaged suite of tools (and have had for years) that use the AI flavor of the month (from Hadoop to Machine Learning) to analyze databases and "fixes" these - performing master data management harmonization, cleansing data, stochastic or semantic analysis of terminology, or whatever else some smart kids in a half-finished building (usually called "retro") wrote to create a product which would propel them to the payday of an IPO, usually based upon someone's PhD thesis.
A few of them are in fact quite good, albeit very expensive. Most are mediocre, and a few are outright vaporware. Even the best of these solutions will get you only about 80% of the way there, and the reality is that at some point human analysts will need to look at the various edge cases as well as deal with miscategorizations due to poor training data.
In general, when working from existing data, your IT/analytics department is engaged in what has come to be known as forensic data management. It is an effort to reconstruct the data from the past, to ascertain the mindset of designers and programmers, and to make that data sufficiently useful as to glean something, anything, from the mountains of database servers that most companies typically maintain. It is very, very expensive, and to the extent that it will provide value, such analysis should generally be done at best only in conjunction with rethinking your data culture.
Data managers will gladly let you read/write into their data systems. Data silos occur for a reason. Databases exist to build applications that facilitate processes - often very mission critical processes. Performing regular queries into databases designed for specific tasks could very well cause the databases to grind to a halt, have the potential of corrupting data integrity and become a potential vector for hackers.
This is a big part of the reason that most organizations are now exposing services to get at the data. Services can be throttled, provide a means to access some information without potentially exposing protected content, and can be monitored without taking away CPU cycles from the database itself.
This doesn't even get into the political reasons for data access, which often involve budgets and allocations of personnel and similar issues.
Management's Role in Data Culture
For the most part, corporate management has had a tendency to segment data so that they don't have to deal with it, except in nice digestible reports and the occasional 3D-based dashboards.
Data falls into the technical domains, deep in the back offices where the men and women where t-shirts and play foosball all day long.
Yet the reality is that while the techies will be the one creating and maintaining the software, databases and algorithms, they are almost never privy to the most important data - the business requirements that determine what these systems hold.
builAs organizations become more data-centric, the value of this external data will rise, as it represents a potential competitive advantage. Another reason why wasting your efforts on forensic data analysis is because those analysts are likely to be more needed to determine what competitors' internal data states are likely to be. This means that the degree to which your company can minimize internal efforts which can be automated through digital transformation, the more that resources can be placed on acquiring on outward facing view of the business landscape.
There are several things that can facilitate this process.
Establish data priorities. To the extent possible, build a canonical model that can be filled in piecemeal based upon priorities. Capture those immediately tangible business variables first, then build outward at connection points to new domains of data. For instance, focusing first on contracts can help you identify the resources that are being worked on, and frequently contracts represent the first point at which a given resource is defined. Once this core set is built, move outward, getting manufacturing information, sales data, sensor data and so forth. These organic view makes it easier to construct road maps by which the organization will be able to support data from a given domain.
Use Semantics. Semantic tools are designed to manage metadata - information about information. By taking advantage of a semantic approach, you can establish a more consistent framework for master data management, resource mastering and so forth. This can be augmented with machine learning systems to automate a lot of the classification problems, but semantics gives you the infrastructure on which all this hangs.
Put your librarians to work. Most large organizations have librarians, who have typically performed the role of categorizing and organizing both business and technical metadata. They are the keepers of corporate metadata - controlled vocabularies containing list of asset types, enumerated lists holding critical business distinctions, curators of descriptive content and so forth. This is information, such as country representations, units of measure, lists of authorities and so forth, that is critical to database representations. They also play a vital part in your data governance/provenance/cleanliness strategies.
Move from business analysts to business modelers. Traditionally, business analysts typically have served to determine what information needed to be captured within an organization, and storing this as data dictionaries or similar structures that is then turned over to IT. Increasingly, however, their role will need to shift from being a fairly passive part of the process to becoming liaisons with both internal and external stake-holders, gathering requirements and fitting them into conceptual models and demonstrable use cases.
These use cases will be critical towards moving an organization forward - pick the wrong use cases and your IT department can end up spending valuable time exploring blind alleys. The use case management is also something that should be done in conjunction with your ontologies, information architects or related technical data modelers. The business analysts should then take advantage of their particular domain.
Rethink Ownership. Enterprises are moving away from ownership of systems to ownership of domains of information. This has the effect of changing patterns of governance from being largely an IT function to a business function.
One impact on this is that information organization shifts to areas such as data analytics, manufacturing, personnel, marketing, accounting and so forth, a shift that more accurately follows the life cycle of data. It also cuts down on siloization - by focusing on the governance of information rather than of servers, it makes it harder for different departments to argue that security (which often is code for "we don't want to take the budgetary hit or reassign already overburdened technicians") trumps accessibility.
Become Analytics Driven. One major advantage that comes with this change in approach is that analytics, which is typically the ultimate consumer of data, has much more say in terms of what kind of information it receives. A significant percentage of a data scientist's time is simply in getting disparate data content into a form that is fit for analysis, and by reducing the overall costs in thatprocess, it means that data scientists can get on doing what they do best - identifying patterns and trends in historical data that can provide actionable information for decision makers.
See Data as Product. In a Data Culture company, the information that the organization produces has value because of its processing. The data is clean, consistent, unambiguous, timely and conveniently packaged. This makes the data much more valuable, both in straight monetary terms as well as payment in kind, where it would make sense to both companies to exchange (limited) state information in order to get information about the other side. This can go a long way towards reducing the overall cost of data management within either organization, while at the same time keeping proprietary information controlled.
The Digital Transformation process is not one that is going to take place overnight. There will be winners and losers as power and priorities shift, and this means that there will be inertia to make changes at all levels immediately. It requires that managers especially need to get more closely involved with the information life-cycle, to move out of the compartmentalization between front office and back office work.
Those companies that are able to master what's involved quickly will be both more self and externally aware,, will have far clearer metrics into their particular business functions, and ultimately will be able to transform that information into every aspect of their business, from the manufacture of things to their marketing and sales, at a level that most CEOs can only dream of today.
Kurt Cagle is a contributor for Cognitive World on Forbes. He has worked as an information architect for a number of Fortune 500 companies and Federal and International agencies, and is Principal Ontologist for Semantical LLC, in Issaquah, WA.