Understanding the Data
The first challenge of data integration is to understand the data. This is a problem at two levels.
At a higher level, there is the need to understand the business context and the value of the data: to know which parts of it are important, and which combinations of it can contribute to business deliverables.
At the most basic level, the problem is a lack of data models and documentation. The data is often described only by cryptic entity, attribute, and domain field names that meant something to the business subject matter expert, programmer, or database administrator who hurriedly created them to satisfy system constraints or a production deadline, but are incomprehensible to those conducting the data integration or data migration effort.
As discussed in An Information Architecture Vision: Moving from Data Rich to Information Smart (see References), possibly 10% of data holdings are documented, and the cryptic definition of entities, attributes, and domain values is often incomprehensible. In addition, without overarching data models there tends to be major redundancy, leading to poor data quality.
There are four main issues that are encountered:
• Enterprise-structured data holdings are embedded within individual applications using incoherent tables in separate database files
• In some legacy systems, data may be embedded in the system code or retained using flat or indexed file paradigms that are inaccessible by current standard structured query languages
• When Commercial Off-The-Shelf (COTS) applications or cloud services are used, often the data structures are not provided and there is only limited data download access
• There is often no or limited metadata
Metadata provides a context for the data, addressing items such as who created the data and when, the security classification, the data quality, and so on. Metadata is key to effective data integration, especially when redundant data must be consolidated, and new metadata generated.