• Raw, curated, near-real-time data
• Some standardization, few linkages
• Multiple data sources
• Well suited for data science
• Supports exploratory analysis
• Addresses unknown questions and hypotheses
1.4.3.1. Data Lake
The advent of “big data” tools is enabling organizations to detect patterns in data that may otherwise escape human recognition. This is being further advanced with new tools in machine learning and artificial intelligence described below. These tools are designed to access large data sets in a range of native formats, but such access cannot generally target mission critical systems for fear of adverse performance effects. This long-standing performance issue remains true for supporting more traditional reporting and analytics techniques as well. To enable state-of-the-art tools to operate, the industry has begun to co-locate large volumes of data in “raw” forms. Such a repository is called a data lake, and by “raw” form it is meant that few, if any, operations are performed when the data is copied from operational systems into the data lake. As a result, the data lake contains a complete and detailed copy of the organization’s assets, but stored in a very human-unfriendly form. This is well suited to support two activities: big-data, machine learning and artificial intelligence operations performed by highly-skilled specialists, and extraction of key data into a data warehouse without interfering with mission critical systems as described below. In the absence of a data lake, it is not possible to bring to bear many modern technological tools, and consequently there are likely many opportunities to improve service that remain hidden. There is a business need to provide secure access to authorized users, including executives and analysts, to easy-to-use, reporting, dash-boarding, and visualization tools that require few technical skills and yet provide answers to basic business questions, in near-real-time, in a manner that does not adversely affect performance of operational systems, and enables access to integrated, enterprise-shared data.