What Is a Data Warehouse?

June 25, 2020

Data Warehouse Defined

A data warehouse is a type of data management system that is designed to enable and support business intelligence (BI) activities, especially analytics. Data warehouses are solely intended to perform queries and analysis and often contain large amounts of historical data. The data within a data warehouse is usually derived from a wide range of sources such as application log files and transaction applications.

A data warehouse centralizes and consolidates large amounts of data from multiple sources. Its analytical capabilities allow organizations to derive valuable business insights from their data to improve decision-making. Over time, it builds a historical record that can be invaluable to data scientists and business analysts. Because of these capabilities, a data warehouse can be considered an organization’s “single source of truth.”


Data Warehouse video

 

A typical data warehouse often includes the following elements:

  • A relational database to store and manage data
  • An extraction, loading, and transformation (ELT) solution for preparing the data for analysis
  • Statistical analysis, reporting, and data mining capabilities
  • Client analysis tools for visualizing and presenting data to business users
  • Other, more sophisticated analytical applications that generate actionable information by applying data science and artificial intelligence (AI) algorithms, or graph and spatial features that enable more kinds of analysis of data at scale

Organizations can also select a solution combining transaction processing, real-time analytics across data warehouses and data lakes, and machine learning in one MySQL Database service—without the complexity, latency, cost, and risk of extract, transform, and load (ETL) duplication.

Benefits of a Data Warehouse

Data warehouses offer the overarching and unique benefit of allowing organizations to analyze large amounts of variant data and extract significant value from it, as well as to keep a historical record.

Four unique characteristics (described by computer scientist William Inmon, who is considered the father of the data warehouse) allow data warehouses to deliver this overarching benefit. According to this definition, data warehouses are

  • Subject-oriented. They can analyze data about a particular subject or functional area (such as sales).
  • Integrated. Data warehouses create consistency among different data types from disparate sources.
  • Nonvolatile. Once data is in a data warehouse, it’s stable and doesn’t change.
  • Time-variant. Data warehouse analysis looks at change over time.

A well-designed data warehouse will perform queries very quickly, deliver high data throughput, and provide enough flexibility for end users to “slice and dice” or reduce the volume of data for closer examination to meet a variety of demands—whether at a high level or at a very fine, detailed level. The data warehouse serves as the functional foundation for middleware BI environments that provide end users with reports, dashboards, and other interfaces.

Data Warehouse Architecture

The architecture of a data warehouse is determined by the organization’s specific needs. Common architectures include

  • Simple. All data warehouses share a basic design in which metadata, summary data, and raw data are stored within the central repository of the warehouse. The repository is fed by data sources on one end and accessed by end users for analysis, reporting, and mining on the other end.
  • Simple with a staging area. Operational data must be cleaned and processed before being put in the warehouse. Although this can be done programmatically, many data warehouses add a staging area for data before it enters the warehouse, to simplify data preparation.
  • Hub and spoke. Adding data marts between the central repository and end users allows an organization to customize its data warehouse to serve various lines of business. When the data is ready for use, it is moved to the appropriate data mart.
  • Sandboxes. Sandboxes are private, secure, safe areas that allow companies to quickly and informally explore new datasets or ways of analyzing data without having to conform to or comply with the formal rules and protocol of the data warehouse.

The Evolution of Data Warehouses—From Data Analytics to AI and Machine Learning

When data warehouses first came onto the scene in the late 1980s, their purpose was to help data flow from operational systems into decision-support systems (DSSs). These early data warehouses required an enormous amount of redundancy. Most organizations had multiple DSS environments that served their various users. Although the DSS environments used much of the same data, the gathering, cleaning, and integration of the data was often replicated for each environment.

As data warehouses became more efficient, they evolved from information stores that supported traditional BI platforms into broad analytics infrastructures that support a wide variety of applications, such as operational analytics and performance management.

Data warehouse iterations have progressed over time to deliver incremental additional value to the enterprise with enterprise data warehouse (EDW).

Step Capability Business Value
1 Transactional reporting Provides relational information to create snapshots of business performance
2 Slice and dice, ad hoc query, BI tools Expands capabilities for deeper insights and more robust analysis
3 Predicting future performance (data mining) Develops visualizations and forward-looking business intelligence
4 Tactical analysis (spatial, statistics) Offers “what-if” scenarios to inform practical decisions based on more comprehensive analysis
5 Stores many months or years of data Stores data for only weeks or months

Supporting each of these five steps has required an increasing variety of datasets. The last three steps in particular create the imperative for an even broader range of data and analytics capabilities.

Today, AI and machine learning are transforming almost every industry, service, and enterprise asset—and data warehouses are no exception. The expansion of big data and the application of new digital technologies are driving change in data warehouse requirements and capabilities.

The autonomous data warehouse is the latest step in this evolution, offering enterprises the ability to extract even greater value from their data while lowering costs and improving data warehouse reliability and performance.

Find out more about autonomous data warehouses and get started with your own autonomous data warehouse.

Data Warehouses, Data Marts, and Operation Data Stores

Though they perform similar roles, data warehouses are different from data marts and operation data stores (ODSs). A data mart performs the same functions as a data warehouse but within a much more limited scope—usually a single department or line of business. This makes data marts easier to establish than data warehouses. However, they tend to introduce inconsistency because it can be difficult to uniformly manage and control data across numerous data marts.

ODSs support only daily operations, so their view of historical data is very limited. Although they work very well as sources of current data and are often used as such by data warehouses, they do not support historically rich queries.

What Is a Cloud Data Warehouse?

A cloud data warehouse uses the cloud to ingest and store data from disparate data sources.

The original data warehouses were built with on-premises servers. These on-premises data warehouses continue to have many advantages today. In many cases, they can offer improved governance, security, data sovereignty, and better latency. However, on-premises data warehouses are not as elastic and they require complex forecasting to determine how to scale the data warehouse for future needs. Managing these data warehouses can also be very complex.

On the other hand, some of the advantages of cloud data warehouses include:

The best cloud data warehouses are fully managed and self-driving, ensuring that even beginners can create and use a data warehouse with only a few clicks. An easy way to start your migration to a cloud data warehouse is to run your cloud data warehouse on-premises, behind your data center firewall which complies with data sovereignty and security requirements.

In addition, most cloud data warehouses follow a pay-as-you-go model, which brings added cost savings to customers.

What Is a Modern Data Warehouse?

Whether they’re part of IT, data engineering, business analytics, or data science teams, different users across the organization have different needs for a data warehouse.

A modern data architecture addresses those different needs by providing a way to manage all data types, workloads, and analysis. It consists of architecture patterns with necessary components integrated to work together in alignment with industry best practices. The modern data warehouse includes:

  • A converged database that simplifies management of all data types and provides different ways to use data
  • Self-service data ingestion and transformation services
  • Support for SQL, machine learning, graph, and spatial processing
  • Multiple analytics options that make it easy to use data without moving it
  • Automated management for simple provisioning, scaling, and administration

A modern data warehouse can efficiently streamline data workflows in a way that other warehouses can’t. This means that everyone, from analysts and data engineers to data scientists and IT teams, can perform their jobs more effectively and pursue the innovative work that moves the organization forward, without countless delays and complexity.

Designing a Data Warehouse

When an organization sets out to design a data warehouse, it must begin by defining its specific business requirements, agreeing on the scope, and drafting a conceptual design. The organization can then create both the logical and physical design for the data warehouse. The logical design involves the relationships between the objects, and the physical design involves the best way to store and retrieve the objects. The physical design also incorporates transportation, backup, and recovery processes.

Any data warehouse design must address the following:

  • Specific data content
  • Relationships within and between groups of data
  • The systems environment that will support the data warehouse
  • The types of data transformations required
  • Data refresh frequency

A primary factor in the design is the needs of the end users. Most end users are interested in performing analysis and looking at data in aggregate, instead of as individual transactions. However, often end users don’t really know what they want until a specific need arises. Thus, the planning process should include enough exploration to anticipate needs. Finally, the data warehouse design should allow room for expansion and evolution to keep pace with the evolving needs of end users.

The Cloud and the Data Warehouse

Data warehouses in the cloud offer the same characteristics and benefits of on-premises data warehouses but with the added benefits of cloud computing―such as flexibility, scalability, agility, security, and reduced costs. Cloud data warehouses allow enterprises to focus solely on extracting value from their data rather than having to build and manage the hardware and software infrastructure to support the data warehouse.

Do I Need a Data Lake?

Organizations use both data lakes and data warehouses for large volumes of data from various sources. The choice of when to use one or the other depends on what the organization intends to do with the data. The following describes how each is best used:

  • Data lakes store an abundance of disparate, unfiltered data to be used later for a particular purpose. Data from line-of-business applications, mobile apps, social media, IoT devices, and more is captured as raw data in a data lake. The structure, integrity, selection, and format of the various datasets is derived at the time of analysis by the person doing the analysis. When organizations need low-cost storage for unformatted, unstructured data from multiple sources that they intend to use for some purpose in the future, a data lake might be the right choice.
  • Data warehouses are specifically intended to analyze data. Analytical processing within a data warehouse is performed on data that has been readied for analysis—gathered, contextualized, and transformed—with the purpose of generating analysis-based insights. Data warehouses are also adept at handling large quantities of data from various sources. When organizations need advanced data analytics or analysis that draws on historical data from multiple sources across their enterprise, a data warehouse is likely the right choice.

Why Not Run Analytics Against Your OLTP Environment?

Data warehouses are relational environments that are used for data analysis, particularly of historical data. Organizations use data warehouses to discover patterns and relationships in their data that develop over time.

In contrast, transactional environments are used to process transactions on an ongoing basis and are commonly used for order entry and financial and retail transactions. They do not build on historical data; in fact, in OLTP environments, historical data is often archived or simply deleted to improve performance.

Data warehouses and OLTP systems differ significantly.

Data Warehouse OLTP System
Workload Accommodates ad hoc queries and data analysis Supports only predefined operations
Data modifications Automatically updates on a regular basis Updates by end users issuing individual statements
Schema design Uses partially denormalized schemas to optimize performance Uses fully normalized schemas to guarantee data consistency
Data scanning Encompasses thousands to millions of rows Accesses only a handful of records at a time
Historical data Stores many months or years of data Stores data for only weeks or months

Zero-Complexity Deployment: The Autonomous Data Warehouse

The most recent iteration of the data warehouse is the autonomous data warehouse, which relies on AI and machine learning to eliminate manual tasks and simplify setup, deployment, and data management. An as-a-service autonomous data warehouse in the cloud requires no human-performed database administration, hardware configuration or management, or software installation.

Creating the data warehouse, backing up, patching and upgrading the database, and expanding or reducing the database are all performed automatically—with the same flexibility, scalability, agility, and reduced costs that cloud platforms offer. The autonomous data warehouse removes complexity, speeds deployment, and frees up resources so organizations can focus on activities that add value to the business.

Oracle Autonomous Data Warehouse

Oracle Autonomous Data Warehouse is an easy-to-use, fully autonomous data warehouse that scales elastically, delivers fast query performance, and requires no database administration. The setup for Oracle Autonomous Data Warehouse is very simple and fast.