by Yuli VasilievPublished April 2011
Get the basics of data warehouse development using Oracle Warehouse Builder 11g.
From school, we know that although most math tasks have plain-English formulation, we still have to state an equation with x, or, sometimes, a system of equations with x, y, and maybe even more variables, to find a solution. Similarly, in a decision support system, we have to design a set of data objects such as dimensions and cubes, according to the business questions formulated in plain English, so that we can get those questions answered.
This article focuses on just that: constructing a dimensional environment for answering business questions. In particular, it will explain how to come from certain analytic questions to the set of data objects needed to get the answers, using Oracle Warehouse Builder as the development tool. Since things are best understood by example, the article will walk you through building a simple data warehouse.
A typical data warehouse concentrates on sales, to help users find answers to questions regarding the state of the business using the results retrieved from a sales cube based on Time, Product, or Customer criteria. This article example deviates from this practice, however. Here, you'll look at an example of how you might analyze the outgoing traffic related to a certain Website, using the information retrieved from a traffic cube based on Geography, Resource, and Time criteria.
Suppose you have a Website whose resources are hosted on more than one server, each of which may differ from another in the way it stores traffic statistics. The storage types used vary from a relational database to flat files. You need to consolidate traffic statistics over all of the servers so that you can analyze users activity over resources being accessed, date and time, and geographical location to answer the questions such as the following:
At first glance, it looks like you can use SQL alone to get these questions answered. After all, the CUBE, ROLLUP, and GROUPING SETS extensions to SQL are specifically designed to perform aggregation over multiple dimensions of data. Remember, however, that some data here is stored in flat files, which makes the SQL-based approach impractical. Moreover, looking at the above questions, you may notice that some require one year of historical data to get answered. What this means, in practice, is that you’ll also need to access archives containing historical data derived from transaction data and implemented as separate sources. In SQL, managing all those sources and transforming the data in each one into a consistent format for unified query operations would be quite laborious and error-prone.
So, briefly, the main tasks here are:
A data warehouse is designed precisely to perform these tasks. Just to recap, a warehouse is a relational database tuned to handle analytic queries rather than transaction processing, and is kept up to date with periodical refreshes and updates, downloading a subset of data from its sources by the ETL (extraction, transformation, and loading) process (scheduled normally for a particular day of the week or at a predetermined time of the day or night). The loaded data is transformed into a consistent format, and loaded into the warehouse target object., Once populated, the warehouse is typically available for queries through dimensional objects such as cubes and dimensions. Schematically, this might look like the figure below:
Figure 1 Gathering data from disparate sources and transforming it into useful information available to business users.
In particular, for this example, a simple warehouse consisting of a cube with several dimensions is an appropriate solution. Since the traffic is the subject matter here, you might want to define outgoing traffic as the cube measure. For simplicity, in this example we will measure outgoing traffic based on the size of the resource being accessed. For example, if someone downloads a 1MB file from your site, then it's assumed that 1MB of outgoing traffic will be generated. It's similar in meaning to how the dollar amount of a purchase depends on the price of the product chosen. While dollar amount is normally a measure of a sales cube, price is a characteristic of product that is usually used as a dimension in that cube. A similar situation is here, while outgoing traffic is our traffic cube's measure, resource size is a characteristic of resource that will be used as a dimension.
Moving on to dimensions, the set of ones to be used in the cube in this example can be determined by examining the list of questions you have to answer. So, looking over the questions listed at the beginning of this section, you might want to use the following dimensions to organize the data in the cube:
Each traffic record will have specific values for each geography location, for each resource, and for each day and time. To clarify, the time value in a traffic record refers to the time the resource is accessed.
The next essential step is to define the levels of aggregation of data for each dimension, organizing those levels into hierarchies. As for the Geography dimension, you might define the following hierarchy of levels (with the highest level listed first):
The hierarchy of levels for the Resource dimension might look like this:
The time dimension might contain the following hierarchy:
In more complex and realistic scenarios, a dimension may contain more than one hierarchy—for example, fiscal versus calendar year. In this particular example, however, each dimension will have only a single hierarchy.
Now that you've decided what objects you need to have in the warehouse, you can design and build them .This task can be accomplished with Oracle Warehouse Builde, which is part of the standard installation of Oracle Database, starting with Oracle Database 11g Release 1. To enable it, though, some preliminary steps are required:
First of all, you need to unlock database schemas used by Oracle Warehouse Builder. In Oracle Warehouse Builder 11g Re1ease 1, the OWBSYS schema is used; in 11g Release 2, both OWBSYS and OWBSYS_AUDIT are used. These schemas hold the OWB design and runtime metadata. This can be done with the following commands, connecting to SQL*Plus as SYS or SYSDBA:
ALTER USER OWBSYS IDENTIFIED BY owbsyspwd ACCOUNT UNLOCK;
ALTER USER OWBSYS_AUDIT IDENTIFIED BY owbsys_auditpwd ACCOUNT UNLOCK;
Next, you must create a Warehouse Builder workspace. A workspace contains the objects for one or more data warehousing projects; in complex environments, you may have several workspaces. (Instructions for creating a workspace are in the Oracle Warehouse Builder Installation and Administration Guide for Windows and Linux.Follow the instructions to create a new workspace with a new user as workspace owner.)
Now you can launch the Warehouse Builder Design Center, which is the primary graphical user interface of Oracle Warehouse Builder. Click Show Details and connect to the Design Center as the newly created workspace user, with the required host/port/service name or Net Service name.
Before going any further, though, let's outline the set of tasks to accomplish. Broadly described, the tasks to be done in this example are:
The following sections describe how to accomplish the above tasks, implementing the dimensional solution discussed here. Before you proceed to it, though, you have to decide what implementation model will be used. In fact, you have two options: a relational target warehouse, which stores the actual data in relational tables, or a multidimensional warehouse. In the latter case, dimensional data is stored in an Oracle OLAP analytic workspace. This feature is available in Oracle Database 10g and Oracle Database 11g. For this example, , the dimensional model will be implemented as a relational target warehouse.
In this initial step, you begin by creating a new project or configuring the default one in the OWB Design Center. Then, you might identify the target schema that will be used to contain the target data objects: the dimensions and cube described earlier in this article.
Assuming you've decided to use the default project MY_PROJECT, let's move on to creating the target schema. The steps below describe the process of creating the target schema and then a target module upon that schema in the Design Center:
As a result of the above steps, the owbtarget schema is created in the database. (Also, the owbtarget user should show up under the Security->Users node in the Globals Navigator.) The next step is to create a target module upon the newly created database schema. As a quick recap, you use modules in Warehouse Builder to organize the objects you’re dealing with into subject-oriented groups. So, the following steps describe how you might build an Oracle module upon the owbtarget database schema:
As a result of the above steps, the TARGET_MDL module should appear under the MY_PROJECT->Databases->Oracle node in the Projects Navigator. If you expand the module node, you’ll see what types of objects you can create under it. Among others, it includes nodes for holding: cubes, dimensions, tables, and external tables.
Here, you will need not only to extract data from disparate sources but also transform the extracted data in a way so that it can be consolidated into a single data source. Thus, this task usually has the following stages:
What you need to start with however is develop a general strategy for extracting the source data, transforming it, and loading it into the warehouse. That said, you first have to make strategic decisions about how to best implement the task of consolidating data from data sources.
As far as flat files are concerned, your first decision to be made is probably about how you’re going to move data from them into the warehouse. The available options include: utilizing SQL*Loader or through external tables. In this particular example, using external tables seems to be a preferable option because the data being extracted from the flat files has to be joined with relational data. If you recall, the example assumes the source data is to be extracted from both database tables and flat files.
Your next decision to be made is of whether to define a source module for the source data objects you’re going to use. Although it’s is generally considered good practice to keep source and target objects in separate modules, for this simple example we will create all objects in a single database module.
Now let’s take a closer look at the data sources that will be accessed.
As mentioned earlier, what we have here is a Website whose resources are hosted on more than one server, each of which differs from another in the way it stores traffic statistics. For example, one server stores it in flat files and another in the database. The content of a flat file containing real-time data and called, say, access.csv might look like this:
User IP,Date Time,Site Resource
67.212.160.0,5-Jan-2011 20:04:00,/rdbms/demo/demo.zip
85.172.23.0,8-Jan-2011 12:54:28,/articles/vasiliev_owb.html
80.247.139.0,10-Jan-2011 19:43:31,/tutorials/owb_oracle11gr2.html
As you can see, the above file contains information about accessing resources by users, storing it in comma-separated value (CSV) format.In turn, the server that uses the database in place of flat files might store this same information in an accesslog table with the following structure:
USERIP VARCHAR2(15)
DATETIME DATE
SITERESOURCE VARCHAR2(200)
As you might guess, in this example, IP address data is necessary to identify the geolocation of the user accessing a resource. In particular, it allows you to deduce the geographic location down to the region, country, city, and even organization the IP address belongs to. To obtain this information from an IP address, you might use one of a number of free or paid subscription geolocation databases available today. Alternatively, you might utilize the geolocation information provided by the user during his/her registration, thus relying on the information stored in your own database. In that case, however, you’d probably want to rely on the user’s id rather than the IP address.
For the purpose of this example, we will use a free geolocation database that identifies IP address ranges on a country level, such as MaxMind's GeoLite Country database. (Maxmind also offers more accurate paid databases for country and city-level geolocation data.)For more details, you can check out the MaxMind Website.
The GeoLite Country database is stored as a CSV file containing geographical data for publicly assigned IPv4 addresses, thus allowing you to determine the user's country based on the IP address. To take advantage of this database you need to download the zipped CSV file, unzip it, and then import the data into your data warehouse. The imported data will be then joined with the Web traffic statistics data obtained from the flat files and the database discussed earlier in this section.
Examining the structure of the GeoLite Country CSV file, you may notice that aside of the IP diapasons each of which is assigned to a particular country and defined by their beginning and ending IP addresses represented in dot-decimal notation, it also includes corresponding IP numbers derived from those IP addresses with the help of the following formula:
IP Number = 16777216*w + 65536*x + 256*y + z
where
IP Address = w.x.y.z
The obvious advantage of using IP numbers rather than direct IP addresses is that IP numbers, being regular decimal numbers, can be easily compared, which simplifies the task of determining to what country the corresponding IP address belongs. The problem is however, that our traffic statistics data sources store direct IP addresses rather than the numbers derived from them. You will have to transform the Web traffic data so that the result of this transformation includes IP numbers rather than IP addresses.
The following diagram gives a graphical depiction of transforming and joining data:
Figure 2 Oracle Warehouse Builder is extracts, transforms, and joins the source data.
Remember that dimension and cube data are usually derived from more than one data source. For this example, in addition to the traffic statistics and geolocation data, you will also need the data sources containing the resource and region information. For that purpose, you might assume you have two database tables: RESOURCES and REGIONS. Assume the RESOURCES table has the following structure:
SITERESOURCE VARCHAR2(200) PRIMARY KEY
RESOURCESIZE NUMBER(12)
RESOURCEGROUP VARCHAR2(10)
And assume the REGIONS table is defined as follows:
COUNTRYID VARCHAR2(2) PRIMARY KEY
REGION VARCHAR2(2)
The data in the above tables will be joined with the Web traffic statistics and geolocation data.
Now that you understand the structure and the meaning of your source data, it’s time to move on and define all the necessary data objects in the Warehouse Builder. We will create the objects needed for the flat files first. The general steps to perform are the following:
To create the flat file module, follow these steps in the Design Center:
Now you can define a new flat file within the newly created flat file module. Let’s start with creating a flat file object for the access.csv file you saw earlier in this section:
Name SQL Type SQL Length
USERIP VARCHAR2 15
DATETIME DATE
SITERESOURCE VARCHAR2 200
At this point, you may want to commit your changes to the repository. Select File->Save All to commit your changes.
Repeat the above steps for the GeoIPCountryWhois.csv file that contains the geolocation data, defining the following properties on the Field Properties screen of the wizard:
Name SQL Type SQL Length
STARTIP VARCHAR2 15
ENDIP VARCHAR2 15
STARTNUM VARCHAR2 10
ENDNUM VARCHAR2 10
COUNTRYID VARCHAR2 2
COUNTRYNAME VARCHAR2 100
Once these are done, define external table objects in the the target module. These will expose the flat file data as tables in the database To define an external table upon the ACCESS_CSV_FF flat file object created earlier, follow the steps below:
Repeat the above steps for the GEOLOCATION_CSV_FF flat file object.
Now that you have all of the necessary object definitions created you have to deploy them to the target schema before you can use them. Another preliminary step is to make sure that the target schema in the database is granted the privileges to create and drop directories. For that, you can connect to SQL*Plus as sysdba and issue the following statements:
GRANT CREATE ANY DIRECTORY TO owbtarget;
GRANT DROP ANY DIRECTORY TO owbtarget;
After that, you can come back to the Design Center and proceed to deploying. The following steps describe how to deploy the external tables:
If the deployment has completed successfully, this means you have the definitions of the external tables created in the target schema within the database, and, therefore, you can query those tables. To confirm that everything is going as planned so far, it would be a good idea to look at the data you can access through the newly deployed tables. The simplest way to do this is by selecting the Data…command in the popup menu that appears when you right-click the node of an external table in the Project Navigator.
While you should have no problem with the GEOLOCATION_CSV_ EXT table containing about 140,000 rows, you may see nothing when it comes to the ACCESS_CSV_EXT data. The first thing you might want to check out to determine where the problem lies are the ACCESS_CSV_EXT’s access parameters, which you can access through the ALL_EXTERNAL_TABLES data dictionary view. Thus, being connected as sysdba to SQL*Plus, you might issue the following query:
SELECT access_parameters FROM all_external_tables
WHERE table_name ='ACCESS_CSV_EXT';
The output should look like this:
records delimited by newline
characterset we8mswin1252
string sizes are in bytes
nobadfile
nodiscardfile
nologfile
fields terminated by ','
notrim
("USERIP" char,
"DATETIME" char,
"SITERESOURCE" char
)
Examining the above, you might notice that the DATETIME field comes with no date mask, which may cause a problem when accessing the date data whose format differs from the default. The problem can be fixed with the following ALTER TABLE statement:
ALTER TABLE owbtarget.access_csv_ext ACCESS PARAMETERS
(records delimited by newline
characterset we8mswin1252
string sizes are in bytes
nobadfile
nodiscardfile
nologfile
fields terminated by ','
notrim
("USERIP" char,
"DATETIME" char date_format date mask "dd-mon-yyyy hh24:mi:ss",
"SITERESOURCE" char
)
);
Now, returning to the Design Center, if you click the Execute Query button in the Data-ACCESS_CSV_EXT window, you should see the rows generated from the data derived from the access.csv file.
All that is left to complete the task of creating the data source object definitions is to import the metadata for the source database tables ACCESSLOG, RESOURCES, and REGIONS described earlier in this section. To do this, you can follow these steps:
As a result of the above steps, the ACCESSLOG, REGIONS, and RESOURCES objects must appear under the MY_PROJECT->Databases->Oracle->TARGET_MDL->Tables node in the Projects Navigator.
Having the source object definitions created and deployed, let’s build the target structure. In particular, you’ll need to build a Traffic cube to be used for storing aggregated traffic data. Before moving on to building the cube, though, you’ll have to build the dimensions that will make up the edges of it.
If you recall from the discussion in the beginning of the article, you need to define the following three dimensions to organize the data in the cube: Geography, Resource, and Time. The following steps describe how you might build the Geography dimension and then load data into it:
Region Country
Now you will define an ETL mapping that loads the GEOGRAPHY_DM dimension from the source data. The steps are as follows:
Group Resource
Finally, you need to create a time dimension. The easiest way to do this is to use the Create Time Dimension wizard, which defines both a Time Dimension object and an ETL mapping to load it for you. For details on how to create and populate a time dimension, you can refer to the Creating Time Dimensions section in the Oracle Warehouse Builder Data Modeling, ETL, and Data Quality Guide.
Once you have the dimensions set up, carry out the following steps to define a cube:
RESOURCE_DM GEOGRAPHY_DM TIME_DM
OUT_TRAFFIC with the data type NUMBER
As with a dimension, the next step to be done is to create a mapping defining how the source data will be loaded into the cube.
So now, you have to design ETL mappings that will transform the source data and load it into the cube. Here is the list of the transformation operations you’ll need to design:
As mentioned, the above operations must be described in a mapping. Before moving on to creating a mapping, though, let’s define a transformation described at the second step above. This transformation will be implemented as a PL/SQL function. The following steps describe how you might do that without leaving the Design Center:
p NUMBER;
ipnum NUMBER;
ipstr VARCHAR2(15);
BEGIN
ipnum := 0;
ipstr:=ipadd;
FOR i IN 1..3 LOOP
p:= INSTR(ipstr, '.', 1, 1);
ipnum := TO_NUMBER(SUBSTR(ipstr, 1, p - 1))*POWER(256,4-i) + ipnum;
ipstr := SUBSTR(ipstr, p + 1);
END LOOP;
ipnum := ipnum + TO_NUMBER(ipstr);
RETURN ipnum;
END;
Now you can create a mapping in which you then define how data from the source objects will be loaded into the cube:
(INGRP2.USERIP BETWEEN INGRP1.STARTNUM AND INGRP1.ENDNUM)
AND
(INGRP2.SITERESOURCE = INGRP3.SITERESOURCE)
INGRP1.COUNTRYID,INGRP1.SITERESOURCE,INGRP1.DATETIME
TRUNC(INGRP1.DATETIME, 'DD')
RESOURCESIZE to OUT_TRAFFIC
COUNTRYID to GEOGRAPHY_DM_NAME
DATETIME to TIME_DM_DAY_START_DATE
SITERESOURCE to RESOURCE_DM_NAME
By now the mapping canvas should look like the figure below:
Figure 3 The mapping canvas , showing the TRAFFIC_MAP mapping that loads data from the source objects into the cube.
Once you’ve completed the above steps, you have the TRAFFIC cube populated with the data from the sources in accordance with the logic implemented in the mapping. Practically speaking, however, you rather have the fact table (it’s the TRAFFIC_TAB table in this particular example) populated with data. In other words, cube records are stored in the fact table. The cube itself is just a logical representation or visualization of the dimensional data used here.
Similarly, dimensions are physically bound to corresponding dimension tables, which store dimensions’ data in the database. Dimension tables are joined to the fact table with foreign keys, making up a model known as a star schema (because the diagram of such a schema resembles a star). Oracle Database's query optimizer can apply powerful optimization techniques when it comes to star queries (join queries issued against the fact table and the dimension tables joined to it), thus providing efficient query performance for the queries answering business questions.
Business intelligence as the process of gathering information with the purpose to support decision making needs a foundation for its environment. A data warehouse provides such a foundation being a relational database that is designed just for that: consolidating information gathered from disparate sources and providing access to that information to business users so that they can make better decisions.
As you saw in this article, a small data warehouse may consist of a single cube and just a few dimensions, which make up the edges of that cube. In particular, you looked at an example of how traffic statistics data can be organized into a cube whose edges contain values for Geography, Resource, and Time dimensions.
Yuli Vasiliev is a software developer, freelance author, and consultant currently specializing in open source development, Java technologies, business intelligence (BI), databases, and service-oriented architecture (SOA). He is the author of Oracle Business Intelligence: The Condensed Guide to Analysis and Reporting (Packt, 2010) as well as a series of other books on the Oracle technology.