In this tutorial you learn how to use the new SQL join syntax in the Oracle Database 10g SQL to fill gaps in sparse data.
Approximately 30 minutes
This tutorial covers the following topics:
Overview | |
Prerequisites | |
Syntax | |
Sample of Sparse Data | |
Filling Gaps in Data | |
Example of Period-to-Period Comparison for Multiple Time Levels |
|
Summary |
Place the cursor over this icon to load and view all the screenshots for this tutorial. (Caution: This action loads all screenshots simultaneously, so response time may be slow depending on your Internet connection.)
Note: Alternatively, you can place the cursor over an individual icon in the following steps to load and view only the screenshot associated with that step. You can hide an individual screenshot by clicking it.
Data is usually stored in sparse form. That is, if no value exists for a given time, no row exists in the fact table. However, time series calculations can be performed most easily when data is dense along the time dimension. This is because dense data fills a consistent number of rows for each period, which in turn makes it simple to use the analytic windowing functions with physical offsets. Refer to Chapter 21: Data Warehousing Guide for more information.
To overcome the problem of sparsity, you can use a partitioned outer join to fill the gaps in a time series. Such a join extends the conventional outer join syntax by applying the outer join to each logical partition defined in a query. The Oracle database logically partitions the rows in your query based on the expression you specify in the PARTITION BY clause. The result of a partitioned outer join is a UNION of the outer joins of each of the groups in the logically partitioned table with the table on the other side of the join.
Note that you can use this type of join to fill the gaps in any dimension, not just the time dimension. In this tutorial, you will focus on the time dimension because it is the dimension most frequently used as a basis for comparisons.
Before starting this tutorial, you should:
1. |
Perform the Installing Oracle Database 10g on Windows tutorial. |
2. |
Download and unzip outer_j.zip into your working directory (i.e. c:\wkdir) |
The syntax for partitioned outer join extends the ANSI SQL JOIN clause with the phrase PARTITION BY followed by an expression list. The expressions in the list specify the group to which the outer join is applied. Following are the two forms of syntax typically used for partitioned outer join:
SELECT select_expression FROM table_reference PARTITION BY (expr [, expr ]... ) RIGHT OUTER JOIN table_reference |
SELECT select_expression FROM table_reference LEFT OUTER JOIN table_reference PARTITION BY {expr [,expr ]...) |
Note that FULL OUTER JOIN is not supported with a partitioned outer join. Refer to the Oracle Database 10g SQL Reference for further information regarding syntax and restrictions.
A typical situation with a sparse dimension is shown in the following example, which computes the weekly sales and year-to-date sales for the product "Bounce" for weeks 2030 in 2000 and 2001:
SELECT |
PRODUCT_NAME YEAR WEEK SALES 18 rows selected. |
In this example, you would expect 22 rows of data (11 weeks
each from 2 years) if the data were dense. However you see only 18 rows because
weeks 25 and 26 are missing in 2000, and weeks 26 and 28 are missing in 2001.
Gaps in time series make calculations, such as year-over-year comparisons, difficult to compute. When there are no gaps, you can compare data by referring from one row to another row a fixed distance away using the analytic functions LEAD() and LAG(). For example, if you retrieve month-level data and want to refer to data from 12 months ago, it is convenient to access data 12 rows before the current value. You cannot reliably use the LEAD() and LAG() functions when the number of rows per period (or whatever other dimension is used as the divider) is inconsistent.
How can you fill in the gaps in the preceding example with a partitioned outer join?
You can take the sparse data of our query above and do a partitioned outer join with a dense set of time data. In the query shown below, the original query is aliased as v and the data retrieved from the times table is aliased as t. Here you see 22 rows because there are no gaps in the series. The four added rows each have 0 as their sales value set to 0 by using the NVL() function..
SELECT Product_Name, t.Year, t.Week, NVL(Sales,0) dense_sales |
PRODUCT_NAME YEAR WEEK DENSE_SALES |
Note that in the query above a WHERE condition for weeks between 20 and 30 is placed in the inline view for the time dimension. This step reduces the number of rows handled by the outer join, which saves processing time.
How do you combine this technique with analytic SQL functions
to obtain cumulative sales for the desired weeks?
1. |
Start a SQL *Plus session. Select Start > Programs > Oracle-OraDB10g_home > Application Development > SQL Plus. (Note: This tutorial assumes you have an c:\wkdir folder. If you do not, you will need to create one and unzip the contents of outer_j.zip into this folder. While executing the scripts, paths are specified) |
2. |
Log in as the SH user. Enter SH as the User Name and SH as the Password. Then click OK. |
3. |
From your SQL * Plus session, execute the following script. @c:\wkdir\fg.sql The fg.sql script contains the following: SELECT Product_Name, t.Year, t.Week, Sales, Weekly_ytd_sales FROM (SELECT SUBSTR(p.Prod_Name,1,15) Product_Name, t.Calendar_Year Year, t.Calendar_Week_Number Week, NVL(SUM(Amount_Sold),0) Sales, SUM(SUM(Amount_Sold)) OVER (PARTITION BY p.Prod_Name, t.Calendar_Year ORDER BY t.Calendar_Week_Number) Weekly_ytd_sales FROM Sales s, Times t, Products p WHERE s.Time_id = t.Time_id AND s.Prod_id = p.Prod_id AND p.Prod_name IN ('Bounce') AND t.Calendar_Year IN (2000,2001) AND t.Calendar_Week_Number BETWEEN 20 AND 30 GROUP BY p.Prod_Name, t.Calendar_Year, t.Calendar_Week_Number ) v PARTITION BY (v.Product_Name) RIGHT OUTER JOIN (SELECT DISTINCT Calendar_Week_Number Week, Calendar_Year Year FROM Times WHERE Calendar_Year in (2000, 2001) ) t ON (v.week = t.week AND v.Year = t.Year) WHERE t.Week BETWEEN 20 AND 30 ORDER BY 1, 2, 3; In this query, the weekly year-to-date sales are calculated alongside the weekly sales. The NULL values that the partitioned outer join inserts in making the time series dense are handled in the usual way: the SUM function treats them as 0s.
|
There are queries in which a partitioned outer join will return rows with NULL values, but you may want those rows to hold the most recent non-NULL value in the series. That is, if you want to have NULLs replaced with the first non-NULL value you see as you scan upward in a column.
Inventory tables, which track quantity of units available for various products, are a common case that needs such output. Inventory tables are sparse: like sales tables, they need to only store a row for a product when there is an event. For a sales table the event is a sale, and for the inventory table, the event is a change in quantity available for a product. If you make the inventory's time dimension dense, you want to see a quantity value for each day. The value to output is the most recent non-NULL value. Note that this differs from the prior example with cumulative sales. In that query, the cumulative sum calculation treats NULLs as 0s, so it presents correct values. That approach cannot work with inventory and similar tables because the value to place in rows with NULLs is not a sum.
Here an example is presented of partitioned outer join with an inventory table. It replaces NULLs with the nearest non-NULL value.
In the next task, you use the outer join feature to compare values across time periods. Specifically, you will calculate a year-over-year sales comparison at the week level. The query will return on the same row, for each product, the year-to-date sales for each week of 2001 with that of 2000.
1. |
To improve readability of the query and focus on the partitioned outer join, use a WITH clause to start the query. Execute the following SQL*Plus script: @c:\wkdir\pp.sql The pp.sql script contains the following: WITH v AS (SELECT p.Prod_Name Product_Name, t.Calendar_Year Year, t.Calendar_Week_Number Week, SUM(Amount_Sold) Sales FROM Sales s, Times t, Products p WHERE s.Time_id = t.Time_id AND s.Prod_id = p.Prod_id AND p.Prod_name in ('Y Box') AND t.Calendar_Year in (2000,2001) AND t.Calendar_Week_Number BETWEEN 30 AND 40 GROUP BY p.Prod_Name, t.Calendar_Year, t.Calendar_Week_Number ) SELECT substr(Product_Name,1,12) Prod, Year, Week, Sales, Weekly_ytd_sales, Weekly_ytd_sales_prior_year FROM (SELECT --Start of year_over_year sales Product_Name, Year, Week, Sales, Weekly_ytd_sales, LAG(Weekly_ytd_sales, 1) OVER (PARTITION BY Product_Name, Week ORDER BY Year) Weekly_ytd_sales_prior_year FROM (SELECT --Start of dense_sales v.Product_Name Product_Name, t.Year Year, t.Week Week, NVL(v.Sales,0) Sales, SUM(NVL(v.Sales,0)) OVER (PARTITION BY v.Product_Name, t.Year ORDER BY t.week) weekly_ytd_sales FROM v PARTITION BY (v.Product_Name) RIGHT OUTER JOIN (SELECT DISTINCT Calendar_Week_Number Week, Calendar_Year Year FROM Times WHERE Calendar_Year IN (2000, 2001) ) t ON (v.week = t.week AND v.Year = t.Year) ) dense_sales ) year_over_year_sales WHERE Year = 2001 AND Week BETWEEN 30 AND 40 ORDER BY 1, 2, 3; In the FROM clause of the in-line view DENSE_SALES, a partitioned outer join of aggregate view v and time view t is used to fill gaps in the sales data along the time dimension. The output of the partitioned outer join is then processed by the analytic function SUM ... OVER to compute the weekly year-to-date sales (the "weekly_ytd_sales" column). Thus, the view DENSE_SALES computes the year-to-date sales data for each week, including those missing in the aggregate view s. The in-line view YEAR_OVER_YEAR_SALES then computes the year ago weekly year-to-date sales using the LAG function. The LAG function labeled "weekly_ytd_sales_prior_year" specifies a PARTITION BY clause that pairs rows for the same week of years 2000 and 2001 into a single partition. An offset of 1 is passed to the LAG function to get the weekly year-to-date sales for the prior year. The outermost query block selects data from YEAR_OVER_YEAR_SALES with the condition yr = 2001, and thus the query returns, for each product, its weekly year-to-date sales in the specified weeks of years 2001 and 2000.
|
Whereas the previous example showed you a way to create comparisons for a single time level, it is even more useful to handle multiple time levels in a single query. For example, you can compare sales versus the prior period at the year, quarter, month, and day levels.
For the next task, you create a query that performs a year-over-year comparison of year-to-date sales for all levels of our time hierarchy.
Several steps are needed to perform this task. The goal is a single query with comparisons at the day, week, month, quarter, and year levels. You will use a materialized view MV_PROD_TIME that holds a hierarchical cube of sales aggregated across TIMES and PRODUCTS. Along with the materialized view, you will create a view on top of it. Also, you create a view of the time dimension to use as an edge of the cube. The time edge is a partition outer joined to the sparse data in the materialized view.
For more information regarding
hierarchical cubes, see the chapter titled "SQL for Aggregation in Data
Warehouses" in the Data Warehousing Reference Guide.
1. |
Create the materialized view. Note that the query is limited to only two products to keep processing time short. Execute the following SQL*Plus script: @c:\wkdir\cm1.sql The cm1.sql script contains the following: CREATE MATERIALIZED VIEW mv_prod_time REFRESH COMPLETE ON DEMAND AS SELECT (CASE WHEN ((GROUPING(calendar_year)=0 ) AND (GROUPING(calendar_quarter_desc)=1 )) THEN (TO_CHAR(calendar_year) || '_0') WHEN ((GROUPING(calendar_quarter_desc)=0 ) AND (GROUPING(calendar_month_desc)=1 )) THEN (TO_CHAR(calendar_quarter_desc) || '_1') WHEN ((GROUPING(calendar_month_desc)=0 ) AND (GROUPING(t.time_id)=1 )) THEN (TO_CHAR(calendar_month_desc) || '_2') ELSE (TO_CHAR(t.time_id) || '_3') END) Hierarchical_Time, calendar_year year, calendar_quarter_desc quarter, calendar_month_desc month, t.time_id day, prod_category cat, prod_subcategory subcat, p.prod_id prod, GROUPING_ID(prod_category, prod_subcategory, p.prod_id, calendar_year, calendar_quarter_desc, calendar_month_desc,t.time_id) gid, GROUPING_ID(prod_category, prod_subcategory, p.prod_id) gid_p, GROUPING_ID(calendar_year, calendar_quarter_desc, calendar_month_desc, t.time_id) gid_t, SUM(amount_sold) s_sold, COUNT(amount_sold) c_sold, COUNT(*) cnt FROM SALES s, TIMES t, PRODUCTS p WHERE s.time_id = t.time_id AND p.prod_name in ('Bounce', 'Y Box') AND s.prod_id = p.prod_id GROUP BY ROLLUP(calendar_year, calendar_quarter_desc, calendar_month_desc, t.time_id), ROLLUP(prod_category, prod_subcategory, p.prod_id); Because the materialized view is limited to two products, it has just over 2200 rows. Note that the Hierarchical_Time column contains string representations of time from all levels of the time hierarchy. The CASE expression used for the Hierarchical_Time column appends a marker (_0, _1, ...) to each date string to denote the time level of the value. _0 represents the year level, _1 is quarters, _2 is months, and _3 is day. Note that the GROUP BY clause is a concatenated ROLLUP that specifies the roll-up hierarchy for the time and product dimensions. The GROUP BY clause determines the hierarchical cube contents.
|
2. |
Create a view CUBE_PROD_TIME with the same definition as the materialized view MV_PROD_TIME. Execute the following SQL*Plus script: @c:\wkdir\cv1.sql The cv1.sql script contains the following: CREATE OR REPLACE VIEW cube_prod_time AS SELECT (CASE WHEN ((GROUPING(calendar_year)=0 ) AND (GROUPING(calendar_quarter_desc)=1 )) THEN (TO_CHAR(calendar_year) || '_0') WHEN ((GROUPING(calendar_quarter_desc)=0 ) AND (GROUPING(calendar_month_desc)=1 )) THEN (TO_CHAR(calendar_quarter_desc) || '_1') WHEN ((GROUPING(calendar_month_desc)=0 ) AND (GROUPING(t.time_id)=1 )) THEN (TO_CHAR(calendar_month_desc) || '_2') ELSE (TO_CHAR(t.time_id) || '_3') END) Hierarchical_Time, calendar_year year, calendar_quarter_desc quarter, calendar_month_desc month, t.time_id day, prod_category cat, prod_subcategory subcat, p.prod_id prod, GROUPING_ID(prod_category, prod_subcategory, p.prod_id, calendar_year, calendar_quarter_desc, calendar_month_desc, t.time_id) gid, GROUPING_ID(prod_category, prod_subcategory, p.prod_id) gid_p, GROUPING_ID(calendar_year, calendar_quarter_desc, calendar_month_desc, t.time_id) gid_t, SUM(amount_sold) s_sold, COUNT(amount_sold) c_sold, COUNT(*) cnt FROM SALES s, TIMES t, PRODUCTS p WHERE s.time_id = t.time_id AND p.prod_name IN ('Bounce', 'Y Box') AND s.prod_id = p.prod_id GROUP BY ROLLUP(calendar_year, calendar_quarter_desc, calendar_month_desc, t.time_id), ROLLUP(prod_category, prod_subcategory, p.prod_id);
|
3. |
You create a view EDGE_TIME which is a complete set of date values. EDGE_TIME is the source for filling time gaps with a partitioned outer join. The HIERARCHICAL_TIME column in EDGE_TIME will be used in a partitioned join with the HIERARCHICAL_TIME column in the CUBE_PROD_TIME view. Execute the following SQL*Plus script: @c:\wkdir\cv2.sql The cv2.sql script contains the following: CREATE OR REPLACE VIEW edge_time AS SELECT (CASE WHEN ((GROUPING(calendar_year)=0 ) AND (GROUPING(calendar_quarter_desc)=1 )) THEN (TO_CHAR(calendar_year) || '_0') WHEN ((GROUPING(calendar_quarter_desc)=0 ) AND (GROUPING(calendar_month_desc)=1 )) THEN (TO_CHAR(calendar_quarter_desc) || '_1') WHEN ((GROUPING(calendar_month_desc)=0 ) AND (GROUPING(time_id)=1 )) THEN (TO_CHAR(calendar_month_desc) || '_2') ELSE (TO_CHAR(time_id) || '_3') END) Hierarchical_Time, calendar_year yr, calendar_quarter_number qtr_num, calendar_quarter_desc qtr, calendar_month_number mon_num, calendar_month_desc mon, time_id - TRUNC(time_id, 'YEAR') + 1 day_num, time_id day, GROUPING_ID(calendar_year, calendar_quarter_desc, calendar_month_desc, time_id) gid_t FROM TIMES GROUP BY ROLLUP (calendar_year, (calendar_quarter_desc, calendar_quarter_number), (calendar_month_desc, calendar_month_number), time_id);
|
4. |
You now have the required elements for the comparison query. You can obtain period-to-period comparison calculations at all time levels. It requires applying analytic functions to a hierarchical cube with dense data along the time dimension. Some of the calculations you can achieve for each time level are:
The following example performs all four of these calculations. It uses a partitioned outer join of the CUBE_PROD_TIME and EDGE_TIME views to create an in-line view of dense data called DENSE_CUBE_PROD_TIME. The query then uses the LAG function in the same way as the prior single-level example. The outer WHERE clause specifies time at three levels: the days of August 2001, the entire month, and the entire third quarter of 2001. Note that the last two rows of the results contain the month-level and quarter-level aggregations. Execute the following SQL*Plus script: @c:\wkdir\mt.sql The mt.sql script contains the following: SELECT substr(prod,1,4) prod, substr(Hierarchical_Time,1,12) ht, sales, sales_prior_period, sales - sales_prior_period variance_prior_period, sales_same_period_prior_year, sales - sales_same_period_prior_year variance_same_period_p_year FROM (SELECT cat, subcat, prod, gid_p, gid_t, Hierarchical_Time, yr, qtr, mon, day, sales, LAG(sales, 1) OVER (PARTITION BY gid_p, cat, subcat, prod, gid_t ORDER BY yr, qtr, mon, day) sales_prior_period, LAG(sales, 1) OVER (PARTITION BY gid_p, cat, subcat, prod, gid_t, qtr_num, mon_num, day_num ORDER BY yr) sales_same_period_prior_year FROM (SELECT c.gid, c.cat, c.subcat, c.prod, c.gid_p, t.gid_t, t.yr, t.qtr, t.qtr_num, t.mon, t.mon_num, t.day, t.day_num, t.Hierarchical_Time, NVL(s_sold,0) sales FROM cube_prod_time c PARTITION BY (gid_p, cat, subcat, prod) RIGHT OUTER JOIN edge_time t ON ( c.gid_t = t.gid_t AND c.Hierarchical_Time = t.Hierarchical_Time) ) dense_cube_prod_time ) -- side by side current,prior and prior year sales WHERE prod IN (139) AND gid_p=0 AND -- 1 product and product level data ( (mon IN ('2001-08' ) AND gid_t IN (0, 1) ) OR -- day and month data ( qtr IN ('2001-03' ) AND gid_t IN (3) ) ) -- quarter level data ORDER BY day; |
In many OLAP tasks, it is helpful to define custom members in a dimension. For instance, you might define a specialized time period for analyses. You can use a partitioned outer join to temporarily add a member to a dimension. Note that the new SQL MODEL clause introduced in Oracle Database 10g is suitable for creating more complex scenarios involving new members in dimensions. See the tutorial titled "SQL for Modeling" in the Data Warehousing Reference Guide for more information on this topic.
In this exercise, you define a new member for the TIME
dimension. You create a 13th member of the month level in the TIME
dimension. This 13th month is defined as the summation of the sales for each
product in the first month of each quarter of year 2001. You build this solution
using the views and tables created in the prior example.
1. |
Create a view with the new member added to the appropriate dimension. The view uses a UNION ALL operation to add the new member. To query using the custom member, use a CASE expression and a partitioned outer join. Execute the following SQL*Plus script: @c:\wkdir\cv3.sql The cv3.sql
script contains the following: CREATE OR REPLACE VIEW time_c AS (SELECT * FROM edge_time UNION ALL SELECT '2001-13_2', 2001, 5, '2001-05', 13, '2001-13', null, null, 8 -- <gid_of_mon> FROM DUAL); In the statement shown, the TIME_C view is defined by performing a UNION ALL of the EDGE_TIME view (defined in the prior example) and the user-defined 13th month. UNION ALL specifies the attributes for a 13th month member by doing a SELECT from the DUAL table. Note that the grouping id, column gid_t, is set to 8, and the quarter number is set to 5.
|
2. |
The in-line view of the query shown below performs a partitioned outer join of CUBE_PROD_TIME with TIME_C. This step creates sales data for the 13th month at each level of product aggregation. In the main query, the analytic function SUM is used with a CASE expression to compute the 13th month, which is defined as the summation of the first month's sales of each quarter. Execute the following SQL*Plus script: @c:\wkdir\cv4.sql The cv4.sql script contains the following: SELECT * from ( SELECT substr(cat,1,12) cat, substr(subcat,1,12) subcat, substr(prod,1,9) prod, mon, mon_num, SUM(CASE WHEN mon_num IN (1, 4, 7, 10) THEN s_sold ELSE NULL END) OVER (PARTITION BY gid_p, prod, subcat, cat, yr) sales_month_13 FROM (SELECT c.gid, c.prod, c.subcat, c.cat, gid_p, t.gid_t , t.day, t.mon, t.mon_num, t.qtr, t.yr, NVL(s_sold,0) s_sold FROM cube_prod_time c PARTITION BY (gid_p, prod, subcat, cat) RIGHT OUTER JOIN time_c t ON (c.gid_t = t.gid_t AND c.Hierarchical_Time = t.Hierarchical_Time) ) ) WHERE mon_num=13; The SUM function used in generating these results had a CASE statement to limit the data to months 1, 4, 7, and 10 within each year. Because of the tiny data set, with just two products, the roll-up values of the results are necessarily repetitions of lower-level aggregations. For a more realistic set of roll-up values, you can include more products from the "Game Console" and "Y Box Games" subcategories in the underlying materialized view.
|
In this tutorial, you learned how to:
Fill gaps in data using analytic SQL functions | ||
Replace NULLS with the nearest non-NULL value | ||
Perform a period-to-period comparison | ||
Create a custom member in a dimension |