You are here: Concepts > Reference Data

Reference Data

Reference Data is data that is used in lookups by various processors when checking and improving working data. Examples of Reference Data include:

Each set of Reference Data may be created, edited and managed in OEDQ itself, or may be from an external source. For example, a file that is stored and updated on the internet may be downloaded, snapshotted and used as Reference Data, or you may choose to maintain your own database of Reference Data and perform lookups against this database.

Reference Data that is managed in OEDQ may be also used in processes in the same way as Staged Data. It can be profiled, checked, transformed, matched and so on.

There are two aspects of Reference Data definition:

When creating a set of Reference Data, the New Reference Data option will create both a set of data (to be managed in OEDQ) and a default lookup definition for that data. The New Lookup option will create a lookup onto an existing set of data, which may be from one of three sources:

When using the Reference Data in a processor option, there is no difference between Lookups and Reference Data.

Reference Data managed in OEDQ

When using lists and maps of data that are used to validate values and patterns, that will normally be small enough to load into memory (see note below), and that you may need to create or update using results, it is advisable to manage these sets of data in OEDQ.

Note: As a guide, any Reference Data set with fewer than 50,000 rows should be loadable into memory on an OEDQ server with the recommended minimum of 1GB RAM, and so will be marked as loadable when selecting Reference Data for use in processors. Reference Data sets that are larger than this will by default not be loadable into memory, but if you know you do have more memory available, it is possible for an administrator to change the 50,000 row limit on the server.

For example, the following types of Reference Data would normally be managed in OEDQ:

A starter pack of Reference Data is shipped with OEDQ, though new Reference Data can be created and modified quickly and easily from your own data, using the Results Browser.

Reference Data Categories

When creating a Reference Data set that is managed by OEDQ, you can optionally assign it a Category.

Categories are used to provide shorter lists of Reference Data sets when selecting Reference Data from processors, where the processor option requires a certain 'type' of Reference Data, such as a list of characters, patterns, or regular expressions.

The following categories are all used by processors in the Processor Library, and are therefore available for selection when creating a Reference Data set. If new processors are created and added to the Processor Library (see Extending OEDQ for details), these may add further categories which will also appear in the list.

Category

Use

Date Formatting

Lists of date formats, for use in processors that need to recognize dates.

Used by the following processors:

Data Types Profiler

Data Type Check

Convert String to Date

No Data Handling

Lists of characters that constitute 'No Data', such as whitespace characters. No Data characters may be normalized to NULL values to aid data analysis.

Used by the following processors:

Reader

Normalize No Data

It is also possible to use Reference Data in this Category when snapshotting data.

Number Formatting

Lists of number formats, for use in processors that need to recognize numbers.

Used by the Convert String to Number processor.

Number Bands

Lists of Number Bands, for profiling numeric values across various ranges.

Used by the Number Profiler processor.

Parse Base Token Patterns

Lists of Base Token Patterns.

Used in the Parse processor when classifying data using a Base Token Check.

Parse Pattern Frequency

Reference Data generated and used by the Parse processor when selecting patterns using their frequency of occurrence in the Reference Data.

Parse Tokenization

Maps of characters to pattern characters, group tags and character types.

Used in the Parse processor when tokenizing data.

Pattern Generation

Maps of characters to pattern characters, used to generate patterns from data.

Used by the following processors:

Patterns Profiler

Pattern Check (and also by the Pattern Check classifier in the Parse processor)

Pattern Transform

Patterns

Lists or maps of character patterns, used to validate or transform data formats.

Used by the following processors:

Pattern Check (and also by the Pattern Check classifier in the Parse processor)

Pattern Transform

Regular Expressions

Lists of regular expressions.

Used by the following processors:

RegEx Patterns Profiler

RegEx Check

Email Check

GBR Postcode Format Check

 

Staged Data Lookups

Staged Data Lookups are lookups onto an existing set of staged data in the repository (either a Snapshot, or data that has been written from another process).

When setting up a Staged Data Lookup, you must choose which column or columns to use for the lookup, and which columns you want to return.

You may configure several different lookups onto the same data, using different lookup and return columns.

Staged Data Lookups appear under the Reference Data node in the Project Browser, but with a Staged Data icon to indicate that the lookup is onto Staged Data rather than editable Reference Data or External Data.

External Data Lookups

External Data Lookups are lookups onto some data that you do not have staged, and that you do not wish to stage, for example, a large data set that exists externally to OEDQ, and may be frequently updated.

An External Data Lookup is configured in the same way as a Staged Data Lookup, with selected columns used for the lookup, and selected columns returned. However, the external data set is not staged in the OEDQ repository.

You may configure several different lookups onto the same data, using different lookup and return columns.

External Data Lookups appear under the Reference Data node in the Project Browser, but with the Data Store icon to indicate that the lookup is onto External Data rather than editable Reference Data or Staged Data.

Reference Data Levels

Reference Data may exist at two different levels. System-level Reference Data is globally shared on a server, and may be used in many projects. Project-level Reference Data may only be used in the project where it is stored.

System-level Reference Data

Sets of System-level Reference Data are listed under the OEDQ server in the Project Browser:

On first installation of OEDQ, a starter pack of System-level Reference Data is available for common uses. For example, a No Data Handling map is provided in order to normalize No Data values to NULL values, and a Character Pattern Map is provided in order to drive the way OEDQ generates and assesses patterns in data. This kind of Reference Data will normally be used in a standard way across all projects.

It is also possible to copy Project-level Reference Data to the System-level library, if you know that the Reference Data will be rarely modified and used on various projects. Otherwise, it is better to manage the Reference Data on a project basis.

Reference Data can be copied to a different area (that is, to a specific project, or to the system level), by Copy and Paste, using the Right-click menu:

 

System-level Reference Data should be treated with care, and only modified if you are sure that you want to apply a global change across the OEDQ instance, that will immediately take effect on all processes using the Reference Data. Modifying System-level Reference Data may affect many processes in many projects.

New System-level Reference Data lists and maps may be added by copying the data to the System-level from a project, perhaps as extended versions of previous lists or maps with a different name. The various users using the former version may then make a conscious decision whether or not to use the extended version.

For a full list of all the System-level Reference Data lists and maps that are shipped with OEDQ, see the Reference Data Library.

Project-level Reference Data

Sets of Project-level Reference Data are listed under the project in the Project Browser:

Project-level Reference Data is used for defining business rules for use in specific projects. Project-level Reference Data should be used whenever you wish to change the Reference Data iteratively as you analyze data, and you do not wish to affect processes in other projects, and wherever you have Reference Data that will not be useful to other projects, such as specific rules for a data attribute that does not occur commonly.

You can copy Reference Data between projects, or copy Project-level Reference Data to the System level, from the Right-click menu.

Note:When creating your own Reference Data, if the data needs to be of the data type DATE, then it must appear in the Reference Data in ISO format; that is, YYYY-MM-DD HH:mm:ss.  This is true, even if the data it is to be used to check is in non-ISO format.

Oracle ® Enterprise Data Quality Help version 9.0
Copyright © 2006,2012, Oracle and/or its affiliates. All rights reserved.