Merge |
Merge is a sub-processor of all matching processors except Link (where no record merging occurs).
Merging records is an optional part of matching, that allows you to create new 'best' output records from the matching process. These 'best' records are constructed from the multiple records in each match group, using a combination of automatic selection rules and manual decisions.
For example, using automatic rules, the match processor might output the Most Common Value for an attribute from the records found in a match group, the Latest Value using an input date (such as a 'last edited by' field), or the First Non-empty Value, preferring one source of data over another.
Any errors in the automatic output selection for an attribute mean that the attribute is marked with a status of Fail. Groups which contain any failed attributes are also marked with a Fail status. These errors can be manually resolved at the Review stage. For example when consolidating two duplicate records from two different systems, if automatic rules fail to select values (for example because there is no value that is 'Most Common' in the records in the match group, the user can choose one value (such as name) from one record, and another (such as email address) from a related record, depending on which is seen to be best. From the Review screens, it is possible to review the output for records that contained selection errors. The Merge Summary results view also shows the number of failed match groups, so that you are aware of how many errors need resolution.
The typical use of the Merge sub-processor varies according to the type of match processor you are using. When deduplicating a data stream, or consolidating a number of data streams, automatic merge rules are typically used to create a set of duplicate-free output records - for example as part of a data migration project, or when re-purposing data, for example, in the preparation of a customer list for a mailshot.
When enhancing a set of data from reference sources, the merge rules might be used to add in data from the matching reference records. In this case, the original working records may be updated with better, trusted information, or perhaps simply appended with the new information.
For the Deduplicate and Consolidate match processors, the Merge configuration has a set of default selection rules, allowing you to create a simple form of de-duplicated output.
For the Enhance processor, the Merge configuration has a set of default selection rules to enhance your working data from your reference data.
All of these default configurations are simple, and designed to create quick output from the matching process. If you need more precisely constructed merged output records, you will need to edit the rules to suit your needs.
To keep the default format for the merged output, keep the Auto Attribute Selection option ticked at the bottom of the dialog. Note that the attributes in the output may still change, as attributes are included for each input attribute. Adding or removing input attributes from the match processor will change the attributes in the default output. To customize the output, untick this box, and add or remove attributes. Note that if you change the output to a custom format, for example, by adding attributes, the Auto Attribute Selection option is automatically de-selected. This means that changing input attributes will not automatically add attributes to the output, though you can still add them manually if required.
The merge rules are set separately for each output attribute that you require.
By default, an output attribute is included for any attributes that have the same name in all of the data streams presented to matching. (If only one data stream is presented to matching, as in De-duplication, this means an output attribute for every input attribute is created.)
Note: The output selector used in the default output format is Most Common Value. This means that from all the records in the match group, the most common value for each attribute will be selected. Where there is no most common value, for example, one record in the group has a FirstName of 'Jhon' and the only other record has 'John', the selector has an option (Use first non-empty if tied) simply to pick the first non-empty value from the records in the group. By default, this is set, but you may wish to unset it and raise an error where there is no most common value. |
Additional output attributes exist for MatchGroup and MatchGroupSize, so that you can cross-reference the merged output with the Match Groups output, in order to maintain a full audit trail of all output selection decisions. A number of other internally generated attributes are available.
There are a number of options available for changing the set of merged output records:
Option |
Description |
Default setting |
Generate Merged Output |
Determines whether or not to generate the merged output (at all) or not. For example, if you have fully developed the matching process, and you are not using the merged output, you can save on performance by not generating it. |
Selected |
Output related records |
Determines whether or not to output merged output records for groups of related records. |
Selected |
Output unrelated records |
Determines whether or not to output unrelated records. |
Selected, for Deduplicate and Consolidate processors. Not Selected, for Enhance and Advanced Match processors. |
To add a new output attribute to the merged output, use the Add button at the bottom of the dialog.
A number of further internally generated output attributes are available:
Attribute Name |
Description |
Attribute Value |
Match_Group_Status |
The status of the Match Group. FAIL, if there were any errors during output selection for the Match Group SUCCESS, if there were no errors during output selection for the Match Group. |
|
Reviewed_Flag |
Reviewed Status |
An indicator on the Match Group to show if the merged output for it has been manually reviewed or not. |
Review_User |
Name of Reviewer |
The user name of the last user to review the merged output for the Match Group. |
Review_Date |
Date of Latest Review |
The date of the last review of the merged output for the Match Group. |
Comment |
Latest Comment |
The latest comment made on the merged output. |
Comment_User |
Name of Latest Comment Provider |
The user name of the user that made the latest comment. |
Comment_Date |
Date of Latest Comment
|
The date of the latest comment. |
Alternatively, to create a new merged output attribute, that is, to merge in data from the records in the match group:
Note that the number of available inputs above varies depending on the number of source data streams that are input to the match processor. In some cases, an output selector might require a specific additional input. For example, the Earliest Value and Latest Value selectors require a date attribute to be used to choose the earliest or latest record from which to select a value.
If you want to select data for output using more complex rules, it is possible to add your own output selectors into the available set. See Extending matching in OEDQ.
An output selection error occurs whenever the output selector for a merged output attribute cannot select a single valid value from the input records in the match group. If a merged output record with no errors cannot be output from a match group (that is, the merged output record contains any output selection errors on any attributes), the match group is flagged as a 'failed' group. The failure is intended as a flag to indicate that the merging of data from the match group requires manual review, so that where a value could not be selected using automatic merging rules, the merged output record can be resolved manually. See Making manual merged output decisions.
Note that wherever values could be correctly selected, these are still output (even on failed groups), so it may still be possible to use merged output records with selection failures.
Ambiguous Selection
If the output selector finds more than one value that could be considered as the output value, and has not been configured to select arbitrarily between them, an output selection error is raised, and no value will be selected for the merged output attribute. For example, an error will be raised if attempting to select the Most Common Value for a Date of Birth output attribute from the following records in a match group, if the Use first non-empty value if tied? option is not ticked:
Record |
Date of Birth |
A |
01/10/1975 |
B |
01/10/1975 |
C |
10/01/1975 |
D |
10/01/1975 |
In the above case, selection is ambiguous because there are two values that occur twice; that is, there is no single most common value.
Disallowing Null values
You may wish to apply a rule to disallow Null values in a merged output attribute, so that you can ensure the completeness of this attribute for downstream processing.
For example, you may want to flag all groups where a Postcode attribute value could not be selected, by unticking the Allow Nulls option on the output selector. If all records in the match group have Null values for the Postcode attribute, a selection error will be raised regardless of the output selector used, as there is no non-Null value to select.
Oracle ® Enterprise Data Quality Help version 9.0
Copyright ©
2006,2011 Oracle and/or its affiliates. All rights reserved.