You are here: Advanced Features > Clustering concept guide

Clustering Concept Guide

Clustering is a necessary aspect of matching, required to produce fast match results by creating intelligent 'first cuts' through data sets, in order that the matching processors do not attempt to compare every single record in a table with every single other record - a process that would not be feasible in terms of system performance.

Clustering is also vital to Real time matching, to allow new records to be matched against existing records in a system, without the need for OEDQ to hold a synchronized copy of all the records in that system.

Clusters

Rather than attempt to compare all records with each other, matching in OEDQ takes place within Clusters, which are created by a clustering process.

Clustering does not attempt to compare any records in a data set. Rather, it creates clusters for a data set by manipulating one or more of the identifier values used for matching on a record-by-record basis. Records with a common manipulated value (cluster key) fall within the same cluster, and will be compared with each other in matching. Records that are in different clusters will not be compared together, and clusters containing a single record will not be used in matching.

This is illustrated below, for a single cluster on a Name column, using a Make Array from String transformation to cluster all values from the Name identifier that are separated by a space:

The clustering process is therefore crucial to the performance and functionality of the matching process. If the clusters created are too large, matching may have too many comparisons to make, and run slowly. If on the other hand the clusters are too small, some possible matching records may have been missed by failing to cluster similar values identically.

Depending on the matching required, clustering must be capable of producing common cluster keys for records that are slightly different. For example, it is desirable for records containing identifier values such as 'Oracle' and 'Oracle Ltd' to fall within the same cluster, and therefore be compared against each other.

Multiple clusters

OEDQ supports the use of multiple clusters. On a given data set, clustering on a single identifier value is often insufficient, as matches may exist where the identifier value is slightly different. For instance, a Surname cluster with no transformations will be unreliable if some of the data in the Surname field is known to be misspelt, and adding a soundex or metaphone transformation may make the clusters too large. In this case, an additional cluster may be required. If an additional cluster is configured on another identifier, OEDQ will create entirely new clusters, and perform the same matching process on them. For example, you could choose to cluster on the first three digits of a post code, such that all records with a post code of CB4 will be in a single cluster for matching purposes.

It is also possible to cluster the same identifier twice, using different clustering configurations with different transformations. To do this, create two clusters on the same identifier, and configure different transformations to work on the identifier values.

The more clusters there are, the more likely matching is to detect matches. However, clusters should still be used sparingly to ensure that matching performance does not suffer. The more clusters that exist, the more records matching has to compare.

Composite clusters

Composite clusters allow a more sensitive and efficient way of dividing up data sets into clusters, using multiple identifiers. In this case, different parts of each identifier are combined to create a single cluster key that is used to group records for matching. For example, when matching customers, the following cluster might be configured using a combination of Surname and Postcode identifiers:

For each record in the matching process, therefore, the value for the Surname identifier will be transformed (converted to upper case, all whitespace removed, a metaphone value generated, and trimmed to the first 4 characters), and then concatenated with the transformed value from the Postcode identifier (converted to upper case, all whitespace removed, and trimmed to the first 3 characters).

Note that the concatenation of the identifier values after transformation is not configured, and occurs automatically.

So, using the above cluster configuration, cluster keys will be generated as follows:

Surname	Postcode	Cluster key
Matthews	CB13 7AG	M0SCB1
Foster	CB4 1YG	FSTRCB4
JONES	SW11 3QB	JNSSW1
Jones	sw11 3qb	JNSSW1

This means that the last two records would be in the same cluster, and would be compared with each other for a possible match, but would not be compared with the other two records.

Transformations in clustering

Transformations in clustering allow you to normalize space, case, spelling and other differences between values that are essentially the same, enabling the creation of clusters for records that are only similar, rather than identical, in their identifier values.

For example, a Name identifier may use either a Metaphone or a Soundex transformation during clustering, such that similar-sounding names are included in the same cluster. This allows matching to work where data may have been misspelt. For example, with a Soundex transformation on a Surname identifier, the surnames 'Fairgrieve' and 'Fairgreive' would be in the same cluster, so that matching will compare the records for possible duplicates.

The valid transformations for an identifier vary depending on the Identifier Type (for example, there are different transformations for Strings as for Numbers).

For example, the following are some of the transformations available for String identifiers:

Make Array from String (splits up values into each separate word, using a delimiter, and groups by each word value. For example, 'JOHN' and 'SMITH' will be split from the value 'JOHN SMITH' if a space delimiter is used.)
First N Characters (selects the first few characters of a value. For example, 'MATT' from 'MATTHEWS'.)
Generate Initials (generates initials from an identifier value. For example, 'IBM' from 'Internal Business Machines'.)

OEDQ comes with a library of transformations to cover the majority of needs. It is also possible to add custom transformations to the system - see Extending matching in OEDQ.

The options of a transformation allow you to vary the way the identifier value is transformed. The available options are different for each transformation.

For example, the following options may be configured for the First N Characters transformation:

Number of characters (the number of characters to select)
Characters to ignore (the number of characters to skip over before selection)

Using clustering

The 'best' clustering configuration will depend upon the data used in matching, and the requirements of the matching process.

Where many identifiers are used for a given entity, it may be optimal to use clusters on only one or two of the identifiers, for example to cluster people into groups by Surname and approximate Date of Birth (for example, Year of Birth), but without creating clusters on First Name or Post Code, though all these attributes are used in the matching process.

Again, this depends on the source data, and in particular on the quality and completeness of the data in each of the attributes. For accurate matching results, the attributes used by cluster functions require a high degree of quality. In particular the data needs to be complete and correct. Audit and transformation processors may be used prior to matching in order to ensure that attributes that are desirable for clustering are populated with high quality data.

Note that it is common to start with quite a simple clustering configuration (for example, when matching people, group records using the first 5 letters of a Surname attribute, converted to upper case), that yields fairly large clusters (with hundreds of records in many of the groups). After the match process has been further developed, and perhaps applied to the full data sets rather than samples, it is possible to improve performance by making the clustering configuration more sensitive (for example, by grouping records using the first 5 letters of a Surname attribute followed by the first 4 characters of a Postcode attribute). This will have the effect of making the clusters smaller, and reducing the total number of comparisons that need to be performed.

When matching on a number of identifiers, where some of the key identifiers contain blanks and nulls, It is generally better to use multiple clusters rather than a single cluster with large groups.

Note:All No Data (whitespace) characters are always stripped from cluster keys after all user-specified clustering transformations are applied, and before the clustering engine finishes. For example, if you use the Make Array from String transformation, and split data on spaces, the values "Jim<space><carriage return>Jones" and "Jim<space>Jones" would both create the cluster values "Jim" and "Jones". The former would not create the cluster value "<carriage return>Jones". This is in order that the user does not always have to consider how to cope with different forms of whitespace in the data when clustering.

Reviewing the clustering process

In OEDQ, the clusters used in matching can be reviewed in order to ensure they are created to the required level of granularity.

This is possible using the views of clusters created when the match processor has been run with an clustering configuration. The Results Browser displays a view of the clusters generated, with their cluster keys:

The list of clusters can be sorted by the cluster key value in order to see similar groups that possibly ought not be distinct.

By drilling down, it is possible to see the constituent records within each cluster from each input data set. For example the 9 records from the Customers data set with a cluster key of 'KLRKEC3' above are:

In this way, an expert user can inspect and tune the clustering process to produce the optimal results in the quickest time. For example, if the clusters are too big, and matching performance is suffering, extra clusters could be created, or a cluster configuration may be made tighter. If, on the other hand, the user can see that some possible matches are in different clusters, the clustering options may need to be changed to widen the clusters.

Note: With some clustering strategies, large clusters are created. This is often the case, for example, if there are a large number of null values for an identifier, creating a large cluster with a NULL cluster key. If a cluster contains more than a configurable number of records, or will lead to a large number of comparisons being performed, it can be skipped to save the performance of the matching engine. The default maximum size of a cluster is 500 records, and it is also possible to limit the maximum number of comparisons that should be performed for each cluster. To change these options, see Advanced options for match processors.