By Yuli Vasiliev | November 2021
Your organization has tons of transaction data—and creates more data every day. There’s invaluable knowledge in those database records! Association analysis lets your organization benefit from this data, extracting useful information from it. This article explains how to take advantage of this technique to identify relationships between items in transaction data and turn them into actionable insights.
After a brief discussion on the basics of association analysis, you’ll see some examples of applying association analysis techniques to a sample set of transaction data in Python and in Oracle Database for comparison. This article assumes some knowledge of Python and Oracle Database, such as being able to install libraries with pip
.
Association analysis discovers the probability of the co-occurrence of items in a collection. The strength of associations between co-occurring items is measured by metrics known as association rules. In essence, association rules are IF-THEN
statements that express the probability of relationships between items. An association rule is denoted as X -> Y
, where X
is the IF
component of the rule, called the antecedent, and Y
is the THEN
component, called the consequent. Or, to put it more plainly, association analysis tells you that if X
occurs in a record in the dataset, how likely it is that X
would show up in the same record.
In IF-THEN
notation, the above rule might be denoted as IF X THEN Y
.
Association rules are often used to analyze sales transaction data, helping retailers find relationships between the products that people buy together frequently. For example, the rule IF {bread, butter} THEN {cheese}
, which might be found in the transaction data of a grocery store, indicates whether, when shoppers buy bread and butter in the same shopping trip, they are likely to also pick up cheese. This knowledge might drive coupons and special pricing, product placement, and inventory management. When the insights are not intuitively obvious, they could present not only a way to increase revenue but also a competitive advantage.
There are several metrics for evaluating the compliance of a rule. The most common association metrics are the following:
In the first place, you can use the support metric to determine frequent itemsets in your transaction set (a set of items that occurs in many transactions is considered frequent). The other metrics can then evaluate association rules for the frequent itemsets. The Apriori algorithm, developed by Rakesh Agrawal and Ramakrishnan Srikant in the 1990s, is designed just for that: determining frequent itemsets and evaluating association rules on them.
Using the Apriori algorithm consists of performing the following two steps:
Below is a simple example that illustrates how this algorithm can be applied in practice. The example uses the sample transaction dataset implemented as a Python list of lists, where each nested list represents a set of products found in a single transaction.
transactions = [
['cabbage', 'cucumber'], ['cabbage', 'eggplant', 'cucumber'], ['carrot', 'potato', 'pepper'],
['carrot', 'pepper'], ['carrot', 'tomato'], ['spinach', 'eggplant', 'pumpkin'], ['carrot', 'tomato', 'corn'],
['lettuce', 'leek'], ['cabbage', 'cucumber', 'spinach'], ['corn', 'broccoli', 'tomato'],
['radishes', 'potato'], ['carrot', 'potato'], ['radishes', 'beets', 'potato'], ['cabbage', 'beans'],
['carrot', 'potato', 'pepper'], ['carrot', 'cucumber', 'pepper'], ['artichoke', 'cucumber'],
['cabbage', 'cucumber'], ['carrot', 'beets'], ['onion', 'garlic']
]
This sample dataset consists of 20 transactions, each of which contains two to three articles. (In a real transaction dataset, there would likely be millions of transactions, which might be considerably more complex—think about your typical store receipt covering a week of grocery shopping for a family.)
If you look through the transactions, you may notice that some products appear more often than others and some of them tend to appear together. For example, there are five transactions with cabbage and six transactions with cucumber, among which there are four that contain both cabbage and cucumber. Let’s now look at how to calculate the main metrics for the cabbage -> cucumber
association rule.
As mentioned, the support of an item is the ratio of transactions that include this item to the total number of transactions. So, the support for cabbage in the example being discussed can be calculated as follows:
support(cabbage) = cabbage/total -> 5/20 -> 0.25
For itemsets, the support is the ratio of transactions in which an itemset occurs to the total number of transactions. Thus, you could use the following formula to calculate the support for the cabbage & cucumber
itemset defined in the example:
support(cabbage -> cucumber) = (cabbage & cucumber)/total -> 4/20 -> 0.2
A support metric falls in the range of 0 to 1. The Apriori algorithm uses a user-specified support threshold to determine if an itemset can be considered frequent. For example, if you set the support threshold to 0.1, any set of items that appear together in at least 10% of all the transactions will be considered frequent.
Once the set of frequent itemsets has been determined, you can evaluate association rules on them. Confidence and lift are the primary metrics. Turning back to the example, confidence can be interpreted as the likelihood of purchasing cabbage along with cucumber. So, the confidence for the cabbage -> cucumber
association rule can be calculated as follows:
confidence(cabbage -> cucumber) = (cabbage & cucumber)/cabbage -> 4/5 -> 0.8
Confidence can be in the range of 0 to 1. This metric is not symmetric. For example, the confidence for cabbage -> cucumber
differs from the confidence for cucumber -> cabbage
in this example.
confidence(cucumber -> cabbage) = (cabbage & cucumber)/cucumber -> 4/6 -> 0.66
Confidence is a measure of performance for an association rule, indicating how often the consequent follows the antecedent.
As mentioned, lift is another important metric for evaluating association rules. The lift of association rule cabbage -> cucumber
is the ratio of the observed support for cabbage -> cucumber
to that expected if cabbage and cucumber were independent of each other. This can be calculated as follows:
lift(cabbage -> cucumber) = support(cabbage & cucumber)/(support(cabbage)*support(cucumber)) -> 0.2/(0.25*0.3) -> 2.66
The range of possible values for lift is [0,∞]
. A lift ratio larger than 1 indicates that the relationship between the antecedent and consequent referenced in the rule is stronger than would be expected if they were independent.
A set of association rules can be automatically generated against your transaction data. In Python, this can be done with libraries such as Mlxtend, which provides an implementation of the Apriori algorithm. You can install Mlxtend with pip
, as follows:
pip install mlxtend
After that, you can apply the Apriori algorithm to the transaction dataset, which was defined as a Python list of lists earlier. The first step is to transform the dataset into the format that can be used in further processing. When using Mlxtend, you need to convert the dataset into a one-hot encoded array of transactions. In the following snippet, you do that, saving the encoded array of transactions in a pandas DataFrame:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
encoder = TransactionEncoder()
encoded_array = encoder.fit(transactions).transform(transactions)
df_itemsets = pd.DataFrame(encoded_array, columns=encoder.columns_)
Next, you can extract frequent itemsets from the encoded array. This can be done using the apriori()
function from the mlxtend.frequent_patterns
package, as follows:
from mlxtend.frequent_patterns import apriori
frequent_itemsets = apriori(df_itemsets, min_support=0.1, use_colnames=True)
In this particular example, set the min_support
parameter to 0.1 to return the items and itemsets with at least 10% support. As a result, the content of the frequent_itemsets
DataFrame should look as follows:
support itemsets
0 0.10 (beets)
1 0.25 (cabbage)
2 0.40 (carrot)
3 0.10 (corn)
4 0.30 (cucumber)
5 0.10 (eggplant)
6 0.20 (pepper)
7 0.25 (potato)
8 0.10 (radishes)
9 0.10 (spinach)
10 0.15 (tomato)
11 0.20 (cucumber, cabbage)
12 0.20 (pepper, carrot)
13 0.15 (carrot, potato)
14 0.10 (carrot, tomato)
15 0.10 (corn, tomato)
16 0.10 (pepper, potato)
17 0.10 (potato, radishes)
18 0.10 (pepper, potato, carrot)
Now that you have the items and itemsets that meet the specified threshold support, you can proceed to the next step of the Apriori algorithm and generate association rules for the itemsets. The Mlxtend library provides the association_rules()
method for doing that.
from mlxtend.frequent_patterns import association_rules
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)
If you look at the generated rules, however, you may notice that they include more metrics than support, confidence, and lift. To minimize confusion, if you want to view only those three, you can explicitly specify the following:
print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])
The command above should output the following rule set:
antecedents consequents support confidence lift
0 (cucumber) (cabbage) 0.20 0.666667 2.666667
1 (cabbage) (cucumber) 0.20 0.800000 2.666667
2 (pepper) (carrot) 0.20 1.000000 2.500000
3 (carrot) (pepper) 0.20 0.500000 2.500000
4 (potato) (carrot) 0.15 0.600000 1.500000
5 (tomato) (carrot) 0.10 0.666667 1.666667
6 (corn) (tomato) 0.10 1.000000 6.666667
7 (tomato) (corn) 0.10 0.666667 6.666667
8 (pepper) (potato) 0.10 0.500000 2.000000
9 (radishes) (potato) 0.10 1.000000 4.000000
10 (pepper, potato) (carrot) 0.10 1.000000 2.500000
11 (pepper, carrot) (potato) 0.10 0.500000 2.000000
12 (carrot, potato) (pepper) 0.10 0.666667 3.333333
13 (pepper) (carrot, potato) 0.10 0.500000 3.333333
As you can see, a separate rule is generated for each combination of cabbage and cucumber, as well as for each combination of some other items. The metrics generated for an itemset depend on the order of the items in it. If you swap the antecedent and the consequent in an itemset, the metric confidence may change as well.
Now that you have generated a set of association rules against your transaction data, how can you take advantage of it and turn it into actionable insights? The following section discusses how you can make recommendations for customers based on the purchase history summarized in the metrics that compose association rules.
Making recommendations based on the items that customers have already added to their basket is a common technique used by retailers to show customers cross-sell items they might want to purchase. To generate such recommendations, you need to identify the items that customers buy together frequently. This is where association rules come in handy, providing metrics for identifying items frequently bought together.
To start, you might want to filter the rules you have by a certain metric. For example, you might filter the collection of rules using the threshold set at the lift metric. In the following code snippet, the lift threshold is set to 2.5. I suggest you also rename the antecedents and consequents columns to make it clear that the recommendation list for an antecedent is composed of its most common consequents.
recommendations = rules[rules['lift'] >= 2.5] [['antecedents','consequents']].rename(columns={'antecedents':'items', 'consequents':'recommendation'}).reset_index(drop=True)
So, the content of the recommendations DataFrame should look as follows:
items recommendation
0 (cucumber) (cabbage)
1 (cabbage) (cucumber)
2 (pepper) (carrot)
3 (carrot) (pepper)
4 (corn) (tomato)
5 (tomato) (corn)
6 (radishes) (potato)
7 (potato, pepper) (carrot)
8 (potato, carrot) (pepper)
9 (pepper) (potato, carrot)
Looking through the rows above, you may notice that there are rows with duplicate values in the items column. For example, you have two rows for pepper, which is not appropriate for a recommendation system. Here is how to combine such duplicates into a single row.
recommendations['recommendation'] = recommendations['recommendation'].apply(lambda x: list(x))
recommendations['recommendation'] = recommendations.groupby(recommendations['items'])['recommendation'].agg(sum).reset_index(drop=True)
recommendations = recommendations.drop_duplicates(subset=['items'])
recommendations['recommendation'] = recommendations['recommendation'].apply(lambda x: list(set(x)))
After that, the rows in the recommendations DataFrame should look as follows:
items recommendation
0 (cucumber) [cabbage]
1 (cabbage) [cucumber]
2 (carrot) [pepper]
3 (pepper) [carrot, potato]
4 (tomato) [corn]
5 (corn) [tomato]
6 (radishes) [potato]
7 (carrot, potato) [pepper]
8 (potato, pepper) [carrot]
Oracle Database allows you to perform association analysis with the Apriori algorithm. Use the DBMS_DATA_MINING
package, a component of the Oracle Advanced Analytics option to Oracle Database Enterprise Edition. (In Oracle Database 21c, Oracle Data Mining was rebranded to Oracle Machine Learning for SQL. The PL/SQL package name, however, has not changed and remains DBMS_DATA_MINING
.)
Here is a simple example of using DBMS_DATA_MINING
to generate association rules against a set of transaction data. To begin, you need to have a table or view with transaction data. In its simplest form, such a table might look like the one created below.
CREATE TABLE transactions (
trans_id NUMBER(10),
prod_name VARCHAR2(20)
);
The trans_id
column is the case identifier column in the table. In this context, a case assumes a collection of items in a single transaction.
Once you have created the transactions
table, you need to populate it with data. For this, you can execute the insert
statements found in the transactions.sql
file accompanying this article. You might also want to prepare a market_settings
table to override some default settings of the association model that you’re going to create next.
CREATE TABLE market_settings AS SELECT * FROM TABLE(DBMS_DATA_MINING.GET_DEFAULT_SETTINGS) WHERE setting_name LIKE 'ASSO_%';
Then, you can change the default values of the settings found in the market_settings
table and build the association model, all in a single PL/SQL block.
BEGIN
UPDATE market_settings
SET setting_value = TO_CHAR(0.1)
WHERE setting_name = DBMS_DATA_MINING.asso_min_support;
DBMS_DATA_MINING.CREATE_MODEL(
model_name => 'market_model',
mining_function => DBMS_DATA_MINING.ASSOCIATION,
data_table_name => 'transactions',
case_id_column_name => 'trans_id',
target_column_name => NULL,
settings_table_name => 'market_settings');
END;
/
View the association rules generated for the transaction data found in the transactions table with the help of the DBMS_DATA_MINING.GET_ASSOCIATION_RULES
table function, as illustrated in the following example:
SELECT A.attribute_str_value antecedent, C.ATTRIBUTE_STR_VALUE consequent, rule_support support, rule_confidence confidence, rule_lift lift
FROM TABLE(DBMS_DATA_MINING.GET_ASSOCIATION_RULES('market_model')) R,
TABLE (R.antecedent) A,
TABLE (R.consequent) C
WHERE rule_lift >=2.5 AND rule_confidence >=0.5;
The query above should produce the following output:
ANTECEDENT CONSEQUENT SUPPORT CONFIDENCE LIFT
---------- ---------- ------- -------------- --------------
pepper carrot .2 1 2.5
radishes potato .1 1 4
corn tomato .1 1 6.66666667
potato carrot .1 1 2.5
pepper carrot .1 1 2.5
cabbage cucumber .2 .8 2.66666667
cucumber cabbage .2 .666666667 2.66666667
potato pepper .1 .666666667 3.33333333
carrot pepper .1 .666666667 3.33333333
tomato corn .1 .666666667 6.66666667
carrot pepper .2 .5 2.5
After an item has appeared in a customer basket, which item is going to appear there next? Of course, you cannot say for sure, but you can make a prediction based on the history of customer purchases, using association analysis. In this article, you saw how this type of analysis can be performed on a sample set of transactions using both Python and Oracle Database.
Illustration: Wes Rowell
Yuli Vasiliev (@VasilievYuli) is a programmer, freelance writer, and consultant specializing in open source development, Oracle Database technologies, and natural-language processing (NLP). He is the author of Natural Language Processing with Python and spaCy.