Chapter 10 Market Basket Analysis

Market basket analysis is a type of affinity analysis that can be used to discover co-occurrence relationships among activities performed by (or recorded about) specific individuals or groups. In the market basket analysis, one seeks to understand the purchase behavior of customers. For example, if you are in a grocery store and you buy a 6-pack beer, you are more likely to buy a bag of chips at the same time than somebody who buys a dozen of eggs.

Each time a customer makes a purchase, it is referred to a transaction. The items bought in that transaction is referred to as an itemset. Market basket analysis investigates the combinations of items that occur together frequently in transactions to understand shopping behavior. This kind of analysis can be very helpful in retail. For example, in a grocery store, the items that are bought together frequently can be put close to each other. In online shopping, a website can recommend the associated items to customers.

10.1 Data

The data file groceries.csv contains 1 month (30 days) of real-world point-of-sale transaction data from a typical local grocery outlet. The data set contains 9,835 transactions and the items are aggregated to 169 categories, for example, by combining the same grocery of different brands.

The first 5 rows of data are shown below. Each row represents a transaction with different grocery items.

readLines('data/groceries.csv', n = 5)
## [1] "citrus fruit,semi-finished bread,margarine,ready soups"             
## [2] "tropical fruit,yogurt,coffee"                                       
## [3] "whole milk"                                                         
## [4] "pip fruit,yogurt,cream cheese ,meat spreads"                        
## [5] "other vegetables,whole milk,condensed milk,long life bakery product"

Although the data are in the .csv format, they are very different from typical data we see. Usually, we have the same number of columns and even when the number of columns is different, we would have the same information on the same column. Here, it is not the case. For example, if we use read.csv directly, we have

read.csv('data/groceries.csv', nrows=4, header=FALSE, fill = TRUE)
##               V1                  V2            V3           V4
## 1   citrus fruit semi-finished bread     margarine  ready soups
## 2 tropical fruit              yogurt        coffee             
## 3     whole milk                                               
## 4      pip fruit              yogurt cream cheese  meat spreads

Note that although we successfully read the data into 4 variables, it will cause problems in data analysis. This is because we would hope to have a matrix in the following format: Each row would represent a transaction and each column represents a type of grocery. Therefore, we would expect a 9835 by 169 matrix. Since rarely people would buy all 169 items a time, most likely a few items only, such a matrix is expected to be sparse.

The R package arules can be used to conduct market basket analysis. It also includes a function to read such data into a sparse matrix.

grocery <- read.transactions('data/groceries.csv', sep=',')
grocery
## transactions in sparse format with
##  9835 transactions (rows) and
##  169 items (columns)
str(grocery)
## Formal class 'transactions' [package "arules"] with 3 slots
##   ..@ data       :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
##   .. .. ..@ i       : int [1:43367] 29 88 118 132 33 157 167 166 38 91 ...
##   .. .. ..@ p       : int [1:9836] 0 4 7 8 12 16 21 22 27 28 ...
##   .. .. ..@ Dim     : int [1:2] 169 9835
##   .. .. ..@ Dimnames:List of 2
##   .. .. .. ..$ : NULL
##   .. .. .. ..$ : NULL
##   .. .. ..@ factors : list()
##   ..@ itemInfo   :'data.frame':  169 obs. of  1 variable:
##   .. ..$ labels: chr [1:169] "abrasive cleaner" "artif. sweetener" "baby cosmetics" "baby food" ...
##   ..@ itemsetInfo:'data.frame':  0 obs. of  0 variables
summary(grocery)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

From the output, we see there are a total of 9,836 transactions and a total of 169 different groceries. The density of the matrix is about 2.6%. The most frequently bought item is whole milk, then other vegetables. We can also see that the majority of transactions has 10 items or less. In fact, there are 2,159 transactions with only one item. But there is also 1 transaction with 32 items.

inspect(grocery[size(grocery)==32])
##     items               
## [1] {beef,              
##      beverages,         
##      butter,            
##      candles,           
##      chicken,           
##      citrus fruit,      
##      cream cheese,      
##      curd,              
##      domestic eggs,     
##      flour,             
##      frankfurter,       
##      ham,               
##      hard cheese,       
##      hygiene articles,  
##      liver loaf,        
##      margarine,         
##      mayonnaise,        
##      other vegetables,  
##      pasta,             
##      roll products,     
##      rolls/buns,        
##      root vegetables,   
##      sausage,           
##      skin care,         
##      soft cheese,       
##      soups,             
##      specialty fat,     
##      sugar,             
##      tropical fruit,    
##      whipped/sour cream,
##      whole milk,        
##      yogurt}

10.2 Association rules

The purpose of market basket analysis is often to identify association rules among items and this can be done through the Apriori algorithm. An association rule implies that if an item A occurs, then item B also occurs with a certain probability. Using the grocery data as an example.

Transaction Items
t1 {citrus fruit,semi-finished bread,margarine,ready soups}
t2 {tropical fruit,yogurt,coffee}
t3 {pip fruit,yogurt,cream cheese ,meat spreads}

In the table above we can see three transactions. Each transaction shows items bought in that transaction. We can represent our items as an item set as follows:

\[I=\{i_1, i_2,..., i_k\}\] where \(k\) is the total number of items such as grocery items.

For the grocery data, we have

\[I=\text{\{citrus fruit, semi-finished bread, margarine, ready soups, tropical fruit, yogurt, coffee ...\}}\]

A transaction is represented by the following expression:

\[T=\{t_1, t_2,..., t_n\}\] where \(n\) is the number items in this transaction. For example,

\[t_1=\text{\{citrus fruit, semi-finished bread, margarine, ready soups\}}\]

Then, an association rule is defined as an implication of the form:

\(X \Rightarrow Y\), where \(X \subset I\), \(Y \subset I\) and \(X \cap Y = 0\)

For example,

\[\text{\{tropical fruit, yogurt\}} \Rightarrow \{coffee\}.\]

However, whether such a rule is plausible or not needs evaluation. To do so, we can use the following measures - support, confidence, lift, and conviction.

10.2.1 Support

Support is an indication of how frequently the itemset appears in the data set.

\[supp(X \& Y)=\dfrac{|X \cup Y|}{n} = P(X \& Y)\]

In other words, it’s the number of transactions with both \(X\) and \(Y\) divided by the total number of transactions. For example, for the grocery data.

  • \(supp(citrus\ fruit)=\frac{814}{9835} = 8.3\%\)

  • \(supp(citrus\ fruit \& margarine)=\frac{78}{9835}= 0.8 \%\)

  • \(supp(beef\ \& \ other\ vegetables \& root\ vegetables)=\frac{78}{9835} = 0.8 \%\)

grocery.text <- readLines('data/groceries.csv')

sum(str_detect(grocery.text, "citrus fruit"))
## [1] 814

sum(str_detect(grocery.text, "citrus fruit") & str_detect(grocery.text, "margarine"))
## [1] 78
sum(str_detect(grocery.text, "beef") & str_detect(grocery.text, "other vegetables") & str_detect(grocery.text, "root vegetables"))
## [1] 78

10.2.2 Confidence

Confidence is an indication of how often the rule has been found to be true. For a rule \(X \Rightarrow Y\), it means given \(X\), what is the probability to have \(Y\) in the same comment/transaction. This is called confidence. Confidence is like a conditional probability and it is an indication of how often the rule has been found to be true.

\[conf(X \Rightarrow Y)=\dfrac{supp(X \cup Y)}{supp(X)} = \frac{P(X \& Y)}{P(X)}.\]

For example, if a shopper bought “citrus fruit”, the confidence for the shopper to also buy “margarine” is

  • \(conf(\text{citrus fruit } \Rightarrow \text{ margarine})=\dfrac{\frac{78}{9835}}{\frac{814}{9835}}= 9.6\%\)
mean(str_detect(grocery.text, "citrus fruit") & str_detect(grocery.text, "margarine")) / mean(str_detect(grocery.text, "citrus fruit"))
## [1] 0.0958231

10.2.3 Lift

The lift of a rule is the ratio of the observed confidence of \(X \Rightarrow Y\) to the support of \(Y\), and is defined as

\[lift(X \Rightarrow Y)=\dfrac{supp(X \cup Y)}{supp(X)supp(Y) }.\]

In other words, the lift is the ratio between observing \(Y\) given \(X\) to observing \(Y\) only. Another way to interpret it is that the ratio between \(X\) and \(Y\) together and \(X\) and \(Y\) independently. If the ratio is greater than 1, then it means it is more likely \(X\) and \(Y\) will be bought together than each individually. Therefore, greater lift values indicate stronger associations.

If the rule had a lift of 1, it would imply that the probability of occurrence of the antecedent and that of the consequent are independent of each other. When two events are independent of each other, no rule can be drawn involving those two events.

If the lift is > 1, that lets us know the degree to which those two occurrences are dependent on one another, and makes those rules potentially useful for predicting the consequent in future data sets.

If the lift is < 1, that lets us know the items are a substitute to each other. This means that the presence of one item has a negative effect on the presence of the other item and vice versa.

  • \(lift(\text{citrus fruit } \Rightarrow \text{ margarine})=1.64\)
mean(str_detect(grocery.text, "citrus fruit") & str_detect(grocery.text, "margarine")) / (mean(str_detect(grocery.text, "citrus fruit")) * mean(str_detect(grocery.text, "margarine")))
## [1] 1.636146

10.2.4 Conviction

The conviction of a rule is defined as

\[conv(X \Rightarrow Y)=\dfrac{1-supp(Y)}{1-conf(X \Rightarrow Y) }\]

It can be interpreted as the ratio of the expected frequency that \(X\) occurs without \(Y\) (that is to say, the frequency that the rule makes an incorrect prediction) if \(X\) and \(Y\) were independent divided by the observed frequency of incorrect predictions. A high value means that the consequent depends strongly on the antecedent.

  • \(conv(\text{citrus fruit } \Rightarrow \text{ margarine})= =1.01\)

In this example, the conviction value of 1.01 shows that the rule \(\text{citrus fruit } \Rightarrow \text{ margarine}\) would be incorrect 1% more often (1.01 times as often) if the association between citrus fruit and margarine was purely random chance.

(1 - mean(str_detect(grocery.text, "citrus fruit") )) /(1 - mean(str_detect(grocery.text, "citrus fruit") & str_detect(grocery.text, "margarine")) / mean(str_detect(grocery.text, "citrus fruit")))
## [1] 1.014441

10.3 Basic visualization

First, we can plot the frequent items using the function itemFrequencyPlot(). The code below plots the top 20 most frequently bought items.

itemFrequencyPlot(grocery, topN=20,
                  type="absolute",
                  horiz=TRUE)

The itemFrequencyPlot() allows us to show the absolute or relative values. If absolute it will plot numeric frequencies of each item independently. If ‘relative’ it will plot how many times these items have appeared as compared to others, as it’s shown in the following plot.

itemFrequencyPlot(grocery, topN=20,
                  type="relative",
                  horiz=TRUE)

“Whole milk” is the best-selling product, followed by other vegetables and rolls/buns.

We can also visualize the transaction data using the level plot. Each row represents one transaction with a dot representing an item in that transaction.

image(grocery[1:200], aspect='fill')

We can also randomly select a sample to visualize.

image(sample(grocery, 100), aspect='fill')


itemLabels(grocery)[168:169]
## [1] "yogurt"   "zwieback"

10.4 Apriori algorithm

The function apriori can be used to identify the association rules. To use the function, we need to make a choice on the support and confidence parameters. If we set these values too low, then the algorithm will take longer to execute, and we will get a lot of rules and most of them will not be useful. On the other hand, if we set these values too high, then the algorithm will get very few rules and miss some meaningful rules. The default behavior is to mine rules with minimum support of 0.1 and minimum confidence of 0.8.

The R generates rules in the following format:

LHS => RHS {A} => {B}

Apriori only creates rules with one item in the RHS (Consequent). The default value in the function parameter for minlen is 1. This means that rules with only one item (i.e., an empty antecedent/LHS) like

{} => {beer}

will be created. These rules mean that no matter what other items are involved, the item in the RHS will appear with the probability given by the rule’s confidence (which equals the support). If you want to avoid these rules then use the argument parameter=list(minlen=2).

If the minimum support is chosen too low for the dataset, then the algorithm will try to create an extremely large set of itemsets/rules. This will result in very long run time and eventually the process will run out of memory. To prevent this, the default maximal length of itemsets/rules is restricted to 10 items (via the parameter element maxlen=10) and the time for checking subsets is limited to 5 seconds (via maxtime=5). The output will show if you hit these limits in the “checking subsets” line of the output. The time limit is only checked when the subset size increases, so it may run significantly longer than what you specify in maxtime. Setting maxtime=0 disables the time limit.

Here we simply set support=0.005 and confidence = 0.25.

rules.set1 <- apriori(grocery, 
                      parameter = list(supp=0.01,
                                       conf=0.25,
                                       minlen=2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.25    0.1    1 none FALSE            TRUE       5    0.01      2
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 98 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [170 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
rules.set1
## set of 170 rules

The above specification leads to 170 rules. Basic information about the rules is given below.

summary(rules.set1)
## set of 170 rules
## 
## rule length distribution (lhs + rhs):sizes
##  2  3 
## 96 74 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   2.000   2.435   3.000   3.000 
## 
## summary of quality measures:
##     support          confidence          lift            count      
##  Min.   :0.01007   Min.   :0.2517   Min.   :0.9932   Min.   : 99.0  
##  1st Qu.:0.01159   1st Qu.:0.2973   1st Qu.:1.5215   1st Qu.:114.0  
##  Median :0.01454   Median :0.3587   Median :1.7784   Median :143.0  
##  Mean   :0.01822   Mean   :0.3703   Mean   :1.8747   Mean   :179.2  
##  3rd Qu.:0.02097   3rd Qu.:0.4253   3rd Qu.:2.1453   3rd Qu.:206.2  
##  Max.   :0.07483   Max.   :0.5862   Max.   :3.2950   Max.   :736.0  
## 
## mining info:
##     data ntransactions support confidence
##  grocery          9835    0.01       0.25

We can show the first few rules. The first rule means that if one buys “hard cheese”, he/she will also buy “whole milk”. The support of it is 0.01, meaning in 1% of the transactions, the two times were bought together. The confidence is 41% meaning that in transactions with “hard cheese”, 41% also bought “whole milk”. The lift is 1.61. This means comparing to overall shoppers who bought “whole milk”, the shoppers who bought “hard cheese” will lift the possibility to buy whole milk 61%.

inspect(rules.set1[1:10])
##      lhs                rhs                support    confidence lift     count
## [1]  {hard cheese}   => {whole milk}       0.01006609 0.4107884  1.607682  99  
## [2]  {butter milk}   => {other vegetables} 0.01037112 0.3709091  1.916916 102  
## [3]  {butter milk}   => {whole milk}       0.01159126 0.4145455  1.622385 114  
## [4]  {ham}           => {whole milk}       0.01148958 0.4414062  1.727509 113  
## [5]  {sliced cheese} => {whole milk}       0.01077783 0.4398340  1.721356 106  
## [6]  {oil}           => {whole milk}       0.01128622 0.4021739  1.573968 111  
## [7]  {onions}        => {other vegetables} 0.01423488 0.4590164  2.372268 140  
## [8]  {onions}        => {whole milk}       0.01209964 0.3901639  1.526965 119  
## [9]  {berries}       => {yogurt}           0.01057448 0.3180428  2.279848 104  
## [10] {berries}       => {other vegetables} 0.01026945 0.3088685  1.596280 101

We can order the rules, too.

inspect(sort(rules.set1)[1:10])
##      lhs                   rhs                support    confidence lift    
## [1]  {other vegetables} => {whole milk}       0.07483477 0.3867578  1.513634
## [2]  {whole milk}       => {other vegetables} 0.07483477 0.2928770  1.513634
## [3]  {rolls/buns}       => {whole milk}       0.05663447 0.3079049  1.205032
## [4]  {yogurt}           => {whole milk}       0.05602440 0.4016035  1.571735
## [5]  {root vegetables}  => {whole milk}       0.04890696 0.4486940  1.756031
## [6]  {root vegetables}  => {other vegetables} 0.04738180 0.4347015  2.246605
## [7]  {yogurt}           => {other vegetables} 0.04341637 0.3112245  1.608457
## [8]  {tropical fruit}   => {whole milk}       0.04229792 0.4031008  1.577595
## [9]  {tropical fruit}   => {other vegetables} 0.03589222 0.3420543  1.767790
## [10] {bottled water}    => {whole milk}       0.03436706 0.3109476  1.216940
##      count
## [1]  736  
## [2]  736  
## [3]  557  
## [4]  551  
## [5]  481  
## [6]  466  
## [7]  427  
## [8]  416  
## [9]  353  
## [10] 338

inspect(sort(rules.set1, by='lift')[1:10])
##      lhs                                  rhs                support   
## [1]  {citrus fruit,other vegetables}   => {root vegetables}  0.01037112
## [2]  {other vegetables,tropical fruit} => {root vegetables}  0.01230300
## [3]  {beef}                            => {root vegetables}  0.01738688
## [4]  {citrus fruit,root vegetables}    => {other vegetables} 0.01037112
## [5]  {root vegetables,tropical fruit}  => {other vegetables} 0.01230300
## [6]  {other vegetables,whole milk}     => {root vegetables}  0.02318251
## [7]  {curd,whole milk}                 => {yogurt}           0.01006609
## [8]  {other vegetables,yogurt}         => {root vegetables}  0.01291307
## [9]  {other vegetables,yogurt}         => {tropical fruit}   0.01230300
## [10] {other vegetables,rolls/buns}     => {root vegetables}  0.01220132
##      confidence lift     count
## [1]  0.3591549  3.295045 102  
## [2]  0.3427762  3.144780 121  
## [3]  0.3313953  3.040367 171  
## [4]  0.5862069  3.029608 102  
## [5]  0.5845411  3.020999 121  
## [6]  0.3097826  2.842082 228  
## [7]  0.3852140  2.761356  99  
## [8]  0.2974239  2.728698 127  
## [9]  0.2833724  2.700550 121  
## [10] 0.2863962  2.627525 120

We can also identify rules that are related to a specific item.

citrus.rules <- subset(rules.set1, items %in% "citrus fruit")
inspect(citrus.rules)
##     lhs                                rhs                support    confidence
## [1] {citrus fruit}                  => {yogurt}           0.02165735 0.2616708 
## [2] {citrus fruit}                  => {other vegetables} 0.02887646 0.3488943 
## [3] {citrus fruit}                  => {whole milk}       0.03050330 0.3685504 
## [4] {citrus fruit,root vegetables}  => {other vegetables} 0.01037112 0.5862069 
## [5] {citrus fruit,other vegetables} => {root vegetables}  0.01037112 0.3591549 
## [6] {citrus fruit,yogurt}           => {whole milk}       0.01026945 0.4741784 
## [7] {citrus fruit,whole milk}       => {yogurt}           0.01026945 0.3366667 
## [8] {citrus fruit,other vegetables} => {whole milk}       0.01301474 0.4507042 
## [9] {citrus fruit,whole milk}       => {other vegetables} 0.01301474 0.4266667 
##     lift     count
## [1] 1.875752 213  
## [2] 1.803140 284  
## [3] 1.442377 300  
## [4] 3.029608 102  
## [5] 3.295045 102  
## [6] 1.855768 101  
## [7] 2.413350 101  
## [8] 1.763898 128  
## [9] 2.205080 128

We can use the arulesViz package to visualize association rules. Let’s begin with a simple scatter plot with different measures of interestingness on the axes (lift and support) and a third measure (confidence) represented by the color of the points.

plot(rules.set1, measure=c("support","lift"), shading="confidence")

The following visualization represents the rules as a graph with items as labeled vertices, and rules represented as vertices connected to items using arrows.

# Graph
plot(rules.set1, method="graph")


plot(citrus.rules, method="graph")

We can also change the graph layout.

plot(citrus.rules, method="graph", control=list(layout=igraph::in_circle()))

We can represent the rules as a grouped matrix-based visualization. The support and lift measures are represented by the size and color of the balloons, respectively.

plot(citrus.rules, method="grouped")

We can also use a parallel coordinates plot.

plot(citrus.rules, method="paracoord", control=list(reorder=TRUE))

The function ruleExplorer() can explore association rules interactively.

10.5 References

  • https://www.kaggle.com/xvivancos/market-basket-analysis
  • Michael Hahsler, Kurt Hornik, and Thomas Reutterer (2006) Implications of probabilistic data modeling for mining association rules. In M. Spiliopoulou, R. Kruse, C. Borgelt, A. Nuernberger, and W. Gaul, editors, From Data and Information Analysis to Knowledge Engineering, Studies in Classification, Data Analysis, and Knowledge Organization, pages 598–605. Springer-Verlag.
  • Michael Hahsler, Christian Buchta, Bettina Gruen and Kurt Hornik (2018). arules: Mining Association Rules and Frequent Itemsets. R package version 1.6-1. https://CRAN.R-project.org/package=arules
  • Michael Hahsler, Bettina Gruen and Kurt Hornik (2005), arules - A Computational Environment for Mining Association Rules and Frequent Item Sets. Journal of Statistical Software 14/15. URL: http://dx.doi.org/10.18637/jss.v014.i15.
  • Michael Hahsler, Sudheer Chelluboina, Kurt Hornik, and Christian Buchta (2011), The arules R-package ecosystem: Analyzing interesting patterns from large transaction datasets. Journal of Machine Learning Research, 12:1977–1981. URL: http://jmlr.csail.mit.edu/papers/v12/hahsler11a.html.
  • Michael Hahsler (2018). arulesViz: Visualizing Association Rules and Frequent Itemsets. R package version 1.3-1. https://CRAN.R-project.org/package=arulesViz