--- title: "Market Basket Analyses" author: "Gizem Guleli" date: '2023-02-04' output: html_document: default word_document: default pdf_document: default --- ```{r, echo=T,warning=FALSE,results=FALSE, message=FALSE} library (arules) library (dplyr) library(plyr) library(tidyr) library(tidyverse) library (arulesViz) library (Rcpp) ``` ### DATA The Groceries dataset is a collection of 38765 transactions made at a grocery store. Each transaction contains a list of items purchased by a customer. This dataset can be used for market basket analysis, which is a technique used to identify patterns in customer purchasing behavior. I first read data set and then clean missing values and dublicates row. ```{r Read data} #read and summarize data groceries <- read.csv("Groceries_dataset.csv") head(groceries) summary(groceries) ``` ```{r Clean data} # Remove any duplicates groceries <- distinct(groceries) # Convert the Date column to a date format groceries$Date <- as.Date(groceries$Date, format = "%Y-%m-%d") # Remove any rows with missing values groceries2 <- drop_na(groceries) ###Change the name of variables to make more easy colnames(groceries2)[colnames(groceries2)=="itemDescription"] <- "Item" ###Check last version of dataset head(groceries2) dim(groceries2) str(groceries2) ``` ```{r} ###Order data according to member_number sorted <- groceries2[order(groceries2$Member_number),] ###Group all the items that bought by the same customer on the same date to see set of items itemList <- aggregate(Item ~ Member_number + Date, data = groceries2, FUN = function(x) paste(x, collapse = ",")) head(itemList,5) dim(itemList) ``` ### ANALYSE First of all the original groceries data frame is subsetted to select only the column containing the items , and then converted the resulting data frame of items to a transaction object using read.transaction() function because the transaction object is the preferred format for performing market basket analysis using the arules package in R. By converting the item data into a transaction object, we can use the functions in the arules package to perform association rule mining, which involves finding patterns in the data such as frequent itemsets (groups of items that are frequently purchased together) and association rules. In the end I print the total number of transactions and items in the dataset.As we see from the result there are 14935 transaction and 168 items. ```{r Subset only for items and Convert Transaction } subset <- itemList[,3] write.csv(subset,"subset", quote = FALSE, row.names = TRUE) head(subset) dim(subset) trans1 = read.transactions(file="subset", rm.duplicates= TRUE, format="basket",sep=",",cols=1); head(trans1) dim(trans1) length(trans1) LIST(head(trans1)) # Print the number of transactions and items in the dataset cat("Number of transactions:", length(trans1), "\n") cat("Number of items:", length(itemLabels(trans1)), "\n") itemFrequencyPlot(trans1, topN = 25) ``` Calculates the relative and absolute frequency of occurrence of each item in the transaction dataset trans1. First frequencies are expressed as a proportion between 0 and 1, where 1 represents an item that appears in every transaction and 0 represents an item that does not appear in any transaction while the second (absolute) frequencies are expressed as infinite integers that represent the number of transactions in which the item appears. To see the most occured items I ordered result as descending and results show that whole milk, other vegetables and rollss/buns are the most occured 3 items. ```{r One Dimension Table} head(sort(round(itemFrequency(trans1),3),decreasing = TRUE)) ``` ```{r} head(sort(itemFrequency(trans1, type="absolute"),decreasing = TRUE)) ``` ### Algorithms #### Eclat ```{r} sets<-eclat(trans1, parameter = list( sup = 0.001 , maxlen=15)) class(sets) inspect(head(sets, n = 5)) head(round(support(items(sets), trans1) , 2)) # getting rules sets<-ruleInduction(sets, trans1, confidence=0.08) # screening the part of the rules inspect(head(sets)) summary(sets) plot(sets, jitter = 0) plot(sets[1:50], method="graph") plot(sets[1:50], method="paracoord") ``` #### The Apriori ```{r} # creating rules - standard settings rules.trans1<-apriori(trans1, parameter=list(supp=0.001 , conf=0.08)) summary(rules.trans1) plot(rules.trans1, jitter = 0) plot(rules.trans1[1:50], method="graph") plot(rules.trans1[1:50], method="paracoord") ``` ```{r} # Changing rules - standard settings rules.trans2<-apriori(trans1, parameter=list(supp=0.002 , conf=0.05)) summary(rules.trans2) plot(rules.trans2, jitter = 0.25) plot(rules.trans2[1:20], method="graph") plot(rules.trans2[1:50], method="paracoord") ``` When examining the co-occurrence of items in the dataset, we can see that there are many strong associations between items. For example, if a customer purchases citrus fruit, they are also likely to purchase tropical fruit, root vegetables, other vegetables, and whole milk. The lift values show us that these associations are stronger than what we would expect by chance. The chi-squared test shows that there are significant associations between many of the items in the dataset. The p-values are very small, indicating that we can reject the null hypothesis of independence and conclude that there are associations between items. Overall, the analysis shows that there are many strong associations between items in the groceries dataset, and that these associations are significant. This information can be useful for retailers who want to optimize product placement, promotions, and recommendations for customers based on their purchase history. I set the minimum support to 0.001 and the minimum confidence to 0.08. This means that we are only interested in rules with a support of at least 0.001 (i.e., the rule must appear in at least 0.1% of all transactions) and a confidence of at least 0.08 (i.e., the rule must be correct at least 8% of the time). Then tried it with minimum support to 0.002 and the minimum confidence to 0.05. Then I use the inspect() function to generate a list of association rules based on the specified itemsets, and sort the rules by lift. Finally, convert the list of rules to a data frame and show the top 10 rules. #### Rules for closed itemsets ```{r} trans1.closed<-apriori(trans1, parameter=list(target="closed frequent itemsets", support=0.01)) inspect(head(trans1.closed)) is.closed(trans1.closed) freq.closed<-eclat(trans1, parameter=list(supp=0.001, maxlen=15, target="closed frequent itemsets")) inspect(head(freq.closed)) freq.max<-eclat(trans1, parameter=list(supp=0.001, maxlen=15, target="maximally frequent itemsets")) inspect(head(freq.max)) ``` ```{r} ####similarity & dissimilarity trans.sel<-trans1[,itemFrequency(trans1)>0.05] # selected transations d.jac.i<-dissimilarity(trans.sel, which="items") # Jaccard as default round(d.jac.i,2) ``` ### REFERENCES Data: https://www.kaggle.com/datasets/heeraldedhia/groceries-dataset