Transactional data can be used to partition a customer base into different ‘segments’ based on predominant purchasing patterns. Typical segments might be ‘convenience shoppers’ (customers who buy a lot of ready meals, pizzas, etc.), ‘cooking from scratch’ (customers who buy a lot of ingredients you would use when preparing meals yourself, e.g. flour, rice, fresh vegetables), budget (customers who buy a lot of low priced products, e.g. private labels rather than big brands) etc. Note that we do not want this segmentation to reflect different levels of spend within the store (e.g. regular customers, infrequent customers etc.)

Assume we have ‘item level’ transactions. This means for each customer we have a data record that gives us the information
• customer identifier
• product purchased (e.g. “Coke Zero 2 Litre” or “Coca Cola Regular 8x330ml”)
• department and category of purchase (e.g. dairy products, milk)
• data and time of purchase
• price of product
• store
Usually we would use 12 months of data for our analysis. This means that we can track each customer’s purchases over a whole year.
The challenge in finding these segments is to cope with the complexity of the data. We do this, using standard methods from multivariate statistics as described in basic textbooks on the subject (e.g. “Applied Multivariate Statistical Analysis” by Johnson/Wichern). Currently this involves a sequential approach:
Step 1: Within a department (e.g. dairy products) we group products that tend to get purchased by the same customers.
1.1 Please explain how you would prepare the data (e.g. aggregating transactions and creating some appropriate statistical measure of “tend to get purchased by the same customers”) and which statistical methods you would then apply in order to complete this task.
1.2 Which problems might we come across in the data that could lead to misinterpretation of statistical results?
1.3 Why don’t we just use the existing categories (e.g. milk, butter, yoghurt, etc.)?
Step 2: As a result of step 1 we would expect to find ca. 15 such product groups within each department which would give us a total of 200-300 product groups across all departments. We can now calculate total spend or number of transactions within these product groups for each customer.
Ideally we do not want to use more than about 20 variables for each customer as the basis for customer segmentation though. Our next task is to understand cross-departmental shopping by further reducing the dimensionality of the data set.
2.1 Which statistical techniques can we use for this purpose?
2.2 How would you prepare the data?
2.3 What might the result look like?
Step 3: Having reduced the dimensionality of the data, we now have a small set of variables that describe customer’s predominant purchasing patterns in the way that we need. We now want to find customer segments with similar purchasing behavior.
3.1 Which statistical techniques can we use for this purpose?
3.2 How do we need to transform the data first in order to achieve the results we are looking for?
The above process describes how we currently tackle the problem of segmenting customers based on what they buy.
4 Which other ways of segmenting customers can you think of - using transactional data only and using additional data sources that might be available?

Hey did you get an answer for that question?