Applied Data Science Notebook in Python for Beginners to Professionals

An end-to-end tutorials on EDA & RFM Analysis - Applied Machine Learning & Data Science

Rule based Segmentation, EDA and RFM Analysis & Visualisation in Python

In [1]:
# Suppress warnings in Jupyter Notebooks

import warnings
warnings.filterwarnings("ignore")
In [2]:
# Load the library
import numpy as np 
import pandas as pd 
import seaborn as sns
from scipy import stats
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import plotly.graph_objects as go
In [3]:
# Load the dataset
df = pd.read_csv('OnlineRetail_sample.csv')
In [4]:
df.head()
Out[4]:
Quantity InvoiceDate UnitPrice CustomerID Country
0 6 12/01/2010 2.55 17850.0 United Kingdom
1 6 12/01/2010 3.39 17850.0 United Kingdom
2 8 12/01/2010 2.75 17850.0 United Kingdom
3 6 12/01/2010 3.39 17850.0 United Kingdom
4 6 12/01/2010 3.39 17850.0 United Kingdom
In [5]:
df.shape
Out[5]:
(541909, 5)
In [6]:
df.describe()
Out[6]:
Quantity UnitPrice CustomerID
count 541909.000000 541909.000000 406829.000000
mean 9.552250 4.611114 15287.690570
std 218.081158 96.759853 1713.600303
min -80995.000000 -11062.060000 12346.000000
25% 1.000000 1.250000 13953.000000
50% 3.000000 2.080000 15152.000000
75% 10.000000 4.130000 16791.000000
max 80995.000000 38970.000000 18287.000000
In [7]:
# How many missing values are in each column and of what data types they are?

print(); print("Database has {} observations (instances) and {} columns (attributes).".format(df.shape[0],df.shape[1]))
print(); print("Missing values in each column:\n{}".format(df.isnull().sum()))
print(); print("Columns data types:\n{}".format(df.dtypes))
Database has 541909 observations (instances) and 5 columns (attributes).

Missing values in each column:
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

Columns data types:
Quantity         int64
InvoiceDate     object
UnitPrice      float64
CustomerID     float64
Country         object
dtype: object
In [8]:
# Drop the intances that have null values
df.dropna(subset=['CustomerID'], how='all', inplace= True)
df.shape
Out[8]:
(406829, 5)
In [9]:
# considering the negative quantities as 0 for rfm analysis
df['Quantity']= list(map((lambda x: 0 if(x<0) else x),df['Quantity']))
In [10]:
# creating a new column called price which will quantity x unitprice
df['Price']= df['Quantity'] * df['UnitPrice']
In [11]:
df.head()
Out[11]:
Quantity InvoiceDate UnitPrice CustomerID Country Price
0 6 12/01/2010 2.55 17850.0 United Kingdom 15.30
1 6 12/01/2010 3.39 17850.0 United Kingdom 20.34
2 8 12/01/2010 2.75 17850.0 United Kingdom 22.00
3 6 12/01/2010 3.39 17850.0 United Kingdom 20.34
4 6 12/01/2010 3.39 17850.0 United Kingdom 20.34
In [12]:
df.describe()
Out[12]:
Quantity UnitPrice CustomerID Price
count 406829.000000 406829.000000 406829.000000 406829.000000
mean 12.736791 3.460471 15287.690570 21.904554
std 178.444861 69.315162 1713.600303 305.672001
min 0.000000 0.000000 12346.000000 0.000000
25% 2.000000 1.250000 13953.000000 4.200000
50% 5.000000 1.950000 15152.000000 11.100000
75% 12.000000 3.750000 16791.000000 19.500000
max 80995.000000 38970.000000 18287.000000 168469.600000
In [13]:
# Let's derive a separate column for Date from Invoice Date
df['date'] = pd.to_datetime(df['InvoiceDate']).dt.date
In [14]:
df
Out[14]:
Quantity InvoiceDate UnitPrice CustomerID Country Price date
0 6 12/01/2010 2.55 17850.0 United Kingdom 15.30 2010-12-01
1 6 12/01/2010 3.39 17850.0 United Kingdom 20.34 2010-12-01
2 8 12/01/2010 2.75 17850.0 United Kingdom 22.00 2010-12-01
3 6 12/01/2010 3.39 17850.0 United Kingdom 20.34 2010-12-01
4 6 12/01/2010 3.39 17850.0 United Kingdom 20.34 2010-12-01
... ... ... ... ... ... ... ...
541904 12 12/09/2011 0.85 12680.0 France 10.20 2011-12-09
541905 6 12/09/2011 2.10 12680.0 France 12.60 2011-12-09
541906 4 12/09/2011 4.15 12680.0 France 16.60 2011-12-09
541907 4 12/09/2011 4.15 12680.0 France 16.60 2011-12-09
541908 3 12/09/2011 4.95 12680.0 France 14.85 2011-12-09

406829 rows × 7 columns

In [15]:
# Inorder to fix the date of analysis we will see the date range
print(df['date'].min(), '\t', df['date'].max())
2010-12-01 	 2011-12-09
In [16]:
# Let's consider the analysis date as 2012-01-01 and lets calculate the recency
import datetime as dt

df['no_of_days'] = (dt.date(2012,1,1))-df['date']
df['no_of_days'] = df['no_of_days'].astype('timedelta64[D]')
In [17]:
df.head()
Out[17]:
Quantity InvoiceDate UnitPrice CustomerID Country Price date no_of_days
0 6 12/01/2010 2.55 17850.0 United Kingdom 15.30 2010-12-01 396.0
1 6 12/01/2010 3.39 17850.0 United Kingdom 20.34 2010-12-01 396.0
2 8 12/01/2010 2.75 17850.0 United Kingdom 22.00 2010-12-01 396.0
3 6 12/01/2010 3.39 17850.0 United Kingdom 20.34 2010-12-01 396.0
4 6 12/01/2010 3.39 17850.0 United Kingdom 20.34 2010-12-01 396.0
In [18]:
rfm = df.groupby('CustomerID').agg({'no_of_days': lambda x:x.min(), \
                                    'CustomerID': lambda x: len(x), \
                                    'Price': lambda x: x.sum()})     

rfm.rename(columns={'no_of_days': 'recency', 'CustomerID': 'frequency', 'Price': 'monetary'}, inplace=True)
In [19]:
rfm.head()
Out[19]:
recency frequency monetary
CustomerID
12346.0 348.0 2.0 77183.60
12347.0 25.0 182.0 4310.00
12348.0 98.0 31.0 1797.24
12349.0 41.0 73.0 1757.55
12350.0 333.0 17.0 334.40
In [20]:
rfm.describe()
Out[20]:
recency frequency monetary
count 4372.000000 4372.000000 4372.000000
mean 114.581199 93.053294 2038.290920
std 100.772139 232.471608 8956.019171
min 23.000000 1.000000 0.000000
25% 39.000000 17.000000 303.602500
50% 73.000000 42.000000 665.315000
75% 166.000000 102.000000 1652.790000
max 396.000000 7983.000000 280206.020000
In [ ]:
 

The scores are given as follows: for frequency and monetary 1 is the lowest and 4 the highest while for recency it is the opposite

Recency

In [21]:
rfm["recency_score"] = pd.cut(rfm["recency"], bins=[-1,np.percentile(rfm["recency"], 25),np.percentile(rfm["recency"], 50), 
                                                     np.percentile(rfm["recency"], 75), rfm["recency"].max()],
                                                     labels=[4, 3, 2, 1]).astype("int")

Frequency

In [22]:
rfm["frequency_score"] = pd.cut(rfm["frequency"], bins=[-1,np.percentile(rfm["frequency"], 25),
                                                           np.percentile(rfm["frequency"], 50), 
                                                        np.percentile(rfm["frequency"], 75), rfm["frequency"].max()],
                                                        labels=[1, 2, 3, 4]).astype("int")

Monetary

In [23]:
rfm["monetary_score"] = pd.cut(rfm["monetary"], bins=[-1,np.percentile(rfm["monetary"], 25),
                                                         np.percentile(rfm["monetary"], 50), 
                                                      np.percentile(rfm["monetary"], 75), rfm["monetary"].max()],
                                                      labels=[1, 2, 3, 4])
In [24]:
rfm.head()
Out[24]:
recency frequency monetary recency_score frequency_score monetary_score
CustomerID
12346.0 348.0 2.0 77183.60 1 1 4
12347.0 25.0 182.0 4310.00 4 4 4
12348.0 98.0 31.0 1797.24 2 2 4
12349.0 41.0 73.0 1757.55 3 3 4
12350.0 333.0 17.0 334.40 1 1 2

Creating RFM class and Customer Segments

In [25]:
rfm['RFMClass']=0
rfm['Customer_Seg']=''

j=0
while j <= (len(rfm)-1):
    rfm['RFMClass'][j] = ((rfm['recency_score'].iloc[j])*100) + \
                         ((rfm['frequency_score'].iloc[j])*10) + \
                          (rfm['monetary_score'].iloc[j])
    j+=1
In [26]:
from tabulate import tabulate

# ---------------------------------------------------------
# Create a Reference table for Rule based segmentation
# ---------------------------------------------------------

table= [['R','M','F1','F2','F3','F4'],
        ['4','4','Unsteady','Active','Top customer','Top customer'],
        [' ','3','Unsteady','Active','Top customer','Top customer'],
        [' ','2','Unsteady','Active','Active','Active'],
        [' ','1','Unsteady','Active','Active','Active'],
        ['3','4','Unsteady','Emerging','Emerging','Emerging'],
        [' ','3','Unsteady','Emerging','Emerging','Emerging'],
        [' ','2','Unsteady','Unsteady','Unsteady','Unsteady'],
        [' ','1','Unsteady','Unsteady','Unsteady','Unsteady'],
        ['2','4','At risk','At risk','At risk','At risk'],
        [' ',3,'At risk','At risk','At risk','At risk'],
        [' ','2','Potentially lost','Potentially lost','At risk','At risk'],
        [' ','2','Potentially lost','Potentially lost','At risk','At risk'],
        [' ','1','Potentially lost','Potentially lost','At risk','At risk'],
        ['1','4','Inactive','Inactive','At risk','At risk'],
        [' ','3','Inactive','Inactive','At risk','At risk'],
        [' ','2','Lost','Lost','At risk','At risk'],
        [' ','1','Lost','Lost','At risk','At risk']]

print('Customer Segmentation: ')
print(tabulate(table,headers='firstrow', tablefmt='fancy_grid'))
Customer Segmentation: 
╒═════╤═════╤══════════════════╤══════════════════╤══════════════╤══════════════╕
│ R   │   M │ F1               │ F2               │ F3           │ F4           │
╞═════╪═════╪══════════════════╪══════════════════╪══════════════╪══════════════╡
│ 4   │   4 │ Unsteady         │ Active           │ Top customer │ Top customer │
├─────┼─────┼──────────────────┼──────────────────┼──────────────┼──────────────┤
│     │   3 │ Unsteady         │ Active           │ Top customer │ Top customer │
├─────┼─────┼──────────────────┼──────────────────┼──────────────┼──────────────┤
│     │   2 │ Unsteady         │ Active           │ Active       │ Active       │
├─────┼─────┼──────────────────┼──────────────────┼──────────────┼──────────────┤
│     │   1 │ Unsteady         │ Active           │ Active       │ Active       │
├─────┼─────┼──────────────────┼──────────────────┼──────────────┼──────────────┤
│ 3   │   4 │ Unsteady         │ Emerging         │ Emerging     │ Emerging     │
├─────┼─────┼──────────────────┼──────────────────┼──────────────┼──────────────┤
│     │   3 │ Unsteady         │ Emerging         │ Emerging     │ Emerging     │
├─────┼─────┼──────────────────┼──────────────────┼──────────────┼──────────────┤
│     │   2 │ Unsteady         │ Unsteady         │ Unsteady     │ Unsteady     │
├─────┼─────┼──────────────────┼──────────────────┼──────────────┼──────────────┤
│     │   1 │ Unsteady         │ Unsteady         │ Unsteady     │ Unsteady     │
├─────┼─────┼──────────────────┼──────────────────┼──────────────┼──────────────┤
│ 2   │   4 │ At risk          │ At risk          │ At risk      │ At risk      │
├─────┼─────┼──────────────────┼──────────────────┼──────────────┼──────────────┤
│     │   3 │ At risk          │ At risk          │ At risk      │ At risk      │
├─────┼─────┼──────────────────┼──────────────────┼──────────────┼──────────────┤
│     │   2 │ Potentially lost │ Potentially lost │ At risk      │ At risk      │
├─────┼─────┼──────────────────┼──────────────────┼──────────────┼──────────────┤
│     │   2 │ Potentially lost │ Potentially lost │ At risk      │ At risk      │
├─────┼─────┼──────────────────┼──────────────────┼──────────────┼──────────────┤
│     │   1 │ Potentially lost │ Potentially lost │ At risk      │ At risk      │
├─────┼─────┼──────────────────┼──────────────────┼──────────────┼──────────────┤
│ 1   │   4 │ Inactive         │ Inactive         │ At risk      │ At risk      │
├─────┼─────┼──────────────────┼──────────────────┼──────────────┼──────────────┤
│     │   3 │ Inactive         │ Inactive         │ At risk      │ At risk      │
├─────┼─────┼──────────────────┼──────────────────┼──────────────┼──────────────┤
│     │   2 │ Lost             │ Lost             │ At risk      │ At risk      │
├─────┼─────┼──────────────────┼──────────────────┼──────────────┼──────────────┤
│     │   1 │ Lost             │ Lost             │ At risk      │ At risk      │
╘═════╧═════╧══════════════════╧══════════════════╧══════════════╧══════════════╛
In [27]:
# Assign a segment based on RFM
k=0
while k <= (len(rfm)-1):
    if rfm['recency_score'].iloc[k]== 4 and rfm['frequency_score'].iloc[k]>=3 and  rfm['monetary_score'].iloc[k]>=3:
        rfm['Customer_Seg'].iloc[k]= 'Top Customers'
    elif rfm['recency_score'].iloc[k]== 4 and rfm['frequency_score'].iloc[k]>=1 and  rfm['monetary_score'].iloc[k]>=1:
        rfm['Customer_Seg'].iloc[k]= 'Active Customers'
    elif rfm['recency_score'].iloc[k]== 3 and rfm['frequency_score'].iloc[k]>=2 and  rfm['monetary_score'].iloc[k]>=3:
        rfm['Customer_Seg'].iloc[k]= 'Emerging Customers'
    elif rfm['recency_score'].iloc[k]>= 3 and rfm['frequency_score'].iloc[k]==1 and  rfm['monetary_score'].iloc[k]>=1:
        rfm['Customer_Seg'].iloc[k]= 'Unsteady Customers'
    elif rfm['recency_score'].iloc[k]== 3 and rfm['frequency_score'].iloc[k]>=2 and  rfm['monetary_score'].iloc[k]<=2:
        rfm['Customer_Seg'].iloc[k]= 'Unsteady Customers'
    elif rfm['recency_score'].iloc[k]== 2 and rfm['frequency_score'].iloc[k]>=1 and  rfm['monetary_score'].iloc[k]>=3:
        rfm['Customer_Seg'].iloc[k]= 'At Risk'
    elif rfm['recency_score'].iloc[k]== 2 and rfm['frequency_score'].iloc[k]>=3 and  rfm['monetary_score'].iloc[k]<=2:
        rfm['Customer_Seg'].iloc[k]= 'At Risk'
    elif rfm['recency_score'].iloc[k]== 1 and rfm['frequency_score'].iloc[k]>=3 and  rfm['monetary_score'].iloc[k]>=1:
        rfm['Customer_Seg'].iloc[k]= 'At Risk'
    elif rfm['recency_score'].iloc[k]== 2 and rfm['frequency_score'].iloc[k]<=2 and  rfm['monetary_score'].iloc[k]<=2:
        rfm['Customer_Seg'].iloc[k]= 'Potentially lost'
    elif rfm['recency_score'].iloc[k]== 1 and rfm['frequency_score'].iloc[k]<=2 and  rfm['monetary_score'].iloc[k]>=3:
        rfm['Customer_Seg'].iloc[k]= 'Inactive Customers'
    elif rfm['recency_score'].iloc[k]== 1 and rfm['frequency_score'].iloc[k]<=2 and  rfm['monetary_score'].iloc[k]<=2:
        rfm['Customer_Seg'].iloc[k]= 'Lost'
    else:
        rfm['Customer_Seg'].iloc[k]= 'Uncategorized'
    k+=1      
In [28]:
rfm.head()
Out[28]:
recency frequency monetary recency_score frequency_score monetary_score RFMClass Customer_Seg
CustomerID
12346.0 348.0 2.0 77183.60 1 1 4 114 Inactive Customers
12347.0 25.0 182.0 4310.00 4 4 4 444 Top Customers
12348.0 98.0 31.0 1797.24 2 2 4 224 At Risk
12349.0 41.0 73.0 1757.55 3 3 4 334 Emerging Customers
12350.0 333.0 17.0 334.40 1 1 2 112 Lost
In [29]:
data = rfm.groupby("Customer_Seg")["RFMClass"].count()
data
Out[29]:
Customer_Seg
Active Customers      325
At Risk               793
Emerging Customers    624
Inactive Customers     72
Lost                  805
Potentially lost      479
Top Customers         771
Unsteady Customers    503
Name: RFMClass, dtype: int64
In [30]:
# Visualisation
In [31]:
sns.set_context("notebook", font_scale=1.0, rc={"lines.linewidth": 2.5})
plt.subplots(figsize=(10,10))
data.plot.pie(autopct="%.1f%%")
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fce8a897588>

Observation: Significant number of customers have been lost and are at Risk

In [32]:
sns.set_context("notebook", font_scale=1.0, rc={"lines.linewidth": 2.5})
plt.subplots(figsize=(18,5))
sns.barplot(y='frequency', x= 'Customer_Seg', data= rfm)
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fce89157c88>
In [33]:
sns.set_context("notebook", font_scale=1.0, rc={"lines.linewidth": 2.5})
plt.subplots(figsize=(18,5))
sns.barplot(y='frequency',x= 'Customer_Seg', data= rfm)
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fce890f3c50>
In [34]:
sns.set_context("notebook", font_scale=1.0, rc={"lines.linewidth": 2.5})
plt.subplots(figsize=(18,5))
sns.barplot(y='monetary',x= 'Customer_Seg', data= rfm)
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fce890751d0>
In [ ]: