Applied Data Science Notebook in Python for Beginners to Professionals¶

An end-to-end tutorials on EDA & RFM Analysis - Applied Machine Learning & Data Science¶

Rule based Segmentation, EDA and RFM Analysis & Visualisation in Python¶

# Suppress warnings in Jupyter Notebooks

import warnings
warnings.filterwarnings("ignore")

# Load the library
import numpy as np 
import pandas as pd 
import seaborn as sns
from scipy import stats
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import plotly.graph_objects as go

# Load the dataset
df = pd.read_csv('OnlineRetail_sample.csv')

df.head()

df.shape

(541909, 5)

df.describe()

# How many missing values are in each column and of what data types they are?

print(); print("Database has {} observations (instances) and {} columns (attributes).".format(df.shape[0],df.shape[1]))
print(); print("Missing values in each column:\n{}".format(df.isnull().sum()))
print(); print("Columns data types:\n{}".format(df.dtypes))

Database has 541909 observations (instances) and 5 columns (attributes).

Missing values in each column:
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

Columns data types:
Quantity         int64
InvoiceDate     object
UnitPrice      float64
CustomerID     float64
Country         object
dtype: object

# Drop the intances that have null values
df.dropna(subset=['CustomerID'], how='all', inplace= True)
df.shape

(406829, 5)

# considering the negative quantities as 0 for rfm analysis
df['Quantity']= list(map((lambda x: 0 if(x<0) else x),df['Quantity']))

# creating a new column called price which will quantity x unitprice
df['Price']= df['Quantity'] * df['UnitPrice']

df.head()

df.describe()

# Let's derive a separate column for Date from Invoice Date
df['date'] = pd.to_datetime(df['InvoiceDate']).dt.date

df

# Inorder to fix the date of analysis we will see the date range
print(df['date'].min(), '\t', df['date'].max())

2010-12-01 	 2011-12-09

# Let's consider the analysis date as 2012-01-01 and lets calculate the recency
import datetime as dt

df['no_of_days'] = (dt.date(2012,1,1))-df['date']
df['no_of_days'] = df['no_of_days'].astype('timedelta64[D]')

df.head()

rfm = df.groupby('CustomerID').agg({'no_of_days': lambda x:x.min(), \
                                    'CustomerID': lambda x: len(x), \
                                    'Price': lambda x: x.sum()})     

rfm.rename(columns={'no_of_days': 'recency', 'CustomerID': 'frequency', 'Price': 'monetary'}, inplace=True)

rfm.head()

rfm.describe()

The scores are given as follows: for frequency and monetary 1 is the lowest and 4 the highest while for recency it is the opposite¶

Recency¶

rfm["recency_score"] = pd.cut(rfm["recency"], bins=[-1,np.percentile(rfm["recency"], 25),np.percentile(rfm["recency"], 50), 
                                                     np.percentile(rfm["recency"], 75), rfm["recency"].max()],
                                                     labels=[4, 3, 2, 1]).astype("int")

Frequency¶

rfm["frequency_score"] = pd.cut(rfm["frequency"], bins=[-1,np.percentile(rfm["frequency"], 25),
                                                           np.percentile(rfm["frequency"], 50), 
                                                        np.percentile(rfm["frequency"], 75), rfm["frequency"].max()],
                                                        labels=[1, 2, 3, 4]).astype("int")

Monetary¶

rfm["monetary_score"] = pd.cut(rfm["monetary"], bins=[-1,np.percentile(rfm["monetary"], 25),
                                                         np.percentile(rfm["monetary"], 50), 
                                                      np.percentile(rfm["monetary"], 75), rfm["monetary"].max()],
                                                      labels=[1, 2, 3, 4])

rfm.head()

Creating RFM class and Customer Segments¶

rfm['RFMClass']=0
rfm['Customer_Seg']=''

j=0
while j <= (len(rfm)-1):
    rfm['RFMClass'][j] = ((rfm['recency_score'].iloc[j])*100) + \
                         ((rfm['frequency_score'].iloc[j])*10) + \
                          (rfm['monetary_score'].iloc[j])
    j+=1

from tabulate import tabulate

# ---------------------------------------------------------
# Create a Reference table for Rule based segmentation
# ---------------------------------------------------------

table= [['R','M','F1','F2','F3','F4'],
        ['4','4','Unsteady','Active','Top customer','Top customer'],
        [' ','3','Unsteady','Active','Top customer','Top customer'],
        [' ','2','Unsteady','Active','Active','Active'],
        [' ','1','Unsteady','Active','Active','Active'],
        ['3','4','Unsteady','Emerging','Emerging','Emerging'],
        [' ','3','Unsteady','Emerging','Emerging','Emerging'],
        [' ','2','Unsteady','Unsteady','Unsteady','Unsteady'],
        [' ','1','Unsteady','Unsteady','Unsteady','Unsteady'],
        ['2','4','At risk','At risk','At risk','At risk'],
        [' ',3,'At risk','At risk','At risk','At risk'],
        [' ','2','Potentially lost','Potentially lost','At risk','At risk'],
        [' ','2','Potentially lost','Potentially lost','At risk','At risk'],
        [' ','1','Potentially lost','Potentially lost','At risk','At risk'],
        ['1','4','Inactive','Inactive','At risk','At risk'],
        [' ','3','Inactive','Inactive','At risk','At risk'],
        [' ','2','Lost','Lost','At risk','At risk'],
        [' ','1','Lost','Lost','At risk','At risk']]

print('Customer Segmentation: ')
print(tabulate(table,headers='firstrow', tablefmt='fancy_grid'))

Customer Segmentation: 
╒═════╤═════╤══════════════════╤══════════════════╤══════════════╤══════════════╕
│ R   │   M │ F1               │ F2               │ F3           │ F4           │
╞═════╪═════╪══════════════════╪══════════════════╪══════════════╪══════════════╡
│ 4   │   4 │ Unsteady         │ Active           │ Top customer │ Top customer │
├─────┼─────┼──────────────────┼──────────────────┼──────────────┼──────────────┤
│     │   3 │ Unsteady         │ Active           │ Top customer │ Top customer │
├─────┼─────┼──────────────────┼──────────────────┼──────────────┼──────────────┤
│     │   2 │ Unsteady         │ Active           │ Active       │ Active       │
├─────┼─────┼──────────────────┼──────────────────┼──────────────┼──────────────┤
│     │   1 │ Unsteady         │ Active           │ Active       │ Active       │
├─────┼─────┼──────────────────┼──────────────────┼──────────────┼──────────────┤
│ 3   │   4 │ Unsteady         │ Emerging         │ Emerging     │ Emerging     │
├─────┼─────┼──────────────────┼──────────────────┼──────────────┼──────────────┤
│     │   3 │ Unsteady         │ Emerging         │ Emerging     │ Emerging     │
├─────┼─────┼──────────────────┼──────────────────┼──────────────┼──────────────┤
│     │   2 │ Unsteady         │ Unsteady         │ Unsteady     │ Unsteady     │
├─────┼─────┼──────────────────┼──────────────────┼──────────────┼──────────────┤
│     │   1 │ Unsteady         │ Unsteady         │ Unsteady     │ Unsteady     │
├─────┼─────┼──────────────────┼──────────────────┼──────────────┼──────────────┤
│ 2   │   4 │ At risk          │ At risk          │ At risk      │ At risk      │
├─────┼─────┼──────────────────┼──────────────────┼──────────────┼──────────────┤
│     │   3 │ At risk          │ At risk          │ At risk      │ At risk      │
├─────┼─────┼──────────────────┼──────────────────┼──────────────┼──────────────┤
│     │   2 │ Potentially lost │ Potentially lost │ At risk      │ At risk      │
├─────┼─────┼──────────────────┼──────────────────┼──────────────┼──────────────┤
│     │   2 │ Potentially lost │ Potentially lost │ At risk      │ At risk      │
├─────┼─────┼──────────────────┼──────────────────┼──────────────┼──────────────┤
│     │   1 │ Potentially lost │ Potentially lost │ At risk      │ At risk      │
├─────┼─────┼──────────────────┼──────────────────┼──────────────┼──────────────┤
│ 1   │   4 │ Inactive         │ Inactive         │ At risk      │ At risk      │
├─────┼─────┼──────────────────┼──────────────────┼──────────────┼──────────────┤
│     │   3 │ Inactive         │ Inactive         │ At risk      │ At risk      │
├─────┼─────┼──────────────────┼──────────────────┼──────────────┼──────────────┤
│     │   2 │ Lost             │ Lost             │ At risk      │ At risk      │
├─────┼─────┼──────────────────┼──────────────────┼──────────────┼──────────────┤
│     │   1 │ Lost             │ Lost             │ At risk      │ At risk      │
╘═════╧═════╧══════════════════╧══════════════════╧══════════════╧══════════════╛

# Assign a segment based on RFM
k=0
while k <= (len(rfm)-1):
    if rfm['recency_score'].iloc[k]== 4 and rfm['frequency_score'].iloc[k]>=3 and  rfm['monetary_score'].iloc[k]>=3:
        rfm['Customer_Seg'].iloc[k]= 'Top Customers'
    elif rfm['recency_score'].iloc[k]== 4 and rfm['frequency_score'].iloc[k]>=1 and  rfm['monetary_score'].iloc[k]>=1:
        rfm['Customer_Seg'].iloc[k]= 'Active Customers'
    elif rfm['recency_score'].iloc[k]== 3 and rfm['frequency_score'].iloc[k]>=2 and  rfm['monetary_score'].iloc[k]>=3:
        rfm['Customer_Seg'].iloc[k]= 'Emerging Customers'
    elif rfm['recency_score'].iloc[k]>= 3 and rfm['frequency_score'].iloc[k]==1 and  rfm['monetary_score'].iloc[k]>=1:
        rfm['Customer_Seg'].iloc[k]= 'Unsteady Customers'
    elif rfm['recency_score'].iloc[k]== 3 and rfm['frequency_score'].iloc[k]>=2 and  rfm['monetary_score'].iloc[k]<=2:
        rfm['Customer_Seg'].iloc[k]= 'Unsteady Customers'
    elif rfm['recency_score'].iloc[k]== 2 and rfm['frequency_score'].iloc[k]>=1 and  rfm['monetary_score'].iloc[k]>=3:
        rfm['Customer_Seg'].iloc[k]= 'At Risk'
    elif rfm['recency_score'].iloc[k]== 2 and rfm['frequency_score'].iloc[k]>=3 and  rfm['monetary_score'].iloc[k]<=2:
        rfm['Customer_Seg'].iloc[k]= 'At Risk'
    elif rfm['recency_score'].iloc[k]== 1 and rfm['frequency_score'].iloc[k]>=3 and  rfm['monetary_score'].iloc[k]>=1:
        rfm['Customer_Seg'].iloc[k]= 'At Risk'
    elif rfm['recency_score'].iloc[k]== 2 and rfm['frequency_score'].iloc[k]<=2 and  rfm['monetary_score'].iloc[k]<=2:
        rfm['Customer_Seg'].iloc[k]= 'Potentially lost'
    elif rfm['recency_score'].iloc[k]== 1 and rfm['frequency_score'].iloc[k]<=2 and  rfm['monetary_score'].iloc[k]>=3:
        rfm['Customer_Seg'].iloc[k]= 'Inactive Customers'
    elif rfm['recency_score'].iloc[k]== 1 and rfm['frequency_score'].iloc[k]<=2 and  rfm['monetary_score'].iloc[k]<=2:
        rfm['Customer_Seg'].iloc[k]= 'Lost'
    else:
        rfm['Customer_Seg'].iloc[k]= 'Uncategorized'
    k+=1

rfm.head()

data = rfm.groupby("Customer_Seg")["RFMClass"].count()
data

Customer_Seg
Active Customers      325
At Risk               793
Emerging Customers    624
Inactive Customers     72
Lost                  805
Potentially lost      479
Top Customers         771
Unsteady Customers    503
Name: RFMClass, dtype: int64

# Visualisation

sns.set_context("notebook", font_scale=1.0, rc={"lines.linewidth": 2.5})
plt.subplots(figsize=(10,10))
data.plot.pie(autopct="%.1f%%")

<matplotlib.axes._subplots.AxesSubplot at 0x7fce8a897588>

Observation: Significant number of customers have been lost and are at Risk

sns.set_context("notebook", font_scale=1.0, rc={"lines.linewidth": 2.5})
plt.subplots(figsize=(18,5))
sns.barplot(y='frequency', x= 'Customer_Seg', data= rfm)

<matplotlib.axes._subplots.AxesSubplot at 0x7fce89157c88>

sns.set_context("notebook", font_scale=1.0, rc={"lines.linewidth": 2.5})
plt.subplots(figsize=(18,5))
sns.barplot(y='frequency',x= 'Customer_Seg', data= rfm)

<matplotlib.axes._subplots.AxesSubplot at 0x7fce890f3c50>

sns.set_context("notebook", font_scale=1.0, rc={"lines.linewidth": 2.5})
plt.subplots(figsize=(18,5))
sns.barplot(y='monetary',x= 'Customer_Seg', data= rfm)

<matplotlib.axes._subplots.AxesSubplot at 0x7fce890751d0>

	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
0	6	12/01/2010	2.55	17850.0	United Kingdom
1	6	12/01/2010	3.39	17850.0	United Kingdom
2	8	12/01/2010	2.75	17850.0	United Kingdom
3	6	12/01/2010	3.39	17850.0	United Kingdom
4	6	12/01/2010	3.39	17850.0	United Kingdom

	Quantity	UnitPrice	CustomerID
count	541909.000000	541909.000000	406829.000000
mean	9.552250	4.611114	15287.690570
std	218.081158	96.759853	1713.600303
min	-80995.000000	-11062.060000	12346.000000
25%	1.000000	1.250000	13953.000000
50%	3.000000	2.080000	15152.000000
75%	10.000000	4.130000	16791.000000
max	80995.000000	38970.000000	18287.000000

	Quantity	InvoiceDate	UnitPrice	CustomerID	Country	Price
0	6	12/01/2010	2.55	17850.0	United Kingdom	15.30
1	6	12/01/2010	3.39	17850.0	United Kingdom	20.34
2	8	12/01/2010	2.75	17850.0	United Kingdom	22.00
3	6	12/01/2010	3.39	17850.0	United Kingdom	20.34
4	6	12/01/2010	3.39	17850.0	United Kingdom	20.34

	Quantity	UnitPrice	CustomerID	Price
count	406829.000000	406829.000000	406829.000000	406829.000000
mean	12.736791	3.460471	15287.690570	21.904554
std	178.444861	69.315162	1713.600303	305.672001
min	0.000000	0.000000	12346.000000	0.000000
25%	2.000000	1.250000	13953.000000	4.200000
50%	5.000000	1.950000	15152.000000	11.100000
75%	12.000000	3.750000	16791.000000	19.500000
max	80995.000000	38970.000000	18287.000000	168469.600000

	Quantity	InvoiceDate	UnitPrice	CustomerID	Country	Price	date
0	6	12/01/2010	2.55	17850.0	United Kingdom	15.30	2010-12-01
1	6	12/01/2010	3.39	17850.0	United Kingdom	20.34	2010-12-01
2	8	12/01/2010	2.75	17850.0	United Kingdom	22.00	2010-12-01
3	6	12/01/2010	3.39	17850.0	United Kingdom	20.34	2010-12-01
4	6	12/01/2010	3.39	17850.0	United Kingdom	20.34	2010-12-01
...	...	...	...	...	...	...	...
541904	12	12/09/2011	0.85	12680.0	France	10.20	2011-12-09
541905	6	12/09/2011	2.10	12680.0	France	12.60	2011-12-09
541906	4	12/09/2011	4.15	12680.0	France	16.60	2011-12-09
541907	4	12/09/2011	4.15	12680.0	France	16.60	2011-12-09
541908	3	12/09/2011	4.95	12680.0	France	14.85	2011-12-09

	recency	frequency	monetary
CustomerID
12346.0	348.0	2.0	77183.60
12347.0	25.0	182.0	4310.00
12348.0	98.0	31.0	1797.24
12349.0	41.0	73.0	1757.55
12350.0	333.0	17.0	334.40

	recency	frequency	monetary
count	4372.000000	4372.000000	4372.000000
mean	114.581199	93.053294	2038.290920
std	100.772139	232.471608	8956.019171
min	23.000000	1.000000	0.000000
25%	39.000000	17.000000	303.602500
50%	73.000000	42.000000	665.315000
75%	166.000000	102.000000	1652.790000
max	396.000000	7983.000000	280206.020000