Project Business Statistics: E-news Express¶

Marks: 60

Preparation Information¶

  • Developed and Analyzed by: Jerry Gonzalez
  • Cohort: November 2023 - Group D

Problem Statement and Objectives¶

Business Context:¶

The advent of e-news, or electronic news, portals has offered us a great opportunity to quickly get updates on the day-to-day events occurring globally. The information on these portals is retrieved electronically from online databases, processed using a variety of software, and then transmitted to the users. There are multiple advantages of transmitting new electronically, like faster access to the content and the ability to utilize different technologies such as audio, graphics, video, and other interactive elements that are either not being used or aren’t common yet in traditional newspapers.

E-news Express, an online news portal, aims to expand its business by acquiring new subscribers. With every visitor to the website taking certain actions based on their interest, the company plans to analyze these actions to understand user interests and determine how to drive better engagement. The executives at E-news Express are of the opinion that there has been a decline in new monthly subscribers compared to the past year because the current webpage is not designed well enough in terms of the outline & recommended content to keep customers engaged long enough to make a decision to subscribe.

[Companies often analyze user responses to two variants of a product to decide which of the two variants is more effective. This experimental technique, known as A/B testing, is used to determine whether a new feature attracts users based on a chosen metric.]

Objective¶

The design team of the company has researched and created a new landing page that has a new outline & more relevant content shown compared to the old page. In order to test the effectiveness of the new landing page in gathering new subscribers, the Data Science team conducted an experiment by randomly selecting 100 users and dividing them equally into two groups. The existing landing page was served to the first group (control group) and the new landing page to the second group (treatment group). Data regarding the interaction of users in both groups with the two versions of the landing page was collected. Being a data scientist in E-news Express, you have been asked to explore the data and perform a statistical analysis (at a significance level of 5%) to determine the effectiveness of the new landing page in gathering new subscribers for the news portal by answering the following questions:

  • Do the users spend more time on the new landing page than on the existing landing page?
  • Is the conversion rate (the proportion of users who visit the landing page and get converted) for the new page greater than the conversion rate for the old page?
  • Does the converted status depend on the preferred language?
  • Is the time spent on the new page the same for the different language users?

Data Dictionary¶

The data contains information regarding the interaction of users in both groups with the two versions of the landing page.

  • user_id - Unique user ID of the person visiting the website
  • group - Whether the user belongs to the first group (control) or the second group (treatment)
  • landing_page - Whether the landing page is new or old
  • time_spent_on_the_page - Time (in minutes) spent by the user on the landing page
  • converted - Whether the user gets converted to a subscriber of the news portal or not
  • language_preferred - Language chosen by the user to view the landing page

Problem Definition¶

E-News Express (electronic news portal) executives have concern over the decline of monthly subscribers to their portal. It is their conjecture that the decline of subscribers is due to the current landing page not sufficiently captivating users to convert to a paid subscription. To remedy this issue, the Design Team has developed a new landing page with an updated outline and content. The objective is to evaluate the effectiveness of the new landing page, compared to the old landing page, in acquiring new subscribers.

The Data Science Team conducted an A/B test by randomly assigning 100 users evenly into two groups: a control group that was presented with the old landing page and a treatment group presented with the new landing page. Data was then collected on their user interactions along with their preferred language.

The task is to perform statistical analysis to determine the new landing page's effectiveness in attracting new subscribers. The following are key questions to address:

  1. Do the users spend more time on the new landing page than on the existing landing page?
  2. Is the conversion rate (the proportion of users who visit the landing page and get converted) for the new page greater than the conversion rate for the old page?
  3. Does the converted status depend on the preferred language?
  4. Is the time spent on the new page the same for the different language users?

Import all the necessary libraries¶

In [54]:
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# Libraries to help with data visualization
import warnings
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 

# Library to help with statistical analysis
import scipy.stats as stats 

Initialize some basic Panda configurations¶

In [55]:
# removing the limit for the number of displayed columns
pd.set_option("display.max_columns", None) # To set column limits replace None with a number
# setting the limit for the number of displayed rows
pd.set_option("display.max_rows", 25) # To set row limits replace None with a number
# setting the precision of floating numbers to 2 decimal points
pd.set_option("display.float_format", lambda x: "%.2f" % x)

Load some useful functions¶

In [56]:
# Purpose: Calculate various discrete statistical values for a specific column in a DataFrame
#
# Prerequisites:
#    Requires the developer to only send data that discrete statistics can safely be calculated for.
#    This function would require more extensive data validation checks and more robust exception handling.
#
# Inputs
#    data   : DataFrame object containing rows and columns of data
#    feature: str representing the column name to run statistics on
#
def calculate_statistics (data, feature):
        
    # Only calculate and print statistics if the feature is a single column string name and data is a DataFrame
    if isinstance(data,pd.DataFrame) and type(feature) == str:
        
        # For future, would like to use Describe to pull data types for each column
        # Then only perform the calculations and prints if of type Int64 or Float64
        
        # Calculate and print various discrete statistical values         
        print(f"Discrete Statistics for {feature}\n")
        print(f"Mean              : {data[feature].mean():.6f}")
        print(f"Mode              : {data[feature].mode()[0]}")
        print(f"Median            : {data[feature].median()}")
        print(f"Min               : {data[feature].min()}")
        print(f"Max               : {data[feature].max()}")   
        print(f"Standard Deviation: {data[feature].std():.6f}")
        print(f"Percentiles       : \n{data[feature].quantile([.25,.50,.75])}")
In [59]:
# Purpose: Create histogram plot to visualize the distribution of continuous numerical data
#      by dividing the data into bins and displaying the frequency of observations within each bin.
#      Histogram is useful for understanding the underlining distribution, shape, central tendency, 
#      and spread of the data.
#
# Inputs:
#
#      input_data: DataFrame object containing rows and columns of data
#      feature: str representing the column name to plot a histogram for
#      in_kde: boolean value; True: plot the kde density line; False: do not plot the kde line
#

def histogram(input_data, feature, in_kde=True):
    
    # Only proceed if the feature is a single column string name and data is a DataFrame
    if isinstance(input_data,pd.DataFrame) and type(feature) == str:
        
        #stores the x axis left and right buffer size
        buffer=5
            
        #set the x limits based on the minimum and maximum values for x-axis feature
        xmin_value = int(input_data[feature].min())
        xmax_value = int(input_data[feature].max())
        plt.xlim(xmin_value-buffer,xmax_value+buffer)
        
        #plot the histogram using the buffersize
        sns.histplot(data=input_data, x=feature, kde=in_kde);
        
        #set the title, x and y labels
        plt.title('Histogram of '+ feature)
        plt.xlabel(feature)
        plt.ylabel('Frequency')
        
        plt.show()
In [60]:
# Purpose: Create a Boxplot, for a particular column/feature, to visually summarize the distribution
#      of a continuous numerical variable and to identify potential outliers within the data.
#
# Inputs:
#
#      input_data: DataFrame object containing rows and columns of data
#      feature: str representing the column name to plot a count plot graph for
#      in_vert: False (display horizontal); True (display vertically)
#      in_showfliers: True (show the outliers); False (do not show outliers)
#      in_showlabels: True (show boxplot labels for Max,Q3,Median,Q1, and Min); False (do not show labels)
#

def boxplot(input_data,feature,in_vert=False,in_showfliers=False, in_showlabels=True):
    
    #label translater key
    label_translator = {'caps 0': 'Min', 'caps 1': 'Max', 'whiskers 0': 'Q1', 'whiskers 1': 'Q3','medians 0':'Median', 'boxes 0':'Box'}
    
    # Only proceed if data is a DataFrame
    if isinstance(input_data,pd.DataFrame):
 
        ax = plt.boxplot(df[feature], vert=in_vert,showfliers=in_showfliers) 
         
        plt.title('Boxplot of '+ feature)
        plt.xlabel(feature)
        
        # Revisit this later to figure out a more robust way to get the min x value for labels.
        # Right now, the median has the smallest x value when in vert mode
        min_x = ax['medians'][0].get_xydata()[0][0] - .05;
         
        #show labels for the main boxplot lines (e.g, Max Line, Q1, Q3, Min Line)
        for i in ax.keys():
            for index,line in enumerate(ax[i]):
                
                # for some reason boxes is redundant for Q1. Filter this value out. We're also not interested in 
                # labels for outliers
                if (i != 'boxes' and i != 'fliers'):
                    line_x,line_y=line.get_xydata()[0]
                    
                    #if needed and in vertical mode, show the key boxplot labels
                    #revisit later to get labels working in horizontal mode
                    if (in_showlabels == True and in_vert == True):
                        label_key = f"{i} {index}"
                        label = f"{label_translator[label_key]}:{line_y:.2f}"
                        plt.text(min_x,line_y,label,ha='right',va='center',color='blue',fontsize=9)
        
        plt.show()
In [61]:
# Purpose: Create a countplot; used to visualize counts for categorical data
#
# Inputs:
#
#     input_data: DataFrame object containing rows and columns of data
#     feature: str representing the column name to plot a count plot graph for (category column)
#     show_perc: value from [0,1] indicating the top % values to show in the countplot
#     label_count: True (show count labels); False (show percentage labels)
#
def countplot (input_data, feature, show_perc=1.0, label_count=True):
    if isinstance(input_data,pd.DataFrame) and type(feature) == str:
                
        #Set the figure size. However, revist this later to see if there is a robust way to increase the 
        #figure size based on the number of x-axis labels.
        plt.figure(figsize=(10, 6))   
        
        #Perform a total counts, which sorts the list in descending order. Then grab the list of columns
        order_cols = input_data[feature].value_counts().index
        
        #Use the percentage value to determine how many of the top columns to show
        num_to_show = int(len(order_cols)*show_perc)
        
        #Grab the top columns to show in the count plot
        cols_to_show = input_data[feature].value_counts().nlargest(num_to_show).index
                
        #plot the top columns
        cp=sns.countplot(data=input_data,x=feature,order=cols_to_show)
        
        #rotate x labels for better readability
        plt.xticks(rotation=90)
        
        #set the title
        plt_title = f"Countplot for the top {show_perc*100}% {feature} values"
        plt.title(plt_title)
        
        # Apply some simple label formatting; remove the underscores and replace with a blank space
        cp.set_xlabel(feature.replace('_', ' ').title(), fontsize=15)
        
        #show values for each bar/patch. The labels will either be numerical values or percentages.
        for p in cp.patches:
            
            total_values = len(input_data[feature])
            
            #show count labels
            if label_count == True:
                label = p.get_height() 
            else:
                # show percentage label
                label = "{:.1f}%".format(100 * p.get_height() / total_values)
            
            cp.annotate(
                label, 
                (p.get_x()+p.get_width()/2.,p.get_height()),
                ha='center',
                va='center',
                xytext=(0,5), # set the label offset above the bar
                textcoords='offset points'
            )

        plt.show()
In [62]:
# Purpose: Create Boxplot for multiple variables (x being a categorical value)
#
# Inputs:
#
#     in_data: DataFrame object containing rows and columns of data
#     x_feature: str representing the column name for the x-axis (categorical data)
#     y_feature: str representing the column name for the y-axis
#
def multi_boxplot (in_data, x_feature, y_feature):

    # Only proceed if the features is a single column string name and data is a DataFrame
    if isinstance(in_data,pd.DataFrame) and type(x_feature) == str and type(y_feature) == str:

        # visualizing the relationship between two featgures
        plt.figure(figsize=(12, 5))
        sns.boxplot(data=df, x=x_feature, y=y_feature, showmeans=True)
        plt.xticks(fontsize=15)
        plt.yticks(fontsize=15)
        plt.xticks(rotation='vertical')
        plt.xlabel(x_feature, fontsize=15)
        plt.ylabel(y_feature, fontsize=15);
        
        plt.show() 
In [63]:
# Purpose: Create a heatmap which supports multivariate analysis of 2+ numerical features. Heatmaps are
#     useful for identifying correlations between 2 or more variables.
#
# Inputs:
#
#     input_data: DataFrame object containing rows and columns of data
#
def heatmap (input_data):
    if isinstance(input_data,pd.DataFrame): #and type(feature) == str:
        plt.figure(figsize=(10,8))
        
        # Select only the numerical columns
        numeric_columns = input_data.select_dtypes(include=np.number)

        sns.heatmap(numeric_columns.corr(),annot=True,cmap='Spectral',vmin=-1,vmax=1)
        plt.show()    
    

Reading the Data into a DataFrame¶

In [64]:
# Read the ABTest data set into a panda dataframe object
df = pd.read_csv("./abtest.csv")

Explore the dataset and extract insights using Exploratory Data Analysis¶

  • Data Overview
    • Viewing the first and last few rows of the dataset
    • Checking the shape of the dataset
    • Getting the statistical summary for the variables
  • Check for missing values
  • Check for duplicates
In [65]:
# Verify the data file was read correctly by displaying the first five rows.
df.head()
Out[65]:
user_id group landing_page time_spent_on_the_page converted language_preferred
0 546592 control old 3.48 no Spanish
1 546468 treatment new 7.13 yes English
2 546462 treatment new 4.40 no Spanish
3 546567 control old 3.02 no French
4 546459 treatment new 4.75 yes Spanish
In [66]:
# Verify the entire data file was read correctly by displaying the last five rows.
df.tail()
Out[66]:
user_id group landing_page time_spent_on_the_page converted language_preferred
95 546446 treatment new 5.15 no Spanish
96 546544 control old 6.52 yes English
97 546472 treatment new 7.07 yes Spanish
98 546481 treatment new 6.20 yes Spanish
99 546483 treatment new 5.86 yes English

Observation(s):¶

  • Head and Tail functions successfully show the file was read in correctly.
In [12]:
# Print out the number of rows and columns in the data file.
df.shape
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")
There are 100 rows and 6 columns.
In [13]:
# Print out basic information on the data file.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   user_id                 100 non-null    int64  
 1   group                   100 non-null    object 
 2   landing_page            100 non-null    object 
 3   time_spent_on_the_page  100 non-null    float64
 4   converted               100 non-null    object 
 5   language_preferred      100 non-null    object 
dtypes: float64(1), int64(1), object(4)
memory usage: 4.8+ KB

Observation(s)¶

  • The user_id is an integer
  • The time_spent_on_the_page is a float
  • The remaining four columns are all of type object.
  • Confirms 100 rows were read in and there are no null values.
In [14]:
# Print out some basic discrete statistics on the ABTest data
df.describe(include='all').T
Out[14]:
count unique top freq mean std min 25% 50% 75% max
user_id 100.00 NaN NaN NaN 546517.00 52.30 546443.00 546467.75 546492.50 546567.25 546592.00
group 100 2 control 50 NaN NaN NaN NaN NaN NaN NaN
landing_page 100 2 old 50 NaN NaN NaN NaN NaN NaN NaN
time_spent_on_the_page 100.00 NaN NaN NaN 5.38 2.38 0.19 3.88 5.42 7.02 10.71
converted 100 2 yes 54 NaN NaN NaN NaN NaN NaN NaN
language_preferred 100 3 Spanish 34 NaN NaN NaN NaN NaN NaN NaN

Observation¶

  • The average time_spent_on_the_page is 5.38 minutes with a standard deviation of 2.38 minutes.
  • The min and max time_spent_on_the_page is .19 min and 10.71 min, respectively.
  • For time_spent_on_the_page, the mean and median are almost equal and the difference between the min and 25% and the difference between 70% and max are about the same. Therefore, this column is likely has a normal distribution.
  • There are two unique group values with control occurring 50 times out of 100.
  • There are two unique landing_page values with old occurring 50 times out of 100.
  • There are two unique converted values with yes occurring 54 times out of 100.
  • There are three unique language_preferred values with Spanish occurring 34 times out of 100.

Check for Missing Values¶

In [15]:
# Check for missing values.
df.isnull().sum()
Out[15]:
user_id                   0
group                     0
landing_page              0
time_spent_on_the_page    0
converted                 0
language_preferred        0
dtype: int64

Observation¶

  • There are no missing values.

Check for Duplicate Values¶

In [16]:
# Check for duplicate values
df.nunique()
Out[16]:
user_id                   100
group                       2
landing_page                2
time_spent_on_the_page     94
converted                   2
language_preferred          3
dtype: int64

Observation(s):¶

  • There are 100 rows (in this sample data set) and the unique key (user_id) has 100 unique values.
  • There are no duplicate values for user_id.
  • There are two distinct group values.
  • There are two distinct landing_page values.
  • There are two distinct converted values.
  • There are three distinct language_preferred values.

Ensure the Control values match the Old landing page and the Treatment values match the New landing page.¶

In [67]:
#The control group should always be given the old landing page.  Ensure the data reflects this given.
df_control = df[(df['group'] == 'control') & (df['landing_page'] != 'old')]
print(f"There are {len(df_control)} control and old value data inconsistencies.")

#The treatment group should always be given the new landing page.  Ensure the data reflects this given.
df_treatment = df[(df['group'] == 'treatment') & (df['landing_page'] != 'new')]
print(f"There are {len(df_treatment)} treatment and new value inconsistencies.")
There are 0 control and old value data inconsistencies.
There are 0 treatment and new value inconsistencies.

Univariate Analysis¶

In [71]:
# Group field univariate analysis

#Create histogram plot for the top groups
countplot(df,'group',show_perc=1.0,label_count=True)

#Print the total number of unique restaurants
print(f"Number of groups: {df['group'].nunique()}")
Number of groups: 2

Observation(s):¶

  • There are two Group values (Control and Treatment) and they both have 50 counts.
  • This verifies the sampling was done fairly between the Control group (old landing page) and the Treatment Group (new landing page).
In [70]:
# Landing_page field univariate analysis

#Create histogram plot for the top groups
countplot(df,'landing_page',show_perc=1.0,label_count=True)

#Print the total number of unique restaurants
print(f"Number of landing pages: {df['landing_page'].nunique()}")
Number of landing pages: 2

Observation(s):¶

  • There are two Landing_group values (Old and New) and they both have 50 counts.
  • This verifies the sampling was done fairly between the old landing page and the new landing page.
In [20]:
# time_spent_on_the_page field univariate analysis
selected_column = 'time_spent_on_the_page'

#calculate univariate statistics
calculate_statistics(df,selected_column) 

#show histogram for cost_of_the_order
histogram(df,selected_column)

#show boxplot for cost_of_the_order
boxplot(df,selected_column,in_vert=True, in_showfliers=False,in_showlabels=True)
Discrete Statistics for time_spent_on_the_page

Mean              : 5.377800
Mode              : 0.4
Median            : 5.415
Min               : 0.19
Max               : 10.71
Standard Deviation: 2.378166
Percentiles       : 
0.25   3.88
0.50   5.42
0.75   7.02
Name: time_spent_on_the_page, dtype: float64

Observation(s)¶

  • The time_spent_on_the_page visually follows a normal distribution.
  • The average time_spent_on_the_page is 5.3778 min and the median is 5.42 min.
  • The min and max times spent on the page were .19 min and 10.71 min respectively.
In [73]:
# Converted field univariate analysis

#Create histogram plot for the top groups
countplot(df,'converted',show_perc=1.0,label_count=True)

#Print the total number of unique restaurants
print(f"Number of converted values: {df['converted'].nunique()}")
Number of converted values: 2

Observation(s):¶

  • There are 54 users out of 100 that were converted into a subscription.
  • There are 46 users out of 100 that were not converted into a subscription.
In [74]:
# language_preferred field univariate analysis

#Create histogram plot for the top groups
countplot(df,'language_preferred',show_perc=1.0,label_count=True)

#Print the total number of unique restaurants
print(f"Number of language_preferred values: {df['language_preferred'].nunique()}")
Number of language_preferred values: 3

Observation(s):¶

  • There are three preferred languages (Spanish, French, and English)
  • Both Spanish and French had 34 counts.
  • English has 32 counts.
  • The sample set indicates the fair and even set of Spanish, French, and English users.

Bivariate Analysis¶

In [75]:
# visualizing the relationship between landing_page and time_spent_on_the_page
multi_boxplot(in_data=df,x_feature="landing_page", y_feature="time_spent_on_the_page")   

Observations:¶

  • The new landing page has a higher median and mean (green triangle) than the old landing page.
  • The new landing page does have some outliers on both ends.
  • 50% of users on the new landing page are spending approximately 5 to 7 minutes on the new landing page.
  • 50% of users on the old landing page are spending approximately 3 to 6.5 minutes on the old landing page.
  • Based on discrete statistics, on average users are spending more time on the new landing page
In [76]:
# visualizing the relationship between language_preferred and time_spent_on_the_page
multi_boxplot(in_data=df,x_feature="language_preferred", y_feature="time_spent_on_the_page")   

Observations:¶

  • English and French users have similar distributions with English users having a slighly higher median and mean.
  • 50% of spanish users stay engaged for approximately 4.5 min to 6.5 mins.
  • Spanish users (with an outlier as an exception) tend to stay longer on the landing page, but also do not stay on more than 9 minutes.
In [77]:
# visualizing the relationship between converted and time_spent_on_the_page
multi_boxplot(in_data=df,x_feature="converted", y_feature="time_spent_on_the_page")   

Observation(s):¶

  • 50% of converted users stayed on the landing page for approximately 5.5 min to 7.5 min; which is higher than those that did not convert.
  • Based on discrete statistics, this implies that the longer a user spends on the landing page they are more likely to convert to a paid subscription.
In [78]:
# Plot the number of converted values (yes or no) for each landing page (old or new)
sns.countplot(data=df, x='landing_page', hue='converted');

Observation(s)¶

  • Based on discrete statistics it aqppears there is a higher number of conversions from the new landing page.
  • Use inferential statistics to determine, with a high-confidence level, that this is true. See below for inferential analysis.

1. Do the users spend more time on the new landing page than the existing landing page?¶

Perform Visual Analysis¶

In [79]:
# visualizing the relationship between the old and new landing page and the total time spent on the page.
multi_boxplot(in_data=df,x_feature="landing_page", y_feature="time_spent_on_the_page")   

Observation(s):¶

  • On average, users spend more time on the new landing page than on the old one.
  • Almost 100% of users on the new landing page will stay on the page approximately 3.5 min or more (with an outlier)
  • Discrete statistics tend to indicate the new landing page keeps users on the page longer.
    • Need to use inferential statistics to determine with a high level of significance on whether this is true or not.

Step 1: Define the null and alternate hypotheses¶

Let's frame the null and alternative hypothesis based on the above claim:

$H_0$: $\mu_n$ $=$ $\mu_o$; on average users spend the same amount of time on the new landing page as the old landing page.

$H_a$: $\mu_n$ $\gt$ $\mu_o$; on average users spend more time on the new landing page than on the old landing page.

Step 2: Select Appropriate test¶

Information:¶

  • Need to compare two sample means (old landing page and new landing page)
  • Population standard deviations are unknown
  • By nature of the sampling method, there are two independent populations.
  • Based on the alternative hypothesis, this will be a one-tailed test (using alternative='greater')

Based on above information, utilize the 2-sample indicator t-test

Step 3: Decide the significance level¶

Significance level ($\alpha$) = .05%

Step 4: Collect and prepare data¶

In [81]:
# Collect the time spent values for the new landing page
time_spent_new_landing = df[df['landing_page']=='new']['time_spent_on_the_page']

# Collect the time spent values for the old landing page
time_spent_old_landing= df[df['landing_page']=='old']['time_spent_on_the_page']

# Calculate the standing deviations to determine how to set the equal_var parameter below.
print (f"The standard deviation for time spent on the new landing page is {round(time_spent_new_landing.std(),4)}")
print (f"The standard deviation for time spent on the old landing page is {round(time_spent_old_landing.std(),4)}")
The standard deviation for time spent on the new landing page is 1.817
The standard deviation for time spent on the old landing page is 2.582

Step 5: Calculate the p-value¶

In [82]:
#import the required t-test function
from scipy.stats import ttest_ind

# find the p-value
test_stat, p_value = ttest_ind(time_spent_new_landing, time_spent_old_landing, equal_var = False, alternative = 'greater')
print(f"The p-value is {round(p_value,8)}")
The p-value is 0.00013924

Step 6: Compare the p-value with $\alpha$¶

In [83]:
if (p_value < 0.05):
    print(f'As the p-value {round(p_value,8)} is less than the level of significance; we reject the null hypothesis.')
else:
    print(f'As the p-value {round(p_value,8)} is greater than the level of significance; we fail to reject the null hypothesis.')
As the p-value 0.00013924 is less than the level of significance; we reject the null hypothesis.

Step 7: Draw inference¶

Insight¶

Since, the null hypothesis is rejected (i.e. accept the alternative hypothesis) there is sufficient statistical evidence that users on average are spending more time on the new landing page than on the old landing page.

A similar approach can be followed to answer the other questions.

2. Is the conversion rate (the proportion of users who visit the landing page and get converted) for the new page greater than the conversion rate for the old page?¶

Perform Visual Analysis¶

In [84]:
# Create a crosstab table for landing page vs converted
df_crosstab = pd.crosstab(df['landing_page'],df['converted'])  # ,normalize='index'
df_crosstab
Out[84]:
converted no yes
landing_page
new 17 33
old 29 21
In [85]:
# create a stacked bar plot to compare the distributions of both the categorical features
df_crosstab.plot(kind='bar',stacked =True)
plt.legend()
plt.show()

Observations¶

  • Based on discrete statistics and visualizations, it does appear that the conversion rate from the new landing page is greater than the conversion rate from the old landing page.
  • Use inferential statistics to confirm the above claim.

Step 1: Define the null and alternate hypotheses¶

Let's frame the null and alternative hypothesis based on the above claim:

$H_0$: $p_n$ = $p_o$; the proportion of users converted from the new landing page is equal to the proportion of users converted from the old landing page.

$H_a$: $p_n$ > $p_o$; the proportion of users converted from the new landing page is greater than the proportion of users converted from the old landing page.

Assumptions¶

  • Binomial Distribution (converted: yes or no)
  • Based on the above alternative hypothesis this will be a one-sided test (alternative=larger)
  • Comparing two sample proportions

Determine if the number of samples are sufficient enough:¶

  • Verify that both np and n(1-p) are >= 10:

  • Verify number of sample proportions for new landing page:
    • $n_n$ $=$ 50
    • $p_n$ $=$ $\frac{33}{50}$
  • $n_n$ $\times$ $p_n$ = (50 * $\frac{33}{50}$) $=$ 33 $>=$ 10
  • $n_n$(1-$p_n$) = 50 (1-$\frac{33}{50}$) = 50 ($\frac{50}{50}$ - $\frac{33}{50}$) = 50 * ($\frac{17}{50}$) = 17 >= 10

  • Verify number of sample proportions for old landing page:
    • $n_o$ $=$ 50
    • $p_o$ $=$ $\frac{21}{50}$
  • $n_o$ $\times$ $p_o$ = (50 * $\frac{21}{50}$) $=$ 21 $>=$ 10
  • $n_o$(1-$p_o$) = 50 (1-$\frac{21}{50}$) = 50 ($\frac{50}{50}$ - $\frac{21}{50}$) = 50 * ($\frac{29}{50}$) = 29 >= 10
  • Yes, the number of samples are sufficient enough.

Step 2: Select Appropriate test¶

  • Based on the above assumptions, utilize the 2-sample proportion z-test

Step 3: Decide the significance level¶

Significance level ($\alpha$) = .05%

Step 4: Collect and prepare data¶

In [89]:
# Get the number of people who converted from the new landing page
new_converted = df[df['landing_page']=='new']['converted'].value_counts()['yes']

# Get the number of people who converted from the old landing page
old_converted = df[df['landing_page']=='old']['converted'].value_counts()['yes']

# Create the list of conversion values from the new and old landing pages
conversions = [new_converted, old_converted]

# Get the total number of new observations
new_obs = df[df['landing_page']=='new']['landing_page'].value_counts()['new']

# Get the total number of old observations
old_obs = df[df['landing_page']=='old']['landing_page'].value_counts()['old']

nobs = [new_obs, old_obs]

print(f"Number of new landing page conversions: {new_converted}")
print(f"Number of old landing page conversions: {old_converted}")
print(f"Number of new landing page observations: {new_obs}")
print(f"Number of old landing page observations: {old_obs}")
Number of new landing page conversions: 33
Number of old landing page conversions: 21
Number of new landing page observations: 50
Number of old landing page observations: 50

Step 5: Calculate the p-value¶

In [90]:
#import the required functions
from statsmodels.stats.proportion import proportions_ztest

test_stat, p_value = proportions_ztest(conversions,nobs, alternative='larger')
print(f"p_value is {round(p_value,8)}")
p_value is 0.00802631

Step 6: Compare the p-value with $\alpha$¶

In [91]:
if (p_value < .05):
    print(f"The p_value of {round(p_value,8)} is less than .05. Therefore, reject the null hypothesis.")
else:
    print(f"The p_value of {round(p_value,8)} is greater than .05. Therefore, there is not enough statistical evidence to reject the null hypothesis.")
The p_value of 0.00802631 is less than .05. Therefore, reject the null hypothesis.

Step 7: Draw inference¶

  • Since the null hypothesis was rejected, there is enough statistical evidence to state that the proportion of new landing page conversions is greater than the proportion of old landing page conversions.

3. Does the converted status depend on the preferred language?¶

Perform Visual Analysis¶

In [92]:
# Create a crosstab table for landing page vs converted
df_crosstab = pd.crosstab(df['language_preferred'],df['converted'])
df_crosstab
Out[92]:
converted no yes
language_preferred
English 11 21
French 19 15
Spanish 16 18
In [93]:
# create a stacked bar plot to compare the distributions of both the categorical features
df_crosstab.plot(kind='bar',stacked =True)
plt.legend()
plt.show()

Observations:¶

  • Based on the discrete statistics and visualizations, English users appear to convert at a higher rate than other languages.
  • Use inferential statistics to confirm the above claim.

Step 1: Define the null and alternate hypotheses¶

Let's frame the null and alternative hypothesis based on the above claim:

$H_0$: Conversions and preferred language are independent

$H_a$: Conversions and preferred language are not independent

Assumptions:¶

  • Categorical data (converted and preferred language)
  • Random sampling from the population
  • The expected value of the number of observations in each level of the variable is at least 5

Step 2: Select Appropriate test¶

Based on the above assumptions, select the Chi-Square Test of Independence

Step 3: Decide the significance level¶

Significance level ($\alpha$) = .05%

Step 4: Collect and prepare data¶

  • Use the crosstab variable (df_crosstab) that was already prepared above.

Step 5: Calculate the p-value¶

In [99]:
#import the required functions
from scipy.stats import chi2_contingency

chi, p_value, dof, expected = chi2_contingency(df_crosstab)
print (f"The p_value is {round(p_value,8)}.")
The p_value is 0.21298887.

Step 6: Compare the p-value with $\alpha$¶

In [100]:
if (p_value < .05):
    print(f"The p_value of {round(p_value,8)} is less than .05. Therefore, reject the null hypothesis.")
else:
    print(f"The p_value of {round(p_value,8)} is greater than .05. Therefore, there is not enough statistical evidence to reject the null hypothesis.")
The p_value of 0.21298887 is greater than .05. Therefore, there is not enough statistical evidence to reject the null hypothesis.

Step 7: Draw inference¶

  • Since the null hypothesis was not rejected, remain the status quo. Therefore, the conversion rates and preferred language are independent.

4. Is the time spent on the new page the same for the different language users?¶

Perform Visual Analysis¶

In [102]:
#create a new datafram with just the new landing_page data; this will include data from all preferred languages
df_new = df[df['landing_page']=='new']
df_new.head()
Out[102]:
user_id group landing_page time_spent_on_the_page converted language_preferred
1 546468 treatment new 7.13 yes English
2 546462 treatment new 4.40 no Spanish
4 546459 treatment new 4.75 yes Spanish
6 546448 treatment new 5.25 yes French
8 546461 treatment new 10.71 yes French
In [103]:
# visualizing the relationship between preferred languages (from the new landing page) vs time_spent_on_the_page
multi_boxplot(in_data=df_new,x_feature="language_preferred", y_feature="time_spent_on_the_page")     
In [98]:
# Calculate the average time spent on the new landing page for each preferred language
mu = df_new.groupby(['language_preferred'])['time_spent_on_the_page'].mean()
mu
Out[98]:
language_preferred
English   6.66
French    6.20
Spanish   5.84
Name: time_spent_on_the_page, dtype: float64

Observations:¶

  • Based on the above discrete statistics, it does appear that time spent on the new landing page is different for at least one language (Spanish mean is lower than English and French).
  • Use inferential statistics to determine if there is significance statistics to support this assertion.

Step 1: Define the null and alternate hypotheses¶

Let $\mu_s$, $\mu_e$, $\mu_f$ be the average time spent on the new page for each of the preferred languages (Spanish, English, and French).

$H_0$: $\mu_s$ $=$ $\mu_e$ $=$ $\mu_f$

$H_a$: At least one of the means is not the same.

Assumptions - Part 1 of 2¶

  • Need to determine if the time spent on the page is a normal distribution.
  • To verify this claim, the shapiro function will be used.

$H_0$: The time spent on the page is a normal distribution

$H_a$: The time spent on the page is not a normal distribution

In [104]:
#import the required functions
from scipy.stats import shapiro

# Run Shapiro to test the validity of a normal distribution

statistic_val, p_value = shapiro(df_new['time_spent_on_the_page'])
print (f"The p_value is {round(p_value,8)}.")
The p_value is 0.80400163.
In [44]:
if (p_value < .05):
    print(f"The p_value of {round(p_value,8)} is less than .05. Therefore, reject the null hypothesis.")
else:
    print(f"The p_value of {round(p_value,8)} is greater than .05. Therefore, do not reject the null hypothesis.")
The p_value of 0.80400163 is greater than .05. Therefore, do not reject the null hypothesis.

Insight¶

  • Since the null hypothesis was not rejected, remain the status quo with the null hypothesis. Therefore, the time spent on the page is a normal distribution.

Need to determine if the group populations have a common variance.

  • To verify this claim, the levene function will be used.

$H_O$: All the population variances are the same.

$H_a$: At least one variance is different from the rest.

In [105]:
# Run the Levene to test the validity of having the common variance
#import the required functions
from scipy.stats import levene

# For each language user, get the time spent on each page
df_new_spanish = df_new[df_new['language_preferred']=='Spanish']['time_spent_on_the_page']
df_new_english = df_new[df_new['language_preferred']=='English']['time_spent_on_the_page']
df_new_french = df_new[df_new['language_preferred']=='French']['time_spent_on_the_page']

# calculate the statistic and p value
statistic_val, p_value = levene (df_new_spanish, df_new_english, df_new_french)
print (f"The p_value is {round (p_value,8)}.")
The p_value is 0.46711358.
In [106]:
if (p_value < .05):
    print(f"The p_value of {round(p_value,8)} is less than .05. Therefore, reject the null hypothesis.")
else:
    print(f"The p_value of {round(p_value,8)} is greater than .05. Therefore, do not reject the null hypothesis.")
The p_value of 0.46711358 is greater than .05. Therefore, do not reject the null hypothesis.

Insight¶

  • Since the null hypothesis was not rejected, remain the status quo with the null hypothesis. Therefore, the population variances are the same.

Assumptions - Part 2 of 2¶

  • Samples are independent simple random samples
  • The group populations are normally distributed
  • The group populations have a common variance

Step 2: Select Appropriate test¶

Based on the above assumptions, utilize a One-Way ANOVA Test

Step 3: Decide the significance level¶

Significance level ($\alpha$) = .05%

Step 4: Collect and prepare data¶

  • Utilize df_new_spanish, df_new_english, and df_new_french variables created for the levene test.

Step 5: Calculate the p-value¶

In [48]:
from scipy.stats import f_oneway

test_stat, p_value = f_oneway(df_new_spanish, df_new_english, df_new_french)

print (f"The p_value is {round(p_value,8)}.")
The p_value is 0.43204139.

Step 6: Compare the p-value with $\alpha$¶

In [49]:
if (p_value < .05):
    print(f"The p_value of {round(p_value,6)} is less than .05. Therefore, reject the null hypothesis.")
else:
    print(f"The p_value of {round(p_value,6)} is greater than .05. Therefore, do not reject the null hypothesis.")
The p_value of 0.432041 is greater than .05. Therefore, do not reject the null hypothesis.

Step 7: Draw inference¶

Since the null hypothesis was not rejected, keep with the status quo that average time spent on the landing page was the same for all preferred languages (Spanish, English, and French).

Conclusion and Business Recommendations¶

Initial questions and insights are as follows¶

  • Question 1: Do the users spend more time on the new landing page than on the existing landing page?

    There is sufficient statistical evidence that users on average are spending more time on the new landing page than on the old landing page.

  • Question 2: Is the conversion rate (the proportion of users who visit the landing page and get converted) for the new page greater than the conversion rate for the old page?

    There is enough statistical evidence to state that the proportion of new landing page conversions is greater than the proportion of old landing page conversions.

  • Question 3: Does the converted status depend on the preferred language?

    The conversion rates and preferred language are independent and therefore the converted status does not depend on the preferred language.

  • Question 4: Is the time spent on the new page the same for the different language users?

    The average time spent on the landing page was the same for all preferred languages (Spanish, English, and French).

Conclusion¶

  • Based on statistical evidence, the new landing page is more effective in converting users to paid E-news subscriptions.
  • Based on statistical evidence, conversion status is independent of language preferences.

Recommendations¶

  • Immediate: Roll out the new landing page for all users as there is statistical evidence confirming the new landing page converts more users.
  • Localized Content: Although the conversion status was independent of language preferences, recommend investing in tailoring content or messaging to specific language preferences.
  • A/B Testing Iterations: Continue using A/B Testing on other web content refinements and ideas to continue to roll out effective changes.
  • Segmentation Analysis: Invest in acquiring additional user demographic information (gender, age, location, etc.) to tailor content pushes to specific customer segments.
  • User Retention: Invest in user retention data and analytics to understand and increase retention rates. It's more cost effective to retain a subscriber than to convert one.