Marks: 60
The advent of e-news, or electronic news, portals has offered us a great opportunity to quickly get updates on the day-to-day events occurring globally. The information on these portals is retrieved electronically from online databases, processed using a variety of software, and then transmitted to the users. There are multiple advantages of transmitting new electronically, like faster access to the content and the ability to utilize different technologies such as audio, graphics, video, and other interactive elements that are either not being used or aren’t common yet in traditional newspapers.
E-news Express, an online news portal, aims to expand its business by acquiring new subscribers. With every visitor to the website taking certain actions based on their interest, the company plans to analyze these actions to understand user interests and determine how to drive better engagement. The executives at E-news Express are of the opinion that there has been a decline in new monthly subscribers compared to the past year because the current webpage is not designed well enough in terms of the outline & recommended content to keep customers engaged long enough to make a decision to subscribe.
[Companies often analyze user responses to two variants of a product to decide which of the two variants is more effective. This experimental technique, known as A/B testing, is used to determine whether a new feature attracts users based on a chosen metric.]
The design team of the company has researched and created a new landing page that has a new outline & more relevant content shown compared to the old page. In order to test the effectiveness of the new landing page in gathering new subscribers, the Data Science team conducted an experiment by randomly selecting 100 users and dividing them equally into two groups. The existing landing page was served to the first group (control group) and the new landing page to the second group (treatment group). Data regarding the interaction of users in both groups with the two versions of the landing page was collected. Being a data scientist in E-news Express, you have been asked to explore the data and perform a statistical analysis (at a significance level of 5%) to determine the effectiveness of the new landing page in gathering new subscribers for the news portal by answering the following questions:
The data contains information regarding the interaction of users in both groups with the two versions of the landing page.
E-News Express (electronic news portal) executives have concern over the decline of monthly subscribers to their portal. It is their conjecture that the decline of subscribers is due to the current landing page not sufficiently captivating users to convert to a paid subscription. To remedy this issue, the Design Team has developed a new landing page with an updated outline and content. The objective is to evaluate the effectiveness of the new landing page, compared to the old landing page, in acquiring new subscribers.
The Data Science Team conducted an A/B test by randomly assigning 100 users evenly into two groups: a control group that was presented with the old landing page and a treatment group presented with the new landing page. Data was then collected on their user interactions along with their preferred language.
The task is to perform statistical analysis to determine the new landing page's effectiveness in attracting new subscribers. The following are key questions to address:
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import warnings
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Library to help with statistical analysis
import scipy.stats as stats
# removing the limit for the number of displayed columns
pd.set_option("display.max_columns", None) # To set column limits replace None with a number
# setting the limit for the number of displayed rows
pd.set_option("display.max_rows", 25) # To set row limits replace None with a number
# setting the precision of floating numbers to 2 decimal points
pd.set_option("display.float_format", lambda x: "%.2f" % x)
# Purpose: Calculate various discrete statistical values for a specific column in a DataFrame
#
# Prerequisites:
# Requires the developer to only send data that discrete statistics can safely be calculated for.
# This function would require more extensive data validation checks and more robust exception handling.
#
# Inputs
# data : DataFrame object containing rows and columns of data
# feature: str representing the column name to run statistics on
#
def calculate_statistics (data, feature):
# Only calculate and print statistics if the feature is a single column string name and data is a DataFrame
if isinstance(data,pd.DataFrame) and type(feature) == str:
# For future, would like to use Describe to pull data types for each column
# Then only perform the calculations and prints if of type Int64 or Float64
# Calculate and print various discrete statistical values
print(f"Discrete Statistics for {feature}\n")
print(f"Mean : {data[feature].mean():.6f}")
print(f"Mode : {data[feature].mode()[0]}")
print(f"Median : {data[feature].median()}")
print(f"Min : {data[feature].min()}")
print(f"Max : {data[feature].max()}")
print(f"Standard Deviation: {data[feature].std():.6f}")
print(f"Percentiles : \n{data[feature].quantile([.25,.50,.75])}")
# Purpose: Create histogram plot to visualize the distribution of continuous numerical data
# by dividing the data into bins and displaying the frequency of observations within each bin.
# Histogram is useful for understanding the underlining distribution, shape, central tendency,
# and spread of the data.
#
# Inputs:
#
# input_data: DataFrame object containing rows and columns of data
# feature: str representing the column name to plot a histogram for
# in_kde: boolean value; True: plot the kde density line; False: do not plot the kde line
#
def histogram(input_data, feature, in_kde=True):
# Only proceed if the feature is a single column string name and data is a DataFrame
if isinstance(input_data,pd.DataFrame) and type(feature) == str:
#stores the x axis left and right buffer size
buffer=5
#set the x limits based on the minimum and maximum values for x-axis feature
xmin_value = int(input_data[feature].min())
xmax_value = int(input_data[feature].max())
plt.xlim(xmin_value-buffer,xmax_value+buffer)
#plot the histogram using the buffersize
sns.histplot(data=input_data, x=feature, kde=in_kde);
#set the title, x and y labels
plt.title('Histogram of '+ feature)
plt.xlabel(feature)
plt.ylabel('Frequency')
plt.show()
# Purpose: Create a Boxplot, for a particular column/feature, to visually summarize the distribution
# of a continuous numerical variable and to identify potential outliers within the data.
#
# Inputs:
#
# input_data: DataFrame object containing rows and columns of data
# feature: str representing the column name to plot a count plot graph for
# in_vert: False (display horizontal); True (display vertically)
# in_showfliers: True (show the outliers); False (do not show outliers)
# in_showlabels: True (show boxplot labels for Max,Q3,Median,Q1, and Min); False (do not show labels)
#
def boxplot(input_data,feature,in_vert=False,in_showfliers=False, in_showlabels=True):
#label translater key
label_translator = {'caps 0': 'Min', 'caps 1': 'Max', 'whiskers 0': 'Q1', 'whiskers 1': 'Q3','medians 0':'Median', 'boxes 0':'Box'}
# Only proceed if data is a DataFrame
if isinstance(input_data,pd.DataFrame):
ax = plt.boxplot(df[feature], vert=in_vert,showfliers=in_showfliers)
plt.title('Boxplot of '+ feature)
plt.xlabel(feature)
# Revisit this later to figure out a more robust way to get the min x value for labels.
# Right now, the median has the smallest x value when in vert mode
min_x = ax['medians'][0].get_xydata()[0][0] - .05;
#show labels for the main boxplot lines (e.g, Max Line, Q1, Q3, Min Line)
for i in ax.keys():
for index,line in enumerate(ax[i]):
# for some reason boxes is redundant for Q1. Filter this value out. We're also not interested in
# labels for outliers
if (i != 'boxes' and i != 'fliers'):
line_x,line_y=line.get_xydata()[0]
#if needed and in vertical mode, show the key boxplot labels
#revisit later to get labels working in horizontal mode
if (in_showlabels == True and in_vert == True):
label_key = f"{i} {index}"
label = f"{label_translator[label_key]}:{line_y:.2f}"
plt.text(min_x,line_y,label,ha='right',va='center',color='blue',fontsize=9)
plt.show()
# Purpose: Create a countplot; used to visualize counts for categorical data
#
# Inputs:
#
# input_data: DataFrame object containing rows and columns of data
# feature: str representing the column name to plot a count plot graph for (category column)
# show_perc: value from [0,1] indicating the top % values to show in the countplot
# label_count: True (show count labels); False (show percentage labels)
#
def countplot (input_data, feature, show_perc=1.0, label_count=True):
if isinstance(input_data,pd.DataFrame) and type(feature) == str:
#Set the figure size. However, revist this later to see if there is a robust way to increase the
#figure size based on the number of x-axis labels.
plt.figure(figsize=(10, 6))
#Perform a total counts, which sorts the list in descending order. Then grab the list of columns
order_cols = input_data[feature].value_counts().index
#Use the percentage value to determine how many of the top columns to show
num_to_show = int(len(order_cols)*show_perc)
#Grab the top columns to show in the count plot
cols_to_show = input_data[feature].value_counts().nlargest(num_to_show).index
#plot the top columns
cp=sns.countplot(data=input_data,x=feature,order=cols_to_show)
#rotate x labels for better readability
plt.xticks(rotation=90)
#set the title
plt_title = f"Countplot for the top {show_perc*100}% {feature} values"
plt.title(plt_title)
# Apply some simple label formatting; remove the underscores and replace with a blank space
cp.set_xlabel(feature.replace('_', ' ').title(), fontsize=15)
#show values for each bar/patch. The labels will either be numerical values or percentages.
for p in cp.patches:
total_values = len(input_data[feature])
#show count labels
if label_count == True:
label = p.get_height()
else:
# show percentage label
label = "{:.1f}%".format(100 * p.get_height() / total_values)
cp.annotate(
label,
(p.get_x()+p.get_width()/2.,p.get_height()),
ha='center',
va='center',
xytext=(0,5), # set the label offset above the bar
textcoords='offset points'
)
plt.show()
# Purpose: Create Boxplot for multiple variables (x being a categorical value)
#
# Inputs:
#
# in_data: DataFrame object containing rows and columns of data
# x_feature: str representing the column name for the x-axis (categorical data)
# y_feature: str representing the column name for the y-axis
#
def multi_boxplot (in_data, x_feature, y_feature):
# Only proceed if the features is a single column string name and data is a DataFrame
if isinstance(in_data,pd.DataFrame) and type(x_feature) == str and type(y_feature) == str:
# visualizing the relationship between two featgures
plt.figure(figsize=(12, 5))
sns.boxplot(data=df, x=x_feature, y=y_feature, showmeans=True)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xticks(rotation='vertical')
plt.xlabel(x_feature, fontsize=15)
plt.ylabel(y_feature, fontsize=15);
plt.show()
# Purpose: Create a heatmap which supports multivariate analysis of 2+ numerical features. Heatmaps are
# useful for identifying correlations between 2 or more variables.
#
# Inputs:
#
# input_data: DataFrame object containing rows and columns of data
#
def heatmap (input_data):
if isinstance(input_data,pd.DataFrame): #and type(feature) == str:
plt.figure(figsize=(10,8))
# Select only the numerical columns
numeric_columns = input_data.select_dtypes(include=np.number)
sns.heatmap(numeric_columns.corr(),annot=True,cmap='Spectral',vmin=-1,vmax=1)
plt.show()
# Read the ABTest data set into a panda dataframe object
df = pd.read_csv("./abtest.csv")
# Verify the data file was read correctly by displaying the first five rows.
df.head()
user_id | group | landing_page | time_spent_on_the_page | converted | language_preferred | |
---|---|---|---|---|---|---|
0 | 546592 | control | old | 3.48 | no | Spanish |
1 | 546468 | treatment | new | 7.13 | yes | English |
2 | 546462 | treatment | new | 4.40 | no | Spanish |
3 | 546567 | control | old | 3.02 | no | French |
4 | 546459 | treatment | new | 4.75 | yes | Spanish |
# Verify the entire data file was read correctly by displaying the last five rows.
df.tail()
user_id | group | landing_page | time_spent_on_the_page | converted | language_preferred | |
---|---|---|---|---|---|---|
95 | 546446 | treatment | new | 5.15 | no | Spanish |
96 | 546544 | control | old | 6.52 | yes | English |
97 | 546472 | treatment | new | 7.07 | yes | Spanish |
98 | 546481 | treatment | new | 6.20 | yes | Spanish |
99 | 546483 | treatment | new | 5.86 | yes | English |
# Print out the number of rows and columns in the data file.
df.shape
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")
There are 100 rows and 6 columns.
# Print out basic information on the data file.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 100 entries, 0 to 99 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 100 non-null int64 1 group 100 non-null object 2 landing_page 100 non-null object 3 time_spent_on_the_page 100 non-null float64 4 converted 100 non-null object 5 language_preferred 100 non-null object dtypes: float64(1), int64(1), object(4) memory usage: 4.8+ KB
# Print out some basic discrete statistics on the ABTest data
df.describe(include='all').T
count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|
user_id | 100.00 | NaN | NaN | NaN | 546517.00 | 52.30 | 546443.00 | 546467.75 | 546492.50 | 546567.25 | 546592.00 |
group | 100 | 2 | control | 50 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
landing_page | 100 | 2 | old | 50 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
time_spent_on_the_page | 100.00 | NaN | NaN | NaN | 5.38 | 2.38 | 0.19 | 3.88 | 5.42 | 7.02 | 10.71 |
converted | 100 | 2 | yes | 54 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
language_preferred | 100 | 3 | Spanish | 34 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
# Check for missing values.
df.isnull().sum()
user_id 0 group 0 landing_page 0 time_spent_on_the_page 0 converted 0 language_preferred 0 dtype: int64
# Check for duplicate values
df.nunique()
user_id 100 group 2 landing_page 2 time_spent_on_the_page 94 converted 2 language_preferred 3 dtype: int64
#The control group should always be given the old landing page. Ensure the data reflects this given.
df_control = df[(df['group'] == 'control') & (df['landing_page'] != 'old')]
print(f"There are {len(df_control)} control and old value data inconsistencies.")
#The treatment group should always be given the new landing page. Ensure the data reflects this given.
df_treatment = df[(df['group'] == 'treatment') & (df['landing_page'] != 'new')]
print(f"There are {len(df_treatment)} treatment and new value inconsistencies.")
There are 0 control and old value data inconsistencies. There are 0 treatment and new value inconsistencies.
# Group field univariate analysis
#Create histogram plot for the top groups
countplot(df,'group',show_perc=1.0,label_count=True)
#Print the total number of unique restaurants
print(f"Number of groups: {df['group'].nunique()}")
Number of groups: 2
# Landing_page field univariate analysis
#Create histogram plot for the top groups
countplot(df,'landing_page',show_perc=1.0,label_count=True)
#Print the total number of unique restaurants
print(f"Number of landing pages: {df['landing_page'].nunique()}")
Number of landing pages: 2
# time_spent_on_the_page field univariate analysis
selected_column = 'time_spent_on_the_page'
#calculate univariate statistics
calculate_statistics(df,selected_column)
#show histogram for cost_of_the_order
histogram(df,selected_column)
#show boxplot for cost_of_the_order
boxplot(df,selected_column,in_vert=True, in_showfliers=False,in_showlabels=True)
Discrete Statistics for time_spent_on_the_page Mean : 5.377800 Mode : 0.4 Median : 5.415 Min : 0.19 Max : 10.71 Standard Deviation: 2.378166 Percentiles : 0.25 3.88 0.50 5.42 0.75 7.02 Name: time_spent_on_the_page, dtype: float64
# Converted field univariate analysis
#Create histogram plot for the top groups
countplot(df,'converted',show_perc=1.0,label_count=True)
#Print the total number of unique restaurants
print(f"Number of converted values: {df['converted'].nunique()}")
Number of converted values: 2
# language_preferred field univariate analysis
#Create histogram plot for the top groups
countplot(df,'language_preferred',show_perc=1.0,label_count=True)
#Print the total number of unique restaurants
print(f"Number of language_preferred values: {df['language_preferred'].nunique()}")
Number of language_preferred values: 3
# visualizing the relationship between landing_page and time_spent_on_the_page
multi_boxplot(in_data=df,x_feature="landing_page", y_feature="time_spent_on_the_page")
# visualizing the relationship between language_preferred and time_spent_on_the_page
multi_boxplot(in_data=df,x_feature="language_preferred", y_feature="time_spent_on_the_page")
# visualizing the relationship between converted and time_spent_on_the_page
multi_boxplot(in_data=df,x_feature="converted", y_feature="time_spent_on_the_page")
# Plot the number of converted values (yes or no) for each landing page (old or new)
sns.countplot(data=df, x='landing_page', hue='converted');
# visualizing the relationship between the old and new landing page and the total time spent on the page.
multi_boxplot(in_data=df,x_feature="landing_page", y_feature="time_spent_on_the_page")
Let's frame the null and alternative hypothesis based on the above claim:
$H_0$: $\mu_n$ $=$ $\mu_o$; on average users spend the same amount of time on the new landing page as the old landing page.
$H_a$: $\mu_n$ $\gt$ $\mu_o$; on average users spend more time on the new landing page than on the old landing page.
Based on above information, utilize the 2-sample indicator t-test
Significance level ($\alpha$) = .05%
# Collect the time spent values for the new landing page
time_spent_new_landing = df[df['landing_page']=='new']['time_spent_on_the_page']
# Collect the time spent values for the old landing page
time_spent_old_landing= df[df['landing_page']=='old']['time_spent_on_the_page']
# Calculate the standing deviations to determine how to set the equal_var parameter below.
print (f"The standard deviation for time spent on the new landing page is {round(time_spent_new_landing.std(),4)}")
print (f"The standard deviation for time spent on the old landing page is {round(time_spent_old_landing.std(),4)}")
The standard deviation for time spent on the new landing page is 1.817 The standard deviation for time spent on the old landing page is 2.582
#import the required t-test function
from scipy.stats import ttest_ind
# find the p-value
test_stat, p_value = ttest_ind(time_spent_new_landing, time_spent_old_landing, equal_var = False, alternative = 'greater')
print(f"The p-value is {round(p_value,8)}")
The p-value is 0.00013924
if (p_value < 0.05):
print(f'As the p-value {round(p_value,8)} is less than the level of significance; we reject the null hypothesis.')
else:
print(f'As the p-value {round(p_value,8)} is greater than the level of significance; we fail to reject the null hypothesis.')
As the p-value 0.00013924 is less than the level of significance; we reject the null hypothesis.
Since, the null hypothesis is rejected (i.e. accept the alternative hypothesis) there is sufficient statistical evidence that users on average are spending more time on the new landing page than on the old landing page.
A similar approach can be followed to answer the other questions.
# Create a crosstab table for landing page vs converted
df_crosstab = pd.crosstab(df['landing_page'],df['converted']) # ,normalize='index'
df_crosstab
converted | no | yes |
---|---|---|
landing_page | ||
new | 17 | 33 |
old | 29 | 21 |
# create a stacked bar plot to compare the distributions of both the categorical features
df_crosstab.plot(kind='bar',stacked =True)
plt.legend()
plt.show()
Let's frame the null and alternative hypothesis based on the above claim:
$H_0$: $p_n$ = $p_o$; the proportion of users converted from the new landing page is equal to the proportion of users converted from the old landing page.
$H_a$: $p_n$ > $p_o$; the proportion of users converted from the new landing page is greater than the proportion of users converted from the old landing page.
Significance level ($\alpha$) = .05%
# Get the number of people who converted from the new landing page
new_converted = df[df['landing_page']=='new']['converted'].value_counts()['yes']
# Get the number of people who converted from the old landing page
old_converted = df[df['landing_page']=='old']['converted'].value_counts()['yes']
# Create the list of conversion values from the new and old landing pages
conversions = [new_converted, old_converted]
# Get the total number of new observations
new_obs = df[df['landing_page']=='new']['landing_page'].value_counts()['new']
# Get the total number of old observations
old_obs = df[df['landing_page']=='old']['landing_page'].value_counts()['old']
nobs = [new_obs, old_obs]
print(f"Number of new landing page conversions: {new_converted}")
print(f"Number of old landing page conversions: {old_converted}")
print(f"Number of new landing page observations: {new_obs}")
print(f"Number of old landing page observations: {old_obs}")
Number of new landing page conversions: 33 Number of old landing page conversions: 21 Number of new landing page observations: 50 Number of old landing page observations: 50
#import the required functions
from statsmodels.stats.proportion import proportions_ztest
test_stat, p_value = proportions_ztest(conversions,nobs, alternative='larger')
print(f"p_value is {round(p_value,8)}")
p_value is 0.00802631
if (p_value < .05):
print(f"The p_value of {round(p_value,8)} is less than .05. Therefore, reject the null hypothesis.")
else:
print(f"The p_value of {round(p_value,8)} is greater than .05. Therefore, there is not enough statistical evidence to reject the null hypothesis.")
The p_value of 0.00802631 is less than .05. Therefore, reject the null hypothesis.
# Create a crosstab table for landing page vs converted
df_crosstab = pd.crosstab(df['language_preferred'],df['converted'])
df_crosstab
converted | no | yes |
---|---|---|
language_preferred | ||
English | 11 | 21 |
French | 19 | 15 |
Spanish | 16 | 18 |
# create a stacked bar plot to compare the distributions of both the categorical features
df_crosstab.plot(kind='bar',stacked =True)
plt.legend()
plt.show()
Let's frame the null and alternative hypothesis based on the above claim:
$H_0$: Conversions and preferred language are independent
$H_a$: Conversions and preferred language are not independent
Based on the above assumptions, select the Chi-Square Test of Independence
Significance level ($\alpha$) = .05%
#import the required functions
from scipy.stats import chi2_contingency
chi, p_value, dof, expected = chi2_contingency(df_crosstab)
print (f"The p_value is {round(p_value,8)}.")
The p_value is 0.21298887.
if (p_value < .05):
print(f"The p_value of {round(p_value,8)} is less than .05. Therefore, reject the null hypothesis.")
else:
print(f"The p_value of {round(p_value,8)} is greater than .05. Therefore, there is not enough statistical evidence to reject the null hypothesis.")
The p_value of 0.21298887 is greater than .05. Therefore, there is not enough statistical evidence to reject the null hypothesis.
#create a new datafram with just the new landing_page data; this will include data from all preferred languages
df_new = df[df['landing_page']=='new']
df_new.head()
user_id | group | landing_page | time_spent_on_the_page | converted | language_preferred | |
---|---|---|---|---|---|---|
1 | 546468 | treatment | new | 7.13 | yes | English |
2 | 546462 | treatment | new | 4.40 | no | Spanish |
4 | 546459 | treatment | new | 4.75 | yes | Spanish |
6 | 546448 | treatment | new | 5.25 | yes | French |
8 | 546461 | treatment | new | 10.71 | yes | French |
# visualizing the relationship between preferred languages (from the new landing page) vs time_spent_on_the_page
multi_boxplot(in_data=df_new,x_feature="language_preferred", y_feature="time_spent_on_the_page")
# Calculate the average time spent on the new landing page for each preferred language
mu = df_new.groupby(['language_preferred'])['time_spent_on_the_page'].mean()
mu
language_preferred English 6.66 French 6.20 Spanish 5.84 Name: time_spent_on_the_page, dtype: float64
Let $\mu_s$, $\mu_e$, $\mu_f$ be the average time spent on the new page for each of the preferred languages (Spanish, English, and French).
$H_0$: $\mu_s$ $=$ $\mu_e$ $=$ $\mu_f$
$H_a$: At least one of the means is not the same.
$H_0$: The time spent on the page is a normal distribution
$H_a$: The time spent on the page is not a normal distribution
#import the required functions
from scipy.stats import shapiro
# Run Shapiro to test the validity of a normal distribution
statistic_val, p_value = shapiro(df_new['time_spent_on_the_page'])
print (f"The p_value is {round(p_value,8)}.")
The p_value is 0.80400163.
if (p_value < .05):
print(f"The p_value of {round(p_value,8)} is less than .05. Therefore, reject the null hypothesis.")
else:
print(f"The p_value of {round(p_value,8)} is greater than .05. Therefore, do not reject the null hypothesis.")
The p_value of 0.80400163 is greater than .05. Therefore, do not reject the null hypothesis.
Need to determine if the group populations have a common variance.
$H_O$: All the population variances are the same.
$H_a$: At least one variance is different from the rest.
# Run the Levene to test the validity of having the common variance
#import the required functions
from scipy.stats import levene
# For each language user, get the time spent on each page
df_new_spanish = df_new[df_new['language_preferred']=='Spanish']['time_spent_on_the_page']
df_new_english = df_new[df_new['language_preferred']=='English']['time_spent_on_the_page']
df_new_french = df_new[df_new['language_preferred']=='French']['time_spent_on_the_page']
# calculate the statistic and p value
statistic_val, p_value = levene (df_new_spanish, df_new_english, df_new_french)
print (f"The p_value is {round (p_value,8)}.")
The p_value is 0.46711358.
if (p_value < .05):
print(f"The p_value of {round(p_value,8)} is less than .05. Therefore, reject the null hypothesis.")
else:
print(f"The p_value of {round(p_value,8)} is greater than .05. Therefore, do not reject the null hypothesis.")
The p_value of 0.46711358 is greater than .05. Therefore, do not reject the null hypothesis.
Based on the above assumptions, utilize a One-Way ANOVA Test
Significance level ($\alpha$) = .05%
from scipy.stats import f_oneway
test_stat, p_value = f_oneway(df_new_spanish, df_new_english, df_new_french)
print (f"The p_value is {round(p_value,8)}.")
The p_value is 0.43204139.
if (p_value < .05):
print(f"The p_value of {round(p_value,6)} is less than .05. Therefore, reject the null hypothesis.")
else:
print(f"The p_value of {round(p_value,6)} is greater than .05. Therefore, do not reject the null hypothesis.")
The p_value of 0.432041 is greater than .05. Therefore, do not reject the null hypothesis.
Since the null hypothesis was not rejected, keep with the status quo that average time spent on the landing page was the same for all preferred languages (Spanish, English, and French).
Question 1: Do the users spend more time on the new landing page than on the existing landing page?
There is sufficient statistical evidence that users on average are spending more time on the new landing page than on the old landing page.
Question 2: Is the conversion rate (the proportion of users who visit the landing page and get converted) for the new page greater than the conversion rate for the old page?
There is enough statistical evidence to state that the proportion of new landing page conversions is greater than the proportion of old landing page conversions.
Question 3: Does the converted status depend on the preferred language?
The conversion rates and preferred language are independent and therefore the converted status does not depend on the preferred language.
Question 4: Is the time spent on the new page the same for the different language users?
The average time spent on the landing page was the same for all preferred languages (Spanish, English, and French).