4.3 PREDICTIVE ANALYTICS TECHNIQUE¶

4.3.1 PRESENTATION OF DATA¶

The dataset to be analysed using this technique will be imported, read into our dataframe, and displayed below.

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import Markdown, display

def printTitle(title):
    display(Markdown(("### **"+title+"**").upper()))

title = "TABLE 1 : importing our dataset, reading it into our dataframe, viewing the dataset"
printTitle(title)
data = pd.read_csv('word/insurance.csv')
data

4.3.2 ANALYSIS OF DATA ACCORDING TO RESEARCH QUESTIONS¶

Using the model for prediction, the steps to follow includes:

• Data processing
• Model evaluation
• Predicting insurance premium charge with our model

1. DATA PROCESSING¶

a. DATA CLEANING¶

We need to represent all data numerically. Some of our predictors (sex and smoker) are categorical data and since they have two categories:

sex - male and female
smoker - yes and no

we encode these categories in binary values.

For sex:

0 for male
1 for female

For smoker:

1 for yes
0 for no

Also 'region' column has 4 categories:

data['region'].value_counts()

southeast    364
northwest    325
southwest    325
northeast    324
Name: region, dtype: int64

These will be encoded accordingly:

southeast : 0
northwest : 1
southwest : 2
northeast : 3

data['sex'] = data.sex.map({'male':0,'female':1})
data['smoker'] = data.smoker.map({'yes':1,'no':0})
data['region'] = data.region.map({'southeast' : 0, 'northwest': 1, 'southwest': 2, 'northeast':3})

After converting all categorical data to numerical data we have our dataset:

title =  "table 2 : Our Dataset, After converting all categorical data to numerical data"
printTitle(title)
data

b. DATA VISUALIZATION¶

Here we are going to visualize the relationship between the dependent and the independent variable using scatterplots .

The relationship between charges and each independent variables can be seen below:

import seaborn as sns

col = list(data)
col.remove('charges')

title =  "Graph 1 : Relationship between charges and each independent variables"
printTitle(title)

for i in col:
    sns.pairplot(data, x_vars=[i], y_vars='charges', height=4, aspect=1.5)

c. QUESTIONS ABOUT THE INSURANCE DATASET¶

Is there a relationship between:
- Sex and charges?
- Age and charges?
- Smoker and charges?
- bmi and charges?
- Children and charges?
- regions and charges?

If there is, how strong is that relationship?
What is the effect of each independent variable on the dependent variable?
With this data set, can premium be estimated for a new life insurance policy holder?
How accurate is the prediction?

This questions will be answered in the model evaluation stage

2. MODEL EVALUATION¶

a. PREPARING THE DATASET FOR TRAINING THE MODEL¶

At this stage we select independent variables to used in our model by evaluating their p-values.

We select and use independent variables that have strong relationship with the dependent variable and drop the column or not include in our model the independent variables that have weak or no relationship with the dependent variable.

import statsmodels.formula.api as smf

mod_coef = smf.ols(formula='charges ~ age + sex + bmi + children + smoker + region', data=data).fit()

Hypothesis Testing And P-Values¶

print "These are our independent variables and their corresponding p-values"
mod_coef.pvalues

These are our independent variables and their corresponding p-values

Intercept    4.068838e-33
age          6.243711e-89
sex          6.952873e-01
bmi          8.292445e-31
children     5.983277e-04
smoker       0.000000e+00
region       1.051604e-01
dtype: float64

From the output above each feature have significant p-values except sex and region. Thus we reject the null hypothesis for age, bmi, children, smoker and fail to reject the null hypothesis for sex and region. This is one way to determine which columns are insignificant and should be dropped from a dataset.

'smoker' has the strongest relationship with charges, followed by age, body mass index, children. Sex and region both have very weak relationship with charges therefore will be dropped from our data set and not be used in our model.

Here's a view of our dataset after dropping the sex and region column:

data_ii = data.drop(['sex','region'], axis=1)
title = "table 3 : Dataset after dropping the insignificant independent variables (sex and region column)"
printTitle(title)
data_ii

b. ESTIMATING OR 'LEARINING' THE MODEL COEFFICIENT¶

We use the Statsmodels Library to estimate our model coefficients.

import statsmodels.formula.api as smf
print "Independent variables and their Coefficient"

mod_coef = smf.ols(formula='charges ~ age + bmi + children + smoker', data=data_ii).fit()
mod_coef.params

Independent variables and their Coefficient

Intercept   -12102.769363
age            257.849507
bmi            321.851402
children       473.502316
smoker       23811.399845
dtype: float64

The coefficient of each independent variable to be used in our model can be seen above

c. INTERPRETING THE MODEL COEFFICIENT¶

Generally a unit increase in any of: age, bmi, children, smoker, is associated with a unit increase in premium charges up to the value of the corresponding model coefficient.

More clearly: A Unit increase in body mass index of a policy holder is associated with a 321.851402 unit increase in his/her premium charge. If the model coefficient is negative, then a unit increase will cause a unit decrease in the premium charge.

d. SPLITTING THE DATASET INTO TRAINING AND TESTING SET¶

A higher percentage is used for training the model while a smaller portion of the data is used to test the model. This is implemented below:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

X = data_ii[['age','bmi','children','smoker']]
y = data_ii.charges

#split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)

e. TRAINING AND TESTING THE MODEL¶

The model is trained to predict the known outputs/premium charges using the training dataset and later tested using the test dataset. The test data is used to test the accuracy of the model.

# instantiate the model
model = LinearRegression()

#fit Model
model.fit(X_train, y_train)

# make prediction based on the test data
predictions = model.predict(X_test)

f. MODEL EVALUATION: PREDICTION ACCURACY OF TEST DATASET AND R-SQUARED¶

Prediction Accuracy Of The Test Set¶

After training the model with our training dataset, we try to predict our already know dependent variable/premium charge in our test dataset then compare this prediction (predicted test charges) against the actual test charges to evaluate the accuracy of the model.

Below is a table showing Actual Test Dataset Premium Charges against their Predicted Test Dataset Premium Charges:

from pandas import Series, DataFrame

data_iii = {'Actual Test Dataset Premium Charges':y_test, 'Predicted Test Dataset Premium Charges':predictions}
frameData_iii = pd.DataFrame(data_iii)
title =  "table 4 : Test Dataset Premium Charges: Actual Charges and their corresponding Predicted Charges"
printTitle(title)
frameData_iii

# accuracy of the prediction
model_accuracy = model.score(X_test, y_test)
model_accuracy

0.7385564961370813

The accuracy of the model is : 73.86%¶

Visualizing The Relationship Between The Actual Premium Charges In Test Data Set And Their Corresponding Predicted Value Using Our Model¶

sns.distplot(y_test - predictions, axlabel="Test vs Prediction")

title =  'Graph 2 : Relationship Between Actual Premium Charges In Test Dataset VS Their Corresponding Predicted Value (Using Our Model For Prediction)'
printTitle(title)

R-squared¶

To evaluate the overall fit of a linear model, we analyse the R-squared. The proportion of variance in the observed data that is explained by the model is the R-squared. The R-squared is between 0 and 1 and the higher the R-square the better because it means that more variance is explained by the model.

# R-square
model_rsquare = mod_coef.rsquared
model_rsquare

0.74969453034647904

The R-squared value for our model is 74.97%¶

The R-squared is most useful in comparing different models

g. SUMMARY OF THE FITTED MODEL¶

title =  "table 5 : SUMMARY OF THE FITTED MODEL"
printTitle(title)
mod_coef.summary()

3. USING OUR MODEL TO PREDICT INSURANCE PREMIUM CHARGE¶

CASE STUDY:¶

A new policy holder insured his life at Leadway Insurance Company. Among other details provided, the following were relevant to this study:

- Age : 55
- Sex : male
- BMI : 47.843
- Children : 4
- Smoker : no
- Region : Southeast

Determine/predict the premium that should be charged.

IMPLEMENTATION¶

We predict insurance premium charge for a new policy holder following the steps:

Import the Multiple Linear Regression Model from the Python’s Scikit-Learn Library
Instantiate or run the model
Fit the data into our model
Predict for a new policy holder using model coefficient

Recall that region and sex do not have strong relationship with insurance premium charge. Hence for better prediction, they will be excluded from our model.

#import the linear regression model
from sklearn.linear_model import LinearRegression

# x represent the independent variables while y, the dependent variable
X = data_ii[['age','bmi','children','smoker']]
y = data_ii.charges

#instantiate the model
my_model = LinearRegression()

# fit our dataset into our model
my_model.fit(X,y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

PREDICT PREMIUM CHARGE FOR A NEW POLICY HOLDER¶

my_model.predict([[55,47.843,4,0]])

array([ 19371.29944809])

	age	sex	bmi	children	smoker	region	charges
0	19	1	27.900	0	1	2	16884.92400
1	18	0	33.770	1	0	0	1725.55230
2	28	0	33.000	3	0	0	4449.46200
3	33	0	22.705	0	0	1	21984.47061
4	32	0	28.880	0	0	1	3866.85520
5	31	1	25.740	0	0	0	3756.62160
6	46	1	33.440	1	0	0	8240.58960
7	37	1	27.740	3	0	1	7281.50560
8	37	0	29.830	2	0	3	6406.41070
9	60	1	25.840	0	0	1	28923.13692
10	25	0	26.220	0	0	3	2721.32080
11	62	1	26.290	0	1	0	27808.72510
12	23	0	34.400	0	0	2	1826.84300
13	56	1	39.820	0	0	0	11090.71780
14	27	0	42.130	0	1	0	39611.75770
15	19	0	24.600	1	0	2	1837.23700
16	52	1	30.780	1	0	3	10797.33620
17	23	0	23.845	0	0	3	2395.17155
18	56	0	40.300	0	0	2	10602.38500
19	30	0	35.300	0	1	2	36837.46700
20	60	1	36.005	0	0	3	13228.84695
21	30	1	32.400	1	0	2	4149.73600
22	18	0	34.100	0	0	0	1137.01100
23	34	1	31.920	1	1	3	37701.87680
24	37	0	28.025	2	0	1	6203.90175
25	59	1	27.720	3	0	0	14001.13380
26	63	1	23.085	0	0	3	14451.83515
27	55	1	32.775	2	0	1	12268.63225
28	23	0	17.385	1	0	1	2775.19215
29	31	0	36.300	2	1	2	38711.00000
...	...	...	...	...	...	...	...
1308	25	1	30.200	0	1	2	33900.65300
1309	41	0	32.200	2	0	2	6875.96100
1310	42	0	26.315	1	0	1	6940.90985
1311	33	1	26.695	0	0	1	4571.41305
1312	34	0	42.900	1	0	2	4536.25900
1313	19	1	34.700	2	1	2	36397.57600
1314	30	1	23.655	3	1	1	18765.87545
1315	18	0	28.310	1	0	3	11272.33139
1316	19	1	20.600	0	0	2	1731.67700
1317	18	0	53.130	0	0	0	1163.46270
1318	35	0	39.710	4	0	3	19496.71917
1319	39	1	26.315	2	0	1	7201.70085
1320	31	0	31.065	3	0	1	5425.02335
1321	62	0	26.695	0	1	3	28101.33305
1322	62	0	38.830	0	0	0	12981.34570
1323	42	1	40.370	2	1	0	43896.37630
1324	31	0	25.935	1	0	1	4239.89265
1325	61	0	33.535	0	0	3	13143.33665
1326	42	1	32.870	0	0	3	7050.02130
1327	51	0	30.030	1	0	0	9377.90470
1328	23	1	24.225	2	0	3	22395.74424
1329	52	0	38.600	2	0	2	10325.20600
1330	57	1	25.740	2	0	0	12629.16560
1331	23	1	33.400	0	0	2	10795.93733
1332	52	1	44.700	3	0	2	11411.68500
1333	50	0	30.970	3	0	1	10600.54830
1334	18	1	31.920	0	0	3	2205.98080
1335	18	1	36.850	0	0	0	1629.83350
1336	21	1	25.800	0	0	2	2007.94500
1337	61	1	29.070	0	1	1	29141.36030

	age	bmi	children	smoker	charges
0	19	27.900	0	1	16884.92400
1	18	33.770	1	0	1725.55230
2	28	33.000	3	0	4449.46200
3	33	22.705	0	0	21984.47061
4	32	28.880	0	0	3866.85520
5	31	25.740	0	0	3756.62160
6	46	33.440	1	0	8240.58960
7	37	27.740	3	0	7281.50560
8	37	29.830	2	0	6406.41070
9	60	25.840	0	0	28923.13692
10	25	26.220	0	0	2721.32080
11	62	26.290	0	1	27808.72510
12	23	34.400	0	0	1826.84300
13	56	39.820	0	0	11090.71780
14	27	42.130	0	1	39611.75770
15	19	24.600	1	0	1837.23700
16	52	30.780	1	0	10797.33620
17	23	23.845	0	0	2395.17155
18	56	40.300	0	0	10602.38500
19	30	35.300	0	1	36837.46700
20	60	36.005	0	0	13228.84695
21	30	32.400	1	0	4149.73600
22	18	34.100	0	0	1137.01100
23	34	31.920	1	1	37701.87680
24	37	28.025	2	0	6203.90175
25	59	27.720	3	0	14001.13380
26	63	23.085	0	0	14451.83515
27	55	32.775	2	0	12268.63225
28	23	17.385	1	0	2775.19215
29	31	36.300	2	1	38711.00000
...	...	...	...	...	...
1308	25	30.200	0	1	33900.65300
1309	41	32.200	2	0	6875.96100
1310	42	26.315	1	0	6940.90985
1311	33	26.695	0	0	4571.41305
1312	34	42.900	1	0	4536.25900
1313	19	34.700	2	1	36397.57600
1314	30	23.655	3	1	18765.87545
1315	18	28.310	1	0	11272.33139
1316	19	20.600	0	0	1731.67700
1317	18	53.130	0	0	1163.46270
1318	35	39.710	4	0	19496.71917
1319	39	26.315	2	0	7201.70085
1320	31	31.065	3	0	5425.02335
1321	62	26.695	0	1	28101.33305
1322	62	38.830	0	0	12981.34570
1323	42	40.370	2	1	43896.37630
1324	31	25.935	1	0	4239.89265
1325	61	33.535	0	0	13143.33665
1326	42	32.870	0	0	7050.02130
1327	51	30.030	1	0	9377.90470
1328	23	24.225	2	0	22395.74424
1329	52	38.600	2	0	10325.20600
1330	57	25.740	2	0	12629.16560
1331	23	33.400	0	0	10795.93733
1332	52	44.700	3	0	11411.68500
1333	50	30.970	3	0	10600.54830
1334	18	31.920	0	0	2205.98080
1335	18	36.850	0	0	1629.83350
1336	21	25.800	0	0	2007.94500
1337	61	29.070	0	1	29141.36030

	Actual Test Dataset Premium Charges	Predicted Test Dataset Premium Charges
559	1646.42970	3898.198488
1087	11353.22760	12846.443859
1020	8798.59300	12853.105439
460	10381.47870	13679.300235
802	2103.08000	1031.772047
298	38746.35510	31954.393745
481	9304.70190	13448.727314
616	11658.11505	11714.565290
763	3070.80870	3236.855885
750	19539.24300	29736.544591
48	12629.89670	11585.216258
547	11538.42100	17481.377439
1143	6338.07560	9265.860910
767	7050.64200	8594.044524
194	1137.46970	3308.567492
424	8968.33000	11033.970339
3	21984.47061	3857.673204
785	6414.17800	7326.006568
443	28287.89766	15331.956885
921	13462.52000	14666.018521
315	9722.76950	12017.315984
725	40932.42950	33057.633223
88	8026.66660	8799.882235
310	8444.47400	9530.193991
471	2203.47185	2040.384142
726	6664.68595	8152.833596
60	8606.21740	9356.798260
1280	8283.68070	10975.469036
705	5375.03800	7832.406983
101	3645.08940	3868.351523
...	...	...
59	5989.52365	9402.198227
13	11090.71780	15013.605405
1329	10325.20600	14568.098817
1124	40904.19950	31302.277246
680	2585.26900	-408.343460
941	9549.56510	15747.532806
639	12949.15540	15159.998123
772	12797.20962	10835.898539
830	13393.75600	14902.968168
322	35491.64000	30230.690258
441	37079.37200	30757.883533
1294	11931.12525	11242.102964
585	4779.60230	5982.438869
256	43921.18370	36921.905144
560	9193.83850	7488.804300
1059	4462.72180	7347.250347
686	7729.64575	8238.284703
72	11741.72600	12237.682686
257	5400.98050	10316.477108
797	4719.52405	3603.701050
2	4449.46200	7019.304209
1220	4718.20355	3324.494982
1045	21880.82000	31813.363057
568	11552.90400	13267.559739
367	8017.06115	7903.237422
708	6113.23105	7082.101172
479	1824.28540	4090.671187
541	3056.38810	4043.759864
117	19107.77960	28547.640157
768	14319.03100	17109.055345

Dep. Variable:	charges	R-squared:	0.750
Model:	OLS	Adj. R-squared:	0.749
Method:	Least Squares	F-statistic:	998.1
Date:	Wed, 17 Jul 2019	Prob (F-statistic):	0.00
Time:	20:40:43	Log-Likelihood:	-13551.
No. Observations:	1338	AIC:	2.711e+04
Df Residuals:	1333	BIC:	2.714e+04
Df Model:	4
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-1.21e+04	941.984	-12.848	0.000	-1.4e+04	-1.03e+04
age	257.8495	11.896	21.675	0.000	234.512	281.187
bmi	321.8514	27.378	11.756	0.000	268.143	375.559
children	473.5023	137.792	3.436	0.001	203.190	743.814
smoker	2.381e+04	411.220	57.904	0.000	2.3e+04	2.46e+04