import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import Markdown, display
def printTitle(title):
display(Markdown(("### **"+title+"**").upper()))
title = "TABLE 1 : importing our dataset, reading it into our dataframe, viewing the dataset"
printTitle(title)
data = pd.read_csv('word/insurance.csv')
data
Using the model for prediction, the steps to follow includes:
• Data processing
• Model evaluation
• Predicting insurance premium charge with our model
We need to represent all data numerically. Some of our predictors (sex and smoker) are categorical data and since they have two categories:
we encode these categories in binary values.
For sex:
For smoker:
Also 'region' column has 4 categories:
data['region'].value_counts()
These will be encoded accordingly:
data['sex'] = data.sex.map({'male':0,'female':1})
data['smoker'] = data.smoker.map({'yes':1,'no':0})
data['region'] = data.region.map({'southeast' : 0, 'northwest': 1, 'southwest': 2, 'northeast':3})
After converting all categorical data to numerical data we have our dataset:
title = "table 2 : Our Dataset, After converting all categorical data to numerical data"
printTitle(title)
data
Here we are going to visualize the relationship between the dependent and the independent variable using scatterplots .
The relationship between charges and each independent variables can be seen below:
import seaborn as sns
col = list(data)
col.remove('charges')
title = "Graph 1 : Relationship between charges and each independent variables"
printTitle(title)
for i in col:
sns.pairplot(data, x_vars=[i], y_vars='charges', height=4, aspect=1.5)
This questions will be answered in the model evaluation stage
At this stage we select independent variables to used in our model by evaluating their p-values.
We select and use independent variables that have strong relationship with the dependent variable and drop the column or not include in our model the independent variables that have weak or no relationship with the dependent variable.
import statsmodels.formula.api as smf
mod_coef = smf.ols(formula='charges ~ age + sex + bmi + children + smoker + region', data=data).fit()
print "These are our independent variables and their corresponding p-values"
mod_coef.pvalues
From the output above each feature have significant p-values except sex and region. Thus we reject the null hypothesis for age, bmi, children, smoker and fail to reject the null hypothesis for sex and region. This is one way to determine which columns are insignificant and should be dropped from a dataset.
'smoker' has the strongest relationship with charges, followed by age, body mass index, children. Sex and region both have very weak relationship with charges therefore will be dropped from our data set and not be used in our model.
Here's a view of our dataset after dropping the sex and region column:
data_ii = data.drop(['sex','region'], axis=1)
title = "table 3 : Dataset after dropping the insignificant independent variables (sex and region column)"
printTitle(title)
data_ii
We use the Statsmodels Library to estimate our model coefficients.
import statsmodels.formula.api as smf
print "Independent variables and their Coefficient"
mod_coef = smf.ols(formula='charges ~ age + bmi + children + smoker', data=data_ii).fit()
mod_coef.params
The coefficient of each independent variable to be used in our model can be seen above
Generally a unit increase in any of: age, bmi, children, smoker, is associated with a unit increase in premium charges up to the value of the corresponding model coefficient.
More clearly: A Unit increase in body mass index of a policy holder is associated with a 321.851402 unit increase in his/her premium charge. If the model coefficient is negative, then a unit increase will cause a unit decrease in the premium charge.
A higher percentage is used for training the model while a smaller portion of the data is used to test the model. This is implemented below:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
X = data_ii[['age','bmi','children','smoker']]
y = data_ii.charges
#split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)
The model is trained to predict the known outputs/premium charges using the training dataset and later tested using the test dataset. The test data is used to test the accuracy of the model.
# instantiate the model
model = LinearRegression()
#fit Model
model.fit(X_train, y_train)
# make prediction based on the test data
predictions = model.predict(X_test)
After training the model with our training dataset, we try to predict our already know dependent variable/premium charge in our test dataset then compare this prediction (predicted test charges) against the actual test charges to evaluate the accuracy of the model.
Below is a table showing Actual Test Dataset Premium Charges against their Predicted Test Dataset Premium Charges:
from pandas import Series, DataFrame
data_iii = {'Actual Test Dataset Premium Charges':y_test, 'Predicted Test Dataset Premium Charges':predictions}
frameData_iii = pd.DataFrame(data_iii)
title = "table 4 : Test Dataset Premium Charges: Actual Charges and their corresponding Predicted Charges"
printTitle(title)
frameData_iii
# accuracy of the prediction
model_accuracy = model.score(X_test, y_test)
model_accuracy
sns.distplot(y_test - predictions, axlabel="Test vs Prediction")
title = 'Graph 2 : Relationship Between Actual Premium Charges In Test Dataset VS Their Corresponding Predicted Value (Using Our Model For Prediction)'
printTitle(title)
To evaluate the overall fit of a linear model, we analyse the R-squared. The proportion of variance in the observed data that is explained by the model is the R-squared. The R-squared is between 0 and 1 and the higher the R-square the better because it means that more variance is explained by the model.
# R-square
model_rsquare = mod_coef.rsquared
model_rsquare
The R-squared is most useful in comparing different models
title = "table 5 : SUMMARY OF THE FITTED MODEL"
printTitle(title)
mod_coef.summary()
A new policy holder insured his life at Leadway Insurance Company. Among other details provided, the following were relevant to this study:
- Age : 55
- Sex : male
- BMI : 47.843
- Children : 4
- Smoker : no
- Region : Southeast
Determine/predict the premium that should be charged.
We predict insurance premium charge for a new policy holder following the steps:
Recall that region and sex do not have strong relationship with insurance premium charge. Hence for better prediction, they will be excluded from our model.
#import the linear regression model
from sklearn.linear_model import LinearRegression
# x represent the independent variables while y, the dependent variable
X = data_ii[['age','bmi','children','smoker']]
y = data_ii.charges
#instantiate the model
my_model = LinearRegression()
# fit our dataset into our model
my_model.fit(X,y)
my_model.predict([[55,47.843,4,0]])