Predictive Modelling Assignment
Part A - Linear Regression
Table 01.
Regression results using Credit_Limit as the criterion
Predictor |
b |
B 95% CI [LL, UL] |
Fit |
(Intercept) |
10175.45** |
[8165.61, 12185.29] |
|
Customer_Age |
15.76 |
[-34.89, 66.40] |
|
Gender1 |
164.77 |
[-763.97, 1093.50] |
|
Education_Level2 |
-602.75 |
[-1413.20, 207.71] |
|
Education_Level3 |
-477.35 |
[-1179.42, 224.72] |
|
Marital_Status2 |
-1407.91** |
[-2343.81, -472.01] |
|
Marital_Status3 |
-359.96 |
[-1311.93, 592.01] |
|
Attrition_Flag1 |
-1202.43** |
[-1929.44, -475.43] |
|
Income_Category2 |
831.41* |
[50.99, 1611.83] |
|
Income_Category3 |
4673.50** |
[3525.24, 5821.77] |
|
Income_Category4 |
8998.13** |
[7861.04, 10135.22] |
|
Income_Category5 |
11355.12** |
[10041.97, 12668.26] |
|
Card_Category2 |
12639.14** |
[11521.54, 13756.75] |
|
Months_on_book |
7.04 |
[-44.59, 58.67] |
|
Avg_Utilization_Ratio |
-14330.49** |
[-15610.30, -13050.69] |
|
Pay_on_time1 |
-5088.31** |
[-5876.95, -4299.67] |
|
R2 = .631** |
|||
95% CI[.61,.65] |
Note. b represents unstandardized regression weights. LL and UL indicate the lower and upper limits of a confidence interval, respectively.
* indicates p < .05. ** indicates p < .01.
Model Selection:
Using the Backward elimination method it was found that marital status, Income category, type of credit card, Whether or not the customer has left the bank in the last 12 months, average utilization ratio and whether or not the monthly balance on the credit card paid off explain a significant amount of the variance in the credit card limit F(10, 1935) = 330, p < .01, R2 = 0.63, Adj. R2 = 0.628. Table 02 provides the output of the regression analysis.
Table 02.
Regression results using Credit_Limit as the criterion
Predictor |
b |
B 95% CI [LL, UL] |
Fit |
(Intercept) |
10857.91** |
[9761.27, 11954.56] |
|
Marital_Status2 |
-1402.05** |
[-2336.01, -468.09] |
|
Marital_Status3 |
-357.64 |
[-1308.14, 592.86] |
|
Income_Category2 |
754.19* |
[49.85, 1458.54] |
|
Income_Category3 |
4546.58** |
[3758.45, 5334.71] |
|
Income_Category4 |
8852.18** |
[8074.93, 9629.43] |
|
Income_Category5 |
11228.47** |
[10209.55, 12247.39] |
|
Card_Category2 |
12633.38** |
[11516.47, 13750.30] |
|
Avg_Utilization_Ratio |
-14298.49** |
[-15569.56, -13027.41] |
|
Pay_on_time1 |
-5068.65** |
[-5855.55, -4281.76] |
|
Attrition_Flag1 |
-1193.20** |
[-1919.08, -467.32] |
|
R2 = .630** |
|||
95% CI[.61,.65] |
Note. b represents unstandardized regression weights. LL and UL indicate the lower and upper limits of a confidence interval, respectively.
* indicates p < .05. ** indicates p < .01.
Check for multicollinearity
For our best model the VIF values are all well below 10 and the tolerance statistics all
well above 0.2. Also, the average VIF is very close to 1. Based on these measures we can
safely conclude that there is no collinearity within our data.
Checking the normality of the residuals
Shapiro-Wilk test wasa run to check the normality of the residuals, The tests showeda significant deviation from normality W = 0.95011, p < 0.01. As another measure, QQ-plot was plotted. The resulting plot is shown in Figure 01, and the plot shows significant deviation from normality. Since the residuals show problems with normality it is advised to transform the raw data.
Q-Q plot of the residuals
Cook’s distance
This shows the Cook’s distance plot to illustrate data points that are an outlier and have high leverage. Three data points - 313, 1137 and 1227 have large values of Cook’s distance. It is suggested that to run the regression analysis with these data points excluded and see what happens to the model performance and to the regression coefficients.
Conclusion
Based on the above analysis we can conclude that factors such as marital status, Income category, type of credit card, average utilization ratio and whether or not the monthly balance on the credit card paid off explain a significant amount of the variance in the credit card limit.
Part B) Logistic Regression
Table 03.
Logistic Regression results using Attrition_Flag as the criterion
Predictor |
b |
ODD’S Ratio 95% CI [LL, OR, UL] |
(Intercept) |
-2.26 (0.53)** |
[0.03, 0.10, 0.29] |
Customer_Age |
0.017(0.013) |
[-0.99, 1.01, 1.04] |
Gender1 |
0.31(0.24) |
[0.85, 1.36, 2.23] |
Education_Level2 |
-0.07(0.21) |
[0.61, 0.93, 1.42] |
Education_Level3 |
0.12(0.18) |
[0.79, 1.129, 1.62] |
Marital_Status2 |
-0.21(0.23) |
[0.51, 0.81, 1.29] |
Marital_Status3 |
-0.19(0.23) |
[0.52, 0.82, 1.32] |
Income_Category2 |
0.02(0.19) |
[0.69, 1.02, 1.50] |
Income_Category3 |
0.08(0.31) |
[0.59, 1.08, 2.00] |
Income_Category4 |
0.39(0.31) |
[0.79, 1.47, 2.75] |
Income_Category5 |
0.58(0.36) |
[0.88, 1.79, 3.65] |
Card_Category2 |
0.30(0.34) |
[0.67, 1.35, 2.60] |
Months_on_book |
-0.01(0.01) |
[0.96, 0.98, 1.01] |
Credit_Limit |
-0.0000345(0.00001)* |
[0.99, 0.99, 0.99] |
Avg_Utilization_Ratio |
-0.63(0.40) |
[0.23, 0.52, 1.16] |
Pay_on_time1 |
1.49(0.20)** |
[3.03, 4.48, 6.65] |
Note. R2 = 0.112 (Hosmer–Lemeshow), 0.095 (Cox–Snell), 0.162 (Nagelkerke). Model χ 2 (15) = 194.56, p < 0.01.
*p< 0.05, ** p < .01
Through the logistic regression it was found Whether or not the monthly balance on the credit card was paid off (Pay_on_time1) to be a significant predictor of Whether or not the customer has left the bank in the last 12 months (Attrition_Flag). The odds of attrition increased by 4.4 times (95% CI [3.03, 6.65]) when the monthly balance on the credit card was paid off.
Based on the logistic regression model, Credit card limit and Whether or not the monthly balance on the credit card was paid off emerged as significant predictors of Attrition. Surprisingly, factors such as customer’s age, gender, level of education, marital status, income category, type of credit card, number of months as credit card customer and average utilization ratio did not predict the attrition rate.
APPENDIX
Predictive Modelling Assignment
Read xlsx file selecting a random sample from CreditCard.xls to create a dataset of 2000 observations
library(readxl)
library(dplyr)
library(apaTables)
set.seed(1)
df <- read_excel("1698850029CreditCard.xlsx") # load the xlsx file from the saved location
my_data <- df %>% sample_n(2000) # Select 2000 random rows to create the dataset
my_data$Attrition_Flag <- factor(my_data$Attrition_Flag)
my_data$Gender <- factor(my_data$Gender)
my_data$Education_Level <- factor(my_data$Education_Level, levels = c("1", "2", "3"))
my_data$Marital_Status <- factor(my_data$Marital_Status)
my_data$Income_Category <- factor(my_data$Income_Category, levels =c("1", "2", "3", "4", "5"))
my_data$Card_Category <- factor(my_data$Card_Category)
my_data$Pay_on_time <- factor(my_data$Pay_on_time)
Regression model with all the variables
full.model <- lm(Credit_Limit ~ Customer_Age+Gender+Education_Level+Marital_Status+Attrition_Flag+ Income_Category+Card_Category+Months_on_book+ Avg_Utilization_Ratio+Pay_on_time, data = my_data)
summary(full.model)
apa.reg.table(full.model, filename = "full_model.doc")
Model Selection using Backward elimination procedure
full.model <- lm(Credit_Limit ~ Customer_Age+Gender+Education_Level+Marital_Status+Income_Category+Card_Category+Months_on_book+ Avg_Utilization_Ratio+Pay_on_time+Attrition_Flag, data = my_data)
final.model <- step( object = full.model, direction = "backward")
apa.reg.table(final.model, filename = "final_model.doc")
Best model chosen from Backward elimination
best.model <- lm(Credit_Limit ~ Marital_Status + Income_Category + Card_Category + Avg_Utilization_Ratio + Pay_on_time+ Attrition_Flag, data = my_data)
summary(best.model)
Check for multicollinearity
library(car)
vif(best.model)
tolerance <- 1/vif(best.model)
tolerance
avg.tolerenace <- mean(vif(best.model))
avg.tolerenace
Check for normality of residuals
hist( x = residuals( best.model ), xlab = "Value of residual", main = "")
plot( x = best.model, which = 2 )
plot( x = best.model, which = 5 )
shapiro.test(residuals(best.model))
Cook’s Distance
plot(x = best.model, which = 4)
Logistic Regression model with all the variables
options(scipen=999, digits = 2)
full.model2 <- glm(Attrition_Flag ~ Customer_Age+Gender+Education_Level+Marital_Status+Income_Category+Card_Category+Months_on_book+Credit_Limit+ Avg_Utilization_Ratio+Pay_on_time, data = my_data, family=binomial)
summary(full.model2)
Testing model significance
modelChi <- full.model2$null.deviance - full.model2$deviance
chidf <- full.model2$df.null - full.model2$df.residual
chisq.prob <- 1 - pchisq(modelChi, chidf)
modelChi; chidf; chisq.prob
R2 value
logisticPseudoR2s <- function(LogModel) {
dev <- LogModel$deviance
nullDev <- LogModel$null.deviance
modelN <- length(LogModel$fitted.values)
R.l <- 1 - dev / nullDev
R.cs <- 1- exp ( -(nullDev - dev) / modelN)
R.n <- R.cs / ( 1 - ( exp (-(nullDev / modelN))))
cat("Pseudo R^2 for logistic regression\n")
cat("Hosmer and Lemeshow R^2 ", round(R.l, 3), "\n")
cat("Cox and Snell R^2
", round(R.cs, 3), "\n")
cat("Nagelkerke R^2
", round(R.n, 3),
"\n")
}
logisticPseudoR2s(full.model2)
Odd’s Ratio
exp(full.model2$coefficients)
exp(confint(full.model2))
More related post:
Multilevel Longitudinal Modelling Assignment Help
Economics And Quantitative Analysis Linear Regression Report Assessment Answer
KB7038 Integrated BIM Modelling Assignment Sample
1,212,718Orders
4.9/5Rating
5,063Experts
Turnitin Report
$10.00Proofreading and Editing
$9.00Per PageConsultation with Expert
$35.00Per HourLive Session 1-on-1
$40.00Per 30 min.Quality Check
$25.00Total
FreeGet
500 Words Free
on your assignment today
Get
500 Words Free
on your assignment today
Request Callback
Doing your Assignment with our resources is simple, take Expert assistance to ensure HD Grades. Here you Go....
🚨Don't Leave Empty-Handed!🚨
Snag a Sweet 70% OFF on Your Assignments! 📚💡
Grab it while it's hot!🔥
Claim Your DiscountHurry, Offer Expires Soon 🚀🚀