Linear Regression

Author

Dr. Mohammad Nasir Abdullah

Linear Regression

Regression analysis is a statistical methodology that utilizes the relation between two or more quantitative variables so that a response or outcome variable can be predicted from the other, or others. This methodology is widely used in business, the social and behavioural sciences, the biological sciences, and many other disciplines.

A few examples of applications are:

1) Drug Dosage Determination - Regression analysis helps in determining the appropriate drug dosage for patients based on factors like age, weight, and other physiological parameters. By analyszing the relationship between these variables and drug response, doctors can optimize dosage for better treatment outcomes.

2) Disease Progression Prediction - Regression models are used to predict the progression of diseases like diabetes, cancer, or cardiovascular conditions. By analysing historical patient data (for example: biomarkers, genetic factors, lifestyle habits), regression helps estimate the likelihood and pace of disease advancement in individuals.

3) Price Optimization - Regression analysis assists in determining the optimal pricing strategy by examining the relationship between price changes and customer demand. It helps businesses understand how price adjustment affect sales volume and revenue, enabling them to set competitive yet profitable prices.

4) The performance of an employee on a job can be predicted by utilizing the relationship between performance and a battery of aptitude tests.

Simple Linear Regression

The simple linear regression estimate the linear equation for a relationship between continuous variables so one variable can be predicted or estimated. It is also used to determine the relationship between one numerical dependent variable and one numerical or categorical independent variable. It measures the strength of association between these variables as in correlation but it provides more information compared to correlation method.

The simple linear regression is usually used as the preliminary step of multiple linear regression. As in correlation analysis, the linear regression also provides the coefficient of relationship named as coefficient of determination, which represents the proportion of the variation of dependent variable explained by the independent variable.

Some examples of research questions for simple linear regression are:

1) Is the amount of calories associated with body weight?
2) Would the amount of exercise in hours predict the change in blood pressure?
3) How much a 15-hour physical activity changes the weight in kilograms?
4) How much will a 1gm of salt change blood pressure in mmHg in Perak population?

Remember!

Linear regression measures the linear association between continuous variables and it is useful when a dependent variable and independent variable are clearly defined.

The most important part in linear regression is, it applied when a prediction in the target variable is among the objectives of the analysis.

Assumptions in Simple Linear Regression

The assumptions in simple linear regression with two variables (1 independent variable and 1 dependent variable) are:

1) The values of the independent variable is fixed or non-random (should be in mathematical variable).
2) The independent variable is measured without error (this should be done at research design and data collection stage).
3) For each of the independent variable, the sub-population of dependent variable must be normally distributed (means that the dependent variable must be normally distributed at any point of independent variable).
4) The variances of the sub-population of the dependent variable are all equal (means equal variances of dependent variable at any point of independent variable).
5) The means of the sub-population of the dependent variable, must lie on the same straight line (this is the assumption of linearity of the dependent variable).
6) The dependent variable Y, has the values that are statistically independent.

As conclusion, we can conclude all the assumptions above as L.I.N.E. Which brings the meaning of L - Linearity, I - Independence, N-Normality, and E-Equal variances.

Types of data required

1 dependent variable - Continuous
1 independent variable - Continuous or categorical

Example 1

In this example, we will use oxygen.sav dataset from this link: https://sta334.s3.ap-southeast-1.amazonaws.com/data/Oxygen.sav. This dataset is about cardiovascular risk factors. There are 1000 males engaged in sedentary occupation. We want to study the relationship oxygen consumption among the risk factors in this population. The possible risk factors under considerations are Systolic blood pressure, total cholesterol, HDL cholesterol, and Triglycerides.

Research Questions:

1) Is there a significant relationship between oxygen consumption and systolic blood pressure levels?
2) Does a meaningful relationship exist between oxygen consumption and total cholesterol levels?
3) Is there a notable association between oxygen consumption and HDL cholesterol levels?
4) Do triglyceride levels exhibit a significant connection with oxygen consumption?

Research Hypothesis:

  1. Systolic Blood Pressure:
    • Null Hypothesis (H0): There is no significant linear relationship between oxygen consumption and systolic blood pressure (β = 0).
    • Alternative Hypothesis (H1): There is a significant linear relationship between oxygen consumption and systolic blood pressure (β ≠ 0).
  2. Total Cholesterol:
    • Null Hypothesis (H0): There is no significant linear relationship between oxygen consumption and total cholesterol levels (β = 0).
    • Alternative Hypothesis (H1): There is a significant linear relationship between oxygen consumption and total cholesterol levels (β ≠ 0).
  3. HDL Cholesterol:
    • Null Hypothesis (H0): There is no significant linear relationship between oxygen consumption and HDL cholesterol levels (β = 0).
    • Alternative Hypothesis (H1): There is a significant linear relationship between oxygen consumption and HDL cholesterol levels (β ≠ 0).
  4. Triglycerides:
    • Null Hypothesis (H0): There is no significant linear relationship between oxygen consumption and triglyceride levels (β = 0).

    • Alternative Hypothesis (H1): There is a significant linear relationship between oxygen consumption and triglyceride levels (β ≠ 0).

Steps to perform simple linear regression

Step 1: Data Exploration

library(foreign)
data1 <- read.spss("https://sta334.s3.ap-southeast-1.amazonaws.com/data/Oxygen.sav", to.data.frame = TRUE)

#Viewing 1st 6 observations
head(data1)
  subj Oxygen Sbp TChol HDL Triglycerides
1  937     32 108   177  47           149
2  838     31  91   146  47           146
3  784     44  99   116  57           153
4  900     32 100   147  51            95
5  611     28 102   176  51            79
6  222     44 103   145  54           195
#getting the mean and standard deviation for each of the variables
#Finding Means for each of the variables
sapply(data1[-1], mean)
       Oxygen           Sbp         TChol           HDL Triglycerides 
       32.724       128.570       209.138        43.990       163.165 
#Finding standard deviation for each of the variables
sapply(data1[-1], sd)
       Oxygen           Sbp         TChol           HDL Triglycerides 
     6.579142     10.745071     37.754713      9.453320     37.179619 

Step 2: Creating Scatter Diagram for each pairs

library(ggplot2)
library(gridExtra)

#Oxygen and SBP
a <- ggplot(data1, aes(x=Sbp, y=Oxygen)) + 
  geom_point(size = 3, bg = "grey88", col="grey12", pch = 21) + 
  geom_smooth(method = "lm", color="grey14",
              lwd = 0.5, lty = 2, se = F) + 
  labs(title = "A Scatter plot showing relationship between \n Oxygen consumption and Systolic Blood Pressure (n=1000)", 
       x = "Systolic Blood Pressure", 
       y = "Oxygen Consumptions") + 
  theme_minimal() +
  theme(plot.title = element_text(hjust=0.5, size = 5))

#Oxygen and Total Cholesterol
b <- ggplot(data1, aes(x=TChol, y=Oxygen)) + 
  geom_point(size = 3, bg = "grey88", col="grey12", pch = 21) + 
  geom_smooth(method = "lm", color="grey14",
              lwd = 0.5, lty = 2, se = F) + 
  labs(title = "A Scatter plot showing relationship between \n Oxygen consumption and Total Cholesterol (n=1000)", 
       x = "Total Cholesterol", 
       y = "Oxygen Consumptions") + 
  theme_minimal() +
  theme(plot.title = element_text(hjust=0.5, size = 5))

#Oxygen and HDL Cholesterol
c <- ggplot(data1, aes(x=HDL, y=Oxygen)) + 
  geom_point(size = 3, bg = "grey88", col="grey12", pch = 21) + 
  geom_smooth(method = "lm", color="grey14",
              lwd = 0.5, lty = 2, se = F) + 
  labs(title = "A Scatter plot showing relationship between \n Oxygen consumption and HDL Cholesterol (n=1000)", 
       x = "HDL Cholesterol", 
       y = "Oxygen Consumptions") + 
  theme_minimal() +
  theme(plot.title = element_text(hjust=0.5, size = 5))

#Oxygen and Triglycerides
d <- ggplot(data1, aes(x=Triglycerides, y=Oxygen)) + 
  geom_point(size = 3, bg = "grey88", col="grey12", pch = 21) + 
  geom_smooth(method = "lm", color="grey14",
              lwd = 0.5, lty = 2, se = F) + 
  labs(title = "A Scatter plot showing relationship between \n Oxygen consumption and Triglycerides (n=1000)", 
       x = "Triglycerides", 
       y = "Oxygen Consumptions") + 
  theme_minimal() +
  theme(plot.title = element_text(hjust=0.5, size = 5))

grid.arrange(arrangeGrob(a,b, ncol = 2), 
             arrangeGrob(c, d, ncol = 2), nrow = 2)

From the scatter diagram, we can conclude that the relationship between oxygen consumption and systolic blood pressure was negatively correlated, Oxygen consumption and Total Cholesterol has negative correlation, both oxygen consumption and HDL cholesterol, and oxygen consumption and Triglycerides had positive correlation.

Step 3: Perform simple linear regression for each pairs

1) Oxygen Consumption and Systolic Blood Pressure

#Simple linear regression for Oxygen consumption and Systolic Blood Pressure.

OS <- lm(data1$Oxygen~data1$Sbp)
summary(OS)

Call:
lm(formula = data1$Oxygen ~ data1$Sbp)

Residuals:
     Min       1Q   Median       3Q      Max 
-19.1063  -4.4535   0.1093   4.4446  20.6662 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 47.34954    2.45707  19.271  < 2e-16 ***
data1$Sbp   -0.11376    0.01904  -5.973 3.23e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.468 on 998 degrees of freedom
Multiple R-squared:  0.03452,   Adjusted R-squared:  0.03355 
F-statistic: 35.68 on 1 and 998 DF,  p-value: 3.233e-09

To get the confidence interval of the model

#confidence interval
confint(OS)
                 2.5 %      97.5 %
(Intercept) 42.5279326 52.17114659
data1$Sbp   -0.1511271 -0.07638381

Based on the result, we can conclude that there are statistically significant relationship between Oxygen consumption and Systolic blood pressure since the p-value was less than 0.05 [t-statistic (df): -5.973 (998); p-value <0.001]. The R-square (coefficient of determination) was 0.03452 which is 3.45%. This indicates that there are 3.45% of the total variation in Oxygen consumption can be explained by Systolic blood pressure, where the balance were not included in the model. Based on the beta coefficient for systolic blood pressure (-0.1138), it shows that the variable having weak negative relationship with oxygen consumption.

Step 4: Residual diagnostic to check the assumptions

We need to make sure the assumption of LINE are met before interpreting the model.

  1. Checking the the linearity assumption (L) and Homoscedasticity (equal variance) assumption (E).

The linearity and equality of variances can be check through scatter diagram by plotting residuals and predicted values. The predicted values should be on X-axis, and the residual values should be on Y-axis on the plot. (It is XP-YR in short)!

To interpret the results of linearity of the model, the scatter plot should shows elliptical shape or it have equally even distributed around the graph area. Please refer to the illustration below:

1) Overall Linearity

2) Equality of variance

Now, For our current model, we first extract the residual and predicted values from the model and store the values into resid1 object, and then construct the plot.

# plot(OS$fitted.values, OS$residuals)
# abline(h=0)

resid1 <- data.frame(cbind(residual = OS$residuals, predicted =  OS$fitted.values))

library(ggplot2)
ggplot(resid1, aes(x=predicted, y=residual)) + 
  geom_point(pch=21, size = 3.5, bg="grey88", col="grey56") + 
  stat_ellipse(lty=2, color = "red") +
  theme_minimal() + 
  ggtitle("Predicted vs Residual for linearity and equality of variances")

Based on the scatter diagram of predicted and residual values, we can conclude that there were no peculiar pattern that can deny the linearity and the equality of variances. Thus, we can conclude that the linearity assumption is met and the equality of variances also met.

  1. Normality of the residual values

Now to check the normality assumption, we will construct a Histogram and use Lilliefor’s test for this purpose.

#Histogram of the residual values
ggplot(resid1, aes(x=residual)) +
  geom_histogram(col="grey44", bg="grey88") + 
  ggtitle("Distribution of residual values for \n Oxygen consumption and Systolic Blood Pressure") + 
  theme_minimal()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Lilliefor's test of normality
library(nortest)
lillie.test(resid1$residual)

    Lilliefors (Kolmogorov-Smirnov) normality test

data:  resid1$residual
D = 0.018977, p-value = 0.516

Based on the histogram, we can conclude that the data was normally distributed. Furthermore, based on the lilliefor’s test, it confirm the graphical result (histogram) that the data was normally distributed since the p-value was more than 0.05 [D-statistic: 0.0189; p-value: 0.516]. Hence, the normality assumption is met.

Repeat the same procedure for all others variables

2) Oxygen consumption and Total Cholesterol

#fit the simple linear regression model
OT <- lm(data1$Oxygen~data1$TChol)
summary(OT)

Call:
lm(formula = data1$Oxygen ~ data1$TChol)

Residuals:
     Min       1Q   Median       3Q      Max 
-18.6550  -3.6920  -0.0228   3.8707  20.7395 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 50.542809   1.022586   49.43   <2e-16 ***
data1$TChol -0.085201   0.004812  -17.71   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.742 on 998 degrees of freedom
Multiple R-squared:  0.2391,    Adjusted R-squared:  0.2383 
F-statistic: 313.5 on 1 and 998 DF,  p-value: < 2.2e-16
confint(OT)
                  2.5 %      97.5 %
(Intercept) 48.53614403 52.54947409
data1$TChol -0.09464366 -0.07575875

Checking the assumption

  1. Linearity and Equality of Variance
resid2 <- data.frame(cbind(residual = OT$residuals, predicted =  OT$fitted.values))

library(ggplot2)
ggplot(resid2, aes(x=predicted, y=residual)) + 
  geom_point(pch=21, size = 3.5, bg="grey88", col="grey56") + 
  stat_ellipse(lty=2, color = "red") +
  theme_minimal() + 
  ggtitle("Predicted vs Residual for linearity and equality of variances")

  1. Normality of the residuals
#Histogram of the residual values
ggplot(resid2, aes(x=residual)) +
  geom_histogram(col="grey44", bg="grey88") + 
  ggtitle("Distribution of residual values for \n Oxygen consumption and Total Cholesterol") + 
  theme_minimal()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

3) Oxygen consumption and HDL Cholesterol

#fit the simple linear regression model 
OH <- lm(data1$Oxygen~data1$HDL) 
summary(OH)

Call:
lm(formula = data1$Oxygen ~ data1$HDL)

Residuals:
     Min       1Q   Median       3Q      Max 
-16.6115  -3.6115   0.0751   3.6443  18.4120 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 16.35696    0.83768   19.53   <2e-16 ***
data1$HDL    0.37206    0.01862   19.98   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.563 on 998 degrees of freedom
Multiple R-squared:  0.2858,    Adjusted R-squared:  0.2851 
F-statistic: 399.4 on 1 and 998 DF,  p-value: < 2.2e-16
confint(OH)
                 2.5 %     97.5 %
(Intercept) 14.7131456 18.0007706
data1$HDL    0.3355282  0.4085974

Checking the assumption

  1. Linearity and Equality of Variance
resid3 <- data.frame(cbind(residual = OH$residuals, predicted =  OH$fitted.values))  
library(ggplot2) 
ggplot(resid3, aes(x=predicted, y=residual)) +    
  geom_point(pch=21, size = 3.5, bg="grey88", col="grey56") +    
  stat_ellipse(lty=2, color = "red") +   
  theme_minimal() +    
  ggtitle("Predicted vs Residual for linearity and equality of variances")

  1. Normality of the residuals
#Histogram of the residual values 
ggplot(resid3, aes(x=residual)) +   
  geom_histogram(col="grey44", bg="grey88") +   
  ggtitle("Distribution of residual values for \n Oxygen consumption and HDL Cholesterol") +    
  theme_minimal()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

2) Oxygen consumption and Triglycerides

#fit the simple linear regression model 
OT2 <- lm(data1$Oxygen~data1$Triglycerides) 
summary(OT2)

Call:
lm(formula = data1$Oxygen ~ data1$Triglycerides)

Residuals:
     Min       1Q   Median       3Q      Max 
-19.5192  -4.1522  -0.0543   4.3903  21.4771 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         27.304999   0.920704  29.657  < 2e-16 ***
data1$Triglycerides  0.033212   0.005502   6.036 2.22e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.465 on 998 degrees of freedom
Multiple R-squared:  0.03523,   Adjusted R-squared:  0.03426 
F-statistic: 36.44 on 1 and 998 DF,  p-value: 2.219e-09
confint(OT2)
                          2.5 %      97.5 %
(Intercept)         25.49826094 29.11173757
data1$Triglycerides  0.02241518  0.04400839

Checking the assumption

  1. Linearity and Equality of Variance
resid4 <- data.frame(cbind(residual = OT2$residuals, predicted =OT2$fitted.values))
library(ggplot2) 
ggplot(resid4, aes(x=predicted, y=residual)) +    
  geom_point(pch=21, size = 3.5, bg="grey88", col="grey56") +    
  stat_ellipse(lty=2, color = "red") +   
  theme_minimal() +    
  ggtitle("Predicted vs Residual for linearity and equality of variances")

  1. Normality of the residuals
#Histogram of the residual values 
ggplot(resid4, aes(x=residual)) +   
  geom_histogram(col="grey44", bg="grey88") +    
  ggtitle("Distribution of residual values for \n Oxygen consumption and Triglycerides") +    
  theme_minimal()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Presentation of the results

To present the results of simple linear regression analysis, we need to construct the table as below:

Alternatively, we can generate the report more easily using gtsummary package.

#automated table regression results
library(gtsummary)
data1 %>%
  select(Oxygen, Sbp, TChol, HDL, Triglycerides) %>%
  tbl_uvregression(
    method = glm,
    y=Oxygen,
    hide_n= TRUE, 
    pvalue_fun = ~style_pvalue(.x, digits = 3), 
    # add_estimate_to_reference_rows = T,
    # statistic = all_continuous() ~ "{mean} ({sd})"
  ) %>% 
  bold_p() %>%
  bold_labels() %>%
  add_global_p()
Characteristic Beta 95% CI1 p-value
Sbp -0.11 -0.15, -0.08 <0.001
TChol -0.09 -0.09, -0.08 <0.001
HDL 0.37 0.34, 0.41 <0.001
Triglycerides 0.03 0.02, 0.04 <0.001
1 CI = Confidence Interval

Based on the Table 1, the simple linear regression were performed to analyse the relationship between oxygen consumption and others risk factors such as systolic blood pressure, total cholesterol, HDL cholesterol, and Triglycerides. We can conclude that all the variables have statistically significant relationship with Oxygen consumption since the p-value for all variables were less than 0.05. For univariable relationship betwen oxygen consumption and systolic blood pressure was statistically significant since the p-value was less than 0.05 [t-statistic (df): -5.973 (998); p-value <0.001]. The coefficient of -0.1138 suggests a negative association. For every 1 mmHg increase in systolic blood pressure, oxygen consumption is estimated to decrease by 0.1138 units, holding other factors constant.

For Total cholesterol, there are significant relationship with Oxygen consumption since the p-value was less than 0.05 [t-statistic (df): -17.71 (998); p-value < 0.001]. The coefficient of -0.0852 also suggests a negative association. For every 1 mg/dL increase in total cholesterol, oxygen consumption is estimated to decrease by 0.0852 units, holding other factors constant. Next, for HDL cholesterol, the coefficient of 0.3721 suggests a positive association. For every 1 mg/dL increase in HDL cholesterol, oxygen consumption is estimated to increase by 0.3721 units, holding other factors constant. The p-value < 0.001 indicates a statistically significant relationship. Lastly, The coefficient of triglycerides (0.0332) suggests a positive association. For every 1 mg/dL increase in triglycerides, oxygen consumption is estimated to increase by 0.0332 units, holding other factors constant. The p-value < 0.001 indicates a statistically significant relationship.

The R-squared values for the individual risk factors range from 3.45% to 28.58%, suggesting that each factor explains a moderate amount of the variability in oxygen consumption on its own.

Example 2

In this example, we will use regression.dta from this link: adsad

library(haven)
data1 <- read_stata("regression.dta")
head(data1)
# A tibble: 6 × 4
   subj waist   fat   rbs
  <dbl> <dbl> <dbl> <dbl>
1     1  74.8  25.7  1.82
2    12  75    43.9  3.64
3    54  75.5  41.9  3.44
4    20  75.9  29.3  2.18
5    17  76.0  43.8  3.63
6    42  76    50.5  4.3 

In this dataset, we would like to find the relationship between abdominal fat thickness (fat) and waist circumference (waist), and random blood sugar levels (rbs).

Research Questions:

1) Is there a significant relationship between Abdominal Fat Thickness and Waist Circumference?
2) Is there a significant linear relationship between Abdominal Fat Thickness and Random Blood Sugar?

Hypothesis:

1) There are significant linear relationship between Abdominal Fat Thickness and Waist Circumference.
2) There are significant linear relationship between Abdominal Fat Thickness and Random Blood Sugar.

In this example, the dependent variable is Abdominal Fat Thickness, and the predictors variables are Waist Circumference and Random Blood Sugar.

Step 1: Data exploration

# gtsummary
library(gtsummary)
meansd <- data1[-1] %>%
  tbl_summary(
    missing = "no", 
    statistic = list(all_continuous() ~ "{mean} ({sd})")
  ) %>%
  add_n() %>%
  modify_header(label = "**Variables**", 
                all_stat_cols() ~ "**Mean (SD)**")
  
medianiqr <- data1[-1] %>%
  tbl_summary(
    missing = "no", 
    statistic = list(all_continuous() ~ "{median} ({IQR})")
    ) %>%
  modify_header(label = "**Variables**", 
                all_stat_cols() ~ "**Median (IQR)**")

tbl_merge(
  tbls        = list(meansd, medianiqr),
  tab_spanner = c("Measure 1", "Measure 2")
)
Variables Measure 1 Measure 2
N Mean (SD)1 Median (IQR)2
waist circumference (cm) 86 91 (11) 90 (21)
abdominal fat thickness (cm) 86 102 (52) 98 (75)
random blood sugar (mmol/l) 86 10.8 (5.2) 10.9 (8.4)
1 Mean (SD)
2 Median (IQR)

Now, we would like to access the relationship between the pairs using scatter diagram.

library(ggplot2)
library(gridExtra)

a <- ggplot(data1, aes(y=fat, x=waist)) +
  geom_point(pch=21, bg="grey88", col="grey30", size = 3) + 
  geom_smooth(method = "lm", se=F, lty = 2, col="lightblue4") +
  labs(
    title = "A Scatter plot for Fat and Waist", 
    x = "Waist Cirumference", 
    y = "Abdominal Fat Thickness"
  ) + 
  theme_minimal()+
  theme(plot.title= element_text(hjust=0.5))

b <- ggplot(data1, aes(y=fat, x=rbs)) +
  geom_point(pch=21, bg="grey88", col="grey30", size = 3) + 
  geom_smooth(method = "lm", se=F, lty = 2, col="lightblue4") + 
  labs(
    title = "A Scatter plot for Fat and RBS", 
    x = "Random Blood Sugar", 
    y = "Abdominal Fat Thickness"
  ) + 
  theme_minimal()+
  theme(plot.title= element_text(hjust=0.5))

grid.arrange(arrangeGrob(a,b), nrow = 1)
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

Based on the scatter diagram, we can see that both pairs Waist circumference and Abdominal fat thickness, and Random blood sugar and Abdominal fat thickness have positive correlations.

Step 2: Performing Univariable Analysis (Simple Linear Regression)

#Abdominal Fat Thickness and Waist Circumference
FW <- lm(data1$fat~data1$waist)
summary(FW)

Call:
lm(formula = data1$fat ~ data1$waist)

Residuals:
   Min     1Q Median     3Q    Max 
-49.03 -18.81  -0.51  12.17  79.97 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -266.4461    24.8325  -10.73   <2e-16 ***
data1$waist    4.0328     0.2697   14.95   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 27.26 on 84 degrees of freedom
Multiple R-squared:  0.7269,    Adjusted R-squared:  0.7236 
F-statistic: 223.5 on 1 and 84 DF,  p-value: < 2.2e-16
#Abdominal Fat Thickness and Random Blood Sugar
FR <- lm(data1$fat~data1$rbs)
summary(FR)

Call:
lm(formula = data1$fat ~ data1$rbs)

Residuals:
    Min      1Q  Median      3Q     Max 
-41.957 -11.576   2.895  10.194  38.186 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.5463     3.2306  -0.479    0.633    
data1$rbs     9.6361     0.2703  35.643   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.99 on 84 degrees of freedom
Multiple R-squared:  0.938, Adjusted R-squared:  0.9372 
F-statistic:  1270 on 1 and 84 DF,  p-value: < 2.2e-16

Step 3: Checking the assumptions

library(performance)

#Checking overall assumption for Fat and Waist
check_model(FW)

Based on charts produced by the model for Abdominal Fat Thickness and Waist Circumference, we can conclude that the linearity and equality of variances assumptions were met (this is based on the first table). Then the normality of the residual also met (last table (3rd row, 1st column)).

#checking overal assumption for Fat and RBS
library(performance)
check_model(FR)

Based on charts produced by the model for Abdominal Fat Thickness and Random Blood Sugar, we can conclude that the linearity and equality of variances assumptions were met (this is based on the first table). Then the normality of the residual was not met, how ever, since the number of observations were more than 30, the central limit theorem was applied to indicate that the data was approximately to normal distribution (last table (3rd row, 1st column)).

Step 4: Presentation of the results

#automated table regression results
library(gtsummary)
data1 %>%
  select(-1) %>%
  tbl_uvregression(
    method = glm,
    y=fat,
    hide_n= TRUE, 
    pvalue_fun = ~style_pvalue(.x, digits = 3), 
    # add_estimate_to_reference_rows = T,
    # statistic = all_continuous() ~ "{mean} ({sd})"
  ) %>% 
  modify_column_unhide(columns = c(statistic, std.error)) %>%
  bold_p() %>%
  bold_labels() %>%
  add_global_p() %>%
  add_n(location = "level") %>%
  modify_caption("Table 2. Association of Abdominal Fat Thickness and Risk Factors (n=86)") %>%
  modify_header(label = "**Risk Factors**") %>%
  modify_footnote(
    ci = "CI = 95% Confidence Interval for Simple Linear Regression", abbreviation = TRUE) 
Table 2. Association of Abdominal Fat Thickness and Risk Factors (n=86)
Risk Factors N Beta SE1 Statistic 95% CI1 p-value
waist circumference (cm) 86 4.0 0.270 15.0 3.5, 4.6 <0.001
random blood sugar (mmol/l) 86 9.6 0.270 35.6 9.1, 10 <0.001
1 SE = Standard Error, CI = 95% Confidence Interval for Simple Linear Regression

Interpretation:

From Table 2, Waist circumference showed as a moderately strong companion, explaining 72.7% of the variation in abdominal fat thickness. For every increase in waistline, there was a 4.03 unit climb in visceral fat. This suggests that measuring the waist could be a useful screening tool. However, a more powerful predictor was found.

Random blood sugar, with a significant 93.8% explanation, emerged as the top predictor. Every increase in blood sugar was associated with a notable 9.64 unit surge in abdominal fat, indicating its close connection with visceral fat accumulation. This strength suggests that random blood sugar might be an early and sensitive indicator of hidden fat, offering valuable insights into metabolic health.

Exercise

  1. Perform a simple linear regression with mpg as the dependent variable and wt as the independent variable in the mtcars dataset. Check and interpret the assumptions. Interpret the coefficients and the R-Squared value.
  2. Perform a simple linear regression analysis with Sepal.Length as the dependent variable and Petal.Width as the independent variable in the iris dataset. Check and interpret the assumptions. Interpret the coefficients and the R-squared value.
  3. Using pefr mlr.dta dataset from this link: https://sta334.s3.ap-southeast-1.amazonaws.com/data/pefr+mlr.dta. Perform simple linear regression by taking pefr as dependent variable. The independent variables are age, weight, and height. Check and interpret the assumptions then intrepret the coefficients and the R-squared values.