04 - Statistical modeling through regression - DATS

References:

Fox, J, Applied Regression Analysis and Generalized Linear Models, Sage Publications

These notes are not intended for a deep understanding of the topic. They are just a guide to try and survive the 📊 DATS exam.

They contain no rigor and may contain mistakes.

The goal is to model a target (or response) variable, $Y$ as a linear combination of some exploratory variables, $X$ . More precisely, we will obtain a prediction of the target, $η$ .

μ = η = X β

$β$ is the vector of coefficients.
$X$ is a matrix where each column is an exploratory variable. The first column is just 1s.
$η$ only describes the systemic component.
The vector $η$ has the same length as the target variable.

Part of the variability of the response is not describied in $η$ , so, to get $Y$ we also need to add the remaining part, known as residuals: $ε = e_{i} = Y_{i} - {\hat{Y}}_{i}$ . ${\hat{Y}}_{i}$ is the prediction of the response according to the model.

$β$ is a list of coefficients:

β = β_{1}, β_{2}, . . ., β_{n}

where $n$ is the number of variables and $β_{1}$ is known as the intercept.

To summarise:

Y = μ + ε = X β + ε

Model in R

If all exploratory variables are numeric, the process is quite simple. The rules described above all stand.

In R-studio we proceed as follows:

Y: target variable
X1, X2, ..., Xp: All the exploratory variables (There are a number p of them)

m1 <- lm(Y ~ X2 + X3 + ... + Xp)
summary(m1)

Then we do a summary of the model

FISHER-ANOVA:
We can try making a new model with less variables.
Let $M$ be the original complete model. Then we can make $m$ a nested model of $M$ (It has the same form of $M$ but with less variables).

We can use the FISHER test to verify if the 2 models are equivalent. If they are, it's better to use the smaller one.

Fisher test: H0 = Models m and M are equivalent

$p ≫ 0.05$ Accept H0
$p ≪ 0.05$ Reject H0 --> Need to use big model M

In R studio this test is done using the function anova(m1,m2)

anova(m1, m2)

Read p-value of the test.

If using glm(), then use the Wald test

Unusual and influential data

Regression outlier: observation with unusual value of the outcome variable Y, conditioned by the value of the explanatory variable X.

LEVERAGE

An observation with high leverage has an unusual X value and has leverage on the regression line.
Leverage is higher for an observation further away from the mean of X
If the value has high leverage but still fall onto the regression line, this is not a problem as it does not influence the regressio coefficients

INFLUENTIAL OBSERVATIONS:

If an observation has high leverage and also falls far from the regression line, then is considered influential.
It has unsual X value and Y value

influencePlot() can give what are the most influential observations:

influencePlot(m1, pch=19, id = list(method="noteworthy", n=3))

OUTPUT: