04 - Statistical modeling through regression - DATS

04 - Statistical modeling through regression - DATS

References:

These notes are not intended for a deep understanding of the topic. They are just a guide to try and survive the 📊 DATS exam.

They contain no rigor and may contain mistakes.

The goal is to model a target (or response) variable, Y as a linear combination of some exploratory variables, X. More precisely, we will obtain a prediction of the target, η.

μ=η=Xβ

Part of the variability of the response is not describied in η, so, to get Y we also need to add the remaining part, known as residuals: ε=ei=YiY^i. Y^i is the prediction of the response according to the model.

β is a list of coefficients:

β=β1,β2,...,βn

where n is the number of variables and β1 is known as the intercept.

To summarise:

Y=μ+ε=Xβ+ε

Model in R

If all exploratory variables are numeric, the process is quite simple. The rules described above all stand.

In R-studio we proceed as follows:

m1 <- lm(Y ~ X2 + X3 + ... + Xp)
summary(m1)

Then we do a summary of the model

04 - Statistical modeling through regression - DATS 2025-01-09 16.54.29.excalidraw.png

FISHER-ANOVA:
We can try making a new model with less variables.
Let M be the original complete model. Then we can make m a nested model of M (It has the same form of M but with less variables).

We can use the FISHER test to verify if the 2 models are equivalent. If they are, it's better to use the smaller one.

Fisher test: H0 = Models m and M are equivalent

In R studio this test is done using the function anova(m1,m2)

anova(m1, m2)

Read p-value of the test.

If using glm(), then use the Wald test

Unusual and influential data

Regression outlier: observation with unusual value of the outcome variable Y, conditioned by the value of the explanatory variable X.

LEVERAGE

INFLUENTIAL OBSERVATIONS:

influencePlot() can give what are the most influential observations:

influencePlot(m1, pch=19, id = list(method="noteworthy", n=3))

OUTPUT:

04 - Statistical modeling through regression - DATS 2025-01-11 17.04.16.excalidraw.png