02 - Exploratory Data Analysis - DATS

02 - Exploratory Data Analysis - DATS

Statistical analysis of experiments starts with graphical and non-graphical Explanatory Data Analysis (EDA)

EDA always precedes formal (confirmatory) data analysis.

Characteristic:

Univariant Explanatory Data Analysis

We take each column one by one and we summarize it.

How can we summarize different variables types

Robustness of an indicator

Shows how much mean is affected byoutlier while median is not.

Classic Statistic

classic statistic

A classic statistic or indicator is one that is higly influenced by the presence of one or more outliers

Robust Statistic

robust statistic

A Robust statistic or indicator is one that is NOT higly influenced by the presence of one or more outliers

Numeric indicators

Central Tendency

Mean

#Classic Statistic

mean ($\mu$)

Simple mean (Also known as average although it is not a mathematical concept)
x=x1++xnn
It indicates some sort of central trend. It is also known as the expected value: It's the outcome of a variable when the observations get close to infinity.

Quartiles
Median

Median is a #Robust Statistic

It's the second quartile
median = Q2

median or 2° quartile

It's the value that splits the sample in 2 parts:

  • 50% of the observation are less than the median
  • 50% of the observation are greater than the median
Q1 - Q3

#Robust Statistic

Dispersion or spread

The spread of a distribution mostly refers to the #Variance or the #Standard Deviation.

Variance

It's a #Classic Statistic (it's very affected by outliers)

variance ($s_{x}^{2}$ or $\sigma^{2}$)

σ2=Sx2=1n1(xix)2

Variance has squared units so it's difficult to interprete. So we use standard deviation

Standard Deviation

See also: 04 - Statistica - Idro#Deviazione Standard, 03 - Statistica in Topografia - TP#Deviazione Standard

It's a #Classic Statistic (it's very affected by outliers)

standard deviation ($s_{x}$ or $\sigma$)

σ=Sx=Sx2

IQR - Inter Quartile Range

It's a #Robust Statistic

inter quartile range ($iqr$)

It indicates how spread apart the observation are.
IQR=Q3Q1

Skewness and Kurtosis

Skewness
skewness ($\gamma_{2}$)

The skewness is a measure of asymmetry of a distribution.

γ2=0 Perfect symmetry.

Skewness - 02 - Exploratory Data Analysis - DATS 2024-11-06 12.06.17.excalidraw.png

Kurtosis
kurtosis ($\gamma_{1}$)

The Kurtosis of a distribution measures how far away a distribution is from a Gaussian distribution in terms of peakedness vs flatness.

  • γ1<0: Rounder shoulders and thin tails compared to a Gaussian distribution
  • γ1>0: More sharply shaped picks and fat tails compared to a Gaussian distribution

Moments

Moments - 02 - Exploratory Data Analysis - DATS 2024-11-06 12.12.00.excalidraw.png

Numeric charts

Boxplot

Schermata 2024-10-08 alle 15.46.16.png

Boxplot - 02 - Exploratory Data Analysis - DATS 2024-11-06 12.26.23.excalidraw.png

The boxplot is a particular plot useful to understand the central tendency of the data.

Assuming there are no outliers, the whiskers indicate the min and the max.

Quantile-Quantile plot - QQ plot

The QQ plot is used to see how well a particular data sample follows a particular theoretical distribution.

quantile-quantile plot (qq plot)

A Quantile-Quantile Plot (QQ Plot) is a graph of the standard model quantiles F1^(qi) where
qi=i0.5ni=1,2,...,n
versus x(i),i=1,2,...,n, so that x(1)<x(2)<<x(n) is the ordered sample data.

If F^1(qi) and x(n) are close together than the QQ plot will also be approximately linear with an intercept meaning location and a slope meaning location. Also, it is a measurement of the goodness of fit of the proposed distribution.

Bivariate Explanatory Data Analysis

2 variables: X,Y (numbers)

we are interested in:

Correlation

It's adimensional.
the coefficient of correlation ranges between:

1ρ1

If |ρ|1, we can assume a perfectly linear relation between the variables: Y=aX+b

The sign indicates how the relation is working.

The pearson coefficient is given by:

rXY=ρ=i=1nj=1n(XiX)(YiY)SXSY

Charts

❗❗❗❗❗❗❗❗❗❗❗❗
❗❗❗ COMPLETARE ❗❗❗
❗❗❗❗❗❗❗❗❗❗❗❗ tutta la parte da pag 16 delle slide in poi.