02.1 - Input Data Analysis - DATS

In stochastic simulations, we must decide on input probability distributions from which to generate random variables:

random variables

Random variables are observation or draws or realizations of a random variable from specified input distributions/processes

Random variables = observation or draws or realizations of a random variables from specified input distributions/processes
Queueing systems: interarrival and service times

Once those input probability distributions are specified, must have a way to generate random variates from them

Specifying Univariate Input Distributions

Univariate distribution: a single (scalar) random input
- May have many across simulation

Usually, we have real-world observed data. We want to fit a probability distribution to the observed data.

Once this is done, we can generate random variates to drive the simulation.

Choosing Probability Distributions

There are many probability distributions, continuous and discrete.

For a comprehensive list of the main distributions, visit: https://www.itl.nist.gov/div898/handbook/eda/section3/eda366.htm

Some common continuous distributions:

Normal distribution
Exponential distribution
Logistic distribution
Continuous Uniform distribution
Triangular Distribution - not in the assgm
Weibull Distribution
Gamma Distribution
Pareto distribution
Lognormal distribution
Loglogistic distribution
Erlang k Distribution
Gumbel Distribution
etc.

Some common discrete distributions:

Discrete Uniform Distribution
Binomial Distribution
Poisson Distribution
Geometric Distribution
Negative Binomial Distribution
Take a look at the data
- Continuous vs Discrete
- Look at the range (negative and positive, just positive,...)
Make a histogram
- Match it to a subset of possible distribution of real data

It is still needed to estimate distribution parameters (fit), then test the goodness of fit

Fitting distribution to data

The first step is always to observe the data. Some things to look at are:

Continuous vs discrete
Range (also support): finite or infinite on right and left end?

Then, steps to take are:

Make a histogram
- Match it to shapes of known probability density functions

With working sample:

Select a possibile distribution
Estimate parameters
Test goodness of fit

With test sample

Final validation of selected model

There are several distribution-fitting packages in R, provided in MINITAB.

Other resources in R Studio:

fitidistrplus
vdc
moments packages

Coefficient of variation

The coefficient of variation of a given distribution is defined as:
$C_{X} = \frac{σ_{X}}{μ_{X}}$
where:

$σ_{X} :$ Deviazione standard (Standard Deviation)
$μ_{X} :$ Media (mean)

It's a measure of how far the selected distribution for the sample data relies from the exponential distribution ( $C_{X} = 1$ .

In practice:

If $C_{X} > 1 ⟹$ log-exponential distribution
If $C_{X} = 1 ⟹$ exponential distribution
If $C_{X} < 1 ⟹$ Gamma family distribution

Descriptive statistics methods in R

Variate type: continuous or discrete
- a data sample is always discrete. If there are many many values, it's continuous
Numerical description:
- moments
- mean
- variance - standard deviation
- Kurtosis
Rough description in R:
- summary(<dataframe>)
Use library moments in R and compute first fourth sample moments, including skewness and kurtosis
Check correlation in sample (acf(), pack())
Remove shift for exponential, gamma, Weibull type data

Auxiliary Graphic Tools

Probability Plot - P Plot

probability plot

A Probability Plot is a graphical comparison of an estimate of the true distribution function of the available data $X_{1}, X_{2}, . . ., X_{n}$ with the distribution function of the fitted distribution.

Probability-Probability Plot

A Probability-Probability Plot (PP Plot) is a graph of the model probabiity $\hat{F} (X_{i})$ versus the sample probability
$\tilde{F_{n}} (X_{i}) = q_{i} = \frac{i - 0.5}{n} i = 1, 2, . . ., n$

If $\hat{F} (x)$ and $\tilde{F} (x)$ are close together, then the P-P plot will also be approximately linear with an intercept 0 and a slope 1.

The linear correlation coefficient of the fit of the PP Plot is a measurement of the goodness of fit of the proposed distribution.

The PP plot graphs 2 funzioni di ripartizione against each other: the theoretical on the x-axis, and the empirical on the y-axis.

Quantile-Quantile Plot - QQ Plot

Quantile-Quantile plot - QQ plot

The QQ plot is used to see how well a particular data sample follows a particular theoretical distribution.

quantile-quantile plot (qq plot)

A Quantile-Quantile Plot (QQ Plot) is a graph of the standard model quantiles $\hat{F^{- 1}} (q_{i})$ where
$q_{i} = \frac{i - 0.5}{n} i = 1, 2, . . ., n$
versus $x_{(i)}, i = 1, 2, . . ., n$ , so that $x_{(1)} < x_{(2)} < \dots < x_{(n)}$ is the ordered sample data.

If ${\hat{F}}^{- 1} (q_{i})$ and $x_{(n)}$ are close together than the QQ plot will also be approximately linear with an intercept meaning location and a slope meaning location. Also, it is a measurement of the goodness of fit of the proposed distribution.