02.1 - Input Data Analysis - DATS
02.1 - Input Data Analysis - DATS
In stochastic simulations, we must decide on input probability distributions from which to generate random variables:
Random variables are observation or draws or realizations of a random variable from specified input distributions/processes
- Random variables = observation or draws or realizations of a random variables from specified input distributions/processes
- Queueing systems: interarrival and service times
Once those input probability distributions are specified, must have a way to generate random variates from them
Specifying Univariate Input Distributions
- Univariate distribution: a single (scalar) random input
- May have many across simulation
Usually, we have real-world observed data. We want to fit a probability distribution to the observed data.
Once this is done, we can generate random variates to drive the simulation.
Choosing Probability Distributions
There are many probability distributions, continuous and discrete.
For a comprehensive list of the main distributions, visit: https://www.itl.nist.gov/div898/handbook/eda/section3/eda366.htm
Some common continuous distributions:
- Normal distribution
- Exponential distribution
- Logistic distribution
- Continuous Uniform distribution
- Triangular Distribution - not in the assgm
- Weibull Distribution
- Gamma Distribution
- Pareto distribution
- Lognormal distribution
- Loglogistic distribution
- Erlang k Distribution
- Gumbel Distribution
- etc.
Some common discrete distributions:
-
Take a look at the data
- Continuous vs Discrete
- Look at the range (negative and positive, just positive,...)
-
Make a histogram
- Match it to a subset of possible distribution of real data
It is still needed to estimate distribution parameters (fit), then test the goodness of fit
Fitting distribution to data
The first step is always to observe the data. Some things to look at are:
- Continuous vs discrete
- Range (also support): finite or infinite on right and left end?
Then, steps to take are:
- Make a histogram
- Match it to shapes of known probability density functions
With working sample:
- Select a possibile distribution
- Estimate parameters
- Test goodness of fit
With test sample
- Final validation of selected model
There are several distribution-fitting packages in R, provided in MINITAB.
Other resources in R Studio:
- fitidistrplus
- vdc
- moments packages
Coefficient of variation
The coefficient of variation of a given distribution is defined as:
where:
Deviazione standard (Standard Deviation) Media (mean)
It's a measure of how far the selected distribution for the sample data relies from the exponential distribution (
In practice:
- If
log-exponential distribution - If
exponential distribution - If
Gamma family distribution
Descriptive statistics methods in R
- Variate type: continuous or discrete
- a data sample is always discrete. If there are many many values, it's continuous
- Numerical description:
- moments
- mean
- variance - standard deviation
- Kurtosis
- Rough description in R:
summary(<dataframe>)
- Use library moments in R and compute first fourth sample moments, including skewness and kurtosis
- Check correlation in sample (
acf()
,pack()
) - Remove shift for exponential, gamma, Weibull type data
Auxiliary Graphic Tools
Probability Plot - P Plot
A Probability Plot is a graphical comparison of an estimate of the true distribution function of the available data
Probability-Probability Plot
A Probability-Probability Plot (PP Plot) is a graph of the model probabiity
If
The linear correlation coefficient of the fit of the PP Plot is a measurement of the goodness of fit of the proposed distribution.
The PP plot graphs 2 funzioni di ripartizione against each other: the theoretical on the x-axis, and the empirical on the y-axis.