02 - Data and Space - TDBM

02 - Data and Space - TDBM

When trying to model transportation patterns, one of the biggest challenges is the data collection. There are several ways of collecting data:

Surveys

Surveys are an active data collection process. Data is collected based on pre-defined questions. When carrying out surveys, only a portion of the population of interest is interviewed. This portion is referred to as the sample.

They can be carried out in several ways:

A survey is different from a census because, in a census, the whole population is interviewed. In a survey, the information gathered from the sample is then expanded and associated to the rest of the population. There are several ways of selecting a sample. More information can be obtained in [[#Sample Theory]].

It is higly recommended the use of questionnaire, allowing for standardised, pre-defined questions that should be asked to everyone in the same way.

Some surveys focus on opinions and attitudes, such as election polls. Others are concerned with factual characteristics or behaviours, such as transportation habits, housing or consumer spending.

Questions can be asked in an open or closed format. Closed questions are much easer to interpret than open ended questions. Preferably, for ease of analysis, all questions should have a binary response.

Usually, surveys combine questions of different types.

There are many stages in a survey process:

[[Question you should ask when doing a survey]]

  1. Who did the poll?
  2. Who paid for the poll and why was it done?
  3. How many people were interviewed for the survey?
  4. How were those people chosen?
  5. What area (nation, state, or region) or what group (teachers, lawyers, Democratic voters, etc.) were these people chosen from?
  6. Are the results based on the answers of all the people interviewed?
  7. Who should have been interviewed and was not?
  8. When was the poll done?
  9. How were the interviews conducted?
  10. What about phone-in polls or polls on the Internet?
  11. What is the sampling error for the poll results?
  12. Who’s on first?
  13. What other kinds of factors can skew poll results?
  14. What questions were asked?
  15. In what order were the questions asked?
  16. What about "push polls"?
  17. What other polls have been done on this topic? Do they say the same thing? If they are different, why are they different?
  18. So I've asked all the questions. The answers sound good. The poll is correct, right?
  19. With all these potential problems, should we ever report poll results?
  20. Is this poll worth reporting?

Survey types in transportation systems

example: emef survey

EMEF surveys are done every year in Barcelona. About 10'000 people are interviewed by phone. The survey is about the last working day trip the responder took (it's about the whole day trip)

Household travel/activity surveys

In the 1950/60s, surveys were conducted on a large sample, about 3 to 5% of all households. The goal was to estimate O/D matrixes for region. At the time, TAZs were much larger because computer capacity was limited.

Today, we only interview 2000 to 15000 households (corresponding to about 0.3%). The goal is no longer to obtain O/D matrixes, but to to get dailiy activity pattern sequences

Schermata 2025-03-05 alle 18.25.30.png

Transit on-board surveys

These are surveys usually conducted by travelling agencies. Some data may already be known beforehand, like count of bourdings, but lack information about customer characteristics, like frequency of transit trip-making, vehicle avaialbility, origin-destination...

Transit operator - satisfaction surveys

These are to quantify the importance of items related to day to day conditions of transit network. For each item, the responders selects a score (from 1 to 10, or on the [[Likert Scale]]).

Commercial vehicle surveys

Schermata 2025-03-05 alle 18.30.50.png

Schermata 2025-03-05 alle 18.30.59.png

New sources of mobility data

New sources of data collection include:

This sources are now used to extract OD matrices but do not provide modal data for urban trips, since they are too short.

Schermata 2025-03-05 alle 18.34.10.png

example - nommon

Nommon is an IT company in Barcelona that produces mobility data from Call Detail Records (CDR) and network probe of the Orange telephone operator to generate OD matrices.

It does not include modal split. It does provide trip length estimation and precise timestamps.

Special generator, visitor surveys

Special generator, visitor surveys are intended for specific demand attraction points like tourists hotspots, big shopping malls, airports.

Parking surveys

Parking surveys ask about distance to final destination, duration of stay, posted prices vs cost for the individual...

Revealed versus stated preference

Sometimes we are interested in knowing the factual state of the art. In other occasions, we want to investigate what people think of a new scenario that does not exists yet.

Revealed preference

revealed preference

A revealed preference is the actual behaviour of the individuals.

When we collect data from surveys on actual travel, then we are collecting infromation on revealed preference.

Stated preference

stated preference

A stated preference is obtained by asking the responders about scenarios that do not yet exist.

Responders are asked about hypothetical fictitious conditions, with typical variables strategically defined by means of experiment design and measurign the selected choice by the individuals in different conditions.

Conditions for valid results are:

Louviere, J.J., Hensher D.A. and Swait J (2000) [[Stated Choice Methods: Analysis and Application]]. Cambridge Univ. Press, Cambridge

Errors in stated preference data
Bias statement

The respondent answers, onsciously or unconsciously, what they think the interviewer wants

Rationalisation Bias

The respondent tries to be rational in their responses in order to justify their behaviour at the time of the interview.

Political Bias

The respondent answers in order to influence policy decisions based on their beliefs about how they can affect the results of the survey.

No restriction Bias

When responding does not take into account any restrictions on their behaviour, so that the answers are not real.

Questionnaire design

Questionnaire design is the process of writing the questions and possible answers.

Questionnaire usually contain:

The general organisation, especially the order of questions, is important.

  • go from general to particular
  • go from less committed to most committed
  • delicate questions must never go at the beginning or at the end
  • socio-economic questins go to the end
  • use transitional phrases to break the monotony and thematic changes
  • first questions are usually strategic and set the tone and predispose the respondent: they must be neutral, pleasant and easy
  • Avoid questions conditioning the response to the following questions
  • When designing filter questions, try not to frustrate people that do not meet certain requirements

The order of the questions can influence non-response.

Survey design

Survey design involves:

Statistical issues

There are some possible errors that may occur when analysing data.

First, let's focus on the difference between [[Accuracy]] and [[Precision]]:

[[Accuracy vs Precision]]

It is very intersting to look at the difference between [[Accuracy]] and [[Precision]]. This difference is well explained by the following diagram:

Accuracy vs Precision 2025-03-05 19.34.35.excalidraw.png
%%🖋 Edit in Excalidraw%%

Lack of [[Accuracy]] corresponds to a [[#Systematic error]] (or bias) in sampling
Lack of [[Precision]] corresponds to a [[#Sampling error]].

Sampling errors

There are 2 types of sampling errors that may occur, summarised in the table below:

Decision\Truth H0 true H0 false
Accept H0 [[#Type II error]]
Reject H0 [[#Type I error]]

These errors occur when we test an hypothesis and, the decision we take based on the p-value ends up being inccorect.

Type I error

Type I error occurs when the null hypothesis is true, but the sample leads us to reject it.

Type II error

Type II error occurs when the null hypothesis is false, but the sample leads us to accept it.

Sample theory

Sample size

It's important before any investigation to determine the sample size in respect to the population. Keep in mind that there is always a trade-off between the sample size and the resources available.

Sample size depends on various factors:

Steps in determining the sample size

Statistical terminology

Unit

unit

A unit is a single person, household, business that is intended to answer the survey.

Population

survey population

The population is the collection of units that the survey result should describe or explain. It's the set that includes all measurements of interest to the researcher.

A population is usually associated with a probability distribution. Therefore, we have a [[#Population distribution]]

Target population

target population

Target population is the #Population we are interested in studying.

Population distribution

population distribution

The population distribution is the probability distribution derived from the information on all elements of a population.

Sample

sample

A sample is a subset of the population for which the survey data is collected

Sample distribution

sample distribution

The sample distribution is the probability distribution of a #Sample statistic (x,p,s)

Sampling frame

sampling frame

Sampling frame is the list of all sampling units from which the sample is drawn.

Sampling scheme

sampling scheme

The sampling scheme is the method of selecting sample units from the #Sampling frame.

Why sampling

Sampling is necessary to get information about large popoulations.

Sampling schemes

Probability sampling

probability (or random) sampling

Probability (or random) sampling is where each object has a known, non-zero probability of being selected.

It can produce unbiased results (if no non-response) and it allows for calculation of sampling error (if pairwise selection probabilities known). Most widely accepted sampling method.

It allows application of statistical sampling theory in order to:

And it ensures:

There are several methods in probability sampling:

Simple random sampling

In simple random sampling we simply select responders at random from a given list.

Systematic sampling

In systematic sampling, we select at random one responder from a given list, then, based on how many responses we need, we select every nth possibility from the list.

Stratified sampling

In stratified sampling we divide the population in strata (same age-group, same occupation...), then we draw a random sample from every stratus.

Schermata 2025-03-06 alle 12.18.17.png

Cluster sampling

In cluster sampling, we dived the population in clusters, select some clusters at random, then survey every unit in each cluster.

Schermata 2025-03-06 alle 12.04.30.png

Multi-stage sampling

In multi-stage sampling we apply a combination of [[#Stratified sampling]] and [[#Cluster sampling]].

First, we select some strata at random, then, from the selected strata, we select units at random.

Schermata 2025-03-06 alle 12.21.50.png

Judgment sampling

judgment sampling

Judgment sampling involves choosing objects that it is believed will give accurate results.

Quota sampling

quota samples

Quota samples are based on selecting objects until you have a certain number (the quota) of each type – Appeals to idea of a “representative” sample, but usually biased – Still widely used (especially for telephone surveys with high non- response levels)

Convenience samples

Convenience samples are obtained by choosing the easiest objects available

Sampling from a finite population

A simple random sample from a finite population of size N is a sample selected such that each possible sample of size n has the same probability of being selected.

Point estimation

In point estimation we use data from a #Sample to compute a value of a sample statistic that we then use as an estimate of the population statistic.

point estimate

A point estimate is as statistic computed from a sample that gives a single value for the population parameter.

We have a population of size N, from which we select a sample of size n.

Let Y be the target variable (numeric) and we are interested maybe in the population mean (μ) or the population variance (σ2).

Let the sample be the set of observations:

{Y1,...,Yn}

Then, the sample mean is:

Y=i=1nYin

I want, starting from Y, the point estimate for the population mean.

Statistics can prove that the [[Estimator]] (E[...]) for the population mean, is the sample mean itself:

E[Y]=μ

This is not true for every statistic. If for example I define a statistic like:
θ=min(Y1,...,Yn)+max(Y1,...,Yn)2
Then I don't know what its estimator is:
E[θ]=???

Inference

As explained in [[#Point estimation]], we have a sample of a larger population. We can calculate some statistic of the sample. Then, we can use it to infer the corresponding statistic of the whole population.

Let's imagine we have several samples:

We can now have many observations of a sample mean. This statistic is itself a random varaible. It has been proved that:

teorema

We can prove that:
E[X]=μV[X]=σ2nXN(μ,σ2n)If n is large enough

Then, we want an interval in which the population mean falls when estimating it from the sample mean. We want a [[#Confidence Interval]], according to some α (ex: α=0.05)

μ[Xz1α2σn;X+z1α2σn]

Confidence Interval

[[Confidence Interval]]

A confidence interval is the interval in which a population statistic is contained when estimated by a sample statistic.

Let μ be some population mean of same value and let X be the mean of a sample of the population. Given an [[Level of significance]] of α (s.t. the interval is accurate in (1α)% of cases), then a Confidence Interval for the population mean is:
μ[Xz1α2σn;X+z1α2σn]
where z1α2 is the value of z (where z follows a [[Standard Normal Distribution]]) such that the area under the curve is equal to 1α2.

Let's suppose z is the value we are looking for (for example, the exact value of μ). Let Z be any possible value of the [[Standard Normal Distribution]]. We can evaluate the probability:
P(Zz)
as the area in the interval (,z). If for example we want that our estimate is accurate with a Degree of confidence of 95%, all we need to do is make sure we select a range wide enough such that, the range in question as an area of at least 0.95. That means, the area in the interval (,z) has to be 0.052. (notice how the curve is symmetric and we want 95% to be the area in the middle). So, if we can find the value of z such that the area under the curve is 0.025, we will have a range of values that ensures us that the estimate is accurate at 95%.

In order to find this value it's important to use statistic tables. For the [[Standard Normal Distribution]] you can read more about them in [[Standard Normal Distribution#Statistical tables]].

Confidence Interval 2025-03-16 19.27.14.excalidraw.png
%%🖋 Edit in Excalidraw%%

We use the Standard normal distribution since is well known. Then, we scale it according to a factor dependent on the variance of the variable we are interested in: σn.

One particular variable of a population has a mean of μ (unknown). We have a sample of the population with a mean for the same variable equal to X. Let's suppose we know the population variance, σ2. The sample is composed of n elements.

We want an estimate for the population mean with 95% confidence. This mean we want an α=0.05.

We are basically looking for the value:
z1α2=z10.052=z10.025=z0.975
that is the value such that the area under the density distribution is equal to 0.975.

From the probability table we can find out that such value is:
z=1.960
given a large number of observations.

  • α: [[Level of significance]]
  • The percentage is the [[Degree of confidence]]

We are assuming to know the variance of the population. This is clearly impossible. Actually, also the variance, as the mean, follows a random probability distribution (that can be proven to be a ???). For the porpuses we are usually interested in, we can simply use the sample variance (s2) in place of the population variance.

Common estimator

Point estimator of population mean

point estimator of population mean

The point estimator for the population Mean μ is the quantity:
x=i=1nxin
Where x is the [[#Sample]] mean.

Point estimator of population variance

point estimator of population mean

The point estimator for the population Standard Deviation σ is the quantity:
s=i=1n(xix)2n1

Point estimator of population proportion

point estimator of population mean

The point estimator for the population Proportion p is the quantity:
p=i=1nxin
where X is a binary variable and p is the number of trues over the total.

Inference of population mean

02 - Data and Space - TDBM 2025-03-18 11.09.46.excalidraw.png
%%🖋 Edit in Excalidraw%%

Sample distribution of sample mean

The #Sample distribution of the sample mean x is the probability distribution of all possible values of x.

In order to define it, we need to know:

Expected value of sample mean
theorem

The expected value of the sampling distribution of X is equal to the mean of the population. Thus:
E[x]=μx=μ

Standard deviation of sample mean

The [[Standard Deviation]] of the [[#Sample]] mean depends on the size of the population.
Remember:

INFINITE POPULATION (N>500000)
Then the standard deviation of the population mean (σx) is:

σx=σn

where σ is simply the population standard deviation.

FINITE POPULATION AND nN0.05
Then we need to apply the correction factor NnN1:

σx=NnN1σn

FINITE POPULATION AND nN>0.05
Then we need to apply the correction factor NnN:

σx=NnNσn
  1. The standard deviation of the sample mean is smaller than the standard deviation of the corresponding population distribution: σx<σ
  2. The standard deviation of the sampling distribution of x decreases as the sample size increases
Form of the sample distribution of sample mean

The [[#Sample distribution]] of the sample mean follows different [[Random variables]] distributions depending on the [[#Population distribution]].

Population follows Normal distribution.
Then the sampling distribution of the sample mean is also normally distributed for any sample size.

XN(μ,σ2)XN(μ,σ2n)

Population is not normally distributed AND sample size is large (n>30).
Then the sampling distribution of the sample mean is also approximately normally distributed, irrespective of the skewed, we need a sample size n>50 or more.

Inference of population proportion

02 - Data and Space - TDBM 2025-03-18 12.02.00.excalidraw.png
%%🖋 Edit in Excalidraw%%

Sample distribution of sample proportion

The [[#Sample distribution]] of the sample proportion p is the probability distribution of all possible values of p.

In order to define it, we need to know:

Expected value of sample proportion
theorem

The expected value of the sampling distribution of p is equal to the mean of the population. Thus:
E[p]=p

Standard deviation of sample proportion

The [[Standard Deviation]] of the [[#Sample]] proportion depends on the size of the population.
Remember:

INFINITE POPULATION (N>500000)
Then the standard deviation of the population proportion (σp) is simply:

σp=p

FINITE POPULATION AND nN0.05
Then the standard deviation of the population proportion (σp) is (depending on whether we know the value of p):

σp=p(1p)np unknownsp=p(1p)n1

FINITE POPULATION AND nN>0.05
Then the standard deviation of the population proportion (σp) is (depending on whether we know the value of p):

σp=p(1p)nNnNp unknownsp=p(1p)n1NnN
  1. The standard deviation of the sample mean is smaller than the standard deviation of the corresponding population distribution: σx<σ
  2. The standard deviation of the sampling distribution of x decreases as the sample size increases
Form of the sample distribution of sample mean

The [[#Sample distribution]] of the sample proportion follows a [[Normal distribution]] whenever the sample size n is large enough

#Sample size is large enough when both of the following conditions are met:

❗❗❗❗❗❗❗❗❗❗❗❗
❗❗❗ COMPLETARE ❗❗❗
❗❗❗❗❗❗❗❗❗❗❗❗ from slide 114 to the end