Statistics

Mathematics and Problem Solving

11. Statistics

David Gundry

Learning Outcomes

Understand the basics of probability distributions, data collection and sampling

Recognise common probability distributions
Identify the type of data of a range of variables
Identify dependent and independent variables

Calculate and interpret common descriptive statistics

Appropriately use measures of central tendency to describe data
Calculate Arithmetic Mean
Calculate Median
Calculate Mode
Appropriately use measures of spread to describe data
Calculate Range
Calculate Inter-quartile Range
Calculate Variance
Calculate Standard Deviation
Calculate measures of correlation for paired data

Create statistical models of data

Approximate the parameters to a normal distribution from data
Perform a linear regression to model a linear relationship between variables

Learning Outcomes

Understand the basics of probability distributions, data collection and sampling
Calculate and interpret common descriptive statistics
Create statistical models of data

Getting Started

Discrete and Continuous

Which of the following are discrete, which are continuous? Give the probability mass function or probability density function for each.

Rolling a biased dice where six is twice as likely as any other number
Discrete. A fair dice would have a uniform probability distribution. In this case, the distribution can be defined by the probability mass function:
\[ \begin{numcases}{P(X = x) =} 2/7 & x = 6 \\ 1/7 & otherwise \end{numcases} \]
Python's random.random()
Approximately continuous for most purposes. As the numbers it generates only have a fixed precision, technically this distribution is discrete with a very large number of possible values it can take. Roughly, if the numbered generated is stored with 32 bits, this is a discrete distribution with $ 2^{32} $ possible values, which each have an approximately uniform probability. The statistical properties of random number generators is important for cryptography.
A random integer 0-100
Discrete. Uniform distribution. If we assume the range is inclusive at the lower and and exclusive at the top end, there are 100 possible values, each with an equal probability
\[ P(X = x) = 1/100 \]
Flipping a fair coin ten times and counting the number of heads
Discrete. Binomial distribution, where $n = 10$ and $p=0.5$
\[ P(X=x) = ^nC_x \times p^x \times (1-p)^{n-x} \]
Jelenkovic et al (2016) studied the heights of twins in Europe, North America and Australia, and East Asia. Across all cohorts, women born in 1980-1994 had a mean height of 164.7cm and a standard deviation of 7.07cm.
Continuous. Normal distribution with parameters $\mu =164.7$ and standard deviation $\sigma = 7.07 $
\[ f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-{\frac{1}{2}(\frac{x - \mu}{\sigma})^2}} \]

Types of Data

For each of the variables below, say whether they are nominal, ordinal, or numeric. What do you think the underlying probability distribution looks like?

Idle temperature of a CPU
Numeric
Gender
Nominal
Age
Numeric
Height
Numeric
Paper size (A3, Foolscap, Letter, A4, etc.)
Nominal usually, but could be ordinal if we are e.g. considering the area of the paper.
Dog breed ordered by average height
Ordinal, because we are treating the dog breeds as ordered

Probability Distributions

Normal Distribution

Go to the online graphing calculator Desmos. This allows you to type in functions and see them plotted on a graph. To the left of the screen you will see a boxes labelled '1' and '2': this is where you type the functions to plot. Type in

normaldist(1,1)

Change the arguments to the function to change the parameters of the distribution.

To find the cumulative probability of an area under the graph, click 'Find Cumulative Probability' and then enter lower and upper limits. For example, to find the probability of picking a number from a normal distribution with $\mu = 0$ and $\sigma=1$ between 2 and 3, you would enter the expression 'normaldist(0,1)', and click 'Find Cumulative Probability', entering 2 as the min bound and 3 as the max bound. It would give you the answer of approximately 0.02. Leave one of the bounds blank to default to infinity.

In this way, find the cumulative probability of picking a value $x$ from a normal distribution with parameters $\mu = 0, \sigma = 1$ where

$-1 < x < 1$
$0.68$
$-2 < x < 2$
$0.95$
$-3 < x < 3$
$0.997$
$-\infty < x < \infty$
$1$

Note that as $\sigma = 1$, our $x$ values are expressed in standard deviations.

Normal Distribution

Bell curve distributed around mean $\mu$ with standard deviation $\sigma$

\[ f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-{\frac{1}{2}(\frac{x - \mu}{\sigma})^2}} \]

Binomial Distribution

Using Desmos as described above, plot a binomial distribution by typing in the function

binomialdist(1,0.5)

Change the parameters to observe the effects on the distribution. The first parameter $n$ is the number of trials. The second parameter $p$ is the probability that a trial will be a success.

Find parameters $n, p$ of a binomial distribution such that the cumulative probability of drawing a value $x$ between the given range is as specified.

$ \sum_{x=0}^{1} \text{binomialdist}(p, n) = 1 $
$n=1, p=0.5$
$ \sum_{x=0}^{1} \text{binomialdist}(p, n) = 0.75 $
$n=2, p=0.5$
$ \sum_{x=0}^{2} \text{binomialdist}(p, n) = 0.5 $
$n=5, p=0.5$

Binomial Distribution

Probability of number of successes

$n$ trials with $p$ probability of a success

\[ P(X=x) = ^nC_x \times p^x \times (1-p)^{n-x} \]

Statistics

Descriptive statistics

Consider the following data:

\[ X = \{ 4, 6, 7, 11, 12, 14, 15, 18, 19, 28, 35 \} \]

Give a five-number summary of the data, providing $min(X), Q_1, Q_2, Q_3$ and $max(X)$.
$$ min(X) = 4 $$ $$ Q_1 = 7 $$ $$ Q_2 = 14 $$ $$ Q_3 = 19 $$ $$ max(X) = 35$$
Calculate the mean and the range between pairs of quartiles. Use these values to consider whether the data approximately symmetrical. You may wish to plot these values using a box plot.
$$ Q_2 - Q_1 = 7 $$ $$ Q_3 - Q_2 = 5 $$ $$ Q_3 - Q_1 = 12 $$ $$ mean(X) = 15.35$$
The median is slightly lower than the mean. The difference is slightly bigger in the lower quartile. Overall the data is approximately symmetrical. This is clear if we plot these values using a boxplot:
Test for outliers using Tukey's fences with a value of $k = 1.5$.
We can test for outliers using Tukey's Fences with $k=1.5$:
$$ [Q_1 - k(Q_3 - Q_1), Q_3 + k(Q_3 - Q_1)] $$ $$ [7 - 1.5(19 - 7), 19 + 1.5(19 - 7)] $$ $$ [-11, 37] $$
There are no outliers, as all values are within $ [-11, 37] $

Approximate parameters to normal distribution

You measure the ages of a sample of 5 children in a nursery and get the values: 2, 2, 3, 4, 4. You suspect that the ages of children in the nursery are drawn from a normal distribution.

Approximate the parameters to this distribution.

We need to find the mean $\mu$ and standard deviation $s$. As this is a sample of children, we need to use the formula for sample standard deviation:

\[ s = \sqrt{rac{\sum_{i=1}^n (x_i - overline{x})^2}{n - 1}} \]

\[ \mu = 3 \quad s = 1 \]

Correlation

The online graphing calculator Desmos, as used in previous exercises can also be used to calculate correlations. For example, consider the following data:

i	1	2	3	4	5	6	7	8	9	10	11
x	12	13	16	18	12	14	16	27	26	29	30
y	11	15	13	17	16	18	19	30	19	27	35

To calculate the Pearson's correlation coefficient for this data, we can use the 'corr' function in Desmos, passing as its arguments two lists. We get an answer of approximately $r = 0.87$.

corr([12, 13, 16, 18, 12, 14, 16, 27, 26, 29, 30],[11, 15, 13, 17, 16, 18, 19, 30, 19, 27, 35])

Calculate the correlation between $x$ and $y$ in the following data set:

i	1	2	3	4	5	6	7	8	9	10	11
x	54	33	66	54	43	66	67	45	23	12	5
y	25	43	37	43	16	53	98	66	34	44	9

$$r = 0.505$$

Pearson's r

Compare each $(x_i, y_i)$ to $ \overline{x} $, $ \overline{y} $

\[ r = \frac{\sum( (x_i - \overline{x})(y_i - \overline{y}))}{\sqrt{\sum(x_i - \overline{x})^2 \sum(y_i - \overline{y})^2}} \]

Value between -1 (perfect negative correlation) and 1 (perfect positive correlation)

Linear Regression

Using the online linear regression calculator on GraphPad, give a formula for a straight line that models the relationship between $x$ and $y$ using the following data.

i	1	2	3	4	5	6	7	8	9	10	11
x	12	13	16	18	12	14	16	27	26	29	30
y	11	15	13	17	16	18	19	30	19	27	35

Enter the $x$ and $y$ values into the calculator. It provides you with the formula under 'Data > Equation'
With your model, you can now approximate the value of $y$ for an arbitrary value of $x$. What is the predicted value of $y$ when $x=20$?

$$ Y = 0.9127*X + 2.326 $$

When $x = 20$, $y = 20.58$

Linear Regression

Model each point as $Y_i = a + bX_i + \epsilon_i$

Find values for $a$ and $b$ to minimise the total error $\sum \epsilon$

Least Squares Method can be used for this

\[ b = \frac{\sum((x_i - \overline{x})(y_i - \overline{y})}{\sum(x_i - \overline{x}^2)} \]

\[ a = \overline{y} - b\overline{x} \]