Statistics practical

Statistics

Mathematics and Problem Solving

11. Statistics

David Gundry

Learning Outcomes

    Understand the basics of probability distributions, data collection and sampling
    1. Recognise common probability distributions
    2. Identify the type of data of a range of variables
    3. Identify dependent and independent variables
    Calculate and interpret common descriptive statistics
    1. Appropriately use measures of central tendency to describe data
    2. Calculate Arithmetic Mean
    3. Calculate Median
    4. Calculate Mode
    5. Appropriately use measures of spread to describe data
    6. Calculate Range
    7. Calculate Inter-quartile Range
    8. Calculate Variance
    9. Calculate Standard Deviation
    10. Calculate measures of correlation for paired data
    Create statistical models of data
    1. Approximate the parameters to a normal distribution from data
    2. Perform a linear regression to model a linear relationship between variables

Learning Outcomes

  1. Understand the basics of probability distributions, data collection and sampling
  2. Calculate and interpret common descriptive statistics
  3. Create statistical models of data

Getting Started

Discrete and Continuous

Which of the following are discrete, which are continuous? Give the probability mass function or probability density function for each.

  1. Rolling a biased dice where six is twice as likely as any other number

    Discrete. A fair dice would have a uniform probability distribution. In this case, the distribution can be defined by the probability mass function:

    \[ \begin{numcases}{P(X = x) =} 2/7 & x = 6 \\ 1/7 & otherwise \end{numcases} \]
  2. Python's random.random()

    Approximately continuous for most purposes. As the numbers it generates only have a fixed precision, technically this distribution is discrete with a very large number of possible values it can take. Roughly, if the numbered generated is stored with 32 bits, this is a discrete distribution with \( 2^{32} \) possible values, which each have an approximately uniform probability. The statistical properties of random number generators is important for cryptography.

  3. A random integer 0-100

    Discrete. Uniform distribution. If we assume the range is inclusive at the lower and and exclusive at the top end, there are 100 possible values, each with an equal probability

    \[ P(X = x) = 1/100 \]
  4. Flipping a fair coin ten times and counting the number of heads

    Discrete. Binomial distribution, where $n = 10$ and $p=0.5$

    \[ P(X=x) = ^nC_x \times p^x \times (1-p)^{n-x} \]
  5. Jelenkovic et al (2016) studied the heights of twins in Europe, North America and Australia, and East Asia. Across all cohorts, women born in 1980-1994 had a mean height of 164.7cm and a standard deviation of 7.07cm.

    Continuous. Normal distribution with parameters $\mu =164.7$ and standard deviation $\sigma = 7.07 $

    \[ f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-{\frac{1}{2}(\frac{x - \mu}{\sigma})^2}} \]

Types of Data

For each of the variables below, say whether they are nominal, ordinal, or numeric. What do you think the underlying probability distribution looks like?

  1. Idle temperature of a CPU
    Numeric
  2. Gender
    Nominal
  3. Age
    Numeric
  4. Height
    Numeric
  5. Paper size (A3, Foolscap, Letter, A4, etc.)
    Nominal usually, but could be ordinal if we are e.g. considering the area of the paper.
  6. Dog breed ordered by average height
    Ordinal, because we are treating the dog breeds as ordered

Probability Distributions

Normal Distribution

Go to the online graphing calculator Desmos. This allows you to type in functions and see them plotted on a graph. To the left of the screen you will see a boxes labelled '1' and '2': this is where you type the functions to plot. Type in

normaldist(1,1)

Change the arguments to the function to change the parameters of the distribution.

To find the cumulative probability of an area under the graph, click 'Find Cumulative Probability' and then enter lower and upper limits. For example, to find the probability of picking a number from a normal distribution with $\mu = 0$ and $\sigma=1$ between 2 and 3, you would enter the expression 'normaldist(0,1)', and click 'Find Cumulative Probability', entering 2 as the min bound and 3 as the max bound. It would give you the answer of approximately 0.02. Leave one of the bounds blank to default to infinity.

In this way, find the cumulative probability of picking a value $x$ from a normal distribution with parameters $\mu = 0, \sigma = 1$ where

  1. $-1 < x < 1$
    $0.68$
  2. $-2 < x < 2$
    $0.95$
  3. $-3 < x < 3$
    $0.997$
  4. $-\infty < x < \infty$
    $1$

Note that as $\sigma = 1$, our $x$ values are expressed in standard deviations.

Normal Distribution

Bell curve distributed around mean $\mu$ with standard deviation $\sigma$

\[ f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-{\frac{1}{2}(\frac{x - \mu}{\sigma})^2}} \]

Binomial Distribution

Using Desmos as described above, plot a binomial distribution by typing in the function

binomialdist(1,0.5)

Change the parameters to observe the effects on the distribution. The first parameter $n$ is the number of trials. The second parameter $p$ is the probability that a trial will be a success.

Find parameters $n, p$ of a binomial distribution such that the cumulative probability of drawing a value $x$ between the given range is as specified.

  1. \( \sum_{x=0}^{1} \text{binomialdist}(p, n) = 1 \)
    $n=1, p=0.5$
  2. \( \sum_{x=0}^{1} \text{binomialdist}(p, n) = 0.75 \)
    $n=2, p=0.5$
  3. \( \sum_{x=0}^{2} \text{binomialdist}(p, n) = 0.5 \)
    $n=5, p=0.5$

Binomial Distribution

Probability of number of successes

$n$ trials with $p$ probability of a success

\[ P(X=x) = ^nC_x \times p^x \times (1-p)^{n-x} \]

Statistics

Descriptive statistics

Consider the following data:

\[ X = \{ 4, 6, 7, 11, 12, 14, 15, 18, 19, 28, 35 \} \]
  1. Give a five-number summary of the data, providing $min(X), Q_1, Q_2, Q_3$ and $max(X)$.
    $$ min(X) = 4 $$ $$ Q_1 = 7 $$ $$ Q_2 = 14 $$ $$ Q_3 = 19 $$ $$ max(X) = 35$$
  2. Calculate the mean and the range between pairs of quartiles. Use these values to consider whether the data approximately symmetrical. You may wish to plot these values using a box plot.
    $$ Q_2 - Q_1 = 7 $$ $$ Q_3 - Q_2 = 5 $$ $$ Q_3 - Q_1 = 12 $$ $$ mean(X) = 15.35$$

    The median is slightly lower than the mean. The difference is slightly bigger in the lower quartile. Overall the data is approximately symmetrical. This is clear if we plot these values using a boxplot:

    5101520253035
  3. Test for outliers using Tukey's fences with a value of $k = 1.5$.

    We can test for outliers using Tukey's Fences with $k=1.5$:

    $$ [Q_1 - k(Q_3 - Q_1), Q_3 + k(Q_3 - Q_1)] $$ $$ [7 - 1.5(19 - 7), 19 + 1.5(19 - 7)] $$ $$ [-11, 37] $$

    There are no outliers, as all values are within $ [-11, 37] $

Approximate parameters to normal distribution

You measure the ages of a sample of 5 children in a nursery and get the values: 2, 2, 3, 4, 4. You suspect that the ages of children in the nursery are drawn from a normal distribution.

Approximate the parameters to this distribution.

We need to find the mean $\mu$ and standard deviation $s$. As this is a sample of children, we need to use the formula for sample standard deviation:

\[ s = \sqrt{ rac{\sum_{i=1}^n (x_i - overline{x})^2}{n - 1}} \]
\[ \mu = 3 \quad s = 1 \]

Correlation

The online graphing calculator Desmos, as used in previous exercises can also be used to calculate correlations. For example, consider the following data:

i1234567891011
x1213161812141627262930
y1115131716181930192735

To calculate the Pearson's correlation coefficient for this data, we can use the 'corr' function in Desmos, passing as its arguments two lists. We get an answer of approximately $r = 0.87$.

corr([12, 13, 16, 18, 12, 14, 16, 27, 26, 29, 30],[11, 15, 13, 17, 16, 18, 19, 30, 19, 27, 35])

Calculate the correlation between $x$ and $y$ in the following data set:

i1234567891011
x543366544366674523125
y254337431653986634449
$$r = 0.505$$

Pearson's r

Compare each $(x_i, y_i)$ to \( \overline{x} \), \( \overline{y} \)

\[ r = \frac{\sum( (x_i - \overline{x})(y_i - \overline{y}))}{\sqrt{\sum(x_i - \overline{x})^2 \sum(y_i - \overline{y})^2}} \]

Value between -1 (perfect negative correlation) and 1 (perfect positive correlation)

Linear Regression

Using the online linear regression calculator on GraphPad, give a formula for a straight line that models the relationship between $x$ and $y$ using the following data.

i1234567891011
x1213161812141627262930
y1115131716181930192735
  1. Enter the $x$ and $y$ values into the calculator. It provides you with the formula under 'Data > Equation'
  2. With your model, you can now approximate the value of $y$ for an arbitrary value of $x$. What is the predicted value of $y$ when $x=20$?
$$ Y = 0.9127*X + 2.326 $$

When $x = 20$, $y = 20.58$

Linear Regression

Model each point as $Y_i = a + bX_i + \epsilon_i$

Find values for $a$ and $b$ to minimise the total error $\sum \epsilon$

Least Squares Method can be used for this

\[ b = \frac{\sum((x_i - \overline{x})(y_i - \overline{y})}{\sum(x_i - \overline{x}^2)} \]
\[ a = \overline{y} - b\overline{x} \]
LecturePractical