Statistics
In this lecture we will see how to apply statistics and probability to find the probability that a hypothesis is true based on data. The mathematics we will cover is the fundamentals of statistics for experimental research.
This lecture aims to give you an intuitive understanding of why inferential statistical tests work the way they do and to give you practical tools for you to use for planning and analysing your own research.
Introduction
Inferential statistics allow us to make inferences from a sample of data. Most often this takes the form of a hypothesis test: we collect some data to infer whether a given hypothesis is likely to be true.
The fundamental problem of statistics is that we want to describe a population of potential measurements based on a limited amount of data. As it is impractical or impossible to collect all the data we might need, we need to make judgements based on probability.
For most types of quantitative research we start by collecting data from a sample that is representative of the population. We perform some statistical analysis on that data in order to generalise our findings from the sample to the wider population with some level of confidence. As we cannot be sure that the results of our sample are the same as for the overall population that it is meant to represent, we express our results in terms of probabilities. We use inferential statistics to allow us to make these generalisations based on sound mathematical principles.
Any inferential statistical analysis relies on certain assumptions being correct. One necessary assumption is that the sample is representative of the population. We usually need to assume that the population was sampled at random, meaning each individual has an equal chance of being included in the sample. When we interpret the results of a statistical test, we need to make sure we understand the assumptions that it makes and be able to judge whether these assumptions are reasonable.
Learning Outcomes
- Calculate the z statistic
- Calculate the t statistic
- Interpret test statistic in comparison to critical values
Overview
We will start by familiarising ourselves with the core concepts and terminology that are used in inferential statistics. We will then build an intuitive understanding of how inferential statistical tests by unpacking two basic statistical tests, the Z test and the Student's T test. Finally we will discuss how to choose between different inferential statistics.
Core Concepts
Sample
The data that we use for inferential statistics comes from a sample of a wider population. For this sample, we can calculate various statistics.Where our sampled variable is drawn from a normal distribution, we require the mean and the standard deviation of the sample, as this is enough to approximate the probability distribution that the sample came from.
Hypothesis
Our goal is to support or refute particular claims about the world (or the system under test). These claims take the form of hypotheses. Hypotheses can be understood in terms of probability as disjoint (mutually exclusive) events.
We have one (or more) alternate hypotheses that describes the situation or event we are expecting to find/observe. In addition to our alternate hypotheses, we state a null hypothesis which describes the situation or event in which there is no effect, no difference, or nothing unusual is going on.
For example, imagine that we are investigating the intelligence of bears and we have a particular bear who we think is pretty clever. To turn this into a testable hypothesis, we would state the hypothesis that our bear is smarter than the average bear. This is our alternate hypothesis - the effect we are expecting to find. Our null hypothesis would be that there is no difference in intelligence between our bear and the average bear.
Most hypotheses are of the form that such-and-such a group or such-and-such an individual differ on some variable compared to the wider population, or that two groups are different from each other on some variable. In terms of probability distributions, we are asking whether the samples are likely to be drawn from the same probability distribution, or whether they likely come from different probability distributions.
Test Statistic
Most inferential statistics test for differences between two or more sets of data. However, what counts as a difference? It is not enough to test whether there is a difference, because small differences are likely to arise due to chance variation. Rather than saying how big the difference is, we want to say how unexpected it is. Inferential test statistics express how unexpected an observed difference is.
Each test statistic is expressed on a different scale, and they cannot be directly compared. To interpret an inferential statistic, we consider the probability distribution of that test statistic. We calculate the probability of observing a test statistic of at least that size.
P Value
The result at the end of our statistical analysis is a
The
The
The
Alpha
The
If the
Multiple Testing
We can think of calculating a hypothesis test against an alpha (significance level) of 0.05 as like rolling a 20 sided dice. There is a 1 in 20 chance (0.05) of rolling a 20. Here rolling a 20 represents the situation where we have a false positive - where we incorrectly conclude that we observed an effect.
From probability, we can calculate that the probability of getting at least one 20 if we roll the dice twice as being
In any situation where we are testing (effectively) the same hypothesis multiple times - and where we would report having detected an effect based on any individual test finding an effect - we need to reduce our alpha to adjust for this increased probability of a type 1 error (false positive). (This is also why it is important for null results to be published - if only positive results are published, there will be an inflated false-positive rate.)
Degrees of Freedom
The degrees of freedom for a statistic is the number of values used in its calculation that would be able to vary without changing the value of the statistic itself. This can be confusing to understand, so let's consider a simple example with the calculation of the arithmetic mean of a sample of four numbers:
Here we have four values, or data points, involved in the calculation: 3,4,6, and 7. Were we to report the mean of this set of numbers (which as we can see is 5), we could say that the mean is 5, and there are 3 degrees of freedom. To understand why there are three degrees of freedom, you need to imagine for a moment that you do not know what data this statistic was calculated from.
Now assuming that you know that the mean is 5, and that the sample size was 4 (i.e., we found the mean of four numbers) only three of the four data points are free to vary independently. This is because while three of the values could have been any number, the fourth value is completely determined by the previous three and the mean. In other words, If I gave you three values and the mean, you could work out the fourth value. Algebraically, you can see this as the following problem, where you can solve for
Thus the value of
Question
Understanding Inferential Statistics
Basic Intuition
A basic inferential statistic is the ratio between the difference we observed compared to the typical difference we would expect. The larger the difference we observe is relative to what we would expect, the larger our inferential statistic.
The observed and expected values vary depending on our experiment. We might be testing whether an individual is significantly different from a group. In which case our observed value would be the individual measurement, and the expected value would be the group mean. This makes the observed difference the difference between our individual and the group.
The typical difference is also called the standard error, and we will see some examples of how it is calculated for different probability distributions below.
Fish Example
Imagine we know that on average fish are 150 units long. We measure a fish and find it is 175 units long. We can calculate the difference that we have observed by comparing what we observed (175) to what we expected (150).
We have a difference of 25, but to determine whether this difference is interesting we must compare it against the typical differences in fish size. We need to know the variance in measurements of fish size
Let us assume
Then 175 is over 2 standard deviations away from mean. As 95% of data is within 2 standard deviations of mean, that makes our fish in at least the most extreme 5% of fish sizes. Interesting fish!
(Fish images by Vecteezy)
Sampling Distributions
The sampling distribution is the statistical distribution that your observed value comes from. For example if we sample individuals, our sampling distribution is the same as our population distribution. In the example above, we sampled individual fish, so our sampling distribution was the same as our population distribution, which was a normal distribution.
However, often the observed value is not an individual measurement, but the mean of a group of measurements. Means of values drawn from a given distribution have a different distribution to the values themselves. Consider that it is much more likely to get a single extreme value than to average several values and get an extreme value for the mean. Therefore, when we sample group means of size
According to the Central Limit Theorem, for large sample sizes, the sampling distribution for means approximates a normal distribution with a mean equal to the population mean, and a standard deviation equal to
Standard Error
Standard error is the standard deviation of our sampling distribution. For our fish example, our standard error would be the standard deviation of our data (assuming our individuals are normally distributed).
When comparing group means, we need to adjust our standard error as our sampling distribution is no longer normal. It is a distribution formed of means of groups of size
When comparing differences in group means, again we have a different sampling distribution, so we need a different formula for our standard error:
Z Statistic
Let us begin with our basic formula for an inferential statistic:
Consider the situation where we observe a single data point and we are comparing it to the population mean. Our observation is of a data point,
Z Distribution
Imagine we randomly sampled a normal distribution and calculated z statistics for single values, for samples, or for comparing two sample means using the
If we generate a random
As it happens, the
Now we can compare the value we have computed for
T Statistic
We rarely know the population standard deviation, and thus we cannot often use a
Because of the error inherent in using an approximation of standard deviation, the values that we derive are different. This is most evident when sample sizes are smaller than 30. When we use the approximation of standard deviation, the resulting statistic is called a
Like Z tests, T tests assume data follows a normal distribution. This makes it a parametric test. It similarly assumes representative, randomly selected samples and similar variance between groups.
T Distribution
As the
When the distribution of the
(Note: In the interactive example embedded on the slides,
Critical Thresholds
For each test statistic, we can calculate the value that corresponds to a particular probability. For example, we can calculate the z statistic where the probability of generating a value at least as extreme as it is 0.05. If our significance level is set to 0.05, we are able to compare our calculated
Find t tables at t-tables.net
Effect Size
Effect size is usually calculated using the formula for Cohen's
This is the observed difference between the samples as a proportion of the standard deviation. If the standard deviation varies between groups, the pooled standard deviation must be used:
Where
Effect size is a unitless measure that can be used to compare the effect sizes of different studies. A very approximate rule of thumb - where there is no more specific data available is:
Cohen's | Description |
---|---|
0.2 | Small |
0.5 | Medium |
0.8 | Large |
Inferential Statistics: 5 Steps
To determine if sample means come from same population, use 5 steps with inferential statistic
1. State Hypothesis
Results are stated in terms of probability that
2. Level of Significance
Decide on the alpha level, or level of significance. This is the probability of observing a result by chance, or observing a result if two samples were drawn from the same population. This is usually set to
3. Computing Test Statistic
Use a statistical test formula to derive an appropriate test statistic (e.g.
4. Obtain Critical Value
A value derived from the test statistic distribution for a given number of degrees of freedom, and the significance level
5. Reject or Fail to Reject
Calculated value is compared to the critical value to determine if the difference is significant enough to reject Ho at the predetermined level of significance
If test statistic is greater than the critical value then we reject
Choosing Inferential Statistics
Independent or Paired
Is the data in the two (or more) sets paired, or is every data point independent from every other? i.e. can you match records between the sets? If yes - they are paired. For example a before and after test with the same respondents would be paired data.
Parametric or Non-Parametric
Most statistics assume your data are normally distributed. If they are use a parametric test if not use a non-parametric test.
How to decide? Test for normality - use the Shapiro-Wilk or the Kolmogorov-Smirnov test. S-W generally considered to be better.
For both if
One or Two Tailed
You must decide whether to run a one or a two-tailed test. Ideally this should be a principled decision based on the nature of the hypothesis.
When determining the critical value for a statistic - that is, what value counts as sufficiently unexpected to be deemed statistically significant - you need to decide whether you are looking for values that are either unexpectedly large or small, or whether you are only looking for values that are unexpectedly large, or whether you are only looking for values that are unexpectedly small. In other words, does any extreme value count, or does it only count if it is extreme in a particular direction?
As the significance level is the same (e.g. 0.05) whether you are running a one- or two-tailed test, the critical values will be lower in a one-tailed test than a two-tailed test. As you only will accept positive results in one direction, there is no possibility of you getting a false positive in the other direction. All of the chance of false positives is concentrated on one end.
In the slides you can see two graphs. The first represents a two-tailed test. If the highlighted area adds up to 0.95, there is a 0.05 chance of a number being in one of the tails that are not highlighted. The second graph represents a one-tailed test. Now the area were significance is achieved is concentrated on one side. The critical threshold is thus lower to achieve the same significance level.
Choosing one- or two-tailed tests
Your default choice should be to use a two-tailed test. Even when you have a directed hypothesis, a two-tailed test will be less likely to make a type 1 error (false positive), so it is the conservative option.
However, if you hypothesis that your dependent and independent variables have a clear directional relationship, you might choose to use a one-tailed test. This might be because there are strong theoretical reasons to expect a particular directed result. Or it might be that a result in the direction other than the one you expect would not be meaningful or interesting. For example, if you test whether a game is more enjoyable than a control condition, it might be that finding the opposite (control is more enjoyable than the game) just demonstrates that your study is flawed rather than providing any meaningful conclusions.
You do not need a strong principled reason to use a one-tailed test. However, because significance is easier with a one-tailed test, readers may be suspicious if its use is not justified, particularly if your results only just achieve significance. If you have not pre-registered your analysis, a reader might suspect you chose a one-tailed test only after you saw the direction of the results. Doing this significantly increases the risk of a type 1 error, and is bad statistical practice.
Chi-Squared Test
Used to test whether categorical variables are independent.
Do observed frequencies differ from expected frequencies in a statistically significant way?
A Chi-squared goodness of fit test is used to test whether a set of frequencies differ from a given set of expected frequencies.
A Chi-squared test of independence is used to test whether two independent categorical variables differ significantly from each other.