Probability
Almost any application of mathematics to solve real world problems involves probability at some point. Probability is extensively used when we work with data, such as in machine learning.
Learning Outcomes
- Derive probabilities of an event based on the assumptions of classical probability
- Apply the addition rule to find probabilities of unions of events
- Apply the multiplication rule to find probabilities of co-occurrence of events
- Apply the conditional probability formula to find conditional probabilities
- Apply bayes rule
Introduction
In all domains of life we experience events whose outcome is uncertain. Some uncertainties are resolved fairly quickly; we may not even pay much attention to them. Perhaps you turned on a kettle this morning to make a cup of tea. Yet, until the kettle was making noise and the water starting to steam you could, of course, not be certain that it would work; it might have broken. Turning on the kettle this morning was an experiment. It was a procedure by which you discovered something unknown ("is my kettle still working?").
Whether or not a kettle is working depends on the state of the electronics which ultimately is determined by the laws of physics. In principle, using the tools of mathematics we might be able to give a model of the kettle that is deterministic (produces a fixed answer). However, it is generally easier and more useful to model the uncertainty itself. What is the probability that the kettle will be working? How, for instance, does the probability of the kettle working change in response to different events we might observe in the world?
Experiments
An experiment is any situation (real or hypothetical) where the outcome is uncertain. The outcome of an experiment is random or stochastic. In other words (based on what you know), you cannot reliably predict it in advance. For example, when you click the 'Roll Dice' button below, you are unable to reliably predict what number it will show. The 'make request' button makes a HTTP request to the server this website is hosted on and measures the time taken to get a response. It's impossible to predict exactly what the response time will be: another experiment.
It is often valuable to be able to reason about the outcome of experiments, for example if you were betting on the outcome of the dice roll or optimising a webserver.
When we reason about experiments using the mathematics of probability, we are building a model of reality that can only approximate a real-world situation. For example, I might choose to model flipping a coin as having two possible outcomes: heads or tails. Yet in the real world this modelling assumption might be violated - the coin might land perfectly balanced on its side.
Image on slide CC 4.0 BY-NC PNG ALL
Overview
We will begin by understanding the basic terms used in probability: Outcomes, Events, and Sample Spaces. We will see an approach to Classical Probability
We will then turn to the fundamental principles of probability, starting with the three Kolmogorov axioms that are the foundation of the modern mathematics of probability. We will see some of the implications of these axioms.
We will go on to see the basics of how we can work with probabilities. We will see probability trees, which are a tool that will help us to reason about more complicated situations.
Events
Probability is built using concepts from set theory. For the rest of this lecture, we will be applying set theory operations such as \( \cup \) and \( \cap \). If you are not comfortable with what these mean, revise the lectures on Set Theory.
Events and Sample Spaces
A sample space, \( \Omega \), is defined as the set of all possible outcomes of an experiment. Imagine this like a dartboard. When you perform an experiment, or sample the sample space (or observe a random variable) you are throwing a dart at this dartboard. You will always get exactly one outcome on the dartboard.
An event, \( E \), is a set of outcomes. An event might be a singleton set (i.e. containing only a single outcome). If so it is called an elementary event. Alternatively, it might contain multiple outcomes.
For example, if I perform the experiment of rolling a fair dice, my sample space would be \( \Omega = \{ 1, 2, 3, 4, 5, 6 \} \). Every subset of $\Omega$ is an event. For example, the set \( \{ 2, 4, 6 \} \) is the event where an even number is rolled; \( \{1, 5, 6\} \) is the event where either a 1, 5, or 6 is rolled; and so on.
You rolled 1. An outcome of 1 is part of the following events:
Examples of events
Here we are going to look a three different ways of formalising the same experiment.
Experiment A: Imagine we flip two coins and look at both coins. There are four possible outcomes of our experiment. Assuming both coins are fair, each outcome has an equal probability. The sample space for this experiment is:
Experiment B: However, now imagine we flip two coins and count the number of heads. This is a slightly different experiment. Here our sample space contains three possible outcomes of our experiment and these outcomes do not have equal probabilities.
Experiment C: Finally, now imagine we flip two coins and just look to see whether we get at least one head. We have a different sample space again. These outcomes do not have equal probabilities.
Imagine we want to consider the likelihood we get at least one head. We can use any of the experiments described here to do this. While we are asking the same question of all three experiments, in each case we will be interested in an event which contains different outcomes. The probability of each event would be the same, however.
Classical probability
There is a simple way to work out probabilities if all outcomes in our sample space are equally likely. Classical Probability defines the probability of an event as the proportion of favourable outcomes (outcomes where the event occurs) out of the total number of outcomes possible.
We will write the probability of an event $E$ as $P(E)$, where $P$ is the probability function. Below, $\Omega$ is the sample space.
Classical probability examples
Consider the experiment of rolling a fair dice. Assuming our sample space contains six outcomes which are all equally likely, we can work out the probability of rolling a particular number to be $1/6$
To do this, let our sample space \( \Omega = \{1, 2, 3, 4, 5, 6\} \). In other words, we assume that the only outcomes from our experiment can be rolling one of six numbers. We can see that $\#\Omega = 6$. Imagine we are considering elementary event where we roll a 4. Let event \( E = \{4\} \). We can see that $\#\Omega = 1$. Thus \( P(E) = \frac{\#E}{\#\Omega} = \frac{1}{6} \). Thus, the probability of rolling a 4 on a six sided dice is $1/6$.
Consider the example above of flipping two coins and getting at least one head. We can calculate the probability using Classical Probability if we model the situation as Experiment A, because in Experiment A we assume that each outcome is equally likely.
Let our sample space \( \Omega_A = \{ (H,H), (H, T), (T, H), (T,T) \} \). We can see that $\#\Omega_A = 4$. Let $H$ be the event where we get at least one head: \( H = \{ (H,H), (H, T), (T, H) \} \). We can see that $\#H = 3$. Thus \( P(H) = \frac{\#H}{\#\Omega_A} = \frac{3}{4} \). Thus, the probability of getting at least one head when flipping two coins is $3/4$
Kolmogorov Axioms
Axioms are basic assumptions that are used as a foundation for proofs and deductions. The three Kolmogorov axioms provide a foundation for probability.
First Axiom
The first axiom describes what a probability is: it is a real number (i.e. with decimal places) that is not less than 0.
"The probability of an event is a non-negative real number."
We represent probabilities as numbers (specifically, real numbers). The smallest possible probability is 0, and no event can have a probability less than this.
Second Axiom
The second axiom says that what is means to be an experiment is that there will be an outcome. This axiom also defines what number will be used to represent certainty.
"The probability that at least one elementary event in the sample space will occur is one."
If \( \Omega \) is a sample space, then it is certain that at least one event in \( \Omega \) will occur. In probability, certainty is represented as a probability of 1.
Third axiom
The third axiom states that we add the probabilities of certain kinds of events.
"The probability of any countable sequence of disjoint (i.e. mutually exclusive) events \( E_1, E_2, E_3 \dots \) is equal to the sum of the probabilities of the individual events."
In other words, we can add up the probabilities of events to find the probability that any of those events happen. Specifically, it applies to events that are mutually exclusive (i.e. non-overlapping), it does not apply to events that can co-occur.
If this is hard to read, we can simplify the notation for the example where we only have two mutually exclusive events \( A \) and \( B \):
Deck of Cards
What is the probability of drawing a heart from a 52 card deck?
A 52 card deck contains 4 suits (one of which being hearts), 13 cards of each suit, and no jokers. As the question does not say we draw multiple cards, we will assume we draw a single card. We can represent this situation with a sample space $\Omega$ that contains 52 possible outcomes. The event, which we will call $H$, where a heart is drawn is a subset containing 13 of these outcomes (as there are 13 cards that are hearts).
This situation is quite tractable to work out by counting and using classical probability.
Now let us work this out based on Kolmogorov's axioms. Let \( \Omega = \{ c_1, ..., c_{52} \} \) be a sample space (where $c_n$ is the outcome of drawing a particular card, and there are 52 such cards). Let $C_n$ be the elementary event \( \{c_n\} \).
As the deck is perfectly shuffled, we will assume that the probability of each elementary event is the same, i.e. we are no more likely to draw one card than any other. Thus $P(C_n) = k$, for all $n$. Here $k$ is some constant.
As we know $\# \Omega = 52$ it follows that $\#C = 52$. In other words, because there are 52 outcomes, there are 52 elementary events, as there must be one elementary event per outcome. Elementary events must be mutually exclusive as they each contain only a single outcome. According to Kolmogorov's third axiom, we can add together the probabilities of mutually exclusive events to get the probability of any of them occurring. Thus the probability of drawing any card will be $52k$, where $k$ is the constant introduced above.
As $P(\Omega) = 1$, we can work out that:
Let \( H_1, \dots, H_{13} \) be elementary events where a heart is drawn. As these are elementary we know they are mutually exclusive. We know that drawing a heart is equally likely as drawing any other card, thus we know $P(H_n) = P(C_n) = 1/52$
Finally, we can apply the formula given in the third axiom to these mutually exclusive events.
Implications of Kolmogorov Axioms
The probabilities of events are monotonic.
If \( A \subseteq B \) then \( P(A) \leq P(B) \)
Proof
We can write $B$ as the union of two non-overlapping subsets: \( B = (B \setminus A) \cup (A \cap B) \). Further, as we assume A $\subset B$, then $A \cap B = A$. Thus $B = A \cup (B \setminus A)$
Let $C = B \setminus A$. Thus, using Kolmogorov's third axiom we can write the probability of $B$ as the sum of the two mutually exclusive events $A$ and $C$
From the first axiom we know that probabilities non-negative real numbers, thus \( P (C) \geq 0 \)
As we are adding a non-negative number to $P(A)$, we conclude that $P(B)$ must be greater or equal to $P(A)$. Equivalently, P(A) but be less than or equal to $P(B)$.
Numeric Bound
We know that probability is non-negative, i.e. it must be at least 0. We also know that the probability of the sample space \( \Omega \) is 1. The sample space is the set of all possible outcomes, therefore all other events must be subsets of the sample space. Therefore, from the property of monotonicity, all events must have a probability in the range 0-1:
Complement Rule
The complement of an event \( A \) is the event where $A$ does not occur. For example, if you roll a dice, and say that $A$ is the event where an odd number is rolled, then the complement of $A$, written \( A^c \) is where an odd number is not rolled. In this example, this is equivalent to rolling an even number.
The complement of an event, \( A^c \), is defined as the set of all outcomes in the sample space $\\Omega$ that are not in $A$. It follows that the union of an event and its complement equals the sample space, and thus their probabilities sum to \( P(\Omega) \), which is 1.
The probability of an event and its complement add up to 1:
Working with Probabilities
Addition Property
It is common to want to find the likelihood that either event $E_1$ or event $E_2$ occurs. This means we are interested in the event formed by the union $E_1 \cup E_2$. To find this, we can add the probabilities of the events $E_1$ and $E_2$ and take away one multiple of the intersection (if we didn't take away the one multiple of the intersection we would end up counting those outcomes twice, once for each set).
If events are sets of outcomes, the set-theoretic union of two events is the event that either event occurs
Addition Rule:
The combined probability of two events A and B is the sum of the probabilities of those events, less the probability that those events co-occur:
We will see how to work out the probability of the intersection of two events below.
Intersection
We often want to know when two events both occur. For example, we might want to know the probability of rolling two dice and getting a six on both. For two events \( E_1 \) and $E_2$, the event formed by their intersection (i.e. $E_1 \cap E_2$) is the set of outcomes where both events occur.
Where events $E_1$ and $E_2$ are independent, the probability of the intersection \( E_1 \cap E_2 \) is given by the formula
Two events are independent if knowing that one event has occurred gives no information about the outcome of the other event.
Two sixes example
Say we roll two dice and want to work out the probability of getting two sixes. In this case, our sample space might be formed by every pair of values on our dice.
Consider this in classical probability. There are $6 \times 6 = 36$ such pairs. There are 6 only outcomes where dice $A$ rolls a six, and 6 outcomes where dice $B$ rolls a 6. There is only one outcome where both dice roll a 6. The probability of rolling two sixes is therefore \( \frac{1}{36} \).
Notice that $1/36 = 1/6 \times 1/6$. We can multiply probabilities to find the probability that two events co-occur, so long as these events are independent of one another. This is the multiplication rule for independent events, stated above.
As getting sixes on two dice are independent events, we can apply this formula to calculate this probability. The probability of each independent event is \( \frac{1}{6} \), thus:
General Multiplication Rule
The general multiplication rule is:
We saw a special case of this formula is where two events are independent events. This is the case when \( P(E_1 \mid E_2) = P(E_1) \). That is, when the conditional probability of $E_1$ given $E_2$ is the same as the probability of $E_1$. In other words, events are independent when knowing that $E_2$ has happened makes no difference to the probability of \( E_1 \).
A good example of independent events is rolling two dice. Knowing the outcome on one dice makes no difference to the outcomes on the other dice. In contrast, if you were screening for a disease, getting a positive test result would affect the probability that you really had the disease.
Conditional Probability
Often we have partial information about the outcome of an experiment. For example, when we turn on the kettle in the morning, we see that it lights up before we can tell whether or not it is heating. Based on this, we might unconsciously update our expectation about whether the kettle is likely to work. We can exclude all those outcomes where the kettle is not working because the fuse has blown, for example.
Imagine that we know that an event \( A \) has happened. We want to know the probability of event $E$. We write this $P(E \mid A)$ ("the probability of $E$ given $A$") Effectively what we have done is restricted our sample space for this question. We are now interested in only outcomes in the subset $A$, and only outcomes in event $E$ that are also in \( A \).
If all outcomes are equally likely, we can work this out based on classical probability. In this case, a conditional probability will be the proportion of outcomes in our event out of our restricted sample space: \( P(E \mid A) = \frac{\#(E \cap A)}{\#A} \). However we can work with the previous formula even when we do not assume all outcomes are equally likely. This formula is obtainable by dividing top and bottom of the above fraction by \( \#S \) and substituting for probabilities.
Multiple Conditions
Bayes' Rule
In science we often have hypotheses. We can consider each hypothesis to be a disjoint event. For example, imagine we do not know whether someone has passed their exam. Hypothesis \( H \) might be the event that they pass the exam, and $H^-$ be the event that they fail the exam. In Bayesian probability, the probability that a hypothesis is true before we collect evidence is called the prior probability, or \( P(H) \). Perhaps from previous years we know that the pass rate is 80%.
There is often evidence that we can observe that might tell us that \( H \) is more or less likely. For example, we might know that the student attended all of their lectures. Let $E$ be the event that refers to all outcomes where the student attended lectures. The probability \( P(E) \) is the likelihood of observing this evidence, i.e. the likelihood that a random student will attend lectures. Perhaps we know that 75% of students attend all lectures.
Finally, we might also know how likely we are to observe this evidence were our hypothesis true, or \( P(E \mid H) \). In our example, we might know how likely it is for a student to pass their exam given that they attended lectures. We might observe that 90% of students who attend all lectures pass the exam. Thus we have:
From this we can calculate a new probability for our hypothesis that takes into account the evidence \( P(H \mid E) \). This is called the posterior probability. In our example, rather than guessing that there is an 80% chance a student will pass their exam, we can make a more accurate estimate by incorporating our knowledge that the student attended all lectures. This can be achieved through a rearrangement of some of the formulas that have already been given above:
Thus
This give us the formula for a posterior probability \( P(H \mid E) \) in terms of our prior probability $P(H)$, the probability of the evidence $P(E)$, and the probability of the evidence given our hypothesis $P(E \mid H)$. This formula is called Bayes' Rule. In other words, we can express $P(A \mid B)$ in terms of $P(B \mid A)$, if we know $P(A)$ and \( P(B) \)
To solve our example above, when
We have
The probability that a student who attended their lectures will pass the exam is therefore 96%.
This was for a single hypothesis \( H \). By substituting in the law of total probability, we can extend Bayes' Rule to any finite and collectively exhaustive set of disjoint hypotheses \( H_1, H_2, \dots, H_n \) that partition our sample space.
Thus
Law of Total Probability
A partition is a way of subdividing a set into partition sets \( B_1, B_2, \dots, Bn \) that are disjoint (i.e. there is no overlap between $B_i$ and $B_j$, or in other words \( B_i \cap B_j = \emptyset \)) and collectively exhaustive (i.e. together they contain everything in the set). A set and its complement is one example of a partition, but partitions may also involve more than two sets.
For example, when rolling a dice, the two events \( odd \) and $even$ form a partition; you always roll an odd or even number, and you can never roll both at the same time. The events \( \{1, 2\}, \{3, 4\}, \{5, 6\} \) also form a partition of this sample space. This partition is shown in the venn diagram below, and the event \( even \) is shown overlapping with each partition.
A set can be divided across these partitions (i.e. it might intersect with multiple partitions). The set is the union of all of these parts:
If this set is an event, the conditional probabilities of this event given each partition will be mutually exclusive (because the partitions are mutually exclusive). Thus we can sum the probabilities (usually we would need to take way the intersection, but here the intersection is 0).
Finally, recall the intersection identity for probabilities above:
Thus we can express the probability of an event in a partitioned sample space with the following law that includes the summation of the probabilities of the the intersection of the event with each partition.
Law of Total Probability
If \( B_1, B_2, \dots \) is a partition of the sample space $S$, then for any event \( A \):
Law of Total Probability Example
You have 2 bags of balls that each contain 70 balls. Bag 1 has 50 red balls and 20 black balls. Bag 2 has 30 red balls and 40 black balls. You pick a bag at random and pick a random ball from that bag. What is the probability that you pick a red ball?
We can formalise the question. Let $R$ be the event of drawing a red ball. Let $B_n$ be the event of picking bag $n$.
Solving the question requires us to find $P(R)$. We can proceed by substituting the above values into the Law of Total Probability:
The probability of drawing a red ball from the randomly chosen bag is $4/7$.
Probability Trees
A decision tree is a tree representing a sequence of decisions. For example, we might pick between two bags at random and then pick a ball at random out of the bag. In this case, we would also be able to assign probabilities to each of our decisions, making it a probability tree.
Probability trees are visualisations that make it easy to work out otherwise complex probabilities. The choices in a decision tree are mutually exclusive, meaning in a probability tree we can add the probabilities of branches together to get the probability that at least one branch will be taken (the union of those two events). If you walk down a probability tree, each probability is conditional upon its parent branch being reached, this means we can multiply down branches to work out the probability of taking a path through this tree.
Decision Tree
A way of representing a sequence of Decisions
Probability Tree
In a probability tree, we can think of each vertex being an event.
$P(V_i)$ is the likelihood that a random walk through the probability tree will include $V_i$
Multiplying down branches
To work out the absolute probability of a set of events we can multiply down the tree
Remember the Multiplication Rule
Let $D_1, \dots, D_n$ be the path from $root(T)$ to $V_i$
This gives the probability of walking down the probability tree following a given path $D$.
Adding across branches
Branches are mutually exclusive, so we can add them together to get their combined probability
Remember the Addition Rule
Disease Screening Example
Imagine you have been screened for a disease and get a positive result. We know certain statistics about the disease, given below. What is the probability you have the disease?
- \( P(D) = 0.1 \) (prior probability = prevalence)
- \( P(T \mid D) = 0.9 \) (sensitivity)
- \( P(T^c \mid D^c) = 0.8 \) (specificity)
Here ee want to work out $P(D \mid T)$, for which we can use Bayes' rule. As you cna see form the formula below, we have everything we need except for $P(T)$. We can work this out either using a probability tree or using the Law of Total Probability, introduced above.
Using a Probability Tree
We can construct a probability tree of this situation, as shown on the slide. In this diagram I have partitioned event $T$ into two for clarity representing it in the probability tree. $T_1 = T \cap D$ and $T_2 = T \cap -D$. These partition $T$, meaning $T = T_1 \cup T_2$
While only some of the edges in the diagram are labelled with conditional probabilities, when events are complementary we know their probability must sum to one. For instance, we have $P(T_1 \mid D) = 0.9$. This is the probability of getting a positive rest result given we have the disease. The only other possibility, given that we have the disease, is that we get a negative test result, therefore $P(-T_1 \mid D) = 0.1$. (Observe that $0.9 + 0.1 = 1$)
We can find $P(T_1)$ and $P(T_2)$ by multiplying down the tree.
$$P(T_1) = 0.1 \times 0.9 = 0.09$$ $$P(T_2) = 0.9 \times 0.2 = 0.18$$The addition rule then tells us that $P(T) = P(T_1) + P(T_2) = 0.27$
Using the Law of Total Probability
Our hypotheses $D$, and $D^c$ form a partition as they are disjoint (they cannot both be true) and they are collectively exhaustive (either you do or do not have a disease). As our hypotheses partition our sample space we can rewrite $P(T)$ in terms of the sum of $P(T \mid D) \times P(D)$ using the Law of Total Probability
Finding $P(D \mid T)$
Now we know $P(T)$, we can find $P(D \mid T)$ using Bayes' Rule:
Markov Chain
A similar concept to a probability tree is a Markov Chain. Here the requirement that the graph is a tree is relaxed. We need to interpret the vertices slightly differently. They are not events (as they can occur more than once) - rather they are states in a system. The system transitions through a sequence of states according to the transition probabilities that label each edge.