Probability

Almost any application of mathematics to solve real world problems involves probability at some point. Probability is extensively used when we work with data, such as in machine learning.

Mathematics and Problem Solving

10. Probability

David Gundry

Machine Learning

Alphabet

Training Set

Test Set

Prior

Letter Probabilities

Letter	Probability
A_1	0.25000000
B_1	0.00000000
C_1	0.75000000
D_1	0.00000000
A_2\|A_1	0.00000000
B_2\|A_1	1.00000000
C_2\|A_1	0.00000000
D_2\|A_1	0.00000000
A_2\|B_1	0.00000000
B_2\|B_1	0.00000000
C_2\|B_1	0.00000000
D_2\|B_1	0.00000000
A_2\|C_1	0.66666667
B_2\|C_1	0.33333333
C_2\|C_1	0.00000000
D_2\|C_1	0.00000000
A_2\|D_1	0.00000000
B_2\|D_1	0.00000000
C_2\|D_1	0.00000000
D_2\|D_1	0.00000000
A_3\|A_2	0.00000000
B_3\|A_2	0.50000000
C_3\|A_2	0.50000000
D_3\|A_2	0.00000000
A_3\|B_2	0.00000000
B_3\|B_2	0.00000000
C_3\|B_2	1.00000000
D_3\|B_2	0.00000000
A_3\|C_2	0.00000000
B_3\|C_2	0.00000000
C_3\|C_2	0.00000000
D_3\|C_2	0.00000000
A_3\|D_2	0.00000000
B_3\|D_2	0.00000000
C_3\|D_2	0.00000000
D_3\|D_2	0.00000000
A_4\|A_3	0.00000000
B_4\|A_3	0.00000000
C_4\|A_3	0.00000000
D_4\|A_3	0.00000000
A_4\|B_3	1.00000000
B_4\|B_3	0.00000000
C_4\|B_3	0.00000000
D_4\|B_3	0.00000000
A_4\|C_3	0.33333333
B_4\|C_3	0.33333333
C_4\|C_3	0.33333333
D_4\|C_3	0.00000000
A_4\|D_3	0.00000000
B_4\|D_3	0.00000000
C_4\|D_3	0.00000000
D_4\|D_3	0.00000000

Generated

String	P(string \| Model)	P(model \| string)

Test

Threshold Probability: Match: 100.000%

String	P(string \| model)	P(string \| random)	P(model \| string)
CBCB	0.08333333	0.00390625	0.95522388

Learning Outcomes

Calculate the probabilities of simple combinations of events

Derive probabilities of an event based on the assumptions of classical probability
Apply the addition rule to find probabilities of unions of events
Apply the multiplication rule to find probabilities of co-occurrence of events
Apply the conditional probability formula to find conditional probabilities
Apply bayes rule

Calculate probabilities of combinations of events by constructing a probability tree

Learning Outcomes

Calculate the probabilities of simple combinations of events
Calculate probabilities of combinations of events by constructing a probability tree

Introduction

In all domains of life we experience events whose outcome is uncertain. Some uncertainties are resolved fairly quickly; we may not even pay much attention to them. Perhaps you turned on a kettle this morning to make a cup of tea. Yet, until the kettle was making noise and the water starting to steam you could, of course, not be certain that it would work; it might have broken. Turning on the kettle this morning was an experiment. It was a procedure by which you discovered something unknown ("is my kettle still working?").

Whether or not a kettle is working depends on the state of the electronics which ultimately is determined by the laws of physics. In principle, using the tools of mathematics we might be able to give a model of the kettle that is deterministic (produces a fixed answer). However, it is generally easier and more useful to model the uncertainty itself. What is the probability that the kettle will be working? How, for instance, does the probability of the kettle working change in response to different events we might observe in the world?

Modelling

Experiments

An experiment is any situation (real or hypothetical) where the outcome is uncertain. The outcome of an experiment is random or stochastic. In other words (based on what you know), you cannot reliably predict it in advance. For example, when you click the 'Roll Dice' button below, you are unable to reliably predict what number it will show. The 'make request' button makes a HTTP request to the server this website is hosted on and measures the time taken to get a response. It's impossible to predict exactly what the response time will be: another experiment.

http://mathematicsandproblemsolving.netlify.app

It is often valuable to be able to reason about the outcome of experiments, for example if you were betting on the outcome of the dice roll or optimising a webserver.

When we reason about experiments using the mathematics of probability, we are building a model of reality that can only approximate a real-world situation. For example, I might choose to model flipping a coin as having two possible outcomes: heads or tails. Yet in the real world this modelling assumption might be violated - the coin might land perfectly balanced on its side.

Image on slide CC 4.0 BY-NC PNG ALL

Experiment

An experiment is a model that represents a procedure for discovering something unknown

Overview

We will begin by understanding the basic terms used in probability: Outcomes, Events, and Sample Spaces. We will see an approach to Classical Probability

We will then turn to the fundamental principles of probability, starting with the three Kolmogorov axioms that are the foundation of the modern mathematics of probability. We will see some of the implications of these axioms.

We will go on to see the basics of how we can work with probabilities. We will see probability trees, which are a tool that will help us to reason about more complicated situations.

Overview

Events
Kolmogorov Axioms
Working with probabilities
Probability Trees

Events

Probability is built using concepts from set theory. For the rest of this lecture, we will be applying set theory operations such as $ \cup $ and $ \cap $. If you are not comfortable with what these mean, revise the lectures on Set Theory.

Events

Events and Sample Spaces

A sample space, $ \Omega $, is defined as the set of all possible outcomes of an experiment. Imagine this like a dartboard. When you perform an experiment, or sample the sample space (or observe a random variable) you are throwing a dart at this dartboard. You will always get exactly one outcome on the dartboard.

An event, $ E $, is a set of outcomes. An event might be a singleton set (i.e. containing only a single outcome). If so it is called an elementary event. Alternatively, it might contain multiple outcomes.

For example, if I perform the experiment of rolling a fair dice, my sample space would be $ \Omega = \{ 1, 2, 3, 4, 5, 6 \} $. Every subset of $\Omega$ is an event. For example, the set $ \{ 2, 4, 6 \} $ is the event where an even number is rolled; $ \{1, 5, 6\} $ is the event where either a 1, 5, or 6 is rolled; and so on.

You rolled 1. An outcome of 1 is part of the following events:

{1}

{2}

{1, 2}

{3}

{1, 3}

{2, 3}

{1, 2, 3}

{4}

{1, 4}

{2, 4}

{1, 2, 4}

{3, 4}

{1, 3, 4}

{2, 3, 4}

{1, 2, 3, 4}

{5}

{1, 5}

{2, 5}

{1, 2, 5}

{3, 5}

{1, 3, 5}

{2, 3, 5}

{1, 2, 3, 5}

{4, 5}

{1, 4, 5}

{2, 4, 5}

{1, 2, 4, 5}

{3, 4, 5}

{1, 3, 4, 5}

{2, 3, 4, 5}

{1, 2, 3, 4, 5}

{6}

{1, 6}

{2, 6}

{1, 2, 6}

{3, 6}

{1, 3, 6}

{2, 3, 6}

{1, 2, 3, 6}

{4, 6}

{1, 4, 6}

{2, 4, 6}

{1, 2, 4, 6}

{3, 4, 6}

{1, 3, 4, 6}

{2, 3, 4, 6}

{1, 2, 3, 4, 6}

{5, 6}

{1, 5, 6}

{2, 5, 6}

{1, 2, 5, 6}

{3, 5, 6}

{1, 3, 5, 6}

{2, 3, 5, 6}

{1, 2, 3, 5, 6}

{4, 5, 6}

{1, 4, 5, 6}

{2, 4, 5, 6}

{1, 2, 4, 5, 6}

{3, 4, 5, 6}

{1, 3, 4, 5, 6}

{2, 3, 4, 5, 6}

{1, 2, 3, 4, 5, 6}

Sample Space

Experiments select a random outcome from a sample space

\[ \Omega = \{ c_1, c_2, \dots, c_n \} \]

Likelihood of each outcome may vary

Events

An event $E$ is a non-empty subset of the sample space

\[ E \subseteq \Omega \]

Probability function $P$ maps events to a measure of likelihood

\[ 0 \leq P(E) \leq 1 \]

Examples of events

Here we are going to look a three different ways of formalising the same experiment.

Experiment A: Imagine we flip two coins and look at both coins. There are four possible outcomes of our experiment. Assuming both coins are fair, each outcome has an equal probability. The sample space for this experiment is:

\[ \Omega_A = \{ (H,H), (H, T), (T, H), (T,T) \} \]

Experiment B: However, now imagine we flip two coins and count the number of heads. This is a slightly different experiment. Here our sample space contains three possible outcomes of our experiment and these outcomes do not have equal probabilities.

\[ \Omega_B = \{ 0, 1, 2 \} \]

Experiment C: Finally, now imagine we flip two coins and just look to see whether we get at least one head. We have a different sample space again. These outcomes do not have equal probabilities.

\[ \Omega_C = \{\text{yes}, \text{no} \} \]

Imagine we want to consider the likelihood we get at least one head. We can use any of the experiments described here to do this. While we are asking the same question of all three experiments, in each case we will be interested in an event which contains different outcomes. The probability of each event would be the same, however.

\[ \text{one head }_A = \{ (H, H), (H, T), (T, H) \} \]

\[ \text{one head }_B = \{ 1, 2 \} \]

\[ \text{one head }_C = \{ \text{yes} \} \]

Coin Flips

Task

Flip two coins and check if you get at least one head

Model this as an experiment

Give the extension of the sample space set $\Omega$
Give the extension of the event $E \in \Omega$ where at least one head is obtained

Coin Flips

Approach A

Outcomes are pairs $(c_1, c_2)$ where $ c_1, c_2 \in \{H, T\} $.

\[ \Omega_A = \left\{ {\begin{array}{c} (H,H), (H, T) \\ (T, H), (T,T) \\ \end{array} } \right\} \]

One head is obtained when $c_1 = H$ or $c_2 = H$

\[ E_A = \{ (H, H), (H, T), (T, H) \} \]

Coin Flips

Approach B

An outcome is a count $ c \in \mathbb{N} $ where $ 0 \leq c \leq 2 $

\[ \Omega_B = \{ 0, 1, 2 \} \]

One head is obtained when $c \geq 1$

\[ E_B = \{ 1, 2 \} \]

Coin Flips

Approach C

An outcome is a truth value in $ \{\top, \bot\} $, corresponding to the proposition 'there is at least one head'

\[ \Omega_C = \{\top, \bot \} \]

One head is obtained when the proposition is true

\[ E_C = \{ \top \} \]

Classical probability

There is a simple way to work out probabilities if all outcomes in our sample space are equally likely. Classical Probability defines the probability of an event as the proportion of favourable outcomes (outcomes where the event occurs) out of the total number of outcomes possible.

We will write the probability of an event $E$ as $P(E)$, where $P$ is the probability function. Below, $\Omega$ is the sample space.

\[ P(E) = \frac{\text{favourable outcomes}}{\text{total outcomes}} = \frac{\#E}{\#\Omega} \]

Classical probability examples

Consider the experiment of rolling a fair dice. Assuming our sample space contains six outcomes which are all equally likely, we can work out the probability of rolling a particular number to be $1/6$

To do this, let our sample space $ \Omega = \{1, 2, 3, 4, 5, 6\} $. In other words, we assume that the only outcomes from our experiment can be rolling one of six numbers. We can see that $\#\Omega = 6$. Imagine we are considering elementary event where we roll a 4. Let event $ E = \{4\} $. We can see that $\#\Omega = 1$. Thus $ P(E) = \frac{\#E}{\#\Omega} = \frac{1}{6} $. Thus, the probability of rolling a 4 on a six sided dice is $1/6$.

Consider the example above of flipping two coins and getting at least one head. We can calculate the probability using Classical Probability if we model the situation as Experiment A, because in Experiment A we assume that each outcome is equally likely.

Let our sample space $ \Omega_A = \{ (H,H), (H, T), (T, H), (T,T) \} $. We can see that $\#\Omega_A = 4$. Let $H$ be the event where we get at least one head: $ H = \{ (H,H), (H, T), (T, H) \} $. We can see that $\#H = 3$. Thus $ P(H) = \frac{\#H}{\#\Omega_A} = \frac{3}{4} $. Thus, the probability of getting at least one head when flipping two coins is $3/4$

Classical probability

Assuming all outcomes are equally likely,

the probability of an event $E$ in sample space $\Omega$ is

\[ P(E) = \frac{\text{favourable outcomes}}{\text{total outcomes}} = \frac{\#E}{\#\Omega} \]

Question

What is the probability of flipping two coins and getting two heads?

Kolmogorov Axioms

Axioms are basic assumptions that are used as a foundation for proofs and deductions. The three Kolmogorov axioms provide a foundation for probability.

Kolmogorov Axioms

First Axiom

The first axiom describes what a probability is: it is a real number (i.e. with decimal places) that is not less than 0.

"The probability of an event is a non-negative real number."

We represent probabilities as numbers (specifically, real numbers). The smallest possible probability is 0, and no event can have a probability less than this.

\[ P(E) \in \mathbb{R}, \quad P(E) \geq 0, \qquad \text{for all events } E \]

Second Axiom

The second axiom says that what is means to be an experiment is that there will be an outcome. This axiom also defines what number will be used to represent certainty.

"The probability that at least one elementary event in the sample space will occur is one."

If $ \Omega $ is a sample space, then it is certain that at least one event in $ \Omega $ will occur. In probability, certainty is represented as a probability of 1.

\[ P(\Omega) = 1 \]

Kolmogorov Axioms

First Axiom

For all events, $E$

\[ P(E) \in \mathbb{R} \qquad P(E) \geq 0 \]

Second Axiom

\[ P(\Omega) = 1 \]

Third axiom

The third axiom states that we add the probabilities of certain kinds of events.

"The probability of any countable sequence of disjoint (i.e. mutually exclusive) events $ E_1, E_2, E_3 \dots $ is equal to the sum of the probabilities of the individual events."

In other words, we can add up the probabilities of events to find the probability that any of those events happen. Specifically, it applies to events that are mutually exclusive (i.e. non-overlapping), it does not apply to events that can co-occur.

\[ P(\bigcup_{i=1}^{\infty} E_i ) = \sum_{i=1}^{\infty} P(E_i) \]

If this is hard to read, we can simplify the notation for the example where we only have two mutually exclusive events $ A $ and $ B $:

\[ P(A \cup B) = P(A) + P(B) \qquad \text{where } A \cap B = \emptyset \]

Kolmogorov Axioms

Third Axiom

We can sum the probabilities of mutually exclusive events

\[ P(\bigcup_{i=1}^{\infty} E_i ) = \sum_{i=1}^{\infty} P(E_i) \text{ where all $E_n$ are disjoint} \]

For two events, $A, B$, where $A \cap B = \emptyset$

\[ P(A \cup B) = P(A) + P(B) \]

Deck of Cards

What is the probability of drawing a heart from a 52 card deck?

A 52 card deck contains 4 suits (one of which being hearts), 13 cards of each suit, and no jokers. As the question does not say we draw multiple cards, we will assume we draw a single card. We can represent this situation with a sample space $\Omega$ that contains 52 possible outcomes. The event, which we will call $H$, where a heart is drawn is a subset containing 13 of these outcomes (as there are 13 cards that are hearts).

This situation is quite tractable to work out by counting and using classical probability.

\[ P(H) = \frac{\#H}{\#\Omega} = \frac{13}{52} \]

Now let us work this out based on Kolmogorov's axioms. Let $ \Omega = \{ c_1, ..., c_{52} \} $ be a sample space (where $c_n$ is the outcome of drawing a particular card, and there are 52 such cards). Let $C_n$ be the elementary event $ \{c_n\} $.

As the deck is perfectly shuffled, we will assume that the probability of each elementary event is the same, i.e. we are no more likely to draw one card than any other. Thus $P(C_n) = k$, for all $n$. Here $k$ is some constant.

As we know $\# \Omega = 52$ it follows that $\#C = 52$. In other words, because there are 52 outcomes, there are 52 elementary events, as there must be one elementary event per outcome. Elementary events must be mutually exclusive as they each contain only a single outcome. According to Kolmogorov's third axiom, we can add together the probabilities of mutually exclusive events to get the probability of any of them occurring. Thus the probability of drawing any card will be $52k$, where $k$ is the constant introduced above.

As $P(\Omega) = 1$, we can work out that:

\[ P(\Omega) = 52k = 1 \]

\[ P(C_n) = k = 1/52 \]

Let $ H_1, \dots, H_{13} $ be elementary events where a heart is drawn. As these are elementary we know they are mutually exclusive. We know that drawing a heart is equally likely as drawing any other card, thus we know $P(H_n) = P(C_n) = 1/52$

Finally, we can apply the formula given in the third axiom to these mutually exclusive events.

\[ P(\bigcup_{i=1}^{13} H_i ) = \sum_{i=1}^{13} P(H_i) = \sum_{i=1}^{13} 1/52 = 13/52 \]

Deck of Cards

Task

What is the probability of drawing a heart from a 52 card deck?

Deck of Cards

Solution A

From classical probability we know

\[ P(H) = \frac{\text{favourable outcomes}}{\text{total outcomes}} \]

As there are 52 possible outcomes, of which 13 are hearts

\[ P(H) = \frac{13}{52} \]

Deck of Cards

Solution B

Let $ \Omega = \{ c_1, ..., c_{52} \} $ be a sample space and let $C_n$ be event $ \{c_n\} $

As the deck is perfectly shuffled, we assume $P(C_n) = k$, for all $n$

As we know $\# \Omega = 52$ it follows that $\#C = 52$ and as $P(\Omega) = 1$, according to Kolmogorov's third axiom,

\[ P(\Omega) = 52k = 1 \]

\[ P(C_n) = k = 1/52 \]

Deck of Cards

Solution B (continued)

Let $ H_1, \dots, H_{13} $ be events where a heart is drawn.

We know $P(H_n) = P(C_n) = 1/52$

These events are mutually exclusive, therefore

\[ P(\bigcup_{i=1}^{13} H_i ) = \sum_{i=1}^{13} P(H_i) = \sum_{i=1}^{13} 1/52 = 13/52 \]

Implications of Kolmogorov Axioms

The probabilities of events are monotonic.

If $ A \subseteq B $ then $ P(A) \leq P(B) $

Proof

We can write $B$ as the union of two non-overlapping subsets: $ B = (B \setminus A) \cup (A \cap B) $. Further, as we assume A $\subset B$, then $A \cap B = A$. Thus $B = A \cup (B \setminus A)$

Let $C = B \setminus A$. Thus, using Kolmogorov's third axiom we can write the probability of $B$ as the sum of the two mutually exclusive events $A$ and $C$

\[ P(B) = P(A) + P (C) \]

From the first axiom we know that probabilities non-negative real numbers, thus $ P (C) \geq 0 $

As we are adding a non-negative number to $P(A)$, we conclude that $P(B)$ must be greater or equal to $P(A)$. Equivalently, P(A) but be less than or equal to $P(B)$.

Monotonicity

If $A \subseteq B$ then $P(A) \leq P(B)$

Proof sketch

$ P(B) = P(A) + P (B \setminus A) $

And $ P(B \setminus A) \geq 0 $

Numeric Bound

We know that probability is non-negative, i.e. it must be at least 0. We also know that the probability of the sample space $ \Omega $ is 1. The sample space is the set of all possible outcomes, therefore all other events must be subsets of the sample space. Therefore, from the property of monotonicity, all events must have a probability in the range 0-1:

\[ 0 \leq P(E) \leq 1 \quad \text{ for all events E} \]

Numeric Bound

For all events $E$, $ \qquad 0 \leq P(E) \leq 1 $

Proof sketch

All events $E$ are subsets of $\Omega$

thus $P(E) \leq P(\Omega)$ and $P(\Omega) = 1$

Complement Rule

The complement of an event $ A $ is the event where $A$ does not occur. For example, if you roll a dice, and say that $A$ is the event where an odd number is rolled, then the complement of $A$, written $ A^c $ is where an odd number is not rolled. In this example, this is equivalent to rolling an even number.

The complement of an event, $ A^c $, is defined as the set of all outcomes in the sample space $\\Omega$ that are not in $A$. It follows that the union of an event and its complement equals the sample space, and thus their probabilities sum to $ P(\Omega) $, which is 1.

\[ A \cup A^c = \Omega \]

\[ P(A) + P(A^c) = P(\Omega) \]

The probability of an event and its complement add up to 1:

\[ P(A^c) = P(\Omega - A) = 1 - P(A) \]

Complement Rule

If $A \subseteq \Omega$, the complement of $A$ is defined:

\[ A^c = \Omega \setminus A \]

As $A \cup A^c = \Omega$

\[ P(A) + P(A^c) = P(\Omega) = 1 \]

Working with Probabilities

Let there be events $A, B \subseteq \Omega$

Aim

Given $P(A)$ and $P(B)$ calculate $P(A \cup B)$ and $P(A \cap B)$

Addition Property

It is common to want to find the likelihood that either event $E_1$ or event $E_2$ occurs. This means we are interested in the event formed by the union $E_1 \cup E_2$. To find this, we can add the probabilities of the events $E_1$ and $E_2$ and take away one multiple of the intersection (if we didn't take away the one multiple of the intersection we would end up counting those outcomes twice, once for each set).

\[ P(E_1 \cup E_2) = P(E_1) + P(E_2) - P(E_1 \cap E_2) \]

If events are sets of outcomes, the set-theoretic union of two events is the event that either event occurs

Addition Rule:

The combined probability of two events A and B is the sum of the probabilities of those events, less the probability that those events co-occur:

\[ P(A \cup B) = P(A) + P(B) - P(A \cap B) \]

We will see how to work out the probability of the intersection of two events below.

Addition Rule

The event where either $A$ or $B$ occur is $A \cup B$

Addition Rule:

The combined probability of $A$ and $B$ is:

\[ P(A \cup B) = P(A) + P(B) - P(A \cap B) \]

Adding Probabilities

Task

What is the probability of drawing either a spade or a face card?

Approach

Find the probabilities of relevant events
Apply the addition rule

\[ P(A \cup B) = P(A) + P(B) - P(A \cap B) \]

Adding Probabilities

Solution

Classical probability lets us work out

$ P(S) = 1/4 \quad $ and $ \quad P(F) = 3/13 $

There are 3 spade face cards, and therefore

\[ P(S \cap F) = 3/52 \]

$ P(S \cup F) = P(S) + P(F) - P(S \cap F) $

$ P(S \cup F) = 1/4 + 3/13 - 3/52 = 22/52 $

Intersection

We often want to know when two events both occur. For example, we might want to know the probability of rolling two dice and getting a six on both. For two events $ E_1 $ and $E_2$, the event formed by their intersection (i.e. $E_1 \cap E_2$) is the set of outcomes where both events occur.

Where events $E_1$ and $E_2$ are independent, the probability of the intersection $ E_1 \cap E_2 $ is given by the formula

\[ P(E_1 \cap E_2) = P(E_1) \times P(E_2) \]

Two events are independent if knowing that one event has occurred gives no information about the outcome of the other event.

Multiplication Rule for Independent Events

The event in which both $A$ and $B$ occurs is $A \cap B$

Where $A$ and $B$ are independent,

\[ P(A \cap B) = P(A) \times P(B) \]

Two sixes example

Say we roll two dice and want to work out the probability of getting two sixes. In this case, our sample space might be formed by every pair of values on our dice.

Consider this in classical probability. There are $6 \times 6 = 36$ such pairs. There are 6 only outcomes where dice $A$ rolls a six, and 6 outcomes where dice $B$ rolls a 6. There is only one outcome where both dice roll a 6. The probability of rolling two sixes is therefore $ \frac{1}{36} $.

Notice that $1/36 = 1/6 \times 1/6$. We can multiply probabilities to find the probability that two events co-occur, so long as these events are independent of one another. This is the multiplication rule for independent events, stated above.

\[ P(E_1 \cap E_2) = P(E_1) \times P(E_2) \]

As getting sixes on two dice are independent events, we can apply this formula to calculate this probability. The probability of each independent event is $ \frac{1}{6} $, thus:

\[ P(E_1 \cap E_2) = \frac{1}{6} \times \frac{1}{6} = \frac{1}{36} \]

Two Dice

Task

What is the probability of rolling two sixes on two fair 6-sided dice?

Approach

Apply the multiplication rule for independent events

\[ P(A \cap B) = P(A) \times P(B) \]

Two Dice

Solution

As the dice are fair, we can assume

$ P(six_1) = 1/6 \quad $ and $ \quad P(six_2) = 1/6 $

As we can assume the outcomes on each dice are independent

\[ P(six_1 ∩ six_2) = P(six_1) \times P(six_2) \]

\[ = \frac{1}{6} \times \frac{1}{6} = \frac{1}{36} \]

General Multiplication Rule

The general multiplication rule is:

\[ P(E_1 \cap E_2) = P(E_1 \mid E_2) \times P(E_2) \]

We saw a special case of this formula is where two events are independent events. This is the case when $ P(E_1 \mid E_2) = P(E_1) $. That is, when the conditional probability of $E_1$ given $E_2$ is the same as the probability of $E_1$. In other words, events are independent when knowing that $E_2$ has happened makes no difference to the probability of $ E_1 $.

A good example of independent events is rolling two dice. Knowing the outcome on one dice makes no difference to the outcomes on the other dice. In contrast, if you were screening for a disease, getting a positive test result would affect the probability that you really had the disease.

General Multiplication Rule

Let $A, B$ be events.

\[ P(A \cap B) = P(A) \times P(B \mid A) \]

$P(A \mid B)$ is the probability of $A$ assuming $B$

Independence

Events are independent iff $P(A \mid B) = P(A)$

Conditional Probability

Often we have partial information about the outcome of an experiment. For example, when we turn on the kettle in the morning, we see that it lights up before we can tell whether or not it is heating. Based on this, we might unconsciously update our expectation about whether the kettle is likely to work. We can exclude all those outcomes where the kettle is not working because the fuse has blown, for example.

Imagine that we know that an event $ A $ has happened. We want to know the probability of event $E$. We write this $P(E \mid A)$ ("the probability of $E$ given $A$") Effectively what we have done is restricted our sample space for this question. We are now interested in only outcomes in the subset $A$, and only outcomes in event $E$ that are also in $ A $.

\[ P(E \mid A) = \frac{P(E \cap A)}{P(A)} \]

If all outcomes are equally likely, we can work this out based on classical probability. In this case, a conditional probability will be the proportion of outcomes in our event out of our restricted sample space: $ P(E \mid A) = \frac{\#(E \cap A)}{\#A} $. However we can work with the previous formula even when we do not assume all outcomes are equally likely. This formula is obtainable by dividing top and bottom of the above fraction by $ \#S $ and substituting for probabilities.

Conditional Probability

Let $A \subseteq \Omega$.

The probability of $A$ conditional on $\Omega$ can be written as:

\[ P( A \mid \Omega) = \frac{P(A \cap \Omega)}{P(\Omega)} = \frac{P(A)}{1} = P(A) \]

Intersection with $\Omega$ as a proportion of $\Omega$

Conditional Probability

Let $A$, $B$ be events over $\Omega$. The probability of $A$ conditional on $B$ is

\[ P( A \mid B) = \frac{P(A \cap B)}{P(B)} \]

The intersection with $B$ as a proportion of $B$

We effectively restrict our sample space to $B$

Multiple Conditions

\[ P( A \mid B, C) = P( A \mid (B \cap C)) = \frac{P(A \cap B \cap C)}{P(B \cap C)} \]

Multiple Conditions

When we depend on multiple conditions, we depend on the intersection of these

\[ P( A \mid B, C) = P( A \mid (B \cap C)) \]

Bayes' Rule

In science we often have hypotheses. We can consider each hypothesis to be a disjoint event. For example, imagine we do not know whether someone has passed their exam. Hypothesis $ H $ might be the event that they pass the exam, and $H^-$ be the event that they fail the exam. In Bayesian probability, the probability that a hypothesis is true before we collect evidence is called the prior probability, or $ P(H) $. Perhaps from previous years we know that the pass rate is 80%.

There is often evidence that we can observe that might tell us that $ H $ is more or less likely. For example, we might know that the student attended all of their lectures. Let $E$ be the event that refers to all outcomes where the student attended lectures. The probability $ P(E) $ is the likelihood of observing this evidence, i.e. the likelihood that a random student will attend lectures. Perhaps we know that 75% of students attend all lectures.

Finally, we might also know how likely we are to observe this evidence were our hypothesis true, or $ P(E \mid H) $. In our example, we might know how likely it is for a student to pass their exam given that they attended lectures. We might observe that 90% of students who attend all lectures pass the exam. Thus we have:

\[ P(H) = 0.8 \qquad P(E) =0.75 \qquad P(E \mid H) = 0.9 \]

From this we can calculate a new probability for our hypothesis that takes into account the evidence $ P(H \mid E) $. This is called the posterior probability. In our example, rather than guessing that there is an 80% chance a student will pass their exam, we can make a more accurate estimate by incorporating our knowledge that the student attended all lectures. This can be achieved through a rearrangement of some of the formulas that have already been given above:

\[ P(H \mid E) = \frac{P(H \cap E)}{P(E)} \]

\[ P(H \cap E) = P(E | H) \times P(H) \]

Thus

\[ P(H \mid E) = \frac{P(E | H) \times P(H)}{P(E)} \]

This give us the formula for a posterior probability $ P(H \mid E) $ in terms of our prior probability $P(H)$, the probability of the evidence $P(E)$, and the probability of the evidence given our hypothesis $P(E \mid H)$. This formula is called Bayes' Rule. In other words, we can express $P(A \mid B)$ in terms of $P(B \mid A)$, if we know $P(A)$ and $ P(B) $

To solve our example above, when

\[ P(H) = 0.8 \qquad P(E) =0.75 \qquad P(E \mid H) = 0.9 \]

We have

\[ P(H \mid E) = \frac{0.9 \times 0.8}{0.75} = 0.96 \]

The probability that a student who attended their lectures will pass the exam is therefore 96%.

This was for a single hypothesis $ H $. By substituting in the law of total probability, we can extend Bayes' Rule to any finite and collectively exhaustive set of disjoint hypotheses $ H_1, H_2, \dots, H_n $ that partition our sample space.

\[ P(H \mid E) = \frac{P(E | H) \times P(H)}{P(E)} \]

\[ P(E) = \sum_i P(E \mid H_i) \times P(H_i) \]

Thus

\[ P(H \mid E) = \frac{P(H_k) \times P(E | H_k)}{\sum_i P(E \mid H_i) \times P(H_i)} \]

Bayes' Rule

We can rewrite conditional probability

\[ P(A \mid B) = \frac{P(A \cap B)}{P(B)} = \frac{P(A) \times P(B \mid A)}{P(B)} \]

This gives $P(A \mid B)$ in terms of $P(B \mid A)$

What we (might) know

Lets say we know

$P(H)$, the prior probability of hypothesis H
$P(E)$, the likelihood of observing the evidence
$P(E \mid H)$, the likelihood of the evidence given the hypothesis

We can work out

$P(H \mid E)$, the posterior probability of hypothesis $H$ given $E$

Law of Total Probability

A partition is a way of subdividing a set into partition sets $ B_1, B_2, \dots, Bn $ that are disjoint (i.e. there is no overlap between $B_i$ and $B_j$, or in other words $ B_i \cap B_j = \emptyset $) and collectively exhaustive (i.e. together they contain everything in the set). A set and its complement is one example of a partition, but partitions may also involve more than two sets.

For example, when rolling a dice, the two events $ odd $ and $even$ form a partition; you always roll an odd or even number, and you can never roll both at the same time. The events $ \{1, 2\}, \{3, 4\}, \{5, 6\} $ also form a partition of this sample space. This partition is shown in the venn diagram below, and the event $ even $ is shown overlapping with each partition.

A set can be divided across these partitions (i.e. it might intersect with multiple partitions). The set is the union of all of these parts:

\[ A = (A \cap B_1) \cup (A \cap B_2) \cup \dots \cup (A \cap B_n) \]

If this set is an event, the conditional probabilities of this event given each partition will be mutually exclusive (because the partitions are mutually exclusive). Thus we can sum the probabilities (usually we would need to take way the intersection, but here the intersection is 0).

\[ P(A) = P(A \cap B_1) + P(A \cap B_2) + \dots + P(A \cap B_n) \]

Finally, recall the intersection identity for probabilities above:

\[ P(A \cap B_i) = P(A \mid B_i) \times P(B_i) \]

Thus we can express the probability of an event in a partitioned sample space with the following law that includes the summation of the probabilities of the the intersection of the event with each partition.

Law of Total Probability

If $ B_1, B_2, \dots $ is a partition of the sample space $S$, then for any event $ A $:

\[ P(A) = \sum_i P(A \cap B_i) = \sum_i P(A \mid B_i) \times P(B_i) \]

Law of Total Probability Example

You have 2 bags of balls that each contain 70 balls. Bag 1 has 50 red balls and 20 black balls. Bag 2 has 30 red balls and 40 black balls. You pick a bag at random and pick a random ball from that bag. What is the probability that you pick a red ball?

We can formalise the question. Let $R$ be the event of drawing a red ball. Let $B_n$ be the event of picking bag $n$.

\[ P(R \mid B_1 ) = \frac{50}{70} \qquad P(R \mid B_2 ) = \frac{30}{70} \]

\[ P(B_1) = P(B_2) = \frac{1}{2} \]

Solving the question requires us to find $P(R)$. We can proceed by substituting the above values into the Law of Total Probability:

\[ P(R) = P(R | B_1 ) x P(B_1) + P(R | B_2 ) x P(B_2) \]

\[ P(R) = 50/70 \times 1/2 + 30/70 \times 1/2 \]

\[ = 5/14 + 3/14 = 8/14 = 4/7 \]

The probability of drawing a red ball from the randomly chosen bag is $4/7$.

Partition

A partition is a way of subdividing a set into partitions $ B_1, B_2, \dots, B_n $ that are:

disjoint
collectively exhaustive

Diagram of a sample space partitioned into three

Summing Conditional probabilities

$P(A)$ is the sum of the intersections

\[ P(A \cap B_1) + \]

\[ P(A \cap B_2) + \]

\[ P(A \cap B_3) \]

Diagram of a sample space partitioned into three with an event distributed across the three partitions

Summing Conditional probabilities

We can rewrite this as a sum of conditional probabilities

\[ P(A \mid B_1) \times P(B_1) + \]

\[ P(A \mid B_2) \times P(B_2) + \]

\[ P(A \mid B_3) \times P(B_3) \]

\[ P(A \mid B) = \frac{P(B \cap A)}{P(B)} \]

Law of Total Probability

If $ B_1, B_2, B_3, \dots $ is a partition of the sample space $S$, then for any event $A$:

\[ P(A) = \sum_i P(A \cap B_i) = \sum_i P(A \mid B_i) \times P(B_i) \]

Probability Trees

A decision tree is a tree representing a sequence of decisions. For example, we might pick between two bags at random and then pick a ball at random out of the bag. In this case, we would also be able to assign probabilities to each of our decisions, making it a probability tree.

Probability trees are visualisations that make it easy to work out otherwise complex probabilities. The choices in a decision tree are mutually exclusive, meaning in a probability tree we can add the probabilities of branches together to get the probability that at least one branch will be taken (the union of those two events). If you walk down a probability tree, each probability is conditional upon its parent branch being reached, this means we can multiply down branches to work out the probability of taking a path through this tree.

Probability Trees

Decision Tree

A way of representing a sequence of Decisions

Probability Tree

In a probability tree, we can think of each vertex being an event.

$P(V_i)$ is the likelihood that a random walk through the probability tree will include $V_i$

\[ P(root(T)) = 1 \]

Probability Trees

Let each vertex $V_i$ be an event

Let each edge $(A, B)$ be labeled with $P(B \mid A)$

Multiplying down branches

To work out the absolute probability of a set of events we can multiply down the tree

Remember the Multiplication Rule

\[ P(A \cap B) = P(A \mid B) \times P(B) \]

Let $D_1, \dots, D_n$ be the path from $root(T)$ to $V_i$

\[ P(V_i) = P(D_1) \times P(D_2 \mid D_1), \times \dots \times P(D_n \mid D_{n-1}) \]

This gives the probability of walking down the probability tree following a given path $D$.

Multiply Down Branches

Let $D_1, \dots, D_n$ be the path from $root(T)$ to $D_n$

$ P(D_n) = P(D_1) \times P(D_2 \mid D_1), \times \dots \times P(D_n \mid D_{n-1}) $

$ P(D_n) = P(D_n \mid D_1, \dots, D_{n-1}) $

Adding across branches

Branches are mutually exclusive, so we can add them together to get their combined probability

Remember the Addition Rule

\[ P(A \cup B) = P(A) + P(B) - P(A \cap B) \]

Adding Across Branches

All events are mutually exclusive

Let $A$, $B$ be vertices

\[ P(A \cup B) + P(A) + P(B) \]

Disease Screening Example

Imagine you have been screened for a disease and get a positive result. We know certain statistics about the disease, given below. What is the probability you have the disease?

$ P(D) = 0.1 $ (prior probability = prevalence)
$ P(T \mid D) = 0.9 $ (sensitivity)
$ P(T^c \mid D^c) = 0.8 $ (specificity)

Here ee want to work out $P(D \mid T)$, for which we can use Bayes' rule. As you cna see form the formula below, we have everything we need except for $P(T)$. We can work this out either using a probability tree or using the Law of Total Probability, introduced above.

\[ P(D \mid T) = \frac{P(D) \times P(T \mid D)}{P(T)} \]

Using a Probability Tree

We can construct a probability tree of this situation, as shown on the slide. In this diagram I have partitioned event $T$ into two for clarity representing it in the probability tree. $T_1 = T \cap D$ and $T_2 = T \cap -D$. These partition $T$, meaning $T = T_1 \cup T_2$

While only some of the edges in the diagram are labelled with conditional probabilities, when events are complementary we know their probability must sum to one. For instance, we have $P(T_1 \mid D) = 0.9$. This is the probability of getting a positive rest result given we have the disease. The only other possibility, given that we have the disease, is that we get a negative test result, therefore $P(-T_1 \mid D) = 0.1$. (Observe that $0.9 + 0.1 = 1$)

We can find $P(T_1)$ and $P(T_2)$ by multiplying down the tree.

$$P(T_1) = 0.1 \times 0.9 = 0.09$$ $$P(T_2) = 0.9 \times 0.2 = 0.18$$

The addition rule then tells us that $P(T) = P(T_1) + P(T_2) = 0.27$

Using the Law of Total Probability

Our hypotheses $D$, and $D^c$ form a partition as they are disjoint (they cannot both be true) and they are collectively exhaustive (either you do or do not have a disease). As our hypotheses partition our sample space we can rewrite $P(T)$ in terms of the sum of $P(T \mid D) \times P(D)$ using the Law of Total Probability

\[ P(T) = P(T \mid D) \times P(D) + P(T \mid D^c) \times P(D^c) \]

\[ P(T) = 0.9 \times 0.1 + 0.2 \times 0.9 = 0.27 \]

Finding $P(D \mid T)$

Now we know $P(T)$, we can find $P(D \mid T)$ using Bayes' Rule:

\[ P(D \mid T) = \frac{P(D) \times P(T \mid D)}{P(T)} \]

\[ P(D \mid T) = \frac{0.1 \times 0.9}{0.27} = \frac{9}{27} \]

Disease Screening Example

A test is 90% sensitive: $P(T \mid D) = 0.9$

The prevalence is low: $P(D) = 0.1$

The test is 80% specific: $P(T \mid D) = 0.8$

What is $P(D \mid T)$?

Markov Chain

A similar concept to a probability tree is a Markov Chain. Here the requirement that the graph is a tree is relaxed. We need to interpret the vertices slightly differently. They are not events (as they can occur more than once) - rather they are states in a system. The system transitions through a sequence of states according to the transition probabilities that label each edge.

Markov Chain

Each vertex is a 'state', not an event

Edges labelled with transition probabilities

$ P(\overrightarrow{ABBD}) = 0.75 \times 0.5 \times 0.5 \times 0.5 $

$ = 0.1875 $

$ P(\overrightarrow{ACD}) = 0.25 \times 1 $

$ = 0.25 $

Summary

Events

Events are sets of outcomes, if all equally likely:

\[ P(E) = \frac{\text{favourable outcomes}}{\text{total outcomes}} = \frac{\#E}{\#\Omega} \]

We can find the

Union (either event)
Intersection (both events)

Probability function $P(E)$ is defined by Kolmogorov Axioms

Conditional Probability

The probability of $A$ assuming a set of events $C_1, \dots, C_n$ is written

\[ P(A \mid C_1, \dots, C_n) \]

Bayes' rule lets us write $P(A \mid B)$ in terms of $P(B \mid A)$

Probability Trees

We can represent complex situations using Probability Trees

Multiply down branches
Add across branches