Probability

This practical starts with some questions to help you understand the basics of probability. Below this there is a tutorial that will walk you through the mathematics of a simple machine learning algorithm for classifying messages based on a set of examples.

Mathematics and Problem Solving

10. Probability

David Gundry

Getting Started

Independent Events

Two events are independent if knowing the outcome of one event does not change the probability of the other event.

Which of the following involve independent events?

Rolling two dice
Independent. The result rolled on one dice does not affect the other dice.
Drawing two cards from a deck and looking at their suit
Dependent. After we have drawn one card, the proportion of cards in the deck have changed. If the first card was a heart, for example, there would be fewer hearts in the deck.
Generating two random numbers one after the other
Depends on our assumptions about our random number generation method. Technically if we used a pseudorandom number generation algorithm the events would be close to independent, but not entirely independent. Pseudorandom number generators cannot provide truly random outputs, thus in principle knowing one generated number will change the expectation of what number you would see next.
The probability of two successive requests to a webserver failing
Likely not independent in certain conditions. For example, if the webserver is experiencing high load, the first request might make it more likely for the second to fail. Whether we treat these events as dependent or independent would depend on our modelling assumptions.

Work out Probabilities

Give the probability for each of the events and experiments described below.

Flip a coin twice and get two heads
Let a sample space $ \Omega = \{(H, H), (H, T), (T, H), (T, T)\} $ and an event $ E = \{(H, H)\} $
$P(E) = 1/4$
Draw a face card that is a spade from a 52 card deck
We could approach this question by counting the outcomes in the event where a card is drawn that is both a spade and a face card. We find that there are 3. There are 52 possible outcomes in the sample space. If we assume that all outcomes are equally likely, we can calculate that the probability is $3/52$.
Draw a card that is either an even number or a red card a 52 card deck
Let each outcome in sample space $\Omega$ be drawing one of 52 cards. Let $R$ be the event where a red card has been drawn. As there are 26 red cards in a deck, $\#R = 26$. Let $E$ be the event where an even number has been drawn. By counting we find that $\#E = 20$.
We could approach this question by counting the even-numbered red cards (of which there are 10), and thus conclude based on classical probability that the probability of drawing an even-numbered red card $P(E \cap R) = 10/52$
Alternatively, we could work out the probabilities $P(R) = 26/52$ and $P(E) = 20/52$. Given that these two events are independent, we can use the multiplication rule for independent events
\[ P(R \cap E) = P(R) \times P(E) = (26/52) \times (20/52) \]
If we were not sure if the events were independent, we would use the general multiplication rule:
\[ P(R \cap E) = P(R \mid E) \times P(E) \]
To find $P(E \mid R)$ we need to consider the probability of finding a red card assuming an even numbered card has been drawn. In a deck of cards, we consider only the even numbered cards, and find that 10 of them are red. Thus, based on classical probability, $P(R \mid E) = 10/20 = 1/2$. This is expected, as we know that half the cards in a a deck are red. We can then substitute this value into the formula above with what we already know.
\[ P(R \cap E) = (1/2) \times (20/52) = 10/52 \]
This gives us the same answer as above because $R$ and $E$ are independent events.
Draw two face cards in a row from a 52 card deck, without replacing the cards.
Trying to solve this by counting the outcomes in the event we are interested in would be time consuming. Instead, we will work it out using the rules of probability.
We can calculate that the probability of our first card being a face card as $P(F_1) = 12/52$. The probability of our second card being a face card will be different. This time there are only 11 face cards out of 51 remaining cards. Thus $P(F_2 \mid F_1) = 11/51$. Applying the general multiplication rule, we calculate $P(F_1 \cap F_2) = (12/52) \times (11/51) = 132/2652 = 11/221$
You draw a card. Each time you draw a red card you draw again. What is the chance of getting 3 red cards?
As in the above question we work out:
\[ P(R_1) = 26/52 \]
\[ P(R_2 \mid R1) = 25/51 \]
\[ P(R_3 \mid R_1, R_2) = 24/50 \]
From this we can apply the basic rules we saw in the lecture to work out $P(R_2 \cap R_1)$
\[ P(R_2 \cap R_1) = P(R_2 \mid R_1) \times P(R_1) \]
\[ = (25/51) \times (26/52) = 25/102 \]
And then $P(R_3 \cap (R_1 \cap R_2))$
\[ P(R_3 \cap (R_2 \cap R_1)) = P(R_3 \mid (R_2 \cap R_1)) \times P(R_2 \cap R_1) \]
\[ = P(R_3 \mid R_1, R_2) \times (25/102) \]
\[ = 24/50 \times (25/102) = 2/17 \]
Notice that this is the same as
\[ P(R_1) \times P(R_2 \mid R_1) \times P(R_3 \mid R_1, R_2) \]
\[ (26/52) \times (25/51) \times (24/50) = 2/17 \]

Probability questions

Let $P(A) = 0.5$, $P(B) = 0.25$, $P(B \mid A) = 0.1$

Find $P(A \cap B)$
\[ P(A \cap B) = P(B \mid A) \times P(A) \]
\[ P(A \cap B) = 0.1 \times 0.5 = 0.05 \]
Find $P(A \cup B)$
\[ P(A \cup B) = P(A) + P(B) - P(A \cap B) \]
\[ P(A \cup B) = 0.5 + 0.25 - 0.05 = 0.7 \]
Find $P(B \mid A)$
\[ P(B \mid A) = \frac{P(B \cap A)}{P(A)} \]
\[ P(B \mid A) = \frac{0.1}{0.5} = 0.05 \]

Let $P(H) = 0.1$, $P(E) = 0.2$, $P(E \mid H) = 0.9$

Find $ P(H \mid E) $
\[ P(H \mid E) = \frac{0.1 \times 0.9}{0.2} = 0.45 \]

Using Bayes' Rule

We can use the Law of Total probability with Bayes rule (see slide) to work out the posterior probabilities of a set of hypotheses $H_1, \dots, H_n$, where these hypotheses are disjoint (cannot co-occur) and are collectively exhaustive.

For more details, see the lecture notes.

Let the prior probabilities of our two hypotheses be $P(H_1) = 0.5$, $P(H_2) = 0.5$. Further, let

\[ P(E \mid H_1) = 0.8 \qquad P(E \mid H_2) = 0.2 \]

Find $ P(H_1 \mid E) $ and $P(H_2 \mid E)$
\[ P(E \mid H_1) \times P(H_1) = 0.8 \times 0.5 = 0.4 \]
\[ P(E \mid H_2) \times P(H_2) = 0.2 \times 0.5 = 0.1 \]
\[ P(H_1 \mid E) = \frac{0.4}{0.4 + 0.1} = 0.8 \]
\[ P(H_2 \mid E) = \frac{0.1}{0.4 + 0.1} = 0.2 \]

Bayes Rule

Law of Total Probability

\[ P(A) = \sum_i P(A \mid B_i) \times P(B_i) \]

This can be combined with Bayes' rule:

\[ P(H \mid E) = \frac{P(H_k) \times P(E \mid H_k)}{\sum_i P(E \mid H_i) \times P(H_i)} \]

Machine Learning

Our goal here is to create a (very) simple machine learning algorithm for identifying strings matching a given statistical pattern. This is a kind of classification task. We want to classify whether a string that we have never seen before is similar to our training set. This could be used for spam detection (do the messages match the patterns of spam messages). Here we are going to try detecting whether a string is likely to be an English word.

To simplify matters for now, we will work only with messages made of the letters A, B, C, and D. We will assume that every message is four letters long. This keeps the process tractable to work out by hand. However, the interactive demos allow you to have a longer alphabet, and to use large training sets. Use these demos to see how the same approach scales to a real-world use case.

We have a collection of strings that we want to learn from. These might represent spam messages, or English words, or whatever it is we are modelling. This is the training data that we want to use to parametrise or models.

CABA CBCB CACB CBCC

We have no training data for non-target strings (e.g. non-spam messages). For simplicity, we will assume that we have no knowledge about these strings, and thus that, as far as we know, they are generated completely at random. We could easily extend the approach here to incorporate more sophisticated models of non-target strings if we wanted to.

The basic idea is that we will devise a model that describes a mechanism for how strings are generated. This model will depend on several probabilities. We can work out these probabilities from frequencies in our training dataset.

String Classifier

Task

Model the statistical properties of a set of strings, e.g.

 CABA CBCB CACB CBCC

Generate new strings with the same statistical properties
Calculate the probability a given string was generated using this model

First attempt

We are trying to classify strings as either English or not English. We will describe two models, $M_1$, a model that generates 'English' (in inverted commas), and $R_1$, a model that generates random messages. On the assumption that only these two models could have been used, we will then work out the probability that a given message was generated by $M_1$ and not $R_1$.

Let us say that strings are generated by a weighted random choice for each successive letter. In other words, for each character in the string, a random letter is picked. Each letter has a probability of being selected that is independent of where that letter occurs, or of any other character.

If we know nothing about the distribution of letters, we start with the assumption that any letter is as likely as any other. This will be our modelling assumption for when words are not 'English', giving us the model $R_1$.

\[ P(A) = 0.25 \quad P(B) = 0.25 \]

\[ P(C) = 0.25 \quad P(D) = 0.25 \]

By looking at our training data, we can pick probabilities based on the frequencies with which letters appear.

1. What is the probability of A, B, C, D, based on the training data set?

We have now parametrised our model $M_1$ with probabilities based on our training data. We can now use $M_1$ to generate a random 'English' word. Remember, our model says that each letter is chosen independently based on the probability of selecting each letter. Note that how realistic this generated word is (how much it looks to us like 'English') will be limited by the simplistic modelling assumptions we have made.

2. Generate a random 4-letter 'English' word using model $M_1$.

We can also use our parametrised model $M_1$ to determine the probability that a given word is 'English'. The basic idea is to see how likely model $M_1$ is to generate a given word and compare that to the probability of getting that word using model $R_1$.

For example, if we have a four letter word CBCB, we can work out the probability of generating this word based on model $M_1$ as

\[ P(CBCB \mid M_1) = P(A) \times P(B) \times P(C) \times P(A) \]

3. Calculate $P(CBCB \mid M_1)$

Now we have $P(CBCB \mid M)$, which tells us how likely it is that M generates CBCB. However, what we want is $P(M \mid CBCB)$. In other words, how likely is it that the model used was M if the string observed was CBCB? We can use Bayes rule to find this.

\[ P(M_1 \mid CBCB) = (P(CBCB \mid M_1) \times P(M_1)) / P(CBCB) \]

$P(M_1)$ is the prior probability that model M was used. In other words, it's our initial expectation of the likelihood that the message is 'English'. We might know or guess that 50% of messages are 'English'. In this example, our prior probability would be set to $P(M_1) = 0.5$.

To use Bayes' rule, we need to work out $P(CBCB)$, the probability of generating CBCB. We know the probability for generating this given model $M_1$. We need to consider the possibility that the message is not 'English'.

One of our modelling assumptions is that strings are either generated by R_1 (they are random), or they are generated by $M_1$ (they are 'English').

We can work out the probability of generating this word if letters were random (model $R_1$) as

\[ P(CBCB \mid R_1) = 0.25 \times 0.25 \times 0.25 \times 0.25 \]

As $R_1$ and $M_1$ are the only possibilities (given on our modelling assumptions), they form a partition. Thus we can use the Law of Total Probability:

\[ P(CBCB) = P(CBCB \mid M_1) \times P(M_1) + P(CBCB \mid R_1) \times P(R_1) \]

4. Calculate $P(CBCB)$

This gives us everything we need to calculate $P(M \mid CBCB)$.

5. Based on the the probabilities of generating each letter from the dataset, and the prior probability that a message is 'English' of 50% (i.e. $P(M_1) = 0.5$), calculate the likelihood that the message CBCB is 'English'. In other words, calculate $P(M_1 \mid CBCB)$

First Attempt

Modelling assumptions

Each letter is randomly selected, independent of other letters

Two models

$M_1$: Calculate probabilities for each letter from frequencies in training set

$R_1$: All letters are equally likely

First Attempt Demo

Alphabet

Training Set

Test Set

Prior

Letter Probabilities

Letter	Probability
A	0.31250000
B	0.25000000
C	0.43750000
D	0.00000000

Generated

String	P(string \| Model)	P(model \| string)

Test

Threshold Probability: Match: 0.000%

String	P(string \| model)	P(string \| random)	P(model \| string)
CBCB	0.01196289	0.00390625	0.75384615

Prompts for reflection

You may want to consider the answers to these questions before you move on.

Compare your results to the results in the demo. Is the demo correct? (It's possible I've made an error implementing it.)
You may want to experiment with a larger alphabet (such as ABCDEFGHIJKLMNOPQRSTUVWXYZ), and/or with a larger training set.
Where do you think the main limitations in the model are? What information does the model not represent that ou think would help improve the models predictive power?

Second Attempt

To improve on this, we might recognise that different letters are more or less likely to appear in different positions.

Let $P(A_n)$ be the probability of letter $A$ appearing at position $n$. Initially, we might assume that the likelihood of any given letter in a given position is equally likely. If we have four letters, these would each be $0.25$.

\[ P(A_1) = 0.25, \quad P(B_1) = 0.25, \quad \dots \]

1. Calculate the frequencies in the training data to find probabilities for each letter for each position

Now, as above, we can use our new model, $M_2$, to generate 'English' words.

2. Generate a random four-letter message using this model.

As above, we can work out the probability that the string CBCB is generated based on the assumption that model $M_2$ is used

3. Calculate $P(CBCB \mid M_2)$

We can also work out the probability that the string CBCB is generated based on the assumption that messages are generated at random. Let us call this model $R_2$.

4. Calculate $P(CBCB \mid R_2)$

In our second attempt, we are assuming that the messages we are classifying were either generated by $M_2$ or by $R_2$. As above we can use the Law of Total Probability and Bayes rule to work out $P(M_2 \mid CBCB)$.

5. Calculate $P(M_2 \mid CBCB)$

Second Attempt

Modelling Assumptions

Each letter is selected randomly based on its position

Letters in different positions are independent of one another

Two models

$M_1$: Probability of a given letter in position $n$ based on frequency in training set

$R_1$: All letters in position $n$ are equally likely

Second Attempt Demo

Alphabet

Training Set

Test Set

Prior

Letter Probabilities

Letter	Probability
A_1	0.25000000
B_1	0.00000000
C_1	0.75000000
D_1	0.00000000
A_2	0.50000000
B_2	0.50000000
C_2	0.00000000
D_2	0.00000000
A_3	0.00000000
B_3	0.25000000
C_3	0.75000000
D_3	0.00000000
A_4	0.50000000
B_4	0.25000000
C_4	0.25000000
D_4	0.00000000

Generated

String	P(string \| Model)	P(model \| string)

Test

Threshold Probability: Match: 100.000%

String	P(string \| model)	P(string \| random)	P(model \| string)
CBCB	0.07031250	0.00390625	0.94736842

Detecting English words

Investigate how good the interactive demos are at detecting English words. For this you will need to set the alphabet to ABCDEFGHIJKLMNOPQRSTUVWXYZ.

Put some English words into the training set to calculate the model probabilities from. Then experiment with English and non-English words in the test set.
The accuracy of the model on the test set (shown in above the 'test' table) tells you how many strings in the test set were detected as matching the model. A string matches if the probability it was generated by the model exceeds the threshold probability. The higher this threshold, the more conservative the model will be.
The quality of the model is significantly affected by the training data. More training data will improve the ability of the model to correctly identify English words, though only up to a point. Paste a word list into the training set (for very large word lists you might need to wait a minute for it to create the model). You can find wordlists on the web, or follow one of these links (1000 most common US-English words, Large Word lists).
You can evaluate the model as a tool for checking English sentences. However, it only works with one word per line. Collect together some sentences (Random Sentence Generator). Replace each space with a new line, and paste the list of words into the test set. Look at the accuracy or match rate. As you work through the rest of this practical, compare the models in this way.

Third Attempt

Now we might consider that different letters are more or less likely depending on the letter that appear in front of them.

We need to find the probabilities of the letters appearing in the first position: $ P(A_1), P(B_1), \dots $. And we need to find a conditional probabilities of each letter occurring in position $n$ for every letter that could have occurred in position $n-1$. This can be represented in a probability tree. As each letter depends only on the previous letter, a better representation is as a Markov Chain.

1. Find these probabilities and draw a probability tree or Markov chain representing this.

Using this diagram, we can generate a random 'English' word by randomly walking down the probability tree or Markov chain according to the transition probabilities.

2. Generate a random 'English' word using model $M_3$

Now we can also calculate the probability of generating the string CBCB assuming model $M_3$ is used: $P(CBCB \mid M_3)$. We can also calculate the probability assuming strings are randomly generated: $P(CBCB \mid R_3)$. From here we can calculate $P(M_3 \mid CBCB)$

3. Calculate $P(M_3 \mid CBCB)$

Third Attempt

Modelling Assumptions

The probability of a letter in position $n$ is depends on the letter in position $n-1$

Approach

Calculate probabilities $P(X_1)$, and $ P(X_n \mid Y_{n-1}) $, for all letters $X$ and $Y$.

Third Attempt Demo

Alphabet

Training Set

Test Set

Prior

Letter Probabilities

Letter	Probability
A_1	0.25000000
B_1	0.00000000
C_1	0.75000000
D_1	0.00000000
A_2\|A_1	0.00000000
B_2\|A_1	1.00000000
C_2\|A_1	0.00000000
D_2\|A_1	0.00000000
A_2\|B_1	0.00000000
B_2\|B_1	0.00000000
C_2\|B_1	0.00000000
D_2\|B_1	0.00000000
A_2\|C_1	0.66666667
B_2\|C_1	0.33333333
C_2\|C_1	0.00000000
D_2\|C_1	0.00000000
A_2\|D_1	0.00000000
B_2\|D_1	0.00000000
C_2\|D_1	0.00000000
D_2\|D_1	0.00000000
A_3\|A_2	0.00000000
B_3\|A_2	0.50000000
C_3\|A_2	0.50000000
D_3\|A_2	0.00000000
A_3\|B_2	0.00000000
B_3\|B_2	0.00000000
C_3\|B_2	1.00000000
D_3\|B_2	0.00000000
A_3\|C_2	0.00000000
B_3\|C_2	0.00000000
C_3\|C_2	0.00000000
D_3\|C_2	0.00000000
A_3\|D_2	0.00000000
B_3\|D_2	0.00000000
C_3\|D_2	0.00000000
D_3\|D_2	0.00000000
A_4\|A_3	0.00000000
B_4\|A_3	0.00000000
C_4\|A_3	0.00000000
D_4\|A_3	0.00000000
A_4\|B_3	1.00000000
B_4\|B_3	0.00000000
C_4\|B_3	0.00000000
D_4\|B_3	0.00000000
A_4\|C_3	0.33333333
B_4\|C_3	0.33333333
C_4\|C_3	0.33333333
D_4\|C_3	0.00000000
A_4\|D_3	0.00000000
B_4\|D_3	0.00000000
C_4\|D_3	0.00000000
D_4\|D_3	0.00000000

Generated

String	P(string \| Model)	P(model \| string)

Test

Threshold Probability: Match: 100.000%

String	P(string \| model)	P(string \| random)	P(model \| string)
CBCB	0.08333333	0.00390625	0.95522388

Fourth Attempt

Next, we might reason that it doesn't matter where a pair of letters appears. We might want to learn a model $M_4$ defined by the probabilities of transitioning between each pair of letters.

We construct a Markov Model with a node for each letter and a edge between each pair of letters. The probabilities labeling these edges can be found from frequencies with with certain pairs of letters appear in the dataset.

From here, the same process can be followed as with the previous modelling attempts to work out $P(CBCB \mid M_4)$.

1. Calculate $P(CBCB \mid M_4)$

Fourth Attempt

Modelling Assumptions

Probability of each letter is dependant on previous letter, regardless of position

Approach

Calculate probabilities $P(X_1)$, and $ P(X \mid Y) $, for all letters $X$ and $Y$

Fourth Attempt Demo

Alphabet

Training Set

Test Set

Prior

Letter Probabilities

Letter	Probability
A_1	0.25000000
B_1	0.00000000
C_1	0.75000000
D_1	0.00000000
A\|A	0.00000000
B\|A	0.66666667
C\|A	0.33333333
D\|A	0.00000000
A\|B	0.33333333
B\|B	0.00000000
C\|B	0.66666667
D\|B	0.00000000
A\|C	0.50000000
B\|C	0.33333333
C\|C	0.16666667
D\|C	0.00000000
A\|D	0.00000000
B\|D	0.00000000
C\|D	0.00000000
D\|D	0.00000000

Generated

String	P(string \| Model)	P(model \| string)

Test

Threshold Probability: Match: 100.000%

String	P(string \| model)	P(string \| random)	P(model \| string)
CBCB	0.05555556	0.00390625	0.93430657

Suitability for English word detection

After you have played with these models for some time, try and answer these questions about their ability to generate and detect English words:

If you were to write a non-English word that the model would detect as English, what features would you include?
What sorts of English words does the model fail to identify, and why? Try predicting in advance whether a given word will be matched as English or not
Are there contexts or words for which the earlier models are more accurate than the later models? Why is this?
In some of the models, shorter words are less likely to be detected as English than longer words. Why do you think this is?
How would you develop or adapt these models to improve accuracy?