Probability
This practical starts with some questions to help you understand the basics of probability. Below this there is a tutorial that will walk you through the mathematics of a simple machine learning algorithm for classifying messages based on a set of examples.
Getting Started
Independent Events
Two events are independent if knowing the outcome of one event does not change the probability of the other event.
Which of the following involve independent events?
- Rolling two diceIndependent. The result rolled on one dice does not affect the other dice.
- Drawing two cards from a deck and looking at their suitDependent. After we have drawn one card, the proportion of cards in the deck have changed. If the first card was a heart, for example, there would be fewer hearts in the deck.
- Generating two random numbers one after the otherDepends on our assumptions about our random number generation method. Technically if we used a pseudorandom number generation algorithm the events would be close to independent, but not entirely independent. Pseudorandom number generators cannot provide truly random outputs, thus in principle knowing one generated number will change the expectation of what number you would see next.
- The probability of two successive requests to a webserver failingLikely not independent in certain conditions. For example, if the webserver is experiencing high load, the first request might make it more likely for the second to fail. Whether we treat these events as dependent or independent would depend on our modelling assumptions.
Work out Probabilities
Give the probability for each of the events and experiments described below.
- Flip a coin twice and get two heads
Let a sample space \( \Omega = \{(H, H), (H, T), (T, H), (T, T)\} \) and an event \( E = \{(H, H)\} \)
$P(E) = 1/4$
- Draw a face card that is a spade from a 52 card deck
We could approach this question by counting the outcomes in the event where a card is drawn that is both a spade and a face card. We find that there are 3. There are 52 possible outcomes in the sample space. If we assume that all outcomes are equally likely, we can calculate that the probability is $3/52$.
- Draw a card that is either an even number or a red card a 52 card deck
Let each outcome in sample space $\Omega$ be drawing one of 52 cards. Let $R$ be the event where a red card has been drawn. As there are 26 red cards in a deck, $\#R = 26$. Let $E$ be the event where an even number has been drawn. By counting we find that $\#E = 20$.
We could approach this question by counting the even-numbered red cards (of which there are 10), and thus conclude based on classical probability that the probability of drawing an even-numbered red card $P(E \cap R) = 10/52$
Alternatively, we could work out the probabilities $P(R) = 26/52$ and $P(E) = 20/52$. Given that these two events are independent, we can use the multiplication rule for independent events
\[ P(R \cap E) = P(R) \times P(E) = (26/52) \times (20/52) \]If we were not sure if the events were independent, we would use the general multiplication rule:
\[ P(R \cap E) = P(R \mid E) \times P(E) \]To find $P(E \mid R)$ we need to consider the probability of finding a red card assuming an even numbered card has been drawn. In a deck of cards, we consider only the even numbered cards, and find that 10 of them are red. Thus, based on classical probability, $P(R \mid E) = 10/20 = 1/2$. This is expected, as we know that half the cards in a a deck are red. We can then substitute this value into the formula above with what we already know.
\[ P(R \cap E) = (1/2) \times (20/52) = 10/52 \]This gives us the same answer as above because $R$ and $E$ are independent events.
- Draw two face cards in a row from a 52 card deck, without replacing the cards.
Trying to solve this by counting the outcomes in the event we are interested in would be time consuming. Instead, we will work it out using the rules of probability.
We can calculate that the probability of our first card being a face card as $P(F_1) = 12/52$. The probability of our second card being a face card will be different. This time there are only 11 face cards out of 51 remaining cards. Thus $P(F_2 \mid F_1) = 11/51$. Applying the general multiplication rule, we calculate $P(F_1 \cap F_2) = (12/52) \times (11/51) = 132/2652 = 11/221$
- You draw a card. Each time you draw a red card you draw again. What is the chance of getting 3 red cards?
As in the above question we work out:
\[ P(R_1) = 26/52 \]\[ P(R_2 \mid R1) = 25/51 \]\[ P(R_3 \mid R_1, R_2) = 24/50 \]From this we can apply the basic rules we saw in the lecture to work out $P(R_2 \cap R_1)$
\[ P(R_2 \cap R_1) = P(R_2 \mid R_1) \times P(R_1) \]\[ = (25/51) \times (26/52) = 25/102 \]And then $P(R_3 \cap (R_1 \cap R_2))$
\[ P(R_3 \cap (R_2 \cap R_1)) = P(R_3 \mid (R_2 \cap R_1)) \times P(R_2 \cap R_1) \]\[ = P(R_3 \mid R_1, R_2) \times (25/102) \]\[ = 24/50 \times (25/102) = 2/17 \]Notice that this is the same as
\[ P(R_1) \times P(R_2 \mid R_1) \times P(R_3 \mid R_1, R_2) \]\[ (26/52) \times (25/51) \times (24/50) = 2/17 \]
Probability questions
Let $P(A) = 0.5$, $P(B) = 0.25$, $P(B \mid A) = 0.1$
- Find $P(A \cap B)$\[ P(A \cap B) = P(B \mid A) \times P(A) \]\[ P(A \cap B) = 0.1 \times 0.5 = 0.05 \]
- Find $P(A \cup B)$\[ P(A \cup B) = P(A) + P(B) - P(A \cap B) \]\[ P(A \cup B) = 0.5 + 0.25 - 0.05 = 0.7 \]
- Find $P(B \mid A)$\[ P(B \mid A) = \frac{P(B \cap A)}{P(A)} \]\[ P(B \mid A) = \frac{0.1}{0.5} = 0.05 \]
Let $P(H) = 0.1$, $P(E) = 0.2$, $P(E \mid H) = 0.9$
- Find \( P(H \mid E) \)\[ P(H \mid E) = \frac{0.1 \times 0.9}{0.2} = 0.45 \]
Using Bayes' Rule
We can use the Law of Total probability with Bayes rule (see slide) to work out the posterior probabilities of a set of hypotheses $H_1, \dots, H_n$, where these hypotheses are disjoint (cannot co-occur) and are collectively exhaustive.
For more details, see the lecture notes.
Let the prior probabilities of our two hypotheses be $P(H_1) = 0.5$, $P(H_2) = 0.5$. Further, let
- Find \( P(H_1 \mid E) \) and $P(H_2 \mid E)$\[ P(E \mid H_1) \times P(H_1) = 0.8 \times 0.5 = 0.4 \]\[ P(E \mid H_2) \times P(H_2) = 0.2 \times 0.5 = 0.1 \]\[ P(H_1 \mid E) = \frac{0.4}{0.4 + 0.1} = 0.8 \]\[ P(H_2 \mid E) = \frac{0.1}{0.4 + 0.1} = 0.2 \]
Machine Learning
Our goal here is to create a (very) simple machine learning algorithm for identifying strings matching a given statistical pattern. This is a kind of classification task. We want to classify whether a string that we have never seen before is similar to our training set. This could be used for spam detection (do the messages match the patterns of spam messages). Here we are going to try detecting whether a string is likely to be an English word.
To simplify matters for now, we will work only with messages made of the letters A, B, C, and D. We will assume that every message is four letters long. This keeps the process tractable to work out by hand. However, the interactive demos allow you to have a longer alphabet, and to use large training sets. Use these demos to see how the same approach scales to a real-world use case.
We have a collection of strings that we want to learn from. These might represent spam messages, or English words, or whatever it is we are modelling. This is the training data that we want to use to parametrise or models.
CABA CBCB CACB CBCC
We have no training data for non-target strings (e.g. non-spam messages). For simplicity, we will assume that we have no knowledge about these strings, and thus that, as far as we know, they are generated completely at random. We could easily extend the approach here to incorporate more sophisticated models of non-target strings if we wanted to.
The basic idea is that we will devise a model that describes a mechanism for how strings are generated. This model will depend on several probabilities. We can work out these probabilities from frequencies in our training dataset.
First attempt
We are trying to classify strings as either English or not English. We will describe two models, $M_1$, a model that generates 'English' (in inverted commas), and $R_1$, a model that generates random messages. On the assumption that only these two models could have been used, we will then work out the probability that a given message was generated by $M_1$ and not $R_1$.
Let us say that strings are generated by a weighted random choice for each successive letter. In other words, for each character in the string, a random letter is picked. Each letter has a probability of being selected that is independent of where that letter occurs, or of any other character.
If we know nothing about the distribution of letters, we start with the assumption that any letter is as likely as any other. This will be our modelling assumption for when words are not 'English', giving us the model $R_1$.
By looking at our training data, we can pick probabilities based on the frequencies with which letters appear.
We have now parametrised our model $M_1$ with probabilities based on our training data. We can now use $M_1$ to generate a random 'English' word. Remember, our model says that each letter is chosen independently based on the probability of selecting each letter. Note that how realistic this generated word is (how much it looks to us like 'English') will be limited by the simplistic modelling assumptions we have made.
We can also use our parametrised model $M_1$ to determine the probability that a given word is 'English'. The basic idea is to see how likely model $M_1$ is to generate a given word and compare that to the probability of getting that word using model $R_1$.
For example, if we have a four letter word CBCB
, we can work out the probability of generating this word based on model $M_1$ as
Now we have $P(CBCB \mid M)$, which tells us how likely it is that M generates CBCB
. However, what we want is $P(M \mid CBCB)$. In other words, how likely is it that the model used was M if the string observed was CBCB
? We can use Bayes rule to find this.
$P(M_1)$ is the prior probability that model M was used. In other words, it's our initial expectation of the likelihood that the message is 'English'. We might know or guess that 50% of messages are 'English'. In this example, our prior probability would be set to $P(M_1) = 0.5$.
To use Bayes' rule, we need to work out $P(CBCB)$, the probability of generating CBCB
. We know the probability for generating this given model $M_1$. We need to consider the possibility that the message is not 'English'.
One of our modelling assumptions is that strings are either generated by R_1 (they are random), or they are generated by $M_1$ (they are 'English').
We can work out the probability of generating this word if letters were random (model $R_1$) as
As $R_1$ and $M_1$ are the only possibilities (given on our modelling assumptions), they form a partition. Thus we can use the Law of Total Probability:
This gives us everything we need to calculate $P(M \mid CBCB)$.
CBCB
is 'English'. In other words, calculate $P(M_1 \mid CBCB)$First Attempt Demo
Alphabet
Training Set
Test Set
Prior
Letter Probabilities
Letter | Probability |
---|---|
A | 0.31250000 |
B | 0.25000000 |
C | 0.43750000 |
D | 0.00000000 |
Generated
String | P(string | Model) | P(model | string) |
---|
Test
Match: 0.000%String | P(string | model) | P(string | random) | P(model | string) |
---|---|---|---|
CBCB | 0.01196289 | 0.00390625 | 0.75384615 |
Prompts for reflection
You may want to consider the answers to these questions before you move on.
- Compare your results to the results in the demo. Is the demo correct? (It's possible I've made an error implementing it.)
- You may want to experiment with a larger alphabet (such as
ABCDEFGHIJKLMNOPQRSTUVWXYZ
), and/or with a larger training set. - Where do you think the main limitations in the model are? What information does the model not represent that ou think would help improve the models predictive power?
Second Attempt
To improve on this, we might recognise that different letters are more or less likely to appear in different positions.
Let $P(A_n)$ be the probability of letter $A$ appearing at position $n$. Initially, we might assume that the likelihood of any given letter in a given position is equally likely. If we have four letters, these would each be $0.25$.
Now, as above, we can use our new model, $M_2$, to generate 'English' words.
As above, we can work out the probability that the string CBCB
is generated based on the assumption that model $M_2$ is used
We can also work out the probability that the string CBCB
is generated based on the assumption that messages are generated at random. Let us call this model $R_2$.
In our second attempt, we are assuming that the messages we are classifying were either generated by $M_2$ or by $R_2$. As above we can use the Law of Total Probability and Bayes rule to work out $P(M_2 \mid CBCB)$.
Second Attempt Demo
Alphabet
Training Set
Test Set
Prior
Letter Probabilities
Letter | Probability |
---|---|
A_1 | 0.25000000 |
B_1 | 0.00000000 |
C_1 | 0.75000000 |
D_1 | 0.00000000 |
A_2 | 0.50000000 |
B_2 | 0.50000000 |
C_2 | 0.00000000 |
D_2 | 0.00000000 |
A_3 | 0.00000000 |
B_3 | 0.25000000 |
C_3 | 0.75000000 |
D_3 | 0.00000000 |
A_4 | 0.50000000 |
B_4 | 0.25000000 |
C_4 | 0.25000000 |
D_4 | 0.00000000 |
Generated
String | P(string | Model) | P(model | string) |
---|
Test
Match: 100.000%String | P(string | model) | P(string | random) | P(model | string) |
---|---|---|---|
CBCB | 0.07031250 | 0.00390625 | 0.94736842 |
Detecting English words
Investigate how good the interactive demos are at detecting English words. For this you will need to set the alphabet to ABCDEFGHIJKLMNOPQRSTUVWXYZ
.
- Put some English words into the training set to calculate the model probabilities from. Then experiment with English and non-English words in the test set.
- The accuracy of the model on the test set (shown in above the 'test' table) tells you how many strings in the test set were detected as matching the model. A string matches if the probability it was generated by the model exceeds the threshold probability. The higher this threshold, the more conservative the model will be.
- The quality of the model is significantly affected by the training data. More training data will improve the ability of the model to correctly identify English words, though only up to a point. Paste a word list into the training set (for very large word lists you might need to wait a minute for it to create the model). You can find wordlists on the web, or follow one of these links (1000 most common US-English words, Large Word lists).
- You can evaluate the model as a tool for checking English sentences. However, it only works with one word per line. Collect together some sentences (Random Sentence Generator). Replace each space with a new line, and paste the list of words into the test set. Look at the accuracy or match rate. As you work through the rest of this practical, compare the models in this way.
Third Attempt
Now we might consider that different letters are more or less likely depending on the letter that appear in front of them.
We need to find the probabilities of the letters appearing in the first position: \( P(A_1), P(B_1), \dots \). And we need to find a conditional probabilities of each letter occurring in position $n$ for every letter that could have occurred in position $n-1$. This can be represented in a probability tree. As each letter depends only on the previous letter, a better representation is as a Markov Chain.
Using this diagram, we can generate a random 'English' word by randomly walking down the probability tree or Markov chain according to the transition probabilities.
Now we can also calculate the probability of generating the string CBCB
assuming model $M_3$ is used: $P(CBCB \mid M_3)$. We can also calculate the probability assuming strings are randomly generated: $P(CBCB \mid R_3)$. From here we can calculate $P(M_3 \mid CBCB)$
Third Attempt Demo
Alphabet
Training Set
Test Set
Prior
Letter Probabilities
Letter | Probability |
---|---|
A_1 | 0.25000000 |
B_1 | 0.00000000 |
C_1 | 0.75000000 |
D_1 | 0.00000000 |
A_2|A_1 | 0.00000000 |
B_2|A_1 | 1.00000000 |
C_2|A_1 | 0.00000000 |
D_2|A_1 | 0.00000000 |
A_2|B_1 | 0.00000000 |
B_2|B_1 | 0.00000000 |
C_2|B_1 | 0.00000000 |
D_2|B_1 | 0.00000000 |
A_2|C_1 | 0.66666667 |
B_2|C_1 | 0.33333333 |
C_2|C_1 | 0.00000000 |
D_2|C_1 | 0.00000000 |
A_2|D_1 | 0.00000000 |
B_2|D_1 | 0.00000000 |
C_2|D_1 | 0.00000000 |
D_2|D_1 | 0.00000000 |
A_3|A_2 | 0.00000000 |
B_3|A_2 | 0.50000000 |
C_3|A_2 | 0.50000000 |
D_3|A_2 | 0.00000000 |
A_3|B_2 | 0.00000000 |
B_3|B_2 | 0.00000000 |
C_3|B_2 | 1.00000000 |
D_3|B_2 | 0.00000000 |
A_3|C_2 | 0.00000000 |
B_3|C_2 | 0.00000000 |
C_3|C_2 | 0.00000000 |
D_3|C_2 | 0.00000000 |
A_3|D_2 | 0.00000000 |
B_3|D_2 | 0.00000000 |
C_3|D_2 | 0.00000000 |
D_3|D_2 | 0.00000000 |
A_4|A_3 | 0.00000000 |
B_4|A_3 | 0.00000000 |
C_4|A_3 | 0.00000000 |
D_4|A_3 | 0.00000000 |
A_4|B_3 | 1.00000000 |
B_4|B_3 | 0.00000000 |
C_4|B_3 | 0.00000000 |
D_4|B_3 | 0.00000000 |
A_4|C_3 | 0.33333333 |
B_4|C_3 | 0.33333333 |
C_4|C_3 | 0.33333333 |
D_4|C_3 | 0.00000000 |
A_4|D_3 | 0.00000000 |
B_4|D_3 | 0.00000000 |
C_4|D_3 | 0.00000000 |
D_4|D_3 | 0.00000000 |
Generated
String | P(string | Model) | P(model | string) |
---|
Test
Match: 100.000%String | P(string | model) | P(string | random) | P(model | string) |
---|---|---|---|
CBCB | 0.08333333 | 0.00390625 | 0.95522388 |
Fourth Attempt
Next, we might reason that it doesn't matter where a pair of letters appears. We might want to learn a model $M_4$ defined by the probabilities of transitioning between each pair of letters.
We construct a Markov Model with a node for each letter and a edge between each pair of letters. The probabilities labeling these edges can be found from frequencies with with certain pairs of letters appear in the dataset.
From here, the same process can be followed as with the previous modelling attempts to work out $P(CBCB \mid M_4)$.
Fourth Attempt Demo
Alphabet
Training Set
Test Set
Prior
Letter Probabilities
Letter | Probability |
---|---|
A_1 | 0.25000000 |
B_1 | 0.00000000 |
C_1 | 0.75000000 |
D_1 | 0.00000000 |
A|A | 0.00000000 |
B|A | 0.66666667 |
C|A | 0.33333333 |
D|A | 0.00000000 |
A|B | 0.33333333 |
B|B | 0.00000000 |
C|B | 0.66666667 |
D|B | 0.00000000 |
A|C | 0.50000000 |
B|C | 0.33333333 |
C|C | 0.16666667 |
D|C | 0.00000000 |
A|D | 0.00000000 |
B|D | 0.00000000 |
C|D | 0.00000000 |
D|D | 0.00000000 |
Generated
String | P(string | Model) | P(model | string) |
---|
Test
Match: 100.000%String | P(string | model) | P(string | random) | P(model | string) |
---|---|---|---|
CBCB | 0.05555556 | 0.00390625 | 0.93430657 |
Suitability for English word detection
After you have played with these models for some time, try and answer these questions about their ability to generate and detect English words:
- If you were to write a non-English word that the model would detect as English, what features would you include?
- What sorts of English words does the model fail to identify, and why? Try predicting in advance whether a given word will be matched as English or not
- Are there contexts or words for which the earlier models are more accurate than the later models? Why is this?
- In some of the models, shorter words are less likely to be detected as English than longer words. Why do you think this is?
- How would you develop or adapt these models to improve accuracy?