Set Theory
In this lecture, we are going to learn the fundamentals and notation of set theory. Set theory is of foundational importance in mathematics. However, it is also critical in understanding relational databases and other data structures. Thus through this lecture we are going to ground our discussion of set theory in its application to relational databases.
Our goal through this lecture will be to formalise a database-like data structure and introduce basic database operations. Along the way we will learn about some key concepts in mathematics, including:
- Sets and set theory
- Types
- Tuples
- Set theory operations
Learning Outcomes
- Notate sets, tuples and types
- Identify common types and classes of numbers
- Use common set theory relations
- Recognise and use the names and symbols of common set theory operations
- Apply the above operators to pairs of simple operands
- Simplify arbitrarily complicated set theory expressions
Introduction
Databases are everywhere. In almost every domain of life where there is a need to store and access data you will find that data stored in some kind of computer database. As a computer scientist, software engineer or games developer, you will need databases. Whether you are building a social network, collecting scientific data, managing assets, or tracking customer data, databases are an essential tool.
Relational databases store data in tables. Many other data structures exist: lists, arrays, graphs, and more. Set theory gives us tools to mathematically formalise many different kinds of data structure.
Why Set Theory?
Set Theory is the mathematical underpinning of relational databases. Understanding the basics of set theory will help you understand the fundamental operations on databases such as JOIN
, UNION
, IN
, EXISTS
, and more.
As computer scientists we want to be able to write algorithms for creating, searching, sorting, and manipulating databases. To write and program algorithms we need to formalise the data structures they use. We might want to:
- Combine two different versions of a database table
- Match data between different tables
- Reason about whether records exist
This is what database management software allows us to do by applying concepts from set theory.
Last week we looked at propositional logic. This allows us to model human reasoning. However it is limited in only dealing with propositions. Propositional logic alone cannot tell us what is happening inside a proposition. For example, consider the following proposition:
Growltiger is a cat
If we just analysed this as a proposition $p$, we would not have learned anything. However, if we think about this statement, it seems to have two important parts: "Growltiger", being Growltiger, a thing. And "cat", which is a category or type of thing. If we represent Growltiger as $g$, and the set of cats as $C$, we can formalise the above statement as:
We can interpret this as meaning Growltiger is an element of the set of cats, or Growltiger is a cat. To do this, we are making use of a powerful branch of mathematics called set theory. Set theory is used whenever we want to reason about collections of things. As above, we can use it to construct propositions which we can reason about the truth value of.
Overview
Databases draw on mathematical concepts from set theory, and set theory will be our focus for this lecture. This will also be a necessary foundation for subsequent lectures where we will continue to build on set theory concepts and notation.
We will start off by learning how we can formalise a database using the concept of a set. We will also look into the concept of a type.
Once we have done this we will see how set theory allows us to reason about such sets. We will see how sets can be combined or intersected, how sets can contain other sets, and more.
The Set
The set is a basic mathematical object, like the number. Arguably, sets are even more fundamental than numbers, as numbers can themselves be defined as particular sets.
A set is an unordered collection of elements without repetition. We write out sets by listing the elements it contains within curly braces. For example, this set of elements:
\[ \{ a, b, c\} \]At the moment, the labels of our elements don't mean anything in particular. Set theory is not concerned with what these elements are but rather the structures in which they are contained. Set theory purists will say that these elements are just other sets. Sets containing sets, perhaps. To the set theorist, it's sets all the way down.
We might visualise the contents of this set as the following table. Each element of the set is a new record in our table, giving us the simplest possible database table. While this doesn't look particularly interesting right now, we will get to more interesting looking sets later.
set |
---|
a |
b |
c |
The set with no elements in it is called the empty set and is written $\emptyset$ or \( \{ \} \). A set with exactly one element is called a singleton. Because the elements of a set can be other sets, thus it is quite possible to define a set such as:
Properties of Sets
There are two important properties of sets to remember:
Unordered
Sets are unordered, meaning that the order in which we write the elements of a set makes no difference. We could write a set as \( \{a, b, c\} \) and we can write exactly the same set as \( \{c, b, a\} \). Imagine a database table: it doesn't matter which order we write the rows in the table, we still have the same data in the database.
No Repetition
Second, sets cannot contain repeated elements. All elements in a set are unique in that set. If I attempted to write one, such as \( \{a, a\} \), I still would not have a set, I would have a misleading bunch of symbols on a page. Similarly in databases, we do not have repeated records: you would never be able to tell them apart! Database records always differ in at least one variable (and we give each record a unique ID for this purpose).
Extensional and Intensional Set Definitions
The extension of a set is the elements it contains. You can define a set by giving its extension, such as:
This is also called an extensional definition.
We will see in a later lecture how you can also give an intensional definition of a set. This does not list its elements, but instead gives a rule for what counts as being in the set. For example, the following is a set of cats:
Tuples
Numbers in Set Theory
If you were wondering how you could possibly found mathematics on simple rules about collections of things, consider the example of the von Neumann integers. This is a way of defining the natural numbers using just sets and the empty set:
Types
So what are these things in our sets? Yes, $a$, $b$, and $c$ above were labels, but what are they labels for?
In so-called naive set theory, you can put anything you like into a set. Unfortunately, naive set theory falls into paradoxes. One way out of these paradoxes is to introduce types. Types are extremely important in computer science. If you have done any programming, you will likely already be familiar with the concept of a type. You might have seen code in that looks like the following:
let myNumber: number = 4
Here a variable called myNumber
is declared that has the type number
. This means the variable can only store numbers. If you try to put anything else into the variable, you will get a compilation error; that is, doing so is invalid within the programming language.
Similarly, with our typed version of set theory, we will declare sets as being of a particular type. For now, we are going to assume the following types. We are not going to worry at this point about what these elements really are (e.g. whether the numbers just labels for special nested sets).
Common Types
name | symbol | elements | description |
---|---|---|---|
natural numbers | \( \mathbb{N} \) | \( \{ 1, 2, 3, 4, 5, \dots \} \) | 1 and every subsequent number to positive infinity |
real numbers | \( \mathbb{R} \) | \( \{ 0.001020\dots, 0, -24.242, \dots \} \) | Any number on a number line between negative and positive infinity, to an infinite number of decimal places |
boolean truth values | \( \{ 0, 1 \} \) | \( \{ 0, 1 \} \) | True or false |
strings | \( \Sigma^* \) | \( \{a, b, abc, aabcbdc, \dots \} \) | Any string of characters from the alphabet $\Sigma$ |
Let's apply one of these types. For example, let's describe the set $A$, which will contain natural numbers. We will notate this as follows:
$ A : \mathbb{N} $ |
---|
1 |
2 |
3 |
Note that because we have defined $A$ as containing only natural numbers we cannot later change our minds and permit something that is not a natural number into $A$. (Just as attempting something similar in a programming language would give us a compilation error). The power of this approach, which is fully realised in a branch of mathematics called Type Theory, is that we can infer all sorts of things from these types. For example, say we take two elements of \( \mathbb{N} \) and add them together. We know that the answer will also be an element of \( \mathbb{N} \). However, for now, we will not delve into Type Theory and will merely gerry-rig types into set theory.
Product Types
Let us take two types, the type of and the type of .
\( \mathbb{N} \) and \( \mathbb{N} \) are types. \( \mathbb{N} \times \mathbb{N} \) is also a type. This type consists of pairs of elements, where the first member of each pair is from \( \mathbb{N} \) and the second member of each pair is from \( \mathbb{N} \). All possible such pairs are members of the type \( \mathbb{N} \times \mathbb{N} \)
\[ \mathbb{N} \times \mathbb{N} = \{ (1, 1), (2, 2), (3, 3), ...\} \]\( \mathbb{N} \) | \( \mathbb{N} \) |
---|---|
1 | 1 |
2 | 2 |
3 | 3 |
... |
Here we see a new structure. The structure $(a, b)$ is called a tuple. Tuples are ordered pairs of elements, though you can also have triples, quadruples, and more generally n-tuples containing $n$ elements in order. You can construct these purely out of sets if you want to; the bracket notation can be seen as a notational convenience.
We could construct the product type out of more than two types at a time, for example \( \mathbb{N} \times \mathbb{R} \times \{0, 1\} \). It would contain an infinite number of elements, including, for example:
\[ \{ (1, 0.01, 1), (1, 23.344, 0), \dots \} \]Set Types
If $A$ is a type, then \( \mathbb{P}(A) \) is the type of sets of type $A$.
Let's see an example using the type of
Thus the type \( \mathbb{P}(\mathbb{N}) \) is the type of sets of natural numbers. \( \mathbb{P}(\mathbb{N}) \) contains, for instance \( \emptyset, \{1\}, \{1,2 \} \) and (infinitely) many more.
Databases
Now we can start to define sets that look a bit more like databases. In these sets we can store different types of data.
We might imagine that we have a set of type \( \mathbb{N} \times \Sigma^* \times \{ 0, 1\} \) that looks like the following:
\( \mathbb{N} \) | \( \Sigma^* \) | \( \{ 0, 1 \} \) |
---|---|---|
0 | John | 1 |
1 | Mary | 1 |
2 | Bill | 0 |
... | ... | ... |
Written in set notation, this would be:
Now we can formalise a database table as a set using set theory. Next we will see what we can do with sets. We will see ways of reasoning about them. These will help us to understand the mathematical underpinnings of databases.
Set Elements
Element-of
Sets (with the exception of the empty set) contain elements. For example, the set \( \{ a, b, c \} \) has as $a$ as an element. I can write this statement formally using set theory notation (here $ \in $ is to be read "is an element of"):
The above statement is a proposition. It evaluates to either $\top$ or $\bot$. Because $a$ is indeed in \( \{a, b, c\} \), the above proposition is true.
Applying Set Theory Propositions
We can formalise a table in a relational database as a set of tuples. The attributes stored in the table dictate the type of the set. Consider the following database table. It is defined by columns or attributes (the types of data stored about each record), and rows or records (the particular data stored)
\( \mathbb{N} \) | \( \Sigma^* \) | \( \mathbb{R} \) |
---|---|---|
1 | 'John' | 2.5 |
To formalise this table as a set, we would describe a set $D$ of type \( \mathbb{N} \times \Sigma^* \times \mathbb{R} \). Let $D$ describe the table above. Thus $D$ is a set with one element.
Each row of the table is called a record or a tuple. For example, we know that \( (1, 'John', 2.5) \in D \), as this tuple appears in the table above. However, \( (2, 'Bill', 1.2) \not \in D \) because this record does not appear in the table above.
Set theory lets us formalise propositions - true or false statements. We can then incorporate these into propositional logic. For instance, how might we formalise the following statement in set theory and propositional logic?
"If Growltiger is a cat then Growltiger is sleepy"
First, let us define the following non-logical symbols:
- $g$ (Growltiger),
- $C$ (the set of cats), and
- $S$ (the set of sleepy things):
Now we can express the above proposition as a statement of propositional logic and set theory:
Comparing Sets
So far we have defined what a set is. To recap, a set is an unordered collection of unique elements. Now we are going to look at what we can do with sets. In particular, how can we reason about them?
We are going to see the following relations:
Name | Symbol | Usage | Description | Example |
---|---|---|---|---|
element of | $\in$ | $a \in A$ | $a$ is an element of $A$ | \( a \in \{ a \} \) |
subset | $\subseteq$ | $A \subseteq B$ | $A$ only contains elements that are also in $B$ | \( \{ a \} \subseteq \{ a, b \} \) |
superset | $\supseteq$ | $A \supseteq B$ | $A$ contains all the elements in $B$ | \( \{ a, b \} \supseteq \{ a, b \} \) |
equals | $=$ | $A = B$ | $A$ and $B$ are the same set: they contain exactly the same elements | \( \{ a \} = \{ a \} \) |
not equals | $\neq$ | $A \neq B$ | $A$ and $B$ are not the same set: they contain different elements | \( \{ a \} \neq \{ a, b \} \) |
proper subset | $\subset$ | $A \subset B$ | $A$ only contains elements that are also in $B$, but not all of them | \( \{ a \} \subset \{ a, b \} \) |
proper superset | $\supset$ | $A \supset B$ | $A$ contains all the elements in $B$ and at least one more | \( \{ a, b \} \supset \{ a \} \) |
Subset and Superset
Subset and superset express whether one set contains another. A subset is a set contained by another set. A superset is a set that contains another set. Consider for instance the following diagram representing two sets labelled $T$ and $S$. Set $S$ is contained within the set $T$:
Here $S$ is a subset of $T$ (written $S \subseteq T$) because all the elements of $S$ must also be elements of $T$, give how it is drawn. Equivalently, $T$ is a superset of $S$ (written $T \supseteq S)$, because $T$ contains all the elements that $S$ does. This gives us our two most common relations between sets:
By putting a slash through the subset or superset symbol, we express its negation, e.g. 'not subset' ($\not \subseteq$), 'not superset' ($\not \supseteq$), etc.
Proper Subset and Proper Superset
From here we can define two more relations. When a set is a subset of another set but is not equal to that set, it is called a proper subset. Similarly, when a set is a superset of another set but is not equal to it, it is called a proper superset.
To write these we use the proper subset and proper superset symbols:
Proof example
Relating Sets
Imagine we are managing the database for a website when the hard drive of the server fails. We lose the data it contains. Luckily, we have kept regular backups of our database. Unfortunately, we do not know when these backups were taken. We have narrowed down the correct one to two versions that we will call $A$ and $B$.
uid | username | password_sha1_hash |
---|---|---|
1 | john | 5f4dcc3b5aa765d61d8327deb882cf99 |
2 | mary | 25f9e794323b453885f5181f1b624d0b |
3 | bill | e8375d7cd983efcbf956da5937050ffc |
uid | username | password_sha1_hash |
---|---|---|
1 | john | 5f4dcc3b5aa765d61d8327deb882cf99 |
2 | mary | 25f9e794323b453885f5181f1b624d0b |
We can express following true propositions about these two sets.
We know that to avoid data loss we need to restore database $A$ as this is a proper superset of $B$, written $A \supset B$.
Set Operations
So far we can define sets and say whether one set contains another. We can say a lot more about sets when we combine sets using the operators of set theory. The following operators take two sets as arguments and return a set, usually with different elements.
Name | Symbol | Usage | Description | Example |
---|---|---|---|---|
union | $\cup$ | $A \cup A$ | Returns all elements in either $A$ or $B$ | \( \{ a \} \cup \{ b \} = \{ a, b \} \) |
intersection | $\cap$ | $A \cap B$ | Returns only elements in both $A$ and $B$ | \( \{ a, b \} \cap \{ b, c \} = \{ b \} \) |
set difference | $\setminus$ | $A \setminus B$ | Returns elements from $A$ except those in $B$ | \( \{ a, b \} \setminus \{ b \} = \{ a \} \) |
cardinality | $\#$ | $\#(A)$ | Returns the number of elements in $A$ | \( \#(\{ a, b \}) = 2 \) |
power set | \( \mathbb{P} \) | \( \mathbb{P} \) | Returns the set of all subsets of $A$ | \( \mathbb{P}(\{ a, b \} = \{ \emptyset, \{a\}, \{b\}, \{a, b\} \} \) |
cartesian product | $\times$ | $A \times B$ | Returns all pairs $(a, b)$ where $a \in A$ and $b \in B$ | \( \{ a \} \times \{ 1 \} = \{(a, 1), (a, 2)\} \) |
In general we will assume that we always apply these operations to sets of the same type.
Union
Intersection
Generalised Operations
Set Difference
Cardinality
Power Set
Cartesian Product
Real-world examples
Imagine John and Mary have each seen many movies. They want to watch something together,
Let us model this situation in set theory. We will describe two sets, $J$ (the movies John has seen) and $M$ (the movies Mary has seen). Assume that we can ask John and Mary to write us out a list. Further, let $U$ be the universal set of all movies (that John and mary would consider watching). If we were building a movie recommendation app, we might get this list from IMDB.
John | Mary |
---|---|
Yes Man (2008) | The Paperboy (2012) |
Underdog (2007) | Nowhere to Run (1993) |
Where the Heart Is (2000) | I, Robot (2004) |
Mean Streets (1973) | The Voices (2014) |
I, Robot (2004) | Bangkok Dangerous (2008) |
Control (2007) | Tequila Sunrise (1988) |
... | ... |
Say John is willing to rewatch a movie he has already seen, but Mary does not want to rewatch a movie. If so, we need to pick a movie from the set $U \setminus M$, the set of movies Mary has not watched. If neither John nor Mary is willing to rewatch a movie, we would need to select a movie from the set $U \setminus (J \cup M)$. If John and Mary both are wanting to watch something familiar, they would need to pick from $J \cap M$.
Being able to formalise the logic behind picking a movie, we are better placed to, for example, write a recommendation algorithm.
Imagine we are developing a photo management app that stores photographs in the cloud. This app can be used on multiple devices simultaneously, e.g. on a mobile and a laptop. These devices might not always have an internet connection. We could easily find each device has made changes to its local database. New photographs may have been taken, or old photographs deleted. We want our app to behave appropriately when these two devices have been synced with the cloud storage server again.
Phone
image uuid | filename | date/time |
---|---|---|
69ba4909-2d27-4824-b7a8-0801b8103ba2 | IMG_1203.png | 22/6/23 14:23 |
58f4a45c-a054-4de9-8315-d8701a170d90 | IMG_1204.png | 23/6/23 10:25 |
Laptop
image uuid | filename | date/time |
---|---|---|
69ba4909-2d27-4824-b7a8-0801b8103ba2 | IMG_1203.png | 22/6/23 14:23 |
40ccddbd-2b7c-4324-8623-6d3b8c5f1e3a | IMG_1204.png | 23/6/23 12:01 |
We have two tables of photographs, which we will label $Mobile$ and $Laptop$. These two tables have got out of sync and we want our app to be able to deal with this.
The obvious thing we might want to do in this situation is to find all the photos that exist on either the mobile or the laptop. In other words, we want to include everything from both databases. Perhaps we would want to upload all of these photos to our cloud storage provider.
To find the database table that contains all the elements of two database tables, we are performing a union operation.
image uuid | filename | date/time |
---|---|---|
69ba4909-2d27-4824-b7a8-0801b8103ba2 | IMG_1203.png | 22/6/23 14:23 |
40ccddbd-2b7c-4324-8623-6d3b8c5f1e3a | IMG_1204.png | 23/6/23 12:01 |
58f4a45c-a054-4de9-8315-d8701a170d90 | IMG_1204.png | 23/6/23 10:25 |
$A \cup B$ returns a set containing all the elements in either of the sets $A$ and $B$. For instance,
Let's say we want to minimise the amount of user data we are storing in the cloud. After all, cloud storage costs money. We might decide that it is safe to delete those photos that exist on all of a users devices.
We would be able to find these photos by using the intersection operation.
image uuid | filename | date/time |
---|---|---|
69ba4909-2d27-4824-b7a8-0801b8103ba2 | IMG_1203.png | 22/6/23 14:23 |
$A \cap B$ returns a set containing only those elements that are both in set $A$ and in set $B$. For instance,
Let's say we want to copy the new photos from the mobile onto the laptop directly. For reasons of efficiency, we naturally don't want to copy any that already exist on the laptop. That is, we want to create a table of all of the photos on the mobile except those that exist on the laptop.
To do this, we need the set difference operation:
image uuid | filename | date/time |
---|---|---|
69ba4909-2d27-4824-b7a8-0801b8103ba2 | IMG_1203.png | 22/6/23 14:23 |
$A \setminus B$ returns a set containing the elements in set $A$ taking away any that also appear in set $B$. For instance,
Our cloud photo management app is expensive to run, so we want to bill customers for the number of photographs they store. Here, of course, we need to count all of the photos on any device, so we first find the union.
The cardinality of a set is the number of elements it contains. For example, the following set has a cardinality of 3: