Conditional Probability

Community Article Published January 9, 2024

Welcome back to my blog, where we dive deep into making complex topics both understandable and engaging. Today, we're tackling a subject that can be as intriguing as it is challenging: Conditional Probability.

The basics of Probability with Two Dice 🎲

Imagine rolling two distinct dice. Each die can land on a number from 1 to 6, so when rolling two dice, our possible outcomes (or sample space) look like this:

[(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6),
 (2, 1), (2, 2), (2, 3), (2, 4), (2, 5), (2, 6),
 (3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 6),
 (4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (4, 6),
 (5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6),
 (6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)]

Now, what if we want to find the probability that the sum of these two dice is 4? Out of all the combinations, these ones add up to 4:

[(1, 3), (2, 2), (3, 1)]

Since each outcome is equally likely, the probability is simply the number of favorable outcomes (summing to 4) divided by the total number of outcomes:

P=336=112P = \frac{3}{36}= \frac{1}{12}

Adding a condition - How it changes things

But let’s change the scenario. Suppose I tell you that the first die rolled a 2. This additional information changes our calculation. Why? Because it narrows down our sample space to only those outcomes where the first die is 2:

[(2, 1), (2, 2), (2, 3), (2, 4), (2, 5), (2, 6),]

In this smaller sample space, only one outcome sums to 4:

[(2, 2)]

So, the probability now changes to:

P=16P = \frac{1}{6}

This demonstrates how additional information, or a condition, can alter probabilities.

Defining Conditional Probability

The conditional probability of "E given F", is the probability that E occurs when F has already occured. This is known as conditioning on F. Mathematically, it's defined as:

P(E∣F)=count of E and F happeningcount of F happening=count of E and F happeningcount of sample spacecount of F happeningcount of sample space=P(E∩F)P(F)P(E|F) = \frac{\text{count of } E \text{ and } F \text{ happening}}{\text{count of } F \text{ happening}} = \frac{\frac{\text{count of } E \text{ and } F \text{ happening}}{\text{count of sample space}}}{\frac{\text{count of } F \text{ happening}}{\text{count of sample space}}} = \frac{P(E \cap F)}{P(F)}

Chain rule in Probability

Building on this concept, we can express the joint probability of two events as a product of a conditional probability and the probability of the condition. This is known as the chain rule:

P(E∩F)=P(F)P(E∣F)P(E \cap F)=P(F)P(E|F)

Netflix and Learn

When trying to calculate the probability of someone watching a movie on Netflix, a common first thought might be to consider just two options: watching or not watching.

This would imply a probability of 12\frac{1}{2}, as if flipping a coin. However, this approach is flawed because it assumes each option is equally likely, which isn't the case in real-life scenarios. The decision to watch a movie involves various factors making it far from a simple 50/50 choice.

A more realistic approach

A better approach is to think of Netflix as a platform conducting countless experiments with its vast user base. Each time a user decides to watch or not watch a movie, it's an experiment contributing to a large dataset.

P(E=watching movie M)=lim⁑nβ†’βˆž=n(E)nP(E=\text{watching movie M}) = \lim_{n \to \infty} = \frac{n(E)}{n}

Netflix has the information that we can just plug in and have the probability of watching a movie M.

Now, let's add a twist. Suppose we know that a person has already watched a movie on Netflix. How does this information affect the probability of them watching another movie? This is where conditional probability comes into play.

In this scenario, the relevant equation is:

P(M1∣M2)=P(M1∩M2)P(M2)=n(M1∩M2)n(netflix users)n(M2)n(netflix users)=n(M1∩M2)n(M2)P(M_1 | M_2) = \frac{P(M_1 \cap M_2)}{P(M_2)} = \frac{\frac{n(M_1 \cap M_2)}{n(\text{netflix users})}}{\frac{n(M_2)}{n(\text{netflix users})}} = \frac{n(M_1 \cap M_2)}{n(M_2)}

Here, M2M_2 is the movie already watched, and we're trying to find the likelihood of M1M_1 being watched next.

This formula tells us that the probability of someone choosing to watch another movie dramatically changes when we know they've already watched a movie. Essentially, past behavior influences future choices.

Law of Total Probability

Consider you have a background event B and conditioned on that another event E. What would be the probability of event E?

image/png

There are two roads that merge in computing the probability of event E conditioned on event B.

  1. Probability of E when B occurs P(E∣B)P(E | B)
  2. Probability of E when B does not occur P(E∣BC)P(E | B^{C})

Now using the chain rule we can merge these two probabilities to get what we wanted: P(E)=P(E∣B)P(B)+P(E∣BC)P(BC)P(E) = P(E|B)P(B) + P(E|B^{C})P(B^{C})

Bayes' Theorem

With the essential tools in hand let's think about a problem.

Detecting Spam Emails πŸ“§

Consider your inbox is flooded with spam emails. You would like to detect and delete these emails. You go through your inbox and find a pattern. You see that a bunch of the spam emails start with the word "Dear".

You would like to get the probability of the email being spam considering it starts with "Dear". We want P(Spam | Starts with "Dear"). We have observed that our email starts with "Dear", no what is the probability of it being a spam? This is inherently a difficult thing to solve. We can cluster our database of email that start with "Dear", and then count the spam emails in that cluster. But that is not the right thing to do, because spam email might not begin with "Dear".

But when you flip the problem, P(Starts with "Dear" | Spam), it becomes much easier. You cluster the spam emails first, then count the number of emails that start with "Dear".

This underlines an intrinsic nature about our universe. Sometimes infering from an observation is difficult.

This still does not solve our problem. We would still want to predict how likely is it for an email to be spam starting with the word "Dear".

This is where Bayes' Theorem comes in. P(F∣E)=P(FE)P(E)P(F | E) = \frac{P(FE)}{P(E)}

Let's plug in the chain rule into the numerator P(F∣E)=P(E∣F)P(F)P(E)P(F | E) = \frac{P(E|F)P(F)}{P(E)}

Let's plug in the law of total probability into the denominator P(F∣E)=P(E∣F)P(F)P(E∣F)P(F)+P(E∣FC)P(FC)P(F | E) = \frac{P(E|F)P(F)}{P(E|F)P(F) + P(E|F^{C})P(F^{C})}

With Bayes' we can go from one form of conditional probability to another using the chain rule and the law of total probability.

References:

CS109

Acknowledgement

Thanks to Udbhas Mitra for an initial review of the blog post.