Skip to content

Bayesian methods: Concepts

Python notebook: https://github.com/daviskregers/data-science-recap/blob/main/14-naive-bayes-spam-classifier.ipynb

Remember Baye's theorem?

  • \(P(A|B) = \frac{P(A)P(B|A)}{P(B)}\)
  • Let's use it for machine learning! I want a spam classifier.
  • Example: How would we express the probability of an email being spam if it contains the word "free"?
  • \(P(Spam|Free)=\frac{P(Spam)P(Free|Spam)}{P(Free)}\)
  • The numerator is the probability of a message being spam and containing the word "free" (this is subtly different from what we're looking for).
  • The denominator is the overall probability of an email containing the word "free". (Equivalent to \(P(Free|Spam)P(Spam) + P(Free|Not Spam)P(Not Spam)\))
  • So together - this ratio is the % of emails with the word "free" that are spam.

What about all the other words?

  • We can construct \(P(Spam | Word)\) for every meaningful word we encounter during training
  • Then multiply these together when analyzing a new email to get the probability of it being spam.
  • Assumes the presence of different words are independent of each other - one reason this is called "Naive Bayes".

Sounds like a lot of work.

  • Scikit-learn to the rescue!
  • The CountVectorizer lets us operate on lots of words at once, and MultinomialNB does all the heavy lifting on Naive Bayes.
  • We'll train it on known sets of spam and "ham" (non-spam) emails.