TF / IDF¶

The python file available here:

https://github.com/daviskregers/data-science-recap/blob/main/28-spark-wikipedia-search-tf-idf.py

TF-IDF¶

Stands for Term Frequency and Inverse Document Frequency
Important data for search - figures out what terms are most relevant for a document
Term Frequency just measures how often a word occurs in a document
- A word that occurs frequently is probably important to that document's meaning
Document Frequency is how often, a word occurs in an entire set of documents, i.e., all of Wikipedia or every web page
- This tells us about common words that just appear everywhere no matter what the topic, like "a", "the", "and" etc.
So a measure of the relevancy of a word to a document might be $\(\frac{term frequency}{document frequency}$\)
Or $$ Term frequency * Inverse Document Frequency $$
That is, take how often the word appears in a document, over how often it just appears everywhere. That gives you a measure of how important and unique this word is for this document.

We actually use the log of the IDF, since word frequencies are distributed exponentially. That gives us a better weighting of a words overall popularity.
TF-IDF assumes document is just a "bag of words"
- Parsing documents into a bag of words can be most of the work
- Words can be represented as a has value (number) for efficiency
- What about synonyms? Various tenses? Abbreviations? Capitalizations? Misspelling?
Doing this at scale is the hard part
- That's where Spark comes in!

A very simple search algorithm could be:
- Compute TF-IDF for every word in a corpus
- For a given search word, sort the documents by their TF-DF score for that word
- Display the results