Skip to content

TF / IDF

The python file available here:

https://github.com/daviskregers/data-science-recap/blob/main/28-spark-wikipedia-search-tf-idf.py


TF-IDF

  • Stands for Term Frequency and Inverse Document Frequency
  • Important data for search - figures out what terms are most relevant for a document

  • Term Frequency just measures how often a word occurs in a document

    • A word that occurs frequently is probably important to that document's meaning
  • Document Frequency is how often, a word occurs in an entire set of documents, i.e., all of Wikipedia or every web page

    • This tells us about common words that just appear everywhere no matter what the topic, like "a", "the", "and" etc.
  • So a measure of the relevancy of a word to a document might be \(\(\frac{term frequency}{document frequency}\)\)

  • Or $$ Term frequency * Inverse Document Frequency $$
  • That is, take how often the word appears in a document, over how often it just appears everywhere. That gives you a measure of how important and unique this word is for this document.

TF-IDF in practice

  • We actually use the log of the IDF, since word frequencies are distributed exponentially. That gives us a better weighting of a words overall popularity.
  • TF-IDF assumes document is just a "bag of words"
    • Parsing documents into a bag of words can be most of the work
    • Words can be represented as a has value (number) for efficiency
    • What about synonyms? Various tenses? Abbreviations? Capitalizations? Misspelling?
  • Doing this at scale is the hard part
    • That's where Spark comes in!

Using TF-IDF

  • A very simple search algorithm could be:
    • Compute TF-IDF for every word in a corpus
    • For a given search word, sort the documents by their TF-DF score for that word
    • Display the results