Term Frequency-Inverse Document Frequency (Tf-idf)

Term Frequency-Inverse Document Frequency is a natural language processing model use for converting textual data into numerical form. Principle is, as number of times a word increases in given data, the Tf-idf value increases correspondingly. Or, importance of a word is proportional to the number of times it appears in the data but is offset by number of documents in the data.

In previous blog I’ve explain bag of words model to convert textual data into numerical form. But it has some drawbacks, all words have the same importance and no semantic information preserved. This can be overcome by using Tf-idf model.

Creating Tf-idf

Let’s make the Tf-idf model concrete with the following example.

paragraph = he read books as one would breath air. books are most loyal friend as human. he watch movies as he breath air.

  • Document 1 – “he read books as one would breath air ”
  • Document 2 – “books are loyal friend as human ”
  • Document 3 – “he watch movies as he breath air”
import nltk
import re
import heapq
import numpy as np
paragraph = '''he read books as one would breath air
               books are loyal friend as human
               he watch movies as he breath air

Steps

First three steps are similar to bag of words model.

1. Tokenize sentences

Make list of all the unique words in given sentences.

  • “he”
  • “read”
  • “books”
  • “as”
  • “one”
  • “would”
  • “breath”
  • “air”
  • “are”
  • “loyal”
  • “friend”
  • “human”
  • “watch”
  • “movies”
data = nltk.sent_tokenize(paragraph)

2. Creating word histogram

The next step is to score the words in each sentence. To create histogram of words, we have to iterate through each sentence and get up count of these words. Let’s put the maximum counted word at the top.

  • “he” = 3
  • “as” = 3
  • “books” = 2
  • “read” = 2
  • “breath” = 2
  • “air” = 2
  • “loyal” = 1
  • “friend” = 1
  • “are” = 1
  • “one” = 1
  • “would” = 1
  • “human” = 1
  • “watch” = 1
  • “movies” = 1
up_count = {}
for d in data:
    words = nltk.word_tokenize(d)
    for word in words:
        if word not in up_count.keys():
            up_count[word] = 1
        else:
            up_count[word] += 1

3. Selecting best features

The purpose of putting maximum counted word at top is to know which words are most important and which are not. By doing so we can get rid of unwanted word. because here, we are working with 3 sentences containing only 13 words, but what if we want to work with 50 million words. Then it will be very hard to analyze. Furthermore, unwanted words can affect the result.

The best features are:-

“books”, “read”, “as”, “breath”, “air”, “he”, “loyal”, friend”.

top_features = heapq.nlargest(9,up_count,key=up_count.get)

4. Calculate Term Frequency

Term frequency is the ratio of number of occurrence of a word in document to total number of words in the document.

  • Document 1 : “books” => (1/8) = 0.125 , “read” => (1/8) = 0.125 …
  • Document 2 : “books” => (1/5) = 0.2 , “are” => (1/5) = 0.2 …
  • Document 2 : “books” => (1/7) = 0.14 , “he” => (2/7) = 0.28 …
tf = {}
for word in top_features:
    doc_tf = []
    for d in data:
        frequency = 0
        for w in nltk.word_tokenize(d):
            if word == w:
                frequency += 1
        tf_word = frequency/len(nltk.word_tokenize(d))
        doc_tf.append(tf_word)
    tf[word] = doc_tf

5. Calculate Inverse Document Frequency

Inverse document frequency is a measure of how important a term is. We need the idf value because computing just the term frequency alone is not sufficient to understand the importance of words.

It is define as log of ratio of number of document to the number of document containing that particular word.

In our example there are 3 documents

  • “books” => log(3 /2) = 0.176
  • “read” => log(3/2) = 0.176
  • “as” => log(3/3) = 0
  • “breath” => log(3/2) = 0.176
  • “air” => log(3/2) = 0.176
  • “he” => log(3/3) = 0
  • “loyal” => log(3/1) = 0.47
  • “friend”=> log(3/1) = 0.47

From this we can see that idf value is high for words like “book”, “read”, etc. and have more importance than the words like “as”, “he” with less importance. And this is how Tf-idf model overcomes the drawback of bag of words model.

idf = {}
for word in freq_words:
    doc_count = 0
    for d in data:
        if word in nltk.word_tokenize(d):
            doc_count += 1
    idf[word] = np.log(len(data)/(1+doc_count))

6. Building a Model

Tf-idf (Word) = Tf (Document, Word) * idf (Word)

Let’s calculate Tf-idf for document 1

  • Tf-idf (“book”) = 0.125*0.176 = 0.022
  • Tf-idf (“read”) = 0.125*0.176 = 0.022
  • Tf-idf(“as”) = 0.125*0 = 0
  • Tf-idf(“breath”) = 0.125*0.176 = 0.022
  • Tf-idf(“air”) = 0.125*0.176 = 0.022
  • Tf-idf(“he”) = 0.125*0 = 0
  • Tf-idf(“loyal”) = 0.125*0.47 = 0.058
  • Tf-idf(“friend”) = 0.125*0.47 = 0.058

Whole Tf-idf Model :-

Data (Tf-idf)
tfidf_matrix = []
for word in tf.keys():
    tfidf = []
    for value in tf[word]:
        score = value * idf[word]
        tfidf.append(score)
    tfidf_matrix.append(tfidf)   
    
# Finishing the Tf-Tdf model
X = np.asarray(tfidf_matrix)

X = np.transpose(X)

Now, we have data which can be understand by any machine learning algorithm.

Summary –

Bag of Words just creates a set of vectors containing the count of word occurrences in the document, while the Tf-idf model contains information on the more important words and the less important ones as well.

Leave a Reply

UP↑