Documentation

Documentation on Various Techniques

Basic computation of tf-idf is defined in 6.2.1 here.

There are various ways to weight it, but here's the essential rundown: it's the importance of a word in a document, relative to a collection of documents.

term frequency
==========
The term frequency, tf, for a term in a document is the number of times the term appears in a document.

inverse document frequency
==================
The idf, inverse document frequency, of term t, is computed relative to a collection of documents as:

idf(for term t) = log (number of documents in collection/number of documents in collection containing term t).

So, the idf of a rare term is high, whereas the idf of a frequent term is low.

tfidf
===
tf-idf provides a way to weight the importance of each term in each document. Equation 6.8 in the above reference gives:

tf-idf(of term t in document d) = (term frequency for the term in document d) x (idf of term t).

Interpretation
=========

The tf-idf weights the importance of term t in document d. It is:

1. Highest when t occurs many times within a small number of documents in the collection (thus lending high discriminating power to these documents)

2. Lower when the term appears fewer times in a document, or occurs in many documents in the collection (so it is less important for discrimination)

3. Lowest when t occurs in most documents in the collection.