Provides a Python class to easily calculate the TF-IDF scores of a collection of documents produced over time. The class can also create an HTML page visualizing how documents' topics vary over time.
The file example.py shows a small example based on Kenneth Lay's sent messages e-mails, taken from the Enron e-mail archive.
The usage is fairly simple:
from tempo_tfidf import TempoTFIDF
dates = ['2007-05-23', '2008-04-23']
docs = ['This is one doc', 'This is the doc with a strange word']
scorer = TempoTFIDF()
doc_scores = scorer.score_documents(docs, dates, time_unit='month')
scorer.visualize(doc_scores, path='visualize.html')
Scoring functionality requires only packages available in most base Python installs. However, jinja2 is required for the HTML visualization. The Pattern package -- only available for Python 2.x -- allows for additional text cleaning, potentially improving the output.
See requirements.txt for complete details.
This package uses a modified version of the standard TF-IDF equation.
Term frequency is the raw count of word w in time unit t:
f_w,t = count(f_t)
Inverse document frequency is defined as one plus the log of the number of time units N divided by the number of time units w appears plus one:
if_w = 1 + log( (N / (count(w_T) + 1)) )
As usual, w's score in time t is:
TFIDF_w,t = f_w,t * if_w
Viegas, Fernanda B., Scott Golder, and Judith Donath. "Visualizing Email Content: Portraying Relationships from Conversational Histories." Proceedings of ACM CHI, April 2006.
Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge UP: 2008. Ch. 6, but esp. p. 128.