Skip to content

Latest commit

 

History

History

Goal

  • Gain intuition for different notions of similarity and practice finding similar documents.
  • Explore the tradeoffs with representing documents using raw word counts and TF-IDF
  • Explore the behavior of different distance metrics by looking at the Wikipedia pages most similar to President Obama’s page.

File Description

  • .zip file is data file.
    • people_wiki.csv.zip (unzip people_wiki.csv) consists of 59,071 pages and 3 features. URL name text
  • .json files
    • people_wiki_map_index_to_word.json
  • .npz files
    • people_wiki_tf_idf.npz
    • people_wiki_word_count.npz
  • description files
    • .ipynb file is the solution of Week 2 program assignment 1
      • nearest-neighbors-features-and-metrics_blank.ipynb
    • .html file is the html version of .ipynb file.
      • nearest-neighbors-features-and-metrics_blank.html
    • .py
      • nearest-neighbors-features-and-metrics_blank.py
    • file
      • nearest-neighbors-features-and-metrics_blank

Snapshot

  • Recommend open md file inside a file
  • open .html file via brower for quick look.

Algorithm

  • K-NN with word count and tf-idf
  • metrics differ from Euclidean and Cosine

Implement in details

  • Extract word count vectors
  • Find nearest neighbors using word count vectors
  • Interpreting the nearest neighbors
  • Extract the TF-IDF vectors
  • Find nearest neighbors using TF-IDF vectors
  • Choosing metrics
  • Problem with cosine distances: tweets vs. long articles