Skip to content

A set of tools for creating topic modeling datasets, performing topic modeling, etc.

License

Notifications You must be signed in to change notification settings

clintpgeorge/gaur

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gaur

This project provides libraries to create datasets for topic modeling or text classification. This also has a pure python implementation of the collapsed Gibbs sampling algorithm of the topic model Latent Dirichlet Allocation (Caveat: It's not written for handling large datasets).

Currently, it supports downloading articles from the English Wikipedia to create datasets. The user has to specify the Wikipedia categories of interest to download the associated articles and create a data set out of it. This project uses the MediaWiki API to query abd download articles in a Wikipedia category.

Usage

  • To download the Wikipedia articles, see download_wikipedia_articles.py
  • To build a topic modeling data set (in the LDA-C format), see build_ldac_corpus.py
  • To run the LDA collapsed Gibbs sampling algorithm, see lda_gibbs.py and lda_gibbs_test.py*

Dependencies

About

A set of tools for creating topic modeling datasets, performing topic modeling, etc.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages