Skip to content

Commit

Permalink
piskvorky#13 update of documentation
Browse files Browse the repository at this point in the history
git-svn-id: https://my-svn.assembla.com/svn/gensim/trunk@80 92d0401f-a546-4972-9173-107b360ed7e5
  • Loading branch information
piskvorky committed Mar 18, 2010
1 parent 2012f8e commit 7a73a43
Show file tree
Hide file tree
Showing 8 changed files with 62 additions and 43 deletions.
2 changes: 1 addition & 1 deletion MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
recursive-include docs *
include COPYING
include COPYING.LESSER

include ez_setup.py
1 change: 1 addition & 0 deletions docs/src/apiref.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ Modules:
.. toctree::
:maxdepth: 0

interfaces
utils
matutils
corpora/bleicorpus
Expand Down
8 changes: 8 additions & 0 deletions docs/src/interfaces.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
:mod:`interfaces`
==================

.. automodule:: gensim.interfaces
:synopsis: Core gensim interfces
:members:
:inherited-members:

39 changes: 21 additions & 18 deletions docs/src/intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,31 +19,33 @@ Design
------

Gensim includes the following features:
* Memory independence -- there is no need for the whole text corpus (or any
intermediate term-document matrices) to reside fully in RAM at any one time.
* Provides implementations for several popular topic inference algorithms,
including Latent Semantic Analysis (LSA, LSI) and Latent Dirichlet Allocation (LDA),
and makes adding new ones simple.
* Contains I/O wrappers and converters around several popular data formats.
* Allows similarity queries across documents in their latent, topical representation.

* Memory independence -- there is no need for the whole text corpus (or any
intermediate term-document matrices) to reside fully in RAM at any one time.
* Provides implementations for several popular topic inference algorithms,
including Latent Semantic Analysis (LSA, LSI) and Latent Dirichlet Allocation (LDA),
and makes adding new ones simple.
* Contains I/O wrappers and converters around several popular data formats.
* Allows similarity queries across documents in their latent, topical representation.

Creation of gensim was motivated by a perceived lack of available, scalable software
frameworks that realize topic modeling, and/or their overwhelming internal complexity.
You can read more about the motivation in our `LREC 2010 workshop paper <http://www.fi.muni.cz/~sojka/lrec2010/dml_lrec.pdf>`_.

The principal design objectives behind gensim are:
1. Straightforward interfaces and low API learning curve for developers,
facilitating modifications and rapid prototyping.
2. Memory independence with respect to the size of the input corpus; all intermediate
steps and algorithms operate in a streaming fashion, processing one document
at a time.

1. Straightforward interfaces and low API learning curve for developers,
facilitating modifications and rapid prototyping.
2. Memory independence with respect to the size of the input corpus; all intermediate
steps and algorithms operate in a streaming fashion, processing one document
at a time.


Availability
------------
Gensim is licensed under the OSI-approved `GNU LPGL license <http://www.gnu.org/licenses/lgpl.html>`_
and can be downloaded either from its `SVN repository <http://my-trac.assembla.com/gensim>`_
or from the `Python Package Index <TODO>`_.
or from the `Python Package Index <http://pypi.python.org/pypi/gensim>`_.

.. http://my-trac.assembla.com/gensim/browser/trunk/COPYING.LESSER
Expand Down Expand Up @@ -73,12 +75,13 @@ The whole gensim package revolves around the concepts of :term:`corpus`, :term:`
In the Vector Space Model (VSM), each document is represented by an
array of features. For example, a single feature may be thought of as a
question-answer pair:
1. How many times does the word *splonge* appear in the document? Zero.
2. How many paragraphs does the document consist of? Two.
3. How many fonts does the document use? Five.

1. How many times does the word *splonge* appear in the document? Zero.
2. How many paragraphs does the document consist of? Two.
3. How many fonts does the document use? Five.

The question is usually represented only by its integer id, so that the
representation becomes a series of pairs: ``(1, 0.0), (2, 2.0), (3, 5.0)``.
representation of a document becomes a series of pairs: ``(1, 0.0), (2, 2.0), (3, 5.0)``.
If we know all the questions in advance, we may leave them implicit
and simply write ``(0.0, 2.0, 5.0)``.
This sequence of answers can be thought of as a high-dimensional (in our case 3-dimensional)
Expand All @@ -102,7 +105,7 @@ The whole gensim package revolves around the concepts of :term:`corpus`, :term:`
to another (or, in other words, from one vector space to another).
Both the initial and target representations are
still vectors -- they only differ in what the questions and answers are.
The transformation is automatically learned from the :term:`training corpus`, without human
The transformation is automatically learned from the traning :term:`corpus`, without human
supervision, and in hopes that the final document representation will be more compact
and more useful (with similar documents having similar representations)
than the initial one. The transformation process is also sometimes called
Expand Down
1 change: 0 additions & 1 deletion docs/src/matutils.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,3 @@
:members:
:inherited-members:


30 changes: 16 additions & 14 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,20 +25,20 @@
for topical similarity and so on.
Gensim includes the following features:
* Memory independence -- there is no need for the whole text corpus (or any
intermediate term-document matrices) to reside fully in RAM at any one time.
* Provides implementations for several popular topic inference algorithms,
including Latent Semantic Analysis (LSA, LSI) and Latent Dirichlet Allocation (LDA),
and makes adding new ones simple.
* Contains I/O wrappers and converters around several popular data formats.
* Allows similarity queries across documents in their latent, topical representation.
* Memory independence -- there is no need for the whole text corpus (or any
intermediate term-document matrices) to reside fully in RAM at any one time.
* Provides implementations for several popular topic inference algorithms,
including Latent Semantic Analysis (LSA, LSI) and Latent Dirichlet Allocation (LDA),
and makes adding new ones simple.
* Contains I/O wrappers and converters around several popular data formats.
* Allows similarity queries across documents in their latent, topical representation.
The principal design objectives behind gensim are:
1. Straightforward interfaces and low API learning curve for developers,
facilitating modifications and rapid prototyping.
2. Memory independence with respect to the size of the input corpus; all intermediate
steps and algorithms operate in a streaming fashion, processing one document
at a time.
1. Straightforward interfaces and low API learning curve for developers,
facilitating modifications and rapid prototyping.
2. Memory independence with respect to the size of the input corpus; all intermediate
steps and algorithms operate in a streaming fashion, processing one document
at a time.
"""


Expand All @@ -58,7 +58,9 @@ def read(fname):
package_dir = {'': 'src'},
packages = find_packages('src'),

author = 'Radim Rehurek', # there is a bug in python2.5, preventing distutils from using non-ascii characters :( http://bugs.python.org/issue2562
# there is a bug in python2.5, preventing distutils from using non-ascii characters :(
author = 'Radim Rehurek',
# author = u'Radim Řehůřek', # <- should really be this.. see http://bugs.python.org/issue2562
author_email = 'radimrehurek@seznam.cz',
url = 'http://nlp.fi.muni.cz/projekty/gensim',
download_url = 'http://pypi.python.org/pypi/gensim',
Expand All @@ -82,6 +84,6 @@ def read(fname):

include_package_data = True,

entry_points = "",
entry_points = {},

)
18 changes: 12 additions & 6 deletions src/gensim/interfaces.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,18 +17,21 @@

class CorpusABC(utils.SaveLoad):
"""
Interface for corpora. A 'corpus' is simply an iterable, where each
Interface for corpora. A *corpus* is simply an iterable, where each
iteration step yields one document. A document is a list of (fieldId, fieldValue)
2-tuples.
See the corpora module for some example corpus implementations.
See the corpora package for some example corpus implementations.
Note that although a default len() method is provided, it is very inefficient
(performs a linear scan through the corpus to determine its length). Wherever
the corpus size is needed and known in advance (or at least doesn't change so
that it can be cached), the len() method should be overridden.
"""
def __iter__(self):
"""
Iterate over the corpus, yielding one document at a time.
"""
raise NotImplementedError('cannot instantiate abstract base class')


Expand All @@ -50,7 +53,7 @@ class TransformationABC(utils.SaveLoad):
a sparse document via the dictionary notation [] and returns another sparse
document in its stead.
See the tfidfmodel module for an example of a transformation.
See the :mod:`tfidfmodel` module for an example of a transformation.
"""
class TransformedCorpus(CorpusABC):
def __init__(self, fnc, corpus):
Expand All @@ -64,14 +67,17 @@ def __iter__(self):
yield self.fnc(doc)
#endclass TransformedCorpus

def __getitem__(self):
def __getitem__(self, vec):
"""
Transform vector from one vector space into another.
"""
raise NotImplementedError('cannot instantiate abstract base class')


def apply(self, corpus):
"""
Helper function used in derived classes. Applies the transformation to
a whole corpus (as opposed to a single document) and returns another corpus.
Apply the transformation to a whole corpus (as opposed to a single document)
and return the result as another another corpus.
"""
return TransformationABC.TransformedCorpus(self.__getitem__, corpus)
#endclass TransformationABC
Expand Down
6 changes: 3 additions & 3 deletions src/gensim/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,15 +76,15 @@ class SaveLoad(object):
@classmethod
def load(cls, fname):
"""
Load a previously saved object from file (also see save()).
Load a previously saved object from file (also see `save`).
"""
logging.info("loading %s object from %s" % (cls.__name__, fname))
return cPickle.load(open(fname))


def save(self, fname):
"""
Save the object to file via pickling (also see load()).
Save the object to file via pickling (also see `load`).
"""
logging.info("saving %s object to %s" % (self.__class__.__name__, fname))
f = open(fname, 'w')
Expand All @@ -100,7 +100,7 @@ def identity(p):
def dictFromCorpus(corpus):
"""
Scan corpus for all word ids that appear in it, then contruct and return a mapping
which maps each wordId -> str(wordId).
which maps each ``wordId -> str(wordId)``.
This function is used whenever *words* need to be displayed (as opposed to just
their ids) but no wordId->word mapping was provided. The resulting mapping
Expand Down

0 comments on commit 7a73a43

Please sign in to comment.