piskvorky#13 update of documentation

git-svn-id: https://my-svn.assembla.com/svn/gensim/trunk@80 92d0401f-a546-4972-9173-107b360ed7e5
Zezo360 · Mar 18, 2010 · 7a73a43 · 7a73a43
1 parent 2012f8e
commit 7a73a43
Show file tree

Hide file tree

Showing 8 changed files with 62 additions and 43 deletions.
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -1,4 +1,4 @@
 recursive-include docs *
 include COPYING
 include COPYING.LESSER
-
+include ez_setup.py
diff --git a/docs/src/apiref.rst b/docs/src/apiref.rst
@@ -8,6 +8,7 @@ Modules:
 .. toctree::
     :maxdepth: 0
 
+    interfaces
     utils
     matutils
     corpora/bleicorpus

diff --git a/docs/src/interfaces.rst b/docs/src/interfaces.rst
@@ -0,0 +1,8 @@
+:mod:`interfaces`
+==================
+
+.. automodule:: gensim.interfaces
+    :synopsis: Core gensim interfces
+    :members:
+    :inherited-members:
+
diff --git a/docs/src/intro.rst b/docs/src/intro.rst
@@ -19,31 +19,33 @@ Design
 ------
 
 Gensim includes the following features:
-    * Memory independence -- there is no need for the whole text corpus (or any 
-      intermediate term-document matrices) to reside fully in RAM at any one time.
-    * Provides implementations for several popular topic inference algorithms, 
-      including Latent Semantic Analysis (LSA, LSI) and Latent Dirichlet Allocation (LDA), 
-      and makes adding new ones simple.
-    * Contains I/O wrappers and converters around several popular data formats.
-    * Allows similarity queries across documents in their latent, topical representation.
+
+* Memory independence -- there is no need for the whole text corpus (or any 
+  intermediate term-document matrices) to reside fully in RAM at any one time.
+* Provides implementations for several popular topic inference algorithms, 
+  including Latent Semantic Analysis (LSA, LSI) and Latent Dirichlet Allocation (LDA), 
+  and makes adding new ones simple.
+* Contains I/O wrappers and converters around several popular data formats.
+* Allows similarity queries across documents in their latent, topical representation.
 
 Creation of gensim was motivated by a perceived lack of available, scalable software 
 frameworks that realize topic modeling, and/or their overwhelming internal complexity. 
 You can read more about the motivation in our `LREC 2010 workshop paper <http://www.fi.muni.cz/~sojka/lrec2010/dml_lrec.pdf>`_.
 
 The principal design objectives behind gensim are:
-    1. Straightforward interfaces and low API learning curve for developers, 
-       facilitating modifications and rapid prototyping.
-    2. Memory independence with respect to the size of the input corpus; all intermediate 
-       steps and algorithms operate in a streaming fashion, processing one document 
-       at a time.
+
+1. Straightforward interfaces and low API learning curve for developers, 
+   facilitating modifications and rapid prototyping.
+2. Memory independence with respect to the size of the input corpus; all intermediate 
+   steps and algorithms operate in a streaming fashion, processing one document 
+   at a time.
 
 
 Availability
 ------------
 Gensim is licensed under the OSI-approved `GNU LPGL license <http://www.gnu.org/licenses/lgpl.html>`_ 
 and can be downloaded either from its `SVN repository <http://my-trac.assembla.com/gensim>`_
-or from the `Python Package Index <TODO>`_. 
+or from the `Python Package Index <http://pypi.python.org/pypi/gensim>`_. 
 
 .. http://my-trac.assembla.com/gensim/browser/trunk/COPYING.LESSER
 
@@ -73,12 +75,13 @@ The whole gensim package revolves around the concepts of :term:`corpus`, :term:`
         In the Vector Space Model (VSM), each document is represented by an 
         array of features. For example, a single feature may be thought of as a 
         question-answer pair:
-            1. How many times does the word *splonge* appear in the document? Zero.
-            2. How many paragraphs does the document consist of? Two.
-            3. How many fonts does the document use? Five.
+
+        1. How many times does the word *splonge* appear in the document? Zero.
+        2. How many paragraphs does the document consist of? Two.
+        3. How many fonts does the document use? Five.
 
         The question is usually represented only by its integer id, so that the
-        representation becomes a series of pairs: ``(1, 0.0), (2, 2.0), (3, 5.0)``.
+        representation of a document becomes a series of pairs: ``(1, 0.0), (2, 2.0), (3, 5.0)``.
         If we know all the questions in advance, we may leave them implicit 
         and simply write ``(0.0, 2.0, 5.0)``.
         This sequence of answers can be thought of as a high-dimensional (in our case 3-dimensional)
@@ -102,7 +105,7 @@ The whole gensim package revolves around the concepts of :term:`corpus`, :term:`
         to another (or, in other words, from one vector space to another). 
         Both the initial and target representations are
         still vectors -- they only differ in what the questions and answers are.
-        The transformation is automatically learned from the :term:`training corpus`, without human
+        The transformation is automatically learned from the traning :term:`corpus`, without human
         supervision, and in hopes that the final document representation will be more compact
         and more useful (with similar documents having similar representations) 
         than the initial one. The transformation process is also sometimes called 

diff --git a/docs/src/matutils.rst b/docs/src/matutils.rst
@@ -6,4 +6,3 @@
     :members:
     :inherited-members:
 
-
diff --git a/setup.py b/setup.py
@@ -25,20 +25,20 @@
 for topical similarity and so on.
 
 Gensim includes the following features:
-    * Memory independence -- there is no need for the whole text corpus (or any 
-      intermediate term-document matrices) to reside fully in RAM at any one time.
-    * Provides implementations for several popular topic inference algorithms, 
-      including Latent Semantic Analysis (LSA, LSI) and Latent Dirichlet Allocation (LDA), 
-      and makes adding new ones simple.
-    * Contains I/O wrappers and converters around several popular data formats.
-    * Allows similarity queries across documents in their latent, topical representation.
+* Memory independence -- there is no need for the whole text corpus (or any 
+  intermediate term-document matrices) to reside fully in RAM at any one time.
+* Provides implementations for several popular topic inference algorithms, 
+  including Latent Semantic Analysis (LSA, LSI) and Latent Dirichlet Allocation (LDA), 
+  and makes adding new ones simple.
+* Contains I/O wrappers and converters around several popular data formats.
+* Allows similarity queries across documents in their latent, topical representation.
 
 The principal design objectives behind gensim are:
-    1. Straightforward interfaces and low API learning curve for developers, 
-       facilitating modifications and rapid prototyping.
-    2. Memory independence with respect to the size of the input corpus; all intermediate 
-       steps and algorithms operate in a streaming fashion, processing one document 
-       at a time.
+1. Straightforward interfaces and low API learning curve for developers, 
+   facilitating modifications and rapid prototyping.
+2. Memory independence with respect to the size of the input corpus; all intermediate 
+   steps and algorithms operate in a streaming fashion, processing one document 
+   at a time.
 """
 
 
@@ -58,7 +58,9 @@ def read(fname):
     package_dir = {'': 'src'},
     packages = find_packages('src'),
 
-    author = 'Radim Rehurek', # there is a bug in python2.5, preventing distutils from using non-ascii characters :( http://bugs.python.org/issue2562 
+    # there is a bug in python2.5, preventing distutils from using non-ascii characters :(
+    author = 'Radim Rehurek', 
+    # author = u'Radim Řehůřek', # <- should really be this.. see http://bugs.python.org/issue2562
     author_email = 'radimrehurek@seznam.cz',
     url = 'http://nlp.fi.muni.cz/projekty/gensim',
     download_url = 'http://pypi.python.org/pypi/gensim',
@@ -82,6 +84,6 @@ def read(fname):
 
     include_package_data = True,
 
-    entry_points = "",
+    entry_points = {},
 
 )
diff --git a/src/gensim/interfaces.py b/src/gensim/interfaces.py
@@ -17,18 +17,21 @@
 
 class CorpusABC(utils.SaveLoad):
     """
-    Interface for corpora. A 'corpus' is simply an iterable, where each 
+    Interface for corpora. A *corpus* is simply an iterable, where each 
     iteration step yields one document. A document is a list of (fieldId, fieldValue)
     2-tuples.
     
-    See the corpora module for some example corpus implementations.
+    See the corpora package for some example corpus implementations.
     
     Note that although a default len() method is provided, it is very inefficient
     (performs a linear scan through the corpus to determine its length). Wherever 
     the corpus size is needed and known in advance (or at least doesn't change so 
     that it can be cached), the len() method should be overridden.
     """
     def __iter__(self):
+        """
+        Iterate over the corpus, yielding one document at a time.
+        """
         raise NotImplementedError('cannot instantiate abstract base class')
 
 
@@ -50,7 +53,7 @@ class TransformationABC(utils.SaveLoad):
     a sparse document via the dictionary notation [] and returns another sparse
     document in its stead.
     
-    See the tfidfmodel module for an example of a transformation.
+    See the :mod:`tfidfmodel` module for an example of a transformation.
     """
     class TransformedCorpus(CorpusABC):
         def __init__(self, fnc, corpus):
@@ -64,14 +67,17 @@ def __iter__(self):
                 yield self.fnc(doc) 
     #endclass TransformedCorpus
 
-    def __getitem__(self):
+    def __getitem__(self, vec):
+        """
+        Transform vector from one vector space into another.
+        """
         raise NotImplementedError('cannot instantiate abstract base class')
 
 
     def apply(self, corpus):
         """
-        Helper function used in derived classes. Applies the transformation to 
-        a whole corpus (as opposed to a single document) and returns another corpus.
+        Apply the transformation to a whole corpus (as opposed to a single document) 
+        and return the result as another another corpus.
         """
         return TransformationABC.TransformedCorpus(self.__getitem__, corpus)
 #endclass TransformationABC

diff --git a/src/gensim/utils.py b/src/gensim/utils.py
@@ -76,15 +76,15 @@ class SaveLoad(object):
     @classmethod
     def load(cls, fname):
         """
-        Load a previously saved object from file (also see save()).
+        Load a previously saved object from file (also see `save`).
         """
         logging.info("loading %s object from %s" % (cls.__name__, fname))
         return cPickle.load(open(fname))
 
 
     def save(self, fname):
         """
-        Save the object to file via pickling (also see load()).
+        Save the object to file via pickling (also see `load`).
         """
         logging.info("saving %s object to %s" % (self.__class__.__name__, fname))
         f = open(fname, 'w')
@@ -100,7 +100,7 @@ def identity(p):
 def dictFromCorpus(corpus):
     """
     Scan corpus for all word ids that appear in it, then contruct and return a mapping
-    which maps each wordId -> str(wordId).
+    which maps each ``wordId -> str(wordId)``.
     
     This function is used whenever *words* need to be displayed (as opposed to just 
     their ids) but no wordId->word mapping was provided. The resulting mapping