FIX memory usage in DictVectorizer.fit

No need to materialize X in memory if it's a generator. Fixes scikit-learn#2171. fit_transform still needs fixing; it should be single pass.
smartsammler · Jul 22, 2013 · 9c2ec56 · 9c2ec56
1 parent 93deda5
commit 9c2ec56
Showing 1 changed file with 5 additions and 2 deletions.
diff --git a/sklearn/feature_extraction/dict_vectorizer.py b/sklearn/feature_extraction/dict_vectorizer.py
@@ -101,8 +101,6 @@ def fit(self, X, y=None):
         -------
         self
         """
-        X = _tosequence(X)
-
         # collect all the possible feature names
         feature_names = set()
         for x in X:
@@ -134,6 +132,11 @@ def fit_transform(self, X, y=None):
         -------
         Xa : {array, sparse matrix}
             Feature vectors; always 2-d.
+
+        Notes
+        -----
+        Because this method requires two passes over X, it materializes X in
+        memory.
         """
         X = _tosequence(X)
         self.fit(X)