Skip to content

Commit

Permalink
FIX memory usage in DictVectorizer.fit
Browse files Browse the repository at this point in the history
No need to materialize X in memory if it's a generator.
Fixes scikit-learn#2171.

fit_transform still needs fixing; it should be single pass.
  • Loading branch information
larsmans committed Jul 22, 2013
1 parent 93deda5 commit 9c2ec56
Showing 1 changed file with 5 additions and 2 deletions.
7 changes: 5 additions & 2 deletions sklearn/feature_extraction/dict_vectorizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,8 +101,6 @@ def fit(self, X, y=None):
-------
self
"""
X = _tosequence(X)

# collect all the possible feature names
feature_names = set()
for x in X:
Expand Down Expand Up @@ -134,6 +132,11 @@ def fit_transform(self, X, y=None):
-------
Xa : {array, sparse matrix}
Feature vectors; always 2-d.
Notes
-----
Because this method requires two passes over X, it materializes X in
memory.
"""
X = _tosequence(X)
self.fit(X)
Expand Down

0 comments on commit 9c2ec56

Please sign in to comment.