Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: set multiindex labels with a coerced dtype (GH8456) #8676

Merged
merged 2 commits into from
Oct 30, 2014

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Oct 29, 2014

closes #8456

CLN: move coerce_indexer_dtype to common
@jreback jreback added Bug MultiIndex Performance Memory or execution speed performance labels Oct 29, 2014
@jreback jreback added this to the 0.15.1 milestone Oct 29, 2014
@jreback
Copy link
Contributor Author

jreback commented Oct 29, 2014

cc @shoyer
cc @JanSchulz
cc @jorisvandenbossche

return indexer.astype('int16')
elif l < _int32_max:
return indexer.astype('int32')
return indexer.astype('int64')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a look at this again, this operation will always copy the input array, even if it already has the right dtype.

It would be a little better to do something like this:

    l = len(categories)
    if l < _int8_max:
        dtype = 'int8'
    ...
    return np.array(indexer, copy=False, dtype=dtype)

@jreback
Copy link
Contributor Author

jreback commented Oct 29, 2014

@shoyer both done

@@ -4361,7 +4370,7 @@ def insert(self, loc, item):
lev_loc = level.get_loc(k)

new_levels.append(level)
new_labels.append(np.insert(labels, loc, lev_loc))
new_labels.append(np.insert(labels.astype('int64'), loc, lev_loc))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a good idea to add copy=False here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, I fixed these a slightly different way (there are _ensure_int* functions),

@cache_readonly
def nbytes(self):
""" return the number of bytes in the underlying data """
level_nbytes = sum(( i.nbytes for i in self.levels ))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: nested ( is also unnecessary here, but harmless :).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this IS necessary, each level is an array.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test it out: sum(x for x in [1, 2, 3])

sum(i.nbytes for i in self.levels) implicitly does the generator compression: http://legacy.python.org/dev/peps/pep-0289/

If you have a single argument to a function the parentheses around generator comprehensions are optional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be true, but each element is a level, which is an ndarray iteself

In [1]: i = pd.MultiIndex.from_product([list('ab'),range(3)])

In [2]: i.levels
Out[2]: FrozenList([[u'a', u'b'], [0, 1, 2]])

In [4]: sum([ x.nbytes for x in i.levels ])    
Out[4]: 40

In [7]: i.levels[0]
Out[7]: Index([u'a', u'b'], dtype='object')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understand your point from this example? I get the same thing without the nested brackets:

In [6]: sum(x.nbytes for x in i.levels)
Out[6]: 40

This is really a matter of Python syntax, which is independent of the nature of the arguments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, yeh...I always have used the [ ], doesn't make a difference in this case

@shoyer
Copy link
Member

shoyer commented Oct 30, 2014

looks pretty good to me!

note: not really sure why pandas has _ensure_int8 functions (and the like)... I would think np.asarray(x, 'int8') would be succinct enough!

@jreback jreback modified the milestones: 0.15.1, 0.15.2 Oct 30, 2014
jreback added a commit that referenced this pull request Oct 30, 2014
PERF: set multiindex labels with a coerced dtype (GH8456)
@jreback jreback merged commit 5d22bd1 into pandas-dev:master Oct 30, 2014
@behzadnouri
Copy link
Contributor

@jreback
please consider reverting this change; casting to smaller integer types comes with the risk of overflow/wraparound, plus a performance cost anywhere there is a call to com._ensure_int64 or when numpy internally has to up-cast. as an example indexing into sorted multi-index has been impacted by this commit, because numpy has to up-cast before doing binary search:

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
frame_xs_mi_ix                               |   8.5303 |   0.6206 |  13.7452 |
series_xs_mi_ix                              |   8.0659 |   0.5600 |  14.4023 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [c11e75c] : PERF: set multiindex labels with coerced dtype (GH8456)
Base   [6bbb39e] : Merge pull request #8675 from pydata/setitem

so it will change an O(log(n)) performance to O(n). (down-casting the scalar value is not an option here because of the risk of overflow, so an entire array has to be up-cast).

there are also many places in the code that explicitly call into com._ensure_int64, and this commit will impact their performance as well.

also practically, down casting has a negative impact on the memory usage, because many operations generate a temporary version of the array with upcasted types (even when viable, down-casting the other operand is risky or costly to do it safely), so then there will be two copies of the same values in the memory instead of only one, and that definitely increases memory usage.

@jreback
Copy link
Contributor Author

jreback commented Dec 13, 2014

@behzadnouri these are only for indexers. The risk of overflow is 0. Yet you gain enormous memory savings.

If you see a perf impact then it would be much easier to NOT upcast when doing the indexing.

I would say that their are some negative perf implications of this, but I suspect they can be fixed and/or outweiged by holding a smaller dtype of indexer.

@jreback
Copy link
Contributor Author

jreback commented Dec 13, 2014

pls create a new issue for this, with a specific example of the problem. keep in mind that the benchmarks you are showing are for a scalar indexing operation which is not very useful in and of itself. (that said if it can be fixed great).

@behzadnouri
Copy link
Contributor

@jreback as in the last paragraph of my comment this will cause more memory usage.
memory profile results added in the #9073

@jreback
Copy link
Contributor Author

jreback commented Dec 13, 2014

you will have to prove it on several use cases. E.g. need more than a single example. I agree that their is some casting issues, but I suspect those can be addressed.

@behzadnouri
Copy link
Contributor

@jreback below is the proof; any time the code hits the lines below with smaller integer type, it creates a copy of the same data and that costs both memory and performance:

$ find -maxdepth 4 -iname '*.py' -exec grep -IrHn -i 'ensure_int64.*index' '{}' \;
./pandas/core/groupby.py:3692:        sorter, _ = _algos.groupsort_indexer(com._ensure_int64(group_index),
./pandas/core/groupby.py:3708:    group_index = com._ensure_int64(group_index)
./pandas/core/generic.py:1828:                indexer = com._ensure_int64(indexer)
./pandas/core/common.py:729:        indexer = _ensure_int64(indexer)
./pandas/core/common.py:768:        indexer = _ensure_int64(indexer)
./pandas/core/common.py:822:    indexer = _ensure_int64(indexer)
./pandas/core/common.py:971:    return _ensure_int64(indexer)
./pandas/core/common.py:980:    return _ensure_int64(indexer)
./pandas/core/frame.py:4125:        labels = com._ensure_int64(frame.index.labels[level])
./pandas/core/series.py:1123:            labels = com._ensure_int64(self.index.labels[level])
./pandas/core/index.py:1896:            left_lev_indexer = com._ensure_int64(left_lev_indexer)
./pandas/tseries/index.py:464:                index = tslib.tz_localize_to_utc(com._ensure_int64(index), tz,

@jreback
Copy link
Contributor Author

jreback commented Dec 13, 2014

you are missing the point. These should in theory be removable. This bizness of conversions between int64 is just a distraction. These can be indexed via a smaller indexer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug MultiIndex Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: optimize memory space for factorize / FrozenList (storage for MultiIndex)
3 participants