PERF: set multiindex labels with a coerced dtype (GH8456) #8676

jreback · 2014-10-29T23:01:54Z

CLN: move coerce_indexer_dtype to common

jreback · 2014-10-29T23:02:33Z

cc @shoyer
cc @JanSchulz
cc @jorisvandenbossche

shoyer · 2014-10-29T23:26:29Z

pandas/core/common.py

+        return indexer.astype('int16')
+    elif l < _int32_max:
+        return indexer.astype('int32')
+    return indexer.astype('int64')


Taking a look at this again, this operation will always copy the input array, even if it already has the right dtype.

It would be a little better to do something like this:

l = len(categories) if l < _int8_max: dtype = 'int8' ... return np.array(indexer, copy=False, dtype=dtype)

jreback · 2014-10-29T23:32:32Z

@shoyer both done

shoyer · 2014-10-29T23:36:23Z

pandas/core/index.py

@@ -4361,7 +4370,7 @@ def insert(self, loc, item):
                lev_loc = level.get_loc(k)

            new_levels.append(level)
-            new_labels.append(np.insert(labels, loc, lev_loc))
+            new_labels.append(np.insert(labels.astype('int64'), loc, lev_loc))


maybe a good idea to add copy=False here

done, I fixed these a slightly different way (there are _ensure_int* functions),

shoyer · 2014-10-30T00:06:12Z

pandas/core/index.py

+    @cache_readonly
+    def nbytes(self):
+        """ return the number of bytes in the underlying data """
+        level_nbytes = sum(( i.nbytes for i in self.levels ))


note: nested ( is also unnecessary here, but harmless :).

this IS necessary, each level is an array.

Test it out: sum(x for x in [1, 2, 3])

sum(i.nbytes for i in self.levels) implicitly does the generator compression: http://legacy.python.org/dev/peps/pep-0289/

If you have a single argument to a function the parentheses around generator comprehensions are optional.

That would be true, but each element is a level, which is an ndarray iteself

In [1]: i = pd.MultiIndex.from_product([list('ab'),range(3)]) In [2]: i.levels Out[2]: FrozenList([[u'a', u'b'], [0, 1, 2]]) In [4]: sum([ x.nbytes for x in i.levels ]) Out[4]: 40 In [7]: i.levels[0] Out[7]: Index([u'a', u'b'], dtype='object')

not sure I understand your point from this example? I get the same thing without the nested brackets:

In [6]: sum(x.nbytes for x in i.levels) Out[6]: 40

This is really a matter of Python syntax, which is independent of the nature of the arguments

I see, yeh...I always have used the [ ], doesn't make a difference in this case

shoyer · 2014-10-30T00:07:17Z

looks pretty good to me!

note: not really sure why pandas has _ensure_int8 functions (and the like)... I would think np.asarray(x, 'int8') would be succinct enough!

PERF: set multiindex labels with a coerced dtype (GH8456)

behzadnouri · 2014-12-13T19:46:13Z

@jreback
please consider reverting this change; casting to smaller integer types comes with the risk of overflow/wraparound, plus a performance cost anywhere there is a call to com._ensure_int64 or when numpy internally has to up-cast. as an example indexing into sorted multi-index has been impacted by this commit, because numpy has to up-cast before doing binary search:

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
frame_xs_mi_ix                               |   8.5303 |   0.6206 |  13.7452 |
series_xs_mi_ix                              |   8.0659 |   0.5600 |  14.4023 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [c11e75c] : PERF: set multiindex labels with coerced dtype (GH8456)
Base   [6bbb39e] : Merge pull request #8675 from pydata/setitem

so it will change an O(log(n)) performance to O(n). (down-casting the scalar value is not an option here because of the risk of overflow, so an entire array has to be up-cast).

there are also many places in the code that explicitly call into com._ensure_int64, and this commit will impact their performance as well.

also practically, down casting has a negative impact on the memory usage, because many operations generate a temporary version of the array with upcasted types (even when viable, down-casting the other operand is risky or costly to do it safely), so then there will be two copies of the same values in the memory instead of only one, and that definitely increases memory usage.

jreback · 2014-12-13T20:32:52Z

@behzadnouri these are only for indexers. The risk of overflow is 0. Yet you gain enormous memory savings.

If you see a perf impact then it would be much easier to NOT upcast when doing the indexing.

I would say that their are some negative perf implications of this, but I suspect they can be fixed and/or outweiged by holding a smaller dtype of indexer.

jreback · 2014-12-13T21:28:17Z

pls create a new issue for this, with a specific example of the problem. keep in mind that the benchmarks you are showing are for a scalar indexing operation which is not very useful in and of itself. (that said if it can be fixed great).

behzadnouri · 2014-12-13T21:39:51Z

@jreback as in the last paragraph of my comment this will cause more memory usage.
memory profile results added in the #9073

jreback · 2014-12-13T21:41:40Z

you will have to prove it on several use cases. E.g. need more than a single example. I agree that their is some casting issues, but I suspect those can be addressed.

behzadnouri · 2014-12-13T21:48:56Z

@jreback below is the proof; any time the code hits the lines below with smaller integer type, it creates a copy of the same data and that costs both memory and performance:

$ find -maxdepth 4 -iname '*.py' -exec grep -IrHn -i 'ensure_int64.*index' '{}' \;
./pandas/core/groupby.py:3692:        sorter, _ = _algos.groupsort_indexer(com._ensure_int64(group_index),
./pandas/core/groupby.py:3708:    group_index = com._ensure_int64(group_index)
./pandas/core/generic.py:1828:                indexer = com._ensure_int64(indexer)
./pandas/core/common.py:729:        indexer = _ensure_int64(indexer)
./pandas/core/common.py:768:        indexer = _ensure_int64(indexer)
./pandas/core/common.py:822:    indexer = _ensure_int64(indexer)
./pandas/core/common.py:971:    return _ensure_int64(indexer)
./pandas/core/common.py:980:    return _ensure_int64(indexer)
./pandas/core/frame.py:4125:        labels = com._ensure_int64(frame.index.labels[level])
./pandas/core/series.py:1123:            labels = com._ensure_int64(self.index.labels[level])
./pandas/core/index.py:1896:            left_lev_indexer = com._ensure_int64(left_lev_indexer)
./pandas/tseries/index.py:464:                index = tslib.tz_localize_to_utc(com._ensure_int64(index), tz,

jreback · 2014-12-13T21:52:27Z

you are missing the point. These should in theory be removable. This bizness of conversions between int64 is just a distraction. These can be indexed via a smaller indexer.

CLN: more generic index creation in algorithms.py

339400e

CLN: move coerce_indexer_dtype to common

jreback added Bug MultiIndex Performance Memory or execution speed performance labels Oct 29, 2014

jreback added this to the 0.15.1 milestone Oct 29, 2014

shoyer reviewed Oct 29, 2014
View reviewed changes

jreback force-pushed the algos branch from 547d566 to 1cc922d Compare October 29, 2014 23:31

shoyer reviewed Oct 29, 2014
View reviewed changes

jreback force-pushed the algos branch from 1cc922d to 80c63d7 Compare October 29, 2014 23:42

PERF: set multiindex labels with coerced dtype (GH8456)

c11e75c

jreback force-pushed the algos branch from 80c63d7 to c11e75c Compare October 29, 2014 23:44

shoyer reviewed Oct 30, 2014
View reviewed changes

jreback modified the milestones: 0.15.1, 0.15.2 Oct 30, 2014

jreback added a commit that referenced this pull request Oct 30, 2014

Merge pull request #8676 from jreback/algos

5d22bd1

PERF: set multiindex labels with a coerced dtype (GH8456)

jreback merged commit 5d22bd1 into pandas-dev:master Oct 30, 2014

behzadnouri mentioned this pull request Dec 13, 2014

saving on memory can cost both memory and performance #9073

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: set multiindex labels with a coerced dtype (GH8456) #8676

PERF: set multiindex labels with a coerced dtype (GH8456) #8676

jreback commented Oct 29, 2014

jreback commented Oct 29, 2014

shoyer Oct 29, 2014

jreback commented Oct 29, 2014

shoyer Oct 29, 2014

jreback Oct 29, 2014

shoyer Oct 30, 2014

jreback Oct 30, 2014

shoyer Oct 30, 2014

jreback Oct 30, 2014

shoyer Oct 30, 2014

jreback Oct 30, 2014

shoyer commented Oct 30, 2014

behzadnouri commented Dec 13, 2014

jreback commented Dec 13, 2014

jreback commented Dec 13, 2014

behzadnouri commented Dec 13, 2014

jreback commented Dec 13, 2014

behzadnouri commented Dec 13, 2014

jreback commented Dec 13, 2014

PERF: set multiindex labels with a coerced dtype (GH8456) #8676

PERF: set multiindex labels with a coerced dtype (GH8456) #8676

Conversation

jreback commented Oct 29, 2014

jreback commented Oct 29, 2014

Choose a reason for hiding this comment

jreback commented Oct 29, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer commented Oct 30, 2014

behzadnouri commented Dec 13, 2014

jreback commented Dec 13, 2014

jreback commented Dec 13, 2014

behzadnouri commented Dec 13, 2014

jreback commented Dec 13, 2014

behzadnouri commented Dec 13, 2014

jreback commented Dec 13, 2014