-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: set multiindex labels with a coerced dtype (GH8456) #8676
Conversation
CLN: move coerce_indexer_dtype to common
cc @shoyer |
return indexer.astype('int16') | ||
elif l < _int32_max: | ||
return indexer.astype('int32') | ||
return indexer.astype('int64') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking a look at this again, this operation will always copy the input array, even if it already has the right dtype.
It would be a little better to do something like this:
l = len(categories)
if l < _int8_max:
dtype = 'int8'
...
return np.array(indexer, copy=False, dtype=dtype)
@shoyer both done |
@@ -4361,7 +4370,7 @@ def insert(self, loc, item): | |||
lev_loc = level.get_loc(k) | |||
|
|||
new_levels.append(level) | |||
new_labels.append(np.insert(labels, loc, lev_loc)) | |||
new_labels.append(np.insert(labels.astype('int64'), loc, lev_loc)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a good idea to add copy=False
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, I fixed these a slightly different way (there are _ensure_int*
functions),
@cache_readonly | ||
def nbytes(self): | ||
""" return the number of bytes in the underlying data """ | ||
level_nbytes = sum(( i.nbytes for i in self.levels )) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: nested (
is also unnecessary here, but harmless :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this IS necessary, each level is an array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test it out: sum(x for x in [1, 2, 3])
sum(i.nbytes for i in self.levels)
implicitly does the generator compression: http://legacy.python.org/dev/peps/pep-0289/
If you have a single argument to a function the parentheses around generator comprehensions are optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be true, but each element is a level, which is an ndarray iteself
In [1]: i = pd.MultiIndex.from_product([list('ab'),range(3)])
In [2]: i.levels
Out[2]: FrozenList([[u'a', u'b'], [0, 1, 2]])
In [4]: sum([ x.nbytes for x in i.levels ])
Out[4]: 40
In [7]: i.levels[0]
Out[7]: Index([u'a', u'b'], dtype='object')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure I understand your point from this example? I get the same thing without the nested brackets:
In [6]: sum(x.nbytes for x in i.levels)
Out[6]: 40
This is really a matter of Python syntax, which is independent of the nature of the arguments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, yeh...I always have used the [ ]
, doesn't make a difference in this case
looks pretty good to me! note: not really sure why pandas has |
PERF: set multiindex labels with a coerced dtype (GH8456)
@jreback
so it will change an there are also many places in the code that explicitly call into also practically, down casting has a negative impact on the memory usage, because many operations generate a temporary version of the array with upcasted types (even when viable, down-casting the other operand is risky or costly to do it safely), so then there will be two copies of the same values in the memory instead of only one, and that definitely increases memory usage. |
@behzadnouri these are only for indexers. The risk of overflow is 0. Yet you gain enormous memory savings. If you see a perf impact then it would be much easier to NOT upcast when doing the indexing. I would say that their are some negative perf implications of this, but I suspect they can be fixed and/or outweiged by holding a smaller dtype of indexer. |
pls create a new issue for this, with a specific example of the problem. keep in mind that the benchmarks you are showing are for a scalar indexing operation which is not very useful in and of itself. (that said if it can be fixed great). |
you will have to prove it on several use cases. E.g. need more than a single example. I agree that their is some casting issues, but I suspect those can be addressed. |
@jreback below is the proof; any time the code hits the lines below with smaller integer type, it creates a copy of the same data and that costs both memory and performance:
|
you are missing the point. These should in theory be removable. This bizness of conversions between int64 is just a distraction. These can be indexed via a smaller indexer. |
closes #8456