Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: improve MultiIndex get_loc performance #16346

Merged
Merged
Prev Previous commit
Next Next commit
use infer_dtype_from_scalar
  • Loading branch information
jorisvandenbossche committed May 16, 2017
commit 3bd0404f2582403bc660facd5d1932b306b624e4
4 changes: 2 additions & 2 deletions pandas/core/dtypes/cast.py
Original file line number Diff line number Diff line change
Expand Up @@ -333,7 +333,7 @@ def maybe_promote(dtype, fill_value=np.nan):
return dtype, fill_value


def infer_dtype_from_scalar(val, pandas_dtype=False):
def infer_dtype_from_scalar(val, pandas_dtype=False, use_datetimetz=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need this extra parameter, instead you can pass pandas_dtype=True . what is the issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas_dtype=True does not return a np.datetime64 but our DatetimeTZDtype, and further then also Periods get converted, which I also don't need.
So the combination I need, i.e. tz-aware timestamps to its numpy dtype instead of as pandas extension type or as object, but keep Periods as objects, is not possible with the current options. This is a bit a strange combination, but that's the consequence of how those are returned from the Index values (which is as datetime64 without tz but Periods as objects), which is how they are hashed.

But indeed, if we add this, ignore_tz is probably a better name

"""
interpret the dtype from a scalar

Expand Down Expand Up @@ -368,7 +368,7 @@ def infer_dtype_from_scalar(val, pandas_dtype=False):

elif isinstance(val, (np.datetime64, datetime)):
val = tslib.Timestamp(val)
if val is tslib.NaT or val.tz is None:
if val is tslib.NaT or val.tz is None or not use_datetimetz:
dtype = np.dtype('M8[ns]')
else:
if pandas_dtype:
Expand Down
18 changes: 3 additions & 15 deletions pandas/core/util/hashing.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@

import numpy as np
from pandas._libs import hashing
from pandas.compat import string_and_binary_types, text_type
from pandas.core.dtypes.generic import (
ABCMultiIndex,
ABCIndexClass,
Expand All @@ -14,6 +13,7 @@
from pandas.core.dtypes.common import (
is_categorical_dtype, is_list_like)
from pandas.core.dtypes.missing import isnull
from pandas.core.dtypes.cast import infer_dtype_from_scalar


# 16 byte long hashing key
Expand Down Expand Up @@ -317,20 +317,8 @@ def _hash_scalar(val, encoding='utf8', hash_key=None):
# this is to be consistent with the _hash_categorical implementation
return np.array([np.iinfo(np.uint64).max], dtype='u8')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you need to handle datetime w/tz directly (IOW, we basically ignore it), then I would.

if getattr(val, 'tzinfo', None) is not None:
    val = val.tz_localize(None)

I suppose an option to ignore tz is fine for infer_dtype_from_scalar, but if you add it I would rename, document and test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I can certainly do that check here as well.

That is maybe better to keep the custom logic here, as the keyword added to infer_dtype_from_scalar would not be used anywhere else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

he i think better locally

if isinstance(val, string_and_binary_types + (text_type,)):
vals = np.array([val], dtype=object)
else:
vals = np.array([val])

if vals.dtype == np.object_:
from pandas import Timestamp, Timedelta, Period, Interval
if isinstance(val, (Timestamp, Timedelta)):
vals = np.array([val.value])
elif isinstance(val, (Period, Interval)):
pass
else:
from pandas import Index
vals = Index(vals).values
dtype, val = infer_dtype_from_scalar(val, use_datetimetz=False)
vals = np.array([val], dtype=dtype)

return hash_array(vals, hash_key=hash_key, encoding=encoding,
categorize=False)
4 changes: 2 additions & 2 deletions pandas/tests/util/test_hashing.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,13 +81,13 @@ def test_hash_tuples(self):

def test_hash_tuple(self):
# test equivalence between hash_tuples and hash_tuple
for tup in [(1, 'one'), (1, np.nan)]:
for tup in [(1, 'one'), (1, np.nan), (1.0, pd.NaT, 'A')]:
result = hash_tuple(tup)
expected = hash_tuples([tup])[0]
assert result == expected

def test_hash_scalar(self):
for val in [1, 1.4, 'A', b'A', u'A', pd.Timestamp("2012-01-01"),
for val in [1, 1.4, 'A', b'A', u'A', pd.Timestamp("2012-01-01"),
pd.Timestamp("2012-01-01", tz='Europe/Brussels'),
pd.Period('2012-01-01', freq='D'), pd.Timedelta('1 days'),
pd.Interval(0, 1), np.nan, pd.NaT, None]:
Expand Down