Closed
Description
Reproducible example:
import pandas as pd
from darts import TimeSeries
data = {
"val1": {
pd.Timestamp("2020-09-08 00:00:00+0200", tz="CET", freq="H"): 0.0,
pd.Timestamp("2020-09-08 01:00:00+0200", tz="CET", freq="H"): 0.0,
pd.Timestamp("2020-09-08 02:00:00+0200", tz="CET", freq="H"): 0.0,
pd.Timestamp("2020-09-08 03:00:00+0200", tz="CET", freq="H"): 0.0,
pd.Timestamp("2020-09-08 04:00:00+0200", tz="CET", freq="H"): 0.0,
}
}
df = pd.DataFrame.from_dict(data)
X = TimeSeries.from_dataframe(df, value_cols=None)
df_utc = df.tz_convert("UTC").tz_localize(None)
X_utc = TimeSeries.from_dataframe(df_utc, value_cols=None)
It seems the types of time index of X
and X_utc
are different. For example doing X_utc.plot()
works, but X.plot()
complains:
TypeError: Plotting requires coordinates to be numeric, boolean, or dates of type numpy.datetime64, datetime.datetime, cftime.datetime or pandas.Interval. Received data of type object instead.
Activity
tomasvanpottelbergh commentedon Jul 7, 2022
This could probably be solved by setting a
DateFormatter
, as described here. What format should we then use? Currently it seems to be a default by matplotlib that depends on the values and length of the series.strakehyr commentedon Jul 20, 2022
This error is also raised when plotting if the series has Pandas Datetime index.
pni-mft commentedon Apr 13, 2023
#1343 #1052
This bug has not been resolved if you just for example use a index crossing a DST time change:
data = { "val1": { pd.Timestamp("2022-10-30 00:00:00+0200", tz="CET", freq="H"): 0.0, pd.Timestamp("2022-10-30 01:00:00+0200", tz="CET", freq="H"): 0.0, pd.Timestamp("2022-10-30 02:00:00+0200", tz="CET", freq="H"): 0.0, pd.Timestamp("2022-10-30 02:00:00+0100", tz="CET", freq="H"): 0.0, pd.Timestamp("2022-10-30 03:00:00+0100", tz="CET", freq="H"): 0.0, } }
You then receive this error:
The provided DatetimeIndex was associated with a timezone, which is currently not supported by xarray. To avoid unexpected behaviour, the tz information was removed. Consider calling
ts.time_index.tz_localize(CET)when exporting the results.To plot the series with the right time steps, consider setting the matplotlib.pyplot
rcParams['timezone']parameter to automatically convert the time axis back to the original timezone. ValueError: The time index of the provided DataArray is missing the freq attribute, and the frequency could not be directly inferred. This probably comes from inconsistent date frequencies with missing dates. If you know the actual frequency, try setting
fill_missing_dates=True, freq=actual_frequency. If not, try setting
fill_missing_dates=True, freq=Noneto see if a frequency can be inferred. Traceback (most recent call last): File "/home/azureuser/.local/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3433, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "<ipython-input-7-05686c705678>", line 1, in <module> X = TimeSeries.from_dataframe(df, value_cols=None) File "/home/azureuser/.local/lib/python3.8/site-packages/darts/timeseries.py", line 714, in from_dataframe return cls.from_xarray( File "/home/azureuser/.local/lib/python3.8/site-packages/darts/timeseries.py", line 444, in from_xarray return cls(xa_) File "/home/azureuser/.local/lib/python3.8/site-packages/darts/timeseries.py", line 174, in __init__ raise_if( File "/home/azureuser/.local/lib/python3.8/site-packages/darts/logging.py", line 104, in raise_if raise_if_not(not condition, message, logger) File "/home/azureuser/.local/lib/python3.8/site-packages/darts/logging.py", line 78, in raise_if_not raise ValueError(message) ValueError: The time index of the provided DataArray is missing the freq attribute, and the frequency could not be directly inferred. This probably comes from inconsistent date frequencies with missing dates. If you know the actual frequency, try setting
fill_missing_dates=True, freq=actual_frequency. If not, try setting
fill_missing_dates=True, freq=Noneto see if a frequency can be inferred.
Following the error message's advice I set:
X = TimeSeries.from_dataframe(df, value_cols=None, freq="H")
and then I get the following error:
The provided DatetimeIndex was associated with a timezone, which is currently not supported by xarray. To avoid unexpected behaviour, the tz information was removed. Consider calling
ts.time_index.tz_localize(CET)when exporting the results.To plot the series with the right time steps, consider setting the matplotlib.pyplot
rcParams['timezone']parameter to automatically convert the time axis back to the original timezone. Traceback (most recent call last): File "/home/azureuser/.local/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3433, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "<ipython-input-9-760a72fd8006>", line 1, in <module> X = TimeSeries.from_dataframe(df, value_cols=None, freq="H") File "/home/azureuser/.local/lib/python3.8/site-packages/darts/timeseries.py", line 714, in from_dataframe return cls.from_xarray( File "/home/azureuser/.local/lib/python3.8/site-packages/darts/timeseries.py", line 379, in from_xarray xa_ = cls._restore_xarray_from_frequency(xa, freq=freq) File "/home/azureuser/.local/lib/python3.8/site-packages/darts/timeseries.py", line 4453, in _restore_xarray_from_frequency resampled_time_index = resampled_time_index.asfreq(freq) File "/home/azureuser/.local/lib/python3.8/site-packages/pandas/core/series.py", line 5403, in asfreq return super().asfreq( File "/home/azureuser/.local/lib/python3.8/site-packages/pandas/core/generic.py", line 7697, in asfreq return asfreq( File "/home/azureuser/.local/lib/python3.8/site-packages/pandas/core/resample.py", line 2096, in asfreq new_obj = obj.reindex(dti, method=method, fill_value=fill_value) File "/home/azureuser/.local/lib/python3.8/site-packages/pandas/core/series.py", line 4672, in reindex return super().reindex(**kwargs) File "/home/azureuser/.local/lib/python3.8/site-packages/pandas/core/generic.py", line 4966, in reindex return self._reindex_axes( File "/home/azureuser/.local/lib/python3.8/site-packages/pandas/core/generic.py", line 4986, in _reindex_axes obj = obj._reindex_with_indexers( File "/home/azureuser/.local/lib/python3.8/site-packages/pandas/core/generic.py", line 5032, in _reindex_with_indexers new_data = new_data.reindex_indexer( File "/home/azureuser/.local/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 679, in reindex_indexer self.axes[axis]._validate_can_reindex(indexer) File "/home/azureuser/.local/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 4107, in _validate_can_reindex raise ValueError("cannot reindex on an axis with duplicate labels") ValueError: cannot reindex on an axis with duplicate labels
madtoinou commentedon Apr 13, 2023
Hi,
Thank you for sharing a code snippet to duplicate the bug and detailing the error messages.
I am afraid that this kind of corner case should be handled before creating the
TimeSeries
. Since thepd.Timestamp("2022-10-30 02:00:00+0200", tz="CET")
andpd.Timestamp("2022-10-30 02:00:00+0100", tz="CET")
correspond to the exact same timestamp in local time (correct me if I am wrong), it induces conflicting indexes in the resultingTimeSeries
.Which one of the two value would you retain for the
pd.Timestamp("2022-10-30 02:00:00")
? If there is any convention to treat this kind of situation (or if you have an idea of solution), I invite you to open a new issue to that we can track the bug and implement/discuss a fix?pni-mft commentedon Apr 13, 2023
Hi,
Thanks for the quick reply!
Yes, this does indeed result in a conflicting index if you are using a tz-naive index. If you change them to UTC the issue is resolved but there are lots of situations where the hour of day is relevant when forecast eg. consumption, traffic or anything related to human activities since we are not programmed to follow UTC time (unfortunately).
So in these cases in local time it is necessary to have a duplicate index. Some places call them hour 2a and 2b (imo. a terrible convention). The challenge is that if you are for example comparing human behaviour, you need the hours to match up in local time. So we need to compare CET=8am with CET=8am from the day before in order to forecast realistic human related activities.
Another option is to just drop one of the hours here in the fall during the change to DST and to add an hour the other direction in the spring.
I would actually prefer to keep both of these in a timezone aware format rather than a timezone naive format and for example on this day be forced to forecast 25 hourly values, and in the spring forecast 23 hourly prices. This would from a purity perspective be the most accurate solution however I do not know how much refactoring this would require, as forecast time steps currently only support an integer, and in order to do this you would need to support something like an interval, like 1 day where this day is time zone aware.
However as a quick solution to this specific solution, it does seem that xarray may perhaps have some way of reaching a solution to this specific conversion to/from a dataframe:
Xarray converts the index to nanoseconds since the Unix epoch (midnight UTC on January 1, 1970). It also retains the original timezone-aware DatetimeIndex in the dataset. And when converting back from xarray it also provides a dataframe with the original tz-aware datetime index (however it adds a name for some reason).
here is a snippet demonstrating xarray's built in functionality:
we of course cannot test this on the TimeSeries as shown above since the conversion does not go through, however going back to the original example also shows that the hidden functionality of making the index tz naive raises issues as well:
Where of course the final pandas equality test fails due to the missing tz awareness. Perhaps there is a way of constructing/reconstructing the dataframe while keeping the tz aware datetime index somewhere in a similar way to which xarray handles this?
In the future if it would be also then be an added benefit to due indexing through this tz-aware index. Then we could imagine scenarios where doing a forecasting horizon which is dependent on a timedelta with a frequency rather than a simple number of integer steps. It is of course a corner case, but seeing as DST is also happening on different days depending on your location (Europe vs North America) there are at least 4 days of the year on a global perspective that could potentially need these types of more flexible forecasting.
I will I think create an issue referring to this bug for specifically handling the issue over the DST time change, as well as perhaps a Feature Request referring to the challenges of forecasting these days. In general, I think this would be a great feature as so many other timeseries related headaches are already handled by darts, why not also this one?
1 remaining item