Support Point-in-time Data Operation (#343)

* add period ops class * black format * add pit data read * fix bug in period ops * update ops runnable * update PIT test example * black format * update PIT test * update tets_PIT * update code format * add check_feature_exist * black format * optimize the PIT Algorithm * fix bug * update example * update test_PIT name * add pit collector * black format * fix bugs * fix try * fix bug & add dump_pit.py * Successfully run and understand PIT * Add some docs and remove a bug * mv crypto collector * black format * Run succesfully after merging master * Pass test and fix code * remove useless PIT code * fix PYlint * Rename Co-authored-by: Young <afe.young@gmail.com>
microsoft · Mar 10, 2022 · faa99f3 · faa99f3
1 parent 3a911bc
commit faa99f3
Show file tree

Hide file tree

Showing 19 changed files with 1,459 additions and 141 deletions.
diff --git a/README.md b/README.md
@@ -11,6 +11,8 @@
 Recent released features
 | Feature | Status |
 | --                      | ------    |
+| Point-in-Time database | :hammer: [Rleased](https://github.com/microsoft/qlib/pull/343) on Mar 10, 2022 |
+| Arctic Provider Backend & Orderbook data example | :hammer: [Rleased](https://github.com/microsoft/qlib/pull/744) on Jan 17, 2022 |
 | Arctic Provider Backend & Orderbook data example | :hammer: [Rleased](https://github.com/microsoft/qlib/pull/744) on Jan 17, 2022 |
 | Meta-Learning-based framework & DDG-DA  | :chart_with_upwards_trend:  :hammer: [Released](https://github.com/microsoft/qlib/pull/743) on Jan 10, 2022 | 
 | Planning-based portfolio optimization | :hammer: [Released](https://github.com/microsoft/qlib/pull/754) on Dec 28, 2021 | 
@@ -95,9 +97,8 @@ For more details, please refer to our paper ["Qlib: An AI-oriented Quantitative
 # Plans
 New features under development(order by estimated release time).
 Your feedbacks about the features are very important.
-| Feature                        | Status      |
-| --                      | ------    |
-| Point-in-Time database | Under review: https://github.com/microsoft/qlib/pull/343 |
+<!-- | Feature                        | Status      | -->
+<!-- | --                      | ------    | -->
 
 # Framework of Qlib
 

diff --git a/docs/advanced/PIT.rst b/docs/advanced/PIT.rst
@@ -0,0 +1,133 @@
+.. _pit:
+
+===========================
+(P)oint-(I)n-(T)ime Database
+===========================
+.. currentmodule:: qlib
+
+
+Introduction
+------------
+Point-in-time data is a very important consideration when performing any sort of historical market analysis. 
+
+For example, let’s say we are backtesting a trading strategy and we are using the past five years of historical data as our input.
+Our model is assumed to trade once a day, at the market close, and we’ll say we are calculating the trading signal for 1 January 2020 in our backtest. At that point, we should only have data for 1 January 2020, 31 December 2019, 30 December 2019 etc.
+
+In financial data (especially financial reports), the same piece of data may be amended for multiple times overtime.  If we only use the latest version for historical backtesting, data leakage will happen.
+Point-in-time database is designed for solving this problem to make sure user get the right version of data at any historical timestamp. It will keep the performance of online trading and historical backtesting the same.
+
+
+
+Data Preparation
+----------------
+
+Qlib provides a crawler to help users to download financial data and then a converter to dump the data in Qlib format.
+Please follow `scripts/data_collector/pit/README.md` to download and convert data.
+
+
+File-based design for PIT data
+------------------------------
+
+Qlib provides a file-based storage for PIT data.
+
+For each feature, it contains 4 columns, i.e. date, period, value, _next.
+Each row corresponds to a statement.
+
+The meaning of each feature with filename like `XXX_a.data`
+- `date`: the statement's date of publication.
+- `period`: the period of the statement. (e.g. it will be quarterly frequency in most of the markets)
+    - If it is an annual period, it will be an integer corresponding to the year
+    - If it is an quarterly  periods, it will be an integer like `<year><index of quarter>`.  The last two decimal digits represents the index of quarter. Others represent the year.
+- `value`: the described value
+- `_next`: the byte index of the next occurance of the field.
+
+Besides the feature data, an index `XXX_a.index` is included to speed up the querying performance
+
+The statements are soted by the `date` in ascending order from the beginning of the file.
+
+.. code-block:: python
+
+    # the data format from XXXX.data
+    array([(20070428, 200701, 0.090219  , 4294967295),
+           (20070817, 200702, 0.13933   , 4294967295),
+           (20071023, 200703, 0.24586301, 4294967295),
+           (20080301, 200704, 0.3479    ,         80),
+           (20080313, 200704, 0.395989  , 4294967295),
+           (20080422, 200801, 0.100724  , 4294967295),
+           (20080828, 200802, 0.24996801, 4294967295),
+           (20081027, 200803, 0.33412001, 4294967295),
+           (20090325, 200804, 0.39011699, 4294967295),
+           (20090421, 200901, 0.102675  , 4294967295),
+           (20090807, 200902, 0.230712  , 4294967295),
+           (20091024, 200903, 0.30072999, 4294967295),
+           (20100402, 200904, 0.33546099, 4294967295),
+           (20100426, 201001, 0.083825  , 4294967295),
+           (20100812, 201002, 0.200545  , 4294967295),
+           (20101029, 201003, 0.260986  , 4294967295),
+           (20110321, 201004, 0.30739301, 4294967295),
+           (20110423, 201101, 0.097411  , 4294967295),
+           (20110831, 201102, 0.24825101, 4294967295),
+           (20111018, 201103, 0.318919  , 4294967295),
+           (20120323, 201104, 0.4039    ,        420),
+           (20120411, 201104, 0.403925  , 4294967295),
+           (20120426, 201201, 0.112148  , 4294967295),
+           (20120810, 201202, 0.26484701, 4294967295),
+           (20121026, 201203, 0.370487  , 4294967295),
+           (20130329, 201204, 0.45004699, 4294967295),
+           (20130418, 201301, 0.099958  , 4294967295),
+           (20130831, 201302, 0.21044201, 4294967295),
+           (20131016, 201303, 0.30454299, 4294967295),
+           (20140325, 201304, 0.394328  , 4294967295),
+           (20140425, 201401, 0.083217  , 4294967295),
+           (20140829, 201402, 0.16450299, 4294967295),
+           (20141030, 201403, 0.23408499, 4294967295),
+           (20150421, 201404, 0.319612  , 4294967295),
+           (20150421, 201501, 0.078494  , 4294967295),
+           (20150828, 201502, 0.137504  , 4294967295),
+           (20151023, 201503, 0.201709  , 4294967295),
+           (20160324, 201504, 0.26420501, 4294967295),
+           (20160421, 201601, 0.073664  , 4294967295),
+           (20160827, 201602, 0.136576  , 4294967295),
+           (20161029, 201603, 0.188062  , 4294967295),
+           (20170415, 201604, 0.244385  , 4294967295),
+           (20170425, 201701, 0.080614  , 4294967295),
+           (20170728, 201702, 0.15151   , 4294967295),
+           (20171026, 201703, 0.25416601, 4294967295),
+           (20180328, 201704, 0.32954201, 4294967295),
+           (20180428, 201801, 0.088887  , 4294967295),
+           (20180802, 201802, 0.170563  , 4294967295),
+           (20181029, 201803, 0.25522   , 4294967295),
+           (20190329, 201804, 0.34464401, 4294967295),
+           (20190425, 201901, 0.094737  , 4294967295),
+           (20190713, 201902, 0.        ,       1040),
+           (20190718, 201902, 0.175322  , 4294967295),
+           (20191016, 201903, 0.25581899, 4294967295)],
+          dtype=[('date', '<u4'), ('period', '<u4'), ('value', '<f8'), ('_next', '<u4')])
+    # - each row contains 20 byte
+
+
+    # The data format from XXXX.index.  It consists of two parts
+    # 1) the start index of the data. So the first part of the info will be like
+    2007
+    # 2) the remain index data will be like information below
+    #    - The data indicate the **byte index** of first data update of a period.
+    #    - e.g. Because the info at both byte 80 and 100 corresponds to 200704. The byte index of first occurance (i.e. 100) is recorded in the data.
+    array([         0,         20,         40,         60,        100,
+                  120,        140,        160,        180,        200,
+                  220,        240,        260,        280,        300,
+                  320,        340,        360,        380,        400,
+                  440,        460,        480,        500,        520,
+                  540,        560,        580,        600,        620,
+                  640,        660,        680,        700,        720,
+                  740,        760,        780,        800,        820,
+                  840,        860,        880,        900,        920,
+                  940,        960,        980,       1000,       1020,
+                 1060, 4294967295], dtype=uint32)
+
+
+
+
+Known limitations
+- Currently, the PIT database is designed for quarterly or annually factors, which can handle fundamental data of financial reports in most markets.
+    Qlib leverage the file name to identify the type of the data. File with name like `XXX_q.data` corresponds to quarterly data.  File with name like `XXX_a.data` corresponds to annual data
+- The caclulation of PIT is not performed in the optimal way. There is great potential to boost the performance of PIT data calcuation.
diff --git a/docs/index.rst b/docs/index.rst
@@ -53,6 +53,7 @@ Document Structure
    Online & Offline mode <advanced/server.rst>
    Serialization <advanced/serial.rst>
    Task Management <advanced/task_management.rst>
+   Point-In-Time database <advanced/PIT.rst>
 
 .. toctree::
    :maxdepth: 3

diff --git a/qlib/config.py b/qlib/config.py
@@ -92,6 +92,7 @@ def set_conf_from_C(self, config_c):
     "calendar_provider": "LocalCalendarProvider",
     "instrument_provider": "LocalInstrumentProvider",
     "feature_provider": "LocalFeatureProvider",
+    "pit_provider": "LocalPITProvider",
     "expression_provider": "LocalExpressionProvider",
     "dataset_provider": "LocalDatasetProvider",
     "provider": "LocalProvider",
@@ -108,7 +109,6 @@ def set_conf_from_C(self, config_c):
     "provider_uri": "",
     # cache
     "expression_cache": None,
-    "dataset_cache": None,
     "calendar_cache": None,
     # for simple dataset cache
     "local_cache_path": None,
@@ -171,6 +171,18 @@ def set_conf_from_C(self, config_c):
             "default_exp_name": "Experiment",
         },
     },
+    "pit_record_type": {
+        "date": "I",  # uint32
+        "period": "I",  # uint32
+        "value": "d",  # float64
+        "index": "I",  # uint32
+    },
+    "pit_record_nan": {
+        "date": 0,
+        "period": 0,
+        "value": float("NAN"),
+        "index": 0xFFFFFFFF,
+    },
     # Default config for MongoDB
     "mongo": {
         "task_url": "mongodb://localhost:27017/",
@@ -184,46 +196,28 @@ def set_conf_from_C(self, config_c):
 
 MODE_CONF = {
     "server": {
-        # data provider config
-        "calendar_provider": "LocalCalendarProvider",
-        "instrument_provider": "LocalInstrumentProvider",
-        "feature_provider": "LocalFeatureProvider",
-        "expression_provider": "LocalExpressionProvider",
-        "dataset_provider": "LocalDatasetProvider",
-        "provider": "LocalProvider",
         # config it in qlib.init()
         "provider_uri": "",
         # redis
         "redis_host": "127.0.0.1",
         "redis_port": 6379,
         "redis_task_db": 1,
-        "kernels": NUM_USABLE_CPU,
         # cache
         "expression_cache": DISK_EXPRESSION_CACHE,
         "dataset_cache": DISK_DATASET_CACHE,
         "local_cache_path": Path("~/.cache/qlib_simple_cache").expanduser().resolve(),
         "mount_path": None,
     },
     "client": {
-        # data provider config
-        "calendar_provider": "LocalCalendarProvider",
-        "instrument_provider": "LocalInstrumentProvider",
-        "feature_provider": "LocalFeatureProvider",
-        "expression_provider": "LocalExpressionProvider",
-        "dataset_provider": "LocalDatasetProvider",
-        "provider": "LocalProvider",
         # config it in user's own code
         "provider_uri": "~/.qlib/qlib_data/cn_data",
         # cache
         # Using parameter 'remote' to announce the client is using server_cache, and the writing access will be disabled.
         # Disable cache by default. Avoid introduce advanced features for beginners
-        "expression_cache": None,
         "dataset_cache": None,
         # SimpleDatasetCache directory
         "local_cache_path": Path("~/.cache/qlib_simple_cache").expanduser().resolve(),
-        "calendar_cache": None,
         # client config
-        "kernels": NUM_USABLE_CPU,
         "mount_path": None,
         "auto_mount": False,  # The nfs is already mounted on our server[auto_mount: False].
         # The nfs should be auto-mounted by qlib on other

diff --git a/qlib/data/__init__.py b/qlib/data/__init__.py
@@ -15,6 +15,7 @@
     LocalCalendarProvider,
     LocalInstrumentProvider,
     LocalFeatureProvider,
+    LocalPITProvider,
     LocalExpressionProvider,
     LocalDatasetProvider,
     ClientCalendarProvider,

diff --git a/qlib/data/base.py b/qlib/data/base.py
@@ -6,12 +6,20 @@
 from __future__ import print_function
 
 import abc
-
+import pandas as pd
 from ..log import get_module_logger
 
 
 class Expression(abc.ABC):
-    """Expression base class"""
+    """
+    Expression base class
+
+    Expression is designed to handle the calculation of data with the format below
+    data with two dimension for each instrument,
+    - feature
+    - time:  it  could be observation time or period time.
+        - period time is designed for Point-in-time database.  For example, the period time maybe 2014Q4, its value can observed for multiple times(different value may be observed at different time due to amendment).
+    """
 
     def __str__(self):
         return type(self).__name__
@@ -124,8 +132,18 @@ def __ror__(self, other):
 
         return Or(other, self)
 
-    def load(self, instrument, start_index, end_index, freq):
+    def load(self, instrument, start_index, end_index, *args):
         """load  feature
+        This function is responsible for loading feature/expression based on the expression engine.
+
+        The concerate implementation will be seperated by two parts
+        1) caching data, handle errors.
+            - This part is shared by all the expressions and implemented in Expression
+        2) processing and calculating data based on the specific expression.
+            - This part is different in each expression and implemented in each expression
+
+        Expresion Engine is shared by different data.
+        Different data will have different extra infomation for `args`.
 
         Parameters
         ----------
@@ -135,8 +153,15 @@ def load(self, instrument, start_index, end_index, freq):
             feature start index [in calendar].
         end_index : str
             feature end  index  [in calendar].
-        freq : str
-            feature frequency.
+
+        *args may contains following information;
+        1) if it is used in basic experssion engine data, it contains following arguments
+            freq : str
+                feature frequency.
+
+        2) if is used in PIT data, it contains following arguments
+            cur_pit:
+                it is designed for the point-in-time data.
 
         Returns
         ----------
@@ -146,26 +171,26 @@ def load(self, instrument, start_index, end_index, freq):
         from .cache import H  # pylint: disable=C0415
 
         # cache
-        args = str(self), instrument, start_index, end_index, freq
-        if args in H["f"]:
-            return H["f"][args]
+        cache_key = str(self), instrument, start_index, end_index, *args
+        if cache_key in H["f"]:
+            return H["f"][cache_key]
         if start_index is not None and end_index is not None and start_index > end_index:
             raise ValueError("Invalid index range: {} {}".format(start_index, end_index))
         try:
-            series = self._load_internal(instrument, start_index, end_index, freq)
+            series = self._load_internal(instrument, start_index, end_index, *args)
         except Exception as e:
             get_module_logger("data").debug(
                 f"Loading data error: instrument={instrument}, expression={str(self)}, "
-                f"start_index={start_index}, end_index={end_index}, freq={freq}. "
+                f"start_index={start_index}, end_index={end_index}, args={args}. "
                 f"error info: {str(e)}"
             )
             raise
         series.name = str(self)
-        H["f"][args] = series
+        H["f"][cache_key] = series
         return series
 
     @abc.abstractmethod
-    def _load_internal(self, instrument, start_index, end_index, freq):
+    def _load_internal(self, instrument, start_index, end_index, *args) -> pd.Series:
         raise NotImplementedError("This function must be implemented in your newly defined feature")
 
     @abc.abstractmethod
@@ -225,6 +250,16 @@ def get_extended_window_size(self):
         return 0, 0
 
 
+class PFeature(Feature):
+    def __str__(self):
+        return "$$" + self._name
+
+    def _load_internal(self, instrument, start_index, end_index, cur_time):
+        from .data import PITD  # pylint: disable=C0415
+
+        return PITD.period_feature(instrument, str(self), start_index, end_index, cur_time)
+
+
 class ExpressionOps(Expression):
     """Operator Expression