Skip to content

Initial support for CatboostΒ #377

Open
@hcho3

Description

We would like to add support for Catboost models. Users of Treelite should be able to load Catboost models and run prediction.

Overview

Catboost has a custom target encoding method to encode categorical data, and produces special kinds of decision trees called oblivious trees. See the Catboost paper for more details.

In general, target encoder is a function that takes a categorical input and puts out a numeric output. The function is an "encoding," in the sense that the categorical input is encoded as a real number. The advantage of target encoding is that we can exclusively use the simple test of form [feature] < [threshold] in all of our decision trees.

The challenge is that Catboost uses a custom flavor of target encoding. The goal, therefore, is to abstract away as much complexity as possible.

Proposed Design

The treelite model spec

template <typename ThresholdType, typename LeafOutputType>
class ModelImpl : public Model {
public:
/*! \brief member trees */
std::vector<Tree<ThresholdType, LeafOutputType>> trees;

should be updated to include an optional field to store the target encoding function. The target encoding component should be a lookup table of form

(categorical_feature_id, categorical_value) -> [ numerical vector ]
(categorical_feature_id, categorical_value) -> [ numerical vector ]
(categorical_feature_id, categorical_value) -> [ numerical vector ]
...

where each possible categorical value is mapped to a vector of length 1 or greater.

Catboost uses CityHash to convert string categories into int64, so the target encoding field must allow both int64 and float32 types for the categorical input.

Scope

Catboost allows users to save models in two formats: FlatBuffer and JSON. For the initial version, we'll only support the JSON format.
Initially, we'll convert oblivious trees into regular decision trees. We may add ObliviousTree class to the Treelite model spec in the future.
In addition, we'll only support the simple_ctr configuration, where the target encoding function takes in only one single categorical feature at a time. We won't support the combination_ctr configuration where multiple categorical features are fed into the target encoder.

TODOs

  • Add the target encoder to the Treelite model spec
  • Implement the deserializer for the Catboost JSON model. The deserializer will be placed in src/frontend.
  • Update GTIL to support inferencing with Catboost.
  • Update the C codegen to support text inputs and target encoding. I expect this step to be challenging, given the complexity in the C codegen.

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions