Skip to content

Improve runtimes but 'pushing up' common Case Statements into precomputed values #2580

Open
@RobinL

Description

Currently, in Splink, comparison functions (e.g., cosine_sim) are evaluated multiple times within CASE statements during the predict() process. For example:

CASE
    WHEN cosine_sim(l, r) > 0.9 THEN 1
    WHEN cosine_sim(l, r) > 0.8 THEN 2
    WHEN cosine_sim(l, r) > 0.7 THEN 3
    ...
END

This results in repeated computation of cosine_sim(l, r) for each threshold, which impacts performance.

Proposed Enhancement:

Where functions are being repeated computed, automatically refactor the logic to precompute the comparison value once and reuse it in the CASE statement. For example:

WITH precomputed_cosine AS (
    SELECT cosine_sim(l, r) AS precompute_cosine_value, ...
)
SELECT
    CASE
        WHEN precompute_cosine_value > 0.9 THEN 1
        WHEN precompute_cosine_value > 0.8 THEN 2
        WHEN precompute_cosine_value > 0.7 THEN 3
        ...
    END

This would makes it easier to use more granular thresholds without additional overhead - and allow users to more easily approximate match weights being a linear function rather than bucketed (by simply using a large number of thresholds)

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions