Improve runtimes but 'pushing up' common Case Statements into precomputed values #2580
Open
Description
Currently, in Splink, comparison functions (e.g., cosine_sim
) are evaluated multiple times within CASE
statements during the predict()
process. For example:
CASE
WHEN cosine_sim(l, r) > 0.9 THEN 1
WHEN cosine_sim(l, r) > 0.8 THEN 2
WHEN cosine_sim(l, r) > 0.7 THEN 3
...
END
This results in repeated computation of cosine_sim(l, r)
for each threshold, which impacts performance.
Proposed Enhancement:
Where functions are being repeated computed, automatically refactor the logic to precompute the comparison value once and reuse it in the CASE
statement. For example:
WITH precomputed_cosine AS (
SELECT cosine_sim(l, r) AS precompute_cosine_value, ...
)
SELECT
CASE
WHEN precompute_cosine_value > 0.9 THEN 1
WHEN precompute_cosine_value > 0.8 THEN 2
WHEN precompute_cosine_value > 0.7 THEN 3
...
END
This would makes it easier to use more granular thresholds without additional overhead - and allow users to more easily approximate match weights being a linear function rather than bucketed (by simply using a large number of thresholds)
Metadata
Assignees
Labels
No labels