Randomly changing results from predictions #2530

sanmayshelat · 2024-11-27T21:59:40Z

sanmayshelat
Nov 27, 2024

Hi,

I have a model saved that seems to (generally) have parameters that are expected/explainable. I am loading this model with linker = Linker(df, settings="model_name.json", db_api=db_api) and applying df_predict = linker.inference.predict(threshold_match_weight=2,).as_pandas_dataframe(). However, each time I reload the kernel I am getting a slightly different length for df_predict. This doesn't happen when I re-run linker = Linker(df... or even if I create a separate version like linker2 = Linker(df.... Any suggestions on why this might be happening?

I'm using Splink=4.0.5 and Python=3.12.3.

Answered by ADBond

Nov 28, 2024

Hmmm, that is puzzling. Nothing immediately leaps to mind as the likely cause - if you are not retraining the model then predict() should give the same set of results consistently for the same input data.

Any further details you are able to provide (the .json file, or your comparisons, or the script you are running, backend) might point to some clues. Are there any possibly non-deterministic functions in your comparisons (or blocking rules)? And is the input data definitely unchanged between these runs?

One thing that might help narrow it down is to try and investigate the record pairs that are not always appearing - if you save the left+right ids of your df_predict table to file you can …

View full answer

ADBond · 2024-11-28T10:45:59Z

ADBond
Nov 28, 2024
Maintainer

Hmmm, that is puzzling. Nothing immediately leaps to mind as the likely cause - if you are not retraining the model then predict() should give the same set of results consistently for the same input data.

Any further details you are able to provide (the .json file, or your comparisons, or the script you are running, backend) might point to some clues. Are there any possibly non-deterministic functions in your comparisons (or blocking rules)? And is the input data definitely unchanged between these runs?

One thing that might help narrow it down is to try and investigate the record pairs that are not always appearing - if you save the left+right ids of your df_predict table to file you can compare across runs and then see if there is anything noteworthy about them (and perhaps share any details you are able to here).

0 replies

sanmayshelat · 2024-12-01T19:32:25Z

sanmayshelat
Dec 1, 2024
Author

Thanks for getting back to me so quickly! It appears that the randomness is coming from a data manipulation step in Polars. I was able to get repeatable results using Pandas end to end or using a saved, post-manipulation dataset. I will look into what it is in Polars that's causing the randomness.

1 reply

RobinL Dec 1, 2024
Maintainer

Thanks for the update, glad you resolved the problem

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Randomly changing results from predictions #2530

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Randomly changing results from predictions #2530

sanmayshelat Nov 27, 2024

Replies: 2 comments · 1 reply

ADBond Nov 28, 2024 Maintainer

sanmayshelat Dec 1, 2024 Author

RobinL Dec 1, 2024 Maintainer

sanmayshelat
Nov 27, 2024

Replies: 2 comments 1 reply

ADBond
Nov 28, 2024
Maintainer

sanmayshelat
Dec 1, 2024
Author

RobinL Dec 1, 2024
Maintainer