Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement correlated random number generation #1069

Merged
merged 40 commits into from
Oct 3, 2022
Merged

Conversation

mtazzari
Copy link
Contributor

@mtazzari mtazzari commented Jul 5, 2022

This Fix #1068 following specifications in #910

TODO:

  • perform quantitative checks on correlated rng
  • update all docstrings
  • check performance
  • write release notes

Check flow:

  • if correlations are defined in meta-data/model_settings.json:
    • if all peril corr groups have all correlation_value=0 --> do NOT compute correlations.
    • if at least one peril corr group has correlation_value>0 --> compute correlations.
    • if runs/losses.../input/correlations.bin is not present ---> raise ERROR
    • if --ignore-correlation is passed to gulpy --> do NOT compute correlations, regardless of correlation definitions.
  • if correlations are NOT defined in meta-data/model_settings.json --> do NOT compute correlations

Release notes feature title

This PR introduces the possibility to generate correlated random samples for items in the same peril correlation group. Specifications of this feature are at #910

@mtazzari mtazzari added enhancement New feature or request feature A main feature, captured on the backlog labels Jul 5, 2022
@mtazzari mtazzari self-assigned this Jul 5, 2022
@mtazzari mtazzari changed the base branch from master to develop July 5, 2022 17:21
@mtazzari mtazzari marked this pull request as draft July 5, 2022 17:22
@mtazzari mtazzari changed the title Implement correlated random number generation (wip) Implement correlated random number generation Jul 5, 2022
@mtazzari mtazzari changed the title (wip) Implement correlated random number generation Implement correlated random number generation Jul 15, 2022
@mtazzari mtazzari marked this pull request as ready for review July 20, 2022 09:42
@mtazzari mtazzari requested a review from sstruzik August 2, 2022 10:59
Copy link
Contributor

@sstruzik sstruzik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beside the few small details this is excellent.

oasislmf/computation/generate/files.py Outdated Show resolved Hide resolved
oasislmf/pytools/gul/manager.py Outdated Show resolved Hide resolved
oasislmf/pytools/gul/manager.py Outdated Show resolved Hide resolved
Returns: (List[str]) the filtered columns
"""
for col in VALID_OASIS_GROUP_COLS:
if col not in list(exposure_df_columns) + VALID_OASIS_GROUP_COLS:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is always False as col in VALID_OASIS_GROUP_COLS are always in list(exposure_df_columns) + VALID_OASIS_GROUP_COLS

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is merely legacy code that has been moved to another area but happy to chance this:

for col in group_id_cols:

Copy link
Contributor

@sambles sambles Oct 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sstruzik is right here, there is an issue with the logic and will always evaluate to False

its checking for valid group_id columns by looking over the list VALID_OASIS_GROUP_COLS instead of checking the input given to the function from group_id_cols

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

group_id_cols.remove(col)

peril_correlation_group = 'peril_correlation_group'
if peril_correlation_group not in group_id_cols and correlations is True:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually just do "if correlation", is there a reason to use "if correlation is True"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correlations is a bool however, if we just have the if correlations it will pass if correlations is merely not None. Therefore it is safer to explicitly state if correlations is True. You can run the following code to see the difference:

one = 1

if one:
  print("one")

if one is True:
  print("two")

]


def process_group_id_cols(group_id_cols: List[str], exposure_df_columns: List[str], correlations: bool) -> List[str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is one of my dislike example about typing.
I'm not sure what you are checking later (see comment about ln 74) but exposure_df_columns doesn't need to be a list. As you cast it as a list in your code so for example here you could pass df.columns directly and there no need to type.

For correlations, naming would be more explicit. has_correlation_groups or is_correlated that the current choice with bool
plus it wouldn't need to be a bool if you did if correlations instead of if correlations is True.
In the end all the typing limit the function potential use and in my opinion is more confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now removed typing and changed the correlations with has_correlation_groups

return group_id_cols


def hash_with_correlations(gul_inputs_df: pd.DataFrame, hashing_columns: List[str]) -> pd.DataFrame:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is nothing specific to correlation in this function, it is just hashing based on some columns. and it is identical as the code we have at the end of get_gul_input_items. So the name is misleading.
The hashing itself should be done twice in two part of the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been changed to hash_group_id for the function name

)
correlation_input_items = get_correlation_input_items(
model_settings_path=self.model_settings_json,
gul_inputs_df=gul_inputs_df
)

correlations: bool = establish_correlations(model_settings_path=self.model_settings_json)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you should have to read twice model_settings_path. get_correlation_input_items should return all the information that you need. So I would remove establish_correlations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model settings is only read once now

break


# it is assumed that correlations are False for now, correlations for group ID hashing are assessed later on in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to avoid hashing twice, you could do the merge with the correlation data in this function instead of doing it after calling it. That should remove your chicken and egg problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hashing is only performed once now

@@ -409,6 +410,27 @@ def get_model_settings(model_settings_fp, key=None, validate=True):
return model_settings if not key else model_settings.get(key)


def establish_correlations(model_settings_path: str) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to remove see comment in the calling function

@johcarter
Copy link
Contributor

piwind ci results have been updated in https://github.com/OasisLMF/OasisPiWind/tree/update/feature-correlated_rng. This branch should pass against those updated expected results.

A fix is still needed to proceed with model run when no correlation settings are present (fix error on missing correlations.bin). @maxwellflitton is on the case.

@johcarter
Copy link
Contributor

All working as expected

@sambles
Copy link
Contributor

sambles commented Oct 3, 2022

@sambles sambles merged commit 3d7f8a5 into develop Oct 3, 2022
@sambles sambles deleted the feature/correlated_rng branch October 4, 2022 09:57
@awsbuild awsbuild added this to the 1.27.0rc1 milestone Oct 6, 2022
sambles added a commit that referenced this pull request Jan 12, 2023
* [gulpy] first implementation

* [gulpy] implementing correlated rng

* [wip] implementing correlated rng

* [wip]

* [gulpy] working implementation of the correlated random values

* minor cleanup

* [gulpy] Update docstrings for random  module functions

* [gulpy] remove unused generate_correlated_hash

* [gulpy] introduce --ignore-correlation flag

* set hashed_group_id to True by default, cleanup

* adding haahing patch

* adding haahing patch

* [gulpy] minor cleanup files.py parameter on same line

* [gulpy] run correlation only if rho>0

* updating hashing

* [gulpy] improve flow depending on corr definitions

* Disable GroupID hashing for acceptance tests (#1094)

* Update expected acceptance tests

* Revert "Update expected acceptance tests"

This reverts commit ad0907f.

* Default "hashed_group_id" to false in exposure run

* Move hashed_group_id=F default from "RunExposure" to "RunFmTest"

* Fix/pip compile (#1097)

* Only install pip-tools before pip-compile

* Try pinning flake8

* Revert "Try pinning flake8"

This reverts commit d845d5b.

* Try pinning virtualenv

* add --upgrade to pip install pip-tools

* Fix test_get_dataframe__from_csv_file__set_col_defaults_option_and_use_defaults_ and run with falsifying example

* Remove falsifying example

Co-authored-by: Marco Tazzari <6020226+mtazzari@users.noreply.github.com>

* Update group_id_cols default in get_gul_input_items

* Hashing investigation (#1096)

* adding haahing patch

* adding haahing patch

* updating hashing

* Update oasislmf/preparation/gul_inputs.py

Co-authored-by: Marco Tazzari <6020226+mtazzari@users.noreply.github.com>

* [gul_inputs] bugfix don't modify inplace

* Update test_summaries.py to not rely on "loc_id" as default for group_id_cols

* Always create a correlations.bin, if missing model_settings file is blank (#1101)

* adding peril_correlation_group for valid_oasis_group_cols

* appending peril_correlation_group to columns if correlations group is present

* adding peril_correlation_group column to hashing of group IDs if correlations groups are used and hashing group IDs is done

* updating hashing group ID

* updating to accomodate non-correlations

* fixxing run

* fixing empty correlations df write header if empty correlations

* Remove empty file

* Add missing defaults to get_gul_input_items (backwards compatible)

* Fix Group_id valid column check

* Force retest

Co-authored-by: maxwellflitton <maxwellflitton@gmail.com>
Co-authored-by: sambles <sambles@users.noreply.github.com>
Co-authored-by: Sam Gamble <hexadessa@gmail.com>
@awsbuild awsbuild modified the milestones: 1.27.0rc1, 1.27.0 Jan 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request feature A main feature, captured on the backlog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement correlated random number generation in gulpy
6 participants