Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement correlated random number generation #1069

Merged
merged 40 commits into from
Oct 3, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
eefce12
[gulpy] first implementation
mtazzari Jul 5, 2022
a8f89de
[gulpy] implementing correlated rng
mtazzari Jul 5, 2022
d8215d8
[wip] implementing correlated rng
mtazzari Jul 6, 2022
614dd3d
[wip]
mtazzari Jul 12, 2022
d407089
[gulpy] working implementation of the correlated random values
mtazzari Jul 15, 2022
579910b
Merge branch 'develop' into feature/correlated_rng
mtazzari Jul 15, 2022
5417f9b
minor cleanup
mtazzari Jul 15, 2022
5ad10f9
[gulpy] Update docstrings for random module functions
mtazzari Jul 22, 2022
4f564a1
Merge branch 'develop' into feature/correlated_rng
mtazzari Aug 1, 2022
2066aa2
[gulpy] remove unused generate_correlated_hash
mtazzari Aug 2, 2022
f0311c0
[gulpy] introduce --ignore-correlation flag
mtazzari Aug 2, 2022
1709cee
set hashed_group_id to True by default, cleanup
mtazzari Aug 3, 2022
2222be2
adding haahing patch
maxwellflitton Aug 8, 2022
2c0d5e3
adding haahing patch
maxwellflitton Aug 8, 2022
6621208
Merge branch 'develop' into hashing-investigation
mtazzari Aug 10, 2022
e8cf544
Merge branch 'develop' into feature/correlated_rng
mtazzari Aug 11, 2022
bb05858
[gulpy] minor cleanup files.py parameter on same line
mtazzari Aug 11, 2022
e593f0c
[gulpy] run correlation only if rho>0
mtazzari Aug 11, 2022
fbf1689
updating hashing
maxwellflitton Aug 11, 2022
d611869
[gulpy] improve flow depending on corr definitions
mtazzari Aug 12, 2022
82fb79c
Disable GroupID hashing for acceptance tests (#1094)
sambles Aug 12, 2022
fee427e
Update group_id_cols default in get_gul_input_items
mtazzari Aug 12, 2022
45f8779
Hashing investigation (#1096)
maxwellflitton Aug 12, 2022
461621f
[gul_inputs] bugfix don't modify inplace
mtazzari Aug 12, 2022
77245a3
Update test_summaries.py to not rely on "loc_id" as default for group…
sambles Aug 12, 2022
bf500be
Always create a correlations.bin, if missing model_settings file is b…
sambles Aug 23, 2022
fd210e2
Merge branch 'develop' of https://github.com/OasisLMF/OasisLMF into h…
maxwellflitton Aug 24, 2022
2b3d202
adding peril_correlation_group for valid_oasis_group_cols
maxwellflitton Sep 5, 2022
f7cb1ab
adding peril_correlation_group for valid_oasis_group_cols
maxwellflitton Sep 5, 2022
7d772fb
appending peril_correlation_group to columns if correlations group is…
maxwellflitton Sep 5, 2022
e6ac89e
adding peril_correlation_group column to hashing of group IDs if corr…
maxwellflitton Sep 7, 2022
823add8
updating hashing group ID
maxwellflitton Sep 15, 2022
ff9f568
updating to accomodate non-correlations
maxwellflitton Sep 20, 2022
32a5f66
fixxing run
maxwellflitton Sep 21, 2022
d60deb2
fixing empty correlations df write header if empty correlations
maxwellflitton Sep 21, 2022
1fc5651
Merge branch 'develop' into feature/correlated_rng
sambles Oct 3, 2022
e95dac6
Remove empty file
sambles Oct 3, 2022
62c805f
Add missing defaults to get_gul_input_items (backwards compatible)
sambles Oct 3, 2022
5f6933a
Fix Group_id valid column check
sambles Oct 3, 2022
44e43f2
Force retest
sambles Oct 3, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
adding peril_correlation_group column to hashing of group IDs if corr…
…elations groups are used and hashing group IDs is done
  • Loading branch information
maxwellflitton committed Sep 7, 2022
commit e6ac89e944af08036ca53a6ca7f6e9a71413e250
20 changes: 16 additions & 4 deletions oasislmf/computation/generate/files.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@
import json
import os
from pathlib import Path
from typing import List
import pandas as pd

from .keys import GenerateKeys, GenerateKeysDeterministic
from ..base import ComputationStep
Expand Down Expand Up @@ -72,7 +74,10 @@
GULSummaryXrefFile,
FMSummaryXrefFile
)
from oasislmf.preparation.correlations import get_correlation_input_items
from oasislmf.preparation.correlations import get_correlation_input_items, map_data
from oasislmf.preparation.gul_inputs import process_group_id_cols, hash_with_correlations
# from oasislmf.preparation.correlations import map_data
from oasislmf.utils.data import establish_correlations


class GenerateFiles(ComputationStep):
Expand Down Expand Up @@ -230,16 +235,24 @@ def run(self):
gul_inputs_df = get_gul_input_items(
location_df,
keys_df,
output_dir=self._get_output_dir(),
exposure_profile=location_profile,
group_id_cols=group_id_cols,
hashed_group_id=self.hashed_group_id,
hashed_group_id=self.hashed_group_id
)
correlation_input_items = get_correlation_input_items(
model_settings_path=self.model_settings_json,
gul_inputs_df=gul_inputs_df
)

correlations: bool = establish_correlations(model_settings_path=self.model_settings_json)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you should have to read twice model_settings_path. get_correlation_input_items should return all the information that you need. So I would remove establish_correlations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model settings is only read once now

group_id_cols: List[str] = process_group_id_cols(group_id_cols=group_id_cols,
exposure_df_columns=list(location_df),
correlations=correlations)

if self.hashed_group_id is True and correlations is True:
gul_inputs_df = pd.merge(gul_inputs_df, correlation_input_items, on="item_id")
gul_inputs_df = hash_with_correlations(gul_inputs_df=gul_inputs_df, hashing_columns=group_id_cols)

# If not in det. loss gen. scenario, write exposure summary file
if summarise_exposure:
write_exposure_summary(
Expand All @@ -265,7 +278,6 @@ def run(self):
oasis_files_prefixes=files_prefixes['gul'],
chunksize=self.write_chunksize,
)

gul_summary_mapping = get_summary_mapping(gul_inputs_df, oed_hierarchy)
write_mapping_file(gul_summary_mapping, target_dir)

Expand Down
89 changes: 58 additions & 31 deletions oasislmf/preparation/gul_inputs.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
import sys
import warnings
from collections import OrderedDict
from typing import List

import pandas as pd

Expand Down Expand Up @@ -44,12 +45,61 @@
pd.options.mode.chained_assignment = None
warnings.simplefilter(action='ignore', category=FutureWarning)

VALID_OASIS_GROUP_COLS = [
'item_id',
'peril_id',
'coverage_id',
'coverage_type_id',
'peril_correlation_group'
]


def process_group_id_cols(group_id_cols: List[str], exposure_df_columns: List[str], correlations: bool) -> List[str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is one of my dislike example about typing.
I'm not sure what you are checking later (see comment about ln 74) but exposure_df_columns doesn't need to be a list. As you cast it as a list in your code so for example here you could pass df.columns directly and there no need to type.

For correlations, naming would be more explicit. has_correlation_groups or is_correlated that the current choice with bool
plus it wouldn't need to be a bool if you did if correlations instead of if correlations is True.
In the end all the typing limit the function potential use and in my opinion is more confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now removed typing and changed the correlations with has_correlation_groups

"""
cleans out columns that are not valid oasis group columns.

Valid group id columns can be either
1. exist in the location file
2. be listed as a useful internal col

Args:
group_id_cols: (List[str]) the ID columns that are going to be filtered
exposure_df_columns: (List[str]) the columns in the exposure dataframe
correlations: (bool) if set to True means that we are hashing with correlations in mind therefore the
"peril_correlation_group" column is added

Returns: (List[str]) the filtered columns
"""
for col in VALID_OASIS_GROUP_COLS:
if col not in list(exposure_df_columns) + VALID_OASIS_GROUP_COLS:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is always False as col in VALID_OASIS_GROUP_COLS are always in list(exposure_df_columns) + VALID_OASIS_GROUP_COLS

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is merely legacy code that has been moved to another area but happy to chance this:

for col in group_id_cols:

Copy link
Contributor

@sambles sambles Oct 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sstruzik is right here, there is an issue with the logic and will always evaluate to False

its checking for valid group_id columns by looking over the list VALID_OASIS_GROUP_COLS instead of checking the input given to the function from group_id_cols

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warnings.warn('Column {} not found in loc file, or a valid internal oasis column'.format(col))
group_id_cols.remove(col)

peril_correlation_group = 'peril_correlation_group'
if peril_correlation_group not in group_id_cols and correlations is True:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually just do "if correlation", is there a reason to use "if correlation is True"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correlations is a bool however, if we just have the if correlations it will pass if correlations is merely not None. Therefore it is safer to explicitly state if correlations is True. You can run the following code to see the difference:

one = 1

if one:
  print("one")

if one is True:
  print("two")

group_id_cols.append(peril_correlation_group)
return group_id_cols


def hash_with_correlations(gul_inputs_df: pd.DataFrame, hashing_columns: List[str]) -> pd.DataFrame:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is nothing specific to correlation in this function, it is just hashing based on some columns. and it is identical as the code we have at the end of get_gul_input_items. So the name is misleading.
The hashing itself should be done twice in two part of the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been changed to hash_group_id for the function name

"""
Creates a hash for the group ID field for the input data frame.

Args:
gul_inputs_df: (pd.DataFrame) the gul inputs that are doing the have the group_id field rewritten with a hash
hashing_columns: (List[str]) the list of columns used in the hashing algorithm

Returns: (pd.DataFrame) the gul_inputs_df with the new hashed group_id
"""
gul_inputs_df["group_id"] = (pd.util.hash_pandas_object(gul_inputs_df[hashing_columns],
index=False).to_numpy() >> 33)
return gul_inputs_df


@oasis_log
def get_gul_input_items(
exposure_df,
keys_df,
output_dir,
exposure_profile=get_default_exposure_profile(),
group_id_cols=["PortNumber", "AccNumber", "LocNumber"],
hashed_group_id=True
Expand Down Expand Up @@ -148,35 +198,12 @@ def get_gul_input_items(
# Remove any duplicate column names used to assign group_id
group_id_cols = list(set(group_id_cols))

# Ignore any column names used to assign group_id that are missing or not supported
# Valid group id columns can be either
# 1. exist in the location file
# 2. be listed as a useful internal col
valid_oasis_group_cols = [
'item_id',
'peril_id',
'coverage_id',
'coverage_type_id',
'peril_correlation_group'
]
for col in group_id_cols:
if col not in list(exposure_df.columns) + valid_oasis_group_cols:
warnings.warn('Column {} not found in loc file, or a valid internal oasis column'.format(col))
group_id_cols.remove(col)

# here we check to see if the correlation file is here, if it is then we need to add the "peril_correlation_group" to the valid_oasis_group_cols
peril_correlation_group = 'peril_correlation_group'
correlations_files = [
f"{output_dir}/correlations.csv",
f"{output_dir}/correlations.bin",
]
for file_path in correlations_files:
if os.path.exists(path=file_path):
if peril_correlation_group not in group_id_cols:
group_id_cols.append(peril_correlation_group)
break


# it is assumed that correlations are False for now, correlations for group ID hashing are assessed later on in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to avoid hashing twice, you could do the merge with the correlation data in this function instead of doing it after calling it. That should remove your chicken and egg problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hashing is only performed once now

# the process to re-hash the group ID with the correlation "peril_correlation_group" column name. This is because
# the correlations is achieved later in the process leading to a chicken and egg problem
group_id_cols = process_group_id_cols(group_id_cols=group_id_cols,
exposure_df_columns=list(exposure_df.columns),
correlations=False)

# Should list of column names used to group_id be empty, revert to
# default
Expand All @@ -186,7 +213,7 @@ def get_gul_input_items(
# Only add group col if not internal oasis col
missing_group_id_cols = []
for col in group_id_cols:
if col in valid_oasis_group_cols:
if col in VALID_OASIS_GROUP_COLS:
pass
elif col not in exposure_df_gul_inputs_cols:
missing_group_id_cols.append(col)
Expand Down
22 changes: 22 additions & 0 deletions oasislmf/utils/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@

from chardet.universaldetector import UniversalDetector
from tabulate import tabulate
from typing import List, Optional

import numpy as np
import pandas as pd
Expand Down Expand Up @@ -409,6 +410,27 @@ def get_model_settings(model_settings_fp, key=None, validate=True):
return model_settings if not key else model_settings.get(key)


def establish_correlations(model_settings_path: str) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to remove see comment in the calling function

"""
Checks the model settings to see if correlations are present.

Args:
model_settings_path: (str) path to the model setting JSON file

Returns: (bool) True if correlations, False if not
"""
model_settings_raw_data: dict = get_model_settings(model_settings_fp=model_settings_path)
correlations: Optional[List[dict]] = model_settings_raw_data.get("correlation_settings")

if correlations is None:
return False
if not isinstance(correlations, list):
return False
if len(correlations) == 0:
return False
return True


def detect_encoding(filepath):
"""
Given a path to a CSV of unknown encoding
Expand Down
Empty file added run_test.py
Empty file.