-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement correlated random number generation #1069
Changes from 1 commit
eefce12
a8f89de
d8215d8
614dd3d
d407089
579910b
5417f9b
5ad10f9
4f564a1
2066aa2
f0311c0
1709cee
2222be2
2c0d5e3
6621208
e8cf544
bb05858
e593f0c
fbf1689
d611869
82fb79c
fee427e
45f8779
461621f
77245a3
bf500be
fd210e2
2b3d202
f7cb1ab
7d772fb
e6ac89e
823add8
ff9f568
32a5f66
d60deb2
1fc5651
e95dac6
62c805f
5f6933a
44e43f2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
…elations groups are used and hashing group IDs is done
- Loading branch information
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -10,6 +10,7 @@ | |||
import sys | ||||
import warnings | ||||
from collections import OrderedDict | ||||
from typing import List | ||||
|
||||
import pandas as pd | ||||
|
||||
|
@@ -44,12 +45,61 @@ | |||
pd.options.mode.chained_assignment = None | ||||
warnings.simplefilter(action='ignore', category=FutureWarning) | ||||
|
||||
VALID_OASIS_GROUP_COLS = [ | ||||
'item_id', | ||||
'peril_id', | ||||
'coverage_id', | ||||
'coverage_type_id', | ||||
'peril_correlation_group' | ||||
] | ||||
|
||||
|
||||
def process_group_id_cols(group_id_cols: List[str], exposure_df_columns: List[str], correlations: bool) -> List[str]: | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is one of my dislike example about typing. For correlations, naming would be more explicit. has_correlation_groups or is_correlated that the current choice with bool There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Now removed typing and changed the |
||||
""" | ||||
cleans out columns that are not valid oasis group columns. | ||||
|
||||
Valid group id columns can be either | ||||
1. exist in the location file | ||||
2. be listed as a useful internal col | ||||
|
||||
Args: | ||||
group_id_cols: (List[str]) the ID columns that are going to be filtered | ||||
exposure_df_columns: (List[str]) the columns in the exposure dataframe | ||||
correlations: (bool) if set to True means that we are hashing with correlations in mind therefore the | ||||
"peril_correlation_group" column is added | ||||
|
||||
Returns: (List[str]) the filtered columns | ||||
""" | ||||
for col in VALID_OASIS_GROUP_COLS: | ||||
if col not in list(exposure_df_columns) + VALID_OASIS_GROUP_COLS: | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is always False as col in VALID_OASIS_GROUP_COLS are always in list(exposure_df_columns) + VALID_OASIS_GROUP_COLS There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is merely legacy code that has been moved to another area but happy to chance this: OasisLMF/oasislmf/preparation/gul_inputs.py Line 156 in 1c59d93
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @sstruzik is right here, there is an issue with the logic and will always evaluate to its checking for valid group_id columns by looking over the list There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||||
warnings.warn('Column {} not found in loc file, or a valid internal oasis column'.format(col)) | ||||
group_id_cols.remove(col) | ||||
|
||||
peril_correlation_group = 'peril_correlation_group' | ||||
if peril_correlation_group not in group_id_cols and correlations is True: | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I usually just do "if correlation", is there a reason to use "if correlation is True"? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
one = 1
if one:
print("one")
if one is True:
print("two") |
||||
group_id_cols.append(peril_correlation_group) | ||||
return group_id_cols | ||||
|
||||
|
||||
def hash_with_correlations(gul_inputs_df: pd.DataFrame, hashing_columns: List[str]) -> pd.DataFrame: | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is nothing specific to correlation in this function, it is just hashing based on some columns. and it is identical as the code we have at the end of get_gul_input_items. So the name is misleading. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This has been changed to |
||||
""" | ||||
Creates a hash for the group ID field for the input data frame. | ||||
|
||||
Args: | ||||
gul_inputs_df: (pd.DataFrame) the gul inputs that are doing the have the group_id field rewritten with a hash | ||||
hashing_columns: (List[str]) the list of columns used in the hashing algorithm | ||||
|
||||
Returns: (pd.DataFrame) the gul_inputs_df with the new hashed group_id | ||||
""" | ||||
gul_inputs_df["group_id"] = (pd.util.hash_pandas_object(gul_inputs_df[hashing_columns], | ||||
index=False).to_numpy() >> 33) | ||||
return gul_inputs_df | ||||
|
||||
|
||||
@oasis_log | ||||
def get_gul_input_items( | ||||
exposure_df, | ||||
keys_df, | ||||
output_dir, | ||||
exposure_profile=get_default_exposure_profile(), | ||||
group_id_cols=["PortNumber", "AccNumber", "LocNumber"], | ||||
hashed_group_id=True | ||||
|
@@ -148,35 +198,12 @@ def get_gul_input_items( | |||
# Remove any duplicate column names used to assign group_id | ||||
group_id_cols = list(set(group_id_cols)) | ||||
|
||||
# Ignore any column names used to assign group_id that are missing or not supported | ||||
# Valid group id columns can be either | ||||
# 1. exist in the location file | ||||
# 2. be listed as a useful internal col | ||||
valid_oasis_group_cols = [ | ||||
'item_id', | ||||
'peril_id', | ||||
'coverage_id', | ||||
'coverage_type_id', | ||||
'peril_correlation_group' | ||||
] | ||||
for col in group_id_cols: | ||||
if col not in list(exposure_df.columns) + valid_oasis_group_cols: | ||||
warnings.warn('Column {} not found in loc file, or a valid internal oasis column'.format(col)) | ||||
group_id_cols.remove(col) | ||||
|
||||
# here we check to see if the correlation file is here, if it is then we need to add the "peril_correlation_group" to the valid_oasis_group_cols | ||||
peril_correlation_group = 'peril_correlation_group' | ||||
correlations_files = [ | ||||
f"{output_dir}/correlations.csv", | ||||
f"{output_dir}/correlations.bin", | ||||
] | ||||
for file_path in correlations_files: | ||||
if os.path.exists(path=file_path): | ||||
if peril_correlation_group not in group_id_cols: | ||||
group_id_cols.append(peril_correlation_group) | ||||
break | ||||
|
||||
|
||||
# it is assumed that correlations are False for now, correlations for group ID hashing are assessed later on in | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. to avoid hashing twice, you could do the merge with the correlation data in this function instead of doing it after calling it. That should remove your chicken and egg problem. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hashing is only performed once now |
||||
# the process to re-hash the group ID with the correlation "peril_correlation_group" column name. This is because | ||||
# the correlations is achieved later in the process leading to a chicken and egg problem | ||||
group_id_cols = process_group_id_cols(group_id_cols=group_id_cols, | ||||
exposure_df_columns=list(exposure_df.columns), | ||||
correlations=False) | ||||
|
||||
# Should list of column names used to group_id be empty, revert to | ||||
# default | ||||
|
@@ -186,7 +213,7 @@ def get_gul_input_items( | |||
# Only add group col if not internal oasis col | ||||
missing_group_id_cols = [] | ||||
for col in group_id_cols: | ||||
if col in valid_oasis_group_cols: | ||||
if col in VALID_OASIS_GROUP_COLS: | ||||
pass | ||||
elif col not in exposure_df_gul_inputs_cols: | ||||
missing_group_id_cols.append(col) | ||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -44,6 +44,7 @@ | |
|
||
from chardet.universaldetector import UniversalDetector | ||
from tabulate import tabulate | ||
from typing import List, Optional | ||
|
||
import numpy as np | ||
import pandas as pd | ||
|
@@ -409,6 +410,27 @@ def get_model_settings(model_settings_fp, key=None, validate=True): | |
return model_settings if not key else model_settings.get(key) | ||
|
||
|
||
def establish_correlations(model_settings_path: str) -> bool: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. to remove see comment in the calling function |
||
""" | ||
Checks the model settings to see if correlations are present. | ||
|
||
Args: | ||
model_settings_path: (str) path to the model setting JSON file | ||
|
||
Returns: (bool) True if correlations, False if not | ||
""" | ||
model_settings_raw_data: dict = get_model_settings(model_settings_fp=model_settings_path) | ||
correlations: Optional[List[dict]] = model_settings_raw_data.get("correlation_settings") | ||
|
||
if correlations is None: | ||
return False | ||
if not isinstance(correlations, list): | ||
return False | ||
if len(correlations) == 0: | ||
return False | ||
return True | ||
|
||
|
||
def detect_encoding(filepath): | ||
""" | ||
Given a path to a CSV of unknown encoding | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you should have to read twice model_settings_path. get_correlation_input_items should return all the information that you need. So I would remove establish_correlations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
model settings is only read once now