Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline setup step 2: Fill forecast.py #93

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

Fuhan-Yang
Copy link
Contributor

@Fuhan-Yang Fuhan-Yang commented Jan 10, 2025

This PR adds forecast function and run the forecast sequentially given the time frame in config.yaml. The results are saved as parquet data in the designated path. While this function hasn't been tested because of several errors:

  1. Error: "typeFloat64 is incompatible with expected type String" is returned in split_train_test. This happens when the uptake data is a single polars DataFrame (instead of a uptake data list).

  2. The NaN in training data prevents the fitting of LinearIncidenceUptakeModel, checks or methods (impute NA, or delete) are needed to ensure fitting can be run.

@Fuhan-Yang Fuhan-Yang requested review from swo and eschrom January 10, 2025 17:28
Copy link
Collaborator

@eschrom eschrom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my review for #92, I think we should get this to pass pre-commit and root out errors before merging.

If split_train_test breaks when given a single polars data frame, let's fix it. It looks like preprocess.py intends to always return a single data frame, not a list. This can easily remain true even after preprocess.py knows how to process multiple data sets, because it can just concatenate them into one before returning it. So we can just edit split_train_test to expect only a single data frame.

Likewise, if NaNs in the data cause problems, let's find out where they come from and fix them if possible. I don't recall missing values in the raw data - I wonder if there are further issues lurking in our code that cause the NaNs.

Normally, such things could be separate issues for separate PRs, but if they prevent us from testing the content of this PR properly, then I think they should be solved here.

Please also see my separate comment on the way to increment the forecast date.

cumulative_projections.with_columns(estimate_type=pl.lit("cumulative")),
incident_projections.with_columns(estimate_type=pl.lit("incident")),
]
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question I had on #92 - why combine cumulative and incident uptake into a single data frame? We could just return the cumulative data and derive the incident data in a future step if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 862faf8

clean_data,
grouping_factors=config["groups"],
forecast_start=config["timeframe"]["start"],
forecast_end=forecast_date,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be forecast_start that is being incremented?

As written here, with forecast_end being incremented, we are asking questions like:

  • suppose we have data through Nov 1, 2024 - how well can we estimate uptake through May 1, 2025?
  • suppose we have data through Nov 1, 2024 - how well can we estimate uptake through May 8, 2025?
  • suppose we have data through Nov 1, 2024 - how well can we estimate uptake through May 15, 2025?
  • and so on...
    These questions do not require fitting the model multiple times in a loop! Just fit it once for the latest forecast_end date (e.g. May 29, 2025) and look up the projections "along the way" for all the intermediate dates (May 22, May 15, May 8, May 1, etc.).

But instead, I think we had intended to increment forecast_start to ask questions like:

  • suppose we have data through Nov 1, 2024 - how well can we estimate uptake through May 29, 2025?
  • suppose we have data through Nov 8, 2024 - how well can we estimate uptake through May 29, 2025?
  • suppose we have data through Nov 15, 2024 - how well can we estimate uptake through May 29, 2025?
  • and so on...
    In this case, because the starting date for forecasting keeps getting pushed later, the training data set keeps expanding, so the model does need to be refit multiple times in a loop.

So the latter is what we should be doing here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 857c080

@Fuhan-Yang Fuhan-Yang requested a review from eschrom January 13, 2025 22:29
@Fuhan-Yang
Copy link
Contributor Author

Similar to my review for #92, I think we should get this to pass pre-commit and root out errors before merging.

If split_train_test breaks when given a single polars data frame, let's fix it. It looks like preprocess.py intends to always return a single data frame, not a list. This can easily remain true even after preprocess.py knows how to process multiple data sets, because it can just concatenate them into one before returning it. So we can just edit split_train_test to expect only a single data frame.

Likewise, if NaNs in the data cause problems, let's find out where they come from and fix them if possible. I don't recall missing values in the raw data - I wonder if there are further issues lurking in our code that cause the NaNs.

Normally, such things could be separate issues for separate PRs, but if they prevent us from testing the content of this PR properly, then I think they should be solved here.

Please also see my separate comment on the way to increment the forecast date.

For now, I would suggest to merge the PRs when the questions are well-addressed. Once the entire pipeline is merged, we can test it and open PRs if needed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants