Add missing reference model components to delay_group_lmpf and generated quantities #147

seabbs · 2022-07-31T17:17:10Z

As pointed out in #138 I have now made things a bit complicated for us (@adrian-lison and myself) in terms of implementing the missing reference model. This PR aims to implement a simplistic version of the missing reference model to the delay_group_lmpf and to the generated quantities with the idea we can make an improved version in the future/based on this implementation.

So far I have implemented:

An if else option in delay_group_lpmf based on if the missingness model is required
An allocation function to go from all (i.e up to dmax) observations to just observed observations
An allocation function to apply missing reference effects to log observations
A function to map (using a lookup) from observations/expectation by report and reference to observations/expectations by report.

Before being marked as not a draft this PR needs:

A lookup to be implemented in enw_missing()
Generated quantities to be updated with missing model
Filter for only observations with complete reports by reference date (efficiently)
The addtion of a basic regression test for the missing data model
Unit tests for the newly added functions.
Make sure it reproduces simulated missingness

This PR implements part of #43.

sbfnk

The indexing is savage but I couldn't spot any flaws.

inst/stan/functions/delay_lpmf.stan

seabbs · 2022-08-10T16:55:53Z

Did a little more work on this and fixed a small bug in the ordering of data for the simulation that may have been causing some issues. The issue with not recovering missingness proportions sadly has not been resolved and appears to be present across multiple nowcast dates (though not always an underestimate),.

I have now:

Checked on several other snapshots (for example see below).
Added enw_incidence_to_cumulative() to the package and renamed enw_new_reports() to match
Added unit tests for both these functions.

I still need to:

Update the news item
Add internal functions for the lookup code
Add unit tests for the lookup code
Add a regression test for epinowcast using the missing data model
Add simulator to the package
Add unit tests for simulator
More tests of simulation recovery
Update the model definition vignette
Unit test internal missing stan functions

Example nowcast (with a daily random walk on the proportion missing)

Note: Not entirely clear what is driving the day of the week effect seen here)

Example nowcast (fixed proportion missing)

Example nowcast (fixed proportion missing for a different date)

adrian-lison · 2022-08-31T14:38:00Z

So I was in a pedantic mood and made some changes to the in-code model documentation to make it a bit more intuitive (at least for me - hoping my brain is not too strange so others would profit as well 😆 )

adrian-lison · 2022-08-31T14:44:09Z

Unfortunately, no solution yet found for the wrong missingsness proportions...

But it really looks like we are leaking observations with missing reference times in the beginning, see the plot of imp_obs below (dots are realized / ground truth observations, accounting for the missingness):

Corresponding missingness:

EDIT: not fully true, the above could also be a result of the lack of signal for the first dmax time points

adrian-lison · 2022-09-01T06:55:18Z

This is still a riddle. Using print statements in stan, and copying the results into R (as a replacement for debugging), I have verified that no observations from imp_obs get lost further downstream:

imp_obs get fully distributed across delays
correct shares of cases with missing and non-missing reference allocated, sum up to overall expected obs
obs with missing reference correctly allocated to reporting times

inst/stan/functions/delay_lpmf.stan

adrian-lison · 2022-09-01T09:53:33Z

Since I am running out of possible implementation-level explanations for the observed behavior, I am starting to think about the model specification...

What seems to happen (not sure if this is the only problem) is that if specified flexibly, the share of cases with missing reference can compensate for a bad fit of imp_obs by producing too many cases with missing reference and distributing them over the delays. Since the likelihood for the cases with missing reference is less sharp (integrating over delays) the model can get away with that especially if case numbers are low, see the examples below.

My question is whether this is maybe an artefact of the parametric delay distribution not fitting super well to this real-world dataset. Perhaps we should do some tests with simulated data where everything is well-behaved.
I could also feed the Germany dataset with the simulated missingness into my model implementation (since it has a non-parametric reporting delay) and see what happens.

Example 1 (40% missingness)

(note that the dashed line is the realized missing proportion through binomial sampling in the simulation)

It even looks as if the share of cases with missing reference is also capturing some of the weekday effect... This would be quite weird, and I haven't seen something like that with my models.

Example 2 (10% missingness)

This basically illustrates that the problem does not go away even if missingness is small.

seabbs

@adrian-lison changes look good. Updates to model documentation are a big improvement.

seabbs · 2022-09-01T10:55:23Z

Unfortunately, no solution yet found for the wrong missingsness proportions...

I thought that this issue was maybe due to how we define the random walk (i.e it starts from t = 0 where data is sparse) vs the issue found elsewhere? As we discussed it's not totally clear to me what to do about this with the current implementation and it perhaps points to a flaw that should be corrected.

seabbs · 2022-09-01T11:04:59Z

Since I am running out of possible implementation-level explanations for the observed behaviour, I am starting to think about the model specification...

The model being overly flexible in the wrong places is a sensible suggestion. If we moved testing to the implementations that also have a flexible expectation model that would perhaps help explore this?

the model can get away with that especially if case numbers are low, see the examples below.

Could this also be a tension between picking a more extreme growth rate and the right missingness proportion? The log-normal random walk may limit the amount of change over time required to fit this data well.

My question is whether this is maybe an artefact of the parametric delay distribution not fitting super well to this real-world dataset. Perhaps we should do some tests with simulated data where everything is well-behaved.
I could also feed the Germany dataset with the simulated missingness into my model implementation (since it has a non-parametric reporting delay) and see what happens.

This is a good suggestion and could well be the root cause. Both options for exploring are sensible. In general, we are crying out for a model simulator to ease testing and boost confidence.

Something that also concerns me is that perhaps this is highlighting a bug elsewhere. Perhaps with model specification or allocation of delays? I was fairly confident that was all working as intended but it hasn't recently been tested on clean simulated data (i.e not since initial development).

It even looks as if the share of cases with missing reference is also capturing some of the weekday effect... This would be quite weird, and I haven't seen something like that with my models.

This I found quite concerning and enforces my concern there is a bug elsewhere. Seems quite hard to explain/understand.

In terms of moving this forward, I wonder if perhaps we should merge this into develop with a big warning pointing back at this PR. Then work up #152 and #154 to MVP status and merge, and after that is done circle back to this before release. More eyes on this would likely help and it's starting to get quite weighty.

(this is a great investigation by the way, can you put the exploratory code for this somewhere so we can explore more in the future?)

seabbs · 2022-09-01T13:19:54Z

Proposed tasks to get an MVP to merge:

Fix missing_reference where multiple groups are present
Add experimental warning
Add Adrian to news items

adrian-lison · 2022-09-01T13:39:50Z

Hey sam, thanks for your replies! I agree that adding this to develop may be a good idea. I think I can work on the open tasks as suggested above by you.

Can add my investigation code to the example script we already have.

For my own models, I have written a simulator that creates linelists based on a renewal process, which can then be used to create nowcasting datasets. I could maybe create a PR for this, although it would be more work to get it into the style of epinowcast (need to switch from dplyr to data.table e.g.).

seabbs · 2022-09-01T16:41:14Z

So I think I have covered the minimum ground required to merge this as is.....

Can add my investigation code to the example script we already have.

I think this would be useful.

For my own models, I have written a simulator that creates linelists based on a renewal process, which can then be used to create nowcasting datasets. I could maybe create a PR for this, although it would be more work to get it into the style of epinowcast (need to switch from dplyr to data.table e.g.).

Yeah I think having something in the repository we can play with would be great - regardless of implementation style. Perhaps lets add to the repo and put in the .Rbuildignore to avoid having to add as a suggests?

seabbs · 2022-09-20T09:23:29Z

As discussed I have gone ahead and merged this into develop as an experimental feature.

seabbs added 2 commits July 31, 2022 17:09

add required delay_group_lmpf changes for missing ref model

b58ecfe

debug to compilation

3cadc35

seabbs added the enhancement New feature or request label Jul 31, 2022

seabbs added 7 commits July 31, 2022 19:10

add missing reference model to gq

a004d8e

use segment where possible

5dd2dbc

add first draft of missing reference look-up

b235579

correct delay structure

48acfdc

model fitting

9a47a13

debug allocation of missing reference effects

f172d8a

model fitting but not recovering simulated proportion

c1e8396

This comment was marked as outdated.

Sign in to view

update snaps for enw_missing

28c47bd

seabbs marked this pull request as ready for review August 1, 2022 15:00

seabbs added 5 commits August 2, 2022 12:30

add plot

77cee2d

make a output processing

1856b89

use the correct likelihood you tool

a9f8e01

make example multi-threaded

28a0d58

use the correct helper function (log1m_exp not log1m)

ddee268

This comment was marked as outdated.

Sign in to view

sbfnk reviewed Aug 3, 2022

View reviewed changes

inst/stan/functions/delay_lpmf.stan Outdated Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

Base automatically changed from feature-missing-support-code to develop August 6, 2022 12:04

seabbs added 4 commits August 10, 2022 16:21

add enw_incidence_to_cumulativ and update enw_new_reports to match

85a4617

update nowcast date for missing example and clean up code

0223a5c

reset to same date as used in all other examples

7bbd646

fix merge issues and turn off warning

3da960e

Streamline time wording in in-model doc

a68bf80

Refactor variable names in delay_group_lpmf

3cbe6b4

adrian-lison reviewed Sep 1, 2022

View reviewed changes

inst/stan/functions/delay_lpmf.stan Outdated Show resolved Hide resolved

seabbs commented Sep 1, 2022

View reviewed changes

This was referenced Sep 1, 2022

Feature expectation reporting #154

Merged

Flexible expectation model #152

Merged

seabbs added 6 commits September 1, 2022 15:24

add localisation changes from main

6f3ff55

Merge branch 'develop' into feature-missing-reference-function

da55c32

add usage warning for the missing data MVP

0e6a1c7

update news

11d647b

spelling and global variables

2686216

add handling of group-wise missing reference observations and look-ups

2ec8058

use filtered missing reference obs in likelihood

c3494e5

seabbs mentioned this pull request Sep 1, 2022

Missing data proportion recovery #165

Open

seabbs added 2 commits September 1, 2022 17:09

add missing lookup variables

aeb7eef

update test snapshot

8bdf530

seabbs requested a review from adrian-lison September 1, 2022 20:10

seabbs merged commit 298a769 into develop Sep 20, 2022

seabbs deleted the feature-missing-reference-function branch September 20, 2022 09:23

seabbs mentioned this pull request Dec 5, 2022

Support for missing reference dates MVP #43

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add missing reference model components to delay_group_lmpf and generated quantities #147

Add missing reference model components to delay_group_lmpf and generated quantities #147

seabbs commented Jul 31, 2022 •

edited

Loading

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

sbfnk left a comment

This comment was marked as outdated.

This comment was marked as outdated.

seabbs commented Aug 10, 2022 •

edited

Loading

adrian-lison commented Aug 31, 2022

adrian-lison commented Aug 31, 2022 •

edited

Loading

adrian-lison commented Sep 1, 2022

adrian-lison commented Sep 1, 2022

seabbs left a comment

seabbs commented Sep 1, 2022

seabbs commented Sep 1, 2022

seabbs commented Sep 1, 2022 •

edited

Loading

adrian-lison commented Sep 1, 2022

seabbs commented Sep 1, 2022

seabbs commented Sep 20, 2022

Add missing reference model components to delay_group_lmpf and generated quantities #147

Add missing reference model components to delay_group_lmpf and generated quantities #147

Conversation

seabbs commented Jul 31, 2022 • edited Loading

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

sbfnk left a comment

Choose a reason for hiding this comment

This comment was marked as outdated.

This comment was marked as outdated.

seabbs commented Aug 10, 2022 • edited Loading

Example nowcast (with a daily random walk on the proportion missing)

Example nowcast (fixed proportion missing)

Example nowcast (fixed proportion missing for a different date)

adrian-lison commented Aug 31, 2022

adrian-lison commented Aug 31, 2022 • edited Loading

adrian-lison commented Sep 1, 2022

adrian-lison commented Sep 1, 2022

Example 1 (40% missingness)

Example 2 (10% missingness)

seabbs left a comment

Choose a reason for hiding this comment

seabbs commented Sep 1, 2022

seabbs commented Sep 1, 2022

seabbs commented Sep 1, 2022 • edited Loading

adrian-lison commented Sep 1, 2022

seabbs commented Sep 1, 2022

seabbs commented Sep 20, 2022

seabbs commented Jul 31, 2022 •

edited

Loading

seabbs commented Aug 10, 2022 •

edited

Loading

adrian-lison commented Aug 31, 2022 •

edited

Loading

seabbs commented Sep 1, 2022 •

edited

Loading