Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new Difference In Differences notebook #424

Merged
merged 21 commits into from
Oct 6, 2022
Merged

Conversation

drbenvincent
Copy link
Contributor

This adds a new notebook covering the difference in differences approach to causal inference.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@review-notebook-app
Copy link

review-notebook-app bot commented Sep 24, 2022

View / edit / reply to this conversation on ReviewNB

lucianopaz commented on 2022-09-24T19:41:11Z
----------------------------------------------------------------

Another way of saying this is that treatment would be fully determined by time, so there is no way to dissociate the changes in the pre and post outcome measures as being caused by treatment or time.

Doesn’t this actually mean that time and observed treatment outcomes are confounded or undetermined? I don’t remember the exact term from causal inference, but I think that time doesn’t determine outcome as much as you cannot disambiguate


drbenvincent commented on 2022-09-25T10:02:49Z
----------------------------------------------------------------

Changed to "disambiguate"

@review-notebook-app
Copy link

review-notebook-app bot commented Sep 24, 2022

View / edit / reply to this conversation on ReviewNB

lucianopaz commented on 2022-09-24T19:41:12Z
----------------------------------------------------------------

Typo: first let’s define a Python function


drbenvincent commented on 2022-09-25T09:49:27Z
----------------------------------------------------------------

resolved in an upcoming commit

Copy link
Collaborator

@lucianopaz lucianopaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks very nice @drbenvincent! I left two comments of things that could be clarified.
On the other hand, I would have loved if you had shown the results of an alternative simpler model. For example the one that only looks at the treatment group outcome, or some other naive method that doesn’t think about treatment time or neglects the pre test observations. Those should return biased or highly uncertain estimates of treatment effects, while the difference in difference should be unbiased and have low variance.
If you feel that doing that is out of scope of an example, it’s fine.

@review-notebook-app
Copy link

review-notebook-app bot commented Sep 25, 2022

View / edit / reply to this conversation on ReviewNB

Armavica commented on 2022-09-25T05:33:13Z
----------------------------------------------------------------

typo: you want to know the causal imapact

typo: "What would the post-treatment outcome be of the treatment group be [...]"


drbenvincent commented on 2022-09-25T09:29:17Z
----------------------------------------------------------------

Thanks. Fixed in an upcoming commit

@review-notebook-app
Copy link

review-notebook-app bot commented Sep 25, 2022

View / edit / reply to this conversation on ReviewNB

Armavica commented on 2022-09-25T05:33:14Z
----------------------------------------------------------------

I don't really understand the difference between the "treatment" and "group" variables in this graph, because belonging in a given group strictly determines receiving a treatment or not. Moreover, shouldn't the pre-treatment variable be included in this graph, but maybe this is what "group" means?

I actually don't understand why four variables are necessary on this DAG. I am sure that there are several valid ways to do this, but my initial thought was to reason only in terms of the pre-treatment metric before_i, post-treatment metric after_i and a binary indicator describing whether the patient received the treatment treatment_i . We have before -> treatment,before -> after and treatment -> after . Then we would simply have the regression after_i = before_i + trend + Delta * treated_i . We can add a gaussian observation noise on both before and after , and counterfactuals can also be computed. Overall, I have the impression that this 3-variable model matches the results of this notebook. Is there something that I am missing? Or is this more complex model only needed for more complex datasets for example when we have measurements at intermediary time points?


drbenvincent commented on 2022-09-25T09:48:51Z
----------------------------------------------------------------

Hi

The DAG is the same as Fig 18.2 from The Effect https://theeffectbook.net/ch-DifferenceinDifference.html

In an upcoming commit I've added explanation for why treatment and group are different. Namely, the treatment group is only treated after the intervention time.

I think a 3-variable setup could match the results, but you're thinking statistically, not causally. There is a bit of a mindset shift involved. The 4-variable model captures the actual causal relationships. The idea is that if you intervene (with the do operator) then all the child nodes should be updated accordingly.

Armavica commented on 2022-09-25T15:52:09Z
----------------------------------------------------------------

Thank you for your answer. I think that my main confusion comes from the inclusion of "time" as an explicit variable of this DAG, but it might only be a concern with the DiD method and not with your work. From several sources (the DAG wikipedia article, the Statistical Rethinking book, etc.) I understood that the nodes of a causal model describe events taking place at specific times, and that arrows mean "... has a direct causal influence on ...". Causality ensuring that arrows are only drawn from earlier events/nodes to later events/nodes, the graph is acyclic by construction. Time is an intrinsic component of the model because it imposes the direction of the arrows. But including an explicit node "time" breaks this framework and I don't understand what this kind of DAG means anymore.

Indeed, what should "time has a direct causal influence on ..." mean? And what about the opposite: could one ever draw an arrow towards the "time" node, or is the "time" node special in the sense that no arrow can point to it? And what ensures that the rest of the graph is acyclic, if arrows do not have this intrinsic constraint of going from an earlier event to a later event?

I think a 3-variable setup could match the results, but you're thinking statistically, not causally. There is a bit of a mindset shift involved. The 4-variable model captures the actual causal relationships. The idea is that if you intervene (with the do operator) then all the child nodes should be updated accordingly.

This could very well be the case: I am also learning. But I fail to understand why this 3-variable model would not capture the causal relationships, or would not allow to update child nodes after an intervention. Changing the initial metric has causal repercussions on the treatment decision as well as on the final value, and changing the treatment variable has causal repercussions on the final value. All of the arrows in this model are causal because of the direction of time.

However, in the 4-variable model with the "time" node, because the initial metric and final metric are represented with the same variable "mu" (node "outcome"), how would one describe an intervention on the initial metric with the do operator? Also, is it really causal to have "treatment" and "group" point at "outcome" if "outcome" can represent both the initial and the final value, even though "treatment" and "group" cannot causally influence the initial value?

I apologize if my questions don't make sense or if they are trivial, and I also understand that this notebook is not meant to be a causal inference textbook, so please feel free to ignore my comments if they are too "philosophical" or irrelevant!

@review-notebook-app
Copy link

review-notebook-app bot commented Sep 25, 2022

View / edit / reply to this conversation on ReviewNB

Armavica commented on 2022-09-25T05:33:15Z
----------------------------------------------------------------

typo?: dummary: dummy? summary? binary? :)


drbenvincent commented on 2022-09-25T09:50:03Z
----------------------------------------------------------------

Thanks. Fixed in an upcoming commit

@review-notebook-app
Copy link

review-notebook-app bot commented Sep 25, 2022

View / edit / reply to this conversation on ReviewNB

Armavica commented on 2022-09-25T05:33:15Z
----------------------------------------------------------------

Line #46.    t = np.array([0, 1])

Maybe rename this variable to avoid switching between this and np.linspace(-0.5, 1.5, 1000)?


drbenvincent commented on 2022-09-25T10:00:33Z
----------------------------------------------------------------

Did a bit of rearranging with the time variables

@review-notebook-app
Copy link

View / edit / reply to this conversation on ReviewNB

Armavica commented on 2022-09-25T05:33:16Z
----------------------------------------------------------------

Line #20.    df["y"] += rng.normal(0, 0.1, size=df.shape[0])

to make it reproducible

And perhaps define sigma = 0.1 as observation error with the other "true parameters" above?


@review-notebook-app
Copy link

View / edit / reply to this conversation on ReviewNB

Armavica commented on 2022-09-25T05:33:17Z
----------------------------------------------------------------

Line #2.        idata = pm.sample(random_seed=RANDOM_SEED)

to make it reproducible


@review-notebook-app
Copy link

View / edit / reply to this conversation on ReviewNB

Armavica commented on 2022-09-25T05:33:17Z
----------------------------------------------------------------

typo: its inputs


@review-notebook-app
Copy link

review-notebook-app bot commented Sep 25, 2022

View / edit / reply to this conversation on ReviewNB

Armavica commented on 2022-09-25T05:33:18Z
----------------------------------------------------------------

Line #1.    t = np.linspace(-0.5, 1.5, 100)

Already defined like this earlier but shadowed in the mean time by t = np.array([0, 1])


@Armavica
Copy link
Member

Armavica commented Sep 25, 2022

Thank you for the notebook which was an interesting read! I found a few typos and I think similarly to @lucianopaz I wonder about a simpler model which I think is equivalent to this method, but I may be missing something (see my second comment). I don't really understand the distinction between group and treatment in this particular example, but perhaps this distinction is useful for more complex data?
Another lint: you use 15 times "difference in differences" but 4 times "differenceS in differences" which I am not sure is intentional?

Copy link
Contributor Author

Thanks. Fixed in an upcoming commit


View entire conversation on ReviewNB

Copy link
Contributor Author

Hi

The DAG is the same as Fig 18.2 from The Effect https://theeffectbook.net/ch-DifferenceinDifference.html

In an upcoming commit I've added explanation for why treatment and group are different. Namely, the treatment group is only treated after the intervention time.

I think a 3-variable setup could match the results, but you're thinking statistically, not causally. There is a bit of a mindset shift involved. The 4-variable model captures the actual causal relationships. The idea is that if you intervene (with the do operator) then all the child nodes should be updated accordingly.


View entire conversation on ReviewNB

Copy link
Contributor Author

resolved in an upcoming commit


View entire conversation on ReviewNB

Copy link
Contributor Author

Thanks. Fixed in an upcoming commit


View entire conversation on ReviewNB

Copy link
Contributor Author

Did a bit of rearranging with the time variables


View entire conversation on ReviewNB

Copy link
Contributor Author

Changed to "disambiguate"


View entire conversation on ReviewNB

@drbenvincent
Copy link
Contributor Author

Thanks for the feedback @lucianopaz and @Armavica. I think I've addressed all the typos and points of confusion.

In terms of a simpler model, I think I might come back and look at that another time. My focus at the moment is (when I have time for these notebooks) to try to lay down a number relating to causal inference. And then loop back around at some point when I've got more feedback and my understanding has increased.

Copy link
Member

Thank you for your answer. I think that my main confusion is the inclusion of "time" as an explicit variable of this DAG, but it might only be a concern with the DiD method and not with your work. From several sources (the DAG wikipedia article, the Statistical Rethinking book, etc.) I understood that the nodes of a causal model describe events taking place at specific times, and that arrows mean "... has a direct causal influence on ...". Causality ensuring that arrows are only drawn from earlier events/nodes to later events/nodes, the graph is acyclic by construction. Time is an intrinsic component of the model because it imposes the direction of the arrows. But including an explicit node "time" breaks this framework and I don't understand what this kind of DAG means anymore.

Indeed, what should "time has a direct causal influence on ..." mean? And what about the opposite: could one ever draw an arrow towards the "time" node, or is the "time" node special in the sense that no arrow can point to it? And what ensures that the rest of the graph is acyclic, if arrows do not have this intrinsic constraint of going from an earlier event to a later event?

I think a 3-variable setup could match the results, but you're thinking statistically, not causally. There is a bit of a mindset shift involved. The 4-variable model captures the actual causal relationships. The idea is that if you intervene (with the do operator) then all the child nodes should be updated accordingly.

This could very well be the case: I am also learning. But I fail to understand why this 3-variable model would not capture the causal relationships, or would not allow to update child nodes after an intervention. Changing the initial metric has causal repercussions on the treatment decision as well as on the final value, and changing the treatment variable has causal repercussions on the final value. All of the arrows in this model are causal because of the direction of time.

However, in the 4-variable model with the "time" node, because the initial metric and final metric are represented with the same variable "mu" (node "outcome"), how would one describe an intervention on the initial metric with the do operator? Also, is it really causal to have "treatment" and "group" point at "outcome" if "outcome" can represent both the initial and the final value, even though "treatment" and "group" cannot causally influence the initial value?

I apologize if my questions don't make sense or if they are trivial, and I also understand that this notebook is not meant to be a causal inference textbook, so please feel free to ignore my comments if they are too "philosophical" or irrelevant!


View entire conversation on ReviewNB

@Armavica
Copy link
Member

It looks like you define a RANDOM_SEED and a seeded prng in the beginning of the notebook but never use them later, or am I missing something?

@drbenvincent
Copy link
Contributor Author

It looks like you define a RANDOM_SEED and a seeded prng in the beginning of the notebook but never use them later, or am I missing something?

Thanks, I forgot the style guide... I should be using rng.normal which uses the random seed, rather than scipy.stats.normal.

@drbenvincent
Copy link
Contributor Author

No, these are good questions and I am certainly not at expert level yet. And I may have been a bit hasty with my reply about statistical vs causal thinking. (Although that might be an interesting thing to write about separately)

Your concerns seem to be focussed on 'unfolding nodes over time' vs not doing that. I think there are different conventions about this. Maybe it makes sense if you just consider two discrete points in time (pre/post) but it's less workable if you want to evaluate over continuous (or many discrete values of) time.

I don't know if I can yet fully articulate an decent answer at this point in my learning, but a few points:

  • The way in which the time node is thought about is expanded on in this video https://www.youtube.com/watch?v=ggYnrOGG97o
  • I think we can separate different ideas here. As you say, an edge can be thought of as representing time because of temporal precedence. But that seems like a slightly different notion of something changing as a function of time. It doesn't seem that tricky if you just think $time \rightarrow treatment$ as describing $treatment = f(time)$ for example.
  • Yes, it's a fair point to say that the post intervention outcome can be causally influenced by the pre intervention outcome and that it would make sense to have represent this as outcomes unfolded over time. But like I say, I think this is a convention thing and different approaches might be better suited in different situations.

I think that's all I've got at this point. Onwards with the learning. Like I say, I hope to loop back around my notebooks in the future to disambiguate and further clarify. But yes, have to not fall into the trap of trying to write a textbook chapter!

@Armavica
Copy link
Member

Yes, perhaps it is just a convention, in which case whatever seems more intuitive for every particular modeler and model should work :) Thank you for this discussion in any case, it was interesting to learn about this framework!

@twiecki
Copy link
Member

twiecki commented Sep 29, 2022

Just read through it and could follow everything nicely. One thought I had was that instead of just a linear model it could be e.g. a GP to model a more complex time-series. Of course that's outside the scope of this post, and I'm not sure if it should even be mentioned.

@drbenvincent
Copy link
Contributor Author

Just read through it and could follow everything nicely. One thought I had was that instead of just a linear model it could be e.g. a GP to model a more complex time-series. Of course that's outside the scope of this post, and I'm not sure if it should even be mentioned.

Thanks.

I think GP's could be more relevant in an interrupted time series type design perhaps. So I think I'll not mention it here

@review-notebook-app
Copy link

View / edit / reply to this conversation on ReviewNB

lucianopaz commented on 2022-09-29T20:05:25Z
----------------------------------------------------------------

I thought that you were going to use the minimum wage and unemployment dataset here. I’m confused, why did you have to talk about that dataset if you weren’t going to use it?


@review-notebook-app
Copy link

review-notebook-app bot commented Sep 29, 2022

View / edit / reply to this conversation on ReviewNB

lucianopaz commented on 2022-09-29T20:05:25Z
----------------------------------------------------------------

I’m a bit fuzzy about the DiD model. If you had in fact been able to randomise the group membership, what would have changed in the model? Am I correct in thinking that the DiD only assumes that the intercept changes? The slope is assumed to be shared, and then there’s the effect size, so the only thing left is the intercept. This sounds like I’m missing something else. Could it be that DiD also gives unbiased estimates of treatment effects even if you have unbalanced recordings between the groups for the different times (E.g more control measurements in pre, more treatment measurements in post)?


drbenvincent commented on 2022-10-06T07:15:40Z
----------------------------------------------------------------

I’m a bit fuzzy about the DiD model. If you had in fact been able to randomise the group membership, what would have changed in the model? 

If there was randomisation then we are outside of the realm of quasi-experiments. So you could do something more like A/B tests in that case.

drbenvincent commented on 2022-10-06T07:22:58Z
----------------------------------------------------------------

Am I correct in thinking that the DiD only assumes that the intercept changes? The slope is assumed to be shared, and then there’s the effect size, so the only thing left is the intercept. This sounds like I’m missing something else.

The parallel trends assumption assumes that both groups are evolving identically _in the absence of treatment_. But in the presence of treatment there is some unknown deflection of the treatment group, so there is a free parameter for the slope of the treatment group.

Then you also have a free parameter for an intercept difference. This can be interpreted as capturing differences in the groups (prior to treatment) which might be expected because of the lack of randomisation.

drbenvincent commented on 2022-10-06T07:30:35Z
----------------------------------------------------------------

Could it be that DiD also gives unbiased estimates of treatment effects even if you have unbalanced recordings between the groups for the different times (E.g more control measurements in pre, more treatment measurements in post)?

As far as I understand it, the assumption is that you have panel / repeated measures data. Ie you have data points for each unit at both pre and post treatment times. But I think there's nothing in the model formulation at the moment which relies upon this. At the moment you could have unbalanced recordings between the groups for the different times.

But that does actually seem odd to me now that you bring it up. Wouldn't it be more sane to have a hierarchical model where you are modelling the slopes and intercepts at the unit and the population level and you draw your conclusions from the population level parameters? What do you think? Maybe that's worth a note and perhaps a different follow up hierarchical difference in differences example notebook?

EDIT: hierarchical difference in differences already exists (e.g. https://arxiv.org/abs/1910.07017). This paper doesn't look like the most easily accessible for practitioners, so maybe a follow up notebook on that could be useful?

Copy link
Contributor Author

I’m a bit fuzzy about the DiD model. If you had in fact been able to randomise the group membership, what would have changed in the model? 

If there was randomisation then we are outside of the realm of quasi-experiments. So you could do something more like A/B tests in that case.


View entire conversation on ReviewNB

Copy link
Contributor Author

Am I correct in thinking that the DiD only assumes that the intercept changes? The slope is assumed to be shared, and then there’s the effect size, so the only thing left is the intercept. This sounds like I’m missing something else.

The parallel trends assumption assumes that both groups are evolving identically _in the absence of treatment_. But in the presence of treatment there is some unknown deflection of the treatment group, so there is a free parameter for the slope of the treatment group.

Then you also have a free parameter for an intercept difference. This can be interpreted as capturing differences in the groups (prior to treatment) which might be expected because of the lack of randomisation.


View entire conversation on ReviewNB

Copy link
Contributor Author

Could it be that DiD also gives unbiased estimates of treatment effects even if you have unbalanced recordings between the groups for the different times (E.g more control measurements in pre, more treatment measurements in post)?

As far as I understand it, the assumption is that you have panel / repeated measures data. Ie you have data points for each unit at both pre and post treatment times. But I think there's nothing in the model formulation at the moment which relies upon this. At the moment you could have unbalanced recordings between the groups for the different times.

But that does actually seem odd to me now that you bring it up. Wouldn't it be more sane to have a hierarchical model where you are modelling the slopes and intercepts at the unit and the population level and you draw your conclusions from the population level parameters? What do you think? Maybe that's worth a note and perhaps a different follow up hierarchical difference in differences example notebook?


View entire conversation on ReviewNB

@drbenvincent
Copy link
Contributor Author

View / edit / reply to this conversation on ReviewNB

lucianopaz commented on 2022-09-29T20:05:25Z ----------------------------------------------------------------

I thought that you were going to use the minimum wage and unemployment dataset here. I’m confused, why did you have to talk about that dataset if you weren’t going to use it?

I just wanted to give a tangible example in case some found the rest of the introduction too abstract. But I could always add another one or two examples if you thought that having the one sets up the wrong expectations?

@twiecki twiecki merged commit ba37aaa into pymc-devs:main Oct 6, 2022
@drbenvincent drbenvincent deleted the DiD branch October 6, 2022 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants