-
-
Notifications
You must be signed in to change notification settings - Fork 253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new Difference In Differences notebook #424
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
View / edit / reply to this conversation on ReviewNB lucianopaz commented on 2022-09-24T19:41:11Z Another way of saying this is that treatment would be fully determined by time, so there is no way to dissociate the changes in the pre and post outcome measures as being caused by treatment or time. Doesn’t this actually mean that time and observed treatment outcomes are confounded or undetermined? I don’t remember the exact term from causal inference, but I think that time doesn’t determine outcome as much as you cannot disambiguate drbenvincent commented on 2022-09-25T10:02:49Z Changed to "disambiguate" |
View / edit / reply to this conversation on ReviewNB lucianopaz commented on 2022-09-24T19:41:12Z Typo: first let’s define a Python function drbenvincent commented on 2022-09-25T09:49:27Z resolved in an upcoming commit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks very nice @drbenvincent! I left two comments of things that could be clarified.
On the other hand, I would have loved if you had shown the results of an alternative simpler model. For example the one that only looks at the treatment group outcome, or some other naive method that doesn’t think about treatment time or neglects the pre test observations. Those should return biased or highly uncertain estimates of treatment effects, while the difference in difference should be unbiased and have low variance.
If you feel that doing that is out of scope of an example, it’s fine.
View / edit / reply to this conversation on ReviewNB Armavica commented on 2022-09-25T05:33:13Z typo: you want to know the causal im typo: "What would the post-treatment outcome drbenvincent commented on 2022-09-25T09:29:17Z Thanks. Fixed in an upcoming commit |
View / edit / reply to this conversation on ReviewNB Armavica commented on 2022-09-25T05:33:14Z I don't really understand the difference between the "treatment" and "group" variables in this graph, because belonging in a given group strictly determines receiving a treatment or not. Moreover, shouldn't the pre-treatment variable be included in this graph, but maybe this is what "group" means?
I actually don't understand why four variables are necessary on this DAG. I am sure that there are several valid ways to do this, but my initial thought was to reason only in terms of the pre-treatment metric drbenvincent commented on 2022-09-25T09:48:51Z Hi The DAG is the same as Fig 18.2 from The Effect https://theeffectbook.net/ch-DifferenceinDifference.html
In an upcoming commit I've added explanation for why treatment and group are different. Namely, the treatment group is only treated after the intervention time.
I think a 3-variable setup could match the results, but you're thinking statistically, not causally. There is a bit of a mindset shift involved. The 4-variable model captures the actual causal relationships. The idea is that if you intervene (with the do operator) then all the child nodes should be updated accordingly.
Armavica commented on 2022-09-25T15:52:09Z Thank you for your answer. I think that my main confusion comes from the inclusion of "time" as an explicit variable of this DAG, but it might only be a concern with the DiD method and not with your work. From several sources (the DAG wikipedia article, the Statistical Rethinking book, etc.) I understood that the nodes of a causal model describe events taking place at specific times, and that arrows mean "... has a direct causal influence on ...". Causality ensuring that arrows are only drawn from earlier events/nodes to later events/nodes, the graph is acyclic by construction. Time is an intrinsic component of the model because it imposes the direction of the arrows. But including an explicit node "time" breaks this framework and I don't understand what this kind of DAG means anymore.
Indeed, what should "time has a direct causal influence on ..." mean? And what about the opposite: could one ever draw an arrow towards the "time" node, or is the "time" node special in the sense that no arrow can point to it? And what ensures that the rest of the graph is acyclic, if arrows do not have this intrinsic constraint of going from an earlier event to a later event?
I think a 3-variable setup could match the results, but you're thinking statistically, not causally. There is a bit of a mindset shift involved. The 4-variable model captures the actual causal relationships. The idea is that if you intervene (with the do operator) then all the child nodes should be updated accordingly.
This could very well be the case: I am also learning. But I fail to understand why this 3-variable model would not capture the causal relationships, or would not allow to update child nodes after an intervention. Changing the initial metric has causal repercussions on the treatment decision as well as on the final value, and changing the treatment variable has causal repercussions on the final value. All of the arrows in this model are causal because of the direction of time. However, in the 4-variable model with the "time" node, because the initial metric and final metric are represented with the same variable "mu" (node "outcome"), how would one describe an intervention on the initial metric with the do operator? Also, is it really causal to have "treatment" and "group" point at "outcome" if "outcome" can represent both the initial and the final value, even though "treatment" and "group" cannot causally influence the initial value?
I apologize if my questions don't make sense or if they are trivial, and I also understand that this notebook is not meant to be a causal inference textbook, so please feel free to ignore my comments if they are too "philosophical" or irrelevant! |
View / edit / reply to this conversation on ReviewNB Armavica commented on 2022-09-25T05:33:15Z typo?: dummary: dummy? summary? binary? :) drbenvincent commented on 2022-09-25T09:50:03Z Thanks. Fixed in an upcoming commit |
View / edit / reply to this conversation on ReviewNB Armavica commented on 2022-09-25T05:33:15Z Line #46. t = np.array([0, 1]) Maybe rename this variable to avoid switching between this and drbenvincent commented on 2022-09-25T10:00:33Z Did a bit of rearranging with the time variables
|
View / edit / reply to this conversation on ReviewNB Armavica commented on 2022-09-25T05:33:16Z Line #20. df["y"] += rng.normal(0, 0.1, size=df.shape[0]) to make it reproducible And perhaps define |
View / edit / reply to this conversation on ReviewNB Armavica commented on 2022-09-25T05:33:17Z Line #2. idata = pm.sample(random_seed=RANDOM_SEED) to make it reproducible |
View / edit / reply to this conversation on ReviewNB Armavica commented on 2022-09-25T05:33:17Z typo: its inputs |
View / edit / reply to this conversation on ReviewNB Armavica commented on 2022-09-25T05:33:18Z Line #1. t = np.linspace(-0.5, 1.5, 100) Already defined like this earlier but shadowed in the mean time by |
Thank you for the notebook which was an interesting read! I found a few typos and I think similarly to @lucianopaz I wonder about a simpler model which I think is equivalent to this method, but I may be missing something (see my second comment). I don't really understand the distinction between |
Thanks. Fixed in an upcoming commit View entire conversation on ReviewNB |
Hi The DAG is the same as Fig 18.2 from The Effect https://theeffectbook.net/ch-DifferenceinDifference.html
In an upcoming commit I've added explanation for why treatment and group are different. Namely, the treatment group is only treated after the intervention time.
I think a 3-variable setup could match the results, but you're thinking statistically, not causally. There is a bit of a mindset shift involved. The 4-variable model captures the actual causal relationships. The idea is that if you intervene (with the do operator) then all the child nodes should be updated accordingly.
View entire conversation on ReviewNB |
resolved in an upcoming commit View entire conversation on ReviewNB |
Thanks. Fixed in an upcoming commit View entire conversation on ReviewNB |
Did a bit of rearranging with the time variables
View entire conversation on ReviewNB |
Changed to "disambiguate" View entire conversation on ReviewNB |
Thanks for the feedback @lucianopaz and @Armavica. I think I've addressed all the typos and points of confusion. In terms of a simpler model, I think I might come back and look at that another time. My focus at the moment is (when I have time for these notebooks) to try to lay down a number relating to causal inference. And then loop back around at some point when I've got more feedback and my understanding has increased. |
Thank you for your answer. I think that my main confusion is the inclusion of "time" as an explicit variable of this DAG, but it might only be a concern with the DiD method and not with your work. From several sources (the DAG wikipedia article, the Statistical Rethinking book, etc.) I understood that the nodes of a causal model describe events taking place at specific times, and that arrows mean "... has a direct causal influence on ...". Causality ensuring that arrows are only drawn from earlier events/nodes to later events/nodes, the graph is acyclic by construction. Time is an intrinsic component of the model because it imposes the direction of the arrows. But including an explicit node "time" breaks this framework and I don't understand what this kind of DAG means anymore.
Indeed, what should "time has a direct causal influence on ..." mean? And what about the opposite: could one ever draw an arrow towards the "time" node, or is the "time" node special in the sense that no arrow can point to it? And what ensures that the rest of the graph is acyclic, if arrows do not have this intrinsic constraint of going from an earlier event to a later event?
I think a 3-variable setup could match the results, but you're thinking statistically, not causally. There is a bit of a mindset shift involved. The 4-variable model captures the actual causal relationships. The idea is that if you intervene (with the do operator) then all the child nodes should be updated accordingly.
This could very well be the case: I am also learning. But I fail to understand why this 3-variable model would not capture the causal relationships, or would not allow to update child nodes after an intervention. Changing the initial metric has causal repercussions on the treatment decision as well as on the final value, and changing the treatment variable has causal repercussions on the final value. All of the arrows in this model are causal because of the direction of time. However, in the 4-variable model with the "time" node, because the initial metric and final metric are represented with the same variable "mu" (node "outcome"), how would one describe an intervention on the initial metric with the do operator? Also, is it really causal to have "treatment" and "group" point at "outcome" if "outcome" can represent both the initial and the final value, even though "treatment" and "group" cannot causally influence the initial value?
I apologize if my questions don't make sense or if they are trivial, and I also understand that this notebook is not meant to be a causal inference textbook, so please feel free to ignore my comments if they are too "philosophical" or irrelevant! View entire conversation on ReviewNB |
It looks like you define a |
Thanks, I forgot the style guide... I should be using |
No, these are good questions and I am certainly not at expert level yet. And I may have been a bit hasty with my reply about statistical vs causal thinking. (Although that might be an interesting thing to write about separately) Your concerns seem to be focussed on 'unfolding nodes over time' vs not doing that. I think there are different conventions about this. Maybe it makes sense if you just consider two discrete points in time (pre/post) but it's less workable if you want to evaluate over continuous (or many discrete values of) time. I don't know if I can yet fully articulate an decent answer at this point in my learning, but a few points:
I think that's all I've got at this point. Onwards with the learning. Like I say, I hope to loop back around my notebooks in the future to disambiguate and further clarify. But yes, have to not fall into the trap of trying to write a textbook chapter! |
Yes, perhaps it is just a convention, in which case whatever seems more intuitive for every particular modeler and model should work :) Thank you for this discussion in any case, it was interesting to learn about this framework! |
Just read through it and could follow everything nicely. One thought I had was that instead of just a linear model it could be e.g. a GP to model a more complex time-series. Of course that's outside the scope of this post, and I'm not sure if it should even be mentioned. |
Thanks. I think GP's could be more relevant in an interrupted time series type design perhaps. So I think I'll not mention it here |
View / edit / reply to this conversation on ReviewNB lucianopaz commented on 2022-09-29T20:05:25Z I thought that you were going to use the minimum wage and unemployment dataset here. I’m confused, why did you have to talk about that dataset if you weren’t going to use it? |
View / edit / reply to this conversation on ReviewNB lucianopaz commented on 2022-09-29T20:05:25Z I’m a bit fuzzy about the DiD model. If you had in fact been able to randomise the group membership, what would have changed in the model? Am I correct in thinking that the DiD only assumes that the intercept changes? The slope is assumed to be shared, and then there’s the effect size, so the only thing left is the intercept. This sounds like I’m missing something else. Could it be that DiD also gives unbiased estimates of treatment effects even if you have unbalanced recordings between the groups for the different times (E.g more control measurements in pre, more treatment measurements in post)? drbenvincent commented on 2022-10-06T07:15:40Z I’m a bit fuzzy about the DiD model. If you had in fact been able to randomise the group membership, what would have changed in the model? If there was randomisation then we are outside of the realm of quasi-experiments. So you could do something more like A/B tests in that case. drbenvincent commented on 2022-10-06T07:22:58Z Am I correct in thinking that the DiD only assumes that the intercept changes? The slope is assumed to be shared, and then there’s the effect size, so the only thing left is the intercept. This sounds like I’m missing something else. The parallel trends assumption assumes that both groups are evolving identically _in the absence of treatment_. But in the presence of treatment there is some unknown deflection of the treatment group, so there is a free parameter for the slope of the treatment group. Then you also have a free parameter for an intercept difference. This can be interpreted as capturing differences in the groups (prior to treatment) which might be expected because of the lack of randomisation. drbenvincent commented on 2022-10-06T07:30:35Z Could it be that DiD also gives unbiased estimates of treatment effects even if you have unbalanced recordings between the groups for the different times (E.g more control measurements in pre, more treatment measurements in post)? As far as I understand it, the assumption is that you have panel / repeated measures data. Ie you have data points for each unit at both pre and post treatment times. But I think there's nothing in the model formulation at the moment which relies upon this. At the moment you could have unbalanced recordings between the groups for the different times.
But that does actually seem odd to me now that you bring it up. Wouldn't it be more sane to have a hierarchical model where you are modelling the slopes and intercepts at the unit and the population level and you draw your conclusions from the population level parameters? What do you think? Maybe that's worth a note and perhaps a different follow up hierarchical difference in differences example notebook?
EDIT: hierarchical difference in differences already exists (e.g. https://arxiv.org/abs/1910.07017). This paper doesn't look like the most easily accessible for practitioners, so maybe a follow up notebook on that could be useful? |
I’m a bit fuzzy about the DiD model. If you had in fact been able to randomise the group membership, what would have changed in the model? If there was randomisation then we are outside of the realm of quasi-experiments. So you could do something more like A/B tests in that case. View entire conversation on ReviewNB |
Am I correct in thinking that the DiD only assumes that the intercept changes? The slope is assumed to be shared, and then there’s the effect size, so the only thing left is the intercept. This sounds like I’m missing something else. The parallel trends assumption assumes that both groups are evolving identically _in the absence of treatment_. But in the presence of treatment there is some unknown deflection of the treatment group, so there is a free parameter for the slope of the treatment group. Then you also have a free parameter for an intercept difference. This can be interpreted as capturing differences in the groups (prior to treatment) which might be expected because of the lack of randomisation. View entire conversation on ReviewNB |
Could it be that DiD also gives unbiased estimates of treatment effects even if you have unbalanced recordings between the groups for the different times (E.g more control measurements in pre, more treatment measurements in post)? As far as I understand it, the assumption is that you have panel / repeated measures data. Ie you have data points for each unit at both pre and post treatment times. But I think there's nothing in the model formulation at the moment which relies upon this. At the moment you could have unbalanced recordings between the groups for the different times.
But that does actually seem odd to me now that you bring it up. Wouldn't it be more sane to have a hierarchical model where you are modelling the slopes and intercepts at the unit and the population level and you draw your conclusions from the population level parameters? What do you think? Maybe that's worth a note and perhaps a different follow up hierarchical difference in differences example notebook? View entire conversation on ReviewNB |
I just wanted to give a tangible example in case some found the rest of the introduction too abstract. But I could always add another one or two examples if you thought that having the one sets up the wrong expectations? |
This adds a new notebook covering the difference in differences approach to causal inference.