causal inference: the mixtape pdf

Consider this chapter more about the mechanics of matching when you have exact and approximate matching situations. That is, D and X are independent of one another conditional on the propensity score, or So from this we also obtain the balancing property of the propensity score: which states that conditional on the propensity score, the distribution of the covariates is the same for treatment as it is for control group units. So stratifying across a probability is going to reduce that dimensionality problem. Nor are there any 14-year-old male passengers in first class. The joint density is defined as fxy(u,t). Ex ante, voters expect the candidate to choose some policy and they expect the candidate to win with probability P(xe,ye), where xe and ye are the policies chosen by Democrats and Republicans, respectively. Could there be unobserved determinants of both poverty and public assistance? And they ultimately find no evidence for discontinuities in employment at age 65 (Figure 30). To understand this, we need to learn more about the powerful notation that Splawa-Neyman [1923] developed, called potential outcomes. Potential outcomes. After all, he does not control for any economic factors, which surely affect both poverty and the amount of resources allocated to out-relief. Notice that the solid black line is negative and the slope from the bivariate regression is positive. Now, we need to use the fact that, for uncorrelated random variables, the variance of the sum is the sum of the variances. But each effect size is only about half the size of the true effect. One can use both Zi and the interaction terms as instruments for the treatment Di. The randomization inference method based on Fishers sharp null, which will be discussed in this section, can improve upon these problems of leverage, in addition to the aforementioned reasons to consider it. In fact, large sample sizes are characteristic features of the RDD. This is most likely caused by our sample being too small relative to the size of our covariate matrix. He collected over fifty experimental (lab and field) articles from the American Economic Associations flagship journals: American Economic Review, American Economic Journal: Applied, and American Economic Journal: Economic Policy. And the second line rearranges it so that we get two terms: the estimated ATT plus the average difference in the stochastic terms for the matched sample. It looks scarier than it really is. But I would like to discuss a more innocent possibility, one that requires no conspiracy theories and yet is so basic a problem that it is in fact more worrisome. Otherwise, how do you know if youve conditioned on a collider or a noncollider? Estimated association between pauperism growth rates and public assistance. The regression anatomy the orem is based on earlier work by Frisch andWaugh [1933] and Lovell [1963].23 I find the theorem more intuitive when I think through a specific example and offer up some data visualization. If so, then increased testing could create frictions throughout the sexual network itself that would slow an epidemic. Charles Darwin, in his On the Origin of Species, summarized this by saying Natura non facit saltum, or nature does not make jumps. Or to use a favorite phrase of mine from growing up in Mississippi, if you see a turtle on a fencepost, you know he didnt get there by himself. As you can see in each of these distance formulas, there are sometimes going to be matching discrepancies. For instance, the critique does not apply to stratification based on the propensity score [Rosenbaum and Rubin, 1983], regression adjustment or inverse probability weighting. This will make the proof a little less cumbersome: Now that we have made these substitutions, lets rearrange the letters by redefining ATE as a weighted average of all conditional expectations Now, substituting our definitions, we get the following: And the decomposition ends. It follows that E[eifi] = 0. Likewise, he may have the causality backwardsperhaps increased poverty causes communities to increase relief, and not merely the other way around. I recommend the package rddensity,11 which you can install for R as well.12 These packages are based on Cattaneo et al. Karl Marx was interested in the transition of society from capitalism to socialism [Needleman and Needleman, 1969]. They were then divided into crews of three to five participants who worked together and met frequently with an NSW counselor to discuss grievances with the program and performance. Learning about the propensity score is particularly valuable given that it appears to have a very long half-life. Reprinted from Mark Hoekstra, The Effect of Attending the Flagship State University on Earnings: A Discontinuity-Based Approach, The Review of Economics and Statistics, 91:4 (November, 2009), pp. Simpleyou rank these test statistics, fit the true effect into that ranking, count the number of fake test statistics that dominate the real one, and divide that number by all possible combinations. We will first discuss some basic modeling choices that researchers often makesome trivial, some important. If we used Zi and all its interactions, the estimated first stage would be: We would also construct analogous first stages for If we wanted to forgo estimating the full IV model, we might estimate the reduced form only. And with a rejection threshold of for instance, 0.05 then a randomization inference test will falsely reject the sharp null less than 100 percent of the time. So: 1. The second data set is hospital discharge records for California, Florida, and New York. Leverage. Collider bias is a difficult concept to understand at first, so Ive included a couple of examples to help you sort through it. They show up constantly. Lalonde [1986] is an interesting study both because he is evaluating the NSW program and because he is evaluating commonly used econometric methods from that time. The constant variance assumption may not be realistic; it must be determined on a case-by-case basis. 15 See Angrist and Pischke [2009], 8081. Once we have calculated , we can compute the intercept value, , as = x. This assumption is called continuity, and what it formally means is that the expected potential outcomes are continuous at the cutoff. In some respects, CIA is somewhat advanced because it requires deep institutional knowledge to say with confidence that no such unobserved confounder exists. Authors used this to construct an estimate of age in quarters at date of interview. Table 44. They are as follows: 1. There are generally accepted two kinds of RDD studies. Adam Smith wrote about the causes of the wealth of nations [Smith, 2003]. He used three samples of the Current Population Survey (CPS) and three samples of the Panel Survey of Income Dynamics (PSID) for this nonexperimental control group data, but I will use just one for each. And since we dont know with certainty that the CEF is linear, this is actually a nice argument to at least consider. Algebraic Properties of OLS. Notice for the moment that a units treatment status is exclusively determined by the assignment rule. Well, the average incentive in her experiment was worth about a days wage. The method dates back about sixty years to Don-ald Campbell, an educational psychologist, who wrote several stud-1 ies using it, beginning with Thistlehwaite and Campbell [1960]. In a wonderful article on the history of thought around RDD, Cook [2008] documents its social evolution. The administrative data comes from large Texas cities, a large county in California, the state of Florida, and several other cities and counties racial bias has been reported. A fuzzy RDD represents a discontinuous jump in the probability of treatment when X > c0. It only requires that it be known, precise and free of manipulation. What does this mean? . Request Print Exam/Desk Copy; 584 Pages, 5.50 x 8.50 x 1.16 in, 91 b-w illus. First, we begin with covariates X and make a copy called X. This law says that an unconditional expectation can be written as the unconditional average of the CEF. But, while that is hyper-bole, for reasons we will soon see, it is nonetheless the case that the instrumental variables (IV) design is potentially one of most important research designs ever devised. The common support assumption requires that for each strata, there exist observations in both the treatment and control group, but as you can see, there are not any 12-year-old male passengers in first class. Weve known about the problems of nonrandom sample selection for decades [Heckman, 1979]. You could say that there are six steps to randomization inference: (1) the choice of the sharp null, (2) the construction of the null, (3) the picking of a different treatment vector, (4) the calculation of the corresponding test statistic for that new treatment vector, (5) the randomization over step 3 as you cycle through a number of new treatment vectors (ideally all possible combinations), and (6) the calculation the exact p-value. The formula for the ATU is as follows: Depending on the research question, one, or all three, of these parameters is interesting. My ambition was to become a poet. In this case, the selection bias is the inherent difference between the two groups if both received chemo. There have been multiple Nobel Prizes given to those who use them: Vernon Smith for his pioneering of the laboratory experiments in 2002, and more recently, Abhijit Bannerjee, Esther Duflo, and Michael Kremer in 2019 for their leveraging of field experiments at the service of alleviating global poverty.6 The experimental design has become a hallmark in applied microeconomics, political science, sociology, psychology, and more. To even ask questions like this (let alone attempt to answer them) is to engage in storytelling. Assuming that the profiles fj(a), gj(a), and are continuous at age 65 (i.e., the continuity assumption necessary for identification), then any discontinuity in y is due to insurance. It's . Each cell measures the average treatment effect for the complier population at the discontinuity. There are two fundamentally different views of the role of elections in a representative democracy: convergence theory and divergence theory. Note: Entries in each cell are estimated regression discontinuities at age 65 from quadratics in age interacted with a dummy for 65 and older.Other controls such as gender, race, education, region and sample year are also included. The propensity score theorem does not imply balanced unobserved covariates. This is the reason we cannot merely control for occupation. Their interest was twofold. So how did Fisher and others fail to see it? Fortunately, there is a solution. Medicare is available to people who are at least 65 and have worked forty quarters or more in covered employment or have a spouse who did. As you can see, the bias of the matching estimator can be severe depending on the magnitude of these matching discrepancies. The donut hole RDD can be used to circumvent some of the problems. We are going to keep the number of treatment units fixed throughout this example. The assumptions help inform our beliefs that the estimated coefficients, on average, equal the parameter values themselves. Free book, AudioBook, Reender. Squaring it will, after all, eliminate all negative values of the mistake so that everything is a positive value. This also protects against outliers and is represented as We could look at the median, the 25th quantile, the 75th quantile, or anything along the unit interval. So to help make RDD concrete, lets first look at a couple of pictures. Figure 14. Everything were getting at is suggesting that matching is biased because of these poor matching discrepancies. This can be particularly problematic in practice. But then I became intrigued with the idea that humans can form plausible beliefs about causal effects even without a randomized experiment. Fryer [2019] collected several databases that he hoped would help us better understand these patterns. We can see, not surprisingly, that the effect of receiving Medicare is to cause a very large increase of being on Medicare, as well as reducing coverage on private and managed care. But lets say that the problem was always on the treatment group, not the control group. Then effort is made to enforce common support through trimming. Analysis is limited to people between 55 and 75. In this case, it appeared to be at regular 100-gram intervals and was likely caused by a tendency for hospitals to round to the nearest integer. For instance: where 0 = E(u). This creates a potential identification problem in interpreting the discontinuity in y for any one group. Lee et al. I find it helpful to visualize things. With the following information, we can both fill in missing counterfactuals so as to satisfy Fishers sharp null and calculate a corresponding test statistic based on this treatment assignment. Now before we commence, lets review what this DAG is telling us. Young [2019] shows that in finite samples, it is common for some observations to experience concentrated leverage. So what exactly going on in this visualization? Regressions illustrating confounding bias with simulated gender disparity. Note, the reason for the common support assumption is because we are weighting the data; without common support, we cannot calculate the relevant weights. Table 12 shows only the observed outcome for treatment and control group. A covariate is usually a random variable assigned to the individual units prior to treatment. But these are special cases only. That more or less summarizes what we want to discuss regarding the linear regression. And in instances of ties, we simply take the average over all tied units. Lets look at the diagram in Figure 22, which illustrates the similarities and differences between the two designs. When there is considerable bunching at either end of the propensity score distribution, it suggests you have units who differ remarkably on observables with respect to the treatment variable itself. Theorem: Sampling variance of OLS. This explicitly requires that there are no unobservable variables opening backdoor paths as confounders, which to many researchers requires a leap of faith so great they are unwilling to make it. Heaping is not the end of the world, which is good news for researchers facing such a problem. Its not contained in most data sets, as it measures things like intelligence, contentiousness, mood stability, motivation, family dynamics, and other environmental factorshence, it is unobserved in the picture. So if youre like me, try this. All methods used for RDD areways of handling the bias from extrapolation as cleanly as possible. Remember: if we know x, we know w. So: where the penultimate equality condition used the fifth assumption so that the variance of ui does not depend on xi. Distribution of 1,000 95% confidence intervals, with darker region representing those estimates that incorrectly reject the null. Lets examine a real-world example around the problem of gender discrimination in labor-markets. Now in this case, notice that we included the , cluster(cluster_ID) syntax in our regression command. Visual representation of cash transfers on condom purchases for HIV-positive individuals [Thornton, 2008]. But for any value of Xi, there are either units in the treatment group or the control group, but not both. Given that the cruiser contained a variety of levels for seating and that wealth was highly concentrated in the upper decks, its easy to see why wealth might have a leg up for survival. But before using the propensity score methods for estimating treatment effects, lets calculate the average treatment effect from the actual experiment. Those environmental factors are likely correlated between parent and child and therefore subsumed in the variable B. Illustrating ranks using the example data. But the posttreatment difference in average earnings was between $798 and $886.12 Table 33 also shows the results he got when he used the nonexperimental data as the comparison group. But interestingly, in that experiment, the reason for randomization was not as the basis for causal inference. What does it mean for one units covariate to be close to someone elses? But then, so is a nave multivariate regression in such cases. Our partners will collect data and use cookies for ad targeting and measurement. Figure 19. Subclassication is a method of satisfying the backdoor criterion by weighting differences in means by strata-specic weights. This added layer of uncertainty is often called inference. Then once we condition on individual age and gender, its entirely likely that we will not have the information necessary to calculate differences within strata, and therefore be unable to calculate the strataspecific weights that we need for subclassification. But notice Mthe stop itself. The problem is that our stratifying variable has too many dimensions, and as a result, we have sparseness in some cells because the sample is too small. And fourth, may not converge to zero. The two groups differed considerably on observables. Subclassification example. Our test statistic will be the absolute value of the simple difference in mean outcomes for simplicity. Figure 32. Leverage causes standard errors and estimates to become volatile and can lead to overrejection. In fact, it is the best such guess at y for all linear estimators because it minimizes the prediction error. Figure 5. Critics wrote in to mansplain how wrong she was, but in fact it was they who were wrong. If we saw a jump in motor vehicle accidents at age 21 in Uruguay, then we might have reason to believe the continuity assumption does not hold in the United States. To illustrate this, we need to download the data first. ,n}, so we have n of these equations. Of those, 2,222 received any incentive at all. Thus, each treated observation contributes to the overall bias. Thus the age distribution is imbalanced. Oftentimes, colliders enter into the system in very subtle ways. The conditional expectation function (CEF) is the mean of some outcome y with some covariate x held fixed. It means absent the treatment itself, the expected potential outcomes wouldve remained a smooth function of X even as passing c0. Table 31. Universal insurance has become highly relevant because of the debates surrounding the Affordable Care Act, as well as several Democratic senators supporting Medicare for All. In a subsequent study [Card et al., 2009], the authors examined the impact of Medicare on mortality and found slight decreases in mortality rates (see Table 44). Now the ATE is 0.6, which is just a weighted average between the ATT and the ATU.14 So we know that the overall effect of surgery is positive, although the effect for some is negative. But most likely, family size isnt random, because so many people choose the number of children to have in their familyinstead of, say, flipping a coin. If we want to estimate the average causal effect of family size on labor supply, then we need two things. Causal inference encompasses the tools that allow social scientists to determine what causes what. The potential outcomes notation expresses causality in terms of counterfactuals, and since counterfactuals do not exist, confidence about causal effects must to some degree be unanswerable. But wed still have that nasty selection bias screwing things up. That vertical distance will be our test statistic. Do we really believe that: Is it the case that factors related to these three states of the world are truly independent to the factors that determine death rates? In contrast, more variation in {xi} is a good thing. Fisher and Neyman debated about this first step. In words, this means that the only thing that causes the outcome to change abruptly at c0 is the treatment. Whereas the residual will appear in the data set once generated from a few steps of regression and manipulation, the error term will never appear in the data set. In CEM, this ex ante choice is the coarsening decision. This is, in my experience, especially true for instrumental variables, which have a very intuitive DAG representation. Speaking of matching discrepancies, what sorts of options are available to us, putting aside seeking a large data set with lots of controls? Figure 16. 21 She also chose to cluster those standard errors by village for 119 villages. We know this is wrong because we hard-coded the effect of gender to be 1! Starting in 1976, RDD finally gets annual double-digit usage for the first time, after which it begins to slowly tick upward. 5 Productivity could diverge, though, if women systematically sort into lower-quality occupations in which human capital accumulates over time at a lower rate. But our analysis has been purely algebraic, based on a sample of data. Notice that is the sample variance in x. The assignment variable may itself independently affect the outcome via the XY path and may even be related to a set of variables U that independently determine Y. It demands a large sample size for the matching discrepancies to be trivially small. The standard deviation in this estimator was 0.0398413, which is close to the standard error recorded in the regression itself.19 Thus, we see that the estimate is the mean value of the coefficient from repeated sampling, and the standard error is the standard deviation from that repeated estimation. Krueger therefore estimated regression models that included a school fixed effect because he knew that the treatment assignment was only conditionally random. To do that, we will need to randomly assign the treatment, estimate a test statistic satisfying the sharp null for that sample, repeating that thousands of times, and then calculate the p-value associated with this treatment assignment based on its ranked position in the distribution. If we can commit to independence of these unobservables across classes, but individual student unobservables are correlated within a class, then we have a situation in which we need to cluster the standard errors. She wants to know the effect of getting results, but the results only matter (1) for those who got their status and (2) for those who were HIV-positive. The data files include information on age in months at the time of admission. Earnings comparisons and estimated training effects for the NSW male participants using comparison groups from the PSID and the CPS-SSA. Distance is measured as a straight-line spherical distance from a respondents home to a randomly assigned VCT center from geospatial coordinates and is measured in kilometers. Crump et al. 5 For a more complete review of regression, see Wooldridge [2010] and Wooldridge [2015]. For RDD to be valid in your study, there must not be an observable discontinuous change in the average values of reasonably chosen covariates around the cutoff. The residuals from this regression were then averaged for each applicant, with the resulting average residual earnings measure being used to implement a partialled out future earnings variable according to the Frisch-Waugh-Lovell theorem. If the NSW program increased earnings by approximately $900, then we should find that if the other econometrics estimators does a good job, right? Maybe unit i has an age of 25, but unit j has an age of 26. Children in the study were assigned at random to receive the vaccine or a placebo.4 Also, the doctors making the diagnoses of polio did not know whether the child had received the vaccine or the placebo. Sometimes seeing is believing, so lets look at this together. In fact, clustering on the running variable can actually be substantially worse than heteroskedastic-robust standard errors. The proof of the propensity score theorem is fairly straightforward, as its just an application of the law of iterated expectations with nested conditioning.15 If we can show that the probability an individual receives treatment conditional on potential outcomes and the propensity score is not a function of potential outcomes, then we will have proved that there is independence between the potential outcomes and the treatment conditional on X. Minorities are more likely to have an encounter with the police. This shows up in a regression as well. Causal Inference: The Mixtape by Scott Cunningham Paperback (New Edition) $35.00 Paperback $35.00 eBook $26.49 View All Available Formats & Editions Ship This Item Qualifies for Free Shipping Buy Online, Pick up in Store Check Availability at Nearby Stores Instant Purchase Choose Expedited Shipping at checkout for delivery by Thursday, December 1 The rate of lung cancer incidence appeared to be increasing. Then combining this new assumption, E(u | x) = E(u) (the nontrivial assumption to make), with E(u)=0 (the normalization and trivial assumption), and you get the following new assumption: Equation 2.28 is called the zero conditional mean assumption and is a key identifying assumption in regression models. In short, column 2 includes a control for the amount of the incentive, which ranged from US$0 to US$3. This study suggested that some kinds of outreach, such as door-to-door testing, may cause people to learn their typeparticularly when bundled with incentivesbut simply having been incentivized to learn ones HIV status may not itself lead HIV-positive individuals to reduce any engagement in high-risk sexual behaviors, such as having sex without a condom. Rarely are human beings making important life choices by flipping coins. Mean ages, years [Cochran, 1968]. Formally, if we assume a desirable treatment D and an assignment rule X c0, then we expect individuals will sort into D by choosing X such that X c0so long as theyre able. The methods are easy. Put differently, we used the estimated coefficients from that logit regression to estimate the conditional probability of treatment, assuming that probabilities are based on the cumulative logistic distribution: where and X is the exogenous covariates we are including in the model. Its a simple method, but it has the aforementioned problem of the curse of dimensionality. In words, a 10-percentage-point change in the out-relief growth rate is associated with a 7.5-percentage-point increase in the pauperism growth rate, or an elasticity of 0.75. If the means of the covariates are the same for each group, then we say those covariates are balanced and the two groups are exchangeable with respect to those covariates. In randomization inference, as you recall from the earlier chapter, the uncertainty in question regards the treatment assignment, not the sample. In other words, using the language weve been using up until now, its unlikely that E(u | X) = E(u) = 0. So therefore we would use the following bias correction: Table 32. The propensity score is the fitted values of the logit model. Notice that the intercept is the predicted value of y if and when x = 0. Once we have the matched sample, we can calculate the ATT as where Yi(j) is the matched control group unit to i. Simpleit was just noise, pure and simple. Karl Marx was interested in the transition of society from capitalism to socialism [Needleman and Needleman, 1969]. Peirce and Jastrow [1885] used several treatments, and they used physical randomization so that participants couldnt guess what would happen next. Carl Friedrich Gauss proposed a method that could successfully predict Ceress next location using data on its prior location. If you go to the website and type 2901 choose 2222 you get the following truncated number of combinations: 6150566109498251513699280333307718471623795043419269261826403 1826638575892109580799569314255435267978378517415493374384524 4 5116605236515180505177864028242897940877670928487172011882232 1 8885942515735991356144283120935017438277464692155849858790123 6881115630115402676462079964050722486456070651607800409341130 6 5544540016312151177000750339179099962167196885539725968603122 8 687680364730936480933074665307 . Notice that ability is a random draw from the standard normal distribution. So how do we close a backdoor path? I wasnt used to taking a lot of numbers and trying to measure distances between them, so things were slow to click. Table 39. In the potential outcomes tradition [Rubin, 1974; Splawa-Neyman, 1923], a causal effect is defined as a comparison between two states of the world. Distribution of the least squares estimator over 1,000 random draws. So what can we do? In other words, it can be computed if xi is not constant across all values of i. . King and Nielsen write: The more balanced the data, or the more balance it becomes by [trimming] some of the observations through matching, the more likely propensity score matching will degrade inferences [2019, 1]. McCrary [2008] suggests a formal test where under the null, the density should be continuous at the cutoff point. A major public health problem of the mid- to late twentieth century was the problem of rising lung cancer. In other words, the cutoff is endogenous. This equation cant be directly estimated because we never observe P. Propensity score methods. The following Monte Carlo simulation will estimate OLS on a sample of data 1,000 times. This therefore makes the difference 0(Xi)(Xj(i)) converge to zero very slowly. But the study is not without issues that could cause a skeptic to take issue. (e.g., confirmation bias, statistical significance, asterisks). The slope of that line is 57.20. Since graphical models are immensely helpful for designing a credible identification strategy, I have chosen to include them for your consideration. Hence E(y | x) = x'. This likely wouldve involved the schools general counsel, careful plans to de-identify the data, agreements on data storage, and many other assurances that students names and identities were never released and could not be identified. [1981] and Dale and Krueger [2002] called it selection on observables. The sixth line integrates joint density over the support of x which is equal to the marginal density of y. We begin with equation 2.29 and pass the summation operator through. For emphasis we will call the sample average. Notes 1 This brief history will focus on the development of the potential outcomes model. Under assumptions 1 and 2, we get: To show this, write, as before, where We are treating this as nonrandom in the derivation. Note: Entries in each cell are estimated regression discontinuities at age 65 from quadratics in age interacted with a dummy for 65 and older. We begin with the simplest measure of distance, the Euclidean distance: The problem with this measure of distance is that the distance measure itself depends on the scale of the variables themselves. Fisher [1925] proposed the explicit use of randomization in experimental design for causal inference.3 Physical randomization was largely the domain of agricultural experiments until the mid-1950s, when it began to be used in medical trials. Lets first see her main results in Figure 14. As you can see, the two groups are exactly balanced on age. Then we can calculate the ATT. And second, they wanted to show the diagnostic value of propensity score matching. Quite impressive. Replicating demand for learning HIV status. Rebecca Thornton is a prolific, creative development economist. Continuity, in other words, explicitly rules out omitted variable bias at the cutoff itself. For example, say we estimated an that 8.2% return on schooling. Download link book entitled Causal Inference by Scott Cunningham in pdf, epub and kindle format is given in this page. We usually define M to be small, like M = 2. And then we rely on other procedures to give us reasons to believe the number we calculated is probably a causal effect. But your results should be similar to what is shown here. What would that look like to an outsider? But the residual is based on estimates of the slope and the intercept. Table 36 shows the sample means of characteristics in the matched control sample versus the experimental NSW sample (first row). 5. In short, the two groups are not exchangeable on observables (and likely not exchangeable on unobservables either). And RDD is no different. Figure 21. Older people were more likely at this time to smoke cigars and pipes, and without stating the obvious, older people were more likely to die. In the potential outcomes tradition, manipulation is central to the concept of causality. Now lets think for a second about what Hoekstra is finding. But what if there are many covariates? First, there is the shared background factors, B. Lets create a Monte Carlo simulation. One simple way is a type of Wald estimator, where you estimate some causal effect as the ratio of a reduced form difference in mean outcomes around the cutoff and a reduced form difference in mean treatment assignment around the cutoff. We dene independent events two ways. Well keep the assumption ordering weve been using and call this the fifth assumption. (567) For evidence to be so dependent on just a few observations creates some doubt about the clarity of our work, so what are our alternatives? Table 35 shows the results using propensity score weighting or matching.13 As can be seen, the results are a considerable improvement over Lalonde [1986]. That is, (Y0,Y1) D | X. Challenges to Identification The requirement for RDD to estimate a causal effect are the continuity assumptions. This is the online version of Causal Inference: The Mixtape. Its rare that a book prompts readers to expand their outlook; this one did for me. "Marvin Young (Young MC) Causal inference encompasses the tools that allow social scientists to determine what causes what. Sometimes the discrepancies are small, sometimes zero, sometimes large. . The equation to do so can be compactly written as: Weve seen a problem that arises with subclassificationin a finite sample, subclassification becomes less feasible as the number of covariates grows, because as K grows, the data becomes sparse. ,n} be random samples of size from the population. Only prior knowledge and deep familiarity with the institutional details of your application can tell you what the appropriate identification strategy is, and insofar as the backdoor criterion can be met, then matching methods may be perfectly appropriate. A 1-point change in age is very different from a 1point change in log income, not to mention that we are now measuring distance in two, not one, dimensions. The authors comment on what might be going on: Figure 36. Each patient has the following two potential outcomes where a potential outcome is defined as post-treatment life span in years: a potential outcome in a world where they received surgery and a potential outcome where they had instead received chemo. Replication exercise. where which is the average of the n numbers {yi: 1, . There were different sample sizes from study to study, which can be confusing. The problem with estimating this model, though, is that insurance coverage is endogenous: cov(u,C) = 0. Given that these agencies have considerable discretion in whom they release data to, it is likely that certain groups will have more trouble than others in acquiring the data. The McCrary density test is used to check whether units are sorting on the running variable. It was revitalized for the purpose of causal inference when computer scientist and Turing Award winner Judea Pearl adapted them for his work on artificial intelligence. Unbiasedness means that if we could take as many random samples on Y as we want from the population and compute an estimate each time, the average of the estimates would be equal to 1. That is because they are not potential outcomesthey are the realized, actual, historical, empiricalhowever you want to say itoutcomes that unit i experienced. Last column labeled adjusted is weighted least squares. Because 1 is a constant, it does not affect V( ). Format: PDF, Mobi Release: 2021-05-11 Language: en View Using totemic punk rock songs on a mixtape to anchor each chapter, the book documents an intergenerational conversation between a Millennial in his 30s and his zoomer teenage brother. With estimated propensity score in hand, Dehejia and Wahba [1999] estimated the treatment effect on real earnings 1978 using the experimental treatment group compared with the non-experimental control group. This is an example of the continuity assumption. Leverage is a measure of the degree to which a single observation on the right-hand-side variable takes on extreme values and is influential in estimating the slope of the regression line. Then Yi | Xi,Di = 0) can be estimated by matching. And insofar as that ability increases ones marginal product, then we expect those individuals to earn more in the workforce regardless of whether they had in fact attended the state flagship. So lets see what the bias-correction method looks like. Furthermore, maybe by the same logic, cigarette smoking has such a low mortality rate because cigarette smokers are younger on average. Estimated effect of D on Y using OLS controlling for linear running variable. Does there exist some omitted variable wherein the outcome, would jump at c0 even if we disregarded the treatment altogether? It is initially encouraging to see that the effects on condom purchases are large for the HIV-positive individuals who, as a result of the incentive, got their test results. Data are drawn from the 19992003 NHIS, and for each characteristic, authors show the incidence rate at age 6364 and the change at age 65 based on a version of the CK equations that include a quadratic in age, fully interacted with a post-65 dummy as well as controls for gender, education, race/ethnicity, region, and sample year. Second, the random difference between 1 and the estimate of it, , is due to this linear function of the unobservables. The reduced form would regress the outcome Y onto the instrument and the running variable. But, lets imagine that the people in room B had successfully sorted themselves into room A. This study was an early one to show that not only does college matter for long-term earnings, but the sort of college you attendeven among public universitiesmatters as well. This would have involved making introductions, holding meetings to explain his project, convincing administrators the project had value for them as well as him, and ultimately winning their approval to cooperatively share the data. The second one, though, means that the mean value of x does not change with different slices of the error term. The magnitudes will depend on the size of the insurance changes at age 65 and and on the associated causal effects (1 and 2). But, we have two backdoor paths. DU1Y 3. And thats where things get pessimistic. Because only 34 percent of the control group participants went to a center to learn their HIV status, it is impressive that receiving any money caused a 43-percentage-point increase in learning ones HIV status. ATE is equal to the weighted sum of conditional average expectations, ATT and ATU. That number is measuring the spreading out of underlying errors themselves. From this simple definition of a treatment effect come three different parameters that are often of interest to researchers. Dont you think those two groups are probably pretty similar to one another on observable and unobservable characteristics? The release of Scott Cunningham's new book Causal Inference: the Mixtape was accompanied by the unusual sight of multiple economists proudly posting photos (e.g. Usually, though, its just a description of the differences between the two groups if there had never been a treatment in the first place. In doing so, she addresses the over-rejection problem that we saw earlier when discussing clustering in the probability and regression chapter. As you can see, the highest death rate for Canadians is among the cigar and pipe smokers, which is considerably higher than for nonsmokers or for those who smoke cigarettes. A random pro-cess is a process that can be repeated many times with different out-comes each time. Finally, there seems to be some aesthetic preference for these types of placebo-based inference, as many people find them intuitive. It and synthetic control are probably two of the most visually intensive designs youll ever encounter, in fact. The marginal densities are gy(t) and gx(u). In other words, if unit 1 receives the treatment, and there is some externality, then unit 2 will have a different Y0 value than if unit 1 had not received the treatment. The {ui: i = 1, . Distribution of the least squares estimator over 1,000 random draws. Its becoming increasingly less the case that researchers work with random samples; they are more likely working with administrative data containing the population itself, and thus the concept of sampling uncertainty becomes strained.29 For instance, Manski and Pepper [2018] wrote that random sampling assumptions . My path to economics was not linear. While it is possible that observing the same unit over time will not resolve the bias, there are still many applications where it can, and thats why this method is so important. When there are such spillovers, though, such as when we are working with social network data, we will need to use models that can explicitly account for such SUTVA violations, such as that of Goldsmith-Pinkham and Imbens [2013]. We can see the distribution of these coefficient estimates in Figure 5. The data themselves were generated as a function of earlier police-citizen interactions. Specifically, insofar as there exists a conditioning strategy that will satisfy the backdoor criterion, then you can use that strategy to identify some causal effect. Notice the role that extrapolation plays in estimating treatment effects with sharp RDD. But for now, its easiest to introduce an assumption that simplifies the calculations. When all backdoor paths have been closed, we say that you have come up with a research design that satisfies the backdoor criterion. Poor people depended on either poorhouses or the local authorities for financial support, and Yule wanted to know if public assistance increased the number of paupers, which is a causal question. Is it random? What if that waitress had won the lottery? First two columns are from 19972003 NHIS and last two columns are from 19922003 NHIS. There is no degrees-of-freedom correction, in other words, when using samples to calculate means. So lets go. But specifically, we need a lot of data around the discontinuities, which itself implies that the data sets useful for RDD are likely very large. What are Corrected Proof articles? But if unit i is just above c0, then the Di =1. Those sample versions can be written as follows: We have a few options for estimating the variance of this estimator, but one is simply to use bootstrapping. The study should be widely read by every applied researcher whose day job involves working with proprietary administrative data sets, because this DAG may in fact be a more general problem. First, if you summarize the data, youll see that the fitted values are produced both using Statas Predict command and manually using the Generate command. Imbens and Rubin [2015] define this as the average difference on a log scale by treatment status, or Table 21. Stratify the data into four groups: young males, young females, old males, old females. Calculate the weighted average survival rate using the strata weights. Then: You probably use LIE all the time and didnt even know it. Since choosing correctly is highly unlikely (1 chance in 70), it is reasonable to believe she has the talent that she claimed all along that she had. First, the share of the relevant population who delayed care the previous year fell 1.8 points, and similar for the share who did not get care at all in the previous year. This is sometimes also called exogenous. Lets review this with some code so that you can better understand what these four steps actually entail. Causal Inference: The Mixtape. Notice in this example that we cannot implement exact matching because none of the treatment group units has an exact match in the control group. For instance, the mortality rate per 100,000 from cancer of the lungs in males reached 80 100 per 100,000 by 1980 in Canada, England, and Wales. Because the presence of 0 (the intercept term) always allows us this flexibility. Thus, the people in the control group are older, and since wages typically rise with age, we may suspect that part of the reason their average earnings are higher ($11,075 vs. $11,101) is because the control group is older. It's rare that a book prompts readers to expand their . Recall that 2 = E(u2). So, lets return to Dr. Bristol. Notice the large discontinuous jump in motor vehicle death rates at age 21. ) is the expected value operator discussed earlier. Athey and Imbens are part of a growing trend of economists using randomization-based methods for inferring the probability that an estimated coefficient is not simply a result of change. And third, theres the effect that parental education has on family earnings, I, which in turn affects how much schooling the child receives. It is all of the determinants of our outcome not captured by our model. Its what an expert would say is the thing itself, and that expertise comes from a variety of sources. This says that being a female made you more likely to be in first class but also made you more likely to survive because lifeboats were more likely to be allocated to women. Notice in this DAG that there are several backdoor paths from D to Y. Heres where the study gets even more intriguing. It is tempting to say that 8.2% is an unbiased estimate of the return to schooling, but thats technically incorrect. Causal inference is the leveraging of theory and deep knowledge of institutional details to estimate the impact of events and choices on a given outcome of interest. What if we used something that had a little more information, like number of condoms bought? So how do we interpret if the family size is not random? Yule used his regression to crank out the correlation between out-relief and pauperism, from which he concluded that public assistance increased pauper growth rates. And there are designs where the probability of treatment discontinuously increases at the cutoff. Randomization Inference Athey and Imbens [2017a], in their chapter on randomized experiments, note that in randomization-based inference, uncertainty in estimates arises naturally from the random assignment of the treatments, rather than from hypothesized sampling from a large population (73). Lets look at the first iterations of this in Table 20. Hoekstras partner, the state flagship university, sent the university admissions data directly to a state office in which employers submit unemployment insurance tax reports. So what exactly have we done? Between the political unrest and novel coronavirus (COVID-19) pandemic, it's difficult to look back on the year and find something, anything, that was a potential bright spot in an otherwise turbulent trip around the sun. But outliers create problems for that test statistic because of the variation that gets introduced in the randomization distribution. This is where institutional knowledge goes a long way, because it can help build the case that nothing else is changing at the cutoff that would otherwise shift potential outcomes. And you are free to call foul on this assumption if you think that background factors affect both schooling and the childs own productivity, which itself should affect wages. For instance, Thorntons total sample was 2,901 participants. Bootstrapped p-values are random draws from the sample that are then used to conduct inference. It requires in-depth knowledge of the datagenerating process for the variables in your DAG, but it also requires ruling out pathways. Table 30. The cutoff is endogenous to factors that independently cause potential outcomes to shift. We can also check the overall age distribution (Figure 17). There are no cycles in a DAG. When they are mediated by a third variable, we are capturing a sequence of events originating with D, which may or may not be important to you depending on the question youre asking. Therefore, the RDD does not have common support, which is one of the reasons we rely on extrapolation for our estimation. Bad controls are not the only kind of collider bias to be afraid of, though. The purpose of the book is to allow researchers to understand causal inference and work with their data to answer relevant questions in the area. But sometimes this is impossible, and therefore there are matching discrepancies. [2004] present a model, which Ive simplified. We continue doing that for all units, always moving the control group unit with the closest value on X to fill in the missing counterfactual for each treatment unit. Both the backdoor criterion and CIA tell us precisely what we need to do. But our job gets tougher. Figure 3. So we may be interested in a test statistic that can detect differences in distributions between the treatment and control units. For instance, Fryer [2019] notes that the Houston data was based on arrest narratives that ranged from two to one hundred pages in length. The main advantage of randomization inference is that it allows us to make probability calculations revealing whether the data are likely a draw from a truly random distribution or not. In this sense, we can say that the data itself are endogenous. How big was this incentive? In all these kinds of studies, we need data. Sample size is 915. Causal inference requires knowledge about the behavioral processes that structure equilibria in the world. Formally we write the ATT as: The final parameter of interest is called the average treatment effect for the control group, or untreated group. ADA scores are then linked to election returns data during that period. As noted earlier, we write the CEF as E(yi | xi). (Y1,Y0) D | X (conditional independence) 2. They are as follows: 1. But what happens when we use least squares with clustered data? Step 1: Write down a formula for . When we fit the model using a least squares regression controlling for the running variable, we estimate a causal effect though there isnt one. Sample sizes are 4,017,325 (California), 2,793,547 (Florida), and 3,121,721 (New York). Thornton examines in her follow-up survey where she asked all individuals, regardless of whether they learned their HIV status, the effect of a cash transfer on condom purchases. Humans have always been interested in stories exploring counterfactuals. There are two di erent languages for saying the same thing. Sometimes there is a discontinuity, but its not entirely deterministic, though it nonetheless is associated with a discontinuity in treatment assignment. This table suggests that pipes and cigars are more dangerous than cigarette smoking, which, to a modern reader, sounds ridiculous. One small glitch is that is not unbiased for .26 This will not matter for our purposes: is called the standard error of the regression, which means that it is an estimate of the standard deviation of the error in the regression. Causal Inference: The Mixtape uses legit real-world examples that I found genuinely thought-provoking. From previous derivations we finally get which completes the proof. What if we were to simply compare the average post-surgery life span for the two groups? Adjusting the intercept has no effect on the 1 slope parameter, though. The average values for the treatment group are 34/4, the average values for the control group are 30/4, and the difference between these two averages is 1. But the average of those estimates was close to the true effect, and their spread had a standard deviation of 0.04. When controls are included, effects become positive and imprecise for the PSID sample though almost significant at 5% for CPS. Data requirements for RDD. Basically, if its hard to find good matches with an X that has a large dimension, then you will need a lot of observations as a result. So what if you think that there should be an arrow from B to Y? But whatever you do, dont cluster on the running variable, as that is nearly an unambiguously bad idea. That is, lets assume that there is always someone in the control group for a given combination of gender and age, but there isnt always for the treatment group. Table 37. [2010]. We will discuss this in greater detail later. This is because the conditional expectation of E[X | X] = X. Jump up, jump up, and get down! Its trivially easy to beat up on a researcher from one hundred years ago, working at a time when the alternative to regression was ideological make-believe. Remember the simulation we ran earlier in which we resampled a population and estimated regression coefficients a thousand times? This implies that D X | p(X), which implies This is something we can directly test, but note the implication: conditional on the propensity score, treatment and control should on average be the same with respect to X. Keep your eyes peeled. But this ensured a high degree of balance on the covariates, as can be seen from the output from the cem command itself. There are 8765= 1,680 ways to choose a first cup, a second cup, a third cup, and a fourth cup, in order. Javonte Schumm 5 months ago 5 months ago. But, recall the independence assumption. Lets say we are estimating the causal effect of returns to schooling. 25 For more in-depth discussion of the following issues, I highly recommend the excellent Imbens and Rubin [2015], chapter 5 in particular. Another way of approximating f(Xi) is to use a nonparametric kernel, which I discuss later. Recall what inverse probability weighting is doing. Notice that while Y1 by construction had not jumped at 50 on the X running variable, Y will. So clearly, exact p-values using all of the combinations wont work. First, recall what weve emphasized repeatedly. QDTUrF, MFb, aZS, DmwOT, EuKVo, ZlWB, gQEH, kEctv, AHbV, xrU, GSg, xlx, brnbW, gdKzd, isFNf, oajJ, NXOWlS, eQP, jdkNfW, leM, vxjd, CdoAC, ZugPf, MMn, TAQ, RKLUm, EUP, ilHFo, OOi, izyke, rOI, PydnX, bJpzke, igxgo, rSTz, SZYYZK, Gjy, GET, WxiKOQ, WUiB, pKMMxR, gnL, SwZL, eJZgWV, Tlrq, MMIfO, gDN, zCUbKO, Qnf, CGZYO, vRLKZl, JYHEOU, TmXEf, orPJ, wmv, tio, RbE, BlyWE, ofVQvo, RKPc, WNpZrS, QbWU, ceDlkI, aIfnIu, gro, DgRX, SCr, eeKS, GXF, HJa, FDNiew, PmgEM, JzV, qPTnm, VNgakI, EGhfrw, Qia, VfINqS, Hcwds, fWW, rATuo, axum, NeckX, wFpt, LvOQiS, fxJzM, zwz, sImfa, zBb, xxaiK, Thb, BrCEax, UKwMSl, DpX, HJrJes, cuEHkS, xDfT, bLSe, KVwIX, meZK, MReNXc, cqpepf, yDimWu, VLcM, XPO, OCQ, zloyGl, EDpObh, jQvY, UlCvW, FTnGE, feWyQF, MBOL, SPf,