cluster_names
entry
By default, this package always provides heteroskedasticity-robust standard errors. However, in difference-in-differences applications, it is often the case that treatment is assigned to groups of individuals (e.g., a change in state-wide policy treats all individuals in a state simultaneously). If those groups are also subject to common shocks, this induces correlation in the estimation errors within cluster, and standard errors will tend to be too small.
Fortunately, this is easy to fix: If the groups within which the estimation errors are correlated are known to the researcher, we just need to cluster standard errors by group. DiDforBigData
requires only that the list of variable names, varnames
, is modified to include a cluster_names
entry. For example, varnames$cluster_names = c("group1","group2")
tells it to use multi-way clustering in a way that accounts for common shocks to each of observable groups, âgroup1â and âgroup2â.
Note: When estimating a regression that combines multiple treatment cohorts and/or multiple event times, it is necessary to always cluster on unit (individual). DiDforBigData
adds this clustering by default.
There is an option in SimDiD()
using clusters=TRUE
to group individuals into bins that are differentially selected for treatment and that also face common shocks within each bin:
sim = SimDiD(sample_size=1000, clusters = TRUE)
simdata = sim$simdata
print(simdata)
#> id year cohort Y cluster
#> 1: 1 2003 2007 8.266417 9
#> 2: 1 2004 2007 12.657646 9
#> 3: 1 2005 2007 9.373441 9
#> 4: 1 2006 2007 10.851528 9
#> 5: 1 2007 2007 9.306055 9
#> ---
#> 10996: 1000 2009 2010 10.317600 6
#> 10997: 1000 2010 2010 9.623423 6
#> 10998: 1000 2011 2010 11.509660 6
#> 10999: 1000 2012 2010 9.718423 6
#> 11000: 1000 2013 2010 11.042858 6
print(sim$true_ATT[cohort=="Average"])
#> cohort event ATTge
#> 1: Average 0 1.500668
#> 2: Average 1 2.500668
#> 3: Average 2 3.250501
#> 4: Average 3 4.250501
#> 5: Average 4 5.000000
#> 6: Average 5 6.000000
#> 7: Average 6 7.000000
We set up the varnames
to prepare for estimation:
varnames = list()
varnames$time_name = "year"
varnames$outcome_name = "Y"
varnames$cohort_name = "cohort"
varnames$id_name = "id"
We check the usual standard errors, which are clustered on unit based on varnames$id_name
by default:
did = DiD(inputdata = simdata, varnames = varnames, min_event = -1, max_event=3)
print(did$results_average)
#> EventTime BaseEvent ATTe ATTe_SE Ncontrol Ntreated
#> 1: -1 -1 0.000000 0.00000000 1503 749
#> 2: 0 -1 1.654672 0.07813674 1503 749
#> 3: 1 -1 2.710981 0.09333436 1503 749
#> 4: 2 -1 3.319803 0.11998209 1002 499
#> 5: 3 -1 4.597428 0.12376782 752 499
Next, we cluster on the âclusterâ variable by adding it to the varnames
and re-estimating:
varnames$cluster_names = "cluster"
did = DiD(inputdata = copy(simdata), varnames = varnames, min_event = -1, max_event=3)
print(did$results_average)
#> EventTime BaseEvent ATTe ATTe_SE Ncontrol Ntreated
#> 1: -1 -1 0.000000 0.00000000 1503 749
#> 2: 0 -1 1.654672 0.08714482 1503 749
#> 3: 1 -1 2.710981 0.11885828 1503 749
#> 4: 2 -1 3.319803 0.13355642 1002 499
#> 5: 3 -1 4.597428 0.25745379 752 499
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4