Thread by @economeager, Hello! Tamara Broderick, Ryan Giordano and I have a new working paper [...]

Hello! Tamara Broderick, Ryan Giordano and I have a new working paper out!! It's called "An Automatic Finite-Sample Robustness Metric: Can Dropping a Little Data Change Conclusions?" https://arxiv.org/abs/2011.14999

Here comes the paper thread!!! Aaaaaah!!!

We propose a way to measure the dependence of research findings on the particular realisation of the sample. We find that several results from big papers in empirical micro can be overturned by dropping less than 1% of the data -- or even 1-10 points, even when samples are large.

We do find some results are robust, and can simulate totally robust results, so this thing genuinely varies in practice. You might think that the sensitive cases must have outliers or spec problems, but they don't need to, and our headline application has binary outcome data!

So what is this metric? Ok, for any given result (sign of an effect, significance, whatever) we ask if there is a small set of observations in the sample with a large influence on that result, in the sense that removing them would overturn it.

This is like asking how bad it could be for your result if a small percentage of the data set was lost, or a small percentage of your population of interest was not in the sample.

Such exclusions are pretty common, both because of practical difficulties of perfectly randomly sampling the real world (and humans processing the data), and b/c the world is always changing across space and time, even if only a little.

Typically it's not reasonable to assume these deviations from the ideal thought experiment are random: there's usually a reason you can't perfectly sample people or places, or interpret everyone's data intelligibly, or predict the future!

So we want to know if there's a small number of highly influential points in the sample, capable of overturning our result if dropped. Finding them exactly is possible but usually computationally prohibitive -- you have to cycle through too many combinations most of the time.

We develop an approximation to the influence of removing any given set of points. It's a Taylor expansion type of thing, but what's exciting is YOU CAN *ALWAYS* CHECK IF THIS APPROXIMATION WORKED IN *EVERY* GIVEN SAMPLE! So it's not the usual bullshit "trust my big math" thing.

Our approach identifies these approximate-high-influence points, so you can always remove them, re-run the analysis once, and see if the result changes. Whatever you get is an exact lower bound on the true sensitivity, since at worst we missed out on some higher-influence points.

In our applications we almost always achieve the claimed reversal (tho we discuss exceptions in the paper, and it seems like having true parameter values lying near the boundaries of their spaces is a problem even if you transform the parameter).

Now of course if you'd like some big math, we do have some big math for you. We formally derive the approximation for Z estimators (like GMM, OLS, IV, MLE) under regularity conditions.

We have explicit bounds on the approximation error for OLS and IV - it's small relative to the real change in the result. We show our metric is a semi-norm on the Influence Function, linking it to standard errors and gross error sensitivity, which are different norms on the IF.

Why are some analyses so sensitive? It turns out to be linked to the signal to noise ratio, where the signal is the strength of the empirical result, and the noise is large when the influence function is "big" in a specific sense.

For OLS, the value of the influence function for each data point is just that point's sample regression error times its leverage. One or the other is not enough. You need both at once, on the same point. That's part of why you can't eyeball this thing in the outcome or errors.

Wouldn't that "noise" show up in standard errors? No, because standard errors are divided through by root-N. Big N can make SEs small even when the noise is large. That's also why SEs disappear asymptotically, but our metric won't. Important as we move into the "big data" space.

Also, this noise reflects a distributional shape component that SEs don't, but that is NOT just about outliers: we show that this sensitivity to 1% of the sample can arise even in perfectly specified OLS inference on Gaussian data, and it also arises in practice on binary data.

This links up to something we were discussing on twitter earlier this year: what's intuitively wrong with running an OLS linear reg when the X-Y data scatter is basically vertical? Well, many things, but one of them is that the signal to noise ratio is *probably* quite low.

The fact that this sensitivity can arise even when nothing is formally "wrong" with the classical inference can feel weird, because we are used to thinking of our SEs and performance metrics like bias, test size, etc as capturing all the uncertainty we have -- but they don't!

classical procedures are only designed to capture one type of uncertainty: the variation in an estimator's finite sample value across perfectly random resamplings of the exact same population.

But this hypothetical perfect resampling experiment doesn't really capture all the relevant uncertainty about results in social science. We're not physicists or crop yield analysts, so we shouldn't expect their statistical tools to be suitable for us.

We need to ask about data-dependence in ways that make more sense given how we actually generate and use empirical results!

*Bayesian whisper* also wouldn't you rather know about the dependence of a research result on the sample you HAVE rather than the dependence you could imagine having in some hypothetical thought experiment based on a resampling exercise you could never do? You would, come on!!

But let me be super clear: my own bayesian papers don't escape this problem and you should check out the paper if you want to see me dunk on myself for several pages. (My hunch is that I have been using overly weak priors.)

We think you should report our metric. We definitely don't think you should abandon results that are not robust, but it should prompt a desire to understand the estimation procedure more deeply, and promote caution in generalizing them too broadly.

We wrote you an R package to compute and report it automatically! It all uses Python at the moment, so you need to have that installed. Future versions will be able to do OLS and IV without Python though! https://github.com/rgiordan/zaminfluence

rgiordan/zaminfluence

Tools in R for computing and using Z-estimator approximate influence functions. - rgiordan/zaminfluence

https://github.com/rgiordan/zaminfluence

We really hope this paper can be part of a broader conversation about empirical social science that leads us all to try out new ways of interrogating our data and understanding our conclusions a lot more deeply.

Don;t think of this new metric as "yet another thing you have to report (sigh)" but as yet another tool to illuminate the way in which your procedure is constructing information about the population from your sample. And what could be more important than that!? Nothing!

FIN!!!!

Latest Threads Unrolled: