Thread by @metamaf, Something that comes up time and time again in my work is [...]

Something that comes up time and time again in my work is confusion between (1) *prediction* analyses and (2) *association* analyses. It’s easy to confuse the two. A thread highlighting the differences

1/N #DataScience #MedTwitter #OrthoTwitter #AcademicTwitter

Prediction and association analyses are related, and can inform each other, but focus on different things. Part of the confusion is that the same sort of statistical model can be used for both (e.g. logistic regression). Don’t confuse the algorithm with the goal of the analysis.

Another confusing point is that people colloquially refer to independent variables as “predictors”, which leads people to think they’re doing prediction, when often they’re not.

Prediction focuses on a practical task — creating a model, that, when applied to *new data*, can prognosticate well according to one or more metrics. What makes it prediction is testing overall predictive ability on an independent sample.

The right stat to measure predictive performance depends on the circumstance (e.g. continuous or binary outcome, rare or common outcome). But again, the emphasis with prediction is on the overall predictive performance of the model on *new data* not used in model formation.

A note on terminology — in the context of prediction, we often refer to the independent variables used to do the predicting as “features”. Prediction is less about quantifying which features matter and how much, and, again, more about overall model performance on new data.

The important thing with prediction is that the model must be trained on data that is *not used* to evaluate predictive performance. If you don’t do this, you’re cheating, and aren’t doing prediction. The number of papers I've seen claiming prediction like this is alarming IMO.

Practically speaking, we often only have one dataset, so randomly split it into a training and test sets (i.e., build the model on the training set, and apply it to the test set to test performance).

Personally, when I have enough data, I like taking the last, say, month of generated data and holding that aside as a second test set. This mimics reality -- we would at some point start using the model, and this sort of simulates that.

Be weary of info bleeding from the training set into the test set, i.e. things that render them not independent. It can happen in subtle ways that will give you inflated predictive power. Whenever I get too-good-to-be-true predictive performance, I assume this is happening.

In the context of prediction, even though we're concerned about overall model fit, it’s still important to take a look at individual feature importances, e.g. the magnitude of coefficients in logistic regression. This gives us a sense of what is mechanically driving predictions.

But BEWARE! If you’ve thrown in a large number of features, particularly ones that are correlated with each other, you CANNOT interpret these feature importances as unbiased associations.

Particularly among physicians, I find this is the easiest and most common slip-up to make, because it feels natural to interpret features this way. Don’t be tempted! This is why most prediction analyses are not association analyses.

The basic intuition is that if you include too many correlate variables, it’s arbitrary which the model will tend to emphasize and place high weight on, so you cannot trust the weights as associations.

Of course, if you do find high predictive performance among a set of features, somewhere among them, there are likely interesting associations. It requires detective work, carefully evaluating what is correlated to tease apart and credibly claim an unbiased association.

With association analyses, the focus isn’t on overall model fit, nor how it performs out-of-sample on new data. Instead, we focus on one or more independent variables, and their individual relationships with the dependent variable of interest (e.g., magnitude of a coefficient).

For association analyses, most people just run one big model and report in-sample stats at they relate to the relevant indep variables. You don’t typically see testing on independent samples. But doing so can speak to the robustness of the associations, which can be valuable.

For example, with a big enough dataset, you could imagine randomly splitting the data in half, performing your analyses identically on each half, and comparing the magnitudes/signs of the relevant associations — similarities would point to robustness.

And of course, repeatedly finding similar associations in altogether independent datasets speaks even more to the underlying robustness of the relationships at hand.

Feature selection is important here, e.g., to ensure everything relevant is included and to avoid including highly overlapping variables. Though we don’t typically call this feature selection, we call it “picking covariates” and "eliminating confounding" or something similar.

This is especially important because, in this context, we care about the individual variables’ relationships to the dependent variable. Indeed, we want to be mindful that we’ve included relevant confounders, without which, we are want to find associations that may be spurious.

Association analyses are important when we cannot conduct RCTs , and/or only have retrospective, observational data. But we should be frank that association analyses with retrospective, observational data are typically not able to measure causal effects.

Try as we might, these sorts of association analyses can, at best, hint at causal effects, particularly when they are found consistently across different environments and datasets, have been fully vetted by domain experts, and relevant confounding has been ruled out.

But there is no substitute for a proper RCT; the specter of unmeasured confounding indeed looms large. From a pure epistemological standpoint, it’s nearly impossible to rule out unmeasured confounding without an RCT.

Of course, as an economist, our entire discipline is focused on enumerating *assumptions* under which such associations may be credibly claimed as causal. But the plethora of methods and circumstances in which it's credible to do so... that’s for another thread. FIN. 25/25

Latest Threads Unrolled: