Thread by @PhDemetri, Apropos nothing, here is how I would structure a data science class:A [...]

Demetri

PhDemetri

Apropos nothing, here is how I would structure a data science class:

A

0) Ethics

In this class, you might learn something that could hurt someone in unseen and unknown ways. Its important we understand what kinds of things we should ask ourselves before building a model

1) Point predictions

Assuming the class is about prediction and a little about inference, it makes sense to start with the mean and median. They are the simplest predictions we can make and are extended by regression.

Topics: CLT, sampling variance, confidence intervals

2) Regression

We learned the mean minimizes the variance, and the median the absolute deviation. Let's extend that to a regression case now.

Topics: MLE, optimization, loss functions

3) Model validation

So we've built a model, but how are we going to tell if it is any good? Here, we would talk about the difference between training, testing, and val sets.

I would spend lots of time on this. ...

Its important to let students know that whatever choices they make about the model after seeing data are part of the modelling processes. Drop correlated features? You need to validate that.

Topics: Cross validation, the bootstrap, estimation of training error optimism...

...AIC, other loss functions, simulation. Different ways of measuring how much impact a variable had when added to the model. And that is just off the top of my head.

4) More regression

Few classification problems are actually classification problems. Here is where we would introduce logistic regression and when classification is and is not a good idea:

Topics: Proper scoring rules, sens/spec and when they make sense.

I don't think that would be one topic per week. The bootstrap in particular has a few variants, and if I wanted to spend time talking about bootstrap confidence intervals alone that could be a whole lecture.

So I would probably leave out neural nets, svms, etc. Maybe they would come right at the end, but once you cut your teeth on linear models then the other algorithms are just drop in replacements.

Except if you want to do inference of course.

Anyway, my point is that I think if we let linear models be the workhorses, and introduce non-linearity via splines or similar methods, we can spend more time on the less sexy but more important stuff.

You can follow @PhDemetri.

Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: