Thread by @WomenInStat, Let’s talk data preparation! Data in the real world is usually pretty [...]

Women in Statistics and Data Science

WomenInStat

Let’s talk data preparation! Data in the real world is usually pretty dirty, and data cleaning may seem like a chore, but it is a vital step for any modeling down the road. Disclaimer: this is by no means comprehensive, but it is how I like to think about the big picture steps.

Step 1: Make sure your data file came with a data dictionary, this will be your new best friend who just so happens to know everything there is to know about your dataset.

Step 2: Once you get comfortable with the contents of your data, the tidying begins. And with tidying comes the Tidyverse R package aka my holy grail. I personally like to begin by selecting my rows and columns of interest.

Step 2.1: Sometimes, I will try to do this in small bite-sized pieces first, like a quick pilot study to make the data more digestible. This could be isolating maybe 5 individuals with my features of interest and getting comfortable with them.

Step 2.2: This is where I also want to talk about the awesomeness of the new Tidyverse Skills e-Book by @mirnas22 @rdpeng @stephaniehicks @Shannon_E_Ellis . Do yourself a favor and get your hands on it, especially if you are new to the Tidyverse: https://leanpub.com/tidyverseskillsdatascience.

Tidyverse Skills for Data Science in R

Develop insights from data with tidy tools. Import, wrangle, visualize, and model data with the Tidyverse R packages.

https://leanpub.com/tidyverseskillsdatascience

Step 3: Let’s say my rows are individuals and my columns are features. Now I like to inspect columns of interest and this is where the piping %>% starts to go crazy. Here, I mutate my columns according to my needs: recode, rename, reorder, change data type, you name it.

Step 3.1: You can mutate your missing data too, but some decision making must go into this step. For example, let’s say my variable for BMI has two types of missing data: a value of -4 indicates “not available” and a value of 99 indicates “not evaluated.” 1/2

If you want to consider these both as “missing,” you can mutate them as NAs, if not, you can keep them as is. CAVEAT: it depends on your data and there are different types of missing data! I recommend reading this post before proceeding: https://stefvanbuuren.name/fimd/sec-MCAR.html 2/2

Step 3.2: Converting your missing data to NAs is helpful because some packages may not recognize the way your missing data is coded, especially if you choose to impute your data. I like to use the MICE R package for imputation.

Step 4: Your data is looking good; variables are coded the way you want them, and missing data has been handled. You can now do other things like normalizing, scaling, and combining if needed. Here I usually find myself having to create composite variables from my base variables.

Step 5: It's nice to ensure that your data cleaning is going well by visualizing your data along the way. Do a quick histogram to inspect that your variable distribution looks right. Pull up some summary stats to verify that your variables made it through cleaning as intended.

You’ve made it from dirty to clean—Hoorah! This was a simple guide and I appreciate you following along. We in the biz like to say "garbage in, garbage out" but if you keep it tidy, it'll be mighty!

You can follow @WomenInStat.

Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: