In Spring 2020 and Fall 2020, I taught two sections of R labs for introductory statistics. One section was in formula syntax, the other in tidy(verse) syntax. I'm writing a paper about the experience, but for now, a thread 🧵
There's debate in the stat ed community about which syntax is best for teaching intro stat. My syntax comparison cheatsheet shows how to do the same task in three main syntaxes: base (dollar sign), formula, and tidyverse, and I've taught full-semester courses in each.
(That cheatsheet needs an update! The formula stuff isn't showing ggformula, which is now the standard for graphics in formula syntax. And, I'd probably remove qplot() for the ggplot2 stuff.)
The problem is, teaching a full-semester course in formula syntax and then another full-semester course in tidyverse syntax doesn't give an easy comparison. Memory fades. Packages change. When producing materials you tend to make them easier to do in the particular syntax.
Teaching formula and tidyverse head to head allowed me to make a better comparison, but it was HARD. I basically doubled my prep time, because I needed to make materials in each syntax.
When I began the experiment, I was teaching in person. This meant I spent double the time writing pre-labs (one in each syntax), but my in-classroom time was essentially the same. It's additional cognitive lift for me to code switch (if you will) but it's doable.
But just a few weeks into the first semester of the experiment, we moved to online teaching. I took a flipped classroom approach where students watch pre-lab videos and follow along with the associated RMarkdown document.
This means that in addition to preparing the pre-lab documents in each syntax, I was also recording two separate sets of videos. The side benefit is that if you want to see how I introduce topics, or compare my approaches in the two syntaxes, all those videos are available!
A more organized version of the whole thing, mapping topics to videos and RMarkdown documents, is here: https://www.amelia.mn/STAT220labs 
I did pre- and post-surveys of students, and will be examining the data to see if I can find anything interesting there. I'm also planning to analyze YouTube analytics and http://rstudio.cloud  usage for differences.
But if you've come to this thread saying, "Amelia! I have to teach R labs *this week.* What syntax should I use?" my gut instinct is to say "formula."
In the formula labs, we loaded mosaic and ggformula every week, and if you look at my one-page "All the R you need for STAT 220 - formula" you'll see extreme consistency between the lines of code.
tally(~marital_status, data = GSS, format = "proportion")

gf_boxplot(highest_year_of_school_completed ~ labor_force_status, data = GSS)

t.test(highest_year_of_school_completed ~ born_in_us, data = GSS)

aov(highest_year_of_school_completed ~ labor_force_status, data = GSS)
The tidy labs were as consistent as I could make them. We loaded tidyverse and infer each week, and while infer is great, they are still working out some of the kinks. Tidyverse is also more verbose, which can be challenging.
GSS %>%
group_by(marital_status) %>%
summarize(n = n()) %>%
mutate(prop = n / sum(n))

ggplot(GSS) + geom_boxplot(aes(x = marital_status, y = highest_year_of_school_completed))

(can't get as many commands in one tweet!)
GSS %>%
drop_na(born_in_us) %>%
t_test(
response = highest_year_of_school_completed,
explanatory = born_in_us, order = c("No", "Yes")
)

aov(Age ~ marital_status, data = GSS)
The biggest thing is to *be consistent.* I spent a ton of time developing materials and doing my best to make everything in a particular syntax consistent. I don't think base R is the thing to teach in intro stats, but if you are going to use it, it always needs to look the same.
Here's another head-to-head comparison. First, formula syntax.

boot <- do(1000) * mean(~highest_year_of_school_completed, data = resample(GSS))

gf_histogram(~mean, data = boot)
Versus tidyverse syntax

boot <- GSS %>%
specify(response = highest_year_of_school_completed) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "mean")

ggplot(boot) + geom_histogram(aes(x=stat))
There are strong and weak parts to both chunks. Formula syntax is really concise, but it leans on the magical * operator. Tidyverse syntax is more verbose, but that makes it easier to see all the pieces. I think writing this in base R would be pretty painful for intro students.
You can follow @AmeliaMN.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled:

By continuing to use the site, you are consenting to the use of cookies as explained in our Cookie Policy to improve your experience.