Thread by @JanKabatek, Here's a bunch of @Stata tips for handling large datasets (millions of [...]

Jan Kabátek

JanKabatek

Here's a bunch of @Stata tips for handling large datasets (millions of observations).

I wish I knew them when I was starting with admin data analysis...

1) Memory is often an issue, so store your data efficiently!

- use 'compress' command to recast your variables into appropriate data types

- declare the correct data types when generating variables
('gen byte var1 = 1')

2) Merging does not need to take forever!

- be aware that the 'merge' command sorts 'master' and 'using' data on matching variables

- you can save a lot of time by running the merge command on datasets that are already sorted!

3) Command 'joinby' does what 'merge m:m' should have been doing all along

- that is, it forms pairwise combinations between 'master' and 'using'

4) If you're working with spell-level data, learn to use Stata's native date & time functions

- these will allow you to store the time data efficiently and avoid transformation/approximation errors

5) Factor variables can help you overcome memory constraints in regressions

6) When estimating regression models with if conditions, it is often faster to drop all the irrelevant data
preserve
keep if var1 ==1
reg y x, robust
restore

I'll add more when I think of something relevant.
Feel free to chime in too

7) Reverberating what others suggested:

If you can't load the full dataset due to memory constraints, you can adjust the 'use' command to load only a subset of the data

use var1 if var1 == 1 using "data.dta"

using a subset of variables is fast.
if conditions are slow.

8) @Stata is actually pretty good for handling large data.

R caps the number of entries in a data frame at ~2.2bn, which can be restrictive. AFAIK Stata does not have a hard-coded cap.

R packages exist to bypass the cap, but these may not be available in secure environments.

9) Creating graphs can take forever, so rather than running the graph commands over and over again, use graph editor to make stylistic changes to the graphs you already made.

the gr_edit commands can be useful too... https://twitter.com/JanKabatek/status/1295970998999584768

https://twitter.com/JanKabatek/status/1295970998999584768

You can follow @JanKabatek.

Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: