Thread by @WomenInStat, Today I will be talking about some of the data structures we [...]

Women in Statistics and Data Science

WomenInStat

Today I will be talking about some of the data structures we use regularly when doing data science work. I will start with numpy's ndarray.

What is an ndarray? It's numpy's abstraction for describing an array, or a group of numbers. In math terms, arrays are a "catch all" term used to describe matrices or vectors. Behind the scenes, it essentially describes memory using several key attributes:

* pointer: the memory address of the first byte in the array
* type: the kind of elements in the array, such as floats or ints
* shape: the size of each dimension of the array (ex: 5 x 5 x 5)
* strides: number of bytes to skip to proceed to the next element
* flags

The "stride" attribute here is key. it allows you to subset or view data *without* copying it, which saves time and space/memory. In this example, `x` and `y` share memory, even though they aren't exactly the same array! This is very helpful when working with "big data."

So this is why, if you've ever modified a slice of a numpy array, you end up modifying the original array!

The `stride` attribute is not only relevant when slicing arrays. Transposes, reshapes, and other operations take advantage of the `stride` attribute to avoid copying large amounts of data. Stay tuned for the next thread on vectorization.

You can follow @WomenInStat.

Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: