Thread by @WomenInStat, In data science, often we have a lot of data and want [...]

In data science, often we have a lot of data and want to parallelize our algorithms so they complete faster. For example, when you call a sklearn function, you have the ability to set the `n_jobs` parameter to “enable parallelism.” What does this mean? Thread:

Modern laptops have multiple core CPUs. Each core is a small processing unit in itself. My MacBook Pro has 4 cores. If we enable parallelism while calling data science library functions, we can reduce the overall runtime by spreading computation to all the cores!

Each core can only run one process at a time. Every process has a separate address space — which means if you spawn a new process and change a variable, its value is *not* changed in the parent process.

“Multiprocessing” is the act of splitting a program or some computation into multiple process such that multiple cores can be leveraged. This is a great strategy for parallelizing “CPU-bound” computation, or computation that involves a lot of work on the CPU.

A relevant example of a CPU-bound process is training a logistic regression model using an iterative solver. This involves a lot of matrix multiplications to determine a fixed set of optimal parameters.

You can further parallelize a process into multiple threads. You can see this in the Activity Monitor if you have a Mac — in my example, the Spotify process is running 45 threads!

Why would a process want to run many threads? In the Spotify case, maybe there’s a thread to render the album cover art. Maybe there’s another thread to listen to keystrokes (if I want to change a song).

The number of processes *currently running* is bounded by the number of cores you have. But in theory, you can have as many threads as you want per process. Unlike processes, threads within the same process share the same memory address space.

Practically, not all forms of computation are CPU-bound. Maybe you want to collect a dataset of news articles so you can do data science on it later. This computation would involve requesting and downloading a bunch of text.

Suppose you have a process to send a HTTP request to http://nytimes.com for an article and download the article text response. You might find that majority of the process’s runtime is spent waiting for the HTTP response! So your CPU is sitting idly.

The New York Times - Breaking News, US News, World News and Videos

Live news, investigations, opinion, photos and video by the journalists of The New York Times from more than 150 countries around the world. Subscribe for coverage of U.S. and international news,...

http://nytimes.com

“Multithreading” is a great way to speed up the download of many articles, a “network-bound” form of computation. If each thread requests an article, you can swap the thread on the processor while it waits for its response, enabling higher CPU usage.

Since threads share the same address space, you don’t have to deal with high context switch costs (swapping new memory, etc). You may find that in many non CPU-bound or I/O bound programs, multithreading speedups outperform multiprocessing speedups.

So under the hood of these data science library functions, both multiprocessing and multithreading are used. You can read more about sklearn’s parallelism here: https://scikit-learn.org/dev/computing/parallelism.html

In my next thread, I will discuss parallelism in Python specifically, which was designed to limit the use of multithreading with the Global Interpreter Lock (GIL).

Latest Threads Unrolled: