Pruning is a technique where model parameters are set to zero. With finetuning, models can be pruned immensely with very little drop in acc. Pruning reduces both the size of the model as well as the number of operations required to execute the model. But does it? 1/
Let's say you set a few weights of a Float32 model to zero and save it. Without any special compression during saving, the size of the model will remain the same since it takes the same number of bits to store a floating point zero as any other floating point number (32 bits). 2/
Will the inference will be faster? Not really. While multiplying by zero is usually optimized, it will still take time to fetch that zero weight value from memory. Memory access takes more time and energy than any operation meaning your inference speed will not improve much. 3/
Many methods have been proposed to overcome this problem. The simplest one is to not store any zero weights and instead store an additional "table" that keeps track of the location of zero weights. Custom hardware can take advantage of this and execute networks efficiently. 4/
Say your network has 200 zero weights. Your table will have to have 8 bits at least (256 unique positions). Thus giving you a memory saving of roughly 4x for each zero weight. Not accounting for the overhead of reading the table and using custom logic to perform inference. 5/
Moreover, there is a delicate balance of the number of zero weights and how large your zero weight table becomes. More zero weights = larger table = diminished optimization return. I remember seeing a formula for this, but cannot find it. Will share it when I do. 6/
Furthermore, due to this weird zero table paradigm, optimized SIMD instructions can no longer be used to execute your network. Which is why, like I mentioned previously, custom hardware/software needs to be used to execute these networks. 7/
So basically, having zeros in weights should improve inference speed and memory. In reality, it doesn't really do much without custom compression techniques and hardware/software. Custom hardware include SCNN, EIE 8/
To solve these problems, researchers have come up with "structured sparsity". Usually the position of pruned weights are random(which causes the problem). The sparsity is unstructured. However, if we could remove groups of weights (structured), some problems can be avoided. 9/
Groups of weights can be a whole row/column of a filter, a filter channel, lots of adjacent weights etc. However, as you might expect, removing large chunks of weights as opposed to a few individual weights drastically reduces the model accuracy. src: https://arxiv.org/pdf/1705.08922.pdf 10/
On the other hand, models with structured sparsity can be executed using optimized SIMD instructions in general hardware and the zero weight table can be made smaller since it now points to a group of weights instead of a single weight! 11/
That was a overview of pruning and structured sparsity. Lots of research is needed to improve the accuracy of "structured sparsity" models. The above listed papers as well as https://arxiv.org/pdf/2003.03033.pdf and https://arxiv.org/abs/1902.09574 ( @sarahookr) are good intros to pruning/sparsity