My favourite ARM instruction is LDM—load multiple. A thread.
If you have a pointer to a fixed-sized heap object made of, for example, four words, you can load all of its contents into registers in one go:
ldm r4, {r0, r1, r2, r3}
And ldr takes as many registers as you want. The instruction is encoded as a bit-mask with 16 bits for 16 registers.
This is ideal for a load-store architecture: load many registers in one go, perform some computation only on registers, and store the results back into memory.
The opposite of LDM is STM—store multiple. With these two you can copy large blocks of memory:
ldm r0!, {r4-r11}
stm r1!, {r4-r11}

Load eight words from r0 pointer and store them into r1 pointer.
The bang (!) auto-increments the registers by the right number of words, so you can do this in a loop.
ARM's pop instruction is simply an alias for LDM with a stack pointer. These two are exactly the same:
ldm sp! {r0-r4}
pop {r0-r4}

And the push instruction is an alias for an STM variant (STMDB).
This way you can push and pop large quantities from and to the stack in one go. And if you replace SP by another register you can implement efficient stacks on the heap, for example, to implement shadow stacks: https://en.wikipedia.org/wiki/Shadow_stack
With such versatile push and pop instructions you can have short and efficient function prologues and epilogues, which is especially important for ARM where arguments and return address are passed in registers.
Save frame pointer and return address in one go, a fairly standard prologue:
push {fp, lr}

Restore both and return (epilogue):
pop {fp, lr}
bx lr
Even better, restore both and return in one go!
pop {fp, pc}

This works because the value of the return address (LR) is popped into the program counter register (PC), so you don't need an explicit return.
Or, consider this prologue:
push {r0, r1, r2, r3, fp, lr}

Saves frame pointer, return address, and spills four registers onto the stack (in case their address is taken).
Or, consider the same prologue:
push {r0, r1, r2, r3, fp, lr}

Save FP, LR, and allocate 4 words on the stack for local variables. Who cares about the contents of r0-r3.
Unfortunately, when it was time to design ARM64 there were some difficult trade-offs. The decision was made to double the number of registers to 32. I remember reading a paper saying that this improves performance by 6% across the board.
With 32 registers it is not possible to encode a bit-mask of registers into a 32-bit long instruction. So, instead, ARM64 has LDP and STP: load pair, and store pair, which are the spiritual successors of LDM and STM.
You can follow @keleshev.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled:

By continuing to use the site, you are consenting to the use of cookies as explained in our Cookie Policy to improve your experience.