The Power of Pre-loading
When I was running mbw on an ARM platform, specifically a SABRE Lite
development board, I noticed that the results using memcpy supplied by
default libc library is much faster than the word-by-word copy. This is actually
not a surprise. I was wondering how I could achieve similar speed. After some
searches, I found that ARM provided an excellent document about
What is the fastest way to copy memory on a Cortex-A8, which concludes that
NEON memory copy with PLD is the fastest. I repeat the code below for your
convenience. As you can see, PLD instruction tries to pre-fetch the data from
memory. Please note that the offset in the PLD instruction, it means that
the PLD is actually preparing data for the next round of VMDM and VSTM
pair. The code hopes that the processor can overlap preparing data for next
round of copy with current copy instructions, and hopefully when the next
round of copy starts, the required data is already in cache. The following
NEON instructions first load the data to 8 registers d0 to d7 from
the source, and then the data is stored to destination memory. The exclamation
marks after the r0 and r1 registers are used to increase the source
and destination addresses in r0 and r1 automatically after each load and
store.
NEONCopyPLD:
PLD {r1, #0xc0]
VLDM r1!, {d0-d7}
VSTM r0!, {d0-d7}
SUBS r2, r2, #0x40
BGE NEONCopyPLDAs expected, the NEONCopyPLD does achieve higher memory bandwidth, but it
is still not comparable with the libc memcpy. Of course I am so curious
about the reason, so I compiled the mbw with static linking and disassembled
the binary to find out why. The standard library memcpy uses ldm and stm
instructions, which also operate on multiple general-purpose registers.
Basically the code demonstrates the basic idea and skips a lot of checks
on alignment or length. The main difference is that the code uses multiple pld
instructions before actually copies the data. My understanding (I could be wrong)
is that multiple pld instructions are using different execution units
of the processor to pre-fetch data from memory while the ldm and stm instructions
are copying data. One single pld may finish too early so that the pre-fetch
unit is idling instead of preparing for the subsequent data. Please let me know
if you know exactly what is going on under the hood. The pld instruction
should not trigger synchronous data abort the address to be pre-fetched can not
be translate to a physical address.
void *memcpy(void *dest, const void *src, size_t n)
{
__asm__ __volatile__(
"push {r3-r10} \n"
"1: \n"
"pld [%1] \n"
"pld [%1, #28] \n"
"pld [%1, #60] \n"
"pld [%1, #92] \n"
"pld [%1, #124] \n"
"ldmia %1!, {r3-r10} \n"
"stmia %0!, {r3-r10} \n"
"ldmia %1!, {r3-r10} \n"
"stmia %0!, {r3-r10} \n"
"subs %2, %2, #0x40 \n"
"bge 1b \n"
"pop {r3-r10} \n"
:
: "r"(dest), "r"(src), "r"(n)
: "memory"
);
}The end result is that I achieve similar speed as the standard C library memcpy.
Actually a little faster since the code skips so many checks. You may download
the mbw benchmark and try it out yourselves. This experiment shows the importance
of memory pre-fetching, and that is where the title comes from.