Updated Support for AArch64 (markdown)

JayDDee
2023-10-28 17:06:04 -04:00
parent 83e8b5f7e1
commit 0266c0d224

@@ -61,8 +61,12 @@ Verthash is a mystery, it only produces rejects on ARM even with no targtetted c
There are a few cases where translating from SSE2 to NEON is diffiult or the workaround kills performance. NEON, being RISC, has no microcode so no programmable shuffle instruction. The only shuffling I can find is sub-vector word, sub-word bit, shift, rotate & reverse. Notably SSE2 can't do bit reversal but can shulffle bytes any which way.
Multiplications are implemented differently, particularly widening multiplcatiom where the product is twice the bit width of the souces.
X86_64 operates on lanes 0 & 2 while ARM operates on lanes 0 & 1 of the source data. In effect x86_64 assumes the data is pre-widened and discards lanes 1 & 3 leaving 2 zero extended 64 bit source integers. With ARM the source arguments are packed into a smaller vector (uint32x2_t * uint32x2_t = uint64x2_t) and the product is widened to 64 bits upon multiplication. Most uses are the x86_64 dormat requiring a workaround for ARM.
X86_64 operates on lanes 0 & 2 while ARM operates on lanes 0 & 1 of the source data. In effect x86_64 assumes the data is pre-widened and discards lanes 1 & 3 leaving 2 zero extended 64 bit source integers. With ARM the source arguments are packed into a smaller vector and the product is widened to 64 bits upon multiplication:
`uint64x2_t = uint32x2_t * uint32x2_t`
Most uses are the x86_64 dormat requiring a workaround for ARM.
NEON has some fancy load instructions that combine load with another oeration like byte swap. These may provide optimizatins that SSE can't.
Exploring these is part of the longer term plans once the existing problems are solved and ARM code is up to he same level of optimization level as x86_64.
NEON has no blend instruction but can emulate one compatible with x86_64 blendv using boolean algebra, but not very efficiently.