Updated Support for AArch64 (markdown)

JayDDee
2023-10-28 17:06:32 -04:00
parent 0266c0d224
commit f50d91d762

@@ -62,7 +62,9 @@ There are a few cases where translating from SSE2 to NEON is diffiult or the wor
Multiplications are implemented differently, particularly widening multiplcatiom where the product is twice the bit width of the souces.
X86_64 operates on lanes 0 & 2 while ARM operates on lanes 0 & 1 of the source data. In effect x86_64 assumes the data is pre-widened and discards lanes 1 & 3 leaving 2 zero extended 64 bit source integers. With ARM the source arguments are packed into a smaller vector and the product is widened to 64 bits upon multiplication:
`uint64x2_t = uint32x2_t * uint32x2_t`
Most uses are the x86_64 dormat requiring a workaround for ARM.
NEON has some fancy load instructions that combine load with another oeration like byte swap. These may provide optimizatins that SSE can't.