From ffea6989fe50c5c24fa97f8c63c2a60ea3127137 Mon Sep 17 00:00:00 2001 From: JayDDee Date: Sat, 28 Oct 2023 17:07:57 -0400 Subject: [PATCH] Updated Support for AArch64 (markdown) --- Support-for-AArch64.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Support-for-AArch64.md b/Support-for-AArch64.md index 87330c7..abc1387 100644 --- a/Support-for-AArch64.md +++ b/Support-for-AArch64.md @@ -58,7 +58,7 @@ Some notable observation about the problems observed: Verthash is a mystery, it only produces rejects on ARM even with no targtetted code, only compiled C. The same C source works on x86_64 but not on AArch64. Tried with -O3 & -O2. In all other cases falling back to C was always successful. Verthash data file creation and verification work. -There are a few cases where translating from SSE2 to NEON is diffiult or the workaround kills performance. NEON, being RISC, has no microcode so no programmable shuffle instruction. The only shuffling I can find is sub-vector word, sub-word bit, shift, rotate & reverse. Notably SSE2 can't do bit reversal but can shulffle bytes any which way. +There are a few cases where translating from SSE2 to NEON is diffiult or the workaround kills performance. NEON, being RISC, has no microcode so no programmable shuffle instruction. The only shuffling I can find is sub-vector word & sub-word bit, shift, rotate & reverse. Notably SSE2 can't do bit reversal but can shuffle bytes any which way. Multiplications are implemented differently, particularly widening multiplcatiom where the product is twice the bit width of the souces. X86_64 operates on lanes 0 & 2 while ARM operates on lanes 0 & 1 of the source data. In effect x86_64 assumes the data is pre-widened and discards lanes 1 & 3 leaving 2 zero extended 64 bit source integers. With ARM the source arguments are packed into a smaller vector and the product is widened to 64 bits upon multiplication: