Updated Support for AArch64 (markdown)

2025-09-17 23:44:27 +00:00 · 2023-10-26 15:41:40 -04:00
parent 793b667a0e
commit 72e54f6759
1 changed files with 21 additions and 7 deletions
--- a/Support-for-AArch64.md
+++ b/Support-for-AArch64.md
@@ -25,15 +25,14 @@ Follow normal Linux build procedure but add "-flax-vector-conversions" to CFLAGS
 The miner compiles and runs on Raspberry Pi 4B, and compiles for all version of armv8 with our without AES or SHA2 or both.

 What works:
-* All algorithms ecept Verthash and Hodl should be working.
+* All algorithms except Verthash and Hodl should be working.
 * Allium, Lyra2z, Lyraz330, Argon2d are fully optimzed for NEON, Allium also for AES.
-* Unoptimized: Sha256dt, sha256t, Blake2s.
 * Yespower, Yescrypt, Scrypt are working with slow sha256.
-* X17 is the only X* to be optimized in this realease.
+* X17 is the only X* to be optimized in this release.
 * MinotaurX is partially optimized.
-* AES & SHA2 are enabled but untested.
-* Other algos are not optimized for ARM and not tested.
-* stratum+ssl and stratum+tcp are working, GBT is untested.
+* AES & SHA2 are enabled but untested, expectations are low.
+* Other algos are not optimized for ARM and not tested but expected to work.
+* stratum+ssl and stratum+tcp are working, GBT is untested but expected to work.
 * CPU and SW feature detection and reporting is working, algo features in progress, CPU brand not yet implemented.
 * CPU temperature and clock frequency is working.
 * cpu-affinity & threads are working.
@@ -43,7 +42,7 @@ Known problems:
 * No detection of ARM architecture minor version number.
 * NEON may not be displayed in algo features for some algos that may support it.
 * Algos may show support for NEON even if it's disabled or not yet implemented.
-* AES & SHA2 are enabled but untested.* Sha256 & Sha512 Parallel N-way are disabled. They work on X86_64. 
+* AES & SHA2 are enabled but untested. Sha256 & Sha512 Parallel N-way are disabled. They work on X86_64. 
 * Sha256dt, Sha256t, Sha256d unoptimized.
 * Scryptn2 optimzations disabled due to Sha256 issues.
 * X17, MinotaurX are partially optimized.
@@ -54,3 +53,18 @@ Known problems:
 * SWIFFTX: Multiple issues with NEON,using unoptimized.
 * Remaining algos are not yet optimized for NEON but should work unoptimized. 

+Some notable observation about the problems observed:
+
+In general linear (1-way) vectorization is working and parallel vectoring (n-way) is not. Parallel vectors are only working for Keccak. They dopn't work for Sha, Blake although linear vectoring works well for Blake small and large.
+
+I expected linear vectoring to be the bigger challenge due to lane shuffling which isn't necessary for n-way which is simple arithmetic and logic instructions that mostly (mult is an issue) translate directly to the ARM architecture. N-way does require data shuffling on entry and exit to interleave the data so the issue may be there. Arm uses some 2-way which wasn't previously implemented on x86_64. However, the new 2-way code works on x86_64, the same code that doesn't work on ARM.
+
+Verthash is a mystery, it only produces rejects on ARM even with no targtetted code, only compiled C. The same C source works on x86_64 but not on AArch64. Tried with -O3 & -O2. In all other cases falling back to C was always successful. Verthash data file creation and verification work.
+
+There are a few cases where translating from SSE2 to NEON is diffiult or the workaround kills performance. NEON, being RISC, has no microcode so no programmable shuffle instruction. The only shuffling I can find is sub-vector word, sub-word bit, shift, rotate & reverse. Notably SSE2 can't do bit reversal but can shulffle bytes any which way.
+
+Multiplications are implemented differently, particularly widening multiplcatiom where the product is twice the bit width of the souces.
+
+NEON has some fancy load instructions that combine load with another oeration like byte swap. These may provide optimizatins that SSE can't.
+Exploring these is part of the longer term plans once the existing problems are solved and ARM code is up to he same level of optimization level as x86_64. 
+x86_64 pre-widened source data while ARM operates on the packed data then widens the product.