Updated Support for AArch64 (markdown)

2026-02-22 16:33:08 +00:00 · 2023-10-26 15:41:40 -04:00
parent 793b667a0e
commit 72e54f6759
1 changed files with 21 additions and 7 deletions
--- a/Support-for-AArch64.md
+++ b/Support-for-AArch64.md
@@ -25,15 +25,14 @@ Follow normal Linux build procedure but add "-flax-vector-conversions" to CFLAGS
 The miner compiles and runs on Raspberry Pi 4B, and compiles for all version of armv8 with our without AES or SHA2 or both.
 What works:
-* All algorithms ecept Verthash and Hodl should be working.
+* All algorithms except Verthash and Hodl should be working.
 * Allium, Lyra2z, Lyraz330, Argon2d are fully optimzed for NEON, Allium also for AES.
 * Unoptimized: Sha256dt, sha256t, Blake2s.
 * Yespower, Yescrypt, Scrypt are working with slow sha256.
-* X17 is the only X* to be optimized in this realease.
+* X17 is the only X* to be optimized in this release.
 * MinotaurX is partially optimized.
-* AES & SHA2 are enabled but untested.
+* AES & SHA2 are enabled but untested, expectations are low.
-* Other algos are not optimized for ARM and not tested.
+* Other algos are not optimized for ARM and not tested but expected to work.
-* stratum+ssl and stratum+tcp are working, GBT is untested.
+* stratum+ssl and stratum+tcp are working, GBT is untested but expected to work.
 * CPU and SW feature detection and reporting is working, algo features in progress, CPU brand not yet implemented.
 * CPU temperature and clock frequency is working.
 * cpu-affinity & threads are working.
@@ -43,7 +42,7 @@ Known problems:
 * No detection of ARM architecture minor version number.
 * NEON may not be displayed in algo features for some algos that may support it.
 * Algos may show support for NEON even if it's disabled or not yet implemented.
-* AES & SHA2 are enabled but untested.* Sha256 & Sha512 Parallel N-way are disabled. They work on X86_64. 
+* AES & SHA2 are enabled but untested. Sha256 & Sha512 Parallel N-way are disabled. They work on X86_64. 
 * Sha256dt, Sha256t, Sha256d unoptimized.
 * Scryptn2 optimzations disabled due to Sha256 issues.
 * X17, MinotaurX are partially optimized.
@@ -54,3 +53,18 @@ Known problems:
 * SWIFFTX: Multiple issues with NEON,using unoptimized.
 * Remaining algos are not yet optimized for NEON but should work unoptimized. 
 Some notable observation about the problems observed:
 In general linear (1-way) vectorization is working and parallel vectoring (n-way) is not. Parallel vectors are only working for Keccak. They dopn't work for Sha, Blake although linear vectoring works well for Blake small and large.
 I expected linear vectoring to be the bigger challenge due to lane shuffling which isn't necessary for n-way which is simple arithmetic and logic instructions that mostly (mult is an issue) translate directly to the ARM architecture. N-way does require data shuffling on entry and exit to interleave the data so the issue may be there. Arm uses some 2-way which wasn't previously implemented on x86_64. However, the new 2-way code works on x86_64, the same code that doesn't work on ARM.
 Verthash is a mystery, it only produces rejects on ARM even with no targtetted code, only compiled C. The same C source works on x86_64 but not on AArch64. Tried with -O3 & -O2. In all other cases falling back to C was always successful. Verthash data file creation and verification work.
 There are a few cases where translating from SSE2 to NEON is diffiult or the workaround kills performance. NEON, being RISC, has no microcode so no programmable shuffle instruction. The only shuffling I can find is sub-vector word, sub-word bit, shift, rotate & reverse. Notably SSE2 can't do bit reversal but can shulffle bytes any which way.
 Multiplications are implemented differently, particularly widening multiplcatiom where the product is twice the bit width of the souces.
 NEON has some fancy load instructions that combine load with another oeration like byte swap. These may provide optimizatins that SSE can't.
 Exploring these is part of the longer term plans once the existing problems are solved and ARM code is up to he same level of optimization level as x86_64. 
 x86_64 pre-widened source data while ARM operates on the packed data then widens the product.