diff --git a/Support-for-AArch64.md b/Support-for-AArch64.md index bc205e3..b0294eb 100644 --- a/Support-for-AArch64.md +++ b/Support-for-AArch64.md @@ -25,15 +25,14 @@ Follow normal Linux build procedure but add "-flax-vector-conversions" to CFLAGS The miner compiles and runs on Raspberry Pi 4B, and compiles for all version of armv8 with our without AES or SHA2 or both. What works: -* All algorithms ecept Verthash and Hodl should be working. +* All algorithms except Verthash and Hodl should be working. * Allium, Lyra2z, Lyraz330, Argon2d are fully optimzed for NEON, Allium also for AES. -* Unoptimized: Sha256dt, sha256t, Blake2s. * Yespower, Yescrypt, Scrypt are working with slow sha256. -* X17 is the only X* to be optimized in this realease. +* X17 is the only X* to be optimized in this release. * MinotaurX is partially optimized. -* AES & SHA2 are enabled but untested. -* Other algos are not optimized for ARM and not tested. -* stratum+ssl and stratum+tcp are working, GBT is untested. +* AES & SHA2 are enabled but untested, expectations are low. +* Other algos are not optimized for ARM and not tested but expected to work. +* stratum+ssl and stratum+tcp are working, GBT is untested but expected to work. * CPU and SW feature detection and reporting is working, algo features in progress, CPU brand not yet implemented. * CPU temperature and clock frequency is working. * cpu-affinity & threads are working. @@ -43,7 +42,7 @@ Known problems: * No detection of ARM architecture minor version number. * NEON may not be displayed in algo features for some algos that may support it. * Algos may show support for NEON even if it's disabled or not yet implemented. -* AES & SHA2 are enabled but untested.* Sha256 & Sha512 Parallel N-way are disabled. They work on X86_64. +* AES & SHA2 are enabled but untested. Sha256 & Sha512 Parallel N-way are disabled. They work on X86_64. * Sha256dt, Sha256t, Sha256d unoptimized. * Scryptn2 optimzations disabled due to Sha256 issues. * X17, MinotaurX are partially optimized. @@ -54,3 +53,18 @@ Known problems: * SWIFFTX: Multiple issues with NEON,using unoptimized. * Remaining algos are not yet optimized for NEON but should work unoptimized. +Some notable observation about the problems observed: + +In general linear (1-way) vectorization is working and parallel vectoring (n-way) is not. Parallel vectors are only working for Keccak. They dopn't work for Sha, Blake although linear vectoring works well for Blake small and large. + +I expected linear vectoring to be the bigger challenge due to lane shuffling which isn't necessary for n-way which is simple arithmetic and logic instructions that mostly (mult is an issue) translate directly to the ARM architecture. N-way does require data shuffling on entry and exit to interleave the data so the issue may be there. Arm uses some 2-way which wasn't previously implemented on x86_64. However, the new 2-way code works on x86_64, the same code that doesn't work on ARM. + +Verthash is a mystery, it only produces rejects on ARM even with no targtetted code, only compiled C. The same C source works on x86_64 but not on AArch64. Tried with -O3 & -O2. In all other cases falling back to C was always successful. Verthash data file creation and verification work. + +There are a few cases where translating from SSE2 to NEON is diffiult or the workaround kills performance. NEON, being RISC, has no microcode so no programmable shuffle instruction. The only shuffling I can find is sub-vector word, sub-word bit, shift, rotate & reverse. Notably SSE2 can't do bit reversal but can shulffle bytes any which way. + +Multiplications are implemented differently, particularly widening multiplcatiom where the product is twice the bit width of the souces. + +NEON has some fancy load instructions that combine load with another oeration like byte swap. These may provide optimizatins that SSE can't. +Exploring these is part of the longer term plans once the existing problems are solved and ARM code is up to he same level of optimization level as x86_64. +x86_64 pre-widened source data while ARM operates on the packed data then widens the product.