From bee1aa1aa19e9e6107e80234168077579952fb42 Mon Sep 17 00:00:00 2001
From: JayDDee <jayddee246@gmail.com>
Date: Sat, 25 May 2024 13:31:44 -0400
Subject: [PATCH] Updated Support for AArch64 (markdown)

---
 Support-for-AArch64.md | 28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/Support-for-AArch64.md b/Support-for-AArch64.md
index c8571ee..f6c5e64 100644
--- a/Support-for-AArch64.md
+++ b/Support-for-AArch64.md
@@ -14,7 +14,9 @@ Requirements:
 
 **cpuminer-opt-23.15 is released**
 
-ARM support is now feature complete, on par with SSE4.2 & AES_NI.
+Update v24.2 is released with more NEON optimizations, some requiring SHA3 extension.
+
+ARM support is now feature complete, on par with x86_64 with SSE4.2, AES_NI & SHA extensions.
 
 Development environment:
 *  Orange Pi 5 Plus 16 GB, Rockchip 8 core CPU with AES & SHA2
@@ -25,7 +27,7 @@ Secondary environment:
 * Mac Mini M2
 * MacOS 14.1 Sonoma
 * UTM/Qemu VM emulator
-* Ubuntu Mate 22.04 VM guest
+* Ubuntu Mate 22.04 VM guest (update: Ubuntu Mate 24.04)
 
 Compile with:
 
@@ -36,18 +38,19 @@ Specific achitectures and features can be compiled using examples in armbuild-al
 
 The miner has been tested on Raspberry Pi 4B, Orange Pi 5 Plus, and Mac Mini from a Linux VM.
 It compiles for all minor versions of armv8.x with or without AES, or SHA2, or both.
+Update: compiles with armv9 and sha3.
 
 Known problems:
 
-* Verthash algo is not working.
+* Verthash algo is not working. (update: problem found, will be fixed in v24.3)
 * MacOS is not working natively, workaround with linux VM.
-* CPU and feature detection and reporting is incomplete.
+* CPU and feature detection and reporting is incomplete. (update, will be fixed in v24.3)
 * Some algorithms too difficult to test with a CPU are not optimized for NEON.
 
 Short term plan:
 
-* Figure out what's going on with verthash.
-* Detection of ARM CPU model and architecture minor version.
+* Figure out what's going on with verthash. Update: problem was using ROR for 32 bit rotation.
+* Detection of ARM CPU model and architecture minor version. 
 * Find NEON optimization opportunities that exploit it's architecture and instruction set.
 * Apply lessons learned to x86_64.
 
@@ -55,19 +58,24 @@ Long term:
 
 * SHA512, x86_64 & AArch64.
 * ARM SVE
-* x86_64 AVX10
+* x86_64 AVX10 (update: initial support in v24.2)
 * RISC-V
 
 Some notable observations about the problems observed:
 
-Verthash is a mystery, it only produces rejects on ARM even with no targtetted code, only compiled C. The same C source works on x86_64 but not on AArch64. Tried with -O3 & -O2. In all other cases falling back to C was always successful. Verthash data file creation and verification work. Verthash has one unique feature in the data-file. No other algo has that and no other algo fails with unoptimized code.
+The problem with verthash was using the scalar ROR instrution for bit rotation. Documentaion says it supports 64 & 32 bit registers but not how. It is assumed "registers" means hardware dependent and 32 bit works on A32 only and 64 bit on A64 only.
 
 Multiplications are implemented differently, particularly widening multiplcatiom where the product is twice the bit width of the souces.
 X86_64 operates on lanes 0 & 2 while ARM operates on lanes 0 & 1 of the source data. In effect x86_64 assumes the data is pre-widened and discards lanes 1 & 3 leaving 2 zero extended 64 bit source integers. With ARM the source arguments are packed into a smaller vector and the product is widened to 64 bits upon multiplication:
 
 `uint64x2_t = uint32x2_t * uint32x2_t`
 
-Most uses are the x86_64 format requiring a workaround for ARM. The curent workaround seems to be functioning correctly where needed.
+Most widening multiplications use the x86_64 format requiring a workaround for ARM. The curent workaround seems to be functioning correctly where needed but with significant extra overhead.
 
-NEON has no blend instruction but can emulate one compatible with x86_64 blendv using boolean algebra, but not very efficiently.
+SHA512 support in cpuminer-opt is not assured. It is little used and may not be worth the effort. X64 looks like amn enlarged clone of sha256 with 128 bit operations replaced with equivalent 256 bit ops.
+AArch64 implements sha512 using 128 bit registers and splitting the 256 bit operations over 2 128 bit instructions, reducing performance gain.
 
+SVE deosn't seem to be useable for hashing. It uses Vector Agnostic Programming which abstracts the logical vector size from the vector register size. This creates run time overhead to determine HW register size that doesn't exist for highly optimised NEON code. 
+
+Biggest MAC problems seem to be with JSON, possibly a configure issue choosing whether to use system or local version of JSON.
+Other than missing GMP most problems occur at link or load time.
\ No newline at end of file