Updated Support for AArch64 (markdown)

JayDDee
2023-11-17 14:56:10 -05:00
parent 5578981065
commit ca10f0bc62

@@ -1,4 +1,4 @@
Development is progressing faster than expected to provide support for ARM 64 bit CPUs using the AArch64 architecture.
Support for AArch64 with AES and SHA2 is at release candidate status.
This is provided as source code only and may be built on native Linux by following the existing procedure subject to any modifications described below.
@@ -12,11 +12,11 @@ Requirements:
## Status
**cpuminer-opt-23.10 is released, all users should upgrade**
**cpuminer-opt-23.11 is released, all users should upgrade**
Highlights from this release:
Important fixes to Scryptn2 and Skein.
x17 & minotaurx are mostly optimized, only AES Groestl & Fugue remain unoptimized.
Important fixes to x25x, hmq1725.
Most SHA3 algos now optimized with 2-way for NEON.
Development environment:
* Orange Pi 5 Plus 16 GB, Rockchip 8 core CPU with AES & SHA2
@@ -40,12 +40,9 @@ The miner has been tested on Raspberry Pi 4B, Orange Pi 5 Plus, and Mac Mini fro
It compiles for all minor versions of armv8.x with or without AES, or SHA2, or both.
What works:
* All algorithms except Verthash should be working.
* Allium, Lyra2z, Lyraz330, Argon2d are fully optimzed for NEON, Allium uses unoptimized AES.
* All Scrypt & Sha25 are fully optimized to make use of SHA2.
* X17, MinotaurX are mostly optimized for NEON & AES, also helps SSE2.
* Skein & Skein2 are fully optimized for NEON, Skein also for SHA2.
* AES is working for Shavite & Echo, not for Groestl & Fugue.
* Most algorithms are working with Neon optimizations.
* AES is working for Shavite and Echo.
* SHA is working for all algos.
* stratum+ssl and stratum+tcp are working, GBT is untested but expected to work.
* CPU and SW feature detection and reporting is partialliy implemented.
* CPU temperature and clock frequency is working (native Linux).
@@ -56,18 +53,17 @@ Known problems:
* MacOS is not working natively, workaround with linux VM.
* CPU and feature detection and reporting is incomplete.
* Groestl, Fugue: multiple issues not AES related, using unoptimized.
* Algos not mentioned have either been deferred or have not been analyzed. They may or may not work on ARM.
Short term plan:
* Continue propagating x17 opimizations to the rest of the X family.
* Figure out what's going on with verthash.
* Complete any other work needed to bring parity with SSE2.
* Performance testing.
* End of Beta phase.
* Full support for ARM.
Medium term:
* Verthash
* Groestl & Fugue AES.
* Detection of ARM CPU model and architecture minor version.
* Find NEON optimization opportunities that exploit it's architecture and instruction set.
@@ -84,14 +80,14 @@ Some notable observations about the problems observed:
Verthash is a mystery, it only produces rejects on ARM even with no targtetted code, only compiled C. The same C source works on x86_64 but not on AArch64. Tried with -O3 & -O2. In all other cases falling back to C was always successful. Verthash data file creation and verification work. Verthash has one unique feature in the data-file. No other algo has that and no other algo fails with unoptimized code.
There are a few cases where translating from SSE2 to NEON is diffiult or the workaround kills performance. NEON, being RISC, has no microcode so no programmable shuffle instruction. The only shuffling I can find is sub-vector word & sub-word bit, shift, rotate & reverse. Notably SSE2 can't do bit reversal but can shuffle bytes any which way.
There are a few cases where translating from SSE2 to NEON is diffiult or the workaround kills performance. NEON, being RISC, has no microcode so no programmable shuffle instruction. The only shuffling I can find is sub-vector word & sub-word bit, shift, rotate & reverse. Notably SSE2 can't do bit reversal but can shuffle bytes any which way. Notably Groestl AES, despite not working, is currently slower on ARM that the SPH version.
Multiplications are implemented differently, particularly widening multiplcatiom where the product is twice the bit width of the souces.
X86_64 operates on lanes 0 & 2 while ARM operates on lanes 0 & 1 of the source data. In effect x86_64 assumes the data is pre-widened and discards lanes 1 & 3 leaving 2 zero extended 64 bit source integers. With ARM the source arguments are packed into a smaller vector and the product is widened to 64 bits upon multiplication:
`uint64x2_t = uint32x2_t * uint32x2_t`
Most uses are the x86_64 format requiring a workaround for ARM.
Most uses are the x86_64 format requiring a workaround for ARM. Te curent workaround seems to be functioning correctly where needed.
NEON has some fancy load instructions that combine load with another oeration like byte swap. These may provide optimizatins that SSE can't.
Exploring these is part of the longer term plans once the existing problems are solved and ARM code is up to he same level of optimization level as x86_64.