mirror of
https://github.com/JayDDee/cpuminer-opt.git
synced 2025-09-17 23:44:27 +00:00
Updated Support for AArch64 (markdown)
@@ -1,4 +1,4 @@
|
||||
Development is progressing faster than expected to provide support for ARM 64 bit CPUs using the AArch64 architecture.
|
||||
Support for AArch64 with AES and SHA2 is at release candidate status.
|
||||
|
||||
This is provided as source code only and may be built on native Linux by following the existing procedure subject to any modifications described below.
|
||||
|
||||
@@ -12,11 +12,11 @@ Requirements:
|
||||
|
||||
## Status
|
||||
|
||||
**cpuminer-opt-23.10 is released, all users should upgrade**
|
||||
**cpuminer-opt-23.11 is released, all users should upgrade**
|
||||
|
||||
Highlights from this release:
|
||||
Important fixes to Scryptn2 and Skein.
|
||||
x17 & minotaurx are mostly optimized, only AES Groestl & Fugue remain unoptimized.
|
||||
Important fixes to x25x, hmq1725.
|
||||
Most SHA3 algos now optimized with 2-way for NEON.
|
||||
|
||||
Development environment:
|
||||
* Orange Pi 5 Plus 16 GB, Rockchip 8 core CPU with AES & SHA2
|
||||
@@ -40,12 +40,9 @@ The miner has been tested on Raspberry Pi 4B, Orange Pi 5 Plus, and Mac Mini fro
|
||||
It compiles for all minor versions of armv8.x with or without AES, or SHA2, or both.
|
||||
|
||||
What works:
|
||||
* All algorithms except Verthash should be working.
|
||||
* Allium, Lyra2z, Lyraz330, Argon2d are fully optimzed for NEON, Allium uses unoptimized AES.
|
||||
* All Scrypt & Sha25 are fully optimized to make use of SHA2.
|
||||
* X17, MinotaurX are mostly optimized for NEON & AES, also helps SSE2.
|
||||
* Skein & Skein2 are fully optimized for NEON, Skein also for SHA2.
|
||||
* AES is working for Shavite & Echo, not for Groestl & Fugue.
|
||||
* Most algorithms are working with Neon optimizations.
|
||||
* AES is working for Shavite and Echo.
|
||||
* SHA is working for all algos.
|
||||
* stratum+ssl and stratum+tcp are working, GBT is untested but expected to work.
|
||||
* CPU and SW feature detection and reporting is partialliy implemented.
|
||||
* CPU temperature and clock frequency is working (native Linux).
|
||||
@@ -56,18 +53,17 @@ Known problems:
|
||||
* MacOS is not working natively, workaround with linux VM.
|
||||
* CPU and feature detection and reporting is incomplete.
|
||||
* Groestl, Fugue: multiple issues not AES related, using unoptimized.
|
||||
* Algos not mentioned have either been deferred or have not been analyzed. They may or may not work on ARM.
|
||||
|
||||
Short term plan:
|
||||
|
||||
* Continue propagating x17 opimizations to the rest of the X family.
|
||||
* Figure out what's going on with verthash.
|
||||
* Complete any other work needed to bring parity with SSE2.
|
||||
* Performance testing.
|
||||
* End of Beta phase.
|
||||
* Full support for ARM.
|
||||
|
||||
Medium term:
|
||||
|
||||
* Verthash
|
||||
* Groestl & Fugue AES.
|
||||
* Detection of ARM CPU model and architecture minor version.
|
||||
* Find NEON optimization opportunities that exploit it's architecture and instruction set.
|
||||
@@ -84,14 +80,14 @@ Some notable observations about the problems observed:
|
||||
|
||||
Verthash is a mystery, it only produces rejects on ARM even with no targtetted code, only compiled C. The same C source works on x86_64 but not on AArch64. Tried with -O3 & -O2. In all other cases falling back to C was always successful. Verthash data file creation and verification work. Verthash has one unique feature in the data-file. No other algo has that and no other algo fails with unoptimized code.
|
||||
|
||||
There are a few cases where translating from SSE2 to NEON is diffiult or the workaround kills performance. NEON, being RISC, has no microcode so no programmable shuffle instruction. The only shuffling I can find is sub-vector word & sub-word bit, shift, rotate & reverse. Notably SSE2 can't do bit reversal but can shuffle bytes any which way.
|
||||
There are a few cases where translating from SSE2 to NEON is diffiult or the workaround kills performance. NEON, being RISC, has no microcode so no programmable shuffle instruction. The only shuffling I can find is sub-vector word & sub-word bit, shift, rotate & reverse. Notably SSE2 can't do bit reversal but can shuffle bytes any which way. Notably Groestl AES, despite not working, is currently slower on ARM that the SPH version.
|
||||
|
||||
Multiplications are implemented differently, particularly widening multiplcatiom where the product is twice the bit width of the souces.
|
||||
X86_64 operates on lanes 0 & 2 while ARM operates on lanes 0 & 1 of the source data. In effect x86_64 assumes the data is pre-widened and discards lanes 1 & 3 leaving 2 zero extended 64 bit source integers. With ARM the source arguments are packed into a smaller vector and the product is widened to 64 bits upon multiplication:
|
||||
|
||||
`uint64x2_t = uint32x2_t * uint32x2_t`
|
||||
|
||||
Most uses are the x86_64 format requiring a workaround for ARM.
|
||||
Most uses are the x86_64 format requiring a workaround for ARM. Te curent workaround seems to be functioning correctly where needed.
|
||||
|
||||
NEON has some fancy load instructions that combine load with another oeration like byte swap. These may provide optimizatins that SSE can't.
|
||||
Exploring these is part of the longer term plans once the existing problems are solved and ARM code is up to he same level of optimization level as x86_64.
|
||||
|
Reference in New Issue
Block a user