mirror of
https://github.com/JayDDee/cpuminer-opt.git
synced 2025-09-17 23:44:27 +00:00
Updated Support for AArch64 (markdown)
@@ -25,15 +25,14 @@ Follow normal Linux build procedure but add "-flax-vector-conversions" to CFLAGS
|
|||||||
The miner compiles and runs on Raspberry Pi 4B, and compiles for all version of armv8 with our without AES or SHA2 or both.
|
The miner compiles and runs on Raspberry Pi 4B, and compiles for all version of armv8 with our without AES or SHA2 or both.
|
||||||
|
|
||||||
What works:
|
What works:
|
||||||
* All algorithms ecept Verthash and Hodl should be working.
|
* All algorithms except Verthash and Hodl should be working.
|
||||||
* Allium, Lyra2z, Lyraz330, Argon2d are fully optimzed for NEON, Allium also for AES.
|
* Allium, Lyra2z, Lyraz330, Argon2d are fully optimzed for NEON, Allium also for AES.
|
||||||
* Unoptimized: Sha256dt, sha256t, Blake2s.
|
|
||||||
* Yespower, Yescrypt, Scrypt are working with slow sha256.
|
* Yespower, Yescrypt, Scrypt are working with slow sha256.
|
||||||
* X17 is the only X* to be optimized in this realease.
|
* X17 is the only X* to be optimized in this release.
|
||||||
* MinotaurX is partially optimized.
|
* MinotaurX is partially optimized.
|
||||||
* AES & SHA2 are enabled but untested.
|
* AES & SHA2 are enabled but untested, expectations are low.
|
||||||
* Other algos are not optimized for ARM and not tested.
|
* Other algos are not optimized for ARM and not tested but expected to work.
|
||||||
* stratum+ssl and stratum+tcp are working, GBT is untested.
|
* stratum+ssl and stratum+tcp are working, GBT is untested but expected to work.
|
||||||
* CPU and SW feature detection and reporting is working, algo features in progress, CPU brand not yet implemented.
|
* CPU and SW feature detection and reporting is working, algo features in progress, CPU brand not yet implemented.
|
||||||
* CPU temperature and clock frequency is working.
|
* CPU temperature and clock frequency is working.
|
||||||
* cpu-affinity & threads are working.
|
* cpu-affinity & threads are working.
|
||||||
@@ -43,7 +42,7 @@ Known problems:
|
|||||||
* No detection of ARM architecture minor version number.
|
* No detection of ARM architecture minor version number.
|
||||||
* NEON may not be displayed in algo features for some algos that may support it.
|
* NEON may not be displayed in algo features for some algos that may support it.
|
||||||
* Algos may show support for NEON even if it's disabled or not yet implemented.
|
* Algos may show support for NEON even if it's disabled or not yet implemented.
|
||||||
* AES & SHA2 are enabled but untested.* Sha256 & Sha512 Parallel N-way are disabled. They work on X86_64.
|
* AES & SHA2 are enabled but untested. Sha256 & Sha512 Parallel N-way are disabled. They work on X86_64.
|
||||||
* Sha256dt, Sha256t, Sha256d unoptimized.
|
* Sha256dt, Sha256t, Sha256d unoptimized.
|
||||||
* Scryptn2 optimzations disabled due to Sha256 issues.
|
* Scryptn2 optimzations disabled due to Sha256 issues.
|
||||||
* X17, MinotaurX are partially optimized.
|
* X17, MinotaurX are partially optimized.
|
||||||
@@ -54,3 +53,18 @@ Known problems:
|
|||||||
* SWIFFTX: Multiple issues with NEON,using unoptimized.
|
* SWIFFTX: Multiple issues with NEON,using unoptimized.
|
||||||
* Remaining algos are not yet optimized for NEON but should work unoptimized.
|
* Remaining algos are not yet optimized for NEON but should work unoptimized.
|
||||||
|
|
||||||
|
Some notable observation about the problems observed:
|
||||||
|
|
||||||
|
In general linear (1-way) vectorization is working and parallel vectoring (n-way) is not. Parallel vectors are only working for Keccak. They dopn't work for Sha, Blake although linear vectoring works well for Blake small and large.
|
||||||
|
|
||||||
|
I expected linear vectoring to be the bigger challenge due to lane shuffling which isn't necessary for n-way which is simple arithmetic and logic instructions that mostly (mult is an issue) translate directly to the ARM architecture. N-way does require data shuffling on entry and exit to interleave the data so the issue may be there. Arm uses some 2-way which wasn't previously implemented on x86_64. However, the new 2-way code works on x86_64, the same code that doesn't work on ARM.
|
||||||
|
|
||||||
|
Verthash is a mystery, it only produces rejects on ARM even with no targtetted code, only compiled C. The same C source works on x86_64 but not on AArch64. Tried with -O3 & -O2. In all other cases falling back to C was always successful. Verthash data file creation and verification work.
|
||||||
|
|
||||||
|
There are a few cases where translating from SSE2 to NEON is diffiult or the workaround kills performance. NEON, being RISC, has no microcode so no programmable shuffle instruction. The only shuffling I can find is sub-vector word, sub-word bit, shift, rotate & reverse. Notably SSE2 can't do bit reversal but can shulffle bytes any which way.
|
||||||
|
|
||||||
|
Multiplications are implemented differently, particularly widening multiplcatiom where the product is twice the bit width of the souces.
|
||||||
|
|
||||||
|
NEON has some fancy load instructions that combine load with another oeration like byte swap. These may provide optimizatins that SSE can't.
|
||||||
|
Exploring these is part of the longer term plans once the existing problems are solved and ARM code is up to he same level of optimization level as x86_64.
|
||||||
|
x86_64 pre-widened source data while ARM operates on the packed data then widens the product.
|
||||||
|
Reference in New Issue
Block a user