mirror of
https://github.com/JayDDee/cpuminer-opt.git
synced 2025-09-17 23:44:27 +00:00
Updated Support for AArch64 (markdown)
@@ -11,12 +11,10 @@ Requirements:
|
||||
|
||||
## Status
|
||||
|
||||
**cpuminer-opt-23.5 is released with the following...**
|
||||
**cpuminer-opt-23.6 is released, all users should apgrade**
|
||||
|
||||
2 way parallel hash will be implemented on aplicable algorithms for NEON and will also benefit x86_64 CPUs without AVX2.
|
||||
|
||||
Errata: sha256dt, sha256t & sha256d were released with optimizations enabled, therefore they don't work.
|
||||
|
||||
Development environment:
|
||||
* Raspberry Pi-4B 8 GB
|
||||
* Ubuntu (Mate) 22.04 Raspberry Pi image
|
||||
@@ -28,13 +26,12 @@ The miner compiles and runs on Raspberry Pi 4B, and compiles for all version of
|
||||
|
||||
What works:
|
||||
* All algorithms except Verthash and Hodl should be working.
|
||||
* Allium, Lyra2z, Lyraz330, Argon2d are fully optimzed for NEON, Allium also for AES.
|
||||
* Yespower, Yescrypt are fully optimized
|
||||
, Scrypt,scryptn2 partially optiomized.
|
||||
* Allium, Lyra2z, Lyraz330, Argon2d are fully optimzed for NEON, Allium also for AES untested.
|
||||
* Yespower, Yescrypt, Scrypt, ScryptN2 are fully optimized, SHA is enbabled but utested.
|
||||
* Sha256dt, Sha256t, Sha256d are fully optimized, SHA is enabled but untested.
|
||||
* X17 is the only X* to be optimized in this release.
|
||||
* MinotaurX is partially optimized.
|
||||
* AES & SHA2 are enabled but untested, problem are likely.
|
||||
* Other algos are not optimized for ARM and not tested but expected to work.
|
||||
* AES & SHA2 are enabled but untested
|
||||
* stratum+ssl and stratum+tcp are working, GBT is untested but expected to work.
|
||||
* CPU and SW feature detection and reporting is working, algo features in progress, CPU brand not yet implemented.
|
||||
* CPU temperature and clock frequency is working.
|
||||
@@ -45,16 +42,15 @@ Known problems:
|
||||
* No detection of ARM architecture minor version number.
|
||||
* NEON may not be displayed in algo features for some algos that may support it.
|
||||
* Algos may show support for NEON even if it's disabled or not yet implemented.
|
||||
* AES & SHA2 are enabled but untested. Sha256 & Sha512 Parallel N-way are disabled. They work on X86_64.
|
||||
* Sha256dt, Sha256t, Sha256d not working.
|
||||
* Scryptn2 optimzations disabled due to Sha256 issues.
|
||||
* AES & SHA2 are enabled but untested.
|
||||
* Several parallel hash functions are disabled on ARM although they work on x86_64.
|
||||
* X17, MinotaurX are partially optimized.
|
||||
* Blake256, Blake512, Blake2s, Blake2b N-way parallel hash not working, using linear when possible, unoptimzed otherwise.
|
||||
* Simd: Multiple issues with NEON, using unoptimized.
|
||||
* Luffa: NEON not working, using unoptimized
|
||||
* Fugue: Multiple issues with NEON & AES, using unoptimized.
|
||||
* SWIFFTX: Multiple issues with NEON,using unoptimized.
|
||||
* Remaining algos are not yet optimized for NEON but should work unoptimized.
|
||||
* Algos not mentioned have either been deferred or have not been analyzed. They may or may not work on ARM.
|
||||
|
||||
Some notable observation about the problems observed:
|
||||
|
||||
@@ -63,8 +59,8 @@ Verthash is a mystery, it only produces rejects on ARM even with no targtetted c
|
||||
There are a few cases where translating from SSE2 to NEON is diffiult or the workaround kills performance. NEON, being RISC, has no microcode so no programmable shuffle instruction. The only shuffling I can find is sub-vector word, sub-word bit, shift, rotate & reverse. Notably SSE2 can't do bit reversal but can shulffle bytes any which way.
|
||||
|
||||
Multiplications are implemented differently, particularly widening multiplcatiom where the product is twice the bit width of the souces.
|
||||
X86_64 operates on lanes 0 & 2 while ASRM operates on lanes 0 & 1 of the source data. In effect x86_64 assumes the data is pre-widened and discards lanes 1 & 3 leaving 2 zero extended 64 bit integers. ARM the source arguments are packed into a smaller vector (uint32x2_t * uint32x2_t = uint64x2_t) and the product is widened 64 bits.
|
||||
X86_64 operates on lanes 0 & 2 while ARM operates on lanes 0 & 1 of the source data. In effect x86_64 assumes the data is pre-widened and discards lanes 1 & 3 leaving 2 zero extended 64 bit source integers. With ARM the source arguments are packed into a smaller vector (uint32x2_t * uint32x2_t = uint64x2_t) and the product is widened to 64 bits upon multiplication. Most uses are the x86_64 dormat requiring a workaround for ARM.
|
||||
|
||||
NEON has some fancy load instructions that combine load with another oeration like byte swap. These may provide optimizatins that SSE can't.
|
||||
Exploring these is part of the longer term plans once the existing problems are solved and ARM code is up to he same level of optimization level as x86_64.
|
||||
x86_64 pre-widened source data while ARM operates on the packed data then widens the product.
|
||||
|
||||
|
Reference in New Issue
Block a user