mirror of
https://github.com/JayDDee/cpuminer-opt.git
synced 2025-09-17 23:44:27 +00:00
Updated Support for AArch64 (markdown)
@@ -29,10 +29,11 @@ The miner compiles and runs on Raspberry Pi 4B, and compiles for all version of
|
||||
What works:
|
||||
* All algorithms except Verthash and Hodl should be working.
|
||||
* Allium, Lyra2z, Lyraz330, Argon2d are fully optimzed for NEON, Allium also for AES.
|
||||
* Yespower, Yescrypt, Scrypt are working with slow sha256.
|
||||
* Yespower, Yescrypt are fully optimized
|
||||
, Scrypt,scryptn2 partially optiomized.
|
||||
* X17 is the only X* to be optimized in this release.
|
||||
* MinotaurX is partially optimized.
|
||||
* AES & SHA2 are enabled but untested, expectations are low.
|
||||
* AES & SHA2 are enabled but untested, problem are likely.
|
||||
* Other algos are not optimized for ARM and not tested but expected to work.
|
||||
* stratum+ssl and stratum+tcp are working, GBT is untested but expected to work.
|
||||
* CPU and SW feature detection and reporting is working, algo features in progress, CPU brand not yet implemented.
|
||||
@@ -57,15 +58,12 @@ Known problems:
|
||||
|
||||
Some notable observation about the problems observed:
|
||||
|
||||
In general linear (1-way) vectorization is working and parallel vectoring (n-way) is not. Parallel vectors are only working for Keccak. They dopn't work for Sha, Blake although linear vectoring works well for Blake small and large.
|
||||
|
||||
I expected linear vectoring to be the bigger challenge due to lane shuffling which isn't necessary for n-way which is simple arithmetic and logic instructions that mostly (mult is an issue) translate directly to the ARM architecture. N-way does require data shuffling on entry and exit to interleave the data so the issue may be there. Arm uses some 2-way which wasn't previously implemented on x86_64. However, the new 2-way code works on x86_64, the same code that doesn't work on ARM.
|
||||
|
||||
Verthash is a mystery, it only produces rejects on ARM even with no targtetted code, only compiled C. The same C source works on x86_64 but not on AArch64. Tried with -O3 & -O2. In all other cases falling back to C was always successful. Verthash data file creation and verification work.
|
||||
|
||||
There are a few cases where translating from SSE2 to NEON is diffiult or the workaround kills performance. NEON, being RISC, has no microcode so no programmable shuffle instruction. The only shuffling I can find is sub-vector word, sub-word bit, shift, rotate & reverse. Notably SSE2 can't do bit reversal but can shulffle bytes any which way.
|
||||
|
||||
Multiplications are implemented differently, particularly widening multiplcatiom where the product is twice the bit width of the souces.
|
||||
X86_64 operates on lanes 0 & 2 while ASRM operates on lanes 0 & 1 of the source data. In effect x86_64 assumes the data is pre-widened and discards lanes 1 & 3 leaving 2 zero extended 64 bit integers. ARM the source arguments are packed into a smaller vector (uint32x2_t * uint32x2_t = uint64x2_t) and the product is widened 64 bits.
|
||||
|
||||
NEON has some fancy load instructions that combine load with another oeration like byte swap. These may provide optimizatins that SSE can't.
|
||||
Exploring these is part of the longer term plans once the existing problems are solved and ARM code is up to he same level of optimization level as x86_64.
|
||||
|
Reference in New Issue
Block a user