Updated Support for AArch64 (markdown)

JayDDee
2023-10-28 16:51:48 -04:00
parent a49ab5ab4d
commit f3c651bcef

@@ -11,12 +11,10 @@ Requirements:
## Status
**cpuminer-opt-23.5 is released with the following...**
**cpuminer-opt-23.6 is released, all users should apgrade**
2 way parallel hash will be implemented on aplicable algorithms for NEON and will also benefit x86_64 CPUs without AVX2.
Errata: sha256dt, sha256t & sha256d were released with optimizations enabled, therefore they don't work.
Development environment:
* Raspberry Pi-4B 8 GB
* Ubuntu (Mate) 22.04 Raspberry Pi image
@@ -28,13 +26,12 @@ The miner compiles and runs on Raspberry Pi 4B, and compiles for all version of
What works:
* All algorithms except Verthash and Hodl should be working.
* Allium, Lyra2z, Lyraz330, Argon2d are fully optimzed for NEON, Allium also for AES.
* Yespower, Yescrypt are fully optimized
, Scrypt,scryptn2 partially optiomized.
* Allium, Lyra2z, Lyraz330, Argon2d are fully optimzed for NEON, Allium also for AES untested.
* Yespower, Yescrypt, Scrypt, ScryptN2 are fully optimized, SHA is enbabled but utested.
* Sha256dt, Sha256t, Sha256d are fully optimized, SHA is enabled but untested.
* X17 is the only X* to be optimized in this release.
* MinotaurX is partially optimized.
* AES & SHA2 are enabled but untested, problem are likely.
* Other algos are not optimized for ARM and not tested but expected to work.
* AES & SHA2 are enabled but untested
* stratum+ssl and stratum+tcp are working, GBT is untested but expected to work.
* CPU and SW feature detection and reporting is working, algo features in progress, CPU brand not yet implemented.
* CPU temperature and clock frequency is working.
@@ -45,16 +42,15 @@ Known problems:
* No detection of ARM architecture minor version number.
* NEON may not be displayed in algo features for some algos that may support it.
* Algos may show support for NEON even if it's disabled or not yet implemented.
* AES & SHA2 are enabled but untested. Sha256 & Sha512 Parallel N-way are disabled. They work on X86_64.
* Sha256dt, Sha256t, Sha256d not working.
* Scryptn2 optimzations disabled due to Sha256 issues.
* AES & SHA2 are enabled but untested.
* Several parallel hash functions are disabled on ARM although they work on x86_64.
* X17, MinotaurX are partially optimized.
* Blake256, Blake512, Blake2s, Blake2b N-way parallel hash not working, using linear when possible, unoptimzed otherwise.
* Simd: Multiple issues with NEON, using unoptimized.
* Luffa: NEON not working, using unoptimized
* Fugue: Multiple issues with NEON & AES, using unoptimized.
* SWIFFTX: Multiple issues with NEON,using unoptimized.
* Remaining algos are not yet optimized for NEON but should work unoptimized.
* Algos not mentioned have either been deferred or have not been analyzed. They may or may not work on ARM.
Some notable observation about the problems observed:
@@ -63,8 +59,8 @@ Verthash is a mystery, it only produces rejects on ARM even with no targtetted c
There are a few cases where translating from SSE2 to NEON is diffiult or the workaround kills performance. NEON, being RISC, has no microcode so no programmable shuffle instruction. The only shuffling I can find is sub-vector word, sub-word bit, shift, rotate & reverse. Notably SSE2 can't do bit reversal but can shulffle bytes any which way.
Multiplications are implemented differently, particularly widening multiplcatiom where the product is twice the bit width of the souces.
X86_64 operates on lanes 0 & 2 while ASRM operates on lanes 0 & 1 of the source data. In effect x86_64 assumes the data is pre-widened and discards lanes 1 & 3 leaving 2 zero extended 64 bit integers. ARM the source arguments are packed into a smaller vector (uint32x2_t * uint32x2_t = uint64x2_t) and the product is widened 64 bits.
X86_64 operates on lanes 0 & 2 while ARM operates on lanes 0 & 1 of the source data. In effect x86_64 assumes the data is pre-widened and discards lanes 1 & 3 leaving 2 zero extended 64 bit source integers. With ARM the source arguments are packed into a smaller vector (uint32x2_t * uint32x2_t = uint64x2_t) and the product is widened to 64 bits upon multiplication. Most uses are the x86_64 dormat requiring a workaround for ARM.
NEON has some fancy load instructions that combine load with another oeration like byte swap. These may provide optimizatins that SSE can't.
Exploring these is part of the longer term plans once the existing problems are solved and ARM code is up to he same level of optimization level as x86_64.
x86_64 pre-widened source data while ARM operates on the packed data then widens the product.