Updated Support for AArch64 (markdown)

JayDDee
2023-10-28 04:35:12 -04:00
parent 8a986e82d8
commit a49ab5ab4d

@@ -29,10 +29,11 @@ The miner compiles and runs on Raspberry Pi 4B, and compiles for all version of
What works:
* All algorithms except Verthash and Hodl should be working.
* Allium, Lyra2z, Lyraz330, Argon2d are fully optimzed for NEON, Allium also for AES.
* Yespower, Yescrypt, Scrypt are working with slow sha256.
* Yespower, Yescrypt are fully optimized
, Scrypt,scryptn2 partially optiomized.
* X17 is the only X* to be optimized in this release.
* MinotaurX is partially optimized.
* AES & SHA2 are enabled but untested, expectations are low.
* AES & SHA2 are enabled but untested, problem are likely.
* Other algos are not optimized for ARM and not tested but expected to work.
* stratum+ssl and stratum+tcp are working, GBT is untested but expected to work.
* CPU and SW feature detection and reporting is working, algo features in progress, CPU brand not yet implemented.
@@ -57,15 +58,12 @@ Known problems:
Some notable observation about the problems observed:
In general linear (1-way) vectorization is working and parallel vectoring (n-way) is not. Parallel vectors are only working for Keccak. They dopn't work for Sha, Blake although linear vectoring works well for Blake small and large.
I expected linear vectoring to be the bigger challenge due to lane shuffling which isn't necessary for n-way which is simple arithmetic and logic instructions that mostly (mult is an issue) translate directly to the ARM architecture. N-way does require data shuffling on entry and exit to interleave the data so the issue may be there. Arm uses some 2-way which wasn't previously implemented on x86_64. However, the new 2-way code works on x86_64, the same code that doesn't work on ARM.
Verthash is a mystery, it only produces rejects on ARM even with no targtetted code, only compiled C. The same C source works on x86_64 but not on AArch64. Tried with -O3 & -O2. In all other cases falling back to C was always successful. Verthash data file creation and verification work.
There are a few cases where translating from SSE2 to NEON is diffiult or the workaround kills performance. NEON, being RISC, has no microcode so no programmable shuffle instruction. The only shuffling I can find is sub-vector word, sub-word bit, shift, rotate & reverse. Notably SSE2 can't do bit reversal but can shulffle bytes any which way.
Multiplications are implemented differently, particularly widening multiplcatiom where the product is twice the bit width of the souces.
X86_64 operates on lanes 0 & 2 while ASRM operates on lanes 0 & 1 of the source data. In effect x86_64 assumes the data is pre-widened and discards lanes 1 & 3 leaving 2 zero extended 64 bit integers. ARM the source arguments are packed into a smaller vector (uint32x2_t * uint32x2_t = uint64x2_t) and the product is widened 64 bits.
NEON has some fancy load instructions that combine load with another oeration like byte swap. These may provide optimizatins that SSE can't.
Exploring these is part of the longer term plans once the existing problems are solved and ARM code is up to he same level of optimization level as x86_64.