v24.5

2025-09-17 23:44:27 +00:00 · 2024-09-13 14:14:57 -04:00
parent 47e24b50e8
commit 8e91bfbe19
16 changed files with 2727 additions and 1880 deletions
--- a/simd-utils/simd-128.h
+++ b/simd-utils/simd-128.h
@@ -32,6 +32,14 @@
 // Intrinsics automatically promote from REX to VEX when AVX is available
 // but ASM needs to be done manually.
 //
+// APX supports EGPR which adds 16 more GPRs and 3 operand instructions.
+// This may affect ASM that include instructions that are superseded by APX
+// versions and are therefore incompatible with APX.
+// As a result GCC-14 disables EGPR by default and can be enabled with
+// "-mapx-inline-asm-use-gpr32"
+//TODO
+// Some ASM functions may need to be updated to support EGPR with APX.
+//
 ///////////////////////////////////////////////////////////////////////////////

 // New architecturally agnostic syntax: 
@@ -164,7 +172,7 @@ typedef union
 // necessary the cvt, set, or set1 intrinsics can be used allowing the
 // compiler to exploit new features to produce optimum code.
 // Currently only used internally and by Luffa.
-
+// It also has implications for APX EGPR feature.

 #define v128_mov64       _mm_cvtsi64_si128
 #define v128_mov32       _mm_cvtsi32_si128
--- a/simd-utils/simd-512.h
+++ b/simd-utils/simd-512.h
@@ -125,7 +125,7 @@ static inline __m512i mm512_perm_128( const __m512i v, const int c )
 // Pseudo constants.
 #define m512_zero       _mm512_setzero_si512()

-// use asm to avoid compiler warning for unitialized local
+// use asm to avoid compiler warning for uninitialized local
 static inline __m512i mm512_neg1_fn()
 {
   __m512i v;
--- a/simd-utils/simd-64.h
+++ b/simd-utils/simd-64.h
@@ -10,7 +10,18 @@
 // This code is not used anywhere annd likely never will. It's intent was
 // to support 2 way parallel hashing using  MMX, or NEON for 32 bit hash
 // functions, but hasn't been implementedwas never implemented.
-// 
+//
+// MMX is being deprecated by compilers, all intrinsics will be converted to use SSE
+// registers and instructions. MMX will still be available using ASM.
+// For backward compatibility it's likely the compiler won't allow mixing explicit SSE
+// with promoted MMX. It is therefore preferable to implement all 64 bit vector code
+// using explicit SSE with the upper 64 bits being ignored.
+// Using SSE for 64 bit vectors will complicate loading arrays from memory which will
+// always load 128 bits. Odd indexes will need to be extracted from the upper 64 bits
+// of the even index SSE register. 
+// In most cases the exiting 4x32 SSE code can be used with 2 lanes being ignored
+// making ths file obsolete.
+

 #define v64_t                        __m64
 #define v64u32_t                     v64_t
--- a/simd-utils/simd-sve.h
+++ b/simd-utils/simd-sve.h
@@ -0,0 +1,25 @@
+// Placeholder for now.
+//
+// This file will hold AArch64 SVE code, a replecement for NEON that uses vector length
+// agnostic instructions. This means the same code can be used on CPUs with different
+// SVE vector register lengths. This is not good for vectorized hashing.
+// Optimum hash is sensitive to the vector register length with different code
+// used for different register sizes. On X86_64 the vector length is tied to the CPU
+// feature making it simple and efficient to handle different lengths although it
+// results in multiple executables. Theoretically SVE could use a single executable for
+// any vector length.
+//
+// With the SVE vector length only known at run time it resultis in run time overhead
+// to test the vector length. Theoretically it could be tested at program loading and
+// appropriate libraries loaded. However I don't know if this can be done and if so
+// how to do it.
+//
+// SVE is not expected to be used for 128 bit vectors as it does not provide any
+// advantages over NEON. However, it may be implemented for testing purposes
+// because CPU with registers larger than 128 bits are currently very rare and very
+// expensive server class CPUs.
+//
+// N-way parallel hashing could be the best use of SVE, usimg the same code for all 
+// vector lengths with the only variable being the number of lanes. This will still
+// require run time checking but should be lighter than substituting functions.
+