v3.19.6

v3.19.5
v3.19.4
2025-09-17 23:44:27 +00:00 · 2022-02-21 23:14:24 -05:00 · 2022-01-30 20:59:54 -05:00 · 2022-01-12 21:08:25 -05:00 · 2022-01-07 12:07:38 -05:00 · 2021-12-30 16:28:24 -05:00
73 changed files with 5719 additions and 5794 deletions
--- a/22
+++ b/22
@@ -32,14 +32,26 @@ but different package names.
 $ sudo apt-get install build-essential automake libssl-dev libcurl4-openssl-dev libjansson-dev libgmp-dev zlib1g-dev git

 SHA support on AMD Ryzen CPUs requires gcc version 5 or higher and
-openssl 1.1.0e or higher. Add one of the following to CFLAGS for SHA
-support depending on your CPU and compiler version:
+openssl 1.1.0e or higher.

-"-march=native" is always the best choice
+znver1 and znver2 should be recognized on most recent version of GCC and
+znver3 is expected with GCC 11. GCC 11 also includes rocketlake support.
+In the meantime here are some suggestions to compile with new CPUs:

-"-march=znver1" for Ryzen 1000 & 2000 series, znver2 for 3000.
+"-march=native" is usually the best choice, used by build.sh.

-"-msha"  Add SHA to other tuning options
+"-march=znver2 -mvaes" can be used for Ryzen 5000 if znver3 is not recongized.
+
+"-mcascadelake -msha" or
+"-mcometlake -mavx512 -msha" can be used for Rocket Lake.
+
+Features can also be added individually:
+
+"-msha" adds support for HW accelerated sha256.
+
+"-mavx512" adds support for 512 bit vectors
+
+"-mvaes" add support for parallel AES

 Additional instructions for static compilalation can be found here:
 https://lxadm.com/Static_compilation_of_cpuminer
--- a/79
+++ b/79
@@ -40,7 +40,7 @@ $ mkdir $HOME/usr/lib
   version available in the repositories.

 Download the following source code packages from their respective and
-respected download locations, copy them to ~/usr/lib/ and uncompress them. 
+respected download locations, copy them to $HOME/usr/lib/ and uncompress them. 

 openssl: https://github.com/openssl/openssl/releases

@@ -149,85 +149,10 @@ Copy cpuminer.exe to the release directory, compress and copy the release direct

 Run cpuminer

-In a command windows change directories to the unzipped release folder. to get a list of all options:
+In a command windows change directories to the unzipped release folder. To get a list of all options:

 cpuminer.exe --help

 Command options are specific to where you mine. Refer to the pool's instructions on how to set them.


-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-Create a link to the locally compiled version of gmp.h
-
-$ ln -s $LOCAL_LIB/gmp-version/gmp.h ./gmp.h
-
-Edit configure.ac to fix lipthread package name.
-
-sed -i 's/"-lpthread"/"-lpthreadGC2"/g' configure.ac
-
-
-7. Compile
-
-you can use the default compile if you intend to use cpuminer-opt on the
-same CPU and the virtual machine supports that architecture.
-
-./build.sh
-
-Otherwise you can compile manually while setting options in CFLAGS.
-
-Some common options:
-
-To compile for a specific CPU architecture:
-
-CFLAGS="-O3 -march=znver1 -Wall" ./configure --with-curl
-
-This will compile for AMD Ryzen.
-
-You can compile more generically for a set of specific CPU features
-if you know what features you want:
-
-CFLAGS="-O3 -maes -msse4.2 -Wall" ./configure --with-curl
-
-This will compile for an older CPU that does not have AVX.
-
-You can find several examples in build-allarch.sh
-
-If you have a CPU with more than 64 threads and Windows 7 or higher you
-can enable the CPU Groups feature:
-
-D_WIN32_WINNT==0x0601
-
-Once you have run configure successfully run make with n CPU threads:
-
-make -j n
-
-Copy cpuminer.exe to the release directory, compress and copy the release
-directory to a Windows system and run cpuminer.exe from the command line.
-
-Run cpuminer
-
-In a command windows change directories to the unzipped release folder.
-to get a list of all options:
-
-cpuminer.exe --help
-
-Command options are specific to where you mine. Refer to the pool's
-instructions on how to set them.
--- a/Makefile.am
+++ b/Makefile.am
@@ -21,6 +21,7 @@ cpuminer_SOURCES = \
  api.c \
  sysinfos.c \
  algo-gate-api.c\
+  malloc-huge.c \
  algo/argon2/argon2a/argon2a.c \
  algo/argon2/argon2a/ar2/argon2.c \
  algo/argon2/argon2a/ar2/opt.c \
@@ -171,6 +172,7 @@ cpuminer_SOURCES = \
  algo/sha/hmac-sha256-hash-4way.c \
  algo/sha/sha256d.c \
  algo/sha/sha2.c \
+  algo/sha/sha256d-4way.c \
  algo/sha/sha256t-gate.c \
  algo/sha/sha256t-4way.c \
  algo/sha/sha256t.c \
--- a/README.txt
+++ b/README.txt
@@ -18,14 +18,14 @@ error to find the fastest one that works. Pay attention to
 the features listed at cpuminer startup to ensure you are mining at
 optimum speed using the best available features.

-Architecture names and compile options used are only provided for Intel
-Core series. Budget CPUs like Pentium and Celeron are often missing some
-features.
+Architecture names and compile options used are only provided for 
+mainstream desktop CPUs. Budget CPUs like Pentium and Celeron are often
+missing some features. Check your CPU.

-AMD CPUs older than Piledriver, including Athlon x2 and Phenom II x4, are not
-supported by cpuminer-opt due to an incompatible implementation of SSE2 on
-these CPUs. Some algos may crash the miner with an invalid instruction.
-Users are recommended to use an unoptimized miner such as cpuminer-multi.
+Support for AMD CPUs older than Ryzen is incomplete and without specific 
+recommendations. Find the best fit. CPUs older than Piledriver, including
+Athlon x2 and Phenom II x4, are not supported by cpuminer-opt due to an
+incompatible implementation of SSE2 on these CPUs. 

 More information for Intel and AMD CPU architectures and their features
 can be found on Wikipedia.
@@ -34,26 +34,21 @@ https://en.wikipedia.org/wiki/List_of_Intel_CPU_microarchitectures

 https://en.wikipedia.org/wiki/List_of_AMD_CPU_microarchitectures

+File name                      Architecture name

-Exe file name                Compile flags              Arch name
+cpuminer-sse2.exe              Core2, Nehalem, generic x86_64 with SSE2   
+cpuminer-aes-sse42.exe         Westmere
+cpuminer-avx.exe               Sandybridge, Ivybridge
+cpuminer-avx2.exe              Haswell, Skylake, Kabylake, Coffeelake, Cometlake
+cpuminer-avx2-sha.exe          AMD Zen1, Zen2
+cpuminer-avx2-sha-vaes.exe     Intel Alderlake*, AMD Zen3
+cpuminer-avx512.exe            Intel HEDT Skylake-X, Cascadelake
+cpuminer-avx512-sha-vaes.exe   Icelake, Tigerlake, Rocketlake

-cpuminer-sse2.exe            "-msse2"                   Core2, Nehalem   
-cpuminer-aes-sse42.exe       "-march=westmere"          Westmere
-cpuminer-avx.exe             "-march=corei7-avx"        Sandybridge, Ivybridge
-cpuminer-avx2.exe            "-march=core-avx2 -maes"   Haswell(1)
-cpuminer-avx512.exe          "-march=skylake-avx512"    Skylake-X, Cascadelake
-cpuminer-avx512-sha.exe      "-march=cascadelake -msha" Rocketlake(2)
-cpuminer-avx512-sha-vaes.exe "-march=icelake-client"    Icelake, Tigerlake(3)
-cpuminer-zen.exe             "-march=znver1"            AMD Zen1, Zen2
-cpuminer-zen3.exe            "-march=znver2 -mvaes"     Zen3(4)
-
-(1) Haswell includes Broadwell, Skylake, Kabylake, Coffeelake & Cometlake. 
-(2) Rocketlake build uses cascadelake+sha as a workaround until Rocketlake
-    compiler support is avalable.
-(3) Icelake & Tigerlake are only available on some laptops. Mining with a
-    laptop is not recommended.
-(4) Zen3 build uses zen2+vaes as a workaround until Zen3 compiler support is
-    available. Zen2 CPUs should use Zen1 build.
+* Alderlake is a hybrid architecture. With the E-cores disabled it may be
+  possible to enable AVX512 on the the P-cores and use the avx512-sha-vaes
+  build. This is not officially supported by Intel at time of writing.
+  Check for current information.

 Notes about included DLL files:

@@ -66,8 +61,7 @@ https://github.com/JayDDee/cpuminer-opt/wiki/Compiling-from-source

 Some DLL filess may already be installed on the system by Windows or third
 party packages. They often will work and may be used instead of the included
-file. Without a compelling reason to do so it's recommended to use the included
-files as they are packaged.
+file. 

 If you like this software feel free to donate:

--- a/104
+++ b/104
@@ -65,10 +65,112 @@ If not what makes it happen or not happen?
 Change Log
 ----------

+v3.19.6
+
+#363 Fixed a stratum bug where the first job may be ignored delaying start of hashing
+Fixed handling of nonce exhaust when hashing a fast algo with extranonce disabled
+Small optimization to Shavite.
+
+v3.19.5
+
+Enhanced stratum-keepalive preemptively resets the stratum connection
+before the server to avoid lost shares.
+
+Added build-msys2.sh scrypt for easier compiling on Windows, see Wiki for details.
+
+X16RT: eliminate unnecessary recalculations of the hash order.
+
+Fix a few compiler warnings.
+
+Fixed log colour error when a block is solved.
+
+v3.19.4
+
+#359: Fix verthash memory allocation for non-hugepages, broken in v3.19.3.
+
+New option stratum-keepalive prevents stratum timeouts when no shares are
+submitted for several minutes due to high difficulty.
+
+Fixed a bug displaying optimizations for some algos.
+
+v3.19.3
+
+Linux: Faster verthash (+25%), scryptn2 (+2%) when huge pages are available.
+
+Small speed up for Hamsi AVX2 & AVX512, Keccak AVX512.
+
+v3.19.2
+
+Fixed log displaying incorrect memory usage for scrypt, broken in v3.19.1.
+
+Reduce log noise when replies to submitted shares are lost due to stratum errors.
+
+Fugue prehash optimization for X16r family AVX2 & AVX512.
+
+Small speed improvement for Hamsi AVX2 & AVX512.
+
+Win: With CPU groups enabled the number of CPUs displayed in the ASCII art
+affinity map is the number of CPUs in a CPU group, was number of CPUs up to 64.
+
+v3.19.1
+
+Changes to Windows binaries package:
+ - builds for CPUs with AVX or lower have CPU groups disabled,
+ - zen3 build renamed to avx2-sha-vaes to support Alderlake as well as Zen3,
+ - zen build renamed to avx2-sha, supports Zen1 & Zen2,
+ - avx512-sha build removed, Rocketlake CPUs can use avx512-sha-vaes,
+ - see README.txt for compatibility details.
+
+Fixed a few compiler warnings that are new in GCC 11.
+Other minor fixes.
+
+v3.19.0
+
+Windows binaries now built with support for CPU groups, requires Windows 7.
+
+Changes to cpu-affinity:
+  - PR#346: Fixed incorrect CPU affinity on Windows built for CPU groups,
+  - added support for CPU affinity for up to 256 threads or CPUs,
+  - streamlined code for more efficient initialization of miner threads,
+  - precise affining of each miner thread to a specific CPU,
+  - added an option to disable CPU affinity with "--cpu-affinity 0"
+
+Faster sha256t with AVX512 & AVX2.
+
+Added stratum error count to stats log, reported only when non-zero.
+
+v3.18.2
+
+Issue #342, fixed Groestl AES on Windows, broken in v3.18.0.
+
+AVX512 for sha256d.
+
+SSE42 and AVX may now be displayed as mining features at startup.
+This is hard coded for each algo, and is only implemented for scrypt
+at this time as it is the only algo with significant performance differences
+with those features.
+
+Fixed an issue where a high hashrate algo could cause excessive invalid hash
+rate log reports when starting up in benchmark mode.
+
+v3.18.1
+
+More speed for scrypt:
+ - additional scryptn2 optimizations for all CPU architectures,
+ - AVX2 is now used by default on CPUS with SHA but not AVX512,
+ - scrypt:1024 performance lost in v3.18.0 is restored,
+ - AVX512 & AVX2 improvements to scrypt:1024.
+
+Big speedup for SwiFFTx AVX2 & SSE4.1: x22i +55%, x25x +22%.
+
+Issue #337: fixed a problem that could display negative stats values in the
+first summary report if the report was forced prematurely due to a stratum
+diff change. The stats will still be invalid but should display zeros.
+
 v3.18.0

 Complete rewrite of Scrypt code, optimized for large N factor (scryptn2):
-  - AVX512 & SHA support for SHA256, AVX512 has priority,
+  - AVX512 & SHA support for sha256, AVX512 has priority,
  - up to 50% increase in hashrate,
  - memory requirements reduced 30-60% depending on CPU architecture,
  - memory usage displayed at startup,
--- a/algo-gate-api.h
+++ b/algo-gate-api.h
@@ -97,7 +97,6 @@ typedef  uint32_t set_t;
 #define SHA_OPT       0x20   // Zen1, Icelake (sha256)
 #define AVX512_OPT    0x40   // Skylake-X (AVX512[F,VL,DQ,BW])
 #define VAES_OPT      0x80   // Icelake (VAES & AVX512)
-#define VAES256_OPT   0x100  // Zen3 (VAES without AVX512)


 // return set containing all elements from sets a & b
--- a/algo/argon2/argon2a/ar2/sj/scrypt-jane-portable-x86.h
+++ b/algo/argon2/argon2a/ar2/sj/scrypt-jane-portable-x86.h
@@ -344,7 +344,7 @@ static size_t
 detect_cpu(void) {
 	//union { uint8_t s[12]; uint32_t i[3]; } vendor_string;
 	//cpu_vendors_x86 vendor = cpu_nobody;
-	x86_regs regs;
+	x86_regs regs; regs.eax = regs.ebx = regs.ecx = 0;
 	uint32_t max_level, max_ext_level;
 	size_t cpu_flags = 0;
 #if defined(X86ASM_AVX) || defined(X86_64ASM_AVX)
@@ -460,4 +460,4 @@ get_top_cpuflag_desc(size_t flag) {
 	#endif
 #endif

-#endif /* defined(CPU_X86) || defined(CPU_X86_64) */
+#endif /* defined(CPU_X86) || defined(CPU_X86_64) */
--- a/algo/argon2/argon2a/ar2/sj/scrypt-jane-romix-basic.h
+++ b/algo/argon2/argon2a/ar2/sj/scrypt-jane-romix-basic.h
@@ -4,11 +4,12 @@ typedef void (FASTCALL *scrypt_ROMixfn)(scrypt_mix_word_t *X/*[chunkWords]*/, sc
 #endif

 /* romix pre/post nop function */
+/*
 static void asm_calling_convention
 scrypt_romix_nop(scrypt_mix_word_t *blocks, size_t nblocks) {
 	(void)blocks; (void)nblocks;
 }
-
+*/
 /* romix pre/post endian conversion function */
 static void asm_calling_convention
 scrypt_romix_convert_endian(scrypt_mix_word_t *blocks, size_t nblocks) {
--- a/algo/argon2/argon2d/argon2d/opt.c
+++ b/algo/argon2/argon2d/argon2d/opt.c
@@ -37,6 +37,13 @@

 #if defined(__AVX512F__)

+static inline __m512i blamka( __m512i x, __m512i y )
+{
+    __m512i xy = _mm512_mul_epu32( x, y );
+    return _mm512_add_epi64( _mm512_add_epi64( x, y ),
+                             _mm512_add_epi64( xy, xy ) );
+}
+
 static void fill_block( __m512i *state, const block *ref_block,
                       block *next_block, int with_xor )
 {
--- a/algo/argon2/argon2d/blake2/blamka-round-opt.h
+++ b/algo/argon2/argon2d/blake2/blamka-round-opt.h
@@ -328,9 +328,7 @@ static BLAKE2_INLINE __m128i fBlaMka(__m128i x, __m128i y) {

 #include <immintrin.h>

-#define ROR64(x, n) _mm512_ror_epi64((x), (n))
-
-static __m512i muladd(__m512i x, __m512i y)
+static inline __m512i muladd(__m512i x, __m512i y)
 {
    __m512i z = _mm512_mul_epu32(x, y);
    return _mm512_add_epi64(_mm512_add_epi64(x, y), _mm512_add_epi64(z, z));
@@ -344,8 +342,8 @@ static __m512i muladd(__m512i x, __m512i y)
        D0 = _mm512_xor_si512(D0, A0); \
        D1 = _mm512_xor_si512(D1, A1); \
 \
-        D0 = ROR64(D0, 32); \
-        D1 = ROR64(D1, 32); \
+        D0 = _mm512_ror_epi64(D0, 32); \
+        D1 = _mm512_ror_epi64(D1, 32); \
 \
        C0 = muladd(C0, D0); \
        C1 = muladd(C1, D1); \
@@ -353,8 +351,8 @@ static __m512i muladd(__m512i x, __m512i y)
        B0 = _mm512_xor_si512(B0, C0); \
        B1 = _mm512_xor_si512(B1, C1); \
 \
-        B0 = ROR64(B0, 24); \
-        B1 = ROR64(B1, 24); \
+        B0 = _mm512_ror_epi64(B0, 24); \
+        B1 = _mm512_ror_epi64(B1, 24); \
    } while ((void)0, 0)

 #define G2(A0, B0, C0, D0, A1, B1, C1, D1) \
@@ -365,8 +363,8 @@ static __m512i muladd(__m512i x, __m512i y)
        D0 = _mm512_xor_si512(D0, A0); \
        D1 = _mm512_xor_si512(D1, A1); \
 \
-        D0 = ROR64(D0, 16); \
-        D1 = ROR64(D1, 16); \
+        D0 = _mm512_ror_epi64(D0, 16); \
+        D1 = _mm512_ror_epi64(D1, 16); \
 \
        C0 = muladd(C0, D0); \
        C1 = muladd(C1, D1); \
@@ -374,8 +372,8 @@ static __m512i muladd(__m512i x, __m512i y)
        B0 = _mm512_xor_si512(B0, C0); \
        B1 = _mm512_xor_si512(B1, C1); \
 \
-        B0 = ROR64(B0, 63); \
-        B1 = ROR64(B1, 63); \
+        B0 = _mm512_ror_epi64(B0, 63); \
+        B1 = _mm512_ror_epi64(B1, 63); \
    } while ((void)0, 0)

 #define DIAGONALIZE(A0, B0, C0, D0, A1, B1, C1, D1) \
@@ -417,11 +415,10 @@ static __m512i muladd(__m512i x, __m512i y)

 #define SWAP_HALVES(A0, A1) \
    do { \
-        __m512i t0, t1; \
-        t0 = _mm512_shuffle_i64x2(A0, A1, _MM_SHUFFLE(1, 0, 1, 0)); \
-        t1 = _mm512_shuffle_i64x2(A0, A1, _MM_SHUFFLE(3, 2, 3, 2)); \
-        A0 = t0; \
-        A1 = t1; \
+        __m512i t; \
+        t = _mm512_shuffle_i64x2(A0, A1, _MM_SHUFFLE(1, 0, 1, 0)); \
+        A1 = _mm512_shuffle_i64x2(A0, A1, _MM_SHUFFLE(3, 2, 3, 2)); \
+        A0 = t; \
    } while((void)0, 0)

 #define SWAP_QUARTERS(A0, A1) \
--- a/algo/blake/decred-gate.c
+++ b/algo/blake/decred-gate.c
@@ -8,7 +8,7 @@ uint32_t *decred_get_nonceptr( uint32_t *work_data )
   return &work_data[ DECRED_NONCE_INDEX ];
 }

-double decred_calc_network_diff( struct work* work )
+long double decred_calc_network_diff( struct work* work )
 {
   // sample for diff 43.281 : 1c05ea29
   // todo: endian reversed on longpoll could be zr5 specific...
@@ -16,7 +16,7 @@ double decred_calc_network_diff( struct work* work )
   uint32_t bits = ( nbits & 0xffffff );
   int16_t shift = ( swab32(nbits) & 0xff ); // 0x1c = 28
   int m;
-   double d = (double)0x0000ffff / (double)bits;
+   long double d = (long double)0x0000ffff / (long double)bits;

   for ( m = shift; m < 29; m++ )
       d *= 256.0;
@@ -25,7 +25,7 @@ double decred_calc_network_diff( struct work* work )
   if ( shift == 28 )
       d *= 256.0; // testnet
   if ( opt_debug_diff )
-       applog( LOG_DEBUG, "net diff: %f -> shift %u, bits %08x", d,
+       applog( LOG_DEBUG, "net diff: %f -> shift %u, bits %08x", (double)d,
                           shift, bits );
   return net_diff;
 }
@@ -70,7 +70,10 @@ void decred_be_build_stratum_request( char *req, struct work *work,
         rpc_user, work->job_id, xnonce2str, ntimestr, noncestr );
   free(xnonce2str);
 }
+
+#if !defined(min)
 #define min(a,b) (a>b ? (b) :(a))
+#endif

 void decred_build_extraheader( struct work* g_work, struct stratum_ctx* sctx )
 {
--- a/algo/fugue/fugue-aesni.h
+++ b/algo/fugue/fugue-aesni.h
@@ -37,12 +37,23 @@ typedef struct

 } hashState_fugue __attribute__ ((aligned (64)));

+
+// These functions are deprecated, use the lower case macro aliases that use
+// the standard interface. This will be cleaned up at a later date.
 HashReturn fugue512_Init(hashState_fugue *state, int hashbitlen);

 HashReturn fugue512_Update(hashState_fugue *state, const void *data, DataLength databitlen);

 HashReturn fugue512_Final(hashState_fugue *state, void *hashval);

+#define fugue512_init( state ) \
+   fugue512_Init( state, 512 )
+#define fugue512_update( state, data, len ) \
+   fugue512_Update( state, data, (len)<<3 )
+#define fugue512_final \
+   fugue512_Final
+
+
 HashReturn fugue512_full(hashState_fugue *hs, void *hashval, const void *data, DataLength databitlen);

 #endif // AES
--- a/algo/hamsi/hamsi-hash-4way.c
+++ b/algo/hamsi/hamsi-hash-4way.c
@@ -545,31 +545,33 @@ static const sph_u32 T512[64][16] = {
 #define sE   c7
 #define sF   m7

-
 #if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)

 // Hamsi 8 way AVX512 

+// Intel says _mm512_movepi64_mask has (1L/1T) timimg while
+// _mm512_cmplt_epi64_mask as (3L/1T) timing, however, when tested hashing X13
+// on i9-9940x cmplt with zero was 3% faster than movepi. 
+
 #define INPUT_BIG8 \
 do { \
-  __m512i db = *buf; \
-  const uint64_t *tp = (uint64_t*)&T512[0][0];  \
-  m0 = m1 = m2 = m3 = m4 = m5 = m6 = m7 = m512_zero; \
+  __m512i db = _mm512_ror_epi64( *buf, 1 ); \
+  const __m512i zero = m512_zero; \
+  const uint64_t *tp = (const uint64_t*)T512; \
+  m0 = m1 = m2 = m3 = m4 = m5 = m6 = m7 = zero; \
  for ( int u = 0; u < 64; u++ ) \
  { \
-     __m512i dm = _mm512_and_si512( db, m512_one_64 ) ; \
-     dm = mm512_negate_32( _mm512_or_si512( dm, \
-                                          _mm512_slli_epi64( dm, 32 ) ) ); \
-     m0 = mm512_xorand( m0, dm, m512_const1_64( tp[0] ) ); \
-     m1 = mm512_xorand( m1, dm, m512_const1_64( tp[1] ) ); \
-     m2 = mm512_xorand( m2, dm, m512_const1_64( tp[2] ) ); \
-     m3 = mm512_xorand( m3, dm, m512_const1_64( tp[3] ) ); \
-     m4 = mm512_xorand( m4, dm, m512_const1_64( tp[4] ) ); \
-     m5 = mm512_xorand( m5, dm, m512_const1_64( tp[5] ) ); \
-     m6 = mm512_xorand( m6, dm, m512_const1_64( tp[6] ) ); \
-     m7 = mm512_xorand( m7, dm, m512_const1_64( tp[7] ) ); \
+     const __mmask8 dm = _mm512_cmplt_epi64_mask( db, zero ); \
+     m0 = _mm512_mask_xor_epi64( m0, dm, m0, m512_const1_64( tp[0] ) ); \
+     m1 = _mm512_mask_xor_epi64( m1, dm, m1, m512_const1_64( tp[1] ) ); \
+     m2 = _mm512_mask_xor_epi64( m2, dm, m2, m512_const1_64( tp[2] ) ); \
+     m3 = _mm512_mask_xor_epi64( m3, dm, m3, m512_const1_64( tp[3] ) ); \
+     m4 = _mm512_mask_xor_epi64( m4, dm, m4, m512_const1_64( tp[4] ) ); \
+     m5 = _mm512_mask_xor_epi64( m5, dm, m5, m512_const1_64( tp[5] ) ); \
+     m6 = _mm512_mask_xor_epi64( m6, dm, m6, m512_const1_64( tp[6] ) ); \
+     m7 = _mm512_mask_xor_epi64( m7, dm, m7, m512_const1_64( tp[7] ) ); \
+     db = _mm512_ror_epi64( db, 1 ); \
     tp += 8; \
-     db = _mm512_srli_epi64( db, 1 ); \
  } \
 } while (0)

@@ -609,199 +611,192 @@ do { \

 #define READ_STATE_BIG8(sc) \
 do { \
-   c0 = sc->h[0x0]; \
-   c1 = sc->h[0x1]; \
-   c2 = sc->h[0x2]; \
-   c3 = sc->h[0x3]; \
-   c4 = sc->h[0x4]; \
-   c5 = sc->h[0x5]; \
-   c6 = sc->h[0x6]; \
-   c7 = sc->h[0x7]; \
+   c0 = sc->h[0]; \
+   c1 = sc->h[1]; \
+   c2 = sc->h[2]; \
+   c3 = sc->h[3]; \
+   c4 = sc->h[4]; \
+   c5 = sc->h[5]; \
+   c6 = sc->h[6]; \
+   c7 = sc->h[7]; \
 } while (0)

 #define WRITE_STATE_BIG8(sc) \
 do { \
-   sc->h[0x0] = c0; \
-   sc->h[0x1] = c1; \
-   sc->h[0x2] = c2; \
-   sc->h[0x3] = c3; \
-   sc->h[0x4] = c4; \
-   sc->h[0x5] = c5; \
-   sc->h[0x6] = c6; \
-   sc->h[0x7] = c7; \
+   sc->h[0] = c0; \
+   sc->h[1] = c1; \
+   sc->h[2] = c2; \
+   sc->h[3] = c3; \
+   sc->h[4] = c4; \
+   sc->h[5] = c5; \
+   sc->h[6] = c6; \
+   sc->h[7] = c7; \
 } while (0)

-
 #define ROUND_BIG8( alpha ) \
 do { \
   __m512i t0, t1, t2, t3; \
-   s0 = _mm512_xor_si512( s0, alpha[ 0] ); \
-   s1 = _mm512_xor_si512( s1, alpha[ 1] ); \
-   s2 = _mm512_xor_si512( s2, alpha[ 2] ); \
-   s3 = _mm512_xor_si512( s3, alpha[ 3] ); \
-   s4 = _mm512_xor_si512( s4, alpha[ 4] ); \
-   s5 = _mm512_xor_si512( s5, alpha[ 5] ); \
-   s6 = _mm512_xor_si512( s6, alpha[ 6] ); \
-   s7 = _mm512_xor_si512( s7, alpha[ 7] ); \
-   s8 = _mm512_xor_si512( s8, alpha[ 8] ); \
-   s9 = _mm512_xor_si512( s9, alpha[ 9] ); \
-   sA = _mm512_xor_si512( sA, alpha[10] ); \
-   sB = _mm512_xor_si512( sB, alpha[11] ); \
-   sC = _mm512_xor_si512( sC, alpha[12] ); \
-   sD = _mm512_xor_si512( sD, alpha[13] ); \
-   sE = _mm512_xor_si512( sE, alpha[14] ); \
-   sF = _mm512_xor_si512( sF, alpha[15] ); \
+   s0 = _mm512_xor_si512( s0, alpha[ 0] ); /* m0 */ \
+   s1 = _mm512_xor_si512( s1, alpha[ 1] ); /* c0 */ \
+   s2 = _mm512_xor_si512( s2, alpha[ 2] ); /* m1 */ \
+   s3 = _mm512_xor_si512( s3, alpha[ 3] ); /* c1 */ \
+   s4 = _mm512_xor_si512( s4, alpha[ 4] ); /* c2 */ \
+   s5 = _mm512_xor_si512( s5, alpha[ 5] ); /* m2 */ \
+   s6 = _mm512_xor_si512( s6, alpha[ 6] ); /* c3 */ \
+   s7 = _mm512_xor_si512( s7, alpha[ 7] ); /* m3 */ \
+   s8 = _mm512_xor_si512( s8, alpha[ 8] ); /* m4 */ \
+   s9 = _mm512_xor_si512( s9, alpha[ 9] ); /* c4 */ \
+   sA = _mm512_xor_si512( sA, alpha[10] ); /* m5 */ \
+   sB = _mm512_xor_si512( sB, alpha[11] ); /* c5 */ \
+   sC = _mm512_xor_si512( sC, alpha[12] ); /* c6 */ \
+   sD = _mm512_xor_si512( sD, alpha[13] ); /* m6 */ \
+   sE = _mm512_xor_si512( sE, alpha[14] ); /* c7 */ \
+   sF = _mm512_xor_si512( sF, alpha[15] ); /* m7 */ \
 \
-  SBOX8( s0, s4, s8, sC ); \
-  SBOX8( s1, s5, s9, sD ); \
-  SBOX8( s2, s6, sA, sE ); \
-  SBOX8( s3, s7, sB, sF ); \
+  SBOX8( s0, s4, s8, sC ); /* ( m0, c2, m4, c6 ) */ \
+  SBOX8( s1, s5, s9, sD ); /* ( c0, m2, c4, m6 ) */ \
+  SBOX8( s2, s6, sA, sE ); /* ( m1, c3, m5, c7 ) */ \
+  SBOX8( s3, s7, sB, sF ); /* ( c1, m3, c5, m7 ) */ \
 \
-  t1 = _mm512_mask_blend_epi32( 0xaaaa, _mm512_bsrli_epi128( s4, 4 ), \
-                                        _mm512_bslli_epi128( s5, 4 ) ); \
-  t3 = _mm512_mask_blend_epi32( 0xaaaa, _mm512_bsrli_epi128( sD, 4 ), \
-                                        _mm512_bslli_epi128( sE, 4 ) ); \
+  s4 = mm512_swap64_32( s4 ); \
+  s5 = mm512_swap64_32( s5 ); \
+  sD = mm512_swap64_32( sD ); \
+  sE = mm512_swap64_32( sE ); \
+  t1 = _mm512_mask_blend_epi32( 0xaaaa, s4, s5 ); \
+  t3 = _mm512_mask_blend_epi32( 0xaaaa, sD, sE ); \
  L8( s0, t1, s9, t3 ); \
-  s4 = _mm512_mask_blend_epi32( 0xaaaa, s4, _mm512_bslli_epi128( t1, 4 ) ); \
-  s5 = _mm512_mask_blend_epi32( 0x5555, s5, _mm512_bsrli_epi128( t1, 4 ) ); \
-  sD = _mm512_mask_blend_epi32( 0xaaaa, sD, _mm512_bslli_epi128( t3, 4 ) ); \
-  sE = _mm512_mask_blend_epi32( 0x5555, sE, _mm512_bsrli_epi128( t3, 4 ) ); \
+  s4 = _mm512_mask_blend_epi32( 0x5555, s4, t1 ); \
+  s5 = _mm512_mask_blend_epi32( 0xaaaa, s5, t1 ); \
+  sD = _mm512_mask_blend_epi32( 0x5555, sD, t3 ); \
+  sE = _mm512_mask_blend_epi32( 0xaaaa, sE, t3 ); \
 \
-  t1 = _mm512_mask_blend_epi32( 0xaaaa, _mm512_bsrli_epi128( s5, 4 ), \
-                                        _mm512_bslli_epi128( s6, 4 ) ); \
-  t3 = _mm512_mask_blend_epi32( 0xaaaa, _mm512_bsrli_epi128( sE, 4 ), \
-                                        _mm512_bslli_epi128( sF, 4 ) ); \
+  s6 = mm512_swap64_32( s6 ); \
+  sF = mm512_swap64_32( sF ); \
+  t1 = _mm512_mask_blend_epi32( 0xaaaa, s5, s6 ); \
+  t3 = _mm512_mask_blend_epi32( 0xaaaa, sE, sF ); \
  L8( s1, t1, sA, t3 ); \
-  s5 = _mm512_mask_blend_epi32( 0xaaaa, s5, _mm512_bslli_epi128( t1, 4 ) ); \
-  s6 = _mm512_mask_blend_epi32( 0x5555, s6, _mm512_bsrli_epi128( t1, 4 ) ); \
-  sE = _mm512_mask_blend_epi32( 0xaaaa, sE, _mm512_bslli_epi128( t3, 4 ) ); \
-  sF = _mm512_mask_blend_epi32( 0x5555, sF, _mm512_bsrli_epi128( t3, 4 ) ); \
+  s5 = _mm512_mask_blend_epi32( 0x5555, s5, t1 ); \
+  s6 = _mm512_mask_blend_epi32( 0xaaaa, s6, t1 ); \
+  sE = _mm512_mask_blend_epi32( 0x5555, sE, t3 ); \
+  sF = _mm512_mask_blend_epi32( 0xaaaa, sF, t3 ); \
 \
-  t1 = _mm512_mask_blend_epi32( 0xaaaa, _mm512_bsrli_epi128( s6, 4 ), \
-                                        _mm512_bslli_epi128( s7, 4 ) ); \
-  t3 = _mm512_mask_blend_epi32( 0xaaaa, _mm512_bsrli_epi128( sF, 4 ), \
-                                        _mm512_bslli_epi128( sC, 4 ) ); \
+  s7 = mm512_swap64_32( s7 ); \
+  sC = mm512_swap64_32( sC ); \
+  t1 = _mm512_mask_blend_epi32( 0xaaaa, s6, s7 ); \
+  t3 = _mm512_mask_blend_epi32( 0xaaaa, sF, sC ); \
  L8( s2, t1, sB, t3 ); \
-  s6 = _mm512_mask_blend_epi32( 0xaaaa, s6, _mm512_bslli_epi128( t1, 4 ) ); \
-  s7 = _mm512_mask_blend_epi32( 0x5555, s7, _mm512_bsrli_epi128( t1, 4 ) ); \
-  sF = _mm512_mask_blend_epi32( 0xaaaa, sF, _mm512_bslli_epi128( t3, 4 ) ); \
-  sC = _mm512_mask_blend_epi32( 0x5555, sC, _mm512_bsrli_epi128( t3, 4 ) ); \
+  s6 = _mm512_mask_blend_epi32( 0x5555, s6, t1 ); \
+  s7 = _mm512_mask_blend_epi32( 0xaaaa, s7, t1 ); \
+  sF = _mm512_mask_blend_epi32( 0x5555, sF, t3 ); \
+  sC = _mm512_mask_blend_epi32( 0xaaaa, sC, t3 ); \
+  s6 = mm512_swap64_32( s6 ); \
+  sF = mm512_swap64_32( sF ); \
 \
-  t1 = _mm512_mask_blend_epi32( 0xaaaa, _mm512_bsrli_epi128( s7, 4 ), \
-                                        _mm512_bslli_epi128( s4, 4 ) ); \
-  t3 = _mm512_mask_blend_epi32( 0xaaaa, _mm512_bsrli_epi128( sC, 4 ), \
-                                        _mm512_bslli_epi128( sD, 4 ) ); \
+  t1 = _mm512_mask_blend_epi32( 0xaaaa, s7, s4 ); \
+  t3 = _mm512_mask_blend_epi32( 0xaaaa, sC, sD ); \
  L8( s3, t1, s8, t3 ); \
-  s7 = _mm512_mask_blend_epi32( 0xaaaa, s7, _mm512_bslli_epi128( t1, 4 ) ); \
-  s4 = _mm512_mask_blend_epi32( 0x5555, s4, _mm512_bsrli_epi128( t1, 4 ) ); \
-  sC = _mm512_mask_blend_epi32( 0xaaaa, sC, _mm512_bslli_epi128( t3, 4 ) ); \
-  sD = _mm512_mask_blend_epi32( 0x5555, sD, _mm512_bsrli_epi128( t3, 4 ) ); \
+  s7 = _mm512_mask_blend_epi32( 0x5555, s7, t1 ); \
+  s4 = _mm512_mask_blend_epi32( 0xaaaa, s4, t1 ); \
+  sC = _mm512_mask_blend_epi32( 0x5555, sC, t3 ); \
+  sD = _mm512_mask_blend_epi32( 0xaaaa, sD, t3 ); \
+  s7 = mm512_swap64_32( s7 ); \
+  sC = mm512_swap64_32( sC ); \
 \
-  t0 = _mm512_mask_blend_epi32( 0xaaaa, s0, _mm512_bslli_epi128( s8, 4 ) ); \
+  t0 = _mm512_mask_blend_epi32( 0xaaaa, s0, mm512_swap64_32( s8 ) ); \
  t1 = _mm512_mask_blend_epi32( 0xaaaa, s1, s9 ); \
-  t2 = _mm512_mask_blend_epi32( 0xaaaa, _mm512_bsrli_epi128( s2, 4 ), sA ); \
-  t3 = _mm512_mask_blend_epi32( 0xaaaa, _mm512_bsrli_epi128( s3, 4 ), \
-                                        _mm512_bslli_epi128( sB, 4 ) ); \
+  t2 = _mm512_mask_blend_epi32( 0xaaaa, mm512_swap64_32( s2 ), sA ); \
+  t3 = _mm512_mask_blend_epi32( 0x5555, s3, sB ); \
+  t3 = mm512_swap64_32( t3 ); \
  L8( t0, t1, t2, t3 ); \
+  t3 = mm512_swap64_32( t3 ); \
  s0 = _mm512_mask_blend_epi32( 0x5555, s0, t0 ); \
-  s8 = _mm512_mask_blend_epi32( 0x5555, s8, _mm512_bsrli_epi128( t0, 4 ) ); \
+  s8 = _mm512_mask_blend_epi32( 0x5555, s8, mm512_swap64_32( t0 ) ); \
  s1 = _mm512_mask_blend_epi32( 0x5555, s1, t1 ); \
  s9 = _mm512_mask_blend_epi32( 0xaaaa, s9, t1 ); \
-  s2 = _mm512_mask_blend_epi32( 0xaaaa, s2, _mm512_bslli_epi128( t2, 4 ) ); \
+  s2 = _mm512_mask_blend_epi32( 0xaaaa, s2, mm512_swap64_32( t2 ) ); \
  sA = _mm512_mask_blend_epi32( 0xaaaa, sA, t2 ); \
-  s3 = _mm512_mask_blend_epi32( 0xaaaa, s3, _mm512_bslli_epi128( t3, 4 ) ); \
-  sB = _mm512_mask_blend_epi32( 0x5555, sB, _mm512_bsrli_epi128( t3, 4 ) ); \
+  s3 = _mm512_mask_blend_epi32( 0xaaaa, s3, t3 ); \
+  sB = _mm512_mask_blend_epi32( 0x5555, sB, t3 ); \
 \
-  t0 = _mm512_mask_blend_epi32( 0xaaaa, _mm512_bsrli_epi128( s4, 4 ), sC ); \
-  t1 = _mm512_mask_blend_epi32( 0xaaaa, _mm512_bsrli_epi128( s5, 4 ), \
-                                        _mm512_bslli_epi128( sD, 4 ) ); \
-  t2 = _mm512_mask_blend_epi32( 0xaaaa, s6, _mm512_bslli_epi128( sE, 4 ) ); \
+  t0 = _mm512_mask_blend_epi32( 0xaaaa, s4, sC ); \
+  t1 = _mm512_mask_blend_epi32( 0xaaaa, s5, sD ); \
+  t2 = _mm512_mask_blend_epi32( 0xaaaa, s6, sE ); \
  t3 = _mm512_mask_blend_epi32( 0xaaaa, s7, sF ); \
  L8( t0, t1, t2, t3 ); \
-  s4 = _mm512_mask_blend_epi32( 0xaaaa, s4, _mm512_bslli_epi128( t0, 4 ) ); \
+  s4 = _mm512_mask_blend_epi32( 0x5555, s4, t0 ); \
  sC = _mm512_mask_blend_epi32( 0xaaaa, sC, t0 ); \
-  s5 = _mm512_mask_blend_epi32( 0xaaaa, s5, _mm512_bslli_epi128( t1, 4 ) ); \
-  sD = _mm512_mask_blend_epi32( 0x5555, sD, _mm512_bsrli_epi128( t1, 4 ) ); \
+  s5 = _mm512_mask_blend_epi32( 0x5555, s5, t1 ); \
+  sD = _mm512_mask_blend_epi32( 0xaaaa, sD, t1 ); \
  s6 = _mm512_mask_blend_epi32( 0x5555, s6, t2 ); \
-  sE = _mm512_mask_blend_epi32( 0x5555, sE, _mm512_bsrli_epi128( t2, 4 ) ); \
+  sE = _mm512_mask_blend_epi32( 0xaaaa, sE, t2 ); \
  s7 = _mm512_mask_blend_epi32( 0x5555, s7, t3 ); \
  sF = _mm512_mask_blend_epi32( 0xaaaa, sF, t3 ); \
+  s4 = mm512_swap64_32( s4 ); \
+  s5 = mm512_swap64_32( s5 ); \
+  sD = mm512_swap64_32( sD ); \
+  sE = mm512_swap64_32( sE ); \
 } while (0)

 #define P_BIG8 \
 do { \
   __m512i alpha[16]; \
+   const uint64_t A0 = ( (uint64_t*)alpha_n )[0]; \
   for( int i = 0; i < 16; i++ ) \
      alpha[i] = m512_const1_64( ( (uint64_t*)alpha_n )[i] ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( (uint64_t)1 << 32 ) \
-                            ^ ( (uint64_t*)alpha_n )[0] ); \
+   alpha[0] = m512_const1_64( (1ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( (uint64_t)2 << 32 ) \
-                            ^ ( (uint64_t*)alpha_n )[0] ); \
+   alpha[0] = m512_const1_64( (2ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( (uint64_t)3 << 32 ) \
-                            ^ ( (uint64_t*)alpha_n )[0] ); \
+   alpha[0] = m512_const1_64( (3ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( (uint64_t)4 << 32 ) \
-                            ^ ( (uint64_t*)alpha_n )[0] ); \
+   alpha[0] = m512_const1_64( (4ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( (uint64_t)5 << 32 ) \
-                            ^ ( (uint64_t*)alpha_n )[0] ); \
+   alpha[0] = m512_const1_64( (5ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
 } while (0)

 #define PF_BIG8 \
 do { \
   __m512i alpha[16]; \
+   const uint64_t A0 = ( (uint64_t*)alpha_f )[0]; \
   for( int i = 0; i < 16; i++ ) \
      alpha[i] = m512_const1_64( ( (uint64_t*)alpha_f )[i] ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( (uint64_t)1 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m512_const1_64( ( 1ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( (uint64_t)2 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m512_const1_64( ( 2ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( (uint64_t)3 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m512_const1_64( ( 3ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( (uint64_t)4 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m512_const1_64( ( 4ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( (uint64_t)5 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m512_const1_64( ( 5ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( (uint64_t)6 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m512_const1_64( ( 6ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( (uint64_t)7 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m512_const1_64( ( 7ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( (uint64_t)8 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m512_const1_64( ( 8ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( (uint64_t)9 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m512_const1_64( ( 9ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( (uint64_t)10 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m512_const1_64( (10ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( (uint64_t)11 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m512_const1_64( (11ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
 } while (0)

 #define T_BIG8 \
 do { /* order is important */ \
-   c7 = sc->h[ 0x7 ] = _mm512_xor_si512( sc->h[ 0x7 ], sB ); \
-   c6 = sc->h[ 0x6 ] = _mm512_xor_si512( sc->h[ 0x6 ], sA ); \
-   c5 = sc->h[ 0x5 ] = _mm512_xor_si512( sc->h[ 0x5 ], s9 ); \
-   c4 = sc->h[ 0x4 ] = _mm512_xor_si512( sc->h[ 0x4 ], s8 ); \
-   c3 = sc->h[ 0x3 ] = _mm512_xor_si512( sc->h[ 0x3 ], s3 ); \
-   c2 = sc->h[ 0x2 ] = _mm512_xor_si512( sc->h[ 0x2 ], s2 ); \
-   c1 = sc->h[ 0x1 ] = _mm512_xor_si512( sc->h[ 0x1 ], s1 ); \
-   c0 = sc->h[ 0x0 ] = _mm512_xor_si512( sc->h[ 0x0 ], s0 ); \
+   c7 = sc->h[ 7 ] = _mm512_xor_si512( sc->h[ 7 ], sB ); /* c5 */ \
+   c6 = sc->h[ 6 ] = _mm512_xor_si512( sc->h[ 6 ], sA ); /* m5 */ \
+   c5 = sc->h[ 5 ] = _mm512_xor_si512( sc->h[ 5 ], s9 ); /* c4 */ \
+   c4 = sc->h[ 4 ] = _mm512_xor_si512( sc->h[ 4 ], s8 ); /* m4 */ \
+   c3 = sc->h[ 3 ] = _mm512_xor_si512( sc->h[ 3 ], s3 ); /* c1 */ \
+   c2 = sc->h[ 2 ] = _mm512_xor_si512( sc->h[ 2 ], s2 ); /* m1 */ \
+   c1 = sc->h[ 1 ] = _mm512_xor_si512( sc->h[ 1 ], s1 ); /* c0 */ \
+   c0 = sc->h[ 0 ] = _mm512_xor_si512( sc->h[ 0 ], s0 ); /* m0 */ \
 } while (0)

 void hamsi_8way_big( hamsi_8way_big_context *sc, __m512i *buf, size_t num )
@@ -838,7 +833,6 @@ void hamsi_8way_big_final( hamsi_8way_big_context *sc, __m512i *buf )
   WRITE_STATE_BIG8( sc );
 }

-
 void hamsi512_8way_init( hamsi_8way_big_context *sc )
 {
   sc->partial_len = 0;
@@ -888,13 +882,12 @@ void hamsi512_8way_close( hamsi_8way_big_context *sc, void *dst )
 #define INPUT_BIG \
 do { \
  __m256i db = *buf; \
-  const uint64_t *tp = (uint64_t*)&T512[0][0];  \
-  m0 = m1 = m2 = m3 = m4 = m5 = m6 = m7 = m256_zero; \
-  for ( int u = 0; u < 64; u++ ) \
+  const __m256i zero = m256_zero; \
+  const uint64_t *tp = (const uint64_t*)T512;  \
+  m0 = m1 = m2 = m3 = m4 = m5 = m6 = m7 = zero; \
+  for ( int u = 63; u >= 0; u-- ) \
  { \
-     __m256i dm = _mm256_and_si256( db, m256_one_64 ) ; \
-     dm = mm256_negate_32( _mm256_or_si256( dm, \
-                         _mm256_slli_epi64( dm, 32 ) ) ); \
+     __m256i dm = _mm256_cmpgt_epi64( zero, _mm256_slli_epi64( db, u ) ); \
     m0 = _mm256_xor_si256( m0, _mm256_and_si256( dm, \
                                          m256_const1_64( tp[0] ) ) ); \
     m1 = _mm256_xor_si256( m1, _mm256_and_si256( dm, \
@@ -912,7 +905,6 @@ do { \
     m7 = _mm256_xor_si256( m7, _mm256_and_si256( dm, \
                                          m256_const1_64( tp[7] ) ) ); \
     tp += 8; \
-     db = _mm256_srli_epi64( db, 1 ); \
  } \
 } while (0)

@@ -961,47 +953,28 @@ do { \

 #define READ_STATE_BIG(sc) \
 do { \
-   c0 = sc->h[0x0]; \
-   c1 = sc->h[0x1]; \
-   c2 = sc->h[0x2]; \
-   c3 = sc->h[0x3]; \
-   c4 = sc->h[0x4]; \
-   c5 = sc->h[0x5]; \
-   c6 = sc->h[0x6]; \
-   c7 = sc->h[0x7]; \
+   c0 = sc->h[0]; \
+   c1 = sc->h[1]; \
+   c2 = sc->h[2]; \
+   c3 = sc->h[3]; \
+   c4 = sc->h[4]; \
+   c5 = sc->h[5]; \
+   c6 = sc->h[6]; \
+   c7 = sc->h[7]; \
 } while (0)

 #define WRITE_STATE_BIG(sc) \
 do { \
-   sc->h[0x0] = c0; \
-   sc->h[0x1] = c1; \
-   sc->h[0x2] = c2; \
-   sc->h[0x3] = c3; \
-   sc->h[0x4] = c4; \
-   sc->h[0x5] = c5; \
-   sc->h[0x6] = c6; \
-   sc->h[0x7] = c7; \
+   sc->h[0] = c0; \
+   sc->h[1] = c1; \
+   sc->h[2] = c2; \
+   sc->h[3] = c3; \
+   sc->h[4] = c4; \
+   sc->h[5] = c5; \
+   sc->h[6] = c6; \
+   sc->h[7] = c7; \
 } while (0)

-/*
-#define s0   m0
-#define s1   c0
-#define s2   m1
-#define s3   c1
-#define s4   c2
-#define s5   m2
-#define s6   c3
-#define s7   m3
-#define s8   m4
-#define s9   c4
-#define sA   m5
-#define sB   c5
-#define sC   c6
-#define sD   m6
-#define sE   c7
-#define sF   m7
-*/
-
 #define ROUND_BIG( alpha ) \
 do { \
   __m256i t0, t1, t2, t3; \
@@ -1027,151 +1000,145 @@ do { \
  SBOX( s2, s6, sA, sE ); \
  SBOX( s3, s7, sB, sF ); \
 \
-  t1 = _mm256_blend_epi32( _mm256_bsrli_epi128( s4, 4 ), \
-                           _mm256_bslli_epi128( s5, 4 ), 0xAA ); \
-  t3 = _mm256_blend_epi32( _mm256_bsrli_epi128( sD, 4 ), \
-                           _mm256_bslli_epi128( sE, 4 ), 0xAA ); \
+  s4 = mm256_swap64_32( s4 ); \
+  s5 = mm256_swap64_32( s5 ); \
+  sD = mm256_swap64_32( sD ); \
+  sE = mm256_swap64_32( sE ); \
+  t1 = _mm256_blend_epi32( s4, s5, 0xaa ); \
+  t3 = _mm256_blend_epi32( sD, sE, 0xaa ); \
  L( s0, t1, s9, t3 ); \
-  s4 = _mm256_blend_epi32( s4, _mm256_bslli_epi128( t1, 4 ), 0xAA );\
-  s5 = _mm256_blend_epi32( s5, _mm256_bsrli_epi128( t1, 4 ), 0x55 );\
-  sD = _mm256_blend_epi32( sD, _mm256_bslli_epi128( t3, 4 ), 0xAA );\
-  sE = _mm256_blend_epi32( sE, _mm256_bsrli_epi128( t3, 4 ), 0x55 );\
+  s4 = _mm256_blend_epi32( s4, t1, 0x55 ); \
+  s5 = _mm256_blend_epi32( s5, t1, 0xaa ); \
+  sD = _mm256_blend_epi32( sD, t3, 0x55 ); \
+  sE = _mm256_blend_epi32( sE, t3, 0xaa ); \
 \
-  t1 = _mm256_blend_epi32( _mm256_bsrli_epi128( s5, 4 ), \
-                           _mm256_bslli_epi128( s6, 4 ), 0xAA ); \
-  t3 = _mm256_blend_epi32( _mm256_bsrli_epi128( sE, 4 ), \
-                           _mm256_bslli_epi128( sF, 4 ), 0xAA ); \
+  s6 = mm256_swap64_32( s6 ); \
+  sF = mm256_swap64_32( sF ); \
+  t1 = _mm256_blend_epi32( s5, s6, 0xaa ); \
+  t3 = _mm256_blend_epi32( sE, sF, 0xaa ); \
  L( s1, t1, sA, t3 ); \
-  s5 = _mm256_blend_epi32( s5, _mm256_bslli_epi128( t1, 4 ), 0xAA );\
-  s6 = _mm256_blend_epi32( s6, _mm256_bsrli_epi128( t1, 4 ), 0x55 );\
-  sE = _mm256_blend_epi32( sE, _mm256_bslli_epi128( t3, 4 ), 0xAA );\
-  sF = _mm256_blend_epi32( sF, _mm256_bsrli_epi128( t3, 4 ), 0x55 );\
+  s5 = _mm256_blend_epi32( s5, t1, 0x55 ); \
+  s6 = _mm256_blend_epi32( s6, t1, 0xaa ); \
+  sE = _mm256_blend_epi32( sE, t3, 0x55 ); \
+  sF = _mm256_blend_epi32( sF, t3, 0xaa ); \
 \
-  t1 = _mm256_blend_epi32( _mm256_bsrli_epi128( s6, 4 ), \
-                           _mm256_bslli_epi128( s7, 4 ), 0xAA ); \
-  t3 = _mm256_blend_epi32( _mm256_bsrli_epi128( sF, 4 ), \
-                           _mm256_bslli_epi128( sC, 4 ), 0xAA ); \
+  s7 = mm256_swap64_32( s7 ); \
+  sC = mm256_swap64_32( sC ); \
+  t1 = _mm256_blend_epi32( s6, s7, 0xaa ); \
+  t3 = _mm256_blend_epi32( sF, sC, 0xaa ); \
  L( s2, t1, sB, t3 ); \
-  s6 = _mm256_blend_epi32( s6, _mm256_bslli_epi128( t1, 4 ), 0xAA );\
-  s7 = _mm256_blend_epi32( s7, _mm256_bsrli_epi128( t1, 4 ), 0x55 );\
-  sF = _mm256_blend_epi32( sF, _mm256_bslli_epi128( t3, 4 ), 0xAA );\
-  sC = _mm256_blend_epi32( sC, _mm256_bsrli_epi128( t3, 4 ), 0x55 );\
+  s6 = _mm256_blend_epi32( s6, t1, 0x55 ); \
+  s7 = _mm256_blend_epi32( s7, t1, 0xaa ); \
+  sF = _mm256_blend_epi32( sF, t3, 0x55 ); \
+  sC = _mm256_blend_epi32( sC, t3, 0xaa ); \
+  s6 = mm256_swap64_32( s6 ); \
+  sF = mm256_swap64_32( sF ); \
 \
-  t1 = _mm256_blend_epi32( _mm256_bsrli_epi128( s7, 4 ), \
-                           _mm256_bslli_epi128( s4, 4 ), 0xAA ); \
-  t3 = _mm256_blend_epi32( _mm256_bsrli_epi128( sC, 4 ), \
-                           _mm256_bslli_epi128( sD, 4 ), 0xAA ); \
+  t1 = _mm256_blend_epi32( s7, s4, 0xaa ); \
+  t3 = _mm256_blend_epi32( sC, sD, 0xaa ); \
  L( s3, t1, s8, t3 ); \
-  s7 = _mm256_blend_epi32( s7, _mm256_bslli_epi128( t1, 4 ), 0xAA );\
-  s4 = _mm256_blend_epi32( s4, _mm256_bsrli_epi128( t1, 4 ), 0x55 );\
-  sC = _mm256_blend_epi32( sC, _mm256_bslli_epi128( t3, 4 ), 0xAA );\
-  sD = _mm256_blend_epi32( sD, _mm256_bsrli_epi128( t3, 4 ), 0x55 );\
+  s7 = _mm256_blend_epi32( s7, t1, 0x55 ); \
+  s4 = _mm256_blend_epi32( s4, t1, 0xaa ); \
+  sC = _mm256_blend_epi32( sC, t3, 0x55 ); \
+  sD = _mm256_blend_epi32( sD, t3, 0xaa ); \
+  s7 = mm256_swap64_32( s7 ); \
+  sC = mm256_swap64_32( sC ); \
 \
-  t0 = _mm256_blend_epi32( s0, _mm256_bslli_epi128( s8, 4 ), 0xAA ); \
-  t1 = _mm256_blend_epi32( s1, s9, 0xAA ); \
-  t2 = _mm256_blend_epi32( _mm256_bsrli_epi128( s2, 4 ), sA, 0xAA ); \
-  t3 = _mm256_blend_epi32( _mm256_bsrli_epi128( s3, 4 ), \
-                           _mm256_bslli_epi128( sB, 4 ), 0xAA ); \
+  t0 = _mm256_blend_epi32( s0, mm256_swap64_32( s8 ), 0xaa ); \
+  t1 = _mm256_blend_epi32( s1, s9, 0xaa ); \
+  t2 = _mm256_blend_epi32( mm256_swap64_32( s2 ), sA, 0xaa ); \
+  t3 = _mm256_blend_epi32( s3, sB, 0x55 ); \
+  t3 = mm256_swap64_32( t3 ); \
  L( t0, t1, t2, t3 ); \
+  t3 = mm256_swap64_32( t3 ); \
  s0 = _mm256_blend_epi32( s0, t0, 0x55 ); \
-  s8 = _mm256_blend_epi32( s8, _mm256_bsrli_epi128( t0, 4 ), 0x55 ); \
+  s8 = _mm256_blend_epi32( s8, mm256_swap64_32( t0 ), 0x55 ); \
  s1 = _mm256_blend_epi32( s1, t1, 0x55 ); \
-  s9 = _mm256_blend_epi32( s9, t1, 0xAA ); \
-  s2 = _mm256_blend_epi32( s2, _mm256_bslli_epi128( t2, 4 ), 0xAA ); \
-  sA = _mm256_blend_epi32( sA, t2, 0xAA ); \
-  s3 = _mm256_blend_epi32( s3, _mm256_bslli_epi128( t3, 4 ), 0xAA ); \
-  sB = _mm256_blend_epi32( sB, _mm256_bsrli_epi128( t3, 4 ), 0x55 ); \
+  s9 = _mm256_blend_epi32( s9, t1, 0xaa ); \
+  s2 = _mm256_blend_epi32( s2, mm256_swap64_32( t2 ), 0xaa ); \
+  sA = _mm256_blend_epi32( sA, t2, 0xaa ); \
+  s3 = _mm256_blend_epi32( s3, t3, 0xaa ); \
+  sB = _mm256_blend_epi32( sB, t3, 0x55 ); \
 \
-  t0 = _mm256_blend_epi32( _mm256_bsrli_epi128( s4, 4 ), sC, 0xAA ); \
-  t1 = _mm256_blend_epi32( _mm256_bsrli_epi128( s5, 4 ), \
-                           _mm256_bslli_epi128( sD, 4 ), 0xAA ); \
-  t2 = _mm256_blend_epi32( s6, _mm256_bslli_epi128( sE, 4 ), 0xAA ); \
-  t3 = _mm256_blend_epi32( s7, sF, 0xAA ); \
+  t0 = _mm256_blend_epi32( s4, sC, 0xaa ); \
+  t1 = _mm256_blend_epi32( s5, sD, 0xaa ); \
+  t2 = _mm256_blend_epi32( s6, sE, 0xaa ); \
+  t3 = _mm256_blend_epi32( s7, sF, 0xaa ); \
  L( t0, t1, t2, t3 ); \
-  s4 = _mm256_blend_epi32( s4, _mm256_bslli_epi128( t0, 4 ), 0xAA ); \
-  sC = _mm256_blend_epi32( sC, t0, 0xAA ); \
-  s5 = _mm256_blend_epi32( s5, _mm256_bslli_epi128( t1, 4 ), 0xAA ); \
-  sD = _mm256_blend_epi32( sD, _mm256_bsrli_epi128( t1, 4 ), 0x55 ); \
+  s4 = _mm256_blend_epi32( s4, t0, 0x55 ); \
+  sC = _mm256_blend_epi32( sC, t0, 0xaa ); \
+  s5 = _mm256_blend_epi32( s5, t1, 0x55 ); \
+  sD = _mm256_blend_epi32( sD, t1, 0xaa ); \
  s6 = _mm256_blend_epi32( s6, t2, 0x55 ); \
-  sE = _mm256_blend_epi32( sE, _mm256_bsrli_epi128( t2, 4 ), 0x55 ); \
+  sE = _mm256_blend_epi32( sE, t2, 0xaa ); \
  s7 = _mm256_blend_epi32( s7, t3, 0x55 ); \
-  sF = _mm256_blend_epi32( sF, t3, 0xAA ); \
+  sF = _mm256_blend_epi32( sF, t3, 0xaa ); \
+  s4 = mm256_swap64_32( s4 ); \
+  s5 = mm256_swap64_32( s5 ); \
+  sD = mm256_swap64_32( sD ); \
+  sE = mm256_swap64_32( sE ); \
 } while (0)

 #define P_BIG \
 do { \
   __m256i alpha[16]; \
+   const uint64_t A0 = ( (uint64_t*)alpha_n )[0]; \
   for( int i = 0; i < 16; i++ ) \
      alpha[i] = m256_const1_64( ( (uint64_t*)alpha_n )[i] ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( (uint64_t)1 << 32 ) \
-                            ^ ( (uint64_t*)alpha_n )[0] ); \
+   alpha[0] = m256_const1_64( (1ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( (uint64_t)2 << 32 ) \
-                            ^ ( (uint64_t*)alpha_n )[0] ); \
+   alpha[0] = m256_const1_64( (2ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( (uint64_t)3 << 32 ) \
-                            ^ ( (uint64_t*)alpha_n )[0] ); \
+   alpha[0] = m256_const1_64( (3ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( (uint64_t)4 << 32 ) \
-                            ^ ( (uint64_t*)alpha_n )[0] ); \
+   alpha[0] = m256_const1_64( (4ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( (uint64_t)5 << 32 ) \
-                            ^ ( (uint64_t*)alpha_n )[0] ); \
+   alpha[0] = m256_const1_64( (5ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
 } while (0)

 #define PF_BIG \
 do { \
   __m256i alpha[16]; \
+   const uint64_t A0 = ( (uint64_t*)alpha_f )[0]; \
   for( int i = 0; i < 16; i++ ) \
      alpha[i] = m256_const1_64( ( (uint64_t*)alpha_f )[i] ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( (uint64_t)1 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m256_const1_64( ( 1ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( (uint64_t)2 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m256_const1_64( ( 2ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( (uint64_t)3 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m256_const1_64( ( 3ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( (uint64_t)4 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m256_const1_64( ( 4ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( (uint64_t)5 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m256_const1_64( ( 5ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( (uint64_t)6 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m256_const1_64( ( 6ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( (uint64_t)7 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m256_const1_64( ( 7ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( (uint64_t)8 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m256_const1_64( ( 8ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( (uint64_t)9 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m256_const1_64( ( 9ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( (uint64_t)10 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m256_const1_64( (10ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( (uint64_t)11 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m256_const1_64( (11ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
 } while (0)

 #define T_BIG \
 do { /* order is important */ \
-   c7 = sc->h[ 0x7 ] = _mm256_xor_si256( sc->h[ 0x7 ], sB ); \
-   c6 = sc->h[ 0x6 ] = _mm256_xor_si256( sc->h[ 0x6 ], sA ); \
-   c5 = sc->h[ 0x5 ] = _mm256_xor_si256( sc->h[ 0x5 ], s9 ); \
-   c4 = sc->h[ 0x4 ] = _mm256_xor_si256( sc->h[ 0x4 ], s8 ); \
-   c3 = sc->h[ 0x3 ] = _mm256_xor_si256( sc->h[ 0x3 ], s3 ); \
-   c2 = sc->h[ 0x2 ] = _mm256_xor_si256( sc->h[ 0x2 ], s2 ); \
-   c1 = sc->h[ 0x1 ] = _mm256_xor_si256( sc->h[ 0x1 ], s1 ); \
-   c0 = sc->h[ 0x0 ] = _mm256_xor_si256( sc->h[ 0x0 ], s0 ); \
+   c7 = sc->h[ 7 ] = _mm256_xor_si256( sc->h[ 7 ], sB ); \
+   c6 = sc->h[ 6 ] = _mm256_xor_si256( sc->h[ 6 ], sA ); \
+   c5 = sc->h[ 5 ] = _mm256_xor_si256( sc->h[ 5 ], s9 ); \
+   c4 = sc->h[ 4 ] = _mm256_xor_si256( sc->h[ 4 ], s8 ); \
+   c3 = sc->h[ 3 ] = _mm256_xor_si256( sc->h[ 3 ], s3 ); \
+   c2 = sc->h[ 2 ] = _mm256_xor_si256( sc->h[ 2 ], s2 ); \
+   c1 = sc->h[ 1 ] = _mm256_xor_si256( sc->h[ 1 ], s1 ); \
+   c0 = sc->h[ 0 ] = _mm256_xor_si256( sc->h[ 0 ], s0 ); \
 } while (0)

 void hamsi_big( hamsi_4way_big_context *sc, __m256i *buf, size_t num )
--- a/algo/hodl/sha512-avx.h
+++ b/algo/hodl/sha512-avx.h
@@ -45,6 +45,6 @@ void sha512Compute32b_parallel(
        uint64_t *data[SHA512_PARALLEL_N],
        uint64_t *digest[SHA512_PARALLEL_N]);

-void sha512ProcessBlock(Sha512Context *context);
+void sha512ProcessBlock(Sha512Context contexti[2] );

 #endif
--- a/algo/keccak/keccak-hash-4way.c
+++ b/algo/keccak/keccak-hash-4way.c
@@ -53,7 +53,8 @@ static const uint64_t RC[] = {
 #define WRITE_STATE(sc)

 #define MOV64(d, s)      (d = s)
-#define XOR64_IOTA       XOR64
+#define XOR64_IOTA       XOR
+

 #define LPAR   (
 #define RPAR   )
@@ -71,14 +72,15 @@ static const uint64_t RC[] = {
 // Targetted macros, keccak-macros.h is included for each target.

 #define DECL64(x)          __m512i x
-#define XOR64(d, a, b)     (d = _mm512_xor_si512(a,b))
+#define XOR(d, a, b)     (d = _mm512_xor_si512(a,b))
+#define XOR64 XOR
 #define AND64(d, a, b)     (d = _mm512_and_si512(a,b))
 #define OR64(d, a, b)      (d = _mm512_or_si512(a,b))
 #define NOT64(d, s)        (d = _mm512_xor_si512(s,m512_neg1))
 #define ROL64(d, v, n)     (d = mm512_rol_64(v, n))
 #define XOROR(d, a, b, c)  (d = mm512_xoror(a, b, c))
 #define XORAND(d, a, b, c) (d = mm512_xorand(a, b, c))
-
+#define XOR3( d, a, b, c ) (d = mm512_xor3( a, b, c ))

 #include "keccak-macros.c"

@@ -236,6 +238,7 @@ keccak512_8way_close(void *cc, void *dst)
 #undef INPUT_BUF
 #undef DECL64
 #undef XOR64
+#undef XOR
 #undef AND64
 #undef OR64
 #undef NOT64
@@ -243,7 +246,7 @@ keccak512_8way_close(void *cc, void *dst)
 #undef KECCAK_F_1600
 #undef XOROR
 #undef XORAND
-
+#undef XOR3
 #endif  // AVX512

 // AVX2
@@ -255,13 +258,15 @@ keccak512_8way_close(void *cc, void *dst)
 } while (0)

 #define DECL64(x)        __m256i x
-#define XOR64(d, a, b)   (d = _mm256_xor_si256(a,b))
+#define XOR(d, a, b)    (d = _mm256_xor_si256(a,b))
+#define XOR64 XOR
 #define AND64(d, a, b)   (d = _mm256_and_si256(a,b))
 #define OR64(d, a, b)    (d = _mm256_or_si256(a,b))
 #define NOT64(d, s)      (d = _mm256_xor_si256(s,m256_neg1))
 #define ROL64(d, v, n)   (d = mm256_rol_64(v, n))
 #define XOROR(d, a, b, c) (d = _mm256_xor_si256(a, _mm256_or_si256(b, c)))
 #define XORAND(d, a, b, c) (d = _mm256_xor_si256(a, _mm256_and_si256(b, c)))
+#define XOR3( d, a, b, c ) (d = mm256_xor3( a, b, c ))

 #include "keccak-macros.c"

@@ -421,6 +426,7 @@ keccak512_4way_close(void *cc, void *dst)
 #undef INPUT_BUF
 #undef DECL64
 #undef XOR64
+#undef XOR
 #undef AND64
 #undef OR64
 #undef NOT64
@@ -428,5 +434,6 @@ keccak512_4way_close(void *cc, void *dst)
 #undef KECCAK_F_1600
 #undef XOROR
 #undef XORAND
+#undef XOR3

 #endif  // AVX2
--- a/algo/keccak/keccak-macros.c
+++ b/algo/keccak/keccak-macros.c
@@ -1,6 +1,19 @@
 #ifdef TH_ELT
 #undef TH_ELT
 #endif
+
+#define TH_ELT(t, c0, c1, c2, c3, c4, d0, d1, d2, d3, d4)   do { \
+    DECL64(tt0); \
+    DECL64(tt1); \
+    XOR3( tt0, d0, d1, d4 ); \
+    XOR( tt1, d2, d3 ); \
+    XOR( tt0, tt0, tt1 ); \
+    ROL64( tt0, tt0, 1 ); \
+    XOR3( tt1, c0, c1, c4 ); \
+    XOR3( tt0, tt0, c2, c3 ); \
+    XOR( t, tt0, tt1 ); \
+} while (0)
+/*
 #define TH_ELT(t, c0, c1, c2, c3, c4, d0, d1, d2, d3, d4)   do { \
                DECL64(tt0); \
                DECL64(tt1); \
@@ -17,7 +30,7 @@
                XOR64(tt2, tt2, tt3); \
                XOR64(t, tt0, tt2); \
        } while (0)
-
+*/
 #ifdef THETA
 #undef THETA
 #endif
--- a/algo/lyra2/allium-4way.c
+++ b/algo/lyra2/allium-4way.c
@@ -69,7 +69,6 @@ void allium_16way_hash( void *state, const void *input )
   intrlv_8x64( vhashB, hash8, hash9, hash10, hash11, hash12, hash13, hash14,
                hash15, 256 );
   
-//   rintrlv_8x32_8x64( vhashA, vhash, 256 );
   keccak256_8way_update( &ctx.keccak, vhashA, 32 );
   keccak256_8way_close( &ctx.keccak, vhashA);
   keccak256_8way_init( &ctx.keccak );
@@ -284,7 +283,7 @@ void allium_8way_hash( void *hash, const void *input )
   blake256_8way_close( &ctx.blake, vhashA );

   dintrlv_8x32( hash0, hash1, hash2, hash3, hash4, hash5, hash6, hash7,
-                     vhashA, 256 );
+                 vhashA, 256 );
   intrlv_4x64( vhashA, hash0, hash1, hash2, hash3, 256 );
   intrlv_4x64( vhashB, hash4, hash5, hash6, hash7, 256 );

--- a/algo/lyra2/lyra2-gate.c
+++ b/algo/lyra2/lyra2-gate.c
@@ -188,7 +188,7 @@ bool register_allium_algo( algo_gate_t* gate )
  gate->hash      = (void*)&allium_hash;
 #endif
  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT
-	              | VAES_OPT | VAES256_OPT;
+	                   | VAES_OPT;
  opt_target_factor = 256.0;
  return true;
 };
--- a/algo/lyra2/lyra2z-4way.c
+++ b/algo/lyra2/lyra2z-4way.c
@@ -49,7 +49,7 @@ void lyra2z_16way_hash( void *state, const void *input )

    dintrlv_16x32( hash0, hash1, hash2, hash3, hash4, hash5, hash6, hash7,
              hash8, hash9, hash10, hash11 ,hash12, hash13, hash14, hash15,
-               vhash, 256 );
+              vhash, 256 );

    intrlv_2x256( vhash, hash0, hash1, 256 );
    LYRA2Z_2WAY( lyra2z_16way_matrix, vhash, 32, vhash, 32, 8, 8, 8 );
--- a/algo/ripemd/lbry-gate.c
+++ b/algo/ripemd/lbry-gate.c
@@ -4,7 +4,7 @@
 #include <string.h>
 #include <stdio.h>

-double lbry_calc_network_diff( struct work *work )
+long double lbry_calc_network_diff( struct work *work )
 {
        // sample for diff 43.281 : 1c05ea29
        // todo: endian reversed on longpoll could be zr5 specific...
@@ -12,7 +12,7 @@ double lbry_calc_network_diff( struct work *work )
   uint32_t nbits = swab32( work->data[ LBRY_NBITS_INDEX ] );
   uint32_t bits = (nbits & 0xffffff);
   int16_t shift = (swab32(nbits) & 0xff); // 0x1c = 28
-   double d = (double)0x0000ffff / (double)bits;
+   long double d = (long double)0x0000ffff / (long double)bits;

   for (int m=shift; m < 29; m++) d *= 256.0;
   for (int m=29; m < shift; m++) d /= 256.0;
--- a/algo/ripemd/sph_ripemd.c
+++ b/algo/ripemd/sph_ripemd.c
@@ -35,6 +35,7 @@

 #include "sph_ripemd.h"

+#if 0
 /*
 * Round functions for RIPEMD (original).
 */
@@ -46,6 +47,7 @@ static const sph_u32 oIV[5] = {
 	SPH_C32(0x67452301), SPH_C32(0xEFCDAB89),
 	SPH_C32(0x98BADCFE), SPH_C32(0x10325476)
 };
+#endif

 /*
 * Round functions for RIPEMD-128 and RIPEMD-160.
@@ -63,6 +65,8 @@ static const sph_u32 IV[5] = {

 #define ROTL    SPH_ROTL32

+#if 0
+
 /* ===================================================================== */
 /*
 * RIPEMD (original hash, deprecated).
@@ -479,7 +483,7 @@ sph_ripemd_comp(const sph_u32 msg[16], sph_u32 val[4])
 * One round of RIPEMD-128. The data must be aligned for 32-bit access.
 */
 static void
-ripemd128_round(const unsigned char *data, sph_u32 r[5])
+ripemd128_round(const unsigned char *data, sph_u32 r[4])
 {
 #if SPH_LITTLE_FAST

@@ -539,6 +543,8 @@ sph_ripemd128_comp(const sph_u32 msg[16], sph_u32 val[4])
 #undef RIPEMD128_IN
 }

+#endif
+
 /* ===================================================================== */
 /*
 * RIPEMD-160.
--- a/algo/ripemd/sph_ripemd.h
+++ b/algo/ripemd/sph_ripemd.h
@@ -84,6 +84,7 @@
 * can be cloned by copying the context (e.g. with a simple
 * <code>memcpy()</code>).
 */
+#if 0
 typedef struct {
 #ifndef DOXYGEN_IGNORE
 	unsigned char buf[64];    /* first field, for alignment */
@@ -204,6 +205,8 @@ void sph_ripemd128_close(void *cc, void *dst);
 */
 void sph_ripemd128_comp(const sph_u32 msg[16], sph_u32 val[4]);

+#endif
+
 /* ===================================================================== */

 /**
--- a/algo/scrypt/scrypt-core-4way.c
+++ b/algo/scrypt/scrypt-core-4way.c
--- a/algo/scrypt/scrypt.c
+++ b/algo/scrypt/scrypt.c
@@ -28,13 +28,13 @@
 */

 #include "algo-gate-api.h"
-
 #include <stdlib.h>
 #include <string.h>
 #include <inttypes.h>
 #include "algo/sha/sha-hash-4way.h"
 #include "algo/sha/sha256-hash.h"
 #include <mm_malloc.h>
+#include "malloc-huge.h"

 static const uint32_t keypad[12] = {
 	0x80000000, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0x00000280
@@ -55,11 +55,25 @@ static const uint32_t sha256_initial_state[8] =
  0x510E527F, 0x9B05688C, 0x1F83D9AB, 0x5BE0CD19
 };

-static int scrypt_throughput = 0;
+#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
+
+#define SCRYPT_THROUGHPUT 16
+
+#elif defined(__AVX2__)
+
+#define SCRYPT_THROUGHPUT 8
+
+#else
+
+#define SCRYPT_THROUGHPUT 4
+
+#endif
+
+// static int scrypt_throughput = 0;

 static int scratchbuf_size = 0;

-static __thread char *scratchbuf = NULL;
+static __thread uint32_t *scratchbuf = NULL;

 // change this to a constant to be used directly  as input state arg
 // vectors still need an init function.
@@ -146,6 +160,119 @@ static inline void PBKDF2_SHA256_128_32(uint32_t *tstate, uint32_t *ostate,
      output[i] = bswap_32( ostate[i] );
 }

+#if defined(__SHA__)
+
+static inline void HMAC_SHA256_80_init_SHA_2BUF( const uint32_t *key0, 
+                    const uint32_t *key1, uint32_t *tstate0, uint32_t *tstate1,
+                    uint32_t *ostate0, uint32_t *ostate1 )
+{
+   uint32_t ihash0[8], ihash1[8], pad0[16], pad1[16];
+   int i;
+
+   memcpy( pad0, key0 + 16, 16 );
+   memcpy( pad0 + 4, keypad, 48 );
+   memcpy( pad1, key1 + 16, 16 );
+   memcpy( pad1 + 4, keypad, 48 );
+
+   sha256_ni2way_transform_le( tstate0, tstate1, pad0, pad1,
+		               tstate0, tstate1 );
+
+   memcpy( ihash0, tstate0, 32 );
+   memcpy( ihash1, tstate1, 32 );
+
+   for ( i = 0; i < 8; i++ )
+   {
+      pad0[i] = ihash0[i] ^ 0x5c5c5c5c;
+      pad1[i] = ihash1[i] ^ 0x5c5c5c5c;
+   }
+   for ( ; i < 16; i++ ) pad0[i] = pad1[i] = 0x5c5c5c5c;
+
+   sha256_ni2way_transform_le( ostate0, ostate1, pad0, pad1,
+                               sha256_initial_state, sha256_initial_state );
+
+   for ( i = 0; i < 8; i++ )
+   {
+      pad0[i] = ihash0[i] ^ 0x36363636;
+      pad1[i] = ihash1[i] ^ 0x36363636;
+   }
+   for ( ; i < 16; i++ )      pad0[i] = pad1[i] = 0x36363636;
+
+   sha256_ni2way_transform_le( tstate0, tstate1, pad0, pad1, 
+                               sha256_initial_state, sha256_initial_state );
+}
+
+static inline void PBKDF2_SHA256_80_128_SHA_2BUF( const uint32_t *tstate0,
+            const uint32_t *tstate1, uint32_t *ostate0, uint32_t *ostate1,
+            const uint32_t *salt0, const uint32_t *salt1, uint32_t *output0,
+            uint32_t *output1 )
+{
+   uint32_t istate0[8], istate1[8], ostateb0[8], ostateb1[8];
+   uint32_t ibuf0[16], obuf0[16], ibuf1[16], obuf1[16];
+   int i, j;
+
+   sha256_ni2way_transform_le( istate0, istate1, salt0, salt1,
+                               tstate0, tstate1 );
+
+   memcpy( ibuf0, salt0 + 16, 16 );
+   memcpy( ibuf0 + 5, innerpad, 44 );
+   memcpy( obuf0 + 8, outerpad, 32 );
+   memcpy( ibuf1, salt1 + 16, 16 );
+   memcpy( ibuf1 + 5, innerpad, 44 );
+   memcpy( obuf1 + 8, outerpad, 32 );
+
+   for ( i = 0; i < 4; i++ )
+   {
+      memcpy( obuf0, istate0, 32 );
+      memcpy( obuf1, istate1, 32 );
+      ibuf0[4] = ibuf1[4] = i + 1;
+
+      sha256_ni2way_transform_le( obuf0, obuf1, ibuf0, ibuf1,
+                                  obuf0, obuf1 );
+      sha256_ni2way_transform_le( ostateb0, ostateb1, obuf0, obuf1,
+                                  ostate0, ostate1 );
+      
+      for ( j = 0; j < 8; j++ )
+      {
+         output0[ 8*i + j ] = bswap_32( ostateb0[j] );
+         output1[ 8*i + j ] = bswap_32( ostateb1[j] );
+      }
+   }
+}
+
+static inline void PBKDF2_SHA256_128_32_SHA_2BUF( uint32_t *tstate0,
+                    uint32_t *tstate1, uint32_t *ostate0, uint32_t *ostate1,
+                    const uint32_t *salt0, const uint32_t *salt1,
+                    uint32_t *output0, uint32_t *output1 )
+{
+   uint32_t buf0[16], buf1[16];
+   int i;
+
+   sha256_ni2way_transform_be( tstate0, tstate1, salt0, salt1,
+                               tstate0, tstate1 );   
+   sha256_ni2way_transform_be( tstate0, tstate1, salt0+16, salt1+16,
+                               tstate0, tstate1 );
+   sha256_ni2way_transform_le( tstate0, tstate1, finalblk, finalblk,
+                               tstate0, tstate1 );
+
+   memcpy( buf0, tstate0, 32 );
+   memcpy( buf0 + 8, outerpad, 32 );
+   memcpy( buf1, tstate1, 32 );
+   memcpy( buf1 + 8, outerpad, 32 );
+
+   sha256_ni2way_transform_le( ostate0, ostate1, buf0, buf1,
+                               ostate0, ostate1 );
+
+   for ( i = 0; i < 8; i++ )
+   {
+      output0[i] = bswap_32( ostate0[i] );
+      output1[i] = bswap_32( ostate1[i] );
+   }
+}
+
+
+
+#endif
+
 #ifdef HAVE_SHA256_4WAY

 static const uint32_t keypad_4way[4 * 12] = {
@@ -596,15 +723,11 @@ static inline void PBKDF2_SHA256_128_32_16way( uint32_t *tstate,

 #endif // AVX512

-//#if defined(USE_ASM) && defined(__x86_64__)
-
 #define SCRYPT_MAX_WAYS 12
 #define HAVE_SCRYPT_3WAY 1
-//int scrypt_best_throughput();
 void scrypt_core(uint32_t *X, uint32_t *V, int N);
 void scrypt_core_3way(uint32_t *X, uint32_t *V, int N);

-//#if defined(USE_AVX2)
 #if defined(__AVX2__)
 #undef SCRYPT_MAX_WAYS
 #define SCRYPT_MAX_WAYS 24
@@ -614,40 +737,39 @@ void scrypt_core_6way(uint32_t *X, uint32_t *V, int N);

 #ifndef SCRYPT_MAX_WAYS
 #define SCRYPT_MAX_WAYS 1
-//#define scrypt_best_throughput() 1
 #endif

 #include "scrypt-core-4way.h"

-static bool scrypt_N_1_1_256(const uint32_t *input, uint32_t *output,
-	uint32_t *midstate, unsigned char *scratchpad, int N, int thr_id )
+/*
+static bool scrypt_N_1_1_256( const uint32_t *input, uint32_t *output,
+                              uint32_t *midstate, int N, int thr_id )
 {
 	uint32_t tstate[8], ostate[8];
 	uint32_t X[32];
-	uint32_t *V = (uint32_t*)scratchpad;
 	
 	memcpy(tstate, midstate, 32);
 	HMAC_SHA256_80_init(input, tstate, ostate);
 	PBKDF2_SHA256_80_128(tstate, ostate, input, X);

-   scrypt_core_simd128( X, V, N );  // woring
+   scrypt_core_simd128( X, scratchbuf, N );  // woring
 //   scrypt_core_1way( X, V, N );  // working
 //   scrypt_core(X, V, N);

 	PBKDF2_SHA256_128_32(tstate, ostate, X, output);
   return true;
 }
+*/

-#if defined(__AVX2__)
+#if ( SCRYPT_THROUGHPUT == 8 )

 static int scrypt_N_1_1_256_8way( const uint32_t *input, uint32_t *output,
-           uint32_t *midstate, unsigned char *scratchpad, int N, int thrid )
+                                  uint32_t *midstate, int N, int thrid )
 {
-   uint32_t _ALIGN(128) tstate[8 * 8];
-   uint32_t _ALIGN(128) ostate[8 * 8];
-   uint32_t _ALIGN(128) W[8 * 32];
-   uint32_t _ALIGN(128) X[8 * 32];
-   uint32_t *V = (uint32_t*)scratchpad;
+   uint32_t _ALIGN(128) tstate[ 8*8 ];
+   uint32_t _ALIGN(128) ostate[ 8*8 ];
+   uint32_t _ALIGN(128) W[ 8*32 ];
+   uint32_t _ALIGN(128) X[ 8*32 ];

   intrlv_8x32( W, input,    input+ 20, input+ 40, input+ 60,
                   input+80, input+100, input+120, input+140, 640 );
@@ -658,53 +780,45 @@ static int scrypt_N_1_1_256_8way( const uint32_t *input, uint32_t *output,
   PBKDF2_SHA256_80_128_8way( tstate, ostate, W, W );

   dintrlv_8x32( X, X+32, X+64, X+96, X+128, X+160, X+192, X+224, W, 1024 );
+   
+   if ( opt_param_n > 0x4000 )
+   {
+      scrypt_core_simd128_3buf( X,     scratchbuf, N );
+      if ( work_restart[thrid].restart ) return 0;
+      scrypt_core_simd128_3buf( X+ 96, scratchbuf, N );
+      if ( work_restart[thrid].restart ) return 0;
+      scrypt_core_simd128_2buf( X+192, scratchbuf, N );
+   }
+   else
+   {
+      intrlv_2x128( W,     X,     X+ 32, 1024 );
+      intrlv_2x128( W+ 64, X+ 64, X+ 96, 1024 );
+      intrlv_2x128( W+128, X+128, X+160, 1024 );
+      intrlv_2x128( W+192, X+192, X+224, 1024 );
+      scrypt_core_2way_simd128( (__m256i*) W,      (__m256i*)scratchbuf, N );
+      if ( work_restart[thrid].restart ) return 0;
+      scrypt_core_2way_simd128( (__m256i*)(W+ 64), (__m256i*)scratchbuf, N );
+      if ( work_restart[thrid].restart ) return 0;
+      scrypt_core_2way_simd128( (__m256i*)(W+128), (__m256i*)scratchbuf, N );
+      if ( work_restart[thrid].restart ) return 0;
+      scrypt_core_2way_simd128( (__m256i*)(W+192), (__m256i*)scratchbuf, N );
+      dintrlv_2x128( X,     X+ 32, W,     1024 );
+      dintrlv_2x128( X+ 64, X+ 96, W+ 64, 1024 );
+      dintrlv_2x128( X+128, X+160, W+128, 1024 );
+      dintrlv_2x128( X+192, X+224, W+192, 1024 );
+   }

+      

   // SCRYPT CORE

-
-   // AVX512
-
-/*
-   // AVX512 16 way working
-   intrlv_16x32( W, X, X+32, X+64, X+96, X+128, X+160, X+192, X+224,
-                    X+256, X+256+32, X+256+64, X+256+96, X+256+128,
-                    X+256+160, X+256+192, X+256+224, 1024 );
-
-   scrypt_core_16way( (__m512i*)W , (__m512i*)V, N );
-
-   dintrlv_16x32( X, X+32, X+64, X+96, X+128, X+160, X+192, X+224,
-                  X+256, X+256+32, X+256+64, X+256+96, X+256+128, 
-                  X+256+160, X+256+192, X+256+224, W, 1024 );
-*/
-/*
-   // AVX512 working
-   intrlv_4x32( W,     X,     X+ 32, X+ 64, X+ 96, 1024 );
-   intrlv_4x32( W+128, X+128, X+160, X+192, X+224, 1024 );
-   scrypt_core_simd128_4way( (__m128i*)W, (__m128i*)V, N ); 
-   if ( work_restart[thrid].restart ) return 0;
-   scrypt_core_simd128_4way( (__m128i*)(W+128), (__m128i*)V, N );
-   dintrlv_4x32( X,     X+ 32, X+ 64, X+ 96, W,     1024 );
-   dintrlv_4x32( X+128, X+160, X+192, X+224, W+128, 1024 );
-*/   
-/*
-   // AVX512, not working, very slow
-   intrlv_4x128( W,     X,     X+ 32, X+ 64, X+ 96, 1024 );
-   intrlv_4x128( W+128, X+128, X+160, X+192, X+224, 1024 );
-   scrypt_core_4way_simd128( (__m512i*)W,      (__m512i*)V, N );
-   if ( work_restart[thrid].restart ) return 0;
-   scrypt_core_4way_simd128( (__m512i*)(W+128), (__m512i*)V, N );
-   dintrlv_4x128( X,     X+ 32, X+ 64, X+ 96, W,     1024 );
-   dintrlv_4x128( X+128, X+160, X+192, X+224, W+128, 1024 );
-*/
-
  // AVX2

-/*
+
   // AVX2   
   // disable de/interleave for testing.
-   scrypt_core_8way( (__m256i*)W , (__m256i*)V, N );
-*/
+//   scrypt_core_8way( (__m256i*)W , (__m256i*)V, N );
+

 /*
   // AVX2 working
@@ -714,23 +828,18 @@ static int scrypt_N_1_1_256_8way( const uint32_t *input, uint32_t *output,
   intrlv_2x128( W+192, X+192, X+224, 1024 );

   // working
-//   scrypt_core_2way_simd128_3buf( (__m256i*) W,      (__m256i*)V, N );
+//   scrypt_core_2way_simd128_2buf( (__m256i*) W,      (__m256i*)V, N );
 //   if ( work_restart[thrid].restart ) return 0;
-//   scrypt_core_2way_simd128( (__m256i*)(W+192),      (__m256i*)V, N );
+//   scrypt_core_2way_simd128_2buf( (__m256i*)(W+128), (__m256i*)V, N );

   // working
-   scrypt_core_2way_simd128_2buf( (__m256i*) W,      (__m256i*)V, N );
+   scrypt_core_2way_simd128( (__m256i*) W,      (__m256i*)V, N );
   if ( work_restart[thrid].restart ) return 0;
-   scrypt_core_2way_simd128_2buf( (__m256i*)(W+128), (__m256i*)V, N );
-
-   // working
-//   scrypt_core_2way_simd128( (__m256i*) W,      (__m256i*)V, N );
-//   if ( work_restart[thrid].restart ) return 0;
-//   scrypt_core_2way_simd128( (__m256i*)(W+ 64), (__m256i*)V, N );
-//   if ( work_restart[thrid].restart ) return 0;
-//   scrypt_core_2way_simd128( (__m256i*)(W+128), (__m256i*)V, N );
-//   if ( work_restart[thrid].restart ) return 0;
-//   scrypt_core_2way_simd128( (__m256i*)(W+192), (__m256i*)V, N );
+   scrypt_core_2way_simd128( (__m256i*)(W+ 64), (__m256i*)V, N );
+   if ( work_restart[thrid].restart ) return 0;
+   scrypt_core_2way_simd128( (__m256i*)(W+128), (__m256i*)V, N );
+   if ( work_restart[thrid].restart ) return 0;
+   scrypt_core_2way_simd128( (__m256i*)(W+192), (__m256i*)V, N );

   dintrlv_2x128( X,     X+ 32, W,     1024 );
   dintrlv_2x128( X+ 64, X+ 96, W+ 64, 1024 );
@@ -745,18 +854,10 @@ static int scrypt_N_1_1_256_8way( const uint32_t *input, uint32_t *output,
   intrlv_2x32( W+128, X+128, X+160, 1024 );
   intrlv_2x32( W+192, X+192, X+224, 1024 );

-   // working, deprecated, not up to data
-//   scrypt_core_simd128_2way_4buf( (uint64_t*)W,  (uint64_t*)V, N );
-
-     // deprecated, not up to date
-//   scrypt_core_simd128_2way_3buf( (uint64_t*)  W,       (uint64_t*)V, N );
-//   if ( work_restart[thrid].restart ) return 0;
-//   scrypt_core_simd128_2way(      (uint64_t*)( W+192 ), (uint64_t*)V, N );
-
   // working
-//   scrypt_core_simd128_2way_2buf( (uint64_t*)  W,       (uint64_t*)V, N );
-//   if ( work_restart[thrid].restart ) return 0;
-//   scrypt_core_simd128_2way_2buf( (uint64_t*)( W+128 ), (uint64_t*)V, N );
+   scrypt_core_simd128_2way_2buf( (uint64_t*)  W,       (uint64_t*)V, N );
+   if ( work_restart[thrid].restart ) return 0;
+   scrypt_core_simd128_2way_2buf( (uint64_t*)( W+128 ), (uint64_t*)V, N );

 //   scrypt_core_simd128_2way( (uint64_t*)  W,       (uint64_t*)V, N );
 //   if ( work_restart[thrid].restart ) return 0;
@@ -813,19 +914,13 @@ static int scrypt_N_1_1_256_8way( const uint32_t *input, uint32_t *output,
   if ( work_restart[thrid].restart ) return 0;
   scrypt_core_simd128_2buf( X+192, V, N );
 */
-
+/**************
   scrypt_core_simd128_3buf( X,     V, N );
   if ( work_restart[thrid].restart ) return 0;
   scrypt_core_simd128_3buf( X+ 96, V, N );
   if ( work_restart[thrid].restart ) return 0;
   scrypt_core_simd128_2buf( X+192, V, N );
-
-/*
-   // SSE2 working
-   scrypt_core_simd128_4buf( X,     V, N );
-   if ( work_restart[thrid].restart ) return 0;
-   scrypt_core_simd128_4buf( X+128, V, N );
-*/
+*************/


   if ( work_restart[thrid].restart ) return 0;
@@ -842,16 +937,15 @@ static int scrypt_N_1_1_256_8way( const uint32_t *input, uint32_t *output,

 #endif  // AVX2

-#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
+#if ( SCRYPT_THROUGHPUT == 16 )

 static int scrypt_N_1_1_256_16way( const uint32_t *input, uint32_t *output,
-           uint32_t *midstate, unsigned char *scratchpad, int N, int thrid )
+                                   uint32_t *midstate, int N, int thrid )
 {
   uint32_t _ALIGN(128) tstate[ 16*8 ];
   uint32_t _ALIGN(128) ostate[ 16*8 ];
   uint32_t _ALIGN(128) W[ 16*32 ]; 
   uint32_t _ALIGN(128) X[ 16*32 ];
-   uint32_t *V = (uint32_t*)scratchpad;

   intrlv_16x32( W, input,     input+ 20, input+ 40, input+ 60,
                    input+ 80, input+100, input+120, input+140,
@@ -868,6 +962,39 @@ static int scrypt_N_1_1_256_16way( const uint32_t *input, uint32_t *output,
                  W, 1024 );


+   if ( opt_param_n > 0x4000 )
+   {
+      scrypt_core_simd128_3buf( X,     scratchbuf, N );
+      if ( work_restart[thrid].restart ) return 0;
+      scrypt_core_simd128_3buf( X+ 96, scratchbuf, N );
+      if ( work_restart[thrid].restart ) return 0;
+      scrypt_core_simd128_2buf( X+192, scratchbuf, N );
+      if ( work_restart[thrid].restart ) return 0;
+      scrypt_core_simd128_3buf( X+256, scratchbuf, N );
+      if ( work_restart[thrid].restart ) return 0;
+      scrypt_core_simd128_3buf( X+352, scratchbuf, N );
+      if ( work_restart[thrid].restart ) return 0;
+      scrypt_core_simd128_2buf( X+448, scratchbuf, N );
+   }
+   else
+   {
+      intrlv_4x128( W,     X,     X+ 32, X+ 64, X+ 96, 1024 );
+      intrlv_4x128( W+128, X+128, X+160, X+192, X+224, 1024 );
+      intrlv_4x128( W+256, X+256, X+288, X+320, X+352, 1024 );
+      intrlv_4x128( W+384, X+384, X+416, X+448, X+480, 1024 );
+      scrypt_core_4way_simd128( (__m512i*) W,      (__m512i*)scratchbuf, N );
+      if ( work_restart[thrid].restart ) return 0;
+      scrypt_core_4way_simd128( (__m512i*)(W+128), (__m512i*)scratchbuf, N );
+      if ( work_restart[thrid].restart ) return 0;
+      scrypt_core_4way_simd128( (__m512i*)(W+256), (__m512i*)scratchbuf, N );
+      if ( work_restart[thrid].restart ) return 0;
+      scrypt_core_4way_simd128( (__m512i*)(W+384), (__m512i*)scratchbuf, N );
+      dintrlv_4x128( X,     X+ 32, X+ 64, X+ 96, W,     1024 );
+      dintrlv_4x128( X+128, X+160, X+192, X+224, W+128, 1024 );
+      dintrlv_4x128( X+256, X+288, X+320, X+352, W+256, 1024 );
+      dintrlv_4x128( X+384, X+416, X+448, X+480, W+384, 1024 );
+   }
+
   // SCRYPT CORE


@@ -888,23 +1015,40 @@ static int scrypt_N_1_1_256_16way( const uint32_t *input, uint32_t *output,
   // AVX512 working
   intrlv_4x32( W,     X,     X+ 32, X+ 64, X+ 96, 1024 );
   intrlv_4x32( W+128, X+128, X+160, X+192, X+224, 1024 );
+   intrlv_4x32( W+256,     X+256,     X+256+ 32, X+256+ 64, X+256+ 96, 1024 );
+   intrlv_4x32( W+256+128, X+256+128, X+256+160, X+256+192, X+256+224, 1024 );
   scrypt_core_simd128_4way( (__m128i*)W, (__m128i*)V, N );
   if ( work_restart[thrid].restart ) return 0;
   scrypt_core_simd128_4way( (__m128i*)(W+128), (__m128i*)V, N );
+   if ( work_restart[thrid].restart ) return 0;
+   scrypt_core_simd128_4way( (__m128i*)(W+256), (__m128i*)V, N );
+   if ( work_restart[thrid].restart ) return 0;
+   scrypt_core_simd128_4way( (__m128i*)(W+256+128), (__m128i*)V, N );
   dintrlv_4x32( X,     X+ 32, X+ 64, X+ 96, W,     1024 );
   dintrlv_4x32( X+128, X+160, X+192, X+224, W+128, 1024 );
+   dintrlv_4x32( X+256,     X+256+ 32, X+256+ 64, X+256+ 96, W+256,     1024 );
+   dintrlv_4x32( X+256+128, X+256+160, X+256+192, X+256+224, W+256+128, 1024 );
 */
 /*
-   // AVX512, not working, very slow
+   // AVX512, working
   intrlv_4x128( W,     X,     X+ 32, X+ 64, X+ 96, 1024 );
   intrlv_4x128( W+128, X+128, X+160, X+192, X+224, 1024 );
+   intrlv_4x128( W+256,     X+256,     X+256+ 32, X+256+ 64, X+256+ 96, 1024 );
+   intrlv_4x128( W+256+128, X+256+128, X+256+160, X+256+192, X+256+224, 1024 );
   scrypt_core_4way_simd128( (__m512i*)W,      (__m512i*)V, N );
   if ( work_restart[thrid].restart ) return 0;
   scrypt_core_4way_simd128( (__m512i*)(W+128), (__m512i*)V, N );
+   if ( work_restart[thrid].restart ) return 0;
+   scrypt_core_4way_simd128( (__m512i*)(W+256),   (__m512i*)V, N );
+   if ( work_restart[thrid].restart ) return 0;
+   scrypt_core_4way_simd128( (__m512i*)(W+256+128), (__m512i*)V, N );
   dintrlv_4x128( X,     X+ 32, X+ 64, X+ 96, W,     1024 );
   dintrlv_4x128( X+128, X+160, X+192, X+224, W+128, 1024 );
+   dintrlv_4x128( X+256,     X+256+ 32, X+256+ 64, X+256+ 96, W+256,     1024 );
+   dintrlv_4x128( X+256+128, X+256+160, X+256+192, X+256+224, W+256+128, 1024 );
 */

+
  // AVX2

 /*
@@ -919,16 +1063,19 @@ static int scrypt_N_1_1_256_16way( const uint32_t *input, uint32_t *output,
   intrlv_2x128( W+ 64, X+ 64, X+ 96, 1024 );
   intrlv_2x128( W+128, X+128, X+160, 1024 );
   intrlv_2x128( W+192, X+192, X+224, 1024 );
-
-   // working
-//   scrypt_core_2way_simd128_3buf( (__m256i*) W,      (__m256i*)V, N );
-//   if ( work_restart[thrid].restart ) return 0;
-//   scrypt_core_2way_simd128( (__m256i*)(W+192),      (__m256i*)V, N );
+   intrlv_2x128( W+256,     X+256,     X+256+ 32, 1024 );
+   intrlv_2x128( W+256+ 64, X+256+ 64, X+256+ 96, 1024 );
+   intrlv_2x128( W+256+128, X+256+128, X+256+160, 1024 );
+   intrlv_2x128( W+256+192, X+256+192, X+256+224, 1024 );

   // working
   scrypt_core_2way_simd128_2buf( (__m256i*) W,      (__m256i*)V, N );
   if ( work_restart[thrid].restart ) return 0;
   scrypt_core_2way_simd128_2buf( (__m256i*)(W+128), (__m256i*)V, N );
+   if ( work_restart[thrid].restart ) return 0;
+   scrypt_core_2way_simd128_2buf( (__m256i*)(W+256),      (__m256i*)V, N );
+   if ( work_restart[thrid].restart ) return 0;
+   scrypt_core_2way_simd128_2buf( (__m256i*)(W+256+128), (__m256i*)V, N );

   // working
 //   scrypt_core_2way_simd128( (__m256i*) W,      (__m256i*)V, N );
@@ -938,11 +1085,23 @@ static int scrypt_N_1_1_256_16way( const uint32_t *input, uint32_t *output,
 //   scrypt_core_2way_simd128( (__m256i*)(W+128), (__m256i*)V, N );
 //   if ( work_restart[thrid].restart ) return 0;
 //   scrypt_core_2way_simd128( (__m256i*)(W+192), (__m256i*)V, N );
+//   if ( work_restart[thrid].restart ) return 0;
+//   scrypt_core_2way_simd128( (__m256i*)(W+256),      (__m256i*)V, N );
+//   if ( work_restart[thrid].restart ) return 0;
+//   scrypt_core_2way_simd128( (__m256i*)(W+256+ 64), (__m256i*)V, N );
+//   if ( work_restart[thrid].restart ) return 0;
+//   scrypt_core_2way_simd128( (__m256i*)(W+256+128), (__m256i*)V, N );
+//   if ( work_restart[thrid].restart ) return 0;
+//   scrypt_core_2way_simd128( (__m256i*)(W+256+192), (__m256i*)V, N );

   dintrlv_2x128( X,     X+ 32, W,     1024 );
   dintrlv_2x128( X+ 64, X+ 96, W+ 64, 1024 );
   dintrlv_2x128( X+128, X+160, W+128, 1024 );
   dintrlv_2x128( X+192, X+224, W+192, 1024 );
+   dintrlv_2x128( X+256,     X+256+ 32, W+256,     1024 );
+   dintrlv_2x128( X+256+ 64, X+256+ 96, W+256+ 64, 1024 );
+   dintrlv_2x128( X+256+128, X+256+160, W+256+128, 1024 );
+   dintrlv_2x128( X+256+192, X+256+224, W+256+192, 1024 );
 */

 /*
@@ -952,18 +1111,13 @@ static int scrypt_N_1_1_256_16way( const uint32_t *input, uint32_t *output,
   intrlv_2x32( W+128, X+128, X+160, 1024 );
   intrlv_2x32( W+192, X+192, X+224, 1024 );

-   // working, deprecated, not up to data
-//   scrypt_core_simd128_2way_4buf( (uint64_t*)W,  (uint64_t*)V, N );
-
-     // deprecated, not up to date
-//   scrypt_core_simd128_2way_3buf( (uint64_t*)  W,       (uint64_t*)V, N );
-//   if ( work_restart[thrid].restart ) return 0;
-//   scrypt_core_simd128_2way(      (uint64_t*)( W+192 ), (uint64_t*)V, N );
-
   // working
 //   scrypt_core_simd128_2way_2buf( (uint64_t*)  W,       (uint64_t*)V, N );
 //   if ( work_restart[thrid].restart ) return 0;
 //   scrypt_core_simd128_2way_2buf( (uint64_t*)( W+128 ), (uint64_t*)V, N );
+//   scrypt_core_simd128_2way_2buf( (uint64_t*)  W,       (uint64_t*)V, N );
+//   if ( work_restart[thrid].restart ) return 0;
+//   scrypt_core_simd128_2way_2buf( (uint64_t*)( W+128 ), (uint64_t*)V, N );

 //   scrypt_core_simd128_2way( (uint64_t*)  W,       (uint64_t*)V, N );
 //   if ( work_restart[thrid].restart ) return 0;
@@ -1043,7 +1197,7 @@ static int scrypt_N_1_1_256_16way( const uint32_t *input, uint32_t *output,
   if ( work_restart[thrid].restart ) return 0;
   scrypt_core_simd128_2buf( X+448, V, N );
 */
-
+/***************
   scrypt_core_simd128_3buf( X,     V, N );
   if ( work_restart[thrid].restart ) return 0;
   scrypt_core_simd128_3buf( X+ 96, V, N );
@@ -1055,17 +1209,7 @@ static int scrypt_N_1_1_256_16way( const uint32_t *input, uint32_t *output,
   scrypt_core_simd128_3buf( X+352, V, N );
   if ( work_restart[thrid].restart ) return 0;
   scrypt_core_simd128_2buf( X+448, V, N );
-
-/*
-   // SSE2 working
-   scrypt_core_simd128_4buf( X,     V, N );
-   if ( work_restart[thrid].restart ) return 0;
-   scrypt_core_simd128_4buf( X+128, V, N );
-   if ( work_restart[thrid].restart ) return 0;
-   scrypt_core_simd128_4buf( X+256, V, N );
-   if ( work_restart[thrid].restart ) return 0;
-   scrypt_core_simd128_4buf( X+384, V, N );
-*/
+********************/
 /*
   scrypt_core_3way( X,     V, N );
   if ( work_restart[thrid].restart ) return 0;
@@ -1100,15 +1244,37 @@ static int scrypt_N_1_1_256_16way( const uint32_t *input, uint32_t *output,

 #endif // AVX512

-#if defined(__SHA__)
+#if 0
+static int scrypt_N_1_1_256_sha_2buf( const uint32_t *input, uint32_t *output,
+                                      uint32_t *midstate, int N, int thrid )
+{
+    uint32_t _ALIGN(128) tstate[ 2*8 ];
+    uint32_t _ALIGN(128) ostate[ 2*8 ];
+    uint32_t _ALIGN(128) W[ 2*32 ];
+
+    memcpy( tstate,    midstate, 32 );
+    memcpy( tstate+ 8, midstate, 32 );
+
+    HMAC_SHA256_80_init_SHA_2BUF( input, input+20, tstate, tstate+8,
+                                  ostate, ostate+8 );
+    PBKDF2_SHA256_80_128_SHA_2BUF( tstate, tstate+8, ostate, ostate+8,
+                                   input, input+20,  W, W+32 );
+
+    scrypt_core_simd128_2buf( W, scratchbuf, N );
+    if ( work_restart[thrid].restart ) return 0;
+
+    PBKDF2_SHA256_128_32_SHA_2BUF( tstate, tstate+8, ostate, ostate+8, W, W+32,
+                                   output, output+8 );
+
+   return 1;
+}

 static int scrypt_N_1_1_256_4way_sha( const uint32_t *input, uint32_t *output,
-           uint32_t *midstate, unsigned char *scratchpad, int N, int thrid )
+           uint32_t *midstate, int N, int thrid )
 {
    uint32_t _ALIGN(128) tstate[4 * 8];
    uint32_t _ALIGN(128) ostate[4 * 8];
    uint32_t _ALIGN(128) W[4 * 32];
-    uint32_t *V = (uint32_t*)scratchpad;

    memcpy( tstate,    midstate, 32 );
    memcpy( tstate+ 8, midstate, 32 );
@@ -1139,9 +1305,9 @@ static int scrypt_N_1_1_256_4way_sha( const uint32_t *input, uint32_t *output,
 */

   // working, double buffered linear simd
-   scrypt_core_simd128_2buf( W, V, N );
+   scrypt_core_simd128_2buf( W, scratchbuf, N );
   if ( work_restart[thrid].restart ) return 0;
-   scrypt_core_simd128_2buf( W+64, V, N );
+   scrypt_core_simd128_2buf( W+64, scratchbuf, N );

 /*
   scrypt_core_simd128_3buf( W, V, N );
@@ -1149,8 +1315,6 @@ static int scrypt_N_1_1_256_4way_sha( const uint32_t *input, uint32_t *output,
   scrypt_core_simd128( W+96, V, N );
 */

-   // working
-//   scrypt_core_simd128_4buf( W, V, N );

   if ( work_restart[thrid].restart ) return 0;

@@ -1164,18 +1328,15 @@ static int scrypt_N_1_1_256_4way_sha( const uint32_t *input, uint32_t *output,

   return 1;
 }
+#endif

-#else
-
-#ifdef HAVE_SHA256_4WAY
+#if ( SCRYPT_THROUGHPUT == 4 )
 static int scrypt_N_1_1_256_4way( const uint32_t *input,	uint32_t *output,
-           uint32_t *midstate, unsigned char *scratchpad, int N, int thrid )
+           uint32_t *midstate, int N, int thrid )
 {
-   uint32_t _ALIGN(128) tstate[4 * 8];
-   uint32_t _ALIGN(128) ostate[4 * 8];
-   uint32_t _ALIGN(128) W[4 * 32];
-   uint32_t _ALIGN(128) X[4 * 32];
-   uint32_t *V = (uint32_t*)scratchpad;
+   uint32_t _ALIGN(128) tstate[ 4*8 ];
+   uint32_t _ALIGN(128) ostate[ 4*8 ];
+   uint32_t _ALIGN(128) W[ 4*32 ];

   intrlv_4x32( W, input, input+20, input+40, input+60, 640 );
   for ( int i = 0; i < 8; i++ )
@@ -1184,7 +1345,21 @@ static int scrypt_N_1_1_256_4way( const uint32_t *input,	uint32_t *output,
   HMAC_SHA256_80_init_4way(W, tstate, ostate);
   PBKDF2_SHA256_80_128_4way(tstate, ostate, W, W);

-   dintrlv_4x32( X, X+32, X+64, X+96, W, 1024 );
+   if ( opt_param_n > 0x4000 )
+   {
+      uint32_t _ALIGN(128) X[ 4*32 ];
+      dintrlv_4x32( X, X+32, X+64, X+96, W, 1024 );
+      scrypt_core_simd128_2buf( X, scratchbuf, N );
+      if ( work_restart[thrid].restart ) return 0;
+      scrypt_core_simd128_2buf( X+64, scratchbuf, N );
+      intrlv_4x32( W, X, X+32, X+64, X+96, 1024 );
+   }
+   else
+      scrypt_core_4way( (__m128i*)W, (__m128i*)scratchbuf, N );
+
+
+
+//   dintrlv_4x32( X, X+32, X+64, X+96, W, 1024 );

 ////// SCRYPT_CORE   

@@ -1202,35 +1377,23 @@ static int scrypt_N_1_1_256_4way( const uint32_t *input,	uint32_t *output,
   if ( work_restart[thrid].restart ) return 0;
   scrypt_core_simd128( X+96, V, N );
 */
-   
+/*   
   // working, double buffered linear simd, best for n2
   scrypt_core_simd128_2buf( X, V, N );
   if ( work_restart[thrid].restart ) return 0;
   scrypt_core_simd128_2buf( X+64, V, N );
-  
+*/  
 /*
   scrypt_core_simd128_3buf( X, V, N );
   if ( work_restart[thrid].restart ) return 0;
   scrypt_core_simd128( X+96, V, N );
 */
   
-   // working
-//   scrypt_core_simd128_4buf( X, V, N );
-
-
-/* 
-   // original
-   scrypt_core(X + 0 * 32, V, N);
-	scrypt_core(X + 1 * 32, V, N);
-	scrypt_core(X + 2 * 32, V, N);
-	scrypt_core(X + 3 * 32, V, N);
-*/
-
 ////////////////////////////////

   if ( work_restart[thrid].restart ) return 0;

-   intrlv_4x32( W, X, X+32, X+64, X+96, 1024 );
+//   intrlv_4x32( W, X, X+32, X+64, X+96, 1024 );

   PBKDF2_SHA256_128_32_4way(tstate, ostate, W, W);

@@ -1238,58 +1401,73 @@ static int scrypt_N_1_1_256_4way( const uint32_t *input,	uint32_t *output,

   return 1;
 }
-#endif /* HAVE_SHA256_4WAY */
+#endif   // SCRYPT_THROUGHPUT == 4

-#endif // SHA
+//#endif // SHA

 extern int scanhash_scrypt( struct work *work, uint32_t max_nonce,
                            uint64_t *hashes_done, struct thr_info *mythr )
 {
+   uint32_t _ALIGN(64) hash[  8*SCRYPT_THROUGHPUT ];
+   uint32_t _ALIGN(64) data[ 20*SCRYPT_THROUGHPUT ];
   uint32_t *pdata = work->data;
   uint32_t *ptarget = work->target;
-	uint32_t data[SCRYPT_MAX_WAYS * 20], hash[SCRYPT_MAX_WAYS * 8];
-	uint32_t midstate[8];
-	uint32_t n = pdata[19] - 1;
+   uint32_t midstate[8];
+   uint32_t n = pdata[19] - 1;
   int thr_id = mythr->id;  
-   int throughput = scrypt_throughput;
-	int i;
+   int i;
   volatile uint8_t *restart = &(work_restart[thr_id].restart);
 	
-	for ( i = 0; i < throughput; i++ )
-		memcpy( data + i * 20, pdata, 80 );
+   for ( i = 0; i < SCRYPT_THROUGHPUT; i++ )
+      memcpy( data + i * 20, pdata, 80 );

   sha256_transform_le( midstate, data, sha256_initial_state );

-	do {
+   do {
      bool rc = true;
-		for ( i = 0; i < throughput; i++ ) data[ i*20 + 19 ] = ++n;
+      for ( i = 0; i < SCRYPT_THROUGHPUT; i++ ) data[ i*20 + 19 ] = ++n;

-#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
-      if ( throughput == 16 )
-         rc = scrypt_N_1_1_256_16way( data, hash, midstate, scratchbuf,
-                                      opt_param_n, thr_id );
-      else
-#endif
-#if defined(__AVX2__)      
-      if ( throughput == 8 )      
-         rc = scrypt_N_1_1_256_8way( data, hash, midstate, scratchbuf,
-                                     opt_param_n, thr_id );
-      else
-#endif
-      if ( throughput == 4 )
-#if defined(__SHA__)
-         rc = scrypt_N_1_1_256_4way_sha( data, hash, midstate, scratchbuf,
-                                         opt_param_n, thr_id );
+//#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
+#if ( SCRYPT_THROUGHPUT == 16 )
+//      if ( SCRYPT_THROUGHPUT == 16 )
+         rc = scrypt_N_1_1_256_16way( data, hash, midstate, opt_param_n,
+                                      thr_id );
+//      else
+//#endif
+//#if defined(__AVX2__)      
+#elif ( SCRYPT_THROUGHPUT == 8 )
+//         if ( SCRYPT_THROUGHPUT == 8 )      
+         rc = scrypt_N_1_1_256_8way( data, hash, midstate, opt_param_n,
+                                     thr_id );
+//      else
+//#endif
+#elif ( SCRYPT_THROUGHPUT == 4 )
+//      if ( SCRYPT_THROUGHPUT == 4 ) // slower on Ryzen than 8way
+//#if defined(__SHA__)
+//         rc = scrypt_N_1_1_256_4way_sha( data, hash, midstate, opt_param_n,
+//                                         thr_id );
+//#else
+         rc = scrypt_N_1_1_256_4way( data, hash, midstate, opt_param_n,
+                                     thr_id );
 #else
-         rc = scrypt_N_1_1_256_4way( data, hash, midstate, scratchbuf,
-                                     opt_param_n, thr_id );
-#endif
-      else
-         rc = scrypt_N_1_1_256( data, hash, midstate, scratchbuf,
-                                opt_param_n, thr_id );

+#error "Invalid SCRYPT_THROUGHPUT"
+
+#endif
+/*
+#if defined(__SHA__)
+      else
+      if ( SCRYPT_THROUGHPUT == 2 )  // slower on Ryzen than 4way_sha & 8way
+         rc = scrypt_N_1_1_256_sha_2buf( data, hash, midstate, opt_param_n,
+                                         thr_id );
+#endif         
+      else  // should never get here
+         rc = scrypt_N_1_1_256( data, hash, midstate, opt_param_n, thr_id );
+*/
+
+      // test the hash
      if ( rc )
-      for ( i = 0; i < throughput; i++ )
+      for ( i = 0; i < SCRYPT_THROUGHPUT; i++ )
      {
         if ( unlikely( valid_hash( hash + i*8, ptarget ) && !opt_benchmark ) )
         {
@@ -1301,7 +1479,7 @@ extern int scanhash_scrypt( struct work *work, uint32_t max_nonce,
      }


-   } while ( likely( ( n < ( max_nonce - throughput ) ) && !(*restart) ) );
+   } while ( likely( ( n < ( max_nonce - SCRYPT_THROUGHPUT ) ) && !(*restart) ) );
 	
 	*hashes_done = n - pdata[19];
 	pdata[19] = n;
@@ -1310,38 +1488,62 @@ extern int scanhash_scrypt( struct work *work, uint32_t max_nonce,

 bool scrypt_miner_thread_init( int thr_id )
 {
-   scratchbuf = _mm_malloc( scratchbuf_size, 128 );
+   scratchbuf = malloc_hugepages( scratchbuf_size );
   if ( scratchbuf )
-      return true;
+   {
+      if ( opt_debug )
+         applog( LOG_NOTICE, "Thread %u is using huge pages", thr_id );
+   }
+   else
+       scratchbuf = _mm_malloc( scratchbuf_size, 128 );
+   
+   if ( scratchbuf ) return true;
+   
   applog( LOG_ERR, "Thread %u: Scrypt buffer allocation failed", thr_id );
-   return false; 
+   return false;
 }

 bool register_scrypt_algo( algo_gate_t* gate )
 {
-#if defined(__SHA__)
-   gate->optimizations = SSE2_OPT | SHA_OPT;
-#else
-   gate->optimizations = SSE2_OPT | AVX2_OPT | AVX512_OPT;
-#endif
+//#if defined(__SHA__)
+//   gate->optimizations = SSE2_OPT | SHA_OPT;
+//#else
+   gate->optimizations = SSE2_OPT | SSE42_OPT | AVX_OPT | AVX2_OPT | AVX512_OPT;
+//#endif
   gate->miner_thread_init =(void*)&scrypt_miner_thread_init;
   gate->scanhash         = (void*)&scanhash_scrypt;
   opt_target_factor = 65536.0;
   opt_param_n = opt_param_n ? opt_param_n : 1024;
   applog( LOG_INFO,"Scrypt paramaters: N= %d, R= 1", opt_param_n );

+// scrypt_throughput can be defined at compile time and used to replace
+// MAX_WAYS to reduce memory usage.
+   
 #if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
-   scrypt_throughput = 16;
-   scratchbuf_size = opt_param_n * 3 * 128;  // 3 buf
+//   scrypt_throughput = 16;
+   if ( opt_param_n > 0x4000 )
+      scratchbuf_size = opt_param_n * 3 * 128;  // 3 buf
+   else      
+      scratchbuf_size = opt_param_n * 4 * 128;  // 4 way
+
+/* SHA is slower than AVX2 on Ryzen
 #elif defined(__SHA__)
   scrypt_throughput = 4;
   scratchbuf_size = opt_param_n * 2 * 128;  // 2 buf
+*/
+
 #elif defined(__AVX2__)
-   scrypt_throughput = 8;   
-   scratchbuf_size = opt_param_n * 3 * 128;  // 3 buf
+//   scrypt_throughput = 8;   
+   if ( opt_param_n > 0x4000 )
+      scratchbuf_size = opt_param_n * 3 * 128;  // 3 buf
+   else
+      scratchbuf_size = opt_param_n * 2 * 128;  // 2 way
 #else
-   scrypt_throughput = 4;
+//   scrypt_throughput = 4;
+   if ( opt_param_n > 0x4000 )
   scratchbuf_size = opt_param_n * 2 * 128;  // 2 buf
+   else
+   scratchbuf_size = opt_param_n * 4 * 128;  // 4 way
 #endif

   char t_units[4] = {0};
@@ -1351,9 +1553,8 @@ bool register_scrypt_algo( algo_gate_t* gate )

   format_number_si( &t_size, t_units );
   format_number_si( &d_size, d_units );
-   
   applog( LOG_INFO,"Throughput %d/thr, Buffer %.0f %siB/thr, Total %.0f %siB\n",
-          scrypt_throughput, t_size, t_units, d_size, d_units );
+          SCRYPT_THROUGHPUT, t_size, t_units, d_size, d_units );

   return true;
 };
--- a/algo/sha/sha-hash-4way.h
+++ b/algo/sha/sha-hash-4way.h
@@ -51,7 +51,6 @@ typedef struct {
   __m128i buf[64>>2];
   __m128i val[8];
   uint32_t count_high, count_low;
-   bool initialized;
 } sha256_4way_context __attribute__ ((aligned (64)));

 void sha256_4way_init( sha256_4way_context *sc );
@@ -63,6 +62,12 @@ void sha256_4way_transform_le( __m128i *state_out,  const __m128i *data,
                            const __m128i *state_in );
 void sha256_4way_transform_be( __m128i *state_out,  const __m128i *data,
                            const __m128i *state_in );
+void sha256_4way_prehash_3rounds( __m128i *state_mid, __m128i *X,
+                                   const __m128i *W, const __m128i *state_in );
+void sha256_4way_final_rounds( __m128i *state_out, const __m128i *data,
+        const __m128i *state_in, const __m128i *state_mid, const __m128i *X );
+int sha256_4way_transform_le_short( __m128i *state_out, const __m128i *data,
+                                     const __m128i *state_in );

 #endif  // SSE2

@@ -74,7 +79,6 @@ typedef struct {
   __m256i buf[64>>2];
   __m256i val[8];
   uint32_t count_high, count_low;
-   bool initialized;
 } sha256_8way_context __attribute__ ((aligned (128)));

 void sha256_8way_init( sha256_8way_context *sc );
@@ -86,6 +90,13 @@ void sha256_8way_transform_le( __m256i *state_out, const __m256i *data,
 void sha256_8way_transform_be( __m256i *state_out, const __m256i *data,
                               const __m256i *state_in );

+void sha256_8way_prehash_3rounds( __m256i *state_mid, __m256i *X,
+                                 const __m256i *W, const __m256i *state_in );
+void sha256_8way_final_rounds( __m256i *state_out, const __m256i *data,
+        const __m256i *state_in, const __m256i *state_mid, const __m256i *X );
+int sha256_8way_transform_le_short( __m256i *state_out, const __m256i *data,
+                                     const __m256i *state_in );
+
 #endif  // AVX2

 #if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
@@ -96,7 +107,6 @@ typedef struct {
   __m512i buf[64>>2];
   __m512i val[8];
   uint32_t count_high, count_low;
-   bool initialized;
 } sha256_16way_context __attribute__ ((aligned (128)));

 void sha256_16way_init( sha256_16way_context *sc );
@@ -107,10 +117,13 @@ void sha256_16way_transform_le( __m512i *state_out, const __m512i *data,
                             const __m512i *state_in );
 void sha256_16way_transform_be( __m512i *state_out, const __m512i *data,
                             const __m512i *state_in );
-void sha256_16way_prehash_3rounds( __m512i *state_mid, const __m512i *W,
-                             const __m512i *state_in );
+void sha256_16way_prehash_3rounds( __m512i *state_mid, __m512i *X,
+                                  const __m512i *W, const __m512i *state_in );
 void sha256_16way_final_rounds( __m512i *state_out, const __m512i *data,
-                          const __m512i *state_in, const __m512i *state_mid );
+        const __m512i *state_in, const __m512i *state_mid, const __m512i *X );
+
+int sha256_16way_transform_le_short( __m512i *state_out, const __m512i *data,
+                                     const __m512i *state_in );

 #endif // AVX512

--- a/algo/sha/sha2.c
+++ b/algo/sha/sha2.c
@@ -8,7 +8,7 @@
 * any later version.  See COPYING for more details.
 */

-#include "algo-gate-api.h"
+#include "sha256d-4way.h"

 #include <string.h>
 #include <inttypes.h>
@@ -181,6 +181,8 @@ static const uint32_t sha256d_hash1[16] = {
 };

 // this performs the entire hash all over again, why?
+// because main function only does 56 rounds.
+
 static void sha256d_80_swap(uint32_t *hash, const uint32_t *data)
 {
 	uint32_t S[16];
@@ -492,7 +494,7 @@ static inline void sha256d_ms(uint32_t *hash, uint32_t *W,
 void sha256d_ms_4way(uint32_t *hash,  uint32_t *data,
 	const uint32_t *midstate, const uint32_t *prehash);

-static inline int scanhash_sha256d_4way( struct work *work,
+static inline int scanhash_sha256d_4way_pooler( struct work *work,
             uint32_t max_nonce, uint64_t *hashes_done, struct thr_info *mythr )
 {
        uint32_t *pdata = work->data;
@@ -553,7 +555,7 @@ static inline int scanhash_sha256d_4way( struct work *work,
 void sha256d_ms_8way(uint32_t *hash,  uint32_t *data,
 	const uint32_t *midstate, const uint32_t *prehash);

-static inline int scanhash_sha256d_8way( struct work *work,
+static inline int scanhash_sha256d_8way_pooler( struct work *work,
            uint32_t max_nonce, uint64_t *hashes_done, struct thr_info *mythr )
 {
        uint32_t *pdata = work->data;
@@ -609,11 +611,11 @@ static inline int scanhash_sha256d_8way( struct work *work,

 #endif /* HAVE_SHA256_8WAY */

-int scanhash_sha256d( struct work *work,
-	uint32_t max_nonce, uint64_t *hashes_done, struct thr_info *mythr )
+int scanhash_sha256d_pooler( struct work *work,	uint32_t max_nonce,
+                             uint64_t *hashes_done, struct thr_info *mythr )
 {
-        uint32_t *pdata = work->data;
-        uint32_t *ptarget = work->target;
+   uint32_t *pdata = work->data;
+   uint32_t *ptarget = work->target;
 	uint32_t _ALIGN(128) data[64];
 	uint32_t _ALIGN(32) hash[8];
 	uint32_t _ALIGN(32) midstate[8];
@@ -624,12 +626,12 @@ int scanhash_sha256d( struct work *work,
   int thr_id = mythr->id;  // thr_id arg is deprecated

 #ifdef HAVE_SHA256_8WAY
-	if (sha256_use_8way())
-		return scanhash_sha256d_8way( work,	max_nonce, hashes_done, mythr );
+	if ( sha256_use_8way() )
+		return scanhash_sha256d_8way_pooler( work, max_nonce, hashes_done, mythr );
 #endif
 #ifdef HAVE_SHA256_4WAY
-	if (sha256_use_4way())
-		return scanhash_sha256d_4way( work,	max_nonce, hashes_done, mythr );
+	if ( sha256_use_4way() )
+		return scanhash_sha256d_4way_pooler( work, max_nonce, hashes_done, mythr );
 #endif
 	
 	memcpy(data, pdata + 16, 64);
@@ -690,9 +692,16 @@ int scanhash_SHA256d( struct work *work, const uint32_t max_nonce,

 bool register_sha256d_algo( algo_gate_t* gate )
 {
-   gate->optimizations = SSE2_OPT | AVX2_OPT;
-   gate->scanhash = (void*)&scanhash_sha256d;
-//   gate->hash     = (void*)&sha256d;
+   gate->optimizations = SSE2_OPT | AVX2_OPT | AVX512_OPT;
+#if defined(SHA256D_16WAY)
+   gate->scanhash = (void*)&scanhash_sha256d_16way;
+//#elif defined(SHA256D_8WAY)
+//   gate->scanhash = (void*)&scanhash_sha256d_8way;
+#else
+   gate->scanhash = (void*)&scanhash_sha256d_pooler;
+//   gate->scanhash = (void*)&scanhash_sha256d_4way;
+#endif
+   //   gate->hash     = (void*)&sha256d;
   return true;
 };

--- a/algo/sha/sha256-hash-4way.c
+++ b/algo/sha/sha256-hash-4way.c
--- a/algo/sha/sha256-hash.h
+++ b/algo/sha/sha256-hash.h
@@ -53,4 +53,8 @@ void sha256_ni2way_transform_be( uint32_t *out_X, uint32_t*out_Y,
 #define sha256_transform_be sph_sha256_transform_be

 #endif
+
+// SHA can't do only 3 rounds
+#define sha256_prehash_3rounds sph_sha256_prehash_3rounds
+
 #endif
--- a/algo/sha/sha256d-4way.c
+++ b/algo/sha/sha256d-4way.c
@@ -1,4 +1,4 @@
-#include "sha256t-gate.h"
+#include "sha256d-4way.h"
 #include <stdlib.h>
 #include <stdint.h>
 #include <string.h>
@@ -10,13 +10,14 @@
 int scanhash_sha256d_16way( struct work *work, const uint32_t max_nonce,
                           uint64_t *hashes_done, struct thr_info *mythr )
 {
+   __m512i  vdata[32]    __attribute__ ((aligned (128)));
   __m512i  block[16]    __attribute__ ((aligned (64)));
-   __m512i  hash32[8]    __attribute__ ((aligned (32)));
-   __m512i  initstate[8] __attribute__ ((aligned (32)));
-   __m512i  midstate[8]  __attribute__ ((aligned (32)));
-   __m512i  midstate2[8] __attribute__ ((aligned (32)));
-   uint32_t lane_hash[8] __attribute__ ((aligned (32)));
-   __m512i  vdata[20]    __attribute__ ((aligned (32)));
+   __m512i  hash32[8]    __attribute__ ((aligned (64)));
+   __m512i  initstate[8] __attribute__ ((aligned (64)));
+   __m512i  midstate1[8] __attribute__ ((aligned (64)));
+   __m512i  midstate2[8] __attribute__ ((aligned (64)));
+   __m512i  mexp_pre[16] __attribute__ ((aligned (64)));
+   uint32_t lane_hash[8] __attribute__ ((aligned (64)));
   uint32_t *hash32_d7 =  (uint32_t*)&( hash32[7] );
   uint32_t *pdata = work->data;
   const uint32_t *ptarget = work->target;
@@ -36,6 +37,14 @@ int scanhash_sha256d_16way( struct work *work, const uint32_t max_nonce,
   *noncev = _mm512_set_epi32( n+15, n+14, n+13, n+12, n+11, n+10, n+9, n+8,
                               n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n+1, n );

+   vdata[16+4] = last_byte;
+   memset_zero_512( vdata+16 + 5, 10 );
+   vdata[16+15] = m512_const1_32( 80*8 ); // bit count
+
+   block[ 8] = last_byte;
+   memset_zero_512( block + 9, 6 );
+   block[15] = m512_const1_32( 32*8 ); // bit count
+   
   // initialize state
   initstate[0] = m512_const1_64( 0x6A09E6676A09E667 );
   initstate[1] = m512_const1_64( 0xBB67AE85BB67AE85 );
@@ -46,43 +55,36 @@ int scanhash_sha256d_16way( struct work *work, const uint32_t max_nonce,
   initstate[6] = m512_const1_64( 0x1F83D9AB1F83D9AB );
   initstate[7] = m512_const1_64( 0x5BE0CD195BE0CD19 );

-   // hash first 64 byte block of data
-   sha256_16way_transform_le( midstate, vdata, initstate );
+   sha256_16way_transform_le( midstate1, vdata, initstate );

   // Do 3 rounds on the first 12 bytes of the next block
-   sha256_16way_prehash_3rounds( midstate2, vdata + 16, midstate );
+   sha256_16way_prehash_3rounds( midstate2, mexp_pre, vdata+16, midstate1 );

   do
   {
      // 1. final 16 bytes of data, with padding
-      memcpy_512( block, vdata + 16, 4 );
-      block[ 4] = last_byte;
-      memset_zero_512( block + 5, 10 );
-      block[15] = m512_const1_32( 80*8 ); // bit count
-      sha256_16way_final_rounds( hash32, block, midstate, midstate2 );
+      sha256_16way_final_rounds( block, vdata+16, midstate1, midstate2,
+                                 mexp_pre );

      // 2. 32 byte hash from 1.
-      memcpy_512( block, hash32, 8 );
-      block[ 8] = last_byte;
-      memset_zero_512( block + 9, 6 );
-      block[15] = m512_const1_32( 32*8 ); // bit count
-      sha256_16way_transform_le( hash32, block, initstate );
-
-      // byte swap final hash for testing
-      mm512_block_bswap_32( hash32, hash32 );
-
-      for ( int lane = 0; lane < 16; lane++ )
-      if ( unlikely( hash32_d7[ lane ] <= targ32_d7 ) )
+      if ( sha256_16way_transform_le_short( hash32, block, initstate ) )
      {
-         extr_lane_16x32( lane_hash, hash32, lane, 256 );
-         if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
+         // byte swap final hash for testing
+         mm512_block_bswap_32( hash32, hash32 );
+
+         for ( int lane = 0; lane < 16; lane++ )
+         if ( unlikely( hash32_d7[ lane ] <= targ32_d7 ) )
         {
-            pdata[19] = n + lane;
-            submit_solution( work, lane_hash, mythr );
+            extr_lane_16x32( lane_hash, hash32, lane, 256 );
+            if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
+            {
+               pdata[19] = n + lane;
+               submit_solution( work, lane_hash, mythr );
+            }
         }
-       }
-       *noncev = _mm512_add_epi32( *noncev, sixteen );
-       n += 16;
+      }
+      *noncev = _mm512_add_epi32( *noncev, sixteen );
+      n += 16;
   } while ( (n < last_nonce) && !work_restart[thr_id].restart );
   pdata[19] = n;
   *hashes_done = n - first_nonce;
@@ -96,12 +98,14 @@ int scanhash_sha256d_16way( struct work *work, const uint32_t max_nonce,
 int scanhash_sha256d_8way( struct work *work, const uint32_t max_nonce,
                           uint64_t *hashes_done, struct thr_info *mythr )
 {
-   __m256i  block[16]    __attribute__ ((aligned (64)));
+   __m256i  vdata[32]    __attribute__ ((aligned (64)));
+   __m256i  block[16]    __attribute__ ((aligned (32)));
   __m256i  hash32[8]    __attribute__ ((aligned (32)));
   __m256i  initstate[8] __attribute__ ((aligned (32)));
-   __m256i  midstate[8]  __attribute__ ((aligned (32)));
+   __m256i  midstate1[8] __attribute__ ((aligned (32)));
+   __m256i  midstate2[8] __attribute__ ((aligned (32)));
+   __m256i  mexp_pre[16] __attribute__ ((aligned (32)));
   uint32_t lane_hash[8] __attribute__ ((aligned (32)));
-   __m256i  vdata[20]    __attribute__ ((aligned (32)));
   uint32_t *hash32_d7 =  (uint32_t*)&( hash32[7] );
   uint32_t *pdata = work->data;
   const uint32_t *ptarget = work->target;
@@ -116,10 +120,18 @@ int scanhash_sha256d_8way( struct work *work, const uint32_t max_nonce,
   const __m256i eight = m256_const1_32( 8 );

   for ( int i = 0; i < 19; i++ )
-       vdata[i] = m256_const1_32( pdata[i] );
+      vdata[i] = m256_const1_32( pdata[i] );

   *noncev = _mm256_set_epi32( n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n+1, n );

+   vdata[16+4] = last_byte;
+   memset_zero_256( vdata+16 + 5, 10 );
+   vdata[16+15] = m256_const1_32( 80*8 ); // bit count
+
+   block[ 8] = last_byte;
+   memset_zero_256( block + 9, 6 );
+   block[15] = m256_const1_32( 32*8 ); // bit count
+   
   // initialize state
   initstate[0] = m256_const1_64( 0x6A09E6676A09E667 );
   initstate[1] = m256_const1_64( 0xBB67AE85BB67AE85 );
@@ -130,36 +142,33 @@ int scanhash_sha256d_8way( struct work *work, const uint32_t max_nonce,
   initstate[6] = m256_const1_64( 0x1F83D9AB1F83D9AB );
   initstate[7] = m256_const1_64( 0x5BE0CD195BE0CD19 );

-   // hash first 64 bytes of data
-   sha256_8way_transform_le( midstate, vdata, initstate );
+   sha256_8way_transform_le( midstate1, vdata, initstate );
+   
+   // Do 3 rounds on the first 12 bytes of the next block
+   sha256_8way_prehash_3rounds( midstate2, mexp_pre, vdata + 16, midstate1 );

   do
   {
      // 1. final 16 bytes of data, with padding
-      memcpy_256( block, vdata + 16, 4 );
-      block[ 4] = last_byte;
-      memset_zero_256( block + 5, 10 );
-      block[15] = m256_const1_32( 80*8 ); // bit count
-      sha256_8way_transform_le( hash32, block, midstate );
+      sha256_8way_final_rounds( block, vdata+16, midstate1, midstate2,
+                                mexp_pre );

      // 2. 32 byte hash from 1.
-      memcpy_256( block, hash32, 8 );
-      block[ 8] = last_byte;
-      memset_zero_256( block + 9, 6 );
-      block[15] = m256_const1_32( 32*8 ); // bit count
-      sha256_8way_transform_le( hash32, block, initstate );
-
-      // byte swap final hash for testing
-      mm256_block_bswap_32( hash32, hash32 );
-
-      for ( int lane = 0; lane < 8; lane++ )
-      if ( unlikely( hash32_d7[ lane ] <= targ32_d7 ) )
+      if ( unlikely(
+               sha256_8way_transform_le_short( hash32, block, initstate ) ) )
      {
-         extr_lane_8x32( lane_hash, hash32, lane, 256 );
-         if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
+         // byte swap final hash for testing
+         mm256_block_bswap_32( hash32, hash32 );
+
+         for ( int lane = 0; lane < 8; lane++ )
+         if ( unlikely( hash32_d7[ lane ] <= targ32_d7 ) )
         {
-            pdata[19] = n + lane;
-            submit_solution( work, lane_hash, mythr );
+            extr_lane_8x32( lane_hash, hash32, lane, 256 );
+            if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
+            {
+               pdata[19] = n + lane;
+               submit_solution( work, lane_hash, mythr );
+            }
         }
       }
       *noncev = _mm256_add_epi32( *noncev, eight );
@@ -177,12 +186,14 @@ int scanhash_sha256d_8way( struct work *work, const uint32_t max_nonce,
 int scanhash_sha256d_4way( struct work *work, const uint32_t max_nonce,
                           uint64_t *hashes_done, struct thr_info *mythr )
 {
-   __m128i  block[16]    __attribute__ ((aligned (64)));
-   __m128i  hash32[8]    __attribute__ ((aligned (32)));
-   __m128i  initstate[8] __attribute__ ((aligned (32)));
-   __m128i  midstate[8]  __attribute__ ((aligned (32)));
-   uint32_t lane_hash[8] __attribute__ ((aligned (32)));
-   __m128i  vdata[20]    __attribute__ ((aligned (32)));
+   __m128i  vdata[32]     __attribute__ ((aligned (64)));
+   __m128i  block[16]     __attribute__ ((aligned (32)));
+   __m128i  hash32[8]     __attribute__ ((aligned (32)));
+   __m128i  initstate[8]  __attribute__ ((aligned (32)));
+   __m128i  midstate1[8]   __attribute__ ((aligned (32)));
+   __m128i  midstate2[8]  __attribute__ ((aligned (32)));
+   __m128i  mexp_pre[16]  __attribute__ ((aligned (32)));
+   uint32_t lane_hash[8]  __attribute__ ((aligned (32)));
   uint32_t *hash32_d7 =  (uint32_t*)&( hash32[7] );
   uint32_t *pdata = work->data;
   const uint32_t *ptarget = work->target;
@@ -201,6 +212,14 @@ int scanhash_sha256d_4way( struct work *work, const uint32_t max_nonce,

   *noncev = _mm_set_epi32( n+ 3, n+ 2, n+1, n );

+   vdata[16+4] = last_byte;
+   memset_zero_128( vdata+16 + 5, 10 );
+   vdata[16+15] = m128_const1_32( 80*8 ); // bit count
+
+   block[ 8] = last_byte;
+   memset_zero_128( block + 9, 6 );
+   block[15] = m128_const1_32( 32*8 ); // bit count
+
   // initialize state
   initstate[0] = m128_const1_64( 0x6A09E6676A09E667 );
   initstate[1] = m128_const1_64( 0xBB67AE85BB67AE85 );
@@ -212,39 +231,36 @@ int scanhash_sha256d_4way( struct work *work, const uint32_t max_nonce,
   initstate[7] = m128_const1_64( 0x5BE0CD195BE0CD19 );

   // hash first 64 bytes of data
-   sha256_4way_transform_le( midstate, vdata, initstate );
+   sha256_4way_transform_le( midstate1, vdata, initstate );
+   // Do 3 rounds on the first 12 bytes of the next block
+   sha256_4way_prehash_3rounds( midstate2, mexp_pre, vdata + 16, midstate1 );

   do
   {
      // 1. final 16 bytes of data, with padding
-      memcpy_128( block, vdata + 16, 4 );
-      block[ 4] = last_byte;
-      memset_zero_128( block + 5, 10 );
-      block[15] = m128_const1_32( 80*8 ); // bit count
-      sha256_4way_transform_le( hash32, block, midstate );
+      sha256_4way_final_rounds( block, vdata+16, midstate1, midstate2,
+                                mexp_pre );

      // 2. 32 byte hash from 1.
-      memcpy_128( block, hash32, 8 );
-      block[ 8] = last_byte;
-      memset_zero_128( block + 9, 6 );
-      block[15] = m128_const1_32( 32*8 ); // bit count
-      sha256_4way_transform_le( hash32, block, initstate );
-
-      // byte swap final hash for testing
-      mm128_block_bswap_32( hash32, hash32 );
-
-      for ( int lane = 0; lane < 4; lane++ )
-      if ( unlikely( hash32_d7[ lane ] <= targ32_d7 ) )
+      if ( unlikely(
+              sha256_4way_transform_le_short( hash32, block, initstate ) ) )
      {
-         extr_lane_4x32( lane_hash, hash32, lane, 256 );
-         if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
+         // byte swap final hash for testing
+         mm128_block_bswap_32( hash32, hash32 );
+
+         for ( int lane = 0; lane < 4; lane++ )
+         if ( unlikely( hash32_d7[ lane ] <= targ32_d7 ) )
         {
-            pdata[19] = n + lane;
-            submit_solution( work, lane_hash, mythr );
+            extr_lane_4x32( lane_hash, hash32, lane, 256 );
+            if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
+            {
+               pdata[19] = n + lane;
+               submit_solution( work, lane_hash, mythr );
+            }
         }
-       }
-       *noncev = _mm_add_epi32( *noncev, four );
-       n += 4;
+      }
+      *noncev = _mm_add_epi32( *noncev, four );
+      n += 4;
   } while ( (n < last_nonce) && !work_restart[thr_id].restart );
   pdata[19] = n;
   *hashes_done = n - first_nonce;
@@ -253,3 +269,20 @@ int scanhash_sha256d_4way( struct work *work, const uint32_t max_nonce,

 #endif

+/*
+bool register_sha256d_algo( algo_gate_t* gate )
+{
+   gate->optimizations = SSE2_OPT | AVX2_OPT | AVX512_OPT;
+#if defined(SHA256D_16WAY)
+   gate->scanhash = (void*)&scanhash_sha256d_16way;
+#elif defined(SHA256D_8WAY)
+   gate->scanhash = (void*)&scanhash_sha256d_8way;
+#elif defined(SHA256D_4WAY)
+   gate->scanhash = (void*)&scanhash_sha256d_4way;
+#endif
+   
+//   gate->hash     = (void*)&sha256d;
+   return true;
+};
+*/
+
--- a/algo/sha/sha256d-4way.h
+++ b/algo/sha/sha256d-4way.h
@@ -0,0 +1,46 @@
+#ifndef __SHA256D_4WAY_H__
+#define __SHA256D_4WAY_H__ 1
+
+#include <stdint.h>
+#include "algo-gate-api.h"
+
+#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
+  #define SHA256D_16WAY 1
+#elif defined(__AVX2__)
+  #define SHA256D_8WAY 1
+#else
+  #define SHA256D_4WAY 1
+#endif
+
+bool register_sha256d_algo( algo_gate_t* gate );
+
+#if defined(SHA256D_16WAY)
+
+int scanhash_sha256d_16way( struct work *work, uint32_t max_nonce,
+                           uint64_t *hashes_done, struct thr_info *mythr );
+#endif
+
+#if defined(SHA256D_8WAY)
+
+int scanhash_sha256d_8way( struct work *work, uint32_t max_nonce,
+                           uint64_t *hashes_done, struct thr_info *mythr );
+#endif
+
+#if defined(SHA256D_4WAY)
+
+int scanhash_sha256d_4way( struct work *work, uint32_t max_nonce,
+                           uint64_t *hashes_done, struct thr_info *mythr );
+#endif
+
+
+/*
+#if defined(__SHA__)
+
+int scanhash_sha256d( struct work *work, uint32_t max_nonce,
+                      uint64_t *hashes_done, struct thr_info *mythr );
+
+#endif
+*/
+
+#endif
+
--- a/algo/sha/sha256t-4way.c
+++ b/algo/sha/sha256t-4way.c
@@ -10,13 +10,14 @@
 int scanhash_sha256t_16way( struct work *work, const uint32_t max_nonce,
                           uint64_t *hashes_done, struct thr_info *mythr )
 {
+   __m512i  vdata[32]    __attribute__ ((aligned (128)));
   __m512i  block[16]    __attribute__ ((aligned (64)));
-   __m512i  hash32[8]    __attribute__ ((aligned (32)));
-   __m512i  initstate[8] __attribute__ ((aligned (32)));
-   __m512i  midstate[8]  __attribute__ ((aligned (32)));
-   __m512i  midstate2[8] __attribute__ ((aligned (32)));
-   uint32_t lane_hash[8] __attribute__ ((aligned (32)));
-   __m512i  vdata[20]    __attribute__ ((aligned (32)));
+   __m512i  hash32[8]    __attribute__ ((aligned (64)));
+   __m512i  initstate[8] __attribute__ ((aligned (64)));
+   __m512i  midstate1[8] __attribute__ ((aligned (64)));
+   __m512i  midstate2[8] __attribute__ ((aligned (64)));
+   __m512i  mexp_pre[16] __attribute__ ((aligned (64)));
+   uint32_t lane_hash[8] __attribute__ ((aligned (64)));
   uint32_t *hash32_d7 =  (uint32_t*)&( hash32[7] );
   uint32_t *pdata = work->data;
   const uint32_t *ptarget = work->target;
@@ -31,12 +32,19 @@ int scanhash_sha256t_16way( struct work *work, const uint32_t max_nonce,
   const __m512i sixteen = m512_const1_32( 16 );

   for ( int i = 0; i < 19; i++ )
-       vdata[i] = m512_const1_32( pdata[i] );
+      vdata[i] = m512_const1_32( pdata[i] );

   *noncev = _mm512_set_epi32( n+15, n+14, n+13, n+12, n+11, n+10, n+9, n+8,
                               n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n+1, n );

-   // initialize state
+   vdata[16+4] = last_byte;
+   memset_zero_512( vdata+16 + 5, 10 );
+   vdata[16+15] = m512_const1_32( 80*8 ); // bit count
+   
+   block[ 8] = last_byte;
+   memset_zero_512( block + 9, 6 );
+   block[15] = m512_const1_32( 32*8 ); // bit count
+   
   initstate[0] = m512_const1_64( 0x6A09E6676A09E667 );
   initstate[1] = m512_const1_64( 0xBB67AE85BB67AE85 );
   initstate[2] = m512_const1_64( 0x3C6EF3723C6EF372 );
@@ -46,47 +54,40 @@ int scanhash_sha256t_16way( struct work *work, const uint32_t max_nonce,
   initstate[6] = m512_const1_64( 0x1F83D9AB1F83D9AB );
   initstate[7] = m512_const1_64( 0x5BE0CD195BE0CD19 );

-   // hash first 64 byte block of data
-   sha256_16way_transform_le( midstate, vdata, initstate );
-
+   sha256_16way_transform_le( midstate1, vdata, initstate );
+   
   // Do 3 rounds on the first 12 bytes of the next block
-   sha256_16way_prehash_3rounds( midstate2, vdata + 16, midstate );
+   sha256_16way_prehash_3rounds( midstate2, mexp_pre, vdata+16, midstate1 );

   do
   {
-      // 1. final 16 bytes of data, with padding
-      memcpy_512( block, vdata + 16, 4 );
-      block[ 4] = last_byte;
-      memset_zero_512( block + 5, 10 );  
-      block[15] = m512_const1_32( 80*8 ); // bit count
-      sha256_16way_final_rounds( hash32, block, midstate, midstate2 );
+      // 1. final 16 bytes of data, pre-padded
+      sha256_16way_final_rounds( block, vdata+16, midstate1, midstate2,
+                                 mexp_pre );

      // 2. 32 byte hash from 1.
-      memcpy_512( block, hash32, 8 );
-      block[ 8] = last_byte;
-      memset_zero_512( block + 9, 6 );
-      block[15] = m512_const1_32( 32*8 ); // bit count
-      sha256_16way_transform_le( hash32, block, initstate );
+      sha256_16way_transform_le( block, block, initstate );

      // 3. 32 byte hash from 2.
-      memcpy_512( block, hash32, 8 );
-      sha256_16way_transform_le( hash32, block, initstate );
-
-      // byte swap final hash for testing
-      mm512_block_bswap_32( hash32, hash32 );    
-
-      for ( int lane = 0; lane < 16; lane++ )
-      if ( unlikely( hash32_d7[ lane ] <= targ32_d7 ) )
+      if ( unlikely(
+               sha256_16way_transform_le_short( hash32, block, initstate ) ) )
      {
-         extr_lane_16x32( lane_hash, hash32, lane, 256 );
-         if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
+         // byte swap final hash for testing
+         mm512_block_bswap_32( hash32, hash32 );    
+
+         for ( int lane = 0; lane < 16; lane++ )
+         if ( hash32_d7[ lane ] <= targ32_d7 )
         {
-            pdata[19] = n + lane;
-            submit_solution( work, lane_hash, mythr );
+            extr_lane_16x32( lane_hash, hash32, lane, 256 );
+            if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
+            {
+               pdata[19] = n + lane;
+               submit_solution( work, lane_hash, mythr );
+            }
         }
-       }
-       *noncev = _mm512_add_epi32( *noncev, sixteen );
-       n += 16;
+      }
+      *noncev = _mm512_add_epi32( *noncev, sixteen );
+      n += 16;
   } while ( (n < last_nonce) && !work_restart[thr_id].restart );
   pdata[19] = n;
   *hashes_done = n - first_nonce;
@@ -101,12 +102,14 @@ int scanhash_sha256t_16way( struct work *work, const uint32_t max_nonce,
 int scanhash_sha256t_8way( struct work *work, const uint32_t max_nonce,
                           uint64_t *hashes_done, struct thr_info *mythr )
 {
-   __m256i  block[16]    __attribute__ ((aligned (64)));
+   __m256i  vdata[32]    __attribute__ ((aligned (64)));
+   __m256i  block[16]    __attribute__ ((aligned (32)));
   __m256i  hash32[8]    __attribute__ ((aligned (32)));
   __m256i  initstate[8] __attribute__ ((aligned (32)));
-   __m256i  midstate[8]  __attribute__ ((aligned (32)));
+   __m256i  midstate1[8] __attribute__ ((aligned (32)));
+   __m256i  midstate2[8] __attribute__ ((aligned (32)));
+   __m256i  mexp_pre[16] __attribute__ ((aligned (32)));
   uint32_t lane_hash[8] __attribute__ ((aligned (32)));
-   __m256i  vdata[20]    __attribute__ ((aligned (32)));
   uint32_t *hash32_d7 =  (uint32_t*)&( hash32[7] );
   uint32_t *pdata = work->data;
   const uint32_t *ptarget = work->target;
@@ -121,10 +124,18 @@ int scanhash_sha256t_8way( struct work *work, const uint32_t max_nonce,
   const __m256i eight = m256_const1_32( 8 );

   for ( int i = 0; i < 19; i++ )
-       vdata[i] = m256_const1_32( pdata[i] );
+      vdata[i] = m256_const1_32( pdata[i] );

   *noncev = _mm256_set_epi32( n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n+1, n );

+   vdata[16+4] = last_byte;
+   memset_zero_256( vdata+16 + 5, 10 );
+   vdata[16+15] = m256_const1_32( 80*8 ); // bit count
+
+   block[ 8] = last_byte;
+   memset_zero_256( block + 9, 6 );
+   block[15] = m256_const1_32( 32*8 ); // bit count
+   
   // initialize state
   initstate[0] = m256_const1_64( 0x6A09E6676A09E667 );
   initstate[1] = m256_const1_64( 0xBB67AE85BB67AE85 );
@@ -135,44 +146,40 @@ int scanhash_sha256t_8way( struct work *work, const uint32_t max_nonce,
   initstate[6] = m256_const1_64( 0x1F83D9AB1F83D9AB );
   initstate[7] = m256_const1_64( 0x5BE0CD195BE0CD19 );

-   // hash first 64 bytes of data
-   sha256_8way_transform_le( midstate, vdata, initstate );
+   sha256_8way_transform_le( midstate1, vdata, initstate );

+   // Do 3 rounds on the first 12 bytes of the next block
+   sha256_8way_prehash_3rounds( midstate2, mexp_pre, vdata + 16, midstate1 );
+   
   do
   {
      // 1. final 16 bytes of data, with padding
-      memcpy_256( block, vdata + 16, 4 );
-      block[ 4] = last_byte;
-      memset_zero_256( block + 5, 10 );
-      block[15] = m256_const1_32( 80*8 ); // bit count
-      sha256_8way_transform_le( hash32, block, midstate );
+      sha256_8way_final_rounds( block, vdata+16, midstate1, midstate2,
+                                mexp_pre );

      // 2. 32 byte hash from 1.
-      memcpy_256( block, hash32, 8 );
-      block[ 8] = last_byte;
-      memset_zero_256( block + 9, 6 );
-      block[15] = m256_const1_32( 32*8 ); // bit count
-      sha256_8way_transform_le( hash32, block, initstate );
+      sha256_8way_transform_le( block, block, initstate );

      // 3. 32 byte hash from 2.
-      memcpy_256( block, hash32, 8 );
-      sha256_8way_transform_le( hash32, block, initstate );
-
-      // byte swap final hash for testing
-      mm256_block_bswap_32( hash32, hash32 );
-
-      for ( int lane = 0; lane < 8; lane++ )
-      if ( unlikely( hash32_d7[ lane ] <= targ32_d7 ) )
+      if ( unlikely(
+               sha256_8way_transform_le_short( hash32, block, initstate ) ) )
      {
-         extr_lane_8x32( lane_hash, hash32, lane, 256 );
-         if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
+         // byte swap final hash for testing
+         mm256_block_bswap_32( hash32, hash32 );
+
+         for ( int lane = 0; lane < 8; lane++ )
+         if ( hash32_d7[ lane ] <= targ32_d7 )
         {
-            pdata[19] = n + lane;
-            submit_solution( work, lane_hash, mythr );
+            extr_lane_8x32( lane_hash, hash32, lane, 256 );
+            if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
+            {
+               pdata[19] = n + lane;
+               submit_solution( work, lane_hash, mythr );
+            }
         }
-       }
-       *noncev = _mm256_add_epi32( *noncev, eight );
-       n += 8;
+      }
+      *noncev = _mm256_add_epi32( *noncev, eight );
+      n += 8;
   } while ( (n < last_nonce) && !work_restart[thr_id].restart );
   pdata[19] = n;
   *hashes_done = n - first_nonce;
@@ -181,17 +188,110 @@ int scanhash_sha256t_8way( struct work *work, const uint32_t max_nonce,

 #endif

+
 #if defined(SHA256T_4WAY)

+// Optimizations are slower with AVX/SSE2
+// https://github.com/JayDDee/cpuminer-opt/issues/344
+/*
+int scanhash_sha256t_4way( struct work *work, const uint32_t max_nonce,
+                           uint64_t *hashes_done, struct thr_info *mythr )
+{
+   __m128i  vdata[32]     __attribute__ ((aligned (64)));
+   __m128i  block[16]     __attribute__ ((aligned (32)));
+   __m128i  hash32[8]     __attribute__ ((aligned (32)));
+   __m128i  initstate[8]  __attribute__ ((aligned (32)));
+   __m128i  midstate1[8]  __attribute__ ((aligned (32)));
+   __m128i  midstate2[8]  __attribute__ ((aligned (32)));
+   __m128i  mexp_pre[16]  __attribute__ ((aligned (32)));
+   uint32_t lane_hash[8]  __attribute__ ((aligned (32)));
+   uint32_t *hash32_d7 = (uint32_t*)&( hash32[7] );
+   uint32_t *pdata = work->data;
+   const uint32_t *ptarget = work->target;
+   const uint32_t targ32_d7 = ptarget[7];
+   const uint32_t first_nonce = pdata[19];
+   const uint32_t last_nonce = max_nonce - 4;
+   uint32_t n = first_nonce;
+   __m128i *noncev = vdata + 19;
+   const int thr_id = mythr->id;
+   const bool bench = opt_benchmark;
+   const __m128i last_byte = m128_const1_32( 0x80000000 );
+   const __m128i four = m128_const1_32( 4 );
+
+   for ( int i = 0; i < 19; i++ )
+       vdata[i] = m128_const1_32( pdata[i] );
+
+   *noncev = _mm_set_epi32( n+ 3, n+ 2, n+1, n );
+
+   vdata[16+4] = last_byte;
+   memset_zero_128( vdata+16 + 5, 10 );
+   vdata[16+15] = m128_const1_32( 80*8 ); // bit count
+
+   block[ 8] = last_byte;
+   memset_zero_128( block + 9, 6 );
+   block[15] = m128_const1_32( 32*8 ); // bit count
+   
+   // initialize state
+   initstate[0] = m128_const1_64( 0x6A09E6676A09E667 );
+   initstate[1] = m128_const1_64( 0xBB67AE85BB67AE85 );
+   initstate[2] = m128_const1_64( 0x3C6EF3723C6EF372 );
+   initstate[3] = m128_const1_64( 0xA54FF53AA54FF53A );
+   initstate[4] = m128_const1_64( 0x510E527F510E527F );
+   initstate[5] = m128_const1_64( 0x9B05688C9B05688C );
+   initstate[6] = m128_const1_64( 0x1F83D9AB1F83D9AB );
+   initstate[7] = m128_const1_64( 0x5BE0CD195BE0CD19 );
+
+   // hash first 64 bytes of data
+   sha256_4way_transform_le( midstate1, vdata, initstate );
+
+   // Do 3 rounds on the first 12 bytes of the next block
+   sha256_4way_prehash_3rounds( midstate2, mexp_pre, vdata + 16, midstate1 );
+
+   do
+   {
+      // 1. final 16 bytes of data, with padding
+      sha256_4way_final_rounds( block, vdata+16, midstate1, midstate2,
+                                mexp_pre );
+
+      // 2. 32 byte hash from 1.
+      sha256_4way_transform_le( block, block, initstate );
+
+      // 3. 32 byte hash from 2.
+      if ( unlikely(
+              sha256_4way_transform_le_short( hash32, block, initstate ) ) )
+      {   
+         // byte swap final hash for testing
+         mm128_block_bswap_32( hash32, hash32 );
+
+         for ( int lane = 0; lane < 4; lane++ )
+         if ( unlikely( hash32_d7[ lane ] <= targ32_d7 ) )
+         {
+            extr_lane_4x32( lane_hash, hash32, lane, 256 );
+            if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
+            {
+               pdata[19] = n + lane;
+               submit_solution( work, lane_hash, mythr );
+            }
+         }
+      }
+      *noncev = _mm_add_epi32( *noncev, four );
+      n += 4;
+   } while ( (n < last_nonce) && !work_restart[thr_id].restart );
+   pdata[19] = n;
+   *hashes_done = n - first_nonce;
+   return 0;
+}
+*/
+
 int scanhash_sha256t_4way( struct work *work, const uint32_t max_nonce,
                           uint64_t *hashes_done, struct thr_info *mythr )
 {
-   __m128i  block[16]    __attribute__ ((aligned (64)));
+   __m128i  vdata[32]    __attribute__ ((aligned (64)));
+   __m128i  block[16]    __attribute__ ((aligned (32)));
   __m128i  hash32[8]    __attribute__ ((aligned (32)));
   __m128i  initstate[8] __attribute__ ((aligned (32)));
   __m128i  midstate[8]  __attribute__ ((aligned (32)));
   uint32_t lane_hash[8] __attribute__ ((aligned (32)));
-   __m128i  vdata[20]    __attribute__ ((aligned (32)));
   uint32_t *hash32_d7 =  (uint32_t*)&( hash32[7] );
   uint32_t *pdata = work->data;
   const uint32_t *ptarget = work->target;
@@ -210,6 +310,14 @@ int scanhash_sha256t_4way( struct work *work, const uint32_t max_nonce,

   *noncev = _mm_set_epi32( n+ 3, n+ 2, n+1, n );

+   vdata[16+4] = last_byte;
+   memset_zero_128( vdata+16 + 5, 10 );
+   vdata[16+15] = m128_const1_32( 80*8 ); // bit count
+
+   block[ 8] = last_byte;
+   memset_zero_128( block + 9, 6 );
+   block[15] = m128_const1_32( 32*8 ); // bit count
+   
   // initialize state
   initstate[0] = m128_const1_64( 0x6A09E6676A09E667 );
   initstate[1] = m128_const1_64( 0xBB67AE85BB67AE85 );
@@ -225,25 +333,9 @@ int scanhash_sha256t_4way( struct work *work, const uint32_t max_nonce,

   do
   {
-      // 1. final 16 bytes of data, with padding
-      memcpy_128( block, vdata + 16, 4 );
-      block[ 4] = last_byte;
-      memset_zero_128( block + 5, 10 );
-      block[15] = m128_const1_32( 80*8 ); // bit count
-      sha256_4way_transform_le( hash32, block, midstate );
-
-      // 2. 32 byte hash from 1.
-      memcpy_128( block, hash32, 8 );
-      block[ 8] = last_byte;
-      memset_zero_128( block + 9, 6 );
-      block[15] = m128_const1_32( 32*8 ); // bit count
-      sha256_4way_transform_le( hash32, block, initstate );
-
-      // 3. 32 byte hash from 2.
-      memcpy_128( block, hash32, 8 );
-      sha256_4way_transform_le( hash32, block, initstate );
-
-      // byte swap final hash for testing
+      sha256_4way_transform_le( block,  vdata+16, midstate  );
+      sha256_4way_transform_le( block,  block,    initstate );
+      sha256_4way_transform_le( hash32, block,    initstate );
      mm128_block_bswap_32( hash32, hash32 );

      for ( int lane = 0; lane < 4; lane++ )
@@ -264,5 +356,6 @@ int scanhash_sha256t_4way( struct work *work, const uint32_t max_nonce,
   return 0;
 }

+
 #endif

--- a/algo/sha/sph_sha2.c
+++ b/algo/sha/sph_sha2.c
@@ -702,6 +702,36 @@ memcpy( state_out, state_in, 32 );

 }

+void sph_sha256_prehash_3rounds( uint32_t *state_out, const uint32_t *data,
+                                 const uint32_t *state_in )
+{
+   uint32_t t1, t2, X_xor_Y, Y_xor_Z = state_in[1] ^ state_in[2];
+   memcpy( state_out, state_in, 32 );
+
+   t1 = state_out[7] + BSG2_1( state_out[4] )
+      + CH( state_out[4], state_out[5], state_out[6] ) + 0x428A2F98 + data[0];
+   t2 = BSG2_0( state_out[0] )
+      + MAJ( state_out[0], state_out[1], state_out[2] );
+   Y_xor_Z = X_xor_Y;
+   state_out[3] += t1;
+   state_out[7] = t1 + t2;
+
+   t1 = state_out[6] + BSG2_1( state_out[3] ) 
+      + CH( state_out[3], state_out[4], state_out[5] ) + 0x71374491 + data[1];
+   t2 = BSG2_0( state_out[7] )
+      + MAJ( state_out[7], state_out[0], state_out[1] );
+   Y_xor_Z = X_xor_Y;
+   state_out[2] += t1;
+   state_out[6] = t1 + t2;
+
+   t1 = state_out[5] + BSG2_1( state_out[2] )
+      + CH( state_out[2], state_out[3], state_out[4] ) + 0xB5C0FBCF + data[2];
+   t2 = BSG2_0( state_out[6] )
+      + MAJ( state_out[6], state_out[7], state_out[0] );
+   state_out[1] += t1;
+   state_out[5] = t1 + t2;
+}   
+
 /* see sph_sha2.h */
 void
 sph_sha224_init(void *cc)
--- a/algo/sha/sph_sha2.h
+++ b/algo/sha/sph_sha2.h
@@ -215,6 +215,9 @@ void sph_sha256_transform_le( uint32_t *state_out, const uint32_t *data,
 void sph_sha256_transform_be( uint32_t *state_out, const uint32_t *data,
                              const uint32_t *state_in );

+void sph_sha256_prehash_3rounds( uint32_t *state_out, const uint32_t *data,
+                                 const uint32_t *state_in );
+

 #if SPH_64

--- a/algo/shabal/shabal-hash-4way.c
+++ b/algo/shabal/shabal-hash-4way.c
@@ -62,32 +62,34 @@ extern "C"{
 #if defined(__AVX2__)

 #define DECL_STATE8   \
-   __m256i A00, A01, A02, A03, A04, A05, A06, A07, \
-           A08, A09, A0A, A0B; \
+   __m256i A0, A1, A2, A3, A4, A5, A6, A7, \
+           A8, A9, AA, AB; \
   __m256i B0, B1, B2, B3, B4, B5, B6, B7, \
           B8, B9, BA, BB, BC, BD, BE, BF; \
   __m256i C0, C1, C2, C3, C4, C5, C6, C7, \
           C8, C9, CA, CB, CC, CD, CE, CF; \
   __m256i M0, M1, M2, M3, M4, M5, M6, M7, \
           M8, M9, MA, MB, MC, MD, ME, MF; \
+   const __m256i FIVE  = _mm256_set1_epi32( 5 ); \
+   const __m256i THREE = _mm256_set1_epi32( 3 ); \
   sph_u32 Wlow, Whigh;

 #define READ_STATE8(state) do \
 { \
   if ( (state)->state_loaded ) \
   { \
-      A00 = (state)->A[0]; \
-      A01 = (state)->A[1]; \
-      A02 = (state)->A[2]; \
-      A03 = (state)->A[3]; \
-      A04 = (state)->A[4]; \
-      A05 = (state)->A[5]; \
-      A06 = (state)->A[6]; \
-      A07 = (state)->A[7]; \
-      A08 = (state)->A[8]; \
-      A09 = (state)->A[9]; \
-      A0A = (state)->A[10]; \
-      A0B = (state)->A[11]; \
+      A0 = (state)->A[0]; \
+      A1 = (state)->A[1]; \
+      A2 = (state)->A[2]; \
+      A3 = (state)->A[3]; \
+      A4 = (state)->A[4]; \
+      A5 = (state)->A[5]; \
+      A6 = (state)->A[6]; \
+      A7 = (state)->A[7]; \
+      A8 = (state)->A[8]; \
+      A9 = (state)->A[9]; \
+      AA = (state)->A[10]; \
+      AB = (state)->A[11]; \
      B0 = (state)->B[0]; \
      B1 = (state)->B[1]; \
      B2 = (state)->B[2]; \
@@ -124,18 +126,18 @@ extern "C"{
   else \
   { \
       (state)->state_loaded = true; \
-       A00 = m256_const1_64( 0x20728DFD20728DFD ); \
-       A01 = m256_const1_64( 0x46C0BD5346C0BD53 ); \
-       A02 = m256_const1_64( 0xE782B699E782B699 ); \
-       A03 = m256_const1_64( 0x5530463255304632 ); \
-       A04 = m256_const1_64( 0x71B4EF9071B4EF90 ); \
-       A05 = m256_const1_64( 0x0EA9E82C0EA9E82C ); \
-       A06 = m256_const1_64( 0xDBB930F1DBB930F1 ); \
-       A07 = m256_const1_64( 0xFAD06B8BFAD06B8B ); \
-       A08 = m256_const1_64( 0xBE0CAE40BE0CAE40 ); \
-       A09 = m256_const1_64( 0x8BD144108BD14410 ); \
-       A0A = m256_const1_64( 0x76D2ADAC76D2ADAC ); \
-       A0B = m256_const1_64( 0x28ACAB7F28ACAB7F ); \
+       A0 = m256_const1_64( 0x20728DFD20728DFD ); \
+       A1 = m256_const1_64( 0x46C0BD5346C0BD53 ); \
+       A2 = m256_const1_64( 0xE782B699E782B699 ); \
+       A3 = m256_const1_64( 0x5530463255304632 ); \
+       A4 = m256_const1_64( 0x71B4EF9071B4EF90 ); \
+       A5 = m256_const1_64( 0x0EA9E82C0EA9E82C ); \
+       A6 = m256_const1_64( 0xDBB930F1DBB930F1 ); \
+       A7 = m256_const1_64( 0xFAD06B8BFAD06B8B ); \
+       A8 = m256_const1_64( 0xBE0CAE40BE0CAE40 ); \
+       A9 = m256_const1_64( 0x8BD144108BD14410 ); \
+       AA = m256_const1_64( 0x76D2ADAC76D2ADAC ); \
+       AB = m256_const1_64( 0x28ACAB7F28ACAB7F ); \
       B0 = m256_const1_64( 0xC1099CB7C1099CB7 ); \
       B1 = m256_const1_64( 0x07B385F307B385F3 ); \
       B2 = m256_const1_64( 0xE7442C26E7442C26 ); \
@@ -174,18 +176,18 @@ extern "C"{
 } while (0)

 #define WRITE_STATE8(state)   do { \
-      (state)->A[0] = A00; \
-      (state)->A[1] = A01; \
-      (state)->A[2] = A02; \
-      (state)->A[3] = A03; \
-      (state)->A[4] = A04; \
-      (state)->A[5] = A05; \
-      (state)->A[6] = A06; \
-      (state)->A[7] = A07; \
-      (state)->A[8] = A08; \
-      (state)->A[9] = A09; \
-      (state)->A[10] = A0A; \
-      (state)->A[11] = A0B; \
+      (state)->A[0] = A0; \
+      (state)->A[1] = A1; \
+      (state)->A[2] = A2; \
+      (state)->A[3] = A3; \
+      (state)->A[4] = A4; \
+      (state)->A[5] = A5; \
+      (state)->A[6] = A6; \
+      (state)->A[7] = A7; \
+      (state)->A[8] = A8; \
+      (state)->A[9] = A9; \
+      (state)->A[10] = AA; \
+      (state)->A[11] = AB; \
      (state)->B[0] = B0; \
      (state)->B[1] = B1; \
      (state)->B[2] = B2; \
@@ -284,8 +286,8 @@ do { \

 #define XOR_W8 \
 do { \
-   A00 = _mm256_xor_si256( A00, _mm256_set1_epi32( Wlow ) ); \
-   A01 = _mm256_xor_si256( A01, _mm256_set1_epi32( Whigh ) ); \
+   A0 = _mm256_xor_si256( A0, _mm256_set1_epi32( Wlow ) ); \
+   A1 = _mm256_xor_si256( A1, _mm256_set1_epi32( Whigh ) ); \
 } while (0)

 #define SWAP_BC8 \
@@ -314,66 +316,65 @@ do { \
            _mm256_andnot_si256( xb3, xb2 ), \
            _mm256_mullo_epi32( mm256_xor3( xa0, xc, \
               _mm256_mullo_epi32( mm256_rol_32( xa1, 15 ), \
-                                   _mm256_set1_epi32(5UL) ) ), \
-               _mm256_set1_epi32(3UL) ) ) ); \
+                                   FIVE ) ), THREE ) ) ); \
   xb0 = mm256_xnor( xa0, mm256_rol_32( xb0, 1 ) ); \
 } while (0)

 #define PERM_STEP_0_8   do { \
-      PERM_ELT8(A00, A0B, B0, BD, B9, B6, C8, M0); \
-      PERM_ELT8(A01, A00, B1, BE, BA, B7, C7, M1); \
-      PERM_ELT8(A02, A01, B2, BF, BB, B8, C6, M2); \
-      PERM_ELT8(A03, A02, B3, B0, BC, B9, C5, M3); \
-      PERM_ELT8(A04, A03, B4, B1, BD, BA, C4, M4); \
-      PERM_ELT8(A05, A04, B5, B2, BE, BB, C3, M5); \
-      PERM_ELT8(A06, A05, B6, B3, BF, BC, C2, M6); \
-      PERM_ELT8(A07, A06, B7, B4, B0, BD, C1, M7); \
-      PERM_ELT8(A08, A07, B8, B5, B1, BE, C0, M8); \
-      PERM_ELT8(A09, A08, B9, B6, B2, BF, CF, M9); \
-      PERM_ELT8(A0A, A09, BA, B7, B3, B0, CE, MA); \
-      PERM_ELT8(A0B, A0A, BB, B8, B4, B1, CD, MB); \
-      PERM_ELT8(A00, A0B, BC, B9, B5, B2, CC, MC); \
-      PERM_ELT8(A01, A00, BD, BA, B6, B3, CB, MD); \
-      PERM_ELT8(A02, A01, BE, BB, B7, B4, CA, ME); \
-      PERM_ELT8(A03, A02, BF, BC, B8, B5, C9, MF); \
+      PERM_ELT8(A0, AB, B0, BD, B9, B6, C8, M0); \
+      PERM_ELT8(A1, A0, B1, BE, BA, B7, C7, M1); \
+      PERM_ELT8(A2, A1, B2, BF, BB, B8, C6, M2); \
+      PERM_ELT8(A3, A2, B3, B0, BC, B9, C5, M3); \
+      PERM_ELT8(A4, A3, B4, B1, BD, BA, C4, M4); \
+      PERM_ELT8(A5, A4, B5, B2, BE, BB, C3, M5); \
+      PERM_ELT8(A6, A5, B6, B3, BF, BC, C2, M6); \
+      PERM_ELT8(A7, A6, B7, B4, B0, BD, C1, M7); \
+      PERM_ELT8(A8, A7, B8, B5, B1, BE, C0, M8); \
+      PERM_ELT8(A9, A8, B9, B6, B2, BF, CF, M9); \
+      PERM_ELT8(AA, A9, BA, B7, B3, B0, CE, MA); \
+      PERM_ELT8(AB, AA, BB, B8, B4, B1, CD, MB); \
+      PERM_ELT8(A0, AB, BC, B9, B5, B2, CC, MC); \
+      PERM_ELT8(A1, A0, BD, BA, B6, B3, CB, MD); \
+      PERM_ELT8(A2, A1, BE, BB, B7, B4, CA, ME); \
+      PERM_ELT8(A3, A2, BF, BC, B8, B5, C9, MF); \
   } while (0)

 #define PERM_STEP_1_8   do { \
-      PERM_ELT8(A04, A03, B0, BD, B9, B6, C8, M0); \
-      PERM_ELT8(A05, A04, B1, BE, BA, B7, C7, M1); \
-      PERM_ELT8(A06, A05, B2, BF, BB, B8, C6, M2); \
-      PERM_ELT8(A07, A06, B3, B0, BC, B9, C5, M3); \
-      PERM_ELT8(A08, A07, B4, B1, BD, BA, C4, M4); \
-      PERM_ELT8(A09, A08, B5, B2, BE, BB, C3, M5); \
-      PERM_ELT8(A0A, A09, B6, B3, BF, BC, C2, M6); \
-      PERM_ELT8(A0B, A0A, B7, B4, B0, BD, C1, M7); \
-      PERM_ELT8(A00, A0B, B8, B5, B1, BE, C0, M8); \
-      PERM_ELT8(A01, A00, B9, B6, B2, BF, CF, M9); \
-      PERM_ELT8(A02, A01, BA, B7, B3, B0, CE, MA); \
-      PERM_ELT8(A03, A02, BB, B8, B4, B1, CD, MB); \
-      PERM_ELT8(A04, A03, BC, B9, B5, B2, CC, MC); \
-      PERM_ELT8(A05, A04, BD, BA, B6, B3, CB, MD); \
-      PERM_ELT8(A06, A05, BE, BB, B7, B4, CA, ME); \
-      PERM_ELT8(A07, A06, BF, BC, B8, B5, C9, MF); \
+      PERM_ELT8(A4, A3, B0, BD, B9, B6, C8, M0); \
+      PERM_ELT8(A5, A4, B1, BE, BA, B7, C7, M1); \
+      PERM_ELT8(A6, A5, B2, BF, BB, B8, C6, M2); \
+      PERM_ELT8(A7, A6, B3, B0, BC, B9, C5, M3); \
+      PERM_ELT8(A8, A7, B4, B1, BD, BA, C4, M4); \
+      PERM_ELT8(A9, A8, B5, B2, BE, BB, C3, M5); \
+      PERM_ELT8(AA, A9, B6, B3, BF, BC, C2, M6); \
+      PERM_ELT8(AB, AA, B7, B4, B0, BD, C1, M7); \
+      PERM_ELT8(A0, AB, B8, B5, B1, BE, C0, M8); \
+      PERM_ELT8(A1, A0, B9, B6, B2, BF, CF, M9); \
+      PERM_ELT8(A2, A1, BA, B7, B3, B0, CE, MA); \
+      PERM_ELT8(A3, A2, BB, B8, B4, B1, CD, MB); \
+      PERM_ELT8(A4, A3, BC, B9, B5, B2, CC, MC); \
+      PERM_ELT8(A5, A4, BD, BA, B6, B3, CB, MD); \
+      PERM_ELT8(A6, A5, BE, BB, B7, B4, CA, ME); \
+      PERM_ELT8(A7, A6, BF, BC, B8, B5, C9, MF); \
   } while (0)

 #define PERM_STEP_2_8   do { \
-      PERM_ELT8(A08, A07, B0, BD, B9, B6, C8, M0); \
-      PERM_ELT8(A09, A08, B1, BE, BA, B7, C7, M1); \
-      PERM_ELT8(A0A, A09, B2, BF, BB, B8, C6, M2); \
-      PERM_ELT8(A0B, A0A, B3, B0, BC, B9, C5, M3); \
-      PERM_ELT8(A00, A0B, B4, B1, BD, BA, C4, M4); \
-      PERM_ELT8(A01, A00, B5, B2, BE, BB, C3, M5); \
-      PERM_ELT8(A02, A01, B6, B3, BF, BC, C2, M6); \
-      PERM_ELT8(A03, A02, B7, B4, B0, BD, C1, M7); \
-      PERM_ELT8(A04, A03, B8, B5, B1, BE, C0, M8); \
-      PERM_ELT8(A05, A04, B9, B6, B2, BF, CF, M9); \
-      PERM_ELT8(A06, A05, BA, B7, B3, B0, CE, MA); \
-      PERM_ELT8(A07, A06, BB, B8, B4, B1, CD, MB); \
-      PERM_ELT8(A08, A07, BC, B9, B5, B2, CC, MC); \
-      PERM_ELT8(A09, A08, BD, BA, B6, B3, CB, MD); \
-      PERM_ELT8(A0A, A09, BE, BB, B7, B4, CA, ME); \
-      PERM_ELT8(A0B, A0A, BF, BC, B8, B5, C9, MF); \
+      PERM_ELT8(A8, A7, B0, BD, B9, B6, C8, M0); \
+      PERM_ELT8(A9, A8, B1, BE, BA, B7, C7, M1); \
+      PERM_ELT8(AA, A9, B2, BF, BB, B8, C6, M2); \
+      PERM_ELT8(AB, AA, B3, B0, BC, B9, C5, M3); \
+      PERM_ELT8(A0, AB, B4, B1, BD, BA, C4, M4); \
+      PERM_ELT8(A1, A0, B5, B2, BE, BB, C3, M5); \
+      PERM_ELT8(A2, A1, B6, B3, BF, BC, C2, M6); \
+      PERM_ELT8(A3, A2, B7, B4, B0, BD, C1, M7); \
+      PERM_ELT8(A4, A3, B8, B5, B1, BE, C0, M8); \
+      PERM_ELT8(A5, A4, B9, B6, B2, BF, CF, M9); \
+      PERM_ELT8(A6, A5, BA, B7, B3, B0, CE, MA); \
+      PERM_ELT8(A7, A6, BB, B8, B4, B1, CD, MB); \
+      PERM_ELT8(A8, A7, BC, B9, B5, B2, CC, MC); \
+      PERM_ELT8(A9, A8, BD, BA, B6, B3, CB, MD); \
+      PERM_ELT8(AA, A9, BE, BB, B7, B4, CA, ME); \
+      PERM_ELT8(AB, AA, BF, BC, B8, B5, C9, MF); \
   } while (0)

 #define APPLY_P8 \
@@ -397,42 +398,42 @@ do { \
    PERM_STEP_0_8; \
    PERM_STEP_1_8; \
    PERM_STEP_2_8; \
-    A0B = _mm256_add_epi32( A0B, C6 ); \
-    A0A = _mm256_add_epi32( A0A, C5 ); \
-    A09 = _mm256_add_epi32( A09, C4 ); \
-    A08 = _mm256_add_epi32( A08, C3 ); \
-    A07 = _mm256_add_epi32( A07, C2 ); \
-    A06 = _mm256_add_epi32( A06, C1 ); \
-    A05 = _mm256_add_epi32( A05, C0 ); \
-    A04 = _mm256_add_epi32( A04, CF ); \
-    A03 = _mm256_add_epi32( A03, CE ); \
-    A02 = _mm256_add_epi32( A02, CD ); \
-    A01 = _mm256_add_epi32( A01, CC ); \
-    A00 = _mm256_add_epi32( A00, CB ); \
-    A0B = _mm256_add_epi32( A0B, CA ); \
-    A0A = _mm256_add_epi32( A0A, C9 ); \
-    A09 = _mm256_add_epi32( A09, C8 ); \
-    A08 = _mm256_add_epi32( A08, C7 ); \
-    A07 = _mm256_add_epi32( A07, C6 ); \
-    A06 = _mm256_add_epi32( A06, C5 ); \
-    A05 = _mm256_add_epi32( A05, C4 ); \
-    A04 = _mm256_add_epi32( A04, C3 ); \
-    A03 = _mm256_add_epi32( A03, C2 ); \
-    A02 = _mm256_add_epi32( A02, C1 ); \
-    A01 = _mm256_add_epi32( A01, C0 ); \
-    A00 = _mm256_add_epi32( A00, CF ); \
-    A0B = _mm256_add_epi32( A0B, CE ); \
-    A0A = _mm256_add_epi32( A0A, CD ); \
-    A09 = _mm256_add_epi32( A09, CC ); \
-    A08 = _mm256_add_epi32( A08, CB ); \
-    A07 = _mm256_add_epi32( A07, CA ); \
-    A06 = _mm256_add_epi32( A06, C9 ); \
-    A05 = _mm256_add_epi32( A05, C8 ); \
-    A04 = _mm256_add_epi32( A04, C7 ); \
-    A03 = _mm256_add_epi32( A03, C6 ); \
-    A02 = _mm256_add_epi32( A02, C5 ); \
-    A01 = _mm256_add_epi32( A01, C4 ); \
-    A00 = _mm256_add_epi32( A00, C3 ); \
+    AB = _mm256_add_epi32( AB, C6 ); \
+    AA = _mm256_add_epi32( AA, C5 ); \
+    A9 = _mm256_add_epi32( A9, C4 ); \
+    A8 = _mm256_add_epi32( A8, C3 ); \
+    A7 = _mm256_add_epi32( A7, C2 ); \
+    A6 = _mm256_add_epi32( A6, C1 ); \
+    A5 = _mm256_add_epi32( A5, C0 ); \
+    A4 = _mm256_add_epi32( A4, CF ); \
+    A3 = _mm256_add_epi32( A3, CE ); \
+    A2 = _mm256_add_epi32( A2, CD ); \
+    A1 = _mm256_add_epi32( A1, CC ); \
+    A0 = _mm256_add_epi32( A0, CB ); \
+    AB = _mm256_add_epi32( AB, CA ); \
+    AA = _mm256_add_epi32( AA, C9 ); \
+    A9 = _mm256_add_epi32( A9, C8 ); \
+    A8 = _mm256_add_epi32( A8, C7 ); \
+    A7 = _mm256_add_epi32( A7, C6 ); \
+    A6 = _mm256_add_epi32( A6, C5 ); \
+    A5 = _mm256_add_epi32( A5, C4 ); \
+    A4 = _mm256_add_epi32( A4, C3 ); \
+    A3 = _mm256_add_epi32( A3, C2 ); \
+    A2 = _mm256_add_epi32( A2, C1 ); \
+    A1 = _mm256_add_epi32( A1, C0 ); \
+    A0 = _mm256_add_epi32( A0, CF ); \
+    AB = _mm256_add_epi32( AB, CE ); \
+    AA = _mm256_add_epi32( AA, CD ); \
+    A9 = _mm256_add_epi32( A9, CC ); \
+    A8 = _mm256_add_epi32( A8, CB ); \
+    A7 = _mm256_add_epi32( A7, CA ); \
+    A6 = _mm256_add_epi32( A6, C9 ); \
+    A5 = _mm256_add_epi32( A5, C8 ); \
+    A4 = _mm256_add_epi32( A4, C7 ); \
+    A3 = _mm256_add_epi32( A3, C6 ); \
+    A2 = _mm256_add_epi32( A2, C5 ); \
+    A1 = _mm256_add_epi32( A1, C4 ); \
+    A0 = _mm256_add_epi32( A0, C3 ); \
 } while (0)

 #define INCR_W8   do { \
@@ -659,32 +660,34 @@ shabal512_8way_addbits_and_close(void *cc, unsigned ub, unsigned n, void *dst)


 #define DECL_STATE   \
-	__m128i A00, A01, A02, A03, A04, A05, A06, A07, \
-	        A08, A09, A0A, A0B; \
+	__m128i A0, A1, A2, A3, A4, A5, A6, A7, \
+	        A8, A9, AA, AB; \
 	__m128i B0, B1, B2, B3, B4, B5, B6, B7, \
 	        B8, B9, BA, BB, BC, BD, BE, BF; \
 	__m128i C0, C1, C2, C3, C4, C5, C6, C7, \
 	        C8, C9, CA, CB, CC, CD, CE, CF; \
 	__m128i M0, M1, M2, M3, M4, M5, M6, M7, \
 	        M8, M9, MA, MB, MC, MD, ME, MF; \
-	sph_u32 Wlow, Whigh;
+   const __m128i FIVE  = _mm_set1_epi32( 5 ); \
+   const __m128i THREE = _mm_set1_epi32( 3 ); \
+   sph_u32 Wlow, Whigh;

 #define READ_STATE(state) do \
 { \
   if ( (state)->state_loaded ) \
   { \
-      A00 = (state)->A[0]; \
-		A01 = (state)->A[1]; \
-		A02 = (state)->A[2]; \
-		A03 = (state)->A[3]; \
-		A04 = (state)->A[4]; \
-		A05 = (state)->A[5]; \
-		A06 = (state)->A[6]; \
-		A07 = (state)->A[7]; \
-		A08 = (state)->A[8]; \
-		A09 = (state)->A[9]; \
-		A0A = (state)->A[10]; \
-		A0B = (state)->A[11]; \
+      A0 = (state)->A[0]; \
+		A1 = (state)->A[1]; \
+		A2 = (state)->A[2]; \
+		A3 = (state)->A[3]; \
+		A4 = (state)->A[4]; \
+		A5 = (state)->A[5]; \
+		A6 = (state)->A[6]; \
+		A7 = (state)->A[7]; \
+		A8 = (state)->A[8]; \
+		A9 = (state)->A[9]; \
+		AA = (state)->A[10]; \
+		AB = (state)->A[11]; \
 		B0 = (state)->B[0]; \
 		B1 = (state)->B[1]; \
 		B2 = (state)->B[2]; \
@@ -721,18 +724,18 @@ shabal512_8way_addbits_and_close(void *cc, unsigned ub, unsigned n, void *dst)
   else \
   { \
       (state)->state_loaded = true; \
-       A00 = m128_const1_64( 0x20728DFD20728DFD ); \
-       A01 = m128_const1_64( 0x46C0BD5346C0BD53 ); \
-       A02 = m128_const1_64( 0xE782B699E782B699 ); \
-       A03 = m128_const1_64( 0x5530463255304632 ); \
-       A04 = m128_const1_64( 0x71B4EF9071B4EF90 ); \
-       A05 = m128_const1_64( 0x0EA9E82C0EA9E82C ); \
-       A06 = m128_const1_64( 0xDBB930F1DBB930F1 ); \
-       A07 = m128_const1_64( 0xFAD06B8BFAD06B8B ); \
-       A08 = m128_const1_64( 0xBE0CAE40BE0CAE40 ); \
-       A09 = m128_const1_64( 0x8BD144108BD14410 ); \
-       A0A = m128_const1_64( 0x76D2ADAC76D2ADAC ); \
-       A0B = m128_const1_64( 0x28ACAB7F28ACAB7F ); \
+       A0 = m128_const1_64( 0x20728DFD20728DFD ); \
+       A1 = m128_const1_64( 0x46C0BD5346C0BD53 ); \
+       A2 = m128_const1_64( 0xE782B699E782B699 ); \
+       A3 = m128_const1_64( 0x5530463255304632 ); \
+       A4 = m128_const1_64( 0x71B4EF9071B4EF90 ); \
+       A5 = m128_const1_64( 0x0EA9E82C0EA9E82C ); \
+       A6 = m128_const1_64( 0xDBB930F1DBB930F1 ); \
+       A7 = m128_const1_64( 0xFAD06B8BFAD06B8B ); \
+       A8 = m128_const1_64( 0xBE0CAE40BE0CAE40 ); \
+       A9 = m128_const1_64( 0x8BD144108BD14410 ); \
+       AA = m128_const1_64( 0x76D2ADAC76D2ADAC ); \
+       AB = m128_const1_64( 0x28ACAB7F28ACAB7F ); \
       B0 = m128_const1_64( 0xC1099CB7C1099CB7 ); \
       B1 = m128_const1_64( 0x07B385F307B385F3 ); \
       B2 = m128_const1_64( 0xE7442C26E7442C26 ); \
@@ -771,18 +774,18 @@ shabal512_8way_addbits_and_close(void *cc, unsigned ub, unsigned n, void *dst)
 } while (0)

 #define WRITE_STATE(state)   do { \
-		(state)->A[0] = A00; \
-		(state)->A[1] = A01; \
-		(state)->A[2] = A02; \
-		(state)->A[3] = A03; \
-		(state)->A[4] = A04; \
-		(state)->A[5] = A05; \
-		(state)->A[6] = A06; \
-		(state)->A[7] = A07; \
-		(state)->A[8] = A08; \
-		(state)->A[9] = A09; \
-		(state)->A[10] = A0A; \
-		(state)->A[11] = A0B; \
+		(state)->A[0] = A0; \
+		(state)->A[1] = A1; \
+		(state)->A[2] = A2; \
+		(state)->A[3] = A3; \
+		(state)->A[4] = A4; \
+		(state)->A[5] = A5; \
+		(state)->A[6] = A6; \
+		(state)->A[7] = A7; \
+		(state)->A[8] = A8; \
+		(state)->A[9] = A9; \
+		(state)->A[10] = AA; \
+		(state)->A[11] = AB; \
 		(state)->B[0] = B0; \
 		(state)->B[1] = B1; \
 		(state)->B[2] = B2; \
@@ -881,8 +884,8 @@ do { \

 #define XOR_W \
 do { \
-   A00 = _mm_xor_si128( A00, _mm_set1_epi32( Wlow ) ); \
-   A01 = _mm_xor_si128( A01, _mm_set1_epi32( Whigh ) ); \
+   A0 = _mm_xor_si128( A0, _mm_set1_epi32( Wlow ) ); \
+   A1 = _mm_xor_si128( A1, _mm_set1_epi32( Whigh ) ); \
 } while (0)


@@ -931,66 +934,66 @@ do { \
   xa0 = _mm_xor_si128( xm, _mm_xor_si128( xb1, _mm_xor_si128(  \
            _mm_andnot_si128( xb3, xb2 ), \
            _mm_mullo_epi32( _mm_xor_si128( xa0, _mm_xor_si128( xc, \
-               _mm_mullo_epi32(  mm128_rol_32( xa1, 15 ), _mm_set1_epi32(5UL) ) \
-                   ) ), _mm_set1_epi32(3UL) ) ) ) ); \
+               _mm_mullo_epi32(  mm128_rol_32( xa1, 15 ), FIVE ) \
+                   ) ), THREE ) ) ) ); \
   xb0 = mm128_not( _mm_xor_si128( xa0, mm128_rol_32( xb0, 1 ) ) ); \
 } while (0)

 #define PERM_STEP_0   do { \
-		PERM_ELT(A00, A0B, B0, BD, B9, B6, C8, M0); \
-		PERM_ELT(A01, A00, B1, BE, BA, B7, C7, M1); \
-		PERM_ELT(A02, A01, B2, BF, BB, B8, C6, M2); \
-		PERM_ELT(A03, A02, B3, B0, BC, B9, C5, M3); \
-		PERM_ELT(A04, A03, B4, B1, BD, BA, C4, M4); \
-		PERM_ELT(A05, A04, B5, B2, BE, BB, C3, M5); \
-		PERM_ELT(A06, A05, B6, B3, BF, BC, C2, M6); \
-		PERM_ELT(A07, A06, B7, B4, B0, BD, C1, M7); \
-		PERM_ELT(A08, A07, B8, B5, B1, BE, C0, M8); \
-		PERM_ELT(A09, A08, B9, B6, B2, BF, CF, M9); \
-		PERM_ELT(A0A, A09, BA, B7, B3, B0, CE, MA); \
-		PERM_ELT(A0B, A0A, BB, B8, B4, B1, CD, MB); \
-		PERM_ELT(A00, A0B, BC, B9, B5, B2, CC, MC); \
-		PERM_ELT(A01, A00, BD, BA, B6, B3, CB, MD); \
-		PERM_ELT(A02, A01, BE, BB, B7, B4, CA, ME); \
-		PERM_ELT(A03, A02, BF, BC, B8, B5, C9, MF); \
+		PERM_ELT(A0, AB, B0, BD, B9, B6, C8, M0); \
+		PERM_ELT(A1, A0, B1, BE, BA, B7, C7, M1); \
+		PERM_ELT(A2, A1, B2, BF, BB, B8, C6, M2); \
+		PERM_ELT(A3, A2, B3, B0, BC, B9, C5, M3); \
+		PERM_ELT(A4, A3, B4, B1, BD, BA, C4, M4); \
+		PERM_ELT(A5, A4, B5, B2, BE, BB, C3, M5); \
+		PERM_ELT(A6, A5, B6, B3, BF, BC, C2, M6); \
+		PERM_ELT(A7, A6, B7, B4, B0, BD, C1, M7); \
+		PERM_ELT(A8, A7, B8, B5, B1, BE, C0, M8); \
+		PERM_ELT(A9, A8, B9, B6, B2, BF, CF, M9); \
+		PERM_ELT(AA, A9, BA, B7, B3, B0, CE, MA); \
+		PERM_ELT(AB, AA, BB, B8, B4, B1, CD, MB); \
+		PERM_ELT(A0, AB, BC, B9, B5, B2, CC, MC); \
+		PERM_ELT(A1, A0, BD, BA, B6, B3, CB, MD); \
+		PERM_ELT(A2, A1, BE, BB, B7, B4, CA, ME); \
+		PERM_ELT(A3, A2, BF, BC, B8, B5, C9, MF); \
 	} while (0)

 #define PERM_STEP_1   do { \
-		PERM_ELT(A04, A03, B0, BD, B9, B6, C8, M0); \
-		PERM_ELT(A05, A04, B1, BE, BA, B7, C7, M1); \
-		PERM_ELT(A06, A05, B2, BF, BB, B8, C6, M2); \
-		PERM_ELT(A07, A06, B3, B0, BC, B9, C5, M3); \
-		PERM_ELT(A08, A07, B4, B1, BD, BA, C4, M4); \
-		PERM_ELT(A09, A08, B5, B2, BE, BB, C3, M5); \
-		PERM_ELT(A0A, A09, B6, B3, BF, BC, C2, M6); \
-		PERM_ELT(A0B, A0A, B7, B4, B0, BD, C1, M7); \
-		PERM_ELT(A00, A0B, B8, B5, B1, BE, C0, M8); \
-		PERM_ELT(A01, A00, B9, B6, B2, BF, CF, M9); \
-		PERM_ELT(A02, A01, BA, B7, B3, B0, CE, MA); \
-		PERM_ELT(A03, A02, BB, B8, B4, B1, CD, MB); \
-		PERM_ELT(A04, A03, BC, B9, B5, B2, CC, MC); \
-		PERM_ELT(A05, A04, BD, BA, B6, B3, CB, MD); \
-		PERM_ELT(A06, A05, BE, BB, B7, B4, CA, ME); \
-		PERM_ELT(A07, A06, BF, BC, B8, B5, C9, MF); \
+		PERM_ELT(A4, A3, B0, BD, B9, B6, C8, M0); \
+		PERM_ELT(A5, A4, B1, BE, BA, B7, C7, M1); \
+		PERM_ELT(A6, A5, B2, BF, BB, B8, C6, M2); \
+		PERM_ELT(A7, A6, B3, B0, BC, B9, C5, M3); \
+		PERM_ELT(A8, A7, B4, B1, BD, BA, C4, M4); \
+		PERM_ELT(A9, A8, B5, B2, BE, BB, C3, M5); \
+		PERM_ELT(AA, A9, B6, B3, BF, BC, C2, M6); \
+		PERM_ELT(AB, AA, B7, B4, B0, BD, C1, M7); \
+		PERM_ELT(A0, AB, B8, B5, B1, BE, C0, M8); \
+		PERM_ELT(A1, A0, B9, B6, B2, BF, CF, M9); \
+		PERM_ELT(A2, A1, BA, B7, B3, B0, CE, MA); \
+		PERM_ELT(A3, A2, BB, B8, B4, B1, CD, MB); \
+		PERM_ELT(A4, A3, BC, B9, B5, B2, CC, MC); \
+		PERM_ELT(A5, A4, BD, BA, B6, B3, CB, MD); \
+		PERM_ELT(A6, A5, BE, BB, B7, B4, CA, ME); \
+		PERM_ELT(A7, A6, BF, BC, B8, B5, C9, MF); \
 	} while (0)

 #define PERM_STEP_2   do { \
-		PERM_ELT(A08, A07, B0, BD, B9, B6, C8, M0); \
-		PERM_ELT(A09, A08, B1, BE, BA, B7, C7, M1); \
-		PERM_ELT(A0A, A09, B2, BF, BB, B8, C6, M2); \
-		PERM_ELT(A0B, A0A, B3, B0, BC, B9, C5, M3); \
-		PERM_ELT(A00, A0B, B4, B1, BD, BA, C4, M4); \
-		PERM_ELT(A01, A00, B5, B2, BE, BB, C3, M5); \
-		PERM_ELT(A02, A01, B6, B3, BF, BC, C2, M6); \
-		PERM_ELT(A03, A02, B7, B4, B0, BD, C1, M7); \
-		PERM_ELT(A04, A03, B8, B5, B1, BE, C0, M8); \
-		PERM_ELT(A05, A04, B9, B6, B2, BF, CF, M9); \
-		PERM_ELT(A06, A05, BA, B7, B3, B0, CE, MA); \
-		PERM_ELT(A07, A06, BB, B8, B4, B1, CD, MB); \
-		PERM_ELT(A08, A07, BC, B9, B5, B2, CC, MC); \
-		PERM_ELT(A09, A08, BD, BA, B6, B3, CB, MD); \
-		PERM_ELT(A0A, A09, BE, BB, B7, B4, CA, ME); \
-		PERM_ELT(A0B, A0A, BF, BC, B8, B5, C9, MF); \
+		PERM_ELT(A8, A7, B0, BD, B9, B6, C8, M0); \
+		PERM_ELT(A9, A8, B1, BE, BA, B7, C7, M1); \
+		PERM_ELT(AA, A9, B2, BF, BB, B8, C6, M2); \
+		PERM_ELT(AB, AA, B3, B0, BC, B9, C5, M3); \
+		PERM_ELT(A0, AB, B4, B1, BD, BA, C4, M4); \
+		PERM_ELT(A1, A0, B5, B2, BE, BB, C3, M5); \
+		PERM_ELT(A2, A1, B6, B3, BF, BC, C2, M6); \
+		PERM_ELT(A3, A2, B7, B4, B0, BD, C1, M7); \
+		PERM_ELT(A4, A3, B8, B5, B1, BE, C0, M8); \
+		PERM_ELT(A5, A4, B9, B6, B2, BF, CF, M9); \
+		PERM_ELT(A6, A5, BA, B7, B3, B0, CE, MA); \
+		PERM_ELT(A7, A6, BB, B8, B4, B1, CD, MB); \
+		PERM_ELT(A8, A7, BC, B9, B5, B2, CC, MC); \
+		PERM_ELT(A9, A8, BD, BA, B6, B3, CB, MD); \
+		PERM_ELT(AA, A9, BE, BB, B7, B4, CA, ME); \
+		PERM_ELT(AB, AA, BF, BC, B8, B5, C9, MF); \
 	} while (0)

 #define APPLY_P \
@@ -1014,42 +1017,42 @@ do { \
    PERM_STEP_0; \
    PERM_STEP_1; \
    PERM_STEP_2; \
-    A0B = _mm_add_epi32( A0B, C6 ); \
-    A0A = _mm_add_epi32( A0A, C5 ); \
-    A09 = _mm_add_epi32( A09, C4 ); \
-    A08 = _mm_add_epi32( A08, C3 ); \
-    A07 = _mm_add_epi32( A07, C2 ); \
-    A06 = _mm_add_epi32( A06, C1 ); \
-    A05 = _mm_add_epi32( A05, C0 ); \
-    A04 = _mm_add_epi32( A04, CF ); \
-    A03 = _mm_add_epi32( A03, CE ); \
-    A02 = _mm_add_epi32( A02, CD ); \
-    A01 = _mm_add_epi32( A01, CC ); \
-    A00 = _mm_add_epi32( A00, CB ); \
-    A0B = _mm_add_epi32( A0B, CA ); \
-    A0A = _mm_add_epi32( A0A, C9 ); \
-    A09 = _mm_add_epi32( A09, C8 ); \
-    A08 = _mm_add_epi32( A08, C7 ); \
-    A07 = _mm_add_epi32( A07, C6 ); \
-    A06 = _mm_add_epi32( A06, C5 ); \
-    A05 = _mm_add_epi32( A05, C4 ); \
-    A04 = _mm_add_epi32( A04, C3 ); \
-    A03 = _mm_add_epi32( A03, C2 ); \
-    A02 = _mm_add_epi32( A02, C1 ); \
-    A01 = _mm_add_epi32( A01, C0 ); \
-    A00 = _mm_add_epi32( A00, CF ); \
-    A0B = _mm_add_epi32( A0B, CE ); \
-    A0A = _mm_add_epi32( A0A, CD ); \
-    A09 = _mm_add_epi32( A09, CC ); \
-    A08 = _mm_add_epi32( A08, CB ); \
-    A07 = _mm_add_epi32( A07, CA ); \
-    A06 = _mm_add_epi32( A06, C9 ); \
-    A05 = _mm_add_epi32( A05, C8 ); \
-    A04 = _mm_add_epi32( A04, C7 ); \
-    A03 = _mm_add_epi32( A03, C6 ); \
-    A02 = _mm_add_epi32( A02, C5 ); \
-    A01 = _mm_add_epi32( A01, C4 ); \
-    A00 = _mm_add_epi32( A00, C3 ); \
+    AB = _mm_add_epi32( AB, C6 ); \
+    AA = _mm_add_epi32( AA, C5 ); \
+    A9 = _mm_add_epi32( A9, C4 ); \
+    A8 = _mm_add_epi32( A8, C3 ); \
+    A7 = _mm_add_epi32( A7, C2 ); \
+    A6 = _mm_add_epi32( A6, C1 ); \
+    A5 = _mm_add_epi32( A5, C0 ); \
+    A4 = _mm_add_epi32( A4, CF ); \
+    A3 = _mm_add_epi32( A3, CE ); \
+    A2 = _mm_add_epi32( A2, CD ); \
+    A1 = _mm_add_epi32( A1, CC ); \
+    A0 = _mm_add_epi32( A0, CB ); \
+    AB = _mm_add_epi32( AB, CA ); \
+    AA = _mm_add_epi32( AA, C9 ); \
+    A9 = _mm_add_epi32( A9, C8 ); \
+    A8 = _mm_add_epi32( A8, C7 ); \
+    A7 = _mm_add_epi32( A7, C6 ); \
+    A6 = _mm_add_epi32( A6, C5 ); \
+    A5 = _mm_add_epi32( A5, C4 ); \
+    A4 = _mm_add_epi32( A4, C3 ); \
+    A3 = _mm_add_epi32( A3, C2 ); \
+    A2 = _mm_add_epi32( A2, C1 ); \
+    A1 = _mm_add_epi32( A1, C0 ); \
+    A0 = _mm_add_epi32( A0, CF ); \
+    AB = _mm_add_epi32( AB, CE ); \
+    AA = _mm_add_epi32( AA, CD ); \
+    A9 = _mm_add_epi32( A9, CC ); \
+    A8 = _mm_add_epi32( A8, CB ); \
+    A7 = _mm_add_epi32( A7, CA ); \
+    A6 = _mm_add_epi32( A6, C9 ); \
+    A5 = _mm_add_epi32( A5, C8 ); \
+    A4 = _mm_add_epi32( A4, C7 ); \
+    A3 = _mm_add_epi32( A3, C6 ); \
+    A2 = _mm_add_epi32( A2, C5 ); \
+    A1 = _mm_add_epi32( A1, C4 ); \
+    A0 = _mm_add_epi32( A0, C3 ); \
 } while (0)

 #define INCR_W   do { \
--- a/algo/shavite/shavite-hash-2way.c
+++ b/algo/shavite/shavite-hash-2way.c
@@ -18,10 +18,13 @@ static const uint32_t IV512[] =
        0xE275EADE, 0x502D9FCD, 0xB9357178, 0x022A4B9A
 };

-
+/*
 #define mm256_ror2x256hi_1x32( a, b ) \
   _mm256_blend_epi32( mm256_shuflr128_32( a ), \
                       mm256_shuflr128_32( b ), 0x88 )
+*/
+
+//#define mm256_ror2x256hi_1x32( a, b ) _mm256_alignr_epi8( b, a, 4 )

 #if defined(__VAES__)

@@ -127,24 +130,24 @@ c512_2way( shavite512_2way_context *ctx, const void *msg )
     
     // round 2, 6, 10

-     k00 = _mm256_xor_si256( k00, mm256_ror2x256hi_1x32( k12, k13 ) );
+     k00 = _mm256_xor_si256( k00, _mm256_alignr_epi8( k13, k12, 4 ) );
     x = mm256_aesenc_2x128( _mm256_xor_si256( p3, k00 ), zero );
-     k01 = _mm256_xor_si256( k01, mm256_ror2x256hi_1x32( k13, k00 ) );
+     k01 = _mm256_xor_si256( k01, _mm256_alignr_epi8( k00, k13, 4 ) );
     x = mm256_aesenc_2x128( _mm256_xor_si256( x, k01 ), zero );
-     k02 = _mm256_xor_si256( k02, mm256_ror2x256hi_1x32( k00, k01 ) );
+     k02 = _mm256_xor_si256( k02, _mm256_alignr_epi8( k01, k00, 4 ) );
     x = mm256_aesenc_2x128( _mm256_xor_si256( x, k02 ), zero );
-     k03 = _mm256_xor_si256( k03, mm256_ror2x256hi_1x32( k01, k02 ) );
+     k03 = _mm256_xor_si256( k03, _mm256_alignr_epi8( k02, k01, 4 ) );
     x = mm256_aesenc_2x128( _mm256_xor_si256( x, k03 ), zero );

     p2 = _mm256_xor_si256( p2, x );

-     k10 = _mm256_xor_si256( k10, mm256_ror2x256hi_1x32( k02, k03 ) );
+     k10 = _mm256_xor_si256( k10, _mm256_alignr_epi8( k03, k02, 4 ) );
     x = mm256_aesenc_2x128( _mm256_xor_si256( p1, k10 ), zero );
-     k11 = _mm256_xor_si256( k11, mm256_ror2x256hi_1x32( k03, k10 ) );
+     k11 = _mm256_xor_si256( k11, _mm256_alignr_epi8( k10, k03, 4 ) );
     x = mm256_aesenc_2x128( _mm256_xor_si256( x, k11 ), zero );
-     k12 = _mm256_xor_si256( k12, mm256_ror2x256hi_1x32( k10, k11 ) );
+     k12 = _mm256_xor_si256( k12, _mm256_alignr_epi8( k11, k10, 4 ) );
     x = mm256_aesenc_2x128( _mm256_xor_si256( x, k12 ), zero );
-     k13 = _mm256_xor_si256( k13, mm256_ror2x256hi_1x32( k11, k12 ) );
+     k13 = _mm256_xor_si256( k13, _mm256_alignr_epi8( k12, k11, 4 ) );
     x = mm256_aesenc_2x128( _mm256_xor_si256( x, k13 ), zero );

     p0 = _mm256_xor_si256( p0, x );
@@ -183,24 +186,24 @@ c512_2way( shavite512_2way_context *ctx, const void *msg )

     // round 4, 8, 12

-     k00 = _mm256_xor_si256( k00, mm256_ror2x256hi_1x32( k12, k13 ) );
+     k00 = _mm256_xor_si256( k00, _mm256_alignr_epi8( k13, k12, 4 ) );
     x = mm256_aesenc_2x128( _mm256_xor_si256( p1, k00 ), zero );
-     k01 = _mm256_xor_si256( k01, mm256_ror2x256hi_1x32( k13, k00 ) );
+     k01 = _mm256_xor_si256( k01, _mm256_alignr_epi8( k00, k13, 4 ) );
     x = mm256_aesenc_2x128( _mm256_xor_si256( x, k01 ), zero );
-     k02 = _mm256_xor_si256( k02, mm256_ror2x256hi_1x32( k00, k01 ) );
+     k02 = _mm256_xor_si256( k02, _mm256_alignr_epi8( k01, k00, 4 ) );
     x = mm256_aesenc_2x128( _mm256_xor_si256( x, k02 ), zero );
-     k03 = _mm256_xor_si256( k03, mm256_ror2x256hi_1x32( k01, k02 ) );
+     k03 = _mm256_xor_si256( k03, _mm256_alignr_epi8( k02, k01, 4 ) );
     x = mm256_aesenc_2x128( _mm256_xor_si256( x, k03 ), zero );

     p0 = _mm256_xor_si256( p0, x );

-     k10 = _mm256_xor_si256( k10, mm256_ror2x256hi_1x32( k02, k03 ) );
+     k10 = _mm256_xor_si256( k10, _mm256_alignr_epi8( k03, k02, 4 ) );
     x = mm256_aesenc_2x128( _mm256_xor_si256( p3, k10 ), zero );
-     k11 = _mm256_xor_si256( k11, mm256_ror2x256hi_1x32( k03, k10 ) );
+     k11 = _mm256_xor_si256( k11, _mm256_alignr_epi8( k10, k03, 4 ) );
     x = mm256_aesenc_2x128( _mm256_xor_si256( x, k11 ), zero );
-     k12 = _mm256_xor_si256( k12, mm256_ror2x256hi_1x32( k10, k11 ) );
+     k12 = _mm256_xor_si256( k12, _mm256_alignr_epi8( k11, k10, 4 ) );
     x = mm256_aesenc_2x128( _mm256_xor_si256( x, k12 ), zero );
-     k13 = _mm256_xor_si256( k13, mm256_ror2x256hi_1x32( k11, k12 ) );
+     k13 = _mm256_xor_si256( k13, _mm256_alignr_epi8( k12, k11, 4 ) );
     x = mm256_aesenc_2x128( _mm256_xor_si256( x, k13 ), zero );

     p2 = _mm256_xor_si256( p2, x );
--- a/algo/shavite/shavite-hash-4way.c
+++ b/algo/shavite/shavite-hash-4way.c
@@ -11,10 +11,6 @@ static const uint32_t IV512[] =
        0xE275EADE, 0x502D9FCD, 0xB9357178, 0x022A4B9A
 };

-#define mm512_ror2x512hi_1x32( a, b ) \
-   _mm512_mask_blend_epi32( 0x8888, mm512_shuflr128_32( a ), \
-                                    mm512_shuflr128_32( b ) )
-
 static void
 c512_4way( shavite512_4way_context *ctx, const void *msg )
 {
@@ -106,24 +102,24 @@ c512_4way( shavite512_4way_context *ctx, const void *msg )
     
     // round 2, 6, 10

-     K0 = _mm512_xor_si512( K0, mm512_ror2x512hi_1x32( K6, K7 ) );
+     K0 = _mm512_xor_si512( K0, _mm512_alignr_epi8( K7, K6, 4 ) );
     X = _mm512_aesenc_epi128( _mm512_xor_si512( P3, K0 ), m512_zero );
-     K1 = _mm512_xor_si512( K1, mm512_ror2x512hi_1x32( K7, K0 ) );
+     K1 = _mm512_xor_si512( K1, _mm512_alignr_epi8( K0, K7, 4 ) );
     X = _mm512_aesenc_epi128( _mm512_xor_si512( X, K1 ), m512_zero );
-     K2 = _mm512_xor_si512( K2, mm512_ror2x512hi_1x32( K0, K1 ) );
+     K2 = _mm512_xor_si512( K2, _mm512_alignr_epi8( K1, K0, 4 ) );
     X = _mm512_aesenc_epi128( _mm512_xor_si512( X, K2 ), m512_zero );
-     K3 = _mm512_xor_si512( K3, mm512_ror2x512hi_1x32( K1, K2 ) );
+     K3 = _mm512_xor_si512( K3, _mm512_alignr_epi8( K2, K1, 4 ) );
     X = _mm512_aesenc_epi128( _mm512_xor_si512( X, K3 ), m512_zero );

     P2 = _mm512_xor_si512( P2, X );

-     K4 = _mm512_xor_si512( K4, mm512_ror2x512hi_1x32( K2, K3 ) );
+     K4 = _mm512_xor_si512( K4, _mm512_alignr_epi8( K3, K2, 4 ) );
     X = _mm512_aesenc_epi128( _mm512_xor_si512( P1, K4 ), m512_zero );
-     K5 = _mm512_xor_si512( K5, mm512_ror2x512hi_1x32( K3, K4 ) );
+     K5 = _mm512_xor_si512( K5, _mm512_alignr_epi8( K4, K3, 4 ) );
     X = _mm512_aesenc_epi128( _mm512_xor_si512( X, K5 ), m512_zero );
-     K6 = _mm512_xor_si512( K6, mm512_ror2x512hi_1x32( K4, K5 ) );
+     K6 = _mm512_xor_si512( K6, _mm512_alignr_epi8( K5, K4, 4 ) );
     X = _mm512_aesenc_epi128( _mm512_xor_si512( X, K6 ), m512_zero );
-     K7 = _mm512_xor_si512( K7, mm512_ror2x512hi_1x32( K5, K6 ) );
+     K7 = _mm512_xor_si512( K7, _mm512_alignr_epi8( K6, K5, 4 ) );
     X = _mm512_aesenc_epi128( _mm512_xor_si512( X, K7 ), m512_zero );

     P0 = _mm512_xor_si512( P0, X );
@@ -162,24 +158,24 @@ c512_4way( shavite512_4way_context *ctx, const void *msg )

     // round 4, 8, 12

-     K0 = _mm512_xor_si512( K0, mm512_ror2x512hi_1x32( K6, K7 ) );
+     K0 = _mm512_xor_si512( K0, _mm512_alignr_epi8( K7, K6, 4 ) );
     X = _mm512_aesenc_epi128( _mm512_xor_si512( P1, K0 ), m512_zero );
-     K1 = _mm512_xor_si512( K1, mm512_ror2x512hi_1x32( K7, K0 ) );
+     K1 = _mm512_xor_si512( K1, _mm512_alignr_epi8( K0, K7, 4 ) );
     X = _mm512_aesenc_epi128( _mm512_xor_si512( X, K1 ), m512_zero );
-     K2 = _mm512_xor_si512( K2, mm512_ror2x512hi_1x32( K0, K1 ) );
+     K2 = _mm512_xor_si512( K2, _mm512_alignr_epi8( K1, K0, 4 ) );
     X = _mm512_aesenc_epi128( _mm512_xor_si512( X, K2 ), m512_zero );
-     K3 = _mm512_xor_si512( K3, mm512_ror2x512hi_1x32( K1, K2 ) );
+     K3 = _mm512_xor_si512( K3, _mm512_alignr_epi8( K2, K1, 4 ) );
     X = _mm512_aesenc_epi128( _mm512_xor_si512( X, K3 ), m512_zero );

     P0 = _mm512_xor_si512( P0, X );

-     K4 = _mm512_xor_si512( K4, mm512_ror2x512hi_1x32( K2, K3 ) );
+     K4 = _mm512_xor_si512( K4, _mm512_alignr_epi8( K3, K2, 4 ) );
     X = _mm512_aesenc_epi128( _mm512_xor_si512( P3, K4 ), m512_zero );
-     K5 = _mm512_xor_si512( K5, mm512_ror2x512hi_1x32( K3, K4 ) );
+     K5 = _mm512_xor_si512( K5, _mm512_alignr_epi8( K4, K3, 4 ) );
     X = _mm512_aesenc_epi128( _mm512_xor_si512( X, K5 ), m512_zero );
-     K6 = _mm512_xor_si512( K6, mm512_ror2x512hi_1x32( K4, K5 ) );
+     K6 = _mm512_xor_si512( K6, _mm512_alignr_epi8( K5, K4, 4 ) );
     X = _mm512_aesenc_epi128( _mm512_xor_si512( X, K6 ), m512_zero );
-     K7 = _mm512_xor_si512( K7, mm512_ror2x512hi_1x32( K5, K6 ) );
+     K7 = _mm512_xor_si512( K7, _mm512_alignr_epi8( K6, K5, 4 ) );
     X = _mm512_aesenc_epi128( _mm512_xor_si512( X, K7 ), m512_zero );

     P2 = _mm512_xor_si512( P2, X );
--- a/algo/shavite/sph-shavite-aesni.c
+++ b/algo/shavite/sph-shavite-aesni.c
@@ -59,30 +59,6 @@ static const sph_u32 IV512[] = {
 	C32(0xE275EADE), C32(0x502D9FCD), C32(0xB9357178), C32(0x022A4B9A)
 };

-// Partially rotate elements in two 128 bit vectors a & b as one 256 bit vector
-// and return the rotated 128 bit vector a.
-// a[3:0] = { b[0], a[3], a[2], a[1] }
-#if defined(__SSSE3__)
-
-#define mm128_ror256hi_1x32( a, b )  _mm_alignr_epi8( b, a, 4 )
-
-#else  // SSE2
-
-#define mm128_ror256hi_1x32( a, b ) \
-   _mm_or_si128( _mm_srli_si128( a,  4 ), \
-                 _mm_slli_si128( b, 12 ) )
-
-#endif
-
-/*
-#if defined(__AVX2__)
-// 2 way version of above
-// a[7:0] = { b[4], a[7], a[6], a[5], b[0], a[3], a[2], a[1] }
-#define mm256_ror2x256hi_1x32( a, b ) \
-   _mm256_blend_epi32( mm256_ror256_1x32( a ), \
-                       mm256_rol256_3x32( b ), 0x88 )
-#endif
-*/

 static void
 c512( sph_shavite_big_context *sc, const void *msg )
@@ -190,31 +166,31 @@ c512( sph_shavite_big_context *sc, const void *msg )

      // round 2, 6, 10

-      k00 = _mm_xor_si128( k00, mm128_ror256hi_1x32( k12, k13 ) );
+      k00 = _mm_xor_si128( k00, _mm_alignr_epi8( k13, k12, 4 ) );
      x = _mm_xor_si128( p3, k00 );
      x = _mm_aesenc_si128( x, zero );
-      k01 = _mm_xor_si128( k01, mm128_ror256hi_1x32( k13, k00 ) );
+      k01 = _mm_xor_si128( k01, _mm_alignr_epi8( k00, k13, 4 ) );
      x = _mm_xor_si128( x, k01 );
      x = _mm_aesenc_si128( x, zero );
-      k02 = _mm_xor_si128( k02, mm128_ror256hi_1x32( k00, k01 ) );
+      k02 = _mm_xor_si128( k02, _mm_alignr_epi8( k01, k00, 4 ) );
      x = _mm_xor_si128( x, k02 );
      x = _mm_aesenc_si128( x, zero );
-      k03 = _mm_xor_si128( k03, mm128_ror256hi_1x32( k01, k02 ) );
+      k03 = _mm_xor_si128( k03, _mm_alignr_epi8( k02, k01, 4 ) );
      x = _mm_xor_si128( x, k03 );
      x = _mm_aesenc_si128( x, zero );

      p2 = _mm_xor_si128( p2, x );

-      k10 = _mm_xor_si128( k10, mm128_ror256hi_1x32( k02, k03 ) );
+      k10 = _mm_xor_si128( k10, _mm_alignr_epi8( k03, k02, 4 ) );
      x = _mm_xor_si128( p1, k10 );
      x = _mm_aesenc_si128( x, zero );
-      k11 = _mm_xor_si128( k11, mm128_ror256hi_1x32( k03, k10 ) );
+      k11 = _mm_xor_si128( k11, _mm_alignr_epi8( k10, k03, 4 ) );
      x = _mm_xor_si128( x, k11 );
      x = _mm_aesenc_si128( x, zero );
-      k12 = _mm_xor_si128( k12, mm128_ror256hi_1x32( k10, k11 ) );
+      k12 = _mm_xor_si128( k12, _mm_alignr_epi8( k11, k10, 4 ) );
      x = _mm_xor_si128( x, k12 );
      x = _mm_aesenc_si128( x, zero );
-      k13 = _mm_xor_si128( k13, mm128_ror256hi_1x32( k11, k12 ) );
+      k13 = _mm_xor_si128( k13, _mm_alignr_epi8( k12, k11, 4 ) );
      x = _mm_xor_si128( x, k13 );
      x = _mm_aesenc_si128( x, zero );

@@ -262,31 +238,31 @@ c512( sph_shavite_big_context *sc, const void *msg )

      // round 4, 8, 12

-      k00 = _mm_xor_si128( k00, mm128_ror256hi_1x32( k12, k13 ) );
+      k00 = _mm_xor_si128( k00, _mm_alignr_epi8( k13, k12, 4 ) );
      x = _mm_xor_si128( p1, k00 );
      x = _mm_aesenc_si128( x, zero );
-      k01 = _mm_xor_si128( k01, mm128_ror256hi_1x32( k13, k00 ) );
+      k01 = _mm_xor_si128( k01, _mm_alignr_epi8( k00, k13, 4 ) );
      x = _mm_xor_si128( x, k01 );
      x = _mm_aesenc_si128( x, zero );
-      k02 = _mm_xor_si128( k02, mm128_ror256hi_1x32( k00, k01 ) );
+      k02 = _mm_xor_si128( k02, _mm_alignr_epi8( k01, k00, 4 ) );
      x = _mm_xor_si128( x, k02 );
      x = _mm_aesenc_si128( x, zero );
-      k03 = _mm_xor_si128( k03, mm128_ror256hi_1x32( k01, k02 ) );
+      k03 = _mm_xor_si128( k03, _mm_alignr_epi8( k02, k01, 4 ) );
      x = _mm_xor_si128( x, k03 );
      x = _mm_aesenc_si128( x, zero );

      p0 = _mm_xor_si128( p0, x );

-      k10 = _mm_xor_si128( k10, mm128_ror256hi_1x32( k02, k03 ) );
+      k10 = _mm_xor_si128( k10, _mm_alignr_epi8( k03, k02, 4 ) );
      x = _mm_xor_si128( p3, k10 );
      x = _mm_aesenc_si128( x, zero );
-      k11 = _mm_xor_si128( k11, mm128_ror256hi_1x32( k03, k10 ) );
+      k11 = _mm_xor_si128( k11, _mm_alignr_epi8( k10, k03, 4 ) );
      x = _mm_xor_si128( x, k11 );
      x = _mm_aesenc_si128( x, zero );
-      k12 = _mm_xor_si128( k12, mm128_ror256hi_1x32( k10, k11 ) );
+      k12 = _mm_xor_si128( k12, _mm_alignr_epi8( k11, k10, 4 ) );
      x = _mm_xor_si128( x, k12 );
      x = _mm_aesenc_si128( x, zero );
-      k13 = _mm_xor_si128( k13, mm128_ror256hi_1x32( k11, k12 ) );
+      k13 = _mm_xor_si128( k13, _mm_alignr_epi8( k12, k11, 4 ) );
      x = _mm_xor_si128( x, k13 );
      x = _mm_aesenc_si128( x, zero );

--- a/algo/shavite/sph_shavite.c
+++ b/algo/shavite/sph_shavite.c
@@ -35,7 +35,7 @@

 #include "sph_shavite.h"

-#if !defined(__AES__)
+#if !(defined(__AES__) && defined(__SSSE3__))

 #ifdef __cplusplus
 extern "C"{
--- a/algo/shavite/sph_shavite.h
+++ b/algo/shavite/sph_shavite.h
@@ -263,7 +263,7 @@ void sph_shavite384_addbits_and_close(
 	void *cc, unsigned ub, unsigned n, void *dst);

 //Don't call these directly from application code, use the macros below.
-#ifdef __AES__
+#if defined(__AES__) && defined(__SSSE3__)

 void sph_shavite512_aesni_init(void *cc);
 void sph_shavite512_aesni(void *cc, const void *data, size_t len);
--- a/algo/sm3/sph_sm3.h
+++ b/algo/sm3/sph_sm3.h
@@ -74,7 +74,7 @@ typedef struct {

 void sm3_init(sm3_ctx_t *ctx);
 void sm3_update(sm3_ctx_t *ctx, const unsigned char* data, size_t data_len);
-void sm3_final(sm3_ctx_t *ctx, unsigned char digest[SM3_DIGEST_LENGTH]);
+void sm3_final(sm3_ctx_t *ctx, unsigned char *digest);
 void sm3_compress(uint32_t digest[8], const unsigned char block[SM3_BLOCK_SIZE]);
 void sm3(const unsigned char *data, size_t datalen,
 	unsigned char digest[SM3_DIGEST_LENGTH]);
--- a/algo/swifftx/inttypes.h
+++ b/algo/swifftx/inttypes.h
@@ -18,16 +18,20 @@
 #ifndef __INTTYPES_H_
 #define __INTTYPES_H_

+#include <stdint.h>
+
 /* Use [u]intN_t if you need exactly N bits.
  XXX - doesn't handle the -mint8 option.  */

 typedef signed char swift_int8_t;
 typedef unsigned char swift_uint8_t;

- typedef int swift_int16_t;
+ typedef int32_t swift_int16_t;
+// typedef int swift_int16_t;
 typedef unsigned int swift_uint16_t;

- typedef long swift_int32_t;
+ typedef int32_t swift_int32_t;
+// typedef long swift_int32_t;
 typedef unsigned long swift_uint32_t;

 typedef long long swift_int64_t;
--- a/algo/swifftx/swifftx-4way.c
+++ b/algo/swifftx/swifftx-4way.c
@@ -1,912 +0,0 @@
-///////////////////////////////////////////////////////////////////////////////////////////////
-//
-//  SWIFFTX ANSI C OPTIMIZED 32BIT IMPLEMENTATION FOR NIST SHA-3 COMPETITION
-//
-//  SWIFFTX.c
-//
-//  October 2008
-//
-//  This is the source file of the OPTIMIZED 32BIT implementation of SWIFFTX hash function.
-//  SWIFFTX is a candidate function for SHA-3 NIST competition.
-//  More details about SWIFFTX can be found in the accompanying submission documents.
-//
-///////////////////////////////////////////////////////////////////////////////////////////////
-#include "swifftx.h"
-// See the remarks concerning compatibility issues inside stdint.h.
-#include "stdint.h"
-// Remove this while using gcc:
-//#include "stdbool.h"
-#include <memory.h>
-
-///////////////////////////////////////////////////////////////////////////////////////////////
-// Constants and static tables portion.
-///////////////////////////////////////////////////////////////////////////////////////////////
-
-// In SWIFFTX we work over Z_257, so this is the modulus and the arithmetic is performed modulo
-// this number.
-#define FIELD_SIZE 257
-
-// The size of FFT we use:
-#define N 64
-
-#define LOGN 6
-
-#define EIGHTH_N (N / 8)
-
-// The number of FFTS done on the input.
-#define M (SWIFFTX_INPUT_BLOCK_SIZE / 8)   // 32
-
-// Omega is the 128th root of unity in Z_257.
-// We choose w = 42.
-#define OMEGA 42
-
-// The size of the inner FFT lookup table:
-#define W 8
-
-// Calculates the sum and the difference of two numbers.
-//
-// Parameters:
-// - A: the first operand. After the operation stores the sum of the two operands.
-// - B: the second operand. After the operation stores the difference between the first and the
-//   second operands.
-#define ADD_SUB_4WAY( A, B ) \
-{ \
-  __m128i temp = B; \
-  B = _mm_sub_epi32( A, B ); \
-  A = _mm_add_epi32( A, temp ); \
-}
-
-
-//#define ADD_SUB(A, B) {register int temp = (B); B = ((A) - (B)); A = ((A) + (temp));}
-
-// Quickly reduces an integer modulo 257.
-//
-// Parameters:
-// - A: the input.
-
-#define Q_REDUCE( A ) ( _mm_sub_epi32( \
-                               _mm_and_epi32( A, m128_const1_32( 0xff ) ), \
-                               _mm_srli_epi32( A, 8 ) ) )
-
-//#define Q_REDUCE(A) (((A) & 0xff) - ((A) >> 8))
-
-// Since we need to do the setup only once, this is the indicator variable:
-static bool wasSetupDone = false;
-
-// This array stores the powers of omegas that correspond to the indices, which are the input
-// values. Known also as the "outer FFT twiddle factors".
-swift_int16_t multipliers[N];
-
-// This array stores the powers of omegas, multiplied by the corresponding values.
-// We store this table to save computation time.
-//
-// To calculate the intermediate value of the compression function (the first out of two
-// stages), we multiply the k-th bit of x_i by w^[(2i + 1) * k]. {x_i} is the input to the
-// compression function, i is between 0 and 31, x_i is a 64-bit value.
-// One can see the formula for this (intermediate) stage in the SWIFFT FSE 2008 paper --
-// formula (2), section 3, page 6.
-swift_int16_t fftTable[256 * EIGHTH_N];
-
-// The A's we use in SWIFFTX shall be random elements of Z_257.
-// We generated these A's from the decimal expansion of PI as follows:  we converted each
-// triple of digits into a decimal number d. If d < (257 * 3) we used (d % 257) for the next A
-// element, otherwise move to the next triple of digits in the expansion. This guarntees that
-// the A's are random, provided that PI digits are.
-const swift_int16_t As[3 * M * N] =
-{141,  78, 139,  75, 238, 205, 129, 126,  22, 245, 197, 169, 142, 118, 105,  78,
-  50, 149,  29, 208, 114,  34,  85, 117,  67, 148,  86, 256,  25,  49, 133,  93,
-  95,  36,  68, 231, 211, 102, 151, 128, 224, 117, 193,  27, 102, 187,   7, 105,
-  45, 130, 108, 124, 171, 151, 189, 128, 218, 134, 233, 165,  14, 201, 145, 134,
-  52, 203,  91,  96, 197,  69, 134, 213, 136,  93,   3, 249, 141,  16, 210,  73,
-   6,  92,  58,  74, 174,   6, 254,  91, 201, 107, 110,  76, 103,  11,  73,  16,
-  34, 209,   7, 127, 146, 254,  95, 176,  57,  13, 108, 245,  77,  92, 186, 117,
- 124,  97, 105, 118,  34,  74, 205, 122, 235,  53,  94, 238, 210, 227, 183,  11,
- 129, 159, 105, 183, 142, 129,  86,  21, 137, 138, 224, 223, 190, 188, 179, 188,
- 256,  25, 217, 176,  36, 176, 238, 127, 160, 210, 155, 148, 132,   0,  54, 127,
- 145,   6,  46,  85, 243,  95, 173, 123, 178, 207, 211, 183, 224, 173, 146,  35,
-  71, 114,  50,  22, 175,   1,  28,  19, 112, 129,  21,  34, 161, 159, 115,  52,
-   4, 193, 211,  92, 115,  49,  59, 217, 218,  96,  61,  81,  24, 202, 198,  89,
-  45, 128,   8,  51, 253,  87, 171,  35,   4, 188, 171,  10,   3, 137, 238,  73,
-  19, 208, 124, 163, 103, 177, 155, 147,  46,  84, 253, 233, 171, 241, 211, 217,
- 159,  48,  96,  79, 237,  18, 171, 226,  99,   1,  97, 195, 216, 163, 198,  95,
-   0, 201,  65, 228,  21, 153, 124, 230,  44,  35,  44, 108,  85, 156, 249, 207,
-  26, 222, 131,   1,  60, 242, 197, 150, 181,  19, 116, 213,  75,  98, 124, 240,
- 123, 207,  62, 255,  60, 143, 187, 157, 139,   9,  12, 104,  89,  49, 193, 146,
- 104, 196, 181,  82, 198, 253, 192, 191, 255, 122, 212, 104,  47,  20, 132, 208,
-  46, 170,   2,  69, 234,  36,  56, 163,  28, 152, 104, 238, 162,  56,  24,  58,
-  38, 150, 193, 254, 253, 125, 173,  35,  73, 126, 247, 239, 216,   6, 199,  15,
-  90,  12,  97, 122,   9,  84, 207, 127, 219,  72,  58,  30,  29, 182,  41, 192,
- 235, 248, 237,  74,  72, 176, 210, 252,  45,  64, 165,  87, 202, 241, 236, 223,
- 151, 242, 119, 239,  52, 112, 169,  28,  13,  37, 160,  60, 158,  81, 133,  60,
-  16, 145, 249, 192, 173, 217, 214,  93, 141, 184,  54,  34, 161, 104, 157,  95,
-  38, 133, 218, 227, 211, 181,   9,  66, 137, 143,  77,  33, 248, 159,   4,  55,
- 228,  48,  99, 219, 222, 184,  15,  36, 254, 256, 157, 237,  87, 139, 209, 113,
- 232,  85, 126, 167, 197, 100, 103, 166,  64, 225, 125, 205, 117, 135,  84, 128,
- 231, 112,  90, 241,  28,  22, 210, 147, 186,  49, 230,  21, 108,  39, 194,  47,
- 123, 199, 107, 114,  30, 210, 250, 143,  59, 156, 131, 133, 221,  27,  76,  99,
- 208, 250,  78,  12, 211, 141,  95,  81, 195, 106,   8, 232, 150, 212, 205, 221,
-  11, 225,  87, 219, 126, 136, 137, 180, 198,  48,  68, 203, 239, 252, 194, 235,
- 142, 137, 174, 172, 190, 145, 250, 221, 182, 204,   1, 195, 130, 153,  83, 241,
- 161, 239, 211, 138,  11, 169, 155, 245, 174,  49,  10, 166,  16, 130, 181, 139,
- 222, 222, 112,  99, 124,  94,  51, 243, 133, 194, 244, 136,  35, 248, 201, 177,
- 178, 186, 129, 102,  89, 184, 180,  41, 149,  96, 165,  72, 225, 231, 134, 158,
- 199,  28, 249,  16, 225, 195,  10, 210, 164, 252, 138,   8,  35, 152, 213, 199,
-  82, 116,  97, 230,  63, 199, 241,  35,  79, 120,  54, 174,  67, 112,   1,  76,
-  69, 222, 194,  96,  82,  94,  25, 228, 196, 145, 155, 136, 228, 234,  46, 101,
- 246,  51, 103, 166, 246,  75,   9, 200, 161,   4, 108,  35, 129, 168, 208, 144,
-  50,  14,  13, 220,  41, 132, 122, 127, 194,   9, 232, 234, 107,  28, 187,   8,
-  51, 141,  97, 221, 225,   9, 113, 170, 166, 102, 135,  22, 231, 185, 227, 187,
- 110, 145, 251, 146,  76,  22, 146, 228,   7,  53,  64,  25,  62, 198, 130, 190,
- 221, 232, 169,  64, 188, 199, 237, 249, 173, 218, 196, 191,  48, 224,   5, 113,
- 100, 166, 160,  21, 191, 197,  61, 162, 149, 171, 240, 183, 129, 231, 123, 204,
- 192, 179, 134,  15,  47, 161, 142, 177, 239, 234, 186, 237, 231,  53, 208,  95,
- 146,  36, 225, 231,  89, 142,  93, 248, 137, 124,  83,  39,  69,  77,  89, 208,
- 182,  48,  85, 147, 244, 164, 246,  68,  38, 190, 220,  35, 202,  91, 157, 151,
- 201, 240, 185, 218,   4, 152,   2, 132, 177,  88, 190, 196, 229,  74, 220, 135,
- 137, 196,  11,  47,   5, 251, 106, 144, 163,  60, 222, 127,  52,  57, 202, 102,
-  64, 140, 110, 206,  23, 182,  39, 245,   1, 163, 157, 186, 163,  80,   7, 230,
-  44, 249, 176, 102, 164, 125, 147, 120,  18, 191, 186, 125,  64,  65, 198, 157,
- 164, 213,  95,  61,  13, 181, 208,  91, 242, 197, 158,  34,  98, 169,  91,  14,
-  17,  93, 157,  17,  65,  30, 183,   6, 139,  58, 255, 108, 100, 136, 209, 144,
- 164,   6, 237,  33, 210, 110,  57, 126, 197, 136, 125, 244, 165, 151, 168,   3,
- 143, 251, 247, 155, 136, 130,  88,  14,  74, 121, 250, 133,  21, 226, 185, 232,
- 118, 132,  89,  64, 204, 161,   2,  70, 224, 159,  35, 204, 123, 180,  13,  52,
- 231,  57,  25,  78,  66,  69,  97,  42, 198,  84, 176,  59,   8, 232, 125, 134,
- 193,   2, 232, 109, 216,  69,  90, 142,  32,  38, 249,  37,  75, 180, 184, 188,
-  19,  47, 120,  87, 146,  70, 232, 120, 191,  45,  33,  38,  19, 248, 110, 110,
-  44,  64,   2,  84, 244, 228, 252, 228, 170, 123,  38, 144, 213, 144, 171, 212,
- 243,  87, 189,  46, 128, 110,  84,  77,  65, 183,  61, 184, 101,  44, 168,  68,
-  14, 106, 105,   8, 227, 211, 166,  39, 152,  43,  52, 254, 197,  55, 119,  89,
- 168,  65,  53, 138, 177,  56, 219,   0,  58, 121, 148,  18,  44, 100, 215, 103,
- 145, 229, 117, 196,  91,  89, 113, 143, 172, 239, 249, 184, 154,  39, 112,  65,
- 204,  42,  84,  38, 155, 151, 151,  16, 100,  87, 174, 162, 145, 147, 149, 186,
- 237, 145, 134, 144, 198, 235, 213, 163,  48, 230,  24,  47,  57,  71, 127,   0,
- 150, 219,  12,  81, 197, 150, 131,  13, 169,  63, 175, 184,  48, 235,  65, 243,
- 149, 200, 163, 254, 202, 114, 247,  67, 143, 250, 126, 228,  80, 130, 216, 214,
-  36,   2, 230,  33, 119, 125,   3, 142, 237, 100,   3, 152, 197, 174, 244, 129,
- 232,  30, 206, 199,  39, 210, 220,  43, 237, 221, 201,  54, 179,  42,  28, 133,
- 246, 203, 198, 177,   0,  28, 194,  85, 223, 109, 155, 147, 221,  60, 133, 108,
- 157, 254,  26,  75, 157, 185,  49, 142,  31, 137,  71,  43,  63,  64, 237, 148,
- 237, 172, 159, 160, 155, 254, 234, 224, 140, 193, 114, 140,  62, 109, 136,  39,
- 255,   8, 158, 146, 128,  49, 222,  96,  57, 209, 180, 249, 202, 127, 113, 231,
-  78, 178,  46,  33, 228, 215, 104,  31, 207, 186,  82,  41,  42,  39, 103, 119,
- 123, 133, 243, 254, 238, 156,  90, 186,  37, 212,  33, 107, 252,  51, 177,  36,
- 237,  76, 159, 245,  93, 214,  97,  56, 190,  38, 160,  94, 105, 222, 220, 158,
-  49,  16, 191,  52, 120,  87, 179,   2,  27, 144, 223, 230, 184,   6, 129, 227,
-  69,  47, 215, 181, 162, 139,  72, 200,  45, 163, 159,  62,   2, 221, 124,  40,
- 159, 242,  35, 208, 179, 166,  98,  67, 178,  68, 143, 225, 178, 146, 187, 159,
-  57,  66, 176, 192, 236, 250, 168, 224, 122,  43, 159, 120, 133, 165, 122,  64,
-  87,  74, 161, 241,   9,  87,  90,  24, 255, 113, 203, 220,  57, 139, 197, 159,
-  31, 151,  27, 140,  77, 162,   7,  27,  84, 228, 187, 220,  53, 126, 162, 242,
-  84, 181, 223, 103,  86, 177, 207,  31, 140,  18, 207, 256, 201, 166,  96,  23,
- 233, 103, 197,  84, 161,  75,  59, 149, 138, 154, 119,  92,  16,  53, 116,  97,
- 220, 114,  35,  45,  77, 209,  40, 196,  71,  22,  81, 178, 110,  14,   3, 180,
- 110, 129, 112,  47,  18,  61, 134,  78,  73,  79, 254, 232, 125, 180, 205,  54,
- 220, 119,  63,  89, 181,  52,  77, 109, 151,  77,  80, 207, 144,  25,  20,   6,
- 208,  47, 201, 206, 192,  14,  73, 176, 256, 201, 207,  87, 216,  60,  56,  73,
-  92, 243, 179, 113,  49,  59,  55, 168, 121, 137,  69, 154,  95,  57, 187,  47,
- 129,   4,  15,  92,   6, 116,  69, 196,  48, 134,  84,  81, 111,  56,  38, 176,
- 239,   6, 128,  72, 242, 134,  36, 221,  59,  48, 242,  68, 130, 110, 171,  89,
-  13, 220,  48,  29,   5,  75, 104, 233,  91, 129, 105, 162,  44, 113, 163, 163,
-  85, 147, 190, 111, 197,  80, 213, 153,  81,  68, 203,  33, 161, 165,  10,  61,
- 120, 252,   0, 205,  28,  42, 193,  64,  39,  37,  83, 175,   5, 218, 215, 174,
- 128, 121, 231,  11, 150, 145, 135, 197, 136,  91, 193,   5, 107,  88,  82,   6,
-   4, 188, 256,  70,  40,   2, 167,  57, 169, 203, 115, 254, 215, 172,  84,  80,
- 188, 167,  34, 137,  43, 243,   2,  79, 178,  38, 188, 135, 233, 194, 208,  13,
-  11, 151, 231, 196,  12, 122, 162,  56,  17, 114, 191, 207,  90, 132,  64, 238,
- 187,   6, 198, 176, 240,  88, 118, 236,  15, 226, 166,  22, 193, 229,  82, 246,
- 213,  64,  37,  63,  31, 243, 252,  37, 156,  38, 175, 204, 138, 141, 211,  82,
- 106, 217,  97, 139, 153,  56, 129, 218, 158,   9,  83,  26,  87, 112,  71,  21,
- 250,   5,  65, 141,  68, 116, 231, 113,  10, 218,  99, 205, 201,  92, 157,   4,
-  97,  46,  49, 220,  72, 139, 103, 171, 149, 129, 193,  19,  69, 245,  43,  31,
-  58,  68,  36, 195, 159,  22,  54,  34, 233, 141, 205, 100, 226,  96,  22, 192,
-  41, 231,  24,  79, 234, 138,  30, 120, 117, 216, 172, 197, 172, 107,  86,  29,
- 181, 151,   0,   6, 146, 186,  68,  55,  54,  58, 213, 182,  60, 231,  33, 232,
-  77, 210, 216, 154,  80,  51, 141, 122,  68, 148, 219, 122, 254,  48,  64, 175,
-  41, 115,  62, 243, 141,  81, 119, 121,   5,  68, 121,  88, 239,  29, 230,  90,
- 135, 159,  35, 223, 168, 112,  49,  37, 146,  60, 126, 134,  42, 145, 115,  90,
-  73, 133, 211,  86, 120, 141, 122, 241, 127,  56, 130,  36, 174,  75,  83, 246,
- 112,  45, 136, 194, 201, 115,   1, 156, 114, 167, 208,  12, 176, 147,  32, 170,
- 251, 100, 102, 220, 122, 210,   6,  49,  75, 201,  38, 105, 132, 135, 126, 102,
-  13, 121,  76, 228, 202,  20,  61, 213, 246,  13, 207,  42, 148, 168,  37, 253,
-  34,  94, 141, 185,  18, 234, 157, 109, 104,  64, 250, 125,  49, 236,  86,  48,
- 196,  77,  75, 237, 156, 103, 225,  19, 110, 229,  22,  68, 177,  93, 221, 181,
- 152, 153,  61, 108, 101,  74, 247, 195, 127, 216,  30, 166, 168,  61,  83, 229,
- 120, 156,  96, 120, 201, 124,  43,  27, 253, 250, 120, 143,  89, 235, 189, 243,
- 150,   7, 127, 119, 149, 244,  84, 185, 134,  34, 128, 193, 236, 234, 132, 117,
- 137,  32, 145, 184,  44, 121,  51,  76,  11, 228, 142, 251,  39,  77, 228, 251,
-  41,  58, 246, 107, 125, 187,   9, 240,  35,   8,  11, 162, 242, 220, 158, 163,
-   2, 184, 163, 227, 242,   2, 100, 101,   2,  78, 129,  34,  89,  28,  26, 157,
-  79,  31, 107, 250, 194, 156, 186,  69, 212,  66,  41, 180, 139,  42, 211, 253,
- 256, 239,  29, 129, 104, 248, 182,  68,   1, 189,  48, 226,  36, 229,   3, 158,
-  41,  53, 241,  22, 115, 174,  16, 163, 224,  19, 112, 219, 177, 233,  42,  27,
- 250, 134,  18,  28, 145, 122,  68,  34, 134,  31, 147,  17,  39, 188, 150,  76,
-  45,  42, 167, 249,  12,  16,  23, 182,  13,  79, 121,   3,  70, 197, 239,  44,
-  86, 177, 255,  81,  64, 171, 138, 131,  73, 110,  44, 201, 254, 198, 146,  91,
-  48,   9, 104,  31,  29, 161, 101,  31, 138, 180, 231, 233,  79, 137,  61, 236,
- 140,  15, 249, 218, 234, 119,  99, 195, 110, 137, 237, 207,   8,  31,  45,  24,
-  90, 155, 203, 253, 192, 203,  65, 176, 210, 171, 142, 214, 220, 122, 136, 237,
- 189, 186, 147,  40,  80, 254, 173,  33, 191,  46, 192,  26, 108, 255, 228, 205,
-  61,  76,  39, 107, 225, 126, 228, 182, 140, 251, 143, 134, 252, 168, 221,   8,
- 185,  85,  60, 233, 147, 244,  87, 137,   8, 140,  96,  80,  53,  45, 175, 160,
- 124, 189, 112,  37, 144,  19,  70,  17, 170, 242,   2,   3,  28,  95, 120, 199,
- 212,  43,   9, 117,  86, 151, 101, 241, 200, 145, 241,  19, 178,  69, 204, 197,
- 227, 166,  94,   7, 193,  45, 247, 234,  19, 187, 212, 212, 236, 125,  33,  95,
- 198, 121, 122, 103,  77, 155, 235,  49,  25, 237, 249,  11, 162,   7, 238,  24,
-  16, 150, 129,  25, 152,  17,  42,  67, 247, 162,  77, 154,  31, 133,  55, 137,
-  79, 119, 153,  10,  86,  28, 244, 186,  41, 169, 106,  44,  10,  49, 110, 179,
-  32, 133, 155, 244,  61,  70, 131, 168, 170,  39, 231, 252,  32,  69,  92, 238,
- 239,  35, 132, 136, 236, 167,  90,  32, 123,  88,  69,  22,  20,  89, 145, 166,
-  30, 118,  75,   4,  49,  31, 225,  54,  11,  50,  56, 191, 246,   1, 187,  33,
- 119, 107, 139,  68,  19, 240, 131,  55,  94, 113,  31, 252,  12, 179, 121,   2,
- 120, 252,   0,  76,  41,  80, 185,  42,  62, 121, 105, 159, 121, 109, 111,  98,
-   7, 118,  86,  29, 210,  70, 231, 179, 223, 229, 164,  70,  62,  47,   0, 206,
- 204, 178, 168, 120, 224, 166,  99,  25, 103,  63, 246, 224, 117, 204,  75, 124,
- 140, 133, 110, 110, 222,  88, 151, 118,  46,  37,  22, 143, 158,  40,   2,  50,
- 153,  94, 190, 199,  13, 198, 127, 211, 180,  90, 183,  98,   0, 142, 210, 154,
- 100, 187,  67, 231, 202, 100, 198, 235, 252, 160, 247, 124, 247,  14, 121, 221,
-  57,  88, 253, 243, 185,  89,  45, 249, 221, 194, 108, 175, 193, 119,  50, 141,
- 223, 133, 136,  64, 176, 250, 129, 100, 124,  94, 181, 159,  99, 185, 177, 240,
- 135,  42, 103,  52, 202, 208, 143, 186, 193, 103, 154, 237, 102,  88, 225, 161,
-  50, 188, 191, 109,  12,  87,  19, 227, 247, 183,  13,  52, 205, 170, 205, 146,
-  89, 160,  18, 105, 192,  73, 231, 225, 184, 157, 252, 220,  61,  59, 169, 183,
- 221,  20, 141,  20, 158, 101, 245,   7, 245, 225, 118, 137,  84,  55,  19,  27,
- 164, 110,  35,  25, 202,  94, 150,  46,  91, 152, 130,   1,   7,  46,  16, 237,
- 171, 109,  19, 200,  65,  38,  10, 213,  70,  96, 126, 226, 185, 225, 181,  46,
-  10, 165,  11, 123,  53, 158,  22, 147,  64,  22, 227,  69, 182, 237, 197,  37,
-  39,  49, 186, 223, 139, 128,  55,  36, 166, 178, 220,  20,  98, 172, 166, 253,
-  45,   0, 120, 180, 189, 185, 158, 159, 196,   6, 214,  79, 141,  52, 156, 107,
-   5, 109, 142, 159,  33,  64, 190, 133,  95, 132,  95, 202, 160,  63, 186,  23,
- 231, 107, 163,  33, 234,  15, 244,  77, 108,  49,  51,   7, 164,  87, 142,  99,
- 240, 202,  47, 256, 118, 190, 196, 178, 217,  42,  39, 153,  21, 192, 232, 202,
-  14,  82, 179,  64, 233,   4, 219,  10, 133,  78,  43, 144, 146, 216, 202,  81,
-  71, 252,   8, 201,  68, 256,  85, 233, 164,  88, 176,  30,   5, 152, 126, 179,
- 249,  84, 140, 190, 159,  54, 118,  98,   2, 159,  27, 133,  74, 121, 239, 196,
-  71, 149, 119, 135, 102,  20,  87, 112,  44,  75, 221,   3, 151, 158,   5,  98,
- 152,  25,  97, 106,  63, 171, 240,  79, 234, 240, 230,  92,  76,  70, 173, 196,
-  36, 225, 218, 133,  64, 240, 150,  41, 146,  66, 133,  51, 134,  73, 170, 238,
- 140,  90,  45,  89,  46, 147,  96, 169, 174, 174, 244, 151,  90,  40,  32,  74,
-  38, 154, 246,  57,  31,  14, 189, 151,  83, 243, 197, 183, 220, 185,  53, 225,
-  51, 106, 188, 208, 222, 248,  93,  13,  93, 215, 131,  25, 142, 185, 113, 222,
- 131, 215, 149,  50, 159,  85,  32,   5, 205, 192,   2, 227,  42, 214, 197,  42,
- 126, 182,  68, 123, 109,  36, 237, 179, 170, 199,  77, 256,   5, 128, 214, 243,
- 137, 177, 170, 253, 179, 180, 153, 236, 100, 196, 216, 231, 198,  37, 192,  80,
- 121, 221, 246,   1,  16, 246,  29,  78,  64, 148, 124,  38,  96, 125,  28,  20,
-  48,  51,  73, 187, 139, 208,  98, 253, 221, 188,  84, 129,   1, 205,  95, 205,
- 117,  79,  71, 126, 134, 237,  19, 184, 137, 125, 129, 178, 223,  54, 188, 112,
-  30,   7, 225, 228, 205, 184, 233,  87, 117,  22,  58,  10,   8,  42,   2, 114,
- 254,  19,  17,  13, 150,  92, 233, 179,  63,  12,  60, 171, 127,  35,  50,   5,
- 195, 113, 241,  25, 249, 184, 166,  44, 221,  35, 151, 116,   8,  54, 195,  89,
- 218, 186, 132,   5,  41,  89, 226, 177,  11,  41,  87, 172,   5,  23,  20,  59,
- 228,  94,  76,  33, 137,  43, 151, 221,  61, 232,   4, 120,  93, 217,  80, 228,
- 228,   6,  58,  25,  62,  84,  91,  48, 209,  20, 247, 243,  55, 106,  80,  79,
- 235,  34,  20, 180, 146,   2, 236,  13, 236, 206, 243, 222, 204,  83, 148, 213,
- 214, 117, 237,  98,   0,  90, 204, 168,  32,  41, 126,  67, 191,  74,  27, 255,
-  26,  75, 240, 113, 185, 105, 167, 154, 112,  67, 151,  63, 161, 134, 239, 176,
-  42,  87, 249, 130,  45, 242,  17, 100, 107, 120, 212, 218, 237,  76, 231, 162,
- 175, 172, 118, 155,  92,  36, 124,  17, 121,  71,  13,   9,  82, 126, 147, 142,
- 218, 148, 138,  80, 163, 106, 164, 123, 140, 129,  35,  42, 186, 154, 228, 214,
-  75,  73,   8, 253,  42, 153, 232, 164,  95,  24, 110,  90, 231, 197,  90, 196,
-  57, 164, 252, 181,  31,   7,  97, 256,  35,  77, 200, 212,  99, 179,  92, 227,
-  17, 180,  49, 176,   9, 188,  13, 182,  93,  44, 128, 219, 134,  92, 151,   6,
-  23, 126, 200, 109,  66,  30, 140, 180, 146, 134,  67, 200,   7,   9, 223, 168,
- 186, 221,   3, 154, 150, 165,  43,  53, 138,  27,  86, 213, 235, 160,  70,   2,
- 240,  20,  89, 212,  84, 141, 168, 246, 183, 227,  30, 167, 138, 185, 253,  83,
-  52, 143, 236,  94,  59,  65,  89, 218, 194, 157, 164, 156, 111,  95, 202, 168,
- 245, 256, 151,  28, 222, 194,  72, 130, 217, 134, 253,  77, 246, 100,  76,  32,
- 254, 174, 182, 193,  14, 237,  74,   1,  74,  26, 135, 216, 152, 208, 112,  38,
- 181,  62,  25,  71,  61, 234, 254,  97, 191,  23,  92, 256, 190, 205,   6,  16,
- 134, 147, 210, 219, 148,  59,  73, 185,  24, 247, 174, 143, 116, 220, 128, 144,
- 111, 126, 101,  98, 130, 136, 101, 102,  69, 127,  24, 168, 146, 226, 226, 207,
- 176, 122, 149, 254, 134, 196,  22, 151, 197,  21,  50, 205, 116, 154,  65, 116,
- 177, 224, 127,  77, 177, 159, 225,  69, 176,  54, 100, 104, 140,   8,  11, 126,
-  11, 188, 185, 159, 107,  16, 254, 142,  80,  28,   5, 157, 104,  57, 109,  82,
- 102,  80, 173, 242, 238, 207,  57, 105, 237, 160,  59, 189, 189, 199,  26,  11,
- 190, 156,  97, 118,  20,  12, 254, 189, 165, 147, 142, 199,   5, 213,  64, 133,
- 108, 217, 133,  60,  94,  28, 116, 136,  47, 165, 125,  42, 183, 143,  14, 129,
- 223,  70, 212, 205, 181, 180,   3, 201, 182,  46,  57, 104, 239,  60,  99, 181,
- 220, 231,  45,  79, 156,  89, 149, 143, 190, 103, 153,  61, 235,  73, 136,  20,
-  89, 243,  16, 130, 247, 141, 134,  93,  80,  68,  85,  84,   8,  72, 194,   4,
- 242, 110,  19, 133, 199,  70, 172,  92, 132, 254,  67,  74,  36,  94,  13,  90,
- 154, 184,   9, 109, 118, 243, 214,  71,  36,  95,   0,  90, 201, 105, 112, 215,
-  69, 196, 224, 210, 236, 242, 155, 211,  37, 134,  69, 113, 157,  97,  68,  26,
- 230, 149, 219, 180,  20,  76, 172, 145, 154,  40, 129,   8,  93,  56, 162, 124,
- 207, 233, 105,  19,   3, 183, 155, 134,   8, 244, 213,  78, 139,  88, 156,  37,
-  51, 152, 111, 102, 112, 250, 114, 252, 201, 241, 133,  24, 136, 153,   5,  90,
- 210, 197, 216,  24, 131,  17, 147, 246,  13,  86,   3, 253, 179, 237, 101, 114,
- 243, 191, 207,   2, 220, 133, 244,  53,  87, 125, 154, 158, 197,  20,   8,  83,
-  32, 191,  38, 241, 204,  22, 168,  59, 217, 123, 162,  82,  21,  50, 130,  89,
- 239, 253, 195,  56, 253,  74, 147, 125, 234, 199, 250,  28,  65, 193,  22, 237,
- 193,  94,  58, 229, 139, 176,  69,  42, 179, 164, 150, 168, 246, 214,  86, 174,
-  59, 117,  15,  19,  76,  37, 214, 238, 153, 226, 154,  45, 109, 114, 198, 107,
-  45,  70, 238, 196, 142, 252, 244,  71, 123, 136, 134, 188,  99, 132,  25,  42,
- 240,   0, 196,  33,  26, 124, 256, 145,  27, 102, 153,  35,  28, 132, 221, 167,
- 138, 133,  41, 170,  95, 224,  40, 139, 239, 153,   1, 106, 255, 106, 170, 163,
- 127,  44, 155, 232, 194, 119, 232, 117, 239, 143, 108,  41,   3,   9, 180, 256,
- 144, 113, 133, 200,  79,  69, 128, 216,  31,  50, 102, 209, 249, 136, 150, 154,
- 182,  51, 228,  39, 127, 142,  87,  15,  94,  92, 187, 245,  31, 236,  64,  58,
- 114,  11,  17, 166, 189, 152, 218,  34, 123,  39,  58,  37, 153,  91,  63, 121,
-  31,  34,  12, 254, 106,  96, 171,  14, 155, 247, 214,  69,  24,  98,   3, 204,
- 202, 194, 207,  30, 253,  44, 119,  70,  14,  96,  82, 250,  63,   6, 232,  38,
-  89, 144, 102, 191,  82, 254,  20, 222,  96, 162, 110,   6, 159,  58, 200, 226,
-  98, 128,  42,  70,  84, 247, 128, 211, 136,  54, 143, 166,  60, 118,  99, 218,
-  27, 193,  85,  81, 219, 223,  46,  41,  23, 233, 152, 222,  36, 236,  54, 181,
-  56,  50,   4, 207, 129,  92,  78,  88, 197, 251, 131, 105,  31, 172,  38, 131,
-  19, 204, 129,  47, 227, 106, 202, 183,  23,   6,  77, 224, 102, 147,  11, 218,
- 131, 132,  60, 192, 208, 223, 236,  23, 103, 115,  89,  18, 185, 171,  70, 174,
- 139,   0, 100, 160, 221,  11, 228,  60,  12, 122, 114,  12, 157, 235, 148,  57,
-  83,  62, 173, 131, 169, 126,  85,  99,  93, 243,  81,  80,  29, 245, 206,  82,
- 236, 227, 166,  14, 230, 213, 144,  97,  27, 111,  99, 164, 105, 150,  89, 111,
- 252, 118, 140, 232, 120, 183, 137, 213, 232, 157, 224,  33, 134, 118, 186,  80,
- 159,   2, 186, 193,  54, 242,  25, 237, 232, 249, 226, 213,  90, 149,  90, 160,
- 118,  69,  64,  37,  10, 183, 109, 246,  30,  52, 219,  69, 189,  26, 116, 220,
-  50, 244, 243, 243, 139, 137, 232,  98,  38,  45, 256, 143, 171, 101,  73, 238,
- 123,  45, 194, 167, 250, 123,  12,  29, 136, 237, 141,  21,  89,  96, 199,  44,
-   8, 214, 208,  17, 113,  41, 137,  26, 166, 155,  89,  85,  54,  58,  97, 160,
-  50, 239,  58,  71,  21, 157, 139,  12,  37, 198, 182, 131, 149, 134,  16, 204,
- 164, 181, 248, 166,  52, 216, 136, 201,  37, 255, 187, 240,   5, 101, 147, 231,
-  14, 163, 253, 134, 146, 216,   8,  54, 224,  90, 220, 195,  75, 215, 186,  58,
-  71, 204, 124, 105, 239,  53,  16,  85,  69, 163, 195, 223,  33,  38,  69,  88,
-  88, 203,  99,  55, 176,  13, 156, 204, 236,  99, 194, 134,  75, 247, 126, 129,
- 160, 124, 233, 206, 139, 144, 154,  45, 233,  51, 206,  61,  60,  55, 205, 107,
-  84, 108,  96, 188, 203,  31,  89,  20, 115, 144, 137,  90, 237,  78, 231, 185,
- 120, 217,   1, 176, 169,  30, 155, 176, 100, 113,  53,  42, 193, 108,  14, 121,
- 176, 158, 137,  92, 178,  44, 110, 249, 108, 234,  94, 101, 128,  12, 250, 173,
-  72, 202, 232,  66, 139, 152, 189,  18,  32, 197,   9, 238, 246,  55, 119, 183,
- 196, 119, 113, 247, 191, 100, 200, 245,  46,  16, 234, 112, 136, 116, 232,  48,
- 176, 108,  11, 237,  14, 153,  93, 177, 124,  72,  67, 121, 135, 143,  45,  18,
-  97, 251, 184, 172, 136,  55, 213,   8, 103,  12, 221, 212,  13, 160, 116,  91,
- 237, 127, 218, 190, 103, 131,  77,  82,  36, 100,  22, 252,  79,  69,  54,  26,
-  65, 182, 115, 142, 247,  20,  89,  81, 188, 244,  27, 120, 240, 248,  13, 230,
-  67, 133,  32, 201, 129,  87,   9, 245,  66,  88, 166,  34,  46, 184, 119, 218,
- 144, 235, 163,  40, 138, 134, 127, 217,  64, 227, 116,  67,  55, 202, 130,  48,
- 199,  42, 251, 112, 124, 153, 123, 194, 243,  49, 250,  12,  78, 157, 167, 134,
- 210,  73, 156, 102,  21,  88, 216, 123,  45,  11, 208,  18,  47, 187,  20,  43,
-   3, 180, 124,   2, 136, 176,  77, 111, 138, 139,  91, 225, 126,   8,  74, 255,
-  88, 192, 193, 239, 138, 204, 139, 194, 166, 130, 252, 184, 140, 168,  30, 177,
- 121,  98, 131, 124,  69, 171,  75,  49, 184,  34,  76, 122, 202, 115, 184, 253,
- 120, 182,  33, 251,   1,  74, 216, 217, 243, 168,  70, 162, 119, 158, 197, 198,
-  61,  89,   7,   5,  54, 199, 211, 170,  23, 226,  44, 247, 165, 195,   7, 225,
-  91,  23,  50,  15,  51, 208, 106,  94,  12,  31,  43, 112, 146, 139, 246, 182,
- 113,   1,  97,  15,  66,   2,  51,  76, 164, 184, 237, 200, 218, 176,  72,  98,
-  33, 135,  38, 147, 140, 229,  50,  94,  81, 187, 129,  17, 238, 168, 146, 203,
- 181,  99, 164,   3, 104,  98, 255, 189, 114, 142,  86, 102, 229, 102,  80, 129,
-  64,  84,  79, 161,  81, 156, 128, 111, 164, 197,  18,  15,  55, 196, 198, 191,
-  28, 113, 117,  96, 207, 253,  19, 158, 231,  13,  53, 130, 252, 211,  58, 180,
- 212, 142,   7, 219,  38,  81,  62, 109, 167, 113,  33,  56,  97, 185, 157, 130,
- 186, 129, 119, 182, 196,  26,  54, 110,  65, 170, 166, 236,  30,  22, 162,   0,
- 106,  12, 248,  33,  48,  72, 159,  17,  76, 244, 172, 132,  89, 171, 196,  76,
- 254, 166,  76, 218, 226,   3,  52, 220, 238, 181, 179, 144, 225,  23,   3, 166,
- 158,  35, 228, 154, 204,  23, 203,  71, 134, 189,  18, 168, 236, 141, 117, 138,
-   2, 132,  78,  57, 154,  21, 250, 196, 184,  40, 161,  40,  10, 178, 134, 120,
- 132, 123, 101,  82, 205, 121,  55, 140, 231,  56, 231,  71, 206, 246, 198, 150,
- 146, 192,  45, 105, 242,   1, 125,  18, 176,  46, 222, 122,  19,  80, 113, 133,
- 131, 162,  81,  51,  98, 168, 247, 161, 139,  39,  63, 162,  22, 153, 170,  92,
-  91, 130, 174, 200,  45, 112,  99, 164, 132, 184, 191, 186, 200, 167,  86, 145,
- 167, 227, 130,  44,  12, 158, 172, 249, 204,  17,  54, 249,  16, 200,  21, 174,
-  67, 223, 105, 201,  50,  36, 133, 203, 244, 131, 228,  67,  29, 195,  91,  91,
-  55, 107, 167, 154, 170, 137, 218, 183, 169,  61,  99, 175, 128,  23, 142, 183,
-  66, 255,  59, 187,  66,  85, 212, 109, 168,  82,  16,  43,  67, 139, 114, 176,
- 216, 255, 130,  94, 152,  79, 183,  64, 100,  23, 214,  82,  34, 230,  48,  15,
- 242, 130,  50, 241,  81,  32,   5, 125, 183, 182, 184,  99, 248, 109, 159, 210,
- 226,  61, 119, 129,  39, 149,  78, 214, 107,  78, 147, 124, 228,  18, 143, 188,
-  84, 180, 233, 119,  64,  39, 158, 133, 177, 168,   6, 150,  80, 117, 150,  56,
-  49,  72,  49,  37,  30, 242,  49, 142,  33, 156,  34,  44,  44,  72,  58,  22,
- 249,  46, 168,  80,  25, 196,  64, 174,  97, 179, 244, 134, 213, 105,  63, 151,
-  21,  90, 168,  90, 245,  28, 157,  65, 250, 232, 188,  27,  99, 160, 156, 127,
-  68, 193,  10,  80, 205,  36, 138, 229,  12, 223,  70, 169, 251,  41,  48,  94,
-  41, 177,  99, 256, 158,   0,   6,  83, 231, 191, 120, 135, 157, 146, 218, 213,
- 160,   7,  47, 234,  98, 211,  79, 225, 179,  95, 175, 105, 185,  79, 115,   0,
- 104,  14,  65, 124,  15, 188,  52,   9, 253,  27, 132, 137,  13, 127,  75, 238,
- 185, 253,  33,   8,  52, 157, 164,  68, 232, 188,  69,  28, 209, 233,   5, 129,
- 216,  90, 252, 212,  33, 200, 222,   9, 112,  15,  43,  36, 226, 114,  15, 249,
- 217,   8, 148,  22, 147,  23, 143,  67, 222, 116, 235, 250, 212, 210,  39, 142,
- 108,  64, 209,  83,  73,  66,  99,  34,  17,  29,  45, 151, 244, 114,  28, 241,
- 144, 208, 146, 179, 132,  89, 217, 198, 252, 219, 205, 165,  75, 107,  11, 173,
-  76,   6, 196, 247, 152, 216, 248,  91, 209, 178,  57, 250, 174,  60,  79, 123,
-  18, 135,   9, 241, 230, 159, 184,  68, 156, 251, 215,   9, 113, 234,  75, 235,
- 103, 194, 205, 129, 230,  45,  96,  73, 157,  20, 200, 212, 212, 228, 161,   7,
- 231, 228, 108,  43, 198,  87, 140, 140,   4, 182, 164,   3,  53, 104, 250, 213,
-  85,  38,  89,  61,  52, 187,  35, 204,  86, 249, 100,  71, 248, 213, 163, 215,
-  66, 106, 252, 129,  40, 111,  47,  24, 186, 221,  85, 205, 199, 237, 122, 181,
-  32,  46, 182, 135,  33, 251, 142,  34, 208, 242, 128, 255,   4, 234,  15,  33,
- 167, 222,  32, 186, 191,  34, 255, 244,  98, 240, 228, 204,  30, 142,  32,  70,
-  69,  83, 110, 151,  10, 243, 141,  21, 223,  69,  61,  37,  59, 209, 102, 114,
- 223,  33, 129, 254, 255, 103,  86, 247, 235,  72, 126, 177, 102, 226, 102,  30,
- 149, 221,  62, 247, 251, 120, 163, 173,  57, 202, 204,  24,  39, 106, 120, 143,
- 202, 176, 191, 147,  37,  38,  51, 133,  47, 245, 157, 132, 154,  71, 183, 111,
-  30, 180,  18, 202,  82,  96, 170,  91, 157, 181, 212, 140, 256,   8, 196, 121,
- 149,  79,  66, 127, 113,  78,   4, 197,  84, 256, 111, 222, 102,  63, 228, 104,
- 136, 223,  67, 193,  93, 154, 249,  83, 204, 101, 200, 234,  84, 252, 230, 195,
-  43, 140, 120, 242,  89,  63, 166, 233, 209,  94,  43, 170, 126,   5, 205,  78,
- 112,  80, 143, 151, 146, 248, 137, 203,  45, 183,  61,   1, 155,   8, 102,  59,
-  68, 212, 230,  61, 254, 191, 128, 223, 176, 123, 229,  27, 146, 120,  96, 165,
- 213,  12, 232,  40, 186, 225,  66, 105, 200, 195, 212, 110, 237, 238, 151,  19,
-  12, 171, 150,  82,   7, 228,  79,  52,  15,  78,  62,  43,  21, 154, 114,  21,
-  12, 212, 256, 232, 125, 127,   5,  51,  37, 252, 136,  13,  47, 195, 168, 191,
- 231,  55,  57, 251, 214, 116,  15,  86, 210,  41, 249, 242, 119,  27, 250, 203,
- 107,  69,  90,  43, 206, 154, 127,  54, 100,  78, 187,  54, 244, 177, 234, 167,
- 202, 136, 209, 171,  69, 114, 133, 173,  26, 139,  78, 141, 128,  32, 124,  39,
-  45, 218,  96,  68,  90,  44,  67,  62,  83, 190, 188, 256, 103,  42, 102,  64,
- 249,   0, 141,  11,  61,  69,  70,  66, 233, 237,  29, 200, 251, 157,  71,  51,
-  64, 133, 113,  76,  35, 125,  76, 137, 217, 145,  35,  69, 226, 180,  56, 249,
- 156, 163, 176, 237,  81,  54,  85, 169, 115, 211, 129,  70, 248,  40, 252, 192,
- 194, 101, 247,   8, 181, 124, 217, 191, 194,  93,  99, 127, 117, 177, 144, 151,
- 228, 121,  32,  11,  89,  81,  26,  29, 183,  76, 249, 132, 179,  70,  34, 102,
-  20,  66,  87,  63, 124, 205, 174, 177,  87, 219,  73, 218,  91,  87, 176,  72,
-  15, 211,  47,  61, 251, 165,  39, 247, 146,  70, 150,  57,   1, 212,  36, 162,
-  39,  38,  16, 216,   3,  50, 116, 200,  32, 234,  77, 181, 155,  19,  90, 188,
-  36,   6, 254,  46,  46, 203,  25, 230, 181, 196,   4, 151, 225,  65, 122, 216,
- 168,  86, 158, 131, 136,  16,  49, 102, 233,  64, 154,  88, 228,  52, 146,  69,
-  93, 157, 243, 121,  70, 209, 126, 213,  88, 145, 236,  65,  70,  96, 204,  47,
-  10, 200,  77,   8, 103, 150,  48, 153,   5,  37,  52, 235, 209,  31, 181, 126,
-  83, 142, 224, 140,   6,  32, 200, 171, 160, 179, 115, 229,  75, 194, 208,  39,
-  59, 223,  52, 247,  38, 197, 135,   1,   6, 189, 106, 114, 168,   5, 211, 222,
-  44,  63,  90, 160, 116, 172, 170, 133, 125, 138,  39, 131,  23, 178,  10, 214,
-  36,  93,  28,  59,  68,  17, 123,  25, 255, 184, 204, 102, 194, 214, 129,  94,
- 159, 245, 112, 141,  62,  11,  61, 197, 124, 221, 205,  11,  79,  71, 201,  54,
-  58, 150,  29, 121,  87,  46, 240, 201,  68,  20, 194, 209,  47, 152, 158, 174,
- 193, 164, 120, 255, 216, 165, 247,  58,  85, 130, 220,  23, 122, 223, 188,  98,
-  21,  70,  72, 170, 150, 237,  76, 143, 112, 238, 206, 146, 215, 110,   4, 250,
-  68,  44, 174, 177,  30,  98, 143, 241, 180, 127, 113,  48,   0,   1, 179, 199,
-  59, 106, 201, 114,  29,  86, 173, 133, 217,  44, 200, 141, 107, 172,  16,  60,
-  82,  58, 239,  94, 141, 234, 186, 235, 109, 173, 249, 139, 141,  59, 100, 248,
-  84, 144,  49, 160,  51, 207, 164, 103,  74,  97, 146, 202, 193, 125, 168, 134,
- 236, 111, 135, 121,  59, 145, 168, 200, 181, 173, 109,   2, 255,   6,   9, 245,
-  90, 202, 214, 143, 121,  65,  85, 232, 132,  77, 228,  84,  26,  54, 184,  15,
- 161,  29, 177,  79,  43,   0, 156, 184, 163, 165,  62,  90, 179,  93,  45, 239,
-   1,  16, 120, 189, 127,  47,  74, 166,  20, 214, 233, 226,  89, 217, 229,  26,
- 156,  53, 162,  60,  21,   3, 192,  72, 111,  51,  53, 101, 181, 208,  88,  82,
- 179, 160, 219, 113, 240, 108,  43, 224, 162, 147,  62,  14,  95,  81, 205,   4,
- 160, 177, 225, 115,  29,  69, 235, 168, 148,  29, 128, 114, 124, 129, 172, 165,
- 215, 231, 214,  86, 160,  44, 157,  91, 248, 183,  73, 164,  56, 181, 162,  92,
- 141, 118, 127, 240, 196,  77,   0,   9, 244,  79, 250, 100, 195,  25, 255,  85,
-  94,  35, 212, 137, 107,  34, 110,  20, 200, 104,  17,  32, 231,  43, 150, 159,
- 231, 216, 223, 190, 226, 109, 162, 197,  87,  92, 224,  11, 111,  73,  60, 225,
- 238,  73, 246, 169,  19, 217, 119,  38, 121, 118,  70,  82,  99, 241, 110,  67,
-  31,  76, 146, 215, 124, 240,  31, 103, 139, 224,  75, 160,  31,  78,  93,   4,
-  64,   9, 103, 223,   6, 227, 119,  85, 116,  81,  21,  43,  46, 206, 234, 132,
-  85,  99,  22, 131, 135,  97,  86,  13, 234, 188,  21,  14,  89, 169, 207, 238,
- 219, 177, 190,  72, 157,  41, 114, 140,  92, 141, 186,   1,  63, 107, 225, 184,
- 118, 150, 153, 254, 241, 106, 120, 210, 104, 144, 151, 161,  88, 206, 125, 164,
-  15, 211, 173,  49, 146, 241,  71,  36,  58, 201,  46,  27,  33, 187,  91, 162,
- 117,  19, 210, 213, 187,  97, 193,  50, 190, 114, 217,  60,  61, 167, 207, 213,
- 213,  53, 135,  34, 156,  91, 115, 119,  46,  99, 242,   1,  90,  52, 198, 227,
- 201,  91, 216, 146, 210,  82, 121,  38,  73, 133, 182, 193, 132, 148, 246,  75,
- 109, 157, 179, 113, 176, 134, 205, 159, 148,  58, 103, 171, 132, 156, 133, 147,
- 161, 231,  39, 100, 175,  97, 125,  28, 183, 129, 135, 191, 202, 181,  29, 218,
-  43, 104, 148, 203, 189, 204,   4, 182, 169,   1, 134, 122, 141, 202,  13, 187,
- 177, 112, 162,  35, 231,   6,   8, 241,  99,   6, 191,  45, 113, 113, 101, 104};
-
-// The S-Box we use for further linearity breaking.
-// We created it by taking the digits of decimal expansion of e.
-// The code that created it can be found in 'ProduceRandomSBox.c'.
-unsigned char SBox[256] = {
-//0     1    2      3     4    5     6     7      8    9     A      B    C     D     E     F
-0x7d, 0xd1, 0x70, 0x0b, 0xfa, 0x39, 0x18, 0xc3, 0xf3, 0xbb, 0xa7, 0xd4, 0x84, 0x25, 0x3b, 0x3c,   // 0
-0x2c, 0x15, 0x69, 0x9a, 0xf9, 0x27, 0xfb, 0x02, 0x52, 0xba, 0xa8, 0x4b, 0x20, 0xb5, 0x8b, 0x3a,   // 1
-0x88, 0x8e, 0x26, 0xcb, 0x71, 0x5e, 0xaf, 0xad, 0x0c, 0xac, 0xa1, 0x93, 0xc6, 0x78, 0xce, 0xfc,   // 2
-0x2a, 0x76, 0x17, 0x1f, 0x62, 0xc2, 0x2e, 0x99, 0x11, 0x37, 0x65, 0x40, 0xfd, 0xa0, 0x03, 0xc1,   // 3
-0xca, 0x48, 0xe2, 0x9b, 0x81, 0xe4, 0x1c, 0x01, 0xec, 0x68, 0x7a, 0x5a, 0x50, 0xf8, 0x0e, 0xa3,   // 4
-0xe8, 0x61, 0x2b, 0xa2, 0xeb, 0xcf, 0x8c, 0x3d, 0xb4, 0x95, 0x13, 0x08, 0x46, 0xab, 0x91, 0x7b,   // 5
-0xea, 0x55, 0x67, 0x9d, 0xdd, 0x29, 0x6a, 0x8f, 0x9f, 0x22, 0x4e, 0xf2, 0x57, 0xd2, 0xa9, 0xbd,   // 6
-0x38, 0x16, 0x5f, 0x4c, 0xf7, 0x9e, 0x1b, 0x2f, 0x30, 0xc7, 0x41, 0x24, 0x5c, 0xbf, 0x05, 0xf6,   // 7
-0x0a, 0x31, 0xa5, 0x45, 0x21, 0x33, 0x6b, 0x6d, 0x6c, 0x86, 0xe1, 0xa4, 0xe6, 0x92, 0x9c, 0xdf,   // 8
-0xe7, 0xbe, 0x28, 0xe3, 0xfe, 0x06, 0x4d, 0x98, 0x80, 0x04, 0x96, 0x36, 0x3e, 0x14, 0x4a, 0x34,   // 9
-0xd3, 0xd5, 0xdb, 0x44, 0xcd, 0xf5, 0x54, 0xdc, 0x89, 0x09, 0x90, 0x42, 0x87, 0xff, 0x7e, 0x56,   // A
-0x5d, 0x59, 0xd7, 0x23, 0x75, 0x19, 0x97, 0x73, 0x83, 0x64, 0x53, 0xa6, 0x1e, 0xd8, 0xb0, 0x49,   // B
-0x3f, 0xef, 0xbc, 0x7f, 0x43, 0xf0, 0xc9, 0x72, 0x0f, 0x63, 0x79, 0x2d, 0xc0, 0xda, 0x66, 0xc8,   // C
-0x32, 0xde, 0x47, 0x07, 0xb8, 0xe9, 0x1d, 0xc4, 0x85, 0x74, 0x82, 0xcc, 0x60, 0x51, 0x77, 0x0d,   // D
-0xaa, 0x35, 0xed, 0x58, 0x7c, 0x5b, 0xb9, 0x94, 0x6e, 0x8d, 0xb1, 0xc5, 0xb7, 0xee, 0xb6, 0xae,   // E
-0x10, 0xe0, 0xd6, 0xd9, 0xe5, 0x4f, 0xf1, 0x12, 0x00, 0xd0, 0xf4, 0x1a, 0x6f, 0x8a, 0xb3, 0xb2 }; // F
-
-///////////////////////////////////////////////////////////////////////////////////////////////
-//
-//	Helper functions definition portion.
-//
-///////////////////////////////////////////////////////////////////////////////////////////////
-
-// Don't vectorize, move decl to header file
-
-// Translates an input array with values in base 257 to output array with values in base 256.
-// Returns the carry bit.
-//
-// Parameters:
-// - input: the input array of size EIGHTH_N. Each value in the array is a number in Z_257.
-//          The MSB is assumed to be the last one in the array.
-// - output: the input array encoded in base 256.
-//
-// Returns:
-// - The carry bit (MSB).
-swift_int16_t TranslateToBase256(swift_int32_t input[EIGHTH_N], unsigned char output[EIGHTH_N]);
-
-// Translates an input integer into the range (-FIELD_SIZE / 2) <= result <= (FIELD_SIZE / 2).
-//
-// Parameters:
-// - x: the input integer.
-//
-// Returns:
-// - The result, which equals (x MOD FIELD_SIZE), such that |result| <= (FIELD_SIZE / 2).
-int Center(int x);
-
-// Calculates bit reversal permutation.
-//
-// Parameters:
-// - input: the input to reverse.
-// - numOfBits: the number of bits in the input to reverse.
-//
-// Returns:
-// - The resulting number, which is obtained from the input by reversing its bits.
-int ReverseBits(int input, int numOfBits);
-
-// Initializes the FFT fast lookup table.
-// Shall be called only once.
-void InitializeSWIFFTX();
-
-// Calculates the FFT.
-//
-// Parameters:
-// - input: the input to the FFT.
-// - output: the resulting output.
-void FFT(const unsigned char input[EIGHTH_N], swift_int32_t *output);
-
-///////////////////////////////////////////////////////////////////////////////////////////////
-// Helper functions implementation portion.
-///////////////////////////////////////////////////////////////////////////////////////////////
-
-// Don't vectorize, delete this copy.
-
-swift_int16_t TranslateToBase256(swift_int32_t input[EIGHTH_N], unsigned char output[EIGHTH_N])
-{
-	swift_int32_t pairs[EIGHTH_N / 2];
-	int i;
-
-	for (i = 0; i < EIGHTH_N; i += 2)
-	{
-		// input[i] + 257 * input[i + 1]
-		pairs[i >> 1] = input[i] + input[i + 1] + (input[i + 1] << 8);
-	}
-
-	for (i = (EIGHTH_N / 2) - 1; i > 0; --i)
-	{
-		int j;
-
-		for (j = i - 1; j < (EIGHTH_N / 2) - 1; ++j)
-		{
-			// pairs[j + 1] * 513, because 257^2 = 513 % 256^2.
-			register swift_int32_t temp = pairs[j] + pairs[j + 1] + (pairs[j + 1] << 9);
-			pairs[j] = temp & 0xffff;
-			pairs[j + 1] += (temp >> 16);
-		}
-	}
-
-	for (i = 0; i < EIGHTH_N; i += 2)
-	{
-		output[i] = (unsigned char) (pairs[i >> 1] & 0xff);
-		output[i + 1] = (unsigned char) ((pairs[i >> 1] >> 8) & 0xff);
-	}
-
-	return (pairs[EIGHTH_N/2 - 1] >> 16);
-}
-
-int Center(int x)
-{
-	int result = x % FIELD_SIZE;
-
-	if (result > (FIELD_SIZE / 2))
-		result -= FIELD_SIZE;
-
-	if (result < (FIELD_SIZE / -2))
-		result += FIELD_SIZE;
-
-	return result;
-}
-
-int ReverseBits(int input, int numOfBits)
-{
-	register int reversed = 0;
-
-	for (input |= numOfBits; input > 1; input >>= 1)
-		reversed = (reversed << 1) | (input & 1);
-
-	return reversed;
-}
-
-void InitializeSWIFFTX()
-{
-	int i, j, k, x;
-	// The powers of OMEGA
-	int omegaPowers[2 * N];
-	omegaPowers[0] = 1;
-
-	if (wasSetupDone)
-		return;
-
-	for (i = 1; i < (2 * N); ++i)
-	{
-		omegaPowers[i] = Center(omegaPowers[i - 1] * OMEGA);
-	}
-
-	for (i = 0; i < (N / W); ++i)
-	{
-		for (j = 0; j < W; ++j)
-		{
-			multipliers[(i << 3) + j] = omegaPowers[ReverseBits(i, N / W) * (2 * j + 1)];
-		}
-	}
-
-	for (x = 0; x < 256; ++x)
-	{
-		for (j = 0; j < 8; ++j)
-		{
-			register int temp = 0;
-			for (k = 0; k < 8; ++k)
-			{
-				temp += omegaPowers[(EIGHTH_N * (2 * j + 1) * ReverseBits(k, W)) % (2 * N)]
-					  * ((x >> k) & 1);
-			}
-
-			fftTable[(x << 3) + j] = Center(temp);
-		}
-	}
-
-	wasSetupDone = true;
-}
-
-// input should be deinterleaved in contiguos memory
-// output and F are 4x32
-// multipliers & fftTable are scalar 16
-
-
-void FFT_4way(const unsigned char input[EIGHTH_N], swift_int32_t *output)
-{
-	swift_int16_t *mult = multipliers;
-   m128_swift_int32_t F[64];
-
-   for (int i = 0; i < 8; i++)
-   {
-      int j = i<<3;
-
-// Need to isolate bytes in input, 8 bytes per lane.
-// Each iteration of the loop process one input vector
-// Each lane reads a different index to ffttable.
-
-// deinterleave the input!
-
-// load table with 4 lanes from different indexes into fftTable
-// extract bytes into m128 4x16
-// mutiply by vectorized mult
-
-// input[lane][byte]
-
-      __m128i table;
-      table = _mm_set_epi32( fftTable[ input[3][i] ],
-                             fftTable[ input[2][i] ],
-                             fftTable[ input[1][i] ],
-                             fftTable[ input[0][i] ] );
-
-      F[i  ] = _mm_mullo_epi32( mm128_const1_32( mult[j+0] ), table );
-
-      table = _mm_set_epi32( fftTable[ input[3][i+1] ]
-                             fftTable[ input[2][i+1] ]
-                             fftTable[ input[1][i+1] ]
-                             fftTable[ input[0][i+1] ] );
-
-      F[i+8] = _mm_mullo_epi32( mm128_const1_32( mult[j+0] ), table );
-
-
-      m128_swift_int16_t *table = &( fftTable[input[i] << 3] );
-
-      F[i   ] = _mm_mullo_epi32( mm128_const1_32( mult[j+0] ),
-                                 mm128_const1_32( table[0] ) );
-      F[i+ 8] = _mm_mullo_epi32( mm128_const1_32( mult[j+1] ),
-                                 mm128_const1_32( table[1] ) );
-      F[i+16] = _mm_mullo_epi32( mm128_const1_32( mult[j+2] ),
-                                 mm128_const1_32( table[2] ) );
-      F[i+24] = _mm_mullo_epi32( mm128_const1_32( mult[j+3] ),
-                                 mm128_const1_32( table[3] ) );
-      F[i+32] = _mm_mullo_epi32( mm128_const1_32( mult[j+4] ),
-                                 mm128_const1_32( table[4] ) );
-      F[i+40] = _mm_mullo_epi32( mm128_const1_32( mult[j+5] ),
-                                 mm128_const1_32( table[5] ) );
-      F[i+48] = _mm_mullo_epi32( mm128_const1_32( mult[j+6] ),
-                                 mm128_const1_32( table[6] ) );
-      F[i+56] = _mm_mullo_epi32( mm128_const1_32( mult[j+7] ),
-                                 mm128_const1_32( table[7] ) );
-   }
-
-
-   for ( int i = 0; i < 8; i++ )
-   {
-      int j = i<<3;
-      ADD_SUB_4WAY( F[j  ], F[j+1] );
-      ADD_SUB_4WAY( F[j+2], F[j+3] );
-      ADD_SUB_4WAY( F[j+4], F[j+5] );
-      ADD_SUB_4WAY( F[j+6], F[j+7] );
-
-      F[j+3] = _mm_slli_epi32( F[j+3], 4 );
-      F[j+7] = _mm_slli_epi32( F[j+7], 4 );
-
-      ADD_SUB_4WAY( F[j  ], F[j+2] );
-      ADD_SUB_4WAY( F[j+1], F[j+3] );
-      ADD_SUB_4WAY( F[j+4], F[j+6] );
-      ADD_SUB_4WAY( F[j+5], F[j+7] );
-
-      F[j+5] = _mm_slli_epi32( F[j+5], 2 );
-      F[j+6] = _mm_slli_epi32( F[j+6], 4 );
-      F[j+7] = _mm_slli_epi32( F[j+7], 6 );
-
-      ADD_SUB_4WAY( F[j  ], F[j+4] );
-      ADD_SUB_4WAY( F[j+1], F[j+5] );
-      ADD_SUB_4WAY( F[j+2], F[j+6] );
-      ADD_SUB_4WAY( F[j+3], F[j+7] );
-
-      output[i   ] = Q_REDUCE_4WAY( F[j  ] );
-      output[i+ 8] = Q_REDUCE_4WAY( F[j+1] );
-      output[i+16] = Q_REDUCE_4WAY( F[j+2] );
-      output[i+24] = Q_REDUCE_4WAY( F[j+3] );
-      output[i+32] = Q_REDUCE_4WAY( F[j+4] );
-      output[i+40] = Q_REDUCE_4WAY( F[j+5] );
-      output[i+48] = Q_REDUCE_4WAY( F[j+6] );
-      output[i+56] = Q_REDUCE_4WAY( F[j+7] );
-   }
-}
-
-// Calculates the FFT part of SWIFFT.
-// We divided the SWIFFT calculation into two, because that way we could save 2 computations of
-// the FFT part, since in the first stage of SWIFFTX the difference between the first 3 SWIFFTs
-// is only the A's part.
-//
-// Parameters:
-// - input: the input to FFT.
-// - m: the input size divided by 8. The function performs m FFTs.
-// - output: will store the result.
-void SWIFFTFFT(const unsigned char *input, int m, swift_int32_t *output)
-{
-	int i;
-
-	for (i = 0;
-		 i < m;
-		 i++, input += EIGHTH_N, output += N)
-	{
-		FFT(input, output);
-	}
-}
-
-// Calculates the 'sum' part of SWIFFT, including the base change at the end.
-// We divided the SWIFFT calculation into two, because that way we could save 2 computations of
-// the FFT part, since in the first stage of SWIFFTX the difference between the first 3 SWIFFTs
-// is only the A's part.
-//
-// Parameters:
-// - input: the input. Of size 64 * m.
-// - m: the input size divided by 64.
-// - output: will store the result.
-// - a: the coefficients in the sum. Of size 64 * m.
-void SWIFFTSum(const swift_int32_t *input, int m, unsigned char *output, const swift_int16_t *a)
-{
-	int i, j;
-	swift_int32_t result[N];
-	register swift_int16_t carry = 0;
-
-	for (j = 0; j < N; ++j)
-	{
-		register swift_int32_t sum = 0;
-		const register swift_int32_t *f = input + j;
-		const register swift_int16_t *k = a + j;
-
-		for (i = 0; i < m; i++, f += N,k += N)
-		{
-			sum += (*f) * (*k);
-		}
-
-		result[j] = sum;
-	}
-
-	for (j = 0; j < N; ++j)
-	{
-		result[j] = ((FIELD_SIZE << 22) + result[j]) % FIELD_SIZE;
-	}
-
-	for (j = 0; j < 8; ++j)
-	{
-		int register carryBit = TranslateToBase256(result + (j << 3), output + (j << 3));
-		carry |= carryBit << j;
-	}
-
-	output[N] = carry;
-}
-
-
-// On entry input is interleaved 4x64. SIZE is *4 lanes / 8 bytes,
-// multiply by 2.
-
-
-void ComputeSingleSWIFFTX_4way( unsigned char input[SWIFFTX_INPUT_BLOCK_SIZE],
-                          unsigned char output[SWIFFTX_OUTPUT_BLOCK_SIZE],
-						  bool doSmooth)
-{
-	int i;
-	// Will store the result of the FFT parts:
-   m128_swift_int32_t fftOut[N * M];
-//   swift_int32_t fftOut[N * M];
-	unsigned char intermediate[N * 3 + 8];
-	unsigned char carry0,carry1,carry2;
-
-	// Do the three SWIFFTS while remembering the three carry bytes (each carry byte gets
-	// overriden by the following SWIFFT):
-
-	// 1. Compute the FFT of the input - the common part for the first 3 SWIFFTs:
-	SWIFFTFFT(input, M, fftOut);
-
-	// 2. Compute the sums of the 3 SWIFFTs, each using a different set of coefficients:
-
-	// 2a. The first SWIFFT:
-	SWIFFTSum(fftOut, M, intermediate, As);
-	// Remember the carry byte:
-	carry0 = intermediate[N];
-
-	// 2b. The second one:
-	SWIFFTSum(fftOut, M, intermediate + N, As + (M * N));
-	carry1 = intermediate[2 * N];
-
-	// 2c. The third one:
-	SWIFFTSum(fftOut, M, intermediate + (2 * N), As + 2 * (M * N));
-	carry2 = intermediate[3 * N];
-
-	//2d. Put three carry bytes in their place
-	intermediate[3 * N] = carry0;
-	intermediate[(3 * N) + 1] = carry1;
-	intermediate[(3 * N) + 2] = carry2;
-
-	// Padding  intermediate output with 5 zeroes.
-	memset(intermediate + (3 * N) + 3, 0, 5);
-
-	// Apply the S-Box:
-	for (i = 0; i < (3 * N) + 8; ++i)
-	{
-		intermediate[i] = SBox[intermediate[i]];
-	}
-
-	// 3. The final and last SWIFFT:
-	SWIFFTFFT(intermediate, 3 * (N/8) + 1, fftOut);
-	SWIFFTSum(fftOut,       3 * (N/8) + 1, output, As);
-
-	if (doSmooth)
-	{
-		unsigned char sum[N];
-		register int i, j;
-		memset(sum, 0, N);
-
-		for (i = 0; i < (N + 1) * 8; ++i)
-		{
-			register const swift_int16_t *AsRow;
-			register int AShift;
-
-			if  (!(output[i >> 3] & (1 << (i & 7))))
-			{
-				continue;
-			}
-
-			AsRow = As + N * M + (i & ~(N - 1)) ;
-			AShift = i & 63;
-
-			for (j = AShift; j < N; ++j)
-			{
-				sum[j] += AsRow[j - AShift];
-			}
-
-			for(j = 0; j < AShift; ++j)
-			{
-				sum[j] -= AsRow[N - AShift + j];
-			}
-		}
-
-		for (i = 0; i < N; ++i)
-		{
-			output[i] = sum[i];
-		}
-
-		output[N] = 0;
-	}
-}
--- a/algo/swifftx/swifftx.c
+++ b/algo/swifftx/swifftx.c
@@ -18,6 +18,8 @@
 //#include "stdbool.h"
 #include <memory.h>

+#include "simd-utils.h"
+
 ///////////////////////////////////////////////////////////////////////////////////////////////
 // Constants and static tables portion.
 ///////////////////////////////////////////////////////////////////////////////////////////////
@@ -49,20 +51,20 @@
 // - A: the first operand. After the operation stores the sum of the two operands.
 // - B: the second operand. After the operation stores the difference between the first and the
 //   second operands.
-#define ADD_SUB(A, B) {register int temp = (B); B = ((A) - (B)); A = ((A) + (temp));}
+//#define ADD_SUB(A, B) {register int temp = (B); B = ((A) - (B)); A = ((A) + (temp));}

 // Quickly reduces an integer modulo 257.
 //
 // Parameters:
 // - A: the input.
-#define Q_REDUCE(A) (((A) & 0xff) - ((A) >> 8))
+//#define Q_REDUCE(A) (((A) & 0xff) - ((A) >> 8))

 // Since we need to do the setup only once, this is the indicator variable:
 static bool wasSetupDone = false;

 // This array stores the powers of omegas that correspond to the indices, which are the input
 // values. Known also as the "outer FFT twiddle factors".
-swift_int16_t multipliers[N];
+swift_int16_t multipliers[N] __attribute__ ((aligned (64)));

 // This array stores the powers of omegas, multiplied by the corresponding values.
 // We store this table to save computation time.
@@ -72,14 +74,14 @@ swift_int16_t multipliers[N];
 // compression function, i is between 0 and 31, x_i is a 64-bit value.
 // One can see the formula for this (intermediate) stage in the SWIFFT FSE 2008 paper --
 // formula (2), section 3, page 6.
-swift_int16_t fftTable[256 * EIGHTH_N];
+swift_int16_t fftTable[256 * EIGHTH_N] __attribute__ ((aligned (64)));

 // The A's we use in SWIFFTX shall be random elements of Z_257.
 // We generated these A's from the decimal expansion of PI as follows:  we converted each
 // triple of digits into a decimal number d. If d < (257 * 3) we used (d % 257) for the next A
 // element, otherwise move to the next triple of digits in the expansion. This guarntees that
 // the A's are random, provided that PI digits are.
-const swift_int16_t As[3 * M * N] =
+const swift_int16_t As[3 * M * N] __attribute__ ((aligned (64))) =
 {141,  78, 139,  75, 238, 205, 129, 126,  22, 245, 197, 169, 142, 118, 105,  78,
  50, 149,  29, 208, 114,  34,  85, 117,  67, 148,  86, 256,  25,  49, 133,  93,
  95,  36,  68, 231, 211, 102, 151, 128, 224, 117, 193,  27, 102, 187,   7, 105,
@@ -602,21 +604,14 @@ void InitializeSWIFFTX()
 	int omegaPowers[2 * N];
 	omegaPowers[0] = 1;

-	if (wasSetupDone)
-		return;
+	if (wasSetupDone) return;

 	for (i = 1; i < (2 * N); ++i)
-	{
 		omegaPowers[i] = Center(omegaPowers[i - 1] * OMEGA);
-	}

 	for (i = 0; i < (N / W); ++i)
-	{
 		for (j = 0; j < W; ++j)
-		{
 			multipliers[(i << 3) + j] = omegaPowers[ReverseBits(i, N / W) * (2 * j + 1)];
-		}
-	}

 	for (x = 0; x < 256; ++x)
 	{
@@ -624,10 +619,8 @@ void InitializeSWIFFTX()
 		{
 			register int temp = 0;
 			for (k = 0; k < 8; ++k)
-			{
 				temp += omegaPowers[(EIGHTH_N * (2 * j + 1) * ReverseBits(k, W)) % (2 * N)]
 					  * ((x >> k) & 1);
-			}

 			fftTable[(x << 3) + j] = Center(temp);
 		}
@@ -636,9 +629,203 @@ void InitializeSWIFFTX()
 	wasSetupDone = true;
 }

+// In the original code the F matrix is rotated so it was not aranged
+// the same as all the other data. Rearanging F to match all the other
+// data made vectorizing possible, the compiler probably could have been
+// able to auto-vectorize with proper data organisation.
+// Also in the original code the custom 16 bit data types are all now 32
+// bit int32_t regardless of the type name.
+//
 void FFT(const unsigned char input[EIGHTH_N], swift_int32_t *output)
 {
-	swift_int16_t *mult = multipliers;
+#if defined(__AVX2__)
+
+   __m256i F[8] __attribute__ ((aligned (64)));
+   __m256i *mul = (__m256i*)multipliers;
+   __m256i *out = (__m256i*)output;
+   __m256i *tbl = (__m256i*)&( fftTable[ input[0] << 3 ] );
+
+   F[0] = _mm256_mullo_epi32( mul[0], *tbl );
+   tbl = (__m256i*)&( fftTable[ input[1] << 3 ] );
+   F[1] = _mm256_mullo_epi32( mul[1], *tbl );
+   tbl = (__m256i*)&( fftTable[ input[2] << 3 ] );
+   F[2] = _mm256_mullo_epi32( mul[2], *tbl );
+   tbl = (__m256i*)&( fftTable[ input[3] << 3 ] );
+   F[3] = _mm256_mullo_epi32( mul[3], *tbl );
+   tbl = (__m256i*)&( fftTable[ input[4] << 3 ] );
+   F[4] = _mm256_mullo_epi32( mul[4], *tbl );
+   tbl = (__m256i*)&( fftTable[ input[5] << 3 ] );
+   F[5] = _mm256_mullo_epi32( mul[5], *tbl );
+   tbl = (__m256i*)&( fftTable[ input[6] << 3 ] );
+   F[6] = _mm256_mullo_epi32( mul[6], *tbl );
+   tbl = (__m256i*)&( fftTable[ input[7] << 3 ] );
+   F[7] = _mm256_mullo_epi32( mul[7], *tbl );
+
+   #define ADD_SUB( a, b ) \
+   { \
+      __m256i tmp = b; \
+      b = _mm256_sub_epi32( a, b ); \
+      a = _mm256_add_epi32( a, tmp ); \
+   }
+   
+   ADD_SUB( F[0], F[1] );
+   ADD_SUB( F[2], F[3] );
+   ADD_SUB( F[4], F[5] );
+   ADD_SUB( F[6], F[7] );
+
+   F[3] = _mm256_slli_epi32( F[3], 4 );
+   F[7] = _mm256_slli_epi32( F[7], 4 );
+
+   ADD_SUB( F[0], F[2] );
+   ADD_SUB( F[1], F[3] );
+   ADD_SUB( F[4], F[6] );
+   ADD_SUB( F[5], F[7] );  
+
+   F[5] = _mm256_slli_epi32( F[5], 2 );
+   F[6] = _mm256_slli_epi32( F[6], 4 );
+   F[7] = _mm256_slli_epi32( F[7], 6 );
+
+   ADD_SUB( F[0], F[4] );
+   ADD_SUB( F[1], F[5] );
+   ADD_SUB( F[2], F[6] );
+   ADD_SUB( F[3], F[7] );
+
+   #undef ADD_SUB
+
+#if defined (__AVX512VL__) && defined(__AVX512BW__)   
+
+   const __m256i mask = _mm256_movm_epi8( 0x11111111 );
+
+#else
+
+   const __m256i mask = m256_const1_32( 0x000000ff );
+
+#endif
+
+   #define Q_REDUCE( a ) \
+       _mm256_sub_epi32( _mm256_and_si256( a, mask ), \
+                         _mm256_srai_epi32( a, 8 ) )
+
+   out[0] = Q_REDUCE( F[0] );  
+   out[1] = Q_REDUCE( F[1] );                        
+   out[2] = Q_REDUCE( F[2] );                        
+   out[3] = Q_REDUCE( F[3] );                        
+   out[4] = Q_REDUCE( F[4] );                        
+   out[5] = Q_REDUCE( F[5] );                        
+   out[6] = Q_REDUCE( F[6] );                        
+   out[7] = Q_REDUCE( F[7] );
+
+   #undef Q_REDUCE
+
+#elif defined(__SSE4_1__)
+
+   __m128i F[16] __attribute__ ((aligned (64)));
+   __m128i *mul = (__m128i*)multipliers;
+   __m128i *out = (__m128i*)output;
+   __m128i *tbl = (__m128i*)&( fftTable[ input[0] << 3 ] );
+
+   F[ 0] = _mm_mullo_epi32( mul[ 0], tbl[0] );
+   F[ 1] = _mm_mullo_epi32( mul[ 1], tbl[1] );
+   tbl = (__m128i*)&( fftTable[ input[1] << 3 ] );
+   F[ 2] = _mm_mullo_epi32( mul[ 2], tbl[0] );
+   F[ 3] = _mm_mullo_epi32( mul[ 3], tbl[1] );
+   tbl = (__m128i*)&( fftTable[ input[2] << 3 ] );
+   F[ 4] = _mm_mullo_epi32( mul[ 4], tbl[0] );
+   F[ 5] = _mm_mullo_epi32( mul[ 5], tbl[1] );
+   tbl = (__m128i*)&( fftTable[ input[3] << 3 ] );
+   F[ 6] = _mm_mullo_epi32( mul[ 6], tbl[0] );
+   F[ 7] = _mm_mullo_epi32( mul[ 7], tbl[1] );
+   tbl = (__m128i*)&( fftTable[ input[4] << 3 ] );
+   F[ 8] = _mm_mullo_epi32( mul[ 8], tbl[0] );
+   F[ 9] = _mm_mullo_epi32( mul[ 9], tbl[1] );
+   tbl = (__m128i*)&( fftTable[ input[5] << 3 ] );
+   F[10] = _mm_mullo_epi32( mul[10], tbl[0] );
+   F[11] = _mm_mullo_epi32( mul[11], tbl[1] );
+   tbl = (__m128i*)&( fftTable[ input[6] << 3 ] );
+   F[12] = _mm_mullo_epi32( mul[12], tbl[0] );
+   F[13] = _mm_mullo_epi32( mul[13], tbl[1] );
+   tbl = (__m128i*)&( fftTable[ input[7] << 3 ] );
+   F[14] = _mm_mullo_epi32( mul[14], tbl[0] );
+   F[15] = _mm_mullo_epi32( mul[15], tbl[1] );
+
+   #define ADD_SUB( a, b ) \
+   { \
+      __m128i tmp = b; \
+      b = _mm_sub_epi32( a, b ); \
+      a = _mm_add_epi32( a, tmp ); \
+   }
+
+   ADD_SUB( F[ 0], F[ 2] );
+   ADD_SUB( F[ 1], F[ 3] );
+   ADD_SUB( F[ 4], F[ 6] );
+   ADD_SUB( F[ 5], F[ 7] );
+   ADD_SUB( F[ 8], F[10] );
+   ADD_SUB( F[ 9], F[11] );
+   ADD_SUB( F[12], F[14] );
+   ADD_SUB( F[13], F[15] );
+
+   F[ 6] = _mm_slli_epi32( F[ 6], 4 );
+   F[ 7] = _mm_slli_epi32( F[ 7], 4 );
+   F[14] = _mm_slli_epi32( F[14], 4 );
+   F[15] = _mm_slli_epi32( F[15], 4 );
+
+   ADD_SUB( F[ 0], F[ 4] );
+   ADD_SUB( F[ 1], F[ 5] );
+   ADD_SUB( F[ 2], F[ 6] );
+   ADD_SUB( F[ 3], F[ 7] );
+   ADD_SUB( F[ 8], F[12] );
+   ADD_SUB( F[ 9], F[13] );
+   ADD_SUB( F[10], F[14] );
+   ADD_SUB( F[11], F[15] );
+
+   F[10] = _mm_slli_epi32( F[10], 2 );
+   F[11] = _mm_slli_epi32( F[11], 2 );
+   F[12] = _mm_slli_epi32( F[12], 4 );
+   F[13] = _mm_slli_epi32( F[13], 4 );
+   F[14] = _mm_slli_epi32( F[14], 6 );
+   F[15] = _mm_slli_epi32( F[15], 6 );
+   
+   ADD_SUB( F[ 0], F[ 8] );
+   ADD_SUB( F[ 1], F[ 9] );
+   ADD_SUB( F[ 2], F[10] );
+   ADD_SUB( F[ 3], F[11] );
+   ADD_SUB( F[ 4], F[12] );
+   ADD_SUB( F[ 5], F[13] );
+   ADD_SUB( F[ 6], F[14] );
+   ADD_SUB( F[ 7], F[15] );
+
+   #undef ADD_SUB
+
+   const __m128i mask = m128_const1_32( 0x000000ff );
+
+   #define Q_REDUCE( a ) \
+      _mm_sub_epi32( _mm_and_si128( a, mask ), _mm_srai_epi32( a, 8 ) ) 
+
+   out[ 0] = Q_REDUCE( F[ 0] );
+   out[ 1] = Q_REDUCE( F[ 1] );
+   out[ 2] = Q_REDUCE( F[ 2] );
+   out[ 3] = Q_REDUCE( F[ 3] );
+   out[ 4] = Q_REDUCE( F[ 4] );
+   out[ 5] = Q_REDUCE( F[ 5] );
+   out[ 6] = Q_REDUCE( F[ 6] );
+   out[ 7] = Q_REDUCE( F[ 7] );
+   out[ 8] = Q_REDUCE( F[ 8] );
+   out[ 9] = Q_REDUCE( F[ 9] );
+   out[10] = Q_REDUCE( F[10] );
+   out[11] = Q_REDUCE( F[11] );
+   out[12] = Q_REDUCE( F[12] );
+   out[13] = Q_REDUCE( F[13] );
+   out[14] = Q_REDUCE( F[14] );
+   out[15] = Q_REDUCE( F[15] );
+
+   #undef Q_REDUCE
+
+#else   // < SSE4.1
+   
+   swift_int16_t *mult = multipliers;
+
+   // First loop unrolling:
+	register swift_int16_t *table = &(fftTable[input[0] << 3]);

 /*
   swift_int32_t F[64];
@@ -666,11 +853,8 @@ void FFT(const unsigned char input[EIGHTH_N], swift_int32_t *output)
                F50, F51, F52, F53, F54, F55, F56, F57, F58, F59,
                F60, F61, F62, F63;
   
-   // First loop unrolling:
-	register swift_int16_t *table = &(fftTable[input[0] << 3]);
-
-	F0 = mult[0] * table[0];
-	F8 = mult[1] * table[1];
+	F0  = mult[0] * table[0];
+	F8  = mult[1] * table[1];
 	F16 = mult[2] * table[2];
 	F24 = mult[3] * table[3];
 	F32 = mult[4] * table[4];
@@ -678,90 +862,93 @@ void FFT(const unsigned char input[EIGHTH_N], swift_int32_t *output)
 	F48 = mult[6] * table[6];
 	F56 = mult[7] * table[7];

-	mult += 8;
 	table = &(fftTable[input[1] << 3]);

-	F1 = mult[0] * table[0];
-	F9 = mult[1] * table[1];
-	F17 = mult[2] * table[2];
-	F25 = mult[3] * table[3];
-	F33 = mult[4] * table[4];
-	F41 = mult[5] * table[5];
-	F49 = mult[6] * table[6];
-	F57 = mult[7] * table[7];
+	F1  = mult[ 8] * table[0];
+	F9  = mult[ 9] * table[1];
+	F17 = mult[10] * table[2];
+	F25 = mult[11] * table[3];
+	F33 = mult[12] * table[4];
+	F41 = mult[13] * table[5];
+	F49 = mult[14] * table[6];
+	F57 = mult[15] * table[7];

-	mult += 8;
 	table = &(fftTable[input[2] << 3]);

-	F2 = mult[0] * table[0];
-	F10 = mult[1] * table[1];
-	F18 = mult[2] * table[2];
-	F26 = mult[3] * table[3];
-	F34 = mult[4] * table[4];
-	F42 = mult[5] * table[5];
-	F50 = mult[6] * table[6];
-	F58 = mult[7] * table[7];
+	F2  = mult[16] * table[0];
+	F10 = mult[17] * table[1];
+	F18 = mult[18] * table[2];
+	F26 = mult[19] * table[3];
+	F34 = mult[20] * table[4];
+	F42 = mult[21] * table[5];
+	F50 = mult[22] * table[6];
+	F58 = mult[23] * table[7];

-	mult += 8;
 	table = &(fftTable[input[3] << 3]);

-	F3 = mult[0] * table[0];
-	F11 = mult[1] * table[1];
-	F19 = mult[2] * table[2];
-	F27 = mult[3] * table[3];
-	F35 = mult[4] * table[4];
-	F43 = mult[5] * table[5];
-	F51 = mult[6] * table[6];
-	F59 = mult[7] * table[7];
+	F3  = mult[24] * table[0];
+	F11 = mult[25] * table[1];
+	F19 = mult[26] * table[2];
+	F27 = mult[27] * table[3];
+	F35 = mult[28] * table[4];
+	F43 = mult[29] * table[5];
+	F51 = mult[30] * table[6];
+	F59 = mult[31] * table[7];

-	mult += 8;
 	table = &(fftTable[input[4] << 3]);

-	F4 = mult[0] * table[0];
-	F12 = mult[1] * table[1];
-	F20 = mult[2] * table[2];
-	F28 = mult[3] * table[3];
-	F36 = mult[4] * table[4];
-	F44 = mult[5] * table[5];
-	F52 = mult[6] * table[6];
-	F60 = mult[7] * table[7];
+	F4  = mult[32] * table[0];
+	F12 = mult[33] * table[1];
+	F20 = mult[34] * table[2];
+	F28 = mult[35] * table[3];
+	F36 = mult[36] * table[4];
+	F44 = mult[37] * table[5];
+	F52 = mult[38] * table[6];
+	F60 = mult[39] * table[7];

-	mult += 8;
 	table = &(fftTable[input[5] << 3]);

-	F5 = mult[0] * table[0];
-	F13 = mult[1] * table[1];
-	F21 = mult[2] * table[2];
-	F29 = mult[3] * table[3];
-	F37 = mult[4] * table[4];
-	F45 = mult[5] * table[5];
-	F53 = mult[6] * table[6];
-	F61 = mult[7] * table[7];
+	F5  = mult[40] * table[0];
+	F13 = mult[41] * table[1];
+	F21 = mult[42] * table[2];
+	F29 = mult[43] * table[3];
+	F37 = mult[44] * table[4];
+	F45 = mult[45] * table[5];
+	F53 = mult[46] * table[6];
+	F61 = mult[47] * table[7];

-	mult += 8;
 	table = &(fftTable[input[6] << 3]);

-	F6 = mult[0] * table[0];
-	F14 = mult[1] * table[1];
-	F22 = mult[2] * table[2];
-	F30 = mult[3] * table[3];
-	F38 = mult[4] * table[4];
-	F46 = mult[5] * table[5];
-	F54 = mult[6] * table[6];
-	F62 = mult[7] * table[7];
+	F6  = mult[48] * table[0];
+	F14 = mult[49] * table[1];
+	F22 = mult[50] * table[2];
+	F30 = mult[51] * table[3];
+	F38 = mult[52] * table[4];
+	F46 = mult[53] * table[5];
+	F54 = mult[54] * table[6];
+	F62 = mult[55] * table[7];

-	mult += 8;
 	table = &(fftTable[input[7] << 3]);

-	F7 = mult[0] * table[0];
-	F15 = mult[1] * table[1];
-	F23 = mult[2] * table[2];
-	F31 = mult[3] * table[3];
-	F39 = mult[4] * table[4];
-	F47 = mult[5] * table[5];
-	F55 = mult[6] * table[6];
-	F63 = mult[7] * table[7];
+	F7  = mult[56] * table[0];
+	F15 = mult[57] * table[1];
+	F23 = mult[58] * table[2];
+	F31 = mult[59] * table[3];
+	F39 = mult[60] * table[4];
+	F47 = mult[61] * table[5];
+	F55 = mult[62] * table[6];
+	F63 = mult[63] * table[7];

+   #define ADD_SUB( a, b ) \
+   { \
+      int temp = b; \
+      b = a - b; \
+      a = a + temp; \
+   }
+   
+   #define Q_REDUCE( a ) \
+      ( ( (a) & 0xff ) - ( (a) >> 8 ) )
+   
 /*

   for ( int i = 0; i < 8; i++ )
@@ -800,7 +987,6 @@ void FFT(const unsigned char input[EIGHTH_N], swift_int32_t *output)
   }
 */

-
 	// Second loop unrolling:
 	// Iteration 0:
 	ADD_SUB(F0, F1);
@@ -1057,6 +1243,11 @@ void FFT(const unsigned char input[EIGHTH_N], swift_int32_t *output)
 	output[47] = Q_REDUCE(F61);
 	output[55] = Q_REDUCE(F62);
 	output[63] = Q_REDUCE(F63);
+
+   #undef ADD_SUB
+   #undef Q_REDUCE
+
+#endif  // AVX2 elif SSE4.1 else
 }

 // Calculates the FFT part of SWIFFT.
@@ -1086,24 +1277,66 @@ void SWIFFTFFT(const unsigned char *input, int m, swift_int32_t *output)
 // - m: the input size divided by 64.
 // - output: will store the result.
 // - a: the coefficients in the sum. Of size 64 * m.
-void SWIFFTSum(const swift_int32_t *input, int m, unsigned char *output, const swift_int16_t *a)
+void SWIFFTSum( const swift_int32_t *input, int m, unsigned char *output,
+                const swift_int16_t *a )
 {
 	int i, j;
-	swift_int32_t result[N];
+	swift_int32_t result[N] __attribute__ ((aligned (64)));
 	register swift_int16_t carry = 0;

+#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
+
+   __m512i *res = (__m512i*)result;
+   for ( j = 0; j < N/16; ++j )
+   {
+      __m512i sum = _mm512_setzero_si512();
+      const __m512i *f = (__m512i*)input + j;
+      const __m512i *k = (__m512i*)a + j;
+      for ( i = 0; i < m; i++, f += N/16, k += N/16 )
+         sum = _mm512_add_epi32( sum, _mm512_mullo_epi32( *f, *k ) );
+      res[j] = sum;
+   }
+
+#elif defined(__AVX2__)
+
+   __m256i *res = (__m256i*)result;
+   for ( j = 0; j < N/8; ++j )
+   {
+      __m256i sum = _mm256_setzero_si256();
+      const __m256i *f = (__m256i*)input + j;
+      const __m256i *k = (__m256i*)a + j;
+      for ( i = 0; i < m; i++, f += N/8, k += N/8 )
+         sum = _mm256_add_epi32( sum, _mm256_mullo_epi32( *f, *k ) );
+      res[j] = sum;
+   }
+
+#elif defined(__SSE4_1__)
+
+   __m128i *res = (__m128i*)result;
+   for ( j = 0; j < N/4; ++j )
+   {
+      __m128i sum = _mm_setzero_si128();
+      const __m128i *f = (__m128i*)input + j;
+      const __m128i *k = (__m128i*)a + j;
+      for ( i = 0; i < m; i++, f += N/4, k += N/4 )
+         sum = _mm_add_epi32( sum, _mm_mullo_epi32( *f, *k ) );
+      res[j] = sum;
+   }
+
+#else
+
 	for (j = 0; j < N; ++j)
 	{
 		register swift_int32_t sum = 0;
 		const register swift_int32_t *f = input + j;
 		const register swift_int16_t *k = a + j;
-
 		for (i = 0; i < m; i++, f += N,k += N)
 			sum += (*f) * (*k);
-
 		result[j] = sum;
 	}

+#endif
+
 	for (j = 0; j < N; ++j)
 		result[j] = ((FIELD_SIZE << 22) + result[j]) % FIELD_SIZE;

@@ -1116,14 +1349,15 @@ void SWIFFTSum(const swift_int32_t *input, int m, unsigned char *output, const s
 	output[N] = carry;
 }

+/*
 void ComputeSingleSWIFFTX_smooth(unsigned char input[SWIFFTX_INPUT_BLOCK_SIZE],
                          unsigned char output[SWIFFTX_OUTPUT_BLOCK_SIZE],
 						  bool doSmooth)
 {
 	int i;
 	// Will store the result of the FFT parts:
-	swift_int32_t fftOut[N * M];
-	unsigned char intermediate[N * 3 + 8];
+	swift_int32_t fftOut[N * M] __attribute__ ((aligned (64)));
+	unsigned char intermediate[N * 3 + 8] __attribute__ ((aligned (64)));
 	unsigned char carry0,carry1,carry2;

 	// Do the three SWIFFTS while remembering the three carry bytes (each carry byte gets
@@ -1193,51 +1427,50 @@ void ComputeSingleSWIFFTX_smooth(unsigned char input[SWIFFTX_INPUT_BLOCK_SIZE],
 		output[N] = 0;
 	}
 }
+*/

-void ComputeSingleSWIFFTX( unsigned char input[SWIFFTX_INPUT_BLOCK_SIZE],
-                           unsigned char output[SWIFFTX_OUTPUT_BLOCK_SIZE] )
+void ComputeSingleSWIFFTX( unsigned char *input, unsigned char *output )
 {
   int i;
   // Will store the result of the FFT parts:
-   swift_int32_t fftOut[N * M];
-   unsigned char intermediate[N * 3 + 8];
+   swift_int32_t fftOut[N * M] __attribute__ ((aligned (64)));
+   unsigned char sum[ N*3 + 8 ] __attribute__ ((aligned (64)));
   unsigned char carry0,carry1,carry2;

   // Do the three SWIFFTS while remembering the three carry bytes (each carry byte gets
   // overriden by the following SWIFFT):

   // 1. Compute the FFT of the input - the common part for the first 3 SWIFFTs:
-   SWIFFTFFT(input, M, fftOut);
+   SWIFFTFFT( input, M, fftOut );

   // 2. Compute the sums of the 3 SWIFFTs, each using a different set of coefficients:

   // 2a. The first SWIFFT:
-   SWIFFTSum(fftOut, M, intermediate, As);
-   // Remember the carry byte:
-   carry0 = intermediate[N];
+   SWIFFTSum( fftOut, M, sum,       As         );
+   carry0 = sum[N];

   // 2b. The second one:
-   SWIFFTSum(fftOut, M, intermediate + N, As + (M * N));
-   carry1 = intermediate[2 * N];
+   SWIFFTSum( fftOut, M, sum + N,   As +   M*N );
+   carry1 = sum[ 2*N ];

   // 2c. The third one:
-   SWIFFTSum(fftOut, M, intermediate + (2 * N), As + 2 * (M * N));
-   carry2 = intermediate[3 * N];
+   SWIFFTSum( fftOut, M, sum + 2*N, As + 2*M*N );
+   carry2 = sum[ 3*N ];

   //2d. Put three carry bytes in their place
-   intermediate[3 * N] = carry0;
-   intermediate[(3 * N) + 1] = carry1;
-   intermediate[(3 * N) + 2] = carry2;
+   sum[ 3*N     ] = carry0;
+   sum[ 3*N + 1 ] = carry1;
+   sum[ 3*N + 2 ] = carry2;

   // Padding  intermediate output with 5 zeroes.
-   memset(intermediate + (3 * N) + 3, 0, 5);
+   memset( sum + 3*N + 3, 0, 5 );

   // Apply the S-Box:
   for ( i = 0; i < (3 * N) + 8; ++i )
-      intermediate[i] = SBox[intermediate[i]];
+      sum[i] = SBox[ sum[i] ];

   // 3. The final and last SWIFFT:
-   SWIFFTFFT(intermediate, 3 * (N/8) + 1, fftOut);
-   SWIFFTSum(fftOut,       3 * (N/8) + 1, output, As);
-
+   SWIFFTFFT( sum, 3 * (N/8) + 1, fftOut );
+   SWIFFTSum( fftOut,       3 * (N/8) + 1, sum, As );
+   memcpy( output, sum, SWIFFTX_OUTPUT_BLOCK_SIZE - 1 );
 }
--- a/algo/swifftx/swifftx.h
+++ b/algo/swifftx/swifftx.h
@@ -61,11 +61,10 @@ void ComputeSingleSWIFFT(unsigned char *input, unsigned short m,
 //
 // Returns:
 // - Success value.
-void ComputeSingleSWIFFTX( unsigned char input[SWIFFTX_INPUT_BLOCK_SIZE],
-                           unsigned char output[SWIFFTX_OUTPUT_BLOCK_SIZE] );
+void ComputeSingleSWIFFTX( unsigned char *input, unsigned char *output );

-void ComputeSingleSWIFFTX_smooth( unsigned char input[SWIFFTX_INPUT_BLOCK_SIZE],
-	            unsigned char output[SWIFFTX_OUTPUT_BLOCK_SIZE], bool doSmooth);
+//void ComputeSingleSWIFFTX_smooth( unsigned char input[SWIFFTX_INPUT_BLOCK_SIZE],
+//	            unsigned char output[SWIFFTX_OUTPUT_BLOCK_SIZE], bool doSmooth);

 // Calculates the powers of OMEGA and generates the bit reversal permutation.
 // You must call this function before doing SWIFFT/X, otherwise you will get zeroes everywhere.
--- a/algo/verthash/Verthash.c
+++ b/algo/verthash/Verthash.c
@@ -10,6 +10,7 @@
 #include "algo-gate-api.h"
 #include "Verthash.h"
 #include "mm_malloc.h"
+#include "malloc-huge.h"

 //-----------------------------------------------------------------------------
 // Verthash info management
@@ -84,10 +85,17 @@ int verthash_info_init(verthash_info_t* info, const char* file_name)
    }

    // Allocate data
-    info->data = (uint8_t *)_mm_malloc( fileSize, 64 );
-    if (!info->data)
+    info->data = (uint8_t *)malloc_hugepages( fileSize );
+    if ( info->data )
    {
-        fclose(fileMiningData);
+       if ( !opt_quiet ) applog( LOG_INFO, "Verthash data is using huge pages");
+    }
+    else
+       info->data = (uint8_t *)_mm_malloc( fileSize, 64 );
+
+    if ( !info->data )
+    {
+        fclose( fileMiningData );
        // Memory allocation fatal error.
        return 2;
    }
--- a/algo/verthash/tiny_sha3/sha3-4way.c
+++ b/algo/verthash/tiny_sha3/sha3-4way.c
@@ -29,16 +29,11 @@ void sha3_4way_keccakf( __m256i st[25] )
   for ( r = 0; r < KECCAKF_ROUNDS; r++ )
   {
      // Theta
-      bc[0] = _mm256_xor_si256( st[0],
-                           mm256_xor4( st[5], st[10], st[15], st[20] ) );
-      bc[1] = _mm256_xor_si256( st[1],
-                           mm256_xor4( st[6], st[11], st[16], st[21] ) );
-      bc[2] = _mm256_xor_si256( st[2],
-                           mm256_xor4( st[7], st[12], st[17], st[22] ) );
-      bc[3] = _mm256_xor_si256( st[3],
-                           mm256_xor4( st[8], st[13], st[18], st[23] ) );
-      bc[4] = _mm256_xor_si256( st[4],
-                           mm256_xor4( st[9], st[14], st[19], st[24] ) );
+      bc[0] = mm256_xor3( st[0], st[5], mm256_xor3( st[10], st[15], st[20] ) );
+      bc[1] = mm256_xor3( st[1], st[6], mm256_xor3( st[11], st[16], st[21] ) );
+      bc[2] = mm256_xor3( st[2], st[7], mm256_xor3( st[12], st[17], st[22] ) );
+      bc[3] = mm256_xor3( st[3], st[8], mm256_xor3( st[13], st[18], st[23] ) );
+      bc[4] = mm256_xor3( st[4], st[9], mm256_xor3( st[14], st[19], st[24] ) );

      for ( i = 0; i < 5; i++ )
      {
@@ -89,17 +84,13 @@ void sha3_4way_keccakf( __m256i st[25] )
      //  Chi
      for ( j = 0; j < 25; j += 5 )
      {
-         memcpy( bc, &st[ j ], 5*32 );
-         st[ j   ] = _mm256_xor_si256( st[ j   ],
-                                       _mm256_andnot_si256( bc[1], bc[2] ) );
-         st[ j+1 ] = _mm256_xor_si256( st[ j+1 ],
-                                       _mm256_andnot_si256( bc[2], bc[3] ) );
-         st[ j+2 ] = _mm256_xor_si256( st[ j+2 ],
-                                       _mm256_andnot_si256( bc[3], bc[4] ) );
-         st[ j+3 ] = _mm256_xor_si256( st[ j+3 ],
-                                       _mm256_andnot_si256( bc[4], bc[0] ) );
-         st[ j+4 ] = _mm256_xor_si256( st[ j+4 ],
-                                       _mm256_andnot_si256( bc[0], bc[1] ) );
+         bc[0] = st[j];
+         bc[1] = st[j+1];
+         st[ j   ] = mm256_xorandnot( st[ j   ], st[j+1], st[j+2] );
+         st[ j+1 ] = mm256_xorandnot( st[ j+1 ], st[j+2], st[j+3] );
+         st[ j+2 ] = mm256_xorandnot( st[ j+2 ], st[j+3], st[j+4] );
+         st[ j+3 ] = mm256_xorandnot( st[ j+3 ], st[j+4], bc[0] );
+         st[ j+4 ] = mm256_xorandnot( st[ j+4 ], bc[0], bc[1] );
      }

      //  Iota
--- a/algo/verthash/verthash-gate.c
+++ b/algo/verthash/verthash-gate.c
@@ -127,7 +127,7 @@ bool register_verthash_algo( algo_gate_t* gate )
 {
  opt_target_factor = 256.0;
  gate->scanhash  = (void*)&scanhash_verthash;
-  gate->optimizations = AVX2_OPT;
+  gate->optimizations = SSE42_OPT | AVX2_OPT;
   
  const char *verthash_data_file = opt_data_file ? opt_data_file
                                                 : default_verthash_data_file;
--- a/algo/x16/x16r-4way.c
+++ b/algo/x16/x16r-4way.c
@@ -60,7 +60,14 @@ void x16r_8way_prehash( void *vdata, void *pdata )
      case HAMSI:
         mm512_bswap32_intrlv80_8x64( vdata, pdata );
         hamsi512_8way_init( &x16r_ctx.hamsi );
-         hamsi512_8way_update( &x16r_ctx.hamsi, vdata, 64 );
+         hamsi512_8way_update( &x16r_ctx.hamsi, vdata, 72 );
+      break;
+      case FUGUE:
+         mm128_bswap32_80( edata, pdata );
+         fugue512_init( &x16r_ctx.fugue );
+         fugue512_update( &x16r_ctx.fugue, edata, 76 );
+         intrlv_8x64( vdata, edata, edata, edata, edata,
+                             edata, edata, edata, edata, 640 );
      break;
      case SHABAL:
         mm256_bswap32_intrlv80_8x32( vdata2, pdata );
@@ -306,7 +313,7 @@ int x16r_8way_hash_generic( void* output, const void* input, int thrid )
         break;
         case HAMSI:
            if ( i == 0 )
-               hamsi512_8way_update( &ctx.hamsi, input + (64<<3), 16 );
+               hamsi512_8way_update( &ctx.hamsi, input + (72<<3), 8 );
            else
            {
               intrlv_8x64( vhash, in0, in1, in2, in3, in4, in5, in6, in7,
@@ -319,14 +326,43 @@ int x16r_8way_hash_generic( void* output, const void* input, int thrid )
                          hash7, vhash );
         break;
         case FUGUE:
-             fugue512_full( &ctx.fugue, hash0, in0, size );
-             fugue512_full( &ctx.fugue, hash1, in1, size );
-             fugue512_full( &ctx.fugue, hash2, in2, size );
-             fugue512_full( &ctx.fugue, hash3, in3, size );
-             fugue512_full( &ctx.fugue, hash4, in4, size );
-             fugue512_full( &ctx.fugue, hash5, in5, size );
-             fugue512_full( &ctx.fugue, hash6, in6, size );
-             fugue512_full( &ctx.fugue, hash7, in7, size );
+            if ( i == 0 )
+            {
+               fugue512_update( &ctx.fugue, in0 + 76, 4 );
+               fugue512_final( &ctx.fugue, hash0 );
+               memcpy( &ctx, &x16r_ctx, sizeof(hashState_fugue) );
+               fugue512_update( &ctx.fugue, in1 + 76, 4 );
+               fugue512_final( &ctx.fugue, hash1 );
+               memcpy( &ctx, &x16r_ctx, sizeof(hashState_fugue) );
+               fugue512_update( &ctx.fugue, in2 + 76, 4 );
+               fugue512_final( &ctx.fugue, hash2 );
+               memcpy( &ctx, &x16r_ctx, sizeof(hashState_fugue) );
+               fugue512_update( &ctx.fugue, in3 + 76, 4 );
+               fugue512_final( &ctx.fugue, hash3 );
+               memcpy( &ctx, &x16r_ctx, sizeof(hashState_fugue) );
+               fugue512_update( &ctx.fugue, in4 + 76, 4 );
+               fugue512_final( &ctx.fugue, hash4 );
+               memcpy( &ctx, &x16r_ctx, sizeof(hashState_fugue) );
+               fugue512_update( &ctx.fugue, in5 + 76, 4 );
+               fugue512_final( &ctx.fugue, hash5 );
+               memcpy( &ctx, &x16r_ctx, sizeof(hashState_fugue) );
+               fugue512_update( &ctx.fugue, in6 + 76, 4 );
+               fugue512_final( &ctx.fugue, hash6 );
+               memcpy( &ctx, &x16r_ctx, sizeof(hashState_fugue) );
+               fugue512_update( &ctx.fugue, in7 + 76, 4 );
+               fugue512_final( &ctx.fugue, hash7 );
+            }
+            else
+            {
+               fugue512_full( &ctx.fugue, hash0, in0, size );
+               fugue512_full( &ctx.fugue, hash1, in1, size );
+               fugue512_full( &ctx.fugue, hash2, in2, size );
+               fugue512_full( &ctx.fugue, hash3, in3, size );
+               fugue512_full( &ctx.fugue, hash4, in4, size );
+               fugue512_full( &ctx.fugue, hash5, in5, size );
+               fugue512_full( &ctx.fugue, hash6, in6, size );
+               fugue512_full( &ctx.fugue, hash7, in7, size );
+            }
         break;
         case SHABAL:
             intrlv_8x32( vhash, in0, in1, in2, in3, in4, in5, in6, in7,
@@ -347,25 +383,25 @@ int x16r_8way_hash_generic( void* output, const void* input, int thrid )
            {
               sph_whirlpool( &ctx.whirlpool, in0 + 64, 16 );
               sph_whirlpool_close( &ctx.whirlpool, hash0 );
-               memcpy( &ctx, &x16r_ctx, sizeof(ctx) );
+               memcpy( &ctx, &x16r_ctx, sizeof(sph_whirlpool_context) );
               sph_whirlpool( &ctx.whirlpool, in1 + 64, 16 );
               sph_whirlpool_close( &ctx.whirlpool, hash1 );
-               memcpy( &ctx, &x16r_ctx, sizeof(ctx) );
+               memcpy( &ctx, &x16r_ctx, sizeof(sph_whirlpool_context) );
               sph_whirlpool( &ctx.whirlpool, in2 + 64, 16 );
               sph_whirlpool_close( &ctx.whirlpool, hash2 );
-               memcpy( &ctx, &x16r_ctx, sizeof(ctx) );
+               memcpy( &ctx, &x16r_ctx, sizeof(sph_whirlpool_context) );
               sph_whirlpool( &ctx.whirlpool, in3 + 64, 16 );
               sph_whirlpool_close( &ctx.whirlpool, hash3 );
-               memcpy( &ctx, &x16r_ctx, sizeof(ctx) );
+               memcpy( &ctx, &x16r_ctx, sizeof(sph_whirlpool_context) );
               sph_whirlpool( &ctx.whirlpool, in4 + 64, 16 );
               sph_whirlpool_close( &ctx.whirlpool, hash4 ); 
-               memcpy( &ctx, &x16r_ctx, sizeof(ctx) );
+               memcpy( &ctx, &x16r_ctx, sizeof(sph_whirlpool_context) );
               sph_whirlpool( &ctx.whirlpool, in5 + 64, 16 );
               sph_whirlpool_close( &ctx.whirlpool, hash5 );
-               memcpy( &ctx, &x16r_ctx, sizeof(ctx) );
+               memcpy( &ctx, &x16r_ctx, sizeof(sph_whirlpool_context) );
               sph_whirlpool( &ctx.whirlpool, in6 + 64, 16 );
               sph_whirlpool_close( &ctx.whirlpool, hash6 );
-               memcpy( &ctx, &x16r_ctx, sizeof(ctx) );
+               memcpy( &ctx, &x16r_ctx, sizeof(sph_whirlpool_context) );
               sph_whirlpool( &ctx.whirlpool, in7 + 64, 16 );
               sph_whirlpool_close( &ctx.whirlpool, hash7 );
            }
@@ -532,7 +568,13 @@ void x16r_4way_prehash( void *vdata, void *pdata )
      case HAMSI:
         mm256_bswap32_intrlv80_4x64( vdata, pdata );
         hamsi512_4way_init( &x16r_ctx.hamsi );
-         hamsi512_4way_update( &x16r_ctx.hamsi, vdata, 64 );
+         hamsi512_4way_update( &x16r_ctx.hamsi, vdata, 72 );
+      break;
+      case FUGUE:
+         mm128_bswap32_80( edata, pdata );
+         fugue512_init( &x16r_ctx.fugue );
+         fugue512_update( &x16r_ctx.fugue, edata, 76 );
+         intrlv_4x64( vdata, edata, edata, edata, edata, 640 );
      break;
      case SHABAL:
         mm128_bswap32_intrlv80_4x32( vdata2, pdata );
@@ -734,7 +776,7 @@ int x16r_4way_hash_generic( void* output, const void* input, int thrid )
   	    break;
         case HAMSI:
            if ( i == 0 )
-               hamsi512_4way_update( &ctx.hamsi, input + (64<<2), 16 );
+               hamsi512_4way_update( &ctx.hamsi, input + (72<<2), 8 );
            else
            {
               intrlv_4x64( vhash, in0, in1, in2, in3, size<<3 );
@@ -745,10 +787,27 @@ int x16r_4way_hash_generic( void* output, const void* input, int thrid )
            dintrlv_4x64_512( hash0, hash1, hash2, hash3, vhash );
         break;
         case FUGUE:
-             fugue512_full( &ctx.fugue, hash0, in0, size );
-             fugue512_full( &ctx.fugue, hash1, in1, size );
-             fugue512_full( &ctx.fugue, hash2, in2, size );
-             fugue512_full( &ctx.fugue, hash3, in3, size );
+            if ( i == 0 )
+            {
+               fugue512_update( &ctx.fugue, in0 + 76, 4 );
+               fugue512_final( &ctx.fugue, hash0 );
+               memcpy( &ctx, &x16r_ctx, sizeof(hashState_fugue) );
+               fugue512_update( &ctx.fugue, in1 + 76, 4 );
+               fugue512_final( &ctx.fugue, hash1 );
+               memcpy( &ctx, &x16r_ctx, sizeof(hashState_fugue) );
+               fugue512_update( &ctx.fugue, in2 + 76, 4 );
+               fugue512_final( &ctx.fugue, hash2 );
+               memcpy( &ctx, &x16r_ctx, sizeof(hashState_fugue) );
+               fugue512_update( &ctx.fugue, in3 + 76, 4 );
+               fugue512_final( &ctx.fugue, hash3 );
+             }
+             else
+             {
+                fugue512_full( &ctx.fugue, hash0, in0, size );
+                fugue512_full( &ctx.fugue, hash1, in1, size );
+                fugue512_full( &ctx.fugue, hash2, in2, size );
+                fugue512_full( &ctx.fugue, hash3, in3, size );
+             }
         break;
         case SHABAL:
             intrlv_4x32( vhash, in0, in1, in2, in3, size<<3 );
--- a/algo/x16/x16r-gate.c
+++ b/algo/x16/x16r-gate.c
@@ -62,8 +62,7 @@ bool register_x16r_algo( algo_gate_t* gate )
  gate->scanhash  = (void*)&scanhash_x16r;
  gate->hash      = (void*)&x16r_hash;
 #endif
-  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT |
-	                VAES_OPT | VAES256_OPT;
+  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT;
  x16_r_s_getAlgoString = (void*)&x16r_getAlgoString;
  opt_target_factor = 256.0;
  return true;
@@ -81,8 +80,7 @@ bool register_x16rv2_algo( algo_gate_t* gate )
  gate->scanhash  = (void*)&scanhash_x16rv2;
  gate->hash      = (void*)&x16rv2_hash;
 #endif
-  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT |
-	                VAES_OPT | VAES256_OPT;
+  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT;
  x16_r_s_getAlgoString = (void*)&x16r_getAlgoString;
  opt_target_factor = 256.0;
  return true;
@@ -100,8 +98,7 @@ bool register_x16s_algo( algo_gate_t* gate )
  gate->scanhash  = (void*)&scanhash_x16r;
  gate->hash      = (void*)&x16r_hash;
 #endif
-  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT |
-	                VAES_OPT | VAES256_OPT;
+  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT;
  x16_r_s_getAlgoString = (void*)&x16s_getAlgoString;
  opt_target_factor = 256.0;
  return true;
@@ -234,8 +231,7 @@ bool register_x16rt_algo( algo_gate_t* gate )
  gate->scanhash  = (void*)&scanhash_x16rt;
  gate->hash      = (void*)&x16r_hash;
 #endif
-  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT |
-	                VAES_OPT | VAES256_OPT;
+  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT;
  opt_target_factor = 256.0;
  return true;
 };
@@ -252,8 +248,7 @@ bool register_x16rt_veil_algo( algo_gate_t* gate )
  gate->scanhash  = (void*)&scanhash_x16rt;
  gate->hash      = (void*)&x16r_hash;
 #endif
-  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT |
-	                VAES_OPT | VAES256_OPT;
+  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT;
  gate->build_extraheader = (void*)&veil_build_extraheader;
  opt_target_factor = 256.0;
  return true;
@@ -292,8 +287,7 @@ bool register_x21s_algo( algo_gate_t* gate )
  gate->hash              = (void*)&x21s_hash;
  gate->miner_thread_init = (void*)&x21s_thread_init;
 #endif
-  gate->optimizations     = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT |
-	                    VAES_OPT | VAES256_OPT;
+  gate->optimizations  = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT;
  x16_r_s_getAlgoString   = (void*)&x16s_getAlgoString;
  opt_target_factor = 256.0;
  return true;
--- a/algo/x16/x16rt-4way.c
+++ b/algo/x16/x16rt-4way.c
@@ -24,15 +24,15 @@ int scanhash_x16rt_8way( struct work *work, uint32_t max_nonce,
   if ( bench )   ptarget[7] = 0x0cff;

   static __thread uint32_t s_ntime = UINT32_MAX;
-   uint32_t ntime = bswap_32( pdata[17] );
-   if ( s_ntime != ntime )
+   uint32_t masked_ntime = bswap_32( pdata[17] ) & 0xffffff80;
+   if ( s_ntime != masked_ntime )
   {
-      x16rt_getTimeHash( ntime, &timeHash );
+      x16rt_getTimeHash( masked_ntime, &timeHash );
      x16rt_getAlgoString( &timeHash[0], x16r_hash_order );
-      s_ntime = ntime;
+      s_ntime = masked_ntime;
      if ( opt_debug && !thr_id )
          applog( LOG_INFO, "hash order: %s time: (%08x) time hash: (%08x)",
-                               x16r_hash_order, ntime, timeHash );
+                            x16r_hash_order, bswap_32( pdata[17] ), timeHash );
   }

   x16r_8way_prehash( vdata, pdata );
@@ -78,15 +78,15 @@ int scanhash_x16rt_4way( struct work *work, uint32_t max_nonce,
   if ( bench )  ptarget[7] = 0x0cff;

   static __thread uint32_t s_ntime = UINT32_MAX;
-   uint32_t ntime = bswap_32( pdata[17] );
-   if ( s_ntime != ntime )
+   uint32_t masked_ntime = bswap_32( pdata[17] ) & 0xffffff80;
+   if ( s_ntime != masked_ntime )
   {
-      x16rt_getTimeHash( ntime, &timeHash );
+      x16rt_getTimeHash( masked_ntime, &timeHash );
      x16rt_getAlgoString( &timeHash[0], x16r_hash_order );
-      s_ntime = ntime;
+      s_ntime = masked_ntime;
      if ( opt_debug && !thr_id )
          applog( LOG_INFO, "hash order: %s time: (%08x) time hash: (%08x)",
-                               x16r_hash_order, ntime, timeHash );
+                            x16r_hash_order, bswap_32( pdata[17] ), timeHash );
   }

   x16r_4way_prehash( vdata, pdata );
--- a/algo/x16/x16rt.c
+++ b/algo/x16/x16rt.c
@@ -20,15 +20,15 @@ int scanhash_x16rt( struct work *work, uint32_t max_nonce,
   mm128_bswap32_80( edata, pdata );

   static __thread uint32_t s_ntime = UINT32_MAX;
-   uint32_t ntime = swab32( pdata[17] );
-   if ( s_ntime != ntime )
+   uint32_t masked_ntime = swab32( pdata[17] ) & 0xffffff80;
+   if ( s_ntime != masked_ntime )
   {
-      x16rt_getTimeHash( ntime, &timeHash );
+      x16rt_getTimeHash( masked_ntime, &timeHash );
      x16rt_getAlgoString( &timeHash[0], x16r_hash_order );
-      s_ntime = ntime;
+      s_ntime = masked_ntime;
      if ( opt_debug && !thr_id )
          applog( LOG_INFO, "hash order: %s time: (%08x) time hash: (%08x)",
-                               x16r_hash_order, ntime, timeHash );
+                        x16r_hash_order, swab32( pdata[17] ), timeHash );
   }
   
   x16r_prehash( edata, pdata );
--- a/algo/x17/sonoa-gate.c
+++ b/algo/x17/sonoa-gate.c
@@ -12,7 +12,7 @@ bool register_sonoa_algo( algo_gate_t* gate )
  init_sonoa_ctx();
  gate->hash      = (void*)&sonoa_hash;
 #endif
-  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT | VAES256_OPT;
+  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT;
  return true;
 };

--- a/algo/x17/x17-gate.c
+++ b/algo/x17/x17-gate.c
@@ -11,7 +11,7 @@ bool register_x17_algo( algo_gate_t* gate )
 #else
  gate->hash      = (void*)&x17_hash;
 #endif
-  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT | VAES256_OPT;
+  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT;
  return true;
 };

--- a/algo/x17/xevan-gate.c
+++ b/algo/x17/xevan-gate.c
@@ -12,7 +12,7 @@ bool register_xevan_algo( algo_gate_t* gate )
  init_xevan_ctx();
  gate->hash      = (void*)&xevan_hash;
 #endif
-  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT | VAES256_OPT;
+  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT;
  opt_target_factor = 256.0;
  return true;
 };
--- a/algo/x22/x22i-gate.c
+++ b/algo/x22/x22i-gate.c
@@ -31,8 +31,8 @@ bool register_x22i_algo( algo_gate_t* gate )

 #endif

-  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | SHA_OPT
-                      | AVX512_OPT | VAES_OPT | VAES256_OPT;
+  gate->optimizations = SSE2_OPT | SSE42_OPT | AES_OPT | AVX2_OPT | SHA_OPT
+                      | AVX512_OPT | VAES_OPT;
  return true;
 };

@@ -48,8 +48,8 @@ bool register_x25x_algo( algo_gate_t* gate )
  gate->scanhash  = (void*)&scanhash_x25x;
  gate->hash      = (void*)&x25x_hash;
 #endif
-  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | SHA_OPT |
-                        AVX512_OPT | VAES_OPT | VAES256_OPT;
+  gate->optimizations = SSE2_OPT | SSE42_OPT | AES_OPT | AVX2_OPT | SHA_OPT |
+                        AVX512_OPT | VAES_OPT;
  return true;
 };

--- a/build-allarch.sh
+++ b/build-allarch.sh
@@ -4,128 +4,97 @@
 # during develpment. However the information contained may provide compilation
 # tips to users.

-rm cpuminer-avx512-sha-vaes cpuminer-avx512-sha cpuminer-avx512 cpuminer-avx2 cpuminer-avx cpuminer-aes-sse42 cpuminer-sse42 cpuminer-ssse3 cpuminer-sse2 cpuminer-zen cpuminer-zen3  > /dev/null
+rm cpuminer-avx512-sha-vaes cpuminer-avx512 cpuminer-avx2 cpuminer-avx cpuminer-aes-sse42 cpuminer-sse42 cpuminer-ssse3 cpuminer-sse2 cpuminer-zen cpuminer-zen3  > /dev/null

-# Icelake AVX512 SHA VAES
+# AVX512 SHA VAES: Intel Core Icelake, Rocketlake
 make distclean || echo clean
 rm -f config.status
 ./autogen.sh || echo done
 CFLAGS="-O3 -march=icelake-client -Wall -fno-common" ./configure --with-curl
+#CFLAGS="-O3 -march=rocketlake -Wall -fno-common" ./configure --with-curl
 make -j 8
-strip -s cpuminer.exe
-mv cpuminer.exe cpuminer-avx512-sha-vaes.exe
 strip -s cpuminer
 mv cpuminer cpuminer-avx512-sha-vaes

-# Rocketlake AVX512 SHA AES
+# AVX512 AES: Intel Core HEDT Sylake-X, Cascadelake
 make clean || echo clean
 rm -f config.status
-CFLAGS="-O3 -march=cascadelake -msha -Wall -fno-common" ./configure --with-curl
-#CFLAGS="-O3 -march=skylake-avx512 -msha -Wall -fno-common" ./configure --with-curl
-# CFLAGS="-O3 -march=rocketlake -Wall -fno-common" ./configure --with-curl
+CFLAGS="-O3 -march=skylake-avx512 -maes -Wall -fno-common" ./configure --with-curl
 make -j 8
-strip -s cpuminer.exe
-mv cpuminer.exe cpuminer-avx512-sha.exe
-strip -s cpuminer
-mv cpuminer cpuminer-avx512-sha
-
-# Slylake-X AVX512 AES
-make clean || echo clean
-rm -f config.status
-CFLAGS="-O3 -march=skylake-avx512 -Wall -fno-common" ./configure --with-curl
-make -j 8
-strip -s cpuminer.exe
-mv cpuminer.exe cpuminer-avx512.exe
 strip -s cpuminer
 mv cpuminer cpuminer-avx512

-# Haswell AVX2 AES
+# AVX2 SHA VAES: Intel Alderlake, AMD Zen3
+make clean || echo done
+rm -f config.status
+# vaes doesn't include aes
+CFLAGS="-O3 -maes -mavx2 -msha -mvaes -Wall -fno-common" ./configure --with-curl
+make -j 8
+strip -s cpuminer
+mv cpuminer cpuminer-avx2-sha-vaes
+
+# AVX2 SHA AES: AMD Zen1
+make clean || echo done
+rm -f config.status
+#CFLAGS="-O3 -march=znver1 -maes -Wall -fno-common" ./configure --with-curl
+CFLAGS="-O3 -maes -mavx2 -msha -Wall -fno-common" ./configure --with-curl
+make -j 8
+strip -s cpuminer
+mv cpuminer cpuminer-avx2-sha
+
+# AVX2 AES: Intel Haswell..Cometlake
 make clean || echo clean
 rm -f config.status
 # GCC 9 doesn't include AES with core-avx2
 CFLAGS="-O3 -march=core-avx2 -maes -Wall -fno-common" ./configure --with-curl
 make -j 8
-strip -s cpuminer.exe
-mv cpuminer.exe cpuminer-avx2.exe
 strip -s cpuminer
 mv cpuminer cpuminer-avx2

-# Sandybridge AVX AES
+# AVX AES: Intel Sandybridge, Ivybridge
 make clean || echo clean
 rm -f config.status
 CFLAGS="-O3 -march=corei7-avx -maes -Wall -fno-common" ./configure --with-curl
 make -j 8
-strip -s cpuminer.exe
-mv cpuminer.exe cpuminer-avx.exe
 strip -s cpuminer
 mv cpuminer cpuminer-avx

-# Westmere SSE4.2 AES
+# SSE4.2 AES: Intel Westmere
 make clean || echo clean
 rm -f config.status
 CFLAGS="-O3 -march=westmere -maes -Wall -fno-common" ./configure --with-curl
 make -j 8
-strip -s cpuminer.exe
-mv cpuminer.exe cpuminer-aes-sse42.exe
 strip -s cpuminer
 mv cpuminer cpuminer-aes-sse42

-# Nehalem SSE4.2
+# SSE4.2: Intel Nehalem
 make clean || echo clean
 rm -f config.status
 CFLAGS="-O3 -march=corei7 -Wall -fno-common" ./configure --with-curl
 make -j 8
-strip -s cpuminer.exe
-mv cpuminer.exe cpuminer-sse42.exe
 strip -s cpuminer
 mv cpuminer cpuminer-sse42

-# Core2 SSSE3
+# SSSE3: Intel Core2
 make clean || echo clean
 rm -f config.status
 CFLAGS="-O3 -march=core2 -Wall -fno-common" ./configure --with-curl
 make -j 8
-strip -s cpuminer.exe
-mv cpuminer.exe cpuminer-ssse3.exe
 strip -s cpuminer
 mv cpuminer cpuminer-ssse3

-# Generic SSE2
+# SSE2
 make clean || echo clean
 rm -f config.status
 CFLAGS="-O3 -msse2 -Wall -fno-common" ./configure --with-curl
 make -j 8
-strip -s cpuminer.exe
-mv cpuminer.exe cpuminer-sse2.exe
 strip -s cpuminer
 mv cpuminer cpuminer-sse2

-# AMD Zen1 AVX2 SHA
-make clean || echo done
-rm -f config.status
-CFLAGS="-O3 -march=znver1 -Wall -fno-common" ./configure --with-curl
-make -j 8
-strip -s cpuminer.exe
-mv cpuminer.exe cpuminer-zen.exe
-strip -s cpuminer
-mv cpuminer cpuminer-zen
-
-# AMD Zen3 AVX2 SHA VAES
-make clean || echo done
-rm -f config.status
-CFLAGS="-O3 -march=znver2 -mvaes -Wall -fno-common" ./configure --with-curl
-# CFLAGS="-O3 -march=znver3 -Wall -fno-common" ./configure --with-curl
-make -j 8
-strip -s cpuminer.exe
-mv cpuminer.exe cpuminer-zen3.exe
-strip -s cpuminer
-mv cpuminer cpuminer-zen3
-
-# Native to current CPU
+# Native to host CPU
 make clean || echo done
 rm -f config.status
 CFLAGS="-O3 -march=native -Wall -fno-common" ./configure --with-curl
 make -j 8
-strip -s cpuminer.exe
 strip -s cpuminer

--- a/build-msys2.sh
+++ b/build-msys2.sh
@@ -0,0 +1,10 @@
+#!/bin/bash
+#
+# Compile on Windows using MSYS2 and MinGW.
+
+make distclean || echo clean
+rm -f config.status
+./autogen.sh || echo done
+CFLAGS="-O3 --param=evrp-mode=legacy -march=native -Wall -D_WIN32_WINNT=0x0601" ./configure --with-curl
+make -j 4
+strip -s cpuminer
--- a/clean-all.sh
+++ b/clean-all.sh
@@ -2,8 +2,8 @@
 #
 # make clean and rm all the targetted executables.

-rm cpuminer-avx512-sha-vaes cpuminer-avx512-sha cpuminer-avx512 cpuminer-avx2 cpuminer-avx cpuminer-aes-sse42 cpuminer-sse2 cpuminer-zen cpuminer-sse42 cpuminer-ssse3 cpuminer-zen3 > /dev/null
+rm cpuminer-avx512-sha-vaes cpuminer-avx512 cpuminer-avx2 cpuminer-avx cpuminer-aes-sse42 cpuminer-sse2 cpuminer-avx2-sha cpuminer-sse42 cpuminer-ssse3 cpuminer-avx2-sha-vaes > /dev/null

-rm cpuminer-avx512-sha-vaes.exe cpuminer-avx512-sha.exe cpuminer-avx512.exe cpuminer-avx2.exe cpuminer-avx.exe cpuminer-aes-sse42.exe cpuminer-sse2.exe cpuminer-zen.exe  cpuminer-sse42.exe cpuminer-ssse3.exe cpuminer-zen3.exe > /dev/null
+rm cpuminer-avx512-sha-vaes.exe cpuminer-avx512-sha.exe cpuminer-avx512.exe cpuminer-avx2.exe cpuminer-avx.exe cpuminer-aes-sse42.exe cpuminer-sse2.exe cpuminer-avx2-sha.exe cpuminer-sse42.exe cpuminer-ssse3.exe cpuminer-avx2-sha-vaes.exe > /dev/null

 make distclean > /dev/null
--- a/compat.h
+++ b/compat.h
@@ -3,6 +3,10 @@

 #ifdef WIN32

+#if _WIN32_WINNT==0x0601    // Windows 7
+ #define WINDOWS_CPU_GROUPS_ENABLED 1
+#endif
+
 #include <windows.h>
 #include <time.h>

--- a/20
+++ b/20
@@ -1,6 +1,6 @@
 #! /bin/sh
 # Guess values for system-dependent variables and create Makefiles.
-# Generated by GNU Autoconf 2.69 for cpuminer-opt 3.18.0.
+# Generated by GNU Autoconf 2.69 for cpuminer-opt 3.19.6.
 #
 #
 # Copyright (C) 1992-1996, 1998-2012 Free Software Foundation, Inc.
@@ -577,8 +577,8 @@ MAKEFLAGS=
 # Identity of this package.
 PACKAGE_NAME='cpuminer-opt'
 PACKAGE_TARNAME='cpuminer-opt'
-PACKAGE_VERSION='3.18.0'
-PACKAGE_STRING='cpuminer-opt 3.18.0'
+PACKAGE_VERSION='3.19.6'
+PACKAGE_STRING='cpuminer-opt 3.19.6'
 PACKAGE_BUGREPORT=''
 PACKAGE_URL=''

@@ -1332,7 +1332,7 @@ if test "$ac_init_help" = "long"; then
  # Omit some internal or obsolete options to make the list less imposing.
  # This message is too long to be a string in the A/UX 3.1 sh.
  cat <<_ACEOF
-\`configure' configures cpuminer-opt 3.18.0 to adapt to many kinds of systems.
+\`configure' configures cpuminer-opt 3.19.6 to adapt to many kinds of systems.

 Usage: $0 [OPTION]... [VAR=VALUE]...

@@ -1404,7 +1404,7 @@ fi

 if test -n "$ac_init_help"; then
  case $ac_init_help in
-     short | recursive ) echo "Configuration of cpuminer-opt 3.18.0:";;
+     short | recursive ) echo "Configuration of cpuminer-opt 3.19.6:";;
   esac
  cat <<\_ACEOF

@@ -1509,7 +1509,7 @@ fi
 test -n "$ac_init_help" && exit $ac_status
 if $ac_init_version; then
  cat <<\_ACEOF
-cpuminer-opt configure 3.18.0
+cpuminer-opt configure 3.19.6
 generated by GNU Autoconf 2.69

 Copyright (C) 2012 Free Software Foundation, Inc.
@@ -2012,7 +2012,7 @@ cat >config.log <<_ACEOF
 This file contains any messages produced by compilers while
 running configure, to aid debugging if configure makes a mistake.

-It was created by cpuminer-opt $as_me 3.18.0, which was
+It was created by cpuminer-opt $as_me 3.19.6, which was
 generated by GNU Autoconf 2.69.  Invocation command line was

  $ $0 $@
@@ -2993,7 +2993,7 @@ fi

 # Define the identity of the package.
 PACKAGE='cpuminer-opt'
- VERSION='3.18.0'
+ VERSION='3.19.6'


 cat >>confdefs.h <<_ACEOF
@@ -6690,7 +6690,7 @@ cat >>$CONFIG_STATUS <<\_ACEOF || ac_write_fail=1
 # report actual input values of CONFIG_FILES etc. instead of their
 # values after options handling.
 ac_log="
-This file was extended by cpuminer-opt $as_me 3.18.0, which was
+This file was extended by cpuminer-opt $as_me 3.19.6, which was
 generated by GNU Autoconf 2.69.  Invocation command line was

  CONFIG_FILES    = $CONFIG_FILES
@@ -6756,7 +6756,7 @@ _ACEOF
 cat >>$CONFIG_STATUS <<_ACEOF || ac_write_fail=1
 ac_cs_config="`$as_echo "$ac_configure_args" | sed 's/^ //; s/[\\""\`\$]/\\\\&/g'`"
 ac_cs_version="\\
-cpuminer-opt config.status 3.18.0
+cpuminer-opt config.status 3.19.6
 configured by $0, generated by GNU Autoconf 2.69,
  with options \\"\$ac_cs_config\\"

--- a/configure.ac
+++ b/configure.ac
@@ -1,4 +1,4 @@
-AC_INIT([cpuminer-opt], [3.18.0])
+AC_INIT([cpuminer-opt], [3.19.6])

 AC_PREREQ([2.59c])
 AC_CANONICAL_SYSTEM
--- a/cpu-miner.c
+++ b/cpu-miner.c
@@ -3,7 +3,7 @@
 * Copyright 2012-2014 pooler
 * Copyright 2014 Lucas Jones
 * Copyright 2014-2016 Tanguy Pruvot
- * Copyright 2016-2020 Jay D Dee
+ * Copyright 2016-2021 Jay D Dee
 *
 * This program is free software; you can redistribute it and/or modify it
 * under the terms of the GNU General Public License as published by the Free
@@ -115,28 +115,24 @@ int opt_param_n = 0;
 int opt_param_r = 0;
 int opt_n_threads = 0;
 bool opt_sapling = false;
-
-// Windows doesn't support 128 bit affinity mask.
-// Need compile time and run time test.
-#if defined(__linux) && defined(GCC_INT128)  
-#define AFFINITY_USES_UINT128 1
-static uint128_t opt_affinity = -1;
-static bool affinity_uses_uint128 = true;
-#else
-static uint64_t opt_affinity = -1;
-static bool affinity_uses_uint128 = false;
-#endif
-
+static uint64_t opt_affinity = 0xFFFFFFFFFFFFFFFFULL;  // default, use all cores
 int opt_priority = 0;  // deprecated
 int num_cpus = 1;
-int num_cpugroups = 1;
-char *rpc_url = NULL;;
+int num_cpugroups = 1;  // For Windows
+#define max_cpus 256   // max for affinity
+char *rpc_url = NULL;
 char *rpc_userpass = NULL;
 char *rpc_user, *rpc_pass;
 char *short_url = NULL;
 char *coinbase_address;
 char *opt_data_file = NULL;
 bool opt_verify = false;
+static bool opt_stratum_keepalive = false;
+static struct timeval stratum_keepalive_timer;
+// Stratum typically times out in 5 minutes or 300 seconds
+#define stratum_keepalive_timeout 180  // 3 minutes
+static struct timeval stratum_reset_time;
+

 // pk_buffer_size is used as a version selector by b58 code, therefore
 // it must be set correctly to work.
@@ -166,6 +162,7 @@ uint32_t accepted_share_count = 0;
 uint32_t rejected_share_count = 0;
 uint32_t stale_share_count = 0;
 uint32_t solved_block_count = 0;
+uint32_t stratum_errors = 0;
 double *thr_hashrates;
 double global_hashrate = 0.;
 double total_hashes = 0.;
@@ -196,7 +193,6 @@ int default_api_listen = 4048;
 static struct   timeval session_start;
 static struct   timeval five_min_start;
 static uint64_t session_first_block = 0;
-static double   latency_sum = 0.;
 static uint64_t submit_sum  = 0;
 static uint64_t accept_sum  = 0;
 static uint64_t stale_sum  = 0;
@@ -227,18 +223,25 @@ char*  lp_id;

 static void   workio_cmd_free(struct workio_cmd *wc);

-static void format_affinity_map( char *map_str, uint64_t map )
-{
-   int n = num_cpus < 64 ? num_cpus : 64;
-   int i;
+// array mapping thread to cpu
+static uint8_t thread_affinity_map[ max_cpus ];

+// display affinity mask graphically
+static void format_affinity_mask( char *mask_str, uint64_t mask )
+{
+#if defined(WINDOWS_CPU_GROUPS_ENABLED)
+   int n = num_cpus / num_cpugroups;
+#else
+   int n = num_cpus < 64 ? num_cpus : 64;
+#endif
+   int i;
   for ( i = 0; i < n; i++ )
   {
-      if ( map & 1 )  map_str[i] = '!';
-      else            map_str[i] = '.';
-      map >>= 1;
+      if ( mask & 1 )  mask_str[i] = '!';
+      else             mask_str[i] = '.';
+      mask >>= 1;
   }
-   memset( &map_str[i], 0, 64 - i );
+   memset( &mask_str[i], 0, 64 - i );
 }

 #ifdef __linux /* Linux specific policy and affinity management */
@@ -260,93 +263,70 @@ static inline void drop_policy(void)
 #define pthread_setaffinity_np(tid,sz,s) {} /* only do process affinity */
 #endif

-// Linux affinity can use int128.
-#if AFFINITY_USES_UINT128
-static void affine_to_cpu_mask( int id, uint128_t mask )
-#else
-static void affine_to_cpu_mask( int id, uint64_t mask )
-#endif
+static void affine_to_cpu( struct thr_info *thr )
 {
+   int thread = thr->id;
   cpu_set_t set;
   CPU_ZERO( &set );
-   uint8_t ncpus = (num_cpus > 256) ? 256 : num_cpus;       
-
-   for ( uint8_t i = 0; i < ncpus; i++ ) 
-   {
-      // cpu mask
-#if AFFINITY_USES_UINT128
-      if( ( mask & ( (uint128_t)1 << i ) ) )  CPU_SET( i, &set );
-#else
-      if( (ncpus > 64) || ( mask & (1 << i) ) )  CPU_SET( i, &set );
-#endif
-   }
-   if ( id == -1 )
-   {
-      // process affinity
-      sched_setaffinity(0, sizeof(&set), &set);
-   }
-   else
-   {
-      // thread only
-      pthread_setaffinity_np(thr_info[id].pth, sizeof(&set), &set);
-   }
+   CPU_SET( thread_affinity_map[ thread ], &set );
+   if ( opt_debug )
+      applog( LOG_INFO, "Binding thread %d to cpu %d",
+                        thread, thread_affinity_map[ thread ] );
+   pthread_setaffinity_np( thr->pth, sizeof(set), &set );
 }

 #elif defined(WIN32) /* Windows */
+
 static inline void drop_policy(void) { }

 // Windows CPU groups to manage more than 64 CPUs.
-static void affine_to_cpu_mask( int id, uint64_t mask )
+// mask arg is ignored
+static void affine_to_cpu( struct thr_info *thr )
 {
-   bool success;
+   int thread = thr->id;
   unsigned long last_error;    
-//   BOOL success;
-//   DWORD last_error;
+   bool ok;

-   if ( id == -1 )
-      success = SetProcessAffinityMask( GetCurrentProcess(), mask );
+#if defined(WINDOWS_CPU_GROUPS_ENABLED)
+   unsigned long group_size = GetActiveProcessorCount( 0 );
+   unsigned long group      = thread / group_size;
+   unsigned long cpu        = thread_affinity_map[ thread % group_size ];

-// Are Windows CPU Groups supported?
-#if _WIN32_WINNT==0x0601
-   else if ( num_cpugroups == 1 )
-	   success = SetThreadAffinityMask( GetCurrentThread(), mask );
-   else
-   {
-	   // Find the correct cpu group
-	   int cpu = id % num_cpus;
-	   int group;
-	   for( group = 0; group < num_cpugroups; group++ )
-	   {
-	      int cpus = GetActiveProcessorCount( group );
- 	      if ( cpu < cpus )  break;
-  	      cpu -= cpus;
-      }
+   GROUP_AFFINITY affinity;
+   affinity.Group = group;
+   affinity.Mask = 1ULL << cpu;

-	   if (opt_debug)
-         applog(LOG_DEBUG, "Binding thread %d to cpu %d on cpu group %d (mask %x)",
-               id, cpu, group, (1ULL << cpu));
+   if ( opt_debug )
+      applog( LOG_INFO, "Binding thread %d to cpu %d in cpu group %d",
+                        thread, cpu, group );
+
+   ok = SetThreadGroupAffinity( GetCurrentThread(), &affinity, NULL );

-	   GROUP_AFFINITY affinity;
-	   affinity.Group = group;
-	   affinity.Mask = 1ULL << cpu;
-	   success = SetThreadGroupAffinity( GetCurrentThread(), &affinity, NULL );
-   }
 #else
-   else 
-      success = SetThreadAffinityMask( GetCurrentThread(), mask );
+
+   unsigned long cpu = thread_affinity_map[ thread ];
+   uint64_t mask = 1ULL << cpu;
+
+   if ( opt_debug )
+      applog( LOG_INFO, "Binding thread %d to cpu %d", thread, cpu );
+
+   ok = SetThreadAffinityMask( GetCurrentThread(), mask );
+
 #endif

-   if (!success)
+   if ( !ok )
   {
-	   last_error = GetLastError();
-	   applog(LOG_WARNING, "affine_to_cpu_mask for %u returned %x",
-               id, last_error);
+      last_error = GetLastError();
+      applog( LOG_WARNING, "affine_to_cpu_mask for %u returned 0x%x",
+                           thread, last_error );
   }
-}
+}   

 #else
+
 static inline void drop_policy(void) { }
-static void affine_to_cpu_mask(int id, unsigned long mask) { }
+static void affine_to_cpu( struct thr_info *thr ) { }
+
 #endif

 // not very useful, just index the arrray directly.
@@ -1067,9 +1047,17 @@ void report_summary_log( bool force )

 #endif

-   if ( !( force && ( submit_sum || ( et.tv_sec > 5 ) ) )
-     && ( et.tv_sec < 300 ) )
-      return;
+   if ( !( force && ( submit_sum || ( et.tv_sec > 5 ) ) ) )
+   {
+      if ( et.tv_sec < 300 )
+         return;
+      if ( ( s_get_ptr != s_put_ptr ) && ( et.tv_sec < 360 ) )
+         return;
+   }
+   
+//   if ( !( force && ( submit_sum || ( et.tv_sec > 5 ) ) )
+//     && ( et.tv_sec < 300 ) )
+//      return;
   
   // collect and reset periodic counters
   pthread_mutex_lock( &stats_lock );
@@ -1112,19 +1100,17 @@ void report_summary_log( bool force )
   applog( LOG_BLUE, "%s: %s", algo_names[ opt_algo ], short_url );
   applog2( LOG_NOTICE, "Periodic Report     %s        %s", et_str, upt_str );
   applog2( LOG_INFO, "Share rate        %.2f/min     %.2f/min",
-            submit_rate, (double)submitted_share_count*60. /
-            ( (double)uptime.tv_sec + (double)uptime.tv_usec / 1e6 ) );
+            submit_rate, safe_div( (double)submitted_share_count*60.,
+              ( (double)uptime.tv_sec + (double)uptime.tv_usec / 1e6 ), 0. ) );
   applog2( LOG_INFO, "Hash rate       %7.2f%sh/s   %7.2f%sh/s   (%.2f%sh/s)",
            shrate, shr_units, sess_hrate, sess_hr_units, ghrate, ghr_units );

   if ( accepted_share_count < submitted_share_count )
   {
-      double lost_ghrate = uptime.tv_sec == 0 ? 0.
-                : target_diff
-                       * (double)(submitted_share_count - accepted_share_count )
-                  / (double)uptime.tv_sec;
-      double lost_shrate = share_time == 0. ? 0.
-               : target_diff  * (double)(submits - accepts ) / share_time;
+      double lost_ghrate = safe_div( target_diff
+                    * (double)(submitted_share_count - accepted_share_count ),
+                    (double)uptime.tv_sec, 0. );
+      double lost_shrate = safe_div( target_diff * (double)(submits - accepts ),                                     share_time, 0. );
      char lshr_units[4] = {0};
      char lghr_units[4] = {0};
      scale_hash_for_display( &lost_shrate, lshr_units );
@@ -1161,17 +1147,23 @@ void report_summary_log( bool force )
      applog2( prio, "Blocks Solved   %7d      %7d",
               solved, solved_block_count );
   }
+   if ( stratum_errors )
+      applog2( LOG_INFO, "Stratum resets               %7d", stratum_errors );
+
   applog2( LOG_INFO, "Hi/Lo Share Diff  %.5g /  %.5g",
            highest_share, lowest_share );

   int mismatch = submitted_share_count
         - ( accepted_share_count + stale_share_count + rejected_share_count );
+
   if ( mismatch )
   {
-      if ( mismatch != 1 )
-         applog2(LOG_MINR, "Count mismatch: %d, stats may be inaccurate", mismatch );
-      else
-         applog2(LOG_INFO, CL_LBL "Count mismatch, submitted share may still be pending" CL_N );
+      if ( stratum_errors )
+         applog2( LOG_MINR, "Count mismatch: %d, stats may be inaccurate",
+                            mismatch );
+      else if ( !opt_quiet )
+         applog2( LOG_INFO, CL_LBL
+                  "Count mismatch, submitted share may still be pending" CL_N );
   }
 }

@@ -1287,7 +1279,6 @@ static int share_result( int result, struct work *work,
      else          reject_sum++;
   }
   submit_sum++;
-   latency_sum += latency;

   pthread_mutex_unlock( &stats_lock );

@@ -1303,9 +1294,9 @@ static int share_result( int result, struct work *work,
     else              rcol = CL_LRD;
   }

-   applog( LOG_INFO, "%d %s%s %s%s %s%s %s%s" CL_WHT ", %.3f sec (%dms)",
+   applog( LOG_INFO, "%d %s%s %s%s %s%s %s%s%s, %.3f sec (%dms)",
           my_stats.share_count, acol, ares, scol, sres, rcol, rres, bcol,
-           bres, share_time, latency );
+           bres, CL_N, share_time, latency );

   if ( unlikely( opt_debug || !result || solved ) )
   {
@@ -2008,6 +1999,10 @@ void set_work_data_big_endian( struct work *work )
 // calculate net diff from nbits.
 double std_calc_network_diff( struct work* work )
 {
+   uint32_t nbits = work->data[ algo_gate.nbits_index ];
+   uint32_t shift = nbits & 0xff;
+   uint32_t bits = bswap_32( nbits ) & 0x00ffffff;
+/*
   // sample for diff 43.281 : 1c05ea29
   // todo: endian reversed on longpoll could be zr5 specific...
   int nbits_index = algo_gate.nbits_index;
@@ -2015,15 +2010,17 @@ double std_calc_network_diff( struct work* work )
                                  : swab32( work->data[ nbits_index ] );
   uint32_t bits  = ( nbits & 0xffffff );
   int16_t  shift = ( swab32(nbits) & 0xff ); // 0x1c = 28
+*/
+
   int m;
-   double d = (double)0x0000ffff / (double)bits;
+   long double d = (long double)0x0000ffff / (long double)bits;
   for ( m = shift; m < 29; m++ )
       d *= 256.0;
   for ( m = 29; m < shift; m++ )
       d /= 256.0;
   if ( opt_debug_diff )
-      applog(LOG_DEBUG, "net diff: %f -> shift %u, bits %08x", d, shift, bits);
-   return d;
+      applog(LOG_DEBUG, "net diff: %8f -> shift %u, bits %08x", (double)d, shift, bits);
+   return (double)d;
 }

 void std_get_new_work( struct work* work, struct work* g_work, int thr_id,
@@ -2083,7 +2080,8 @@ static void stratum_gen_work( struct stratum_ctx *sctx, struct work *g_work )
                           / ( opt_target_factor * opt_diff_factor );
   diff_to_hash( g_work->target, g_work->targetdiff );

-   // Increment extranonce2
+   // Pre increment extranonce2 in case of being called again before receiving
+   // a new job
   for ( int t = 0;
         t < sctx->xnonce2_size && !( ++sctx->job.xnonce2[t] );
         t++ );
@@ -2103,20 +2101,12 @@ static void stratum_gen_work( struct stratum_ctx *sctx, struct work *g_work )

   pthread_mutex_unlock( &stats_lock );

-   if ( !opt_quiet )
-   {
-      int mismatch = submitted_share_count
-         - ( accepted_share_count + stale_share_count + rejected_share_count );
-      if ( mismatch )
-         applog(LOG_INFO, CL_LBL "%d Submitted share pending, maybe stale" CL_N, submitted_share_count );
-   }
-
   if ( stratum_diff != sctx->job.diff )
      applog( LOG_BLUE, "New Stratum Diff %g, Block %d, Job %s",
                        sctx->job.diff, sctx->block_height, g_work->job_id );
   else if ( last_block_height != sctx->block_height )
-      applog( LOG_BLUE, "New Block %d, Job %s",
-                        sctx->block_height, g_work->job_id );
+      applog( LOG_BLUE, "New Block %d, Net diff %.5g, Job %s",
+                        sctx->block_height, net_diff, g_work->job_id );
   else if ( g_work->job_id && new_job )
      applog( LOG_BLUE, "New Work: Block %d, Net diff %.5g, Job %s",
                         sctx->block_height, net_diff, g_work->job_id );
@@ -2124,7 +2114,7 @@ static void stratum_gen_work( struct stratum_ctx *sctx, struct work *g_work )
   {
      unsigned char *xnonce2str = bebin2hex( g_work->xnonce2,
                                             g_work->xnonce2_len );
-      applog( LOG_INFO, "Extranonce2 %s, Block %d, Job %s",
+      applog( LOG_INFO, "Extranonce2 0x%s, Block %d, Job %s",
                        xnonce2str, sctx->block_height, g_work->job_id );
      free( xnonce2str );
   }
@@ -2169,11 +2159,10 @@ static void stratum_gen_work( struct stratum_ctx *sctx, struct work *g_work )
               uint64_t net_ttf =
                    ( last_block_height - session_first_block ) == 0 ? 0
                    : et.tv_sec / ( last_block_height - session_first_block );
-               if ( net_diff && net_ttf )
+               if ( net_diff > 0. && net_ttf )
               {
                  double net_hr = nd / net_ttf;
                  char net_hr_units[4] = {0};
-
                  scale_hash_for_display ( &net_hr, net_hr_units );
                  applog2( LOG_INFO, "Net hash rate (est) %.2f %sh/s",
                                     net_hr, net_hr_units );
@@ -2182,6 +2171,17 @@ static void stratum_gen_work( struct stratum_ctx *sctx, struct work *g_work )
         }  // hr > 0
      } // !quiet
   }  // new diff/block
+
+   if ( new_job && !( opt_quiet || stratum_errors ) )
+   {
+      int mismatch = submitted_share_count - ( accepted_share_count
+                                             + stale_share_count
+                                             + rejected_share_count );
+      if ( mismatch )
+         applog( LOG_INFO,
+                 CL_LBL "%d Submitted share pending, maybe stale" CL_N,
+                 submitted_share_count );
+   }
 }

 static void *miner_thread( void *userdata )
@@ -2240,53 +2240,13 @@ static void *miner_thread( void *userdata )
 	   if ( opt_priority == 0 )
 	      drop_policy();
   }
+
   // CPU thread affinity
-   if ( num_cpus > 1 )
-   {
-#if AFFINITY_USES_UINT128
-      // Default affinity
-      if ( (opt_affinity == (uint128_t)(-1) ) && opt_n_threads > 1 )
-      {  
-         affine_to_cpu_mask( thr_id, (uint128_t)1 << (thr_id % num_cpus) );
-         if ( opt_debug )
-            applog( LOG_INFO, "Binding thread %d to cpu %d.",
-                    thr_id, thr_id % num_cpus,
-	                 u128_hi64( (uint128_t)1 << (thr_id % num_cpus) ),
-		              u128_lo64( (uint128_t)1 << (thr_id % num_cpus) ) );
-      }
-#else
-      if ( ( opt_affinity == -1 ) && ( opt_n_threads > 1 ) ) 
-      {
-         affine_to_cpu_mask( thr_id, 1 << (thr_id % num_cpus) );
-         if (opt_debug)
-            applog( LOG_DEBUG, "Binding thread %d to cpu %d.",
-                thr_id, thr_id % num_cpus, 1 << (thr_id % num_cpus)) ;
-      }
-#endif
-      else   // Custom affinity
-      {
-         affine_to_cpu_mask( thr_id, opt_affinity );
-         if ( opt_debug )
-         {
-#if AFFINITY_USES_UINT128
-            if ( num_cpus > 64 )
-               applog( LOG_INFO, "Binding thread %d to mask %016llx %016llx",
-                                thr_id, u128_hi64( opt_affinity ), 
-                                        u128_lo64( opt_affinity ) );
-            else
-               applog( LOG_INFO, "Binding thread %d to mask %016llx",
-                                 thr_id, opt_affinity );
-#else
-            applog( LOG_INFO, "Binding thread %d to mask %016llx",
-                                 thr_id, opt_affinity );
-#endif
-         }
-      }
-   }  // num_cpus > 1
+   if ( opt_affinity && num_cpus > 1 )   affine_to_cpu( mythr );

   if ( !algo_gate.miner_thread_init( thr_id ) )
   {
-      applog( LOG_ERR, "FAIL: thread %u failed to initialize", thr_id );
+      applog( LOG_ERR, "FAIL: thread %d failed to initialize", thr_id );
      exit (1);
   }

@@ -2314,10 +2274,24 @@ static void *miner_thread( void *userdata )
          {
             while ( unlikely( stratum_down ) )
                sleep( 1 );
-             if ( *nonceptr >= end_nonce )
-                stratum_gen_work( &stratum, &g_work );
+             if ( unlikely( ( *nonceptr >= end_nonce )
+                         && !work_restart[thr_id].restart ) )
+             {
+                if ( opt_extranonce )
+                   stratum_gen_work( &stratum, &g_work );
+                else
+                {
+                   if ( !thr_id )
+                   {
+                      applog( LOG_WARNING, "nonce range exhausted, extranonce not subscribed" );
+                      applog( LOG_WARNING, "waiting for new work...");
+                   }
+                   while ( !work_restart[thr_id].restart )
+                      sleep ( 1 );
+                }
+             }
          }
-          else
+          else  // GBT or getwork
          {
             pthread_rwlock_wrlock( &g_work_lock );

@@ -2328,8 +2302,7 @@ static void *miner_thread( void *userdata )
                if ( unlikely( !get_work( mythr, &g_work ) ) )
                {
                   pthread_rwlock_unlock( &g_work_lock );
-		             applog( LOG_ERR, "work retrieval failed, exiting "
-		                              "mining thread %d", thr_id );
+		             applog( LOG_ERR, "work retrieval failed, exiting miner thread %d", thr_id );
 		             goto out;
 	             }
                g_work_time = time(NULL);
@@ -2492,18 +2465,21 @@ static void *miner_thread( void *userdata )
             timeval_subtract( &uptime, &total_hashes_time, &session_start ); 
             double hashrate = safe_div( total_hashes, uptime.tv_sec, 0. );

-             scale_hash_for_display( &hashrate,  hr_units );
-             sprintf( hr, "%.2f", hashrate );
+             if ( hashrate > 0. )
+             {
+                scale_hash_for_display( &hashrate,  hr_units );
+                sprintf( hr, "%.2f", hashrate );
 #if (defined(_WIN64) || defined(__WINDOWS__) || defined(_WIN32))
-             applog( LOG_NOTICE, "Total: %s %sH/s", hr, hr_units );
+                applog( LOG_NOTICE, "Total: %s %sH/s", hr, hr_units );
 #else
-             float lo_freq = 0., hi_freq = 0.;
-             linux_cpu_hilo_freq( &lo_freq, &hi_freq );
-             applog( LOG_NOTICE,
+                float lo_freq = 0., hi_freq = 0.;
+                linux_cpu_hilo_freq( &lo_freq, &hi_freq );
+                applog( LOG_NOTICE,
                     "Total: %s %sH/s, Temp: %dC, Freq: %.3f/%.3f GHz",
                     hr, hr_units, (uint32_t)cpu_temp(0), lo_freq / 1e6,
                     hi_freq / 1e6 );
 #endif
+             }
          }
       }  // benchmark

@@ -2631,7 +2607,7 @@ start:
 	       if (!opt_quiet)
          {
 	         char netinfo[64] = { 0 };
-	         if (net_diff > 0.)
+	         if ( net_diff > 0. )
            {
 	 	         sprintf(netinfo, ", diff %.3f", net_diff);
 	         }
@@ -2770,6 +2746,18 @@ void std_build_extraheader( struct work* g_work, struct stratum_ctx* sctx )
          sctx->job.final_sapling_hash );
 }

+// Loop is out of order:
+//
+//   connect/reconnect
+//   handle message
+//   get new message
+//
+// change to
+//   connect/reconnect
+//   get new message
+//   handle message
+
+
 static void *stratum_thread(void *userdata )
 {
   struct thr_info *mythr = (struct thr_info *) userdata;
@@ -2787,7 +2775,9 @@ static void *stratum_thread(void *userdata )
      if ( unlikely( stratum_need_reset ) )
      {
          stratum_need_reset = false;
+          gettimeofday( &stratum_reset_time, NULL );
          stratum_down = true;
+          stratum_errors++;
          stratum_disconnect( &stratum );
          if ( strcmp( stratum.url, rpc_url ) )
          {
@@ -2796,7 +2786,7 @@ static void *stratum_thread(void *userdata )
 	          applog(LOG_BLUE, "Connection changed to %s", short_url);
          }
          else 
-	          applog(LOG_WARNING, "Stratum connection reset");
+	          applog(LOG_BLUE, "Stratum connection reset");
          // reset stats queue as well
          restart_threads();
          if ( s_get_ptr != s_put_ptr ) s_get_ptr = s_put_ptr = 0;
@@ -2805,6 +2795,7 @@ static void *stratum_thread(void *userdata )
      while ( !stratum.curl )
      {
         stratum_down = true;
+         restart_threads();
         pthread_rwlock_wrlock( &g_work_lock );
         g_work_time = 0;
         pthread_rwlock_unlock( &g_work_lock );
@@ -2826,17 +2817,13 @@ static void *stratum_thread(void *userdata )
         else
         {
            stratum_down = false;
-            restart_threads();
            applog(LOG_BLUE,"Stratum connection established" );
+            if ( stratum.new_job )   // prime first job
+               stratum_gen_work( &stratum, &g_work );
         }
      }

-      report_summary_log( ( stratum_diff != stratum.job.diff )
-                       && ( stratum_diff != 0. ) );
-      
-      if ( stratum.new_job )
-         stratum_gen_work( &stratum, &g_work );
-
+      // Wait for new message from server
      if ( likely( stratum_socket_full( &stratum, opt_timeout ) ) )
      {
         if ( likely( s = stratum_recv_line( &stratum ) ) )
@@ -2859,6 +2846,53 @@ static void *stratum_thread(void *userdata )
 //         stratum_disconnect( &stratum );
      }

+      report_summary_log( ( stratum_diff != stratum.job.diff )
+                       && ( stratum_diff != 0. ) );
+
+      if ( !stratum_need_reset )
+      {
+         // Is keepalive needed? Mutex would normally be required but that
+         // would block any attempt to submit a share. A share is more
+         // important even if it messes up the keepalive.
+
+         if ( opt_stratum_keepalive )
+         {
+            struct timeval now, et;
+            gettimeofday( &now, NULL );
+            // any shares submitted since last keepalive?
+            if ( last_submit_time.tv_sec > stratum_keepalive_timer.tv_sec )
+               memcpy( &stratum_keepalive_timer, &last_submit_time,
+                       sizeof (struct timeval) );
+
+            timeval_subtract( &et, &now, &stratum_keepalive_timer );
+
+            if ( et.tv_sec > stratum_keepalive_timeout )
+            {
+                double diff = stratum.job.diff * 0.5;
+                stratum_keepalive_timer = now;
+                if ( !opt_quiet )
+                   applog( LOG_BLUE,
+                           "Stratum keepalive requesting lower difficulty" );
+                stratum_suggest_difficulty( &stratum, diff );
+            }
+
+            if ( last_submit_time.tv_sec > stratum_reset_time.tv_sec )
+              timeval_subtract( &et, &now, &last_submit_time );
+            else
+              timeval_subtract( &et, &now, &stratum_reset_time );
+
+            if ( et.tv_sec > stratum_keepalive_timeout + 60 )
+            {
+               applog( LOG_NOTICE, "No shares submitted, resetting stratum connection" );
+               stratum_need_reset = true;
+               stratum_keepalive_timer = now;
+            }
+         } // stratum_keepalive
+
+         if ( stratum.new_job && !stratum_need_reset )
+            stratum_gen_work( &stratum, &g_work );
+
+      } // stratum_need_reset
   }  // loop
 out:
  return NULL;
@@ -2897,13 +2931,15 @@ static bool cpu_capability( bool display_only )
     bool algo_has_sse2    = set_incl( SSE2_OPT,    algo_features );
     bool algo_has_aes     = set_incl( AES_OPT,     algo_features );
     bool algo_has_sse42   = set_incl( SSE42_OPT,   algo_features );
+     bool algo_has_avx     = set_incl( AVX_OPT,     algo_features );
     bool algo_has_avx2    = set_incl( AVX2_OPT,    algo_features );
     bool algo_has_avx512  = set_incl( AVX512_OPT,  algo_features );
     bool algo_has_sha     = set_incl( SHA_OPT,     algo_features );
     bool algo_has_vaes    = set_incl( VAES_OPT,    algo_features );
-     bool algo_has_vaes256 = set_incl( VAES256_OPT, algo_features );
     bool use_aes;
     bool use_sse2;
+     bool use_sse42;
+     bool use_avx;
     bool use_avx2;
     bool use_avx512;
     bool use_sha;
@@ -2973,18 +3009,20 @@ static bool cpu_capability( bool display_only )
     else if ( sw_has_aes    )    printf( "  AES"   );
     if      ( sw_has_sha    )    printf( " SHA"    );

-     printf("\nAlgo features:");
-     if ( algo_features == EMPTY_SET ) printf( " None" );
-     else
+     if ( !display_only )
     {
-        if      ( algo_has_avx512  )  printf( " AVX512" );
-        else if ( algo_has_avx2    )  printf( " AVX2  " );
-        else if ( algo_has_sse42   )  printf( " SSE4.2" );
-        else if ( algo_has_sse2    )  printf( " SSE2  " );
-        if      ( algo_has_vaes ||
-                  algo_has_vaes256 )  printf( " VAES"   );
-        else if ( algo_has_aes     )  printf( "  AES"   );
-        if      ( algo_has_sha     )  printf( " SHA"    );
+        printf("\nAlgo features:");
+        if ( algo_features == EMPTY_SET ) printf( " None" );
+        else
+        {
+           if      ( algo_has_avx512 )  printf( " AVX512" );
+           else if ( algo_has_avx2   )  printf( " AVX2  " );
+           else if ( algo_has_sse42  )  printf( " SSE4.2" );
+           else if ( algo_has_sse2   )  printf( " SSE2  " );
+           if      ( algo_has_vaes   )  printf( " VAES"   );
+           else if ( algo_has_aes    )  printf( "  AES"   );
+           if      ( algo_has_sha    )  printf( " SHA"    );
+        }
     }
     printf("\n");

@@ -3019,14 +3057,15 @@ static bool cpu_capability( bool display_only )

     // Determine mining options
     use_sse2   = cpu_has_sse2   && algo_has_sse2;
+     use_sse42  = cpu_has_sse42  && sw_has_sse42  && algo_has_sse42;
+     use_avx    = cpu_has_avx    && sw_has_avx    && algo_has_avx;
     use_aes    = cpu_has_aes    && sw_has_aes    && algo_has_aes;
     use_avx2   = cpu_has_avx2   && sw_has_avx2   && algo_has_avx2;
     use_avx512 = cpu_has_avx512 && sw_has_avx512 && algo_has_avx512;
     use_sha    = cpu_has_sha    && sw_has_sha    && algo_has_sha;
-     use_vaes   = cpu_has_vaes   && sw_has_vaes   && ( algo_has_vaes
-                                                    || algo_has_vaes256 );
-     use_none = !( use_sse2 || use_aes || use_avx512 || use_avx2 ||
-                   use_sha || use_vaes );
+     use_vaes   = cpu_has_vaes   && sw_has_vaes   && algo_has_vaes;
+     use_none = !( use_sse2 || use_sse42 || use_avx || use_aes || use_avx512
+                || use_avx2 || use_sha || use_vaes );

     // Display best options
     printf( "\nStarting miner with" );
@@ -3035,6 +3074,8 @@ static bool cpu_capability( bool display_only )
     {
        if      ( use_avx512 ) printf( " AVX512" );
        else if ( use_avx2   ) printf( " AVX2"   );
+        else if ( use_avx    ) printf( " AVX"    );
+        else if ( use_sse42  ) printf( " SSE42"  );
        else if ( use_sse2   ) printf( " SSE2"   );
        if      ( use_vaes   ) printf( " VAES"   );
        else if ( use_aes    ) printf( " AES"    );
@@ -3123,7 +3164,7 @@ void parse_arg(int key, char *arg )
 {
 	char *p;
 	int v, i;
-	uint64_t ul;
+//	uint64_t ul;
 	double d;

 	switch( key )
@@ -3434,24 +3475,14 @@ void parse_arg(int key, char *arg )
 		break;
 #endif
 	case 1020:  // cpu-affinity
-		p = strstr(arg, "0x");
-		if ( p )
-			ul = strtoull( p, NULL, 16 );
-		else
-			ul = atoll( arg );
-#if AFFINITY_USES_UINT128
-// replicate the low 64 bits to make a full 128 bit mask if there are more
-// than 64 CPUs, otherwise zero extend the upper half.
-         opt_affinity = (uint128_t)ul;
-         if ( num_cpus > 64 )
-            opt_affinity |= opt_affinity << 64;
-#else
-         opt_affinity = ul;
-#endif
-		break;
+      p = strstr( arg, "0x" );
+      opt_affinity = p ? strtoull( p, NULL, 16 )
+                       : atoll( arg );
+      break;
 	case 1021:  // cpu-priority
 		v = atoi(arg);
-		if (v < 0 || v > 5)	/* sanity check */
+      applog(LOG_NOTICE,"--cpu-priority is deprecated and will be removed from a future release");
+      if (v < 0 || v > 5)	/* sanity check */
 			show_usage_and_exit(1);
 		opt_priority = v;
 		break;
@@ -3487,14 +3518,18 @@ void parse_arg(int key, char *arg )
 		break;
 	case 1024:
 		opt_randomize = true;
-		break;
+      applog(LOG_NOTICE,"--randomize is deprecated and will be removed from a future release");
+      break;
   case 1027:  // data-file
      opt_data_file = strdup( arg );
      break;
   case 1028:  // verify
      opt_verify = true;
      break;
-	case 'V':
+   case 1029:  // stratum-keepalive
+      opt_stratum_keepalive = true;
+      break;
+   case 'V':
      display_cpu_capability();
      exit(0);
 	case 'h':
@@ -3551,20 +3586,18 @@ static void parse_cmdline(int argc, char *argv[])
   while (1)
   {
 #if HAVE_GETOPT_LONG
-	key = getopt_long(argc, argv, short_options, options, NULL);
+      key = getopt_long(argc, argv, short_options, options, NULL);
 #else
-	key = getopt(argc, argv, short_options);
+      key = getopt(argc, argv, short_options);
 #endif
-	if (key < 0)
-		break;
-
-	parse_arg(key, optarg);
+      if ( key < 0 )   break;
+      parse_arg( key, optarg );
   }
-   if (optind < argc)
+   if ( optind < argc )
   {
-	fprintf(stderr, "%s: unsupported non-option argument -- '%s'\n",
-		argv[0], argv[optind]);
-        show_usage_and_exit(1);
+      fprintf( stderr, "%s: unsupported non-option argument -- '%s'\n",
+		                 argv[0], argv[optind]);
+      show_usage_and_exit(1);
   }
 }

@@ -3628,26 +3661,21 @@ int main(int argc, char *argv[])
 	rpc_user = strdup("");
 	rpc_pass = strdup("");

-   parse_cmdline(argc, argv);
-
 #if defined(WIN32)
-//	SYSTEM_INFO sysinfo;
-//	GetSystemInfo(&sysinfo);
-//	num_cpus = sysinfo.dwNumberOfProcessors;
-// What happens if GetActiveProcessorGroupCount called if groups not enabled?

 // Are Windows CPU Groups supported?
-#if _WIN32_WINNT==0x0601
+#if defined(WINDOWS_CPU_GROUPS_ENABLED)
 	num_cpus = 0;
 	num_cpugroups = GetActiveProcessorGroupCount();
-	for(  i = 0; i < num_cpugroups; i++ )
+	for( i = 0; i < num_cpugroups; i++ )
 	{
- 	   int cpus = GetActiveProcessorCount(i);
+ 	   int cpus = GetActiveProcessorCount( i );
 	   num_cpus += cpus;

 	   if (opt_debug)
-         applog(LOG_DEBUG, "Found %d cpus on cpu group %d", cpus, i);
+         applog( LOG_INFO, "Found %d CPUs in CPU group %d", cpus, i );
 	}
+
 #else
   SYSTEM_INFO sysinfo;
   GetSystemInfo(&sysinfo);
@@ -3663,21 +3691,22 @@ int main(int argc, char *argv[])
 #else
 	num_cpus = 1;
 #endif
-	if (num_cpus < 1)
-		num_cpus = 1;

-   if (!opt_n_threads)
-      opt_n_threads = num_cpus;
+   if ( num_cpus < 1 )
+      num_cpus = 1;
+   opt_n_threads = num_cpus;
+
+   parse_cmdline( argc, argv );

   if ( opt_algo == ALGO_NULL )
   {
-      fprintf(stderr, "%s: no algo supplied\n", argv[0]);
+      fprintf( stderr, "%s: No algo parameter specified\n", argv[0] );
      show_usage_and_exit(1);
   }

   // need to register to get algo optimizations for cpu capabilities
-   // but that causes register logs before cpu capabilities is output.
-   // Would need to split register into 2 parts. First part sets algo
+   // but that causes registration logs before cpu capabilities is output.
+   // Would need to split register function into 2 parts. First part sets algo
   // optimizations but no logging, second part does any logging.   
   if ( !register_algo_gate( opt_algo, &algo_gate ) )  exit(1);

@@ -3721,9 +3750,6 @@ int main(int argc, char *argv[])
         return 1;
 	}

-   // All options must be set before starting the gate
-//   if ( !register_algo_gate( opt_algo, &algo_gate ) )  exit(1);
-
   if ( coinbase_address )
   {
      pk_script_size = address_to_script( pk_script, pk_buffer_size,
@@ -3735,8 +3761,6 @@ int main(int argc, char *argv[])
      }
   }

-//   if ( !check_cpu_capability() ) exit(1);
-
 	pthread_mutex_init( &stats_lock, NULL );
   pthread_rwlock_init( &g_work_lock, NULL );
 	pthread_mutex_init( &stratum.sock_lock, NULL );
@@ -3806,44 +3830,28 @@ int main(int argc, char *argv[])
 	}
 #endif

-// To be confirmed with more than 64 cpus
-   if ( opt_affinity != -1 )
+   if ( opt_affinity && num_cpus > max_cpus )
   {
-      if ( !affinity_uses_uint128 && num_cpus > 64 )
-      {
-          applog(LOG_WARNING,"Setting CPU affinity with more than 64 CPUs is only");
-          applog(LOG_WARNING,"available on Linux. Using default affinity.");
-          opt_affinity = -1;
-      }
-/*
-      else	
-      {
-         affine_to_cpu_mask( -1, opt_affinity );
-         if ( !opt_quiet )
-         {
-#if AFFINITY_USES_UINT128
-            if ( num_cpus > 64 )
-               applog(LOG_DEBUG, "Binding process to cpu mask %x",
-                      u128_hi64( opt_affinity ), u128_lo64( opt_affinity ) );
-            else 
-               applog(LOG_DEBUG, "Binding process to cpu mask %x",
-                      opt_affinity );
-#else
-               applog(LOG_DEBUG, "Binding process to cpu mask %x",
-                      opt_affinity );
-#endif
-         }
-      }
-*/
-   }
-
-   if ( !opt_quiet && ( opt_n_threads < num_cpus ) )
-   {
-      char affinity_map[64];
-      format_affinity_map( affinity_map, opt_affinity );
-      applog( LOG_INFO, "CPU affinity [%s]", affinity_map );
+      applog( LOG_WARNING, "More than %d CPUs, CPU affinity is disabled",
+                            max_cpus );
+      opt_affinity = 0ULL;
   }
   
+   if ( opt_affinity )
+   {
+      for ( int thr = 0, cpu = 0; thr < opt_n_threads; thr++, cpu++ )
+      {
+         while ( !( ( opt_affinity >> ( cpu&63 ) ) & 1ULL ) ) cpu++;   
+         thread_affinity_map[ thr ] = cpu % num_cpus;
+      }
+      if ( !opt_quiet )
+      {
+         char affinity_mask[64];
+         format_affinity_mask( affinity_mask, opt_affinity );
+         applog( LOG_INFO, "CPU affinity [%s]", affinity_mask );
+      }
+   }
+    
 #ifdef HAVE_SYSLOG_H
 	if (use_syslog)
 		openlog("cpuminer", LOG_PID, LOG_USER);
@@ -3904,6 +3912,8 @@ int main(int argc, char *argv[])
      if ( opt_debug )
         applog(LOG_INFO,"Creating stratum thread");

+      stratum.new_job = false;  // just to make sure
+
      /* init stratum thread info */
 		stratum_thr_id = opt_n_threads + 2;
 		thr = &thr_info[stratum_thr_id];
@@ -3941,7 +3951,7 @@ int main(int argc, char *argv[])
 			return 1;
 		}
      if ( !opt_quiet )
-         applog( LOG_INFO,"API listnening to %s:%d", opt_api_allow,
+         applog( LOG_INFO,"API listening to %s:%d", opt_api_allow,
                                                     opt_api_listen );
   }

@@ -3970,6 +3980,9 @@ int main(int argc, char *argv[])
   gettimeofday( &last_submit_time, NULL );
   memcpy( &five_min_start, &last_submit_time, sizeof (struct timeval) );
   memcpy( &session_start, &last_submit_time, sizeof (struct timeval) );
+   memcpy( &stratum_keepalive_timer, &last_submit_time, sizeof (struct timeval) );
+   memcpy( &stratum_reset_time, &last_submit_time, sizeof (struct timeval) );
+   memcpy( &total_hashes_time, &last_submit_time, sizeof (struct timeval) );
   pthread_mutex_unlock( &stats_lock );

   applog( LOG_INFO, "%d of %d miner threads started using '%s' algorithm",
--- a/malloc-huge.c
+++ b/malloc-huge.c
@@ -0,0 +1,36 @@
+#include "malloc-huge.h"
+#include "miner.h"
+
+#define HUGEPAGE_SIZE_2M  (2 * 1024 * 1024)
+
+void *malloc_hugepages( size_t size )
+{
+#if !(defined(MAP_HUGETLB) && defined(MAP_ANON))
+//   applog( LOG_WARNING, "Huge pages not available",size);
+   return NULL;
+#else
+
+   if ( size < HUGEPAGE_MIN_ALLOC )
+   {
+//	   applog( LOG_WARNING, "Block too small for huge pages: %lu bytes",size);
+	   return NULL;
+   }
+
+   const size_t hugepage_mask = (size_t)HUGEPAGE_SIZE_2M - 1;
+   void *p = NULL;
+   int flags =
+   #ifdef MAP_NOCORE
+                MAP_NOCORE |
+   #endif
+		          MAP_HUGETLB | MAP_ANON | MAP_PRIVATE;
+
+   // round size up to next page boundary
+   size = ( size + hugepage_mask ) & (~hugepage_mask);
+   
+   p = mmap( NULL, size, PROT_READ | PROT_WRITE, flags, -1, 0 );
+   if ( p == MAP_FAILED )
+      p = NULL;
+   return p;
+#endif
+}
+
--- a/malloc-huge.h
+++ b/malloc-huge.h
@@ -0,0 +1,24 @@
+#if !(defined(MALLOC_HUGE__))
+#define MALLOC_HUGE__
+
+#include <errno.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <string.h>
+
+#ifdef __unix__
+#include <sys/mman.h>
+#endif
+
+#if defined(MAP_HUGETLB)
+
+// Minimum block size 6 MiB to use huge pages
+#define HUGEPAGE_MIN_ALLOC    (6 * 1024 * 1024)
+
+#endif
+
+// Attempt to allocate memory backed by 2 MiB pages, returns NULL on failure.
+void *malloc_hugepages( size_t size );
+
+#endif
+
--- a/miner.h
+++ b/miner.h
@@ -466,6 +466,7 @@ void stratum_disconnect(struct stratum_ctx *sctx);
 bool stratum_subscribe(struct stratum_ctx *sctx);
 bool stratum_authorize(struct stratum_ctx *sctx, const char *user, const char *pass);
 bool stratum_handle_method(struct stratum_ctx *sctx, const char *s);
+bool stratum_suggest_difficulty( struct stratum_ctx *sctx, double diff );


 extern bool aes_ni_supported;
@@ -823,6 +824,7 @@ Options:\n\
                          qubit         Qubit\n\
                          scrypt        scrypt(1024, 1, 1) (default)\n\
                          scrypt:N      scrypt(N, 1, 1)\n\
+                          scryptn2      scrypt(1048576, 1,1)\n\
                          sha256d       Double SHA-256\n\
                          sha256q       Quad SHA-256, Pyrite (PYE)\n\
                          sha256t       Triple SHA-256, Onecoin (OC)\n\
@@ -868,9 +870,9 @@ Options:\n\
                          yespowerr16   Yenten (YTN)\n\
                          yespower-b2b  generic yespower + blake2b\n\
                          zr5           Ziftr\n\
-  -N, --param-n         N parameter for scrypt based algos\n\
-  -R, --param-r         R parameter for scrypt based algos\n\
-  -K, --param-key       Key (pers) parameter for algos that use it\n\
+  -N, --param-n=N       N parameter for scrypt based algos\n\
+  -R, --param-r=N       R parameter for scrypt based algos\n\
+  -K, --param-key=STRING  Key (pers) parameter for algos that use it\n\
  -o, --url=URL         URL of mining server\n\
  -O, --userpass=U:P    username:password pair for mining server\n\
  -u, --user=USERNAME   username for mining server\n\
@@ -885,19 +887,19 @@ Options:\n\
  -T, --timeout=N       timeout for long poll and stratum (default: 300 seconds)\n\
  -s, --scantime=N      upper bound on time spent scanning current work when\n\
                          long polling is unavailable, in seconds (default: 5)\n\
-      --randomize       Randomize scan range start to reduce duplicates\n\
-  -f, --diff-factor     Divide req. difficulty by this factor (std is 1.0)\n\
-  -m, --diff-multiplier Multiply difficulty by this factor (std is 1.0)\n\
-      --hash-meter      Display thread hash rates\n\
+      --randomize       randomize scan range (deprecated)\n\
+  -f, --diff-factor=N   divide req. difficulty by this factor (std is 1.0)\n\
+  -m, --diff-multiplier=N Multiply difficulty by this factor (std is 1.0)\n\
+      --hash-meter      display thread hash rates\n\
      --coinbase-addr=ADDR  payout address for solo mining\n\
      --coinbase-sig=TEXT  data to insert in the coinbase when possible\n\
      --no-longpoll     disable long polling support\n\
      --no-getwork      disable getwork support\n\
      --no-gbt          disable getblocktemplate support\n\
      --no-stratum      disable X-Stratum support\n\
-      --no-extranonce   disable Stratum extranonce support\n\
+      --no-extranonce   disable Stratum extranonce subscribe\n\
      --no-redirect     ignore requests to change the URL of the mining server\n\
-  -q, --quiet           disable per-thread hashmeter output\n\
+  -q, --quiet           reduce log verbosity\n\
      --no-color        disable colored output\n\
  -D, --debug           enable debug output\n\
  -P, --protocol-dump   verbose dump of protocol-level activities\n"
@@ -909,16 +911,17 @@ Options:\n\
  -B, --background      run the miner in the background\n\
      --benchmark       run in offline benchmark mode\n\
      --cpu-affinity    set process affinity to cpu core(s), mask 0x3 for cores 0 and 1\n\
-      --cpu-priority    set process priority (default: 0 idle, 2 normal to 5 highest)\n\
+      --cpu-priority    set process priority (default: 0 idle, 2 normal to 5 highest) (deprecated)\n\
  -b, --api-bind=address[:port]   IP address for the miner API, default port is 4048)\n\
-      --api-remote      Allow remote control\n\
-      --max-temp=N      Only mine if cpu temp is less than specified value (linux)\n\
-      --max-rate=N[KMG] Only mine if net hashrate is less than specified value\n\
-      --max-diff=N      Only mine if net difficulty is less than specified value\n\
+      --api-remote      allow remote control\n\
+      --max-temp=N      only mine if cpu temp is less than specified value (linux)\n\
+      --max-rate=N[KMG] only mine if net hashrate is less than specified value\n\
+      --max-diff=N      only mine if net difficulty is less than specified value\n\
  -c, --config=FILE     load a JSON-format configuration file\n\
-      --data-file       path and name of data file\n\
+      --data-file=FILE  path and name of data file\n\
      --verify          enable additional time consuming start up tests\n\
-  -V, --version         display version information and exit\n\
+      --stratum-keepalive  prevent disconnects when difficulty is too high\n\
+  -V, --version         display version and CPU information and exit\n\
  -h, --help            display this help text and exit\n\
 ";

@@ -987,6 +990,7 @@ static struct option const options[] = {
        { "userpass", 1, NULL, 'O' },
        { "data-file", 1, NULL, 1027 },
        { "verify", 0, NULL, 1028 },
+        { "stratum-keepalive", 0, NULL, 1029 },
        { "version", 0, NULL, 'V' },
        { 0, 0, 0, 0 }
 };
--- a/simd-utils/intrlv.h
+++ b/simd-utils/intrlv.h
--- a/simd-utils/simd-128.h
+++ b/simd-utils/simd-128.h
@@ -272,9 +272,19 @@ static inline void memcpy_128( __m128i *dst, const __m128i *src, const int n )

 #endif

+// Mask making
+
+// Equivalent of AVX512 _mm_movepi64_mask & _mm_movepi32_mask.
+// Returns 2 or 4 bit integer mask from MSB of 64 or 32 bit elements.
+
+#define mm_movmask_64( v ) \
+   _mm_castpd_si128( _mm_movmask_pd( _mm_castsi128_pd( v ) ) )
+
+#define mm_movmask_32( v ) \
+   _mm_castps_si128( _mm_movmask_ps( _mm_castsi128_ps( v ) ) )


-// Diagonal blend: d = s3[3], s2[2], s1[1], s0[0] ||
+// Diagonal blend

 // Blend 4 32 bit elements from 4 vectors

@@ -284,7 +294,7 @@ static inline void memcpy_128( __m128i *dst, const __m128i *src, const int n )
  mm_blend_epi32( _mm_blend_epi32( s3, s2, 0x4 ), \
                  _mm_blend_epi32( s1, s0, 0x1 ), 0x3 )

-#elif defined(__SSE4_1)
+#elif defined(__SSE4_1__)

 #define mm128_diagonal_32( v3, v2, v1, v0 ) \
  mm_blend_epi16( _mm_blend_epi16( s3, s2, 0x30 ), \
@@ -401,6 +411,16 @@ static inline void memcpy_128( __m128i *dst, const __m128i *src, const int n )
 #define mm128_rol_16( v, c ) \
   _mm_or_si128( _mm_slli_epi16( v, c ), _mm_srli_epi16( v, 16-(c) ) )

+// Limited 2 input shuffle
+#define mm128_shuffle2_64( a, b, c ) \
+   _mm_castpd_si128( _mm_shuffle_pd( _mm_castsi128_pd( a ), \
+                                     _mm_castsi128_pd( b ), c ) ); 
+
+#define mm128_shuffle2_32( a, b, c ) \
+   _mm_castps_si128( _mm_shuffle_ps( _mm_castsi128_ps( a ), \
+                                     _mm_castsi128_ps( b ), c ) ); 
+
+
 //
 // Rotate vector elements accross all lanes

@@ -532,9 +552,8 @@ static inline void mm128_block_bswap_32( __m128i *d, const __m128i *s )
 #if defined(__SSSE3__)

 // Function macro with two inputs and one output, inputs are preserved.
-// Returns modified first arg.
 // Two input functions are not available without SSSE3. Use procedure
-// belowe instead.
+// macros below instead.

 #define mm128_shufl2r_64( v1, v2 )     _mm_alignr_epi8( v2, v1, 8 )
 #define mm128_shufl2l_64( v1, v2 )     _mm_alignr_epi8( v1, v2, 8 )
@@ -548,12 +567,11 @@ static inline void mm128_block_bswap_32( __m128i *d, const __m128i *s )
 #define mm128_shufl2r_8( v1, v2 )      _mm_alignr_epi8( v2, v1, 8 )
 #define mm128_shufl2l_8( v1, v2 )      _mm_alignr_epi8( v1, v2, 8 )

-// Procedure macroswith 2 inputs and 2 outputs, inputs are destroyed.
-// Returns both modified args in place.
+// Procedure macros with 2 inputs and 2 outputs, inputs args are overwritten.

 // These macros retain the vrol/vror name for now to avoid
 // confusion with the shufl2r/shuffle2l function macros above.
-// These may be renamed to something like shufl2r2 for 2 1nputs and
+// These may be renamed to something like shufl2r2 for 2 nputs and
 // 2 outputs, ie SHUFfLe 2 inputs Right with 2 outputs.

 #define mm128_vror256_64( v1, v2 ) \
--- a/simd-utils/simd-256.h
+++ b/simd-utils/simd-256.h
@@ -1,7 +1,7 @@
 #if !defined(SIMD_256_H__)
 #define SIMD_256_H__ 1

-#if defined(__AVX2__)
+//#if defined(__AVX2__)

 /////////////////////////////////////////////////////////////////////
 //
@@ -14,7 +14,9 @@
 // is limited because 256 bit vectors are less likely to be used when 512
 // is available.

-// Used instead if casting.
+#if defined(__AVX__)
+
+// Used instead of casting.
 typedef union
 {
   __m256i m256;
@@ -23,6 +25,28 @@ typedef union
   uint32_t u32[8];
 } __attribute__ ((aligned (32))) m256_ovly;

+//
+// Pointer casting
+
+// p = any aligned pointer
+// returns p as pointer to vector type, not very useful
+#define castp_m256i(p) ((__m256i*)(p))
+
+// p = any aligned pointer
+// returns *p, watch your pointer arithmetic
+#define cast_m256i(p) (*((__m256i*)(p)))
+
+// p = any aligned pointer, i = scaled array index
+// returns value p[i]
+#define casti_m256i(p,i) (((__m256i*)(p))[(i)])
+
+// p = any aligned pointer, o = scaled offset
+// returns pointer p+o
+#define casto_m256i(p,o) (((__m256i*)(p))+(o))
+
+#endif
+#if defined(__AVX2__)
+

 // Move integer to low element of vector, other elements are set to zero.
 #define mm256_mov64_256( i ) _mm256_castsi128_si256( mm128_mov64_128( i ) )
@@ -91,26 +115,6 @@ static inline __m256i mm256_neg1_fn()
 #define mm128_extr_lo128_256( v ) _mm256_castsi256_si128( v )
 #define mm128_extr_hi128_256( v ) _mm256_extracti128_si256( v, 1 )

-//
-// Pointer casting
-
-// p = any aligned pointer
-// returns p as pointer to vector type, not very useful
-#define castp_m256i(p) ((__m256i*)(p))
-
-// p = any aligned pointer
-// returns *p, watch your pointer arithmetic
-#define cast_m256i(p) (*((__m256i*)(p)))
-
-// p = any aligned pointer, i = scaled array index
-// returns value p[i]
-#define casti_m256i(p,i) (((__m256i*)(p))[(i)])
-
-// p = any aligned pointer, o = scaled offset
-// returns pointer p+o
-#define casto_m256i(p,o) (((__m256i*)(p))+(o))
-
-
 //
 // Memory functions
 // n = number of 256 bit (32 byte) vectors
@@ -229,6 +233,18 @@ static inline void memcpy_256( __m256i *dst, const __m256i *src, const int n )

 #endif

+// Mask making
+
+// Equivalent of AVX512 _mm256_movepi64_mask & _mm256_movepi32_mask.
+// Returns 4 or 8 bit integer mask from MSB of 64 or 32 bit elements.
+
+#define mm256_movmask_64( v ) \
+   _mm256_castpd_si256( _mm256_movmask_pd( _mm256_castsi256_pd( v ) ) )
+
+#define mm256_movmask_32( v ) \
+   _mm256_castps_si256( _mm256_movmask_ps( _mm256_castsi256_ps( v ) ) )
+
+
 // Diagonal blending

 // Blend 4 64 bit elements from 4 vectors
@@ -401,6 +417,16 @@ static inline void memcpy_256( __m256i *dst, const __m256i *src, const int n )
 //
 // Rotate elements within each 128 bit lane of 256 bit vector.

+// Limited 2 input shuffle
+#define mm256_shuffle2_64( a, b, c ) \
+   _mm256_castpd_si256( _mm256_shuffle_pd( _mm256_castsi256_pd( a ), \
+                                           _mm256_castsi256_pd( b ), c ) ); 
+
+#define mm256_shuffle2_32( a, b, c ) \
+   _mm256_castps_si256( _mm256_shuffle_ps( _mm256_castsi256_ps( a ), \
+                                           _mm256_castsi256_ps( b ), c ) ); 
+
+
 #define mm256_swap128_64( v )  _mm256_shuffle_epi32( v, 0x4e )
 #define mm256_shuflr128_64 mm256_swap128_64
 #define mm256_shufll128_64 mm256_swap128_64
@@ -481,20 +507,6 @@ static inline __m256i mm256_shuflr128_x8( const __m256i v, const int c )
   v2 = _mm256_xor_si256( v1, v2 ); \
   v1 = _mm256_xor_si256( v1, v2 );

-#define mm256_vror512_128( v1, v2 ) \
-do { \
-   __m256i t = _mm256_permute2x128( v1, v2, 0x03 ); \
-   v1 = _mm256_permute2x128( v2, v1, 0x21 ); \
-   v2 = t; \
-} while(0)
-
-#define mm256_vrol512_128( v1, v2 ) \
-do { \
-   __m256i t = _mm256_permute2x128( v1, v2, 0x03 ); \
-   v2 = _mm256_permute2x128( v2, v1, 0x21 ); \
-   v1 = t; \
-} while(0)
-
 #endif // __AVX2__
 #endif // SIMD_256_H__

--- a/simd-utils/simd-512.h
+++ b/simd-utils/simd-512.h
@@ -493,7 +493,7 @@ static inline __m512i mm512_shufll_32( const __m512i v )
 static inline __m512i mm512_shuflr_x64( const __m512i v, const int n )
 { return _mm512_alignr_epi64( v, v, n ); }

-static inline __m512i mm512_shufll_x32( const __m512i v, const int n )
+static inline __m512i mm512_shuflr_x32( const __m512i v, const int n )
 { return _mm512_alignr_epi32( v, v, n ); }

 #define mm512_shuflr_16( v ) \
@@ -535,7 +535,6 @@ static inline __m512i mm512_shufll_x32( const __m512i v, const int n )

 // Rotate 256 bit lanes by one 64 bit element
 #define mm512_shuflr256_64( v )     _mm512_permutex_epi64( v, 0x39 )
-
 #define mm512_shufll256_64( v )     _mm512_permutex_epi64( v, 0x93 )

 // Rotate 256 bit lanes by one 32 bit element
@@ -582,8 +581,17 @@ static inline __m512i mm512_shufll_x32( const __m512i v, const int n )
                     0x0e0d0c0b0a090807, 0x060504030201001f ) )

 //
-// Shuffle-roate elements within 128 bit lanes of 512 bit vector.
+// Shuffle/rotate elements within 128 bit lanes of 512 bit vector.
 
+// Limited 2 input, 1 output shuffle within 128 bit lanes.
+#define mm512_shuffle2_64( a, b, c ) \
+   _mm512_castpd_si512( _mm512_shuffle_pd( _mm512_castsi512_pd( a ), \
+                                           _mm512_castsi512_pd( b ), c ) ); 
+
+#define mm512_shuffle2_32( a, b, c ) \
+   _mm512_castps_si512( _mm512_shuffle_ps( _mm512_castsi512_ps( a ), \
+                                           _mm512_castsi512_ps( b ), c ) ); 
+
 // Swap 64 bits in each 128 bit lane
 #define mm512_swap128_64( v )   _mm512_shuffle_epi32( v, 0x4e )
 #define mm512_shuflr128_64  mm512_swap128_64
@@ -611,9 +619,7 @@ static inline __m512i mm512_shuflr128_8( const __m512i v, const int c )
 // shufl2r is 2 input ...
 // Drop macros? They can easilly be rebuilt using shufl2 functions

-// add shuflr shufll functions performing rotate, returning first arg
-// They're faster than doing both, when both not needed.
-
+// 2 input, 1 output
 // Shuffle concatenated { v1, v2 ) right or left by 256 bits and return
 // rotated v1 
 // visually confusing for shif2r because of arg order. First arg is always
@@ -631,76 +637,5 @@ static inline __m512i mm512_shuflr128_8( const __m512i v, const int c )
 #define mm512_shufl2r_32( v1, v2 )     _mm512_alignr_epi32( v2, v1, 1 )
 #define mm512_shufl2l_32( v1, v2 )     _mm512_alignr_epi32( v1, v2, 1 )

-// Rotate elements from 2 512 bit vectors in place, source arguments
-//  are overwritten.
-
-#define mm512_swap1024_512( v1, v2 ) \
-   v1 = _mm512_xor_si512( v1, v2 ); \
-   v2 = _mm512_xor_si512( v1, v2 ); \
-   v1 = _mm512_xor_si512( v1, v2 );
-#define mm512_shufl2l_512 mm512_swap1024_512 \
-#define mm512_shufl2r_512 mm512_swap1024_512 \
-
-// Deprecated, will be removed. Use shufl2 functions instead. Leave them as is
-// for now.
-//  Rotate elements from 2 512 bit vectors in place, both source arguments
-//  are updated.
-
-#define mm512_vror1024_256( v1, v2 ) \
-do { \
-   __m512i t = _mm512_alignr_epi64( v1, v2, 4 ); \
-   v1 = _mm512_alignr_epi64( v2, v1, 4 ); \
-   v2 = t; \
-} while(0)
-
-#define mm512_vrol1024_256( v1, v2 ) \
-do { \
-   __m512i t = _mm512_alignr_epi64( v1, v2, 4 ); \
-   v2 = _mm512_alignr_epi64( v2, v1, 4 ); \
-   v1 = t; \
-} while(0)
-
-#define mm512_vror1024_128( v1, v2 ) \
-do { \
-   __m512i t = _mm512_alignr_epi64( v1, v2, 2 ); \
-   v1 = _mm512_alignr_epi64( v2, v1, 2 ); \
-   v2 = t; \
-} while(0)
-
-#define mm512_vrol1024_128( v1, v2 ) \
-do { \
-   __m512i t = _mm512_alignr_epi64( v1, v2, 6 ); \
-   v2 = _mm512_alignr_epi64( v2, v1, 6 ); \
-   v1 = t; \
-} while(0)
-
-#define mm512_vror1024_64( v1, v2 ) \
-do { \
-   __m512i t = _mm512_alignr_epi64( v1, v2, 1 ); \
-   v1 = _mm512_alignr_epi64( v2, v1, 1 ); \
-   v2 = t; \
-} while(0)
-
-#define mm512_vrol1024_64( v1, v2 ) \
-do { \
-   __m512i t = _mm512_alignr_epi64( v1, v2, 7 ); \
-   v2 = _mm512_alignr_epi64( v2, v1, 7 ); \
-   v1 = t; \
-} while(0)
-
-#define mm512_vror1024_32( v1, v2 ) \
-do { \
-   __m512i t = _mm512_alignr_epi32( v1, v2, 1 ); \
-   v1 = _mm512_alignr_epi32( v2, v1, 1 ); \
-   v2 = t; \
-} while(0)
-
-#define mm512_vrol1024_32( v1, v2 ) \
-do { \
-   __m512i t = _mm512_alignr_epi32( v1, v2, 15 ); \
-   v2 = _mm512_alignr_epi32( v2, v1, 15 ); \
-   v1 = t; \
-} while(0)
-
 #endif // AVX512
 #endif // SIMD_512_H__
--- a/simd-utils/simd-int.h
+++ b/simd-utils/simd-int.h
@@ -2,22 +2,21 @@
 #define SIMD_INT_H__ 1

 // Endian byte swap
-#define bswap_64( a ) __builtin_bswap64( a )
-#define bswap_32( a ) __builtin_bswap32( a )
+#define bswap_64    __builtin_bswap64
+#define bswap_32    __builtin_bswap32
+
+// Bit rotation
+#define rol64       __rolq
+#define ror64       __rorq
+#define rol32       __rold
+#define ror32       __rord

 // Safe division, integer or floating point. For floating point it's as  
-// safe as 0. is precisely zero.
-// Returns safe_result if division by zero.
+// safe as 0 is precisely zero.
+// Returns safe_result if division by zero, typically zero.
 #define safe_div( dividend, divisor, safe_result ) \
   ( (divisor) == 0 ? safe_result : ( (dividend) / (divisor) )  )

-// Aliases with familiar names for built in bit rotate instructions
-#define rol64( a, n )   _lrotl( a, n )  
-#define ror64( a, n )   _lrotr( a, n )
-#define rol32( a, n )   _rotl( a, n )
-#define ror32( a, n )   _rotr( a, n )
-#define rol16( a, n )   _rotwl( a, n )
-#define ror16( a, n )   _rotwr( a, n )

 ///////////////////////////////////////
 // 
--- a/sysinfos.c
+++ b/sysinfos.c
@@ -209,7 +209,7 @@ static inline void cpu_getname(char *outbuf, size_t maxsz)
 {
   memset(outbuf, 0, maxsz);
 #ifdef WIN32
-   char brand[0xC0] = { 0 };
+   char brand[256] = { 0 };
   int output[4] = { 0 }, ext;
   cpuid(0x80000000, output);
   ext = output[0];
--- a/util.c
+++ b/util.c
@@ -1658,7 +1658,7 @@ static bool stratum_parse_extranonce(struct stratum_ctx *sctx, json_t *params, i
 	pthread_mutex_unlock(&sctx->work_lock);

   if ( !opt_quiet ) /* pool dynamic change */
-      applog( LOG_INFO, "Stratum extranonce1= %s, extranonce2 size= %d",
+      applog( LOG_INFO, "Stratum extranonce1 0x%s, extranonce2 size %d",
         xnonce1, xn2_size);

 	return true;
@@ -1846,6 +1846,25 @@ out:
 	return ret;
 }

+bool stratum_suggest_difficulty( struct stratum_ctx *sctx, double diff )
+{
+   char *s;
+   s = (char*) malloc( 80 );
+   bool rc = true;
+
+   // response is handled seperately, what ID?
+   sprintf( s, "{\"id\": 1, \"method\": \"mining.suggest_difficulty\", \"params\": [\"%f\"]}", diff );
+   if ( !stratum_send_line( sctx, s ) )
+   {
+      applog(LOG_WARNING,"stratum.suggest_difficulty send failed");
+      rc = false;
+   } 
+   free ( s );
+   return rc;
+}
+
+
+
 /**
 * Extract bloc height     L H... here len=3, height=0x1333e8
 * "...0000000000ffffffff2703e83313062f503253482f043d61105408"
--- a/winbuild-cross.sh
+++ b/winbuild-cross.sh
@@ -16,18 +16,18 @@ export MINGW_LIB="/usr/x86_64-w64-mingw32/lib"
 export GCC_MINGW_LIB="/usr/lib/gcc/x86_64-w64-mingw32/9.3-win32"
 # used by GCC
 export LDFLAGS="-L$LOCAL_LIB/curl/lib/.libs -L$LOCAL_LIB/gmp/.libs -L$LOCAL_LIB/openssl"
+# Support for Windows 7 CPU groups, AES sometimes not included in -march
+export DEFAULT_CFLAGS="-maes -O3 -Wall -D_WIN32_WINNT=0x0601"
+export DEFAULT_CFLAGS_OLD="-O3 -Wall"

 # make link to local gmp header file.
 ln -s $LOCAL_LIB/gmp/gmp.h ./gmp.h

-# edit configure to fix pthread lib name for Windows.
-#sed -i 's/"-lpthread"/"-lpthreadGC2"/g' configure.ac
-
 # make release directory and copy selected DLLs.

 rm -rf release > /dev/null
-
 mkdir release
+
 cp README.txt release/
 cp README.md release/
 cp RELEASE_NOTES release/
@@ -41,74 +41,59 @@ cp $LOCAL_LIB/curl/lib/.libs/libcurl-4.dll release/

 # Start building...

-# Icelake AVX512 SHA VAES
+# AVX512 SHA VAES: Intel Core Icelake, Rocketlake
 ./clean-all.sh || echo clean
 rm -f config.status
 ./autogen.sh || echo done
-CFLAGS="-O3 -march=icelake-client -Wall" ./configure $CONFIGURE_ARGS
+CFLAGS="-march=icelake-client $DEFAULT_CFLAGS" ./configure $CONFIGURE_ARGS
 make -j 8
 strip -s cpuminer.exe
 mv cpuminer.exe release/cpuminer-avx512-sha-vaes.exe

-# Rocketlake AVX512 SHA AES
+# AVX512 AES: Intel Core HEDT Slylake-X, Cascadelake 
 make clean || echo clean
 rm -f config.status
-CFLAGS="-O3 -march=cascadelake -msha -Wall" ./configure $CONFIGURE_ARGS
-#CFLAGS="-O3 -march=rocketlake -Wall" ./configure $CONFIGURE_ARGS
-make -j 8
-strip -s cpuminer.exe
-mv cpuminer.exe release/cpuminer-avx512-sha.exe
-
-# Zen1 AVX2 AES SHA
-make clean || echo clean
-rm -f config.status
-CFLAGS="-O3 -march=znver1 -Wall" ./configure $CONFIGURE_ARGS
-make -j 8
-strip -s cpuminer.exe
-mv cpuminer.exe release/cpuminer-zen.exe
-
-# Zen3 AVX2 SHA VAES
-make clean || echo clean
-rm -f config.status
-CFLAGS="-O3 -march=znver2 -mvaes -Wall" ./configure $CONFIGURE_ARGS
-# CFLAGS="-O3 -march=znver3 -Wall" ./configure $CONFIGURE_ARGS
-make -j 8
-strip -s cpuminer.exe
-mv cpuminer.exe release/cpuminer-zen3.exe
-
-# Slylake-X AVX512 AES
-# mingw won't compile avx512 without -fno-asynchronous-unwind-tables
-make clean || echo clean
-rm -f config.status
-CFLAGS="-O3 -march=skylake-avx512 -Wall" ./configure $CONFIGURE_ARGS
-#CFLAGS="-O3 -march=skylake-avx512 -Wall -fno-asynchronous-unwind-tables" ./configure $CONFIGURE_ARGS
+CFLAGS="-march=skylake-avx512 $DEFAULT_CFLAGS" ./configure $CONFIGURE_ARGS
 make -j 8
 strip -s cpuminer.exe
 mv cpuminer.exe release/cpuminer-avx512.exe

-# Haswell AVX2 AES
+# AVX2 SHA VAES: Intel Alderlake, AMD Zen3
+make clean || echo done
+rm -f config.status
+CFLAGS="-mavx2 -msha -mvaes $DEFAULT_CFLAGS" ./configure $CONFIGURE_ARGS
+make -j 8
+strip -s cpuminer.exe
+mv cpuminer.exe release/cpuminer-avx2-sha-vaes.exe
+
+# AVX2 AES SHA: AMD Zen1
 make clean || echo clean
 rm -f config.status
-# GCC 9 doesn't include AES in -march=core-avx2
-CFLAGS="-O3 -march=core-avx2 -maes -Wall" ./configure $CONFIGURE_ARGS
+CFLAGS="-march=znver1 $DEFAULT_CFLAGS" ./configure $CONFIGURE_ARGS
+make -j 8
+strip -s cpuminer.exe
+mv cpuminer.exe release/cpuminer-avx2-sha.exe
+
+# AVX2 AES: Intel Core Haswell, Skylake, Kabylake, Coffeelake, Cometlake
+make clean || echo clean
+rm -f config.status
+CFLAGS="-march=core-avx2 $DEFAULT_CFLAGS" ./configure $CONFIGURE_ARGS
 make -j 8
 strip -s cpuminer.exe
 mv cpuminer.exe release/cpuminer-avx2.exe

-# Sandybridge AVX AES
+# AVX AES: Intel Sandybridge, Ivybridge
 make clean || echo clean
 rm -f config.status
-# -march=corei7-avx still includes aes, but just in case
-CFLAGS="-O3 -march=corei7-avx -maes -Wall" ./configure $CONFIGURE_ARGS 
+CFLAGS="-march=corei7-avx -maes $DEFAULT_CFLAGS_OLD" ./configure $CONFIGURE_ARGS 
 make -j 8
 strip -s cpuminer.exe
 mv cpuminer.exe release/cpuminer-avx.exe

-# Westmere SSE4.2 AES
+# SSE4.2 AES: Intel Westmere
 make clean || echo clean
 rm -f config.status
-CFLAGS="-O3 -march=westmere -maes -Wall" ./configure $CONFIGURE_ARGS
-#CFLAGS="-O3 -maes -msse4.2 -Wall" ./configure $CONFIGURE_ARGS
+CFLAGS="-march=westmere -maes $DEFAULT_CFLAGS_OLD" ./configure $CONFIGURE_ARGS
 make -j 8
 strip -s cpuminer.exe
 mv cpuminer.exe release/cpuminer-aes-sse42.exe
@@ -116,7 +101,7 @@ mv cpuminer.exe release/cpuminer-aes-sse42.exe
 # Nehalem SSE4.2
 #make clean || echo clean
 #rm -f config.status
-#CFLAGS="-O3 -march=corei7 -Wall" ./configure $CONFIGURE_ARGS
+#CFLAGS="$DEFAULT_CFLAGS_OLD -march=corei7" ./configure $CONFIGURE_ARGS
 #make 
 #strip -s cpuminer.exe
 #mv cpuminer.exe release/cpuminer-sse42.exe
@@ -124,7 +109,7 @@ mv cpuminer.exe release/cpuminer-aes-sse42.exe
 # Core2 SSSE3
 #make clean || echo clean
 #rm -f config.status
-#CFLAGS="-O3 -march=core2 -Wall" ./configure $CONFIGURE_ARGS
+#CFLAGS="$DEFAULT_CFLAGS_OLD -march=core2" ./configure $CONFIGURE_ARGS
 #make 
 #strip -s cpuminer.exe
 #mv cpuminer.exe release/cpuminer-ssse3.exe
@@ -133,9 +118,16 @@ mv cpuminer.exe release/cpuminer-aes-sse42.exe
 # Generic SSE2
 make clean || echo clean
 rm -f config.status
-CFLAGS="-O3 -msse2 -Wall" ./configure $CONFIGURE_ARGS
+CFLAGS="-msse2 $DEFAULT_CFLAGS_OLD" ./configure $CONFIGURE_ARGS
 make -j 8
 strip -s cpuminer.exe
 mv cpuminer.exe release/cpuminer-sse2.exe
 make clean || echo clean

+# Native with CPU groups ennabled
+make clean || echo clean
+rm -f config.status
+CFLAGS="-march=native $DEFAULT_CFLAGS" ./configure $CONFIGURE_ARGS
+make -j 8
+strip -s cpuminer.exe
+
Author	SHA1	Message	Date
Jay D Dee	5b678d2481	v3.19.6	2022-02-21 23:14:24 -05:00
Jay D Dee	90137b391e	v3.19.5	2022-01-30 20:59:54 -05:00
Jay D Dee	8727d79182	v3.19.4	2022-01-12 21:08:25 -05:00
Jay D Dee	17ccbc328f	v3.19.3	2022-01-07 12:07:38 -05:00
Jay D Dee	0e3945ddb5	v3.19.2	2021-12-30 16:28:24 -05:00
Jay D Dee	7d2ef7973d	v3.19.1	2021-11-20 00:46:01 -05:00
Jay D Dee	e6fd9b1d69	v3.19.0	2021-11-10 21:33:44 -05:00
Jay D Dee	1a234cbe53	v3.18.2	2021-10-19 22:35:36 -04:00
Jay D Dee	47cc5dcff5	v3.18.1	2021-10-10 22:50:19 -04:00