v3.20.3

v3.20.2
v3.20.1
2025-09-17 23:44:27 +00:00 · 2022-10-21 23:12:18 -04:00 · 2022-08-01 20:21:05 -04:00 · 2022-07-26 18:36:40 -04:00 · 2022-07-17 13:30:50 -04:00 · 2022-07-10 11:04:00 -04:00
125 changed files with 7587 additions and 15543 deletions
--- a/79
+++ b/79
@@ -40,7 +40,7 @@ $ mkdir $HOME/usr/lib
   version available in the repositories.

 Download the following source code packages from their respective and
-respected download locations, copy them to ~/usr/lib/ and uncompress them. 
+respected download locations, copy them to $HOME/usr/lib/ and uncompress them. 

 openssl: https://github.com/openssl/openssl/releases

@@ -149,85 +149,10 @@ Copy cpuminer.exe to the release directory, compress and copy the release direct

 Run cpuminer

-In a command windows change directories to the unzipped release folder. to get a list of all options:
+In a command windows change directories to the unzipped release folder. To get a list of all options:

 cpuminer.exe --help

 Command options are specific to where you mine. Refer to the pool's instructions on how to set them.


-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-Create a link to the locally compiled version of gmp.h
-
-$ ln -s $LOCAL_LIB/gmp-version/gmp.h ./gmp.h
-
-Edit configure.ac to fix lipthread package name.
-
-sed -i 's/"-lpthread"/"-lpthreadGC2"/g' configure.ac
-
-
-7. Compile
-
-you can use the default compile if you intend to use cpuminer-opt on the
-same CPU and the virtual machine supports that architecture.
-
-./build.sh
-
-Otherwise you can compile manually while setting options in CFLAGS.
-
-Some common options:
-
-To compile for a specific CPU architecture:
-
-CFLAGS="-O3 -march=znver1 -Wall" ./configure --with-curl
-
-This will compile for AMD Ryzen.
-
-You can compile more generically for a set of specific CPU features
-if you know what features you want:
-
-CFLAGS="-O3 -maes -msse4.2 -Wall" ./configure --with-curl
-
-This will compile for an older CPU that does not have AVX.
-
-You can find several examples in build-allarch.sh
-
-If you have a CPU with more than 64 threads and Windows 7 or higher you
-can enable the CPU Groups feature:
-
-D_WIN32_WINNT==0x0601
-
-Once you have run configure successfully run make with n CPU threads:
-
-make -j n
-
-Copy cpuminer.exe to the release directory, compress and copy the release
-directory to a Windows system and run cpuminer.exe from the command line.
-
-Run cpuminer
-
-In a command windows change directories to the unzipped release folder.
-to get a list of all options:
-
-cpuminer.exe --help
-
-Command options are specific to where you mine. Refer to the pool's
-instructions on how to set them.
--- a/Makefile.am
+++ b/Makefile.am
@@ -21,6 +21,7 @@ cpuminer_SOURCES = \
  api.c \
  sysinfos.c \
  algo-gate-api.c\
+  malloc-huge.c \
  algo/argon2/argon2a/argon2a.c \
  algo/argon2/argon2a/ar2/argon2.c \
  algo/argon2/argon2a/ar2/opt.c \
@@ -204,7 +205,6 @@ cpuminer_SOURCES = \
  algo/verthash/tiny_sha3/sha3.c \
  algo/verthash/tiny_sha3/sha3-4way.c \
  algo/whirlpool/sph_whirlpool.c \
-  algo/whirlpool/whirlpool-hash-4way.c \
  algo/whirlpool/whirlpool-gate.c \
  algo/whirlpool/whirlpool.c \
  algo/whirlpool/whirlpoolx.c \
@@ -284,11 +284,9 @@ cpuminer_SOURCES = \
  algo/x22/x22i-gate.c \
  algo/x22/x25x.c \
  algo/x22/x25x-4way.c \
-  algo/yescrypt/yescrypt.c \
-  algo/yescrypt/yescrypt-best.c \
  algo/yespower/yespower-gate.c \
  algo/yespower/yespower-blake2b.c \
-  algo/yespower/crypto/blake2b-yp.c \
+  algo/yespower/crypto/hmac-blake2b.c \
  algo/yespower/yescrypt-r8g.c \
  algo/yespower/yespower-opt.c

--- a/README.md
+++ b/README.md
@@ -40,17 +40,25 @@ Requirements
 Intel Core2 and newer and AMD equivalents. Further optimizations are available
 on some algoritms for CPUs with AES, AVX, AVX2, SHA, AVX512 and VAES.

-Older CPUs are supported by cpuminer-multi by TPruvot but at reduced
-performance.
+32 bit CPUs are not supported.
+Other CPU architectures such as ARM, Raspberry Pi, RISC-V, Xeon Phi, etc,
+are not supported.

-ARM and Aarch64 CPUs are not supported.
+Mobile CPUs like laptop computers are not recommended because they aren't
+designed for extreme heat of operating at full load for extended periods of
+time.
+
+Older CPUs and ARM architecture may be supported by cpuminer-multi by TPruvot.

 2. 64 bit Linux or Windows OS. Ubuntu and Fedora based distributions,
 including Mint and Centos, are known to work and have all dependencies
 in their repositories. Others may work but may require more effort. Older
 versions such as Centos 6 don't work due to missing features. 
-64 bit Windows OS is supported with mingw_w64 and msys or pre-built binaries.

+Windows 7 or newer is supported with mingw_w64 and msys or using the pre-built
+binaries. WindowsXP 64 bit is YMMV.
+
+FreeBSD is not actively tested but should work, YMMV.
 MacOS, OSx and Android are not supported.

 3. Stratum pool supporting stratum+tcp:// or stratum+ssl:// protocols or
--- a/README.txt
+++ b/README.txt
@@ -1,12 +1,22 @@
 This file is included in the Windows binary package. Compile instructions
 for Linux and Windows can be found in RELEASE_NOTES.

-This package is officially avalable only from:
+cpuminer-opt is open source and free of any fees. Many forks exist that are
+closed source and contain usage fees. support open source free software.
+
+This package is officially avalaible only from:
+
 https://github.com/JayDDee/cpuminer-opt
+
 No other sources should be trusted.

 cpuminer is a console program that is executed from a DOS or Powershell
-prompt. There is no GUI and no mouse support.
+command prompt. There is no GUI and no mouse support.
+
+New users are encouraged to consult the cpuminer-opt Wiki for detailed
+information on usage:
+
+https://github.com/JayDDee/cpuminer-opt/wiki

 Miner programs are often flagged as malware by antivirus programs. This is
 a false positive, they are flagged simply because they are cryptocurrency 
@@ -18,14 +28,14 @@ error to find the fastest one that works. Pay attention to
 the features listed at cpuminer startup to ensure you are mining at
 optimum speed using the best available features.

-Architecture names and compile options used are only provided for Intel
-Core series. Budget CPUs like Pentium and Celeron are often missing some
-features.
+Architecture names and compile options used are only provided for 
+mainstream desktop CPUs. Budget CPUs like Pentium and Celeron are often
+missing some features. Check your CPU.

-AMD CPUs older than Piledriver, including Athlon x2 and Phenom II x4, are not
-supported by cpuminer-opt due to an incompatible implementation of SSE2 on
-these CPUs. Some algos may crash the miner with an invalid instruction.
-Users are recommended to use an unoptimized miner such as cpuminer-multi.
+Support for AMD CPUs older than Ryzen is incomplete and without specific 
+recommendations. Find the best fit. CPUs older than Piledriver, including
+Athlon x2 and Phenom II x4, are not supported by cpuminer-opt due to an
+incompatible implementation of SSE2 on these CPUs. 

 More information for Intel and AMD CPU architectures and their features
 can be found on Wikipedia.
@@ -34,26 +44,20 @@ https://en.wikipedia.org/wiki/List_of_Intel_CPU_microarchitectures

 https://en.wikipedia.org/wiki/List_of_AMD_CPU_microarchitectures

+File name                      Architecture name

-Exe file name                Compile flags              Arch name
+cpuminer-sse2.exe              Core2, Nehalem, generic x86_64 with SSE2   
+cpuminer-aes-sse42.exe         Westmere
+cpuminer-avx.exe               Sandybridge, Ivybridge
+cpuminer-avx2.exe              Haswell, Skylake, Kabylake, Coffeelake, Cometlake
+cpuminer-avx2-sha.exe          AMD Zen1, Zen2
+cpuminer-avx2-sha-vaes.exe     Intel Alderlake*, AMD Zen3
+cpuminer-avx512.exe            Intel HEDT Skylake-X, Cascadelake
+cpuminer-avx512-sha-vaes.exe   AMD Zen4, Intel Rocketlake, Icelake

-cpuminer-sse2.exe            "-msse2"                   Core2, Nehalem   
-cpuminer-aes-sse42.exe       "-march=westmere"          Westmere
-cpuminer-avx.exe             "-march=corei7-avx"        Sandybridge, Ivybridge
-cpuminer-avx2.exe            "-march=core-avx2 -maes"   Haswell(1)
-cpuminer-avx512.exe          "-march=skylake-avx512"    Skylake-X, Cascadelake
-cpuminer-avx512-sha.exe      "-march=cascadelake -msha" Rocketlake(2)
-cpuminer-avx512-sha-vaes.exe "-march=icelake-client"    Icelake, Tigerlake(3)
-cpuminer-zen.exe             "-march=znver1"            AMD Zen1, Zen2
-cpuminer-zen3.exe            "-march=znver2 -mvaes"     Zen3(4)
-
-(1) Haswell includes Broadwell, Skylake, Kabylake, Coffeelake & Cometlake. 
-(2) Rocketlake build uses cascadelake+sha as a workaround until Rocketlake
-    compiler support is avalable.
-(3) Icelake & Tigerlake are only available on some laptops. Mining with a
-    laptop is not recommended.
-(4) Zen3 build uses zen2+vaes as a workaround until Zen3 compiler support is
-    available. Zen2 CPUs should use Zen1 build.
+* Alderlake is a hybrid architecture with a mix of E-cores & P-cores. Although
+  the P-cores can support AVX512 the E-cores can't so Intel decided to disable
+  AVX512 on the the P-cores.

 Notes about included DLL files:

@@ -64,10 +68,11 @@ source code obtained from the author's official repository. The exact
 procedure is documented in the build instructions for Windows:
 https://github.com/JayDDee/cpuminer-opt/wiki/Compiling-from-source

-Some DLL filess may already be installed on the system by Windows or third
-party packages. They often will work and may be used instead of the included
-file. Without a compelling reason to do so it's recommended to use the included
-files as they are packaged.
+Some included DLL files may already be installed on the system by Windows or
+third party packages. They often will work and may be used instead of the
+included version of the files.
+
+

 If you like this software feel free to donate:

--- a/124
+++ b/124
@@ -22,7 +22,7 @@ required.
 Compile Instructions
 --------------------

-See INSTALL_LINUX or INSTALL_WINDOWS for compile instruuctions
+See INSTALL_LINUX or INSTALL_WINDOWS for compile instructions

 Requirements
 ------------
@@ -65,7 +65,127 @@ If not what makes it happen or not happen?
 Change Log
 ----------

-v3.8.2
+v3.20.3
+
+Faster c11 algo: AVX512 6%, AVX2 4%, AVX2+VAES 15%.
+Faster AVX2+VAES for anime 14%, hmq1725 6%.
+Small optimizations to Luffa AVX2 & AVX512.
+
+v3.20.2
+
+Bit rotation optimizations to Blake256, Blake512, Blake2b, Blake2s & Lyra2-blake2b for SSE2 & AVX2.
+Removed old unused yescrypt library and other unused code.
+
+v3.20.1
+
+sph_blake2b optimized 1-way SSSE3 & AVX2.
+Removed duplicate Blake2b used by Power2b algo, will now use optimized sph_blake2b.
+Removed imprecise hash & target display from rejected share log.
+Share and target difficulty is now displayed only for low difficulty shares.
+Updated configure.ac to check for AVX512 asm support.
+Small optimization to Lyra2 SSE2.
+
+v3.20.0
+
+#375 Fixed segfault in algos using Groestl VAES due to use of uninitialized data.
+
+v3.19.9
+
+More Blake256, Blake512, Luffa & Cubehash prehash optimizations.
+Relaxed some excessively strict data alignment that was negatively affecting performance.
+
+v3.19.8
+
+#370 "stratum+ssl", in addition to "stratum+tcps", is now recognized as a valid
+url protocol specifier for requesting a secure stratum connection.
+
+The full url, including the protocol, is now displayed in the stratum connect
+log and the periodic summary log.
+
+Small optimizations to Cubehash, AVX2 & AVX512.
+
+Byte order and prehash optimizations for Blake256 & Blake512, AVX2 & AVX512.
+
+v3.19.7
+
+#369 Fixed time limited mining, --time-limit.
+Fixed a potential compile error when using optimization below -O3.
+
+v3.19.6
+
+#363 Fixed a stratum bug where the first job may be ignored delaying start of hashing
+Fixed handling of nonce exhaust when hashing a fast algo with extranonce disabled
+Small optimization to Shavite.
+
+v3.19.5
+
+Enhanced stratum-keepalive preemptively resets the stratum connection
+before the server to avoid lost shares.
+
+Added build-msys2.sh shell script for easier compiling on Windows, see Wiki for details.
+
+X16RT: eliminate unnecessary recalculations of the hash order.
+
+Fix a few compiler warnings.
+
+Fixed log colour error when a block is solved.
+
+v3.19.4
+
+#359: Fix verthash memory allocation for non-hugepages, broken in v3.19.3.
+
+New option stratum-keepalive prevents stratum timeouts when no shares are
+submitted for several minutes due to high difficulty.
+
+Fixed a bug displaying optimizations for some algos.
+
+v3.19.3
+
+Linux: Faster verthash (+25%), scryptn2 (+2%) when huge pages are available.
+
+Small speed up for Hamsi AVX2 & AVX512, Keccak AVX512.
+
+v3.19.2
+
+Fixed log displaying incorrect memory usage for scrypt, broken in v3.19.1.
+
+Reduce log noise when replies to submitted shares are lost due to stratum errors.
+
+Fugue prehash optimization for X16r family AVX2 & AVX512.
+
+Small speed improvement for Hamsi AVX2 & AVX512.
+
+Win: With CPU groups enabled the number of CPUs displayed in the ASCII art
+affinity map is the number of CPUs in a CPU group, was number of CPUs up to 64.
+
+v3.19.1
+
+Changes to Windows binaries package:
+ - builds for CPUs with AVX or lower have CPU groups disabled,
+ - zen3 build renamed to avx2-sha-vaes to support Alderlake as well as Zen3,
+ - zen build renamed to avx2-sha, supports Zen1 & Zen2,
+ - avx512-sha build removed, Rocketlake CPUs can use avx512-sha-vaes,
+ - see README.txt for compatibility details.
+
+Fixed a few compiler warnings that are new in GCC 11.
+Other minor fixes.
+
+v3.19.0
+
+Windows binaries now built with support for CPU groups, requires Windows 7.
+
+Changes to cpu-affinity:
+  - PR#346: Fixed incorrect CPU affinity on Windows built for CPU groups,
+  - added support for CPU affinity for up to 256 threads or CPUs,
+  - streamlined code for more efficient initialization of miner threads,
+  - precise affining of each miner thread to a specific CPU,
+  - added an option to disable CPU affinity with "--cpu-affinity 0"
+
+Faster sha256t with AVX512 & AVX2.
+
+Added stratum error count to stats log, reported only when non-zero.
+
+v3.18.2

 Issue #342, fixed Groestl AES on Windows, broken in v3.18.0.

--- a/algo-gate-api.c
+++ b/algo-gate-api.c
@@ -67,7 +67,6 @@ void do_nothing   () {}
 bool return_true  () { return true;  }
 bool return_false () { return false; }
 void *return_null () { return NULL;  }
-void call_error   () { printf("ERR: Uninitialized function pointer\n"); }

 void algo_not_tested()
 {
@@ -95,7 +94,8 @@ int null_scanhash()
   return 0;
 }

-// Default generic scanhash can be used in many cases.
+// Default generic scanhash can be used in many cases. Not to be used when
+// prehashing can be done or when byte swapping the data can be avoided.
 int scanhash_generic( struct work *work, uint32_t max_nonce,
                      uint64_t *hashes_done, struct thr_info *mythr )
 {
@@ -152,6 +152,9 @@ int scanhash_4way_64in_32out( struct work *work, uint32_t max_nonce,
   const bool bench = opt_benchmark;

   mm256_bswap32_intrlv80_4x64( vdata, pdata );
+   // overwrite byte swapped nonce with original byte order for proper
+   // incrementing. The nonce only needs to byte swapped if it is to be
+   // sumbitted.
   *noncev = mm256_intrlv_blend_32(
                   _mm256_set_epi32( n+3, 0, n+2, 0, n+1, 0, n, 0 ), *noncev );
   do
@@ -371,15 +374,11 @@ bool register_algo_gate( int algo, algo_gate_t *gate )
    case ALGO_X22I:         rc = register_x22i_algo          ( gate ); break;
    case ALGO_X25X:         rc = register_x25x_algo          ( gate ); break;
    case ALGO_XEVAN:        rc = register_xevan_algo         ( gate ); break;
-    case ALGO_YESCRYPT:     rc = register_yescrypt_05_algo   ( gate ); break;
-//    case ALGO_YESCRYPT:      register_yescrypt_algo      ( gate ); break;
-    case ALGO_YESCRYPTR8:   rc = register_yescryptr8_05_algo ( gate ); break;
-//    case ALGO_YESCRYPTR8:    register_yescryptr8_algo    ( gate ); break;
+    case ALGO_YESCRYPT:     rc = register_yescrypt_algo      ( gate ); break;
+    case ALGO_YESCRYPTR8:   rc = register_yescryptr8_algo    ( gate ); break;
    case ALGO_YESCRYPTR8G:  rc = register_yescryptr8g_algo   ( gate ); break;
-    case ALGO_YESCRYPTR16:  rc = register_yescryptr16_05_algo( gate ); break;
-//    case ALGO_YESCRYPTR16:   register_yescryptr16_algo   ( gate ); break;
-    case ALGO_YESCRYPTR32:  rc = register_yescryptr32_05_algo( gate ); break;
-//    case ALGO_YESCRYPTR32:   register_yescryptr32_algo   ( gate ); break;
+    case ALGO_YESCRYPTR16:  rc = register_yescryptr16_algo   ( gate ); break;
+    case ALGO_YESCRYPTR32:  rc = register_yescryptr32_algo   ( gate ); break;
    case ALGO_YESPOWER:     rc = register_yespower_algo      ( gate ); break;
    case ALGO_YESPOWERR16:  rc = register_yespowerr16_algo   ( gate ); break;
    case ALGO_YESPOWER_B2B: rc = register_yespower_b2b_algo  ( gate ); break;
--- a/algo-gate-api.h
+++ b/algo-gate-api.h
@@ -97,7 +97,6 @@ typedef  uint32_t set_t;
 #define SHA_OPT       0x20   // Zen1, Icelake (sha256)
 #define AVX512_OPT    0x40   // Skylake-X (AVX512[F,VL,DQ,BW])
 #define VAES_OPT      0x80   // Icelake (VAES & AVX512)
-#define VAES256_OPT   0x100  // Zen3 (VAES without AVX512)


 // return set containing all elements from sets a & b
--- a/algo/argon2/argon2a/ar2/sj/scrypt-jane-portable-x86.h
+++ b/algo/argon2/argon2a/ar2/sj/scrypt-jane-portable-x86.h
@@ -344,7 +344,7 @@ static size_t
 detect_cpu(void) {
 	//union { uint8_t s[12]; uint32_t i[3]; } vendor_string;
 	//cpu_vendors_x86 vendor = cpu_nobody;
-	x86_regs regs;
+	x86_regs regs; regs.eax = regs.ebx = regs.ecx = 0;
 	uint32_t max_level, max_ext_level;
 	size_t cpu_flags = 0;
 #if defined(X86ASM_AVX) || defined(X86_64ASM_AVX)
@@ -460,4 +460,4 @@ get_top_cpuflag_desc(size_t flag) {
 	#endif
 #endif

-#endif /* defined(CPU_X86) || defined(CPU_X86_64) */
+#endif /* defined(CPU_X86) || defined(CPU_X86_64) */
--- a/algo/argon2/argon2a/ar2/sj/scrypt-jane-romix-basic.h
+++ b/algo/argon2/argon2a/ar2/sj/scrypt-jane-romix-basic.h
@@ -4,11 +4,12 @@ typedef void (FASTCALL *scrypt_ROMixfn)(scrypt_mix_word_t *X/*[chunkWords]*/, sc
 #endif

 /* romix pre/post nop function */
+/*
 static void asm_calling_convention
 scrypt_romix_nop(scrypt_mix_word_t *blocks, size_t nblocks) {
 	(void)blocks; (void)nblocks;
 }
-
+*/
 /* romix pre/post endian conversion function */
 static void asm_calling_convention
 scrypt_romix_convert_endian(scrypt_mix_word_t *blocks, size_t nblocks) {
--- a/algo/argon2/argon2d/argon2d/opt.c
+++ b/algo/argon2/argon2d/argon2d/opt.c
@@ -37,6 +37,13 @@

 #if defined(__AVX512F__)

+static inline __m512i blamka( __m512i x, __m512i y )
+{
+    __m512i xy = _mm512_mul_epu32( x, y );
+    return _mm512_add_epi64( _mm512_add_epi64( x, y ),
+                             _mm512_add_epi64( xy, xy ) );
+}
+
 static void fill_block( __m512i *state, const block *ref_block,
                       block *next_block, int with_xor )
 {
--- a/algo/argon2/argon2d/blake2/blamka-round-opt.h
+++ b/algo/argon2/argon2d/blake2/blamka-round-opt.h
@@ -328,9 +328,7 @@ static BLAKE2_INLINE __m128i fBlaMka(__m128i x, __m128i y) {

 #include <immintrin.h>

-#define ROR64(x, n) _mm512_ror_epi64((x), (n))
-
-static __m512i muladd(__m512i x, __m512i y)
+static inline __m512i muladd(__m512i x, __m512i y)
 {
    __m512i z = _mm512_mul_epu32(x, y);
    return _mm512_add_epi64(_mm512_add_epi64(x, y), _mm512_add_epi64(z, z));
@@ -344,8 +342,8 @@ static __m512i muladd(__m512i x, __m512i y)
        D0 = _mm512_xor_si512(D0, A0); \
        D1 = _mm512_xor_si512(D1, A1); \
 \
-        D0 = ROR64(D0, 32); \
-        D1 = ROR64(D1, 32); \
+        D0 = _mm512_ror_epi64(D0, 32); \
+        D1 = _mm512_ror_epi64(D1, 32); \
 \
        C0 = muladd(C0, D0); \
        C1 = muladd(C1, D1); \
@@ -353,8 +351,8 @@ static __m512i muladd(__m512i x, __m512i y)
        B0 = _mm512_xor_si512(B0, C0); \
        B1 = _mm512_xor_si512(B1, C1); \
 \
-        B0 = ROR64(B0, 24); \
-        B1 = ROR64(B1, 24); \
+        B0 = _mm512_ror_epi64(B0, 24); \
+        B1 = _mm512_ror_epi64(B1, 24); \
    } while ((void)0, 0)

 #define G2(A0, B0, C0, D0, A1, B1, C1, D1) \
@@ -365,8 +363,8 @@ static __m512i muladd(__m512i x, __m512i y)
        D0 = _mm512_xor_si512(D0, A0); \
        D1 = _mm512_xor_si512(D1, A1); \
 \
-        D0 = ROR64(D0, 16); \
-        D1 = ROR64(D1, 16); \
+        D0 = _mm512_ror_epi64(D0, 16); \
+        D1 = _mm512_ror_epi64(D1, 16); \
 \
        C0 = muladd(C0, D0); \
        C1 = muladd(C1, D1); \
@@ -374,8 +372,8 @@ static __m512i muladd(__m512i x, __m512i y)
        B0 = _mm512_xor_si512(B0, C0); \
        B1 = _mm512_xor_si512(B1, C1); \
 \
-        B0 = ROR64(B0, 63); \
-        B1 = ROR64(B1, 63); \
+        B0 = _mm512_ror_epi64(B0, 63); \
+        B1 = _mm512_ror_epi64(B1, 63); \
    } while ((void)0, 0)

 #define DIAGONALIZE(A0, B0, C0, D0, A1, B1, C1, D1) \
@@ -417,11 +415,10 @@ static __m512i muladd(__m512i x, __m512i y)

 #define SWAP_HALVES(A0, A1) \
    do { \
-        __m512i t0, t1; \
-        t0 = _mm512_shuffle_i64x2(A0, A1, _MM_SHUFFLE(1, 0, 1, 0)); \
-        t1 = _mm512_shuffle_i64x2(A0, A1, _MM_SHUFFLE(3, 2, 3, 2)); \
-        A0 = t0; \
-        A1 = t1; \
+        __m512i t; \
+        t = _mm512_shuffle_i64x2(A0, A1, _MM_SHUFFLE(1, 0, 1, 0)); \
+        A1 = _mm512_shuffle_i64x2(A0, A1, _MM_SHUFFLE(3, 2, 3, 2)); \
+        A0 = t; \
    } while((void)0, 0)

 #define SWAP_QUARTERS(A0, A1) \
--- a/algo/blake/blake-hash-4way.h
+++ b/algo/blake/blake-hash-4way.h
@@ -49,6 +49,20 @@ extern "C"{

 #define SPH_SIZE_blake512   512

+/////////////////////////
+//
+//  Blake-256 1 way SSE2
+
+void  blake256_transform_le( uint32_t *H, const uint32_t *buf,
+                             const uint32_t T0, const uint32_t T1 );
+
+/////////////////////////
+//
+//  Blake-512 1 way SSE2
+
+void  blake512_transform_le( uint64_t *H, const uint64_t *buf,
+                             const uint64_t T0, const uint64_t T1 );
+
 //////////////////////////
 //
 //   Blake-256 4 way SSE2
@@ -98,6 +112,12 @@ typedef blake_8way_small_context blake256_8way_context;
 void blake256_8way_init(void *cc);
 void blake256_8way_update(void *cc, const void *data, size_t len);
 void blake256_8way_close(void *cc, void *dst);
+void blake256_8way_update_le(void *cc, const void *data, size_t len);
+void blake256_8way_close_le(void *cc, void *dst);
+void blake256_8way_round0_prehash_le( void *midstate, const void *midhash,
+                                       const void *data );
+void blake256_8way_final_rounds_le( void *final_hash, const void *midstate,
+                                     const void *midhash, const void *data );

 // 14 rounds, blake, decred
 typedef blake_8way_small_context blake256r14_8way_context;
@@ -128,6 +148,12 @@ void blake512_4way_update( void *cc, const void *data, size_t len );
 void blake512_4way_close( void *cc, void *dst );
 void blake512_4way_full( blake_4way_big_context *sc, void * dst,
                         const void *data, size_t len );
+void blake512_4way_full_le( blake_4way_big_context *sc, void * dst,
+                            const void *data, size_t len );
+void blake512_4way_prehash_le( blake_4way_big_context *sc, __m256i *midstate,
+                               const void *data );
+void blake512_4way_final_le( blake_4way_big_context *sc, void *hash,
+                             const __m256i nonce, const __m256i *midstate );

 #if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)

@@ -148,6 +174,14 @@ typedef blake_16way_small_context blake256_16way_context;
 void blake256_16way_init(void *cc);
 void blake256_16way_update(void *cc, const void *data, size_t len);
 void blake256_16way_close(void *cc, void *dst);
+// Expects data in little endian order, no byte swap needed
+void blake256_16way_update_le(void *cc, const void *data, size_t len);
+void blake256_16way_close_le(void *cc, void *dst);
+void blake256_16way_round0_prehash_le( void *midstate, const void *midhash,
+                                       const void *data );
+void blake256_16way_final_rounds_le( void *final_hash, const void *midstate,
+                                     const void *midhash, const void *data );
+

 // 14 rounds, blake, decred
 typedef blake_16way_small_context blake256r14_16way_context;
@@ -180,7 +214,12 @@ void blake512_8way_update( void *cc, const void *data, size_t len );
 void blake512_8way_close( void *cc, void *dst );
 void blake512_8way_full( blake_8way_big_context *sc, void * dst,
                        const void *data, size_t len );
-void blake512_8way_hash_le80( void *hash, const void *data );
+void blake512_8way_full_le( blake_8way_big_context *sc, void * dst,
+                            const void *data, size_t len );
+void blake512_8way_prehash_le( blake_8way_big_context *sc, __m512i *midstate,
+                               const void *data );
+void blake512_8way_final_le( blake_8way_big_context *sc, void *hash,
+                             const __m512i nonce, const __m512i *midstate );

 #endif  // AVX512
 #endif  // AVX2
--- a/algo/blake/blake256-hash-4way.c
+++ b/algo/blake/blake256-hash-4way.c
--- a/algo/blake/blake2b-hash-4way.c
+++ b/algo/blake/blake2b-hash-4way.c
@@ -52,6 +52,180 @@ static const uint8_t sigma[12][16] =
 };


+#define Z00   0
+#define Z01   1
+#define Z02   2
+#define Z03   3
+#define Z04   4
+#define Z05   5
+#define Z06   6
+#define Z07   7
+#define Z08   8
+#define Z09   9
+#define Z0A   A
+#define Z0B   B
+#define Z0C   C
+#define Z0D   D
+#define Z0E   E
+#define Z0F   F
+
+#define Z10   E
+#define Z11   A
+#define Z12   4
+#define Z13   8
+#define Z14   9
+#define Z15   F
+#define Z16   D
+#define Z17   6
+#define Z18   1
+#define Z19   C
+#define Z1A   0
+#define Z1B   2
+#define Z1C   B
+#define Z1D   7
+#define Z1E   5
+#define Z1F   3
+
+#define Z20   B
+#define Z21   8
+#define Z22   C
+#define Z23   0
+#define Z24   5
+#define Z25   2
+#define Z26   F
+#define Z27   D
+#define Z28   A
+#define Z29   E
+#define Z2A   3
+#define Z2B   6
+#define Z2C   7
+#define Z2D   1
+#define Z2E   9
+#define Z2F   4
+
+#define Z30   7
+#define Z31   9
+#define Z32   3
+#define Z33   1
+#define Z34   D
+#define Z35   C
+#define Z36   B
+#define Z37   E
+#define Z38   2
+#define Z39   6
+#define Z3A   5
+#define Z3B   A
+#define Z3C   4
+#define Z3D   0
+#define Z3E   F
+#define Z3F   8
+
+#define Z40   9
+#define Z41   0
+#define Z42   5
+#define Z43   7
+#define Z44   2
+#define Z45   4
+#define Z46   A
+#define Z47   F
+#define Z48   E
+#define Z49   1
+#define Z4A   B
+#define Z4B   C
+#define Z4C   6
+#define Z4D   8
+#define Z4E   3
+#define Z4F   D
+
+#define Z50   2
+#define Z51   C
+#define Z52   6
+#define Z53   A
+#define Z54   0
+#define Z55   B
+#define Z56   8
+#define Z57   3
+#define Z58   4
+#define Z59   D
+#define Z5A   7
+#define Z5B   5
+#define Z5C   F
+#define Z5D   E
+#define Z5E   1
+#define Z5F   9
+
+#define Z60   C
+#define Z61   5
+#define Z62   1
+#define Z63   F
+#define Z64   E
+#define Z65   D
+#define Z66   4
+#define Z67   A
+#define Z68   0
+#define Z69   7
+#define Z6A   6
+#define Z6B   3
+#define Z6C   9
+#define Z6D   2
+#define Z6E   8
+#define Z6F   B
+
+#define Z70   D
+#define Z71   B
+#define Z72   7
+#define Z73   E
+#define Z74   C
+#define Z75   1
+#define Z76   3
+#define Z77   9
+#define Z78   5
+#define Z79   0
+#define Z7A   F
+#define Z7B   4
+#define Z7C   8
+#define Z7D   6
+#define Z7E   2
+#define Z7F   A
+
+#define Z80   6
+#define Z81   F
+#define Z82   E
+#define Z83   9
+#define Z84   B
+#define Z85   3
+#define Z86   0
+#define Z87   8
+#define Z88   C
+#define Z89   2
+#define Z8A   D
+#define Z8B   7
+#define Z8C   1
+#define Z8D   4
+#define Z8E   A
+#define Z8F   5
+
+#define Z90   A
+#define Z91   2
+#define Z92   8
+#define Z93   4
+#define Z94   7
+#define Z95   6
+#define Z96   1
+#define Z97   5
+#define Z98   F
+#define Z99   B
+#define Z9A   9
+#define Z9B   E
+#define Z9C   3
+#define Z9D   C
+#define Z9E   D
+#define Z9F   0
+
+#define Mx(r, i)    Mx_(Z ## r ## i)
+#define Mx_(n)      Mx__(n)
+#define Mx__(n)     M ## n
+
 #if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)

 #define B2B8W_G(a, b, c, d, x, y) \
@@ -214,11 +388,11 @@ void blake2b_8way_final( blake2b_8way_ctx *ctx, void *out )
 #define B2B_G(a, b, c, d, x, y) \
 { \
   v[a] = _mm256_add_epi64( _mm256_add_epi64( v[a], v[b] ), x ); \
-	v[d] = mm256_ror_64( _mm256_xor_si256( v[d], v[a] ), 32 ); \
+	v[d] = mm256_swap64_32( _mm256_xor_si256( v[d], v[a] ) ); \
 	v[c] = _mm256_add_epi64( v[c], v[d] ); \
-	v[b] = mm256_ror_64( _mm256_xor_si256( v[b], v[c] ), 24 ); \
+	v[b] = mm256_shuflr64_24( _mm256_xor_si256( v[b], v[c] ) ); \
 	v[a] = _mm256_add_epi64( _mm256_add_epi64( v[a], v[b] ), y ); \
-	v[d] = mm256_ror_64( _mm256_xor_si256( v[d], v[a] ), 16 ); \
+	v[d] = mm256_shuflr64_16( _mm256_xor_si256( v[d], v[a] ) ); \
 	v[c] = _mm256_add_epi64( v[c], v[d] ); \
 	v[b] = mm256_ror_64( _mm256_xor_si256( v[b], v[c] ), 63 ); \
 }
--- a/algo/blake/blake2s-hash-4way.c
+++ b/algo/blake/blake2s-hash-4way.c
@@ -108,11 +108,11 @@ do { \
   uint8_t s0 = sigma0; \
   uint8_t s1 = sigma1; \
   a = _mm_add_epi32( _mm_add_epi32( a, b ), m[ s0 ] ); \
-   d = mm128_ror_32( _mm_xor_si128( d, a ), 16 ); \
+   d = mm128_swap32_16( _mm_xor_si128( d, a ) ); \
   c = _mm_add_epi32( c, d ); \
   b = mm128_ror_32( _mm_xor_si128( b, c ), 12 ); \
   a = _mm_add_epi32( _mm_add_epi32( a, b ), m[ s1 ] ); \
-   d = mm128_ror_32( _mm_xor_si128( d, a ),  8 ); \
+   d = mm128_shuflr32_8( _mm_xor_si128( d, a ) ); \
   c = _mm_add_epi32( c, d ); \
   b = mm128_ror_32( _mm_xor_si128( b, c ),  7 ); \
 } while(0)
@@ -320,11 +320,11 @@ do { \
   uint8_t s0 = sigma0; \
   uint8_t s1 = sigma1; \
   a = _mm256_add_epi32( _mm256_add_epi32( a, b ), m[ s0 ] ); \
-   d = mm256_ror_32( _mm256_xor_si256( d, a ), 16 ); \
+   d = mm256_swap32_16( _mm256_xor_si256( d, a ) ); \
   c = _mm256_add_epi32( c, d ); \
   b = mm256_ror_32( _mm256_xor_si256( b, c ), 12 ); \
   a = _mm256_add_epi32( _mm256_add_epi32( a, b ), m[ s1 ] ); \
-   d = mm256_ror_32( _mm256_xor_si256( d, a ),  8 ); \
+   d = mm256_shuflr32_8( _mm256_xor_si256( d, a ) ); \
   c = _mm256_add_epi32( c, d ); \
   b = mm256_ror_32( _mm256_xor_si256( b, c ),  7 ); \
 } while(0)
--- a/algo/blake/blake512-hash-4way.c
+++ b/algo/blake/blake512-hash-4way.c
@@ -314,10 +314,11 @@ static const sph_u64 CB[16] = {

 // Blake-512 8 way AVX512

-#define GB_8WAY(m0, m1, c0, c1, a, b, c, d)   do { \
+#define GB_8WAY( m0, m1, c0, c1, a, b, c, d ) \
+{ \
   a = _mm512_add_epi64( _mm512_add_epi64( _mm512_xor_si512( \
                 _mm512_set1_epi64( c1 ), m0 ), b ), a ); \
-   d = mm512_ror_64( _mm512_xor_si512( d, a ), 32 ); \
+   d = mm512_swap64_32( _mm512_xor_si512( d, a ) ); \
   c = _mm512_add_epi64( c, d ); \
   b = mm512_ror_64( _mm512_xor_si512( b, c ), 25 ); \
   a = _mm512_add_epi64( _mm512_add_epi64( _mm512_xor_si512( \
@@ -325,9 +326,10 @@ static const sph_u64 CB[16] = {
   d = mm512_ror_64( _mm512_xor_si512( d, a ), 16 ); \
   c = _mm512_add_epi64( c, d ); \
   b = mm512_ror_64( _mm512_xor_si512( b, c ), 11 ); \
-} while (0)
+}

-#define ROUND_B_8WAY(r)   do { \
+#define ROUND_B_8WAY( r ) \
+{ \
   GB_8WAY(Mx(r, 0), Mx(r, 1), CBx(r, 0), CBx(r, 1), V0, V4, V8, VC); \
   GB_8WAY(Mx(r, 2), Mx(r, 3), CBx(r, 2), CBx(r, 3), V1, V5, V9, VD); \
   GB_8WAY(Mx(r, 4), Mx(r, 5), CBx(r, 4), CBx(r, 5), V2, V6, VA, VE); \
@@ -336,13 +338,13 @@ static const sph_u64 CB[16] = {
   GB_8WAY(Mx(r, A), Mx(r, B), CBx(r, A), CBx(r, B), V1, V6, VB, VC); \
   GB_8WAY(Mx(r, C), Mx(r, D), CBx(r, C), CBx(r, D), V2, V7, V8, VD); \
   GB_8WAY(Mx(r, E), Mx(r, F), CBx(r, E), CBx(r, F), V3, V4, V9, VE); \
-   } while (0)
+}

 #define DECL_STATE64_8WAY \
   __m512i H0, H1, H2, H3, H4, H5, H6, H7; \
   uint64_t T0, T1;

-#define COMPRESS64_8WAY( buf )   do \
+#define COMPRESS64_8WAY( buf ) \
 { \
  __m512i M0, M1, M2, M3, M4, M5, M6, M7; \
  __m512i M8, M9, MA, MB, MC, MD, ME, MF; \
@@ -361,14 +363,10 @@ static const sph_u64 CB[16] = {
  V9 = m512_const1_64( CB1 );  \
  VA = m512_const1_64( CB2 );  \
  VB = m512_const1_64( CB3 );  \
-  VC = _mm512_xor_si512( _mm512_set1_epi64( T0 ), \
-                         m512_const1_64( CB4 ) );  \
-  VD = _mm512_xor_si512( _mm512_set1_epi64( T0 ), \
-                         m512_const1_64( CB5 ) );  \
-  VE = _mm512_xor_si512( _mm512_set1_epi64( T1 ), \
-                         m512_const1_64( CB6 ) );  \
-  VF = _mm512_xor_si512( _mm512_set1_epi64( T1 ), \
-                         m512_const1_64( CB7 ) );  \
+  VC = _mm512_set1_epi64( T0 ^ CB4 ); \
+  VD = _mm512_set1_epi64( T0 ^ CB5 ); \
+  VE = _mm512_set1_epi64( T1 ^ CB6 ); \
+  VF = _mm512_set1_epi64( T1 ^ CB7 ); \
  shuf_bswap64 = m512_const_64( 0x38393a3b3c3d3e3f, 0x3031323334353637, \
                                0x28292a2b2c2d2e2f, 0x2021222324252627, \
                                0x18191a1b1c1d1e1f, 0x1011121314151617, \
@@ -413,7 +411,7 @@ static const sph_u64 CB[16] = {
  H5 = mm512_xor3( VD, V5, H5 ); \
  H6 = mm512_xor3( VE, V6, H6 ); \
  H7 = mm512_xor3( VF, V7, H7 ); \
-} while (0)
+}

 void blake512_8way_compress( blake_8way_big_context *sc )
 { 
@@ -435,14 +433,10 @@ void blake512_8way_compress( blake_8way_big_context *sc )
  V9 = m512_const1_64( CB1 );
  VA = m512_const1_64( CB2 );
  VB = m512_const1_64( CB3 );
-  VC = _mm512_xor_si512( _mm512_set1_epi64( sc->T0 ),
-                            m512_const1_64( CB4 ) );
-  VD = _mm512_xor_si512( _mm512_set1_epi64( sc->T0 ),
-                            m512_const1_64( CB5 ) );
-  VE = _mm512_xor_si512( _mm512_set1_epi64( sc->T1 ),
-                            m512_const1_64( CB6 ) );
-  VF = _mm512_xor_si512( _mm512_set1_epi64( sc->T1 ),
-                            m512_const1_64( CB7 ) );
+  VC = _mm512_set1_epi64( sc->T0 ^ CB4 );
+  VD = _mm512_set1_epi64( sc->T0 ^ CB5 );
+  VE = _mm512_set1_epi64( sc->T1 ^ CB6 );
+  VF = _mm512_set1_epi64( sc->T1 ^ CB7 );

  shuf_bswap64 = m512_const_64( 0x38393a3b3c3d3e3f, 0x3031323334353637,
                                0x28292a2b2c2d2e2f, 0x2021222324252627,
@@ -493,6 +487,307 @@ void blake512_8way_compress( blake_8way_big_context *sc )
  sc->H[7] = mm512_xor3( VF, V7, sc->H[7] );
 }

+// won't be used after prehash implemented
+void blake512_8way_compress_le( blake_8way_big_context *sc )
+{
+  __m512i M0, M1, M2, M3, M4, M5, M6, M7;
+  __m512i M8, M9, MA, MB, MC, MD, ME, MF;
+  __m512i V0, V1, V2, V3, V4, V5, V6, V7;
+  __m512i V8, V9, VA, VB, VC, VD, VE, VF;
+
+  V0 = sc->H[0];
+  V1 = sc->H[1];
+  V2 = sc->H[2];
+  V3 = sc->H[3];
+  V4 = sc->H[4];
+  V5 = sc->H[5];
+  V6 = sc->H[6];
+  V7 = sc->H[7];
+  V8 = m512_const1_64( CB0 );
+  V9 = m512_const1_64( CB1 );
+  VA = m512_const1_64( CB2 );
+  VB = m512_const1_64( CB3 );
+  VC = _mm512_set1_epi64( sc->T0 ^ CB4 );
+  VD = _mm512_set1_epi64( sc->T0 ^ CB5 );
+  VE = _mm512_set1_epi64( sc->T1 ^ CB6 );
+  VF = _mm512_set1_epi64( sc->T1 ^ CB7 );
+
+  M0 = sc->buf[ 0];
+  M1 = sc->buf[ 1];
+  M2 = sc->buf[ 2];
+  M3 = sc->buf[ 3];
+  M4 = sc->buf[ 4];
+  M5 = sc->buf[ 5];
+  M6 = sc->buf[ 6];
+  M7 = sc->buf[ 7];
+  M8 = sc->buf[ 8];
+  M9 = sc->buf[ 9];
+  MA = sc->buf[10];
+  MB = sc->buf[11];
+  MC = sc->buf[12];
+  MD = sc->buf[13];
+  ME = sc->buf[14];
+  MF = sc->buf[15];
+
+  ROUND_B_8WAY(0);
+  ROUND_B_8WAY(1);
+  ROUND_B_8WAY(2);
+  ROUND_B_8WAY(3);
+  ROUND_B_8WAY(4);
+  ROUND_B_8WAY(5);
+  ROUND_B_8WAY(6);
+  ROUND_B_8WAY(7);
+  ROUND_B_8WAY(8);
+  ROUND_B_8WAY(9);
+  ROUND_B_8WAY(0);
+  ROUND_B_8WAY(1);
+  ROUND_B_8WAY(2);
+  ROUND_B_8WAY(3);
+  ROUND_B_8WAY(4);
+  ROUND_B_8WAY(5);
+
+  sc->H[0] = mm512_xor3( V8, V0, sc->H[0] );
+  sc->H[1] = mm512_xor3( V9, V1, sc->H[1] );
+  sc->H[2] = mm512_xor3( VA, V2, sc->H[2] );
+  sc->H[3] = mm512_xor3( VB, V3, sc->H[3] );
+  sc->H[4] = mm512_xor3( VC, V4, sc->H[4] );
+  sc->H[5] = mm512_xor3( VD, V5, sc->H[5] );
+  sc->H[6] = mm512_xor3( VE, V6, sc->H[6] );
+  sc->H[7] = mm512_xor3( VF, V7, sc->H[7] );
+}
+
+// with final_le forms a full hash in 2 parts from little endian data.
+// all variables hard coded for 80 bytes/lane.
+void blake512_8way_prehash_le( blake_8way_big_context *sc, __m512i *midstate,
+                               const void *data )
+{
+   __m512i V0, V1, V2, V3, V4, V5, V6, V7;
+   __m512i V8, V9, VA, VB, VC, VD, VE, VF;
+
+   // initial hash
+   casti_m512i( sc->H, 0 ) = m512_const1_64( 0x6A09E667F3BCC908 );
+   casti_m512i( sc->H, 1 ) = m512_const1_64( 0xBB67AE8584CAA73B );
+   casti_m512i( sc->H, 2 ) = m512_const1_64( 0x3C6EF372FE94F82B );
+   casti_m512i( sc->H, 3 ) = m512_const1_64( 0xA54FF53A5F1D36F1 );
+   casti_m512i( sc->H, 4 ) = m512_const1_64( 0x510E527FADE682D1 );
+   casti_m512i( sc->H, 5 ) = m512_const1_64( 0x9B05688C2B3E6C1F );
+   casti_m512i( sc->H, 6 ) = m512_const1_64( 0x1F83D9ABFB41BD6B );
+   casti_m512i( sc->H, 7 ) = m512_const1_64( 0x5BE0CD19137E2179 );
+
+   // fill buffer
+   memcpy_512( sc->buf, (__m512i*)data, 80>>3 );
+   sc->buf[10] = m512_const1_64( 0x8000000000000000ULL );
+   sc->buf[11] = 
+   sc->buf[12] = m512_zero;
+   sc->buf[13] = m512_one_64;
+   sc->buf[14] = m512_zero;
+   sc->buf[15] = m512_const1_64( 80*8 );
+
+   // build working variables
+   V0 = sc->H[0];
+   V1 = sc->H[1];
+   V2 = sc->H[2];
+   V3 = sc->H[3];
+   V4 = sc->H[4];
+   V5 = sc->H[5];
+   V6 = sc->H[6];
+   V7 = sc->H[7];
+   V8 = m512_const1_64( CB0 );
+   V9 = m512_const1_64( CB1 );
+   VA = m512_const1_64( CB2 );
+   VB = m512_const1_64( CB3 );
+   VC = _mm512_set1_epi64( CB4 ^ 0x280ULL );
+   VD = _mm512_set1_epi64( CB5 ^ 0x280ULL );
+   VE = _mm512_set1_epi64( CB6 );
+   VF = _mm512_set1_epi64( CB7 );
+
+   // round 0
+   GB_8WAY( sc->buf[ 0], sc->buf[ 1], CB0, CB1, V0, V4, V8, VC );
+   GB_8WAY( sc->buf[ 2], sc->buf[ 3], CB2, CB3, V1, V5, V9, VD );
+   GB_8WAY( sc->buf[ 4], sc->buf[ 5], CB4, CB5, V2, V6, VA, VE );
+   GB_8WAY( sc->buf[ 6], sc->buf[ 7], CB6, CB7, V3, V7, VB, VF );
+
+   // Do half of G4, skip the nonce
+   // GB_8WAY( sc->buf[ 8], sc->buf[ 9], CBx(0, 8), CBx(0, 9), V0, V5, VA, VF );
+
+   V0 = _mm512_add_epi64( _mm512_add_epi64( _mm512_xor_si512( 
+                       _mm512_set1_epi64( CB9 ), sc->buf[ 8] ), V5 ), V0 ); 
+   VF = mm512_swap64_32( _mm512_xor_si512( VF, V0 ) ); 
+   VA = _mm512_add_epi64( VA, VF ); 
+   V5 = mm512_ror_64( _mm512_xor_si512( V5, VA ), 25 );
+   V0 = _mm512_add_epi64( V0, V5 );
+   
+   GB_8WAY( sc->buf[10], sc->buf[11], CBA, CBB, V1, V6, VB, VC );
+   GB_8WAY( sc->buf[12], sc->buf[13], CBC, CBD, V2, V7, V8, VD );
+   GB_8WAY( sc->buf[14], sc->buf[15], CBE, CBF, V3, V4, V9, VE );
+   
+   // round 1
+   // G1   
+//   GB_8WAY(Mx(r, 2), Mx(r, 3), CBx(r, 2), CBx(r, 3), V1, V5, V9, VD);
+   V1 = _mm512_add_epi64( V1, _mm512_xor_si512( _mm512_set1_epi64( CB8 ),
+           sc->buf[ 4] ) );
+
+   // G2
+//   GB_8WAY(Mx(1, 4), Mx(1, 5), CBx(1, 4), CBx(1, 5), V2, V6, VA, VE);
+   V2 = _mm512_add_epi64( V2, V6 ); 
+
+   // G3
+//   GB_8WAY(Mx(r, 6), Mx(r, 7), CBx(r, 6), CBx(r, 7), V3, V7, VB, VF);
+   V3 = _mm512_add_epi64( V3, _mm512_add_epi64( _mm512_xor_si512(
+                 _mm512_set1_epi64( CB6 ), sc->buf[13] ), V7 ) );
+
+   // save midstate for second part
+   midstate[ 0] = V0;
+   midstate[ 1] = V1;
+   midstate[ 2] = V2;
+   midstate[ 3] = V3;
+   midstate[ 4] = V4;
+   midstate[ 5] = V5;
+   midstate[ 6] = V6;
+   midstate[ 7] = V7;
+   midstate[ 8] = V8;
+   midstate[ 9] = V9;
+   midstate[10] = VA;
+   midstate[11] = VB;
+   midstate[12] = VC;
+   midstate[13] = VD;
+   midstate[14] = VE;
+   midstate[15] = VF;
+} 
+
+// pick up where we left off, need the nonce now.
+void blake512_8way_final_le( blake_8way_big_context *sc, void *hash,
+                             const __m512i nonce, const __m512i *midstate )
+{
+   __m512i M0, M1, M2, M3, M4, M5, M6, M7;
+   __m512i M8, M9, MA, MB, MC, MD, ME, MF;
+   __m512i V0, V1, V2, V3, V4, V5, V6, V7;
+   __m512i V8, V9, VA, VB, VC, VD, VE, VF;
+   __m512i h[8] __attribute__ ((aligned (64)));
+
+   // Load data with new nonce
+   M0 = sc->buf[ 0];
+   M1 = sc->buf[ 1];
+   M2 = sc->buf[ 2];
+   M3 = sc->buf[ 3];
+   M4 = sc->buf[ 4];
+   M5 = sc->buf[ 5];
+   M6 = sc->buf[ 6];
+   M7 = sc->buf[ 7];
+   M8 = sc->buf[ 8];
+   M9 = nonce;
+   MA = sc->buf[10];
+   MB = sc->buf[11];
+   MC = sc->buf[12];
+   MD = sc->buf[13];
+   ME = sc->buf[14];
+   MF = sc->buf[15];
+
+   V0 = midstate[ 0];
+   V1 = midstate[ 1];
+   V2 = midstate[ 2];
+   V3 = midstate[ 3];
+   V4 = midstate[ 4];
+   V5 = midstate[ 5];
+   V6 = midstate[ 6];
+   V7 = midstate[ 7];
+   V8 = midstate[ 8];
+   V9 = midstate[ 9];
+   VA = midstate[10];
+   VB = midstate[11];
+   VC = midstate[12];
+   VD = midstate[13];
+   VE = midstate[14];
+   VF = midstate[15];
+
+   // finish round 0 with the nonce now available 
+   V0 = _mm512_add_epi64( V0, _mm512_xor_si512(
+                                       _mm512_set1_epi64( CB8 ), M9 ) ); 
+   VF = mm512_ror_64( _mm512_xor_si512( VF, V0 ), 16 );
+   VA = _mm512_add_epi64( VA, VF );
+   V5 = mm512_ror_64( _mm512_xor_si512( V5, VA ), 11 );
+  
+   // Round 1
+   // G0
+   GB_8WAY(Mx(1, 0), Mx(1, 1), CBx(1, 0), CBx(1, 1), V0, V4, V8, VC);
+
+   // G1
+//   GB_8WAY(Mx(1, 2), Mx(1, 3), CBx(1, 2), CBx(1, 3), V1, V5, V9, VD);
+//   V1 = _mm512_add_epi64( V1, _mm512_xor_si512( _mm512_set1_epi64( c1 ), m0 );
+
+   V1 = _mm512_add_epi64( V1, V5 );   
+   VD = mm512_swap64_32( _mm512_xor_si512( VD, V1 ) );
+   V9 = _mm512_add_epi64( V9, VD );
+   V5 = mm512_ror_64( _mm512_xor_si512( V5, V9 ), 25 );
+   V1 = _mm512_add_epi64( V1, _mm512_add_epi64( _mm512_xor_si512(
+                 _mm512_set1_epi64( CBx(1,2) ), Mx(1,3) ), V5 ) );   
+   VD = mm512_ror_64( _mm512_xor_si512( VD, V1 ), 16 );
+   V9 = _mm512_add_epi64( V9, VD );
+   V5 = mm512_ror_64( _mm512_xor_si512( V5, V9 ), 11 );
+
+   // G2
+//   GB_8WAY(Mx(1, 4), Mx(1, 5), CBx(1, 4), CBx(1, 5), V2, V6, VA, VE);
+//   V2 = _mm512_add_epi64( V2, V6 );
+   V2 = _mm512_add_epi64( V2, _mm512_xor_si512( 
+                 _mm512_set1_epi64( CBF ), M9 ) );
+   VE = mm512_swap64_32( _mm512_xor_si512( VE, V2 ) );
+   VA = _mm512_add_epi64( VA, VE );
+   V6 = mm512_ror_64( _mm512_xor_si512( V6, VA ), 25 );
+   V2 = _mm512_add_epi64( V2, _mm512_add_epi64( _mm512_xor_si512(
+                 _mm512_set1_epi64( CB9 ), MF ), V6 ) );
+   VE = mm512_ror_64( _mm512_xor_si512( VE, V2 ), 16 );
+   VA = _mm512_add_epi64( VA, VE );
+   V6 = mm512_ror_64( _mm512_xor_si512( V6, VA ), 11 );
+
+   // G3
+//   GB_8WAY(Mx(1, 6), Mx(1, 7), CBx(1, 6), CBx(1, 7), V3, V7, VB, VF);
+//   V3 = _mm512_add_epi64( V3, _mm512_add_epi64( _mm512_xor_si512( 
+//                 _mm512_set1_epi64( CBx(1, 7) ), Mx(1, 6) ), V7 ) ); 
+
+   VF = mm512_swap64_32( _mm512_xor_si512( VF, V3 ) ); 
+   VB = _mm512_add_epi64( VB, VF ); 
+   V7 = mm512_ror_64( _mm512_xor_si512( V7, VB ), 25 );
+   V3 = _mm512_add_epi64( V3, _mm512_add_epi64( _mm512_xor_si512(
+                 _mm512_set1_epi64( CBx(1, 6) ), Mx(1, 7) ), V7 ) ); 
+   VF = mm512_ror_64( _mm512_xor_si512( VF, V3 ), 16 ); 
+   VB = _mm512_add_epi64( VB, VF ); 
+   V7 = mm512_ror_64( _mm512_xor_si512( V7, VB ), 11 );
+
+   // G4, G5, G6, G7
+   GB_8WAY(Mx(1, 8), Mx(1, 9), CBx(1, 8), CBx(1, 9), V0, V5, VA, VF);
+   GB_8WAY(Mx(1, A), Mx(1, B), CBx(1, A), CBx(1, B), V1, V6, VB, VC);
+   GB_8WAY(Mx(1, C), Mx(1, D), CBx(1, C), CBx(1, D), V2, V7, V8, VD);
+   GB_8WAY(Mx(1, E), Mx(1, F), CBx(1, E), CBx(1, F), V3, V4, V9, VE);
+
+   // remaining rounds  
+   ROUND_B_8WAY(2);
+   ROUND_B_8WAY(3);
+   ROUND_B_8WAY(4);
+   ROUND_B_8WAY(5);
+   ROUND_B_8WAY(6);
+   ROUND_B_8WAY(7);
+   ROUND_B_8WAY(8);
+   ROUND_B_8WAY(9);
+   ROUND_B_8WAY(0);
+   ROUND_B_8WAY(1);
+   ROUND_B_8WAY(2);
+   ROUND_B_8WAY(3);
+   ROUND_B_8WAY(4);
+   ROUND_B_8WAY(5);
+
+   h[0] = mm512_xor3( V8, V0, sc->H[0] );
+   h[1] = mm512_xor3( V9, V1, sc->H[1] );
+   h[2] = mm512_xor3( VA, V2, sc->H[2] );
+   h[3] = mm512_xor3( VB, V3, sc->H[3] );
+   h[4] = mm512_xor3( VC, V4, sc->H[4] );
+   h[5] = mm512_xor3( VD, V5, sc->H[5] );
+   h[6] = mm512_xor3( VE, V6, sc->H[6] );
+   h[7] = mm512_xor3( VF, V7, sc->H[7] );
+
+   // bswap final hash
+   mm512_block_bswap_64( (__m512i*)hash, h );
+}
+
 void blake512_8way_init( blake_8way_big_context *sc )
 {
   casti_m512i( sc->H, 0 ) = m512_const1_64( 0x6A09E667F3BCC908 );
@@ -678,6 +973,73 @@ void blake512_8way_full( blake_8way_big_context *sc, void * dst,
   mm512_block_bswap_64( (__m512i*)dst, sc->H );
 }
   
+void blake512_8way_full_le( blake_8way_big_context *sc, void * dst,
+                        const void *data, size_t len )
+{
+
+// init
+
+   casti_m512i( sc->H, 0 ) = m512_const1_64( 0x6A09E667F3BCC908 );
+   casti_m512i( sc->H, 1 ) = m512_const1_64( 0xBB67AE8584CAA73B );
+   casti_m512i( sc->H, 2 ) = m512_const1_64( 0x3C6EF372FE94F82B );
+   casti_m512i( sc->H, 3 ) = m512_const1_64( 0xA54FF53A5F1D36F1 );
+   casti_m512i( sc->H, 4 ) = m512_const1_64( 0x510E527FADE682D1 );
+   casti_m512i( sc->H, 5 ) = m512_const1_64( 0x9B05688C2B3E6C1F );
+   casti_m512i( sc->H, 6 ) = m512_const1_64( 0x1F83D9ABFB41BD6B );
+   casti_m512i( sc->H, 7 ) = m512_const1_64( 0x5BE0CD19137E2179 );
+
+   sc->T0 = sc->T1 = 0;
+   sc->ptr = 0;
+
+// update
+
+   memcpy_512( sc->buf, (__m512i*)data, len>>3 );
+   sc->ptr = len;
+   if ( len == 128 )
+   {
+      if ( ( sc->T0 = sc->T0 + 1024 ) < 1024 )
+            sc->T1 = sc->T1 + 1;
+      blake512_8way_compress_le( sc );
+      sc->ptr = 0;
+   }
+
+// close
+
+   size_t ptr64 = sc->ptr >> 3;
+   unsigned bit_len;
+   uint64_t th, tl;
+
+   bit_len = sc->ptr << 3;
+   sc->buf[ptr64] = m512_const1_64( 0x8000000000000000ULL );
+   tl = sc->T0 + bit_len;
+   th = sc->T1;
+
+   if ( ptr64 == 0 )
+   {
+   sc->T0 = 0xFFFFFFFFFFFFFC00ULL;
+   sc->T1 = 0xFFFFFFFFFFFFFFFFULL;
+   }
+   else if ( sc->T0 == 0 )
+   {
+   sc->T0 = 0xFFFFFFFFFFFFFC00ULL + bit_len;
+   sc->T1 = sc->T1 - 1;
+   }
+   else
+      sc->T0 -= 1024 - bit_len;
+
+   memset_zero_512( sc->buf + ptr64 + 1, 13 - ptr64 );
+   sc->buf[13] = m512_one_64;
+   sc->buf[14] = m512_const1_64( th );
+   sc->buf[15] = m512_const1_64( tl );
+
+   if ( ( sc->T0 = sc->T0 + 1024 ) < 1024 )
+       sc->T1 = sc->T1 + 1;
+
+   blake512_8way_compress_le( sc );
+
+   mm512_block_bswap_64( (__m512i*)dst, sc->H );
+}
+
 void
 blake512_8way_update(void *cc, const void *data, size_t len)
 {
@@ -694,20 +1056,22 @@ blake512_8way_close(void *cc, void *dst)

 // Blake-512 4 way

-#define GB_4WAY(m0, m1, c0, c1, a, b, c, d)   do { \
+#define GB_4WAY(m0, m1, c0, c1, a, b, c, d) \
+{ \
   a = _mm256_add_epi64( _mm256_add_epi64( _mm256_xor_si256( \
                 _mm256_set1_epi64x( c1 ), m0 ), b ), a ); \
-   d = mm256_ror_64( _mm256_xor_si256( d, a ), 32 ); \
+   d = mm256_swap64_32( _mm256_xor_si256( d, a ) ); \
   c = _mm256_add_epi64( c, d ); \
   b = mm256_ror_64( _mm256_xor_si256( b, c ), 25 ); \
   a = _mm256_add_epi64( _mm256_add_epi64( _mm256_xor_si256( \
                 _mm256_set1_epi64x( c0 ), m1 ), b ), a ); \
-   d = mm256_ror_64( _mm256_xor_si256( d, a ), 16 ); \
+   d = mm256_shuflr64_16( _mm256_xor_si256( d, a ) ); \
   c = _mm256_add_epi64( c, d ); \
   b = mm256_ror_64( _mm256_xor_si256( b, c ), 11 ); \
-} while (0)
+}

-#define ROUND_B_4WAY(r)   do { \
+#define ROUND_B_4WAY(r) \
+{ \
 	GB_4WAY(Mx(r, 0), Mx(r, 1), CBx(r, 0), CBx(r, 1), V0, V4, V8, VC); \
 	GB_4WAY(Mx(r, 2), Mx(r, 3), CBx(r, 2), CBx(r, 3), V1, V5, V9, VD); \
 	GB_4WAY(Mx(r, 4), Mx(r, 5), CBx(r, 4), CBx(r, 5), V2, V6, VA, VE); \
@@ -716,13 +1080,13 @@ blake512_8way_close(void *cc, void *dst)
 	GB_4WAY(Mx(r, A), Mx(r, B), CBx(r, A), CBx(r, B), V1, V6, VB, VC); \
 	GB_4WAY(Mx(r, C), Mx(r, D), CBx(r, C), CBx(r, D), V2, V7, V8, VD); \
 	GB_4WAY(Mx(r, E), Mx(r, F), CBx(r, E), CBx(r, F), V3, V4, V9, VE); \
-	} while (0)
+}

 #define DECL_STATE64_4WAY \
 	__m256i H0, H1, H2, H3, H4, H5, H6, H7; \
 	uint64_t T0, T1;

-#define COMPRESS64_4WAY   do \
+#define COMPRESS64_4WAY \
 { \
  __m256i M0, M1, M2, M3, M4, M5, M6, M7; \
  __m256i M8, M9, MA, MB, MC, MD, ME, MF; \
@@ -741,14 +1105,10 @@ blake512_8way_close(void *cc, void *dst)
  V9 = m256_const1_64( CB1 );  \
  VA = m256_const1_64( CB2 );  \
  VB = m256_const1_64( CB3 );  \
-  VC = _mm256_xor_si256( _mm256_set1_epi64x( T0 ), \
-                         m256_const1_64( CB4 ) );  \
-  VD = _mm256_xor_si256( _mm256_set1_epi64x( T0 ), \
-                         m256_const1_64( CB5 ) );  \
-  VE = _mm256_xor_si256( _mm256_set1_epi64x( T1 ), \
-                         m256_const1_64( CB6 ) );  \
-  VF = _mm256_xor_si256( _mm256_set1_epi64x( T1 ), \
-                         m256_const1_64( CB7 ) );  \
+  VC = _mm256_set1_epi64x( T0 ^ CB4 ); \
+  VD = _mm256_set1_epi64x( T0 ^ CB5 ); \
+  VE = _mm256_set1_epi64x( T1 ^ CB6 ); \
+  VF = _mm256_set1_epi64x( T1 ^ CB7 ); \
  shuf_bswap64 = m256_const_64( 0x18191a1b1c1d1e1f, 0x1011121314151617, \
                                0x08090a0b0c0d0e0f, 0x0001020304050607 ); \
  M0 = _mm256_shuffle_epi8( *(buf+ 0), shuf_bswap64 ); \
@@ -791,7 +1151,7 @@ blake512_8way_close(void *cc, void *dst)
  H5 = mm256_xor3( VD, V5, H5 ); \
  H6 = mm256_xor3( VE, V6, H6 ); \
  H7 = mm256_xor3( VF, V7, H7 ); \
-} while (0)
+}


 void blake512_4way_compress( blake_4way_big_context *sc )
@@ -869,6 +1229,221 @@ void blake512_4way_compress( blake_4way_big_context *sc )
  sc->H[7] = mm256_xor3( VF, V7, sc->H[7] );
 }

+void blake512_4way_prehash_le( blake_4way_big_context *sc, __m256i *midstate,
+                               const void *data )
+{
+   __m256i V0, V1, V2, V3, V4, V5, V6, V7;
+   __m256i V8, V9, VA, VB, VC, VD, VE, VF;
+
+   // initial hash
+   casti_m256i( sc->H, 0 ) = m256_const1_64( 0x6A09E667F3BCC908 );
+   casti_m256i( sc->H, 1 ) = m256_const1_64( 0xBB67AE8584CAA73B );
+   casti_m256i( sc->H, 2 ) = m256_const1_64( 0x3C6EF372FE94F82B );
+   casti_m256i( sc->H, 3 ) = m256_const1_64( 0xA54FF53A5F1D36F1 );
+   casti_m256i( sc->H, 4 ) = m256_const1_64( 0x510E527FADE682D1 );
+   casti_m256i( sc->H, 5 ) = m256_const1_64( 0x9B05688C2B3E6C1F );
+   casti_m256i( sc->H, 6 ) = m256_const1_64( 0x1F83D9ABFB41BD6B );
+   casti_m256i( sc->H, 7 ) = m256_const1_64( 0x5BE0CD19137E2179 );
+   
+   // fill buffer
+   memcpy_256( sc->buf, (__m256i*)data, 80>>3 );
+   sc->buf[10] = m256_const1_64( 0x8000000000000000ULL );
+   sc->buf[11] = m256_zero;
+   sc->buf[12] = m256_zero;
+   sc->buf[13] = m256_one_64;
+   sc->buf[14] = m256_zero;
+   sc->buf[15] = m256_const1_64( 80*8 );
+
+   // build working variables
+   V0 = sc->H[0];
+   V1 = sc->H[1];
+   V2 = sc->H[2];
+   V3 = sc->H[3];
+   V4 = sc->H[4];
+   V5 = sc->H[5];
+   V6 = sc->H[6];
+   V7 = sc->H[7];
+   V8 = m256_const1_64( CB0 );
+   V9 = m256_const1_64( CB1 );
+   VA = m256_const1_64( CB2 );
+   VB = m256_const1_64( CB3 );
+   VC = _mm256_set1_epi64x( CB4 ^ 0x280ULL );
+   VD = _mm256_set1_epi64x( CB5 ^ 0x280ULL );
+   VE = _mm256_set1_epi64x( CB6 );
+   VF = _mm256_set1_epi64x( CB7 );
+
+   // round 0
+   GB_4WAY( sc->buf[ 0], sc->buf[ 1], CB0, CB1, V0, V4, V8, VC );
+   GB_4WAY( sc->buf[ 2], sc->buf[ 3], CB2, CB3, V1, V5, V9, VD );
+   GB_4WAY( sc->buf[ 4], sc->buf[ 5], CB4, CB5, V2, V6, VA, VE );
+   GB_4WAY( sc->buf[ 6], sc->buf[ 7], CB6, CB7, V3, V7, VB, VF );
+
+   // G4 skip nonce
+   V0 = _mm256_add_epi64( _mm256_add_epi64( _mm256_xor_si256(
+                       _mm256_set1_epi64x( CB9 ), sc->buf[ 8] ), V5 ), V0 );
+   VF = mm256_swap64_32( _mm256_xor_si256( VF, V0 ) );
+   VA = _mm256_add_epi64( VA, VF );
+   V5 = mm256_ror_64( _mm256_xor_si256( V5, VA ), 25 );
+   V0 = _mm256_add_epi64( V0, V5 );
+
+   GB_4WAY( sc->buf[10], sc->buf[11], CBA, CBB, V1, V6, VB, VC );
+   GB_4WAY( sc->buf[12], sc->buf[13], CBC, CBD, V2, V7, V8, VD );
+   GB_4WAY( sc->buf[14], sc->buf[15], CBE, CBF, V3, V4, V9, VE );
+
+   // round 1
+   // G1   
+   V1 = _mm256_add_epi64( V1, _mm256_xor_si256( _mm256_set1_epi64x( CB8 ),
+           sc->buf[ 4] ) );
+
+   // G2
+   V2 = _mm256_add_epi64( V2, V6 );
+
+   // G3
+   V3 = _mm256_add_epi64( V3, _mm256_add_epi64( _mm256_xor_si256(
+                 _mm256_set1_epi64x( CB6 ), sc->buf[13] ), V7 ) );
+
+   // save midstate for second part
+   midstate[ 0] = V0;
+   midstate[ 1] = V1;
+   midstate[ 2] = V2;
+   midstate[ 3] = V3;
+   midstate[ 4] = V4;
+   midstate[ 5] = V5;
+   midstate[ 6] = V6;
+   midstate[ 7] = V7;
+   midstate[ 8] = V8;
+   midstate[ 9] = V9;
+   midstate[10] = VA;
+   midstate[11] = VB;
+   midstate[12] = VC;
+   midstate[13] = VD;
+   midstate[14] = VE;
+   midstate[15] = VF;
+}
+
+void blake512_4way_final_le( blake_4way_big_context *sc, void *hash,
+                             const __m256i nonce, const __m256i *midstate )
+{
+   __m256i M0, M1, M2, M3, M4, M5, M6, M7;
+   __m256i M8, M9, MA, MB, MC, MD, ME, MF;
+   __m256i V0, V1, V2, V3, V4, V5, V6, V7;
+   __m256i V8, V9, VA, VB, VC, VD, VE, VF;
+   __m256i h[8] __attribute__ ((aligned (64)));
+
+   // Load data with new nonce
+   M0 = sc->buf[ 0];
+   M1 = sc->buf[ 1];
+   M2 = sc->buf[ 2];
+   M3 = sc->buf[ 3];
+   M4 = sc->buf[ 4];
+   M5 = sc->buf[ 5];
+   M6 = sc->buf[ 6];
+   M7 = sc->buf[ 7];
+   M8 = sc->buf[ 8];
+   M9 = nonce;
+   MA = sc->buf[10];
+   MB = sc->buf[11];
+   MC = sc->buf[12];
+   MD = sc->buf[13];
+   ME = sc->buf[14];
+   MF = sc->buf[15];
+
+   V0 = midstate[ 0];
+   V1 = midstate[ 1];
+   V2 = midstate[ 2];
+   V3 = midstate[ 3];
+   V4 = midstate[ 4];
+   V5 = midstate[ 5];
+   V6 = midstate[ 6];
+   V7 = midstate[ 7];
+   V8 = midstate[ 8];
+   V9 = midstate[ 9];
+   VA = midstate[10];
+   VB = midstate[11];
+   VC = midstate[12];
+   VD = midstate[13];
+   VE = midstate[14];
+   VF = midstate[15];
+
+   // finish round 0, with the nonce now available 
+   V0 = _mm256_add_epi64( V0, _mm256_xor_si256(
+                                       _mm256_set1_epi64x( CB8 ), M9 ) );
+   VF = mm256_shuflr64_16( _mm256_xor_si256( VF, V0 ) );
+   VA = _mm256_add_epi64( VA, VF );
+   V5 = mm256_ror_64( _mm256_xor_si256( V5, VA ), 11 );
+
+   // Round 1
+   // G0
+   GB_4WAY(Mx(1, 0), Mx(1, 1), CBx(1, 0), CBx(1, 1), V0, V4, V8, VC);
+
+   // G1
+   V1 = _mm256_add_epi64( V1, V5 );
+   VD = mm256_swap64_32( _mm256_xor_si256( VD, V1 ) );
+   V9 = _mm256_add_epi64( V9, VD );
+   V5 = mm256_ror_64( _mm256_xor_si256( V5, V9 ), 25 );
+   V1 = _mm256_add_epi64( V1, _mm256_add_epi64( _mm256_xor_si256(
+                 _mm256_set1_epi64x( CBx(1,2) ), Mx(1,3) ), V5 ) );
+   VD = mm256_shuflr64_16( _mm256_xor_si256( VD, V1 ) );
+   V9 = _mm256_add_epi64( V9, VD );
+   V5 = mm256_ror_64( _mm256_xor_si256( V5, V9 ), 11 );
+
+   // G2
+   V2 = _mm256_add_epi64( V2, _mm256_xor_si256(
+                 _mm256_set1_epi64x( CBF ), M9 ) );
+   VE = mm256_swap64_32( _mm256_xor_si256( VE, V2 ) );
+   VA = _mm256_add_epi64( VA, VE );
+   V6 = mm256_ror_64( _mm256_xor_si256( V6, VA ), 25 );
+   V2 = _mm256_add_epi64( V2, _mm256_add_epi64( _mm256_xor_si256(
+                 _mm256_set1_epi64x( CB9 ), MF ), V6 ) );
+   VE = mm256_shuflr64_16( _mm256_xor_si256( VE, V2 ) );
+   VA = _mm256_add_epi64( VA, VE );
+   V6 = mm256_ror_64( _mm256_xor_si256( V6, VA ), 11 );
+
+   // G3
+   VF = mm256_swap64_32( _mm256_xor_si256( VF, V3 ) );
+   VB = _mm256_add_epi64( VB, VF );
+   V7 = mm256_ror_64( _mm256_xor_si256( V7, VB ), 25 );
+   V3 = _mm256_add_epi64( V3, _mm256_add_epi64( _mm256_xor_si256(
+                 _mm256_set1_epi64x( CBx(1, 6) ), Mx(1, 7) ), V7 ) );
+   VF = mm256_shuflr64_16( _mm256_xor_si256( VF, V3 ) );
+   VB = _mm256_add_epi64( VB, VF );
+   V7 = mm256_ror_64( _mm256_xor_si256( V7, VB ), 11 );
+
+   // G4, G5, G6, G7
+   GB_4WAY(Mx(1, 8), Mx(1, 9), CBx(1, 8), CBx(1, 9), V0, V5, VA, VF);
+   GB_4WAY(Mx(1, A), Mx(1, B), CBx(1, A), CBx(1, B), V1, V6, VB, VC);
+   GB_4WAY(Mx(1, C), Mx(1, D), CBx(1, C), CBx(1, D), V2, V7, V8, VD);
+   GB_4WAY(Mx(1, E), Mx(1, F), CBx(1, E), CBx(1, F), V3, V4, V9, VE);
+
+   ROUND_B_4WAY(2);
+   ROUND_B_4WAY(3);
+   ROUND_B_4WAY(4);
+   ROUND_B_4WAY(5);
+   ROUND_B_4WAY(6);
+   ROUND_B_4WAY(7);
+   ROUND_B_4WAY(8);
+   ROUND_B_4WAY(9);
+   ROUND_B_4WAY(0);
+   ROUND_B_4WAY(1);
+   ROUND_B_4WAY(2);
+   ROUND_B_4WAY(3);
+   ROUND_B_4WAY(4);
+   ROUND_B_4WAY(5);
+
+   h[0] = mm256_xor3( V8, V0, sc->H[0] );
+   h[1] = mm256_xor3( V9, V1, sc->H[1] );
+   h[2] = mm256_xor3( VA, V2, sc->H[2] );
+   h[3] = mm256_xor3( VB, V3, sc->H[3] );
+   h[4] = mm256_xor3( VC, V4, sc->H[4] );
+   h[5] = mm256_xor3( VD, V5, sc->H[5] );
+   h[6] = mm256_xor3( VE, V6, sc->H[6] );
+   h[7] = mm256_xor3( VF, V7, sc->H[7] );
+
+   // bswap final hash
+   mm256_block_bswap_64( (__m256i*)hash, h );
+}
+
+
 void blake512_4way_init( blake_4way_big_context *sc )
 {
   casti_m256i( sc->H, 0 ) = m256_const1_64( 0x6A09E667F3BCC908 );
--- a/algo/blake/decred-gate.c
+++ b/algo/blake/decred-gate.c
@@ -8,7 +8,7 @@ uint32_t *decred_get_nonceptr( uint32_t *work_data )
   return &work_data[ DECRED_NONCE_INDEX ];
 }

-double decred_calc_network_diff( struct work* work )
+long double decred_calc_network_diff( struct work* work )
 {
   // sample for diff 43.281 : 1c05ea29
   // todo: endian reversed on longpoll could be zr5 specific...
@@ -16,7 +16,7 @@ double decred_calc_network_diff( struct work* work )
   uint32_t bits = ( nbits & 0xffffff );
   int16_t shift = ( swab32(nbits) & 0xff ); // 0x1c = 28
   int m;
-   double d = (double)0x0000ffff / (double)bits;
+   long double d = (long double)0x0000ffff / (long double)bits;

   for ( m = shift; m < 29; m++ )
       d *= 256.0;
@@ -25,7 +25,7 @@ double decred_calc_network_diff( struct work* work )
   if ( shift == 28 )
       d *= 256.0; // testnet
   if ( opt_debug_diff )
-       applog( LOG_DEBUG, "net diff: %f -> shift %u, bits %08x", d,
+       applog( LOG_DEBUG, "net diff: %f -> shift %u, bits %08x", (double)d,
                           shift, bits );
   return net_diff;
 }
@@ -70,7 +70,10 @@ void decred_be_build_stratum_request( char *req, struct work *work,
         rpc_user, work->job_id, xnonce2str, ntimestr, noncestr );
   free(xnonce2str);
 }
+
+#if !defined(min)
 #define min(a,b) (a>b ? (b) :(a))
+#endif

 void decred_build_extraheader( struct work* g_work, struct stratum_ctx* sctx )
 {
--- a/algo/blake/sph_blake.c
+++ b/algo/blake/sph_blake.c
@@ -630,6 +630,69 @@ static const sph_u64 CB[16] = {
 		H7 ^= S3 ^ V7 ^ VF; \
 	} while (0)

+#define COMPRESS32_LE   do { \
+      sph_u32 M0, M1, M2, M3, M4, M5, M6, M7; \
+      sph_u32 M8, M9, MA, MB, MC, MD, ME, MF; \
+      sph_u32 V0, V1, V2, V3, V4, V5, V6, V7; \
+      sph_u32 V8, V9, VA, VB, VC, VD, VE, VF; \
+      V0 = H0; \
+      V1 = H1; \
+      V2 = H2; \
+      V3 = H3; \
+      V4 = H4; \
+      V5 = H5; \
+      V6 = H6; \
+      V7 = H7; \
+      V8 = S0 ^ CS0; \
+      V9 = S1 ^ CS1; \
+      VA = S2 ^ CS2; \
+      VB = S3 ^ CS3; \
+      VC = T0 ^ CS4; \
+      VD = T0 ^ CS5; \
+      VE = T1 ^ CS6; \
+      VF = T1 ^ CS7; \
+      M0 = *((uint32_t*)(buf +  0)); \
+      M1 = *((uint32_t*)(buf +  4)); \
+      M2 = *((uint32_t*)(buf +  8)); \
+      M3 = *((uint32_t*)(buf + 12)); \
+      M4 = *((uint32_t*)(buf + 16)); \
+      M5 = *((uint32_t*)(buf + 20)); \
+      M6 = *((uint32_t*)(buf + 24)); \
+      M7 = *((uint32_t*)(buf + 28)); \
+      M8 = *((uint32_t*)(buf + 32)); \
+      M9 = *((uint32_t*)(buf + 36)); \
+      MA = *((uint32_t*)(buf + 40)); \
+      MB = *((uint32_t*)(buf + 44)); \
+      MC = *((uint32_t*)(buf + 48)); \
+      MD = *((uint32_t*)(buf + 52)); \
+      ME = *((uint32_t*)(buf + 56)); \
+      MF = *((uint32_t*)(buf + 60)); \
+      ROUND_S(0); \
+      ROUND_S(1); \
+      ROUND_S(2); \
+      ROUND_S(3); \
+      ROUND_S(4); \
+      ROUND_S(5); \
+      ROUND_S(6); \
+      ROUND_S(7); \
+      if (BLAKE32_ROUNDS == 14) { \
+      ROUND_S(8); \
+      ROUND_S(9); \
+      ROUND_S(0); \
+      ROUND_S(1); \
+      ROUND_S(2); \
+      ROUND_S(3); \
+      } \
+      H0 ^= S0 ^ V0 ^ V8; \
+      H1 ^= S1 ^ V1 ^ V9; \
+      H2 ^= S2 ^ V2 ^ VA; \
+      H3 ^= S3 ^ V3 ^ VB; \
+      H4 ^= S0 ^ V4 ^ VC; \
+      H5 ^= S1 ^ V5 ^ VD; \
+      H6 ^= S2 ^ V6 ^ VE; \
+      H7 ^= S3 ^ V7 ^ VF; \
+   } while (0)
+
 #endif

 #if SPH_64
@@ -843,6 +906,45 @@ blake32(sph_blake_small_context *sc, const void *data, size_t len)
 	sc->ptr = ptr;
 }

+static void
+blake32_le(sph_blake_small_context *sc, const void *data, size_t len)
+{
+   unsigned char *buf;
+   size_t ptr;
+   DECL_STATE32
+
+   buf = sc->buf;
+   ptr = sc->ptr;
+
+   if (len < (sizeof sc->buf) - ptr) {
+      memcpy(buf + ptr, data, len);
+      ptr += len;
+      sc->ptr = ptr;
+      return;
+   }
+
+   READ_STATE32(sc);
+   while (len > 0) {
+      size_t clen;
+
+      clen = (sizeof sc->buf) - ptr;
+      if (clen > len)
+         clen = len;
+      memcpy(buf + ptr, data, clen);
+      ptr += clen;
+      data = (const unsigned char *)data + clen;
+      len -= clen;
+      if (ptr == sizeof sc->buf) {
+         if ((T0 = SPH_T32(T0 + 512)) < 512)
+            T1 = SPH_T32(T1 + 1);
+         COMPRESS32_LE;
+         ptr = 0;
+      }
+   }
+   WRITE_STATE32(sc);
+   sc->ptr = ptr;
+}
+
 static void
 blake32_close(sph_blake_small_context *sc,
 	unsigned ub, unsigned n, void *dst, size_t out_size_w32)
@@ -1050,6 +1152,12 @@ sph_blake256(void *cc, const void *data, size_t len)
 	blake32(cc, data, len);
 }

+void
+sph_blake256_update_le(void *cc, const void *data, size_t len)
+{
+   blake32_le(cc, data, len);
+}
+
 /* see sph_blake.h */
 void
 sph_blake256_close(void *cc, void *dst)
--- a/algo/blake/sph_blake.h
+++ b/algo/blake/sph_blake.h
@@ -198,6 +198,7 @@ void sph_blake256_init(void *cc);
 * @param len    the input data length (in bytes)
 */
 void sph_blake256(void *cc, const void *data, size_t len);
+void sph_blake256_update_le(void *cc, const void *data, size_t len);

 /**
 * Terminate the current BLAKE-256 computation and output the result into
--- a/algo/blake/sph_blake2b.c
+++ b/algo/blake/sph_blake2b.c
@@ -30,18 +30,11 @@
 #include <stdlib.h>
 #include <stdint.h>
 #include <string.h>
-
+#include "simd-utils.h"
 #include "algo/sha/sph_types.h"
 #include "sph_blake2b.h"

-// Cyclic right rotation.
-
-#ifndef ROTR64
-#define ROTR64(x, y)  (((x) >> (y)) ^ ((x) << (64 - (y))))
-#endif
-
 // Little-endian byte access.
-
 #define B2B_GET64(p)                            \
 	(((uint64_t) ((uint8_t *) (p))[0]) ^        \
 	(((uint64_t) ((uint8_t *) (p))[1]) << 8) ^  \
@@ -52,47 +45,143 @@
 	(((uint64_t) ((uint8_t *) (p))[6]) << 48) ^ \
 	(((uint64_t) ((uint8_t *) (p))[7]) << 56))

-// G Mixing function.
+#if defined(__AVX2__)

-#define B2B_G(a, b, c, d, x, y) {   \
-	v[a] = v[a] + v[b] + x;         \
-	v[d] = ROTR64(v[d] ^ v[a], 32); \
-	v[c] = v[c] + v[d];             \
-	v[b] = ROTR64(v[b] ^ v[c], 24); \
-	v[a] = v[a] + v[b] + y;         \
-	v[d] = ROTR64(v[d] ^ v[a], 16); \
-	v[c] = v[c] + v[d];             \
-	v[b] = ROTR64(v[b] ^ v[c], 63); }
+#define BLAKE2B_G( Sa, Sb, Sc, Sd, Se, Sf, Sg, Sh ) \
+{ \
+  V[0] = _mm256_add_epi64( V[0], _mm256_add_epi64( V[1], \
+              _mm256_set_epi64x( m[ sigmaR[ Sg ] ], m[ sigmaR[ Se ] ], \
+                                 m[ sigmaR[ Sc ] ], m[ sigmaR[ Sa ] ] ) ) ); \
+  V[3] = mm256_swap64_32( _mm256_xor_si256( V[3], V[0] ) ); \
+  V[2] = _mm256_add_epi64( V[2], V[3] ); \
+  V[1] = mm256_shuflr64_24( _mm256_xor_si256( V[1], V[2] ) ); \
+\
+  V[0] = _mm256_add_epi64( V[0], _mm256_add_epi64( V[1], \
+              _mm256_set_epi64x( m[ sigmaR[ Sh ] ], m[ sigmaR[ Sf ] ], \
+                                 m[ sigmaR[ Sd ] ], m[ sigmaR[ Sb ] ] ) ) ); \
+  V[3] = mm256_shuflr64_16( _mm256_xor_si256( V[3], V[0] ) ); \
+  V[2] = _mm256_add_epi64( V[2], V[3] ); \
+  V[1] = mm256_ror_64( _mm256_xor_si256( V[1], V[2] ), 63 ); \
+}
+
+#define BLAKE2B_ROUND( R ) \
+{ \
+  __m256i *V = (__m256i*)v; \
+  const uint8_t *sigmaR = sigma[R]; \
+  BLAKE2B_G(  0,  1,  2,  3,  4,  5,  6,  7 ); \
+  V[3] = mm256_shufll_64( V[3] ); \
+  V[2] = mm256_swap_128( V[2] ); \
+  V[1] = mm256_shuflr_64( V[1] ); \
+  BLAKE2B_G(  8,  9, 10, 11, 12, 13, 14, 15 ); \
+  V[3] = mm256_shuflr_64( V[3] ); \
+  V[2] = mm256_swap_128( V[2] ); \
+  V[1] = mm256_shufll_64( V[1] ); \
+}
+
+#elif defined(__SSE2__)
+// always true
+
+#define BLAKE2B_G( Va, Vb, Vc, Vd, Sa, Sb, Sc, Sd ) \
+{ \
+   Va = _mm_add_epi64( Va, _mm_add_epi64( Vb, \
+                 _mm_set_epi64x( m[ sigmaR[ Sc ] ], m[ sigmaR[ Sa ] ] ) ) ); \
+   Vd = mm128_swap64_32( _mm_xor_si128( Vd, Va ) ); \
+   Vc = _mm_add_epi64( Vc, Vd ); \
+   Vb = mm128_shuflr64_24( _mm_xor_si128( Vb, Vc ) ); \
+\
+   Va = _mm_add_epi64( Va, _mm_add_epi64( Vb, \
+                 _mm_set_epi64x( m[ sigmaR[ Sd ] ], m[ sigmaR[ Sb ] ] ) ) ); \
+   Vd = mm128_shuflr64_16( _mm_xor_si128( Vd, Va ) ); \
+   Vc = _mm_add_epi64( Vc, Vd ); \
+   Vb = mm128_ror_64( _mm_xor_si128( Vb, Vc ), 63 ); \
+}
+
+#define BLAKE2B_ROUND( R ) \
+{ \
+   __m128i *V = (__m128i*)v; \
+   __m128i V2, V3, V6, V7; \
+   const uint8_t *sigmaR = sigma[R]; \
+   BLAKE2B_G( V[0], V[2], V[4], V[6], 0, 1, 2, 3 ); \
+   BLAKE2B_G( V[1], V[3], V[5], V[7], 4, 5, 6, 7 ); \
+   V2 = mm128_shufl2r_64( V[2], V[3] ); \
+   V3 = mm128_shufl2r_64( V[3], V[2] ); \
+   V6 = mm128_shufl2l_64( V[6], V[7] ); \
+   V7 = mm128_shufl2l_64( V[7], V[6] ); \
+   BLAKE2B_G( V[0], V2, V[5], V6,  8,  9, 10, 11 ); \
+   BLAKE2B_G( V[1], V3, V[4], V7, 12, 13, 14, 15 ); \
+   V[2] = mm128_shufl2l_64( V2, V3 ); \
+   V[3] = mm128_shufl2l_64( V3, V2 ); \
+   V[6] = mm128_shufl2r_64( V6, V7 ); \
+   V[7] = mm128_shufl2r_64( V7, V6 ); \
+}
+
+#else
+// never used, SSE2 is always available
+
+#ifndef ROTR64
+#define ROTR64(x, y)  (((x) >> (y)) ^ ((x) << (64 - (y))))
+#endif
+
+#define BLAKE2B_G( R, Va, Vb, Vc, Vd, Sa, Sb ) \
+{ \
+   Va = Va + Vb + m[ sigma[R][Sa] ]; \
+   Vd = ROTR64( Vd ^ Va, 32 ); \
+   Vc = Vc + Vd; \
+   Vb = ROTR64( Vb ^ Vc, 24 ); \
+\
+   Va = Va + Vb + m[ sigma[R][Sb] ]; \
+   Vd = ROTR64( Vd ^ Va, 16 ); \
+   Vc = Vc + Vd; \
+   Vb = ROTR64( Vb ^ Vc, 63 ); \
+}
+
+#define BLAKE2B_ROUND( R ) \
+{ \
+   BLAKE2B_G( R, v[ 0], v[ 4], v[ 8], v[12],  0,  1 ); \
+   BLAKE2B_G( R, v[ 1], v[ 5], v[ 9], v[13],  2,  3 ); \
+   BLAKE2B_G( R, v[ 2], v[ 6], v[10], v[14],  4,  5 ); \
+   BLAKE2B_G( R, v[ 3], v[ 7], v[11], v[15],  6,  7 ); \
+   BLAKE2B_G( R, v[ 0], v[ 5], v[10], v[15],  8,  9 ); \
+   BLAKE2B_G( R, v[ 1], v[ 6], v[11], v[12], 10, 11 ); \
+   BLAKE2B_G( R, v[ 2], v[ 7], v[ 8], v[13], 12, 13 ); \
+   BLAKE2B_G( R, v[ 3], v[ 4], v[ 9], v[14], 14, 15 ); \
+}
+
+#endif

 // Initialization Vector.

-static const uint64_t blake2b_iv[8] = {
+static const uint64_t blake2b_iv[8] __attribute__ ((aligned (32))) =
+{
 	0x6A09E667F3BCC908, 0xBB67AE8584CAA73B,
 	0x3C6EF372FE94F82B, 0xA54FF53A5F1D36F1,
 	0x510E527FADE682D1, 0x9B05688C2B3E6C1F,
 	0x1F83D9ABFB41BD6B, 0x5BE0CD19137E2179
 };

+static const uint8_t sigma[12][16] __attribute__ ((aligned (32))) =
+{
+      { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 },
+      { 14, 10, 4, 8, 9, 15, 13, 6, 1, 12, 0, 2, 11, 7, 5, 3 },
+      { 11, 8, 12, 0, 5, 2, 15, 13, 10, 14, 3, 6, 7, 1, 9, 4 },
+      { 7, 9, 3, 1, 13, 12, 11, 14, 2, 6, 5, 10, 4, 0, 15, 8 },
+      { 9, 0, 5, 7, 2, 4, 10, 15, 14, 1, 11, 12, 6, 8, 3, 13 },
+      { 2, 12, 6, 10, 0, 11, 8, 3, 4, 13, 7, 5, 15, 14, 1, 9 },
+      { 12, 5, 1, 15, 14, 13, 4, 10, 0, 7, 6, 3, 9, 2, 8, 11 },
+      { 13, 11, 7, 14, 12, 1, 3, 9, 5, 0, 15, 4, 8, 6, 2, 10 },
+      { 6, 15, 14, 9, 11, 3, 0, 8, 12, 2, 13, 7, 1, 4, 10, 5 },
+      { 10, 2, 8, 4, 7, 6, 1, 5, 15, 11, 9, 14, 3, 12, 13, 0 },
+      { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 },
+      { 14, 10, 4, 8, 9, 15, 13, 6, 1, 12, 0, 2, 11, 7, 5, 3 }
+};
+
 // Compression function. "last" flag indicates last block.

 static void blake2b_compress( sph_blake2b_ctx *ctx, int last )
 {
-	const uint8_t sigma[12][16] = {
-		{ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 },
-		{ 14, 10, 4, 8, 9, 15, 13, 6, 1, 12, 0, 2, 11, 7, 5, 3 },
-		{ 11, 8, 12, 0, 5, 2, 15, 13, 10, 14, 3, 6, 7, 1, 9, 4 },
-		{ 7, 9, 3, 1, 13, 12, 11, 14, 2, 6, 5, 10, 4, 0, 15, 8 },
-		{ 9, 0, 5, 7, 2, 4, 10, 15, 14, 1, 11, 12, 6, 8, 3, 13 },
-		{ 2, 12, 6, 10, 0, 11, 8, 3, 4, 13, 7, 5, 15, 14, 1, 9 },
-		{ 12, 5, 1, 15, 14, 13, 4, 10, 0, 7, 6, 3, 9, 2, 8, 11 },
-		{ 13, 11, 7, 14, 12, 1, 3, 9, 5, 0, 15, 4, 8, 6, 2, 10 },
-		{ 6, 15, 14, 9, 11, 3, 0, 8, 12, 2, 13, 7, 1, 4, 10, 5 },
-		{ 10, 2, 8, 4, 7, 6, 1, 5, 15, 11, 9, 14, 3, 12, 13, 0 },
-		{ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 },
-		{ 14, 10, 4, 8, 9, 15, 13, 6, 1, 12, 0, 2, 11, 7, 5, 3 }
-	};
-	int i;
-	uint64_t v[16], m[16];
+	uint64_t v[16] __attribute__ ((aligned (32)));
+   uint64_t m[16] __attribute__ ((aligned (32)));
+   int i;

 	for (i = 0; i < 8; i++) {           // init work variables
 		v[i] = ctx->h[i];
@@ -106,16 +195,8 @@ static void blake2b_compress( sph_blake2b_ctx *ctx, int last )
 	for (i = 0; i < 16; i++)            // get little-endian words
 		m[i] = B2B_GET64(&ctx->b[8 * i]);

-	for (i = 0; i < 12; i++) {          // twelve rounds
-		B2B_G( 0, 4,  8, 12, m[sigma[i][ 0]], m[sigma[i][ 1]]);
-		B2B_G( 1, 5,  9, 13, m[sigma[i][ 2]], m[sigma[i][ 3]]);
-		B2B_G( 2, 6, 10, 14, m[sigma[i][ 4]], m[sigma[i][ 5]]);
-		B2B_G( 3, 7, 11, 15, m[sigma[i][ 6]], m[sigma[i][ 7]]);
-		B2B_G( 0, 5, 10, 15, m[sigma[i][ 8]], m[sigma[i][ 9]]);
-		B2B_G( 1, 6, 11, 12, m[sigma[i][10]], m[sigma[i][11]]);
-		B2B_G( 2, 7,  8, 13, m[sigma[i][12]], m[sigma[i][13]]);
-		B2B_G( 3, 4,  9, 14, m[sigma[i][14]], m[sigma[i][15]]);
-	}
+	for (i = 0; i < 12; i++)
+      BLAKE2B_ROUND( i );   

 	for( i = 0; i < 8; ++i )
 		ctx->h[i] ^= v[i] ^ v[i + 8];
--- a/algo/bmw/bmw512-hash-4way.c
+++ b/algo/bmw/bmw512-hash-4way.c
@@ -594,9 +594,6 @@ void bmw512_2way_close( bmw_2way_big_context *ctx, void *dst )
 #define rb6(x)    mm256_rol_64( x, 43 ) 
 #define rb7(x)    mm256_rol_64( x, 53 ) 

-#define rol_off_64( M, j ) \
-   mm256_rol_64( M[ (j) & 0xF ], ( (j) & 0xF ) + 1 )
-
 #define add_elt_b( mj0, mj3, mj10, h, K ) \
  _mm256_xor_si256( h, _mm256_add_epi64( K, \
              _mm256_sub_epi64( _mm256_add_epi64( mj0, mj3 ), mj10 ) ) )
@@ -732,41 +729,58 @@ void compress_big( const __m256i *M, const __m256i H[16], __m256i dH[16] )
   qt[15] = _mm256_add_epi64( sb0( Wb15), H[ 0] ); 

   __m256i mj[16];
-   for ( i = 0; i < 16; i++ )
-      mj[i] = rol_off_64( M, i );

-   qt[16] = add_elt_b( mj[ 0], mj[ 3], mj[10], H[ 7],
-              (const __m256i)_mm256_set1_epi64x( 16 * 0x0555555555555555ULL ) );
-   qt[17] = add_elt_b( mj[ 1], mj[ 4], mj[11], H[ 8],
-              (const __m256i)_mm256_set1_epi64x( 17 * 0x0555555555555555ULL ) );
-   qt[18] = add_elt_b( mj[ 2], mj[ 5], mj[12], H[ 9],
-              (const __m256i)_mm256_set1_epi64x( 18 * 0x0555555555555555ULL ) );
-   qt[19] = add_elt_b( mj[ 3], mj[ 6], mj[13], H[10],
-              (const __m256i)_mm256_set1_epi64x( 19 * 0x0555555555555555ULL ) );
-   qt[20] = add_elt_b( mj[ 4], mj[ 7], mj[14], H[11],
-              (const __m256i)_mm256_set1_epi64x( 20 * 0x0555555555555555ULL ) );
-   qt[21] = add_elt_b( mj[ 5], mj[ 8], mj[15], H[12],
-              (const __m256i)_mm256_set1_epi64x( 21 * 0x0555555555555555ULL ) );
-   qt[22] = add_elt_b( mj[ 6], mj[ 9], mj[ 0], H[13],
-              (const __m256i)_mm256_set1_epi64x( 22 * 0x0555555555555555ULL ) );
-   qt[23] = add_elt_b( mj[ 7], mj[10], mj[ 1], H[14],
-              (const __m256i)_mm256_set1_epi64x( 23 * 0x0555555555555555ULL ) );
-   qt[24] = add_elt_b( mj[ 8], mj[11], mj[ 2], H[15],
-              (const __m256i)_mm256_set1_epi64x( 24 * 0x0555555555555555ULL ) );
-   qt[25] = add_elt_b( mj[ 9], mj[12], mj[ 3], H[ 0],
-              (const __m256i)_mm256_set1_epi64x( 25 * 0x0555555555555555ULL ) );
-   qt[26] = add_elt_b( mj[10], mj[13], mj[ 4], H[ 1],
-              (const __m256i)_mm256_set1_epi64x( 26 * 0x0555555555555555ULL ) );
-   qt[27] = add_elt_b( mj[11], mj[14], mj[ 5], H[ 2],
-              (const __m256i)_mm256_set1_epi64x( 27 * 0x0555555555555555ULL ) );
-   qt[28] = add_elt_b( mj[12], mj[15], mj[ 6], H[ 3],
-              (const __m256i)_mm256_set1_epi64x( 28 * 0x0555555555555555ULL ) );
-   qt[29] = add_elt_b( mj[13], mj[ 0], mj[ 7], H[ 4],
-              (const __m256i)_mm256_set1_epi64x( 29 * 0x0555555555555555ULL ) );
-   qt[30] = add_elt_b( mj[14], mj[ 1], mj[ 8], H[ 5],
-              (const __m256i)_mm256_set1_epi64x( 30 * 0x0555555555555555ULL ) );
-   qt[31] = add_elt_b( mj[15], mj[ 2], mj[ 9], H[ 6],
-              (const __m256i)_mm256_set1_epi64x( 31 * 0x0555555555555555ULL ) );
+   mj[ 0] = mm256_rol_64( M[ 0],  1 );
+   mj[ 1] = mm256_rol_64( M[ 1],  2 );
+   mj[ 2] = mm256_rol_64( M[ 2],  3 );
+   mj[ 3] = mm256_rol_64( M[ 3],  4 );
+   mj[ 4] = mm256_rol_64( M[ 4],  5 );
+   mj[ 5] = mm256_rol_64( M[ 5],  6 );
+   mj[ 6] = mm256_rol_64( M[ 6],  7 );
+   mj[ 7] = mm256_rol_64( M[ 7],  8 );
+   mj[ 8] = mm256_rol_64( M[ 8],  9 );
+   mj[ 9] = mm256_rol_64( M[ 9], 10 );
+   mj[10] = mm256_rol_64( M[10], 11 );
+   mj[11] = mm256_rol_64( M[11], 12 );
+   mj[12] = mm256_rol_64( M[12], 13 );
+   mj[13] = mm256_rol_64( M[13], 14 );
+   mj[14] = mm256_rol_64( M[14], 15 );
+   mj[15] = mm256_rol_64( M[15], 16 );
+
+   __m256i K = _mm256_set1_epi64x( 16 * 0x0555555555555555ULL );
+   const __m256i Kincr = _mm256_set1_epi64x( 0x0555555555555555ULL );
+
+   qt[16] = add_elt_b( mj[ 0], mj[ 3], mj[10], H[ 7], K );
+   K = _mm256_add_epi64( K, Kincr );
+   qt[17] = add_elt_b( mj[ 1], mj[ 4], mj[11], H[ 8], K );
+   K = _mm256_add_epi64( K, Kincr );
+   qt[18] = add_elt_b( mj[ 2], mj[ 5], mj[12], H[ 9], K );
+   K = _mm256_add_epi64( K, Kincr );
+   qt[19] = add_elt_b( mj[ 3], mj[ 6], mj[13], H[10], K );
+   K = _mm256_add_epi64( K, Kincr );
+   qt[20] = add_elt_b( mj[ 4], mj[ 7], mj[14], H[11], K );
+   K = _mm256_add_epi64( K, Kincr );
+   qt[21] = add_elt_b( mj[ 5], mj[ 8], mj[15], H[12], K );
+   K = _mm256_add_epi64( K, Kincr );
+   qt[22] = add_elt_b( mj[ 6], mj[ 9], mj[ 0], H[13], K );
+   K = _mm256_add_epi64( K, Kincr );
+   qt[23] = add_elt_b( mj[ 7], mj[10], mj[ 1], H[14], K );
+   K = _mm256_add_epi64( K, Kincr );
+   qt[24] = add_elt_b( mj[ 8], mj[11], mj[ 2], H[15], K );
+   K = _mm256_add_epi64( K, Kincr );
+   qt[25] = add_elt_b( mj[ 9], mj[12], mj[ 3], H[ 0], K );
+   K = _mm256_add_epi64( K, Kincr );
+   qt[26] = add_elt_b( mj[10], mj[13], mj[ 4], H[ 1], K );
+   K = _mm256_add_epi64( K, Kincr );
+   qt[27] = add_elt_b( mj[11], mj[14], mj[ 5], H[ 2], K );
+   K = _mm256_add_epi64( K, Kincr );
+   qt[28] = add_elt_b( mj[12], mj[15], mj[ 6], H[ 3], K );
+   K = _mm256_add_epi64( K, Kincr );
+   qt[29] = add_elt_b( mj[13], mj[ 0], mj[ 7], H[ 4], K );
+   K = _mm256_add_epi64( K, Kincr );
+   qt[30] = add_elt_b( mj[14], mj[ 1], mj[ 8], H[ 5], K );
+   K = _mm256_add_epi64( K, Kincr );
+   qt[31] = add_elt_b( mj[15], mj[ 2], mj[ 9], H[ 6], K );

   qt[16] = _mm256_add_epi64( qt[16], expand1_b( qt, 16 ) );
   qt[17] = _mm256_add_epi64( qt[17], expand1_b( qt, 17 ) );
@@ -1034,9 +1048,6 @@ bmw512_4way_addbits_and_close(void *cc, unsigned ub, unsigned n, void *dst)
 #define r8b6(x)    mm512_rol_64( x, 43 )
 #define r8b7(x)    mm512_rol_64( x, 53 )

-#define rol8w_off_64( M, j ) \
-   mm512_rol_64( M[ (j) & 0xF ], ( (j) & 0xF ) + 1 )
-
 #define add_elt_b8( mj0, mj3, mj10, h, K ) \
  _mm512_xor_si512( h, _mm512_add_epi64( K, \
              _mm512_sub_epi64( _mm512_add_epi64( mj0, mj3 ), mj10 ) ) )
@@ -1171,41 +1182,58 @@ void compress_big_8way( const __m512i *M, const __m512i H[16],
   qt[15] = _mm512_add_epi64( s8b0( W8b15), H[ 0] );

   __m512i mj[16];
-   for ( i = 0; i < 16; i++ )
-      mj[i] = rol8w_off_64( M, i );
+ 
+   mj[ 0] = mm512_rol_64( M[ 0],  1 );
+   mj[ 1] = mm512_rol_64( M[ 1],  2 );
+   mj[ 2] = mm512_rol_64( M[ 2],  3 );
+   mj[ 3] = mm512_rol_64( M[ 3],  4 );
+   mj[ 4] = mm512_rol_64( M[ 4],  5 );
+   mj[ 5] = mm512_rol_64( M[ 5],  6 );
+   mj[ 6] = mm512_rol_64( M[ 6],  7 );
+   mj[ 7] = mm512_rol_64( M[ 7],  8 );
+   mj[ 8] = mm512_rol_64( M[ 8],  9 );
+   mj[ 9] = mm512_rol_64( M[ 9], 10 );
+   mj[10] = mm512_rol_64( M[10], 11 );
+   mj[11] = mm512_rol_64( M[11], 12 );
+   mj[12] = mm512_rol_64( M[12], 13 );
+   mj[13] = mm512_rol_64( M[13], 14 );
+   mj[14] = mm512_rol_64( M[14], 15 );
+   mj[15] = mm512_rol_64( M[15], 16 );

-   qt[16] = add_elt_b8( mj[ 0], mj[ 3], mj[10], H[ 7],
-              (const __m512i)_mm512_set1_epi64( 16 * 0x0555555555555555ULL ) );
-   qt[17] = add_elt_b8( mj[ 1], mj[ 4], mj[11], H[ 8],
-              (const __m512i)_mm512_set1_epi64( 17 * 0x0555555555555555ULL ) );
-   qt[18] = add_elt_b8( mj[ 2], mj[ 5], mj[12], H[ 9],
-              (const __m512i)_mm512_set1_epi64( 18 * 0x0555555555555555ULL ) );
-   qt[19] = add_elt_b8( mj[ 3], mj[ 6], mj[13], H[10],
-              (const __m512i)_mm512_set1_epi64( 19 * 0x0555555555555555ULL ) );
-   qt[20] = add_elt_b8( mj[ 4], mj[ 7], mj[14], H[11],
-              (const __m512i)_mm512_set1_epi64( 20 * 0x0555555555555555ULL ) );
-   qt[21] = add_elt_b8( mj[ 5], mj[ 8], mj[15], H[12],
-              (const __m512i)_mm512_set1_epi64( 21 * 0x0555555555555555ULL ) );
-   qt[22] = add_elt_b8( mj[ 6], mj[ 9], mj[ 0], H[13],
-              (const __m512i)_mm512_set1_epi64( 22 * 0x0555555555555555ULL ) );
-   qt[23] = add_elt_b8( mj[ 7], mj[10], mj[ 1], H[14],
-              (const __m512i)_mm512_set1_epi64( 23 * 0x0555555555555555ULL ) );
-   qt[24] = add_elt_b8( mj[ 8], mj[11], mj[ 2], H[15],
-              (const __m512i)_mm512_set1_epi64( 24 * 0x0555555555555555ULL ) );
-   qt[25] = add_elt_b8( mj[ 9], mj[12], mj[ 3], H[ 0],
-              (const __m512i)_mm512_set1_epi64( 25 * 0x0555555555555555ULL ) );
-   qt[26] = add_elt_b8( mj[10], mj[13], mj[ 4], H[ 1],
-              (const __m512i)_mm512_set1_epi64( 26 * 0x0555555555555555ULL ) );
-   qt[27] = add_elt_b8( mj[11], mj[14], mj[ 5], H[ 2],
-              (const __m512i)_mm512_set1_epi64( 27 * 0x0555555555555555ULL ) );
-   qt[28] = add_elt_b8( mj[12], mj[15], mj[ 6], H[ 3],
-              (const __m512i)_mm512_set1_epi64( 28 * 0x0555555555555555ULL ) );
-   qt[29] = add_elt_b8( mj[13], mj[ 0], mj[ 7], H[ 4],
-              (const __m512i)_mm512_set1_epi64( 29 * 0x0555555555555555ULL ) );
-   qt[30] = add_elt_b8( mj[14], mj[ 1], mj[ 8], H[ 5],
-              (const __m512i)_mm512_set1_epi64( 30 * 0x0555555555555555ULL ) );
-   qt[31] = add_elt_b8( mj[15], mj[ 2], mj[ 9], H[ 6],
-              (const __m512i)_mm512_set1_epi64( 31 * 0x0555555555555555ULL ) );
+   __m512i K = _mm512_set1_epi64( 16 * 0x0555555555555555ULL );
+   const __m512i Kincr = _mm512_set1_epi64( 0x0555555555555555ULL );
+
+   qt[16] = add_elt_b8( mj[ 0], mj[ 3], mj[10], H[ 7], K );
+   K = _mm512_add_epi64( K, Kincr );
+   qt[17] = add_elt_b8( mj[ 1], mj[ 4], mj[11], H[ 8], K );
+   K = _mm512_add_epi64( K, Kincr );
+   qt[18] = add_elt_b8( mj[ 2], mj[ 5], mj[12], H[ 9], K );
+   K = _mm512_add_epi64( K, Kincr );
+   qt[19] = add_elt_b8( mj[ 3], mj[ 6], mj[13], H[10], K );
+   K = _mm512_add_epi64( K, Kincr );
+   qt[20] = add_elt_b8( mj[ 4], mj[ 7], mj[14], H[11], K );
+   K = _mm512_add_epi64( K, Kincr );
+   qt[21] = add_elt_b8( mj[ 5], mj[ 8], mj[15], H[12], K );
+   K = _mm512_add_epi64( K, Kincr );
+   qt[22] = add_elt_b8( mj[ 6], mj[ 9], mj[ 0], H[13], K );
+   K = _mm512_add_epi64( K, Kincr );
+   qt[23] = add_elt_b8( mj[ 7], mj[10], mj[ 1], H[14], K );
+   K = _mm512_add_epi64( K, Kincr );
+   qt[24] = add_elt_b8( mj[ 8], mj[11], mj[ 2], H[15], K );
+   K = _mm512_add_epi64( K, Kincr );
+   qt[25] = add_elt_b8( mj[ 9], mj[12], mj[ 3], H[ 0], K );
+   K = _mm512_add_epi64( K, Kincr );
+   qt[26] = add_elt_b8( mj[10], mj[13], mj[ 4], H[ 1], K );
+   K = _mm512_add_epi64( K, Kincr );
+   qt[27] = add_elt_b8( mj[11], mj[14], mj[ 5], H[ 2], K );
+   K = _mm512_add_epi64( K, Kincr );
+   qt[28] = add_elt_b8( mj[12], mj[15], mj[ 6], H[ 3], K );
+   K = _mm512_add_epi64( K, Kincr );
+   qt[29] = add_elt_b8( mj[13], mj[ 0], mj[ 7], H[ 4], K );
+   K = _mm512_add_epi64( K, Kincr );
+   qt[30] = add_elt_b8( mj[14], mj[ 1], mj[ 8], H[ 5], K );
+   K = _mm512_add_epi64( K, Kincr );
+   qt[31] = add_elt_b8( mj[15], mj[ 2], mj[ 9], H[ 6], K );

   qt[16] = _mm512_add_epi64( qt[16], expand1_b8( qt, 16 ) );
   qt[17] = _mm512_add_epi64( qt[17], expand1_b8( qt, 17 ) );
--- a/algo/cubehash/cube-hash-2way.c
+++ b/algo/cubehash/cube-hash-2way.c
@@ -54,14 +54,12 @@ static void transform_4way( cube_4way_context *sp )
        x5 = _mm512_add_epi32( x1, x5 );
        x6 = _mm512_add_epi32( x2, x6 );
        x7 = _mm512_add_epi32( x3, x7 );
-        y0 = x0;
-        y1 = x1;
-        x0 = mm512_rol_32( x2, 7 );
-        x1 = mm512_rol_32( x3, 7 );
-        x2 = mm512_rol_32( y0, 7 );
-        x3 = mm512_rol_32( y1, 7 );
-        x0 = _mm512_xor_si512( x0, x4 );
-        x1 = _mm512_xor_si512( x1, x5 );
+        y0 = mm512_rol_32( x2, 7 );
+        y1 = mm512_rol_32( x3, 7 );
+        x2 = mm512_rol_32( x0, 7 );
+        x3 = mm512_rol_32( x1, 7 );
+        x0 = _mm512_xor_si512( y0, x4 );
+        x1 = _mm512_xor_si512( y1, x5 );
        x2 = _mm512_xor_si512( x2, x6 );
        x3 = _mm512_xor_si512( x3, x7 );
        x4 = mm512_swap128_64( x4 );
@@ -72,15 +70,13 @@ static void transform_4way( cube_4way_context *sp )
        x5 = _mm512_add_epi32( x1, x5 );
        x6 = _mm512_add_epi32( x2, x6 );
        x7 = _mm512_add_epi32( x3, x7 );
-        y0 = x0;
-        y1 = x2;
-        x0 = mm512_rol_32( x1, 11 );
-        x1 = mm512_rol_32( y0, 11 );
-        x2 = mm512_rol_32( x3, 11 );
-        x3 = mm512_rol_32( y1, 11 );
-        x0 = _mm512_xor_si512( x0, x4 );
+        y0 = mm512_rol_32( x1, 11 );
+        x1 = mm512_rol_32( x0, 11 );
+        y1 = mm512_rol_32( x3, 11 );
+        x3 = mm512_rol_32( x2, 11 );
+        x0 = _mm512_xor_si512( y0, x4 );
        x1 = _mm512_xor_si512( x1, x5 );
-        x2 = _mm512_xor_si512( x2, x6 );
+        x2 = _mm512_xor_si512( y1, x6 );
        x3 = _mm512_xor_si512( x3, x7 );
        x4 = mm512_swap64_32( x4 );
        x5 = mm512_swap64_32( x5 );
@@ -131,83 +127,67 @@ static void transform_4way_2buf( cube_4way_2buf_context *sp )
    {
        x4 = _mm512_add_epi32( x0, x4 );
        y4 = _mm512_add_epi32( y0, y4 );
-        tx0 = x0;
-        ty0 = y0;
        x5 = _mm512_add_epi32( x1, x5 );
        y5 = _mm512_add_epi32( y1, y5 );
-        tx1 = x1;
-        ty1 = y1;
-        x0 = mm512_rol_32( x2, 7 );
-        y0 = mm512_rol_32( y2, 7 );
+        tx0 = mm512_rol_32( x2, 7 );
+        ty0 = mm512_rol_32( y2, 7 );
+        tx1 = mm512_rol_32( x3, 7 );
+        ty1 = mm512_rol_32( y3, 7 );
        x6 = _mm512_add_epi32( x2, x6 );
-        y6 = _mm512_add_epi32( y2, y6 );
-        x1 = mm512_rol_32( x3, 7 );
-        y1 = mm512_rol_32( y3, 7 );
+        y6 = _mm512_add_epi32( y2, y6 ); 
        x7 = _mm512_add_epi32( x3, x7 );
        y7 = _mm512_add_epi32( y3, y7 );
-
-
-        x2 = mm512_rol_32( tx0, 7 );
-        y2 = mm512_rol_32( ty0, 7 );
-        x0 = _mm512_xor_si512( x0, x4 );
-        y0 = _mm512_xor_si512( y0, y4 );
+        x2 = mm512_rol_32( x0, 7 );
+        y2 = mm512_rol_32( y0, 7 );
+        x3 = mm512_rol_32( x1, 7 );
+        y3 = mm512_rol_32( y1, 7 );
+        x0 = _mm512_xor_si512( tx0, x4 );
+        y0 = _mm512_xor_si512( ty0, y4 );
+        x1 = _mm512_xor_si512( tx1, x5 );
+        y1 = _mm512_xor_si512( ty1, y5 );
        x4 = mm512_swap128_64( x4 );
-        x3 = mm512_rol_32( tx1, 7 );
-        y3 = mm512_rol_32( ty1, 7 );
        y4 = mm512_swap128_64( y4 );
-
-        x1 = _mm512_xor_si512( x1, x5 );
-        y1 = _mm512_xor_si512( y1, y5 );
        x5 = mm512_swap128_64( x5 );
+        y5 = mm512_swap128_64( y5 );
        x2 = _mm512_xor_si512( x2, x6 );
        y2 = _mm512_xor_si512( y2, y6 );
-        y5 = mm512_swap128_64( y5 );
        x3 = _mm512_xor_si512( x3, x7 );
        y3 = _mm512_xor_si512( y3, y7 );
-
        x6 = mm512_swap128_64( x6 );
+        y6 = mm512_swap128_64( y6 );
+        x7 = mm512_swap128_64( x7 );
+        y7 = mm512_swap128_64( y7 );
        x4 = _mm512_add_epi32( x0, x4 );
        y4 = _mm512_add_epi32( y0, y4 );
-        y6 = mm512_swap128_64( y6 );
        x5 = _mm512_add_epi32( x1, x5 );
        y5 = _mm512_add_epi32( y1, y5 );
-        x7 = mm512_swap128_64( x7 );
+        tx0 = mm512_rol_32( x1, 11 );
+        ty0 = mm512_rol_32( y1, 11 );
+        tx1 = mm512_rol_32( x3, 11 );
+        ty1 = mm512_rol_32( y3, 11 );
        x6 = _mm512_add_epi32( x2, x6 );
        y6 = _mm512_add_epi32( y2, y6 );
-        tx0 = x0;
-        ty0 = y0;
-        y7 = mm512_swap128_64( y7 );
-        tx1 = x2;
-        ty1 = y2;
-        x0 = mm512_rol_32( x1, 11 );
-        y0 = mm512_rol_32( y1, 11 );
-
        x7 = _mm512_add_epi32( x3, x7 );
        y7 = _mm512_add_epi32( y3, y7 );
-
-        x1 = mm512_rol_32( tx0, 11 );
-        y1 = mm512_rol_32( ty0, 11 );
-        x0 = _mm512_xor_si512( x0, x4 );
-        x4 = mm512_swap64_32( x4 );
-        y0 = _mm512_xor_si512( y0, y4 );
-        x2 = mm512_rol_32( x3, 11 );
-        y4 = mm512_swap64_32( y4 );
-        y2 = mm512_rol_32( y3, 11 );
+        x1 = mm512_rol_32( x0, 11 );
+        y1 = mm512_rol_32( y0, 11 );
+        x3 = mm512_rol_32( x2, 11 );
+        y3 = mm512_rol_32( y2, 11 );
+        x0 = _mm512_xor_si512( tx0, x4 );
+        y0 = _mm512_xor_si512( ty0, y4 );
        x1 = _mm512_xor_si512( x1, x5 );
-        x5 = mm512_swap64_32( x5 );
        y1 = _mm512_xor_si512( y1, y5 );
-        x3 = mm512_rol_32( tx1, 11 );
+        x4 = mm512_swap64_32( x4 );
+        y4 = mm512_swap64_32( y4 );
+        x5 = mm512_swap64_32( x5 );
        y5 = mm512_swap64_32( y5 );
-        y3 = mm512_rol_32( ty1, 11 );
-
-        x2 = _mm512_xor_si512( x2, x6 );
-        x6 = mm512_swap64_32( x6 );
-        y2 = _mm512_xor_si512( y2, y6 );
-        y6 = mm512_swap64_32( y6 );
+        x2 = _mm512_xor_si512( tx1, x6 );
+        y2 = _mm512_xor_si512( ty1, y6 );
        x3 = _mm512_xor_si512( x3, x7 );
-        x7 = mm512_swap64_32( x7 );
        y3 = _mm512_xor_si512( y3, y7 );
-
+        x6 = mm512_swap64_32( x6 );
+        y6 = mm512_swap64_32( y6 );
+        x7 = mm512_swap64_32( x7 );
        y7 = mm512_swap64_32( y7 );
    }

@@ -241,14 +221,6 @@ int cube_4way_init( cube_4way_context *sp, int hashbitlen, int rounds,
    sp->rounds    = rounds;
    sp->pos       = 0;

-    h[ 0] = m512_const1_128( iv[0] );
-    h[ 1] = m512_const1_128( iv[1] );
-    h[ 2] = m512_const1_128( iv[2] );
-    h[ 3] = m512_const1_128( iv[3] );
-    h[ 4] = m512_const1_128( iv[4] );
-    h[ 5] = m512_const1_128( iv[5] );
-    h[ 6] = m512_const1_128( iv[6] );
-    h[ 7] = m512_const1_128( iv[7] );
    h[ 0] = m512_const1_128( iv[0] );
    h[ 1] = m512_const1_128( iv[1] );
    h[ 2] = m512_const1_128( iv[2] );
@@ -489,33 +461,29 @@ static void transform_2way( cube_2way_context *sp )
        x5 = _mm256_add_epi32( x1, x5 );
        x6 = _mm256_add_epi32( x2, x6 );
        x7 = _mm256_add_epi32( x3, x7 );
-        y0 = x0;
-        y1 = x1;
-        ROL2( x0, x1, x2, x3, 7 );
-        ROL2( x2, x3, y0, y1, 7 );
-        x0 = _mm256_xor_si256( x0, x4 );
+        ROL2( y0, y1, x2, x3, 7 );
+        ROL2( x2, x3, x0, x1, 7 );
+        x0 = _mm256_xor_si256( y0, x4 );
+        x1 = _mm256_xor_si256( y1, x5 );
+        x2 = _mm256_xor_si256( x2, x6 );
+        x3 = _mm256_xor_si256( x3, x7 );
        x4 = mm256_swap128_64( x4 );
-        x1 = _mm256_xor_si256( x1, x5 );
-        x2 = _mm256_xor_si256( x2, x6 );
        x5 = mm256_swap128_64( x5 );
-        x3 = _mm256_xor_si256( x3, x7 );
-        x4 = _mm256_add_epi32( x0, x4 );
        x6 = mm256_swap128_64( x6 );
-        y0 = x0;
-        x5 = _mm256_add_epi32( x1, x5 );
        x7 = mm256_swap128_64( x7 );
+        x4 = _mm256_add_epi32( x0, x4 );
+        x5 = _mm256_add_epi32( x1, x5 );
        x6 = _mm256_add_epi32( x2, x6 );
-        y1 = x2;
-        ROL2( x0, x1, x1, y0, 11 );
        x7 = _mm256_add_epi32( x3, x7 );
-        ROL2( x2, x3, x3, y1, 11 );
-        x0 = _mm256_xor_si256( x0, x4 );
-        x4 = mm256_swap64_32( x4 );
+        ROL2( y0, x1, x1, x0, 11 );
+        ROL2( y1, x3, x3, x2, 11 );
+        x0 = _mm256_xor_si256( y0, x4 );
        x1 = _mm256_xor_si256( x1, x5 );
-        x5 = mm256_swap64_32( x5 );
-        x2 = _mm256_xor_si256( x2, x6 );
-        x6 = mm256_swap64_32( x6 );
+        x2 = _mm256_xor_si256( y1, x6 );
        x3 = _mm256_xor_si256( x3, x7 );
+        x4 = mm256_swap64_32( x4 );
+        x5 = mm256_swap64_32( x5 );
+        x6 = mm256_swap64_32( x6 );
        x7 = mm256_swap64_32( x7 );
    }

@@ -540,14 +508,6 @@ int cube_2way_init( cube_2way_context *sp, int hashbitlen, int rounds,
    sp->rounds    = rounds;
    sp->pos       = 0;

-    h[ 0] = m256_const1_128( iv[0] );
-    h[ 1] = m256_const1_128( iv[1] );
-    h[ 2] = m256_const1_128( iv[2] );
-    h[ 3] = m256_const1_128( iv[3] );
-    h[ 4] = m256_const1_128( iv[4] );
-    h[ 5] = m256_const1_128( iv[5] );
-    h[ 6] = m256_const1_128( iv[6] );
-    h[ 7] = m256_const1_128( iv[7] );
    h[ 0] = m256_const1_128( iv[0] );
    h[ 1] = m256_const1_128( iv[1] );
    h[ 2] = m256_const1_128( iv[2] );
@@ -560,7 +520,6 @@ int cube_2way_init( cube_2way_context *sp, int hashbitlen, int rounds,
    return 0;
 }

-
 int cube_2way_update( cube_2way_context *sp, const void *data, size_t size )
 {
    const int len = size >> 4;
--- a/algo/cubehash/cubehash_sse2.h
+++ b/algo/cubehash/cubehash_sse2.h
@@ -15,11 +15,11 @@

 struct _cubehashParam
 {
+    __m128i _ALIGN(64) x[8];  // aligned for __m512i
    int hashlen;           // __m128i
    int rounds;
    int blocksize;         // __m128i
    int pos;	           // number of __m128i read into x from current block
-    __m128i _ALIGN(64) x[8];  // aligned for __m256i
 };

 typedef struct _cubehashParam cubehashParam;
--- a/algo/fugue/fugue-aesni.h
+++ b/algo/fugue/fugue-aesni.h
@@ -37,12 +37,23 @@ typedef struct

 } hashState_fugue __attribute__ ((aligned (64)));

+
+// These functions are deprecated, use the lower case macro aliases that use
+// the standard interface. This will be cleaned up at a later date.
 HashReturn fugue512_Init(hashState_fugue *state, int hashbitlen);

 HashReturn fugue512_Update(hashState_fugue *state, const void *data, DataLength databitlen);

 HashReturn fugue512_Final(hashState_fugue *state, void *hashval);

+#define fugue512_init( state ) \
+   fugue512_Init( state, 512 )
+#define fugue512_update( state, data, len ) \
+   fugue512_Update( state, data, (len)<<3 )
+#define fugue512_final \
+   fugue512_Final
+
+
 HashReturn fugue512_full(hashState_fugue *hs, void *hashval, const void *data, DataLength databitlen);

 #endif // AES
--- a/algo/groestl/aes_ni/hash-groestl.c
+++ b/algo/groestl/aes_ni/hash-groestl.c
@@ -156,14 +156,12 @@ int groestl512_full( hashState_groestl* ctx, void* output,
   }
   ctx->chaining[ 6 ] = m128_const_64( 0x0200000000000000, 0 );
   ctx->buf_ptr = 0;
-   ctx->rem_ptr = 0;

   // --- update ---
   
   const int len = (int)databitlen / 128;
   const int hashlen_m128i = ctx->hashlen / 16;   // bytes to __m128i
   const int hash_offset = SIZE512 - hashlen_m128i;
-   int rem = ctx->rem_ptr;
   uint64_t blocks = len / SIZE512;
   __m128i* in = (__m128i*)input;

@@ -175,8 +173,8 @@ int groestl512_full( hashState_groestl* ctx, void* output,
   // copy any remaining data to buffer, it may already contain data
   // from a previous update for a midstate precalc
   for ( i = 0; i < len % SIZE512; i++ )
-       ctx->buffer[ rem + i ] = in[ ctx->buf_ptr + i ];
-   i += rem;    // use i as rem_ptr in final
+       ctx->buffer[ i ] = in[ ctx->buf_ptr + i ];
+   // use i as rem_ptr in final

   //--- final ---

--- a/algo/groestl/aes_ni/hash-groestl256.c
+++ b/algo/groestl/aes_ni/hash-groestl256.c
@@ -227,12 +227,10 @@ int groestl256_full( hashState_groestl256* ctx,
  ((u64*)ctx->chaining)[COLS-1] = U64BIG((u64)LENGTH);
  INIT256( ctx->chaining );
  ctx->buf_ptr = 0;
-  ctx->rem_ptr = 0;

   const int len = (int)databitlen / 128;
   const int hashlen_m128i = ctx->hashlen / 16;   // bytes to __m128i
   const int hash_offset = SIZE256 - hashlen_m128i;
-   int rem = ctx->rem_ptr;
   int blocks = len / SIZE256;
   __m128i* in = (__m128i*)input;

@@ -245,7 +243,7 @@ int groestl256_full( hashState_groestl256* ctx,

   // cryptonight has 200 byte input, an odd number of __m128i
   // remainder is only 8 bytes, ie u64.
-   if ( databitlen % 128 !=0 )
+   if ( databitlen % 128 != 0 )
   {
      // must be cryptonight, copy 64 bits of data
      *(uint64_t*)(ctx->buffer) = *(uint64_t*)(&in[ ctx->buf_ptr ] );
@@ -255,8 +253,8 @@ int groestl256_full( hashState_groestl256* ctx,
   {
      // Copy any remaining data to buffer for final transform
      for ( i = 0; i < len % SIZE256; i++ )
-          ctx->buffer[ rem + i ] = in[ ctx->buf_ptr + i ];
-      i += rem;   // use i as rem_ptr in final
+          ctx->buffer[ i ] = in[ ctx->buf_ptr + i ];
+      // use i as rem_ptr in final
   }

   //--- final ---
--- a/algo/groestl/groestl256-hash-4way.c
+++ b/algo/groestl/groestl256-hash-4way.c
@@ -50,7 +50,6 @@ int groestl256_4way_full( groestl256_4way_context* ctx, void* output,
   const int len = (int)datalen >> 4;
   const int hashlen_m128i = 32 >> 4;   // bytes to __m128i
   const int hash_offset = SIZE256 - hashlen_m128i;
-   int rem = ctx->rem_ptr;
   uint64_t blocks = len / SIZE256;
   __m512i* in = (__m512i*)input;
   int i;
@@ -67,7 +66,6 @@ int groestl256_4way_full( groestl256_4way_context* ctx, void* output,
  // The only non-zero in the IV is len. It can be hard coded.
  ctx->chaining[ 3 ] = m512_const2_64( 0, 0x0100000000000000 );
  ctx->buf_ptr = 0;
-  ctx->rem_ptr = 0;
   
   // --- update ---

@@ -76,11 +74,10 @@ int groestl256_4way_full( groestl256_4way_context* ctx, void* output,
      TF512_4way( ctx->chaining, &in[ i * SIZE256 ] );
   ctx->buf_ptr = blocks * SIZE256;

-   // copy any remaining data to buffer, it may already contain data
-   // from a previous update for a midstate precalc
+   // copy any remaining data to buffer 
   for ( i = 0; i < len % SIZE256; i++ )
-       ctx->buffer[ rem + i ] = in[ ctx->buf_ptr + i ];
-   i += rem;    // use i as rem_ptr in final
+       ctx->buffer[ i ] = in[ ctx->buf_ptr + i ];
+   // use i as rem_ptr in final

   //--- final ---

@@ -206,7 +203,6 @@ int groestl256_2way_full( groestl256_2way_context* ctx, void* output,
   const int len = (int)datalen >> 4;
   const int hashlen_m128i = 32 >> 4;   // bytes to __m128i
   const int hash_offset = SIZE256 - hashlen_m128i;
-   int rem = ctx->rem_ptr;
   uint64_t blocks = len / SIZE256;
   __m256i* in = (__m256i*)input;
   int i;
@@ -223,7 +219,6 @@ int groestl256_2way_full( groestl256_2way_context* ctx, void* output,
   // The only non-zero in the IV is len. It can be hard coded.
   ctx->chaining[ 3 ] = m256_const2_64( 0, 0x0100000000000000 );
   ctx->buf_ptr = 0;
-   ctx->rem_ptr = 0;

   // --- update ---

@@ -232,11 +227,10 @@ int groestl256_2way_full( groestl256_2way_context* ctx, void* output,
      TF512_2way( ctx->chaining, &in[ i * SIZE256 ] );
   ctx->buf_ptr = blocks * SIZE256;

-   // copy any remaining data to buffer, it may already contain data
-   // from a previous update for a midstate precalc
+   // copy any remaining data to buffer
   for ( i = 0; i < len % SIZE256; i++ )
-       ctx->buffer[ rem + i ] = in[ ctx->buf_ptr + i ];
-   i += rem;    // use i as rem_ptr in final
+       ctx->buffer[ i ] = in[ ctx->buf_ptr + i ];
+   // use i as rem_ptr in final

   //--- final ---

--- a/algo/groestl/groestl512-hash-4way.c
+++ b/algo/groestl/groestl512-hash-4way.c
@@ -99,7 +99,6 @@ int groestl512_4way_full( groestl512_4way_context* ctx, void* output,
   memset_zero_512( ctx->buffer, SIZE512 );
   ctx->chaining[ 6 ] = m512_const2_64( 0x0200000000000000, 0 );
   ctx->buf_ptr = 0;
-   ctx->rem_ptr = 0;

   // --- update ---

@@ -108,8 +107,7 @@ int groestl512_4way_full( groestl512_4way_context* ctx, void* output,
   ctx->buf_ptr = blocks * SIZE512;

   for ( i = 0; i < len % SIZE512; i++ )
-       ctx->buffer[ ctx->rem_ptr + i ] = in[ ctx->buf_ptr + i ];
-   i += ctx->rem_ptr;
+       ctx->buffer[ i ] = in[ ctx->buf_ptr + i ];

   // --- close ---

@@ -222,7 +220,6 @@ int groestl512_2way_full( groestl512_2way_context* ctx, void* output,
   memset_zero_256( ctx->buffer, SIZE512 );
   ctx->chaining[ 6 ] = m256_const2_64( 0x0200000000000000, 0 );
   ctx->buf_ptr = 0;
-   ctx->rem_ptr = 0;

   // --- update ---

@@ -231,8 +228,7 @@ int groestl512_2way_full( groestl512_2way_context* ctx, void* output,
   ctx->buf_ptr = blocks * SIZE512;

   for ( i = 0; i < len % SIZE512; i++ )
-       ctx->buffer[ ctx->rem_ptr + i ] = in[ ctx->buf_ptr + i ];
-   i += ctx->rem_ptr;
+       ctx->buffer[ i ] = in[ ctx->buf_ptr + i ];

   // --- close ---

--- a/algo/hamsi/hamsi-hash-4way.c
+++ b/algo/hamsi/hamsi-hash-4way.c
@@ -545,31 +545,33 @@ static const sph_u32 T512[64][16] = {
 #define sE   c7
 #define sF   m7

-
 #if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)

 // Hamsi 8 way AVX512 

+// Intel says _mm512_movepi64_mask has (1L/1T) timimg while
+// _mm512_cmplt_epi64_mask as (3L/1T) timing, however, when tested hashing X13
+// on i9-9940x cmplt with zero was 3% faster than movepi. 
+
 #define INPUT_BIG8 \
 do { \
-  __m512i db = *buf; \
-  const uint64_t *tp = (uint64_t*)&T512[0][0];  \
-  m0 = m1 = m2 = m3 = m4 = m5 = m6 = m7 = m512_zero; \
+  __m512i db = _mm512_ror_epi64( *buf, 1 ); \
+  const __m512i zero = m512_zero; \
+  const uint64_t *tp = (const uint64_t*)T512; \
+  m0 = m1 = m2 = m3 = m4 = m5 = m6 = m7 = zero; \
  for ( int u = 0; u < 64; u++ ) \
  { \
-     __m512i dm = _mm512_and_si512( db, m512_one_64 ) ; \
-     dm = mm512_negate_32( _mm512_or_si512( dm, \
-                                          _mm512_slli_epi64( dm, 32 ) ) ); \
-     m0 = mm512_xorand( m0, dm, m512_const1_64( tp[0] ) ); \
-     m1 = mm512_xorand( m1, dm, m512_const1_64( tp[1] ) ); \
-     m2 = mm512_xorand( m2, dm, m512_const1_64( tp[2] ) ); \
-     m3 = mm512_xorand( m3, dm, m512_const1_64( tp[3] ) ); \
-     m4 = mm512_xorand( m4, dm, m512_const1_64( tp[4] ) ); \
-     m5 = mm512_xorand( m5, dm, m512_const1_64( tp[5] ) ); \
-     m6 = mm512_xorand( m6, dm, m512_const1_64( tp[6] ) ); \
-     m7 = mm512_xorand( m7, dm, m512_const1_64( tp[7] ) ); \
+     const __mmask8 dm = _mm512_cmplt_epi64_mask( db, zero ); \
+     m0 = _mm512_mask_xor_epi64( m0, dm, m0, m512_const1_64( tp[0] ) ); \
+     m1 = _mm512_mask_xor_epi64( m1, dm, m1, m512_const1_64( tp[1] ) ); \
+     m2 = _mm512_mask_xor_epi64( m2, dm, m2, m512_const1_64( tp[2] ) ); \
+     m3 = _mm512_mask_xor_epi64( m3, dm, m3, m512_const1_64( tp[3] ) ); \
+     m4 = _mm512_mask_xor_epi64( m4, dm, m4, m512_const1_64( tp[4] ) ); \
+     m5 = _mm512_mask_xor_epi64( m5, dm, m5, m512_const1_64( tp[5] ) ); \
+     m6 = _mm512_mask_xor_epi64( m6, dm, m6, m512_const1_64( tp[6] ) ); \
+     m7 = _mm512_mask_xor_epi64( m7, dm, m7, m512_const1_64( tp[7] ) ); \
+     db = _mm512_ror_epi64( db, 1 ); \
     tp += 8; \
-     db = _mm512_srli_epi64( db, 1 ); \
  } \
 } while (0)

@@ -609,199 +611,192 @@ do { \

 #define READ_STATE_BIG8(sc) \
 do { \
-   c0 = sc->h[0x0]; \
-   c1 = sc->h[0x1]; \
-   c2 = sc->h[0x2]; \
-   c3 = sc->h[0x3]; \
-   c4 = sc->h[0x4]; \
-   c5 = sc->h[0x5]; \
-   c6 = sc->h[0x6]; \
-   c7 = sc->h[0x7]; \
+   c0 = sc->h[0]; \
+   c1 = sc->h[1]; \
+   c2 = sc->h[2]; \
+   c3 = sc->h[3]; \
+   c4 = sc->h[4]; \
+   c5 = sc->h[5]; \
+   c6 = sc->h[6]; \
+   c7 = sc->h[7]; \
 } while (0)

 #define WRITE_STATE_BIG8(sc) \
 do { \
-   sc->h[0x0] = c0; \
-   sc->h[0x1] = c1; \
-   sc->h[0x2] = c2; \
-   sc->h[0x3] = c3; \
-   sc->h[0x4] = c4; \
-   sc->h[0x5] = c5; \
-   sc->h[0x6] = c6; \
-   sc->h[0x7] = c7; \
+   sc->h[0] = c0; \
+   sc->h[1] = c1; \
+   sc->h[2] = c2; \
+   sc->h[3] = c3; \
+   sc->h[4] = c4; \
+   sc->h[5] = c5; \
+   sc->h[6] = c6; \
+   sc->h[7] = c7; \
 } while (0)

-
 #define ROUND_BIG8( alpha ) \
 do { \
   __m512i t0, t1, t2, t3; \
-   s0 = _mm512_xor_si512( s0, alpha[ 0] ); \
-   s1 = _mm512_xor_si512( s1, alpha[ 1] ); \
-   s2 = _mm512_xor_si512( s2, alpha[ 2] ); \
-   s3 = _mm512_xor_si512( s3, alpha[ 3] ); \
-   s4 = _mm512_xor_si512( s4, alpha[ 4] ); \
-   s5 = _mm512_xor_si512( s5, alpha[ 5] ); \
-   s6 = _mm512_xor_si512( s6, alpha[ 6] ); \
-   s7 = _mm512_xor_si512( s7, alpha[ 7] ); \
-   s8 = _mm512_xor_si512( s8, alpha[ 8] ); \
-   s9 = _mm512_xor_si512( s9, alpha[ 9] ); \
-   sA = _mm512_xor_si512( sA, alpha[10] ); \
-   sB = _mm512_xor_si512( sB, alpha[11] ); \
-   sC = _mm512_xor_si512( sC, alpha[12] ); \
-   sD = _mm512_xor_si512( sD, alpha[13] ); \
-   sE = _mm512_xor_si512( sE, alpha[14] ); \
-   sF = _mm512_xor_si512( sF, alpha[15] ); \
+   s0 = _mm512_xor_si512( s0, alpha[ 0] ); /* m0 */ \
+   s1 = _mm512_xor_si512( s1, alpha[ 1] ); /* c0 */ \
+   s2 = _mm512_xor_si512( s2, alpha[ 2] ); /* m1 */ \
+   s3 = _mm512_xor_si512( s3, alpha[ 3] ); /* c1 */ \
+   s4 = _mm512_xor_si512( s4, alpha[ 4] ); /* c2 */ \
+   s5 = _mm512_xor_si512( s5, alpha[ 5] ); /* m2 */ \
+   s6 = _mm512_xor_si512( s6, alpha[ 6] ); /* c3 */ \
+   s7 = _mm512_xor_si512( s7, alpha[ 7] ); /* m3 */ \
+   s8 = _mm512_xor_si512( s8, alpha[ 8] ); /* m4 */ \
+   s9 = _mm512_xor_si512( s9, alpha[ 9] ); /* c4 */ \
+   sA = _mm512_xor_si512( sA, alpha[10] ); /* m5 */ \
+   sB = _mm512_xor_si512( sB, alpha[11] ); /* c5 */ \
+   sC = _mm512_xor_si512( sC, alpha[12] ); /* c6 */ \
+   sD = _mm512_xor_si512( sD, alpha[13] ); /* m6 */ \
+   sE = _mm512_xor_si512( sE, alpha[14] ); /* c7 */ \
+   sF = _mm512_xor_si512( sF, alpha[15] ); /* m7 */ \
 \
-  SBOX8( s0, s4, s8, sC ); \
-  SBOX8( s1, s5, s9, sD ); \
-  SBOX8( s2, s6, sA, sE ); \
-  SBOX8( s3, s7, sB, sF ); \
+  SBOX8( s0, s4, s8, sC ); /* ( m0, c2, m4, c6 ) */ \
+  SBOX8( s1, s5, s9, sD ); /* ( c0, m2, c4, m6 ) */ \
+  SBOX8( s2, s6, sA, sE ); /* ( m1, c3, m5, c7 ) */ \
+  SBOX8( s3, s7, sB, sF ); /* ( c1, m3, c5, m7 ) */ \
 \
-  t1 = _mm512_mask_blend_epi32( 0xaaaa, _mm512_bsrli_epi128( s4, 4 ), \
-                                        _mm512_bslli_epi128( s5, 4 ) ); \
-  t3 = _mm512_mask_blend_epi32( 0xaaaa, _mm512_bsrli_epi128( sD, 4 ), \
-                                        _mm512_bslli_epi128( sE, 4 ) ); \
+  s4 = mm512_swap64_32( s4 ); \
+  s5 = mm512_swap64_32( s5 ); \
+  sD = mm512_swap64_32( sD ); \
+  sE = mm512_swap64_32( sE ); \
+  t1 = _mm512_mask_blend_epi32( 0xaaaa, s4, s5 ); \
+  t3 = _mm512_mask_blend_epi32( 0xaaaa, sD, sE ); \
  L8( s0, t1, s9, t3 ); \
-  s4 = _mm512_mask_blend_epi32( 0xaaaa, s4, _mm512_bslli_epi128( t1, 4 ) ); \
-  s5 = _mm512_mask_blend_epi32( 0x5555, s5, _mm512_bsrli_epi128( t1, 4 ) ); \
-  sD = _mm512_mask_blend_epi32( 0xaaaa, sD, _mm512_bslli_epi128( t3, 4 ) ); \
-  sE = _mm512_mask_blend_epi32( 0x5555, sE, _mm512_bsrli_epi128( t3, 4 ) ); \
+  s4 = _mm512_mask_blend_epi32( 0x5555, s4, t1 ); \
+  s5 = _mm512_mask_blend_epi32( 0xaaaa, s5, t1 ); \
+  sD = _mm512_mask_blend_epi32( 0x5555, sD, t3 ); \
+  sE = _mm512_mask_blend_epi32( 0xaaaa, sE, t3 ); \
 \
-  t1 = _mm512_mask_blend_epi32( 0xaaaa, _mm512_bsrli_epi128( s5, 4 ), \
-                                        _mm512_bslli_epi128( s6, 4 ) ); \
-  t3 = _mm512_mask_blend_epi32( 0xaaaa, _mm512_bsrli_epi128( sE, 4 ), \
-                                        _mm512_bslli_epi128( sF, 4 ) ); \
+  s6 = mm512_swap64_32( s6 ); \
+  sF = mm512_swap64_32( sF ); \
+  t1 = _mm512_mask_blend_epi32( 0xaaaa, s5, s6 ); \
+  t3 = _mm512_mask_blend_epi32( 0xaaaa, sE, sF ); \
  L8( s1, t1, sA, t3 ); \
-  s5 = _mm512_mask_blend_epi32( 0xaaaa, s5, _mm512_bslli_epi128( t1, 4 ) ); \
-  s6 = _mm512_mask_blend_epi32( 0x5555, s6, _mm512_bsrli_epi128( t1, 4 ) ); \
-  sE = _mm512_mask_blend_epi32( 0xaaaa, sE, _mm512_bslli_epi128( t3, 4 ) ); \
-  sF = _mm512_mask_blend_epi32( 0x5555, sF, _mm512_bsrli_epi128( t3, 4 ) ); \
+  s5 = _mm512_mask_blend_epi32( 0x5555, s5, t1 ); \
+  s6 = _mm512_mask_blend_epi32( 0xaaaa, s6, t1 ); \
+  sE = _mm512_mask_blend_epi32( 0x5555, sE, t3 ); \
+  sF = _mm512_mask_blend_epi32( 0xaaaa, sF, t3 ); \
 \
-  t1 = _mm512_mask_blend_epi32( 0xaaaa, _mm512_bsrli_epi128( s6, 4 ), \
-                                        _mm512_bslli_epi128( s7, 4 ) ); \
-  t3 = _mm512_mask_blend_epi32( 0xaaaa, _mm512_bsrli_epi128( sF, 4 ), \
-                                        _mm512_bslli_epi128( sC, 4 ) ); \
+  s7 = mm512_swap64_32( s7 ); \
+  sC = mm512_swap64_32( sC ); \
+  t1 = _mm512_mask_blend_epi32( 0xaaaa, s6, s7 ); \
+  t3 = _mm512_mask_blend_epi32( 0xaaaa, sF, sC ); \
  L8( s2, t1, sB, t3 ); \
-  s6 = _mm512_mask_blend_epi32( 0xaaaa, s6, _mm512_bslli_epi128( t1, 4 ) ); \
-  s7 = _mm512_mask_blend_epi32( 0x5555, s7, _mm512_bsrli_epi128( t1, 4 ) ); \
-  sF = _mm512_mask_blend_epi32( 0xaaaa, sF, _mm512_bslli_epi128( t3, 4 ) ); \
-  sC = _mm512_mask_blend_epi32( 0x5555, sC, _mm512_bsrli_epi128( t3, 4 ) ); \
+  s6 = _mm512_mask_blend_epi32( 0x5555, s6, t1 ); \
+  s7 = _mm512_mask_blend_epi32( 0xaaaa, s7, t1 ); \
+  sF = _mm512_mask_blend_epi32( 0x5555, sF, t3 ); \
+  sC = _mm512_mask_blend_epi32( 0xaaaa, sC, t3 ); \
+  s6 = mm512_swap64_32( s6 ); \
+  sF = mm512_swap64_32( sF ); \
 \
-  t1 = _mm512_mask_blend_epi32( 0xaaaa, _mm512_bsrli_epi128( s7, 4 ), \
-                                        _mm512_bslli_epi128( s4, 4 ) ); \
-  t3 = _mm512_mask_blend_epi32( 0xaaaa, _mm512_bsrli_epi128( sC, 4 ), \
-                                        _mm512_bslli_epi128( sD, 4 ) ); \
+  t1 = _mm512_mask_blend_epi32( 0xaaaa, s7, s4 ); \
+  t3 = _mm512_mask_blend_epi32( 0xaaaa, sC, sD ); \
  L8( s3, t1, s8, t3 ); \
-  s7 = _mm512_mask_blend_epi32( 0xaaaa, s7, _mm512_bslli_epi128( t1, 4 ) ); \
-  s4 = _mm512_mask_blend_epi32( 0x5555, s4, _mm512_bsrli_epi128( t1, 4 ) ); \
-  sC = _mm512_mask_blend_epi32( 0xaaaa, sC, _mm512_bslli_epi128( t3, 4 ) ); \
-  sD = _mm512_mask_blend_epi32( 0x5555, sD, _mm512_bsrli_epi128( t3, 4 ) ); \
+  s7 = _mm512_mask_blend_epi32( 0x5555, s7, t1 ); \
+  s4 = _mm512_mask_blend_epi32( 0xaaaa, s4, t1 ); \
+  sC = _mm512_mask_blend_epi32( 0x5555, sC, t3 ); \
+  sD = _mm512_mask_blend_epi32( 0xaaaa, sD, t3 ); \
+  s7 = mm512_swap64_32( s7 ); \
+  sC = mm512_swap64_32( sC ); \
 \
-  t0 = _mm512_mask_blend_epi32( 0xaaaa, s0, _mm512_bslli_epi128( s8, 4 ) ); \
+  t0 = _mm512_mask_blend_epi32( 0xaaaa, s0, mm512_swap64_32( s8 ) ); \
  t1 = _mm512_mask_blend_epi32( 0xaaaa, s1, s9 ); \
-  t2 = _mm512_mask_blend_epi32( 0xaaaa, _mm512_bsrli_epi128( s2, 4 ), sA ); \
-  t3 = _mm512_mask_blend_epi32( 0xaaaa, _mm512_bsrli_epi128( s3, 4 ), \
-                                        _mm512_bslli_epi128( sB, 4 ) ); \
+  t2 = _mm512_mask_blend_epi32( 0xaaaa, mm512_swap64_32( s2 ), sA ); \
+  t3 = _mm512_mask_blend_epi32( 0x5555, s3, sB ); \
+  t3 = mm512_swap64_32( t3 ); \
  L8( t0, t1, t2, t3 ); \
+  t3 = mm512_swap64_32( t3 ); \
  s0 = _mm512_mask_blend_epi32( 0x5555, s0, t0 ); \
-  s8 = _mm512_mask_blend_epi32( 0x5555, s8, _mm512_bsrli_epi128( t0, 4 ) ); \
+  s8 = _mm512_mask_blend_epi32( 0x5555, s8, mm512_swap64_32( t0 ) ); \
  s1 = _mm512_mask_blend_epi32( 0x5555, s1, t1 ); \
  s9 = _mm512_mask_blend_epi32( 0xaaaa, s9, t1 ); \
-  s2 = _mm512_mask_blend_epi32( 0xaaaa, s2, _mm512_bslli_epi128( t2, 4 ) ); \
+  s2 = _mm512_mask_blend_epi32( 0xaaaa, s2, mm512_swap64_32( t2 ) ); \
  sA = _mm512_mask_blend_epi32( 0xaaaa, sA, t2 ); \
-  s3 = _mm512_mask_blend_epi32( 0xaaaa, s3, _mm512_bslli_epi128( t3, 4 ) ); \
-  sB = _mm512_mask_blend_epi32( 0x5555, sB, _mm512_bsrli_epi128( t3, 4 ) ); \
+  s3 = _mm512_mask_blend_epi32( 0xaaaa, s3, t3 ); \
+  sB = _mm512_mask_blend_epi32( 0x5555, sB, t3 ); \
 \
-  t0 = _mm512_mask_blend_epi32( 0xaaaa, _mm512_bsrli_epi128( s4, 4 ), sC ); \
-  t1 = _mm512_mask_blend_epi32( 0xaaaa, _mm512_bsrli_epi128( s5, 4 ), \
-                                        _mm512_bslli_epi128( sD, 4 ) ); \
-  t2 = _mm512_mask_blend_epi32( 0xaaaa, s6, _mm512_bslli_epi128( sE, 4 ) ); \
+  t0 = _mm512_mask_blend_epi32( 0xaaaa, s4, sC ); \
+  t1 = _mm512_mask_blend_epi32( 0xaaaa, s5, sD ); \
+  t2 = _mm512_mask_blend_epi32( 0xaaaa, s6, sE ); \
  t3 = _mm512_mask_blend_epi32( 0xaaaa, s7, sF ); \
  L8( t0, t1, t2, t3 ); \
-  s4 = _mm512_mask_blend_epi32( 0xaaaa, s4, _mm512_bslli_epi128( t0, 4 ) ); \
+  s4 = _mm512_mask_blend_epi32( 0x5555, s4, t0 ); \
  sC = _mm512_mask_blend_epi32( 0xaaaa, sC, t0 ); \
-  s5 = _mm512_mask_blend_epi32( 0xaaaa, s5, _mm512_bslli_epi128( t1, 4 ) ); \
-  sD = _mm512_mask_blend_epi32( 0x5555, sD, _mm512_bsrli_epi128( t1, 4 ) ); \
+  s5 = _mm512_mask_blend_epi32( 0x5555, s5, t1 ); \
+  sD = _mm512_mask_blend_epi32( 0xaaaa, sD, t1 ); \
  s6 = _mm512_mask_blend_epi32( 0x5555, s6, t2 ); \
-  sE = _mm512_mask_blend_epi32( 0x5555, sE, _mm512_bsrli_epi128( t2, 4 ) ); \
+  sE = _mm512_mask_blend_epi32( 0xaaaa, sE, t2 ); \
  s7 = _mm512_mask_blend_epi32( 0x5555, s7, t3 ); \
  sF = _mm512_mask_blend_epi32( 0xaaaa, sF, t3 ); \
+  s4 = mm512_swap64_32( s4 ); \
+  s5 = mm512_swap64_32( s5 ); \
+  sD = mm512_swap64_32( sD ); \
+  sE = mm512_swap64_32( sE ); \
 } while (0)

 #define P_BIG8 \
 do { \
   __m512i alpha[16]; \
+   const uint64_t A0 = ( (uint64_t*)alpha_n )[0]; \
   for( int i = 0; i < 16; i++ ) \
      alpha[i] = m512_const1_64( ( (uint64_t*)alpha_n )[i] ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( (uint64_t)1 << 32 ) \
-                            ^ ( (uint64_t*)alpha_n )[0] ); \
+   alpha[0] = m512_const1_64( (1ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( (uint64_t)2 << 32 ) \
-                            ^ ( (uint64_t*)alpha_n )[0] ); \
+   alpha[0] = m512_const1_64( (2ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( (uint64_t)3 << 32 ) \
-                            ^ ( (uint64_t*)alpha_n )[0] ); \
+   alpha[0] = m512_const1_64( (3ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( (uint64_t)4 << 32 ) \
-                            ^ ( (uint64_t*)alpha_n )[0] ); \
+   alpha[0] = m512_const1_64( (4ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( (uint64_t)5 << 32 ) \
-                            ^ ( (uint64_t*)alpha_n )[0] ); \
+   alpha[0] = m512_const1_64( (5ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
 } while (0)

 #define PF_BIG8 \
 do { \
   __m512i alpha[16]; \
+   const uint64_t A0 = ( (uint64_t*)alpha_f )[0]; \
   for( int i = 0; i < 16; i++ ) \
      alpha[i] = m512_const1_64( ( (uint64_t*)alpha_f )[i] ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( (uint64_t)1 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m512_const1_64( ( 1ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( (uint64_t)2 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m512_const1_64( ( 2ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( (uint64_t)3 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m512_const1_64( ( 3ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( (uint64_t)4 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m512_const1_64( ( 4ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( (uint64_t)5 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m512_const1_64( ( 5ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( (uint64_t)6 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m512_const1_64( ( 6ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( (uint64_t)7 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m512_const1_64( ( 7ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( (uint64_t)8 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m512_const1_64( ( 8ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( (uint64_t)9 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m512_const1_64( ( 9ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( (uint64_t)10 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m512_const1_64( (10ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( (uint64_t)11 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m512_const1_64( (11ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
 } while (0)

 #define T_BIG8 \
 do { /* order is important */ \
-   c7 = sc->h[ 0x7 ] = _mm512_xor_si512( sc->h[ 0x7 ], sB ); \
-   c6 = sc->h[ 0x6 ] = _mm512_xor_si512( sc->h[ 0x6 ], sA ); \
-   c5 = sc->h[ 0x5 ] = _mm512_xor_si512( sc->h[ 0x5 ], s9 ); \
-   c4 = sc->h[ 0x4 ] = _mm512_xor_si512( sc->h[ 0x4 ], s8 ); \
-   c3 = sc->h[ 0x3 ] = _mm512_xor_si512( sc->h[ 0x3 ], s3 ); \
-   c2 = sc->h[ 0x2 ] = _mm512_xor_si512( sc->h[ 0x2 ], s2 ); \
-   c1 = sc->h[ 0x1 ] = _mm512_xor_si512( sc->h[ 0x1 ], s1 ); \
-   c0 = sc->h[ 0x0 ] = _mm512_xor_si512( sc->h[ 0x0 ], s0 ); \
+   c7 = sc->h[ 7 ] = _mm512_xor_si512( sc->h[ 7 ], sB ); /* c5 */ \
+   c6 = sc->h[ 6 ] = _mm512_xor_si512( sc->h[ 6 ], sA ); /* m5 */ \
+   c5 = sc->h[ 5 ] = _mm512_xor_si512( sc->h[ 5 ], s9 ); /* c4 */ \
+   c4 = sc->h[ 4 ] = _mm512_xor_si512( sc->h[ 4 ], s8 ); /* m4 */ \
+   c3 = sc->h[ 3 ] = _mm512_xor_si512( sc->h[ 3 ], s3 ); /* c1 */ \
+   c2 = sc->h[ 2 ] = _mm512_xor_si512( sc->h[ 2 ], s2 ); /* m1 */ \
+   c1 = sc->h[ 1 ] = _mm512_xor_si512( sc->h[ 1 ], s1 ); /* c0 */ \
+   c0 = sc->h[ 0 ] = _mm512_xor_si512( sc->h[ 0 ], s0 ); /* m0 */ \
 } while (0)

 void hamsi_8way_big( hamsi_8way_big_context *sc, __m512i *buf, size_t num )
@@ -838,7 +833,6 @@ void hamsi_8way_big_final( hamsi_8way_big_context *sc, __m512i *buf )
   WRITE_STATE_BIG8( sc );
 }

-
 void hamsi512_8way_init( hamsi_8way_big_context *sc )
 {
   sc->partial_len = 0;
@@ -888,13 +882,12 @@ void hamsi512_8way_close( hamsi_8way_big_context *sc, void *dst )
 #define INPUT_BIG \
 do { \
  __m256i db = *buf; \
-  const uint64_t *tp = (uint64_t*)&T512[0][0];  \
-  m0 = m1 = m2 = m3 = m4 = m5 = m6 = m7 = m256_zero; \
-  for ( int u = 0; u < 64; u++ ) \
+  const __m256i zero = m256_zero; \
+  const uint64_t *tp = (const uint64_t*)T512;  \
+  m0 = m1 = m2 = m3 = m4 = m5 = m6 = m7 = zero; \
+  for ( int u = 63; u >= 0; u-- ) \
  { \
-     __m256i dm = _mm256_and_si256( db, m256_one_64 ) ; \
-     dm = mm256_negate_32( _mm256_or_si256( dm, \
-                         _mm256_slli_epi64( dm, 32 ) ) ); \
+     __m256i dm = _mm256_cmpgt_epi64( zero, _mm256_slli_epi64( db, u ) ); \
     m0 = _mm256_xor_si256( m0, _mm256_and_si256( dm, \
                                          m256_const1_64( tp[0] ) ) ); \
     m1 = _mm256_xor_si256( m1, _mm256_and_si256( dm, \
@@ -912,7 +905,6 @@ do { \
     m7 = _mm256_xor_si256( m7, _mm256_and_si256( dm, \
                                          m256_const1_64( tp[7] ) ) ); \
     tp += 8; \
-     db = _mm256_srli_epi64( db, 1 ); \
  } \
 } while (0)

@@ -961,47 +953,28 @@ do { \

 #define READ_STATE_BIG(sc) \
 do { \
-   c0 = sc->h[0x0]; \
-   c1 = sc->h[0x1]; \
-   c2 = sc->h[0x2]; \
-   c3 = sc->h[0x3]; \
-   c4 = sc->h[0x4]; \
-   c5 = sc->h[0x5]; \
-   c6 = sc->h[0x6]; \
-   c7 = sc->h[0x7]; \
+   c0 = sc->h[0]; \
+   c1 = sc->h[1]; \
+   c2 = sc->h[2]; \
+   c3 = sc->h[3]; \
+   c4 = sc->h[4]; \
+   c5 = sc->h[5]; \
+   c6 = sc->h[6]; \
+   c7 = sc->h[7]; \
 } while (0)

 #define WRITE_STATE_BIG(sc) \
 do { \
-   sc->h[0x0] = c0; \
-   sc->h[0x1] = c1; \
-   sc->h[0x2] = c2; \
-   sc->h[0x3] = c3; \
-   sc->h[0x4] = c4; \
-   sc->h[0x5] = c5; \
-   sc->h[0x6] = c6; \
-   sc->h[0x7] = c7; \
+   sc->h[0] = c0; \
+   sc->h[1] = c1; \
+   sc->h[2] = c2; \
+   sc->h[3] = c3; \
+   sc->h[4] = c4; \
+   sc->h[5] = c5; \
+   sc->h[6] = c6; \
+   sc->h[7] = c7; \
 } while (0)

-/*
-#define s0   m0
-#define s1   c0
-#define s2   m1
-#define s3   c1
-#define s4   c2
-#define s5   m2
-#define s6   c3
-#define s7   m3
-#define s8   m4
-#define s9   c4
-#define sA   m5
-#define sB   c5
-#define sC   c6
-#define sD   m6
-#define sE   c7
-#define sF   m7
-*/
-
 #define ROUND_BIG( alpha ) \
 do { \
   __m256i t0, t1, t2, t3; \
@@ -1027,151 +1000,145 @@ do { \
  SBOX( s2, s6, sA, sE ); \
  SBOX( s3, s7, sB, sF ); \
 \
-  t1 = _mm256_blend_epi32( _mm256_bsrli_epi128( s4, 4 ), \
-                           _mm256_bslli_epi128( s5, 4 ), 0xAA ); \
-  t3 = _mm256_blend_epi32( _mm256_bsrli_epi128( sD, 4 ), \
-                           _mm256_bslli_epi128( sE, 4 ), 0xAA ); \
+  s4 = mm256_swap64_32( s4 ); \
+  s5 = mm256_swap64_32( s5 ); \
+  sD = mm256_swap64_32( sD ); \
+  sE = mm256_swap64_32( sE ); \
+  t1 = _mm256_blend_epi32( s4, s5, 0xaa ); \
+  t3 = _mm256_blend_epi32( sD, sE, 0xaa ); \
  L( s0, t1, s9, t3 ); \
-  s4 = _mm256_blend_epi32( s4, _mm256_bslli_epi128( t1, 4 ), 0xAA );\
-  s5 = _mm256_blend_epi32( s5, _mm256_bsrli_epi128( t1, 4 ), 0x55 );\
-  sD = _mm256_blend_epi32( sD, _mm256_bslli_epi128( t3, 4 ), 0xAA );\
-  sE = _mm256_blend_epi32( sE, _mm256_bsrli_epi128( t3, 4 ), 0x55 );\
+  s4 = _mm256_blend_epi32( s4, t1, 0x55 ); \
+  s5 = _mm256_blend_epi32( s5, t1, 0xaa ); \
+  sD = _mm256_blend_epi32( sD, t3, 0x55 ); \
+  sE = _mm256_blend_epi32( sE, t3, 0xaa ); \
 \
-  t1 = _mm256_blend_epi32( _mm256_bsrli_epi128( s5, 4 ), \
-                           _mm256_bslli_epi128( s6, 4 ), 0xAA ); \
-  t3 = _mm256_blend_epi32( _mm256_bsrli_epi128( sE, 4 ), \
-                           _mm256_bslli_epi128( sF, 4 ), 0xAA ); \
+  s6 = mm256_swap64_32( s6 ); \
+  sF = mm256_swap64_32( sF ); \
+  t1 = _mm256_blend_epi32( s5, s6, 0xaa ); \
+  t3 = _mm256_blend_epi32( sE, sF, 0xaa ); \
  L( s1, t1, sA, t3 ); \
-  s5 = _mm256_blend_epi32( s5, _mm256_bslli_epi128( t1, 4 ), 0xAA );\
-  s6 = _mm256_blend_epi32( s6, _mm256_bsrli_epi128( t1, 4 ), 0x55 );\
-  sE = _mm256_blend_epi32( sE, _mm256_bslli_epi128( t3, 4 ), 0xAA );\
-  sF = _mm256_blend_epi32( sF, _mm256_bsrli_epi128( t3, 4 ), 0x55 );\
+  s5 = _mm256_blend_epi32( s5, t1, 0x55 ); \
+  s6 = _mm256_blend_epi32( s6, t1, 0xaa ); \
+  sE = _mm256_blend_epi32( sE, t3, 0x55 ); \
+  sF = _mm256_blend_epi32( sF, t3, 0xaa ); \
 \
-  t1 = _mm256_blend_epi32( _mm256_bsrli_epi128( s6, 4 ), \
-                           _mm256_bslli_epi128( s7, 4 ), 0xAA ); \
-  t3 = _mm256_blend_epi32( _mm256_bsrli_epi128( sF, 4 ), \
-                           _mm256_bslli_epi128( sC, 4 ), 0xAA ); \
+  s7 = mm256_swap64_32( s7 ); \
+  sC = mm256_swap64_32( sC ); \
+  t1 = _mm256_blend_epi32( s6, s7, 0xaa ); \
+  t3 = _mm256_blend_epi32( sF, sC, 0xaa ); \
  L( s2, t1, sB, t3 ); \
-  s6 = _mm256_blend_epi32( s6, _mm256_bslli_epi128( t1, 4 ), 0xAA );\
-  s7 = _mm256_blend_epi32( s7, _mm256_bsrli_epi128( t1, 4 ), 0x55 );\
-  sF = _mm256_blend_epi32( sF, _mm256_bslli_epi128( t3, 4 ), 0xAA );\
-  sC = _mm256_blend_epi32( sC, _mm256_bsrli_epi128( t3, 4 ), 0x55 );\
+  s6 = _mm256_blend_epi32( s6, t1, 0x55 ); \
+  s7 = _mm256_blend_epi32( s7, t1, 0xaa ); \
+  sF = _mm256_blend_epi32( sF, t3, 0x55 ); \
+  sC = _mm256_blend_epi32( sC, t3, 0xaa ); \
+  s6 = mm256_swap64_32( s6 ); \
+  sF = mm256_swap64_32( sF ); \
 \
-  t1 = _mm256_blend_epi32( _mm256_bsrli_epi128( s7, 4 ), \
-                           _mm256_bslli_epi128( s4, 4 ), 0xAA ); \
-  t3 = _mm256_blend_epi32( _mm256_bsrli_epi128( sC, 4 ), \
-                           _mm256_bslli_epi128( sD, 4 ), 0xAA ); \
+  t1 = _mm256_blend_epi32( s7, s4, 0xaa ); \
+  t3 = _mm256_blend_epi32( sC, sD, 0xaa ); \
  L( s3, t1, s8, t3 ); \
-  s7 = _mm256_blend_epi32( s7, _mm256_bslli_epi128( t1, 4 ), 0xAA );\
-  s4 = _mm256_blend_epi32( s4, _mm256_bsrli_epi128( t1, 4 ), 0x55 );\
-  sC = _mm256_blend_epi32( sC, _mm256_bslli_epi128( t3, 4 ), 0xAA );\
-  sD = _mm256_blend_epi32( sD, _mm256_bsrli_epi128( t3, 4 ), 0x55 );\
+  s7 = _mm256_blend_epi32( s7, t1, 0x55 ); \
+  s4 = _mm256_blend_epi32( s4, t1, 0xaa ); \
+  sC = _mm256_blend_epi32( sC, t3, 0x55 ); \
+  sD = _mm256_blend_epi32( sD, t3, 0xaa ); \
+  s7 = mm256_swap64_32( s7 ); \
+  sC = mm256_swap64_32( sC ); \
 \
-  t0 = _mm256_blend_epi32( s0, _mm256_bslli_epi128( s8, 4 ), 0xAA ); \
-  t1 = _mm256_blend_epi32( s1, s9, 0xAA ); \
-  t2 = _mm256_blend_epi32( _mm256_bsrli_epi128( s2, 4 ), sA, 0xAA ); \
-  t3 = _mm256_blend_epi32( _mm256_bsrli_epi128( s3, 4 ), \
-                           _mm256_bslli_epi128( sB, 4 ), 0xAA ); \
+  t0 = _mm256_blend_epi32( s0, mm256_swap64_32( s8 ), 0xaa ); \
+  t1 = _mm256_blend_epi32( s1, s9, 0xaa ); \
+  t2 = _mm256_blend_epi32( mm256_swap64_32( s2 ), sA, 0xaa ); \
+  t3 = _mm256_blend_epi32( s3, sB, 0x55 ); \
+  t3 = mm256_swap64_32( t3 ); \
  L( t0, t1, t2, t3 ); \
+  t3 = mm256_swap64_32( t3 ); \
  s0 = _mm256_blend_epi32( s0, t0, 0x55 ); \
-  s8 = _mm256_blend_epi32( s8, _mm256_bsrli_epi128( t0, 4 ), 0x55 ); \
+  s8 = _mm256_blend_epi32( s8, mm256_swap64_32( t0 ), 0x55 ); \
  s1 = _mm256_blend_epi32( s1, t1, 0x55 ); \
-  s9 = _mm256_blend_epi32( s9, t1, 0xAA ); \
-  s2 = _mm256_blend_epi32( s2, _mm256_bslli_epi128( t2, 4 ), 0xAA ); \
-  sA = _mm256_blend_epi32( sA, t2, 0xAA ); \
-  s3 = _mm256_blend_epi32( s3, _mm256_bslli_epi128( t3, 4 ), 0xAA ); \
-  sB = _mm256_blend_epi32( sB, _mm256_bsrli_epi128( t3, 4 ), 0x55 ); \
+  s9 = _mm256_blend_epi32( s9, t1, 0xaa ); \
+  s2 = _mm256_blend_epi32( s2, mm256_swap64_32( t2 ), 0xaa ); \
+  sA = _mm256_blend_epi32( sA, t2, 0xaa ); \
+  s3 = _mm256_blend_epi32( s3, t3, 0xaa ); \
+  sB = _mm256_blend_epi32( sB, t3, 0x55 ); \
 \
-  t0 = _mm256_blend_epi32( _mm256_bsrli_epi128( s4, 4 ), sC, 0xAA ); \
-  t1 = _mm256_blend_epi32( _mm256_bsrli_epi128( s5, 4 ), \
-                           _mm256_bslli_epi128( sD, 4 ), 0xAA ); \
-  t2 = _mm256_blend_epi32( s6, _mm256_bslli_epi128( sE, 4 ), 0xAA ); \
-  t3 = _mm256_blend_epi32( s7, sF, 0xAA ); \
+  t0 = _mm256_blend_epi32( s4, sC, 0xaa ); \
+  t1 = _mm256_blend_epi32( s5, sD, 0xaa ); \
+  t2 = _mm256_blend_epi32( s6, sE, 0xaa ); \
+  t3 = _mm256_blend_epi32( s7, sF, 0xaa ); \
  L( t0, t1, t2, t3 ); \
-  s4 = _mm256_blend_epi32( s4, _mm256_bslli_epi128( t0, 4 ), 0xAA ); \
-  sC = _mm256_blend_epi32( sC, t0, 0xAA ); \
-  s5 = _mm256_blend_epi32( s5, _mm256_bslli_epi128( t1, 4 ), 0xAA ); \
-  sD = _mm256_blend_epi32( sD, _mm256_bsrli_epi128( t1, 4 ), 0x55 ); \
+  s4 = _mm256_blend_epi32( s4, t0, 0x55 ); \
+  sC = _mm256_blend_epi32( sC, t0, 0xaa ); \
+  s5 = _mm256_blend_epi32( s5, t1, 0x55 ); \
+  sD = _mm256_blend_epi32( sD, t1, 0xaa ); \
  s6 = _mm256_blend_epi32( s6, t2, 0x55 ); \
-  sE = _mm256_blend_epi32( sE, _mm256_bsrli_epi128( t2, 4 ), 0x55 ); \
+  sE = _mm256_blend_epi32( sE, t2, 0xaa ); \
  s7 = _mm256_blend_epi32( s7, t3, 0x55 ); \
-  sF = _mm256_blend_epi32( sF, t3, 0xAA ); \
+  sF = _mm256_blend_epi32( sF, t3, 0xaa ); \
+  s4 = mm256_swap64_32( s4 ); \
+  s5 = mm256_swap64_32( s5 ); \
+  sD = mm256_swap64_32( sD ); \
+  sE = mm256_swap64_32( sE ); \
 } while (0)

 #define P_BIG \
 do { \
   __m256i alpha[16]; \
+   const uint64_t A0 = ( (uint64_t*)alpha_n )[0]; \
   for( int i = 0; i < 16; i++ ) \
      alpha[i] = m256_const1_64( ( (uint64_t*)alpha_n )[i] ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( (uint64_t)1 << 32 ) \
-                            ^ ( (uint64_t*)alpha_n )[0] ); \
+   alpha[0] = m256_const1_64( (1ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( (uint64_t)2 << 32 ) \
-                            ^ ( (uint64_t*)alpha_n )[0] ); \
+   alpha[0] = m256_const1_64( (2ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( (uint64_t)3 << 32 ) \
-                            ^ ( (uint64_t*)alpha_n )[0] ); \
+   alpha[0] = m256_const1_64( (3ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( (uint64_t)4 << 32 ) \
-                            ^ ( (uint64_t*)alpha_n )[0] ); \
+   alpha[0] = m256_const1_64( (4ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( (uint64_t)5 << 32 ) \
-                            ^ ( (uint64_t*)alpha_n )[0] ); \
+   alpha[0] = m256_const1_64( (5ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
 } while (0)

 #define PF_BIG \
 do { \
   __m256i alpha[16]; \
+   const uint64_t A0 = ( (uint64_t*)alpha_f )[0]; \
   for( int i = 0; i < 16; i++ ) \
      alpha[i] = m256_const1_64( ( (uint64_t*)alpha_f )[i] ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( (uint64_t)1 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m256_const1_64( ( 1ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( (uint64_t)2 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m256_const1_64( ( 2ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( (uint64_t)3 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m256_const1_64( ( 3ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( (uint64_t)4 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m256_const1_64( ( 4ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( (uint64_t)5 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m256_const1_64( ( 5ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( (uint64_t)6 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m256_const1_64( ( 6ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( (uint64_t)7 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m256_const1_64( ( 7ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( (uint64_t)8 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m256_const1_64( ( 8ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( (uint64_t)9 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m256_const1_64( ( 9ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( (uint64_t)10 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m256_const1_64( (10ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( (uint64_t)11 << 32 ) \
-                            ^ ( (uint64_t*)alpha_f )[0] ); \
+   alpha[0] = m256_const1_64( (11ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
 } while (0)

 #define T_BIG \
 do { /* order is important */ \
-   c7 = sc->h[ 0x7 ] = _mm256_xor_si256( sc->h[ 0x7 ], sB ); \
-   c6 = sc->h[ 0x6 ] = _mm256_xor_si256( sc->h[ 0x6 ], sA ); \
-   c5 = sc->h[ 0x5 ] = _mm256_xor_si256( sc->h[ 0x5 ], s9 ); \
-   c4 = sc->h[ 0x4 ] = _mm256_xor_si256( sc->h[ 0x4 ], s8 ); \
-   c3 = sc->h[ 0x3 ] = _mm256_xor_si256( sc->h[ 0x3 ], s3 ); \
-   c2 = sc->h[ 0x2 ] = _mm256_xor_si256( sc->h[ 0x2 ], s2 ); \
-   c1 = sc->h[ 0x1 ] = _mm256_xor_si256( sc->h[ 0x1 ], s1 ); \
-   c0 = sc->h[ 0x0 ] = _mm256_xor_si256( sc->h[ 0x0 ], s0 ); \
+   c7 = sc->h[ 7 ] = _mm256_xor_si256( sc->h[ 7 ], sB ); \
+   c6 = sc->h[ 6 ] = _mm256_xor_si256( sc->h[ 6 ], sA ); \
+   c5 = sc->h[ 5 ] = _mm256_xor_si256( sc->h[ 5 ], s9 ); \
+   c4 = sc->h[ 4 ] = _mm256_xor_si256( sc->h[ 4 ], s8 ); \
+   c3 = sc->h[ 3 ] = _mm256_xor_si256( sc->h[ 3 ], s3 ); \
+   c2 = sc->h[ 2 ] = _mm256_xor_si256( sc->h[ 2 ], s2 ); \
+   c1 = sc->h[ 1 ] = _mm256_xor_si256( sc->h[ 1 ], s1 ); \
+   c0 = sc->h[ 0 ] = _mm256_xor_si256( sc->h[ 0 ], s0 ); \
 } while (0)

 void hamsi_big( hamsi_4way_big_context *sc, __m256i *buf, size_t num )
--- a/algo/heavy/sph_hefty1.c
+++ b/algo/heavy/sph_hefty1.c
@@ -1,382 +0,0 @@
-/*
- * HEFTY1 cryptographic hash function
- *
- * Copyright (c) 2014, dbcc14 <BM-NBx4AKznJuyem3dArgVY8MGyABpihRy5>
- * All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *
- * 1. Redistributions of source code must retain the above copyright notice, this
- *    list of conditions and the following disclaimer.
- * 2. Redistributions in binary form must reproduce the above copyright notice,
- *    this list of conditions and the following disclaimer in the documentation
- *    and/or other materials provided with the distribution.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
- * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- * The views and conclusions contained in the software and documentation are those
- * of the authors and should not be interpreted as representing official policies,
- * either expressed or implied, of the FreeBSD Project.
- */
-
-#include <assert.h>
-#include <string.h>
-
-#ifdef _MSC_VER
-#define inline __inline
-#endif
-
-#include "sph_hefty1.h"
-
-#define Min(A, B) (A <= B ? A : B)
-#define RoundFunc(ctx, A, B, C, D, E, F, G, H, W, K)                    \
-    {                                                                   \
-        /* To thwart parallelism, Br modifies itself each time it's     \
-         * called.  This also means that calling it in different        \
-         * orders yeilds different results.  In C the order of          \
-         * evaluation of function arguments and + operands are          \
-         * unspecified (and depends on the compiler), so we must make   \
-         * the order of Br calls explicit.                              \
-         */                                                             \
-        uint32_t brG = Br(ctx, G);                                      \
-        uint32_t tmp1 = Ch(E, Br(ctx, F), brG) + H + W + K;             \
-        uint32_t tmp2 = tmp1 + Sigma1(Br(ctx, E));                      \
-        uint32_t brC = Br(ctx, C);                                      \
-        uint32_t brB = Br(ctx, B);                                      \
-        uint32_t tmp3 = Ma(Br(ctx, A), brB, brC);                       \
-        uint32_t tmp4 = tmp3 + Sigma0(Br(ctx, A));                      \
-        H = G;                                                          \
-        G = F;                                                          \
-        F = E;                                                          \
-        E = D + Br(ctx, tmp2);                                          \
-        D = C;                                                          \
-        C = B;                                                          \
-        B = A;                                                          \
-        A = tmp2 + tmp4;                                                \
-    }                                                                   \
-
-/* Nothing up my sleeve constants */
-const static uint32_t K[64] = {
-    0x428a2f98UL, 0x71374491UL, 0xb5c0fbcfUL, 0xe9b5dba5UL,
-    0x3956c25bUL, 0x59f111f1UL, 0x923f82a4UL, 0xab1c5ed5UL,
-    0xd807aa98UL, 0x12835b01UL, 0x243185beUL, 0x550c7dc3UL,
-    0x72be5d74UL, 0x80deb1feUL, 0x9bdc06a7UL, 0xc19bf174UL,
-    0xe49b69c1UL, 0xefbe4786UL, 0x0fc19dc6UL, 0x240ca1ccUL,
-    0x2de92c6fUL, 0x4a7484aaUL, 0x5cb0a9dcUL, 0x76f988daUL,
-    0x983e5152UL, 0xa831c66dUL, 0xb00327c8UL, 0xbf597fc7UL,
-    0xc6e00bf3UL, 0xd5a79147UL, 0x06ca6351UL, 0x14292967UL,
-    0x27b70a85UL, 0x2e1b2138UL, 0x4d2c6dfcUL, 0x53380d13UL,
-    0x650a7354UL, 0x766a0abbUL, 0x81c2c92eUL, 0x92722c85UL,
-    0xa2bfe8a1UL, 0xa81a664bUL, 0xc24b8b70UL, 0xc76c51a3UL,
-    0xd192e819UL, 0xd6990624UL, 0xf40e3585UL, 0x106aa070UL,
-    0x19a4c116UL, 0x1e376c08UL, 0x2748774cUL, 0x34b0bcb5UL,
-    0x391c0cb3UL, 0x4ed8aa4aUL, 0x5b9cca4fUL, 0x682e6ff3UL,
-    0x748f82eeUL, 0x78a5636fUL, 0x84c87814UL, 0x8cc70208UL,
-    0x90befffaUL, 0xa4506cebUL, 0xbef9a3f7UL, 0xc67178f2UL
-};
-
-/* Initial hash values */
-const static uint32_t H[HEFTY1_STATE_WORDS] = {
-    0x6a09e667UL,
-    0xbb67ae85UL,
-    0x3c6ef372UL,
-    0xa54ff53aUL,
-    0x510e527fUL,
-    0x9b05688cUL,
-    0x1f83d9abUL,
-    0x5be0cd19UL
-};
-
-static inline uint32_t Rr(uint32_t X, uint8_t n)
-{
-    return (X >> n) | (X << (32 - n));
-}
-
-static inline uint32_t Ch(uint32_t E, uint32_t F, uint32_t G)
-{
-    return (E & F) ^ (~E & G);
-}
-
-static inline uint32_t Sigma1(uint32_t E)
-{
-    return Rr(E, 6) ^ Rr(E, 11) ^ Rr(E, 25);
-}
-
-static inline uint32_t sigma1(uint32_t X)
-{
-    return Rr(X, 17) ^ Rr(X, 19) ^ (X >> 10);
-}
-
-static inline uint32_t Ma(uint32_t A, uint32_t B, uint32_t C)
-{
-    return (A & B) ^ (A & C) ^ (B & C);
-}
-
-static inline uint32_t Sigma0(uint32_t A)
-{
-    return Rr(A, 2) ^ Rr(A, 13) ^ Rr(A, 22);
-}
-
-static inline uint32_t sigma0(uint32_t X)
-{
-    return Rr(X, 7) ^ Rr(X, 18) ^ (X >> 3);
-}
-
-static inline uint32_t Reverse32(uint32_t n)
-{
-    #if BYTE_ORDER == LITTLE_ENDIAN
-        return n << 24 | (n & 0x0000ff00) << 8 | (n & 0x00ff0000) >> 8 | n >> 24;
-    #else
-        return n;
-    #endif
-}
-
-static inline uint64_t Reverse64(uint64_t n)
-{
-    #if BYTE_ORDER == LITTLE_ENDIAN
-        uint32_t a = n >> 32;
-        uint32_t b = (n << 32) >> 32;
-
-        return (uint64_t)Reverse32(b) << 32 | Reverse32(a);
-    #else
-        return n;
-    #endif
-}
-
-/* Smoosh byte into nibble */
-static inline uint8_t Smoosh4(uint8_t X)
-{
-    return (X >> 4) ^ (X & 0xf);
-}
-
-/* Smoosh 32-bit word into 2-bits */
-static inline uint8_t Smoosh2(uint32_t X)
-{
-    uint16_t w = (X >> 16) ^ (X & 0xffff);
-    uint8_t n = Smoosh4((w >> 8) ^ (w & 0xff));
-    return (n >> 2) ^ (n & 0x3);
-}
-
-static void Mangle(uint32_t *S)
-{
-    uint32_t *R = S;
-    uint32_t *C = &S[1];
-
-    uint8_t r0 = Smoosh4(R[0] >> 24);
-    uint8_t r1 = Smoosh4(R[0] >> 16);
-    uint8_t r2 = Smoosh4(R[0] >> 8);
-    uint8_t r3 = Smoosh4(R[0] & 0xff);
-
-    int i;
-
-    /* Diffuse */
-    uint32_t tmp = 0;
-    for (i = 0; i < HEFTY1_SPONGE_WORDS - 1; i++) {
-        uint8_t r = Smoosh2(tmp);
-        switch (r) {
-        case 0:
-            C[i] ^= Rr(R[0], i + r0);
-            break;
-        case 1:
-            C[i] += Rr(~R[0], i + r1);
-            break;
-        case 2:
-            C[i] &= Rr(~R[0], i + r2);
-            break;
-        case 3:
-            C[i] ^= Rr(R[0], i + r3);
-            break;
-        }
-        tmp ^= C[i];
-    }
-
-    /* Compress */
-    tmp = 0;
-    for (i = 0; i < HEFTY1_SPONGE_WORDS - 1; i++)
-        if (i % 2)
-            tmp ^= C[i];
-        else
-            tmp += C[i];
-    R[0] ^= tmp;
-}
-
-static void Absorb(uint32_t *S, uint32_t X)
-{
-    uint32_t *R = S;
-    R[0] ^= X;
-    Mangle(S);
-}
-
-static uint32_t Squeeze(uint32_t *S)
-{
-    uint32_t Y = S[0];
-    Mangle(S);
-    return Y;
-}
-
-/* Branch, compress and serialize function */
-static inline uint32_t Br(HEFTY1_CTX *ctx, uint32_t X)
-{
-    uint32_t R = Squeeze(ctx->sponge);
-
-    uint8_t r0 = R >> 8;
-    uint8_t r1 = R & 0xff;
-
-    uint32_t Y = 1 << (r0 % 32);
-
-    switch (r1 % 4)
-    {
-    case 0:
-        /* Do nothing */
-        break;
-    case 1:
-        return X & ~Y;
-    case 2:
-        return X | Y;
-    case 3:
-        return X ^ Y;
-    }
-
-    return X;
-}
-
-static void HashBlock(HEFTY1_CTX *ctx)
-{
-    uint32_t A, B, C, D, E, F, G, H;
-    uint32_t W[HEFTY1_BLOCK_BYTES];
-
-    assert(ctx);
-
-    A = ctx->h[0];
-    B = ctx->h[1];
-    C = ctx->h[2];
-    D = ctx->h[3];
-    E = ctx->h[4];
-    F = ctx->h[5];
-    G = ctx->h[6];
-    H = ctx->h[7];
-
-    int t = 0;
-    for (; t < 16; t++) {
-        W[t] = Reverse32(((uint32_t *)&ctx->block[0])[t]); /* To host byte order */
-        Absorb(ctx->sponge, W[t] ^ K[t]);
-    }
-
-    for (t = 0; t < 16; t++) {
-        Absorb(ctx->sponge, D ^ H);
-        RoundFunc(ctx, A, B, C, D, E, F, G, H, W[t], K[t]);
-    }
-    for (t = 16; t < 64; t++) {
-        Absorb(ctx->sponge, H + D);
-        W[t] = sigma1(W[t - 2]) + W[t - 7] + sigma0(W[t - 15]) + W[t - 16];
-        RoundFunc(ctx, A, B, C, D, E, F, G, H, W[t], K[t]);
-    }
-
-    ctx->h[0] += A;
-    ctx->h[1] += B;
-    ctx->h[2] += C;
-    ctx->h[3] += D;
-    ctx->h[4] += E;
-    ctx->h[5] += F;
-    ctx->h[6] += G;
-    ctx->h[7] += H;
-
-    A = 0;
-    B = 0;
-    C = 0;
-    D = 0;
-    E = 0;
-    F = 0;
-    G = 0;
-    H = 0;
-
-    memset(W, 0, sizeof(W));
-}
-
-/* Public interface */
-
-void HEFTY1_Init(HEFTY1_CTX *ctx)
-{
-    assert(ctx);
-
-    memcpy(ctx->h, H, sizeof(ctx->h));
-    memset(ctx->block, 0, sizeof(ctx->block));
-    ctx->written = 0;
-    memset(ctx->sponge, 0, sizeof(ctx->sponge));
-}
-
-void HEFTY1_Update(HEFTY1_CTX *ctx, const void *buf, size_t len)
-{
-    assert(ctx);
-
-    uint64_t read = 0;
-    while (len) {
-        size_t end = (size_t)(ctx->written % HEFTY1_BLOCK_BYTES);
-        size_t count = Min(len, HEFTY1_BLOCK_BYTES - end);
-        memcpy(&ctx->block[end], &((unsigned char *)buf)[read], count);
-        len -= count;
-        read += count;
-        ctx->written += count;
-        if (!(ctx->written % HEFTY1_BLOCK_BYTES))
-            HashBlock(ctx);
-    }
-}
-
-void HEFTY1_Final(unsigned char *digest, HEFTY1_CTX *ctx)
-{
-    assert(digest);
-    assert(ctx);
-
-    /* Pad message (FIPS 180 Section 5.1.1) */
-    size_t used = (size_t)(ctx->written % HEFTY1_BLOCK_BYTES);
-    ctx->block[used++] = 0x80; /* Append 1 to end of message */
-    if (used > HEFTY1_BLOCK_BYTES - 8) {
-        /* We have already written into the last 64bits, so
-         * we must continue into the next block. */
-        memset(&ctx->block[used], 0, HEFTY1_BLOCK_BYTES - used);
-        HashBlock(ctx);
-        used = 0; /* Create a new block (below) */
-    }
-
-    /* All remaining bits to zero */
-    memset(&ctx->block[used], 0, HEFTY1_BLOCK_BYTES - 8 - used);
-
-    /* The last 64bits encode the length (in network byte order) */
-    uint64_t *len = (uint64_t *)&ctx->block[HEFTY1_BLOCK_BYTES - 8];
-    *len = Reverse64(ctx->written*8);
-
-    HashBlock(ctx);
-
-    /* Convert back to network byte order */
-    int i = 0;
-    for (; i < HEFTY1_STATE_WORDS; i++)
-        ctx->h[i] = Reverse32(ctx->h[i]);
-
-    memcpy(digest, ctx->h, sizeof(ctx->h));
-    memset(ctx, 0, sizeof(HEFTY1_CTX));
-}
-
-unsigned char* HEFTY1(const unsigned char *buf, size_t len, unsigned char *digest)
-{
-    HEFTY1_CTX ctx;
-    static unsigned char m[HEFTY1_DIGEST_BYTES];
-
-    if (!digest)
-        digest = m;
-
-    HEFTY1_Init(&ctx);
-    HEFTY1_Update(&ctx, buf, len);
-    HEFTY1_Final(digest, &ctx);
-
-    return digest;
-}
--- a/algo/heavy/sph_hefty1.h
+++ b/algo/heavy/sph_hefty1.h
@@ -1,66 +0,0 @@
-/*
- * HEFTY1 cryptographic hash function
- *
- * Copyright (c) 2014, dbcc14 <BM-NBx4AKznJuyem3dArgVY8MGyABpihRy5>
- * All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *
- * 1. Redistributions of source code must retain the above copyright notice, this
- *    list of conditions and the following disclaimer.
- * 2. Redistributions in binary form must reproduce the above copyright notice,
- *    this list of conditions and the following disclaimer in the documentation
- *    and/or other materials provided with the distribution.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
- * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- * The views and conclusions contained in the software and documentation are those
- * of the authors and should not be interpreted as representing official policies,
- * either expressed or implied, of the FreeBSD Project.
- */
-
-#ifndef __HEFTY1_H__
-#define __HEFTY1_H__
-
-#ifdef __cplusplus
-extern "C" {
-#endif
-
-#ifndef WIN32
-#include <sys/types.h>
-#endif
-
-#include <inttypes.h>
-
-#define HEFTY1_DIGEST_BYTES 32
-#define HEFTY1_BLOCK_BYTES 64
-#define HEFTY1_STATE_WORDS 8
-#define HEFTY1_SPONGE_WORDS 4
-
-typedef struct HEFTY1_CTX {
-    uint32_t h[HEFTY1_STATE_WORDS];
-    uint8_t  block[HEFTY1_BLOCK_BYTES];
-    uint64_t written;
-    uint32_t sponge[HEFTY1_SPONGE_WORDS];
-} HEFTY1_CTX;
-
-void HEFTY1_Init(HEFTY1_CTX *cxt);
-void HEFTY1_Update(HEFTY1_CTX *cxt, const void *data, size_t len);
-void HEFTY1_Final(unsigned char *digest, HEFTY1_CTX *cxt);
-unsigned char* HEFTY1(const unsigned char *data, size_t len, unsigned char *digest);
-
-#ifdef __cplusplus
-}
-#endif
-
-#endif /* __HEFTY1_H__ */
--- a/algo/hodl/sha512-avx.h
+++ b/algo/hodl/sha512-avx.h
@@ -45,6 +45,6 @@ void sha512Compute32b_parallel(
        uint64_t *data[SHA512_PARALLEL_N],
        uint64_t *digest[SHA512_PARALLEL_N]);

-void sha512ProcessBlock(Sha512Context *context);
+void sha512ProcessBlock(Sha512Context contexti[2] );

 #endif
--- a/algo/keccak/keccak-hash-4way.c
+++ b/algo/keccak/keccak-hash-4way.c
@@ -53,7 +53,8 @@ static const uint64_t RC[] = {
 #define WRITE_STATE(sc)

 #define MOV64(d, s)      (d = s)
-#define XOR64_IOTA       XOR64
+#define XOR64_IOTA       XOR
+

 #define LPAR   (
 #define RPAR   )
@@ -71,14 +72,15 @@ static const uint64_t RC[] = {
 // Targetted macros, keccak-macros.h is included for each target.

 #define DECL64(x)          __m512i x
-#define XOR64(d, a, b)     (d = _mm512_xor_si512(a,b))
+#define XOR(d, a, b)     (d = _mm512_xor_si512(a,b))
+#define XOR64 XOR
 #define AND64(d, a, b)     (d = _mm512_and_si512(a,b))
 #define OR64(d, a, b)      (d = _mm512_or_si512(a,b))
 #define NOT64(d, s)        (d = _mm512_xor_si512(s,m512_neg1))
 #define ROL64(d, v, n)     (d = mm512_rol_64(v, n))
 #define XOROR(d, a, b, c)  (d = mm512_xoror(a, b, c))
 #define XORAND(d, a, b, c) (d = mm512_xorand(a, b, c))
-
+#define XOR3( d, a, b, c ) (d = mm512_xor3( a, b, c ))

 #include "keccak-macros.c"

@@ -236,6 +238,7 @@ keccak512_8way_close(void *cc, void *dst)
 #undef INPUT_BUF
 #undef DECL64
 #undef XOR64
+#undef XOR
 #undef AND64
 #undef OR64
 #undef NOT64
@@ -243,7 +246,7 @@ keccak512_8way_close(void *cc, void *dst)
 #undef KECCAK_F_1600
 #undef XOROR
 #undef XORAND
-
+#undef XOR3
 #endif  // AVX512

 // AVX2
@@ -255,13 +258,15 @@ keccak512_8way_close(void *cc, void *dst)
 } while (0)

 #define DECL64(x)        __m256i x
-#define XOR64(d, a, b)   (d = _mm256_xor_si256(a,b))
+#define XOR(d, a, b)    (d = _mm256_xor_si256(a,b))
+#define XOR64 XOR
 #define AND64(d, a, b)   (d = _mm256_and_si256(a,b))
 #define OR64(d, a, b)    (d = _mm256_or_si256(a,b))
 #define NOT64(d, s)      (d = _mm256_xor_si256(s,m256_neg1))
 #define ROL64(d, v, n)   (d = mm256_rol_64(v, n))
 #define XOROR(d, a, b, c) (d = _mm256_xor_si256(a, _mm256_or_si256(b, c)))
 #define XORAND(d, a, b, c) (d = _mm256_xor_si256(a, _mm256_and_si256(b, c)))
+#define XOR3( d, a, b, c ) (d = mm256_xor3( a, b, c ))

 #include "keccak-macros.c"

@@ -421,6 +426,7 @@ keccak512_4way_close(void *cc, void *dst)
 #undef INPUT_BUF
 #undef DECL64
 #undef XOR64
+#undef XOR
 #undef AND64
 #undef OR64
 #undef NOT64
@@ -428,5 +434,6 @@ keccak512_4way_close(void *cc, void *dst)
 #undef KECCAK_F_1600
 #undef XOROR
 #undef XORAND
+#undef XOR3

 #endif  // AVX2
--- a/algo/keccak/keccak-macros.c
+++ b/algo/keccak/keccak-macros.c
@@ -1,6 +1,19 @@
 #ifdef TH_ELT
 #undef TH_ELT
 #endif
+
+#define TH_ELT(t, c0, c1, c2, c3, c4, d0, d1, d2, d3, d4)   do { \
+    DECL64(tt0); \
+    DECL64(tt1); \
+    XOR3( tt0, d0, d1, d4 ); \
+    XOR( tt1, d2, d3 ); \
+    XOR( tt0, tt0, tt1 ); \
+    ROL64( tt0, tt0, 1 ); \
+    XOR3( tt1, c0, c1, c4 ); \
+    XOR3( tt0, tt0, c2, c3 ); \
+    XOR( t, tt0, tt1 ); \
+} while (0)
+/*
 #define TH_ELT(t, c0, c1, c2, c3, c4, d0, d1, d2, d3, d4)   do { \
                DECL64(tt0); \
                DECL64(tt1); \
@@ -17,7 +30,7 @@
                XOR64(tt2, tt2, tt3); \
                XOR64(t, tt0, tt2); \
        } while (0)
-
+*/
 #ifdef THETA
 #undef THETA
 #endif
--- a/algo/luffa/luffa-hash-2way.c
+++ b/algo/luffa/luffa-hash-2way.c
@@ -62,186 +62,66 @@ static const uint32 CNS_INIT[128] __attribute((aligned(64))) = {

 #define cns4w(i)  m512_const1_128( ( (__m128i*)CNS_INIT)[i] )

-#define ADD_CONSTANT4W(a,b,c0,c1)\
-    a = _mm512_xor_si512(a,c0);\
-    b = _mm512_xor_si512(b,c1);
+#define ADD_CONSTANT4W( a, b, c0, c1 ) \
+    a = _mm512_xor_si512( a, c0 ); \
+    b = _mm512_xor_si512( b, c1 );

 #define MULT24W( a0, a1 ) \
-do { \
+{ \
  __m512i b = _mm512_xor_si512( a0, \
                     _mm512_maskz_shuffle_epi32( 0xbbbb, a1, 16 ) ); \
-  a0 = _mm512_or_si512( _mm512_bsrli_epi128(  b, 4 ), \
-                        _mm512_bslli_epi128( a1,12 ) ); \
-  a1 = _mm512_or_si512( _mm512_bsrli_epi128( a1, 4 ), \
-                        _mm512_bslli_epi128(  b,12 ) ); \
-} while(0)
+  a0 = _mm512_alignr_epi8( a1,  b, 4 ); \
+  a1 = _mm512_alignr_epi8(  b, a1, 4 ); \
+}

-/*
-#define MULT24W( a0, a1, mask ) \
-do { \
-  __m512i b = _mm512_xor_si512( a0, \
-                   _mm512_shuffle_epi32( _mm512_and_si512(a1,mask), 16 ) ); \
-  a0 = _mm512_or_si512( _mm512_bsrli_epi128(b,4), _mm512_bslli_epi128(a1,12) );\
-  a1 = _mm512_or_si512( _mm512_bsrli_epi128(a1,4), _mm512_bslli_epi128(b,12) );\
-} while(0)
-*/
-
-// confirm pointer arithmetic
-// ok but use array indexes
-#define STEP_PART4W(x,c0,c1,t)\
-    SUBCRUMB4W(*x,*(x+1),*(x+2),*(x+3),*t);\
-    SUBCRUMB4W(*(x+5),*(x+6),*(x+7),*(x+4),*t);\
-    MIXWORD4W(*x,*(x+4),*t,*(t+1));\
-    MIXWORD4W(*(x+1),*(x+5),*t,*(t+1));\
-    MIXWORD4W(*(x+2),*(x+6),*t,*(t+1));\
-    MIXWORD4W(*(x+3),*(x+7),*t,*(t+1));\
-    ADD_CONSTANT4W(*x, *(x+4), c0, c1);
-
-#define SUBCRUMB4W(a0,a1,a2,a3,t)\
-    t  = a0;\
+#define SUBCRUMB4W( a0, a1, a2, a3 ) \
+{ \
+    __m512i t = a0; \
    a0 = mm512_xoror( a3, a0, a1 ); \
-    a2 = _mm512_xor_si512(a2,a3);\
+    a2 = _mm512_xor_si512( a2, a3 ); \
    a1 = _mm512_ternarylogic_epi64( a1, a3, t, 0x87 ); /* a1 xnor (a3 & t) */ \
    a3 = mm512_xorand( a2, a3, t ); \
-    a2 = mm512_xorand( a1, a2, a0);\
-    a1 = _mm512_or_si512(a1,a3);\
-    a3 = _mm512_xor_si512(a3,a2);\
-    t  = _mm512_xor_si512(t,a1);\
-    a2 = _mm512_and_si512(a2,a1);\
-    a1 = mm512_xnor(a1,a0);\
-    a0 = t;
+    a2 = mm512_xorand( a1, a2, a0); \
+    a1 = _mm512_or_si512( a1, a3 ); \
+    a3 = _mm512_xor_si512( a3, a2 ); \
+    t  = _mm512_xor_si512( t, a1 ); \
+    a2 = _mm512_and_si512( a2, a1 ); \
+    a1 = mm512_xnor( a1, a0 ); \
+    a0 = t; \
+}

-/*
-#define SUBCRUMB4W(a0,a1,a2,a3,t)\
-    t  = _mm512_load_si512(&a0);\
-    a0 = _mm512_or_si512(a0,a1);\
-    a2 = _mm512_xor_si512(a2,a3);\
-    a1 = _mm512_andnot_si512(a1, m512_neg1 );\
-    a0 = _mm512_xor_si512(a0,a3);\
-    a3 = _mm512_and_si512(a3,t);\
-    a1 = _mm512_xor_si512(a1,a3);\
-    a3 = _mm512_xor_si512(a3,a2);\
-    a2 = _mm512_and_si512(a2,a0);\
-    a0 = _mm512_andnot_si512(a0, m512_neg1 );\
-    a2 = _mm512_xor_si512(a2,a1);\
-    a1 = _mm512_or_si512(a1,a3);\
-    t  = _mm512_xor_si512(t,a1);\
-    a3 = _mm512_xor_si512(a3,a2);\
-    a2 = _mm512_and_si512(a2,a1);\
-    a1 = _mm512_xor_si512(a1,a0);\
-    a0 = _mm512_load_si512(&t);
-*/
+#define MIXWORD4W( a, b ) \
+    b = _mm512_xor_si512( a, b ); \
+    a = _mm512_xor_si512( b, _mm512_rol_epi32( a, 2 ) ); \
+    b = _mm512_xor_si512( a, _mm512_rol_epi32( b, 14 ) ); \
+    a = _mm512_xor_si512( b, _mm512_rol_epi32( a, 10 ) ); \
+    b = _mm512_rol_epi32( b, 1 );

-#define MIXWORD4W(a,b,t1,t2)\
-    b  = _mm512_xor_si512(a,b);\
-    t1 = _mm512_slli_epi32(a,2);\
-    t2 = _mm512_srli_epi32(a,30);\
-    a  = mm512_xoror( b, t1, t2 ); \
-    t1 = _mm512_slli_epi32(b,14);\
-    t2 = _mm512_srli_epi32(b,18);\
-    b  = _mm512_or_si512(t1,t2);\
-    b  = mm512_xoror( a, t1, t2 ); \
-    t1 = _mm512_slli_epi32(a,10);\
-    t2 = _mm512_srli_epi32(a,22);\
-    a  = mm512_xoror( b, t1, t2 ); \
-    t1 = _mm512_slli_epi32(b,1);\
-    t2 = _mm512_srli_epi32(b,31);\
-    b  = _mm512_or_si512(t1,t2);
+#define STEP_PART4W( x0, x1, x2, x3, x4, x5, x6, x7, c0, c1 ) \
+    SUBCRUMB4W( x0, x1, x2, x3 ); \
+    SUBCRUMB4W( x5, x6, x7, x4 ); \
+    MIXWORD4W( x0, x4 ); \
+    MIXWORD4W( x1, x5 ); \
+    MIXWORD4W( x2, x6 ); \
+    MIXWORD4W( x3, x7 ); \
+    ADD_CONSTANT4W( x0, x4, c0, c1 );

-/*
-#define MIXWORD4W(a,b,t1,t2)\
-    b  = _mm512_xor_si512(a,b);\
-    t1 = _mm512_slli_epi32(a,2);\
-    t2 = _mm512_srli_epi32(a,30);\
-     a = _mm512_or_si512(t1,t2);\
-    a  = _mm512_xor_si512(a,b);\
-    t1 = _mm512_slli_epi32(b,14);\
-    t2 = _mm512_srli_epi32(b,18);\
-    b  = _mm512_or_si512(t1,t2);\
-    b  = _mm512_xor_si512(a,b);\
-    t1 = _mm512_slli_epi32(a,10);\
-    t2 = _mm512_srli_epi32(a,22);\
-    a  = _mm512_or_si512(t1,t2);\
-    a  = _mm512_xor_si512(a,b);\
-    t1 = _mm512_slli_epi32(b,1);\
-    t2 = _mm512_srli_epi32(b,31);\
-    b  = _mm512_or_si512(t1,t2);
-*/
-
-#define STEP_PART24W(a0,a1,t0,t1,c0,c1,tmp0,tmp1)\
-    a1 = _mm512_shuffle_epi32(a1,147);\
-    t0 = _mm512_load_si512(&a1);\
-    a1 = _mm512_unpacklo_epi32(a1,a0);\
-    t0 = _mm512_unpackhi_epi32(t0,a0);\
-    t1 = _mm512_shuffle_epi32(t0,78);\
-    a0 = _mm512_shuffle_epi32(a1,78);\
-    SUBCRUMB4W(t1,t0,a0,a1,tmp0);\
-    t0 = _mm512_unpacklo_epi32(t0,t1);\
-    a1 = _mm512_unpacklo_epi32(a1,a0);\
-    a0 = _mm512_load_si512(&a1);\
-    a0 = _mm512_unpackhi_epi64(a0,t0);\
-    a1 = _mm512_unpacklo_epi64(a1,t0);\
-    a1 = _mm512_shuffle_epi32(a1,57);\
-    MIXWORD4W(a0,a1,tmp0,tmp1);\
-    ADD_CONSTANT4W(a0,a1,c0,c1);
-
-#define NMLTOM7684W(r0,r1,r2,s0,s1,s2,s3,p0,p1,p2,q0,q1,q2,q3)\
-    s2 = _mm512_load_si512(&r1);\
-    q2 = _mm512_load_si512(&p1);\
-    r2 = _mm512_shuffle_epi32(r2,216);\
-    p2 = _mm512_shuffle_epi32(p2,216);\
-    r1 = _mm512_unpacklo_epi32(r1,r0);\
-    p1 = _mm512_unpacklo_epi32(p1,p0);\
-    s2 = _mm512_unpackhi_epi32(s2,r0);\
-    q2 = _mm512_unpackhi_epi32(q2,p0);\
-    s0 = _mm512_load_si512(&r2);\
-    q0 = _mm512_load_si512(&p2);\
-    r2 = _mm512_unpacklo_epi64(r2,r1);\
-    p2 = _mm512_unpacklo_epi64(p2,p1);\
-    s1 = _mm512_load_si512(&s0);\
-    q1 = _mm512_load_si512(&q0);\
-    s0 = _mm512_unpackhi_epi64(s0,r1);\
-    q0 = _mm512_unpackhi_epi64(q0,p1);\
-    r2 = _mm512_shuffle_epi32(r2,225);\
-    p2 = _mm512_shuffle_epi32(p2,225);\
-    r0 = _mm512_load_si512(&s1);\
-    p0 = _mm512_load_si512(&q1);\
-    s0 = _mm512_shuffle_epi32(s0,225);\
-    q0 = _mm512_shuffle_epi32(q0,225);\
-    s1 = _mm512_unpacklo_epi64(s1,s2);\
-    q1 = _mm512_unpacklo_epi64(q1,q2);\
-    r0 = _mm512_unpackhi_epi64(r0,s2);\
-    p0 = _mm512_unpackhi_epi64(p0,q2);\
-    s2 = _mm512_load_si512(&r0);\
-    q2 = _mm512_load_si512(&p0);\
-    s3 = _mm512_load_si512(&r2);\
-    q3 = _mm512_load_si512(&p2);
-
-#define MIXTON7684W(r0,r1,r2,r3,s0,s1,s2,p0,p1,p2,p3,q0,q1,q2)\
-    s0 = _mm512_load_si512(&r0);\
-    q0 = _mm512_load_si512(&p0);\
-    s1 = _mm512_load_si512(&r2);\
-    q1 = _mm512_load_si512(&p2);\
-    r0 = _mm512_unpackhi_epi32(r0,r1);\
-    p0 = _mm512_unpackhi_epi32(p0,p1);\
-    r2 = _mm512_unpackhi_epi32(r2,r3);\
-    p2 = _mm512_unpackhi_epi32(p2,p3);\
-    s0 = _mm512_unpacklo_epi32(s0,r1);\
-    q0 = _mm512_unpacklo_epi32(q0,p1);\
-    s1 = _mm512_unpacklo_epi32(s1,r3);\
-    q1 = _mm512_unpacklo_epi32(q1,p3);\
-    r1 = _mm512_load_si512(&r0);\
-    p1 = _mm512_load_si512(&p0);\
-    r0 = _mm512_unpackhi_epi64(r0,r2);\
-    p0 = _mm512_unpackhi_epi64(p0,p2);\
-    s0 = _mm512_unpackhi_epi64(s0,s1);\
-    q0 = _mm512_unpackhi_epi64(q0,q1);\
-    r1 = _mm512_unpacklo_epi64(r1,r2);\
-    p1 = _mm512_unpacklo_epi64(p1,p2);\
-    s2 = _mm512_load_si512(&r0);\
-    q2 = _mm512_load_si512(&p0);\
-    s1 = _mm512_load_si512(&r1);\
-    q1 = _mm512_load_si512(&p1);
+#define STEP_PART24W( a0, a1, t0, t1, c0, c1 ) \
+    a1 = _mm512_shuffle_epi32( a1, 147 ); \
+    t0 = _mm512_load_si512( &a1 ); \
+    a1 = _mm512_unpacklo_epi32( a1, a0 ); \
+    t0 = _mm512_unpackhi_epi32( t0, a0 ); \
+    t1 = _mm512_shuffle_epi32( t0, 78 ); \
+    a0 = _mm512_shuffle_epi32( a1, 78 ); \
+    SUBCRUMB4W( t1, t0, a0, a1 ); \
+    t0 = _mm512_unpacklo_epi32( t0, t1 ); \
+    a1 = _mm512_unpacklo_epi32( a1, a0 ); \
+    a0 = _mm512_load_si512( &a1 ); \
+    a0 = _mm512_unpackhi_epi64( a0, t0 ); \
+    a1 = _mm512_unpacklo_epi64( a1, t0 ); \
+    a1 = _mm512_shuffle_epi32( a1, 57 ); \
+    MIXWORD4W( a0, a1 ); \
+    ADD_CONSTANT4W( a0, a1, c0, c1 );

 #define NMLTOM10244W(r0,r1,r2,r3,s0,s1,s2,s3,p0,p1,p2,p3,q0,q1,q2,q3)\
    s1 = _mm512_load_si512(&r3);\
@@ -279,8 +159,7 @@ void rnd512_4way( luffa_4way_context *state, __m512i *msg )
    __m512i t0, t1;
    __m512i *chainv = state->chainv;
    __m512i msg0, msg1;
-    __m512i tmp[2];
-    __m512i x[8];
+    __m512i x0, x1, x2, x3, x4, x5, x6, x7;

    t0 = mm512_xor3( chainv[0], chainv[2], chainv[4] );
    t1 = mm512_xor3( chainv[1], chainv[3], chainv[5] );
@@ -372,42 +251,30 @@ void rnd512_4way( luffa_4way_context *state, __m512i *msg )
    chainv[7] = _mm512_rol_epi32( chainv[7], 3 );
    chainv[9] = _mm512_rol_epi32( chainv[9], 4 );

-    NMLTOM10244W( chainv[0], chainv[2], chainv[4], chainv[6],
-                x[0], x[1], x[2], x[3],
-                chainv[1],chainv[3],chainv[5],chainv[7],
-                x[4], x[5], x[6], x[7] );
+    NMLTOM10244W( chainv[0], chainv[2], chainv[4], chainv[6], x0, x1, x2, x3,
+                  chainv[1], chainv[3], chainv[5], chainv[7], x4, x5, x6, x7 );

-    STEP_PART4W( &x[0], cns4w( 0), cns4w( 1), &tmp[0] );
-    STEP_PART4W( &x[0], cns4w( 2), cns4w( 3), &tmp[0] );
-    STEP_PART4W( &x[0], cns4w( 4), cns4w( 5), &tmp[0] );
-    STEP_PART4W( &x[0], cns4w( 6), cns4w( 7), &tmp[0] );
-    STEP_PART4W( &x[0], cns4w( 8), cns4w( 9), &tmp[0] );
-    STEP_PART4W( &x[0], cns4w(10), cns4w(11), &tmp[0] );
-    STEP_PART4W( &x[0], cns4w(12), cns4w(13), &tmp[0] );
-    STEP_PART4W( &x[0], cns4w(14), cns4w(15), &tmp[0] );
+    STEP_PART4W( x0, x1, x2, x3, x4, x5, x6, x7, cns4w( 0), cns4w( 1) );
+    STEP_PART4W( x0, x1, x2, x3, x4, x5, x6, x7, cns4w( 2), cns4w( 3) );
+    STEP_PART4W( x0, x1, x2, x3, x4, x5, x6, x7, cns4w( 4), cns4w( 5) );
+    STEP_PART4W( x0, x1, x2, x3, x4, x5, x6, x7, cns4w( 6), cns4w( 7) );
+    STEP_PART4W( x0, x1, x2, x3, x4, x5, x6, x7, cns4w( 8), cns4w( 9) );
+    STEP_PART4W( x0, x1, x2, x3, x4, x5, x6, x7, cns4w(10), cns4w(11) );
+    STEP_PART4W( x0, x1, x2, x3, x4, x5, x6, x7, cns4w(12), cns4w(13) );
+    STEP_PART4W( x0, x1, x2, x3, x4, x5, x6, x7, cns4w(14), cns4w(15) );

-    MIXTON10244W( x[0], x[1], x[2], x[3],
-                chainv[0], chainv[2], chainv[4],chainv[6],
-                x[4], x[5], x[6], x[7],
-                chainv[1],chainv[3],chainv[5],chainv[7]);
+    MIXTON10244W( x0, x1, x2, x3, chainv[0], chainv[2], chainv[4], chainv[6],
+                  x4, x5, x6, x7, chainv[1], chainv[3], chainv[5], chainv[7] );

    /* Process last 256-bit block */
-    STEP_PART24W( chainv[8], chainv[9], t0, t1, cns4w(16), cns4w(17),
-                tmp[0], tmp[1] );
-    STEP_PART24W( chainv[8], chainv[9], t0, t1, cns4w(18), cns4w(19),
-                tmp[0], tmp[1] );
-    STEP_PART24W( chainv[8], chainv[9], t0, t1, cns4w(20), cns4w(21),
-                tmp[0], tmp[1] );
-    STEP_PART24W( chainv[8], chainv[9], t0, t1, cns4w(22), cns4w(23),
-                tmp[0], tmp[1] );
-    STEP_PART24W( chainv[8], chainv[9], t0, t1, cns4w(24), cns4w(25),
-                tmp[0], tmp[1] );
-    STEP_PART24W( chainv[8], chainv[9], t0, t1, cns4w(26), cns4w(27),
-                tmp[0], tmp[1] );
-    STEP_PART24W( chainv[8], chainv[9], t0, t1, cns4w(28), cns4w(29),
-                tmp[0], tmp[1] );
-    STEP_PART24W( chainv[8], chainv[9], t0, t1, cns4w(30), cns4w(31),
-                tmp[0], tmp[1] );
+    STEP_PART24W( chainv[8], chainv[9], t0, t1, cns4w(16), cns4w(17) );
+    STEP_PART24W( chainv[8], chainv[9], t0, t1, cns4w(18), cns4w(19) );
+    STEP_PART24W( chainv[8], chainv[9], t0, t1, cns4w(20), cns4w(21) );
+    STEP_PART24W( chainv[8], chainv[9], t0, t1, cns4w(22), cns4w(23) );
+    STEP_PART24W( chainv[8], chainv[9], t0, t1, cns4w(24), cns4w(25) );
+    STEP_PART24W( chainv[8], chainv[9], t0, t1, cns4w(26), cns4w(27) );
+    STEP_PART24W( chainv[8], chainv[9], t0, t1, cns4w(28), cns4w(29) );
+    STEP_PART24W( chainv[8], chainv[9], t0, t1, cns4w(30), cns4w(31) );
 }

 void finalization512_4way( luffa_4way_context *state, uint32 *b )
@@ -683,10 +550,11 @@ int luffa_4way_update_close( luffa_4way_context *state,

 #define cns(i)  m256_const1_128( ( (__m128i*)CNS_INIT)[i] )

-#define ADD_CONSTANT(a,b,c0,c1)\
-    a = _mm256_xor_si256(a,c0);\
-    b = _mm256_xor_si256(b,c1);
+#define ADD_CONSTANT( a, b, c0, c1 ) \
+    a = _mm256_xor_si256( a, c0 ); \
+    b = _mm256_xor_si256( b, c1 );

+/*
 #define MULT2( a0, a1, mask ) \
 do { \
  __m256i b = _mm256_xor_si256( a0, \
@@ -694,127 +562,83 @@ do { \
  a0 = _mm256_or_si256( _mm256_srli_si256(b,4), _mm256_slli_si256(a1,12) ); \
  a1 = _mm256_or_si256( _mm256_srli_si256(a1,4), _mm256_slli_si256(b,12) );  \
 } while(0)
+*/

-#define STEP_PART(x,c0,c1,t)\
-    SUBCRUMB(*x,*(x+1),*(x+2),*(x+3),*t);\
-    SUBCRUMB(*(x+5),*(x+6),*(x+7),*(x+4),*t);\
-    MIXWORD(*x,*(x+4),*t,*(t+1));\
-    MIXWORD(*(x+1),*(x+5),*t,*(t+1));\
-    MIXWORD(*(x+2),*(x+6),*t,*(t+1));\
-    MIXWORD(*(x+3),*(x+7),*t,*(t+1));\
-    ADD_CONSTANT(*x, *(x+4), c0, c1);
+#define MULT2( a0, a1, mask ) \
+{ \
+  __m256i b = _mm256_xor_si256( a0, \
+                 _mm256_shuffle_epi32( _mm256_and_si256( a1, mask ), 16 ) ); \
+  a0 = _mm256_alignr_epi8( a1,  b, 4 ); \
+  a1 = _mm256_alignr_epi8(  b, a1, 4 ); \
+}

-#define SUBCRUMB(a0,a1,a2,a3,t)\
-    t  = a0;\
-    a0 = _mm256_or_si256(a0,a1);\
-    a2 = _mm256_xor_si256(a2,a3);\
-    a1 = mm256_not( a1 );\
-    a0 = _mm256_xor_si256(a0,a3);\
-    a3 = _mm256_and_si256(a3,t);\
-    a1 = _mm256_xor_si256(a1,a3);\
-    a3 = _mm256_xor_si256(a3,a2);\
-    a2 = _mm256_and_si256(a2,a0);\
-    a0 = mm256_not( a0 );\
-    a2 = _mm256_xor_si256(a2,a1);\
-    a1 = _mm256_or_si256(a1,a3);\
-    t  = _mm256_xor_si256(t,a1);\
-    a3 = _mm256_xor_si256(a3,a2);\
-    a2 = _mm256_and_si256(a2,a1);\
-    a1 = _mm256_xor_si256(a1,a0);\
-    a0 = t;\
+#define SUBCRUMB( a0, a1, a2, a3 ) \
+{ \
+    __m256i t = a0; \
+    a0 = _mm256_or_si256( a0, a1 ); \
+    a2 = _mm256_xor_si256( a2, a3 ); \
+    a1 = mm256_not( a1 ); \
+    a0 = _mm256_xor_si256( a0, a3 ); \
+    a3 = _mm256_and_si256( a3, t ); \
+    a1 = _mm256_xor_si256( a1, a3 ); \
+    a3 = _mm256_xor_si256( a3, a2 ); \
+    a2 = _mm256_and_si256( a2, a0 ); \
+    a0 = mm256_not( a0 ); \
+    a2 = _mm256_xor_si256( a2, a1 ); \
+    a1 = _mm256_or_si256(  a1, a3 ); \
+    t  = _mm256_xor_si256(  t, a1 ); \
+    a3 = _mm256_xor_si256( a3, a2 ); \
+    a2 = _mm256_and_si256( a2, a1 ); \
+    a1 = _mm256_xor_si256( a1, a0 ); \
+    a0 = t; \
+}

-#define MIXWORD(a,b,t1,t2)\
-    b  = _mm256_xor_si256(a,b);\
-    t1 = _mm256_slli_epi32(a,2);\
-    t2 = _mm256_srli_epi32(a,30);\
-     a = _mm256_or_si256(t1,t2);\
-    a  = _mm256_xor_si256(a,b);\
-    t1 = _mm256_slli_epi32(b,14);\
-    t2 = _mm256_srli_epi32(b,18);\
-    b  = _mm256_or_si256(t1,t2);\
-    b  = _mm256_xor_si256(a,b);\
-    t1 = _mm256_slli_epi32(a,10);\
-    t2 = _mm256_srli_epi32(a,22);\
-    a  = _mm256_or_si256(t1,t2);\
-    a  = _mm256_xor_si256(a,b);\
-    t1 = _mm256_slli_epi32(b,1);\
-    t2 = _mm256_srli_epi32(b,31);\
-    b  = _mm256_or_si256(t1,t2);
+#define MIXWORD( a, b ) \
+{ \
+    __m256i t1, t2; \
+    b  = _mm256_xor_si256( a,b ); \
+    t1 = _mm256_slli_epi32( a,  2 ); \
+    t2 = _mm256_srli_epi32( a, 30 ); \
+    a  = _mm256_or_si256( t1, t2 ); \
+    a  = _mm256_xor_si256( a, b ); \
+    t1 = _mm256_slli_epi32( b, 14 ); \
+    t2 = _mm256_srli_epi32( b, 18 ); \
+    b  = _mm256_or_si256( t1, t2 ); \
+    b  = _mm256_xor_si256( a, b ); \
+    t1 = _mm256_slli_epi32( a, 10 ); \
+    t2 = _mm256_srli_epi32( a, 22 ); \
+    a  = _mm256_or_si256( t1,t2 ); \
+    a  = _mm256_xor_si256( a,b ); \
+    t1 = _mm256_slli_epi32( b,1 ); \
+    t2 = _mm256_srli_epi32( b,31 ); \
+    b  = _mm256_or_si256( t1, t2 ); \
+}

-#define STEP_PART2(a0,a1,t0,t1,c0,c1,tmp0,tmp1)\
-    a1 = _mm256_shuffle_epi32(a1,147);\
-    t0 = _mm256_load_si256(&a1);\
-    a1 = _mm256_unpacklo_epi32(a1,a0);\
-    t0 = _mm256_unpackhi_epi32(t0,a0);\
-    t1 = _mm256_shuffle_epi32(t0,78);\
-    a0 = _mm256_shuffle_epi32(a1,78);\
-    SUBCRUMB(t1,t0,a0,a1,tmp0);\
-    t0 = _mm256_unpacklo_epi32(t0,t1);\
-    a1 = _mm256_unpacklo_epi32(a1,a0);\
-    a0 = _mm256_load_si256(&a1);\
-    a0 = _mm256_unpackhi_epi64(a0,t0);\
-    a1 = _mm256_unpacklo_epi64(a1,t0);\
-    a1 = _mm256_shuffle_epi32(a1,57);\
-    MIXWORD(a0,a1,tmp0,tmp1);\
-    ADD_CONSTANT(a0,a1,c0,c1);
+#define STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, c0, c1 ) \
+    SUBCRUMB( x0, x1, x2, x3 ); \
+    SUBCRUMB( x5, x6, x7, x4 ); \
+    MIXWORD( x0, x4 ); \
+    MIXWORD( x1, x5 ); \
+    MIXWORD( x2, x6 ); \
+    MIXWORD( x3, x7 ); \
+    ADD_CONSTANT( x0, x4, c0, c1 );

-#define NMLTOM768(r0,r1,r2,s0,s1,s2,s3,p0,p1,p2,q0,q1,q2,q3)\
-    s2 = _mm256_load_si256(&r1);\
-    q2 = _mm256_load_si256(&p1);\
-    r2 = _mm256_shuffle_epi32(r2,216);\
-    p2 = _mm256_shuffle_epi32(p2,216);\
-    r1 = _mm256_unpacklo_epi32(r1,r0);\
-    p1 = _mm256_unpacklo_epi32(p1,p0);\
-    s2 = _mm256_unpackhi_epi32(s2,r0);\
-    q2 = _mm256_unpackhi_epi32(q2,p0);\
-    s0 = _mm256_load_si256(&r2);\
-    q0 = _mm256_load_si256(&p2);\
-    r2 = _mm256_unpacklo_epi64(r2,r1);\
-    p2 = _mm256_unpacklo_epi64(p2,p1);\
-    s1 = _mm256_load_si256(&s0);\
-    q1 = _mm256_load_si256(&q0);\
-    s0 = _mm256_unpackhi_epi64(s0,r1);\
-    q0 = _mm256_unpackhi_epi64(q0,p1);\
-    r2 = _mm256_shuffle_epi32(r2,225);\
-    p2 = _mm256_shuffle_epi32(p2,225);\
-    r0 = _mm256_load_si256(&s1);\
-    p0 = _mm256_load_si256(&q1);\
-    s0 = _mm256_shuffle_epi32(s0,225);\
-    q0 = _mm256_shuffle_epi32(q0,225);\
-    s1 = _mm256_unpacklo_epi64(s1,s2);\
-    q1 = _mm256_unpacklo_epi64(q1,q2);\
-    r0 = _mm256_unpackhi_epi64(r0,s2);\
-    p0 = _mm256_unpackhi_epi64(p0,q2);\
-    s2 = _mm256_load_si256(&r0);\
-    q2 = _mm256_load_si256(&p0);\
-    s3 = _mm256_load_si256(&r2);\
-    q3 = _mm256_load_si256(&p2);
-
-#define MIXTON768(r0,r1,r2,r3,s0,s1,s2,p0,p1,p2,p3,q0,q1,q2)\
-    s0 = _mm256_load_si256(&r0);\
-    q0 = _mm256_load_si256(&p0);\
-    s1 = _mm256_load_si256(&r2);\
-    q1 = _mm256_load_si256(&p2);\
-    r0 = _mm256_unpackhi_epi32(r0,r1);\
-    p0 = _mm256_unpackhi_epi32(p0,p1);\
-    r2 = _mm256_unpackhi_epi32(r2,r3);\
-    p2 = _mm256_unpackhi_epi32(p2,p3);\
-    s0 = _mm256_unpacklo_epi32(s0,r1);\
-    q0 = _mm256_unpacklo_epi32(q0,p1);\
-    s1 = _mm256_unpacklo_epi32(s1,r3);\
-    q1 = _mm256_unpacklo_epi32(q1,p3);\
-    r1 = _mm256_load_si256(&r0);\
-    p1 = _mm256_load_si256(&p0);\
-    r0 = _mm256_unpackhi_epi64(r0,r2);\
-    p0 = _mm256_unpackhi_epi64(p0,p2);\
-    s0 = _mm256_unpackhi_epi64(s0,s1);\
-    q0 = _mm256_unpackhi_epi64(q0,q1);\
-    r1 = _mm256_unpacklo_epi64(r1,r2);\
-    p1 = _mm256_unpacklo_epi64(p1,p2);\
-    s2 = _mm256_load_si256(&r0);\
-    q2 = _mm256_load_si256(&p0);\
-    s1 = _mm256_load_si256(&r1);\
-    q1 = _mm256_load_si256(&p1);\
+#define STEP_PART2( a0, a1, t0, t1, c0, c1 ) \
+    a1 = _mm256_shuffle_epi32( a1, 147); \
+    t0 = _mm256_load_si256( &a1 ); \
+    a1 = _mm256_unpacklo_epi32( a1, a0 ); \
+    t0 = _mm256_unpackhi_epi32( t0, a0 ); \
+    t1 = _mm256_shuffle_epi32( t0, 78 ); \
+    a0 = _mm256_shuffle_epi32( a1, 78 ); \
+    SUBCRUMB( t1, t0, a0, a1 );\
+    t0 = _mm256_unpacklo_epi32( t0, t1 ); \
+    a1 = _mm256_unpacklo_epi32( a1, a0 ); \
+    a0 = _mm256_load_si256( &a1 ); \
+    a0 = _mm256_unpackhi_epi64( a0, t0 ); \
+    a1 = _mm256_unpacklo_epi64( a1, t0 ); \
+    a1 = _mm256_shuffle_epi32( a1, 57 ); \
+    MIXWORD( a0, a1 ); \
+    ADD_CONSTANT( a0, a1, c0, c1 );

 #define NMLTOM1024(r0,r1,r2,r3,s0,s1,s2,s3,p0,p1,p2,p3,q0,q1,q2,q3)\
    s1 = _mm256_load_si256(&r3);\
@@ -857,9 +681,8 @@ void rnd512_2way( luffa_2way_context *state, __m256i *msg )
    __m256i t0, t1;
    __m256i *chainv = state->chainv;
    __m256i msg0, msg1;
-    __m256i tmp[2];
-    __m256i x[8];
-    const __m256i MASK = m256_const1_i128( 0x00000000ffffffff );
+    __m256i x0, x1, x2, x3, x4, x5, x6, x7;
+    const __m256i MASK = m256_const1_i128( 0xffffffff );

    t0 = chainv[0];
    t1 = chainv[1];
@@ -958,42 +781,30 @@ void rnd512_2way( luffa_2way_context *state, __m256i *msg )
    chainv[7] = mm256_rol_32( chainv[7], 3 );
    chainv[9] = mm256_rol_32( chainv[9], 4 );

-    NMLTOM1024( chainv[0], chainv[2], chainv[4], chainv[6],
-                x[0], x[1], x[2], x[3],
-                chainv[1],chainv[3],chainv[5],chainv[7],
-                x[4], x[5], x[6], x[7] );
+    NMLTOM1024( chainv[0], chainv[2], chainv[4], chainv[6], x0, x1, x2, x3,
+                chainv[1], chainv[3], chainv[5], chainv[7], x4, x5, x6, x7 );

-    STEP_PART( &x[0], cns( 0), cns( 1), &tmp[0] );
-    STEP_PART( &x[0], cns( 2), cns( 3), &tmp[0] );
-    STEP_PART( &x[0], cns( 4), cns( 5), &tmp[0] );
-    STEP_PART( &x[0], cns( 6), cns( 7), &tmp[0] );
-    STEP_PART( &x[0], cns( 8), cns( 9), &tmp[0] );
-    STEP_PART( &x[0], cns(10), cns(11), &tmp[0] );
-    STEP_PART( &x[0], cns(12), cns(13), &tmp[0] );
-    STEP_PART( &x[0], cns(14), cns(15), &tmp[0] );
+    STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, cns( 0), cns( 1) );
+    STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, cns( 2), cns( 3) );
+    STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, cns( 4), cns( 5) );
+    STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, cns( 6), cns( 7) );
+    STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, cns( 8), cns( 9) );
+    STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, cns(10), cns(11) );
+    STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, cns(12), cns(13) );
+    STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, cns(14), cns(15) );

-    MIXTON1024( x[0], x[1], x[2], x[3],
-                chainv[0], chainv[2], chainv[4],chainv[6],
-                x[4], x[5], x[6], x[7],
-                chainv[1],chainv[3],chainv[5],chainv[7]);
+    MIXTON1024( x0, x1, x2, x3, chainv[0], chainv[2], chainv[4], chainv[6],
+                x4, x5, x6, x7, chainv[1], chainv[3], chainv[5], chainv[7]);

    /* Process last 256-bit block */
-    STEP_PART2( chainv[8], chainv[9], t0, t1, cns(16), cns(17),
-                tmp[0], tmp[1] );
-    STEP_PART2( chainv[8], chainv[9], t0, t1, cns(18), cns(19),
-                tmp[0], tmp[1] );
-    STEP_PART2( chainv[8], chainv[9], t0, t1, cns(20), cns(21),
-                tmp[0], tmp[1] );
-    STEP_PART2( chainv[8], chainv[9], t0, t1, cns(22), cns(23),
-                tmp[0], tmp[1] );
-    STEP_PART2( chainv[8], chainv[9], t0, t1, cns(24), cns(25),
-                tmp[0], tmp[1] );
-    STEP_PART2( chainv[8], chainv[9], t0, t1, cns(26), cns(27),
-                tmp[0], tmp[1] );
-    STEP_PART2( chainv[8], chainv[9], t0, t1, cns(28), cns(29),
-                tmp[0], tmp[1] );
-    STEP_PART2( chainv[8], chainv[9], t0, t1, cns(30), cns(31),
-                tmp[0], tmp[1] );
+    STEP_PART2( chainv[8], chainv[9], t0, t1, cns(16), cns(17) );
+    STEP_PART2( chainv[8], chainv[9], t0, t1, cns(18), cns(19) );
+    STEP_PART2( chainv[8], chainv[9], t0, t1, cns(20), cns(21) );
+    STEP_PART2( chainv[8], chainv[9], t0, t1, cns(22), cns(23) );
+    STEP_PART2( chainv[8], chainv[9], t0, t1, cns(24), cns(25) );
+    STEP_PART2( chainv[8], chainv[9], t0, t1, cns(26), cns(27) );
+    STEP_PART2( chainv[8], chainv[9], t0, t1, cns(28), cns(29) );
+    STEP_PART2( chainv[8], chainv[9], t0, t1, cns(30), cns(31) );
 }

 /***************************************************/
--- a/algo/luffa/luffa_for_sse2.c
+++ b/algo/luffa/luffa_for_sse2.c
@@ -30,19 +30,6 @@
  a1 = _mm_or_si128( _mm_srli_si128(a1,4), _mm_slli_si128(b,12) );  \
 } while(0)

-/*
-static inline __m256i mult2_avx2( a )
-{ 
-   __m128 a0, a0, b;
-   a0 = mm128_extractlo_256( a );
-   a1 = mm128_extracthi_256( a );
-   b =  _mm_xor_si128( a0, _mm_shuffle_epi32( _mm_and_si128(a1,MASK), 16 ) );
-   a0 = _mm_or_si128( _mm_srli_si128(b,4), _mm_slli_si128(a1,12) );
-   a1 = _mm_or_si128( _mm_srli_si128(a1,4), _mm_slli_si128(b,12) );
-   return mm256_concat_128( a1, a0 );
-}
-*/
-
 #define STEP_PART(x,c,t)\
    SUBCRUMB(*x,*(x+1),*(x+2),*(x+3),*t);\
    SUBCRUMB(*(x+5),*(x+6),*(x+7),*(x+4),*t);\
--- a/algo/lyra2/allium-4way.c
+++ b/algo/lyra2/allium-4way.c
@@ -13,8 +13,7 @@

 #if defined (ALLIUM_16WAY)  

-typedef struct {
-   blake256_16way_context     blake;
+typedef union {
   keccak256_8way_context    keccak;
   cube_4way_2buf_context    cube;
   skein256_8way_context     skein;
@@ -25,41 +24,31 @@ typedef struct {
 #endif
 } allium_16way_ctx_holder;

-static __thread allium_16way_ctx_holder allium_16way_ctx;
-
-bool init_allium_16way_ctx()
-{
-   keccak256_8way_init( &allium_16way_ctx.keccak );
-   skein256_8way_init( &allium_16way_ctx.skein );
-   return true;
-}
-
-void allium_16way_hash( void *state, const void *input )
+static void allium_16way_hash( void *state, const void *midstate_vars, 
+                               const void *midhash, const void *block )
 {
   uint32_t vhash[16*8] __attribute__ ((aligned (128)));
   uint32_t vhashA[16*8] __attribute__ ((aligned (64)));
   uint32_t vhashB[16*8] __attribute__ ((aligned (64)));
-   uint32_t hash0[8] __attribute__ ((aligned (64)));
-   uint32_t hash1[8] __attribute__ ((aligned (64)));
-   uint32_t hash2[8] __attribute__ ((aligned (64)));
-   uint32_t hash3[8] __attribute__ ((aligned (64)));
-   uint32_t hash4[8] __attribute__ ((aligned (64)));
-   uint32_t hash5[8] __attribute__ ((aligned (64)));
-   uint32_t hash6[8] __attribute__ ((aligned (64)));
-   uint32_t hash7[8] __attribute__ ((aligned (64)));
-   uint32_t hash8[8] __attribute__ ((aligned (64)));
-   uint32_t hash9[8] __attribute__ ((aligned (64)));
-   uint32_t hash10[8] __attribute__ ((aligned (64)));
-   uint32_t hash11[8] __attribute__ ((aligned (64)));
-   uint32_t hash12[8] __attribute__ ((aligned (64)));
-   uint32_t hash13[8] __attribute__ ((aligned (64)));
-   uint32_t hash14[8] __attribute__ ((aligned (64)));
-   uint32_t hash15[8] __attribute__ ((aligned (64)));
+   uint32_t hash0[8] __attribute__ ((aligned (32)));
+   uint32_t hash1[8] __attribute__ ((aligned (32)));
+   uint32_t hash2[8] __attribute__ ((aligned (32)));
+   uint32_t hash3[8] __attribute__ ((aligned (32)));
+   uint32_t hash4[8] __attribute__ ((aligned (32)));
+   uint32_t hash5[8] __attribute__ ((aligned (32)));
+   uint32_t hash6[8] __attribute__ ((aligned (32)));
+   uint32_t hash7[8] __attribute__ ((aligned (32)));
+   uint32_t hash8[8] __attribute__ ((aligned (32)));
+   uint32_t hash9[8] __attribute__ ((aligned (32)));
+   uint32_t hash10[8] __attribute__ ((aligned (32)));
+   uint32_t hash11[8] __attribute__ ((aligned (32)));
+   uint32_t hash12[8] __attribute__ ((aligned (32)));
+   uint32_t hash13[8] __attribute__ ((aligned (32)));
+   uint32_t hash14[8] __attribute__ ((aligned (32)));
+   uint32_t hash15[8] __attribute__ ((aligned (32)));
   allium_16way_ctx_holder ctx __attribute__ ((aligned (64)));

-   memcpy( &ctx, &allium_16way_ctx, sizeof(allium_16way_ctx) );
-   blake256_16way_update( &ctx.blake, input + (64<<4), 16 );
-   blake256_16way_close( &ctx.blake, vhash );
+   blake256_16way_final_rounds_le( vhash, midstate_vars, midhash, block );

   dintrlv_16x32( hash0, hash1, hash2, hash3, hash4, hash5, hash6, hash7,
                  hash8, hash9, hash10, hash11, hash12, hash13, hash14, hash15,
@@ -69,7 +58,7 @@ void allium_16way_hash( void *state, const void *input )
   intrlv_8x64( vhashB, hash8, hash9, hash10, hash11, hash12, hash13, hash14,
                hash15, 256 );
   
-//   rintrlv_8x32_8x64( vhashA, vhash, 256 );
+   keccak256_8way_init( &ctx.keccak );
   keccak256_8way_update( &ctx.keccak, vhashA, 32 );
   keccak256_8way_close( &ctx.keccak, vhashA);
   keccak256_8way_init( &ctx.keccak );
@@ -152,6 +141,7 @@ void allium_16way_hash( void *state, const void *input )
   intrlv_8x64( vhashB, hash8, hash9, hash10, hash11, hash12, hash13, hash14,
                hash15, 256 );

+   skein256_8way_init( &ctx.skein );
   skein256_8way_update( &ctx.skein, vhashA, 32 );
   skein256_8way_close( &ctx.skein, vhashA );
   skein256_8way_init( &ctx.skein );
@@ -199,6 +189,7 @@ void allium_16way_hash( void *state, const void *input )
   groestl256_full( &ctx.groestl, state+416, hash13, 256 );
   groestl256_full( &ctx.groestl, state+448, hash14, 256 );
   groestl256_full( &ctx.groestl, state+480, hash15, 256 );
+
 #endif
 }

@@ -206,35 +197,72 @@ int scanhash_allium_16way( struct work *work, uint32_t max_nonce,
                             uint64_t *hashes_done, struct thr_info *mythr )
 {
   uint32_t hash[8*16] __attribute__ ((aligned (128)));
-   uint32_t vdata[20*16] __attribute__ ((aligned (64)));
+   uint32_t midstate_vars[16*16] __attribute__ ((aligned (64)));
+   __m512i block0_hash[8] __attribute__ ((aligned (64)));
+   __m512i block_buf[16] __attribute__ ((aligned (64)));
+   uint32_t phash[8] __attribute__ ((aligned (32))) = 
+   {
+      0x6A09E667, 0xBB67AE85, 0x3C6EF372, 0xA54FF53A,
+      0x510E527F, 0x9B05688C, 0x1F83D9AB, 0x5BE0CD19
+   };
   uint32_t *pdata = work->data;
   uint32_t *ptarget = work->target;
   const uint32_t first_nonce = pdata[19];
   uint32_t n = first_nonce;
   const uint32_t last_nonce = max_nonce - 16;
-   __m512i  *noncev = (__m512i*)vdata + 19;   // aligned
   const int thr_id = mythr->id;
   const bool bench = opt_benchmark;
+   const __m512i sixteen = m512_const1_32( 16 );

   if ( bench ) ( (uint32_t*)ptarget )[7] = 0x0000ff;

-   mm512_bswap32_intrlv80_16x32( vdata, pdata );
-   *noncev = _mm512_set_epi32( n+15, n+14, n+13, n+12, n+11, n+10, n+ 9, n+ 8,
-                               n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n +1, n );
+   // Prehash first block.
+   blake256_transform_le( phash, pdata, 512, 0 );

-   blake256_16way_init( &allium_16way_ctx.blake );
-   blake256_16way_update( &allium_16way_ctx.blake, vdata, 64 );
+   // Interleave hash for second block prehash.
+   block0_hash[0] = _mm512_set1_epi32( phash[0] );
+   block0_hash[1] = _mm512_set1_epi32( phash[1] );
+   block0_hash[2] = _mm512_set1_epi32( phash[2] );
+   block0_hash[3] = _mm512_set1_epi32( phash[3] );
+   block0_hash[4] = _mm512_set1_epi32( phash[4] );
+   block0_hash[5] = _mm512_set1_epi32( phash[5] );
+   block0_hash[6] = _mm512_set1_epi32( phash[6] );
+   block0_hash[7] = _mm512_set1_epi32( phash[7] );
+
+   // Build vectored second block, interleave last 16 bytes of data using
+   // unique nonces, add padding.
+   block_buf[ 0] = _mm512_set1_epi32( pdata[16] );
+   block_buf[ 1] = _mm512_set1_epi32( pdata[17] );
+   block_buf[ 2] = _mm512_set1_epi32( pdata[18] );
+   block_buf[ 3] =
+             _mm512_set_epi32( n+15, n+14, n+13, n+12, n+11, n+10, n+ 9, n+ 8,
+                               n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n+ 1, n );
+   block_buf[ 4] = m512_const1_32( 0x80000000 );
+   block_buf[ 5] =
+   block_buf[ 6] = 
+   block_buf[ 7] = 
+   block_buf[ 8] = 
+   block_buf[ 9] = 
+   block_buf[10] = 
+   block_buf[11] = 
+   block_buf[12] = m512_zero;
+   block_buf[13] = m512_one_32;
+   block_buf[14] = m512_zero;
+   block_buf[15] = m512_const1_32( 80*8 );
+
+   // Partialy prehash second block without touching nonces in block_buf[3].
+   blake256_16way_round0_prehash_le( midstate_vars, block0_hash, block_buf );

   do {
-     allium_16way_hash( hash, vdata );
+     allium_16way_hash( hash, midstate_vars, block0_hash, block_buf );

     for ( int lane = 0; lane < 16; lane++ ) 
     if ( unlikely( valid_hash( hash+(lane<<3), ptarget ) && !bench ) )
     {
-         pdata[19] = bswap_32( n + lane );
-         submit_solution( work, hash+(lane<<3), mythr );
+        pdata[19] = n + lane;
+        submit_solution( work, hash+(lane<<3), mythr );
     }
-     *noncev = _mm512_add_epi32( *noncev, m512_const1_32( 16 ) );
+     block_buf[ 3] = _mm512_add_epi32( block_buf[ 3], sixteen ); 
     n += 16;
   } while ( likely( (n < last_nonce) && !work_restart[thr_id].restart) );
   pdata[19] = n;
@@ -244,8 +272,7 @@ int scanhash_allium_16way( struct work *work, uint32_t max_nonce,

 #elif defined (ALLIUM_8WAY)  

-typedef struct {
-   blake256_8way_context     blake;
+typedef union {
   keccak256_4way_context    keccak;
   cube_2way_context         cube;
   skein256_4way_context     skein;
@@ -256,19 +283,11 @@ typedef struct {
 #endif
 } allium_8way_ctx_holder;

-static __thread allium_8way_ctx_holder allium_8way_ctx;
-
-bool init_allium_8way_ctx()
-{
-   keccak256_4way_init( &allium_8way_ctx.keccak );
-   skein256_4way_init( &allium_8way_ctx.skein );
-   return true;
-}
-
-void allium_8way_hash( void *hash, const void *input )
+static void allium_8way_hash( void *hash, const void *midstate_vars,
+                               const void *midhash, const void *block )
 {
   uint64_t vhashA[4*8] __attribute__ ((aligned (64)));
-   uint64_t vhashB[4*8] __attribute__ ((aligned (64)));
+   uint64_t vhashB[4*8] __attribute__ ((aligned (32)));
   uint64_t *hash0 = (uint64_t*)hash;
   uint64_t *hash1 = (uint64_t*)hash+ 4;
   uint64_t *hash2 = (uint64_t*)hash+ 8;
@@ -279,15 +298,14 @@ void allium_8way_hash( void *hash, const void *input )
   uint64_t *hash7 = (uint64_t*)hash+28;
   allium_8way_ctx_holder ctx __attribute__ ((aligned (64))); 

-   memcpy( &ctx, &allium_8way_ctx, sizeof(allium_8way_ctx) );
-   blake256_8way_update( &ctx.blake, input + (64<<3), 16 );
-   blake256_8way_close( &ctx.blake, vhashA );
+   blake256_8way_final_rounds_le( vhashA, midstate_vars, midhash, block );

   dintrlv_8x32( hash0, hash1, hash2, hash3, hash4, hash5, hash6, hash7,
-                     vhashA, 256 );
+                 vhashA, 256 );
   intrlv_4x64( vhashA, hash0, hash1, hash2, hash3, 256 );
   intrlv_4x64( vhashB, hash4, hash5, hash6, hash7, 256 );

+   keccak256_4way_init( &ctx.keccak );
   keccak256_4way_update( &ctx.keccak, vhashA, 32 );
   keccak256_4way_close( &ctx.keccak, vhashA );
   keccak256_4way_init( &ctx.keccak );
@@ -306,7 +324,6 @@ void allium_8way_hash( void *hash, const void *input )
   LYRA2RE( hash6, 32, hash6, 32, hash6, 32, 1, 8, 8 );
   LYRA2RE( hash7, 32, hash7, 32, hash7, 32, 1, 8, 8 );

-
   intrlv_2x128( vhashA, hash0, hash1, 256 );
   intrlv_2x128( vhashB, hash2, hash3, 256 );
   cube_2way_full( &ctx.cube, vhashA, 256, vhashA, 32 );
@@ -333,6 +350,7 @@ void allium_8way_hash( void *hash, const void *input )
   intrlv_4x64( vhashA, hash0, hash1, hash2, hash3, 256 );
   intrlv_4x64( vhashB, hash4, hash5, hash6, hash7, 256 );

+   skein256_4way_init( &ctx.skein );
   skein256_4way_update( &ctx.skein, vhashA, 32 );
   skein256_4way_close( &ctx.skein, vhashA );
   skein256_4way_init( &ctx.skein );
@@ -341,8 +359,8 @@ void allium_8way_hash( void *hash, const void *input )

 #if defined(__VAES__)

-   uint64_t vhashC[4*2] __attribute__ ((aligned (64)));
-   uint64_t vhashD[4*2] __attribute__ ((aligned (64)));
+   uint64_t vhashC[4*2] __attribute__ ((aligned (32)));
+   uint64_t vhashD[4*2] __attribute__ ((aligned (32)));
   
   rintrlv_4x64_2x128( vhashC, vhashD, vhashA, 256 );
   groestl256_2way_full( &ctx.groestl, vhashC, vhashC, 32 );
@@ -377,36 +395,72 @@ int scanhash_allium_8way( struct work *work, uint32_t max_nonce,
                             uint64_t *hashes_done, struct thr_info *mythr )
 {
   uint64_t hash[4*8] __attribute__ ((aligned (64)));
-   uint32_t vdata[20*8] __attribute__ ((aligned (64)));
+   uint32_t midstate_vars[16*8] __attribute__ ((aligned (64)));
+   __m256i block0_hash[8] __attribute__ ((aligned (64)));
+   __m256i block_buf[16] __attribute__ ((aligned (64)));
+   uint32_t phash[8] __attribute__ ((aligned (32))) =
+   {
+      0x6A09E667, 0xBB67AE85, 0x3C6EF372, 0xA54FF53A,
+      0x510E527F, 0x9B05688C, 0x1F83D9AB, 0x5BE0CD19
+   };
   uint32_t *pdata = work->data;
   uint64_t *ptarget = (uint64_t*)work->target;
   const uint32_t first_nonce = pdata[19];
   const uint32_t last_nonce = max_nonce - 8;
   uint32_t n = first_nonce;
-   __m256i  *noncev = (__m256i*)vdata + 19;   // aligned
   const int thr_id = mythr->id;  
   const bool bench = opt_benchmark;
+   const __m256i eight = m256_const1_32( 8 );

-   mm256_bswap32_intrlv80_8x32( vdata, pdata );
-   *noncev = _mm256_set_epi32( n+7, n+6, n+5, n+4, n+3, n+2, n+1, n );
+   // Prehash first block
+   blake256_transform_le( phash, pdata, 512, 0 );

-   blake256_8way_init( &allium_8way_ctx.blake );
-   blake256_8way_update( &allium_8way_ctx.blake, vdata, 64 );
+   block0_hash[0] = _mm256_set1_epi32( phash[0] );
+   block0_hash[1] = _mm256_set1_epi32( phash[1] );
+   block0_hash[2] = _mm256_set1_epi32( phash[2] );
+   block0_hash[3] = _mm256_set1_epi32( phash[3] );
+   block0_hash[4] = _mm256_set1_epi32( phash[4] );
+   block0_hash[5] = _mm256_set1_epi32( phash[5] );
+   block0_hash[6] = _mm256_set1_epi32( phash[6] );
+   block0_hash[7] = _mm256_set1_epi32( phash[7] );
+
+   // Build vectored second block, interleave last 16 bytes of data using
+   // unique nonces and add padding.
+   block_buf[ 0] = _mm256_set1_epi32( pdata[16] );
+   block_buf[ 1] = _mm256_set1_epi32( pdata[17] );
+   block_buf[ 2] = _mm256_set1_epi32( pdata[18] );
+   block_buf[ 3] =
+            _mm256_set_epi32( n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n+ 1, n );
+   block_buf[ 4] = m256_const1_32( 0x80000000 );
+   block_buf[ 5] =
+   block_buf[ 6] =
+   block_buf[ 7] =
+   block_buf[ 8] =
+   block_buf[ 9] =
+   block_buf[10] =
+   block_buf[11] =
+   block_buf[12] = m256_zero;
+   block_buf[13] = m256_one_32;
+   block_buf[14] = m256_zero;
+   block_buf[15] = m256_const1_32( 80*8 );
+
+   // Partialy prehash second block without touching nonces
+   blake256_8way_round0_prehash_le( midstate_vars, block0_hash, block_buf );

   do {
-     allium_8way_hash( hash, vdata );
+     allium_8way_hash( hash, midstate_vars, block0_hash, block_buf );

     for ( int lane = 0; lane < 8; lane++ )
     {
        const uint64_t *lane_hash = hash + (lane<<2);
        if ( unlikely( valid_hash( lane_hash, ptarget ) && !bench ) )
        {
-           pdata[19] = bswap_32( n + lane );
+           pdata[19] = n + lane;
           submit_solution( work, lane_hash, mythr );
        }
     }
     n += 8;
-     *noncev = _mm256_add_epi32( *noncev, m256_const1_32( 8 ) );
+     block_buf[ 3] = _mm256_add_epi32( block_buf[ 3], eight );
   } while ( likely( (n <= last_nonce) && !work_restart[thr_id].restart ) );
   pdata[19] = n;
   *hashes_done = n - first_nonce;
--- a/algo/lyra2/lyra2-gate.c
+++ b/algo/lyra2/lyra2-gate.c
@@ -132,11 +132,11 @@ bool register_lyra2z_algo( algo_gate_t* gate )
 #if defined(LYRA2Z_16WAY)
  gate->miner_thread_init = (void*)&lyra2z_16way_thread_init;
  gate->scanhash   = (void*)&scanhash_lyra2z_16way;
-  gate->hash       = (void*)&lyra2z_16way_hash;
+//  gate->hash       = (void*)&lyra2z_16way_hash;
 #elif defined(LYRA2Z_8WAY)
  gate->miner_thread_init = (void*)&lyra2z_8way_thread_init;
  gate->scanhash   = (void*)&scanhash_lyra2z_8way;
-  gate->hash       = (void*)&lyra2z_8way_hash;
+//  gate->hash       = (void*)&lyra2z_8way_hash;
 #elif defined(LYRA2Z_4WAY)
  gate->miner_thread_init = (void*)&lyra2z_4way_thread_init;
  gate->scanhash   = (void*)&scanhash_lyra2z_4way;
@@ -175,20 +175,16 @@ bool register_lyra2h_algo( algo_gate_t* gate )
 bool register_allium_algo( algo_gate_t* gate )
 {
 #if defined (ALLIUM_16WAY)
-  gate->miner_thread_init = (void*)&init_allium_16way_ctx;
  gate->scanhash  = (void*)&scanhash_allium_16way;
-  gate->hash      = (void*)&allium_16way_hash;
 #elif defined (ALLIUM_8WAY)
-  gate->miner_thread_init = (void*)&init_allium_8way_ctx;
  gate->scanhash  = (void*)&scanhash_allium_8way;
-  gate->hash      = (void*)&allium_8way_hash;
 #else
  gate->miner_thread_init = (void*)&init_allium_ctx;
  gate->scanhash  = (void*)&scanhash_allium;
  gate->hash      = (void*)&allium_hash;
 #endif
  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT
-	              | VAES_OPT | VAES256_OPT;
+	                   | VAES_OPT;
  opt_target_factor = 256.0;
  return true;
 };
--- a/algo/lyra2/lyra2-gate.h
+++ b/algo/lyra2/lyra2-gate.h
@@ -99,14 +99,14 @@ bool init_lyra2rev2_ctx();

 #if defined(LYRA2Z_16WAY)

-void lyra2z_16way_hash( void *state, const void *input );
+//void lyra2z_16way_hash( void *state, const void *input );
 int scanhash_lyra2z_16way( struct work *work, uint32_t max_nonce,
                          uint64_t *hashes_done, struct thr_info *mythr );
 bool lyra2z_16way_thread_init();

 #elif defined(LYRA2Z_8WAY)

-void lyra2z_8way_hash( void *state, const void *input );
+//void lyra2z_8way_hash( void *state, const void *input );
 int scanhash_lyra2z_8way( struct work *work, uint32_t max_nonce,
                          uint64_t *hashes_done, struct thr_info *mythr );
 bool lyra2z_8way_thread_init();
@@ -163,17 +163,13 @@ bool register_allium_algo( algo_gate_t* gate );

 #if defined(ALLIUM_16WAY)

-void allium_16way_hash( void *state, const void *input );
 int scanhash_allium_16way( struct work *work, uint32_t max_nonce,
                          uint64_t *hashes_done, struct thr_info *mythr );
-bool init_allium_16way_ctx();

 #elif defined(ALLIUM_8WAY)

-void allium_8way_hash( void *state, const void *input );
 int scanhash_allium_8way( struct work *work, uint32_t max_nonce,
                          uint64_t *hashes_done, struct thr_info *mythr );
-bool init_allium_8way_ctx();

 #else

--- a/algo/lyra2/lyra2z-4way.c
+++ b/algo/lyra2/lyra2z-4way.c
@@ -14,42 +14,32 @@ bool lyra2z_16way_thread_init()
 return ( lyra2z_16way_matrix = _mm_malloc( 2*LYRA2Z_MATRIX_SIZE, 64 ) );
 }

-static __thread blake256_16way_context l2z_16way_blake_mid;
-
-void lyra2z_16way_midstate( const void* input )
-{
-       blake256_16way_init( &l2z_16way_blake_mid );
-       blake256_16way_update( &l2z_16way_blake_mid, input, 64 );
-}
-
-void lyra2z_16way_hash( void *state, const void *input )
+static void lyra2z_16way_hash( void *state, const void *midstate_vars,
+                        const void *midhash, const void *block )
 {
    uint32_t vhash[8*16] __attribute__ ((aligned (128)));
-    uint32_t hash0[8] __attribute__ ((aligned (64)));
-    uint32_t hash1[8] __attribute__ ((aligned (64)));
-    uint32_t hash2[8] __attribute__ ((aligned (64)));
-    uint32_t hash3[8] __attribute__ ((aligned (64)));
-    uint32_t hash4[8] __attribute__ ((aligned (64)));
-    uint32_t hash5[8] __attribute__ ((aligned (64)));
-    uint32_t hash6[8] __attribute__ ((aligned (64)));
-    uint32_t hash7[8] __attribute__ ((aligned (64)));
-    uint32_t hash8[8] __attribute__ ((aligned (64)));
-    uint32_t hash9[8] __attribute__ ((aligned (64)));
-    uint32_t hash10[8] __attribute__ ((aligned (64)));
-    uint32_t hash11[8] __attribute__ ((aligned (64)));
-    uint32_t hash12[8] __attribute__ ((aligned (64)));
-    uint32_t hash13[8] __attribute__ ((aligned (64)));
-    uint32_t hash14[8] __attribute__ ((aligned (64)));
-    uint32_t hash15[8] __attribute__ ((aligned (64)));
-    blake256_16way_context ctx_blake __attribute__ ((aligned (64)));
+    uint32_t hash0[8] __attribute__ ((aligned (32)));
+    uint32_t hash1[8] __attribute__ ((aligned (32)));
+    uint32_t hash2[8] __attribute__ ((aligned (32)));
+    uint32_t hash3[8] __attribute__ ((aligned (32)));
+    uint32_t hash4[8] __attribute__ ((aligned (32)));
+    uint32_t hash5[8] __attribute__ ((aligned (32)));
+    uint32_t hash6[8] __attribute__ ((aligned (32)));
+    uint32_t hash7[8] __attribute__ ((aligned (32)));
+    uint32_t hash8[8] __attribute__ ((aligned (32)));
+    uint32_t hash9[8] __attribute__ ((aligned (32)));
+    uint32_t hash10[8] __attribute__ ((aligned (32)));
+    uint32_t hash11[8] __attribute__ ((aligned (32)));
+    uint32_t hash12[8] __attribute__ ((aligned (32)));
+    uint32_t hash13[8] __attribute__ ((aligned (32)));
+    uint32_t hash14[8] __attribute__ ((aligned (32)));
+    uint32_t hash15[8] __attribute__ ((aligned (32)));

-    memcpy( &ctx_blake, &l2z_16way_blake_mid, sizeof l2z_16way_blake_mid );
-    blake256_16way_update( &ctx_blake, input + (64*16), 16 );
-    blake256_16way_close( &ctx_blake, vhash );
+    blake256_16way_final_rounds_le( vhash, midstate_vars, midhash, block );

    dintrlv_16x32( hash0, hash1, hash2, hash3, hash4, hash5, hash6, hash7,
              hash8, hash9, hash10, hash11 ,hash12, hash13, hash14, hash15,
-               vhash, 256 );
+              vhash, 256 );

    intrlv_2x256( vhash, hash0, hash1, 256 );
    LYRA2Z_2WAY( lyra2z_16way_matrix, vhash, 32, vhash, 32, 8, 8, 8 );
@@ -97,40 +87,74 @@ void lyra2z_16way_hash( void *state, const void *input )
 int scanhash_lyra2z_16way( struct work *work, uint32_t max_nonce,
                          uint64_t *hashes_done, struct thr_info *mythr )
 {
-   uint64_t hash[4*16] __attribute__ ((aligned (128)));
-   uint32_t vdata[20*16] __attribute__ ((aligned (64)));
+   uint32_t hash[8*16] __attribute__ ((aligned (128)));
+   uint32_t midstate_vars[16*16] __attribute__ ((aligned (64)));
+   __m512i block0_hash[8] __attribute__ ((aligned (64)));
+   __m512i block_buf[16] __attribute__ ((aligned (64)));
+   uint32_t phash[8] __attribute__ ((aligned (64))) =
+   {
+      0x6A09E667, 0xBB67AE85, 0x3C6EF372, 0xA54FF53A,
+      0x510E527F, 0x9B05688C, 0x1F83D9AB, 0x5BE0CD19
+   };
   uint32_t *pdata = work->data;
   uint32_t *ptarget = work->target;
   const uint32_t first_nonce = pdata[19];
   uint32_t n = first_nonce;
   const uint32_t last_nonce = max_nonce - 16;
-   __m512i  *noncev = (__m512i*)vdata + 19;   // aligned
   const int thr_id = mythr->id;
   const bool bench = opt_benchmark;
+   const __m512i sixteen = m512_const1_32( 16 );

-   if ( bench )   ptarget[7] = 0x0000ff;
+   if ( bench ) ( (uint32_t*)ptarget )[7] = 0x0000ff;

-   mm512_bswap32_intrlv80_16x32( vdata, pdata );
-   *noncev = _mm512_set_epi32( n+15, n+14, n+13, n+12, n+11, n+10, n+ 9, n+ 8,
+   // Prehash first block
+   blake256_transform_le( phash, pdata, 512, 0 );
+
+   block0_hash[0] = _mm512_set1_epi32( phash[0] );
+   block0_hash[1] = _mm512_set1_epi32( phash[1] );
+   block0_hash[2] = _mm512_set1_epi32( phash[2] );
+   block0_hash[3] = _mm512_set1_epi32( phash[3] );
+   block0_hash[4] = _mm512_set1_epi32( phash[4] );
+   block0_hash[5] = _mm512_set1_epi32( phash[5] );
+   block0_hash[6] = _mm512_set1_epi32( phash[6] );
+   block0_hash[7] = _mm512_set1_epi32( phash[7] );
+
+   // Build vectored second block, interleave last 16 bytes of data using
+   // unique nonces and add padding.
+   block_buf[ 0] = _mm512_set1_epi32( pdata[16] );
+   block_buf[ 1] = _mm512_set1_epi32( pdata[17] );
+   block_buf[ 2] = _mm512_set1_epi32( pdata[18] );
+   block_buf[ 3] =
+             _mm512_set_epi32( n+15, n+14, n+13, n+12, n+11, n+10, n+ 9, n+ 8,
                               n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n +1, n );
-   lyra2z_16way_midstate( vdata );
+   block_buf[ 4] = m512_const1_32( 0x80000000 );
+   block_buf[ 5] =
+   block_buf[ 6] =
+   block_buf[ 7] =
+   block_buf[ 8] =
+   block_buf[ 9] =
+   block_buf[10] =
+   block_buf[11] =
+   block_buf[12] = m512_zero;
+   block_buf[13] = m512_one_32;
+   block_buf[14] = m512_zero;
+   block_buf[15] = m512_const1_32( 80*8 );
+
+   // Partialy prehash second block without touching nonces in block_buf[3].
+   blake256_16way_round0_prehash_le( midstate_vars, block0_hash, block_buf );

   do {
-      lyra2z_16way_hash( hash, vdata );
-
-      for ( int lane = 0; lane < 16; lane++ )
-      {
-        const uint64_t *lane_hash = hash + (lane<<2);
-        if ( unlikely( valid_hash( lane_hash, ptarget ) && !bench ) )
-        {
-           pdata[19] = bswap_32( n + lane );
-           submit_solution( work, lane_hash, mythr );
-        }
-      }
-      *noncev = _mm512_add_epi32( *noncev, m512_const1_32( 16 ) );
-      n += 16;
-   } while ( likely( (n < last_nonce) && !work_restart[thr_id].restart ) );
+     lyra2z_16way_hash( hash, midstate_vars, block0_hash, block_buf );

+     for ( int lane = 0; lane < 16; lane++ )
+     if ( unlikely( valid_hash( hash+(lane<<3), ptarget ) && !bench ) )
+     {
+        pdata[19] = n + lane;
+        submit_solution( work, hash+(lane<<3), mythr );
+     }
+     block_buf[ 3] = _mm512_add_epi32( block_buf[ 3], sixteen );
+     n += 16;
+   } while ( likely( (n < last_nonce) && !work_restart[thr_id].restart) );
   pdata[19] = n;
   *hashes_done = n - first_nonce;
   return 0;
@@ -145,30 +169,20 @@ bool lyra2z_8way_thread_init()
 return ( lyra2z_8way_matrix = _mm_malloc( LYRA2Z_MATRIX_SIZE, 64 ) );
 }

-static __thread blake256_8way_context l2z_8way_blake_mid;
-
-void lyra2z_8way_midstate( const void* input )
-{
-       blake256_8way_init( &l2z_8way_blake_mid );
-       blake256_8way_update( &l2z_8way_blake_mid, input, 64 );
-}
-
-void lyra2z_8way_hash( void *state, const void *input )
+static void lyra2z_8way_hash( void *state, const void *midstate_vars,
+                       const void *midhash, const void *block )
 {
     uint32_t hash0[8] __attribute__ ((aligned (64)));
-     uint32_t hash1[8] __attribute__ ((aligned (64)));
-     uint32_t hash2[8] __attribute__ ((aligned (64)));
-     uint32_t hash3[8] __attribute__ ((aligned (64)));
-     uint32_t hash4[8] __attribute__ ((aligned (64)));
-     uint32_t hash5[8] __attribute__ ((aligned (64)));
-     uint32_t hash6[8] __attribute__ ((aligned (64)));
-     uint32_t hash7[8] __attribute__ ((aligned (64)));
+     uint32_t hash1[8] __attribute__ ((aligned (32)));
+     uint32_t hash2[8] __attribute__ ((aligned (32)));
+     uint32_t hash3[8] __attribute__ ((aligned (32)));
+     uint32_t hash4[8] __attribute__ ((aligned (32)));
+     uint32_t hash5[8] __attribute__ ((aligned (32)));
+     uint32_t hash6[8] __attribute__ ((aligned (32)));
+     uint32_t hash7[8] __attribute__ ((aligned (32)));
     uint32_t vhash[8*8] __attribute__ ((aligned (64)));
-     blake256_8way_context ctx_blake __attribute__ ((aligned (64)));

-     memcpy( &ctx_blake, &l2z_8way_blake_mid, sizeof l2z_8way_blake_mid );
-     blake256_8way_update( &ctx_blake, input + (64*8), 16 );
-     blake256_8way_close( &ctx_blake, vhash );
+     blake256_8way_final_rounds_le( vhash, midstate_vars, midhash, block );

     dintrlv_8x32( hash0, hash1, hash2, hash3,
                   hash4, hash5, hash6, hash7, vhash, 256 );
@@ -182,7 +196,6 @@ void lyra2z_8way_hash( void *state, const void *input )
     LYRA2Z( lyra2z_8way_matrix, hash6, 32, hash6, 32, hash6, 32, 8, 8, 8 );
     LYRA2Z( lyra2z_8way_matrix, hash7, 32, hash7, 32, hash7, 32, 8, 8, 8 );

-
     memcpy( state,     hash0, 32 );
     memcpy( state+ 32, hash1, 32 );
     memcpy( state+ 64, hash2, 32 );
@@ -197,43 +210,78 @@ int scanhash_lyra2z_8way( struct work *work, uint32_t max_nonce,
                          uint64_t *hashes_done, struct thr_info *mythr )
 {
   uint64_t hash[4*8] __attribute__ ((aligned (64)));
-   uint32_t vdata[20*8] __attribute__ ((aligned (64)));
+   uint32_t midstate_vars[16*8] __attribute__ ((aligned (64)));
+   __m256i block0_hash[8] __attribute__ ((aligned (64)));
+   __m256i block_buf[16] __attribute__ ((aligned (64)));
+   uint32_t phash[8] __attribute__ ((aligned (32))) =
+   {
+      0x6A09E667, 0xBB67AE85, 0x3C6EF372, 0xA54FF53A,
+      0x510E527F, 0x9B05688C, 0x1F83D9AB, 0x5BE0CD19
+   };
   uint32_t *pdata = work->data;
-   uint32_t *ptarget = work->target;
+   uint64_t *ptarget = (uint64_t*)work->target;
   const uint32_t first_nonce = pdata[19];
   const uint32_t last_nonce = max_nonce - 8;
   uint32_t n = first_nonce;
-   __m256i  *noncev = (__m256i*)vdata + 19;   // aligned
   const int thr_id = mythr->id;
   const bool bench = opt_benchmark;
+   const __m256i eight = m256_const1_32( 8 );

-   if ( bench )  ptarget[7] = 0x0000ff;
+   // Prehash first block
+   blake256_transform_le( phash, pdata, 512, 0 );

-   mm256_bswap32_intrlv80_8x32( vdata, pdata );
-   *noncev = _mm256_set_epi32( n+7, n+6, n+5, n+4, n+3, n+2, n+1, n );
-   lyra2z_8way_midstate( vdata );
+   block0_hash[0] = _mm256_set1_epi32( phash[0] );
+   block0_hash[1] = _mm256_set1_epi32( phash[1] );
+   block0_hash[2] = _mm256_set1_epi32( phash[2] );
+   block0_hash[3] = _mm256_set1_epi32( phash[3] );
+   block0_hash[4] = _mm256_set1_epi32( phash[4] );
+   block0_hash[5] = _mm256_set1_epi32( phash[5] );
+   block0_hash[6] = _mm256_set1_epi32( phash[6] );
+   block0_hash[7] = _mm256_set1_epi32( phash[7] );
+
+   // Build vectored second block, interleave last 16 bytes of data using
+   // unique nonces and add padding.
+   block_buf[ 0] = _mm256_set1_epi32( pdata[16] );
+   block_buf[ 1] = _mm256_set1_epi32( pdata[17] );
+   block_buf[ 2] = _mm256_set1_epi32( pdata[18] );
+   block_buf[ 3] =
+            _mm256_set_epi32( n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n +1, n );
+   block_buf[ 4] = m256_const1_32( 0x80000000 );
+   block_buf[ 5] =
+   block_buf[ 6] =
+   block_buf[ 7] =
+   block_buf[ 8] =
+   block_buf[ 9] =
+   block_buf[10] =
+   block_buf[11] =
+   block_buf[12] = m256_zero;
+   block_buf[13] = m256_one_32;
+   block_buf[14] = m256_zero;
+   block_buf[15] = m256_const1_32( 80*8 );
+
+   // Partialy prehash second block without touching nonces
+   blake256_8way_round0_prehash_le( midstate_vars, block0_hash, block_buf );

   do {
-      lyra2z_8way_hash( hash, vdata );
+     lyra2z_8way_hash( hash, midstate_vars, block0_hash, block_buf );

-      for ( int lane = 0; lane < 8; lane++ )
-      {
+     for ( int lane = 0; lane < 8; lane++ )
+     {
        const uint64_t *lane_hash = hash + (lane<<2);
        if ( unlikely( valid_hash( lane_hash, ptarget ) && !bench ) )
        {
-           pdata[19] = bswap_32( n + lane );
+           pdata[19] = n + lane;
           submit_solution( work, lane_hash, mythr );
        }
-      }
-      *noncev = _mm256_add_epi32( *noncev, m256_const1_32( 8 ) );
-      n += 8;
-   } while ( likely( (n < last_nonce) && !work_restart[thr_id].restart) );
+     }
+     n += 8;
+     block_buf[ 3] = _mm256_add_epi32( block_buf[ 3], eight );
+   } while ( likely( (n <= last_nonce) && !work_restart[thr_id].restart ) );
   pdata[19] = n;
   *hashes_done = n - first_nonce;
   return 0;
 }

-
 #elif defined(LYRA2Z_4WAY)


--- a/algo/lyra2/sponge-2way.c
+++ b/algo/lyra2/sponge-2way.c
@@ -261,7 +261,7 @@ inline void reducedDuplexRowSetup_2way( uint64_t *State, uint64_t *rowIn,
 // overlap it's unified.
 // As a result normal is Nrows-2 / Nrows.
 // for 4 rows: 1 unified, 2 overlap, 1 normal.
-// for 8 rows: 1 unified, 2 overlap, 56 normal.
+// for 8 rows: 1 unified, 2 overlap, 5 normal.

 static inline void reducedDuplexRow_2way_normal( uint64_t *State,
                   uint64_t *rowIn, uint64_t *rowInOut0, uint64_t *rowInOut1,
@@ -283,6 +283,15 @@ static inline void reducedDuplexRow_2way_normal( uint64_t *State,
   for ( i = 0; i < nCols; i++ )
   {
     //Absorbing "M[prev] [+] M[row*]"
+     io0 = _mm512_load_si512( inout0    );
+     io1 = _mm512_load_si512( inout0 +1 );
+     io2 = _mm512_load_si512( inout0 +2 );
+
+     io0 = _mm512_mask_load_epi64( io0, 0xf0, inout1    );
+     io1 = _mm512_mask_load_epi64( io1, 0xf0, inout1 +1 );
+     io2 = _mm512_mask_load_epi64( io2, 0xf0, inout1 +2 );
+
+/*
     io0 = _mm512_mask_blend_epi64( 0xf0,
                                    _mm512_load_si512( (__m512i*)inout0 ),
                                    _mm512_load_si512( (__m512i*)inout1 ) );
@@ -292,6 +301,7 @@ static inline void reducedDuplexRow_2way_normal( uint64_t *State,
     io2 = _mm512_mask_blend_epi64( 0xf0,
                                    _mm512_load_si512( (__m512i*)inout0 +2 ),
                                    _mm512_load_si512( (__m512i*)inout1 +2 ) );
+*/

     state0 = _mm512_xor_si512( state0, _mm512_add_epi64( in[0], io0 ) );
     state1 = _mm512_xor_si512( state1, _mm512_add_epi64( in[1], io1 ) );
@@ -359,6 +369,15 @@ static inline void reducedDuplexRow_2way_overlap( uint64_t *State,
   for ( i = 0; i < nCols; i++ )
   {
     //Absorbing "M[prev] [+] M[row*]"
+     io0.v512 = _mm512_load_si512( inout0    );
+     io1.v512 = _mm512_load_si512( inout0 +1 );
+     io2.v512 = _mm512_load_si512( inout0 +2 );
+
+     io0.v512 = _mm512_mask_load_epi64( io0.v512, 0xf0, inout1    );
+     io1.v512 = _mm512_mask_load_epi64( io1.v512, 0xf0, inout1 +1 );
+     io2.v512 = _mm512_mask_load_epi64( io2.v512, 0xf0, inout1 +2 );
+
+/*
     io0.v512 = _mm512_mask_blend_epi64( 0xf0,
                                  _mm512_load_si512( (__m512i*)inout0 ),
                                  _mm512_load_si512( (__m512i*)inout1 ) );
@@ -368,27 +387,12 @@ static inline void reducedDuplexRow_2way_overlap( uint64_t *State,
     io2.v512 = _mm512_mask_blend_epi64( 0xf0,
                                  _mm512_load_si512( (__m512i*)inout0 +2 ),
                                  _mm512_load_si512( (__m512i*)inout1 +2 ) );
+*/

     state0 = _mm512_xor_si512( state0, _mm512_add_epi64( in[0], io0.v512 ) );
     state1 = _mm512_xor_si512( state1, _mm512_add_epi64( in[1], io1.v512 ) );
     state2 = _mm512_xor_si512( state2, _mm512_add_epi64( in[2], io2.v512 ) );
     
-/* 
-     io.v512[0] = _mm512_mask_blend_epi64( 0xf0,
-                                  _mm512_load_si512( (__m512i*)inout0 ),
-                                  _mm512_load_si512( (__m512i*)inout1 ) );
-     io.v512[1] = _mm512_mask_blend_epi64( 0xf0,
-                                  _mm512_load_si512( (__m512i*)inout0 +1 ),
-                                  _mm512_load_si512( (__m512i*)inout1 +1 ) );
-     io.v512[2] = _mm512_mask_blend_epi64( 0xf0,
-                                  _mm512_load_si512( (__m512i*)inout0 +2 ),
-                                  _mm512_load_si512( (__m512i*)inout1 +2 ) );
-
-     state0 = _mm512_xor_si512( state0, _mm512_add_epi64( in[0], io.v512[0] ) );
-     state1 = _mm512_xor_si512( state1, _mm512_add_epi64( in[1], io.v512[1] ) );
-     state2 = _mm512_xor_si512( state2, _mm512_add_epi64( in[2], io.v512[2] ) );
-*/
-
     //Applies the reduced-round transformation f to the sponge's state
     LYRA_ROUND_2WAY_AVX512( state0, state1, state2, state3 );

@@ -415,22 +419,6 @@ static inline void reducedDuplexRow_2way_overlap( uint64_t *State,
          io2.v512 = _mm512_mask_blend_epi64( 0xf0, io2.v512, out[2] );
       }

-/*
-       if ( rowOut == rowInOut0 )
-       {
-          io.v512[0] = _mm512_mask_blend_epi64( 0x0f, io.v512[0], out[0] );
-          io.v512[1] = _mm512_mask_blend_epi64( 0x0f, io.v512[1], out[1] );
-          io.v512[2] = _mm512_mask_blend_epi64( 0x0f, io.v512[2], out[2] );
-
-       }
-       if ( rowOut == rowInOut1 )
-       {
-          io.v512[0] = _mm512_mask_blend_epi64( 0xf0, io.v512[0], out[0] );
-          io.v512[1] = _mm512_mask_blend_epi64( 0xf0, io.v512[1], out[1] );
-          io.v512[2] = _mm512_mask_blend_epi64( 0xf0, io.v512[2], out[2] );
-       }
-*/
-
       //M[rowInOut][col] = M[rowInOut][col] XOR rotW(rand)
       t0 = _mm512_permutex_epi64( state0, 0x93 );
       t1 = _mm512_permutex_epi64( state1, 0x93 );
@@ -444,12 +432,23 @@ static inline void reducedDuplexRow_2way_overlap( uint64_t *State,
                                 _mm512_mask_blend_epi64( 0x11, t2, t1 ) );
     }

+/*     
+      casti_m256i( inout0, 0 ) = _mm512_castsi512_si256( io0.v512 );
+      casti_m256i( inout0, 2 ) = _mm512_castsi512_si256( io1.v512 );
+      casti_m256i( inout0, 4 ) = _mm512_castsi512_si256( io2.v512 );
+     _mm512_mask_store_epi64( inout1,    0xf0, io0.v512 );
+     _mm512_mask_store_epi64( inout1 +1, 0xf0, io1.v512 );
+     _mm512_mask_store_epi64( inout1 +2, 0xf0, io2.v512 );
+*/
+
+      
      casti_m256i( inout0, 0 ) = io0.v256lo;
      casti_m256i( inout1, 1 ) = io0.v256hi;
      casti_m256i( inout0, 2 ) = io1.v256lo;
      casti_m256i( inout1, 3 ) = io1.v256hi;
      casti_m256i( inout0, 4 ) = io2.v256lo;
      casti_m256i( inout1, 5 ) = io2.v256hi;
+
 /*     
     _mm512_mask_store_epi64( inout0,    0x0f, io.v512[0] );
     _mm512_mask_store_epi64( inout1,    0xf0, io.v512[0] );
--- a/algo/lyra2/sponge.h
+++ b/algo/lyra2/sponge.h
@@ -97,11 +97,11 @@ static inline uint64_t rotr64( const uint64_t w, const unsigned c ){
 // returns void, updates all args
 #define G_4X64(a,b,c,d) \
   a = _mm256_add_epi64( a, b ); \
-   d = mm256_ror_64( _mm256_xor_si256( d, a ), 32 ); \
+   d = mm256_swap64_32( _mm256_xor_si256( d, a ) ); \
   c = _mm256_add_epi64( c, d ); \
-   b = mm256_ror_64( _mm256_xor_si256( b, c ), 24 ); \
+   b = mm256_shuflr64_24( _mm256_xor_si256( b, c ) ); \
   a = _mm256_add_epi64( a, b ); \
-   d = mm256_ror_64( _mm256_xor_si256( d, a ), 16 ); \
+   d = mm256_shuflr64_16( _mm256_xor_si256( d, a ) ); \
   c = _mm256_add_epi64( c, d ); \
   b = mm256_ror_64( _mm256_xor_si256( b, c ), 63 );

@@ -137,11 +137,11 @@ static inline uint64_t rotr64( const uint64_t w, const unsigned c ){
 // returns void, all args updated
 #define G_2X64(a,b,c,d) \
   a = _mm_add_epi64( a, b ); \
-   d = mm128_ror_64( _mm_xor_si128( d, a), 32 ); \
+   d = mm128_swap64_32( _mm_xor_si128( d, a) ); \
   c = _mm_add_epi64( c, d ); \
-   b = mm128_ror_64( _mm_xor_si128( b, c ), 24 ); \
+   b = mm128_shuflr64_24( _mm_xor_si128( b, c ) ); \
   a = _mm_add_epi64( a, b ); \
-   d = mm128_ror_64( _mm_xor_si128( d, a ), 16 ); \
+   d = mm128_shuflr64_16( _mm_xor_si128( d, a ) ); \
   c = _mm_add_epi64( c, d ); \
   b = mm128_ror_64( _mm_xor_si128( b, c ), 63 );

@@ -150,12 +150,10 @@ static inline uint64_t rotr64( const uint64_t w, const unsigned c ){
   G_2X64( s1, s3, s5, s7 ); \
   mm128_vrol256_64( s6, s7 ); \
   mm128_vror256_64( s2, s3 ); \
-   mm128_swap256_128( s4, s5 ); \
-   G_2X64( s0, s2, s4, s6 ); \
-   G_2X64( s1, s3, s5, s7 ); \
+   G_2X64( s0, s2, s5, s6 ); \
+   G_2X64( s1, s3, s4, s7 ); \
   mm128_vror256_64( s6, s7 ); \
-   mm128_vrol256_64( s2, s3 ); \
-   mm128_swap256_128( s4, s5 );
+   mm128_vrol256_64( s2, s3 );

 #define LYRA_12_ROUNDS_AVX(s0,s1,s2,s3,s4,s5,s6,s7) \
   LYRA_ROUND_AVX(s0,s1,s2,s3,s4,s5,s6,s7) \
--- a/algo/quark/anime-4way.c
+++ b/algo/quark/anime-4way.c
@@ -15,7 +15,8 @@

 #if defined (ANIME_8WAY)

-typedef struct {
+union _anime_8way_context_overlay
+{
    blake512_8way_context   blake;
    bmw512_8way_context     bmw;
 #if defined(__VAES__)
@@ -26,23 +27,9 @@ typedef struct {
    jh512_8way_context      jh;
    skein512_8way_context   skein;
    keccak512_8way_context  keccak;
-} anime_8way_ctx_holder;
+} __attribute__ ((aligned (64)));

-anime_8way_ctx_holder anime_8way_ctx __attribute__ ((aligned (64)));
-
-void init_anime_8way_ctx()
-{
-     blake512_8way_init( &anime_8way_ctx.blake );
-     bmw512_8way_init( &anime_8way_ctx.bmw );
-#if defined(__VAES__)
-     groestl512_4way_init( &anime_8way_ctx.groestl, 64 );
-#else
-     init_groestl( &anime_8way_ctx.groestl, 64 );
-#endif
-     skein512_8way_init( &anime_8way_ctx.skein );
-     jh512_8way_init( &anime_8way_ctx.jh );
-     keccak512_8way_init( &anime_8way_ctx.keccak );
-}
+typedef union _anime_8way_context_overlay anime_8way_context_overlay;

 void anime_8way_hash( void *state, const void *input )
 {
@@ -65,17 +52,14 @@ void anime_8way_hash( void *state, const void *input )
    __m512i* vhB = (__m512i*)vhashB;
    __m512i* vhC = (__m512i*)vhashC;
    const __m512i bit3_mask = m512_const1_64( 8 );
-    const __m512i zero = _mm512_setzero_si512();
    __mmask8 vh_mask;
-    anime_8way_ctx_holder ctx;
-    memcpy( &ctx, &anime_8way_ctx, sizeof(anime_8way_ctx) );
+    anime_8way_context_overlay ctx __attribute__ ((aligned (64)));

    bmw512_8way_full( &ctx.bmw, vhash, input, 80 );

    blake512_8way_full( &ctx.blake, vhash, vhash, 64 );

-    vh_mask = _mm512_cmpeq_epi64_mask( _mm512_and_si512( vh[0], bit3_mask ),
-                                       zero );
+    vh_mask = _mm512_testn_epi64_mask( vh[0], bit3_mask );

 #if defined(__VAES__)

@@ -152,8 +136,7 @@ void anime_8way_hash( void *state, const void *input )
    jh512_8way_update( &ctx.jh, vhash, 64 );
    jh512_8way_close( &ctx.jh, vhash );

-    vh_mask = _mm512_cmpeq_epi64_mask( _mm512_and_si512( vh[0], bit3_mask ),
-                                       zero );
+    vh_mask = _mm512_testn_epi64_mask( vh[0], bit3_mask );

    if ( ( vh_mask & 0xff ) != 0xff )
       blake512_8way_full( &ctx.blake, vhashA, vhash, 64 );
@@ -168,8 +151,7 @@ void anime_8way_hash( void *state, const void *input )

    skein512_8way_full( &ctx.skein, vhash, vhash, 64 );

-    vh_mask = _mm512_cmpeq_epi64_mask( _mm512_and_si512( vh[0], bit3_mask ), 
-                                       zero );
+    vh_mask = _mm512_testn_epi64_mask( vh[0], bit3_mask );

    if ( ( vh_mask & 0xff ) != 0xff )
    {
@@ -237,14 +219,20 @@ int scanhash_anime_8way( struct work *work, uint32_t max_nonce,

 #elif defined (ANIME_4WAY)

-typedef struct {
+union _anime_4way_context_overlay
+{
    blake512_4way_context  blake;
    bmw512_4way_context    bmw;
    hashState_groestl      groestl;
    jh512_4way_context     jh;
    skein512_4way_context  skein;
    keccak512_4way_context keccak;
-} anime_4way_ctx_holder;
+#if defined(__VAES__)
+    groestl512_2way_context groestl2;
+#endif
+} __attribute__ ((aligned (64)));
+
+typedef union _anime_4way_context_overlay anime_4way_context_overlay;

 void anime_4way_hash( void *state, const void *input )
 {
@@ -262,7 +250,7 @@ void anime_4way_hash( void *state, const void *input )
    int h_mask;
    const __m256i bit3_mask = m256_const1_64( 8 );
    const __m256i zero = _mm256_setzero_si256();
-    anime_4way_ctx_holder ctx;
+    anime_4way_context_overlay ctx __attribute__ ((aligned (64)));

    bmw512_4way_init( &ctx.bmw );
    bmw512_4way_update( &ctx.bmw, input, 80 );
@@ -293,7 +281,18 @@ void anime_4way_hash( void *state, const void *input )

    mm256_blend_hash_4x64( vh, vhA, vhB, vh_mask );

-    dintrlv_4x64( hash0, hash1, hash2, hash3, vhash, 512 );
+#if defined(__VAES__)
+
+   rintrlv_4x64_2x128( vhashA, vhashB, vhash, 512 );
+
+   groestl512_2way_full( &ctx.groestl2, vhashA, vhashA, 64 );
+   groestl512_2way_full( &ctx.groestl2, vhashB, vhashB, 64 );
+
+   rintrlv_2x128_4x64( vhash, vhashA, vhashB, 512 );
+
+#else
+
+   dintrlv_4x64( hash0, hash1, hash2, hash3, vhash, 512 );

    groestl512_full( &ctx.groestl, (char*)hash0, (char*)hash0, 512 );
    groestl512_full( &ctx.groestl, (char*)hash1, (char*)hash1, 512 );
@@ -302,6 +301,8 @@ void anime_4way_hash( void *state, const void *input )

    intrlv_4x64( vhash, hash0, hash1, hash2, hash3, 512 );

+#endif
+
    jh512_4way_init( &ctx.jh );
    jh512_4way_update( &ctx.jh, vhash, 64 );
    jh512_4way_close( &ctx.jh, vhash );
--- a/algo/quark/hmq1725-4way.c
+++ b/algo/quark/hmq1725-4way.c
@@ -13,6 +13,7 @@
 #include "algo/cubehash/cubehash_sse2.h"
 #include "algo/simd/nist.h"
 #include "algo/shavite/sph_shavite.h"
+#include "algo/shavite/shavite-hash-2way.h"
 #include "algo/simd/simd-hash-2way.h"
 #include "algo/echo/aes_ni/hash_api.h"
 #include "algo/hamsi/hamsi-hash-4way.h"
@@ -64,14 +65,14 @@ extern void hmq1725_8way_hash(void *state, const void *input)
   uint32_t vhashA[16<<3] __attribute__ ((aligned (64)));
   uint32_t vhashB[16<<3] __attribute__ ((aligned (64)));
   uint32_t vhashC[16<<3] __attribute__ ((aligned (64)));
-   uint32_t hash0 [16]    __attribute__ ((aligned (64)));
-   uint32_t hash1 [16]    __attribute__ ((aligned (64)));
-   uint32_t hash2 [16]    __attribute__ ((aligned (64)));
-   uint32_t hash3 [16]    __attribute__ ((aligned (64)));
-   uint32_t hash4 [16]    __attribute__ ((aligned (64)));
-   uint32_t hash5 [16]    __attribute__ ((aligned (64)));
-   uint32_t hash6 [16]    __attribute__ ((aligned (64)));
-   uint32_t hash7 [16]    __attribute__ ((aligned (64)));
+   uint32_t hash0 [16]    __attribute__ ((aligned (32)));
+   uint32_t hash1 [16]    __attribute__ ((aligned (32)));
+   uint32_t hash2 [16]    __attribute__ ((aligned (32)));
+   uint32_t hash3 [16]    __attribute__ ((aligned (32)));
+   uint32_t hash4 [16]    __attribute__ ((aligned (32)));
+   uint32_t hash5 [16]    __attribute__ ((aligned (32)));
+   uint32_t hash6 [16]    __attribute__ ((aligned (32)));
+   uint32_t hash7 [16]    __attribute__ ((aligned (32)));
   hmq1725_8way_context_overlay ctx __attribute__ ((aligned (64)));
   __mmask8 vh_mask;
   const __m512i vmask = m512_const1_64( 24 );
@@ -98,8 +99,7 @@ extern void hmq1725_8way_hash(void *state, const void *input)
   intrlv_8x64_512( vhash, hash0, hash1, hash2, hash3,
                           hash4, hash5, hash6,  hash7 );

-   vh_mask = _mm512_cmpeq_epi64_mask( _mm512_and_si512( vh[0], vmask ),
-                                       m512_zero );
+   vh_mask = _mm512_testn_epi64_mask( vh[0], vmask );

   // A
 #if defined(__VAES__)
@@ -154,8 +154,7 @@ extern void hmq1725_8way_hash(void *state, const void *input)
   keccak512_8way_update( &ctx.keccak, vhash, 64 );
   keccak512_8way_close( &ctx.keccak, vhash );

-   vh_mask = _mm512_cmpeq_epi64_mask( _mm512_and_si512( vh[0], vmask ),
-                                       m512_zero );
+   vh_mask = _mm512_testn_epi64_mask( vh[0], vmask );

   // A
   if ( ( vh_mask & 0xff ) != 0xff )
@@ -174,8 +173,7 @@ extern void hmq1725_8way_hash(void *state, const void *input)
   cube_4way_full( &ctx.cube, vhashB, 512, vhashB, 64 );

   rintrlv_4x128_8x64( vhash, vhashA, vhashB, 512 );
-   vh_mask = _mm512_cmpeq_epi64_mask( _mm512_and_si512( vh[0], vmask ),
-                                       m512_zero );
+   vh_mask = _mm512_testn_epi64_mask( vh[0], vmask );

   if ( likely( ( vh_mask & 0xff ) != 0xff ) )
   {
@@ -223,8 +221,7 @@ extern void hmq1725_8way_hash(void *state, const void *input)
   simd512_4way_full( &ctx.simd, vhashB, vhashB, 64 );

   rintrlv_4x128_8x64( vhash, vhashA, vhashB, 512 );
-   vh_mask = _mm512_cmpeq_epi64_mask( _mm512_and_si512( vh[0], vmask ),
-                                       m512_zero );
+   vh_mask = _mm512_testn_epi64_mask( vh[0], vmask );
   dintrlv_8x64_512( hash0, hash1, hash2, hash3,
                     hash4, hash5, hash6, hash7, vhash );
   // 4x32 for haval
@@ -302,8 +299,7 @@ extern void hmq1725_8way_hash(void *state, const void *input)

   blake512_8way_full( &ctx.blake, vhash, vhash, 64 );

-   vh_mask = _mm512_cmpeq_epi64_mask( _mm512_and_si512( vh[0], vmask ),
-                                       m512_zero );
+   vh_mask = _mm512_testn_epi64_mask( vh[0], vmask );

   // A
 #if defined(__VAES__)
@@ -374,8 +370,7 @@ extern void hmq1725_8way_hash(void *state, const void *input)

   intrlv_8x64_512( vhash, hash0, hash1, hash2, hash3,
                           hash4, hash5, hash6, hash7 );
-   vh_mask = _mm512_cmpeq_epi64_mask( _mm512_and_si512( vh[0], vmask ),
-                                       m512_zero );
+   vh_mask = _mm512_testn_epi64_mask( vh[0], vmask );

     // A   
 #if defined(__VAES__)
@@ -455,8 +450,7 @@ extern void hmq1725_8way_hash(void *state, const void *input)

   intrlv_8x64_512( vhash, hash0, hash1, hash2, hash3,
                           hash4, hash5, hash6, hash7 );
-   vh_mask = _mm512_cmpeq_epi64_mask( _mm512_and_si512( vh[0], vmask ),
-                                       m512_zero );
+   vh_mask = _mm512_testn_epi64_mask( vh[0], vmask );

   if ( hash0[0] & mask )
      fugue512_full( &ctx.fugue, hash0, hash0, 64 );
@@ -520,8 +514,7 @@ extern void hmq1725_8way_hash(void *state, const void *input)
   sha512_8way_update( &ctx.sha512, vhash, 64 );
   sha512_8way_close( &ctx.sha512, vhash );

-   vh_mask = _mm512_cmpeq_epi64_mask( _mm512_and_si512( vh[0], vmask ),
-                                       m512_zero );
+   vh_mask = _mm512_testn_epi64_mask( vh[0], vmask );
   dintrlv_8x64_512( hash0, hash1, hash2, hash3,
                     hash4, hash5, hash6, hash7, vhash );

@@ -625,6 +618,7 @@ union _hmq1725_4way_context_overlay
    cube_2way_context       cube2;
    sph_shavite512_context  shavite;
    hashState_sd            sd;
+    shavite512_2way_context shavite2;
    simd_2way_context       simd;
    hashState_echo          echo;
    hamsi512_4way_context   hamsi;
@@ -633,19 +627,23 @@ union _hmq1725_4way_context_overlay
    sph_whirlpool_context   whirlpool;
    sha512_4way_context     sha512;
    haval256_5_4way_context haval;
+#if defined(__VAES__)
+    groestl512_2way_context groestl2;
+    echo_2way_context       echo2;
+#endif    
 } __attribute__ ((aligned (64)));

 typedef union _hmq1725_4way_context_overlay hmq1725_4way_context_overlay;

 extern void hmq1725_4way_hash(void *state, const void *input)
 {
-   uint32_t hash0 [16]    __attribute__ ((aligned (64)));
-   uint32_t hash1 [16]    __attribute__ ((aligned (64)));
-   uint32_t hash2 [16]    __attribute__ ((aligned (64)));
-   uint32_t hash3 [16]    __attribute__ ((aligned (64)));
   uint32_t vhash [16<<2] __attribute__ ((aligned (64)));
   uint32_t vhashA[16<<2] __attribute__ ((aligned (64)));
   uint32_t vhashB[16<<2] __attribute__ ((aligned (64)));
+   uint32_t hash0 [16]    __attribute__ ((aligned (32)));
+   uint32_t hash1 [16]    __attribute__ ((aligned (32)));
+   uint32_t hash2 [16]    __attribute__ ((aligned (32)));
+   uint32_t hash3 [16]    __attribute__ ((aligned (32)));
   hmq1725_4way_context_overlay ctx __attribute__ ((aligned (64)));
   __m256i vh_mask;     
   int h_mask;
@@ -750,15 +748,10 @@ extern void hmq1725_4way_hash(void *state, const void *input)

    mm256_blend_hash_4x64( vh, vhA, vhB, vh_mask );

-    dintrlv_4x64( hash0, hash1, hash2, hash3, vhash, 512 );
+    rintrlv_4x64_2x128( vhashA, vhashB, vhash, 512 );

-    shavite512_full( &ctx.shavite, hash0, hash0, 64 );
-    shavite512_full( &ctx.shavite, hash1, hash1, 64 );
-    shavite512_full( &ctx.shavite, hash2, hash2, 64 );
-    shavite512_full( &ctx.shavite, hash3, hash3, 64 );
-
-    intrlv_2x128_512( vhashA, hash0, hash1 );
-    intrlv_2x128_512( vhashB, hash2, hash3 );
+    shavite512_2way_full( &ctx.shavite2, vhashA, vhashA, 64 );
+    shavite512_2way_full( &ctx.shavite2, vhashB, vhashB, 64 );

    simd512_2way_full( &ctx.simd, vhashA, vhashA, 64 );
    simd512_2way_full( &ctx.simd, vhashB, vhashB, 64 );
@@ -795,6 +788,17 @@ extern void hmq1725_4way_hash(void *state, const void *input)

    mm256_blend_hash_4x64( vh, vhA, vhB, vh_mask );

+#if defined(__VAES__)
+
+   rintrlv_4x64_2x128( vhashA, vhashB, vhash, 512 );
+
+   echo_2way_full( &ctx.echo2, vhashA, 512, vhashA, 64 );
+   echo_2way_full( &ctx.echo2, vhashB, 512, vhashB, 64 );
+
+   rintrlv_2x128_4x64( vhash, vhashA, vhashB, 512 );
+
+#else
+    
    dintrlv_4x64( hash0, hash1, hash2, hash3, vhash, 512 );
    
    echo_full( &ctx.echo, (BitSequence *)hash0, 512,
@@ -807,7 +811,9 @@ extern void hmq1725_4way_hash(void *state, const void *input)
                    (const BitSequence *)hash3, 64 );

    intrlv_4x64( vhash, hash0, hash1, hash2, hash3, 512 );
-     
+
+#endif
+
    blake512_4way_full( &ctx.blake, vhash, vhash, 64 );

    dintrlv_4x64( hash0, hash1, hash2, hash3, vhash, 512 );
@@ -939,6 +945,17 @@ extern void hmq1725_4way_hash(void *state, const void *input)

   mm256_blend_hash_4x64( vh, vhA, vhB, vh_mask );

+#if defined(__VAES__)
+
+   rintrlv_4x64_2x128( vhashA, vhashB, vhash, 512 );
+
+   groestl512_2way_full( &ctx.groestl2, vhashA, vhashA, 64 );
+   groestl512_2way_full( &ctx.groestl2, vhashB, vhashB, 64 );
+
+   rintrlv_2x128_4x64( vhash, vhashA, vhashB, 512 );
+
+#else
+   
   dintrlv_4x64( hash0, hash1, hash2, hash3, vhash, 512 );

   groestl512_full( &ctx.groestl, (char*)hash0, (char*)hash0, 512 );
@@ -948,6 +965,8 @@ extern void hmq1725_4way_hash(void *state, const void *input)

   intrlv_4x64( vhash, hash0, hash1, hash2, hash3, 512 );

+#endif
+
   sha512_4way_init( &ctx.sha512 ); 
   sha512_4way_update( &ctx.sha512, vhash, 64 );
   sha512_4way_close( &ctx.sha512, vhash ); 
--- a/algo/quark/quark-4way.c
+++ b/algo/quark/quark-4way.c
@@ -68,7 +68,6 @@ void quark_8way_hash( void *state, const void *input )
    quark_8way_ctx_holder ctx;
    const uint32_t mask = 8;
    const __m512i bit3_mask = m512_const1_64( mask );
-    const __m512i zero = _mm512_setzero_si512();

    memcpy( &ctx, &quark_8way_ctx, sizeof(quark_8way_ctx) );

@@ -76,9 +75,7 @@ void quark_8way_hash( void *state, const void *input )

    bmw512_8way_full( &ctx.bmw, vhash, vhash, 64 );
    
-    vh_mask = _mm512_cmpeq_epi64_mask( _mm512_and_si512( vh[0], bit3_mask ),
-                                       zero );
-
+    vh_mask = _mm512_testn_epi64_mask( vh[0], bit3_mask );
    
 #if defined(__VAES__)

@@ -154,8 +151,7 @@ void quark_8way_hash( void *state, const void *input )
    jh512_8way_update( &ctx.jh, vhash, 64 );
    jh512_8way_close( &ctx.jh, vhash );

-    vh_mask = _mm512_cmpeq_epi64_mask( _mm512_and_si512( vh[0], bit3_mask ),
-                                       zero );
+    vh_mask = _mm512_testn_epi64_mask( vh[0], bit3_mask );

    if ( ( vh_mask & 0xff ) != 0xff )
       blake512_8way_full( &ctx.blake, vhashA, vhash, 64 );
@@ -169,8 +165,7 @@ void quark_8way_hash( void *state, const void *input )

    skein512_8way_full( &ctx.skein, vhash, vhash, 64 );

-    vh_mask = _mm512_cmpeq_epi64_mask( _mm512_and_si512( vh[0], bit3_mask ),
-                                       zero );
+    vh_mask = _mm512_testn_epi64_mask( vh[0], bit3_mask );

    if ( ( vh_mask & 0xff ) != 0xff )
    {
--- a/algo/radiogatun/sph_radiogatun.c
+++ b/algo/radiogatun/sph_radiogatun.c
--- a/algo/radiogatun/sph_radiogatun.h
+++ b/algo/radiogatun/sph_radiogatun.h
@@ -1,186 +0,0 @@
-/* $Id: sph_radiogatun.h 226 2010-06-16 17:28:08Z tp $ */
-/**
- * RadioGatun interface.
- *
- * RadioGatun has been published in: G. Bertoni, J. Daemen, M. Peeters
- * and G. Van Assche, "RadioGatun, a belt-and-mill hash function",
- * presented at the Second Cryptographic Hash Workshop, Santa Barbara,
- * August 24-25, 2006. The main Web site, containing that article, the
- * reference code and some test vectors, appears to be currently located
- * at the following URL: http://radiogatun.noekeon.org/
- *
- * The presentation article does not specify endianness or padding. The
- * reference code uses the following conventions, which we also apply
- * here:
- * <ul>
- * <li>The input message is an integral number of sequences of three
- * words. Each word is either a 32-bit of 64-bit word (depending on
- * the version of RadioGatun).</li>
- * <li>Input bytes are decoded into words using little-endian
- * convention.</li>
- * <li>Padding consists of a single bit of value 1, using little-endian
- * convention within bytes (i.e. for a byte-oriented input, a single
- * byte of value 0x01 is appended), then enough bits of value 0 to finish
- * the current block.</li>
- * <li>Output consists of 256 bits. Successive output words are encoded
- * with little-endian convention.</li>
- * </ul>
- * These conventions are very close to those we use for PANAMA, which is
- * a close ancestor or RadioGatun.
- *
- * RadioGatun is actually a family of functions, depending on some
- * internal parameters. We implement here two functions, with a "belt
- * length" of 13, a "belt width" of 3, and a "mill length" of 19. The
- * RadioGatun[32] version uses 32-bit words, while the RadioGatun[64]
- * variant uses 64-bit words.
- *
- * Strictly speaking, the name "RadioGatun" should use an acute accent
- * on the "u", which we omitted here to keep strict ASCII-compatibility
- * of this file.
- *
- * ==========================(LICENSE BEGIN)============================
- *
- * Copyright (c) 2007-2010  Projet RNRT SAPHIR
- *
- * Permission is hereby granted, free of charge, to any person obtaining
- * a copy of this software and associated documentation files (the
- * "Software"), to deal in the Software without restriction, including
- * without limitation the rights to use, copy, modify, merge, publish,
- * distribute, sublicense, and/or sell copies of the Software, and to
- * permit persons to whom the Software is furnished to do so, subject to
- * the following conditions:
- *
- * The above copyright notice and this permission notice shall be
- * included in all copies or substantial portions of the Software.
- *
- * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
- * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
- * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
- * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
- * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
- * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
- * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
- *
- * ===========================(LICENSE END)=============================
- *
- * @file     sph_radiogatun.h
- * @author   Thomas Pornin <thomas.pornin@cryptolog.com>
- */
-
-#ifndef SPH_RADIOGATUN_H__
-#define SPH_RADIOGATUN_H__
-
-#include <stddef.h>
-#include "algo/sha/sph_types.h"
-
-/**
- * Output size (in bits) for RadioGatun[32].
- */
-#define SPH_SIZE_radiogatun32   256
-
-/**
- * This structure is a context for RadioGatun[32] computations: it
- * contains intermediate values and some data from the last entered
- * block. Once a RadioGatun[32] computation has been performed, the
- * context can be reused for another computation.
- *
- * The contents of this structure are private. A running RadioGatun[32]
- * computation can be cloned by copying the context (e.g. with a
- * simple <code>memcpy()</code>).
- */
-typedef struct {
-#ifndef DOXYGEN_IGNORE
-	unsigned char data[156];   /* first field, for alignment */
-	unsigned data_ptr;
-	sph_u32 a[19], b[39];
-#endif
-} sph_radiogatun32_context;
-
-/**
- * Initialize a RadioGatun[32] context. This process performs no
- * memory allocation.
- *
- * @param cc   the RadioGatun[32] context (pointer to a
- *             <code>sph_radiogatun32_context</code>)
- */
-void sph_radiogatun32_init(void *cc);
-
-/**
- * Process some data bytes. It is acceptable that <code>len</code> is zero
- * (in which case this function does nothing).
- *
- * @param cc     the RadioGatun[32] context
- * @param data   the input data
- * @param len    the input data length (in bytes)
- */
-void sph_radiogatun32(void *cc, const void *data, size_t len);
-
-/**
- * Terminate the current RadioGatun[32] computation and output the
- * result into the provided buffer. The destination buffer must be wide
- * enough to accomodate the result (32 bytes). The context is
- * automatically reinitialized.
- *
- * @param cc    the RadioGatun[32] context
- * @param dst   the destination buffer
- */
-void sph_radiogatun32_close(void *cc, void *dst);
-
-#if SPH_64
-
-/**
- * Output size (in bits) for RadioGatun[64].
- */
-#define SPH_SIZE_radiogatun64   256
-
-/**
- * This structure is a context for RadioGatun[64] computations: it
- * contains intermediate values and some data from the last entered
- * block. Once a RadioGatun[64] computation has been performed, the
- * context can be reused for another computation.
- *
- * The contents of this structure are private. A running RadioGatun[64]
- * computation can be cloned by copying the context (e.g. with a
- * simple <code>memcpy()</code>).
- */
-typedef struct {
-#ifndef DOXYGEN_IGNORE
-	unsigned char data[312];   /* first field, for alignment */
-	unsigned data_ptr;
-	sph_u64 a[19], b[39];
-#endif
-} sph_radiogatun64_context;
-
-/**
- * Initialize a RadioGatun[64] context. This process performs no
- * memory allocation.
- *
- * @param cc   the RadioGatun[64] context (pointer to a
- *             <code>sph_radiogatun64_context</code>)
- */
-void sph_radiogatun64_init(void *cc);
-
-/**
- * Process some data bytes. It is acceptable that <code>len</code> is zero
- * (in which case this function does nothing).
- *
- * @param cc     the RadioGatun[64] context
- * @param data   the input data
- * @param len    the input data length (in bytes)
- */
-void sph_radiogatun64(void *cc, const void *data, size_t len);
-
-/**
- * Terminate the current RadioGatun[64] computation and output the
- * result into the provided buffer. The destination buffer must be wide
- * enough to accomodate the result (32 bytes). The context is
- * automatically reinitialized.
- *
- * @param cc    the RadioGatun[64] context
- * @param dst   the destination buffer
- */
-void sph_radiogatun64_close(void *cc, void *dst);
-
-#endif
-
-#endif
--- a/algo/ripemd/lbry-gate.c
+++ b/algo/ripemd/lbry-gate.c
@@ -4,7 +4,7 @@
 #include <string.h>
 #include <stdio.h>

-double lbry_calc_network_diff( struct work *work )
+long double lbry_calc_network_diff( struct work *work )
 {
        // sample for diff 43.281 : 1c05ea29
        // todo: endian reversed on longpoll could be zr5 specific...
@@ -12,7 +12,7 @@ double lbry_calc_network_diff( struct work *work )
   uint32_t nbits = swab32( work->data[ LBRY_NBITS_INDEX ] );
   uint32_t bits = (nbits & 0xffffff);
   int16_t shift = (swab32(nbits) & 0xff); // 0x1c = 28
-   double d = (double)0x0000ffff / (double)bits;
+   long double d = (long double)0x0000ffff / (long double)bits;

   for (int m=shift; m < 29; m++) d *= 256.0;
   for (int m=29; m < shift; m++) d /= 256.0;
--- a/algo/ripemd/sph_ripemd.c
+++ b/algo/ripemd/sph_ripemd.c
@@ -35,6 +35,7 @@

 #include "sph_ripemd.h"

+#if 0
 /*
 * Round functions for RIPEMD (original).
 */
@@ -46,6 +47,7 @@ static const sph_u32 oIV[5] = {
 	SPH_C32(0x67452301), SPH_C32(0xEFCDAB89),
 	SPH_C32(0x98BADCFE), SPH_C32(0x10325476)
 };
+#endif

 /*
 * Round functions for RIPEMD-128 and RIPEMD-160.
@@ -63,6 +65,8 @@ static const sph_u32 IV[5] = {

 #define ROTL    SPH_ROTL32

+#if 0
+
 /* ===================================================================== */
 /*
 * RIPEMD (original hash, deprecated).
@@ -479,7 +483,7 @@ sph_ripemd_comp(const sph_u32 msg[16], sph_u32 val[4])
 * One round of RIPEMD-128. The data must be aligned for 32-bit access.
 */
 static void
-ripemd128_round(const unsigned char *data, sph_u32 r[5])
+ripemd128_round(const unsigned char *data, sph_u32 r[4])
 {
 #if SPH_LITTLE_FAST

@@ -539,6 +543,8 @@ sph_ripemd128_comp(const sph_u32 msg[16], sph_u32 val[4])
 #undef RIPEMD128_IN
 }

+#endif
+
 /* ===================================================================== */
 /*
 * RIPEMD-160.
--- a/algo/ripemd/sph_ripemd.h
+++ b/algo/ripemd/sph_ripemd.h
@@ -84,6 +84,7 @@
 * can be cloned by copying the context (e.g. with a simple
 * <code>memcpy()</code>).
 */
+#if 0
 typedef struct {
 #ifndef DOXYGEN_IGNORE
 	unsigned char buf[64];    /* first field, for alignment */
@@ -204,6 +205,8 @@ void sph_ripemd128_close(void *cc, void *dst);
 */
 void sph_ripemd128_comp(const sph_u32 msg[16], sph_u32 val[4]);

+#endif
+
 /* ===================================================================== */

 /**
--- a/algo/scrypt/scrypt.c
+++ b/algo/scrypt/scrypt.c
@@ -34,6 +34,7 @@
 #include "algo/sha/sha-hash-4way.h"
 #include "algo/sha/sha256-hash.h"
 #include <mm_malloc.h>
+#include "malloc-huge.h"

 static const uint32_t keypad[12] = {
 	0x80000000, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0x00000280
@@ -1487,11 +1488,19 @@ extern int scanhash_scrypt( struct work *work, uint32_t max_nonce,

 bool scrypt_miner_thread_init( int thr_id )
 {
-   scratchbuf = _mm_malloc( scratchbuf_size, 128 );
+   scratchbuf = malloc_hugepages( scratchbuf_size );
   if ( scratchbuf )
-      return true;
+   {
+      if ( opt_debug )
+         applog( LOG_NOTICE, "Thread %u is using huge pages", thr_id );
+   }
+   else
+       scratchbuf = _mm_malloc( scratchbuf_size, 128 );
+   
+   if ( scratchbuf ) return true;
+   
   applog( LOG_ERR, "Thread %u: Scrypt buffer allocation failed", thr_id );
-   return false; 
+   return false;
 }

 bool register_scrypt_algo( algo_gate_t* gate )
@@ -1544,7 +1553,6 @@ bool register_scrypt_algo( algo_gate_t* gate )

   format_number_si( &t_size, t_units );
   format_number_si( &d_size, d_units );
-   
   applog( LOG_INFO,"Throughput %d/thr, Buffer %.0f %siB/thr, Total %.0f %siB\n",
          SCRYPT_THROUGHPUT, t_size, t_units, d_size, d_units );

--- a/algo/sha/sha-hash-4way.h
+++ b/algo/sha/sha-hash-4way.h
@@ -62,6 +62,12 @@ void sha256_4way_transform_le( __m128i *state_out,  const __m128i *data,
                            const __m128i *state_in );
 void sha256_4way_transform_be( __m128i *state_out,  const __m128i *data,
                            const __m128i *state_in );
+void sha256_4way_prehash_3rounds( __m128i *state_mid, __m128i *X,
+                                   const __m128i *W, const __m128i *state_in );
+void sha256_4way_final_rounds( __m128i *state_out, const __m128i *data,
+        const __m128i *state_in, const __m128i *state_mid, const __m128i *X );
+int sha256_4way_transform_le_short( __m128i *state_out, const __m128i *data,
+                                     const __m128i *state_in );

 #endif  // SSE2

@@ -84,10 +90,12 @@ void sha256_8way_transform_le( __m256i *state_out, const __m256i *data,
 void sha256_8way_transform_be( __m256i *state_out, const __m256i *data,
                               const __m256i *state_in );

-void sha256_8way_prehash_3rounds( __m256i *state_mid, const __m256i *W,
-                             const __m256i *state_in );
+void sha256_8way_prehash_3rounds( __m256i *state_mid, __m256i *X,
+                                 const __m256i *W, const __m256i *state_in );
 void sha256_8way_final_rounds( __m256i *state_out, const __m256i *data,
-                          const __m256i *state_in, const __m256i *state_mid );
+        const __m256i *state_in, const __m256i *state_mid, const __m256i *X );
+int sha256_8way_transform_le_short( __m256i *state_out, const __m256i *data,
+                                     const __m256i *state_in );

 #endif  // AVX2

@@ -109,10 +117,13 @@ void sha256_16way_transform_le( __m512i *state_out, const __m512i *data,
                             const __m512i *state_in );
 void sha256_16way_transform_be( __m512i *state_out, const __m512i *data,
                             const __m512i *state_in );
-void sha256_16way_prehash_3rounds( __m512i *state_mid, const __m512i *W,
-                             const __m512i *state_in );
+void sha256_16way_prehash_3rounds( __m512i *state_mid, __m512i *X,
+                                  const __m512i *W, const __m512i *state_in );
 void sha256_16way_final_rounds( __m512i *state_out, const __m512i *data,
-                          const __m512i *state_in, const __m512i *state_mid );
+        const __m512i *state_in, const __m512i *state_mid, const __m512i *X );
+
+int sha256_16way_transform_le_short( __m512i *state_out, const __m512i *data,
+                                     const __m512i *state_in );

 #endif // AVX512

--- a/algo/sha/sha2.c
+++ b/algo/sha/sha2.c
@@ -611,11 +611,11 @@ static inline int scanhash_sha256d_8way_pooler( struct work *work,

 #endif /* HAVE_SHA256_8WAY */

-int scanhash_sha256d_pooler( struct work *work,
-	uint32_t max_nonce, uint64_t *hashes_done, struct thr_info *mythr )
+int scanhash_sha256d_pooler( struct work *work,	uint32_t max_nonce,
+                             uint64_t *hashes_done, struct thr_info *mythr )
 {
-        uint32_t *pdata = work->data;
-        uint32_t *ptarget = work->target;
+   uint32_t *pdata = work->data;
+   uint32_t *ptarget = work->target;
 	uint32_t _ALIGN(128) data[64];
 	uint32_t _ALIGN(32) hash[8];
 	uint32_t _ALIGN(32) midstate[8];
@@ -626,12 +626,12 @@ int scanhash_sha256d_pooler( struct work *work,
   int thr_id = mythr->id;  // thr_id arg is deprecated

 #ifdef HAVE_SHA256_8WAY
-	if (sha256_use_8way())
-		return scanhash_sha256d_8way_pooler( work,	max_nonce, hashes_done, mythr );
+	if ( sha256_use_8way() )
+		return scanhash_sha256d_8way_pooler( work, max_nonce, hashes_done, mythr );
 #endif
 #ifdef HAVE_SHA256_4WAY
-	if (sha256_use_4way())
-		return scanhash_sha256d_4way_pooler( work,	max_nonce, hashes_done, mythr );
+	if ( sha256_use_4way() )
+		return scanhash_sha256d_4way_pooler( work, max_nonce, hashes_done, mythr );
 #endif
 	
 	memcpy(data, pdata + 16, 64);
@@ -695,8 +695,11 @@ bool register_sha256d_algo( algo_gate_t* gate )
   gate->optimizations = SSE2_OPT | AVX2_OPT | AVX512_OPT;
 #if defined(SHA256D_16WAY)
   gate->scanhash = (void*)&scanhash_sha256d_16way;
+//#elif defined(SHA256D_8WAY)
+//   gate->scanhash = (void*)&scanhash_sha256d_8way;
 #else
   gate->scanhash = (void*)&scanhash_sha256d_pooler;
+//   gate->scanhash = (void*)&scanhash_sha256d_4way;
 #endif
   //   gate->hash     = (void*)&sha256d;
   return true;
--- a/algo/sha/sha256-hash-4way.c
+++ b/algo/sha/sha256-hash-4way.c
--- a/algo/sha/sha256d-4way.c
+++ b/algo/sha/sha256d-4way.c
@@ -10,13 +10,14 @@
 int scanhash_sha256d_16way( struct work *work, const uint32_t max_nonce,
                           uint64_t *hashes_done, struct thr_info *mythr )
 {
+   __m512i  vdata[32]    __attribute__ ((aligned (128)));
   __m512i  block[16]    __attribute__ ((aligned (64)));
-   __m512i  hash32[8]    __attribute__ ((aligned (32)));
-   __m512i  initstate[8] __attribute__ ((aligned (32)));
-   __m512i  midstate1[8] __attribute__ ((aligned (32)));
-   __m512i  midstate2[8] __attribute__ ((aligned (32)));
-   uint32_t lane_hash[8] __attribute__ ((aligned (32)));
-   __m512i  vdata[20]    __attribute__ ((aligned (32)));
+   __m512i  hash32[8]    __attribute__ ((aligned (64)));
+   __m512i  initstate[8] __attribute__ ((aligned (64)));
+   __m512i  midstate1[8] __attribute__ ((aligned (64)));
+   __m512i  midstate2[8] __attribute__ ((aligned (64)));
+   __m512i  mexp_pre[16] __attribute__ ((aligned (64)));
+   uint32_t lane_hash[8] __attribute__ ((aligned (64)));
   uint32_t *hash32_d7 =  (uint32_t*)&( hash32[7] );
   uint32_t *pdata = work->data;
   const uint32_t *ptarget = work->target;
@@ -36,6 +37,14 @@ int scanhash_sha256d_16way( struct work *work, const uint32_t max_nonce,
   *noncev = _mm512_set_epi32( n+15, n+14, n+13, n+12, n+11, n+10, n+9, n+8,
                               n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n+1, n );

+   vdata[16+4] = last_byte;
+   memset_zero_512( vdata+16 + 5, 10 );
+   vdata[16+15] = m512_const1_32( 80*8 ); // bit count
+
+   block[ 8] = last_byte;
+   memset_zero_512( block + 9, 6 );
+   block[15] = m512_const1_32( 32*8 ); // bit count
+   
   // initialize state
   initstate[0] = m512_const1_64( 0x6A09E6676A09E667 );
   initstate[1] = m512_const1_64( 0xBB67AE85BB67AE85 );
@@ -49,39 +58,33 @@ int scanhash_sha256d_16way( struct work *work, const uint32_t max_nonce,
   sha256_16way_transform_le( midstate1, vdata, initstate );

   // Do 3 rounds on the first 12 bytes of the next block
-   sha256_16way_prehash_3rounds( midstate2, vdata + 16, midstate1 );
+   sha256_16way_prehash_3rounds( midstate2, mexp_pre, vdata+16, midstate1 );

   do
   {
      // 1. final 16 bytes of data, with padding
-      memcpy_512( block, vdata + 16, 4 );
-      block[ 4] = last_byte;
-      memset_zero_512( block + 5, 10 );
-      block[15] = m512_const1_32( 80*8 ); // bit count
-      sha256_16way_final_rounds( hash32, block, midstate1, midstate2 );
+      sha256_16way_final_rounds( block, vdata+16, midstate1, midstate2,
+                                 mexp_pre );

      // 2. 32 byte hash from 1.
-      memcpy_512( block, hash32, 8 );
-      block[ 8] = last_byte;
-      memset_zero_512( block + 9, 6 );
-      block[15] = m512_const1_32( 32*8 ); // bit count
-      sha256_16way_transform_le( hash32, block, initstate );
-
-      // byte swap final hash for testing
-      mm512_block_bswap_32( hash32, hash32 );
-
-      for ( int lane = 0; lane < 16; lane++ )
-      if ( unlikely( hash32_d7[ lane ] <= targ32_d7 ) )
+      if ( sha256_16way_transform_le_short( hash32, block, initstate ) )
      {
-         extr_lane_16x32( lane_hash, hash32, lane, 256 );
-         if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
+         // byte swap final hash for testing
+         mm512_block_bswap_32( hash32, hash32 );
+
+         for ( int lane = 0; lane < 16; lane++ )
+         if ( unlikely( hash32_d7[ lane ] <= targ32_d7 ) )
         {
-            pdata[19] = n + lane;
-            submit_solution( work, lane_hash, mythr );
+            extr_lane_16x32( lane_hash, hash32, lane, 256 );
+            if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
+            {
+               pdata[19] = n + lane;
+               submit_solution( work, lane_hash, mythr );
+            }
         }
-       }
-       *noncev = _mm512_add_epi32( *noncev, sixteen );
-       n += 16;
+      }
+      *noncev = _mm512_add_epi32( *noncev, sixteen );
+      n += 16;
   } while ( (n < last_nonce) && !work_restart[thr_id].restart );
   pdata[19] = n;
   *hashes_done = n - first_nonce;
@@ -95,13 +98,14 @@ int scanhash_sha256d_16way( struct work *work, const uint32_t max_nonce,
 int scanhash_sha256d_8way( struct work *work, const uint32_t max_nonce,
                           uint64_t *hashes_done, struct thr_info *mythr )
 {
-   __m256i  block[16]    __attribute__ ((aligned (64)));
+   __m256i  vdata[32]    __attribute__ ((aligned (64)));
+   __m256i  block[16]    __attribute__ ((aligned (32)));
   __m256i  hash32[8]    __attribute__ ((aligned (32)));
   __m256i  initstate[8] __attribute__ ((aligned (32)));
   __m256i  midstate1[8] __attribute__ ((aligned (32)));
   __m256i  midstate2[8] __attribute__ ((aligned (32)));
+   __m256i  mexp_pre[16] __attribute__ ((aligned (32)));
   uint32_t lane_hash[8] __attribute__ ((aligned (32)));
-   __m256i  vdata[20]    __attribute__ ((aligned (32)));
   uint32_t *hash32_d7 =  (uint32_t*)&( hash32[7] );
   uint32_t *pdata = work->data;
   const uint32_t *ptarget = work->target;
@@ -120,6 +124,14 @@ int scanhash_sha256d_8way( struct work *work, const uint32_t max_nonce,

   *noncev = _mm256_set_epi32( n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n+1, n );

+   vdata[16+4] = last_byte;
+   memset_zero_256( vdata+16 + 5, 10 );
+   vdata[16+15] = m256_const1_32( 80*8 ); // bit count
+
+   block[ 8] = last_byte;
+   memset_zero_256( block + 9, 6 );
+   block[15] = m256_const1_32( 32*8 ); // bit count
+   
   // initialize state
   initstate[0] = m256_const1_64( 0x6A09E6676A09E667 );
   initstate[1] = m256_const1_64( 0xBB67AE85BB67AE85 );
@@ -133,35 +145,30 @@ int scanhash_sha256d_8way( struct work *work, const uint32_t max_nonce,
   sha256_8way_transform_le( midstate1, vdata, initstate );
   
   // Do 3 rounds on the first 12 bytes of the next block
-   sha256_8way_prehash_3rounds( midstate2, vdata + 16, midstate1 );
+   sha256_8way_prehash_3rounds( midstate2, mexp_pre, vdata + 16, midstate1 );

   do
   {
      // 1. final 16 bytes of data, with padding
-      memcpy_256( block, vdata + 16, 4 );
-      block[ 4] = last_byte;
-      memset_zero_256( block + 5, 10 );
-      block[15] = m256_const1_32( 80*8 ); // bit count
-      sha256_8way_final_rounds( hash32, block, midstate1, midstate2 );
+      sha256_8way_final_rounds( block, vdata+16, midstate1, midstate2,
+                                mexp_pre );

      // 2. 32 byte hash from 1.
-      memcpy_256( block, hash32, 8 );
-      block[ 8] = last_byte;
-      memset_zero_256( block + 9, 6 );
-      block[15] = m256_const1_32( 32*8 ); // bit count
-      sha256_8way_transform_le( hash32, block, initstate );
-
-      // byte swap final hash for testing
-      mm256_block_bswap_32( hash32, hash32 );
-
-      for ( int lane = 0; lane < 8; lane++ )
-      if ( unlikely( hash32_d7[ lane ] <= targ32_d7 ) )
+      if ( unlikely(
+               sha256_8way_transform_le_short( hash32, block, initstate ) ) )
      {
-         extr_lane_8x32( lane_hash, hash32, lane, 256 );
-         if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
+         // byte swap final hash for testing
+         mm256_block_bswap_32( hash32, hash32 );
+
+         for ( int lane = 0; lane < 8; lane++ )
+         if ( unlikely( hash32_d7[ lane ] <= targ32_d7 ) )
         {
-            pdata[19] = n + lane;
-            submit_solution( work, lane_hash, mythr );
+            extr_lane_8x32( lane_hash, hash32, lane, 256 );
+            if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
+            {
+               pdata[19] = n + lane;
+               submit_solution( work, lane_hash, mythr );
+            }
         }
       }
       *noncev = _mm256_add_epi32( *noncev, eight );
@@ -179,12 +186,14 @@ int scanhash_sha256d_8way( struct work *work, const uint32_t max_nonce,
 int scanhash_sha256d_4way( struct work *work, const uint32_t max_nonce,
                           uint64_t *hashes_done, struct thr_info *mythr )
 {
-   __m128i  block[16]    __attribute__ ((aligned (64)));
-   __m128i  hash32[8]    __attribute__ ((aligned (32)));
-   __m128i  initstate[8] __attribute__ ((aligned (32)));
-   __m128i  midstate[8]  __attribute__ ((aligned (32)));
-   uint32_t lane_hash[8] __attribute__ ((aligned (32)));
-   __m128i  vdata[20]    __attribute__ ((aligned (32)));
+   __m128i  vdata[32]     __attribute__ ((aligned (64)));
+   __m128i  block[16]     __attribute__ ((aligned (32)));
+   __m128i  hash32[8]     __attribute__ ((aligned (32)));
+   __m128i  initstate[8]  __attribute__ ((aligned (32)));
+   __m128i  midstate1[8]   __attribute__ ((aligned (32)));
+   __m128i  midstate2[8]  __attribute__ ((aligned (32)));
+   __m128i  mexp_pre[16]  __attribute__ ((aligned (32)));
+   uint32_t lane_hash[8]  __attribute__ ((aligned (32)));
   uint32_t *hash32_d7 =  (uint32_t*)&( hash32[7] );
   uint32_t *pdata = work->data;
   const uint32_t *ptarget = work->target;
@@ -203,6 +212,14 @@ int scanhash_sha256d_4way( struct work *work, const uint32_t max_nonce,

   *noncev = _mm_set_epi32( n+ 3, n+ 2, n+1, n );

+   vdata[16+4] = last_byte;
+   memset_zero_128( vdata+16 + 5, 10 );
+   vdata[16+15] = m128_const1_32( 80*8 ); // bit count
+
+   block[ 8] = last_byte;
+   memset_zero_128( block + 9, 6 );
+   block[15] = m128_const1_32( 32*8 ); // bit count
+
   // initialize state
   initstate[0] = m128_const1_64( 0x6A09E6676A09E667 );
   initstate[1] = m128_const1_64( 0xBB67AE85BB67AE85 );
@@ -214,39 +231,36 @@ int scanhash_sha256d_4way( struct work *work, const uint32_t max_nonce,
   initstate[7] = m128_const1_64( 0x5BE0CD195BE0CD19 );

   // hash first 64 bytes of data
-   sha256_4way_transform_le( midstate, vdata, initstate );
+   sha256_4way_transform_le( midstate1, vdata, initstate );
+   // Do 3 rounds on the first 12 bytes of the next block
+   sha256_4way_prehash_3rounds( midstate2, mexp_pre, vdata + 16, midstate1 );

   do
   {
      // 1. final 16 bytes of data, with padding
-      memcpy_128( block, vdata + 16, 4 );
-      block[ 4] = last_byte;
-      memset_zero_128( block + 5, 10 );
-      block[15] = m128_const1_32( 80*8 ); // bit count
-      sha256_4way_transform_le( hash32, block, midstate );
+      sha256_4way_final_rounds( block, vdata+16, midstate1, midstate2,
+                                mexp_pre );

      // 2. 32 byte hash from 1.
-      memcpy_128( block, hash32, 8 );
-      block[ 8] = last_byte;
-      memset_zero_128( block + 9, 6 );
-      block[15] = m128_const1_32( 32*8 ); // bit count
-      sha256_4way_transform_le( hash32, block, initstate );
-
-      // byte swap final hash for testing
-      mm128_block_bswap_32( hash32, hash32 );
-
-      for ( int lane = 0; lane < 4; lane++ )
-      if ( unlikely( hash32_d7[ lane ] <= targ32_d7 ) )
+      if ( unlikely(
+              sha256_4way_transform_le_short( hash32, block, initstate ) ) )
      {
-         extr_lane_4x32( lane_hash, hash32, lane, 256 );
-         if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
+         // byte swap final hash for testing
+         mm128_block_bswap_32( hash32, hash32 );
+
+         for ( int lane = 0; lane < 4; lane++ )
+         if ( unlikely( hash32_d7[ lane ] <= targ32_d7 ) )
         {
-            pdata[19] = n + lane;
-            submit_solution( work, lane_hash, mythr );
+            extr_lane_4x32( lane_hash, hash32, lane, 256 );
+            if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
+            {
+               pdata[19] = n + lane;
+               submit_solution( work, lane_hash, mythr );
+            }
         }
-       }
-       *noncev = _mm_add_epi32( *noncev, four );
-       n += 4;
+      }
+      *noncev = _mm_add_epi32( *noncev, four );
+      n += 4;
   } while ( (n < last_nonce) && !work_restart[thr_id].restart );
   pdata[19] = n;
   *hashes_done = n - first_nonce;
--- a/algo/sha/sha256d-4way.h
+++ b/algo/sha/sha256d-4way.h
@@ -6,12 +6,10 @@

 #if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
  #define SHA256D_16WAY 1
-/*
 #elif defined(__AVX2__)
  #define SHA256D_8WAY 1
 #else
  #define SHA256D_4WAY 1
-*/
 #endif

 bool register_sha256d_algo( algo_gate_t* gate );
@@ -21,7 +19,7 @@ bool register_sha256d_algo( algo_gate_t* gate );
 int scanhash_sha256d_16way( struct work *work, uint32_t max_nonce,
                           uint64_t *hashes_done, struct thr_info *mythr );
 #endif
-/*
+
 #if defined(SHA256D_8WAY)

 int scanhash_sha256d_8way( struct work *work, uint32_t max_nonce,
@@ -33,7 +31,7 @@ int scanhash_sha256d_8way( struct work *work, uint32_t max_nonce,
 int scanhash_sha256d_4way( struct work *work, uint32_t max_nonce,
                           uint64_t *hashes_done, struct thr_info *mythr );
 #endif
-*/
+

 /*
 #if defined(__SHA__)
--- a/algo/sha/sha256t-4way.c
+++ b/algo/sha/sha256t-4way.c
@@ -10,13 +10,14 @@
 int scanhash_sha256t_16way( struct work *work, const uint32_t max_nonce,
                           uint64_t *hashes_done, struct thr_info *mythr )
 {
+   __m512i  vdata[32]    __attribute__ ((aligned (128)));
   __m512i  block[16]    __attribute__ ((aligned (64)));
-   __m512i  hash32[8]    __attribute__ ((aligned (32)));
-   __m512i  initstate[8] __attribute__ ((aligned (32)));
-   __m512i  midstate1[8] __attribute__ ((aligned (32)));
-   __m512i  midstate2[8] __attribute__ ((aligned (32)));
-   uint32_t lane_hash[8] __attribute__ ((aligned (32)));
-   __m512i  vdata[20]    __attribute__ ((aligned (32)));
+   __m512i  hash32[8]    __attribute__ ((aligned (64)));
+   __m512i  initstate[8] __attribute__ ((aligned (64)));
+   __m512i  midstate1[8] __attribute__ ((aligned (64)));
+   __m512i  midstate2[8] __attribute__ ((aligned (64)));
+   __m512i  mexp_pre[16] __attribute__ ((aligned (64)));
+   uint32_t lane_hash[8] __attribute__ ((aligned (64)));
   uint32_t *hash32_d7 =  (uint32_t*)&( hash32[7] );
   uint32_t *pdata = work->data;
   const uint32_t *ptarget = work->target;
@@ -36,7 +37,14 @@ int scanhash_sha256t_16way( struct work *work, const uint32_t max_nonce,
   *noncev = _mm512_set_epi32( n+15, n+14, n+13, n+12, n+11, n+10, n+9, n+8,
                               n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n+1, n );

-   // initialize state
+   vdata[16+4] = last_byte;
+   memset_zero_512( vdata+16 + 5, 10 );
+   vdata[16+15] = m512_const1_32( 80*8 ); // bit count
+   
+   block[ 8] = last_byte;
+   memset_zero_512( block + 9, 6 );
+   block[15] = m512_const1_32( 32*8 ); // bit count
+   
   initstate[0] = m512_const1_64( 0x6A09E6676A09E667 );
   initstate[1] = m512_const1_64( 0xBB67AE85BB67AE85 );
   initstate[2] = m512_const1_64( 0x3C6EF3723C6EF372 );
@@ -49,43 +57,37 @@ int scanhash_sha256t_16way( struct work *work, const uint32_t max_nonce,
   sha256_16way_transform_le( midstate1, vdata, initstate );
   
   // Do 3 rounds on the first 12 bytes of the next block
-   sha256_16way_prehash_3rounds( midstate2, vdata + 16, midstate1 );
+   sha256_16way_prehash_3rounds( midstate2, mexp_pre, vdata+16, midstate1 );

   do
   {
-      // 1. final 16 bytes of data, with padding
-      memcpy_512( block, vdata + 16, 4 );
-      block[ 4] = last_byte;
-      memset_zero_512( block + 5, 10 );  
-      block[15] = m512_const1_32( 80*8 ); // bit count
-      sha256_16way_final_rounds( hash32, block, midstate1, midstate2 );
+      // 1. final 16 bytes of data, pre-padded
+      sha256_16way_final_rounds( block, vdata+16, midstate1, midstate2,
+                                 mexp_pre );

      // 2. 32 byte hash from 1.
-      memcpy_512( block, hash32, 8 );
-      block[ 8] = last_byte;
-      memset_zero_512( block + 9, 6 );
-      block[15] = m512_const1_32( 32*8 ); // bit count
-      sha256_16way_transform_le( hash32, block, initstate );
+      sha256_16way_transform_le( block, block, initstate );

      // 3. 32 byte hash from 2.
-      memcpy_512( block, hash32, 8 );
-      sha256_16way_transform_le( hash32, block, initstate );
-
-      // byte swap final hash for testing
-      mm512_block_bswap_32( hash32, hash32 );    
-
-      for ( int lane = 0; lane < 16; lane++ )
-      if ( unlikely( hash32_d7[ lane ] <= targ32_d7 ) )
+      if ( unlikely(
+               sha256_16way_transform_le_short( hash32, block, initstate ) ) )
      {
-         extr_lane_16x32( lane_hash, hash32, lane, 256 );
-         if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
+         // byte swap final hash for testing
+         mm512_block_bswap_32( hash32, hash32 );    
+
+         for ( int lane = 0; lane < 16; lane++ )
+         if ( hash32_d7[ lane ] <= targ32_d7 )
         {
-            pdata[19] = n + lane;
-            submit_solution( work, lane_hash, mythr );
+            extr_lane_16x32( lane_hash, hash32, lane, 256 );
+            if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
+            {
+               pdata[19] = n + lane;
+               submit_solution( work, lane_hash, mythr );
+            }
         }
-       }
-       *noncev = _mm512_add_epi32( *noncev, sixteen );
-       n += 16;
+      }
+      *noncev = _mm512_add_epi32( *noncev, sixteen );
+      n += 16;
   } while ( (n < last_nonce) && !work_restart[thr_id].restart );
   pdata[19] = n;
   *hashes_done = n - first_nonce;
@@ -100,13 +102,14 @@ int scanhash_sha256t_16way( struct work *work, const uint32_t max_nonce,
 int scanhash_sha256t_8way( struct work *work, const uint32_t max_nonce,
                           uint64_t *hashes_done, struct thr_info *mythr )
 {
-   __m256i  block[16]    __attribute__ ((aligned (64)));
+   __m256i  vdata[32]    __attribute__ ((aligned (64)));
+   __m256i  block[16]    __attribute__ ((aligned (32)));
   __m256i  hash32[8]    __attribute__ ((aligned (32)));
   __m256i  initstate[8] __attribute__ ((aligned (32)));
   __m256i  midstate1[8] __attribute__ ((aligned (32)));
   __m256i  midstate2[8] __attribute__ ((aligned (32)));
+   __m256i  mexp_pre[16] __attribute__ ((aligned (32)));
   uint32_t lane_hash[8] __attribute__ ((aligned (32)));
-   __m256i  vdata[20]    __attribute__ ((aligned (32)));
   uint32_t *hash32_d7 =  (uint32_t*)&( hash32[7] );
   uint32_t *pdata = work->data;
   const uint32_t *ptarget = work->target;
@@ -125,6 +128,14 @@ int scanhash_sha256t_8way( struct work *work, const uint32_t max_nonce,

   *noncev = _mm256_set_epi32( n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n+1, n );

+   vdata[16+4] = last_byte;
+   memset_zero_256( vdata+16 + 5, 10 );
+   vdata[16+15] = m256_const1_32( 80*8 ); // bit count
+
+   block[ 8] = last_byte;
+   memset_zero_256( block + 9, 6 );
+   block[15] = m256_const1_32( 32*8 ); // bit count
+   
   // initialize state
   initstate[0] = m256_const1_64( 0x6A09E6676A09E667 );
   initstate[1] = m256_const1_64( 0xBB67AE85BB67AE85 );
@@ -138,43 +149,37 @@ int scanhash_sha256t_8way( struct work *work, const uint32_t max_nonce,
   sha256_8way_transform_le( midstate1, vdata, initstate );

   // Do 3 rounds on the first 12 bytes of the next block
-   sha256_8way_prehash_3rounds( midstate2, vdata + 16, midstate1 );
+   sha256_8way_prehash_3rounds( midstate2, mexp_pre, vdata + 16, midstate1 );
   
   do
   {
      // 1. final 16 bytes of data, with padding
-      memcpy_256( block, vdata + 16, 4 );
-      block[ 4] = last_byte;
-      memset_zero_256( block + 5, 10 );
-      block[15] = m256_const1_32( 80*8 ); // bit count
-      sha256_8way_final_rounds( hash32, block, midstate1, midstate2 );
+      sha256_8way_final_rounds( block, vdata+16, midstate1, midstate2,
+                                mexp_pre );

      // 2. 32 byte hash from 1.
-      memcpy_256( block, hash32, 8 );
-      block[ 8] = last_byte;
-      memset_zero_256( block + 9, 6 );
-      block[15] = m256_const1_32( 32*8 ); // bit count
-      sha256_8way_transform_le( hash32, block, initstate );
+      sha256_8way_transform_le( block, block, initstate );

      // 3. 32 byte hash from 2.
-      memcpy_256( block, hash32, 8 );
-      sha256_8way_transform_le( hash32, block, initstate );
-
-      // byte swap final hash for testing
-      mm256_block_bswap_32( hash32, hash32 );
-
-      for ( int lane = 0; lane < 8; lane++ )
-      if ( unlikely( hash32_d7[ lane ] <= targ32_d7 ) )
+      if ( unlikely(
+               sha256_8way_transform_le_short( hash32, block, initstate ) ) )
      {
-         extr_lane_8x32( lane_hash, hash32, lane, 256 );
-         if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
+         // byte swap final hash for testing
+         mm256_block_bswap_32( hash32, hash32 );
+
+         for ( int lane = 0; lane < 8; lane++ )
+         if ( hash32_d7[ lane ] <= targ32_d7 )
         {
-            pdata[19] = n + lane;
-            submit_solution( work, lane_hash, mythr );
+            extr_lane_8x32( lane_hash, hash32, lane, 256 );
+            if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
+            {
+               pdata[19] = n + lane;
+               submit_solution( work, lane_hash, mythr );
+            }
         }
-       }
-       *noncev = _mm256_add_epi32( *noncev, eight );
-       n += 8;
+      }
+      *noncev = _mm256_add_epi32( *noncev, eight );
+      n += 8;
   } while ( (n < last_nonce) && !work_restart[thr_id].restart );
   pdata[19] = n;
   *hashes_done = n - first_nonce;
@@ -183,17 +188,110 @@ int scanhash_sha256t_8way( struct work *work, const uint32_t max_nonce,

 #endif

+
 #if defined(SHA256T_4WAY)

+// Optimizations are slower with AVX/SSE2
+// https://github.com/JayDDee/cpuminer-opt/issues/344
+/*
+int scanhash_sha256t_4way( struct work *work, const uint32_t max_nonce,
+                           uint64_t *hashes_done, struct thr_info *mythr )
+{
+   __m128i  vdata[32]     __attribute__ ((aligned (64)));
+   __m128i  block[16]     __attribute__ ((aligned (32)));
+   __m128i  hash32[8]     __attribute__ ((aligned (32)));
+   __m128i  initstate[8]  __attribute__ ((aligned (32)));
+   __m128i  midstate1[8]  __attribute__ ((aligned (32)));
+   __m128i  midstate2[8]  __attribute__ ((aligned (32)));
+   __m128i  mexp_pre[16]  __attribute__ ((aligned (32)));
+   uint32_t lane_hash[8]  __attribute__ ((aligned (32)));
+   uint32_t *hash32_d7 = (uint32_t*)&( hash32[7] );
+   uint32_t *pdata = work->data;
+   const uint32_t *ptarget = work->target;
+   const uint32_t targ32_d7 = ptarget[7];
+   const uint32_t first_nonce = pdata[19];
+   const uint32_t last_nonce = max_nonce - 4;
+   uint32_t n = first_nonce;
+   __m128i *noncev = vdata + 19;
+   const int thr_id = mythr->id;
+   const bool bench = opt_benchmark;
+   const __m128i last_byte = m128_const1_32( 0x80000000 );
+   const __m128i four = m128_const1_32( 4 );
+
+   for ( int i = 0; i < 19; i++ )
+       vdata[i] = m128_const1_32( pdata[i] );
+
+   *noncev = _mm_set_epi32( n+ 3, n+ 2, n+1, n );
+
+   vdata[16+4] = last_byte;
+   memset_zero_128( vdata+16 + 5, 10 );
+   vdata[16+15] = m128_const1_32( 80*8 ); // bit count
+
+   block[ 8] = last_byte;
+   memset_zero_128( block + 9, 6 );
+   block[15] = m128_const1_32( 32*8 ); // bit count
+   
+   // initialize state
+   initstate[0] = m128_const1_64( 0x6A09E6676A09E667 );
+   initstate[1] = m128_const1_64( 0xBB67AE85BB67AE85 );
+   initstate[2] = m128_const1_64( 0x3C6EF3723C6EF372 );
+   initstate[3] = m128_const1_64( 0xA54FF53AA54FF53A );
+   initstate[4] = m128_const1_64( 0x510E527F510E527F );
+   initstate[5] = m128_const1_64( 0x9B05688C9B05688C );
+   initstate[6] = m128_const1_64( 0x1F83D9AB1F83D9AB );
+   initstate[7] = m128_const1_64( 0x5BE0CD195BE0CD19 );
+
+   // hash first 64 bytes of data
+   sha256_4way_transform_le( midstate1, vdata, initstate );
+
+   // Do 3 rounds on the first 12 bytes of the next block
+   sha256_4way_prehash_3rounds( midstate2, mexp_pre, vdata + 16, midstate1 );
+
+   do
+   {
+      // 1. final 16 bytes of data, with padding
+      sha256_4way_final_rounds( block, vdata+16, midstate1, midstate2,
+                                mexp_pre );
+
+      // 2. 32 byte hash from 1.
+      sha256_4way_transform_le( block, block, initstate );
+
+      // 3. 32 byte hash from 2.
+      if ( unlikely(
+              sha256_4way_transform_le_short( hash32, block, initstate ) ) )
+      {   
+         // byte swap final hash for testing
+         mm128_block_bswap_32( hash32, hash32 );
+
+         for ( int lane = 0; lane < 4; lane++ )
+         if ( unlikely( hash32_d7[ lane ] <= targ32_d7 ) )
+         {
+            extr_lane_4x32( lane_hash, hash32, lane, 256 );
+            if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
+            {
+               pdata[19] = n + lane;
+               submit_solution( work, lane_hash, mythr );
+            }
+         }
+      }
+      *noncev = _mm_add_epi32( *noncev, four );
+      n += 4;
+   } while ( (n < last_nonce) && !work_restart[thr_id].restart );
+   pdata[19] = n;
+   *hashes_done = n - first_nonce;
+   return 0;
+}
+*/
+
 int scanhash_sha256t_4way( struct work *work, const uint32_t max_nonce,
                           uint64_t *hashes_done, struct thr_info *mythr )
 {
-   __m128i  block[16]    __attribute__ ((aligned (64)));
+   __m128i  vdata[32]    __attribute__ ((aligned (64)));
+   __m128i  block[16]    __attribute__ ((aligned (32)));
   __m128i  hash32[8]    __attribute__ ((aligned (32)));
   __m128i  initstate[8] __attribute__ ((aligned (32)));
   __m128i  midstate[8]  __attribute__ ((aligned (32)));
   uint32_t lane_hash[8] __attribute__ ((aligned (32)));
-   __m128i  vdata[20]    __attribute__ ((aligned (32)));
   uint32_t *hash32_d7 =  (uint32_t*)&( hash32[7] );
   uint32_t *pdata = work->data;
   const uint32_t *ptarget = work->target;
@@ -212,6 +310,14 @@ int scanhash_sha256t_4way( struct work *work, const uint32_t max_nonce,

   *noncev = _mm_set_epi32( n+ 3, n+ 2, n+1, n );

+   vdata[16+4] = last_byte;
+   memset_zero_128( vdata+16 + 5, 10 );
+   vdata[16+15] = m128_const1_32( 80*8 ); // bit count
+
+   block[ 8] = last_byte;
+   memset_zero_128( block + 9, 6 );
+   block[15] = m128_const1_32( 32*8 ); // bit count
+   
   // initialize state
   initstate[0] = m128_const1_64( 0x6A09E6676A09E667 );
   initstate[1] = m128_const1_64( 0xBB67AE85BB67AE85 );
@@ -227,25 +333,9 @@ int scanhash_sha256t_4way( struct work *work, const uint32_t max_nonce,

   do
   {
-      // 1. final 16 bytes of data, with padding
-      memcpy_128( block, vdata + 16, 4 );
-      block[ 4] = last_byte;
-      memset_zero_128( block + 5, 10 );
-      block[15] = m128_const1_32( 80*8 ); // bit count
-      sha256_4way_transform_le( hash32, block, midstate );
-
-      // 2. 32 byte hash from 1.
-      memcpy_128( block, hash32, 8 );
-      block[ 8] = last_byte;
-      memset_zero_128( block + 9, 6 );
-      block[15] = m128_const1_32( 32*8 ); // bit count
-      sha256_4way_transform_le( hash32, block, initstate );
-
-      // 3. 32 byte hash from 2.
-      memcpy_128( block, hash32, 8 );
-      sha256_4way_transform_le( hash32, block, initstate );
-
-      // byte swap final hash for testing
+      sha256_4way_transform_le( block,  vdata+16, midstate  );
+      sha256_4way_transform_le( block,  block,    initstate );
+      sha256_4way_transform_le( hash32, block,    initstate );
      mm128_block_bswap_32( hash32, hash32 );

      for ( int lane = 0; lane < 4; lane++ )
@@ -266,5 +356,6 @@ int scanhash_sha256t_4way( struct work *work, const uint32_t max_nonce,
   return 0;
 }

+
 #endif

--- a/algo/shabal/shabal-hash-4way.c
+++ b/algo/shabal/shabal-hash-4way.c
@@ -62,8 +62,8 @@ extern "C"{
 #if defined(__AVX2__)

 #define DECL_STATE8   \
-   __m256i A00, A01, A02, A03, A04, A05, A06, A07, \
-           A08, A09, A0A, A0B; \
+   __m256i A0, A1, A2, A3, A4, A5, A6, A7, \
+           A8, A9, AA, AB; \
   __m256i B0, B1, B2, B3, B4, B5, B6, B7, \
           B8, B9, BA, BB, BC, BD, BE, BF; \
   __m256i C0, C1, C2, C3, C4, C5, C6, C7, \
@@ -78,18 +78,18 @@ extern "C"{
 { \
   if ( (state)->state_loaded ) \
   { \
-      A00 = (state)->A[0]; \
-      A01 = (state)->A[1]; \
-      A02 = (state)->A[2]; \
-      A03 = (state)->A[3]; \
-      A04 = (state)->A[4]; \
-      A05 = (state)->A[5]; \
-      A06 = (state)->A[6]; \
-      A07 = (state)->A[7]; \
-      A08 = (state)->A[8]; \
-      A09 = (state)->A[9]; \
-      A0A = (state)->A[10]; \
-      A0B = (state)->A[11]; \
+      A0 = (state)->A[0]; \
+      A1 = (state)->A[1]; \
+      A2 = (state)->A[2]; \
+      A3 = (state)->A[3]; \
+      A4 = (state)->A[4]; \
+      A5 = (state)->A[5]; \
+      A6 = (state)->A[6]; \
+      A7 = (state)->A[7]; \
+      A8 = (state)->A[8]; \
+      A9 = (state)->A[9]; \
+      AA = (state)->A[10]; \
+      AB = (state)->A[11]; \
      B0 = (state)->B[0]; \
      B1 = (state)->B[1]; \
      B2 = (state)->B[2]; \
@@ -126,18 +126,18 @@ extern "C"{
   else \
   { \
       (state)->state_loaded = true; \
-       A00 = m256_const1_64( 0x20728DFD20728DFD ); \
-       A01 = m256_const1_64( 0x46C0BD5346C0BD53 ); \
-       A02 = m256_const1_64( 0xE782B699E782B699 ); \
-       A03 = m256_const1_64( 0x5530463255304632 ); \
-       A04 = m256_const1_64( 0x71B4EF9071B4EF90 ); \
-       A05 = m256_const1_64( 0x0EA9E82C0EA9E82C ); \
-       A06 = m256_const1_64( 0xDBB930F1DBB930F1 ); \
-       A07 = m256_const1_64( 0xFAD06B8BFAD06B8B ); \
-       A08 = m256_const1_64( 0xBE0CAE40BE0CAE40 ); \
-       A09 = m256_const1_64( 0x8BD144108BD14410 ); \
-       A0A = m256_const1_64( 0x76D2ADAC76D2ADAC ); \
-       A0B = m256_const1_64( 0x28ACAB7F28ACAB7F ); \
+       A0 = m256_const1_64( 0x20728DFD20728DFD ); \
+       A1 = m256_const1_64( 0x46C0BD5346C0BD53 ); \
+       A2 = m256_const1_64( 0xE782B699E782B699 ); \
+       A3 = m256_const1_64( 0x5530463255304632 ); \
+       A4 = m256_const1_64( 0x71B4EF9071B4EF90 ); \
+       A5 = m256_const1_64( 0x0EA9E82C0EA9E82C ); \
+       A6 = m256_const1_64( 0xDBB930F1DBB930F1 ); \
+       A7 = m256_const1_64( 0xFAD06B8BFAD06B8B ); \
+       A8 = m256_const1_64( 0xBE0CAE40BE0CAE40 ); \
+       A9 = m256_const1_64( 0x8BD144108BD14410 ); \
+       AA = m256_const1_64( 0x76D2ADAC76D2ADAC ); \
+       AB = m256_const1_64( 0x28ACAB7F28ACAB7F ); \
       B0 = m256_const1_64( 0xC1099CB7C1099CB7 ); \
       B1 = m256_const1_64( 0x07B385F307B385F3 ); \
       B2 = m256_const1_64( 0xE7442C26E7442C26 ); \
@@ -176,18 +176,18 @@ extern "C"{
 } while (0)

 #define WRITE_STATE8(state)   do { \
-      (state)->A[0] = A00; \
-      (state)->A[1] = A01; \
-      (state)->A[2] = A02; \
-      (state)->A[3] = A03; \
-      (state)->A[4] = A04; \
-      (state)->A[5] = A05; \
-      (state)->A[6] = A06; \
-      (state)->A[7] = A07; \
-      (state)->A[8] = A08; \
-      (state)->A[9] = A09; \
-      (state)->A[10] = A0A; \
-      (state)->A[11] = A0B; \
+      (state)->A[0] = A0; \
+      (state)->A[1] = A1; \
+      (state)->A[2] = A2; \
+      (state)->A[3] = A3; \
+      (state)->A[4] = A4; \
+      (state)->A[5] = A5; \
+      (state)->A[6] = A6; \
+      (state)->A[7] = A7; \
+      (state)->A[8] = A8; \
+      (state)->A[9] = A9; \
+      (state)->A[10] = AA; \
+      (state)->A[11] = AB; \
      (state)->B[0] = B0; \
      (state)->B[1] = B1; \
      (state)->B[2] = B2; \
@@ -286,8 +286,8 @@ do { \

 #define XOR_W8 \
 do { \
-   A00 = _mm256_xor_si256( A00, _mm256_set1_epi32( Wlow ) ); \
-   A01 = _mm256_xor_si256( A01, _mm256_set1_epi32( Whigh ) ); \
+   A0 = _mm256_xor_si256( A0, _mm256_set1_epi32( Wlow ) ); \
+   A1 = _mm256_xor_si256( A1, _mm256_set1_epi32( Whigh ) ); \
 } while (0)

 #define SWAP_BC8 \
@@ -321,60 +321,60 @@ do { \
 } while (0)

 #define PERM_STEP_0_8   do { \
-      PERM_ELT8(A00, A0B, B0, BD, B9, B6, C8, M0); \
-      PERM_ELT8(A01, A00, B1, BE, BA, B7, C7, M1); \
-      PERM_ELT8(A02, A01, B2, BF, BB, B8, C6, M2); \
-      PERM_ELT8(A03, A02, B3, B0, BC, B9, C5, M3); \
-      PERM_ELT8(A04, A03, B4, B1, BD, BA, C4, M4); \
-      PERM_ELT8(A05, A04, B5, B2, BE, BB, C3, M5); \
-      PERM_ELT8(A06, A05, B6, B3, BF, BC, C2, M6); \
-      PERM_ELT8(A07, A06, B7, B4, B0, BD, C1, M7); \
-      PERM_ELT8(A08, A07, B8, B5, B1, BE, C0, M8); \
-      PERM_ELT8(A09, A08, B9, B6, B2, BF, CF, M9); \
-      PERM_ELT8(A0A, A09, BA, B7, B3, B0, CE, MA); \
-      PERM_ELT8(A0B, A0A, BB, B8, B4, B1, CD, MB); \
-      PERM_ELT8(A00, A0B, BC, B9, B5, B2, CC, MC); \
-      PERM_ELT8(A01, A00, BD, BA, B6, B3, CB, MD); \
-      PERM_ELT8(A02, A01, BE, BB, B7, B4, CA, ME); \
-      PERM_ELT8(A03, A02, BF, BC, B8, B5, C9, MF); \
+      PERM_ELT8(A0, AB, B0, BD, B9, B6, C8, M0); \
+      PERM_ELT8(A1, A0, B1, BE, BA, B7, C7, M1); \
+      PERM_ELT8(A2, A1, B2, BF, BB, B8, C6, M2); \
+      PERM_ELT8(A3, A2, B3, B0, BC, B9, C5, M3); \
+      PERM_ELT8(A4, A3, B4, B1, BD, BA, C4, M4); \
+      PERM_ELT8(A5, A4, B5, B2, BE, BB, C3, M5); \
+      PERM_ELT8(A6, A5, B6, B3, BF, BC, C2, M6); \
+      PERM_ELT8(A7, A6, B7, B4, B0, BD, C1, M7); \
+      PERM_ELT8(A8, A7, B8, B5, B1, BE, C0, M8); \
+      PERM_ELT8(A9, A8, B9, B6, B2, BF, CF, M9); \
+      PERM_ELT8(AA, A9, BA, B7, B3, B0, CE, MA); \
+      PERM_ELT8(AB, AA, BB, B8, B4, B1, CD, MB); \
+      PERM_ELT8(A0, AB, BC, B9, B5, B2, CC, MC); \
+      PERM_ELT8(A1, A0, BD, BA, B6, B3, CB, MD); \
+      PERM_ELT8(A2, A1, BE, BB, B7, B4, CA, ME); \
+      PERM_ELT8(A3, A2, BF, BC, B8, B5, C9, MF); \
   } while (0)

 #define PERM_STEP_1_8   do { \
-      PERM_ELT8(A04, A03, B0, BD, B9, B6, C8, M0); \
-      PERM_ELT8(A05, A04, B1, BE, BA, B7, C7, M1); \
-      PERM_ELT8(A06, A05, B2, BF, BB, B8, C6, M2); \
-      PERM_ELT8(A07, A06, B3, B0, BC, B9, C5, M3); \
-      PERM_ELT8(A08, A07, B4, B1, BD, BA, C4, M4); \
-      PERM_ELT8(A09, A08, B5, B2, BE, BB, C3, M5); \
-      PERM_ELT8(A0A, A09, B6, B3, BF, BC, C2, M6); \
-      PERM_ELT8(A0B, A0A, B7, B4, B0, BD, C1, M7); \
-      PERM_ELT8(A00, A0B, B8, B5, B1, BE, C0, M8); \
-      PERM_ELT8(A01, A00, B9, B6, B2, BF, CF, M9); \
-      PERM_ELT8(A02, A01, BA, B7, B3, B0, CE, MA); \
-      PERM_ELT8(A03, A02, BB, B8, B4, B1, CD, MB); \
-      PERM_ELT8(A04, A03, BC, B9, B5, B2, CC, MC); \
-      PERM_ELT8(A05, A04, BD, BA, B6, B3, CB, MD); \
-      PERM_ELT8(A06, A05, BE, BB, B7, B4, CA, ME); \
-      PERM_ELT8(A07, A06, BF, BC, B8, B5, C9, MF); \
+      PERM_ELT8(A4, A3, B0, BD, B9, B6, C8, M0); \
+      PERM_ELT8(A5, A4, B1, BE, BA, B7, C7, M1); \
+      PERM_ELT8(A6, A5, B2, BF, BB, B8, C6, M2); \
+      PERM_ELT8(A7, A6, B3, B0, BC, B9, C5, M3); \
+      PERM_ELT8(A8, A7, B4, B1, BD, BA, C4, M4); \
+      PERM_ELT8(A9, A8, B5, B2, BE, BB, C3, M5); \
+      PERM_ELT8(AA, A9, B6, B3, BF, BC, C2, M6); \
+      PERM_ELT8(AB, AA, B7, B4, B0, BD, C1, M7); \
+      PERM_ELT8(A0, AB, B8, B5, B1, BE, C0, M8); \
+      PERM_ELT8(A1, A0, B9, B6, B2, BF, CF, M9); \
+      PERM_ELT8(A2, A1, BA, B7, B3, B0, CE, MA); \
+      PERM_ELT8(A3, A2, BB, B8, B4, B1, CD, MB); \
+      PERM_ELT8(A4, A3, BC, B9, B5, B2, CC, MC); \
+      PERM_ELT8(A5, A4, BD, BA, B6, B3, CB, MD); \
+      PERM_ELT8(A6, A5, BE, BB, B7, B4, CA, ME); \
+      PERM_ELT8(A7, A6, BF, BC, B8, B5, C9, MF); \
   } while (0)

 #define PERM_STEP_2_8   do { \
-      PERM_ELT8(A08, A07, B0, BD, B9, B6, C8, M0); \
-      PERM_ELT8(A09, A08, B1, BE, BA, B7, C7, M1); \
-      PERM_ELT8(A0A, A09, B2, BF, BB, B8, C6, M2); \
-      PERM_ELT8(A0B, A0A, B3, B0, BC, B9, C5, M3); \
-      PERM_ELT8(A00, A0B, B4, B1, BD, BA, C4, M4); \
-      PERM_ELT8(A01, A00, B5, B2, BE, BB, C3, M5); \
-      PERM_ELT8(A02, A01, B6, B3, BF, BC, C2, M6); \
-      PERM_ELT8(A03, A02, B7, B4, B0, BD, C1, M7); \
-      PERM_ELT8(A04, A03, B8, B5, B1, BE, C0, M8); \
-      PERM_ELT8(A05, A04, B9, B6, B2, BF, CF, M9); \
-      PERM_ELT8(A06, A05, BA, B7, B3, B0, CE, MA); \
-      PERM_ELT8(A07, A06, BB, B8, B4, B1, CD, MB); \
-      PERM_ELT8(A08, A07, BC, B9, B5, B2, CC, MC); \
-      PERM_ELT8(A09, A08, BD, BA, B6, B3, CB, MD); \
-      PERM_ELT8(A0A, A09, BE, BB, B7, B4, CA, ME); \
-      PERM_ELT8(A0B, A0A, BF, BC, B8, B5, C9, MF); \
+      PERM_ELT8(A8, A7, B0, BD, B9, B6, C8, M0); \
+      PERM_ELT8(A9, A8, B1, BE, BA, B7, C7, M1); \
+      PERM_ELT8(AA, A9, B2, BF, BB, B8, C6, M2); \
+      PERM_ELT8(AB, AA, B3, B0, BC, B9, C5, M3); \
+      PERM_ELT8(A0, AB, B4, B1, BD, BA, C4, M4); \
+      PERM_ELT8(A1, A0, B5, B2, BE, BB, C3, M5); \
+      PERM_ELT8(A2, A1, B6, B3, BF, BC, C2, M6); \
+      PERM_ELT8(A3, A2, B7, B4, B0, BD, C1, M7); \
+      PERM_ELT8(A4, A3, B8, B5, B1, BE, C0, M8); \
+      PERM_ELT8(A5, A4, B9, B6, B2, BF, CF, M9); \
+      PERM_ELT8(A6, A5, BA, B7, B3, B0, CE, MA); \
+      PERM_ELT8(A7, A6, BB, B8, B4, B1, CD, MB); \
+      PERM_ELT8(A8, A7, BC, B9, B5, B2, CC, MC); \
+      PERM_ELT8(A9, A8, BD, BA, B6, B3, CB, MD); \
+      PERM_ELT8(AA, A9, BE, BB, B7, B4, CA, ME); \
+      PERM_ELT8(AB, AA, BF, BC, B8, B5, C9, MF); \
   } while (0)

 #define APPLY_P8 \
@@ -398,42 +398,42 @@ do { \
    PERM_STEP_0_8; \
    PERM_STEP_1_8; \
    PERM_STEP_2_8; \
-    A0B = _mm256_add_epi32( A0B, C6 ); \
-    A0A = _mm256_add_epi32( A0A, C5 ); \
-    A09 = _mm256_add_epi32( A09, C4 ); \
-    A08 = _mm256_add_epi32( A08, C3 ); \
-    A07 = _mm256_add_epi32( A07, C2 ); \
-    A06 = _mm256_add_epi32( A06, C1 ); \
-    A05 = _mm256_add_epi32( A05, C0 ); \
-    A04 = _mm256_add_epi32( A04, CF ); \
-    A03 = _mm256_add_epi32( A03, CE ); \
-    A02 = _mm256_add_epi32( A02, CD ); \
-    A01 = _mm256_add_epi32( A01, CC ); \
-    A00 = _mm256_add_epi32( A00, CB ); \
-    A0B = _mm256_add_epi32( A0B, CA ); \
-    A0A = _mm256_add_epi32( A0A, C9 ); \
-    A09 = _mm256_add_epi32( A09, C8 ); \
-    A08 = _mm256_add_epi32( A08, C7 ); \
-    A07 = _mm256_add_epi32( A07, C6 ); \
-    A06 = _mm256_add_epi32( A06, C5 ); \
-    A05 = _mm256_add_epi32( A05, C4 ); \
-    A04 = _mm256_add_epi32( A04, C3 ); \
-    A03 = _mm256_add_epi32( A03, C2 ); \
-    A02 = _mm256_add_epi32( A02, C1 ); \
-    A01 = _mm256_add_epi32( A01, C0 ); \
-    A00 = _mm256_add_epi32( A00, CF ); \
-    A0B = _mm256_add_epi32( A0B, CE ); \
-    A0A = _mm256_add_epi32( A0A, CD ); \
-    A09 = _mm256_add_epi32( A09, CC ); \
-    A08 = _mm256_add_epi32( A08, CB ); \
-    A07 = _mm256_add_epi32( A07, CA ); \
-    A06 = _mm256_add_epi32( A06, C9 ); \
-    A05 = _mm256_add_epi32( A05, C8 ); \
-    A04 = _mm256_add_epi32( A04, C7 ); \
-    A03 = _mm256_add_epi32( A03, C6 ); \
-    A02 = _mm256_add_epi32( A02, C5 ); \
-    A01 = _mm256_add_epi32( A01, C4 ); \
-    A00 = _mm256_add_epi32( A00, C3 ); \
+    AB = _mm256_add_epi32( AB, C6 ); \
+    AA = _mm256_add_epi32( AA, C5 ); \
+    A9 = _mm256_add_epi32( A9, C4 ); \
+    A8 = _mm256_add_epi32( A8, C3 ); \
+    A7 = _mm256_add_epi32( A7, C2 ); \
+    A6 = _mm256_add_epi32( A6, C1 ); \
+    A5 = _mm256_add_epi32( A5, C0 ); \
+    A4 = _mm256_add_epi32( A4, CF ); \
+    A3 = _mm256_add_epi32( A3, CE ); \
+    A2 = _mm256_add_epi32( A2, CD ); \
+    A1 = _mm256_add_epi32( A1, CC ); \
+    A0 = _mm256_add_epi32( A0, CB ); \
+    AB = _mm256_add_epi32( AB, CA ); \
+    AA = _mm256_add_epi32( AA, C9 ); \
+    A9 = _mm256_add_epi32( A9, C8 ); \
+    A8 = _mm256_add_epi32( A8, C7 ); \
+    A7 = _mm256_add_epi32( A7, C6 ); \
+    A6 = _mm256_add_epi32( A6, C5 ); \
+    A5 = _mm256_add_epi32( A5, C4 ); \
+    A4 = _mm256_add_epi32( A4, C3 ); \
+    A3 = _mm256_add_epi32( A3, C2 ); \
+    A2 = _mm256_add_epi32( A2, C1 ); \
+    A1 = _mm256_add_epi32( A1, C0 ); \
+    A0 = _mm256_add_epi32( A0, CF ); \
+    AB = _mm256_add_epi32( AB, CE ); \
+    AA = _mm256_add_epi32( AA, CD ); \
+    A9 = _mm256_add_epi32( A9, CC ); \
+    A8 = _mm256_add_epi32( A8, CB ); \
+    A7 = _mm256_add_epi32( A7, CA ); \
+    A6 = _mm256_add_epi32( A6, C9 ); \
+    A5 = _mm256_add_epi32( A5, C8 ); \
+    A4 = _mm256_add_epi32( A4, C7 ); \
+    A3 = _mm256_add_epi32( A3, C6 ); \
+    A2 = _mm256_add_epi32( A2, C5 ); \
+    A1 = _mm256_add_epi32( A1, C4 ); \
+    A0 = _mm256_add_epi32( A0, C3 ); \
 } while (0)

 #define INCR_W8   do { \
@@ -660,8 +660,8 @@ shabal512_8way_addbits_and_close(void *cc, unsigned ub, unsigned n, void *dst)


 #define DECL_STATE   \
-	__m128i A00, A01, A02, A03, A04, A05, A06, A07, \
-	        A08, A09, A0A, A0B; \
+	__m128i A0, A1, A2, A3, A4, A5, A6, A7, \
+	        A8, A9, AA, AB; \
 	__m128i B0, B1, B2, B3, B4, B5, B6, B7, \
 	        B8, B9, BA, BB, BC, BD, BE, BF; \
 	__m128i C0, C1, C2, C3, C4, C5, C6, C7, \
@@ -676,18 +676,18 @@ shabal512_8way_addbits_and_close(void *cc, unsigned ub, unsigned n, void *dst)
 { \
   if ( (state)->state_loaded ) \
   { \
-      A00 = (state)->A[0]; \
-		A01 = (state)->A[1]; \
-		A02 = (state)->A[2]; \
-		A03 = (state)->A[3]; \
-		A04 = (state)->A[4]; \
-		A05 = (state)->A[5]; \
-		A06 = (state)->A[6]; \
-		A07 = (state)->A[7]; \
-		A08 = (state)->A[8]; \
-		A09 = (state)->A[9]; \
-		A0A = (state)->A[10]; \
-		A0B = (state)->A[11]; \
+      A0 = (state)->A[0]; \
+		A1 = (state)->A[1]; \
+		A2 = (state)->A[2]; \
+		A3 = (state)->A[3]; \
+		A4 = (state)->A[4]; \
+		A5 = (state)->A[5]; \
+		A6 = (state)->A[6]; \
+		A7 = (state)->A[7]; \
+		A8 = (state)->A[8]; \
+		A9 = (state)->A[9]; \
+		AA = (state)->A[10]; \
+		AB = (state)->A[11]; \
 		B0 = (state)->B[0]; \
 		B1 = (state)->B[1]; \
 		B2 = (state)->B[2]; \
@@ -724,18 +724,18 @@ shabal512_8way_addbits_and_close(void *cc, unsigned ub, unsigned n, void *dst)
   else \
   { \
       (state)->state_loaded = true; \
-       A00 = m128_const1_64( 0x20728DFD20728DFD ); \
-       A01 = m128_const1_64( 0x46C0BD5346C0BD53 ); \
-       A02 = m128_const1_64( 0xE782B699E782B699 ); \
-       A03 = m128_const1_64( 0x5530463255304632 ); \
-       A04 = m128_const1_64( 0x71B4EF9071B4EF90 ); \
-       A05 = m128_const1_64( 0x0EA9E82C0EA9E82C ); \
-       A06 = m128_const1_64( 0xDBB930F1DBB930F1 ); \
-       A07 = m128_const1_64( 0xFAD06B8BFAD06B8B ); \
-       A08 = m128_const1_64( 0xBE0CAE40BE0CAE40 ); \
-       A09 = m128_const1_64( 0x8BD144108BD14410 ); \
-       A0A = m128_const1_64( 0x76D2ADAC76D2ADAC ); \
-       A0B = m128_const1_64( 0x28ACAB7F28ACAB7F ); \
+       A0 = m128_const1_64( 0x20728DFD20728DFD ); \
+       A1 = m128_const1_64( 0x46C0BD5346C0BD53 ); \
+       A2 = m128_const1_64( 0xE782B699E782B699 ); \
+       A3 = m128_const1_64( 0x5530463255304632 ); \
+       A4 = m128_const1_64( 0x71B4EF9071B4EF90 ); \
+       A5 = m128_const1_64( 0x0EA9E82C0EA9E82C ); \
+       A6 = m128_const1_64( 0xDBB930F1DBB930F1 ); \
+       A7 = m128_const1_64( 0xFAD06B8BFAD06B8B ); \
+       A8 = m128_const1_64( 0xBE0CAE40BE0CAE40 ); \
+       A9 = m128_const1_64( 0x8BD144108BD14410 ); \
+       AA = m128_const1_64( 0x76D2ADAC76D2ADAC ); \
+       AB = m128_const1_64( 0x28ACAB7F28ACAB7F ); \
       B0 = m128_const1_64( 0xC1099CB7C1099CB7 ); \
       B1 = m128_const1_64( 0x07B385F307B385F3 ); \
       B2 = m128_const1_64( 0xE7442C26E7442C26 ); \
@@ -774,18 +774,18 @@ shabal512_8way_addbits_and_close(void *cc, unsigned ub, unsigned n, void *dst)
 } while (0)

 #define WRITE_STATE(state)   do { \
-		(state)->A[0] = A00; \
-		(state)->A[1] = A01; \
-		(state)->A[2] = A02; \
-		(state)->A[3] = A03; \
-		(state)->A[4] = A04; \
-		(state)->A[5] = A05; \
-		(state)->A[6] = A06; \
-		(state)->A[7] = A07; \
-		(state)->A[8] = A08; \
-		(state)->A[9] = A09; \
-		(state)->A[10] = A0A; \
-		(state)->A[11] = A0B; \
+		(state)->A[0] = A0; \
+		(state)->A[1] = A1; \
+		(state)->A[2] = A2; \
+		(state)->A[3] = A3; \
+		(state)->A[4] = A4; \
+		(state)->A[5] = A5; \
+		(state)->A[6] = A6; \
+		(state)->A[7] = A7; \
+		(state)->A[8] = A8; \
+		(state)->A[9] = A9; \
+		(state)->A[10] = AA; \
+		(state)->A[11] = AB; \
 		(state)->B[0] = B0; \
 		(state)->B[1] = B1; \
 		(state)->B[2] = B2; \
@@ -884,8 +884,8 @@ do { \

 #define XOR_W \
 do { \
-   A00 = _mm_xor_si128( A00, _mm_set1_epi32( Wlow ) ); \
-   A01 = _mm_xor_si128( A01, _mm_set1_epi32( Whigh ) ); \
+   A0 = _mm_xor_si128( A0, _mm_set1_epi32( Wlow ) ); \
+   A1 = _mm_xor_si128( A1, _mm_set1_epi32( Whigh ) ); \
 } while (0)


@@ -940,60 +940,60 @@ do { \
 } while (0)

 #define PERM_STEP_0   do { \
-		PERM_ELT(A00, A0B, B0, BD, B9, B6, C8, M0); \
-		PERM_ELT(A01, A00, B1, BE, BA, B7, C7, M1); \
-		PERM_ELT(A02, A01, B2, BF, BB, B8, C6, M2); \
-		PERM_ELT(A03, A02, B3, B0, BC, B9, C5, M3); \
-		PERM_ELT(A04, A03, B4, B1, BD, BA, C4, M4); \
-		PERM_ELT(A05, A04, B5, B2, BE, BB, C3, M5); \
-		PERM_ELT(A06, A05, B6, B3, BF, BC, C2, M6); \
-		PERM_ELT(A07, A06, B7, B4, B0, BD, C1, M7); \
-		PERM_ELT(A08, A07, B8, B5, B1, BE, C0, M8); \
-		PERM_ELT(A09, A08, B9, B6, B2, BF, CF, M9); \
-		PERM_ELT(A0A, A09, BA, B7, B3, B0, CE, MA); \
-		PERM_ELT(A0B, A0A, BB, B8, B4, B1, CD, MB); \
-		PERM_ELT(A00, A0B, BC, B9, B5, B2, CC, MC); \
-		PERM_ELT(A01, A00, BD, BA, B6, B3, CB, MD); \
-		PERM_ELT(A02, A01, BE, BB, B7, B4, CA, ME); \
-		PERM_ELT(A03, A02, BF, BC, B8, B5, C9, MF); \
+		PERM_ELT(A0, AB, B0, BD, B9, B6, C8, M0); \
+		PERM_ELT(A1, A0, B1, BE, BA, B7, C7, M1); \
+		PERM_ELT(A2, A1, B2, BF, BB, B8, C6, M2); \
+		PERM_ELT(A3, A2, B3, B0, BC, B9, C5, M3); \
+		PERM_ELT(A4, A3, B4, B1, BD, BA, C4, M4); \
+		PERM_ELT(A5, A4, B5, B2, BE, BB, C3, M5); \
+		PERM_ELT(A6, A5, B6, B3, BF, BC, C2, M6); \
+		PERM_ELT(A7, A6, B7, B4, B0, BD, C1, M7); \
+		PERM_ELT(A8, A7, B8, B5, B1, BE, C0, M8); \
+		PERM_ELT(A9, A8, B9, B6, B2, BF, CF, M9); \
+		PERM_ELT(AA, A9, BA, B7, B3, B0, CE, MA); \
+		PERM_ELT(AB, AA, BB, B8, B4, B1, CD, MB); \
+		PERM_ELT(A0, AB, BC, B9, B5, B2, CC, MC); \
+		PERM_ELT(A1, A0, BD, BA, B6, B3, CB, MD); \
+		PERM_ELT(A2, A1, BE, BB, B7, B4, CA, ME); \
+		PERM_ELT(A3, A2, BF, BC, B8, B5, C9, MF); \
 	} while (0)

 #define PERM_STEP_1   do { \
-		PERM_ELT(A04, A03, B0, BD, B9, B6, C8, M0); \
-		PERM_ELT(A05, A04, B1, BE, BA, B7, C7, M1); \
-		PERM_ELT(A06, A05, B2, BF, BB, B8, C6, M2); \
-		PERM_ELT(A07, A06, B3, B0, BC, B9, C5, M3); \
-		PERM_ELT(A08, A07, B4, B1, BD, BA, C4, M4); \
-		PERM_ELT(A09, A08, B5, B2, BE, BB, C3, M5); \
-		PERM_ELT(A0A, A09, B6, B3, BF, BC, C2, M6); \
-		PERM_ELT(A0B, A0A, B7, B4, B0, BD, C1, M7); \
-		PERM_ELT(A00, A0B, B8, B5, B1, BE, C0, M8); \
-		PERM_ELT(A01, A00, B9, B6, B2, BF, CF, M9); \
-		PERM_ELT(A02, A01, BA, B7, B3, B0, CE, MA); \
-		PERM_ELT(A03, A02, BB, B8, B4, B1, CD, MB); \
-		PERM_ELT(A04, A03, BC, B9, B5, B2, CC, MC); \
-		PERM_ELT(A05, A04, BD, BA, B6, B3, CB, MD); \
-		PERM_ELT(A06, A05, BE, BB, B7, B4, CA, ME); \
-		PERM_ELT(A07, A06, BF, BC, B8, B5, C9, MF); \
+		PERM_ELT(A4, A3, B0, BD, B9, B6, C8, M0); \
+		PERM_ELT(A5, A4, B1, BE, BA, B7, C7, M1); \
+		PERM_ELT(A6, A5, B2, BF, BB, B8, C6, M2); \
+		PERM_ELT(A7, A6, B3, B0, BC, B9, C5, M3); \
+		PERM_ELT(A8, A7, B4, B1, BD, BA, C4, M4); \
+		PERM_ELT(A9, A8, B5, B2, BE, BB, C3, M5); \
+		PERM_ELT(AA, A9, B6, B3, BF, BC, C2, M6); \
+		PERM_ELT(AB, AA, B7, B4, B0, BD, C1, M7); \
+		PERM_ELT(A0, AB, B8, B5, B1, BE, C0, M8); \
+		PERM_ELT(A1, A0, B9, B6, B2, BF, CF, M9); \
+		PERM_ELT(A2, A1, BA, B7, B3, B0, CE, MA); \
+		PERM_ELT(A3, A2, BB, B8, B4, B1, CD, MB); \
+		PERM_ELT(A4, A3, BC, B9, B5, B2, CC, MC); \
+		PERM_ELT(A5, A4, BD, BA, B6, B3, CB, MD); \
+		PERM_ELT(A6, A5, BE, BB, B7, B4, CA, ME); \
+		PERM_ELT(A7, A6, BF, BC, B8, B5, C9, MF); \
 	} while (0)

 #define PERM_STEP_2   do { \
-		PERM_ELT(A08, A07, B0, BD, B9, B6, C8, M0); \
-		PERM_ELT(A09, A08, B1, BE, BA, B7, C7, M1); \
-		PERM_ELT(A0A, A09, B2, BF, BB, B8, C6, M2); \
-		PERM_ELT(A0B, A0A, B3, B0, BC, B9, C5, M3); \
-		PERM_ELT(A00, A0B, B4, B1, BD, BA, C4, M4); \
-		PERM_ELT(A01, A00, B5, B2, BE, BB, C3, M5); \
-		PERM_ELT(A02, A01, B6, B3, BF, BC, C2, M6); \
-		PERM_ELT(A03, A02, B7, B4, B0, BD, C1, M7); \
-		PERM_ELT(A04, A03, B8, B5, B1, BE, C0, M8); \
-		PERM_ELT(A05, A04, B9, B6, B2, BF, CF, M9); \
-		PERM_ELT(A06, A05, BA, B7, B3, B0, CE, MA); \
-		PERM_ELT(A07, A06, BB, B8, B4, B1, CD, MB); \
-		PERM_ELT(A08, A07, BC, B9, B5, B2, CC, MC); \
-		PERM_ELT(A09, A08, BD, BA, B6, B3, CB, MD); \
-		PERM_ELT(A0A, A09, BE, BB, B7, B4, CA, ME); \
-		PERM_ELT(A0B, A0A, BF, BC, B8, B5, C9, MF); \
+		PERM_ELT(A8, A7, B0, BD, B9, B6, C8, M0); \
+		PERM_ELT(A9, A8, B1, BE, BA, B7, C7, M1); \
+		PERM_ELT(AA, A9, B2, BF, BB, B8, C6, M2); \
+		PERM_ELT(AB, AA, B3, B0, BC, B9, C5, M3); \
+		PERM_ELT(A0, AB, B4, B1, BD, BA, C4, M4); \
+		PERM_ELT(A1, A0, B5, B2, BE, BB, C3, M5); \
+		PERM_ELT(A2, A1, B6, B3, BF, BC, C2, M6); \
+		PERM_ELT(A3, A2, B7, B4, B0, BD, C1, M7); \
+		PERM_ELT(A4, A3, B8, B5, B1, BE, C0, M8); \
+		PERM_ELT(A5, A4, B9, B6, B2, BF, CF, M9); \
+		PERM_ELT(A6, A5, BA, B7, B3, B0, CE, MA); \
+		PERM_ELT(A7, A6, BB, B8, B4, B1, CD, MB); \
+		PERM_ELT(A8, A7, BC, B9, B5, B2, CC, MC); \
+		PERM_ELT(A9, A8, BD, BA, B6, B3, CB, MD); \
+		PERM_ELT(AA, A9, BE, BB, B7, B4, CA, ME); \
+		PERM_ELT(AB, AA, BF, BC, B8, B5, C9, MF); \
 	} while (0)

 #define APPLY_P \
@@ -1017,42 +1017,42 @@ do { \
    PERM_STEP_0; \
    PERM_STEP_1; \
    PERM_STEP_2; \
-    A0B = _mm_add_epi32( A0B, C6 ); \
-    A0A = _mm_add_epi32( A0A, C5 ); \
-    A09 = _mm_add_epi32( A09, C4 ); \
-    A08 = _mm_add_epi32( A08, C3 ); \
-    A07 = _mm_add_epi32( A07, C2 ); \
-    A06 = _mm_add_epi32( A06, C1 ); \
-    A05 = _mm_add_epi32( A05, C0 ); \
-    A04 = _mm_add_epi32( A04, CF ); \
-    A03 = _mm_add_epi32( A03, CE ); \
-    A02 = _mm_add_epi32( A02, CD ); \
-    A01 = _mm_add_epi32( A01, CC ); \
-    A00 = _mm_add_epi32( A00, CB ); \
-    A0B = _mm_add_epi32( A0B, CA ); \
-    A0A = _mm_add_epi32( A0A, C9 ); \
-    A09 = _mm_add_epi32( A09, C8 ); \
-    A08 = _mm_add_epi32( A08, C7 ); \
-    A07 = _mm_add_epi32( A07, C6 ); \
-    A06 = _mm_add_epi32( A06, C5 ); \
-    A05 = _mm_add_epi32( A05, C4 ); \
-    A04 = _mm_add_epi32( A04, C3 ); \
-    A03 = _mm_add_epi32( A03, C2 ); \
-    A02 = _mm_add_epi32( A02, C1 ); \
-    A01 = _mm_add_epi32( A01, C0 ); \
-    A00 = _mm_add_epi32( A00, CF ); \
-    A0B = _mm_add_epi32( A0B, CE ); \
-    A0A = _mm_add_epi32( A0A, CD ); \
-    A09 = _mm_add_epi32( A09, CC ); \
-    A08 = _mm_add_epi32( A08, CB ); \
-    A07 = _mm_add_epi32( A07, CA ); \
-    A06 = _mm_add_epi32( A06, C9 ); \
-    A05 = _mm_add_epi32( A05, C8 ); \
-    A04 = _mm_add_epi32( A04, C7 ); \
-    A03 = _mm_add_epi32( A03, C6 ); \
-    A02 = _mm_add_epi32( A02, C5 ); \
-    A01 = _mm_add_epi32( A01, C4 ); \
-    A00 = _mm_add_epi32( A00, C3 ); \
+    AB = _mm_add_epi32( AB, C6 ); \
+    AA = _mm_add_epi32( AA, C5 ); \
+    A9 = _mm_add_epi32( A9, C4 ); \
+    A8 = _mm_add_epi32( A8, C3 ); \
+    A7 = _mm_add_epi32( A7, C2 ); \
+    A6 = _mm_add_epi32( A6, C1 ); \
+    A5 = _mm_add_epi32( A5, C0 ); \
+    A4 = _mm_add_epi32( A4, CF ); \
+    A3 = _mm_add_epi32( A3, CE ); \
+    A2 = _mm_add_epi32( A2, CD ); \
+    A1 = _mm_add_epi32( A1, CC ); \
+    A0 = _mm_add_epi32( A0, CB ); \
+    AB = _mm_add_epi32( AB, CA ); \
+    AA = _mm_add_epi32( AA, C9 ); \
+    A9 = _mm_add_epi32( A9, C8 ); \
+    A8 = _mm_add_epi32( A8, C7 ); \
+    A7 = _mm_add_epi32( A7, C6 ); \
+    A6 = _mm_add_epi32( A6, C5 ); \
+    A5 = _mm_add_epi32( A5, C4 ); \
+    A4 = _mm_add_epi32( A4, C3 ); \
+    A3 = _mm_add_epi32( A3, C2 ); \
+    A2 = _mm_add_epi32( A2, C1 ); \
+    A1 = _mm_add_epi32( A1, C0 ); \
+    A0 = _mm_add_epi32( A0, CF ); \
+    AB = _mm_add_epi32( AB, CE ); \
+    AA = _mm_add_epi32( AA, CD ); \
+    A9 = _mm_add_epi32( A9, CC ); \
+    A8 = _mm_add_epi32( A8, CB ); \
+    A7 = _mm_add_epi32( A7, CA ); \
+    A6 = _mm_add_epi32( A6, C9 ); \
+    A5 = _mm_add_epi32( A5, C8 ); \
+    A4 = _mm_add_epi32( A4, C7 ); \
+    A3 = _mm_add_epi32( A3, C6 ); \
+    A2 = _mm_add_epi32( A2, C5 ); \
+    A1 = _mm_add_epi32( A1, C4 ); \
+    A0 = _mm_add_epi32( A0, C3 ); \
 } while (0)

 #define INCR_W   do { \
--- a/algo/shavite/shavite-hash-2way.c
+++ b/algo/shavite/shavite-hash-2way.c
@@ -18,10 +18,13 @@ static const uint32_t IV512[] =
        0xE275EADE, 0x502D9FCD, 0xB9357178, 0x022A4B9A
 };

-
+/*
 #define mm256_ror2x256hi_1x32( a, b ) \
   _mm256_blend_epi32( mm256_shuflr128_32( a ), \
                       mm256_shuflr128_32( b ), 0x88 )
+*/
+
+//#define mm256_ror2x256hi_1x32( a, b ) _mm256_alignr_epi8( b, a, 4 )

 #if defined(__VAES__)

@@ -127,24 +130,24 @@ c512_2way( shavite512_2way_context *ctx, const void *msg )
     
     // round 2, 6, 10

-     k00 = _mm256_xor_si256( k00, mm256_ror2x256hi_1x32( k12, k13 ) );
+     k00 = _mm256_xor_si256( k00, _mm256_alignr_epi8( k13, k12, 4 ) );
     x = mm256_aesenc_2x128( _mm256_xor_si256( p3, k00 ), zero );
-     k01 = _mm256_xor_si256( k01, mm256_ror2x256hi_1x32( k13, k00 ) );
+     k01 = _mm256_xor_si256( k01, _mm256_alignr_epi8( k00, k13, 4 ) );
     x = mm256_aesenc_2x128( _mm256_xor_si256( x, k01 ), zero );
-     k02 = _mm256_xor_si256( k02, mm256_ror2x256hi_1x32( k00, k01 ) );
+     k02 = _mm256_xor_si256( k02, _mm256_alignr_epi8( k01, k00, 4 ) );
     x = mm256_aesenc_2x128( _mm256_xor_si256( x, k02 ), zero );
-     k03 = _mm256_xor_si256( k03, mm256_ror2x256hi_1x32( k01, k02 ) );
+     k03 = _mm256_xor_si256( k03, _mm256_alignr_epi8( k02, k01, 4 ) );
     x = mm256_aesenc_2x128( _mm256_xor_si256( x, k03 ), zero );

     p2 = _mm256_xor_si256( p2, x );

-     k10 = _mm256_xor_si256( k10, mm256_ror2x256hi_1x32( k02, k03 ) );
+     k10 = _mm256_xor_si256( k10, _mm256_alignr_epi8( k03, k02, 4 ) );
     x = mm256_aesenc_2x128( _mm256_xor_si256( p1, k10 ), zero );
-     k11 = _mm256_xor_si256( k11, mm256_ror2x256hi_1x32( k03, k10 ) );
+     k11 = _mm256_xor_si256( k11, _mm256_alignr_epi8( k10, k03, 4 ) );
     x = mm256_aesenc_2x128( _mm256_xor_si256( x, k11 ), zero );
-     k12 = _mm256_xor_si256( k12, mm256_ror2x256hi_1x32( k10, k11 ) );
+     k12 = _mm256_xor_si256( k12, _mm256_alignr_epi8( k11, k10, 4 ) );
     x = mm256_aesenc_2x128( _mm256_xor_si256( x, k12 ), zero );
-     k13 = _mm256_xor_si256( k13, mm256_ror2x256hi_1x32( k11, k12 ) );
+     k13 = _mm256_xor_si256( k13, _mm256_alignr_epi8( k12, k11, 4 ) );
     x = mm256_aesenc_2x128( _mm256_xor_si256( x, k13 ), zero );

     p0 = _mm256_xor_si256( p0, x );
@@ -183,24 +186,24 @@ c512_2way( shavite512_2way_context *ctx, const void *msg )

     // round 4, 8, 12

-     k00 = _mm256_xor_si256( k00, mm256_ror2x256hi_1x32( k12, k13 ) );
+     k00 = _mm256_xor_si256( k00, _mm256_alignr_epi8( k13, k12, 4 ) );
     x = mm256_aesenc_2x128( _mm256_xor_si256( p1, k00 ), zero );
-     k01 = _mm256_xor_si256( k01, mm256_ror2x256hi_1x32( k13, k00 ) );
+     k01 = _mm256_xor_si256( k01, _mm256_alignr_epi8( k00, k13, 4 ) );
     x = mm256_aesenc_2x128( _mm256_xor_si256( x, k01 ), zero );
-     k02 = _mm256_xor_si256( k02, mm256_ror2x256hi_1x32( k00, k01 ) );
+     k02 = _mm256_xor_si256( k02, _mm256_alignr_epi8( k01, k00, 4 ) );
     x = mm256_aesenc_2x128( _mm256_xor_si256( x, k02 ), zero );
-     k03 = _mm256_xor_si256( k03, mm256_ror2x256hi_1x32( k01, k02 ) );
+     k03 = _mm256_xor_si256( k03, _mm256_alignr_epi8( k02, k01, 4 ) );
     x = mm256_aesenc_2x128( _mm256_xor_si256( x, k03 ), zero );

     p0 = _mm256_xor_si256( p0, x );

-     k10 = _mm256_xor_si256( k10, mm256_ror2x256hi_1x32( k02, k03 ) );
+     k10 = _mm256_xor_si256( k10, _mm256_alignr_epi8( k03, k02, 4 ) );
     x = mm256_aesenc_2x128( _mm256_xor_si256( p3, k10 ), zero );
-     k11 = _mm256_xor_si256( k11, mm256_ror2x256hi_1x32( k03, k10 ) );
+     k11 = _mm256_xor_si256( k11, _mm256_alignr_epi8( k10, k03, 4 ) );
     x = mm256_aesenc_2x128( _mm256_xor_si256( x, k11 ), zero );
-     k12 = _mm256_xor_si256( k12, mm256_ror2x256hi_1x32( k10, k11 ) );
+     k12 = _mm256_xor_si256( k12, _mm256_alignr_epi8( k11, k10, 4 ) );
     x = mm256_aesenc_2x128( _mm256_xor_si256( x, k12 ), zero );
-     k13 = _mm256_xor_si256( k13, mm256_ror2x256hi_1x32( k11, k12 ) );
+     k13 = _mm256_xor_si256( k13, _mm256_alignr_epi8( k12, k11, 4 ) );
     x = mm256_aesenc_2x128( _mm256_xor_si256( x, k13 ), zero );

     p2 = _mm256_xor_si256( p2, x );
--- a/algo/shavite/shavite-hash-4way.c
+++ b/algo/shavite/shavite-hash-4way.c
@@ -11,10 +11,6 @@ static const uint32_t IV512[] =
        0xE275EADE, 0x502D9FCD, 0xB9357178, 0x022A4B9A
 };

-#define mm512_ror2x512hi_1x32( a, b ) \
-   _mm512_mask_blend_epi32( 0x8888, mm512_shuflr128_32( a ), \
-                                    mm512_shuflr128_32( b ) )
-
 static void
 c512_4way( shavite512_4way_context *ctx, const void *msg )
 {
@@ -106,24 +102,24 @@ c512_4way( shavite512_4way_context *ctx, const void *msg )
     
     // round 2, 6, 10

-     K0 = _mm512_xor_si512( K0, mm512_ror2x512hi_1x32( K6, K7 ) );
+     K0 = _mm512_xor_si512( K0, _mm512_alignr_epi8( K7, K6, 4 ) );
     X = _mm512_aesenc_epi128( _mm512_xor_si512( P3, K0 ), m512_zero );
-     K1 = _mm512_xor_si512( K1, mm512_ror2x512hi_1x32( K7, K0 ) );
+     K1 = _mm512_xor_si512( K1, _mm512_alignr_epi8( K0, K7, 4 ) );
     X = _mm512_aesenc_epi128( _mm512_xor_si512( X, K1 ), m512_zero );
-     K2 = _mm512_xor_si512( K2, mm512_ror2x512hi_1x32( K0, K1 ) );
+     K2 = _mm512_xor_si512( K2, _mm512_alignr_epi8( K1, K0, 4 ) );
     X = _mm512_aesenc_epi128( _mm512_xor_si512( X, K2 ), m512_zero );
-     K3 = _mm512_xor_si512( K3, mm512_ror2x512hi_1x32( K1, K2 ) );
+     K3 = _mm512_xor_si512( K3, _mm512_alignr_epi8( K2, K1, 4 ) );
     X = _mm512_aesenc_epi128( _mm512_xor_si512( X, K3 ), m512_zero );

     P2 = _mm512_xor_si512( P2, X );

-     K4 = _mm512_xor_si512( K4, mm512_ror2x512hi_1x32( K2, K3 ) );
+     K4 = _mm512_xor_si512( K4, _mm512_alignr_epi8( K3, K2, 4 ) );
     X = _mm512_aesenc_epi128( _mm512_xor_si512( P1, K4 ), m512_zero );
-     K5 = _mm512_xor_si512( K5, mm512_ror2x512hi_1x32( K3, K4 ) );
+     K5 = _mm512_xor_si512( K5, _mm512_alignr_epi8( K4, K3, 4 ) );
     X = _mm512_aesenc_epi128( _mm512_xor_si512( X, K5 ), m512_zero );
-     K6 = _mm512_xor_si512( K6, mm512_ror2x512hi_1x32( K4, K5 ) );
+     K6 = _mm512_xor_si512( K6, _mm512_alignr_epi8( K5, K4, 4 ) );
     X = _mm512_aesenc_epi128( _mm512_xor_si512( X, K6 ), m512_zero );
-     K7 = _mm512_xor_si512( K7, mm512_ror2x512hi_1x32( K5, K6 ) );
+     K7 = _mm512_xor_si512( K7, _mm512_alignr_epi8( K6, K5, 4 ) );
     X = _mm512_aesenc_epi128( _mm512_xor_si512( X, K7 ), m512_zero );

     P0 = _mm512_xor_si512( P0, X );
@@ -162,24 +158,24 @@ c512_4way( shavite512_4way_context *ctx, const void *msg )

     // round 4, 8, 12

-     K0 = _mm512_xor_si512( K0, mm512_ror2x512hi_1x32( K6, K7 ) );
+     K0 = _mm512_xor_si512( K0, _mm512_alignr_epi8( K7, K6, 4 ) );
     X = _mm512_aesenc_epi128( _mm512_xor_si512( P1, K0 ), m512_zero );
-     K1 = _mm512_xor_si512( K1, mm512_ror2x512hi_1x32( K7, K0 ) );
+     K1 = _mm512_xor_si512( K1, _mm512_alignr_epi8( K0, K7, 4 ) );
     X = _mm512_aesenc_epi128( _mm512_xor_si512( X, K1 ), m512_zero );
-     K2 = _mm512_xor_si512( K2, mm512_ror2x512hi_1x32( K0, K1 ) );
+     K2 = _mm512_xor_si512( K2, _mm512_alignr_epi8( K1, K0, 4 ) );
     X = _mm512_aesenc_epi128( _mm512_xor_si512( X, K2 ), m512_zero );
-     K3 = _mm512_xor_si512( K3, mm512_ror2x512hi_1x32( K1, K2 ) );
+     K3 = _mm512_xor_si512( K3, _mm512_alignr_epi8( K2, K1, 4 ) );
     X = _mm512_aesenc_epi128( _mm512_xor_si512( X, K3 ), m512_zero );

     P0 = _mm512_xor_si512( P0, X );

-     K4 = _mm512_xor_si512( K4, mm512_ror2x512hi_1x32( K2, K3 ) );
+     K4 = _mm512_xor_si512( K4, _mm512_alignr_epi8( K3, K2, 4 ) );
     X = _mm512_aesenc_epi128( _mm512_xor_si512( P3, K4 ), m512_zero );
-     K5 = _mm512_xor_si512( K5, mm512_ror2x512hi_1x32( K3, K4 ) );
+     K5 = _mm512_xor_si512( K5, _mm512_alignr_epi8( K4, K3, 4 ) );
     X = _mm512_aesenc_epi128( _mm512_xor_si512( X, K5 ), m512_zero );
-     K6 = _mm512_xor_si512( K6, mm512_ror2x512hi_1x32( K4, K5 ) );
+     K6 = _mm512_xor_si512( K6, _mm512_alignr_epi8( K5, K4, 4 ) );
     X = _mm512_aesenc_epi128( _mm512_xor_si512( X, K6 ), m512_zero );
-     K7 = _mm512_xor_si512( K7, mm512_ror2x512hi_1x32( K5, K6 ) );
+     K7 = _mm512_xor_si512( K7, _mm512_alignr_epi8( K6, K5, 4 ) );
     X = _mm512_aesenc_epi128( _mm512_xor_si512( X, K7 ), m512_zero );

     P2 = _mm512_xor_si512( P2, X );
--- a/algo/shavite/sph-shavite-aesni.c
+++ b/algo/shavite/sph-shavite-aesni.c
@@ -59,30 +59,6 @@ static const sph_u32 IV512[] = {
 	C32(0xE275EADE), C32(0x502D9FCD), C32(0xB9357178), C32(0x022A4B9A)
 };

-// Partially rotate elements in two 128 bit vectors a & b as one 256 bit vector
-// and return the rotated 128 bit vector a.
-// a[3:0] = { b[0], a[3], a[2], a[1] }
-#if defined(__SSSE3__)
-
-#define mm128_ror256hi_1x32( a, b )  _mm_alignr_epi8( b, a, 4 )
-
-#else  // SSE2
-
-#define mm128_ror256hi_1x32( a, b ) \
-   _mm_or_si128( _mm_srli_si128( a,  4 ), \
-                 _mm_slli_si128( b, 12 ) )
-
-#endif
-
-/*
-#if defined(__AVX2__)
-// 2 way version of above
-// a[7:0] = { b[4], a[7], a[6], a[5], b[0], a[3], a[2], a[1] }
-#define mm256_ror2x256hi_1x32( a, b ) \
-   _mm256_blend_epi32( mm256_ror256_1x32( a ), \
-                       mm256_rol256_3x32( b ), 0x88 )
-#endif
-*/

 static void
 c512( sph_shavite_big_context *sc, const void *msg )
@@ -190,31 +166,31 @@ c512( sph_shavite_big_context *sc, const void *msg )

      // round 2, 6, 10

-      k00 = _mm_xor_si128( k00, mm128_ror256hi_1x32( k12, k13 ) );
+      k00 = _mm_xor_si128( k00, _mm_alignr_epi8( k13, k12, 4 ) );
      x = _mm_xor_si128( p3, k00 );
      x = _mm_aesenc_si128( x, zero );
-      k01 = _mm_xor_si128( k01, mm128_ror256hi_1x32( k13, k00 ) );
+      k01 = _mm_xor_si128( k01, _mm_alignr_epi8( k00, k13, 4 ) );
      x = _mm_xor_si128( x, k01 );
      x = _mm_aesenc_si128( x, zero );
-      k02 = _mm_xor_si128( k02, mm128_ror256hi_1x32( k00, k01 ) );
+      k02 = _mm_xor_si128( k02, _mm_alignr_epi8( k01, k00, 4 ) );
      x = _mm_xor_si128( x, k02 );
      x = _mm_aesenc_si128( x, zero );
-      k03 = _mm_xor_si128( k03, mm128_ror256hi_1x32( k01, k02 ) );
+      k03 = _mm_xor_si128( k03, _mm_alignr_epi8( k02, k01, 4 ) );
      x = _mm_xor_si128( x, k03 );
      x = _mm_aesenc_si128( x, zero );

      p2 = _mm_xor_si128( p2, x );

-      k10 = _mm_xor_si128( k10, mm128_ror256hi_1x32( k02, k03 ) );
+      k10 = _mm_xor_si128( k10, _mm_alignr_epi8( k03, k02, 4 ) );
      x = _mm_xor_si128( p1, k10 );
      x = _mm_aesenc_si128( x, zero );
-      k11 = _mm_xor_si128( k11, mm128_ror256hi_1x32( k03, k10 ) );
+      k11 = _mm_xor_si128( k11, _mm_alignr_epi8( k10, k03, 4 ) );
      x = _mm_xor_si128( x, k11 );
      x = _mm_aesenc_si128( x, zero );
-      k12 = _mm_xor_si128( k12, mm128_ror256hi_1x32( k10, k11 ) );
+      k12 = _mm_xor_si128( k12, _mm_alignr_epi8( k11, k10, 4 ) );
      x = _mm_xor_si128( x, k12 );
      x = _mm_aesenc_si128( x, zero );
-      k13 = _mm_xor_si128( k13, mm128_ror256hi_1x32( k11, k12 ) );
+      k13 = _mm_xor_si128( k13, _mm_alignr_epi8( k12, k11, 4 ) );
      x = _mm_xor_si128( x, k13 );
      x = _mm_aesenc_si128( x, zero );

@@ -262,31 +238,31 @@ c512( sph_shavite_big_context *sc, const void *msg )

      // round 4, 8, 12

-      k00 = _mm_xor_si128( k00, mm128_ror256hi_1x32( k12, k13 ) );
+      k00 = _mm_xor_si128( k00, _mm_alignr_epi8( k13, k12, 4 ) );
      x = _mm_xor_si128( p1, k00 );
      x = _mm_aesenc_si128( x, zero );
-      k01 = _mm_xor_si128( k01, mm128_ror256hi_1x32( k13, k00 ) );
+      k01 = _mm_xor_si128( k01, _mm_alignr_epi8( k00, k13, 4 ) );
      x = _mm_xor_si128( x, k01 );
      x = _mm_aesenc_si128( x, zero );
-      k02 = _mm_xor_si128( k02, mm128_ror256hi_1x32( k00, k01 ) );
+      k02 = _mm_xor_si128( k02, _mm_alignr_epi8( k01, k00, 4 ) );
      x = _mm_xor_si128( x, k02 );
      x = _mm_aesenc_si128( x, zero );
-      k03 = _mm_xor_si128( k03, mm128_ror256hi_1x32( k01, k02 ) );
+      k03 = _mm_xor_si128( k03, _mm_alignr_epi8( k02, k01, 4 ) );
      x = _mm_xor_si128( x, k03 );
      x = _mm_aesenc_si128( x, zero );

      p0 = _mm_xor_si128( p0, x );

-      k10 = _mm_xor_si128( k10, mm128_ror256hi_1x32( k02, k03 ) );
+      k10 = _mm_xor_si128( k10, _mm_alignr_epi8( k03, k02, 4 ) );
      x = _mm_xor_si128( p3, k10 );
      x = _mm_aesenc_si128( x, zero );
-      k11 = _mm_xor_si128( k11, mm128_ror256hi_1x32( k03, k10 ) );
+      k11 = _mm_xor_si128( k11, _mm_alignr_epi8( k10, k03, 4 ) );
      x = _mm_xor_si128( x, k11 );
      x = _mm_aesenc_si128( x, zero );
-      k12 = _mm_xor_si128( k12, mm128_ror256hi_1x32( k10, k11 ) );
+      k12 = _mm_xor_si128( k12, _mm_alignr_epi8( k11, k10, 4 ) );
      x = _mm_xor_si128( x, k12 );
      x = _mm_aesenc_si128( x, zero );
-      k13 = _mm_xor_si128( k13, mm128_ror256hi_1x32( k11, k12 ) );
+      k13 = _mm_xor_si128( k13, _mm_alignr_epi8( k12, k11, 4 ) );
      x = _mm_xor_si128( x, k13 );
      x = _mm_aesenc_si128( x, zero );

--- a/algo/shavite/sph_shavite.c
+++ b/algo/shavite/sph_shavite.c
@@ -35,7 +35,7 @@

 #include "sph_shavite.h"

-#if !defined(__AES__)
+#if !(defined(__AES__) && defined(__SSSE3__))

 #ifdef __cplusplus
 extern "C"{
--- a/algo/shavite/sph_shavite.h
+++ b/algo/shavite/sph_shavite.h
@@ -263,7 +263,7 @@ void sph_shavite384_addbits_and_close(
 	void *cc, unsigned ub, unsigned n, void *dst);

 //Don't call these directly from application code, use the macros below.
-#ifdef __AES__
+#if defined(__AES__) && defined(__SSSE3__)

 void sph_shavite512_aesni_init(void *cc);
 void sph_shavite512_aesni(void *cc, const void *data, size_t len);
--- a/algo/sm3/sph_sm3.h
+++ b/algo/sm3/sph_sm3.h
@@ -74,7 +74,7 @@ typedef struct {

 void sm3_init(sm3_ctx_t *ctx);
 void sm3_update(sm3_ctx_t *ctx, const unsigned char* data, size_t data_len);
-void sm3_final(sm3_ctx_t *ctx, unsigned char digest[SM3_DIGEST_LENGTH]);
+void sm3_final(sm3_ctx_t *ctx, unsigned char *digest);
 void sm3_compress(uint32_t digest[8], const unsigned char block[SM3_BLOCK_SIZE]);
 void sm3(const unsigned char *data, size_t datalen,
 	unsigned char digest[SM3_DIGEST_LENGTH]);
--- a/algo/swifftx/swifftx-4way.c
+++ b/algo/swifftx/swifftx-4way.c
@@ -1,912 +0,0 @@
-///////////////////////////////////////////////////////////////////////////////////////////////
-//
-//  SWIFFTX ANSI C OPTIMIZED 32BIT IMPLEMENTATION FOR NIST SHA-3 COMPETITION
-//
-//  SWIFFTX.c
-//
-//  October 2008
-//
-//  This is the source file of the OPTIMIZED 32BIT implementation of SWIFFTX hash function.
-//  SWIFFTX is a candidate function for SHA-3 NIST competition.
-//  More details about SWIFFTX can be found in the accompanying submission documents.
-//
-///////////////////////////////////////////////////////////////////////////////////////////////
-#include "swifftx.h"
-// See the remarks concerning compatibility issues inside stdint.h.
-#include "stdint.h"
-// Remove this while using gcc:
-//#include "stdbool.h"
-#include <memory.h>
-
-///////////////////////////////////////////////////////////////////////////////////////////////
-// Constants and static tables portion.
-///////////////////////////////////////////////////////////////////////////////////////////////
-
-// In SWIFFTX we work over Z_257, so this is the modulus and the arithmetic is performed modulo
-// this number.
-#define FIELD_SIZE 257
-
-// The size of FFT we use:
-#define N 64
-
-#define LOGN 6
-
-#define EIGHTH_N (N / 8)
-
-// The number of FFTS done on the input.
-#define M (SWIFFTX_INPUT_BLOCK_SIZE / 8)   // 32
-
-// Omega is the 128th root of unity in Z_257.
-// We choose w = 42.
-#define OMEGA 42
-
-// The size of the inner FFT lookup table:
-#define W 8
-
-// Calculates the sum and the difference of two numbers.
-//
-// Parameters:
-// - A: the first operand. After the operation stores the sum of the two operands.
-// - B: the second operand. After the operation stores the difference between the first and the
-//   second operands.
-#define ADD_SUB_4WAY( A, B ) \
-{ \
-  __m128i temp = B; \
-  B = _mm_sub_epi32( A, B ); \
-  A = _mm_add_epi32( A, temp ); \
-}
-
-
-//#define ADD_SUB(A, B) {register int temp = (B); B = ((A) - (B)); A = ((A) + (temp));}
-
-// Quickly reduces an integer modulo 257.
-//
-// Parameters:
-// - A: the input.
-
-#define Q_REDUCE( A ) ( _mm_sub_epi32( \
-                               _mm_and_epi32( A, m128_const1_32( 0xff ) ), \
-                               _mm_srli_epi32( A, 8 ) ) )
-
-//#define Q_REDUCE(A) (((A) & 0xff) - ((A) >> 8))
-
-// Since we need to do the setup only once, this is the indicator variable:
-static bool wasSetupDone = false;
-
-// This array stores the powers of omegas that correspond to the indices, which are the input
-// values. Known also as the "outer FFT twiddle factors".
-swift_int16_t multipliers[N];
-
-// This array stores the powers of omegas, multiplied by the corresponding values.
-// We store this table to save computation time.
-//
-// To calculate the intermediate value of the compression function (the first out of two
-// stages), we multiply the k-th bit of x_i by w^[(2i + 1) * k]. {x_i} is the input to the
-// compression function, i is between 0 and 31, x_i is a 64-bit value.
-// One can see the formula for this (intermediate) stage in the SWIFFT FSE 2008 paper --
-// formula (2), section 3, page 6.
-swift_int16_t fftTable[256 * EIGHTH_N];
-
-// The A's we use in SWIFFTX shall be random elements of Z_257.
-// We generated these A's from the decimal expansion of PI as follows:  we converted each
-// triple of digits into a decimal number d. If d < (257 * 3) we used (d % 257) for the next A
-// element, otherwise move to the next triple of digits in the expansion. This guarntees that
-// the A's are random, provided that PI digits are.
-const swift_int16_t As[3 * M * N] =
-{141,  78, 139,  75, 238, 205, 129, 126,  22, 245, 197, 169, 142, 118, 105,  78,
-  50, 149,  29, 208, 114,  34,  85, 117,  67, 148,  86, 256,  25,  49, 133,  93,
-  95,  36,  68, 231, 211, 102, 151, 128, 224, 117, 193,  27, 102, 187,   7, 105,
-  45, 130, 108, 124, 171, 151, 189, 128, 218, 134, 233, 165,  14, 201, 145, 134,
-  52, 203,  91,  96, 197,  69, 134, 213, 136,  93,   3, 249, 141,  16, 210,  73,
-   6,  92,  58,  74, 174,   6, 254,  91, 201, 107, 110,  76, 103,  11,  73,  16,
-  34, 209,   7, 127, 146, 254,  95, 176,  57,  13, 108, 245,  77,  92, 186, 117,
- 124,  97, 105, 118,  34,  74, 205, 122, 235,  53,  94, 238, 210, 227, 183,  11,
- 129, 159, 105, 183, 142, 129,  86,  21, 137, 138, 224, 223, 190, 188, 179, 188,
- 256,  25, 217, 176,  36, 176, 238, 127, 160, 210, 155, 148, 132,   0,  54, 127,
- 145,   6,  46,  85, 243,  95, 173, 123, 178, 207, 211, 183, 224, 173, 146,  35,
-  71, 114,  50,  22, 175,   1,  28,  19, 112, 129,  21,  34, 161, 159, 115,  52,
-   4, 193, 211,  92, 115,  49,  59, 217, 218,  96,  61,  81,  24, 202, 198,  89,
-  45, 128,   8,  51, 253,  87, 171,  35,   4, 188, 171,  10,   3, 137, 238,  73,
-  19, 208, 124, 163, 103, 177, 155, 147,  46,  84, 253, 233, 171, 241, 211, 217,
- 159,  48,  96,  79, 237,  18, 171, 226,  99,   1,  97, 195, 216, 163, 198,  95,
-   0, 201,  65, 228,  21, 153, 124, 230,  44,  35,  44, 108,  85, 156, 249, 207,
-  26, 222, 131,   1,  60, 242, 197, 150, 181,  19, 116, 213,  75,  98, 124, 240,
- 123, 207,  62, 255,  60, 143, 187, 157, 139,   9,  12, 104,  89,  49, 193, 146,
- 104, 196, 181,  82, 198, 253, 192, 191, 255, 122, 212, 104,  47,  20, 132, 208,
-  46, 170,   2,  69, 234,  36,  56, 163,  28, 152, 104, 238, 162,  56,  24,  58,
-  38, 150, 193, 254, 253, 125, 173,  35,  73, 126, 247, 239, 216,   6, 199,  15,
-  90,  12,  97, 122,   9,  84, 207, 127, 219,  72,  58,  30,  29, 182,  41, 192,
- 235, 248, 237,  74,  72, 176, 210, 252,  45,  64, 165,  87, 202, 241, 236, 223,
- 151, 242, 119, 239,  52, 112, 169,  28,  13,  37, 160,  60, 158,  81, 133,  60,
-  16, 145, 249, 192, 173, 217, 214,  93, 141, 184,  54,  34, 161, 104, 157,  95,
-  38, 133, 218, 227, 211, 181,   9,  66, 137, 143,  77,  33, 248, 159,   4,  55,
- 228,  48,  99, 219, 222, 184,  15,  36, 254, 256, 157, 237,  87, 139, 209, 113,
- 232,  85, 126, 167, 197, 100, 103, 166,  64, 225, 125, 205, 117, 135,  84, 128,
- 231, 112,  90, 241,  28,  22, 210, 147, 186,  49, 230,  21, 108,  39, 194,  47,
- 123, 199, 107, 114,  30, 210, 250, 143,  59, 156, 131, 133, 221,  27,  76,  99,
- 208, 250,  78,  12, 211, 141,  95,  81, 195, 106,   8, 232, 150, 212, 205, 221,
-  11, 225,  87, 219, 126, 136, 137, 180, 198,  48,  68, 203, 239, 252, 194, 235,
- 142, 137, 174, 172, 190, 145, 250, 221, 182, 204,   1, 195, 130, 153,  83, 241,
- 161, 239, 211, 138,  11, 169, 155, 245, 174,  49,  10, 166,  16, 130, 181, 139,
- 222, 222, 112,  99, 124,  94,  51, 243, 133, 194, 244, 136,  35, 248, 201, 177,
- 178, 186, 129, 102,  89, 184, 180,  41, 149,  96, 165,  72, 225, 231, 134, 158,
- 199,  28, 249,  16, 225, 195,  10, 210, 164, 252, 138,   8,  35, 152, 213, 199,
-  82, 116,  97, 230,  63, 199, 241,  35,  79, 120,  54, 174,  67, 112,   1,  76,
-  69, 222, 194,  96,  82,  94,  25, 228, 196, 145, 155, 136, 228, 234,  46, 101,
- 246,  51, 103, 166, 246,  75,   9, 200, 161,   4, 108,  35, 129, 168, 208, 144,
-  50,  14,  13, 220,  41, 132, 122, 127, 194,   9, 232, 234, 107,  28, 187,   8,
-  51, 141,  97, 221, 225,   9, 113, 170, 166, 102, 135,  22, 231, 185, 227, 187,
- 110, 145, 251, 146,  76,  22, 146, 228,   7,  53,  64,  25,  62, 198, 130, 190,
- 221, 232, 169,  64, 188, 199, 237, 249, 173, 218, 196, 191,  48, 224,   5, 113,
- 100, 166, 160,  21, 191, 197,  61, 162, 149, 171, 240, 183, 129, 231, 123, 204,
- 192, 179, 134,  15,  47, 161, 142, 177, 239, 234, 186, 237, 231,  53, 208,  95,
- 146,  36, 225, 231,  89, 142,  93, 248, 137, 124,  83,  39,  69,  77,  89, 208,
- 182,  48,  85, 147, 244, 164, 246,  68,  38, 190, 220,  35, 202,  91, 157, 151,
- 201, 240, 185, 218,   4, 152,   2, 132, 177,  88, 190, 196, 229,  74, 220, 135,
- 137, 196,  11,  47,   5, 251, 106, 144, 163,  60, 222, 127,  52,  57, 202, 102,
-  64, 140, 110, 206,  23, 182,  39, 245,   1, 163, 157, 186, 163,  80,   7, 230,
-  44, 249, 176, 102, 164, 125, 147, 120,  18, 191, 186, 125,  64,  65, 198, 157,
- 164, 213,  95,  61,  13, 181, 208,  91, 242, 197, 158,  34,  98, 169,  91,  14,
-  17,  93, 157,  17,  65,  30, 183,   6, 139,  58, 255, 108, 100, 136, 209, 144,
- 164,   6, 237,  33, 210, 110,  57, 126, 197, 136, 125, 244, 165, 151, 168,   3,
- 143, 251, 247, 155, 136, 130,  88,  14,  74, 121, 250, 133,  21, 226, 185, 232,
- 118, 132,  89,  64, 204, 161,   2,  70, 224, 159,  35, 204, 123, 180,  13,  52,
- 231,  57,  25,  78,  66,  69,  97,  42, 198,  84, 176,  59,   8, 232, 125, 134,
- 193,   2, 232, 109, 216,  69,  90, 142,  32,  38, 249,  37,  75, 180, 184, 188,
-  19,  47, 120,  87, 146,  70, 232, 120, 191,  45,  33,  38,  19, 248, 110, 110,
-  44,  64,   2,  84, 244, 228, 252, 228, 170, 123,  38, 144, 213, 144, 171, 212,
- 243,  87, 189,  46, 128, 110,  84,  77,  65, 183,  61, 184, 101,  44, 168,  68,
-  14, 106, 105,   8, 227, 211, 166,  39, 152,  43,  52, 254, 197,  55, 119,  89,
- 168,  65,  53, 138, 177,  56, 219,   0,  58, 121, 148,  18,  44, 100, 215, 103,
- 145, 229, 117, 196,  91,  89, 113, 143, 172, 239, 249, 184, 154,  39, 112,  65,
- 204,  42,  84,  38, 155, 151, 151,  16, 100,  87, 174, 162, 145, 147, 149, 186,
- 237, 145, 134, 144, 198, 235, 213, 163,  48, 230,  24,  47,  57,  71, 127,   0,
- 150, 219,  12,  81, 197, 150, 131,  13, 169,  63, 175, 184,  48, 235,  65, 243,
- 149, 200, 163, 254, 202, 114, 247,  67, 143, 250, 126, 228,  80, 130, 216, 214,
-  36,   2, 230,  33, 119, 125,   3, 142, 237, 100,   3, 152, 197, 174, 244, 129,
- 232,  30, 206, 199,  39, 210, 220,  43, 237, 221, 201,  54, 179,  42,  28, 133,
- 246, 203, 198, 177,   0,  28, 194,  85, 223, 109, 155, 147, 221,  60, 133, 108,
- 157, 254,  26,  75, 157, 185,  49, 142,  31, 137,  71,  43,  63,  64, 237, 148,
- 237, 172, 159, 160, 155, 254, 234, 224, 140, 193, 114, 140,  62, 109, 136,  39,
- 255,   8, 158, 146, 128,  49, 222,  96,  57, 209, 180, 249, 202, 127, 113, 231,
-  78, 178,  46,  33, 228, 215, 104,  31, 207, 186,  82,  41,  42,  39, 103, 119,
- 123, 133, 243, 254, 238, 156,  90, 186,  37, 212,  33, 107, 252,  51, 177,  36,
- 237,  76, 159, 245,  93, 214,  97,  56, 190,  38, 160,  94, 105, 222, 220, 158,
-  49,  16, 191,  52, 120,  87, 179,   2,  27, 144, 223, 230, 184,   6, 129, 227,
-  69,  47, 215, 181, 162, 139,  72, 200,  45, 163, 159,  62,   2, 221, 124,  40,
- 159, 242,  35, 208, 179, 166,  98,  67, 178,  68, 143, 225, 178, 146, 187, 159,
-  57,  66, 176, 192, 236, 250, 168, 224, 122,  43, 159, 120, 133, 165, 122,  64,
-  87,  74, 161, 241,   9,  87,  90,  24, 255, 113, 203, 220,  57, 139, 197, 159,
-  31, 151,  27, 140,  77, 162,   7,  27,  84, 228, 187, 220,  53, 126, 162, 242,
-  84, 181, 223, 103,  86, 177, 207,  31, 140,  18, 207, 256, 201, 166,  96,  23,
- 233, 103, 197,  84, 161,  75,  59, 149, 138, 154, 119,  92,  16,  53, 116,  97,
- 220, 114,  35,  45,  77, 209,  40, 196,  71,  22,  81, 178, 110,  14,   3, 180,
- 110, 129, 112,  47,  18,  61, 134,  78,  73,  79, 254, 232, 125, 180, 205,  54,
- 220, 119,  63,  89, 181,  52,  77, 109, 151,  77,  80, 207, 144,  25,  20,   6,
- 208,  47, 201, 206, 192,  14,  73, 176, 256, 201, 207,  87, 216,  60,  56,  73,
-  92, 243, 179, 113,  49,  59,  55, 168, 121, 137,  69, 154,  95,  57, 187,  47,
- 129,   4,  15,  92,   6, 116,  69, 196,  48, 134,  84,  81, 111,  56,  38, 176,
- 239,   6, 128,  72, 242, 134,  36, 221,  59,  48, 242,  68, 130, 110, 171,  89,
-  13, 220,  48,  29,   5,  75, 104, 233,  91, 129, 105, 162,  44, 113, 163, 163,
-  85, 147, 190, 111, 197,  80, 213, 153,  81,  68, 203,  33, 161, 165,  10,  61,
- 120, 252,   0, 205,  28,  42, 193,  64,  39,  37,  83, 175,   5, 218, 215, 174,
- 128, 121, 231,  11, 150, 145, 135, 197, 136,  91, 193,   5, 107,  88,  82,   6,
-   4, 188, 256,  70,  40,   2, 167,  57, 169, 203, 115, 254, 215, 172,  84,  80,
- 188, 167,  34, 137,  43, 243,   2,  79, 178,  38, 188, 135, 233, 194, 208,  13,
-  11, 151, 231, 196,  12, 122, 162,  56,  17, 114, 191, 207,  90, 132,  64, 238,
- 187,   6, 198, 176, 240,  88, 118, 236,  15, 226, 166,  22, 193, 229,  82, 246,
- 213,  64,  37,  63,  31, 243, 252,  37, 156,  38, 175, 204, 138, 141, 211,  82,
- 106, 217,  97, 139, 153,  56, 129, 218, 158,   9,  83,  26,  87, 112,  71,  21,
- 250,   5,  65, 141,  68, 116, 231, 113,  10, 218,  99, 205, 201,  92, 157,   4,
-  97,  46,  49, 220,  72, 139, 103, 171, 149, 129, 193,  19,  69, 245,  43,  31,
-  58,  68,  36, 195, 159,  22,  54,  34, 233, 141, 205, 100, 226,  96,  22, 192,
-  41, 231,  24,  79, 234, 138,  30, 120, 117, 216, 172, 197, 172, 107,  86,  29,
- 181, 151,   0,   6, 146, 186,  68,  55,  54,  58, 213, 182,  60, 231,  33, 232,
-  77, 210, 216, 154,  80,  51, 141, 122,  68, 148, 219, 122, 254,  48,  64, 175,
-  41, 115,  62, 243, 141,  81, 119, 121,   5,  68, 121,  88, 239,  29, 230,  90,
- 135, 159,  35, 223, 168, 112,  49,  37, 146,  60, 126, 134,  42, 145, 115,  90,
-  73, 133, 211,  86, 120, 141, 122, 241, 127,  56, 130,  36, 174,  75,  83, 246,
- 112,  45, 136, 194, 201, 115,   1, 156, 114, 167, 208,  12, 176, 147,  32, 170,
- 251, 100, 102, 220, 122, 210,   6,  49,  75, 201,  38, 105, 132, 135, 126, 102,
-  13, 121,  76, 228, 202,  20,  61, 213, 246,  13, 207,  42, 148, 168,  37, 253,
-  34,  94, 141, 185,  18, 234, 157, 109, 104,  64, 250, 125,  49, 236,  86,  48,
- 196,  77,  75, 237, 156, 103, 225,  19, 110, 229,  22,  68, 177,  93, 221, 181,
- 152, 153,  61, 108, 101,  74, 247, 195, 127, 216,  30, 166, 168,  61,  83, 229,
- 120, 156,  96, 120, 201, 124,  43,  27, 253, 250, 120, 143,  89, 235, 189, 243,
- 150,   7, 127, 119, 149, 244,  84, 185, 134,  34, 128, 193, 236, 234, 132, 117,
- 137,  32, 145, 184,  44, 121,  51,  76,  11, 228, 142, 251,  39,  77, 228, 251,
-  41,  58, 246, 107, 125, 187,   9, 240,  35,   8,  11, 162, 242, 220, 158, 163,
-   2, 184, 163, 227, 242,   2, 100, 101,   2,  78, 129,  34,  89,  28,  26, 157,
-  79,  31, 107, 250, 194, 156, 186,  69, 212,  66,  41, 180, 139,  42, 211, 253,
- 256, 239,  29, 129, 104, 248, 182,  68,   1, 189,  48, 226,  36, 229,   3, 158,
-  41,  53, 241,  22, 115, 174,  16, 163, 224,  19, 112, 219, 177, 233,  42,  27,
- 250, 134,  18,  28, 145, 122,  68,  34, 134,  31, 147,  17,  39, 188, 150,  76,
-  45,  42, 167, 249,  12,  16,  23, 182,  13,  79, 121,   3,  70, 197, 239,  44,
-  86, 177, 255,  81,  64, 171, 138, 131,  73, 110,  44, 201, 254, 198, 146,  91,
-  48,   9, 104,  31,  29, 161, 101,  31, 138, 180, 231, 233,  79, 137,  61, 236,
- 140,  15, 249, 218, 234, 119,  99, 195, 110, 137, 237, 207,   8,  31,  45,  24,
-  90, 155, 203, 253, 192, 203,  65, 176, 210, 171, 142, 214, 220, 122, 136, 237,
- 189, 186, 147,  40,  80, 254, 173,  33, 191,  46, 192,  26, 108, 255, 228, 205,
-  61,  76,  39, 107, 225, 126, 228, 182, 140, 251, 143, 134, 252, 168, 221,   8,
- 185,  85,  60, 233, 147, 244,  87, 137,   8, 140,  96,  80,  53,  45, 175, 160,
- 124, 189, 112,  37, 144,  19,  70,  17, 170, 242,   2,   3,  28,  95, 120, 199,
- 212,  43,   9, 117,  86, 151, 101, 241, 200, 145, 241,  19, 178,  69, 204, 197,
- 227, 166,  94,   7, 193,  45, 247, 234,  19, 187, 212, 212, 236, 125,  33,  95,
- 198, 121, 122, 103,  77, 155, 235,  49,  25, 237, 249,  11, 162,   7, 238,  24,
-  16, 150, 129,  25, 152,  17,  42,  67, 247, 162,  77, 154,  31, 133,  55, 137,
-  79, 119, 153,  10,  86,  28, 244, 186,  41, 169, 106,  44,  10,  49, 110, 179,
-  32, 133, 155, 244,  61,  70, 131, 168, 170,  39, 231, 252,  32,  69,  92, 238,
- 239,  35, 132, 136, 236, 167,  90,  32, 123,  88,  69,  22,  20,  89, 145, 166,
-  30, 118,  75,   4,  49,  31, 225,  54,  11,  50,  56, 191, 246,   1, 187,  33,
- 119, 107, 139,  68,  19, 240, 131,  55,  94, 113,  31, 252,  12, 179, 121,   2,
- 120, 252,   0,  76,  41,  80, 185,  42,  62, 121, 105, 159, 121, 109, 111,  98,
-   7, 118,  86,  29, 210,  70, 231, 179, 223, 229, 164,  70,  62,  47,   0, 206,
- 204, 178, 168, 120, 224, 166,  99,  25, 103,  63, 246, 224, 117, 204,  75, 124,
- 140, 133, 110, 110, 222,  88, 151, 118,  46,  37,  22, 143, 158,  40,   2,  50,
- 153,  94, 190, 199,  13, 198, 127, 211, 180,  90, 183,  98,   0, 142, 210, 154,
- 100, 187,  67, 231, 202, 100, 198, 235, 252, 160, 247, 124, 247,  14, 121, 221,
-  57,  88, 253, 243, 185,  89,  45, 249, 221, 194, 108, 175, 193, 119,  50, 141,
- 223, 133, 136,  64, 176, 250, 129, 100, 124,  94, 181, 159,  99, 185, 177, 240,
- 135,  42, 103,  52, 202, 208, 143, 186, 193, 103, 154, 237, 102,  88, 225, 161,
-  50, 188, 191, 109,  12,  87,  19, 227, 247, 183,  13,  52, 205, 170, 205, 146,
-  89, 160,  18, 105, 192,  73, 231, 225, 184, 157, 252, 220,  61,  59, 169, 183,
- 221,  20, 141,  20, 158, 101, 245,   7, 245, 225, 118, 137,  84,  55,  19,  27,
- 164, 110,  35,  25, 202,  94, 150,  46,  91, 152, 130,   1,   7,  46,  16, 237,
- 171, 109,  19, 200,  65,  38,  10, 213,  70,  96, 126, 226, 185, 225, 181,  46,
-  10, 165,  11, 123,  53, 158,  22, 147,  64,  22, 227,  69, 182, 237, 197,  37,
-  39,  49, 186, 223, 139, 128,  55,  36, 166, 178, 220,  20,  98, 172, 166, 253,
-  45,   0, 120, 180, 189, 185, 158, 159, 196,   6, 214,  79, 141,  52, 156, 107,
-   5, 109, 142, 159,  33,  64, 190, 133,  95, 132,  95, 202, 160,  63, 186,  23,
- 231, 107, 163,  33, 234,  15, 244,  77, 108,  49,  51,   7, 164,  87, 142,  99,
- 240, 202,  47, 256, 118, 190, 196, 178, 217,  42,  39, 153,  21, 192, 232, 202,
-  14,  82, 179,  64, 233,   4, 219,  10, 133,  78,  43, 144, 146, 216, 202,  81,
-  71, 252,   8, 201,  68, 256,  85, 233, 164,  88, 176,  30,   5, 152, 126, 179,
- 249,  84, 140, 190, 159,  54, 118,  98,   2, 159,  27, 133,  74, 121, 239, 196,
-  71, 149, 119, 135, 102,  20,  87, 112,  44,  75, 221,   3, 151, 158,   5,  98,
- 152,  25,  97, 106,  63, 171, 240,  79, 234, 240, 230,  92,  76,  70, 173, 196,
-  36, 225, 218, 133,  64, 240, 150,  41, 146,  66, 133,  51, 134,  73, 170, 238,
- 140,  90,  45,  89,  46, 147,  96, 169, 174, 174, 244, 151,  90,  40,  32,  74,
-  38, 154, 246,  57,  31,  14, 189, 151,  83, 243, 197, 183, 220, 185,  53, 225,
-  51, 106, 188, 208, 222, 248,  93,  13,  93, 215, 131,  25, 142, 185, 113, 222,
- 131, 215, 149,  50, 159,  85,  32,   5, 205, 192,   2, 227,  42, 214, 197,  42,
- 126, 182,  68, 123, 109,  36, 237, 179, 170, 199,  77, 256,   5, 128, 214, 243,
- 137, 177, 170, 253, 179, 180, 153, 236, 100, 196, 216, 231, 198,  37, 192,  80,
- 121, 221, 246,   1,  16, 246,  29,  78,  64, 148, 124,  38,  96, 125,  28,  20,
-  48,  51,  73, 187, 139, 208,  98, 253, 221, 188,  84, 129,   1, 205,  95, 205,
- 117,  79,  71, 126, 134, 237,  19, 184, 137, 125, 129, 178, 223,  54, 188, 112,
-  30,   7, 225, 228, 205, 184, 233,  87, 117,  22,  58,  10,   8,  42,   2, 114,
- 254,  19,  17,  13, 150,  92, 233, 179,  63,  12,  60, 171, 127,  35,  50,   5,
- 195, 113, 241,  25, 249, 184, 166,  44, 221,  35, 151, 116,   8,  54, 195,  89,
- 218, 186, 132,   5,  41,  89, 226, 177,  11,  41,  87, 172,   5,  23,  20,  59,
- 228,  94,  76,  33, 137,  43, 151, 221,  61, 232,   4, 120,  93, 217,  80, 228,
- 228,   6,  58,  25,  62,  84,  91,  48, 209,  20, 247, 243,  55, 106,  80,  79,
- 235,  34,  20, 180, 146,   2, 236,  13, 236, 206, 243, 222, 204,  83, 148, 213,
- 214, 117, 237,  98,   0,  90, 204, 168,  32,  41, 126,  67, 191,  74,  27, 255,
-  26,  75, 240, 113, 185, 105, 167, 154, 112,  67, 151,  63, 161, 134, 239, 176,
-  42,  87, 249, 130,  45, 242,  17, 100, 107, 120, 212, 218, 237,  76, 231, 162,
- 175, 172, 118, 155,  92,  36, 124,  17, 121,  71,  13,   9,  82, 126, 147, 142,
- 218, 148, 138,  80, 163, 106, 164, 123, 140, 129,  35,  42, 186, 154, 228, 214,
-  75,  73,   8, 253,  42, 153, 232, 164,  95,  24, 110,  90, 231, 197,  90, 196,
-  57, 164, 252, 181,  31,   7,  97, 256,  35,  77, 200, 212,  99, 179,  92, 227,
-  17, 180,  49, 176,   9, 188,  13, 182,  93,  44, 128, 219, 134,  92, 151,   6,
-  23, 126, 200, 109,  66,  30, 140, 180, 146, 134,  67, 200,   7,   9, 223, 168,
- 186, 221,   3, 154, 150, 165,  43,  53, 138,  27,  86, 213, 235, 160,  70,   2,
- 240,  20,  89, 212,  84, 141, 168, 246, 183, 227,  30, 167, 138, 185, 253,  83,
-  52, 143, 236,  94,  59,  65,  89, 218, 194, 157, 164, 156, 111,  95, 202, 168,
- 245, 256, 151,  28, 222, 194,  72, 130, 217, 134, 253,  77, 246, 100,  76,  32,
- 254, 174, 182, 193,  14, 237,  74,   1,  74,  26, 135, 216, 152, 208, 112,  38,
- 181,  62,  25,  71,  61, 234, 254,  97, 191,  23,  92, 256, 190, 205,   6,  16,
- 134, 147, 210, 219, 148,  59,  73, 185,  24, 247, 174, 143, 116, 220, 128, 144,
- 111, 126, 101,  98, 130, 136, 101, 102,  69, 127,  24, 168, 146, 226, 226, 207,
- 176, 122, 149, 254, 134, 196,  22, 151, 197,  21,  50, 205, 116, 154,  65, 116,
- 177, 224, 127,  77, 177, 159, 225,  69, 176,  54, 100, 104, 140,   8,  11, 126,
-  11, 188, 185, 159, 107,  16, 254, 142,  80,  28,   5, 157, 104,  57, 109,  82,
- 102,  80, 173, 242, 238, 207,  57, 105, 237, 160,  59, 189, 189, 199,  26,  11,
- 190, 156,  97, 118,  20,  12, 254, 189, 165, 147, 142, 199,   5, 213,  64, 133,
- 108, 217, 133,  60,  94,  28, 116, 136,  47, 165, 125,  42, 183, 143,  14, 129,
- 223,  70, 212, 205, 181, 180,   3, 201, 182,  46,  57, 104, 239,  60,  99, 181,
- 220, 231,  45,  79, 156,  89, 149, 143, 190, 103, 153,  61, 235,  73, 136,  20,
-  89, 243,  16, 130, 247, 141, 134,  93,  80,  68,  85,  84,   8,  72, 194,   4,
- 242, 110,  19, 133, 199,  70, 172,  92, 132, 254,  67,  74,  36,  94,  13,  90,
- 154, 184,   9, 109, 118, 243, 214,  71,  36,  95,   0,  90, 201, 105, 112, 215,
-  69, 196, 224, 210, 236, 242, 155, 211,  37, 134,  69, 113, 157,  97,  68,  26,
- 230, 149, 219, 180,  20,  76, 172, 145, 154,  40, 129,   8,  93,  56, 162, 124,
- 207, 233, 105,  19,   3, 183, 155, 134,   8, 244, 213,  78, 139,  88, 156,  37,
-  51, 152, 111, 102, 112, 250, 114, 252, 201, 241, 133,  24, 136, 153,   5,  90,
- 210, 197, 216,  24, 131,  17, 147, 246,  13,  86,   3, 253, 179, 237, 101, 114,
- 243, 191, 207,   2, 220, 133, 244,  53,  87, 125, 154, 158, 197,  20,   8,  83,
-  32, 191,  38, 241, 204,  22, 168,  59, 217, 123, 162,  82,  21,  50, 130,  89,
- 239, 253, 195,  56, 253,  74, 147, 125, 234, 199, 250,  28,  65, 193,  22, 237,
- 193,  94,  58, 229, 139, 176,  69,  42, 179, 164, 150, 168, 246, 214,  86, 174,
-  59, 117,  15,  19,  76,  37, 214, 238, 153, 226, 154,  45, 109, 114, 198, 107,
-  45,  70, 238, 196, 142, 252, 244,  71, 123, 136, 134, 188,  99, 132,  25,  42,
- 240,   0, 196,  33,  26, 124, 256, 145,  27, 102, 153,  35,  28, 132, 221, 167,
- 138, 133,  41, 170,  95, 224,  40, 139, 239, 153,   1, 106, 255, 106, 170, 163,
- 127,  44, 155, 232, 194, 119, 232, 117, 239, 143, 108,  41,   3,   9, 180, 256,
- 144, 113, 133, 200,  79,  69, 128, 216,  31,  50, 102, 209, 249, 136, 150, 154,
- 182,  51, 228,  39, 127, 142,  87,  15,  94,  92, 187, 245,  31, 236,  64,  58,
- 114,  11,  17, 166, 189, 152, 218,  34, 123,  39,  58,  37, 153,  91,  63, 121,
-  31,  34,  12, 254, 106,  96, 171,  14, 155, 247, 214,  69,  24,  98,   3, 204,
- 202, 194, 207,  30, 253,  44, 119,  70,  14,  96,  82, 250,  63,   6, 232,  38,
-  89, 144, 102, 191,  82, 254,  20, 222,  96, 162, 110,   6, 159,  58, 200, 226,
-  98, 128,  42,  70,  84, 247, 128, 211, 136,  54, 143, 166,  60, 118,  99, 218,
-  27, 193,  85,  81, 219, 223,  46,  41,  23, 233, 152, 222,  36, 236,  54, 181,
-  56,  50,   4, 207, 129,  92,  78,  88, 197, 251, 131, 105,  31, 172,  38, 131,
-  19, 204, 129,  47, 227, 106, 202, 183,  23,   6,  77, 224, 102, 147,  11, 218,
- 131, 132,  60, 192, 208, 223, 236,  23, 103, 115,  89,  18, 185, 171,  70, 174,
- 139,   0, 100, 160, 221,  11, 228,  60,  12, 122, 114,  12, 157, 235, 148,  57,
-  83,  62, 173, 131, 169, 126,  85,  99,  93, 243,  81,  80,  29, 245, 206,  82,
- 236, 227, 166,  14, 230, 213, 144,  97,  27, 111,  99, 164, 105, 150,  89, 111,
- 252, 118, 140, 232, 120, 183, 137, 213, 232, 157, 224,  33, 134, 118, 186,  80,
- 159,   2, 186, 193,  54, 242,  25, 237, 232, 249, 226, 213,  90, 149,  90, 160,
- 118,  69,  64,  37,  10, 183, 109, 246,  30,  52, 219,  69, 189,  26, 116, 220,
-  50, 244, 243, 243, 139, 137, 232,  98,  38,  45, 256, 143, 171, 101,  73, 238,
- 123,  45, 194, 167, 250, 123,  12,  29, 136, 237, 141,  21,  89,  96, 199,  44,
-   8, 214, 208,  17, 113,  41, 137,  26, 166, 155,  89,  85,  54,  58,  97, 160,
-  50, 239,  58,  71,  21, 157, 139,  12,  37, 198, 182, 131, 149, 134,  16, 204,
- 164, 181, 248, 166,  52, 216, 136, 201,  37, 255, 187, 240,   5, 101, 147, 231,
-  14, 163, 253, 134, 146, 216,   8,  54, 224,  90, 220, 195,  75, 215, 186,  58,
-  71, 204, 124, 105, 239,  53,  16,  85,  69, 163, 195, 223,  33,  38,  69,  88,
-  88, 203,  99,  55, 176,  13, 156, 204, 236,  99, 194, 134,  75, 247, 126, 129,
- 160, 124, 233, 206, 139, 144, 154,  45, 233,  51, 206,  61,  60,  55, 205, 107,
-  84, 108,  96, 188, 203,  31,  89,  20, 115, 144, 137,  90, 237,  78, 231, 185,
- 120, 217,   1, 176, 169,  30, 155, 176, 100, 113,  53,  42, 193, 108,  14, 121,
- 176, 158, 137,  92, 178,  44, 110, 249, 108, 234,  94, 101, 128,  12, 250, 173,
-  72, 202, 232,  66, 139, 152, 189,  18,  32, 197,   9, 238, 246,  55, 119, 183,
- 196, 119, 113, 247, 191, 100, 200, 245,  46,  16, 234, 112, 136, 116, 232,  48,
- 176, 108,  11, 237,  14, 153,  93, 177, 124,  72,  67, 121, 135, 143,  45,  18,
-  97, 251, 184, 172, 136,  55, 213,   8, 103,  12, 221, 212,  13, 160, 116,  91,
- 237, 127, 218, 190, 103, 131,  77,  82,  36, 100,  22, 252,  79,  69,  54,  26,
-  65, 182, 115, 142, 247,  20,  89,  81, 188, 244,  27, 120, 240, 248,  13, 230,
-  67, 133,  32, 201, 129,  87,   9, 245,  66,  88, 166,  34,  46, 184, 119, 218,
- 144, 235, 163,  40, 138, 134, 127, 217,  64, 227, 116,  67,  55, 202, 130,  48,
- 199,  42, 251, 112, 124, 153, 123, 194, 243,  49, 250,  12,  78, 157, 167, 134,
- 210,  73, 156, 102,  21,  88, 216, 123,  45,  11, 208,  18,  47, 187,  20,  43,
-   3, 180, 124,   2, 136, 176,  77, 111, 138, 139,  91, 225, 126,   8,  74, 255,
-  88, 192, 193, 239, 138, 204, 139, 194, 166, 130, 252, 184, 140, 168,  30, 177,
- 121,  98, 131, 124,  69, 171,  75,  49, 184,  34,  76, 122, 202, 115, 184, 253,
- 120, 182,  33, 251,   1,  74, 216, 217, 243, 168,  70, 162, 119, 158, 197, 198,
-  61,  89,   7,   5,  54, 199, 211, 170,  23, 226,  44, 247, 165, 195,   7, 225,
-  91,  23,  50,  15,  51, 208, 106,  94,  12,  31,  43, 112, 146, 139, 246, 182,
- 113,   1,  97,  15,  66,   2,  51,  76, 164, 184, 237, 200, 218, 176,  72,  98,
-  33, 135,  38, 147, 140, 229,  50,  94,  81, 187, 129,  17, 238, 168, 146, 203,
- 181,  99, 164,   3, 104,  98, 255, 189, 114, 142,  86, 102, 229, 102,  80, 129,
-  64,  84,  79, 161,  81, 156, 128, 111, 164, 197,  18,  15,  55, 196, 198, 191,
-  28, 113, 117,  96, 207, 253,  19, 158, 231,  13,  53, 130, 252, 211,  58, 180,
- 212, 142,   7, 219,  38,  81,  62, 109, 167, 113,  33,  56,  97, 185, 157, 130,
- 186, 129, 119, 182, 196,  26,  54, 110,  65, 170, 166, 236,  30,  22, 162,   0,
- 106,  12, 248,  33,  48,  72, 159,  17,  76, 244, 172, 132,  89, 171, 196,  76,
- 254, 166,  76, 218, 226,   3,  52, 220, 238, 181, 179, 144, 225,  23,   3, 166,
- 158,  35, 228, 154, 204,  23, 203,  71, 134, 189,  18, 168, 236, 141, 117, 138,
-   2, 132,  78,  57, 154,  21, 250, 196, 184,  40, 161,  40,  10, 178, 134, 120,
- 132, 123, 101,  82, 205, 121,  55, 140, 231,  56, 231,  71, 206, 246, 198, 150,
- 146, 192,  45, 105, 242,   1, 125,  18, 176,  46, 222, 122,  19,  80, 113, 133,
- 131, 162,  81,  51,  98, 168, 247, 161, 139,  39,  63, 162,  22, 153, 170,  92,
-  91, 130, 174, 200,  45, 112,  99, 164, 132, 184, 191, 186, 200, 167,  86, 145,
- 167, 227, 130,  44,  12, 158, 172, 249, 204,  17,  54, 249,  16, 200,  21, 174,
-  67, 223, 105, 201,  50,  36, 133, 203, 244, 131, 228,  67,  29, 195,  91,  91,
-  55, 107, 167, 154, 170, 137, 218, 183, 169,  61,  99, 175, 128,  23, 142, 183,
-  66, 255,  59, 187,  66,  85, 212, 109, 168,  82,  16,  43,  67, 139, 114, 176,
- 216, 255, 130,  94, 152,  79, 183,  64, 100,  23, 214,  82,  34, 230,  48,  15,
- 242, 130,  50, 241,  81,  32,   5, 125, 183, 182, 184,  99, 248, 109, 159, 210,
- 226,  61, 119, 129,  39, 149,  78, 214, 107,  78, 147, 124, 228,  18, 143, 188,
-  84, 180, 233, 119,  64,  39, 158, 133, 177, 168,   6, 150,  80, 117, 150,  56,
-  49,  72,  49,  37,  30, 242,  49, 142,  33, 156,  34,  44,  44,  72,  58,  22,
- 249,  46, 168,  80,  25, 196,  64, 174,  97, 179, 244, 134, 213, 105,  63, 151,
-  21,  90, 168,  90, 245,  28, 157,  65, 250, 232, 188,  27,  99, 160, 156, 127,
-  68, 193,  10,  80, 205,  36, 138, 229,  12, 223,  70, 169, 251,  41,  48,  94,
-  41, 177,  99, 256, 158,   0,   6,  83, 231, 191, 120, 135, 157, 146, 218, 213,
- 160,   7,  47, 234,  98, 211,  79, 225, 179,  95, 175, 105, 185,  79, 115,   0,
- 104,  14,  65, 124,  15, 188,  52,   9, 253,  27, 132, 137,  13, 127,  75, 238,
- 185, 253,  33,   8,  52, 157, 164,  68, 232, 188,  69,  28, 209, 233,   5, 129,
- 216,  90, 252, 212,  33, 200, 222,   9, 112,  15,  43,  36, 226, 114,  15, 249,
- 217,   8, 148,  22, 147,  23, 143,  67, 222, 116, 235, 250, 212, 210,  39, 142,
- 108,  64, 209,  83,  73,  66,  99,  34,  17,  29,  45, 151, 244, 114,  28, 241,
- 144, 208, 146, 179, 132,  89, 217, 198, 252, 219, 205, 165,  75, 107,  11, 173,
-  76,   6, 196, 247, 152, 216, 248,  91, 209, 178,  57, 250, 174,  60,  79, 123,
-  18, 135,   9, 241, 230, 159, 184,  68, 156, 251, 215,   9, 113, 234,  75, 235,
- 103, 194, 205, 129, 230,  45,  96,  73, 157,  20, 200, 212, 212, 228, 161,   7,
- 231, 228, 108,  43, 198,  87, 140, 140,   4, 182, 164,   3,  53, 104, 250, 213,
-  85,  38,  89,  61,  52, 187,  35, 204,  86, 249, 100,  71, 248, 213, 163, 215,
-  66, 106, 252, 129,  40, 111,  47,  24, 186, 221,  85, 205, 199, 237, 122, 181,
-  32,  46, 182, 135,  33, 251, 142,  34, 208, 242, 128, 255,   4, 234,  15,  33,
- 167, 222,  32, 186, 191,  34, 255, 244,  98, 240, 228, 204,  30, 142,  32,  70,
-  69,  83, 110, 151,  10, 243, 141,  21, 223,  69,  61,  37,  59, 209, 102, 114,
- 223,  33, 129, 254, 255, 103,  86, 247, 235,  72, 126, 177, 102, 226, 102,  30,
- 149, 221,  62, 247, 251, 120, 163, 173,  57, 202, 204,  24,  39, 106, 120, 143,
- 202, 176, 191, 147,  37,  38,  51, 133,  47, 245, 157, 132, 154,  71, 183, 111,
-  30, 180,  18, 202,  82,  96, 170,  91, 157, 181, 212, 140, 256,   8, 196, 121,
- 149,  79,  66, 127, 113,  78,   4, 197,  84, 256, 111, 222, 102,  63, 228, 104,
- 136, 223,  67, 193,  93, 154, 249,  83, 204, 101, 200, 234,  84, 252, 230, 195,
-  43, 140, 120, 242,  89,  63, 166, 233, 209,  94,  43, 170, 126,   5, 205,  78,
- 112,  80, 143, 151, 146, 248, 137, 203,  45, 183,  61,   1, 155,   8, 102,  59,
-  68, 212, 230,  61, 254, 191, 128, 223, 176, 123, 229,  27, 146, 120,  96, 165,
- 213,  12, 232,  40, 186, 225,  66, 105, 200, 195, 212, 110, 237, 238, 151,  19,
-  12, 171, 150,  82,   7, 228,  79,  52,  15,  78,  62,  43,  21, 154, 114,  21,
-  12, 212, 256, 232, 125, 127,   5,  51,  37, 252, 136,  13,  47, 195, 168, 191,
- 231,  55,  57, 251, 214, 116,  15,  86, 210,  41, 249, 242, 119,  27, 250, 203,
- 107,  69,  90,  43, 206, 154, 127,  54, 100,  78, 187,  54, 244, 177, 234, 167,
- 202, 136, 209, 171,  69, 114, 133, 173,  26, 139,  78, 141, 128,  32, 124,  39,
-  45, 218,  96,  68,  90,  44,  67,  62,  83, 190, 188, 256, 103,  42, 102,  64,
- 249,   0, 141,  11,  61,  69,  70,  66, 233, 237,  29, 200, 251, 157,  71,  51,
-  64, 133, 113,  76,  35, 125,  76, 137, 217, 145,  35,  69, 226, 180,  56, 249,
- 156, 163, 176, 237,  81,  54,  85, 169, 115, 211, 129,  70, 248,  40, 252, 192,
- 194, 101, 247,   8, 181, 124, 217, 191, 194,  93,  99, 127, 117, 177, 144, 151,
- 228, 121,  32,  11,  89,  81,  26,  29, 183,  76, 249, 132, 179,  70,  34, 102,
-  20,  66,  87,  63, 124, 205, 174, 177,  87, 219,  73, 218,  91,  87, 176,  72,
-  15, 211,  47,  61, 251, 165,  39, 247, 146,  70, 150,  57,   1, 212,  36, 162,
-  39,  38,  16, 216,   3,  50, 116, 200,  32, 234,  77, 181, 155,  19,  90, 188,
-  36,   6, 254,  46,  46, 203,  25, 230, 181, 196,   4, 151, 225,  65, 122, 216,
- 168,  86, 158, 131, 136,  16,  49, 102, 233,  64, 154,  88, 228,  52, 146,  69,
-  93, 157, 243, 121,  70, 209, 126, 213,  88, 145, 236,  65,  70,  96, 204,  47,
-  10, 200,  77,   8, 103, 150,  48, 153,   5,  37,  52, 235, 209,  31, 181, 126,
-  83, 142, 224, 140,   6,  32, 200, 171, 160, 179, 115, 229,  75, 194, 208,  39,
-  59, 223,  52, 247,  38, 197, 135,   1,   6, 189, 106, 114, 168,   5, 211, 222,
-  44,  63,  90, 160, 116, 172, 170, 133, 125, 138,  39, 131,  23, 178,  10, 214,
-  36,  93,  28,  59,  68,  17, 123,  25, 255, 184, 204, 102, 194, 214, 129,  94,
- 159, 245, 112, 141,  62,  11,  61, 197, 124, 221, 205,  11,  79,  71, 201,  54,
-  58, 150,  29, 121,  87,  46, 240, 201,  68,  20, 194, 209,  47, 152, 158, 174,
- 193, 164, 120, 255, 216, 165, 247,  58,  85, 130, 220,  23, 122, 223, 188,  98,
-  21,  70,  72, 170, 150, 237,  76, 143, 112, 238, 206, 146, 215, 110,   4, 250,
-  68,  44, 174, 177,  30,  98, 143, 241, 180, 127, 113,  48,   0,   1, 179, 199,
-  59, 106, 201, 114,  29,  86, 173, 133, 217,  44, 200, 141, 107, 172,  16,  60,
-  82,  58, 239,  94, 141, 234, 186, 235, 109, 173, 249, 139, 141,  59, 100, 248,
-  84, 144,  49, 160,  51, 207, 164, 103,  74,  97, 146, 202, 193, 125, 168, 134,
- 236, 111, 135, 121,  59, 145, 168, 200, 181, 173, 109,   2, 255,   6,   9, 245,
-  90, 202, 214, 143, 121,  65,  85, 232, 132,  77, 228,  84,  26,  54, 184,  15,
- 161,  29, 177,  79,  43,   0, 156, 184, 163, 165,  62,  90, 179,  93,  45, 239,
-   1,  16, 120, 189, 127,  47,  74, 166,  20, 214, 233, 226,  89, 217, 229,  26,
- 156,  53, 162,  60,  21,   3, 192,  72, 111,  51,  53, 101, 181, 208,  88,  82,
- 179, 160, 219, 113, 240, 108,  43, 224, 162, 147,  62,  14,  95,  81, 205,   4,
- 160, 177, 225, 115,  29,  69, 235, 168, 148,  29, 128, 114, 124, 129, 172, 165,
- 215, 231, 214,  86, 160,  44, 157,  91, 248, 183,  73, 164,  56, 181, 162,  92,
- 141, 118, 127, 240, 196,  77,   0,   9, 244,  79, 250, 100, 195,  25, 255,  85,
-  94,  35, 212, 137, 107,  34, 110,  20, 200, 104,  17,  32, 231,  43, 150, 159,
- 231, 216, 223, 190, 226, 109, 162, 197,  87,  92, 224,  11, 111,  73,  60, 225,
- 238,  73, 246, 169,  19, 217, 119,  38, 121, 118,  70,  82,  99, 241, 110,  67,
-  31,  76, 146, 215, 124, 240,  31, 103, 139, 224,  75, 160,  31,  78,  93,   4,
-  64,   9, 103, 223,   6, 227, 119,  85, 116,  81,  21,  43,  46, 206, 234, 132,
-  85,  99,  22, 131, 135,  97,  86,  13, 234, 188,  21,  14,  89, 169, 207, 238,
- 219, 177, 190,  72, 157,  41, 114, 140,  92, 141, 186,   1,  63, 107, 225, 184,
- 118, 150, 153, 254, 241, 106, 120, 210, 104, 144, 151, 161,  88, 206, 125, 164,
-  15, 211, 173,  49, 146, 241,  71,  36,  58, 201,  46,  27,  33, 187,  91, 162,
- 117,  19, 210, 213, 187,  97, 193,  50, 190, 114, 217,  60,  61, 167, 207, 213,
- 213,  53, 135,  34, 156,  91, 115, 119,  46,  99, 242,   1,  90,  52, 198, 227,
- 201,  91, 216, 146, 210,  82, 121,  38,  73, 133, 182, 193, 132, 148, 246,  75,
- 109, 157, 179, 113, 176, 134, 205, 159, 148,  58, 103, 171, 132, 156, 133, 147,
- 161, 231,  39, 100, 175,  97, 125,  28, 183, 129, 135, 191, 202, 181,  29, 218,
-  43, 104, 148, 203, 189, 204,   4, 182, 169,   1, 134, 122, 141, 202,  13, 187,
- 177, 112, 162,  35, 231,   6,   8, 241,  99,   6, 191,  45, 113, 113, 101, 104};
-
-// The S-Box we use for further linearity breaking.
-// We created it by taking the digits of decimal expansion of e.
-// The code that created it can be found in 'ProduceRandomSBox.c'.
-unsigned char SBox[256] = {
-//0     1    2      3     4    5     6     7      8    9     A      B    C     D     E     F
-0x7d, 0xd1, 0x70, 0x0b, 0xfa, 0x39, 0x18, 0xc3, 0xf3, 0xbb, 0xa7, 0xd4, 0x84, 0x25, 0x3b, 0x3c,   // 0
-0x2c, 0x15, 0x69, 0x9a, 0xf9, 0x27, 0xfb, 0x02, 0x52, 0xba, 0xa8, 0x4b, 0x20, 0xb5, 0x8b, 0x3a,   // 1
-0x88, 0x8e, 0x26, 0xcb, 0x71, 0x5e, 0xaf, 0xad, 0x0c, 0xac, 0xa1, 0x93, 0xc6, 0x78, 0xce, 0xfc,   // 2
-0x2a, 0x76, 0x17, 0x1f, 0x62, 0xc2, 0x2e, 0x99, 0x11, 0x37, 0x65, 0x40, 0xfd, 0xa0, 0x03, 0xc1,   // 3
-0xca, 0x48, 0xe2, 0x9b, 0x81, 0xe4, 0x1c, 0x01, 0xec, 0x68, 0x7a, 0x5a, 0x50, 0xf8, 0x0e, 0xa3,   // 4
-0xe8, 0x61, 0x2b, 0xa2, 0xeb, 0xcf, 0x8c, 0x3d, 0xb4, 0x95, 0x13, 0x08, 0x46, 0xab, 0x91, 0x7b,   // 5
-0xea, 0x55, 0x67, 0x9d, 0xdd, 0x29, 0x6a, 0x8f, 0x9f, 0x22, 0x4e, 0xf2, 0x57, 0xd2, 0xa9, 0xbd,   // 6
-0x38, 0x16, 0x5f, 0x4c, 0xf7, 0x9e, 0x1b, 0x2f, 0x30, 0xc7, 0x41, 0x24, 0x5c, 0xbf, 0x05, 0xf6,   // 7
-0x0a, 0x31, 0xa5, 0x45, 0x21, 0x33, 0x6b, 0x6d, 0x6c, 0x86, 0xe1, 0xa4, 0xe6, 0x92, 0x9c, 0xdf,   // 8
-0xe7, 0xbe, 0x28, 0xe3, 0xfe, 0x06, 0x4d, 0x98, 0x80, 0x04, 0x96, 0x36, 0x3e, 0x14, 0x4a, 0x34,   // 9
-0xd3, 0xd5, 0xdb, 0x44, 0xcd, 0xf5, 0x54, 0xdc, 0x89, 0x09, 0x90, 0x42, 0x87, 0xff, 0x7e, 0x56,   // A
-0x5d, 0x59, 0xd7, 0x23, 0x75, 0x19, 0x97, 0x73, 0x83, 0x64, 0x53, 0xa6, 0x1e, 0xd8, 0xb0, 0x49,   // B
-0x3f, 0xef, 0xbc, 0x7f, 0x43, 0xf0, 0xc9, 0x72, 0x0f, 0x63, 0x79, 0x2d, 0xc0, 0xda, 0x66, 0xc8,   // C
-0x32, 0xde, 0x47, 0x07, 0xb8, 0xe9, 0x1d, 0xc4, 0x85, 0x74, 0x82, 0xcc, 0x60, 0x51, 0x77, 0x0d,   // D
-0xaa, 0x35, 0xed, 0x58, 0x7c, 0x5b, 0xb9, 0x94, 0x6e, 0x8d, 0xb1, 0xc5, 0xb7, 0xee, 0xb6, 0xae,   // E
-0x10, 0xe0, 0xd6, 0xd9, 0xe5, 0x4f, 0xf1, 0x12, 0x00, 0xd0, 0xf4, 0x1a, 0x6f, 0x8a, 0xb3, 0xb2 }; // F
-
-///////////////////////////////////////////////////////////////////////////////////////////////
-//
-//	Helper functions definition portion.
-//
-///////////////////////////////////////////////////////////////////////////////////////////////
-
-// Don't vectorize, move decl to header file
-
-// Translates an input array with values in base 257 to output array with values in base 256.
-// Returns the carry bit.
-//
-// Parameters:
-// - input: the input array of size EIGHTH_N. Each value in the array is a number in Z_257.
-//          The MSB is assumed to be the last one in the array.
-// - output: the input array encoded in base 256.
-//
-// Returns:
-// - The carry bit (MSB).
-swift_int16_t TranslateToBase256(swift_int32_t input[EIGHTH_N], unsigned char output[EIGHTH_N]);
-
-// Translates an input integer into the range (-FIELD_SIZE / 2) <= result <= (FIELD_SIZE / 2).
-//
-// Parameters:
-// - x: the input integer.
-//
-// Returns:
-// - The result, which equals (x MOD FIELD_SIZE), such that |result| <= (FIELD_SIZE / 2).
-int Center(int x);
-
-// Calculates bit reversal permutation.
-//
-// Parameters:
-// - input: the input to reverse.
-// - numOfBits: the number of bits in the input to reverse.
-//
-// Returns:
-// - The resulting number, which is obtained from the input by reversing its bits.
-int ReverseBits(int input, int numOfBits);
-
-// Initializes the FFT fast lookup table.
-// Shall be called only once.
-void InitializeSWIFFTX();
-
-// Calculates the FFT.
-//
-// Parameters:
-// - input: the input to the FFT.
-// - output: the resulting output.
-void FFT(const unsigned char input[EIGHTH_N], swift_int32_t *output);
-
-///////////////////////////////////////////////////////////////////////////////////////////////
-// Helper functions implementation portion.
-///////////////////////////////////////////////////////////////////////////////////////////////
-
-// Don't vectorize, delete this copy.
-
-swift_int16_t TranslateToBase256(swift_int32_t input[EIGHTH_N], unsigned char output[EIGHTH_N])
-{
-	swift_int32_t pairs[EIGHTH_N / 2];
-	int i;
-
-	for (i = 0; i < EIGHTH_N; i += 2)
-	{
-		// input[i] + 257 * input[i + 1]
-		pairs[i >> 1] = input[i] + input[i + 1] + (input[i + 1] << 8);
-	}
-
-	for (i = (EIGHTH_N / 2) - 1; i > 0; --i)
-	{
-		int j;
-
-		for (j = i - 1; j < (EIGHTH_N / 2) - 1; ++j)
-		{
-			// pairs[j + 1] * 513, because 257^2 = 513 % 256^2.
-			register swift_int32_t temp = pairs[j] + pairs[j + 1] + (pairs[j + 1] << 9);
-			pairs[j] = temp & 0xffff;
-			pairs[j + 1] += (temp >> 16);
-		}
-	}
-
-	for (i = 0; i < EIGHTH_N; i += 2)
-	{
-		output[i] = (unsigned char) (pairs[i >> 1] & 0xff);
-		output[i + 1] = (unsigned char) ((pairs[i >> 1] >> 8) & 0xff);
-	}
-
-	return (pairs[EIGHTH_N/2 - 1] >> 16);
-}
-
-int Center(int x)
-{
-	int result = x % FIELD_SIZE;
-
-	if (result > (FIELD_SIZE / 2))
-		result -= FIELD_SIZE;
-
-	if (result < (FIELD_SIZE / -2))
-		result += FIELD_SIZE;
-
-	return result;
-}
-
-int ReverseBits(int input, int numOfBits)
-{
-	register int reversed = 0;
-
-	for (input |= numOfBits; input > 1; input >>= 1)
-		reversed = (reversed << 1) | (input & 1);
-
-	return reversed;
-}
-
-void InitializeSWIFFTX()
-{
-	int i, j, k, x;
-	// The powers of OMEGA
-	int omegaPowers[2 * N];
-	omegaPowers[0] = 1;
-
-	if (wasSetupDone)
-		return;
-
-	for (i = 1; i < (2 * N); ++i)
-	{
-		omegaPowers[i] = Center(omegaPowers[i - 1] * OMEGA);
-	}
-
-	for (i = 0; i < (N / W); ++i)
-	{
-		for (j = 0; j < W; ++j)
-		{
-			multipliers[(i << 3) + j] = omegaPowers[ReverseBits(i, N / W) * (2 * j + 1)];
-		}
-	}
-
-	for (x = 0; x < 256; ++x)
-	{
-		for (j = 0; j < 8; ++j)
-		{
-			register int temp = 0;
-			for (k = 0; k < 8; ++k)
-			{
-				temp += omegaPowers[(EIGHTH_N * (2 * j + 1) * ReverseBits(k, W)) % (2 * N)]
-					  * ((x >> k) & 1);
-			}
-
-			fftTable[(x << 3) + j] = Center(temp);
-		}
-	}
-
-	wasSetupDone = true;
-}
-
-// input should be deinterleaved in contiguos memory
-// output and F are 4x32
-// multipliers & fftTable are scalar 16
-
-
-void FFT_4way(const unsigned char input[EIGHTH_N], swift_int32_t *output)
-{
-	swift_int16_t *mult = multipliers;
-   m128_swift_int32_t F[64];
-
-   for (int i = 0; i < 8; i++)
-   {
-      int j = i<<3;
-
-// Need to isolate bytes in input, 8 bytes per lane.
-// Each iteration of the loop process one input vector
-// Each lane reads a different index to ffttable.
-
-// deinterleave the input!
-
-// load table with 4 lanes from different indexes into fftTable
-// extract bytes into m128 4x16
-// mutiply by vectorized mult
-
-// input[lane][byte]
-
-      __m128i table;
-      table = _mm_set_epi32( fftTable[ input[3][i] ],
-                             fftTable[ input[2][i] ],
-                             fftTable[ input[1][i] ],
-                             fftTable[ input[0][i] ] );
-
-      F[i  ] = _mm_mullo_epi32( mm128_const1_32( mult[j+0] ), table );
-
-      table = _mm_set_epi32( fftTable[ input[3][i+1] ]
-                             fftTable[ input[2][i+1] ]
-                             fftTable[ input[1][i+1] ]
-                             fftTable[ input[0][i+1] ] );
-
-      F[i+8] = _mm_mullo_epi32( mm128_const1_32( mult[j+0] ), table );
-
-
-      m128_swift_int16_t *table = &( fftTable[input[i] << 3] );
-
-      F[i   ] = _mm_mullo_epi32( mm128_const1_32( mult[j+0] ),
-                                 mm128_const1_32( table[0] ) );
-      F[i+ 8] = _mm_mullo_epi32( mm128_const1_32( mult[j+1] ),
-                                 mm128_const1_32( table[1] ) );
-      F[i+16] = _mm_mullo_epi32( mm128_const1_32( mult[j+2] ),
-                                 mm128_const1_32( table[2] ) );
-      F[i+24] = _mm_mullo_epi32( mm128_const1_32( mult[j+3] ),
-                                 mm128_const1_32( table[3] ) );
-      F[i+32] = _mm_mullo_epi32( mm128_const1_32( mult[j+4] ),
-                                 mm128_const1_32( table[4] ) );
-      F[i+40] = _mm_mullo_epi32( mm128_const1_32( mult[j+5] ),
-                                 mm128_const1_32( table[5] ) );
-      F[i+48] = _mm_mullo_epi32( mm128_const1_32( mult[j+6] ),
-                                 mm128_const1_32( table[6] ) );
-      F[i+56] = _mm_mullo_epi32( mm128_const1_32( mult[j+7] ),
-                                 mm128_const1_32( table[7] ) );
-   }
-
-
-   for ( int i = 0; i < 8; i++ )
-   {
-      int j = i<<3;
-      ADD_SUB_4WAY( F[j  ], F[j+1] );
-      ADD_SUB_4WAY( F[j+2], F[j+3] );
-      ADD_SUB_4WAY( F[j+4], F[j+5] );
-      ADD_SUB_4WAY( F[j+6], F[j+7] );
-
-      F[j+3] = _mm_slli_epi32( F[j+3], 4 );
-      F[j+7] = _mm_slli_epi32( F[j+7], 4 );
-
-      ADD_SUB_4WAY( F[j  ], F[j+2] );
-      ADD_SUB_4WAY( F[j+1], F[j+3] );
-      ADD_SUB_4WAY( F[j+4], F[j+6] );
-      ADD_SUB_4WAY( F[j+5], F[j+7] );
-
-      F[j+5] = _mm_slli_epi32( F[j+5], 2 );
-      F[j+6] = _mm_slli_epi32( F[j+6], 4 );
-      F[j+7] = _mm_slli_epi32( F[j+7], 6 );
-
-      ADD_SUB_4WAY( F[j  ], F[j+4] );
-      ADD_SUB_4WAY( F[j+1], F[j+5] );
-      ADD_SUB_4WAY( F[j+2], F[j+6] );
-      ADD_SUB_4WAY( F[j+3], F[j+7] );
-
-      output[i   ] = Q_REDUCE_4WAY( F[j  ] );
-      output[i+ 8] = Q_REDUCE_4WAY( F[j+1] );
-      output[i+16] = Q_REDUCE_4WAY( F[j+2] );
-      output[i+24] = Q_REDUCE_4WAY( F[j+3] );
-      output[i+32] = Q_REDUCE_4WAY( F[j+4] );
-      output[i+40] = Q_REDUCE_4WAY( F[j+5] );
-      output[i+48] = Q_REDUCE_4WAY( F[j+6] );
-      output[i+56] = Q_REDUCE_4WAY( F[j+7] );
-   }
-}
-
-// Calculates the FFT part of SWIFFT.
-// We divided the SWIFFT calculation into two, because that way we could save 2 computations of
-// the FFT part, since in the first stage of SWIFFTX the difference between the first 3 SWIFFTs
-// is only the A's part.
-//
-// Parameters:
-// - input: the input to FFT.
-// - m: the input size divided by 8. The function performs m FFTs.
-// - output: will store the result.
-void SWIFFTFFT(const unsigned char *input, int m, swift_int32_t *output)
-{
-	int i;
-
-	for (i = 0;
-		 i < m;
-		 i++, input += EIGHTH_N, output += N)
-	{
-		FFT(input, output);
-	}
-}
-
-// Calculates the 'sum' part of SWIFFT, including the base change at the end.
-// We divided the SWIFFT calculation into two, because that way we could save 2 computations of
-// the FFT part, since in the first stage of SWIFFTX the difference between the first 3 SWIFFTs
-// is only the A's part.
-//
-// Parameters:
-// - input: the input. Of size 64 * m.
-// - m: the input size divided by 64.
-// - output: will store the result.
-// - a: the coefficients in the sum. Of size 64 * m.
-void SWIFFTSum(const swift_int32_t *input, int m, unsigned char *output, const swift_int16_t *a)
-{
-	int i, j;
-	swift_int32_t result[N];
-	register swift_int16_t carry = 0;
-
-	for (j = 0; j < N; ++j)
-	{
-		register swift_int32_t sum = 0;
-		const register swift_int32_t *f = input + j;
-		const register swift_int16_t *k = a + j;
-
-		for (i = 0; i < m; i++, f += N,k += N)
-		{
-			sum += (*f) * (*k);
-		}
-
-		result[j] = sum;
-	}
-
-	for (j = 0; j < N; ++j)
-	{
-		result[j] = ((FIELD_SIZE << 22) + result[j]) % FIELD_SIZE;
-	}
-
-	for (j = 0; j < 8; ++j)
-	{
-		int register carryBit = TranslateToBase256(result + (j << 3), output + (j << 3));
-		carry |= carryBit << j;
-	}
-
-	output[N] = carry;
-}
-
-
-// On entry input is interleaved 4x64. SIZE is *4 lanes / 8 bytes,
-// multiply by 2.
-
-
-void ComputeSingleSWIFFTX_4way( unsigned char input[SWIFFTX_INPUT_BLOCK_SIZE],
-                          unsigned char output[SWIFFTX_OUTPUT_BLOCK_SIZE],
-						  bool doSmooth)
-{
-	int i;
-	// Will store the result of the FFT parts:
-   m128_swift_int32_t fftOut[N * M];
-//   swift_int32_t fftOut[N * M];
-	unsigned char intermediate[N * 3 + 8];
-	unsigned char carry0,carry1,carry2;
-
-	// Do the three SWIFFTS while remembering the three carry bytes (each carry byte gets
-	// overriden by the following SWIFFT):
-
-	// 1. Compute the FFT of the input - the common part for the first 3 SWIFFTs:
-	SWIFFTFFT(input, M, fftOut);
-
-	// 2. Compute the sums of the 3 SWIFFTs, each using a different set of coefficients:
-
-	// 2a. The first SWIFFT:
-	SWIFFTSum(fftOut, M, intermediate, As);
-	// Remember the carry byte:
-	carry0 = intermediate[N];
-
-	// 2b. The second one:
-	SWIFFTSum(fftOut, M, intermediate + N, As + (M * N));
-	carry1 = intermediate[2 * N];
-
-	// 2c. The third one:
-	SWIFFTSum(fftOut, M, intermediate + (2 * N), As + 2 * (M * N));
-	carry2 = intermediate[3 * N];
-
-	//2d. Put three carry bytes in their place
-	intermediate[3 * N] = carry0;
-	intermediate[(3 * N) + 1] = carry1;
-	intermediate[(3 * N) + 2] = carry2;
-
-	// Padding  intermediate output with 5 zeroes.
-	memset(intermediate + (3 * N) + 3, 0, 5);
-
-	// Apply the S-Box:
-	for (i = 0; i < (3 * N) + 8; ++i)
-	{
-		intermediate[i] = SBox[intermediate[i]];
-	}
-
-	// 3. The final and last SWIFFT:
-	SWIFFTFFT(intermediate, 3 * (N/8) + 1, fftOut);
-	SWIFFTSum(fftOut,       3 * (N/8) + 1, output, As);
-
-	if (doSmooth)
-	{
-		unsigned char sum[N];
-		register int i, j;
-		memset(sum, 0, N);
-
-		for (i = 0; i < (N + 1) * 8; ++i)
-		{
-			register const swift_int16_t *AsRow;
-			register int AShift;
-
-			if  (!(output[i >> 3] & (1 << (i & 7))))
-			{
-				continue;
-			}
-
-			AsRow = As + N * M + (i & ~(N - 1)) ;
-			AShift = i & 63;
-
-			for (j = AShift; j < N; ++j)
-			{
-				sum[j] += AsRow[j - AShift];
-			}
-
-			for(j = 0; j < AShift; ++j)
-			{
-				sum[j] -= AsRow[N - AShift + j];
-			}
-		}
-
-		for (i = 0; i < N; ++i)
-		{
-			output[i] = sum[i];
-		}
-
-		output[N] = 0;
-	}
-}
--- a/algo/swifftx/swifftx.c
+++ b/algo/swifftx/swifftx.c
@@ -604,21 +604,14 @@ void InitializeSWIFFTX()
 	int omegaPowers[2 * N];
 	omegaPowers[0] = 1;

-	if (wasSetupDone)
-		return;
+	if (wasSetupDone) return;

 	for (i = 1; i < (2 * N); ++i)
-	{
 		omegaPowers[i] = Center(omegaPowers[i - 1] * OMEGA);
-	}

 	for (i = 0; i < (N / W); ++i)
-	{
 		for (j = 0; j < W; ++j)
-		{
 			multipliers[(i << 3) + j] = omegaPowers[ReverseBits(i, N / W) * (2 * j + 1)];
-		}
-	}

 	for (x = 0; x < 256; ++x)
 	{
@@ -626,10 +619,8 @@ void InitializeSWIFFTX()
 		{
 			register int temp = 0;
 			for (k = 0; k < 8; ++k)
-			{
 				temp += omegaPowers[(EIGHTH_N * (2 * j + 1) * ReverseBits(k, W)) % (2 * N)]
 					  * ((x >> k) & 1);
-			}

 			fftTable[(x << 3) + j] = Center(temp);
 		}
@@ -703,18 +694,18 @@ void FFT(const unsigned char input[EIGHTH_N], swift_int32_t *output)

 #if defined (__AVX512VL__) && defined(__AVX512BW__)   

-   #define Q_REDUCE( a ) \
-       _mm256_sub_epi32( _mm256_and_si256( a, \
-                 _mm256_movm_epi8( 0x11111111 ) ), _mm256_srai_epi32( a, 8 ) ) 
+   const __m256i mask = _mm256_movm_epi8( 0x11111111 );

-#else   
+#else

-   #define Q_REDUCE( a ) \
-       _mm256_sub_epi32( _mm256_and_si256( a, \
-                   m256_const1_32( 0x000000ff ) ), _mm256_srai_epi32( a, 8 ) ) 
+   const __m256i mask = m256_const1_32( 0x000000ff );

 #endif
-                          
+
+   #define Q_REDUCE( a ) \
+       _mm256_sub_epi32( _mm256_and_si256( a, mask ), \
+                         _mm256_srai_epi32( a, 8 ) )
+
   out[0] = Q_REDUCE( F[0] );  
   out[1] = Q_REDUCE( F[1] );                        
   out[2] = Q_REDUCE( F[2] );                        
@@ -805,9 +796,10 @@ void FFT(const unsigned char input[EIGHTH_N], swift_int32_t *output)

   #undef ADD_SUB

+   const __m128i mask = m128_const1_32( 0x000000ff );
+
   #define Q_REDUCE( a ) \
-      _mm_sub_epi32( _mm_and_si128( a, \
-                   m128_const1_32( 0x000000ff ) ), _mm_srai_epi32( a, 8 ) ) 
+      _mm_sub_epi32( _mm_and_si128( a, mask ), _mm_srai_epi32( a, 8 ) ) 

   out[ 0] = Q_REDUCE( F[ 0] );
   out[ 1] = Q_REDUCE( F[ 1] );
@@ -1357,6 +1349,7 @@ void SWIFFTSum( const swift_int32_t *input, int m, unsigned char *output,
 	output[N] = carry;
 }

+/*
 void ComputeSingleSWIFFTX_smooth(unsigned char input[SWIFFTX_INPUT_BLOCK_SIZE],
                          unsigned char output[SWIFFTX_OUTPUT_BLOCK_SIZE],
 						  bool doSmooth)
@@ -1434,51 +1427,50 @@ void ComputeSingleSWIFFTX_smooth(unsigned char input[SWIFFTX_INPUT_BLOCK_SIZE],
 		output[N] = 0;
 	}
 }
+*/

-void ComputeSingleSWIFFTX( unsigned char input[SWIFFTX_INPUT_BLOCK_SIZE],
-                           unsigned char output[SWIFFTX_OUTPUT_BLOCK_SIZE] )
+void ComputeSingleSWIFFTX( unsigned char *input, unsigned char *output )
 {
   int i;
   // Will store the result of the FFT parts:
   swift_int32_t fftOut[N * M] __attribute__ ((aligned (64)));
-   unsigned char intermediate[N * 3 + 8] __attribute__ ((aligned (64)));
+   unsigned char sum[ N*3 + 8 ] __attribute__ ((aligned (64)));
   unsigned char carry0,carry1,carry2;

   // Do the three SWIFFTS while remembering the three carry bytes (each carry byte gets
   // overriden by the following SWIFFT):

   // 1. Compute the FFT of the input - the common part for the first 3 SWIFFTs:
-   SWIFFTFFT(input, M, fftOut);
+   SWIFFTFFT( input, M, fftOut );

   // 2. Compute the sums of the 3 SWIFFTs, each using a different set of coefficients:

   // 2a. The first SWIFFT:
-   SWIFFTSum(fftOut, M, intermediate, As);
-   // Remember the carry byte:
-   carry0 = intermediate[N];
+   SWIFFTSum( fftOut, M, sum,       As         );
+   carry0 = sum[N];

   // 2b. The second one:
-   SWIFFTSum(fftOut, M, intermediate + N, As + (M * N));
-   carry1 = intermediate[2 * N];
+   SWIFFTSum( fftOut, M, sum + N,   As +   M*N );
+   carry1 = sum[ 2*N ];

   // 2c. The third one:
-   SWIFFTSum(fftOut, M, intermediate + (2 * N), As + 2 * (M * N));
-   carry2 = intermediate[3 * N];
+   SWIFFTSum( fftOut, M, sum + 2*N, As + 2*M*N );
+   carry2 = sum[ 3*N ];

   //2d. Put three carry bytes in their place
-   intermediate[3 * N] = carry0;
-   intermediate[(3 * N) + 1] = carry1;
-   intermediate[(3 * N) + 2] = carry2;
+   sum[ 3*N     ] = carry0;
+   sum[ 3*N + 1 ] = carry1;
+   sum[ 3*N + 2 ] = carry2;

   // Padding  intermediate output with 5 zeroes.
-   memset(intermediate + (3 * N) + 3, 0, 5);
+   memset( sum + 3*N + 3, 0, 5 );

   // Apply the S-Box:
   for ( i = 0; i < (3 * N) + 8; ++i )
-      intermediate[i] = SBox[intermediate[i]];
+      sum[i] = SBox[ sum[i] ];

   // 3. The final and last SWIFFT:
-   SWIFFTFFT(intermediate, 3 * (N/8) + 1, fftOut);
-   SWIFFTSum(fftOut,       3 * (N/8) + 1, output, As);
-
+   SWIFFTFFT( sum, 3 * (N/8) + 1, fftOut );
+   SWIFFTSum( fftOut,       3 * (N/8) + 1, sum, As );
+   memcpy( output, sum, SWIFFTX_OUTPUT_BLOCK_SIZE - 1 );
 }
--- a/algo/swifftx/swifftx.h
+++ b/algo/swifftx/swifftx.h
@@ -61,11 +61,10 @@ void ComputeSingleSWIFFT(unsigned char *input, unsigned short m,
 //
 // Returns:
 // - Success value.
-void ComputeSingleSWIFFTX( unsigned char input[SWIFFTX_INPUT_BLOCK_SIZE],
-                           unsigned char output[SWIFFTX_OUTPUT_BLOCK_SIZE] );
+void ComputeSingleSWIFFTX( unsigned char *input, unsigned char *output );

-void ComputeSingleSWIFFTX_smooth( unsigned char input[SWIFFTX_INPUT_BLOCK_SIZE],
-	            unsigned char output[SWIFFTX_OUTPUT_BLOCK_SIZE], bool doSmooth);
+//void ComputeSingleSWIFFTX_smooth( unsigned char input[SWIFFTX_INPUT_BLOCK_SIZE],
+//	            unsigned char output[SWIFFTX_OUTPUT_BLOCK_SIZE], bool doSmooth);

 // Calculates the powers of OMEGA and generates the bit reversal permutation.
 // You must call this function before doing SWIFFT/X, otherwise you will get zeroes everywhere.
--- a/algo/verthash/Verthash.c
+++ b/algo/verthash/Verthash.c
@@ -10,6 +10,7 @@
 #include "algo-gate-api.h"
 #include "Verthash.h"
 #include "mm_malloc.h"
+#include "malloc-huge.h"

 //-----------------------------------------------------------------------------
 // Verthash info management
@@ -84,10 +85,17 @@ int verthash_info_init(verthash_info_t* info, const char* file_name)
    }

    // Allocate data
-    info->data = (uint8_t *)_mm_malloc( fileSize, 64 );
-    if (!info->data)
+    info->data = (uint8_t *)malloc_hugepages( fileSize );
+    if ( info->data )
    {
-        fclose(fileMiningData);
+       if ( !opt_quiet ) applog( LOG_INFO, "Verthash data is using huge pages");
+    }
+    else
+       info->data = (uint8_t *)_mm_malloc( fileSize, 64 );
+
+    if ( !info->data )
+    {
+        fclose( fileMiningData );
        // Memory allocation fatal error.
        return 2;
    }
--- a/algo/verthash/tiny_sha3/sha3-4way.c
+++ b/algo/verthash/tiny_sha3/sha3-4way.c
@@ -29,16 +29,11 @@ void sha3_4way_keccakf( __m256i st[25] )
   for ( r = 0; r < KECCAKF_ROUNDS; r++ )
   {
      // Theta
-      bc[0] = _mm256_xor_si256( st[0],
-                           mm256_xor4( st[5], st[10], st[15], st[20] ) );
-      bc[1] = _mm256_xor_si256( st[1],
-                           mm256_xor4( st[6], st[11], st[16], st[21] ) );
-      bc[2] = _mm256_xor_si256( st[2],
-                           mm256_xor4( st[7], st[12], st[17], st[22] ) );
-      bc[3] = _mm256_xor_si256( st[3],
-                           mm256_xor4( st[8], st[13], st[18], st[23] ) );
-      bc[4] = _mm256_xor_si256( st[4],
-                           mm256_xor4( st[9], st[14], st[19], st[24] ) );
+      bc[0] = mm256_xor3( st[0], st[5], mm256_xor3( st[10], st[15], st[20] ) );
+      bc[1] = mm256_xor3( st[1], st[6], mm256_xor3( st[11], st[16], st[21] ) );
+      bc[2] = mm256_xor3( st[2], st[7], mm256_xor3( st[12], st[17], st[22] ) );
+      bc[3] = mm256_xor3( st[3], st[8], mm256_xor3( st[13], st[18], st[23] ) );
+      bc[4] = mm256_xor3( st[4], st[9], mm256_xor3( st[14], st[19], st[24] ) );

      for ( i = 0; i < 5; i++ )
      {
@@ -89,17 +84,13 @@ void sha3_4way_keccakf( __m256i st[25] )
      //  Chi
      for ( j = 0; j < 25; j += 5 )
      {
-         memcpy( bc, &st[ j ], 5*32 );
-         st[ j   ] = _mm256_xor_si256( st[ j   ],
-                                       _mm256_andnot_si256( bc[1], bc[2] ) );
-         st[ j+1 ] = _mm256_xor_si256( st[ j+1 ],
-                                       _mm256_andnot_si256( bc[2], bc[3] ) );
-         st[ j+2 ] = _mm256_xor_si256( st[ j+2 ],
-                                       _mm256_andnot_si256( bc[3], bc[4] ) );
-         st[ j+3 ] = _mm256_xor_si256( st[ j+3 ],
-                                       _mm256_andnot_si256( bc[4], bc[0] ) );
-         st[ j+4 ] = _mm256_xor_si256( st[ j+4 ],
-                                       _mm256_andnot_si256( bc[0], bc[1] ) );
+         bc[0] = st[j];
+         bc[1] = st[j+1];
+         st[ j   ] = mm256_xorandnot( st[ j   ], st[j+1], st[j+2] );
+         st[ j+1 ] = mm256_xorandnot( st[ j+1 ], st[j+2], st[j+3] );
+         st[ j+2 ] = mm256_xorandnot( st[ j+2 ], st[j+3], st[j+4] );
+         st[ j+3 ] = mm256_xorandnot( st[ j+3 ], st[j+4], bc[0] );
+         st[ j+4 ] = mm256_xorandnot( st[ j+4 ], bc[0], bc[1] );
      }

      //  Iota
--- a/algo/verthash/verthash-gate.c
+++ b/algo/verthash/verthash-gate.c
@@ -127,7 +127,7 @@ bool register_verthash_algo( algo_gate_t* gate )
 {
  opt_target_factor = 256.0;
  gate->scanhash  = (void*)&scanhash_verthash;
-  gate->optimizations = AVX2_OPT;
+  gate->optimizations = SSE42_OPT | AVX2_OPT;
   
  const char *verthash_data_file = opt_data_file ? opt_data_file
                                                 : default_verthash_data_file;
--- a/algo/whirlpool/md-helper-4way.c
+++ b/algo/whirlpool/md-helper-4way.c
@@ -1,291 +0,0 @@
-/* $Id: md_helper.c 216 2010-06-08 09:46:57Z tp $ */
-/*
- * This file contains some functions which implement the external data
- * handling and padding for Merkle-Damgard hash functions which follow
- * the conventions set out by MD4 (little-endian) or SHA-1 (big-endian).
- *
- * API: this file is meant to be included, not compiled as a stand-alone
- * file. Some macros must be defined:
- *   RFUN   name for the round function
- *   HASH   "short name" for the hash function
- *   BE32   defined for big-endian, 32-bit based (e.g. SHA-1)
- *   LE32   defined for little-endian, 32-bit based (e.g. MD5)
- *   BE64   defined for big-endian, 64-bit based (e.g. SHA-512)
- *   LE64   defined for little-endian, 64-bit based (no example yet)
- *   PW01   if defined, append 0x01 instead of 0x80 (for Tiger)
- *   BLEN   if defined, length of a message block (in bytes)
- *   PLW1   if defined, length is defined on one 64-bit word only (for Tiger)
- *   PLW4   if defined, length is defined on four 64-bit words (for WHIRLPOOL)
- *   SVAL   if defined, reference to the context state information
- *
- * BLEN is used when a message block is not 16 (32-bit or 64-bit) words:
- * this is used for instance for Tiger, which works on 64-bit words but
- * uses 512-bit message blocks (eight 64-bit words). PLW1 and PLW4 are
- * ignored if 32-bit words are used; if 64-bit words are used and PLW1 is
- * set, then only one word (64 bits) will be used to encode the input
- * message length (in bits), otherwise two words will be used (as in
- * SHA-384 and SHA-512). If 64-bit words are used and PLW4 is defined (but
- * not PLW1), four 64-bit words will be used to encode the message length
- * (in bits). Note that regardless of those settings, only 64-bit message
- * lengths are supported (in bits): messages longer than 2 Exabytes will be
- * improperly hashed (this is unlikely to happen soon: 2 Exabytes is about
- * 2 millions Terabytes, which is huge).
- *
- * If CLOSE_ONLY is defined, then this file defines only the sph_XXX_close()
- * function. This is used for Tiger2, which is identical to Tiger except
- * when it comes to the padding (Tiger2 uses the standard 0x80 byte instead
- * of the 0x01 from original Tiger).
- *
- * The RFUN function is invoked with two arguments, the first pointing to
- * aligned data (as a "const void *"), the second being state information
- * from the context structure. By default, this state information is the
- * "val" field from the context, and this field is assumed to be an array
- * of words ("sph_u32" or "sph_u64", depending on BE32/LE32/BE64/LE64).
- * from the context structure. The "val" field can have any type, except
- * for the output encoding which assumes that it is an array of "sph_u32"
- * values. By defining NO_OUTPUT, this last step is deactivated; the
- * includer code is then responsible for writing out the hash result. When
- * NO_OUTPUT is defined, the third parameter to the "close()" function is
- * ignored.
- *
- * ==========================(LICENSE BEGIN)============================
- *
- * Copyright (c) 2007-2010  Projet RNRT SAPHIR
- * 
- * Permission is hereby granted, free of charge, to any person obtaining
- * a copy of this software and associated documentation files (the
- * "Software"), to deal in the Software without restriction, including
- * without limitation the rights to use, copy, modify, merge, publish,
- * distribute, sublicense, and/or sell copies of the Software, and to
- * permit persons to whom the Software is furnished to do so, subject to
- * the following conditions:
- * 
- * The above copyright notice and this permission notice shall be
- * included in all copies or substantial portions of the Software.
- * 
- * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
- * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
- * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
- * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
- * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
- * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
- * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
- *
- * ===========================(LICENSE END)=============================
- *
- * @author   Thomas Pornin <thomas.pornin@cryptolog.com>
- */
-
-#ifdef _MSC_VER
-#pragma warning (disable: 4146)
-#endif
-
-#undef SPH_XCAT
-#define SPH_XCAT(a, b)     SPH_XCAT_(a, b)
-#undef SPH_XCAT_
-#define SPH_XCAT_(a, b)    a ## b
-
-#undef SPH_BLEN
-#undef SPH_WLEN
-#if defined BE64 || defined LE64
-#define SPH_BLEN    128U
-#define SPH_WLEN      8U
-#else
-#define SPH_BLEN     64U
-#define SPH_WLEN      4U
-#endif
-
-#ifdef BLEN
-#undef SPH_BLEN
-#define SPH_BLEN    BLEN
-#endif
-
-#undef SPH_MAXPAD
-#if defined PLW1
-#define SPH_MAXPAD   (SPH_BLEN - SPH_WLEN)
-#elif defined PLW4
-#define SPH_MAXPAD   (SPH_BLEN - (SPH_WLEN << 2))
-#else
-#define SPH_MAXPAD   (SPH_BLEN - (SPH_WLEN << 1))
-#endif
-
-#undef SPH_VAL
-#undef SPH_NO_OUTPUT
-#ifdef SVAL
-#define SPH_VAL         SVAL
-#define SPH_NO_OUTPUT   1
-#else
-#define SPH_VAL   sc->val
-#endif
-
-#ifndef CLOSE_ONLY
-
-#ifdef SPH_UPTR
-static void
-SPH_XCAT(HASH, _short)( void *cc, const void *data, size_t len )
-#else
-void
-HASH ( void *cc, const void *data, size_t len )
-#endif
-{
-   SPH_XCAT( HASH, _context ) *sc;
-   __m256i *vdata = (__m256i*)data;
-   size_t ptr;
-
-   sc = cc;
-   ptr = (unsigned)sc->count & (SPH_BLEN - 1U);
-   while ( len > 0 )
-   {
-      size_t clen;
-      clen = SPH_BLEN - ptr;
-      if ( clen > len )
-         clen = len;
-      memcpy_256( sc->buf + (ptr>>3), vdata, clen>>3 );
-      vdata = vdata + (clen>>3);
-      ptr += clen;
-      len -= clen;
-      if ( ptr == SPH_BLEN )
-      {
-         RFUN( sc->buf, SPH_VAL );
-         ptr = 0;
-      }
-         sc->count += clen;
-   }
-}
-
-#ifdef SPH_UPTR
-void
-HASH (void *cc, const void *data, size_t len)
-{
-   SPH_XCAT(HASH, _context) *sc;
-   __m256i *vdata = (__m256i*)data;
-   unsigned ptr;
-
-   if ( len < (2 * SPH_BLEN) )
-   {
-      SPH_XCAT(HASH, _short)(cc, data, len);
-      return;
-   }
-   sc = cc;
-   ptr = (unsigned)sc->count & (SPH_BLEN - 1U);
-   if ( ptr > 0 )
-   {
-      unsigned t;
-      t = SPH_BLEN - ptr;
-      SPH_XCAT( HASH, _short )( cc, data, t );
-      vdata = vdata + (t>>3);
-      len -= t;
-   }
-   SPH_XCAT( HASH, _short )( cc, data, len );
-}
-#endif
-
-#endif
-
-/*
- * Perform padding and produce result. The context is NOT reinitialized
- * by this function.
- */
-static void
-SPH_XCAT( HASH, _addbits_and_close )(void *cc, 	unsigned ub, unsigned n,
-          void *dst, unsigned rnum )
-{
-    SPH_XCAT(HASH, _context) *sc;
-    unsigned ptr, u;
-    sc = cc;
-    ptr = (unsigned)sc->count & (SPH_BLEN - 1U);
-
-//uint64_t *b= (uint64_t*)sc->buf;
-//uint64_t *s= (uint64_t*)sc->state;
-//printf("Vptr 1= %u\n", ptr);
-//printf("VBuf %016llx %016llx %016llx %016llx\n", b[0], b[4], b[8], b[12] );
-//printf("VBuf %016llx %016llx %016llx %016llx\n", b[16], b[20], b[24], b[28] );
-
-#ifdef PW01
-    sc->buf[ptr>>3] = _mm256_set1_epi64x( 0x100 >> 8 );
-//    sc->buf[ptr++] = 0x100 >> 8;
-#else
-// need to overwrite exactly one byte
-//    sc->buf[ptr>>3] = _mm256_set_epi64x( 0, 0, 0, 0x80 );
-    sc->buf[ptr>>3] = _mm256_set1_epi64x( 0x80 );
-//    ptr++;
-#endif
-    ptr += 8;
-
-//printf("Vptr 2= %u\n", ptr);
-//printf("VBuf %016llx %016llx %016llx %016llx\n", b[0], b[4], b[8], b[12] );
-//printf("VBuf %016llx %016llx %016llx %016llx\n", b[16], b[20], b[24], b[28] );
-
-    if ( ptr > SPH_MAXPAD )
-    {
-         memset_zero_256( sc->buf + (ptr>>3), (SPH_BLEN - ptr) >> 3 );
-         RFUN( sc->buf, SPH_VAL );
-         memset_zero_256( sc->buf, SPH_MAXPAD >> 3 );
-    }
-    else
-    {
-         memset_zero_256( sc->buf + (ptr>>3), (SPH_MAXPAD - ptr) >> 3 );
-    }
-#if defined BE64
-#if defined PLW1
-    sc->buf[ SPH_MAXPAD>>3 ] =
-                 mm256_bswap_64( _mm256_set1_epi64x( sc->count << 3 ) );
-#elif defined PLW4
-    memset_zero_256( sc->buf + (SPH_MAXPAD>>3), ( 2 * SPH_WLEN ) >> 3 );
-    sc->buf[ (SPH_MAXPAD + 2 * SPH_WLEN ) >> 3 ] =
-                mm256_bswap_64( _mm256_set1_epi64x( sc->count >> 61 ) );
-    sc->buf[ (SPH_MAXPAD + 3 * SPH_WLEN ) >> 3 ] =
-                mm256_bswap_64( _mm256_set1_epi64x( sc->count << 3 ) );
-#else
-    sc->buf[ ( SPH_MAXPAD + 2 * SPH_WLEN ) >> 3 ] =
-               mm256_bswap_64( _mm256_set1_epi64x( sc->count >> 61 ) );
-    sc->buf[ ( SPH_MAXPAD + 3 * SPH_WLEN ) >> 3 ] =
-               mm256_bswap_64( _mm256_set1_epi64x( sc->count << 3 ) );
-#endif  // PLW
-#else  // LE64
-#if defined PLW1
-    sc->buf[ SPH_MAXPAD >> 3 ] = _mm256_set1_epi64x( sc->count << 3 );
-#elif defined PLW4
-    sc->buf[ SPH_MAXPAD >> 3 ] = _mm256_set1_epi64x( sc->count << 3 );
-    sc->buf[ ( SPH_MAXPAD + SPH_WLEN ) >> 3 ] =
-                       _mm256_set1_epi64x( c->count >> 61 );
-    memset_zero_256( sc->buf + ( ( SPH_MAXPAD + 2 * SPH_WLEN ) >> 3 ),
-                       2 * SPH_WLEN );
-#else
-    sc->buf[ SPH_MAXPAD >> 3 ] = _mm256_set1_epi64x( sc->count << 3 );
-    sc->buf[ ( SPH_MAXPAD + SPH_WLEN ) >> 3 ] =
-                          _mm256_set1_epi64x( sc->count >> 61 );
-#endif // PLW
-
-#endif // LE64
-
-//printf("Vptr 3= %u\n", ptr);
-//printf("VBuf   %016llx %016llx %016llx %016llx\n", b[0], b[4], b[8], b[12] );
-//printf("VBuf   %016llx %016llx %016llx %016llx\n", b[16], b[20], b[24], b[28] );
-    RFUN( sc->buf, SPH_VAL );
-
-//printf("Vptr after= %u\n", ptr);
-//printf("VState %016llx %016llx %016llx %016llx\n", s[0], s[4], s[8], s[12] );
-//printf("VState %016llx %016llx %016llx %016llx\n", s[16], s[20], s[24], s[28] );
-
-#ifdef SPH_NO_OUTPUT
-    (void)dst;
-    (void)rnum;
-    (void)u;
-#else
-    for ( u = 0; u < rnum; u ++ )
-    {
-#if defined BE64
-       ((__m256i*)dst)[u] = mm256_bswap_64( sc->val[u] );
-#else  // LE64
-       ((__m256i*)dst)[u] = sc->val[u];
-#endif
-    }
-#endif
-}
-
-static void
-SPH_XCAT( HASH, _mdclose )( void *cc, void *dst, unsigned rnum )
-{
-   SPH_XCAT( HASH, _addbits_and_close )( cc, 0, 0, dst, rnum );
-}
--- a/algo/whirlpool/whirlpool-hash-4way.c
+++ b/algo/whirlpool/whirlpool-hash-4way.c
--- a/algo/whirlpool/whirlpool-hash-4way.h
+++ b/algo/whirlpool/whirlpool-hash-4way.h
@@ -1,108 +0,0 @@
-/* $Id: sph_whirlpool.h 216 2010-06-08 09:46:57Z tp $ */
-/**
- * WHIRLPOOL interface.
- *
- * WHIRLPOOL knows three variants, dubbed "WHIRLPOOL-0" (original
- * version, published in 2000, studied by NESSIE), "WHIRLPOOL-1"
- * (first revision, 2001, with a new S-box) and "WHIRLPOOL" (current
- * version, 2003, with a new diffusion matrix, also described as "plain
- * WHIRLPOOL"). All three variants are implemented here.
- *
- * The original WHIRLPOOL (i.e. WHIRLPOOL-0) was published in: P. S. L.
- * M. Barreto, V. Rijmen, "The Whirlpool Hashing Function", First open
- * NESSIE Workshop, Leuven, Belgium, November 13--14, 2000.
- *
- * The current WHIRLPOOL specification and a reference implementation
- * can be found on the WHIRLPOOL web page:
- * http://paginas.terra.com.br/informatica/paulobarreto/WhirlpoolPage.html
- *
- * ==========================(LICENSE BEGIN)============================
- *
- * Copyright (c) 2007-2010  Projet RNRT SAPHIR
- *
- * Permission is hereby granted, free of charge, to any person obtaining
- * a copy of this software and associated documentation files (the
- * "Software"), to deal in the Software without restriction, including
- * without limitation the rights to use, copy, modify, merge, publish,
- * distribute, sublicense, and/or sell copies of the Software, and to
- * permit persons to whom the Software is furnished to do so, subject to
- * the following conditions:
- *
- * The above copyright notice and this permission notice shall be
- * included in all copies or substantial portions of the Software.
- *
- * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
- * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
- * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
- * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
- * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
- * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
- * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
- *
- * ===========================(LICENSE END)=============================
- *
- * @file     sph_whirlpool.h
- * @author   Thomas Pornin <thomas.pornin@cryptolog.com>
- */
-
-#ifndef WHIRLPOOL_HASH_4WAY_H__
-#define WHIRLPOOL_HASH_4WAY_H__
-
-#ifdef __AVX2__
-
-#include <stddef.h>
-#include "algo/sha/sph_types.h"
-#include "simd-utils.h"
-
-/**
- * Output size (in bits) for WHIRLPOOL.
- */
-#define SPH_SIZE_whirlpool   512
-
-/**
- * Output size (in bits) for WHIRLPOOL-0.
- */
-#define SPH_SIZE_whirlpool0   512
-
-/**
- * Output size (in bits) for WHIRLPOOL-1.
- */
-#define SPH_SIZE_whirlpool1   512
-
-typedef struct {
-    __m256i buf[8] __attribute__ ((aligned (64)));
-    __m256i state[8];
-    sph_u64 count;
-} whirlpool_4way_context;
-
-void whirlpool_4way_init( void *cc );
-
-void whirlpool_4way( void *cc, const void *data, size_t len );
-
-void whirlpool_4way_close( void *cc, void *dst );
-
-/**
- * WHIRLPOOL-0 uses the same structure than plain WHIRLPOOL.
- */
-typedef whirlpool_4way_context whirlpool0_4way_context;
-
-#define whirlpool0_4way_init whirlpool_4way_init
-
-void whirlpool0_4way( void *cc, const void *data, size_t len );
-
-void whirlpool0_4way_close( void *cc, void *dst );
-
-/**
- * WHIRLPOOL-1 uses the same structure than plain WHIRLPOOL.
- */
-typedef whirlpool_4way_context whirlpool1_4way_context;
-
-#define whirlpool1_4way_init whirlpool_4way_init
-
-void whirlpool1_4way(void *cc, const void *data, size_t len);
-
-void whirlpool1_4way_close(void *cc, void *dst);
-
-#endif
-
-#endif
--- a/algo/x11/c11-4way.c
+++ b/algo/x11/c11-4way.c
@@ -12,6 +12,7 @@
 #include "algo/cubehash/cube-hash-2way.h"
 #include "algo/cubehash/cubehash_sse2.h"
 #include "algo/shavite/sph_shavite.h"
+#include "algo/shavite/shavite-hash-2way.h"
 #include "algo/simd/simd-hash-2way.h"
 #include "algo/echo/aes_ni/hash_api.h"
 #if defined(__VAES__)
@@ -22,15 +23,15 @@

 #if defined (C11_8WAY)

-typedef struct {
+union _c11_8way_context_overlay
+{
    blake512_8way_context   blake;
    bmw512_8way_context     bmw;
    skein512_8way_context   skein;
    jh512_8way_context      jh;
    keccak512_8way_context  keccak;
    luffa_4way_context      luffa;
-    cube_4way_context       cube;
-    simd_4way_context       simd;
+    cube_4way_2buf_context   cube;
 #if defined(__VAES__)
    groestl512_4way_context groestl;
    shavite512_4way_context shavite;
@@ -40,32 +41,14 @@ typedef struct {
    sph_shavite512_context  shavite;
    hashState_echo          echo;
 #endif
-} c11_8way_ctx_holder;
+    simd_4way_context       simd;
+} __attribute__ ((aligned (64)));
+typedef union _c11_8way_context_overlay c11_8way_context_overlay;

-c11_8way_ctx_holder c11_8way_ctx;
+static __thread __m512i c11_8way_midstate[16] __attribute__((aligned(64)));
+static __thread blake512_8way_context blake512_8way_ctx __attribute__((aligned(64)));

-void init_c11_8way_ctx()
-{
-     blake512_8way_init( &c11_8way_ctx.blake );
-     bmw512_8way_init( &c11_8way_ctx.bmw );
-     skein512_8way_init( &c11_8way_ctx.skein );
-     jh512_8way_init( &c11_8way_ctx.jh );
-     keccak512_8way_init( &c11_8way_ctx.keccak );
-     luffa_4way_init( &c11_8way_ctx.luffa, 512 );
-     cube_4way_init( &c11_8way_ctx.cube, 512, 16, 32 );
-     simd_4way_init( &c11_8way_ctx.simd, 512 );
-#if defined(__VAES__)
-     groestl512_4way_init( &c11_8way_ctx.groestl, 64 );
-     shavite512_4way_init( &c11_8way_ctx.shavite );
-     echo_4way_init( &c11_8way_ctx.echo, 512 );
-#else
-     init_groestl( &c11_8way_ctx.groestl, 64 );
-     sph_shavite512_init( &c11_8way_ctx.shavite );
-     init_echo( &c11_8way_ctx.echo, 512 );
-#endif
-}
-
-void c11_8way_hash( void *state, const void *input )
+int c11_8way_hash( void *state, const void *input, int thr_id )
 {
     uint64_t vhash[8*8] __attribute__ ((aligned (128)));
     uint64_t vhashA[4*8] __attribute__ ((aligned (64)));     
@@ -78,24 +61,19 @@ void c11_8way_hash( void *state, const void *input )
     uint64_t hash5[8] __attribute__ ((aligned (64)));
     uint64_t hash6[8] __attribute__ ((aligned (64)));
     uint64_t hash7[8] __attribute__ ((aligned (64)));
-     c11_8way_ctx_holder ctx;
-     memcpy( &ctx, &c11_8way_ctx, sizeof(c11_8way_ctx) );
+     c11_8way_context_overlay ctx;

-     // 1 Blake 4way
-     blake512_8way_update( &ctx.blake, input, 80 );
-     blake512_8way_close( &ctx.blake, vhash );
-
-     // 2 Bmw
-     bmw512_8way_update( &ctx.bmw, vhash, 64 );
-     bmw512_8way_close( &ctx.bmw, vhash );
+     blake512_8way_final_le( &blake512_8way_ctx, vhash, casti_m512i( input, 9 ),
+                             c11_8way_midstate );

+     bmw512_8way_full( &ctx.bmw, vhash, vhash, 64 );
+     
 #if defined(__VAES__)

     rintrlv_8x64_4x128( vhashA, vhashB, vhash, 512 );

-     groestl512_4way_update_close( &ctx.groestl, vhashA, vhashA, 512 );
-     groestl512_4way_init( &ctx.groestl, 64 );
-     groestl512_4way_update_close( &ctx.groestl, vhashB, vhashB, 512 );
+     groestl512_4way_full( &ctx.groestl, vhashA, vhashA, 64 );
+     groestl512_4way_full( &ctx.groestl, vhashB, vhashB, 64 );

     rintrlv_4x128_8x64( vhash, vhashA, vhashB, 512 );

@@ -104,21 +82,14 @@ void c11_8way_hash( void *state, const void *input )
     dintrlv_8x64_512( hash0, hash1, hash2, hash3, hash4, hash5, hash6, hash7,
                   vhash );

-     update_and_final_groestl( &ctx.groestl, (char*)hash0, (char*)hash0, 512 );
-     memcpy( &ctx.groestl, &c11_8way_ctx.groestl, sizeof(hashState_groestl) );
-     update_and_final_groestl( &ctx.groestl, (char*)hash1, (char*)hash1, 512 );
-     memcpy( &ctx.groestl, &c11_8way_ctx.groestl, sizeof(hashState_groestl) );
-     update_and_final_groestl( &ctx.groestl, (char*)hash2, (char*)hash2, 512 );
-     memcpy( &ctx.groestl, &c11_8way_ctx.groestl, sizeof(hashState_groestl) );
-     update_and_final_groestl( &ctx.groestl, (char*)hash3, (char*)hash3, 512 );
-     memcpy( &ctx.groestl, &c11_8way_ctx.groestl, sizeof(hashState_groestl) );
-     update_and_final_groestl( &ctx.groestl, (char*)hash4, (char*)hash4, 512 );
-     memcpy( &ctx.groestl, &c11_8way_ctx.groestl, sizeof(hashState_groestl) );
-     update_and_final_groestl( &ctx.groestl, (char*)hash5, (char*)hash5, 512 );
-     memcpy( &ctx.groestl, &c11_8way_ctx.groestl, sizeof(hashState_groestl) );
-     update_and_final_groestl( &ctx.groestl, (char*)hash6, (char*)hash6, 512 );
-     memcpy( &ctx.groestl, &c11_8way_ctx.groestl, sizeof(hashState_groestl) );
-     update_and_final_groestl( &ctx.groestl, (char*)hash7, (char*)hash7, 512 );
+     groestl512_full( &ctx.groestl, (char*)hash0, (char*)hash0, 512 );
+     groestl512_full( &ctx.groestl, (char*)hash1, (char*)hash1, 512 );
+     groestl512_full( &ctx.groestl, (char*)hash2, (char*)hash2, 512 );
+     groestl512_full( &ctx.groestl, (char*)hash3, (char*)hash3, 512 );
+     groestl512_full( &ctx.groestl, (char*)hash4, (char*)hash4, 512 );
+     groestl512_full( &ctx.groestl, (char*)hash5, (char*)hash5, 512 );
+     groestl512_full( &ctx.groestl, (char*)hash6, (char*)hash6, 512 );
+     groestl512_full( &ctx.groestl, (char*)hash7, (char*)hash7, 512 );

     intrlv_8x64_512( vhash, hash0, hash1, hash2, hash3, hash4, hash5, hash6,
                  hash7 );
@@ -126,83 +97,56 @@ void c11_8way_hash( void *state, const void *input )
 #endif

     // 4 JH
+     jh512_8way_init( &ctx.jh );
     jh512_8way_update( &ctx.jh, vhash, 64 );
     jh512_8way_close( &ctx.jh, vhash );

     // 5 Keccak
+     keccak512_8way_init( &ctx.keccak );
     keccak512_8way_update( &ctx.keccak, vhash, 64 );
     keccak512_8way_close( &ctx.keccak, vhash );

     // 6 Skein
-     skein512_8way_update( &ctx.skein, vhash, 64 );
-     skein512_8way_close( &ctx.skein, vhash );
+     skein512_8way_full( &ctx.skein, vhash, vhash, 64 );

     rintrlv_8x64_4x128( vhashA, vhashB, vhash, 512 );

-     luffa_4way_update_close( &ctx.luffa, vhashA, vhashA, 64 );
-     luffa_4way_init( &ctx.luffa, 512 );
-     luffa_4way_update_close( &ctx.luffa, vhashB, vhashB, 64 );
-
-     cube_4way_update_close( &ctx.cube, vhashA, vhashA, 64 );
-     cube_4way_init( &ctx.cube, 512, 16, 32 );
-     cube_4way_update_close( &ctx.cube, vhashB, vhashB, 64 );
+     luffa512_4way_full( &ctx.luffa, vhashA, vhashA, 64 );
+     luffa512_4way_full( &ctx.luffa, vhashB, vhashB, 64 );

+     cube_4way_2buf_full( &ctx.cube, vhashA, vhashB, 512, vhashA, vhashB, 64 );
+     
 #if defined(__VAES__)

-     shavite512_4way_update_close( &ctx.shavite, vhashA, vhashA, 64 );
-     shavite512_4way_init( &ctx.shavite );
-     shavite512_4way_update_close( &ctx.shavite, vhashB, vhashB, 64 );
+     shavite512_4way_full( &ctx.shavite, vhashA, vhashA, 64 );
+     shavite512_4way_full( &ctx.shavite, vhashB, vhashB, 64 );

 #else
     
     dintrlv_4x128_512( hash0, hash1, hash2, hash3, vhashA );
     dintrlv_4x128_512( hash4, hash5, hash6, hash7, vhashB );

-     sph_shavite512( &ctx.shavite, hash0, 64 );
-     sph_shavite512_close( &ctx.shavite, hash0 );
-     memcpy( &ctx.shavite, &c11_8way_ctx.shavite,
-             sizeof(sph_shavite512_context) );
-     sph_shavite512( &ctx.shavite, hash1, 64 );
-     sph_shavite512_close( &ctx.shavite, hash1 );
-     memcpy( &ctx.shavite, &c11_8way_ctx.shavite,
-             sizeof(sph_shavite512_context) );
-     sph_shavite512( &ctx.shavite, hash2, 64 );
-     sph_shavite512_close( &ctx.shavite, hash2 );
-     memcpy( &ctx.shavite, &c11_8way_ctx.shavite,
-             sizeof(sph_shavite512_context) );
-     sph_shavite512( &ctx.shavite, hash3, 64 );
-     sph_shavite512_close( &ctx.shavite, hash3 );
-     memcpy( &ctx.shavite, &c11_8way_ctx.shavite,
-             sizeof(sph_shavite512_context) );
-     sph_shavite512( &ctx.shavite, hash4, 64 );
-     sph_shavite512_close( &ctx.shavite, hash4 );
-     memcpy( &ctx.shavite, &c11_8way_ctx.shavite,
-             sizeof(sph_shavite512_context) );
-     sph_shavite512( &ctx.shavite, hash5, 64 );
-     sph_shavite512_close( &ctx.shavite, hash5 );
-     memcpy( &ctx.shavite, &c11_8way_ctx.shavite,
-             sizeof(sph_shavite512_context) );
-     sph_shavite512( &ctx.shavite, hash6, 64 );
-     sph_shavite512_close( &ctx.shavite, hash6 );
-     memcpy( &ctx.shavite, &c11_8way_ctx.shavite,
-             sizeof(sph_shavite512_context) );
-     sph_shavite512( &ctx.shavite, hash7, 64 );
-     sph_shavite512_close( &ctx.shavite, hash7 );
-
+     shavite512_full( &ctx.shavite, hash0, hash0, 64 );
+     shavite512_full( &ctx.shavite, hash1, hash1, 64 );
+     shavite512_full( &ctx.shavite, hash2, hash2, 64 );
+     shavite512_full( &ctx.shavite, hash3, hash3, 64 );
+     shavite512_full( &ctx.shavite, hash4, hash4, 64 );
+     shavite512_full( &ctx.shavite, hash5, hash5, 64 );
+     shavite512_full( &ctx.shavite, hash6, hash6, 64 );
+     shavite512_full( &ctx.shavite, hash7, hash7, 64 );
+     
     intrlv_4x128_512( vhashA, hash0, hash1, hash2, hash3 );
     intrlv_4x128_512( vhashB, hash4, hash5, hash6, hash7 );

 #endif

-     simd_4way_update_close( &ctx.simd, vhashA, vhashA, 512 );
-     simd_4way_init( &ctx.simd, 512 );
-     simd_4way_update_close( &ctx.simd, vhashB, vhashB, 512 );
+     simd512_4way_full( &ctx.simd, vhashA, vhashA, 64 );
+     simd512_4way_full( &ctx.simd, vhashB, vhashB, 64 );

 #if defined(__VAES__)

-     echo_4way_update_close( &ctx.echo, vhashA, vhashA, 512 );
-     echo_4way_init( &ctx.echo, 512 );
-     echo_4way_update_close( &ctx.echo, vhashB, vhashB, 512 );
+     echo_4way_full( &ctx.echo, vhashA, 512, vhashA, 64 );
+     echo_4way_full( &ctx.echo, vhashB, 512, vhashB, 64 );

     dintrlv_4x128_512( hash0, hash1, hash2, hash3, vhashA );
     dintrlv_4x128_512( hash4, hash5, hash6, hash7, vhashB );
@@ -212,29 +156,22 @@ void c11_8way_hash( void *state, const void *input )
     dintrlv_4x128_512( hash0, hash1, hash2, hash3, vhashA );
     dintrlv_4x128_512( hash4, hash5, hash6, hash7, vhashB );
     
-     update_final_echo( &ctx.echo, (BitSequence *)hash0,
-                       (const BitSequence *) hash0, 512 );
-     memcpy( &ctx.echo, &c11_8way_ctx.echo, sizeof(hashState_echo) );
-     update_final_echo( &ctx.echo, (BitSequence *)hash1,
-                       (const BitSequence *) hash1, 512 );
-     memcpy( &ctx.echo, &c11_8way_ctx.echo, sizeof(hashState_echo) );
-     update_final_echo( &ctx.echo, (BitSequence *)hash2,
-                       (const BitSequence *) hash2, 512 );
-     memcpy( &ctx.echo, &c11_8way_ctx.echo, sizeof(hashState_echo) );
-     update_final_echo( &ctx.echo, (BitSequence *)hash3,
-                       (const BitSequence *) hash3, 512 );
-     memcpy( &ctx.echo, &c11_8way_ctx.echo, sizeof(hashState_echo) );
-     update_final_echo( &ctx.echo, (BitSequence *)hash4,
-                       (const BitSequence *) hash4, 512 );
-     memcpy( &ctx.echo, &c11_8way_ctx.echo, sizeof(hashState_echo) );
-     update_final_echo( &ctx.echo, (BitSequence *)hash5,
-                       (const BitSequence *) hash5, 512 );
-     memcpy( &ctx.echo, &c11_8way_ctx.echo, sizeof(hashState_echo) );
-     update_final_echo( &ctx.echo, (BitSequence *)hash6,
-                       (const BitSequence *) hash6, 512 );
-     memcpy( &ctx.echo, &c11_8way_ctx.echo, sizeof(hashState_echo) );
-     update_final_echo( &ctx.echo, (BitSequence *)hash7,
-                       (const BitSequence *) hash7, 512 );
+     echo_full( &ctx.echo, (BitSequence *)hash0, 512,
+                     (const BitSequence *)hash0, 64 );
+     echo_full( &ctx.echo, (BitSequence *)hash1, 512,
+                     (const BitSequence *)hash1, 64 );
+     echo_full( &ctx.echo, (BitSequence *)hash2, 512,
+                     (const BitSequence *)hash2, 64 );
+     echo_full( &ctx.echo, (BitSequence *)hash3, 512,
+                     (const BitSequence *)hash3, 64 );
+     echo_full( &ctx.echo, (BitSequence *)hash4, 512,
+                     (const BitSequence *)hash4, 64 );
+     echo_full( &ctx.echo, (BitSequence *)hash5, 512,
+                     (const BitSequence *)hash5, 64 );
+     echo_full( &ctx.echo, (BitSequence *)hash6, 512,
+                     (const BitSequence *)hash6, 64 );
+     echo_full( &ctx.echo, (BitSequence *)hash7, 512,
+                     (const BitSequence *)hash7, 64 );

 #endif

@@ -246,225 +183,223 @@ void c11_8way_hash( void *state, const void *input )
     memcpy( state+160, hash5, 32 );
     memcpy( state+192, hash6, 32 );
     memcpy( state+224, hash7, 32 );
+
+     return 1;
 }

 int scanhash_c11_8way( struct work *work, uint32_t max_nonce,
                   uint64_t *hashes_done, struct thr_info *mythr )
 {
-     uint32_t hash[8*8] __attribute__ ((aligned (128)));
-     uint32_t vdata[24*8] __attribute__ ((aligned (64)));
-     uint32_t *pdata = work->data;
-     uint32_t *ptarget = work->target;
-     uint32_t n = pdata[19];
-     const uint32_t first_nonce = pdata[19];
-     int thr_id = mythr->id;   
-     __m512i  *noncev = (__m512i*)vdata + 9;   // aligned
-     const uint32_t Htarg = ptarget[7];
+   uint32_t hash[8*8] __attribute__ ((aligned (128)));
+   uint32_t vdata[20*8] __attribute__ ((aligned (64)));
+   __m128i edata[5] __attribute__ ((aligned (64)));
+   uint32_t *pdata = work->data;
+   const uint32_t *ptarget = work->target;
+   const uint32_t first_nonce = pdata[19];
+   const uint32_t last_nonce = max_nonce - 8;
+   __m512i  *noncev = (__m512i*)vdata + 9;
+   uint32_t n = first_nonce;
+   const int thr_id = mythr->id;
+   const uint32_t targ32_d7 = ptarget[7];
+   const __m512i eight = m512_const1_64( 8 );
+   const bool bench = opt_benchmark;

-     max_nonce -= 8;
+   edata[0] = mm128_swap64_32( casti_m128i( pdata, 0 ) );
+   edata[1] = mm128_swap64_32( casti_m128i( pdata, 1 ) );
+   edata[2] = mm128_swap64_32( casti_m128i( pdata, 2 ) );
+   edata[3] = mm128_swap64_32( casti_m128i( pdata, 3 ) );
+   edata[4] = mm128_swap64_32( casti_m128i( pdata, 4 ) );

-     mm512_bswap32_intrlv80_8x64( vdata, pdata );
+   mm512_intrlv80_8x64( vdata, edata );
+   *noncev = _mm512_add_epi32( *noncev, _mm512_set_epi32(
+                            0, 7, 0, 6, 0, 5, 0, 4, 0, 3, 0, 2, 0, 1, 0, 0 ) );
+   blake512_8way_prehash_le( &blake512_8way_ctx, c11_8way_midstate, vdata );

-     do
-     {
-        *noncev = mm512_intrlv_blend_32( mm512_bswap_32(
-        _mm512_set_epi32( n+7, 0, n+6, 0, n+5, 0, n+4, 0,
-                          n+3, 0, n+2, 0, n+1, 0, n,   0 ) ), *noncev );
-
-        c11_8way_hash( hash, vdata );
-        pdata[19] = n;
-
-        for ( int i = 0; i < 8; i++ )
-        if ( ( ( hash+(i<<3) )[7] <= Htarg )
-             && fulltest( hash+(i<<3), ptarget ) && !opt_benchmark )
-        {
-           pdata[19] = n+i;
-           submit_solution( work, hash+(i<<3), mythr );
-        }
-        n += 8;
-     } while ( ( n < max_nonce ) && !work_restart[thr_id].restart );
-     *hashes_done = n - first_nonce;
-     return 0;
+   do
+   {
+      if ( likely( c11_8way_hash( hash, vdata, thr_id ) ) )
+      for ( int lane = 0; lane < 8; lane++ )
+      if ( ( ( hash + ( lane << 3 ) )[7] <= targ32_d7 )
+           && valid_hash( hash +( lane << 3 ), ptarget ) && !bench )
+      {
+         pdata[19] = n + lane;
+         submit_solution( work, hash + ( lane << 3 ), mythr );
+      }
+      *noncev = _mm512_add_epi32( *noncev, eight );
+      n += 8;
+   } while ( ( n < last_nonce ) && !work_restart[thr_id].restart );
+   pdata[19] = n;
+   *hashes_done = n - first_nonce;
+   return 0;
 }
     
 #elif defined (C11_4WAY)

-typedef struct {
+union _c11_4way_context_overlay
+{
    blake512_4way_context   blake;
    bmw512_4way_context     bmw;
+#if defined(__VAES__)
+    groestl512_2way_context groestl;
+    echo512_2way_context    echo;
+#else
    hashState_groestl       groestl;
-    skein512_4way_context   skein;
-    jh512_4way_context      jh;    
-    keccak512_4way_context  keccak;    
-    luffa_2way_context      luffa;
-    cubehashParam           cube;
-    sph_shavite512_context  shavite;
-    simd_2way_context       simd;
    hashState_echo          echo;
-} c11_4way_ctx_holder;
+#endif
+    skein512_4way_context   skein;
+    jh512_4way_context      jh;
+    keccak512_4way_context  keccak;
+    luffa_2way_context      luffa;
+    cube_2way_context       cube;
+    shavite512_2way_context shavite;
+    simd_2way_context       simd;
+};
+typedef union _c11_4way_context_overlay c11_4way_context_overlay;

-c11_4way_ctx_holder c11_4way_ctx;
+static __thread __m256i c11_4way_midstate[16] __attribute__((aligned(64)));
+static __thread blake512_4way_context blake512_4way_ctx __attribute__((aligned(64)));

-void init_c11_4way_ctx()
-{
-     blake512_4way_init( &c11_4way_ctx.blake );
-     bmw512_4way_init( &c11_4way_ctx.bmw );
-     init_groestl( &c11_4way_ctx.groestl, 64 );
-     skein512_4way_init( &c11_4way_ctx.skein );
-     jh512_4way_init( &c11_4way_ctx.jh );
-     keccak512_4way_init( &c11_4way_ctx.keccak );
-     luffa_2way_init( &c11_4way_ctx.luffa, 512 );
-     cubehashInit( &c11_4way_ctx.cube, 512, 16, 32 );
-     sph_shavite512_init( &c11_4way_ctx.shavite );
-     simd_2way_init( &c11_4way_ctx.simd, 512 );
-     init_echo( &c11_4way_ctx.echo, 512 );
-}
-
-void c11_4way_hash( void *state, const void *input )
+int c11_4way_hash( void *state, const void *input, int thr_id )
 {
     uint64_t hash0[8] __attribute__ ((aligned (64)));
     uint64_t hash1[8] __attribute__ ((aligned (64)));
     uint64_t hash2[8] __attribute__ ((aligned (64)));
     uint64_t hash3[8] __attribute__ ((aligned (64)));
     uint64_t vhash[8*4] __attribute__ ((aligned (64)));
+     uint64_t vhashA[8*2] __attribute__ ((aligned (64)));
     uint64_t vhashB[8*2] __attribute__ ((aligned (64)));
-     c11_4way_ctx_holder ctx;
-     memcpy( &ctx, &c11_4way_ctx, sizeof(c11_4way_ctx) );
+     c11_4way_context_overlay ctx;

-     // 1 Blake 4way
-     blake512_4way_update( &ctx.blake, input, 80 );
-     blake512_4way_close( &ctx.blake, vhash );
+     blake512_4way_final_le( &blake512_4way_ctx, vhash, casti_m256i( input, 9 ),
+                             c11_4way_midstate );

-     // 2 Bmw
+     bmw512_4way_init( &ctx.bmw );
     bmw512_4way_update( &ctx.bmw, vhash, 64 );
     bmw512_4way_close( &ctx.bmw, vhash );
+     
+#if defined(__VAES__)

-     // Serial
-     dintrlv_4x64( hash0, hash1, hash2, hash3, vhash, 512 );
+     rintrlv_4x64_2x128( vhashA, vhashB, vhash, 512 );

-     // 3 Groestl
-     update_and_final_groestl( &ctx.groestl, (char*)hash0, (char*)hash0, 512 );
-     memcpy( &ctx.groestl, &c11_4way_ctx.groestl, sizeof(hashState_groestl) );
-     update_and_final_groestl( &ctx.groestl, (char*)hash1, (char*)hash1, 512 );
-     memcpy( &ctx.groestl, &c11_4way_ctx.groestl, sizeof(hashState_groestl) );
-     update_and_final_groestl( &ctx.groestl, (char*)hash2, (char*)hash2, 512 );
-     memcpy( &ctx.groestl, &c11_4way_ctx.groestl, sizeof(hashState_groestl) );
-     update_and_final_groestl( &ctx.groestl, (char*)hash3, (char*)hash3, 512 );
+     groestl512_2way_full( &ctx.groestl, vhashA, vhashA, 64 );
+     groestl512_2way_full( &ctx.groestl, vhashB, vhashB, 64 );

-     // 4way
-     intrlv_4x64( vhash, hash0, hash1, hash2, hash3, 512 );
+     rintrlv_2x128_4x64( vhash, vhashA, vhashB, 512 );

-     // 4 JH
+#else
+
+     dintrlv_4x64_512( hash0, hash1, hash2, hash3, vhash );
+
+     groestl512_full( &ctx.groestl, (char*)hash0, (char*)hash0, 512 );
+     groestl512_full( &ctx.groestl, (char*)hash1, (char*)hash1, 512 );
+     groestl512_full( &ctx.groestl, (char*)hash2, (char*)hash2, 512 );
+     groestl512_full( &ctx.groestl, (char*)hash3, (char*)hash3, 512 );
+
+     intrlv_4x64_512( vhash, hash0, hash1, hash2, hash3 );
+
+#endif
+     
+     jh512_4way_init( &ctx.jh );
     jh512_4way_update( &ctx.jh, vhash, 64 );
     jh512_4way_close( &ctx.jh, vhash );

-     // 5 Keccak
+     keccak512_4way_init( &ctx.keccak );
     keccak512_4way_update( &ctx.keccak, vhash, 64 );
     keccak512_4way_close( &ctx.keccak, vhash );

-     // 6 Skein
-     skein512_4way_update( &ctx.skein, vhash, 64 );
-     skein512_4way_close( &ctx.skein, vhash );
+     skein512_4way_full( &ctx.skein, vhash, vhash, 64 );

-     // Serial
-     dintrlv_4x64( hash0, hash1, hash2, hash3, vhash, 512 );
+     rintrlv_4x64_2x128( vhashA, vhashB, vhash, 512 );

-     // 7 Luffa
-     intrlv_2x128( vhash, hash0, hash1, 512 );
-     intrlv_2x128( vhashB, hash2, hash3, 512 );
-     luffa_2way_update_close( &ctx.luffa, vhash, vhash, 64 );
-     luffa_2way_init( &ctx.luffa, 512 );
-     luffa_2way_update_close( &ctx.luffa, vhashB, vhashB, 64 );
-     dintrlv_2x128( hash0, hash1, vhash, 512 );
-     dintrlv_2x128( hash2, hash3, vhashB, 512 );
+     luffa512_2way_full( &ctx.luffa, vhashA, vhashA, 64 );
+     luffa512_2way_full( &ctx.luffa, vhashB, vhashB, 64 );

-     // 8 Cubehash
-     cubehashUpdateDigest( &ctx.cube, (byte*)hash0, (const byte*) hash0, 64 );
-     memcpy( &ctx.cube, &c11_4way_ctx.cube, sizeof(cubehashParam) );
-     cubehashUpdateDigest( &ctx.cube, (byte*)hash1, (const byte*) hash1, 64 );
-     memcpy( &ctx.cube, &c11_4way_ctx.cube, sizeof(cubehashParam) );
-     cubehashUpdateDigest( &ctx.cube, (byte*)hash2, (const byte*) hash2, 64 );
-     memcpy( &ctx.cube, &c11_4way_ctx.cube, sizeof(cubehashParam) );
-     cubehashUpdateDigest( &ctx.cube, (byte*)hash3, (const byte*) hash3, 64 );
+     cube_2way_full( &ctx.cube, vhashA, 512, vhashA, 64 );
+     cube_2way_full( &ctx.cube, vhashB, 512, vhashB, 64 );

-     // 9 Shavite
-     sph_shavite512( &ctx.shavite, hash0, 64 );
-     sph_shavite512_close( &ctx.shavite, hash0 );
-     memcpy( &ctx.shavite, &c11_4way_ctx.shavite,
-             sizeof(sph_shavite512_context) );
-     sph_shavite512( &ctx.shavite, hash1, 64 );
-     sph_shavite512_close( &ctx.shavite, hash1 );
-     memcpy( &ctx.shavite, &c11_4way_ctx.shavite,
-             sizeof(sph_shavite512_context) );
-     sph_shavite512( &ctx.shavite, hash2, 64 );
-     sph_shavite512_close( &ctx.shavite, hash2 );
-     memcpy( &ctx.shavite, &c11_4way_ctx.shavite,
-             sizeof(sph_shavite512_context) );
-     sph_shavite512( &ctx.shavite, hash3, 64 );
-     sph_shavite512_close( &ctx.shavite, hash3 );
+     shavite512_2way_full( &ctx.shavite, vhashA, vhashA, 64 );
+     shavite512_2way_full( &ctx.shavite, vhashB, vhashB, 64 );

-     // 10 Simd
-     intrlv_2x128( vhash, hash0, hash1, 512 );
-     intrlv_2x128( vhashB, hash2, hash3, 512 );
-     simd_2way_update_close( &ctx.simd, vhash, vhash, 512 );
-     simd_2way_init( &ctx.simd, 512 );
-     simd_2way_update_close( &ctx.simd, vhashB, vhashB, 512 );
-     dintrlv_2x128( hash0, hash1, vhash, 512 );
-     dintrlv_2x128( hash2, hash3, vhashB, 512 );
+     simd512_2way_full( &ctx.simd, vhashA, vhashA, 64 );
+     simd512_2way_full( &ctx.simd, vhashB, vhashB, 64 );

-     // 11 Echo
-     update_final_echo( &ctx.echo, (BitSequence *)hash0,
-                       (const BitSequence *) hash0, 512 );
-     memcpy( &ctx.echo, &c11_4way_ctx.echo, sizeof(hashState_echo) );
-     update_final_echo( &ctx.echo, (BitSequence *)hash1,
-                       (const BitSequence *) hash1, 512 );
-     memcpy( &ctx.echo, &c11_4way_ctx.echo, sizeof(hashState_echo) );
-     update_final_echo( &ctx.echo, (BitSequence *)hash2,
-                       (const BitSequence *) hash2, 512 );
-     memcpy( &ctx.echo, &c11_4way_ctx.echo, sizeof(hashState_echo) );
-     update_final_echo( &ctx.echo, (BitSequence *)hash3,
-                       (const BitSequence *) hash3, 512 );
+#if defined(__VAES__)
+
+     echo_2way_full( &ctx.echo, vhashA, 512, vhashA, 64 );
+     echo_2way_full( &ctx.echo, vhashB, 512, vhashB, 64 );
+
+     dintrlv_2x128_512( hash0, hash1, vhashA );
+     dintrlv_2x128_512( hash2, hash3, vhashB );
+
+#else
+
+     dintrlv_2x128_512( hash0, hash1, vhashA );
+     dintrlv_2x128_512( hash2, hash3, vhashB );
+
+     echo_full( &ctx.echo, (BitSequence *)hash0, 512,
+                     (const BitSequence *)hash0, 64 );
+     echo_full( &ctx.echo, (BitSequence *)hash1, 512,
+                     (const BitSequence *)hash1, 64 );
+     echo_full( &ctx.echo, (BitSequence *)hash2, 512,
+                     (const BitSequence *)hash2, 64 );
+     echo_full( &ctx.echo, (BitSequence *)hash3, 512,
+                     (const BitSequence *)hash3, 64 );
+
+#endif

     memcpy( state,    hash0, 32 );
     memcpy( state+32, hash1, 32 );
     memcpy( state+64, hash2, 32 );
     memcpy( state+96, hash3, 32 );
+
+     return 1;
 }

 int scanhash_c11_4way( struct work *work, uint32_t max_nonce,
                   uint64_t *hashes_done, struct thr_info *mythr )
 {
-     uint32_t hash[4*8] __attribute__ ((aligned (64)));
-     uint32_t vdata[24*4] __attribute__ ((aligned (64)));
-     uint32_t *pdata = work->data;
-     uint32_t *ptarget = work->target;
-     uint32_t n = pdata[19];
-     const uint32_t first_nonce = pdata[19];
-     int thr_id = mythr->id;  // thr_id arg is deprecated
-     __m256i  *noncev = (__m256i*)vdata + 9;   // aligned
-     const uint32_t Htarg = ptarget[7];
+   uint32_t hash[8*4] __attribute__ ((aligned (128)));
+   uint32_t vdata[20*4] __attribute__ ((aligned (64)));
+   __m128i edata[5] __attribute__ ((aligned (32)));
+   uint32_t *pdata = work->data;
+   const uint32_t *ptarget = work->target;
+   const uint32_t first_nonce = pdata[19];
+   const uint32_t last_nonce = max_nonce - 8;
+   __m256i  *noncev = (__m256i*)vdata + 9;
+   uint32_t n = first_nonce;
+   const int thr_id = mythr->id;
+   const uint32_t targ32_d7 = ptarget[7];
+   const __m256i four = m256_const1_64( 4 );
+   const bool bench = opt_benchmark;

-     mm256_bswap32_intrlv80_4x64( vdata, pdata );
+   edata[0] = mm128_swap64_32( casti_m128i( pdata, 0 ) );
+   edata[1] = mm128_swap64_32( casti_m128i( pdata, 1 ) );
+   edata[2] = mm128_swap64_32( casti_m128i( pdata, 2 ) );
+   edata[3] = mm128_swap64_32( casti_m128i( pdata, 3 ) );
+   edata[4] = mm128_swap64_32( casti_m128i( pdata, 4 ) );

-     do
-     {
-        *noncev = mm256_intrlv_blend_32( mm256_bswap_32(
-             _mm256_set_epi32( n+3, 0, n+2, 0, n+1, 0, n, 0 ) ), *noncev );
+   mm256_intrlv80_4x64( vdata, edata );

-        c11_4way_hash( hash, vdata );
-        pdata[19] = n;
+   *noncev = _mm256_add_epi32( *noncev, _mm256_set_epi32(
+                                           0, 3, 0, 2, 0, 1, 0, 0 ) );
+   blake512_4way_prehash_le( &blake512_4way_ctx, c11_4way_midstate, vdata );

-        for ( int i = 0; i < 4; i++ )
-        if ( ( ( hash+(i<<3) )[7] <= Htarg )
-            && fulltest( hash+(i<<3), ptarget ) && !opt_benchmark )
-        {
-           pdata[19] = n+i;
-           submit_solution( work, hash+(i<<3), mythr );
-        }
-        n += 4;
-     } while ( ( n < max_nonce ) && !work_restart[thr_id].restart );
-     *hashes_done = n - first_nonce + 1;
-     return 0;
+   do
+   {
+      if ( likely( c11_4way_hash( hash, vdata, thr_id ) ) )
+      for ( int lane = 0; lane < 4; lane++ )
+      if ( ( ( hash + ( lane << 3 ) )[7] <= targ32_d7 )
+           && valid_hash( hash +( lane << 3 ), ptarget ) && !bench )
+      {
+         pdata[19] = n + lane;
+         submit_solution( work, hash + ( lane << 3 ), mythr );
+      }
+      *noncev = _mm256_add_epi32( *noncev, four );
+      n += 4;
+   } while ( ( n < last_nonce ) && !work_restart[thr_id].restart );
+   pdata[19] = n;
+   *hashes_done = n - first_nonce;
+   return 0;
 }

 #endif
--- a/algo/x11/c11-gate.c
+++ b/algo/x11/c11-gate.c
@@ -3,11 +3,9 @@
 bool register_c11_algo( algo_gate_t* gate )
 {
 #if defined (C11_8WAY)
-  init_c11_8way_ctx();
  gate->scanhash  = (void*)&scanhash_c11_8way;
  gate->hash      = (void*)&c11_8way_hash;
 #elif defined (C11_4WAY)
-  init_c11_4way_ctx();
  gate->scanhash  = (void*)&scanhash_c11_4way;
  gate->hash      = (void*)&c11_4way_hash;
 #else
--- a/algo/x11/c11-gate.h
+++ b/algo/x11/c11-gate.h
@@ -14,14 +14,14 @@
 bool register_c11_algo( algo_gate_t* gate );
 #if defined(C11_8WAY)

-void c11_8way_hash( void *state, const void *input );
+int c11_8way_hash( void *state, const void *input, int thr_id );
 int scanhash_c11_8way( struct work *work, uint32_t max_nonce,
                         uint64_t *hashes_done, struct thr_info *mythr );
-void init_c11_8way_ctx();
+//void init_c11_8way_ctx();

 #elif defined(C11_4WAY)

-void c11_4way_hash( void *state, const void *input );
+int c11_4way_hash( void *state, const void *input, int thr_id );
 int scanhash_c11_4way( struct work *work, uint32_t max_nonce,
                         uint64_t *hashes_done, struct thr_info *mythr );
 void init_c11_4way_ctx();
--- a/algo/x16/x16r-4way.c
+++ b/algo/x16/x16r-4way.c
@@ -16,7 +16,8 @@

 #if defined (X16R_8WAY)

-// Perform midstate prehash of hash functions with block size <= 72 bytes.
+// Perform midstate prehash of hash functions with block size <= 72 bytes,
+// 76 bytes for hash functions that operate on 32 bit data.

 void x16r_8way_prehash( void *vdata, void *pdata )
 {
@@ -44,23 +45,48 @@ void x16r_8way_prehash( void *vdata, void *pdata )
         skein512_8way_update( &x16r_ctx.skein, vdata, 64 );
      break;
      case LUFFA:
+      {
+         hashState_luffa ctx_luffa;
         mm128_bswap32_80( edata, pdata );
-         intrlv_4x128( vdata2, edata, edata, edata, edata, 640 );
-         luffa_4way_init( &x16r_ctx.luffa, 512 );
-         luffa_4way_update( &x16r_ctx.luffa, vdata2, 64 );
-         rintrlv_4x128_8x64( vdata, vdata2, vdata2, 640 );
+         intrlv_8x64( vdata, edata, edata, edata, edata,
+                             edata, edata, edata, edata, 640 );            
+         init_luffa( &ctx_luffa, 512 );
+         update_luffa( &ctx_luffa, (const BitSequence*)edata, 64 );
+         intrlv_4x128( x16r_ctx.luffa.buffer, ctx_luffa.buffer,
+                  ctx_luffa.buffer, ctx_luffa.buffer, ctx_luffa.buffer, 512 );
+         intrlv_4x128( x16r_ctx.luffa.chainv, ctx_luffa.chainv,
+                  ctx_luffa.chainv, ctx_luffa.chainv, ctx_luffa.chainv, 1280 );
+         x16r_ctx.luffa.hashbitlen = ctx_luffa.hashbitlen;
+         x16r_ctx.luffa.rembytes = ctx_luffa.rembytes;
+      }
      break;
      case CUBEHASH:
+      {
+         cubehashParam ctx_cube;
         mm128_bswap32_80( edata, pdata );
-         intrlv_4x128( vdata2, edata, edata, edata, edata, 640 );
-         cube_4way_init( &x16r_ctx.cube, 512, 16, 32 );
-         cube_4way_update( &x16r_ctx.cube, vdata2, 64 );
-         rintrlv_4x128_8x64( vdata, vdata2, vdata2, 640 );
+         intrlv_8x64( vdata, edata, edata, edata, edata,
+                             edata, edata, edata, edata, 640 );            
+         cubehashInit( &ctx_cube, 512, 16, 32 );
+         cubehashUpdate( &ctx_cube, (const byte*)edata, 64 );
+         x16r_ctx.cube.hashlen = ctx_cube.hashlen;
+         x16r_ctx.cube.rounds = ctx_cube.rounds;
+         x16r_ctx.cube.blocksize = ctx_cube.blocksize;
+         x16r_ctx.cube.pos = ctx_cube.pos;
+         intrlv_4x128( x16r_ctx.cube.h, ctx_cube.x, ctx_cube.x, ctx_cube.x,
+                                        ctx_cube.x, 1024 );
+      }
      break;
      case HAMSI:
         mm512_bswap32_intrlv80_8x64( vdata, pdata );
         hamsi512_8way_init( &x16r_ctx.hamsi );
-         hamsi512_8way_update( &x16r_ctx.hamsi, vdata, 64 );
+         hamsi512_8way_update( &x16r_ctx.hamsi, vdata, 72 );
+      break;
+      case FUGUE:
+         mm128_bswap32_80( edata, pdata );
+         fugue512_init( &x16r_ctx.fugue );
+         fugue512_update( &x16r_ctx.fugue, edata, 76 );
+         intrlv_8x64( vdata, edata, edata, edata, edata,
+                             edata, edata, edata, edata, 640 );
      break;
      case SHABAL:
         mm256_bswap32_intrlv80_8x32( vdata2, pdata );
@@ -87,14 +113,14 @@ void x16r_8way_prehash( void *vdata, void *pdata )
 int x16r_8way_hash_generic( void* output, const void* input, int thrid )
 {
   uint32_t vhash[20*8] __attribute__ ((aligned (128)));
-   uint32_t hash0[20] __attribute__ ((aligned (64)));
-   uint32_t hash1[20] __attribute__ ((aligned (64)));
-   uint32_t hash2[20] __attribute__ ((aligned (64)));
-   uint32_t hash3[20] __attribute__ ((aligned (64)));
-   uint32_t hash4[20] __attribute__ ((aligned (64)));
-   uint32_t hash5[20] __attribute__ ((aligned (64)));
-   uint32_t hash6[20] __attribute__ ((aligned (64)));
-   uint32_t hash7[20] __attribute__ ((aligned (64)));
+   uint32_t hash0[20] __attribute__ ((aligned (16)));
+   uint32_t hash1[20] __attribute__ ((aligned (16)));
+   uint32_t hash2[20] __attribute__ ((aligned (16)));
+   uint32_t hash3[20] __attribute__ ((aligned (16)));
+   uint32_t hash4[20] __attribute__ ((aligned (16)));
+   uint32_t hash5[20] __attribute__ ((aligned (16)));
+   uint32_t hash6[20] __attribute__ ((aligned (16)));
+   uint32_t hash7[20] __attribute__ ((aligned (16)));
   x16r_8way_context_overlay ctx;
   memcpy( &ctx, &x16r_ctx, sizeof(ctx) );
   void *in0 = (void*) hash0;
@@ -137,7 +163,7 @@ int x16r_8way_hash_generic( void* output, const void* input, int thrid )
            {
               intrlv_8x64( vhash, in0, in1, in2, in3, in4, in5, in6, in7,
                            size<<3 );
-            bmw512_8way_update( &ctx.bmw, vhash, size );
+               bmw512_8way_update( &ctx.bmw, vhash, size );
            }
            bmw512_8way_close( &ctx.bmw, vhash );
            dintrlv_8x64_512( hash0, hash1, hash2, hash3, hash4, hash5, hash6,
@@ -306,7 +332,7 @@ int x16r_8way_hash_generic( void* output, const void* input, int thrid )
         break;
         case HAMSI:
            if ( i == 0 )
-               hamsi512_8way_update( &ctx.hamsi, input + (64<<3), 16 );
+               hamsi512_8way_update( &ctx.hamsi, input + (72<<3), 8 );
            else
            {
               intrlv_8x64( vhash, in0, in1, in2, in3, in4, in5, in6, in7,
@@ -319,14 +345,43 @@ int x16r_8way_hash_generic( void* output, const void* input, int thrid )
                          hash7, vhash );
         break;
         case FUGUE:
-             fugue512_full( &ctx.fugue, hash0, in0, size );
-             fugue512_full( &ctx.fugue, hash1, in1, size );
-             fugue512_full( &ctx.fugue, hash2, in2, size );
-             fugue512_full( &ctx.fugue, hash3, in3, size );
-             fugue512_full( &ctx.fugue, hash4, in4, size );
-             fugue512_full( &ctx.fugue, hash5, in5, size );
-             fugue512_full( &ctx.fugue, hash6, in6, size );
-             fugue512_full( &ctx.fugue, hash7, in7, size );
+            if ( i == 0 )
+            {
+               fugue512_update( &ctx.fugue, in0 + 76, 4 );
+               fugue512_final( &ctx.fugue, hash0 );
+               memcpy( &ctx, &x16r_ctx, sizeof(hashState_fugue) );
+               fugue512_update( &ctx.fugue, in1 + 76, 4 );
+               fugue512_final( &ctx.fugue, hash1 );
+               memcpy( &ctx, &x16r_ctx, sizeof(hashState_fugue) );
+               fugue512_update( &ctx.fugue, in2 + 76, 4 );
+               fugue512_final( &ctx.fugue, hash2 );
+               memcpy( &ctx, &x16r_ctx, sizeof(hashState_fugue) );
+               fugue512_update( &ctx.fugue, in3 + 76, 4 );
+               fugue512_final( &ctx.fugue, hash3 );
+               memcpy( &ctx, &x16r_ctx, sizeof(hashState_fugue) );
+               fugue512_update( &ctx.fugue, in4 + 76, 4 );
+               fugue512_final( &ctx.fugue, hash4 );
+               memcpy( &ctx, &x16r_ctx, sizeof(hashState_fugue) );
+               fugue512_update( &ctx.fugue, in5 + 76, 4 );
+               fugue512_final( &ctx.fugue, hash5 );
+               memcpy( &ctx, &x16r_ctx, sizeof(hashState_fugue) );
+               fugue512_update( &ctx.fugue, in6 + 76, 4 );
+               fugue512_final( &ctx.fugue, hash6 );
+               memcpy( &ctx, &x16r_ctx, sizeof(hashState_fugue) );
+               fugue512_update( &ctx.fugue, in7 + 76, 4 );
+               fugue512_final( &ctx.fugue, hash7 );
+            }
+            else
+            {
+               fugue512_full( &ctx.fugue, hash0, in0, size );
+               fugue512_full( &ctx.fugue, hash1, in1, size );
+               fugue512_full( &ctx.fugue, hash2, in2, size );
+               fugue512_full( &ctx.fugue, hash3, in3, size );
+               fugue512_full( &ctx.fugue, hash4, in4, size );
+               fugue512_full( &ctx.fugue, hash5, in5, size );
+               fugue512_full( &ctx.fugue, hash6, in6, size );
+               fugue512_full( &ctx.fugue, hash7, in7, size );
+            }
         break;
         case SHABAL:
             intrlv_8x32( vhash, in0, in1, in2, in3, in4, in5, in6, in7,
@@ -347,25 +402,25 @@ int x16r_8way_hash_generic( void* output, const void* input, int thrid )
            {
               sph_whirlpool( &ctx.whirlpool, in0 + 64, 16 );
               sph_whirlpool_close( &ctx.whirlpool, hash0 );
-               memcpy( &ctx, &x16r_ctx, sizeof(ctx) );
+               memcpy( &ctx, &x16r_ctx, sizeof(sph_whirlpool_context) );
               sph_whirlpool( &ctx.whirlpool, in1 + 64, 16 );
               sph_whirlpool_close( &ctx.whirlpool, hash1 );
-               memcpy( &ctx, &x16r_ctx, sizeof(ctx) );
+               memcpy( &ctx, &x16r_ctx, sizeof(sph_whirlpool_context) );
               sph_whirlpool( &ctx.whirlpool, in2 + 64, 16 );
               sph_whirlpool_close( &ctx.whirlpool, hash2 );
-               memcpy( &ctx, &x16r_ctx, sizeof(ctx) );
+               memcpy( &ctx, &x16r_ctx, sizeof(sph_whirlpool_context) );
               sph_whirlpool( &ctx.whirlpool, in3 + 64, 16 );
               sph_whirlpool_close( &ctx.whirlpool, hash3 );
-               memcpy( &ctx, &x16r_ctx, sizeof(ctx) );
+               memcpy( &ctx, &x16r_ctx, sizeof(sph_whirlpool_context) );
               sph_whirlpool( &ctx.whirlpool, in4 + 64, 16 );
               sph_whirlpool_close( &ctx.whirlpool, hash4 ); 
-               memcpy( &ctx, &x16r_ctx, sizeof(ctx) );
+               memcpy( &ctx, &x16r_ctx, sizeof(sph_whirlpool_context) );
               sph_whirlpool( &ctx.whirlpool, in5 + 64, 16 );
               sph_whirlpool_close( &ctx.whirlpool, hash5 );
-               memcpy( &ctx, &x16r_ctx, sizeof(ctx) );
+               memcpy( &ctx, &x16r_ctx, sizeof(sph_whirlpool_context) );
               sph_whirlpool( &ctx.whirlpool, in6 + 64, 16 );
               sph_whirlpool_close( &ctx.whirlpool, hash6 );
-               memcpy( &ctx, &x16r_ctx, sizeof(ctx) );
+               memcpy( &ctx, &x16r_ctx, sizeof(sph_whirlpool_context) );
               sph_whirlpool( &ctx.whirlpool, in7 + 64, 16 );
               sph_whirlpool_close( &ctx.whirlpool, hash7 );
            }
@@ -440,7 +495,7 @@ int scanhash_x16r_8way( struct work *work, uint32_t max_nonce,
 {
   uint32_t hash[16*8] __attribute__ ((aligned (128)));
   uint32_t vdata[20*8] __attribute__ ((aligned (64)));
-   uint32_t bedata1[2] __attribute__((aligned(64)));
+   uint32_t bedata1[2];
   uint32_t *pdata = work->data;
   uint32_t *ptarget = work->target;
   const uint32_t first_nonce = pdata[19];
@@ -464,7 +519,7 @@ int scanhash_x16r_8way( struct work *work, uint32_t max_nonce,
      s_ntime = ntime;

      if ( opt_debug && !thr_id )
-          applog( LOG_INFO, "hash order %s (%08x)", x16r_hash_order, ntime );
+          applog( LOG_INFO, "Hash order %s Ntime %08x", x16r_hash_order, ntime );
   }

   x16r_8way_prehash( vdata, pdata );
@@ -516,23 +571,44 @@ void x16r_4way_prehash( void *vdata, void *pdata )
         skein512_4way_prehash64( &x16r_ctx.skein, vdata );
      break;
      case LUFFA:
+      {
+         hashState_luffa ctx_luffa;
         mm128_bswap32_80( edata, pdata );
-         intrlv_2x128( vdata2, edata, edata, 640 );
-         luffa_2way_init( &x16r_ctx.luffa, 512 );
-         luffa_2way_update( &x16r_ctx.luffa, vdata2, 64 );
-         rintrlv_2x128_4x64( vdata, vdata2, vdata2, 640 );
-         break;
+         intrlv_4x64( vdata, edata, edata, edata, edata, 640 );
+         init_luffa( &ctx_luffa, 512 );
+         update_luffa( &ctx_luffa, (const BitSequence*)edata, 64 );
+         intrlv_2x128( x16r_ctx.luffa.buffer, ctx_luffa.buffer,
+                                              ctx_luffa.buffer, 512 );
+         intrlv_2x128( x16r_ctx.luffa.chainv, ctx_luffa.chainv,
+                                              ctx_luffa.chainv, 1280 );
+         x16r_ctx.luffa.hashbitlen = ctx_luffa.hashbitlen;
+         x16r_ctx.luffa.rembytes = ctx_luffa.rembytes;
+      }
+      break;
      case CUBEHASH:
+      {
+         cubehashParam ctx_cube;
         mm128_bswap32_80( edata, pdata );
-         intrlv_2x128( vdata2, edata, edata, 640 );
-         cube_2way_init( &x16r_ctx.cube, 512, 16, 32 );
-         cube_2way_update( &x16r_ctx.cube, vdata2, 64 );
-         rintrlv_2x128_4x64( vdata, vdata2, vdata2, 640 );
+         intrlv_4x64( vdata, edata, edata, edata, edata, 640 );
+         cubehashInit( &ctx_cube, 512, 16, 32 );
+         cubehashUpdate( &ctx_cube, (const byte*)edata, 64 );
+         x16r_ctx.cube.hashlen = ctx_cube.hashlen;
+         x16r_ctx.cube.rounds = ctx_cube.rounds;
+         x16r_ctx.cube.blocksize = ctx_cube.blocksize;
+         x16r_ctx.cube.pos = ctx_cube.pos;
+         intrlv_2x128( x16r_ctx.cube.h, ctx_cube.x, ctx_cube.x, 1024 );
+      }
      break;
      case HAMSI:
         mm256_bswap32_intrlv80_4x64( vdata, pdata );
         hamsi512_4way_init( &x16r_ctx.hamsi );
-         hamsi512_4way_update( &x16r_ctx.hamsi, vdata, 64 );
+         hamsi512_4way_update( &x16r_ctx.hamsi, vdata, 72 );
+      break;
+      case FUGUE:
+         mm128_bswap32_80( edata, pdata );
+         fugue512_init( &x16r_ctx.fugue );
+         fugue512_update( &x16r_ctx.fugue, edata, 76 );
+         intrlv_4x64( vdata, edata, edata, edata, edata, 640 );
      break;
      case SHABAL:
         mm128_bswap32_intrlv80_4x32( vdata2, pdata );
@@ -554,10 +630,10 @@ void x16r_4way_prehash( void *vdata, void *pdata )
 int x16r_4way_hash_generic( void* output, const void* input, int thrid )
 {
   uint32_t vhash[20*4] __attribute__ ((aligned (128)));
-   uint32_t hash0[20] __attribute__ ((aligned (64)));
-   uint32_t hash1[20] __attribute__ ((aligned (64)));
-   uint32_t hash2[20] __attribute__ ((aligned (64)));
-   uint32_t hash3[20] __attribute__ ((aligned (64)));
+   uint32_t hash0[20] __attribute__ ((aligned (32)));
+   uint32_t hash1[20] __attribute__ ((aligned (32)));
+   uint32_t hash2[20] __attribute__ ((aligned (32)));
+   uint32_t hash3[20] __attribute__ ((aligned (32)));
   x16r_4way_context_overlay ctx;
   memcpy( &ctx, &x16r_ctx, sizeof(ctx) );
   void *in0 = (void*) hash0;
@@ -734,7 +810,7 @@ int x16r_4way_hash_generic( void* output, const void* input, int thrid )
   	    break;
         case HAMSI:
            if ( i == 0 )
-               hamsi512_4way_update( &ctx.hamsi, input + (64<<2), 16 );
+               hamsi512_4way_update( &ctx.hamsi, input + (72<<2), 8 );
            else
            {
               intrlv_4x64( vhash, in0, in1, in2, in3, size<<3 );
@@ -745,10 +821,27 @@ int x16r_4way_hash_generic( void* output, const void* input, int thrid )
            dintrlv_4x64_512( hash0, hash1, hash2, hash3, vhash );
         break;
         case FUGUE:
-             fugue512_full( &ctx.fugue, hash0, in0, size );
-             fugue512_full( &ctx.fugue, hash1, in1, size );
-             fugue512_full( &ctx.fugue, hash2, in2, size );
-             fugue512_full( &ctx.fugue, hash3, in3, size );
+            if ( i == 0 )
+            {
+               fugue512_update( &ctx.fugue, in0 + 76, 4 );
+               fugue512_final( &ctx.fugue, hash0 );
+               memcpy( &ctx, &x16r_ctx, sizeof(hashState_fugue) );
+               fugue512_update( &ctx.fugue, in1 + 76, 4 );
+               fugue512_final( &ctx.fugue, hash1 );
+               memcpy( &ctx, &x16r_ctx, sizeof(hashState_fugue) );
+               fugue512_update( &ctx.fugue, in2 + 76, 4 );
+               fugue512_final( &ctx.fugue, hash2 );
+               memcpy( &ctx, &x16r_ctx, sizeof(hashState_fugue) );
+               fugue512_update( &ctx.fugue, in3 + 76, 4 );
+               fugue512_final( &ctx.fugue, hash3 );
+             }
+             else
+             {
+                fugue512_full( &ctx.fugue, hash0, in0, size );
+                fugue512_full( &ctx.fugue, hash1, in1, size );
+                fugue512_full( &ctx.fugue, hash2, in2, size );
+                fugue512_full( &ctx.fugue, hash3, in3, size );
+             }
         break;
         case SHABAL:
             intrlv_4x32( vhash, in0, in1, in2, in3, size<<3 );
@@ -831,7 +924,7 @@ int scanhash_x16r_4way( struct work *work, uint32_t max_nonce,
 {
   uint32_t hash[16*4] __attribute__ ((aligned (64)));
   uint32_t vdata[20*4] __attribute__ ((aligned (64)));
-   uint32_t bedata1[2] __attribute__((aligned(64)));
+   uint32_t bedata1[2];
   uint32_t *pdata = work->data;
   uint32_t *ptarget = work->target;
   const uint32_t first_nonce = pdata[19];
@@ -854,7 +947,7 @@ int scanhash_x16r_4way( struct work *work, uint32_t max_nonce,
      x16_r_s_getAlgoString( (const uint8_t*)bedata1, x16r_hash_order );
      s_ntime = ntime;
      if ( opt_debug && !thr_id )
-         applog( LOG_INFO, "hash order %s (%08x)", x16r_hash_order, ntime );
+         applog( LOG_INFO, "Hash order %s Ntime %08x", x16r_hash_order, ntime );
   }

   x16r_4way_prehash( vdata, pdata );
--- a/algo/x16/x16r-gate.c
+++ b/algo/x16/x16r-gate.c
@@ -62,8 +62,7 @@ bool register_x16r_algo( algo_gate_t* gate )
  gate->scanhash  = (void*)&scanhash_x16r;
  gate->hash      = (void*)&x16r_hash;
 #endif
-  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT |
-	                VAES_OPT | VAES256_OPT;
+  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT;
  x16_r_s_getAlgoString = (void*)&x16r_getAlgoString;
  opt_target_factor = 256.0;
  return true;
@@ -81,8 +80,7 @@ bool register_x16rv2_algo( algo_gate_t* gate )
  gate->scanhash  = (void*)&scanhash_x16rv2;
  gate->hash      = (void*)&x16rv2_hash;
 #endif
-  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT |
-	                VAES_OPT | VAES256_OPT;
+  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT;
  x16_r_s_getAlgoString = (void*)&x16r_getAlgoString;
  opt_target_factor = 256.0;
  return true;
@@ -100,8 +98,7 @@ bool register_x16s_algo( algo_gate_t* gate )
  gate->scanhash  = (void*)&scanhash_x16r;
  gate->hash      = (void*)&x16r_hash;
 #endif
-  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT |
-	                VAES_OPT | VAES256_OPT;
+  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT;
  x16_r_s_getAlgoString = (void*)&x16s_getAlgoString;
  opt_target_factor = 256.0;
  return true;
@@ -234,8 +231,7 @@ bool register_x16rt_algo( algo_gate_t* gate )
  gate->scanhash  = (void*)&scanhash_x16rt;
  gate->hash      = (void*)&x16r_hash;
 #endif
-  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT |
-	                VAES_OPT | VAES256_OPT;
+  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT;
  opt_target_factor = 256.0;
  return true;
 };
@@ -252,8 +248,7 @@ bool register_x16rt_veil_algo( algo_gate_t* gate )
  gate->scanhash  = (void*)&scanhash_x16rt;
  gate->hash      = (void*)&x16r_hash;
 #endif
-  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT |
-	                VAES_OPT | VAES256_OPT;
+  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT;
  gate->build_extraheader = (void*)&veil_build_extraheader;
  opt_target_factor = 256.0;
  return true;
@@ -292,8 +287,7 @@ bool register_x21s_algo( algo_gate_t* gate )
  gate->hash              = (void*)&x21s_hash;
  gate->miner_thread_init = (void*)&x21s_thread_init;
 #endif
-  gate->optimizations     = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT |
-	                    VAES_OPT | VAES256_OPT;
+  gate->optimizations  = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT;
  x16_r_s_getAlgoString   = (void*)&x16s_getAlgoString;
  opt_target_factor = 256.0;
  return true;
--- a/algo/x16/x16rt-4way.c
+++ b/algo/x16/x16rt-4way.c
@@ -24,15 +24,15 @@ int scanhash_x16rt_8way( struct work *work, uint32_t max_nonce,
   if ( bench )   ptarget[7] = 0x0cff;

   static __thread uint32_t s_ntime = UINT32_MAX;
-   uint32_t ntime = bswap_32( pdata[17] );
-   if ( s_ntime != ntime )
+   uint32_t masked_ntime = bswap_32( pdata[17] ) & 0xffffff80;
+   if ( s_ntime != masked_ntime )
   {
-      x16rt_getTimeHash( ntime, &timeHash );
+      x16rt_getTimeHash( masked_ntime, &timeHash );
      x16rt_getAlgoString( &timeHash[0], x16r_hash_order );
-      s_ntime = ntime;
-      if ( opt_debug && !thr_id )
-          applog( LOG_INFO, "hash order: %s time: (%08x) time hash: (%08x)",
-                               x16r_hash_order, ntime, timeHash );
+      s_ntime = masked_ntime;
+      if ( !thr_id )
+          applog( LOG_INFO, "Hash order %s, Ntime %08x, time hash %08x",
+                            x16r_hash_order, bswap_32( pdata[17] ), timeHash );
   }

   x16r_8way_prehash( vdata, pdata );
@@ -78,15 +78,15 @@ int scanhash_x16rt_4way( struct work *work, uint32_t max_nonce,
   if ( bench )  ptarget[7] = 0x0cff;

   static __thread uint32_t s_ntime = UINT32_MAX;
-   uint32_t ntime = bswap_32( pdata[17] );
-   if ( s_ntime != ntime )
+   uint32_t masked_ntime = bswap_32( pdata[17] ) & 0xffffff80;
+   if ( s_ntime != masked_ntime )
   {
-      x16rt_getTimeHash( ntime, &timeHash );
+      x16rt_getTimeHash( masked_ntime, &timeHash );
      x16rt_getAlgoString( &timeHash[0], x16r_hash_order );
-      s_ntime = ntime;
-      if ( opt_debug && !thr_id )
-          applog( LOG_INFO, "hash order: %s time: (%08x) time hash: (%08x)",
-                               x16r_hash_order, ntime, timeHash );
+      s_ntime = masked_ntime;
+      if ( !thr_id )
+          applog( LOG_INFO, "Hash order %s, Ntime %08x, time hash %08x",
+                            x16r_hash_order, bswap_32( pdata[17] ), timeHash );
   }

   x16r_4way_prehash( vdata, pdata );
--- a/algo/x16/x16rt.c
+++ b/algo/x16/x16rt.c
@@ -20,15 +20,15 @@ int scanhash_x16rt( struct work *work, uint32_t max_nonce,
   mm128_bswap32_80( edata, pdata );

   static __thread uint32_t s_ntime = UINT32_MAX;
-   uint32_t ntime = swab32( pdata[17] );
-   if ( s_ntime != ntime )
+   uint32_t masked_ntime = swab32( pdata[17] ) & 0xffffff80;
+   if ( s_ntime != masked_ntime )
   {
-      x16rt_getTimeHash( ntime, &timeHash );
+      x16rt_getTimeHash( masked_ntime, &timeHash );
      x16rt_getAlgoString( &timeHash[0], x16r_hash_order );
-      s_ntime = ntime;
+      s_ntime = masked_ntime;
      if ( opt_debug && !thr_id )
          applog( LOG_INFO, "hash order: %s time: (%08x) time hash: (%08x)",
-                               x16r_hash_order, ntime, timeHash );
+                        x16r_hash_order, swab32( pdata[17] ), timeHash );
   }
   
   x16r_prehash( edata, pdata );
--- a/algo/x16/x16rv2-4way.c
+++ b/algo/x16/x16rv2-4way.c
@@ -45,14 +45,14 @@ static __thread x16rv2_8way_context_overlay x16rv2_ctx;
 int x16rv2_8way_hash( void* output, const void* input, int thrid )
 {
   uint32_t vhash[24*8] __attribute__ ((aligned (128)));
-   uint32_t hash0[24] __attribute__ ((aligned (64)));
-   uint32_t hash1[24] __attribute__ ((aligned (64)));
-   uint32_t hash2[24] __attribute__ ((aligned (64)));
-   uint32_t hash3[24] __attribute__ ((aligned (64)));
-   uint32_t hash4[24] __attribute__ ((aligned (64)));
-   uint32_t hash5[24] __attribute__ ((aligned (64)));
-   uint32_t hash6[24] __attribute__ ((aligned (64)));
-   uint32_t hash7[24] __attribute__ ((aligned (64)));
+   uint32_t hash0[24] __attribute__ ((aligned (32)));
+   uint32_t hash1[24] __attribute__ ((aligned (32)));
+   uint32_t hash2[24] __attribute__ ((aligned (32)));
+   uint32_t hash3[24] __attribute__ ((aligned (32)));
+   uint32_t hash4[24] __attribute__ ((aligned (32)));
+   uint32_t hash5[24] __attribute__ ((aligned (32)));
+   uint32_t hash6[24] __attribute__ ((aligned (32)));
+   uint32_t hash7[24] __attribute__ ((aligned (32)));
   x16rv2_8way_context_overlay ctx;
   memcpy( &ctx, &x16rv2_ctx, sizeof(ctx) );
   void *in0 = (void*) hash0;
@@ -706,11 +706,11 @@ inline void padtiger512( uint32_t* hash )

 int x16rv2_4way_hash( void* output, const void* input, int thrid )
 {
-   uint32_t hash0[20] __attribute__ ((aligned (64)));
-   uint32_t hash1[20] __attribute__ ((aligned (64)));
-   uint32_t hash2[20] __attribute__ ((aligned (64)));
-   uint32_t hash3[20] __attribute__ ((aligned (64)));
   uint32_t vhash[20*4] __attribute__ ((aligned (64)));
+   uint32_t hash0[20] __attribute__ ((aligned (32)));
+   uint32_t hash1[20] __attribute__ ((aligned (32)));
+   uint32_t hash2[20] __attribute__ ((aligned (32)));
+   uint32_t hash3[20] __attribute__ ((aligned (32)));
   x16rv2_4way_context_overlay ctx;
   memcpy( &ctx, &x16rv2_ctx, sizeof(ctx) );
   void *in0 = (void*) hash0;
@@ -1054,8 +1054,8 @@ int scanhash_x16rv2_4way( struct work *work, uint32_t max_nonce,
   uint32_t hash[4*16] __attribute__ ((aligned (64)));
   uint32_t vdata[24*4] __attribute__ ((aligned (64)));
   uint32_t vdata32[20*4] __attribute__ ((aligned (64)));
-   uint32_t edata[20] __attribute__ ((aligned (64)));
-   uint32_t bedata1[2] __attribute__((aligned(64)));
+   uint32_t edata[20];
+   uint32_t bedata1[2];
   uint32_t *pdata = work->data;
   uint32_t *ptarget = work->target;
   const uint32_t first_nonce = pdata[19];
@@ -1068,7 +1068,6 @@ int scanhash_x16rv2_4way( struct work *work, uint32_t max_nonce,

   if ( bench )  ptarget[7] = 0x0fff;
   
-
   bedata1[0] = bswap_32( pdata[1] );
   bedata1[1] = bswap_32( pdata[2] );

--- a/algo/x17/sonoa-4way.c
+++ b/algo/x17/sonoa-4way.c
@@ -63,14 +63,14 @@ int sonoa_8way_hash( void *state, const void *input, int thr_id )
     uint64_t vhash[8*8] __attribute__ ((aligned (128)));
     uint64_t vhashA[8*8] __attribute__ ((aligned (64)));
     uint64_t vhashB[8*8] __attribute__ ((aligned (64)));
-     uint64_t hash0[8] __attribute__ ((aligned (64)));
-     uint64_t hash1[8] __attribute__ ((aligned (64)));
-     uint64_t hash2[8] __attribute__ ((aligned (64)));
-     uint64_t hash3[8] __attribute__ ((aligned (64)));
-     uint64_t hash4[8] __attribute__ ((aligned (64)));
-     uint64_t hash5[8] __attribute__ ((aligned (64)));
-     uint64_t hash6[8] __attribute__ ((aligned (64)));
-     uint64_t hash7[8] __attribute__ ((aligned (64)));
+     uint64_t hash0[8] __attribute__ ((aligned (32)));
+     uint64_t hash1[8] __attribute__ ((aligned (32)));
+     uint64_t hash2[8] __attribute__ ((aligned (32)));
+     uint64_t hash3[8] __attribute__ ((aligned (32)));
+     uint64_t hash4[8] __attribute__ ((aligned (32)));
+     uint64_t hash5[8] __attribute__ ((aligned (32)));
+     uint64_t hash6[8] __attribute__ ((aligned (32)));
+     uint64_t hash7[8] __attribute__ ((aligned (32)));
     sonoa_8way_context_overlay ctx;

 // 1
@@ -1150,13 +1150,13 @@ typedef union _sonoa_4way_context_overlay sonoa_4way_context_overlay;

 int sonoa_4way_hash( void *state, const void *input, int thr_id )
 {
-     uint64_t hash0[8] __attribute__ ((aligned (64)));
-     uint64_t hash1[8] __attribute__ ((aligned (64)));
-     uint64_t hash2[8] __attribute__ ((aligned (64)));
-     uint64_t hash3[8] __attribute__ ((aligned (64)));
     uint64_t vhash[8*4] __attribute__ ((aligned (64)));
     uint64_t vhashA[8*4] __attribute__ ((aligned (64)));
     uint64_t vhashB[8*4] __attribute__ ((aligned (64)));
+     uint64_t hash0[8] __attribute__ ((aligned (32)));
+     uint64_t hash1[8] __attribute__ ((aligned (32)));
+     uint64_t hash2[8] __attribute__ ((aligned (32)));
+     uint64_t hash3[8] __attribute__ ((aligned (32)));
     sonoa_4way_context_overlay ctx;

 // 1
--- a/algo/x17/sonoa-gate.c
+++ b/algo/x17/sonoa-gate.c
@@ -12,7 +12,7 @@ bool register_sonoa_algo( algo_gate_t* gate )
  init_sonoa_ctx();
  gate->hash      = (void*)&sonoa_hash;
 #endif
-  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT | VAES256_OPT;
+  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT;
  return true;
 };

--- a/algo/x17/x17-4way.c
+++ b/algo/x17/x17-4way.c
@@ -58,23 +58,27 @@ union _x17_8way_context_overlay
 } __attribute__ ((aligned (64)));
 typedef union _x17_8way_context_overlay x17_8way_context_overlay;

+static __thread __m512i x17_8way_midstate[16] __attribute__((aligned(64)));
+static __thread blake512_8way_context blake512_8way_ctx __attribute__((aligned(64)));
+
 int x17_8way_hash( void *state, const void *input, int thr_id )
 {
     uint64_t vhash[8*8] __attribute__ ((aligned (128)));
     uint64_t vhashA[8*8] __attribute__ ((aligned (64)));
     uint64_t vhashB[8*8] __attribute__ ((aligned (64)));
-     uint64_t hash0[8] __attribute__ ((aligned (64)));
-     uint64_t hash1[8] __attribute__ ((aligned (64)));
-     uint64_t hash2[8] __attribute__ ((aligned (64)));
-     uint64_t hash3[8] __attribute__ ((aligned (64)));
-     uint64_t hash4[8] __attribute__ ((aligned (64)));
-     uint64_t hash5[8] __attribute__ ((aligned (64)));
-     uint64_t hash6[8] __attribute__ ((aligned (64)));
-     uint64_t hash7[8] __attribute__ ((aligned (64)));
+     uint64_t hash0[8] __attribute__ ((aligned (32)));
+     uint64_t hash1[8] __attribute__ ((aligned (32)));
+     uint64_t hash2[8] __attribute__ ((aligned (32)));
+     uint64_t hash3[8] __attribute__ ((aligned (32)));
+     uint64_t hash4[8] __attribute__ ((aligned (32)));
+     uint64_t hash5[8] __attribute__ ((aligned (32)));
+     uint64_t hash6[8] __attribute__ ((aligned (32)));
+     uint64_t hash7[8] __attribute__ ((aligned (32)));
     x17_8way_context_overlay ctx;

-     blake512_8way_full( &ctx.blake, vhash, input, 80 );
-
+     blake512_8way_final_le( &blake512_8way_ctx, vhash, casti_m512i( input, 9 ),
+                             x17_8way_midstate );
+     
     bmw512_8way_full( &ctx.bmw, vhash, vhash, 64 );

 #if defined(__VAES__)
@@ -122,9 +126,6 @@ int x17_8way_hash( void *state, const void *input, int thr_id )

     cube_4way_2buf_full( &ctx.cube, vhashA, vhashB, 512, vhashA, vhashB, 64 );
     
-//     cube_4way_full( &ctx.cube, vhashA, 512, vhashA, 64 );
-//     cube_4way_full( &ctx.cube, vhashB, 512, vhashB, 64 );
-
 #if defined(__VAES__)

     shavite512_4way_full( &ctx.shavite, vhashA, vhashA, 64 );
@@ -237,6 +238,57 @@ int x17_8way_hash( void *state, const void *input, int thr_id )
     return 1;
 }

+int scanhash_x17_8way( struct work *work, uint32_t max_nonce,
+                      uint64_t *hashes_done, struct thr_info *mythr )
+{
+   uint32_t hash32[8*8] __attribute__ ((aligned (128)));
+   uint32_t vdata[20*8] __attribute__ ((aligned (64)));
+   uint32_t lane_hash[8] __attribute__ ((aligned (64)));
+   __m128i edata[5] __attribute__ ((aligned (64)));
+   uint32_t *hash32_d7 = &(hash32[7*8]);
+   uint32_t *pdata = work->data;
+   const uint32_t *ptarget = work->target;
+   const uint32_t first_nonce = pdata[19];
+   const uint32_t last_nonce = max_nonce - 8;
+   __m512i  *noncev = (__m512i*)vdata + 9;
+   uint32_t n = first_nonce;
+   const int thr_id = mythr->id;
+   const uint32_t targ32_d7 = ptarget[7];
+   const __m512i eight = m512_const1_64( 8 );
+   const bool bench = opt_benchmark;
+
+   edata[0] = mm128_swap64_32( casti_m128i( pdata, 0 ) );
+   edata[1] = mm128_swap64_32( casti_m128i( pdata, 1 ) );
+   edata[2] = mm128_swap64_32( casti_m128i( pdata, 2 ) );
+   edata[3] = mm128_swap64_32( casti_m128i( pdata, 3 ) );
+   edata[4] = mm128_swap64_32( casti_m128i( pdata, 4 ) );
+
+   mm512_intrlv80_8x64( vdata, edata );
+   *noncev = _mm512_add_epi32( *noncev, _mm512_set_epi32(
+                                    0,7, 0,6, 0,5, 0,4, 0,3, 0,2, 0,1, 0,0 ) );
+   blake512_8way_prehash_le( &blake512_8way_ctx, x17_8way_midstate, vdata );
+   
+   do
+   {
+      if ( likely( x17_8way_hash( hash32, vdata, thr_id ) ) )
+      for ( int lane = 0; lane < 8; lane++ )
+      if ( unlikely( ( hash32_d7[ lane ] <= targ32_d7 ) && !bench ) )
+      {
+         extr_lane_8x32( lane_hash, hash32, lane, 256 );
+         if ( likely( valid_hash( lane_hash, ptarget ) ) )
+         {
+            pdata[19] = n + lane;
+            submit_solution( work, lane_hash, mythr );
+         }
+      }
+      *noncev = _mm512_add_epi32( *noncev, eight );
+      n += 8;
+   } while ( likely( ( n < last_nonce ) && !work_restart[thr_id].restart ) );
+   pdata[19] = n;
+   *hashes_done = n - first_nonce;
+   return 0;
+}
+
 #elif defined(X17_4WAY)

 union _x17_4way_context_overlay
@@ -266,18 +318,24 @@ union _x17_4way_context_overlay
 };  
 typedef union _x17_4way_context_overlay x17_4way_context_overlay;

+static __thread __m256i x17_4way_midstate[16] __attribute__((aligned(64)));
+static __thread blake512_4way_context blake512_4way_ctx __attribute__((aligned(64)));
+
 int x17_4way_hash( void *state, const void *input, int thr_id )
 {
     uint64_t vhash[8*4] __attribute__ ((aligned (64)));
     uint64_t vhashA[8*4] __attribute__ ((aligned (64)));
     uint64_t vhashB[8*4] __attribute__ ((aligned (64)));
-     uint64_t hash0[8] __attribute__ ((aligned (64)));
-     uint64_t hash1[8] __attribute__ ((aligned (64)));
-     uint64_t hash2[8] __attribute__ ((aligned (64)));
-     uint64_t hash3[8] __attribute__ ((aligned (64)));
+     uint64_t hash0[8] __attribute__ ((aligned (32)));
+     uint64_t hash1[8] __attribute__ ((aligned (32)));
+     uint64_t hash2[8] __attribute__ ((aligned (32)));
+     uint64_t hash3[8] __attribute__ ((aligned (32)));
     x17_4way_context_overlay ctx;

-     blake512_4way_full( &ctx.blake, vhash, input, 80 );
+     blake512_4way_final_le( &blake512_4way_ctx, vhash, casti_m256i( input, 9 ),
+                             x17_4way_midstate );
+     
+//     blake512_4way_full( &ctx.blake, vhash, input, 80 );

     bmw512_4way_init( &ctx.bmw );
     bmw512_4way_update( &ctx.bmw, vhash, 64 );
@@ -393,4 +451,54 @@ int x17_4way_hash( void *state, const void *input, int thr_id )
     return 1;
 }

+int scanhash_x17_4way( struct work *work, uint32_t max_nonce,
+                   uint64_t *hashes_done, struct thr_info *mythr )
+{
+   uint32_t hash32[8*4] __attribute__ ((aligned (128)));
+   uint32_t vdata[20*4] __attribute__ ((aligned (32)));
+   uint32_t lane_hash[8] __attribute__ ((aligned (32)));
+   __m128i edata[5] __attribute__ ((aligned (32)));
+   uint32_t *pdata = work->data;
+   uint32_t *hash32_d7 = &(hash32[7*4]);
+   const uint32_t *ptarget = work->target;
+   const uint32_t first_nonce = pdata[19];
+   const uint32_t last_nonce = max_nonce - 4;
+   __m256i  *noncev = (__m256i*)vdata + 9;
+   uint32_t n = first_nonce;
+   const int thr_id = mythr->id;
+   const uint32_t targ32_d7 = ptarget[7];
+   const __m256i four = m256_const1_64( 4 );
+   const bool bench = opt_benchmark;
+
+   edata[0] = mm128_swap64_32( casti_m128i( pdata, 0 ) );
+   edata[1] = mm128_swap64_32( casti_m128i( pdata, 1 ) );
+   edata[2] = mm128_swap64_32( casti_m128i( pdata, 2 ) );
+   edata[3] = mm128_swap64_32( casti_m128i( pdata, 3 ) );
+   edata[4] = mm128_swap64_32( casti_m128i( pdata, 4 ) );
+
+   mm256_intrlv80_4x64( vdata, edata );
+   *noncev = _mm256_add_epi32( *noncev, _mm256_set_epi32( 0,3,0,2, 0,1,0,0 ) );
+   blake512_4way_prehash_le( &blake512_4way_ctx, x17_4way_midstate, vdata );
+
+   do
+   {
+      if ( likely( x17_4way_hash( hash32, vdata, thr_id ) ) )
+      for ( int lane = 0; lane < 4; lane++ )
+      if ( unlikely( ( hash32_d7[ lane ] <= targ32_d7 ) && !bench ) )
+      {
+         extr_lane_4x32( lane_hash, hash32, lane, 256 );
+         if ( likely( valid_hash( lane_hash, ptarget ) ) )
+         {
+            pdata[19] = n + lane;
+            submit_solution( work, lane_hash, mythr );
+         }
+      }
+      *noncev = _mm256_add_epi32( *noncev, four );
+      n += 4;
+   } while ( ( n < last_nonce ) && !work_restart[thr_id].restart );
+   pdata[19] = n;
+   *hashes_done = n - first_nonce;
+   return 0;
+}
+
 #endif
--- a/algo/x17/x17-gate.c
+++ b/algo/x17/x17-gate.c
@@ -3,15 +3,16 @@
 bool register_x17_algo( algo_gate_t* gate )
 {
 #if defined (X17_8WAY)
-  gate->scanhash  = (void*)&scanhash_8way_64in_32out;
+  gate->scanhash  = (void*)&scanhash_x17_8way;
  gate->hash      = (void*)&x17_8way_hash;
 #elif defined (X17_4WAY)
-  gate->scanhash  = (void*)&scanhash_4way_64in_32out;
+  gate->scanhash  = (void*)&scanhash_x17_4way;
+//  gate->scanhash  = (void*)&scanhash_4way_64in_32out;
  gate->hash      = (void*)&x17_4way_hash;
 #else
  gate->hash      = (void*)&x17_hash;
 #endif
-  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT | VAES256_OPT;
+  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT;
  return true;
 };

--- a/algo/x17/x17-gate.h
+++ b/algo/x17/x17-gate.h
@@ -14,10 +14,15 @@ bool register_x17_algo( algo_gate_t* gate );

 #if defined(X17_8WAY)

+int scanhash_x17_8way( struct work *work, uint32_t max_nonce,
+                      uint64_t *hashes_done, struct thr_info *mythr );
+
 int x17_8way_hash( void *state, const void *input, int thr_id );

 #elif defined(X17_4WAY)

+int scanhash_x17_4way( struct work *work, uint32_t max_nonce,
+                      uint64_t *hashes_done, struct thr_info *mythr );
 int x17_4way_hash( void *state, const void *input, int thr_id );

 #endif
--- a/algo/x17/xevan-4way.c
+++ b/algo/x17/xevan-4way.c
@@ -62,14 +62,14 @@ int xevan_8way_hash( void *output, const void *input, int thr_id )
     uint64_t vhash[16<<3] __attribute__ ((aligned (128)));
     uint64_t vhashA[16<<3] __attribute__ ((aligned (64)));
     uint64_t vhashB[16<<3] __attribute__ ((aligned (64)));
-     uint64_t hash0[16] __attribute__ ((aligned (64)));
-     uint64_t hash1[16] __attribute__ ((aligned (64)));
-     uint64_t hash2[16] __attribute__ ((aligned (64)));
-     uint64_t hash3[16] __attribute__ ((aligned (64)));
-     uint64_t hash4[16] __attribute__ ((aligned (64)));
-     uint64_t hash5[16] __attribute__ ((aligned (64)));
-     uint64_t hash6[16] __attribute__ ((aligned (64)));
-     uint64_t hash7[16] __attribute__ ((aligned (64)));
+     uint64_t hash0[16] __attribute__ ((aligned (32)));
+     uint64_t hash1[16] __attribute__ ((aligned (32)));
+     uint64_t hash2[16] __attribute__ ((aligned (32)));
+     uint64_t hash3[16] __attribute__ ((aligned (32)));
+     uint64_t hash4[16] __attribute__ ((aligned (32)));
+     uint64_t hash5[16] __attribute__ ((aligned (32)));
+     uint64_t hash6[16] __attribute__ ((aligned (32)));
+     uint64_t hash7[16] __attribute__ ((aligned (32)));
     const int dataLen = 128;
     xevan_8way_context_overlay ctx __attribute__ ((aligned (64)));

@@ -430,13 +430,13 @@ typedef union _xevan_4way_context_overlay xevan_4way_context_overlay;

 int xevan_4way_hash( void *output, const void *input, int thr_id )
 {
-     uint64_t hash0[16] __attribute__ ((aligned (64)));
-     uint64_t hash1[16] __attribute__ ((aligned (64)));
-     uint64_t hash2[16] __attribute__ ((aligned (64)));
-     uint64_t hash3[16] __attribute__ ((aligned (64)));
     uint64_t vhash[16<<2] __attribute__ ((aligned (64)));
     uint64_t vhashA[16<<2] __attribute__ ((aligned (64)));
     uint64_t vhashB[16<<2] __attribute__ ((aligned (64)));
+     uint64_t hash0[16] __attribute__ ((aligned (32)));
+     uint64_t hash1[16] __attribute__ ((aligned (32)));
+     uint64_t hash2[16] __attribute__ ((aligned (32)));
+     uint64_t hash3[16] __attribute__ ((aligned (32)));
     const int dataLen = 128;
     xevan_4way_context_overlay ctx __attribute__ ((aligned (64)));

--- a/algo/x17/xevan-gate.c
+++ b/algo/x17/xevan-gate.c
@@ -12,7 +12,7 @@ bool register_xevan_algo( algo_gate_t* gate )
  init_xevan_ctx();
  gate->hash      = (void*)&xevan_hash;
 #endif
-  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT | VAES256_OPT;
+  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT;
  opt_target_factor = 256.0;
  return true;
 };
--- a/algo/x20/x20r-gate.c
+++ b/algo/x20/x20r-gate.c
@@ -1,34 +0,0 @@
-#include "x20r-gate.h"
-
-void getAlgoString( const uint8_t* prevblock, char *output )
-{
-    char *sptr = outpuit;
-
-    for ( int j = 0; j < X20R_HASH_FUNC_COUNT; j++ )
-    {
-        char b = (19 - j) >> 1; // 16 ascii hex chars, reversed
-        uint8_t algoDigit = (j & 1) ? prevblock[b] & 0xF : prevblock[b] >> 4;
-        if (algoDigit >= 10)
-            sprintf(sptr, "%c", 'A' + (algoDigit - 10));
-         else
-            sprintf(sptr, "%u", (uint32_t) algoDigit);
-        sptr++;
-     }
-     *sptr = '\0';
-}
-
-bool register_x20r_algo( algo_gate_t* gate )
-{
-#if defined (X20R_4WAY)
-  gate->scanhash  = (void*)&scanhash_x20r_4way;
-  gate->hash      = (void*)&x20r_4way_hash;
-#else
-  gate->scanhash  = (void*)&scanhash_x20r;
-  gate->hash      = (void*)&x20r_hash;
-#endif
-  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT;
-  x20_r_s_getAlgoString = (void*)&x20r_getAlgoString;
-  opt_target_factor = 256.;
-  return true;
-};
-
--- a/algo/x20/x20r-gate.h
+++ b/algo/x20/x20r-gate.h
@@ -1,58 +0,0 @@
-#ifndef X20R_GATE_H__
-#define X20R_GATE_H__ 1
-
-#include "algo-gate-api.h"
-#include <stdint.h>
-
-/*
-#if defined(__AVX2__) && defined(__AES__)
-  #define X20R_4WAY
-#endif
-*/
-
-enum x20r_Algo {
-        BLAKE = 0,
-        BMW,
-        GROESTL,
-        JH,
-        KECCAK,
-        SKEIN,
-        LUFFA,
-        CUBEHASH,
-        SHAVITE,
-        SIMD,
-        ECHO,
-        HAMSI,
-        FUGUE,
-        SHABAL,
-        WHIRLPOOL,
-        SHA_512,
-        HAVAL,      // 256-bits output
-        GOST,
-        RADIOGATUN, // 256-bits output
-        PANAMA,     // 256-bits output
-        X20R_HASH_FUNC_COUNT
-};
-
-void (*x20_r_s_getAlgoString) ( const uint8_t*, char* );
-
-void x20r_getAlgoString( const uint8_t* prevblock, char *output );
-
-bool register_xi20r_algo( algo_gate_t* gate );
-
-#if defined(X20R_4WAY)
-
-void x20r_4way_hash( void *state, const void *input );
-
-int scanhash_x20r_4way( struct work *work, uint32_t max_nonce,
-                        uint64_t *hashes_done, struct thr_info *mythr );
-
-#endif
-
-void x20rhash( void *state, const void *input );
-
-int scanhash_x20r( struct work *work, uint32_t max_nonce,
-                   uint64_t *hashes_done, struct thr_info *mythr );
-
-#endif
-
--- a/algo/x20/x20r.c
+++ b/algo/x20/x20r.c
@@ -1,252 +0,0 @@
-#include "x20r-gate.h"
-
-#include <stdio.h>
-#include <stdlib.h>
-#include <string.h>
-
-#include "algo/blake/sph_blake.h"
-#include "algo/bmw/sph_bmw.h"
-#include "algo/jh/sph_jh.h"
-#include "algo/keccak/sph_keccak.h"
-#include "algo/skein/sph_skein.h"
-#include "algo/shavite/sph_shavite.h"
-#include "algo/hamsi/sph_hamsi.h"
-#include "algo/fugue/sph_fugue.h"
-#include "algo/shabal/sph_shabal.h"
-#include "algo/whirlpool/sph_whirlpool.h"
-#include "algo/haval/sph-haval.h"
-#include "algo/radiogatun/sph_radiogatun.h"
-#include "algo/panama/sph_panama.h"
-#include "algo/gost/sph_gost.h"
-#include "algo/sha/sph_sha2.h"
-#if defined(__AES__)
-  #include "algo/echo/aes_ni/hash_api.h"
-  #include "algo/groestl/aes_ni/hash-groestl.h"
-#else
-  #include "algo/groestl/sph_groestl.h"
-  #include "algo/echo/sph_echo.h"
-#endif 
-#include "algo/luffa/luffa_for_sse2.h"
-#include "algo/cubehash/cubehash_sse2.h"
-#include "algo/simd/nist.h"
-
-
-static __thread uint32_t s_ntime = UINT32_MAX;
-static __thread char hashOrder[X20R_HASH_FUNC_COUNT + 1] = { 0 };
-
-union _x20r_context_overlay
-{
-    sph_blake512_context     blake;
-    sph_bmw512_context       bmw;
-#if defined(__AES__)
-    hashState_groestl        groestl;
-    hashState_echo           echo;
-#else
-    sph_groestl512_context   groestl;
-    sph_echo512_context      echo;
-#endif
-    sph_skein512_context     skein;
-    sph_jh512_context        jh;
-    sph_keccak512_context    keccak;
-    hashState_luffa          luffa;
-    cubehashParam            cube;
-    hashState_sd             simd;
-    sph_shavite512_context   shavite;
-    sph_hamsi512_context     hamsi;
-    sph_fugue512_context     fugue;
-    sph_shabal512_context    shabal;
-    sph_whirlpool_context    whirlpool;
-    sph_sha512_context       sha512;
-    sph_haval256_5_context   haval;
-    sph_gost512_context      gost;
-    sph_radiogatun64_context radiogatun;
-    sph_panama_context       panama;
-};
-typedef union _x20r_context_overlay x20r_context_overlay;
-
-void x20r_hash(void* output, const void* input)
-{
-   uint32_t _ALIGN(128) hash[64/4];
-   x20r_context_overlay ctx;
-   void *in = (void*) input;
-   int size = 80;
-
-   if ( s_ntime == UINT32_MAX )
-   {
-	const uint8_t* in8 = (uint8_t*) input;
-	x20_r_s_getAlgoString(&in8[4], hashOrder);
-   }
-
-   for (int i = 0; i < 20; i++)
-   {
-	const char elem = hashOrder[i];
-	const uint8_t algo = elem >= 'A' ? elem - 'A' + 10 : elem - '0';
-
-	switch ( algo )
-       	{
-	   case BLAKE:
-		sph_blake512_init(&ctx.blake);
-		sph_blake512(&ctx.blake, in, size);
-		sph_blake512_close(&ctx.blake, hash);
-		break;
-	   case BMW:
-		sph_bmw512_init(&ctx.bmw);
-		sph_bmw512(&ctx.bmw, in, size);
-		sph_bmw512_close(&ctx.bmw, hash);
-		break;
-	   case GROESTL:
-#if defined(__AES__)
-                init_groestl( &ctx.groestl, 64 );
-                update_and_final_groestl( &ctx.groestl, (char*)hash,
-                                         (const char*)in, size<<3 );
-#else
-                sph_groestl512_init(&ctx.groestl);
-                sph_groestl512(&ctx.groestl, in, size);
-                sph_groestl512_close(&ctx.groestl, hash);
-#endif
-                break;
-           case SKEIN:
-		sph_skein512_init(&ctx.skein);
-		sph_skein512(&ctx.skein, in, size);
-		sph_skein512_close(&ctx.skein, hash);
-		break;
-	   case JH:
-		sph_jh512_init(&ctx.jh);
-		sph_jh512(&ctx.jh, in, size);
-		sph_jh512_close(&ctx.jh, hash);
-		break;
-	   case KECCAK:
-		sph_keccak512_init(&ctx.keccak);
-		sph_keccak512(&ctx.keccak, in, size);
-		sph_keccak512_close(&ctx.keccak, hash);
-		break;
-	   case LUFFA:
-                init_luffa( &ctx.luffa, 512 );
-                update_and_final_luffa( &ctx.luffa, (BitSequence*)hash,
-                                        (const BitSequence*)in, size );
-		break;
-           case CUBEHASH:
-                cubehashInit( &ctx.cube, 512, 16, 32 );
-                cubehashUpdateDigest( &ctx.cube, (byte*) hash,
-                                      (const byte*)in, size );
-		break;
-	   case SHAVITE:
-		sph_shavite512_init(&ctx.shavite);
-		sph_shavite512(&ctx.shavite, in, size);
-		sph_shavite512_close(&ctx.shavite, hash);
-		break;
-           case SIMD:
-                init_sd( &ctx.simd, 512 );
-                update_final_sd( &ctx.simd, (BitSequence *)hash,
-                                 (const BitSequence *)in, size<<3 );
-			break;
-           case ECHO:
-#if defined(__AES__)
-                init_echo( &ctx.echo, 512 );
-                update_final_echo ( &ctx.echo, (BitSequence *)hash,
-                                    (const BitSequence *)in, size<<3 );
-#else
-	        sph_echo512_init(&ctx.echo);
-	        sph_echo512(&ctx.echo, in, size);
-	        sph_echo512_close(&ctx.echo, hash);
-#endif
-		break;
-	   case HAMSI:
-		sph_hamsi512_init(&ctx.hamsi);
-		sph_hamsi512(&ctx.hamsi, in, size);
-		sph_hamsi512_close(&ctx.hamsi, hash);
-		break;
-	   case FUGUE:
-		sph_fugue512_init(&ctx.fugue);
-		sph_fugue512(&ctx.fugue, in, size);
-		sph_fugue512_close(&ctx.fugue, hash);
-		break;
-	   case SHABAL:
-		sph_shabal512_init(&ctx.shabal);
-		sph_shabal512(&ctx.shabal, in, size);
-		sph_shabal512_close(&ctx.shabal, hash);
-		break;
-	   case WHIRLPOOL:
-		sph_whirlpool_init(&ctx.whirlpool);
-		sph_whirlpool(&ctx.whirlpool, in, size);
-		sph_whirlpool_close(&ctx.whirlpool, hash);
-		break;
-	   case SHA_512:
-                sph_sha512_Init( &ctx.sha512 );
-                sph_sha512( &ctx.sha512, in, size );
-                sph_sha512_close( &ctx.sha512, hash );
-		break;
-	   case HAVAL:
-		sph_haval256_5_init(&ctx.haval);
-		sph_haval256_5(&ctx.haval, in, size);
-		sph_haval256_5_close(&ctx.haval, hash);
-		memset(&hash[8], 0, 32);
-		break;
-	   case GOST:
-		sph_gost512_init(&ctx.gost);
-		sph_gost512(&ctx.gost, in, size);
-		sph_gost512_close(&ctx.gost, hash);
-		break;
-	   case RADIOGATUN:
-		sph_radiogatun64_init(&ctx.radiogatun);
-		sph_radiogatun64(&ctx.radiogatun, in, size);
-		sph_radiogatun64_close(&ctx.radiogatun, hash);
-		memset(&hash[8], 0, 32);
-		break;
-	   case PANAMA:
-		sph_panama_init(&ctx.panama);
-		sph_panama(&ctx.panama, in, size);
-		sph_panama_close(&ctx.panama, hash);
-		memset(&hash[8], 0, 32);
-		break;
-	}
-   in = (void*) hash;
-   size = 64;
-   }
-   memcpy(output, hash, 32);
-}
-
-int scanhash_x20r( struct work *work, uint32_t max_nonce,
-	           uint64_t *hashes_done, struct thr_info *mythr )
-{
-   uint32_t _ALIGN(128) hash32[8];
-   uint32_t _ALIGN(128) endiandata[20];
-   uint32_t *pdata = work->data;
-   uint32_t *ptarget = work->target;
-   const uint32_t Htarg = ptarget[7];
-   const uint32_t first_nonce = pdata[19];
-   uint32_t nonce = first_nonce;
-   int thr_id = mythr->id;
-   volatile uint8_t *restart = &(work_restart[thr_id].restart);
-
-   for (int k=0; k < 19; k++)
-	be32enc( &endiandata[k], pdata[k] );
-
-   if ( s_ntime != pdata[17] )
-   {
-	uint32_t ntime = swab32(pdata[17]);
-	x20_r_s_getAlgoString( (const char*) (&endiandata[1]), hashOrder );
-	s_ntime = ntime;
-	if (opt_debug && !thr_id) applog(LOG_DEBUG, "hash order %s (%08x)", hashOrder, ntime);
-   }
-
-   if ( opt_benchmark )
-	ptarget[7] = 0x0cff;
-
-   do {
-	be32enc( &endiandata[19], nonce );
-	x20r_hash( hash32, endiandata );
-
-	if ( hash32[7] <= Htarg && fulltest( hash32, ptarget ) )
-  	{
-        pdata[19] = nonce;
-        submit_solution( work, hash32, mythr );
-	}
-	nonce++;
-
-   } while (nonce < max_nonce && !(*restart));
-
-   pdata[19] = nonce;
-   *hashes_done = pdata[19] - first_nonce + 1;
-   return 0;
-}
--- a/algo/x22/x22i-4way.c
+++ b/algo/x22/x22i-4way.c
@@ -21,7 +21,6 @@
 #include "algo/tiger/sph_tiger.h"
 #include "algo/lyra2/lyra2.h"
 #include "algo/gost/sph_gost.h"
-#include "algo/swifftx/swifftx.h"
 #if defined(__VAES__)
  #include "algo/groestl/groestl512-hash-4way.h"
  #include "algo/shavite/shavite-hash-4way.h"
--- a/algo/x22/x22i-gate.c
+++ b/algo/x22/x22i-gate.c
@@ -31,8 +31,8 @@ bool register_x22i_algo( algo_gate_t* gate )

 #endif

-  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | SHA_OPT
-                      | AVX512_OPT | VAES_OPT | VAES256_OPT;
+  gate->optimizations = SSE2_OPT | SSE42_OPT | AES_OPT | AVX2_OPT | SHA_OPT
+                      | AVX512_OPT | VAES_OPT;
  return true;
 };

@@ -48,8 +48,9 @@ bool register_x25x_algo( algo_gate_t* gate )
  gate->scanhash  = (void*)&scanhash_x25x;
  gate->hash      = (void*)&x25x_hash;
 #endif
-  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | SHA_OPT |
-                        AVX512_OPT | VAES_OPT | VAES256_OPT;
+  gate->optimizations = SSE2_OPT | SSE42_OPT | AES_OPT | AVX2_OPT | SHA_OPT |
+                        AVX512_OPT | VAES_OPT;
+  InitializeSWIFFTX();
  return true;
 };

--- a/algo/x22/x22i-gate.h
+++ b/algo/x22/x22i-gate.h
@@ -5,6 +5,7 @@
 #include "simd-utils.h"
 #include <stdint.h>
 #include <unistd.h>
+#include "algo/swifftx/swifftx.h"

 #if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
  #define X22I_8WAY 1
--- a/algo/x22/x25x-4way.c
+++ b/algo/x22/x25x-4way.c
@@ -24,7 +24,6 @@
 #include "algo/tiger/sph_tiger.h"
 #include "algo/lyra2/lyra2.h"
 #include "algo/gost/sph_gost.h"
-#include "algo/swifftx/swifftx.h"
 #include "algo/panama/panama-hash-4way.h"
 #include "algo/lanehash/lane.h"
 #if defined(__VAES__)
@@ -102,6 +101,9 @@ union _x25x_8way_ctx_overlay
 };
 typedef union _x25x_8way_ctx_overlay x25x_8way_ctx_overlay;

+static __thread __m512i x25x_8way_midstate[16] __attribute__((aligned(64)));
+static __thread blake512_8way_context blake512_8way_ctx __attribute__((aligned(64)));
+
 int x25x_8way_hash( void *output, const void *input, int thrid )
 {
   uint64_t vhash[8*8] __attribute__ ((aligned (128)));
@@ -118,9 +120,9 @@ int x25x_8way_hash( void *output, const void *input, int thrid )
   uint64_t vhashB[8*8] __attribute__ ((aligned (64)));
   x25x_8way_ctx_overlay ctx __attribute__ ((aligned (64)));

-   blake512_8way_init( &ctx.blake );
-   blake512_8way_update( &ctx.blake, input, 80 );
-   blake512_8way_close( &ctx.blake, vhash );
+   blake512_8way_final_le( &blake512_8way_ctx, vhash, casti_m512i( input, 9 ),
+                                                x25x_8way_midstate );
+
   dintrlv_8x64_512( hash0[0], hash1[0], hash2[0], hash3[0],
                     hash4[0], hash5[0], hash6[0], hash7[0], vhash );

@@ -271,7 +273,6 @@ int x25x_8way_hash( void *output, const void *input, int thrid )
   intrlv_8x64_512( vhash, hash0[10], hash1[10], hash2[10], hash3[10],
                           hash4[10], hash5[10], hash6[10], hash7[10] );

-   
 #else

   init_echo( &ctx.echo, 512 );
@@ -558,6 +559,7 @@ int scanhash_x25x_8way( struct work *work, uint32_t max_nonce,
 {
   uint32_t hash[8*8] __attribute__ ((aligned (128)));
   uint32_t vdata[20*8] __attribute__ ((aligned (64)));
+   __m128i edata[5] __attribute__ ((aligned (64)));
   uint32_t lane_hash[8] __attribute__ ((aligned (64)));
   uint32_t *hashd7 = &(hash[7*8]);
   uint32_t *pdata = work->data;
@@ -569,15 +571,20 @@ int scanhash_x25x_8way( struct work *work, uint32_t max_nonce,
   const int thr_id = mythr->id;
   const uint32_t targ32 = ptarget[7];
   const bool bench = opt_benchmark;
-
+   const __m512i eight = m512_const1_64( 8 );
   if ( bench )  ptarget[7] = 0x08ff;

-   InitializeSWIFFTX();
+   edata[0] = mm128_swap64_32( casti_m128i( pdata, 0 ) ); 
+   edata[1] = mm128_swap64_32( casti_m128i( pdata, 1 ) );   
+   edata[2] = mm128_swap64_32( casti_m128i( pdata, 2 ) );   
+   edata[3] = mm128_swap64_32( casti_m128i( pdata, 3 ) );   
+   edata[4] = mm128_swap64_32( casti_m128i( pdata, 4 ) );   
+
+   mm512_intrlv80_8x64( vdata, edata );
+   *noncev = _mm512_add_epi32( *noncev, _mm512_set_epi32(
+                       0, 7, 0, 6, 0, 5, 0, 4, 0, 3, 0, 2, 0, 1, 0, 0 ) );
+   blake512_8way_prehash_le( &blake512_8way_ctx, x25x_8way_midstate, vdata ); 

-   mm512_bswap32_intrlv80_8x64( vdata, pdata );
-   *noncev = mm512_intrlv_blend_32(
-              _mm512_set_epi32( n+7, 0, n+6, 0, n+5, 0, n+4, 0,
-                                n+3, 0, n+2, 0, n+1, 0, n,   0 ), *noncev );
   do
   {
      if ( x25x_8way_hash( hash, vdata, thr_id ) );
@@ -588,12 +595,11 @@ int scanhash_x25x_8way( struct work *work, uint32_t max_nonce,
         extr_lane_8x32( lane_hash, hash, lane, 256 );
         if ( likely( valid_hash( lane_hash, ptarget ) ) )
         {
-            pdata[19] = bswap_32( n + lane );
+            pdata[19] = n + lane;
            submit_solution( work, lane_hash, mythr );
         }
      }
-      *noncev = _mm512_add_epi32( *noncev,
-                                  m512_const1_64( 0x0000000800000000 ) );
+      *noncev = _mm512_add_epi32( *noncev, eight );
      n += 8;
   } while ( likely( ( n < last_nonce ) && !work_restart[thr_id].restart ) );
   pdata[19] = n;
@@ -637,8 +643,12 @@ union _x25x_4way_ctx_overlay
    panama_4way_context     panama;
    blake2s_4way_state      blake2s;
 };
+
 typedef union _x25x_4way_ctx_overlay x25x_4way_ctx_overlay;

+static __thread __m256i x25x_4way_midstate[16] __attribute__((aligned(64)));
+static __thread blake512_4way_context blake512_4way_ctx __attribute__((aligned(64)));
+
 int x25x_4way_hash( void *output, const void *input, int thrid )
 {
   uint64_t vhash[8*4] __attribute__ ((aligned (128)));
@@ -651,7 +661,9 @@ int x25x_4way_hash( void *output, const void *input, int thrid )
   uint64_t vhashB[8*4] __attribute__ ((aligned (64)));
   x25x_4way_ctx_overlay ctx __attribute__ ((aligned (64)));

-   blake512_4way_full( &ctx.blake, vhash, input, 80 );
+   blake512_4way_final_le( &blake512_4way_ctx, vhash, casti_m256i( input, 9 ),
+                                                x25x_4way_midstate );
+   
   dintrlv_4x64_512( hash0[0], hash1[0], hash2[0], hash3[0], vhash );

   bmw512_4way_init( &ctx.bmw );
@@ -905,6 +917,7 @@ int scanhash_x25x_4way( struct work* work, uint32_t max_nonce,
   uint32_t hash[8*4] __attribute__ ((aligned (64)));
   uint32_t vdata[20*4] __attribute__ ((aligned (64)));
   uint32_t lane_hash[8] __attribute__ ((aligned (64)));
+   __m128i edata[5] __attribute__ ((aligned (64)));
   uint32_t *hashd7 = &(hash[ 7*4 ]);
   uint32_t *pdata = work->data;
   uint32_t *ptarget = work->target;
@@ -914,15 +927,22 @@ int scanhash_x25x_4way( struct work* work, uint32_t max_nonce,
   uint32_t n = first_nonce;
   const int thr_id = mythr->id;
   const uint32_t targ32 = ptarget[7];
+   const __m256i four = m256_const1_64( 4 );
   const bool bench = opt_benchmark;

   if ( bench ) ptarget[7] = 0x08ff;

-   InitializeSWIFFTX();
+   edata[0] = mm128_swap64_32( casti_m128i( pdata, 0 ) );
+   edata[1] = mm128_swap64_32( casti_m128i( pdata, 1 ) );
+   edata[2] = mm128_swap64_32( casti_m128i( pdata, 2 ) );
+   edata[3] = mm128_swap64_32( casti_m128i( pdata, 3 ) );
+   edata[4] = mm128_swap64_32( casti_m128i( pdata, 4 ) );

-   mm256_bswap32_intrlv80_4x64( vdata, pdata );
-   *noncev = mm256_intrlv_blend_32(
-                   _mm256_set_epi32( n+3, 0, n+2, 0, n+1, 0, n, 0 ), *noncev );
+   mm256_intrlv80_4x64( vdata, edata );
+   *noncev = _mm256_add_epi32( *noncev, _mm256_set_epi32(
+                                                 0, 3, 0, 2, 0, 1, 0, 0 ) );
+   blake512_4way_prehash_le( &blake512_4way_ctx, x25x_4way_midstate, vdata );
+   
   do
   {
      if ( x25x_4way_hash( hash, vdata, thr_id ) )
@@ -932,12 +952,11 @@ int scanhash_x25x_4way( struct work* work, uint32_t max_nonce,
         extr_lane_4x32( lane_hash, hash, lane, 256 );
         if ( valid_hash( lane_hash, ptarget ) )
         {
-            pdata[19] = bswap_32( n + lane );
+            pdata[19] = n + lane;
            submit_solution( work, lane_hash, mythr );
         }
      }
-      *noncev = _mm256_add_epi32( *noncev,
-                                  m256_const1_64( 0x0000000400000000 ) );
+      *noncev = _mm256_add_epi32( *noncev, four );
      n += 4;
   } while ( likely( ( n <= last_nonce ) && !work_restart[thr_id].restart ) );
   pdata[19] = n;
--- a/algo/yescrypt/yescrypt-best.c
+++ b/algo/yescrypt/yescrypt-best.c
@@ -1,5 +0,0 @@
-#ifdef __SSE2__
-#include "yescrypt-simd.c"
-#else
-#include "yescrypt-opt.c"
-#endif
--- a/algo/yescrypt/yescrypt-platform.h
+++ b/algo/yescrypt/yescrypt-platform.h
@@ -1,213 +0,0 @@
-/*-
- * Copyright 2013,2014 Alexander Peslyak
- * All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted.
- *
- * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
- * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
- * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
- * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
- * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
- * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
- * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
- * SUCH DAMAGE.
- */
-
-#ifdef MAP_ANON
-#include <sys/mman.h>
-#endif
-
-#include "yescrypt.h"
-#define HUGEPAGE_THRESHOLD		(12 * 1024 * 1024)
-
-#ifdef __x86_64__
-#define HUGEPAGE_SIZE			(2 * 1024 * 1024)
-#else
-#undef HUGEPAGE_SIZE
-#endif
-
-/*
-static __inline uint32_t
-le32dec(const void *pp)
-{
-	const uint8_t *p = (uint8_t const *)pp;
-
-	return ((uint32_t)(p[0]) + ((uint32_t)(p[1]) << 8) +
-	    ((uint32_t)(p[2]) << 16) + ((uint32_t)(p[3]) << 24));
-}
-
-static __inline void
-le32enc(void *pp, uint32_t x)
-{
-	uint8_t * p = (uint8_t *)pp;
-
-	p[0] = x & 0xff;
-	p[1] = (x >> 8) & 0xff;
-	p[2] = (x >> 16) & 0xff;
-	p[3] = (x >> 24) & 0xff;
-}
-*/
-
-static void *
-alloc_region(yescrypt_region_t * region, size_t size)
-{
-	size_t base_size = size;
-	uint8_t * base, * aligned;
-#ifdef MAP_ANON
-	int flags =
-#ifdef MAP_NOCORE
-	    MAP_NOCORE |
-#endif
-	    MAP_ANON | MAP_PRIVATE;
-#if defined(MAP_HUGETLB) && defined(HUGEPAGE_SIZE)
-	size_t new_size = size;
-	const size_t hugepage_mask = (size_t)HUGEPAGE_SIZE - 1;
-	if (size >= HUGEPAGE_THRESHOLD && size + hugepage_mask >= size) {
-		flags |= MAP_HUGETLB;
-/*
- * Linux's munmap() fails on MAP_HUGETLB mappings if size is not a multiple of
- * huge page size, so let's round up to huge page size here.
- */
-		new_size = size + hugepage_mask;
-		new_size &= ~hugepage_mask;
-	}
-	base = mmap(NULL, new_size, PROT_READ | PROT_WRITE, flags, -1, 0);
-	if (base != MAP_FAILED) {
-		base_size = new_size;
-	} else
-	if (flags & MAP_HUGETLB) {
-		flags &= ~MAP_HUGETLB;
-		base = mmap(NULL, size, PROT_READ | PROT_WRITE, flags, -1, 0);
-	}
-
-#else
-	base = mmap(NULL, size, PROT_READ | PROT_WRITE, flags, -1, 0);
-#endif
-	if (base == MAP_FAILED)
-		base = NULL;
-	aligned = base;
-#elif defined(HAVE_POSIX_MEMALIGN)
-	if ((errno = posix_memalign((void **)&base, 64, size)) != 0)
-		base = NULL;
-	aligned = base;
-#else
-	base = aligned = NULL;
-	if (size + 63 < size) {
-		errno = ENOMEM;
-	} else if ((base = malloc(size + 63)) != NULL) {
-		aligned = base + 63;
-		aligned -= (uintptr_t)aligned & 63;
-	}
-#endif
-	region->base = base;
-	region->aligned = aligned;
-	region->base_size = base ? base_size : 0;
-	region->aligned_size = base ? size : 0;
-	return aligned;
-}
-
-static __inline void
-init_region(yescrypt_region_t * region)
-{
-	region->base = region->aligned = NULL;
-	region->base_size = region->aligned_size = 0;
-}
-
-static int
-free_region(yescrypt_region_t * region)
-{
-	if (region->base) {
-#ifdef MAP_ANON
-		if (munmap(region->base, region->base_size))
-			return -1;
-#else
-		free(region->base);
-#endif
-	}
-	init_region(region);
-	return 0;
-}
-
-int yescrypt_init_shared(yescrypt_shared_t * shared, const uint8_t * param, size_t paramlen,
-    uint64_t N, uint32_t r, uint32_t p, yescrypt_init_shared_flags_t flags, uint32_t mask,
-    uint8_t * buf, size_t buflen)
-{
-	yescrypt_shared1_t* shared1 = &shared->shared1;
-	yescrypt_shared_t dummy, half1, half2;
-	uint8_t salt[32];
-
-	if (flags & YESCRYPT_SHARED_PREALLOCATED) {
-		if (!shared1->aligned || !shared1->aligned_size)
-			return -1;
-	} else {
-		init_region(shared1);
-	}
-	shared->mask1 = 1;
-	if (!param && !paramlen && !N && !r && !p && !buf && !buflen)
-		return 0;
-
-	init_region(&dummy.shared1);
-	dummy.mask1 = 1;
-	if (yescrypt_kdf(&dummy, shared1,
-	    param, paramlen, NULL, 0, N, r, p, 0,
-	    YESCRYPT_RW | YESCRYPT_PARALLEL_SMIX | __YESCRYPT_INIT_SHARED_1,
-	    salt, sizeof(salt), 0 ) )
-		goto out;
-
-	half1 = half2 = *shared;
-	half1.shared1.aligned_size /= 2;
-	half2.shared1.aligned = (void*) ((size_t)half2.shared1.aligned + half1.shared1.aligned_size);
-	half2.shared1.aligned_size = half1.shared1.aligned_size;
-	N /= 2;
-
-	if (p > 1 && yescrypt_kdf(&half1, &half2.shared1,
-	    param, paramlen, salt, sizeof(salt), N, r, p, 0,
-	    YESCRYPT_RW | YESCRYPT_PARALLEL_SMIX | __YESCRYPT_INIT_SHARED_2,
-	    salt, sizeof(salt), 0 ))
-		goto out;
-
-	if (yescrypt_kdf(&half2, &half1.shared1,
-	    param, paramlen, salt, sizeof(salt), N, r, p, 0,
-	    YESCRYPT_RW | YESCRYPT_PARALLEL_SMIX | __YESCRYPT_INIT_SHARED_1,
-	    salt, sizeof(salt), 0))
-		goto out;
-
-	if (yescrypt_kdf(&half1, &half2.shared1,
-	    param, paramlen, salt, sizeof(salt), N, r, p, 0,
-	    YESCRYPT_RW | YESCRYPT_PARALLEL_SMIX | __YESCRYPT_INIT_SHARED_1,
-	    buf, buflen, 0))
-		goto out;
-
-	shared->mask1 = mask;
-
-	return 0;
-
-out:
-	if (!(flags & YESCRYPT_SHARED_PREALLOCATED))
-		free_region(shared1);
-	return -1;
-}
-
-int
-yescrypt_free_shared(yescrypt_shared_t * shared)
-{
-	return free_region(&shared->shared1);
-}
-
-int
-yescrypt_init_local(yescrypt_local_t * local)
-{
-	init_region(local);
-	return 0;
-}
-
-int
-yescrypt_free_local(yescrypt_local_t * local)
-{
-	return free_region(local);
-}
--- a/algo/yescrypt/yescrypt-simd.c
+++ b/algo/yescrypt/yescrypt-simd.c
--- a/algo/yescrypt/yescrypt.c
+++ b/algo/yescrypt/yescrypt.c
@@ -1,488 +0,0 @@
-/*-
- * Copyright 2013,2014 Alexander Peslyak
- * All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted.
- *
- * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
- * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
- * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
- * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
- * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
- * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
- * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
- * SUCH DAMAGE.
- */
-
-#include <stdint.h>
-#include <string.h>
-#include <stdio.h>
-
-#include "compat.h"
-
-#include "yescrypt.h"
-#include "algo/sha/hmac-sha256-hash.h"
-#include "algo-gate-api.h"
-
-#define BYTES2CHARS(bytes) \
-	((((bytes) * 8) + 5) / 6)
-
-#define HASH_SIZE 32 /* bytes */
-#define HASH_LEN BYTES2CHARS(HASH_SIZE) /* base-64 chars */
-#define YESCRYPT_FLAGS (YESCRYPT_RW | YESCRYPT_PWXFORM)
-
-static const char * const itoa64 =
-	"./0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
-
-static uint8_t* encode64_uint32(uint8_t* dst, size_t dstlen, uint32_t src, uint32_t srcbits)
-{
-	uint32_t bit;
-
-	for (bit = 0; bit < srcbits; bit += 6) {
-		if (dstlen < 1)
-			return NULL;
-		*dst++ = itoa64[src & 0x3f];
-		dstlen--;
-		src >>= 6;
-	}
-
-	return dst;
-}
-
-static uint8_t* encode64(uint8_t* dst, size_t dstlen, const uint8_t* src, size_t srclen)
-{
-	size_t i;
-
-	for (i = 0; i < srclen; ) {
-		uint8_t * dnext;
-		uint32_t value = 0, bits = 0;
-		do {
-			value |= (uint32_t)src[i++] << bits;
-			bits += 8;
-		} while (bits < 24 && i < srclen);
-		dnext = encode64_uint32(dst, dstlen, value, bits);
-		if (!dnext)
-			return NULL;
-		dstlen -= dnext - dst;
-		dst = dnext;
-	}
-
-	return dst;
-}
-
-static int decode64_one(uint32_t* dst, uint8_t src)
-{
-	const char * ptr = strchr(itoa64, src);
-	if (ptr) {
-		*dst = (uint32_t) (ptr - itoa64);
-		return 0;
-	}
-	*dst = 0;
-	return -1;
-}
-
-static const uint8_t* decode64_uint32(uint32_t* dst, uint32_t dstbits, const uint8_t* src)
-{
-	uint32_t bit;
-	uint32_t value;
-
-	value = 0;
-	for (bit = 0; bit < dstbits; bit += 6) {
-		uint32_t one;
-		if (decode64_one(&one, *src)) {
-			*dst = 0;
-			return NULL;
-		}
-		src++;
-		value |= one << bit;
-	}
-
-	*dst = value;
-	return src;
-}
-
-uint8_t* yescrypt_r(const yescrypt_shared_t* shared, yescrypt_local_t* local,
-    const uint8_t* passwd, size_t passwdlen, const uint8_t* setting,
-    uint8_t* buf, size_t buflen, int thrid )
-{
-	uint8_t hash[HASH_SIZE];
-	const uint8_t * src, * salt;
-	uint8_t * dst;
-	size_t prefixlen, saltlen, need;
-	uint8_t version;
-	uint64_t N;
-	uint32_t r, p;
-	yescrypt_flags_t flags = YESCRYPT_WORM;
-
-	printf("pass1 ...");
-	fflush(stdout);
-
-	if (setting[0] != '$' || setting[1] != '7') {
-		printf("died$7 ...");
-		fflush(stdout);
-		return NULL;
-	}
-
-	printf("died80 ...");
-	fflush(stdout);
-
-	src = setting + 2;
-
-	printf("hello '%p'\n", (char *)src);
-	fflush(stdout);
-
-	switch ((version = *src)) {
-	case '$':
-		printf("died2 ...");
-		fflush(stdout);
-		break;
-	case 'X':
-		src++;
-		flags = YESCRYPT_RW;
-		printf("died3 ...");
-		fflush(stdout);
-		break;
-	default:
-		printf("died4 ...");
-		fflush(stdout);
-		return NULL;
-	}
-
-	printf("pass2 ...");
-	fflush(stdout);
-
-	if (*src != '$') {
-		uint32_t decoded_flags;
-		if (decode64_one(&decoded_flags, *src)) {
-			printf("died5 ...");
-			fflush(stdout);
-			return NULL;
-		}
-		flags = decoded_flags;
-		if (*++src != '$') {
-			printf("died6 ...");
-			fflush(stdout);
-			return NULL;
-		}
-	}
-
-	src++;
-
-	{
-		uint32_t N_log2;
-		if (decode64_one(&N_log2, *src)) {
-			printf("died7 ...");
-			return NULL;
-		}
-		src++;
-		N = (uint64_t)1 << N_log2;
-	}
-
-	src = decode64_uint32(&r, 30, src);
-	if (!src) {
-		printf("died6 ...");
-		return NULL;
-	}
-
-	src = decode64_uint32(&p, 30, src);
-	if (!src) {
-		printf("died7 ...");
-		return NULL;
-	}
-
-	prefixlen = src - setting;
-
-	salt = src;
-	src = (uint8_t *)strrchr((char *)salt, '$');
-	if (src)
-		saltlen = src - salt;
-	else
-		saltlen = strlen((char *)salt);
-
-	need = prefixlen + saltlen + 1 + HASH_LEN + 1;
-	if (need > buflen || need < saltlen) {
-		printf("'%d %d %d'", (int) need, (int) buflen, (int) saltlen);
-		printf("died8killbuf ...");
-		fflush(stdout);
-		return NULL;
-	}
-
-	if ( yescrypt_kdf( shared, local, passwd, passwdlen, salt, saltlen, N, r, p,
-            0, flags, hash, sizeof(hash), thrid ) == -1 )
-   {
-		printf("died10 ...");
-		fflush(stdout);
-		return NULL;
-	}
-
-	dst = buf;
-	memcpy(dst, setting, prefixlen + saltlen);
-	dst += prefixlen + saltlen;
-	*dst++ = '$';
-
-	dst = encode64(dst, buflen - (dst - buf), hash, sizeof(hash));
-	/* Could zeroize hash[] here, but yescrypt_kdf() doesn't zeroize its
-	 * memory allocations yet anyway. */
-	if (!dst || dst >= buf + buflen) { /* Can't happen */
-		printf("died11 ...");
-		return NULL;
-	}
-
-	*dst = 0; /* NUL termination */
-
-	printf("died12 ...");
-	fflush(stdout);
-
-	return buf;
-}
-
-uint8_t* yescrypt(const uint8_t* passwd, const uint8_t* setting, int thrid )
-{
-	static uint8_t buf[4 + 1 + 5 + 5 + BYTES2CHARS(32) + 1 + HASH_LEN + 1];
-	yescrypt_shared_t shared;
-	yescrypt_local_t local;
-	uint8_t * retval;
-
-	if (yescrypt_init_shared(&shared, NULL, 0,
-	    0, 0, 0, YESCRYPT_SHARED_DEFAULTS, 0, NULL, 0))
-		return NULL;
-	if (yescrypt_init_local(&local)) {
-		yescrypt_free_shared(&shared);
-		return NULL;
-	}
-	retval = yescrypt_r(&shared, &local,
-	    passwd, 80, setting, buf, sizeof(buf), thrid );
-	//printf("hashse='%s'\n", (char *)retval);
-	if (yescrypt_free_local(&local)) {
-		yescrypt_free_shared(&shared);
-		return NULL;
-	}
-	if (yescrypt_free_shared(&shared))
-		return NULL;
-	return retval;
-}
-
-uint8_t* yescrypt_gensalt_r(uint32_t N_log2, uint32_t r, uint32_t p, yescrypt_flags_t flags,
-    const uint8_t* src, size_t srclen, uint8_t* buf, size_t buflen)
-{
-	uint8_t * dst;
-	size_t prefixlen = 3 + 1 + 5 + 5;
-	size_t saltlen = BYTES2CHARS(srclen);
-	size_t need;
-
-	if (p == 1)
-		flags &= ~YESCRYPT_PARALLEL_SMIX;
-
-	if (flags) {
-		if (flags & ~0x3f)
-			return NULL;
-
-		prefixlen++;
-		if (flags != YESCRYPT_RW)
-			prefixlen++;
-	}
-
-	need = prefixlen + saltlen + 1;
-	if (need > buflen || need < saltlen || saltlen < srclen)
-		return NULL;
-
-	if (N_log2 > 63 || ((uint64_t)r * (uint64_t)p >= (1U << 30)))
-		return NULL;
-
-	dst = buf;
-	*dst++ = '$';
-	*dst++ = '7';
-	if (flags) {
-		*dst++ = 'X'; /* eXperimental, subject to change */
-		if (flags != YESCRYPT_RW)
-			*dst++ = itoa64[flags];
-	}
-	*dst++ = '$';
-
-	*dst++ = itoa64[N_log2];
-
-	dst = encode64_uint32(dst, buflen - (dst - buf), r, 30);
-	if (!dst) /* Can't happen */
-		return NULL;
-
-	dst = encode64_uint32(dst, buflen - (dst - buf), p, 30);
-	if (!dst) /* Can't happen */
-		return NULL;
-
-	dst = encode64(dst, buflen - (dst - buf), src, srclen);
-	if (!dst || dst >= buf + buflen) /* Can't happen */
-		return NULL;
-
-	*dst = 0; /* NUL termination */
-
-	return buf;
-}
-
-uint8_t* yescrypt_gensalt(uint32_t N_log2, uint32_t r, uint32_t p, yescrypt_flags_t flags,
-    const uint8_t * src, size_t srclen)
-{
-	static uint8_t buf[4 + 1 + 5 + 5 + BYTES2CHARS(32) + 1];
-	return yescrypt_gensalt_r(N_log2, r, p, flags, src, srclen,
-	    buf, sizeof(buf));
-}
-
-static int yescrypt_bsty(const uint8_t * passwd, size_t passwdlen,
-    const uint8_t * salt, size_t saltlen, uint64_t N, uint32_t r, uint32_t p,
-    uint8_t * buf, size_t buflen, int thrid )
-{
-	static __thread int initialized = 0;
-	static __thread yescrypt_shared_t shared;
-	static __thread yescrypt_local_t local;
-	int retval;
-	if (!initialized) {
-/* "shared" could in fact be shared, but it's simpler to keep it private
- * along with "local".  It's dummy and tiny anyway. */
-		if (yescrypt_init_shared(&shared, NULL, 0,
-		    0, 0, 0, YESCRYPT_SHARED_DEFAULTS, 0, NULL, 0))
-			return -1;
-		if (yescrypt_init_local(&local)) {
-			yescrypt_free_shared(&shared);
-			return -1;
-		}
-		initialized = 1;
-	}
-	retval = yescrypt_kdf(&shared, &local,
-	    passwd, passwdlen, salt, saltlen, N, r, p, 0, YESCRYPT_FLAGS,
-	    buf, buflen, thrid );
-#if 0
-	if (yescrypt_free_local(&local)) {
-		yescrypt_free_shared(&shared);
-		return -1;
-	}
-	if (yescrypt_free_shared(&shared))
-		return -1;
-	initialized = 0;
-#endif
-	return retval;
-}
-
-// scrypt parameters initialized at run time.
-uint64_t YESCRYPT_N;
-uint32_t YESCRYPT_R;
-uint32_t YESCRYPT_P;
-char *yescrypt_client_key = NULL;
-int yescrypt_client_key_len = 0;
-
-/* main hash 80 bytes input */
-int yescrypt_hash( const char *input, char *output, uint32_t len, int thrid )
-{
-   return yescrypt_bsty( (uint8_t*)input, len, (uint8_t*)input, len, YESCRYPT_N,
-                  YESCRYPT_R, YESCRYPT_P, (uint8_t*)output, 32, thrid );
-}
-
-/* for util.c test */
-int yescrypthash(void *output, const void *input, int thrid)
-{
-	return yescrypt_hash((char*) input, (char*) output, 80, thrid);
-}
-
-int scanhash_yescrypt( struct work *work, uint32_t max_nonce,
-                       uint64_t *hashes_done, struct thr_info *mythr )
-{
-   uint32_t _ALIGN(64) vhash[8];
-   uint32_t _ALIGN(64) endiandata[20];
-   uint32_t *pdata = work->data;
-   uint32_t *ptarget = work->target;
-   const uint32_t first_nonce = pdata[19];
-   const uint32_t last_nonce = max_nonce;
-   uint32_t n = first_nonce;
-   int thr_id = mythr->id; 
-
-   for ( int k = 0; k < 19; k++ )
-      be32enc( &endiandata[k], pdata[k] );
-   endiandata[19] = n;
-   do {
-      if ( yescrypt_hash((char*) endiandata, (char*) vhash, 80, thr_id ) )
-      if unlikely( valid_hash( vhash, ptarget ) && !opt_benchmark )
-      {
-          be32enc( pdata+19, n );
-          submit_solution( work, vhash, mythr );
-      }
-      endiandata[19] = ++n;
-   } while ( n < last_nonce && !work_restart[thr_id].restart );
-   *hashes_done = n - first_nonce;
-   pdata[19] = n;
-   return 0;
-}
-
-void yescrypt_gate_base(algo_gate_t *gate )
-{
-   gate->optimizations = SSE2_OPT | SHA_OPT;
-   gate->scanhash   = (void*)&scanhash_yescrypt;
-   gate->hash       = (void*)&yescrypt_hash;
-   opt_target_factor = 65536.0;
-}
-
-bool register_yescrypt_algo( algo_gate_t* gate )
-{
-   yescrypt_gate_base( gate );
-
-   if ( opt_param_n )  YESCRYPT_N = opt_param_n;
-   else                YESCRYPT_N = 2048;
-
-   if ( opt_param_r )  YESCRYPT_R = opt_param_r;
-   else                YESCRYPT_R = 8;
- 
-   if ( opt_param_key ) 
-   {   
-     yescrypt_client_key = opt_param_key;
-     yescrypt_client_key_len = strlen( opt_param_key );
-   }
-   else
-   {   
-     yescrypt_client_key = NULL;
-     yescrypt_client_key_len = 0;
-   }
-
-   YESCRYPT_P = 1;
-
-   applog( LOG_NOTICE,"Yescrypt parameters: N= %d, R= %d", YESCRYPT_N,
-                                                            YESCRYPT_R );
-   if ( yescrypt_client_key )
-     applog( LOG_NOTICE,"Key= \"%s\"\n", yescrypt_client_key );
-
-   return true;
-}
-
-bool register_yescryptr8_algo( algo_gate_t* gate )
-{
-   yescrypt_gate_base( gate );
-   yescrypt_client_key = "Client Key";
-   yescrypt_client_key_len = 10;
-   YESCRYPT_N = 2048;
-   YESCRYPT_R = 8;
-   YESCRYPT_P = 1;
-   return true;
-}
-
-bool register_yescryptr16_algo( algo_gate_t* gate )
-{
-   yescrypt_gate_base( gate );
-   yescrypt_client_key = "Client Key";
-   yescrypt_client_key_len = 10;
-   YESCRYPT_N = 4096;   
-   YESCRYPT_R = 16;   
-   YESCRYPT_P = 1;   
-   return true;
-}
-
-bool register_yescryptr32_algo( algo_gate_t* gate )
-{
-   yescrypt_gate_base( gate );
-   yescrypt_client_key = "WaviBanana";
-   yescrypt_client_key_len = 10;
-   YESCRYPT_N = 4096;
-   YESCRYPT_R = 32;
-   YESCRYPT_P = 1;
-   return true;
-}
-
--- a/algo/yescrypt/yescrypt.h
+++ b/algo/yescrypt/yescrypt.h
@@ -1,382 +0,0 @@
-/*-
- * Copyright 2009 Colin Percival
- * Copyright 2013,2014 Alexander Peslyak
- * All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- * 1. Redistributions of source code must retain the above copyright
- *    notice, this list of conditions and the following disclaimer.
- * 2. Redistributions in binary form must reproduce the above copyright
- *    notice, this list of conditions and the following disclaimer in the
- *    documentation and/or other materials provided with the distribution.
- *
- * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
- * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
- * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
- * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
- * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
- * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
- * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
- * SUCH DAMAGE.
- *
- * This file was originally written by Colin Percival as part of the Tarsnap
- * online backup system.
- */
-
-#ifndef YESCRYPT_H
-#define YESCRYPT_H
-
-#ifdef __cplusplus
-extern "C" {
-#endif
-
-#include <stdint.h>
-#include <stdlib.h> /* for size_t */
-#include <stdbool.h>
-#include "miner.h"
-
-//#define  __SSE4_1__
-
-int yescrypt_hash(const char* input, char* output, uint32_t len, int thrid );
-
-int yescrypthash(void *output, const void *input, int thrid );
-
-/**
- * crypto_scrypt(passwd, passwdlen, salt, saltlen, N, r, p, buf, buflen):
- * Compute scrypt(passwd[0 .. passwdlen - 1], salt[0 .. saltlen - 1], N, r,
- * p, buflen) and write the result into buf.  The parameters r, p, and buflen
- * must satisfy r * p < 2^30 and buflen <= (2^32 - 1) * 32.  The parameter N
- * must be a power of 2 greater than 1.
- *
- * Return 0 on success; or -1 on error.
- *
- * MT-safe as long as buf is local to the thread.
- */
-extern int crypto_scrypt(const uint8_t * __passwd, size_t __passwdlen,
-    const uint8_t * __salt, size_t __saltlen,
-    uint64_t __N, uint32_t __r, uint32_t __p,
-    uint8_t * __buf, size_t __buflen);
-
-/**
- * Internal type used by the memory allocator.  Please do not use it directly.
- * Use yescrypt_shared_t and yescrypt_local_t as appropriate instead, since
- * they might differ from each other in a future version.
- */
-typedef struct {
-	void * base, * aligned;
-	size_t base_size, aligned_size;
-} yescrypt_region_t;
-
-/**
- * Types for shared (ROM) and thread-local (RAM) data structures.
- */
-typedef yescrypt_region_t yescrypt_shared1_t;
-typedef struct {
-	yescrypt_shared1_t shared1;
-	uint32_t mask1;
-} yescrypt_shared_t;
-typedef yescrypt_region_t yescrypt_local_t;
-
-/**
- * Possible values for yescrypt_init_shared()'s flags argument.
- */
-typedef enum {
-	YESCRYPT_SHARED_DEFAULTS = 0,
-	YESCRYPT_SHARED_PREALLOCATED = 0x100
-} yescrypt_init_shared_flags_t;
-
-/**
- * Possible values for the flags argument of yescrypt_kdf(),
- * yescrypt_gensalt_r(), yescrypt_gensalt().  These may be OR'ed together,
- * except that YESCRYPT_WORM and YESCRYPT_RW are mutually exclusive.
- * Please refer to the description of yescrypt_kdf() below for the meaning of
- * these flags.
- */
-typedef enum {
-/* public */
-	YESCRYPT_WORM = 0,
-	YESCRYPT_RW = 1,
-	YESCRYPT_PARALLEL_SMIX = 2,
-	YESCRYPT_PWXFORM = 4,
-/* private */
-	__YESCRYPT_INIT_SHARED_1 = 0x10000,
-	__YESCRYPT_INIT_SHARED_2 = 0x20000,
-	__YESCRYPT_INIT_SHARED = 0x30000
-} yescrypt_flags_t;
-
-extern char *yescrypt_client_key;
-extern int yescrypt_client_key_len;
-
-
-#define YESCRYPT_KNOWN_FLAGS \
-	(YESCRYPT_RW | YESCRYPT_PARALLEL_SMIX | YESCRYPT_PWXFORM | \
-	__YESCRYPT_INIT_SHARED)
-
-/**
- * yescrypt_init_shared(shared, param, paramlen, N, r, p, flags, mask,
- *     buf, buflen):
- * Optionally allocate memory for and initialize the shared (ROM) data
- * structure.  The parameters N, r, and p must satisfy the same conditions as
- * with crypto_scrypt().  param and paramlen specify a local parameter with
- * which the ROM is seeded.  If buf is not NULL, then it is used to return
- * buflen bytes of message digest for the initialized ROM (the caller may use
- * this to verify that the ROM has been computed in the same way that it was on
- * a previous run).
- *
- * Return 0 on success; or -1 on error.
- *
- * If bit YESCRYPT_SHARED_PREALLOCATED in flags is set, then memory for the
- * ROM is assumed to have been preallocated by the caller, with
- * shared->shared1.aligned being the start address of the ROM and
- * shared->shared1.aligned_size being its size (which must be consistent with
- * N, r, and p).  This may be used e.g. when the ROM is to be placed in a SysV
- * shared memory segment allocated by the caller.
- *
- * mask controls the frequency of ROM accesses by yescrypt_kdf().  Normally it
- * should be set to 1, to interleave RAM and ROM accesses, which works well
- * when both regions reside in the machine's RAM anyway.  Other values may be
- * used e.g. when the ROM is memory-mapped from a disk file.  Recommended mask
- * values are powers of 2 minus 1 or minus 2.  Here's the effect of some mask
- * values:
- * mask	value	ROM accesses in SMix 1st loop	ROM accesses in SMix 2nd loop
- *	0		0				1/2
- *	1		1/2				1/2
- *	2		0				1/4
- *	3		1/4				1/4
- *	6		0				1/8
- *	7		1/8				1/8
- *	14		0				1/16
- *	15		1/16				1/16
- *	1022		0				1/1024
- *	1023		1/1024				1/1024
- *
- * Actual computation of the ROM contents may be avoided, if you don't intend
- * to use a ROM but need a dummy shared structure, by calling this function
- * with NULL, 0, 0, 0, 0, YESCRYPT_SHARED_DEFAULTS, 0, NULL, 0 for the
- * arguments starting with param and on.
- *
- * MT-safe as long as shared is local to the thread.
- */
-extern int yescrypt_init_shared(yescrypt_shared_t * __shared,
-    const uint8_t * __param, size_t __paramlen,
-    uint64_t __N, uint32_t __r, uint32_t __p,
-    yescrypt_init_shared_flags_t __flags, uint32_t __mask,
-    uint8_t * __buf, size_t __buflen);
-
-/**
- * yescrypt_free_shared(shared):
- * Free memory that had been allocated with yescrypt_init_shared().
- *
- * Return 0 on success; or -1 on error.
- *
- * MT-safe as long as shared is local to the thread.
- */
-extern int yescrypt_free_shared(yescrypt_shared_t * __shared);
-
-/**
- * yescrypt_init_local(local):
- * Initialize the thread-local (RAM) data structure.  Actual memory allocation
- * is currently fully postponed until a call to yescrypt_kdf() or yescrypt_r().
- *
- * Return 0 on success; or -1 on error.
- *
- * MT-safe as long as local is local to the thread.
- */
-extern int yescrypt_init_local(yescrypt_local_t * __local);
-
-/**
- * yescrypt_free_local(local):
- * Free memory that may have been allocated for an initialized thread-local
- * (RAM) data structure.
- *
- * Return 0 on success; or -1 on error.
- *
- * MT-safe as long as local is local to the thread.
- */
-extern int yescrypt_free_local(yescrypt_local_t * __local);
-
-/**
- * yescrypt_kdf(shared, local, passwd, passwdlen, salt, saltlen,
- *     N, r, p, t, flags, buf, buflen):
- * Compute scrypt(passwd[0 .. passwdlen - 1], salt[0 .. saltlen - 1], N, r,
- * p, buflen), or a revision of scrypt as requested by flags and shared, and
- * write the result into buf.  The parameters N, r, p, and buflen must satisfy
- * the same conditions as with crypto_scrypt().  t controls computation time
- * while not affecting peak memory usage.  shared and flags may request
- * special modes as described below.  local is the thread-local data
- * structure, allowing to preserve and reuse a memory allocation across calls,
- * thereby reducing its overhead.
- *
- * Return 0 on success; or -1 on error.
- *
- * t controls computation time.  t = 0 is optimal in terms of achieving the
- * highest area-time for ASIC attackers.  Thus, higher computation time, if
- * affordable, is best achieved by increasing N rather than by increasing t.
- * However, if the higher memory usage (which goes along with higher N) is not
- * affordable, or if fine-tuning of the time is needed (recall that N must be a
- * power of 2), then t = 1 or above may be used to increase time while staying
- * at the same peak memory usage.  t = 1 increases the time by 25% and
- * decreases the normalized area-time to 96% of optimal.  (Of course, in
- * absolute terms the area-time increases with higher t.  It's just that it
- * would increase slightly more with higher N*r rather than with higher t.)
- * t = 2 increases the time by another 20% and decreases the normalized
- * area-time to 89% of optimal.  Thus, these two values are reasonable to use
- * for fine-tuning.  Values of t higher than 2 result in further increase in
- * time while reducing the efficiency much further (e.g., down to around 50% of
- * optimal for t = 5, which runs 3 to 4 times slower than t = 0, with exact
- * numbers varying by the flags settings).
- *
- * Classic scrypt is available by setting t = 0 and flags to YESCRYPT_WORM and
- * passing a dummy shared structure (see the description of
- * yescrypt_init_shared() above for how to produce one).  In this mode, the
- * thread-local memory region (RAM) is first sequentially written to and then
- * randomly read from.  This algorithm is friendly towards time-memory
- * tradeoffs (TMTO), available both to defenders (albeit not in this
- * implementation) and to attackers.
- *
- * Setting YESCRYPT_RW adds extra random reads and writes to the thread-local
- * memory region (RAM), which makes TMTO a lot less efficient.  This may be
- * used to slow down the kinds of attackers who would otherwise benefit from
- * classic scrypt's efficient TMTO.  Since classic scrypt's TMTO allows not
- * only for the tradeoff, but also for a decrease of attacker's area-time (by
- * up to a constant factor), setting YESCRYPT_RW substantially increases the
- * cost of attacks in area-time terms as well.  Yet another benefit of it is
- * that optimal area-time is reached at an earlier time than with classic
- * scrypt, and t = 0 actually corresponds to this earlier completion time,
- * resulting in quicker hash computations (and thus in higher request rate
- * capacity).  Due to these properties, YESCRYPT_RW should almost always be
- * set, except when compatibility with classic scrypt or TMTO-friendliness are
- * desired.
- *
- * YESCRYPT_PARALLEL_SMIX moves parallelism that is present with p > 1 to a
- * lower level as compared to where it is in classic scrypt.  This reduces
- * flexibility for efficient computation (for both attackers and defenders) by
- * requiring that, short of resorting to TMTO, the full amount of memory be
- * allocated as needed for the specified p, regardless of whether that
- * parallelism is actually being fully made use of or not.  (For comparison, a
- * single instance of classic scrypt may be computed in less memory without any
- * CPU time overhead, but in more real time, by not making full use of the
- * parallelism.)  This may be desirable when the defender has enough memory
- * with sufficiently low latency and high bandwidth for efficient full parallel
- * execution, yet the required memory size is high enough that some likely
- * attackers might end up being forced to choose between using higher latency
- * memory than they could use otherwise (waiting for data longer) or using TMTO
- * (waiting for data more times per one hash computation).  The area-time cost
- * for other kinds of attackers (who would use the same memory type and TMTO
- * factor or no TMTO either way) remains roughly the same, given the same
- * running time for the defender.  In the TMTO-friendly YESCRYPT_WORM mode, as
- * long as the defender has enough memory that is just as fast as the smaller
- * per-thread regions would be, doesn't expect to ever need greater
- * flexibility (except possibly via TMTO), and doesn't need backwards
- * compatibility with classic scrypt, there are no other serious drawbacks to
- * this setting.  In the YESCRYPT_RW mode, which is meant to discourage TMTO,
- * this new approach to parallelization makes TMTO less inefficient.  (This is
- * an unfortunate side-effect of avoiding some random writes, as we have to in
- * order to allow for parallel threads to access a common memory region without
- * synchronization overhead.)  Thus, in this mode this setting poses an extra
- * tradeoff of its own (higher area-time cost for a subset of attackers vs.
- * better TMTO resistance).  Setting YESCRYPT_PARALLEL_SMIX also changes the
- * way the running time is to be controlled from N*r*p (for classic scrypt) to
- * N*r (in this modification).  All of this applies only when p > 1.  For
- * p = 1, this setting is a no-op.
- *
- * Passing a real shared structure, with ROM contents previously computed by
- * yescrypt_init_shared(), enables the use of ROM and requires YESCRYPT_RW for
- * the thread-local RAM region.  In order to allow for initialization of the
- * ROM to be split into a separate program, the shared->shared1.aligned and
- * shared->shared1.aligned_size fields may be set by the caller of
- * yescrypt_kdf() manually rather than with yescrypt_init_shared().
- *
- * local must be initialized with yescrypt_init_local().
- *
- * MT-safe as long as local and buf are local to the thread.
- */
-extern int yescrypt_kdf(const yescrypt_shared_t * __shared,
-    yescrypt_local_t * __local,
-    const uint8_t * __passwd, size_t __passwdlen,
-    const uint8_t * __salt, size_t __saltlen,
-    uint64_t __N, uint32_t __r, uint32_t __p, uint32_t __t,
-    yescrypt_flags_t __flags,
-    uint8_t * __buf, size_t __buflen, int thrid);
-
-/**
- * yescrypt_r(shared, local, passwd, passwdlen, setting, buf, buflen):
- * Compute and encode an scrypt or enhanced scrypt hash of passwd given the
- * parameters and salt value encoded in setting.  If the shared structure is
- * not dummy, a ROM is used and YESCRYPT_RW is required.  Otherwise, whether to
- * use the YESCRYPT_WORM (classic scrypt) or YESCRYPT_RW (time-memory tradeoff
- * discouraging modification) is determined by the setting string.  shared and
- * local must be initialized as described above for yescrypt_kdf().  buf must
- * be large enough (as indicated by buflen) to hold the encoded hash string.
- *
- * Return the encoded hash string on success; or NULL on error.
- *
- * MT-safe as long as local and buf are local to the thread.
- */
-extern uint8_t * yescrypt_r(const yescrypt_shared_t * __shared,
-    yescrypt_local_t * __local,
-    const uint8_t * __passwd, size_t __passwdlen,
-    const uint8_t * __setting,
-    uint8_t * __buf, size_t __buflen, int thrid);
-
-/**
- * yescrypt(passwd, setting):
- * Compute and encode an scrypt or enhanced scrypt hash of passwd given the
- * parameters and salt value encoded in setting.  Whether to use the
- * YESCRYPT_WORM (classic scrypt) or YESCRYPT_RW (time-memory tradeoff
- * discouraging modification) is determined by the setting string.
- *
- * Return the encoded hash string on success; or NULL on error.
- *
- * This is a crypt(3)-like interface, which is simpler to use than
- * yescrypt_r(), but it is not MT-safe, it does not allow for the use of a ROM,
- * and it is slower than yescrypt_r() for repeated calls because it allocates
- * and frees memory on each call.
- *
- * MT-unsafe.
- */
-extern uint8_t * yescrypt(const uint8_t * __passwd, const uint8_t * __setting, int thrid );
-
-/**
- * yescrypt_gensalt_r(N_log2, r, p, flags, src, srclen, buf, buflen):
- * Generate a setting string for use with yescrypt_r() and yescrypt() by
- * encoding into it the parameters N_log2 (which is to be set to base 2
- * logarithm of the desired value for N), r, p, flags, and a salt given by src
- * (of srclen bytes).  buf must be large enough (as indicated by buflen) to
- * hold the setting string.
- *
- * Return the setting string on success; or NULL on error.
- *
- * MT-safe as long as buf is local to the thread.
- */
-extern uint8_t * yescrypt_gensalt_r(
-    uint32_t __N_log2, uint32_t __r, uint32_t __p,
-    yescrypt_flags_t __flags,
-    const uint8_t * __src, size_t __srclen,
-    uint8_t * __buf, size_t __buflen);
-
-/**
- * yescrypt_gensalt(N_log2, r, p, flags, src, srclen):
- * Generate a setting string for use with yescrypt_r() and yescrypt().  This
- * function is the same as yescrypt_gensalt_r() except that it uses a static
- * buffer and thus is not MT-safe.
- *
- * Return the setting string on success; or NULL on error.
- *
- * MT-unsafe.
- */
-extern uint8_t * yescrypt_gensalt(
-    uint32_t __N_log2, uint32_t __r, uint32_t __p,
-    yescrypt_flags_t __flags,
-    const uint8_t * __src, size_t __srclen);
-
-#ifdef __cplusplus
-}
-#endif
-
-#endif
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Jay D Dee	bd84f199fe	v3.20.3	2022-10-21 23:12:18 -04:00
Jay D Dee	58030e2788	v3.20.2	2022-08-01 20:21:05 -04:00
Jay D Dee	1321ac474c	v3.20.1	2022-07-26 18:36:40 -04:00
Jay D Dee	40d07c0097	v3.20.0	2022-07-17 13:30:50 -04:00
Jay D Dee	f552f2b1e8	v3.19.9	2022-07-10 11:04:00 -04:00
Jay D Dee	26b8927632	v3.19.8	2022-05-27 18:12:30 -04:00
Jay D Dee	db76d3865f	v3.19.7	2022-04-02 12:44:57 -04:00
Jay D Dee	5b678d2481	v3.19.6	2022-02-21 23:14:24 -05:00
Jay D Dee	90137b391e	v3.19.5	2022-01-30 20:59:54 -05:00
Jay D Dee	8727d79182	v3.19.4	2022-01-12 21:08:25 -05:00
Jay D Dee	17ccbc328f	v3.19.3	2022-01-07 12:07:38 -05:00
Jay D Dee	0e3945ddb5	v3.19.2	2021-12-30 16:28:24 -05:00
Jay D Dee	7d2ef7973d	v3.19.1	2021-11-20 00:46:01 -05:00
Jay D Dee	e6fd9b1d69	v3.19.0	2021-11-10 21:33:44 -05:00