v3.23.1

v3.23.0
v3.22.3
2025-09-17 23:44:27 +00:00 · 2023-09-13 11:48:52 -04:00 · 2023-08-30 20:15:48 -04:00 · 2023-06-14 11:07:40 -04:00 · 2023-04-06 13:38:37 -04:00 · 2023-03-24 18:29:42 -04:00
97 changed files with 14050 additions and 6420 deletions
--- a/156
+++ b/156
@@ -1,158 +1,4 @@
-Instructions for compiling cpuminer-opt for Windows.
-
-These intructions are out of date. Please consult the wiki for
-the latest:
+Please consult the wiki for Windows compile instructions.

 https://github.com/JayDDee/cpuminer-opt/wiki/Compiling-from-source

-Windows compilation using Visual Studio is not supported. Mingw64 is
-used on a Linux system (bare metal or virtual machine) to cross-compile
-cpuminer-opt executable binaries for Windows.
-
-These instructions were written for Debian and Ubuntu compatible distributions
-but should work on other major distributions as well. However some of the
-package names or file paths may be different.
-
-It is assumed a Linux system is already available and running. And the user
-has enough Linux knowledge to find and install packages and follow these
-instructions.
-
-First it is a good idea to create new user specifically for cross compiling.
-It keeps all mingw stuff contained and isolated from the rest of the system.
-
-Step by step...
-
-1. Install necessary packages from the distribution's repositories.
-
-Refer to Linux compile instructions and install required packages.
-
-Additionally, install mingw-w64.
-
-sudo apt-get install mingw-w64 libz-mingw-w64-dev
-
-
-2. Create a local library directory for packages to be compiled in the next
-   step. Suggested location is $HOME/usr/lib/
-
-$ mkdir $HOME/usr/lib
-
-3. Download and build other packages for mingw that don't have a mingw64
-   version available in the repositories.
-
-Download the following source code packages from their respective and
-respected download locations, copy them to $HOME/usr/lib/ and uncompress them. 
-
-openssl: https://github.com/openssl/openssl/releases
-
-curl: https://github.com/curl/curl/releases
-
-gmp: https://gmplib.org/download/gmp/
-
-In most cases the latest version is ok but it's safest to download the same major and minor version as included in your distribution. The following uses versions from Ubuntu 20.04. Change version numbers as required.
-
-Run the following commands or follow the supplied instructions. Do not run "make install" unless you are using /usr/lib, which isn't recommended.
-
-Some instructions insist on running "make check". If make check fails it may still work, YMMV.
-
-You can speed up "make" by using all CPU cores available with "-j n" where n is the number of CPU threads you want to use.
-
-openssl:
-
-$ ./Configure mingw64 shared --cross-compile-prefix=x86_64-w64-mingw32-
-$ make
-
-Make may fail with an ld error, just ensure libcrypto-1_1-x64.dll is created.
-
-curl:
-
-$ ./configure --with-winssl --with-winidn --host=x86_64-w64-mingw32
-$ make
-
-gmp:
-
-$ ./configure --host=x86_64-w64-mingw32
-$ make
-
-4. Tweak the environment.
-
-This step is required everytime you login or the commands can be added to .bashrc.
-
-Define some local variables to point to local library.
-
-$ export LOCAL_LIB="$HOME/usr/lib"
-
-$ export LDFLAGS="-L$LOCAL_LIB/curl/lib/.libs -L$LOCAL_LIB/gmp/.libs -L$LOCAL_LIB/openssl"
-
-$ export CONFIGURE_ARGS="--with-curl=$LOCAL_LIB/curl --with-crypto=$LOCAL_LIB/openssl --host=x86_64-w64-mingw32"
-
-Adjust for gcc version:
-
-$ export GCC_MINGW_LIB="/usr/lib/gcc/x86_64-w64-mingw32/9.3-win32"
-
-Create a release directory and copy some dll files previously built. This can be done outside of cpuminer-opt and only needs to be done once. If the release directory is in cpuminer-opt directory it needs to be recreated every time a source package is decompressed.
-
-$ mkdir release
-$ cp /usr/x86_64-w64-mingw32/lib/zlib1.dll release/
-$ cp /usr/x86_64-w64-mingw32/lib/libwinpthread-1.dll release/
-$ cp $GCC_MINGW_LIB/libstdc++-6.dll release/
-$ cp $GCC_MINGW_LIB/libgcc_s_seh-1.dll release/
-$ cp $LOCAL_LIB/openssl/libcrypto-1_1-x64.dll release/
-$ cp $LOCAL_LIB/curl/lib/.libs/libcurl-4.dll release/
-
-The following steps need to be done every time a new source package is
-opened.
-
-5. Download cpuminer-opt
-
-Download the latest source code package of cpumuner-opt to your desired
-location. .zip or .tar.gz, your choice.
-
-https://github.com/JayDDee/cpuminer-opt/releases
-
-Decompress and change to the cpuminer-opt directory.
-
-6. compile
-
-Create a link to the locally compiled version of gmp.h
-
-$ ln -s $LOCAL_LIB/gmp-version/gmp.h ./gmp.h
-
-$ ./autogen.sh
-
-Configure the compiler for the CPU architecture of the host machine:
-
-CFLAGS="-O3 -march=native -Wall" ./configure $CONFIGURE_ARGS
-
-or cross compile for a specific CPU architecture:
-
-CFLAGS="-O3 -march=znver1 -Wall" ./configure $CONFIGURE_ARGS
-
-This will compile for AMD Ryzen.
-
-You can compile more generically for a set of specific CPU features if you know what features you want:
-
-CFLAGS="-O3 -maes -msse4.2 -Wall" ./configure $CONFIGURE_ARGS
-
-This will compile for an older CPU that does not have AVX.
-
-You can find several examples in README.txt
-
-If you have a CPU with more than 64 threads and Windows 7 or higher you can enable the CPU Groups feature by adding the following to CFLAGS:
-
-"-D_WIN32_WINNT=0x0601"
-
-Once you have run configure successfully run the compiler with n CPU threads:
-
-$ make -j n
-
-Copy cpuminer.exe to the release directory, compress and copy the release directory to a Windows system and run cpuminer.exe from the command line.
-
-Run cpuminer
-
-In a command windows change directories to the unzipped release folder. To get a list of all options:
-
-cpuminer.exe --help
-
-Command options are specific to where you mine. Refer to the pool's instructions on how to set them.
-
-
--- a/Makefile.am
+++ b/Makefile.am
@@ -175,6 +175,8 @@ cpuminer_SOURCES = \
  algo/sha/sha256t.c \
  algo/sha/sha256q-4way.c \
  algo/sha/sha256q.c \
+  algo/sha/sha512256d-4way.c \
+  algo/sha/sha256dt.c \
  algo/shabal/sph_shabal.c \
  algo/shabal/shabal-hash-4way.c \
  algo/shavite/sph_shavite.c \
--- a/66
+++ b/66
@@ -65,6 +65,45 @@ If not what makes it happen or not happen?
 Change Log
 ----------

+v3.23.1
+
+#349: Fix sha256t low difficulty shares and low effective hash rate.
+Faster sha256dt:  AVX512 +7%, SHA +200%, AVX2 +5%.
+Faster blakecoin & vanilla: AVX2 +30%, AVX512 +110%.
+Other small improvements and code cleanup.
+
+v3.23.0
+
+#398: Prevent GBT fallback to Getwork on network error.
+#398: Prevent excessive logs when conditional mining is paused when mining solo.
+Fix a false start if stratum doesn't immediately send a new job after connecting.
+Tweak diagonal shuffle in Blake2b & Blake256 1-way SIMD to reduce latency.
+CPUID support for AVX10.
+Initial changes to AVX2 targeted code in preparation for AVX10.
+Code cleanup and miscellaneous small improvements.
+
+v3.22.3
+
+Data interleaving and byte swap optimizations with AVX2, AVX512 & AVX512VBMI.
+Faster Luffa with AVX2 & AVX512.
+Other small optimizations.
+Some code cleanup.
+
+v3.22.2
+
+Added sha512256d & sha256dt algos.
+Fixed intermittant invalid shares lyra2v2 AVX512.
+Removed application limits on the number of CPUs and threads, HW and OS limits still apply.
+Added a log warning if more threads are defined than active CPUs in affinity mask.
+Improved merkle tree memory management for stratum.
+Added transaction count to New Work log.
+Other small improvements.
+
+v3.22.1
+
+#393 fixed segfault in GBT, regression from v3.22.0.
+More efficient 32 bit data interleaving.
+
 v3.22.0

 Stratum: faster netdiff calculation.
@@ -182,40 +221,29 @@ v3.19.5

 Enhanced stratum-keepalive preemptively resets the stratum connection
 before the server to avoid lost shares.
-
 Added build-msys2.sh shell script for easier compiling on Windows, see Wiki for details.
-
 X16RT: eliminate unnecessary recalculations of the hash order.
-
 Fix a few compiler warnings.
-
 Fixed log colour error when a block is solved.

 v3.19.4

 #359: Fix verthash memory allocation for non-hugepages, broken in v3.19.3.
-
 New option stratum-keepalive prevents stratum timeouts when no shares are
 submitted for several minutes due to high difficulty.
-
 Fixed a bug displaying optimizations for some algos.

 v3.19.3

 Linux: Faster verthash (+25%), scryptn2 (+2%) when huge pages are available.
-
 Small speed up for Hamsi AVX2 & AVX512, Keccak AVX512.

 v3.19.2

 Fixed log displaying incorrect memory usage for scrypt, broken in v3.19.1.
-
 Reduce log noise when replies to submitted shares are lost due to stratum errors.
-
 Fugue prehash optimization for X16r family AVX2 & AVX512.
-
 Small speed improvement for Hamsi AVX2 & AVX512.
-
 Win: With CPU groups enabled the number of CPUs displayed in the ASCII art
 affinity map is the number of CPUs in a CPU group, was number of CPUs up to 64.

@@ -227,7 +255,6 @@ Changes to Windows binaries package:
 - zen build renamed to avx2-sha, supports Zen1 & Zen2,
 - avx512-sha build removed, Rocketlake CPUs can use avx512-sha-vaes,
 - see README.txt for compatibility details.
-
 Fixed a few compiler warnings that are new in GCC 11.
 Other minor fixes.

@@ -241,22 +268,17 @@ Changes to cpu-affinity:
  - streamlined code for more efficient initialization of miner threads,
  - precise affining of each miner thread to a specific CPU,
  - added an option to disable CPU affinity with "--cpu-affinity 0"
-
 Faster sha256t with AVX512 & AVX2.
-
 Added stratum error count to stats log, reported only when non-zero.

 v3.18.2

 Issue #342, fixed Groestl AES on Windows, broken in v3.18.0.
-
 AVX512 for sha256d.
-
 SSE42 and AVX may now be displayed as mining features at startup.
 This is hard coded for each algo, and is only implemented for scrypt
 at this time as it is the only algo with significant performance differences
 with those features.
-
 Fixed an issue where a high hashrate algo could cause excessive invalid hash
 rate log reports when starting up in benchmark mode.

@@ -267,9 +289,7 @@ More speed for scrypt:
 - AVX2 is now used by default on CPUS with SHA but not AVX512,
 - scrypt:1024 performance lost in v3.18.0 is restored,
 - AVX512 & AVX2 improvements to scrypt:1024.
-
 Big speedup for SwiFFTx AVX2 & SSE4.1: x22i +55%, x25x +22%.
-
 Issue #337: fixed a problem that could display negative stats values in the
 first summary report if the report was forced prematurely due to a stratum
 diff change. The stats will still be invalid but should display zeros.
@@ -282,26 +302,19 @@ Complete rewrite of Scrypt code, optimized for large N factor (scryptn2):
  - memory requirements reduced 30-60% depending on CPU architecture,
  - memory usage displayed at startup,
  - scrypt, default N=1024 (LTC), will likely perform slower.
-
 Improved stale share detection and handling for Scrypt with large N factor:
  - abort and discard partially computed hash when new work is detected,
  - quicker response to new job, less time wasted mining stale job.
-
 Improved stale share handling for all algorithms:
  - report possible stale share when new work received with a previously
    submitted share still pending,
  - when new work is detected report the submission of an already completed,
    otherwise valid, but likely stale, share,
  - fixed incorrect block height in stale share log.
-
 Small performance improvements to sha, bmw, cube & hamsi for AVX512 & AVX2.
-
 When stratum disconnects miner threads go to idle until reconnected.
-
 Colour changes to some logs.
-
 Some low level function name changes for clarity and consistency.
-
 The reference hashrate in the summary log and the benchmark total hashrate
 are now the mean hashrate for the session. 

@@ -414,7 +427,6 @@ Fixed neoscrypt BUG log.
 v3.14.3

 #265: more mutex changes to reduce blocking with high thread count.
-
 #267: fixed hodl algo potential memory alignment issue,
      add warning when thread count is not valid for mining hodl algo.

--- a/algo-gate-api.c
+++ b/algo-gate-api.c
@@ -171,7 +171,7 @@ int scanhash_4way_64in_32out( struct work *work, uint32_t max_nonce,
         }
      }
      *noncev = _mm256_add_epi32( *noncev,
-                                  m256_const1_64( 0x0000000400000000 ) );
+                                  _mm256_set1_epi64x( 0x0000000400000000 ) );
      n += 4;
   } while ( likely( ( n <= last_nonce ) && !work_restart[thr_id].restart ) );
   pdata[19] = n;
@@ -227,7 +227,7 @@ int scanhash_8way_64in_32out( struct work *work, uint32_t max_nonce,
         }
      }
      *noncev = _mm512_add_epi32( *noncev,
-                                  m512_const1_64( 0x0000000800000000 ) );
+                                  _mm512_set1_epi64( 0x0000000800000000 ) );
      n += 8;
   } while ( likely( ( n < last_nonce ) && !work_restart[thr_id].restart ) );
   pdata[19] = n;
@@ -337,9 +337,11 @@ bool register_algo_gate( int algo, algo_gate_t *gate )
    case ALGO_QUBIT:        rc = register_qubit_algo         ( gate ); break;
    case ALGO_SCRYPT:       rc = register_scrypt_algo        ( gate ); break;
    case ALGO_SHA256D:      rc = register_sha256d_algo       ( gate ); break;
+    case ALGO_SHA256DT:     rc = register_sha256dt_algo      ( gate ); break;
    case ALGO_SHA256Q:      rc = register_sha256q_algo       ( gate ); break;
    case ALGO_SHA256T:      rc = register_sha256t_algo       ( gate ); break;
    case ALGO_SHA3D:        rc = register_sha3d_algo         ( gate ); break;
+    case ALGO_SHA512256D:   rc = register_sha512256d_algo    ( gate ); break;
    case ALGO_SHAVITE3:     rc = register_shavite_algo       ( gate ); break;
    case ALGO_SKEIN:        rc = register_skein_algo         ( gate ); break;
    case ALGO_SKEIN2:       rc = register_skein2_algo        ( gate ); break;
--- a/algo-gate-api.h
+++ b/algo-gate-api.h
@@ -94,10 +94,13 @@ typedef  uint32_t set_t;
 #define SSE42_OPT        4
 #define AVX_OPT          8   // Sandybridge
 #define AVX2_OPT      0x10   // Haswell, Zen1
-#define SHA_OPT       0x20   // Zen1, Icelake (sha256)
-#define AVX512_OPT    0x40   // Skylake-X (AVX512[F,VL,DQ,BW])
-#define VAES_OPT      0x80   // Icelake (VAES & AVX512)
+#define SHA_OPT       0x20   // Zen1, Icelake (deprecated)
+#define AVX512_OPT    0x40   // Skylake-X, Zen4 (AVX512[F,VL,DQ,BW])
+#define VAES_OPT      0x80   // Icelake, Zen3

+// AVX10 does not have explicit algo features:
+//  AVX10_512 is compatible with AVX512 + VAES
+//  AVX10_256 is compatible with AVX2 + VAES

 // return set containing all elements from sets a & b
 inline set_t set_union ( set_t a, set_t b ) { return a | b; }
@@ -264,6 +267,8 @@ void std_get_new_work( struct work *work, struct work *g_work, int thr_id,
                       uint32_t* end_nonce_ptr );

 void sha256d_gen_merkle_root( char *merkle_root, struct stratum_ctx *sctx );
+void sha256_gen_merkle_root ( char *merkle_root, struct stratum_ctx *sctx );
+// OpenSSL sha256 deprecated
 void SHA256_gen_merkle_root ( char *merkle_root, struct stratum_ctx *sctx );

 bool std_le_work_decode( struct work *work );
--- a/algo/blake/blake-hash-4way.h
+++ b/algo/blake/blake-hash-4way.h
@@ -1,60 +1,15 @@
-/* $Id: sph_blake.h 252 2011-06-07 17:55:14Z tp $ */
-/**
- * BLAKE interface. BLAKE is a family of functions which differ by their
- * output size; this implementation defines BLAKE for output sizes 224,
- * 256, 384 and 512 bits. This implementation conforms to the "third
- * round" specification.
- *
- * ==========================(LICENSE BEGIN)============================
- *
- * Copyright (c) 2007-2010  Projet RNRT SAPHIR
- * 
- * Permission is hereby granted, free of charge, to any person obtaining
- * a copy of this software and associated documentation files (the
- * "Software"), to deal in the Software without restriction, including
- * without limitation the rights to use, copy, modify, merge, publish,
- * distribute, sublicense, and/or sell copies of the Software, and to
- * permit persons to whom the Software is furnished to do so, subject to
- * the following conditions:
- * 
- * The above copyright notice and this permission notice shall be
- * included in all copies or substantial portions of the Software.
- * 
- * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
- * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
- * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
- * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
- * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
- * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
- * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
- *
- * ===========================(LICENSE END)=============================
- *
- * @file     sph_blake.h
- * @author   Thomas Pornin <thomas.pornin@cryptolog.com>
- */
-
-#ifndef __BLAKE_HASH_4WAY__
-#define __BLAKE_HASH_4WAY__ 1
-
-#ifdef __cplusplus
-extern "C"{
-#endif
+#ifndef BLAKE_HASH_4WAY__
+#define BLAKE_HASH_4WAY__ 1

 #include <stddef.h>
-#include "algo/sha/sph_types.h"
 #include "simd-utils.h"

-#define SPH_SIZE_blake256   256
-
-#define SPH_SIZE_blake512   512
-
 /////////////////////////
 //
 //  Blake-256 1 way SSE2

 void  blake256_transform_le( uint32_t *H, const uint32_t *buf,
-                             const uint32_t T0, const uint32_t T1 );
+                             const uint32_t T0, const uint32_t T1, int rounds );

 /////////////////////////
 //
@@ -75,13 +30,13 @@ typedef struct {
   int rounds;   // 14 for blake, 8 for blakecoin & vanilla
 } blake_4way_small_context __attribute__ ((aligned (64)));

-// Default, 14 rounds, blake, decred
+// Default, 14 rounds
 typedef blake_4way_small_context blake256_4way_context;
 void blake256_4way_init(void *ctx);
 void blake256_4way_update(void *ctx, const void *data, size_t len);
 void blake256_4way_close(void *ctx, void *dst);

-// 14 rounds, blake, decred
+// 14 rounds
 typedef blake_4way_small_context blake256r14_4way_context;
 void blake256r14_4way_init(void *cc);
 void blake256r14_4way_update(void *cc, const void *data, size_t len);
@@ -103,7 +58,7 @@ typedef struct {
   __m256i buf[16] __attribute__ ((aligned (64)));
   __m256i H[8];
   size_t ptr;
-   sph_u32 T0, T1;
+   uint32_t T0, T1;
   int rounds;   // 14 for blake, 8 for blakecoin & vanilla
 } blake_8way_small_context;

@@ -117,7 +72,7 @@ void blake256_8way_close_le(void *cc, void *dst);
 void blake256_8way_round0_prehash_le( void *midstate, const void *midhash,
                                      void *data );
 void blake256_8way_final_rounds_le( void *final_hash, const void *midstate,
-                                     const void *midhash, const void *data );
+                    const void *midhash, const void *data, const int rounds );

 // 14 rounds, blake, decred
 typedef blake_8way_small_context blake256r14_8way_context;
@@ -138,7 +93,7 @@ typedef struct {
   __m256i H[8];
   __m256i S[4];   
   size_t ptr;
-   sph_u64 T0, T1;
+   uint64_t T0, T1;
 } blake_4way_big_context __attribute__ ((aligned (128)));

 typedef blake_4way_big_context blake512_4way_context;
@@ -180,7 +135,7 @@ void blake256_16way_close_le(void *cc, void *dst);
 void blake256_16way_round0_prehash_le( void *midstate, const void *midhash,
                                       void *data );
 void blake256_16way_final_rounds_le( void *final_hash, const void *midstate,
-                                     const void *midhash, const void *data );
+                     const void *midhash, const void *data, const int rounds );


 // 14 rounds, blake, decred
@@ -204,7 +159,7 @@ typedef struct {
   __m512i H[8];
   __m512i S[4];
   size_t ptr;
-   sph_u64 T0, T1;
+   uint64_t T0, T1;
 } blake_8way_big_context __attribute__ ((aligned (128)));

 typedef blake_8way_big_context blake512_8way_context;
@@ -224,8 +179,4 @@ void blake512_8way_final_le( blake_8way_big_context *sc, void *hash,
 #endif  // AVX512
 #endif  // AVX2

-#ifdef __cplusplus
-}
-#endif
-
 #endif  // BLAKE_HASH_4WAY_H__
--- a/algo/blake/blake256-hash-4way.c
+++ b/algo/blake/blake256-hash-4way.c
@@ -40,26 +40,6 @@

 #include "blake-hash-4way.h"

-#ifdef __cplusplus
-extern "C"{
-#endif
-
-#if SPH_SMALL_FOOTPRINT && !defined SPH_SMALL_FOOTPRINT_BLAKE
-#define SPH_SMALL_FOOTPRINT_BLAKE   1
-#endif
-
-#if SPH_SMALL_FOOTPRINT_BLAKE
-#define SPH_COMPACT_BLAKE_32   1
-#endif
-
-#if SPH_64 && (SPH_SMALL_FOOTPRINT_BLAKE || !SPH_64_TRUE)
-#define SPH_COMPACT_BLAKE_64   1
-#endif
-
-#ifdef _MSC_VER
-#pragma warning (disable: 4146)
-#endif
-
 // Blake-256

 static const uint32_t IV256[8] =
@@ -68,7 +48,7 @@ static const uint32_t IV256[8] =
 	0x510E527F, 0x9B05688C,	0x1F83D9AB, 0x5BE0CD19
 };

-#if SPH_COMPACT_BLAKE_32 || SPH_COMPACT_BLAKE_64
+#if 0

 // Blake-256 4 & 8 way, Blake-512 4 way

@@ -273,41 +253,27 @@ static const unsigned sigma[16][16] = {
 #define CSx_(n)     CSx__(n)
 #define CSx__(n)    CS ## n

-#define CS0   SPH_C32(0x243F6A88)
-#define CS1   SPH_C32(0x85A308D3)
-#define CS2   SPH_C32(0x13198A2E)
-#define CS3   SPH_C32(0x03707344)
-#define CS4   SPH_C32(0xA4093822)
-#define CS5   SPH_C32(0x299F31D0)
-#define CS6   SPH_C32(0x082EFA98)
-#define CS7   SPH_C32(0xEC4E6C89)
-#define CS8   SPH_C32(0x452821E6)
-#define CS9   SPH_C32(0x38D01377)
-#define CSA   SPH_C32(0xBE5466CF)
-#define CSB   SPH_C32(0x34E90C6C)
-#define CSC   SPH_C32(0xC0AC29B7)
-#define CSD   SPH_C32(0xC97C50DD)
-#define CSE   SPH_C32(0x3F84D5B5)
-#define CSF   SPH_C32(0xB5470917)
-
-#if SPH_COMPACT_BLAKE_32
-
-static const sph_u32 CS[16] = {
-	SPH_C32(0x243F6A88), SPH_C32(0x85A308D3),
-	SPH_C32(0x13198A2E), SPH_C32(0x03707344),
-	SPH_C32(0xA4093822), SPH_C32(0x299F31D0),
-	SPH_C32(0x082EFA98), SPH_C32(0xEC4E6C89),
-	SPH_C32(0x452821E6), SPH_C32(0x38D01377),
-	SPH_C32(0xBE5466CF), SPH_C32(0x34E90C6C),
-	SPH_C32(0xC0AC29B7), SPH_C32(0xC97C50DD),
-	SPH_C32(0x3F84D5B5), SPH_C32(0xB5470917)
-};
-
-#endif
+#define CS0   0x243F6A88
+#define CS1   0x85A308D3
+#define CS2   0x13198A2E
+#define CS3   0x03707344
+#define CS4   0xA4093822
+#define CS5   0x299F31D0
+#define CS6   0x082EFA98
+#define CS7   0xEC4E6C89
+#define CS8   0x452821E6
+#define CS9   0x38D01377
+#define CSA   0xBE5466CF
+#define CSB   0x34E90C6C
+#define CSC   0xC0AC29B7
+#define CSD   0xC97C50DD
+#define CSE   0x3F84D5B5
+#define CSF   0xB5470917

 /////////////////////////////////////////
 //
 // Blake-256 1 way SIMD
+// Only used for prehash, otherwise 4way is used with SSE2.

 #define BLAKE256_ROUND( r ) \
 { \
@@ -327,32 +293,33 @@ static const sph_u32 CS[16] = {
   V3 = mm128_shuflr32_8( _mm_xor_si128( V3, V0 ) ); \
   V2 = _mm_add_epi32( V2, V3 ); \
   V1 = mm128_ror_32( _mm_xor_si128( V1, V2 ), 7 ); \
-   V3 = mm128_shufll_32( V3 ); \
-   V2 = mm128_swap_64( V2 ); \
-   V1 = mm128_shuflr_32( V1 ); \
+   V0 = mm128_shufll_32( V0 ); \
+   V3 = mm128_swap_64( V3 ); \
+   V2 = mm128_shuflr_32( V2 ); \
   V0 = _mm_add_epi32( V0, _mm_add_epi32( V1, \
-                           _mm_set_epi32( CSx( r, F ) ^ Mx( r, E ), \
-                                          CSx( r, D ) ^ Mx( r, C ), \
+                           _mm_set_epi32( CSx( r, D ) ^ Mx( r, C ), \
                                          CSx( r, B ) ^ Mx( r, A ), \
-                                          CSx( r, 9 ) ^ Mx( r, 8 ) ) ) ); \
+                                          CSx( r, 9 ) ^ Mx( r, 8 ), \
+                                          CSx( r, F ) ^ Mx( r, E ) ) ) ); \
   V3 = mm128_swap32_16( _mm_xor_si128( V3, V0 ) ); \
   V2 = _mm_add_epi32( V2, V3 ); \
   V1 = mm128_ror_32( _mm_xor_si128( V1, V2 ), 12 ); \
   V0 = _mm_add_epi32( V0, _mm_add_epi32( V1, \
-                           _mm_set_epi32( CSx( r, E ) ^ Mx( r, F ), \
-                                          CSx( r, C ) ^ Mx( r, D ), \
+                           _mm_set_epi32( CSx( r, C ) ^ Mx( r, D ), \
                                          CSx( r, A ) ^ Mx( r, B ), \
-                                          CSx( r, 8 ) ^ Mx( r, 9 ) ) ) ); \
+                                          CSx( r, 8 ) ^ Mx( r, 9 ), \
+                                          CSx( r, E ) ^ Mx( r, F ) ) ) ); \
   V3 = mm128_shuflr32_8( _mm_xor_si128( V3, V0 ) ); \
   V2 = _mm_add_epi32( V2, V3 ); \
   V1 = mm128_ror_32( _mm_xor_si128( V1, V2 ), 7 ); \
-   V3 = mm128_shuflr_32( V3 ); \
-   V2 = mm128_swap_64( V2 ); \
-   V1 = mm128_shufll_32( V1 ); \
+   V0 = mm128_shuflr_32( V0 ); \
+   V3 = mm128_swap_64( V3 ); \
+   V2 = mm128_shufll_32( V2 ); \
 }

+// Default is 14 rounds, blakecoin & vanilla are 8.
 void blake256_transform_le( uint32_t *H, const uint32_t *buf,
-                            const uint32_t T0, const uint32_t T1 )
+                            const uint32_t T0, const uint32_t T1, int rounds )
 {
   __m128i V0, V1, V2, V3;
   uint32_t M0, M1, M2, M3, M4, M5, M6, M7, M8, M9, MA, MB, MC, MD, ME, MF;
@@ -385,12 +352,15 @@ void blake256_transform_le( uint32_t *H, const uint32_t *buf,
   BLAKE256_ROUND( 5 );
   BLAKE256_ROUND( 6 );
   BLAKE256_ROUND( 7 );
-   BLAKE256_ROUND( 8 );
-   BLAKE256_ROUND( 9 );
-   BLAKE256_ROUND( 0 );
-   BLAKE256_ROUND( 1 );
-   BLAKE256_ROUND( 2 );
-   BLAKE256_ROUND( 3 );
+   if ( rounds > 8 )     // 14
+   {
+      BLAKE256_ROUND( 8 );
+      BLAKE256_ROUND( 9 );
+      BLAKE256_ROUND( 0 );
+      BLAKE256_ROUND( 1 );
+      BLAKE256_ROUND( 2 );
+      BLAKE256_ROUND( 3 );
+   }
   casti_m128i( H, 0 ) = mm128_xor3( casti_m128i( H, 0 ), V0, V2 );
   casti_m128i( H, 1 ) = mm128_xor3( casti_m128i( H, 1 ), V1, V3 );
 }
@@ -413,34 +383,6 @@ void blake256_transform_le( uint32_t *H, const uint32_t *buf,
   b = mm128_ror_32( _mm_xor_si128( b, c ), 7 ); \
 }

-#if SPH_COMPACT_BLAKE_32
-
-// Not used
-#if 0
-
-#define ROUND_S_4WAY(r)   do { \
-	GS_4WAY(M[sigma[r][0x0]], M[sigma[r][0x1]], \
-		CS[sigma[r][0x0]], CS[sigma[r][0x1]], V0, V4, V8, VC); \
-	GS_4WAY(M[sigma[r][0x2]], M[sigma[r][0x3]], \
-		CS[sigma[r][0x2]], CS[sigma[r][0x3]], V1, V5, V9, VD); \
-	GS_4WAY(M[sigma[r][0x4]], M[sigma[r][0x5]], \
-		CS[sigma[r][0x4]], CS[sigma[r][0x5]], V2, V6, VA, VE); \
-	GS_4WAY(M[sigma[r][0x6]], M[sigma[r][0x7]], \
-		CS[sigma[r][0x6]], CS[sigma[r][0x7]], V3, V7, VB, VF); \
-	GS_4WAY(M[sigma[r][0x8]], M[sigma[r][0x9]], \
-		CS[sigma[r][0x8]], CS[sigma[r][0x9]], V0, V5, VA, VF); \
-	GS_4WAY(M[sigma[r][0xA]], M[sigma[r][0xB]], \
-		CS[sigma[r][0xA]], CS[sigma[r][0xB]], V1, V6, VB, VC); \
-	GS_4WAY(M[sigma[r][0xC]], M[sigma[r][0xD]], \
-		CS[sigma[r][0xC]], CS[sigma[r][0xD]], V2, V7, V8, VD); \
-	GS_4WAY(M[sigma[r][0xE]], M[sigma[r][0xF]], \
-		CS[sigma[r][0xE]], CS[sigma[r][0xF]], V3, V4, V9, VE); \
-} while (0)
-
-#endif
-
-#else
-
 #define ROUND_S_4WAY(r) \
 { \
 	GS_4WAY(Mx(r, 0), Mx(r, 1), CSx(r, 0), CSx(r, 1), V0, V4, V8, VC); \
@@ -453,8 +395,6 @@ void blake256_transform_le( uint32_t *H, const uint32_t *buf,
 	GS_4WAY(Mx(r, E), Mx(r, F), CSx(r, E), CSx(r, F), V3, V4, V9, VE); \
 }

-#endif
-
 #define DECL_STATE32_4WAY \
 	__m128i H0, H1, H2, H3, H4, H5, H6, H7; \
        uint32_t T0, T1;
@@ -485,56 +425,6 @@ void blake256_transform_le( uint32_t *H, const uint32_t *buf,
 		(state)->T1 = T1; \
 	} while (0)

-#if SPH_COMPACT_BLAKE_32
-// not used
-#if 0
-#define COMPRESS32_4WAY( rounds )   do { \
-	__m128i M[16]; \
-	__m128i V0, V1, V2, V3, V4, V5, V6, V7; \
-	__m128i V8, V9, VA, VB, VC, VD, VE, VF; \
-	unsigned r; \
-	V0 = H0; \
-	V1 = H1; \
-	V2 = H2; \
-	V3 = H3; \
-	V4 = H4; \
-	V5 = H5; \
-	V6 = H6; \
-	V7 = H7; \
-   V8 = _mm_xor_si128( S0, _mm_set1_epi32( CS0 ) ); \
-   V9 = _mm_xor_si128( S1, _mm_set1_epi32( CS1 ) ); \
-   VA = _mm_xor_si128( S2, _mm_set1_epi32( CS2 ) ); \
-   VB = _mm_xor_si128( S3, _mm_set1_epi32( CS3 ) ); \
-   VC = _mm_xor_si128( _mm_set1_epi32( T0 ), _mm_set1_epi32( CS4 ) ); \
-   VD = _mm_xor_si128( _mm_set1_epi32( T0 ), _mm_set1_epi32( CS5 ) ); \
-   VE = _mm_xor_si128( _mm_set1_epi32( T1 ), _mm_set1_epi32( CS6 ) ); \
-   VF = _mm_xor_si128( _mm_set1_epi32( T1 ), _mm_set1_epi32( CS7 ) ); \
-   mm128_block_bswap_32( M, buf ); \
-   mm128_block_bswap_32( M+8, buf+8 ); \
-	for (r = 0; r < rounds; r ++) \
-		ROUND_S_4WAY(r); \
-        H0 = _mm_xor_si128( _mm_xor_si128( \
-                                   _mm_xor_si128( S0, V0 ), V8 ), H0 ); \
-        H1 = _mm_xor_si128( _mm_xor_si128( \
-                                   _mm_xor_si128( S1, V1 ), V9 ), H1 ); \
-        H2 = _mm_xor_si128( _mm_xor_si128( \
-                                   _mm_xor_si128( S2, V2 ), VA ), H2 ); \
-        H3 = _mm_xor_si128( _mm_xor_si128( \
-                                   _mm_xor_si128( S3, V3 ), VB ), H3 ); \
-        H4 = _mm_xor_si128( _mm_xor_si128( \
-                                   _mm_xor_si128( S0, V4 ), VC ), H4 ); \
-        H5 = _mm_xor_si128( _mm_xor_si128( \
-                                   _mm_xor_si128( S1, V5 ), VD ), H5 ); \
-        H6 = _mm_xor_si128( _mm_xor_si128( \
-                                   _mm_xor_si128( S2, V6 ), VE ), H6 ); \
-        H7 = _mm_xor_si128( _mm_xor_si128( \
-                                   _mm_xor_si128( S3, V7 ), VF ), H7 ); \
-	} while (0)
-#endif
-
-#else
-
-// current impl

 #if defined(__SSSE3__)

@@ -598,10 +488,10 @@ void blake256_transform_le( uint32_t *H, const uint32_t *buf,
   V5 = H5; \
   V6 = H6; \
   V7 = H7; \
-   V8 = m128_const1_64( 0x243F6A88243F6A88 ); \
-   V9 = m128_const1_64( 0x85A308D385A308D3 ); \
-   VA = m128_const1_64( 0x13198A2E13198A2E ); \
-   VB = m128_const1_64( 0x0370734403707344 ); \
+   V8 = _mm_set1_epi64x( 0x243F6A88243F6A88 ); \
+   V9 = _mm_set1_epi64x( 0x85A308D385A308D3 ); \
+   VA = _mm_set1_epi64x( 0x13198A2E13198A2E ); \
+   VB = _mm_set1_epi64x( 0x0370734403707344 ); \
   VC = _mm_set1_epi32( T0 ^ 0xA4093822 ); \
   VD = _mm_set1_epi32( T0 ^ 0x299F31D0 ); \
   VE = _mm_set1_epi32( T1 ^ 0x082EFA98 ); \
@@ -634,8 +524,6 @@ void blake256_transform_le( uint32_t *H, const uint32_t *buf,
   H7 = _mm_xor_si128( _mm_xor_si128( VF, V7 ), H7 ); \
 }

-#endif
-
 #if defined (__AVX2__)

 /////////////////////////////////
@@ -922,7 +810,7 @@ void blake256_transform_le( uint32_t *H, const uint32_t *buf,

 #define DECL_STATE32_8WAY \
   __m256i H0, H1, H2, H3, H4, H5, H6, H7; \
-   sph_u32 T0, T1;
+   uint32_t T0, T1;

 #define READ_STATE32_8WAY(state) \
 do { \
@@ -958,7 +846,6 @@ do { \
   __m256i M8, M9, MA, MB, MC, MD, ME, MF; \
   __m256i V0, V1, V2, V3, V4, V5, V6, V7; \
   __m256i V8, V9, VA, VB, VC, VD, VE, VF; \
-   __m256i shuf_bswap32; \
   V0 = H0; \
   V1 = H1; \
   V2 = H2; \
@@ -967,16 +854,16 @@ do { \
   V5 = H5; \
   V6 = H6; \
   V7 = H7; \
-   V8 = m256_const1_64( 0x243F6A88243F6A88 ); \
-   V9 = m256_const1_64( 0x85A308D385A308D3 ); \
-   VA = m256_const1_64( 0x13198A2E13198A2E ); \
-   VB = m256_const1_64( 0x0370734403707344 ); \
+   V8 = _mm256_set1_epi64x( 0x243F6A88243F6A88 ); \
+   V9 = _mm256_set1_epi64x( 0x85A308D385A308D3 ); \
+   VA = _mm256_set1_epi64x( 0x13198A2E13198A2E ); \
+   VB = _mm256_set1_epi64x( 0x0370734403707344 ); \
   VC = _mm256_set1_epi32( T0 ^ 0xA4093822 ); \
   VD = _mm256_set1_epi32( T0 ^ 0x299F31D0 ); \
   VE = _mm256_set1_epi32( T1 ^ 0x082EFA98 ); \
   VF = _mm256_set1_epi32( T1 ^ 0xEC4E6C89 ); \
-   shuf_bswap32 = m256_const_64( 0x1c1d1e1f18191a1b, 0x1415161710111213, \
-                                 0x0c0d0e0f08090a0b, 0x0405060700010203 ); \
+   const __m256i shuf_bswap32 = mm256_set2_64( \
+                               0x0c0d0e0f08090a0b, 0x0405060700010203 ); \
   M0 = _mm256_shuffle_epi8( * buf    , shuf_bswap32 ); \
   M1 = _mm256_shuffle_epi8( *(buf+ 1), shuf_bswap32 ); \
   M2 = _mm256_shuffle_epi8( *(buf+ 2), shuf_bswap32 ); \
@@ -1001,7 +888,7 @@ do { \
   ROUND_S_8WAY(5); \
   ROUND_S_8WAY(6); \
   ROUND_S_8WAY(7); \
-   if (rounds == 14) \
+   if (rounds > 8) \
   { \
      ROUND_S_8WAY(8); \
      ROUND_S_8WAY(9); \
@@ -1034,10 +921,10 @@ do { \
   V5 = H5; \
   V6 = H6; \
   V7 = H7; \
-   V8 = m256_const1_64( 0x243F6A88243F6A88 ); \
-   V9 = m256_const1_64( 0x85A308D385A308D3 ); \
-   VA = m256_const1_64( 0x13198A2E13198A2E ); \
-   VB = m256_const1_64( 0x0370734403707344 ); \
+   V8 = _mm256_set1_epi64x( 0x243F6A88243F6A88 ); \
+   V9 = _mm256_set1_epi64x( 0x85A308D385A308D3 ); \
+   VA = _mm256_set1_epi64x( 0x13198A2E13198A2E ); \
+   VB = _mm256_set1_epi64x( 0x0370734403707344 ); \
   VC = _mm256_set1_epi32( T0 ^ 0xA4093822 ); \
   VD = _mm256_set1_epi32( T0 ^ 0x299F31D0 ); \
   VE = _mm256_set1_epi32( T1 ^ 0x082EFA98 ); \
@@ -1066,7 +953,7 @@ do { \
   ROUND_S_8WAY(5); \
   ROUND_S_8WAY(6); \
   ROUND_S_8WAY(7); \
-   if (rounds == 14) \
+   if (rounds > 8) \
   { \
      ROUND_S_8WAY(8); \
      ROUND_S_8WAY(9); \
@@ -1100,23 +987,23 @@ void blake256_8way_round0_prehash_le( void *midstate, const void *midhash,
   V[ 5] = H[5];
   V[ 6] = H[6];
   V[ 7] = H[7];
-   V[ 8] = m256_const1_32( CS0 );
-   V[ 9] = m256_const1_32( CS1 );
-   V[10] = m256_const1_32( CS2 );
-   V[11] = m256_const1_32( CS3 );
-   V[12] = m256_const1_32( CS4 ^ 0x280 );
-   V[13] = m256_const1_32( CS5 ^ 0x280 );
-   V[14] = m256_const1_32( CS6 );
-   V[15] = m256_const1_32( CS7 );
+   V[ 8] = _mm256_set1_epi32( CS0 );
+   V[ 9] = _mm256_set1_epi32( CS1 );
+   V[10] = _mm256_set1_epi32( CS2 );
+   V[11] = _mm256_set1_epi32( CS3 );
+   V[12] = _mm256_set1_epi32( CS4 ^ 0x280 );
+   V[13] = _mm256_set1_epi32( CS5 ^ 0x280 );
+   V[14] = _mm256_set1_epi32( CS6 );
+   V[15] = _mm256_set1_epi32( CS7 );

 // M[ 0:3 ] contain new message data including unique nonces in M[ 3].
 // M[ 5:12, 14 ] are always zero and not needed or used.
-// M[ 4], M[ 13], M[15] are constant and are initialized here.
+// M[ 4], M[13], M[15] are constant and are initialized here.
 // M[ 5] is a special case, used as a cache for (M[13] ^ CSC).

-   M[ 4] = m256_const1_32( 0x80000000 );
+   M[ 4] = _mm256_set1_epi32( 0x80000000 );
   M[13] = m256_one_32;
-   M[15] = m256_const1_32( 80*8 );
+   M[15] = _mm256_set1_epi32( 80*8 );

   M[ 5] =_mm256_xor_si256( M[13], _mm256_set1_epi32( CSC ) );

@@ -1176,7 +1063,7 @@ void blake256_8way_round0_prehash_le( void *midstate, const void *midhash,
 }

 void blake256_8way_final_rounds_le( void *final_hash, const void *midstate,
-                                     const void *midhash, const void *data )
+                     const void *midhash, const void *data, const int rounds )
 {
   __m256i *H = (__m256i*)final_hash;
   const __m256i *h = (const __m256i*)midhash;
@@ -1270,16 +1157,18 @@ void blake256_8way_final_rounds_le( void *final_hash, const void *midstate,
   ROUND256_8WAY_5;
   ROUND256_8WAY_6;
   ROUND256_8WAY_7;
-   ROUND256_8WAY_8;
-   ROUND256_8WAY_9;
-   ROUND256_8WAY_0;
-   ROUND256_8WAY_1;
-   ROUND256_8WAY_2;
-   ROUND256_8WAY_3;
+   if ( rounds > 8 )
+   {
+      ROUND256_8WAY_8;
+      ROUND256_8WAY_9;
+      ROUND256_8WAY_0;
+      ROUND256_8WAY_1;
+      ROUND256_8WAY_2;
+      ROUND256_8WAY_3;
+   }

   const __m256i shuf_bswap32 =
-                  m256_const_64( 0x1c1d1e1f18191a1b, 0x1415161710111213,
-                                 0x0c0d0e0f08090a0b, 0x0405060700010203 );
+                  mm256_set2_64( 0x0c0d0e0f08090a0b, 0x0405060700010203 );

   H[0] = _mm256_shuffle_epi8( mm256_xor3( V8, V0, h[0] ), shuf_bswap32 );
   H[1] = _mm256_shuffle_epi8( mm256_xor3( V9, V1, h[1] ), shuf_bswap32 );
@@ -1579,7 +1468,7 @@ void blake256_8way_final_rounds_le( void *final_hash, const void *midstate,

 #define DECL_STATE32_16WAY \
   __m512i H0, H1, H2, H3, H4, H5, H6, H7; \
-   sph_u32 T0, T1;
+   uint32_t T0, T1;

 #define READ_STATE32_16WAY(state) \
 do { \
@@ -1615,7 +1504,8 @@ do { \
   __m512i M8, M9, MA, MB, MC, MD, ME, MF; \
   __m512i V0, V1, V2, V3, V4, V5, V6, V7; \
   __m512i V8, V9, VA, VB, VC, VD, VE, VF; \
-   __m512i shuf_bswap32; \
+   const __m512i shuf_bswap32 = mm512_bcast_m128( _mm_set_epi64x( \
+                                 0x0c0d0e0f08090a0b, 0x0405060700010203 ) ); \
   V0 = H0; \
   V1 = H1; \
   V2 = H2; \
@@ -1624,18 +1514,14 @@ do { \
   V5 = H5; \
   V6 = H6; \
   V7 = H7; \
-   V8 = m512_const1_64( 0x243F6A88243F6A88 ); \
-   V9 = m512_const1_64( 0x85A308D385A308D3 ); \
-   VA = m512_const1_64( 0x13198A2E13198A2E ); \
-   VB = m512_const1_64( 0x0370734403707344 ); \
+   V8 = _mm512_set1_epi64( 0x243F6A88243F6A88 ); \
+   V9 = _mm512_set1_epi64( 0x85A308D385A308D3 ); \
+   VA = _mm512_set1_epi64( 0x13198A2E13198A2E ); \
+   VB = _mm512_set1_epi64( 0x0370734403707344 ); \
   VC = _mm512_set1_epi32( T0 ^ 0xA4093822 ); \
   VD = _mm512_set1_epi32( T0 ^ 0x299F31D0 ); \
   VE = _mm512_set1_epi32( T1 ^ 0x082EFA98 ); \
   VF = _mm512_set1_epi32( T1 ^ 0xEC4E6C89 ); \
-   shuf_bswap32 = m512_const_64( 0x3c3d3e3f38393a3b, 0x3435363730313233, \
-                                 0x2c2d2e2f28292a2b, 0x2425262720212223, \
-                                 0x1c1d1e1f18191a1b, 0x1415161710111213, \
-                                 0x0c0d0e0f08090a0b, 0x0405060700010203 ); \
   M0 = _mm512_shuffle_epi8( * buf    , shuf_bswap32 ); \
   M1 = _mm512_shuffle_epi8( *(buf+ 1), shuf_bswap32 ); \
   M2 = _mm512_shuffle_epi8( *(buf+ 2), shuf_bswap32 ); \
@@ -1693,10 +1579,10 @@ do { \
   V5 = H5; \
   V6 = H6; \
   V7 = H7; \
-   V8 = m512_const1_64( 0x243F6A88243F6A88 ); \
-   V9 = m512_const1_64( 0x85A308D385A308D3 ); \
-   VA = m512_const1_64( 0x13198A2E13198A2E ); \
-   VB = m512_const1_64( 0x0370734403707344 ); \
+   V8 = _mm512_set1_epi64( 0x243F6A88243F6A88 ); \
+   V9 = _mm512_set1_epi64( 0x85A308D385A308D3 ); \
+   VA = _mm512_set1_epi64( 0x13198A2E13198A2E ); \
+   VB = _mm512_set1_epi64( 0x0370734403707344 ); \
   VC = _mm512_set1_epi32( T0 ^ 0xA4093822 ); \
   VD = _mm512_set1_epi32( T0 ^ 0x299F31D0 ); \
   VE = _mm512_set1_epi32( T1 ^ 0x082EFA98 ); \
@@ -1763,23 +1649,23 @@ void blake256_16way_round0_prehash_le( void *midstate, const void *midhash,
   V[ 5] = H[5];
   V[ 6] = H[6];
   V[ 7] = H[7];
-   V[ 8] = m512_const1_32( CS0 );
-   V[ 9] = m512_const1_32( CS1 );
-   V[10] = m512_const1_32( CS2 );
-   V[11] = m512_const1_32( CS3 );
-   V[12] = m512_const1_32( CS4 ^ 0x280 );
-   V[13] = m512_const1_32( CS5 ^ 0x280 );
-   V[14] = m512_const1_32( CS6 );
-   V[15] = m512_const1_32( CS7 );
+   V[ 8] = _mm512_set1_epi32( CS0 );
+   V[ 9] = _mm512_set1_epi32( CS1 );
+   V[10] = _mm512_set1_epi32( CS2 );
+   V[11] = _mm512_set1_epi32( CS3 );
+   V[12] = _mm512_set1_epi32( CS4 ^ 0x280 );
+   V[13] = _mm512_set1_epi32( CS5 ^ 0x280 );
+   V[14] = _mm512_set1_epi32( CS6 );
+   V[15] = _mm512_set1_epi32( CS7 );

 // M[ 0:3 ] contain new message data including unique nonces in M[ 3].   
 // M[ 5:12, 14 ] are always zero and not needed or used, except M[5] as noted.
 // M[ 4], M[ 13], M[15] are constant and are initialized here.
 // M[ 5] is a special case, used as a cache for (M[13] ^ CSC).
   
-   M[ 4] = m512_const1_32( 0x80000000 );
+   M[ 4] = _mm512_set1_epi32( 0x80000000 );
   M[13] = m512_one_32;
-   M[15] = m512_const1_32( 80*8 );
+   M[15] = _mm512_set1_epi32( 80*8 );

   M[ 5] =_mm512_xor_si512( M[13], _mm512_set1_epi32( CSC ) );

@@ -1841,8 +1727,9 @@ void blake256_16way_round0_prehash_le( void *midstate, const void *midhash,
                         _mm512_xor_si512( _mm512_set1_epi32( CSE ), M[15] ) );
 }

+// Dfault is 14 rounds, blakecoin & vanilla are 8.
 void blake256_16way_final_rounds_le( void *final_hash, const void *midstate,
-                                     const void *midhash, const void *data )
+                     const void *midhash, const void *data, const int rounds )
 {
   __m512i *H = (__m512i*)final_hash;
   const __m512i *h = (const __m512i*)midhash;
@@ -1947,19 +1834,20 @@ void blake256_16way_final_rounds_le( void *final_hash, const void *midstate,
   ROUND256_16WAY_5;
   ROUND256_16WAY_6;
   ROUND256_16WAY_7;
-   ROUND256_16WAY_8;
-   ROUND256_16WAY_9;
-   ROUND256_16WAY_0;
-   ROUND256_16WAY_1;
-   ROUND256_16WAY_2;
-   ROUND256_16WAY_3;
+   if ( rounds > 8 )
+   {
+      ROUND256_16WAY_8;
+      ROUND256_16WAY_9;
+      ROUND256_16WAY_0;
+      ROUND256_16WAY_1;
+      ROUND256_16WAY_2;
+      ROUND256_16WAY_3;
+   }

   // Byte swap final hash
   const __m512i shuf_bswap32 =
-                  m512_const_64( 0x3c3d3e3f38393a3b, 0x3435363730313233,
-                                 0x2c2d2e2f28292a2b, 0x2425262720212223,
-                                 0x1c1d1e1f18191a1b, 0x1415161710111213,
-                                 0x0c0d0e0f08090a0b, 0x0405060700010203 );
+                  mm512_bcast_m128( _mm_set_epi64x( 
+                                 0x0c0d0e0f08090a0b, 0x0405060700010203 ) );

   H[0] = _mm512_shuffle_epi8( mm512_xor3( V8, V0, h[0] ), shuf_bswap32 );
   H[1] = _mm512_shuffle_epi8( mm512_xor3( V9, V1, h[1] ), shuf_bswap32 );
@@ -1981,14 +1869,14 @@ static void
 blake32_4way_init( blake_4way_small_context *ctx, const uint32_t *iv,
                   const uint32_t *salt, int rounds )
 {
-   casti_m128i( ctx->H, 0 ) = m128_const1_64( 0x6A09E6676A09E667 );
-   casti_m128i( ctx->H, 1 ) = m128_const1_64( 0xBB67AE85BB67AE85 );
-   casti_m128i( ctx->H, 2 ) = m128_const1_64( 0x3C6EF3723C6EF372 );
-   casti_m128i( ctx->H, 3 ) = m128_const1_64( 0xA54FF53AA54FF53A );
-   casti_m128i( ctx->H, 4 ) = m128_const1_64( 0x510E527F510E527F );
-   casti_m128i( ctx->H, 5 ) = m128_const1_64( 0x9B05688C9B05688C );
-   casti_m128i( ctx->H, 6 ) = m128_const1_64( 0x1F83D9AB1F83D9AB );
-   casti_m128i( ctx->H, 7 ) = m128_const1_64( 0x5BE0CD195BE0CD19 );
+   casti_m128i( ctx->H, 0 ) = _mm_set1_epi64x( 0x6A09E6676A09E667 );
+   casti_m128i( ctx->H, 1 ) = _mm_set1_epi64x( 0xBB67AE85BB67AE85 );
+   casti_m128i( ctx->H, 2 ) = _mm_set1_epi64x( 0x3C6EF3723C6EF372 );
+   casti_m128i( ctx->H, 3 ) = _mm_set1_epi64x( 0xA54FF53AA54FF53A );
+   casti_m128i( ctx->H, 4 ) = _mm_set1_epi64x( 0x510E527F510E527F );
+   casti_m128i( ctx->H, 5 ) = _mm_set1_epi64x( 0x9B05688C9B05688C );
+   casti_m128i( ctx->H, 6 ) = _mm_set1_epi64x( 0x1F83D9AB1F83D9AB );
+   casti_m128i( ctx->H, 7 ) = _mm_set1_epi64x( 0x5BE0CD195BE0CD19 );
   ctx->T0 = ctx->T1 = 0;
   ctx->ptr = 0;
   ctx->rounds = rounds;
@@ -2018,7 +1906,7 @@ blake32_4way( blake_4way_small_context *ctx, const void *data,
      size_t clen = ( sizeof ctx->buf ) - bptr;

      if ( clen > blen )
-	 clen = blen;
+         clen = blen;
      memcpy( buf + vptr, data, clen );
      bptr += clen;
      data = (const unsigned char *)data + clen;
@@ -2059,13 +1947,13 @@ blake32_4way_close( blake_4way_small_context *ctx, unsigned ub, unsigned n,
   else
      ctx->T0 -= 512 - bit_len;

-   buf[vptr] = m128_const1_64( 0x0000008000000080 );
+   buf[vptr] = _mm_set1_epi64x( 0x0000008000000080 );

   if ( vptr < 12 )
   {
      memset_zero_128( buf + vptr + 1, 13 - vptr  );
      buf[ 13 ] = _mm_or_si128( buf[ 13 ],
-                                m128_const1_64( 0x0100000001000000ULL ) );
+                                _mm_set1_epi64x( 0x0100000001000000ULL ) );
      buf[ 14 ] = _mm_set1_epi32( bswap_32( th ) );
      buf[ 15 ] = _mm_set1_epi32( bswap_32( tl ) );
      blake32_4way( ctx, buf + vptr, 64 - ptr );
@@ -2078,7 +1966,7 @@ blake32_4way_close( blake_4way_small_context *ctx, unsigned ub, unsigned n,
      ctx->T1 = 0xFFFFFFFFUL;
      memset_zero_128( buf, 56>>2 );
      buf[ 13 ] = _mm_or_si128( buf[ 13 ],
-                                m128_const1_64( 0x0100000001000000ULL ) );
+                                _mm_set1_epi64x( 0x0100000001000000ULL ) );
      buf[ 14 ] = _mm_set1_epi32( bswap_32( th ) );
      buf[ 15 ] = _mm_set1_epi32( bswap_32( tl ) );
      blake32_4way( ctx, buf, 64 );
@@ -2091,20 +1979,20 @@ blake32_4way_close( blake_4way_small_context *ctx, unsigned ub, unsigned n,

 // Blake-256 8 way

-static const sph_u32 salt_zero_8way_small[8] = { 0, 0, 0, 0, 0, 0, 0, 0 };
+static const uint32_t salt_zero_8way_small[8] = { 0, 0, 0, 0, 0, 0, 0, 0 };

 static void
-blake32_8way_init( blake_8way_small_context *sc, const sph_u32 *iv,
-                   const sph_u32 *salt, int rounds )
+blake32_8way_init( blake_8way_small_context *sc, const uint32_t *iv,
+                   const uint32_t *salt, int rounds )
 {
-   casti_m256i( sc->H, 0 ) = m256_const1_64( 0x6A09E6676A09E667 );
-   casti_m256i( sc->H, 1 ) = m256_const1_64( 0xBB67AE85BB67AE85 );
-   casti_m256i( sc->H, 2 ) = m256_const1_64( 0x3C6EF3723C6EF372 );
-   casti_m256i( sc->H, 3 ) = m256_const1_64( 0xA54FF53AA54FF53A );
-   casti_m256i( sc->H, 4 ) = m256_const1_64( 0x510E527F510E527F );
-   casti_m256i( sc->H, 5 ) = m256_const1_64( 0x9B05688C9B05688C );
-   casti_m256i( sc->H, 6 ) = m256_const1_64( 0x1F83D9AB1F83D9AB );
-   casti_m256i( sc->H, 7 ) = m256_const1_64( 0x5BE0CD195BE0CD19 );
+   casti_m256i( sc->H, 0 ) = _mm256_set1_epi64x( 0x6A09E6676A09E667 );
+   casti_m256i( sc->H, 1 ) = _mm256_set1_epi64x( 0xBB67AE85BB67AE85 );
+   casti_m256i( sc->H, 2 ) = _mm256_set1_epi64x( 0x3C6EF3723C6EF372 );
+   casti_m256i( sc->H, 3 ) = _mm256_set1_epi64x( 0xA54FF53AA54FF53A );
+   casti_m256i( sc->H, 4 ) = _mm256_set1_epi64x( 0x510E527F510E527F );
+   casti_m256i( sc->H, 5 ) = _mm256_set1_epi64x( 0x9B05688C9B05688C );
+   casti_m256i( sc->H, 6 ) = _mm256_set1_epi64x( 0x1F83D9AB1F83D9AB );
+   casti_m256i( sc->H, 7 ) = _mm256_set1_epi64x( 0x5BE0CD195BE0CD19 );
   sc->T0 = sc->T1 = 0;
   sc->ptr = 0;
   sc->rounds = rounds;
@@ -2142,8 +2030,8 @@ blake32_8way( blake_8way_small_context *sc, const void *data, size_t len )
      len -= clen;
      if ( ptr == buf_size )
      {
-          if ( ( T0 = SPH_T32(T0 + 512) ) < 512 )
-                T1 = SPH_T32(T1 + 1);
+          if ( ( T0 = T0 + 512 ) < 512 )
+                T1 = T1 + 1;
          COMPRESS32_8WAY( sc->rounds );
          ptr = 0;
      }
@@ -2159,23 +2047,23 @@ blake32_8way_close( blake_8way_small_context *sc, unsigned ub, unsigned n,
   __m256i buf[16];
   size_t ptr;
   unsigned bit_len;
-   sph_u32 th, tl;
+   uint32_t th, tl;

   ptr = sc->ptr;
   bit_len = ((unsigned)ptr << 3);
-   buf[ptr>>2] = m256_const1_64( 0x0000008000000080ULL );
+   buf[ptr>>2] = _mm256_set1_epi64x( 0x0000008000000080ULL );
   tl = sc->T0 + bit_len;
   th = sc->T1;

   if ( ptr == 0 )
   {
-        sc->T0 = SPH_C32(0xFFFFFE00UL);
-        sc->T1 = SPH_C32(0xFFFFFFFFUL);
+        sc->T0 = 0xFFFFFE00UL;
+        sc->T1 = 0xFFFFFFFFUL;
   }
   else if ( sc->T0 == 0 )
   {
-        sc->T0 = SPH_C32(0xFFFFFE00UL) + bit_len;
-        sc->T1 = SPH_T32(sc->T1 - 1);
+        sc->T0 = 0xFFFFFE00UL + bit_len;
+        sc->T1 = sc->T1 - 1;
   }
   else
        sc->T0 -= 512 - bit_len;
@@ -2185,7 +2073,7 @@ blake32_8way_close( blake_8way_small_context *sc, unsigned ub, unsigned n,
       memset_zero_256( buf + (ptr>>2) + 1, (52 - ptr) >> 2 );
       if ( out_size_w32 == 8 )
           buf[52>>2] = _mm256_or_si256( buf[52>>2],
-                                m256_const1_64( 0x0100000001000000ULL ) );
+                                _mm256_set1_epi64x( 0x0100000001000000ULL ) );
       *(buf+(56>>2)) = _mm256_set1_epi32( bswap_32( th ) );
       *(buf+(60>>2)) = _mm256_set1_epi32( bswap_32( tl ) );
       blake32_8way( sc, buf + (ptr>>2), 64 - ptr );
@@ -2194,11 +2082,11 @@ blake32_8way_close( blake_8way_small_context *sc, unsigned ub, unsigned n,
   {
       memset_zero_256( buf + (ptr>>2) + 1, (60-ptr) >> 2 );
       blake32_8way( sc, buf + (ptr>>2), 64 - ptr );
-       sc->T0 = SPH_C32(0xFFFFFE00UL);
-       sc->T1 = SPH_C32(0xFFFFFFFFUL);
+       sc->T0 = 0xFFFFFE00UL;
+       sc->T1 = 0xFFFFFFFFUL;
       memset_zero_256( buf, 56>>2 );
       if ( out_size_w32 == 8 )
-           buf[52>>2] = m256_const1_64( 0x0100000001000000ULL );
+           buf[52>>2] = _mm256_set1_epi64x( 0x0100000001000000ULL );
       *(buf+(56>>2)) = _mm256_set1_epi32( bswap_32( th ) );
       *(buf+(60>>2)) = _mm256_set1_epi32( bswap_32( tl ) );
       blake32_8way( sc, buf, 64 );
@@ -2238,8 +2126,8 @@ blake32_8way_le( blake_8way_small_context *sc, const void *data, size_t len )
      len -= clen;
      if ( ptr == buf_size )
      {
-          if ( ( T0 = SPH_T32(T0 + 512) ) < 512 )
-                T1 = SPH_T32(T1 + 1);
+          if ( ( T0 = T0 + 512 ) < 512 )
+                T1 = T1 + 1;
          COMPRESS32_8WAY_LE( sc->rounds );
          ptr = 0;
      }
@@ -2255,23 +2143,23 @@ blake32_8way_close_le( blake_8way_small_context *sc, unsigned ub, unsigned n,
   __m256i buf[16];
   size_t ptr;
   unsigned bit_len;
-   sph_u32 th, tl;
+   uint32_t th, tl;

   ptr = sc->ptr;
   bit_len = ((unsigned)ptr << 3);
-   buf[ptr>>2] = m256_const1_32( 0x80000000 );
+   buf[ptr>>2] = _mm256_set1_epi32( 0x80000000 );
   tl = sc->T0 + bit_len;
   th = sc->T1;

   if ( ptr == 0 )
   {
-        sc->T0 = SPH_C32(0xFFFFFE00UL);
-        sc->T1 = SPH_C32(0xFFFFFFFFUL);
+        sc->T0 = 0xFFFFFE00UL;
+        sc->T1 = 0xFFFFFFFFUL;
   }
   else if ( sc->T0 == 0 )
   {
-        sc->T0 = SPH_C32(0xFFFFFE00UL) + bit_len;
-        sc->T1 = SPH_T32(sc->T1 - 1);
+        sc->T0 = 0xFFFFFE00UL + bit_len;
+        sc->T1 = sc->T1 - 1;
   }
   else
        sc->T0 -= 512 - bit_len;
@@ -2289,8 +2177,8 @@ blake32_8way_close_le( blake_8way_small_context *sc, unsigned ub, unsigned n,
   {
       memset_zero_256( buf + (ptr>>2) + 1, (60-ptr) >> 2 );
       blake32_8way_le( sc, buf + (ptr>>2), 64 - ptr );
-       sc->T0 = SPH_C32(0xFFFFFE00UL);
-       sc->T1 = SPH_C32(0xFFFFFFFFUL);
+       sc->T0 = 0xFFFFFE00UL;
+       sc->T1 = 0xFFFFFFFFUL;
       memset_zero_256( buf, 56>>2 );
       if ( out_size_w32 == 8 )
           buf[52>>2] = m256_one_32;
@@ -2309,17 +2197,17 @@ blake32_8way_close_le( blake_8way_small_context *sc, unsigned ub, unsigned n,
 //Blake-256 16 way AVX512

 static void
-blake32_16way_init( blake_16way_small_context *sc, const sph_u32 *iv,
-                   const sph_u32 *salt, int rounds )
+blake32_16way_init( blake_16way_small_context *sc, const uint32_t *iv,
+                   const uint32_t *salt, int rounds )
 {
-   casti_m512i( sc->H, 0 ) = m512_const1_64( 0x6A09E6676A09E667 );
-   casti_m512i( sc->H, 1 ) = m512_const1_64( 0xBB67AE85BB67AE85 );
-   casti_m512i( sc->H, 2 ) = m512_const1_64( 0x3C6EF3723C6EF372 );
-   casti_m512i( sc->H, 3 ) = m512_const1_64( 0xA54FF53AA54FF53A );
-   casti_m512i( sc->H, 4 ) = m512_const1_64( 0x510E527F510E527F );
-   casti_m512i( sc->H, 5 ) = m512_const1_64( 0x9B05688C9B05688C );
-   casti_m512i( sc->H, 6 ) = m512_const1_64( 0x1F83D9AB1F83D9AB );
-   casti_m512i( sc->H, 7 ) = m512_const1_64( 0x5BE0CD195BE0CD19 );
+   casti_m512i( sc->H, 0 ) = _mm512_set1_epi64( 0x6A09E6676A09E667 );
+   casti_m512i( sc->H, 1 ) = _mm512_set1_epi64( 0xBB67AE85BB67AE85 );
+   casti_m512i( sc->H, 2 ) = _mm512_set1_epi64( 0x3C6EF3723C6EF372 );
+   casti_m512i( sc->H, 3 ) = _mm512_set1_epi64( 0xA54FF53AA54FF53A );
+   casti_m512i( sc->H, 4 ) = _mm512_set1_epi64( 0x510E527F510E527F );
+   casti_m512i( sc->H, 5 ) = _mm512_set1_epi64( 0x9B05688C9B05688C );
+   casti_m512i( sc->H, 6 ) = _mm512_set1_epi64( 0x1F83D9AB1F83D9AB );
+   casti_m512i( sc->H, 7 ) = _mm512_set1_epi64( 0x5BE0CD195BE0CD19 );
   sc->T0 = sc->T1 = 0;
   sc->ptr = 0;
   sc->rounds = rounds;
@@ -2372,11 +2260,11 @@ blake32_16way_close( blake_16way_small_context *sc, unsigned ub, unsigned n,
   __m512i buf[16];
   size_t ptr;
   unsigned bit_len;
-   sph_u32 th, tl;
+   uint32_t th, tl;

   ptr = sc->ptr;
   bit_len = ((unsigned)ptr << 3);
-   buf[ptr>>2] = m512_const1_64( 0x0000008000000080ULL );
+   buf[ptr>>2] = _mm512_set1_epi64( 0x0000008000000080ULL );
   tl = sc->T0 + bit_len;
   th = sc->T1;

@@ -2398,7 +2286,7 @@ blake32_16way_close( blake_16way_small_context *sc, unsigned ub, unsigned n,
       memset_zero_512( buf + (ptr>>2) + 1, (52 - ptr) >> 2 );
       if ( out_size_w32 == 8 )
           buf[52>>2] = _mm512_or_si512( buf[52>>2],
-                                m512_const1_64( 0x0100000001000000ULL ) );
+                                _mm512_set1_epi64( 0x0100000001000000ULL ) );
       buf[56>>2] = _mm512_set1_epi32( bswap_32( th ) );
       buf[60>>2] = _mm512_set1_epi32( bswap_32( tl ) );
       blake32_16way( sc, buf + (ptr>>2), 64 - ptr );
@@ -2411,7 +2299,7 @@ blake32_16way_close( blake_16way_small_context *sc, unsigned ub, unsigned n,
       sc->T1 = 0xFFFFFFFFUL;
       memset_zero_512( buf, 56>>2 );
       if ( out_size_w32 == 8 )
-          buf[52>>2] = m512_const1_64( 0x0100000001000000ULL );
+          buf[52>>2] = _mm512_set1_epi64( 0x0100000001000000ULL );
       buf[56>>2] = _mm512_set1_epi32( bswap_32( th ) );
       buf[60>>2] = _mm512_set1_epi32( bswap_32( tl ) );
       blake32_16way( sc, buf, 64 );
@@ -2469,11 +2357,11 @@ blake32_16way_close_le( blake_16way_small_context *sc, unsigned ub, unsigned n,
   __m512i buf[16];
   size_t ptr;
   unsigned bit_len;
-   sph_u32 th, tl;
+   uint32_t th, tl;

   ptr = sc->ptr;
   bit_len = ((unsigned)ptr << 3);
-   buf[ptr>>2] = m512_const1_32( 0x80000000 );
+   buf[ptr>>2] = _mm512_set1_epi32( 0x80000000 );
   tl = sc->T0 + bit_len;
   th = sc->T1;

@@ -2579,8 +2467,6 @@ blake256r8_16way_close(void *cc, void *dst)

 #endif // AVX512

-
-
 // Blake-256 4 way

 // default 14 rounds, backward copatibility
@@ -2715,9 +2601,3 @@ blake256r8_8way_close(void *cc, void *dst)
 }

 #endif
-
-#ifdef __cplusplus
-}
-#endif
-
-//#endif
--- a/algo/blake/blake2b-hash-4way.c
+++ b/algo/blake/blake2b-hash-4way.c
@@ -252,14 +252,14 @@ static void blake2b_8way_compress( blake2b_8way_ctx *ctx, int last )
   v[ 5] = ctx->h[5];
   v[ 6] = ctx->h[6];
   v[ 7] = ctx->h[7];
-   v[ 8] = m512_const1_64( 0x6A09E667F3BCC908 );
-   v[ 9] = m512_const1_64( 0xBB67AE8584CAA73B );
-   v[10] = m512_const1_64( 0x3C6EF372FE94F82B );
-   v[11] = m512_const1_64( 0xA54FF53A5F1D36F1 );
-   v[12] = m512_const1_64( 0x510E527FADE682D1 );
-   v[13] = m512_const1_64( 0x9B05688C2B3E6C1F );
-   v[14] = m512_const1_64( 0x1F83D9ABFB41BD6B );
-   v[15] = m512_const1_64( 0x5BE0CD19137E2179 );
+   v[ 8] = _mm512_set1_epi64( 0x6A09E667F3BCC908 );
+   v[ 9] = _mm512_set1_epi64( 0xBB67AE8584CAA73B );
+   v[10] = _mm512_set1_epi64( 0x3C6EF372FE94F82B );
+   v[11] = _mm512_set1_epi64( 0xA54FF53A5F1D36F1 );
+   v[12] = _mm512_set1_epi64( 0x510E527FADE682D1 );
+   v[13] = _mm512_set1_epi64( 0x9B05688C2B3E6C1F );
+   v[14] = _mm512_set1_epi64( 0x1F83D9ABFB41BD6B );
+   v[15] = _mm512_set1_epi64( 0x5BE0CD19137E2179 );

   v[12] = _mm512_xor_si512( v[12], _mm512_set1_epi64( ctx->t[0] ) );
   v[13] = _mm512_xor_si512( v[13], _mm512_set1_epi64( ctx->t[1] ) );
@@ -310,16 +310,16 @@ int blake2b_8way_init( blake2b_8way_ctx *ctx )
 {
   size_t i;

-   ctx->h[0] = m512_const1_64( 0x6A09E667F3BCC908 );
-   ctx->h[1] = m512_const1_64( 0xBB67AE8584CAA73B );
-   ctx->h[2] = m512_const1_64( 0x3C6EF372FE94F82B );
-   ctx->h[3] = m512_const1_64( 0xA54FF53A5F1D36F1 );
-   ctx->h[4] = m512_const1_64( 0x510E527FADE682D1 );
-   ctx->h[5] = m512_const1_64( 0x9B05688C2B3E6C1F );
-   ctx->h[6] = m512_const1_64( 0x1F83D9ABFB41BD6B );
-   ctx->h[7] = m512_const1_64( 0x5BE0CD19137E2179 );
+   ctx->h[0] = _mm512_set1_epi64( 0x6A09E667F3BCC908 );
+   ctx->h[1] = _mm512_set1_epi64( 0xBB67AE8584CAA73B );
+   ctx->h[2] = _mm512_set1_epi64( 0x3C6EF372FE94F82B );
+   ctx->h[3] = _mm512_set1_epi64( 0xA54FF53A5F1D36F1 );
+   ctx->h[4] = _mm512_set1_epi64( 0x510E527FADE682D1 );
+   ctx->h[5] = _mm512_set1_epi64( 0x9B05688C2B3E6C1F );
+   ctx->h[6] = _mm512_set1_epi64( 0x1F83D9ABFB41BD6B );
+   ctx->h[7] = _mm512_set1_epi64( 0x5BE0CD19137E2179 );

-   ctx->h[0] = _mm512_xor_si512( ctx->h[0], m512_const1_64( 0x01010020 ) );
+   ctx->h[0] = _mm512_xor_si512( ctx->h[0], _mm512_set1_epi64( 0x01010020 ) );

   ctx->t[0] = 0;
   ctx->t[1] = 0;
@@ -419,14 +419,14 @@ static void blake2b_4way_compress( blake2b_4way_ctx *ctx, int last )
   v[ 5] = ctx->h[5];
   v[ 6] = ctx->h[6];
   v[ 7] = ctx->h[7];
-   v[ 8] = m256_const1_64( 0x6A09E667F3BCC908 );
-   v[ 9] = m256_const1_64( 0xBB67AE8584CAA73B );
-   v[10] = m256_const1_64( 0x3C6EF372FE94F82B );
-   v[11] = m256_const1_64( 0xA54FF53A5F1D36F1 );
-   v[12] = m256_const1_64( 0x510E527FADE682D1 );
-   v[13] = m256_const1_64( 0x9B05688C2B3E6C1F );
-   v[14] = m256_const1_64( 0x1F83D9ABFB41BD6B );
-   v[15] = m256_const1_64( 0x5BE0CD19137E2179 );
+   v[ 8] = _mm256_set1_epi64x( 0x6A09E667F3BCC908 );
+   v[ 9] = _mm256_set1_epi64x( 0xBB67AE8584CAA73B );
+   v[10] = _mm256_set1_epi64x( 0x3C6EF372FE94F82B );
+   v[11] = _mm256_set1_epi64x( 0xA54FF53A5F1D36F1 );
+   v[12] = _mm256_set1_epi64x( 0x510E527FADE682D1 );
+   v[13] = _mm256_set1_epi64x( 0x9B05688C2B3E6C1F );
+   v[14] = _mm256_set1_epi64x( 0x1F83D9ABFB41BD6B );
+   v[15] = _mm256_set1_epi64x( 0x5BE0CD19137E2179 );

   v[12] = _mm256_xor_si256( v[12], _mm256_set1_epi64x( ctx->t[0] ) );
   v[13] = _mm256_xor_si256( v[13], _mm256_set1_epi64x( ctx->t[1] ) );
@@ -477,16 +477,16 @@ int blake2b_4way_init( blake2b_4way_ctx *ctx )
 {
 	size_t i;

-   ctx->h[0] = m256_const1_64( 0x6A09E667F3BCC908 );
-   ctx->h[1] = m256_const1_64( 0xBB67AE8584CAA73B );
-   ctx->h[2] = m256_const1_64( 0x3C6EF372FE94F82B );
-   ctx->h[3] = m256_const1_64( 0xA54FF53A5F1D36F1 );
-   ctx->h[4] = m256_const1_64( 0x510E527FADE682D1 );
-   ctx->h[5] = m256_const1_64( 0x9B05688C2B3E6C1F );
-   ctx->h[6] = m256_const1_64( 0x1F83D9ABFB41BD6B );
-   ctx->h[7] = m256_const1_64( 0x5BE0CD19137E2179 );
+   ctx->h[0] = _mm256_set1_epi64x( 0x6A09E667F3BCC908 );
+   ctx->h[1] = _mm256_set1_epi64x( 0xBB67AE8584CAA73B );
+   ctx->h[2] = _mm256_set1_epi64x( 0x3C6EF372FE94F82B );
+   ctx->h[3] = _mm256_set1_epi64x( 0xA54FF53A5F1D36F1 );
+   ctx->h[4] = _mm256_set1_epi64x( 0x510E527FADE682D1 );
+   ctx->h[5] = _mm256_set1_epi64x( 0x9B05688C2B3E6C1F );
+   ctx->h[6] = _mm256_set1_epi64x( 0x1F83D9ABFB41BD6B );
+   ctx->h[7] = _mm256_set1_epi64x( 0x5BE0CD19137E2179 );

-   ctx->h[0] = _mm256_xor_si256( ctx->h[0], m256_const1_64( 0x01010020 ) );
+   ctx->h[0] = _mm256_xor_si256( ctx->h[0], _mm256_set1_epi64x( 0x01010020 ) );

 	ctx->t[0] = 0;
 	ctx->t[1] = 0;
--- a/algo/blake/blake2s-hash-4way.c
+++ b/algo/blake/blake2s-hash-4way.c
@@ -62,14 +62,14 @@ int blake2s_4way_init( blake2s_4way_state *S, const uint8_t outlen )

   memset( S, 0, sizeof( blake2s_4way_state ) );

-   S->h[0] = m128_const1_64( 0x6A09E6676A09E667ULL );
-   S->h[1] = m128_const1_64( 0xBB67AE85BB67AE85ULL );
-   S->h[2] = m128_const1_64( 0x3C6EF3723C6EF372ULL );
-   S->h[3] = m128_const1_64( 0xA54FF53AA54FF53AULL );
-   S->h[4] = m128_const1_64( 0x510E527F510E527FULL );
-   S->h[5] = m128_const1_64( 0x9B05688C9B05688CULL );
-   S->h[6] = m128_const1_64( 0x1F83D9AB1F83D9ABULL );
-   S->h[7] = m128_const1_64( 0x5BE0CD195BE0CD19ULL );
+   S->h[0] = _mm_set1_epi64x( 0x6A09E6676A09E667ULL );
+   S->h[1] = _mm_set1_epi64x( 0xBB67AE85BB67AE85ULL );
+   S->h[2] = _mm_set1_epi64x( 0x3C6EF3723C6EF372ULL );
+   S->h[3] = _mm_set1_epi64x( 0xA54FF53AA54FF53AULL );
+   S->h[4] = _mm_set1_epi64x( 0x510E527F510E527FULL );
+   S->h[5] = _mm_set1_epi64x( 0x9B05688C9B05688CULL );
+   S->h[6] = _mm_set1_epi64x( 0x1F83D9AB1F83D9ABULL );
+   S->h[7] = _mm_set1_epi64x( 0x5BE0CD195BE0CD19ULL );
   
 //   for( int i = 0; i < 8; ++i )
 //      S->h[i] = _mm_set1_epi32( blake2s_IV[i] );
@@ -90,18 +90,18 @@ int blake2s_4way_compress( blake2s_4way_state *S, const __m128i* block )
   memcpy_128( m, block, 16 );
   memcpy_128( v, S->h, 8 );

-   v[ 8] = m128_const1_64( 0x6A09E6676A09E667ULL );
-   v[ 9] = m128_const1_64( 0xBB67AE85BB67AE85ULL );
-   v[10] = m128_const1_64( 0x3C6EF3723C6EF372ULL );
-   v[11] = m128_const1_64( 0xA54FF53AA54FF53AULL );
+   v[ 8] = _mm_set1_epi64x( 0x6A09E6676A09E667ULL );
+   v[ 9] = _mm_set1_epi64x( 0xBB67AE85BB67AE85ULL );
+   v[10] = _mm_set1_epi64x( 0x3C6EF3723C6EF372ULL );
+   v[11] = _mm_set1_epi64x( 0xA54FF53AA54FF53AULL );
   v[12] = _mm_xor_si128( _mm_set1_epi32( S->t[0] ),
-                          m128_const1_64( 0x510E527F510E527FULL ) );
+                          _mm_set1_epi64x( 0x510E527F510E527FULL ) );
   v[13] = _mm_xor_si128( _mm_set1_epi32( S->t[1] ),
-                          m128_const1_64( 0x9B05688C9B05688CULL ) );
+                          _mm_set1_epi64x( 0x9B05688C9B05688CULL ) );
   v[14] = _mm_xor_si128( _mm_set1_epi32( S->f[0] ),
-                          m128_const1_64( 0x1F83D9AB1F83D9ABULL ) );
+                          _mm_set1_epi64x( 0x1F83D9AB1F83D9ABULL ) );
   v[15] = _mm_xor_si128( _mm_set1_epi32( S->f[1] ),
-                          m128_const1_64( 0x5BE0CD195BE0CD19ULL ) );
+                          _mm_set1_epi64x( 0x5BE0CD195BE0CD19ULL ) );

 #define G4W( sigma0, sigma1, a, b, c, d ) \
 do { \
@@ -269,21 +269,21 @@ int blake2s_8way_compress( blake2s_8way_state *S, const __m256i *block )
   memcpy_256( m, block, 16 );
   memcpy_256( v, S->h, 8 );

-   v[ 8] = m256_const1_64( 0x6A09E6676A09E667ULL );
-   v[ 9] = m256_const1_64( 0xBB67AE85BB67AE85ULL );
-   v[10] = m256_const1_64( 0x3C6EF3723C6EF372ULL );
-   v[11] = m256_const1_64( 0xA54FF53AA54FF53AULL );
+   v[ 8] = _mm256_set1_epi64x( 0x6A09E6676A09E667ULL );
+   v[ 9] = _mm256_set1_epi64x( 0xBB67AE85BB67AE85ULL );
+   v[10] = _mm256_set1_epi64x( 0x3C6EF3723C6EF372ULL );
+   v[11] = _mm256_set1_epi64x( 0xA54FF53AA54FF53AULL );
   v[12] = _mm256_xor_si256( _mm256_set1_epi32( S->t[0] ),
-                          m256_const1_64( 0x510E527F510E527FULL ) );
+                          _mm256_set1_epi64x( 0x510E527F510E527FULL ) );

   v[13] = _mm256_xor_si256( _mm256_set1_epi32( S->t[1] ),
-                          m256_const1_64( 0x9B05688C9B05688CULL ) );
+                          _mm256_set1_epi64x( 0x9B05688C9B05688CULL ) );

   v[14] = _mm256_xor_si256( _mm256_set1_epi32( S->f[0] ),
-                          m256_const1_64( 0x1F83D9AB1F83D9ABULL ) );
+                          _mm256_set1_epi64x( 0x1F83D9AB1F83D9ABULL ) );

   v[15] = _mm256_xor_si256( _mm256_set1_epi32( S->f[1] ),
-                          m256_const1_64( 0x5BE0CD195BE0CD19ULL ) );
+                          _mm256_set1_epi64x( 0x5BE0CD195BE0CD19ULL ) );

 /*
   v[ 8] = _mm256_set1_epi32( blake2s_IV[0] );
@@ -391,14 +391,14 @@ int blake2s_8way_init( blake2s_8way_state *S, const uint8_t outlen )
   memset( P->personal, 0, sizeof( P->personal ) );

   memset( S, 0, sizeof( blake2s_8way_state ) );
-   S->h[0] = m256_const1_64( 0x6A09E6676A09E667ULL );
-   S->h[1] = m256_const1_64( 0xBB67AE85BB67AE85ULL );
-   S->h[2] = m256_const1_64( 0x3C6EF3723C6EF372ULL );
-   S->h[3] = m256_const1_64( 0xA54FF53AA54FF53AULL );
-   S->h[4] = m256_const1_64( 0x510E527F510E527FULL );
-   S->h[5] = m256_const1_64( 0x9B05688C9B05688CULL );
-   S->h[6] = m256_const1_64( 0x1F83D9AB1F83D9ABULL );
-   S->h[7] = m256_const1_64( 0x5BE0CD195BE0CD19ULL );
+   S->h[0] = _mm256_set1_epi64x( 0x6A09E6676A09E667ULL );
+   S->h[1] = _mm256_set1_epi64x( 0xBB67AE85BB67AE85ULL );
+   S->h[2] = _mm256_set1_epi64x( 0x3C6EF3723C6EF372ULL );
+   S->h[3] = _mm256_set1_epi64x( 0xA54FF53AA54FF53AULL );
+   S->h[4] = _mm256_set1_epi64x( 0x510E527F510E527FULL );
+   S->h[5] = _mm256_set1_epi64x( 0x9B05688C9B05688CULL );
+   S->h[6] = _mm256_set1_epi64x( 0x1F83D9AB1F83D9ABULL );
+   S->h[7] = _mm256_set1_epi64x( 0x5BE0CD195BE0CD19ULL );


 //   for( int i = 0; i < 8; ++i )
@@ -510,21 +510,21 @@ int blake2s_16way_compress( blake2s_16way_state *S, const __m512i *block )
   memcpy_512( m, block, 16 );
   memcpy_512( v, S->h, 8 );

-   v[ 8] = m512_const1_64( 0x6A09E6676A09E667ULL );
-   v[ 9] = m512_const1_64( 0xBB67AE85BB67AE85ULL );
-   v[10] = m512_const1_64( 0x3C6EF3723C6EF372ULL );
-   v[11] = m512_const1_64( 0xA54FF53AA54FF53AULL );
+   v[ 8] = _mm512_set1_epi64( 0x6A09E6676A09E667ULL );
+   v[ 9] = _mm512_set1_epi64( 0xBB67AE85BB67AE85ULL );
+   v[10] = _mm512_set1_epi64( 0x3C6EF3723C6EF372ULL );
+   v[11] = _mm512_set1_epi64( 0xA54FF53AA54FF53AULL );
   v[12] = _mm512_xor_si512( _mm512_set1_epi32( S->t[0] ),
-                          m512_const1_64( 0x510E527F510E527FULL ) );
+                          _mm512_set1_epi64( 0x510E527F510E527FULL ) );

   v[13] = _mm512_xor_si512( _mm512_set1_epi32( S->t[1] ),
-                          m512_const1_64( 0x9B05688C9B05688CULL ) );
+                          _mm512_set1_epi64( 0x9B05688C9B05688CULL ) );

   v[14] = _mm512_xor_si512( _mm512_set1_epi32( S->f[0] ),
-                          m512_const1_64( 0x1F83D9AB1F83D9ABULL ) );
+                          _mm512_set1_epi64( 0x1F83D9AB1F83D9ABULL ) );

   v[15] = _mm512_xor_si512( _mm512_set1_epi32( S->f[1] ),
-                          m512_const1_64( 0x5BE0CD195BE0CD19ULL ) );
+                          _mm512_set1_epi64( 0x5BE0CD195BE0CD19ULL ) );


 #define G16W( sigma0, sigma1, a, b, c, d) \
@@ -589,14 +589,14 @@ int blake2s_16way_init( blake2s_16way_state *S, const uint8_t outlen )
   memset( P->personal, 0, sizeof( P->personal ) );

   memset( S, 0, sizeof( blake2s_16way_state ) );
-   S->h[0] = m512_const1_64( 0x6A09E6676A09E667ULL );
-   S->h[1] = m512_const1_64( 0xBB67AE85BB67AE85ULL );
-   S->h[2] = m512_const1_64( 0x3C6EF3723C6EF372ULL );
-   S->h[3] = m512_const1_64( 0xA54FF53AA54FF53AULL );
-   S->h[4] = m512_const1_64( 0x510E527F510E527FULL );
-   S->h[5] = m512_const1_64( 0x9B05688C9B05688CULL );
-   S->h[6] = m512_const1_64( 0x1F83D9AB1F83D9ABULL );
-   S->h[7] = m512_const1_64( 0x5BE0CD195BE0CD19ULL );
+   S->h[0] = _mm512_set1_epi64( 0x6A09E6676A09E667ULL );
+   S->h[1] = _mm512_set1_epi64( 0xBB67AE85BB67AE85ULL );
+   S->h[2] = _mm512_set1_epi64( 0x3C6EF3723C6EF372ULL );
+   S->h[3] = _mm512_set1_epi64( 0xA54FF53AA54FF53AULL );
+   S->h[4] = _mm512_set1_epi64( 0x510E527F510E527FULL );
+   S->h[5] = _mm512_set1_epi64( 0x9B05688C9B05688CULL );
+   S->h[6] = _mm512_set1_epi64( 0x1F83D9AB1F83D9ABULL );
+   S->h[7] = _mm512_set1_epi64( 0x5BE0CD195BE0CD19ULL );

   uint32_t *p = ( uint32_t * )( P );

--- a/algo/blake/blake512-hash-4way.c
+++ b/algo/blake/blake512-hash-4way.c
@@ -1,62 +1,22 @@
-/* $Id: blake.c 252 2011-06-07 17:55:14Z tp $ */
-/*
- * BLAKE implementation.
- *
- * ==========================(LICENSE BEGIN)============================
- *
- * Copyright (c) 2007-2010  Projet RNRT SAPHIR
- *
- * Permission is hereby granted, free of charge, to any person obtaining
- * a copy of this software and associated documentation files (the
- * "Software"), to deal in the Software without restriction, including
- * without limitation the rights to use, copy, modify, merge, publish,
- * distribute, sublicense, and/or sell copies of the Software, and to
- * permit persons to whom the Software is furnished to do so, subject to
- * the following conditions:
- *
- * The above copyright notice and this permission notice shall be
- * included in all copies or substantial portions of the Software.
- *
- * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
- * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
- * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
- * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
- * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
- * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
- * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
- *
- * ===========================(LICENSE END)=============================
- *
- * @author   Thomas Pornin <thomas.pornin@cryptolog.com>
- */
-
 #if defined (__AVX2__)

 #include <stddef.h>
 #include <string.h>
 #include <limits.h>
-
 #include "blake-hash-4way.h"

-#ifdef __cplusplus
-extern "C"{
-#endif
-
-#ifdef _MSC_VER
-#pragma warning (disable: 4146)
-#endif
-
 // Blake-512 common
   
 /*
-static const sph_u64 IV512[8] = {
-	SPH_C64(0x6A09E667F3BCC908), SPH_C64(0xBB67AE8584CAA73B),
-	SPH_C64(0x3C6EF372FE94F82B), SPH_C64(0xA54FF53A5F1D36F1),
-	SPH_C64(0x510E527FADE682D1), SPH_C64(0x9B05688C2B3E6C1F),
-	SPH_C64(0x1F83D9ABFB41BD6B), SPH_C64(0x5BE0CD19137E2179)
+static const uint64_t IV512[8] =
+{
+  0x6A09E667F3BCC908, 0xBB67AE8584CAA73B,
+  0x3C6EF372FE94F82B, 0xA54FF53A5F1D36F1,
+  0x510E527FADE682D1, 0x9B05688C2B3E6C1F,
+  0x1F83D9ABFB41BD6B, 0x5BE0CD19137E2179
 };

-static const sph_u64 salt_zero_big[4] = { 0, 0, 0, 0 };
+static const uint64_t salt_zero_big[4] = { 0, 0, 0, 0 };

 static const unsigned sigma[16][16] = {
 	{  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15 },
@@ -77,15 +37,15 @@ static const unsigned sigma[16][16] = {
 	{  2, 12,  6, 10,  0, 11,  8,  3,  4, 13,  7,  5, 15, 14,  1,  9 }
 };

-static const sph_u64 CB[16] = {
-   SPH_C64(0x243F6A8885A308D3), SPH_C64(0x13198A2E03707344),
-   SPH_C64(0xA4093822299F31D0), SPH_C64(0x082EFA98EC4E6C89),
-   SPH_C64(0x452821E638D01377), SPH_C64(0xBE5466CF34E90C6C),
-   SPH_C64(0xC0AC29B7C97C50DD), SPH_C64(0x3F84D5B5B5470917),
-   SPH_C64(0x9216D5D98979FB1B), SPH_C64(0xD1310BA698DFB5AC),
-   SPH_C64(0x2FFD72DBD01ADFB7), SPH_C64(0xB8E1AFED6A267E96),
-   SPH_C64(0xBA7C9045F12C7F99), SPH_C64(0x24A19947B3916CF7),
-   SPH_C64(0x0801F2E2858EFC16), SPH_C64(0x636920D871574E69)
+static const uint64_t CB[16] = {
+   0x243F6A8885A308D3, 0x13198A2E03707344,
+   0xA4093822299F31D0, 0x082EFA98EC4E6C89,
+   0x452821E638D01377, 0xBE5466CF34E90C6C,
+   0xC0AC29B7C97C50DD, 0x3F84D5B5B5470917,
+   0x9216D5D98979FB1B, 0xD1310BA698DFB5AC,
+   0x2FFD72DBD01ADFB7, 0xB8E1AFED6A267E96,
+   0xBA7C9045F12C7F99, 0x24A19947B3916CF7,
+   0x0801F2E2858EFC16, 0x636920D871574E69

 */

@@ -350,7 +310,6 @@ static const sph_u64 CB[16] = {
  __m512i M8, M9, MA, MB, MC, MD, ME, MF; \
  __m512i V0, V1, V2, V3, V4, V5, V6, V7; \
  __m512i V8, V9, VA, VB, VC, VD, VE, VF; \
-  __m512i shuf_bswap64; \
  V0 = H0; \
  V1 = H1; \
  V2 = H2; \
@@ -359,18 +318,16 @@ static const sph_u64 CB[16] = {
  V5 = H5; \
  V6 = H6; \
  V7 = H7; \
-  V8 = m512_const1_64( CB0 );  \
-  V9 = m512_const1_64( CB1 );  \
-  VA = m512_const1_64( CB2 );  \
-  VB = m512_const1_64( CB3 );  \
+  V8 = _mm512_set1_epi64( CB0 );  \
+  V9 = _mm512_set1_epi64( CB1 );  \
+  VA = _mm512_set1_epi64( CB2 );  \
+  VB = _mm512_set1_epi64( CB3 );  \
  VC = _mm512_set1_epi64( T0 ^ CB4 ); \
  VD = _mm512_set1_epi64( T0 ^ CB5 ); \
  VE = _mm512_set1_epi64( T1 ^ CB6 ); \
  VF = _mm512_set1_epi64( T1 ^ CB7 ); \
-  shuf_bswap64 = m512_const_64( 0x38393a3b3c3d3e3f, 0x3031323334353637, \
-                                0x28292a2b2c2d2e2f, 0x2021222324252627, \
-                                0x18191a1b1c1d1e1f, 0x1011121314151617, \
-                                0x08090a0b0c0d0e0f, 0x0001020304050607 ); \
+  const __m512i shuf_bswap64 = mm512_bcast_m128( _mm_set_epi64x( \
+                                   0x08090a0b0c0d0e0f, 0x0001020304050607 ) ); \
  M0 = _mm512_shuffle_epi8( *(buf+ 0), shuf_bswap64 ); \
  M1 = _mm512_shuffle_epi8( *(buf+ 1), shuf_bswap64 ); \
  M2 = _mm512_shuffle_epi8( *(buf+ 2), shuf_bswap64 ); \
@@ -419,7 +376,6 @@ void blake512_8way_compress( blake_8way_big_context *sc )
  __m512i M8, M9, MA, MB, MC, MD, ME, MF;
  __m512i V0, V1, V2, V3, V4, V5, V6, V7;
  __m512i V8, V9, VA, VB, VC, VD, VE, VF;
-  __m512i shuf_bswap64;

  V0 = sc->H[0];
  V1 = sc->H[1];
@@ -429,19 +385,17 @@ void blake512_8way_compress( blake_8way_big_context *sc )
  V5 = sc->H[5];
  V6 = sc->H[6];
  V7 = sc->H[7];
-  V8 = m512_const1_64( CB0 );
-  V9 = m512_const1_64( CB1 );
-  VA = m512_const1_64( CB2 );
-  VB = m512_const1_64( CB3 );
+  V8 = _mm512_set1_epi64( CB0 );
+  V9 = _mm512_set1_epi64( CB1 );
+  VA = _mm512_set1_epi64( CB2 );
+  VB = _mm512_set1_epi64( CB3 );
  VC = _mm512_set1_epi64( sc->T0 ^ CB4 );
  VD = _mm512_set1_epi64( sc->T0 ^ CB5 );
  VE = _mm512_set1_epi64( sc->T1 ^ CB6 );
  VF = _mm512_set1_epi64( sc->T1 ^ CB7 );

-  shuf_bswap64 = m512_const_64( 0x38393a3b3c3d3e3f, 0x3031323334353637,
-                                0x28292a2b2c2d2e2f, 0x2021222324252627,
-                                0x18191a1b1c1d1e1f, 0x1011121314151617,
-                                0x08090a0b0c0d0e0f, 0x0001020304050607 );
+  const __m512i shuf_bswap64 = mm512_bcast_m128( _mm_set_epi64x( 
+                                   0x08090a0b0c0d0e0f, 0x0001020304050607 ) );

  M0 = _mm512_shuffle_epi8( sc->buf[ 0], shuf_bswap64 );
  M1 = _mm512_shuffle_epi8( sc->buf[ 1], shuf_bswap64 );
@@ -503,10 +457,10 @@ void blake512_8way_compress_le( blake_8way_big_context *sc )
  V5 = sc->H[5];
  V6 = sc->H[6];
  V7 = sc->H[7];
-  V8 = m512_const1_64( CB0 );
-  V9 = m512_const1_64( CB1 );
-  VA = m512_const1_64( CB2 );
-  VB = m512_const1_64( CB3 );
+  V8 = _mm512_set1_epi64( CB0 );
+  V9 = _mm512_set1_epi64( CB1 );
+  VA = _mm512_set1_epi64( CB2 );
+  VB = _mm512_set1_epi64( CB3 );
  VC = _mm512_set1_epi64( sc->T0 ^ CB4 );
  VD = _mm512_set1_epi64( sc->T0 ^ CB5 );
  VE = _mm512_set1_epi64( sc->T1 ^ CB6 );
@@ -565,23 +519,23 @@ void blake512_8way_prehash_le( blake_8way_big_context *sc, __m512i *midstate,
   __m512i V8, V9, VA, VB, VC, VD, VE, VF;

   // initial hash
-   casti_m512i( sc->H, 0 ) = m512_const1_64( 0x6A09E667F3BCC908 );
-   casti_m512i( sc->H, 1 ) = m512_const1_64( 0xBB67AE8584CAA73B );
-   casti_m512i( sc->H, 2 ) = m512_const1_64( 0x3C6EF372FE94F82B );
-   casti_m512i( sc->H, 3 ) = m512_const1_64( 0xA54FF53A5F1D36F1 );
-   casti_m512i( sc->H, 4 ) = m512_const1_64( 0x510E527FADE682D1 );
-   casti_m512i( sc->H, 5 ) = m512_const1_64( 0x9B05688C2B3E6C1F );
-   casti_m512i( sc->H, 6 ) = m512_const1_64( 0x1F83D9ABFB41BD6B );
-   casti_m512i( sc->H, 7 ) = m512_const1_64( 0x5BE0CD19137E2179 );
+   casti_m512i( sc->H, 0 ) = _mm512_set1_epi64( 0x6A09E667F3BCC908 );
+   casti_m512i( sc->H, 1 ) = _mm512_set1_epi64( 0xBB67AE8584CAA73B );
+   casti_m512i( sc->H, 2 ) = _mm512_set1_epi64( 0x3C6EF372FE94F82B );
+   casti_m512i( sc->H, 3 ) = _mm512_set1_epi64( 0xA54FF53A5F1D36F1 );
+   casti_m512i( sc->H, 4 ) = _mm512_set1_epi64( 0x510E527FADE682D1 );
+   casti_m512i( sc->H, 5 ) = _mm512_set1_epi64( 0x9B05688C2B3E6C1F );
+   casti_m512i( sc->H, 6 ) = _mm512_set1_epi64( 0x1F83D9ABFB41BD6B );
+   casti_m512i( sc->H, 7 ) = _mm512_set1_epi64( 0x5BE0CD19137E2179 );

   // fill buffer
   memcpy_512( sc->buf, (__m512i*)data, 80>>3 );
-   sc->buf[10] = m512_const1_64( 0x8000000000000000ULL );
+   sc->buf[10] = _mm512_set1_epi64( 0x8000000000000000ULL );
   sc->buf[11] = 
   sc->buf[12] = m512_zero;
   sc->buf[13] = m512_one_64;
   sc->buf[14] = m512_zero;
-   sc->buf[15] = m512_const1_64( 80*8 );
+   sc->buf[15] = _mm512_set1_epi64( 80*8 );

   // build working variables
   V0 = sc->H[0];
@@ -592,10 +546,10 @@ void blake512_8way_prehash_le( blake_8way_big_context *sc, __m512i *midstate,
   V5 = sc->H[5];
   V6 = sc->H[6];
   V7 = sc->H[7];
-   V8 = m512_const1_64( CB0 );
-   V9 = m512_const1_64( CB1 );
-   VA = m512_const1_64( CB2 );
-   VB = m512_const1_64( CB3 );
+   V8 = _mm512_set1_epi64( CB0 );
+   V9 = _mm512_set1_epi64( CB1 );
+   VA = _mm512_set1_epi64( CB2 );
+   VB = _mm512_set1_epi64( CB3 );
   VC = _mm512_set1_epi64( CB4 ^ 0x280ULL );
   VD = _mm512_set1_epi64( CB5 ^ 0x280ULL );
   VE = _mm512_set1_epi64( CB6 );
@@ -790,14 +744,14 @@ void blake512_8way_final_le( blake_8way_big_context *sc, void *hash,

 void blake512_8way_init( blake_8way_big_context *sc )
 {
-   casti_m512i( sc->H, 0 ) = m512_const1_64( 0x6A09E667F3BCC908 );
-   casti_m512i( sc->H, 1 ) = m512_const1_64( 0xBB67AE8584CAA73B );
-   casti_m512i( sc->H, 2 ) = m512_const1_64( 0x3C6EF372FE94F82B );
-   casti_m512i( sc->H, 3 ) = m512_const1_64( 0xA54FF53A5F1D36F1 );
-   casti_m512i( sc->H, 4 ) = m512_const1_64( 0x510E527FADE682D1 );
-   casti_m512i( sc->H, 5 ) = m512_const1_64( 0x9B05688C2B3E6C1F );
-   casti_m512i( sc->H, 6 ) = m512_const1_64( 0x1F83D9ABFB41BD6B );
-   casti_m512i( sc->H, 7 ) = m512_const1_64( 0x5BE0CD19137E2179 );
+   casti_m512i( sc->H, 0 ) = _mm512_set1_epi64( 0x6A09E667F3BCC908 );
+   casti_m512i( sc->H, 1 ) = _mm512_set1_epi64( 0xBB67AE8584CAA73B );
+   casti_m512i( sc->H, 2 ) = _mm512_set1_epi64( 0x3C6EF372FE94F82B );
+   casti_m512i( sc->H, 3 ) = _mm512_set1_epi64( 0xA54FF53A5F1D36F1 );
+   casti_m512i( sc->H, 4 ) = _mm512_set1_epi64( 0x510E527FADE682D1 );
+   casti_m512i( sc->H, 5 ) = _mm512_set1_epi64( 0x9B05688C2B3E6C1F );
+   casti_m512i( sc->H, 6 ) = _mm512_set1_epi64( 0x1F83D9ABFB41BD6B );
+   casti_m512i( sc->H, 7 ) = _mm512_set1_epi64( 0x5BE0CD19137E2179 );

   sc->T0 = sc->T1 = 0;
   sc->ptr = 0;
@@ -861,7 +815,7 @@ blake64_8way_close( blake_8way_big_context *sc, void *dst )

   ptr = sc->ptr;
   bit_len = ((unsigned)ptr << 3);
-   buf[ptr>>3] = m512_const1_64( 0x80 );
+   buf[ptr>>3] = _mm512_set1_epi64( 0x80 );
   tl = sc->T0 + bit_len;
   th = sc->T1;
   if (ptr == 0 )
@@ -882,9 +836,9 @@ blake64_8way_close( blake_8way_big_context *sc, void *dst )
   {
       memset_zero_512( buf + (ptr>>3) + 1, (104-ptr) >> 3 );
       buf[104>>3] = _mm512_or_si512( buf[104>>3],
-                                 m512_const1_64( 0x0100000000000000ULL ) );
-       buf[112>>3] = m512_const1_64( bswap_64( th ) );
-       buf[120>>3] = m512_const1_64( bswap_64( tl ) );
+                                 _mm512_set1_epi64( 0x0100000000000000ULL ) );
+       buf[112>>3] = _mm512_set1_epi64( bswap_64( th ) );
+       buf[120>>3] = _mm512_set1_epi64( bswap_64( tl ) );

       blake64_8way( sc, buf + (ptr>>3), 128 - ptr );
   }
@@ -896,9 +850,9 @@ blake64_8way_close( blake_8way_big_context *sc, void *dst )
       sc->T0 = 0xFFFFFFFFFFFFFC00ULL;
       sc->T1 = 0xFFFFFFFFFFFFFFFFULL;
       memset_zero_512( buf, 112>>3 );
-       buf[104>>3] = m512_const1_64( 0x0100000000000000ULL );
-       buf[112>>3] = m512_const1_64( bswap_64( th ) );
-       buf[120>>3] = m512_const1_64( bswap_64( tl ) );
+       buf[104>>3] = _mm512_set1_epi64( 0x0100000000000000ULL );
+       buf[112>>3] = _mm512_set1_epi64( bswap_64( th ) );
+       buf[120>>3] = _mm512_set1_epi64( bswap_64( tl ) );

       blake64_8way( sc, buf, 128 );
   }
@@ -912,14 +866,14 @@ void blake512_8way_full( blake_8way_big_context *sc, void * dst,
   
 // init

-   casti_m512i( sc->H, 0 ) = m512_const1_64( 0x6A09E667F3BCC908 );
-   casti_m512i( sc->H, 1 ) = m512_const1_64( 0xBB67AE8584CAA73B );
-   casti_m512i( sc->H, 2 ) = m512_const1_64( 0x3C6EF372FE94F82B );
-   casti_m512i( sc->H, 3 ) = m512_const1_64( 0xA54FF53A5F1D36F1 );
-   casti_m512i( sc->H, 4 ) = m512_const1_64( 0x510E527FADE682D1 );
-   casti_m512i( sc->H, 5 ) = m512_const1_64( 0x9B05688C2B3E6C1F );
-   casti_m512i( sc->H, 6 ) = m512_const1_64( 0x1F83D9ABFB41BD6B );
-   casti_m512i( sc->H, 7 ) = m512_const1_64( 0x5BE0CD19137E2179 );
+   casti_m512i( sc->H, 0 ) = _mm512_set1_epi64( 0x6A09E667F3BCC908 );
+   casti_m512i( sc->H, 1 ) = _mm512_set1_epi64( 0xBB67AE8584CAA73B );
+   casti_m512i( sc->H, 2 ) = _mm512_set1_epi64( 0x3C6EF372FE94F82B );
+   casti_m512i( sc->H, 3 ) = _mm512_set1_epi64( 0xA54FF53A5F1D36F1 );
+   casti_m512i( sc->H, 4 ) = _mm512_set1_epi64( 0x510E527FADE682D1 );
+   casti_m512i( sc->H, 5 ) = _mm512_set1_epi64( 0x9B05688C2B3E6C1F );
+   casti_m512i( sc->H, 6 ) = _mm512_set1_epi64( 0x1F83D9ABFB41BD6B );
+   casti_m512i( sc->H, 7 ) = _mm512_set1_epi64( 0x5BE0CD19137E2179 );

   sc->T0 = sc->T1 = 0;
   sc->ptr = 0;
@@ -943,7 +897,7 @@ void blake512_8way_full( blake_8way_big_context *sc, void * dst,
   uint64_t th, tl;

   bit_len = sc->ptr << 3;
-   sc->buf[ptr64] = m512_const1_64( 0x80 );
+   sc->buf[ptr64] = _mm512_set1_epi64( 0x80 );
   tl = sc->T0 + bit_len;
   th = sc->T1;

@@ -961,9 +915,9 @@ void blake512_8way_full( blake_8way_big_context *sc, void * dst,
      sc->T0 -= 1024 - bit_len;

   memset_zero_512( sc->buf + ptr64 + 1, 13 - ptr64 );
-   sc->buf[13] = m512_const1_64( 0x0100000000000000ULL );
-   sc->buf[14] = m512_const1_64( bswap_64( th ) );
-   sc->buf[15] = m512_const1_64( bswap_64( tl ) );
+   sc->buf[13] = _mm512_set1_epi64( 0x0100000000000000ULL );
+   sc->buf[14] = _mm512_set1_epi64( bswap_64( th ) );
+   sc->buf[15] = _mm512_set1_epi64( bswap_64( tl ) );

   if ( ( sc->T0 = sc->T0 + 1024 ) < 1024 )
       sc->T1 = sc->T1 + 1;
@@ -979,14 +933,14 @@ void blake512_8way_full_le( blake_8way_big_context *sc, void * dst,

 // init

-   casti_m512i( sc->H, 0 ) = m512_const1_64( 0x6A09E667F3BCC908 );
-   casti_m512i( sc->H, 1 ) = m512_const1_64( 0xBB67AE8584CAA73B );
-   casti_m512i( sc->H, 2 ) = m512_const1_64( 0x3C6EF372FE94F82B );
-   casti_m512i( sc->H, 3 ) = m512_const1_64( 0xA54FF53A5F1D36F1 );
-   casti_m512i( sc->H, 4 ) = m512_const1_64( 0x510E527FADE682D1 );
-   casti_m512i( sc->H, 5 ) = m512_const1_64( 0x9B05688C2B3E6C1F );
-   casti_m512i( sc->H, 6 ) = m512_const1_64( 0x1F83D9ABFB41BD6B );
-   casti_m512i( sc->H, 7 ) = m512_const1_64( 0x5BE0CD19137E2179 );
+   casti_m512i( sc->H, 0 ) = _mm512_set1_epi64( 0x6A09E667F3BCC908 );
+   casti_m512i( sc->H, 1 ) = _mm512_set1_epi64( 0xBB67AE8584CAA73B );
+   casti_m512i( sc->H, 2 ) = _mm512_set1_epi64( 0x3C6EF372FE94F82B );
+   casti_m512i( sc->H, 3 ) = _mm512_set1_epi64( 0xA54FF53A5F1D36F1 );
+   casti_m512i( sc->H, 4 ) = _mm512_set1_epi64( 0x510E527FADE682D1 );
+   casti_m512i( sc->H, 5 ) = _mm512_set1_epi64( 0x9B05688C2B3E6C1F );
+   casti_m512i( sc->H, 6 ) = _mm512_set1_epi64( 0x1F83D9ABFB41BD6B );
+   casti_m512i( sc->H, 7 ) = _mm512_set1_epi64( 0x5BE0CD19137E2179 );

   sc->T0 = sc->T1 = 0;
   sc->ptr = 0;
@@ -1010,7 +964,7 @@ void blake512_8way_full_le( blake_8way_big_context *sc, void * dst,
   uint64_t th, tl;

   bit_len = sc->ptr << 3;
-   sc->buf[ptr64] = m512_const1_64( 0x8000000000000000ULL );
+   sc->buf[ptr64] = _mm512_set1_epi64( 0x8000000000000000ULL );
   tl = sc->T0 + bit_len;
   th = sc->T1;

@@ -1029,8 +983,8 @@ void blake512_8way_full_le( blake_8way_big_context *sc, void * dst,

   memset_zero_512( sc->buf + ptr64 + 1, 13 - ptr64 );
   sc->buf[13] = m512_one_64;
-   sc->buf[14] = m512_const1_64( th );
-   sc->buf[15] = m512_const1_64( tl );
+   sc->buf[14] = _mm512_set1_epi64( th );
+   sc->buf[15] = _mm512_set1_epi64( tl );

   if ( ( sc->T0 = sc->T0 + 1024 ) < 1024 )
       sc->T1 = sc->T1 + 1;
@@ -1092,7 +1046,6 @@ blake512_8way_close(void *cc, void *dst)
  __m256i M8, M9, MA, MB, MC, MD, ME, MF; \
  __m256i V0, V1, V2, V3, V4, V5, V6, V7; \
  __m256i V8, V9, VA, VB, VC, VD, VE, VF; \
-  __m256i shuf_bswap64; \
  V0 = H0; \
  V1 = H1; \
  V2 = H2; \
@@ -1101,16 +1054,16 @@ blake512_8way_close(void *cc, void *dst)
  V5 = H5; \
  V6 = H6; \
  V7 = H7; \
-  V8 = m256_const1_64( CB0 );  \
-  V9 = m256_const1_64( CB1 );  \
-  VA = m256_const1_64( CB2 );  \
-  VB = m256_const1_64( CB3 );  \
+  V8 = _mm256_set1_epi64x( CB0 );  \
+  V9 = _mm256_set1_epi64x( CB1 );  \
+  VA = _mm256_set1_epi64x( CB2 );  \
+  VB = _mm256_set1_epi64x( CB3 );  \
  VC = _mm256_set1_epi64x( T0 ^ CB4 ); \
  VD = _mm256_set1_epi64x( T0 ^ CB5 ); \
  VE = _mm256_set1_epi64x( T1 ^ CB6 ); \
  VF = _mm256_set1_epi64x( T1 ^ CB7 ); \
-  shuf_bswap64 = m256_const_64( 0x18191a1b1c1d1e1f, 0x1011121314151617, \
-                                0x08090a0b0c0d0e0f, 0x0001020304050607 ); \
+  const __m256i shuf_bswap64 = mm256_bcast_m128( _mm_set_epi64x( \
+                             0x08090a0b0c0d0e0f, 0x0001020304050607 ) ); \
  M0 = _mm256_shuffle_epi8( *(buf+ 0), shuf_bswap64 ); \
  M1 = _mm256_shuffle_epi8( *(buf+ 1), shuf_bswap64 ); \
  M2 = _mm256_shuffle_epi8( *(buf+ 2), shuf_bswap64 ); \
@@ -1160,7 +1113,6 @@ void blake512_4way_compress( blake_4way_big_context *sc )
  __m256i M8, M9, MA, MB, MC, MD, ME, MF;
  __m256i V0, V1, V2, V3, V4, V5, V6, V7;
  __m256i V8, V9, VA, VB, VC, VD, VE, VF;
-  __m256i shuf_bswap64;

  V0 = sc->H[0];
  V1 = sc->H[1];
@@ -1170,20 +1122,20 @@ void blake512_4way_compress( blake_4way_big_context *sc )
  V5 = sc->H[5];
  V6 = sc->H[6];
  V7 = sc->H[7];
-  V8 = m256_const1_64( CB0 );
-  V9 = m256_const1_64( CB1 );
-  VA = m256_const1_64( CB2 );
-  VB = m256_const1_64( CB3 );
+  V8 = _mm256_set1_epi64x( CB0 );
+  V9 = _mm256_set1_epi64x( CB1 );
+  VA = _mm256_set1_epi64x( CB2 );
+  VB = _mm256_set1_epi64x( CB3 );
  VC = _mm256_xor_si256( _mm256_set1_epi64x( sc->T0 ),
-                             m256_const1_64( CB4 ) );
+                             _mm256_set1_epi64x( CB4 ) );
  VD = _mm256_xor_si256( _mm256_set1_epi64x( sc->T0 ),
-                             m256_const1_64( CB5 ) );
+                             _mm256_set1_epi64x( CB5 ) );
  VE = _mm256_xor_si256( _mm256_set1_epi64x( sc->T1 ),
-                             m256_const1_64( CB6 ) );
+                             _mm256_set1_epi64x( CB6 ) );
  VF = _mm256_xor_si256( _mm256_set1_epi64x( sc->T1 ),
-                             m256_const1_64( CB7 ) );
-  shuf_bswap64 = m256_const_64( 0x18191a1b1c1d1e1f, 0x1011121314151617,
-                                0x08090a0b0c0d0e0f, 0x0001020304050607 );
+                             _mm256_set1_epi64x( CB7 ) );
+  const __m256i shuf_bswap64 = mm256_bcast_m128( _mm_set_epi64x(
+                                    0x08090a0b0c0d0e0f, 0x0001020304050607 ) );

  M0 = _mm256_shuffle_epi8( sc->buf[ 0], shuf_bswap64 );
  M1 = _mm256_shuffle_epi8( sc->buf[ 1], shuf_bswap64 );
@@ -1236,23 +1188,23 @@ void blake512_4way_prehash_le( blake_4way_big_context *sc, __m256i *midstate,
   __m256i V8, V9, VA, VB, VC, VD, VE, VF;

   // initial hash
-   casti_m256i( sc->H, 0 ) = m256_const1_64( 0x6A09E667F3BCC908 );
-   casti_m256i( sc->H, 1 ) = m256_const1_64( 0xBB67AE8584CAA73B );
-   casti_m256i( sc->H, 2 ) = m256_const1_64( 0x3C6EF372FE94F82B );
-   casti_m256i( sc->H, 3 ) = m256_const1_64( 0xA54FF53A5F1D36F1 );
-   casti_m256i( sc->H, 4 ) = m256_const1_64( 0x510E527FADE682D1 );
-   casti_m256i( sc->H, 5 ) = m256_const1_64( 0x9B05688C2B3E6C1F );
-   casti_m256i( sc->H, 6 ) = m256_const1_64( 0x1F83D9ABFB41BD6B );
-   casti_m256i( sc->H, 7 ) = m256_const1_64( 0x5BE0CD19137E2179 );
+   casti_m256i( sc->H, 0 ) = _mm256_set1_epi64x( 0x6A09E667F3BCC908 );
+   casti_m256i( sc->H, 1 ) = _mm256_set1_epi64x( 0xBB67AE8584CAA73B );
+   casti_m256i( sc->H, 2 ) = _mm256_set1_epi64x( 0x3C6EF372FE94F82B );
+   casti_m256i( sc->H, 3 ) = _mm256_set1_epi64x( 0xA54FF53A5F1D36F1 );
+   casti_m256i( sc->H, 4 ) = _mm256_set1_epi64x( 0x510E527FADE682D1 );
+   casti_m256i( sc->H, 5 ) = _mm256_set1_epi64x( 0x9B05688C2B3E6C1F );
+   casti_m256i( sc->H, 6 ) = _mm256_set1_epi64x( 0x1F83D9ABFB41BD6B );
+   casti_m256i( sc->H, 7 ) = _mm256_set1_epi64x( 0x5BE0CD19137E2179 );
   
   // fill buffer
   memcpy_256( sc->buf, (__m256i*)data, 80>>3 );
-   sc->buf[10] = m256_const1_64( 0x8000000000000000ULL );
+   sc->buf[10] = _mm256_set1_epi64x( 0x8000000000000000ULL );
   sc->buf[11] = m256_zero;
   sc->buf[12] = m256_zero;
   sc->buf[13] = m256_one_64;
   sc->buf[14] = m256_zero;
-   sc->buf[15] = m256_const1_64( 80*8 );
+   sc->buf[15] = _mm256_set1_epi64x( 80*8 );

   // build working variables
   V0 = sc->H[0];
@@ -1263,10 +1215,10 @@ void blake512_4way_prehash_le( blake_4way_big_context *sc, __m256i *midstate,
   V5 = sc->H[5];
   V6 = sc->H[6];
   V7 = sc->H[7];
-   V8 = m256_const1_64( CB0 );
-   V9 = m256_const1_64( CB1 );
-   VA = m256_const1_64( CB2 );
-   VB = m256_const1_64( CB3 );
+   V8 = _mm256_set1_epi64x( CB0 );
+   V9 = _mm256_set1_epi64x( CB1 );
+   VA = _mm256_set1_epi64x( CB2 );
+   VB = _mm256_set1_epi64x( CB3 );
   VC = _mm256_set1_epi64x( CB4 ^ 0x280ULL );
   VD = _mm256_set1_epi64x( CB5 ^ 0x280ULL );
   VE = _mm256_set1_epi64x( CB6 );
@@ -1446,14 +1398,14 @@ void blake512_4way_final_le( blake_4way_big_context *sc, void *hash,

 void blake512_4way_init( blake_4way_big_context *sc )
 {
-   casti_m256i( sc->H, 0 ) = m256_const1_64( 0x6A09E667F3BCC908 );
-   casti_m256i( sc->H, 1 ) = m256_const1_64( 0xBB67AE8584CAA73B );
-   casti_m256i( sc->H, 2 ) = m256_const1_64( 0x3C6EF372FE94F82B );
-   casti_m256i( sc->H, 3 ) = m256_const1_64( 0xA54FF53A5F1D36F1 );
-   casti_m256i( sc->H, 4 ) = m256_const1_64( 0x510E527FADE682D1 );
-   casti_m256i( sc->H, 5 ) = m256_const1_64( 0x9B05688C2B3E6C1F );
-   casti_m256i( sc->H, 6 ) = m256_const1_64( 0x1F83D9ABFB41BD6B );
-   casti_m256i( sc->H, 7 ) = m256_const1_64( 0x5BE0CD19137E2179 );
+   casti_m256i( sc->H, 0 ) = _mm256_set1_epi64x( 0x6A09E667F3BCC908 );
+   casti_m256i( sc->H, 1 ) = _mm256_set1_epi64x( 0xBB67AE8584CAA73B );
+   casti_m256i( sc->H, 2 ) = _mm256_set1_epi64x( 0x3C6EF372FE94F82B );
+   casti_m256i( sc->H, 3 ) = _mm256_set1_epi64x( 0xA54FF53A5F1D36F1 );
+   casti_m256i( sc->H, 4 ) = _mm256_set1_epi64x( 0x510E527FADE682D1 );
+   casti_m256i( sc->H, 5 ) = _mm256_set1_epi64x( 0x9B05688C2B3E6C1F );
+   casti_m256i( sc->H, 6 ) = _mm256_set1_epi64x( 0x1F83D9ABFB41BD6B );
+   casti_m256i( sc->H, 7 ) = _mm256_set1_epi64x( 0x5BE0CD19137E2179 );

   sc->T0 = sc->T1 = 0;
   sc->ptr = 0;
@@ -1494,7 +1446,7 @@ blake64_4way( blake_4way_big_context *sc, const void *data, size_t len)
 	   if ( ptr == buf_size )
      {
 		   if ( (T0 = T0 + 1024 ) < 1024 )
-			   T1 = SPH_T64(T1 + 1);
+			   T1 = T1 + 1;
 	   	COMPRESS64_4WAY;
 		   ptr = 0;
 	   }
@@ -1513,7 +1465,7 @@ blake64_4way_close( blake_4way_big_context *sc, void *dst )

   ptr = sc->ptr;
   bit_len = ((unsigned)ptr << 3);
-   buf[ptr>>3] = m256_const1_64( 0x80 );
+   buf[ptr>>3] = _mm256_set1_epi64x( 0x80 );
   tl = sc->T0 + bit_len;
   th = sc->T1;
   if (ptr == 0 )
@@ -1535,9 +1487,9 @@ blake64_4way_close( blake_4way_big_context *sc, void *dst )
   {
       memset_zero_256( buf + (ptr>>3) + 1, (104-ptr) >> 3 );
       buf[104>>3] = _mm256_or_si256( buf[104>>3],
-                                 m256_const1_64( 0x0100000000000000ULL ) );
-       buf[112>>3] = m256_const1_64( bswap_64( th ) );
-       buf[120>>3] = m256_const1_64( bswap_64( tl ) );
+                                 _mm256_set1_epi64x( 0x0100000000000000ULL ) );
+       buf[112>>3] = _mm256_set1_epi64x( bswap_64( th ) );
+       buf[120>>3] = _mm256_set1_epi64x( bswap_64( tl ) );

       blake64_4way( sc, buf + (ptr>>3), 128 - ptr );
   }
@@ -1546,12 +1498,12 @@ blake64_4way_close( blake_4way_big_context *sc, void *dst )
       memset_zero_256( buf + (ptr>>3) + 1, (120 - ptr) >> 3 );

       blake64_4way( sc, buf + (ptr>>3), 128 - ptr );
-       sc->T0 = SPH_C64(0xFFFFFFFFFFFFFC00ULL);
-       sc->T1 = SPH_C64(0xFFFFFFFFFFFFFFFFULL);
+       sc->T0 = 0xFFFFFFFFFFFFFC00ULL;
+       sc->T1 = 0xFFFFFFFFFFFFFFFFULL;
       memset_zero_256( buf, 112>>3 ); 
-       buf[104>>3] = m256_const1_64( 0x0100000000000000ULL );
-       buf[112>>3] = m256_const1_64( bswap_64( th ) );
-       buf[120>>3] = m256_const1_64( bswap_64( tl ) );
+       buf[104>>3] = _mm256_set1_epi64x( 0x0100000000000000ULL );
+       buf[112>>3] = _mm256_set1_epi64x( bswap_64( th ) );
+       buf[120>>3] = _mm256_set1_epi64x( bswap_64( tl ) );

       blake64_4way( sc, buf, 128 );
   }
@@ -1565,14 +1517,14 @@ void blake512_4way_full( blake_4way_big_context *sc, void * dst,

 // init

-   casti_m256i( sc->H, 0 ) = m256_const1_64( 0x6A09E667F3BCC908 );
-   casti_m256i( sc->H, 1 ) = m256_const1_64( 0xBB67AE8584CAA73B );
-   casti_m256i( sc->H, 2 ) = m256_const1_64( 0x3C6EF372FE94F82B );
-   casti_m256i( sc->H, 3 ) = m256_const1_64( 0xA54FF53A5F1D36F1 );
-   casti_m256i( sc->H, 4 ) = m256_const1_64( 0x510E527FADE682D1 );
-   casti_m256i( sc->H, 5 ) = m256_const1_64( 0x9B05688C2B3E6C1F );
-   casti_m256i( sc->H, 6 ) = m256_const1_64( 0x1F83D9ABFB41BD6B );
-   casti_m256i( sc->H, 7 ) = m256_const1_64( 0x5BE0CD19137E2179 );
+   casti_m256i( sc->H, 0 ) = _mm256_set1_epi64x( 0x6A09E667F3BCC908 );
+   casti_m256i( sc->H, 1 ) = _mm256_set1_epi64x( 0xBB67AE8584CAA73B );
+   casti_m256i( sc->H, 2 ) = _mm256_set1_epi64x( 0x3C6EF372FE94F82B );
+   casti_m256i( sc->H, 3 ) = _mm256_set1_epi64x( 0xA54FF53A5F1D36F1 );
+   casti_m256i( sc->H, 4 ) = _mm256_set1_epi64x( 0x510E527FADE682D1 );
+   casti_m256i( sc->H, 5 ) = _mm256_set1_epi64x( 0x9B05688C2B3E6C1F );
+   casti_m256i( sc->H, 6 ) = _mm256_set1_epi64x( 0x1F83D9ABFB41BD6B );
+   casti_m256i( sc->H, 7 ) = _mm256_set1_epi64x( 0x5BE0CD19137E2179 );

   sc->T0 = sc->T1 = 0;
   sc->ptr = 0;
@@ -1596,7 +1548,7 @@ void blake512_4way_full( blake_4way_big_context *sc, void * dst,
   uint64_t th, tl;

   bit_len = sc->ptr << 3;
-   sc->buf[ptr64] = m256_const1_64( 0x80 );
+   sc->buf[ptr64] = _mm256_set1_epi64x( 0x80 );
   tl = sc->T0 + bit_len;
   th = sc->T1;
   if ( sc->ptr == 0 )
@@ -1613,9 +1565,9 @@ void blake512_4way_full( blake_4way_big_context *sc, void * dst,
        sc->T0 -= 1024 - bit_len;

   memset_zero_256( sc->buf + ptr64 + 1, 13 - ptr64 );
-   sc->buf[13] = m256_const1_64( 0x0100000000000000ULL );
-   sc->buf[14] = m256_const1_64( bswap_64( th ) );
-   sc->buf[15] = m256_const1_64( bswap_64( tl ) );
+   sc->buf[13] = _mm256_set1_epi64x( 0x0100000000000000ULL );
+   sc->buf[14] = _mm256_set1_epi64x( bswap_64( th ) );
+   sc->buf[15] = _mm256_set1_epi64x( bswap_64( tl ) );

   if ( ( sc->T0 = sc->T0 + 1024 ) < 1024 )
       sc->T1 = sc->T1 + 1;
@@ -1637,8 +1589,4 @@ blake512_4way_close(void *cc, void *dst)
   blake64_4way_close( cc, dst );
 }

-#ifdef __cplusplus
-}
-#endif
-
 #endif
--- a/algo/blake/blakecoin-4way.c
+++ b/algo/blake/blakecoin-4way.c
@@ -4,7 +4,149 @@
 #include <stdint.h>
 #include <memory.h>

-#if defined (BLAKECOIN_4WAY)
+#define rounds 8
+
+#if defined (BLAKECOIN_16WAY)
+
+int scanhash_blakecoin_16way( struct work *work, uint32_t max_nonce,
+                         uint64_t *hashes_done, struct thr_info *mythr )
+{
+   uint32_t hash32[8*16] __attribute__ ((aligned (64)));
+   uint32_t midstate_vars[16*16] __attribute__ ((aligned (64)));
+   __m512i block0_hash[8] __attribute__ ((aligned (64)));
+   __m512i block_buf[16] __attribute__ ((aligned (64)));
+   uint32_t lane_hash[8] __attribute__ ((aligned (32)));
+   uint32_t *hash32_d7 =  (uint32_t*)&( ((__m512i*)hash32)[7] );
+   uint32_t *pdata = work->data;
+   const uint32_t *ptarget = work->target;
+   const uint32_t targ32_d7 = ptarget[7];
+   uint32_t phash[8] __attribute__ ((aligned (64))) =
+   {
+      0x6A09E667, 0xBB67AE85, 0x3C6EF372, 0xA54FF53A,
+      0x510E527F, 0x9B05688C, 0x1F83D9AB, 0x5BE0CD19
+   };
+   uint32_t n = pdata[19];
+   const uint32_t first_nonce = (const uint32_t) n;
+   const uint32_t last_nonce = max_nonce - 16;
+   const int thr_id = mythr->id;
+   const bool bench = opt_benchmark;
+   const __m512i sixteen = _mm512_set1_epi32( 16 );
+
+   // Prehash first block
+   blake256_transform_le( phash, pdata, 512, 0, rounds );
+
+   block0_hash[0] = _mm512_set1_epi32( phash[0] );
+   block0_hash[1] = _mm512_set1_epi32( phash[1] );
+   block0_hash[2] = _mm512_set1_epi32( phash[2] );
+   block0_hash[3] = _mm512_set1_epi32( phash[3] );
+   block0_hash[4] = _mm512_set1_epi32( phash[4] );
+   block0_hash[5] = _mm512_set1_epi32( phash[5] );
+   block0_hash[6] = _mm512_set1_epi32( phash[6] );
+   block0_hash[7] = _mm512_set1_epi32( phash[7] );
+
+   // Build vectored second block, interleave last 16 bytes of data using
+   // unique nonces.
+   block_buf[0] = _mm512_set1_epi32( pdata[16] );
+   block_buf[1] = _mm512_set1_epi32( pdata[17] );
+   block_buf[2] = _mm512_set1_epi32( pdata[18] );
+   block_buf[3] =
+             _mm512_set_epi32( n+15, n+14, n+13, n+12, n+11, n+10, n+ 9, n+ 8,
+                               n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n +1, n );
+
+   // Partialy prehash second block without touching nonces in block_buf[3].
+   blake256_16way_round0_prehash_le( midstate_vars, block0_hash, block_buf );
+
+   do {
+      blake256_16way_final_rounds_le( hash32, midstate_vars, block0_hash,
+                                      block_buf, rounds );
+      for ( int lane = 0; lane < 16; lane++ )
+      if ( unlikely( hash32_d7[ lane ] <= targ32_d7 ) )
+      {
+         extr_lane_16x32( lane_hash, hash32, lane, 256 );
+         if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
+         {
+            pdata[19] = n + lane;
+            submit_solution( work, lane_hash, mythr );
+         }
+      }
+      block_buf[3] = _mm512_add_epi32( block_buf[3], sixteen );
+      n += 16;
+   } while ( likely( (n < last_nonce) && !work_restart[thr_id].restart) );
+   pdata[19] = n;
+   *hashes_done = n - first_nonce;
+   return 0;
+}
+
+#elif defined (BLAKECOIN_8WAY)
+
+int scanhash_blakecoin_8way( struct work *work, uint32_t max_nonce,
+                         uint64_t *hashes_done, struct thr_info *mythr )
+{
+   uint32_t hash32[8*8] __attribute__ ((aligned (64)));
+   uint32_t midstate_vars[16*8] __attribute__ ((aligned (32)));
+   __m256i block0_hash[8] __attribute__ ((aligned (32)));
+   __m256i block_buf[16] __attribute__ ((aligned (32)));
+   uint32_t lane_hash[8] __attribute__ ((aligned (32)));
+   uint32_t *hash32_d7 =  (uint32_t*)&( ((__m256i*)hash32)[7] );
+   uint32_t *pdata = work->data;
+   const uint32_t *ptarget = work->target;
+   const uint32_t targ32_d7 = ptarget[7];
+   uint32_t phash[8] __attribute__ ((aligned (32))) =
+   {
+      0x6A09E667, 0xBB67AE85, 0x3C6EF372, 0xA54FF53A,
+      0x510E527F, 0x9B05688C, 0x1F83D9AB, 0x5BE0CD19
+   };
+   uint32_t n = pdata[19];
+   const uint32_t first_nonce = (const uint32_t) n;
+   const uint32_t last_nonce = max_nonce - 8;
+   const int thr_id = mythr->id;
+   const bool bench = opt_benchmark;
+   const __m256i eight = _mm256_set1_epi32( 8 );
+
+   // Prehash first block
+   blake256_transform_le( phash, pdata, 512, 0, rounds );
+
+   block0_hash[0] = _mm256_set1_epi32( phash[0] );
+   block0_hash[1] = _mm256_set1_epi32( phash[1] );
+   block0_hash[2] = _mm256_set1_epi32( phash[2] );
+   block0_hash[3] = _mm256_set1_epi32( phash[3] );
+   block0_hash[4] = _mm256_set1_epi32( phash[4] );
+   block0_hash[5] = _mm256_set1_epi32( phash[5] );
+   block0_hash[6] = _mm256_set1_epi32( phash[6] );
+   block0_hash[7] = _mm256_set1_epi32( phash[7] );
+
+   // Build vectored second block, interleave last 16 bytes of data using
+   // unique nonces.
+   block_buf[0] = _mm256_set1_epi32( pdata[16] );
+   block_buf[1] = _mm256_set1_epi32( pdata[17] );
+   block_buf[2] = _mm256_set1_epi32( pdata[18] );
+   block_buf[3] = _mm256_set_epi32( n+7, n+6, n+5, n+4, n+3, n+2, n+1, n );
+
+   // Partialy prehash second block without touching nonces in block_buf[3].
+   blake256_8way_round0_prehash_le( midstate_vars, block0_hash, block_buf );
+
+   do {
+      blake256_8way_final_rounds_le( hash32, midstate_vars, block0_hash,
+                                     block_buf, rounds );
+      for ( int lane = 0; lane < 8; lane++ )
+      if ( unlikely( hash32_d7[ lane ] <= targ32_d7 ) )
+      {
+         extr_lane_8x32( lane_hash, hash32, lane, 256 );
+         if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
+         {
+            pdata[19] = n + lane;
+            submit_solution( work, lane_hash, mythr );
+         }
+      }
+      block_buf[3] = _mm256_add_epi32( block_buf[3], eight );
+      n += 8;
+   } while ( likely( (n < last_nonce) && !work_restart[thr_id].restart) );
+   pdata[19] = n;
+   *hashes_done = n - first_nonce;
+   return 0;
+}
+   
+#elif defined (BLAKECOIN_4WAY)

 blake256r8_4way_context blakecoin_4w_ctx;

@@ -61,7 +203,8 @@ int scanhash_blakecoin_4way( struct work *work, uint32_t max_nonce,

 #endif

-#if defined(BLAKECOIN_8WAY)
+#if 0
+//#if defined(BLAKECOIN_8WAY)

 blake256r8_8way_context blakecoin_8w_ctx;

@@ -78,11 +221,84 @@ void blakecoin_8way_hash( void *state, const void *input )
                   state+160, state+192, state+224, vhash, 256 );
 }

+/*
+int scanhash_blakecoin_8way( struct work *work, uint32_t max_nonce,
+                             uint64_t *hashes_done, struct thr_info *mythr )
+{
+   uint32_t hash32[8*8] __attribute__ ((aligned (64)));
+   uint32_t midstate_vars[16*8] __attribute__ ((aligned (64)));
+   __m256i block0_hash[8] __attribute__ ((aligned (64)));
+   __m256i block_buf[16] __attribute__ ((aligned (64)));
+   uint32_t lane_hash[8] __attribute__ ((aligned (32)));
+   uint32_t *hash32_d7 =  (uint32_t*)&( hash32[7] );
+   uint32_t phash[8] __attribute__ ((aligned (32))) =
+   {
+      0x6A09E667, 0xBB67AE85, 0x3C6EF372, 0xA54FF53A,
+      0x510E527F, 0x9B05688C, 0x1F83D9AB, 0x5BE0CD19
+   };
+   uint32_t *pdata = work->data;
+   uint32_t *ptarget = (uint32_t*)work->target;
+   const uint32_t targ32_d7 = ptarget[7];
+   const uint32_t first_nonce = pdata[19];
+   const uint32_t last_nonce = max_nonce - 8;
+   uint32_t n = first_nonce;
+   const int thr_id = mythr->id;
+   const bool bench = opt_benchmark;
+   const __m256i eight = _mm256_set1_epi32( 8 );
+
+   // Prehash first block
+   blake256_transform_le( phash, pdata, 512, 0, 8 );
+
+   block0_hash[0] = _mm256_set1_epi32( phash[0] );
+   block0_hash[1] = _mm256_set1_epi32( phash[1] );
+   block0_hash[2] = _mm256_set1_epi32( phash[2] );
+   block0_hash[3] = _mm256_set1_epi32( phash[3] );
+   block0_hash[4] = _mm256_set1_epi32( phash[4] );
+   block0_hash[5] = _mm256_set1_epi32( phash[5] );
+   block0_hash[6] = _mm256_set1_epi32( phash[6] );
+   block0_hash[7] = _mm256_set1_epi32( phash[7] );
+
+   // Build vectored second block, interleave last 16 bytes of data using
+   // unique nonces.
+   block_buf[0] = _mm256_set1_epi32( pdata[16] );
+   block_buf[1] = _mm256_set1_epi32( pdata[17] );
+   block_buf[2] = _mm256_set1_epi32( pdata[18] );
+   block_buf[3] = _mm256_set_epi32( n+7, n+6, n+5, n+4, n+3, n+2, n+1, n );
+
+   // Partialy prehash second block without touching nonces
+   blake256_8way_round0_prehash_le( midstate_vars, block0_hash, block_buf );
+
+   do {
+      blake256_8way_final_rounds_le( hash32, midstate_vars, block0_hash, 
+                                     block_buf );
+
+      for ( int lane = 0; lane < 8; lane++ )
+      if ( hash32_d7[ lane ] <= targ32_d7 )
+      {
+         extr_lane_8x32( lane_hash, hash32, lane, 256 );
+         if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
+         {
+            pdata[19] = n + lane;
+            submit_solution( work, lane_hash, mythr );
+         }
+      }
+      block_buf[3] = _mm256_add_epi32( block_buf[3], eight );
+      n += 8;
+   } while ( (n < last_nonce) && !work_restart[thr_id].restart );
+   pdata[19] = n;
+   *hashes_done = n - first_nonce;
+   return 0;
+}
+*/
+
 int scanhash_blakecoin_8way( struct work *work, uint32_t max_nonce,
                         uint64_t *hashes_done, struct thr_info *mythr )
 {
   uint32_t vdata[20*8] __attribute__ ((aligned (64)));
-   uint32_t hash[8*8] __attribute__ ((aligned (32)));
+   uint32_t hash32[8*8] __attribute__ ((aligned (32)));
+   uint32_t lane_hash[8] __attribute__ ((aligned (32)));
+   blake256r8_8way_context ctx __attribute__ ((aligned (32)));
+   uint32_t *hash32_d7 =  (uint32_t*)&( ((__m256i*)hash32)[7] );
   uint32_t *pdata = work->data;
   uint32_t *ptarget = work->target;
   const uint32_t first_nonce = pdata[19];
@@ -101,15 +317,22 @@ int scanhash_blakecoin_8way( struct work *work, uint32_t max_nonce,
      *noncev = mm256_bswap_32( _mm256_set_epi32( n+7, n+6, n+5, n+4,
                                                  n+3, n+2, n+1, n ) );
      pdata[19] = n;
-      blakecoin_8way_hash( hash, vdata );

-      for ( int i = 0; i < 8; i++ )
-      if (  (hash+(i<<3))[7] <= HTarget && fulltest( hash+(i<<3), ptarget )
-          && !opt_benchmark )
+      memcpy( &ctx, &blakecoin_8w_ctx, sizeof ctx );
+      blake256r8_8way_update( &ctx, (const void*)vdata + (64<<3), 16 );
+      blake256r8_8way_close( &ctx, hash32 );
+
+      for ( int lane = 0; lane < 8; lane++ )
+      if ( hash32_d7[ lane ] <= HTarget )
      {
-          pdata[19] = n+i;
-          submit_solution( work, hash+(i<<3), mythr );
+         extr_lane_8x32( lane_hash, hash32, lane, 256 );
+         if ( likely( valid_hash( lane_hash, ptarget ) && !opt_benchmark ) )
+         {
+            pdata[19] = n + lane;
+            submit_solution( work, lane_hash, mythr );
+         }
      }
+     
      n += 8;
   } while ( (n < max_nonce) && !work_restart[thr_id].restart );

--- a/algo/blake/blakecoin-gate.c
+++ b/algo/blake/blakecoin-gate.c
@@ -4,10 +4,10 @@
 // vanilla uses default gen merkle root, otherwise identical to blakecoin
 bool register_vanilla_algo( algo_gate_t* gate )
 {
-#if defined(BLAKECOIN_8WAY)
+#if defined(BLAKECOIN_16WAY)
+  gate->scanhash  = (void*)&scanhash_blakecoin_16way;
+#elif defined(BLAKECOIN_8WAY)
  gate->scanhash  = (void*)&scanhash_blakecoin_8way;
-  gate->hash      = (void*)&blakecoin_8way_hash;
-
 #elif defined(BLAKECOIN_4WAY)
  gate->scanhash  = (void*)&scanhash_blakecoin_4way;
  gate->hash      = (void*)&blakecoin_4way_hash;
@@ -15,14 +15,14 @@ bool register_vanilla_algo( algo_gate_t* gate )
  gate->scanhash = (void*)&scanhash_blakecoin;
  gate->hash     = (void*)&blakecoinhash;
 #endif
-  gate->optimizations = SSE42_OPT | AVX2_OPT;
+  gate->optimizations = SSE2_OPT | AVX2_OPT | AVX512_OPT;
  return true;
 }

 bool register_blakecoin_algo( algo_gate_t* gate )
 {
  register_vanilla_algo( gate );
-  gate->gen_merkle_root = (void*)&SHA256_gen_merkle_root;
+  gate->gen_merkle_root = (void*)&sha256_gen_merkle_root;
  return true;
 }

--- a/algo/blake/blakecoin-gate.h
+++ b/algo/blake/blakecoin-gate.h
@@ -1,30 +1,36 @@
-#ifndef __BLAKECOIN_GATE_H__
-#define __BLAKECOIN_GATE_H__ 1
+#ifndef BLAKECOIN_GATE_H__
+#define BLAKECOIN_GATE_H__ 1

 #include "algo-gate-api.h"
 #include <stdint.h>

-#if defined(__SSE4_2__)
+#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
+  #define BLAKECOIN_16WAY
+#elif defined(__AVX2__)
+  #define BLAKECOIN_8WAY
+#elif defined(__SSE2__)  // always true
  #define BLAKECOIN_4WAY
 #endif
-#if defined(__AVX2__)
-  #define BLAKECOIN_8WAY
-#endif

-#if defined (BLAKECOIN_8WAY)
-void blakecoin_8way_hash(void *state, const void *input);
+#if defined (BLAKECOIN_16WAY)
+int scanhash_blakecoin_16way( struct work *work, uint32_t max_nonce,
+                         uint64_t *hashes_done, struct thr_info *mythr );
+
+#elif defined (BLAKECOIN_8WAY)
+//void blakecoin_8way_hash(void *state, const void *input);
 int scanhash_blakecoin_8way( struct work *work, uint32_t max_nonce,
                         uint64_t *hashes_done, struct thr_info *mythr );
-#endif

-#if defined (BLAKECOIN_4WAY)
+#elif defined (BLAKECOIN_4WAY)
 void blakecoin_4way_hash(void *state, const void *input);
 int scanhash_blakecoin_4way( struct work *work, uint32_t max_nonce,
                         uint64_t *hashes_done, struct thr_info *mythr );
-#endif
+#else  // never used

 void blakecoinhash( void *state, const void *input );
 int scanhash_blakecoin( struct work *work, uint32_t max_nonce,
                      uint64_t *hashes_done, struct thr_info *mythr );

 #endif
+
+#endif
--- a/algo/blake/blakecoin.c
+++ b/algo/blake/blakecoin.c
@@ -1,6 +1,6 @@
 #include "blakecoin-gate.h"

-#if !defined(BLAKECOIN_8WAY) && !defined(BLAKECOIN_4WAY)
+#if !defined(BLAKECOIN_16WAY) && !defined(BLAKECOIN_8WAY) && !defined(BLAKECOIN_4WAY)

 #define BLAKE32_ROUNDS 8
 #include "sph_blake.h"
@@ -12,7 +12,6 @@ void blakecoin_close(void *cc, void *dst);
 #include <string.h>
 #include <stdint.h>
 #include <memory.h>
-#include <openssl/sha.h>

 // context management is staged for efficiency.
 // 1. global initial ctx cached on startup
@@ -35,8 +34,8 @@ void blakecoinhash( void *state, const void *input )
 	uint8_t hash[64] __attribute__ ((aligned (32)));
 	uint8_t *ending = (uint8_t*) input + 64;

-        // copy cached midstate
-        memcpy( &ctx, &blake_mid_ctx, sizeof ctx );
+   // copy cached midstate
+   memcpy( &ctx, &blake_mid_ctx, sizeof ctx );
 	blakecoin( &ctx, ending, 16 );
 	blakecoin_close( &ctx, hash );
 	memcpy( state, hash, 32 );
@@ -45,8 +44,8 @@ void blakecoinhash( void *state, const void *input )
 int scanhash_blakecoin( struct work *work, uint32_t max_nonce,
                        uint64_t *hashes_done, struct thr_info *mythr )
 {
-        uint32_t *pdata = work->data;
-        uint32_t *ptarget = work->target;
+   uint32_t *pdata = work->data;
+   uint32_t *ptarget = work->target;
 	const uint32_t first_nonce = pdata[19];
 	uint32_t HTarget = ptarget[7];
   int thr_id = mythr->id;  // thr_id arg is deprecated
@@ -60,10 +59,10 @@ int scanhash_blakecoin( struct work *work, uint32_t max_nonce,
 		HTarget = 0x7f;

 	// we need big endian data...
-        for (int kk=0; kk < 19; kk++) 
-                be32enc(&endiandata[kk], ((uint32_t*)pdata)[kk]);
+   for (int kk=0; kk < 19; kk++) 
+      be32enc(&endiandata[kk], ((uint32_t*)pdata)[kk]);

-        blake_midstate_init( endiandata );
+   blake_midstate_init( endiandata );

 #ifdef DEBUG_ALGO
 	applog(LOG_DEBUG,"[%d] Target=%08x %08x", thr_id, ptarget[6], ptarget[7]);
--- a/algo/blake/sph_blake2b.c
+++ b/algo/blake/sph_blake2b.c
@@ -64,6 +64,22 @@
  V[1] = mm256_ror_64( _mm256_xor_si256( V[1], V[2] ), 63 ); \
 }

+// Pivot about V[1] instead of V[0] reduces latency.
+#define BLAKE2B_ROUND( R ) \
+{ \
+  __m256i *V = (__m256i*)v; \
+  const uint8_t *sigmaR = sigma[R]; \
+  BLAKE2B_G(  0,  1,  2,  3,  4,  5,  6,  7 ); \
+  V[0] = mm256_shufll_64( V[0] ); \
+  V[3] = mm256_swap_128( V[3] ); \
+  V[2] = mm256_shuflr_64( V[2] ); \
+  BLAKE2B_G( 14, 15,  8,  9, 10, 11, 12, 13 ); \
+  V[0] = mm256_shuflr_64( V[0] ); \
+  V[3] = mm256_swap_128( V[3] ); \
+  V[2] = mm256_shufll_64( V[2] ); \
+}
+
+/*
 #define BLAKE2B_ROUND( R ) \
 { \
  __m256i *V = (__m256i*)v; \
@@ -77,6 +93,7 @@
  V[2] = mm256_swap_128( V[2] ); \
  V[1] = mm256_shufll_64( V[1] ); \
 }
+*/

 #elif defined(__SSE2__)
 // always true
--- a/algo/bmw/bmw256-hash-4way.c
+++ b/algo/bmw/bmw256-hash-4way.c
@@ -451,22 +451,22 @@ static const __m128i final_s[16] =
 */
 void bmw256_4way_init( bmw256_4way_context *ctx )
 {
-   ctx->H[ 0] = m128_const1_64( 0x4041424340414243 );
-   ctx->H[ 1] = m128_const1_64( 0x4445464744454647 );
-   ctx->H[ 2] = m128_const1_64( 0x48494A4B48494A4B );
-   ctx->H[ 3] = m128_const1_64( 0x4C4D4E4F4C4D4E4F );
-   ctx->H[ 4] = m128_const1_64( 0x5051525350515253 );
-   ctx->H[ 5] = m128_const1_64( 0x5455565754555657 );
-   ctx->H[ 6] = m128_const1_64( 0x58595A5B58595A5B );
-   ctx->H[ 7] = m128_const1_64( 0x5C5D5E5F5C5D5E5F );
-   ctx->H[ 8] = m128_const1_64( 0x6061626360616263 );
-   ctx->H[ 9] = m128_const1_64( 0x6465666764656667 );
-   ctx->H[10] = m128_const1_64( 0x68696A6B68696A6B );
-   ctx->H[11] = m128_const1_64( 0x6C6D6E6F6C6D6E6F );
-   ctx->H[12] = m128_const1_64( 0x7071727370717273 );
-   ctx->H[13] = m128_const1_64( 0x7475767774757677 );
-   ctx->H[14] = m128_const1_64( 0x78797A7B78797A7B );
-   ctx->H[15] = m128_const1_64( 0x7C7D7E7F7C7D7E7F );
+   ctx->H[ 0] = _mm_set1_epi64x( 0x4041424340414243 );
+   ctx->H[ 1] = _mm_set1_epi64x( 0x4445464744454647 );
+   ctx->H[ 2] = _mm_set1_epi64x( 0x48494A4B48494A4B );
+   ctx->H[ 3] = _mm_set1_epi64x( 0x4C4D4E4F4C4D4E4F );
+   ctx->H[ 4] = _mm_set1_epi64x( 0x5051525350515253 );
+   ctx->H[ 5] = _mm_set1_epi64x( 0x5455565754555657 );
+   ctx->H[ 6] = _mm_set1_epi64x( 0x58595A5B58595A5B );
+   ctx->H[ 7] = _mm_set1_epi64x( 0x5C5D5E5F5C5D5E5F );
+   ctx->H[ 8] = _mm_set1_epi64x( 0x6061626360616263 );
+   ctx->H[ 9] = _mm_set1_epi64x( 0x6465666764656667 );
+   ctx->H[10] = _mm_set1_epi64x( 0x68696A6B68696A6B );
+   ctx->H[11] = _mm_set1_epi64x( 0x6C6D6E6F6C6D6E6F );
+   ctx->H[12] = _mm_set1_epi64x( 0x7071727370717273 );
+   ctx->H[13] = _mm_set1_epi64x( 0x7475767774757677 );
+   ctx->H[14] = _mm_set1_epi64x( 0x78797A7B78797A7B );
+   ctx->H[15] = _mm_set1_epi64x( 0x7C7D7E7F7C7D7E7F );


 //   for ( int i = 0; i < 16; i++ )
@@ -529,7 +529,7 @@ bmw32_4way_close(bmw_4way_small_context *sc, unsigned ub, unsigned n,

   buf = sc->buf;
   ptr = sc->ptr;
-   buf[ ptr>>2 ] = m128_const1_64( 0x0000008000000080 );
+   buf[ ptr>>2 ] = _mm_set1_epi64x( 0x0000008000000080 );
   ptr += 4;
   h = sc->H;

@@ -959,22 +959,22 @@ static const __m256i final_s8[16] =

 void bmw256_8way_init( bmw256_8way_context *ctx )
 {
-   ctx->H[ 0] = m256_const1_64( 0x4041424340414243 );
-   ctx->H[ 1] = m256_const1_64( 0x4445464744454647 );
-   ctx->H[ 2] = m256_const1_64( 0x48494A4B48494A4B );
-   ctx->H[ 3] = m256_const1_64( 0x4C4D4E4F4C4D4E4F );
-   ctx->H[ 4] = m256_const1_64( 0x5051525350515253 );
-   ctx->H[ 5] = m256_const1_64( 0x5455565754555657 );
-   ctx->H[ 6] = m256_const1_64( 0x58595A5B58595A5B );
-   ctx->H[ 7] = m256_const1_64( 0x5C5D5E5F5C5D5E5F );
-   ctx->H[ 8] = m256_const1_64( 0x6061626360616263 );
-   ctx->H[ 9] = m256_const1_64( 0x6465666764656667 );
-   ctx->H[10] = m256_const1_64( 0x68696A6B68696A6B );
-   ctx->H[11] = m256_const1_64( 0x6C6D6E6F6C6D6E6F );
-   ctx->H[12] = m256_const1_64( 0x7071727370717273 );
-   ctx->H[13] = m256_const1_64( 0x7475767774757677 );
-   ctx->H[14] = m256_const1_64( 0x78797A7B78797A7B );
-   ctx->H[15] = m256_const1_64( 0x7C7D7E7F7C7D7E7F );
+   ctx->H[ 0] = _mm256_set1_epi64x( 0x4041424340414243 );
+   ctx->H[ 1] = _mm256_set1_epi64x( 0x4445464744454647 );
+   ctx->H[ 2] = _mm256_set1_epi64x( 0x48494A4B48494A4B );
+   ctx->H[ 3] = _mm256_set1_epi64x( 0x4C4D4E4F4C4D4E4F );
+   ctx->H[ 4] = _mm256_set1_epi64x( 0x5051525350515253 );
+   ctx->H[ 5] = _mm256_set1_epi64x( 0x5455565754555657 );
+   ctx->H[ 6] = _mm256_set1_epi64x( 0x58595A5B58595A5B );
+   ctx->H[ 7] = _mm256_set1_epi64x( 0x5C5D5E5F5C5D5E5F );
+   ctx->H[ 8] = _mm256_set1_epi64x( 0x6061626360616263 );
+   ctx->H[ 9] = _mm256_set1_epi64x( 0x6465666764656667 );
+   ctx->H[10] = _mm256_set1_epi64x( 0x68696A6B68696A6B );
+   ctx->H[11] = _mm256_set1_epi64x( 0x6C6D6E6F6C6D6E6F );
+   ctx->H[12] = _mm256_set1_epi64x( 0x7071727370717273 );
+   ctx->H[13] = _mm256_set1_epi64x( 0x7475767774757677 );
+   ctx->H[14] = _mm256_set1_epi64x( 0x78797A7B78797A7B );
+   ctx->H[15] = _mm256_set1_epi64x( 0x7C7D7E7F7C7D7E7F );
   ctx->ptr       = 0;
   ctx->bit_count = 0;
 }
@@ -1030,7 +1030,7 @@ void bmw256_8way_close( bmw256_8way_context *ctx, void *dst )

   buf = ctx->buf;
   ptr = ctx->ptr;
-   buf[ ptr>>2 ] = m256_const1_64( 0x0000008000000080 );
+   buf[ ptr>>2 ] = _mm256_set1_epi64x( 0x0000008000000080 );
   ptr += 4;
   h = ctx->H;

@@ -1460,22 +1460,22 @@ static const __m512i final_s16[16] =

 void bmw256_16way_init( bmw256_16way_context *ctx )
 {
-   ctx->H[ 0] = m512_const1_64( 0x4041424340414243 );
-   ctx->H[ 1] = m512_const1_64( 0x4445464744454647 );
-   ctx->H[ 2] = m512_const1_64( 0x48494A4B48494A4B );
-   ctx->H[ 3] = m512_const1_64( 0x4C4D4E4F4C4D4E4F );
-   ctx->H[ 4] = m512_const1_64( 0x5051525350515253 );
-   ctx->H[ 5] = m512_const1_64( 0x5455565754555657 );
-   ctx->H[ 6] = m512_const1_64( 0x58595A5B58595A5B );
-   ctx->H[ 7] = m512_const1_64( 0x5C5D5E5F5C5D5E5F );
-   ctx->H[ 8] = m512_const1_64( 0x6061626360616263 );
-   ctx->H[ 9] = m512_const1_64( 0x6465666764656667 );
-   ctx->H[10] = m512_const1_64( 0x68696A6B68696A6B );
-   ctx->H[11] = m512_const1_64( 0x6C6D6E6F6C6D6E6F );
-   ctx->H[12] = m512_const1_64( 0x7071727370717273 );
-   ctx->H[13] = m512_const1_64( 0x7475767774757677 );
-   ctx->H[14] = m512_const1_64( 0x78797A7B78797A7B );
-   ctx->H[15] = m512_const1_64( 0x7C7D7E7F7C7D7E7F );
+   ctx->H[ 0] = _mm512_set1_epi64( 0x4041424340414243 );
+   ctx->H[ 1] = _mm512_set1_epi64( 0x4445464744454647 );
+   ctx->H[ 2] = _mm512_set1_epi64( 0x48494A4B48494A4B );
+   ctx->H[ 3] = _mm512_set1_epi64( 0x4C4D4E4F4C4D4E4F );
+   ctx->H[ 4] = _mm512_set1_epi64( 0x5051525350515253 );
+   ctx->H[ 5] = _mm512_set1_epi64( 0x5455565754555657 );
+   ctx->H[ 6] = _mm512_set1_epi64( 0x58595A5B58595A5B );
+   ctx->H[ 7] = _mm512_set1_epi64( 0x5C5D5E5F5C5D5E5F );
+   ctx->H[ 8] = _mm512_set1_epi64( 0x6061626360616263 );
+   ctx->H[ 9] = _mm512_set1_epi64( 0x6465666764656667 );
+   ctx->H[10] = _mm512_set1_epi64( 0x68696A6B68696A6B );
+   ctx->H[11] = _mm512_set1_epi64( 0x6C6D6E6F6C6D6E6F );
+   ctx->H[12] = _mm512_set1_epi64( 0x7071727370717273 );
+   ctx->H[13] = _mm512_set1_epi64( 0x7475767774757677 );
+   ctx->H[14] = _mm512_set1_epi64( 0x78797A7B78797A7B );
+   ctx->H[15] = _mm512_set1_epi64( 0x7C7D7E7F7C7D7E7F );
   ctx->ptr       = 0;
   ctx->bit_count = 0;
 }
@@ -1531,7 +1531,7 @@ void bmw256_16way_close( bmw256_16way_context *ctx, void *dst )

   buf = ctx->buf;
   ptr = ctx->ptr;
-   buf[ ptr>>2 ] = m512_const1_64( 0x0000008000000080 );
+   buf[ ptr>>2 ] = _mm512_set1_epi64( 0x0000008000000080 );
   ptr += 4;
   h = ctx->H;

--- a/algo/bmw/bmw512-hash-4way.c
+++ b/algo/bmw/bmw512-hash-4way.c
@@ -896,22 +896,22 @@ static const __m256i final_b[16] =
 static void
 bmw64_4way_init( bmw_4way_big_context *sc, const sph_u64 *iv )
 {
-   sc->H[ 0] = m256_const1_64( 0x8081828384858687 );
-   sc->H[ 1] = m256_const1_64( 0x88898A8B8C8D8E8F );
-   sc->H[ 2] = m256_const1_64( 0x9091929394959697 );
-   sc->H[ 3] = m256_const1_64( 0x98999A9B9C9D9E9F );
-   sc->H[ 4] = m256_const1_64( 0xA0A1A2A3A4A5A6A7 );
-   sc->H[ 5] = m256_const1_64( 0xA8A9AAABACADAEAF );
-   sc->H[ 6] = m256_const1_64( 0xB0B1B2B3B4B5B6B7 );
-   sc->H[ 7] = m256_const1_64( 0xB8B9BABBBCBDBEBF );
-   sc->H[ 8] = m256_const1_64( 0xC0C1C2C3C4C5C6C7 );
-   sc->H[ 9] = m256_const1_64( 0xC8C9CACBCCCDCECF );
-   sc->H[10] = m256_const1_64( 0xD0D1D2D3D4D5D6D7 );
-   sc->H[11] = m256_const1_64( 0xD8D9DADBDCDDDEDF );
-   sc->H[12] = m256_const1_64( 0xE0E1E2E3E4E5E6E7 );
-   sc->H[13] = m256_const1_64( 0xE8E9EAEBECEDEEEF );
-   sc->H[14] = m256_const1_64( 0xF0F1F2F3F4F5F6F7 );
-   sc->H[15] = m256_const1_64( 0xF8F9FAFBFCFDFEFF );
+   sc->H[ 0] = _mm256_set1_epi64x( 0x8081828384858687 );
+   sc->H[ 1] = _mm256_set1_epi64x( 0x88898A8B8C8D8E8F );
+   sc->H[ 2] = _mm256_set1_epi64x( 0x9091929394959697 );
+   sc->H[ 3] = _mm256_set1_epi64x( 0x98999A9B9C9D9E9F );
+   sc->H[ 4] = _mm256_set1_epi64x( 0xA0A1A2A3A4A5A6A7 );
+   sc->H[ 5] = _mm256_set1_epi64x( 0xA8A9AAABACADAEAF );
+   sc->H[ 6] = _mm256_set1_epi64x( 0xB0B1B2B3B4B5B6B7 );
+   sc->H[ 7] = _mm256_set1_epi64x( 0xB8B9BABBBCBDBEBF );
+   sc->H[ 8] = _mm256_set1_epi64x( 0xC0C1C2C3C4C5C6C7 );
+   sc->H[ 9] = _mm256_set1_epi64x( 0xC8C9CACBCCCDCECF );
+   sc->H[10] = _mm256_set1_epi64x( 0xD0D1D2D3D4D5D6D7 );
+   sc->H[11] = _mm256_set1_epi64x( 0xD8D9DADBDCDDDEDF );
+   sc->H[12] = _mm256_set1_epi64x( 0xE0E1E2E3E4E5E6E7 );
+   sc->H[13] = _mm256_set1_epi64x( 0xE8E9EAEBECEDEEEF );
+   sc->H[14] = _mm256_set1_epi64x( 0xF0F1F2F3F4F5F6F7 );
+   sc->H[15] = _mm256_set1_epi64x( 0xF8F9FAFBFCFDFEFF );
   sc->ptr = 0;
   sc->bit_count = 0;
 }
@@ -967,7 +967,7 @@ bmw64_4way_close(bmw_4way_big_context *sc, unsigned ub, unsigned n,

   buf = sc->buf;
   ptr = sc->ptr;
-   buf[ ptr>>3 ] = m256_const1_64( 0x80 );
+   buf[ ptr>>3 ] = _mm256_set1_epi64x( 0x80 );
   ptr += 8;
   h = sc->H;

@@ -1379,22 +1379,22 @@ static const __m512i final_b8[16] =
 void bmw512_8way_init( bmw512_8way_context *ctx )
 //bmw64_4way_init( bmw_4way_big_context *sc, const sph_u64 *iv )
 {
-   ctx->H[ 0] = m512_const1_64( 0x8081828384858687 );
-   ctx->H[ 1] = m512_const1_64( 0x88898A8B8C8D8E8F );
-   ctx->H[ 2] = m512_const1_64( 0x9091929394959697 );
-   ctx->H[ 3] = m512_const1_64( 0x98999A9B9C9D9E9F );
-   ctx->H[ 4] = m512_const1_64( 0xA0A1A2A3A4A5A6A7 );
-   ctx->H[ 5] = m512_const1_64( 0xA8A9AAABACADAEAF );
-   ctx->H[ 6] = m512_const1_64( 0xB0B1B2B3B4B5B6B7 );
-   ctx->H[ 7] = m512_const1_64( 0xB8B9BABBBCBDBEBF );
-   ctx->H[ 8] = m512_const1_64( 0xC0C1C2C3C4C5C6C7 );
-   ctx->H[ 9] = m512_const1_64( 0xC8C9CACBCCCDCECF );
-   ctx->H[10] = m512_const1_64( 0xD0D1D2D3D4D5D6D7 );
-   ctx->H[11] = m512_const1_64( 0xD8D9DADBDCDDDEDF );
-   ctx->H[12] = m512_const1_64( 0xE0E1E2E3E4E5E6E7 );
-   ctx->H[13] = m512_const1_64( 0xE8E9EAEBECEDEEEF );
-   ctx->H[14] = m512_const1_64( 0xF0F1F2F3F4F5F6F7 );
-   ctx->H[15] = m512_const1_64( 0xF8F9FAFBFCFDFEFF );
+   ctx->H[ 0] = _mm512_set1_epi64( 0x8081828384858687 );
+   ctx->H[ 1] = _mm512_set1_epi64( 0x88898A8B8C8D8E8F );
+   ctx->H[ 2] = _mm512_set1_epi64( 0x9091929394959697 );
+   ctx->H[ 3] = _mm512_set1_epi64( 0x98999A9B9C9D9E9F );
+   ctx->H[ 4] = _mm512_set1_epi64( 0xA0A1A2A3A4A5A6A7 );
+   ctx->H[ 5] = _mm512_set1_epi64( 0xA8A9AAABACADAEAF );
+   ctx->H[ 6] = _mm512_set1_epi64( 0xB0B1B2B3B4B5B6B7 );
+   ctx->H[ 7] = _mm512_set1_epi64( 0xB8B9BABBBCBDBEBF );
+   ctx->H[ 8] = _mm512_set1_epi64( 0xC0C1C2C3C4C5C6C7 );
+   ctx->H[ 9] = _mm512_set1_epi64( 0xC8C9CACBCCCDCECF );
+   ctx->H[10] = _mm512_set1_epi64( 0xD0D1D2D3D4D5D6D7 );
+   ctx->H[11] = _mm512_set1_epi64( 0xD8D9DADBDCDDDEDF );
+   ctx->H[12] = _mm512_set1_epi64( 0xE0E1E2E3E4E5E6E7 );
+   ctx->H[13] = _mm512_set1_epi64( 0xE8E9EAEBECEDEEEF );
+   ctx->H[14] = _mm512_set1_epi64( 0xF0F1F2F3F4F5F6F7 );
+   ctx->H[15] = _mm512_set1_epi64( 0xF8F9FAFBFCFDFEFF );
   ctx->ptr = 0;
   ctx->bit_count = 0;
 }
@@ -1448,7 +1448,7 @@ void bmw512_8way_close( bmw512_8way_context *ctx, void *dst )

   buf = ctx->buf;
   ptr = ctx->ptr;
-   buf[ ptr>>3 ] = m512_const1_64( 0x80 );
+   buf[ ptr>>3 ] = _mm512_set1_epi64( 0x80 );
   ptr += 8;
   h = ctx->H;

@@ -1483,22 +1483,22 @@ void bmw512_8way_full( bmw512_8way_context *ctx, void *out, const void *data,

 // Init

-   H[ 0] = m512_const1_64( 0x8081828384858687 );
-   H[ 1] = m512_const1_64( 0x88898A8B8C8D8E8F );
-   H[ 2] = m512_const1_64( 0x9091929394959697 );
-   H[ 3] = m512_const1_64( 0x98999A9B9C9D9E9F );
-   H[ 4] = m512_const1_64( 0xA0A1A2A3A4A5A6A7 );
-   H[ 5] = m512_const1_64( 0xA8A9AAABACADAEAF );
-   H[ 6] = m512_const1_64( 0xB0B1B2B3B4B5B6B7 );
-   H[ 7] = m512_const1_64( 0xB8B9BABBBCBDBEBF );
-   H[ 8] = m512_const1_64( 0xC0C1C2C3C4C5C6C7 );
-   H[ 9] = m512_const1_64( 0xC8C9CACBCCCDCECF );
-   H[10] = m512_const1_64( 0xD0D1D2D3D4D5D6D7 );
-   H[11] = m512_const1_64( 0xD8D9DADBDCDDDEDF );
-   H[12] = m512_const1_64( 0xE0E1E2E3E4E5E6E7 );
-   H[13] = m512_const1_64( 0xE8E9EAEBECEDEEEF );
-   H[14] = m512_const1_64( 0xF0F1F2F3F4F5F6F7 );
-   H[15] = m512_const1_64( 0xF8F9FAFBFCFDFEFF );
+   H[ 0] = _mm512_set1_epi64( 0x8081828384858687 );
+   H[ 1] = _mm512_set1_epi64( 0x88898A8B8C8D8E8F );
+   H[ 2] = _mm512_set1_epi64( 0x9091929394959697 );
+   H[ 3] = _mm512_set1_epi64( 0x98999A9B9C9D9E9F );
+   H[ 4] = _mm512_set1_epi64( 0xA0A1A2A3A4A5A6A7 );
+   H[ 5] = _mm512_set1_epi64( 0xA8A9AAABACADAEAF );
+   H[ 6] = _mm512_set1_epi64( 0xB0B1B2B3B4B5B6B7 );
+   H[ 7] = _mm512_set1_epi64( 0xB8B9BABBBCBDBEBF );
+   H[ 8] = _mm512_set1_epi64( 0xC0C1C2C3C4C5C6C7 );
+   H[ 9] = _mm512_set1_epi64( 0xC8C9CACBCCCDCECF );
+   H[10] = _mm512_set1_epi64( 0xD0D1D2D3D4D5D6D7 );
+   H[11] = _mm512_set1_epi64( 0xD8D9DADBDCDDDEDF );
+   H[12] = _mm512_set1_epi64( 0xE0E1E2E3E4E5E6E7 );
+   H[13] = _mm512_set1_epi64( 0xE8E9EAEBECEDEEEF );
+   H[14] = _mm512_set1_epi64( 0xF0F1F2F3F4F5F6F7 );
+   H[15] = _mm512_set1_epi64( 0xF8F9FAFBFCFDFEFF );

 // Update

@@ -1530,7 +1530,7 @@ void bmw512_8way_full( bmw512_8way_context *ctx, void *out, const void *data,
   __m512i h1[16], h2[16];
   size_t u, v;

-   buf[ ptr>>3 ] = m512_const1_64( 0x80 );
+   buf[ ptr>>3 ] = _mm512_set1_epi64( 0x80 );
   ptr += 8;

   if (  ptr > (buf_size - 8) )
--- a/algo/cubehash/cube-hash-2way.c
+++ b/algo/cubehash/cube-hash-2way.c
@@ -221,14 +221,14 @@ int cube_4way_init( cube_4way_context *sp, int hashbitlen, int rounds,
    sp->rounds    = rounds;
    sp->pos       = 0;

-    h[ 0] = m512_const1_128( iv[0] );
-    h[ 1] = m512_const1_128( iv[1] );
-    h[ 2] = m512_const1_128( iv[2] );
-    h[ 3] = m512_const1_128( iv[3] );
-    h[ 4] = m512_const1_128( iv[4] );
-    h[ 5] = m512_const1_128( iv[5] );
-    h[ 6] = m512_const1_128( iv[6] );
-    h[ 7] = m512_const1_128( iv[7] );
+    h[ 0] = mm512_bcast_m128( iv[0] );
+    h[ 1] = mm512_bcast_m128( iv[1] );
+    h[ 2] = mm512_bcast_m128( iv[2] );
+    h[ 3] = mm512_bcast_m128( iv[3] );
+    h[ 4] = mm512_bcast_m128( iv[4] );
+    h[ 5] = mm512_bcast_m128( iv[5] );
+    h[ 6] = mm512_bcast_m128( iv[6] );
+    h[ 7] = mm512_bcast_m128( iv[7] );

    return 0;
 }
@@ -259,11 +259,11 @@ int cube_4way_close( cube_4way_context *sp, void *output )

    // pos is zero for 64 byte data, 1 for 80 byte data.
    sp->h[ sp->pos ] = _mm512_xor_si512( sp->h[ sp->pos ],
-                                 m512_const2_64( 0, 0x0000000000000080 ) );
+                         mm512_bcast128lo_64( 0x0000000000000080 ) );
    transform_4way( sp );

    sp->h[7] = _mm512_xor_si512( sp->h[7],
-                                 m512_const2_64( 0x0000000100000000, 0 ) );
+                         mm512_bcast128hi_64( 0x0000000100000000 ) );

    for ( i = 0; i < 10; ++i ) 
       transform_4way( sp );
@@ -283,14 +283,14 @@ int cube_4way_full( cube_4way_context *sp, void *output,  int hashbitlen,
    sp->rounds    = 16;
    sp->pos       = 0;

-    h[ 0] = m512_const1_128( iv[0] );
-    h[ 1] = m512_const1_128( iv[1] );
-    h[ 2] = m512_const1_128( iv[2] );
-    h[ 3] = m512_const1_128( iv[3] );
-    h[ 4] = m512_const1_128( iv[4] );
-    h[ 5] = m512_const1_128( iv[5] );
-    h[ 6] = m512_const1_128( iv[6] );
-    h[ 7] = m512_const1_128( iv[7] );
+    h[ 0] = mm512_bcast_m128( iv[0] );
+    h[ 1] = mm512_bcast_m128( iv[1] );
+    h[ 2] = mm512_bcast_m128( iv[2] );
+    h[ 3] = mm512_bcast_m128( iv[3] );
+    h[ 4] = mm512_bcast_m128( iv[4] );
+    h[ 5] = mm512_bcast_m128( iv[5] );
+    h[ 6] = mm512_bcast_m128( iv[6] );
+    h[ 7] = mm512_bcast_m128( iv[7] );

    const int len = size >> 4;
    const __m512i *in = (__m512i*)data;
@@ -310,11 +310,11 @@ int cube_4way_full( cube_4way_context *sp, void *output,  int hashbitlen,

    // pos is zero for 64 byte data, 1 for 80 byte data.
    sp->h[ sp->pos ] = _mm512_xor_si512( sp->h[ sp->pos ],
-                                    m512_const2_64( 0, 0x0000000000000080 ) );
+                         mm512_bcast128lo_64( 0x0000000000000080 ) );
    transform_4way( sp );

    sp->h[7] = _mm512_xor_si512( sp->h[7],
-                                    m512_const2_64( 0x0000000100000000, 0 ) );
+                         mm512_bcast128hi_64( 0x0000000100000000 ) );

    for ( i = 0; i < 10; ++i )
       transform_4way( sp );
@@ -336,14 +336,14 @@ int cube_4way_2buf_full( cube_4way_2buf_context *sp,
    sp->rounds    = 16;
    sp->pos       = 0;

-    h1[0] = h0[0] = m512_const1_128( iv[0] );
-    h1[1] = h0[1] = m512_const1_128( iv[1] );
-    h1[2] = h0[2] = m512_const1_128( iv[2] );
-    h1[3] = h0[3] = m512_const1_128( iv[3] );
-    h1[4] = h0[4] = m512_const1_128( iv[4] );
-    h1[5] = h0[5] = m512_const1_128( iv[5] );
-    h1[6] = h0[6] = m512_const1_128( iv[6] );
-    h1[7] = h0[7] = m512_const1_128( iv[7] );
+    h1[0] = h0[0] = mm512_bcast_m128( iv[0] );
+    h1[1] = h0[1] = mm512_bcast_m128( iv[1] );
+    h1[2] = h0[2] = mm512_bcast_m128( iv[2] );
+    h1[3] = h0[3] = mm512_bcast_m128( iv[3] );
+    h1[4] = h0[4] = mm512_bcast_m128( iv[4] );
+    h1[5] = h0[5] = mm512_bcast_m128( iv[5] );
+    h1[6] = h0[6] = mm512_bcast_m128( iv[6] );
+    h1[7] = h0[7] = mm512_bcast_m128( iv[7] );

    const int len = size >> 4;
    const __m512i *in0 = (__m512i*)data0;
@@ -365,13 +365,13 @@ int cube_4way_2buf_full( cube_4way_2buf_context *sp,
    }

    // pos is zero for 64 byte data, 1 for 80 byte data.
-    __m512i tmp = m512_const2_64( 0, 0x0000000000000080 );
+    __m512i tmp = mm512_bcast128lo_64( 0x0000000000000080 );
    sp->h0[ sp->pos ] = _mm512_xor_si512( sp->h0[ sp->pos ], tmp );
    sp->h1[ sp->pos ] = _mm512_xor_si512( sp->h1[ sp->pos ], tmp );

    transform_4way_2buf( sp );

-    tmp = m512_const2_64( 0x0000000100000000, 0 );
+    tmp = mm512_bcast128hi_64( 0x0000000100000000 );
    sp->h0[7] = _mm512_xor_si512( sp->h0[7], tmp );
    sp->h1[7] = _mm512_xor_si512( sp->h1[7], tmp );

@@ -384,7 +384,6 @@ int cube_4way_2buf_full( cube_4way_2buf_context *sp,
    return 0;
 }

-
 int cube_4way_update_close( cube_4way_context *sp, void *output,
                               const void *data, size_t size )
 {
@@ -406,11 +405,11 @@ int cube_4way_update_close( cube_4way_context *sp, void *output,

    // pos is zero for 64 byte data, 1 for 80 byte data.
    sp->h[ sp->pos ] = _mm512_xor_si512( sp->h[ sp->pos ],
-                                    m512_const2_64( 0, 0x0000000000000080 ) );
+                          mm512_bcast128lo_64( 0x0000000000000080 ) );
    transform_4way( sp );

    sp->h[7] = _mm512_xor_si512( sp->h[7],
-                                    m512_const2_64( 0x0000000100000000, 0 ) );
+                          mm512_bcast128hi_64( 0x0000000100000000 ) );

    for ( i = 0; i < 10; ++i )
       transform_4way( sp );
@@ -424,21 +423,6 @@ int cube_4way_update_close( cube_4way_context *sp, void *output,

 // 2 way 128 

-// This isn't expected to be used with AVX512 so HW rotate intruction
-// is assumed not avaiable.
-// Use double buffering to optimize serial bit rotations. Full double
-// buffering isn't practical because it needs twice as many registers
-// with AVX2 having only half as many as AVX512.
-#define ROL2( out0, out1, in0, in1, c ) \
-{ \
- __m256i t0 = _mm256_slli_epi32( in0, c ); \
- __m256i t1 = _mm256_slli_epi32( in1, c ); \
- out0 = _mm256_srli_epi32( in0, 32-(c) ); \
- out1 = _mm256_srli_epi32( in1, 32-(c) ); \
- out0 = _mm256_or_si256( out0, t0 ); \
- out1 = _mm256_or_si256( out1, t1 ); \
-}
-
 static void transform_2way( cube_2way_context *sp )
 {
    int r;
@@ -461,8 +445,10 @@ static void transform_2way( cube_2way_context *sp )
        x5 = _mm256_add_epi32( x1, x5 );
        x6 = _mm256_add_epi32( x2, x6 );
        x7 = _mm256_add_epi32( x3, x7 );
-        ROL2( y0, y1, x2, x3, 7 );
-        ROL2( x2, x3, x0, x1, 7 );
+        y0 = mm256_rol_32( x2, 7 );
+        y1 = mm256_rol_32( x3, 7 );
+        x2 = mm256_rol_32( x0, 7 );
+        x3 = mm256_rol_32( x1, 7 );
        x0 = _mm256_xor_si256( y0, x4 );
        x1 = _mm256_xor_si256( y1, x5 );
        x2 = _mm256_xor_si256( x2, x6 );
@@ -475,8 +461,10 @@ static void transform_2way( cube_2way_context *sp )
        x5 = _mm256_add_epi32( x1, x5 );
        x6 = _mm256_add_epi32( x2, x6 );
        x7 = _mm256_add_epi32( x3, x7 );
-        ROL2( y0, x1, x1, x0, 11 );
-        ROL2( y1, x3, x3, x2, 11 );
+        y0 = mm256_rol_32( x1, 11 );
+        x1 = mm256_rol_32( x0, 11 );
+        y1 = mm256_rol_32( x3, 11 );
+        x3 = mm256_rol_32( x2, 11 );
        x0 = _mm256_xor_si256( y0, x4 );
        x1 = _mm256_xor_si256( x1, x5 );
        x2 = _mm256_xor_si256( y1, x6 );
@@ -508,14 +496,14 @@ int cube_2way_init( cube_2way_context *sp, int hashbitlen, int rounds,
    sp->rounds    = rounds;
    sp->pos       = 0;

-    h[ 0] = m256_const1_128( iv[0] );
-    h[ 1] = m256_const1_128( iv[1] );
-    h[ 2] = m256_const1_128( iv[2] );
-    h[ 3] = m256_const1_128( iv[3] );
-    h[ 4] = m256_const1_128( iv[4] );
-    h[ 5] = m256_const1_128( iv[5] );
-    h[ 6] = m256_const1_128( iv[6] );
-    h[ 7] = m256_const1_128( iv[7] );
+    h[ 0] = mm256_bcast_m128( iv[0] );
+    h[ 1] = mm256_bcast_m128( iv[1] );
+    h[ 2] = mm256_bcast_m128( iv[2] );
+    h[ 3] = mm256_bcast_m128( iv[3] );
+    h[ 4] = mm256_bcast_m128( iv[4] );
+    h[ 5] = mm256_bcast_m128( iv[5] );
+    h[ 6] = mm256_bcast_m128( iv[6] );
+    h[ 7] = mm256_bcast_m128( iv[7] );
    
    return 0;
 }
@@ -546,13 +534,14 @@ int cube_2way_close( cube_2way_context *sp, void *output )

    // pos is zero for 64 byte data, 1 for 80 byte data.
    sp->h[ sp->pos ] = _mm256_xor_si256( sp->h[ sp->pos ],
-                                   m256_const2_64( 0, 0x0000000000000080 ) );
+                                   mm256_bcast128lo_64( 0x0000000000000080 ) );
    transform_2way( sp );

    sp->h[7] = _mm256_xor_si256( sp->h[7],
-                                   m256_const2_64( 0x0000000100000000, 0 ) );
+                                   mm256_bcast128hi_64( 0x0000000100000000 ) );

-    for ( i = 0; i < 10; ++i )           transform_2way( sp );
+    for ( i = 0; i < 10; ++i )  
+       transform_2way( sp );

    memcpy( hash, sp->h, sp->hashlen<<5 );
    return 0;
@@ -579,13 +568,14 @@ int cube_2way_update_close( cube_2way_context *sp, void *output,

    // pos is zero for 64 byte data, 1 for 80 byte data.
    sp->h[ sp->pos ] = _mm256_xor_si256( sp->h[ sp->pos ],
-                                    m256_const2_64( 0, 0x0000000000000080 ) );
+                                    mm256_bcast128lo_64( 0x0000000000000080 ) );
    transform_2way( sp );

    sp->h[7] = _mm256_xor_si256( sp->h[7],
-                                    m256_const2_64( 0x0000000100000000, 0 ) );
+                                    mm256_bcast128hi_64( 0x0000000100000000 ) );

-    for ( i = 0; i < 10; ++i )    transform_2way( sp );
+    for ( i = 0; i < 10; ++i )
+       transform_2way( sp );

    memcpy( hash, sp->h, sp->hashlen<<5 );
    return 0;
@@ -602,14 +592,14 @@ int cube_2way_full( cube_2way_context *sp, void *output, int hashbitlen,
    sp->rounds    = 16;
    sp->pos       = 0;

-    h[ 0] = m256_const1_128( iv[0] );
-    h[ 1] = m256_const1_128( iv[1] );
-    h[ 2] = m256_const1_128( iv[2] );
-    h[ 3] = m256_const1_128( iv[3] );
-    h[ 4] = m256_const1_128( iv[4] );
-    h[ 5] = m256_const1_128( iv[5] );
-    h[ 6] = m256_const1_128( iv[6] );
-    h[ 7] = m256_const1_128( iv[7] );
+    h[ 0] = mm256_bcast_m128( iv[0] );
+    h[ 1] = mm256_bcast_m128( iv[1] );
+    h[ 2] = mm256_bcast_m128( iv[2] );
+    h[ 3] = mm256_bcast_m128( iv[3] );
+    h[ 4] = mm256_bcast_m128( iv[4] );
+    h[ 5] = mm256_bcast_m128( iv[5] );
+    h[ 6] = mm256_bcast_m128( iv[6] );
+    h[ 7] = mm256_bcast_m128( iv[7] );

    const int len = size >> 4;
    const __m256i *in = (__m256i*)data;
@@ -629,13 +619,14 @@ int cube_2way_full( cube_2way_context *sp, void *output, int hashbitlen,

    // pos is zero for 64 byte data, 1 for 80 byte data.
    sp->h[ sp->pos ] = _mm256_xor_si256( sp->h[ sp->pos ],
-                                    m256_const2_64( 0, 0x0000000000000080 ) );
+                                    mm256_bcast128lo_64( 0x0000000000000080 ) );
    transform_2way( sp );

    sp->h[7] = _mm256_xor_si256( sp->h[7],
-                                    m256_const2_64( 0x0000000100000000, 0 ) );
+                                    mm256_bcast128hi_64( 0x0000000100000000 ) );

-    for ( i = 0; i < 10; ++i )    transform_2way( sp );
+    for ( i = 0; i < 10; ++i )
+       transform_2way( sp );

    memcpy( hash, sp->h, sp->hashlen<<5 );
    return 0;
--- a/algo/cubehash/cubehash_sse2.c
+++ b/algo/cubehash/cubehash_sse2.c
@@ -32,7 +32,7 @@ static void transform( cubehashParam *sp )
    { 
        x1 = _mm512_add_epi32( x0, x1 );
        x0 = mm512_swap_256( x0 );
-        x0 = mm512_rol_32(  x0, 7 );
+        x0 = mm512_rol_32( x0, 7 );
        x0 = _mm512_xor_si512( x0, x1 );
        x1 = mm512_swap128_64( x1 );
        x1 = _mm512_add_epi32( x0, x1 );
@@ -58,19 +58,18 @@ static void transform( cubehashParam *sp )
    { 
        x2 = _mm256_add_epi32( x0, x2 );
        x3 = _mm256_add_epi32( x1, x3 );
-        y0 = x0;
-        x0 = mm256_rol_32( x1, 7 );
-        x1 = mm256_rol_32( y0, 7 );
-        x0 = _mm256_xor_si256( x0, x2 );
-        x1 = _mm256_xor_si256( x1, x3 );
+        y0 = mm256_rol_32( x1, 7 );
+        y1 = mm256_rol_32( x0, 7 );
+        x0 = _mm256_xor_si256( y0, x2 );
+        x1 = _mm256_xor_si256( y1, x3 );
        x2 = mm256_swap128_64( x2 );
        x3 = mm256_swap128_64( x3 );
        x2 = _mm256_add_epi32( x0, x2 );
        x3 = _mm256_add_epi32( x1, x3 );
-        y0 = mm256_swap_128( x0 );
-        y1 = mm256_swap_128( x1 );
-        x0 = mm256_rol_32( y0, 11 );
-        x1 = mm256_rol_32( y1, 11 );
+        x0 = mm256_swap_128( x0 );
+        x1 = mm256_swap_128( x1 );
+        x0 = mm256_rol_32( x0, 11 );
+        x1 = mm256_rol_32( x1, 11 );
        x0 = _mm256_xor_si256( x0, x2 );
        x1 = _mm256_xor_si256( x1, x3 );
        x2 = mm256_swap64_32( x2 );
@@ -94,47 +93,48 @@ static void transform( cubehashParam *sp )
    x6 = _mm_load_si128( (__m128i*)sp->x + 6 );
    x7 = _mm_load_si128( (__m128i*)sp->x + 7 );

-    for (r = 0; r < rounds; ++r) {
-	x4 = _mm_add_epi32(x0, x4);
-	x5 = _mm_add_epi32(x1, x5);
-	x6 = _mm_add_epi32(x2, x6);
-	x7 = _mm_add_epi32(x3, x7);
-	y0 = x2;
-	y1 = x3;
-	y2 = x0;
-	y3 = x1;
-	x0 = _mm_xor_si128(_mm_slli_epi32(y0, 7), _mm_srli_epi32(y0, 25));
-	x1 = _mm_xor_si128(_mm_slli_epi32(y1, 7), _mm_srli_epi32(y1, 25));
-	x2 = _mm_xor_si128(_mm_slli_epi32(y2, 7), _mm_srli_epi32(y2, 25));
-	x3 = _mm_xor_si128(_mm_slli_epi32(y3, 7), _mm_srli_epi32(y3, 25));
-	x0 = _mm_xor_si128(x0, x4);
-	x1 = _mm_xor_si128(x1, x5);
-	x2 = _mm_xor_si128(x2, x6);
-	x3 = _mm_xor_si128(x3, x7);
-	x4 = _mm_shuffle_epi32(x4, 0x4e);
-	x5 = _mm_shuffle_epi32(x5, 0x4e);
-	x6 = _mm_shuffle_epi32(x6, 0x4e);
-	x7 = _mm_shuffle_epi32(x7, 0x4e);
-	x4 = _mm_add_epi32(x0, x4);
-	x5 = _mm_add_epi32(x1, x5);
-	x6 = _mm_add_epi32(x2, x6);
-	x7 = _mm_add_epi32(x3, x7);
-	y0 = x1;
-	y1 = x0;
-	y2 = x3;
-	y3 = x2;
-	x0 = _mm_xor_si128(_mm_slli_epi32(y0, 11), _mm_srli_epi32(y0, 21));
-	x1 = _mm_xor_si128(_mm_slli_epi32(y1, 11), _mm_srli_epi32(y1, 21));
-	x2 = _mm_xor_si128(_mm_slli_epi32(y2, 11), _mm_srli_epi32(y2, 21));
-	x3 = _mm_xor_si128(_mm_slli_epi32(y3, 11), _mm_srli_epi32(y3, 21));
-	x0 = _mm_xor_si128(x0, x4);
-	x1 = _mm_xor_si128(x1, x5);
-	x2 = _mm_xor_si128(x2, x6);
-	x3 = _mm_xor_si128(x3, x7);
-	x4 = _mm_shuffle_epi32(x4, 0xb1);
-	x5 = _mm_shuffle_epi32(x5, 0xb1);
-	x6 = _mm_shuffle_epi32(x6, 0xb1);
-	x7 = _mm_shuffle_epi32(x7, 0xb1);
+    for ( r = 0; r < rounds; ++r )
+    {
+       x4 = _mm_add_epi32( x0, x4 );
+       x5 = _mm_add_epi32( x1, x5 );
+       x6 = _mm_add_epi32( x2, x6 );
+       x7 = _mm_add_epi32( x3, x7 );
+       y0 = x2;
+       y1 = x3;
+       y2 = x0;
+       y3 = x1;
+       x0 = mm128_rol_32( y0, 7 );
+       x1 = mm128_rol_32( y1, 7 );
+       x2 = mm128_rol_32( y2, 7 );
+       x3 = mm128_rol_32( y3, 7 );
+       x0 = _mm_xor_si128( x0, x4 );
+       x1 = _mm_xor_si128( x1, x5 );
+       x2 = _mm_xor_si128( x2, x6 );
+       x3 = _mm_xor_si128( x3, x7 );
+       x4 = _mm_shuffle_epi32( x4, 0x4e );
+       x5 = _mm_shuffle_epi32( x5, 0x4e );
+       x6 = _mm_shuffle_epi32( x6, 0x4e );
+       x7 = _mm_shuffle_epi32( x7, 0x4e );
+       x4 = _mm_add_epi32( x0, x4 );
+       x5 = _mm_add_epi32( x1, x5 );
+       x6 = _mm_add_epi32( x2, x6 );
+       x7 = _mm_add_epi32( x3, x7 );
+       y0 = x1;
+       y1 = x0;
+       y2 = x3;
+       y3 = x2;
+       x0 = mm128_rol_32( y0, 11 );
+       x1 = mm128_rol_32( y1, 11 );
+       x2 = mm128_rol_32( y2, 11 );
+       x3 = mm128_rol_32( y3, 11 );
+	    x0 = _mm_xor_si128( x0, x4 );
+	    x1 = _mm_xor_si128( x1, x5 );
+	    x2 = _mm_xor_si128( x2, x6 );
+	    x3 = _mm_xor_si128( x3, x7 );
+	    x4 = _mm_shuffle_epi32( x4, 0xb1 );
+	    x5 = _mm_shuffle_epi32( x5, 0xb1 );
+	    x6 = _mm_shuffle_epi32( x6, 0xb1 );
+	    x7 = _mm_shuffle_epi32( x7, 0xb1 );
    }

    _mm_store_si128( (__m128i*)sp->x,     x0 );
@@ -180,25 +180,25 @@ int cubehashInit(cubehashParam *sp, int hashbitlen, int rounds, int blockbytes)
    if ( hashbitlen == 512 )
    {

-       x[0] = m128_const_64( 0x4167D83E2D538B8B, 0x50F494D42AEA2A61 );
-       x[1] = m128_const_64( 0x50AC5695CC39968E, 0xC701CF8C3FEE2313 );
-       x[2] = m128_const_64( 0x825B453797CF0BEF, 0xA647A8B34D42C787 );
-       x[3] = m128_const_64( 0xA23911AED0E5CD33, 0xF22090C4EEF864D2 );
-       x[4] = m128_const_64( 0xB64445321B017BEF, 0x148FE485FCD398D9 );
-       x[5] = m128_const_64( 0x0DBADEA991FA7934, 0x2FF5781C6A536159 );
-       x[6] = m128_const_64( 0xBC796576B1C62456, 0xA5A70E75D65C8A2B );
-       x[7] = m128_const_64( 0xD43E3B447795D246, 0xE7989AF11921C8F7 );
+       x[0] = _mm_set_epi64x( 0x4167D83E2D538B8B, 0x50F494D42AEA2A61 );
+       x[1] = _mm_set_epi64x( 0x50AC5695CC39968E, 0xC701CF8C3FEE2313 );
+       x[2] = _mm_set_epi64x( 0x825B453797CF0BEF, 0xA647A8B34D42C787 );
+       x[3] = _mm_set_epi64x( 0xA23911AED0E5CD33, 0xF22090C4EEF864D2 );
+       x[4] = _mm_set_epi64x( 0xB64445321B017BEF, 0x148FE485FCD398D9 );
+       x[5] = _mm_set_epi64x( 0x0DBADEA991FA7934, 0x2FF5781C6A536159 );
+       x[6] = _mm_set_epi64x( 0xBC796576B1C62456, 0xA5A70E75D65C8A2B );
+       x[7] = _mm_set_epi64x( 0xD43E3B447795D246, 0xE7989AF11921C8F7 );
    }
    else
    {
-       x[0] = m128_const_64( 0x35481EAE63117E71, 0xCCD6F29FEA2BD4B4 );
-       x[1] = m128_const_64( 0xF4CC12BE7E624131, 0xE5D94E6322512D5B );
-       x[2] = m128_const_64( 0x3361DA8CD0720C35, 0x42AF2070C2D0B696 );
-       x[3] = m128_const_64( 0x40E5FBAB4680AC00, 0x8EF8AD8328CCECA4 );
-       x[4] = m128_const_64( 0xF0B266796C859D41, 0x6107FBD5D89041C3 );
-       x[5] = m128_const_64( 0x93CB628565C892FD, 0x5FA2560309392549 );
-       x[6] = m128_const_64( 0x85254725774ABFDD, 0x9E4B4E602AF2B5AE );
-       x[7] = m128_const_64( 0xD6032C0A9CDAF8AF, 0x4AB6AAD615815AEB );
+       x[0] = _mm_set_epi64x( 0x35481EAE63117E71, 0xCCD6F29FEA2BD4B4 );
+       x[1] = _mm_set_epi64x( 0xF4CC12BE7E624131, 0xE5D94E6322512D5B );
+       x[2] = _mm_set_epi64x( 0x3361DA8CD0720C35, 0x42AF2070C2D0B696 );
+       x[3] = _mm_set_epi64x( 0x40E5FBAB4680AC00, 0x8EF8AD8328CCECA4 );
+       x[4] = _mm_set_epi64x( 0xF0B266796C859D41, 0x6107FBD5D89041C3 );
+       x[5] = _mm_set_epi64x( 0x93CB628565C892FD, 0x5FA2560309392549 );
+       x[6] = _mm_set_epi64x( 0x85254725774ABFDD, 0x9E4B4E602AF2B5AE );
+       x[7] = _mm_set_epi64x( 0xD6032C0A9CDAF8AF, 0x4AB6AAD615815AEB );
    }   

    return SUCCESS;
@@ -234,10 +234,10 @@ int cubehashDigest( cubehashParam *sp, byte *digest )

    // pos is zero for 64 byte data, 1 for 80 byte data.
    sp->x[ sp->pos ] = _mm_xor_si128( sp->x[ sp->pos ],
-                                      m128_const_64( 0, 0x80 ) );
+                                      _mm_set_epi64x( 0, 0x80 ) );
    transform( sp );

-    sp->x[7] = _mm_xor_si128( sp->x[7], m128_const_64( 0x100000000, 0 ) );
+    sp->x[7] = _mm_xor_si128( sp->x[7], _mm_set_epi64x( 0x100000000, 0 ) );
    transform( sp );
    transform( sp );
    transform( sp );
@@ -279,10 +279,10 @@ int cubehashUpdateDigest( cubehashParam *sp, byte *digest,

    // pos is zero for 64 byte data, 1 for 80 byte data.
    sp->x[ sp->pos ] = _mm_xor_si128( sp->x[ sp->pos ],
-                                      m128_const_64( 0, 0x80 ) );
+                                      _mm_set_epi64x( 0, 0x80 ) );
    transform( sp );

-    sp->x[7] = _mm_xor_si128( sp->x[7], m128_const_64( 0x100000000, 0 ) );
+    sp->x[7] = _mm_xor_si128( sp->x[7], _mm_set_epi64x( 0x100000000, 0 ) );

    transform( sp );
    transform( sp );
@@ -313,25 +313,25 @@ int cubehash_full( cubehashParam *sp, byte *digest, int hashbitlen,
    if ( hashbitlen == 512 )
    {

-       x[0] = m128_const_64( 0x4167D83E2D538B8B, 0x50F494D42AEA2A61 );
-       x[1] = m128_const_64( 0x50AC5695CC39968E, 0xC701CF8C3FEE2313 );
-       x[2] = m128_const_64( 0x825B453797CF0BEF, 0xA647A8B34D42C787 );
-       x[3] = m128_const_64( 0xA23911AED0E5CD33, 0xF22090C4EEF864D2 );
-       x[4] = m128_const_64( 0xB64445321B017BEF, 0x148FE485FCD398D9 );
-       x[5] = m128_const_64( 0x0DBADEA991FA7934, 0x2FF5781C6A536159 );
-       x[6] = m128_const_64( 0xBC796576B1C62456, 0xA5A70E75D65C8A2B );
-       x[7] = m128_const_64( 0xD43E3B447795D246, 0xE7989AF11921C8F7 );
+       x[0] = _mm_set_epi64x( 0x4167D83E2D538B8B, 0x50F494D42AEA2A61 );
+       x[1] = _mm_set_epi64x( 0x50AC5695CC39968E, 0xC701CF8C3FEE2313 );
+       x[2] = _mm_set_epi64x( 0x825B453797CF0BEF, 0xA647A8B34D42C787 );
+       x[3] = _mm_set_epi64x( 0xA23911AED0E5CD33, 0xF22090C4EEF864D2 );
+       x[4] = _mm_set_epi64x( 0xB64445321B017BEF, 0x148FE485FCD398D9 );
+       x[5] = _mm_set_epi64x( 0x0DBADEA991FA7934, 0x2FF5781C6A536159 );
+       x[6] = _mm_set_epi64x( 0xBC796576B1C62456, 0xA5A70E75D65C8A2B );
+       x[7] = _mm_set_epi64x( 0xD43E3B447795D246, 0xE7989AF11921C8F7 );
    }
    else
    {
-       x[0] = m128_const_64( 0x35481EAE63117E71, 0xCCD6F29FEA2BD4B4 );
-       x[1] = m128_const_64( 0xF4CC12BE7E624131, 0xE5D94E6322512D5B );
-       x[2] = m128_const_64( 0x3361DA8CD0720C35, 0x42AF2070C2D0B696 );
-       x[3] = m128_const_64( 0x40E5FBAB4680AC00, 0x8EF8AD8328CCECA4 );
-       x[4] = m128_const_64( 0xF0B266796C859D41, 0x6107FBD5D89041C3 );
-       x[5] = m128_const_64( 0x93CB628565C892FD, 0x5FA2560309392549 );
-       x[6] = m128_const_64( 0x85254725774ABFDD, 0x9E4B4E602AF2B5AE );
-       x[7] = m128_const_64( 0xD6032C0A9CDAF8AF, 0x4AB6AAD615815AEB );
+       x[0] = _mm_set_epi64x( 0x35481EAE63117E71, 0xCCD6F29FEA2BD4B4 );
+       x[1] = _mm_set_epi64x( 0xF4CC12BE7E624131, 0xE5D94E6322512D5B );
+       x[2] = _mm_set_epi64x( 0x3361DA8CD0720C35, 0x42AF2070C2D0B696 );
+       x[3] = _mm_set_epi64x( 0x40E5FBAB4680AC00, 0x8EF8AD8328CCECA4 );
+       x[4] = _mm_set_epi64x( 0xF0B266796C859D41, 0x6107FBD5D89041C3 );
+       x[5] = _mm_set_epi64x( 0x93CB628565C892FD, 0x5FA2560309392549 );
+       x[6] = _mm_set_epi64x( 0x85254725774ABFDD, 0x9E4B4E602AF2B5AE );
+       x[7] = _mm_set_epi64x( 0xD6032C0A9CDAF8AF, 0x4AB6AAD615815AEB );
    }


@@ -358,10 +358,10 @@ int cubehash_full( cubehashParam *sp, byte *digest, int hashbitlen,

    // pos is zero for 64 byte data, 1 for 80 byte data.
    sp->x[ sp->pos ] = _mm_xor_si128( sp->x[ sp->pos ],
-                                      m128_const_64( 0, 0x80 ) );
+                                      _mm_set_epi64x( 0, 0x80 ) );
    transform( sp );

-    sp->x[7] = _mm_xor_si128( sp->x[7], m128_const_64( 0x100000000, 0 ) );
+    sp->x[7] = _mm_xor_si128( sp->x[7], _mm_set_epi64x( 0x100000000, 0 ) );

    transform( sp );
    transform( sp );
--- a/algo/echo/aes_ni/hash.c
+++ b/algo/echo/aes_ni/hash.c
@@ -566,16 +566,16 @@ HashReturn echo_full( hashState_echo *state, BitSequence *hashval,
         state->uHashSize = 256;
         state->uBlockLength = 192;
         state->uRounds = 8;
-         state->hashsize = m128_const_64( 0, 0x100 );
-         state->const1536 = m128_const_64( 0, 0x600 );
+         state->hashsize = _mm_set_epi64x( 0, 0x100 );
+         state->const1536 = _mm_set_epi64x( 0, 0x600 );
         break;

      case 512:
         state->uHashSize = 512;
         state->uBlockLength = 128;
         state->uRounds = 10;
-         state->hashsize = m128_const_64( 0, 0x200 );
-         state->const1536 = m128_const_64( 0, 0x400 );
+         state->hashsize = _mm_set_epi64x( 0, 0x200 );
+         state->const1536 = _mm_set_epi64x( 0, 0x400 );
         break;

      default:
--- a/algo/echo/echo-hash-4way.c
+++ b/algo/echo/echo-hash-4way.c
@@ -162,9 +162,9 @@ void echo_4way_compress( echo_4way_context *ctx, const __m512i *pmsg,
  unsigned int r, b, i, j;
  __m512i t1, t2, s2, k1;
  __m512i _state[4][4], _state2[4][4], _statebackup[4][4]; 
-  __m512i one = m512_one_128;
-  __m512i mul2mask = m512_const2_64( 0, 0x00001b00 );
-  __m512i lsbmask  = m512_const1_32( 0x01010101 ); 
+  const __m512i one = mm512_bcast128lo_64( 1 ); 
+  const __m512i mul2mask = mm512_bcast128lo_64( 0x00001b00 );
+  const __m512i lsbmask  = _mm512_set1_epi32( 0x01010101 ); 

  _state[ 0 ][ 0 ] = ctx->state[ 0 ][ 0 ];
  _state[ 0 ][ 1 ] = ctx->state[ 0 ][ 1 ];
@@ -264,16 +264,16 @@ int echo_4way_init( echo_4way_context *ctx, int nHashSize )
 		ctx->uHashSize = 256;
 		ctx->uBlockLength = 192;
 		ctx->uRounds = 8;
-		ctx->hashsize = m512_const2_64( 0, 0x100 );
-		ctx->const1536 = m512_const2_64( 0, 0x600 );
+      ctx->hashsize = mm512_bcast128lo_64( 0x100 );
+      ctx->const1536 = mm512_bcast128lo_64( 0x600 );
 		break;

 	case 512:
 		ctx->uHashSize = 512;
 		ctx->uBlockLength = 128;
 		ctx->uRounds = 10;
-		ctx->hashsize = m512_const2_64( 0, 0x200 );
-		ctx->const1536 = m512_const2_64( 0, 0x400);
+      ctx->hashsize = mm512_bcast128lo_64( 0x200 );
+      ctx->const1536 = mm512_bcast128lo_64( 0x400);
 		break;

 	default:
@@ -305,7 +305,7 @@ int echo_4way_update_close( echo_4way_context *state, void *hashval,
   {
      echo_4way_compress( state, data, 1 );
      state->processed_bits = 1024;
-      remainingbits = m512_const2_64( 0, -1024 );
+      remainingbits = mm512_bcast128lo_64( -1024 );
      vlen = 0;
   }
   else
@@ -313,13 +313,15 @@ int echo_4way_update_close( echo_4way_context *state, void *hashval,
      vlen = databitlen / 128;  // * 4 lanes / 128 bits per lane
      memcpy_512( state->buffer, data, vlen );
      state->processed_bits += (unsigned int)( databitlen );
-      remainingbits = m512_const2_64( 0, (uint64_t)databitlen );
+      remainingbits = mm512_bcast128lo_64( (uint64_t)databitlen );
   }

-   state->buffer[ vlen ] = m512_const2_64( 0, 0x80 );
+   state->buffer[ vlen ] = mm512_bcast128lo_64( 0x80 );
   memset_zero_512( state->buffer + vlen + 1, vblen - vlen - 2 );
-   state->buffer[ vblen-2 ] = m512_const2_64( (uint64_t)state->uHashSize << 48, 0 );
-   state->buffer[ vblen-1 ] = m512_const2_64( 0, state->processed_bits);
+   state->buffer[ vblen-2 ] =
+           mm512_bcast128hi_64( (uint64_t)state->uHashSize << 48 );
+   state->buffer[ vblen-1 ] =
+           mm512_bcast128lo_64( state->processed_bits );

   state->k = _mm512_add_epi64( state->k, remainingbits );
   state->k = _mm512_sub_epi64( state->k, state->const1536 );
@@ -352,16 +354,16 @@ int echo_4way_full( echo_4way_context *ctx, void *hashval, int nHashSize,
         ctx->uHashSize = 256;
         ctx->uBlockLength = 192;
         ctx->uRounds = 8;
-         ctx->hashsize = m512_const2_64( 0, 0x100 );
-         ctx->const1536 = m512_const2_64( 0, 0x600 );
+         ctx->hashsize = mm512_bcast128lo_64( 0x100 );
+         ctx->const1536 = mm512_bcast128lo_64( 0x600 );
         break;

      case 512:
         ctx->uHashSize = 512;
         ctx->uBlockLength = 128;
         ctx->uRounds = 10;
-         ctx->hashsize = m512_const2_64( 0, 0x200 );
-         ctx->const1536 = m512_const2_64( 0, 0x400 );
+         ctx->hashsize = mm512_bcast128lo_64( 0x200 );
+         ctx->const1536 = mm512_bcast128lo_64( 0x400 );
         break;

      default:
@@ -388,7 +390,7 @@ int echo_4way_full( echo_4way_context *ctx, void *hashval, int nHashSize,
   {
      echo_4way_compress( ctx, data, 1 );
      ctx->processed_bits = 1024;
-      remainingbits = m512_const2_64( 0, -1024 );
+      remainingbits = mm512_bcast128lo_64( -1024 );
      vlen = 0;
   }
   else
@@ -396,14 +398,14 @@ int echo_4way_full( echo_4way_context *ctx, void *hashval, int nHashSize,
      vlen = databitlen / 128;  // * 4 lanes / 128 bits per lane
      memcpy_512( ctx->buffer, data, vlen );
      ctx->processed_bits += (unsigned int)( databitlen );
-      remainingbits = m512_const2_64( 0, databitlen );
+      remainingbits = mm512_bcast128lo_64( databitlen );
   }

-   ctx->buffer[ vlen ] = m512_const2_64( 0, 0x80 );
+   ctx->buffer[ vlen ] = mm512_bcast128lo_64( 0x80 );
   memset_zero_512( ctx->buffer + vlen + 1, vblen - vlen - 2 );
   ctx->buffer[ vblen-2 ] =
-                     m512_const2_64( (uint64_t)ctx->uHashSize << 48, 0 );
-   ctx->buffer[ vblen-1 ] = m512_const2_64( 0, ctx->processed_bits);
+               mm512_bcast128hi_64( (uint64_t)ctx->uHashSize << 48 );
+   ctx->buffer[ vblen-1 ] = mm512_bcast128lo_64( ctx->processed_bits);

   ctx->k = _mm512_add_epi64( ctx->k, remainingbits );
   ctx->k = _mm512_sub_epi64( ctx->k, ctx->const1536 );
@@ -425,9 +427,9 @@ int echo_4way_full( echo_4way_context *ctx, void *hashval, int nHashSize,

 // AVX2 + VAES

-#define mul2mask_2way   m256_const2_64( 0, 0x0000000000001b00 ) 
+#define mul2mask_2way   mm256_bcast128lo_64( 0x0000000000001b00 ) 

-#define lsbmask_2way    m256_const1_32( 0x01010101 ) 
+#define lsbmask_2way    _mm256_set1_epi32( 0x01010101 ) 

 #define ECHO_SUBBYTES4_2WAY( state, j ) \
   state[0][j] = _mm256_aesenc_epi128( state[0][j], k1 ); \
@@ -467,8 +469,7 @@ int echo_4way_full( echo_4way_context *ctx, void *hashval, int nHashSize,
   t1 = _mm256_and_si256( t1, lsbmask_2way ); \
   t2 = _mm256_shuffle_epi8( mul2mask_2way, t1 ); \
   s2 = _mm256_xor_si256( s2, t2 );\
-   state2[ 0 ][ j ] = _mm256_xor_si256( state2[ 0 ][ j ], \
-                              _mm256_xor_si256( s2, state1[ 1 ][ j1 ] ) ); \
+   state2[ 0 ][ j ] = mm256_xor3( state2[ 0 ][ j ], s2, state1[ 1 ][ j1 ] ); \
   state2[ 1 ][ j ] = _mm256_xor_si256( state2[ 1 ][ j ], s2 ); \
   state2[ 2 ][ j ] = _mm256_xor_si256( state2[ 2 ][ j ], state1[ 1 ][ j1 ] ); \
   state2[ 3 ][ j ] = _mm256_xor_si256( state2[ 3 ][ j ], state1[ 1 ][ j1 ] ); \
@@ -478,8 +479,7 @@ int echo_4way_full( echo_4way_context *ctx, void *hashval, int nHashSize,
   t2 = _mm256_shuffle_epi8( mul2mask_2way, t1 ); \
   s2 = _mm256_xor_si256( s2, t2 ); \
   state2[ 0 ][ j ] = _mm256_xor_si256( state2[ 0 ][ j ], state1[ 2 ][ j2 ] ); \
-   state2[ 1 ][ j ] = _mm256_xor_si256( state2[ 1 ][ j ], \
-                            _mm256_xor_si256( s2, state1[ 2 ][ j2 ] ) ); \
+   state2[ 1 ][ j ] = mm256_xor3( state2[ 1 ][ j ], s2, state1[ 2 ][ j2 ] ); \
   state2[ 2 ][ j ] = _mm256_xor_si256( state2[ 2 ][ j ], s2 ); \
   state2[ 3 ][ j ] = _mm256_xor_si256( state2[ 3][ j ], state1[ 2 ][ j2 ] ); \
   s2 = _mm256_add_epi8( state1[ 3 ][ j3 ], state1[ 3 ][ j3 ] ); \
@@ -489,8 +489,7 @@ int echo_4way_full( echo_4way_context *ctx, void *hashval, int nHashSize,
   s2 = _mm256_xor_si256( s2, t2 ); \
   state2[ 0 ][ j ] = _mm256_xor_si256( state2[ 0 ][ j ], state1[ 3 ][ j3 ] ); \
   state2[ 1 ][ j ] = _mm256_xor_si256( state2[ 1 ][ j ], state1[ 3 ][ j3 ] ); \
-   state2[ 2 ][ j ] = _mm256_xor_si256( state2[ 2 ][ j ], \
-                            _mm256_xor_si256( s2, state1[ 3 ][ j3] ) ); \
+   state2[ 2 ][ j ] = mm256_xor3( state2[ 2 ][ j ], s2, state1[ 3 ][ j3] ); \
   state2[ 3 ][ j ] = _mm256_xor_si256( state2[ 3 ][ j ], s2 ); \
 } while(0)

@@ -679,16 +678,16 @@ int echo_2way_init( echo_2way_context *ctx, int nHashSize )
                        ctx->uHashSize = 256;
                        ctx->uBlockLength = 192;
                        ctx->uRounds = 8;
-                        ctx->hashsize = m256_const2_64( 0, 0x100 );
-                        ctx->const1536 = m256_const2_64( 0, 0x600 );
+                        ctx->hashsize = mm256_bcast128lo_64( 0x100 );
+                        ctx->const1536 = mm256_bcast128lo_64( 0x600 );
                        break;

                case 512:
                        ctx->uHashSize = 512;
                        ctx->uBlockLength = 128;
                        ctx->uRounds = 10;
-                        ctx->hashsize = m256_const2_64( 0, 0x200 );
-                        ctx->const1536 = m256_const2_64( 0, 0x400 );
+                        ctx->hashsize = mm256_bcast128lo_64( 0x200 );
+                        ctx->const1536 = mm256_bcast128lo_64( 0x400 );
                        break;

                default:
@@ -720,20 +719,20 @@ int echo_2way_update_close( echo_2way_context *state, void *hashval,
   {
      echo_2way_compress( state, data, 1 );
      state->processed_bits = 1024;
-      remainingbits = m256_const2_64( 0, -1024 );
+      remainingbits = mm256_bcast128lo_64( -1024 );
      vlen = 0;
   }
   else
   {
      memcpy_256( state->buffer, data, vlen );
      state->processed_bits += (unsigned int)( databitlen );
-      remainingbits = m256_const2_64( 0, databitlen );
+      remainingbits = mm256_bcast128lo_64( databitlen );
   }

-   state->buffer[ vlen ] = m256_const2_64( 0, 0x80 );
+   state->buffer[ vlen ] = mm256_bcast128lo_64( 0x80 );
   memset_zero_256( state->buffer + vlen + 1, vblen - vlen - 2 );
-   state->buffer[ vblen-2 ] = m256_const2_64( (uint64_t)state->uHashSize << 48, 0 );
-   state->buffer[ vblen-1 ] = m256_const2_64( 0, state->processed_bits );
+   state->buffer[ vblen-2 ] = mm256_bcast128hi_64( (uint64_t)state->uHashSize << 48 );
+   state->buffer[ vblen-1 ] = mm256_bcast128lo_64( state->processed_bits );

   state->k = _mm256_add_epi64( state->k, remainingbits );
   state->k = _mm256_sub_epi64( state->k, state->const1536 );
@@ -766,16 +765,16 @@ int echo_2way_full( echo_2way_context *ctx, void *hashval, int nHashSize,
         ctx->uHashSize = 256;
         ctx->uBlockLength = 192;
         ctx->uRounds = 8;
-         ctx->hashsize = m256_const2_64( 0, 0x100 );
-         ctx->const1536 = m256_const2_64( 0, 0x600 );
+         ctx->hashsize = mm256_bcast128lo_64( 0x100 );
+         ctx->const1536 = mm256_bcast128lo_64( 0x600 );
         break;

      case 512:
         ctx->uHashSize = 512;
         ctx->uBlockLength = 128;
         ctx->uRounds = 10;
-         ctx->hashsize = m256_const2_64( 0, 0x200 );
-         ctx->const1536 = m256_const2_64( 0, 0x400 );
+         ctx->hashsize = mm256_bcast128lo_64( 0x200 );
+         ctx->const1536 = mm256_bcast128lo_64( 0x400 );
         break;

      default:
@@ -798,7 +797,7 @@ int echo_2way_full( echo_2way_context *ctx, void *hashval, int nHashSize,
   {
      echo_2way_compress( ctx, data, 1 );
      ctx->processed_bits = 1024;
-      remainingbits = m256_const2_64( 0, -1024 );
+      remainingbits = mm256_bcast128lo_64( -1024 );
      vlen = 0;
   }
   else
@@ -806,13 +805,13 @@ int echo_2way_full( echo_2way_context *ctx, void *hashval, int nHashSize,
      vlen = databitlen / 128;  // * 4 lanes / 128 bits per lane
      memcpy_256( ctx->buffer, data, vlen );
      ctx->processed_bits += (unsigned int)( databitlen );
-      remainingbits = m256_const2_64( 0, databitlen );
+      remainingbits = mm256_bcast128lo_64( databitlen );
   }

-   ctx->buffer[ vlen ] = m256_const2_64( 0, 0x80 );
+   ctx->buffer[ vlen ] = mm256_bcast128lo_64( 0x80 );
   memset_zero_256( ctx->buffer + vlen + 1, vblen - vlen - 2 );
-   ctx->buffer[ vblen-2 ] = m256_const2_64( (uint64_t)ctx->uHashSize << 48, 0 );
-   ctx->buffer[ vblen-1 ] = m256_const2_64( 0, ctx->processed_bits );
+   ctx->buffer[ vblen-2 ] = mm256_bcast128hi_64( (uint64_t)ctx->uHashSize << 48 );
+   ctx->buffer[ vblen-1 ] = mm256_bcast128lo_64( ctx->processed_bits );

   ctx->k = _mm256_add_epi64( ctx->k, remainingbits );
   ctx->k = _mm256_sub_epi64( ctx->k, ctx->const1536 );
--- a/algo/fugue/fugue-aesni.c
+++ b/algo/fugue/fugue-aesni.c
@@ -33,11 +33,11 @@ MYALIGN const unsigned long long _supermix4b[]	= {0x07020d08080e0d0d, 0x07070908
 MYALIGN const unsigned long long _supermix4c[]	= {0x0706050403020000, 0x0302000007060504};
 MYALIGN const unsigned long long _supermix7a[]	= {0x010c0b060d080702, 0x0904030e03000104};
 MYALIGN const unsigned long long _supermix7b[]	= {0x8080808080808080, 0x0504070605040f06};
-MYALIGN const unsigned long long _k_n[] = {0x4E4E4E4E4E4E4E4E, 0x1B1B1B1B0E0E0E0E};
-MYALIGN const unsigned char _shift_one_mask[]   = {7, 4, 5, 6, 11, 8, 9, 10, 15, 12, 13, 14, 3, 0, 1, 2};
-MYALIGN const unsigned char _shift_four_mask[]  = {13, 14, 15, 12, 1, 2, 3, 0, 5, 6, 7, 4, 9, 10, 11, 8};
-MYALIGN const unsigned char _shift_seven_mask[] = {10, 11, 8, 9, 14, 15, 12, 13, 2, 3, 0, 1, 6, 7, 4, 5};
-MYALIGN const unsigned char _aes_shift_rows[]   = {0, 5, 10, 15, 4, 9, 14, 3, 8, 13, 2, 7, 12, 1, 6, 11};
+//MYALIGN const unsigned long long _k_n[] = {0x4E4E4E4E4E4E4E4E, 0x1B1B1B1B0E0E0E0E};
+//MYALIGN const unsigned char _shift_one_mask[]   = {7, 4, 5, 6, 11, 8, 9, 10, 15, 12, 13, 14, 3, 0, 1, 2};
+//MYALIGN const unsigned char _shift_four_mask[]  = {13, 14, 15, 12, 1, 2, 3, 0, 5, 6, 7, 4, 9, 10, 11, 8};
+//MYALIGN const unsigned char _shift_seven_mask[] = {10, 11, 8, 9, 14, 15, 12, 13, 2, 3, 0, 1, 6, 7, 4, 5};
+//MYALIGN const unsigned char _aes_shift_rows[]   = {0, 5, 10, 15, 4, 9, 14, 3, 8, 13, 2, 7, 12, 1, 6, 11};
 MYALIGN const unsigned int _inv_shift_rows[] = {0x070a0d00, 0x0b0e0104, 0x0f020508, 0x0306090c};
 MYALIGN const unsigned int _mul2mask[] = {0x1b1b0000, 0x00000000, 0x00000000, 0x00000000};
 MYALIGN const unsigned int _mul4mask[] = {0x2d361b00, 0x00000000, 0x00000000, 0x00000000};
@@ -131,7 +131,7 @@ MYALIGN const unsigned int _IV512[] = {
   t1 = _mm_srli_epi16(t0, 6);\
   t1 = _mm_and_si128(t1, M128(_lsbmask2));\
   t3 = _mm_xor_si128(t3, _mm_shuffle_epi8(M128(_mul2mask), t1));\
-   t0  = _mm_xor_si128(t4, _mm_shuffle_epi8(M128(_mul4mask), t1))
+   t0 = _mm_xor_si128(t4, _mm_shuffle_epi8(M128(_mul4mask), t1))

 /*
 #define PRESUPERMIX(x, t1, s1, s2, t2)\
--- a/algo/groestl/aes_ni/groestl-intr-aes.h
+++ b/algo/groestl/aes_ni/groestl-intr-aes.h
@@ -139,7 +139,7 @@ static const __m128i SUBSH_MASK7 = { 0x06090c0f0205080b, 0x0e0104070a0d0003 };
  \
  /* compute z_i : double x_i using temp xmm8 and 1B xmm9 */\
  /* compute w_i : add y_{i+4} */\
-  b1 = m128_const1_64( 0x1b1b1b1b1b1b1b1b );\
+  b1 = _mm_set1_epi64x( 0x1b1b1b1b1b1b1b1b );\
  MUL2(a0, b0, b1);\
  a0 = _mm_xor_si128(a0, TEMP0);\
  MUL2(a1, b0, b1);\
@@ -237,7 +237,7 @@ static const __m128i SUBSH_MASK7 = { 0x06090c0f0205080b, 0x0e0104070a0d0003 };
  \
  /* compute z_i : double x_i using temp xmm8 and 1B xmm9 */\
  /* compute w_i : add y_{i+4} */\
-  b1 = m128_const1_64( 0x1b1b1b1b1b1b1b1b );\
+  b1 = _mm_set1_epi64x( 0x1b1b1b1b1b1b1b1b );\
  MUL2(a0, b0, b1);\
  a0 = _mm_xor_si128(a0, TEMP0);\
  MUL2(a1, b0, b1);\
--- a/algo/groestl/aes_ni/groestl256-intr-aes.h
+++ b/algo/groestl/aes_ni/groestl256-intr-aes.h
@@ -128,7 +128,7 @@ static const __m128i SUBSH_MASK7 = { 0x090c000306080b07, 0x02050f0a0d01040e };
  \
  /* compute z_i : double x_i using temp xmm8 and 1B xmm9 */\
  /* compute w_i : add y_{i+4} */\
-  b1 = m128_const1_64( 0x1b1b1b1b1b1b1b1b );\
+  b1 = _mm_set1_epi64x( 0x1b1b1b1b1b1b1b1b );\
  MUL2(a0, b0, b1);\
  a0 = _mm_xor_si128(a0, TEMP0);\
  MUL2(a1, b0, b1);\
@@ -226,7 +226,7 @@ static const __m128i SUBSH_MASK7 = { 0x090c000306080b07, 0x02050f0a0d01040e };
  \
  /* compute z_i : double x_i using temp xmm8 and 1B xmm9 */\
  /* compute w_i : add y_{i+4} */\
-  b1 = m128_const1_64( 0x1b1b1b1b1b1b1b1b );\
+  b1 = _mm_set1_epi64x( 0x1b1b1b1b1b1b1b1b );\
  MUL2(a0, b0, b1);\
  a0 = _mm_xor_si128(a0, TEMP0);\
  MUL2(a1, b0, b1);\
@@ -275,7 +275,7 @@ static const __m128i SUBSH_MASK7 = { 0x090c000306080b07, 0x02050f0a0d01040e };
 */
 #define ROUND(i, a0, a1, a2, a3, a4, a5, a6, a7, b0, b1, b2, b3, b4, b5, b6, b7){\
  /* AddRoundConstant */\
-  b1 = m128_const_64( 0xffffffffffffffff, 0 ); \
+  b1 = _mm_set_epi64x( 0xffffffffffffffff, 0 ); \
  a0 = _mm_xor_si128( a0, casti_m128i( round_const_l0, i ) ); \
  a1 = _mm_xor_si128( a1, b1 ); \
  a2 = _mm_xor_si128( a2, b1 ); \
--- a/algo/groestl/aes_ni/hash-groestl.c
+++ b/algo/groestl/aes_ni/hash-groestl.c
@@ -31,7 +31,7 @@ HashReturn_gr init_groestl( hashState_groestl* ctx, int hashlen )
  }

  // The only non-zero in the IV is len. It can be hard coded.
-  ctx->chaining[ 6 ] = m128_const_64( 0x0200000000000000, 0 );
+  ctx->chaining[ 6 ] = _mm_set_epi64x( 0x0200000000000000, 0 );

  ctx->buf_ptr = 0;
  ctx->rem_ptr = 0;
@@ -48,7 +48,7 @@ HashReturn_gr reinit_groestl( hashState_groestl* ctx )
     ctx->chaining[i] = _mm_setzero_si128();
     ctx->buffer[i]   = _mm_setzero_si128();
  }
-  ctx->chaining[ 6 ] = m128_const_64( 0x0200000000000000, 0 );
+  ctx->chaining[ 6 ] = _mm_set_epi64x( 0x0200000000000000, 0 );
  ctx->buf_ptr = 0;
  ctx->rem_ptr = 0;

@@ -116,7 +116,7 @@ HashReturn_gr final_groestl( hashState_groestl* ctx, void* output )
   else
   {
       // add first padding
-       ctx->buffer[rem_ptr] = m128_const_64( 0, 0x80 );
+       ctx->buffer[rem_ptr] = _mm_set_epi64x( 0, 0x80 );
       // add zero padding
       for ( i = rem_ptr + 1; i < SIZE512 - 1; i++ )
           ctx->buffer[i] = _mm_setzero_si128();
@@ -148,7 +148,7 @@ int groestl512_full( hashState_groestl* ctx, void* output,
      ctx->chaining[i] = _mm_setzero_si128();
      ctx->buffer[i]   = _mm_setzero_si128();
   }
-   ctx->chaining[ 6 ] = m128_const_64( 0x0200000000000000, 0 );
+   ctx->chaining[ 6 ] = _mm_set_epi64x( 0x0200000000000000, 0 );
   ctx->buf_ptr = 0;

   // --- update ---
@@ -182,7 +182,7 @@ int groestl512_full( hashState_groestl* ctx, void* output,
   else
   {
       // add first padding
-       ctx->buffer[i] = m128_const_64( 0, 0x80 );
+       ctx->buffer[i] = _mm_set_epi64x( 0, 0x80 );
       // add zero padding
       for ( i += 1; i < SIZE512 - 1; i++ )
           ctx->buffer[i] = _mm_setzero_si128();
@@ -239,7 +239,7 @@ HashReturn_gr update_and_final_groestl( hashState_groestl* ctx, void* output,
   else
   {
       // add first padding
-       ctx->buffer[i] = m128_const_64( 0, 0x80 );
+       ctx->buffer[i] = _mm_set_epi64x( 0, 0x80 );
       // add zero padding
       for ( i += 1; i < SIZE512 - 1; i++ )
           ctx->buffer[i] = _mm_setzero_si128();
--- a/algo/groestl/aes_ni/hash-groestl256.c
+++ b/algo/groestl/aes_ni/hash-groestl256.c
@@ -46,7 +46,7 @@ HashReturn_gr reinit_groestl256(hashState_groestl256* ctx)
     ctx->buffer[i]   = _mm_setzero_si128();
  }

-  ctx->chaining[ 3 ] = m128_const_64( 0, 0x0100000000000000 );
+  ctx->chaining[ 3 ] = _mm_set_epi64x( 0, 0x0100000000000000 );

  ctx->buf_ptr = 0;
  ctx->rem_ptr = 0;
--- a/algo/groestl/groestl256-hash-4way.c
+++ b/algo/groestl/groestl256-hash-4way.c
@@ -33,8 +33,7 @@ int groestl256_4way_init( groestl256_4way_context* ctx, uint64_t hashlen )
  }

  // The only non-zero in the IV is len. It can be hard coded.
-  ctx->chaining[ 3 ] = m512_const2_64( 0, 0x0100000000000000 );
-
+  ctx->chaining[ 3 ] = mm512_bcast128lo_64( 0x0100000000000000 );
  ctx->buf_ptr = 0;
  ctx->rem_ptr = 0;

@@ -51,9 +50,6 @@ int groestl256_4way_full( groestl256_4way_context* ctx, void* output,
   __m512i* in = (__m512i*)input;
   int i;

-//  if (ctx->chaining == NULL || ctx->buffer == NULL)
-//    return 1;
-
  for ( i = 0; i < SIZE256; i++ )
  {
     ctx->chaining[i] = m512_zero;
@@ -61,7 +57,7 @@ int groestl256_4way_full( groestl256_4way_context* ctx, void* output,
  }

  // The only non-zero in the IV is len. It can be hard coded.
-  ctx->chaining[ 3 ] = m512_const2_64( 0, 0x0100000000000000 );
+  ctx->chaining[ 3 ] = mm512_bcast128lo_64( 0x0100000000000000 );
  ctx->buf_ptr = 0;
   
   // --- update ---
@@ -83,18 +79,18 @@ int groestl256_4way_full( groestl256_4way_context* ctx, void* output,
   if ( i == SIZE256 - 1 )
   {        
       // only 1 vector left in buffer, all padding at once
-       ctx->buffer[i] = m512_const2_64( blocks << 56, 0x80 ); 
+       ctx->buffer[i] = mm512_set2_64( blocks << 56, 0x80 ); 
   }   
   else
   {
       // add first padding
-       ctx->buffer[i] = m512_const2_64( 0, 0x80 );
+       ctx->buffer[i] = mm512_bcast128lo_64( 0x80 );
       // add zero padding
       for ( i += 1; i < SIZE256 - 1; i++ )
           ctx->buffer[i] = m512_zero;

       // add length padding, second last byte is zero unless blocks > 255
-       ctx->buffer[i] = m512_const2_64( blocks << 56, 0 );
+       ctx->buffer[i] = mm512_bcast128hi_64( blocks << 56 );
   }

   // digest final padding block and do output transform
@@ -140,18 +136,18 @@ int groestl256_4way_update_close( groestl256_4way_context* ctx, void* output,
   if ( i == SIZE256 - 1 )
   {
       // only 1 vector left in buffer, all padding at once
-       ctx->buffer[i] = m512_const2_64( blocks << 56, 0x80 );
+       ctx->buffer[i] = mm512_set2_64( blocks << 56, 0x80 );
   }
   else
   {
       // add first padding
-       ctx->buffer[i] = m512_const2_64( 0, 0x80 );
+       ctx->buffer[i] = mm512_bcast128lo_64( 0x80 );
       // add zero padding
       for ( i += 1; i < SIZE256 - 1; i++ )
           ctx->buffer[i] = m512_zero;

       // add length padding, second last byte is zero unless blocks > 255
-       ctx->buffer[i] = m512_const2_64( blocks << 56, 0 );
+       ctx->buffer[i] = mm512_bcast128hi_64( blocks << 56 );
   }

 // digest final padding block and do output transform
@@ -186,7 +182,7 @@ int groestl256_2way_init( groestl256_2way_context* ctx, uint64_t hashlen )
  }

  // The only non-zero in the IV is len. It can be hard coded.
-  ctx->chaining[ 3 ] = m256_const2_64( 0, 0x0100000000000000 );
+  ctx->chaining[ 3 ] = mm256_bcast128lo_64( 0x0100000000000000 );

  ctx->buf_ptr = 0;
  ctx->rem_ptr = 0;
@@ -211,7 +207,7 @@ int groestl256_2way_full( groestl256_2way_context* ctx, void* output,
   }

   // The only non-zero in the IV is len. It can be hard coded.
-   ctx->chaining[ 3 ] = m256_const2_64( 0, 0x0100000000000000 );
+   ctx->chaining[ 3 ] = mm256_bcast128lo_64( 0x0100000000000000 );
   ctx->buf_ptr = 0;

   // --- update ---
@@ -233,18 +229,18 @@ int groestl256_2way_full( groestl256_2way_context* ctx, void* output,
   if ( i == SIZE256 - 1 )
   {
       // only 1 vector left in buffer, all padding at once
-      ctx->buffer[i] = m256_const2_64( blocks << 56, 0x80 );
+      ctx->buffer[i] = mm256_set2_64( blocks << 56, 0x80 );
   }
   else
   {
       // add first padding
-       ctx->buffer[i] = m256_const2_64( 0, 0x80 );
+       ctx->buffer[i] = mm256_bcast128lo_64( 0x80 );
       // add zero padding
       for ( i += 1; i < SIZE256 - 1; i++ )
           ctx->buffer[i] = m256_zero;

       // add length padding, second last byte is zero unless blocks > 255
-       ctx->buffer[i] = m256_const2_64( blocks << 56, 0 );
+       ctx->buffer[i] = mm256_bcast128hi_64( blocks << 56 );
   }

   // digest final padding block and do output transform
@@ -289,23 +285,22 @@ int groestl256_2way_update_close( groestl256_2way_context* ctx, void* output,
   if ( i == SIZE256 - 1 )
   {
       // only 1 vector left in buffer, all padding at once
-       ctx->buffer[i] = m256_const2_64( blocks << 56, 0x80 );
+       ctx->buffer[i] = mm256_set2_64( blocks << 56, 0x80 );
   }
   else
   {
       // add first padding
-       ctx->buffer[i] = m256_const2_64( 0, 0x80 );
+       ctx->buffer[i] = mm256_bcast128lo_64( 0x80 );
       // add zero padding
       for ( i += 1; i < SIZE256 - 1; i++ )
           ctx->buffer[i] = m256_zero;

       // add length padding, second last byte is zero unless blocks > 255
-       ctx->buffer[i] = m256_const2_64( blocks << 56, 0 );
+       ctx->buffer[i] = mm256_bcast128hi_64( blocks << 56 );
   }

 // digest final padding block and do output transform
   TF512_2way( ctx->chaining, ctx->buffer );
-
   OF512_2way( ctx->chaining );

   // store hash result in output 
--- a/algo/groestl/groestl256-intr-4way.h
+++ b/algo/groestl/groestl256-intr-4way.h
@@ -165,7 +165,7 @@ static const __m512i SUBSH_MASK7 = { 0x090c000306080b07, 0x02050f0a0d01040e,
  \
  /* compute z_i : double x_i using temp xmm8 and 1B xmm9 */\
  /* compute w_i : add y_{i+4} */\
-  b1 = m512_const1_64( 0x1b1b1b1b1b1b1b1b ); \
+  b1 = _mm512_set1_epi64( 0x1b1b1b1b1b1b1b1b ); \
  MUL2( a0, b0, b1 ); \
  a0 = _mm512_xor_si512( a0, TEMP0 ); \
  MUL2( a1, b0, b1 ); \
@@ -205,116 +205,18 @@ static const __m512i SUBSH_MASK7 = { 0x090c000306080b07, 0x02050f0a0d01040e,
  b1 = _mm512_xor_si512( b1, a4 ); \
 }/*MixBytes*/

-
-#if 0
-#define MixBytes(a0, a1, a2, a3, a4, a5, a6, a7, b0, b1, b2, b3, b4, b5, b6, b7){\
-  /* t_i = a_i + a_{i+1} */\
-  b6 = a0;\
-  b7 = a1;\
-  a0 = _mm512_xor_si512(a0, a1);\
-  b0 = a2;\
-  a1 = _mm512_xor_si512(a1, a2);\
-  b1 = a3;\
-  a2 = _mm512_xor_si512(a2, a3);\
-  b2 = a4;\
-  a3 = _mm512_xor_si512(a3, a4);\
-  b3 = a5;\
-  a4 = _mm512_xor_si512(a4, a5);\
-  b4 = a6;\
-  a5 = _mm512_xor_si512(a5, a6);\
-  b5 = a7;\
-  a6 = _mm512_xor_si512(a6, a7);\
-  a7 = _mm512_xor_si512(a7, b6);\
-  \
-  /* build y4 y5 y6 ... in regs xmm8, xmm9, xmm10 by adding t_i*/\
-  b0 = _mm512_xor_si512(b0, a4);\
-  b6 = _mm512_xor_si512(b6, a4);\
-  b1 = _mm512_xor_si512(b1, a5);\
-  b7 = _mm512_xor_si512(b7, a5);\
-  b2 = _mm512_xor_si512(b2, a6);\
-  b0 = _mm512_xor_si512(b0, a6);\
-  /* spill values y_4, y_5 to memory */\
-  TEMP0 = b0;\
-  b3 = _mm512_xor_si512(b3, a7);\
-  b1 = _mm512_xor_si512(b1, a7);\
-  TEMP1 = b1;\
-  b4 = _mm512_xor_si512(b4, a0);\
-  b2 = _mm512_xor_si512(b2, a0);\
-  /* save values t0, t1, t2 to xmm8, xmm9 and memory */\
-  b0 = a0;\
-  b5 = _mm512_xor_si512(b5, a1);\
-  b3 = _mm512_xor_si512(b3, a1);\
-  b1 = a1;\
-  b6 = _mm512_xor_si512(b6, a2);\
-  b4 = _mm512_xor_si512(b4, a2);\
-  TEMP2 = a2;\
-  b7 = _mm512_xor_si512(b7, a3);\
-  b5 = _mm512_xor_si512(b5, a3);\
-  \
-  /* compute x_i = t_i + t_{i+3} */\
-  a0 = _mm512_xor_si512(a0, a3);\
-  a1 = _mm512_xor_si512(a1, a4);\
-  a2 = _mm512_xor_si512(a2, a5);\
-  a3 = _mm512_xor_si512(a3, a6);\
-  a4 = _mm512_xor_si512(a4, a7);\
-  a5 = _mm512_xor_si512(a5, b0);\
-  a6 = _mm512_xor_si512(a6, b1);\
-  a7 = _mm512_xor_si512(a7, TEMP2);\
-  \
-  /* compute z_i : double x_i using temp xmm8 and 1B xmm9 */\
-  /* compute w_i : add y_{i+4} */\
-  b1 = m512_const1_64( 0x1b1b1b1b1b1b1b1b );\
-  MUL2(a0, b0, b1);\
-  a0 = _mm512_xor_si512(a0, TEMP0);\
-  MUL2(a1, b0, b1);\
-  a1 = _mm512_xor_si512(a1, TEMP1);\
-  MUL2(a2, b0, b1);\
-  a2 = _mm512_xor_si512(a2, b2);\
-  MUL2(a3, b0, b1);\
-  a3 = _mm512_xor_si512(a3, b3);\
-  MUL2(a4, b0, b1);\
-  a4 = _mm512_xor_si512(a4, b4);\
-  MUL2(a5, b0, b1);\
-  a5 = _mm512_xor_si512(a5, b5);\
-  MUL2(a6, b0, b1);\
-  a6 = _mm512_xor_si512(a6, b6);\
-  MUL2(a7, b0, b1);\
-  a7 = _mm512_xor_si512(a7, b7);\
-  \
-  /* compute v_i : double w_i      */\
-  /* add to y_4 y_5 .. v3, v4, ... */\
-  MUL2(a0, b0, b1);\
-  b5 = _mm512_xor_si512(b5, a0);\
-  MUL2(a1, b0, b1);\
-  b6 = _mm512_xor_si512(b6, a1);\
-  MUL2(a2, b0, b1);\
-  b7 = _mm512_xor_si512(b7, a2);\
-  MUL2(a5, b0, b1);\
-  b2 = _mm512_xor_si512(b2, a5);\
-  MUL2(a6, b0, b1);\
-  b3 = _mm512_xor_si512(b3, a6);\
-  MUL2(a7, b0, b1);\
-  b4 = _mm512_xor_si512(b4, a7);\
-  MUL2(a3, b0, b1);\
-  MUL2(a4, b0, b1);\
-  b0 = TEMP0;\
-  b1 = TEMP1;\
-  b0 = _mm512_xor_si512(b0, a3);\
-  b1 = _mm512_xor_si512(b1, a4);\
-}/*MixBytes*/
-#endif
+#define MASK_NOT( a )  _mm512_mask_ternarylogic_epi64( a, 0xaa, a, a, 1 )

 #define ROUND(i, a0, a1, a2, a3, a4, a5, a6, a7, b0, b1, b2, b3, b4, b5, b6, b7){\
  /* AddRoundConstant */\
-  b1 = m512_const2_64( 0xffffffffffffffff, 0 ); \
-  a0 = _mm512_xor_si512( a0, m512_const1_128( round_const_l0[i] ) );\
-  a1 = _mm512_xor_si512( a1, b1 );\
-  a2 = _mm512_xor_si512( a2, b1 );\
-  a3 = _mm512_xor_si512( a3, b1 );\
-  a4 = _mm512_xor_si512( a4, b1 );\
-  a5 = _mm512_xor_si512( a5, b1 );\
-  a6 = _mm512_xor_si512( a6, b1 );\
-  a7 = _mm512_xor_si512( a7, m512_const1_128( round_const_l7[i] ) );\
+  a0 = _mm512_xor_si512( a0, mm512_bcast_m128( round_const_l0[i] ) );\
+  a1 = MASK_NOT( a1 ); \
+  a2 = MASK_NOT( a2 ); \
+  a3 = MASK_NOT( a3 ); \
+  a4 = MASK_NOT( a4 ); \
+  a5 = MASK_NOT( a5 ); \
+  a6 = MASK_NOT( a6 ); \
+  a7 = _mm512_xor_si512( a7, mm512_bcast_m128( round_const_l7[i] ) );\
  \
  /* ShiftBytes + SubBytes (interleaved) */\
  b0 = _mm512_xor_si512( b0, b0 );\
@@ -450,7 +352,7 @@ static const __m512i SUBSH_MASK7 = { 0x090c000306080b07, 0x02050f0a0d01040e,
 * outputs: (i0-7) = (0|S)
 */
 #define Matrix_Transpose_O_B(i0, i1, i2, i3, i4, i5, i6, i7, t0){\
-  t0 = _mm512_xor_si512( t0, t0 );\
+  t0 = m512_zero;\
  i1 = i0;\
  i3 = i2;\
  i5 = i4;\
@@ -481,11 +383,11 @@ static const __m512i SUBSH_MASK7 = { 0x090c000306080b07, 0x02050f0a0d01040e,

 void TF512_4way( __m512i* chaining, __m512i* message )
 {
-  static __m512i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
-  static __m512i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
-  static __m512i TEMP0;
-  static __m512i TEMP1;
-  static __m512i TEMP2;
+  __m512i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
+  __m512i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
+  __m512i TEMP0;
+  __m512i TEMP1;
+  __m512i TEMP2;

  /* load message into registers xmm12 - xmm15 */
  xmm12 = message[0];
@@ -547,11 +449,11 @@ void TF512_4way( __m512i* chaining, __m512i* message )

 void OF512_4way( __m512i* chaining )
 {
-  static __m512i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
-  static __m512i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
-  static __m512i TEMP0;
-  static __m512i TEMP1;
-  static __m512i TEMP2;
+  __m512i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
+  __m512i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
+  __m512i TEMP0;
+  __m512i TEMP1;
+  __m512i TEMP2;

  /* load CV into registers xmm8, xmm10, xmm12, xmm14 */
  xmm8 = chaining[0];
@@ -637,7 +539,7 @@ static const __m256i SUBSH_MASK7_2WAY =
  j = _mm256_cmpgt_epi8(j, i );\
  i = _mm256_add_epi8(i, i);\
  j = _mm256_and_si256(j, k);\
-  i = _mm256_xor_si256(i, j);\
+  i = mm256_xorand( i, j, k );\
 }

 #define MixBytes_2way(a0, a1, a2, a3, a4, a5, a6, a7, b0, b1, b2, b3, b4, b5, b6, b7){\
@@ -648,7 +550,7 @@ static const __m256i SUBSH_MASK7_2WAY =
  b0 = a2;\
  a1 = _mm256_xor_si256(a1, a2);\
  b1 = a3;\
-  a2 = _mm256_xor_si256(a2, a3);\
+  TEMP2 = _mm256_xor_si256(a2, a3);\
  b2 = a4;\
  a3 = _mm256_xor_si256(a3, a4);\
  b3 = a5;\
@@ -660,34 +562,20 @@ static const __m256i SUBSH_MASK7_2WAY =
  a7 = _mm256_xor_si256(a7, b6);\
  \
  /* build y4 y5 y6 ... in regs xmm8, xmm9, xmm10 by adding t_i*/\
-  b0 = _mm256_xor_si256(b0, a4);\
-  b6 = _mm256_xor_si256(b6, a4);\
-  b1 = _mm256_xor_si256(b1, a5);\
-  b7 = _mm256_xor_si256(b7, a5);\
-  b2 = _mm256_xor_si256(b2, a6);\
-  b0 = _mm256_xor_si256(b0, a6);\
-  /* spill values y_4, y_5 to memory */\
-  TEMP0 = b0;\
-  b3 = _mm256_xor_si256(b3, a7);\
-  b1 = _mm256_xor_si256(b1, a7);\
-  TEMP1 = b1;\
-  b4 = _mm256_xor_si256(b4, a0);\
-  b2 = _mm256_xor_si256(b2, a0);\
-  /* save values t0, t1, t2 to xmm8, xmm9 and memory */\
-  b0 = a0;\
-  b5 = _mm256_xor_si256(b5, a1);\
-  b3 = _mm256_xor_si256(b3, a1);\
-  b1 = a1;\
-  b6 = _mm256_xor_si256(b6, a2);\
-  b4 = _mm256_xor_si256(b4, a2);\
-  TEMP2 = a2;\
-  b7 = _mm256_xor_si256(b7, a3);\
-  b5 = _mm256_xor_si256(b5, a3);\
-  \
+  TEMP0 = mm256_xor3( b0, a4, a6 ); \
+  TEMP1 = mm256_xor3( b1, a5, a7 ); \
+  b2 = mm256_xor3( b2, a6, a0 ); \
+  b0 = a0; \
+  b3 = mm256_xor3( b3, a7, a1 ); \
+  b1 = a1; \
+  b6 = mm256_xor3( b6, a4, TEMP2 ); \
+  b4 = mm256_xor3( b4, a0, TEMP2 ); \
+  b7 = mm256_xor3( b7, a5, a3 ); \
+  b5 = mm256_xor3( b5, a1, a3 ); \
  /* compute x_i = t_i + t_{i+3} */\
  a0 = _mm256_xor_si256(a0, a3);\
  a1 = _mm256_xor_si256(a1, a4);\
-  a2 = _mm256_xor_si256(a2, a5);\
+  a2 = _mm256_xor_si256( TEMP2, a5);\
  a3 = _mm256_xor_si256(a3, a6);\
  a4 = _mm256_xor_si256(a4, a7);\
  a5 = _mm256_xor_si256(a5, b0);\
@@ -696,7 +584,7 @@ static const __m256i SUBSH_MASK7_2WAY =
  \
  /* compute z_i : double x_i using temp xmm8 and 1B xmm9 */\
  /* compute w_i : add y_{i+4} */\
-  b1 = m256_const1_64( 0x1b1b1b1b1b1b1b1b );\
+  b1 = _mm256_set1_epi64x( 0x1b1b1b1b1b1b1b1b );\
  MUL2_2WAY(a0, b0, b1);\
  a0 = _mm256_xor_si256(a0, TEMP0);\
  MUL2_2WAY(a1, b0, b1);\
@@ -738,15 +626,15 @@ static const __m256i SUBSH_MASK7_2WAY =

 #define ROUND_2WAY(i, a0, a1, a2, a3, a4, a5, a6, a7, b0, b1, b2, b3, b4, b5, b6, b7){\
  /* AddRoundConstant */\
-  b1 = m256_const2_64( 0xffffffffffffffff, 0 ); \
-  a0 = _mm256_xor_si256( a0, m256_const1_128( round_const_l0[i] ) );\
+  b1 = mm256_bcast_m128( mm128_mask_32( m128_neg1, 0x3 ) ); \
+  a0 = _mm256_xor_si256( a0, mm256_bcast_m128( round_const_l0[i] ) );\
  a1 = _mm256_xor_si256( a1, b1 );\
  a2 = _mm256_xor_si256( a2, b1 );\
  a3 = _mm256_xor_si256( a3, b1 );\
  a4 = _mm256_xor_si256( a4, b1 );\
  a5 = _mm256_xor_si256( a5, b1 );\
  a6 = _mm256_xor_si256( a6, b1 );\
-  a7 = _mm256_xor_si256( a7, m256_const1_128( round_const_l7[i] ) );\
+  a7 = _mm256_xor_si256( a7, mm256_bcast_m128( round_const_l7[i] ) );\
  \
  /* ShiftBytes + SubBytes (interleaved) */\
  b0 = _mm256_xor_si256( b0, b0 );\
@@ -769,7 +657,6 @@ static const __m256i SUBSH_MASK7_2WAY =
  \
  /* MixBytes */\
  MixBytes_2way(a0, a1, a2, a3, a4, a5, a6, a7, b0, b1, b2, b3, b4, b5, b6, b7);\
-\
 }

 /* 10 rounds, P and Q in parallel */
@@ -850,7 +737,7 @@ static const __m256i SUBSH_MASK7_2WAY =
 }/**/

 #define Matrix_Transpose_O_B_2way(i0, i1, i2, i3, i4, i5, i6, i7, t0){\
-  t0 = _mm256_xor_si256( t0, t0 );\
+  t0 = m256_zero;\
  i1 = i0;\
  i3 = i2;\
  i5 = i4;\
@@ -874,11 +761,11 @@ static const __m256i SUBSH_MASK7_2WAY =

 void TF512_2way( __m256i* chaining, __m256i* message )
 {
-  static __m256i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
-  static __m256i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
-  static __m256i TEMP0;
-  static __m256i TEMP1;
-  static __m256i TEMP2;
+  __m256i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
+  __m256i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
+  __m256i TEMP0;
+  __m256i TEMP1;
+  __m256i TEMP2;

  /* load message into registers xmm12 - xmm15 */
  xmm12 = message[0];
@@ -940,11 +827,11 @@ void TF512_2way( __m256i* chaining, __m256i* message )
  
 void OF512_2way( __m256i* chaining )
 {
-  static __m256i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
-  static __m256i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
-  static __m256i TEMP0;
-  static __m256i TEMP1;
-  static __m256i TEMP2;
+  __m256i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
+  __m256i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
+  __m256i TEMP0;
+  __m256i TEMP1;
+  __m256i TEMP2;

  /* load CV into registers xmm8, xmm10, xmm12, xmm14 */
  xmm8 = chaining[0];
--- a/algo/groestl/groestl512-hash-4way.c
+++ b/algo/groestl/groestl512-hash-4way.c
@@ -25,8 +25,7 @@ int groestl512_4way_init( groestl512_4way_context* ctx, uint64_t hashlen )
  memset_zero_512( ctx->buffer, SIZE512 );

  // The only non-zero in the IV is len. It can be hard coded.
-  ctx->chaining[ 6 ] = m512_const2_64( 0x0200000000000000, 0 );
-
+  ctx->chaining[ 6 ] = mm512_bcast128hi_64( 0x0200000000000000 );
  ctx->buf_ptr = 0;
  ctx->rem_ptr = 0;

@@ -61,14 +60,14 @@ int groestl512_4way_update_close( groestl512_4way_context* ctx, void* output,
   if ( i == SIZE512 - 1 )
   {        
       // only 1 vector left in buffer, all padding at once
-       ctx->buffer[i] = m512_const2_64( blocks << 56, 0x80 );
+       ctx->buffer[i] = mm512_set2_64( blocks << 56, 0x80 );
   }   
   else
   {
-       ctx->buffer[i] = m512_const2_64( 0, 0x80 );
+       ctx->buffer[i] = mm512_bcast128lo_64( 0x80 );
       for ( i += 1; i < SIZE512 - 1; i++ )
           ctx->buffer[i] = m512_zero;
-       ctx->buffer[i] = m512_const2_64( blocks << 56, 0 );
+       ctx->buffer[i] = mm512_bcast128hi_64( blocks << 56 );
   }

   TF1024_4way( ctx->chaining, ctx->buffer );
@@ -94,7 +93,7 @@ int groestl512_4way_full( groestl512_4way_context* ctx, void* output,

   memset_zero_512( ctx->chaining, SIZE512 );
   memset_zero_512( ctx->buffer, SIZE512 );
-   ctx->chaining[ 6 ] = m512_const2_64( 0x0200000000000000, 0 );
+   ctx->chaining[ 6 ] = mm512_bcast128hi_64( 0x0200000000000000 );
   ctx->buf_ptr = 0;

   // --- update ---
@@ -113,14 +112,14 @@ int groestl512_4way_full( groestl512_4way_context* ctx, void* output,
   if ( i == SIZE512 - 1 )
   {
       // only 1 vector left in buffer, all padding at once
-       ctx->buffer[i] = m512_const2_64( blocks << 56, 0x80 );
+       ctx->buffer[i] = mm512_set2_64( blocks << 56, 0x80 );
   }
   else
   {
-       ctx->buffer[i] = m512_const2_64( 0, 0x80 );
+       ctx->buffer[i] = mm512_bcast128lo_64( 0x80 );
       for ( i += 1; i < SIZE512 - 1; i++ )
           ctx->buffer[i] = m512_zero;
-       ctx->buffer[i] = m512_const2_64( blocks << 56, 0 );
+       ctx->buffer[i] = mm512_bcast128hi_64( blocks << 56 );
   }

   TF1024_4way( ctx->chaining, ctx->buffer );
@@ -143,7 +142,7 @@ int groestl512_2way_init( groestl512_2way_context* ctx, uint64_t hashlen )
  memset_zero_256( ctx->buffer, SIZE512 );

  // The only non-zero in the IV is len. It can be hard coded.
-  ctx->chaining[ 6 ] = m256_const2_64( 0x0200000000000000, 0 );
+  ctx->chaining[ 6 ] = mm256_bcast128hi_64( 0x0200000000000000 );

  ctx->buf_ptr = 0;
  ctx->rem_ptr = 0;
@@ -179,14 +178,14 @@ int groestl512_2way_update_close( groestl512_2way_context* ctx, void* output,
   if ( i == SIZE512 - 1 )
   {
       // only 1 vector left in buffer, all padding at once
-       ctx->buffer[i] = m256_const2_64( blocks << 56, 0x80 );
+       ctx->buffer[i] = mm256_set2_64( blocks << 56, 0x80 );
   }
   else
   {
-       ctx->buffer[i] = m256_const2_64( 0, 0x80 );
+       ctx->buffer[i] = mm256_bcast128lo_64( 0x80 );
       for ( i += 1; i < SIZE512 - 1; i++ )
           ctx->buffer[i] = m256_zero;
-       ctx->buffer[i] = m256_const2_64( blocks << 56, 0 );
+       ctx->buffer[i] = mm256_bcast128hi_64( blocks << 56 );
   }

   TF1024_2way( ctx->chaining, ctx->buffer );
@@ -212,7 +211,7 @@ int groestl512_2way_full( groestl512_2way_context* ctx, void* output,

   memset_zero_256( ctx->chaining, SIZE512 );
   memset_zero_256( ctx->buffer, SIZE512 );
-   ctx->chaining[ 6 ] = m256_const2_64( 0x0200000000000000, 0 );
+   ctx->chaining[ 6 ] = mm256_bcast128hi_64( 0x0200000000000000 );
   ctx->buf_ptr = 0;

   // --- update ---
@@ -231,14 +230,14 @@ int groestl512_2way_full( groestl512_2way_context* ctx, void* output,
   if ( i == SIZE512 - 1 )
   {
       // only 1 vector left in buffer, all padding at once
-       ctx->buffer[i] = m256_const2_64( blocks << 56, 0x80 );
+       ctx->buffer[i] = mm256_set2_64( blocks << 56, 0x80 );
   }
   else
   {
-       ctx->buffer[i] = m256_const2_64( 0, 0x80 );
+       ctx->buffer[i] = mm256_bcast128lo_64( 0x80 );
       for ( i += 1; i < SIZE512 - 1; i++ )
           ctx->buffer[i] = m256_zero;
-       ctx->buffer[i] = m256_const2_64( blocks << 56, 0 );
+       ctx->buffer[i] = mm256_bcast128hi_64( blocks << 56 );
   }

   TF1024_2way( ctx->chaining, ctx->buffer );
--- a/algo/groestl/groestl512-intr-4way.h
+++ b/algo/groestl/groestl512-intr-4way.h
@@ -174,7 +174,7 @@ static const __m512i SUBSH_MASK7 = { 0x06090c0f0205080b, 0x0e0104070a0d0003,
  \
  /* compute z_i : double x_i using temp xmm8 and 1B xmm9 */\
  /* compute w_i : add y_{i+4} */\
-  b1 = m512_const1_64( 0x1b1b1b1b1b1b1b1b ); \
+  b1 = _mm512_set1_epi64( 0x1b1b1b1b1b1b1b1b ); \
  MUL2( a0, b0, b1 ); \
  a0 = _mm512_xor_si512( a0, TEMP0 ); \
  MUL2( a1, b0, b1 ); \
@@ -238,7 +238,7 @@ static const __m512i SUBSH_MASK7 = { 0x06090c0f0205080b, 0x0e0104070a0d0003,
  for ( round_counter = 0; round_counter < 14; round_counter += 2 ) \
  { \
    /* AddRoundConstant P1024 */\
-    xmm8 = _mm512_xor_si512( xmm8, m512_const1_128( \
+    xmm8 = _mm512_xor_si512( xmm8, mm512_bcast_m128( \
             casti_m128i( round_const_p, round_counter ) ) ); \
    /* ShiftBytes P1024 + pre-AESENCLAST */\
    xmm8  = _mm512_shuffle_epi8( xmm8,  SUBSH_MASK0 ); \
@@ -253,7 +253,7 @@ static const __m512i SUBSH_MASK7 = { 0x06090c0f0205080b, 0x0e0104070a0d0003,
    SUBMIX(xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7);\
    \
     /* AddRoundConstant P1024 */\
-    xmm0 = _mm512_xor_si512( xmm0, m512_const1_128( \
+    xmm0 = _mm512_xor_si512( xmm0, mm512_bcast_m128( \
             casti_m128i( round_const_p, round_counter+1 ) ) ); \
    /* ShiftBytes P1024 + pre-AESENCLAST */\
    xmm0 = _mm512_shuffle_epi8( xmm0, SUBSH_MASK0 );\
@@ -282,7 +282,7 @@ static const __m512i SUBSH_MASK7 = { 0x06090c0f0205080b, 0x0e0104070a0d0003,
    xmm12 = _mm512_xor_si512( xmm12, xmm1 );\
    xmm13 = _mm512_xor_si512( xmm13, xmm1 );\
    xmm14 = _mm512_xor_si512( xmm14, xmm1 );\
-    xmm15 = _mm512_xor_si512( xmm15, m512_const1_128( \
+    xmm15 = _mm512_xor_si512( xmm15, mm512_bcast_m128( \
                 casti_m128i( round_const_q, round_counter ) ) ); \
    /* ShiftBytes Q1024 + pre-AESENCLAST */\
    xmm8  = _mm512_shuffle_epi8( xmm8,  SUBSH_MASK1 );\
@@ -305,7 +305,7 @@ static const __m512i SUBSH_MASK7 = { 0x06090c0f0205080b, 0x0e0104070a0d0003,
    xmm4 = _mm512_xor_si512( xmm4, xmm9 );\
    xmm5 = _mm512_xor_si512( xmm5, xmm9 );\
    xmm6 = _mm512_xor_si512( xmm6, xmm9 );\
-    xmm7 = _mm512_xor_si512( xmm7, m512_const1_128( \
+    xmm7 = _mm512_xor_si512( xmm7, mm512_bcast_m128( \
             casti_m128i( round_const_q, round_counter+1 ) ) ); \
    /* ShiftBytes Q1024 + pre-AESENCLAST */\
    xmm0 = _mm512_shuffle_epi8( xmm0, SUBSH_MASK1 );\
@@ -471,8 +471,8 @@ static const __m512i SUBSH_MASK7 = { 0x06090c0f0205080b, 0x0e0104070a0d0003,

 void INIT_4way( __m512i* chaining )
 {
-  static __m512i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
-  static __m512i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
+  __m512i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
+  __m512i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;

  /* load IV into registers xmm8 - xmm15 */
  xmm8 = chaining[0];
@@ -500,12 +500,12 @@ void INIT_4way( __m512i* chaining )

 void TF1024_4way( __m512i* chaining, const __m512i* message )
 {
-  static __m512i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
-  static __m512i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
-  static __m512i QTEMP[8];
-  static __m512i TEMP0;
-  static __m512i TEMP1;
-  static __m512i TEMP2;
+  __m512i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
+  __m512i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
+  __m512i QTEMP[8];
+  __m512i TEMP0;
+  __m512i TEMP1;
+  __m512i TEMP2;

  /* load message into registers xmm8 - xmm15 (Q = message) */
  xmm8 = message[0];
@@ -606,11 +606,11 @@ void TF1024_4way( __m512i* chaining, const __m512i* message )

 void OF1024_4way( __m512i* chaining )
 {
-  static __m512i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
-  static __m512i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
-  static __m512i TEMP0;
-  static __m512i TEMP1;
-  static __m512i TEMP2;
+  __m512i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
+  __m512i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
+  __m512i TEMP0;
+  __m512i TEMP1;
+  __m512i TEMP2;

  /* load CV into registers xmm8 - xmm15 */
  xmm8 = chaining[0];
@@ -710,7 +710,7 @@ static const __m256i SUBSH_MASK7_2WAY =
  b0 = a2;\
  a1 = _mm256_xor_si256(a1, a2);\
  b1 = a3;\
-  a2 = _mm256_xor_si256(a2, a3);\
+  TEMP2 = _mm256_xor_si256(a2, a3);\
  b2 = a4;\
  a3 = _mm256_xor_si256(a3, a4);\
  b3 = a5;\
@@ -722,34 +722,23 @@ static const __m256i SUBSH_MASK7_2WAY =
  a7 = _mm256_xor_si256(a7, b6);\
  \
  /* build y4 y5 y6 ... in regs xmm8, xmm9, xmm10 by adding t_i*/\
-  b0 = _mm256_xor_si256(b0, a4);\
-  b6 = _mm256_xor_si256(b6, a4);\
-  b1 = _mm256_xor_si256(b1, a5);\
-  b7 = _mm256_xor_si256(b7, a5);\
-  b2 = _mm256_xor_si256(b2, a6);\
-  b0 = _mm256_xor_si256(b0, a6);\
+  TEMP0 = mm256_xor3( b0, a4, a6 ); \
  /* spill values y_4, y_5 to memory */\
-  TEMP0 = b0;\
-  b3 = _mm256_xor_si256(b3, a7);\
-  b1 = _mm256_xor_si256(b1, a7);\
-  TEMP1 = b1;\
-  b4 = _mm256_xor_si256(b4, a0);\
-  b2 = _mm256_xor_si256(b2, a0);\
+  TEMP1 = mm256_xor3( b1, a5, a7 ); \
+  b2 = mm256_xor3( b2, a6, a0 ); \
  /* save values t0, t1, t2 to xmm8, xmm9 and memory */\
-  b0 = a0;\
-  b5 = _mm256_xor_si256(b5, a1);\
-  b3 = _mm256_xor_si256(b3, a1);\
-  b1 = a1;\
-  b6 = _mm256_xor_si256(b6, a2);\
-  b4 = _mm256_xor_si256(b4, a2);\
-  TEMP2 = a2;\
-  b7 = _mm256_xor_si256(b7, a3);\
-  b5 = _mm256_xor_si256(b5, a3);\
+  b0 = a0; \
+  b3 = mm256_xor3( b3, a7, a1 ); \
+  b1 = a1; \
+  b6 = mm256_xor3( b6, a4, TEMP2 ); \
+  b4 = mm256_xor3( b4, a0, TEMP2 ); \
+  b7 = mm256_xor3( b7, a5, a3 ); \
+  b5 = mm256_xor3( b5, a1, a3 ); \
  \
  /* compute x_i = t_i + t_{i+3} */\
  a0 = _mm256_xor_si256(a0, a3);\
  a1 = _mm256_xor_si256(a1, a4);\
-  a2 = _mm256_xor_si256(a2, a5);\
+  a2 = _mm256_xor_si256( TEMP2, a5);\
  a3 = _mm256_xor_si256(a3, a6);\
  a4 = _mm256_xor_si256(a4, a7);\
  a5 = _mm256_xor_si256(a5, b0);\
@@ -758,7 +747,7 @@ static const __m256i SUBSH_MASK7_2WAY =
  \
  /* compute z_i : double x_i using temp xmm8 and 1B xmm9 */\
  /* compute w_i : add y_{i+4} */\
-  b1 = m256_const1_64( 0x1b1b1b1b1b1b1b1b );\
+  b1 = _mm256_set1_epi64x( 0x1b1b1b1b1b1b1b1b );\
  MUL2_2WAY(a0, b0, b1);\
  a0 = _mm256_xor_si256(a0, TEMP0);\
  MUL2_2WAY(a1, b0, b1);\
@@ -822,7 +811,7 @@ static const __m256i SUBSH_MASK7_2WAY =
  for ( round_counter = 0; round_counter < 14; round_counter += 2 ) \
  { \
    /* AddRoundConstant P1024 */\
-    xmm8 = _mm256_xor_si256( xmm8, m256_const1_128( \
+    xmm8 = _mm256_xor_si256( xmm8, mm256_bcast_m128( \
             casti_m128i( round_const_p, round_counter ) ) ); \
    /* ShiftBytes P1024 + pre-AESENCLAST */\
    xmm8  = _mm256_shuffle_epi8( xmm8,  SUBSH_MASK0_2WAY ); \
@@ -837,7 +826,7 @@ static const __m256i SUBSH_MASK7_2WAY =
    SUBMIX_2WAY(xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7);\
    \
     /* AddRoundConstant P1024 */\
-    xmm0 = _mm256_xor_si256( xmm0, m256_const1_128( \
+    xmm0 = _mm256_xor_si256( xmm0, mm256_bcast_m128( \
             casti_m128i( round_const_p, round_counter+1 ) ) ); \
    /* ShiftBytes P1024 + pre-AESENCLAST */\
    xmm0 = _mm256_shuffle_epi8( xmm0, SUBSH_MASK0_2WAY );\
@@ -866,7 +855,7 @@ static const __m256i SUBSH_MASK7_2WAY =
    xmm12 = _mm256_xor_si256( xmm12, xmm1 );\
    xmm13 = _mm256_xor_si256( xmm13, xmm1 );\
    xmm14 = _mm256_xor_si256( xmm14, xmm1 );\
-    xmm15 = _mm256_xor_si256( xmm15, m256_const1_128( \
+    xmm15 = _mm256_xor_si256( xmm15, mm256_bcast_m128( \
                 casti_m128i( round_const_q, round_counter ) ) ); \
    /* ShiftBytes Q1024 + pre-AESENCLAST */\
    xmm8  = _mm256_shuffle_epi8( xmm8,  SUBSH_MASK1_2WAY );\
@@ -889,7 +878,7 @@ static const __m256i SUBSH_MASK7_2WAY =
    xmm4 = _mm256_xor_si256( xmm4, xmm9 );\
    xmm5 = _mm256_xor_si256( xmm5, xmm9 );\
    xmm6 = _mm256_xor_si256( xmm6, xmm9 );\
-    xmm7 = _mm256_xor_si256( xmm7, m256_const1_128( \
+    xmm7 = _mm256_xor_si256( xmm7, mm256_bcast_m128( \
             casti_m128i( round_const_q, round_counter+1 ) ) ); \
    /* ShiftBytes Q1024 + pre-AESENCLAST */\
    xmm0 = _mm256_shuffle_epi8( xmm0, SUBSH_MASK1_2WAY );\
@@ -1040,8 +1029,8 @@ static const __m256i SUBSH_MASK7_2WAY =

 void INIT_2way( __m256i *chaining )
 {
-  static __m256i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
-  static __m256i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
+  __m256i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
+  __m256i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;

  /* load IV into registers xmm8 - xmm15 */
  xmm8 = chaining[0];
@@ -1069,12 +1058,12 @@ void INIT_2way( __m256i *chaining )

 void TF1024_2way( __m256i *chaining, const __m256i *message )
 {
-  static __m256i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
-  static __m256i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
-  static __m256i QTEMP[8];
-  static __m256i TEMP0;
-  static __m256i TEMP1;
-  static __m256i TEMP2;
+  __m256i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
+  __m256i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
+  __m256i QTEMP[8];
+  __m256i TEMP0;
+  __m256i TEMP1;
+  __m256i TEMP2;

  /* load message into registers xmm8 - xmm15 (Q = message) */
  xmm8 = message[0];
@@ -1175,11 +1164,11 @@ void TF1024_2way( __m256i *chaining, const __m256i *message )

 void OF1024_2way( __m256i* chaining )
 {
-  static __m256i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
-  static __m256i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
-  static __m256i TEMP0;
-  static __m256i TEMP1;
-  static __m256i TEMP2;
+  __m256i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
+  __m256i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
+  __m256i TEMP0;
+  __m256i TEMP1;
+  __m256i TEMP2;

  /* load CV into registers xmm8 - xmm15 */
  xmm8 = chaining[0];
--- a/algo/hamsi/hamsi-hash-4way.c
+++ b/algo/hamsi/hamsi-hash-4way.c
@@ -562,14 +562,14 @@ do { \
  for ( int u = 0; u < 64; u++ ) \
  { \
     const __mmask8 dm = _mm512_cmplt_epi64_mask( db, zero ); \
-     m0 = _mm512_mask_xor_epi64( m0, dm, m0, m512_const1_64( tp[0] ) ); \
-     m1 = _mm512_mask_xor_epi64( m1, dm, m1, m512_const1_64( tp[1] ) ); \
-     m2 = _mm512_mask_xor_epi64( m2, dm, m2, m512_const1_64( tp[2] ) ); \
-     m3 = _mm512_mask_xor_epi64( m3, dm, m3, m512_const1_64( tp[3] ) ); \
-     m4 = _mm512_mask_xor_epi64( m4, dm, m4, m512_const1_64( tp[4] ) ); \
-     m5 = _mm512_mask_xor_epi64( m5, dm, m5, m512_const1_64( tp[5] ) ); \
-     m6 = _mm512_mask_xor_epi64( m6, dm, m6, m512_const1_64( tp[6] ) ); \
-     m7 = _mm512_mask_xor_epi64( m7, dm, m7, m512_const1_64( tp[7] ) ); \
+     m0 = _mm512_mask_xor_epi64( m0, dm, m0, _mm512_set1_epi64( tp[0] ) ); \
+     m1 = _mm512_mask_xor_epi64( m1, dm, m1, _mm512_set1_epi64( tp[1] ) ); \
+     m2 = _mm512_mask_xor_epi64( m2, dm, m2, _mm512_set1_epi64( tp[2] ) ); \
+     m3 = _mm512_mask_xor_epi64( m3, dm, m3, _mm512_set1_epi64( tp[3] ) ); \
+     m4 = _mm512_mask_xor_epi64( m4, dm, m4, _mm512_set1_epi64( tp[4] ) ); \
+     m5 = _mm512_mask_xor_epi64( m5, dm, m5, _mm512_set1_epi64( tp[5] ) ); \
+     m6 = _mm512_mask_xor_epi64( m6, dm, m6, _mm512_set1_epi64( tp[6] ) ); \
+     m7 = _mm512_mask_xor_epi64( m7, dm, m7, _mm512_set1_epi64( tp[7] ) ); \
     db = _mm512_ror_epi64( db, 1 ); \
     tp += 8; \
  } \
@@ -733,17 +733,17 @@ do { \
   __m512i alpha[16]; \
   const uint64_t A0 = ( (uint64_t*)alpha_n )[0]; \
   for( int i = 0; i < 16; i++ ) \
-      alpha[i] = m512_const1_64( ( (uint64_t*)alpha_n )[i] ); \
+      alpha[i] = _mm512_set1_epi64( ( (uint64_t*)alpha_n )[i] ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( (1ULL << 32) ^ A0 ); \
+   alpha[0] = _mm512_set1_epi64( (1ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( (2ULL << 32) ^ A0 ); \
+   alpha[0] = _mm512_set1_epi64( (2ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( (3ULL << 32) ^ A0 ); \
+   alpha[0] = _mm512_set1_epi64( (3ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( (4ULL << 32) ^ A0 ); \
+   alpha[0] = _mm512_set1_epi64( (4ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( (5ULL << 32) ^ A0 ); \
+   alpha[0] = _mm512_set1_epi64( (5ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
 } while (0)

@@ -752,29 +752,29 @@ do { \
   __m512i alpha[16]; \
   const uint64_t A0 = ( (uint64_t*)alpha_f )[0]; \
   for( int i = 0; i < 16; i++ ) \
-      alpha[i] = m512_const1_64( ( (uint64_t*)alpha_f )[i] ); \
+      alpha[i] = _mm512_set1_epi64( ( (uint64_t*)alpha_f )[i] ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( 1ULL << 32) ^ A0 ); \
+   alpha[0] = _mm512_set1_epi64( ( 1ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( 2ULL << 32) ^ A0 ); \
+   alpha[0] = _mm512_set1_epi64( ( 2ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( 3ULL << 32) ^ A0 ); \
+   alpha[0] = _mm512_set1_epi64( ( 3ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( 4ULL << 32) ^ A0 ); \
+   alpha[0] = _mm512_set1_epi64( ( 4ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( 5ULL << 32) ^ A0 ); \
+   alpha[0] = _mm512_set1_epi64( ( 5ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( 6ULL << 32) ^ A0 ); \
+   alpha[0] = _mm512_set1_epi64( ( 6ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( 7ULL << 32) ^ A0 ); \
+   alpha[0] = _mm512_set1_epi64( ( 7ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( 8ULL << 32) ^ A0 ); \
+   alpha[0] = _mm512_set1_epi64( ( 8ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( 9ULL << 32) ^ A0 ); \
+   alpha[0] = _mm512_set1_epi64( ( 9ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( (10ULL << 32) ^ A0 ); \
+   alpha[0] = _mm512_set1_epi64( (10ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( (11ULL << 32) ^ A0 ); \
+   alpha[0] = _mm512_set1_epi64( (11ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
 } while (0)

@@ -829,14 +829,14 @@ void hamsi512_8way_init( hamsi_8way_big_context *sc )
   sc->partial_len = 0;
   sc->count_high = sc->count_low = 0;

-   sc->h[0] = m512_const1_64( 0x6c70617273746565 );
-   sc->h[1] = m512_const1_64( 0x656e62656b204172 );
-   sc->h[2] = m512_const1_64( 0x302c206272672031 );
-   sc->h[3] = m512_const1_64( 0x3434362c75732032 );
-   sc->h[4] = m512_const1_64( 0x3030312020422d33 );
-   sc->h[5] = m512_const1_64( 0x656e2d484c657576 );
-   sc->h[6] = m512_const1_64( 0x6c65652c65766572 );
-   sc->h[7] = m512_const1_64( 0x6769756d2042656c );
+   sc->h[0] = _mm512_set1_epi64( 0x6c70617273746565 );
+   sc->h[1] = _mm512_set1_epi64( 0x656e62656b204172 );
+   sc->h[2] = _mm512_set1_epi64( 0x302c206272672031 );
+   sc->h[3] = _mm512_set1_epi64( 0x3434362c75732032 );
+   sc->h[4] = _mm512_set1_epi64( 0x3030312020422d33 );
+   sc->h[5] = _mm512_set1_epi64( 0x656e2d484c657576 );
+   sc->h[6] = _mm512_set1_epi64( 0x6c65652c65766572 );
+   sc->h[7] = _mm512_set1_epi64( 0x6769756d2042656c );
 }

 void hamsi512_8way_update( hamsi_8way_big_context *sc, const void *data,
@@ -859,7 +859,7 @@ void hamsi512_8way_close( hamsi_8way_big_context *sc, void *dst )
   sph_enc32be( &ch, sc->count_high );
   sph_enc32be( &cl, sc->count_low + ( sc->partial_len << 3 ) );
   pad[0] = _mm512_set1_epi64( ((uint64_t)cl << 32 ) | (uint64_t)ch );
-   sc->buf[0] = m512_const1_64( 0x80 );
+   sc->buf[0] = _mm512_set1_epi64( 0x80 );
   hamsi_8way_big( sc, sc->buf, 1 );
   hamsi_8way_big_final( sc, pad );

@@ -870,6 +870,32 @@ void hamsi512_8way_close( hamsi_8way_big_context *sc, void *dst )

 // Hamsi 4 way AVX2

+#if defined(__AVX512VL__)
+
+#define INPUT_BIG \
+do { \
+  __m256i db = _mm256_ror_epi64( *buf, 1 ); \
+  const __m256i zero = m256_zero; \
+  const uint64_t *tp = (const uint64_t*)T512; \
+  m0 = m1 = m2 = m3 = m4 = m5 = m6 = m7 = zero; \
+  for ( int u = 0; u < 64; u++ ) \
+  { \
+     const __mmask8 dm = _mm256_cmplt_epi64_mask( db, zero ); \
+     m0 = _mm256_mask_xor_epi64( m0, dm, m0, _mm256_set1_epi64x( tp[0] ) ); \
+     m1 = _mm256_mask_xor_epi64( m1, dm, m1, _mm256_set1_epi64x( tp[1] ) ); \
+     m2 = _mm256_mask_xor_epi64( m2, dm, m2, _mm256_set1_epi64x( tp[2] ) ); \
+     m3 = _mm256_mask_xor_epi64( m3, dm, m3, _mm256_set1_epi64x( tp[3] ) ); \
+     m4 = _mm256_mask_xor_epi64( m4, dm, m4, _mm256_set1_epi64x( tp[4] ) ); \
+     m5 = _mm256_mask_xor_epi64( m5, dm, m5, _mm256_set1_epi64x( tp[5] ) ); \
+     m6 = _mm256_mask_xor_epi64( m6, dm, m6, _mm256_set1_epi64x( tp[6] ) ); \
+     m7 = _mm256_mask_xor_epi64( m7, dm, m7, _mm256_set1_epi64x( tp[7] ) ); \
+     db = _mm256_ror_epi64( db, 1 ); \
+     tp += 8; \
+  } \
+} while (0)
+
+#else
+
 #define INPUT_BIG \
 do { \
  __m256i db = *buf; \
@@ -880,25 +906,58 @@ do { \
  { \
     __m256i dm = _mm256_cmpgt_epi64( zero, _mm256_slli_epi64( db, u ) ); \
     m0 = _mm256_xor_si256( m0, _mm256_and_si256( dm, \
-                                          m256_const1_64( tp[0] ) ) ); \
+                                          _mm256_set1_epi64x( tp[0] ) ) ); \
     m1 = _mm256_xor_si256( m1, _mm256_and_si256( dm, \
-                                          m256_const1_64( tp[1] ) ) ); \
+                                          _mm256_set1_epi64x( tp[1] ) ) ); \
     m2 = _mm256_xor_si256( m2, _mm256_and_si256( dm, \
-                                          m256_const1_64( tp[2] ) ) ); \
+                                          _mm256_set1_epi64x( tp[2] ) ) ); \
     m3 = _mm256_xor_si256( m3, _mm256_and_si256( dm, \
-                                          m256_const1_64( tp[3] ) ) ); \
+                                          _mm256_set1_epi64x( tp[3] ) ) ); \
     m4 = _mm256_xor_si256( m4, _mm256_and_si256( dm, \
-                                          m256_const1_64( tp[4] ) ) ); \
+                                          _mm256_set1_epi64x( tp[4] ) ) ); \
     m5 = _mm256_xor_si256( m5, _mm256_and_si256( dm, \
-                                          m256_const1_64( tp[5] ) ) ); \
+                                          _mm256_set1_epi64x( tp[5] ) ) ); \
     m6 = _mm256_xor_si256( m6, _mm256_and_si256( dm, \
-                                          m256_const1_64( tp[6] ) ) ); \
+                                          _mm256_set1_epi64x( tp[6] ) ) ); \
     m7 = _mm256_xor_si256( m7, _mm256_and_si256( dm, \
-                                          m256_const1_64( tp[7] ) ) ); \
+                                          _mm256_set1_epi64x( tp[7] ) ) ); \
     tp += 8; \
  } \
 } while (0)

+#endif
+
+#define SBOX( a, b, c, d ) \
+do { \
+  __m256i t; \
+  t = a; \
+  a = mm256_xorand( d, a, c ); \
+  c = mm256_xor3( a, b, c ); \
+  b = mm256_xoror( b, d, t ); \
+  t = _mm256_xor_si256( t, c ); \
+  d = mm256_xoror( a, b, t ); \
+  t = mm256_xorand( t, a, b ); \
+  a = c; \
+  c = mm256_xor3( b, d, t ); \
+  b = d; \
+  d = mm256_not( t ); \
+} while (0)
+
+#define L( a, b, c, d ) \
+do { \
+   a = mm256_rol_32( a, 13 ); \
+   c = mm256_rol_32( c,  3 ); \
+   b = mm256_xor3( a, b, c ); \
+   d = mm256_xor3( d, c, _mm256_slli_epi32( a, 3 ) ); \
+   b = mm256_rol_32( b, 1 ); \
+   d = mm256_rol_32( d, 7 ); \
+   a = mm256_xor3( a, b, d ); \
+   c = mm256_xor3( c, d, _mm256_slli_epi32( b, 7 ) ); \
+   a = mm256_rol_32( a,  5 ); \
+   c = mm256_rol_32( c, 22 ); \
+} while (0)
+
+/*
 #define SBOX( a, b, c, d ) \
 do { \
  __m256i t; \
@@ -937,6 +996,7 @@ do { \
   a = mm256_rol_32( a,  5 ); \
   c = mm256_rol_32( c, 22 ); \
 } while (0)
+*/

 #define DECL_STATE_BIG \
   __m256i c0, c1, c2, c3, c4, c5, c6, c7; \
@@ -1066,17 +1126,17 @@ do { \
   __m256i alpha[16]; \
   const uint64_t A0 = ( (uint64_t*)alpha_n )[0]; \
   for( int i = 0; i < 16; i++ ) \
-      alpha[i] = m256_const1_64( ( (uint64_t*)alpha_n )[i] ); \
+      alpha[i] = _mm256_set1_epi64x( ( (uint64_t*)alpha_n )[i] ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( (1ULL << 32) ^ A0 ); \
+   alpha[0] = _mm256_set1_epi64x( (1ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( (2ULL << 32) ^ A0 ); \
+   alpha[0] = _mm256_set1_epi64x( (2ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( (3ULL << 32) ^ A0 ); \
+   alpha[0] = _mm256_set1_epi64x( (3ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( (4ULL << 32) ^ A0 ); \
+   alpha[0] = _mm256_set1_epi64x( (4ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( (5ULL << 32) ^ A0 ); \
+   alpha[0] = _mm256_set1_epi64x( (5ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
 } while (0)

@@ -1085,29 +1145,29 @@ do { \
   __m256i alpha[16]; \
   const uint64_t A0 = ( (uint64_t*)alpha_f )[0]; \
   for( int i = 0; i < 16; i++ ) \
-      alpha[i] = m256_const1_64( ( (uint64_t*)alpha_f )[i] ); \
+      alpha[i] = _mm256_set1_epi64x( ( (uint64_t*)alpha_f )[i] ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( 1ULL << 32) ^ A0 ); \
+   alpha[0] = _mm256_set1_epi64x( ( 1ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( 2ULL << 32) ^ A0 ); \
+   alpha[0] = _mm256_set1_epi64x( ( 2ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( 3ULL << 32) ^ A0 ); \
+   alpha[0] = _mm256_set1_epi64x( ( 3ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( 4ULL << 32) ^ A0 ); \
+   alpha[0] = _mm256_set1_epi64x( ( 4ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( 5ULL << 32) ^ A0 ); \
+   alpha[0] = _mm256_set1_epi64x( ( 5ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( 6ULL << 32) ^ A0 ); \
+   alpha[0] = _mm256_set1_epi64x( ( 6ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( 7ULL << 32) ^ A0 ); \
+   alpha[0] = _mm256_set1_epi64x( ( 7ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( 8ULL << 32) ^ A0 ); \
+   alpha[0] = _mm256_set1_epi64x( ( 8ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( 9ULL << 32) ^ A0 ); \
+   alpha[0] = _mm256_set1_epi64x( ( 9ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( (10ULL << 32) ^ A0 ); \
+   alpha[0] = _mm256_set1_epi64x( (10ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( (11ULL << 32) ^ A0 ); \
+   alpha[0] = _mm256_set1_epi64x( (11ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
 } while (0)

@@ -1163,14 +1223,14 @@ void hamsi512_4way_init( hamsi_4way_big_context *sc )
   sc->partial_len = 0;
   sc->count_high = sc->count_low = 0;

-   sc->h[0] = m256_const1_64( 0x6c70617273746565 );
-   sc->h[1] = m256_const1_64( 0x656e62656b204172 );
-   sc->h[2] = m256_const1_64( 0x302c206272672031 );
-   sc->h[3] = m256_const1_64( 0x3434362c75732032 );
-   sc->h[4] = m256_const1_64( 0x3030312020422d33 );
-   sc->h[5] = m256_const1_64( 0x656e2d484c657576 );
-   sc->h[6] = m256_const1_64( 0x6c65652c65766572 );
-   sc->h[7] = m256_const1_64( 0x6769756d2042656c );
+   sc->h[0] = _mm256_set1_epi64x( 0x6c70617273746565 );
+   sc->h[1] = _mm256_set1_epi64x( 0x656e62656b204172 );
+   sc->h[2] = _mm256_set1_epi64x( 0x302c206272672031 );
+   sc->h[3] = _mm256_set1_epi64x( 0x3434362c75732032 );
+   sc->h[4] = _mm256_set1_epi64x( 0x3030312020422d33 );
+   sc->h[5] = _mm256_set1_epi64x( 0x656e2d484c657576 );
+   sc->h[6] = _mm256_set1_epi64x( 0x6c65652c65766572 );
+   sc->h[7] = _mm256_set1_epi64x( 0x6769756d2042656c );
 }

 void hamsi512_4way_update( hamsi_4way_big_context *sc, const void *data,
@@ -1193,7 +1253,7 @@ void hamsi512_4way_close( hamsi_4way_big_context *sc, void *dst )
   sph_enc32be( &ch, sc->count_high );
   sph_enc32be( &cl, sc->count_low + ( sc->partial_len << 3 ) );
   pad[0] = _mm256_set1_epi64x( ((uint64_t)cl << 32 ) | (uint64_t)ch );
-   sc->buf[0] = m256_const1_64( 0x80 );
+   sc->buf[0] = _mm256_set1_epi64x( 0x80 );
   hamsi_big( sc, sc->buf, 1 );
   hamsi_big_final( sc, pad );

--- a/algo/haval/haval-hash-4way.c
+++ b/algo/haval/haval-hash-4way.c
@@ -52,6 +52,56 @@ extern "C"{
 #define SPH_SMALL_FOOTPRINT_HAVAL   1
 //#endif

+#if defined(__AVX512VL__)
+
+// ( ~( a ^ b ) ) & c
+#define mm128_andnotxor( a, b, c ) \
+   _mm_ternarylogic_epi32( a, b, c, 0x82  )
+
+#else
+
+#define mm128_andnotxor( a, b, c ) \
+   _mm_andnot_si128( _mm_xor_si128( a, b ), c )
+
+#endif
+
+#define F1(x6, x5, x4, x3, x2, x1, x0) \
+ mm128_xor3( x0, mm128_andxor( x1, x0, x4 ), \
+                 _mm_xor_si128( _mm_and_si128( x2, x5 ), \
+                                _mm_and_si128( x3, x6 ) ) ) \
+
+#define F2(x6, x5, x4, x3, x2, x1, x0) \
+   mm128_xor3( mm128_andxor( x2, _mm_andnot_si128( x3, x1 ), \
+                       mm128_xor3( _mm_and_si128( x4, x5 ), x6, x0 )  ), \
+               mm128_andxor( x4, x1, x5 ), \
+               mm128_xorand( x0, x3, x5 ) ) \
+
+#define F3(x6, x5, x4, x3, x2, x1, x0) \
+  mm128_xor3( x0, \
+              _mm_and_si128( x3, \
+                         mm128_xor3( _mm_and_si128( x1, x2 ), x6, x0 ) ), \
+              _mm_xor_si128( _mm_and_si128( x1, x4 ), \
+                             _mm_and_si128( x2, x5 ) ) )
+
+#define F4(x6, x5, x4, x3, x2, x1, x0) \
+  mm128_xor3( \
+      mm128_andxor( x3, x5, \
+                    _mm_xor_si128( _mm_and_si128( x1, x2 ), \
+                                      _mm_or_si128( x4, x6 ) ) ), \
+      _mm_and_si128( x4, \
+                        mm128_xor3( x0, _mm_andnot_si128( x2, x5 ), \
+                                    _mm_xor_si128( x1, x6 ) ) ), \
+      mm128_xorand( x0, x2, x6 ) )
+
+#define F5(x6, x5, x4, x3, x2, x1, x0) \
+   _mm_xor_si128( \
+         mm128_andnotxor( mm128_and3( x1, x2, x3 ), x5, x0 ), \
+         mm128_xor3( _mm_and_si128( x1, x4 ), \
+                     _mm_and_si128( x2, x5 ), \
+                     _mm_and_si128( x3, x6 ) ) )
+  
+
+/*
 #define F1(x6, x5, x4, x3, x2, x1, x0) \
   _mm_xor_si128( x0, \
       _mm_xor_si128( _mm_and_si128(_mm_xor_si128( x0, x4 ), x1 ), \
@@ -96,6 +146,7 @@ extern "C"{
      _mm_xor_si128( _mm_xor_si128( _mm_and_si128( x1, x4 ), \
                                    _mm_and_si128( x2, x5 ) ), \
                                    _mm_and_si128( x3, x6 ) ) )
+*/

 /*
 * The macros below integrate the phi() permutations, depending on the
@@ -740,14 +791,14 @@ do { \
 static void
 haval_8way_init( haval_8way_context *sc, unsigned olen, unsigned passes )
 {
-   sc->s0 = m256_const1_32( 0x243F6A88UL );
-   sc->s1 = m256_const1_32( 0x85A308D3UL );
-   sc->s2 = m256_const1_32( 0x13198A2EUL );
-   sc->s3 = m256_const1_32( 0x03707344UL );
-   sc->s4 = m256_const1_32( 0xA4093822UL );
-   sc->s5 = m256_const1_32( 0x299F31D0UL );
-   sc->s6 = m256_const1_32( 0x082EFA98UL );
-   sc->s7 = m256_const1_32( 0xEC4E6C89UL );
+   sc->s0 = _mm256_set1_epi32( 0x243F6A88UL );
+   sc->s1 = _mm256_set1_epi32( 0x85A308D3UL );
+   sc->s2 = _mm256_set1_epi32( 0x13198A2EUL );
+   sc->s3 = _mm256_set1_epi32( 0x03707344UL );
+   sc->s4 = _mm256_set1_epi32( 0xA4093822UL );
+   sc->s5 = _mm256_set1_epi32( 0x299F31D0UL );
+   sc->s6 = _mm256_set1_epi32( 0x082EFA98UL );
+   sc->s7 = _mm256_set1_epi32( 0xEC4E6C89UL );
   sc->olen = olen;
   sc->passes = passes;
   sc->count_high = 0;
--- a/algo/jh/jh-hash-4way.c
+++ b/algo/jh/jh-hash-4way.c
@@ -76,19 +76,31 @@ do { \

 #endif

+#if defined(__AVX512VL__)
+//TODO enable for AVX10_256, not used with AVX512VL
+
+#define notxorandnot( a, b, c ) \
+   _mm256_ternarylogic_epi64( a, b, c, 0x2d )
+
+#else
+
+#define notxorandnot( a, b, c ) \
+   _mm256_xor_si256( mm256_not( a ), _mm256_andnot_si256( b, c ) )
+
+#endif
+
 #define Sb(x0, x1, x2, x3, c) \
 do { \
-   const __m256i cc = _mm256_set1_epi64x( c ); \
-    x3 = mm256_not( x3 ); \
-    x0 = _mm256_xor_si256( x0, _mm256_andnot_si256( x2, cc ) ); \
-    tmp = _mm256_xor_si256( cc, _mm256_and_si256( x0, x1 ) ); \
-    x0 = _mm256_xor_si256( x0, _mm256_and_si256( x2, x3 ) ); \
-    x3 = _mm256_xor_si256( x3, _mm256_andnot_si256( x1, x2 ) ); \
-    x1 = _mm256_xor_si256( x1, _mm256_and_si256( x0, x2 ) ); \
-    x2 = _mm256_xor_si256( x2, _mm256_andnot_si256( x3, x0 ) ); \
-    x0 = _mm256_xor_si256( x0, _mm256_or_si256( x1, x3 ) ); \
-    x3 = _mm256_xor_si256( x3, _mm256_and_si256( x1, x2 ) ); \
-    x1 = _mm256_xor_si256( x1, _mm256_and_si256( tmp, x0 ) ); \
+    const __m256i cc = _mm256_set1_epi64x( c ); \
+    x0 = mm256_xorandnot( x0, x2, cc ); \
+    tmp = mm256_xorand( cc, x0, x1 ); \
+    x0 = mm256_xorandnot( x0, x3, x2 ); \
+    x3 = notxorandnot( x3, x1, x2 ); \
+    x1 = mm256_xorand( x1, x0, x2 ); \
+    x2 = mm256_xorandnot( x2, x3, x0 ); \
+    x0 = mm256_xoror( x0, x1, x3 ); \
+    x3 = mm256_xorand( x3, x1, x2 ); \
+    x1 = mm256_xorand( x1, tmp, x0 ); \
    x2 = _mm256_xor_si256( x2, tmp ); \
 } while (0)

@@ -96,11 +108,11 @@ do { \
 do { \
    x4 = _mm256_xor_si256( x4, x1 ); \
    x5 = _mm256_xor_si256( x5, x2 ); \
-    x6 = _mm256_xor_si256( x6, _mm256_xor_si256( x3, x0 ) ); \
+    x6 = mm256_xor3( x6, x3, x0 ); \
    x7 = _mm256_xor_si256( x7, x0 ); \
    x0 = _mm256_xor_si256( x0, x5 ); \
    x1 = _mm256_xor_si256( x1, x6 ); \
-    x2 = _mm256_xor_si256( x2, _mm256_xor_si256( x7, x4 ) ); \
+    x2 = mm256_xor3( x2, x7, x4 ); \
    x3 = _mm256_xor_si256( x3, x4 ); \
 } while (0)

@@ -323,12 +335,12 @@ do { \
 } while (0)


-#define W80(x)   Wz_8W(x, m512_const1_64( 0x5555555555555555 ),  1 )
-#define W81(x)   Wz_8W(x, m512_const1_64( 0x3333333333333333 ),  2 )
-#define W82(x)   Wz_8W(x, m512_const1_64( 0x0F0F0F0F0F0F0F0F ),  4 )
-#define W83(x)   Wz_8W(x, m512_const1_64( 0x00FF00FF00FF00FF ),  8 ) 
-#define W84(x)   Wz_8W(x, m512_const1_64( 0x0000FFFF0000FFFF ), 16 )
-#define W85(x)   Wz_8W(x, m512_const1_64( 0x00000000FFFFFFFF ), 32 )
+#define W80(x)   Wz_8W(x, _mm512_set1_epi64( 0x5555555555555555 ),  1 )
+#define W81(x)   Wz_8W(x, _mm512_set1_epi64( 0x3333333333333333 ),  2 )
+#define W82(x)   Wz_8W(x, _mm512_set1_epi64( 0x0F0F0F0F0F0F0F0F ),  4 )
+#define W83(x)   Wz_8W(x, _mm512_set1_epi64( 0x00FF00FF00FF00FF ),  8 ) 
+#define W84(x)   Wz_8W(x, _mm512_set1_epi64( 0x0000FFFF0000FFFF ), 16 )
+#define W85(x)   Wz_8W(x, _mm512_set1_epi64( 0x00000000FFFFFFFF ), 32 )
 #define W86(x) \
 do { \
   __m512i t = x ## h; \
@@ -352,12 +364,12 @@ do { \
   x ## l = _mm256_or_si256( _mm256_and_si256((x ## l >> (n)), (c)), t ); \
 } while (0)

-#define W0(x)   Wz(x, m256_const1_64( 0x5555555555555555 ),  1 )
-#define W1(x)   Wz(x, m256_const1_64( 0x3333333333333333 ),  2 )
-#define W2(x)   Wz(x, m256_const1_64( 0x0F0F0F0F0F0F0F0F ),  4 )
-#define W3(x)   Wz(x, m256_const1_64( 0x00FF00FF00FF00FF ),  8 ) 
-#define W4(x)   Wz(x, m256_const1_64( 0x0000FFFF0000FFFF ), 16 )
-#define W5(x)   Wz(x, m256_const1_64( 0x00000000FFFFFFFF ), 32 )
+#define W0(x)   Wz(x, _mm256_set1_epi64x( 0x5555555555555555 ),  1 )
+#define W1(x)   Wz(x, _mm256_set1_epi64x( 0x3333333333333333 ),  2 )
+#define W2(x)   Wz(x, _mm256_set1_epi64x( 0x0F0F0F0F0F0F0F0F ),  4 )
+#define W3(x)   Wz(x, _mm256_set1_epi64x( 0x00FF00FF00FF00FF ),  8 ) 
+#define W4(x)   Wz(x, _mm256_set1_epi64x( 0x0000FFFF0000FFFF ), 16 )
+#define W5(x)   Wz(x, _mm256_set1_epi64x( 0x00000000FFFFFFFF ), 32 )
 #define W6(x) \
 do { \
   __m256i t = x ## h; \
@@ -624,22 +636,22 @@ static const sph_u64 IV512[] = {
 void jh256_8way_init( jh_8way_context *sc )
 {
    // bswapped IV256
-    sc->H[ 0] = m512_const1_64( 0xebd3202c41a398eb );
-    sc->H[ 1] = m512_const1_64( 0xc145b29c7bbecd92 );
-    sc->H[ 2] = m512_const1_64( 0xfac7d4609151931c );
-    sc->H[ 3] = m512_const1_64( 0x038a507ed6820026 );
-    sc->H[ 4] = m512_const1_64( 0x45b92677269e23a4 );
-    sc->H[ 5] = m512_const1_64( 0x77941ad4481afbe0 );
-    sc->H[ 6] = m512_const1_64( 0x7a176b0226abb5cd );
-    sc->H[ 7] = m512_const1_64( 0xa82fff0f4224f056 );
-    sc->H[ 8] = m512_const1_64( 0x754d2e7f8996a371 );
-    sc->H[ 9] = m512_const1_64( 0x62e27df70849141d );
-    sc->H[10] = m512_const1_64( 0x948f2476f7957627 );
-    sc->H[11] = m512_const1_64( 0x6c29804757b6d587 );
-    sc->H[12] = m512_const1_64( 0x6c0d8eac2d275e5c );
-    sc->H[13] = m512_const1_64( 0x0f7a0557c6508451 );
-    sc->H[14] = m512_const1_64( 0xea12247067d3e47b );
-    sc->H[15] = m512_const1_64( 0x69d71cd313abe389 );
+    sc->H[ 0] = _mm512_set1_epi64( 0xebd3202c41a398eb );
+    sc->H[ 1] = _mm512_set1_epi64( 0xc145b29c7bbecd92 );
+    sc->H[ 2] = _mm512_set1_epi64( 0xfac7d4609151931c );
+    sc->H[ 3] = _mm512_set1_epi64( 0x038a507ed6820026 );
+    sc->H[ 4] = _mm512_set1_epi64( 0x45b92677269e23a4 );
+    sc->H[ 5] = _mm512_set1_epi64( 0x77941ad4481afbe0 );
+    sc->H[ 6] = _mm512_set1_epi64( 0x7a176b0226abb5cd );
+    sc->H[ 7] = _mm512_set1_epi64( 0xa82fff0f4224f056 );
+    sc->H[ 8] = _mm512_set1_epi64( 0x754d2e7f8996a371 );
+    sc->H[ 9] = _mm512_set1_epi64( 0x62e27df70849141d );
+    sc->H[10] = _mm512_set1_epi64( 0x948f2476f7957627 );
+    sc->H[11] = _mm512_set1_epi64( 0x6c29804757b6d587 );
+    sc->H[12] = _mm512_set1_epi64( 0x6c0d8eac2d275e5c );
+    sc->H[13] = _mm512_set1_epi64( 0x0f7a0557c6508451 );
+    sc->H[14] = _mm512_set1_epi64( 0xea12247067d3e47b );
+    sc->H[15] = _mm512_set1_epi64( 0x69d71cd313abe389 );
    sc->ptr = 0;
    sc->block_count = 0;
 }
@@ -647,22 +659,22 @@ void jh256_8way_init( jh_8way_context *sc )
 void jh512_8way_init( jh_8way_context *sc )
 {
    // bswapped IV512
-    sc->H[ 0] = m512_const1_64( 0x17aa003e964bd16f );
-    sc->H[ 1] = m512_const1_64( 0x43d5157a052e6a63 );
-    sc->H[ 2] = m512_const1_64( 0x0bef970c8d5e228a );
-    sc->H[ 3] = m512_const1_64( 0x61c3b3f2591234e9 );
-    sc->H[ 4] = m512_const1_64( 0x1e806f53c1a01d89 );
-    sc->H[ 5] = m512_const1_64( 0x806d2bea6b05a92a );
-    sc->H[ 6] = m512_const1_64( 0xa6ba7520dbcc8e58 );
-    sc->H[ 7] = m512_const1_64( 0xf73bf8ba763a0fa9 );
-    sc->H[ 8] = m512_const1_64( 0x694ae34105e66901 );
-    sc->H[ 9] = m512_const1_64( 0x5ae66f2e8e8ab546 );
-    sc->H[10] = m512_const1_64( 0x243c84c1d0a74710 );
-    sc->H[11] = m512_const1_64( 0x99c15a2db1716e3b );
-    sc->H[12] = m512_const1_64( 0x56f8b19decf657cf );
-    sc->H[13] = m512_const1_64( 0x56b116577c8806a7 );
-    sc->H[14] = m512_const1_64( 0xfb1785e6dffcc2e3 );
-    sc->H[15] = m512_const1_64( 0x4bdd8ccc78465a54 );
+    sc->H[ 0] = _mm512_set1_epi64( 0x17aa003e964bd16f );
+    sc->H[ 1] = _mm512_set1_epi64( 0x43d5157a052e6a63 );
+    sc->H[ 2] = _mm512_set1_epi64( 0x0bef970c8d5e228a );
+    sc->H[ 3] = _mm512_set1_epi64( 0x61c3b3f2591234e9 );
+    sc->H[ 4] = _mm512_set1_epi64( 0x1e806f53c1a01d89 );
+    sc->H[ 5] = _mm512_set1_epi64( 0x806d2bea6b05a92a );
+    sc->H[ 6] = _mm512_set1_epi64( 0xa6ba7520dbcc8e58 );
+    sc->H[ 7] = _mm512_set1_epi64( 0xf73bf8ba763a0fa9 );
+    sc->H[ 8] = _mm512_set1_epi64( 0x694ae34105e66901 );
+    sc->H[ 9] = _mm512_set1_epi64( 0x5ae66f2e8e8ab546 );
+    sc->H[10] = _mm512_set1_epi64( 0x243c84c1d0a74710 );
+    sc->H[11] = _mm512_set1_epi64( 0x99c15a2db1716e3b );
+    sc->H[12] = _mm512_set1_epi64( 0x56f8b19decf657cf );
+    sc->H[13] = _mm512_set1_epi64( 0x56b116577c8806a7 );
+    sc->H[14] = _mm512_set1_epi64( 0xfb1785e6dffcc2e3 );
+    sc->H[15] = _mm512_set1_epi64( 0x4bdd8ccc78465a54 );
    sc->ptr = 0;
    sc->block_count = 0;
 }
@@ -721,7 +733,7 @@ jh_8way_close( jh_8way_context *sc, unsigned ub, unsigned n, void *dst,
   size_t numz, u;
   uint64_t l0, l1;

-   buf[0] = m512_const1_64( 0x80ULL );
+   buf[0] = _mm512_set1_epi64( 0x80ULL );

   if ( sc->ptr == 0 )
       numz = 48;
@@ -772,22 +784,22 @@ jh512_8way_close(void *cc, void *dst)
 void jh256_4way_init( jh_4way_context *sc )
 {
    // bswapped IV256
-    sc->H[ 0] = m256_const1_64( 0xebd3202c41a398eb );
-    sc->H[ 1] = m256_const1_64( 0xc145b29c7bbecd92 );
-    sc->H[ 2] = m256_const1_64( 0xfac7d4609151931c );
-    sc->H[ 3] = m256_const1_64( 0x038a507ed6820026 );
-    sc->H[ 4] = m256_const1_64( 0x45b92677269e23a4 );
-    sc->H[ 5] = m256_const1_64( 0x77941ad4481afbe0 );
-    sc->H[ 6] = m256_const1_64( 0x7a176b0226abb5cd );
-    sc->H[ 7] = m256_const1_64( 0xa82fff0f4224f056 );
-    sc->H[ 8] = m256_const1_64( 0x754d2e7f8996a371 );
-    sc->H[ 9] = m256_const1_64( 0x62e27df70849141d );
-    sc->H[10] = m256_const1_64( 0x948f2476f7957627 );
-    sc->H[11] = m256_const1_64( 0x6c29804757b6d587 );
-    sc->H[12] = m256_const1_64( 0x6c0d8eac2d275e5c );
-    sc->H[13] = m256_const1_64( 0x0f7a0557c6508451 );
-    sc->H[14] = m256_const1_64( 0xea12247067d3e47b );
-    sc->H[15] = m256_const1_64( 0x69d71cd313abe389 );
+    sc->H[ 0] = _mm256_set1_epi64x( 0xebd3202c41a398eb );
+    sc->H[ 1] = _mm256_set1_epi64x( 0xc145b29c7bbecd92 );
+    sc->H[ 2] = _mm256_set1_epi64x( 0xfac7d4609151931c );
+    sc->H[ 3] = _mm256_set1_epi64x( 0x038a507ed6820026 );
+    sc->H[ 4] = _mm256_set1_epi64x( 0x45b92677269e23a4 );
+    sc->H[ 5] = _mm256_set1_epi64x( 0x77941ad4481afbe0 );
+    sc->H[ 6] = _mm256_set1_epi64x( 0x7a176b0226abb5cd );
+    sc->H[ 7] = _mm256_set1_epi64x( 0xa82fff0f4224f056 );
+    sc->H[ 8] = _mm256_set1_epi64x( 0x754d2e7f8996a371 );
+    sc->H[ 9] = _mm256_set1_epi64x( 0x62e27df70849141d );
+    sc->H[10] = _mm256_set1_epi64x( 0x948f2476f7957627 );
+    sc->H[11] = _mm256_set1_epi64x( 0x6c29804757b6d587 );
+    sc->H[12] = _mm256_set1_epi64x( 0x6c0d8eac2d275e5c );
+    sc->H[13] = _mm256_set1_epi64x( 0x0f7a0557c6508451 );
+    sc->H[14] = _mm256_set1_epi64x( 0xea12247067d3e47b );
+    sc->H[15] = _mm256_set1_epi64x( 0x69d71cd313abe389 );
    sc->ptr = 0;
    sc->block_count = 0;
 }
@@ -795,22 +807,22 @@ void jh256_4way_init( jh_4way_context *sc )
 void jh512_4way_init( jh_4way_context *sc )
 {
    // bswapped IV512
-    sc->H[ 0] = m256_const1_64( 0x17aa003e964bd16f );
-    sc->H[ 1] = m256_const1_64( 0x43d5157a052e6a63 );
-    sc->H[ 2] = m256_const1_64( 0x0bef970c8d5e228a );
-    sc->H[ 3] = m256_const1_64( 0x61c3b3f2591234e9 );
-    sc->H[ 4] = m256_const1_64( 0x1e806f53c1a01d89 );
-    sc->H[ 5] = m256_const1_64( 0x806d2bea6b05a92a );
-    sc->H[ 6] = m256_const1_64( 0xa6ba7520dbcc8e58 );
-    sc->H[ 7] = m256_const1_64( 0xf73bf8ba763a0fa9 );
-    sc->H[ 8] = m256_const1_64( 0x694ae34105e66901 );
-    sc->H[ 9] = m256_const1_64( 0x5ae66f2e8e8ab546 );
-    sc->H[10] = m256_const1_64( 0x243c84c1d0a74710 );
-    sc->H[11] = m256_const1_64( 0x99c15a2db1716e3b );
-    sc->H[12] = m256_const1_64( 0x56f8b19decf657cf );
-    sc->H[13] = m256_const1_64( 0x56b116577c8806a7 );
-    sc->H[14] = m256_const1_64( 0xfb1785e6dffcc2e3 );
-    sc->H[15] = m256_const1_64( 0x4bdd8ccc78465a54 );
+    sc->H[ 0] = _mm256_set1_epi64x( 0x17aa003e964bd16f );
+    sc->H[ 1] = _mm256_set1_epi64x( 0x43d5157a052e6a63 );
+    sc->H[ 2] = _mm256_set1_epi64x( 0x0bef970c8d5e228a );
+    sc->H[ 3] = _mm256_set1_epi64x( 0x61c3b3f2591234e9 );
+    sc->H[ 4] = _mm256_set1_epi64x( 0x1e806f53c1a01d89 );
+    sc->H[ 5] = _mm256_set1_epi64x( 0x806d2bea6b05a92a );
+    sc->H[ 6] = _mm256_set1_epi64x( 0xa6ba7520dbcc8e58 );
+    sc->H[ 7] = _mm256_set1_epi64x( 0xf73bf8ba763a0fa9 );
+    sc->H[ 8] = _mm256_set1_epi64x( 0x694ae34105e66901 );
+    sc->H[ 9] = _mm256_set1_epi64x( 0x5ae66f2e8e8ab546 );
+    sc->H[10] = _mm256_set1_epi64x( 0x243c84c1d0a74710 );
+    sc->H[11] = _mm256_set1_epi64x( 0x99c15a2db1716e3b );
+    sc->H[12] = _mm256_set1_epi64x( 0x56f8b19decf657cf );
+    sc->H[13] = _mm256_set1_epi64x( 0x56b116577c8806a7 );
+    sc->H[14] = _mm256_set1_epi64x( 0xfb1785e6dffcc2e3 );
+    sc->H[15] = _mm256_set1_epi64x( 0x4bdd8ccc78465a54 );
    sc->ptr = 0;
    sc->block_count = 0;
 }
@@ -869,7 +881,7 @@ jh_4way_close( jh_4way_context *sc, unsigned ub, unsigned n, void *dst,
   size_t numz, u;
   uint64_t l0, l1;

-   buf[0] = m256_const1_64( 0x80ULL );
+   buf[0] = _mm256_set1_epi64x( 0x80ULL );

   if ( sc->ptr == 0 )
       numz = 48;
--- a/algo/keccak/keccak-4way.c
+++ b/algo/keccak/keccak-4way.c
@@ -49,7 +49,7 @@ int scanhash_keccak_8way( struct work *work, uint32_t max_nonce,
          }
      }
      *noncev = _mm512_add_epi32( *noncev,
-                                  m512_const1_64( 0x0000000800000000 ) );
+                                  _mm512_set1_epi64( 0x0000000800000000 ) );
      n += 8;

   } while ( (n < max_nonce-8) && !work_restart[thr_id].restart);
@@ -101,7 +101,7 @@ int scanhash_keccak_4way( struct work *work, uint32_t max_nonce,
          }
      }
      *noncev = _mm256_add_epi32( *noncev,
-                                  m256_const1_64( 0x0000000400000000 ) );
+                                  _mm256_set1_epi64x( 0x0000000400000000 ) );
      n += 4;
   } while ( (n < max_nonce-4) && !work_restart[thr_id].restart);
   pdata[19] = n;
--- a/algo/keccak/keccak-hash-4way.c
+++ b/algo/keccak/keccak-hash-4way.c
@@ -180,15 +180,15 @@ static void keccak64_8way_close( keccak64_ctx_m512i *kc, void *dst,
    if ( kc->ptr == (lim - 8) )
    {
        const uint64_t t = eb | 0x8000000000000000;
-        u.tmp[0] = m512_const1_64( t );
+        u.tmp[0] = _mm512_set1_epi64( t );
        j = 8;
    }
    else
    {
        j = lim - kc->ptr;
-        u.tmp[0] = m512_const1_64( eb );
+        u.tmp[0] = _mm512_set1_epi64( eb );
        memset_zero_512( u.tmp + 1, (j>>3) - 2 );
-        u.tmp[ (j>>3) - 1] = m512_const1_64( 0x8000000000000000 );
+        u.tmp[ (j>>3) - 1] = _mm512_set1_epi64( 0x8000000000000000 );
    }
    keccak64_8way_core( kc, u.tmp, j, lim );
    /* Finalize the "lane complement" */
@@ -264,8 +264,8 @@ keccak512_8way_close(void *cc, void *dst)
 #define OR64(d, a, b)      (d = _mm256_or_si256(a,b))
 #define NOT64(d, s)        (d = mm256_not( s ) )
 #define ROL64(d, v, n)     (d = mm256_rol_64(v, n))
-#define XOROR(d, a, b, c)  (d = _mm256_xor_si256(a, _mm256_or_si256(b, c)))
-#define XORAND(d, a, b, c) (d = _mm256_xor_si256(a, _mm256_and_si256(b, c)))
+#define XOROR(d, a, b, c)  (d = mm256_xoror( a, b, c ) )
+#define XORAND(d, a, b, c) (d = mm256_xorand( a, b, c ) )
 #define XOR3( d, a, b, c ) (d = mm256_xor3( a, b, c ))

 #include "keccak-macros.c"
@@ -368,15 +368,15 @@ static void keccak64_close( keccak64_ctx_m256i *kc, void *dst, size_t byte_len,
    if ( kc->ptr == (lim - 8) )
    {
        const uint64_t t = eb | 0x8000000000000000;
-        u.tmp[0] = m256_const1_64( t );
+        u.tmp[0] = _mm256_set1_epi64x( t );
        j = 8;
    }
    else
    {
        j = lim - kc->ptr;
-        u.tmp[0] = m256_const1_64( eb );
+        u.tmp[0] = _mm256_set1_epi64x( eb );
        memset_zero_256( u.tmp + 1, (j>>3) - 2 );
-        u.tmp[ (j>>3) - 1] = m256_const1_64( 0x8000000000000000 );
+        u.tmp[ (j>>3) - 1] = _mm256_set1_epi64x( 0x8000000000000000 );
    }
    keccak64_core( kc, u.tmp, j, lim );
    /* Finalize the "lane complement" */
--- a/algo/keccak/sha3d-4way.c
+++ b/algo/keccak/sha3d-4way.c
@@ -56,7 +56,7 @@ int scanhash_sha3d_8way( struct work *work, uint32_t max_nonce,
          }
      }
      *noncev = _mm512_add_epi32( *noncev,
-                                  m512_const1_64( 0x0000000800000000 ) );
+                                  _mm512_set1_epi64( 0x0000000800000000 ) );
      n += 8;

   } while ( likely( (n < last_nonce) && !work_restart[thr_id].restart ) );
@@ -115,7 +115,7 @@ int scanhash_sha3d_4way( struct work *work, uint32_t max_nonce,
          }
      }
      *noncev = _mm256_add_epi32( *noncev,
-                                  m256_const1_64( 0x0000000400000000 ) );
+                                  _mm256_set1_epi64x( 0x0000000400000000 ) );
      n += 4;
   } while ( likely( (n < last_nonce) && !work_restart[thr_id].restart ) );
   pdata[19] = n;
--- a/algo/luffa/luffa-hash-2way.c
+++ b/algo/luffa/luffa-hash-2way.c
@@ -60,7 +60,7 @@ static const uint32 CNS_INIT[128] __attribute((aligned(64))) = {

 #if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)

-#define cns4w(i)  m512_const1_128( ( (__m128i*)CNS_INIT)[i] )
+#define cns4w(i)  mm512_bcast_m128( ( (__m128i*)CNS_INIT)[i] )

 #define ADD_CONSTANT4W( a, b, c0, c1 ) \
    a = _mm512_xor_si512( a, c0 ); \
@@ -69,7 +69,7 @@ static const uint32 CNS_INIT[128] __attribute((aligned(64))) = {
 #define MULT24W( a0, a1 ) \
 { \
  __m512i b = _mm512_xor_si512( a0, \
-                     _mm512_maskz_shuffle_epi32( 0xbbbb, a1, 16 ) ); \
+                     _mm512_maskz_shuffle_epi32( 0xbbbb, a1, 0x10 ) ); \
  a0 = _mm512_alignr_epi8( a1,  b, 4 ); \
  a1 = _mm512_alignr_epi8(  b, a1, 4 ); \
 }
@@ -107,58 +107,45 @@ static const uint32 CNS_INIT[128] __attribute((aligned(64))) = {
    ADD_CONSTANT4W( x0, x4, c0, c1 );

 #define STEP_PART24W( a0, a1, t0, t1, c0, c1 ) \
-    a1 = _mm512_shuffle_epi32( a1, 147 ); \
-    t0 = _mm512_load_si512( &a1 ); \
-    a1 = _mm512_unpacklo_epi32( a1, a0 ); \
+    t0 = _mm512_shuffle_epi32( a1, 147 ); \
+    a1 = _mm512_unpacklo_epi32( t0, a0 ); \
    t0 = _mm512_unpackhi_epi32( t0, a0 ); \
    t1 = _mm512_shuffle_epi32( t0, 78 ); \
    a0 = _mm512_shuffle_epi32( a1, 78 ); \
    SUBCRUMB4W( t1, t0, a0, a1 ); \
    t0 = _mm512_unpacklo_epi32( t0, t1 ); \
    a1 = _mm512_unpacklo_epi32( a1, a0 ); \
-    a0 = _mm512_load_si512( &a1 ); \
-    a0 = _mm512_unpackhi_epi64( a0, t0 ); \
+    a0 = _mm512_unpackhi_epi64( a1, t0 ); \
    a1 = _mm512_unpacklo_epi64( a1, t0 ); \
    a1 = _mm512_shuffle_epi32( a1, 57 ); \
    MIXWORD4W( a0, a1 ); \
    ADD_CONSTANT4W( a0, a1, c0, c1 );

 #define NMLTOM10244W(r0,r1,r2,r3,s0,s1,s2,s3,p0,p1,p2,p3,q0,q1,q2,q3)\
-    s1 = _mm512_load_si512(&r3);\
-    q1 = _mm512_load_si512(&p3);\
-    s3 = _mm512_load_si512(&r3);\
-    q3 = _mm512_load_si512(&p3);\
-    s1 = _mm512_unpackhi_epi32(s1,r2);\
-    q1 = _mm512_unpackhi_epi32(q1,p2);\
-    s3 = _mm512_unpacklo_epi32(s3,r2);\
-    q3 = _mm512_unpacklo_epi32(q3,p2);\
-    s0 = _mm512_load_si512(&s1);\
-    q0 = _mm512_load_si512(&q1);\
-    s2 = _mm512_load_si512(&s3);\
-    q2 = _mm512_load_si512(&q3);\
-    r3 = _mm512_load_si512(&r1);\
-    p3 = _mm512_load_si512(&p1);\
-    r1 = _mm512_unpacklo_epi32(r1,r0);\
-    p1 = _mm512_unpacklo_epi32(p1,p0);\
-    r3 = _mm512_unpackhi_epi32(r3,r0);\
-    p3 = _mm512_unpackhi_epi32(p3,p0);\
-    s0 = _mm512_unpackhi_epi64(s0,r3);\
-    q0 = _mm512_unpackhi_epi64(q0,p3);\
-    s1 = _mm512_unpacklo_epi64(s1,r3);\
-    q1 = _mm512_unpacklo_epi64(q1,p3);\
-    s2 = _mm512_unpackhi_epi64(s2,r1);\
-    q2 = _mm512_unpackhi_epi64(q2,p1);\
-    s3 = _mm512_unpacklo_epi64(s3,r1);\
-    q3 = _mm512_unpacklo_epi64(q3,p1);
+    s1 = _mm512_unpackhi_epi32( r3, r2 ); \
+    q1 = _mm512_unpackhi_epi32( p3, p2 ); \
+    s3 = _mm512_unpacklo_epi32( r3, r2 ); \
+    q3 = _mm512_unpacklo_epi32( p3, p2 ); \
+    r3 = _mm512_unpackhi_epi32( r1, r0 ); \
+    r1 = _mm512_unpacklo_epi32( r1, r0 ); \
+    p3 = _mm512_unpackhi_epi32( p1, p0 ); \
+    p1 = _mm512_unpacklo_epi32( p1, p0 ); \
+    s0 = _mm512_unpackhi_epi64( s1, r3 ); \
+    q0 = _mm512_unpackhi_epi64( q1 ,p3 ); \
+    s1 = _mm512_unpacklo_epi64( s1, r3 ); \
+    q1 = _mm512_unpacklo_epi64( q1, p3 ); \
+    s2 = _mm512_unpackhi_epi64( s3, r1 ); \
+    q2 = _mm512_unpackhi_epi64( q3, p1 ); \
+    s3 = _mm512_unpacklo_epi64( s3, r1 ); \
+    q3 = _mm512_unpacklo_epi64( q3, p1 );

 #define MIXTON10244W(r0,r1,r2,r3,s0,s1,s2,s3,p0,p1,p2,p3,q0,q1,q2,q3)\
    NMLTOM10244W(r0,r1,r2,r3,s0,s1,s2,s3,p0,p1,p2,p3,q0,q1,q2,q3);

-void rnd512_4way( luffa_4way_context *state, __m512i *msg )
+void rnd512_4way( luffa_4way_context *state, const __m512i *msg )
 {
    __m512i t0, t1;
    __m512i *chainv = state->chainv;
-    __m512i msg0, msg1;
    __m512i x0, x1, x2, x3, x4, x5, x6, x7;

    t0 = mm512_xor3( chainv[0], chainv[2], chainv[4] );
@@ -168,9 +155,6 @@ void rnd512_4way( luffa_4way_context *state, __m512i *msg )

    MULT24W( t0, t1 );

-    msg0 = _mm512_shuffle_epi32( msg[0], 27 );
-    msg1 = _mm512_shuffle_epi32( msg[1], 27 );
-
    chainv[0] = _mm512_xor_si512( chainv[0], t0 );
    chainv[1] = _mm512_xor_si512( chainv[1], t1 );
    chainv[2] = _mm512_xor_si512( chainv[2], t0 );
@@ -202,11 +186,8 @@ void rnd512_4way( luffa_4way_context *state, __m512i *msg )
    chainv[7] = _mm512_xor_si512(chainv[7], chainv[9]);

    MULT24W( chainv[8], chainv[9] );
-    chainv[8] = _mm512_xor_si512( chainv[8], t0 );
-    chainv[9] = _mm512_xor_si512( chainv[9], t1 );
-
-    t0 = chainv[8];
-    t1 = chainv[9];
+    t0 = chainv[8] = _mm512_xor_si512( chainv[8], t0 );
+    t1 = chainv[9] = _mm512_xor_si512( chainv[9], t1 );

    MULT24W( chainv[8], chainv[9] );
    chainv[8] = _mm512_xor_si512( chainv[8], chainv[6] );
@@ -225,27 +206,36 @@ void rnd512_4way( luffa_4way_context *state, __m512i *msg )
    chainv[3] = _mm512_xor_si512( chainv[3], chainv[1] );

    MULT24W( chainv[0], chainv[1] );
-    chainv[0] = mm512_xor3( chainv[0], t0, msg0 );
-    chainv[1] = mm512_xor3( chainv[1], t1, msg1 );
+    chainv[0] = _mm512_xor_si512( chainv[0], t0 );
+    chainv[1] = _mm512_xor_si512( chainv[1], t1 );

-    MULT24W( msg0, msg1 );
-    chainv[2] = _mm512_xor_si512( chainv[2], msg0 );
-    chainv[3] = _mm512_xor_si512( chainv[3], msg1 );
+    if ( msg )
+    {
+       __m512i msg0, msg1;

-    MULT24W( msg0, msg1 );
-    chainv[4] = _mm512_xor_si512( chainv[4], msg0 );
-    chainv[5] = _mm512_xor_si512( chainv[5], msg1 );
+       msg0 = _mm512_shuffle_epi32( msg[0], 27 );
+       msg1 = _mm512_shuffle_epi32( msg[1], 27 );

-    MULT24W( msg0, msg1 );
-    chainv[6] = _mm512_xor_si512( chainv[6], msg0 );
-    chainv[7] = _mm512_xor_si512( chainv[7], msg1 );
+       chainv[0] = _mm512_xor_si512( chainv[0], msg0 );
+       chainv[1] = _mm512_xor_si512( chainv[1], msg1 );

-    MULT24W( msg0, msg1);
-    chainv[8] = _mm512_xor_si512( chainv[8], msg0 );
-    chainv[9] = _mm512_xor_si512( chainv[9], msg1 );
+       MULT24W( msg0, msg1 );
+       chainv[2] = _mm512_xor_si512( chainv[2], msg0 );
+       chainv[3] = _mm512_xor_si512( chainv[3], msg1 );

-    MULT24W( msg0, msg1 );
+       MULT24W( msg0, msg1 );
+       chainv[4] = _mm512_xor_si512( chainv[4], msg0 );
+       chainv[5] = _mm512_xor_si512( chainv[5], msg1 );

+       MULT24W( msg0, msg1 );
+       chainv[6] = _mm512_xor_si512( chainv[6], msg0 );
+       chainv[7] = _mm512_xor_si512( chainv[7], msg1 );
+
+       MULT24W( msg0, msg1);
+       chainv[8] = _mm512_xor_si512( chainv[8], msg0 );
+       chainv[9] = _mm512_xor_si512( chainv[9], msg1 );
+    }
+    
    chainv[3] = _mm512_rol_epi32( chainv[3], 1 );
    chainv[5] = _mm512_rol_epi32( chainv[5], 2 );
    chainv[7] = _mm512_rol_epi32( chainv[7], 3 );
@@ -282,16 +272,11 @@ void finalization512_4way( luffa_4way_context *state, uint32 *b )
    uint32_t hash[8*4] __attribute((aligned(128)));
    __m512i* chainv = state->chainv;
    __m512i t[2];
-    __m512i zero[2];
-    zero[0] = zero[1] = m512_zero;
-    const __m512i shuff_bswap32 = m512_const_64(
-                                  0x3c3d3e3f38393a3b, 0x3435363730313233,
-                                  0x2c2d2e2f28292a2b, 0x2425262720212223,
-                                  0x1c1d1e1f18191a1b, 0x1415161710111213,
-                                  0x0c0d0e0f08090a0b, 0x0405060700010203 );
+    const __m512i shuff_bswap32 = mm512_bcast_m128( _mm_set_epi64x( 
+                                  0x0c0d0e0f08090a0b, 0x0405060700010203 ) );

    /*---- blank round with m=0 ----*/
-    rnd512_4way( state, zero );
+    rnd512_4way( state, NULL );
    
    t[0] = mm512_xor3( chainv[0], chainv[2], chainv[4] );
    t[1] = mm512_xor3( chainv[1], chainv[3], chainv[5] );
@@ -300,37 +285,30 @@ void finalization512_4way( luffa_4way_context *state, uint32 *b )
    t[0] = _mm512_shuffle_epi32( t[0], 27 );
    t[1] = _mm512_shuffle_epi32( t[1], 27 );

-    _mm512_store_si512( (__m512i*)&hash[0], t[0] );
+    _mm512_store_si512( (__m512i*)&hash[ 0], t[0] );
    _mm512_store_si512( (__m512i*)&hash[16], t[1] );

-    casti_m512i( b, 0 ) = _mm512_shuffle_epi8(
-                                  casti_m512i( hash, 0 ), shuff_bswap32 );
-    casti_m512i( b, 1 ) = _mm512_shuffle_epi8(
-                                  casti_m512i( hash, 1 ), shuff_bswap32 );
+    casti_m512i( b,0 ) = _mm512_shuffle_epi8(
+                                  casti_m512i( hash,0 ), shuff_bswap32 );
+    casti_m512i( b,1 ) = _mm512_shuffle_epi8(
+                                  casti_m512i( hash,1 ), shuff_bswap32 );

-    rnd512_4way( state, zero );
-
-    t[0] = chainv[0];
-    t[1] = chainv[1];
-    t[0] = _mm512_xor_si512( t[0], chainv[2] );
-    t[1] = _mm512_xor_si512( t[1], chainv[3] );
-    t[0] = _mm512_xor_si512( t[0], chainv[4] );
-    t[1] = _mm512_xor_si512( t[1], chainv[5] );
-    t[0] = _mm512_xor_si512( t[0], chainv[6] );
-    t[1] = _mm512_xor_si512( t[1], chainv[7] );
-    t[0] = _mm512_xor_si512( t[0], chainv[8] );
-    t[1] = _mm512_xor_si512( t[1], chainv[9] );
+    rnd512_4way( state, NULL );

+    t[0] = mm512_xor3( chainv[0], chainv[2], chainv[4] );
+    t[1] = mm512_xor3( chainv[1], chainv[3], chainv[5] );
+    t[0] = mm512_xor3( t[0], chainv[6], chainv[8] );
+    t[1] = mm512_xor3( t[1], chainv[7], chainv[9] );
    t[0] = _mm512_shuffle_epi32( t[0], 27 );
    t[1] = _mm512_shuffle_epi32( t[1], 27 );

-    _mm512_store_si512( (__m512i*)&hash[0], t[0] );
+    _mm512_store_si512( (__m512i*)&hash[ 0], t[0] );
    _mm512_store_si512( (__m512i*)&hash[16], t[1] );

-    casti_m512i( b, 2 ) = _mm512_shuffle_epi8(
-                                  casti_m512i( hash, 0 ), shuff_bswap32 );
-    casti_m512i( b, 3 ) = _mm512_shuffle_epi8(
-                                  casti_m512i( hash, 1 ), shuff_bswap32 );
+    casti_m512i( b,2 ) = _mm512_shuffle_epi8(
+                                  casti_m512i( hash,0 ), shuff_bswap32 );
+    casti_m512i( b,3 ) = _mm512_shuffle_epi8(
+                                  casti_m512i( hash,1 ), shuff_bswap32 );
 }

 int luffa_4way_init( luffa_4way_context *state, int hashbitlen )
@@ -338,16 +316,16 @@ int luffa_4way_init( luffa_4way_context *state, int hashbitlen )
    state->hashbitlen = hashbitlen;
    __m128i *iv = (__m128i*)IV;

-    state->chainv[0] = m512_const1_128( iv[0] );
-    state->chainv[1] = m512_const1_128( iv[1] );
-    state->chainv[2] = m512_const1_128( iv[2] );
-    state->chainv[3] = m512_const1_128( iv[3] );
-    state->chainv[4] = m512_const1_128( iv[4] );
-    state->chainv[5] = m512_const1_128( iv[5] );
-    state->chainv[6] = m512_const1_128( iv[6] );
-    state->chainv[7] = m512_const1_128( iv[7] );
-    state->chainv[8] = m512_const1_128( iv[8] );
-    state->chainv[9] = m512_const1_128( iv[9] );
+    state->chainv[0] = mm512_bcast_m128( iv[0] );
+    state->chainv[1] = mm512_bcast_m128( iv[1] );
+    state->chainv[2] = mm512_bcast_m128( iv[2] );
+    state->chainv[3] = mm512_bcast_m128( iv[3] );
+    state->chainv[4] = mm512_bcast_m128( iv[4] );
+    state->chainv[5] = mm512_bcast_m128( iv[5] );
+    state->chainv[6] = mm512_bcast_m128( iv[6] );
+    state->chainv[7] = mm512_bcast_m128( iv[7] );
+    state->chainv[8] = mm512_bcast_m128( iv[8] );
+    state->chainv[9] = mm512_bcast_m128( iv[9] );

    ((__m512i*)state->buffer)[0] = m512_zero;
    ((__m512i*)state->buffer)[1] = m512_zero;
@@ -370,11 +348,8 @@ int luffa_4way_update( luffa_4way_context *state, const void *data,
    __m512i msg[2];
    int i;
    int blocks = (int)len >> 5;
-    const __m512i shuff_bswap32 = m512_const_64( 
-                                   0x3c3d3e3f38393a3b, 0x3435363730313233,
-                                   0x2c2d2e2f28292a2b, 0x2425262720212223,
-                                   0x1c1d1e1f18191a1b, 0x1415161710111213,
-                                   0x0c0d0e0f08090a0b, 0x0405060700010203 );
+    const __m512i shuff_bswap32 = mm512_bcast_m128( _mm_set_epi64x(  
+                                   0x0c0d0e0f08090a0b, 0x0405060700010203 ) );

    state->rembytes = (int)len & 0x1F;

@@ -392,7 +367,7 @@ int luffa_4way_update( luffa_4way_context *state, const void *data,
    {
      // remaining data bytes
      buffer[0] = _mm512_shuffle_epi8( vdata[0], shuff_bswap32 );
-      buffer[1] = m512_const1_i128(  0x0000000080000000 );
+      buffer[1] = mm512_bcast128lo_64( 0x0000000080000000 );
    }
    return 0;
 }
@@ -416,7 +391,7 @@ int luffa_4way_close( luffa_4way_context *state, void *hashval )
      rnd512_4way( state, buffer );
    else
    {     // empty pad block, constant data
-      msg[0] = m512_const1_i128(  0x0000000080000000 );
+      msg[0] = mm512_bcast128lo_64( 0x0000000080000000 );
      msg[1] = m512_zero;
      rnd512_4way( state, msg );
    }
@@ -440,16 +415,16 @@ int luffa512_4way_full( luffa_4way_context *state, void *output,
    state->hashbitlen = 512;
    __m128i *iv = (__m128i*)IV;

-    state->chainv[0] = m512_const1_128( iv[0] );
-    state->chainv[1] = m512_const1_128( iv[1] );
-    state->chainv[2] = m512_const1_128( iv[2] );
-    state->chainv[3] = m512_const1_128( iv[3] );
-    state->chainv[4] = m512_const1_128( iv[4] );
-    state->chainv[5] = m512_const1_128( iv[5] );
-    state->chainv[6] = m512_const1_128( iv[6] );
-    state->chainv[7] = m512_const1_128( iv[7] );
-    state->chainv[8] = m512_const1_128( iv[8] );
-    state->chainv[9] = m512_const1_128( iv[9] );
+    state->chainv[0] = mm512_bcast_m128( iv[0] );
+    state->chainv[1] = mm512_bcast_m128( iv[1] );
+    state->chainv[2] = mm512_bcast_m128( iv[2] );
+    state->chainv[3] = mm512_bcast_m128( iv[3] );
+    state->chainv[4] = mm512_bcast_m128( iv[4] );
+    state->chainv[5] = mm512_bcast_m128( iv[5] );
+    state->chainv[6] = mm512_bcast_m128( iv[6] );
+    state->chainv[7] = mm512_bcast_m128( iv[7] );
+    state->chainv[8] = mm512_bcast_m128( iv[8] );
+    state->chainv[9] = mm512_bcast_m128( iv[9] );

    ((__m512i*)state->buffer)[0] = m512_zero;
    ((__m512i*)state->buffer)[1] = m512_zero;
@@ -458,11 +433,8 @@ int luffa512_4way_full( luffa_4way_context *state, void *output,
    __m512i msg[2];
    int i;
    const int blocks = (int)( inlen >> 5 );
-    const __m512i shuff_bswap32 = m512_const_64(
-                                   0x3c3d3e3f38393a3b, 0x3435363730313233,
-                                   0x2c2d2e2f28292a2b, 0x2425262720212223,
-                                   0x1c1d1e1f18191a1b, 0x1415161710111213,
-                                   0x0c0d0e0f08090a0b, 0x0405060700010203 );
+    const __m512i shuff_bswap32 = mm512_bcast_m128( _mm_set_epi64x( 
+                                   0x0c0d0e0f08090a0b, 0x0405060700010203 ) );

    state->rembytes = inlen & 0x1F;

@@ -479,13 +451,13 @@ int luffa512_4way_full( luffa_4way_context *state, void *output,
    {
       // padding of partial block
       msg[0] = _mm512_shuffle_epi8( vdata[ 0 ], shuff_bswap32 );
-       msg[1] = m512_const1_i128(  0x0000000080000000 );
+       msg[1] = mm512_bcast128lo_64( 0x0000000080000000 );
       rnd512_4way( state, msg );
    }
    else
    {
       // empty pad block
-       msg[0] = m512_const1_i128( 0x0000000080000000 );
+       msg[0] = mm512_bcast128lo_64( 0x0000000080000000 );
       msg[1] = m512_zero;
       rnd512_4way( state, msg );
    }
@@ -506,11 +478,8 @@ int luffa_4way_update_close( luffa_4way_context *state,
    __m512i msg[2];
    int i;
    const int blocks = (int)( inlen >> 5 );
-    const __m512i shuff_bswap32 = m512_const_64(
-                                   0x3c3d3e3f38393a3b, 0x3435363730313233,
-                                   0x2c2d2e2f28292a2b, 0x2425262720212223,
-                                   0x1c1d1e1f18191a1b, 0x1415161710111213,
-                                   0x0c0d0e0f08090a0b, 0x0405060700010203 );
+    const __m512i shuff_bswap32 = mm512_bcast_m128( _mm_set_epi64x( 
+                                   0x0c0d0e0f08090a0b, 0x0405060700010203 ) );

    state->rembytes = inlen & 0x1F;

@@ -527,13 +496,13 @@ int luffa_4way_update_close( luffa_4way_context *state,
    {
       // padding of partial block
       msg[0] = _mm512_shuffle_epi8( vdata[ 0 ], shuff_bswap32 );
-       msg[1] = m512_const1_i128( 0x0000000080000000 );
+       msg[1] = mm512_bcast128lo_64( 0x0000000080000000 );
       rnd512_4way( state, msg );
    }
    else
    {
       // empty pad block
-       msg[0] = m512_const1_i128( 0x0000000080000000 );
+       msg[0] = mm512_bcast128lo_64( 0x0000000080000000 );
       msg[1] = m512_zero;
       rnd512_4way( state, msg );
    }
@@ -548,26 +517,45 @@ int luffa_4way_update_close( luffa_4way_context *state,

 #endif // AVX512

-#define cns(i)  m256_const1_128( ( (__m128i*)CNS_INIT)[i] )
+#define cns(i)  mm256_bcast_m128( ( (__m128i*)CNS_INIT)[i] )

 #define ADD_CONSTANT( a, b, c0, c1 ) \
    a = _mm256_xor_si256( a, c0 ); \
    b = _mm256_xor_si256( b, c1 );

-/*
-#define MULT2( a0, a1, mask ) \
-do { \
-  __m256i b = _mm256_xor_si256( a0, \
-                   _mm256_shuffle_epi32( _mm256_and_si256(a1,mask), 16 ) ); \
-  a0 = _mm256_or_si256( _mm256_srli_si256(b,4), _mm256_slli_si256(a1,12) ); \
-  a1 = _mm256_or_si256( _mm256_srli_si256(a1,4), _mm256_slli_si256(b,12) );  \
-} while(0)
-*/
+//TODO Enable for AVX10_256, not used with AVX512 or AVX10_512
+#if defined(__AVX512VL__) 

-#define MULT2( a0, a1, mask ) \
+#define MULT2( a0, a1 ) \
 { \
  __m256i b = _mm256_xor_si256( a0, \
-                 _mm256_shuffle_epi32( _mm256_and_si256( a1, mask ), 16 ) ); \
+                     _mm256_maskz_shuffle_epi32( 0xbb, a1, 0x10 ) ); \
+  a0 = _mm256_alignr_epi8( a1,  b, 4 ); \
+  a1 = _mm256_alignr_epi8(  b, a1, 4 ); \
+}
+
+#define SUBCRUMB( a0, a1, a2, a3 ) \
+{ \
+    __m256i t = a0; \
+    a0 = mm256_xoror( a3, a0, a1 ); \
+    a2 = _mm256_xor_si256( a2, a3 ); \
+    a1 = _mm256_ternarylogic_epi64( a1, a3, t, 0x87 ); /* a1 xnor (a3 & t) */ \
+    a3 = mm256_xorand( a2, a3, t ); \
+    a2 = mm256_xorand( a1, a2, a0); \
+    a1 = _mm256_or_si256( a1, a3 ); \
+    a3 = _mm256_xor_si256( a3, a2 ); \
+    t  = _mm256_xor_si256( t, a1 ); \
+    a2 = _mm256_and_si256( a2, a1 ); \
+    a1 = mm256_xnor( a1, a0 ); \
+    a0 = t; \
+}
+
+#else
+
+#define MULT2( a0, a1 ) \
+{ \
+  __m256i b = _mm256_xor_si256( a0, _mm256_shuffle_epi32( \
+                         _mm256_blend_epi32( a1, m256_zero, 0xee ), 0x10 ) ); \
  a0 = _mm256_alignr_epi8( a1,  b, 4 ); \
  a1 = _mm256_alignr_epi8(  b, a1, 4 ); \
 }
@@ -593,26 +581,14 @@ do { \
    a0 = t; \
 }

+#endif
+
 #define MIXWORD( a, b ) \
-{ \
-    __m256i t1, t2; \
-    b  = _mm256_xor_si256( a,b ); \
-    t1 = _mm256_slli_epi32( a,  2 ); \
-    t2 = _mm256_srli_epi32( a, 30 ); \
-    a  = _mm256_or_si256( t1, t2 ); \
-    a  = _mm256_xor_si256( a, b ); \
-    t1 = _mm256_slli_epi32( b, 14 ); \
-    t2 = _mm256_srli_epi32( b, 18 ); \
-    b  = _mm256_or_si256( t1, t2 ); \
-    b  = _mm256_xor_si256( a, b ); \
-    t1 = _mm256_slli_epi32( a, 10 ); \
-    t2 = _mm256_srli_epi32( a, 22 ); \
-    a  = _mm256_or_si256( t1,t2 ); \
-    a  = _mm256_xor_si256( a,b ); \
-    t1 = _mm256_slli_epi32( b,1 ); \
-    t2 = _mm256_srli_epi32( b,31 ); \
-    b  = _mm256_or_si256( t1, t2 ); \
-}
+    b = _mm256_xor_si256( a, b ); \
+    a = _mm256_xor_si256( b, mm256_rol_32( a,  2 ) ); \
+    b = _mm256_xor_si256( a, mm256_rol_32( b, 14 ) ); \
+    a = _mm256_xor_si256( b, mm256_rol_32( a, 10 ) ); \
+    b = mm256_rol_32( b, 1 );

 #define STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, c0, c1 ) \
    SUBCRUMB( x0, x1, x2, x3 ); \
@@ -624,49 +600,37 @@ do { \
    ADD_CONSTANT( x0, x4, c0, c1 );

 #define STEP_PART2( a0, a1, t0, t1, c0, c1 ) \
-    a1 = _mm256_shuffle_epi32( a1, 147); \
-    t0 = _mm256_load_si256( &a1 ); \
-    a1 = _mm256_unpacklo_epi32( a1, a0 ); \
+    t0 = _mm256_shuffle_epi32( a1, 147 ); \
+    a1 = _mm256_unpacklo_epi32( t0, a0 ); \
    t0 = _mm256_unpackhi_epi32( t0, a0 ); \
    t1 = _mm256_shuffle_epi32( t0, 78 ); \
    a0 = _mm256_shuffle_epi32( a1, 78 ); \
-    SUBCRUMB( t1, t0, a0, a1 );\
+    SUBCRUMB( t1, t0, a0, a1 ); \
    t0 = _mm256_unpacklo_epi32( t0, t1 ); \
    a1 = _mm256_unpacklo_epi32( a1, a0 ); \
-    a0 = _mm256_load_si256( &a1 ); \
-    a0 = _mm256_unpackhi_epi64( a0, t0 ); \
+    a0 = _mm256_unpackhi_epi64( a1, t0 ); \
    a1 = _mm256_unpacklo_epi64( a1, t0 ); \
    a1 = _mm256_shuffle_epi32( a1, 57 ); \
    MIXWORD( a0, a1 ); \
    ADD_CONSTANT( a0, a1, c0, c1 );

 #define NMLTOM1024(r0,r1,r2,r3,s0,s1,s2,s3,p0,p1,p2,p3,q0,q1,q2,q3)\
-    s1 = _mm256_load_si256(&r3);\
-    q1 = _mm256_load_si256(&p3);\
-    s3 = _mm256_load_si256(&r3);\
-    q3 = _mm256_load_si256(&p3);\
-    s1 = _mm256_unpackhi_epi32(s1,r2);\
-    q1 = _mm256_unpackhi_epi32(q1,p2);\
-    s3 = _mm256_unpacklo_epi32(s3,r2);\
-    q3 = _mm256_unpacklo_epi32(q3,p2);\
-    s0 = _mm256_load_si256(&s1);\
-    q0 = _mm256_load_si256(&q1);\
-    s2 = _mm256_load_si256(&s3);\
-    q2 = _mm256_load_si256(&q3);\
-    r3 = _mm256_load_si256(&r1);\
-    p3 = _mm256_load_si256(&p1);\
-    r1 = _mm256_unpacklo_epi32(r1,r0);\
-    p1 = _mm256_unpacklo_epi32(p1,p0);\
-    r3 = _mm256_unpackhi_epi32(r3,r0);\
-    p3 = _mm256_unpackhi_epi32(p3,p0);\
-    s0 = _mm256_unpackhi_epi64(s0,r3);\
-    q0 = _mm256_unpackhi_epi64(q0,p3);\
-    s1 = _mm256_unpacklo_epi64(s1,r3);\
-    q1 = _mm256_unpacklo_epi64(q1,p3);\
-    s2 = _mm256_unpackhi_epi64(s2,r1);\
-    q2 = _mm256_unpackhi_epi64(q2,p1);\
-    s3 = _mm256_unpacklo_epi64(s3,r1);\
-    q3 = _mm256_unpacklo_epi64(q3,p1);
+    s1 = _mm256_unpackhi_epi32( r3, r2 ); \
+    q1 = _mm256_unpackhi_epi32( p3, p2 ); \
+    s3 = _mm256_unpacklo_epi32( r3, r2 ); \
+    q3 = _mm256_unpacklo_epi32( p3, p2 ); \
+    r3 = _mm256_unpackhi_epi32( r1, r0 ); \
+    r1 = _mm256_unpacklo_epi32( r1, r0 ); \
+    p3 = _mm256_unpackhi_epi32( p1, p0 ); \
+    p1 = _mm256_unpacklo_epi32( p1, p0 ); \
+    s0 = _mm256_unpackhi_epi64( s1, r3 ); \
+    q0 = _mm256_unpackhi_epi64( q1 ,p3 ); \
+    s1 = _mm256_unpacklo_epi64( s1, r3 ); \
+    q1 = _mm256_unpacklo_epi64( q1, p3 ); \
+    s2 = _mm256_unpackhi_epi64( s3, r1 ); \
+    q2 = _mm256_unpackhi_epi64( q3, p1 ); \
+    s3 = _mm256_unpacklo_epi64( s3, r1 ); \
+    q3 = _mm256_unpacklo_epi64( q3, p1 );

 #define MIXTON1024(r0,r1,r2,r3,s0,s1,s2,s3,p0,p1,p2,p3,q0,q1,q2,q3)\
    NMLTOM1024(r0,r1,r2,r3,s0,s1,s2,s3,p0,p1,p2,p3,q0,q1,q2,q3);
@@ -676,30 +640,18 @@ do { \
 /* Round function         */
 /* state: hash context    */

-void rnd512_2way( luffa_2way_context *state, __m256i *msg )
+void rnd512_2way( luffa_2way_context *state, const __m256i *msg )
 {
    __m256i t0, t1;
    __m256i *chainv = state->chainv;
-    __m256i msg0, msg1;
    __m256i x0, x1, x2, x3, x4, x5, x6, x7;
-    const __m256i MASK = m256_const1_i128( 0xffffffff );

-    t0 = chainv[0];
-    t1 = chainv[1];
+    t0 = mm256_xor3( chainv[0], chainv[2], chainv[4] );
+    t1 = mm256_xor3( chainv[1], chainv[3], chainv[5] );
+    t0 = mm256_xor3( t0, chainv[6], chainv[8] );
+    t1 = mm256_xor3( t1, chainv[7], chainv[9] );

-    t0 = _mm256_xor_si256( t0, chainv[2] );
-    t1 = _mm256_xor_si256( t1, chainv[3] );
-    t0 = _mm256_xor_si256( t0, chainv[4] );
-    t1 = _mm256_xor_si256( t1, chainv[5] );
-    t0 = _mm256_xor_si256( t0, chainv[6] );
-    t1 = _mm256_xor_si256( t1, chainv[7] );
-    t0 = _mm256_xor_si256( t0, chainv[8] );
-    t1 = _mm256_xor_si256( t1, chainv[9] );
-
-    MULT2( t0, t1, MASK );
-
-    msg0 = _mm256_shuffle_epi32( msg[0], 27 );
-    msg1 = _mm256_shuffle_epi32( msg[1], 27 );
+    MULT2( t0, t1 );

    chainv[0] = _mm256_xor_si256( chainv[0], t0 );
    chainv[1] = _mm256_xor_si256( chainv[1], t1 );
@@ -715,66 +667,72 @@ void rnd512_2way( luffa_2way_context *state, __m256i *msg )
    t0 = chainv[0];
    t1 = chainv[1];

-    MULT2( chainv[0], chainv[1], MASK );
+    MULT2( chainv[0], chainv[1] );
    chainv[0] = _mm256_xor_si256( chainv[0], chainv[2] );
    chainv[1] = _mm256_xor_si256( chainv[1], chainv[3] );

-    MULT2( chainv[2], chainv[3], MASK );
+    MULT2( chainv[2], chainv[3] );
    chainv[2] = _mm256_xor_si256(chainv[2], chainv[4]);
    chainv[3] = _mm256_xor_si256(chainv[3], chainv[5]);

-    MULT2( chainv[4], chainv[5], MASK );
+    MULT2( chainv[4], chainv[5] );
    chainv[4] = _mm256_xor_si256(chainv[4], chainv[6]);
    chainv[5] = _mm256_xor_si256(chainv[5], chainv[7]);

-    MULT2( chainv[6], chainv[7], MASK );
+    MULT2( chainv[6], chainv[7] );
    chainv[6] = _mm256_xor_si256(chainv[6], chainv[8]);
    chainv[7] = _mm256_xor_si256(chainv[7], chainv[9]);

-    MULT2( chainv[8], chainv[9], MASK );
-    chainv[8] = _mm256_xor_si256( chainv[8], t0 );
-    chainv[9] = _mm256_xor_si256( chainv[9], t1 );
+    MULT2( chainv[8], chainv[9] );
+    t0 = chainv[8] = _mm256_xor_si256( chainv[8], t0 );
+    t1 = chainv[9] = _mm256_xor_si256( chainv[9], t1 );

-    t0 = chainv[8];
-    t1 = chainv[9];
-
-    MULT2( chainv[8], chainv[9], MASK );
+    MULT2( chainv[8], chainv[9] );
    chainv[8] = _mm256_xor_si256( chainv[8], chainv[6] );
    chainv[9] = _mm256_xor_si256( chainv[9], chainv[7] );

-    MULT2( chainv[6], chainv[7], MASK );
+    MULT2( chainv[6], chainv[7] );
    chainv[6] = _mm256_xor_si256( chainv[6], chainv[4] );
    chainv[7] = _mm256_xor_si256( chainv[7], chainv[5] );

-    MULT2( chainv[4], chainv[5], MASK );
+    MULT2( chainv[4], chainv[5] );
    chainv[4] = _mm256_xor_si256( chainv[4], chainv[2] );
    chainv[5] = _mm256_xor_si256( chainv[5], chainv[3] );

-    MULT2( chainv[2], chainv[3], MASK );
+    MULT2( chainv[2], chainv[3] );
    chainv[2] = _mm256_xor_si256( chainv[2], chainv[0] );
    chainv[3] = _mm256_xor_si256( chainv[3], chainv[1] );

-    MULT2( chainv[0], chainv[1], MASK );
-    chainv[0] = _mm256_xor_si256( _mm256_xor_si256( chainv[0], t0 ), msg0 );
-    chainv[1] = _mm256_xor_si256( _mm256_xor_si256( chainv[1], t1 ), msg1 );
+    MULT2( chainv[0], chainv[1] );
+    chainv[0] = _mm256_xor_si256( chainv[0], t0 );
+    chainv[1] = _mm256_xor_si256( chainv[1], t1 );

-    MULT2( msg0, msg1, MASK );
-    chainv[2] = _mm256_xor_si256( chainv[2], msg0 );
-    chainv[3] = _mm256_xor_si256( chainv[3], msg1 );
+    if ( msg )
+    {
+       __m256i msg0, msg1;
+    
+       msg0 = _mm256_shuffle_epi32( msg[0], 27 );
+       msg1 = _mm256_shuffle_epi32( msg[1], 27 );

-    MULT2( msg0, msg1, MASK );
-    chainv[4] = _mm256_xor_si256( chainv[4], msg0 );
-    chainv[5] = _mm256_xor_si256( chainv[5], msg1 );
+       chainv[0] = _mm256_xor_si256( chainv[0], msg0 );
+       chainv[1] = _mm256_xor_si256( chainv[1], msg1 );
+    
+       MULT2( msg0, msg1 );
+       chainv[2] = _mm256_xor_si256( chainv[2], msg0 );
+       chainv[3] = _mm256_xor_si256( chainv[3], msg1 );

-    MULT2( msg0, msg1, MASK );
-    chainv[6] = _mm256_xor_si256( chainv[6], msg0 );
-    chainv[7] = _mm256_xor_si256( chainv[7], msg1 );
+       MULT2( msg0, msg1 );
+       chainv[4] = _mm256_xor_si256( chainv[4], msg0 );
+       chainv[5] = _mm256_xor_si256( chainv[5], msg1 );

-    MULT2( msg0, msg1, MASK );
-    chainv[8] = _mm256_xor_si256( chainv[8], msg0 );
-    chainv[9] = _mm256_xor_si256( chainv[9], msg1 );
+       MULT2( msg0, msg1 );
+       chainv[6] = _mm256_xor_si256( chainv[6], msg0 );
+       chainv[7] = _mm256_xor_si256( chainv[7], msg1 );

-    MULT2( msg0, msg1, MASK );
+       MULT2( msg0, msg1 );
+       chainv[8] = _mm256_xor_si256( chainv[8], msg0 );
+       chainv[9] = _mm256_xor_si256( chainv[9], msg1 );
+    }

    chainv[3] = mm256_rol_32( chainv[3], 1 );
    chainv[5] = mm256_rol_32( chainv[5], 2 );
@@ -816,57 +774,40 @@ void finalization512_2way( luffa_2way_context *state, uint32 *b )
 {
    uint32 hash[8*2] __attribute((aligned(64)));
    __m256i* chainv = state->chainv;
-    __m256i t[2];
-    __m256i zero[2];
-    zero[0] = zero[1] = m256_zero;
-    const __m256i shuff_bswap32 = m256_const_64( 0x1c1d1e1f18191a1b,
-                                                 0x1415161710111213,
-                                                 0x0c0d0e0f08090a0b,
+    __m256i t0, t1;
+    const __m256i shuff_bswap32 = mm256_set2_64( 0x0c0d0e0f08090a0b,
                                                 0x0405060700010203 );
    /*---- blank round with m=0 ----*/
-    rnd512_2way( state, zero );
+    rnd512_2way( state, NULL );

-    t[0] = chainv[0];
-    t[1] = chainv[1];
+    t0 = mm256_xor3( chainv[0], chainv[2], chainv[4] );
+    t1 = mm256_xor3( chainv[1], chainv[3], chainv[5] );
+    t0 = mm256_xor3( t0, chainv[6], chainv[8] );
+    t1 = mm256_xor3( t1, chainv[7], chainv[9] );

-    t[0] = _mm256_xor_si256( t[0], chainv[2] );
-    t[1] = _mm256_xor_si256( t[1], chainv[3] );
-    t[0] = _mm256_xor_si256( t[0], chainv[4] );
-    t[1] = _mm256_xor_si256( t[1], chainv[5] );
-    t[0] = _mm256_xor_si256( t[0], chainv[6] );
-    t[1] = _mm256_xor_si256( t[1], chainv[7] );
-    t[0] = _mm256_xor_si256( t[0], chainv[8] );
-    t[1] = _mm256_xor_si256( t[1], chainv[9] );
+    t0 = _mm256_shuffle_epi32( t0, 27 );
+    t1 = _mm256_shuffle_epi32( t1, 27 );

-    t[0] = _mm256_shuffle_epi32( t[0], 27 );
-    t[1] = _mm256_shuffle_epi32( t[1], 27 );
-
-    _mm256_store_si256( (__m256i*)&hash[0], t[0] );
-    _mm256_store_si256( (__m256i*)&hash[8], t[1] );
+    _mm256_store_si256( (__m256i*)&hash[0], t0 );
+    _mm256_store_si256( (__m256i*)&hash[8], t1 );

    casti_m256i( b, 0 ) = _mm256_shuffle_epi8(
                                  casti_m256i( hash, 0 ), shuff_bswap32 );
    casti_m256i( b, 1 ) = _mm256_shuffle_epi8( 
                                  casti_m256i( hash, 1 ), shuff_bswap32 );

-    rnd512_2way( state, zero );
+    rnd512_2way( state, NULL );

-    t[0] = chainv[0];
-    t[1] = chainv[1];
-    t[0] = _mm256_xor_si256( t[0], chainv[2] );
-    t[1] = _mm256_xor_si256( t[1], chainv[3] );
-    t[0] = _mm256_xor_si256( t[0], chainv[4] );
-    t[1] = _mm256_xor_si256( t[1], chainv[5] );
-    t[0] = _mm256_xor_si256( t[0], chainv[6] );
-    t[1] = _mm256_xor_si256( t[1], chainv[7] );
-    t[0] = _mm256_xor_si256( t[0], chainv[8] );
-    t[1] = _mm256_xor_si256( t[1], chainv[9] );
+    t0 = mm256_xor3( chainv[0], chainv[2], chainv[4] );
+    t1 = mm256_xor3( chainv[1], chainv[3], chainv[5] );
+    t0 = mm256_xor3( t0, chainv[6], chainv[8] );
+    t1 = mm256_xor3( t1, chainv[7], chainv[9] );
+    
+    t0 = _mm256_shuffle_epi32( t0, 27 );
+    t1 = _mm256_shuffle_epi32( t1, 27 );

-    t[0] = _mm256_shuffle_epi32( t[0], 27 );
-    t[1] = _mm256_shuffle_epi32( t[1], 27 );
-
-    _mm256_store_si256( (__m256i*)&hash[0], t[0] );
-    _mm256_store_si256( (__m256i*)&hash[8], t[1] );
+    _mm256_store_si256( (__m256i*)&hash[0], t0 );
+    _mm256_store_si256( (__m256i*)&hash[8], t1 );

    casti_m256i( b, 2 ) = _mm256_shuffle_epi8( 
                                  casti_m256i( hash, 0 ), shuff_bswap32 );
@@ -879,16 +820,16 @@ int luffa_2way_init( luffa_2way_context *state, int hashbitlen )
    state->hashbitlen = hashbitlen;
    __m128i *iv = (__m128i*)IV;
    
-    state->chainv[0] = m256_const1_128( iv[0] );
-    state->chainv[1] = m256_const1_128( iv[1] );
-    state->chainv[2] = m256_const1_128( iv[2] );
-    state->chainv[3] = m256_const1_128( iv[3] );
-    state->chainv[4] = m256_const1_128( iv[4] );
-    state->chainv[5] = m256_const1_128( iv[5] );
-    state->chainv[6] = m256_const1_128( iv[6] );
-    state->chainv[7] = m256_const1_128( iv[7] );
-    state->chainv[8] = m256_const1_128( iv[8] );
-    state->chainv[9] = m256_const1_128( iv[9] );
+    state->chainv[0] = mm256_bcast_m128( iv[0] );
+    state->chainv[1] = mm256_bcast_m128( iv[1] );
+    state->chainv[2] = mm256_bcast_m128( iv[2] );
+    state->chainv[3] = mm256_bcast_m128( iv[3] );
+    state->chainv[4] = mm256_bcast_m128( iv[4] );
+    state->chainv[5] = mm256_bcast_m128( iv[5] );
+    state->chainv[6] = mm256_bcast_m128( iv[6] );
+    state->chainv[7] = mm256_bcast_m128( iv[7] );
+    state->chainv[8] = mm256_bcast_m128( iv[8] );
+    state->chainv[9] = mm256_bcast_m128( iv[9] );

    ((__m256i*)state->buffer)[0] = m256_zero;
    ((__m256i*)state->buffer)[1] = m256_zero;
@@ -906,9 +847,7 @@ int luffa_2way_update( luffa_2way_context *state, const void *data,
    __m256i msg[2];
    int i;
    int blocks = (int)len >> 5;
-    const __m256i shuff_bswap32 = m256_const_64( 0x1c1d1e1f18191a1b,
-                                                 0x1415161710111213,
-                                                 0x0c0d0e0f08090a0b,
+    const __m256i shuff_bswap32 = mm256_set2_64( 0x0c0d0e0f08090a0b,
                                                 0x0405060700010203 );
    state-> rembytes = (int)len & 0x1F;

@@ -926,7 +865,7 @@ int luffa_2way_update( luffa_2way_context *state, const void *data,
    {
      // remaining data bytes
      buffer[0] = _mm256_shuffle_epi8( vdata[0], shuff_bswap32 );
-      buffer[1] = m256_const1_i128( 0x0000000080000000 );
+      buffer[1] = mm256_bcast128lo_64( 0x0000000080000000 );
    }
    return 0;
 }
@@ -942,7 +881,7 @@ int luffa_2way_close( luffa_2way_context *state, void *hashval )
      rnd512_2way( state, buffer );
    else
    {     // empty pad block, constant data
-      msg[0] = m256_const1_i128( 0x0000000080000000 );
+      msg[0] = mm256_bcast128lo_64( 0x0000000080000000 );
      msg[1] = m256_zero;
      rnd512_2way( state, msg );
    }
@@ -959,16 +898,16 @@ int luffa512_2way_full( luffa_2way_context *state, void *output,
    state->hashbitlen = 512;
    __m128i *iv = (__m128i*)IV;

-    state->chainv[0] = m256_const1_128( iv[0] );
-    state->chainv[1] = m256_const1_128( iv[1] );
-    state->chainv[2] = m256_const1_128( iv[2] );
-    state->chainv[3] = m256_const1_128( iv[3] );
-    state->chainv[4] = m256_const1_128( iv[4] );
-    state->chainv[5] = m256_const1_128( iv[5] );
-    state->chainv[6] = m256_const1_128( iv[6] );
-    state->chainv[7] = m256_const1_128( iv[7] );
-    state->chainv[8] = m256_const1_128( iv[8] );
-    state->chainv[9] = m256_const1_128( iv[9] );
+    state->chainv[0] = mm256_bcast_m128( iv[0] );
+    state->chainv[1] = mm256_bcast_m128( iv[1] );
+    state->chainv[2] = mm256_bcast_m128( iv[2] );
+    state->chainv[3] = mm256_bcast_m128( iv[3] );
+    state->chainv[4] = mm256_bcast_m128( iv[4] );
+    state->chainv[5] = mm256_bcast_m128( iv[5] );
+    state->chainv[6] = mm256_bcast_m128( iv[6] );
+    state->chainv[7] = mm256_bcast_m128( iv[7] );
+    state->chainv[8] = mm256_bcast_m128( iv[8] );
+    state->chainv[9] = mm256_bcast_m128( iv[9] );

    ((__m256i*)state->buffer)[0] = m256_zero;
    ((__m256i*)state->buffer)[1] = m256_zero;
@@ -977,9 +916,7 @@ int luffa512_2way_full( luffa_2way_context *state, void *output,
    __m256i msg[2];
    int i;
    const int blocks = (int)( inlen >> 5 );
-    const __m256i shuff_bswap32 = m256_const_64( 0x1c1d1e1f18191a1b,
-                                                 0x1415161710111213,
-                                                 0x0c0d0e0f08090a0b,
+    const __m256i shuff_bswap32 = mm256_set2_64( 0x0c0d0e0f08090a0b,
                                                 0x0405060700010203 );

    state->rembytes = inlen & 0x1F;
@@ -997,13 +934,13 @@ int luffa512_2way_full( luffa_2way_context *state, void *output,
    {
       // padding of partial block
       msg[0] = _mm256_shuffle_epi8( vdata[ 0 ], shuff_bswap32 );
-       msg[1] = m256_const1_i128( 0x0000000080000000 );
+       msg[1] = mm256_bcast128lo_64( 0x0000000080000000 );
       rnd512_2way( state, msg );
    }
    else
    {
       // empty pad block
-       msg[0] = m256_const1_i128( 0x0000000080000000 );
+       msg[0] = mm256_bcast128lo_64( 0x0000000080000000 );
       msg[1] = m256_zero;
       rnd512_2way( state, msg );
    }
@@ -1024,9 +961,7 @@ int luffa_2way_update_close( luffa_2way_context *state,
    __m256i msg[2];
    int i;
    const int blocks = (int)( inlen >> 5 );
-    const __m256i shuff_bswap32 = m256_const_64( 0x1c1d1e1f18191a1b,
-                                                 0x1415161710111213,
-                                                 0x0c0d0e0f08090a0b,
+    const __m256i shuff_bswap32 = mm256_set2_64( 0x0c0d0e0f08090a0b,
                                                 0x0405060700010203 );

    state->rembytes = inlen & 0x1F;
@@ -1044,13 +979,13 @@ int luffa_2way_update_close( luffa_2way_context *state,
    {
       // padding of partial block
       msg[0] = _mm256_shuffle_epi8( vdata[ 0 ], shuff_bswap32 );
-       msg[1] = m256_const1_i128( 0x0000000080000000 );
+       msg[1] = mm256_bcast128lo_64( 0x0000000080000000 );
       rnd512_2way( state, msg );
    }
    else
    {
       // empty pad block
-       msg[0] = m256_const1_i128( 0x0000000080000000 );
+       msg[0] = mm256_bcast128lo_64( 0x0000000080000000 );
       msg[1] = m256_zero;
       rnd512_2way( state, msg );
    }
--- a/algo/luffa/luffa_for_sse2.c
+++ b/algo/luffa/luffa_for_sse2.c
@@ -22,20 +22,29 @@
 #include "simd-utils.h"
 #include "luffa_for_sse2.h"

+#define cns(i)  ( ( (__m128i*)CNS_INIT)[i] )
+
+#define ADD_CONSTANT( a, b, c0 ,c1 ) \
+    a = _mm_xor_si128( a, c0 ); \
+    b = _mm_xor_si128( b, c1 ); \
+
 #if defined(__AVX512VL__)
+//TODO enable for AVX10_512 AVX10_256

 #define MULT2( a0, a1 ) \
 { \
-  __m128i b = _mm_xor_si128( a0, _mm_maskz_shuffle_epi32( 0xb, a1, 0x10 ) ); \
-  a0 = _mm_alignr_epi32( a1, b, 1 ); \
-  a1 = _mm_alignr_epi32( b, a1, 1 ); \
+  __m128i b = _mm_xor_si128( a0, \
+                      _mm_maskz_shuffle_epi32( 0xb, a1, 0x10 ) ); \
+  a0 = _mm_alignr_epi8( a1, b, 4 ); \
+  a1 = _mm_alignr_epi8( b, a1, 4 ); \
 }

 #elif defined(__SSE4_1__)

 #define MULT2( a0, a1 ) do \
 { \
-  __m128i b = _mm_xor_si128( a0, _mm_shuffle_epi32( mm128_mask_32( a1, 0xe ), 0x10 ) ); \
+  __m128i b = _mm_xor_si128( a0, \
+                      _mm_shuffle_epi32( mm128_mask_32( a1, 0xe ), 0x10 ) ); \
  a0 = _mm_alignr_epi8( a1, b, 4 ); \
  a1 = _mm_alignr_epi8( b, a1, 4 ); \
 } while(0)
@@ -44,79 +53,88 @@

 #define MULT2( a0, a1 ) do \
 { \
-  __m128i b = _mm_xor_si128( a0, _mm_shuffle_epi32( _mm_and_si128( a1, MASK ), 0x10 ) ); \
-  a0 = _mm_or_si128( _mm_srli_si128( b, 4 ), _mm_slli_si128( a1, 12 ) ); \
-  a1 = _mm_or_si128( _mm_srli_si128( a1, 4 ), _mm_slli_si128( b, 12 ) ); \
+  __m128i b = _mm_xor_si128( a0, \
+                      _mm_shuffle_epi32( _mm_and_si128( a1, MASK ), 0x10 ) ); \
+  a0 = _mm_or_si128( _mm_srli_si128(  b, 4 ), _mm_slli_si128( a1, 12 ) ); \
+  a1 = _mm_or_si128( _mm_srli_si128( a1, 4 ), _mm_slli_si128(  b, 12 ) ); \
 } while(0)

 #endif

-#define STEP_PART(x,c,t)\
-    SUBCRUMB(*x,*(x+1),*(x+2),*(x+3),*t);\
-    SUBCRUMB(*(x+5),*(x+6),*(x+7),*(x+4),*t);\
-    MIXWORD(*x,*(x+4),*t,*(t+1));\
-    MIXWORD(*(x+1),*(x+5),*t,*(t+1));\
-    MIXWORD(*(x+2),*(x+6),*t,*(t+1));\
-    MIXWORD(*(x+3),*(x+7),*t,*(t+1));\
-    ADD_CONSTANT(*x, *(x+4), *c, *(c+1));
+#if defined(__AVX512VL__)
+//TODO enable for AVX10_512 AVX10_256

-#define STEP_PART2(a0,a1,t0,t1,c0,c1,tmp0,tmp1)\
-    a1 = _mm_shuffle_epi32(a1,147);\
-    t0 = _mm_load_si128(&a1);\
-    a1 = _mm_unpacklo_epi32(a1,a0);\
-    t0 = _mm_unpackhi_epi32(t0,a0);\
-    t1 = _mm_shuffle_epi32(t0,78);\
-    a0 = _mm_shuffle_epi32(a1,78);\
-    SUBCRUMB(t1,t0,a0,a1,tmp0);\
-    t0 = _mm_unpacklo_epi32(t0,t1);\
-    a1 = _mm_unpacklo_epi32(a1,a0);\
-    a0 = _mm_load_si128(&a1);\
-    a0 = _mm_unpackhi_epi64(a0,t0);\
-    a1 = _mm_unpacklo_epi64(a1,t0);\
-    a1 = _mm_shuffle_epi32(a1,57);\
-    MIXWORD(a0,a1,tmp0,tmp1);\
-    ADD_CONSTANT(a0,a1,c0,c1);
+#define SUBCRUMB( a0, a1, a2, a3 ) \
+{ \
+    __m128i t = a0; \
+    a0 = mm128_xoror( a3, a0, a1 ); \
+    a2 = _mm_xor_si128( a2, a3 ); \
+    a1 = _mm_ternarylogic_epi64( a1, a3, t, 0x87 ); /* a1 xnor (a3 & t) */ \
+    a3 = mm128_xorand( a2, a3, t ); \
+    a2 = mm128_xorand( a1, a2, a0 ); \
+    a1 = _mm_or_si128( a1, a3 ); \
+    a3 = _mm_xor_si128( a3, a2 ); \
+    t  = _mm_xor_si128( t, a1 ); \
+    a2 = _mm_and_si128( a2, a1 ); \
+    a1 = mm128_xnor( a1, a0 ); \
+    a0 = t; \
+}

-#define SUBCRUMB(a0,a1,a2,a3,t)\
-    t  = _mm_load_si128(&a0);\
-    a0 = _mm_or_si128(a0,a1);\
-    a2 = _mm_xor_si128(a2,a3);\
-    a1 = mm128_not( a1 );\
-    a0 = _mm_xor_si128(a0,a3);\
-    a3 = _mm_and_si128(a3,t);\
-    a1 = _mm_xor_si128(a1,a3);\
-    a3 = _mm_xor_si128(a3,a2);\
-    a2 = _mm_and_si128(a2,a0);\
-    a0 = mm128_not( a0 );\
-    a2 = _mm_xor_si128(a2,a1);\
-    a1 = _mm_or_si128(a1,a3);\
-    t  = _mm_xor_si128(t,a1);\
-    a3 = _mm_xor_si128(a3,a2);\
-    a2 = _mm_and_si128(a2,a1);\
-    a1 = _mm_xor_si128(a1,a0);\
-    a0 = _mm_load_si128(&t);\
+#else

-#define MIXWORD(a,b,t1,t2)\
-    b = _mm_xor_si128(a,b);\
-    t1 = _mm_slli_epi32(a,2);\
-    t2 = _mm_srli_epi32(a,30);\
-    a = _mm_or_si128(t1,t2);\
-    a = _mm_xor_si128(a,b);\
-    t1 = _mm_slli_epi32(b,14);\
-    t2 = _mm_srli_epi32(b,18);\
-    b = _mm_or_si128(t1,t2);\
-    b = _mm_xor_si128(a,b);\
-    t1 = _mm_slli_epi32(a,10);\
-    t2 = _mm_srli_epi32(a,22);\
-    a = _mm_or_si128(t1,t2);\
-    a = _mm_xor_si128(a,b);\
-    t1 = _mm_slli_epi32(b,1);\
-    t2 = _mm_srli_epi32(b,31);\
-    b = _mm_or_si128(t1,t2);
+#define SUBCRUMB( a0, a1, a2, a3 ) \
+{ \
+    __m128i t = a0; \
+    a0 = _mm_or_si128( a0, a1 ); \
+    a2 = _mm_xor_si128( a2, a3 ); \
+    a1 = mm128_not( a1 ); \
+    a0 = _mm_xor_si128( a0, a3 ); \
+    a3 = _mm_and_si128( a3, t ); \
+    a1 = _mm_xor_si128( a1, a3 ); \
+    a3 = _mm_xor_si128( a3, a2 ); \
+    a2 = _mm_and_si128( a2, a0 ); \
+    a0 = mm128_not( a0 ); \
+    a2 = _mm_xor_si128( a2, a1 ); \
+    a1 = _mm_or_si128(  a1, a3 ); \
+    t  = _mm_xor_si128( t , a1 ); \
+    a3 = _mm_xor_si128( a3, a2 ); \
+    a2 = _mm_and_si128( a2, a1 ); \
+    a1 = _mm_xor_si128( a1, a0 ); \
+    a0 = t; \
+}

-#define ADD_CONSTANT(a,b,c0,c1)\
-    a = _mm_xor_si128(a,c0);\
-    b = _mm_xor_si128(b,c1);\
+#endif
+
+#define MIXWORD( a, b ) \
+    b = _mm_xor_si128( a, b ); \
+    a = _mm_xor_si128( b, mm128_rol_32( a, 2 ) ); \
+    b = _mm_xor_si128( a, mm128_rol_32( b, 14 ) ); \
+    a = _mm_xor_si128( b, mm128_rol_32( a, 10 ) ); \
+    b = mm128_rol_32( b, 1 );
+
+#define STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, c0, c1 ) \
+    SUBCRUMB( x0, x1, x2, x3 ); \
+    SUBCRUMB( x5, x6, x7, x4 ); \
+    MIXWORD( x0, x4 ); \
+    MIXWORD( x1, x5 ); \
+    MIXWORD( x2, x6 ); \
+    MIXWORD( x3, x7 ); \
+    ADD_CONSTANT( x0, x4, c0, c1 );
+
+#define STEP_PART2( a0, a1, t0, t1, c0, c1 ) \
+    t0 = _mm_shuffle_epi32( a1, 147 ); \
+    a1 = _mm_unpacklo_epi32( t0, a0 ); \
+    t0 = _mm_unpackhi_epi32( t0, a0 ); \
+    t1 = _mm_shuffle_epi32( t0, 78 ); \
+    a0 = _mm_shuffle_epi32( a1, 78 ); \
+    SUBCRUMB( t1, t0, a0, a1 ); \
+    t0 = _mm_unpacklo_epi32( t0, t1 ); \
+    a1 = _mm_unpacklo_epi32( a1, a0 ); \
+    a0 = _mm_unpackhi_epi64( a1, t0 ); \
+    a1 = _mm_unpacklo_epi64( a1, t0 ); \
+    a1 = _mm_shuffle_epi32( a1, 57 ); \
+    MIXWORD( a0, a1 ); \
+    ADD_CONSTANT( a0, a1, c0, c1 );

 #define NMLTOM768(r0,r1,r2,s0,s1,s2,s3,p0,p1,p2,q0,q1,q2,q3)\
    s2 = _mm_load_si128(&r1);\
@@ -177,32 +195,22 @@
    q1 = _mm_load_si128(&p1);\

 #define NMLTOM1024(r0,r1,r2,r3,s0,s1,s2,s3,p0,p1,p2,p3,q0,q1,q2,q3)\
-    s1 = _mm_load_si128(&r3);\
-    q1 = _mm_load_si128(&p3);\
-    s3 = _mm_load_si128(&r3);\
-    q3 = _mm_load_si128(&p3);\
-    s1 = _mm_unpackhi_epi32(s1,r2);\
-    q1 = _mm_unpackhi_epi32(q1,p2);\
-    s3 = _mm_unpacklo_epi32(s3,r2);\
-    q3 = _mm_unpacklo_epi32(q3,p2);\
-    s0 = _mm_load_si128(&s1);\
-    q0 = _mm_load_si128(&q1);\
-    s2 = _mm_load_si128(&s3);\
-    q2 = _mm_load_si128(&q3);\
-    r3 = _mm_load_si128(&r1);\
-    p3 = _mm_load_si128(&p1);\
-    r1 = _mm_unpacklo_epi32(r1,r0);\
-    p1 = _mm_unpacklo_epi32(p1,p0);\
-    r3 = _mm_unpackhi_epi32(r3,r0);\
-    p3 = _mm_unpackhi_epi32(p3,p0);\
-    s0 = _mm_unpackhi_epi64(s0,r3);\
-    q0 = _mm_unpackhi_epi64(q0,p3);\
-    s1 = _mm_unpacklo_epi64(s1,r3);\
-    q1 = _mm_unpacklo_epi64(q1,p3);\
-    s2 = _mm_unpackhi_epi64(s2,r1);\
-    q2 = _mm_unpackhi_epi64(q2,p1);\
-    s3 = _mm_unpacklo_epi64(s3,r1);\
-    q3 = _mm_unpacklo_epi64(q3,p1);
+    s1 = _mm_unpackhi_epi32( r3, r2 ); \
+    q1 = _mm_unpackhi_epi32( p3, p2 ); \
+    s3 = _mm_unpacklo_epi32( r3, r2 ); \
+    q3 = _mm_unpacklo_epi32( p3, p2 ); \
+    r3 = _mm_unpackhi_epi32( r1, r0 ); \
+    r1 = _mm_unpacklo_epi32( r1, r0 ); \
+    p3 = _mm_unpackhi_epi32( p1, p0 ); \
+    p1 = _mm_unpacklo_epi32( p1, p0 ); \
+    s0 = _mm_unpackhi_epi64( s1, r3 ); \
+    q0 = _mm_unpackhi_epi64( q1 ,p3 ); \
+    s1 = _mm_unpacklo_epi64( s1, r3 ); \
+    q1 = _mm_unpacklo_epi64( q1, p3 ); \
+    s2 = _mm_unpackhi_epi64( s3, r1 ); \
+    q2 = _mm_unpackhi_epi64( q3, p1 ); \
+    s3 = _mm_unpacklo_epi64( s3, r1 ); \
+    q3 = _mm_unpacklo_epi64( q3, p1 );

 #define MIXTON1024(r0,r1,r2,r3,s0,s1,s2,s3,p0,p1,p2,p3,q0,q1,q2,q3)\
    NMLTOM1024(r0,r1,r2,r3,s0,s1,s2,s3,p0,p1,p2,p3,q0,q1,q2,q3);
@@ -306,8 +314,7 @@ HashReturn update_luffa( hashState_luffa *state, const BitSequence *data,
      // remaining data bytes
      casti_m128i( state->buffer, 0 ) = mm128_bswap_32( cast_m128i( data ) );
      // padding of partial block
-      casti_m128i( state->buffer, 1 ) =
-            _mm_set_epi8( 0,0,0,0, 0,0,0,0, 0,0,0,0, 0x80,0,0,0 );
+      casti_m128i( state->buffer, 1 ) =  _mm_set_epi32( 0, 0, 0, 0x80000000 );
    }

    return SUCCESS;
@@ -325,8 +332,7 @@ HashReturn final_luffa(hashState_luffa *state, BitSequence *hashval)
    else
    {
      // empty pad block, constant data
-     rnd512( state, _mm_setzero_si128(),
-                       _mm_set_epi8( 0,0,0,0, 0,0,0,0, 0,0,0,0, 0x80,0,0,0 ) );
+     rnd512( state, _mm_setzero_si128(), _mm_set_epi32( 0, 0, 0, 0x80000000 ) );
    }

    finalization512(state, (uint32*) hashval);
@@ -354,11 +360,11 @@ HashReturn update_and_final_luffa( hashState_luffa *state, BitSequence* output,
    // 16 byte partial block exists for 80 byte len
    if ( state->rembytes  )
       // padding of partial block
-       rnd512( state, m128_const_i128(  0x80000000 ),
+       rnd512( state, mm128_mov64_128(  0x80000000 ),
                      mm128_bswap_32( cast_m128i( data ) ) );
    else
       // empty pad block
-       rnd512( state, m128_zero, m128_const_i128( 0x80000000 ) );
+       rnd512( state, m128_zero, mm128_mov64_128( 0x80000000 ) );

    finalization512( state, (uint32*) output );
    if ( state->hashbitlen > 512 )
@@ -403,11 +409,11 @@ int luffa_full( hashState_luffa *state, BitSequence* output, int hashbitlen,
    // 16 byte partial block exists for 80 byte len
    if ( state->rembytes  )
       // padding of partial block
-       rnd512( state, m128_const_i128( 0x80000000 ),
+       rnd512( state, mm128_mov64_128( 0x80000000 ),
                      mm128_bswap_32( cast_m128i( data ) ) );
    else
       // empty pad block
-       rnd512( state, m128_zero, m128_const_i128( 0x80000000 ) );
+       rnd512( state, m128_zero, mm128_mov64_128( 0x80000000 ) );

    finalization512( state, (uint32*) output );
    if ( state->hashbitlen > 512 )
@@ -423,163 +429,119 @@ int luffa_full( hashState_luffa *state, BitSequence* output, int hashbitlen,

 static void rnd512( hashState_luffa *state, __m128i msg1, __m128i msg0 )
 {
-    __m128i t[2];
+    __m128i t0, t1;
    __m128i *chainv = state->chainv;
-    __m128i tmp[2];
-    __m128i x[8];
+    __m128i x0, x1, x2, x3, x4, x5, x6, x7; 

-    t[0] = chainv[0];
-    t[1] = chainv[1];
+    t0 = mm128_xor3( chainv[0], chainv[2], chainv[4] );
+    t1 = mm128_xor3( chainv[1], chainv[3], chainv[5] );
+    t0 = mm128_xor3( t0, chainv[6], chainv[8] );
+    t1 = mm128_xor3( t1, chainv[7], chainv[9] );

-    t[0] = _mm_xor_si128( t[0], chainv[2] );
-    t[1] = _mm_xor_si128( t[1], chainv[3] );
-    t[0] = _mm_xor_si128( t[0], chainv[4] );
-    t[1] = _mm_xor_si128( t[1], chainv[5] );
-    t[0] = _mm_xor_si128( t[0], chainv[6] );
-    t[1] = _mm_xor_si128( t[1], chainv[7] );
-    t[0] = _mm_xor_si128( t[0], chainv[8] );
-    t[1] = _mm_xor_si128( t[1], chainv[9] );
-
-    MULT2( t[0], t[1] );
+    MULT2( t0, t1 );

    msg0 = _mm_shuffle_epi32( msg0, 27 );
    msg1 = _mm_shuffle_epi32( msg1, 27 );

-    chainv[0] = _mm_xor_si128( chainv[0], t[0] );
-    chainv[1] = _mm_xor_si128( chainv[1], t[1] );
-    chainv[2] = _mm_xor_si128( chainv[2], t[0] );
-    chainv[3] = _mm_xor_si128( chainv[3], t[1] );
-    chainv[4] = _mm_xor_si128( chainv[4], t[0] );
-    chainv[5] = _mm_xor_si128( chainv[5], t[1] );
-    chainv[6] = _mm_xor_si128( chainv[6], t[0] );
-    chainv[7] = _mm_xor_si128( chainv[7], t[1] );
-    chainv[8] = _mm_xor_si128( chainv[8], t[0] );
-    chainv[9] = _mm_xor_si128( chainv[9], t[1] );
+    chainv[0] = _mm_xor_si128( chainv[0], t0 );
+    chainv[1] = _mm_xor_si128( chainv[1], t1 );
+    chainv[2] = _mm_xor_si128( chainv[2], t0 );
+    chainv[3] = _mm_xor_si128( chainv[3], t1 );
+    chainv[4] = _mm_xor_si128( chainv[4], t0 );
+    chainv[5] = _mm_xor_si128( chainv[5], t1 );
+    chainv[6] = _mm_xor_si128( chainv[6], t0 );
+    chainv[7] = _mm_xor_si128( chainv[7], t1 );
+    chainv[8] = _mm_xor_si128( chainv[8], t0 );
+    chainv[9] = _mm_xor_si128( chainv[9], t1 );

-    t[0] = chainv[0];
-    t[1] = chainv[1];
+    t0 = chainv[0];
+    t1 = chainv[1];

    MULT2( chainv[0], chainv[1]);
-
    chainv[0] = _mm_xor_si128( chainv[0], chainv[2] );
    chainv[1] = _mm_xor_si128( chainv[1], chainv[3] );

    MULT2( chainv[2], chainv[3]);
-
    chainv[2] = _mm_xor_si128(chainv[2], chainv[4]);
    chainv[3] = _mm_xor_si128(chainv[3], chainv[5]);

    MULT2( chainv[4], chainv[5]);
-
    chainv[4] = _mm_xor_si128(chainv[4], chainv[6]);
    chainv[5] = _mm_xor_si128(chainv[5], chainv[7]);

    MULT2( chainv[6], chainv[7]);
-
    chainv[6] = _mm_xor_si128(chainv[6], chainv[8]);
    chainv[7] = _mm_xor_si128(chainv[7], chainv[9]);

    MULT2( chainv[8], chainv[9]);
-
-    chainv[8] = _mm_xor_si128( chainv[8], t[0] );
-    chainv[9] = _mm_xor_si128( chainv[9], t[1] );
-
-    t[0] = chainv[8];
-    t[1] = chainv[9];
+    t0 = chainv[8] = _mm_xor_si128( chainv[8], t0 );
+    t1 = chainv[9] = _mm_xor_si128( chainv[9], t1 );

    MULT2( chainv[8], chainv[9]);
-
    chainv[8] = _mm_xor_si128( chainv[8], chainv[6] );
    chainv[9] = _mm_xor_si128( chainv[9], chainv[7] );

    MULT2( chainv[6], chainv[7]);
-
    chainv[6] = _mm_xor_si128( chainv[6], chainv[4] );
    chainv[7] = _mm_xor_si128( chainv[7], chainv[5] );

    MULT2( chainv[4], chainv[5]);
-
    chainv[4] = _mm_xor_si128( chainv[4], chainv[2] );
    chainv[5] = _mm_xor_si128( chainv[5], chainv[3] );

    MULT2( chainv[2], chainv[3] );
-
    chainv[2] = _mm_xor_si128( chainv[2], chainv[0] );
    chainv[3] = _mm_xor_si128( chainv[3], chainv[1] );

    MULT2( chainv[0], chainv[1] );
-
-    chainv[0] = _mm_xor_si128( _mm_xor_si128( chainv[0], t[0] ), msg0 );
-    chainv[1] = _mm_xor_si128( _mm_xor_si128( chainv[1], t[1] ), msg1 );
+    chainv[0] = _mm_xor_si128( _mm_xor_si128( chainv[0], t0 ), msg0 );
+    chainv[1] = _mm_xor_si128( _mm_xor_si128( chainv[1], t1 ), msg1 );

    MULT2( msg0, msg1);
-
    chainv[2] = _mm_xor_si128( chainv[2], msg0 );
    chainv[3] = _mm_xor_si128( chainv[3], msg1 );

    MULT2( msg0, msg1);
-
    chainv[4] = _mm_xor_si128( chainv[4], msg0 );
    chainv[5] = _mm_xor_si128( chainv[5], msg1 );

    MULT2( msg0, msg1);
-
    chainv[6] = _mm_xor_si128( chainv[6], msg0 );
    chainv[7] = _mm_xor_si128( chainv[7], msg1 );

    MULT2( msg0, msg1);
-
    chainv[8] = _mm_xor_si128( chainv[8], msg0 );
    chainv[9] = _mm_xor_si128( chainv[9], msg1 );

    MULT2( msg0, msg1);
+    chainv[3] = mm128_rol_32( chainv[3], 1 );    
+    chainv[5] = mm128_rol_32( chainv[5], 2 );
+    chainv[7] = mm128_rol_32( chainv[7], 3 );
+    chainv[9] = mm128_rol_32( chainv[9], 4 );
+    
+    NMLTOM1024( chainv[0], chainv[2], chainv[4], chainv[6], x0, x1, x2, x3,
+                chainv[1], chainv[3], chainv[5], chainv[7], x4, x5, x6, x7 );

-    chainv[3] = _mm_or_si128( _mm_slli_epi32(chainv[3], 1),
-                              _mm_srli_epi32(chainv[3], 31) );
-    chainv[5] = _mm_or_si128( _mm_slli_epi32(chainv[5], 2),
-                              _mm_srli_epi32(chainv[5], 30) );
-    chainv[7] = _mm_or_si128( _mm_slli_epi32(chainv[7], 3),
-                              _mm_srli_epi32(chainv[7], 29) );
-    chainv[9] = _mm_or_si128( _mm_slli_epi32(chainv[9], 4),
-                              _mm_srli_epi32(chainv[9], 28) );
+    STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, cns( 0), cns( 1) );
+    STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, cns( 2), cns( 3) );
+    STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, cns( 4), cns( 5) );
+    STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, cns( 6), cns( 7) );
+    STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, cns( 8), cns( 9) );
+    STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, cns(10), cns(11) );
+    STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, cns(12), cns(13) );
+    STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, cns(14), cns(15) );
+    
+    MIXTON1024( x0, x1, x2, x3, chainv[0], chainv[2], chainv[4], chainv[6],
+                x4, x5, x6, x7, chainv[1], chainv[3], chainv[5], chainv[7]);

-
-    NMLTOM1024( chainv[0], chainv[2], chainv[4], chainv[6],
-                x[0], x[1], x[2], x[3],
-                chainv[1],chainv[3],chainv[5],chainv[7],
-                x[4], x[5], x[6], x[7] );
-
-    STEP_PART( &x[0], &CNS128[ 0], &tmp[0] );
-    STEP_PART( &x[0], &CNS128[ 2], &tmp[0] );
-    STEP_PART( &x[0], &CNS128[ 4], &tmp[0] );
-    STEP_PART( &x[0], &CNS128[ 6], &tmp[0] );
-    STEP_PART( &x[0], &CNS128[ 8], &tmp[0] );
-    STEP_PART( &x[0], &CNS128[10], &tmp[0] );
-    STEP_PART( &x[0], &CNS128[12], &tmp[0] );
-    STEP_PART( &x[0], &CNS128[14], &tmp[0] );
-
-    MIXTON1024( x[0], x[1], x[2], x[3],
-                chainv[0], chainv[2], chainv[4],chainv[6],
-                x[4], x[5], x[6], x[7],
-                chainv[1],chainv[3],chainv[5],chainv[7]);
-
-    /* Process last 256-bit block */
-    STEP_PART2( chainv[8], chainv[9], t[0], t[1], CNS128[16], CNS128[17],
-                tmp[0], tmp[1] );
-    STEP_PART2( chainv[8], chainv[9], t[0], t[1], CNS128[18], CNS128[19],
-                tmp[0], tmp[1] );
-    STEP_PART2( chainv[8], chainv[9], t[0], t[1], CNS128[20], CNS128[21],
-                tmp[0], tmp[1] );
-    STEP_PART2( chainv[8], chainv[9], t[0], t[1], CNS128[22], CNS128[23],
-                tmp[0], tmp[1] );
-    STEP_PART2( chainv[8], chainv[9], t[0], t[1], CNS128[24], CNS128[25],
-                tmp[0], tmp[1] );
-    STEP_PART2( chainv[8], chainv[9], t[0], t[1], CNS128[26], CNS128[27],
-                tmp[0], tmp[1] );
-    STEP_PART2( chainv[8], chainv[9], t[0], t[1], CNS128[28], CNS128[29],
-                tmp[0], tmp[1] );
-    STEP_PART2( chainv[8], chainv[9], t[0], t[1], CNS128[30], CNS128[31],
-                tmp[0], tmp[1] );
+    STEP_PART2( chainv[8], chainv[9], t0, t1, cns(16), cns(17) );
+    STEP_PART2( chainv[8], chainv[9], t0, t1, cns(18), cns(19) );
+    STEP_PART2( chainv[8], chainv[9], t0, t1, cns(20), cns(21) );
+    STEP_PART2( chainv[8], chainv[9], t0, t1, cns(22), cns(23) );
+    STEP_PART2( chainv[8], chainv[9], t0, t1, cns(24), cns(25) );
+    STEP_PART2( chainv[8], chainv[9], t0, t1, cns(26), cns(27) );
+    STEP_PART2( chainv[8], chainv[9], t0, t1, cns(28), cns(29) );
+    STEP_PART2( chainv[8], chainv[9], t0, t1, cns(30), cns(31) );
 }


@@ -588,51 +550,6 @@ static void rnd512( hashState_luffa *state, __m128i msg1, __m128i msg0 )
 /* state: hash context    */
 /* b[8]: hash values      */

-#if defined (__AVX2__)
-
-static void finalization512( hashState_luffa *state, uint32 *b )
-{
-    uint32   hash[8] __attribute((aligned(64)));
-    __m256i* chainv = (__m256i*)state->chainv;
-    __m256i  t;
-    const __m128i zero = m128_zero;
-    const __m256i shuff_bswap32 = m256_const_64( 0x1c1d1e1f18191a1b,
-                                                 0x1415161710111213,
-                                                 0x0c0d0e0f08090a0b,
-                                                 0x0405060700010203 );
-
-    rnd512( state, zero, zero );
-
-    t = chainv[0];
-    t = _mm256_xor_si256( t, chainv[1] );
-    t = _mm256_xor_si256( t, chainv[2] );
-    t = _mm256_xor_si256( t, chainv[3] );
-    t = _mm256_xor_si256( t, chainv[4] );
-
-    t = _mm256_shuffle_epi32( t, 27 );
-
-    _mm256_store_si256( (__m256i*)hash, t );
-
-    casti_m256i( b, 0 ) = _mm256_shuffle_epi8(
-                                 casti_m256i( hash, 0 ), shuff_bswap32 );
-
-    rnd512( state, zero, zero );
-
-    t = chainv[0];
-    t = _mm256_xor_si256( t, chainv[1] );
-    t = _mm256_xor_si256( t, chainv[2] );
-    t = _mm256_xor_si256( t, chainv[3] );
-    t = _mm256_xor_si256( t, chainv[4] );
-    t = _mm256_shuffle_epi32( t, 27 );
-
-    _mm256_store_si256( (__m256i*)hash, t );
-
-    casti_m256i( b, 1 ) = _mm256_shuffle_epi8( 
-                                 casti_m256i( hash, 0 ), shuff_bswap32 );
-}
-
-#else
-
 static void finalization512( hashState_luffa *state, uint32 *b )
 {
    uint32 hash[8] __attribute((aligned(64)));
@@ -685,6 +602,5 @@ static void finalization512( hashState_luffa *state, uint32 *b )
    casti_m128i( b, 2 ) = mm128_bswap_32( casti_m128i( hash, 0 ) );
    casti_m128i( b, 3 ) = mm128_bswap_32( casti_m128i( hash, 1 ) );
 }
-#endif

 /***************************************************/
--- a/algo/lyra2/allium-4way.c
+++ b/algo/lyra2/allium-4way.c
@@ -48,7 +48,7 @@ static void allium_16way_hash( void *state, const void *midstate_vars,
   uint32_t hash15[8] __attribute__ ((aligned (32)));
   allium_16way_ctx_holder ctx __attribute__ ((aligned (64)));

-   blake256_16way_final_rounds_le( vhash, midstate_vars, midhash, block );
+   blake256_16way_final_rounds_le( vhash, midstate_vars, midhash, block, 14 );

   dintrlv_16x32( hash0, hash1, hash2, hash3, hash4, hash5, hash6, hash7,
                  hash8, hash9, hash10, hash11, hash12, hash13, hash14, hash15,
@@ -212,12 +212,12 @@ int scanhash_allium_16way( struct work *work, uint32_t max_nonce,
   const uint32_t last_nonce = max_nonce - 16;
   const int thr_id = mythr->id;
   const bool bench = opt_benchmark;
-   const __m512i sixteen = m512_const1_32( 16 );
+   const __m512i sixteen = _mm512_set1_epi32( 16 );

   if ( bench ) ( (uint32_t*)ptarget )[7] = 0x0000ff;

   // Prehash first block.
-   blake256_transform_le( phash, pdata, 512, 0 );
+   blake256_transform_le( phash, pdata, 512, 0, 14 );

   // Interleave hash for second block prehash.
   block0_hash[0] = _mm512_set1_epi32( phash[0] );
@@ -286,7 +286,7 @@ static void allium_8way_hash( void *hash, const void *midstate_vars,
   uint64_t *hash7 = (uint64_t*)hash+28;
   allium_8way_ctx_holder ctx __attribute__ ((aligned (64))); 

-   blake256_8way_final_rounds_le( vhashA, midstate_vars, midhash, block );
+   blake256_8way_final_rounds_le( vhashA, midstate_vars, midhash, block, 14 );

   dintrlv_8x32( hash0, hash1, hash2, hash3, hash4, hash5, hash6, hash7,
                 vhashA, 256 );
@@ -398,10 +398,10 @@ int scanhash_allium_8way( struct work *work, uint32_t max_nonce,
   uint32_t n = first_nonce;
   const int thr_id = mythr->id;  
   const bool bench = opt_benchmark;
-   const __m256i eight = m256_const1_32( 8 );
+   const __m256i eight = _mm256_set1_epi32( 8 );

   // Prehash first block
-   blake256_transform_le( phash, pdata, 512, 0 );
+   blake256_transform_le( phash, pdata, 512, 0, 14 );

   block0_hash[0] = _mm256_set1_epi32( phash[0] );
   block0_hash[1] = _mm256_set1_epi32( phash[1] );
--- a/algo/lyra2/lyra2rev2-4way.c
+++ b/algo/lyra2/lyra2rev2-4way.c
@@ -75,7 +75,7 @@ void lyra2rev2_16way_hash( void *state, const void *input )
   keccak256_8way_close( &ctx.keccak, vhash );

   dintrlv_8x64( hash8,  hash9,  hash10,  hash11,
-                 hash12, hash13, hash14, hash5, vhash, 256 );
+                 hash12, hash13, hash14, hash15, vhash, 256 );

   cubehash_full( &ctx.cube, (byte*) hash0,  256, (const byte*) hash0,  32 );
   cubehash_full( &ctx.cube, (byte*) hash1,  256, (const byte*) hash1,  32 );
@@ -203,7 +203,7 @@ int scanhash_lyra2rev2_16way( struct work *work, const uint32_t max_nonce,
             submit_solution( work, lane_hash, mythr );
         }
      }
-      *noncev = _mm512_add_epi32( *noncev, m512_const1_32( 16 ) );
+      *noncev = _mm512_add_epi32( *noncev, _mm512_set1_epi32( 16 ) );
      n += 16;
   } while ( likely( (n < last_nonce) && !work_restart[thr_id].restart ) );
   pdata[19] = n;
@@ -345,7 +345,7 @@ int scanhash_lyra2rev2_8way( struct work *work, const uint32_t max_nonce,
             submit_solution( work, lane_hash, mythr );
         }
      }
-      *noncev = _mm256_add_epi32( *noncev, m256_const1_32( 8 ) );
+      *noncev = _mm256_add_epi32( *noncev, _mm256_set1_epi32( 8 ) );
      n += 8;
   } while ( likely( (n < last_nonce) && !work_restart[thr_id].restart ) );
   pdata[19] = n;
--- a/algo/lyra2/lyra2rev3-4way.c
+++ b/algo/lyra2/lyra2rev3-4way.c
@@ -287,7 +287,7 @@ int scanhash_lyra2rev3_8way( struct work *work, const uint32_t max_nonce,
             submit_solution( work, lane_hash, mythr );
         }
      }
-      *noncev = _mm256_add_epi32( *noncev, m256_const1_32( 8 ) );
+      *noncev = _mm256_add_epi32( *noncev, _mm256_set1_epi32( 8 ) );
      n += 8;
   } while ( likely( (n < last_nonce) && !work_restart[thr_id].restart ) );
   pdata[19] = n;
@@ -389,7 +389,7 @@ int scanhash_lyra2rev3_4way( struct work *work, const uint32_t max_nonce,
              submit_solution( work, lane_hash, mythr );
 	      }
      }
-      *noncev = _mm_add_epi32( *noncev, m128_const1_32( 4 ) );
+      *noncev = _mm_add_epi32( *noncev, _mm_set1_epi32( 4 ) );
      n += 4;
   } while ( (n < max_nonce-4) && !work_restart[thr_id].restart);
   pdata[19] = n;
--- a/algo/lyra2/lyra2z-4way.c
+++ b/algo/lyra2/lyra2z-4way.c
@@ -35,7 +35,7 @@ static void lyra2z_16way_hash( void *state, const void *midstate_vars,
    uint32_t hash14[8] __attribute__ ((aligned (32)));
    uint32_t hash15[8] __attribute__ ((aligned (32)));

-    blake256_16way_final_rounds_le( vhash, midstate_vars, midhash, block );
+    blake256_16way_final_rounds_le( vhash, midstate_vars, midhash, block, 14 );

    dintrlv_16x32( hash0, hash1, hash2, hash3, hash4, hash5, hash6, hash7,
              hash8, hash9, hash10, hash11 ,hash12, hash13, hash14, hash15,
@@ -103,12 +103,12 @@ int scanhash_lyra2z_16way( struct work *work, uint32_t max_nonce,
   const uint32_t last_nonce = max_nonce - 16;
   const int thr_id = mythr->id;
   const bool bench = opt_benchmark;
-   const __m512i sixteen = m512_const1_32( 16 );
+   const __m512i sixteen = _mm512_set1_epi32( 16 );

   if ( bench ) ( (uint32_t*)ptarget )[7] = 0x0000ff;

   // Prehash first block
-   blake256_transform_le( phash, pdata, 512, 0 );
+   blake256_transform_le( phash, pdata, 512, 0, 14 );

   block0_hash[0] = _mm512_set1_epi32( phash[0] );
   block0_hash[1] = _mm512_set1_epi32( phash[1] );
@@ -170,7 +170,7 @@ static void lyra2z_8way_hash( void *state, const void *midstate_vars,
     uint32_t hash7[8] __attribute__ ((aligned (32)));
     uint32_t vhash[8*8] __attribute__ ((aligned (64)));

-     blake256_8way_final_rounds_le( vhash, midstate_vars, midhash, block );
+     blake256_8way_final_rounds_le( vhash, midstate_vars, midhash, block, 14 );

     dintrlv_8x32( hash0, hash1, hash2, hash3,
                   hash4, hash5, hash6, hash7, vhash, 256 );
@@ -213,10 +213,10 @@ int scanhash_lyra2z_8way( struct work *work, uint32_t max_nonce,
   uint32_t n = first_nonce;
   const int thr_id = mythr->id;
   const bool bench = opt_benchmark;
-   const __m256i eight = m256_const1_32( 8 );
+   const __m256i eight = _mm256_set1_epi32( 8 );

   // Prehash first block
-   blake256_transform_le( phash, pdata, 512, 0 );
+   blake256_transform_le( phash, pdata, 512, 0, 14 );

   block0_hash[0] = _mm256_set1_epi32( phash[0] );
   block0_hash[1] = _mm256_set1_epi32( phash[1] );
@@ -328,7 +328,7 @@ int scanhash_lyra2z_4way( struct work *work, uint32_t max_nonce,
           submit_solution( work, lane_hash, mythr );
        }
      }
-      *noncev = _mm_add_epi32( *noncev, m128_const1_32( 4 ) );
+      *noncev = _mm_add_epi32( *noncev, _mm_set1_epi32( 4 ) );
      n += 4;
   } while ( likely( (n < last_nonce) && !work_restart[thr_id].restart ) );

--- a/algo/lyra2/sponge-2way.c
+++ b/algo/lyra2/sponge-2way.c
@@ -85,10 +85,10 @@ inline void absorbBlockBlake2Safe_2way( uint64_t *State, const uint64_t *In,

  state0 = 
  state1 = m512_zero;
-  state2 = m512_const4_64( 0xa54ff53a5f1d36f1ULL, 0x3c6ef372fe94f82bULL,
-                           0xbb67ae8584caa73bULL, 0x6a09e667f3bcc908ULL );
-  state3 = m512_const4_64( 0x5be0cd19137e2179ULL, 0x1f83d9abfb41bd6bULL,
-                           0x9b05688c2b3e6c1fULL, 0x510e527fade682d1ULL );
+  state2 = _mm512_set4_epi64( 0xa54ff53a5f1d36f1ULL, 0x3c6ef372fe94f82bULL,
+                              0xbb67ae8584caa73bULL, 0x6a09e667f3bcc908ULL );
+  state3 = _mm512_set4_epi64( 0x5be0cd19137e2179ULL, 0x1f83d9abfb41bd6bULL,
+                              0x9b05688c2b3e6c1fULL, 0x510e527fade682d1ULL );

  for ( int i = 0; i < nBlocks; i++ )
  { 
--- a/algo/lyra2/sponge.c
+++ b/algo/lyra2/sponge.c
@@ -41,17 +41,17 @@
 inline void initState( uint64_t State[/*16*/] )
 {

-   /*
+/*
 #if defined (__AVX2__)

  __m256i* state = (__m256i*)State;
  const __m256i zero = m256_zero; 
  state[0] = zero;
  state[1] = zero;
-  state[2] = m256_const_64( 0xa54ff53a5f1d36f1ULL, 0x3c6ef372fe94f82bULL,
-                            0xbb67ae8584caa73bULL, 0x6a09e667f3bcc908ULL );
-  state[3] = m256_const_64( 0x5be0cd19137e2179ULL, 0x1f83d9abfb41bd6bULL,
-                            0x9b05688c2b3e6c1fULL, 0x510e527fade682d1ULL );
+  state[2] = _mm256_set_epi64x( 0xa54ff53a5f1d36f1ULL, 0x3c6ef372fe94f82bULL,
+                                0xbb67ae8584caa73bULL, 0x6a09e667f3bcc908ULL );
+  state[3] = _mm256_set_epi64x( 0x5be0cd19137e2179ULL, 0x1f83d9abfb41bd6bULL,
+                                0x9b05688c2b3e6c1fULL, 0x510e527fade682d1ULL );

 #elif defined (__SSE2__)

@@ -62,10 +62,10 @@ inline void initState( uint64_t State[/*16*/] )
  state[1] = zero;
  state[2] = zero;
  state[3] = zero;
-  state[4] = m128_const_64( 0xbb67ae8584caa73bULL, 0x6a09e667f3bcc908ULL );
-  state[5] = m128_const_64( 0xa54ff53a5f1d36f1ULL, 0x3c6ef372fe94f82bULL );
-  state[6] = m128_const_64( 0x9b05688c2b3e6c1fULL, 0x510e527fade682d1ULL );
-  state[7] = m128_const_64( 0x5be0cd19137e2179ULL, 0x1f83d9abfb41bd6bULL );
+  state[4] = _mm_set_epi64x( 0xbb67ae8584caa73bULL, 0x6a09e667f3bcc908ULL );
+  state[5] = _mm_set_epi64x( 0xa54ff53a5f1d36f1ULL, 0x3c6ef372fe94f82bULL );
+  state[6] = _mm_set_epi64x( 0x9b05688c2b3e6c1fULL, 0x510e527fade682d1ULL );
+  state[7] = _mm_set_epi64x( 0x5be0cd19137e2179ULL, 0x1f83d9abfb41bd6bULL );

 #else
    //First 512 bis are zeros
@@ -271,10 +271,10 @@ inline void absorbBlockBlake2Safe( uint64_t *State, const uint64_t *In,

  state0 = 
  state1 = m256_zero;
-  state2 = m256_const_64( 0xa54ff53a5f1d36f1ULL, 0x3c6ef372fe94f82bULL,
-                          0xbb67ae8584caa73bULL, 0x6a09e667f3bcc908ULL );
-  state3 = m256_const_64( 0x5be0cd19137e2179ULL, 0x1f83d9abfb41bd6bULL,
-                          0x9b05688c2b3e6c1fULL, 0x510e527fade682d1ULL );
+  state2 = _mm256_set_epi64x( 0xa54ff53a5f1d36f1ULL, 0x3c6ef372fe94f82bULL,
+                              0xbb67ae8584caa73bULL, 0x6a09e667f3bcc908ULL );
+  state3 = _mm256_set_epi64x( 0x5be0cd19137e2179ULL, 0x1f83d9abfb41bd6bULL,
+                              0x9b05688c2b3e6c1fULL, 0x510e527fade682d1ULL );

  for ( int i = 0; i < nBlocks; i++ )
  { 
@@ -299,10 +299,10 @@ inline void absorbBlockBlake2Safe( uint64_t *State, const uint64_t *In,
  state1 =
  state2 =
  state3 = m128_zero;
-  state4 = m128_const_64( 0xbb67ae8584caa73bULL, 0x6a09e667f3bcc908ULL );
-  state5 = m128_const_64( 0xa54ff53a5f1d36f1ULL, 0x3c6ef372fe94f82bULL );
-  state6 = m128_const_64( 0x9b05688c2b3e6c1fULL, 0x510e527fade682d1ULL );
-  state7 = m128_const_64( 0x5be0cd19137e2179ULL, 0x1f83d9abfb41bd6bULL );
+  state4 = _mm_set_epi64x( 0xbb67ae8584caa73bULL, 0x6a09e667f3bcc908ULL );
+  state5 = _mm_set_epi64x( 0xa54ff53a5f1d36f1ULL, 0x3c6ef372fe94f82bULL );
+  state6 = _mm_set_epi64x( 0x9b05688c2b3e6c1fULL, 0x510e527fade682d1ULL );
+  state7 = _mm_set_epi64x( 0x5be0cd19137e2179ULL, 0x1f83d9abfb41bd6bULL );

  for ( int i = 0; i < nBlocks; i++ )
  { 
--- a/algo/lyra2/sponge.h
+++ b/algo/lyra2/sponge.h
@@ -43,27 +43,29 @@ static const uint64_t blake2b_IV[8] =
  0x1f83d9abfb41bd6bULL, 0x5be0cd19137e2179ULL
 };

-/*Blake2b's rotation*/
-static inline uint64_t rotr64( const uint64_t w, const unsigned c ){
-    return ( w >> c ) | ( w << ( 64 - c ) );
-}
-
-// serial data is only 32 bytes so AVX2 is the limit for that dimension.
-// However, 2 way parallel looks trivial to code for AVX512 except for
-// a data dependency with rowa.
-
 #if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)

 #define G2W_4X64(a,b,c,d) \
   a = _mm512_add_epi64( a, b ); \
-   d = mm512_ror_64( _mm512_xor_si512( d, a ), 32 ); \
+   d = _mm512_ror_epi64( _mm512_xor_si512( d, a ), 32 ); \
   c = _mm512_add_epi64( c, d ); \
-   b = mm512_ror_64( _mm512_xor_si512( b, c ), 24 ); \
+   b = _mm512_ror_epi64( _mm512_xor_si512( b, c ), 24 ); \
   a = _mm512_add_epi64( a, b ); \
-   d = mm512_ror_64( _mm512_xor_si512( d, a ), 16 ); \
+   d = _mm512_ror_epi64( _mm512_xor_si512( d, a ), 16 ); \
   c = _mm512_add_epi64( c, d ); \
-   b = mm512_ror_64( _mm512_xor_si512( b, c ), 63 );
+   b = _mm512_ror_epi64( _mm512_xor_si512( b, c ), 63 );

+#define LYRA_ROUND_2WAY_AVX512( s0, s1, s2, s3 ) \
+   G2W_4X64( s0, s1, s2, s3 ); \
+   s0 = mm512_shufll256_64( s0 ); \
+   s3 = mm512_swap256_128( s3); \
+   s2 = mm512_shuflr256_64( s2 ); \
+   G2W_4X64( s0, s1, s2, s3 ); \
+   s0 = mm512_shuflr256_64( s0 ); \
+   s3 = mm512_swap256_128( s3 ); \
+   s2 = mm512_shufll256_64( s2 ); 
+
+/*
 #define LYRA_ROUND_2WAY_AVX512( s0, s1, s2, s3 ) \
   G2W_4X64( s0, s1, s2, s3 ); \
   s3 = mm512_shufll256_64( s3 ); \
@@ -73,6 +75,7 @@ static inline uint64_t rotr64( const uint64_t w, const unsigned c ){
   s3 = mm512_shuflr256_64( s3 ); \
   s1 = mm512_shufll256_64( s1 ); \
   s2 = mm512_swap256_128( s2 ); 
+*/

 #define LYRA_12_ROUNDS_2WAY_AVX512( s0, s1, s2, s3 ) \
   LYRA_ROUND_2WAY_AVX512( s0, s1, s2, s3 ) \
@@ -88,13 +91,10 @@ static inline uint64_t rotr64( const uint64_t w, const unsigned c ){
   LYRA_ROUND_2WAY_AVX512( s0, s1, s2, s3 ) \
   LYRA_ROUND_2WAY_AVX512( s0, s1, s2, s3 )

-
 #endif  // AVX512

-#if defined __AVX2__
+#if defined(__AVX2__)

-// process 4 columns in parallel
-// returns void, updates all args
 #define G_4X64(a,b,c,d) \
   a = _mm256_add_epi64( a, b ); \
   d = mm256_swap64_32( _mm256_xor_si256( d, a ) ); \
@@ -105,6 +105,18 @@ static inline uint64_t rotr64( const uint64_t w, const unsigned c ){
   c = _mm256_add_epi64( c, d ); \
   b = mm256_ror_64( _mm256_xor_si256( b, c ), 63 );

+// Pivot about s1 instead of s0 reduces latency.
+#define LYRA_ROUND_AVX2( s0, s1, s2, s3 ) \
+   G_4X64( s0, s1, s2, s3 ); \
+   s0 = mm256_shufll_64( s0 ); \
+   s3 = mm256_swap_128( s3); \
+   s2 = mm256_shuflr_64( s2 ); \
+   G_4X64( s0, s1, s2, s3 ); \
+   s0 = mm256_shuflr_64( s0 ); \
+   s3 = mm256_swap_128( s3 ); \
+   s2 = mm256_shufll_64( s2 );
+
+/*
 #define LYRA_ROUND_AVX2( s0, s1, s2, s3 ) \
   G_4X64( s0, s1, s2, s3 ); \
   s3 = mm256_shufll_64( s3 ); \
@@ -114,6 +126,7 @@ static inline uint64_t rotr64( const uint64_t w, const unsigned c ){
   s3 = mm256_shuflr_64( s3 ); \
   s1 = mm256_shufll_64( s1 ); \
   s2 = mm256_swap_128( s2 );
+*/

 #define LYRA_12_ROUNDS_AVX2( s0, s1, s2, s3 ) \
   LYRA_ROUND_AVX2( s0, s1, s2, s3 ) \
@@ -182,8 +195,13 @@ static inline uint64_t rotr64( const uint64_t w, const unsigned c ){

 #endif // AVX2 else SSE2

-// Scalar
-//Blake2b's G function
+/*
+// Scalar, not used.
+
+static inline uint64_t rotr64( const uint64_t w, const unsigned c ){
+    return ( w >> c ) | ( w << ( 64 - c ) );
+}
+
 #define G(r,i,a,b,c,d) \
  do { \
    a = a + b; \
@@ -196,8 +214,6 @@ static inline uint64_t rotr64( const uint64_t w, const unsigned c ){
    b = rotr64(b ^ c, 63); \
  } while(0)

-
-/*One Round of the Blake2b's compression function*/
 #define ROUND_LYRA(r)  \
    G(r,0,v[ 0],v[ 4],v[ 8],v[12]); \
    G(r,1,v[ 1],v[ 5],v[ 9],v[13]); \
@@ -207,6 +223,7 @@ static inline uint64_t rotr64( const uint64_t w, const unsigned c ){
    G(r,5,v[ 1],v[ 6],v[11],v[12]); \
    G(r,6,v[ 2],v[ 7],v[ 8],v[13]); \
    G(r,7,v[ 3],v[ 4],v[ 9],v[14]);
+*/

 #if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)

--- a/algo/quark/anime-4way.c
+++ b/algo/quark/anime-4way.c
@@ -51,7 +51,7 @@ void anime_8way_hash( void *state, const void *input )
    __m512i* vhA = (__m512i*)vhashA;
    __m512i* vhB = (__m512i*)vhashB;
    __m512i* vhC = (__m512i*)vhashC;
-    const __m512i bit3_mask = m512_const1_64( 8 );
+    const __m512i bit3_mask = _mm512_set1_epi64( 8 );
    __mmask8 vh_mask;
    anime_8way_context_overlay ctx __attribute__ ((aligned (64)));

@@ -209,7 +209,7 @@ int scanhash_anime_8way( struct work *work, uint32_t max_nonce,
          }
       }
       *noncev = _mm512_add_epi32( *noncev,
-                                   m512_const1_64( 0x0000000800000000 ) );
+                                   _mm512_set1_epi64( 0x0000000800000000 ) );
       n += 8;
    } while ( likely( ( n < last_nonce ) && !work_restart[thr_id].restart ) );
    pdata[19] = n;
@@ -248,7 +248,7 @@ void anime_4way_hash( void *state, const void *input )
    __m256i* vhB = (__m256i*)vhashB;
    __m256i vh_mask;
    int h_mask;
-    const __m256i bit3_mask = m256_const1_64( 8 );
+    const __m256i bit3_mask = _mm256_set1_epi64x( 8 );
    const __m256i zero = _mm256_setzero_si256();
    anime_4way_context_overlay ctx __attribute__ ((aligned (64)));

@@ -388,7 +388,7 @@ int scanhash_anime_4way( struct work *work, uint32_t max_nonce,
          }
       }
       *noncev = _mm256_add_epi32( *noncev,
-                                   m256_const1_64( 0x0000000400000000 ) );
+                                   _mm256_set1_epi64x( 0x0000000400000000 ) );
       n += 4;
    } while ( likely( ( n < last_nonce ) && !work_restart[thr_id].restart ) );
    pdata[19] = n;
--- a/algo/quark/hmq1725-4way.c
+++ b/algo/quark/hmq1725-4way.c
@@ -75,7 +75,7 @@ extern void hmq1725_8way_hash(void *state, const void *input)
   uint32_t hash7 [16]    __attribute__ ((aligned (32)));
   hmq1725_8way_context_overlay ctx __attribute__ ((aligned (64)));
   __mmask8 vh_mask;
-   const __m512i vmask = m512_const1_64( 24 );
+   const __m512i vmask = _mm512_set1_epi64( 24 );
   const uint32_t mask = 24;
   __m512i* vh  = (__m512i*)vhash;
   __m512i* vhA = (__m512i*)vhashA;
@@ -593,7 +593,7 @@ int scanhash_hmq1725_8way( struct work *work, uint32_t max_nonce,
          }
       }
       *noncev = _mm512_add_epi32( *noncev,
-                                   m512_const1_64( 0x0000000800000000 ) );
+                                   _mm512_set1_epi64( 0x0000000800000000 ) );
       n += 8;
    } while ( likely( ( n < last_nonce ) && !work_restart[thr_id].restart ) );

@@ -647,7 +647,7 @@ extern void hmq1725_4way_hash(void *state, const void *input)
   hmq1725_4way_context_overlay ctx __attribute__ ((aligned (64)));
   __m256i vh_mask;     
   int h_mask;
-   const __m256i vmask = m256_const1_64( 24 );
+   const __m256i vmask = _mm256_set1_epi64x( 24 );
   const uint32_t mask = 24;
   __m256i* vh  = (__m256i*)vhash;
   __m256i* vhA = (__m256i*)vhashA;
@@ -1041,7 +1041,7 @@ int scanhash_hmq1725_4way( struct work *work, uint32_t max_nonce,
          }
       }
       *noncev = _mm256_add_epi32( *noncev,
-                                   m256_const1_64( 0x0000000400000000 ) );
+                                   _mm256_set1_epi64x( 0x0000000400000000 ) );
       n += 4;
    } while ( likely( ( n < last_nonce ) && !work_restart[thr_id].restart ) );
    pdata[19] = n;
--- a/algo/quark/quark-4way.c
+++ b/algo/quark/quark-4way.c
@@ -67,7 +67,7 @@ void quark_8way_hash( void *state, const void *input )
    __mmask8 vh_mask;
    quark_8way_ctx_holder ctx;
    const uint32_t mask = 8;
-    const __m512i bit3_mask = m512_const1_64( mask );
+    const __m512i bit3_mask = _mm512_set1_epi64( mask );

    memcpy( &ctx, &quark_8way_ctx, sizeof(quark_8way_ctx) );

@@ -224,7 +224,7 @@ int scanhash_quark_8way( struct work *work, uint32_t max_nonce,
          }
       }
       *noncev = _mm512_add_epi32( *noncev,
-                                  m512_const1_64( 0x0000000800000000 ) );
+                                  _mm512_set1_epi64( 0x0000000800000000 ) );
       n += 8;
    } while ( likely( ( n < last_nonce ) && !work_restart[thr_id].restart ) );

@@ -271,7 +271,7 @@ void quark_4way_hash( void *state, const void *input )
    __m256i vh_mask;
    int h_mask;
    quark_4way_ctx_holder ctx;
-    const __m256i bit3_mask = m256_const1_64( 8 );
+    const __m256i bit3_mask = _mm256_set1_epi64x( 8 );
    const __m256i zero = _mm256_setzero_si256();

    memcpy( &ctx, &quark_4way_ctx, sizeof(quark_4way_ctx) );
@@ -397,7 +397,7 @@ int scanhash_quark_4way( struct work *work, uint32_t max_nonce,
          }
       }
       *noncev = _mm256_add_epi32( *noncev,
-                                  m256_const1_64( 0x0000000400000000 ) );
+                                  _mm256_set1_epi64x( 0x0000000400000000 ) );
       n += 4;
    } while ( likely( ( n < last_nonce ) && !work_restart[thr_id].restart ) );

--- a/algo/ripemd/ripemd-hash-4way.c
+++ b/algo/ripemd/ripemd-hash-4way.c
@@ -47,7 +47,7 @@ static const uint32_t IV[5] =
 do{ \
   a = _mm_add_epi32( mm128_rol_32( _mm_add_epi32( _mm_add_epi32( \
                _mm_add_epi32( a, f( b ,c, d ) ), r ), \
-                                 m128_const1_64( k ) ), s ), e ); \
+                                 _mm_set1_epi64x( k ) ), s ), e ); \
   c = mm128_rol_32( c, 10 );\
 } while (0)

@@ -251,11 +251,11 @@ static void ripemd160_4way_round( ripemd160_4way_context *sc )

 void ripemd160_4way_init( ripemd160_4way_context *sc )
 {
-   sc->val[0] = m128_const1_64( 0x6745230167452301 );
-   sc->val[1] = m128_const1_64( 0xEFCDAB89EFCDAB89 );
-   sc->val[2] = m128_const1_64( 0x98BADCFE98BADCFE );
-   sc->val[3] = m128_const1_64( 0x1032547610325476 );
-   sc->val[4] = m128_const1_64( 0xC3D2E1F0C3D2E1F0 );
+   sc->val[0] = _mm_set1_epi64x( 0x6745230167452301 );
+   sc->val[1] = _mm_set1_epi64x( 0xEFCDAB89EFCDAB89 );
+   sc->val[2] = _mm_set1_epi64x( 0x98BADCFE98BADCFE );
+   sc->val[3] = _mm_set1_epi64x( 0x1032547610325476 );
+   sc->val[4] = _mm_set1_epi64x( 0xC3D2E1F0C3D2E1F0 );
   sc->count_high = sc->count_low = 0;
 }

@@ -347,7 +347,7 @@ void ripemd160_4way_close( ripemd160_4way_context  *sc, void *dst )
 do{ \
   a = _mm256_add_epi32( mm256_rol_32( _mm256_add_epi32( _mm256_add_epi32( \
                _mm256_add_epi32( a, f( b ,c, d ) ), r ), \
-                                 m256_const1_64( k ) ), s ), e ); \
+                                 _mm256_set1_epi64x( k ) ), s ), e ); \
   c = mm256_rol_32( c, 10 );\
 } while (0)
    
@@ -552,11 +552,11 @@ static void ripemd160_8way_round( ripemd160_8way_context *sc )

 void ripemd160_8way_init( ripemd160_8way_context *sc )
 {
-   sc->val[0] = m256_const1_64( 0x6745230167452301 );
-   sc->val[1] = m256_const1_64( 0xEFCDAB89EFCDAB89 );
-   sc->val[2] = m256_const1_64( 0x98BADCFE98BADCFE );
-   sc->val[3] = m256_const1_64( 0x1032547610325476 );
-   sc->val[4] = m256_const1_64( 0xC3D2E1F0C3D2E1F0 );
+   sc->val[0] = _mm256_set1_epi64x( 0x6745230167452301 );
+   sc->val[1] = _mm256_set1_epi64x( 0xEFCDAB89EFCDAB89 );
+   sc->val[2] = _mm256_set1_epi64x( 0x98BADCFE98BADCFE );
+   sc->val[3] = _mm256_set1_epi64x( 0x1032547610325476 );
+   sc->val[4] = _mm256_set1_epi64x( 0xC3D2E1F0C3D2E1F0 );
   sc->count_high = sc->count_low = 0;
 }

@@ -649,7 +649,7 @@ void ripemd160_8way_close( ripemd160_8way_context  *sc, void *dst )
 do{ \
   a = _mm512_add_epi32( mm512_rol_32( _mm512_add_epi32( _mm512_add_epi32( \
                _mm512_add_epi32( a, f( b ,c, d ) ), r ), \
-                                 m512_const1_64( k ) ), s ), e ); \
+                                 _mm512_set1_epi64( k ) ), s ), e ); \
   c = mm512_rol_32( c, 10 );\
 } while (0)

@@ -853,11 +853,11 @@ static void ripemd160_16way_round( ripemd160_16way_context *sc )

 void ripemd160_16way_init( ripemd160_16way_context *sc )
 {
-   sc->val[0] = m512_const1_64( 0x6745230167452301 );
-   sc->val[1] = m512_const1_64( 0xEFCDAB89EFCDAB89 );
-   sc->val[2] = m512_const1_64( 0x98BADCFE98BADCFE );
-   sc->val[3] = m512_const1_64( 0x1032547610325476 );
-   sc->val[4] = m512_const1_64( 0xC3D2E1F0C3D2E1F0 );
+   sc->val[0] = _mm512_set1_epi64( 0x6745230167452301 );
+   sc->val[1] = _mm512_set1_epi64( 0xEFCDAB89EFCDAB89 );
+   sc->val[2] = _mm512_set1_epi64( 0x98BADCFE98BADCFE );
+   sc->val[3] = _mm512_set1_epi64( 0x1032547610325476 );
+   sc->val[4] = _mm512_set1_epi64( 0xC3D2E1F0C3D2E1F0 );
   sc->count_high = sc->count_low = 0;
 }

@@ -902,7 +902,7 @@ void ripemd160_16way_close( ripemd160_16way_context  *sc, void *dst )
   const int pad = block_size - 8;

   ptr = (unsigned)sc->count_low & ( block_size - 1U);
-   sc->buf[ ptr>>2 ] = m512_const1_32( 0x80 );
+   sc->buf[ ptr>>2 ] = _mm512_set1_epi32( 0x80 );
   ptr += 4;

   if ( ptr > pad )
--- a/algo/sha/sha-hash-4way.h
+++ b/algo/sha/sha-hash-4way.h
@@ -67,7 +67,7 @@ void sha256_4way_prehash_3rounds( __m128i *state_mid, __m128i *X,
 void sha256_4way_final_rounds( __m128i *state_out, const __m128i *data,
        const __m128i *state_in, const __m128i *state_mid, const __m128i *X );
 int sha256_4way_transform_le_short( __m128i *state_out, const __m128i *data,
-                                     const __m128i *state_in );
+                                   const __m128i *state_in, const uint32_t *target );

 #endif  // SSE2

@@ -95,7 +95,7 @@ void sha256_8way_prehash_3rounds( __m256i *state_mid, __m256i *X,
 void sha256_8way_final_rounds( __m256i *state_out, const __m256i *data,
        const __m256i *state_in, const __m256i *state_mid, const __m256i *X );
 int sha256_8way_transform_le_short( __m256i *state_out, const __m256i *data,
-                                     const __m256i *state_in );
+                             const __m256i *state_in, const uint32_t *target );

 #endif  // AVX2

@@ -123,7 +123,7 @@ void sha256_16way_final_rounds( __m512i *state_out, const __m512i *data,
        const __m512i *state_in, const __m512i *state_mid, const __m512i *X );

 int sha256_16way_transform_le_short( __m512i *state_out, const __m512i *data,
-                                     const __m512i *state_in );
+                            const __m512i *state_in, const uint32_t *target );

 #endif // AVX512

--- a/algo/sha/sha2.c
+++ b/algo/sha/sha2.c
@@ -658,43 +658,14 @@ int scanhash_sha256d_pooler( struct work *work,	uint32_t max_nonce,
 	return 0;
 }

-/*
-int scanhash_SHA256d( struct work *work, const uint32_t max_nonce,
-                      uint64_t *hashes_done, struct thr_info *mythr )
-{
-   uint32_t _ALIGN(128) hash[8];
-   uint32_t _ALIGN(64) data[20];
-   uint32_t *pdata = work->data;
-   const uint32_t *ptarget = work->target;
-   uint32_t n = pdata[19] - 1;
-   const uint32_t first_nonce = pdata[19];
-   const uint32_t Htarg = ptarget[7];
-   int thr_id = mythr->id;
-
-   memcpy( data, pdata, 80 );
-
-   do {
-      data[19] = ++n;
-      sha256d( (unsigned char*)hash, (const unsigned char*)data, 80 );
-      if ( unlikely( swab32( hash[7] ) <= Htarg ) )
-      {
-         pdata[19] = n;
-         sha256d_80_swap(hash, pdata);
-         if ( fulltest( hash, ptarget ) && !opt_benchmark )
-            submit_solution( work, hash, mythr );
-      }
-   } while ( likely( n < max_nonce && !work_restart[thr_id].restart ) );
-   *hashes_done = n - first_nonce + 1;
-   pdata[19] = n;
-   return 0;
-}
-*/
-
 bool register_sha256d_algo( algo_gate_t* gate )
 {
   gate->optimizations = SSE2_OPT | AVX2_OPT | AVX512_OPT;
 #if defined(SHA256D_16WAY)
   gate->scanhash = (void*)&scanhash_sha256d_16way;
+#elif defined(SHA256D_SHA)
+   gate->optimizations = SHA_OPT;
+   gate->scanhash = (void*)&scanhash_sha256d_sha;
 //#elif defined(SHA256D_8WAY)
 //   gate->scanhash = (void*)&scanhash_sha256d_8way;
 #else
--- a/algo/sha/sha256-hash-4way.c
+++ b/algo/sha/sha256-hash-4way.c
--- a/algo/sha/sha256-hash.c
+++ b/algo/sha/sha256-hash.c
@@ -50,65 +50,6 @@ void sha256_update( sha256_context *ctx, const void *data, size_t len )
   memcpy( ctx->buf, src, len );
 }

-#if 0
-void sha256_final( sha256_context *ctx, uint32_t *hash )
-{
-   size_t r;
-
-
-   /* Figure out how many bytes we have buffered. */
-   r = ctx->count & 0x3f;
-//   r = ( ctx->count >> 3 ) & 0x3f;
-
-//printf("final: count= %d, r= %d\n", ctx->count, r );
-   
-   /* Pad to 56 mod 64, transforming if we finish a block en route. */
-   if ( r < 56 )
-   {
-      /* Pad to 56 mod 64. */
-      memcpy( &ctx->buf[r], SHA256_PAD, 56 - r );
-   }
-   else
-   {
-      /* Finish the current block and mix. */
-      memcpy( &ctx->buf[r], SHA256_PAD, 64 - r );
-      sha256_transform_be( ctx->state, (uint32_t*)ctx->buf, ctx->state );
-
-//      SHA256_Transform(ctx->state, ctx->buf, &tmp32[0], &tmp32[64]);
-
-      /* The start of the final block is all zeroes. */
-      memset( &ctx->buf[0], 0, 56 );
-   }
-
-   /* Add the terminating bit-count. */
-   ctx->buf[56] = bswap_64( ctx->count << 3 );
-//   ctx->buf[56] = bswap_64( ctx->count );
-//   be64enc( &ctx->buf[56], ctx->count );
-
-   /* Mix in the final block. */
-   sha256_transform_be( ctx->state, (uint32_t*)ctx->buf, ctx->state );
-
-//   SHA256_Transform(ctx->state, ctx->buf, &tmp32[0], &tmp32[64]);
-
-   for ( int i = 0; i < 8; i++ )  hash[i] = bswap_32( ctx->state[i] );
-   
-//   for ( int i = 0; i < 8; i++ )  be32enc( hash + 4*i, ctx->state + i );
-
-/*
-//   be32enc_vect(digest, ctx->state, 4);
-//   be32enc_vect(uint8_t * dst, const uint32_t * src, size_t len)
-   // Encode vector, two words at a time. 
-   do {
-      be32enc(&dst[0], src[0]);
-      be32enc(&dst[4], src[1]);
-      src += 2;
-      dst += 8;
-   } while (--len);
-*/
-
-}
-#endif
-
 void sha256_final( sha256_context *ctx, void *hash )
 {
   int ptr = ctx->count & 0x3f;
--- a/algo/sha/sha256d-4way.c
+++ b/algo/sha/sha256d-4way.c
@@ -3,10 +3,194 @@
 #include <stdint.h>
 #include <string.h>
 #include <stdio.h>
+#include "sha256-hash.h"
 #include "sha-hash-4way.h"

+static const uint32_t sha256_iv[8] __attribute__ ((aligned (32))) =
+{
+   0x6A09E667, 0xBB67AE85, 0x3C6EF372, 0xA54FF53A,
+   0x510E527F, 0x9B05688C, 0x1F83D9AB, 0x5BE0CD19
+};
+
+#if defined(SHA256D_SHA)
+
+int scanhash_sha256d_sha( struct work *work, uint32_t max_nonce,
+                      uint64_t *hashes_done, struct thr_info *mythr )
+{
+   uint32_t block0[16]   __attribute__ ((aligned (64)));
+   uint32_t block1[16]   __attribute__ ((aligned (64)));
+   uint32_t hash0[8]     __attribute__ ((aligned (32)));
+   uint32_t hash1[8]     __attribute__ ((aligned (32)));
+   uint32_t mstate[8]  __attribute__ ((aligned (32)));
+   uint32_t *pdata = work->data;
+   uint32_t *ptarget = work->target;
+   const uint32_t first_nonce = pdata[19];
+   const uint32_t last_nonce = max_nonce - 2;
+   uint32_t n = first_nonce;
+   const int thr_id = mythr->id;
+   const bool bench = opt_benchmark;
+   const __m128i shuf_bswap32 =
+           _mm_set_epi64x( 0x0c0d0e0f08090a0bULL, 0x0405060700010203ULL );
+
+   // hash first 64 bytes of data
+   sha256_opt_transform_le( mstate, pdata, sha256_iv );
+
+   do
+   {
+      // 1. final 16 bytes of data, with padding
+      memcpy( block0, pdata + 16, 16 );
+      memcpy( block1, pdata + 16, 16 );
+      block0[ 3] = n;
+      block1[ 3] = n+1;
+      block0[ 4] = block1[ 4] = 0x80000000;
+      memset( block0 + 5, 0, 40 );
+      memset( block1 + 5, 0, 40 );
+      block0[15] = block1[15] = 80*8; // bit count
+      sha256_ni2way_transform_le( hash0, hash1, block0, block1,
+                                  mstate, mstate );
+
+      // 2. 32 byte hash from 1.
+      memcpy( block0, hash0, 32 );
+      memcpy( block1, hash1, 32 );
+      block0[ 8] = block1[ 8] = 0x80000000;
+      memset( block0 + 9, 0, 24 );
+      memset( block1 + 9, 0, 24 );
+      block0[15] = block1[15] = 32*8; // bit count
+      sha256_ni2way_transform_le( hash0, hash1, block0, block1,
+                                  sha256_iv, sha256_iv );
+
+      if ( unlikely( bswap_32( hash0[7] ) <= ptarget[7] ) )
+      {
+          casti_m128i( hash0, 0 ) =
+               _mm_shuffle_epi8( casti_m128i( hash0, 0 ), shuf_bswap32 );
+          casti_m128i( hash0, 1 ) =
+               _mm_shuffle_epi8( casti_m128i( hash0, 1 ), shuf_bswap32 );
+          if ( likely( valid_hash( hash0, ptarget ) && !bench ) )
+          {
+             pdata[19] = n;
+             submit_solution( work, hash0, mythr );
+          }
+      }
+
+      if ( unlikely( bswap_32( hash1[7] ) <= ptarget[7] ) )
+      {
+         casti_m128i( hash1, 0 ) =
+               _mm_shuffle_epi8( casti_m128i( hash1, 0 ), shuf_bswap32 );
+         casti_m128i( hash1, 1 ) =
+               _mm_shuffle_epi8( casti_m128i( hash1, 1 ), shuf_bswap32 );
+         if ( likely( valid_hash( hash1, ptarget ) && !bench ) )
+         {
+            pdata[19] = n+1;
+            submit_solution( work, hash1, mythr );
+         }
+      }
+      n += 2;
+   } while ( (n < last_nonce) && !work_restart[thr_id].restart );
+
+   pdata[19] = n;
+   *hashes_done = n - first_nonce;
+   return 0;
+}
+
+#endif
+
 #if defined(SHA256D_16WAY)

+int scanhash_sha256d_16way( struct work *work, const uint32_t max_nonce,
+                           uint64_t *hashes_done, struct thr_info *mythr )
+{
+   __m512i  hash32[8]    __attribute__ ((aligned (128)));
+   __m512i  block[16]    __attribute__ ((aligned (64)));
+   __m512i  buf[16]      __attribute__ ((aligned (64)));
+   __m512i  mstate1[8]   __attribute__ ((aligned (64)));
+   __m512i  mstate2[8]   __attribute__ ((aligned (64)));
+   __m512i  istate[8]    __attribute__ ((aligned (64)));
+   __m512i  mexp_pre[8]  __attribute__ ((aligned (64)));
+   uint32_t phash[8]     __attribute__ ((aligned (32)));
+   uint32_t *pdata = work->data;
+   uint32_t *ptarget = work->target;
+   uint32_t *hash32_d7 = (uint32_t*)&(hash32[7]);
+   const uint32_t targ32_d7 = ptarget[7];
+   const uint32_t first_nonce = pdata[19];
+   const uint32_t last_nonce = max_nonce - 16;
+   const __m512i last_byte = _mm512_set1_epi32( 0x80000000 );
+   uint32_t n = first_nonce;
+   const int thr_id = mythr->id;
+   const __m512i sixteen = _mm512_set1_epi32( 16 );
+   const bool bench = opt_benchmark;
+   const __m256i bswap_shuf = mm256_bcast_m128( _mm_set_epi64x(
+                                0x0c0d0e0f08090a0b, 0x0405060700010203 ) );
+
+   // prehash first block directly from pdata
+   sha256_transform_le( phash, pdata, sha256_iv );
+
+   // vectorize block 0 hash for second block
+   mstate1[0] = _mm512_set1_epi32( phash[0] );
+   mstate1[1] = _mm512_set1_epi32( phash[1] );
+   mstate1[2] = _mm512_set1_epi32( phash[2] );
+   mstate1[3] = _mm512_set1_epi32( phash[3] );
+   mstate1[4] = _mm512_set1_epi32( phash[4] );
+   mstate1[5] = _mm512_set1_epi32( phash[5] );
+   mstate1[6] = _mm512_set1_epi32( phash[6] );
+   mstate1[7] = _mm512_set1_epi32( phash[7] );
+
+   // second message block data, with nonce & padding   
+   buf[0] = _mm512_set1_epi32( pdata[16] );
+   buf[1] = _mm512_set1_epi32( pdata[17] );
+   buf[2] = _mm512_set1_epi32( pdata[18] );
+   buf[3] = _mm512_set_epi32( n+15, n+14, n+13, n+12, n+11, n+10, n+ 9, n+ 8,
+                              n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n +1, n );
+   buf[4] = last_byte;
+   memset_zero_512( buf+5, 10 );
+   buf[15] = _mm512_set1_epi32( 80*8 ); // bit count
+
+   // partially pre-expand & prehash second message block, avoiding the nonces
+   sha256_16way_prehash_3rounds( mstate2, mexp_pre, buf, mstate1 );
+
+   // vectorize IV for 2nd & 3rd sha256
+   istate[0] = _mm512_set1_epi32( sha256_iv[0] );
+   istate[1] = _mm512_set1_epi32( sha256_iv[1] );
+   istate[2] = _mm512_set1_epi32( sha256_iv[2] );
+   istate[3] = _mm512_set1_epi32( sha256_iv[3] );
+   istate[4] = _mm512_set1_epi32( sha256_iv[4] );
+   istate[5] = _mm512_set1_epi32( sha256_iv[5] );
+   istate[6] = _mm512_set1_epi32( sha256_iv[6] );
+   istate[7] = _mm512_set1_epi32( sha256_iv[7] );
+
+   // initialize padding for 2nd sha256
+   block[ 8] = last_byte;
+   memset_zero_512( block + 9, 6 );
+   block[15] = _mm512_set1_epi32( 32*8 ); // bit count
+
+   do
+   {
+      sha256_16way_final_rounds( block, buf, mstate1, mstate2, mexp_pre );
+
+      if ( sha256_16way_transform_le_short( hash32, block, istate, ptarget ) )
+      {
+         for ( int lane = 0; lane < 16; lane++ )
+         if ( bswap_32( hash32_d7[ lane ] ) <= targ32_d7 )
+         {
+            extr_lane_16x32( phash, hash32, lane, 256 );
+            casti_m256i( phash, 0 ) =
+                _mm256_shuffle_epi8( casti_m256i( phash, 0 ), bswap_shuf );
+            if ( likely( valid_hash( phash, ptarget ) && !bench ) )
+            {
+               pdata[19] = n + lane;
+               submit_solution( work, phash, mythr );
+            }
+         }
+      }
+      buf[3] = _mm512_add_epi32( buf[3], sixteen );
+      n += 16;
+   } while ( (n < last_nonce) && !work_restart[thr_id].restart );
+   pdata[19] = n;
+   *hashes_done = n - first_nonce;
+   return 0;
+}
+
+
+/*
 int scanhash_sha256d_16way( struct work *work, const uint32_t max_nonce,
                           uint64_t *hashes_done, struct thr_info *mythr )
 {
@@ -28,32 +212,32 @@ int scanhash_sha256d_16way( struct work *work, const uint32_t max_nonce,
   __m512i *noncev = vdata + 19;
   const int thr_id = mythr->id;
   const bool bench = opt_benchmark;
-   const __m512i last_byte = m512_const1_32( 0x80000000 );
-   const __m512i sixteen = m512_const1_32( 16 );
+   const __m512i last_byte = _mm512_set1_epi32( 0x80000000 );
+   const __m512i sixteen = _mm512_set1_epi32( 16 );

   for ( int i = 0; i < 19; i++ )
-       vdata[i] = m512_const1_32( pdata[i] );
+       vdata[i] = _mm512_set1_epi32( pdata[i] );

   *noncev = _mm512_set_epi32( n+15, n+14, n+13, n+12, n+11, n+10, n+9, n+8,
                               n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n+1, n );

   vdata[16+4] = last_byte;
   memset_zero_512( vdata+16 + 5, 10 );
-   vdata[16+15] = m512_const1_32( 80*8 ); // bit count
+   vdata[16+15] = _mm512_set1_epi32( 80*8 ); // bit count

   block[ 8] = last_byte;
   memset_zero_512( block + 9, 6 );
-   block[15] = m512_const1_32( 32*8 ); // bit count
+   block[15] = _mm512_set1_epi32( 32*8 ); // bit count
   
   // initialize state
-   initstate[0] = m512_const1_64( 0x6A09E6676A09E667 );
-   initstate[1] = m512_const1_64( 0xBB67AE85BB67AE85 );
-   initstate[2] = m512_const1_64( 0x3C6EF3723C6EF372 );
-   initstate[3] = m512_const1_64( 0xA54FF53AA54FF53A );
-   initstate[4] = m512_const1_64( 0x510E527F510E527F );
-   initstate[5] = m512_const1_64( 0x9B05688C9B05688C );
-   initstate[6] = m512_const1_64( 0x1F83D9AB1F83D9AB );
-   initstate[7] = m512_const1_64( 0x5BE0CD195BE0CD19 );
+   initstate[0] = _mm512_set1_epi64( 0x6A09E6676A09E667 );
+   initstate[1] = _mm512_set1_epi64( 0xBB67AE85BB67AE85 );
+   initstate[2] = _mm512_set1_epi64( 0x3C6EF3723C6EF372 );
+   initstate[3] = _mm512_set1_epi64( 0xA54FF53AA54FF53A );
+   initstate[4] = _mm512_set1_epi64( 0x510E527F510E527F );
+   initstate[5] = _mm512_set1_epi64( 0x9B05688C9B05688C );
+   initstate[6] = _mm512_set1_epi64( 0x1F83D9AB1F83D9AB );
+   initstate[7] = _mm512_set1_epi64( 0x5BE0CD195BE0CD19 );

   sha256_16way_transform_le( midstate1, vdata, initstate );

@@ -67,20 +251,18 @@ int scanhash_sha256d_16way( struct work *work, const uint32_t max_nonce,
                                 mexp_pre );

      // 2. 32 byte hash from 1.
-      if ( sha256_16way_transform_le_short( hash32, block, initstate ) )
-      {
-         // byte swap final hash for testing
-         mm512_block_bswap_32( hash32, hash32 );
+      sha256_16way_transform_le( hash32, block, initstate );
+      // byte swap final hash for testing
+      mm512_block_bswap_32( hash32, hash32 );

-         for ( int lane = 0; lane < 16; lane++ )
-         if ( unlikely( hash32_d7[ lane ] <= targ32_d7 ) )
+      for ( int lane = 0; lane < 16; lane++ )
+      if ( unlikely( hash32_d7[ lane ] <= targ32_d7 ) )
+      {
+         extr_lane_16x32( lane_hash, hash32, lane, 256 );
+         if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
         {
-            extr_lane_16x32( lane_hash, hash32, lane, 256 );
-            if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
-            {
-               pdata[19] = n + lane;
-               submit_solution( work, lane_hash, mythr );
-            }
+            pdata[19] = n + lane;
+            submit_solution( work, lane_hash, mythr );
         }
      }
      *noncev = _mm512_add_epi32( *noncev, sixteen );
@@ -90,6 +272,7 @@ int scanhash_sha256d_16way( struct work *work, const uint32_t max_nonce,
   *hashes_done = n - first_nonce;
   return 0;
 }
+*/

 #endif

@@ -104,7 +287,7 @@ int scanhash_sha256d_8way( struct work *work, const uint32_t max_nonce,
   __m256i  initstate[8] __attribute__ ((aligned (32)));
   __m256i  midstate1[8] __attribute__ ((aligned (32)));
   __m256i  midstate2[8] __attribute__ ((aligned (32)));
-   __m256i  mexp_pre[16] __attribute__ ((aligned (32)));
+   __m256i  mexp_pre[8]  __attribute__ ((aligned (32)));
   uint32_t lane_hash[8] __attribute__ ((aligned (32)));
   uint32_t *hash32_d7 =  (uint32_t*)&( hash32[7] );
   uint32_t *pdata = work->data;
@@ -116,31 +299,31 @@ int scanhash_sha256d_8way( struct work *work, const uint32_t max_nonce,
   __m256i *noncev = vdata + 19;
   const int thr_id = mythr->id;
   const bool bench = opt_benchmark;
-   const __m256i last_byte = m256_const1_32( 0x80000000 );
-   const __m256i eight = m256_const1_32( 8 );
+   const __m256i last_byte = _mm256_set1_epi32( 0x80000000 );
+   const __m256i eight = _mm256_set1_epi32( 8 );

   for ( int i = 0; i < 19; i++ )
-      vdata[i] = m256_const1_32( pdata[i] );
+      vdata[i] = _mm256_set1_epi32( pdata[i] );

   *noncev = _mm256_set_epi32( n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n+1, n );

   vdata[16+4] = last_byte;
   memset_zero_256( vdata+16 + 5, 10 );
-   vdata[16+15] = m256_const1_32( 80*8 ); // bit count
+   vdata[16+15] = _mm256_set1_epi32( 80*8 ); // bit count

   block[ 8] = last_byte;
   memset_zero_256( block + 9, 6 );
-   block[15] = m256_const1_32( 32*8 ); // bit count
+   block[15] = _mm256_set1_epi32( 32*8 ); // bit count
   
   // initialize state
-   initstate[0] = m256_const1_64( 0x6A09E6676A09E667 );
-   initstate[1] = m256_const1_64( 0xBB67AE85BB67AE85 );
-   initstate[2] = m256_const1_64( 0x3C6EF3723C6EF372 );
-   initstate[3] = m256_const1_64( 0xA54FF53AA54FF53A );
-   initstate[4] = m256_const1_64( 0x510E527F510E527F );
-   initstate[5] = m256_const1_64( 0x9B05688C9B05688C );
-   initstate[6] = m256_const1_64( 0x1F83D9AB1F83D9AB );
-   initstate[7] = m256_const1_64( 0x5BE0CD195BE0CD19 );
+   initstate[0] = _mm256_set1_epi64x( 0x6A09E6676A09E667 );
+   initstate[1] = _mm256_set1_epi64x( 0xBB67AE85BB67AE85 );
+   initstate[2] = _mm256_set1_epi64x( 0x3C6EF3723C6EF372 );
+   initstate[3] = _mm256_set1_epi64x( 0xA54FF53AA54FF53A );
+   initstate[4] = _mm256_set1_epi64x( 0x510E527F510E527F );
+   initstate[5] = _mm256_set1_epi64x( 0x9B05688C9B05688C );
+   initstate[6] = _mm256_set1_epi64x( 0x1F83D9AB1F83D9AB );
+   initstate[7] = _mm256_set1_epi64x( 0x5BE0CD195BE0CD19 );

   sha256_8way_transform_le( midstate1, vdata, initstate );
   
@@ -154,21 +337,18 @@ int scanhash_sha256d_8way( struct work *work, const uint32_t max_nonce,
                                mexp_pre );

      // 2. 32 byte hash from 1.
-      if ( unlikely(
-               sha256_8way_transform_le_short( hash32, block, initstate ) ) )
-      {
-         // byte swap final hash for testing
-         mm256_block_bswap_32( hash32, hash32 );
+      sha256_8way_transform_le( hash32, block, initstate );
+      // byte swap final hash for testing
+      mm256_block_bswap_32( hash32, hash32 );

-         for ( int lane = 0; lane < 8; lane++ )
-         if ( unlikely( hash32_d7[ lane ] <= targ32_d7 ) )
+      for ( int lane = 0; lane < 8; lane++ )
+      if ( unlikely( hash32_d7[ lane ] <= targ32_d7 ) )
+      {
+         extr_lane_8x32( lane_hash, hash32, lane, 256 );
+         if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
         {
-            extr_lane_8x32( lane_hash, hash32, lane, 256 );
-            if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
-            {
-               pdata[19] = n + lane;
-               submit_solution( work, lane_hash, mythr );
-            }
+            pdata[19] = n + lane;
+            submit_solution( work, lane_hash, mythr );
         }
       }
       *noncev = _mm256_add_epi32( *noncev, eight );
@@ -191,8 +371,6 @@ int scanhash_sha256d_4way( struct work *work, const uint32_t max_nonce,
   __m128i  hash32[8]     __attribute__ ((aligned (32)));
   __m128i  initstate[8]  __attribute__ ((aligned (32)));
   __m128i  midstate1[8]   __attribute__ ((aligned (32)));
-   __m128i  midstate2[8]  __attribute__ ((aligned (32)));
-   __m128i  mexp_pre[16]  __attribute__ ((aligned (32)));
   uint32_t lane_hash[8]  __attribute__ ((aligned (32)));
   uint32_t *hash32_d7 =  (uint32_t*)&( hash32[7] );
   uint32_t *pdata = work->data;
@@ -204,59 +382,53 @@ int scanhash_sha256d_4way( struct work *work, const uint32_t max_nonce,
   __m128i *noncev = vdata + 19;
   const int thr_id = mythr->id;
   const bool bench = opt_benchmark;
-   const __m128i last_byte = m128_const1_32( 0x80000000 );
-   const __m128i four = m128_const1_32( 4 );
+   const __m128i last_byte = _mm_set1_epi32( 0x80000000 );
+   const __m128i four = _mm_set1_epi32( 4 );

   for ( int i = 0; i < 19; i++ )
-       vdata[i] = m128_const1_32( pdata[i] );
+       vdata[i] = _mm_set1_epi32( pdata[i] );

   *noncev = _mm_set_epi32( n+ 3, n+ 2, n+1, n );

   vdata[16+4] = last_byte;
   memset_zero_128( vdata+16 + 5, 10 );
-   vdata[16+15] = m128_const1_32( 80*8 ); // bit count
+   vdata[16+15] = _mm_set1_epi32( 80*8 ); // bit count

   block[ 8] = last_byte;
   memset_zero_128( block + 9, 6 );
-   block[15] = m128_const1_32( 32*8 ); // bit count
+   block[15] = _mm_set1_epi32( 32*8 ); // bit count

   // initialize state
-   initstate[0] = m128_const1_64( 0x6A09E6676A09E667 );
-   initstate[1] = m128_const1_64( 0xBB67AE85BB67AE85 );
-   initstate[2] = m128_const1_64( 0x3C6EF3723C6EF372 );
-   initstate[3] = m128_const1_64( 0xA54FF53AA54FF53A );
-   initstate[4] = m128_const1_64( 0x510E527F510E527F );
-   initstate[5] = m128_const1_64( 0x9B05688C9B05688C );
-   initstate[6] = m128_const1_64( 0x1F83D9AB1F83D9AB );
-   initstate[7] = m128_const1_64( 0x5BE0CD195BE0CD19 );
+   initstate[0] = _mm_set1_epi64x( 0x6A09E6676A09E667 );
+   initstate[1] = _mm_set1_epi64x( 0xBB67AE85BB67AE85 );
+   initstate[2] = _mm_set1_epi64x( 0x3C6EF3723C6EF372 );
+   initstate[3] = _mm_set1_epi64x( 0xA54FF53AA54FF53A );
+   initstate[4] = _mm_set1_epi64x( 0x510E527F510E527F );
+   initstate[5] = _mm_set1_epi64x( 0x9B05688C9B05688C );
+   initstate[6] = _mm_set1_epi64x( 0x1F83D9AB1F83D9AB );
+   initstate[7] = _mm_set1_epi64x( 0x5BE0CD195BE0CD19 );

   // hash first 64 bytes of data
   sha256_4way_transform_le( midstate1, vdata, initstate );
-   // Do 3 rounds on the first 12 bytes of the next block
-   sha256_4way_prehash_3rounds( midstate2, mexp_pre, vdata + 16, midstate1 );

   do
   {
      // 1. final 16 bytes of data, with padding
-      sha256_4way_final_rounds( block, vdata+16, midstate1, midstate2,
-                                mexp_pre );
+      sha256_4way_transform_le( block, vdata+16, initstate );

      // 2. 32 byte hash from 1.
-      if ( unlikely(
-              sha256_4way_transform_le_short( hash32, block, initstate ) ) )
-      {
-         // byte swap final hash for testing
-         mm128_block_bswap_32( hash32, hash32 );
+      sha256_4way_transform_le( hash32, block, initstate );
+      // byte swap final hash for testing
+      mm128_block_bswap_32( hash32, hash32 );

-         for ( int lane = 0; lane < 4; lane++ )
-         if ( unlikely( hash32_d7[ lane ] <= targ32_d7 ) )
+      for ( int lane = 0; lane < 4; lane++ )
+      if ( unlikely( hash32_d7[ lane ] <= targ32_d7 ) )
+      {
+         extr_lane_4x32( lane_hash, hash32, lane, 256 );
+         if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
         {
-            extr_lane_4x32( lane_hash, hash32, lane, 256 );
-            if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
-            {
-               pdata[19] = n + lane;
-               submit_solution( work, lane_hash, mythr );
-            }
+            pdata[19] = n + lane;
+            submit_solution( work, lane_hash, mythr );
         }
      }
      *noncev = _mm_add_epi32( *noncev, four );
@@ -268,21 +440,3 @@ int scanhash_sha256d_4way( struct work *work, const uint32_t max_nonce,
 }

 #endif
-
-/*
-bool register_sha256d_algo( algo_gate_t* gate )
-{
-   gate->optimizations = SSE2_OPT | AVX2_OPT | AVX512_OPT;
-#if defined(SHA256D_16WAY)
-   gate->scanhash = (void*)&scanhash_sha256d_16way;
-#elif defined(SHA256D_8WAY)
-   gate->scanhash = (void*)&scanhash_sha256d_8way;
-#elif defined(SHA256D_4WAY)
-   gate->scanhash = (void*)&scanhash_sha256d_4way;
-#endif
-   
-//   gate->hash     = (void*)&sha256d;
-   return true;
-};
-*/
-
--- a/algo/sha/sha256d-4way.h
+++ b/algo/sha/sha256d-4way.h
@@ -6,6 +6,8 @@

 #if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
  #define SHA256D_16WAY 1
+#elif defined(__SHA__)
+  #define SHA256D_SHA 1
 #elif defined(__AVX2__)
  #define SHA256D_8WAY 1
 #else
@@ -32,15 +34,12 @@ int scanhash_sha256d_4way( struct work *work, uint32_t max_nonce,
                           uint64_t *hashes_done, struct thr_info *mythr );
 #endif

+#if defined(SHA256D_SHA)

-/*
-#if defined(__SHA__)
-
-int scanhash_sha256d( struct work *work, uint32_t max_nonce,
-                      uint64_t *hashes_done, struct thr_info *mythr );
-
-#endif
-*/
+int scanhash_sha256d_sha( struct work *work, uint32_t max_nonce,
+                          uint64_t *hashes_done, struct thr_info *mythr );
+
+#endif

 #endif

--- a/algo/sha/sha256dt.c
+++ b/algo/sha/sha256dt.c
@@ -0,0 +1,377 @@
+#include "algo-gate-api.h"
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <stdio.h>
+#include "sha256-hash.h"
+#include "sha-hash-4way.h"
+
+#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
+  #define SHA256DT_16WAY 1
+#elif defined(__SHA__)
+  #define SHA256DT_SHA 1
+#elif defined(__AVX2__)
+  #define SHA256DT_8WAY 1
+#else
+  #define SHA256DT_4WAY 1
+#endif
+
+static const uint32_t sha256dt_iv[8]  __attribute__ ((aligned (32))) =
+   {
+      0xdfa9bf2c, 0xb72074d4, 0x6bb01122, 0xd338e869,
+      0xaa3ff126, 0x475bbf30, 0x8fd52e5b, 0x9f75c9ad
+   };
+
+#if defined(SHA256DT_16WAY)
+
+int scanhash_sha256dt_16way( struct work *work, const uint32_t max_nonce,
+                           uint64_t *hashes_done, struct thr_info *mythr )
+{
+   __m512i  hash32[8]    __attribute__ ((aligned (128)));
+   __m512i  block[16]    __attribute__ ((aligned (64)));
+   __m512i  buf[16]      __attribute__ ((aligned (64)));
+   __m512i  mstate1[8]   __attribute__ ((aligned (64)));
+   __m512i  mstate2[8]   __attribute__ ((aligned (64)));
+   __m512i  istate[8]    __attribute__ ((aligned (64)));
+   __m512i  mexp_pre[8]  __attribute__ ((aligned (64)));
+   uint32_t phash[8]     __attribute__ ((aligned (32)));
+   uint32_t *pdata = work->data;
+   const uint32_t *ptarget = work->target;
+//   uint32_t *hash32_d7 = (uint32_t*)&(hash32[7]);
+//   const uint32_t targ32_d7 = ptarget[7];
+   const uint32_t first_nonce = pdata[19];
+   const uint32_t last_nonce = max_nonce - 16;
+   const __m512i last_byte = _mm512_set1_epi32( 0x80000000 );
+   uint32_t n = first_nonce;
+   const int thr_id = mythr->id;
+   const __m512i sixteen = _mm512_set1_epi32( 16 );
+   const bool bench = opt_benchmark;
+   const __m256i bswap_shuf = mm256_bcast_m128( _mm_set_epi64x(
+                                0x0c0d0e0f08090a0b, 0x0405060700010203 ) );
+
+   // prehash first block directly from pdata
+   sha256_transform_le( phash, pdata, sha256dt_iv );
+
+   // vectorize block 0 hash for second block
+   mstate1[0] = _mm512_set1_epi32( phash[0] );
+   mstate1[1] = _mm512_set1_epi32( phash[1] );
+   mstate1[2] = _mm512_set1_epi32( phash[2] );
+   mstate1[3] = _mm512_set1_epi32( phash[3] );
+   mstate1[4] = _mm512_set1_epi32( phash[4] );
+   mstate1[5] = _mm512_set1_epi32( phash[5] );
+   mstate1[6] = _mm512_set1_epi32( phash[6] );
+   mstate1[7] = _mm512_set1_epi32( phash[7] );
+
+   // second message block data, with nonce & padding
+   buf[0] = _mm512_set1_epi32( pdata[16] );
+   buf[1] = _mm512_set1_epi32( pdata[17] );
+   buf[2] = _mm512_set1_epi32( pdata[18] );
+   buf[3] = _mm512_set_epi32( n+15, n+14, n+13, n+12, n+11, n+10, n+ 9, n+ 8,
+                              n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n +1, n );
+   buf[4] = last_byte;
+   memset_zero_512( buf+5, 10 );
+   buf[15] = _mm512_set1_epi32( 0x480 ); // sha256dt funky bit count
+
+   // partially pre-expand & prehash second message block, avoiding the nonces
+   sha256_16way_prehash_3rounds( mstate2, mexp_pre, buf, mstate1 );
+
+   // vectorize IV for 2nd sha256
+   istate[0] = _mm512_set1_epi32( sha256dt_iv[0] );
+   istate[1] = _mm512_set1_epi32( sha256dt_iv[1] );
+   istate[2] = _mm512_set1_epi32( sha256dt_iv[2] );
+   istate[3] = _mm512_set1_epi32( sha256dt_iv[3] );
+   istate[4] = _mm512_set1_epi32( sha256dt_iv[4] );
+   istate[5] = _mm512_set1_epi32( sha256dt_iv[5] );
+   istate[6] = _mm512_set1_epi32( sha256dt_iv[6] );
+   istate[7] = _mm512_set1_epi32( sha256dt_iv[7] );
+
+   // initialize padding for 2nd sha256
+   block[ 8] = last_byte;
+   memset_zero_512( block+9, 6 );
+   block[15] = _mm512_set1_epi32( 0x300 ); // bit count
+
+   do
+   {
+      // finish second block with nonces
+      sha256_16way_final_rounds( block, buf, mstate1, mstate2, mexp_pre );
+      if ( unlikely( sha256_16way_transform_le_short(
+                                  hash32, block, istate, ptarget ) ) )
+      {
+         for ( int lane = 0; lane < 16; lane++ )
+//         if ( bswap_32( hash32_d7[ lane ] ) <= targ32_d7 )
+         {
+            extr_lane_16x32( phash, hash32, lane, 256 );
+            casti_m256i( phash, 0 ) =
+                   _mm256_shuffle_epi8( casti_m256i( phash, 0 ), bswap_shuf ); 
+            if ( likely( valid_hash( phash, ptarget ) && !bench ) )
+            {
+              pdata[19] = n + lane;
+              submit_solution( work, phash, mythr );
+            }
+         }
+      }
+      buf[3] = _mm512_add_epi32( buf[3], sixteen );
+      n += 16;
+   } while ( (n < last_nonce) && !work_restart[thr_id].restart );
+   pdata[19] = n;
+   *hashes_done = n - first_nonce;
+   return 0;
+}
+   
+#elif defined(SHA256DT_SHA)
+
+int scanhash_sha256dt_sha( struct work *work, uint32_t max_nonce,
+                      uint64_t *hashes_done, struct thr_info *mythr )
+{
+   uint32_t block0[16]   __attribute__ ((aligned (64)));
+   uint32_t block1[16]   __attribute__ ((aligned (64)));
+   uint32_t hash0[8]     __attribute__ ((aligned (32)));
+   uint32_t hash1[8]     __attribute__ ((aligned (32)));
+   uint32_t mstate[8]  __attribute__ ((aligned (32)));
+   uint32_t *pdata = work->data;
+   uint32_t *ptarget = work->target;
+   const uint32_t first_nonce = pdata[19];
+   const uint32_t last_nonce = max_nonce - 2;
+   uint32_t n = first_nonce;
+   const int thr_id = mythr->id;
+   const bool bench = opt_benchmark;
+   const __m128i shuf_bswap32 =
+           _mm_set_epi64x( 0x0c0d0e0f08090a0bULL, 0x0405060700010203ULL );
+
+   // hash first 64 bytes of data
+   sha256_opt_transform_le( mstate, pdata, sha256dt_iv );
+
+   do
+   {
+      // 1. final 16 bytes of data, with padding
+      memcpy( block0, pdata + 16, 16 );
+      memcpy( block1, pdata + 16, 16 );
+      block0[ 3] = n;
+      block1[ 3] = n+1;
+      block0[ 4] = block1[ 4] = 0x80000000;
+      memset( block0 + 5, 0, 40 );
+      memset( block1 + 5, 0, 40 );
+      block0[15] = block1[15] = 0x480; // funky bit count
+      sha256_ni2way_transform_le( hash0, hash1, block0, block1,
+                                  mstate, mstate );
+
+      // 2. 32 byte hash from 1.
+      memcpy( block0, hash0, 32 );
+      memcpy( block1, hash1, 32 );
+      block0[ 8] = block1[ 8] = 0x80000000;
+      memset( block0 + 9, 0, 24 );
+      memset( block1 + 9, 0, 24 );
+      block0[15] = block1[15] = 0x300; // bit count
+      sha256_ni2way_transform_le( hash0, hash1, block0, block1,
+                                  sha256dt_iv, sha256dt_iv );
+
+      if ( unlikely( bswap_32( hash0[7] ) <= ptarget[7] ) )
+      {
+          casti_m128i( hash0, 0 ) =
+               _mm_shuffle_epi8( casti_m128i( hash0, 0 ), shuf_bswap32 );
+          casti_m128i( hash0, 1 ) =
+               _mm_shuffle_epi8( casti_m128i( hash0, 1 ), shuf_bswap32 );
+          if ( likely( valid_hash( hash0, ptarget ) && !bench ) )
+          {
+             pdata[19] = n;
+             submit_solution( work, hash0, mythr );
+          }
+      }
+      if ( unlikely( bswap_32( hash1[7] ) <= ptarget[7] ) )
+      {
+         casti_m128i( hash1, 0 ) =
+               _mm_shuffle_epi8( casti_m128i( hash1, 0 ), shuf_bswap32 );
+         casti_m128i( hash1, 1 ) =
+               _mm_shuffle_epi8( casti_m128i( hash1, 1 ), shuf_bswap32 );
+         if ( likely( valid_hash( hash1, ptarget ) && !bench ) )
+         {
+            pdata[19] = n+1;
+            submit_solution( work, hash1, mythr );
+         }
+      }
+      n += 2;
+   } while ( (n < last_nonce) && !work_restart[thr_id].restart );
+
+   pdata[19] = n;
+   *hashes_done = n - first_nonce;
+   return 0;
+}
+
+#elif defined(SHA256DT_8WAY)
+
+int scanhash_sha256dt_8way( struct work *work, const uint32_t max_nonce,
+                           uint64_t *hashes_done, struct thr_info *mythr )
+{
+   __m256i  vdata[32]    __attribute__ ((aligned (64)));
+   __m256i  block[16]    __attribute__ ((aligned (32)));
+   __m256i  hash32[8]    __attribute__ ((aligned (32)));
+   __m256i  istate[8]    __attribute__ ((aligned (32)));
+   __m256i  mstate1[8]   __attribute__ ((aligned (32)));
+   __m256i  mstate2[8]   __attribute__ ((aligned (32)));
+   __m256i  mexp_pre[8]  __attribute__ ((aligned (32)));
+   uint32_t lane_hash[8] __attribute__ ((aligned (32)));
+   uint32_t *pdata = work->data;
+   const uint32_t *ptarget = work->target;
+   const uint32_t first_nonce = pdata[19];
+   const uint32_t last_nonce = max_nonce - 8;
+   uint32_t n = first_nonce;
+   __m256i *noncev = vdata + 19;
+   const int thr_id = mythr->id;
+   const bool bench = opt_benchmark;
+   const __m256i last_byte = _mm256_set1_epi32( 0x80000000 );
+   const __m256i eight = _mm256_set1_epi32( 8 );
+   const __m256i bswap_shuf = mm256_bcast_m128( _mm_set_epi64x(
+                                0x0c0d0e0f08090a0b, 0x0405060700010203 ) );
+
+   for ( int i = 0; i < 19; i++ )
+      vdata[i] = _mm256_set1_epi32( pdata[i] );
+
+   *noncev = _mm256_set_epi32( n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n+1, n );
+
+   vdata[16+4] = last_byte;
+   memset_zero_256( vdata+16 + 5, 10 );
+   vdata[16+15] = _mm256_set1_epi32( 0x480 );
+
+   block[ 8] = last_byte;
+   memset_zero_256( block + 9, 6 );
+   block[15] = _mm256_set1_epi32( 0x300 ); 
+   
+   // initialize state
+   istate[0] = _mm256_set1_epi64x( 0xdfa9bf2cdfa9bf2c );
+   istate[1] = _mm256_set1_epi64x( 0xb72074d4b72074d4 );
+   istate[2] = _mm256_set1_epi64x( 0x6bb011226bb01122 );
+   istate[3] = _mm256_set1_epi64x( 0xd338e869d338e869 );
+   istate[4] = _mm256_set1_epi64x( 0xaa3ff126aa3ff126 );
+   istate[5] = _mm256_set1_epi64x( 0x475bbf30475bbf30 );
+   istate[6] = _mm256_set1_epi64x( 0x8fd52e5b8fd52e5b );
+   istate[7] = _mm256_set1_epi64x( 0x9f75c9ad9f75c9ad );
+
+   sha256_8way_transform_le( mstate1, vdata, istate );
+
+   // Do 3 rounds on the first 12 bytes of the next block
+   sha256_8way_prehash_3rounds( mstate2, mexp_pre, vdata + 16, mstate1 );
+   
+   do
+   {
+      sha256_8way_final_rounds( block, vdata+16, mstate1, mstate2,
+                                mexp_pre );
+
+      if ( unlikely( sha256_8way_transform_le_short(
+                            hash32, block, istate, ptarget ) ) )
+      {
+         for ( int lane = 0; lane < 8; lane++ )
+         {
+            extr_lane_8x32( lane_hash, hash32, lane, 256 );
+            casti_m256i( lane_hash, 0 ) =
+               _mm256_shuffle_epi8( casti_m256i( lane_hash, 0 ), bswap_shuf );
+            if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
+            {
+               pdata[19] = n + lane;
+               submit_solution( work, lane_hash, mythr );
+            }
+         }
+      }
+      *noncev = _mm256_add_epi32( *noncev, eight );
+      n += 8;
+   } while ( (n < last_nonce) && !work_restart[thr_id].restart );
+   pdata[19] = n;
+   *hashes_done = n - first_nonce;
+   return 0;
+}
+
+#elif defined(SHA256DT_4WAY)
+
+int scanhash_sha256dt_4way( struct work *work, const uint32_t max_nonce,
+                           uint64_t *hashes_done, struct thr_info *mythr )
+{
+   __m128i  vdata[32]    __attribute__ ((aligned (64)));
+   __m128i  block[16]    __attribute__ ((aligned (32)));
+   __m128i  hash32[8]    __attribute__ ((aligned (32)));
+   __m128i  initstate[8] __attribute__ ((aligned (32)));
+   __m128i  midstate[8]  __attribute__ ((aligned (32)));
+   uint32_t lane_hash[8] __attribute__ ((aligned (32)));
+   uint32_t *hash32_d7 =  (uint32_t*)&( hash32[7] );
+   uint32_t *pdata = work->data;
+   const uint32_t *ptarget = work->target;
+   const uint32_t targ32_d7 = ptarget[7];
+   const uint32_t first_nonce = pdata[19];
+   const uint32_t last_nonce = max_nonce - 4;
+   uint32_t n = first_nonce;
+   __m128i *noncev = vdata + 19;
+   const int thr_id = mythr->id;
+   const bool bench = opt_benchmark;
+   const __m128i last_byte = _mm_set1_epi32( 0x80000000 );
+   const __m128i four = _mm_set1_epi32( 4 );
+
+   for ( int i = 0; i < 19; i++ )
+       vdata[i] = _mm_set1_epi32( pdata[i] );
+
+   *noncev = _mm_set_epi32( n+ 3, n+ 2, n+1, n );
+
+   vdata[16+4] = last_byte;
+   memset_zero_128( vdata+16 + 5, 10 );
+   vdata[16+15] = _mm_set1_epi32( 0x480 );
+
+   block[ 8] = last_byte;
+   memset_zero_128( block + 9, 6 );
+   block[15] = _mm_set1_epi32( 0x300 );
+   
+   // initialize state
+   initstate[0] = _mm_set1_epi64x( 0xdfa9bf2cdfa9bf2c );
+   initstate[1] = _mm_set1_epi64x( 0xb72074d4b72074d4 );
+   initstate[2] = _mm_set1_epi64x( 0x6bb011226bb01122 );
+   initstate[3] = _mm_set1_epi64x( 0xd338e869d338e869 );
+   initstate[4] = _mm_set1_epi64x( 0xaa3ff126aa3ff126 );
+   initstate[5] = _mm_set1_epi64x( 0x475bbf30475bbf30 );
+   initstate[6] = _mm_set1_epi64x( 0x8fd52e5b8fd52e5b );
+   initstate[7] = _mm_set1_epi64x( 0x9f75c9ad9f75c9ad );
+
+   // hash first 64 bytes of data
+   sha256_4way_transform_le( midstate, vdata, initstate );
+
+   do
+   {
+      sha256_4way_transform_le( block,  vdata+16, midstate  );
+      sha256_4way_transform_le( hash32, block, initstate );
+
+//      if ( sha256_4way_transform_le_short( hash32, block, initstate, ptarget ) )
+//      {
+         mm128_block_bswap_32( hash32, hash32 );
+
+         for ( int lane = 0; lane < 4; lane++ )
+         if ( unlikely( hash32_d7[ lane ] <= targ32_d7 ) )
+         {
+            extr_lane_4x32( lane_hash, hash32, lane, 256 );
+            if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
+            {
+               pdata[19] = n + lane;
+               submit_solution( work, lane_hash, mythr );
+            }
+         }
+//      }
+      *noncev = _mm_add_epi32( *noncev, four );
+      n += 4;
+   } while ( (n < last_nonce) && !work_restart[thr_id].restart );
+   pdata[19] = n;
+   *hashes_done = n - first_nonce;
+   return 0;
+}
+
+#endif
+
+bool register_sha256dt_algo( algo_gate_t* gate )
+{
+    gate->optimizations = SSE2_OPT | AVX2_OPT | AVX512_OPT;
+#if defined(SHA256DT_16WAY)
+    gate->scanhash = (void*)&scanhash_sha256dt_16way;
+#elif defined(SHA256DT_SHA)
+    gate->optimizations = SHA_OPT;
+    gate->scanhash = (void*)&scanhash_sha256dt_sha;    
+#elif defined(SHA256DT_8WAY)
+    gate->scanhash = (void*)&scanhash_sha256dt_8way;
+#else
+    gate->scanhash = (void*)&scanhash_sha256dt_4way;
+#endif
+    return true;
+}
+
--- a/algo/sha/sha256q-4way.c
+++ b/algo/sha/sha256q-4way.c
@@ -68,7 +68,7 @@ int scanhash_sha256q_16way( struct work *work, const uint32_t max_nonce,
           submit_solution( work, lane_hash, mythr );
        }
      }
-      *noncev = _mm512_add_epi32( *noncev, m512_const1_32( 16 ) );
+      *noncev = _mm512_add_epi32( *noncev, _mm512_set1_epi32( 16 ) );
      n += 16;
   } while ( (n < last_nonce) && !work_restart[thr_id].restart );
   pdata[19] = n;
@@ -140,7 +140,7 @@ int scanhash_sha256q_8way( struct work *work, const uint32_t max_nonce,
           submit_solution( work, lane_hash, mythr );
        }
      }
-      *noncev = _mm256_add_epi32( *noncev, m256_const1_32( 8 ) );
+      *noncev = _mm256_add_epi32( *noncev, _mm256_set1_epi32( 8 ) );
      n += 8;
   } while ( (n < last_nonce) && !work_restart[thr_id].restart );
   pdata[19] = n;
--- a/algo/sha/sha256t-4way.c
+++ b/algo/sha/sha256t-4way.c
@@ -3,6 +3,7 @@
 #include <stdint.h>
 #include <string.h>
 #include <stdio.h>
+#include "sha256-hash.h"
 #include "sha-hash-4way.h"

 #if defined(SHA256T_16WAY)
@@ -10,83 +11,96 @@
 int scanhash_sha256t_16way( struct work *work, const uint32_t max_nonce,
                           uint64_t *hashes_done, struct thr_info *mythr )
 {
-   __m512i  vdata[32]    __attribute__ ((aligned (128)));
+   __m512i  hash32[8]    __attribute__ ((aligned (128)));
   __m512i  block[16]    __attribute__ ((aligned (64)));
-   __m512i  hash32[8]    __attribute__ ((aligned (64)));
-   __m512i  initstate[8] __attribute__ ((aligned (64)));
-   __m512i  midstate1[8] __attribute__ ((aligned (64)));
-   __m512i  midstate2[8] __attribute__ ((aligned (64)));
-   __m512i  mexp_pre[16] __attribute__ ((aligned (64)));
-   uint32_t lane_hash[8] __attribute__ ((aligned (64)));
-   uint32_t *hash32_d7 =  (uint32_t*)&( hash32[7] );
+   __m512i  buf[16]      __attribute__ ((aligned (64)));
+   __m512i  mstate1[8]   __attribute__ ((aligned (64)));
+   __m512i  mstate2[8]   __attribute__ ((aligned (64)));
+   __m512i  istate[8]    __attribute__ ((aligned (64)));
+   __m512i  mexp_pre[8]  __attribute__ ((aligned (64)));
+   uint32_t phash[8]     __attribute__ ((aligned (32)));
+   static const uint32_t IV[8]  __attribute__ ((aligned (32))) =
+   {
+      0x6A09E667, 0xBB67AE85, 0x3C6EF372, 0xA54FF53A,
+      0x510E527F, 0x9B05688C, 0x1F83D9AB, 0x5BE0CD19
+   };
   uint32_t *pdata = work->data;
-   const uint32_t *ptarget = work->target;
+   uint32_t *ptarget = work->target;
+   uint32_t *hash32_d7 = (uint32_t*)&(hash32[7]);
   const uint32_t targ32_d7 = ptarget[7];
   const uint32_t first_nonce = pdata[19];
   const uint32_t last_nonce = max_nonce - 16;
+   const __m512i last_byte = _mm512_set1_epi32( 0x80000000 );
   uint32_t n = first_nonce;
-   __m512i *noncev = vdata + 19; 
   const int thr_id = mythr->id;
+   const __m512i sixteen = _mm512_set1_epi32( 16 );
   const bool bench = opt_benchmark;
-   const __m512i last_byte = m512_const1_32( 0x80000000 );
-   const __m512i sixteen = m512_const1_32( 16 );
+   const __m256i bswap_shuf = mm256_bcast_m128( _mm_set_epi64x(
+                                0x0c0d0e0f08090a0b, 0x0405060700010203 ) );

-   for ( int i = 0; i < 19; i++ )
-      vdata[i] = m512_const1_32( pdata[i] );
+   // prehash first block directly from pdata
+   sha256_transform_le( phash, pdata, IV );

-   *noncev = _mm512_set_epi32( n+15, n+14, n+13, n+12, n+11, n+10, n+9, n+8,
-                               n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n+1, n );
+   // vectorize block 0 hash for second block
+   mstate1[0] = _mm512_set1_epi32( phash[0] );
+   mstate1[1] = _mm512_set1_epi32( phash[1] );
+   mstate1[2] = _mm512_set1_epi32( phash[2] );
+   mstate1[3] = _mm512_set1_epi32( phash[3] );
+   mstate1[4] = _mm512_set1_epi32( phash[4] );
+   mstate1[5] = _mm512_set1_epi32( phash[5] );
+   mstate1[6] = _mm512_set1_epi32( phash[6] );
+   mstate1[7] = _mm512_set1_epi32( phash[7] );

-   vdata[16+4] = last_byte;
-   memset_zero_512( vdata+16 + 5, 10 );
-   vdata[16+15] = m512_const1_32( 80*8 ); // bit count
-   
+   // second message block data, with nonce & padding   
+   buf[0] = _mm512_set1_epi32( pdata[16] );
+   buf[1] = _mm512_set1_epi32( pdata[17] );
+   buf[2] = _mm512_set1_epi32( pdata[18] );
+   buf[3] = _mm512_set_epi32( n+15, n+14, n+13, n+12, n+11, n+10, n+ 9, n+ 8,
+                              n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n +1, n );
+   buf[4] = last_byte;
+   memset_zero_512( buf+5, 10 );
+   buf[15] = _mm512_set1_epi32( 80*8 ); // bit count
+
+   // partially pre-expand & prehash second message block, avoiding the nonces
+   sha256_16way_prehash_3rounds( mstate2, mexp_pre, buf, mstate1 );
+
+   // vectorize IV for 2nd & 3rd sha256
+   istate[0] = _mm512_set1_epi32( IV[0] );
+   istate[1] = _mm512_set1_epi32( IV[1] );
+   istate[2] = _mm512_set1_epi32( IV[2] );
+   istate[3] = _mm512_set1_epi32( IV[3] );
+   istate[4] = _mm512_set1_epi32( IV[4] );
+   istate[5] = _mm512_set1_epi32( IV[5] );
+   istate[6] = _mm512_set1_epi32( IV[6] );
+   istate[7] = _mm512_set1_epi32( IV[7] );
+
+   // initialize padding for 2nd & 3rd sha256
   block[ 8] = last_byte;
   memset_zero_512( block + 9, 6 );
-   block[15] = m512_const1_32( 32*8 ); // bit count
-   
-   initstate[0] = m512_const1_64( 0x6A09E6676A09E667 );
-   initstate[1] = m512_const1_64( 0xBB67AE85BB67AE85 );
-   initstate[2] = m512_const1_64( 0x3C6EF3723C6EF372 );
-   initstate[3] = m512_const1_64( 0xA54FF53AA54FF53A );
-   initstate[4] = m512_const1_64( 0x510E527F510E527F );
-   initstate[5] = m512_const1_64( 0x9B05688C9B05688C );
-   initstate[6] = m512_const1_64( 0x1F83D9AB1F83D9AB );
-   initstate[7] = m512_const1_64( 0x5BE0CD195BE0CD19 );
-
-   sha256_16way_transform_le( midstate1, vdata, initstate );
-   
-   // Do 3 rounds on the first 12 bytes of the next block
-   sha256_16way_prehash_3rounds( midstate2, mexp_pre, vdata+16, midstate1 );
+   block[15] = _mm512_set1_epi32( 32*8 ); // bit count

   do
   {
-      // 1. final 16 bytes of data, pre-padded
-      sha256_16way_final_rounds( block, vdata+16, midstate1, midstate2,
-                                 mexp_pre );
+      sha256_16way_final_rounds( block, buf, mstate1, mstate2, mexp_pre );

-      // 2. 32 byte hash from 1.
-      sha256_16way_transform_le( block, block, initstate );
+      sha256_16way_transform_le( block, block, istate );

-      // 3. 32 byte hash from 2.
-      if ( unlikely(
-               sha256_16way_transform_le_short( hash32, block, initstate ) ) )
+      if ( sha256_16way_transform_le_short( hash32, block, istate, ptarget ) )
      {
-         // byte swap final hash for testing
-         mm512_block_bswap_32( hash32, hash32 );    
-
         for ( int lane = 0; lane < 16; lane++ )
-         if ( hash32_d7[ lane ] <= targ32_d7 )
+         if ( bswap_32( hash32_d7[ lane ] ) <= targ32_d7 )
         {
-            extr_lane_16x32( lane_hash, hash32, lane, 256 );
-            if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
+            extr_lane_16x32( phash, hash32, lane, 256 );
+            casti_m256i( phash, 0 ) =
+                _mm256_shuffle_epi8( casti_m256i( phash, 0 ), bswap_shuf );
+            if ( likely( valid_hash( phash, ptarget ) && !bench ) )
            {
               pdata[19] = n + lane;
-               submit_solution( work, lane_hash, mythr );
+               submit_solution( work, phash, mythr );
            }
         }
      }
-      *noncev = _mm512_add_epi32( *noncev, sixteen );
+      buf[3] = _mm512_add_epi32( buf[3], sixteen );
      n += 16;
   } while ( (n < last_nonce) && !work_restart[thr_id].restart );
   pdata[19] = n;
@@ -94,83 +108,80 @@ int scanhash_sha256t_16way( struct work *work, const uint32_t max_nonce,
   return 0;
 }

-
 #endif

 #if defined(SHA256T_8WAY)
-
+   
 int scanhash_sha256t_8way( struct work *work, const uint32_t max_nonce,
                           uint64_t *hashes_done, struct thr_info *mythr )
 {
   __m256i  vdata[32]    __attribute__ ((aligned (64)));
   __m256i  block[16]    __attribute__ ((aligned (32)));
   __m256i  hash32[8]    __attribute__ ((aligned (32)));
-   __m256i  initstate[8] __attribute__ ((aligned (32)));
-   __m256i  midstate1[8] __attribute__ ((aligned (32)));
-   __m256i  midstate2[8] __attribute__ ((aligned (32)));
-   __m256i  mexp_pre[16] __attribute__ ((aligned (32)));
+   __m256i  istate[8]    __attribute__ ((aligned (32)));
+   __m256i  mstate1[8]   __attribute__ ((aligned (32)));
+   __m256i  mstate2[8]   __attribute__ ((aligned (32)));
+   __m256i  mexp_pre[8]  __attribute__ ((aligned (32)));
   uint32_t lane_hash[8] __attribute__ ((aligned (32)));
-   uint32_t *hash32_d7 =  (uint32_t*)&( hash32[7] );
   uint32_t *pdata = work->data;
   const uint32_t *ptarget = work->target;
-   const uint32_t targ32_d7 = ptarget[7];
   const uint32_t first_nonce = pdata[19];
   const uint32_t last_nonce = max_nonce - 8;
   uint32_t n = first_nonce;
   __m256i *noncev = vdata + 19;
   const int thr_id = mythr->id;
   const bool bench = opt_benchmark;
-   const __m256i last_byte = m256_const1_32( 0x80000000 );
-   const __m256i eight = m256_const1_32( 8 );
+   const __m256i last_byte = _mm256_set1_epi32( 0x80000000 );
+   const __m256i eight = _mm256_set1_epi32( 8 );
+   const __m256i bswap_shuf = mm256_bcast_m128( _mm_set_epi64x(
+                                0x0c0d0e0f08090a0b, 0x0405060700010203 ) );

   for ( int i = 0; i < 19; i++ )
-      vdata[i] = m256_const1_32( pdata[i] );
+      vdata[i] = _mm256_set1_epi32( pdata[i] );

   *noncev = _mm256_set_epi32( n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n+1, n );

   vdata[16+4] = last_byte;
   memset_zero_256( vdata+16 + 5, 10 );
-   vdata[16+15] = m256_const1_32( 80*8 ); // bit count
+   vdata[16+15] = _mm256_set1_epi32( 80*8 ); // bit count

   block[ 8] = last_byte;
   memset_zero_256( block + 9, 6 );
-   block[15] = m256_const1_32( 32*8 ); // bit count
-   
-   // initialize state
-   initstate[0] = m256_const1_64( 0x6A09E6676A09E667 );
-   initstate[1] = m256_const1_64( 0xBB67AE85BB67AE85 );
-   initstate[2] = m256_const1_64( 0x3C6EF3723C6EF372 );
-   initstate[3] = m256_const1_64( 0xA54FF53AA54FF53A );
-   initstate[4] = m256_const1_64( 0x510E527F510E527F );
-   initstate[5] = m256_const1_64( 0x9B05688C9B05688C );
-   initstate[6] = m256_const1_64( 0x1F83D9AB1F83D9AB );
-   initstate[7] = m256_const1_64( 0x5BE0CD195BE0CD19 );
+   block[15] = _mm256_set1_epi32( 32*8 ); // bit count

-   sha256_8way_transform_le( midstate1, vdata, initstate );
+   // initialize state
+   istate[0] = _mm256_set1_epi64x( 0x6A09E6676A09E667 );
+   istate[1] = _mm256_set1_epi64x( 0xBB67AE85BB67AE85 );
+   istate[2] = _mm256_set1_epi64x( 0x3C6EF3723C6EF372 );
+   istate[3] = _mm256_set1_epi64x( 0xA54FF53AA54FF53A );
+   istate[4] = _mm256_set1_epi64x( 0x510E527F510E527F );
+   istate[5] = _mm256_set1_epi64x( 0x9B05688C9B05688C );
+   istate[6] = _mm256_set1_epi64x( 0x1F83D9AB1F83D9AB );
+   istate[7] = _mm256_set1_epi64x( 0x5BE0CD195BE0CD19 );
+
+   sha256_8way_transform_le( mstate1, vdata, istate );

   // Do 3 rounds on the first 12 bytes of the next block
-   sha256_8way_prehash_3rounds( midstate2, mexp_pre, vdata + 16, midstate1 );
-   
+   sha256_8way_prehash_3rounds( mstate2, mexp_pre, vdata + 16, mstate1 );
+
   do
   {
      // 1. final 16 bytes of data, with padding
-      sha256_8way_final_rounds( block, vdata+16, midstate1, midstate2,
+      sha256_8way_final_rounds( block, vdata+16, mstate1, mstate2,
                                mexp_pre );

      // 2. 32 byte hash from 1.
-      sha256_8way_transform_le( block, block, initstate );
+      sha256_8way_transform_le( block, block, istate );

      // 3. 32 byte hash from 2.
-      if ( unlikely(
-               sha256_8way_transform_le_short( hash32, block, initstate ) ) )
+      if ( unlikely( sha256_8way_transform_le_short(
+                                    hash32, block, istate, ptarget ) ) )
      {
-         // byte swap final hash for testing
-         mm256_block_bswap_32( hash32, hash32 );
-
         for ( int lane = 0; lane < 8; lane++ )
-         if ( hash32_d7[ lane ] <= targ32_d7 )
         {
            extr_lane_8x32( lane_hash, hash32, lane, 256 );
+            casti_m256i( lane_hash, 0 ) =
+             _mm256_shuffle_epi8( casti_m256i( lane_hash, 0 ), bswap_shuf );
            if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
            {
               pdata[19] = n + lane;
@@ -188,109 +199,18 @@ int scanhash_sha256t_8way( struct work *work, const uint32_t max_nonce,

 #endif

-
 #if defined(SHA256T_4WAY)

-// Optimizations are slower with AVX/SSE2
-// https://github.com/JayDDee/cpuminer-opt/issues/344
-/*
-int scanhash_sha256t_4way( struct work *work, const uint32_t max_nonce,
-                           uint64_t *hashes_done, struct thr_info *mythr )
-{
-   __m128i  vdata[32]     __attribute__ ((aligned (64)));
-   __m128i  block[16]     __attribute__ ((aligned (32)));
-   __m128i  hash32[8]     __attribute__ ((aligned (32)));
-   __m128i  initstate[8]  __attribute__ ((aligned (32)));
-   __m128i  midstate1[8]  __attribute__ ((aligned (32)));
-   __m128i  midstate2[8]  __attribute__ ((aligned (32)));
-   __m128i  mexp_pre[16]  __attribute__ ((aligned (32)));
-   uint32_t lane_hash[8]  __attribute__ ((aligned (32)));
-   uint32_t *hash32_d7 = (uint32_t*)&( hash32[7] );
-   uint32_t *pdata = work->data;
-   const uint32_t *ptarget = work->target;
-   const uint32_t targ32_d7 = ptarget[7];
-   const uint32_t first_nonce = pdata[19];
-   const uint32_t last_nonce = max_nonce - 4;
-   uint32_t n = first_nonce;
-   __m128i *noncev = vdata + 19;
-   const int thr_id = mythr->id;
-   const bool bench = opt_benchmark;
-   const __m128i last_byte = m128_const1_32( 0x80000000 );
-   const __m128i four = m128_const1_32( 4 );
-
-   for ( int i = 0; i < 19; i++ )
-       vdata[i] = m128_const1_32( pdata[i] );
-
-   *noncev = _mm_set_epi32( n+ 3, n+ 2, n+1, n );
-
-   vdata[16+4] = last_byte;
-   memset_zero_128( vdata+16 + 5, 10 );
-   vdata[16+15] = m128_const1_32( 80*8 ); // bit count
-
-   block[ 8] = last_byte;
-   memset_zero_128( block + 9, 6 );
-   block[15] = m128_const1_32( 32*8 ); // bit count
-   
-   // initialize state
-   initstate[0] = m128_const1_64( 0x6A09E6676A09E667 );
-   initstate[1] = m128_const1_64( 0xBB67AE85BB67AE85 );
-   initstate[2] = m128_const1_64( 0x3C6EF3723C6EF372 );
-   initstate[3] = m128_const1_64( 0xA54FF53AA54FF53A );
-   initstate[4] = m128_const1_64( 0x510E527F510E527F );
-   initstate[5] = m128_const1_64( 0x9B05688C9B05688C );
-   initstate[6] = m128_const1_64( 0x1F83D9AB1F83D9AB );
-   initstate[7] = m128_const1_64( 0x5BE0CD195BE0CD19 );
-
-   // hash first 64 bytes of data
-   sha256_4way_transform_le( midstate1, vdata, initstate );
-
-   // Do 3 rounds on the first 12 bytes of the next block
-   sha256_4way_prehash_3rounds( midstate2, mexp_pre, vdata + 16, midstate1 );
-
-   do
-   {
-      // 1. final 16 bytes of data, with padding
-      sha256_4way_final_rounds( block, vdata+16, midstate1, midstate2,
-                                mexp_pre );
-
-      // 2. 32 byte hash from 1.
-      sha256_4way_transform_le( block, block, initstate );
-
-      // 3. 32 byte hash from 2.
-      if ( unlikely(
-              sha256_4way_transform_le_short( hash32, block, initstate ) ) )
-      {   
-         // byte swap final hash for testing
-         mm128_block_bswap_32( hash32, hash32 );
-
-         for ( int lane = 0; lane < 4; lane++ )
-         if ( unlikely( hash32_d7[ lane ] <= targ32_d7 ) )
-         {
-            extr_lane_4x32( lane_hash, hash32, lane, 256 );
-            if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
-            {
-               pdata[19] = n + lane;
-               submit_solution( work, lane_hash, mythr );
-            }
-         }
-      }
-      *noncev = _mm_add_epi32( *noncev, four );
-      n += 4;
-   } while ( (n < last_nonce) && !work_restart[thr_id].restart );
-   pdata[19] = n;
-   *hashes_done = n - first_nonce;
-   return 0;
-}
-*/
-
 int scanhash_sha256t_4way( struct work *work, const uint32_t max_nonce,
                           uint64_t *hashes_done, struct thr_info *mythr )
 {
   __m128i  vdata[32]    __attribute__ ((aligned (64)));
   __m128i  block[16]    __attribute__ ((aligned (32)));
   __m128i  hash32[8]    __attribute__ ((aligned (32)));
-   __m128i  initstate[8] __attribute__ ((aligned (32)));
-   __m128i  midstate[8]  __attribute__ ((aligned (32)));
+   __m128i  istate[8]    __attribute__ ((aligned (32)));
+   __m128i  mstate[8]   __attribute__ ((aligned (32)));
+//   __m128i  mstate2[8]   __attribute__ ((aligned (32)));
+//   __m128i  mexp_pre[8]  __attribute__ ((aligned (32)));
   uint32_t lane_hash[8] __attribute__ ((aligned (32)));
   uint32_t *hash32_d7 =  (uint32_t*)&( hash32[7] );
   uint32_t *pdata = work->data;
@@ -302,52 +222,61 @@ int scanhash_sha256t_4way( struct work *work, const uint32_t max_nonce,
   __m128i *noncev = vdata + 19;
   const int thr_id = mythr->id;
   const bool bench = opt_benchmark;
-   const __m128i last_byte = m128_const1_32( 0x80000000 );
-   const __m128i four = m128_const1_32( 4 );
+   const __m128i last_byte = _mm_set1_epi32( 0x80000000 );
+   const __m128i four = _mm_set1_epi32( 4 );

   for ( int i = 0; i < 19; i++ )
-       vdata[i] = m128_const1_32( pdata[i] );
+       vdata[i] = _mm_set1_epi32( pdata[i] );

   *noncev = _mm_set_epi32( n+ 3, n+ 2, n+1, n );

   vdata[16+4] = last_byte;
   memset_zero_128( vdata+16 + 5, 10 );
-   vdata[16+15] = m128_const1_32( 80*8 ); // bit count
+   vdata[16+15] = _mm_set1_epi32( 80*8 ); // bit count

   block[ 8] = last_byte;
   memset_zero_128( block + 9, 6 );
-   block[15] = m128_const1_32( 32*8 ); // bit count
+   block[15] = _mm_set1_epi32( 32*8 ); // bit count
   
   // initialize state
-   initstate[0] = m128_const1_64( 0x6A09E6676A09E667 );
-   initstate[1] = m128_const1_64( 0xBB67AE85BB67AE85 );
-   initstate[2] = m128_const1_64( 0x3C6EF3723C6EF372 );
-   initstate[3] = m128_const1_64( 0xA54FF53AA54FF53A );
-   initstate[4] = m128_const1_64( 0x510E527F510E527F );
-   initstate[5] = m128_const1_64( 0x9B05688C9B05688C );
-   initstate[6] = m128_const1_64( 0x1F83D9AB1F83D9AB );
-   initstate[7] = m128_const1_64( 0x5BE0CD195BE0CD19 );
+   istate[0] = _mm_set1_epi64x( 0x6A09E6676A09E667 );
+   istate[1] = _mm_set1_epi64x( 0xBB67AE85BB67AE85 );
+   istate[2] = _mm_set1_epi64x( 0x3C6EF3723C6EF372 );
+   istate[3] = _mm_set1_epi64x( 0xA54FF53AA54FF53A );
+   istate[4] = _mm_set1_epi64x( 0x510E527F510E527F );
+   istate[5] = _mm_set1_epi64x( 0x9B05688C9B05688C );
+   istate[6] = _mm_set1_epi64x( 0x1F83D9AB1F83D9AB );
+   istate[7] = _mm_set1_epi64x( 0x5BE0CD195BE0CD19 );

   // hash first 64 bytes of data
-   sha256_4way_transform_le( midstate, vdata, initstate );
+   sha256_4way_transform_le( mstate, vdata, istate );
+
+//   sha256_4way_prehash_3rounds( mstate2, mexp_pre, vdata + 16, mstate1 );

   do
   {
-      sha256_4way_transform_le( block,  vdata+16, midstate  );
-      sha256_4way_transform_le( block,  block,    initstate );
-      sha256_4way_transform_le( hash32, block,    initstate );
-      mm128_block_bswap_32( hash32, hash32 );
+//      sha256_4way_final_rounds( block, vdata+16, mstate1, mstate2,
+//                                mexp_pre );
+   
+      sha256_4way_transform_le( block,  vdata+16, mstate  );
+      sha256_4way_transform_le( block,  block, istate );
+      sha256_4way_transform_le( hash32, block, istate );

-      for ( int lane = 0; lane < 4; lane++ )
-      if ( unlikely( hash32_d7[ lane ] <= targ32_d7 ) )
-      {
-         extr_lane_4x32( lane_hash, hash32, lane, 256 );
-         if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
+//      if ( unlikely( sha256_4way_transform_le_short(
+//                                  hash32, block, initstate, ptarget ) ))
+//      {
+         mm128_block_bswap_32( hash32, hash32 );
+         for ( int lane = 0; lane < 4; lane++ )
+         if ( unlikely( hash32_d7[ lane ] <= targ32_d7 ) )
         {
-            pdata[19] = n + lane;
-            submit_solution( work, lane_hash, mythr );
+            extr_lane_4x32( lane_hash, hash32, lane, 256 );
+            if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
+            {
+               pdata[19] = n + lane;
+               submit_solution( work, lane_hash, mythr );
+            }
         }
-       }
+//       }
       *noncev = _mm_add_epi32( *noncev, four );
       n += 4;
   } while ( (n < last_nonce) && !work_restart[thr_id].restart );
@@ -356,6 +285,5 @@ int scanhash_sha256t_4way( struct work *work, const uint32_t max_nonce,
   return 0;
 }

-
 #endif

--- a/algo/sha/sha256t.c
+++ b/algo/sha/sha256t.c
@@ -23,7 +23,7 @@ int scanhash_sha256t( struct work *work, uint32_t max_nonce,
   uint32_t *pdata = work->data;
   uint32_t *ptarget = work->target;
   const uint32_t first_nonce = pdata[19];
-   const uint32_t last_nonce = max_nonce - 1;
+   const uint32_t last_nonce = max_nonce - 2;
   uint32_t n = first_nonce;
   const int thr_id = mythr->id;
   const bool bench = opt_benchmark;
--- a/algo/sha/sha512-hash-4way.c
+++ b/algo/sha/sha512-hash-4way.c
@@ -155,14 +155,14 @@ sha512_8way_round( sha512_8way_context *ctx,  __m512i *in, __m512i r[8] )
   }
   else
   {
-      A = m512_const1_64( 0x6A09E667F3BCC908 );
-      B = m512_const1_64( 0xBB67AE8584CAA73B );
-      C = m512_const1_64( 0x3C6EF372FE94F82B );
-      D = m512_const1_64( 0xA54FF53A5F1D36F1 );
-      E = m512_const1_64( 0x510E527FADE682D1 );
-      F = m512_const1_64( 0x9B05688C2B3E6C1F );
-      G = m512_const1_64( 0x1F83D9ABFB41BD6B );
-      H = m512_const1_64( 0x5BE0CD19137E2179 );
+      A = _mm512_set1_epi64( 0x6A09E667F3BCC908 );
+      B = _mm512_set1_epi64( 0xBB67AE8584CAA73B );
+      C = _mm512_set1_epi64( 0x3C6EF372FE94F82B );
+      D = _mm512_set1_epi64( 0xA54FF53A5F1D36F1 );
+      E = _mm512_set1_epi64( 0x510E527FADE682D1 );
+      F = _mm512_set1_epi64( 0x9B05688C2B3E6C1F );
+      G = _mm512_set1_epi64( 0x1F83D9ABFB41BD6B );
+      H = _mm512_set1_epi64( 0x5BE0CD19137E2179 );
   }

   for ( i = 0; i < 80; i += 8 )
@@ -191,14 +191,14 @@ sha512_8way_round( sha512_8way_context *ctx,  __m512i *in, __m512i r[8] )
   else
   {
      ctx->initialized = true;
-      r[0] = _mm512_add_epi64( A, m512_const1_64( 0x6A09E667F3BCC908 ) );
-      r[1] = _mm512_add_epi64( B, m512_const1_64( 0xBB67AE8584CAA73B ) );
-      r[2] = _mm512_add_epi64( C, m512_const1_64( 0x3C6EF372FE94F82B ) );
-      r[3] = _mm512_add_epi64( D, m512_const1_64( 0xA54FF53A5F1D36F1 ) );
-      r[4] = _mm512_add_epi64( E, m512_const1_64( 0x510E527FADE682D1 ) );
-      r[5] = _mm512_add_epi64( F, m512_const1_64( 0x9B05688C2B3E6C1F ) );
-      r[6] = _mm512_add_epi64( G, m512_const1_64( 0x1F83D9ABFB41BD6B ) );
-      r[7] = _mm512_add_epi64( H, m512_const1_64( 0x5BE0CD19137E2179 ) );
+      r[0] = _mm512_add_epi64( A, _mm512_set1_epi64( 0x6A09E667F3BCC908 ) );
+      r[1] = _mm512_add_epi64( B, _mm512_set1_epi64( 0xBB67AE8584CAA73B ) );
+      r[2] = _mm512_add_epi64( C, _mm512_set1_epi64( 0x3C6EF372FE94F82B ) );
+      r[3] = _mm512_add_epi64( D, _mm512_set1_epi64( 0xA54FF53A5F1D36F1 ) );
+      r[4] = _mm512_add_epi64( E, _mm512_set1_epi64( 0x510E527FADE682D1 ) );
+      r[5] = _mm512_add_epi64( F, _mm512_set1_epi64( 0x9B05688C2B3E6C1F ) );
+      r[6] = _mm512_add_epi64( G, _mm512_set1_epi64( 0x1F83D9ABFB41BD6B ) );
+      r[7] = _mm512_add_epi64( H, _mm512_set1_epi64( 0x5BE0CD19137E2179 ) );
   }
 }

@@ -239,14 +239,11 @@ void sha512_8way_close( sha512_8way_context *sc, void *dst )
    unsigned ptr;
    const int buf_size = 128;
    const int pad = buf_size - 16;
-    const __m512i shuff_bswap64 = m512_const_64(
-                                    0x38393a3b3c3d3e3f, 0x3031323334353637,
-                                    0x28292a2b2c2d2e2f, 0x2021222324252627,
-                                    0x18191a1b1c1d1e1f, 0x1011121314151617,
-                                    0x08090a0b0c0d0e0f, 0x0001020304050607 );
+    const __m512i shuff_bswap64 = mm512_bcast_m128( _mm_set_epi64x(
+                                    0x08090a0b0c0d0e0f, 0x0001020304050607 ) );

    ptr = (unsigned)sc->count & (buf_size - 1U);
-    sc->buf[ ptr>>3 ] = m512_const1_64( 0x80 );
+    sc->buf[ ptr>>3 ] = _mm512_set1_epi64( 0x80 );
    ptr += 8;
    if ( ptr > pad )
    {
@@ -271,51 +268,56 @@ void sha512_8way_close( sha512_8way_context *sc, void *dst )

 // SHA-512 4 way 64 bit

+#define BSG5_0( x )     mm256_xor3( mm256_ror_64( x, 28 ), \
+                                    mm256_ror_64( x, 34 ), \
+                                    mm256_ror_64( x, 39 ) )
+
+#define BSG5_1( x )     mm256_xor3( mm256_ror_64( x, 14 ), \
+                                    mm256_ror_64( x, 18 ), \
+                                    mm256_ror_64( x, 41 ) )
+
+#define SSG5_0( x )     mm256_xor3( mm256_ror_64( x,  1 ), \
+                                    mm256_ror_64( x,  8 ), \
+                                    _mm256_srli_epi64( x, 7 ) ) 
+
+#define SSG5_1( x )     mm256_xor3( mm256_ror_64( x, 19 ), \
+                                    mm256_ror_64( x, 61 ), \
+                                    _mm256_srli_epi64( x, 6 ) )
+
+#if defined(__AVX512VL__)
+//TODO Enable for AVX10_256
+// 4 way is not used whith AVX512 but will be whith AVX10_256 when it
+// becomes available.
+
+#define CH( X, Y, Z )    _mm256_ternarylogic_epi64( X, Y, Z, 0xca )
+
+#define MAJ( X, Y, Z )   _mm256_ternarylogic_epi64( X, Y, Z, 0xe8 )
+   
+#define SHA3_4WAY_STEP( A, B, C, D, E, F, G, H, i ) \
+do { \
+  __m256i T0 = _mm256_add_epi64( _mm256_set1_epi64x( K512[i] ), W[i] ); \
+  __m256i T1 = BSG5_1( E ); \
+  __m256i T2 = BSG5_0( A ); \
+  T0 = _mm256_add_epi64( T0, CH( E, F, G ) ); \
+  T1 = _mm256_add_epi64( T1, H ); \
+  T2 = _mm256_add_epi64( T2, MAJ( A, B, C ) ); \
+  T1 = _mm256_add_epi64( T1, T0 ); \
+  D  = _mm256_add_epi64( D,  T1 ); \
+  H  = _mm256_add_epi64( T1, T2 ); \
+} while (0)
+
+#else   // AVX2 only
+
 #define CH(X, Y, Z) \
   _mm256_xor_si256( _mm256_and_si256( _mm256_xor_si256( Y, Z ), X ), Z ) 

 #define MAJ(X, Y, Z) \
  _mm256_xor_si256( Y, _mm256_and_si256( X_xor_Y = _mm256_xor_si256( X, Y ), \
                                         Y_xor_Z ) )
-                    
-#define BSG5_0(x) \
-  mm256_ror_64( _mm256_xor_si256( mm256_ror_64( \
-                   _mm256_xor_si256( mm256_ror_64( x,  5 ), x ), 6 ), x ), 28 )
-
-#define BSG5_1(x) \
-  mm256_ror_64( _mm256_xor_si256( mm256_ror_64( \
-                   _mm256_xor_si256( mm256_ror_64( x, 23 ), x ), 4 ), x ), 14 )
-
-/*
-#define SSG5_0(x) \
-   _mm256_xor_si256( _mm256_xor_si256( \
-        mm256_ror_64(x,  1), mm256_ror_64(x,  8) ), _mm256_srli_epi64(x, 7) ) 
-
-#define SSG5_1(x) \
-   _mm256_xor_si256( _mm256_xor_si256( \
-        mm256_ror_64(x, 19), mm256_ror_64(x, 61) ), _mm256_srli_epi64(x, 6) )
-*/
-// Interleave SSG0 & SSG1 for better throughput.
-// return ssg0(w0) + ssg1(w1)
-static inline __m256i ssg512_add( __m256i w0, __m256i w1 )
-{
-   __m256i w0a, w1a, w0b, w1b;
-   w0a = mm256_ror_64( w0, 1 );
-   w1a = mm256_ror_64( w1,19 );
-   w0b = mm256_ror_64( w0, 8 );
-   w1b = mm256_ror_64( w1,61 );
-   w0a = _mm256_xor_si256( w0a, w0b );
-   w1a = _mm256_xor_si256( w1a, w1b );
-   w0b = _mm256_srli_epi64( w0, 7 );
-   w1b = _mm256_srli_epi64( w1, 6 );
-   w0a = _mm256_xor_si256( w0a, w0b );
-   w1a = _mm256_xor_si256( w1a, w1b );
-   return _mm256_add_epi64( w0a, w1a );
-}

 #define SHA3_4WAY_STEP( A, B, C, D, E, F, G, H, i ) \
 do { \
-  __m256i T0 = _mm256_add_epi64( _mm256_set1_epi64x( K512[i] ), W[ i ] ); \
+  __m256i T0 = _mm256_add_epi64( _mm256_set1_epi64x( K512[i] ), W[i] ); \
  __m256i T1 = BSG5_1( E ); \
  __m256i T2 = BSG5_0( A ); \
  T0 = _mm256_add_epi64( T0, CH( E, F, G ) ); \
@@ -327,19 +329,27 @@ do { \
  H  = _mm256_add_epi64( T1, T2 ); \
 } while (0)

+#endif  // AVX512VL AVX10_256
+
 static void
 sha512_4way_round( sha512_4way_context *ctx,  __m256i *in, __m256i r[8] )
 {
   int i;
-   register __m256i A, B, C, D, E, F, G, H, X_xor_Y, Y_xor_Z;
+   register __m256i A, B, C, D, E, F, G, H;
+
+#if !defined(__AVX512VL__)
+// Disable for AVX10_256
+   __m256i X_xor_Y, Y_xor_Z;
+#endif
+
   __m256i W[80];

   mm256_block_bswap_64( W  , in );
   mm256_block_bswap_64( W+8, in+8 );

   for ( i = 16; i < 80; i++ )
-      W[i] = _mm256_add_epi64( ssg512_add( W[i-15], W[i-2] ),
-                               _mm256_add_epi64( W[ i- 7 ], W[ i-16 ] ) );
+       W[i] = mm256_add4_64( SSG5_0( W[i-15] ), SSG5_1( W[i-2] ),
+                             W[ i- 7 ], W[ i-16 ] );

   if ( ctx->initialized )
   {
@@ -354,17 +364,20 @@ sha512_4way_round( sha512_4way_context *ctx,  __m256i *in, __m256i r[8] )
   }
   else
   {
-      A = m256_const1_64( 0x6A09E667F3BCC908 );
-      B = m256_const1_64( 0xBB67AE8584CAA73B );
-      C = m256_const1_64( 0x3C6EF372FE94F82B );
-      D = m256_const1_64( 0xA54FF53A5F1D36F1 );
-      E = m256_const1_64( 0x510E527FADE682D1 );
-      F = m256_const1_64( 0x9B05688C2B3E6C1F );
-      G = m256_const1_64( 0x1F83D9ABFB41BD6B );
-      H = m256_const1_64( 0x5BE0CD19137E2179 );
+      A = _mm256_set1_epi64x( 0x6A09E667F3BCC908 );
+      B = _mm256_set1_epi64x( 0xBB67AE8584CAA73B );
+      C = _mm256_set1_epi64x( 0x3C6EF372FE94F82B );
+      D = _mm256_set1_epi64x( 0xA54FF53A5F1D36F1 );
+      E = _mm256_set1_epi64x( 0x510E527FADE682D1 );
+      F = _mm256_set1_epi64x( 0x9B05688C2B3E6C1F );
+      G = _mm256_set1_epi64x( 0x1F83D9ABFB41BD6B );
+      H = _mm256_set1_epi64x( 0x5BE0CD19137E2179 );
   }

+#if !defined(__AVX512VL__)
+// Disable for AVX10_256
   Y_xor_Z = _mm256_xor_si256( B, C );
+#endif

   for ( i = 0; i < 80; i += 8 )
   {
@@ -392,14 +405,14 @@ sha512_4way_round( sha512_4way_context *ctx,  __m256i *in, __m256i r[8] )
   else
   {
      ctx->initialized = true;
-      r[0] = _mm256_add_epi64( A, m256_const1_64( 0x6A09E667F3BCC908 ) );
-      r[1] = _mm256_add_epi64( B, m256_const1_64( 0xBB67AE8584CAA73B ) );
-      r[2] = _mm256_add_epi64( C, m256_const1_64( 0x3C6EF372FE94F82B ) );
-      r[3] = _mm256_add_epi64( D, m256_const1_64( 0xA54FF53A5F1D36F1 ) );
-      r[4] = _mm256_add_epi64( E, m256_const1_64( 0x510E527FADE682D1 ) );
-      r[5] = _mm256_add_epi64( F, m256_const1_64( 0x9B05688C2B3E6C1F ) );
-      r[6] = _mm256_add_epi64( G, m256_const1_64( 0x1F83D9ABFB41BD6B ) );
-      r[7] = _mm256_add_epi64( H, m256_const1_64( 0x5BE0CD19137E2179 ) );
+      r[0] = _mm256_add_epi64( A, _mm256_set1_epi64x( 0x6A09E667F3BCC908 ) );
+      r[1] = _mm256_add_epi64( B, _mm256_set1_epi64x( 0xBB67AE8584CAA73B ) );
+      r[2] = _mm256_add_epi64( C, _mm256_set1_epi64x( 0x3C6EF372FE94F82B ) );
+      r[3] = _mm256_add_epi64( D, _mm256_set1_epi64x( 0xA54FF53A5F1D36F1 ) );
+      r[4] = _mm256_add_epi64( E, _mm256_set1_epi64x( 0x510E527FADE682D1 ) );
+      r[5] = _mm256_add_epi64( F, _mm256_set1_epi64x( 0x9B05688C2B3E6C1F ) );
+      r[6] = _mm256_add_epi64( G, _mm256_set1_epi64x( 0x1F83D9ABFB41BD6B ) );
+      r[7] = _mm256_add_epi64( H, _mm256_set1_epi64x( 0x5BE0CD19137E2179 ) );
   }
 }

@@ -440,13 +453,11 @@ void sha512_4way_close( sha512_4way_context *sc, void *dst )
    unsigned ptr;
    const int buf_size = 128;
    const int pad = buf_size - 16;
-    const __m256i shuff_bswap64 = m256_const_64( 0x18191a1b1c1d1e1f,
-                                                 0x1011121314151617,
-                                                 0x08090a0b0c0d0e0f,
-                                                 0x0001020304050607 );
+    const __m256i shuff_bswap64 = mm256_bcast_m128( _mm_set_epi64x(
+                                    0x08090a0b0c0d0e0f, 0x0001020304050607 ) );

    ptr = (unsigned)sc->count & (buf_size - 1U);
-    sc->buf[ ptr>>3 ] = m256_const1_64( 0x80 );
+    sc->buf[ ptr>>3 ] = _mm256_set1_epi64x( 0x80 );
    ptr += 8;
    if ( ptr > pad )
    {
--- a/algo/sha/sha512256d-4way.c
+++ b/algo/sha/sha512256d-4way.c
@@ -0,0 +1,221 @@
+#include "algo-gate-api.h"
+#include "sha-hash-4way.h"
+#include <string.h>
+#include <stdint.h>
+
+#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
+#define SHA512256D_8WAY 1
+#elif defined(__AVX2__)
+#define SHA512256D_4WAY 1
+#endif
+
+#if defined(SHA512256D_8WAY)
+
+static void sha512256d_8way_init( sha512_8way_context *ctx )
+{
+  ctx->count = 0;
+  ctx->initialized = true;
+  ctx->val[0] = _mm512_set1_epi64( 0x22312194FC2BF72C );
+  ctx->val[1] = _mm512_set1_epi64( 0x9F555FA3C84C64C2 );
+  ctx->val[2] = _mm512_set1_epi64( 0x2393B86B6F53B151 );
+  ctx->val[3] = _mm512_set1_epi64( 0x963877195940EABD );
+  ctx->val[4] = _mm512_set1_epi64( 0x96283EE2A88EFFE3 );
+  ctx->val[5] = _mm512_set1_epi64( 0xBE5E1E2553863992 );
+  ctx->val[6] = _mm512_set1_epi64( 0x2B0199FC2C85B8AA );
+  ctx->val[7] = _mm512_set1_epi64( 0x0EB72DDC81C52CA2 );
+}
+
+int scanhash_sha512256d_8way( struct work *work, uint32_t max_nonce,
+                          uint64_t *hashes_done, struct thr_info *mythr )
+{
+    uint64_t hash[8*8] __attribute__ ((aligned (128)));
+    uint32_t vdata[20*8] __attribute__ ((aligned (64)));
+    sha512_8way_context ctx; 
+    uint32_t lane_hash[8] __attribute__ ((aligned (32)));
+    uint64_t *hash_q3 = &(hash[3*8]);
+    uint32_t *pdata = work->data;
+    uint32_t *ptarget = work->target;
+    const uint64_t targ_q3 = ((uint64_t*)ptarget)[3];
+    const uint32_t first_nonce = pdata[19];
+    const uint32_t last_nonce = max_nonce - 8;
+    uint32_t n = first_nonce;
+    __m512i  *noncev = (__m512i*)vdata + 9;
+    const int thr_id = mythr->id;
+    const bool bench = opt_benchmark;
+    const __m512i eight = _mm512_set1_epi64( 0x0000000800000000 );
+
+    mm512_bswap32_intrlv80_8x64( vdata, pdata );
+    *noncev = mm512_intrlv_blend_32(
+                _mm512_set_epi32( n+7, 0, n+6, 0, n+5, 0, n+4, 0,
+                                  n+3, 0, n+2, 0, n+1, 0, n  , 0 ), *noncev );
+    do
+    {
+       sha512256d_8way_init( &ctx );
+       sha512_8way_update( &ctx, vdata, 80 );
+       sha512_8way_close( &ctx, hash );        
+
+       sha512256d_8way_init( &ctx );
+       sha512_8way_update( &ctx, hash, 32 );
+       sha512_8way_close( &ctx, hash );
+
+       for ( int lane = 0; lane < 8; lane++ )
+       if ( unlikely( hash_q3[ lane ] <= targ_q3 && !bench ) )
+       {
+          extr_lane_8x64( lane_hash, hash, lane, 256 );
+          if ( valid_hash( lane_hash, ptarget ) && !bench )
+          {
+             pdata[19] = bswap_32( n + lane );
+             submit_solution( work, lane_hash, mythr );
+          }
+       }
+       *noncev = _mm512_add_epi32( *noncev, eight );
+       n += 8;
+    } while ( likely( (n < last_nonce) && !work_restart[thr_id].restart ) );
+
+    pdata[19] = n;
+    *hashes_done = n - first_nonce;
+    return 0;
+}
+
+#elif defined(SHA512256D_4WAY)
+
+static void sha512256d_4way_init( sha512_4way_context *ctx )
+{
+  ctx->count = 0;
+  ctx->initialized = true;
+  ctx->val[0] = _mm256_set1_epi64x( 0x22312194FC2BF72C );
+  ctx->val[1] = _mm256_set1_epi64x( 0x9F555FA3C84C64C2 );
+  ctx->val[2] = _mm256_set1_epi64x( 0x2393B86B6F53B151 );
+  ctx->val[3] = _mm256_set1_epi64x( 0x963877195940EABD );
+  ctx->val[4] = _mm256_set1_epi64x( 0x96283EE2A88EFFE3 );
+  ctx->val[5] = _mm256_set1_epi64x( 0xBE5E1E2553863992 );
+  ctx->val[6] = _mm256_set1_epi64x( 0x2B0199FC2C85B8AA );
+  ctx->val[7] = _mm256_set1_epi64x( 0x0EB72DDC81C52CA2 );
+}
+
+int scanhash_sha512256d_4way( struct work *work, uint32_t max_nonce,
+                          uint64_t *hashes_done, struct thr_info *mythr )
+{
+    uint64_t hash[8*4] __attribute__ ((aligned (64)));
+    uint32_t vdata[20*4] __attribute__ ((aligned (64)));
+    sha512_4way_context ctx;
+    uint32_t lane_hash[8] __attribute__ ((aligned (32)));
+    uint64_t *hash_q3 = &(hash[3*4]);
+    uint32_t *pdata = work->data;
+    uint32_t *ptarget = work->target;
+    const uint64_t targ_q3 = ((uint64_t*)ptarget)[3];
+    const uint32_t first_nonce = pdata[19];
+    const uint32_t last_nonce = max_nonce - 4;
+    uint32_t n = first_nonce;
+    __m256i  *noncev = (__m256i*)vdata + 9;
+    const int thr_id = mythr->id;
+    const bool bench = opt_benchmark;
+    const __m256i four = _mm256_set1_epi64x( 0x0000000400000000 );
+
+    mm256_bswap32_intrlv80_4x64( vdata, pdata );
+    *noncev = mm256_intrlv_blend_32(
+                _mm256_set_epi32( n+3, 0, n+2, 0, n+1, 0, n, 0 ), *noncev );
+    do
+    {
+       sha512256d_4way_init( &ctx );
+       sha512_4way_update( &ctx, vdata, 80 );
+       sha512_4way_close( &ctx, hash );
+
+       sha512256d_4way_init( &ctx );
+       sha512_4way_update( &ctx, hash, 32 );
+       sha512_4way_close( &ctx, hash );
+
+       for ( int lane = 0; lane < 4; lane++ )
+       if ( hash_q3[ lane ] <= targ_q3 )
+       {
+          extr_lane_4x64( lane_hash, hash, lane, 256 );
+          if ( valid_hash( lane_hash, ptarget ) && !bench )
+          {
+             pdata[19] = bswap_32( n + lane );
+             submit_solution( work, lane_hash, mythr );
+          }
+       }
+       *noncev = _mm256_add_epi32( *noncev, four );
+       n += 4;
+    } while ( (n < last_nonce) && !work_restart[thr_id].restart );
+
+    pdata[19] = n;
+    *hashes_done = n - first_nonce;
+    return 0;
+}
+
+#else
+
+#include "sph_sha2.h"
+
+static const uint64_t H512_256[8] =
+{
+   0x22312194FC2BF72C, 0x9F555FA3C84C64C2,
+   0x2393B86B6F53B151, 0x963877195940EABD,
+   0x96283EE2A88EFFE3, 0xBE5E1E2553863992,
+   0x2B0199FC2C85B8AA, 0x0EB72DDC81C52CA2,
+};
+
+static void sha512256d_init( sph_sha512_context *ctx )
+{
+   memcpy( ctx->val, H512_256, sizeof H512_256 );
+   ctx->count = 0;
+}
+
+int scanhash_sha512256d( struct work *work,   uint32_t max_nonce,
+                     uint64_t *hashes_done, struct thr_info *mythr )
+{
+   uint32_t *pdata = work->data;
+   uint32_t *ptarget = work->target;
+   uint32_t hash64[8] __attribute__ ((aligned (64)));
+   uint32_t endiandata[20] __attribute__ ((aligned (64)));
+   sph_sha512_context ctx;
+   const uint32_t Htarg = ptarget[7];
+   const uint32_t first_nonce = pdata[19];
+   uint32_t n = first_nonce;
+   int thr_id = mythr->id;
+
+   swab32_array( endiandata, pdata, 20 );
+
+   do {
+      be32enc( &endiandata[19], n );
+
+      sha512256d_init( &ctx );
+      sph_sha512( &ctx, endiandata, 80 );
+      sph_sha512_close( &ctx, hash64 );
+
+      sha512256d_init( &ctx );
+      sph_sha512( &ctx, hash64, 32 );
+      sph_sha512_close( &ctx, hash64 );
+      
+      if ( hash64[7] <= Htarg )
+      if ( fulltest( hash64, ptarget ) && !opt_benchmark )
+      {
+         pdata[19] = n;
+         submit_solution( work, hash64, mythr );
+      }
+      n++;
+
+   } while (n < max_nonce && !work_restart[thr_id].restart);
+
+   *hashes_done = n - first_nonce + 1;
+   pdata[19] = n;
+
+   return 0;
+}
+
+#endif
+
+bool register_sha512256d_algo( algo_gate_t* gate )
+{
+   gate->optimizations = AVX2_OPT | AVX512_OPT;
+#if defined(SHA512256D_8WAY)
+   gate->scanhash = (void*)&scanhash_sha512256d_8way;
+#elif defined(SHA512256D_4WAY)
+   gate->scanhash = (void*)&scanhash_sha512256d_4way;
+#else
+   gate->scanhash = (void*)&scanhash_sha512256d;
+#endif
+   return true;
+};
+
--- a/algo/sha/sph_sha2.c
+++ b/algo/sha/sph_sha2.c
@@ -39,9 +39,9 @@
 #define SPH_SMALL_FOOTPRINT_SHA2   1
 #endif

-#define CH(X, Y, Z)    ((((Y) ^ (Z)) & (X)) ^ (Z))
+#define CH(X, Y, Z)    ( ( ( (Y) ^ (Z) ) & (X)) ^ (Z) )
 //#define MAJ(X, Y, Z)   (((Y) & (Z)) | (((Y) | (Z)) & (X)))
-#define MAJ( X, Y, Z )   ( Y  ^ ( ( X_xor_Y = X ^ Y ) & ( Y_xor_Z ) ) )
+#define MAJ( X, Y, Z )   ( (Y) ^ ( ( (X_xor_Y) = (X) ^ (Y) ) & (Y_xor_Z) ) )
 #define ROTR    SPH_ROTR32

 #define BSG2_0(x)      (ROTR(x, 2) ^ ROTR(x, 13) ^ ROTR(x, 22))
--- a/algo/shabal/shabal-hash-4way.c
+++ b/algo/shabal/shabal-hash-4way.c
@@ -112,50 +112,50 @@ extern "C"{
   else \
   { \
       (state)->state_loaded = true; \
-       A0 = m256_const1_64( 0x20728DFD20728DFD ); \
-       A1 = m256_const1_64( 0x46C0BD5346C0BD53 ); \
-       A2 = m256_const1_64( 0xE782B699E782B699 ); \
-       A3 = m256_const1_64( 0x5530463255304632 ); \
-       A4 = m256_const1_64( 0x71B4EF9071B4EF90 ); \
-       A5 = m256_const1_64( 0x0EA9E82C0EA9E82C ); \
-       A6 = m256_const1_64( 0xDBB930F1DBB930F1 ); \
-       A7 = m256_const1_64( 0xFAD06B8BFAD06B8B ); \
-       A8 = m256_const1_64( 0xBE0CAE40BE0CAE40 ); \
-       A9 = m256_const1_64( 0x8BD144108BD14410 ); \
-       AA = m256_const1_64( 0x76D2ADAC76D2ADAC ); \
-       AB = m256_const1_64( 0x28ACAB7F28ACAB7F ); \
-       B0 = m256_const1_64( 0xC1099CB7C1099CB7 ); \
-       B1 = m256_const1_64( 0x07B385F307B385F3 ); \
-       B2 = m256_const1_64( 0xE7442C26E7442C26 ); \
-       B3 = m256_const1_64( 0xCC8AD640CC8AD640 ); \
-       B4 = m256_const1_64( 0xEB6F56C7EB6F56C7 ); \
-       B5 = m256_const1_64( 0x1EA81AA91EA81AA9 ); \
-       B6 = m256_const1_64( 0x73B9D31473B9D314 ); \
-       B7 = m256_const1_64( 0x1DE85D081DE85D08 ); \
-       B8 = m256_const1_64( 0x48910A5A48910A5A ); \
-       B9 = m256_const1_64( 0x893B22DB893B22DB ); \
-       BA = m256_const1_64( 0xC5A0DF44C5A0DF44 ); \
-       BB = m256_const1_64( 0xBBC4324EBBC4324E ); \
-       BC = m256_const1_64( 0x72D2F24072D2F240 ); \
-       BD = m256_const1_64( 0x75941D9975941D99 ); \
-       BE = m256_const1_64( 0x6D8BDE826D8BDE82 ); \
-       BF = m256_const1_64( 0xA1A7502BA1A7502B ); \
-       C0 = m256_const1_64( 0xD9BF68D1D9BF68D1 ); \
-       C1 = m256_const1_64( 0x58BAD75058BAD750 ); \
-       C2 = m256_const1_64( 0x56028CB256028CB2 ); \
-       C3 = m256_const1_64( 0x8134F3598134F359 ); \
-       C4 = m256_const1_64( 0xB5D469D8B5D469D8 ); \
-       C5 = m256_const1_64( 0x941A8CC2941A8CC2 ); \
-       C6 = m256_const1_64( 0x418B2A6E418B2A6E ); \
-       C7 = m256_const1_64( 0x0405278004052780 ); \
-       C8 = m256_const1_64( 0x7F07D7877F07D787 ); \
-       C9 = m256_const1_64( 0x5194358F5194358F ); \
-       CA = m256_const1_64( 0x3C60D6653C60D665 ); \
-       CB = m256_const1_64( 0xBE97D79ABE97D79A ); \
-       CC = m256_const1_64( 0x950C3434950C3434 ); \
-       CD = m256_const1_64( 0xAED9A06DAED9A06D ); \
-       CE = m256_const1_64( 0x2537DC8D2537DC8D ); \
-       CF = m256_const1_64( 0x7CDB59697CDB5969 ); \
+       A0 = _mm256_set1_epi64x( 0x20728DFD20728DFD ); \
+       A1 = _mm256_set1_epi64x( 0x46C0BD5346C0BD53 ); \
+       A2 = _mm256_set1_epi64x( 0xE782B699E782B699 ); \
+       A3 = _mm256_set1_epi64x( 0x5530463255304632 ); \
+       A4 = _mm256_set1_epi64x( 0x71B4EF9071B4EF90 ); \
+       A5 = _mm256_set1_epi64x( 0x0EA9E82C0EA9E82C ); \
+       A6 = _mm256_set1_epi64x( 0xDBB930F1DBB930F1 ); \
+       A7 = _mm256_set1_epi64x( 0xFAD06B8BFAD06B8B ); \
+       A8 = _mm256_set1_epi64x( 0xBE0CAE40BE0CAE40 ); \
+       A9 = _mm256_set1_epi64x( 0x8BD144108BD14410 ); \
+       AA = _mm256_set1_epi64x( 0x76D2ADAC76D2ADAC ); \
+       AB = _mm256_set1_epi64x( 0x28ACAB7F28ACAB7F ); \
+       B0 = _mm256_set1_epi64x( 0xC1099CB7C1099CB7 ); \
+       B1 = _mm256_set1_epi64x( 0x07B385F307B385F3 ); \
+       B2 = _mm256_set1_epi64x( 0xE7442C26E7442C26 ); \
+       B3 = _mm256_set1_epi64x( 0xCC8AD640CC8AD640 ); \
+       B4 = _mm256_set1_epi64x( 0xEB6F56C7EB6F56C7 ); \
+       B5 = _mm256_set1_epi64x( 0x1EA81AA91EA81AA9 ); \
+       B6 = _mm256_set1_epi64x( 0x73B9D31473B9D314 ); \
+       B7 = _mm256_set1_epi64x( 0x1DE85D081DE85D08 ); \
+       B8 = _mm256_set1_epi64x( 0x48910A5A48910A5A ); \
+       B9 = _mm256_set1_epi64x( 0x893B22DB893B22DB ); \
+       BA = _mm256_set1_epi64x( 0xC5A0DF44C5A0DF44 ); \
+       BB = _mm256_set1_epi64x( 0xBBC4324EBBC4324E ); \
+       BC = _mm256_set1_epi64x( 0x72D2F24072D2F240 ); \
+       BD = _mm256_set1_epi64x( 0x75941D9975941D99 ); \
+       BE = _mm256_set1_epi64x( 0x6D8BDE826D8BDE82 ); \
+       BF = _mm256_set1_epi64x( 0xA1A7502BA1A7502B ); \
+       C0 = _mm256_set1_epi64x( 0xD9BF68D1D9BF68D1 ); \
+       C1 = _mm256_set1_epi64x( 0x58BAD75058BAD750 ); \
+       C2 = _mm256_set1_epi64x( 0x56028CB256028CB2 ); \
+       C3 = _mm256_set1_epi64x( 0x8134F3598134F359 ); \
+       C4 = _mm256_set1_epi64x( 0xB5D469D8B5D469D8 ); \
+       C5 = _mm256_set1_epi64x( 0x941A8CC2941A8CC2 ); \
+       C6 = _mm256_set1_epi64x( 0x418B2A6E418B2A6E ); \
+       C7 = _mm256_set1_epi64x( 0x0405278004052780 ); \
+       C8 = _mm256_set1_epi64x( 0x7F07D7877F07D787 ); \
+       C9 = _mm256_set1_epi64x( 0x5194358F5194358F ); \
+       CA = _mm256_set1_epi64x( 0x3C60D6653C60D665 ); \
+       CB = _mm256_set1_epi64x( 0xBE97D79ABE97D79A ); \
+       CC = _mm256_set1_epi64x( 0x950C3434950C3434 ); \
+       CD = _mm256_set1_epi64x( 0xAED9A06DAED9A06D ); \
+       CE = _mm256_set1_epi64x( 0x2537DC8D2537DC8D ); \
+       CF = _mm256_set1_epi64x( 0x7CDB59697CDB5969 ); \
   } \
   Wlow = (state)->Wlow; \
   Whigh = (state)->Whigh; \
@@ -276,6 +276,11 @@ do { \
   A1 = _mm256_xor_si256( A1, _mm256_set1_epi32( Whigh ) ); \
 } while (0)

+#define mm256_swap512_256( v1, v2 ) \
+   v1 = _mm256_xor_si256( v1, v2 ); \
+   v2 = _mm256_xor_si256( v1, v2 ); \
+   v1 = _mm256_xor_si256( v1, v2 );
+
 #define SWAP_BC8 \
 do { \
    mm256_swap512_256( B0, C0 ); \
@@ -298,7 +303,7 @@ do { \

 #define PERM_ELT8( xa0, xa1, xb0, xb1, xb2, xb3, xc, xm ) \
 do { \
-   xa0 = mm256_xor3( xm, xb1, mm256_xorandnot(  \
+   xa0 = mm256_xor3( xm, xb1, mm256_xorandnot( \
           _mm256_mullo_epi32( mm256_xor3( xa0, xc, \
              _mm256_mullo_epi32( mm256_rol_32( xa1, 15 ), FIVE ) ), THREE ), \
           xb3, xb2 ) ); \
@@ -438,52 +443,52 @@ shabal_8way_init( void *cc, unsigned size )
   else
   {  // No users
       sc->state_loaded = true;
-       sc->A[ 0] = m256_const1_64( 0x52F8455252F84552 );
-       sc->A[ 1] = m256_const1_64( 0xE54B7999E54B7999 );
-       sc->A[ 2] = m256_const1_64( 0x2D8EE3EC2D8EE3EC );
-       sc->A[ 3] = m256_const1_64( 0xB9645191B9645191 );
-       sc->A[ 4] = m256_const1_64( 0xE0078B86E0078B86 );
-       sc->A[ 5] = m256_const1_64( 0xBB7C44C9BB7C44C9 );
-       sc->A[ 6] = m256_const1_64( 0xD2B5C1CAD2B5C1CA );
-       sc->A[ 7] = m256_const1_64( 0xB0D2EB8CB0D2EB8C );
-       sc->A[ 8] = m256_const1_64( 0x14CE5A4514CE5A45 );
-       sc->A[ 9] = m256_const1_64( 0x22AF50DC22AF50DC );
-       sc->A[10] = m256_const1_64( 0xEFFDBC6BEFFDBC6B );
-       sc->A[11] = m256_const1_64( 0xEB21B74AEB21B74A );
+       sc->A[ 0] = _mm256_set1_epi64x( 0x52F8455252F84552 );
+       sc->A[ 1] = _mm256_set1_epi64x( 0xE54B7999E54B7999 );
+       sc->A[ 2] = _mm256_set1_epi64x( 0x2D8EE3EC2D8EE3EC );
+       sc->A[ 3] = _mm256_set1_epi64x( 0xB9645191B9645191 );
+       sc->A[ 4] = _mm256_set1_epi64x( 0xE0078B86E0078B86 );
+       sc->A[ 5] = _mm256_set1_epi64x( 0xBB7C44C9BB7C44C9 );
+       sc->A[ 6] = _mm256_set1_epi64x( 0xD2B5C1CAD2B5C1CA );
+       sc->A[ 7] = _mm256_set1_epi64x( 0xB0D2EB8CB0D2EB8C );
+       sc->A[ 8] = _mm256_set1_epi64x( 0x14CE5A4514CE5A45 );
+       sc->A[ 9] = _mm256_set1_epi64x( 0x22AF50DC22AF50DC );
+       sc->A[10] = _mm256_set1_epi64x( 0xEFFDBC6BEFFDBC6B );
+       sc->A[11] = _mm256_set1_epi64x( 0xEB21B74AEB21B74A );

-       sc->B[ 0] = m256_const1_64( 0xB555C6EEB555C6EE );
-       sc->B[ 1] = m256_const1_64( 0x3E7105963E710596 );
-       sc->B[ 2] = m256_const1_64( 0xA72A652FA72A652F );
-       sc->B[ 3] = m256_const1_64( 0x9301515F9301515F );
-       sc->B[ 4] = m256_const1_64( 0xDA28C1FADA28C1FA );
-       sc->B[ 5] = m256_const1_64( 0x696FD868696FD868 );
-       sc->B[ 6] = m256_const1_64( 0x9CB6BF729CB6BF72 );
-       sc->B[ 7] = m256_const1_64( 0x0AFE40020AFE4002 );
-       sc->B[ 8] = m256_const1_64( 0xA6E03615A6E03615 );
-       sc->B[ 9] = m256_const1_64( 0x5138C1D45138C1D4 );
-       sc->B[10] = m256_const1_64( 0xBE216306BE216306 );
-       sc->B[11] = m256_const1_64( 0xB38B8890B38B8890 );
-       sc->B[12] = m256_const1_64( 0x3EA8B96B3EA8B96B );
-       sc->B[13] = m256_const1_64( 0x3299ACE43299ACE4 );
-       sc->B[14] = m256_const1_64( 0x30924DD430924DD4 );
-       sc->B[15] = m256_const1_64( 0x55CB34A555CB34A5 );
+       sc->B[ 0] = _mm256_set1_epi64x( 0xB555C6EEB555C6EE );
+       sc->B[ 1] = _mm256_set1_epi64x( 0x3E7105963E710596 );
+       sc->B[ 2] = _mm256_set1_epi64x( 0xA72A652FA72A652F );
+       sc->B[ 3] = _mm256_set1_epi64x( 0x9301515F9301515F );
+       sc->B[ 4] = _mm256_set1_epi64x( 0xDA28C1FADA28C1FA );
+       sc->B[ 5] = _mm256_set1_epi64x( 0x696FD868696FD868 );
+       sc->B[ 6] = _mm256_set1_epi64x( 0x9CB6BF729CB6BF72 );
+       sc->B[ 7] = _mm256_set1_epi64x( 0x0AFE40020AFE4002 );
+       sc->B[ 8] = _mm256_set1_epi64x( 0xA6E03615A6E03615 );
+       sc->B[ 9] = _mm256_set1_epi64x( 0x5138C1D45138C1D4 );
+       sc->B[10] = _mm256_set1_epi64x( 0xBE216306BE216306 );
+       sc->B[11] = _mm256_set1_epi64x( 0xB38B8890B38B8890 );
+       sc->B[12] = _mm256_set1_epi64x( 0x3EA8B96B3EA8B96B );
+       sc->B[13] = _mm256_set1_epi64x( 0x3299ACE43299ACE4 );
+       sc->B[14] = _mm256_set1_epi64x( 0x30924DD430924DD4 );
+       sc->B[15] = _mm256_set1_epi64x( 0x55CB34A555CB34A5 );

-       sc->C[ 0] = m256_const1_64( 0xB405F031B405F031 );
-       sc->C[ 1] = m256_const1_64( 0xC4233EBAC4233EBA );
-       sc->C[ 2] = m256_const1_64( 0xB3733979B3733979 );
-       sc->C[ 3] = m256_const1_64( 0xC0DD9D55C0DD9D55 );
-       sc->C[ 4] = m256_const1_64( 0xC51C28AEC51C28AE );
-       sc->C[ 5] = m256_const1_64( 0xA327B8E1A327B8E1 );
-       sc->C[ 6] = m256_const1_64( 0x56C5616756C56167 );
-       sc->C[ 7] = m256_const1_64( 0xED614433ED614433 );
-       sc->C[ 8] = m256_const1_64( 0x88B59D6088B59D60 );
-       sc->C[ 9] = m256_const1_64( 0x60E2CEBA60E2CEBA );
-       sc->C[10] = m256_const1_64( 0x758B4B8B758B4B8B );
-       sc->C[11] = m256_const1_64( 0x83E82A7F83E82A7F );
-       sc->C[12] = m256_const1_64( 0xBC968828BC968828 );
-       sc->C[13] = m256_const1_64( 0xE6E00BF7E6E00BF7 );
-       sc->C[14] = m256_const1_64( 0xBA839E55BA839E55 );
-       sc->C[15] = m256_const1_64( 0x9B491C609B491C60 );
+       sc->C[ 0] = _mm256_set1_epi64x( 0xB405F031B405F031 );
+       sc->C[ 1] = _mm256_set1_epi64x( 0xC4233EBAC4233EBA );
+       sc->C[ 2] = _mm256_set1_epi64x( 0xB3733979B3733979 );
+       sc->C[ 3] = _mm256_set1_epi64x( 0xC0DD9D55C0DD9D55 );
+       sc->C[ 4] = _mm256_set1_epi64x( 0xC51C28AEC51C28AE );
+       sc->C[ 5] = _mm256_set1_epi64x( 0xA327B8E1A327B8E1 );
+       sc->C[ 6] = _mm256_set1_epi64x( 0x56C5616756C56167 );
+       sc->C[ 7] = _mm256_set1_epi64x( 0xED614433ED614433 );
+       sc->C[ 8] = _mm256_set1_epi64x( 0x88B59D6088B59D60 );
+       sc->C[ 9] = _mm256_set1_epi64x( 0x60E2CEBA60E2CEBA );
+       sc->C[10] = _mm256_set1_epi64x( 0x758B4B8B758B4B8B );
+       sc->C[11] = _mm256_set1_epi64x( 0x83E82A7F83E82A7F );
+       sc->C[12] = _mm256_set1_epi64x( 0xBC968828BC968828 );
+       sc->C[13] = _mm256_set1_epi64x( 0xE6E00BF7E6E00BF7 );
+       sc->C[14] = _mm256_set1_epi64x( 0xBA839E55BA839E55 );
+       sc->C[15] = _mm256_set1_epi64x( 0x9B491C609B491C60 );
   }
    sc->Wlow = 1;
    sc->Whigh = 0;
@@ -702,50 +707,50 @@ shabal512_8way_addbits_and_close(void *cc, unsigned ub, unsigned n, void *dst)
   else \
   { \
       (state)->state_loaded = true; \
-       A0 = m128_const1_64( 0x20728DFD20728DFD ); \
-       A1 = m128_const1_64( 0x46C0BD5346C0BD53 ); \
-       A2 = m128_const1_64( 0xE782B699E782B699 ); \
-       A3 = m128_const1_64( 0x5530463255304632 ); \
-       A4 = m128_const1_64( 0x71B4EF9071B4EF90 ); \
-       A5 = m128_const1_64( 0x0EA9E82C0EA9E82C ); \
-       A6 = m128_const1_64( 0xDBB930F1DBB930F1 ); \
-       A7 = m128_const1_64( 0xFAD06B8BFAD06B8B ); \
-       A8 = m128_const1_64( 0xBE0CAE40BE0CAE40 ); \
-       A9 = m128_const1_64( 0x8BD144108BD14410 ); \
-       AA = m128_const1_64( 0x76D2ADAC76D2ADAC ); \
-       AB = m128_const1_64( 0x28ACAB7F28ACAB7F ); \
-       B0 = m128_const1_64( 0xC1099CB7C1099CB7 ); \
-       B1 = m128_const1_64( 0x07B385F307B385F3 ); \
-       B2 = m128_const1_64( 0xE7442C26E7442C26 ); \
-       B3 = m128_const1_64( 0xCC8AD640CC8AD640 ); \
-       B4 = m128_const1_64( 0xEB6F56C7EB6F56C7 ); \
-       B5 = m128_const1_64( 0x1EA81AA91EA81AA9 ); \
-       B6 = m128_const1_64( 0x73B9D31473B9D314 ); \
-       B7 = m128_const1_64( 0x1DE85D081DE85D08 ); \
-       B8 = m128_const1_64( 0x48910A5A48910A5A ); \
-       B9 = m128_const1_64( 0x893B22DB893B22DB ); \
-       BA = m128_const1_64( 0xC5A0DF44C5A0DF44 ); \
-       BB = m128_const1_64( 0xBBC4324EBBC4324E ); \
-       BC = m128_const1_64( 0x72D2F24072D2F240 ); \
-       BD = m128_const1_64( 0x75941D9975941D99 ); \
-       BE = m128_const1_64( 0x6D8BDE826D8BDE82 ); \
-       BF = m128_const1_64( 0xA1A7502BA1A7502B ); \
-       C0 = m128_const1_64( 0xD9BF68D1D9BF68D1 ); \
-       C1 = m128_const1_64( 0x58BAD75058BAD750 ); \
-       C2 = m128_const1_64( 0x56028CB256028CB2 ); \
-       C3 = m128_const1_64( 0x8134F3598134F359 ); \
-       C4 = m128_const1_64( 0xB5D469D8B5D469D8 ); \
-       C5 = m128_const1_64( 0x941A8CC2941A8CC2 ); \
-       C6 = m128_const1_64( 0x418B2A6E418B2A6E ); \
-       C7 = m128_const1_64( 0x0405278004052780 ); \
-       C8 = m128_const1_64( 0x7F07D7877F07D787 ); \
-       C9 = m128_const1_64( 0x5194358F5194358F ); \
-       CA = m128_const1_64( 0x3C60D6653C60D665 ); \
-       CB = m128_const1_64( 0xBE97D79ABE97D79A ); \
-       CC = m128_const1_64( 0x950C3434950C3434 ); \
-       CD = m128_const1_64( 0xAED9A06DAED9A06D ); \
-       CE = m128_const1_64( 0x2537DC8D2537DC8D ); \
-       CF = m128_const1_64( 0x7CDB59697CDB5969 ); \
+       A0 = _mm_set1_epi64x( 0x20728DFD20728DFD ); \
+       A1 = _mm_set1_epi64x( 0x46C0BD5346C0BD53 ); \
+       A2 = _mm_set1_epi64x( 0xE782B699E782B699 ); \
+       A3 = _mm_set1_epi64x( 0x5530463255304632 ); \
+       A4 = _mm_set1_epi64x( 0x71B4EF9071B4EF90 ); \
+       A5 = _mm_set1_epi64x( 0x0EA9E82C0EA9E82C ); \
+       A6 = _mm_set1_epi64x( 0xDBB930F1DBB930F1 ); \
+       A7 = _mm_set1_epi64x( 0xFAD06B8BFAD06B8B ); \
+       A8 = _mm_set1_epi64x( 0xBE0CAE40BE0CAE40 ); \
+       A9 = _mm_set1_epi64x( 0x8BD144108BD14410 ); \
+       AA = _mm_set1_epi64x( 0x76D2ADAC76D2ADAC ); \
+       AB = _mm_set1_epi64x( 0x28ACAB7F28ACAB7F ); \
+       B0 = _mm_set1_epi64x( 0xC1099CB7C1099CB7 ); \
+       B1 = _mm_set1_epi64x( 0x07B385F307B385F3 ); \
+       B2 = _mm_set1_epi64x( 0xE7442C26E7442C26 ); \
+       B3 = _mm_set1_epi64x( 0xCC8AD640CC8AD640 ); \
+       B4 = _mm_set1_epi64x( 0xEB6F56C7EB6F56C7 ); \
+       B5 = _mm_set1_epi64x( 0x1EA81AA91EA81AA9 ); \
+       B6 = _mm_set1_epi64x( 0x73B9D31473B9D314 ); \
+       B7 = _mm_set1_epi64x( 0x1DE85D081DE85D08 ); \
+       B8 = _mm_set1_epi64x( 0x48910A5A48910A5A ); \
+       B9 = _mm_set1_epi64x( 0x893B22DB893B22DB ); \
+       BA = _mm_set1_epi64x( 0xC5A0DF44C5A0DF44 ); \
+       BB = _mm_set1_epi64x( 0xBBC4324EBBC4324E ); \
+       BC = _mm_set1_epi64x( 0x72D2F24072D2F240 ); \
+       BD = _mm_set1_epi64x( 0x75941D9975941D99 ); \
+       BE = _mm_set1_epi64x( 0x6D8BDE826D8BDE82 ); \
+       BF = _mm_set1_epi64x( 0xA1A7502BA1A7502B ); \
+       C0 = _mm_set1_epi64x( 0xD9BF68D1D9BF68D1 ); \
+       C1 = _mm_set1_epi64x( 0x58BAD75058BAD750 ); \
+       C2 = _mm_set1_epi64x( 0x56028CB256028CB2 ); \
+       C3 = _mm_set1_epi64x( 0x8134F3598134F359 ); \
+       C4 = _mm_set1_epi64x( 0xB5D469D8B5D469D8 ); \
+       C5 = _mm_set1_epi64x( 0x941A8CC2941A8CC2 ); \
+       C6 = _mm_set1_epi64x( 0x418B2A6E418B2A6E ); \
+       C7 = _mm_set1_epi64x( 0x0405278004052780 ); \
+       C8 = _mm_set1_epi64x( 0x7F07D7877F07D787 ); \
+       C9 = _mm_set1_epi64x( 0x5194358F5194358F ); \
+       CA = _mm_set1_epi64x( 0x3C60D6653C60D665 ); \
+       CB = _mm_set1_epi64x( 0xBE97D79ABE97D79A ); \
+       CC = _mm_set1_epi64x( 0x950C3434950C3434 ); \
+       CD = _mm_set1_epi64x( 0xAED9A06DAED9A06D ); \
+       CE = _mm_set1_epi64x( 0x2537DC8D2537DC8D ); \
+       CF = _mm_set1_epi64x( 0x7CDB59697CDB5969 ); \
   } \
   Wlow = (state)->Wlow; \
   Whigh = (state)->Whigh; \
@@ -866,6 +871,11 @@ do { \
   A1 = _mm_xor_si128( A1, _mm_set1_epi32( Whigh ) ); \
 } while (0)

+#define mm128_swap256_128( v1, v2 ) \
+   v1 = _mm_xor_si128( v1, v2 ); \
+   v2 = _mm_xor_si128( v1, v2 ); \
+   v1 = _mm_xor_si128( v1, v2 );
+
 #define SWAP_BC \
 do { \
    mm128_swap256_128( B0, C0 ); \
@@ -886,6 +896,16 @@ do { \
    mm128_swap256_128( BF, CF ); \
 } while (0)

+#define PERM_ELT( xa0, xa1, xb0, xb1, xb2, xb3, xc, xm ) \
+do { \
+   xa0 = mm128_xor3( xm, xb1, mm128_xorandnot( \
+           _mm_mullo_epi32( mm128_xor3( xa0, xc, \
+              _mm_mullo_epi32( mm128_rol_32( xa1, 15 ), FIVE ) ), THREE ), \
+           xb3, xb2 ) ); \
+   xb0 = mm128_xnor( xa0, mm128_rol_32( xb0, 1 ) ); \
+} while (0)
+
+/*
 #define PERM_ELT(xa0, xa1, xb0, xb1, xb2, xb3, xc, xm) \
 do { \
   xa0 = _mm_xor_si128( xm, _mm_xor_si128( xb1, _mm_xor_si128(  \
@@ -895,6 +915,7 @@ do { \
                   ) ), THREE ) ) ) ); \
   xb0 = mm128_not( _mm_xor_si128( xa0, mm128_rol_32( xb0, 1 ) ) ); \
 } while (0)
+*/

 #define PERM_STEP_0   do { \
 		PERM_ELT(A0, AB, B0, BD, B9, B6, C8, M0); \
@@ -1068,103 +1089,103 @@ shabal_4way_init( void *cc, unsigned size )
   { // copy immediate constants directly to working registers later.
       sc->state_loaded = false;
 /*
-       sc->A[ 0] = m128_const1_64( 0x20728DFD20728DFD );
-       sc->A[ 1] = m128_const1_64( 0x46C0BD5346C0BD53 );
-       sc->A[ 2] = m128_const1_64( 0xE782B699E782B699 );
-       sc->A[ 3] = m128_const1_64( 0x5530463255304632 );
-       sc->A[ 4] = m128_const1_64( 0x71B4EF9071B4EF90 );
-       sc->A[ 5] = m128_const1_64( 0x0EA9E82C0EA9E82C );
-       sc->A[ 6] = m128_const1_64( 0xDBB930F1DBB930F1 );
-       sc->A[ 7] = m128_const1_64( 0xFAD06B8BFAD06B8B );
-       sc->A[ 8] = m128_const1_64( 0xBE0CAE40BE0CAE40 );
-       sc->A[ 9] = m128_const1_64( 0x8BD144108BD14410 );
-       sc->A[10] = m128_const1_64( 0x76D2ADAC76D2ADAC );
-       sc->A[11] = m128_const1_64( 0x28ACAB7F28ACAB7F );
+       sc->A[ 0] = _mm_set1_epi64x( 0x20728DFD20728DFD );
+       sc->A[ 1] = _mm_set1_epi64x( 0x46C0BD5346C0BD53 );
+       sc->A[ 2] = _mm_set1_epi64x( 0xE782B699E782B699 );
+       sc->A[ 3] = _mm_set1_epi64x( 0x5530463255304632 );
+       sc->A[ 4] = _mm_set1_epi64x( 0x71B4EF9071B4EF90 );
+       sc->A[ 5] = _mm_set1_epi64x( 0x0EA9E82C0EA9E82C );
+       sc->A[ 6] = _mm_set1_epi64x( 0xDBB930F1DBB930F1 );
+       sc->A[ 7] = _mm_set1_epi64x( 0xFAD06B8BFAD06B8B );
+       sc->A[ 8] = _mm_set1_epi64x( 0xBE0CAE40BE0CAE40 );
+       sc->A[ 9] = _mm_set1_epi64x( 0x8BD144108BD14410 );
+       sc->A[10] = _mm_set1_epi64x( 0x76D2ADAC76D2ADAC );
+       sc->A[11] = _mm_set1_epi64x( 0x28ACAB7F28ACAB7F );

-       sc->B[ 0] = m128_const1_64( 0xC1099CB7C1099CB7 );
-       sc->B[ 1] = m128_const1_64( 0x07B385F307B385F3 );
-       sc->B[ 2] = m128_const1_64( 0xE7442C26E7442C26 );
-       sc->B[ 3] = m128_const1_64( 0xCC8AD640CC8AD640 );
-       sc->B[ 4] = m128_const1_64( 0xEB6F56C7EB6F56C7 );
-       sc->B[ 5] = m128_const1_64( 0x1EA81AA91EA81AA9 );
-       sc->B[ 6] = m128_const1_64( 0x73B9D31473B9D314 );
-       sc->B[ 7] = m128_const1_64( 0x1DE85D081DE85D08 );
-       sc->B[ 8] = m128_const1_64( 0x48910A5A48910A5A );
-       sc->B[ 9] = m128_const1_64( 0x893B22DB893B22DB );
-       sc->B[10] = m128_const1_64( 0xC5A0DF44C5A0DF44 );
-       sc->B[11] = m128_const1_64( 0xBBC4324EBBC4324E );
-       sc->B[12] = m128_const1_64( 0x72D2F24072D2F240 );
-       sc->B[13] = m128_const1_64( 0x75941D9975941D99 );
-       sc->B[14] = m128_const1_64( 0x6D8BDE826D8BDE82 );
-       sc->B[15] = m128_const1_64( 0xA1A7502BA1A7502B );
+       sc->B[ 0] = _mm_set1_epi64x( 0xC1099CB7C1099CB7 );
+       sc->B[ 1] = _mm_set1_epi64x( 0x07B385F307B385F3 );
+       sc->B[ 2] = _mm_set1_epi64x( 0xE7442C26E7442C26 );
+       sc->B[ 3] = _mm_set1_epi64x( 0xCC8AD640CC8AD640 );
+       sc->B[ 4] = _mm_set1_epi64x( 0xEB6F56C7EB6F56C7 );
+       sc->B[ 5] = _mm_set1_epi64x( 0x1EA81AA91EA81AA9 );
+       sc->B[ 6] = _mm_set1_epi64x( 0x73B9D31473B9D314 );
+       sc->B[ 7] = _mm_set1_epi64x( 0x1DE85D081DE85D08 );
+       sc->B[ 8] = _mm_set1_epi64x( 0x48910A5A48910A5A );
+       sc->B[ 9] = _mm_set1_epi64x( 0x893B22DB893B22DB );
+       sc->B[10] = _mm_set1_epi64x( 0xC5A0DF44C5A0DF44 );
+       sc->B[11] = _mm_set1_epi64x( 0xBBC4324EBBC4324E );
+       sc->B[12] = _mm_set1_epi64x( 0x72D2F24072D2F240 );
+       sc->B[13] = _mm_set1_epi64x( 0x75941D9975941D99 );
+       sc->B[14] = _mm_set1_epi64x( 0x6D8BDE826D8BDE82 );
+       sc->B[15] = _mm_set1_epi64x( 0xA1A7502BA1A7502B );

-       sc->C[ 0] = m128_const1_64( 0xD9BF68D1D9BF68D1 );
-       sc->C[ 1] = m128_const1_64( 0x58BAD75058BAD750 );
-       sc->C[ 2] = m128_const1_64( 0x56028CB256028CB2 );
-       sc->C[ 3] = m128_const1_64( 0x8134F3598134F359 );
-       sc->C[ 4] = m128_const1_64( 0xB5D469D8B5D469D8 );
-       sc->C[ 5] = m128_const1_64( 0x941A8CC2941A8CC2 );
-       sc->C[ 6] = m128_const1_64( 0x418B2A6E418B2A6E );
-       sc->C[ 7] = m128_const1_64( 0x0405278004052780 );
-       sc->C[ 8] = m128_const1_64( 0x7F07D7877F07D787 );
-       sc->C[ 9] = m128_const1_64( 0x5194358F5194358F );
-       sc->C[10] = m128_const1_64( 0x3C60D6653C60D665 );
-       sc->C[11] = m128_const1_64( 0xBE97D79ABE97D79A );
-       sc->C[12] = m128_const1_64( 0x950C3434950C3434 );
-       sc->C[13] = m128_const1_64( 0xAED9A06DAED9A06D );
-       sc->C[14] = m128_const1_64( 0x2537DC8D2537DC8D );
-       sc->C[15] = m128_const1_64( 0x7CDB59697CDB5969 );
+       sc->C[ 0] = _mm_set1_epi64x( 0xD9BF68D1D9BF68D1 );
+       sc->C[ 1] = _mm_set1_epi64x( 0x58BAD75058BAD750 );
+       sc->C[ 2] = _mm_set1_epi64x( 0x56028CB256028CB2 );
+       sc->C[ 3] = _mm_set1_epi64x( 0x8134F3598134F359 );
+       sc->C[ 4] = _mm_set1_epi64x( 0xB5D469D8B5D469D8 );
+       sc->C[ 5] = _mm_set1_epi64x( 0x941A8CC2941A8CC2 );
+       sc->C[ 6] = _mm_set1_epi64x( 0x418B2A6E418B2A6E );
+       sc->C[ 7] = _mm_set1_epi64x( 0x0405278004052780 );
+       sc->C[ 8] = _mm_set1_epi64x( 0x7F07D7877F07D787 );
+       sc->C[ 9] = _mm_set1_epi64x( 0x5194358F5194358F );
+       sc->C[10] = _mm_set1_epi64x( 0x3C60D6653C60D665 );
+       sc->C[11] = _mm_set1_epi64x( 0xBE97D79ABE97D79A );
+       sc->C[12] = _mm_set1_epi64x( 0x950C3434950C3434 );
+       sc->C[13] = _mm_set1_epi64x( 0xAED9A06DAED9A06D );
+       sc->C[14] = _mm_set1_epi64x( 0x2537DC8D2537DC8D );
+       sc->C[15] = _mm_set1_epi64x( 0x7CDB59697CDB5969 );
 */
   }
   else
   {  // No users
       sc->state_loaded = true;
-       sc->A[ 0] = m128_const1_64( 0x52F8455252F84552 );
-       sc->A[ 1] = m128_const1_64( 0xE54B7999E54B7999 );
-       sc->A[ 2] = m128_const1_64( 0x2D8EE3EC2D8EE3EC );
-       sc->A[ 3] = m128_const1_64( 0xB9645191B9645191 );
-       sc->A[ 4] = m128_const1_64( 0xE0078B86E0078B86 );
-       sc->A[ 5] = m128_const1_64( 0xBB7C44C9BB7C44C9 );
-       sc->A[ 6] = m128_const1_64( 0xD2B5C1CAD2B5C1CA );
-       sc->A[ 7] = m128_const1_64( 0xB0D2EB8CB0D2EB8C );
-       sc->A[ 8] = m128_const1_64( 0x14CE5A4514CE5A45 );
-       sc->A[ 9] = m128_const1_64( 0x22AF50DC22AF50DC );
-       sc->A[10] = m128_const1_64( 0xEFFDBC6BEFFDBC6B );
-       sc->A[11] = m128_const1_64( 0xEB21B74AEB21B74A );
+       sc->A[ 0] = _mm_set1_epi64x( 0x52F8455252F84552 );
+       sc->A[ 1] = _mm_set1_epi64x( 0xE54B7999E54B7999 );
+       sc->A[ 2] = _mm_set1_epi64x( 0x2D8EE3EC2D8EE3EC );
+       sc->A[ 3] = _mm_set1_epi64x( 0xB9645191B9645191 );
+       sc->A[ 4] = _mm_set1_epi64x( 0xE0078B86E0078B86 );
+       sc->A[ 5] = _mm_set1_epi64x( 0xBB7C44C9BB7C44C9 );
+       sc->A[ 6] = _mm_set1_epi64x( 0xD2B5C1CAD2B5C1CA );
+       sc->A[ 7] = _mm_set1_epi64x( 0xB0D2EB8CB0D2EB8C );
+       sc->A[ 8] = _mm_set1_epi64x( 0x14CE5A4514CE5A45 );
+       sc->A[ 9] = _mm_set1_epi64x( 0x22AF50DC22AF50DC );
+       sc->A[10] = _mm_set1_epi64x( 0xEFFDBC6BEFFDBC6B );
+       sc->A[11] = _mm_set1_epi64x( 0xEB21B74AEB21B74A );

-       sc->B[ 0] = m128_const1_64( 0xB555C6EEB555C6EE );
-       sc->B[ 1] = m128_const1_64( 0x3E7105963E710596 );
-       sc->B[ 2] = m128_const1_64( 0xA72A652FA72A652F );
-       sc->B[ 3] = m128_const1_64( 0x9301515F9301515F );
-       sc->B[ 4] = m128_const1_64( 0xDA28C1FADA28C1FA );
-       sc->B[ 5] = m128_const1_64( 0x696FD868696FD868 );
-       sc->B[ 6] = m128_const1_64( 0x9CB6BF729CB6BF72 );
-       sc->B[ 7] = m128_const1_64( 0x0AFE40020AFE4002 );
-       sc->B[ 8] = m128_const1_64( 0xA6E03615A6E03615 );
-       sc->B[ 9] = m128_const1_64( 0x5138C1D45138C1D4 );
-       sc->B[10] = m128_const1_64( 0xBE216306BE216306 );
-       sc->B[11] = m128_const1_64( 0xB38B8890B38B8890 );
-       sc->B[12] = m128_const1_64( 0x3EA8B96B3EA8B96B );
-       sc->B[13] = m128_const1_64( 0x3299ACE43299ACE4 );
-       sc->B[14] = m128_const1_64( 0x30924DD430924DD4 );
-       sc->B[15] = m128_const1_64( 0x55CB34A555CB34A5 );
+       sc->B[ 0] = _mm_set1_epi64x( 0xB555C6EEB555C6EE );
+       sc->B[ 1] = _mm_set1_epi64x( 0x3E7105963E710596 );
+       sc->B[ 2] = _mm_set1_epi64x( 0xA72A652FA72A652F );
+       sc->B[ 3] = _mm_set1_epi64x( 0x9301515F9301515F );
+       sc->B[ 4] = _mm_set1_epi64x( 0xDA28C1FADA28C1FA );
+       sc->B[ 5] = _mm_set1_epi64x( 0x696FD868696FD868 );
+       sc->B[ 6] = _mm_set1_epi64x( 0x9CB6BF729CB6BF72 );
+       sc->B[ 7] = _mm_set1_epi64x( 0x0AFE40020AFE4002 );
+       sc->B[ 8] = _mm_set1_epi64x( 0xA6E03615A6E03615 );
+       sc->B[ 9] = _mm_set1_epi64x( 0x5138C1D45138C1D4 );
+       sc->B[10] = _mm_set1_epi64x( 0xBE216306BE216306 );
+       sc->B[11] = _mm_set1_epi64x( 0xB38B8890B38B8890 );
+       sc->B[12] = _mm_set1_epi64x( 0x3EA8B96B3EA8B96B );
+       sc->B[13] = _mm_set1_epi64x( 0x3299ACE43299ACE4 );
+       sc->B[14] = _mm_set1_epi64x( 0x30924DD430924DD4 );
+       sc->B[15] = _mm_set1_epi64x( 0x55CB34A555CB34A5 );

-       sc->C[ 0] = m128_const1_64( 0xB405F031B405F031 );
-       sc->C[ 1] = m128_const1_64( 0xC4233EBAC4233EBA );
-       sc->C[ 2] = m128_const1_64( 0xB3733979B3733979 );
-       sc->C[ 3] = m128_const1_64( 0xC0DD9D55C0DD9D55 );
-       sc->C[ 4] = m128_const1_64( 0xC51C28AEC51C28AE );
-       sc->C[ 5] = m128_const1_64( 0xA327B8E1A327B8E1 );
-       sc->C[ 6] = m128_const1_64( 0x56C5616756C56167 );
-       sc->C[ 7] = m128_const1_64( 0xED614433ED614433 );
-       sc->C[ 8] = m128_const1_64( 0x88B59D6088B59D60 );
-       sc->C[ 9] = m128_const1_64( 0x60E2CEBA60E2CEBA );
-       sc->C[10] = m128_const1_64( 0x758B4B8B758B4B8B );
-       sc->C[11] = m128_const1_64( 0x83E82A7F83E82A7F );
-       sc->C[12] = m128_const1_64( 0xBC968828BC968828 );
-       sc->C[13] = m128_const1_64( 0xE6E00BF7E6E00BF7 );
-       sc->C[14] = m128_const1_64( 0xBA839E55BA839E55 );
-       sc->C[15] = m128_const1_64( 0x9B491C609B491C60 );
+       sc->C[ 0] = _mm_set1_epi64x( 0xB405F031B405F031 );
+       sc->C[ 1] = _mm_set1_epi64x( 0xC4233EBAC4233EBA );
+       sc->C[ 2] = _mm_set1_epi64x( 0xB3733979B3733979 );
+       sc->C[ 3] = _mm_set1_epi64x( 0xC0DD9D55C0DD9D55 );
+       sc->C[ 4] = _mm_set1_epi64x( 0xC51C28AEC51C28AE );
+       sc->C[ 5] = _mm_set1_epi64x( 0xA327B8E1A327B8E1 );
+       sc->C[ 6] = _mm_set1_epi64x( 0x56C5616756C56167 );
+       sc->C[ 7] = _mm_set1_epi64x( 0xED614433ED614433 );
+       sc->C[ 8] = _mm_set1_epi64x( 0x88B59D6088B59D60 );
+       sc->C[ 9] = _mm_set1_epi64x( 0x60E2CEBA60E2CEBA );
+       sc->C[10] = _mm_set1_epi64x( 0x758B4B8B758B4B8B );
+       sc->C[11] = _mm_set1_epi64x( 0x83E82A7F83E82A7F );
+       sc->C[12] = _mm_set1_epi64x( 0xBC968828BC968828 );
+       sc->C[13] = _mm_set1_epi64x( 0xE6E00BF7E6E00BF7 );
+       sc->C[14] = _mm_set1_epi64x( 0xBA839E55BA839E55 );
+       sc->C[15] = _mm_set1_epi64x( 0x9B491C609B491C60 );
   }
    sc->Wlow = 1;
    sc->Whigh = 0;
--- a/algo/shavite/shavite-hash-2way.c
+++ b/algo/shavite/shavite-hash-2way.c
@@ -18,14 +18,6 @@ static const uint32_t IV512[] =
        0xE275EADE, 0x502D9FCD, 0xB9357178, 0x022A4B9A
 };

-/*
-#define mm256_ror2x256hi_1x32( a, b ) \
-   _mm256_blend_epi32( mm256_shuflr128_32( a ), \
-                       mm256_shuflr128_32( b ), 0x88 )
-*/
-
-//#define mm256_ror2x256hi_1x32( a, b ) _mm256_alignr_epi8( b, a, 4 )
-
 #if defined(__VAES__)

 #define mm256_aesenc_2x128( x, k ) \
@@ -34,8 +26,47 @@ static const uint32_t IV512[] =
 #else

 #define mm256_aesenc_2x128( x, k ) \
-   mm256_concat_128( _mm_aesenc_si128( mm128_extr_hi128_256( x ), k ), \
-                     _mm_aesenc_si128( mm128_extr_lo128_256( x ), k ) )
+   _mm256_inserti128_si256( _mm256_castsi128_si256( \
+            _mm_aesenc_si128( _mm256_castsi256_si128(   x ),    k ) ), \
+            _mm_aesenc_si128( _mm256_extracti128_si256( x, 1 ), k ), 1 )
+
+#endif
+
+#if defined (__AVX512VL__)
+//TODO Enable for AVX10_256
+
+#define DECL_m256i_count \
+   const __m256i count = \
+          mm256_set4_32( ctx->count3, ctx->count2, ctx->count1, ctx->count0 );
+
+#define COUNT_R0 \
+  _mm256_mask_xor_epi32( count, 0x88, count, m256_neg1 )
+
+#define COUNT_R1 \
+  mm256_shuflr128_32( _mm256_mask_xor_epi32( count, 0x11, count, m256_neg1 ) )
+
+#define COUNT_R2 \
+  mm256_swap128_64( _mm256_mask_xor_epi32( count, 0x22, count, m256_neg1 ) )
+
+#define COUNT_R13 \
+  mm256_swap64_32( _mm256_mask_xor_epi32( count, 0x44, count, m256_neg1 ) )
+
+#else
+
+#define DECL_m256i_count
+
+// R matches the loop index not the round number, should changet that
+#define COUNT_R0 \
+  mm256_set4_32( ~ctx->count3, ctx->count2, ctx->count1, ctx->count0 )
+
+#define COUNT_R1 \
+  mm256_set4_32( ~ctx->count0, ctx->count1, ctx->count2, ctx->count3 ) 
+
+#define COUNT_R2 \
+  mm256_set4_32( ~ctx->count1, ctx->count0, ctx->count3, ctx->count2 )
+
+#define COUNT_R13 \
+  mm256_set4_32( ~ctx->count2, ctx->count3, ctx->count0, ctx->count1 )

 #endif

@@ -47,6 +78,7 @@ c512_2way( shavite512_2way_context *ctx, const void *msg )
   __m256i k00, k01, k02, k03, k10, k11, k12, k13;
   __m256i *m = (__m256i*)msg;
   __m256i *h = (__m256i*)ctx->h;
+   DECL_m256i_count;
   int r;

   p0 = h[0];
@@ -54,7 +86,8 @@ c512_2way( shavite512_2way_context *ctx, const void *msg )
   p2 = h[2];
   p3 = h[3];

-   // round
+   // round 0
+
   k00 = m[0];
   x = mm256_aesenc_2x128( _mm256_xor_si256( p1, k00 ), zero );
   k01 = m[1];
@@ -85,18 +118,14 @@ c512_2way( shavite512_2way_context *ctx, const void *msg )
                                  mm256_aesenc_2x128( k00, zero ) ) );

     if ( r == 0 )
-        k00 = _mm256_xor_si256( k00, _mm256_set_epi32( 
-		      ~ctx->count3, ctx->count2, ctx->count1, ctx->count0,
-                      ~ctx->count3, ctx->count2, ctx->count1, ctx->count0 ) );
+        k00 = _mm256_xor_si256( k00, COUNT_R0 );

     x = mm256_aesenc_2x128( _mm256_xor_si256( p0, k00 ), zero );
     k01 = _mm256_xor_si256( k00,
 		     mm256_shuflr128_32( mm256_aesenc_2x128( k01, zero ) ) );

     if ( r == 1 )
-        k01 = _mm256_xor_si256( k01, _mm256_set_epi32(
-	               ~ctx->count0, ctx->count1, ctx->count2, ctx->count3,
-                       ~ctx->count0, ctx->count1, ctx->count2, ctx->count3 ) );
+        k01 = _mm256_xor_si256( k01, COUNT_R1 );

     x = mm256_aesenc_2x128( _mm256_xor_si256( x, k01 ), zero );
     k02 = _mm256_xor_si256( k01,
@@ -121,9 +150,7 @@ c512_2way( shavite512_2way_context *ctx, const void *msg )
 		     mm256_shuflr128_32( mm256_aesenc_2x128( k13, zero ) ) );

     if ( r == 2 )
-        k13 = _mm256_xor_si256( k13, _mm256_set_epi32(
-                  ~ctx->count1, ctx->count0, ctx->count3, ctx->count2,
-                  ~ctx->count1, ctx->count0, ctx->count3, ctx->count2 ) );
+        k13 = _mm256_xor_si256( k13, COUNT_R2 );
 
     x = mm256_aesenc_2x128( _mm256_xor_si256( x, k13 ), zero );
     p1 = _mm256_xor_si256( p1, x );
@@ -235,9 +262,7 @@ c512_2way( shavite512_2way_context *ctx, const void *msg )
   x = mm256_aesenc_2x128( _mm256_xor_si256( x, k11 ), zero );

   k12 = mm256_shuflr128_32( mm256_aesenc_2x128( k12, zero ) );
-   k12 = _mm256_xor_si256( k12, _mm256_xor_si256( k11, _mm256_set_epi32(
-	       ~ctx->count2, ctx->count3, ctx->count0, ctx->count1,
-	       ~ctx->count2, ctx->count3, ctx->count0, ctx->count1 ) ) );
+   k12 = _mm256_xor_si256( k12, _mm256_xor_si256( k11, COUNT_R13 ) );

   x = mm256_aesenc_2x128( _mm256_xor_si256( x, k12 ), zero );
   k13 = _mm256_xor_si256( mm256_shuflr128_32(
@@ -257,10 +282,10 @@ void shavite512_2way_init( shavite512_2way_context *ctx )
    __m256i *h = (__m256i*)ctx->h;
    __m128i *iv = (__m128i*)IV512;
   
-   h[0] = m256_const1_128( iv[0] );
-   h[1] = m256_const1_128( iv[1] );
-   h[2] = m256_const1_128( iv[2] );
-   h[3] = m256_const1_128( iv[3] );
+   h[0] = mm256_bcast_m128( iv[0] );
+   h[1] = mm256_bcast_m128( iv[1] );
+   h[2] = mm256_bcast_m128( iv[2] );
+   h[3] = mm256_bcast_m128( iv[3] );

   ctx->ptr    = 0;
   ctx->count0 = 0;
@@ -320,7 +345,7 @@ void shavite512_2way_close( shavite512_2way_context *ctx, void *dst )
    uint32_t vp = ctx->ptr>>5;

    // Terminating byte then zero pad
-    casti_m256i( buf, vp++ ) = m256_const1_i128( 0x0000000000000080 );
+    casti_m256i( buf, vp++ ) = mm256_bcast128lo_64( 0x0000000000000080 );

    // Zero pad full vectors up to count
    for ( ; vp < 6; vp++ )      
@@ -334,9 +359,9 @@ void shavite512_2way_close( shavite512_2way_context *ctx, void *dst )
    count.u32[2] = ctx->count2;
    count.u32[3] = ctx->count3;

-    casti_m256i( buf, 6 ) = m256_const1_128(
+    casti_m256i( buf, 6 ) = mm256_bcast_m128(
                  _mm_insert_epi16( m128_zero, count.u16[0], 7 ) ); 
-    casti_m256i( buf, 7 ) = m256_const1_128( _mm_set_epi16(
+    casti_m256i( buf, 7 ) = mm256_bcast_m128( _mm_set_epi16(
                  0x0200,       count.u16[7], count.u16[6], count.u16[5],
                  count.u16[4], count.u16[3], count.u16[2], count.u16[1] ) );
                
@@ -400,19 +425,19 @@ void shavite512_2way_update_close( shavite512_2way_context *ctx, void *dst,

   if ( vp == 0 )    // empty buf, xevan.
   { 
-      casti_m256i( buf, 0 ) = m256_const1_i128( 0x0000000000000080 );
+      casti_m256i( buf, 0 ) = mm256_bcast128lo_64( 0x0000000000000080 );
      memset_zero_256( (__m256i*)buf + 1, 5 );
      ctx->count0 = ctx->count1 = ctx->count2 = ctx->count3 = 0;
   }
   else     // half full buf, everyone else.
   {
-    casti_m256i( buf, vp++ ) = m256_const1_i128( 0x0000000000000080 );
+    casti_m256i( buf, vp++ ) = mm256_bcast128lo_64( 0x0000000000000080 );
      memset_zero_256( (__m256i*)buf + vp, 6 - vp );
   }

-    casti_m256i( buf, 6 ) = m256_const1_128(
+    casti_m256i( buf, 6 ) = mm256_bcast_m128(
                  _mm_insert_epi16( m128_zero, count.u16[0], 7 ) ); 
-    casti_m256i( buf, 7 ) = m256_const1_128( _mm_set_epi16(
+    casti_m256i( buf, 7 ) = mm256_bcast_m128( _mm_set_epi16(
                  0x0200,       count.u16[7], count.u16[6], count.u16[5],
                  count.u16[4], count.u16[3], count.u16[2], count.u16[1] ) );

@@ -430,10 +455,10 @@ void shavite512_2way_full( shavite512_2way_context *ctx, void *dst,
    __m256i *h = (__m256i*)ctx->h;
    __m128i *iv = (__m128i*)IV512;

-   h[0] = m256_const1_128( iv[0] );
-   h[1] = m256_const1_128( iv[1] );
-   h[2] = m256_const1_128( iv[2] );
-   h[3] = m256_const1_128( iv[3] );
+   h[0] = mm256_bcast_m128( iv[0] );
+   h[1] = mm256_bcast_m128( iv[1] );
+   h[2] = mm256_bcast_m128( iv[2] );
+   h[3] = mm256_bcast_m128( iv[3] );

   ctx->ptr    =
   ctx->count0 =
@@ -490,19 +515,19 @@ void shavite512_2way_full( shavite512_2way_context *ctx, void *dst,

   if ( vp == 0 )    // empty buf, xevan.
   {
-      casti_m256i( buf, 0 ) = m256_const1_i128( 0x0000000000000080 );
+      casti_m256i( buf, 0 ) = mm256_bcast128lo_64( 0x0000000000000080 );
      memset_zero_256( (__m256i*)buf + 1, 5 );
      ctx->count0 = ctx->count1 = ctx->count2 = ctx->count3 = 0;
   }
   else     // half full buf, everyone else.
   {
-    casti_m256i( buf, vp++ ) = m256_const1_i128( 0x0000000000000080 );
+    casti_m256i( buf, vp++ ) = mm256_bcast128lo_64( 0x0000000000000080 );
      memset_zero_256( (__m256i*)buf + vp, 6 - vp );
   }

-    casti_m256i( buf, 6 ) = m256_const1_128(
+    casti_m256i( buf, 6 ) = mm256_bcast_m128(
                  _mm_insert_epi16( m128_zero, count.u16[0], 7 ) );
-    casti_m256i( buf, 7 ) = m256_const1_128( _mm_set_epi16(
+    casti_m256i( buf, 7 ) = mm256_bcast_m128( _mm_set_epi16(
                  0x0200,       count.u16[7], count.u16[6], count.u16[5],
                  count.u16[4], count.u16[3], count.u16[2], count.u16[1] ) );

--- a/algo/shavite/shavite-hash-4way.c
+++ b/algo/shavite/shavite-hash-4way.c
@@ -204,11 +204,9 @@ c512_4way( shavite512_4way_context *ctx, const void *msg )
   K5 = _mm512_xor_si512( mm512_shuflr128_32(
 			             _mm512_aesenc_epi128( K5, m512_zero ) ), K4 );
   X = _mm512_aesenc_epi128( _mm512_xor_si512( X, K5 ), m512_zero );
-
   K6 = mm512_shuflr128_32( _mm512_aesenc_epi128( K6, m512_zero ) );
-   K6 = _mm512_xor_si512( K6, _mm512_xor_si512( K5, _mm512_set4_epi32(
-	       ~ctx->count2, ctx->count3, ctx->count0, ctx->count1 ) ) );
-
+   K6 = _mm512_xor_si512( K6, _mm512_xor_si512( K5,  mm512_swap64_32( 
+              _mm512_mask_xor_epi32( count, 0x4444, count, m512_neg1 ) ) ) );
   X = _mm512_aesenc_epi128( _mm512_xor_si512( X, K6 ), m512_zero );
   K7= _mm512_xor_si512( mm512_shuflr128_32(
 			             _mm512_aesenc_epi128( K7, m512_zero ) ), K6 );
@@ -227,10 +225,10 @@ void shavite512_4way_init( shavite512_4way_context *ctx )
    __m512i *h = (__m512i*)ctx->h;
    __m128i *iv = (__m128i*)IV512;
   
-   h[0] = m512_const1_128( iv[0] );
-   h[1] = m512_const1_128( iv[1] );
-   h[2] = m512_const1_128( iv[2] );
-   h[3] = m512_const1_128( iv[3] );
+   h[0] = mm512_bcast_m128( iv[0] );
+   h[1] = mm512_bcast_m128( iv[1] );
+   h[2] = mm512_bcast_m128( iv[2] );
+   h[3] = mm512_bcast_m128( iv[3] );

   ctx->ptr    = 0;
   ctx->count0 = 0;
@@ -290,7 +288,7 @@ void shavite512_4way_close( shavite512_4way_context *ctx, void *dst )
    uint32_t vp = ctx->ptr>>6;

    // Terminating byte then zero pad
-    casti_m512i( buf, vp++ ) = m512_const1_i128( 0x0000000000000080 );
+    casti_m512i( buf, vp++ ) = mm512_bcast128lo_64( 0x0000000000000080 );

    // Zero pad full vectors up to count
    for ( ; vp < 6; vp++ )      
@@ -304,9 +302,9 @@ void shavite512_4way_close( shavite512_4way_context *ctx, void *dst )
    count.u32[2] = ctx->count2;
    count.u32[3] = ctx->count3;

-    casti_m512i( buf, 6 ) = m512_const1_128(
+    casti_m512i( buf, 6 ) = mm512_bcast_m128(
                  _mm_insert_epi16( m128_zero, count.u16[0], 7 ) ); 
-    casti_m512i( buf, 7 ) = m512_const1_128( _mm_set_epi16(
+    casti_m512i( buf, 7 ) = mm512_bcast_m128( _mm_set_epi16(
                  0x0200,       count.u16[7], count.u16[6], count.u16[5],
                  count.u16[4], count.u16[3], count.u16[2], count.u16[1] ) );
                
@@ -370,19 +368,19 @@ void shavite512_4way_update_close( shavite512_4way_context *ctx, void *dst,

   if ( vp == 0 )    // empty buf, xevan.
   { 
-      casti_m512i( buf, 0 ) = m512_const1_i128( 0x0000000000000080 );
+      casti_m512i( buf, 0 ) = mm512_bcast128lo_64( 0x0000000000000080 );
      memset_zero_512( (__m512i*)buf + 1, 5 );
      ctx->count0 = ctx->count1 = ctx->count2 = ctx->count3 = 0;
   }
   else     // half full buf, everyone else.
   {
-    casti_m512i( buf, vp++ ) = m512_const1_i128( 0x0000000000000080 );
+    casti_m512i( buf, vp++ ) = mm512_bcast128lo_64( 0x0000000000000080 );
      memset_zero_512( (__m512i*)buf + vp, 6 - vp );
   }

-    casti_m512i( buf, 6 ) = m512_const1_128(
+    casti_m512i( buf, 6 ) = mm512_bcast_m128(
                  _mm_insert_epi16( m128_zero, count.u16[0], 7 ) ); 
-    casti_m512i( buf, 7 ) = m512_const1_128( _mm_set_epi16(
+    casti_m512i( buf, 7 ) = mm512_bcast_m128( _mm_set_epi16(
                  0x0200,       count.u16[7], count.u16[6], count.u16[5],
                  count.u16[4], count.u16[3], count.u16[2], count.u16[1] ) );

@@ -401,10 +399,10 @@ void shavite512_4way_full( shavite512_4way_context *ctx, void *dst,
    __m512i *h = (__m512i*)ctx->h;
    __m128i *iv = (__m128i*)IV512;

-   h[0] = m512_const1_128( iv[0] );
-   h[1] = m512_const1_128( iv[1] );
-   h[2] = m512_const1_128( iv[2] );
-   h[3] = m512_const1_128( iv[3] );
+   h[0] = mm512_bcast_m128( iv[0] );
+   h[1] = mm512_bcast_m128( iv[1] );
+   h[2] = mm512_bcast_m128( iv[2] );
+   h[3] = mm512_bcast_m128( iv[3] );

   ctx->ptr    = 
   ctx->count0 = 
@@ -461,19 +459,19 @@ void shavite512_4way_full( shavite512_4way_context *ctx, void *dst,

   if ( vp == 0 )    // empty buf, xevan.
   {
-      casti_m512i( buf, 0 ) = m512_const1_i128( 0x0000000000000080 );
+      casti_m512i( buf, 0 ) = mm512_bcast128lo_64( 0x0000000000000080 );
      memset_zero_512( (__m512i*)buf + 1, 5 );
      ctx->count0 = ctx->count1 = ctx->count2 = ctx->count3 = 0;
   }
   else     // half full buf, everyone else.
   {
-    casti_m512i( buf, vp++ ) = m512_const1_i128( 0x0000000000000080 );
+    casti_m512i( buf, vp++ ) = mm512_bcast128lo_64( 0x0000000000000080 );
      memset_zero_512( (__m512i*)buf + vp, 6 - vp );
   }

-    casti_m512i( buf, 6 ) = m512_const1_128(
+    casti_m512i( buf, 6 ) = mm512_bcast_m128(
                  _mm_insert_epi16( m128_zero, count.u16[0], 7 ) );
-    casti_m512i( buf, 7 ) = m512_const1_128( _mm_set_epi16(
+    casti_m512i( buf, 7 ) = mm512_bcast_m128( _mm_set_epi16(
                  0x0200,       count.u16[7], count.u16[6], count.u16[5],
                  count.u16[4], count.u16[3], count.u16[2], count.u16[1] ) );

--- a/algo/simd/simd-hash-2way.c
+++ b/algo/simd/simd-hash-2way.c
@@ -212,14 +212,24 @@ do { \
 // targetted
 #define shufxor2w(x,s) _mm256_shuffle_epi32( x, XCAT( SHUFXOR_, s ))

+#if defined(__AVX512VL__)
+//TODO Enable for AVX10_256
+
 #define REDUCE(x) \
-  _mm256_sub_epi16( _mm256_and_si256( x, m256_const1_64( \
+  _mm256_sub_epi16( _mm256_maskz_mov_epi8( 0x55555555, x ), \
+                    _mm256_srai_epi16( x, 8 ) )
+#else
+
+#define REDUCE(x) \
+  _mm256_sub_epi16( _mm256_and_si256( x, _mm256_set1_epi64x( \
                         0x00ff00ff00ff00ff ) ), _mm256_srai_epi16( x, 8 ) )

+#endif
+
 #define EXTRA_REDUCE_S(x)\
  _mm256_sub_epi16( x, _mm256_and_si256( \
-             m256_const1_64( 0x0101010101010101 ), \
-             _mm256_cmpgt_epi16( x, m256_const1_64( 0x0080008000800080 ) ) ) )
+          _mm256_set1_epi64x( 0x0101010101010101 ), \
+          _mm256_cmpgt_epi16( x, _mm256_set1_epi64x( 0x0080008000800080 ) ) ) )

 #define REDUCE_FULL_S( x )  EXTRA_REDUCE_S( REDUCE (x ) )

@@ -387,17 +397,11 @@ static const m512_v16 FFT256_Twiddle4w[] =
  _mm512_sub_epi16( _mm512_maskz_mov_epi8( 0x5555555555555555, x ), \
                    _mm512_srai_epi16( x, 8 ) )

-/*
-#define REDUCE4w(x) \
-  _mm512_sub_epi16( _mm512_and_si512( x, m512_const1_64( \
-                         0x00ff00ff00ff00ff ) ), _mm512_srai_epi16( x, 8 ) )
-*/
-
 #define EXTRA_REDUCE_S4w(x) \
  _mm512_sub_epi16( x, _mm512_and_si512( \
-             m512_const1_64( 0x0101010101010101 ), \
+             _mm512_set1_epi64( 0x0101010101010101 ), \
             _mm512_movm_epi16( _mm512_cmpgt_epi16_mask( \
-                               x, m512_const1_64( 0x0080008000800080 ) ) ) ) )
+                             x, _mm512_set1_epi64( 0x0080008000800080 ) ) ) ) )

 // generic, except it calls targetted macros
 #define REDUCE_FULL_S4w( x )  EXTRA_REDUCE_S4w( REDUCE4w (x ) )
@@ -484,14 +488,7 @@ do { \
 #undef BUTTERFLY_0
 #undef BUTTERFLY_N

-// twiddle is hard coded  T[0] = m512_const2_64( {128,64,32,16}, {8,4,2,1} )  
  // Multiply by twiddle factors
-//  X(6) = _mm512_mullo_epi16( X(6), m512_const2_64( 0x0080004000200010,
-//                                                   0x0008000400020001 );
-//  X(5) = _mm512_mullo_epi16( X(5), m512_const2_64( 0xffdc0008ffef0004,
-//                                                   0x00780002003c0001 );
-
-
  X(6) = _mm512_mullo_epi16( X(6), FFT64_Twiddle4w[0].v512 );
  X(5) = _mm512_mullo_epi16( X(5), FFT64_Twiddle4w[1].v512 );
  X(4) = _mm512_mullo_epi16( X(4), FFT64_Twiddle4w[2].v512 );
--- a/algo/skein/skein-4way.c
+++ b/algo/skein/skein-4way.c
@@ -63,7 +63,7 @@ int scanhash_skein_8way( struct work *work, uint32_t max_nonce,
          }
       }
       *noncev = _mm512_add_epi32( *noncev,
-                                  m512_const1_64( 0x0000000800000000 ) );
+                                  _mm512_set1_epi64( 0x0000000800000000 ) );
       n += 8;
    } while ( likely( (n < last_nonce) && !work_restart[thr_id].restart ) );

@@ -151,7 +151,7 @@ int scanhash_skein_4way( struct work *work, uint32_t max_nonce,
          }
       }
       *noncev = _mm256_add_epi32( *noncev,
-                                  m256_const1_64( 0x0000000400000000 ) );
+                                  _mm256_set1_epi64x( 0x0000000400000000 ) );
       n += 4;
    } while ( likely( (n < last_nonce) && !work_restart[thr_id].restart ) );

--- a/algo/skein/skein-hash-4way.c
+++ b/algo/skein/skein-hash-4way.c
@@ -285,7 +285,7 @@ static const uint64_t IV512[] = {
 #define SKBI(k, s, i)   XCAT(k, XCAT(XCAT(XCAT(M9_, s), _), i))
 #define SKBT(t, s, v)   XCAT(t, XCAT(XCAT(XCAT(M3_, s), _), v))

-#define READ_STATE_BIG(sc)   do { \
+#define READ_STATE_BIG(sc) \
      h0 = (sc)->h0; \
      h1 = (sc)->h1; \
      h2 = (sc)->h2; \
@@ -294,10 +294,9 @@ static const uint64_t IV512[] = {
      h5 = (sc)->h5; \
      h6 = (sc)->h6; \
      h7 = (sc)->h7; \
-      bcount = sc->bcount; \
-   } while (0)
+      bcount = sc->bcount;

-#define WRITE_STATE_BIG(sc)   do { \
+#define WRITE_STATE_BIG(sc) \
      (sc)->h0 = h0; \
      (sc)->h1 = h1; \
      (sc)->h2 = h2; \
@@ -306,62 +305,54 @@ static const uint64_t IV512[] = {
      (sc)->h5 = h5; \
      (sc)->h6 = h6; \
      (sc)->h7 = h7; \
-      sc->bcount = bcount; \
-   } while (0)
+      sc->bcount = bcount;
   

 #if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)

 #define TFBIG_KINIT_8WAY( k0, k1, k2, k3, k4, k5, k6, k7, k8, t0, t1, t2 ) \
-do { \
-  k8 = mm512_xor3( mm512_xor3( k0, k1, k2 ), mm512_xor3( k3, k4, k5 ), \
-                   mm512_xor3( k6, k7, m512_const1_64( 0x1BD11BDAA9FC1A22) ));\
-  t2 = t0 ^ t1; \
-} while (0)
+  k8 = mm512_xor3( mm512_xor3( k0, k1, k2 ), \
+                   mm512_xor3( k3, k4, k5 ), \
+                   mm512_xor3( k6, k7, \
+                              _mm512_set1_epi64( 0x1BD11BDAA9FC1A22) ) ); \
+  t2 = t0 ^ t1;

 #define TFBIG_ADDKEY_8WAY(w0, w1, w2, w3, w4, w5, w6, w7, k, t, s) \
-do { \
  w0 = _mm512_add_epi64( w0, SKBI(k,s,0) ); \
  w1 = _mm512_add_epi64( w1, SKBI(k,s,1) ); \
  w2 = _mm512_add_epi64( w2, SKBI(k,s,2) ); \
  w3 = _mm512_add_epi64( w3, SKBI(k,s,3) ); \
  w4 = _mm512_add_epi64( w4, SKBI(k,s,4) ); \
  w5 = _mm512_add_epi64( w5, _mm512_add_epi64( SKBI(k,s,5), \
-                                         m512_const1_64( SKBT(t,s,0) ) ) ); \
+                                       _mm512_set1_epi64( SKBT(t,s,0) ) ) ); \
  w6 = _mm512_add_epi64( w6, _mm512_add_epi64( SKBI(k,s,6), \
-                                         m512_const1_64( SKBT(t,s,1) ) ) ); \
+                                       _mm512_set1_epi64( SKBT(t,s,1) ) ) ); \
  w7 = _mm512_add_epi64( w7, _mm512_add_epi64( SKBI(k,s,7), \
-                                         m512_const1_64( s ) ) ); \
-} while (0)
+                                        _mm512_set1_epi64( s ) ) );

 #define TFBIG_MIX_8WAY(x0, x1, rc) \
-do { \
     x0 = _mm512_add_epi64( x0, x1 ); \
-     x1 = _mm512_xor_si512( mm512_rol_64( x1, rc ), x0 ); \
-} while (0)
+     x1 = _mm512_xor_si512( mm512_rol_64( x1, rc ), x0 );

-#define TFBIG_MIX8_8WAY(w0, w1, w2, w3, w4, w5, w6, w7, rc0, rc1, rc2, rc3)  do { \
+#define TFBIG_MIX8_8WAY(w0, w1, w2, w3, w4, w5, w6, w7, rc0, rc1, rc2, rc3) \
      TFBIG_MIX_8WAY(w0, w1, rc0); \
      TFBIG_MIX_8WAY(w2, w3, rc1); \
      TFBIG_MIX_8WAY(w4, w5, rc2); \
-      TFBIG_MIX_8WAY(w6, w7, rc3); \
-   } while (0)
+      TFBIG_MIX_8WAY(w6, w7, rc3);

-#define TFBIG_8WAY_4e(s)   do { \
+#define TFBIG_8WAY_4e(s) \
      TFBIG_ADDKEY_8WAY(p0, p1, p2, p3, p4, p5, p6, p7, h, t, s); \
      TFBIG_MIX8_8WAY(p0, p1, p2, p3, p4, p5, p6, p7, 46, 36, 19, 37); \
      TFBIG_MIX8_8WAY(p2, p1, p4, p7, p6, p5, p0, p3, 33, 27, 14, 42); \
      TFBIG_MIX8_8WAY(p4, p1, p6, p3, p0, p5, p2, p7, 17, 49, 36, 39); \
-      TFBIG_MIX8_8WAY(p6, p1, p0, p7, p2, p5, p4, p3, 44,  9, 54, 56); \
-   } while (0)
+      TFBIG_MIX8_8WAY(p6, p1, p0, p7, p2, p5, p4, p3, 44,  9, 54, 56);

-#define TFBIG_8WAY_4o(s)   do { \
+#define TFBIG_8WAY_4o(s) \
      TFBIG_ADDKEY_8WAY(p0, p1, p2, p3, p4, p5, p6, p7, h, t, s); \
      TFBIG_MIX8_8WAY(p0, p1, p2, p3, p4, p5, p6, p7, 39, 30, 34, 24); \
      TFBIG_MIX8_8WAY(p2, p1, p4, p7, p6, p5, p0, p3, 13, 50, 10, 17); \
      TFBIG_MIX8_8WAY(p4, p1, p6, p3, p0, p5, p2, p7, 25, 29, 39, 43); \
-      TFBIG_MIX8_8WAY(p6, p1, p0, p7, p2, p5, p4, p3,  8, 35, 56, 22); \
-   } while (0)
+      TFBIG_MIX8_8WAY(p6, p1, p0, p7, p2, p5, p4, p3,  8, 35, 56, 22);

 #define UBI_BIG_8WAY(etype, extra) \
 do { \
@@ -424,59 +415,48 @@ do { \
 #endif // AVX512

 #define TFBIG_KINIT_4WAY( k0, k1, k2, k3, k4, k5, k6, k7, k8, t0, t1, t2 ) \
-do { \
-  k8 = _mm256_xor_si256( _mm256_xor_si256( \
-                            _mm256_xor_si256( _mm256_xor_si256( k0, k1 ), \
-                                              _mm256_xor_si256( k2, k3 ) ), \
-                            _mm256_xor_si256( _mm256_xor_si256( k4, k5 ), \
-                                              _mm256_xor_si256( k6, k7 ) ) ), \
-                         m256_const1_64( 0x1BD11BDAA9FC1A22) ); \
-  t2 = t0 ^ t1; \
-} while (0)
+  k8 = mm256_xor3( mm256_xor3( k0, k1, k2 ), \
+                   mm256_xor3( k3, k4, k5 ), \
+                   mm256_xor3( k6, k7, \
+                               _mm256_set1_epi64x( 0x1BD11BDAA9FC1A22) ) ); \
+  t2 = t0 ^ t1;

 #define TFBIG_ADDKEY_4WAY(w0, w1, w2, w3, w4, w5, w6, w7, k, t, s) \
-do { \
  w0 = _mm256_add_epi64( w0, SKBI(k,s,0) ); \
  w1 = _mm256_add_epi64( w1, SKBI(k,s,1) ); \
  w2 = _mm256_add_epi64( w2, SKBI(k,s,2) ); \
  w3 = _mm256_add_epi64( w3, SKBI(k,s,3) ); \
  w4 = _mm256_add_epi64( w4, SKBI(k,s,4) ); \
  w5 = _mm256_add_epi64( w5, _mm256_add_epi64( SKBI(k,s,5), \
-                                         m256_const1_64( SKBT(t,s,0) ) ) ); \
+                                       _mm256_set1_epi64x( SKBT(t,s,0) ) ) ); \
  w6 = _mm256_add_epi64( w6, _mm256_add_epi64( SKBI(k,s,6), \
-                                         m256_const1_64( SKBT(t,s,1) ) ) ); \
+                                       _mm256_set1_epi64x( SKBT(t,s,1) ) ) ); \
  w7 = _mm256_add_epi64( w7, _mm256_add_epi64( SKBI(k,s,7), \
-                                         m256_const1_64( s ) ) ); \
-} while (0)
+                                       _mm256_set1_epi64x( s ) ) );

 #define TFBIG_MIX_4WAY(x0, x1, rc) \
-do { \
     x0 = _mm256_add_epi64( x0, x1 ); \
-     x1 = _mm256_xor_si256( mm256_rol_64( x1, rc ), x0 ); \
-} while (0)
+     x1 = _mm256_xor_si256( mm256_rol_64( x1, rc ), x0 );

-#define TFBIG_MIX8_4WAY(w0, w1, w2, w3, w4, w5, w6, w7, rc0, rc1, rc2, rc3)  do { \
+#define TFBIG_MIX8_4WAY(w0, w1, w2, w3, w4, w5, w6, w7, rc0, rc1, rc2, rc3) \
      TFBIG_MIX_4WAY(w0, w1, rc0); \
      TFBIG_MIX_4WAY(w2, w3, rc1); \
      TFBIG_MIX_4WAY(w4, w5, rc2); \
-      TFBIG_MIX_4WAY(w6, w7, rc3); \
-   } while (0)
+      TFBIG_MIX_4WAY(w6, w7, rc3);

-#define TFBIG_4WAY_4e(s)   do { \
+#define TFBIG_4WAY_4e(s) \
      TFBIG_ADDKEY_4WAY(p0, p1, p2, p3, p4, p5, p6, p7, h, t, s); \
      TFBIG_MIX8_4WAY(p0, p1, p2, p3, p4, p5, p6, p7, 46, 36, 19, 37); \
      TFBIG_MIX8_4WAY(p2, p1, p4, p7, p6, p5, p0, p3, 33, 27, 14, 42); \
      TFBIG_MIX8_4WAY(p4, p1, p6, p3, p0, p5, p2, p7, 17, 49, 36, 39); \
-      TFBIG_MIX8_4WAY(p6, p1, p0, p7, p2, p5, p4, p3, 44,  9, 54, 56); \
-   } while (0)
+      TFBIG_MIX8_4WAY(p6, p1, p0, p7, p2, p5, p4, p3, 44,  9, 54, 56);

-#define TFBIG_4WAY_4o(s)   do { \
+#define TFBIG_4WAY_4o(s) \
      TFBIG_ADDKEY_4WAY(p0, p1, p2, p3, p4, p5, p6, p7, h, t, s); \
      TFBIG_MIX8_4WAY(p0, p1, p2, p3, p4, p5, p6, p7, 39, 30, 34, 24); \
      TFBIG_MIX8_4WAY(p2, p1, p4, p7, p6, p5, p0, p3, 13, 50, 10, 17); \
      TFBIG_MIX8_4WAY(p4, p1, p6, p3, p0, p5, p2, p7, 25, 29, 39, 43); \
-      TFBIG_MIX8_4WAY(p6, p1, p0, p7, p2, p5, p4, p3,  8, 35, 56, 22); \
-   } while (0)
+      TFBIG_MIX8_4WAY(p6, p1, p0, p7, p2, p5, p4, p3,  8, 35, 56, 22);

 // scale buf offset by 4
 #define UBI_BIG_4WAY(etype, extra) \
@@ -541,28 +521,28 @@ do { \

 void skein256_8way_init( skein256_8way_context *sc )
 {
-        sc->h0 = m512_const1_64( 0xCCD044A12FDB3E13 );
-        sc->h1 = m512_const1_64( 0xE83590301A79A9EB );
-        sc->h2 = m512_const1_64( 0x55AEA0614F816E6F );
-        sc->h3 = m512_const1_64( 0x2A2767A4AE9B94DB );
-        sc->h4 = m512_const1_64( 0xEC06025E74DD7683 );
-        sc->h5 = m512_const1_64( 0xE7A436CDC4746251 );
-        sc->h6 = m512_const1_64( 0xC36FBAF9393AD185 );
-        sc->h7 = m512_const1_64( 0x3EEDBA1833EDFC13 );
+        sc->h0 = _mm512_set1_epi64( 0xCCD044A12FDB3E13 );
+        sc->h1 = _mm512_set1_epi64( 0xE83590301A79A9EB );
+        sc->h2 = _mm512_set1_epi64( 0x55AEA0614F816E6F );
+        sc->h3 = _mm512_set1_epi64( 0x2A2767A4AE9B94DB );
+        sc->h4 = _mm512_set1_epi64( 0xEC06025E74DD7683 );
+        sc->h5 = _mm512_set1_epi64( 0xE7A436CDC4746251 );
+        sc->h6 = _mm512_set1_epi64( 0xC36FBAF9393AD185 );
+        sc->h7 = _mm512_set1_epi64( 0x3EEDBA1833EDFC13 );
        sc->bcount = 0;
        sc->ptr = 0;
 }

 void skein512_8way_init( skein512_8way_context *sc )
 {
-        sc->h0 = m512_const1_64( 0x4903ADFF749C51CE );
-        sc->h1 = m512_const1_64( 0x0D95DE399746DF03 );
-        sc->h2 = m512_const1_64( 0x8FD1934127C79BCE );
-        sc->h3 = m512_const1_64( 0x9A255629FF352CB1 );
-        sc->h4 = m512_const1_64( 0x5DB62599DF6CA7B0 );
-        sc->h5 = m512_const1_64( 0xEABE394CA9D5C3F4 );
-        sc->h6 = m512_const1_64( 0x991112C71A75B523 );
-        sc->h7 = m512_const1_64( 0xAE18A40B660FCC33 );
+        sc->h0 = _mm512_set1_epi64( 0x4903ADFF749C51CE );
+        sc->h1 = _mm512_set1_epi64( 0x0D95DE399746DF03 );
+        sc->h2 = _mm512_set1_epi64( 0x8FD1934127C79BCE );
+        sc->h3 = _mm512_set1_epi64( 0x9A255629FF352CB1 );
+        sc->h4 = _mm512_set1_epi64( 0x5DB62599DF6CA7B0 );
+        sc->h5 = _mm512_set1_epi64( 0xEABE394CA9D5C3F4 );
+        sc->h6 = _mm512_set1_epi64( 0x991112C71A75B523 );
+        sc->h7 = _mm512_set1_epi64( 0xAE18A40B660FCC33 );
        sc->bcount = 0;
        sc->ptr = 0;
 }
@@ -660,14 +640,14 @@ void skein512_8way_full( skein512_8way_context *sc, void *out, const void *data,

 // Init

-        h0 = m512_const1_64( 0x4903ADFF749C51CE );
-        h1 = m512_const1_64( 0x0D95DE399746DF03 );
-        h2 = m512_const1_64( 0x8FD1934127C79BCE );
-        h3 = m512_const1_64( 0x9A255629FF352CB1 );
-        h4 = m512_const1_64( 0x5DB62599DF6CA7B0 );
-        h5 = m512_const1_64( 0xEABE394CA9D5C3F4 );
-        h6 = m512_const1_64( 0x991112C71A75B523 );
-        h7 = m512_const1_64( 0xAE18A40B660FCC33 );
+        h0 = _mm512_set1_epi64( 0x4903ADFF749C51CE );
+        h1 = _mm512_set1_epi64( 0x0D95DE399746DF03 );
+        h2 = _mm512_set1_epi64( 0x8FD1934127C79BCE );
+        h3 = _mm512_set1_epi64( 0x9A255629FF352CB1 );
+        h4 = _mm512_set1_epi64( 0x5DB62599DF6CA7B0 );
+        h5 = _mm512_set1_epi64( 0xEABE394CA9D5C3F4 );
+        h6 = _mm512_set1_epi64( 0x991112C71A75B523 );
+        h7 = _mm512_set1_epi64( 0xAE18A40B660FCC33 );

 // Update

@@ -734,14 +714,14 @@ skein512_8way_prehash64( skein512_8way_context *sc, const void *data )
   buf[5] = vdata[5];
   buf[6] = vdata[6];
   buf[7] = vdata[7];
-   register __m512i h0 = m512_const1_64( 0x4903ADFF749C51CE );
-   register __m512i h1 = m512_const1_64( 0x0D95DE399746DF03 );
-   register __m512i h2 = m512_const1_64( 0x8FD1934127C79BCE );
-   register __m512i h3 = m512_const1_64( 0x9A255629FF352CB1 );
-   register __m512i h4 = m512_const1_64( 0x5DB62599DF6CA7B0 );
-   register __m512i h5 = m512_const1_64( 0xEABE394CA9D5C3F4 );
-   register __m512i h6 = m512_const1_64( 0x991112C71A75B523 );
-   register __m512i h7 = m512_const1_64( 0xAE18A40B660FCC33 );
+   register __m512i h0 = _mm512_set1_epi64( 0x4903ADFF749C51CE );
+   register __m512i h1 = _mm512_set1_epi64( 0x0D95DE399746DF03 );
+   register __m512i h2 = _mm512_set1_epi64( 0x8FD1934127C79BCE );
+   register __m512i h3 = _mm512_set1_epi64( 0x9A255629FF352CB1 );
+   register __m512i h4 = _mm512_set1_epi64( 0x5DB62599DF6CA7B0 );
+   register __m512i h5 = _mm512_set1_epi64( 0xEABE394CA9D5C3F4 );
+   register __m512i h6 = _mm512_set1_epi64( 0x991112C71A75B523 );
+   register __m512i h7 = _mm512_set1_epi64( 0xAE18A40B660FCC33 );
   uint64_t bcount = 1;

   UBI_BIG_8WAY( 224, 0 );
@@ -830,28 +810,28 @@ skein512_8way_close(void *cc, void *dst)

 void skein256_4way_init( skein256_4way_context *sc )
 {
-        sc->h0 = m256_const1_64( 0xCCD044A12FDB3E13 );
-        sc->h1 = m256_const1_64( 0xE83590301A79A9EB );
-        sc->h2 = m256_const1_64( 0x55AEA0614F816E6F );
-        sc->h3 = m256_const1_64( 0x2A2767A4AE9B94DB );
-        sc->h4 = m256_const1_64( 0xEC06025E74DD7683 );
-        sc->h5 = m256_const1_64( 0xE7A436CDC4746251 );
-        sc->h6 = m256_const1_64( 0xC36FBAF9393AD185 );
-        sc->h7 = m256_const1_64( 0x3EEDBA1833EDFC13 );
+        sc->h0 = _mm256_set1_epi64x( 0xCCD044A12FDB3E13 );
+        sc->h1 = _mm256_set1_epi64x( 0xE83590301A79A9EB );
+        sc->h2 = _mm256_set1_epi64x( 0x55AEA0614F816E6F );
+        sc->h3 = _mm256_set1_epi64x( 0x2A2767A4AE9B94DB );
+        sc->h4 = _mm256_set1_epi64x( 0xEC06025E74DD7683 );
+        sc->h5 = _mm256_set1_epi64x( 0xE7A436CDC4746251 );
+        sc->h6 = _mm256_set1_epi64x( 0xC36FBAF9393AD185 );
+        sc->h7 = _mm256_set1_epi64x( 0x3EEDBA1833EDFC13 );
        sc->bcount = 0;
        sc->ptr = 0;
 }

 void skein512_4way_init( skein512_4way_context *sc )
 {
-        sc->h0 = m256_const1_64( 0x4903ADFF749C51CE );
-        sc->h1 = m256_const1_64( 0x0D95DE399746DF03 );
-        sc->h2 = m256_const1_64( 0x8FD1934127C79BCE );
-        sc->h3 = m256_const1_64( 0x9A255629FF352CB1 );
-        sc->h4 = m256_const1_64( 0x5DB62599DF6CA7B0 );
-        sc->h5 = m256_const1_64( 0xEABE394CA9D5C3F4 );
-        sc->h6 = m256_const1_64( 0x991112C71A75B523 );
-        sc->h7 = m256_const1_64( 0xAE18A40B660FCC33 );
+        sc->h0 = _mm256_set1_epi64x( 0x4903ADFF749C51CE );
+        sc->h1 = _mm256_set1_epi64x( 0x0D95DE399746DF03 );
+        sc->h2 = _mm256_set1_epi64x( 0x8FD1934127C79BCE );
+        sc->h3 = _mm256_set1_epi64x( 0x9A255629FF352CB1 );
+        sc->h4 = _mm256_set1_epi64x( 0x5DB62599DF6CA7B0 );
+        sc->h5 = _mm256_set1_epi64x( 0xEABE394CA9D5C3F4 );
+        sc->h6 = _mm256_set1_epi64x( 0x991112C71A75B523 );
+        sc->h7 = _mm256_set1_epi64x( 0xAE18A40B660FCC33 );
        sc->bcount = 0;
        sc->ptr = 0;
 }
@@ -954,14 +934,14 @@ skein512_4way_full( skein512_4way_context *sc, void *out, const void *data,
   const int buf_size = 64;   // 64 * __m256i
   uint64_t bcount = 0;

-   h0 = m256_const1_64( 0x4903ADFF749C51CE );
-   h1 = m256_const1_64( 0x0D95DE399746DF03 );
-   h2 = m256_const1_64( 0x8FD1934127C79BCE );
-   h3 = m256_const1_64( 0x9A255629FF352CB1 );
-   h4 = m256_const1_64( 0x5DB62599DF6CA7B0 );
-   h5 = m256_const1_64( 0xEABE394CA9D5C3F4 );
-   h6 = m256_const1_64( 0x991112C71A75B523 );
-   h7 = m256_const1_64( 0xAE18A40B660FCC33 );
+   h0 = _mm256_set1_epi64x( 0x4903ADFF749C51CE );
+   h1 = _mm256_set1_epi64x( 0x0D95DE399746DF03 );
+   h2 = _mm256_set1_epi64x( 0x8FD1934127C79BCE );
+   h3 = _mm256_set1_epi64x( 0x9A255629FF352CB1 );
+   h4 = _mm256_set1_epi64x( 0x5DB62599DF6CA7B0 );
+   h5 = _mm256_set1_epi64x( 0xEABE394CA9D5C3F4 );
+   h6 = _mm256_set1_epi64x( 0x991112C71A75B523 );
+   h7 = _mm256_set1_epi64x( 0xAE18A40B660FCC33 );

 // Update     

@@ -1028,14 +1008,14 @@ skein512_4way_prehash64( skein512_4way_context *sc, const void *data )
   buf[5] = vdata[5];
   buf[6] = vdata[6];
   buf[7] = vdata[7];
-   register __m256i h0 = m256_const1_64( 0x4903ADFF749C51CE );
-   register __m256i h1 = m256_const1_64( 0x0D95DE399746DF03 );
-   register __m256i h2 = m256_const1_64( 0x8FD1934127C79BCE );
-   register __m256i h3 = m256_const1_64( 0x9A255629FF352CB1 );
-   register __m256i h4 = m256_const1_64( 0x5DB62599DF6CA7B0 );
-   register __m256i h5 = m256_const1_64( 0xEABE394CA9D5C3F4 );
-   register __m256i h6 = m256_const1_64( 0x991112C71A75B523 );
-   register __m256i h7 = m256_const1_64( 0xAE18A40B660FCC33 );
+   register __m256i h0 = _mm256_set1_epi64x( 0x4903ADFF749C51CE );
+   register __m256i h1 = _mm256_set1_epi64x( 0x0D95DE399746DF03 );
+   register __m256i h2 = _mm256_set1_epi64x( 0x8FD1934127C79BCE );
+   register __m256i h3 = _mm256_set1_epi64x( 0x9A255629FF352CB1 );
+   register __m256i h4 = _mm256_set1_epi64x( 0x5DB62599DF6CA7B0 );
+   register __m256i h5 = _mm256_set1_epi64x( 0xEABE394CA9D5C3F4 );
+   register __m256i h6 = _mm256_set1_epi64x( 0x991112C71A75B523 );
+   register __m256i h7 = _mm256_set1_epi64x( 0xAE18A40B660FCC33 );
   uint64_t bcount = 1;

   UBI_BIG_4WAY( 224, 0 );
--- a/algo/skein/skein2-4way.c
+++ b/algo/skein/skein2-4way.c
@@ -57,7 +57,7 @@ int scanhash_skein2_8way( struct work *work, uint32_t max_nonce,
          }
       }
       *noncev = _mm512_add_epi32( *noncev,
-                                  m512_const1_64( 0x0000000800000000 ) );
+                                  _mm512_set1_epi64( 0x0000000800000000 ) );
       n += 8;
    } while ( likely( (n < last_nonce) && !work_restart[thr_id].restart ) );

@@ -119,7 +119,7 @@ int scanhash_skein2_4way( struct work *work, uint32_t max_nonce,
          }
       }
       *noncev = _mm256_add_epi32( *noncev,
-                                  m256_const1_64( 0x0000000400000000 ) );
+                                  _mm256_set1_epi64x( 0x0000000400000000 ) );
       n += 4;
    } while ( (n < last_nonce) && !work_restart[thr_id].restart );

--- a/algo/sm3/sm3-hash-4way.c
+++ b/algo/sm3/sm3-hash-4way.c
@@ -74,6 +74,10 @@
   _mm256_or_si256( _mm256_and_si256( x, y ), \
                    _mm256_andnot_si256( x, z ) )

+#define mm256_rol_var_32( v, c ) \
+   _mm256_or_si256( _mm256_slli_epi32( v, c ), \
+                    _mm256_srli_epi32( v, 32-(c) ) )
+
 void sm3_8way_compress( __m256i *digest, __m256i *block )
 {
   __m256i W[68], W1[64];
@@ -251,6 +255,9 @@ void sm3_8way_close( void *cc, void *dst )
                                 _mm_andnot_si128( x, z ) )


+#define mm128_rol_var_32( v, c ) \
+   _mm_or_si128( _mm_slli_epi32( v, c ), _mm_srli_epi32( v, 32-(c) ) )
+
 void sm3_4way_compress( __m128i *digest, __m128i *block )
 {
   __m128i W[68], W1[64];
--- a/algo/swifftx/swifftx.c
+++ b/algo/swifftx/swifftx.c
@@ -630,36 +630,35 @@ void InitializeSWIFFTX()
 }

 // In the original code the F matrix is rotated so it was not aranged
-// the same as all the other data. Rearanging F to match all the other
-// data made vectorizing possible, the compiler probably could have been
-// able to auto-vectorize with proper data organisation.
-// Also in the original code the custom 16 bit data types are all now 32
-// bit int32_t regardless of the type name.
-//
-void FFT(const unsigned char input[EIGHTH_N], swift_int32_t *output)
+// the same as the other data. Rearanging F made vectorizing up to 256 bits
+// possible. 
+// Also in the original code the custom 16 bit data types are all now aliased
+// to 32 bit int32_t.
+
+void FFT( const unsigned char input[EIGHTH_N], swift_int32_t *output )
 {
 #if defined(__AVX2__)

-   __m256i F[8] __attribute__ ((aligned (64)));
+   __m256i F0, F1, F2, F3, F4, F5, F6, F7;
+   __m256i tbl = *(__m256i*)&( fftTable[ input[0] << 3 ] );
   __m256i *mul = (__m256i*)multipliers;
   __m256i *out = (__m256i*)output;
-   __m256i *tbl = (__m256i*)&( fftTable[ input[0] << 3 ] );

-   F[0] = _mm256_mullo_epi32( mul[0], *tbl );
-   tbl = (__m256i*)&( fftTable[ input[1] << 3 ] );
-   F[1] = _mm256_mullo_epi32( mul[1], *tbl );
-   tbl = (__m256i*)&( fftTable[ input[2] << 3 ] );
-   F[2] = _mm256_mullo_epi32( mul[2], *tbl );
-   tbl = (__m256i*)&( fftTable[ input[3] << 3 ] );
-   F[3] = _mm256_mullo_epi32( mul[3], *tbl );
-   tbl = (__m256i*)&( fftTable[ input[4] << 3 ] );
-   F[4] = _mm256_mullo_epi32( mul[4], *tbl );
-   tbl = (__m256i*)&( fftTable[ input[5] << 3 ] );
-   F[5] = _mm256_mullo_epi32( mul[5], *tbl );
-   tbl = (__m256i*)&( fftTable[ input[6] << 3 ] );
-   F[6] = _mm256_mullo_epi32( mul[6], *tbl );
-   tbl = (__m256i*)&( fftTable[ input[7] << 3 ] );
-   F[7] = _mm256_mullo_epi32( mul[7], *tbl );
+   F0 = _mm256_mullo_epi32( mul[0], tbl );
+   tbl = *(__m256i*)&( fftTable[ input[1] << 3 ] );
+   F1 = _mm256_mullo_epi32( mul[1], tbl );
+   tbl = *(__m256i*)&( fftTable[ input[2] << 3 ] );
+   F2 = _mm256_mullo_epi32( mul[2], tbl );
+   tbl = *(__m256i*)&( fftTable[ input[3] << 3 ] );
+   F3 = _mm256_mullo_epi32( mul[3], tbl );
+   tbl = *(__m256i*)&( fftTable[ input[4] << 3 ] );
+   F4 = _mm256_mullo_epi32( mul[4], tbl );
+   tbl = *(__m256i*)&( fftTable[ input[5] << 3 ] );
+   F5 = _mm256_mullo_epi32( mul[5], tbl );
+   tbl = *(__m256i*)&( fftTable[ input[6] << 3 ] );
+   F6 = _mm256_mullo_epi32( mul[6], tbl );
+   tbl = *(__m256i*)&( fftTable[ input[7] << 3 ] );
+   F7 = _mm256_mullo_epi32( mul[7], tbl );

   #define ADD_SUB( a, b ) \
   { \
@@ -668,52 +667,50 @@ void FFT(const unsigned char input[EIGHTH_N], swift_int32_t *output)
      a = _mm256_add_epi32( a, tmp ); \
   }
   
-   ADD_SUB( F[0], F[1] );
-   ADD_SUB( F[2], F[3] );
-   ADD_SUB( F[4], F[5] );
-   ADD_SUB( F[6], F[7] );
-
-   F[3] = _mm256_slli_epi32( F[3], 4 );
-   F[7] = _mm256_slli_epi32( F[7], 4 );
-
-   ADD_SUB( F[0], F[2] );
-   ADD_SUB( F[1], F[3] );
-   ADD_SUB( F[4], F[6] );
-   ADD_SUB( F[5], F[7] );  
-
-   F[5] = _mm256_slli_epi32( F[5], 2 );
-   F[6] = _mm256_slli_epi32( F[6], 4 );
-   F[7] = _mm256_slli_epi32( F[7], 6 );
-
-   ADD_SUB( F[0], F[4] );
-   ADD_SUB( F[1], F[5] );
-   ADD_SUB( F[2], F[6] );
-   ADD_SUB( F[3], F[7] );
+   ADD_SUB( F0, F1 );
+   ADD_SUB( F2, F3 );
+   ADD_SUB( F4, F5 );
+   ADD_SUB( F6, F7 );
+   F3 = _mm256_slli_epi32( F3, 4 );
+   F7 = _mm256_slli_epi32( F7, 4 );
+   ADD_SUB( F0, F2 );
+   ADD_SUB( F1, F3 );
+   ADD_SUB( F4, F6 );
+   ADD_SUB( F5, F7 );  
+   F5 = _mm256_slli_epi32( F5, 2 );
+   F6 = _mm256_slli_epi32( F6, 4 );
+   F7 = _mm256_slli_epi32( F7, 6 );
+   ADD_SUB( F0, F4 );
+   ADD_SUB( F1, F5 );
+   ADD_SUB( F2, F6 );
+   ADD_SUB( F3, F7 );

   #undef ADD_SUB

 #if defined (__AVX512VL__) && defined(__AVX512BW__)   

-   const __m256i mask = _mm256_movm_epi8( 0x11111111 );
-
+   #define Q_REDUCE( a ) \
+       _mm256_sub_epi32( _mm256_maskz_mov_epi8( 0x11111111, a ), \
+                         _mm256_srai_epi32( a, 8 ) )
+         
 #else

-   const __m256i mask = m256_const1_32( 0x000000ff );
-
-#endif
+   const __m256i mask = _mm256_set1_epi32( 0x000000ff );

   #define Q_REDUCE( a ) \
       _mm256_sub_epi32( _mm256_and_si256( a, mask ), \
                         _mm256_srai_epi32( a, 8 ) )
+   
+#endif

-   out[0] = Q_REDUCE( F[0] );  
-   out[1] = Q_REDUCE( F[1] );                        
-   out[2] = Q_REDUCE( F[2] );                        
-   out[3] = Q_REDUCE( F[3] );                        
-   out[4] = Q_REDUCE( F[4] );                        
-   out[5] = Q_REDUCE( F[5] );                        
-   out[6] = Q_REDUCE( F[6] );                        
-   out[7] = Q_REDUCE( F[7] );
+   out[0] = Q_REDUCE( F0 );  
+   out[1] = Q_REDUCE( F1 );                        
+   out[2] = Q_REDUCE( F2 );                        
+   out[3] = Q_REDUCE( F3 );                        
+   out[4] = Q_REDUCE( F4 );                        
+   out[5] = Q_REDUCE( F5 );                        
+   out[6] = Q_REDUCE( F6 );                        
+   out[7] = Q_REDUCE( F7 );

   #undef Q_REDUCE

@@ -763,12 +760,10 @@ void FFT(const unsigned char input[EIGHTH_N], swift_int32_t *output)
   ADD_SUB( F[ 9], F[11] );
   ADD_SUB( F[12], F[14] );
   ADD_SUB( F[13], F[15] );
-
   F[ 6] = _mm_slli_epi32( F[ 6], 4 );
   F[ 7] = _mm_slli_epi32( F[ 7], 4 );
   F[14] = _mm_slli_epi32( F[14], 4 );
   F[15] = _mm_slli_epi32( F[15], 4 );
-
   ADD_SUB( F[ 0], F[ 4] );
   ADD_SUB( F[ 1], F[ 5] );
   ADD_SUB( F[ 2], F[ 6] );
@@ -777,14 +772,12 @@ void FFT(const unsigned char input[EIGHTH_N], swift_int32_t *output)
   ADD_SUB( F[ 9], F[13] );
   ADD_SUB( F[10], F[14] );
   ADD_SUB( F[11], F[15] );
-
   F[10] = _mm_slli_epi32( F[10], 2 );
   F[11] = _mm_slli_epi32( F[11], 2 );
   F[12] = _mm_slli_epi32( F[12], 4 );
   F[13] = _mm_slli_epi32( F[13], 4 );
   F[14] = _mm_slli_epi32( F[14], 6 );
   F[15] = _mm_slli_epi32( F[15], 6 );
-   
   ADD_SUB( F[ 0], F[ 8] );
   ADD_SUB( F[ 1], F[ 9] );
   ADD_SUB( F[ 2], F[10] );
@@ -796,7 +789,7 @@ void FFT(const unsigned char input[EIGHTH_N], swift_int32_t *output)

   #undef ADD_SUB

-   const __m128i mask = m128_const1_32( 0x000000ff );
+   const __m128i mask = _mm_set1_epi32( 0x000000ff );

   #define Q_REDUCE( a ) \
      _mm_sub_epi32( _mm_and_si128( a, mask ), _mm_srai_epi32( a, 8 ) ) 
@@ -820,16 +813,13 @@ void FFT(const unsigned char input[EIGHTH_N], swift_int32_t *output)

   #undef Q_REDUCE

-#else   // < SSE4.1
+#else   // AVX256 elif SSE4_1
   
   swift_int16_t *mult = multipliers;
-
-   // First loop unrolling:
-	register swift_int16_t *table = &(fftTable[input[0] << 3]);
-
-/*
+	swift_int16_t *table = &( fftTable[ input[0] << 3 ] );
   swift_int32_t F[64];

+   /*
   for (int i = 0; i < 8; i++)
   {
      int j = i<<3;
@@ -845,99 +835,91 @@ void FFT(const unsigned char input[EIGHTH_N], swift_int32_t *output)
   }
 */

-   register swift_int32_t F0, F1, F2, F3, F4, F5, F6, F7, F8, F9,
-                F10, F11, F12, F13, F14, F15, F16, F17, F18, F19,
-                F20, F21, F22, F23, F24, F25, F26, F27, F28, F29,
-                F30, F31, F32, F33, F34, F35, F36, F37, F38, F39,
-                F40, F41, F42, F43, F44, F45, F46, F47, F48, F49,
-                F50, F51, F52, F53, F54, F55, F56, F57, F58, F59,
-                F60, F61, F62, F63;
-   
-	F0  = mult[0] * table[0];
-	F8  = mult[1] * table[1];
-	F16 = mult[2] * table[2];
-	F24 = mult[3] * table[3];
-	F32 = mult[4] * table[4];
-	F40 = mult[5] * table[5];
-	F48 = mult[6] * table[6];
-	F56 = mult[7] * table[7];
+	F[ 0] = mult[ 0] * table[0];
+	F[ 8] = mult[ 1] * table[1];
+	F[16] = mult[ 2] * table[2];
+	F[24] = mult[ 3] * table[3];
+	F[32] = mult[ 4] * table[4];
+	F[40] = mult[ 5] * table[5];
+	F[48] = mult[ 6] * table[6];
+	F[56] = mult[ 7] * table[7];

 	table = &(fftTable[input[1] << 3]);

-	F1  = mult[ 8] * table[0];
-	F9  = mult[ 9] * table[1];
-	F17 = mult[10] * table[2];
-	F25 = mult[11] * table[3];
-	F33 = mult[12] * table[4];
-	F41 = mult[13] * table[5];
-	F49 = mult[14] * table[6];
-	F57 = mult[15] * table[7];
+	F[ 1] = mult[ 8] * table[0];
+	F[ 9] = mult[ 9] * table[1];
+	F[17] = mult[10] * table[2];
+	F[25] = mult[11] * table[3];
+	F[33] = mult[12] * table[4];
+	F[41] = mult[13] * table[5];
+	F[49] = mult[14] * table[6];
+	F[57] = mult[15] * table[7];

 	table = &(fftTable[input[2] << 3]);

-	F2  = mult[16] * table[0];
-	F10 = mult[17] * table[1];
-	F18 = mult[18] * table[2];
-	F26 = mult[19] * table[3];
-	F34 = mult[20] * table[4];
-	F42 = mult[21] * table[5];
-	F50 = mult[22] * table[6];
-	F58 = mult[23] * table[7];
+	F[ 2] = mult[16] * table[0];
+	F[10] = mult[17] * table[1];
+	F[18] = mult[18] * table[2];
+	F[26] = mult[19] * table[3];
+	F[34] = mult[20] * table[4];
+	F[42] = mult[21] * table[5];
+	F[50] = mult[22] * table[6];
+	F[58] = mult[23] * table[7];

 	table = &(fftTable[input[3] << 3]);

-	F3  = mult[24] * table[0];
-	F11 = mult[25] * table[1];
-	F19 = mult[26] * table[2];
-	F27 = mult[27] * table[3];
-	F35 = mult[28] * table[4];
-	F43 = mult[29] * table[5];
-	F51 = mult[30] * table[6];
-	F59 = mult[31] * table[7];
+	F[ 3] = mult[24] * table[0];
+	F[11] = mult[25] * table[1];
+	F[19] = mult[26] * table[2];
+	F[27] = mult[27] * table[3];
+	F[35] = mult[28] * table[4];
+	F[43] = mult[29] * table[5];
+	F[51] = mult[30] * table[6];
+	F[59] = mult[31] * table[7];

 	table = &(fftTable[input[4] << 3]);

-	F4  = mult[32] * table[0];
-	F12 = mult[33] * table[1];
-	F20 = mult[34] * table[2];
-	F28 = mult[35] * table[3];
-	F36 = mult[36] * table[4];
-	F44 = mult[37] * table[5];
-	F52 = mult[38] * table[6];
-	F60 = mult[39] * table[7];
+	F[ 4] = mult[32] * table[0];
+	F[12] = mult[33] * table[1];
+	F[20] = mult[34] * table[2];
+	F[28] = mult[35] * table[3];
+	F[36] = mult[36] * table[4];
+	F[44] = mult[37] * table[5];
+	F[52] = mult[38] * table[6];
+	F[60] = mult[39] * table[7];

 	table = &(fftTable[input[5] << 3]);

-	F5  = mult[40] * table[0];
-	F13 = mult[41] * table[1];
-	F21 = mult[42] * table[2];
-	F29 = mult[43] * table[3];
-	F37 = mult[44] * table[4];
-	F45 = mult[45] * table[5];
-	F53 = mult[46] * table[6];
-	F61 = mult[47] * table[7];
+	F[ 5] = mult[40] * table[0];
+	F[13] = mult[41] * table[1];
+	F[21] = mult[42] * table[2];
+	F[29] = mult[43] * table[3];
+	F[37] = mult[44] * table[4];
+	F[45] = mult[45] * table[5];
+	F[53] = mult[46] * table[6];
+	F[61] = mult[47] * table[7];

 	table = &(fftTable[input[6] << 3]);

-	F6  = mult[48] * table[0];
-	F14 = mult[49] * table[1];
-	F22 = mult[50] * table[2];
-	F30 = mult[51] * table[3];
-	F38 = mult[52] * table[4];
-	F46 = mult[53] * table[5];
-	F54 = mult[54] * table[6];
-	F62 = mult[55] * table[7];
+	F[ 6] = mult[48] * table[0];
+	F[14] = mult[49] * table[1];
+	F[22] = mult[50] * table[2];
+	F[30] = mult[51] * table[3];
+	F[38] = mult[52] * table[4];
+	F[46] = mult[53] * table[5];
+	F[54] = mult[54] * table[6];
+	F[62] = mult[55] * table[7];

 	table = &(fftTable[input[7] << 3]);

-	F7  = mult[56] * table[0];
-	F15 = mult[57] * table[1];
-	F23 = mult[58] * table[2];
-	F31 = mult[59] * table[3];
-	F39 = mult[60] * table[4];
-	F47 = mult[61] * table[5];
-	F55 = mult[62] * table[6];
-	F63 = mult[63] * table[7];
+	F[ 7] = mult[56] * table[0];
+	F[15] = mult[57] * table[1];
+	F[23] = mult[58] * table[2];
+	F[31] = mult[59] * table[3];
+	F[39] = mult[60] * table[4];
+	F[47] = mult[61] * table[5];
+	F[55] = mult[62] * table[6];
+	F[63] = mult[63] * table[7];

   #define ADD_SUB( a, b ) \
   { \
@@ -987,262 +969,229 @@ void FFT(const unsigned char input[EIGHTH_N], swift_int32_t *output)
   }
 */

-	// Second loop unrolling:
 	// Iteration 0:
-	ADD_SUB(F0, F1);
-	ADD_SUB(F2, F3);
-	ADD_SUB(F4, F5);
-	ADD_SUB(F6, F7);
+	ADD_SUB( F[ 0], F[ 1] );
+	ADD_SUB( F[ 2], F[ 3] );
+	ADD_SUB( F[ 4], F[ 5] );
+	ADD_SUB( F[ 6], F[ 7] );
+	F[ 3] <<= 4;
+	F[ 7] <<= 4;
+	ADD_SUB( F[ 0], F[ 2] );
+	ADD_SUB( F[ 1], F[ 3] );
+	ADD_SUB( F[ 4], F[ 6] );
+	ADD_SUB( F[ 5], F[ 7] );
+	F[ 5] <<= 2;
+	F[ 6] <<= 4;
+	F[ 7] <<= 6;
+	ADD_SUB( F[ 0], F[ 4] );
+	ADD_SUB( F[ 1], F[ 5] );
+	ADD_SUB( F[ 2], F[ 6] );
+	ADD_SUB( F[ 3], F[ 7] );

-	F3 <<= 4;
-	F7 <<= 4;
-
-	ADD_SUB(F0, F2);
-	ADD_SUB(F1, F3);
-	ADD_SUB(F4, F6);
-	ADD_SUB(F5, F7);
-
-	F5 <<= 2;
-	F6 <<= 4;
-	F7 <<= 6;
-
-	ADD_SUB(F0, F4);
-	ADD_SUB(F1, F5);
-	ADD_SUB(F2, F6);
-	ADD_SUB(F3, F7);
-
-	output[0] = Q_REDUCE(F0);
-	output[8] = Q_REDUCE(F1);
-	output[16] = Q_REDUCE(F2);
-	output[24] = Q_REDUCE(F3);
-	output[32] = Q_REDUCE(F4);
-	output[40] = Q_REDUCE(F5);
-	output[48] = Q_REDUCE(F6);
-	output[56] = Q_REDUCE(F7);
+   output[ 0] = Q_REDUCE( F[ 0] );
+	output[ 8] = Q_REDUCE( F[ 1] );
+	output[16] = Q_REDUCE( F[ 2] );
+	output[24] = Q_REDUCE( F[ 3] );
+	output[32] = Q_REDUCE( F[ 4] );
+	output[40] = Q_REDUCE( F[ 5] );
+	output[48] = Q_REDUCE( F[ 6] );
+	output[56] = Q_REDUCE( F[ 7] );

 	// Iteration 1:
-	ADD_SUB(F8, F9);
-	ADD_SUB(F10, F11);
-	ADD_SUB(F12, F13);
-	ADD_SUB(F14, F15);
+	ADD_SUB( F[ 8], F[ 9] );
+	ADD_SUB( F[10], F[11] );
+	ADD_SUB( F[12], F[13] );
+	ADD_SUB( F[14], F[15] );
+	F[11] <<= 4;
+	F[15] <<= 4;
+	ADD_SUB( F[ 8], F[10] );
+	ADD_SUB( F[ 9], F[11] );
+	ADD_SUB( F[12], F[14] );
+	ADD_SUB( F[13], F[15] );
+	F[13] <<= 2;
+	F[14] <<= 4;
+	F[15] <<= 6;
+	ADD_SUB( F[ 8], F[12] );
+	ADD_SUB( F[ 9], F[13] );
+	ADD_SUB( F[10], F[14] );
+	ADD_SUB( F[11], F[15] );

-	F11 <<= 4;
-	F15 <<= 4;
-
-	ADD_SUB(F8, F10);
-	ADD_SUB(F9, F11);
-	ADD_SUB(F12, F14);
-	ADD_SUB(F13, F15);
-
-	F13 <<= 2;
-	F14 <<= 4;
-	F15 <<= 6;
-
-	ADD_SUB(F8, F12);
-	ADD_SUB(F9, F13);
-	ADD_SUB(F10, F14);
-	ADD_SUB(F11, F15);
-
-	output[1] = Q_REDUCE(F8);
-	output[9] = Q_REDUCE(F9);
-	output[17] = Q_REDUCE(F10);
-	output[25] = Q_REDUCE(F11);
-	output[33] = Q_REDUCE(F12);
-	output[41] = Q_REDUCE(F13);
-	output[49] = Q_REDUCE(F14);
-	output[57] = Q_REDUCE(F15);
+	output[ 1] = Q_REDUCE( F[ 8] );
+	output[ 9] = Q_REDUCE( F[ 9] );
+	output[17] = Q_REDUCE( F[10] );
+	output[25] = Q_REDUCE( F[11] );
+	output[33] = Q_REDUCE( F[12] );
+	output[41] = Q_REDUCE( F[13] );
+	output[49] = Q_REDUCE( F[14] );
+	output[57] = Q_REDUCE( F[15] );

 	// Iteration 2:
-	ADD_SUB(F16, F17);
-	ADD_SUB(F18, F19);
-	ADD_SUB(F20, F21);
-	ADD_SUB(F22, F23);
+	ADD_SUB( F[16], F[17] );
+	ADD_SUB( F[18], F[19] );
+	ADD_SUB( F[20], F[21] );
+	ADD_SUB( F[22], F[23] );
+	F[19] <<= 4;
+	F[23] <<= 4;
+	ADD_SUB( F[16], F[18]);
+	ADD_SUB( F[17], F[19]);
+	ADD_SUB( F[20], F[22]);
+	ADD_SUB( F[21], F[23]);
+	F[21] <<= 2;
+	F[22] <<= 4;
+	F[23] <<= 6;
+	ADD_SUB( F[16], F[20] );
+	ADD_SUB( F[17], F[21] );
+	ADD_SUB( F[18], F[22] );
+	ADD_SUB( F[19], F[23] );

-	F19 <<= 4;
-	F23 <<= 4;
-
-	ADD_SUB(F16, F18);
-	ADD_SUB(F17, F19);
-	ADD_SUB(F20, F22);
-	ADD_SUB(F21, F23);
-
-	F21 <<= 2;
-	F22 <<= 4;
-	F23 <<= 6;
-
-	ADD_SUB(F16, F20);
-	ADD_SUB(F17, F21);
-	ADD_SUB(F18, F22);
-	ADD_SUB(F19, F23);
-
-	output[2] = Q_REDUCE(F16);
-	output[10] = Q_REDUCE(F17);
-	output[18] = Q_REDUCE(F18);
-	output[26] = Q_REDUCE(F19);
-	output[34] = Q_REDUCE(F20);
-	output[42] = Q_REDUCE(F21);
-	output[50] = Q_REDUCE(F22);
-	output[58] = Q_REDUCE(F23);
+	output[ 2] = Q_REDUCE( F[16] );
+	output[10] = Q_REDUCE( F[17] );
+	output[18] = Q_REDUCE( F[18] );
+	output[26] = Q_REDUCE( F[19] );
+	output[34] = Q_REDUCE( F[20] );
+	output[42] = Q_REDUCE( F[21] );
+	output[50] = Q_REDUCE( F[22] );
+	output[58] = Q_REDUCE( F[23] );

 	// Iteration 3:
-	ADD_SUB(F24, F25);
-	ADD_SUB(F26, F27);
-	ADD_SUB(F28, F29);
-	ADD_SUB(F30, F31);
+	ADD_SUB( F[24], F[25] );
+	ADD_SUB( F[26], F[27] );
+	ADD_SUB( F[28], F[29] );
+	ADD_SUB( F[30], F[31] );
+ 	F[27] <<= 4;
+ 	F[31] <<= 4;
+	ADD_SUB( F[24], F[26] );
+	ADD_SUB( F[25], F[27] );
+	ADD_SUB( F[28], F[30] );
+	ADD_SUB( F[29], F[31] );
+	F[29] <<= 2;
+	F[30] <<= 4;
+	F[31] <<= 6;
+	ADD_SUB( F[24], F[28] );
+	ADD_SUB( F[25], F[29] );
+	ADD_SUB( F[26], F[30] );
+	ADD_SUB( F[27], F[31] );

-	F27 <<= 4;
-	F31 <<= 4;
-
-	ADD_SUB(F24, F26);
-	ADD_SUB(F25, F27);
-	ADD_SUB(F28, F30);
-	ADD_SUB(F29, F31);
-
-	F29 <<= 2;
-	F30 <<= 4;
-	F31 <<= 6;
-
-	ADD_SUB(F24, F28);
-	ADD_SUB(F25, F29);
-	ADD_SUB(F26, F30);
-	ADD_SUB(F27, F31);
-
-	output[3] = Q_REDUCE(F24);
-	output[11] = Q_REDUCE(F25);
-	output[19] = Q_REDUCE(F26);
-	output[27] = Q_REDUCE(F27);
-	output[35] = Q_REDUCE(F28);
-	output[43] = Q_REDUCE(F29);
-	output[51] = Q_REDUCE(F30);
-	output[59] = Q_REDUCE(F31);
+	output[ 3] = Q_REDUCE( F[24] );
+	output[11] = Q_REDUCE( F[25] );
+	output[19] = Q_REDUCE( F[26] );
+	output[27] = Q_REDUCE( F[27] );
+	output[35] = Q_REDUCE( F[28] );
+	output[43] = Q_REDUCE( F[29] );
+	output[51] = Q_REDUCE( F[30] );
+	output[59] = Q_REDUCE( F[31] );

 	// Iteration 4:
-	ADD_SUB(F32, F33);
-	ADD_SUB(F34, F35);
-	ADD_SUB(F36, F37);
-	ADD_SUB(F38, F39);
+	ADD_SUB( F[32], F[33] );
+	ADD_SUB( F[34], F[35] );
+	ADD_SUB( F[36], F[37] );
+	ADD_SUB( F[38], F[39] );
+	F[35] <<= 4;
+	F[39] <<= 4;
+	ADD_SUB( F[32], F[34] );
+	ADD_SUB( F[33], F[35] );
+	ADD_SUB( F[36], F[38] );
+	ADD_SUB( F[37], F[39] );
+	F[37] <<= 2;
+	F[38] <<= 4;
+	F[39] <<= 6;
+	ADD_SUB( F[32], F[36] );
+	ADD_SUB( F[33], F[37] );
+	ADD_SUB( F[34], F[38] );
+	ADD_SUB( F[35], F[39] );

-	F35 <<= 4;
-	F39 <<= 4;
-
-	ADD_SUB(F32, F34);
-	ADD_SUB(F33, F35);
-	ADD_SUB(F36, F38);
-	ADD_SUB(F37, F39);
-
-	F37 <<= 2;
-	F38 <<= 4;
-	F39 <<= 6;
-
-	ADD_SUB(F32, F36);
-	ADD_SUB(F33, F37);
-	ADD_SUB(F34, F38);
-	ADD_SUB(F35, F39);
-
-	output[4] = Q_REDUCE(F32);
-	output[12] = Q_REDUCE(F33);
-	output[20] = Q_REDUCE(F34);
-	output[28] = Q_REDUCE(F35);
-	output[36] = Q_REDUCE(F36);
-	output[44] = Q_REDUCE(F37);
-	output[52] = Q_REDUCE(F38);
-	output[60] = Q_REDUCE(F39);
+	output[ 4] = Q_REDUCE( F[32] );
+	output[12] = Q_REDUCE( F[33] );
+	output[20] = Q_REDUCE( F[34] );
+	output[28] = Q_REDUCE( F[35] );
+	output[36] = Q_REDUCE( F[36] );
+	output[44] = Q_REDUCE( F[37] );
+	output[52] = Q_REDUCE( F[38] );
+	output[60] = Q_REDUCE( F[39] );

 	// Iteration 5:
-	ADD_SUB(F40, F41);
-	ADD_SUB(F42, F43);
-	ADD_SUB(F44, F45);
-	ADD_SUB(F46, F47);
+	ADD_SUB( F[40], F[41] );
+	ADD_SUB( F[42], F[43] );
+	ADD_SUB( F[44], F[45] );
+	ADD_SUB( F[46], F[47] );
+	F[43] <<= 4;
+	F[47] <<= 4;
+	ADD_SUB( F[40], F[42] );
+	ADD_SUB( F[41], F[43] );
+	ADD_SUB( F[44], F[46] );
+	ADD_SUB( F[45], F[47] );
+	F[45] <<= 2;
+	F[46] <<= 4;
+	F[47] <<= 6;
+	ADD_SUB( F[40], F[44] );
+	ADD_SUB( F[41], F[45] );
+	ADD_SUB( F[42], F[46] );
+	ADD_SUB( F[43], F[47] );

-	F43 <<= 4;
-	F47 <<= 4;
-
-	ADD_SUB(F40, F42);
-	ADD_SUB(F41, F43);
-	ADD_SUB(F44, F46);
-	ADD_SUB(F45, F47);
-
-	F45 <<= 2;
-	F46 <<= 4;
-	F47 <<= 6;
-
-	ADD_SUB(F40, F44);
-	ADD_SUB(F41, F45);
-	ADD_SUB(F42, F46);
-	ADD_SUB(F43, F47);
-
-	output[5] = Q_REDUCE(F40);
-	output[13] = Q_REDUCE(F41);
-	output[21] = Q_REDUCE(F42);
-	output[29] = Q_REDUCE(F43);
-	output[37] = Q_REDUCE(F44);
-	output[45] = Q_REDUCE(F45);
-	output[53] = Q_REDUCE(F46);
-	output[61] = Q_REDUCE(F47);
+	output[ 5] = Q_REDUCE( F[40] );
+	output[13] = Q_REDUCE( F[41] );
+	output[21] = Q_REDUCE( F[42] );
+	output[29] = Q_REDUCE( F[43] );
+	output[37] = Q_REDUCE( F[44] );
+	output[45] = Q_REDUCE( F[45] );
+	output[53] = Q_REDUCE( F[46] );
+	output[61] = Q_REDUCE( F[47] );

 	// Iteration 6:
-	ADD_SUB(F48, F49);
-	ADD_SUB(F50, F51);
-	ADD_SUB(F52, F53);
-	ADD_SUB(F54, F55);
+	ADD_SUB( F[48], F[49] );
+	ADD_SUB( F[50], F[51] );
+	ADD_SUB( F[52], F[53] );
+	ADD_SUB( F[54], F[55] );
+	F[51] <<= 4;
+	F[55] <<= 4;
+	ADD_SUB( F[48], F[50] );
+	ADD_SUB( F[49], F[51] );
+	ADD_SUB( F[52], F[54] );
+	ADD_SUB( F[53], F[55] );
+	F[53] <<= 2;
+	F[54] <<= 4;
+	F[55] <<= 6;
+	ADD_SUB( F[48], F[52] );
+	ADD_SUB( F[49], F[53] );
+	ADD_SUB( F[50], F[54] );
+	ADD_SUB( F[51], F[55] );

-	F51 <<= 4;
-	F55 <<= 4;
-
-	ADD_SUB(F48, F50);
-	ADD_SUB(F49, F51);
-	ADD_SUB(F52, F54);
-	ADD_SUB(F53, F55);
-
-	F53 <<= 2;
-	F54 <<= 4;
-	F55 <<= 6;
-
-	ADD_SUB(F48, F52);
-	ADD_SUB(F49, F53);
-	ADD_SUB(F50, F54);
-	ADD_SUB(F51, F55);
-
-	output[6] = Q_REDUCE(F48);
-	output[14] = Q_REDUCE(F49);
-	output[22] = Q_REDUCE(F50);
-	output[30] = Q_REDUCE(F51);
-	output[38] = Q_REDUCE(F52);
-	output[46] = Q_REDUCE(F53);
-	output[54] = Q_REDUCE(F54);
-	output[62] = Q_REDUCE(F55);
+	output[ 6] = Q_REDUCE( F[48] );
+	output[14] = Q_REDUCE( F[49] );
+	output[22] = Q_REDUCE( F[50] );
+	output[30] = Q_REDUCE( F[51] );
+	output[38] = Q_REDUCE( F[52] );
+	output[46] = Q_REDUCE( F[53] );
+	output[54] = Q_REDUCE( F[54] );
+	output[62] = Q_REDUCE( F[55] );

 	// Iteration 7:
-	ADD_SUB(F56, F57);
-	ADD_SUB(F58, F59);
-	ADD_SUB(F60, F61);
-	ADD_SUB(F62, F63);
+	ADD_SUB( F[56], F[57] );
+	ADD_SUB( F[58], F[59] );
+	ADD_SUB( F[60], F[61] );
+	ADD_SUB( F[62], F[63] );
+	F[59] <<= 4;
+	F[63] <<= 4;
+	ADD_SUB( F[56], F[58] );
+	ADD_SUB( F[57], F[59] );
+	ADD_SUB( F[60], F[62] );
+	ADD_SUB( F[61], F[63] );
+	F[61] <<= 2;
+	F[62] <<= 4;
+	F[63] <<= 6;
+	ADD_SUB( F[56], F[60] );
+	ADD_SUB( F[57], F[61] );
+	ADD_SUB( F[58], F[62] );
+	ADD_SUB( F[59], F[63] );

-	F59 <<= 4;
-	F63 <<= 4;
-
-	ADD_SUB(F56, F58);
-	ADD_SUB(F57, F59);
-	ADD_SUB(F60, F62);
-	ADD_SUB(F61, F63);
-
-	F61 <<= 2;
-	F62 <<= 4;
-	F63 <<= 6;
-
-	ADD_SUB(F56, F60);
-	ADD_SUB(F57, F61);
-	ADD_SUB(F58, F62);
-	ADD_SUB(F59, F63);
-
-	output[7] = Q_REDUCE(F56);
-	output[15] = Q_REDUCE(F57);
-	output[23] = Q_REDUCE(F58);
-	output[31] = Q_REDUCE(F59);
-	output[39] = Q_REDUCE(F60);
-	output[47] = Q_REDUCE(F61);
-	output[55] = Q_REDUCE(F62);
-	output[63] = Q_REDUCE(F63);
+	output[ 7] = Q_REDUCE( F[56] );
+	output[15] = Q_REDUCE( F[57] );
+	output[23] = Q_REDUCE( F[58] );
+	output[31] = Q_REDUCE( F[59] );
+	output[39] = Q_REDUCE( F[60] );
+	output[47] = Q_REDUCE( F[61] );
+	output[55] = Q_REDUCE( F[62] );
+	output[63] = Q_REDUCE( F[63] );

   #undef ADD_SUB
   #undef Q_REDUCE
--- a/algo/verthash/tiny_sha3/sha3-4way.c
+++ b/algo/verthash/tiny_sha3/sha3-4way.c
@@ -134,10 +134,10 @@ int sha3_4way_update( sha3_4way_ctx_t *c, const void *data, size_t len )
 int sha3_4way_final( void *md, sha3_4way_ctx_t *c )
 {
    c->st[ c->pt ] = _mm256_xor_si256( c->st[ c->pt ],
-                                       m256_const1_64( 6 ) );
+                                       _mm256_set1_epi64x( 6 ) );
    c->st[ c->rsiz / 8 - 1 ] =
                       _mm256_xor_si256( c->st[ c->rsiz / 8 - 1 ],
-                                         m256_const1_64( 0x8000000000000000 ) );
+                                    _mm256_set1_epi64x( 0x8000000000000000 ) );
    sha3_4way_keccakf( c->st );
    memcpy( md, c->st, c->mdlen * 4 );
    return 1;
@@ -268,10 +268,10 @@ int sha3_8way_final( void *md, sha3_8way_ctx_t *c )
 {
    c->st[ c->pt ] =
                       _mm512_xor_si512( c->st[ c->pt ],
-                                         m512_const1_64( 6 ) );
+                                         _mm512_set1_epi64( 6 ) );
    c->st[ c->rsiz / 8 - 1 ] =
                       _mm512_xor_si512( c->st[ c->rsiz / 8 - 1 ],
-                                         m512_const1_64( 0x8000000000000000 ) );
+                                     _mm512_set1_epi64( 0x8000000000000000 ) );
    sha3_8way_keccakf( c->st );
    memcpy( md, c->st, c->mdlen * 8 );
    return 1;
--- a/algo/x11/c11-4way.c
+++ b/algo/x11/c11-4way.c
@@ -201,7 +201,7 @@ int scanhash_c11_8way( struct work *work, uint32_t max_nonce,
   uint32_t n = first_nonce;
   const int thr_id = mythr->id;
   const uint32_t targ32_d7 = ptarget[7];
-   const __m512i eight = m512_const1_64( 8 );
+   const __m512i eight = _mm512_set1_epi64( 8 );
   const bool bench = opt_benchmark;

   edata[0] = mm128_swap64_32( casti_m128i( pdata, 0 ) );
@@ -369,7 +369,7 @@ int scanhash_c11_4way( struct work *work, uint32_t max_nonce,
   uint32_t n = first_nonce;
   const int thr_id = mythr->id;
   const uint32_t targ32_d7 = ptarget[7];
-   const __m256i four = m256_const1_64( 4 );
+   const __m256i four = _mm256_set1_epi64x( 4 );
   const bool bench = opt_benchmark;

   edata[0] = mm128_swap64_32( casti_m128i( pdata, 0 ) );
--- a/algo/x13/skunk-4way.c
+++ b/algo/x13/skunk-4way.c
@@ -114,7 +114,7 @@ int scanhash_skunk_8way( struct work *work, uint32_t max_nonce,
         submit_solution( work, hash+(i<<3), mythr );
      }
      *noncev = _mm512_add_epi32( *noncev,
-                                  m512_const1_64( 0x0000000800000000 ) );
+                                  _mm512_set1_epi64( 0x0000000800000000 ) );
      n +=8;
   } while ( likely( ( n < last_nonce ) && !( *restart ) ) );
   pdata[19] = n;
@@ -218,7 +218,7 @@ int scanhash_skunk_4way( struct work *work, uint32_t max_nonce,
         submit_solution( work, hash+(i<<3), mythr );
      }
      *noncev = _mm256_add_epi32( *noncev,
-                                  m256_const1_64( 0x0000000400000000 ) );
+                                  _mm256_set1_epi64x( 0x0000000400000000 ) );
      n +=4;
   } while ( likely( ( n < last_nonce ) && !( *restart ) ) );
   pdata[19] = n;
--- a/algo/x16/x16r-4way.c
+++ b/algo/x16/x16r-4way.c
@@ -536,7 +536,7 @@ int scanhash_x16r_8way( struct work *work, uint32_t max_nonce,
         submit_solution( work, hash+(i<<3), mythr );
      }
      *noncev = _mm512_add_epi32( *noncev,
-                                  m512_const1_64( 0x0000000800000000 ) );
+                                  _mm512_set1_epi64( 0x0000000800000000 ) );
      n += 8;
   } while ( likely( ( n < last_nonce ) && !(*restart) ) );
   pdata[19] = n;
@@ -963,7 +963,7 @@ int scanhash_x16r_4way( struct work *work, uint32_t max_nonce,
         submit_solution( work, hash+(i<<3), mythr );
      }
      *noncev = _mm256_add_epi32( *noncev,
-                                  m256_const1_64( 0x0000000400000000 ) );
+                                  _mm256_set1_epi64x( 0x0000000400000000 ) );
      n += 4;
   } while ( likely( ( n < last_nonce ) && !(*restart) ) );
   pdata[19] = n;
--- a/algo/x16/x16rt-4way.c
+++ b/algo/x16/x16rt-4way.c
@@ -49,7 +49,7 @@ int scanhash_x16rt_8way( struct work *work, uint32_t max_nonce,
         submit_solution( work, hash+(i<<3), mythr );
      }
      *noncev = _mm512_add_epi32( *noncev,
-                                  m512_const1_64( 0x0000000800000000 ) );
+                                  _mm512_set1_epi64( 0x0000000800000000 ) );
      n += 8;
   } while ( likely( ( n < last_nonce ) && !(*restart) ) );
   pdata[19] = n;
@@ -102,7 +102,7 @@ int scanhash_x16rt_4way( struct work *work, uint32_t max_nonce,
         submit_solution( work, hash+(i<<3), mythr );
      }
      *noncev = _mm256_add_epi32( *noncev,
-                                  m256_const1_64( 0x0000000400000000 ) );
+                                  _mm256_set1_epi64x( 0x0000000400000000 ) );
      n += 4;
   } while ( (  n < last_nonce ) && !(*restart) );
   pdata[19] = n;
--- a/algo/x16/x16rt.c
+++ b/algo/x16/x16rt.c
@@ -26,7 +26,7 @@ int scanhash_x16rt( struct work *work, uint32_t max_nonce,
      x16rt_getTimeHash( masked_ntime, &timeHash );
      x16rt_getAlgoString( &timeHash[0], x16r_hash_order );
      s_ntime = masked_ntime;
-      if ( opt_debug && !thr_id )
+      if ( !thr_id )
          applog( LOG_INFO, "hash order: %s time: (%08x) time hash: (%08x)",
                        x16r_hash_order, swab32( pdata[17] ), timeHash );
   }
--- a/algo/x16/x16rv2-4way.c
+++ b/algo/x16/x16rv2-4way.c
@@ -658,7 +658,7 @@ int scanhash_x16rv2_8way( struct work *work, uint32_t max_nonce,
         submit_solution( work, hash+(i<<3), mythr );
      }
      *noncev = _mm512_add_epi32( *noncev,
-                                  m512_const1_64( 0x0000000800000000 ) );
+                                  _mm512_set1_epi64( 0x0000000800000000 ) );
      n += 8;
   } while ( likely( ( n < last_nonce ) && !(*restart) ) );
   pdata[19] = n;
@@ -1143,7 +1143,7 @@ int scanhash_x16rv2_4way( struct work *work, uint32_t max_nonce,
         submit_solution( work, hash+(i<<3), mythr );
      }
      *noncev = _mm256_add_epi32( *noncev,
-                                  m256_const1_64( 0x0000000400000000 ) );
+                                  _mm256_set1_epi64x( 0x0000000400000000 ) );
      n += 4;
   } while ( likely( ( n < last_nonce ) && !(*restart) ) );
   pdata[19] = n;
--- a/algo/x16/x21s-4way.c
+++ b/algo/x16/x21s-4way.c
@@ -181,7 +181,7 @@ int scanhash_x21s_8way( struct work *work, uint32_t max_nonce,
         }
      }
      *noncev = _mm512_add_epi32( *noncev,
-                                  m512_const1_64( 0x0000000800000000 ) );
+                                  _mm512_set1_epi64( 0x0000000800000000 ) );
      n += 8;
   } while ( likely( ( n < last_nonce ) && !(*restart) ) );
   pdata[19] = n;
@@ -335,7 +335,7 @@ int scanhash_x21s_4way( struct work *work, uint32_t max_nonce,
         submit_solution( work, hash+(i<<3), mythr );
      }
      *noncev = _mm256_add_epi32( *noncev,
-                                  m256_const1_64( 0x0000000400000000 ) );
+                                  _mm256_set1_epi64x( 0x0000000400000000 ) );
      n += 4;
   } while ( likely( (  n < last_nonce ) && !(*restart) ) );
   pdata[19] = n;
--- a/algo/x17/x17-4way.c
+++ b/algo/x17/x17-4way.c
@@ -254,7 +254,7 @@ int scanhash_x17_8way( struct work *work, uint32_t max_nonce,
   uint32_t n = first_nonce;
   const int thr_id = mythr->id;
   const uint32_t targ32_d7 = ptarget[7];
-   const __m512i eight = m512_const1_64( 8 );
+   const __m512i eight = _mm512_set1_epi64( 8 );
   const bool bench = opt_benchmark;

   // convert LE32 to LE64
@@ -468,7 +468,7 @@ int scanhash_x17_4way( struct work *work, uint32_t max_nonce,
   uint32_t n = first_nonce;
   const int thr_id = mythr->id;
   const uint32_t targ32_d7 = ptarget[7];
-   const __m256i four = m256_const1_64( 4 );
+   const __m256i four = _mm256_set1_epi64x( 4 );
   const bool bench = opt_benchmark;

   // convert LE32 to LE64
--- a/algo/x22/x22i-4way.c
+++ b/algo/x22/x22i-4way.c
@@ -445,7 +445,7 @@ int scanhash_x22i_8way_sha( struct work *work, uint32_t max_nonce,
         submit_solution( work, hash+(i<<3), mythr );
      }
      *noncev = _mm512_add_epi32( *noncev,
-                                  m512_const1_64( 0x0000000800000000 ) );
+                                  _mm512_set1_epi64( 0x0000000800000000 ) );
      n += 8;
   } while ( likely( ( n < last_nonce ) && !work_restart[thr_id].restart ) );
   pdata[19] = n;
@@ -494,7 +494,7 @@ int scanhash_x22i_8way( struct work *work, uint32_t max_nonce,
         }
      }
      *noncev = _mm512_add_epi32( *noncev,
-                                  m512_const1_64( 0x0000000800000000 ) );
+                                  _mm512_set1_epi64( 0x0000000800000000 ) );
      n += 8;
   } while ( likely( ( n < last_nonce ) && !work_restart[thr_id].restart ) );
   pdata[19] = n;
@@ -787,7 +787,7 @@ int scanhash_x22i_4way_sha( struct work* work, uint32_t max_nonce,
         submit_solution( work, hash+(i<<3), mythr );
      }
      *noncev = _mm256_add_epi32( *noncev,
-                                  m256_const1_64( 0x0000000400000000 ) );
+                                  _mm256_set1_epi64x( 0x0000000400000000 ) );
      n += 4;
   } while ( likely( ( n < last_nonce ) && !work_restart[thr_id].restart ) );
   pdata[19] = n;
@@ -835,7 +835,7 @@ int scanhash_x22i_4way( struct work* work, uint32_t max_nonce,
         }
      }
      *noncev = _mm256_add_epi32( *noncev,
-                                  m256_const1_64( 0x0000000400000000 ) );
+                                  _mm256_set1_epi64x( 0x0000000400000000 ) );
      n += 4;
   } while ( likely( ( n <= last_nonce ) && !work_restart[thr_id].restart ) );
   pdata[19] = n;
--- a/algo/x22/x25x-4way.c
+++ b/algo/x22/x25x-4way.c
@@ -571,7 +571,7 @@ int scanhash_x25x_8way( struct work *work, uint32_t max_nonce,
   const int thr_id = mythr->id;
   const uint32_t targ32 = ptarget[7];
   const bool bench = opt_benchmark;
-   const __m512i eight = m512_const1_64( 8 );
+   const __m512i eight = _mm512_set1_epi64( 8 );
   if ( bench )  ptarget[7] = 0x08ff;

   edata[0] = mm128_swap64_32( casti_m128i( pdata, 0 ) ); 
@@ -927,7 +927,7 @@ int scanhash_x25x_4way( struct work* work, uint32_t max_nonce,
   uint32_t n = first_nonce;
   const int thr_id = mythr->id;
   const uint32_t targ32 = ptarget[7];
-   const __m256i four = m256_const1_64( 4 );
+   const __m256i four = _mm256_set1_epi64x( 4 );
   const bool bench = opt_benchmark;

   if ( bench ) ptarget[7] = 0x08ff;
--- a/build-allarch.sh
+++ b/build-allarch.sh
@@ -29,10 +29,11 @@ mv cpuminer cpuminer-avx512-sha-vaes
 # Zen4 AVX512 SHA VAES
 make clean || echo clean
 rm -f config.status
-# znver3 needs gcc-11, znver4 ?
+# znver3 needs gcc-11, znver4 needs gcc-12.3.
 #CFLAGS="-O3 -march=znver4 -Wall -fno-common " ./configure --with-curl
-CFLAGS="-O3 -march=znver3 -mavx512f -mavx512dq -mavx512bw -mavx512vl -Wall -fno-common " ./configure --with-curl
-#CFLAGS="-O3 -march=znver2 -mvaes -mavx512f -mavx512dq -mavx512bw -mavx512vl -Wall -fno-common " ./configure --with-curl
+# Inclomplete list of Zen4 AVX512 extensions but includes all extensions used by cpuminer.
+CFLAGS="-O3 -march=znver3 -mavx512f -mavx512cd -mavx512dq -mavx512bw -mavx512vl -mavx512vbmi -mavx512vbmi2 -mavx512bitalg -mavx512vpopcntdq -Wall -fno-common " ./configure --with-curl
+#CFLAGS="-O3 -march=znver2 -mvaes -mavx512f -mavx512dq -mavx512bw -mavx512vl -mavx512vbmi -Wall -fno-common " ./configure --with-curl
 make -j 8
 strip -s cpuminer
 mv cpuminer cpuminer-zen4
--- a/20
+++ b/20
@@ -1,6 +1,6 @@
 #! /bin/sh
 # Guess values for system-dependent variables and create Makefiles.
-# Generated by GNU Autoconf 2.71 for cpuminer-opt 3.22.0.
+# Generated by GNU Autoconf 2.71 for cpuminer-opt 3.23.1.
 #
 #
 # Copyright (C) 1992-1996, 1998-2017, 2020-2021 Free Software Foundation,
@@ -608,8 +608,8 @@ MAKEFLAGS=
 # Identity of this package.
 PACKAGE_NAME='cpuminer-opt'
 PACKAGE_TARNAME='cpuminer-opt'
-PACKAGE_VERSION='3.22.0'
-PACKAGE_STRING='cpuminer-opt 3.22.0'
+PACKAGE_VERSION='3.23.1'
+PACKAGE_STRING='cpuminer-opt 3.23.1'
 PACKAGE_BUGREPORT=''
 PACKAGE_URL=''

@@ -1360,7 +1360,7 @@ if test "$ac_init_help" = "long"; then
  # Omit some internal or obsolete options to make the list less imposing.
  # This message is too long to be a string in the A/UX 3.1 sh.
  cat <<_ACEOF
-\`configure' configures cpuminer-opt 3.22.0 to adapt to many kinds of systems.
+\`configure' configures cpuminer-opt 3.23.1 to adapt to many kinds of systems.

 Usage: $0 [OPTION]... [VAR=VALUE]...

@@ -1432,7 +1432,7 @@ fi

 if test -n "$ac_init_help"; then
  case $ac_init_help in
-     short | recursive ) echo "Configuration of cpuminer-opt 3.22.0:";;
+     short | recursive ) echo "Configuration of cpuminer-opt 3.23.1:";;
   esac
  cat <<\_ACEOF

@@ -1538,7 +1538,7 @@ fi
 test -n "$ac_init_help" && exit $ac_status
 if $ac_init_version; then
  cat <<\_ACEOF
-cpuminer-opt configure 3.22.0
+cpuminer-opt configure 3.23.1
 generated by GNU Autoconf 2.71

 Copyright (C) 2021 Free Software Foundation, Inc.
@@ -1985,7 +1985,7 @@ cat >config.log <<_ACEOF
 This file contains any messages produced by compilers while
 running configure, to aid debugging if configure makes a mistake.

-It was created by cpuminer-opt $as_me 3.22.0, which was
+It was created by cpuminer-opt $as_me 3.23.1, which was
 generated by GNU Autoconf 2.71.  Invocation command line was

  $ $0$ac_configure_args_raw
@@ -3593,7 +3593,7 @@ fi

 # Define the identity of the package.
 PACKAGE='cpuminer-opt'
- VERSION='3.22.0'
+ VERSION='3.23.1'


 printf "%s\n" "#define PACKAGE \"$PACKAGE\"" >>confdefs.h
@@ -7508,7 +7508,7 @@ cat >>$CONFIG_STATUS <<\_ACEOF || ac_write_fail=1
 # report actual input values of CONFIG_FILES etc. instead of their
 # values after options handling.
 ac_log="
-This file was extended by cpuminer-opt $as_me 3.22.0, which was
+This file was extended by cpuminer-opt $as_me 3.23.1, which was
 generated by GNU Autoconf 2.71.  Invocation command line was

  CONFIG_FILES    = $CONFIG_FILES
@@ -7576,7 +7576,7 @@ ac_cs_config_escaped=`printf "%s\n" "$ac_cs_config" | sed "s/^ //; s/'/'\\\\\\\\
 cat >>$CONFIG_STATUS <<_ACEOF || ac_write_fail=1
 ac_cs_config='$ac_cs_config_escaped'
 ac_cs_version="\\
-cpuminer-opt config.status 3.22.0
+cpuminer-opt config.status 3.23.1
 configured by $0, generated by GNU Autoconf 2.71,
  with options \\"\$ac_cs_config\\"

--- a/configure.ac
+++ b/configure.ac
@@ -1,4 +1,4 @@
-AC_INIT([cpuminer-opt], [3.22.0])
+AC_INIT([cpuminer-opt], [3.23.1])

 AC_PREREQ([2.59c])
 AC_CANONICAL_SYSTEM
--- a/7647
+++ b/7647
--- a/cpu-miner.c
+++ b/cpu-miner.c
@@ -3,7 +3,7 @@
 * Copyright 2012-2014 pooler
 * Copyright 2014 Lucas Jones
 * Copyright 2014-2016 Tanguy Pruvot
- * Copyright 2016-2021 Jay D Dee
+ * Copyright 2016-2023 Jay D Dee
 *
 * This program is free software; you can redistribute it and/or modify it
 * under the terms of the GNU General Public License as published by the Free
@@ -121,7 +121,6 @@ static uint64_t opt_affinity = 0xFFFFFFFFFFFFFFFFULL;  // default, use all cores
 int opt_priority = 0;  // deprecated
 int num_cpus = 1;
 int num_cpugroups = 1;  // For Windows
-#define max_cpus 256   // max for affinity
 char *rpc_url = NULL;
 char *rpc_userpass = NULL;
 char *rpc_user, *rpc_pass;
@@ -224,8 +223,7 @@ char*  lp_id;

 static void   workio_cmd_free(struct workio_cmd *wc);

-// array mapping thread to cpu
-static uint8_t thread_affinity_map[ max_cpus ];
+static int *thread_affinity_map;

 // display affinity mask graphically
 static void format_affinity_mask( char *mask_str, uint64_t mask )
@@ -867,6 +865,8 @@ static bool gbt_work_decode( const json_t *val, struct work *work )
         sha256d( merkle_tree[i], merkle_tree[2*i], 64 );
   }

+   work->tx_count = tx_count;
+
   /* assemble block header */
   algo_gate.build_block_header( work, swab32( version ),
                                 (uint32_t*) prevhash, (uint32_t*) merkle_tree,
@@ -1532,6 +1532,7 @@ const char *getwork_req =
 #define GBT_CAPABILITIES "[\"coinbasetxn\", \"coinbasevalue\", \"longpoll\", \"workid\"]"

 #define GBT_RULES "[\"segwit\"]"
+
 static const char *gbt_req =
   "{\"method\": \"getblocktemplate\", \"params\": [{\"capabilities\": "
   GBT_CAPABILITIES ", \"rules\": " GBT_RULES "}], \"id\":0}\r\n";
@@ -1589,18 +1590,21 @@ start:
         json_decref( val );
         goto start;
      }
+      allow_getwork = false;  // GBT is working, disable fallback
   } 
   else
      rc = work_decode( json_object_get( val, "result" ), work );

   if ( rc ) 
   {
+      bool new_work = true;
+
      json_decref( val );

      get_mininginfo( curl, work );
      report_summary_log( false );
      
-      if ( opt_protocol | opt_debug )
+      if ( opt_protocol || opt_debug )
      {
         timeval_subtract( &diff, &tv_end, &tv_start );
         applog( LOG_INFO, "%s new work received in %.2f ms",
@@ -1613,16 +1617,18 @@ start:
         last_block_height = work->height;
         last_targetdiff = net_diff;

-         applog( LOG_BLUE, "New Block %d, Net Diff %.5g, Ntime %08x",
-                                work->height, net_diff,
+         applog( LOG_BLUE, "New Block %d, Tx %d, Net Diff %.5g, Ntime %08x",
+                                work->height, work->tx_count, net_diff,
                                work->data[ algo_gate.ntime_index ] );
      }
      else if ( memcmp( &work->data[1], &g_work.data[1], 32 ) )
-         applog( LOG_BLUE, "New Work: Block %d, Net Diff %.5g, Ntime %08x",
-                                     work->height, net_diff,
-                                      work->data[ algo_gate.ntime_index ] );
-       
-      if ( !opt_quiet )
+         applog( LOG_BLUE, "New Work: Block %d, Tx %d, Net Diff %.5g, Ntime %08x",
+                                work->height, work->tx_count, net_diff,
+                                work->data[ algo_gate.ntime_index ] );
+      else
+        new_work = false;
+
+      if ( new_work && !opt_quiet )
      {
         double miner_hr = 0.;
         double net_hr = net_hashrate;
@@ -1960,7 +1966,7 @@ static bool wanna_mine(int thr_id)

 // Common target functions, default usually listed first.

-// default
+// default, double sha256 for root hash
 void sha256d_gen_merkle_root( char* merkle_root, struct stratum_ctx* sctx )
 {
  sha256d( merkle_root, sctx->job.coinbase, (int) sctx->job.coinbase_size );
@@ -1970,6 +1976,17 @@ void sha256d_gen_merkle_root( char* merkle_root, struct stratum_ctx* sctx )
     sha256d( merkle_root, merkle_root, 64 );
  }
 }
+// single sha256 root hash
+void sha256_gen_merkle_root( char* merkle_root, struct stratum_ctx* sctx )
+{
+  sha256_full( merkle_root, sctx->job.coinbase, (int)sctx->job.coinbase_size );
+  for ( int i = 0; i < sctx->job.merkle_count; i++ )
+  {
+     memcpy( merkle_root + 32, sctx->job.merkle[i], 32 );
+     sha256d( merkle_root, merkle_root, 64 );
+  }
+}
+// OpenSSL single sha256, deprecated
 void SHA256_gen_merkle_root( char* merkle_root, struct stratum_ctx* sctx )
 {
  SHA256( sctx->job.coinbase, (int)sctx->job.coinbase_size, merkle_root );
@@ -2056,15 +2073,18 @@ static void stratum_gen_work( struct stratum_ctx *sctx, struct work *g_work )
   pthread_mutex_unlock( &stats_lock );

   if ( stratum_diff != sctx->job.diff )
-      applog( LOG_BLUE, "New Stratum Diff %g, Block %d, Job %s",
-                        sctx->job.diff, sctx->block_height, g_work->job_id );
+      applog( LOG_BLUE, "New Stratum Diff %g, Block %d, Tx %d, Job %s",
+                        sctx->job.diff, sctx->block_height,
+                        sctx->job.merkle_count, g_work->job_id );
   else if ( last_block_height != sctx->block_height )
-      applog( LOG_BLUE, "New Block %d, Net diff %.5g, Job %s",
-                        sctx->block_height, net_diff, g_work->job_id );
+      applog( LOG_BLUE, "New Block %d, Tx %d, Netdiff %.5g, Job %s",
+                        sctx->block_height, sctx->job.merkle_count,
+                        net_diff, g_work->job_id );
   else if ( g_work->job_id && new_job )
-      applog( LOG_BLUE, "New Work: Block %d, Net diff %.5g, Job %s",
-                         sctx->block_height, net_diff, g_work->job_id );
-   else if ( !opt_quiet )
+      applog( LOG_BLUE, "New Work: Block %d, Tx %d, Netdiff %.5g, Job %s",
+                         sctx->block_height, sctx->job.merkle_count,
+                         net_diff, g_work->job_id );
+   else if ( opt_debug )
   {
      unsigned char *xnonce2str = bebin2hex( g_work->xnonce2,
                                             g_work->xnonce2_len );
@@ -2086,7 +2106,7 @@ static void stratum_gen_work( struct stratum_ctx *sctx, struct work *g_work )
         lowest_share = 9e99;
    }

-    if ( !opt_quiet )
+    if ( new_job && !opt_quiet )
    {
       applog2( LOG_INFO, "Diff: Net %.5g, Stratum %.5g, Target %.5g",
                          net_diff, stratum_diff, g_work->targetdiff );
@@ -2742,10 +2762,14 @@ static void *stratum_thread(void *userdata )
         }
         else
         {
-            stratum_down = false;
+// sometimes stratum connects but doesn't immediately send a job, wait for one.
+//            stratum_down = false;
            applog(LOG_BLUE,"Stratum connection established" );
            if ( stratum.new_job )   // prime first job
+            {
+               stratum_down = false;
               stratum_gen_work( &stratum, &g_work );
+            }
         }
      }

@@ -2754,6 +2778,7 @@ static void *stratum_thread(void *userdata )
      {
         if ( likely( s = stratum_recv_line( &stratum ) ) )
         {
+            stratum_down = false;
            if ( likely( !stratum_handle_method( &stratum, s ) ) )
               stratum_handle_response( s );
            free( s );
@@ -2845,6 +2870,7 @@ static bool cpu_capability( bool display_only )
     bool cpu_has_sha    = has_sha();
     bool cpu_has_avx512 = has_avx512();
     bool cpu_has_vaes   = has_vaes();
+     bool cpu_has_avx10  = has_avx10();
     bool sw_has_aes    = false;
     bool sw_has_sse2   = false;
     bool sw_has_sse42  = false;
@@ -2909,8 +2935,8 @@ static bool cpu_capability( bool display_only )
     #ifdef _MSC_VER
         " with VC++ 2013\n");
     #elif defined(__GNUC__)
-         " with GCC");
-        printf(" %d.%d.%d\n", __GNUC__, __GNUC_MINOR__, __GNUC_PATCHLEVEL__);
+         " with GCC-");
+        printf("%d.%d.%d\n", __GNUC__, __GNUC_MINOR__, __GNUC_PATCHLEVEL__);
     #else
        printf("\n");
     #endif
@@ -2924,6 +2950,8 @@ static bool cpu_capability( bool display_only )
     if      ( cpu_has_vaes   )    printf( " VAES"   );
     else if ( cpu_has_aes    )    printf( "  AES"   );
     if      ( cpu_has_sha    )    printf( " SHA"    );
+     if      ( cpu_has_avx10  )    printf( " AVX10.%d-%d",
+                                    avx10_version(), avx10_vector_length() );

     printf("\nSW features:  ");
     if      ( sw_has_avx512 )    printf( " AVX512" );
@@ -3769,24 +3797,29 @@ int main(int argc, char *argv[])
 #endif

 #if defined(WIN32) && defined(WINDOWS_CPU_GROUPS_ENABLED)
-      if ( !opt_quiet )
-         applog( LOG_INFO, "Found %d CPUs in %d groups", num_cpus, num_cpugroups );
+      if ( opt_debug || ( !opt_quiet && num_cpugroups > 1 ) )
+         applog( LOG_INFO, "Found %d CPUs in %d groups",
+                           num_cpus, num_cpugroups );
 #endif
   
-   if ( opt_affinity && num_cpus > max_cpus )
+   const int map_size = opt_n_threads < num_cpus ? num_cpus : opt_n_threads;   
+   thread_affinity_map = malloc( map_size * (sizeof (int)) );
+   if ( !thread_affinity_map )
   {
-      applog( LOG_WARNING, "More than %d CPUs, CPU affinity is disabled",
-                            max_cpus );
+      applog( LOG_ERR, "CPU Affinity disabled, memory allocation failed" );
      opt_affinity = 0ULL;
-   }
-   
+   }   
   if ( opt_affinity )
   {
-      for ( int thr = 0, cpu = 0; thr < opt_n_threads; thr++, cpu++ )
+      int active_cpus = 0; // total CPUs available using rolling affinity mask
+      for ( int thr = 0, cpu = 0; thr < map_size; thr++, cpu++ )
      {
         while ( !( ( opt_affinity >> ( cpu & 63 ) ) & 1ULL ) ) cpu++;   
         thread_affinity_map[ thr ] = cpu % num_cpus;
+         if ( cpu < num_cpus ) active_cpus++;
      }
+      if ( opt_n_threads > active_cpus )
+         applog( LOG_WARNING, "Affinity: more threads (%d) than active CPUs (%d)", opt_n_threads, active_cpus );
      if ( !opt_quiet )
      {
         char affinity_mask[64];
--- a/miner.h
+++ b/miner.h
@@ -24,6 +24,11 @@

 #endif /* _MSC_VER */

+// prevent questions from ARM users that don't read the requirements.
+#if !defined(__x86_64__)
+#error "CPU architecture not supported. Consult the requirements for supported CPUs."
+#endif
+
 #include <stdbool.h>
 #include <inttypes.h>
 #include <sys/time.h>
@@ -410,7 +415,8 @@ struct work
   double stratum_diff;
 	int height;
 	char *txs;
-	char *workid;
+   int tx_count;
+   char *workid;
 	char *job_id;
 	size_t xnonce2_len;
 	unsigned char *xnonce2;
@@ -427,7 +433,8 @@ struct stratum_job
 	unsigned char *coinbase;
 	unsigned char *xnonce2;
 	int merkle_count;
-	unsigned char **merkle;
+   int merkle_buf_size;
+   unsigned char **merkle;
 	unsigned char version[4];
 	unsigned char nbits[4];
 	unsigned char ntime[4];
@@ -582,9 +589,11 @@ enum algos {
        ALGO_QUBIT,       
        ALGO_SCRYPT,
        ALGO_SHA256D,
+        ALGO_SHA256DT,
        ALGO_SHA256Q,
        ALGO_SHA256T,
        ALGO_SHA3D,
+        ALGO_SHA512256D,
        ALGO_SHAVITE3,    
        ALGO_SKEIN,       
        ALGO_SKEIN2,      
@@ -675,9 +684,11 @@ static const char* const algo_names[] = {
        "qubit",
        "scrypt",
        "sha256d",
+        "sha256dt",
        "sha256q",
        "sha256t",
        "sha3d",
+        "sha512256d",
        "shavite3",
        "skein",
        "skein2",
@@ -837,9 +848,11 @@ Options:\n\
                          scrypt:N      scrypt(N, 1, 1)\n\
                          scryptn2      scrypt(1048576, 1,1)\n\
                          sha256d       Double SHA-256\n\
+                          sha256dt      Modified sha256d (Novo)\n\
                          sha256q       Quad SHA-256, Pyrite (PYE)\n\
                          sha256t       Triple SHA-256, Onecoin (OC)\n\
                          sha3d         Double Keccak256 (BSHA3)\n\
+                          sha512256d    Double SHA-512 (Radiant)\n\
                          shavite3      Shavite3\n\
                          skein         Skein+Sha (Skeincoin)\n\
                          skein2        Double Skein (Woodcoin)\n\
--- a/simd-utils.h
+++ b/simd-utils.h
@@ -15,10 +15,6 @@
 //    data but not for vectors. The main categories are bit rotation
 //    and endian byte swapping
 //
-//    An attempt was made to make the names as similar as possible to
-//    Intel's intrinsic function format. Most variations are to avoid
-//    confusion with actual Intel intrinsics, brevity, and clarity.
-//
 //    This suite supports some operations on regular 64 bit integers
 //    as well as 128 bit integers available on recent versions of Linux
 //    and GCC.
@@ -37,6 +33,9 @@
 //    SSE2:   128 bit vectors  (64 bit CPUs only, such as Intel Core2.
 //    AVX2:   256 bit vectors  (Starting with Intel Haswell and AMD Ryzen)
 //    AVX512: 512 bit vectors  (Starting with SkylakeX)
+//    AVX10:  when available will supersede AVX512 and will bring AVX512
+//        features, except 512 bit vectors, to Intel's Ecores. It needs to be
+//        enabled manually when the relevant GCC macros are known.
 //
 //    Most functions are avalaible at the stated levels but in rare cases
 //    a higher level feature may be required with no compatible alternative.
@@ -44,15 +43,6 @@
 //    such as SSSE3 or SSE4.1 that will be used automatically on capable
 //    CPUs.
 //
-//    The vector size boundaries are respected to maintain compatibility.
-//    For example, an instruction introduced with AVX2 may improve 128 bit
-//    vector performance but will not be implemented. A CPU with AVX2 will
-//    tend to use 256 bit vectors. On a practical level AVX512 does introduce
-//    bit rotation instructions for 128 and 256 bit vectors in addition to
-//    its own 5a12 bit vectors. These will not be back ported to replace the
-//    SW implementations for the smaller vectors. This policy may be reviewed
-//    in the future once AVX512 is established. 
-//
 //    Strict alignment of data is required: 16 bytes for 128 bit vectors,
 //    32 bytes for 256 bit vectors and 64 bytes for 512 bit vectors. 64 byte
 //    alignment is recommended in all cases for best cache alignment.
@@ -62,43 +52,33 @@
 //    for the applications but also adds responsibility to ensure adequate data
 //    alignment.
 //
-//    Windows has problems with function vector arguments larger than
-//    128 bits. Stack alignment is only guaranteed to 16 bytes. Always use
-//    pointers for larger vectors in function arguments. Macros can be used
-//    for larger value arguments.
-//
 //    An attempt was made to make the names as similar as possible to
 //    Intel's intrinsic function format. Most variations are to avoid
-//    confusion with actual Intel intrinsics, brevity, and clarity
+//    confusion with actual Intel intrinsics, brevity, and clarity.
 //
 //    The main differences are:
 //
-//   - the leading underscore(s) "_" and the "i" are dropped from the
-//     prefix of vector instructions.
-//   - "mm64" and "mm128" used for 64 and 128 bit prefix respectively
-//     to avoid the ambiguity of "mm".
+//   - the leading underscore "_" is dropped from the prefix of vector function
+//     macros.
+//   - "mm128" is used 128 bit prefix to be consistent with mm256 & mm512 and
+//     to avoid the ambiguity of "mm" which is also used for 64 bit MMX
+//     intrinsics.
 //   - the element size does not include additional type specifiers
 //      like "epi".
-//   - some macros may contain value args that are updated.
-//   - specialized shift and rotate functions that move elements around
-//     use the notation "1x32" to indicate the distance moved as units of
-//     the element size.
-//     Vector shuffle rotations are being renamed to "vrol" and "vror"
-//     to avoid confusion with bit rotations.
 //   - there is a subset of some functions for scalar data. They may have
 //     no prefix nor vec-size, just one size, the size of the data.
 //   - Some integer functions are also defined which use a similar notation.
 //   
 //    Function names follow this pattern:
 //
-//         prefix_op[vsize]_[esize]
+//         [prefix]_[op][vsize]_[esize]
 //
 //    Prefix: usually the size of the returned vector.
 //    Following are some examples:
 //
 //    u64:  unsigned 64 bit integer function
 //    i128: signed 128 bit integer function (rarely used)
-//    m128: 128 bit vector identifier
+//    m128: 128 bit vector identifier (deprecated)
 //    mm128: 128 bit vector function
 //
 //    op: describes the operation of the function or names the data
@@ -109,9 +89,7 @@
 //    vsize: optional, lane size used when a function operates on elements
 //           within lanes of a larger vector.
 //
-//    m256_const_64 defines a vector contructed from the supplied 64 bit
-//        integer arguments.
-//    mm256_shuflr128_32 rotates each 128 bit lane of a 256 bit vector
+//    Ex: mm256_shuflr128_32 rotates each 128 bit lane of a 256 bit vector
 //        right by 32 bits.
 //
 // Vector constants
@@ -137,12 +115,6 @@
 // If a vector constant is to be used repeatedly it is better to define a local
 // variable to generate the constant only once.
 //
-// If a sequence of constants is to be used it can be more efficient to
-// use arithmetic with already existing constants to generate new ones.
-//
-// ex: const __m512i one = m512_one_64;
-//     const __m512i two = _mm512_add_epi64( one, one );
-//     
 //////////////////////////////////////////////////////////////////////////

 #include <inttypes.h>
--- a/simd-utils/intrlv.h
+++ b/simd-utils/intrlv.h
--- a/simd-utils/simd-128.h
+++ b/simd-utils/simd-128.h
@@ -42,10 +42,12 @@ typedef union
   uint32_t u32[4];
 } __attribute__ ((aligned (16))) m128_ovly;

-// Efficient and convenient moving between GP & low bits of XMM.
-// Use VEX when available to give access to xmm8-15 and zero extend for
-// larger vectors.

+// Deprecated. AVX512 adds EVEX encoding (3rd operand) and other improvements
+// that make these functions either unnecessary or inefficient.
+// In cases where an explicit move betweeen GP & SIMD registers is still
+// necessary the cvt, set, or set1 intrinsics can be used allowing the
+// compiler to exploilt new features to produce optimum code.
 static inline __m128i mm128_mov64_128( const uint64_t n )
 {
  __m128i a;
@@ -68,62 +70,19 @@ static inline __m128i mm128_mov32_128( const uint32_t n )
  return a;
 }

-// Inconstant naming, prefix should reflect return value:
-// u64_mov128_64
-
-static inline uint64_t u64_mov128_64( const __m128i a )
-{
-  uint64_t n;
-#if defined(__AVX__)
-  asm( "vmovq %1, %0\n\t" : "=r"(n) : "x"(a) );
-#else  
-  asm( "movq %1, %0\n\t" : "=r"(n) : "x"(a) );
-#endif
-  return n;
-}
-
-static inline uint32_t u32_mov128_32( const __m128i a )
-{
-  uint32_t n;
-#if defined(__AVX__)
-  asm( "vmovd %1, %0\n\t" : "=r"(n) : "x"(a) );
-#else  
-  asm( "movd %1, %0\n\t" : "=r"(n) : "x"(a) );
-#endif
-  return n;
-}
-
-// Equivalent of set1, broadcast integer to all elements.
-#define m128_const_i128( i ) mm128_mov64_128( i )
-#define m128_const1_64( i ) _mm_shuffle_epi32( mm128_mov64_128( i ), 0x44 )
-#define m128_const1_32( i ) _mm_shuffle_epi32( mm128_mov32_128( i ), 0x00 )
-
-#if defined(__SSE4_1__)
-
-// Assign 64 bit integers to respective elements: {hi, lo}
-#define m128_const_64( hi, lo ) \
-   _mm_insert_epi64( mm128_mov64_128( lo ), hi, 1 )
-
-#else  // No insert in SSE2
-
-#define m128_const_64  _mm_set_epi64x
-
-#endif
+// Emulate broadcast & insert instructions not available in SSE2
+// FYI only, not used anywhere
+//#define mm128_bcast_m64( v )   _mm_shuffle_epi32( v, 0x44 )
+//#define mm128_bcast_m32( v )   _mm_shuffle_epi32( v, 0x00 )

 // Pseudo constants
-
 #define m128_zero      _mm_setzero_si128()
 #define m128_one_128   mm128_mov64_128( 1 )
-#define m128_one_64    _mm_shuffle_epi32( mm128_mov64_128( 1 ), 0x44 )
-#define m128_one_32    _mm_shuffle_epi32( mm128_mov32_128( 1 ), 0x00 )
-#define m128_one_16    _mm_shuffle_epi32( \
-                                 mm128_mov32_128( 0x00010001 ), 0x00 )
-#define m128_one_8     _mm_shuffle_epi32( \
-                                 mm128_mov32_128( 0x01010101 ), 0x00 )
+//#define m128_one_64    _mm_set1_epi64x( 1 )
+#define m128_one_32    _mm_set1_epi32( 1 )

 // ASM avoids the need to initialize return variable to avoid compiler warning.
 // Macro abstracts function parentheses to look like an identifier.
-
 static inline __m128i mm128_neg1_fn()
 {
   __m128i a;
@@ -149,7 +108,7 @@ static inline __m128i mm128_neg1_fn()
 // sizing. It's unique.
 //
 // It can:
-//   - zero 32 bit elements of a 128 bit vector.
+//   - zero any number of 32 bit elements of a 128 bit vector.
 //   - extract any 32 bit element from one 128 bit vector and insert the
 //     data to any 32 bit element of another 128 bit vector, or the same vector.
 //   - do both simultaneoulsly.
@@ -162,29 +121,31 @@ static inline __m128i mm128_neg1_fn()
 //    c[5:4] destination element selector
 //    c[7:6] source element selector

-// Convert type and abbreviate name: e"x"tract "i"nsert "m"ask
+// Convert type and abbreviate name: eXtract Insert Mask = XIM
 #define mm128_xim_32( v1, v2, c ) \
   _mm_castps_si128( _mm_insert_ps( _mm_castsi128_ps( v1 ), \
                                    _mm_castsi128_ps( v2 ), c ) )

-// Some examples of simple operations:
+/* Another way to do it with individual arguments.
+#define mm128_xim_32( v1, i1, v2, i2, mask ) \
+   _mm_castps_si128( _mm_insert_ps( _mm_castsi128_ps( v1 ), \
+                                    _mm_castsi128_ps( v2 ), \
+                                    (mask) | ((i1)<<4) | ((i2)<<6) ) )
+*/

-// Insert 32 bit integer into v at element c and return modified v.
+// Examples of simple operations using xim:
+
+// Copy i to element c of dest and copy remaining elemnts from v.
 static inline __m128i mm128_insert_32( const __m128i v, const uint32_t i,
                                       const int c )
 {   return mm128_xim_32( v, mm128_mov32_128( i ), c<<4 ); }

-// Extract 32 bit element c from v and return as integer.
-static inline uint32_t mm128_extract_32( const __m128i v, const int c )
-{   return u32_mov128_32( mm128_xim_32( v, v, c<<6 ) ); }
-
-// Clear (zero) 32 bit elements based on bits set in 4 bit mask.
+// Zero 32 bit elements when corresponding bit in 4 bit mask is set.
 static inline __m128i mm128_mask_32( const __m128i v, const int m ) 
 {   return mm128_xim_32( v, v, m ); }

-// Move element i2 of v2 to element i1 of v1. For reference and convenience,
-// it's faster to precalculate the index.
-#define mm128_shuflmov_32( v1, i1, v2, i2 ) \
+// Copy element i2 of v2 to element i1 of dest and copy remaining elements from v1.
+#define mm128_mov32_32( v1, i1, v2, i2 ) \
  mm128_xim_32( v1, v2, ( (i1)<<4 ) | ( (i2)<<6 ) )

 #endif  // SSE4_1
@@ -194,6 +155,7 @@ static inline __m128i mm128_mask_32( const __m128i v, const int m )

 // Bitwise not (~v)  
 #if defined(__AVX512VL__)
+//TODO Enable for AVX10_256

 static inline __m128i mm128_not( const __m128i v )
 {  return _mm_ternarylogic_epi64( v, v, v, 1 ); }
@@ -204,13 +166,6 @@ static inline __m128i mm128_not( const __m128i v )

 #endif

-/*
-// Unary negation of elements (-v)
-#define mm128_negate_64( v )    _mm_sub_epi64( m128_zero, v )
-#define mm128_negate_32( v )    _mm_sub_epi32( m128_zero, v )  
-#define mm128_negate_16( v )    _mm_sub_epi16( m128_zero, v )  
-*/
-
 // Add 4 values, fewer dependencies than sequential addition.
 #define mm128_add4_64( a, b, c, d ) \
   _mm_add_epi64( _mm_add_epi64( a, b ), _mm_add_epi64( c, d ) )
@@ -263,31 +218,67 @@ static inline void memcpy_128( __m128i *dst, const __m128i *src, const int n )
 {   for ( int i = 0; i < n; i ++ ) dst[i] = src[i]; }

 #if defined(__AVX512VL__)
+//TODO Enable for AVX10_256

 // a ^ b ^ c
-#define mm128_xor3( a, b, c )    _mm_ternarylogic_epi64( a, b, c, 0x96 )
+#define mm128_xor3( a, b, c )      _mm_ternarylogic_epi64( a, b, c, 0x96 )
+
+// a & b & c
+#define mm128_and3( a, b, c )      _mm_ternarylogic_epi64( a, b, c, 0x80 )
+
+// a | b | c
+#define mm128_or3( a, b, c )       _mm_ternarylogic_epi64( a, b, c, 0xfe )

 // a ^ ( b & c )
-#define mm128_xorand( a, b, c )  _mm_ternarylogic_epi64( a, b, c, 0x78 )
+#define mm128_xorand( a, b, c )    _mm_ternarylogic_epi64( a, b, c, 0x78 )
+
+// a & ( b ^ c )
+#define mm128_andxor( a, b, c )    _mm_ternarylogic_epi64( a, b, c, 0x60 )
+
+// a ^ ( b | c )
+#define mm128_xoror( a, b, c )     _mm_ternarylogic_epi64( a, b, c, 0x1e )
+
+// a ^ ( ~b & c )
+#define mm128_xorandnot( a, b, c ) _mm_ternarylogic_epi64( a, b, c, 0xd2 )
+
+// a | ( b & c )
+#define mm128_orand( a, b, c )     _mm_ternarylogic_epi64( a, b, c, 0xf8 )
+
+// ~( a ^ b ), same as (~a) ^ b
+#define mm128_xnor( a, b )         _mm_ternarylogic_epi64( a, b, b, 0x81 )

 #else

-#define mm128_xor3( a, b, c )    _mm_xor_si128( a, _mm_xor_si128( b, c ) )
+#define mm128_xor3( a, b, c )      _mm_xor_si128( a, _mm_xor_si128( b, c ) )

-#define mm128_xorand( a, b, c )  _mm_xor_si128( a, _mm_and_si128( b, c ) )
+#define mm128_and3( a, b, c )      _mm_and_si128( a, _mm_and_si128( b, c ) )
+
+#define mm128_or3( a, b, c )       _mm_or_si128( a, _mm_or_si128( b, c ) )
+
+#define mm128_xorand( a, b, c )    _mm_xor_si128( a, _mm_and_si128( b, c ) )
+
+#define mm128_andxor( a, b, c )    _mm_and_si128( a, _mm_xor_si128( b, c ))
+
+#define mm128_xoror( a, b, c )     _mm_xor_si128( a, _mm_or_si128( b, c ) )
+
+#define mm128_xorandnot( a, b, c ) _mm_xor_si128( a, _mm_andnot_si128( b, c ) )
+
+#define mm128_orand( a, b, c )     _mm_or_si128( a, _mm_and_si128( b, c ) )
+
+#define mm128_xnor( a, b )         mm128_not( _mm_xor_si128( a, b ) )

 #endif

 // Mask making
 // Equivalent of AVX512 _mm_movepi64_mask & _mm_movepi32_mask.
-// Returns 2 or 4 bit integer mask from MSB of 64 or 32 bit elements.
+// Returns 2 or 4 bit integer mask from MSBit of 64 or 32 bit elements.
 // Effectively a sign test.

-#define mm_movmask_64( v ) \
-   _mm_castpd_si128( _mm_movmask_pd( _mm_castsi128_pd( v ) ) )
+#define mm128_movmask_64( v ) \
+   _mm_movemask_pd( (__m128d)(v) )

-#define mm_movmask_32( v ) \
-   _mm_castps_si128( _mm_movmask_ps( _mm_castsi128_ps( v ) ) )
+#define mm128_movmask_32( v ) \
+   _mm_movemask_ps( (__m128)(v) )

 //
 // Bit rotations
@@ -297,6 +288,7 @@ static inline void memcpy_128( __m128i *dst, const __m128i *src, const int n )
 // transparency.

 #if defined(__AVX512VL__)
+//TODO Enable for AVX10_256

 #define mm128_ror_64    _mm_ror_epi64
 #define mm128_rol_64    _mm_rol_epi64
@@ -375,16 +367,7 @@ static inline void memcpy_128( __m128i *dst, const __m128i *src, const int n )

 #endif   // AVX512 else SSE2

-#define mm128_ror_16( v, c ) \
-   _mm_or_si128( _mm_srli_epi16( v, c ), _mm_slli_epi16( v, 16-(c) ) )
-
-#define mm128_rol_16( v, c ) \
-   _mm_or_si128( _mm_slli_epi16( v, c ), _mm_srli_epi16( v, 16-(c) ) )
-
-// Deprecated.
-#define mm128_rol_var_32( v, c ) \
-   _mm_or_si128( _mm_slli_epi32( v, c ), _mm_srli_epi32( v, 32-(c) ) )
-
+// Cross lane shuffles
 //
 // Limited 2 input shuffle, combines shuffle with blend. The destination low
 // half is always taken from v1, and the high half from v2.
@@ -396,16 +379,16 @@ static inline void memcpy_128( __m128i *dst, const __m128i *src, const int n )
   _mm_castps_si128( _mm_shuffle_ps( _mm_castsi128_ps( v1 ), \
                                     _mm_castsi128_ps( v2 ), c ) ); 

-//
 // Rotate vector elements accross all lanes

-#define mm128_swap_64( v )    _mm_shuffle_epi32( v, 0x4e )
-#define mm128_shuflr_64       mm128_swap_64
-#define mm128_shufll_64       mm128_swap_64
+#define mm128_swap_64( v )     _mm_shuffle_epi32( v, 0x4e )
+#define mm128_shuflr_64        mm128_swap_64
+#define mm128_shufll_64        mm128_swap_64

 #define mm128_shuflr_32( v )   _mm_shuffle_epi32( v, 0x39 )
 #define mm128_shufll_32( v )   _mm_shuffle_epi32( v, 0x93 )

+/* Not used
 #if defined(__SSSE3__)

 // Rotate right by c bytes, no SSE2 equivalent.
@@ -413,16 +396,18 @@ static inline __m128i mm128_shuflr_x8( const __m128i v, const int c )
 { return _mm_alignr_epi8( v, v, c ); }

 #endif
+*/

-// Rotate byte elements within 64 or 32 bit lanes, AKA optimized bit rotations
-// for multiples of 8 bits. Uses ror/rol macros when AVX512 is available
-// (unlikely but faster), or when SSSE3 is not available (slower).
+//  Rotate 64 bit lanes

 #define mm128_swap64_32( v )  _mm_shuffle_epi32( v, 0xb1 )
-#define mm128_shuflr64_32 mm128_swap64_32
-#define mm128_shufll64_32 mm128_swap64_32
+#define mm128_shuflr64_32     mm128_swap64_32
+#define mm128_shufll64_32     mm128_swap64_32

-#if defined(__SSSE3__) && !defined(__AVX512VL__)
+//TODO Enable for AVX10_256
+#if defined(__AVX512VL__)
+  #define m1286_shuflr64_24( v )  _mm_ror_epi64( v, 24 )
+#elif defined(__SSSE3__) 
  #define mm128_shuflr64_24( v ) \
    _mm_shuffle_epi8( v, _mm_set_epi64x( \
                                    0x0a09080f0e0d0c0b, 0x0201000706050403 ) )
@@ -430,7 +415,9 @@ static inline __m128i mm128_shuflr_x8( const __m128i v, const int c )
  #define mm128_shuflr64_24( v ) mm128_ror_64( v, 24 )
 #endif

-#if defined(__SSSE3__) && !defined(__AVX512VL__)
+#if defined(__AVX512VL__)
+  #define mm128_shuflr64_16( v )  _mm_ror_epi64( v, 16 )
+#elif defined(__SSSE3__) 
  #define mm128_shuflr64_16( v ) \
    _mm_shuffle_epi8( v, _mm_set_epi64x( \
                                    0x09080f0e0d0c0b0a, 0x0100070605040302 ) )
@@ -438,17 +425,23 @@ static inline __m128i mm128_shuflr_x8( const __m128i v, const int c )
  #define mm128_shuflr64_16( v ) mm128_ror_64( v, 16 )
 #endif

-#if defined(__SSSE3__) && !defined(__AVX512VL__)
+// Rotate 32 bit lanes
+
+#if defined(__AVX512VL__)
+  #define mm128_swap32_16( v )  _mm_ror_epi32( v, 16 )
+#elif defined(__SSSE3__)
  #define mm128_swap32_16( v ) \
    _mm_shuffle_epi8( v, _mm_set_epi64x( \
                                    0x0d0c0f0e09080b0a, 0x0504070601000302 ) )
 #else
  #define mm128_swap32_16( v ) mm128_ror_32( v, 16 )
 #endif
-#define mm128_shuflr32_16 mm128_swap32_16
-#define mm128_shufll32_16 mm128_swap32_16
+#define mm128_shuflr32_16      mm128_swap32_16
+#define mm128_shufll32_16      mm128_swap32_16

-#if defined(__SSSE3__) && !defined(__AVX512VL__)
+#if defined(__AVX512VL__)
+  #define mm128_shuflr32_8( v )  _mm_ror_epi32( v, 8 )
+#elif defined(__SSSE3__)
  #define mm128_shuflr32_8( v ) \
    _mm_shuffle_epi8( v, _mm_set_epi64x( \
                                    0x0c0f0e0d080b0a09, 0x0407060500030201 ) )
@@ -462,25 +455,25 @@ static inline __m128i mm128_shuflr_x8( const __m128i v, const int c )
 #if defined(__SSSE3__)

 #define mm128_bswap_128( v ) \
-   _mm_shuffle_epi8( v, m128_const_64( 0x0001020304050607, \
-                                       0x08090a0b0c0d0e0f ) )
+   _mm_shuffle_epi8( v, _mm_set_epi64x( 0x0001020304050607, \
+                                        0x08090a0b0c0d0e0f ) )

 #define mm128_bswap_64( v ) \
-   _mm_shuffle_epi8( v, m128_const_64( 0x08090a0b0c0d0e0f, \
-                                       0x0001020304050607 ) )
+   _mm_shuffle_epi8( v, _mm_set_epi64x( 0x08090a0b0c0d0e0f, \
+                                        0x0001020304050607 ) )

 #define mm128_bswap_32( v ) \
-   _mm_shuffle_epi8( v, m128_const_64( 0x0c0d0e0f08090a0b, \
-                                       0x0405060700010203 ) )
+   _mm_shuffle_epi8( v, _mm_set_epi64x( 0x0c0d0e0f08090a0b, \
+                                        0x0405060700010203 ) )

 #define mm128_bswap_16( v ) \
-   _mm_shuffle_epi8( v, m128_const_64( 0x0e0f0c0d0a0b0809, \
-                                       0x0607040502030001 )
+   _mm_shuffle_epi8( v, _mm_set_epi64x( 0x0e0f0c0d0a0b0809, \
+                                        0x0607040502030001 )

 // 8 byte qword * 8 qwords * 2 lanes = 128 bytes
 #define mm128_block_bswap_64( d, s ) do \
 { \
-   __m128i ctl = m128_const_64(  0x08090a0b0c0d0e0f, 0x0001020304050607 ); \
+   __m128i ctl = _mm_set_epi64x(  0x08090a0b0c0d0e0f, 0x0001020304050607 ); \
  casti_m128i( d, 0 ) = _mm_shuffle_epi8( casti_m128i( s, 0 ), ctl ); \
  casti_m128i( d, 1 ) = _mm_shuffle_epi8( casti_m128i( s, 1 ), ctl ); \
  casti_m128i( d, 2 ) = _mm_shuffle_epi8( casti_m128i( s, 2 ), ctl ); \
@@ -494,7 +487,7 @@ static inline __m128i mm128_shuflr_x8( const __m128i v, const int c )
 // 4 byte dword * 8 dwords * 4 lanes = 128 bytes
 #define mm128_block_bswap_32( d, s ) do \
 { \
-   __m128i ctl = m128_const_64( 0x0c0d0e0f08090a0b, 0x0405060700010203 ); \
+   __m128i ctl = _mm_set_epi64x( 0x0c0d0e0f08090a0b, 0x0405060700010203 ); \
  casti_m128i( d, 0 ) = _mm_shuffle_epi8( casti_m128i( s, 0 ), ctl ); \
  casti_m128i( d, 1 ) = _mm_shuffle_epi8( casti_m128i( s, 1 ), ctl ); \
  casti_m128i( d, 2 ) = _mm_shuffle_epi8( casti_m128i( s, 2 ), ctl ); \
@@ -555,17 +548,8 @@ static inline void mm128_block_bswap_32( __m128i *d, const __m128i *s )

 #endif // SSSE3 else SSE2

-// Swap 128 bit vectors.
-// This should be avoided, it's more efficient to switch references.
-#define mm128_swap256_128( v1, v2 ) \
-   v1 = _mm_xor_si128( v1, v2 ); \
-   v2 = _mm_xor_si128( v1, v2 ); \
-   v1 = _mm_xor_si128( v1, v2 );
-
-
-// alignr for 32 & 64 bit elements is only available with AVX512 but
-// emulated here. Shift argument is not needed, it's always 1.
-// Behaviour is otherwise consistent with Intel alignr intrinsics.
+// alignr instruction for 32 & 64 bit elements is only available with AVX512
+// but emulated here. Behaviour is consistent with Intel alignr intrinsics.

 #if defined(__SSSE3__)

--- a/simd-utils/simd-256.h
+++ b/simd-utils/simd-256.h
@@ -13,17 +13,14 @@
 // automatically but their use is limited because 256 bit vectors are less
 // likely to be used when 512 is available.
 //
+// AVX10_256 will support AVX512VL instructions on CPUs limited to 256 bit
+// vectors. This will require enabling when the compiler's AVX10 feature
+// macros are known.
+//
 // "_mm256_shuffle_epi8" and "_mm256_alignr_epi8" are restricted to 128 bit
 // lanes and data can't cross the 128 bit lane boundary.  
-// Full width byte shuffle is available with AVX512VL using the mask version
-// with a full mask (-1). 
 // Instructions that can move data across 128 bit lane boundary incur a
 // performance penalty over those that can't.
-// Some usage of index vectors may be encoded as if full vector shuffles are
-// supported. This has no side effects and would have the same results using
-// either version.
-// If the need arises and AVX512VL is available, 256 bit full vector byte 
-// shuffles can be implemented using the AVX512 mask feature with a NULL mask.

 #if defined(__AVX__)

@@ -59,55 +56,40 @@ typedef union

 #if defined(__AVX2__)

-// Move integer to low element of vector, other elements are set to zero.
-#define mm256_mov64_256( i ) _mm256_castsi128_si256( mm128_mov64_128( i ) )
-#define mm256_mov32_256( i ) _mm256_castsi128_si256( mm128_mov32_128( i ) )
-
-// Move low element of vector to integer.
-#define u64_mov256_64( v ) u64_mov128_64( _mm256_castsi256_si128( v ) )
-#define u32_mov256_32( v ) u32_mov128_32( _mm256_castsi256_si128( v ) )
-
-// concatenate two 128 bit vectors into one 256 bit vector: { hi, lo }
-#define mm256_concat_128( hi, lo ) \
-   _mm256_inserti128_si256( _mm256_castsi128_si256( lo ), hi, 1 )
-
-
-// Equivalent of set, move 64 bit integer constants to respective 64 bit
-// elements.
-static inline __m256i m256_const_64( const uint64_t i3, const uint64_t i2,
-                                     const uint64_t i1, const uint64_t i0 )
-{
-  union { __m256i m256i;
-          uint64_t u64[4]; } v;
-  v.u64[0] = i0; v.u64[1] = i1; v.u64[2] = i2; v.u64[3] = i3;
-  return v.m256i;
-}
-
-// Equivalent of set1.
-// 128 bit vector argument
-#define m256_const1_128( v ) \
+// Broadcast, ie set1, from 128 bit vector input.
+#define mm256_bcast_m128( v ) \
   _mm256_permute4x64_epi64( _mm256_castsi128_si256( v ), 0x44 )
-// 64 bit integer argument zero extended to 128 bits.
-#define m256_const1_i128( i ) m256_const1_128( mm128_mov64_128( i ) )
-#define m256_const1_64( i )  _mm256_broadcastq_epi64( mm128_mov64_128( i ) )
-#define m256_const1_32( i )  _mm256_broadcastd_epi32( mm128_mov32_128( i ) )
-#define m256_const1_16( i )  _mm256_broadcastw_epi16( mm128_mov32_128( i ) )
-#define m256_const1_8 ( i )  _mm256_broadcastb_epi8 ( mm128_mov32_128( i ) )

-#define m256_const2_64( i1, i0 ) \
-  m256_const1_128( m128_const_64( i1, i0 ) )
+// Set either the low or high 64 bit elements in 128 bit lanes, other elements
+// are set to zero.
+#if defined(__AVX512VL__)
+//TODO Enable for AVX10_256
+
+#define mm256_bcast128lo_64( i64 )     _mm256_maskz_set1_epi64( 0x55, i64 )
+#define mm256_bcast128hi_64( i64 )     _mm256_maskz_set1_epi64( 0xaa, i64 )
+
+#else
+
+#define mm256_bcast128lo_64( i64 )   mm256_bcast_m128( mm128_mov64_128( i64 ) )
+
+#define mm256_bcast128hi_64( i64 )   _mm256_permute4x64_epi64( \
+                   _mm256_castsi128_si256( mm128_mov64_128( i64 ) ), 0x11 )
+
+#endif
+
+#define mm256_set2_64( i1, i0 )   mm256_bcast_m128( _mm_set_epi64x( i1, i0 ) )
+
+#define mm256_set4_32( i3, i2, i1, i0 ) \
+   mm256_bcast_m128( _mm_set_epi32( i3, i2, i1, i0 ) )

-//
 // All SIMD constant macros are actually functions containing executable
 // code and therefore can't be used as compile time initializers.

-#define m256_zero      _mm256_setzero_si256()
-#define m256_one_256   mm256_mov64_256( 1 )
-#define m256_one_128   m256_const1_i128( 1 )
-#define m256_one_64    _mm256_broadcastq_epi64( mm128_mov64_128( 1 ) )
-#define m256_one_32    _mm256_broadcastd_epi32( mm128_mov64_128( 1 ) )
-#define m256_one_16    _mm256_broadcastw_epi16( mm128_mov64_128( 1 ) )
-#define m256_one_8     _mm256_broadcastb_epi8 ( mm128_mov64_128( 1 ) )
+#define m256_zero            _mm256_setzero_si256()
+//#define m256_one_256         mm256_mov64_256( 1 )
+#define m256_one_128         mm256_bcast_m128( m128_one_128 )
+#define m256_one_64          _mm256_set1_epi64x( 1 )
+#define m256_one_32          _mm256_set1_epi32( 1 )

 static inline __m256i mm256_neg1_fn()
 {
@@ -117,10 +99,6 @@ static inline __m256i mm256_neg1_fn()
 }
 #define m256_neg1  mm256_neg1_fn()

-// Consistent naming for similar operations.
-#define mm128_extr_lo128_256( v ) _mm256_castsi256_si128( v )
-#define mm128_extr_hi128_256( v ) _mm256_extracti128_si256( v, 1 )
-
 //
 // Memory functions
 // n = number of 256 bit (32 byte) vectors
@@ -139,6 +117,7 @@ static inline void memcpy_256( __m256i *dst, const __m256i *src, const int n )
 // Basic operations without SIMD equivalent

 #if defined(__AVX512VL__)
+//TODO Enable for AVX10_256

 static inline __m256i mm256_not( const __m256i v )
 {  return _mm256_ternarylogic_epi64( v, v, v, 1 ); }
@@ -149,14 +128,6 @@ static inline __m256i mm256_not( const __m256i v )

 #endif

-/*
-// Unary negation of each element ( -v )
-#define mm256_negate_64( v ) _mm256_sub_epi64( m256_zero, v )
-#define mm256_negate_32( v ) _mm256_sub_epi32( m256_zero, v )
-#define mm256_negate_16( v ) _mm256_sub_epi16( m256_zero, v )
-*/
-
-
 // Add 4 values, fewer dependencies than sequential addition.

 #define mm256_add4_64( a, b, c, d ) \
@@ -165,15 +136,8 @@ static inline __m256i mm256_not( const __m256i v )
 #define mm256_add4_32( a, b, c, d ) \
   _mm256_add_epi32( _mm256_add_epi32( a, b ), _mm256_add_epi32( c, d ) )

-#define mm256_add4_16( a, b, c, d ) \
-   _mm256_add_epi16( _mm256_add_epi16( a, b ), _mm256_add_epi16( c, d ) )
-
-#define mm256_add4_8( a, b, c, d ) \
-   _mm256_add_epi8( _mm256_add_epi8( a, b ), _mm256_add_epi8( c, d ) )
-
 #if defined(__AVX512VL__)
-
-// AVX512 has ternary logic that supports any 3 input boolean expression.
+//TODO Enable for AVX10_256

 // a ^ b ^ c
 #define mm256_xor3( a, b, c )      _mm256_ternarylogic_epi64( a, b, c, 0x96 )
@@ -208,31 +172,31 @@ static inline __m256i mm256_not( const __m256i v )
 #else

 #define mm256_xor3( a, b, c ) \
-   _mm256_xor_si256( a, _mm256_xor_si256( b, c ) )
+  _mm256_xor_si256( a, _mm256_xor_si256( b, c ) )

 #define mm256_xor4( a, b, c, d ) \
-   _mm256_xor_si256( _mm256_xor_si256( a, b ), _mm256_xor_si256( c, d ) )
+  _mm256_xor_si256( _mm256_xor_si256( a, b ), _mm256_xor_si256( c, d ) )

 #define mm256_and3( a, b, c ) \
-   _mm256_and_si256( a, _mm256_and_si256( b, c ) )
+  _mm256_and_si256( a, _mm256_and_si256( b, c ) )

 #define mm256_or3( a, b, c ) \
   _mm256_or_si256( a, _mm256_or_si256( b, c ) )

 #define mm256_xorand( a, b, c ) \
- _mm256_xor_si256( a, _mm256_and_si256( b, c ) )
+  _mm256_xor_si256( a, _mm256_and_si256( b, c ) )

 #define mm256_andxor( a, b, c ) \
  _mm256_and_si256( a, _mm256_xor_si256( b, c ))

 #define mm256_xoror( a, b, c ) \
- _mm256_xor_si256( a, _mm256_or_si256( b, c ) )
+  _mm256_xor_si256( a, _mm256_or_si256( b, c ) )

 #define mm256_xorandnot( a, b, c ) \
- _mm256_xor_si256( a, _mm256_andnot_si256( b, c ) )
+  _mm256_xor_si256( a, _mm256_andnot_si256( b, c ) )

 #define mm256_orand( a, b, c ) \
- _mm256_or_si256( a, _mm256_and_si256( b, c ) )
+  _mm256_or_si256( a, _mm256_and_si256( b, c ) )

 #define mm256_xnor( a, b ) \
  mm256_not( _mm256_xor_si256( a, b ) )
@@ -241,14 +205,14 @@ static inline __m256i mm256_not( const __m256i v )

 // Mask making
 // Equivalent of AVX512 _mm256_movepi64_mask & _mm256_movepi32_mask.
-// Returns 4 or 8 bit integer mask from MSB of 64 or 32 bit elements.
+// Returns 4 or 8 bit integer mask from MSBit of 64 or 32 bit elements.
 // Effectively a sign test.

 #define mm256_movmask_64( v ) \
-   _mm256_castpd_si256( _mm256_movmask_pd( _mm256_castsi256_pd( v ) ) )
+   _mm256_movemask_pd( _mm256_castsi256_pd( v ) )

 #define mm256_movmask_32( v ) \
-   _mm256_castps_si256( _mm256_movmask_ps( _mm256_castsi256_ps( v ) ) )
+   _mm256_movemask_ps( _mm256_castsi256_ps( v ) )

 //
 //           Bit rotations.
@@ -258,6 +222,7 @@ static inline __m256i mm256_not( const __m256i v )
 // transparency.

 #if defined(__AVX512VL__)
+//TODO Enable for AVX10_256

 #define mm256_ror_64    _mm256_ror_epi64
 #define mm256_rol_64    _mm256_rol_epi64
@@ -342,31 +307,22 @@ static inline __m256i mm256_not( const __m256i v )

 #endif     // AVX512 else AVX2

-#define  mm256_ror_16( v, c ) \
-   _mm256_or_si256( _mm256_srli_epi16( v, c ), \
-                    _mm256_slli_epi16( v, 16-(c) ) )
-
-#define mm256_rol_16( v, c ) \
-   _mm256_or_si256( _mm256_slli_epi16( v, c ), \
-                    _mm256_srli_epi16( v, 16-(c) ) )
-
-// Deprecated.
-#define mm256_rol_var_32( v, c ) \
-   _mm256_or_si256( _mm256_slli_epi32( v, c ), \
-                    _mm256_srli_epi32( v, 32-(c) ) )
-
+//
+// Cross lane shuffles
 //
 // Rotate elements accross all lanes.

 // Swap 128 bit elements in 256 bit vector.
 #define mm256_swap_128( v )     _mm256_permute4x64_epi64( v, 0x4e )
-#define mm256_shuflr_128 mm256_swap_128
-#define mm256_shufll_128 mm256_swap_128
+#define mm256_shuflr_128        mm256_swap_128
+#define mm256_shufll_128        mm256_swap_128

 // Rotate 256 bit vector by one 64 bit element
 #define mm256_shuflr_64( v )    _mm256_permute4x64_epi64( v, 0x39 )
 #define mm256_shufll_64( v )    _mm256_permute4x64_epi64( v, 0x93 )

+
+/* Not used
 // Rotate 256 bit vector by one 32 bit element.
 #if defined(__AVX512VL__)

@@ -380,15 +336,16 @@ static inline __m256i mm256_shufll_32( const __m256i v )

 #define mm256_shuflr_32( v ) \
    _mm256_permutevar8x32_epi32( v, \
-                     m256_const_64( 0x0000000000000007, 0x0000000600000005, \
+                 _mm256_set_spi64x( 0x0000000000000007, 0x0000000600000005, \
                                    0x0000000400000003, 0x0000000200000001 ) )

 #define mm256_shufll_32( v ) \
    _mm256_permutevar8x32_epi32( v, \
-                     m256_const_64( 0x0000000600000005,  0x0000000400000003, \
+                 _mm256_set_epi64x( 0x0000000600000005,  0x0000000400000003, \
                                    0x0000000200000001,  0x0000000000000007 ) )

 #endif
+*/

 //
 // Rotate elements within each 128 bit lane of 256 bit vector.
@@ -402,49 +359,52 @@ static inline __m256i mm256_shufll_32( const __m256i v )
   _mm256_castps_si256( _mm256_shuffle_ps( _mm256_castsi256_ps( v1 ), \
                                           _mm256_castsi256_ps( v2 ), c ) ); 

-#define mm256_swap128_64( v )  _mm256_shuffle_epi32( v, 0x4e )
-#define mm256_shuflr128_64 mm256_swap128_64
-#define mm256_shufll128_64 mm256_swap128_64
+#define mm256_swap128_64( v )     _mm256_shuffle_epi32( v, 0x4e )
+#define mm256_shuflr128_64        mm256_swap128_64
+#define mm256_shufll128_64        mm256_swap128_64

 #define mm256_shuflr128_32( v )   _mm256_shuffle_epi32( v, 0x39 )
 #define mm256_shufll128_32( v )   _mm256_shuffle_epi32( v, 0x93 )

+/* Not used
 static inline __m256i mm256_shuflr128_x8( const __m256i v, const int c )
 { return _mm256_alignr_epi8( v, v, c ); }
+*/

-// Rotate byte elements within 64 or 32 bit lanes, AKA optimized bit
-// rotations for multiples of 8 bits. Uses faster ror/rol instructions when
-// AVX512 is available.
+// 64 bit lanes

-#define mm256_swap64_32( v )   _mm256_shuffle_epi32( v, 0xb1 )
-#define mm256_shuflr64_32 mm256_swap64_32
-#define mm256_shufll64_32 mm256_swap64_32
+#define mm256_swap64_32( v )      _mm256_shuffle_epi32( v, 0xb1 )
+#define mm256_shuflr64_32         mm256_swap64_32
+#define mm256_shufll64_32         mm256_swap64_32

+//TODO Enable for AVX10_256
 #if defined(__AVX512VL__)
  #define mm256_shuflr64_24( v )  _mm256_ror_epi64( v, 24 )
 #else
  #define mm256_shuflr64_24( v ) \
-    _mm256_shuffle_epi8( v, m256_const2_64( \
-                                    0x0a09080f0e0d0c0b, 0x0201000706050403 ) )
+    _mm256_shuffle_epi8( v, mm256_bcast_m128( _mm_set_epi64x( \
+                                 0x0a09080f0e0d0c0b, 0x0201000706050403 ) ) )
 #endif

 #if defined(__AVX512VL__)
  #define mm256_shuflr64_16( v )  _mm256_ror_epi64( v, 16 )
 #else
  #define mm256_shuflr64_16( v ) \
-    _mm256_shuffle_epi8( v, m256_const2_64( \
-                                    0x09080f0e0d0c0b0a, 0x0100070605040302 ) )
+    _mm256_shuffle_epi8( v, mm256_bcast_m128( _mm_set_epi64x( \
+                                 0x09080f0e0d0c0b0a, 0x0100070605040302 ) ) )
 #endif

+// 32 bit lanes
+
 #if defined(__AVX512VL__)
  #define mm256_swap32_16( v )  _mm256_ror_epi32( v, 16 )
 #else
  #define mm256_swap32_16( v ) \
-    _mm256_shuffle_epi8( v, m256_const2_64( \
-                                    0x0d0c0f0e09080b0a, 0x0504070601000302 ) )
+    _mm256_shuffle_epi8( v, mm256_bcast_m128( _mm_set_epi64x( \
+                                 0x0d0c0f0e09080b0a, 0x0504070601000302 ) ) )
 #endif
-#define mm256_shuflr32_16 mm256_swap32_16
-#define mm256_shufll32_16 mm256_swap32_16
+#define mm256_shuflr32_16       mm256_swap32_16
+#define mm256_shufll32_16       mm256_swap32_16

 #if defined(__AVX512VL__)
  #define mm256_shuflr32_8( v )  _mm256_ror_epi32( v, 8 )
@@ -457,22 +417,23 @@ static inline __m256i mm256_shuflr128_x8( const __m256i v, const int c )

 // Reverse byte order in elements, endian bswap.
 #define mm256_bswap_64( v ) \
-   _mm256_shuffle_epi8( v, \
-         m256_const2_64( 0x08090a0b0c0d0e0f, 0x0001020304050607 ) )
+   _mm256_shuffle_epi8( v, mm256_bcast_m128( _mm_set_epi64x( \
+                               0x08090a0b0c0d0e0f, 0x0001020304050607 ) ) )

 #define mm256_bswap_32( v ) \
-   _mm256_shuffle_epi8( v, \
-         m256_const2_64( 0x0c0d0e0f08090a0b, 0x0405060700010203 ) )
+   _mm256_shuffle_epi8( v, mm256_bcast_m128( _mm_set_epi64x( \
+                                0x0c0d0e0f08090a0b, 0x0405060700010203 ) ) )

 #define mm256_bswap_16( v ) \
-   _mm256_shuffle_epi8( v, \
-         m256_const2_64( 0x0e0f0c0d0a0b0809, 0x0607040502030001, ) )
+   _mm256_shuffle_epi8( v, mm256_bcast_m128( _mm_set_epi64x( \
+                                0x0e0f0c0d0a0b0809, 0x0607040502030001 ) ) )

 // Source and destination are pointers, may point to same memory.
 // 8 byte qword * 8 qwords * 4 lanes = 256 bytes
 #define mm256_block_bswap_64( d, s ) do \
 { \
-  __m256i ctl = m256_const2_64( 0x08090a0b0c0d0e0f, 0x0001020304050607 ) ; \
+  __m256i ctl = mm256_bcast_m128( _mm_set_epi64x( 0x08090a0b0c0d0e0f, \
+                                                  0x0001020304050607 ) ); \
  casti_m256i( d, 0 ) = _mm256_shuffle_epi8( casti_m256i( s, 0 ), ctl ); \
  casti_m256i( d, 1 ) = _mm256_shuffle_epi8( casti_m256i( s, 1 ), ctl ); \
  casti_m256i( d, 2 ) = _mm256_shuffle_epi8( casti_m256i( s, 2 ), ctl ); \
@@ -486,7 +447,8 @@ static inline __m256i mm256_shuflr128_x8( const __m256i v, const int c )
 // 4 byte dword * 8 dwords * 8 lanes = 256 bytes
 #define mm256_block_bswap_32( d, s ) do \
 { \
-  __m256i ctl = m256_const2_64( 0x0c0d0e0f08090a0b, 0x0405060700010203 ); \
+  __m256i ctl = mm256_bcast_m128( _mm_set_epi64x( 0x0c0d0e0f08090a0b, \
+                                                  0x0405060700010203 ) ); \
  casti_m256i( d, 0 ) = _mm256_shuffle_epi8( casti_m256i( s, 0 ), ctl ); \
  casti_m256i( d, 1 ) = _mm256_shuffle_epi8( casti_m256i( s, 1 ), ctl ); \
  casti_m256i( d, 2 ) = _mm256_shuffle_epi8( casti_m256i( s, 2 ), ctl ); \
@@ -497,13 +459,6 @@ static inline __m256i mm256_shuflr128_x8( const __m256i v, const int c )
  casti_m256i( d, 7 ) = _mm256_shuffle_epi8( casti_m256i( s, 7 ), ctl ); \
 } while(0)

-// swap 256 bit vectors in place.
-// This should be avoided, it's more efficient to switch references.
-#define mm256_swap512_256( v1, v2 ) \
-   v1 = _mm256_xor_si256( v1, v2 ); \
-   v2 = _mm256_xor_si256( v1, v2 ); \
-   v1 = _mm256_xor_si256( v1, v2 );
-
 #endif // __AVX2__
 #endif // SIMD_256_H__

--- a/simd-utils/simd-512.h
+++ b/simd-utils/simd-512.h
@@ -32,25 +32,26 @@
 //    "_mm512_permutex_epi64" only shuffles within 256 bit lanes. All other
 //    AVX512 permutes can cross all lanes.
 //
-//    "_mm512_shuffle_epi8" shuffles accross the entire 512 bits. Shuffle
-//    instructions generally don't cross 128 bit lane boundaries and the AVX2
-//    version of this specific instruction does not.
-//
 //    New alignr instructions for epi64 and epi32 operate across the entire
 //    vector but slower than epi8 which continues to be restricted to 128 bit
 //    lanes.
 //
+//    "vpbroadcastq/d/w/b" instructions now support integer register source
+//    argument in addition to XMM register or mem location. set1 intrinsic uses
+//    integer arg, broadcast intrinsic requires xmm. Mask versions of 256 and 
+//    128 bit broadcast also inherit this addition.
+//
 //    "_mm512_permutexvar_epi8" and "_mm512_permutex2var_epi8" require
 //    AVX512-VBMI. The same instructions with larger elements don't have this
-//    requirement. "_mm512_permutexvar_epi8" also performs the same operation
-//    as "_mm512_shuffle_epi8" which only requires AVX512-BW.
+//    requirement.
 //
 //    Two coding conventions are used to prevent macro argument side effects:
 //      - if a macro arg is used in an expression it must be protected by
-//        parentheses to ensure an expression argument is evaluated first.
+//        parentheses to ensure the expression argument is evaluated first.
 //      - if an argument is to referenced multiple times a C inline function
 //        should be used instead of a macro to prevent an expression argument
-//        from being evaluated multiple times.
+//        from being evaluated multiple times (wasteful) or produces side
+//         effects (very bad).
 //
 //    There are 2 areas where overhead is a major concern: constants and
 //    permutations.
@@ -87,7 +88,7 @@
 // __AVX512VBMI__  __AVX512VAES__
 //

-// Used instead if casting.
+// Used instead of casting.
 typedef union
 {
   __m512i m512;
@@ -96,105 +97,36 @@ typedef union
   uint64_t u64[8];
 } __attribute__ ((aligned (64))) m512_ovly;

-// Move integer to/from element 0 of vector.
-
-#define mm512_mov64_512( n ) _mm512_castsi128_si512( mm128_mov64_128( n ) )
-#define mm512_mov32_512( n ) _mm512_castsi128_si512( mm128_mov32_128( n ) )
-
-#define u64_mov512_64( a ) u64_mov128_64( _mm256_castsi512_si128( a ) )
-#define u32_mov512_32( a ) u32_mov128_32( _mm256_castsi512_si128( a ) )
-
 // A simple 128 bit permute, using function instead of macro avoids
 // problems if the v arg passed as an expression.
 static inline __m512i mm512_perm_128( const __m512i v, const int c )
 {  return _mm512_shuffle_i64x2( v, v, c ); }

-// Concatenate two 256 bit vectors into one 512 bit vector {hi, lo}
-#define mm512_concat_256( hi, lo ) \
-   _mm512_inserti64x4( _mm512_castsi256_si512( lo ), hi, 1 )
+// Broadcast 128 bit vector to all lanes of 512 bit vector.
+#define mm512_bcast_m128( v )  mm512_perm_128( _mm512_castsi128_si512( v ), 0 )

-#define m512_const_128( v3, v2, v1, v0 ) \
-   mm512_concat_256( mm256_concat_128( v3, v2 ), \
-                     mm256_concat_128( v1, v0 ) )
+// Set either the low or high 64 bit elements in 128 bit lanes, other elements
+// are set to zero.
+#define mm512_bcast128lo_64( i64 )     _mm512_maskz_set1_epi64( 0x55, i64 )
+#define mm512_bcast128hi_64( i64 )     _mm512_maskz_set1_epi64( 0xaa, i64 )

-// Equivalent of set, assign 64 bit integers to respective 64 bit elements.
-// Use stack memory overlay
-static inline __m512i m512_const_64( const uint64_t i7, const uint64_t i6,
-                                     const uint64_t i5, const uint64_t i4,
-                                     const uint64_t i3, const uint64_t i2,
-                                     const uint64_t i1, const uint64_t i0 )
-{
-  union { __m512i m512i;
-          uint64_t u64[8]; } v;   
-  v.u64[0] = i0;     v.u64[1] = i1;
-  v.u64[2] = i2;     v.u64[3] = i3;
-  v.u64[4] = i4;     v.u64[5] = i5;
-  v.u64[6] = i6;     v.u64[7] = i7;
-  return v.m512i;
-}
+#define mm512_set2_64( i1, i0 ) \
+   mm512_bcast_m128( _mm_set_epi64x( i1, i0 ) )

-// Equivalent of set1, broadcast lo element to all elements.
-static inline __m512i m512_const1_256( const __m256i v )
-{ return _mm512_inserti64x4( _mm512_castsi256_si512( v ), v, 1 ); }  
-
-#define m512_const1_128( v ) \
-    mm512_perm_128( _mm512_castsi128_si512( v ), 0 )
-// Integer input argument up to 64 bits
-#define m512_const1_i128( i ) \
-    mm512_perm_128( _mm512_castsi128_si512( mm128_mov64_128( i ) ), 0 )
-
-//#define m512_const1_256( v )   _mm512_broadcast_i64x4( v )
-//#define m512_const1_128( v )   _mm512_broadcast_i64x2( v )
-#define m512_const1_64( i )    _mm512_broadcastq_epi64( mm128_mov64_128( i ) )
-#define m512_const1_32( i )    _mm512_broadcastd_epi32( mm128_mov32_128( i ) )
-#define m512_const1_16( i )    _mm512_broadcastw_epi16( mm128_mov32_128( i ) )
-#define m512_const1_8( i )     _mm512_broadcastb_epi8 ( mm128_mov32_128( i ) )
-
-#define m512_const2_128( v1, v0 ) \
-   m512_const1_256( _mm512_inserti64x2( _mm512_castsi128_si512( v0 ), v1, 1 ) )
-
-#define m512_const2_64( i1, i0 ) \
-   m512_const1_128( m128_const_64( i1, i0 ) )
-
-
-static inline __m512i m512_const4_64( const uint64_t i3, const uint64_t i2,
-                                      const uint64_t i1, const uint64_t i0 )
-{
-  union  {  __m512i m512i;
-            uint64_t u64[8];   } v;
-  v.u64[0] = v.u64[4] = i0;
-  v.u64[1] = v.u64[5] = i1;
-  v.u64[2] = v.u64[6] = i2;
-  v.u64[3] = v.u64[7] = i3;
-  return v.m512i;
-}
-
-//
 // Pseudo constants.
-
-// _mm512_setzero_si512 uses xor instruction. If needed frequently
-// in a function is it better to define a register variable (const?)
-// initialized to zero.
-
 #define m512_zero       _mm512_setzero_si512()
-#define m512_one_512    mm512_mov64_512( 1 )
-#define m512_one_256    _mm512_inserti64x4( m512_one_512, m256_one_256, 1 )  
-#define m512_one_128    m512_const1_i128( 1 )
-#define m512_one_64     m512_const1_64( 1 )
-#define m512_one_32     m512_const1_32( 1 )
-#define m512_one_16     m512_const1_16( 1 )
-#define m512_one_8      m512_const1_8( 1 )
+// Deprecated
+#define m512_one_64     _mm512_set1_epi64( 1 )
+#define m512_one_32     _mm512_set1_epi32( 1 )

 // use asm to avoid compiler warning for unitialized local
 static inline __m512i mm512_neg1_fn()
 {
-   __m512i a;
-   asm( "vpternlogq $0xff, %0, %0, %0\n\t" : "=x"(a) );
-   return a;
+   __m512i v;
+   asm( "vpternlogq $0xff, %0, %0, %0\n\t" : "=x"(v) );
+   return v;
 }
-#define m512_neg1 mm512_neg1_fn()                          // 1 clock
-//#define m512_neg1 m512_const1_64( 0xffffffffffffffff )   // 5 clocks
-//#define m512_neg1 _mm512_movm_epi64( 0xff )              // 2 clocks
+#define m512_neg1 mm512_neg1_fn()    

 //
 // Basic operations without SIMD equivalent
@@ -203,13 +135,6 @@ static inline __m512i mm512_neg1_fn()
 static inline __m512i mm512_not( const __m512i x )
 {  return _mm512_ternarylogic_epi64( x, x, x, 1 ); }

-/*
-// Unary negation: -x
-#define mm512_negate_64( x ) _mm512_sub_epi64( m512_zero, x )
-#define mm512_negate_32( x ) _mm512_sub_epi32( m512_zero, x )  
-#define mm512_negate_16( x ) _mm512_sub_epi16( m512_zero, x )  
-*/
-
 //
 // Pointer casting

@@ -251,12 +176,6 @@ static inline void memcpy_512( __m512i *dst, const __m512i *src, const int n )
 #define mm512_add4_32( a, b, c, d ) \
   _mm512_add_epi32( _mm512_add_epi32( a, b ), _mm512_add_epi32( c, d ) )

-#define mm512_add4_16( a, b, c, d ) \
-   _mm512_add_epi16( _mm512_add_epi16( a, b ), _mm512_add_epi16( c, d ) )
-
-#define mm512_add4_8( a, b, c, d ) \
-   _mm512_add_epi8( _mm512_add_epi8( a, b ), _mm512_add_epi8( c, d ) )
-
 //
 // Ternary logic uses 8 bit truth table to define any 3 input logical
 // expression using any number or combinations of AND, OR, XOR, NOT.
@@ -319,34 +238,23 @@ static inline void memcpy_512( __m512i *dst, const __m512i *src, const int n )
 // Reverse byte order of packed elements, vectorized endian conversion.

 #define mm512_bswap_64( v ) \
-   _mm512_shuffle_epi8( v, \
-               m512_const_64( 0x38393a3b3c3d3e3f, 0x3031323334353637, \
-                              0x28292a2b2c2d2e2f, 0x2021222324252627, \
-                              0x18191a1b1c1d1e1f, 0x1011121314151617, \
-                              0x08090a0b0c0d0e0f, 0x0001020304050607 ) )
+   _mm512_shuffle_epi8( v, mm512_bcast_m128( _mm_set_epi64x( \
+                              0x08090a0b0c0d0e0f, 0x0001020304050607 ) ) )

 #define mm512_bswap_32( v ) \
-   _mm512_shuffle_epi8( v, \
-               m512_const_64( 0x3c3d3e3f38393a3b, 0x3435363730313233, \
-                              0x2c2d2e2f28292a2b, 0x2425262720212223, \
-                              0x1c1d1e1f18191a1b, 0x1415161710111213, \
-                              0x0c0d0e0f08090a0b, 0x0405060700010203 ) )
+   _mm512_shuffle_epi8( v, mm512_bcast_m128( _mm_set_epi64x( \
+                              0x0c0d0e0f08090a0b, 0x0405060700010203 ) ) )

 #define mm512_bswap_16( v ) \
-   _mm512_shuffle_epi8( v, \
-               m512_const_64( 0x3e3f3c3d3a3b3839, 0x3637343532333031, \
-                              0x2e2f2c2d2a2b2829, 0x2627242522232021, \
-                              0x1e1f1c1d1a1b1819, 0x1617141512131011, \
-                              0x0e0f0c0d0a0b0809, 0x0607040502030001 ) )
+   _mm512_shuffle_epi8( v, mm512_bcast_m128( _mm_set_epi64x( \
+                              0x0e0f0c0d0a0b0809, 0x0607040502030001 ) ) )

 // Source and destination are pointers, may point to same memory.
 // 8 lanes of 64 bytes each
 #define mm512_block_bswap_64( d, s ) do \
 { \
-  __m512i ctl = m512_const_64( 0x38393a3b3c3d3e3f, 0x3031323334353637, \
-                               0x28292a2b2c2d2e2f, 0x2021222324252627, \
-                               0x18191a1b1c1d1e1f, 0x1011121314151617, \
-                               0x08090a0b0c0d0e0f, 0x0001020304050607  ); \
+  const __m512i ctl = mm512_bcast_m128( _mm_set_epi64x( \
+                                0x08090a0b0c0d0e0f, 0x0001020304050607 ) ); \
  casti_m512i( d, 0 ) = _mm512_shuffle_epi8( casti_m512i( s, 0 ), ctl ); \
  casti_m512i( d, 1 ) = _mm512_shuffle_epi8( casti_m512i( s, 1 ), ctl ); \
  casti_m512i( d, 2 ) = _mm512_shuffle_epi8( casti_m512i( s, 2 ), ctl ); \
@@ -360,10 +268,8 @@ static inline void memcpy_512( __m512i *dst, const __m512i *src, const int n )
 // 16 lanes of 32 bytes each
 #define mm512_block_bswap_32( d, s ) do \
 { \
-  __m512i ctl = m512_const_64( 0x3c3d3e3f38393a3b, 0x3435363730313233, \
-                               0x2c2d2e2f28292a2b, 0x2425262720212223, \
-                               0x1c1d1e1f18191a1b, 0x1415161710111213, \
-                               0x0c0d0e0f08090a0b, 0x0405060700010203 ); \
+  const __m512i ctl = mm512_bcast_m128( _mm_set_epi64x( \
+                                0x0c0d0e0f08090a0b, 0x0405060700010203 ) ); \
  casti_m512i( d, 0 ) = _mm512_shuffle_epi8( casti_m512i( s, 0 ), ctl ); \
  casti_m512i( d, 1 ) = _mm512_shuffle_epi8( casti_m512i( s, 1 ), ctl ); \
  casti_m512i( d, 2 ) = _mm512_shuffle_epi8( casti_m512i( s, 2 ), ctl ); \
@@ -381,8 +287,8 @@ static inline void memcpy_512( __m512i *dst, const __m512i *src, const int n )
 // Rotate elements across entire vector.
 static inline __m512i mm512_swap_256( const __m512i v )
 { return _mm512_alignr_epi64( v, v, 4 ); }
-#define mm512_shuflr_256( v ) mm512_swap_256
-#define mm512_shufll_256( v ) mm512_swap_256
+#define mm512_shuflr_256   mm512_swap_256
+#define mm512_shufll_256   mm512_swap_256

 static inline __m512i mm512_shuflr_128( const __m512i v )
 { return _mm512_alignr_epi64( v, v, 2 ); }
@@ -390,6 +296,7 @@ static inline __m512i mm512_shuflr_128( const __m512i v )
 static inline __m512i mm512_shufll_128( const __m512i v )
 { return _mm512_alignr_epi64( v, v, 6 ); }

+/* Not used
 static inline __m512i mm512_shuflr_64( const __m512i v )
 { return _mm512_alignr_epi64( v, v, 1 ); }

@@ -401,7 +308,9 @@ static inline __m512i mm512_shuflr_32( const __m512i v )

 static inline __m512i mm512_shufll_32( const __m512i v )
 { return _mm512_alignr_epi32( v, v, 15 ); }
+*/

+/* Not used
 // Generic
 static inline __m512i mm512_shuflr_x64( const __m512i v, const int n )
 { return _mm512_alignr_epi64( v, v, n ); }
@@ -410,34 +319,20 @@ static inline __m512i mm512_shuflr_x32( const __m512i v, const int n )
 { return _mm512_alignr_epi32( v, v, n ); }

 #define mm512_shuflr_16( v ) \
-   _mm512_permutexvar_epi16( m512_const_64( \
+   _mm512_permutexvar_epi16( _mm512_set_epi64( \
                       0x0000001F001E001D, 0x001C001B001A0019, \
                       0x0018001700160015, 0x0014001300120011, \
                       0x0010000F000E000D, 0x000C000B000A0009, \
                       0x0008000700060005, 0x0004000300020001 ), v )

 #define mm512_shufll_16( v ) \
-   _mm512_permutexvar_epi16( m512_const_64( \
+   _mm512_permutexvar_epi16( _mm512_set_epi64( \
                       0x001E001D001C001B, 0x001A001900180017, \
                       0x0016001500140013, 0x001200110010000F, \
                       0x000E000D000C000B, 0x000A000900080007, \
                       0x0006000500040003, 0x000200010000001F ), v )
+*/

-#define mm512_shuflr_8( v ) \
-   _mm512_shuffle_epi8( v, m512_const_64( \
-                       0x003F3E3D3C3B3A39, 0x3837363534333231, \
-                       0x302F2E2D2C2B2A29, 0x2827262524232221, \
-                       0x201F1E1D1C1B1A19. 0x1817161514131211, \
-                       0x100F0E0D0C0B0A09, 0x0807060504030201 ) )
-
-#define mm512_shufll_8( v ) \
-   _mm512_shuffle_epi8( v, m512_const_64( \
-                       0x3E3D3C3B3A393837, 0x363534333231302F. \
-                       0x2E2D2C2B2A292827, 0x262524232221201F, \
-                       0x1E1D1C1B1A191817, 0x161514131211100F, \
-                       0x0E0D0C0B0A090807, 0x060504030201003F ) )
-
-// 256 bit lanes used only by lyra2, move these there
 // Rotate elements within 256 bit lanes of 512 bit vector.

 // Swap hi & lo 128 bits in each 256 bit lane
@@ -449,54 +344,69 @@ static inline __m512i mm512_shuflr_x32( const __m512i v, const int n )
 #define mm512_shuflr256_64( v )     _mm512_permutex_epi64( v, 0x39 )
 #define mm512_shufll256_64( v )     _mm512_permutex_epi64( v, 0x93 )

-/*
+/*  Not used
 // Rotate 256 bit lanes by one 32 bit element
 #define mm512_shuflr256_32( v ) \
-   _mm512_permutexvar_epi32( m512_const_64( \
+   _mm512_permutexvar_epi32( _mm512_set_epi64( \
                      0x000000080000000f, 0x0000000e0000000d, \
                      0x0000000c0000000b, 0x0000000a00000009, \
                      0x0000000000000007, 0x0000000600000005, \
                      0x0000000400000003, 0x0000000200000001 ), v )

 #define mm512_shufll256_32( v ) \
-   _mm512_permutexvar_epi32( m512_const_64( \
+   _mm512_permutexvar_epi32( _mm512_set_epi64( \
                      0x0000000e0000000d, 0x0000000c0000000b, \
                      0x0000000a00000009, 0x000000080000000f, \
                      0x0000000600000005, 0x0000000400000003, \
                      0x0000000200000001, 0x0000000000000007 ), v )

 #define mm512_shuflr256_16( v ) \
-   _mm512_permutexvar_epi16( m512_const_64( \
+   _mm512_permutexvar_epi16( _mm512_set_epi64( \
                     0x00100001001e001d, 0x001c001b001a0019, \
                     0x0018001700160015, 0x0014001300120011, \
                     0x0000000f000e000d, 0x000c000b000a0009, \
                     0x0008000700060005, 0x0004000300020001 ), v )

 #define mm512_shufll256_16( v ) \
-   _mm512_permutexvar_epi16( m512_const_64( \
+   _mm512_permutexvar_epi16( _mm512_set_epi64( \
                     0x001e001d001c001b, 0x001a001900180017, \
                     0x0016001500140013, 0x001200110010001f, \
                     0x000e000d000c000b, 0x000a000900080007, \
                     0x0006000500040003, 0x000200010000000f ), v )

 #define mm512_shuflr256_8( v ) \
-   _mm512_shuffle_epi8( v, m512_const_64( \
+   _mm512_shuffle_epi8( _mm512_set_epi64( \
                     0x203f3e3d3c3b3a39, 0x3837363534333231, \
                     0x302f2e2d2c2b2a29, 0x2827262524232221, \
                     0x001f1e1d1c1b1a19, 0x1817161514131211, \
-                     0x100f0e0d0c0b0a09, 0x0807060504030201 ) )
+                     0x100f0e0d0c0b0a09, 0x0807060504030201 ), v )

 #define mm512_shufll256_8( v ) \
-   _mm512_shuffle_epi8( v, m512_const_64( \
+   _mm512_shuffle_epi8( _mm512_set_epi64( \
                     0x3e3d3c3b3a393837, 0x363534333231302f, \
                     0x2e2d2c2b2a292827, 0x262524232221203f, \
                     0x1e1d1c1b1a191817, 0x161514131211100f, \
-                     0x0e0d0c0b0a090807, 0x060504030201001f ) )
+                     0x0e0d0c0b0a090807, 0x060504030201001f ), v )
 */
+
 //
 // Shuffle/rotate elements within 128 bit lanes of 512 bit vector.
 
-// Limited 2 input, 1 output shuffle, combines shuffle with blend.
+#define mm512_swap128_64( v )   _mm512_shuffle_epi32( v, 0x4e )
+#define mm512_shuflr128_64      mm512_swap128_64
+#define mm512_shufll128_64      mm512_swap128_64
+
+// Rotate 128 bit lanes by one 32 bit element
+#define mm512_shuflr128_32( v )    _mm512_shuffle_epi32( v, 0x39 )
+#define mm512_shufll128_32( v )    _mm512_shuffle_epi32( v, 0x93 )
+
+/* Not used
+// Rotate 128 bit lanes right by c bytes, versatile and just as fast
+static inline __m512i mm512_shuflr128_x8( const __m512i v, const int c )
+{  return _mm512_alignr_epi8( v, v, c ); }
+*/
+
+// Limited 2 input shuffle, combines shuffle with blend.
 // Like most shuffles it's limited to 128 bit lanes and like some shuffles
 // destination elements must come from a specific source arg. 
 #define mm512_shuffle2_64( v1, v2, c ) \
@@ -507,26 +417,12 @@ static inline __m512i mm512_shuflr_x32( const __m512i v, const int n )
   _mm512_castps_si512( _mm512_shuffle_ps( _mm512_castsi512_ps( v1 ), \
                                           _mm512_castsi512_ps( v2 ), c ) ); 

-// Swap 64 bits in each 128 bit lane
-#define mm512_swap128_64( v )   _mm512_shuffle_epi32( v, 0x4e )
-#define mm512_shuflr128_64  mm512_swap128_64
-#define mm512_shufll128_64  mm512_swap128_64
-
-// Rotate 128 bit lanes by one 32 bit element
-#define mm512_shuflr128_32( v )    _mm512_shuffle_epi32( v, 0x39 )
-#define mm512_shufll128_32( v )    _mm512_shuffle_epi32( v, 0x93 )
-
-// Rotate right 128 bit lanes by c bytes, versatile and just as fast
-static inline __m512i mm512_shuflr128_8( const __m512i v, const int c )
-{  return _mm512_alignr_epi8( v, v, c ); }
-
-// Rotate byte elements in each 64 or 32 bit lane. Redundant for AVX512, all
-// can be done with ror & rol. Defined only for convenience and consistency
-// with AVX2 & SSE2 macros.
+// 64 bit lanes
+// Not really necessary with AVX512, included for consistency with AVX2/SSE.

 #define mm512_swap64_32( v )    _mm512_shuffle_epi32( v, 0xb1 )
-#define mm512_shuflr64_32 mm512_swap64_32
-#define mm512_shufll64_32 mm512_swap64_32
+#define mm512_shuflr64_32       mm512_swap64_32
+#define mm512_shufll64_32       mm512_swap64_32

 #define mm512_shuflr64_24( v )  _mm512_ror_epi64( v, 24 )
 #define mm512_shufll64_24( v )  _mm512_rol_epi64( v, 24 )
@@ -537,12 +433,16 @@ static inline __m512i mm512_shuflr128_8( const __m512i v, const int c )
 #define mm512_shuflr64_8(  v )  _mm512_ror_epi64( v,  8 )
 #define mm512_shufll64_8(  v )  _mm512_rol_epi64( v,  8 )

-#define mm512_swap32_16(   v )  _mm512_ror_epi32( v, 16 )
-#define mm512_shuflr32_16 mm512_swap32_16
-#define mm512_shufll32_16 mm512_swap32_16
+/* Not used
+// 32 bit lanes

-#define mm512_shuflr32_8(  v )  _mm512_ror_epi32( v,  8 )
-#define mm512_shufll32_8(  v )  _mm512_rol_epi32( v,  8 )
+#define mm512_swap32_16( v )    _mm512_ror_epi32( v, 16 )
+#define mm512_shuflr32_16       mm512_swap32_16
+#define mm512_shufll32_16       mm512_swap32_16
+
+#define mm512_shuflr32_8( v )   _mm512_ror_epi32( v,  8 )
+#define mm512_shufll32_8( v )   _mm512_rol_epi32( v,  8 )
+*/

 #endif // AVX512
 #endif // SIMD_512_H__
--- a/simd-utils/simd-int.h
+++ b/simd-utils/simd-int.h
@@ -55,6 +55,13 @@
 typedef          __int128  int128_t;
 typedef unsigned __int128 uint128_t;

+typedef union
+{
+   uint128_t u128;
+   uint64_t  u64[2];
+   uint32_t  u32[4];
+} __attribute__ ((aligned (16))) u128_ovly;
+
 // Extracting the low bits is a trivial cast.
 // These specialized functions are optimized while providing a
 // consistent interface.
--- a/sysinfos.c
+++ b/sysinfos.c
@@ -174,35 +174,147 @@ static inline int cpu_fanpercent()
 	return 0;
 }

+
+// CPUID
+
+// This list is incomplete, it only contains features of interest to cpuminer.
+// refer to http://en.wikipedia.org/wiki/CPUID for details.
+
+// AVX10 compatibility notes
+//
+// Notation used: AVX10i.[version]_[vectorwidth]
+// AVX10.1_512 is a rebranding of AVX512 and is effectively the AVX* superset
+// with full 512 bit vector support.
+// AVX10.2_256 is effectively AVX2 + AVX512_VL, all AVX512 instructions and
+// features applied only to 256 bit and 128 bit vectors.
+// Future AVX10 versions will add new instructions and features.
+
+// Register array indexes
+#define EAX_Reg  (0)
+#define EBX_Reg  (1)
+#define ECX_Reg  (2)
+#define EDX_Reg  (3)
+
+// CPUID function number, aka leaf (EAX)
+#define VENDOR_ID            (0)
+#define CPU_INFO             (1)
+#define CACHE_TLB_DESCRIPTOR (2)
+#define EXTENDED_FEATURES    (7)
+#define AVX10_FEATURES       (0x24)
+#define HIGHEST_EXT_FUNCTION (0x80000000)
+#define EXTENDED_CPU_INFO    (0x80000001)
+#define CPU_BRAND_1          (0x80000002)
+#define CPU_BRAND_2          (0x80000003)
+#define CPU_BRAND_3          (0x80000004)
+
+// CPU_INFO: EAX=1, ECX=0
+// ECX
+#define SSE3_Flag                 1    
+#define SSSE3_Flag               (1<< 9)
+#define XOP_Flag                 (1<<11)   // obsolete
+#define FMA3_Flag                (1<<12)
+#define SSE41_Flag               (1<<19)
+#define SSE42_Flag               (1<<20)
+#define AES_NI_Flag              (1<<25)
+#define XSAVE_Flag               (1<<26) 
+#define OSXSAVE_Flag             (1<<27)
+#define AVX_Flag                 (1<<28)
+// EDX
+#define MMX_Flag                 (1<<23)
+#define SSE_Flag                 (1<<25)
+#define SSE2_Flag                (1<<26) 
+
+// EXTENDED_FEATURES subleaf 0: EAX=7, ECX=0
+// EBX
+#define AVX2_Flag                (1<< 5)
+#define AVX512_F_Flag            (1<<16)
+#define AVX512_DQ_Flag           (1<<17)
+#define AVX512_IFMA_Flag         (1<<21)
+#define AVX512_PF_Flag           (1<<26)
+#define AVX512_ER_Flag           (1<<27)
+#define AVX512_CD_Flag           (1<<28)
+#define SHA_Flag                 (1<<29)
+#define AVX512_BW_Flag           (1<<30)
+#define AVX512_VL_Flag           (1<<31)
+// ECX
+#define AVX512_VBMI_Flag         (1<< 1) 
+#define AVX512_VBMI2_Flag        (1<< 6)
+#define VAES_Flag                (1<< 9)
+#define AVX512_VNNI_Flag         (1<<11)
+#define AVX512_BITALG_Flag       (1<<12)
+#define AVX512_VPOPCNTDQ_Flag    (1<<14)
+// EDX
+#define AVX512_4VNNIW_Flag       (1<< 2)
+#define AVX512_4FMAPS_Flag       (1<< 3)
+#define AVX512_VP2INTERSECT_Flag (1<< 8)
+#define AMX_BF16_Flag            (1<<22)
+#define AVX512_FP16_Flag         (1<<23)
+#define AMX_TILE_Flag            (1<<24)
+#define AMX_INT8_Flag            (1<<25)
+
+// EXTENDED_FEATURES subleaf 1: EAX=7, ECX=1
+// EAX
+#define SHA512_Flag               1
+#define SM3_Flag                 (1<< 1)
+#define SM4_Flag                 (1<< 2)
+#define AVX_VNNI_Flag            (1<< 4)
+#define AVX512_BF16_Flag         (1<< 5)
+#define AMX_FP16_Flag            (1<<21)
+#define AVX_IFMA_Flag            (1<<23)
+// EDX
+#define AVX_VNNI_INT8_Flag       (1<< 4)
+#define AVX_NE_CONVERT_Flag      (1<< 5)
+#define AMX_COMPLEX_Flag         (1<< 8)
+#define AVX_VNNI_INT16_Flag      (1<<10)
+#define AVX10_Flag               (1<<19)
+#define APX_F_Flag               (1<<21)
+
+// AVX10_FEATURES: EAX=0x24, ECX=0
+// EBX
+#define AVX10_VERSION_mask        0xff      // bits [7:0]
+#define AVX10_128_Flag           (1<<16)
+#define AVX10_256_Flag           (1<<17)   
+#define AVX10_512_Flag           (1<<18)   
+
+// Use this to detect presence of feature
+#define AVX_mask     (AVX_Flag|XSAVE_Flag|OSXSAVE_Flag)
+#define FMA3_mask    (FMA3_Flag|AVX_mask)
+#define AVX512_mask  (AVX512_VL_Flag|AVX512_BW_Flag|AVX512_DQ_Flag|AVX512_F_Flag)
+
+
 #ifndef __arm__
-static inline void cpuid(int functionnumber, int output[4]) {
+static inline void cpuid( unsigned int leaf, unsigned int subleaf,
+                          unsigned int output[4] )
+{
 #if defined (_MSC_VER) || defined (__INTEL_COMPILER)
-	// Microsoft or Intel compiler, intrin.h included
-	__cpuidex(output, functionnumber, 0);
+   // Microsoft or Intel compiler, intrin.h included
+   __cpuidex(output, leaf, subleaf );
 #elif defined(__GNUC__) || defined(__clang__)
-	// use inline assembly, Gnu/AT&T syntax
-	int a, b, c, d;
-	asm volatile("cpuid" : "=a"(a), "=b"(b), "=c"(c), "=d"(d) : "a"(functionnumber), "c"(0));
-	output[0] = a;
-	output[1] = b;
-	output[2] = c;
-	output[3] = d;
+   // use inline assembly, Gnu/AT&T syntax
+   unsigned int a, b, c, d;
+   asm volatile( "cpuid"
+               : "=a"(a), "=b"(b), "=c"(c), "=d"(d)
+               : "a"(leaf), "c"(subleaf) );
+   output[ EAX_Reg ] = a;
+   output[ EBX_Reg ] = b;
+   output[ ECX_Reg ] = c;
+   output[ EDX_Reg ] = d;
 #else
-	// unknown platform. try inline assembly with masm/intel syntax
-	__asm {
-		mov eax, functionnumber
-		xor ecx, ecx
-		cpuid;
-		mov esi, output
-		mov[esi], eax
-		mov[esi + 4], ebx
-		mov[esi + 8], ecx
-		mov[esi + 12], edx
-	}
+   // unknown platform. try inline assembly with masm/intel syntax
+   __asm {
+      mov eax, leaf
+      mov ecx, subleaf
+      cpuid;
+      mov esi, output
+      mov[esi], eax
+      mov[esi + 4], ebx
+      mov[esi + 8], ecx
+      mov[esi + 12], edx
+   }
 #endif
 }
 #else /* !__arm__ */
-#define cpuid(fn, out) out[0] = 0;
+#define cpuid(leaf, subleaf, out) out[0] = 0;
 #endif

 static inline void cpu_getname(char *outbuf, size_t maxsz)
@@ -211,13 +323,13 @@ static inline void cpu_getname(char *outbuf, size_t maxsz)
 #ifdef WIN32
   char brand[256] = { 0 };
   int output[4] = { 0 }, ext;
-   cpuid(0x80000000, output);
+   cpuid( 0x80000000, 0, output );
   ext = output[0];
   if (ext >= 0x80000004)
   {
      for (int i = 2; i <= (ext & 0xF); i++)
      {
-         cpuid(0x80000000+i, output);
+         cpuid( 0x80000000+i, 0, output);
         memcpy(&brand[(i-2) * 4*sizeof(int)], output, 4*sizeof(int));
      }
      snprintf(outbuf, maxsz, "%s", brand);
@@ -309,70 +421,97 @@ static inline void cpu_getmodelid(char *outbuf, size_t maxsz)
 #endif
 }
 
-// http://en.wikipedia.org/wiki/CPUID
+// Typical display format: AVX10.[version]_[vectorlength], if vector length is
+// omitted 256 is the default.
+//    Ex: AVX10.1_512
+// Flags:
+// AVX10  128  256  512
+//   0     0    0    0    = AVX10 not supported
+//   1     1    1    0    = AVX10 256 bit max  (version 2)
+//   1     1    1    1    = AVX10 512 bit max  (version 1 granite rapids)
+// Other combinations are not defined.

-// CPUID commands
-#define VENDOR_ID            (0)
-#define CPU_INFO             (1)
-#define CACHE_TLB_DESCRIPTOR (2)
-#define EXTENDED_FEATURES    (7)
-#define HIGHEST_EXT_FUNCTION (0x80000000)
-#define EXTENDED_CPU_INFO    (0x80000001)
-#define CPU_BRAND_1          (0x80000002)
-#define CPU_BRAND_2          (0x80000003)
-#define CPU_BRAND_3          (0x80000004)
+// Test AVX10_flag before AVX10_FEATURES flags.
+static inline bool has_avx10()
+{
+#ifdef __arm__
+    return false;
+#else
+    unsigned int cpu_info[4] = { 0 };
+    cpuid( EXTENDED_FEATURES, 1, cpu_info );
+    return cpu_info[ EDX_Reg ] & AVX10_Flag;
+#endif
+}

-// Registers
-#define EAX_Reg  (0)
-#define EBX_Reg  (1)
-#define ECX_Reg  (2)
-#define EDX_Reg  (3)
+static inline unsigned int avx10_version()
+{
+#ifdef __arm__
+    return 0;
+#else
+    if ( has_avx10() )
+    {
+       unsigned int cpu_info[4] = { 0 };
+       cpuid( AVX10_FEATURES, 0, cpu_info );
+       return cpu_info[ EBX_Reg ] & AVX10_VERSION_mask;
+    }
+    return 0;
+#endif
+}

-// Feature flags
+static inline bool has_avx10_512()
+{
+#ifdef __arm__
+    return false;
+#else
+    if ( has_avx10() )
+    {
+       unsigned int cpu_info[4] = { 0 };
+       cpuid( AVX10_FEATURES, 0, cpu_info );
+       return cpu_info[ EBX_Reg ] & AVX10_512_Flag;
+    }
+    return false;
+#endif
+}

-// CPU_INFO ECX
-#define SSE3_Flag      1    
-#define SSSE3_Flag    (1<< 9)
-#define XOP_Flag      (1<<11)   // obsolete, only available on pre-Ryzen AMD
-#define FMA3_Flag     (1<<12)
-#define AES_Flag      (1<<25)
-#define SSE41_Flag    (1<<19)
-#define SSE42_Flag    (1<<20)
-#define AES_Flag      (1<<25)
-#define XSAVE_Flag    (1<<26) 
-#define OSXSAVE_Flag  (1<<27)
-#define AVX_Flag      (1<<28)
+static inline bool has_avx10_256()
+{
+#ifdef __arm__
+    return false;
+#else
+    if ( has_avx10() )
+    {
+       unsigned int cpu_info[4] = { 0 };
+       cpuid( AVX10_FEATURES, 0, cpu_info );
+       return cpu_info[ EBX_Reg ] & AVX10_256_Flag;
+    }
+    return false;
+#endif
+}

-// CPU_INFO EDX
-#define SSE_Flag      (1<<25)
-#define SSE2_Flag     (1<<26) 
-
-// EXTENDED_FEATURES EBX
-#define AVX2_Flag     (1<< 5)
-#define AVX512F_Flag  (1<<16)
-#define AVX512DQ_Flag (1<<17)
-#define SHA_Flag      (1<<29)
-#define AVX512BW_Flag (1<<30)
-#define AVX512VL_Flag (1<<31)
-
-// EXTENDED_FEATURES ECX
-#define AVX512VBMI_Flag  (1<<1) 
-#define AVX512VBMI2_Flag (1<<6)
-#define VAES_Flag        (1<<9)
-
-
-// Use this to detect presence of feature
-#define AVX_mask     (AVX_Flag|XSAVE_Flag|OSXSAVE_Flag)
-#define FMA3_mask    (FMA3_Flag|AVX_mask)
-#define AVX512_mask  (AVX512VL_Flag|AVX512BW_Flag|AVX512DQ_Flag|AVX512F_Flag)
+// Maximum vector length
+static inline unsigned int avx10_vector_length()
+{
+#ifdef __arm__
+    return 0;
+#else
+    if ( has_avx10() )
+    {
+       unsigned int cpu_info[4] = { 0 };
+       cpuid( AVX10_FEATURES, 0, cpu_info );
+       return cpu_info[ EBX_Reg ] & AVX10_512_Flag ? 512
+          : ( cpu_info[ EBX_Reg ] & AVX10_256_Flag ? 256 : 0 );
+    }
+    return 0;
+#endif
+}    

 static inline bool has_sha()
 {
 #ifdef __arm__
    return false;
 #else
-    int cpu_info[4] = { 0 };
-    cpuid( EXTENDED_FEATURES, cpu_info );
+    unsigned int cpu_info[4] = { 0 };
+    cpuid( EXTENDED_FEATURES, 0, cpu_info );
    return cpu_info[ EBX_Reg ] & SHA_Flag;
 #endif
 }
@@ -382,8 +521,8 @@ static inline bool has_sse2()
 #ifdef __arm__
    return false;
 #else
-    int cpu_info[4] = { 0 };
-    cpuid( CPU_INFO, cpu_info );
+    unsigned int cpu_info[4] = { 0 };
+    cpuid( CPU_INFO, 0, cpu_info );
    return cpu_info[ EDX_Reg ] & SSE2_Flag;
 #endif
 }
@@ -394,9 +533,9 @@ static inline bool has_aes_ni()
 #ifdef __arm__
 	return false;
 #else
-	int cpu_info[4] = { 0 };
-        cpuid( CPU_INFO, cpu_info );
-	return cpu_info[ ECX_Reg ] & AES_Flag;
+	unsigned int cpu_info[4] = { 0 };
+        cpuid( CPU_INFO, 0, cpu_info );
+	return cpu_info[ ECX_Reg ] & AES_NI_Flag;
 #endif
 }

@@ -406,8 +545,8 @@ static inline bool has_avx()
 #ifdef __arm__
        return false;
 #else
-        int cpu_info[4] = { 0 };
-        cpuid( CPU_INFO, cpu_info );
+        unsigned int cpu_info[4] = { 0 };
+        cpuid( CPU_INFO, 0, cpu_info );
        return ( ( cpu_info[ ECX_Reg ] & AVX_mask ) == AVX_mask );
 #endif
 }
@@ -418,8 +557,8 @@ static inline bool has_avx2()
 #ifdef __arm__
    return false;
 #else
-    int cpu_info[4] = { 0 };
-    cpuid( EXTENDED_FEATURES, cpu_info );
+    unsigned int cpu_info[4] = { 0 };
+    cpuid( EXTENDED_FEATURES, 0, cpu_info );
    return cpu_info[ EBX_Reg ] & AVX2_Flag;
 #endif
 }
@@ -429,9 +568,9 @@ static inline bool has_avx512f()
 #ifdef __arm__
    return false;
 #else
-    int cpu_info[4] = { 0 };
-    cpuid( EXTENDED_FEATURES, cpu_info );
-    return cpu_info[ EBX_Reg ] & AVX512F_Flag;
+    unsigned int cpu_info[4] = { 0 };
+    cpuid( EXTENDED_FEATURES, 0, cpu_info );
+    return cpu_info[ EBX_Reg ] & AVX512_F_Flag;
 #endif
 }

@@ -440,9 +579,9 @@ static inline bool has_avx512dq()
 #ifdef __arm__
    return false;
 #else
-    int cpu_info[4] = { 0 };
-    cpuid( EXTENDED_FEATURES, cpu_info );
-    return cpu_info[ EBX_Reg ] & AVX512DQ_Flag;
+    unsigned int cpu_info[4] = { 0 };
+    cpuid( EXTENDED_FEATURES, 0, cpu_info );
+    return cpu_info[ EBX_Reg ] & AVX512_DQ_Flag;
 #endif
 }

@@ -451,9 +590,9 @@ static inline bool has_avx512bw()
 #ifdef __arm__
    return false;
 #else
-    int cpu_info[4] = { 0 };
-    cpuid( EXTENDED_FEATURES, cpu_info );
-    return cpu_info[ EBX_Reg ] & AVX512BW_Flag;
+    unsigned int cpu_info[4] = { 0 };
+    cpuid( EXTENDED_FEATURES, 0, cpu_info );
+    return cpu_info[ EBX_Reg ] & AVX512_BW_Flag;
 #endif
 }

@@ -462,9 +601,9 @@ static inline bool has_avx512vl()
 #ifdef __arm__
    return false;
 #else
-    int cpu_info[4] = { 0 };
-    cpuid( EXTENDED_FEATURES, cpu_info );
-    return cpu_info[ EBX_Reg ] & AVX512VL_Flag;
+    unsigned int cpu_info[4] = { 0 };
+    cpuid( EXTENDED_FEATURES, 0, cpu_info );
+    return cpu_info[ EBX_Reg ] & AVX512_VL_Flag;
 #endif
 }

@@ -474,30 +613,19 @@ static inline bool has_avx512()
 #ifdef __arm__
    return false;
 #else
-    int cpu_info[4] = { 0 };
-    cpuid( EXTENDED_FEATURES, cpu_info );
+    unsigned int cpu_info[4] = { 0 };
+    cpuid( EXTENDED_FEATURES, 0, cpu_info );
    return ( ( cpu_info[ EBX_Reg ] & AVX512_mask ) == AVX512_mask );
 #endif
 }

-// AMD Zen3 added support for 256 bit VAES without requiring AVX512.
-// The original Intel spec requires AVX512F to support 512 bit VAES and 
-// requires AVX512VL to support 256 bit VAES.
-// The CPUID VAES bit alone can't distiguish 256 vs 512 bit.
-// If necessary:
-// VAES 256 & 512 = VAES && AVX512VL
-// VAES 512 = VAES && AVX512F  
-// VAES 256 = ( VAES && AVX512VL ) || ( VAES && !AVX512F )
-// VAES 512 only = VAES && AVX512F && !AVX512VL
-// VAES 256 only = VAES && !AVX512F
-
 static inline bool has_vaes()
 {
 #ifdef __arm__
    return false;
 #else
-    int cpu_info[4] = { 0 };
-    cpuid( EXTENDED_FEATURES, cpu_info );
+    unsigned int cpu_info[4] = { 0 };
+    cpuid( EXTENDED_FEATURES, 0, cpu_info );
    return cpu_info[ ECX_Reg ] & VAES_Flag;
 #endif
 }
@@ -507,9 +635,9 @@ static inline bool has_vbmi()
 #ifdef __arm__
    return false;
 #else
-    int cpu_info[4] = { 0 };
-    cpuid( EXTENDED_FEATURES, cpu_info );
-    return cpu_info[ ECX_Reg ] & AVX512VBMI_Flag;
+    unsigned int cpu_info[4] = { 0 };
+    cpuid( EXTENDED_FEATURES, 0, cpu_info );
+    return cpu_info[ ECX_Reg ] & AVX512_VBMI_Flag;
 #endif
 }

@@ -518,9 +646,9 @@ static inline bool has_vbmi2()
 #ifdef __arm__
    return false;
 #else
-    int cpu_info[4] = { 0 };
-    cpuid( EXTENDED_FEATURES, cpu_info );
-    return cpu_info[ ECX_Reg ] & AVX512VBMI2_Flag;
+    unsigned int cpu_info[4] = { 0 };
+    cpuid( EXTENDED_FEATURES, 0, cpu_info );
+    return cpu_info[ ECX_Reg ] & AVX512_VBMI2_Flag;
 #endif
 }

@@ -530,8 +658,8 @@ static inline bool has_xop()
 #ifdef __arm__
        return false;
 #else
-        int cpu_info[4] = { 0 };
-        cpuid( EXTENDED_CPU_INFO, cpu_info );
+        unsigned int cpu_info[4] = { 0 };
+        cpuid( EXTENDED_CPU_INFO, 0, cpu_info );
        return cpu_info[ ECX_Reg ] & XOP_Flag;
 #endif
 }
@@ -541,8 +669,8 @@ static inline bool has_fma3()
 #ifdef __arm__
        return false;
 #else
-        int cpu_info[4] = { 0 };
-        cpuid( CPU_INFO, cpu_info );
+        unsigned int cpu_info[4] = { 0 };
+        cpuid( CPU_INFO, 0, cpu_info );
        return ( ( cpu_info[ ECX_Reg ] & FMA3_mask ) == FMA3_mask );
 #endif
 }
@@ -552,8 +680,8 @@ static inline bool has_sse42()
 #ifdef __arm__
        return false;
 #else
-        int cpu_info[4] = { 0 };
-        cpuid( CPU_INFO, cpu_info );
+        unsigned int cpu_info[4] = { 0 };
+        cpuid( CPU_INFO, 0, cpu_info );
        return cpu_info[ ECX_Reg ] & SSE42_Flag;
 #endif
 }
@@ -563,16 +691,16 @@ static inline bool has_sse()
 #ifdef __arm__
        return false;
 #else
-        int cpu_info[4] = { 0 };
-        cpuid( CPU_INFO, cpu_info );
+        unsigned int cpu_info[4] = { 0 };
+        cpuid( CPU_INFO, 0, cpu_info );
        return cpu_info[ EDX_Reg ] & SSE_Flag;
 #endif
 }

 static inline uint32_t cpuid_get_highest_function_number()
 {
-  uint32_t cpu_info[4] = {0};
-  cpuid( VENDOR_ID, cpu_info);
+  unsigned int cpu_info[4] = {0};
+  cpuid( VENDOR_ID, 0, cpu_info);
  return cpu_info[ EAX_Reg ];
 }

@@ -605,8 +733,8 @@ static inline void cpu_bestfeature(char *outbuf, size_t maxsz)
 #else
 	int cpu_info[4] = { 0 };
 	int cpu_info_adv[4] = { 0 };
-	cpuid( CPU_INFO, cpu_info );
-	cpuid( EXTENDED_FEATURES, cpu_info_adv );
+	cpuid( CPU_INFO, 0, cpu_info );
+	cpuid( EXTENDED_FEATURES, 0, cpu_info_adv );

        if ( has_avx() && has_avx2() )
              sprintf(outbuf, "AVX2");
@@ -634,14 +762,14 @@ static inline void cpu_brand_string( char* s )
        sprintf( s, "ARM" );
 #else
    int cpu_info[4] = { 0 };
-    cpuid( VENDOR_ID, cpu_info );
+    cpuid( VENDOR_ID, 0, cpu_info );
    if ( cpu_info[ EAX_Reg ] >= 4 )
    {
-        cpuid( CPU_BRAND_1, cpu_info );
+        cpuid( CPU_BRAND_1, 0, cpu_info );
        memcpy( s, cpu_info, sizeof(cpu_info) );
-        cpuid( CPU_BRAND_2, cpu_info );
+        cpuid( CPU_BRAND_2, 0, cpu_info );
        memcpy( s + 16, cpu_info, sizeof(cpu_info) );
-        cpuid( CPU_BRAND_3, cpu_info );
+        cpuid( CPU_BRAND_3, 0, cpu_info );
        memcpy( s + 32, cpu_info, sizeof(cpu_info) );
    }
 #endif
--- a/util.c
+++ b/util.c
@@ -553,6 +553,7 @@ json_t *json_rpc_call(CURL *curl, const char *url,
 	long timeout = (flags & JSON_RPC_LONGPOLL) ? opt_timeout : 30;
 	struct header_info hi = {0};

+   all_data.headers = &hi;
 	/* it is assumed that 'curl' is freshly [re]initialized at this pt */

 	if (opt_protocol)  curl_easy_setopt(curl, CURLOPT_VERBOSE, 1);
@@ -2017,23 +2018,41 @@ static bool stratum_notify(struct stratum_ctx *sctx, json_t *params)
      }
   }

-   if ( merkle_count )
-      merkle = (uchar**) malloc( merkle_count * sizeof(char *) );
-	for ( i = 0; i < merkle_count; i++ )
-   {
-		const char *s = json_string_value( json_array_get( merkle_arr, i ) );
-		if ( !s || strlen(s) != 64 )
-      {
-			while ( i-- ) free( merkle[i] );
-			free( merkle );
-			applog( LOG_ERR, "Stratum notify: invalid Merkle branch" );
-			goto out;
-		}
-		merkle[i] = (uchar*) malloc( 32 );
-		hex2bin( merkle[i], s, 32 );
-	}
+   pthread_mutex_lock( &sctx->work_lock );

-	pthread_mutex_lock( &sctx->work_lock );
+   if ( merkle_count )
+   {
+      if ( merkle_count > sctx->job.merkle_buf_size )
+      {
+         for ( i = 0; i < sctx->job.merkle_count; i++ )
+            free( sctx->job.merkle[i] );
+         free( sctx->job.merkle );
+
+         merkle = (uchar**) malloc( merkle_count * sizeof(char *) );
+         for ( i = 0; i < merkle_count; i++ )
+            merkle[i] = (uchar*) malloc( 32 );
+         sctx->job.merkle_buf_size = merkle_count;
+         sctx->job.merkle = merkle;
+      }
+
+      for ( i = 0; i < merkle_count; i++ )
+      {
+         const char *s = json_string_value( json_array_get( merkle_arr, i ) );
+         if ( !s || strlen(s) != 64 )
+         {
+            for ( int j = sctx->job.merkle_buf_size; j > 0; j-- )
+               free( sctx->job.merkle[i] );
+            free( sctx->job.merkle );
+            sctx->job.merkle_count =
+            sctx->job.merkle_buf_size = 0;
+            pthread_mutex_unlock( &sctx->work_lock );
+            applog( LOG_ERR, "Stratum notify: invalid Merkle branch" );
+            goto out;
+         }
+         hex2bin( sctx->job.merkle[i], s, 32 );
+      }   
+   }
+   sctx->job.merkle_count = merkle_count;         

 	coinb1_size = strlen( coinb1 ) / 2;
 	coinb2_size = strlen( coinb2 ) / 2;
@@ -2066,18 +2085,9 @@ static bool stratum_notify(struct stratum_ctx *sctx, json_t *params)
   }

 	sctx->block_height = getblocheight( sctx );
-
-	for ( i = 0; i < sctx->job.merkle_count; i++ )
-		free( sctx->job.merkle[i] );
-
-	free( sctx->job.merkle );
-	sctx->job.merkle = merkle;
-	sctx->job.merkle_count = merkle_count;
-
 	hex2bin( sctx->job.nbits, nbits, 4 );
 	hex2bin( sctx->job.ntime, stime, 4 );
 	sctx->job.clean = clean;
-
 	sctx->job.diff = sctx->next_diff;

 	pthread_mutex_unlock( &sctx->work_lock );
--- a/winbuild-cross.sh
+++ b/winbuild-cross.sh
@@ -129,7 +129,7 @@ make clean || echo clean
 # Native with CPU groups ennabled
 make clean || echo clean
 rm -f config.status
-CFLAGS="-march=native $DEFAULT_CFLAGS" ./configure $CONFIGURE_ARGS
+CFLAGS="-march=native $DEFAULT_CFLAGS_OLD" ./configure $CONFIGURE_ARGS
 make -j 8
 strip -s cpuminer.exe
Author	SHA1	Message	Date
Jay D Dee	d6b5750362	v3.23.1	2023-09-13 11:48:52 -04:00
Jay D Dee	4378d2f841	v3.23.0	2023-08-30 20:15:48 -04:00
Jay D Dee	57a6b7b58b	v3.22.3	2023-06-14 11:07:40 -04:00
Jay D Dee	de564ccbde	v3.22.2	2023-04-06 13:38:37 -04:00
Jay D Dee	fcd7727b0d	v3.22.1	2023-03-24 18:29:42 -04:00