v3.23.0

v3.22.3
v3.22.2
2025-09-17 23:44:27 +00:00 · 2023-08-30 20:15:48 -04:00 · 2023-06-14 11:07:40 -04:00 · 2023-04-06 13:38:37 -04:00 · 2023-03-24 18:29:42 -04:00 · 2023-03-21 17:12:51 -04:00
111 changed files with 13939 additions and 7267 deletions
--- a/156
+++ b/156
@@ -1,158 +1,4 @@
-Instructions for compiling cpuminer-opt for Windows.
-
-These intructions are out of date. Please consult the wiki for
-the latest:
+Please consult the wiki for Windows compile instructions.

 https://github.com/JayDDee/cpuminer-opt/wiki/Compiling-from-source

-Windows compilation using Visual Studio is not supported. Mingw64 is
-used on a Linux system (bare metal or virtual machine) to cross-compile
-cpuminer-opt executable binaries for Windows.
-
-These instructions were written for Debian and Ubuntu compatible distributions
-but should work on other major distributions as well. However some of the
-package names or file paths may be different.
-
-It is assumed a Linux system is already available and running. And the user
-has enough Linux knowledge to find and install packages and follow these
-instructions.
-
-First it is a good idea to create new user specifically for cross compiling.
-It keeps all mingw stuff contained and isolated from the rest of the system.
-
-Step by step...
-
-1. Install necessary packages from the distribution's repositories.
-
-Refer to Linux compile instructions and install required packages.
-
-Additionally, install mingw-w64.
-
-sudo apt-get install mingw-w64 libz-mingw-w64-dev
-
-
-2. Create a local library directory for packages to be compiled in the next
-   step. Suggested location is $HOME/usr/lib/
-
-$ mkdir $HOME/usr/lib
-
-3. Download and build other packages for mingw that don't have a mingw64
-   version available in the repositories.
-
-Download the following source code packages from their respective and
-respected download locations, copy them to $HOME/usr/lib/ and uncompress them. 
-
-openssl: https://github.com/openssl/openssl/releases
-
-curl: https://github.com/curl/curl/releases
-
-gmp: https://gmplib.org/download/gmp/
-
-In most cases the latest version is ok but it's safest to download the same major and minor version as included in your distribution. The following uses versions from Ubuntu 20.04. Change version numbers as required.
-
-Run the following commands or follow the supplied instructions. Do not run "make install" unless you are using /usr/lib, which isn't recommended.
-
-Some instructions insist on running "make check". If make check fails it may still work, YMMV.
-
-You can speed up "make" by using all CPU cores available with "-j n" where n is the number of CPU threads you want to use.
-
-openssl:
-
-$ ./Configure mingw64 shared --cross-compile-prefix=x86_64-w64-mingw32-
-$ make
-
-Make may fail with an ld error, just ensure libcrypto-1_1-x64.dll is created.
-
-curl:
-
-$ ./configure --with-winssl --with-winidn --host=x86_64-w64-mingw32
-$ make
-
-gmp:
-
-$ ./configure --host=x86_64-w64-mingw32
-$ make
-
-4. Tweak the environment.
-
-This step is required everytime you login or the commands can be added to .bashrc.
-
-Define some local variables to point to local library.
-
-$ export LOCAL_LIB="$HOME/usr/lib"
-
-$ export LDFLAGS="-L$LOCAL_LIB/curl/lib/.libs -L$LOCAL_LIB/gmp/.libs -L$LOCAL_LIB/openssl"
-
-$ export CONFIGURE_ARGS="--with-curl=$LOCAL_LIB/curl --with-crypto=$LOCAL_LIB/openssl --host=x86_64-w64-mingw32"
-
-Adjust for gcc version:
-
-$ export GCC_MINGW_LIB="/usr/lib/gcc/x86_64-w64-mingw32/9.3-win32"
-
-Create a release directory and copy some dll files previously built. This can be done outside of cpuminer-opt and only needs to be done once. If the release directory is in cpuminer-opt directory it needs to be recreated every time a source package is decompressed.
-
-$ mkdir release
-$ cp /usr/x86_64-w64-mingw32/lib/zlib1.dll release/
-$ cp /usr/x86_64-w64-mingw32/lib/libwinpthread-1.dll release/
-$ cp $GCC_MINGW_LIB/libstdc++-6.dll release/
-$ cp $GCC_MINGW_LIB/libgcc_s_seh-1.dll release/
-$ cp $LOCAL_LIB/openssl/libcrypto-1_1-x64.dll release/
-$ cp $LOCAL_LIB/curl/lib/.libs/libcurl-4.dll release/
-
-The following steps need to be done every time a new source package is
-opened.
-
-5. Download cpuminer-opt
-
-Download the latest source code package of cpumuner-opt to your desired
-location. .zip or .tar.gz, your choice.
-
-https://github.com/JayDDee/cpuminer-opt/releases
-
-Decompress and change to the cpuminer-opt directory.
-
-6. compile
-
-Create a link to the locally compiled version of gmp.h
-
-$ ln -s $LOCAL_LIB/gmp-version/gmp.h ./gmp.h
-
-$ ./autogen.sh
-
-Configure the compiler for the CPU architecture of the host machine:
-
-CFLAGS="-O3 -march=native -Wall" ./configure $CONFIGURE_ARGS
-
-or cross compile for a specific CPU architecture:
-
-CFLAGS="-O3 -march=znver1 -Wall" ./configure $CONFIGURE_ARGS
-
-This will compile for AMD Ryzen.
-
-You can compile more generically for a set of specific CPU features if you know what features you want:
-
-CFLAGS="-O3 -maes -msse4.2 -Wall" ./configure $CONFIGURE_ARGS
-
-This will compile for an older CPU that does not have AVX.
-
-You can find several examples in README.txt
-
-If you have a CPU with more than 64 threads and Windows 7 or higher you can enable the CPU Groups feature by adding the following to CFLAGS:
-
-"-D_WIN32_WINNT=0x0601"
-
-Once you have run configure successfully run the compiler with n CPU threads:
-
-$ make -j n
-
-Copy cpuminer.exe to the release directory, compress and copy the release directory to a Windows system and run cpuminer.exe from the command line.
-
-Run cpuminer
-
-In a command windows change directories to the unzipped release folder. To get a list of all options:
-
-cpuminer.exe --help
-
-Command options are specific to where you mine. Refer to the pool's instructions on how to set them.
-
-
--- a/Makefile.am
+++ b/Makefile.am
@@ -55,9 +55,6 @@ cpuminer_SOURCES = \
  algo/blake/mod_blakecoin.c \
  algo/blake/blakecoin.c \
  algo/blake/blakecoin-4way.c \
-  algo/blake/decred-gate.c \
-  algo/blake/decred.c \
-  algo/blake/decred-4way.c \
  algo/blake/pentablake-gate.c \
  algo/blake/pentablake-4way.c \
  algo/blake/pentablake.c \
@@ -178,6 +175,8 @@ cpuminer_SOURCES = \
  algo/sha/sha256t.c \
  algo/sha/sha256q-4way.c \
  algo/sha/sha256q.c \
+  algo/sha/sha512256d-4way.c \
+  algo/sha/sha256dt.c \
  algo/shabal/sph_shabal.c \
  algo/shabal/shabal-hash-4way.c \
  algo/shavite/sph_shavite.c \
@@ -264,6 +263,8 @@ cpuminer_SOURCES = \
  algo/x16/x16r-4way.c \
  algo/x16/x16rv2.c \
  algo/x16/x16rv2-4way.c \
+  algo/x16/x16rt.c \
+  algo/x16/x16rt-4way.c \
  algo/x16/hex.c \
  algo/x16/x21s-4way.c \
  algo/x16/x21s.c \
--- a/72
+++ b/72
@@ -65,8 +65,76 @@ If not what makes it happen or not happen?
 Change Log
 ----------

+v3.23.0
+
+#398: Prevent GBT fallback to Getwork on network error.
+#398: Prevent excessive logs when conditional mining is paused when mining solo.
+Fix a false start if stratum doesn't immediately send a new job after connecting.
+Tweak diagonal shuffle in Blake2b & Blake256 1-way SIMD to reduce latency.
+CPUID support for AVX10.
+Initial changes to AVX2 targeted code in preparation for AVX10.
+Code cleanup and miscellaneous small improvements.
+
 v3.22.3

+Data interleaving and byte swap optimizations with AVX2, AVX512 & AVX512VBMI.
+Faster Luffa with AVX2 & AVX512.
+Other small optimizations.
+Some code cleanup.
+
+v3.22.2
+
+Added sha512256d & sha256dt algos.
+Fixed intermittant invalid shares lyra2v2 AVX512.
+Removed application limits on the number of CPUs and threads, HW and OS limits still apply.
+Added a log warning if more threads are defined than active CPUs in affinity mask.
+Improved merkle tree memory management for stratum.
+Added transaction count to New Work log.
+Other small improvements.
+
+v3.22.1
+
+#393 fixed segfault in GBT, regression from v3.22.0.
+More efficient 32 bit data interleaving.
+
+v3.22.0
+
+Stratum: faster netdiff calculation.
+Merged a few updates from Pooler/cpuminer:
+   Use CURLOPT_POSTFIELDS in json_rpc_call,
+   Use CURLINFO_ACTIVESOCKET when supported,
+   JSONRPC speedup,
+   Speed up hex2bin function.  
+Small log improvements, notably more frequent hash rate reports.
+Removed decred algo.
+
+v3.21.5
+
+All issues with v3.21.3 & v3.21.4 should be resolved.
+Changes since v3.21.2:
+#392 #379 #389 Fixed misaligned address segfault solo mining.
+#392 Fixed stats for myr-gr algo, and a few others, for CPUs without AVX2.
+#392 Fixed conditional mining.
+#392 Fixed cpu affinity on Ryzen CPUs using Windows binaries,
+     Windows binaries no longer support CPU groups,
+     Windows binaries support CPUs with up to 64 threads.
+Small optimizations to serialized vectoring.
+
+v3.21.4 CANCELLED
+
+Reapply selected changes from v3.21.3.
+#392 #379 #389 Fixed misaligned address segfault solo mining.
+#392 Fixed conditional mining.
+#392 Fixed cpu affinity on Ryzen CPUs using Windows binaries,
+     Windows binaries no longer support CPU groups,
+     Windows binaries support CPUs with up to 64 threads.
+
+v3.21.3.1 UNRELEASED
+
+Revert to 3.21.2
+
+v3.21.3 CANCELLED
+
 #392 #379 #389 Fixed misaligned address segfault solo mining.
 #392 Fixed stats for myr-gr algo, and a few others, for CPUs without AVX2.
 #392 Fixed conditional mining.
@@ -74,10 +142,10 @@ v3.22.3
     Windows binaries no longer support CPU groups,
     Windows binaries support CPUs with up to 64 threads.
 Midstate prehash is now centralized, done only once instead of by every thread
-for selected algos. 
+for selected algos.
 Small optimizations to serialized vectoring.

-v3.22.2
+v3.21.2 

 Faster SALSA SIMD shuffle for yespower, yescrypt & scryptn2.
 Fixed a couple of compiler warnings with gcc-12.
--- a/algo-gate-api.c
+++ b/algo-gate-api.c
@@ -171,7 +171,7 @@ int scanhash_4way_64in_32out( struct work *work, uint32_t max_nonce,
         }
      }
      *noncev = _mm256_add_epi32( *noncev,
-                                  m256_const1_64( 0x0000000400000000 ) );
+                                  _mm256_set1_epi64x( 0x0000000400000000 ) );
      n += 4;
   } while ( likely( ( n <= last_nonce ) && !work_restart[thr_id].restart ) );
   pdata[19] = n;
@@ -227,7 +227,7 @@ int scanhash_8way_64in_32out( struct work *work, uint32_t max_nonce,
         }
      }
      *noncev = _mm512_add_epi32( *noncev,
-                                  m512_const1_64( 0x0000000800000000 ) );
+                                  _mm512_set1_epi64( 0x0000000800000000 ) );
      n += 8;
   } while ( likely( ( n < last_nonce ) && !work_restart[thr_id].restart ) );
   pdata[19] = n;
@@ -253,7 +253,6 @@ void init_algo_gate( algo_gate_t* gate )
   gate->miner_thread_init       = (void*)&return_true;
   gate->scanhash                = (void*)&scanhash_generic;
   gate->hash                    = (void*)&null_hash;
-   gate->prehash                 = (void*)&return_true;
   gate->get_new_work            = (void*)&std_get_new_work;
   gate->work_decode             = (void*)&std_le_work_decode;
   gate->decode_extra_data       = (void*)&do_nothing;
@@ -264,8 +263,6 @@ void init_algo_gate( algo_gate_t* gate )
   gate->build_block_header      = (void*)&std_build_block_header;
   gate->build_extraheader       = (void*)&std_build_extraheader;
   gate->set_work_data_endian    = (void*)&do_nothing;
-   gate->calc_network_diff       = (void*)&std_calc_network_diff;
-   gate->ready_to_mine           = (void*)&std_ready_to_mine;
   gate->resync_threads          = (void*)&do_nothing;
   gate->do_this_thread          = (void*)&return_true;
   gate->longpoll_rpc_call       = (void*)&std_longpoll_rpc_call;
@@ -309,7 +306,6 @@ bool register_algo_gate( int algo, algo_gate_t *gate )
    case ALGO_BLAKECOIN:    rc = register_blakecoin_algo     ( gate ); break;
    case ALGO_BMW512:       rc = register_bmw512_algo        ( gate ); break;
    case ALGO_C11:          rc = register_c11_algo           ( gate ); break;
-    case ALGO_DECRED:       rc = register_decred_algo        ( gate ); break;
    case ALGO_DEEP:         rc = register_deep_algo          ( gate ); break;
    case ALGO_DMD_GR:       rc = register_dmd_gr_algo        ( gate ); break;
    case ALGO_GROESTL:      rc = register_groestl_algo       ( gate ); break;
@@ -341,9 +337,11 @@ bool register_algo_gate( int algo, algo_gate_t *gate )
    case ALGO_QUBIT:        rc = register_qubit_algo         ( gate ); break;
    case ALGO_SCRYPT:       rc = register_scrypt_algo        ( gate ); break;
    case ALGO_SHA256D:      rc = register_sha256d_algo       ( gate ); break;
+    case ALGO_SHA256DT:     rc = register_sha256dt_algo      ( gate ); break;
    case ALGO_SHA256Q:      rc = register_sha256q_algo       ( gate ); break;
    case ALGO_SHA256T:      rc = register_sha256t_algo       ( gate ); break;
    case ALGO_SHA3D:        rc = register_sha3d_algo         ( gate ); break;
+    case ALGO_SHA512256D:   rc = register_sha512256d_algo    ( gate ); break;
    case ALGO_SHAVITE3:     rc = register_shavite_algo       ( gate ); break;
    case ALGO_SKEIN:        rc = register_skein_algo         ( gate ); break;
    case ALGO_SKEIN2:       rc = register_skein2_algo        ( gate ); break;
@@ -428,7 +426,6 @@ const char* const algo_alias_map[][2] =
  { "blake256r8",        "blakecoin"      },
  { "blake256r8vnl",     "vanilla"        },
  { "blake256r14",       "blake"          },
-  { "blake256r14dcr",    "decred"         },
  { "diamond",           "dmd-gr"         },
  { "espers",            "hmq1725"        },
  { "flax",              "c11"            },
--- a/algo-gate-api.h
+++ b/algo-gate-api.h
@@ -94,10 +94,13 @@ typedef  uint32_t set_t;
 #define SSE42_OPT        4
 #define AVX_OPT          8   // Sandybridge
 #define AVX2_OPT      0x10   // Haswell, Zen1
-#define SHA_OPT       0x20   // Zen1, Icelake (sha256)
-#define AVX512_OPT    0x40   // Skylake-X (AVX512[F,VL,DQ,BW])
-#define VAES_OPT      0x80   // Icelake (VAES & AVX512)
+#define SHA_OPT       0x20   // Zen1, Icelake (deprecated)
+#define AVX512_OPT    0x40   // Skylake-X, Zen4 (AVX512[F,VL,DQ,BW])
+#define VAES_OPT      0x80   // Icelake, Zen3

+// AVX10 does not have explicit algo features:
+//  AVX10_512 is compatible with AVX512 + VAES
+//  AVX10_256 is compatible with AVX2 + VAES

 // return set containing all elements from sets a & b
 inline set_t set_union ( set_t a, set_t b ) { return a | b; }
@@ -119,7 +122,7 @@ typedef struct
 // to be registered with the gate. 
 int ( *scanhash ) ( struct work*, uint32_t, uint64_t*, struct thr_info* );

-int ( *hash )     ( void*, const void*, const int );
+int ( *hash )     ( void*, const void*, int );

 //optional, safe to use default in most cases

@@ -127,9 +130,6 @@ int ( *hash )     ( void*, const void*, const int );
 // other initialization specific to miner threads.
 bool ( *miner_thread_init )     ( int );

-// Perform prehash after receiving new work
-int ( *prehash )                ( struct work* );
-
 // Get thread local copy of blockheader with unique nonce.
 void ( *get_new_work )          ( struct work*, struct work*, int, uint32_t* );

@@ -147,7 +147,7 @@ void ( *gen_merkle_root )       ( char*, struct stratum_ctx* );
 void ( *build_extraheader )     ( struct work*, struct stratum_ctx* );

 void ( *build_block_header )    ( struct work*, uint32_t, uint32_t*,
-	                                uint32_t*, uint32_t, uint32_t,
+	                                uint32_t*,   uint32_t, uint32_t,
                                   unsigned char* );

 // Build mining.submit message
@@ -158,19 +158,13 @@ char* ( *malloc_txs_request )   ( struct work* );
 // Big endian or little endian
 void ( *set_work_data_endian )  ( struct work* );

-double ( *calc_network_diff )   ( struct work* );
-
-// Wait for first work
-bool ( *ready_to_mine )         ( struct work*, struct stratum_ctx*, int );
-
 // Diverge mining threads
 bool ( *do_this_thread )        ( int );

 // After do_this_thread
 void ( *resync_threads )        ( int, struct work* );

-// No longer needed
-json_t* (*longpoll_rpc_call)      ( CURL*, int*, char* );
+json_t* ( *longpoll_rpc_call )  ( CURL*, int*, char* );

 set_t optimizations;
 int  ( *get_work_data_size )     ();
@@ -289,8 +283,6 @@ char* std_malloc_txs_request( struct work *work );
 // Default is do_nothing, little endian is assumed
 void set_work_data_big_endian( struct work *work );

-double std_calc_network_diff( struct work *work );
-
 void std_build_block_header( struct work* g_work, uint32_t version,
 	                          uint32_t *prevhash,  uint32_t *merkle_root,
   	                       uint32_t ntime,      uint32_t nbits,
@@ -300,9 +292,6 @@ void std_build_extraheader( struct work *work, struct stratum_ctx *sctx );

 json_t* std_longpoll_rpc_call( CURL *curl, int *err, char *lp_url );

-bool std_ready_to_mine( struct work* work, struct stratum_ctx* stratum,
-                        int thr_id );
-
 int std_get_work_data_size();

 // Gate admin functions
--- a/algo/blake/blake256-hash-4way.c
+++ b/algo/blake/blake256-hash-4way.c
@@ -308,7 +308,52 @@ static const sph_u32 CS[16] = {
 /////////////////////////////////////////
 //
 // Blake-256 1 way SIMD
+// Only used for prehash, otherwise 4way is used with SSE2.

+// optimize shuffles to reduce latency caused by dependencies on V1.
+#define BLAKE256_ROUND( r ) \
+{ \
+   V0 = _mm_add_epi32( V0, _mm_add_epi32( V1, \
+                           _mm_set_epi32( CSx( r, 7 ) ^ Mx( r, 6 ), \
+                                          CSx( r, 5 ) ^ Mx( r, 4 ), \
+                                          CSx( r, 3 ) ^ Mx( r, 2 ), \
+                                          CSx( r, 1 ) ^ Mx( r, 0 ) ) ) ); \
+   V3 = mm128_swap32_16( _mm_xor_si128( V3, V0 ) ); \
+   V2 = _mm_add_epi32( V2, V3 ); \
+   V1 = mm128_ror_32( _mm_xor_si128( V1, V2 ), 12 ); \
+   V0 = _mm_add_epi32( V0, _mm_add_epi32( V1, \
+                           _mm_set_epi32( CSx( r, 6 ) ^ Mx( r, 7 ), \
+                                          CSx( r, 4 ) ^ Mx( r, 5 ), \
+                                          CSx( r, 2 ) ^ Mx( r, 3 ), \
+                                          CSx( r, 0 ) ^ Mx( r, 1 ) ) ) ); \
+   V3 = mm128_shuflr32_8( _mm_xor_si128( V3, V0 ) ); \
+   V2 = _mm_add_epi32( V2, V3 ); \
+   V1 = mm128_ror_32( _mm_xor_si128( V1, V2 ), 7 ); \
+   V0 = mm128_shufll_32( V0 ); \
+   V3 = mm128_swap_64( V3 ); \
+   V2 = mm128_shuflr_32( V2 ); \
+   V0 = _mm_add_epi32( V0, _mm_add_epi32( V1, \
+                           _mm_set_epi32( CSx( r, D ) ^ Mx( r, C ), \
+                                          CSx( r, B ) ^ Mx( r, A ), \
+                                          CSx( r, 9 ) ^ Mx( r, 8 ), \
+                                          CSx( r, F ) ^ Mx( r, E ) ) ) ); \
+   V3 = mm128_swap32_16( _mm_xor_si128( V3, V0 ) ); \
+   V2 = _mm_add_epi32( V2, V3 ); \
+   V1 = mm128_ror_32( _mm_xor_si128( V1, V2 ), 12 ); \
+   V0 = _mm_add_epi32( V0, _mm_add_epi32( V1, \
+                           _mm_set_epi32( CSx( r, C ) ^ Mx( r, D ), \
+                                          CSx( r, A ) ^ Mx( r, B ), \
+                                          CSx( r, 8 ) ^ Mx( r, 9 ), \
+                                          CSx( r, E ) ^ Mx( r, F ) ) ) ); \
+   V3 = mm128_shuflr32_8( _mm_xor_si128( V3, V0 ) ); \
+   V2 = _mm_add_epi32( V2, V3 ); \
+   V1 = mm128_ror_32( _mm_xor_si128( V1, V2 ), 7 ); \
+   V0 = mm128_shuflr_32( V0 ); \
+   V3 = mm128_swap_64( V3 ); \
+   V2 = mm128_shufll_32( V2 ); \
+}
+
+/*
 #define BLAKE256_ROUND( r ) \
 { \
   V0 = _mm_add_epi32( V0, _mm_add_epi32( V1, \
@@ -350,6 +395,7 @@ static const sph_u32 CS[16] = {
   V2 = mm128_swap_64( V2 ); \
   V1 = mm128_shufll_32( V1 ); \
 }
+*/

 void blake256_transform_le( uint32_t *H, const uint32_t *buf,
                            const uint32_t T0, const uint32_t T1 )
@@ -598,10 +644,10 @@ void blake256_transform_le( uint32_t *H, const uint32_t *buf,
   V5 = H5; \
   V6 = H6; \
   V7 = H7; \
-   V8 = m128_const1_64( 0x243F6A88243F6A88 ); \
-   V9 = m128_const1_64( 0x85A308D385A308D3 ); \
-   VA = m128_const1_64( 0x13198A2E13198A2E ); \
-   VB = m128_const1_64( 0x0370734403707344 ); \
+   V8 = _mm_set1_epi64x( 0x243F6A88243F6A88 ); \
+   V9 = _mm_set1_epi64x( 0x85A308D385A308D3 ); \
+   VA = _mm_set1_epi64x( 0x13198A2E13198A2E ); \
+   VB = _mm_set1_epi64x( 0x0370734403707344 ); \
   VC = _mm_set1_epi32( T0 ^ 0xA4093822 ); \
   VD = _mm_set1_epi32( T0 ^ 0x299F31D0 ); \
   VE = _mm_set1_epi32( T1 ^ 0x082EFA98 ); \
@@ -958,7 +1004,6 @@ do { \
   __m256i M8, M9, MA, MB, MC, MD, ME, MF; \
   __m256i V0, V1, V2, V3, V4, V5, V6, V7; \
   __m256i V8, V9, VA, VB, VC, VD, VE, VF; \
-   __m256i shuf_bswap32; \
   V0 = H0; \
   V1 = H1; \
   V2 = H2; \
@@ -967,16 +1012,16 @@ do { \
   V5 = H5; \
   V6 = H6; \
   V7 = H7; \
-   V8 = m256_const1_64( 0x243F6A88243F6A88 ); \
-   V9 = m256_const1_64( 0x85A308D385A308D3 ); \
-   VA = m256_const1_64( 0x13198A2E13198A2E ); \
-   VB = m256_const1_64( 0x0370734403707344 ); \
+   V8 = _mm256_set1_epi64x( 0x243F6A88243F6A88 ); \
+   V9 = _mm256_set1_epi64x( 0x85A308D385A308D3 ); \
+   VA = _mm256_set1_epi64x( 0x13198A2E13198A2E ); \
+   VB = _mm256_set1_epi64x( 0x0370734403707344 ); \
   VC = _mm256_set1_epi32( T0 ^ 0xA4093822 ); \
   VD = _mm256_set1_epi32( T0 ^ 0x299F31D0 ); \
   VE = _mm256_set1_epi32( T1 ^ 0x082EFA98 ); \
   VF = _mm256_set1_epi32( T1 ^ 0xEC4E6C89 ); \
-   shuf_bswap32 = m256_const_64( 0x1c1d1e1f18191a1b, 0x1415161710111213, \
-                                 0x0c0d0e0f08090a0b, 0x0405060700010203 ); \
+   const __m256i shuf_bswap32 = mm256_set2_64( \
+                               0x0c0d0e0f08090a0b, 0x0405060700010203 ); \
   M0 = _mm256_shuffle_epi8( * buf    , shuf_bswap32 ); \
   M1 = _mm256_shuffle_epi8( *(buf+ 1), shuf_bswap32 ); \
   M2 = _mm256_shuffle_epi8( *(buf+ 2), shuf_bswap32 ); \
@@ -1034,10 +1079,10 @@ do { \
   V5 = H5; \
   V6 = H6; \
   V7 = H7; \
-   V8 = m256_const1_64( 0x243F6A88243F6A88 ); \
-   V9 = m256_const1_64( 0x85A308D385A308D3 ); \
-   VA = m256_const1_64( 0x13198A2E13198A2E ); \
-   VB = m256_const1_64( 0x0370734403707344 ); \
+   V8 = _mm256_set1_epi64x( 0x243F6A88243F6A88 ); \
+   V9 = _mm256_set1_epi64x( 0x85A308D385A308D3 ); \
+   VA = _mm256_set1_epi64x( 0x13198A2E13198A2E ); \
+   VB = _mm256_set1_epi64x( 0x0370734403707344 ); \
   VC = _mm256_set1_epi32( T0 ^ 0xA4093822 ); \
   VD = _mm256_set1_epi32( T0 ^ 0x299F31D0 ); \
   VE = _mm256_set1_epi32( T1 ^ 0x082EFA98 ); \
@@ -1100,23 +1145,23 @@ void blake256_8way_round0_prehash_le( void *midstate, const void *midhash,
   V[ 5] = H[5];
   V[ 6] = H[6];
   V[ 7] = H[7];
-   V[ 8] = m256_const1_32( CS0 );
-   V[ 9] = m256_const1_32( CS1 );
-   V[10] = m256_const1_32( CS2 );
-   V[11] = m256_const1_32( CS3 );
-   V[12] = m256_const1_32( CS4 ^ 0x280 );
-   V[13] = m256_const1_32( CS5 ^ 0x280 );
-   V[14] = m256_const1_32( CS6 );
-   V[15] = m256_const1_32( CS7 );
+   V[ 8] = _mm256_set1_epi32( CS0 );
+   V[ 9] = _mm256_set1_epi32( CS1 );
+   V[10] = _mm256_set1_epi32( CS2 );
+   V[11] = _mm256_set1_epi32( CS3 );
+   V[12] = _mm256_set1_epi32( CS4 ^ 0x280 );
+   V[13] = _mm256_set1_epi32( CS5 ^ 0x280 );
+   V[14] = _mm256_set1_epi32( CS6 );
+   V[15] = _mm256_set1_epi32( CS7 );

 // M[ 0:3 ] contain new message data including unique nonces in M[ 3].
 // M[ 5:12, 14 ] are always zero and not needed or used.
 // M[ 4], M[ 13], M[15] are constant and are initialized here.
 // M[ 5] is a special case, used as a cache for (M[13] ^ CSC).

-   M[ 4] = m256_const1_32( 0x80000000 );
+   M[ 4] = _mm256_set1_epi32( 0x80000000 );
   M[13] = m256_one_32;
-   M[15] = m256_const1_32( 80*8 );
+   M[15] = _mm256_set1_epi32( 80*8 );

   M[ 5] =_mm256_xor_si256( M[13], _mm256_set1_epi32( CSC ) );

@@ -1278,8 +1323,7 @@ void blake256_8way_final_rounds_le( void *final_hash, const void *midstate,
   ROUND256_8WAY_3;

   const __m256i shuf_bswap32 =
-                  m256_const_64( 0x1c1d1e1f18191a1b, 0x1415161710111213,
-                                 0x0c0d0e0f08090a0b, 0x0405060700010203 );
+                  mm256_set2_64( 0x0c0d0e0f08090a0b, 0x0405060700010203 );

   H[0] = _mm256_shuffle_epi8( mm256_xor3( V8, V0, h[0] ), shuf_bswap32 );
   H[1] = _mm256_shuffle_epi8( mm256_xor3( V9, V1, h[1] ), shuf_bswap32 );
@@ -1615,7 +1659,8 @@ do { \
   __m512i M8, M9, MA, MB, MC, MD, ME, MF; \
   __m512i V0, V1, V2, V3, V4, V5, V6, V7; \
   __m512i V8, V9, VA, VB, VC, VD, VE, VF; \
-   __m512i shuf_bswap32; \
+   const __m512i shuf_bswap32 = mm512_bcast_m128( _mm_set_epi64x( \
+                                 0x0c0d0e0f08090a0b, 0x0405060700010203 ) ); \
   V0 = H0; \
   V1 = H1; \
   V2 = H2; \
@@ -1624,18 +1669,14 @@ do { \
   V5 = H5; \
   V6 = H6; \
   V7 = H7; \
-   V8 = m512_const1_64( 0x243F6A88243F6A88 ); \
-   V9 = m512_const1_64( 0x85A308D385A308D3 ); \
-   VA = m512_const1_64( 0x13198A2E13198A2E ); \
-   VB = m512_const1_64( 0x0370734403707344 ); \
+   V8 = _mm512_set1_epi64( 0x243F6A88243F6A88 ); \
+   V9 = _mm512_set1_epi64( 0x85A308D385A308D3 ); \
+   VA = _mm512_set1_epi64( 0x13198A2E13198A2E ); \
+   VB = _mm512_set1_epi64( 0x0370734403707344 ); \
   VC = _mm512_set1_epi32( T0 ^ 0xA4093822 ); \
   VD = _mm512_set1_epi32( T0 ^ 0x299F31D0 ); \
   VE = _mm512_set1_epi32( T1 ^ 0x082EFA98 ); \
   VF = _mm512_set1_epi32( T1 ^ 0xEC4E6C89 ); \
-   shuf_bswap32 = m512_const_64( 0x3c3d3e3f38393a3b, 0x3435363730313233, \
-                                 0x2c2d2e2f28292a2b, 0x2425262720212223, \
-                                 0x1c1d1e1f18191a1b, 0x1415161710111213, \
-                                 0x0c0d0e0f08090a0b, 0x0405060700010203 ); \
   M0 = _mm512_shuffle_epi8( * buf    , shuf_bswap32 ); \
   M1 = _mm512_shuffle_epi8( *(buf+ 1), shuf_bswap32 ); \
   M2 = _mm512_shuffle_epi8( *(buf+ 2), shuf_bswap32 ); \
@@ -1693,10 +1734,10 @@ do { \
   V5 = H5; \
   V6 = H6; \
   V7 = H7; \
-   V8 = m512_const1_64( 0x243F6A88243F6A88 ); \
-   V9 = m512_const1_64( 0x85A308D385A308D3 ); \
-   VA = m512_const1_64( 0x13198A2E13198A2E ); \
-   VB = m512_const1_64( 0x0370734403707344 ); \
+   V8 = _mm512_set1_epi64( 0x243F6A88243F6A88 ); \
+   V9 = _mm512_set1_epi64( 0x85A308D385A308D3 ); \
+   VA = _mm512_set1_epi64( 0x13198A2E13198A2E ); \
+   VB = _mm512_set1_epi64( 0x0370734403707344 ); \
   VC = _mm512_set1_epi32( T0 ^ 0xA4093822 ); \
   VD = _mm512_set1_epi32( T0 ^ 0x299F31D0 ); \
   VE = _mm512_set1_epi32( T1 ^ 0x082EFA98 ); \
@@ -1763,23 +1804,23 @@ void blake256_16way_round0_prehash_le( void *midstate, const void *midhash,
   V[ 5] = H[5];
   V[ 6] = H[6];
   V[ 7] = H[7];
-   V[ 8] = m512_const1_32( CS0 );
-   V[ 9] = m512_const1_32( CS1 );
-   V[10] = m512_const1_32( CS2 );
-   V[11] = m512_const1_32( CS3 );
-   V[12] = m512_const1_32( CS4 ^ 0x280 );
-   V[13] = m512_const1_32( CS5 ^ 0x280 );
-   V[14] = m512_const1_32( CS6 );
-   V[15] = m512_const1_32( CS7 );
+   V[ 8] = _mm512_set1_epi32( CS0 );
+   V[ 9] = _mm512_set1_epi32( CS1 );
+   V[10] = _mm512_set1_epi32( CS2 );
+   V[11] = _mm512_set1_epi32( CS3 );
+   V[12] = _mm512_set1_epi32( CS4 ^ 0x280 );
+   V[13] = _mm512_set1_epi32( CS5 ^ 0x280 );
+   V[14] = _mm512_set1_epi32( CS6 );
+   V[15] = _mm512_set1_epi32( CS7 );

 // M[ 0:3 ] contain new message data including unique nonces in M[ 3].   
 // M[ 5:12, 14 ] are always zero and not needed or used, except M[5] as noted.
 // M[ 4], M[ 13], M[15] are constant and are initialized here.
 // M[ 5] is a special case, used as a cache for (M[13] ^ CSC).
   
-   M[ 4] = m512_const1_32( 0x80000000 );
+   M[ 4] = _mm512_set1_epi32( 0x80000000 );
   M[13] = m512_one_32;
-   M[15] = m512_const1_32( 80*8 );
+   M[15] = _mm512_set1_epi32( 80*8 );

   M[ 5] =_mm512_xor_si512( M[13], _mm512_set1_epi32( CSC ) );

@@ -1956,10 +1997,8 @@ void blake256_16way_final_rounds_le( void *final_hash, const void *midstate,

   // Byte swap final hash
   const __m512i shuf_bswap32 =
-                  m512_const_64( 0x3c3d3e3f38393a3b, 0x3435363730313233,
-                                 0x2c2d2e2f28292a2b, 0x2425262720212223,
-                                 0x1c1d1e1f18191a1b, 0x1415161710111213,
-                                 0x0c0d0e0f08090a0b, 0x0405060700010203 );
+                  mm512_bcast_m128( _mm_set_epi64x( 
+                                 0x0c0d0e0f08090a0b, 0x0405060700010203 ) );

   H[0] = _mm512_shuffle_epi8( mm512_xor3( V8, V0, h[0] ), shuf_bswap32 );
   H[1] = _mm512_shuffle_epi8( mm512_xor3( V9, V1, h[1] ), shuf_bswap32 );
@@ -1981,14 +2020,14 @@ static void
 blake32_4way_init( blake_4way_small_context *ctx, const uint32_t *iv,
                   const uint32_t *salt, int rounds )
 {
-   casti_m128i( ctx->H, 0 ) = m128_const1_64( 0x6A09E6676A09E667 );
-   casti_m128i( ctx->H, 1 ) = m128_const1_64( 0xBB67AE85BB67AE85 );
-   casti_m128i( ctx->H, 2 ) = m128_const1_64( 0x3C6EF3723C6EF372 );
-   casti_m128i( ctx->H, 3 ) = m128_const1_64( 0xA54FF53AA54FF53A );
-   casti_m128i( ctx->H, 4 ) = m128_const1_64( 0x510E527F510E527F );
-   casti_m128i( ctx->H, 5 ) = m128_const1_64( 0x9B05688C9B05688C );
-   casti_m128i( ctx->H, 6 ) = m128_const1_64( 0x1F83D9AB1F83D9AB );
-   casti_m128i( ctx->H, 7 ) = m128_const1_64( 0x5BE0CD195BE0CD19 );
+   casti_m128i( ctx->H, 0 ) = _mm_set1_epi64x( 0x6A09E6676A09E667 );
+   casti_m128i( ctx->H, 1 ) = _mm_set1_epi64x( 0xBB67AE85BB67AE85 );
+   casti_m128i( ctx->H, 2 ) = _mm_set1_epi64x( 0x3C6EF3723C6EF372 );
+   casti_m128i( ctx->H, 3 ) = _mm_set1_epi64x( 0xA54FF53AA54FF53A );
+   casti_m128i( ctx->H, 4 ) = _mm_set1_epi64x( 0x510E527F510E527F );
+   casti_m128i( ctx->H, 5 ) = _mm_set1_epi64x( 0x9B05688C9B05688C );
+   casti_m128i( ctx->H, 6 ) = _mm_set1_epi64x( 0x1F83D9AB1F83D9AB );
+   casti_m128i( ctx->H, 7 ) = _mm_set1_epi64x( 0x5BE0CD195BE0CD19 );
   ctx->T0 = ctx->T1 = 0;
   ctx->ptr = 0;
   ctx->rounds = rounds;
@@ -2059,13 +2098,13 @@ blake32_4way_close( blake_4way_small_context *ctx, unsigned ub, unsigned n,
   else
      ctx->T0 -= 512 - bit_len;

-   buf[vptr] = m128_const1_64( 0x0000008000000080 );
+   buf[vptr] = _mm_set1_epi64x( 0x0000008000000080 );

   if ( vptr < 12 )
   {
      memset_zero_128( buf + vptr + 1, 13 - vptr  );
      buf[ 13 ] = _mm_or_si128( buf[ 13 ],
-                                m128_const1_64( 0x0100000001000000ULL ) );
+                                _mm_set1_epi64x( 0x0100000001000000ULL ) );
      buf[ 14 ] = _mm_set1_epi32( bswap_32( th ) );
      buf[ 15 ] = _mm_set1_epi32( bswap_32( tl ) );
      blake32_4way( ctx, buf + vptr, 64 - ptr );
@@ -2078,7 +2117,7 @@ blake32_4way_close( blake_4way_small_context *ctx, unsigned ub, unsigned n,
      ctx->T1 = 0xFFFFFFFFUL;
      memset_zero_128( buf, 56>>2 );
      buf[ 13 ] = _mm_or_si128( buf[ 13 ],
-                                m128_const1_64( 0x0100000001000000ULL ) );
+                                _mm_set1_epi64x( 0x0100000001000000ULL ) );
      buf[ 14 ] = _mm_set1_epi32( bswap_32( th ) );
      buf[ 15 ] = _mm_set1_epi32( bswap_32( tl ) );
      blake32_4way( ctx, buf, 64 );
@@ -2097,14 +2136,14 @@ static void
 blake32_8way_init( blake_8way_small_context *sc, const sph_u32 *iv,
                   const sph_u32 *salt, int rounds )
 {
-   casti_m256i( sc->H, 0 ) = m256_const1_64( 0x6A09E6676A09E667 );
-   casti_m256i( sc->H, 1 ) = m256_const1_64( 0xBB67AE85BB67AE85 );
-   casti_m256i( sc->H, 2 ) = m256_const1_64( 0x3C6EF3723C6EF372 );
-   casti_m256i( sc->H, 3 ) = m256_const1_64( 0xA54FF53AA54FF53A );
-   casti_m256i( sc->H, 4 ) = m256_const1_64( 0x510E527F510E527F );
-   casti_m256i( sc->H, 5 ) = m256_const1_64( 0x9B05688C9B05688C );
-   casti_m256i( sc->H, 6 ) = m256_const1_64( 0x1F83D9AB1F83D9AB );
-   casti_m256i( sc->H, 7 ) = m256_const1_64( 0x5BE0CD195BE0CD19 );
+   casti_m256i( sc->H, 0 ) = _mm256_set1_epi64x( 0x6A09E6676A09E667 );
+   casti_m256i( sc->H, 1 ) = _mm256_set1_epi64x( 0xBB67AE85BB67AE85 );
+   casti_m256i( sc->H, 2 ) = _mm256_set1_epi64x( 0x3C6EF3723C6EF372 );
+   casti_m256i( sc->H, 3 ) = _mm256_set1_epi64x( 0xA54FF53AA54FF53A );
+   casti_m256i( sc->H, 4 ) = _mm256_set1_epi64x( 0x510E527F510E527F );
+   casti_m256i( sc->H, 5 ) = _mm256_set1_epi64x( 0x9B05688C9B05688C );
+   casti_m256i( sc->H, 6 ) = _mm256_set1_epi64x( 0x1F83D9AB1F83D9AB );
+   casti_m256i( sc->H, 7 ) = _mm256_set1_epi64x( 0x5BE0CD195BE0CD19 );
   sc->T0 = sc->T1 = 0;
   sc->ptr = 0;
   sc->rounds = rounds;
@@ -2163,7 +2202,7 @@ blake32_8way_close( blake_8way_small_context *sc, unsigned ub, unsigned n,

   ptr = sc->ptr;
   bit_len = ((unsigned)ptr << 3);
-   buf[ptr>>2] = m256_const1_64( 0x0000008000000080ULL );
+   buf[ptr>>2] = _mm256_set1_epi64x( 0x0000008000000080ULL );
   tl = sc->T0 + bit_len;
   th = sc->T1;

@@ -2185,7 +2224,7 @@ blake32_8way_close( blake_8way_small_context *sc, unsigned ub, unsigned n,
       memset_zero_256( buf + (ptr>>2) + 1, (52 - ptr) >> 2 );
       if ( out_size_w32 == 8 )
           buf[52>>2] = _mm256_or_si256( buf[52>>2],
-                                m256_const1_64( 0x0100000001000000ULL ) );
+                                _mm256_set1_epi64x( 0x0100000001000000ULL ) );
       *(buf+(56>>2)) = _mm256_set1_epi32( bswap_32( th ) );
       *(buf+(60>>2)) = _mm256_set1_epi32( bswap_32( tl ) );
       blake32_8way( sc, buf + (ptr>>2), 64 - ptr );
@@ -2198,7 +2237,7 @@ blake32_8way_close( blake_8way_small_context *sc, unsigned ub, unsigned n,
       sc->T1 = SPH_C32(0xFFFFFFFFUL);
       memset_zero_256( buf, 56>>2 );
       if ( out_size_w32 == 8 )
-           buf[52>>2] = m256_const1_64( 0x0100000001000000ULL );
+           buf[52>>2] = _mm256_set1_epi64x( 0x0100000001000000ULL );
       *(buf+(56>>2)) = _mm256_set1_epi32( bswap_32( th ) );
       *(buf+(60>>2)) = _mm256_set1_epi32( bswap_32( tl ) );
       blake32_8way( sc, buf, 64 );
@@ -2259,7 +2298,7 @@ blake32_8way_close_le( blake_8way_small_context *sc, unsigned ub, unsigned n,

   ptr = sc->ptr;
   bit_len = ((unsigned)ptr << 3);
-   buf[ptr>>2] = m256_const1_32( 0x80000000 );
+   buf[ptr>>2] = _mm256_set1_epi32( 0x80000000 );
   tl = sc->T0 + bit_len;
   th = sc->T1;

@@ -2312,14 +2351,14 @@ static void
 blake32_16way_init( blake_16way_small_context *sc, const sph_u32 *iv,
                   const sph_u32 *salt, int rounds )
 {
-   casti_m512i( sc->H, 0 ) = m512_const1_64( 0x6A09E6676A09E667 );
-   casti_m512i( sc->H, 1 ) = m512_const1_64( 0xBB67AE85BB67AE85 );
-   casti_m512i( sc->H, 2 ) = m512_const1_64( 0x3C6EF3723C6EF372 );
-   casti_m512i( sc->H, 3 ) = m512_const1_64( 0xA54FF53AA54FF53A );
-   casti_m512i( sc->H, 4 ) = m512_const1_64( 0x510E527F510E527F );
-   casti_m512i( sc->H, 5 ) = m512_const1_64( 0x9B05688C9B05688C );
-   casti_m512i( sc->H, 6 ) = m512_const1_64( 0x1F83D9AB1F83D9AB );
-   casti_m512i( sc->H, 7 ) = m512_const1_64( 0x5BE0CD195BE0CD19 );
+   casti_m512i( sc->H, 0 ) = _mm512_set1_epi64( 0x6A09E6676A09E667 );
+   casti_m512i( sc->H, 1 ) = _mm512_set1_epi64( 0xBB67AE85BB67AE85 );
+   casti_m512i( sc->H, 2 ) = _mm512_set1_epi64( 0x3C6EF3723C6EF372 );
+   casti_m512i( sc->H, 3 ) = _mm512_set1_epi64( 0xA54FF53AA54FF53A );
+   casti_m512i( sc->H, 4 ) = _mm512_set1_epi64( 0x510E527F510E527F );
+   casti_m512i( sc->H, 5 ) = _mm512_set1_epi64( 0x9B05688C9B05688C );
+   casti_m512i( sc->H, 6 ) = _mm512_set1_epi64( 0x1F83D9AB1F83D9AB );
+   casti_m512i( sc->H, 7 ) = _mm512_set1_epi64( 0x5BE0CD195BE0CD19 );
   sc->T0 = sc->T1 = 0;
   sc->ptr = 0;
   sc->rounds = rounds;
@@ -2376,7 +2415,7 @@ blake32_16way_close( blake_16way_small_context *sc, unsigned ub, unsigned n,

   ptr = sc->ptr;
   bit_len = ((unsigned)ptr << 3);
-   buf[ptr>>2] = m512_const1_64( 0x0000008000000080ULL );
+   buf[ptr>>2] = _mm512_set1_epi64( 0x0000008000000080ULL );
   tl = sc->T0 + bit_len;
   th = sc->T1;

@@ -2398,7 +2437,7 @@ blake32_16way_close( blake_16way_small_context *sc, unsigned ub, unsigned n,
       memset_zero_512( buf + (ptr>>2) + 1, (52 - ptr) >> 2 );
       if ( out_size_w32 == 8 )
           buf[52>>2] = _mm512_or_si512( buf[52>>2],
-                                m512_const1_64( 0x0100000001000000ULL ) );
+                                _mm512_set1_epi64( 0x0100000001000000ULL ) );
       buf[56>>2] = _mm512_set1_epi32( bswap_32( th ) );
       buf[60>>2] = _mm512_set1_epi32( bswap_32( tl ) );
       blake32_16way( sc, buf + (ptr>>2), 64 - ptr );
@@ -2411,7 +2450,7 @@ blake32_16way_close( blake_16way_small_context *sc, unsigned ub, unsigned n,
       sc->T1 = 0xFFFFFFFFUL;
       memset_zero_512( buf, 56>>2 );
       if ( out_size_w32 == 8 )
-          buf[52>>2] = m512_const1_64( 0x0100000001000000ULL );
+          buf[52>>2] = _mm512_set1_epi64( 0x0100000001000000ULL );
       buf[56>>2] = _mm512_set1_epi32( bswap_32( th ) );
       buf[60>>2] = _mm512_set1_epi32( bswap_32( tl ) );
       blake32_16way( sc, buf, 64 );
@@ -2473,7 +2512,7 @@ blake32_16way_close_le( blake_16way_small_context *sc, unsigned ub, unsigned n,

   ptr = sc->ptr;
   bit_len = ((unsigned)ptr << 3);
-   buf[ptr>>2] = m512_const1_32( 0x80000000 );
+   buf[ptr>>2] = _mm512_set1_epi32( 0x80000000 );
   tl = sc->T0 + bit_len;
   th = sc->T1;

--- a/algo/blake/blake2b-hash-4way.c
+++ b/algo/blake/blake2b-hash-4way.c
@@ -252,14 +252,14 @@ static void blake2b_8way_compress( blake2b_8way_ctx *ctx, int last )
   v[ 5] = ctx->h[5];
   v[ 6] = ctx->h[6];
   v[ 7] = ctx->h[7];
-   v[ 8] = m512_const1_64( 0x6A09E667F3BCC908 );
-   v[ 9] = m512_const1_64( 0xBB67AE8584CAA73B );
-   v[10] = m512_const1_64( 0x3C6EF372FE94F82B );
-   v[11] = m512_const1_64( 0xA54FF53A5F1D36F1 );
-   v[12] = m512_const1_64( 0x510E527FADE682D1 );
-   v[13] = m512_const1_64( 0x9B05688C2B3E6C1F );
-   v[14] = m512_const1_64( 0x1F83D9ABFB41BD6B );
-   v[15] = m512_const1_64( 0x5BE0CD19137E2179 );
+   v[ 8] = _mm512_set1_epi64( 0x6A09E667F3BCC908 );
+   v[ 9] = _mm512_set1_epi64( 0xBB67AE8584CAA73B );
+   v[10] = _mm512_set1_epi64( 0x3C6EF372FE94F82B );
+   v[11] = _mm512_set1_epi64( 0xA54FF53A5F1D36F1 );
+   v[12] = _mm512_set1_epi64( 0x510E527FADE682D1 );
+   v[13] = _mm512_set1_epi64( 0x9B05688C2B3E6C1F );
+   v[14] = _mm512_set1_epi64( 0x1F83D9ABFB41BD6B );
+   v[15] = _mm512_set1_epi64( 0x5BE0CD19137E2179 );

   v[12] = _mm512_xor_si512( v[12], _mm512_set1_epi64( ctx->t[0] ) );
   v[13] = _mm512_xor_si512( v[13], _mm512_set1_epi64( ctx->t[1] ) );
@@ -310,16 +310,16 @@ int blake2b_8way_init( blake2b_8way_ctx *ctx )
 {
   size_t i;

-   ctx->h[0] = m512_const1_64( 0x6A09E667F3BCC908 );
-   ctx->h[1] = m512_const1_64( 0xBB67AE8584CAA73B );
-   ctx->h[2] = m512_const1_64( 0x3C6EF372FE94F82B );
-   ctx->h[3] = m512_const1_64( 0xA54FF53A5F1D36F1 );
-   ctx->h[4] = m512_const1_64( 0x510E527FADE682D1 );
-   ctx->h[5] = m512_const1_64( 0x9B05688C2B3E6C1F );
-   ctx->h[6] = m512_const1_64( 0x1F83D9ABFB41BD6B );
-   ctx->h[7] = m512_const1_64( 0x5BE0CD19137E2179 );
+   ctx->h[0] = _mm512_set1_epi64( 0x6A09E667F3BCC908 );
+   ctx->h[1] = _mm512_set1_epi64( 0xBB67AE8584CAA73B );
+   ctx->h[2] = _mm512_set1_epi64( 0x3C6EF372FE94F82B );
+   ctx->h[3] = _mm512_set1_epi64( 0xA54FF53A5F1D36F1 );
+   ctx->h[4] = _mm512_set1_epi64( 0x510E527FADE682D1 );
+   ctx->h[5] = _mm512_set1_epi64( 0x9B05688C2B3E6C1F );
+   ctx->h[6] = _mm512_set1_epi64( 0x1F83D9ABFB41BD6B );
+   ctx->h[7] = _mm512_set1_epi64( 0x5BE0CD19137E2179 );

-   ctx->h[0] = _mm512_xor_si512( ctx->h[0], m512_const1_64( 0x01010020 ) );
+   ctx->h[0] = _mm512_xor_si512( ctx->h[0], _mm512_set1_epi64( 0x01010020 ) );

   ctx->t[0] = 0;
   ctx->t[1] = 0;
@@ -419,14 +419,14 @@ static void blake2b_4way_compress( blake2b_4way_ctx *ctx, int last )
   v[ 5] = ctx->h[5];
   v[ 6] = ctx->h[6];
   v[ 7] = ctx->h[7];
-   v[ 8] = m256_const1_64( 0x6A09E667F3BCC908 );
-   v[ 9] = m256_const1_64( 0xBB67AE8584CAA73B );
-   v[10] = m256_const1_64( 0x3C6EF372FE94F82B );
-   v[11] = m256_const1_64( 0xA54FF53A5F1D36F1 );
-   v[12] = m256_const1_64( 0x510E527FADE682D1 );
-   v[13] = m256_const1_64( 0x9B05688C2B3E6C1F );
-   v[14] = m256_const1_64( 0x1F83D9ABFB41BD6B );
-   v[15] = m256_const1_64( 0x5BE0CD19137E2179 );
+   v[ 8] = _mm256_set1_epi64x( 0x6A09E667F3BCC908 );
+   v[ 9] = _mm256_set1_epi64x( 0xBB67AE8584CAA73B );
+   v[10] = _mm256_set1_epi64x( 0x3C6EF372FE94F82B );
+   v[11] = _mm256_set1_epi64x( 0xA54FF53A5F1D36F1 );
+   v[12] = _mm256_set1_epi64x( 0x510E527FADE682D1 );
+   v[13] = _mm256_set1_epi64x( 0x9B05688C2B3E6C1F );
+   v[14] = _mm256_set1_epi64x( 0x1F83D9ABFB41BD6B );
+   v[15] = _mm256_set1_epi64x( 0x5BE0CD19137E2179 );

   v[12] = _mm256_xor_si256( v[12], _mm256_set1_epi64x( ctx->t[0] ) );
   v[13] = _mm256_xor_si256( v[13], _mm256_set1_epi64x( ctx->t[1] ) );
@@ -477,16 +477,16 @@ int blake2b_4way_init( blake2b_4way_ctx *ctx )
 {
 	size_t i;

-   ctx->h[0] = m256_const1_64( 0x6A09E667F3BCC908 );
-   ctx->h[1] = m256_const1_64( 0xBB67AE8584CAA73B );
-   ctx->h[2] = m256_const1_64( 0x3C6EF372FE94F82B );
-   ctx->h[3] = m256_const1_64( 0xA54FF53A5F1D36F1 );
-   ctx->h[4] = m256_const1_64( 0x510E527FADE682D1 );
-   ctx->h[5] = m256_const1_64( 0x9B05688C2B3E6C1F );
-   ctx->h[6] = m256_const1_64( 0x1F83D9ABFB41BD6B );
-   ctx->h[7] = m256_const1_64( 0x5BE0CD19137E2179 );
+   ctx->h[0] = _mm256_set1_epi64x( 0x6A09E667F3BCC908 );
+   ctx->h[1] = _mm256_set1_epi64x( 0xBB67AE8584CAA73B );
+   ctx->h[2] = _mm256_set1_epi64x( 0x3C6EF372FE94F82B );
+   ctx->h[3] = _mm256_set1_epi64x( 0xA54FF53A5F1D36F1 );
+   ctx->h[4] = _mm256_set1_epi64x( 0x510E527FADE682D1 );
+   ctx->h[5] = _mm256_set1_epi64x( 0x9B05688C2B3E6C1F );
+   ctx->h[6] = _mm256_set1_epi64x( 0x1F83D9ABFB41BD6B );
+   ctx->h[7] = _mm256_set1_epi64x( 0x5BE0CD19137E2179 );

-   ctx->h[0] = _mm256_xor_si256( ctx->h[0], m256_const1_64( 0x01010020 ) );
+   ctx->h[0] = _mm256_xor_si256( ctx->h[0], _mm256_set1_epi64x( 0x01010020 ) );

 	ctx->t[0] = 0;
 	ctx->t[1] = 0;
--- a/algo/blake/blake2s-4way.c
+++ b/algo/blake/blake2s-4way.c
@@ -1,6 +1,5 @@
 #include "blake2s-gate.h"
 #include "blake2s-hash-4way.h"
-//#include "sph-blake2s.h"
 #include <string.h>
 #include <stdint.h>

@@ -8,43 +7,6 @@

 static __thread blake2s_16way_state blake2s_16w_ctx;

-/*
-static blake2s_16way_state blake2s_16w_ctx;
-static uint32_t blake2s_16way_vdata[20*16] __attribute__ ((aligned (64)));
-*/
-/*
-int blake2s_16way_prehash( struct work *work )
-{
-   uint32_t edata[20] __attribute__ ((aligned (64)));
-   blake2s_state ctx;
-   mm128_bswap32_80( edata, work->data );
-   blake2s_init( &ctx, BLAKE2S_OUTBYTES );
-   ctx.buflen = ctx.t[0] = 64;
-   blake2s_compress( &ctx, (const uint8_t*)edata );
-
-   blake2s_16way_init( &blake2s_16w_ctx, BLAKE2S_OUTBYTES );
-   intrlv_16x32( blake2s_16w_ctx.h, ctx.h, ctx.h, ctx.h, ctx.h,
-                                    ctx.h, ctx.h, ctx.h, ctx.h,
-                                    ctx.h, ctx.h, ctx.h, ctx.h,
-                                    ctx.h, ctx.h, ctx.h, ctx.h, 256 );
-   intrlv_16x32( blake2s_16way_vdata, edata, edata, edata, edata,
-                                      edata, edata, edata, edata,
-                                      edata, edata, edata, edata,
-                                      edata, edata, edata, edata, 640 );
-   blake2s_16w_ctx.t[0] = 64;
-   return 1;
-}
-*/
-/*
-int blake2s_16way_prehash( struct work *work )
-{
-   mm512_bswap32_intrlv80_16x32( blake2s_16way_vdata, work->data );
-   blake2s_16way_init( &blake2s_16w_ctx, BLAKE2S_OUTBYTES );
-   blake2s_16way_update( &blake2s_16w_ctx, blake2s_16way_vdata, 64 );
-   return 1;
-}
-*/
-
 void blake2s_16way_hash( void *output, const void *input )
 {
   blake2s_16way_state ctx;
@@ -68,40 +30,10 @@ int scanhash_blake2s_16way( struct work *work, uint32_t max_nonce,
   uint32_t n = first_nonce;
   int thr_id = mythr->id;  

-/*   
-//   pthread_rwlock_rdlock( &g_work_lock );
-       memcpy( (__m512i*)vdata +16, (__m512i*)blake2s_16way_vdata +16, 3*4*16 );
-//     casti_m512i( vdata, 16 ) = casti_m512i( blake2s_16way_vdata, 16 );
-//     casti_m512i( vdata, 17 ) = casti_m512i( blake2s_16way_vdata, 17 );
-//     casti_m512i( vdata, 18 ) = casti_m512i( blake2s_16way_vdata, 18 );
-       
-//   pthread_rwlock_unlock( &g_work_lock );
-*/
-/*
-   uint32_t edata[20] __attribute__ ((aligned (64)));
-   blake2s_state ctx;
-   mm128_bswap32_80( edata, pdata );
-   blake2s_init( &ctx, BLAKE2S_OUTBYTES );
-   ctx.buflen = ctx.t[0] = 64;
-   blake2s_compress( &ctx, (const uint8_t*)edata );
-
-   blake2s_16way_init( &blake2s_16w_ctx, BLAKE2S_OUTBYTES );
-   intrlv_16x32( blake2s_16w_ctx.h, ctx.h, ctx.h, ctx.h, ctx.h,
-                                    ctx.h, ctx.h, ctx.h, ctx.h,
-                                    ctx.h, ctx.h, ctx.h, ctx.h,
-                                    ctx.h, ctx.h, ctx.h, ctx.h, 256 );
-   intrlv_16x32( blake2s_16way_blake2s_16way_vdata, edata, edata, edata, edata,
-                                      edata, edata, edata, edata,
-                                      edata, edata, edata, edata,
-                                      edata, edata, edata, edata, 640 );
-   blake2s_16w_ctx.t[0] = 64;
-*/
-   
   mm512_bswap32_intrlv80_16x32( vdata, pdata );
   blake2s_16way_init( &blake2s_16w_ctx, BLAKE2S_OUTBYTES );
   blake2s_16way_update( &blake2s_16w_ctx, vdata, 64 );

-
   do {
      *noncev = mm512_bswap_32( _mm512_set_epi32(
 	                  n+15, n+14, n+13, n+12, n+11, n+10, n+ 9, n+ 8,
@@ -131,36 +63,6 @@ int scanhash_blake2s_16way( struct work *work, uint32_t max_nonce,

 static __thread blake2s_8way_state blake2s_8w_ctx;

-/*
-static blake2s_8way_state blake2s_8w_ctx;
-static uint32_t blake2s_8way_vdata[20*8] __attribute__ ((aligned (32)));
-
-int blake2s_8way_prehash( struct work *work )
-{
-   uint32_t edata[20] __attribute__ ((aligned (64)));
-   blake2s_state ctx;
-   mm128_bswap32_80( edata, work->data );
-   blake2s_init( &ctx, BLAKE2S_OUTBYTES );
-   ctx.buflen = ctx.t[0] = 64;
-   blake2s_compress( &ctx, (const uint8_t*)edata );
-
-   blake2s_8way_init( &blake2s_8w_ctx, BLAKE2S_OUTBYTES );
-
-   for ( int i = 0; i < 8; i++ )
-      casti_m256i( blake2s_8w_ctx.h, i ) = _mm256_set1_epi32( ctx.h[i] );
-
-   casti_m256i( blake2s_8way_vdata, 16 ) = _mm256_set1_epi32( edata[16] );
-   casti_m256i( blake2s_8way_vdata, 17 ) = _mm256_set1_epi32( edata[17] );
-   casti_m256i( blake2s_8way_vdata, 18 ) = _mm256_set1_epi32( edata[18] );
-
-//   intrlv_8x32( blake2s_8w_ctx.h, ctx.h, ctx.h, ctx.h, ctx.h,
-//                                  ctx.h, ctx.h, ctx.h, ctx.h, 256 );
-//   intrlv_8x32( blake2s_8way_vdata, edata, edata, edata, edata,
-//                                    edata, edata, edata, edata, 640 );
-   blake2s_8w_ctx.t[0] = 64;
-}
-*/
-
 void blake2s_8way_hash( void *output, const void *input )
 {
   blake2s_8way_state ctx;
@@ -184,41 +86,10 @@ int scanhash_blake2s_8way( struct work *work, uint32_t max_nonce,
   uint32_t n = first_nonce;
   int thr_id = mythr->id; 

-/*   
-//   pthread_rwlock_rdlock( &g_work_lock );
-       memcpy( &vdata[16*8], &blake2s_8way_vdata[16*8], 3*4*8 );
-//   pthread_rwlock_unlock( &g_work_lock );
-*/
-/*
-   uint32_t edata[20] __attribute__ ((aligned (64)));
-   blake2s_state ctx;
-   mm128_bswap32_80( edata, pdata );
-   blake2s_init( &ctx, BLAKE2S_OUTBYTES );
-   ctx.buflen = ctx.t[0] = 64;
-   blake2s_compress( &ctx, (const uint8_t*)edata );
-
-   blake2s_8way_init( &blake2s_8w_ctx, BLAKE2S_OUTBYTES );
-   for ( int i = 0; i < 8; i++ )
-      casti_m256i( blake2s_8w_ctx.h, i ) = _mm256_set1_epi32( ctx.h[i] );
-
-   casti_m256i( vdata, 16 ) = _mm256_set1_epi32( edata[16] );
-   casti_m256i( vdata, 17 ) = _mm256_set1_epi32( edata[17] );
-   casti_m256i( vdata, 18 ) = _mm256_set1_epi32( edata[18] );
-
-
-//  intrlv_8x32( blake2s_8w_ctx.h, ctx.h, ctx.h, ctx.h, ctx.h,
-//                                  ctx.h, ctx.h, ctx.h, ctx.h, 256 );
-//   intrlv_8x32( vdata, edata, edata, edata, edata,
-//                                    edata, edata, edata, edata, 640 );
-
-   blake2s_8w_ctx.t[0] = 64;
-*/
-   
   mm256_bswap32_intrlv80_8x32( vdata, pdata );
   blake2s_8way_init( &blake2s_8w_ctx, BLAKE2S_OUTBYTES );
   blake2s_8way_update( &blake2s_8w_ctx, vdata, 64 );

-
   do {
      *noncev = mm256_bswap_32( _mm256_set_epi32( n+7, n+6, n+5, n+4,
                                                  n+3, n+2, n+1, n ) );
@@ -246,25 +117,7 @@ int scanhash_blake2s_8way( struct work *work, uint32_t max_nonce,
 #elif defined(BLAKE2S_4WAY)

 static __thread blake2s_4way_state blake2s_4w_ctx;
-/*
-static blake2s_4way_state blake2s_4w_ctx;
-static uint32_t blake2s_4way_vdata[20*4] __attribute__ ((aligned (32)));

-int blake2s_4way_prehash( struct work *work )
-{
-   uint32_t edata[20] __attribute__ ((aligned (64)));
-   blake2s_state ctx;
-   mm128_bswap32_80( edata, work->data );
-   blake2s_init( &ctx, BLAKE2S_OUTBYTES );
-   ctx.buflen = ctx.t[0] = 64;
-   blake2s_compress( &ctx, (const uint8_t*)edata );
-
-   blake2s_4way_init( &blake2s_4w_ctx, BLAKE2S_OUTBYTES );
-   intrlv_4x32( blake2s_4w_ctx.h, ctx.h, ctx.h, ctx.h, ctx.h, 256 );
-   intrlv_4x32( blake2s_4way_vdata, edata, edata, edata, edata, 640 );
-   blake2s_4w_ctx.t[0] = 64;
-}
-*/
 void blake2s_4way_hash( void *output, const void *input )
 {
   blake2s_4way_state ctx;
@@ -287,15 +140,11 @@ int scanhash_blake2s_4way( struct work *work, uint32_t max_nonce,
   __m128i  *noncev = (__m128i*)vdata + 19;   // aligned
   uint32_t n = first_nonce;
   int thr_id = mythr->id; 
-/*
-   pthread_rwlock_rdlock( &g_work_lock );
-       memcpy( vdata, blake2s_4way_vdata, sizeof vdata );
-   pthread_rwlock_unlock( &g_work_lock );
-*/
+
   mm128_bswap32_intrlv80_4x32( vdata, pdata );
   blake2s_4way_init( &blake2s_4w_ctx, BLAKE2S_OUTBYTES );
   blake2s_4way_update( &blake2s_4w_ctx, vdata, 64 );
-   
+
   do {
      *noncev = mm128_bswap_32( _mm_set_epi32( n+3, n+2, n+1, n ) );
      pdata[19] = n;
--- a/algo/blake/blake2s-gate.c
+++ b/algo/blake/blake2s-gate.c
@@ -5,15 +5,13 @@ bool register_blake2s_algo( algo_gate_t* gate )
 #if defined(BLAKE2S_16WAY)
  gate->scanhash  = (void*)&scanhash_blake2s_16way;
  gate->hash      = (void*)&blake2s_16way_hash;
-//  gate->prehash   = (void*)&blake2s_16way_prehash;
 #elif defined(BLAKE2S_8WAY)
+//#if defined(BLAKE2S_8WAY)
  gate->scanhash  = (void*)&scanhash_blake2s_8way;
  gate->hash      = (void*)&blake2s_8way_hash;
-//  gate->prehash   = (void*)&blake2s_8way_prehash;
 #elif defined(BLAKE2S_4WAY)
  gate->scanhash  = (void*)&scanhash_blake2s_4way;
  gate->hash      = (void*)&blake2s_4way_hash;
-//  gate->prehash   = (void*)&blake2s_4way_prehash;
 #else
  gate->scanhash  = (void*)&scanhash_blake2s;
  gate->hash      = (void*)&blake2s_hash;
--- a/algo/blake/blake2s-gate.h
+++ b/algo/blake/blake2s-gate.h
@@ -23,22 +23,18 @@ bool register_blake2s_algo( algo_gate_t* gate );
 void blake2s_16way_hash( void *state, const void *input );
 int scanhash_blake2s_16way( struct work *work, uint32_t max_nonce,
                         uint64_t *hashes_done, struct thr_info *mythr );
-int blake2s_16way_prehash( struct work * );

 #elif defined (BLAKE2S_8WAY)

 void blake2s_8way_hash( void *state, const void *input );
 int scanhash_blake2s_8way( struct work *work, uint32_t max_nonce,
                         uint64_t *hashes_done, struct thr_info *mythr );
-int blake2s_8way_prehash( struct work * );

 #elif defined (BLAKE2S_4WAY)

 void blake2s_4way_hash( void *state, const void *input );
 int scanhash_blake2s_4way( struct work *work, uint32_t max_nonce,
                         uint64_t *hashes_done, struct thr_info *mythr );
-int blake2s_4way_prehash( struct work * );
-
 #else

 void blake2s_hash( void *state, const void *input );
--- a/algo/blake/blake2s-hash-4way.c
+++ b/algo/blake/blake2s-hash-4way.c
@@ -62,14 +62,14 @@ int blake2s_4way_init( blake2s_4way_state *S, const uint8_t outlen )

   memset( S, 0, sizeof( blake2s_4way_state ) );

-   S->h[0] = m128_const1_64( 0x6A09E6676A09E667ULL );
-   S->h[1] = m128_const1_64( 0xBB67AE85BB67AE85ULL );
-   S->h[2] = m128_const1_64( 0x3C6EF3723C6EF372ULL );
-   S->h[3] = m128_const1_64( 0xA54FF53AA54FF53AULL );
-   S->h[4] = m128_const1_64( 0x510E527F510E527FULL );
-   S->h[5] = m128_const1_64( 0x9B05688C9B05688CULL );
-   S->h[6] = m128_const1_64( 0x1F83D9AB1F83D9ABULL );
-   S->h[7] = m128_const1_64( 0x5BE0CD195BE0CD19ULL );
+   S->h[0] = _mm_set1_epi64x( 0x6A09E6676A09E667ULL );
+   S->h[1] = _mm_set1_epi64x( 0xBB67AE85BB67AE85ULL );
+   S->h[2] = _mm_set1_epi64x( 0x3C6EF3723C6EF372ULL );
+   S->h[3] = _mm_set1_epi64x( 0xA54FF53AA54FF53AULL );
+   S->h[4] = _mm_set1_epi64x( 0x510E527F510E527FULL );
+   S->h[5] = _mm_set1_epi64x( 0x9B05688C9B05688CULL );
+   S->h[6] = _mm_set1_epi64x( 0x1F83D9AB1F83D9ABULL );
+   S->h[7] = _mm_set1_epi64x( 0x5BE0CD195BE0CD19ULL );
   
 //   for( int i = 0; i < 8; ++i )
 //      S->h[i] = _mm_set1_epi32( blake2s_IV[i] );
@@ -90,23 +90,23 @@ int blake2s_4way_compress( blake2s_4way_state *S, const __m128i* block )
   memcpy_128( m, block, 16 );
   memcpy_128( v, S->h, 8 );

-   v[ 8] = m128_const1_64( 0x6A09E6676A09E667ULL );
-   v[ 9] = m128_const1_64( 0xBB67AE85BB67AE85ULL );
-   v[10] = m128_const1_64( 0x3C6EF3723C6EF372ULL );
-   v[11] = m128_const1_64( 0xA54FF53AA54FF53AULL );
+   v[ 8] = _mm_set1_epi64x( 0x6A09E6676A09E667ULL );
+   v[ 9] = _mm_set1_epi64x( 0xBB67AE85BB67AE85ULL );
+   v[10] = _mm_set1_epi64x( 0x3C6EF3723C6EF372ULL );
+   v[11] = _mm_set1_epi64x( 0xA54FF53AA54FF53AULL );
   v[12] = _mm_xor_si128( _mm_set1_epi32( S->t[0] ),
-                          m128_const1_64( 0x510E527F510E527FULL ) );
+                          _mm_set1_epi64x( 0x510E527F510E527FULL ) );
   v[13] = _mm_xor_si128( _mm_set1_epi32( S->t[1] ),
-                          m128_const1_64( 0x9B05688C9B05688CULL ) );
+                          _mm_set1_epi64x( 0x9B05688C9B05688CULL ) );
   v[14] = _mm_xor_si128( _mm_set1_epi32( S->f[0] ),
-                          m128_const1_64( 0x1F83D9AB1F83D9ABULL ) );
+                          _mm_set1_epi64x( 0x1F83D9AB1F83D9ABULL ) );
   v[15] = _mm_xor_si128( _mm_set1_epi32( S->f[1] ),
-                          m128_const1_64( 0x5BE0CD195BE0CD19ULL ) );
+                          _mm_set1_epi64x( 0x5BE0CD195BE0CD19ULL ) );

 #define G4W( sigma0, sigma1, a, b, c, d ) \
 do { \
-   const uint8_t s0 = sigma0; \
-   const uint8_t s1 = sigma1; \
+   uint8_t s0 = sigma0; \
+   uint8_t s1 = sigma1; \
   a = _mm_add_epi32( _mm_add_epi32( a, b ), m[ s0 ] ); \
   d = mm128_swap32_16( _mm_xor_si128( d, a ) ); \
   c = _mm_add_epi32( c, d ); \
@@ -120,7 +120,7 @@ do { \

 #define ROUND4W(r)  \
 do { \
-   const uint8_t *sigma = (const uint8_t*)&blake2s_sigma[r]; \
+   uint8_t *sigma = (uint8_t*)&blake2s_sigma[r]; \
   G4W( sigma[ 0], sigma[ 1], v[ 0], v[ 4], v[ 8], v[12] ); \
   G4W( sigma[ 2], sigma[ 3], v[ 1], v[ 5], v[ 9], v[13] ); \
   G4W( sigma[ 4], sigma[ 5], v[ 2], v[ 6], v[10], v[14] ); \
@@ -269,21 +269,21 @@ int blake2s_8way_compress( blake2s_8way_state *S, const __m256i *block )
   memcpy_256( m, block, 16 );
   memcpy_256( v, S->h, 8 );

-   v[ 8] = m256_const1_64( 0x6A09E6676A09E667ULL );
-   v[ 9] = m256_const1_64( 0xBB67AE85BB67AE85ULL );
-   v[10] = m256_const1_64( 0x3C6EF3723C6EF372ULL );
-   v[11] = m256_const1_64( 0xA54FF53AA54FF53AULL );
+   v[ 8] = _mm256_set1_epi64x( 0x6A09E6676A09E667ULL );
+   v[ 9] = _mm256_set1_epi64x( 0xBB67AE85BB67AE85ULL );
+   v[10] = _mm256_set1_epi64x( 0x3C6EF3723C6EF372ULL );
+   v[11] = _mm256_set1_epi64x( 0xA54FF53AA54FF53AULL );
   v[12] = _mm256_xor_si256( _mm256_set1_epi32( S->t[0] ),
-                          m256_const1_64( 0x510E527F510E527FULL ) );
+                          _mm256_set1_epi64x( 0x510E527F510E527FULL ) );

   v[13] = _mm256_xor_si256( _mm256_set1_epi32( S->t[1] ),
-                          m256_const1_64( 0x9B05688C9B05688CULL ) );
+                          _mm256_set1_epi64x( 0x9B05688C9B05688CULL ) );

   v[14] = _mm256_xor_si256( _mm256_set1_epi32( S->f[0] ),
-                          m256_const1_64( 0x1F83D9AB1F83D9ABULL ) );
+                          _mm256_set1_epi64x( 0x1F83D9AB1F83D9ABULL ) );

   v[15] = _mm256_xor_si256( _mm256_set1_epi32( S->f[1] ),
-                          m256_const1_64( 0x5BE0CD195BE0CD19ULL ) );
+                          _mm256_set1_epi64x( 0x5BE0CD195BE0CD19ULL ) );

 /*
   v[ 8] = _mm256_set1_epi32( blake2s_IV[0] );
@@ -317,8 +317,8 @@ do { \

 #define G8W( sigma0, sigma1, a, b, c, d) \
 do { \
-   const uint8_t s0 = sigma0; \
-   const uint8_t s1 = sigma1; \
+   uint8_t s0 = sigma0; \
+   uint8_t s1 = sigma1; \
   a = _mm256_add_epi32( _mm256_add_epi32( a, b ), m[ s0 ] ); \
   d = mm256_swap32_16( _mm256_xor_si256( d, a ) ); \
   c = _mm256_add_epi32( c, d ); \
@@ -331,7 +331,7 @@ do { \

 #define ROUND8W(r)  \
 do { \
-   const uint8_t *sigma = (const uint8_t*)&blake2s_sigma[r]; \
+   uint8_t *sigma = (uint8_t*)&blake2s_sigma[r]; \
   G8W( sigma[ 0], sigma[ 1], v[ 0], v[ 4], v[ 8], v[12] ); \
   G8W( sigma[ 2], sigma[ 3], v[ 1], v[ 5], v[ 9], v[13] ); \
   G8W( sigma[ 4], sigma[ 5], v[ 2], v[ 6], v[10], v[14] ); \
@@ -391,14 +391,14 @@ int blake2s_8way_init( blake2s_8way_state *S, const uint8_t outlen )
   memset( P->personal, 0, sizeof( P->personal ) );

   memset( S, 0, sizeof( blake2s_8way_state ) );
-   S->h[0] = m256_const1_64( 0x6A09E6676A09E667ULL );
-   S->h[1] = m256_const1_64( 0xBB67AE85BB67AE85ULL );
-   S->h[2] = m256_const1_64( 0x3C6EF3723C6EF372ULL );
-   S->h[3] = m256_const1_64( 0xA54FF53AA54FF53AULL );
-   S->h[4] = m256_const1_64( 0x510E527F510E527FULL );
-   S->h[5] = m256_const1_64( 0x9B05688C9B05688CULL );
-   S->h[6] = m256_const1_64( 0x1F83D9AB1F83D9ABULL );
-   S->h[7] = m256_const1_64( 0x5BE0CD195BE0CD19ULL );
+   S->h[0] = _mm256_set1_epi64x( 0x6A09E6676A09E667ULL );
+   S->h[1] = _mm256_set1_epi64x( 0xBB67AE85BB67AE85ULL );
+   S->h[2] = _mm256_set1_epi64x( 0x3C6EF3723C6EF372ULL );
+   S->h[3] = _mm256_set1_epi64x( 0xA54FF53AA54FF53AULL );
+   S->h[4] = _mm256_set1_epi64x( 0x510E527F510E527FULL );
+   S->h[5] = _mm256_set1_epi64x( 0x9B05688C9B05688CULL );
+   S->h[6] = _mm256_set1_epi64x( 0x1F83D9AB1F83D9ABULL );
+   S->h[7] = _mm256_set1_epi64x( 0x5BE0CD195BE0CD19ULL );


 //   for( int i = 0; i < 8; ++i )
@@ -510,27 +510,27 @@ int blake2s_16way_compress( blake2s_16way_state *S, const __m512i *block )
   memcpy_512( m, block, 16 );
   memcpy_512( v, S->h, 8 );

-   v[ 8] = m512_const1_64( 0x6A09E6676A09E667ULL );
-   v[ 9] = m512_const1_64( 0xBB67AE85BB67AE85ULL );
-   v[10] = m512_const1_64( 0x3C6EF3723C6EF372ULL );
-   v[11] = m512_const1_64( 0xA54FF53AA54FF53AULL );
+   v[ 8] = _mm512_set1_epi64( 0x6A09E6676A09E667ULL );
+   v[ 9] = _mm512_set1_epi64( 0xBB67AE85BB67AE85ULL );
+   v[10] = _mm512_set1_epi64( 0x3C6EF3723C6EF372ULL );
+   v[11] = _mm512_set1_epi64( 0xA54FF53AA54FF53AULL );
   v[12] = _mm512_xor_si512( _mm512_set1_epi32( S->t[0] ),
-                          m512_const1_64( 0x510E527F510E527FULL ) );
+                          _mm512_set1_epi64( 0x510E527F510E527FULL ) );

   v[13] = _mm512_xor_si512( _mm512_set1_epi32( S->t[1] ),
-                          m512_const1_64( 0x9B05688C9B05688CULL ) );
+                          _mm512_set1_epi64( 0x9B05688C9B05688CULL ) );

   v[14] = _mm512_xor_si512( _mm512_set1_epi32( S->f[0] ),
-                          m512_const1_64( 0x1F83D9AB1F83D9ABULL ) );
+                          _mm512_set1_epi64( 0x1F83D9AB1F83D9ABULL ) );

   v[15] = _mm512_xor_si512( _mm512_set1_epi32( S->f[1] ),
-                          m512_const1_64( 0x5BE0CD195BE0CD19ULL ) );
+                          _mm512_set1_epi64( 0x5BE0CD195BE0CD19ULL ) );


 #define G16W( sigma0, sigma1, a, b, c, d) \
 do { \
-   const uint8_t s0 = sigma0; \
-   const uint8_t s1 = sigma1; \
+   uint8_t s0 = sigma0; \
+   uint8_t s1 = sigma1; \
   a = _mm512_add_epi32( _mm512_add_epi32( a, b ), m[ s0 ] ); \
   d = mm512_ror_32( _mm512_xor_si512( d, a ), 16 ); \
   c = _mm512_add_epi32( c, d ); \
@@ -543,7 +543,7 @@ do { \

 #define ROUND16W(r)  \
 do { \
-   const uint8_t *sigma = (const uint8_t*)&blake2s_sigma[r]; \
+   uint8_t *sigma = (uint8_t*)&blake2s_sigma[r]; \
   G16W( sigma[ 0], sigma[ 1], v[ 0], v[ 4], v[ 8], v[12] ); \
   G16W( sigma[ 2], sigma[ 3], v[ 1], v[ 5], v[ 9], v[13] ); \
   G16W( sigma[ 4], sigma[ 5], v[ 2], v[ 6], v[10], v[14] ); \
@@ -589,14 +589,14 @@ int blake2s_16way_init( blake2s_16way_state *S, const uint8_t outlen )
   memset( P->personal, 0, sizeof( P->personal ) );

   memset( S, 0, sizeof( blake2s_16way_state ) );
-   S->h[0] = m512_const1_64( 0x6A09E6676A09E667ULL );
-   S->h[1] = m512_const1_64( 0xBB67AE85BB67AE85ULL );
-   S->h[2] = m512_const1_64( 0x3C6EF3723C6EF372ULL );
-   S->h[3] = m512_const1_64( 0xA54FF53AA54FF53AULL );
-   S->h[4] = m512_const1_64( 0x510E527F510E527FULL );
-   S->h[5] = m512_const1_64( 0x9B05688C9B05688CULL );
-   S->h[6] = m512_const1_64( 0x1F83D9AB1F83D9ABULL );
-   S->h[7] = m512_const1_64( 0x5BE0CD195BE0CD19ULL );
+   S->h[0] = _mm512_set1_epi64( 0x6A09E6676A09E667ULL );
+   S->h[1] = _mm512_set1_epi64( 0xBB67AE85BB67AE85ULL );
+   S->h[2] = _mm512_set1_epi64( 0x3C6EF3723C6EF372ULL );
+   S->h[3] = _mm512_set1_epi64( 0xA54FF53AA54FF53AULL );
+   S->h[4] = _mm512_set1_epi64( 0x510E527F510E527FULL );
+   S->h[5] = _mm512_set1_epi64( 0x9B05688C9B05688CULL );
+   S->h[6] = _mm512_set1_epi64( 0x1F83D9AB1F83D9ABULL );
+   S->h[7] = _mm512_set1_epi64( 0x5BE0CD195BE0CD19ULL );

   uint32_t *p = ( uint32_t * )( P );

--- a/algo/blake/blake2s-hash-4way.h
+++ b/algo/blake/blake2s-hash-4way.h
@@ -20,7 +20,6 @@

 #include <stddef.h>
 #include <stdint.h>
-//#include "sph-blake2s.h"

 #if defined(_MSC_VER)
 #include <inttypes.h>
@@ -34,7 +33,7 @@
 #if defined(__cplusplus)
 extern "C" {
 #endif
-/*
+
 enum blake2s_constant
 {
   BLAKE2S_BLOCKBYTES = 64,
@@ -43,13 +42,6 @@ enum blake2s_constant
   BLAKE2S_SALTBYTES  = 8,
   BLAKE2S_PERSONALBYTES = 8
 };
-*/
-
-#define BLAKE2S_BLOCKBYTES  64
-#define BLAKE2S_OUTBYTES    32
-#define BLAKE2S_KEYBYTES    32
-#define BLAKE2S_SALTBYTES   8
-#define BLAKE2S_PERSONALBYTES  8

 #pragma pack(push, 1)
 typedef struct __blake2s_nway_param
--- a/algo/blake/blake2s.c
+++ b/algo/blake/blake2s.c
@@ -8,6 +8,8 @@
 #include "sph-blake2s.h"

 static __thread blake2s_state blake2s_ctx;
+//static __thread blake2s_state s_ctx;
+#define MIDLEN 76

 void blake2s_hash( void *output, const void *input )
 {
@@ -17,27 +19,37 @@ void blake2s_hash( void *output, const void *input )
   memcpy( &ctx, &blake2s_ctx, sizeof ctx );
   blake2s_update( &ctx, input+64, 16 );
 
+//	blake2s_init(&ctx, BLAKE2S_OUTBYTES);
+//	blake2s_update(&ctx, input, 80);
 	blake2s_final( &ctx, hash, BLAKE2S_OUTBYTES );

 	memcpy(output, hash, 32);
 }
-
+/*
+static void blake2s_hash_end(uint32_t *output, const uint32_t *input)
+{
+	s_ctx.buflen = MIDLEN;
+	memcpy(&s_ctx, &s_midstate, 32 + 16 + MIDLEN);
+	blake2s_update(&s_ctx, (uint8_t*) &input[MIDLEN/4], 80 - MIDLEN);
+	blake2s_final(&s_ctx, (uint8_t*) output, BLAKE2S_OUTBYTES);
+}
+*/
 int scanhash_blake2s( struct work *work,
 	uint32_t max_nonce, uint64_t *hashes_done, struct thr_info *mythr )
 {
-   uint32_t *pdata = work->data;
-   uint32_t *ptarget = work->target;
+        uint32_t *pdata = work->data;
+        uint32_t *ptarget = work->target;

 	uint32_t _ALIGN(64) hash64[8];
 	uint32_t _ALIGN(64) endiandata[20];
-   int thr_id = mythr->id;  
+   int thr_id = mythr->id;  // thr_id arg is deprecated

 	const uint32_t Htarg = ptarget[7];
 	const uint32_t first_nonce = pdata[19];

 	uint32_t n = first_nonce;

-   swab32_array( endiandata, pdata, 20 );
+        swab32_array( endiandata, pdata, 20 );

 	// midstate
 	blake2s_init( &blake2s_ctx, BLAKE2S_OUTBYTES );
@@ -46,12 +58,11 @@ int scanhash_blake2s( struct work *work,
 	do {
 		be32enc(&endiandata[19], n);
 		blake2s_hash( hash64, endiandata );
-      if (hash64[7] <= Htarg )
-      if ( fulltest(hash64, ptarget) && !opt_benchmark )
-      {
-         pdata[19] = n;
-         submit_solution( work, hash64, mythr );
-      }
+		if (hash64[7] <= Htarg && fulltest(hash64, ptarget)) {
+			*hashes_done = n - first_nonce + 1;
+			pdata[19] = n;
+			return true;
+		}
 		n++;

 	} while (n < max_nonce && !work_restart[thr_id].restart);
--- a/algo/blake/blake512-hash-4way.c
+++ b/algo/blake/blake512-hash-4way.c
@@ -350,7 +350,6 @@ static const sph_u64 CB[16] = {
  __m512i M8, M9, MA, MB, MC, MD, ME, MF; \
  __m512i V0, V1, V2, V3, V4, V5, V6, V7; \
  __m512i V8, V9, VA, VB, VC, VD, VE, VF; \
-  __m512i shuf_bswap64; \
  V0 = H0; \
  V1 = H1; \
  V2 = H2; \
@@ -359,18 +358,16 @@ static const sph_u64 CB[16] = {
  V5 = H5; \
  V6 = H6; \
  V7 = H7; \
-  V8 = m512_const1_64( CB0 );  \
-  V9 = m512_const1_64( CB1 );  \
-  VA = m512_const1_64( CB2 );  \
-  VB = m512_const1_64( CB3 );  \
+  V8 = _mm512_set1_epi64( CB0 );  \
+  V9 = _mm512_set1_epi64( CB1 );  \
+  VA = _mm512_set1_epi64( CB2 );  \
+  VB = _mm512_set1_epi64( CB3 );  \
  VC = _mm512_set1_epi64( T0 ^ CB4 ); \
  VD = _mm512_set1_epi64( T0 ^ CB5 ); \
  VE = _mm512_set1_epi64( T1 ^ CB6 ); \
  VF = _mm512_set1_epi64( T1 ^ CB7 ); \
-  shuf_bswap64 = m512_const_64( 0x38393a3b3c3d3e3f, 0x3031323334353637, \
-                                0x28292a2b2c2d2e2f, 0x2021222324252627, \
-                                0x18191a1b1c1d1e1f, 0x1011121314151617, \
-                                0x08090a0b0c0d0e0f, 0x0001020304050607 ); \
+  const __m512i shuf_bswap64 = mm512_bcast_m128( _mm_set_epi64x( \
+                                   0x08090a0b0c0d0e0f, 0x0001020304050607 ) ); \
  M0 = _mm512_shuffle_epi8( *(buf+ 0), shuf_bswap64 ); \
  M1 = _mm512_shuffle_epi8( *(buf+ 1), shuf_bswap64 ); \
  M2 = _mm512_shuffle_epi8( *(buf+ 2), shuf_bswap64 ); \
@@ -419,7 +416,6 @@ void blake512_8way_compress( blake_8way_big_context *sc )
  __m512i M8, M9, MA, MB, MC, MD, ME, MF;
  __m512i V0, V1, V2, V3, V4, V5, V6, V7;
  __m512i V8, V9, VA, VB, VC, VD, VE, VF;
-  __m512i shuf_bswap64;

  V0 = sc->H[0];
  V1 = sc->H[1];
@@ -429,19 +425,17 @@ void blake512_8way_compress( blake_8way_big_context *sc )
  V5 = sc->H[5];
  V6 = sc->H[6];
  V7 = sc->H[7];
-  V8 = m512_const1_64( CB0 );
-  V9 = m512_const1_64( CB1 );
-  VA = m512_const1_64( CB2 );
-  VB = m512_const1_64( CB3 );
+  V8 = _mm512_set1_epi64( CB0 );
+  V9 = _mm512_set1_epi64( CB1 );
+  VA = _mm512_set1_epi64( CB2 );
+  VB = _mm512_set1_epi64( CB3 );
  VC = _mm512_set1_epi64( sc->T0 ^ CB4 );
  VD = _mm512_set1_epi64( sc->T0 ^ CB5 );
  VE = _mm512_set1_epi64( sc->T1 ^ CB6 );
  VF = _mm512_set1_epi64( sc->T1 ^ CB7 );

-  shuf_bswap64 = m512_const_64( 0x38393a3b3c3d3e3f, 0x3031323334353637,
-                                0x28292a2b2c2d2e2f, 0x2021222324252627,
-                                0x18191a1b1c1d1e1f, 0x1011121314151617,
-                                0x08090a0b0c0d0e0f, 0x0001020304050607 );
+  const __m512i shuf_bswap64 = mm512_bcast_m128( _mm_set_epi64x( 
+                                   0x08090a0b0c0d0e0f, 0x0001020304050607 ) );

  M0 = _mm512_shuffle_epi8( sc->buf[ 0], shuf_bswap64 );
  M1 = _mm512_shuffle_epi8( sc->buf[ 1], shuf_bswap64 );
@@ -503,10 +497,10 @@ void blake512_8way_compress_le( blake_8way_big_context *sc )
  V5 = sc->H[5];
  V6 = sc->H[6];
  V7 = sc->H[7];
-  V8 = m512_const1_64( CB0 );
-  V9 = m512_const1_64( CB1 );
-  VA = m512_const1_64( CB2 );
-  VB = m512_const1_64( CB3 );
+  V8 = _mm512_set1_epi64( CB0 );
+  V9 = _mm512_set1_epi64( CB1 );
+  VA = _mm512_set1_epi64( CB2 );
+  VB = _mm512_set1_epi64( CB3 );
  VC = _mm512_set1_epi64( sc->T0 ^ CB4 );
  VD = _mm512_set1_epi64( sc->T0 ^ CB5 );
  VE = _mm512_set1_epi64( sc->T1 ^ CB6 );
@@ -565,23 +559,23 @@ void blake512_8way_prehash_le( blake_8way_big_context *sc, __m512i *midstate,
   __m512i V8, V9, VA, VB, VC, VD, VE, VF;

   // initial hash
-   casti_m512i( sc->H, 0 ) = m512_const1_64( 0x6A09E667F3BCC908 );
-   casti_m512i( sc->H, 1 ) = m512_const1_64( 0xBB67AE8584CAA73B );
-   casti_m512i( sc->H, 2 ) = m512_const1_64( 0x3C6EF372FE94F82B );
-   casti_m512i( sc->H, 3 ) = m512_const1_64( 0xA54FF53A5F1D36F1 );
-   casti_m512i( sc->H, 4 ) = m512_const1_64( 0x510E527FADE682D1 );
-   casti_m512i( sc->H, 5 ) = m512_const1_64( 0x9B05688C2B3E6C1F );
-   casti_m512i( sc->H, 6 ) = m512_const1_64( 0x1F83D9ABFB41BD6B );
-   casti_m512i( sc->H, 7 ) = m512_const1_64( 0x5BE0CD19137E2179 );
+   casti_m512i( sc->H, 0 ) = _mm512_set1_epi64( 0x6A09E667F3BCC908 );
+   casti_m512i( sc->H, 1 ) = _mm512_set1_epi64( 0xBB67AE8584CAA73B );
+   casti_m512i( sc->H, 2 ) = _mm512_set1_epi64( 0x3C6EF372FE94F82B );
+   casti_m512i( sc->H, 3 ) = _mm512_set1_epi64( 0xA54FF53A5F1D36F1 );
+   casti_m512i( sc->H, 4 ) = _mm512_set1_epi64( 0x510E527FADE682D1 );
+   casti_m512i( sc->H, 5 ) = _mm512_set1_epi64( 0x9B05688C2B3E6C1F );
+   casti_m512i( sc->H, 6 ) = _mm512_set1_epi64( 0x1F83D9ABFB41BD6B );
+   casti_m512i( sc->H, 7 ) = _mm512_set1_epi64( 0x5BE0CD19137E2179 );

   // fill buffer
   memcpy_512( sc->buf, (__m512i*)data, 80>>3 );
-   sc->buf[10] = m512_const1_64( 0x8000000000000000ULL );
+   sc->buf[10] = _mm512_set1_epi64( 0x8000000000000000ULL );
   sc->buf[11] = 
   sc->buf[12] = m512_zero;
   sc->buf[13] = m512_one_64;
   sc->buf[14] = m512_zero;
-   sc->buf[15] = m512_const1_64( 80*8 );
+   sc->buf[15] = _mm512_set1_epi64( 80*8 );

   // build working variables
   V0 = sc->H[0];
@@ -592,10 +586,10 @@ void blake512_8way_prehash_le( blake_8way_big_context *sc, __m512i *midstate,
   V5 = sc->H[5];
   V6 = sc->H[6];
   V7 = sc->H[7];
-   V8 = m512_const1_64( CB0 );
-   V9 = m512_const1_64( CB1 );
-   VA = m512_const1_64( CB2 );
-   VB = m512_const1_64( CB3 );
+   V8 = _mm512_set1_epi64( CB0 );
+   V9 = _mm512_set1_epi64( CB1 );
+   VA = _mm512_set1_epi64( CB2 );
+   VB = _mm512_set1_epi64( CB3 );
   VC = _mm512_set1_epi64( CB4 ^ 0x280ULL );
   VD = _mm512_set1_epi64( CB5 ^ 0x280ULL );
   VE = _mm512_set1_epi64( CB6 );
@@ -790,14 +784,14 @@ void blake512_8way_final_le( blake_8way_big_context *sc, void *hash,

 void blake512_8way_init( blake_8way_big_context *sc )
 {
-   casti_m512i( sc->H, 0 ) = m512_const1_64( 0x6A09E667F3BCC908 );
-   casti_m512i( sc->H, 1 ) = m512_const1_64( 0xBB67AE8584CAA73B );
-   casti_m512i( sc->H, 2 ) = m512_const1_64( 0x3C6EF372FE94F82B );
-   casti_m512i( sc->H, 3 ) = m512_const1_64( 0xA54FF53A5F1D36F1 );
-   casti_m512i( sc->H, 4 ) = m512_const1_64( 0x510E527FADE682D1 );
-   casti_m512i( sc->H, 5 ) = m512_const1_64( 0x9B05688C2B3E6C1F );
-   casti_m512i( sc->H, 6 ) = m512_const1_64( 0x1F83D9ABFB41BD6B );
-   casti_m512i( sc->H, 7 ) = m512_const1_64( 0x5BE0CD19137E2179 );
+   casti_m512i( sc->H, 0 ) = _mm512_set1_epi64( 0x6A09E667F3BCC908 );
+   casti_m512i( sc->H, 1 ) = _mm512_set1_epi64( 0xBB67AE8584CAA73B );
+   casti_m512i( sc->H, 2 ) = _mm512_set1_epi64( 0x3C6EF372FE94F82B );
+   casti_m512i( sc->H, 3 ) = _mm512_set1_epi64( 0xA54FF53A5F1D36F1 );
+   casti_m512i( sc->H, 4 ) = _mm512_set1_epi64( 0x510E527FADE682D1 );
+   casti_m512i( sc->H, 5 ) = _mm512_set1_epi64( 0x9B05688C2B3E6C1F );
+   casti_m512i( sc->H, 6 ) = _mm512_set1_epi64( 0x1F83D9ABFB41BD6B );
+   casti_m512i( sc->H, 7 ) = _mm512_set1_epi64( 0x5BE0CD19137E2179 );

   sc->T0 = sc->T1 = 0;
   sc->ptr = 0;
@@ -861,7 +855,7 @@ blake64_8way_close( blake_8way_big_context *sc, void *dst )

   ptr = sc->ptr;
   bit_len = ((unsigned)ptr << 3);
-   buf[ptr>>3] = m512_const1_64( 0x80 );
+   buf[ptr>>3] = _mm512_set1_epi64( 0x80 );
   tl = sc->T0 + bit_len;
   th = sc->T1;
   if (ptr == 0 )
@@ -882,9 +876,9 @@ blake64_8way_close( blake_8way_big_context *sc, void *dst )
   {
       memset_zero_512( buf + (ptr>>3) + 1, (104-ptr) >> 3 );
       buf[104>>3] = _mm512_or_si512( buf[104>>3],
-                                 m512_const1_64( 0x0100000000000000ULL ) );
-       buf[112>>3] = m512_const1_64( bswap_64( th ) );
-       buf[120>>3] = m512_const1_64( bswap_64( tl ) );
+                                 _mm512_set1_epi64( 0x0100000000000000ULL ) );
+       buf[112>>3] = _mm512_set1_epi64( bswap_64( th ) );
+       buf[120>>3] = _mm512_set1_epi64( bswap_64( tl ) );

       blake64_8way( sc, buf + (ptr>>3), 128 - ptr );
   }
@@ -896,9 +890,9 @@ blake64_8way_close( blake_8way_big_context *sc, void *dst )
       sc->T0 = 0xFFFFFFFFFFFFFC00ULL;
       sc->T1 = 0xFFFFFFFFFFFFFFFFULL;
       memset_zero_512( buf, 112>>3 );
-       buf[104>>3] = m512_const1_64( 0x0100000000000000ULL );
-       buf[112>>3] = m512_const1_64( bswap_64( th ) );
-       buf[120>>3] = m512_const1_64( bswap_64( tl ) );
+       buf[104>>3] = _mm512_set1_epi64( 0x0100000000000000ULL );
+       buf[112>>3] = _mm512_set1_epi64( bswap_64( th ) );
+       buf[120>>3] = _mm512_set1_epi64( bswap_64( tl ) );

       blake64_8way( sc, buf, 128 );
   }
@@ -912,14 +906,14 @@ void blake512_8way_full( blake_8way_big_context *sc, void * dst,
   
 // init

-   casti_m512i( sc->H, 0 ) = m512_const1_64( 0x6A09E667F3BCC908 );
-   casti_m512i( sc->H, 1 ) = m512_const1_64( 0xBB67AE8584CAA73B );
-   casti_m512i( sc->H, 2 ) = m512_const1_64( 0x3C6EF372FE94F82B );
-   casti_m512i( sc->H, 3 ) = m512_const1_64( 0xA54FF53A5F1D36F1 );
-   casti_m512i( sc->H, 4 ) = m512_const1_64( 0x510E527FADE682D1 );
-   casti_m512i( sc->H, 5 ) = m512_const1_64( 0x9B05688C2B3E6C1F );
-   casti_m512i( sc->H, 6 ) = m512_const1_64( 0x1F83D9ABFB41BD6B );
-   casti_m512i( sc->H, 7 ) = m512_const1_64( 0x5BE0CD19137E2179 );
+   casti_m512i( sc->H, 0 ) = _mm512_set1_epi64( 0x6A09E667F3BCC908 );
+   casti_m512i( sc->H, 1 ) = _mm512_set1_epi64( 0xBB67AE8584CAA73B );
+   casti_m512i( sc->H, 2 ) = _mm512_set1_epi64( 0x3C6EF372FE94F82B );
+   casti_m512i( sc->H, 3 ) = _mm512_set1_epi64( 0xA54FF53A5F1D36F1 );
+   casti_m512i( sc->H, 4 ) = _mm512_set1_epi64( 0x510E527FADE682D1 );
+   casti_m512i( sc->H, 5 ) = _mm512_set1_epi64( 0x9B05688C2B3E6C1F );
+   casti_m512i( sc->H, 6 ) = _mm512_set1_epi64( 0x1F83D9ABFB41BD6B );
+   casti_m512i( sc->H, 7 ) = _mm512_set1_epi64( 0x5BE0CD19137E2179 );

   sc->T0 = sc->T1 = 0;
   sc->ptr = 0;
@@ -943,7 +937,7 @@ void blake512_8way_full( blake_8way_big_context *sc, void * dst,
   uint64_t th, tl;

   bit_len = sc->ptr << 3;
-   sc->buf[ptr64] = m512_const1_64( 0x80 );
+   sc->buf[ptr64] = _mm512_set1_epi64( 0x80 );
   tl = sc->T0 + bit_len;
   th = sc->T1;

@@ -961,9 +955,9 @@ void blake512_8way_full( blake_8way_big_context *sc, void * dst,
      sc->T0 -= 1024 - bit_len;

   memset_zero_512( sc->buf + ptr64 + 1, 13 - ptr64 );
-   sc->buf[13] = m512_const1_64( 0x0100000000000000ULL );
-   sc->buf[14] = m512_const1_64( bswap_64( th ) );
-   sc->buf[15] = m512_const1_64( bswap_64( tl ) );
+   sc->buf[13] = _mm512_set1_epi64( 0x0100000000000000ULL );
+   sc->buf[14] = _mm512_set1_epi64( bswap_64( th ) );
+   sc->buf[15] = _mm512_set1_epi64( bswap_64( tl ) );

   if ( ( sc->T0 = sc->T0 + 1024 ) < 1024 )
       sc->T1 = sc->T1 + 1;
@@ -979,14 +973,14 @@ void blake512_8way_full_le( blake_8way_big_context *sc, void * dst,

 // init

-   casti_m512i( sc->H, 0 ) = m512_const1_64( 0x6A09E667F3BCC908 );
-   casti_m512i( sc->H, 1 ) = m512_const1_64( 0xBB67AE8584CAA73B );
-   casti_m512i( sc->H, 2 ) = m512_const1_64( 0x3C6EF372FE94F82B );
-   casti_m512i( sc->H, 3 ) = m512_const1_64( 0xA54FF53A5F1D36F1 );
-   casti_m512i( sc->H, 4 ) = m512_const1_64( 0x510E527FADE682D1 );
-   casti_m512i( sc->H, 5 ) = m512_const1_64( 0x9B05688C2B3E6C1F );
-   casti_m512i( sc->H, 6 ) = m512_const1_64( 0x1F83D9ABFB41BD6B );
-   casti_m512i( sc->H, 7 ) = m512_const1_64( 0x5BE0CD19137E2179 );
+   casti_m512i( sc->H, 0 ) = _mm512_set1_epi64( 0x6A09E667F3BCC908 );
+   casti_m512i( sc->H, 1 ) = _mm512_set1_epi64( 0xBB67AE8584CAA73B );
+   casti_m512i( sc->H, 2 ) = _mm512_set1_epi64( 0x3C6EF372FE94F82B );
+   casti_m512i( sc->H, 3 ) = _mm512_set1_epi64( 0xA54FF53A5F1D36F1 );
+   casti_m512i( sc->H, 4 ) = _mm512_set1_epi64( 0x510E527FADE682D1 );
+   casti_m512i( sc->H, 5 ) = _mm512_set1_epi64( 0x9B05688C2B3E6C1F );
+   casti_m512i( sc->H, 6 ) = _mm512_set1_epi64( 0x1F83D9ABFB41BD6B );
+   casti_m512i( sc->H, 7 ) = _mm512_set1_epi64( 0x5BE0CD19137E2179 );

   sc->T0 = sc->T1 = 0;
   sc->ptr = 0;
@@ -1010,7 +1004,7 @@ void blake512_8way_full_le( blake_8way_big_context *sc, void * dst,
   uint64_t th, tl;

   bit_len = sc->ptr << 3;
-   sc->buf[ptr64] = m512_const1_64( 0x8000000000000000ULL );
+   sc->buf[ptr64] = _mm512_set1_epi64( 0x8000000000000000ULL );
   tl = sc->T0 + bit_len;
   th = sc->T1;

@@ -1029,8 +1023,8 @@ void blake512_8way_full_le( blake_8way_big_context *sc, void * dst,

   memset_zero_512( sc->buf + ptr64 + 1, 13 - ptr64 );
   sc->buf[13] = m512_one_64;
-   sc->buf[14] = m512_const1_64( th );
-   sc->buf[15] = m512_const1_64( tl );
+   sc->buf[14] = _mm512_set1_epi64( th );
+   sc->buf[15] = _mm512_set1_epi64( tl );

   if ( ( sc->T0 = sc->T0 + 1024 ) < 1024 )
       sc->T1 = sc->T1 + 1;
@@ -1092,7 +1086,6 @@ blake512_8way_close(void *cc, void *dst)
  __m256i M8, M9, MA, MB, MC, MD, ME, MF; \
  __m256i V0, V1, V2, V3, V4, V5, V6, V7; \
  __m256i V8, V9, VA, VB, VC, VD, VE, VF; \
-  __m256i shuf_bswap64; \
  V0 = H0; \
  V1 = H1; \
  V2 = H2; \
@@ -1101,16 +1094,16 @@ blake512_8way_close(void *cc, void *dst)
  V5 = H5; \
  V6 = H6; \
  V7 = H7; \
-  V8 = m256_const1_64( CB0 );  \
-  V9 = m256_const1_64( CB1 );  \
-  VA = m256_const1_64( CB2 );  \
-  VB = m256_const1_64( CB3 );  \
+  V8 = _mm256_set1_epi64x( CB0 );  \
+  V9 = _mm256_set1_epi64x( CB1 );  \
+  VA = _mm256_set1_epi64x( CB2 );  \
+  VB = _mm256_set1_epi64x( CB3 );  \
  VC = _mm256_set1_epi64x( T0 ^ CB4 ); \
  VD = _mm256_set1_epi64x( T0 ^ CB5 ); \
  VE = _mm256_set1_epi64x( T1 ^ CB6 ); \
  VF = _mm256_set1_epi64x( T1 ^ CB7 ); \
-  shuf_bswap64 = m256_const_64( 0x18191a1b1c1d1e1f, 0x1011121314151617, \
-                                0x08090a0b0c0d0e0f, 0x0001020304050607 ); \
+  const __m256i shuf_bswap64 = mm256_bcast_m128( _mm_set_epi64x( \
+                             0x08090a0b0c0d0e0f, 0x0001020304050607 ) ); \
  M0 = _mm256_shuffle_epi8( *(buf+ 0), shuf_bswap64 ); \
  M1 = _mm256_shuffle_epi8( *(buf+ 1), shuf_bswap64 ); \
  M2 = _mm256_shuffle_epi8( *(buf+ 2), shuf_bswap64 ); \
@@ -1160,7 +1153,6 @@ void blake512_4way_compress( blake_4way_big_context *sc )
  __m256i M8, M9, MA, MB, MC, MD, ME, MF;
  __m256i V0, V1, V2, V3, V4, V5, V6, V7;
  __m256i V8, V9, VA, VB, VC, VD, VE, VF;
-  __m256i shuf_bswap64;

  V0 = sc->H[0];
  V1 = sc->H[1];
@@ -1170,20 +1162,20 @@ void blake512_4way_compress( blake_4way_big_context *sc )
  V5 = sc->H[5];
  V6 = sc->H[6];
  V7 = sc->H[7];
-  V8 = m256_const1_64( CB0 );
-  V9 = m256_const1_64( CB1 );
-  VA = m256_const1_64( CB2 );
-  VB = m256_const1_64( CB3 );
+  V8 = _mm256_set1_epi64x( CB0 );
+  V9 = _mm256_set1_epi64x( CB1 );
+  VA = _mm256_set1_epi64x( CB2 );
+  VB = _mm256_set1_epi64x( CB3 );
  VC = _mm256_xor_si256( _mm256_set1_epi64x( sc->T0 ),
-                             m256_const1_64( CB4 ) );
+                             _mm256_set1_epi64x( CB4 ) );
  VD = _mm256_xor_si256( _mm256_set1_epi64x( sc->T0 ),
-                             m256_const1_64( CB5 ) );
+                             _mm256_set1_epi64x( CB5 ) );
  VE = _mm256_xor_si256( _mm256_set1_epi64x( sc->T1 ),
-                             m256_const1_64( CB6 ) );
+                             _mm256_set1_epi64x( CB6 ) );
  VF = _mm256_xor_si256( _mm256_set1_epi64x( sc->T1 ),
-                             m256_const1_64( CB7 ) );
-  shuf_bswap64 = m256_const_64( 0x18191a1b1c1d1e1f, 0x1011121314151617,
-                                0x08090a0b0c0d0e0f, 0x0001020304050607 );
+                             _mm256_set1_epi64x( CB7 ) );
+  const __m256i shuf_bswap64 = mm256_bcast_m128( _mm_set_epi64x(
+                                    0x08090a0b0c0d0e0f, 0x0001020304050607 ) );

  M0 = _mm256_shuffle_epi8( sc->buf[ 0], shuf_bswap64 );
  M1 = _mm256_shuffle_epi8( sc->buf[ 1], shuf_bswap64 );
@@ -1236,23 +1228,23 @@ void blake512_4way_prehash_le( blake_4way_big_context *sc, __m256i *midstate,
   __m256i V8, V9, VA, VB, VC, VD, VE, VF;

   // initial hash
-   casti_m256i( sc->H, 0 ) = m256_const1_64( 0x6A09E667F3BCC908 );
-   casti_m256i( sc->H, 1 ) = m256_const1_64( 0xBB67AE8584CAA73B );
-   casti_m256i( sc->H, 2 ) = m256_const1_64( 0x3C6EF372FE94F82B );
-   casti_m256i( sc->H, 3 ) = m256_const1_64( 0xA54FF53A5F1D36F1 );
-   casti_m256i( sc->H, 4 ) = m256_const1_64( 0x510E527FADE682D1 );
-   casti_m256i( sc->H, 5 ) = m256_const1_64( 0x9B05688C2B3E6C1F );
-   casti_m256i( sc->H, 6 ) = m256_const1_64( 0x1F83D9ABFB41BD6B );
-   casti_m256i( sc->H, 7 ) = m256_const1_64( 0x5BE0CD19137E2179 );
+   casti_m256i( sc->H, 0 ) = _mm256_set1_epi64x( 0x6A09E667F3BCC908 );
+   casti_m256i( sc->H, 1 ) = _mm256_set1_epi64x( 0xBB67AE8584CAA73B );
+   casti_m256i( sc->H, 2 ) = _mm256_set1_epi64x( 0x3C6EF372FE94F82B );
+   casti_m256i( sc->H, 3 ) = _mm256_set1_epi64x( 0xA54FF53A5F1D36F1 );
+   casti_m256i( sc->H, 4 ) = _mm256_set1_epi64x( 0x510E527FADE682D1 );
+   casti_m256i( sc->H, 5 ) = _mm256_set1_epi64x( 0x9B05688C2B3E6C1F );
+   casti_m256i( sc->H, 6 ) = _mm256_set1_epi64x( 0x1F83D9ABFB41BD6B );
+   casti_m256i( sc->H, 7 ) = _mm256_set1_epi64x( 0x5BE0CD19137E2179 );
   
   // fill buffer
   memcpy_256( sc->buf, (__m256i*)data, 80>>3 );
-   sc->buf[10] = m256_const1_64( 0x8000000000000000ULL );
+   sc->buf[10] = _mm256_set1_epi64x( 0x8000000000000000ULL );
   sc->buf[11] = m256_zero;
   sc->buf[12] = m256_zero;
   sc->buf[13] = m256_one_64;
   sc->buf[14] = m256_zero;
-   sc->buf[15] = m256_const1_64( 80*8 );
+   sc->buf[15] = _mm256_set1_epi64x( 80*8 );

   // build working variables
   V0 = sc->H[0];
@@ -1263,10 +1255,10 @@ void blake512_4way_prehash_le( blake_4way_big_context *sc, __m256i *midstate,
   V5 = sc->H[5];
   V6 = sc->H[6];
   V7 = sc->H[7];
-   V8 = m256_const1_64( CB0 );
-   V9 = m256_const1_64( CB1 );
-   VA = m256_const1_64( CB2 );
-   VB = m256_const1_64( CB3 );
+   V8 = _mm256_set1_epi64x( CB0 );
+   V9 = _mm256_set1_epi64x( CB1 );
+   VA = _mm256_set1_epi64x( CB2 );
+   VB = _mm256_set1_epi64x( CB3 );
   VC = _mm256_set1_epi64x( CB4 ^ 0x280ULL );
   VD = _mm256_set1_epi64x( CB5 ^ 0x280ULL );
   VE = _mm256_set1_epi64x( CB6 );
@@ -1446,14 +1438,14 @@ void blake512_4way_final_le( blake_4way_big_context *sc, void *hash,

 void blake512_4way_init( blake_4way_big_context *sc )
 {
-   casti_m256i( sc->H, 0 ) = m256_const1_64( 0x6A09E667F3BCC908 );
-   casti_m256i( sc->H, 1 ) = m256_const1_64( 0xBB67AE8584CAA73B );
-   casti_m256i( sc->H, 2 ) = m256_const1_64( 0x3C6EF372FE94F82B );
-   casti_m256i( sc->H, 3 ) = m256_const1_64( 0xA54FF53A5F1D36F1 );
-   casti_m256i( sc->H, 4 ) = m256_const1_64( 0x510E527FADE682D1 );
-   casti_m256i( sc->H, 5 ) = m256_const1_64( 0x9B05688C2B3E6C1F );
-   casti_m256i( sc->H, 6 ) = m256_const1_64( 0x1F83D9ABFB41BD6B );
-   casti_m256i( sc->H, 7 ) = m256_const1_64( 0x5BE0CD19137E2179 );
+   casti_m256i( sc->H, 0 ) = _mm256_set1_epi64x( 0x6A09E667F3BCC908 );
+   casti_m256i( sc->H, 1 ) = _mm256_set1_epi64x( 0xBB67AE8584CAA73B );
+   casti_m256i( sc->H, 2 ) = _mm256_set1_epi64x( 0x3C6EF372FE94F82B );
+   casti_m256i( sc->H, 3 ) = _mm256_set1_epi64x( 0xA54FF53A5F1D36F1 );
+   casti_m256i( sc->H, 4 ) = _mm256_set1_epi64x( 0x510E527FADE682D1 );
+   casti_m256i( sc->H, 5 ) = _mm256_set1_epi64x( 0x9B05688C2B3E6C1F );
+   casti_m256i( sc->H, 6 ) = _mm256_set1_epi64x( 0x1F83D9ABFB41BD6B );
+   casti_m256i( sc->H, 7 ) = _mm256_set1_epi64x( 0x5BE0CD19137E2179 );

   sc->T0 = sc->T1 = 0;
   sc->ptr = 0;
@@ -1513,7 +1505,7 @@ blake64_4way_close( blake_4way_big_context *sc, void *dst )

   ptr = sc->ptr;
   bit_len = ((unsigned)ptr << 3);
-   buf[ptr>>3] = m256_const1_64( 0x80 );
+   buf[ptr>>3] = _mm256_set1_epi64x( 0x80 );
   tl = sc->T0 + bit_len;
   th = sc->T1;
   if (ptr == 0 )
@@ -1535,9 +1527,9 @@ blake64_4way_close( blake_4way_big_context *sc, void *dst )
   {
       memset_zero_256( buf + (ptr>>3) + 1, (104-ptr) >> 3 );
       buf[104>>3] = _mm256_or_si256( buf[104>>3],
-                                 m256_const1_64( 0x0100000000000000ULL ) );
-       buf[112>>3] = m256_const1_64( bswap_64( th ) );
-       buf[120>>3] = m256_const1_64( bswap_64( tl ) );
+                                 _mm256_set1_epi64x( 0x0100000000000000ULL ) );
+       buf[112>>3] = _mm256_set1_epi64x( bswap_64( th ) );
+       buf[120>>3] = _mm256_set1_epi64x( bswap_64( tl ) );

       blake64_4way( sc, buf + (ptr>>3), 128 - ptr );
   }
@@ -1549,9 +1541,9 @@ blake64_4way_close( blake_4way_big_context *sc, void *dst )
       sc->T0 = SPH_C64(0xFFFFFFFFFFFFFC00ULL);
       sc->T1 = SPH_C64(0xFFFFFFFFFFFFFFFFULL);
       memset_zero_256( buf, 112>>3 ); 
-       buf[104>>3] = m256_const1_64( 0x0100000000000000ULL );
-       buf[112>>3] = m256_const1_64( bswap_64( th ) );
-       buf[120>>3] = m256_const1_64( bswap_64( tl ) );
+       buf[104>>3] = _mm256_set1_epi64x( 0x0100000000000000ULL );
+       buf[112>>3] = _mm256_set1_epi64x( bswap_64( th ) );
+       buf[120>>3] = _mm256_set1_epi64x( bswap_64( tl ) );

       blake64_4way( sc, buf, 128 );
   }
@@ -1565,14 +1557,14 @@ void blake512_4way_full( blake_4way_big_context *sc, void * dst,

 // init

-   casti_m256i( sc->H, 0 ) = m256_const1_64( 0x6A09E667F3BCC908 );
-   casti_m256i( sc->H, 1 ) = m256_const1_64( 0xBB67AE8584CAA73B );
-   casti_m256i( sc->H, 2 ) = m256_const1_64( 0x3C6EF372FE94F82B );
-   casti_m256i( sc->H, 3 ) = m256_const1_64( 0xA54FF53A5F1D36F1 );
-   casti_m256i( sc->H, 4 ) = m256_const1_64( 0x510E527FADE682D1 );
-   casti_m256i( sc->H, 5 ) = m256_const1_64( 0x9B05688C2B3E6C1F );
-   casti_m256i( sc->H, 6 ) = m256_const1_64( 0x1F83D9ABFB41BD6B );
-   casti_m256i( sc->H, 7 ) = m256_const1_64( 0x5BE0CD19137E2179 );
+   casti_m256i( sc->H, 0 ) = _mm256_set1_epi64x( 0x6A09E667F3BCC908 );
+   casti_m256i( sc->H, 1 ) = _mm256_set1_epi64x( 0xBB67AE8584CAA73B );
+   casti_m256i( sc->H, 2 ) = _mm256_set1_epi64x( 0x3C6EF372FE94F82B );
+   casti_m256i( sc->H, 3 ) = _mm256_set1_epi64x( 0xA54FF53A5F1D36F1 );
+   casti_m256i( sc->H, 4 ) = _mm256_set1_epi64x( 0x510E527FADE682D1 );
+   casti_m256i( sc->H, 5 ) = _mm256_set1_epi64x( 0x9B05688C2B3E6C1F );
+   casti_m256i( sc->H, 6 ) = _mm256_set1_epi64x( 0x1F83D9ABFB41BD6B );
+   casti_m256i( sc->H, 7 ) = _mm256_set1_epi64x( 0x5BE0CD19137E2179 );

   sc->T0 = sc->T1 = 0;
   sc->ptr = 0;
@@ -1596,7 +1588,7 @@ void blake512_4way_full( blake_4way_big_context *sc, void * dst,
   uint64_t th, tl;

   bit_len = sc->ptr << 3;
-   sc->buf[ptr64] = m256_const1_64( 0x80 );
+   sc->buf[ptr64] = _mm256_set1_epi64x( 0x80 );
   tl = sc->T0 + bit_len;
   th = sc->T1;
   if ( sc->ptr == 0 )
@@ -1613,9 +1605,9 @@ void blake512_4way_full( blake_4way_big_context *sc, void * dst,
        sc->T0 -= 1024 - bit_len;

   memset_zero_256( sc->buf + ptr64 + 1, 13 - ptr64 );
-   sc->buf[13] = m256_const1_64( 0x0100000000000000ULL );
-   sc->buf[14] = m256_const1_64( bswap_64( th ) );
-   sc->buf[15] = m256_const1_64( bswap_64( tl ) );
+   sc->buf[13] = _mm256_set1_epi64x( 0x0100000000000000ULL );
+   sc->buf[14] = _mm256_set1_epi64x( bswap_64( th ) );
+   sc->buf[15] = _mm256_set1_epi64x( bswap_64( tl ) );

   if ( ( sc->T0 = sc->T0 + 1024 ) < 1024 )
       sc->T1 = sc->T1 + 1;
--- a/algo/blake/decred-4way.c
+++ b/algo/blake/decred-4way.c
@@ -1,74 +0,0 @@
-#include "decred-gate.h"
-#include "blake-hash-4way.h"
-#include <string.h>
-#include <stdint.h>
-#include <memory.h>
-#include <unistd.h>
-
-#if defined (DECRED_4WAY)
-
-static __thread blake256_4way_context blake_mid;
-
-void decred_hash_4way( void *state, const void *input )
-{
-     uint32_t vhash[8*4] __attribute__ ((aligned (64)));
-//     uint32_t hash0[8] __attribute__ ((aligned (32)));
-//     uint32_t hash1[8] __attribute__ ((aligned (32)));
-//     uint32_t hash2[8] __attribute__ ((aligned (32)));
-//     uint32_t hash3[8] __attribute__ ((aligned (32)));
-     const void *tail = input + ( DECRED_MIDSTATE_LEN << 2 );
-     int tail_len = 180 - DECRED_MIDSTATE_LEN; 
-     blake256_4way_context ctx __attribute__ ((aligned (64)));
-
-     memcpy( &ctx, &blake_mid, sizeof(blake_mid) );
-     blake256_4way_update( &ctx, tail, tail_len );
-     blake256_4way_close( &ctx, vhash );
-     dintrlv_4x32( state, state+32, state+64, state+96, vhash, 256 );
-}
-
-int scanhash_decred_4way( struct work *work, uint32_t max_nonce,
-                          uint64_t *hashes_done, struct thr_info *mythr )
-{
-   uint32_t vdata[48*4] __attribute__ ((aligned (64)));
-   uint32_t hash[8*4] __attribute__ ((aligned (32)));
-   uint32_t _ALIGN(64) edata[48];
-   uint32_t *pdata = work->data;
-   uint32_t *ptarget = work->target;
-   const uint32_t first_nonce = pdata[DECRED_NONCE_INDEX];
-   uint32_t n = first_nonce;
-   const uint32_t HTarget = opt_benchmark ? 0x7f : ptarget[7];
-   int thr_id = mythr->id;  // thr_id arg is deprecated
-
-   // copy to buffer guaranteed to be aligned.
-   memcpy( edata, pdata, 180 );
-
-   // use the old way until  new way updated for size.
-   mm128_intrlv_4x32x( vdata, edata, edata, edata, edata, 180*8 );
-
-   blake256_4way_init( &blake_mid );
-   blake256_4way_update( &blake_mid, vdata, DECRED_MIDSTATE_LEN );
-
-   uint32_t *noncep = vdata + DECRED_NONCE_INDEX * 4;
-   do {
-      * noncep    = n;
-      *(noncep+1) = n+1;
-      *(noncep+2) = n+2;
-      *(noncep+3) = n+3;
-
-      decred_hash_4way( hash, vdata );
-
-      for ( int i = 0; i < 4; i++ )
-      if (  (hash+(i<<3))[7] <= HTarget )
-      if ( fulltest( hash+(i<<3), ptarget ) && !opt_benchmark )
-      {
-          pdata[DECRED_NONCE_INDEX] = n+i;
-          submit_solution( work, hash+(i<<3), mythr );
-      }
-      n += 4;
-  } while ( (n < max_nonce) && !work_restart[thr_id].restart );
-
-  *hashes_done = n - first_nonce + 1;
-  return 0;
-}
-
-#endif
--- a/algo/blake/decred-gate.c
+++ b/algo/blake/decred-gate.c
@@ -1,171 +0,0 @@
-#include "decred-gate.h"
-#include <unistd.h>
-#include <memory.h>
-#include <string.h>
-
-uint32_t *decred_get_nonceptr( uint32_t *work_data )
-{
-   return &work_data[ DECRED_NONCE_INDEX ];
-}
-
-long double decred_calc_network_diff( struct work* work )
-{
-   // sample for diff 43.281 : 1c05ea29
-   // todo: endian reversed on longpoll could be zr5 specific...
-   uint32_t nbits = work->data[ DECRED_NBITS_INDEX ];
-   uint32_t bits = ( nbits & 0xffffff );
-   int16_t shift = ( swab32(nbits) & 0xff ); // 0x1c = 28
-   int m;
-   long double d = (long double)0x0000ffff / (long double)bits;
-
-   for ( m = shift; m < 29; m++ )
-       d *= 256.0;
-   for ( m = 29; m < shift; m++ )
-       d /= 256.0;
-   if ( shift == 28 )
-       d *= 256.0; // testnet
-   if ( opt_debug_diff )
-       applog( LOG_DEBUG, "net diff: %f -> shift %u, bits %08x", (double)d,
-                           shift, bits );
-   return net_diff;
-}
-
-void decred_decode_extradata( struct work* work, uint64_t* net_blocks )
-{
-   // some random extradata to make the work unique
-   work->data[ DECRED_XNONCE_INDEX ] = (rand()*4);
-   work->height = work->data[32];
-   if (!have_longpoll && work->height > *net_blocks + 1)
-   {
-      char netinfo[64] = { 0 };
-      if ( net_diff > 0. )
-      {
-         if (net_diff != work->targetdiff)
-            sprintf(netinfo, ", diff %.3f, target %.1f", net_diff,
-                   work->targetdiff);
-         else
-             sprintf(netinfo, ", diff %.3f", net_diff);
-       }
-       applog(LOG_BLUE, "%s block %d%s", algo_names[opt_algo], work->height,
-                       netinfo);
-       *net_blocks = work->height - 1;
-   }
-}
-
-void decred_be_build_stratum_request( char *req, struct work *work,
-                                      struct stratum_ctx *sctx )
-{
-   unsigned char *xnonce2str;
-   uint32_t ntime, nonce;
-   char ntimestr[9], noncestr[9];
-
-   be32enc( &ntime, work->data[ DECRED_NTIME_INDEX ] );
-   be32enc( &nonce, work->data[ DECRED_NONCE_INDEX ] );
-   bin2hex( ntimestr, (char*)(&ntime), sizeof(uint32_t) );
-   bin2hex( noncestr, (char*)(&nonce), sizeof(uint32_t) );
-   xnonce2str = abin2hex( (char*)( &work->data[ DECRED_XNONCE_INDEX ] ),
-                                     sctx->xnonce1_size );
-   snprintf( req, JSON_BUF_LEN,
-        "{\"method\": \"mining.submit\", \"params\": [\"%s\", \"%s\", \"%s\", \"%s\", \"%s\"], \"id\":4}",
-         rpc_user, work->job_id, xnonce2str, ntimestr, noncestr );
-   free(xnonce2str);
-}
-
-#if !defined(min)
-#define min(a,b) (a>b ? (b) :(a))
-#endif
-
-void decred_build_extraheader( struct work* g_work, struct stratum_ctx* sctx )
-{
-   uchar merkle_root[64] = { 0 };
-   uint32_t extraheader[32] = { 0 };
-   int headersize = 0;
-   uint32_t* extradata = (uint32_t*) sctx->xnonce1;
-   int i;
-
-   // getwork over stratum, getwork merkle + header passed in coinb1
-   memcpy(merkle_root, sctx->job.coinbase, 32);
-   headersize = min((int)sctx->job.coinbase_size - 32,
-                  sizeof(extraheader) );
-   memcpy( extraheader, &sctx->job.coinbase[32], headersize );
-
-   // Assemble block header 
-   memset( g_work->data, 0, sizeof(g_work->data) );
-   g_work->data[0] = le32dec( sctx->job.version );
-   for ( i = 0; i < 8; i++ )
-      g_work->data[1 + i] = swab32(
-                              le32dec( (uint32_t *) sctx->job.prevhash + i ) );
-   for ( i = 0; i < 8; i++ )
-      g_work->data[9 + i] = swab32( be32dec( (uint32_t *) merkle_root + i ) );
-
-//   for ( i = 0; i < 8; i++ ) // prevhash
-//      g_work->data[1 + i] = swab32( g_work->data[1 + i] );
-//   for ( i = 0; i < 8; i++ ) // merkle
-//      g_work->data[9 + i] = swab32( g_work->data[9 + i] );
-
-   for ( i = 0; i < headersize/4; i++ ) // header
-      g_work->data[17 + i] = extraheader[i];
-   // extradata
-
-   for ( i = 0; i < sctx->xnonce1_size/4; i++ )
-      g_work->data[ DECRED_XNONCE_INDEX + i ] = extradata[i];
-   for ( i = DECRED_XNONCE_INDEX + sctx->xnonce1_size/4; i < 45; i++ )
-      g_work->data[i] = 0;
-   g_work->data[37] = (rand()*4) << 8;
-   // block header suffix from coinb2 (stake version)
-   memcpy( &g_work->data[44],
-           &sctx->job.coinbase[ sctx->job.coinbase_size-4 ], 4 );
-   sctx->block_height = g_work->data[32];
-   //applog_hex(work->data, 180);
-   //applog_hex(&work->data[36], 36);
-}
-
-#undef min
-
-bool decred_ready_to_mine( struct work* work, struct stratum_ctx* stratum,
-                           int thr_id )
-{
-   if ( have_stratum && strcmp(stratum->job.job_id, work->job_id)  )
-      // need to regen g_work..
-      return false;
-   if ( have_stratum && !work->data[0] && !opt_benchmark )
-   {
-      sleep(1);
-      return false;
-   }
-   // extradata: prevent duplicates
-   work->data[ DECRED_XNONCE_INDEX     ] += 1;
-   work->data[ DECRED_XNONCE_INDEX + 1 ] |= thr_id;
-   return true;
-}
-
-int decred_get_work_data_size() { return DECRED_DATA_SIZE; }
-
-bool register_decred_algo( algo_gate_t* gate )
-{
-#if defined(DECRED_4WAY)
-  four_way_not_tested();
-  gate->scanhash  = (void*)&scanhash_decred_4way;
-  gate->hash      = (void*)&decred_hash_4way;
-#else
-  gate->scanhash  = (void*)&scanhash_decred;
-  gate->hash      = (void*)&decred_hash;
-#endif
-  gate->optimizations = AVX2_OPT;
-//  gate->get_nonceptr          = (void*)&decred_get_nonceptr;
-  gate->decode_extra_data     = (void*)&decred_decode_extradata;
-  gate->build_stratum_request = (void*)&decred_be_build_stratum_request;
-  gate->work_decode           = (void*)&std_be_work_decode;
-  gate->submit_getwork_result = (void*)&std_be_submit_getwork_result;
-  gate->build_extraheader     = (void*)&decred_build_extraheader;
-  gate->ready_to_mine         = (void*)&decred_ready_to_mine;
-  gate->nbits_index           = DECRED_NBITS_INDEX;
-  gate->ntime_index           = DECRED_NTIME_INDEX;
-  gate->nonce_index           = DECRED_NONCE_INDEX;
-  gate->get_work_data_size    = (void*)&decred_get_work_data_size;
-  gate->work_cmp_size         = DECRED_WORK_COMPARE_SIZE;
-  allow_mininginfo            = false;
-  have_gbt                    = false;
-  return true;
-}
-
--- a/algo/blake/decred-gate.h
+++ b/algo/blake/decred-gate.h
@@ -1,36 +0,0 @@
-#ifndef __DECRED_GATE_H__
-#define __DECRED_GATE_H__
-
-#include "algo-gate-api.h"
-#include <stdint.h>
-
-#define DECRED_NBITS_INDEX 29
-#define DECRED_NTIME_INDEX 34
-#define DECRED_NONCE_INDEX 35
-#define DECRED_XNONCE_INDEX 36
-#define DECRED_DATA_SIZE 192
-#define DECRED_WORK_COMPARE_SIZE 140
-#define DECRED_MIDSTATE_LEN 128
-
-#if defined (__AVX2__) 
-//void blakehash_84way(void *state, const void *input);
-//int scanhash_blake_8way( struct work *work, uint32_t max_nonce,
-//                         uint64_t *hashes_done );
-#endif
-
-#if defined(__SSE4_2__)
-  #define DECRED_4WAY
-#endif
-
-#if defined (DECRED_4WAY)
-void decred_hash_4way(void *state, const void *input);
-int scanhash_decred_4way( struct work *work, uint32_t max_nonce,
-                          uint64_t *hashes_done, struct thr_info *mythr );
-#endif
-
-void decred_hash( void *state, const void *input );
-int scanhash_decred( struct work *work, uint32_t max_nonce,
-                     uint64_t *hashes_done, struct thr_info *mythr );
-
-#endif
-
--- a/algo/blake/decred.c
+++ b/algo/blake/decred.c
@@ -1,282 +0,0 @@
-#include "decred-gate.h"
-
-#if !defined(DECRED_8WAY) && !defined(DECRED_4WAY)
-
-#include "sph_blake.h"
-
-#include <string.h>
-#include <stdint.h>
-#include <memory.h>
-#include <unistd.h>
-
-/*
-#ifndef min
-#define min(a,b) (a>b ? b : a)
-#endif
-#ifndef max 
-#define max(a,b) (a<b ? b : a)
-#endif
-*/
-/*
-#define DECRED_NBITS_INDEX 29
-#define DECRED_NTIME_INDEX 34
-#define DECRED_NONCE_INDEX 35
-#define DECRED_XNONCE_INDEX 36
-#define DECRED_DATA_SIZE 192
-#define DECRED_WORK_COMPARE_SIZE 140
-*/
-static __thread sph_blake256_context blake_mid;
-static __thread bool ctx_midstate_done = false;
-
-void decred_hash(void *state, const void *input)
-{
-//        #define MIDSTATE_LEN 128
-        sph_blake256_context ctx __attribute__ ((aligned (64)));
-
-        uint8_t *ending = (uint8_t*) input;
-        ending += DECRED_MIDSTATE_LEN;
-
-        if (!ctx_midstate_done) {
-                sph_blake256_init(&blake_mid);
-                sph_blake256(&blake_mid, input, DECRED_MIDSTATE_LEN);
-                ctx_midstate_done = true;
-        }
-        memcpy(&ctx, &blake_mid, sizeof(blake_mid));
-
-        sph_blake256(&ctx, ending, (180 - DECRED_MIDSTATE_LEN));
-        sph_blake256_close(&ctx, state);
-}
-
-void decred_hash_simple(void *state, const void *input)
-{
-        sph_blake256_context ctx;
-        sph_blake256_init(&ctx);
-        sph_blake256(&ctx, input, 180);
-        sph_blake256_close(&ctx, state);
-}
-
-int scanhash_decred( struct work *work, uint32_t max_nonce,
-               uint64_t *hashes_done, struct thr_info *mythr )
-{
-        uint32_t _ALIGN(64) endiandata[48];
-        uint32_t _ALIGN(64) hash32[8];
-        uint32_t *pdata = work->data;
-        uint32_t *ptarget = work->target;
-   int thr_id = mythr->id;  // thr_id arg is deprecated
-
-//        #define DCR_NONCE_OFT32 35
-
-        const uint32_t first_nonce = pdata[DECRED_NONCE_INDEX];
-        const uint32_t HTarget = opt_benchmark ? 0x7f : ptarget[7];
-
-        uint32_t n = first_nonce;
-
-        ctx_midstate_done = false;
-
-#if 1
-        memcpy(endiandata, pdata, 180);
-#else
-        for (int k=0; k < (180/4); k++)
-                be32enc(&endiandata[k], pdata[k]);
-#endif
-
-        do {
-                //be32enc(&endiandata[DCR_NONCE_OFT32], n);
-                endiandata[DECRED_NONCE_INDEX] = n;
-                decred_hash(hash32, endiandata);
-
-                if (hash32[7] <= HTarget && fulltest(hash32, ptarget))
-                {
-                   pdata[DECRED_NONCE_INDEX] = n;
-                   submit_solution( work, hash32, mythr );
-                }
-
-                n++;
-
-        } while (n < max_nonce && !work_restart[thr_id].restart);
-
-        *hashes_done = n - first_nonce + 1;
-        pdata[DECRED_NONCE_INDEX] = n;
-        return 0;
-}
-
-/*
-uint32_t *decred_get_nonceptr( uint32_t *work_data )
-{
-   return &work_data[ DECRED_NONCE_INDEX ];
-}
-
-double decred_calc_network_diff( struct work* work )
-{
-   // sample for diff 43.281 : 1c05ea29
-   // todo: endian reversed on longpoll could be zr5 specific...
-   uint32_t nbits = work->data[ DECRED_NBITS_INDEX ];
-   uint32_t bits = ( nbits & 0xffffff );
-   int16_t shift = ( swab32(nbits) & 0xff ); // 0x1c = 28
-   int m;
-   double d = (double)0x0000ffff / (double)bits;
-
-   for ( m = shift; m < 29; m++ )
-       d *= 256.0;
-   for ( m = 29; m < shift; m++ )
-       d /= 256.0;
-   if ( shift == 28 )
-       d *= 256.0; // testnet
-   if ( opt_debug_diff )
-       applog( LOG_DEBUG, "net diff: %f -> shift %u, bits %08x", d,
-                           shift, bits );
-   return net_diff;
-}
-
-void decred_decode_extradata( struct work* work, uint64_t* net_blocks )
-{
-   // some random extradata to make the work unique
-   work->data[ DECRED_XNONCE_INDEX ] = (rand()*4);
-   work->height = work->data[32];
-   if (!have_longpoll && work->height > *net_blocks + 1)
-   {
-      char netinfo[64] = { 0 };
-      if (net_diff > 0.)
-      {
-         if (net_diff != work->targetdiff)
-	    sprintf(netinfo, ", diff %.3f, target %.1f", net_diff,
-                   work->targetdiff);
-	 else
-	     sprintf(netinfo, ", diff %.3f", net_diff);
-       }
-       applog(LOG_BLUE, "%s block %d%s", algo_names[opt_algo], work->height,
-                       netinfo);
-       *net_blocks = work->height - 1;
-   }
-}
-
-void decred_be_build_stratum_request( char *req, struct work *work,
-                                      struct stratum_ctx *sctx )
-{
-   unsigned char *xnonce2str;
-   uint32_t ntime, nonce;
-   char ntimestr[9], noncestr[9];
-
-   be32enc( &ntime, work->data[ DECRED_NTIME_INDEX ] );
-   be32enc( &nonce, work->data[ DECRED_NONCE_INDEX ] );
-   bin2hex( ntimestr, (char*)(&ntime), sizeof(uint32_t) );
-   bin2hex( noncestr, (char*)(&nonce), sizeof(uint32_t) );
-   xnonce2str = abin2hex( (char*)( &work->data[ DECRED_XNONCE_INDEX ] ),
-                                     sctx->xnonce1_size );
-   snprintf( req, JSON_BUF_LEN,
-        "{\"method\": \"mining.submit\", \"params\": [\"%s\", \"%s\", \"%s\", \"%s\", \"%s\"], \"id\":4}",
-         rpc_user, work->job_id, xnonce2str, ntimestr, noncestr );
-   free(xnonce2str);
-}
-*/
-/*
-// data shared between gen_merkle_root and build_extraheader.
-__thread uint32_t decred_extraheader[32] = { 0 };
-__thread int decred_headersize = 0;
-
-void decred_gen_merkle_root( char* merkle_root, struct stratum_ctx* sctx )
-{
-   // getwork over stratum, getwork merkle + header passed in coinb1
-   memcpy(merkle_root, sctx->job.coinbase, 32);
-   decred_headersize = min((int)sctx->job.coinbase_size - 32, 
-                  sizeof(decred_extraheader) );
-   memcpy( decred_extraheader, &sctx->job.coinbase[32], decred_headersize);
-}
-*/
-
-/*
-#define min(a,b) (a>b ? (b) :(a))
-
-void decred_build_extraheader( struct work* g_work, struct stratum_ctx* sctx )
-{
-   uchar merkle_root[64] = { 0 };
-   uint32_t extraheader[32] = { 0 };
-   int headersize = 0;
-   uint32_t* extradata = (uint32_t*) sctx->xnonce1;
-   size_t t;
-   int i;
-
-   // getwork over stratum, getwork merkle + header passed in coinb1
-   memcpy(merkle_root, sctx->job.coinbase, 32);
-   headersize = min((int)sctx->job.coinbase_size - 32,
-                  sizeof(extraheader) );
-   memcpy( extraheader, &sctx->job.coinbase[32], headersize );
-
-   // Increment extranonce2 
-   for ( t = 0; t < sctx->xnonce2_size && !( ++sctx->job.xnonce2[t] ); t++ );
-
-   // Assemble block header 
-   memset( g_work->data, 0, sizeof(g_work->data) );
-   g_work->data[0] = le32dec( sctx->job.version );
-   for ( i = 0; i < 8; i++ )
-      g_work->data[1 + i] = swab32(
-                              le32dec( (uint32_t *) sctx->job.prevhash + i ) );
-   for ( i = 0; i < 8; i++ )
-      g_work->data[9 + i] = swab32( be32dec( (uint32_t *) merkle_root + i ) );
-
-//   for ( i = 0; i < 8; i++ ) // prevhash
-//      g_work->data[1 + i] = swab32( g_work->data[1 + i] );
-//   for ( i = 0; i < 8; i++ ) // merkle
-//      g_work->data[9 + i] = swab32( g_work->data[9 + i] );
-
-   for ( i = 0; i < headersize/4; i++ ) // header
-      g_work->data[17 + i] = extraheader[i];
-   // extradata
-
-   for ( i = 0; i < sctx->xnonce1_size/4; i++ )
-      g_work->data[ DECRED_XNONCE_INDEX + i ] = extradata[i];
-   for ( i = DECRED_XNONCE_INDEX + sctx->xnonce1_size/4; i < 45; i++ )
-      g_work->data[i] = 0;
-   g_work->data[37] = (rand()*4) << 8;
-   // block header suffix from coinb2 (stake version)
-   memcpy( &g_work->data[44],
-           &sctx->job.coinbase[ sctx->job.coinbase_size-4 ], 4 );
-   sctx->bloc_height = g_work->data[32];
-   //applog_hex(work->data, 180);
-   //applog_hex(&work->data[36], 36);
-}
-
-#undef min
-
-bool decred_ready_to_mine( struct work* work, struct stratum_ctx* stratum,
-                           int thr_id )
-{
-   if ( have_stratum && strcmp(stratum->job.job_id, work->job_id)  )
-      // need to regen g_work..
-      return false;
-   if ( have_stratum && !work->data[0] && !opt_benchmark )
-   {
-      sleep(1);
-      return false;
-   }      
-   // extradata: prevent duplicates
-   work->data[ DECRED_XNONCE_INDEX     ] += 1;
-   work->data[ DECRED_XNONCE_INDEX + 1 ] |= thr_id;
-   return true;
-}
-
-
-bool register_decred_algo( algo_gate_t* gate )
-{
-  gate->optimizations         = SSE2_OPT;
-  gate->scanhash              = (void*)&scanhash_decred;
-  gate->hash                  = (void*)&decred_hash;
-  gate->get_nonceptr          = (void*)&decred_get_nonceptr;
-  gate->decode_extra_data     = (void*)&decred_decode_extradata;
-  gate->build_stratum_request = (void*)&decred_be_build_stratum_request;
-  gate->work_decode           = (void*)&std_be_work_decode;
-  gate->submit_getwork_result = (void*)&std_be_submit_getwork_result;
-  gate->build_extraheader     = (void*)&decred_build_extraheader;
-  gate->ready_to_mine         = (void*)&decred_ready_to_mine;
-  gate->nbits_index           = DECRED_NBITS_INDEX;
-  gate->ntime_index           = DECRED_NTIME_INDEX;
-  gate->nonce_index           = DECRED_NONCE_INDEX;
-  gate->work_data_size        = DECRED_DATA_SIZE;
-  gate->work_cmp_size         = DECRED_WORK_COMPARE_SIZE; 
-  allow_mininginfo            = false;
-  have_gbt                    = false;
-  return true;
-}
-*/
-
-#endif
--- a/algo/blake/sph-blake2s.c
+++ b/algo/blake/sph-blake2s.c
@@ -17,7 +17,6 @@

 #include "algo/sha/sph_types.h"
 #include "sph-blake2s.h"
-#include "simd-utils.h"

 static const uint32_t blake2s_IV[8] =
 {
@@ -226,71 +225,6 @@ int blake2s_compress( blake2s_state *S, const uint8_t block[BLAKE2S_BLOCKBYTES]
 	v[13] = S->t[1] ^ blake2s_IV[5];
 	v[14] = S->f[0] ^ blake2s_IV[6];
 	v[15] = S->f[1] ^ blake2s_IV[7];
-
-#if 0    
-//#if defined(__SSE2__) // always true
-
-The only application for this is to do a prehash for the blake2s algorithm.
-SSE2 also supports 4 way parallel hashing so that is preferred in most cases.
-Testing has found that using this serial SIMD code for prehash is slower than
-doing a parallel hash. A parallel hash has more instructions and uses more
-data. The serial hash uses fewer instructions and data and only needs to
-interleave the final hash into parallel streams. This has shown negligible
-improvement on other algos, notably blake256 which is almost identical.
-Considering the low frequency of prehash no statistically valid change
-was expected. It was simply better on paper.
-
-Furthermore, simply defining this macro has an additional negative effect on
-blake2s as a whole. There are no references to this macro, blake2s-4way does
-not include it in any header files, it's just another unused macro which should
-have no effect beyond the preprocessor. But just being visible to the compiler
-changes things in a dramatic way.
-
-These 2 things combined reduced the hash rate for blake2s by more than 5% when
-using serial SIMD for the blake2s prehash over 16way parallel prehash.
-16way parallel hashing was used in the high frequency nonce loop in both cases.
-Comsidering the prehash represents 50% of the algorithm and is done once vs
-the high frequency second half that is done mega, maybe giga, times more it's
-hard to imagine that big of an effect in either direction.
-
-#define ROUND( r ) \
-{ \
-   __m128i *V = (__m128i*)v; \
-   const uint8_t *sigma = blake2s_sigma[r]; \
-   V[0] = _mm_add_epi32( V[0], _mm_add_epi32( V[1], \
-                       _mm_set_epi32( m[ sigma[ 6 ] ], m[ sigma[ 4 ] ], \
-                                      m[ sigma[ 2 ] ], m[ sigma[ 0 ] ] ) ) ); \
-   V[3] = mm128_swap32_16( _mm_xor_si128( V[3], V[0] ) ); \
-   V[2] = _mm_add_epi32( V[2], V[3] ); \
-   V[1] = mm128_ror_32( _mm_xor_si128( V[1], V[2] ), 12 ); \
-   V[0] = _mm_add_epi32( V[0], _mm_add_epi32( V[1], \
-                        _mm_set_epi32( m[ sigma[ 7 ] ], m[ sigma[ 5 ] ], \
-                                       m[ sigma[ 3 ] ], m[ sigma[ 1 ] ] ) ) ); \
-   V[3] = mm128_shuflr32_8( _mm_xor_si128( V[3], V[0] ) ); \
-   V[2] = _mm_add_epi32( V[2], V[3] ); \
-   V[1] = mm128_ror_32( _mm_xor_si128( V[1], V[2] ), 7 ); \
-   V[3] = mm128_shufll_32( V[3] ); \
-   V[2] = mm128_swap_64( V[2] ); \
-   V[1] = mm128_shuflr_32( V[1] ); \
-   V[0] = _mm_add_epi32( V[0], _mm_add_epi32( V[1], \
-                        _mm_set_epi32( m[ sigma[14] ], m[ sigma[12] ], \
-                                       m[ sigma[10] ], m[ sigma[ 8] ] ) ) ); \
-   V[3] = mm128_swap32_16( _mm_xor_si128( V[3], V[0] ) ); \
-   V[2] = _mm_add_epi32( V[2], V[3] ); \
-   V[1] = mm128_ror_32( _mm_xor_si128( V[1], V[2] ), 12 ); \
-   V[0] = _mm_add_epi32( V[0], _mm_add_epi32( V[1], \
-                        _mm_set_epi32( m[ sigma[15] ], m[ sigma[13] ], \
-                                       m[ sigma[11] ], m[ sigma[ 9] ] ) ) ); \
-   V[3] = mm128_shuflr32_8( _mm_xor_si128( V[3], V[0] ) ); \
-   V[2] = _mm_add_epi32( V[2], V[3] ); \
-   V[1] = mm128_ror_32( _mm_xor_si128( V[1], V[2] ), 7 ); \
-   V[3] = mm128_shuflr_32( V[3] ); \
-   V[2] = mm128_swap_64( V[2] ); \
-   V[1] = mm128_shufll_32( V[1] ); \
-}
-
-#else
-
 #define G(r,i,a,b,c,d) \
 	do { \
 		a = a + b + m[blake2s_sigma[r][2*i+0]]; \
@@ -313,10 +247,7 @@ hard to imagine that big of an effect in either direction.
 		G(r,6,v[ 2],v[ 7],v[ 8],v[13]); \
 		G(r,7,v[ 3],v[ 4],v[ 9],v[14]); \
 	} while(0)
-
-#endif
-
-   ROUND( 0 );
+	ROUND( 0 );
 	ROUND( 1 );
 	ROUND( 2 );
 	ROUND( 3 );
--- a/algo/blake/sph-blake2s.h
+++ b/algo/blake/sph-blake2s.h
@@ -91,7 +91,6 @@ static inline void secure_zero_memory(void *v, size_t n)
 extern "C" {
 #endif

-/*   
 	enum blake2s_constant
 	{
 		BLAKE2S_BLOCKBYTES = 64,
@@ -100,13 +99,6 @@ extern "C" {
 		BLAKE2S_SALTBYTES  = 8,
 		BLAKE2S_PERSONALBYTES = 8
 	};
-*/
-
-#define BLAKE2S_BLOCKBYTES  64
-#define BLAKE2S_OUTBYTES    32
-#define BLAKE2S_KEYBYTES    32
-#define BLAKE2S_SALTBYTES   8
-#define BLAKE2S_PERSONALBYTES  8

 #pragma pack(push, 1)
 	typedef struct __blake2s_param
--- a/algo/blake/sph_blake2b.c
+++ b/algo/blake/sph_blake2b.c
@@ -64,6 +64,22 @@
  V[1] = mm256_ror_64( _mm256_xor_si256( V[1], V[2] ), 63 ); \
 }

+// Pivot about V[1] instead of V[0] reduces latency.
+#define BLAKE2B_ROUND( R ) \
+{ \
+  __m256i *V = (__m256i*)v; \
+  const uint8_t *sigmaR = sigma[R]; \
+  BLAKE2B_G(  0,  1,  2,  3,  4,  5,  6,  7 ); \
+  V[0] = mm256_shufll_64( V[0] ); \
+  V[3] = mm256_swap_128( V[3] ); \
+  V[2] = mm256_shuflr_64( V[2] ); \
+  BLAKE2B_G( 14, 15,  8,  9, 10, 11, 12, 13 ); \
+  V[0] = mm256_shuflr_64( V[0] ); \
+  V[3] = mm256_swap_128( V[3] ); \
+  V[2] = mm256_shufll_64( V[2] ); \
+}
+
+/*
 #define BLAKE2B_ROUND( R ) \
 { \
  __m256i *V = (__m256i*)v; \
@@ -77,6 +93,7 @@
  V[2] = mm256_swap_128( V[2] ); \
  V[1] = mm256_shufll_64( V[1] ); \
 }
+*/

 #elif defined(__SSE2__)
 // always true
--- a/algo/bmw/bmw256-hash-4way.c
+++ b/algo/bmw/bmw256-hash-4way.c
@@ -451,22 +451,22 @@ static const __m128i final_s[16] =
 */
 void bmw256_4way_init( bmw256_4way_context *ctx )
 {
-   ctx->H[ 0] = m128_const1_64( 0x4041424340414243 );
-   ctx->H[ 1] = m128_const1_64( 0x4445464744454647 );
-   ctx->H[ 2] = m128_const1_64( 0x48494A4B48494A4B );
-   ctx->H[ 3] = m128_const1_64( 0x4C4D4E4F4C4D4E4F );
-   ctx->H[ 4] = m128_const1_64( 0x5051525350515253 );
-   ctx->H[ 5] = m128_const1_64( 0x5455565754555657 );
-   ctx->H[ 6] = m128_const1_64( 0x58595A5B58595A5B );
-   ctx->H[ 7] = m128_const1_64( 0x5C5D5E5F5C5D5E5F );
-   ctx->H[ 8] = m128_const1_64( 0x6061626360616263 );
-   ctx->H[ 9] = m128_const1_64( 0x6465666764656667 );
-   ctx->H[10] = m128_const1_64( 0x68696A6B68696A6B );
-   ctx->H[11] = m128_const1_64( 0x6C6D6E6F6C6D6E6F );
-   ctx->H[12] = m128_const1_64( 0x7071727370717273 );
-   ctx->H[13] = m128_const1_64( 0x7475767774757677 );
-   ctx->H[14] = m128_const1_64( 0x78797A7B78797A7B );
-   ctx->H[15] = m128_const1_64( 0x7C7D7E7F7C7D7E7F );
+   ctx->H[ 0] = _mm_set1_epi64x( 0x4041424340414243 );
+   ctx->H[ 1] = _mm_set1_epi64x( 0x4445464744454647 );
+   ctx->H[ 2] = _mm_set1_epi64x( 0x48494A4B48494A4B );
+   ctx->H[ 3] = _mm_set1_epi64x( 0x4C4D4E4F4C4D4E4F );
+   ctx->H[ 4] = _mm_set1_epi64x( 0x5051525350515253 );
+   ctx->H[ 5] = _mm_set1_epi64x( 0x5455565754555657 );
+   ctx->H[ 6] = _mm_set1_epi64x( 0x58595A5B58595A5B );
+   ctx->H[ 7] = _mm_set1_epi64x( 0x5C5D5E5F5C5D5E5F );
+   ctx->H[ 8] = _mm_set1_epi64x( 0x6061626360616263 );
+   ctx->H[ 9] = _mm_set1_epi64x( 0x6465666764656667 );
+   ctx->H[10] = _mm_set1_epi64x( 0x68696A6B68696A6B );
+   ctx->H[11] = _mm_set1_epi64x( 0x6C6D6E6F6C6D6E6F );
+   ctx->H[12] = _mm_set1_epi64x( 0x7071727370717273 );
+   ctx->H[13] = _mm_set1_epi64x( 0x7475767774757677 );
+   ctx->H[14] = _mm_set1_epi64x( 0x78797A7B78797A7B );
+   ctx->H[15] = _mm_set1_epi64x( 0x7C7D7E7F7C7D7E7F );


 //   for ( int i = 0; i < 16; i++ )
@@ -529,7 +529,7 @@ bmw32_4way_close(bmw_4way_small_context *sc, unsigned ub, unsigned n,

   buf = sc->buf;
   ptr = sc->ptr;
-   buf[ ptr>>2 ] = m128_const1_64( 0x0000008000000080 );
+   buf[ ptr>>2 ] = _mm_set1_epi64x( 0x0000008000000080 );
   ptr += 4;
   h = sc->H;

@@ -959,22 +959,22 @@ static const __m256i final_s8[16] =

 void bmw256_8way_init( bmw256_8way_context *ctx )
 {
-   ctx->H[ 0] = m256_const1_64( 0x4041424340414243 );
-   ctx->H[ 1] = m256_const1_64( 0x4445464744454647 );
-   ctx->H[ 2] = m256_const1_64( 0x48494A4B48494A4B );
-   ctx->H[ 3] = m256_const1_64( 0x4C4D4E4F4C4D4E4F );
-   ctx->H[ 4] = m256_const1_64( 0x5051525350515253 );
-   ctx->H[ 5] = m256_const1_64( 0x5455565754555657 );
-   ctx->H[ 6] = m256_const1_64( 0x58595A5B58595A5B );
-   ctx->H[ 7] = m256_const1_64( 0x5C5D5E5F5C5D5E5F );
-   ctx->H[ 8] = m256_const1_64( 0x6061626360616263 );
-   ctx->H[ 9] = m256_const1_64( 0x6465666764656667 );
-   ctx->H[10] = m256_const1_64( 0x68696A6B68696A6B );
-   ctx->H[11] = m256_const1_64( 0x6C6D6E6F6C6D6E6F );
-   ctx->H[12] = m256_const1_64( 0x7071727370717273 );
-   ctx->H[13] = m256_const1_64( 0x7475767774757677 );
-   ctx->H[14] = m256_const1_64( 0x78797A7B78797A7B );
-   ctx->H[15] = m256_const1_64( 0x7C7D7E7F7C7D7E7F );
+   ctx->H[ 0] = _mm256_set1_epi64x( 0x4041424340414243 );
+   ctx->H[ 1] = _mm256_set1_epi64x( 0x4445464744454647 );
+   ctx->H[ 2] = _mm256_set1_epi64x( 0x48494A4B48494A4B );
+   ctx->H[ 3] = _mm256_set1_epi64x( 0x4C4D4E4F4C4D4E4F );
+   ctx->H[ 4] = _mm256_set1_epi64x( 0x5051525350515253 );
+   ctx->H[ 5] = _mm256_set1_epi64x( 0x5455565754555657 );
+   ctx->H[ 6] = _mm256_set1_epi64x( 0x58595A5B58595A5B );
+   ctx->H[ 7] = _mm256_set1_epi64x( 0x5C5D5E5F5C5D5E5F );
+   ctx->H[ 8] = _mm256_set1_epi64x( 0x6061626360616263 );
+   ctx->H[ 9] = _mm256_set1_epi64x( 0x6465666764656667 );
+   ctx->H[10] = _mm256_set1_epi64x( 0x68696A6B68696A6B );
+   ctx->H[11] = _mm256_set1_epi64x( 0x6C6D6E6F6C6D6E6F );
+   ctx->H[12] = _mm256_set1_epi64x( 0x7071727370717273 );
+   ctx->H[13] = _mm256_set1_epi64x( 0x7475767774757677 );
+   ctx->H[14] = _mm256_set1_epi64x( 0x78797A7B78797A7B );
+   ctx->H[15] = _mm256_set1_epi64x( 0x7C7D7E7F7C7D7E7F );
   ctx->ptr       = 0;
   ctx->bit_count = 0;
 }
@@ -1030,7 +1030,7 @@ void bmw256_8way_close( bmw256_8way_context *ctx, void *dst )

   buf = ctx->buf;
   ptr = ctx->ptr;
-   buf[ ptr>>2 ] = m256_const1_64( 0x0000008000000080 );
+   buf[ ptr>>2 ] = _mm256_set1_epi64x( 0x0000008000000080 );
   ptr += 4;
   h = ctx->H;

@@ -1460,22 +1460,22 @@ static const __m512i final_s16[16] =

 void bmw256_16way_init( bmw256_16way_context *ctx )
 {
-   ctx->H[ 0] = m512_const1_64( 0x4041424340414243 );
-   ctx->H[ 1] = m512_const1_64( 0x4445464744454647 );
-   ctx->H[ 2] = m512_const1_64( 0x48494A4B48494A4B );
-   ctx->H[ 3] = m512_const1_64( 0x4C4D4E4F4C4D4E4F );
-   ctx->H[ 4] = m512_const1_64( 0x5051525350515253 );
-   ctx->H[ 5] = m512_const1_64( 0x5455565754555657 );
-   ctx->H[ 6] = m512_const1_64( 0x58595A5B58595A5B );
-   ctx->H[ 7] = m512_const1_64( 0x5C5D5E5F5C5D5E5F );
-   ctx->H[ 8] = m512_const1_64( 0x6061626360616263 );
-   ctx->H[ 9] = m512_const1_64( 0x6465666764656667 );
-   ctx->H[10] = m512_const1_64( 0x68696A6B68696A6B );
-   ctx->H[11] = m512_const1_64( 0x6C6D6E6F6C6D6E6F );
-   ctx->H[12] = m512_const1_64( 0x7071727370717273 );
-   ctx->H[13] = m512_const1_64( 0x7475767774757677 );
-   ctx->H[14] = m512_const1_64( 0x78797A7B78797A7B );
-   ctx->H[15] = m512_const1_64( 0x7C7D7E7F7C7D7E7F );
+   ctx->H[ 0] = _mm512_set1_epi64( 0x4041424340414243 );
+   ctx->H[ 1] = _mm512_set1_epi64( 0x4445464744454647 );
+   ctx->H[ 2] = _mm512_set1_epi64( 0x48494A4B48494A4B );
+   ctx->H[ 3] = _mm512_set1_epi64( 0x4C4D4E4F4C4D4E4F );
+   ctx->H[ 4] = _mm512_set1_epi64( 0x5051525350515253 );
+   ctx->H[ 5] = _mm512_set1_epi64( 0x5455565754555657 );
+   ctx->H[ 6] = _mm512_set1_epi64( 0x58595A5B58595A5B );
+   ctx->H[ 7] = _mm512_set1_epi64( 0x5C5D5E5F5C5D5E5F );
+   ctx->H[ 8] = _mm512_set1_epi64( 0x6061626360616263 );
+   ctx->H[ 9] = _mm512_set1_epi64( 0x6465666764656667 );
+   ctx->H[10] = _mm512_set1_epi64( 0x68696A6B68696A6B );
+   ctx->H[11] = _mm512_set1_epi64( 0x6C6D6E6F6C6D6E6F );
+   ctx->H[12] = _mm512_set1_epi64( 0x7071727370717273 );
+   ctx->H[13] = _mm512_set1_epi64( 0x7475767774757677 );
+   ctx->H[14] = _mm512_set1_epi64( 0x78797A7B78797A7B );
+   ctx->H[15] = _mm512_set1_epi64( 0x7C7D7E7F7C7D7E7F );
   ctx->ptr       = 0;
   ctx->bit_count = 0;
 }
@@ -1531,7 +1531,7 @@ void bmw256_16way_close( bmw256_16way_context *ctx, void *dst )

   buf = ctx->buf;
   ptr = ctx->ptr;
-   buf[ ptr>>2 ] = m512_const1_64( 0x0000008000000080 );
+   buf[ ptr>>2 ] = _mm512_set1_epi64( 0x0000008000000080 );
   ptr += 4;
   h = ctx->H;

--- a/algo/bmw/bmw512-hash-4way.c
+++ b/algo/bmw/bmw512-hash-4way.c
@@ -896,22 +896,22 @@ static const __m256i final_b[16] =
 static void
 bmw64_4way_init( bmw_4way_big_context *sc, const sph_u64 *iv )
 {
-   sc->H[ 0] = m256_const1_64( 0x8081828384858687 );
-   sc->H[ 1] = m256_const1_64( 0x88898A8B8C8D8E8F );
-   sc->H[ 2] = m256_const1_64( 0x9091929394959697 );
-   sc->H[ 3] = m256_const1_64( 0x98999A9B9C9D9E9F );
-   sc->H[ 4] = m256_const1_64( 0xA0A1A2A3A4A5A6A7 );
-   sc->H[ 5] = m256_const1_64( 0xA8A9AAABACADAEAF );
-   sc->H[ 6] = m256_const1_64( 0xB0B1B2B3B4B5B6B7 );
-   sc->H[ 7] = m256_const1_64( 0xB8B9BABBBCBDBEBF );
-   sc->H[ 8] = m256_const1_64( 0xC0C1C2C3C4C5C6C7 );
-   sc->H[ 9] = m256_const1_64( 0xC8C9CACBCCCDCECF );
-   sc->H[10] = m256_const1_64( 0xD0D1D2D3D4D5D6D7 );
-   sc->H[11] = m256_const1_64( 0xD8D9DADBDCDDDEDF );
-   sc->H[12] = m256_const1_64( 0xE0E1E2E3E4E5E6E7 );
-   sc->H[13] = m256_const1_64( 0xE8E9EAEBECEDEEEF );
-   sc->H[14] = m256_const1_64( 0xF0F1F2F3F4F5F6F7 );
-   sc->H[15] = m256_const1_64( 0xF8F9FAFBFCFDFEFF );
+   sc->H[ 0] = _mm256_set1_epi64x( 0x8081828384858687 );
+   sc->H[ 1] = _mm256_set1_epi64x( 0x88898A8B8C8D8E8F );
+   sc->H[ 2] = _mm256_set1_epi64x( 0x9091929394959697 );
+   sc->H[ 3] = _mm256_set1_epi64x( 0x98999A9B9C9D9E9F );
+   sc->H[ 4] = _mm256_set1_epi64x( 0xA0A1A2A3A4A5A6A7 );
+   sc->H[ 5] = _mm256_set1_epi64x( 0xA8A9AAABACADAEAF );
+   sc->H[ 6] = _mm256_set1_epi64x( 0xB0B1B2B3B4B5B6B7 );
+   sc->H[ 7] = _mm256_set1_epi64x( 0xB8B9BABBBCBDBEBF );
+   sc->H[ 8] = _mm256_set1_epi64x( 0xC0C1C2C3C4C5C6C7 );
+   sc->H[ 9] = _mm256_set1_epi64x( 0xC8C9CACBCCCDCECF );
+   sc->H[10] = _mm256_set1_epi64x( 0xD0D1D2D3D4D5D6D7 );
+   sc->H[11] = _mm256_set1_epi64x( 0xD8D9DADBDCDDDEDF );
+   sc->H[12] = _mm256_set1_epi64x( 0xE0E1E2E3E4E5E6E7 );
+   sc->H[13] = _mm256_set1_epi64x( 0xE8E9EAEBECEDEEEF );
+   sc->H[14] = _mm256_set1_epi64x( 0xF0F1F2F3F4F5F6F7 );
+   sc->H[15] = _mm256_set1_epi64x( 0xF8F9FAFBFCFDFEFF );
   sc->ptr = 0;
   sc->bit_count = 0;
 }
@@ -967,7 +967,7 @@ bmw64_4way_close(bmw_4way_big_context *sc, unsigned ub, unsigned n,

   buf = sc->buf;
   ptr = sc->ptr;
-   buf[ ptr>>3 ] = m256_const1_64( 0x80 );
+   buf[ ptr>>3 ] = _mm256_set1_epi64x( 0x80 );
   ptr += 8;
   h = sc->H;

@@ -1379,22 +1379,22 @@ static const __m512i final_b8[16] =
 void bmw512_8way_init( bmw512_8way_context *ctx )
 //bmw64_4way_init( bmw_4way_big_context *sc, const sph_u64 *iv )
 {
-   ctx->H[ 0] = m512_const1_64( 0x8081828384858687 );
-   ctx->H[ 1] = m512_const1_64( 0x88898A8B8C8D8E8F );
-   ctx->H[ 2] = m512_const1_64( 0x9091929394959697 );
-   ctx->H[ 3] = m512_const1_64( 0x98999A9B9C9D9E9F );
-   ctx->H[ 4] = m512_const1_64( 0xA0A1A2A3A4A5A6A7 );
-   ctx->H[ 5] = m512_const1_64( 0xA8A9AAABACADAEAF );
-   ctx->H[ 6] = m512_const1_64( 0xB0B1B2B3B4B5B6B7 );
-   ctx->H[ 7] = m512_const1_64( 0xB8B9BABBBCBDBEBF );
-   ctx->H[ 8] = m512_const1_64( 0xC0C1C2C3C4C5C6C7 );
-   ctx->H[ 9] = m512_const1_64( 0xC8C9CACBCCCDCECF );
-   ctx->H[10] = m512_const1_64( 0xD0D1D2D3D4D5D6D7 );
-   ctx->H[11] = m512_const1_64( 0xD8D9DADBDCDDDEDF );
-   ctx->H[12] = m512_const1_64( 0xE0E1E2E3E4E5E6E7 );
-   ctx->H[13] = m512_const1_64( 0xE8E9EAEBECEDEEEF );
-   ctx->H[14] = m512_const1_64( 0xF0F1F2F3F4F5F6F7 );
-   ctx->H[15] = m512_const1_64( 0xF8F9FAFBFCFDFEFF );
+   ctx->H[ 0] = _mm512_set1_epi64( 0x8081828384858687 );
+   ctx->H[ 1] = _mm512_set1_epi64( 0x88898A8B8C8D8E8F );
+   ctx->H[ 2] = _mm512_set1_epi64( 0x9091929394959697 );
+   ctx->H[ 3] = _mm512_set1_epi64( 0x98999A9B9C9D9E9F );
+   ctx->H[ 4] = _mm512_set1_epi64( 0xA0A1A2A3A4A5A6A7 );
+   ctx->H[ 5] = _mm512_set1_epi64( 0xA8A9AAABACADAEAF );
+   ctx->H[ 6] = _mm512_set1_epi64( 0xB0B1B2B3B4B5B6B7 );
+   ctx->H[ 7] = _mm512_set1_epi64( 0xB8B9BABBBCBDBEBF );
+   ctx->H[ 8] = _mm512_set1_epi64( 0xC0C1C2C3C4C5C6C7 );
+   ctx->H[ 9] = _mm512_set1_epi64( 0xC8C9CACBCCCDCECF );
+   ctx->H[10] = _mm512_set1_epi64( 0xD0D1D2D3D4D5D6D7 );
+   ctx->H[11] = _mm512_set1_epi64( 0xD8D9DADBDCDDDEDF );
+   ctx->H[12] = _mm512_set1_epi64( 0xE0E1E2E3E4E5E6E7 );
+   ctx->H[13] = _mm512_set1_epi64( 0xE8E9EAEBECEDEEEF );
+   ctx->H[14] = _mm512_set1_epi64( 0xF0F1F2F3F4F5F6F7 );
+   ctx->H[15] = _mm512_set1_epi64( 0xF8F9FAFBFCFDFEFF );
   ctx->ptr = 0;
   ctx->bit_count = 0;
 }
@@ -1448,7 +1448,7 @@ void bmw512_8way_close( bmw512_8way_context *ctx, void *dst )

   buf = ctx->buf;
   ptr = ctx->ptr;
-   buf[ ptr>>3 ] = m512_const1_64( 0x80 );
+   buf[ ptr>>3 ] = _mm512_set1_epi64( 0x80 );
   ptr += 8;
   h = ctx->H;

@@ -1483,22 +1483,22 @@ void bmw512_8way_full( bmw512_8way_context *ctx, void *out, const void *data,

 // Init

-   H[ 0] = m512_const1_64( 0x8081828384858687 );
-   H[ 1] = m512_const1_64( 0x88898A8B8C8D8E8F );
-   H[ 2] = m512_const1_64( 0x9091929394959697 );
-   H[ 3] = m512_const1_64( 0x98999A9B9C9D9E9F );
-   H[ 4] = m512_const1_64( 0xA0A1A2A3A4A5A6A7 );
-   H[ 5] = m512_const1_64( 0xA8A9AAABACADAEAF );
-   H[ 6] = m512_const1_64( 0xB0B1B2B3B4B5B6B7 );
-   H[ 7] = m512_const1_64( 0xB8B9BABBBCBDBEBF );
-   H[ 8] = m512_const1_64( 0xC0C1C2C3C4C5C6C7 );
-   H[ 9] = m512_const1_64( 0xC8C9CACBCCCDCECF );
-   H[10] = m512_const1_64( 0xD0D1D2D3D4D5D6D7 );
-   H[11] = m512_const1_64( 0xD8D9DADBDCDDDEDF );
-   H[12] = m512_const1_64( 0xE0E1E2E3E4E5E6E7 );
-   H[13] = m512_const1_64( 0xE8E9EAEBECEDEEEF );
-   H[14] = m512_const1_64( 0xF0F1F2F3F4F5F6F7 );
-   H[15] = m512_const1_64( 0xF8F9FAFBFCFDFEFF );
+   H[ 0] = _mm512_set1_epi64( 0x8081828384858687 );
+   H[ 1] = _mm512_set1_epi64( 0x88898A8B8C8D8E8F );
+   H[ 2] = _mm512_set1_epi64( 0x9091929394959697 );
+   H[ 3] = _mm512_set1_epi64( 0x98999A9B9C9D9E9F );
+   H[ 4] = _mm512_set1_epi64( 0xA0A1A2A3A4A5A6A7 );
+   H[ 5] = _mm512_set1_epi64( 0xA8A9AAABACADAEAF );
+   H[ 6] = _mm512_set1_epi64( 0xB0B1B2B3B4B5B6B7 );
+   H[ 7] = _mm512_set1_epi64( 0xB8B9BABBBCBDBEBF );
+   H[ 8] = _mm512_set1_epi64( 0xC0C1C2C3C4C5C6C7 );
+   H[ 9] = _mm512_set1_epi64( 0xC8C9CACBCCCDCECF );
+   H[10] = _mm512_set1_epi64( 0xD0D1D2D3D4D5D6D7 );
+   H[11] = _mm512_set1_epi64( 0xD8D9DADBDCDDDEDF );
+   H[12] = _mm512_set1_epi64( 0xE0E1E2E3E4E5E6E7 );
+   H[13] = _mm512_set1_epi64( 0xE8E9EAEBECEDEEEF );
+   H[14] = _mm512_set1_epi64( 0xF0F1F2F3F4F5F6F7 );
+   H[15] = _mm512_set1_epi64( 0xF8F9FAFBFCFDFEFF );

 // Update

@@ -1530,7 +1530,7 @@ void bmw512_8way_full( bmw512_8way_context *ctx, void *out, const void *data,
   __m512i h1[16], h2[16];
   size_t u, v;

-   buf[ ptr>>3 ] = m512_const1_64( 0x80 );
+   buf[ ptr>>3 ] = _mm512_set1_epi64( 0x80 );
   ptr += 8;

   if (  ptr > (buf_size - 8) )
--- a/algo/cubehash/cube-hash-2way.c
+++ b/algo/cubehash/cube-hash-2way.c
@@ -221,14 +221,14 @@ int cube_4way_init( cube_4way_context *sp, int hashbitlen, int rounds,
    sp->rounds    = rounds;
    sp->pos       = 0;

-    h[ 0] = m512_const1_128( iv[0] );
-    h[ 1] = m512_const1_128( iv[1] );
-    h[ 2] = m512_const1_128( iv[2] );
-    h[ 3] = m512_const1_128( iv[3] );
-    h[ 4] = m512_const1_128( iv[4] );
-    h[ 5] = m512_const1_128( iv[5] );
-    h[ 6] = m512_const1_128( iv[6] );
-    h[ 7] = m512_const1_128( iv[7] );
+    h[ 0] = mm512_bcast_m128( iv[0] );
+    h[ 1] = mm512_bcast_m128( iv[1] );
+    h[ 2] = mm512_bcast_m128( iv[2] );
+    h[ 3] = mm512_bcast_m128( iv[3] );
+    h[ 4] = mm512_bcast_m128( iv[4] );
+    h[ 5] = mm512_bcast_m128( iv[5] );
+    h[ 6] = mm512_bcast_m128( iv[6] );
+    h[ 7] = mm512_bcast_m128( iv[7] );

    return 0;
 }
@@ -259,11 +259,11 @@ int cube_4way_close( cube_4way_context *sp, void *output )

    // pos is zero for 64 byte data, 1 for 80 byte data.
    sp->h[ sp->pos ] = _mm512_xor_si512( sp->h[ sp->pos ],
-                                 m512_const2_64( 0, 0x0000000000000080 ) );
+                         mm512_bcast128lo_64( 0x0000000000000080 ) );
    transform_4way( sp );

    sp->h[7] = _mm512_xor_si512( sp->h[7],
-                                 m512_const2_64( 0x0000000100000000, 0 ) );
+                         mm512_bcast128hi_64( 0x0000000100000000 ) );

    for ( i = 0; i < 10; ++i ) 
       transform_4way( sp );
@@ -283,14 +283,14 @@ int cube_4way_full( cube_4way_context *sp, void *output,  int hashbitlen,
    sp->rounds    = 16;
    sp->pos       = 0;

-    h[ 0] = m512_const1_128( iv[0] );
-    h[ 1] = m512_const1_128( iv[1] );
-    h[ 2] = m512_const1_128( iv[2] );
-    h[ 3] = m512_const1_128( iv[3] );
-    h[ 4] = m512_const1_128( iv[4] );
-    h[ 5] = m512_const1_128( iv[5] );
-    h[ 6] = m512_const1_128( iv[6] );
-    h[ 7] = m512_const1_128( iv[7] );
+    h[ 0] = mm512_bcast_m128( iv[0] );
+    h[ 1] = mm512_bcast_m128( iv[1] );
+    h[ 2] = mm512_bcast_m128( iv[2] );
+    h[ 3] = mm512_bcast_m128( iv[3] );
+    h[ 4] = mm512_bcast_m128( iv[4] );
+    h[ 5] = mm512_bcast_m128( iv[5] );
+    h[ 6] = mm512_bcast_m128( iv[6] );
+    h[ 7] = mm512_bcast_m128( iv[7] );

    const int len = size >> 4;
    const __m512i *in = (__m512i*)data;
@@ -310,11 +310,11 @@ int cube_4way_full( cube_4way_context *sp, void *output,  int hashbitlen,

    // pos is zero for 64 byte data, 1 for 80 byte data.
    sp->h[ sp->pos ] = _mm512_xor_si512( sp->h[ sp->pos ],
-                                    m512_const2_64( 0, 0x0000000000000080 ) );
+                         mm512_bcast128lo_64( 0x0000000000000080 ) );
    transform_4way( sp );

    sp->h[7] = _mm512_xor_si512( sp->h[7],
-                                    m512_const2_64( 0x0000000100000000, 0 ) );
+                         mm512_bcast128hi_64( 0x0000000100000000 ) );

    for ( i = 0; i < 10; ++i )
       transform_4way( sp );
@@ -336,14 +336,14 @@ int cube_4way_2buf_full( cube_4way_2buf_context *sp,
    sp->rounds    = 16;
    sp->pos       = 0;

-    h1[0] = h0[0] = m512_const1_128( iv[0] );
-    h1[1] = h0[1] = m512_const1_128( iv[1] );
-    h1[2] = h0[2] = m512_const1_128( iv[2] );
-    h1[3] = h0[3] = m512_const1_128( iv[3] );
-    h1[4] = h0[4] = m512_const1_128( iv[4] );
-    h1[5] = h0[5] = m512_const1_128( iv[5] );
-    h1[6] = h0[6] = m512_const1_128( iv[6] );
-    h1[7] = h0[7] = m512_const1_128( iv[7] );
+    h1[0] = h0[0] = mm512_bcast_m128( iv[0] );
+    h1[1] = h0[1] = mm512_bcast_m128( iv[1] );
+    h1[2] = h0[2] = mm512_bcast_m128( iv[2] );
+    h1[3] = h0[3] = mm512_bcast_m128( iv[3] );
+    h1[4] = h0[4] = mm512_bcast_m128( iv[4] );
+    h1[5] = h0[5] = mm512_bcast_m128( iv[5] );
+    h1[6] = h0[6] = mm512_bcast_m128( iv[6] );
+    h1[7] = h0[7] = mm512_bcast_m128( iv[7] );

    const int len = size >> 4;
    const __m512i *in0 = (__m512i*)data0;
@@ -365,13 +365,13 @@ int cube_4way_2buf_full( cube_4way_2buf_context *sp,
    }

    // pos is zero for 64 byte data, 1 for 80 byte data.
-    __m512i tmp = m512_const2_64( 0, 0x0000000000000080 );
+    __m512i tmp = mm512_bcast128lo_64( 0x0000000000000080 );
    sp->h0[ sp->pos ] = _mm512_xor_si512( sp->h0[ sp->pos ], tmp );
    sp->h1[ sp->pos ] = _mm512_xor_si512( sp->h1[ sp->pos ], tmp );

    transform_4way_2buf( sp );

-    tmp = m512_const2_64( 0x0000000100000000, 0 );
+    tmp = mm512_bcast128hi_64( 0x0000000100000000 );
    sp->h0[7] = _mm512_xor_si512( sp->h0[7], tmp );
    sp->h1[7] = _mm512_xor_si512( sp->h1[7], tmp );

@@ -384,7 +384,6 @@ int cube_4way_2buf_full( cube_4way_2buf_context *sp,
    return 0;
 }

-
 int cube_4way_update_close( cube_4way_context *sp, void *output,
                               const void *data, size_t size )
 {
@@ -406,11 +405,11 @@ int cube_4way_update_close( cube_4way_context *sp, void *output,

    // pos is zero for 64 byte data, 1 for 80 byte data.
    sp->h[ sp->pos ] = _mm512_xor_si512( sp->h[ sp->pos ],
-                                    m512_const2_64( 0, 0x0000000000000080 ) );
+                          mm512_bcast128lo_64( 0x0000000000000080 ) );
    transform_4way( sp );

    sp->h[7] = _mm512_xor_si512( sp->h[7],
-                                    m512_const2_64( 0x0000000100000000, 0 ) );
+                          mm512_bcast128hi_64( 0x0000000100000000 ) );

    for ( i = 0; i < 10; ++i )
       transform_4way( sp );
@@ -424,21 +423,6 @@ int cube_4way_update_close( cube_4way_context *sp, void *output,

 // 2 way 128 

-// This isn't expected to be used with AVX512 so HW rotate intruction
-// is assumed not avaiable.
-// Use double buffering to optimize serial bit rotations. Full double
-// buffering isn't practical because it needs twice as many registers
-// with AVX2 having only half as many as AVX512.
-#define ROL2( out0, out1, in0, in1, c ) \
-{ \
- __m256i t0 = _mm256_slli_epi32( in0, c ); \
- __m256i t1 = _mm256_slli_epi32( in1, c ); \
- out0 = _mm256_srli_epi32( in0, 32-(c) ); \
- out1 = _mm256_srli_epi32( in1, 32-(c) ); \
- out0 = _mm256_or_si256( out0, t0 ); \
- out1 = _mm256_or_si256( out1, t1 ); \
-}
-
 static void transform_2way( cube_2way_context *sp )
 {
    int r;
@@ -461,8 +445,10 @@ static void transform_2way( cube_2way_context *sp )
        x5 = _mm256_add_epi32( x1, x5 );
        x6 = _mm256_add_epi32( x2, x6 );
        x7 = _mm256_add_epi32( x3, x7 );
-        ROL2( y0, y1, x2, x3, 7 );
-        ROL2( x2, x3, x0, x1, 7 );
+        y0 = mm256_rol_32( x2, 7 );
+        y1 = mm256_rol_32( x3, 7 );
+        x2 = mm256_rol_32( x0, 7 );
+        x3 = mm256_rol_32( x1, 7 );
        x0 = _mm256_xor_si256( y0, x4 );
        x1 = _mm256_xor_si256( y1, x5 );
        x2 = _mm256_xor_si256( x2, x6 );
@@ -475,8 +461,10 @@ static void transform_2way( cube_2way_context *sp )
        x5 = _mm256_add_epi32( x1, x5 );
        x6 = _mm256_add_epi32( x2, x6 );
        x7 = _mm256_add_epi32( x3, x7 );
-        ROL2( y0, x1, x1, x0, 11 );
-        ROL2( y1, x3, x3, x2, 11 );
+        y0 = mm256_rol_32( x1, 11 );
+        x1 = mm256_rol_32( x0, 11 );
+        y1 = mm256_rol_32( x3, 11 );
+        x3 = mm256_rol_32( x2, 11 );
        x0 = _mm256_xor_si256( y0, x4 );
        x1 = _mm256_xor_si256( x1, x5 );
        x2 = _mm256_xor_si256( y1, x6 );
@@ -508,14 +496,14 @@ int cube_2way_init( cube_2way_context *sp, int hashbitlen, int rounds,
    sp->rounds    = rounds;
    sp->pos       = 0;

-    h[ 0] = m256_const1_128( iv[0] );
-    h[ 1] = m256_const1_128( iv[1] );
-    h[ 2] = m256_const1_128( iv[2] );
-    h[ 3] = m256_const1_128( iv[3] );
-    h[ 4] = m256_const1_128( iv[4] );
-    h[ 5] = m256_const1_128( iv[5] );
-    h[ 6] = m256_const1_128( iv[6] );
-    h[ 7] = m256_const1_128( iv[7] );
+    h[ 0] = mm256_bcast_m128( iv[0] );
+    h[ 1] = mm256_bcast_m128( iv[1] );
+    h[ 2] = mm256_bcast_m128( iv[2] );
+    h[ 3] = mm256_bcast_m128( iv[3] );
+    h[ 4] = mm256_bcast_m128( iv[4] );
+    h[ 5] = mm256_bcast_m128( iv[5] );
+    h[ 6] = mm256_bcast_m128( iv[6] );
+    h[ 7] = mm256_bcast_m128( iv[7] );
    
    return 0;
 }
@@ -546,13 +534,14 @@ int cube_2way_close( cube_2way_context *sp, void *output )

    // pos is zero for 64 byte data, 1 for 80 byte data.
    sp->h[ sp->pos ] = _mm256_xor_si256( sp->h[ sp->pos ],
-                                   m256_const2_64( 0, 0x0000000000000080 ) );
+                                   mm256_bcast128lo_64( 0x0000000000000080 ) );
    transform_2way( sp );

    sp->h[7] = _mm256_xor_si256( sp->h[7],
-                                   m256_const2_64( 0x0000000100000000, 0 ) );
+                                   mm256_bcast128hi_64( 0x0000000100000000 ) );

-    for ( i = 0; i < 10; ++i )           transform_2way( sp );
+    for ( i = 0; i < 10; ++i )  
+       transform_2way( sp );

    memcpy( hash, sp->h, sp->hashlen<<5 );
    return 0;
@@ -579,13 +568,14 @@ int cube_2way_update_close( cube_2way_context *sp, void *output,

    // pos is zero for 64 byte data, 1 for 80 byte data.
    sp->h[ sp->pos ] = _mm256_xor_si256( sp->h[ sp->pos ],
-                                    m256_const2_64( 0, 0x0000000000000080 ) );
+                                    mm256_bcast128lo_64( 0x0000000000000080 ) );
    transform_2way( sp );

    sp->h[7] = _mm256_xor_si256( sp->h[7],
-                                    m256_const2_64( 0x0000000100000000, 0 ) );
+                                    mm256_bcast128hi_64( 0x0000000100000000 ) );

-    for ( i = 0; i < 10; ++i )    transform_2way( sp );
+    for ( i = 0; i < 10; ++i )
+       transform_2way( sp );

    memcpy( hash, sp->h, sp->hashlen<<5 );
    return 0;
@@ -602,14 +592,14 @@ int cube_2way_full( cube_2way_context *sp, void *output, int hashbitlen,
    sp->rounds    = 16;
    sp->pos       = 0;

-    h[ 0] = m256_const1_128( iv[0] );
-    h[ 1] = m256_const1_128( iv[1] );
-    h[ 2] = m256_const1_128( iv[2] );
-    h[ 3] = m256_const1_128( iv[3] );
-    h[ 4] = m256_const1_128( iv[4] );
-    h[ 5] = m256_const1_128( iv[5] );
-    h[ 6] = m256_const1_128( iv[6] );
-    h[ 7] = m256_const1_128( iv[7] );
+    h[ 0] = mm256_bcast_m128( iv[0] );
+    h[ 1] = mm256_bcast_m128( iv[1] );
+    h[ 2] = mm256_bcast_m128( iv[2] );
+    h[ 3] = mm256_bcast_m128( iv[3] );
+    h[ 4] = mm256_bcast_m128( iv[4] );
+    h[ 5] = mm256_bcast_m128( iv[5] );
+    h[ 6] = mm256_bcast_m128( iv[6] );
+    h[ 7] = mm256_bcast_m128( iv[7] );

    const int len = size >> 4;
    const __m256i *in = (__m256i*)data;
@@ -629,13 +619,14 @@ int cube_2way_full( cube_2way_context *sp, void *output, int hashbitlen,

    // pos is zero for 64 byte data, 1 for 80 byte data.
    sp->h[ sp->pos ] = _mm256_xor_si256( sp->h[ sp->pos ],
-                                    m256_const2_64( 0, 0x0000000000000080 ) );
+                                    mm256_bcast128lo_64( 0x0000000000000080 ) );
    transform_2way( sp );

    sp->h[7] = _mm256_xor_si256( sp->h[7],
-                                    m256_const2_64( 0x0000000100000000, 0 ) );
+                                    mm256_bcast128hi_64( 0x0000000100000000 ) );

-    for ( i = 0; i < 10; ++i )    transform_2way( sp );
+    for ( i = 0; i < 10; ++i )
+       transform_2way( sp );

    memcpy( hash, sp->h, sp->hashlen<<5 );
    return 0;
--- a/algo/cubehash/cubehash_sse2.c
+++ b/algo/cubehash/cubehash_sse2.c
@@ -32,7 +32,7 @@ static void transform( cubehashParam *sp )
    { 
        x1 = _mm512_add_epi32( x0, x1 );
        x0 = mm512_swap_256( x0 );
-        x0 = mm512_rol_32(  x0, 7 );
+        x0 = mm512_rol_32( x0, 7 );
        x0 = _mm512_xor_si512( x0, x1 );
        x1 = mm512_swap128_64( x1 );
        x1 = _mm512_add_epi32( x0, x1 );
@@ -58,19 +58,18 @@ static void transform( cubehashParam *sp )
    { 
        x2 = _mm256_add_epi32( x0, x2 );
        x3 = _mm256_add_epi32( x1, x3 );
-        y0 = x0;
-        x0 = mm256_rol_32( x1, 7 );
-        x1 = mm256_rol_32( y0, 7 );
-        x0 = _mm256_xor_si256( x0, x2 );
-        x1 = _mm256_xor_si256( x1, x3 );
+        y0 = mm256_rol_32( x1, 7 );
+        y1 = mm256_rol_32( x0, 7 );
+        x0 = _mm256_xor_si256( y0, x2 );
+        x1 = _mm256_xor_si256( y1, x3 );
        x2 = mm256_swap128_64( x2 );
        x3 = mm256_swap128_64( x3 );
        x2 = _mm256_add_epi32( x0, x2 );
        x3 = _mm256_add_epi32( x1, x3 );
-        y0 = mm256_swap_128( x0 );
-        y1 = mm256_swap_128( x1 );
-        x0 = mm256_rol_32( y0, 11 );
-        x1 = mm256_rol_32( y1, 11 );
+        x0 = mm256_swap_128( x0 );
+        x1 = mm256_swap_128( x1 );
+        x0 = mm256_rol_32( x0, 11 );
+        x1 = mm256_rol_32( x1, 11 );
        x0 = _mm256_xor_si256( x0, x2 );
        x1 = _mm256_xor_si256( x1, x3 );
        x2 = mm256_swap64_32( x2 );
@@ -94,47 +93,48 @@ static void transform( cubehashParam *sp )
    x6 = _mm_load_si128( (__m128i*)sp->x + 6 );
    x7 = _mm_load_si128( (__m128i*)sp->x + 7 );

-    for (r = 0; r < rounds; ++r) {
-	x4 = _mm_add_epi32(x0, x4);
-	x5 = _mm_add_epi32(x1, x5);
-	x6 = _mm_add_epi32(x2, x6);
-	x7 = _mm_add_epi32(x3, x7);
-	y0 = x2;
-	y1 = x3;
-	y2 = x0;
-	y3 = x1;
-	x0 = _mm_xor_si128(_mm_slli_epi32(y0, 7), _mm_srli_epi32(y0, 25));
-	x1 = _mm_xor_si128(_mm_slli_epi32(y1, 7), _mm_srli_epi32(y1, 25));
-	x2 = _mm_xor_si128(_mm_slli_epi32(y2, 7), _mm_srli_epi32(y2, 25));
-	x3 = _mm_xor_si128(_mm_slli_epi32(y3, 7), _mm_srli_epi32(y3, 25));
-	x0 = _mm_xor_si128(x0, x4);
-	x1 = _mm_xor_si128(x1, x5);
-	x2 = _mm_xor_si128(x2, x6);
-	x3 = _mm_xor_si128(x3, x7);
-	x4 = _mm_shuffle_epi32(x4, 0x4e);
-	x5 = _mm_shuffle_epi32(x5, 0x4e);
-	x6 = _mm_shuffle_epi32(x6, 0x4e);
-	x7 = _mm_shuffle_epi32(x7, 0x4e);
-	x4 = _mm_add_epi32(x0, x4);
-	x5 = _mm_add_epi32(x1, x5);
-	x6 = _mm_add_epi32(x2, x6);
-	x7 = _mm_add_epi32(x3, x7);
-	y0 = x1;
-	y1 = x0;
-	y2 = x3;
-	y3 = x2;
-	x0 = _mm_xor_si128(_mm_slli_epi32(y0, 11), _mm_srli_epi32(y0, 21));
-	x1 = _mm_xor_si128(_mm_slli_epi32(y1, 11), _mm_srli_epi32(y1, 21));
-	x2 = _mm_xor_si128(_mm_slli_epi32(y2, 11), _mm_srli_epi32(y2, 21));
-	x3 = _mm_xor_si128(_mm_slli_epi32(y3, 11), _mm_srli_epi32(y3, 21));
-	x0 = _mm_xor_si128(x0, x4);
-	x1 = _mm_xor_si128(x1, x5);
-	x2 = _mm_xor_si128(x2, x6);
-	x3 = _mm_xor_si128(x3, x7);
-	x4 = _mm_shuffle_epi32(x4, 0xb1);
-	x5 = _mm_shuffle_epi32(x5, 0xb1);
-	x6 = _mm_shuffle_epi32(x6, 0xb1);
-	x7 = _mm_shuffle_epi32(x7, 0xb1);
+    for ( r = 0; r < rounds; ++r )
+    {
+       x4 = _mm_add_epi32( x0, x4 );
+       x5 = _mm_add_epi32( x1, x5 );
+       x6 = _mm_add_epi32( x2, x6 );
+       x7 = _mm_add_epi32( x3, x7 );
+       y0 = x2;
+       y1 = x3;
+       y2 = x0;
+       y3 = x1;
+       x0 = mm128_rol_32( y0, 7 );
+       x1 = mm128_rol_32( y1, 7 );
+       x2 = mm128_rol_32( y2, 7 );
+       x3 = mm128_rol_32( y3, 7 );
+       x0 = _mm_xor_si128( x0, x4 );
+       x1 = _mm_xor_si128( x1, x5 );
+       x2 = _mm_xor_si128( x2, x6 );
+       x3 = _mm_xor_si128( x3, x7 );
+       x4 = _mm_shuffle_epi32( x4, 0x4e );
+       x5 = _mm_shuffle_epi32( x5, 0x4e );
+       x6 = _mm_shuffle_epi32( x6, 0x4e );
+       x7 = _mm_shuffle_epi32( x7, 0x4e );
+       x4 = _mm_add_epi32( x0, x4 );
+       x5 = _mm_add_epi32( x1, x5 );
+       x6 = _mm_add_epi32( x2, x6 );
+       x7 = _mm_add_epi32( x3, x7 );
+       y0 = x1;
+       y1 = x0;
+       y2 = x3;
+       y3 = x2;
+       x0 = mm128_rol_32( y0, 11 );
+       x1 = mm128_rol_32( y1, 11 );
+       x2 = mm128_rol_32( y2, 11 );
+       x3 = mm128_rol_32( y3, 11 );
+	    x0 = _mm_xor_si128( x0, x4 );
+	    x1 = _mm_xor_si128( x1, x5 );
+	    x2 = _mm_xor_si128( x2, x6 );
+	    x3 = _mm_xor_si128( x3, x7 );
+	    x4 = _mm_shuffle_epi32( x4, 0xb1 );
+	    x5 = _mm_shuffle_epi32( x5, 0xb1 );
+	    x6 = _mm_shuffle_epi32( x6, 0xb1 );
+	    x7 = _mm_shuffle_epi32( x7, 0xb1 );
    }

    _mm_store_si128( (__m128i*)sp->x,     x0 );
@@ -180,25 +180,25 @@ int cubehashInit(cubehashParam *sp, int hashbitlen, int rounds, int blockbytes)
    if ( hashbitlen == 512 )
    {

-       x[0] = m128_const_64( 0x4167D83E2D538B8B, 0x50F494D42AEA2A61 );
-       x[1] = m128_const_64( 0x50AC5695CC39968E, 0xC701CF8C3FEE2313 );
-       x[2] = m128_const_64( 0x825B453797CF0BEF, 0xA647A8B34D42C787 );
-       x[3] = m128_const_64( 0xA23911AED0E5CD33, 0xF22090C4EEF864D2 );
-       x[4] = m128_const_64( 0xB64445321B017BEF, 0x148FE485FCD398D9 );
-       x[5] = m128_const_64( 0x0DBADEA991FA7934, 0x2FF5781C6A536159 );
-       x[6] = m128_const_64( 0xBC796576B1C62456, 0xA5A70E75D65C8A2B );
-       x[7] = m128_const_64( 0xD43E3B447795D246, 0xE7989AF11921C8F7 );
+       x[0] = _mm_set_epi64x( 0x4167D83E2D538B8B, 0x50F494D42AEA2A61 );
+       x[1] = _mm_set_epi64x( 0x50AC5695CC39968E, 0xC701CF8C3FEE2313 );
+       x[2] = _mm_set_epi64x( 0x825B453797CF0BEF, 0xA647A8B34D42C787 );
+       x[3] = _mm_set_epi64x( 0xA23911AED0E5CD33, 0xF22090C4EEF864D2 );
+       x[4] = _mm_set_epi64x( 0xB64445321B017BEF, 0x148FE485FCD398D9 );
+       x[5] = _mm_set_epi64x( 0x0DBADEA991FA7934, 0x2FF5781C6A536159 );
+       x[6] = _mm_set_epi64x( 0xBC796576B1C62456, 0xA5A70E75D65C8A2B );
+       x[7] = _mm_set_epi64x( 0xD43E3B447795D246, 0xE7989AF11921C8F7 );
    }
    else
    {
-       x[0] = m128_const_64( 0x35481EAE63117E71, 0xCCD6F29FEA2BD4B4 );
-       x[1] = m128_const_64( 0xF4CC12BE7E624131, 0xE5D94E6322512D5B );
-       x[2] = m128_const_64( 0x3361DA8CD0720C35, 0x42AF2070C2D0B696 );
-       x[3] = m128_const_64( 0x40E5FBAB4680AC00, 0x8EF8AD8328CCECA4 );
-       x[4] = m128_const_64( 0xF0B266796C859D41, 0x6107FBD5D89041C3 );
-       x[5] = m128_const_64( 0x93CB628565C892FD, 0x5FA2560309392549 );
-       x[6] = m128_const_64( 0x85254725774ABFDD, 0x9E4B4E602AF2B5AE );
-       x[7] = m128_const_64( 0xD6032C0A9CDAF8AF, 0x4AB6AAD615815AEB );
+       x[0] = _mm_set_epi64x( 0x35481EAE63117E71, 0xCCD6F29FEA2BD4B4 );
+       x[1] = _mm_set_epi64x( 0xF4CC12BE7E624131, 0xE5D94E6322512D5B );
+       x[2] = _mm_set_epi64x( 0x3361DA8CD0720C35, 0x42AF2070C2D0B696 );
+       x[3] = _mm_set_epi64x( 0x40E5FBAB4680AC00, 0x8EF8AD8328CCECA4 );
+       x[4] = _mm_set_epi64x( 0xF0B266796C859D41, 0x6107FBD5D89041C3 );
+       x[5] = _mm_set_epi64x( 0x93CB628565C892FD, 0x5FA2560309392549 );
+       x[6] = _mm_set_epi64x( 0x85254725774ABFDD, 0x9E4B4E602AF2B5AE );
+       x[7] = _mm_set_epi64x( 0xD6032C0A9CDAF8AF, 0x4AB6AAD615815AEB );
    }   

    return SUCCESS;
@@ -234,10 +234,10 @@ int cubehashDigest( cubehashParam *sp, byte *digest )

    // pos is zero for 64 byte data, 1 for 80 byte data.
    sp->x[ sp->pos ] = _mm_xor_si128( sp->x[ sp->pos ],
-                                      m128_const_64( 0, 0x80 ) );
+                                      _mm_set_epi64x( 0, 0x80 ) );
    transform( sp );

-    sp->x[7] = _mm_xor_si128( sp->x[7], m128_const_64( 0x100000000, 0 ) );
+    sp->x[7] = _mm_xor_si128( sp->x[7], _mm_set_epi64x( 0x100000000, 0 ) );
    transform( sp );
    transform( sp );
    transform( sp );
@@ -279,10 +279,10 @@ int cubehashUpdateDigest( cubehashParam *sp, byte *digest,

    // pos is zero for 64 byte data, 1 for 80 byte data.
    sp->x[ sp->pos ] = _mm_xor_si128( sp->x[ sp->pos ],
-                                      m128_const_64( 0, 0x80 ) );
+                                      _mm_set_epi64x( 0, 0x80 ) );
    transform( sp );

-    sp->x[7] = _mm_xor_si128( sp->x[7], m128_const_64( 0x100000000, 0 ) );
+    sp->x[7] = _mm_xor_si128( sp->x[7], _mm_set_epi64x( 0x100000000, 0 ) );

    transform( sp );
    transform( sp );
@@ -313,25 +313,25 @@ int cubehash_full( cubehashParam *sp, byte *digest, int hashbitlen,
    if ( hashbitlen == 512 )
    {

-       x[0] = m128_const_64( 0x4167D83E2D538B8B, 0x50F494D42AEA2A61 );
-       x[1] = m128_const_64( 0x50AC5695CC39968E, 0xC701CF8C3FEE2313 );
-       x[2] = m128_const_64( 0x825B453797CF0BEF, 0xA647A8B34D42C787 );
-       x[3] = m128_const_64( 0xA23911AED0E5CD33, 0xF22090C4EEF864D2 );
-       x[4] = m128_const_64( 0xB64445321B017BEF, 0x148FE485FCD398D9 );
-       x[5] = m128_const_64( 0x0DBADEA991FA7934, 0x2FF5781C6A536159 );
-       x[6] = m128_const_64( 0xBC796576B1C62456, 0xA5A70E75D65C8A2B );
-       x[7] = m128_const_64( 0xD43E3B447795D246, 0xE7989AF11921C8F7 );
+       x[0] = _mm_set_epi64x( 0x4167D83E2D538B8B, 0x50F494D42AEA2A61 );
+       x[1] = _mm_set_epi64x( 0x50AC5695CC39968E, 0xC701CF8C3FEE2313 );
+       x[2] = _mm_set_epi64x( 0x825B453797CF0BEF, 0xA647A8B34D42C787 );
+       x[3] = _mm_set_epi64x( 0xA23911AED0E5CD33, 0xF22090C4EEF864D2 );
+       x[4] = _mm_set_epi64x( 0xB64445321B017BEF, 0x148FE485FCD398D9 );
+       x[5] = _mm_set_epi64x( 0x0DBADEA991FA7934, 0x2FF5781C6A536159 );
+       x[6] = _mm_set_epi64x( 0xBC796576B1C62456, 0xA5A70E75D65C8A2B );
+       x[7] = _mm_set_epi64x( 0xD43E3B447795D246, 0xE7989AF11921C8F7 );
    }
    else
    {
-       x[0] = m128_const_64( 0x35481EAE63117E71, 0xCCD6F29FEA2BD4B4 );
-       x[1] = m128_const_64( 0xF4CC12BE7E624131, 0xE5D94E6322512D5B );
-       x[2] = m128_const_64( 0x3361DA8CD0720C35, 0x42AF2070C2D0B696 );
-       x[3] = m128_const_64( 0x40E5FBAB4680AC00, 0x8EF8AD8328CCECA4 );
-       x[4] = m128_const_64( 0xF0B266796C859D41, 0x6107FBD5D89041C3 );
-       x[5] = m128_const_64( 0x93CB628565C892FD, 0x5FA2560309392549 );
-       x[6] = m128_const_64( 0x85254725774ABFDD, 0x9E4B4E602AF2B5AE );
-       x[7] = m128_const_64( 0xD6032C0A9CDAF8AF, 0x4AB6AAD615815AEB );
+       x[0] = _mm_set_epi64x( 0x35481EAE63117E71, 0xCCD6F29FEA2BD4B4 );
+       x[1] = _mm_set_epi64x( 0xF4CC12BE7E624131, 0xE5D94E6322512D5B );
+       x[2] = _mm_set_epi64x( 0x3361DA8CD0720C35, 0x42AF2070C2D0B696 );
+       x[3] = _mm_set_epi64x( 0x40E5FBAB4680AC00, 0x8EF8AD8328CCECA4 );
+       x[4] = _mm_set_epi64x( 0xF0B266796C859D41, 0x6107FBD5D89041C3 );
+       x[5] = _mm_set_epi64x( 0x93CB628565C892FD, 0x5FA2560309392549 );
+       x[6] = _mm_set_epi64x( 0x85254725774ABFDD, 0x9E4B4E602AF2B5AE );
+       x[7] = _mm_set_epi64x( 0xD6032C0A9CDAF8AF, 0x4AB6AAD615815AEB );
    }


@@ -358,10 +358,10 @@ int cubehash_full( cubehashParam *sp, byte *digest, int hashbitlen,

    // pos is zero for 64 byte data, 1 for 80 byte data.
    sp->x[ sp->pos ] = _mm_xor_si128( sp->x[ sp->pos ],
-                                      m128_const_64( 0, 0x80 ) );
+                                      _mm_set_epi64x( 0, 0x80 ) );
    transform( sp );

-    sp->x[7] = _mm_xor_si128( sp->x[7], m128_const_64( 0x100000000, 0 ) );
+    sp->x[7] = _mm_xor_si128( sp->x[7], _mm_set_epi64x( 0x100000000, 0 ) );

    transform( sp );
    transform( sp );
--- a/algo/echo/aes_ni/hash.c
+++ b/algo/echo/aes_ni/hash.c
@@ -566,16 +566,16 @@ HashReturn echo_full( hashState_echo *state, BitSequence *hashval,
         state->uHashSize = 256;
         state->uBlockLength = 192;
         state->uRounds = 8;
-         state->hashsize = m128_const_64( 0, 0x100 );
-         state->const1536 = m128_const_64( 0, 0x600 );
+         state->hashsize = _mm_set_epi64x( 0, 0x100 );
+         state->const1536 = _mm_set_epi64x( 0, 0x600 );
         break;

      case 512:
         state->uHashSize = 512;
         state->uBlockLength = 128;
         state->uRounds = 10;
-         state->hashsize = m128_const_64( 0, 0x200 );
-         state->const1536 = m128_const_64( 0, 0x400 );
+         state->hashsize = _mm_set_epi64x( 0, 0x200 );
+         state->const1536 = _mm_set_epi64x( 0, 0x400 );
         break;

      default:
--- a/algo/echo/echo-hash-4way.c
+++ b/algo/echo/echo-hash-4way.c
@@ -162,9 +162,9 @@ void echo_4way_compress( echo_4way_context *ctx, const __m512i *pmsg,
  unsigned int r, b, i, j;
  __m512i t1, t2, s2, k1;
  __m512i _state[4][4], _state2[4][4], _statebackup[4][4]; 
-  __m512i one = m512_one_128;
-  __m512i mul2mask = m512_const2_64( 0, 0x00001b00 );
-  __m512i lsbmask  = m512_const1_32( 0x01010101 ); 
+  const __m512i one = mm512_bcast128lo_64( 1 ); 
+  const __m512i mul2mask = mm512_bcast128lo_64( 0x00001b00 );
+  const __m512i lsbmask  = _mm512_set1_epi32( 0x01010101 ); 

  _state[ 0 ][ 0 ] = ctx->state[ 0 ][ 0 ];
  _state[ 0 ][ 1 ] = ctx->state[ 0 ][ 1 ];
@@ -264,16 +264,16 @@ int echo_4way_init( echo_4way_context *ctx, int nHashSize )
 		ctx->uHashSize = 256;
 		ctx->uBlockLength = 192;
 		ctx->uRounds = 8;
-		ctx->hashsize = m512_const2_64( 0, 0x100 );
-		ctx->const1536 = m512_const2_64( 0, 0x600 );
+      ctx->hashsize = mm512_bcast128lo_64( 0x100 );
+      ctx->const1536 = mm512_bcast128lo_64( 0x600 );
 		break;

 	case 512:
 		ctx->uHashSize = 512;
 		ctx->uBlockLength = 128;
 		ctx->uRounds = 10;
-		ctx->hashsize = m512_const2_64( 0, 0x200 );
-		ctx->const1536 = m512_const2_64( 0, 0x400);
+      ctx->hashsize = mm512_bcast128lo_64( 0x200 );
+      ctx->const1536 = mm512_bcast128lo_64( 0x400);
 		break;

 	default:
@@ -305,7 +305,7 @@ int echo_4way_update_close( echo_4way_context *state, void *hashval,
   {
      echo_4way_compress( state, data, 1 );
      state->processed_bits = 1024;
-      remainingbits = m512_const2_64( 0, -1024 );
+      remainingbits = mm512_bcast128lo_64( -1024 );
      vlen = 0;
   }
   else
@@ -313,13 +313,15 @@ int echo_4way_update_close( echo_4way_context *state, void *hashval,
      vlen = databitlen / 128;  // * 4 lanes / 128 bits per lane
      memcpy_512( state->buffer, data, vlen );
      state->processed_bits += (unsigned int)( databitlen );
-      remainingbits = m512_const2_64( 0, (uint64_t)databitlen );
+      remainingbits = mm512_bcast128lo_64( (uint64_t)databitlen );
   }

-   state->buffer[ vlen ] = m512_const2_64( 0, 0x80 );
+   state->buffer[ vlen ] = mm512_bcast128lo_64( 0x80 );
   memset_zero_512( state->buffer + vlen + 1, vblen - vlen - 2 );
-   state->buffer[ vblen-2 ] = m512_const2_64( (uint64_t)state->uHashSize << 48, 0 );
-   state->buffer[ vblen-1 ] = m512_const2_64( 0, state->processed_bits);
+   state->buffer[ vblen-2 ] =
+           mm512_bcast128hi_64( (uint64_t)state->uHashSize << 48 );
+   state->buffer[ vblen-1 ] =
+           mm512_bcast128lo_64( state->processed_bits );

   state->k = _mm512_add_epi64( state->k, remainingbits );
   state->k = _mm512_sub_epi64( state->k, state->const1536 );
@@ -352,16 +354,16 @@ int echo_4way_full( echo_4way_context *ctx, void *hashval, int nHashSize,
         ctx->uHashSize = 256;
         ctx->uBlockLength = 192;
         ctx->uRounds = 8;
-         ctx->hashsize = m512_const2_64( 0, 0x100 );
-         ctx->const1536 = m512_const2_64( 0, 0x600 );
+         ctx->hashsize = mm512_bcast128lo_64( 0x100 );
+         ctx->const1536 = mm512_bcast128lo_64( 0x600 );
         break;

      case 512:
         ctx->uHashSize = 512;
         ctx->uBlockLength = 128;
         ctx->uRounds = 10;
-         ctx->hashsize = m512_const2_64( 0, 0x200 );
-         ctx->const1536 = m512_const2_64( 0, 0x400 );
+         ctx->hashsize = mm512_bcast128lo_64( 0x200 );
+         ctx->const1536 = mm512_bcast128lo_64( 0x400 );
         break;

      default:
@@ -388,7 +390,7 @@ int echo_4way_full( echo_4way_context *ctx, void *hashval, int nHashSize,
   {
      echo_4way_compress( ctx, data, 1 );
      ctx->processed_bits = 1024;
-      remainingbits = m512_const2_64( 0, -1024 );
+      remainingbits = mm512_bcast128lo_64( -1024 );
      vlen = 0;
   }
   else
@@ -396,14 +398,14 @@ int echo_4way_full( echo_4way_context *ctx, void *hashval, int nHashSize,
      vlen = databitlen / 128;  // * 4 lanes / 128 bits per lane
      memcpy_512( ctx->buffer, data, vlen );
      ctx->processed_bits += (unsigned int)( databitlen );
-      remainingbits = m512_const2_64( 0, databitlen );
+      remainingbits = mm512_bcast128lo_64( databitlen );
   }

-   ctx->buffer[ vlen ] = m512_const2_64( 0, 0x80 );
+   ctx->buffer[ vlen ] = mm512_bcast128lo_64( 0x80 );
   memset_zero_512( ctx->buffer + vlen + 1, vblen - vlen - 2 );
   ctx->buffer[ vblen-2 ] =
-                     m512_const2_64( (uint64_t)ctx->uHashSize << 48, 0 );
-   ctx->buffer[ vblen-1 ] = m512_const2_64( 0, ctx->processed_bits);
+               mm512_bcast128hi_64( (uint64_t)ctx->uHashSize << 48 );
+   ctx->buffer[ vblen-1 ] = mm512_bcast128lo_64( ctx->processed_bits);

   ctx->k = _mm512_add_epi64( ctx->k, remainingbits );
   ctx->k = _mm512_sub_epi64( ctx->k, ctx->const1536 );
@@ -425,9 +427,9 @@ int echo_4way_full( echo_4way_context *ctx, void *hashval, int nHashSize,

 // AVX2 + VAES

-#define mul2mask_2way   m256_const2_64( 0, 0x0000000000001b00 ) 
+#define mul2mask_2way   mm256_bcast128lo_64( 0x0000000000001b00 ) 

-#define lsbmask_2way    m256_const1_32( 0x01010101 ) 
+#define lsbmask_2way    _mm256_set1_epi32( 0x01010101 ) 

 #define ECHO_SUBBYTES4_2WAY( state, j ) \
   state[0][j] = _mm256_aesenc_epi128( state[0][j], k1 ); \
@@ -467,8 +469,7 @@ int echo_4way_full( echo_4way_context *ctx, void *hashval, int nHashSize,
   t1 = _mm256_and_si256( t1, lsbmask_2way ); \
   t2 = _mm256_shuffle_epi8( mul2mask_2way, t1 ); \
   s2 = _mm256_xor_si256( s2, t2 );\
-   state2[ 0 ][ j ] = _mm256_xor_si256( state2[ 0 ][ j ], \
-                              _mm256_xor_si256( s2, state1[ 1 ][ j1 ] ) ); \
+   state2[ 0 ][ j ] = mm256_xor3( state2[ 0 ][ j ], s2, state1[ 1 ][ j1 ] ); \
   state2[ 1 ][ j ] = _mm256_xor_si256( state2[ 1 ][ j ], s2 ); \
   state2[ 2 ][ j ] = _mm256_xor_si256( state2[ 2 ][ j ], state1[ 1 ][ j1 ] ); \
   state2[ 3 ][ j ] = _mm256_xor_si256( state2[ 3 ][ j ], state1[ 1 ][ j1 ] ); \
@@ -478,8 +479,7 @@ int echo_4way_full( echo_4way_context *ctx, void *hashval, int nHashSize,
   t2 = _mm256_shuffle_epi8( mul2mask_2way, t1 ); \
   s2 = _mm256_xor_si256( s2, t2 ); \
   state2[ 0 ][ j ] = _mm256_xor_si256( state2[ 0 ][ j ], state1[ 2 ][ j2 ] ); \
-   state2[ 1 ][ j ] = _mm256_xor_si256( state2[ 1 ][ j ], \
-                            _mm256_xor_si256( s2, state1[ 2 ][ j2 ] ) ); \
+   state2[ 1 ][ j ] = mm256_xor3( state2[ 1 ][ j ], s2, state1[ 2 ][ j2 ] ); \
   state2[ 2 ][ j ] = _mm256_xor_si256( state2[ 2 ][ j ], s2 ); \
   state2[ 3 ][ j ] = _mm256_xor_si256( state2[ 3][ j ], state1[ 2 ][ j2 ] ); \
   s2 = _mm256_add_epi8( state1[ 3 ][ j3 ], state1[ 3 ][ j3 ] ); \
@@ -489,8 +489,7 @@ int echo_4way_full( echo_4way_context *ctx, void *hashval, int nHashSize,
   s2 = _mm256_xor_si256( s2, t2 ); \
   state2[ 0 ][ j ] = _mm256_xor_si256( state2[ 0 ][ j ], state1[ 3 ][ j3 ] ); \
   state2[ 1 ][ j ] = _mm256_xor_si256( state2[ 1 ][ j ], state1[ 3 ][ j3 ] ); \
-   state2[ 2 ][ j ] = _mm256_xor_si256( state2[ 2 ][ j ], \
-                            _mm256_xor_si256( s2, state1[ 3 ][ j3] ) ); \
+   state2[ 2 ][ j ] = mm256_xor3( state2[ 2 ][ j ], s2, state1[ 3 ][ j3] ); \
   state2[ 3 ][ j ] = _mm256_xor_si256( state2[ 3 ][ j ], s2 ); \
 } while(0)

@@ -679,16 +678,16 @@ int echo_2way_init( echo_2way_context *ctx, int nHashSize )
                        ctx->uHashSize = 256;
                        ctx->uBlockLength = 192;
                        ctx->uRounds = 8;
-                        ctx->hashsize = m256_const2_64( 0, 0x100 );
-                        ctx->const1536 = m256_const2_64( 0, 0x600 );
+                        ctx->hashsize = mm256_bcast128lo_64( 0x100 );
+                        ctx->const1536 = mm256_bcast128lo_64( 0x600 );
                        break;

                case 512:
                        ctx->uHashSize = 512;
                        ctx->uBlockLength = 128;
                        ctx->uRounds = 10;
-                        ctx->hashsize = m256_const2_64( 0, 0x200 );
-                        ctx->const1536 = m256_const2_64( 0, 0x400 );
+                        ctx->hashsize = mm256_bcast128lo_64( 0x200 );
+                        ctx->const1536 = mm256_bcast128lo_64( 0x400 );
                        break;

                default:
@@ -720,20 +719,20 @@ int echo_2way_update_close( echo_2way_context *state, void *hashval,
   {
      echo_2way_compress( state, data, 1 );
      state->processed_bits = 1024;
-      remainingbits = m256_const2_64( 0, -1024 );
+      remainingbits = mm256_bcast128lo_64( -1024 );
      vlen = 0;
   }
   else
   {
      memcpy_256( state->buffer, data, vlen );
      state->processed_bits += (unsigned int)( databitlen );
-      remainingbits = m256_const2_64( 0, databitlen );
+      remainingbits = mm256_bcast128lo_64( databitlen );
   }

-   state->buffer[ vlen ] = m256_const2_64( 0, 0x80 );
+   state->buffer[ vlen ] = mm256_bcast128lo_64( 0x80 );
   memset_zero_256( state->buffer + vlen + 1, vblen - vlen - 2 );
-   state->buffer[ vblen-2 ] = m256_const2_64( (uint64_t)state->uHashSize << 48, 0 );
-   state->buffer[ vblen-1 ] = m256_const2_64( 0, state->processed_bits );
+   state->buffer[ vblen-2 ] = mm256_bcast128hi_64( (uint64_t)state->uHashSize << 48 );
+   state->buffer[ vblen-1 ] = mm256_bcast128lo_64( state->processed_bits );

   state->k = _mm256_add_epi64( state->k, remainingbits );
   state->k = _mm256_sub_epi64( state->k, state->const1536 );
@@ -766,16 +765,16 @@ int echo_2way_full( echo_2way_context *ctx, void *hashval, int nHashSize,
         ctx->uHashSize = 256;
         ctx->uBlockLength = 192;
         ctx->uRounds = 8;
-         ctx->hashsize = m256_const2_64( 0, 0x100 );
-         ctx->const1536 = m256_const2_64( 0, 0x600 );
+         ctx->hashsize = mm256_bcast128lo_64( 0x100 );
+         ctx->const1536 = mm256_bcast128lo_64( 0x600 );
         break;

      case 512:
         ctx->uHashSize = 512;
         ctx->uBlockLength = 128;
         ctx->uRounds = 10;
-         ctx->hashsize = m256_const2_64( 0, 0x200 );
-         ctx->const1536 = m256_const2_64( 0, 0x400 );
+         ctx->hashsize = mm256_bcast128lo_64( 0x200 );
+         ctx->const1536 = mm256_bcast128lo_64( 0x400 );
         break;

      default:
@@ -798,7 +797,7 @@ int echo_2way_full( echo_2way_context *ctx, void *hashval, int nHashSize,
   {
      echo_2way_compress( ctx, data, 1 );
      ctx->processed_bits = 1024;
-      remainingbits = m256_const2_64( 0, -1024 );
+      remainingbits = mm256_bcast128lo_64( -1024 );
      vlen = 0;
   }
   else
@@ -806,13 +805,13 @@ int echo_2way_full( echo_2way_context *ctx, void *hashval, int nHashSize,
      vlen = databitlen / 128;  // * 4 lanes / 128 bits per lane
      memcpy_256( ctx->buffer, data, vlen );
      ctx->processed_bits += (unsigned int)( databitlen );
-      remainingbits = m256_const2_64( 0, databitlen );
+      remainingbits = mm256_bcast128lo_64( databitlen );
   }

-   ctx->buffer[ vlen ] = m256_const2_64( 0, 0x80 );
+   ctx->buffer[ vlen ] = mm256_bcast128lo_64( 0x80 );
   memset_zero_256( ctx->buffer + vlen + 1, vblen - vlen - 2 );
-   ctx->buffer[ vblen-2 ] = m256_const2_64( (uint64_t)ctx->uHashSize << 48, 0 );
-   ctx->buffer[ vblen-1 ] = m256_const2_64( 0, ctx->processed_bits );
+   ctx->buffer[ vblen-2 ] = mm256_bcast128hi_64( (uint64_t)ctx->uHashSize << 48 );
+   ctx->buffer[ vblen-1 ] = mm256_bcast128lo_64( ctx->processed_bits );

   ctx->k = _mm256_add_epi64( ctx->k, remainingbits );
   ctx->k = _mm256_sub_epi64( ctx->k, ctx->const1536 );
--- a/algo/fugue/fugue-aesni.c
+++ b/algo/fugue/fugue-aesni.c
@@ -33,11 +33,11 @@ MYALIGN const unsigned long long _supermix4b[]	= {0x07020d08080e0d0d, 0x07070908
 MYALIGN const unsigned long long _supermix4c[]	= {0x0706050403020000, 0x0302000007060504};
 MYALIGN const unsigned long long _supermix7a[]	= {0x010c0b060d080702, 0x0904030e03000104};
 MYALIGN const unsigned long long _supermix7b[]	= {0x8080808080808080, 0x0504070605040f06};
-MYALIGN const unsigned long long _k_n[] = {0x4E4E4E4E4E4E4E4E, 0x1B1B1B1B0E0E0E0E};
-MYALIGN const unsigned char _shift_one_mask[]   = {7, 4, 5, 6, 11, 8, 9, 10, 15, 12, 13, 14, 3, 0, 1, 2};
-MYALIGN const unsigned char _shift_four_mask[]  = {13, 14, 15, 12, 1, 2, 3, 0, 5, 6, 7, 4, 9, 10, 11, 8};
-MYALIGN const unsigned char _shift_seven_mask[] = {10, 11, 8, 9, 14, 15, 12, 13, 2, 3, 0, 1, 6, 7, 4, 5};
-MYALIGN const unsigned char _aes_shift_rows[]   = {0, 5, 10, 15, 4, 9, 14, 3, 8, 13, 2, 7, 12, 1, 6, 11};
+//MYALIGN const unsigned long long _k_n[] = {0x4E4E4E4E4E4E4E4E, 0x1B1B1B1B0E0E0E0E};
+//MYALIGN const unsigned char _shift_one_mask[]   = {7, 4, 5, 6, 11, 8, 9, 10, 15, 12, 13, 14, 3, 0, 1, 2};
+//MYALIGN const unsigned char _shift_four_mask[]  = {13, 14, 15, 12, 1, 2, 3, 0, 5, 6, 7, 4, 9, 10, 11, 8};
+//MYALIGN const unsigned char _shift_seven_mask[] = {10, 11, 8, 9, 14, 15, 12, 13, 2, 3, 0, 1, 6, 7, 4, 5};
+//MYALIGN const unsigned char _aes_shift_rows[]   = {0, 5, 10, 15, 4, 9, 14, 3, 8, 13, 2, 7, 12, 1, 6, 11};
 MYALIGN const unsigned int _inv_shift_rows[] = {0x070a0d00, 0x0b0e0104, 0x0f020508, 0x0306090c};
 MYALIGN const unsigned int _mul2mask[] = {0x1b1b0000, 0x00000000, 0x00000000, 0x00000000};
 MYALIGN const unsigned int _mul4mask[] = {0x2d361b00, 0x00000000, 0x00000000, 0x00000000};
@@ -131,7 +131,7 @@ MYALIGN const unsigned int _IV512[] = {
   t1 = _mm_srli_epi16(t0, 6);\
   t1 = _mm_and_si128(t1, M128(_lsbmask2));\
   t3 = _mm_xor_si128(t3, _mm_shuffle_epi8(M128(_mul2mask), t1));\
-   t0  = _mm_xor_si128(t4, _mm_shuffle_epi8(M128(_mul4mask), t1))
+   t0 = _mm_xor_si128(t4, _mm_shuffle_epi8(M128(_mul4mask), t1))

 /*
 #define PRESUPERMIX(x, t1, s1, s2, t2)\
--- a/algo/groestl/aes_ni/groestl-intr-aes.h
+++ b/algo/groestl/aes_ni/groestl-intr-aes.h
@@ -139,7 +139,7 @@ static const __m128i SUBSH_MASK7 = { 0x06090c0f0205080b, 0x0e0104070a0d0003 };
  \
  /* compute z_i : double x_i using temp xmm8 and 1B xmm9 */\
  /* compute w_i : add y_{i+4} */\
-  b1 = m128_const1_64( 0x1b1b1b1b1b1b1b1b );\
+  b1 = _mm_set1_epi64x( 0x1b1b1b1b1b1b1b1b );\
  MUL2(a0, b0, b1);\
  a0 = _mm_xor_si128(a0, TEMP0);\
  MUL2(a1, b0, b1);\
@@ -237,7 +237,7 @@ static const __m128i SUBSH_MASK7 = { 0x06090c0f0205080b, 0x0e0104070a0d0003 };
  \
  /* compute z_i : double x_i using temp xmm8 and 1B xmm9 */\
  /* compute w_i : add y_{i+4} */\
-  b1 = m128_const1_64( 0x1b1b1b1b1b1b1b1b );\
+  b1 = _mm_set1_epi64x( 0x1b1b1b1b1b1b1b1b );\
  MUL2(a0, b0, b1);\
  a0 = _mm_xor_si128(a0, TEMP0);\
  MUL2(a1, b0, b1);\
--- a/algo/groestl/aes_ni/groestl256-intr-aes.h
+++ b/algo/groestl/aes_ni/groestl256-intr-aes.h
@@ -128,7 +128,7 @@ static const __m128i SUBSH_MASK7 = { 0x090c000306080b07, 0x02050f0a0d01040e };
  \
  /* compute z_i : double x_i using temp xmm8 and 1B xmm9 */\
  /* compute w_i : add y_{i+4} */\
-  b1 = m128_const1_64( 0x1b1b1b1b1b1b1b1b );\
+  b1 = _mm_set1_epi64x( 0x1b1b1b1b1b1b1b1b );\
  MUL2(a0, b0, b1);\
  a0 = _mm_xor_si128(a0, TEMP0);\
  MUL2(a1, b0, b1);\
@@ -226,7 +226,7 @@ static const __m128i SUBSH_MASK7 = { 0x090c000306080b07, 0x02050f0a0d01040e };
  \
  /* compute z_i : double x_i using temp xmm8 and 1B xmm9 */\
  /* compute w_i : add y_{i+4} */\
-  b1 = m128_const1_64( 0x1b1b1b1b1b1b1b1b );\
+  b1 = _mm_set1_epi64x( 0x1b1b1b1b1b1b1b1b );\
  MUL2(a0, b0, b1);\
  a0 = _mm_xor_si128(a0, TEMP0);\
  MUL2(a1, b0, b1);\
@@ -275,7 +275,7 @@ static const __m128i SUBSH_MASK7 = { 0x090c000306080b07, 0x02050f0a0d01040e };
 */
 #define ROUND(i, a0, a1, a2, a3, a4, a5, a6, a7, b0, b1, b2, b3, b4, b5, b6, b7){\
  /* AddRoundConstant */\
-  b1 = m128_const_64( 0xffffffffffffffff, 0 ); \
+  b1 = _mm_set_epi64x( 0xffffffffffffffff, 0 ); \
  a0 = _mm_xor_si128( a0, casti_m128i( round_const_l0, i ) ); \
  a1 = _mm_xor_si128( a1, b1 ); \
  a2 = _mm_xor_si128( a2, b1 ); \
--- a/algo/groestl/aes_ni/hash-groestl.c
+++ b/algo/groestl/aes_ni/hash-groestl.c
@@ -31,7 +31,7 @@ HashReturn_gr init_groestl( hashState_groestl* ctx, int hashlen )
  }

  // The only non-zero in the IV is len. It can be hard coded.
-  ctx->chaining[ 6 ] = m128_const_64( 0x0200000000000000, 0 );
+  ctx->chaining[ 6 ] = _mm_set_epi64x( 0x0200000000000000, 0 );

  ctx->buf_ptr = 0;
  ctx->rem_ptr = 0;
@@ -48,7 +48,7 @@ HashReturn_gr reinit_groestl( hashState_groestl* ctx )
     ctx->chaining[i] = _mm_setzero_si128();
     ctx->buffer[i]   = _mm_setzero_si128();
  }
-  ctx->chaining[ 6 ] = m128_const_64( 0x0200000000000000, 0 );
+  ctx->chaining[ 6 ] = _mm_set_epi64x( 0x0200000000000000, 0 );
  ctx->buf_ptr = 0;
  ctx->rem_ptr = 0;

@@ -116,7 +116,7 @@ HashReturn_gr final_groestl( hashState_groestl* ctx, void* output )
   else
   {
       // add first padding
-       ctx->buffer[rem_ptr] = m128_const_64( 0, 0x80 );
+       ctx->buffer[rem_ptr] = _mm_set_epi64x( 0, 0x80 );
       // add zero padding
       for ( i = rem_ptr + 1; i < SIZE512 - 1; i++ )
           ctx->buffer[i] = _mm_setzero_si128();
@@ -148,7 +148,7 @@ int groestl512_full( hashState_groestl* ctx, void* output,
      ctx->chaining[i] = _mm_setzero_si128();
      ctx->buffer[i]   = _mm_setzero_si128();
   }
-   ctx->chaining[ 6 ] = m128_const_64( 0x0200000000000000, 0 );
+   ctx->chaining[ 6 ] = _mm_set_epi64x( 0x0200000000000000, 0 );
   ctx->buf_ptr = 0;

   // --- update ---
@@ -182,7 +182,7 @@ int groestl512_full( hashState_groestl* ctx, void* output,
   else
   {
       // add first padding
-       ctx->buffer[i] = m128_const_64( 0, 0x80 );
+       ctx->buffer[i] = _mm_set_epi64x( 0, 0x80 );
       // add zero padding
       for ( i += 1; i < SIZE512 - 1; i++ )
           ctx->buffer[i] = _mm_setzero_si128();
@@ -239,7 +239,7 @@ HashReturn_gr update_and_final_groestl( hashState_groestl* ctx, void* output,
   else
   {
       // add first padding
-       ctx->buffer[i] = m128_const_64( 0, 0x80 );
+       ctx->buffer[i] = _mm_set_epi64x( 0, 0x80 );
       // add zero padding
       for ( i += 1; i < SIZE512 - 1; i++ )
           ctx->buffer[i] = _mm_setzero_si128();
--- a/algo/groestl/aes_ni/hash-groestl256.c
+++ b/algo/groestl/aes_ni/hash-groestl256.c
@@ -46,7 +46,7 @@ HashReturn_gr reinit_groestl256(hashState_groestl256* ctx)
     ctx->buffer[i]   = _mm_setzero_si128();
  }

-  ctx->chaining[ 3 ] = m128_const_64( 0, 0x0100000000000000 );
+  ctx->chaining[ 3 ] = _mm_set_epi64x( 0, 0x0100000000000000 );

  ctx->buf_ptr = 0;
  ctx->rem_ptr = 0;
--- a/algo/groestl/groestl256-hash-4way.c
+++ b/algo/groestl/groestl256-hash-4way.c
@@ -33,8 +33,7 @@ int groestl256_4way_init( groestl256_4way_context* ctx, uint64_t hashlen )
  }

  // The only non-zero in the IV is len. It can be hard coded.
-  ctx->chaining[ 3 ] = m512_const2_64( 0, 0x0100000000000000 );
-
+  ctx->chaining[ 3 ] = mm512_bcast128lo_64( 0x0100000000000000 );
  ctx->buf_ptr = 0;
  ctx->rem_ptr = 0;

@@ -51,9 +50,6 @@ int groestl256_4way_full( groestl256_4way_context* ctx, void* output,
   __m512i* in = (__m512i*)input;
   int i;

-//  if (ctx->chaining == NULL || ctx->buffer == NULL)
-//    return 1;
-
  for ( i = 0; i < SIZE256; i++ )
  {
     ctx->chaining[i] = m512_zero;
@@ -61,7 +57,7 @@ int groestl256_4way_full( groestl256_4way_context* ctx, void* output,
  }

  // The only non-zero in the IV is len. It can be hard coded.
-  ctx->chaining[ 3 ] = m512_const2_64( 0, 0x0100000000000000 );
+  ctx->chaining[ 3 ] = mm512_bcast128lo_64( 0x0100000000000000 );
  ctx->buf_ptr = 0;
   
   // --- update ---
@@ -83,18 +79,18 @@ int groestl256_4way_full( groestl256_4way_context* ctx, void* output,
   if ( i == SIZE256 - 1 )
   {        
       // only 1 vector left in buffer, all padding at once
-       ctx->buffer[i] = m512_const2_64( blocks << 56, 0x80 ); 
+       ctx->buffer[i] = mm512_set2_64( blocks << 56, 0x80 ); 
   }   
   else
   {
       // add first padding
-       ctx->buffer[i] = m512_const2_64( 0, 0x80 );
+       ctx->buffer[i] = mm512_bcast128lo_64( 0x80 );
       // add zero padding
       for ( i += 1; i < SIZE256 - 1; i++ )
           ctx->buffer[i] = m512_zero;

       // add length padding, second last byte is zero unless blocks > 255
-       ctx->buffer[i] = m512_const2_64( blocks << 56, 0 );
+       ctx->buffer[i] = mm512_bcast128hi_64( blocks << 56 );
   }

   // digest final padding block and do output transform
@@ -140,18 +136,18 @@ int groestl256_4way_update_close( groestl256_4way_context* ctx, void* output,
   if ( i == SIZE256 - 1 )
   {
       // only 1 vector left in buffer, all padding at once
-       ctx->buffer[i] = m512_const2_64( blocks << 56, 0x80 );
+       ctx->buffer[i] = mm512_set2_64( blocks << 56, 0x80 );
   }
   else
   {
       // add first padding
-       ctx->buffer[i] = m512_const2_64( 0, 0x80 );
+       ctx->buffer[i] = mm512_bcast128lo_64( 0x80 );
       // add zero padding
       for ( i += 1; i < SIZE256 - 1; i++ )
           ctx->buffer[i] = m512_zero;

       // add length padding, second last byte is zero unless blocks > 255
-       ctx->buffer[i] = m512_const2_64( blocks << 56, 0 );
+       ctx->buffer[i] = mm512_bcast128hi_64( blocks << 56 );
   }

 // digest final padding block and do output transform
@@ -186,7 +182,7 @@ int groestl256_2way_init( groestl256_2way_context* ctx, uint64_t hashlen )
  }

  // The only non-zero in the IV is len. It can be hard coded.
-  ctx->chaining[ 3 ] = m256_const2_64( 0, 0x0100000000000000 );
+  ctx->chaining[ 3 ] = mm256_bcast128lo_64( 0x0100000000000000 );

  ctx->buf_ptr = 0;
  ctx->rem_ptr = 0;
@@ -211,7 +207,7 @@ int groestl256_2way_full( groestl256_2way_context* ctx, void* output,
   }

   // The only non-zero in the IV is len. It can be hard coded.
-   ctx->chaining[ 3 ] = m256_const2_64( 0, 0x0100000000000000 );
+   ctx->chaining[ 3 ] = mm256_bcast128lo_64( 0x0100000000000000 );
   ctx->buf_ptr = 0;

   // --- update ---
@@ -233,18 +229,18 @@ int groestl256_2way_full( groestl256_2way_context* ctx, void* output,
   if ( i == SIZE256 - 1 )
   {
       // only 1 vector left in buffer, all padding at once
-      ctx->buffer[i] = m256_const2_64( blocks << 56, 0x80 );
+      ctx->buffer[i] = mm256_set2_64( blocks << 56, 0x80 );
   }
   else
   {
       // add first padding
-       ctx->buffer[i] = m256_const2_64( 0, 0x80 );
+       ctx->buffer[i] = mm256_bcast128lo_64( 0x80 );
       // add zero padding
       for ( i += 1; i < SIZE256 - 1; i++ )
           ctx->buffer[i] = m256_zero;

       // add length padding, second last byte is zero unless blocks > 255
-       ctx->buffer[i] = m256_const2_64( blocks << 56, 0 );
+       ctx->buffer[i] = mm256_bcast128hi_64( blocks << 56 );
   }

   // digest final padding block and do output transform
@@ -289,23 +285,22 @@ int groestl256_2way_update_close( groestl256_2way_context* ctx, void* output,
   if ( i == SIZE256 - 1 )
   {
       // only 1 vector left in buffer, all padding at once
-       ctx->buffer[i] = m256_const2_64( blocks << 56, 0x80 );
+       ctx->buffer[i] = mm256_set2_64( blocks << 56, 0x80 );
   }
   else
   {
       // add first padding
-       ctx->buffer[i] = m256_const2_64( 0, 0x80 );
+       ctx->buffer[i] = mm256_bcast128lo_64( 0x80 );
       // add zero padding
       for ( i += 1; i < SIZE256 - 1; i++ )
           ctx->buffer[i] = m256_zero;

       // add length padding, second last byte is zero unless blocks > 255
-       ctx->buffer[i] = m256_const2_64( blocks << 56, 0 );
+       ctx->buffer[i] = mm256_bcast128hi_64( blocks << 56 );
   }

 // digest final padding block and do output transform
   TF512_2way( ctx->chaining, ctx->buffer );
-
   OF512_2way( ctx->chaining );

   // store hash result in output 
--- a/algo/groestl/groestl256-intr-4way.h
+++ b/algo/groestl/groestl256-intr-4way.h
@@ -165,7 +165,7 @@ static const __m512i SUBSH_MASK7 = { 0x090c000306080b07, 0x02050f0a0d01040e,
  \
  /* compute z_i : double x_i using temp xmm8 and 1B xmm9 */\
  /* compute w_i : add y_{i+4} */\
-  b1 = m512_const1_64( 0x1b1b1b1b1b1b1b1b ); \
+  b1 = _mm512_set1_epi64( 0x1b1b1b1b1b1b1b1b ); \
  MUL2( a0, b0, b1 ); \
  a0 = _mm512_xor_si512( a0, TEMP0 ); \
  MUL2( a1, b0, b1 ); \
@@ -205,116 +205,18 @@ static const __m512i SUBSH_MASK7 = { 0x090c000306080b07, 0x02050f0a0d01040e,
  b1 = _mm512_xor_si512( b1, a4 ); \
 }/*MixBytes*/

-
-#if 0
-#define MixBytes(a0, a1, a2, a3, a4, a5, a6, a7, b0, b1, b2, b3, b4, b5, b6, b7){\
-  /* t_i = a_i + a_{i+1} */\
-  b6 = a0;\
-  b7 = a1;\
-  a0 = _mm512_xor_si512(a0, a1);\
-  b0 = a2;\
-  a1 = _mm512_xor_si512(a1, a2);\
-  b1 = a3;\
-  a2 = _mm512_xor_si512(a2, a3);\
-  b2 = a4;\
-  a3 = _mm512_xor_si512(a3, a4);\
-  b3 = a5;\
-  a4 = _mm512_xor_si512(a4, a5);\
-  b4 = a6;\
-  a5 = _mm512_xor_si512(a5, a6);\
-  b5 = a7;\
-  a6 = _mm512_xor_si512(a6, a7);\
-  a7 = _mm512_xor_si512(a7, b6);\
-  \
-  /* build y4 y5 y6 ... in regs xmm8, xmm9, xmm10 by adding t_i*/\
-  b0 = _mm512_xor_si512(b0, a4);\
-  b6 = _mm512_xor_si512(b6, a4);\
-  b1 = _mm512_xor_si512(b1, a5);\
-  b7 = _mm512_xor_si512(b7, a5);\
-  b2 = _mm512_xor_si512(b2, a6);\
-  b0 = _mm512_xor_si512(b0, a6);\
-  /* spill values y_4, y_5 to memory */\
-  TEMP0 = b0;\
-  b3 = _mm512_xor_si512(b3, a7);\
-  b1 = _mm512_xor_si512(b1, a7);\
-  TEMP1 = b1;\
-  b4 = _mm512_xor_si512(b4, a0);\
-  b2 = _mm512_xor_si512(b2, a0);\
-  /* save values t0, t1, t2 to xmm8, xmm9 and memory */\
-  b0 = a0;\
-  b5 = _mm512_xor_si512(b5, a1);\
-  b3 = _mm512_xor_si512(b3, a1);\
-  b1 = a1;\
-  b6 = _mm512_xor_si512(b6, a2);\
-  b4 = _mm512_xor_si512(b4, a2);\
-  TEMP2 = a2;\
-  b7 = _mm512_xor_si512(b7, a3);\
-  b5 = _mm512_xor_si512(b5, a3);\
-  \
-  /* compute x_i = t_i + t_{i+3} */\
-  a0 = _mm512_xor_si512(a0, a3);\
-  a1 = _mm512_xor_si512(a1, a4);\
-  a2 = _mm512_xor_si512(a2, a5);\
-  a3 = _mm512_xor_si512(a3, a6);\
-  a4 = _mm512_xor_si512(a4, a7);\
-  a5 = _mm512_xor_si512(a5, b0);\
-  a6 = _mm512_xor_si512(a6, b1);\
-  a7 = _mm512_xor_si512(a7, TEMP2);\
-  \
-  /* compute z_i : double x_i using temp xmm8 and 1B xmm9 */\
-  /* compute w_i : add y_{i+4} */\
-  b1 = m512_const1_64( 0x1b1b1b1b1b1b1b1b );\
-  MUL2(a0, b0, b1);\
-  a0 = _mm512_xor_si512(a0, TEMP0);\
-  MUL2(a1, b0, b1);\
-  a1 = _mm512_xor_si512(a1, TEMP1);\
-  MUL2(a2, b0, b1);\
-  a2 = _mm512_xor_si512(a2, b2);\
-  MUL2(a3, b0, b1);\
-  a3 = _mm512_xor_si512(a3, b3);\
-  MUL2(a4, b0, b1);\
-  a4 = _mm512_xor_si512(a4, b4);\
-  MUL2(a5, b0, b1);\
-  a5 = _mm512_xor_si512(a5, b5);\
-  MUL2(a6, b0, b1);\
-  a6 = _mm512_xor_si512(a6, b6);\
-  MUL2(a7, b0, b1);\
-  a7 = _mm512_xor_si512(a7, b7);\
-  \
-  /* compute v_i : double w_i      */\
-  /* add to y_4 y_5 .. v3, v4, ... */\
-  MUL2(a0, b0, b1);\
-  b5 = _mm512_xor_si512(b5, a0);\
-  MUL2(a1, b0, b1);\
-  b6 = _mm512_xor_si512(b6, a1);\
-  MUL2(a2, b0, b1);\
-  b7 = _mm512_xor_si512(b7, a2);\
-  MUL2(a5, b0, b1);\
-  b2 = _mm512_xor_si512(b2, a5);\
-  MUL2(a6, b0, b1);\
-  b3 = _mm512_xor_si512(b3, a6);\
-  MUL2(a7, b0, b1);\
-  b4 = _mm512_xor_si512(b4, a7);\
-  MUL2(a3, b0, b1);\
-  MUL2(a4, b0, b1);\
-  b0 = TEMP0;\
-  b1 = TEMP1;\
-  b0 = _mm512_xor_si512(b0, a3);\
-  b1 = _mm512_xor_si512(b1, a4);\
-}/*MixBytes*/
-#endif
+#define MASK_NOT( a )  _mm512_mask_ternarylogic_epi64( a, 0xaa, a, a, 1 )

 #define ROUND(i, a0, a1, a2, a3, a4, a5, a6, a7, b0, b1, b2, b3, b4, b5, b6, b7){\
  /* AddRoundConstant */\
-  b1 = m512_const2_64( 0xffffffffffffffff, 0 ); \
-  a0 = _mm512_xor_si512( a0, m512_const1_128( round_const_l0[i] ) );\
-  a1 = _mm512_xor_si512( a1, b1 );\
-  a2 = _mm512_xor_si512( a2, b1 );\
-  a3 = _mm512_xor_si512( a3, b1 );\
-  a4 = _mm512_xor_si512( a4, b1 );\
-  a5 = _mm512_xor_si512( a5, b1 );\
-  a6 = _mm512_xor_si512( a6, b1 );\
-  a7 = _mm512_xor_si512( a7, m512_const1_128( round_const_l7[i] ) );\
+  a0 = _mm512_xor_si512( a0, mm512_bcast_m128( round_const_l0[i] ) );\
+  a1 = MASK_NOT( a1 ); \
+  a2 = MASK_NOT( a2 ); \
+  a3 = MASK_NOT( a3 ); \
+  a4 = MASK_NOT( a4 ); \
+  a5 = MASK_NOT( a5 ); \
+  a6 = MASK_NOT( a6 ); \
+  a7 = _mm512_xor_si512( a7, mm512_bcast_m128( round_const_l7[i] ) );\
  \
  /* ShiftBytes + SubBytes (interleaved) */\
  b0 = _mm512_xor_si512( b0, b0 );\
@@ -450,7 +352,7 @@ static const __m512i SUBSH_MASK7 = { 0x090c000306080b07, 0x02050f0a0d01040e,
 * outputs: (i0-7) = (0|S)
 */
 #define Matrix_Transpose_O_B(i0, i1, i2, i3, i4, i5, i6, i7, t0){\
-  t0 = _mm512_xor_si512( t0, t0 );\
+  t0 = m512_zero;\
  i1 = i0;\
  i3 = i2;\
  i5 = i4;\
@@ -481,11 +383,11 @@ static const __m512i SUBSH_MASK7 = { 0x090c000306080b07, 0x02050f0a0d01040e,

 void TF512_4way( __m512i* chaining, __m512i* message )
 {
-  static __m512i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
-  static __m512i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
-  static __m512i TEMP0;
-  static __m512i TEMP1;
-  static __m512i TEMP2;
+  __m512i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
+  __m512i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
+  __m512i TEMP0;
+  __m512i TEMP1;
+  __m512i TEMP2;

  /* load message into registers xmm12 - xmm15 */
  xmm12 = message[0];
@@ -547,11 +449,11 @@ void TF512_4way( __m512i* chaining, __m512i* message )

 void OF512_4way( __m512i* chaining )
 {
-  static __m512i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
-  static __m512i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
-  static __m512i TEMP0;
-  static __m512i TEMP1;
-  static __m512i TEMP2;
+  __m512i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
+  __m512i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
+  __m512i TEMP0;
+  __m512i TEMP1;
+  __m512i TEMP2;

  /* load CV into registers xmm8, xmm10, xmm12, xmm14 */
  xmm8 = chaining[0];
@@ -637,7 +539,7 @@ static const __m256i SUBSH_MASK7_2WAY =
  j = _mm256_cmpgt_epi8(j, i );\
  i = _mm256_add_epi8(i, i);\
  j = _mm256_and_si256(j, k);\
-  i = _mm256_xor_si256(i, j);\
+  i = mm256_xorand( i, j, k );\
 }

 #define MixBytes_2way(a0, a1, a2, a3, a4, a5, a6, a7, b0, b1, b2, b3, b4, b5, b6, b7){\
@@ -648,7 +550,7 @@ static const __m256i SUBSH_MASK7_2WAY =
  b0 = a2;\
  a1 = _mm256_xor_si256(a1, a2);\
  b1 = a3;\
-  a2 = _mm256_xor_si256(a2, a3);\
+  TEMP2 = _mm256_xor_si256(a2, a3);\
  b2 = a4;\
  a3 = _mm256_xor_si256(a3, a4);\
  b3 = a5;\
@@ -660,34 +562,20 @@ static const __m256i SUBSH_MASK7_2WAY =
  a7 = _mm256_xor_si256(a7, b6);\
  \
  /* build y4 y5 y6 ... in regs xmm8, xmm9, xmm10 by adding t_i*/\
-  b0 = _mm256_xor_si256(b0, a4);\
-  b6 = _mm256_xor_si256(b6, a4);\
-  b1 = _mm256_xor_si256(b1, a5);\
-  b7 = _mm256_xor_si256(b7, a5);\
-  b2 = _mm256_xor_si256(b2, a6);\
-  b0 = _mm256_xor_si256(b0, a6);\
-  /* spill values y_4, y_5 to memory */\
-  TEMP0 = b0;\
-  b3 = _mm256_xor_si256(b3, a7);\
-  b1 = _mm256_xor_si256(b1, a7);\
-  TEMP1 = b1;\
-  b4 = _mm256_xor_si256(b4, a0);\
-  b2 = _mm256_xor_si256(b2, a0);\
-  /* save values t0, t1, t2 to xmm8, xmm9 and memory */\
-  b0 = a0;\
-  b5 = _mm256_xor_si256(b5, a1);\
-  b3 = _mm256_xor_si256(b3, a1);\
-  b1 = a1;\
-  b6 = _mm256_xor_si256(b6, a2);\
-  b4 = _mm256_xor_si256(b4, a2);\
-  TEMP2 = a2;\
-  b7 = _mm256_xor_si256(b7, a3);\
-  b5 = _mm256_xor_si256(b5, a3);\
-  \
+  TEMP0 = mm256_xor3( b0, a4, a6 ); \
+  TEMP1 = mm256_xor3( b1, a5, a7 ); \
+  b2 = mm256_xor3( b2, a6, a0 ); \
+  b0 = a0; \
+  b3 = mm256_xor3( b3, a7, a1 ); \
+  b1 = a1; \
+  b6 = mm256_xor3( b6, a4, TEMP2 ); \
+  b4 = mm256_xor3( b4, a0, TEMP2 ); \
+  b7 = mm256_xor3( b7, a5, a3 ); \
+  b5 = mm256_xor3( b5, a1, a3 ); \
  /* compute x_i = t_i + t_{i+3} */\
  a0 = _mm256_xor_si256(a0, a3);\
  a1 = _mm256_xor_si256(a1, a4);\
-  a2 = _mm256_xor_si256(a2, a5);\
+  a2 = _mm256_xor_si256( TEMP2, a5);\
  a3 = _mm256_xor_si256(a3, a6);\
  a4 = _mm256_xor_si256(a4, a7);\
  a5 = _mm256_xor_si256(a5, b0);\
@@ -696,7 +584,7 @@ static const __m256i SUBSH_MASK7_2WAY =
  \
  /* compute z_i : double x_i using temp xmm8 and 1B xmm9 */\
  /* compute w_i : add y_{i+4} */\
-  b1 = m256_const1_64( 0x1b1b1b1b1b1b1b1b );\
+  b1 = _mm256_set1_epi64x( 0x1b1b1b1b1b1b1b1b );\
  MUL2_2WAY(a0, b0, b1);\
  a0 = _mm256_xor_si256(a0, TEMP0);\
  MUL2_2WAY(a1, b0, b1);\
@@ -738,15 +626,15 @@ static const __m256i SUBSH_MASK7_2WAY =

 #define ROUND_2WAY(i, a0, a1, a2, a3, a4, a5, a6, a7, b0, b1, b2, b3, b4, b5, b6, b7){\
  /* AddRoundConstant */\
-  b1 = m256_const2_64( 0xffffffffffffffff, 0 ); \
-  a0 = _mm256_xor_si256( a0, m256_const1_128( round_const_l0[i] ) );\
+  b1 = mm256_bcast_m128( mm128_mask_32( m128_neg1, 0x3 ) ); \
+  a0 = _mm256_xor_si256( a0, mm256_bcast_m128( round_const_l0[i] ) );\
  a1 = _mm256_xor_si256( a1, b1 );\
  a2 = _mm256_xor_si256( a2, b1 );\
  a3 = _mm256_xor_si256( a3, b1 );\
  a4 = _mm256_xor_si256( a4, b1 );\
  a5 = _mm256_xor_si256( a5, b1 );\
  a6 = _mm256_xor_si256( a6, b1 );\
-  a7 = _mm256_xor_si256( a7, m256_const1_128( round_const_l7[i] ) );\
+  a7 = _mm256_xor_si256( a7, mm256_bcast_m128( round_const_l7[i] ) );\
  \
  /* ShiftBytes + SubBytes (interleaved) */\
  b0 = _mm256_xor_si256( b0, b0 );\
@@ -769,7 +657,6 @@ static const __m256i SUBSH_MASK7_2WAY =
  \
  /* MixBytes */\
  MixBytes_2way(a0, a1, a2, a3, a4, a5, a6, a7, b0, b1, b2, b3, b4, b5, b6, b7);\
-\
 }

 /* 10 rounds, P and Q in parallel */
@@ -850,7 +737,7 @@ static const __m256i SUBSH_MASK7_2WAY =
 }/**/

 #define Matrix_Transpose_O_B_2way(i0, i1, i2, i3, i4, i5, i6, i7, t0){\
-  t0 = _mm256_xor_si256( t0, t0 );\
+  t0 = m256_zero;\
  i1 = i0;\
  i3 = i2;\
  i5 = i4;\
@@ -874,11 +761,11 @@ static const __m256i SUBSH_MASK7_2WAY =

 void TF512_2way( __m256i* chaining, __m256i* message )
 {
-  static __m256i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
-  static __m256i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
-  static __m256i TEMP0;
-  static __m256i TEMP1;
-  static __m256i TEMP2;
+  __m256i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
+  __m256i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
+  __m256i TEMP0;
+  __m256i TEMP1;
+  __m256i TEMP2;

  /* load message into registers xmm12 - xmm15 */
  xmm12 = message[0];
@@ -940,11 +827,11 @@ void TF512_2way( __m256i* chaining, __m256i* message )
  
 void OF512_2way( __m256i* chaining )
 {
-  static __m256i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
-  static __m256i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
-  static __m256i TEMP0;
-  static __m256i TEMP1;
-  static __m256i TEMP2;
+  __m256i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
+  __m256i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
+  __m256i TEMP0;
+  __m256i TEMP1;
+  __m256i TEMP2;

  /* load CV into registers xmm8, xmm10, xmm12, xmm14 */
  xmm8 = chaining[0];
--- a/algo/groestl/groestl512-hash-4way.c
+++ b/algo/groestl/groestl512-hash-4way.c
@@ -25,8 +25,7 @@ int groestl512_4way_init( groestl512_4way_context* ctx, uint64_t hashlen )
  memset_zero_512( ctx->buffer, SIZE512 );

  // The only non-zero in the IV is len. It can be hard coded.
-  ctx->chaining[ 6 ] = m512_const2_64( 0x0200000000000000, 0 );
-
+  ctx->chaining[ 6 ] = mm512_bcast128hi_64( 0x0200000000000000 );
  ctx->buf_ptr = 0;
  ctx->rem_ptr = 0;

@@ -61,14 +60,14 @@ int groestl512_4way_update_close( groestl512_4way_context* ctx, void* output,
   if ( i == SIZE512 - 1 )
   {        
       // only 1 vector left in buffer, all padding at once
-       ctx->buffer[i] = m512_const2_64( blocks << 56, 0x80 );
+       ctx->buffer[i] = mm512_set2_64( blocks << 56, 0x80 );
   }   
   else
   {
-       ctx->buffer[i] = m512_const2_64( 0, 0x80 );
+       ctx->buffer[i] = mm512_bcast128lo_64( 0x80 );
       for ( i += 1; i < SIZE512 - 1; i++ )
           ctx->buffer[i] = m512_zero;
-       ctx->buffer[i] = m512_const2_64( blocks << 56, 0 );
+       ctx->buffer[i] = mm512_bcast128hi_64( blocks << 56 );
   }

   TF1024_4way( ctx->chaining, ctx->buffer );
@@ -94,7 +93,7 @@ int groestl512_4way_full( groestl512_4way_context* ctx, void* output,

   memset_zero_512( ctx->chaining, SIZE512 );
   memset_zero_512( ctx->buffer, SIZE512 );
-   ctx->chaining[ 6 ] = m512_const2_64( 0x0200000000000000, 0 );
+   ctx->chaining[ 6 ] = mm512_bcast128hi_64( 0x0200000000000000 );
   ctx->buf_ptr = 0;

   // --- update ---
@@ -113,14 +112,14 @@ int groestl512_4way_full( groestl512_4way_context* ctx, void* output,
   if ( i == SIZE512 - 1 )
   {
       // only 1 vector left in buffer, all padding at once
-       ctx->buffer[i] = m512_const2_64( blocks << 56, 0x80 );
+       ctx->buffer[i] = mm512_set2_64( blocks << 56, 0x80 );
   }
   else
   {
-       ctx->buffer[i] = m512_const2_64( 0, 0x80 );
+       ctx->buffer[i] = mm512_bcast128lo_64( 0x80 );
       for ( i += 1; i < SIZE512 - 1; i++ )
           ctx->buffer[i] = m512_zero;
-       ctx->buffer[i] = m512_const2_64( blocks << 56, 0 );
+       ctx->buffer[i] = mm512_bcast128hi_64( blocks << 56 );
   }

   TF1024_4way( ctx->chaining, ctx->buffer );
@@ -143,7 +142,7 @@ int groestl512_2way_init( groestl512_2way_context* ctx, uint64_t hashlen )
  memset_zero_256( ctx->buffer, SIZE512 );

  // The only non-zero in the IV is len. It can be hard coded.
-  ctx->chaining[ 6 ] = m256_const2_64( 0x0200000000000000, 0 );
+  ctx->chaining[ 6 ] = mm256_bcast128hi_64( 0x0200000000000000 );

  ctx->buf_ptr = 0;
  ctx->rem_ptr = 0;
@@ -179,14 +178,14 @@ int groestl512_2way_update_close( groestl512_2way_context* ctx, void* output,
   if ( i == SIZE512 - 1 )
   {
       // only 1 vector left in buffer, all padding at once
-       ctx->buffer[i] = m256_const2_64( blocks << 56, 0x80 );
+       ctx->buffer[i] = mm256_set2_64( blocks << 56, 0x80 );
   }
   else
   {
-       ctx->buffer[i] = m256_const2_64( 0, 0x80 );
+       ctx->buffer[i] = mm256_bcast128lo_64( 0x80 );
       for ( i += 1; i < SIZE512 - 1; i++ )
           ctx->buffer[i] = m256_zero;
-       ctx->buffer[i] = m256_const2_64( blocks << 56, 0 );
+       ctx->buffer[i] = mm256_bcast128hi_64( blocks << 56 );
   }

   TF1024_2way( ctx->chaining, ctx->buffer );
@@ -212,7 +211,7 @@ int groestl512_2way_full( groestl512_2way_context* ctx, void* output,

   memset_zero_256( ctx->chaining, SIZE512 );
   memset_zero_256( ctx->buffer, SIZE512 );
-   ctx->chaining[ 6 ] = m256_const2_64( 0x0200000000000000, 0 );
+   ctx->chaining[ 6 ] = mm256_bcast128hi_64( 0x0200000000000000 );
   ctx->buf_ptr = 0;

   // --- update ---
@@ -231,14 +230,14 @@ int groestl512_2way_full( groestl512_2way_context* ctx, void* output,
   if ( i == SIZE512 - 1 )
   {
       // only 1 vector left in buffer, all padding at once
-       ctx->buffer[i] = m256_const2_64( blocks << 56, 0x80 );
+       ctx->buffer[i] = mm256_set2_64( blocks << 56, 0x80 );
   }
   else
   {
-       ctx->buffer[i] = m256_const2_64( 0, 0x80 );
+       ctx->buffer[i] = mm256_bcast128lo_64( 0x80 );
       for ( i += 1; i < SIZE512 - 1; i++ )
           ctx->buffer[i] = m256_zero;
-       ctx->buffer[i] = m256_const2_64( blocks << 56, 0 );
+       ctx->buffer[i] = mm256_bcast128hi_64( blocks << 56 );
   }

   TF1024_2way( ctx->chaining, ctx->buffer );
--- a/algo/groestl/groestl512-intr-4way.h
+++ b/algo/groestl/groestl512-intr-4way.h
@@ -174,7 +174,7 @@ static const __m512i SUBSH_MASK7 = { 0x06090c0f0205080b, 0x0e0104070a0d0003,
  \
  /* compute z_i : double x_i using temp xmm8 and 1B xmm9 */\
  /* compute w_i : add y_{i+4} */\
-  b1 = m512_const1_64( 0x1b1b1b1b1b1b1b1b ); \
+  b1 = _mm512_set1_epi64( 0x1b1b1b1b1b1b1b1b ); \
  MUL2( a0, b0, b1 ); \
  a0 = _mm512_xor_si512( a0, TEMP0 ); \
  MUL2( a1, b0, b1 ); \
@@ -238,7 +238,7 @@ static const __m512i SUBSH_MASK7 = { 0x06090c0f0205080b, 0x0e0104070a0d0003,
  for ( round_counter = 0; round_counter < 14; round_counter += 2 ) \
  { \
    /* AddRoundConstant P1024 */\
-    xmm8 = _mm512_xor_si512( xmm8, m512_const1_128( \
+    xmm8 = _mm512_xor_si512( xmm8, mm512_bcast_m128( \
             casti_m128i( round_const_p, round_counter ) ) ); \
    /* ShiftBytes P1024 + pre-AESENCLAST */\
    xmm8  = _mm512_shuffle_epi8( xmm8,  SUBSH_MASK0 ); \
@@ -253,7 +253,7 @@ static const __m512i SUBSH_MASK7 = { 0x06090c0f0205080b, 0x0e0104070a0d0003,
    SUBMIX(xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7);\
    \
     /* AddRoundConstant P1024 */\
-    xmm0 = _mm512_xor_si512( xmm0, m512_const1_128( \
+    xmm0 = _mm512_xor_si512( xmm0, mm512_bcast_m128( \
             casti_m128i( round_const_p, round_counter+1 ) ) ); \
    /* ShiftBytes P1024 + pre-AESENCLAST */\
    xmm0 = _mm512_shuffle_epi8( xmm0, SUBSH_MASK0 );\
@@ -282,7 +282,7 @@ static const __m512i SUBSH_MASK7 = { 0x06090c0f0205080b, 0x0e0104070a0d0003,
    xmm12 = _mm512_xor_si512( xmm12, xmm1 );\
    xmm13 = _mm512_xor_si512( xmm13, xmm1 );\
    xmm14 = _mm512_xor_si512( xmm14, xmm1 );\
-    xmm15 = _mm512_xor_si512( xmm15, m512_const1_128( \
+    xmm15 = _mm512_xor_si512( xmm15, mm512_bcast_m128( \
                 casti_m128i( round_const_q, round_counter ) ) ); \
    /* ShiftBytes Q1024 + pre-AESENCLAST */\
    xmm8  = _mm512_shuffle_epi8( xmm8,  SUBSH_MASK1 );\
@@ -305,7 +305,7 @@ static const __m512i SUBSH_MASK7 = { 0x06090c0f0205080b, 0x0e0104070a0d0003,
    xmm4 = _mm512_xor_si512( xmm4, xmm9 );\
    xmm5 = _mm512_xor_si512( xmm5, xmm9 );\
    xmm6 = _mm512_xor_si512( xmm6, xmm9 );\
-    xmm7 = _mm512_xor_si512( xmm7, m512_const1_128( \
+    xmm7 = _mm512_xor_si512( xmm7, mm512_bcast_m128( \
             casti_m128i( round_const_q, round_counter+1 ) ) ); \
    /* ShiftBytes Q1024 + pre-AESENCLAST */\
    xmm0 = _mm512_shuffle_epi8( xmm0, SUBSH_MASK1 );\
@@ -471,8 +471,8 @@ static const __m512i SUBSH_MASK7 = { 0x06090c0f0205080b, 0x0e0104070a0d0003,

 void INIT_4way( __m512i* chaining )
 {
-  static __m512i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
-  static __m512i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
+  __m512i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
+  __m512i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;

  /* load IV into registers xmm8 - xmm15 */
  xmm8 = chaining[0];
@@ -500,12 +500,12 @@ void INIT_4way( __m512i* chaining )

 void TF1024_4way( __m512i* chaining, const __m512i* message )
 {
-  static __m512i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
-  static __m512i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
-  static __m512i QTEMP[8];
-  static __m512i TEMP0;
-  static __m512i TEMP1;
-  static __m512i TEMP2;
+  __m512i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
+  __m512i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
+  __m512i QTEMP[8];
+  __m512i TEMP0;
+  __m512i TEMP1;
+  __m512i TEMP2;

  /* load message into registers xmm8 - xmm15 (Q = message) */
  xmm8 = message[0];
@@ -606,11 +606,11 @@ void TF1024_4way( __m512i* chaining, const __m512i* message )

 void OF1024_4way( __m512i* chaining )
 {
-  static __m512i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
-  static __m512i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
-  static __m512i TEMP0;
-  static __m512i TEMP1;
-  static __m512i TEMP2;
+  __m512i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
+  __m512i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
+  __m512i TEMP0;
+  __m512i TEMP1;
+  __m512i TEMP2;

  /* load CV into registers xmm8 - xmm15 */
  xmm8 = chaining[0];
@@ -710,7 +710,7 @@ static const __m256i SUBSH_MASK7_2WAY =
  b0 = a2;\
  a1 = _mm256_xor_si256(a1, a2);\
  b1 = a3;\
-  a2 = _mm256_xor_si256(a2, a3);\
+  TEMP2 = _mm256_xor_si256(a2, a3);\
  b2 = a4;\
  a3 = _mm256_xor_si256(a3, a4);\
  b3 = a5;\
@@ -722,34 +722,23 @@ static const __m256i SUBSH_MASK7_2WAY =
  a7 = _mm256_xor_si256(a7, b6);\
  \
  /* build y4 y5 y6 ... in regs xmm8, xmm9, xmm10 by adding t_i*/\
-  b0 = _mm256_xor_si256(b0, a4);\
-  b6 = _mm256_xor_si256(b6, a4);\
-  b1 = _mm256_xor_si256(b1, a5);\
-  b7 = _mm256_xor_si256(b7, a5);\
-  b2 = _mm256_xor_si256(b2, a6);\
-  b0 = _mm256_xor_si256(b0, a6);\
+  TEMP0 = mm256_xor3( b0, a4, a6 ); \
  /* spill values y_4, y_5 to memory */\
-  TEMP0 = b0;\
-  b3 = _mm256_xor_si256(b3, a7);\
-  b1 = _mm256_xor_si256(b1, a7);\
-  TEMP1 = b1;\
-  b4 = _mm256_xor_si256(b4, a0);\
-  b2 = _mm256_xor_si256(b2, a0);\
+  TEMP1 = mm256_xor3( b1, a5, a7 ); \
+  b2 = mm256_xor3( b2, a6, a0 ); \
  /* save values t0, t1, t2 to xmm8, xmm9 and memory */\
-  b0 = a0;\
-  b5 = _mm256_xor_si256(b5, a1);\
-  b3 = _mm256_xor_si256(b3, a1);\
-  b1 = a1;\
-  b6 = _mm256_xor_si256(b6, a2);\
-  b4 = _mm256_xor_si256(b4, a2);\
-  TEMP2 = a2;\
-  b7 = _mm256_xor_si256(b7, a3);\
-  b5 = _mm256_xor_si256(b5, a3);\
+  b0 = a0; \
+  b3 = mm256_xor3( b3, a7, a1 ); \
+  b1 = a1; \
+  b6 = mm256_xor3( b6, a4, TEMP2 ); \
+  b4 = mm256_xor3( b4, a0, TEMP2 ); \
+  b7 = mm256_xor3( b7, a5, a3 ); \
+  b5 = mm256_xor3( b5, a1, a3 ); \
  \
  /* compute x_i = t_i + t_{i+3} */\
  a0 = _mm256_xor_si256(a0, a3);\
  a1 = _mm256_xor_si256(a1, a4);\
-  a2 = _mm256_xor_si256(a2, a5);\
+  a2 = _mm256_xor_si256( TEMP2, a5);\
  a3 = _mm256_xor_si256(a3, a6);\
  a4 = _mm256_xor_si256(a4, a7);\
  a5 = _mm256_xor_si256(a5, b0);\
@@ -758,7 +747,7 @@ static const __m256i SUBSH_MASK7_2WAY =
  \
  /* compute z_i : double x_i using temp xmm8 and 1B xmm9 */\
  /* compute w_i : add y_{i+4} */\
-  b1 = m256_const1_64( 0x1b1b1b1b1b1b1b1b );\
+  b1 = _mm256_set1_epi64x( 0x1b1b1b1b1b1b1b1b );\
  MUL2_2WAY(a0, b0, b1);\
  a0 = _mm256_xor_si256(a0, TEMP0);\
  MUL2_2WAY(a1, b0, b1);\
@@ -822,7 +811,7 @@ static const __m256i SUBSH_MASK7_2WAY =
  for ( round_counter = 0; round_counter < 14; round_counter += 2 ) \
  { \
    /* AddRoundConstant P1024 */\
-    xmm8 = _mm256_xor_si256( xmm8, m256_const1_128( \
+    xmm8 = _mm256_xor_si256( xmm8, mm256_bcast_m128( \
             casti_m128i( round_const_p, round_counter ) ) ); \
    /* ShiftBytes P1024 + pre-AESENCLAST */\
    xmm8  = _mm256_shuffle_epi8( xmm8,  SUBSH_MASK0_2WAY ); \
@@ -837,7 +826,7 @@ static const __m256i SUBSH_MASK7_2WAY =
    SUBMIX_2WAY(xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7);\
    \
     /* AddRoundConstant P1024 */\
-    xmm0 = _mm256_xor_si256( xmm0, m256_const1_128( \
+    xmm0 = _mm256_xor_si256( xmm0, mm256_bcast_m128( \
             casti_m128i( round_const_p, round_counter+1 ) ) ); \
    /* ShiftBytes P1024 + pre-AESENCLAST */\
    xmm0 = _mm256_shuffle_epi8( xmm0, SUBSH_MASK0_2WAY );\
@@ -866,7 +855,7 @@ static const __m256i SUBSH_MASK7_2WAY =
    xmm12 = _mm256_xor_si256( xmm12, xmm1 );\
    xmm13 = _mm256_xor_si256( xmm13, xmm1 );\
    xmm14 = _mm256_xor_si256( xmm14, xmm1 );\
-    xmm15 = _mm256_xor_si256( xmm15, m256_const1_128( \
+    xmm15 = _mm256_xor_si256( xmm15, mm256_bcast_m128( \
                 casti_m128i( round_const_q, round_counter ) ) ); \
    /* ShiftBytes Q1024 + pre-AESENCLAST */\
    xmm8  = _mm256_shuffle_epi8( xmm8,  SUBSH_MASK1_2WAY );\
@@ -889,7 +878,7 @@ static const __m256i SUBSH_MASK7_2WAY =
    xmm4 = _mm256_xor_si256( xmm4, xmm9 );\
    xmm5 = _mm256_xor_si256( xmm5, xmm9 );\
    xmm6 = _mm256_xor_si256( xmm6, xmm9 );\
-    xmm7 = _mm256_xor_si256( xmm7, m256_const1_128( \
+    xmm7 = _mm256_xor_si256( xmm7, mm256_bcast_m128( \
             casti_m128i( round_const_q, round_counter+1 ) ) ); \
    /* ShiftBytes Q1024 + pre-AESENCLAST */\
    xmm0 = _mm256_shuffle_epi8( xmm0, SUBSH_MASK1_2WAY );\
@@ -1040,8 +1029,8 @@ static const __m256i SUBSH_MASK7_2WAY =

 void INIT_2way( __m256i *chaining )
 {
-  static __m256i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
-  static __m256i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
+  __m256i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
+  __m256i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;

  /* load IV into registers xmm8 - xmm15 */
  xmm8 = chaining[0];
@@ -1069,12 +1058,12 @@ void INIT_2way( __m256i *chaining )

 void TF1024_2way( __m256i *chaining, const __m256i *message )
 {
-  static __m256i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
-  static __m256i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
-  static __m256i QTEMP[8];
-  static __m256i TEMP0;
-  static __m256i TEMP1;
-  static __m256i TEMP2;
+  __m256i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
+  __m256i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
+  __m256i QTEMP[8];
+  __m256i TEMP0;
+  __m256i TEMP1;
+  __m256i TEMP2;

  /* load message into registers xmm8 - xmm15 (Q = message) */
  xmm8 = message[0];
@@ -1175,11 +1164,11 @@ void TF1024_2way( __m256i *chaining, const __m256i *message )

 void OF1024_2way( __m256i* chaining )
 {
-  static __m256i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
-  static __m256i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
-  static __m256i TEMP0;
-  static __m256i TEMP1;
-  static __m256i TEMP2;
+  __m256i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
+  __m256i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
+  __m256i TEMP0;
+  __m256i TEMP1;
+  __m256i TEMP2;

  /* load CV into registers xmm8 - xmm15 */
  xmm8 = chaining[0];
--- a/algo/hamsi/hamsi-hash-4way.c
+++ b/algo/hamsi/hamsi-hash-4way.c
@@ -562,14 +562,14 @@ do { \
  for ( int u = 0; u < 64; u++ ) \
  { \
     const __mmask8 dm = _mm512_cmplt_epi64_mask( db, zero ); \
-     m0 = _mm512_mask_xor_epi64( m0, dm, m0, m512_const1_64( tp[0] ) ); \
-     m1 = _mm512_mask_xor_epi64( m1, dm, m1, m512_const1_64( tp[1] ) ); \
-     m2 = _mm512_mask_xor_epi64( m2, dm, m2, m512_const1_64( tp[2] ) ); \
-     m3 = _mm512_mask_xor_epi64( m3, dm, m3, m512_const1_64( tp[3] ) ); \
-     m4 = _mm512_mask_xor_epi64( m4, dm, m4, m512_const1_64( tp[4] ) ); \
-     m5 = _mm512_mask_xor_epi64( m5, dm, m5, m512_const1_64( tp[5] ) ); \
-     m6 = _mm512_mask_xor_epi64( m6, dm, m6, m512_const1_64( tp[6] ) ); \
-     m7 = _mm512_mask_xor_epi64( m7, dm, m7, m512_const1_64( tp[7] ) ); \
+     m0 = _mm512_mask_xor_epi64( m0, dm, m0, _mm512_set1_epi64( tp[0] ) ); \
+     m1 = _mm512_mask_xor_epi64( m1, dm, m1, _mm512_set1_epi64( tp[1] ) ); \
+     m2 = _mm512_mask_xor_epi64( m2, dm, m2, _mm512_set1_epi64( tp[2] ) ); \
+     m3 = _mm512_mask_xor_epi64( m3, dm, m3, _mm512_set1_epi64( tp[3] ) ); \
+     m4 = _mm512_mask_xor_epi64( m4, dm, m4, _mm512_set1_epi64( tp[4] ) ); \
+     m5 = _mm512_mask_xor_epi64( m5, dm, m5, _mm512_set1_epi64( tp[5] ) ); \
+     m6 = _mm512_mask_xor_epi64( m6, dm, m6, _mm512_set1_epi64( tp[6] ) ); \
+     m7 = _mm512_mask_xor_epi64( m7, dm, m7, _mm512_set1_epi64( tp[7] ) ); \
     db = _mm512_ror_epi64( db, 1 ); \
     tp += 8; \
  } \
@@ -733,17 +733,17 @@ do { \
   __m512i alpha[16]; \
   const uint64_t A0 = ( (uint64_t*)alpha_n )[0]; \
   for( int i = 0; i < 16; i++ ) \
-      alpha[i] = m512_const1_64( ( (uint64_t*)alpha_n )[i] ); \
+      alpha[i] = _mm512_set1_epi64( ( (uint64_t*)alpha_n )[i] ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( (1ULL << 32) ^ A0 ); \
+   alpha[0] = _mm512_set1_epi64( (1ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( (2ULL << 32) ^ A0 ); \
+   alpha[0] = _mm512_set1_epi64( (2ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( (3ULL << 32) ^ A0 ); \
+   alpha[0] = _mm512_set1_epi64( (3ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( (4ULL << 32) ^ A0 ); \
+   alpha[0] = _mm512_set1_epi64( (4ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( (5ULL << 32) ^ A0 ); \
+   alpha[0] = _mm512_set1_epi64( (5ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
 } while (0)

@@ -752,29 +752,29 @@ do { \
   __m512i alpha[16]; \
   const uint64_t A0 = ( (uint64_t*)alpha_f )[0]; \
   for( int i = 0; i < 16; i++ ) \
-      alpha[i] = m512_const1_64( ( (uint64_t*)alpha_f )[i] ); \
+      alpha[i] = _mm512_set1_epi64( ( (uint64_t*)alpha_f )[i] ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( 1ULL << 32) ^ A0 ); \
+   alpha[0] = _mm512_set1_epi64( ( 1ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( 2ULL << 32) ^ A0 ); \
+   alpha[0] = _mm512_set1_epi64( ( 2ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( 3ULL << 32) ^ A0 ); \
+   alpha[0] = _mm512_set1_epi64( ( 3ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( 4ULL << 32) ^ A0 ); \
+   alpha[0] = _mm512_set1_epi64( ( 4ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( 5ULL << 32) ^ A0 ); \
+   alpha[0] = _mm512_set1_epi64( ( 5ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( 6ULL << 32) ^ A0 ); \
+   alpha[0] = _mm512_set1_epi64( ( 6ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( 7ULL << 32) ^ A0 ); \
+   alpha[0] = _mm512_set1_epi64( ( 7ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( 8ULL << 32) ^ A0 ); \
+   alpha[0] = _mm512_set1_epi64( ( 8ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( ( 9ULL << 32) ^ A0 ); \
+   alpha[0] = _mm512_set1_epi64( ( 9ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( (10ULL << 32) ^ A0 ); \
+   alpha[0] = _mm512_set1_epi64( (10ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
-   alpha[0] = m512_const1_64( (11ULL << 32) ^ A0 ); \
+   alpha[0] = _mm512_set1_epi64( (11ULL << 32) ^ A0 ); \
   ROUND_BIG8( alpha ); \
 } while (0)

@@ -829,14 +829,14 @@ void hamsi512_8way_init( hamsi_8way_big_context *sc )
   sc->partial_len = 0;
   sc->count_high = sc->count_low = 0;

-   sc->h[0] = m512_const1_64( 0x6c70617273746565 );
-   sc->h[1] = m512_const1_64( 0x656e62656b204172 );
-   sc->h[2] = m512_const1_64( 0x302c206272672031 );
-   sc->h[3] = m512_const1_64( 0x3434362c75732032 );
-   sc->h[4] = m512_const1_64( 0x3030312020422d33 );
-   sc->h[5] = m512_const1_64( 0x656e2d484c657576 );
-   sc->h[6] = m512_const1_64( 0x6c65652c65766572 );
-   sc->h[7] = m512_const1_64( 0x6769756d2042656c );
+   sc->h[0] = _mm512_set1_epi64( 0x6c70617273746565 );
+   sc->h[1] = _mm512_set1_epi64( 0x656e62656b204172 );
+   sc->h[2] = _mm512_set1_epi64( 0x302c206272672031 );
+   sc->h[3] = _mm512_set1_epi64( 0x3434362c75732032 );
+   sc->h[4] = _mm512_set1_epi64( 0x3030312020422d33 );
+   sc->h[5] = _mm512_set1_epi64( 0x656e2d484c657576 );
+   sc->h[6] = _mm512_set1_epi64( 0x6c65652c65766572 );
+   sc->h[7] = _mm512_set1_epi64( 0x6769756d2042656c );
 }

 void hamsi512_8way_update( hamsi_8way_big_context *sc, const void *data,
@@ -859,7 +859,7 @@ void hamsi512_8way_close( hamsi_8way_big_context *sc, void *dst )
   sph_enc32be( &ch, sc->count_high );
   sph_enc32be( &cl, sc->count_low + ( sc->partial_len << 3 ) );
   pad[0] = _mm512_set1_epi64( ((uint64_t)cl << 32 ) | (uint64_t)ch );
-   sc->buf[0] = m512_const1_64( 0x80 );
+   sc->buf[0] = _mm512_set1_epi64( 0x80 );
   hamsi_8way_big( sc, sc->buf, 1 );
   hamsi_8way_big_final( sc, pad );

@@ -870,6 +870,32 @@ void hamsi512_8way_close( hamsi_8way_big_context *sc, void *dst )

 // Hamsi 4 way AVX2

+#if defined(__AVX512VL__)
+
+#define INPUT_BIG \
+do { \
+  __m256i db = _mm256_ror_epi64( *buf, 1 ); \
+  const __m256i zero = m256_zero; \
+  const uint64_t *tp = (const uint64_t*)T512; \
+  m0 = m1 = m2 = m3 = m4 = m5 = m6 = m7 = zero; \
+  for ( int u = 0; u < 64; u++ ) \
+  { \
+     const __mmask8 dm = _mm256_cmplt_epi64_mask( db, zero ); \
+     m0 = _mm256_mask_xor_epi64( m0, dm, m0, _mm256_set1_epi64x( tp[0] ) ); \
+     m1 = _mm256_mask_xor_epi64( m1, dm, m1, _mm256_set1_epi64x( tp[1] ) ); \
+     m2 = _mm256_mask_xor_epi64( m2, dm, m2, _mm256_set1_epi64x( tp[2] ) ); \
+     m3 = _mm256_mask_xor_epi64( m3, dm, m3, _mm256_set1_epi64x( tp[3] ) ); \
+     m4 = _mm256_mask_xor_epi64( m4, dm, m4, _mm256_set1_epi64x( tp[4] ) ); \
+     m5 = _mm256_mask_xor_epi64( m5, dm, m5, _mm256_set1_epi64x( tp[5] ) ); \
+     m6 = _mm256_mask_xor_epi64( m6, dm, m6, _mm256_set1_epi64x( tp[6] ) ); \
+     m7 = _mm256_mask_xor_epi64( m7, dm, m7, _mm256_set1_epi64x( tp[7] ) ); \
+     db = _mm256_ror_epi64( db, 1 ); \
+     tp += 8; \
+  } \
+} while (0)
+
+#else
+
 #define INPUT_BIG \
 do { \
  __m256i db = *buf; \
@@ -880,25 +906,58 @@ do { \
  { \
     __m256i dm = _mm256_cmpgt_epi64( zero, _mm256_slli_epi64( db, u ) ); \
     m0 = _mm256_xor_si256( m0, _mm256_and_si256( dm, \
-                                          m256_const1_64( tp[0] ) ) ); \
+                                          _mm256_set1_epi64x( tp[0] ) ) ); \
     m1 = _mm256_xor_si256( m1, _mm256_and_si256( dm, \
-                                          m256_const1_64( tp[1] ) ) ); \
+                                          _mm256_set1_epi64x( tp[1] ) ) ); \
     m2 = _mm256_xor_si256( m2, _mm256_and_si256( dm, \
-                                          m256_const1_64( tp[2] ) ) ); \
+                                          _mm256_set1_epi64x( tp[2] ) ) ); \
     m3 = _mm256_xor_si256( m3, _mm256_and_si256( dm, \
-                                          m256_const1_64( tp[3] ) ) ); \
+                                          _mm256_set1_epi64x( tp[3] ) ) ); \
     m4 = _mm256_xor_si256( m4, _mm256_and_si256( dm, \
-                                          m256_const1_64( tp[4] ) ) ); \
+                                          _mm256_set1_epi64x( tp[4] ) ) ); \
     m5 = _mm256_xor_si256( m5, _mm256_and_si256( dm, \
-                                          m256_const1_64( tp[5] ) ) ); \
+                                          _mm256_set1_epi64x( tp[5] ) ) ); \
     m6 = _mm256_xor_si256( m6, _mm256_and_si256( dm, \
-                                          m256_const1_64( tp[6] ) ) ); \
+                                          _mm256_set1_epi64x( tp[6] ) ) ); \
     m7 = _mm256_xor_si256( m7, _mm256_and_si256( dm, \
-                                          m256_const1_64( tp[7] ) ) ); \
+                                          _mm256_set1_epi64x( tp[7] ) ) ); \
     tp += 8; \
  } \
 } while (0)

+#endif
+
+#define SBOX( a, b, c, d ) \
+do { \
+  __m256i t; \
+  t = a; \
+  a = mm256_xorand( d, a, c ); \
+  c = mm256_xor3( a, b, c ); \
+  b = mm256_xoror( b, d, t ); \
+  t = _mm256_xor_si256( t, c ); \
+  d = mm256_xoror( a, b, t ); \
+  t = mm256_xorand( t, a, b ); \
+  a = c; \
+  c = mm256_xor3( b, d, t ); \
+  b = d; \
+  d = mm256_not( t ); \
+} while (0)
+
+#define L( a, b, c, d ) \
+do { \
+   a = mm256_rol_32( a, 13 ); \
+   c = mm256_rol_32( c,  3 ); \
+   b = mm256_xor3( a, b, c ); \
+   d = mm256_xor3( d, c, _mm256_slli_epi32( a, 3 ) ); \
+   b = mm256_rol_32( b, 1 ); \
+   d = mm256_rol_32( d, 7 ); \
+   a = mm256_xor3( a, b, d ); \
+   c = mm256_xor3( c, d, _mm256_slli_epi32( b, 7 ) ); \
+   a = mm256_rol_32( a,  5 ); \
+   c = mm256_rol_32( c, 22 ); \
+} while (0)
+
+/*
 #define SBOX( a, b, c, d ) \
 do { \
  __m256i t; \
@@ -937,6 +996,7 @@ do { \
   a = mm256_rol_32( a,  5 ); \
   c = mm256_rol_32( c, 22 ); \
 } while (0)
+*/

 #define DECL_STATE_BIG \
   __m256i c0, c1, c2, c3, c4, c5, c6, c7; \
@@ -1066,17 +1126,17 @@ do { \
   __m256i alpha[16]; \
   const uint64_t A0 = ( (uint64_t*)alpha_n )[0]; \
   for( int i = 0; i < 16; i++ ) \
-      alpha[i] = m256_const1_64( ( (uint64_t*)alpha_n )[i] ); \
+      alpha[i] = _mm256_set1_epi64x( ( (uint64_t*)alpha_n )[i] ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( (1ULL << 32) ^ A0 ); \
+   alpha[0] = _mm256_set1_epi64x( (1ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( (2ULL << 32) ^ A0 ); \
+   alpha[0] = _mm256_set1_epi64x( (2ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( (3ULL << 32) ^ A0 ); \
+   alpha[0] = _mm256_set1_epi64x( (3ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( (4ULL << 32) ^ A0 ); \
+   alpha[0] = _mm256_set1_epi64x( (4ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( (5ULL << 32) ^ A0 ); \
+   alpha[0] = _mm256_set1_epi64x( (5ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
 } while (0)

@@ -1085,29 +1145,29 @@ do { \
   __m256i alpha[16]; \
   const uint64_t A0 = ( (uint64_t*)alpha_f )[0]; \
   for( int i = 0; i < 16; i++ ) \
-      alpha[i] = m256_const1_64( ( (uint64_t*)alpha_f )[i] ); \
+      alpha[i] = _mm256_set1_epi64x( ( (uint64_t*)alpha_f )[i] ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( 1ULL << 32) ^ A0 ); \
+   alpha[0] = _mm256_set1_epi64x( ( 1ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( 2ULL << 32) ^ A0 ); \
+   alpha[0] = _mm256_set1_epi64x( ( 2ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( 3ULL << 32) ^ A0 ); \
+   alpha[0] = _mm256_set1_epi64x( ( 3ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( 4ULL << 32) ^ A0 ); \
+   alpha[0] = _mm256_set1_epi64x( ( 4ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( 5ULL << 32) ^ A0 ); \
+   alpha[0] = _mm256_set1_epi64x( ( 5ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( 6ULL << 32) ^ A0 ); \
+   alpha[0] = _mm256_set1_epi64x( ( 6ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( 7ULL << 32) ^ A0 ); \
+   alpha[0] = _mm256_set1_epi64x( ( 7ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( 8ULL << 32) ^ A0 ); \
+   alpha[0] = _mm256_set1_epi64x( ( 8ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( ( 9ULL << 32) ^ A0 ); \
+   alpha[0] = _mm256_set1_epi64x( ( 9ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( (10ULL << 32) ^ A0 ); \
+   alpha[0] = _mm256_set1_epi64x( (10ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
-   alpha[0] = m256_const1_64( (11ULL << 32) ^ A0 ); \
+   alpha[0] = _mm256_set1_epi64x( (11ULL << 32) ^ A0 ); \
   ROUND_BIG( alpha ); \
 } while (0)

@@ -1163,14 +1223,14 @@ void hamsi512_4way_init( hamsi_4way_big_context *sc )
   sc->partial_len = 0;
   sc->count_high = sc->count_low = 0;

-   sc->h[0] = m256_const1_64( 0x6c70617273746565 );
-   sc->h[1] = m256_const1_64( 0x656e62656b204172 );
-   sc->h[2] = m256_const1_64( 0x302c206272672031 );
-   sc->h[3] = m256_const1_64( 0x3434362c75732032 );
-   sc->h[4] = m256_const1_64( 0x3030312020422d33 );
-   sc->h[5] = m256_const1_64( 0x656e2d484c657576 );
-   sc->h[6] = m256_const1_64( 0x6c65652c65766572 );
-   sc->h[7] = m256_const1_64( 0x6769756d2042656c );
+   sc->h[0] = _mm256_set1_epi64x( 0x6c70617273746565 );
+   sc->h[1] = _mm256_set1_epi64x( 0x656e62656b204172 );
+   sc->h[2] = _mm256_set1_epi64x( 0x302c206272672031 );
+   sc->h[3] = _mm256_set1_epi64x( 0x3434362c75732032 );
+   sc->h[4] = _mm256_set1_epi64x( 0x3030312020422d33 );
+   sc->h[5] = _mm256_set1_epi64x( 0x656e2d484c657576 );
+   sc->h[6] = _mm256_set1_epi64x( 0x6c65652c65766572 );
+   sc->h[7] = _mm256_set1_epi64x( 0x6769756d2042656c );
 }

 void hamsi512_4way_update( hamsi_4way_big_context *sc, const void *data,
@@ -1193,7 +1253,7 @@ void hamsi512_4way_close( hamsi_4way_big_context *sc, void *dst )
   sph_enc32be( &ch, sc->count_high );
   sph_enc32be( &cl, sc->count_low + ( sc->partial_len << 3 ) );
   pad[0] = _mm256_set1_epi64x( ((uint64_t)cl << 32 ) | (uint64_t)ch );
-   sc->buf[0] = m256_const1_64( 0x80 );
+   sc->buf[0] = _mm256_set1_epi64x( 0x80 );
   hamsi_big( sc, sc->buf, 1 );
   hamsi_big_final( sc, pad );

--- a/algo/haval/haval-hash-4way.c
+++ b/algo/haval/haval-hash-4way.c
@@ -52,6 +52,56 @@ extern "C"{
 #define SPH_SMALL_FOOTPRINT_HAVAL   1
 //#endif

+#if defined(__AVX512VL__)
+
+// ( ~( a ^ b ) ) & c
+#define mm128_andnotxor( a, b, c ) \
+   _mm_ternarylogic_epi32( a, b, c, 0x82  )
+
+#else
+
+#define mm128_andnotxor( a, b, c ) \
+   _mm_andnot_si128( _mm_xor_si128( a, b ), c )
+
+#endif
+
+#define F1(x6, x5, x4, x3, x2, x1, x0) \
+ mm128_xor3( x0, mm128_andxor( x1, x0, x4 ), \
+                 _mm_xor_si128( _mm_and_si128( x2, x5 ), \
+                                _mm_and_si128( x3, x6 ) ) ) \
+
+#define F2(x6, x5, x4, x3, x2, x1, x0) \
+   mm128_xor3( mm128_andxor( x2, _mm_andnot_si128( x3, x1 ), \
+                       mm128_xor3( _mm_and_si128( x4, x5 ), x6, x0 )  ), \
+               mm128_andxor( x4, x1, x5 ), \
+               mm128_xorand( x0, x3, x5 ) ) \
+
+#define F3(x6, x5, x4, x3, x2, x1, x0) \
+  mm128_xor3( x0, \
+              _mm_and_si128( x3, \
+                         mm128_xor3( _mm_and_si128( x1, x2 ), x6, x0 ) ), \
+              _mm_xor_si128( _mm_and_si128( x1, x4 ), \
+                             _mm_and_si128( x2, x5 ) ) )
+
+#define F4(x6, x5, x4, x3, x2, x1, x0) \
+  mm128_xor3( \
+      mm128_andxor( x3, x5, \
+                    _mm_xor_si128( _mm_and_si128( x1, x2 ), \
+                                      _mm_or_si128( x4, x6 ) ) ), \
+      _mm_and_si128( x4, \
+                        mm128_xor3( x0, _mm_andnot_si128( x2, x5 ), \
+                                    _mm_xor_si128( x1, x6 ) ) ), \
+      mm128_xorand( x0, x2, x6 ) )
+
+#define F5(x6, x5, x4, x3, x2, x1, x0) \
+   _mm_xor_si128( \
+         mm128_andnotxor( mm128_and3( x1, x2, x3 ), x5, x0 ), \
+         mm128_xor3( _mm_and_si128( x1, x4 ), \
+                     _mm_and_si128( x2, x5 ), \
+                     _mm_and_si128( x3, x6 ) ) )
+  
+
+/*
 #define F1(x6, x5, x4, x3, x2, x1, x0) \
   _mm_xor_si128( x0, \
       _mm_xor_si128( _mm_and_si128(_mm_xor_si128( x0, x4 ), x1 ), \
@@ -96,6 +146,7 @@ extern "C"{
      _mm_xor_si128( _mm_xor_si128( _mm_and_si128( x1, x4 ), \
                                    _mm_and_si128( x2, x5 ) ), \
                                    _mm_and_si128( x3, x6 ) ) )
+*/

 /*
 * The macros below integrate the phi() permutations, depending on the
@@ -740,14 +791,14 @@ do { \
 static void
 haval_8way_init( haval_8way_context *sc, unsigned olen, unsigned passes )
 {
-   sc->s0 = m256_const1_32( 0x243F6A88UL );
-   sc->s1 = m256_const1_32( 0x85A308D3UL );
-   sc->s2 = m256_const1_32( 0x13198A2EUL );
-   sc->s3 = m256_const1_32( 0x03707344UL );
-   sc->s4 = m256_const1_32( 0xA4093822UL );
-   sc->s5 = m256_const1_32( 0x299F31D0UL );
-   sc->s6 = m256_const1_32( 0x082EFA98UL );
-   sc->s7 = m256_const1_32( 0xEC4E6C89UL );
+   sc->s0 = _mm256_set1_epi32( 0x243F6A88UL );
+   sc->s1 = _mm256_set1_epi32( 0x85A308D3UL );
+   sc->s2 = _mm256_set1_epi32( 0x13198A2EUL );
+   sc->s3 = _mm256_set1_epi32( 0x03707344UL );
+   sc->s4 = _mm256_set1_epi32( 0xA4093822UL );
+   sc->s5 = _mm256_set1_epi32( 0x299F31D0UL );
+   sc->s6 = _mm256_set1_epi32( 0x082EFA98UL );
+   sc->s7 = _mm256_set1_epi32( 0xEC4E6C89UL );
   sc->olen = olen;
   sc->passes = passes;
   sc->count_high = 0;
--- a/algo/jh/jh-hash-4way.c
+++ b/algo/jh/jh-hash-4way.c
@@ -76,19 +76,31 @@ do { \

 #endif

+#if defined(__AVX512VL__)
+//TODO enable for AVX10_256, not used with AVX512VL
+
+#define notxorandnot( a, b, c ) \
+   _mm256_ternarylogic_epi64( a, b, c, 0x2d )
+
+#else
+
+#define notxorandnot( a, b, c ) \
+   _mm256_xor_si256( mm256_not( a ), _mm256_andnot_si256( b, c ) )
+
+#endif
+
 #define Sb(x0, x1, x2, x3, c) \
 do { \
-   const __m256i cc = _mm256_set1_epi64x( c ); \
-    x3 = mm256_not( x3 ); \
-    x0 = _mm256_xor_si256( x0, _mm256_andnot_si256( x2, cc ) ); \
-    tmp = _mm256_xor_si256( cc, _mm256_and_si256( x0, x1 ) ); \
-    x0 = _mm256_xor_si256( x0, _mm256_and_si256( x2, x3 ) ); \
-    x3 = _mm256_xor_si256( x3, _mm256_andnot_si256( x1, x2 ) ); \
-    x1 = _mm256_xor_si256( x1, _mm256_and_si256( x0, x2 ) ); \
-    x2 = _mm256_xor_si256( x2, _mm256_andnot_si256( x3, x0 ) ); \
-    x0 = _mm256_xor_si256( x0, _mm256_or_si256( x1, x3 ) ); \
-    x3 = _mm256_xor_si256( x3, _mm256_and_si256( x1, x2 ) ); \
-    x1 = _mm256_xor_si256( x1, _mm256_and_si256( tmp, x0 ) ); \
+    const __m256i cc = _mm256_set1_epi64x( c ); \
+    x0 = mm256_xorandnot( x0, x2, cc ); \
+    tmp = mm256_xorand( cc, x0, x1 ); \
+    x0 = mm256_xorandnot( x0, x3, x2 ); \
+    x3 = notxorandnot( x3, x1, x2 ); \
+    x1 = mm256_xorand( x1, x0, x2 ); \
+    x2 = mm256_xorandnot( x2, x3, x0 ); \
+    x0 = mm256_xoror( x0, x1, x3 ); \
+    x3 = mm256_xorand( x3, x1, x2 ); \
+    x1 = mm256_xorand( x1, tmp, x0 ); \
    x2 = _mm256_xor_si256( x2, tmp ); \
 } while (0)

@@ -96,11 +108,11 @@ do { \
 do { \
    x4 = _mm256_xor_si256( x4, x1 ); \
    x5 = _mm256_xor_si256( x5, x2 ); \
-    x6 = _mm256_xor_si256( x6, _mm256_xor_si256( x3, x0 ) ); \
+    x6 = mm256_xor3( x6, x3, x0 ); \
    x7 = _mm256_xor_si256( x7, x0 ); \
    x0 = _mm256_xor_si256( x0, x5 ); \
    x1 = _mm256_xor_si256( x1, x6 ); \
-    x2 = _mm256_xor_si256( x2, _mm256_xor_si256( x7, x4 ) ); \
+    x2 = mm256_xor3( x2, x7, x4 ); \
    x3 = _mm256_xor_si256( x3, x4 ); \
 } while (0)

@@ -323,12 +335,12 @@ do { \
 } while (0)


-#define W80(x)   Wz_8W(x, m512_const1_64( 0x5555555555555555 ),  1 )
-#define W81(x)   Wz_8W(x, m512_const1_64( 0x3333333333333333 ),  2 )
-#define W82(x)   Wz_8W(x, m512_const1_64( 0x0F0F0F0F0F0F0F0F ),  4 )
-#define W83(x)   Wz_8W(x, m512_const1_64( 0x00FF00FF00FF00FF ),  8 ) 
-#define W84(x)   Wz_8W(x, m512_const1_64( 0x0000FFFF0000FFFF ), 16 )
-#define W85(x)   Wz_8W(x, m512_const1_64( 0x00000000FFFFFFFF ), 32 )
+#define W80(x)   Wz_8W(x, _mm512_set1_epi64( 0x5555555555555555 ),  1 )
+#define W81(x)   Wz_8W(x, _mm512_set1_epi64( 0x3333333333333333 ),  2 )
+#define W82(x)   Wz_8W(x, _mm512_set1_epi64( 0x0F0F0F0F0F0F0F0F ),  4 )
+#define W83(x)   Wz_8W(x, _mm512_set1_epi64( 0x00FF00FF00FF00FF ),  8 ) 
+#define W84(x)   Wz_8W(x, _mm512_set1_epi64( 0x0000FFFF0000FFFF ), 16 )
+#define W85(x)   Wz_8W(x, _mm512_set1_epi64( 0x00000000FFFFFFFF ), 32 )
 #define W86(x) \
 do { \
   __m512i t = x ## h; \
@@ -352,12 +364,12 @@ do { \
   x ## l = _mm256_or_si256( _mm256_and_si256((x ## l >> (n)), (c)), t ); \
 } while (0)

-#define W0(x)   Wz(x, m256_const1_64( 0x5555555555555555 ),  1 )
-#define W1(x)   Wz(x, m256_const1_64( 0x3333333333333333 ),  2 )
-#define W2(x)   Wz(x, m256_const1_64( 0x0F0F0F0F0F0F0F0F ),  4 )
-#define W3(x)   Wz(x, m256_const1_64( 0x00FF00FF00FF00FF ),  8 ) 
-#define W4(x)   Wz(x, m256_const1_64( 0x0000FFFF0000FFFF ), 16 )
-#define W5(x)   Wz(x, m256_const1_64( 0x00000000FFFFFFFF ), 32 )
+#define W0(x)   Wz(x, _mm256_set1_epi64x( 0x5555555555555555 ),  1 )
+#define W1(x)   Wz(x, _mm256_set1_epi64x( 0x3333333333333333 ),  2 )
+#define W2(x)   Wz(x, _mm256_set1_epi64x( 0x0F0F0F0F0F0F0F0F ),  4 )
+#define W3(x)   Wz(x, _mm256_set1_epi64x( 0x00FF00FF00FF00FF ),  8 ) 
+#define W4(x)   Wz(x, _mm256_set1_epi64x( 0x0000FFFF0000FFFF ), 16 )
+#define W5(x)   Wz(x, _mm256_set1_epi64x( 0x00000000FFFFFFFF ), 32 )
 #define W6(x) \
 do { \
   __m256i t = x ## h; \
@@ -624,22 +636,22 @@ static const sph_u64 IV512[] = {
 void jh256_8way_init( jh_8way_context *sc )
 {
    // bswapped IV256
-    sc->H[ 0] = m512_const1_64( 0xebd3202c41a398eb );
-    sc->H[ 1] = m512_const1_64( 0xc145b29c7bbecd92 );
-    sc->H[ 2] = m512_const1_64( 0xfac7d4609151931c );
-    sc->H[ 3] = m512_const1_64( 0x038a507ed6820026 );
-    sc->H[ 4] = m512_const1_64( 0x45b92677269e23a4 );
-    sc->H[ 5] = m512_const1_64( 0x77941ad4481afbe0 );
-    sc->H[ 6] = m512_const1_64( 0x7a176b0226abb5cd );
-    sc->H[ 7] = m512_const1_64( 0xa82fff0f4224f056 );
-    sc->H[ 8] = m512_const1_64( 0x754d2e7f8996a371 );
-    sc->H[ 9] = m512_const1_64( 0x62e27df70849141d );
-    sc->H[10] = m512_const1_64( 0x948f2476f7957627 );
-    sc->H[11] = m512_const1_64( 0x6c29804757b6d587 );
-    sc->H[12] = m512_const1_64( 0x6c0d8eac2d275e5c );
-    sc->H[13] = m512_const1_64( 0x0f7a0557c6508451 );
-    sc->H[14] = m512_const1_64( 0xea12247067d3e47b );
-    sc->H[15] = m512_const1_64( 0x69d71cd313abe389 );
+    sc->H[ 0] = _mm512_set1_epi64( 0xebd3202c41a398eb );
+    sc->H[ 1] = _mm512_set1_epi64( 0xc145b29c7bbecd92 );
+    sc->H[ 2] = _mm512_set1_epi64( 0xfac7d4609151931c );
+    sc->H[ 3] = _mm512_set1_epi64( 0x038a507ed6820026 );
+    sc->H[ 4] = _mm512_set1_epi64( 0x45b92677269e23a4 );
+    sc->H[ 5] = _mm512_set1_epi64( 0x77941ad4481afbe0 );
+    sc->H[ 6] = _mm512_set1_epi64( 0x7a176b0226abb5cd );
+    sc->H[ 7] = _mm512_set1_epi64( 0xa82fff0f4224f056 );
+    sc->H[ 8] = _mm512_set1_epi64( 0x754d2e7f8996a371 );
+    sc->H[ 9] = _mm512_set1_epi64( 0x62e27df70849141d );
+    sc->H[10] = _mm512_set1_epi64( 0x948f2476f7957627 );
+    sc->H[11] = _mm512_set1_epi64( 0x6c29804757b6d587 );
+    sc->H[12] = _mm512_set1_epi64( 0x6c0d8eac2d275e5c );
+    sc->H[13] = _mm512_set1_epi64( 0x0f7a0557c6508451 );
+    sc->H[14] = _mm512_set1_epi64( 0xea12247067d3e47b );
+    sc->H[15] = _mm512_set1_epi64( 0x69d71cd313abe389 );
    sc->ptr = 0;
    sc->block_count = 0;
 }
@@ -647,22 +659,22 @@ void jh256_8way_init( jh_8way_context *sc )
 void jh512_8way_init( jh_8way_context *sc )
 {
    // bswapped IV512
-    sc->H[ 0] = m512_const1_64( 0x17aa003e964bd16f );
-    sc->H[ 1] = m512_const1_64( 0x43d5157a052e6a63 );
-    sc->H[ 2] = m512_const1_64( 0x0bef970c8d5e228a );
-    sc->H[ 3] = m512_const1_64( 0x61c3b3f2591234e9 );
-    sc->H[ 4] = m512_const1_64( 0x1e806f53c1a01d89 );
-    sc->H[ 5] = m512_const1_64( 0x806d2bea6b05a92a );
-    sc->H[ 6] = m512_const1_64( 0xa6ba7520dbcc8e58 );
-    sc->H[ 7] = m512_const1_64( 0xf73bf8ba763a0fa9 );
-    sc->H[ 8] = m512_const1_64( 0x694ae34105e66901 );
-    sc->H[ 9] = m512_const1_64( 0x5ae66f2e8e8ab546 );
-    sc->H[10] = m512_const1_64( 0x243c84c1d0a74710 );
-    sc->H[11] = m512_const1_64( 0x99c15a2db1716e3b );
-    sc->H[12] = m512_const1_64( 0x56f8b19decf657cf );
-    sc->H[13] = m512_const1_64( 0x56b116577c8806a7 );
-    sc->H[14] = m512_const1_64( 0xfb1785e6dffcc2e3 );
-    sc->H[15] = m512_const1_64( 0x4bdd8ccc78465a54 );
+    sc->H[ 0] = _mm512_set1_epi64( 0x17aa003e964bd16f );
+    sc->H[ 1] = _mm512_set1_epi64( 0x43d5157a052e6a63 );
+    sc->H[ 2] = _mm512_set1_epi64( 0x0bef970c8d5e228a );
+    sc->H[ 3] = _mm512_set1_epi64( 0x61c3b3f2591234e9 );
+    sc->H[ 4] = _mm512_set1_epi64( 0x1e806f53c1a01d89 );
+    sc->H[ 5] = _mm512_set1_epi64( 0x806d2bea6b05a92a );
+    sc->H[ 6] = _mm512_set1_epi64( 0xa6ba7520dbcc8e58 );
+    sc->H[ 7] = _mm512_set1_epi64( 0xf73bf8ba763a0fa9 );
+    sc->H[ 8] = _mm512_set1_epi64( 0x694ae34105e66901 );
+    sc->H[ 9] = _mm512_set1_epi64( 0x5ae66f2e8e8ab546 );
+    sc->H[10] = _mm512_set1_epi64( 0x243c84c1d0a74710 );
+    sc->H[11] = _mm512_set1_epi64( 0x99c15a2db1716e3b );
+    sc->H[12] = _mm512_set1_epi64( 0x56f8b19decf657cf );
+    sc->H[13] = _mm512_set1_epi64( 0x56b116577c8806a7 );
+    sc->H[14] = _mm512_set1_epi64( 0xfb1785e6dffcc2e3 );
+    sc->H[15] = _mm512_set1_epi64( 0x4bdd8ccc78465a54 );
    sc->ptr = 0;
    sc->block_count = 0;
 }
@@ -721,7 +733,7 @@ jh_8way_close( jh_8way_context *sc, unsigned ub, unsigned n, void *dst,
   size_t numz, u;
   uint64_t l0, l1;

-   buf[0] = m512_const1_64( 0x80ULL );
+   buf[0] = _mm512_set1_epi64( 0x80ULL );

   if ( sc->ptr == 0 )
       numz = 48;
@@ -772,22 +784,22 @@ jh512_8way_close(void *cc, void *dst)
 void jh256_4way_init( jh_4way_context *sc )
 {
    // bswapped IV256
-    sc->H[ 0] = m256_const1_64( 0xebd3202c41a398eb );
-    sc->H[ 1] = m256_const1_64( 0xc145b29c7bbecd92 );
-    sc->H[ 2] = m256_const1_64( 0xfac7d4609151931c );
-    sc->H[ 3] = m256_const1_64( 0x038a507ed6820026 );
-    sc->H[ 4] = m256_const1_64( 0x45b92677269e23a4 );
-    sc->H[ 5] = m256_const1_64( 0x77941ad4481afbe0 );
-    sc->H[ 6] = m256_const1_64( 0x7a176b0226abb5cd );
-    sc->H[ 7] = m256_const1_64( 0xa82fff0f4224f056 );
-    sc->H[ 8] = m256_const1_64( 0x754d2e7f8996a371 );
-    sc->H[ 9] = m256_const1_64( 0x62e27df70849141d );
-    sc->H[10] = m256_const1_64( 0x948f2476f7957627 );
-    sc->H[11] = m256_const1_64( 0x6c29804757b6d587 );
-    sc->H[12] = m256_const1_64( 0x6c0d8eac2d275e5c );
-    sc->H[13] = m256_const1_64( 0x0f7a0557c6508451 );
-    sc->H[14] = m256_const1_64( 0xea12247067d3e47b );
-    sc->H[15] = m256_const1_64( 0x69d71cd313abe389 );
+    sc->H[ 0] = _mm256_set1_epi64x( 0xebd3202c41a398eb );
+    sc->H[ 1] = _mm256_set1_epi64x( 0xc145b29c7bbecd92 );
+    sc->H[ 2] = _mm256_set1_epi64x( 0xfac7d4609151931c );
+    sc->H[ 3] = _mm256_set1_epi64x( 0x038a507ed6820026 );
+    sc->H[ 4] = _mm256_set1_epi64x( 0x45b92677269e23a4 );
+    sc->H[ 5] = _mm256_set1_epi64x( 0x77941ad4481afbe0 );
+    sc->H[ 6] = _mm256_set1_epi64x( 0x7a176b0226abb5cd );
+    sc->H[ 7] = _mm256_set1_epi64x( 0xa82fff0f4224f056 );
+    sc->H[ 8] = _mm256_set1_epi64x( 0x754d2e7f8996a371 );
+    sc->H[ 9] = _mm256_set1_epi64x( 0x62e27df70849141d );
+    sc->H[10] = _mm256_set1_epi64x( 0x948f2476f7957627 );
+    sc->H[11] = _mm256_set1_epi64x( 0x6c29804757b6d587 );
+    sc->H[12] = _mm256_set1_epi64x( 0x6c0d8eac2d275e5c );
+    sc->H[13] = _mm256_set1_epi64x( 0x0f7a0557c6508451 );
+    sc->H[14] = _mm256_set1_epi64x( 0xea12247067d3e47b );
+    sc->H[15] = _mm256_set1_epi64x( 0x69d71cd313abe389 );
    sc->ptr = 0;
    sc->block_count = 0;
 }
@@ -795,22 +807,22 @@ void jh256_4way_init( jh_4way_context *sc )
 void jh512_4way_init( jh_4way_context *sc )
 {
    // bswapped IV512
-    sc->H[ 0] = m256_const1_64( 0x17aa003e964bd16f );
-    sc->H[ 1] = m256_const1_64( 0x43d5157a052e6a63 );
-    sc->H[ 2] = m256_const1_64( 0x0bef970c8d5e228a );
-    sc->H[ 3] = m256_const1_64( 0x61c3b3f2591234e9 );
-    sc->H[ 4] = m256_const1_64( 0x1e806f53c1a01d89 );
-    sc->H[ 5] = m256_const1_64( 0x806d2bea6b05a92a );
-    sc->H[ 6] = m256_const1_64( 0xa6ba7520dbcc8e58 );
-    sc->H[ 7] = m256_const1_64( 0xf73bf8ba763a0fa9 );
-    sc->H[ 8] = m256_const1_64( 0x694ae34105e66901 );
-    sc->H[ 9] = m256_const1_64( 0x5ae66f2e8e8ab546 );
-    sc->H[10] = m256_const1_64( 0x243c84c1d0a74710 );
-    sc->H[11] = m256_const1_64( 0x99c15a2db1716e3b );
-    sc->H[12] = m256_const1_64( 0x56f8b19decf657cf );
-    sc->H[13] = m256_const1_64( 0x56b116577c8806a7 );
-    sc->H[14] = m256_const1_64( 0xfb1785e6dffcc2e3 );
-    sc->H[15] = m256_const1_64( 0x4bdd8ccc78465a54 );
+    sc->H[ 0] = _mm256_set1_epi64x( 0x17aa003e964bd16f );
+    sc->H[ 1] = _mm256_set1_epi64x( 0x43d5157a052e6a63 );
+    sc->H[ 2] = _mm256_set1_epi64x( 0x0bef970c8d5e228a );
+    sc->H[ 3] = _mm256_set1_epi64x( 0x61c3b3f2591234e9 );
+    sc->H[ 4] = _mm256_set1_epi64x( 0x1e806f53c1a01d89 );
+    sc->H[ 5] = _mm256_set1_epi64x( 0x806d2bea6b05a92a );
+    sc->H[ 6] = _mm256_set1_epi64x( 0xa6ba7520dbcc8e58 );
+    sc->H[ 7] = _mm256_set1_epi64x( 0xf73bf8ba763a0fa9 );
+    sc->H[ 8] = _mm256_set1_epi64x( 0x694ae34105e66901 );
+    sc->H[ 9] = _mm256_set1_epi64x( 0x5ae66f2e8e8ab546 );
+    sc->H[10] = _mm256_set1_epi64x( 0x243c84c1d0a74710 );
+    sc->H[11] = _mm256_set1_epi64x( 0x99c15a2db1716e3b );
+    sc->H[12] = _mm256_set1_epi64x( 0x56f8b19decf657cf );
+    sc->H[13] = _mm256_set1_epi64x( 0x56b116577c8806a7 );
+    sc->H[14] = _mm256_set1_epi64x( 0xfb1785e6dffcc2e3 );
+    sc->H[15] = _mm256_set1_epi64x( 0x4bdd8ccc78465a54 );
    sc->ptr = 0;
    sc->block_count = 0;
 }
@@ -869,7 +881,7 @@ jh_4way_close( jh_4way_context *sc, unsigned ub, unsigned n, void *dst,
   size_t numz, u;
   uint64_t l0, l1;

-   buf[0] = m256_const1_64( 0x80ULL );
+   buf[0] = _mm256_set1_epi64x( 0x80ULL );

   if ( sc->ptr == 0 )
       numz = 48;
--- a/algo/keccak/keccak-4way.c
+++ b/algo/keccak/keccak-4way.c
@@ -49,7 +49,7 @@ int scanhash_keccak_8way( struct work *work, uint32_t max_nonce,
          }
      }
      *noncev = _mm512_add_epi32( *noncev,
-                                  m512_const1_64( 0x0000000800000000 ) );
+                                  _mm512_set1_epi64( 0x0000000800000000 ) );
      n += 8;

   } while ( (n < max_nonce-8) && !work_restart[thr_id].restart);
@@ -101,7 +101,7 @@ int scanhash_keccak_4way( struct work *work, uint32_t max_nonce,
          }
      }
      *noncev = _mm256_add_epi32( *noncev,
-                                  m256_const1_64( 0x0000000400000000 ) );
+                                  _mm256_set1_epi64x( 0x0000000400000000 ) );
      n += 4;
   } while ( (n < max_nonce-4) && !work_restart[thr_id].restart);
   pdata[19] = n;
--- a/algo/keccak/keccak-hash-4way.c
+++ b/algo/keccak/keccak-hash-4way.c
@@ -180,15 +180,15 @@ static void keccak64_8way_close( keccak64_ctx_m512i *kc, void *dst,
    if ( kc->ptr == (lim - 8) )
    {
        const uint64_t t = eb | 0x8000000000000000;
-        u.tmp[0] = m512_const1_64( t );
+        u.tmp[0] = _mm512_set1_epi64( t );
        j = 8;
    }
    else
    {
        j = lim - kc->ptr;
-        u.tmp[0] = m512_const1_64( eb );
+        u.tmp[0] = _mm512_set1_epi64( eb );
        memset_zero_512( u.tmp + 1, (j>>3) - 2 );
-        u.tmp[ (j>>3) - 1] = m512_const1_64( 0x8000000000000000 );
+        u.tmp[ (j>>3) - 1] = _mm512_set1_epi64( 0x8000000000000000 );
    }
    keccak64_8way_core( kc, u.tmp, j, lim );
    /* Finalize the "lane complement" */
@@ -264,8 +264,8 @@ keccak512_8way_close(void *cc, void *dst)
 #define OR64(d, a, b)      (d = _mm256_or_si256(a,b))
 #define NOT64(d, s)        (d = mm256_not( s ) )
 #define ROL64(d, v, n)     (d = mm256_rol_64(v, n))
-#define XOROR(d, a, b, c)  (d = _mm256_xor_si256(a, _mm256_or_si256(b, c)))
-#define XORAND(d, a, b, c) (d = _mm256_xor_si256(a, _mm256_and_si256(b, c)))
+#define XOROR(d, a, b, c)  (d = mm256_xoror( a, b, c ) )
+#define XORAND(d, a, b, c) (d = mm256_xorand( a, b, c ) )
 #define XOR3( d, a, b, c ) (d = mm256_xor3( a, b, c ))

 #include "keccak-macros.c"
@@ -368,15 +368,15 @@ static void keccak64_close( keccak64_ctx_m256i *kc, void *dst, size_t byte_len,
    if ( kc->ptr == (lim - 8) )
    {
        const uint64_t t = eb | 0x8000000000000000;
-        u.tmp[0] = m256_const1_64( t );
+        u.tmp[0] = _mm256_set1_epi64x( t );
        j = 8;
    }
    else
    {
        j = lim - kc->ptr;
-        u.tmp[0] = m256_const1_64( eb );
+        u.tmp[0] = _mm256_set1_epi64x( eb );
        memset_zero_256( u.tmp + 1, (j>>3) - 2 );
-        u.tmp[ (j>>3) - 1] = m256_const1_64( 0x8000000000000000 );
+        u.tmp[ (j>>3) - 1] = _mm256_set1_epi64x( 0x8000000000000000 );
    }
    keccak64_core( kc, u.tmp, j, lim );
    /* Finalize the "lane complement" */
--- a/algo/keccak/sha3d-4way.c
+++ b/algo/keccak/sha3d-4way.c
@@ -56,7 +56,7 @@ int scanhash_sha3d_8way( struct work *work, uint32_t max_nonce,
          }
      }
      *noncev = _mm512_add_epi32( *noncev,
-                                  m512_const1_64( 0x0000000800000000 ) );
+                                  _mm512_set1_epi64( 0x0000000800000000 ) );
      n += 8;

   } while ( likely( (n < last_nonce) && !work_restart[thr_id].restart ) );
@@ -115,7 +115,7 @@ int scanhash_sha3d_4way( struct work *work, uint32_t max_nonce,
          }
      }
      *noncev = _mm256_add_epi32( *noncev,
-                                  m256_const1_64( 0x0000000400000000 ) );
+                                  _mm256_set1_epi64x( 0x0000000400000000 ) );
      n += 4;
   } while ( likely( (n < last_nonce) && !work_restart[thr_id].restart ) );
   pdata[19] = n;
--- a/algo/luffa/luffa-hash-2way.c
+++ b/algo/luffa/luffa-hash-2way.c
@@ -60,7 +60,7 @@ static const uint32 CNS_INIT[128] __attribute((aligned(64))) = {

 #if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)

-#define cns4w(i)  m512_const1_128( ( (__m128i*)CNS_INIT)[i] )
+#define cns4w(i)  mm512_bcast_m128( ( (__m128i*)CNS_INIT)[i] )

 #define ADD_CONSTANT4W( a, b, c0, c1 ) \
    a = _mm512_xor_si512( a, c0 ); \
@@ -69,7 +69,7 @@ static const uint32 CNS_INIT[128] __attribute((aligned(64))) = {
 #define MULT24W( a0, a1 ) \
 { \
  __m512i b = _mm512_xor_si512( a0, \
-                     _mm512_maskz_shuffle_epi32( 0xbbbb, a1, 16 ) ); \
+                     _mm512_maskz_shuffle_epi32( 0xbbbb, a1, 0x10 ) ); \
  a0 = _mm512_alignr_epi8( a1,  b, 4 ); \
  a1 = _mm512_alignr_epi8(  b, a1, 4 ); \
 }
@@ -107,58 +107,45 @@ static const uint32 CNS_INIT[128] __attribute((aligned(64))) = {
    ADD_CONSTANT4W( x0, x4, c0, c1 );

 #define STEP_PART24W( a0, a1, t0, t1, c0, c1 ) \
-    a1 = _mm512_shuffle_epi32( a1, 147 ); \
-    t0 = _mm512_load_si512( &a1 ); \
-    a1 = _mm512_unpacklo_epi32( a1, a0 ); \
+    t0 = _mm512_shuffle_epi32( a1, 147 ); \
+    a1 = _mm512_unpacklo_epi32( t0, a0 ); \
    t0 = _mm512_unpackhi_epi32( t0, a0 ); \
    t1 = _mm512_shuffle_epi32( t0, 78 ); \
    a0 = _mm512_shuffle_epi32( a1, 78 ); \
    SUBCRUMB4W( t1, t0, a0, a1 ); \
    t0 = _mm512_unpacklo_epi32( t0, t1 ); \
    a1 = _mm512_unpacklo_epi32( a1, a0 ); \
-    a0 = _mm512_load_si512( &a1 ); \
-    a0 = _mm512_unpackhi_epi64( a0, t0 ); \
+    a0 = _mm512_unpackhi_epi64( a1, t0 ); \
    a1 = _mm512_unpacklo_epi64( a1, t0 ); \
    a1 = _mm512_shuffle_epi32( a1, 57 ); \
    MIXWORD4W( a0, a1 ); \
    ADD_CONSTANT4W( a0, a1, c0, c1 );

 #define NMLTOM10244W(r0,r1,r2,r3,s0,s1,s2,s3,p0,p1,p2,p3,q0,q1,q2,q3)\
-    s1 = _mm512_load_si512(&r3);\
-    q1 = _mm512_load_si512(&p3);\
-    s3 = _mm512_load_si512(&r3);\
-    q3 = _mm512_load_si512(&p3);\
-    s1 = _mm512_unpackhi_epi32(s1,r2);\
-    q1 = _mm512_unpackhi_epi32(q1,p2);\
-    s3 = _mm512_unpacklo_epi32(s3,r2);\
-    q3 = _mm512_unpacklo_epi32(q3,p2);\
-    s0 = _mm512_load_si512(&s1);\
-    q0 = _mm512_load_si512(&q1);\
-    s2 = _mm512_load_si512(&s3);\
-    q2 = _mm512_load_si512(&q3);\
-    r3 = _mm512_load_si512(&r1);\
-    p3 = _mm512_load_si512(&p1);\
-    r1 = _mm512_unpacklo_epi32(r1,r0);\
-    p1 = _mm512_unpacklo_epi32(p1,p0);\
-    r3 = _mm512_unpackhi_epi32(r3,r0);\
-    p3 = _mm512_unpackhi_epi32(p3,p0);\
-    s0 = _mm512_unpackhi_epi64(s0,r3);\
-    q0 = _mm512_unpackhi_epi64(q0,p3);\
-    s1 = _mm512_unpacklo_epi64(s1,r3);\
-    q1 = _mm512_unpacklo_epi64(q1,p3);\
-    s2 = _mm512_unpackhi_epi64(s2,r1);\
-    q2 = _mm512_unpackhi_epi64(q2,p1);\
-    s3 = _mm512_unpacklo_epi64(s3,r1);\
-    q3 = _mm512_unpacklo_epi64(q3,p1);
+    s1 = _mm512_unpackhi_epi32( r3, r2 ); \
+    q1 = _mm512_unpackhi_epi32( p3, p2 ); \
+    s3 = _mm512_unpacklo_epi32( r3, r2 ); \
+    q3 = _mm512_unpacklo_epi32( p3, p2 ); \
+    r3 = _mm512_unpackhi_epi32( r1, r0 ); \
+    r1 = _mm512_unpacklo_epi32( r1, r0 ); \
+    p3 = _mm512_unpackhi_epi32( p1, p0 ); \
+    p1 = _mm512_unpacklo_epi32( p1, p0 ); \
+    s0 = _mm512_unpackhi_epi64( s1, r3 ); \
+    q0 = _mm512_unpackhi_epi64( q1 ,p3 ); \
+    s1 = _mm512_unpacklo_epi64( s1, r3 ); \
+    q1 = _mm512_unpacklo_epi64( q1, p3 ); \
+    s2 = _mm512_unpackhi_epi64( s3, r1 ); \
+    q2 = _mm512_unpackhi_epi64( q3, p1 ); \
+    s3 = _mm512_unpacklo_epi64( s3, r1 ); \
+    q3 = _mm512_unpacklo_epi64( q3, p1 );

 #define MIXTON10244W(r0,r1,r2,r3,s0,s1,s2,s3,p0,p1,p2,p3,q0,q1,q2,q3)\
    NMLTOM10244W(r0,r1,r2,r3,s0,s1,s2,s3,p0,p1,p2,p3,q0,q1,q2,q3);

-void rnd512_4way( luffa_4way_context *state, __m512i *msg )
+void rnd512_4way( luffa_4way_context *state, const __m512i *msg )
 {
    __m512i t0, t1;
    __m512i *chainv = state->chainv;
-    __m512i msg0, msg1;
    __m512i x0, x1, x2, x3, x4, x5, x6, x7;

    t0 = mm512_xor3( chainv[0], chainv[2], chainv[4] );
@@ -168,9 +155,6 @@ void rnd512_4way( luffa_4way_context *state, __m512i *msg )

    MULT24W( t0, t1 );

-    msg0 = _mm512_shuffle_epi32( msg[0], 27 );
-    msg1 = _mm512_shuffle_epi32( msg[1], 27 );
-
    chainv[0] = _mm512_xor_si512( chainv[0], t0 );
    chainv[1] = _mm512_xor_si512( chainv[1], t1 );
    chainv[2] = _mm512_xor_si512( chainv[2], t0 );
@@ -202,11 +186,8 @@ void rnd512_4way( luffa_4way_context *state, __m512i *msg )
    chainv[7] = _mm512_xor_si512(chainv[7], chainv[9]);

    MULT24W( chainv[8], chainv[9] );
-    chainv[8] = _mm512_xor_si512( chainv[8], t0 );
-    chainv[9] = _mm512_xor_si512( chainv[9], t1 );
-
-    t0 = chainv[8];
-    t1 = chainv[9];
+    t0 = chainv[8] = _mm512_xor_si512( chainv[8], t0 );
+    t1 = chainv[9] = _mm512_xor_si512( chainv[9], t1 );

    MULT24W( chainv[8], chainv[9] );
    chainv[8] = _mm512_xor_si512( chainv[8], chainv[6] );
@@ -225,27 +206,36 @@ void rnd512_4way( luffa_4way_context *state, __m512i *msg )
    chainv[3] = _mm512_xor_si512( chainv[3], chainv[1] );

    MULT24W( chainv[0], chainv[1] );
-    chainv[0] = mm512_xor3( chainv[0], t0, msg0 );
-    chainv[1] = mm512_xor3( chainv[1], t1, msg1 );
+    chainv[0] = _mm512_xor_si512( chainv[0], t0 );
+    chainv[1] = _mm512_xor_si512( chainv[1], t1 );

-    MULT24W( msg0, msg1 );
-    chainv[2] = _mm512_xor_si512( chainv[2], msg0 );
-    chainv[3] = _mm512_xor_si512( chainv[3], msg1 );
+    if ( msg )
+    {
+       __m512i msg0, msg1;

-    MULT24W( msg0, msg1 );
-    chainv[4] = _mm512_xor_si512( chainv[4], msg0 );
-    chainv[5] = _mm512_xor_si512( chainv[5], msg1 );
+       msg0 = _mm512_shuffle_epi32( msg[0], 27 );
+       msg1 = _mm512_shuffle_epi32( msg[1], 27 );

-    MULT24W( msg0, msg1 );
-    chainv[6] = _mm512_xor_si512( chainv[6], msg0 );
-    chainv[7] = _mm512_xor_si512( chainv[7], msg1 );
+       chainv[0] = _mm512_xor_si512( chainv[0], msg0 );
+       chainv[1] = _mm512_xor_si512( chainv[1], msg1 );

-    MULT24W( msg0, msg1);
-    chainv[8] = _mm512_xor_si512( chainv[8], msg0 );
-    chainv[9] = _mm512_xor_si512( chainv[9], msg1 );
+       MULT24W( msg0, msg1 );
+       chainv[2] = _mm512_xor_si512( chainv[2], msg0 );
+       chainv[3] = _mm512_xor_si512( chainv[3], msg1 );

-    MULT24W( msg0, msg1 );
+       MULT24W( msg0, msg1 );
+       chainv[4] = _mm512_xor_si512( chainv[4], msg0 );
+       chainv[5] = _mm512_xor_si512( chainv[5], msg1 );

+       MULT24W( msg0, msg1 );
+       chainv[6] = _mm512_xor_si512( chainv[6], msg0 );
+       chainv[7] = _mm512_xor_si512( chainv[7], msg1 );
+
+       MULT24W( msg0, msg1);
+       chainv[8] = _mm512_xor_si512( chainv[8], msg0 );
+       chainv[9] = _mm512_xor_si512( chainv[9], msg1 );
+    }
+    
    chainv[3] = _mm512_rol_epi32( chainv[3], 1 );
    chainv[5] = _mm512_rol_epi32( chainv[5], 2 );
    chainv[7] = _mm512_rol_epi32( chainv[7], 3 );
@@ -282,16 +272,11 @@ void finalization512_4way( luffa_4way_context *state, uint32 *b )
    uint32_t hash[8*4] __attribute((aligned(128)));
    __m512i* chainv = state->chainv;
    __m512i t[2];
-    __m512i zero[2];
-    zero[0] = zero[1] = m512_zero;
-    const __m512i shuff_bswap32 = m512_const_64(
-                                  0x3c3d3e3f38393a3b, 0x3435363730313233,
-                                  0x2c2d2e2f28292a2b, 0x2425262720212223,
-                                  0x1c1d1e1f18191a1b, 0x1415161710111213,
-                                  0x0c0d0e0f08090a0b, 0x0405060700010203 );
+    const __m512i shuff_bswap32 = mm512_bcast_m128( _mm_set_epi64x( 
+                                  0x0c0d0e0f08090a0b, 0x0405060700010203 ) );

    /*---- blank round with m=0 ----*/
-    rnd512_4way( state, zero );
+    rnd512_4way( state, NULL );
    
    t[0] = mm512_xor3( chainv[0], chainv[2], chainv[4] );
    t[1] = mm512_xor3( chainv[1], chainv[3], chainv[5] );
@@ -300,37 +285,30 @@ void finalization512_4way( luffa_4way_context *state, uint32 *b )
    t[0] = _mm512_shuffle_epi32( t[0], 27 );
    t[1] = _mm512_shuffle_epi32( t[1], 27 );

-    _mm512_store_si512( (__m512i*)&hash[0], t[0] );
+    _mm512_store_si512( (__m512i*)&hash[ 0], t[0] );
    _mm512_store_si512( (__m512i*)&hash[16], t[1] );

-    casti_m512i( b, 0 ) = _mm512_shuffle_epi8(
-                                  casti_m512i( hash, 0 ), shuff_bswap32 );
-    casti_m512i( b, 1 ) = _mm512_shuffle_epi8(
-                                  casti_m512i( hash, 1 ), shuff_bswap32 );
+    casti_m512i( b,0 ) = _mm512_shuffle_epi8(
+                                  casti_m512i( hash,0 ), shuff_bswap32 );
+    casti_m512i( b,1 ) = _mm512_shuffle_epi8(
+                                  casti_m512i( hash,1 ), shuff_bswap32 );

-    rnd512_4way( state, zero );
-
-    t[0] = chainv[0];
-    t[1] = chainv[1];
-    t[0] = _mm512_xor_si512( t[0], chainv[2] );
-    t[1] = _mm512_xor_si512( t[1], chainv[3] );
-    t[0] = _mm512_xor_si512( t[0], chainv[4] );
-    t[1] = _mm512_xor_si512( t[1], chainv[5] );
-    t[0] = _mm512_xor_si512( t[0], chainv[6] );
-    t[1] = _mm512_xor_si512( t[1], chainv[7] );
-    t[0] = _mm512_xor_si512( t[0], chainv[8] );
-    t[1] = _mm512_xor_si512( t[1], chainv[9] );
+    rnd512_4way( state, NULL );

+    t[0] = mm512_xor3( chainv[0], chainv[2], chainv[4] );
+    t[1] = mm512_xor3( chainv[1], chainv[3], chainv[5] );
+    t[0] = mm512_xor3( t[0], chainv[6], chainv[8] );
+    t[1] = mm512_xor3( t[1], chainv[7], chainv[9] );
    t[0] = _mm512_shuffle_epi32( t[0], 27 );
    t[1] = _mm512_shuffle_epi32( t[1], 27 );

-    _mm512_store_si512( (__m512i*)&hash[0], t[0] );
+    _mm512_store_si512( (__m512i*)&hash[ 0], t[0] );
    _mm512_store_si512( (__m512i*)&hash[16], t[1] );

-    casti_m512i( b, 2 ) = _mm512_shuffle_epi8(
-                                  casti_m512i( hash, 0 ), shuff_bswap32 );
-    casti_m512i( b, 3 ) = _mm512_shuffle_epi8(
-                                  casti_m512i( hash, 1 ), shuff_bswap32 );
+    casti_m512i( b,2 ) = _mm512_shuffle_epi8(
+                                  casti_m512i( hash,0 ), shuff_bswap32 );
+    casti_m512i( b,3 ) = _mm512_shuffle_epi8(
+                                  casti_m512i( hash,1 ), shuff_bswap32 );
 }

 int luffa_4way_init( luffa_4way_context *state, int hashbitlen )
@@ -338,16 +316,16 @@ int luffa_4way_init( luffa_4way_context *state, int hashbitlen )
    state->hashbitlen = hashbitlen;
    __m128i *iv = (__m128i*)IV;

-    state->chainv[0] = m512_const1_128( iv[0] );
-    state->chainv[1] = m512_const1_128( iv[1] );
-    state->chainv[2] = m512_const1_128( iv[2] );
-    state->chainv[3] = m512_const1_128( iv[3] );
-    state->chainv[4] = m512_const1_128( iv[4] );
-    state->chainv[5] = m512_const1_128( iv[5] );
-    state->chainv[6] = m512_const1_128( iv[6] );
-    state->chainv[7] = m512_const1_128( iv[7] );
-    state->chainv[8] = m512_const1_128( iv[8] );
-    state->chainv[9] = m512_const1_128( iv[9] );
+    state->chainv[0] = mm512_bcast_m128( iv[0] );
+    state->chainv[1] = mm512_bcast_m128( iv[1] );
+    state->chainv[2] = mm512_bcast_m128( iv[2] );
+    state->chainv[3] = mm512_bcast_m128( iv[3] );
+    state->chainv[4] = mm512_bcast_m128( iv[4] );
+    state->chainv[5] = mm512_bcast_m128( iv[5] );
+    state->chainv[6] = mm512_bcast_m128( iv[6] );
+    state->chainv[7] = mm512_bcast_m128( iv[7] );
+    state->chainv[8] = mm512_bcast_m128( iv[8] );
+    state->chainv[9] = mm512_bcast_m128( iv[9] );

    ((__m512i*)state->buffer)[0] = m512_zero;
    ((__m512i*)state->buffer)[1] = m512_zero;
@@ -370,11 +348,8 @@ int luffa_4way_update( luffa_4way_context *state, const void *data,
    __m512i msg[2];
    int i;
    int blocks = (int)len >> 5;
-    const __m512i shuff_bswap32 = m512_const_64( 
-                                   0x3c3d3e3f38393a3b, 0x3435363730313233,
-                                   0x2c2d2e2f28292a2b, 0x2425262720212223,
-                                   0x1c1d1e1f18191a1b, 0x1415161710111213,
-                                   0x0c0d0e0f08090a0b, 0x0405060700010203 );
+    const __m512i shuff_bswap32 = mm512_bcast_m128( _mm_set_epi64x(  
+                                   0x0c0d0e0f08090a0b, 0x0405060700010203 ) );

    state->rembytes = (int)len & 0x1F;

@@ -392,7 +367,7 @@ int luffa_4way_update( luffa_4way_context *state, const void *data,
    {
      // remaining data bytes
      buffer[0] = _mm512_shuffle_epi8( vdata[0], shuff_bswap32 );
-      buffer[1] = m512_const1_i128(  0x0000000080000000 );
+      buffer[1] = mm512_bcast128lo_64( 0x0000000080000000 );
    }
    return 0;
 }
@@ -416,7 +391,7 @@ int luffa_4way_close( luffa_4way_context *state, void *hashval )
      rnd512_4way( state, buffer );
    else
    {     // empty pad block, constant data
-      msg[0] = m512_const1_i128(  0x0000000080000000 );
+      msg[0] = mm512_bcast128lo_64( 0x0000000080000000 );
      msg[1] = m512_zero;
      rnd512_4way( state, msg );
    }
@@ -440,16 +415,16 @@ int luffa512_4way_full( luffa_4way_context *state, void *output,
    state->hashbitlen = 512;
    __m128i *iv = (__m128i*)IV;

-    state->chainv[0] = m512_const1_128( iv[0] );
-    state->chainv[1] = m512_const1_128( iv[1] );
-    state->chainv[2] = m512_const1_128( iv[2] );
-    state->chainv[3] = m512_const1_128( iv[3] );
-    state->chainv[4] = m512_const1_128( iv[4] );
-    state->chainv[5] = m512_const1_128( iv[5] );
-    state->chainv[6] = m512_const1_128( iv[6] );
-    state->chainv[7] = m512_const1_128( iv[7] );
-    state->chainv[8] = m512_const1_128( iv[8] );
-    state->chainv[9] = m512_const1_128( iv[9] );
+    state->chainv[0] = mm512_bcast_m128( iv[0] );
+    state->chainv[1] = mm512_bcast_m128( iv[1] );
+    state->chainv[2] = mm512_bcast_m128( iv[2] );
+    state->chainv[3] = mm512_bcast_m128( iv[3] );
+    state->chainv[4] = mm512_bcast_m128( iv[4] );
+    state->chainv[5] = mm512_bcast_m128( iv[5] );
+    state->chainv[6] = mm512_bcast_m128( iv[6] );
+    state->chainv[7] = mm512_bcast_m128( iv[7] );
+    state->chainv[8] = mm512_bcast_m128( iv[8] );
+    state->chainv[9] = mm512_bcast_m128( iv[9] );

    ((__m512i*)state->buffer)[0] = m512_zero;
    ((__m512i*)state->buffer)[1] = m512_zero;
@@ -458,11 +433,8 @@ int luffa512_4way_full( luffa_4way_context *state, void *output,
    __m512i msg[2];
    int i;
    const int blocks = (int)( inlen >> 5 );
-    const __m512i shuff_bswap32 = m512_const_64(
-                                   0x3c3d3e3f38393a3b, 0x3435363730313233,
-                                   0x2c2d2e2f28292a2b, 0x2425262720212223,
-                                   0x1c1d1e1f18191a1b, 0x1415161710111213,
-                                   0x0c0d0e0f08090a0b, 0x0405060700010203 );
+    const __m512i shuff_bswap32 = mm512_bcast_m128( _mm_set_epi64x( 
+                                   0x0c0d0e0f08090a0b, 0x0405060700010203 ) );

    state->rembytes = inlen & 0x1F;

@@ -479,13 +451,13 @@ int luffa512_4way_full( luffa_4way_context *state, void *output,
    {
       // padding of partial block
       msg[0] = _mm512_shuffle_epi8( vdata[ 0 ], shuff_bswap32 );
-       msg[1] = m512_const1_i128(  0x0000000080000000 );
+       msg[1] = mm512_bcast128lo_64( 0x0000000080000000 );
       rnd512_4way( state, msg );
    }
    else
    {
       // empty pad block
-       msg[0] = m512_const1_i128( 0x0000000080000000 );
+       msg[0] = mm512_bcast128lo_64( 0x0000000080000000 );
       msg[1] = m512_zero;
       rnd512_4way( state, msg );
    }
@@ -506,11 +478,8 @@ int luffa_4way_update_close( luffa_4way_context *state,
    __m512i msg[2];
    int i;
    const int blocks = (int)( inlen >> 5 );
-    const __m512i shuff_bswap32 = m512_const_64(
-                                   0x3c3d3e3f38393a3b, 0x3435363730313233,
-                                   0x2c2d2e2f28292a2b, 0x2425262720212223,
-                                   0x1c1d1e1f18191a1b, 0x1415161710111213,
-                                   0x0c0d0e0f08090a0b, 0x0405060700010203 );
+    const __m512i shuff_bswap32 = mm512_bcast_m128( _mm_set_epi64x( 
+                                   0x0c0d0e0f08090a0b, 0x0405060700010203 ) );

    state->rembytes = inlen & 0x1F;

@@ -527,13 +496,13 @@ int luffa_4way_update_close( luffa_4way_context *state,
    {
       // padding of partial block
       msg[0] = _mm512_shuffle_epi8( vdata[ 0 ], shuff_bswap32 );
-       msg[1] = m512_const1_i128( 0x0000000080000000 );
+       msg[1] = mm512_bcast128lo_64( 0x0000000080000000 );
       rnd512_4way( state, msg );
    }
    else
    {
       // empty pad block
-       msg[0] = m512_const1_i128( 0x0000000080000000 );
+       msg[0] = mm512_bcast128lo_64( 0x0000000080000000 );
       msg[1] = m512_zero;
       rnd512_4way( state, msg );
    }
@@ -548,26 +517,45 @@ int luffa_4way_update_close( luffa_4way_context *state,

 #endif // AVX512

-#define cns(i)  m256_const1_128( ( (__m128i*)CNS_INIT)[i] )
+#define cns(i)  mm256_bcast_m128( ( (__m128i*)CNS_INIT)[i] )

 #define ADD_CONSTANT( a, b, c0, c1 ) \
    a = _mm256_xor_si256( a, c0 ); \
    b = _mm256_xor_si256( b, c1 );

-/*
-#define MULT2( a0, a1, mask ) \
-do { \
-  __m256i b = _mm256_xor_si256( a0, \
-                   _mm256_shuffle_epi32( _mm256_and_si256(a1,mask), 16 ) ); \
-  a0 = _mm256_or_si256( _mm256_srli_si256(b,4), _mm256_slli_si256(a1,12) ); \
-  a1 = _mm256_or_si256( _mm256_srli_si256(a1,4), _mm256_slli_si256(b,12) );  \
-} while(0)
-*/
+//TODO Enable for AVX10_256, not used with AVX512 or AVX10_512
+#if defined(__AVX512VL__) 

-#define MULT2( a0, a1, mask ) \
+#define MULT2( a0, a1 ) \
 { \
  __m256i b = _mm256_xor_si256( a0, \
-                 _mm256_shuffle_epi32( _mm256_and_si256( a1, mask ), 16 ) ); \
+                     _mm256_maskz_shuffle_epi32( 0xbb, a1, 0x10 ) ); \
+  a0 = _mm256_alignr_epi8( a1,  b, 4 ); \
+  a1 = _mm256_alignr_epi8(  b, a1, 4 ); \
+}
+
+#define SUBCRUMB( a0, a1, a2, a3 ) \
+{ \
+    __m256i t = a0; \
+    a0 = mm256_xoror( a3, a0, a1 ); \
+    a2 = _mm256_xor_si256( a2, a3 ); \
+    a1 = _mm256_ternarylogic_epi64( a1, a3, t, 0x87 ); /* a1 xnor (a3 & t) */ \
+    a3 = mm256_xorand( a2, a3, t ); \
+    a2 = mm256_xorand( a1, a2, a0); \
+    a1 = _mm256_or_si256( a1, a3 ); \
+    a3 = _mm256_xor_si256( a3, a2 ); \
+    t  = _mm256_xor_si256( t, a1 ); \
+    a2 = _mm256_and_si256( a2, a1 ); \
+    a1 = mm256_xnor( a1, a0 ); \
+    a0 = t; \
+}
+
+#else
+
+#define MULT2( a0, a1 ) \
+{ \
+  __m256i b = _mm256_xor_si256( a0, _mm256_shuffle_epi32( \
+                         _mm256_blend_epi32( a1, m256_zero, 0xee ), 0x10 ) ); \
  a0 = _mm256_alignr_epi8( a1,  b, 4 ); \
  a1 = _mm256_alignr_epi8(  b, a1, 4 ); \
 }
@@ -593,26 +581,14 @@ do { \
    a0 = t; \
 }

+#endif
+
 #define MIXWORD( a, b ) \
-{ \
-    __m256i t1, t2; \
-    b  = _mm256_xor_si256( a,b ); \
-    t1 = _mm256_slli_epi32( a,  2 ); \
-    t2 = _mm256_srli_epi32( a, 30 ); \
-    a  = _mm256_or_si256( t1, t2 ); \
-    a  = _mm256_xor_si256( a, b ); \
-    t1 = _mm256_slli_epi32( b, 14 ); \
-    t2 = _mm256_srli_epi32( b, 18 ); \
-    b  = _mm256_or_si256( t1, t2 ); \
-    b  = _mm256_xor_si256( a, b ); \
-    t1 = _mm256_slli_epi32( a, 10 ); \
-    t2 = _mm256_srli_epi32( a, 22 ); \
-    a  = _mm256_or_si256( t1,t2 ); \
-    a  = _mm256_xor_si256( a,b ); \
-    t1 = _mm256_slli_epi32( b,1 ); \
-    t2 = _mm256_srli_epi32( b,31 ); \
-    b  = _mm256_or_si256( t1, t2 ); \
-}
+    b = _mm256_xor_si256( a, b ); \
+    a = _mm256_xor_si256( b, mm256_rol_32( a,  2 ) ); \
+    b = _mm256_xor_si256( a, mm256_rol_32( b, 14 ) ); \
+    a = _mm256_xor_si256( b, mm256_rol_32( a, 10 ) ); \
+    b = mm256_rol_32( b, 1 );

 #define STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, c0, c1 ) \
    SUBCRUMB( x0, x1, x2, x3 ); \
@@ -624,49 +600,37 @@ do { \
    ADD_CONSTANT( x0, x4, c0, c1 );

 #define STEP_PART2( a0, a1, t0, t1, c0, c1 ) \
-    a1 = _mm256_shuffle_epi32( a1, 147); \
-    t0 = _mm256_load_si256( &a1 ); \
-    a1 = _mm256_unpacklo_epi32( a1, a0 ); \
+    t0 = _mm256_shuffle_epi32( a1, 147 ); \
+    a1 = _mm256_unpacklo_epi32( t0, a0 ); \
    t0 = _mm256_unpackhi_epi32( t0, a0 ); \
    t1 = _mm256_shuffle_epi32( t0, 78 ); \
    a0 = _mm256_shuffle_epi32( a1, 78 ); \
-    SUBCRUMB( t1, t0, a0, a1 );\
+    SUBCRUMB( t1, t0, a0, a1 ); \
    t0 = _mm256_unpacklo_epi32( t0, t1 ); \
    a1 = _mm256_unpacklo_epi32( a1, a0 ); \
-    a0 = _mm256_load_si256( &a1 ); \
-    a0 = _mm256_unpackhi_epi64( a0, t0 ); \
+    a0 = _mm256_unpackhi_epi64( a1, t0 ); \
    a1 = _mm256_unpacklo_epi64( a1, t0 ); \
    a1 = _mm256_shuffle_epi32( a1, 57 ); \
    MIXWORD( a0, a1 ); \
    ADD_CONSTANT( a0, a1, c0, c1 );

 #define NMLTOM1024(r0,r1,r2,r3,s0,s1,s2,s3,p0,p1,p2,p3,q0,q1,q2,q3)\
-    s1 = _mm256_load_si256(&r3);\
-    q1 = _mm256_load_si256(&p3);\
-    s3 = _mm256_load_si256(&r3);\
-    q3 = _mm256_load_si256(&p3);\
-    s1 = _mm256_unpackhi_epi32(s1,r2);\
-    q1 = _mm256_unpackhi_epi32(q1,p2);\
-    s3 = _mm256_unpacklo_epi32(s3,r2);\
-    q3 = _mm256_unpacklo_epi32(q3,p2);\
-    s0 = _mm256_load_si256(&s1);\
-    q0 = _mm256_load_si256(&q1);\
-    s2 = _mm256_load_si256(&s3);\
-    q2 = _mm256_load_si256(&q3);\
-    r3 = _mm256_load_si256(&r1);\
-    p3 = _mm256_load_si256(&p1);\
-    r1 = _mm256_unpacklo_epi32(r1,r0);\
-    p1 = _mm256_unpacklo_epi32(p1,p0);\
-    r3 = _mm256_unpackhi_epi32(r3,r0);\
-    p3 = _mm256_unpackhi_epi32(p3,p0);\
-    s0 = _mm256_unpackhi_epi64(s0,r3);\
-    q0 = _mm256_unpackhi_epi64(q0,p3);\
-    s1 = _mm256_unpacklo_epi64(s1,r3);\
-    q1 = _mm256_unpacklo_epi64(q1,p3);\
-    s2 = _mm256_unpackhi_epi64(s2,r1);\
-    q2 = _mm256_unpackhi_epi64(q2,p1);\
-    s3 = _mm256_unpacklo_epi64(s3,r1);\
-    q3 = _mm256_unpacklo_epi64(q3,p1);
+    s1 = _mm256_unpackhi_epi32( r3, r2 ); \
+    q1 = _mm256_unpackhi_epi32( p3, p2 ); \
+    s3 = _mm256_unpacklo_epi32( r3, r2 ); \
+    q3 = _mm256_unpacklo_epi32( p3, p2 ); \
+    r3 = _mm256_unpackhi_epi32( r1, r0 ); \
+    r1 = _mm256_unpacklo_epi32( r1, r0 ); \
+    p3 = _mm256_unpackhi_epi32( p1, p0 ); \
+    p1 = _mm256_unpacklo_epi32( p1, p0 ); \
+    s0 = _mm256_unpackhi_epi64( s1, r3 ); \
+    q0 = _mm256_unpackhi_epi64( q1 ,p3 ); \
+    s1 = _mm256_unpacklo_epi64( s1, r3 ); \
+    q1 = _mm256_unpacklo_epi64( q1, p3 ); \
+    s2 = _mm256_unpackhi_epi64( s3, r1 ); \
+    q2 = _mm256_unpackhi_epi64( q3, p1 ); \
+    s3 = _mm256_unpacklo_epi64( s3, r1 ); \
+    q3 = _mm256_unpacklo_epi64( q3, p1 );

 #define MIXTON1024(r0,r1,r2,r3,s0,s1,s2,s3,p0,p1,p2,p3,q0,q1,q2,q3)\
    NMLTOM1024(r0,r1,r2,r3,s0,s1,s2,s3,p0,p1,p2,p3,q0,q1,q2,q3);
@@ -676,30 +640,18 @@ do { \
 /* Round function         */
 /* state: hash context    */

-void rnd512_2way( luffa_2way_context *state, __m256i *msg )
+void rnd512_2way( luffa_2way_context *state, const __m256i *msg )
 {
    __m256i t0, t1;
    __m256i *chainv = state->chainv;
-    __m256i msg0, msg1;
    __m256i x0, x1, x2, x3, x4, x5, x6, x7;
-    const __m256i MASK = m256_const1_i128( 0xffffffff );

-    t0 = chainv[0];
-    t1 = chainv[1];
+    t0 = mm256_xor3( chainv[0], chainv[2], chainv[4] );
+    t1 = mm256_xor3( chainv[1], chainv[3], chainv[5] );
+    t0 = mm256_xor3( t0, chainv[6], chainv[8] );
+    t1 = mm256_xor3( t1, chainv[7], chainv[9] );

-    t0 = _mm256_xor_si256( t0, chainv[2] );
-    t1 = _mm256_xor_si256( t1, chainv[3] );
-    t0 = _mm256_xor_si256( t0, chainv[4] );
-    t1 = _mm256_xor_si256( t1, chainv[5] );
-    t0 = _mm256_xor_si256( t0, chainv[6] );
-    t1 = _mm256_xor_si256( t1, chainv[7] );
-    t0 = _mm256_xor_si256( t0, chainv[8] );
-    t1 = _mm256_xor_si256( t1, chainv[9] );
-
-    MULT2( t0, t1, MASK );
-
-    msg0 = _mm256_shuffle_epi32( msg[0], 27 );
-    msg1 = _mm256_shuffle_epi32( msg[1], 27 );
+    MULT2( t0, t1 );

    chainv[0] = _mm256_xor_si256( chainv[0], t0 );
    chainv[1] = _mm256_xor_si256( chainv[1], t1 );
@@ -715,66 +667,72 @@ void rnd512_2way( luffa_2way_context *state, __m256i *msg )
    t0 = chainv[0];
    t1 = chainv[1];

-    MULT2( chainv[0], chainv[1], MASK );
+    MULT2( chainv[0], chainv[1] );
    chainv[0] = _mm256_xor_si256( chainv[0], chainv[2] );
    chainv[1] = _mm256_xor_si256( chainv[1], chainv[3] );

-    MULT2( chainv[2], chainv[3], MASK );
+    MULT2( chainv[2], chainv[3] );
    chainv[2] = _mm256_xor_si256(chainv[2], chainv[4]);
    chainv[3] = _mm256_xor_si256(chainv[3], chainv[5]);

-    MULT2( chainv[4], chainv[5], MASK );
+    MULT2( chainv[4], chainv[5] );
    chainv[4] = _mm256_xor_si256(chainv[4], chainv[6]);
    chainv[5] = _mm256_xor_si256(chainv[5], chainv[7]);

-    MULT2( chainv[6], chainv[7], MASK );
+    MULT2( chainv[6], chainv[7] );
    chainv[6] = _mm256_xor_si256(chainv[6], chainv[8]);
    chainv[7] = _mm256_xor_si256(chainv[7], chainv[9]);

-    MULT2( chainv[8], chainv[9], MASK );
-    chainv[8] = _mm256_xor_si256( chainv[8], t0 );
-    chainv[9] = _mm256_xor_si256( chainv[9], t1 );
+    MULT2( chainv[8], chainv[9] );
+    t0 = chainv[8] = _mm256_xor_si256( chainv[8], t0 );
+    t1 = chainv[9] = _mm256_xor_si256( chainv[9], t1 );

-    t0 = chainv[8];
-    t1 = chainv[9];
-
-    MULT2( chainv[8], chainv[9], MASK );
+    MULT2( chainv[8], chainv[9] );
    chainv[8] = _mm256_xor_si256( chainv[8], chainv[6] );
    chainv[9] = _mm256_xor_si256( chainv[9], chainv[7] );

-    MULT2( chainv[6], chainv[7], MASK );
+    MULT2( chainv[6], chainv[7] );
    chainv[6] = _mm256_xor_si256( chainv[6], chainv[4] );
    chainv[7] = _mm256_xor_si256( chainv[7], chainv[5] );

-    MULT2( chainv[4], chainv[5], MASK );
+    MULT2( chainv[4], chainv[5] );
    chainv[4] = _mm256_xor_si256( chainv[4], chainv[2] );
    chainv[5] = _mm256_xor_si256( chainv[5], chainv[3] );

-    MULT2( chainv[2], chainv[3], MASK );
+    MULT2( chainv[2], chainv[3] );
    chainv[2] = _mm256_xor_si256( chainv[2], chainv[0] );
    chainv[3] = _mm256_xor_si256( chainv[3], chainv[1] );

-    MULT2( chainv[0], chainv[1], MASK );
-    chainv[0] = _mm256_xor_si256( _mm256_xor_si256( chainv[0], t0 ), msg0 );
-    chainv[1] = _mm256_xor_si256( _mm256_xor_si256( chainv[1], t1 ), msg1 );
+    MULT2( chainv[0], chainv[1] );
+    chainv[0] = _mm256_xor_si256( chainv[0], t0 );
+    chainv[1] = _mm256_xor_si256( chainv[1], t1 );

-    MULT2( msg0, msg1, MASK );
-    chainv[2] = _mm256_xor_si256( chainv[2], msg0 );
-    chainv[3] = _mm256_xor_si256( chainv[3], msg1 );
+    if ( msg )
+    {
+       __m256i msg0, msg1;
+    
+       msg0 = _mm256_shuffle_epi32( msg[0], 27 );
+       msg1 = _mm256_shuffle_epi32( msg[1], 27 );

-    MULT2( msg0, msg1, MASK );
-    chainv[4] = _mm256_xor_si256( chainv[4], msg0 );
-    chainv[5] = _mm256_xor_si256( chainv[5], msg1 );
+       chainv[0] = _mm256_xor_si256( chainv[0], msg0 );
+       chainv[1] = _mm256_xor_si256( chainv[1], msg1 );
+    
+       MULT2( msg0, msg1 );
+       chainv[2] = _mm256_xor_si256( chainv[2], msg0 );
+       chainv[3] = _mm256_xor_si256( chainv[3], msg1 );

-    MULT2( msg0, msg1, MASK );
-    chainv[6] = _mm256_xor_si256( chainv[6], msg0 );
-    chainv[7] = _mm256_xor_si256( chainv[7], msg1 );
+       MULT2( msg0, msg1 );
+       chainv[4] = _mm256_xor_si256( chainv[4], msg0 );
+       chainv[5] = _mm256_xor_si256( chainv[5], msg1 );

-    MULT2( msg0, msg1, MASK );
-    chainv[8] = _mm256_xor_si256( chainv[8], msg0 );
-    chainv[9] = _mm256_xor_si256( chainv[9], msg1 );
+       MULT2( msg0, msg1 );
+       chainv[6] = _mm256_xor_si256( chainv[6], msg0 );
+       chainv[7] = _mm256_xor_si256( chainv[7], msg1 );

-    MULT2( msg0, msg1, MASK );
+       MULT2( msg0, msg1 );
+       chainv[8] = _mm256_xor_si256( chainv[8], msg0 );
+       chainv[9] = _mm256_xor_si256( chainv[9], msg1 );
+    }

    chainv[3] = mm256_rol_32( chainv[3], 1 );
    chainv[5] = mm256_rol_32( chainv[5], 2 );
@@ -816,57 +774,40 @@ void finalization512_2way( luffa_2way_context *state, uint32 *b )
 {
    uint32 hash[8*2] __attribute((aligned(64)));
    __m256i* chainv = state->chainv;
-    __m256i t[2];
-    __m256i zero[2];
-    zero[0] = zero[1] = m256_zero;
-    const __m256i shuff_bswap32 = m256_const_64( 0x1c1d1e1f18191a1b,
-                                                 0x1415161710111213,
-                                                 0x0c0d0e0f08090a0b,
+    __m256i t0, t1;
+    const __m256i shuff_bswap32 = mm256_set2_64( 0x0c0d0e0f08090a0b,
                                                 0x0405060700010203 );
    /*---- blank round with m=0 ----*/
-    rnd512_2way( state, zero );
+    rnd512_2way( state, NULL );

-    t[0] = chainv[0];
-    t[1] = chainv[1];
+    t0 = mm256_xor3( chainv[0], chainv[2], chainv[4] );
+    t1 = mm256_xor3( chainv[1], chainv[3], chainv[5] );
+    t0 = mm256_xor3( t0, chainv[6], chainv[8] );
+    t1 = mm256_xor3( t1, chainv[7], chainv[9] );

-    t[0] = _mm256_xor_si256( t[0], chainv[2] );
-    t[1] = _mm256_xor_si256( t[1], chainv[3] );
-    t[0] = _mm256_xor_si256( t[0], chainv[4] );
-    t[1] = _mm256_xor_si256( t[1], chainv[5] );
-    t[0] = _mm256_xor_si256( t[0], chainv[6] );
-    t[1] = _mm256_xor_si256( t[1], chainv[7] );
-    t[0] = _mm256_xor_si256( t[0], chainv[8] );
-    t[1] = _mm256_xor_si256( t[1], chainv[9] );
+    t0 = _mm256_shuffle_epi32( t0, 27 );
+    t1 = _mm256_shuffle_epi32( t1, 27 );

-    t[0] = _mm256_shuffle_epi32( t[0], 27 );
-    t[1] = _mm256_shuffle_epi32( t[1], 27 );
-
-    _mm256_store_si256( (__m256i*)&hash[0], t[0] );
-    _mm256_store_si256( (__m256i*)&hash[8], t[1] );
+    _mm256_store_si256( (__m256i*)&hash[0], t0 );
+    _mm256_store_si256( (__m256i*)&hash[8], t1 );

    casti_m256i( b, 0 ) = _mm256_shuffle_epi8(
                                  casti_m256i( hash, 0 ), shuff_bswap32 );
    casti_m256i( b, 1 ) = _mm256_shuffle_epi8( 
                                  casti_m256i( hash, 1 ), shuff_bswap32 );

-    rnd512_2way( state, zero );
+    rnd512_2way( state, NULL );

-    t[0] = chainv[0];
-    t[1] = chainv[1];
-    t[0] = _mm256_xor_si256( t[0], chainv[2] );
-    t[1] = _mm256_xor_si256( t[1], chainv[3] );
-    t[0] = _mm256_xor_si256( t[0], chainv[4] );
-    t[1] = _mm256_xor_si256( t[1], chainv[5] );
-    t[0] = _mm256_xor_si256( t[0], chainv[6] );
-    t[1] = _mm256_xor_si256( t[1], chainv[7] );
-    t[0] = _mm256_xor_si256( t[0], chainv[8] );
-    t[1] = _mm256_xor_si256( t[1], chainv[9] );
+    t0 = mm256_xor3( chainv[0], chainv[2], chainv[4] );
+    t1 = mm256_xor3( chainv[1], chainv[3], chainv[5] );
+    t0 = mm256_xor3( t0, chainv[6], chainv[8] );
+    t1 = mm256_xor3( t1, chainv[7], chainv[9] );
+    
+    t0 = _mm256_shuffle_epi32( t0, 27 );
+    t1 = _mm256_shuffle_epi32( t1, 27 );

-    t[0] = _mm256_shuffle_epi32( t[0], 27 );
-    t[1] = _mm256_shuffle_epi32( t[1], 27 );
-
-    _mm256_store_si256( (__m256i*)&hash[0], t[0] );
-    _mm256_store_si256( (__m256i*)&hash[8], t[1] );
+    _mm256_store_si256( (__m256i*)&hash[0], t0 );
+    _mm256_store_si256( (__m256i*)&hash[8], t1 );

    casti_m256i( b, 2 ) = _mm256_shuffle_epi8( 
                                  casti_m256i( hash, 0 ), shuff_bswap32 );
@@ -879,16 +820,16 @@ int luffa_2way_init( luffa_2way_context *state, int hashbitlen )
    state->hashbitlen = hashbitlen;
    __m128i *iv = (__m128i*)IV;
    
-    state->chainv[0] = m256_const1_128( iv[0] );
-    state->chainv[1] = m256_const1_128( iv[1] );
-    state->chainv[2] = m256_const1_128( iv[2] );
-    state->chainv[3] = m256_const1_128( iv[3] );
-    state->chainv[4] = m256_const1_128( iv[4] );
-    state->chainv[5] = m256_const1_128( iv[5] );
-    state->chainv[6] = m256_const1_128( iv[6] );
-    state->chainv[7] = m256_const1_128( iv[7] );
-    state->chainv[8] = m256_const1_128( iv[8] );
-    state->chainv[9] = m256_const1_128( iv[9] );
+    state->chainv[0] = mm256_bcast_m128( iv[0] );
+    state->chainv[1] = mm256_bcast_m128( iv[1] );
+    state->chainv[2] = mm256_bcast_m128( iv[2] );
+    state->chainv[3] = mm256_bcast_m128( iv[3] );
+    state->chainv[4] = mm256_bcast_m128( iv[4] );
+    state->chainv[5] = mm256_bcast_m128( iv[5] );
+    state->chainv[6] = mm256_bcast_m128( iv[6] );
+    state->chainv[7] = mm256_bcast_m128( iv[7] );
+    state->chainv[8] = mm256_bcast_m128( iv[8] );
+    state->chainv[9] = mm256_bcast_m128( iv[9] );

    ((__m256i*)state->buffer)[0] = m256_zero;
    ((__m256i*)state->buffer)[1] = m256_zero;
@@ -906,9 +847,7 @@ int luffa_2way_update( luffa_2way_context *state, const void *data,
    __m256i msg[2];
    int i;
    int blocks = (int)len >> 5;
-    const __m256i shuff_bswap32 = m256_const_64( 0x1c1d1e1f18191a1b,
-                                                 0x1415161710111213,
-                                                 0x0c0d0e0f08090a0b,
+    const __m256i shuff_bswap32 = mm256_set2_64( 0x0c0d0e0f08090a0b,
                                                 0x0405060700010203 );
    state-> rembytes = (int)len & 0x1F;

@@ -926,7 +865,7 @@ int luffa_2way_update( luffa_2way_context *state, const void *data,
    {
      // remaining data bytes
      buffer[0] = _mm256_shuffle_epi8( vdata[0], shuff_bswap32 );
-      buffer[1] = m256_const1_i128( 0x0000000080000000 );
+      buffer[1] = mm256_bcast128lo_64( 0x0000000080000000 );
    }
    return 0;
 }
@@ -942,7 +881,7 @@ int luffa_2way_close( luffa_2way_context *state, void *hashval )
      rnd512_2way( state, buffer );
    else
    {     // empty pad block, constant data
-      msg[0] = m256_const1_i128( 0x0000000080000000 );
+      msg[0] = mm256_bcast128lo_64( 0x0000000080000000 );
      msg[1] = m256_zero;
      rnd512_2way( state, msg );
    }
@@ -959,16 +898,16 @@ int luffa512_2way_full( luffa_2way_context *state, void *output,
    state->hashbitlen = 512;
    __m128i *iv = (__m128i*)IV;

-    state->chainv[0] = m256_const1_128( iv[0] );
-    state->chainv[1] = m256_const1_128( iv[1] );
-    state->chainv[2] = m256_const1_128( iv[2] );
-    state->chainv[3] = m256_const1_128( iv[3] );
-    state->chainv[4] = m256_const1_128( iv[4] );
-    state->chainv[5] = m256_const1_128( iv[5] );
-    state->chainv[6] = m256_const1_128( iv[6] );
-    state->chainv[7] = m256_const1_128( iv[7] );
-    state->chainv[8] = m256_const1_128( iv[8] );
-    state->chainv[9] = m256_const1_128( iv[9] );
+    state->chainv[0] = mm256_bcast_m128( iv[0] );
+    state->chainv[1] = mm256_bcast_m128( iv[1] );
+    state->chainv[2] = mm256_bcast_m128( iv[2] );
+    state->chainv[3] = mm256_bcast_m128( iv[3] );
+    state->chainv[4] = mm256_bcast_m128( iv[4] );
+    state->chainv[5] = mm256_bcast_m128( iv[5] );
+    state->chainv[6] = mm256_bcast_m128( iv[6] );
+    state->chainv[7] = mm256_bcast_m128( iv[7] );
+    state->chainv[8] = mm256_bcast_m128( iv[8] );
+    state->chainv[9] = mm256_bcast_m128( iv[9] );

    ((__m256i*)state->buffer)[0] = m256_zero;
    ((__m256i*)state->buffer)[1] = m256_zero;
@@ -977,9 +916,7 @@ int luffa512_2way_full( luffa_2way_context *state, void *output,
    __m256i msg[2];
    int i;
    const int blocks = (int)( inlen >> 5 );
-    const __m256i shuff_bswap32 = m256_const_64( 0x1c1d1e1f18191a1b,
-                                                 0x1415161710111213,
-                                                 0x0c0d0e0f08090a0b,
+    const __m256i shuff_bswap32 = mm256_set2_64( 0x0c0d0e0f08090a0b,
                                                 0x0405060700010203 );

    state->rembytes = inlen & 0x1F;
@@ -997,13 +934,13 @@ int luffa512_2way_full( luffa_2way_context *state, void *output,
    {
       // padding of partial block
       msg[0] = _mm256_shuffle_epi8( vdata[ 0 ], shuff_bswap32 );
-       msg[1] = m256_const1_i128( 0x0000000080000000 );
+       msg[1] = mm256_bcast128lo_64( 0x0000000080000000 );
       rnd512_2way( state, msg );
    }
    else
    {
       // empty pad block
-       msg[0] = m256_const1_i128( 0x0000000080000000 );
+       msg[0] = mm256_bcast128lo_64( 0x0000000080000000 );
       msg[1] = m256_zero;
       rnd512_2way( state, msg );
    }
@@ -1024,9 +961,7 @@ int luffa_2way_update_close( luffa_2way_context *state,
    __m256i msg[2];
    int i;
    const int blocks = (int)( inlen >> 5 );
-    const __m256i shuff_bswap32 = m256_const_64( 0x1c1d1e1f18191a1b,
-                                                 0x1415161710111213,
-                                                 0x0c0d0e0f08090a0b,
+    const __m256i shuff_bswap32 = mm256_set2_64( 0x0c0d0e0f08090a0b,
                                                 0x0405060700010203 );

    state->rembytes = inlen & 0x1F;
@@ -1044,13 +979,13 @@ int luffa_2way_update_close( luffa_2way_context *state,
    {
       // padding of partial block
       msg[0] = _mm256_shuffle_epi8( vdata[ 0 ], shuff_bswap32 );
-       msg[1] = m256_const1_i128( 0x0000000080000000 );
+       msg[1] = mm256_bcast128lo_64( 0x0000000080000000 );
       rnd512_2way( state, msg );
    }
    else
    {
       // empty pad block
-       msg[0] = m256_const1_i128( 0x0000000080000000 );
+       msg[0] = mm256_bcast128lo_64( 0x0000000080000000 );
       msg[1] = m256_zero;
       rnd512_2way( state, msg );
    }
--- a/algo/luffa/luffa_for_sse2.c
+++ b/algo/luffa/luffa_for_sse2.c
@@ -22,20 +22,29 @@
 #include "simd-utils.h"
 #include "luffa_for_sse2.h"

+#define cns(i)  ( ( (__m128i*)CNS_INIT)[i] )
+
+#define ADD_CONSTANT( a, b, c0 ,c1 ) \
+    a = _mm_xor_si128( a, c0 ); \
+    b = _mm_xor_si128( b, c1 ); \
+
 #if defined(__AVX512VL__)
+//TODO enable for AVX10_512 AVX10_256

 #define MULT2( a0, a1 ) \
 { \
-  __m128i b = _mm_xor_si128( a0, _mm_maskz_shuffle_epi32( 0xb, a1, 0x10 ) ); \
-  a0 = _mm_alignr_epi32( a1, b, 1 ); \
-  a1 = _mm_alignr_epi32( b, a1, 1 ); \
+  __m128i b = _mm_xor_si128( a0, \
+                      _mm_maskz_shuffle_epi32( 0xb, a1, 0x10 ) ); \
+  a0 = _mm_alignr_epi8( a1, b, 4 ); \
+  a1 = _mm_alignr_epi8( b, a1, 4 ); \
 }

 #elif defined(__SSE4_1__)

 #define MULT2( a0, a1 ) do \
 { \
-  __m128i b = _mm_xor_si128( a0, _mm_shuffle_epi32( mm128_mask_32( a1, 0xe ), 0x10 ) ); \
+  __m128i b = _mm_xor_si128( a0, \
+                      _mm_shuffle_epi32( mm128_mask_32( a1, 0xe ), 0x10 ) ); \
  a0 = _mm_alignr_epi8( a1, b, 4 ); \
  a1 = _mm_alignr_epi8( b, a1, 4 ); \
 } while(0)
@@ -44,79 +53,88 @@

 #define MULT2( a0, a1 ) do \
 { \
-  __m128i b = _mm_xor_si128( a0, _mm_shuffle_epi32( _mm_and_si128( a1, MASK ), 0x10 ) ); \
-  a0 = _mm_or_si128( _mm_srli_si128( b, 4 ), _mm_slli_si128( a1, 12 ) ); \
-  a1 = _mm_or_si128( _mm_srli_si128( a1, 4 ), _mm_slli_si128( b, 12 ) ); \
+  __m128i b = _mm_xor_si128( a0, \
+                      _mm_shuffle_epi32( _mm_and_si128( a1, MASK ), 0x10 ) ); \
+  a0 = _mm_or_si128( _mm_srli_si128(  b, 4 ), _mm_slli_si128( a1, 12 ) ); \
+  a1 = _mm_or_si128( _mm_srli_si128( a1, 4 ), _mm_slli_si128(  b, 12 ) ); \
 } while(0)

 #endif

-#define STEP_PART(x,c,t)\
-    SUBCRUMB(*x,*(x+1),*(x+2),*(x+3),*t);\
-    SUBCRUMB(*(x+5),*(x+6),*(x+7),*(x+4),*t);\
-    MIXWORD(*x,*(x+4),*t,*(t+1));\
-    MIXWORD(*(x+1),*(x+5),*t,*(t+1));\
-    MIXWORD(*(x+2),*(x+6),*t,*(t+1));\
-    MIXWORD(*(x+3),*(x+7),*t,*(t+1));\
-    ADD_CONSTANT(*x, *(x+4), *c, *(c+1));
+#if defined(__AVX512VL__)
+//TODO enable for AVX10_512 AVX10_256

-#define STEP_PART2(a0,a1,t0,t1,c0,c1,tmp0,tmp1)\
-    a1 = _mm_shuffle_epi32(a1,147);\
-    t0 = _mm_load_si128(&a1);\
-    a1 = _mm_unpacklo_epi32(a1,a0);\
-    t0 = _mm_unpackhi_epi32(t0,a0);\
-    t1 = _mm_shuffle_epi32(t0,78);\
-    a0 = _mm_shuffle_epi32(a1,78);\
-    SUBCRUMB(t1,t0,a0,a1,tmp0);\
-    t0 = _mm_unpacklo_epi32(t0,t1);\
-    a1 = _mm_unpacklo_epi32(a1,a0);\
-    a0 = _mm_load_si128(&a1);\
-    a0 = _mm_unpackhi_epi64(a0,t0);\
-    a1 = _mm_unpacklo_epi64(a1,t0);\
-    a1 = _mm_shuffle_epi32(a1,57);\
-    MIXWORD(a0,a1,tmp0,tmp1);\
-    ADD_CONSTANT(a0,a1,c0,c1);
+#define SUBCRUMB( a0, a1, a2, a3 ) \
+{ \
+    __m128i t = a0; \
+    a0 = mm128_xoror( a3, a0, a1 ); \
+    a2 = _mm_xor_si128( a2, a3 ); \
+    a1 = _mm_ternarylogic_epi64( a1, a3, t, 0x87 ); /* a1 xnor (a3 & t) */ \
+    a3 = mm128_xorand( a2, a3, t ); \
+    a2 = mm128_xorand( a1, a2, a0 ); \
+    a1 = _mm_or_si128( a1, a3 ); \
+    a3 = _mm_xor_si128( a3, a2 ); \
+    t  = _mm_xor_si128( t, a1 ); \
+    a2 = _mm_and_si128( a2, a1 ); \
+    a1 = mm128_xnor( a1, a0 ); \
+    a0 = t; \
+}

-#define SUBCRUMB(a0,a1,a2,a3,t)\
-    t  = _mm_load_si128(&a0);\
-    a0 = _mm_or_si128(a0,a1);\
-    a2 = _mm_xor_si128(a2,a3);\
-    a1 = mm128_not( a1 );\
-    a0 = _mm_xor_si128(a0,a3);\
-    a3 = _mm_and_si128(a3,t);\
-    a1 = _mm_xor_si128(a1,a3);\
-    a3 = _mm_xor_si128(a3,a2);\
-    a2 = _mm_and_si128(a2,a0);\
-    a0 = mm128_not( a0 );\
-    a2 = _mm_xor_si128(a2,a1);\
-    a1 = _mm_or_si128(a1,a3);\
-    t  = _mm_xor_si128(t,a1);\
-    a3 = _mm_xor_si128(a3,a2);\
-    a2 = _mm_and_si128(a2,a1);\
-    a1 = _mm_xor_si128(a1,a0);\
-    a0 = _mm_load_si128(&t);\
+#else

-#define MIXWORD(a,b,t1,t2)\
-    b = _mm_xor_si128(a,b);\
-    t1 = _mm_slli_epi32(a,2);\
-    t2 = _mm_srli_epi32(a,30);\
-    a = _mm_or_si128(t1,t2);\
-    a = _mm_xor_si128(a,b);\
-    t1 = _mm_slli_epi32(b,14);\
-    t2 = _mm_srli_epi32(b,18);\
-    b = _mm_or_si128(t1,t2);\
-    b = _mm_xor_si128(a,b);\
-    t1 = _mm_slli_epi32(a,10);\
-    t2 = _mm_srli_epi32(a,22);\
-    a = _mm_or_si128(t1,t2);\
-    a = _mm_xor_si128(a,b);\
-    t1 = _mm_slli_epi32(b,1);\
-    t2 = _mm_srli_epi32(b,31);\
-    b = _mm_or_si128(t1,t2);
+#define SUBCRUMB( a0, a1, a2, a3 ) \
+{ \
+    __m128i t = a0; \
+    a0 = _mm_or_si128( a0, a1 ); \
+    a2 = _mm_xor_si128( a2, a3 ); \
+    a1 = mm128_not( a1 ); \
+    a0 = _mm_xor_si128( a0, a3 ); \
+    a3 = _mm_and_si128( a3, t ); \
+    a1 = _mm_xor_si128( a1, a3 ); \
+    a3 = _mm_xor_si128( a3, a2 ); \
+    a2 = _mm_and_si128( a2, a0 ); \
+    a0 = mm128_not( a0 ); \
+    a2 = _mm_xor_si128( a2, a1 ); \
+    a1 = _mm_or_si128(  a1, a3 ); \
+    t  = _mm_xor_si128( t , a1 ); \
+    a3 = _mm_xor_si128( a3, a2 ); \
+    a2 = _mm_and_si128( a2, a1 ); \
+    a1 = _mm_xor_si128( a1, a0 ); \
+    a0 = t; \
+}

-#define ADD_CONSTANT(a,b,c0,c1)\
-    a = _mm_xor_si128(a,c0);\
-    b = _mm_xor_si128(b,c1);\
+#endif
+
+#define MIXWORD( a, b ) \
+    b = _mm_xor_si128( a, b ); \
+    a = _mm_xor_si128( b, mm128_rol_32( a, 2 ) ); \
+    b = _mm_xor_si128( a, mm128_rol_32( b, 14 ) ); \
+    a = _mm_xor_si128( b, mm128_rol_32( a, 10 ) ); \
+    b = mm128_rol_32( b, 1 );
+
+#define STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, c0, c1 ) \
+    SUBCRUMB( x0, x1, x2, x3 ); \
+    SUBCRUMB( x5, x6, x7, x4 ); \
+    MIXWORD( x0, x4 ); \
+    MIXWORD( x1, x5 ); \
+    MIXWORD( x2, x6 ); \
+    MIXWORD( x3, x7 ); \
+    ADD_CONSTANT( x0, x4, c0, c1 );
+
+#define STEP_PART2( a0, a1, t0, t1, c0, c1 ) \
+    t0 = _mm_shuffle_epi32( a1, 147 ); \
+    a1 = _mm_unpacklo_epi32( t0, a0 ); \
+    t0 = _mm_unpackhi_epi32( t0, a0 ); \
+    t1 = _mm_shuffle_epi32( t0, 78 ); \
+    a0 = _mm_shuffle_epi32( a1, 78 ); \
+    SUBCRUMB( t1, t0, a0, a1 ); \
+    t0 = _mm_unpacklo_epi32( t0, t1 ); \
+    a1 = _mm_unpacklo_epi32( a1, a0 ); \
+    a0 = _mm_unpackhi_epi64( a1, t0 ); \
+    a1 = _mm_unpacklo_epi64( a1, t0 ); \
+    a1 = _mm_shuffle_epi32( a1, 57 ); \
+    MIXWORD( a0, a1 ); \
+    ADD_CONSTANT( a0, a1, c0, c1 );

 #define NMLTOM768(r0,r1,r2,s0,s1,s2,s3,p0,p1,p2,q0,q1,q2,q3)\
    s2 = _mm_load_si128(&r1);\
@@ -177,32 +195,22 @@
    q1 = _mm_load_si128(&p1);\

 #define NMLTOM1024(r0,r1,r2,r3,s0,s1,s2,s3,p0,p1,p2,p3,q0,q1,q2,q3)\
-    s1 = _mm_load_si128(&r3);\
-    q1 = _mm_load_si128(&p3);\
-    s3 = _mm_load_si128(&r3);\
-    q3 = _mm_load_si128(&p3);\
-    s1 = _mm_unpackhi_epi32(s1,r2);\
-    q1 = _mm_unpackhi_epi32(q1,p2);\
-    s3 = _mm_unpacklo_epi32(s3,r2);\
-    q3 = _mm_unpacklo_epi32(q3,p2);\
-    s0 = _mm_load_si128(&s1);\
-    q0 = _mm_load_si128(&q1);\
-    s2 = _mm_load_si128(&s3);\
-    q2 = _mm_load_si128(&q3);\
-    r3 = _mm_load_si128(&r1);\
-    p3 = _mm_load_si128(&p1);\
-    r1 = _mm_unpacklo_epi32(r1,r0);\
-    p1 = _mm_unpacklo_epi32(p1,p0);\
-    r3 = _mm_unpackhi_epi32(r3,r0);\
-    p3 = _mm_unpackhi_epi32(p3,p0);\
-    s0 = _mm_unpackhi_epi64(s0,r3);\
-    q0 = _mm_unpackhi_epi64(q0,p3);\
-    s1 = _mm_unpacklo_epi64(s1,r3);\
-    q1 = _mm_unpacklo_epi64(q1,p3);\
-    s2 = _mm_unpackhi_epi64(s2,r1);\
-    q2 = _mm_unpackhi_epi64(q2,p1);\
-    s3 = _mm_unpacklo_epi64(s3,r1);\
-    q3 = _mm_unpacklo_epi64(q3,p1);
+    s1 = _mm_unpackhi_epi32( r3, r2 ); \
+    q1 = _mm_unpackhi_epi32( p3, p2 ); \
+    s3 = _mm_unpacklo_epi32( r3, r2 ); \
+    q3 = _mm_unpacklo_epi32( p3, p2 ); \
+    r3 = _mm_unpackhi_epi32( r1, r0 ); \
+    r1 = _mm_unpacklo_epi32( r1, r0 ); \
+    p3 = _mm_unpackhi_epi32( p1, p0 ); \
+    p1 = _mm_unpacklo_epi32( p1, p0 ); \
+    s0 = _mm_unpackhi_epi64( s1, r3 ); \
+    q0 = _mm_unpackhi_epi64( q1 ,p3 ); \
+    s1 = _mm_unpacklo_epi64( s1, r3 ); \
+    q1 = _mm_unpacklo_epi64( q1, p3 ); \
+    s2 = _mm_unpackhi_epi64( s3, r1 ); \
+    q2 = _mm_unpackhi_epi64( q3, p1 ); \
+    s3 = _mm_unpacklo_epi64( s3, r1 ); \
+    q3 = _mm_unpacklo_epi64( q3, p1 );

 #define MIXTON1024(r0,r1,r2,r3,s0,s1,s2,s3,p0,p1,p2,p3,q0,q1,q2,q3)\
    NMLTOM1024(r0,r1,r2,r3,s0,s1,s2,s3,p0,p1,p2,p3,q0,q1,q2,q3);
@@ -306,8 +314,7 @@ HashReturn update_luffa( hashState_luffa *state, const BitSequence *data,
      // remaining data bytes
      casti_m128i( state->buffer, 0 ) = mm128_bswap_32( cast_m128i( data ) );
      // padding of partial block
-      casti_m128i( state->buffer, 1 ) =
-            _mm_set_epi8( 0,0,0,0, 0,0,0,0, 0,0,0,0, 0x80,0,0,0 );
+      casti_m128i( state->buffer, 1 ) =  _mm_set_epi32( 0, 0, 0, 0x80000000 );
    }

    return SUCCESS;
@@ -325,8 +332,7 @@ HashReturn final_luffa(hashState_luffa *state, BitSequence *hashval)
    else
    {
      // empty pad block, constant data
-     rnd512( state, _mm_setzero_si128(),
-                       _mm_set_epi8( 0,0,0,0, 0,0,0,0, 0,0,0,0, 0x80,0,0,0 ) );
+     rnd512( state, _mm_setzero_si128(), _mm_set_epi32( 0, 0, 0, 0x80000000 ) );
    }

    finalization512(state, (uint32*) hashval);
@@ -354,11 +360,11 @@ HashReturn update_and_final_luffa( hashState_luffa *state, BitSequence* output,
    // 16 byte partial block exists for 80 byte len
    if ( state->rembytes  )
       // padding of partial block
-       rnd512( state, m128_const_i128(  0x80000000 ),
+       rnd512( state, mm128_mov64_128(  0x80000000 ),
                      mm128_bswap_32( cast_m128i( data ) ) );
    else
       // empty pad block
-       rnd512( state, m128_zero, m128_const_i128( 0x80000000 ) );
+       rnd512( state, m128_zero, mm128_mov64_128( 0x80000000 ) );

    finalization512( state, (uint32*) output );
    if ( state->hashbitlen > 512 )
@@ -403,11 +409,11 @@ int luffa_full( hashState_luffa *state, BitSequence* output, int hashbitlen,
    // 16 byte partial block exists for 80 byte len
    if ( state->rembytes  )
       // padding of partial block
-       rnd512( state, m128_const_i128( 0x80000000 ),
+       rnd512( state, mm128_mov64_128( 0x80000000 ),
                      mm128_bswap_32( cast_m128i( data ) ) );
    else
       // empty pad block
-       rnd512( state, m128_zero, m128_const_i128( 0x80000000 ) );
+       rnd512( state, m128_zero, mm128_mov64_128( 0x80000000 ) );

    finalization512( state, (uint32*) output );
    if ( state->hashbitlen > 512 )
@@ -423,163 +429,119 @@ int luffa_full( hashState_luffa *state, BitSequence* output, int hashbitlen,

 static void rnd512( hashState_luffa *state, __m128i msg1, __m128i msg0 )
 {
-    __m128i t[2];
+    __m128i t0, t1;
    __m128i *chainv = state->chainv;
-    __m128i tmp[2];
-    __m128i x[8];
+    __m128i x0, x1, x2, x3, x4, x5, x6, x7; 

-    t[0] = chainv[0];
-    t[1] = chainv[1];
+    t0 = mm128_xor3( chainv[0], chainv[2], chainv[4] );
+    t1 = mm128_xor3( chainv[1], chainv[3], chainv[5] );
+    t0 = mm128_xor3( t0, chainv[6], chainv[8] );
+    t1 = mm128_xor3( t1, chainv[7], chainv[9] );

-    t[0] = _mm_xor_si128( t[0], chainv[2] );
-    t[1] = _mm_xor_si128( t[1], chainv[3] );
-    t[0] = _mm_xor_si128( t[0], chainv[4] );
-    t[1] = _mm_xor_si128( t[1], chainv[5] );
-    t[0] = _mm_xor_si128( t[0], chainv[6] );
-    t[1] = _mm_xor_si128( t[1], chainv[7] );
-    t[0] = _mm_xor_si128( t[0], chainv[8] );
-    t[1] = _mm_xor_si128( t[1], chainv[9] );
-
-    MULT2( t[0], t[1] );
+    MULT2( t0, t1 );

    msg0 = _mm_shuffle_epi32( msg0, 27 );
    msg1 = _mm_shuffle_epi32( msg1, 27 );

-    chainv[0] = _mm_xor_si128( chainv[0], t[0] );
-    chainv[1] = _mm_xor_si128( chainv[1], t[1] );
-    chainv[2] = _mm_xor_si128( chainv[2], t[0] );
-    chainv[3] = _mm_xor_si128( chainv[3], t[1] );
-    chainv[4] = _mm_xor_si128( chainv[4], t[0] );
-    chainv[5] = _mm_xor_si128( chainv[5], t[1] );
-    chainv[6] = _mm_xor_si128( chainv[6], t[0] );
-    chainv[7] = _mm_xor_si128( chainv[7], t[1] );
-    chainv[8] = _mm_xor_si128( chainv[8], t[0] );
-    chainv[9] = _mm_xor_si128( chainv[9], t[1] );
+    chainv[0] = _mm_xor_si128( chainv[0], t0 );
+    chainv[1] = _mm_xor_si128( chainv[1], t1 );
+    chainv[2] = _mm_xor_si128( chainv[2], t0 );
+    chainv[3] = _mm_xor_si128( chainv[3], t1 );
+    chainv[4] = _mm_xor_si128( chainv[4], t0 );
+    chainv[5] = _mm_xor_si128( chainv[5], t1 );
+    chainv[6] = _mm_xor_si128( chainv[6], t0 );
+    chainv[7] = _mm_xor_si128( chainv[7], t1 );
+    chainv[8] = _mm_xor_si128( chainv[8], t0 );
+    chainv[9] = _mm_xor_si128( chainv[9], t1 );

-    t[0] = chainv[0];
-    t[1] = chainv[1];
+    t0 = chainv[0];
+    t1 = chainv[1];

    MULT2( chainv[0], chainv[1]);
-
    chainv[0] = _mm_xor_si128( chainv[0], chainv[2] );
    chainv[1] = _mm_xor_si128( chainv[1], chainv[3] );

    MULT2( chainv[2], chainv[3]);
-
    chainv[2] = _mm_xor_si128(chainv[2], chainv[4]);
    chainv[3] = _mm_xor_si128(chainv[3], chainv[5]);

    MULT2( chainv[4], chainv[5]);
-
    chainv[4] = _mm_xor_si128(chainv[4], chainv[6]);
    chainv[5] = _mm_xor_si128(chainv[5], chainv[7]);

    MULT2( chainv[6], chainv[7]);
-
    chainv[6] = _mm_xor_si128(chainv[6], chainv[8]);
    chainv[7] = _mm_xor_si128(chainv[7], chainv[9]);

    MULT2( chainv[8], chainv[9]);
-
-    chainv[8] = _mm_xor_si128( chainv[8], t[0] );
-    chainv[9] = _mm_xor_si128( chainv[9], t[1] );
-
-    t[0] = chainv[8];
-    t[1] = chainv[9];
+    t0 = chainv[8] = _mm_xor_si128( chainv[8], t0 );
+    t1 = chainv[9] = _mm_xor_si128( chainv[9], t1 );

    MULT2( chainv[8], chainv[9]);
-
    chainv[8] = _mm_xor_si128( chainv[8], chainv[6] );
    chainv[9] = _mm_xor_si128( chainv[9], chainv[7] );

    MULT2( chainv[6], chainv[7]);
-
    chainv[6] = _mm_xor_si128( chainv[6], chainv[4] );
    chainv[7] = _mm_xor_si128( chainv[7], chainv[5] );

    MULT2( chainv[4], chainv[5]);
-
    chainv[4] = _mm_xor_si128( chainv[4], chainv[2] );
    chainv[5] = _mm_xor_si128( chainv[5], chainv[3] );

    MULT2( chainv[2], chainv[3] );
-
    chainv[2] = _mm_xor_si128( chainv[2], chainv[0] );
    chainv[3] = _mm_xor_si128( chainv[3], chainv[1] );

    MULT2( chainv[0], chainv[1] );
-
-    chainv[0] = _mm_xor_si128( _mm_xor_si128( chainv[0], t[0] ), msg0 );
-    chainv[1] = _mm_xor_si128( _mm_xor_si128( chainv[1], t[1] ), msg1 );
+    chainv[0] = _mm_xor_si128( _mm_xor_si128( chainv[0], t0 ), msg0 );
+    chainv[1] = _mm_xor_si128( _mm_xor_si128( chainv[1], t1 ), msg1 );

    MULT2( msg0, msg1);
-
    chainv[2] = _mm_xor_si128( chainv[2], msg0 );
    chainv[3] = _mm_xor_si128( chainv[3], msg1 );

    MULT2( msg0, msg1);
-
    chainv[4] = _mm_xor_si128( chainv[4], msg0 );
    chainv[5] = _mm_xor_si128( chainv[5], msg1 );

    MULT2( msg0, msg1);
-
    chainv[6] = _mm_xor_si128( chainv[6], msg0 );
    chainv[7] = _mm_xor_si128( chainv[7], msg1 );

    MULT2( msg0, msg1);
-
    chainv[8] = _mm_xor_si128( chainv[8], msg0 );
    chainv[9] = _mm_xor_si128( chainv[9], msg1 );

    MULT2( msg0, msg1);
+    chainv[3] = mm128_rol_32( chainv[3], 1 );    
+    chainv[5] = mm128_rol_32( chainv[5], 2 );
+    chainv[7] = mm128_rol_32( chainv[7], 3 );
+    chainv[9] = mm128_rol_32( chainv[9], 4 );
+    
+    NMLTOM1024( chainv[0], chainv[2], chainv[4], chainv[6], x0, x1, x2, x3,
+                chainv[1], chainv[3], chainv[5], chainv[7], x4, x5, x6, x7 );

-    chainv[3] = _mm_or_si128( _mm_slli_epi32(chainv[3], 1),
-                              _mm_srli_epi32(chainv[3], 31) );
-    chainv[5] = _mm_or_si128( _mm_slli_epi32(chainv[5], 2),
-                              _mm_srli_epi32(chainv[5], 30) );
-    chainv[7] = _mm_or_si128( _mm_slli_epi32(chainv[7], 3),
-                              _mm_srli_epi32(chainv[7], 29) );
-    chainv[9] = _mm_or_si128( _mm_slli_epi32(chainv[9], 4),
-                              _mm_srli_epi32(chainv[9], 28) );
+    STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, cns( 0), cns( 1) );
+    STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, cns( 2), cns( 3) );
+    STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, cns( 4), cns( 5) );
+    STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, cns( 6), cns( 7) );
+    STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, cns( 8), cns( 9) );
+    STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, cns(10), cns(11) );
+    STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, cns(12), cns(13) );
+    STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, cns(14), cns(15) );
+    
+    MIXTON1024( x0, x1, x2, x3, chainv[0], chainv[2], chainv[4], chainv[6],
+                x4, x5, x6, x7, chainv[1], chainv[3], chainv[5], chainv[7]);

-
-    NMLTOM1024( chainv[0], chainv[2], chainv[4], chainv[6],
-                x[0], x[1], x[2], x[3],
-                chainv[1],chainv[3],chainv[5],chainv[7],
-                x[4], x[5], x[6], x[7] );
-
-    STEP_PART( &x[0], &CNS128[ 0], &tmp[0] );
-    STEP_PART( &x[0], &CNS128[ 2], &tmp[0] );
-    STEP_PART( &x[0], &CNS128[ 4], &tmp[0] );
-    STEP_PART( &x[0], &CNS128[ 6], &tmp[0] );
-    STEP_PART( &x[0], &CNS128[ 8], &tmp[0] );
-    STEP_PART( &x[0], &CNS128[10], &tmp[0] );
-    STEP_PART( &x[0], &CNS128[12], &tmp[0] );
-    STEP_PART( &x[0], &CNS128[14], &tmp[0] );
-
-    MIXTON1024( x[0], x[1], x[2], x[3],
-                chainv[0], chainv[2], chainv[4],chainv[6],
-                x[4], x[5], x[6], x[7],
-                chainv[1],chainv[3],chainv[5],chainv[7]);
-
-    /* Process last 256-bit block */
-    STEP_PART2( chainv[8], chainv[9], t[0], t[1], CNS128[16], CNS128[17],
-                tmp[0], tmp[1] );
-    STEP_PART2( chainv[8], chainv[9], t[0], t[1], CNS128[18], CNS128[19],
-                tmp[0], tmp[1] );
-    STEP_PART2( chainv[8], chainv[9], t[0], t[1], CNS128[20], CNS128[21],
-                tmp[0], tmp[1] );
-    STEP_PART2( chainv[8], chainv[9], t[0], t[1], CNS128[22], CNS128[23],
-                tmp[0], tmp[1] );
-    STEP_PART2( chainv[8], chainv[9], t[0], t[1], CNS128[24], CNS128[25],
-                tmp[0], tmp[1] );
-    STEP_PART2( chainv[8], chainv[9], t[0], t[1], CNS128[26], CNS128[27],
-                tmp[0], tmp[1] );
-    STEP_PART2( chainv[8], chainv[9], t[0], t[1], CNS128[28], CNS128[29],
-                tmp[0], tmp[1] );
-    STEP_PART2( chainv[8], chainv[9], t[0], t[1], CNS128[30], CNS128[31],
-                tmp[0], tmp[1] );
+    STEP_PART2( chainv[8], chainv[9], t0, t1, cns(16), cns(17) );
+    STEP_PART2( chainv[8], chainv[9], t0, t1, cns(18), cns(19) );
+    STEP_PART2( chainv[8], chainv[9], t0, t1, cns(20), cns(21) );
+    STEP_PART2( chainv[8], chainv[9], t0, t1, cns(22), cns(23) );
+    STEP_PART2( chainv[8], chainv[9], t0, t1, cns(24), cns(25) );
+    STEP_PART2( chainv[8], chainv[9], t0, t1, cns(26), cns(27) );
+    STEP_PART2( chainv[8], chainv[9], t0, t1, cns(28), cns(29) );
+    STEP_PART2( chainv[8], chainv[9], t0, t1, cns(30), cns(31) );
 }


@@ -588,51 +550,6 @@ static void rnd512( hashState_luffa *state, __m128i msg1, __m128i msg0 )
 /* state: hash context    */
 /* b[8]: hash values      */

-#if defined (__AVX2__)
-
-static void finalization512( hashState_luffa *state, uint32 *b )
-{
-    uint32   hash[8] __attribute((aligned(64)));
-    __m256i* chainv = (__m256i*)state->chainv;
-    __m256i  t;
-    const __m128i zero = m128_zero;
-    const __m256i shuff_bswap32 = m256_const_64( 0x1c1d1e1f18191a1b,
-                                                 0x1415161710111213,
-                                                 0x0c0d0e0f08090a0b,
-                                                 0x0405060700010203 );
-
-    rnd512( state, zero, zero );
-
-    t = chainv[0];
-    t = _mm256_xor_si256( t, chainv[1] );
-    t = _mm256_xor_si256( t, chainv[2] );
-    t = _mm256_xor_si256( t, chainv[3] );
-    t = _mm256_xor_si256( t, chainv[4] );
-
-    t = _mm256_shuffle_epi32( t, 27 );
-
-    _mm256_store_si256( (__m256i*)hash, t );
-
-    casti_m256i( b, 0 ) = _mm256_shuffle_epi8(
-                                 casti_m256i( hash, 0 ), shuff_bswap32 );
-
-    rnd512( state, zero, zero );
-
-    t = chainv[0];
-    t = _mm256_xor_si256( t, chainv[1] );
-    t = _mm256_xor_si256( t, chainv[2] );
-    t = _mm256_xor_si256( t, chainv[3] );
-    t = _mm256_xor_si256( t, chainv[4] );
-    t = _mm256_shuffle_epi32( t, 27 );
-
-    _mm256_store_si256( (__m256i*)hash, t );
-
-    casti_m256i( b, 1 ) = _mm256_shuffle_epi8( 
-                                 casti_m256i( hash, 0 ), shuff_bswap32 );
-}
-
-#else
-
 static void finalization512( hashState_luffa *state, uint32 *b )
 {
    uint32 hash[8] __attribute((aligned(64)));
@@ -685,6 +602,5 @@ static void finalization512( hashState_luffa *state, uint32 *b )
    casti_m128i( b, 2 ) = mm128_bswap_32( casti_m128i( hash, 0 ) );
    casti_m128i( b, 3 ) = mm128_bswap_32( casti_m128i( hash, 1 ) );
 }
-#endif

 /***************************************************/
--- a/algo/lyra2/allium-4way.c
+++ b/algo/lyra2/allium-4way.c
@@ -24,45 +24,6 @@ typedef union {
 #endif
 } allium_16way_ctx_holder;

-static uint32_t allium_16way_midstate_vars[16*16] __attribute__ ((aligned (64)));
-static __m512i allium_16way_block0_hash[8] __attribute__ ((aligned (64)));
-static __m512i allium_16way_block_buf[16] __attribute__ ((aligned (64)));
-
-int allium_16way_prehash( struct work *work )
-{
-   uint32_t phash[8] __attribute__ ((aligned (32))) =
-   {
-      0x6A09E667, 0xBB67AE85, 0x3C6EF372, 0xA54FF53A,
-      0x510E527F, 0x9B05688C, 0x1F83D9AB, 0x5BE0CD19
-   };
-   uint32_t *pdata = work->data;
-
-   // Prehash first block.
-   blake256_transform_le( phash, pdata, 512, 0 );
-
-   // Interleave hash for second block prehash.
-   allium_16way_block0_hash[0] = _mm512_set1_epi32( phash[0] );
-   allium_16way_block0_hash[1] = _mm512_set1_epi32( phash[1] );
-   allium_16way_block0_hash[2] = _mm512_set1_epi32( phash[2] );
-   allium_16way_block0_hash[3] = _mm512_set1_epi32( phash[3] );
-   allium_16way_block0_hash[4] = _mm512_set1_epi32( phash[4] );
-   allium_16way_block0_hash[5] = _mm512_set1_epi32( phash[5] );
-   allium_16way_block0_hash[6] = _mm512_set1_epi32( phash[6] );
-   allium_16way_block0_hash[7] = _mm512_set1_epi32( phash[7] );
-
-   // Build vectored second block, interleave 12 of last 16 bytes of data,
-   // excluding the nonce.
-   allium_16way_block_buf[ 0] = _mm512_set1_epi32( pdata[16] );
-   allium_16way_block_buf[ 1] = _mm512_set1_epi32( pdata[17] );
-   allium_16way_block_buf[ 2] = _mm512_set1_epi32( pdata[18] );
-
-   // Partialy prehash second block without touching nonces in block_buf[3].
-   blake256_16way_round0_prehash_le( allium_16way_midstate_vars,
-                         allium_16way_block0_hash, allium_16way_block_buf );
-
-   return 1;
-}
-
 static void allium_16way_hash( void *state, const void *midstate_vars, 
                               const void *midhash, const void *block )
 {
@@ -239,6 +200,11 @@ int scanhash_allium_16way( struct work *work, uint32_t max_nonce,
   uint32_t midstate_vars[16*16] __attribute__ ((aligned (64)));
   __m512i block0_hash[8] __attribute__ ((aligned (64)));
   __m512i block_buf[16] __attribute__ ((aligned (64)));
+   uint32_t phash[8] __attribute__ ((aligned (32))) = 
+   {
+      0x6A09E667, 0xBB67AE85, 0x3C6EF372, 0xA54FF53A,
+      0x510E527F, 0x9B05688C, 0x1F83D9AB, 0x5BE0CD19
+   };
   uint32_t *pdata = work->data;
   uint32_t *ptarget = work->target;
   const uint32_t first_nonce = pdata[19];
@@ -246,23 +212,35 @@ int scanhash_allium_16way( struct work *work, uint32_t max_nonce,
   const uint32_t last_nonce = max_nonce - 16;
   const int thr_id = mythr->id;
   const bool bench = opt_benchmark;
-   const __m512i sixteen = m512_const1_32( 16 );
+   const __m512i sixteen = _mm512_set1_epi32( 16 );

   if ( bench ) ( (uint32_t*)ptarget )[7] = 0x0000ff;

-   pthread_rwlock_rdlock( &g_work_lock );
+   // Prehash first block.
+   blake256_transform_le( phash, pdata, 512, 0 );

-   memcpy( midstate_vars, allium_16way_midstate_vars, sizeof midstate_vars );
-   memcpy( block0_hash,   allium_16way_block0_hash,   sizeof block0_hash );
-   memcpy( block_buf,     allium_16way_block_buf,     sizeof block_buf );
+   // Interleave hash for second block prehash.
+   block0_hash[0] = _mm512_set1_epi32( phash[0] );
+   block0_hash[1] = _mm512_set1_epi32( phash[1] );
+   block0_hash[2] = _mm512_set1_epi32( phash[2] );
+   block0_hash[3] = _mm512_set1_epi32( phash[3] );
+   block0_hash[4] = _mm512_set1_epi32( phash[4] );
+   block0_hash[5] = _mm512_set1_epi32( phash[5] );
+   block0_hash[6] = _mm512_set1_epi32( phash[6] );
+   block0_hash[7] = _mm512_set1_epi32( phash[7] );

-   pthread_rwlock_unlock( &g_work_lock );
-
-   // fill in the nonces
-   block_buf[3] =
+   // Build vectored second block, interleave last 16 bytes of data using
+   // unique nonces.
+   block_buf[ 0] = _mm512_set1_epi32( pdata[16] );
+   block_buf[ 1] = _mm512_set1_epi32( pdata[17] );
+   block_buf[ 2] = _mm512_set1_epi32( pdata[18] );
+   block_buf[ 3] =
             _mm512_set_epi32( n+15, n+14, n+13, n+12, n+11, n+10, n+ 9, n+ 8,
                               n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n+ 1, n );
-   
+
+   // Partialy prehash second block without touching nonces in block_buf[3].
+   blake256_16way_round0_prehash_le( midstate_vars, block0_hash, block_buf );
+
   do {
     allium_16way_hash( hash, midstate_vars, block0_hash, block_buf );

@@ -293,44 +271,6 @@ typedef union {
 #endif
 } allium_8way_ctx_holder;

-static uint32_t allium_8way_midstate_vars[16*8] __attribute__ ((aligned (64)));
-static __m256i allium_8way_block0_hash[8] __attribute__ ((aligned (64)));
-static __m256i allium_8way_block_buf[16] __attribute__ ((aligned (64)));
-
-int allium_8way_prehash ( struct work *work )
-{
-   uint32_t phash[8] __attribute__ ((aligned (32))) =
-   {
-      0x6A09E667, 0xBB67AE85, 0x3C6EF372, 0xA54FF53A,
-      0x510E527F, 0x9B05688C, 0x1F83D9AB, 0x5BE0CD19
-   };
-   uint32_t *pdata = work->data;
-
-   // Prehash first block
-   blake256_transform_le( phash, pdata, 512, 0 );
-
-   allium_8way_block0_hash[0] = _mm256_set1_epi32( phash[0] );
-   allium_8way_block0_hash[1] = _mm256_set1_epi32( phash[1] );
-   allium_8way_block0_hash[2] = _mm256_set1_epi32( phash[2] );
-   allium_8way_block0_hash[3] = _mm256_set1_epi32( phash[3] );
-   allium_8way_block0_hash[4] = _mm256_set1_epi32( phash[4] );
-   allium_8way_block0_hash[5] = _mm256_set1_epi32( phash[5] );
-   allium_8way_block0_hash[6] = _mm256_set1_epi32( phash[6] );
-   allium_8way_block0_hash[7] = _mm256_set1_epi32( phash[7] );
-
-   // Build vectored second block, interleave 12 of the last 16 bytes,
-   // excepting the nonces.
-   allium_8way_block_buf[ 0] = _mm256_set1_epi32( pdata[16] );
-   allium_8way_block_buf[ 1] = _mm256_set1_epi32( pdata[17] );
-   allium_8way_block_buf[ 2] = _mm256_set1_epi32( pdata[18] );
-
-   // Partialy prehash second block without touching nonces
-   blake256_8way_round0_prehash_le( allium_8way_midstate_vars,
-                             allium_8way_block0_hash, allium_8way_block_buf );
-
-   return 1;
-}
-
 static void allium_8way_hash( void *hash, const void *midstate_vars,
                               const void *midhash, const void *block )
 {
@@ -446,6 +386,11 @@ int scanhash_allium_8way( struct work *work, uint32_t max_nonce,
   uint32_t midstate_vars[16*8] __attribute__ ((aligned (64)));
   __m256i block0_hash[8] __attribute__ ((aligned (64)));
   __m256i block_buf[16] __attribute__ ((aligned (64)));
+   uint32_t phash[8] __attribute__ ((aligned (32))) =
+   {
+      0x6A09E667, 0xBB67AE85, 0x3C6EF372, 0xA54FF53A,
+      0x510E527F, 0x9B05688C, 0x1F83D9AB, 0x5BE0CD19
+   };
   uint32_t *pdata = work->data;
   uint64_t *ptarget = (uint64_t*)work->target;
   const uint32_t first_nonce = pdata[19];
@@ -453,19 +398,31 @@ int scanhash_allium_8way( struct work *work, uint32_t max_nonce,
   uint32_t n = first_nonce;
   const int thr_id = mythr->id;  
   const bool bench = opt_benchmark;
-   const __m256i eight = m256_const1_32( 8 );
+   const __m256i eight = _mm256_set1_epi32( 8 );

-   pthread_rwlock_rdlock( &g_work_lock );
+   // Prehash first block
+   blake256_transform_le( phash, pdata, 512, 0 );

-   memcpy( midstate_vars, allium_8way_midstate_vars, sizeof midstate_vars );
-   memcpy( block0_hash,   allium_8way_block0_hash,   sizeof block0_hash );
-   memcpy( block_buf,     allium_8way_block_buf,     sizeof block_buf );
+   block0_hash[0] = _mm256_set1_epi32( phash[0] );
+   block0_hash[1] = _mm256_set1_epi32( phash[1] );
+   block0_hash[2] = _mm256_set1_epi32( phash[2] );
+   block0_hash[3] = _mm256_set1_epi32( phash[3] );
+   block0_hash[4] = _mm256_set1_epi32( phash[4] );
+   block0_hash[5] = _mm256_set1_epi32( phash[5] );
+   block0_hash[6] = _mm256_set1_epi32( phash[6] );
+   block0_hash[7] = _mm256_set1_epi32( phash[7] );

-   pthread_rwlock_unlock( &g_work_lock );
-   
+   // Build vectored second block, interleave last 16 bytes of data using
+   // unique nonces.
+   block_buf[ 0] = _mm256_set1_epi32( pdata[16] );
+   block_buf[ 1] = _mm256_set1_epi32( pdata[17] );
+   block_buf[ 2] = _mm256_set1_epi32( pdata[18] );
   block_buf[ 3] = _mm256_set_epi32( n+ 7, n+ 6, n+ 5, n+ 4,
                                     n+ 3, n+ 2, n+ 1, n );
-   
+
+   // Partialy prehash second block without touching nonces
+   blake256_8way_round0_prehash_le( midstate_vars, block0_hash, block_buf );
+
   do {
     allium_8way_hash( hash, midstate_vars, block0_hash, block_buf );

@@ -481,7 +438,6 @@ int scanhash_allium_8way( struct work *work, uint32_t max_nonce,
     n += 8;
     block_buf[ 3] = _mm256_add_epi32( block_buf[ 3], eight );
   } while ( likely( (n <= last_nonce) && !work_restart[thr_id].restart ) );
-
   pdata[19] = n;
   *hashes_done = n - first_nonce;
   return 0;
--- a/algo/lyra2/lyra2-gate.c
+++ b/algo/lyra2/lyra2-gate.c
@@ -131,12 +131,10 @@ bool register_lyra2z_algo( algo_gate_t* gate )
 {
 #if defined(LYRA2Z_16WAY)
  gate->miner_thread_init = (void*)&lyra2z_16way_thread_init;
-  gate->prehash    = (void*)&lyra2z_16way_prehash;
  gate->scanhash   = (void*)&scanhash_lyra2z_16way;
 //  gate->hash       = (void*)&lyra2z_16way_hash;
 #elif defined(LYRA2Z_8WAY)
  gate->miner_thread_init = (void*)&lyra2z_8way_thread_init;
-  gate->prehash    = (void*)&lyra2z_8way_prehash;
  gate->scanhash   = (void*)&scanhash_lyra2z_8way;
 //  gate->hash       = (void*)&lyra2z_8way_hash;
 #elif defined(LYRA2Z_4WAY)
@@ -177,10 +175,8 @@ bool register_lyra2h_algo( algo_gate_t* gate )
 bool register_allium_algo( algo_gate_t* gate )
 {
 #if defined (ALLIUM_16WAY)
-  gate->prehash   = (void*)&allium_16way_prehash;
  gate->scanhash  = (void*)&scanhash_allium_16way;
 #elif defined (ALLIUM_8WAY)
-  gate->prehash   = (void*)&allium_8way_prehash;
  gate->scanhash  = (void*)&scanhash_allium_8way;
 #else
  gate->miner_thread_init = (void*)&init_allium_ctx;
--- a/algo/lyra2/lyra2-gate.h
+++ b/algo/lyra2/lyra2-gate.h
@@ -5,6 +5,7 @@
 #include <stdint.h>
 #include "lyra2.h"

+
 #if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
  #define LYRA2REV3_16WAY 1
 #elif defined(__AVX2__)
@@ -101,7 +102,6 @@ bool init_lyra2rev2_ctx();
 //void lyra2z_16way_hash( void *state, const void *input );
 int scanhash_lyra2z_16way( struct work *work, uint32_t max_nonce,
                          uint64_t *hashes_done, struct thr_info *mythr );
-int lyra2z_16way_prehash ( struct work *work );
 bool lyra2z_16way_thread_init();

 #elif defined(LYRA2Z_8WAY)
@@ -110,7 +110,6 @@ bool lyra2z_16way_thread_init();
 int scanhash_lyra2z_8way( struct work *work, uint32_t max_nonce,
                          uint64_t *hashes_done, struct thr_info *mythr );
 bool lyra2z_8way_thread_init();
-int lyra2z_8way_prehash ( struct work *work );

 #elif defined(LYRA2Z_4WAY)

@@ -166,13 +165,11 @@ bool register_allium_algo( algo_gate_t* gate );

 int scanhash_allium_16way( struct work *work, uint32_t max_nonce,
                          uint64_t *hashes_done, struct thr_info *mythr );
-int allium_16way_prehash ( struct work *work );

 #elif defined(ALLIUM_8WAY)

 int scanhash_allium_8way( struct work *work, uint32_t max_nonce,
                          uint64_t *hashes_done, struct thr_info *mythr );
-int allium_8way_prehash ( struct work *work );

 #else

--- a/algo/lyra2/lyra2rev2-4way.c
+++ b/algo/lyra2/lyra2rev2-4way.c
@@ -75,7 +75,7 @@ void lyra2rev2_16way_hash( void *state, const void *input )
   keccak256_8way_close( &ctx.keccak, vhash );

   dintrlv_8x64( hash8,  hash9,  hash10,  hash11,
-                 hash12, hash13, hash14, hash5, vhash, 256 );
+                 hash12, hash13, hash14, hash15, vhash, 256 );

   cubehash_full( &ctx.cube, (byte*) hash0,  256, (const byte*) hash0,  32 );
   cubehash_full( &ctx.cube, (byte*) hash1,  256, (const byte*) hash1,  32 );
@@ -203,7 +203,7 @@ int scanhash_lyra2rev2_16way( struct work *work, const uint32_t max_nonce,
             submit_solution( work, lane_hash, mythr );
         }
      }
-      *noncev = _mm512_add_epi32( *noncev, m512_const1_32( 16 ) );
+      *noncev = _mm512_add_epi32( *noncev, _mm512_set1_epi32( 16 ) );
      n += 16;
   } while ( likely( (n < last_nonce) && !work_restart[thr_id].restart ) );
   pdata[19] = n;
@@ -345,7 +345,7 @@ int scanhash_lyra2rev2_8way( struct work *work, const uint32_t max_nonce,
             submit_solution( work, lane_hash, mythr );
         }
      }
-      *noncev = _mm256_add_epi32( *noncev, m256_const1_32( 8 ) );
+      *noncev = _mm256_add_epi32( *noncev, _mm256_set1_epi32( 8 ) );
      n += 8;
   } while ( likely( (n < last_nonce) && !work_restart[thr_id].restart ) );
   pdata[19] = n;
--- a/algo/lyra2/lyra2rev3-4way.c
+++ b/algo/lyra2/lyra2rev3-4way.c
@@ -287,7 +287,7 @@ int scanhash_lyra2rev3_8way( struct work *work, const uint32_t max_nonce,
             submit_solution( work, lane_hash, mythr );
         }
      }
-      *noncev = _mm256_add_epi32( *noncev, m256_const1_32( 8 ) );
+      *noncev = _mm256_add_epi32( *noncev, _mm256_set1_epi32( 8 ) );
      n += 8;
   } while ( likely( (n < last_nonce) && !work_restart[thr_id].restart ) );
   pdata[19] = n;
@@ -389,7 +389,7 @@ int scanhash_lyra2rev3_4way( struct work *work, const uint32_t max_nonce,
              submit_solution( work, lane_hash, mythr );
 	      }
      }
-      *noncev = _mm_add_epi32( *noncev, m128_const1_32( 4 ) );
+      *noncev = _mm_add_epi32( *noncev, _mm_set1_epi32( 4 ) );
      n += 4;
   } while ( (n < max_nonce-4) && !work_restart[thr_id].restart);
   pdata[19] = n;
--- a/algo/lyra2/lyra2z-4way.c
+++ b/algo/lyra2/lyra2z-4way.c
@@ -14,44 +14,6 @@ bool lyra2z_16way_thread_init()
 return ( lyra2z_16way_matrix = _mm_malloc( 2*LYRA2Z_MATRIX_SIZE, 64 ) );
 }

-static uint32_t lyra2z_16way_midstate_vars[16*16] __attribute__ ((aligned (64)));
-static __m512i lyra2z_16way_block0_hash[8] __attribute__ ((aligned (64)));
-static __m512i lyra2z_16way_block_buf[16] __attribute__ ((aligned (64)));
-
-int lyra2z_16way_prehash ( struct work *work )
-{
-   uint32_t phash[8] __attribute__ ((aligned (32))) =
-   {
-      0x6A09E667, 0xBB67AE85, 0x3C6EF372, 0xA54FF53A,
-      0x510E527F, 0x9B05688C, 0x1F83D9AB, 0x5BE0CD19
-   };
-   uint32_t *pdata = work->data;
-
-   // Prehash first block
-   blake256_transform_le( phash, pdata, 512, 0 );
-
-   lyra2z_16way_block0_hash[0] = _mm512_set1_epi32( phash[0] );
-   lyra2z_16way_block0_hash[1] = _mm512_set1_epi32( phash[1] );
-   lyra2z_16way_block0_hash[2] = _mm512_set1_epi32( phash[2] );
-   lyra2z_16way_block0_hash[3] = _mm512_set1_epi32( phash[3] );
-   lyra2z_16way_block0_hash[4] = _mm512_set1_epi32( phash[4] );
-   lyra2z_16way_block0_hash[5] = _mm512_set1_epi32( phash[5] );
-   lyra2z_16way_block0_hash[6] = _mm512_set1_epi32( phash[6] );
-   lyra2z_16way_block0_hash[7] = _mm512_set1_epi32( phash[7] );
-
-   // Build vectored second block, interleave 12 of last 16 bytes of data
-   // excepting the nonce.
-   lyra2z_16way_block_buf[ 0] = _mm512_set1_epi32( pdata[16] );
-   lyra2z_16way_block_buf[ 1] = _mm512_set1_epi32( pdata[17] );
-   lyra2z_16way_block_buf[ 2] = _mm512_set1_epi32( pdata[18] );
-
-   // Partialy prehash second block without touching nonces in block_buf[3].
-   blake256_16way_round0_prehash_le( lyra2z_16way_midstate_vars, 
-                       lyra2z_16way_block0_hash, lyra2z_16way_block_buf );
-
-   return 1;
-}
-
 static void lyra2z_16way_hash( void *state, const void *midstate_vars,
                        const void *midhash, const void *block )
 {
@@ -129,6 +91,11 @@ int scanhash_lyra2z_16way( struct work *work, uint32_t max_nonce,
   uint32_t midstate_vars[16*16] __attribute__ ((aligned (64)));
   __m512i block0_hash[8] __attribute__ ((aligned (64)));
   __m512i block_buf[16] __attribute__ ((aligned (64)));
+   uint32_t phash[8] __attribute__ ((aligned (64))) =
+   {
+      0x6A09E667, 0xBB67AE85, 0x3C6EF372, 0xA54FF53A,
+      0x510E527F, 0x9B05688C, 0x1F83D9AB, 0x5BE0CD19
+   };
   uint32_t *pdata = work->data;
   uint32_t *ptarget = work->target;
   const uint32_t first_nonce = pdata[19];
@@ -136,22 +103,34 @@ int scanhash_lyra2z_16way( struct work *work, uint32_t max_nonce,
   const uint32_t last_nonce = max_nonce - 16;
   const int thr_id = mythr->id;
   const bool bench = opt_benchmark;
-   const __m512i sixteen = m512_const1_32( 16 );
+   const __m512i sixteen = _mm512_set1_epi32( 16 );

   if ( bench ) ( (uint32_t*)ptarget )[7] = 0x0000ff;

-   pthread_rwlock_rdlock( &g_work_lock );
+   // Prehash first block
+   blake256_transform_le( phash, pdata, 512, 0 );

-   memcpy( midstate_vars, lyra2z_16way_midstate_vars, sizeof midstate_vars );
-   memcpy( block0_hash,   lyra2z_16way_block0_hash,   sizeof block0_hash );
-   memcpy( block_buf,     lyra2z_16way_block_buf,     sizeof block_buf );
+   block0_hash[0] = _mm512_set1_epi32( phash[0] );
+   block0_hash[1] = _mm512_set1_epi32( phash[1] );
+   block0_hash[2] = _mm512_set1_epi32( phash[2] );
+   block0_hash[3] = _mm512_set1_epi32( phash[3] );
+   block0_hash[4] = _mm512_set1_epi32( phash[4] );
+   block0_hash[5] = _mm512_set1_epi32( phash[5] );
+   block0_hash[6] = _mm512_set1_epi32( phash[6] );
+   block0_hash[7] = _mm512_set1_epi32( phash[7] );

-   pthread_rwlock_unlock( &g_work_lock );
-   
+   // Build vectored second block, interleave last 16 bytes of data using
+   // unique nonces.
+   block_buf[ 0] = _mm512_set1_epi32( pdata[16] );
+   block_buf[ 1] = _mm512_set1_epi32( pdata[17] );
+   block_buf[ 2] = _mm512_set1_epi32( pdata[18] );
   block_buf[ 3] =
             _mm512_set_epi32( n+15, n+14, n+13, n+12, n+11, n+10, n+ 9, n+ 8,
                               n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n +1, n );

+   // Partialy prehash second block without touching nonces in block_buf[3].
+   blake256_16way_round0_prehash_le( midstate_vars, block0_hash, block_buf );
+
   do {
     lyra2z_16way_hash( hash, midstate_vars, block0_hash, block_buf );

@@ -178,44 +157,6 @@ bool lyra2z_8way_thread_init()
 return ( lyra2z_8way_matrix = _mm_malloc( LYRA2Z_MATRIX_SIZE, 64 ) );
 }

-static uint32_t lyra2z_8way_midstate_vars[16*8] __attribute__ ((aligned (64)));
-static __m256i lyra2z_8way_block0_hash[8] __attribute__ ((aligned (64)));
-static __m256i lyra2z_8way_block_buf[16] __attribute__ ((aligned (64)));
-
-int lyra2z_8way_prehash ( struct work *work )
-{
-   uint32_t phash[8] __attribute__ ((aligned (32))) =
-   {
-      0x6A09E667, 0xBB67AE85, 0x3C6EF372, 0xA54FF53A,
-      0x510E527F, 0x9B05688C, 0x1F83D9AB, 0x5BE0CD19
-   };
-   uint32_t *pdata = work->data;
-
-   // Prehash first block
-   blake256_transform_le( phash, pdata, 512, 0 );
-
-   lyra2z_8way_block0_hash[0] = _mm256_set1_epi32( phash[0] );
-   lyra2z_8way_block0_hash[1] = _mm256_set1_epi32( phash[1] );
-   lyra2z_8way_block0_hash[2] = _mm256_set1_epi32( phash[2] );
-   lyra2z_8way_block0_hash[3] = _mm256_set1_epi32( phash[3] );
-   lyra2z_8way_block0_hash[4] = _mm256_set1_epi32( phash[4] );
-   lyra2z_8way_block0_hash[5] = _mm256_set1_epi32( phash[5] );
-   lyra2z_8way_block0_hash[6] = _mm256_set1_epi32( phash[6] );
-   lyra2z_8way_block0_hash[7] = _mm256_set1_epi32( phash[7] );
-
-   // Build vectored second block, interleave last 16 bytes of data using
-   // unique nonces.
-   lyra2z_8way_block_buf[ 0] = _mm256_set1_epi32( pdata[16] );
-   lyra2z_8way_block_buf[ 1] = _mm256_set1_epi32( pdata[17] );
-   lyra2z_8way_block_buf[ 2] = _mm256_set1_epi32( pdata[18] );
-
-   // Partialy prehash second block without touching nonces
-   blake256_8way_round0_prehash_le( lyra2z_8way_midstate_vars,
-                           lyra2z_8way_block0_hash, lyra2z_8way_block_buf );
-
-   return 1;
-}
-
 static void lyra2z_8way_hash( void *state, const void *midstate_vars,
                       const void *midhash, const void *block )
 {
@@ -260,6 +201,11 @@ int scanhash_lyra2z_8way( struct work *work, uint32_t max_nonce,
   uint32_t midstate_vars[16*8] __attribute__ ((aligned (64)));
   __m256i block0_hash[8] __attribute__ ((aligned (64)));
   __m256i block_buf[16] __attribute__ ((aligned (64)));
+   uint32_t phash[8] __attribute__ ((aligned (32))) =
+   {
+      0x6A09E667, 0xBB67AE85, 0x3C6EF372, 0xA54FF53A,
+      0x510E527F, 0x9B05688C, 0x1F83D9AB, 0x5BE0CD19
+   };
   uint32_t *pdata = work->data;
   uint64_t *ptarget = (uint64_t*)work->target;
   const uint32_t first_nonce = pdata[19];
@@ -267,16 +213,25 @@ int scanhash_lyra2z_8way( struct work *work, uint32_t max_nonce,
   uint32_t n = first_nonce;
   const int thr_id = mythr->id;
   const bool bench = opt_benchmark;
-   const __m256i eight = m256_const1_32( 8 );
+   const __m256i eight = _mm256_set1_epi32( 8 );

-   pthread_rwlock_rdlock( &g_work_lock );
+   // Prehash first block
+   blake256_transform_le( phash, pdata, 512, 0 );

-   memcpy( midstate_vars, lyra2z_8way_midstate_vars, sizeof midstate_vars );
-   memcpy( block0_hash,   lyra2z_8way_block0_hash,   sizeof block0_hash );
-   memcpy( block_buf,     lyra2z_8way_block_buf,     sizeof block_buf );
+   block0_hash[0] = _mm256_set1_epi32( phash[0] );
+   block0_hash[1] = _mm256_set1_epi32( phash[1] );
+   block0_hash[2] = _mm256_set1_epi32( phash[2] );
+   block0_hash[3] = _mm256_set1_epi32( phash[3] );
+   block0_hash[4] = _mm256_set1_epi32( phash[4] );
+   block0_hash[5] = _mm256_set1_epi32( phash[5] );
+   block0_hash[6] = _mm256_set1_epi32( phash[6] );
+   block0_hash[7] = _mm256_set1_epi32( phash[7] );

-   pthread_rwlock_unlock( &g_work_lock );
-   
+   // Build vectored second block, interleave last 16 bytes of data using
+   // unique nonces.
+   block_buf[ 0] = _mm256_set1_epi32( pdata[16] );
+   block_buf[ 1] = _mm256_set1_epi32( pdata[17] );
+   block_buf[ 2] = _mm256_set1_epi32( pdata[18] );
   block_buf[ 3] =
            _mm256_set_epi32( n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n +1, n );

@@ -373,7 +328,7 @@ int scanhash_lyra2z_4way( struct work *work, uint32_t max_nonce,
           submit_solution( work, lane_hash, mythr );
        }
      }
-      *noncev = _mm_add_epi32( *noncev, m128_const1_32( 4 ) );
+      *noncev = _mm_add_epi32( *noncev, _mm_set1_epi32( 4 ) );
      n += 4;
   } while ( likely( (n < last_nonce) && !work_restart[thr_id].restart ) );

--- a/algo/lyra2/sponge-2way.c
+++ b/algo/lyra2/sponge-2way.c
@@ -85,10 +85,10 @@ inline void absorbBlockBlake2Safe_2way( uint64_t *State, const uint64_t *In,

  state0 = 
  state1 = m512_zero;
-  state2 = m512_const4_64( 0xa54ff53a5f1d36f1ULL, 0x3c6ef372fe94f82bULL,
-                           0xbb67ae8584caa73bULL, 0x6a09e667f3bcc908ULL );
-  state3 = m512_const4_64( 0x5be0cd19137e2179ULL, 0x1f83d9abfb41bd6bULL,
-                           0x9b05688c2b3e6c1fULL, 0x510e527fade682d1ULL );
+  state2 = _mm512_set4_epi64( 0xa54ff53a5f1d36f1ULL, 0x3c6ef372fe94f82bULL,
+                              0xbb67ae8584caa73bULL, 0x6a09e667f3bcc908ULL );
+  state3 = _mm512_set4_epi64( 0x5be0cd19137e2179ULL, 0x1f83d9abfb41bd6bULL,
+                              0x9b05688c2b3e6c1fULL, 0x510e527fade682d1ULL );

  for ( int i = 0; i < nBlocks; i++ )
  { 
--- a/algo/lyra2/sponge.c
+++ b/algo/lyra2/sponge.c
@@ -41,17 +41,17 @@
 inline void initState( uint64_t State[/*16*/] )
 {

-   /*
+/*
 #if defined (__AVX2__)

  __m256i* state = (__m256i*)State;
  const __m256i zero = m256_zero; 
  state[0] = zero;
  state[1] = zero;
-  state[2] = m256_const_64( 0xa54ff53a5f1d36f1ULL, 0x3c6ef372fe94f82bULL,
-                            0xbb67ae8584caa73bULL, 0x6a09e667f3bcc908ULL );
-  state[3] = m256_const_64( 0x5be0cd19137e2179ULL, 0x1f83d9abfb41bd6bULL,
-                            0x9b05688c2b3e6c1fULL, 0x510e527fade682d1ULL );
+  state[2] = _mm256_set_epi64x( 0xa54ff53a5f1d36f1ULL, 0x3c6ef372fe94f82bULL,
+                                0xbb67ae8584caa73bULL, 0x6a09e667f3bcc908ULL );
+  state[3] = _mm256_set_epi64x( 0x5be0cd19137e2179ULL, 0x1f83d9abfb41bd6bULL,
+                                0x9b05688c2b3e6c1fULL, 0x510e527fade682d1ULL );

 #elif defined (__SSE2__)

@@ -62,10 +62,10 @@ inline void initState( uint64_t State[/*16*/] )
  state[1] = zero;
  state[2] = zero;
  state[3] = zero;
-  state[4] = m128_const_64( 0xbb67ae8584caa73bULL, 0x6a09e667f3bcc908ULL );
-  state[5] = m128_const_64( 0xa54ff53a5f1d36f1ULL, 0x3c6ef372fe94f82bULL );
-  state[6] = m128_const_64( 0x9b05688c2b3e6c1fULL, 0x510e527fade682d1ULL );
-  state[7] = m128_const_64( 0x5be0cd19137e2179ULL, 0x1f83d9abfb41bd6bULL );
+  state[4] = _mm_set_epi64x( 0xbb67ae8584caa73bULL, 0x6a09e667f3bcc908ULL );
+  state[5] = _mm_set_epi64x( 0xa54ff53a5f1d36f1ULL, 0x3c6ef372fe94f82bULL );
+  state[6] = _mm_set_epi64x( 0x9b05688c2b3e6c1fULL, 0x510e527fade682d1ULL );
+  state[7] = _mm_set_epi64x( 0x5be0cd19137e2179ULL, 0x1f83d9abfb41bd6bULL );

 #else
    //First 512 bis are zeros
@@ -271,10 +271,10 @@ inline void absorbBlockBlake2Safe( uint64_t *State, const uint64_t *In,

  state0 = 
  state1 = m256_zero;
-  state2 = m256_const_64( 0xa54ff53a5f1d36f1ULL, 0x3c6ef372fe94f82bULL,
-                          0xbb67ae8584caa73bULL, 0x6a09e667f3bcc908ULL );
-  state3 = m256_const_64( 0x5be0cd19137e2179ULL, 0x1f83d9abfb41bd6bULL,
-                          0x9b05688c2b3e6c1fULL, 0x510e527fade682d1ULL );
+  state2 = _mm256_set_epi64x( 0xa54ff53a5f1d36f1ULL, 0x3c6ef372fe94f82bULL,
+                              0xbb67ae8584caa73bULL, 0x6a09e667f3bcc908ULL );
+  state3 = _mm256_set_epi64x( 0x5be0cd19137e2179ULL, 0x1f83d9abfb41bd6bULL,
+                              0x9b05688c2b3e6c1fULL, 0x510e527fade682d1ULL );

  for ( int i = 0; i < nBlocks; i++ )
  { 
@@ -299,10 +299,10 @@ inline void absorbBlockBlake2Safe( uint64_t *State, const uint64_t *In,
  state1 =
  state2 =
  state3 = m128_zero;
-  state4 = m128_const_64( 0xbb67ae8584caa73bULL, 0x6a09e667f3bcc908ULL );
-  state5 = m128_const_64( 0xa54ff53a5f1d36f1ULL, 0x3c6ef372fe94f82bULL );
-  state6 = m128_const_64( 0x9b05688c2b3e6c1fULL, 0x510e527fade682d1ULL );
-  state7 = m128_const_64( 0x5be0cd19137e2179ULL, 0x1f83d9abfb41bd6bULL );
+  state4 = _mm_set_epi64x( 0xbb67ae8584caa73bULL, 0x6a09e667f3bcc908ULL );
+  state5 = _mm_set_epi64x( 0xa54ff53a5f1d36f1ULL, 0x3c6ef372fe94f82bULL );
+  state6 = _mm_set_epi64x( 0x9b05688c2b3e6c1fULL, 0x510e527fade682d1ULL );
+  state7 = _mm_set_epi64x( 0x5be0cd19137e2179ULL, 0x1f83d9abfb41bd6bULL );

  for ( int i = 0; i < nBlocks; i++ )
  { 
--- a/algo/lyra2/sponge.h
+++ b/algo/lyra2/sponge.h
@@ -43,27 +43,29 @@ static const uint64_t blake2b_IV[8] =
  0x1f83d9abfb41bd6bULL, 0x5be0cd19137e2179ULL
 };

-/*Blake2b's rotation*/
-static inline uint64_t rotr64( const uint64_t w, const unsigned c ){
-    return ( w >> c ) | ( w << ( 64 - c ) );
-}
-
-// serial data is only 32 bytes so AVX2 is the limit for that dimension.
-// However, 2 way parallel looks trivial to code for AVX512 except for
-// a data dependency with rowa.
-
 #if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)

 #define G2W_4X64(a,b,c,d) \
   a = _mm512_add_epi64( a, b ); \
-   d = mm512_ror_64( _mm512_xor_si512( d, a ), 32 ); \
+   d = _mm512_ror_epi64( _mm512_xor_si512( d, a ), 32 ); \
   c = _mm512_add_epi64( c, d ); \
-   b = mm512_ror_64( _mm512_xor_si512( b, c ), 24 ); \
+   b = _mm512_ror_epi64( _mm512_xor_si512( b, c ), 24 ); \
   a = _mm512_add_epi64( a, b ); \
-   d = mm512_ror_64( _mm512_xor_si512( d, a ), 16 ); \
+   d = _mm512_ror_epi64( _mm512_xor_si512( d, a ), 16 ); \
   c = _mm512_add_epi64( c, d ); \
-   b = mm512_ror_64( _mm512_xor_si512( b, c ), 63 );
+   b = _mm512_ror_epi64( _mm512_xor_si512( b, c ), 63 );

+#define LYRA_ROUND_2WAY_AVX512( s0, s1, s2, s3 ) \
+   G2W_4X64( s0, s1, s2, s3 ); \
+   s0 = mm512_shufll256_64( s0 ); \
+   s3 = mm512_swap256_128( s3); \
+   s2 = mm512_shuflr256_64( s2 ); \
+   G2W_4X64( s0, s1, s2, s3 ); \
+   s0 = mm512_shuflr256_64( s0 ); \
+   s3 = mm512_swap256_128( s3 ); \
+   s2 = mm512_shufll256_64( s2 ); 
+
+/*
 #define LYRA_ROUND_2WAY_AVX512( s0, s1, s2, s3 ) \
   G2W_4X64( s0, s1, s2, s3 ); \
   s3 = mm512_shufll256_64( s3 ); \
@@ -73,6 +75,7 @@ static inline uint64_t rotr64( const uint64_t w, const unsigned c ){
   s3 = mm512_shuflr256_64( s3 ); \
   s1 = mm512_shufll256_64( s1 ); \
   s2 = mm512_swap256_128( s2 ); 
+*/

 #define LYRA_12_ROUNDS_2WAY_AVX512( s0, s1, s2, s3 ) \
   LYRA_ROUND_2WAY_AVX512( s0, s1, s2, s3 ) \
@@ -88,13 +91,10 @@ static inline uint64_t rotr64( const uint64_t w, const unsigned c ){
   LYRA_ROUND_2WAY_AVX512( s0, s1, s2, s3 ) \
   LYRA_ROUND_2WAY_AVX512( s0, s1, s2, s3 )

-
 #endif  // AVX512

-#if defined __AVX2__
+#if defined(__AVX2__)

-// process 4 columns in parallel
-// returns void, updates all args
 #define G_4X64(a,b,c,d) \
   a = _mm256_add_epi64( a, b ); \
   d = mm256_swap64_32( _mm256_xor_si256( d, a ) ); \
@@ -105,6 +105,18 @@ static inline uint64_t rotr64( const uint64_t w, const unsigned c ){
   c = _mm256_add_epi64( c, d ); \
   b = mm256_ror_64( _mm256_xor_si256( b, c ), 63 );

+// Pivot about s1 instead of s0 reduces latency.
+#define LYRA_ROUND_AVX2( s0, s1, s2, s3 ) \
+   G_4X64( s0, s1, s2, s3 ); \
+   s0 = mm256_shufll_64( s0 ); \
+   s3 = mm256_swap_128( s3); \
+   s2 = mm256_shuflr_64( s2 ); \
+   G_4X64( s0, s1, s2, s3 ); \
+   s0 = mm256_shuflr_64( s0 ); \
+   s3 = mm256_swap_128( s3 ); \
+   s2 = mm256_shufll_64( s2 );
+
+/*
 #define LYRA_ROUND_AVX2( s0, s1, s2, s3 ) \
   G_4X64( s0, s1, s2, s3 ); \
   s3 = mm256_shufll_64( s3 ); \
@@ -114,6 +126,7 @@ static inline uint64_t rotr64( const uint64_t w, const unsigned c ){
   s3 = mm256_shuflr_64( s3 ); \
   s1 = mm256_shufll_64( s1 ); \
   s2 = mm256_swap_128( s2 );
+*/

 #define LYRA_12_ROUNDS_AVX2( s0, s1, s2, s3 ) \
   LYRA_ROUND_AVX2( s0, s1, s2, s3 ) \
@@ -182,8 +195,13 @@ static inline uint64_t rotr64( const uint64_t w, const unsigned c ){

 #endif // AVX2 else SSE2

-// Scalar
-//Blake2b's G function
+/*
+// Scalar, not used.
+
+static inline uint64_t rotr64( const uint64_t w, const unsigned c ){
+    return ( w >> c ) | ( w << ( 64 - c ) );
+}
+
 #define G(r,i,a,b,c,d) \
  do { \
    a = a + b; \
@@ -196,8 +214,6 @@ static inline uint64_t rotr64( const uint64_t w, const unsigned c ){
    b = rotr64(b ^ c, 63); \
  } while(0)

-
-/*One Round of the Blake2b's compression function*/
 #define ROUND_LYRA(r)  \
    G(r,0,v[ 0],v[ 4],v[ 8],v[12]); \
    G(r,1,v[ 1],v[ 5],v[ 9],v[13]); \
@@ -207,6 +223,7 @@ static inline uint64_t rotr64( const uint64_t w, const unsigned c ){
    G(r,5,v[ 1],v[ 6],v[11],v[12]); \
    G(r,6,v[ 2],v[ 7],v[ 8],v[13]); \
    G(r,7,v[ 3],v[ 4],v[ 9],v[14]);
+*/

 #if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)

--- a/algo/quark/anime-4way.c
+++ b/algo/quark/anime-4way.c
@@ -51,7 +51,7 @@ void anime_8way_hash( void *state, const void *input )
    __m512i* vhA = (__m512i*)vhashA;
    __m512i* vhB = (__m512i*)vhashB;
    __m512i* vhC = (__m512i*)vhashC;
-    const __m512i bit3_mask = m512_const1_64( 8 );
+    const __m512i bit3_mask = _mm512_set1_epi64( 8 );
    __mmask8 vh_mask;
    anime_8way_context_overlay ctx __attribute__ ((aligned (64)));

@@ -209,7 +209,7 @@ int scanhash_anime_8way( struct work *work, uint32_t max_nonce,
          }
       }
       *noncev = _mm512_add_epi32( *noncev,
-                                   m512_const1_64( 0x0000000800000000 ) );
+                                   _mm512_set1_epi64( 0x0000000800000000 ) );
       n += 8;
    } while ( likely( ( n < last_nonce ) && !work_restart[thr_id].restart ) );
    pdata[19] = n;
@@ -248,7 +248,7 @@ void anime_4way_hash( void *state, const void *input )
    __m256i* vhB = (__m256i*)vhashB;
    __m256i vh_mask;
    int h_mask;
-    const __m256i bit3_mask = m256_const1_64( 8 );
+    const __m256i bit3_mask = _mm256_set1_epi64x( 8 );
    const __m256i zero = _mm256_setzero_si256();
    anime_4way_context_overlay ctx __attribute__ ((aligned (64)));

@@ -388,7 +388,7 @@ int scanhash_anime_4way( struct work *work, uint32_t max_nonce,
          }
       }
       *noncev = _mm256_add_epi32( *noncev,
-                                   m256_const1_64( 0x0000000400000000 ) );
+                                   _mm256_set1_epi64x( 0x0000000400000000 ) );
       n += 4;
    } while ( likely( ( n < last_nonce ) && !work_restart[thr_id].restart ) );
    pdata[19] = n;
--- a/algo/quark/hmq1725-4way.c
+++ b/algo/quark/hmq1725-4way.c
@@ -75,7 +75,7 @@ extern void hmq1725_8way_hash(void *state, const void *input)
   uint32_t hash7 [16]    __attribute__ ((aligned (32)));
   hmq1725_8way_context_overlay ctx __attribute__ ((aligned (64)));
   __mmask8 vh_mask;
-   const __m512i vmask = m512_const1_64( 24 );
+   const __m512i vmask = _mm512_set1_epi64( 24 );
   const uint32_t mask = 24;
   __m512i* vh  = (__m512i*)vhash;
   __m512i* vhA = (__m512i*)vhashA;
@@ -593,7 +593,7 @@ int scanhash_hmq1725_8way( struct work *work, uint32_t max_nonce,
          }
       }
       *noncev = _mm512_add_epi32( *noncev,
-                                   m512_const1_64( 0x0000000800000000 ) );
+                                   _mm512_set1_epi64( 0x0000000800000000 ) );
       n += 8;
    } while ( likely( ( n < last_nonce ) && !work_restart[thr_id].restart ) );

@@ -647,7 +647,7 @@ extern void hmq1725_4way_hash(void *state, const void *input)
   hmq1725_4way_context_overlay ctx __attribute__ ((aligned (64)));
   __m256i vh_mask;     
   int h_mask;
-   const __m256i vmask = m256_const1_64( 24 );
+   const __m256i vmask = _mm256_set1_epi64x( 24 );
   const uint32_t mask = 24;
   __m256i* vh  = (__m256i*)vhash;
   __m256i* vhA = (__m256i*)vhashA;
@@ -1041,7 +1041,7 @@ int scanhash_hmq1725_4way( struct work *work, uint32_t max_nonce,
          }
       }
       *noncev = _mm256_add_epi32( *noncev,
-                                   m256_const1_64( 0x0000000400000000 ) );
+                                   _mm256_set1_epi64x( 0x0000000400000000 ) );
       n += 4;
    } while ( likely( ( n < last_nonce ) && !work_restart[thr_id].restart ) );
    pdata[19] = n;
--- a/algo/quark/quark-4way.c
+++ b/algo/quark/quark-4way.c
@@ -67,7 +67,7 @@ void quark_8way_hash( void *state, const void *input )
    __mmask8 vh_mask;
    quark_8way_ctx_holder ctx;
    const uint32_t mask = 8;
-    const __m512i bit3_mask = m512_const1_64( mask );
+    const __m512i bit3_mask = _mm512_set1_epi64( mask );

    memcpy( &ctx, &quark_8way_ctx, sizeof(quark_8way_ctx) );

@@ -224,7 +224,7 @@ int scanhash_quark_8way( struct work *work, uint32_t max_nonce,
          }
       }
       *noncev = _mm512_add_epi32( *noncev,
-                                  m512_const1_64( 0x0000000800000000 ) );
+                                  _mm512_set1_epi64( 0x0000000800000000 ) );
       n += 8;
    } while ( likely( ( n < last_nonce ) && !work_restart[thr_id].restart ) );

@@ -271,7 +271,7 @@ void quark_4way_hash( void *state, const void *input )
    __m256i vh_mask;
    int h_mask;
    quark_4way_ctx_holder ctx;
-    const __m256i bit3_mask = m256_const1_64( 8 );
+    const __m256i bit3_mask = _mm256_set1_epi64x( 8 );
    const __m256i zero = _mm256_setzero_si256();

    memcpy( &ctx, &quark_4way_ctx, sizeof(quark_4way_ctx) );
@@ -397,7 +397,7 @@ int scanhash_quark_4way( struct work *work, uint32_t max_nonce,
          }
       }
       *noncev = _mm256_add_epi32( *noncev,
-                                  m256_const1_64( 0x0000000400000000 ) );
+                                  _mm256_set1_epi64x( 0x0000000400000000 ) );
       n += 4;
    } while ( likely( ( n < last_nonce ) && !work_restart[thr_id].restart ) );

--- a/algo/ripemd/lbry-gate.c
+++ b/algo/ripemd/lbry-gate.c
@@ -4,24 +4,6 @@
 #include <string.h>
 #include <stdio.h>

-long double lbry_calc_network_diff( struct work *work )
-{
-        // sample for diff 43.281 : 1c05ea29
-        // todo: endian reversed on longpoll could be zr5 specific...
-
-   uint32_t nbits = swab32( work->data[ LBRY_NBITS_INDEX ] );
-   uint32_t bits = (nbits & 0xffffff);
-   int16_t shift = (swab32(nbits) & 0xff); // 0x1c = 28
-   long double d = (long double)0x0000ffff / (long double)bits;
-
-   for (int m=shift; m < 29; m++) d *= 256.0;
-   for (int m=29; m < shift; m++) d /= 256.0;
-   if (opt_debug_diff)
-      applog(LOG_DEBUG, "net diff: %f -> shift %u, bits %08x", d, shift, bits);
-
-   return d;
-}
-
 // std_le should work but it doesn't
 void lbry_le_build_stratum_request( char *req, struct work *work,
                                      struct stratum_ctx *sctx )
@@ -41,31 +23,6 @@ void lbry_le_build_stratum_request( char *req, struct work *work,
   free(xnonce2str);
 }

-/*
-void lbry_build_block_header( struct work* g_work, uint32_t version,
-                             uint32_t *prevhash, uint32_t *merkle_root,
-                             uint32_t ntime, uint32_t nbits )
-{
-   int i;
-   memset( g_work->data, 0, sizeof(g_work->data) );
-   g_work->data[0] =  version;
-
-   if ( have_stratum )
-      for ( i = 0; i < 8; i++ )
-         g_work->data[1 + i] = le32dec( prevhash + i );
-   else
-      for (i = 0; i < 8; i++)
-         g_work->data[ 8-i ] = le32dec( prevhash + i );
-
-   for ( i = 0; i < 8; i++ )
-      g_work->data[9 + i] = be32dec( merkle_root + i );
-
-   g_work->data[ LBRY_NTIME_INDEX ] = ntime;
-   g_work->data[ LBRY_NBITS_INDEX ] = nbits;
-   g_work->data[28] = 0x80000000;
-}
-*/
-
 void lbry_build_extraheader( struct work* g_work, struct stratum_ctx* sctx )
 {
   unsigned char merkle_root[64] = { 0 };
@@ -112,9 +69,7 @@ bool register_lbry_algo( algo_gate_t* gate )
  gate->hash                  = (void*)&lbry_hash;
  gate->optimizations = AVX2_OPT | AVX512_OPT | SHA_OPT;
 #endif
-  gate->calc_network_diff     = (void*)&lbry_calc_network_diff;
  gate->build_stratum_request = (void*)&lbry_le_build_stratum_request;
-//  gate->build_block_header    = (void*)&build_block_header;
  gate->build_extraheader     = (void*)&lbry_build_extraheader;
  gate->ntime_index           = LBRY_NTIME_INDEX;
  gate->nbits_index           = LBRY_NBITS_INDEX;
--- a/algo/ripemd/ripemd-hash-4way.c
+++ b/algo/ripemd/ripemd-hash-4way.c
@@ -47,7 +47,7 @@ static const uint32_t IV[5] =
 do{ \
   a = _mm_add_epi32( mm128_rol_32( _mm_add_epi32( _mm_add_epi32( \
                _mm_add_epi32( a, f( b ,c, d ) ), r ), \
-                                 m128_const1_64( k ) ), s ), e ); \
+                                 _mm_set1_epi64x( k ) ), s ), e ); \
   c = mm128_rol_32( c, 10 );\
 } while (0)

@@ -251,11 +251,11 @@ static void ripemd160_4way_round( ripemd160_4way_context *sc )

 void ripemd160_4way_init( ripemd160_4way_context *sc )
 {
-   sc->val[0] = m128_const1_64( 0x6745230167452301 );
-   sc->val[1] = m128_const1_64( 0xEFCDAB89EFCDAB89 );
-   sc->val[2] = m128_const1_64( 0x98BADCFE98BADCFE );
-   sc->val[3] = m128_const1_64( 0x1032547610325476 );
-   sc->val[4] = m128_const1_64( 0xC3D2E1F0C3D2E1F0 );
+   sc->val[0] = _mm_set1_epi64x( 0x6745230167452301 );
+   sc->val[1] = _mm_set1_epi64x( 0xEFCDAB89EFCDAB89 );
+   sc->val[2] = _mm_set1_epi64x( 0x98BADCFE98BADCFE );
+   sc->val[3] = _mm_set1_epi64x( 0x1032547610325476 );
+   sc->val[4] = _mm_set1_epi64x( 0xC3D2E1F0C3D2E1F0 );
   sc->count_high = sc->count_low = 0;
 }

@@ -347,7 +347,7 @@ void ripemd160_4way_close( ripemd160_4way_context  *sc, void *dst )
 do{ \
   a = _mm256_add_epi32( mm256_rol_32( _mm256_add_epi32( _mm256_add_epi32( \
                _mm256_add_epi32( a, f( b ,c, d ) ), r ), \
-                                 m256_const1_64( k ) ), s ), e ); \
+                                 _mm256_set1_epi64x( k ) ), s ), e ); \
   c = mm256_rol_32( c, 10 );\
 } while (0)
    
@@ -552,11 +552,11 @@ static void ripemd160_8way_round( ripemd160_8way_context *sc )

 void ripemd160_8way_init( ripemd160_8way_context *sc )
 {
-   sc->val[0] = m256_const1_64( 0x6745230167452301 );
-   sc->val[1] = m256_const1_64( 0xEFCDAB89EFCDAB89 );
-   sc->val[2] = m256_const1_64( 0x98BADCFE98BADCFE );
-   sc->val[3] = m256_const1_64( 0x1032547610325476 );
-   sc->val[4] = m256_const1_64( 0xC3D2E1F0C3D2E1F0 );
+   sc->val[0] = _mm256_set1_epi64x( 0x6745230167452301 );
+   sc->val[1] = _mm256_set1_epi64x( 0xEFCDAB89EFCDAB89 );
+   sc->val[2] = _mm256_set1_epi64x( 0x98BADCFE98BADCFE );
+   sc->val[3] = _mm256_set1_epi64x( 0x1032547610325476 );
+   sc->val[4] = _mm256_set1_epi64x( 0xC3D2E1F0C3D2E1F0 );
   sc->count_high = sc->count_low = 0;
 }

@@ -649,7 +649,7 @@ void ripemd160_8way_close( ripemd160_8way_context  *sc, void *dst )
 do{ \
   a = _mm512_add_epi32( mm512_rol_32( _mm512_add_epi32( _mm512_add_epi32( \
                _mm512_add_epi32( a, f( b ,c, d ) ), r ), \
-                                 m512_const1_64( k ) ), s ), e ); \
+                                 _mm512_set1_epi64( k ) ), s ), e ); \
   c = mm512_rol_32( c, 10 );\
 } while (0)

@@ -853,11 +853,11 @@ static void ripemd160_16way_round( ripemd160_16way_context *sc )

 void ripemd160_16way_init( ripemd160_16way_context *sc )
 {
-   sc->val[0] = m512_const1_64( 0x6745230167452301 );
-   sc->val[1] = m512_const1_64( 0xEFCDAB89EFCDAB89 );
-   sc->val[2] = m512_const1_64( 0x98BADCFE98BADCFE );
-   sc->val[3] = m512_const1_64( 0x1032547610325476 );
-   sc->val[4] = m512_const1_64( 0xC3D2E1F0C3D2E1F0 );
+   sc->val[0] = _mm512_set1_epi64( 0x6745230167452301 );
+   sc->val[1] = _mm512_set1_epi64( 0xEFCDAB89EFCDAB89 );
+   sc->val[2] = _mm512_set1_epi64( 0x98BADCFE98BADCFE );
+   sc->val[3] = _mm512_set1_epi64( 0x1032547610325476 );
+   sc->val[4] = _mm512_set1_epi64( 0xC3D2E1F0C3D2E1F0 );
   sc->count_high = sc->count_low = 0;
 }

@@ -902,7 +902,7 @@ void ripemd160_16way_close( ripemd160_16way_context  *sc, void *dst )
   const int pad = block_size - 8;

   ptr = (unsigned)sc->count_low & ( block_size - 1U);
-   sc->buf[ ptr>>2 ] = m512_const1_32( 0x80 );
+   sc->buf[ ptr>>2 ] = _mm512_set1_epi32( 0x80 );
   ptr += 4;

   if ( ptr > pad )
--- a/algo/sha/sha256-hash-4way.c
+++ b/algo/sha/sha256-hash-4way.c
@@ -311,7 +311,7 @@ int sha256_4way_transform_le_short( __m128i *state_out, const __m128i *data,
   __m128i A, B, C, D, E, F, G, H;
   __m128i W[16];      memcpy_128( W, data, 16 );
   // Value required by H after round 60 to produce valid final hash
-   const __m128i H_ = m128_const1_32( 0x136032ED );
+   const __m128i H_ = _mm_set1_epi32( 0x136032ED );

   A = _mm_load_si128( state_in   );
   B = _mm_load_si128( state_in+1 );
@@ -408,14 +408,14 @@ int sha256_4way_transform_le_short( __m128i *state_out, const __m128i *data,
 void sha256_4way_init( sha256_4way_context *sc )
 {
   sc->count_high = sc->count_low = 0;
-   sc->val[0] = m128_const1_64( 0x6A09E6676A09E667 );
-   sc->val[1] = m128_const1_64( 0xBB67AE85BB67AE85 );
-   sc->val[2] = m128_const1_64( 0x3C6EF3723C6EF372 );
-   sc->val[3] = m128_const1_64( 0xA54FF53AA54FF53A );
-   sc->val[4] = m128_const1_64( 0x510E527F510E527F );
-   sc->val[5] = m128_const1_64( 0x9B05688C9B05688C );
-   sc->val[6] = m128_const1_64( 0x1F83D9AB1F83D9AB );
-   sc->val[7] = m128_const1_64( 0x5BE0CD195BE0CD19 );
+   sc->val[0] = _mm_set1_epi64x( 0x6A09E6676A09E667 );
+   sc->val[1] = _mm_set1_epi64x( 0xBB67AE85BB67AE85 );
+   sc->val[2] = _mm_set1_epi64x( 0x3C6EF3723C6EF372 );
+   sc->val[3] = _mm_set1_epi64x( 0xA54FF53AA54FF53A );
+   sc->val[4] = _mm_set1_epi64x( 0x510E527F510E527F );
+   sc->val[5] = _mm_set1_epi64x( 0x9B05688C9B05688C );
+   sc->val[6] = _mm_set1_epi64x( 0x1F83D9AB1F83D9AB );
+   sc->val[7] = _mm_set1_epi64x( 0x5BE0CD195BE0CD19 );
 }

 void sha256_4way_update( sha256_4way_context *sc, const void *data, size_t len )
@@ -458,7 +458,7 @@ void sha256_4way_close( sha256_4way_context *sc, void *dst )
    const int pad = buf_size - 8;

    ptr = (unsigned)sc->count_low & (buf_size - 1U);
-    sc->buf[ ptr>>2 ] = m128_const1_64( 0x0000008000000080 );
+    sc->buf[ ptr>>2 ] = _mm_set1_epi64x( 0x0000008000000080 );
    ptr += 4;

    if ( ptr > pad )
@@ -474,8 +474,8 @@ void sha256_4way_close( sha256_4way_context *sc, void *dst )
    high = (sc->count_high << 3) | (low >> 29);
    low = low << 3;

-    sc->buf[  pad     >> 2 ] = m128_const1_32( bswap_32( high ) );
-    sc->buf[( pad+4 ) >> 2 ] = m128_const1_32( bswap_32( low ) );
+    sc->buf[  pad     >> 2 ] = _mm_set1_epi32( bswap_32( high ) );
+    sc->buf[( pad+4 ) >> 2 ] = _mm_set1_epi32( bswap_32( low ) );
    sha256_4way_transform_be( sc->val, sc->buf, sc->val );

    mm128_block_bswap_32( dst, sc->val );
@@ -589,7 +589,6 @@ do { \
  _mm256_xor_si256( Y, _mm256_and_si256( X_xor_Y = _mm256_xor_si256( X, Y ), \
                                         Y_xor_Z ) )

-
 #define SHA2s_8WAY_STEP( A, B, C, D, E, F, G, H, i, j ) \
 do { \
  __m256i T0 = _mm256_add_epi32( _mm256_set1_epi32( K256[(j)+(i)] ), W[i] ); \
@@ -863,7 +862,7 @@ int sha256_8way_transform_le_short( __m256i *state_out, const __m256i *data,
 {
   __m256i A, B, C, D, E, F, G, H;
   __m256i W[16];  memcpy_256( W, data, 16 );
-   const __m256i H_ = m256_const1_32( 0x136032ED );
+   const __m256i H_ = _mm256_set1_epi32( 0x136032ED );

   A = _mm256_load_si256( state_in   );
   B = _mm256_load_si256( state_in+1 );
@@ -979,14 +978,14 @@ int sha256_8way_transform_le_short( __m256i *state_out, const __m256i *data,
 void sha256_8way_init( sha256_8way_context *sc )
 {
   sc->count_high = sc->count_low = 0;
-   sc->val[0] = m256_const1_64( 0x6A09E6676A09E667 );
-   sc->val[1] = m256_const1_64( 0xBB67AE85BB67AE85 );
-   sc->val[2] = m256_const1_64( 0x3C6EF3723C6EF372 );
-   sc->val[3] = m256_const1_64( 0xA54FF53AA54FF53A );
-   sc->val[4] = m256_const1_64( 0x510E527F510E527F );
-   sc->val[5] = m256_const1_64( 0x9B05688C9B05688C );
-   sc->val[6] = m256_const1_64( 0x1F83D9AB1F83D9AB );
-   sc->val[7] = m256_const1_64( 0x5BE0CD195BE0CD19 );
+   sc->val[0] = _mm256_set1_epi64x( 0x6A09E6676A09E667 );
+   sc->val[1] = _mm256_set1_epi64x( 0xBB67AE85BB67AE85 );
+   sc->val[2] = _mm256_set1_epi64x( 0x3C6EF3723C6EF372 );
+   sc->val[3] = _mm256_set1_epi64x( 0xA54FF53AA54FF53A );
+   sc->val[4] = _mm256_set1_epi64x( 0x510E527F510E527F );
+   sc->val[5] = _mm256_set1_epi64x( 0x9B05688C9B05688C );
+   sc->val[6] = _mm256_set1_epi64x( 0x1F83D9AB1F83D9AB );
+   sc->val[7] = _mm256_set1_epi64x( 0x5BE0CD195BE0CD19 );
 }

 // need to handle odd byte length for yespower.
@@ -1032,7 +1031,7 @@ void sha256_8way_close( sha256_8way_context *sc, void *dst )
    const int pad = buf_size - 8;

    ptr = (unsigned)sc->count_low & (buf_size - 1U);
-    sc->buf[ ptr>>2 ] = m256_const1_64( 0x0000008000000080 );
+    sc->buf[ ptr>>2 ] = _mm256_set1_epi64x( 0x0000008000000080 );
    ptr += 4;

    if ( ptr > pad )
@@ -1048,8 +1047,8 @@ void sha256_8way_close( sha256_8way_context *sc, void *dst )
    high = (sc->count_high << 3) | (low >> 29);
    low = low << 3;

-    sc->buf[   pad     >> 2 ] = m256_const1_32( bswap_32( high ) );
-    sc->buf[ ( pad+4 ) >> 2 ] = m256_const1_32( bswap_32( low ) );
+    sc->buf[   pad     >> 2 ] = _mm256_set1_epi32( bswap_32( high ) );
+    sc->buf[ ( pad+4 ) >> 2 ] = _mm256_set1_epi32( bswap_32( low ) );

    sha256_8way_transform_be( sc->val, sc->buf, sc->val );

@@ -1360,7 +1359,7 @@ int sha256_16way_transform_le_short( __m512i *state_out, const __m512i *data,
   // Value for H at round 60, before adding K, needed to produce valid final
   // hash where H == 0.
   // H_ =  -( H256[7] + K256[60] );
-   const __m512i H_ = m512_const1_32( 0x136032ED );
+   const __m512i H_ = _mm512_set1_epi32( 0x136032ED );

   A = _mm512_load_si512( state_in   );
   B = _mm512_load_si512( state_in+1 );
@@ -1453,14 +1452,14 @@ int sha256_16way_transform_le_short( __m512i *state_out, const __m512i *data,
 void sha256_16way_init( sha256_16way_context *sc )
 {
   sc->count_high = sc->count_low = 0;
-   sc->val[0] = m512_const1_64( 0x6A09E6676A09E667 );
-   sc->val[1] = m512_const1_64( 0xBB67AE85BB67AE85 );
-   sc->val[2] = m512_const1_64( 0x3C6EF3723C6EF372 );
-   sc->val[3] = m512_const1_64( 0xA54FF53AA54FF53A );
-   sc->val[4] = m512_const1_64( 0x510E527F510E527F );
-   sc->val[5] = m512_const1_64( 0x9B05688C9B05688C );
-   sc->val[6] = m512_const1_64( 0x1F83D9AB1F83D9AB );
-   sc->val[7] = m512_const1_64( 0x5BE0CD195BE0CD19 );
+   sc->val[0] = _mm512_set1_epi64( 0x6A09E6676A09E667 );
+   sc->val[1] = _mm512_set1_epi64( 0xBB67AE85BB67AE85 );
+   sc->val[2] = _mm512_set1_epi64( 0x3C6EF3723C6EF372 );
+   sc->val[3] = _mm512_set1_epi64( 0xA54FF53AA54FF53A );
+   sc->val[4] = _mm512_set1_epi64( 0x510E527F510E527F );
+   sc->val[5] = _mm512_set1_epi64( 0x9B05688C9B05688C );
+   sc->val[6] = _mm512_set1_epi64( 0x1F83D9AB1F83D9AB );
+   sc->val[7] = _mm512_set1_epi64( 0x5BE0CD195BE0CD19 );
 }

 void sha256_16way_update( sha256_16way_context *sc, const void *data,
@@ -1504,7 +1503,7 @@ void sha256_16way_close( sha256_16way_context *sc, void *dst )
    const int pad = buf_size - 8;

    ptr = (unsigned)sc->count_low & (buf_size - 1U);
-    sc->buf[ ptr>>2 ] = m512_const1_64( 0x0000008000000080 );
+    sc->buf[ ptr>>2 ] = _mm512_set1_epi64( 0x0000008000000080 );
    ptr += 4;

    if ( ptr > pad )
@@ -1520,8 +1519,8 @@ void sha256_16way_close( sha256_16way_context *sc, void *dst )
    high = (sc->count_high << 3) | (low >> 29);
    low = low << 3;

-    sc->buf[   pad     >> 2 ] = m512_const1_32( bswap_32( high ) );
-    sc->buf[ ( pad+4 ) >> 2 ] = m512_const1_32( bswap_32( low ) );
+    sc->buf[   pad     >> 2 ] = _mm512_set1_epi32( bswap_32( high ) );
+    sc->buf[ ( pad+4 ) >> 2 ] = _mm512_set1_epi32( bswap_32( low ) );

    sha256_16way_transform_be( sc->val, sc->buf, sc->val );

--- a/algo/sha/sha256d-4way.c
+++ b/algo/sha/sha256d-4way.c
@@ -28,32 +28,32 @@ int scanhash_sha256d_16way( struct work *work, const uint32_t max_nonce,
   __m512i *noncev = vdata + 19;
   const int thr_id = mythr->id;
   const bool bench = opt_benchmark;
-   const __m512i last_byte = m512_const1_32( 0x80000000 );
-   const __m512i sixteen = m512_const1_32( 16 );
+   const __m512i last_byte = _mm512_set1_epi32( 0x80000000 );
+   const __m512i sixteen = _mm512_set1_epi32( 16 );

   for ( int i = 0; i < 19; i++ )
-       vdata[i] = m512_const1_32( pdata[i] );
+       vdata[i] = _mm512_set1_epi32( pdata[i] );

   *noncev = _mm512_set_epi32( n+15, n+14, n+13, n+12, n+11, n+10, n+9, n+8,
                               n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n+1, n );

   vdata[16+4] = last_byte;
   memset_zero_512( vdata+16 + 5, 10 );
-   vdata[16+15] = m512_const1_32( 80*8 ); // bit count
+   vdata[16+15] = _mm512_set1_epi32( 80*8 ); // bit count

   block[ 8] = last_byte;
   memset_zero_512( block + 9, 6 );
-   block[15] = m512_const1_32( 32*8 ); // bit count
+   block[15] = _mm512_set1_epi32( 32*8 ); // bit count
   
   // initialize state
-   initstate[0] = m512_const1_64( 0x6A09E6676A09E667 );
-   initstate[1] = m512_const1_64( 0xBB67AE85BB67AE85 );
-   initstate[2] = m512_const1_64( 0x3C6EF3723C6EF372 );
-   initstate[3] = m512_const1_64( 0xA54FF53AA54FF53A );
-   initstate[4] = m512_const1_64( 0x510E527F510E527F );
-   initstate[5] = m512_const1_64( 0x9B05688C9B05688C );
-   initstate[6] = m512_const1_64( 0x1F83D9AB1F83D9AB );
-   initstate[7] = m512_const1_64( 0x5BE0CD195BE0CD19 );
+   initstate[0] = _mm512_set1_epi64( 0x6A09E6676A09E667 );
+   initstate[1] = _mm512_set1_epi64( 0xBB67AE85BB67AE85 );
+   initstate[2] = _mm512_set1_epi64( 0x3C6EF3723C6EF372 );
+   initstate[3] = _mm512_set1_epi64( 0xA54FF53AA54FF53A );
+   initstate[4] = _mm512_set1_epi64( 0x510E527F510E527F );
+   initstate[5] = _mm512_set1_epi64( 0x9B05688C9B05688C );
+   initstate[6] = _mm512_set1_epi64( 0x1F83D9AB1F83D9AB );
+   initstate[7] = _mm512_set1_epi64( 0x5BE0CD195BE0CD19 );

   sha256_16way_transform_le( midstate1, vdata, initstate );

@@ -116,31 +116,31 @@ int scanhash_sha256d_8way( struct work *work, const uint32_t max_nonce,
   __m256i *noncev = vdata + 19;
   const int thr_id = mythr->id;
   const bool bench = opt_benchmark;
-   const __m256i last_byte = m256_const1_32( 0x80000000 );
-   const __m256i eight = m256_const1_32( 8 );
+   const __m256i last_byte = _mm256_set1_epi32( 0x80000000 );
+   const __m256i eight = _mm256_set1_epi32( 8 );

   for ( int i = 0; i < 19; i++ )
-      vdata[i] = m256_const1_32( pdata[i] );
+      vdata[i] = _mm256_set1_epi32( pdata[i] );

   *noncev = _mm256_set_epi32( n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n+1, n );

   vdata[16+4] = last_byte;
   memset_zero_256( vdata+16 + 5, 10 );
-   vdata[16+15] = m256_const1_32( 80*8 ); // bit count
+   vdata[16+15] = _mm256_set1_epi32( 80*8 ); // bit count

   block[ 8] = last_byte;
   memset_zero_256( block + 9, 6 );
-   block[15] = m256_const1_32( 32*8 ); // bit count
+   block[15] = _mm256_set1_epi32( 32*8 ); // bit count
   
   // initialize state
-   initstate[0] = m256_const1_64( 0x6A09E6676A09E667 );
-   initstate[1] = m256_const1_64( 0xBB67AE85BB67AE85 );
-   initstate[2] = m256_const1_64( 0x3C6EF3723C6EF372 );
-   initstate[3] = m256_const1_64( 0xA54FF53AA54FF53A );
-   initstate[4] = m256_const1_64( 0x510E527F510E527F );
-   initstate[5] = m256_const1_64( 0x9B05688C9B05688C );
-   initstate[6] = m256_const1_64( 0x1F83D9AB1F83D9AB );
-   initstate[7] = m256_const1_64( 0x5BE0CD195BE0CD19 );
+   initstate[0] = _mm256_set1_epi64x( 0x6A09E6676A09E667 );
+   initstate[1] = _mm256_set1_epi64x( 0xBB67AE85BB67AE85 );
+   initstate[2] = _mm256_set1_epi64x( 0x3C6EF3723C6EF372 );
+   initstate[3] = _mm256_set1_epi64x( 0xA54FF53AA54FF53A );
+   initstate[4] = _mm256_set1_epi64x( 0x510E527F510E527F );
+   initstate[5] = _mm256_set1_epi64x( 0x9B05688C9B05688C );
+   initstate[6] = _mm256_set1_epi64x( 0x1F83D9AB1F83D9AB );
+   initstate[7] = _mm256_set1_epi64x( 0x5BE0CD195BE0CD19 );

   sha256_8way_transform_le( midstate1, vdata, initstate );
   
@@ -204,31 +204,31 @@ int scanhash_sha256d_4way( struct work *work, const uint32_t max_nonce,
   __m128i *noncev = vdata + 19;
   const int thr_id = mythr->id;
   const bool bench = opt_benchmark;
-   const __m128i last_byte = m128_const1_32( 0x80000000 );
-   const __m128i four = m128_const1_32( 4 );
+   const __m128i last_byte = _mm_set1_epi32( 0x80000000 );
+   const __m128i four = _mm_set1_epi32( 4 );

   for ( int i = 0; i < 19; i++ )
-       vdata[i] = m128_const1_32( pdata[i] );
+       vdata[i] = _mm_set1_epi32( pdata[i] );

   *noncev = _mm_set_epi32( n+ 3, n+ 2, n+1, n );

   vdata[16+4] = last_byte;
   memset_zero_128( vdata+16 + 5, 10 );
-   vdata[16+15] = m128_const1_32( 80*8 ); // bit count
+   vdata[16+15] = _mm_set1_epi32( 80*8 ); // bit count

   block[ 8] = last_byte;
   memset_zero_128( block + 9, 6 );
-   block[15] = m128_const1_32( 32*8 ); // bit count
+   block[15] = _mm_set1_epi32( 32*8 ); // bit count

   // initialize state
-   initstate[0] = m128_const1_64( 0x6A09E6676A09E667 );
-   initstate[1] = m128_const1_64( 0xBB67AE85BB67AE85 );
-   initstate[2] = m128_const1_64( 0x3C6EF3723C6EF372 );
-   initstate[3] = m128_const1_64( 0xA54FF53AA54FF53A );
-   initstate[4] = m128_const1_64( 0x510E527F510E527F );
-   initstate[5] = m128_const1_64( 0x9B05688C9B05688C );
-   initstate[6] = m128_const1_64( 0x1F83D9AB1F83D9AB );
-   initstate[7] = m128_const1_64( 0x5BE0CD195BE0CD19 );
+   initstate[0] = _mm_set1_epi64x( 0x6A09E6676A09E667 );
+   initstate[1] = _mm_set1_epi64x( 0xBB67AE85BB67AE85 );
+   initstate[2] = _mm_set1_epi64x( 0x3C6EF3723C6EF372 );
+   initstate[3] = _mm_set1_epi64x( 0xA54FF53AA54FF53A );
+   initstate[4] = _mm_set1_epi64x( 0x510E527F510E527F );
+   initstate[5] = _mm_set1_epi64x( 0x9B05688C9B05688C );
+   initstate[6] = _mm_set1_epi64x( 0x1F83D9AB1F83D9AB );
+   initstate[7] = _mm_set1_epi64x( 0x5BE0CD195BE0CD19 );

   // hash first 64 bytes of data
   sha256_4way_transform_le( midstate1, vdata, initstate );
--- a/algo/sha/sha256dt.c
+++ b/algo/sha/sha256dt.c
@@ -0,0 +1,268 @@
+#include "algo-gate-api.h"
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <stdio.h>
+#include "sha-hash-4way.h"
+
+#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
+  #define SHA256DT_16WAY 1
+#elif defined(__AVX2__)
+  #define SHA256DT_8WAY 1
+#else
+  #define SHA256DT_4WAY 1
+#endif
+
+#if defined(SHA256DT_16WAY)
+
+int scanhash_sha256dt_16way( struct work *work, const uint32_t max_nonce,
+                           uint64_t *hashes_done, struct thr_info *mythr )
+{
+   __m512i  vdata[32]    __attribute__ ((aligned (128)));
+   __m512i  block[16]    __attribute__ ((aligned (64)));
+   __m512i  hash32[8]    __attribute__ ((aligned (64)));
+   __m512i  initstate[8] __attribute__ ((aligned (64)));
+   __m512i  midstate1[8] __attribute__ ((aligned (64)));
+   __m512i  midstate2[8] __attribute__ ((aligned (64)));
+   __m512i  mexp_pre[16] __attribute__ ((aligned (64)));
+   uint32_t lane_hash[8] __attribute__ ((aligned (64)));
+   uint32_t *hash32_d7 =  (uint32_t*)&( hash32[7] );
+   uint32_t *pdata = work->data;
+   const uint32_t *ptarget = work->target;
+   const uint32_t targ32_d7 = ptarget[7];
+   const uint32_t first_nonce = pdata[19];
+   const uint32_t last_nonce = max_nonce - 16;
+   uint32_t n = first_nonce;
+   __m512i *noncev = vdata + 19; 
+   const int thr_id = mythr->id;
+   const bool bench = opt_benchmark;
+   const __m512i last_byte = _mm512_set1_epi32( 0x80000000 );
+   const __m512i sixteen = _mm512_set1_epi32( 16 );
+
+   for ( int i = 0; i < 19; i++ )
+      vdata[i] = _mm512_set1_epi32( pdata[i] );
+
+   *noncev = _mm512_set_epi32( n+15, n+14, n+13, n+12, n+11, n+10, n+9, n+8,
+                               n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n+1, n );
+
+   vdata[16+4] = last_byte;
+   memset_zero_512( vdata+16 + 5, 10 );
+   vdata[16+15] = _mm512_set1_epi32( 0x480 ); 
+   
+   block[ 8] = last_byte;
+   memset_zero_512( block + 9, 6 );
+   block[15] = _mm512_set1_epi32( 0x300 ); 
+   
+   initstate[0] = _mm512_set1_epi64( 0xdfa9bf2cdfa9bf2c );
+   initstate[1] = _mm512_set1_epi64( 0xb72074d4b72074d4 );
+   initstate[2] = _mm512_set1_epi64( 0x6bb011226bb01122 );
+   initstate[3] = _mm512_set1_epi64( 0xd338e869d338e869 );
+   initstate[4] = _mm512_set1_epi64( 0xaa3ff126aa3ff126 );
+   initstate[5] = _mm512_set1_epi64( 0x475bbf30475bbf30 );
+   initstate[6] = _mm512_set1_epi64( 0x8fd52e5b8fd52e5b );
+   initstate[7] = _mm512_set1_epi64( 0x9f75c9ad9f75c9ad );
+
+   sha256_16way_transform_le( midstate1, vdata, initstate );
+   
+   // Do 3 rounds on the first 12 bytes of the next block
+   sha256_16way_prehash_3rounds( midstate2, mexp_pre, vdata+16, midstate1 );
+
+   do
+   {
+      sha256_16way_final_rounds( block, vdata+16, midstate1, midstate2,
+                                 mexp_pre );
+      sha256_16way_transform_le( hash32, block, initstate );
+      mm512_block_bswap_32( hash32, hash32 );    
+
+      for ( int lane = 0; lane < 16; lane++ )
+      if ( hash32_d7[ lane ] <= targ32_d7 )
+      {
+         extr_lane_16x32( lane_hash, hash32, lane, 256 );
+         if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
+         {
+            pdata[19] = n + lane;
+            submit_solution( work, lane_hash, mythr );
+         }
+      }
+      *noncev = _mm512_add_epi32( *noncev, sixteen );
+      n += 16;
+   } while ( (n < last_nonce) && !work_restart[thr_id].restart );
+   pdata[19] = n;
+   *hashes_done = n - first_nonce;
+   return 0;
+}
+
+
+#endif
+
+#if defined(SHA256DT_8WAY)
+
+int scanhash_sha256dt_8way( struct work *work, const uint32_t max_nonce,
+                           uint64_t *hashes_done, struct thr_info *mythr )
+{
+   __m256i  vdata[32]    __attribute__ ((aligned (64)));
+   __m256i  block[16]    __attribute__ ((aligned (32)));
+   __m256i  hash32[8]    __attribute__ ((aligned (32)));
+   __m256i  initstate[8] __attribute__ ((aligned (32)));
+   __m256i  midstate1[8] __attribute__ ((aligned (32)));
+   __m256i  midstate2[8] __attribute__ ((aligned (32)));
+   __m256i  mexp_pre[16] __attribute__ ((aligned (32)));
+   uint32_t lane_hash[8] __attribute__ ((aligned (32)));
+   uint32_t *hash32_d7 =  (uint32_t*)&( hash32[7] );
+   uint32_t *pdata = work->data;
+   const uint32_t *ptarget = work->target;
+   const uint32_t targ32_d7 = ptarget[7];
+   const uint32_t first_nonce = pdata[19];
+   const uint32_t last_nonce = max_nonce - 8;
+   uint32_t n = first_nonce;
+   __m256i *noncev = vdata + 19;
+   const int thr_id = mythr->id;
+   const bool bench = opt_benchmark;
+   const __m256i last_byte = _mm256_set1_epi32( 0x80000000 );
+   const __m256i eight = _mm256_set1_epi32( 8 );
+
+   for ( int i = 0; i < 19; i++ )
+      vdata[i] = _mm256_set1_epi32( pdata[i] );
+
+   *noncev = _mm256_set_epi32( n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n+1, n );
+
+   vdata[16+4] = last_byte;
+   memset_zero_256( vdata+16 + 5, 10 );
+   vdata[16+15] = _mm256_set1_epi32( 0x480 );
+
+   block[ 8] = last_byte;
+   memset_zero_256( block + 9, 6 );
+   block[15] = _mm256_set1_epi32( 0x300 ); 
+   
+   // initialize state
+   initstate[0] = _mm256_set1_epi64x( 0xdfa9bf2cdfa9bf2c );
+   initstate[1] = _mm256_set1_epi64x( 0xb72074d4b72074d4 );
+   initstate[2] = _mm256_set1_epi64x( 0x6bb011226bb01122 );
+   initstate[3] = _mm256_set1_epi64x( 0xd338e869d338e869 );
+   initstate[4] = _mm256_set1_epi64x( 0xaa3ff126aa3ff126 );
+   initstate[5] = _mm256_set1_epi64x( 0x475bbf30475bbf30 );
+   initstate[6] = _mm256_set1_epi64x( 0x8fd52e5b8fd52e5b );
+   initstate[7] = _mm256_set1_epi64x( 0x9f75c9ad9f75c9ad );
+
+   sha256_8way_transform_le( midstate1, vdata, initstate );
+
+   // Do 3 rounds on the first 12 bytes of the next block
+   sha256_8way_prehash_3rounds( midstate2, mexp_pre, vdata + 16, midstate1 );
+   
+   do
+   {
+      sha256_8way_final_rounds( block, vdata+16, midstate1, midstate2,
+                                mexp_pre );
+      sha256_8way_transform_le( hash32, block, initstate );
+      mm256_block_bswap_32( hash32, hash32 );
+
+      for ( int lane = 0; lane < 8; lane++ )
+      if ( hash32_d7[ lane ] <= targ32_d7 )
+      {
+         extr_lane_8x32( lane_hash, hash32, lane, 256 );
+         if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
+         {
+            pdata[19] = n + lane;
+            submit_solution( work, lane_hash, mythr );
+         }
+      }
+      *noncev = _mm256_add_epi32( *noncev, eight );
+      n += 8;
+   } while ( (n < last_nonce) && !work_restart[thr_id].restart );
+   pdata[19] = n;
+   *hashes_done = n - first_nonce;
+   return 0;
+}
+
+#endif
+
+
+#if defined(SHA256DT_4WAY)
+
+int scanhash_sha256dt_4way( struct work *work, const uint32_t max_nonce,
+                           uint64_t *hashes_done, struct thr_info *mythr )
+{
+   __m128i  vdata[32]    __attribute__ ((aligned (64)));
+   __m128i  block[16]    __attribute__ ((aligned (32)));
+   __m128i  hash32[8]    __attribute__ ((aligned (32)));
+   __m128i  initstate[8] __attribute__ ((aligned (32)));
+   __m128i  midstate[8]  __attribute__ ((aligned (32)));
+   uint32_t lane_hash[8] __attribute__ ((aligned (32)));
+   uint32_t *hash32_d7 =  (uint32_t*)&( hash32[7] );
+   uint32_t *pdata = work->data;
+   const uint32_t *ptarget = work->target;
+   const uint32_t targ32_d7 = ptarget[7];
+   const uint32_t first_nonce = pdata[19];
+   const uint32_t last_nonce = max_nonce - 4;
+   uint32_t n = first_nonce;
+   __m128i *noncev = vdata + 19;
+   const int thr_id = mythr->id;
+   const bool bench = opt_benchmark;
+   const __m128i last_byte = _mm_set1_epi32( 0x80000000 );
+   const __m128i four = _mm_set1_epi32( 4 );
+
+   for ( int i = 0; i < 19; i++ )
+       vdata[i] = _mm_set1_epi32( pdata[i] );
+
+   *noncev = _mm_set_epi32( n+ 3, n+ 2, n+1, n );
+
+   vdata[16+4] = last_byte;
+   memset_zero_128( vdata+16 + 5, 10 );
+   vdata[16+15] = _mm_set1_epi32( 0x480 );
+
+   block[ 8] = last_byte;
+   memset_zero_128( block + 9, 6 );
+   block[15] = _mm_set1_epi32( 0x300 );
+   
+   // initialize state
+   initstate[0] = _mm_set1_epi64x( 0xdfa9bf2cdfa9bf2c );
+   initstate[1] = _mm_set1_epi64x( 0xb72074d4b72074d4 );
+   initstate[2] = _mm_set1_epi64x( 0x6bb011226bb01122 );
+   initstate[3] = _mm_set1_epi64x( 0xd338e869d338e869 );
+   initstate[4] = _mm_set1_epi64x( 0xaa3ff126aa3ff126 );
+   initstate[5] = _mm_set1_epi64x( 0x475bbf30475bbf30 );
+   initstate[6] = _mm_set1_epi64x( 0x8fd52e5b8fd52e5b );
+   initstate[7] = _mm_set1_epi64x( 0x9f75c9ad9f75c9ad );
+
+   // hash first 64 bytes of data
+   sha256_4way_transform_le( midstate, vdata, initstate );
+
+   do
+   {
+      sha256_4way_transform_le( block,  vdata+16, midstate  );
+      sha256_4way_transform_le( hash32, block,    initstate );
+      mm128_block_bswap_32( hash32, hash32 );
+
+      for ( int lane = 0; lane < 4; lane++ )
+      if ( unlikely( hash32_d7[ lane ] <= targ32_d7 ) )
+      {
+         extr_lane_4x32( lane_hash, hash32, lane, 256 );
+         if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
+         {
+            pdata[19] = n + lane;
+            submit_solution( work, lane_hash, mythr );
+         }
+       }
+       *noncev = _mm_add_epi32( *noncev, four );
+       n += 4;
+   } while ( (n < last_nonce) && !work_restart[thr_id].restart );
+   pdata[19] = n;
+   *hashes_done = n - first_nonce;
+   return 0;
+}
+
+#endif
+
+bool register_sha256dt_algo( algo_gate_t* gate )
+{
+    gate->optimizations = SSE2_OPT | AVX2_OPT | AVX512_OPT;
+#if defined(SHA256DT_16WAY)
+    gate->scanhash   = (void*)&scanhash_sha256dt_16way;
+#elif defined(SHA256DT_8WAY)
+    gate->scanhash   = (void*)&scanhash_sha256dt_8way;
+#else
+    gate->scanhash   = (void*)&scanhash_sha256dt_4way;
+#endif
+    return true;
+}
+
--- a/algo/sha/sha256q-4way.c
+++ b/algo/sha/sha256q-4way.c
@@ -68,7 +68,7 @@ int scanhash_sha256q_16way( struct work *work, const uint32_t max_nonce,
           submit_solution( work, lane_hash, mythr );
        }
      }
-      *noncev = _mm512_add_epi32( *noncev, m512_const1_32( 16 ) );
+      *noncev = _mm512_add_epi32( *noncev, _mm512_set1_epi32( 16 ) );
      n += 16;
   } while ( (n < last_nonce) && !work_restart[thr_id].restart );
   pdata[19] = n;
@@ -140,7 +140,7 @@ int scanhash_sha256q_8way( struct work *work, const uint32_t max_nonce,
           submit_solution( work, lane_hash, mythr );
        }
      }
-      *noncev = _mm256_add_epi32( *noncev, m256_const1_32( 8 ) );
+      *noncev = _mm256_add_epi32( *noncev, _mm256_set1_epi32( 8 ) );
      n += 8;
   } while ( (n < last_nonce) && !work_restart[thr_id].restart );
   pdata[19] = n;
--- a/algo/sha/sha256t-4way.c
+++ b/algo/sha/sha256t-4way.c
@@ -28,31 +28,31 @@ int scanhash_sha256t_16way( struct work *work, const uint32_t max_nonce,
   __m512i *noncev = vdata + 19; 
   const int thr_id = mythr->id;
   const bool bench = opt_benchmark;
-   const __m512i last_byte = m512_const1_32( 0x80000000 );
-   const __m512i sixteen = m512_const1_32( 16 );
+   const __m512i last_byte = _mm512_set1_epi32( 0x80000000 );
+   const __m512i sixteen = _mm512_set1_epi32( 16 );

   for ( int i = 0; i < 19; i++ )
-      vdata[i] = m512_const1_32( pdata[i] );
+      vdata[i] = _mm512_set1_epi32( pdata[i] );

   *noncev = _mm512_set_epi32( n+15, n+14, n+13, n+12, n+11, n+10, n+9, n+8,
                               n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n+1, n );

   vdata[16+4] = last_byte;
   memset_zero_512( vdata+16 + 5, 10 );
-   vdata[16+15] = m512_const1_32( 80*8 ); // bit count
+   vdata[16+15] = _mm512_set1_epi32( 80*8 ); // bit count
   
   block[ 8] = last_byte;
   memset_zero_512( block + 9, 6 );
-   block[15] = m512_const1_32( 32*8 ); // bit count
+   block[15] = _mm512_set1_epi32( 32*8 ); // bit count
   
-   initstate[0] = m512_const1_64( 0x6A09E6676A09E667 );
-   initstate[1] = m512_const1_64( 0xBB67AE85BB67AE85 );
-   initstate[2] = m512_const1_64( 0x3C6EF3723C6EF372 );
-   initstate[3] = m512_const1_64( 0xA54FF53AA54FF53A );
-   initstate[4] = m512_const1_64( 0x510E527F510E527F );
-   initstate[5] = m512_const1_64( 0x9B05688C9B05688C );
-   initstate[6] = m512_const1_64( 0x1F83D9AB1F83D9AB );
-   initstate[7] = m512_const1_64( 0x5BE0CD195BE0CD19 );
+   initstate[0] = _mm512_set1_epi64( 0x6A09E6676A09E667 );
+   initstate[1] = _mm512_set1_epi64( 0xBB67AE85BB67AE85 );
+   initstate[2] = _mm512_set1_epi64( 0x3C6EF3723C6EF372 );
+   initstate[3] = _mm512_set1_epi64( 0xA54FF53AA54FF53A );
+   initstate[4] = _mm512_set1_epi64( 0x510E527F510E527F );
+   initstate[5] = _mm512_set1_epi64( 0x9B05688C9B05688C );
+   initstate[6] = _mm512_set1_epi64( 0x1F83D9AB1F83D9AB );
+   initstate[7] = _mm512_set1_epi64( 0x5BE0CD195BE0CD19 );

   sha256_16way_transform_le( midstate1, vdata, initstate );
   
@@ -120,31 +120,31 @@ int scanhash_sha256t_8way( struct work *work, const uint32_t max_nonce,
   __m256i *noncev = vdata + 19;
   const int thr_id = mythr->id;
   const bool bench = opt_benchmark;
-   const __m256i last_byte = m256_const1_32( 0x80000000 );
-   const __m256i eight = m256_const1_32( 8 );
+   const __m256i last_byte = _mm256_set1_epi32( 0x80000000 );
+   const __m256i eight = _mm256_set1_epi32( 8 );

   for ( int i = 0; i < 19; i++ )
-      vdata[i] = m256_const1_32( pdata[i] );
+      vdata[i] = _mm256_set1_epi32( pdata[i] );

   *noncev = _mm256_set_epi32( n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n+1, n );

   vdata[16+4] = last_byte;
   memset_zero_256( vdata+16 + 5, 10 );
-   vdata[16+15] = m256_const1_32( 80*8 ); // bit count
+   vdata[16+15] = _mm256_set1_epi32( 80*8 ); // bit count

   block[ 8] = last_byte;
   memset_zero_256( block + 9, 6 );
-   block[15] = m256_const1_32( 32*8 ); // bit count
+   block[15] = _mm256_set1_epi32( 32*8 ); // bit count
   
   // initialize state
-   initstate[0] = m256_const1_64( 0x6A09E6676A09E667 );
-   initstate[1] = m256_const1_64( 0xBB67AE85BB67AE85 );
-   initstate[2] = m256_const1_64( 0x3C6EF3723C6EF372 );
-   initstate[3] = m256_const1_64( 0xA54FF53AA54FF53A );
-   initstate[4] = m256_const1_64( 0x510E527F510E527F );
-   initstate[5] = m256_const1_64( 0x9B05688C9B05688C );
-   initstate[6] = m256_const1_64( 0x1F83D9AB1F83D9AB );
-   initstate[7] = m256_const1_64( 0x5BE0CD195BE0CD19 );
+   initstate[0] = _mm256_set1_epi64x( 0x6A09E6676A09E667 );
+   initstate[1] = _mm256_set1_epi64x( 0xBB67AE85BB67AE85 );
+   initstate[2] = _mm256_set1_epi64x( 0x3C6EF3723C6EF372 );
+   initstate[3] = _mm256_set1_epi64x( 0xA54FF53AA54FF53A );
+   initstate[4] = _mm256_set1_epi64x( 0x510E527F510E527F );
+   initstate[5] = _mm256_set1_epi64x( 0x9B05688C9B05688C );
+   initstate[6] = _mm256_set1_epi64x( 0x1F83D9AB1F83D9AB );
+   initstate[7] = _mm256_set1_epi64x( 0x5BE0CD195BE0CD19 );

   sha256_8way_transform_le( midstate1, vdata, initstate );

@@ -215,31 +215,31 @@ int scanhash_sha256t_4way( struct work *work, const uint32_t max_nonce,
   __m128i *noncev = vdata + 19;
   const int thr_id = mythr->id;
   const bool bench = opt_benchmark;
-   const __m128i last_byte = m128_const1_32( 0x80000000 );
-   const __m128i four = m128_const1_32( 4 );
+   const __m128i last_byte = _mm_set1_epi32( 0x80000000 );
+   const __m128i four = _mm_set1_epi32( 4 );

   for ( int i = 0; i < 19; i++ )
-       vdata[i] = m128_const1_32( pdata[i] );
+       vdata[i] = _mm_set1_epi32( pdata[i] );

   *noncev = _mm_set_epi32( n+ 3, n+ 2, n+1, n );

   vdata[16+4] = last_byte;
   memset_zero_128( vdata+16 + 5, 10 );
-   vdata[16+15] = m128_const1_32( 80*8 ); // bit count
+   vdata[16+15] = _mm_set1_epi32( 80*8 ); // bit count

   block[ 8] = last_byte;
   memset_zero_128( block + 9, 6 );
-   block[15] = m128_const1_32( 32*8 ); // bit count
+   block[15] = _mm_set1_epi32( 32*8 ); // bit count
   
   // initialize state
-   initstate[0] = m128_const1_64( 0x6A09E6676A09E667 );
-   initstate[1] = m128_const1_64( 0xBB67AE85BB67AE85 );
-   initstate[2] = m128_const1_64( 0x3C6EF3723C6EF372 );
-   initstate[3] = m128_const1_64( 0xA54FF53AA54FF53A );
-   initstate[4] = m128_const1_64( 0x510E527F510E527F );
-   initstate[5] = m128_const1_64( 0x9B05688C9B05688C );
-   initstate[6] = m128_const1_64( 0x1F83D9AB1F83D9AB );
-   initstate[7] = m128_const1_64( 0x5BE0CD195BE0CD19 );
+   initstate[0] = _mm_set1_epi64x( 0x6A09E6676A09E667 );
+   initstate[1] = _mm_set1_epi64x( 0xBB67AE85BB67AE85 );
+   initstate[2] = _mm_set1_epi64x( 0x3C6EF3723C6EF372 );
+   initstate[3] = _mm_set1_epi64x( 0xA54FF53AA54FF53A );
+   initstate[4] = _mm_set1_epi64x( 0x510E527F510E527F );
+   initstate[5] = _mm_set1_epi64x( 0x9B05688C9B05688C );
+   initstate[6] = _mm_set1_epi64x( 0x1F83D9AB1F83D9AB );
+   initstate[7] = _mm_set1_epi64x( 0x5BE0CD195BE0CD19 );

   // hash first 64 bytes of data
   sha256_4way_transform_le( midstate1, vdata, initstate );
@@ -302,31 +302,31 @@ int scanhash_sha256t_4way( struct work *work, const uint32_t max_nonce,
   __m128i *noncev = vdata + 19;
   const int thr_id = mythr->id;
   const bool bench = opt_benchmark;
-   const __m128i last_byte = m128_const1_32( 0x80000000 );
-   const __m128i four = m128_const1_32( 4 );
+   const __m128i last_byte = _mm_set1_epi32( 0x80000000 );
+   const __m128i four = _mm_set1_epi32( 4 );

   for ( int i = 0; i < 19; i++ )
-       vdata[i] = m128_const1_32( pdata[i] );
+       vdata[i] = _mm_set1_epi32( pdata[i] );

   *noncev = _mm_set_epi32( n+ 3, n+ 2, n+1, n );

   vdata[16+4] = last_byte;
   memset_zero_128( vdata+16 + 5, 10 );
-   vdata[16+15] = m128_const1_32( 80*8 ); // bit count
+   vdata[16+15] = _mm_set1_epi32( 80*8 ); // bit count

   block[ 8] = last_byte;
   memset_zero_128( block + 9, 6 );
-   block[15] = m128_const1_32( 32*8 ); // bit count
+   block[15] = _mm_set1_epi32( 32*8 ); // bit count
   
   // initialize state
-   initstate[0] = m128_const1_64( 0x6A09E6676A09E667 );
-   initstate[1] = m128_const1_64( 0xBB67AE85BB67AE85 );
-   initstate[2] = m128_const1_64( 0x3C6EF3723C6EF372 );
-   initstate[3] = m128_const1_64( 0xA54FF53AA54FF53A );
-   initstate[4] = m128_const1_64( 0x510E527F510E527F );
-   initstate[5] = m128_const1_64( 0x9B05688C9B05688C );
-   initstate[6] = m128_const1_64( 0x1F83D9AB1F83D9AB );
-   initstate[7] = m128_const1_64( 0x5BE0CD195BE0CD19 );
+   initstate[0] = _mm_set1_epi64x( 0x6A09E6676A09E667 );
+   initstate[1] = _mm_set1_epi64x( 0xBB67AE85BB67AE85 );
+   initstate[2] = _mm_set1_epi64x( 0x3C6EF3723C6EF372 );
+   initstate[3] = _mm_set1_epi64x( 0xA54FF53AA54FF53A );
+   initstate[4] = _mm_set1_epi64x( 0x510E527F510E527F );
+   initstate[5] = _mm_set1_epi64x( 0x9B05688C9B05688C );
+   initstate[6] = _mm_set1_epi64x( 0x1F83D9AB1F83D9AB );
+   initstate[7] = _mm_set1_epi64x( 0x5BE0CD195BE0CD19 );

   // hash first 64 bytes of data
   sha256_4way_transform_le( midstate, vdata, initstate );
--- a/algo/sha/sha512-hash-4way.c
+++ b/algo/sha/sha512-hash-4way.c
@@ -155,14 +155,14 @@ sha512_8way_round( sha512_8way_context *ctx,  __m512i *in, __m512i r[8] )
   }
   else
   {
-      A = m512_const1_64( 0x6A09E667F3BCC908 );
-      B = m512_const1_64( 0xBB67AE8584CAA73B );
-      C = m512_const1_64( 0x3C6EF372FE94F82B );
-      D = m512_const1_64( 0xA54FF53A5F1D36F1 );
-      E = m512_const1_64( 0x510E527FADE682D1 );
-      F = m512_const1_64( 0x9B05688C2B3E6C1F );
-      G = m512_const1_64( 0x1F83D9ABFB41BD6B );
-      H = m512_const1_64( 0x5BE0CD19137E2179 );
+      A = _mm512_set1_epi64( 0x6A09E667F3BCC908 );
+      B = _mm512_set1_epi64( 0xBB67AE8584CAA73B );
+      C = _mm512_set1_epi64( 0x3C6EF372FE94F82B );
+      D = _mm512_set1_epi64( 0xA54FF53A5F1D36F1 );
+      E = _mm512_set1_epi64( 0x510E527FADE682D1 );
+      F = _mm512_set1_epi64( 0x9B05688C2B3E6C1F );
+      G = _mm512_set1_epi64( 0x1F83D9ABFB41BD6B );
+      H = _mm512_set1_epi64( 0x5BE0CD19137E2179 );
   }

   for ( i = 0; i < 80; i += 8 )
@@ -191,14 +191,14 @@ sha512_8way_round( sha512_8way_context *ctx,  __m512i *in, __m512i r[8] )
   else
   {
      ctx->initialized = true;
-      r[0] = _mm512_add_epi64( A, m512_const1_64( 0x6A09E667F3BCC908 ) );
-      r[1] = _mm512_add_epi64( B, m512_const1_64( 0xBB67AE8584CAA73B ) );
-      r[2] = _mm512_add_epi64( C, m512_const1_64( 0x3C6EF372FE94F82B ) );
-      r[3] = _mm512_add_epi64( D, m512_const1_64( 0xA54FF53A5F1D36F1 ) );
-      r[4] = _mm512_add_epi64( E, m512_const1_64( 0x510E527FADE682D1 ) );
-      r[5] = _mm512_add_epi64( F, m512_const1_64( 0x9B05688C2B3E6C1F ) );
-      r[6] = _mm512_add_epi64( G, m512_const1_64( 0x1F83D9ABFB41BD6B ) );
-      r[7] = _mm512_add_epi64( H, m512_const1_64( 0x5BE0CD19137E2179 ) );
+      r[0] = _mm512_add_epi64( A, _mm512_set1_epi64( 0x6A09E667F3BCC908 ) );
+      r[1] = _mm512_add_epi64( B, _mm512_set1_epi64( 0xBB67AE8584CAA73B ) );
+      r[2] = _mm512_add_epi64( C, _mm512_set1_epi64( 0x3C6EF372FE94F82B ) );
+      r[3] = _mm512_add_epi64( D, _mm512_set1_epi64( 0xA54FF53A5F1D36F1 ) );
+      r[4] = _mm512_add_epi64( E, _mm512_set1_epi64( 0x510E527FADE682D1 ) );
+      r[5] = _mm512_add_epi64( F, _mm512_set1_epi64( 0x9B05688C2B3E6C1F ) );
+      r[6] = _mm512_add_epi64( G, _mm512_set1_epi64( 0x1F83D9ABFB41BD6B ) );
+      r[7] = _mm512_add_epi64( H, _mm512_set1_epi64( 0x5BE0CD19137E2179 ) );
   }
 }

@@ -239,14 +239,11 @@ void sha512_8way_close( sha512_8way_context *sc, void *dst )
    unsigned ptr;
    const int buf_size = 128;
    const int pad = buf_size - 16;
-    const __m512i shuff_bswap64 = m512_const_64(
-                                    0x38393a3b3c3d3e3f, 0x3031323334353637,
-                                    0x28292a2b2c2d2e2f, 0x2021222324252627,
-                                    0x18191a1b1c1d1e1f, 0x1011121314151617,
-                                    0x08090a0b0c0d0e0f, 0x0001020304050607 );
+    const __m512i shuff_bswap64 = mm512_bcast_m128( _mm_set_epi64x(
+                                    0x08090a0b0c0d0e0f, 0x0001020304050607 ) );

    ptr = (unsigned)sc->count & (buf_size - 1U);
-    sc->buf[ ptr>>3 ] = m512_const1_64( 0x80 );
+    sc->buf[ ptr>>3 ] = _mm512_set1_epi64( 0x80 );
    ptr += 8;
    if ( ptr > pad )
    {
@@ -271,51 +268,56 @@ void sha512_8way_close( sha512_8way_context *sc, void *dst )

 // SHA-512 4 way 64 bit

+#define BSG5_0( x )     mm256_xor3( mm256_ror_64( x, 28 ), \
+                                    mm256_ror_64( x, 34 ), \
+                                    mm256_ror_64( x, 39 ) )
+
+#define BSG5_1( x )     mm256_xor3( mm256_ror_64( x, 14 ), \
+                                    mm256_ror_64( x, 18 ), \
+                                    mm256_ror_64( x, 41 ) )
+
+#define SSG5_0( x )     mm256_xor3( mm256_ror_64( x,  1 ), \
+                                    mm256_ror_64( x,  8 ), \
+                                    _mm256_srli_epi64( x, 7 ) ) 
+
+#define SSG5_1( x )     mm256_xor3( mm256_ror_64( x, 19 ), \
+                                    mm256_ror_64( x, 61 ), \
+                                    _mm256_srli_epi64( x, 6 ) )
+
+#if defined(__AVX512VL__)
+//TODO Enable for AVX10_256
+// 4 way is not used whith AVX512 but will be whith AVX10_256 when it
+// becomes available.
+
+#define CH( X, Y, Z )    _mm256_ternarylogic_epi64( X, Y, Z, 0xca )
+
+#define MAJ( X, Y, Z )   _mm256_ternarylogic_epi64( X, Y, Z, 0xe8 )
+   
+#define SHA3_4WAY_STEP( A, B, C, D, E, F, G, H, i ) \
+do { \
+  __m256i T0 = _mm256_add_epi64( _mm256_set1_epi64x( K512[i] ), W[i] ); \
+  __m256i T1 = BSG5_1( E ); \
+  __m256i T2 = BSG5_0( A ); \
+  T0 = _mm256_add_epi64( T0, CH( E, F, G ) ); \
+  T1 = _mm256_add_epi64( T1, H ); \
+  T2 = _mm256_add_epi64( T2, MAJ( A, B, C ) ); \
+  T1 = _mm256_add_epi64( T1, T0 ); \
+  D  = _mm256_add_epi64( D,  T1 ); \
+  H  = _mm256_add_epi64( T1, T2 ); \
+} while (0)
+
+#else   // AVX2 only
+
 #define CH(X, Y, Z) \
   _mm256_xor_si256( _mm256_and_si256( _mm256_xor_si256( Y, Z ), X ), Z ) 

 #define MAJ(X, Y, Z) \
  _mm256_xor_si256( Y, _mm256_and_si256( X_xor_Y = _mm256_xor_si256( X, Y ), \
                                         Y_xor_Z ) )
-                    
-#define BSG5_0(x) \
-  mm256_ror_64( _mm256_xor_si256( mm256_ror_64( \
-                   _mm256_xor_si256( mm256_ror_64( x,  5 ), x ), 6 ), x ), 28 )
-
-#define BSG5_1(x) \
-  mm256_ror_64( _mm256_xor_si256( mm256_ror_64( \
-                   _mm256_xor_si256( mm256_ror_64( x, 23 ), x ), 4 ), x ), 14 )
-
-/*
-#define SSG5_0(x) \
-   _mm256_xor_si256( _mm256_xor_si256( \
-        mm256_ror_64(x,  1), mm256_ror_64(x,  8) ), _mm256_srli_epi64(x, 7) ) 
-
-#define SSG5_1(x) \
-   _mm256_xor_si256( _mm256_xor_si256( \
-        mm256_ror_64(x, 19), mm256_ror_64(x, 61) ), _mm256_srli_epi64(x, 6) )
-*/
-// Interleave SSG0 & SSG1 for better throughput.
-// return ssg0(w0) + ssg1(w1)
-static inline __m256i ssg512_add( __m256i w0, __m256i w1 )
-{
-   __m256i w0a, w1a, w0b, w1b;
-   w0a = mm256_ror_64( w0, 1 );
-   w1a = mm256_ror_64( w1,19 );
-   w0b = mm256_ror_64( w0, 8 );
-   w1b = mm256_ror_64( w1,61 );
-   w0a = _mm256_xor_si256( w0a, w0b );
-   w1a = _mm256_xor_si256( w1a, w1b );
-   w0b = _mm256_srli_epi64( w0, 7 );
-   w1b = _mm256_srli_epi64( w1, 6 );
-   w0a = _mm256_xor_si256( w0a, w0b );
-   w1a = _mm256_xor_si256( w1a, w1b );
-   return _mm256_add_epi64( w0a, w1a );
-}

 #define SHA3_4WAY_STEP( A, B, C, D, E, F, G, H, i ) \
 do { \
-  __m256i T0 = _mm256_add_epi64( _mm256_set1_epi64x( K512[i] ), W[ i ] ); \
+  __m256i T0 = _mm256_add_epi64( _mm256_set1_epi64x( K512[i] ), W[i] ); \
  __m256i T1 = BSG5_1( E ); \
  __m256i T2 = BSG5_0( A ); \
  T0 = _mm256_add_epi64( T0, CH( E, F, G ) ); \
@@ -327,19 +329,27 @@ do { \
  H  = _mm256_add_epi64( T1, T2 ); \
 } while (0)

+#endif  // AVX512VL AVX10_256
+
 static void
 sha512_4way_round( sha512_4way_context *ctx,  __m256i *in, __m256i r[8] )
 {
   int i;
-   register __m256i A, B, C, D, E, F, G, H, X_xor_Y, Y_xor_Z;
+   register __m256i A, B, C, D, E, F, G, H;
+
+#if !defined(__AVX512VL__)
+// Disable for AVX10_256
+   __m256i X_xor_Y, Y_xor_Z;
+#endif
+
   __m256i W[80];

   mm256_block_bswap_64( W  , in );
   mm256_block_bswap_64( W+8, in+8 );

   for ( i = 16; i < 80; i++ )
-      W[i] = _mm256_add_epi64( ssg512_add( W[i-15], W[i-2] ),
-                               _mm256_add_epi64( W[ i- 7 ], W[ i-16 ] ) );
+       W[i] = mm256_add4_64( SSG5_0( W[i-15] ), SSG5_1( W[i-2] ),
+                             W[ i- 7 ], W[ i-16 ] );

   if ( ctx->initialized )
   {
@@ -354,17 +364,20 @@ sha512_4way_round( sha512_4way_context *ctx,  __m256i *in, __m256i r[8] )
   }
   else
   {
-      A = m256_const1_64( 0x6A09E667F3BCC908 );
-      B = m256_const1_64( 0xBB67AE8584CAA73B );
-      C = m256_const1_64( 0x3C6EF372FE94F82B );
-      D = m256_const1_64( 0xA54FF53A5F1D36F1 );
-      E = m256_const1_64( 0x510E527FADE682D1 );
-      F = m256_const1_64( 0x9B05688C2B3E6C1F );
-      G = m256_const1_64( 0x1F83D9ABFB41BD6B );
-      H = m256_const1_64( 0x5BE0CD19137E2179 );
+      A = _mm256_set1_epi64x( 0x6A09E667F3BCC908 );
+      B = _mm256_set1_epi64x( 0xBB67AE8584CAA73B );
+      C = _mm256_set1_epi64x( 0x3C6EF372FE94F82B );
+      D = _mm256_set1_epi64x( 0xA54FF53A5F1D36F1 );
+      E = _mm256_set1_epi64x( 0x510E527FADE682D1 );
+      F = _mm256_set1_epi64x( 0x9B05688C2B3E6C1F );
+      G = _mm256_set1_epi64x( 0x1F83D9ABFB41BD6B );
+      H = _mm256_set1_epi64x( 0x5BE0CD19137E2179 );
   }

+#if !defined(__AVX512VL__)
+// Disable for AVX10_256
   Y_xor_Z = _mm256_xor_si256( B, C );
+#endif

   for ( i = 0; i < 80; i += 8 )
   {
@@ -392,14 +405,14 @@ sha512_4way_round( sha512_4way_context *ctx,  __m256i *in, __m256i r[8] )
   else
   {
      ctx->initialized = true;
-      r[0] = _mm256_add_epi64( A, m256_const1_64( 0x6A09E667F3BCC908 ) );
-      r[1] = _mm256_add_epi64( B, m256_const1_64( 0xBB67AE8584CAA73B ) );
-      r[2] = _mm256_add_epi64( C, m256_const1_64( 0x3C6EF372FE94F82B ) );
-      r[3] = _mm256_add_epi64( D, m256_const1_64( 0xA54FF53A5F1D36F1 ) );
-      r[4] = _mm256_add_epi64( E, m256_const1_64( 0x510E527FADE682D1 ) );
-      r[5] = _mm256_add_epi64( F, m256_const1_64( 0x9B05688C2B3E6C1F ) );
-      r[6] = _mm256_add_epi64( G, m256_const1_64( 0x1F83D9ABFB41BD6B ) );
-      r[7] = _mm256_add_epi64( H, m256_const1_64( 0x5BE0CD19137E2179 ) );
+      r[0] = _mm256_add_epi64( A, _mm256_set1_epi64x( 0x6A09E667F3BCC908 ) );
+      r[1] = _mm256_add_epi64( B, _mm256_set1_epi64x( 0xBB67AE8584CAA73B ) );
+      r[2] = _mm256_add_epi64( C, _mm256_set1_epi64x( 0x3C6EF372FE94F82B ) );
+      r[3] = _mm256_add_epi64( D, _mm256_set1_epi64x( 0xA54FF53A5F1D36F1 ) );
+      r[4] = _mm256_add_epi64( E, _mm256_set1_epi64x( 0x510E527FADE682D1 ) );
+      r[5] = _mm256_add_epi64( F, _mm256_set1_epi64x( 0x9B05688C2B3E6C1F ) );
+      r[6] = _mm256_add_epi64( G, _mm256_set1_epi64x( 0x1F83D9ABFB41BD6B ) );
+      r[7] = _mm256_add_epi64( H, _mm256_set1_epi64x( 0x5BE0CD19137E2179 ) );
   }
 }

@@ -440,13 +453,11 @@ void sha512_4way_close( sha512_4way_context *sc, void *dst )
    unsigned ptr;
    const int buf_size = 128;
    const int pad = buf_size - 16;
-    const __m256i shuff_bswap64 = m256_const_64( 0x18191a1b1c1d1e1f,
-                                                 0x1011121314151617,
-                                                 0x08090a0b0c0d0e0f,
-                                                 0x0001020304050607 );
+    const __m256i shuff_bswap64 = mm256_bcast_m128( _mm_set_epi64x(
+                                    0x08090a0b0c0d0e0f, 0x0001020304050607 ) );

    ptr = (unsigned)sc->count & (buf_size - 1U);
-    sc->buf[ ptr>>3 ] = m256_const1_64( 0x80 );
+    sc->buf[ ptr>>3 ] = _mm256_set1_epi64x( 0x80 );
    ptr += 8;
    if ( ptr > pad )
    {
--- a/algo/sha/sha512256d-4way.c
+++ b/algo/sha/sha512256d-4way.c
@@ -0,0 +1,221 @@
+#include "algo-gate-api.h"
+#include "sha-hash-4way.h"
+#include <string.h>
+#include <stdint.h>
+
+#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
+#define SHA512256D_8WAY 1
+#elif defined(__AVX2__)
+#define SHA512256D_4WAY 1
+#endif
+
+#if defined(SHA512256D_8WAY)
+
+static void sha512256d_8way_init( sha512_8way_context *ctx )
+{
+  ctx->count = 0;
+  ctx->initialized = true;
+  ctx->val[0] = _mm512_set1_epi64( 0x22312194FC2BF72C );
+  ctx->val[1] = _mm512_set1_epi64( 0x9F555FA3C84C64C2 );
+  ctx->val[2] = _mm512_set1_epi64( 0x2393B86B6F53B151 );
+  ctx->val[3] = _mm512_set1_epi64( 0x963877195940EABD );
+  ctx->val[4] = _mm512_set1_epi64( 0x96283EE2A88EFFE3 );
+  ctx->val[5] = _mm512_set1_epi64( 0xBE5E1E2553863992 );
+  ctx->val[6] = _mm512_set1_epi64( 0x2B0199FC2C85B8AA );
+  ctx->val[7] = _mm512_set1_epi64( 0x0EB72DDC81C52CA2 );
+}
+
+int scanhash_sha512256d_8way( struct work *work, uint32_t max_nonce,
+                          uint64_t *hashes_done, struct thr_info *mythr )
+{
+    uint64_t hash[8*8] __attribute__ ((aligned (128)));
+    uint32_t vdata[20*8] __attribute__ ((aligned (64)));
+    sha512_8way_context ctx; 
+    uint32_t lane_hash[8] __attribute__ ((aligned (32)));
+    uint64_t *hash_q3 = &(hash[3*8]);
+    uint32_t *pdata = work->data;
+    uint32_t *ptarget = work->target;
+    const uint64_t targ_q3 = ((uint64_t*)ptarget)[3];
+    const uint32_t first_nonce = pdata[19];
+    const uint32_t last_nonce = max_nonce - 8;
+    uint32_t n = first_nonce;
+    __m512i  *noncev = (__m512i*)vdata + 9;
+    const int thr_id = mythr->id;
+    const bool bench = opt_benchmark;
+    const __m512i eight = _mm512_set1_epi64( 0x0000000800000000 );
+
+    mm512_bswap32_intrlv80_8x64( vdata, pdata );
+    *noncev = mm512_intrlv_blend_32(
+                _mm512_set_epi32( n+7, 0, n+6, 0, n+5, 0, n+4, 0,
+                                  n+3, 0, n+2, 0, n+1, 0, n  , 0 ), *noncev );
+    do
+    {
+       sha512256d_8way_init( &ctx );
+       sha512_8way_update( &ctx, vdata, 80 );
+       sha512_8way_close( &ctx, hash );        
+
+       sha512256d_8way_init( &ctx );
+       sha512_8way_update( &ctx, hash, 32 );
+       sha512_8way_close( &ctx, hash );
+
+       for ( int lane = 0; lane < 8; lane++ )
+       if ( unlikely( hash_q3[ lane ] <= targ_q3 && !bench ) )
+       {
+          extr_lane_8x64( lane_hash, hash, lane, 256 );
+          if ( valid_hash( lane_hash, ptarget ) && !bench )
+          {
+             pdata[19] = bswap_32( n + lane );
+             submit_solution( work, lane_hash, mythr );
+          }
+       }
+       *noncev = _mm512_add_epi32( *noncev, eight );
+       n += 8;
+    } while ( likely( (n < last_nonce) && !work_restart[thr_id].restart ) );
+
+    pdata[19] = n;
+    *hashes_done = n - first_nonce;
+    return 0;
+}
+
+#elif defined(SHA512256D_4WAY)
+
+static void sha512256d_4way_init( sha512_4way_context *ctx )
+{
+  ctx->count = 0;
+  ctx->initialized = true;
+  ctx->val[0] = _mm256_set1_epi64x( 0x22312194FC2BF72C );
+  ctx->val[1] = _mm256_set1_epi64x( 0x9F555FA3C84C64C2 );
+  ctx->val[2] = _mm256_set1_epi64x( 0x2393B86B6F53B151 );
+  ctx->val[3] = _mm256_set1_epi64x( 0x963877195940EABD );
+  ctx->val[4] = _mm256_set1_epi64x( 0x96283EE2A88EFFE3 );
+  ctx->val[5] = _mm256_set1_epi64x( 0xBE5E1E2553863992 );
+  ctx->val[6] = _mm256_set1_epi64x( 0x2B0199FC2C85B8AA );
+  ctx->val[7] = _mm256_set1_epi64x( 0x0EB72DDC81C52CA2 );
+}
+
+int scanhash_sha512256d_4way( struct work *work, uint32_t max_nonce,
+                          uint64_t *hashes_done, struct thr_info *mythr )
+{
+    uint64_t hash[8*4] __attribute__ ((aligned (64)));
+    uint32_t vdata[20*4] __attribute__ ((aligned (64)));
+    sha512_4way_context ctx;
+    uint32_t lane_hash[8] __attribute__ ((aligned (32)));
+    uint64_t *hash_q3 = &(hash[3*4]);
+    uint32_t *pdata = work->data;
+    uint32_t *ptarget = work->target;
+    const uint64_t targ_q3 = ((uint64_t*)ptarget)[3];
+    const uint32_t first_nonce = pdata[19];
+    const uint32_t last_nonce = max_nonce - 4;
+    uint32_t n = first_nonce;
+    __m256i  *noncev = (__m256i*)vdata + 9;
+    const int thr_id = mythr->id;
+    const bool bench = opt_benchmark;
+    const __m256i four = _mm256_set1_epi64x( 0x0000000400000000 );
+
+    mm256_bswap32_intrlv80_4x64( vdata, pdata );
+    *noncev = mm256_intrlv_blend_32(
+                _mm256_set_epi32( n+3, 0, n+2, 0, n+1, 0, n, 0 ), *noncev );
+    do
+    {
+       sha512256d_4way_init( &ctx );
+       sha512_4way_update( &ctx, vdata, 80 );
+       sha512_4way_close( &ctx, hash );
+
+       sha512256d_4way_init( &ctx );
+       sha512_4way_update( &ctx, hash, 32 );
+       sha512_4way_close( &ctx, hash );
+
+       for ( int lane = 0; lane < 4; lane++ )
+       if ( hash_q3[ lane ] <= targ_q3 )
+       {
+          extr_lane_4x64( lane_hash, hash, lane, 256 );
+          if ( valid_hash( lane_hash, ptarget ) && !bench )
+          {
+             pdata[19] = bswap_32( n + lane );
+             submit_solution( work, lane_hash, mythr );
+          }
+       }
+       *noncev = _mm256_add_epi32( *noncev, four );
+       n += 4;
+    } while ( (n < last_nonce) && !work_restart[thr_id].restart );
+
+    pdata[19] = n;
+    *hashes_done = n - first_nonce;
+    return 0;
+}
+
+#else
+
+#include "sph_sha2.h"
+
+static const uint64_t H512_256[8] =
+{
+   0x22312194FC2BF72C, 0x9F555FA3C84C64C2,
+   0x2393B86B6F53B151, 0x963877195940EABD,
+   0x96283EE2A88EFFE3, 0xBE5E1E2553863992,
+   0x2B0199FC2C85B8AA, 0x0EB72DDC81C52CA2,
+};
+
+static void sha512256d_init( sph_sha512_context *ctx )
+{
+   memcpy( ctx->val, H512_256, sizeof H512_256 );
+   ctx->count = 0;
+}
+
+int scanhash_sha512256d( struct work *work,   uint32_t max_nonce,
+                     uint64_t *hashes_done, struct thr_info *mythr )
+{
+   uint32_t *pdata = work->data;
+   uint32_t *ptarget = work->target;
+   uint32_t hash64[8] __attribute__ ((aligned (64)));
+   uint32_t endiandata[20] __attribute__ ((aligned (64)));
+   sph_sha512_context ctx;
+   const uint32_t Htarg = ptarget[7];
+   const uint32_t first_nonce = pdata[19];
+   uint32_t n = first_nonce;
+   int thr_id = mythr->id;
+
+   swab32_array( endiandata, pdata, 20 );
+
+   do {
+      be32enc( &endiandata[19], n );
+
+      sha512256d_init( &ctx );
+      sph_sha512( &ctx, endiandata, 80 );
+      sph_sha512_close( &ctx, hash64 );
+
+      sha512256d_init( &ctx );
+      sph_sha512( &ctx, hash64, 32 );
+      sph_sha512_close( &ctx, hash64 );
+      
+      if ( hash64[7] <= Htarg )
+      if ( fulltest( hash64, ptarget ) && !opt_benchmark )
+      {
+         pdata[19] = n;
+         submit_solution( work, hash64, mythr );
+      }
+      n++;
+
+   } while (n < max_nonce && !work_restart[thr_id].restart);
+
+   *hashes_done = n - first_nonce + 1;
+   pdata[19] = n;
+
+   return 0;
+}
+
+#endif
+
+bool register_sha512256d_algo( algo_gate_t* gate )
+{
+   gate->optimizations = AVX2_OPT | AVX512_OPT;
+#if defined(SHA512256D_8WAY)
+   gate->scanhash = (void*)&scanhash_sha512256d_8way;
+#elif defined(SHA512256D_4WAY)
+   gate->scanhash = (void*)&scanhash_sha512256d_4way;
+#else
+   gate->scanhash = (void*)&scanhash_sha512256d;
+#endif
+   return true;
+};
+
--- a/algo/shabal/shabal-hash-4way.c
+++ b/algo/shabal/shabal-hash-4way.c
@@ -112,50 +112,50 @@ extern "C"{
   else \
   { \
       (state)->state_loaded = true; \
-       A0 = m256_const1_64( 0x20728DFD20728DFD ); \
-       A1 = m256_const1_64( 0x46C0BD5346C0BD53 ); \
-       A2 = m256_const1_64( 0xE782B699E782B699 ); \
-       A3 = m256_const1_64( 0x5530463255304632 ); \
-       A4 = m256_const1_64( 0x71B4EF9071B4EF90 ); \
-       A5 = m256_const1_64( 0x0EA9E82C0EA9E82C ); \
-       A6 = m256_const1_64( 0xDBB930F1DBB930F1 ); \
-       A7 = m256_const1_64( 0xFAD06B8BFAD06B8B ); \
-       A8 = m256_const1_64( 0xBE0CAE40BE0CAE40 ); \
-       A9 = m256_const1_64( 0x8BD144108BD14410 ); \
-       AA = m256_const1_64( 0x76D2ADAC76D2ADAC ); \
-       AB = m256_const1_64( 0x28ACAB7F28ACAB7F ); \
-       B0 = m256_const1_64( 0xC1099CB7C1099CB7 ); \
-       B1 = m256_const1_64( 0x07B385F307B385F3 ); \
-       B2 = m256_const1_64( 0xE7442C26E7442C26 ); \
-       B3 = m256_const1_64( 0xCC8AD640CC8AD640 ); \
-       B4 = m256_const1_64( 0xEB6F56C7EB6F56C7 ); \
-       B5 = m256_const1_64( 0x1EA81AA91EA81AA9 ); \
-       B6 = m256_const1_64( 0x73B9D31473B9D314 ); \
-       B7 = m256_const1_64( 0x1DE85D081DE85D08 ); \
-       B8 = m256_const1_64( 0x48910A5A48910A5A ); \
-       B9 = m256_const1_64( 0x893B22DB893B22DB ); \
-       BA = m256_const1_64( 0xC5A0DF44C5A0DF44 ); \
-       BB = m256_const1_64( 0xBBC4324EBBC4324E ); \
-       BC = m256_const1_64( 0x72D2F24072D2F240 ); \
-       BD = m256_const1_64( 0x75941D9975941D99 ); \
-       BE = m256_const1_64( 0x6D8BDE826D8BDE82 ); \
-       BF = m256_const1_64( 0xA1A7502BA1A7502B ); \
-       C0 = m256_const1_64( 0xD9BF68D1D9BF68D1 ); \
-       C1 = m256_const1_64( 0x58BAD75058BAD750 ); \
-       C2 = m256_const1_64( 0x56028CB256028CB2 ); \
-       C3 = m256_const1_64( 0x8134F3598134F359 ); \
-       C4 = m256_const1_64( 0xB5D469D8B5D469D8 ); \
-       C5 = m256_const1_64( 0x941A8CC2941A8CC2 ); \
-       C6 = m256_const1_64( 0x418B2A6E418B2A6E ); \
-       C7 = m256_const1_64( 0x0405278004052780 ); \
-       C8 = m256_const1_64( 0x7F07D7877F07D787 ); \
-       C9 = m256_const1_64( 0x5194358F5194358F ); \
-       CA = m256_const1_64( 0x3C60D6653C60D665 ); \
-       CB = m256_const1_64( 0xBE97D79ABE97D79A ); \
-       CC = m256_const1_64( 0x950C3434950C3434 ); \
-       CD = m256_const1_64( 0xAED9A06DAED9A06D ); \
-       CE = m256_const1_64( 0x2537DC8D2537DC8D ); \
-       CF = m256_const1_64( 0x7CDB59697CDB5969 ); \
+       A0 = _mm256_set1_epi64x( 0x20728DFD20728DFD ); \
+       A1 = _mm256_set1_epi64x( 0x46C0BD5346C0BD53 ); \
+       A2 = _mm256_set1_epi64x( 0xE782B699E782B699 ); \
+       A3 = _mm256_set1_epi64x( 0x5530463255304632 ); \
+       A4 = _mm256_set1_epi64x( 0x71B4EF9071B4EF90 ); \
+       A5 = _mm256_set1_epi64x( 0x0EA9E82C0EA9E82C ); \
+       A6 = _mm256_set1_epi64x( 0xDBB930F1DBB930F1 ); \
+       A7 = _mm256_set1_epi64x( 0xFAD06B8BFAD06B8B ); \
+       A8 = _mm256_set1_epi64x( 0xBE0CAE40BE0CAE40 ); \
+       A9 = _mm256_set1_epi64x( 0x8BD144108BD14410 ); \
+       AA = _mm256_set1_epi64x( 0x76D2ADAC76D2ADAC ); \
+       AB = _mm256_set1_epi64x( 0x28ACAB7F28ACAB7F ); \
+       B0 = _mm256_set1_epi64x( 0xC1099CB7C1099CB7 ); \
+       B1 = _mm256_set1_epi64x( 0x07B385F307B385F3 ); \
+       B2 = _mm256_set1_epi64x( 0xE7442C26E7442C26 ); \
+       B3 = _mm256_set1_epi64x( 0xCC8AD640CC8AD640 ); \
+       B4 = _mm256_set1_epi64x( 0xEB6F56C7EB6F56C7 ); \
+       B5 = _mm256_set1_epi64x( 0x1EA81AA91EA81AA9 ); \
+       B6 = _mm256_set1_epi64x( 0x73B9D31473B9D314 ); \
+       B7 = _mm256_set1_epi64x( 0x1DE85D081DE85D08 ); \
+       B8 = _mm256_set1_epi64x( 0x48910A5A48910A5A ); \
+       B9 = _mm256_set1_epi64x( 0x893B22DB893B22DB ); \
+       BA = _mm256_set1_epi64x( 0xC5A0DF44C5A0DF44 ); \
+       BB = _mm256_set1_epi64x( 0xBBC4324EBBC4324E ); \
+       BC = _mm256_set1_epi64x( 0x72D2F24072D2F240 ); \
+       BD = _mm256_set1_epi64x( 0x75941D9975941D99 ); \
+       BE = _mm256_set1_epi64x( 0x6D8BDE826D8BDE82 ); \
+       BF = _mm256_set1_epi64x( 0xA1A7502BA1A7502B ); \
+       C0 = _mm256_set1_epi64x( 0xD9BF68D1D9BF68D1 ); \
+       C1 = _mm256_set1_epi64x( 0x58BAD75058BAD750 ); \
+       C2 = _mm256_set1_epi64x( 0x56028CB256028CB2 ); \
+       C3 = _mm256_set1_epi64x( 0x8134F3598134F359 ); \
+       C4 = _mm256_set1_epi64x( 0xB5D469D8B5D469D8 ); \
+       C5 = _mm256_set1_epi64x( 0x941A8CC2941A8CC2 ); \
+       C6 = _mm256_set1_epi64x( 0x418B2A6E418B2A6E ); \
+       C7 = _mm256_set1_epi64x( 0x0405278004052780 ); \
+       C8 = _mm256_set1_epi64x( 0x7F07D7877F07D787 ); \
+       C9 = _mm256_set1_epi64x( 0x5194358F5194358F ); \
+       CA = _mm256_set1_epi64x( 0x3C60D6653C60D665 ); \
+       CB = _mm256_set1_epi64x( 0xBE97D79ABE97D79A ); \
+       CC = _mm256_set1_epi64x( 0x950C3434950C3434 ); \
+       CD = _mm256_set1_epi64x( 0xAED9A06DAED9A06D ); \
+       CE = _mm256_set1_epi64x( 0x2537DC8D2537DC8D ); \
+       CF = _mm256_set1_epi64x( 0x7CDB59697CDB5969 ); \
   } \
   Wlow = (state)->Wlow; \
   Whigh = (state)->Whigh; \
@@ -276,6 +276,11 @@ do { \
   A1 = _mm256_xor_si256( A1, _mm256_set1_epi32( Whigh ) ); \
 } while (0)

+#define mm256_swap512_256( v1, v2 ) \
+   v1 = _mm256_xor_si256( v1, v2 ); \
+   v2 = _mm256_xor_si256( v1, v2 ); \
+   v1 = _mm256_xor_si256( v1, v2 );
+
 #define SWAP_BC8 \
 do { \
    mm256_swap512_256( B0, C0 ); \
@@ -298,7 +303,7 @@ do { \

 #define PERM_ELT8( xa0, xa1, xb0, xb1, xb2, xb3, xc, xm ) \
 do { \
-   xa0 = mm256_xor3( xm, xb1, mm256_xorandnot(  \
+   xa0 = mm256_xor3( xm, xb1, mm256_xorandnot( \
           _mm256_mullo_epi32( mm256_xor3( xa0, xc, \
              _mm256_mullo_epi32( mm256_rol_32( xa1, 15 ), FIVE ) ), THREE ), \
           xb3, xb2 ) ); \
@@ -438,52 +443,52 @@ shabal_8way_init( void *cc, unsigned size )
   else
   {  // No users
       sc->state_loaded = true;
-       sc->A[ 0] = m256_const1_64( 0x52F8455252F84552 );
-       sc->A[ 1] = m256_const1_64( 0xE54B7999E54B7999 );
-       sc->A[ 2] = m256_const1_64( 0x2D8EE3EC2D8EE3EC );
-       sc->A[ 3] = m256_const1_64( 0xB9645191B9645191 );
-       sc->A[ 4] = m256_const1_64( 0xE0078B86E0078B86 );
-       sc->A[ 5] = m256_const1_64( 0xBB7C44C9BB7C44C9 );
-       sc->A[ 6] = m256_const1_64( 0xD2B5C1CAD2B5C1CA );
-       sc->A[ 7] = m256_const1_64( 0xB0D2EB8CB0D2EB8C );
-       sc->A[ 8] = m256_const1_64( 0x14CE5A4514CE5A45 );
-       sc->A[ 9] = m256_const1_64( 0x22AF50DC22AF50DC );
-       sc->A[10] = m256_const1_64( 0xEFFDBC6BEFFDBC6B );
-       sc->A[11] = m256_const1_64( 0xEB21B74AEB21B74A );
+       sc->A[ 0] = _mm256_set1_epi64x( 0x52F8455252F84552 );
+       sc->A[ 1] = _mm256_set1_epi64x( 0xE54B7999E54B7999 );
+       sc->A[ 2] = _mm256_set1_epi64x( 0x2D8EE3EC2D8EE3EC );
+       sc->A[ 3] = _mm256_set1_epi64x( 0xB9645191B9645191 );
+       sc->A[ 4] = _mm256_set1_epi64x( 0xE0078B86E0078B86 );
+       sc->A[ 5] = _mm256_set1_epi64x( 0xBB7C44C9BB7C44C9 );
+       sc->A[ 6] = _mm256_set1_epi64x( 0xD2B5C1CAD2B5C1CA );
+       sc->A[ 7] = _mm256_set1_epi64x( 0xB0D2EB8CB0D2EB8C );
+       sc->A[ 8] = _mm256_set1_epi64x( 0x14CE5A4514CE5A45 );
+       sc->A[ 9] = _mm256_set1_epi64x( 0x22AF50DC22AF50DC );
+       sc->A[10] = _mm256_set1_epi64x( 0xEFFDBC6BEFFDBC6B );
+       sc->A[11] = _mm256_set1_epi64x( 0xEB21B74AEB21B74A );

-       sc->B[ 0] = m256_const1_64( 0xB555C6EEB555C6EE );
-       sc->B[ 1] = m256_const1_64( 0x3E7105963E710596 );
-       sc->B[ 2] = m256_const1_64( 0xA72A652FA72A652F );
-       sc->B[ 3] = m256_const1_64( 0x9301515F9301515F );
-       sc->B[ 4] = m256_const1_64( 0xDA28C1FADA28C1FA );
-       sc->B[ 5] = m256_const1_64( 0x696FD868696FD868 );
-       sc->B[ 6] = m256_const1_64( 0x9CB6BF729CB6BF72 );
-       sc->B[ 7] = m256_const1_64( 0x0AFE40020AFE4002 );
-       sc->B[ 8] = m256_const1_64( 0xA6E03615A6E03615 );
-       sc->B[ 9] = m256_const1_64( 0x5138C1D45138C1D4 );
-       sc->B[10] = m256_const1_64( 0xBE216306BE216306 );
-       sc->B[11] = m256_const1_64( 0xB38B8890B38B8890 );
-       sc->B[12] = m256_const1_64( 0x3EA8B96B3EA8B96B );
-       sc->B[13] = m256_const1_64( 0x3299ACE43299ACE4 );
-       sc->B[14] = m256_const1_64( 0x30924DD430924DD4 );
-       sc->B[15] = m256_const1_64( 0x55CB34A555CB34A5 );
+       sc->B[ 0] = _mm256_set1_epi64x( 0xB555C6EEB555C6EE );
+       sc->B[ 1] = _mm256_set1_epi64x( 0x3E7105963E710596 );
+       sc->B[ 2] = _mm256_set1_epi64x( 0xA72A652FA72A652F );
+       sc->B[ 3] = _mm256_set1_epi64x( 0x9301515F9301515F );
+       sc->B[ 4] = _mm256_set1_epi64x( 0xDA28C1FADA28C1FA );
+       sc->B[ 5] = _mm256_set1_epi64x( 0x696FD868696FD868 );
+       sc->B[ 6] = _mm256_set1_epi64x( 0x9CB6BF729CB6BF72 );
+       sc->B[ 7] = _mm256_set1_epi64x( 0x0AFE40020AFE4002 );
+       sc->B[ 8] = _mm256_set1_epi64x( 0xA6E03615A6E03615 );
+       sc->B[ 9] = _mm256_set1_epi64x( 0x5138C1D45138C1D4 );
+       sc->B[10] = _mm256_set1_epi64x( 0xBE216306BE216306 );
+       sc->B[11] = _mm256_set1_epi64x( 0xB38B8890B38B8890 );
+       sc->B[12] = _mm256_set1_epi64x( 0x3EA8B96B3EA8B96B );
+       sc->B[13] = _mm256_set1_epi64x( 0x3299ACE43299ACE4 );
+       sc->B[14] = _mm256_set1_epi64x( 0x30924DD430924DD4 );
+       sc->B[15] = _mm256_set1_epi64x( 0x55CB34A555CB34A5 );

-       sc->C[ 0] = m256_const1_64( 0xB405F031B405F031 );
-       sc->C[ 1] = m256_const1_64( 0xC4233EBAC4233EBA );
-       sc->C[ 2] = m256_const1_64( 0xB3733979B3733979 );
-       sc->C[ 3] = m256_const1_64( 0xC0DD9D55C0DD9D55 );
-       sc->C[ 4] = m256_const1_64( 0xC51C28AEC51C28AE );
-       sc->C[ 5] = m256_const1_64( 0xA327B8E1A327B8E1 );
-       sc->C[ 6] = m256_const1_64( 0x56C5616756C56167 );
-       sc->C[ 7] = m256_const1_64( 0xED614433ED614433 );
-       sc->C[ 8] = m256_const1_64( 0x88B59D6088B59D60 );
-       sc->C[ 9] = m256_const1_64( 0x60E2CEBA60E2CEBA );
-       sc->C[10] = m256_const1_64( 0x758B4B8B758B4B8B );
-       sc->C[11] = m256_const1_64( 0x83E82A7F83E82A7F );
-       sc->C[12] = m256_const1_64( 0xBC968828BC968828 );
-       sc->C[13] = m256_const1_64( 0xE6E00BF7E6E00BF7 );
-       sc->C[14] = m256_const1_64( 0xBA839E55BA839E55 );
-       sc->C[15] = m256_const1_64( 0x9B491C609B491C60 );
+       sc->C[ 0] = _mm256_set1_epi64x( 0xB405F031B405F031 );
+       sc->C[ 1] = _mm256_set1_epi64x( 0xC4233EBAC4233EBA );
+       sc->C[ 2] = _mm256_set1_epi64x( 0xB3733979B3733979 );
+       sc->C[ 3] = _mm256_set1_epi64x( 0xC0DD9D55C0DD9D55 );
+       sc->C[ 4] = _mm256_set1_epi64x( 0xC51C28AEC51C28AE );
+       sc->C[ 5] = _mm256_set1_epi64x( 0xA327B8E1A327B8E1 );
+       sc->C[ 6] = _mm256_set1_epi64x( 0x56C5616756C56167 );
+       sc->C[ 7] = _mm256_set1_epi64x( 0xED614433ED614433 );
+       sc->C[ 8] = _mm256_set1_epi64x( 0x88B59D6088B59D60 );
+       sc->C[ 9] = _mm256_set1_epi64x( 0x60E2CEBA60E2CEBA );
+       sc->C[10] = _mm256_set1_epi64x( 0x758B4B8B758B4B8B );
+       sc->C[11] = _mm256_set1_epi64x( 0x83E82A7F83E82A7F );
+       sc->C[12] = _mm256_set1_epi64x( 0xBC968828BC968828 );
+       sc->C[13] = _mm256_set1_epi64x( 0xE6E00BF7E6E00BF7 );
+       sc->C[14] = _mm256_set1_epi64x( 0xBA839E55BA839E55 );
+       sc->C[15] = _mm256_set1_epi64x( 0x9B491C609B491C60 );
   }
    sc->Wlow = 1;
    sc->Whigh = 0;
@@ -702,50 +707,50 @@ shabal512_8way_addbits_and_close(void *cc, unsigned ub, unsigned n, void *dst)
   else \
   { \
       (state)->state_loaded = true; \
-       A0 = m128_const1_64( 0x20728DFD20728DFD ); \
-       A1 = m128_const1_64( 0x46C0BD5346C0BD53 ); \
-       A2 = m128_const1_64( 0xE782B699E782B699 ); \
-       A3 = m128_const1_64( 0x5530463255304632 ); \
-       A4 = m128_const1_64( 0x71B4EF9071B4EF90 ); \
-       A5 = m128_const1_64( 0x0EA9E82C0EA9E82C ); \
-       A6 = m128_const1_64( 0xDBB930F1DBB930F1 ); \
-       A7 = m128_const1_64( 0xFAD06B8BFAD06B8B ); \
-       A8 = m128_const1_64( 0xBE0CAE40BE0CAE40 ); \
-       A9 = m128_const1_64( 0x8BD144108BD14410 ); \
-       AA = m128_const1_64( 0x76D2ADAC76D2ADAC ); \
-       AB = m128_const1_64( 0x28ACAB7F28ACAB7F ); \
-       B0 = m128_const1_64( 0xC1099CB7C1099CB7 ); \
-       B1 = m128_const1_64( 0x07B385F307B385F3 ); \
-       B2 = m128_const1_64( 0xE7442C26E7442C26 ); \
-       B3 = m128_const1_64( 0xCC8AD640CC8AD640 ); \
-       B4 = m128_const1_64( 0xEB6F56C7EB6F56C7 ); \
-       B5 = m128_const1_64( 0x1EA81AA91EA81AA9 ); \
-       B6 = m128_const1_64( 0x73B9D31473B9D314 ); \
-       B7 = m128_const1_64( 0x1DE85D081DE85D08 ); \
-       B8 = m128_const1_64( 0x48910A5A48910A5A ); \
-       B9 = m128_const1_64( 0x893B22DB893B22DB ); \
-       BA = m128_const1_64( 0xC5A0DF44C5A0DF44 ); \
-       BB = m128_const1_64( 0xBBC4324EBBC4324E ); \
-       BC = m128_const1_64( 0x72D2F24072D2F240 ); \
-       BD = m128_const1_64( 0x75941D9975941D99 ); \
-       BE = m128_const1_64( 0x6D8BDE826D8BDE82 ); \
-       BF = m128_const1_64( 0xA1A7502BA1A7502B ); \
-       C0 = m128_const1_64( 0xD9BF68D1D9BF68D1 ); \
-       C1 = m128_const1_64( 0x58BAD75058BAD750 ); \
-       C2 = m128_const1_64( 0x56028CB256028CB2 ); \
-       C3 = m128_const1_64( 0x8134F3598134F359 ); \
-       C4 = m128_const1_64( 0xB5D469D8B5D469D8 ); \
-       C5 = m128_const1_64( 0x941A8CC2941A8CC2 ); \
-       C6 = m128_const1_64( 0x418B2A6E418B2A6E ); \
-       C7 = m128_const1_64( 0x0405278004052780 ); \
-       C8 = m128_const1_64( 0x7F07D7877F07D787 ); \
-       C9 = m128_const1_64( 0x5194358F5194358F ); \
-       CA = m128_const1_64( 0x3C60D6653C60D665 ); \
-       CB = m128_const1_64( 0xBE97D79ABE97D79A ); \
-       CC = m128_const1_64( 0x950C3434950C3434 ); \
-       CD = m128_const1_64( 0xAED9A06DAED9A06D ); \
-       CE = m128_const1_64( 0x2537DC8D2537DC8D ); \
-       CF = m128_const1_64( 0x7CDB59697CDB5969 ); \
+       A0 = _mm_set1_epi64x( 0x20728DFD20728DFD ); \
+       A1 = _mm_set1_epi64x( 0x46C0BD5346C0BD53 ); \
+       A2 = _mm_set1_epi64x( 0xE782B699E782B699 ); \
+       A3 = _mm_set1_epi64x( 0x5530463255304632 ); \
+       A4 = _mm_set1_epi64x( 0x71B4EF9071B4EF90 ); \
+       A5 = _mm_set1_epi64x( 0x0EA9E82C0EA9E82C ); \
+       A6 = _mm_set1_epi64x( 0xDBB930F1DBB930F1 ); \
+       A7 = _mm_set1_epi64x( 0xFAD06B8BFAD06B8B ); \
+       A8 = _mm_set1_epi64x( 0xBE0CAE40BE0CAE40 ); \
+       A9 = _mm_set1_epi64x( 0x8BD144108BD14410 ); \
+       AA = _mm_set1_epi64x( 0x76D2ADAC76D2ADAC ); \
+       AB = _mm_set1_epi64x( 0x28ACAB7F28ACAB7F ); \
+       B0 = _mm_set1_epi64x( 0xC1099CB7C1099CB7 ); \
+       B1 = _mm_set1_epi64x( 0x07B385F307B385F3 ); \
+       B2 = _mm_set1_epi64x( 0xE7442C26E7442C26 ); \
+       B3 = _mm_set1_epi64x( 0xCC8AD640CC8AD640 ); \
+       B4 = _mm_set1_epi64x( 0xEB6F56C7EB6F56C7 ); \
+       B5 = _mm_set1_epi64x( 0x1EA81AA91EA81AA9 ); \
+       B6 = _mm_set1_epi64x( 0x73B9D31473B9D314 ); \
+       B7 = _mm_set1_epi64x( 0x1DE85D081DE85D08 ); \
+       B8 = _mm_set1_epi64x( 0x48910A5A48910A5A ); \
+       B9 = _mm_set1_epi64x( 0x893B22DB893B22DB ); \
+       BA = _mm_set1_epi64x( 0xC5A0DF44C5A0DF44 ); \
+       BB = _mm_set1_epi64x( 0xBBC4324EBBC4324E ); \
+       BC = _mm_set1_epi64x( 0x72D2F24072D2F240 ); \
+       BD = _mm_set1_epi64x( 0x75941D9975941D99 ); \
+       BE = _mm_set1_epi64x( 0x6D8BDE826D8BDE82 ); \
+       BF = _mm_set1_epi64x( 0xA1A7502BA1A7502B ); \
+       C0 = _mm_set1_epi64x( 0xD9BF68D1D9BF68D1 ); \
+       C1 = _mm_set1_epi64x( 0x58BAD75058BAD750 ); \
+       C2 = _mm_set1_epi64x( 0x56028CB256028CB2 ); \
+       C3 = _mm_set1_epi64x( 0x8134F3598134F359 ); \
+       C4 = _mm_set1_epi64x( 0xB5D469D8B5D469D8 ); \
+       C5 = _mm_set1_epi64x( 0x941A8CC2941A8CC2 ); \
+       C6 = _mm_set1_epi64x( 0x418B2A6E418B2A6E ); \
+       C7 = _mm_set1_epi64x( 0x0405278004052780 ); \
+       C8 = _mm_set1_epi64x( 0x7F07D7877F07D787 ); \
+       C9 = _mm_set1_epi64x( 0x5194358F5194358F ); \
+       CA = _mm_set1_epi64x( 0x3C60D6653C60D665 ); \
+       CB = _mm_set1_epi64x( 0xBE97D79ABE97D79A ); \
+       CC = _mm_set1_epi64x( 0x950C3434950C3434 ); \
+       CD = _mm_set1_epi64x( 0xAED9A06DAED9A06D ); \
+       CE = _mm_set1_epi64x( 0x2537DC8D2537DC8D ); \
+       CF = _mm_set1_epi64x( 0x7CDB59697CDB5969 ); \
   } \
   Wlow = (state)->Wlow; \
   Whigh = (state)->Whigh; \
@@ -866,6 +871,11 @@ do { \
   A1 = _mm_xor_si128( A1, _mm_set1_epi32( Whigh ) ); \
 } while (0)

+#define mm128_swap256_128( v1, v2 ) \
+   v1 = _mm_xor_si128( v1, v2 ); \
+   v2 = _mm_xor_si128( v1, v2 ); \
+   v1 = _mm_xor_si128( v1, v2 );
+
 #define SWAP_BC \
 do { \
    mm128_swap256_128( B0, C0 ); \
@@ -886,6 +896,16 @@ do { \
    mm128_swap256_128( BF, CF ); \
 } while (0)

+#define PERM_ELT( xa0, xa1, xb0, xb1, xb2, xb3, xc, xm ) \
+do { \
+   xa0 = mm128_xor3( xm, xb1, mm128_xorandnot( \
+           _mm_mullo_epi32( mm128_xor3( xa0, xc, \
+              _mm_mullo_epi32( mm128_rol_32( xa1, 15 ), FIVE ) ), THREE ), \
+           xb3, xb2 ) ); \
+   xb0 = mm128_xnor( xa0, mm128_rol_32( xb0, 1 ) ); \
+} while (0)
+
+/*
 #define PERM_ELT(xa0, xa1, xb0, xb1, xb2, xb3, xc, xm) \
 do { \
   xa0 = _mm_xor_si128( xm, _mm_xor_si128( xb1, _mm_xor_si128(  \
@@ -895,6 +915,7 @@ do { \
                   ) ), THREE ) ) ) ); \
   xb0 = mm128_not( _mm_xor_si128( xa0, mm128_rol_32( xb0, 1 ) ) ); \
 } while (0)
+*/

 #define PERM_STEP_0   do { \
 		PERM_ELT(A0, AB, B0, BD, B9, B6, C8, M0); \
@@ -1068,103 +1089,103 @@ shabal_4way_init( void *cc, unsigned size )
   { // copy immediate constants directly to working registers later.
       sc->state_loaded = false;
 /*
-       sc->A[ 0] = m128_const1_64( 0x20728DFD20728DFD );
-       sc->A[ 1] = m128_const1_64( 0x46C0BD5346C0BD53 );
-       sc->A[ 2] = m128_const1_64( 0xE782B699E782B699 );
-       sc->A[ 3] = m128_const1_64( 0x5530463255304632 );
-       sc->A[ 4] = m128_const1_64( 0x71B4EF9071B4EF90 );
-       sc->A[ 5] = m128_const1_64( 0x0EA9E82C0EA9E82C );
-       sc->A[ 6] = m128_const1_64( 0xDBB930F1DBB930F1 );
-       sc->A[ 7] = m128_const1_64( 0xFAD06B8BFAD06B8B );
-       sc->A[ 8] = m128_const1_64( 0xBE0CAE40BE0CAE40 );
-       sc->A[ 9] = m128_const1_64( 0x8BD144108BD14410 );
-       sc->A[10] = m128_const1_64( 0x76D2ADAC76D2ADAC );
-       sc->A[11] = m128_const1_64( 0x28ACAB7F28ACAB7F );
+       sc->A[ 0] = _mm_set1_epi64x( 0x20728DFD20728DFD );
+       sc->A[ 1] = _mm_set1_epi64x( 0x46C0BD5346C0BD53 );
+       sc->A[ 2] = _mm_set1_epi64x( 0xE782B699E782B699 );
+       sc->A[ 3] = _mm_set1_epi64x( 0x5530463255304632 );
+       sc->A[ 4] = _mm_set1_epi64x( 0x71B4EF9071B4EF90 );
+       sc->A[ 5] = _mm_set1_epi64x( 0x0EA9E82C0EA9E82C );
+       sc->A[ 6] = _mm_set1_epi64x( 0xDBB930F1DBB930F1 );
+       sc->A[ 7] = _mm_set1_epi64x( 0xFAD06B8BFAD06B8B );
+       sc->A[ 8] = _mm_set1_epi64x( 0xBE0CAE40BE0CAE40 );
+       sc->A[ 9] = _mm_set1_epi64x( 0x8BD144108BD14410 );
+       sc->A[10] = _mm_set1_epi64x( 0x76D2ADAC76D2ADAC );
+       sc->A[11] = _mm_set1_epi64x( 0x28ACAB7F28ACAB7F );

-       sc->B[ 0] = m128_const1_64( 0xC1099CB7C1099CB7 );
-       sc->B[ 1] = m128_const1_64( 0x07B385F307B385F3 );
-       sc->B[ 2] = m128_const1_64( 0xE7442C26E7442C26 );
-       sc->B[ 3] = m128_const1_64( 0xCC8AD640CC8AD640 );
-       sc->B[ 4] = m128_const1_64( 0xEB6F56C7EB6F56C7 );
-       sc->B[ 5] = m128_const1_64( 0x1EA81AA91EA81AA9 );
-       sc->B[ 6] = m128_const1_64( 0x73B9D31473B9D314 );
-       sc->B[ 7] = m128_const1_64( 0x1DE85D081DE85D08 );
-       sc->B[ 8] = m128_const1_64( 0x48910A5A48910A5A );
-       sc->B[ 9] = m128_const1_64( 0x893B22DB893B22DB );
-       sc->B[10] = m128_const1_64( 0xC5A0DF44C5A0DF44 );
-       sc->B[11] = m128_const1_64( 0xBBC4324EBBC4324E );
-       sc->B[12] = m128_const1_64( 0x72D2F24072D2F240 );
-       sc->B[13] = m128_const1_64( 0x75941D9975941D99 );
-       sc->B[14] = m128_const1_64( 0x6D8BDE826D8BDE82 );
-       sc->B[15] = m128_const1_64( 0xA1A7502BA1A7502B );
+       sc->B[ 0] = _mm_set1_epi64x( 0xC1099CB7C1099CB7 );
+       sc->B[ 1] = _mm_set1_epi64x( 0x07B385F307B385F3 );
+       sc->B[ 2] = _mm_set1_epi64x( 0xE7442C26E7442C26 );
+       sc->B[ 3] = _mm_set1_epi64x( 0xCC8AD640CC8AD640 );
+       sc->B[ 4] = _mm_set1_epi64x( 0xEB6F56C7EB6F56C7 );
+       sc->B[ 5] = _mm_set1_epi64x( 0x1EA81AA91EA81AA9 );
+       sc->B[ 6] = _mm_set1_epi64x( 0x73B9D31473B9D314 );
+       sc->B[ 7] = _mm_set1_epi64x( 0x1DE85D081DE85D08 );
+       sc->B[ 8] = _mm_set1_epi64x( 0x48910A5A48910A5A );
+       sc->B[ 9] = _mm_set1_epi64x( 0x893B22DB893B22DB );
+       sc->B[10] = _mm_set1_epi64x( 0xC5A0DF44C5A0DF44 );
+       sc->B[11] = _mm_set1_epi64x( 0xBBC4324EBBC4324E );
+       sc->B[12] = _mm_set1_epi64x( 0x72D2F24072D2F240 );
+       sc->B[13] = _mm_set1_epi64x( 0x75941D9975941D99 );
+       sc->B[14] = _mm_set1_epi64x( 0x6D8BDE826D8BDE82 );
+       sc->B[15] = _mm_set1_epi64x( 0xA1A7502BA1A7502B );

-       sc->C[ 0] = m128_const1_64( 0xD9BF68D1D9BF68D1 );
-       sc->C[ 1] = m128_const1_64( 0x58BAD75058BAD750 );
-       sc->C[ 2] = m128_const1_64( 0x56028CB256028CB2 );
-       sc->C[ 3] = m128_const1_64( 0x8134F3598134F359 );
-       sc->C[ 4] = m128_const1_64( 0xB5D469D8B5D469D8 );
-       sc->C[ 5] = m128_const1_64( 0x941A8CC2941A8CC2 );
-       sc->C[ 6] = m128_const1_64( 0x418B2A6E418B2A6E );
-       sc->C[ 7] = m128_const1_64( 0x0405278004052780 );
-       sc->C[ 8] = m128_const1_64( 0x7F07D7877F07D787 );
-       sc->C[ 9] = m128_const1_64( 0x5194358F5194358F );
-       sc->C[10] = m128_const1_64( 0x3C60D6653C60D665 );
-       sc->C[11] = m128_const1_64( 0xBE97D79ABE97D79A );
-       sc->C[12] = m128_const1_64( 0x950C3434950C3434 );
-       sc->C[13] = m128_const1_64( 0xAED9A06DAED9A06D );
-       sc->C[14] = m128_const1_64( 0x2537DC8D2537DC8D );
-       sc->C[15] = m128_const1_64( 0x7CDB59697CDB5969 );
+       sc->C[ 0] = _mm_set1_epi64x( 0xD9BF68D1D9BF68D1 );
+       sc->C[ 1] = _mm_set1_epi64x( 0x58BAD75058BAD750 );
+       sc->C[ 2] = _mm_set1_epi64x( 0x56028CB256028CB2 );
+       sc->C[ 3] = _mm_set1_epi64x( 0x8134F3598134F359 );
+       sc->C[ 4] = _mm_set1_epi64x( 0xB5D469D8B5D469D8 );
+       sc->C[ 5] = _mm_set1_epi64x( 0x941A8CC2941A8CC2 );
+       sc->C[ 6] = _mm_set1_epi64x( 0x418B2A6E418B2A6E );
+       sc->C[ 7] = _mm_set1_epi64x( 0x0405278004052780 );
+       sc->C[ 8] = _mm_set1_epi64x( 0x7F07D7877F07D787 );
+       sc->C[ 9] = _mm_set1_epi64x( 0x5194358F5194358F );
+       sc->C[10] = _mm_set1_epi64x( 0x3C60D6653C60D665 );
+       sc->C[11] = _mm_set1_epi64x( 0xBE97D79ABE97D79A );
+       sc->C[12] = _mm_set1_epi64x( 0x950C3434950C3434 );
+       sc->C[13] = _mm_set1_epi64x( 0xAED9A06DAED9A06D );
+       sc->C[14] = _mm_set1_epi64x( 0x2537DC8D2537DC8D );
+       sc->C[15] = _mm_set1_epi64x( 0x7CDB59697CDB5969 );
 */
   }
   else
   {  // No users
       sc->state_loaded = true;
-       sc->A[ 0] = m128_const1_64( 0x52F8455252F84552 );
-       sc->A[ 1] = m128_const1_64( 0xE54B7999E54B7999 );
-       sc->A[ 2] = m128_const1_64( 0x2D8EE3EC2D8EE3EC );
-       sc->A[ 3] = m128_const1_64( 0xB9645191B9645191 );
-       sc->A[ 4] = m128_const1_64( 0xE0078B86E0078B86 );
-       sc->A[ 5] = m128_const1_64( 0xBB7C44C9BB7C44C9 );
-       sc->A[ 6] = m128_const1_64( 0xD2B5C1CAD2B5C1CA );
-       sc->A[ 7] = m128_const1_64( 0xB0D2EB8CB0D2EB8C );
-       sc->A[ 8] = m128_const1_64( 0x14CE5A4514CE5A45 );
-       sc->A[ 9] = m128_const1_64( 0x22AF50DC22AF50DC );
-       sc->A[10] = m128_const1_64( 0xEFFDBC6BEFFDBC6B );
-       sc->A[11] = m128_const1_64( 0xEB21B74AEB21B74A );
+       sc->A[ 0] = _mm_set1_epi64x( 0x52F8455252F84552 );
+       sc->A[ 1] = _mm_set1_epi64x( 0xE54B7999E54B7999 );
+       sc->A[ 2] = _mm_set1_epi64x( 0x2D8EE3EC2D8EE3EC );
+       sc->A[ 3] = _mm_set1_epi64x( 0xB9645191B9645191 );
+       sc->A[ 4] = _mm_set1_epi64x( 0xE0078B86E0078B86 );
+       sc->A[ 5] = _mm_set1_epi64x( 0xBB7C44C9BB7C44C9 );
+       sc->A[ 6] = _mm_set1_epi64x( 0xD2B5C1CAD2B5C1CA );
+       sc->A[ 7] = _mm_set1_epi64x( 0xB0D2EB8CB0D2EB8C );
+       sc->A[ 8] = _mm_set1_epi64x( 0x14CE5A4514CE5A45 );
+       sc->A[ 9] = _mm_set1_epi64x( 0x22AF50DC22AF50DC );
+       sc->A[10] = _mm_set1_epi64x( 0xEFFDBC6BEFFDBC6B );
+       sc->A[11] = _mm_set1_epi64x( 0xEB21B74AEB21B74A );

-       sc->B[ 0] = m128_const1_64( 0xB555C6EEB555C6EE );
-       sc->B[ 1] = m128_const1_64( 0x3E7105963E710596 );
-       sc->B[ 2] = m128_const1_64( 0xA72A652FA72A652F );
-       sc->B[ 3] = m128_const1_64( 0x9301515F9301515F );
-       sc->B[ 4] = m128_const1_64( 0xDA28C1FADA28C1FA );
-       sc->B[ 5] = m128_const1_64( 0x696FD868696FD868 );
-       sc->B[ 6] = m128_const1_64( 0x9CB6BF729CB6BF72 );
-       sc->B[ 7] = m128_const1_64( 0x0AFE40020AFE4002 );
-       sc->B[ 8] = m128_const1_64( 0xA6E03615A6E03615 );
-       sc->B[ 9] = m128_const1_64( 0x5138C1D45138C1D4 );
-       sc->B[10] = m128_const1_64( 0xBE216306BE216306 );
-       sc->B[11] = m128_const1_64( 0xB38B8890B38B8890 );
-       sc->B[12] = m128_const1_64( 0x3EA8B96B3EA8B96B );
-       sc->B[13] = m128_const1_64( 0x3299ACE43299ACE4 );
-       sc->B[14] = m128_const1_64( 0x30924DD430924DD4 );
-       sc->B[15] = m128_const1_64( 0x55CB34A555CB34A5 );
+       sc->B[ 0] = _mm_set1_epi64x( 0xB555C6EEB555C6EE );
+       sc->B[ 1] = _mm_set1_epi64x( 0x3E7105963E710596 );
+       sc->B[ 2] = _mm_set1_epi64x( 0xA72A652FA72A652F );
+       sc->B[ 3] = _mm_set1_epi64x( 0x9301515F9301515F );
+       sc->B[ 4] = _mm_set1_epi64x( 0xDA28C1FADA28C1FA );
+       sc->B[ 5] = _mm_set1_epi64x( 0x696FD868696FD868 );
+       sc->B[ 6] = _mm_set1_epi64x( 0x9CB6BF729CB6BF72 );
+       sc->B[ 7] = _mm_set1_epi64x( 0x0AFE40020AFE4002 );
+       sc->B[ 8] = _mm_set1_epi64x( 0xA6E03615A6E03615 );
+       sc->B[ 9] = _mm_set1_epi64x( 0x5138C1D45138C1D4 );
+       sc->B[10] = _mm_set1_epi64x( 0xBE216306BE216306 );
+       sc->B[11] = _mm_set1_epi64x( 0xB38B8890B38B8890 );
+       sc->B[12] = _mm_set1_epi64x( 0x3EA8B96B3EA8B96B );
+       sc->B[13] = _mm_set1_epi64x( 0x3299ACE43299ACE4 );
+       sc->B[14] = _mm_set1_epi64x( 0x30924DD430924DD4 );
+       sc->B[15] = _mm_set1_epi64x( 0x55CB34A555CB34A5 );

-       sc->C[ 0] = m128_const1_64( 0xB405F031B405F031 );
-       sc->C[ 1] = m128_const1_64( 0xC4233EBAC4233EBA );
-       sc->C[ 2] = m128_const1_64( 0xB3733979B3733979 );
-       sc->C[ 3] = m128_const1_64( 0xC0DD9D55C0DD9D55 );
-       sc->C[ 4] = m128_const1_64( 0xC51C28AEC51C28AE );
-       sc->C[ 5] = m128_const1_64( 0xA327B8E1A327B8E1 );
-       sc->C[ 6] = m128_const1_64( 0x56C5616756C56167 );
-       sc->C[ 7] = m128_const1_64( 0xED614433ED614433 );
-       sc->C[ 8] = m128_const1_64( 0x88B59D6088B59D60 );
-       sc->C[ 9] = m128_const1_64( 0x60E2CEBA60E2CEBA );
-       sc->C[10] = m128_const1_64( 0x758B4B8B758B4B8B );
-       sc->C[11] = m128_const1_64( 0x83E82A7F83E82A7F );
-       sc->C[12] = m128_const1_64( 0xBC968828BC968828 );
-       sc->C[13] = m128_const1_64( 0xE6E00BF7E6E00BF7 );
-       sc->C[14] = m128_const1_64( 0xBA839E55BA839E55 );
-       sc->C[15] = m128_const1_64( 0x9B491C609B491C60 );
+       sc->C[ 0] = _mm_set1_epi64x( 0xB405F031B405F031 );
+       sc->C[ 1] = _mm_set1_epi64x( 0xC4233EBAC4233EBA );
+       sc->C[ 2] = _mm_set1_epi64x( 0xB3733979B3733979 );
+       sc->C[ 3] = _mm_set1_epi64x( 0xC0DD9D55C0DD9D55 );
+       sc->C[ 4] = _mm_set1_epi64x( 0xC51C28AEC51C28AE );
+       sc->C[ 5] = _mm_set1_epi64x( 0xA327B8E1A327B8E1 );
+       sc->C[ 6] = _mm_set1_epi64x( 0x56C5616756C56167 );
+       sc->C[ 7] = _mm_set1_epi64x( 0xED614433ED614433 );
+       sc->C[ 8] = _mm_set1_epi64x( 0x88B59D6088B59D60 );
+       sc->C[ 9] = _mm_set1_epi64x( 0x60E2CEBA60E2CEBA );
+       sc->C[10] = _mm_set1_epi64x( 0x758B4B8B758B4B8B );
+       sc->C[11] = _mm_set1_epi64x( 0x83E82A7F83E82A7F );
+       sc->C[12] = _mm_set1_epi64x( 0xBC968828BC968828 );
+       sc->C[13] = _mm_set1_epi64x( 0xE6E00BF7E6E00BF7 );
+       sc->C[14] = _mm_set1_epi64x( 0xBA839E55BA839E55 );
+       sc->C[15] = _mm_set1_epi64x( 0x9B491C609B491C60 );
   }
    sc->Wlow = 1;
    sc->Whigh = 0;
--- a/algo/shavite/shavite-hash-2way.c
+++ b/algo/shavite/shavite-hash-2way.c
@@ -18,14 +18,6 @@ static const uint32_t IV512[] =
        0xE275EADE, 0x502D9FCD, 0xB9357178, 0x022A4B9A
 };

-/*
-#define mm256_ror2x256hi_1x32( a, b ) \
-   _mm256_blend_epi32( mm256_shuflr128_32( a ), \
-                       mm256_shuflr128_32( b ), 0x88 )
-*/
-
-//#define mm256_ror2x256hi_1x32( a, b ) _mm256_alignr_epi8( b, a, 4 )
-
 #if defined(__VAES__)

 #define mm256_aesenc_2x128( x, k ) \
@@ -34,8 +26,47 @@ static const uint32_t IV512[] =
 #else

 #define mm256_aesenc_2x128( x, k ) \
-   mm256_concat_128( _mm_aesenc_si128( mm128_extr_hi128_256( x ), k ), \
-                     _mm_aesenc_si128( mm128_extr_lo128_256( x ), k ) )
+   _mm256_inserti128_si256( _mm256_castsi128_si256( \
+            _mm_aesenc_si128( _mm256_castsi256_si128(   x ),    k ) ), \
+            _mm_aesenc_si128( _mm256_extracti128_si256( x, 1 ), k ), 1 )
+
+#endif
+
+#if defined (__AVX512VL__)
+//TODO Enable for AVX10_256
+
+#define DECL_m256i_count \
+   const __m256i count = \
+          mm256_set4_32( ctx->count3, ctx->count2, ctx->count1, ctx->count0 );
+
+#define COUNT_R0 \
+  _mm256_mask_xor_epi32( count, 0x88, count, m256_neg1 )
+
+#define COUNT_R1 \
+  mm256_shuflr128_32( _mm256_mask_xor_epi32( count, 0x11, count, m256_neg1 ) )
+
+#define COUNT_R2 \
+  mm256_swap128_64( _mm256_mask_xor_epi32( count, 0x22, count, m256_neg1 ) )
+
+#define COUNT_R13 \
+  mm256_swap64_32( _mm256_mask_xor_epi32( count, 0x44, count, m256_neg1 ) )
+
+#else
+
+#define DECL_m256i_count
+
+// R matches the loop index not the round number, should changet that
+#define COUNT_R0 \
+  mm256_set4_32( ~ctx->count3, ctx->count2, ctx->count1, ctx->count0 )
+
+#define COUNT_R1 \
+  mm256_set4_32( ~ctx->count0, ctx->count1, ctx->count2, ctx->count3 ) 
+
+#define COUNT_R2 \
+  mm256_set4_32( ~ctx->count1, ctx->count0, ctx->count3, ctx->count2 )
+
+#define COUNT_R13 \
+  mm256_set4_32( ~ctx->count2, ctx->count3, ctx->count0, ctx->count1 )

 #endif

@@ -47,6 +78,7 @@ c512_2way( shavite512_2way_context *ctx, const void *msg )
   __m256i k00, k01, k02, k03, k10, k11, k12, k13;
   __m256i *m = (__m256i*)msg;
   __m256i *h = (__m256i*)ctx->h;
+   DECL_m256i_count;
   int r;

   p0 = h[0];
@@ -54,7 +86,8 @@ c512_2way( shavite512_2way_context *ctx, const void *msg )
   p2 = h[2];
   p3 = h[3];

-   // round
+   // round 0
+
   k00 = m[0];
   x = mm256_aesenc_2x128( _mm256_xor_si256( p1, k00 ), zero );
   k01 = m[1];
@@ -85,18 +118,14 @@ c512_2way( shavite512_2way_context *ctx, const void *msg )
                                  mm256_aesenc_2x128( k00, zero ) ) );

     if ( r == 0 )
-        k00 = _mm256_xor_si256( k00, _mm256_set_epi32( 
-		      ~ctx->count3, ctx->count2, ctx->count1, ctx->count0,
-                      ~ctx->count3, ctx->count2, ctx->count1, ctx->count0 ) );
+        k00 = _mm256_xor_si256( k00, COUNT_R0 );

     x = mm256_aesenc_2x128( _mm256_xor_si256( p0, k00 ), zero );
     k01 = _mm256_xor_si256( k00,
 		     mm256_shuflr128_32( mm256_aesenc_2x128( k01, zero ) ) );

     if ( r == 1 )
-        k01 = _mm256_xor_si256( k01, _mm256_set_epi32(
-	               ~ctx->count0, ctx->count1, ctx->count2, ctx->count3,
-                       ~ctx->count0, ctx->count1, ctx->count2, ctx->count3 ) );
+        k01 = _mm256_xor_si256( k01, COUNT_R1 );

     x = mm256_aesenc_2x128( _mm256_xor_si256( x, k01 ), zero );
     k02 = _mm256_xor_si256( k01,
@@ -121,9 +150,7 @@ c512_2way( shavite512_2way_context *ctx, const void *msg )
 		     mm256_shuflr128_32( mm256_aesenc_2x128( k13, zero ) ) );

     if ( r == 2 )
-        k13 = _mm256_xor_si256( k13, _mm256_set_epi32(
-                  ~ctx->count1, ctx->count0, ctx->count3, ctx->count2,
-                  ~ctx->count1, ctx->count0, ctx->count3, ctx->count2 ) );
+        k13 = _mm256_xor_si256( k13, COUNT_R2 );
 
     x = mm256_aesenc_2x128( _mm256_xor_si256( x, k13 ), zero );
     p1 = _mm256_xor_si256( p1, x );
@@ -235,9 +262,7 @@ c512_2way( shavite512_2way_context *ctx, const void *msg )
   x = mm256_aesenc_2x128( _mm256_xor_si256( x, k11 ), zero );

   k12 = mm256_shuflr128_32( mm256_aesenc_2x128( k12, zero ) );
-   k12 = _mm256_xor_si256( k12, _mm256_xor_si256( k11, _mm256_set_epi32(
-	       ~ctx->count2, ctx->count3, ctx->count0, ctx->count1,
-	       ~ctx->count2, ctx->count3, ctx->count0, ctx->count1 ) ) );
+   k12 = _mm256_xor_si256( k12, _mm256_xor_si256( k11, COUNT_R13 ) );

   x = mm256_aesenc_2x128( _mm256_xor_si256( x, k12 ), zero );
   k13 = _mm256_xor_si256( mm256_shuflr128_32(
@@ -257,10 +282,10 @@ void shavite512_2way_init( shavite512_2way_context *ctx )
    __m256i *h = (__m256i*)ctx->h;
    __m128i *iv = (__m128i*)IV512;
   
-   h[0] = m256_const1_128( iv[0] );
-   h[1] = m256_const1_128( iv[1] );
-   h[2] = m256_const1_128( iv[2] );
-   h[3] = m256_const1_128( iv[3] );
+   h[0] = mm256_bcast_m128( iv[0] );
+   h[1] = mm256_bcast_m128( iv[1] );
+   h[2] = mm256_bcast_m128( iv[2] );
+   h[3] = mm256_bcast_m128( iv[3] );

   ctx->ptr    = 0;
   ctx->count0 = 0;
@@ -320,7 +345,7 @@ void shavite512_2way_close( shavite512_2way_context *ctx, void *dst )
    uint32_t vp = ctx->ptr>>5;

    // Terminating byte then zero pad
-    casti_m256i( buf, vp++ ) = m256_const1_i128( 0x0000000000000080 );
+    casti_m256i( buf, vp++ ) = mm256_bcast128lo_64( 0x0000000000000080 );

    // Zero pad full vectors up to count
    for ( ; vp < 6; vp++ )      
@@ -334,9 +359,9 @@ void shavite512_2way_close( shavite512_2way_context *ctx, void *dst )
    count.u32[2] = ctx->count2;
    count.u32[3] = ctx->count3;

-    casti_m256i( buf, 6 ) = m256_const1_128(
+    casti_m256i( buf, 6 ) = mm256_bcast_m128(
                  _mm_insert_epi16( m128_zero, count.u16[0], 7 ) ); 
-    casti_m256i( buf, 7 ) = m256_const1_128( _mm_set_epi16(
+    casti_m256i( buf, 7 ) = mm256_bcast_m128( _mm_set_epi16(
                  0x0200,       count.u16[7], count.u16[6], count.u16[5],
                  count.u16[4], count.u16[3], count.u16[2], count.u16[1] ) );
                
@@ -400,19 +425,19 @@ void shavite512_2way_update_close( shavite512_2way_context *ctx, void *dst,

   if ( vp == 0 )    // empty buf, xevan.
   { 
-      casti_m256i( buf, 0 ) = m256_const1_i128( 0x0000000000000080 );
+      casti_m256i( buf, 0 ) = mm256_bcast128lo_64( 0x0000000000000080 );
      memset_zero_256( (__m256i*)buf + 1, 5 );
      ctx->count0 = ctx->count1 = ctx->count2 = ctx->count3 = 0;
   }
   else     // half full buf, everyone else.
   {
-    casti_m256i( buf, vp++ ) = m256_const1_i128( 0x0000000000000080 );
+    casti_m256i( buf, vp++ ) = mm256_bcast128lo_64( 0x0000000000000080 );
      memset_zero_256( (__m256i*)buf + vp, 6 - vp );
   }

-    casti_m256i( buf, 6 ) = m256_const1_128(
+    casti_m256i( buf, 6 ) = mm256_bcast_m128(
                  _mm_insert_epi16( m128_zero, count.u16[0], 7 ) ); 
-    casti_m256i( buf, 7 ) = m256_const1_128( _mm_set_epi16(
+    casti_m256i( buf, 7 ) = mm256_bcast_m128( _mm_set_epi16(
                  0x0200,       count.u16[7], count.u16[6], count.u16[5],
                  count.u16[4], count.u16[3], count.u16[2], count.u16[1] ) );

@@ -430,10 +455,10 @@ void shavite512_2way_full( shavite512_2way_context *ctx, void *dst,
    __m256i *h = (__m256i*)ctx->h;
    __m128i *iv = (__m128i*)IV512;

-   h[0] = m256_const1_128( iv[0] );
-   h[1] = m256_const1_128( iv[1] );
-   h[2] = m256_const1_128( iv[2] );
-   h[3] = m256_const1_128( iv[3] );
+   h[0] = mm256_bcast_m128( iv[0] );
+   h[1] = mm256_bcast_m128( iv[1] );
+   h[2] = mm256_bcast_m128( iv[2] );
+   h[3] = mm256_bcast_m128( iv[3] );

   ctx->ptr    =
   ctx->count0 =
@@ -490,19 +515,19 @@ void shavite512_2way_full( shavite512_2way_context *ctx, void *dst,

   if ( vp == 0 )    // empty buf, xevan.
   {
-      casti_m256i( buf, 0 ) = m256_const1_i128( 0x0000000000000080 );
+      casti_m256i( buf, 0 ) = mm256_bcast128lo_64( 0x0000000000000080 );
      memset_zero_256( (__m256i*)buf + 1, 5 );
      ctx->count0 = ctx->count1 = ctx->count2 = ctx->count3 = 0;
   }
   else     // half full buf, everyone else.
   {
-    casti_m256i( buf, vp++ ) = m256_const1_i128( 0x0000000000000080 );
+    casti_m256i( buf, vp++ ) = mm256_bcast128lo_64( 0x0000000000000080 );
      memset_zero_256( (__m256i*)buf + vp, 6 - vp );
   }

-    casti_m256i( buf, 6 ) = m256_const1_128(
+    casti_m256i( buf, 6 ) = mm256_bcast_m128(
                  _mm_insert_epi16( m128_zero, count.u16[0], 7 ) );
-    casti_m256i( buf, 7 ) = m256_const1_128( _mm_set_epi16(
+    casti_m256i( buf, 7 ) = mm256_bcast_m128( _mm_set_epi16(
                  0x0200,       count.u16[7], count.u16[6], count.u16[5],
                  count.u16[4], count.u16[3], count.u16[2], count.u16[1] ) );

--- a/algo/shavite/shavite-hash-4way.c
+++ b/algo/shavite/shavite-hash-4way.c
@@ -204,11 +204,9 @@ c512_4way( shavite512_4way_context *ctx, const void *msg )
   K5 = _mm512_xor_si512( mm512_shuflr128_32(
 			             _mm512_aesenc_epi128( K5, m512_zero ) ), K4 );
   X = _mm512_aesenc_epi128( _mm512_xor_si512( X, K5 ), m512_zero );
-
   K6 = mm512_shuflr128_32( _mm512_aesenc_epi128( K6, m512_zero ) );
-   K6 = _mm512_xor_si512( K6, _mm512_xor_si512( K5, _mm512_set4_epi32(
-	       ~ctx->count2, ctx->count3, ctx->count0, ctx->count1 ) ) );
-
+   K6 = _mm512_xor_si512( K6, _mm512_xor_si512( K5,  mm512_swap64_32( 
+              _mm512_mask_xor_epi32( count, 0x4444, count, m512_neg1 ) ) ) );
   X = _mm512_aesenc_epi128( _mm512_xor_si512( X, K6 ), m512_zero );
   K7= _mm512_xor_si512( mm512_shuflr128_32(
 			             _mm512_aesenc_epi128( K7, m512_zero ) ), K6 );
@@ -227,10 +225,10 @@ void shavite512_4way_init( shavite512_4way_context *ctx )
    __m512i *h = (__m512i*)ctx->h;
    __m128i *iv = (__m128i*)IV512;
   
-   h[0] = m512_const1_128( iv[0] );
-   h[1] = m512_const1_128( iv[1] );
-   h[2] = m512_const1_128( iv[2] );
-   h[3] = m512_const1_128( iv[3] );
+   h[0] = mm512_bcast_m128( iv[0] );
+   h[1] = mm512_bcast_m128( iv[1] );
+   h[2] = mm512_bcast_m128( iv[2] );
+   h[3] = mm512_bcast_m128( iv[3] );

   ctx->ptr    = 0;
   ctx->count0 = 0;
@@ -290,7 +288,7 @@ void shavite512_4way_close( shavite512_4way_context *ctx, void *dst )
    uint32_t vp = ctx->ptr>>6;

    // Terminating byte then zero pad
-    casti_m512i( buf, vp++ ) = m512_const1_i128( 0x0000000000000080 );
+    casti_m512i( buf, vp++ ) = mm512_bcast128lo_64( 0x0000000000000080 );

    // Zero pad full vectors up to count
    for ( ; vp < 6; vp++ )      
@@ -304,9 +302,9 @@ void shavite512_4way_close( shavite512_4way_context *ctx, void *dst )
    count.u32[2] = ctx->count2;
    count.u32[3] = ctx->count3;

-    casti_m512i( buf, 6 ) = m512_const1_128(
+    casti_m512i( buf, 6 ) = mm512_bcast_m128(
                  _mm_insert_epi16( m128_zero, count.u16[0], 7 ) ); 
-    casti_m512i( buf, 7 ) = m512_const1_128( _mm_set_epi16(
+    casti_m512i( buf, 7 ) = mm512_bcast_m128( _mm_set_epi16(
                  0x0200,       count.u16[7], count.u16[6], count.u16[5],
                  count.u16[4], count.u16[3], count.u16[2], count.u16[1] ) );
                
@@ -370,19 +368,19 @@ void shavite512_4way_update_close( shavite512_4way_context *ctx, void *dst,

   if ( vp == 0 )    // empty buf, xevan.
   { 
-      casti_m512i( buf, 0 ) = m512_const1_i128( 0x0000000000000080 );
+      casti_m512i( buf, 0 ) = mm512_bcast128lo_64( 0x0000000000000080 );
      memset_zero_512( (__m512i*)buf + 1, 5 );
      ctx->count0 = ctx->count1 = ctx->count2 = ctx->count3 = 0;
   }
   else     // half full buf, everyone else.
   {
-    casti_m512i( buf, vp++ ) = m512_const1_i128( 0x0000000000000080 );
+    casti_m512i( buf, vp++ ) = mm512_bcast128lo_64( 0x0000000000000080 );
      memset_zero_512( (__m512i*)buf + vp, 6 - vp );
   }

-    casti_m512i( buf, 6 ) = m512_const1_128(
+    casti_m512i( buf, 6 ) = mm512_bcast_m128(
                  _mm_insert_epi16( m128_zero, count.u16[0], 7 ) ); 
-    casti_m512i( buf, 7 ) = m512_const1_128( _mm_set_epi16(
+    casti_m512i( buf, 7 ) = mm512_bcast_m128( _mm_set_epi16(
                  0x0200,       count.u16[7], count.u16[6], count.u16[5],
                  count.u16[4], count.u16[3], count.u16[2], count.u16[1] ) );

@@ -401,10 +399,10 @@ void shavite512_4way_full( shavite512_4way_context *ctx, void *dst,
    __m512i *h = (__m512i*)ctx->h;
    __m128i *iv = (__m128i*)IV512;

-   h[0] = m512_const1_128( iv[0] );
-   h[1] = m512_const1_128( iv[1] );
-   h[2] = m512_const1_128( iv[2] );
-   h[3] = m512_const1_128( iv[3] );
+   h[0] = mm512_bcast_m128( iv[0] );
+   h[1] = mm512_bcast_m128( iv[1] );
+   h[2] = mm512_bcast_m128( iv[2] );
+   h[3] = mm512_bcast_m128( iv[3] );

   ctx->ptr    = 
   ctx->count0 = 
@@ -461,19 +459,19 @@ void shavite512_4way_full( shavite512_4way_context *ctx, void *dst,

   if ( vp == 0 )    // empty buf, xevan.
   {
-      casti_m512i( buf, 0 ) = m512_const1_i128( 0x0000000000000080 );
+      casti_m512i( buf, 0 ) = mm512_bcast128lo_64( 0x0000000000000080 );
      memset_zero_512( (__m512i*)buf + 1, 5 );
      ctx->count0 = ctx->count1 = ctx->count2 = ctx->count3 = 0;
   }
   else     // half full buf, everyone else.
   {
-    casti_m512i( buf, vp++ ) = m512_const1_i128( 0x0000000000000080 );
+    casti_m512i( buf, vp++ ) = mm512_bcast128lo_64( 0x0000000000000080 );
      memset_zero_512( (__m512i*)buf + vp, 6 - vp );
   }

-    casti_m512i( buf, 6 ) = m512_const1_128(
+    casti_m512i( buf, 6 ) = mm512_bcast_m128(
                  _mm_insert_epi16( m128_zero, count.u16[0], 7 ) );
-    casti_m512i( buf, 7 ) = m512_const1_128( _mm_set_epi16(
+    casti_m512i( buf, 7 ) = mm512_bcast_m128( _mm_set_epi16(
                  0x0200,       count.u16[7], count.u16[6], count.u16[5],
                  count.u16[4], count.u16[3], count.u16[2], count.u16[1] ) );

--- a/algo/simd/simd-hash-2way.c
+++ b/algo/simd/simd-hash-2way.c
@@ -212,14 +212,24 @@ do { \
 // targetted
 #define shufxor2w(x,s) _mm256_shuffle_epi32( x, XCAT( SHUFXOR_, s ))

+#if defined(__AVX512VL__)
+//TODO Enable for AVX10_256
+
 #define REDUCE(x) \
-  _mm256_sub_epi16( _mm256_and_si256( x, m256_const1_64( \
+  _mm256_sub_epi16( _mm256_maskz_mov_epi8( 0x55555555, x ), \
+                    _mm256_srai_epi16( x, 8 ) )
+#else
+
+#define REDUCE(x) \
+  _mm256_sub_epi16( _mm256_and_si256( x, _mm256_set1_epi64x( \
                         0x00ff00ff00ff00ff ) ), _mm256_srai_epi16( x, 8 ) )

+#endif
+
 #define EXTRA_REDUCE_S(x)\
  _mm256_sub_epi16( x, _mm256_and_si256( \
-             m256_const1_64( 0x0101010101010101 ), \
-             _mm256_cmpgt_epi16( x, m256_const1_64( 0x0080008000800080 ) ) ) )
+          _mm256_set1_epi64x( 0x0101010101010101 ), \
+          _mm256_cmpgt_epi16( x, _mm256_set1_epi64x( 0x0080008000800080 ) ) ) )

 #define REDUCE_FULL_S( x )  EXTRA_REDUCE_S( REDUCE (x ) )

@@ -387,17 +397,11 @@ static const m512_v16 FFT256_Twiddle4w[] =
  _mm512_sub_epi16( _mm512_maskz_mov_epi8( 0x5555555555555555, x ), \
                    _mm512_srai_epi16( x, 8 ) )

-/*
-#define REDUCE4w(x) \
-  _mm512_sub_epi16( _mm512_and_si512( x, m512_const1_64( \
-                         0x00ff00ff00ff00ff ) ), _mm512_srai_epi16( x, 8 ) )
-*/
-
 #define EXTRA_REDUCE_S4w(x) \
  _mm512_sub_epi16( x, _mm512_and_si512( \
-             m512_const1_64( 0x0101010101010101 ), \
+             _mm512_set1_epi64( 0x0101010101010101 ), \
             _mm512_movm_epi16( _mm512_cmpgt_epi16_mask( \
-                               x, m512_const1_64( 0x0080008000800080 ) ) ) ) )
+                             x, _mm512_set1_epi64( 0x0080008000800080 ) ) ) ) )

 // generic, except it calls targetted macros
 #define REDUCE_FULL_S4w( x )  EXTRA_REDUCE_S4w( REDUCE4w (x ) )
@@ -484,14 +488,7 @@ do { \
 #undef BUTTERFLY_0
 #undef BUTTERFLY_N

-// twiddle is hard coded  T[0] = m512_const2_64( {128,64,32,16}, {8,4,2,1} )  
  // Multiply by twiddle factors
-//  X(6) = _mm512_mullo_epi16( X(6), m512_const2_64( 0x0080004000200010,
-//                                                   0x0008000400020001 );
-//  X(5) = _mm512_mullo_epi16( X(5), m512_const2_64( 0xffdc0008ffef0004,
-//                                                   0x00780002003c0001 );
-
-
  X(6) = _mm512_mullo_epi16( X(6), FFT64_Twiddle4w[0].v512 );
  X(5) = _mm512_mullo_epi16( X(5), FFT64_Twiddle4w[1].v512 );
  X(4) = _mm512_mullo_epi16( X(4), FFT64_Twiddle4w[2].v512 );
--- a/algo/skein/skein-4way.c
+++ b/algo/skein/skein-4way.c
@@ -7,16 +7,8 @@

 #if defined (SKEIN_8WAY)

-static skein512_8way_context skein512_8way_ctx
+static __thread skein512_8way_context skein512_8way_ctx
                                            __attribute__ ((aligned (64)));
-static uint32_t skein_8way_vdata[20*8] __attribute__ ((aligned (64)));
-
-int skein_8way_prehash( struct work *work )
-{
-    mm512_bswap32_intrlv80_8x64( skein_8way_vdata, work->data );
-    skein512_8way_prehash64( &skein512_8way_ctx, skein_8way_vdata );
-    return 1;
-}

 void skeinhash_8way( void *state, const void *input )
 {
@@ -37,27 +29,25 @@ void skeinhash_8way( void *state, const void *input )
 int scanhash_skein_8way( struct work *work, uint32_t max_nonce,
                         uint64_t *hashes_done, struct thr_info *mythr )
 {
-   uint32_t vdata[20*8] __attribute__ ((aligned (128)));
-   uint32_t hash[8*8] __attribute__ ((aligned (64)));
-   uint32_t lane_hash[8] __attribute__ ((aligned (64)));
-   uint32_t *hash_d7 = &(hash[7*8]);
-   uint32_t *pdata = work->data;
-   uint32_t *ptarget = work->target;
-   const uint32_t targ_d7 = ptarget[7];
-   const uint32_t first_nonce = pdata[19];
-   const uint32_t last_nonce = max_nonce - 8;
-   uint32_t n = first_nonce;
-   __m512i  *noncev = (__m512i*)vdata + 9; 
-   const int thr_id = mythr->id; 
-   const bool bench = opt_benchmark;
-    
-    pthread_rwlock_rdlock( &g_work_lock );
-       memcpy( vdata, skein_8way_vdata, sizeof vdata );
-    pthread_rwlock_unlock( &g_work_lock );
+    uint32_t vdata[20*8] __attribute__ ((aligned (128)));
+    uint32_t hash[8*8] __attribute__ ((aligned (64)));
+    uint32_t lane_hash[8] __attribute__ ((aligned (64)));
+    uint32_t *hash_d7 = &(hash[7*8]);
+    uint32_t *pdata = work->data;
+    uint32_t *ptarget = work->target;
+    const uint32_t targ_d7 = ptarget[7];
+    const uint32_t first_nonce = pdata[19];
+    const uint32_t last_nonce = max_nonce - 8;
+    uint32_t n = first_nonce;
+    __m512i  *noncev = (__m512i*)vdata + 9; 
+    const int thr_id = mythr->id; 
+    const bool bench = opt_benchmark;

+   mm512_bswap32_intrlv80_8x64( vdata, pdata );
   *noncev = mm512_intrlv_blend_32(
                _mm512_set_epi32( n+7, 0, n+6, 0, n+5, 0, n+4, 0,
                                  n+3, 0, n+2, 0, n+1, 0, n  , 0 ), *noncev );
+   skein512_8way_prehash64( &skein512_8way_ctx, vdata );
   do
   {
       skeinhash_8way( hash, vdata );
@@ -73,7 +63,7 @@ int scanhash_skein_8way( struct work *work, uint32_t max_nonce,
          }
       }
       *noncev = _mm512_add_epi32( *noncev,
-                                  m512_const1_64( 0x0000000800000000 ) );
+                                  _mm512_set1_epi64( 0x0000000800000000 ) );
       n += 8;
    } while ( likely( (n < last_nonce) && !work_restart[thr_id].restart ) );

@@ -84,16 +74,8 @@ int scanhash_skein_8way( struct work *work, uint32_t max_nonce,

 #elif defined (SKEIN_4WAY)

-static skein512_4way_context skein512_4way_ctx
+static __thread skein512_4way_context skein512_4way_ctx
                                            __attribute__ ((aligned (64)));
-static uint32_t skein_4way_vdata[20*4] __attribute__ ((aligned (64)));
-
-int skein_4way_prehash( struct work *work )
-{
-    mm256_bswap32_intrlv80_4x64( skein_4way_vdata, work->data );
-    skein512_4way_prehash64( &skein512_4way_ctx, skein_4way_vdata );
-    return 1;
-}

 void skeinhash_4way( void *state, const void *input )
 {
@@ -136,24 +118,23 @@ void skeinhash_4way( void *state, const void *input )
 int scanhash_skein_4way( struct work *work, uint32_t max_nonce,
                         uint64_t *hashes_done, struct thr_info *mythr )
 {
-   uint32_t vdata[20*4] __attribute__ ((aligned (64)));
-   uint32_t hash[8*4] __attribute__ ((aligned (64)));
-   uint32_t lane_hash[8] __attribute__ ((aligned (32)));
-   uint32_t *hash_d7 = &(hash[7<<2]);
-   uint32_t *pdata = work->data;
-   uint32_t *ptarget = work->target;
-   const uint32_t targ_d7 = ptarget[7];
-   const uint32_t first_nonce = pdata[19];
-   const uint32_t last_nonce = max_nonce - 4;
-   uint32_t n = first_nonce;
-   __m256i  *noncev = (__m256i*)vdata + 9; 
-   const int thr_id = mythr->id; 
-   const bool bench = opt_benchmark;
+    uint32_t vdata[20*4] __attribute__ ((aligned (64)));
+    uint32_t hash[8*4] __attribute__ ((aligned (64)));
+    uint32_t lane_hash[8] __attribute__ ((aligned (32)));
+    uint32_t *hash_d7 = &(hash[7<<2]);
+    uint32_t *pdata = work->data;
+    uint32_t *ptarget = work->target;
+    const uint32_t targ_d7 = ptarget[7];
+    const uint32_t first_nonce = pdata[19];
+    const uint32_t last_nonce = max_nonce - 4;
+    uint32_t n = first_nonce;
+    __m256i  *noncev = (__m256i*)vdata + 9; 
+    const int thr_id = mythr->id; 
+    const bool bench = opt_benchmark;
+
+   mm256_bswap32_intrlv80_4x64( vdata, pdata );
+   skein512_4way_prehash64( &skein512_4way_ctx, vdata );

-   pthread_rwlock_rdlock( &g_work_lock );
-      memcpy( vdata, skein_4way_vdata, sizeof vdata );
-   pthread_rwlock_unlock( &g_work_lock );
-    
   *noncev = mm256_intrlv_blend_32(
                _mm256_set_epi32( n+3, 0, n+2, 0, n+1, 0, n, 0 ), *noncev );
   do
@@ -170,7 +151,7 @@ int scanhash_skein_4way( struct work *work, uint32_t max_nonce,
          }
       }
       *noncev = _mm256_add_epi32( *noncev,
-                                  m256_const1_64( 0x0000000400000000 ) );
+                                  _mm256_set1_epi64x( 0x0000000400000000 ) );
       n += 4;
    } while ( likely( (n < last_nonce) && !work_restart[thr_id].restart ) );

--- a/algo/skein/skein-gate.c
+++ b/algo/skein/skein-gate.c
@@ -7,12 +7,10 @@ bool register_skein_algo( algo_gate_t* gate )
 #if defined (SKEIN_8WAY)
    gate->optimizations = AVX2_OPT | AVX512_OPT;
    gate->scanhash  = (void*)&scanhash_skein_8way;
-    gate->prehash   = (void*)&skein_8way_prehash;
    gate->hash      = (void*)&skeinhash_8way;
 #elif defined (SKEIN_4WAY)
    gate->optimizations = AVX2_OPT | AVX512_OPT | SHA_OPT;
    gate->scanhash  = (void*)&scanhash_skein_4way;
-    gate->prehash   = (void*)&skein_4way_prehash;
    gate->hash      = (void*)&skeinhash_4way;
 #else
    gate->optimizations = AVX2_OPT | AVX512_OPT | SHA_OPT;
@@ -27,12 +25,10 @@ bool register_skein2_algo( algo_gate_t* gate )
  gate->optimizations = AVX2_OPT | AVX512_OPT;
 #if defined (SKEIN_8WAY)
  gate->scanhash  = (void*)&scanhash_skein2_8way;
-//  gate->hash      = (void*)&skein2hash_8way;
-  gate->prehash   = (void*)&skein2_8way_prehash;
+  gate->hash      = (void*)&skein2hash_8way;
 #elif defined (SKEIN_4WAY)
  gate->scanhash  = (void*)&scanhash_skein2_4way;
-//  gate->hash      = (void*)&skein2hash_4way;
-  gate->prehash   = (void*)&skein2_4way_prehash;
+  gate->hash      = (void*)&skein2hash_4way;
 #else
  gate->scanhash  = (void*)&scanhash_skein2;
  gate->hash      = (void*)&skein2hash;
--- a/algo/skein/skein-gate.h
+++ b/algo/skein/skein-gate.h
@@ -14,24 +14,20 @@
 void skeinhash_8way( void *output, const void *input );
 int scanhash_skein_8way( struct work *work, uint32_t max_nonce,
                         uint64_t *hashes_done, struct thr_info *mythr );
-int skein_8way_prehash( struct work * );

 void skein2hash_8way( void *output, const void *input );
 int scanhash_skein2_8way( struct work *work, uint32_t max_nonce,
                          uint64_t* hashes_done, struct thr_info *mythr );
-int skein2_8way_prehash( struct work * );

 #elif defined(SKEIN_4WAY)

 void skeinhash_4way( void *output, const void *input );
 int scanhash_skein_4way( struct work *work, uint32_t max_nonce,
                         uint64_t *hashes_done, struct thr_info *mythr );
-int skein_4way_prehash( struct work * );

 void skein2hash_4way( void *output, const void *input );
 int scanhash_skein2_4way( struct work *work, uint32_t max_nonce,
                          uint64_t* hashes_done, struct thr_info *mythr );
-int skein2_4way_prehash( struct work * );

 #else

--- a/algo/skein/skein-hash-4way.c
+++ b/algo/skein/skein-hash-4way.c
@@ -285,7 +285,7 @@ static const uint64_t IV512[] = {
 #define SKBI(k, s, i)   XCAT(k, XCAT(XCAT(XCAT(M9_, s), _), i))
 #define SKBT(t, s, v)   XCAT(t, XCAT(XCAT(XCAT(M3_, s), _), v))

-#define READ_STATE_BIG(sc)   do { \
+#define READ_STATE_BIG(sc) \
      h0 = (sc)->h0; \
      h1 = (sc)->h1; \
      h2 = (sc)->h2; \
@@ -294,10 +294,9 @@ static const uint64_t IV512[] = {
      h5 = (sc)->h5; \
      h6 = (sc)->h6; \
      h7 = (sc)->h7; \
-      bcount = sc->bcount; \
-   } while (0)
+      bcount = sc->bcount;

-#define WRITE_STATE_BIG(sc)   do { \
+#define WRITE_STATE_BIG(sc) \
      (sc)->h0 = h0; \
      (sc)->h1 = h1; \
      (sc)->h2 = h2; \
@@ -306,62 +305,54 @@ static const uint64_t IV512[] = {
      (sc)->h5 = h5; \
      (sc)->h6 = h6; \
      (sc)->h7 = h7; \
-      sc->bcount = bcount; \
-   } while (0)
+      sc->bcount = bcount;
   

 #if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)

 #define TFBIG_KINIT_8WAY( k0, k1, k2, k3, k4, k5, k6, k7, k8, t0, t1, t2 ) \
-do { \
-  k8 = mm512_xor3( mm512_xor3( k0, k1, k2 ), mm512_xor3( k3, k4, k5 ), \
-                   mm512_xor3( k6, k7, m512_const1_64( 0x1BD11BDAA9FC1A22) ));\
-  t2 = t0 ^ t1; \
-} while (0)
+  k8 = mm512_xor3( mm512_xor3( k0, k1, k2 ), \
+                   mm512_xor3( k3, k4, k5 ), \
+                   mm512_xor3( k6, k7, \
+                              _mm512_set1_epi64( 0x1BD11BDAA9FC1A22) ) ); \
+  t2 = t0 ^ t1;

 #define TFBIG_ADDKEY_8WAY(w0, w1, w2, w3, w4, w5, w6, w7, k, t, s) \
-do { \
  w0 = _mm512_add_epi64( w0, SKBI(k,s,0) ); \
  w1 = _mm512_add_epi64( w1, SKBI(k,s,1) ); \
  w2 = _mm512_add_epi64( w2, SKBI(k,s,2) ); \
  w3 = _mm512_add_epi64( w3, SKBI(k,s,3) ); \
  w4 = _mm512_add_epi64( w4, SKBI(k,s,4) ); \
  w5 = _mm512_add_epi64( w5, _mm512_add_epi64( SKBI(k,s,5), \
-                                         m512_const1_64( SKBT(t,s,0) ) ) ); \
+                                       _mm512_set1_epi64( SKBT(t,s,0) ) ) ); \
  w6 = _mm512_add_epi64( w6, _mm512_add_epi64( SKBI(k,s,6), \
-                                         m512_const1_64( SKBT(t,s,1) ) ) ); \
+                                       _mm512_set1_epi64( SKBT(t,s,1) ) ) ); \
  w7 = _mm512_add_epi64( w7, _mm512_add_epi64( SKBI(k,s,7), \
-                                         m512_const1_64( s ) ) ); \
-} while (0)
+                                        _mm512_set1_epi64( s ) ) );

 #define TFBIG_MIX_8WAY(x0, x1, rc) \
-do { \
     x0 = _mm512_add_epi64( x0, x1 ); \
-     x1 = _mm512_xor_si512( mm512_rol_64( x1, rc ), x0 ); \
-} while (0)
+     x1 = _mm512_xor_si512( mm512_rol_64( x1, rc ), x0 );

-#define TFBIG_MIX8_8WAY(w0, w1, w2, w3, w4, w5, w6, w7, rc0, rc1, rc2, rc3)  do { \
+#define TFBIG_MIX8_8WAY(w0, w1, w2, w3, w4, w5, w6, w7, rc0, rc1, rc2, rc3) \
      TFBIG_MIX_8WAY(w0, w1, rc0); \
      TFBIG_MIX_8WAY(w2, w3, rc1); \
      TFBIG_MIX_8WAY(w4, w5, rc2); \
-      TFBIG_MIX_8WAY(w6, w7, rc3); \
-   } while (0)
+      TFBIG_MIX_8WAY(w6, w7, rc3);

-#define TFBIG_8WAY_4e(s)   do { \
+#define TFBIG_8WAY_4e(s) \
      TFBIG_ADDKEY_8WAY(p0, p1, p2, p3, p4, p5, p6, p7, h, t, s); \
      TFBIG_MIX8_8WAY(p0, p1, p2, p3, p4, p5, p6, p7, 46, 36, 19, 37); \
      TFBIG_MIX8_8WAY(p2, p1, p4, p7, p6, p5, p0, p3, 33, 27, 14, 42); \
      TFBIG_MIX8_8WAY(p4, p1, p6, p3, p0, p5, p2, p7, 17, 49, 36, 39); \
-      TFBIG_MIX8_8WAY(p6, p1, p0, p7, p2, p5, p4, p3, 44,  9, 54, 56); \
-   } while (0)
+      TFBIG_MIX8_8WAY(p6, p1, p0, p7, p2, p5, p4, p3, 44,  9, 54, 56);

-#define TFBIG_8WAY_4o(s)   do { \
+#define TFBIG_8WAY_4o(s) \
      TFBIG_ADDKEY_8WAY(p0, p1, p2, p3, p4, p5, p6, p7, h, t, s); \
      TFBIG_MIX8_8WAY(p0, p1, p2, p3, p4, p5, p6, p7, 39, 30, 34, 24); \
      TFBIG_MIX8_8WAY(p2, p1, p4, p7, p6, p5, p0, p3, 13, 50, 10, 17); \
      TFBIG_MIX8_8WAY(p4, p1, p6, p3, p0, p5, p2, p7, 25, 29, 39, 43); \
-      TFBIG_MIX8_8WAY(p6, p1, p0, p7, p2, p5, p4, p3,  8, 35, 56, 22); \
-   } while (0)
+      TFBIG_MIX8_8WAY(p6, p1, p0, p7, p2, p5, p4, p3,  8, 35, 56, 22);

 #define UBI_BIG_8WAY(etype, extra) \
 do { \
@@ -424,59 +415,48 @@ do { \
 #endif // AVX512

 #define TFBIG_KINIT_4WAY( k0, k1, k2, k3, k4, k5, k6, k7, k8, t0, t1, t2 ) \
-do { \
-  k8 = _mm256_xor_si256( _mm256_xor_si256( \
-                            _mm256_xor_si256( _mm256_xor_si256( k0, k1 ), \
-                                              _mm256_xor_si256( k2, k3 ) ), \
-                            _mm256_xor_si256( _mm256_xor_si256( k4, k5 ), \
-                                              _mm256_xor_si256( k6, k7 ) ) ), \
-                         m256_const1_64( 0x1BD11BDAA9FC1A22) ); \
-  t2 = t0 ^ t1; \
-} while (0)
+  k8 = mm256_xor3( mm256_xor3( k0, k1, k2 ), \
+                   mm256_xor3( k3, k4, k5 ), \
+                   mm256_xor3( k6, k7, \
+                               _mm256_set1_epi64x( 0x1BD11BDAA9FC1A22) ) ); \
+  t2 = t0 ^ t1;

 #define TFBIG_ADDKEY_4WAY(w0, w1, w2, w3, w4, w5, w6, w7, k, t, s) \
-do { \
  w0 = _mm256_add_epi64( w0, SKBI(k,s,0) ); \
  w1 = _mm256_add_epi64( w1, SKBI(k,s,1) ); \
  w2 = _mm256_add_epi64( w2, SKBI(k,s,2) ); \
  w3 = _mm256_add_epi64( w3, SKBI(k,s,3) ); \
  w4 = _mm256_add_epi64( w4, SKBI(k,s,4) ); \
  w5 = _mm256_add_epi64( w5, _mm256_add_epi64( SKBI(k,s,5), \
-                                         m256_const1_64( SKBT(t,s,0) ) ) ); \
+                                       _mm256_set1_epi64x( SKBT(t,s,0) ) ) ); \
  w6 = _mm256_add_epi64( w6, _mm256_add_epi64( SKBI(k,s,6), \
-                                         m256_const1_64( SKBT(t,s,1) ) ) ); \
+                                       _mm256_set1_epi64x( SKBT(t,s,1) ) ) ); \
  w7 = _mm256_add_epi64( w7, _mm256_add_epi64( SKBI(k,s,7), \
-                                         m256_const1_64( s ) ) ); \
-} while (0)
+                                       _mm256_set1_epi64x( s ) ) );

 #define TFBIG_MIX_4WAY(x0, x1, rc) \
-do { \
     x0 = _mm256_add_epi64( x0, x1 ); \
-     x1 = _mm256_xor_si256( mm256_rol_64( x1, rc ), x0 ); \
-} while (0)
+     x1 = _mm256_xor_si256( mm256_rol_64( x1, rc ), x0 );

-#define TFBIG_MIX8_4WAY(w0, w1, w2, w3, w4, w5, w6, w7, rc0, rc1, rc2, rc3)  do { \
+#define TFBIG_MIX8_4WAY(w0, w1, w2, w3, w4, w5, w6, w7, rc0, rc1, rc2, rc3) \
      TFBIG_MIX_4WAY(w0, w1, rc0); \
      TFBIG_MIX_4WAY(w2, w3, rc1); \
      TFBIG_MIX_4WAY(w4, w5, rc2); \
-      TFBIG_MIX_4WAY(w6, w7, rc3); \
-   } while (0)
+      TFBIG_MIX_4WAY(w6, w7, rc3);

-#define TFBIG_4WAY_4e(s)   do { \
+#define TFBIG_4WAY_4e(s) \
      TFBIG_ADDKEY_4WAY(p0, p1, p2, p3, p4, p5, p6, p7, h, t, s); \
      TFBIG_MIX8_4WAY(p0, p1, p2, p3, p4, p5, p6, p7, 46, 36, 19, 37); \
      TFBIG_MIX8_4WAY(p2, p1, p4, p7, p6, p5, p0, p3, 33, 27, 14, 42); \
      TFBIG_MIX8_4WAY(p4, p1, p6, p3, p0, p5, p2, p7, 17, 49, 36, 39); \
-      TFBIG_MIX8_4WAY(p6, p1, p0, p7, p2, p5, p4, p3, 44,  9, 54, 56); \
-   } while (0)
+      TFBIG_MIX8_4WAY(p6, p1, p0, p7, p2, p5, p4, p3, 44,  9, 54, 56);

-#define TFBIG_4WAY_4o(s)   do { \
+#define TFBIG_4WAY_4o(s) \
      TFBIG_ADDKEY_4WAY(p0, p1, p2, p3, p4, p5, p6, p7, h, t, s); \
      TFBIG_MIX8_4WAY(p0, p1, p2, p3, p4, p5, p6, p7, 39, 30, 34, 24); \
      TFBIG_MIX8_4WAY(p2, p1, p4, p7, p6, p5, p0, p3, 13, 50, 10, 17); \
      TFBIG_MIX8_4WAY(p4, p1, p6, p3, p0, p5, p2, p7, 25, 29, 39, 43); \
-      TFBIG_MIX8_4WAY(p6, p1, p0, p7, p2, p5, p4, p3,  8, 35, 56, 22); \
-   } while (0)
+      TFBIG_MIX8_4WAY(p6, p1, p0, p7, p2, p5, p4, p3,  8, 35, 56, 22);

 // scale buf offset by 4
 #define UBI_BIG_4WAY(etype, extra) \
@@ -541,28 +521,28 @@ do { \

 void skein256_8way_init( skein256_8way_context *sc )
 {
-        sc->h0 = m512_const1_64( 0xCCD044A12FDB3E13 );
-        sc->h1 = m512_const1_64( 0xE83590301A79A9EB );
-        sc->h2 = m512_const1_64( 0x55AEA0614F816E6F );
-        sc->h3 = m512_const1_64( 0x2A2767A4AE9B94DB );
-        sc->h4 = m512_const1_64( 0xEC06025E74DD7683 );
-        sc->h5 = m512_const1_64( 0xE7A436CDC4746251 );
-        sc->h6 = m512_const1_64( 0xC36FBAF9393AD185 );
-        sc->h7 = m512_const1_64( 0x3EEDBA1833EDFC13 );
+        sc->h0 = _mm512_set1_epi64( 0xCCD044A12FDB3E13 );
+        sc->h1 = _mm512_set1_epi64( 0xE83590301A79A9EB );
+        sc->h2 = _mm512_set1_epi64( 0x55AEA0614F816E6F );
+        sc->h3 = _mm512_set1_epi64( 0x2A2767A4AE9B94DB );
+        sc->h4 = _mm512_set1_epi64( 0xEC06025E74DD7683 );
+        sc->h5 = _mm512_set1_epi64( 0xE7A436CDC4746251 );
+        sc->h6 = _mm512_set1_epi64( 0xC36FBAF9393AD185 );
+        sc->h7 = _mm512_set1_epi64( 0x3EEDBA1833EDFC13 );
        sc->bcount = 0;
        sc->ptr = 0;
 }

 void skein512_8way_init( skein512_8way_context *sc )
 {
-        sc->h0 = m512_const1_64( 0x4903ADFF749C51CE );
-        sc->h1 = m512_const1_64( 0x0D95DE399746DF03 );
-        sc->h2 = m512_const1_64( 0x8FD1934127C79BCE );
-        sc->h3 = m512_const1_64( 0x9A255629FF352CB1 );
-        sc->h4 = m512_const1_64( 0x5DB62599DF6CA7B0 );
-        sc->h5 = m512_const1_64( 0xEABE394CA9D5C3F4 );
-        sc->h6 = m512_const1_64( 0x991112C71A75B523 );
-        sc->h7 = m512_const1_64( 0xAE18A40B660FCC33 );
+        sc->h0 = _mm512_set1_epi64( 0x4903ADFF749C51CE );
+        sc->h1 = _mm512_set1_epi64( 0x0D95DE399746DF03 );
+        sc->h2 = _mm512_set1_epi64( 0x8FD1934127C79BCE );
+        sc->h3 = _mm512_set1_epi64( 0x9A255629FF352CB1 );
+        sc->h4 = _mm512_set1_epi64( 0x5DB62599DF6CA7B0 );
+        sc->h5 = _mm512_set1_epi64( 0xEABE394CA9D5C3F4 );
+        sc->h6 = _mm512_set1_epi64( 0x991112C71A75B523 );
+        sc->h7 = _mm512_set1_epi64( 0xAE18A40B660FCC33 );
        sc->bcount = 0;
        sc->ptr = 0;
 }
@@ -660,14 +640,14 @@ void skein512_8way_full( skein512_8way_context *sc, void *out, const void *data,

 // Init

-        h0 = m512_const1_64( 0x4903ADFF749C51CE );
-        h1 = m512_const1_64( 0x0D95DE399746DF03 );
-        h2 = m512_const1_64( 0x8FD1934127C79BCE );
-        h3 = m512_const1_64( 0x9A255629FF352CB1 );
-        h4 = m512_const1_64( 0x5DB62599DF6CA7B0 );
-        h5 = m512_const1_64( 0xEABE394CA9D5C3F4 );
-        h6 = m512_const1_64( 0x991112C71A75B523 );
-        h7 = m512_const1_64( 0xAE18A40B660FCC33 );
+        h0 = _mm512_set1_epi64( 0x4903ADFF749C51CE );
+        h1 = _mm512_set1_epi64( 0x0D95DE399746DF03 );
+        h2 = _mm512_set1_epi64( 0x8FD1934127C79BCE );
+        h3 = _mm512_set1_epi64( 0x9A255629FF352CB1 );
+        h4 = _mm512_set1_epi64( 0x5DB62599DF6CA7B0 );
+        h5 = _mm512_set1_epi64( 0xEABE394CA9D5C3F4 );
+        h6 = _mm512_set1_epi64( 0x991112C71A75B523 );
+        h7 = _mm512_set1_epi64( 0xAE18A40B660FCC33 );

 // Update

@@ -734,14 +714,14 @@ skein512_8way_prehash64( skein512_8way_context *sc, const void *data )
   buf[5] = vdata[5];
   buf[6] = vdata[6];
   buf[7] = vdata[7];
-   register __m512i h0 = m512_const1_64( 0x4903ADFF749C51CE );
-   register __m512i h1 = m512_const1_64( 0x0D95DE399746DF03 );
-   register __m512i h2 = m512_const1_64( 0x8FD1934127C79BCE );
-   register __m512i h3 = m512_const1_64( 0x9A255629FF352CB1 );
-   register __m512i h4 = m512_const1_64( 0x5DB62599DF6CA7B0 );
-   register __m512i h5 = m512_const1_64( 0xEABE394CA9D5C3F4 );
-   register __m512i h6 = m512_const1_64( 0x991112C71A75B523 );
-   register __m512i h7 = m512_const1_64( 0xAE18A40B660FCC33 );
+   register __m512i h0 = _mm512_set1_epi64( 0x4903ADFF749C51CE );
+   register __m512i h1 = _mm512_set1_epi64( 0x0D95DE399746DF03 );
+   register __m512i h2 = _mm512_set1_epi64( 0x8FD1934127C79BCE );
+   register __m512i h3 = _mm512_set1_epi64( 0x9A255629FF352CB1 );
+   register __m512i h4 = _mm512_set1_epi64( 0x5DB62599DF6CA7B0 );
+   register __m512i h5 = _mm512_set1_epi64( 0xEABE394CA9D5C3F4 );
+   register __m512i h6 = _mm512_set1_epi64( 0x991112C71A75B523 );
+   register __m512i h7 = _mm512_set1_epi64( 0xAE18A40B660FCC33 );
   uint64_t bcount = 1;

   UBI_BIG_8WAY( 224, 0 );
@@ -830,28 +810,28 @@ skein512_8way_close(void *cc, void *dst)

 void skein256_4way_init( skein256_4way_context *sc )
 {
-        sc->h0 = m256_const1_64( 0xCCD044A12FDB3E13 );
-        sc->h1 = m256_const1_64( 0xE83590301A79A9EB );
-        sc->h2 = m256_const1_64( 0x55AEA0614F816E6F );
-        sc->h3 = m256_const1_64( 0x2A2767A4AE9B94DB );
-        sc->h4 = m256_const1_64( 0xEC06025E74DD7683 );
-        sc->h5 = m256_const1_64( 0xE7A436CDC4746251 );
-        sc->h6 = m256_const1_64( 0xC36FBAF9393AD185 );
-        sc->h7 = m256_const1_64( 0x3EEDBA1833EDFC13 );
+        sc->h0 = _mm256_set1_epi64x( 0xCCD044A12FDB3E13 );
+        sc->h1 = _mm256_set1_epi64x( 0xE83590301A79A9EB );
+        sc->h2 = _mm256_set1_epi64x( 0x55AEA0614F816E6F );
+        sc->h3 = _mm256_set1_epi64x( 0x2A2767A4AE9B94DB );
+        sc->h4 = _mm256_set1_epi64x( 0xEC06025E74DD7683 );
+        sc->h5 = _mm256_set1_epi64x( 0xE7A436CDC4746251 );
+        sc->h6 = _mm256_set1_epi64x( 0xC36FBAF9393AD185 );
+        sc->h7 = _mm256_set1_epi64x( 0x3EEDBA1833EDFC13 );
        sc->bcount = 0;
        sc->ptr = 0;
 }

 void skein512_4way_init( skein512_4way_context *sc )
 {
-        sc->h0 = m256_const1_64( 0x4903ADFF749C51CE );
-        sc->h1 = m256_const1_64( 0x0D95DE399746DF03 );
-        sc->h2 = m256_const1_64( 0x8FD1934127C79BCE );
-        sc->h3 = m256_const1_64( 0x9A255629FF352CB1 );
-        sc->h4 = m256_const1_64( 0x5DB62599DF6CA7B0 );
-        sc->h5 = m256_const1_64( 0xEABE394CA9D5C3F4 );
-        sc->h6 = m256_const1_64( 0x991112C71A75B523 );
-        sc->h7 = m256_const1_64( 0xAE18A40B660FCC33 );
+        sc->h0 = _mm256_set1_epi64x( 0x4903ADFF749C51CE );
+        sc->h1 = _mm256_set1_epi64x( 0x0D95DE399746DF03 );
+        sc->h2 = _mm256_set1_epi64x( 0x8FD1934127C79BCE );
+        sc->h3 = _mm256_set1_epi64x( 0x9A255629FF352CB1 );
+        sc->h4 = _mm256_set1_epi64x( 0x5DB62599DF6CA7B0 );
+        sc->h5 = _mm256_set1_epi64x( 0xEABE394CA9D5C3F4 );
+        sc->h6 = _mm256_set1_epi64x( 0x991112C71A75B523 );
+        sc->h7 = _mm256_set1_epi64x( 0xAE18A40B660FCC33 );
        sc->bcount = 0;
        sc->ptr = 0;
 }
@@ -954,14 +934,14 @@ skein512_4way_full( skein512_4way_context *sc, void *out, const void *data,
   const int buf_size = 64;   // 64 * __m256i
   uint64_t bcount = 0;

-   h0 = m256_const1_64( 0x4903ADFF749C51CE );
-   h1 = m256_const1_64( 0x0D95DE399746DF03 );
-   h2 = m256_const1_64( 0x8FD1934127C79BCE );
-   h3 = m256_const1_64( 0x9A255629FF352CB1 );
-   h4 = m256_const1_64( 0x5DB62599DF6CA7B0 );
-   h5 = m256_const1_64( 0xEABE394CA9D5C3F4 );
-   h6 = m256_const1_64( 0x991112C71A75B523 );
-   h7 = m256_const1_64( 0xAE18A40B660FCC33 );
+   h0 = _mm256_set1_epi64x( 0x4903ADFF749C51CE );
+   h1 = _mm256_set1_epi64x( 0x0D95DE399746DF03 );
+   h2 = _mm256_set1_epi64x( 0x8FD1934127C79BCE );
+   h3 = _mm256_set1_epi64x( 0x9A255629FF352CB1 );
+   h4 = _mm256_set1_epi64x( 0x5DB62599DF6CA7B0 );
+   h5 = _mm256_set1_epi64x( 0xEABE394CA9D5C3F4 );
+   h6 = _mm256_set1_epi64x( 0x991112C71A75B523 );
+   h7 = _mm256_set1_epi64x( 0xAE18A40B660FCC33 );

 // Update     

@@ -1028,14 +1008,14 @@ skein512_4way_prehash64( skein512_4way_context *sc, const void *data )
   buf[5] = vdata[5];
   buf[6] = vdata[6];
   buf[7] = vdata[7];
-   register __m256i h0 = m256_const1_64( 0x4903ADFF749C51CE );
-   register __m256i h1 = m256_const1_64( 0x0D95DE399746DF03 );
-   register __m256i h2 = m256_const1_64( 0x8FD1934127C79BCE );
-   register __m256i h3 = m256_const1_64( 0x9A255629FF352CB1 );
-   register __m256i h4 = m256_const1_64( 0x5DB62599DF6CA7B0 );
-   register __m256i h5 = m256_const1_64( 0xEABE394CA9D5C3F4 );
-   register __m256i h6 = m256_const1_64( 0x991112C71A75B523 );
-   register __m256i h7 = m256_const1_64( 0xAE18A40B660FCC33 );
+   register __m256i h0 = _mm256_set1_epi64x( 0x4903ADFF749C51CE );
+   register __m256i h1 = _mm256_set1_epi64x( 0x0D95DE399746DF03 );
+   register __m256i h2 = _mm256_set1_epi64x( 0x8FD1934127C79BCE );
+   register __m256i h3 = _mm256_set1_epi64x( 0x9A255629FF352CB1 );
+   register __m256i h4 = _mm256_set1_epi64x( 0x5DB62599DF6CA7B0 );
+   register __m256i h5 = _mm256_set1_epi64x( 0xEABE394CA9D5C3F4 );
+   register __m256i h6 = _mm256_set1_epi64x( 0x991112C71A75B523 );
+   register __m256i h7 = _mm256_set1_epi64x( 0xAE18A40B660FCC33 );
   uint64_t bcount = 1;

   UBI_BIG_4WAY( 224, 0 );
--- a/algo/skein/skein2-4way.c
+++ b/algo/skein/skein2-4way.c
@@ -5,17 +5,9 @@

 #if defined(SKEIN_8WAY)

-static skein512_8way_context skein512_8way_ctx __attribute__ ((aligned (64)));
-static uint32_t skein2_8way_vdata[20*8] __attribute__ ((aligned (64)));
+ static __thread skein512_8way_context skein512_8way_ctx
+                                             __attribute__ ((aligned (64)));

-int skein2_8way_prehash( struct work *work )
-{
-    mm512_bswap32_intrlv80_8x64( skein2_8way_vdata, work->data );
-    skein512_8way_prehash64( &skein512_8way_ctx, skein2_8way_vdata );
-    return 1;
-}
-
-/* not used
 void skein2hash_8way( void *output, const void *input )
 {
   uint64_t hash[16*8] __attribute__ ((aligned (128)));
@@ -25,7 +17,6 @@ void skein2hash_8way( void *output, const void *input )
   skein512_8way_final16( &ctx, hash, input + (64*8) );
   skein512_8way_full( &ctx, output, hash, 64 );
 }
-*/

 int scanhash_skein2_8way( struct work *work, uint32_t max_nonce,
                          uint64_t *hashes_done, struct thr_info *mythr )
@@ -45,14 +36,11 @@ int scanhash_skein2_8way( struct work *work, uint32_t max_nonce,
    const bool bench = opt_benchmark;
    skein512_8way_context ctx;

-    pthread_rwlock_rdlock( &g_work_lock );
-       memcpy( vdata, skein2_8way_vdata, sizeof vdata );
-       memcpy( &ctx, &skein512_8way_ctx, sizeof ctx );
-    pthread_rwlock_unlock( &g_work_lock );
-
+    mm512_bswap32_intrlv80_8x64( vdata, pdata );
    *noncev = mm512_intrlv_blend_32(
                _mm512_set_epi32( n+7, 0, n+6, 0, n+5, 0, n+4, 0,
                                  n+3, 0, n+2, 0, n+1, 0, n  , 0 ), *noncev );
+    skein512_8way_prehash64( &ctx, vdata );
    do
    {
       skein512_8way_final16( &ctx, hash, vdata + (16*8) );
@@ -69,7 +57,7 @@ int scanhash_skein2_8way( struct work *work, uint32_t max_nonce,
          }
       }
       *noncev = _mm512_add_epi32( *noncev,
-                                  m512_const1_64( 0x0000000800000000 ) );
+                                  _mm512_set1_epi64( 0x0000000800000000 ) );
       n += 8;
    } while ( likely( (n < last_nonce) && !work_restart[thr_id].restart ) );

@@ -79,18 +67,10 @@ int scanhash_skein2_8way( struct work *work, uint32_t max_nonce,
 }

 #elif defined(SKEIN_4WAY)
-                                           
-static skein512_4way_context skein512_4way_ctx __attribute__ ((aligned (64)));
-static uint32_t skein2_4way_vdata[20*4] __attribute__ ((aligned (64)));
-                                           
-int skein2_4way_prehash( struct work *work )
-{
-    mm256_bswap32_intrlv80_4x64( skein2_4way_vdata, work->data );
-    skein512_4way_prehash64( &skein512_4way_ctx, skein2_4way_vdata );
-    return 1;
-}   

-/* not used
+static __thread skein512_4way_context skein512_4way_ctx
+                                           __attribute__ ((aligned (64)));
+
 void skein2hash_4way( void *output, const void *input )
 {
   skein512_4way_context ctx;
@@ -100,7 +80,6 @@ void skein2hash_4way( void *output, const void *input )
   skein512_4way_final16( &ctx, hash, input + (64*4) );
   skein512_4way_full( &ctx, output, hash, 64 );
 }
-*/

 int scanhash_skein2_4way( struct work *work, uint32_t max_nonce,
                          uint64_t *hashes_done, struct thr_info *mythr )
@@ -120,11 +99,8 @@ int scanhash_skein2_4way( struct work *work, uint32_t max_nonce,
    const bool bench = opt_benchmark;
    skein512_4way_context ctx;

-    pthread_rwlock_rdlock( &g_work_lock );
-       memcpy( vdata, skein2_4way_vdata, sizeof vdata );
-       memcpy( &ctx, &skein512_4way_ctx, sizeof ctx );
-    pthread_rwlock_unlock( &g_work_lock );
-
+    mm256_bswap32_intrlv80_4x64( vdata, pdata );
+    skein512_4way_prehash64( &ctx, vdata );
    *noncev = mm256_intrlv_blend_32(
                _mm256_set_epi32( n+3, 0, n+2, 0, n+1, 0, n, 0 ), *noncev );
    do 
@@ -143,7 +119,7 @@ int scanhash_skein2_4way( struct work *work, uint32_t max_nonce,
          }
       }
       *noncev = _mm256_add_epi32( *noncev,
-                                  m256_const1_64( 0x0000000400000000 ) );
+                                  _mm256_set1_epi64x( 0x0000000400000000 ) );
       n += 4;
    } while ( (n < last_nonce) && !work_restart[thr_id].restart );

--- a/algo/sm3/sm3-hash-4way.c
+++ b/algo/sm3/sm3-hash-4way.c
@@ -74,6 +74,10 @@
   _mm256_or_si256( _mm256_and_si256( x, y ), \
                    _mm256_andnot_si256( x, z ) )

+#define mm256_rol_var_32( v, c ) \
+   _mm256_or_si256( _mm256_slli_epi32( v, c ), \
+                    _mm256_srli_epi32( v, 32-(c) ) )
+
 void sm3_8way_compress( __m256i *digest, __m256i *block )
 {
   __m256i W[68], W1[64];
@@ -251,6 +255,9 @@ void sm3_8way_close( void *cc, void *dst )
                                 _mm_andnot_si128( x, z ) )


+#define mm128_rol_var_32( v, c ) \
+   _mm_or_si128( _mm_slli_epi32( v, c ), _mm_srli_epi32( v, 32-(c) ) )
+
 void sm3_4way_compress( __m128i *digest, __m128i *block )
 {
   __m128i W[68], W1[64];
--- a/algo/swifftx/swifftx.c
+++ b/algo/swifftx/swifftx.c
@@ -630,36 +630,35 @@ void InitializeSWIFFTX()
 }

 // In the original code the F matrix is rotated so it was not aranged
-// the same as all the other data. Rearanging F to match all the other
-// data made vectorizing possible, the compiler probably could have been
-// able to auto-vectorize with proper data organisation.
-// Also in the original code the custom 16 bit data types are all now 32
-// bit int32_t regardless of the type name.
-//
-void FFT(const unsigned char input[EIGHTH_N], swift_int32_t *output)
+// the same as the other data. Rearanging F made vectorizing up to 256 bits
+// possible. 
+// Also in the original code the custom 16 bit data types are all now aliased
+// to 32 bit int32_t.
+
+void FFT( const unsigned char input[EIGHTH_N], swift_int32_t *output )
 {
 #if defined(__AVX2__)

-   __m256i F[8] __attribute__ ((aligned (64)));
+   __m256i F0, F1, F2, F3, F4, F5, F6, F7;
+   __m256i tbl = *(__m256i*)&( fftTable[ input[0] << 3 ] );
   __m256i *mul = (__m256i*)multipliers;
   __m256i *out = (__m256i*)output;
-   __m256i *tbl = (__m256i*)&( fftTable[ input[0] << 3 ] );

-   F[0] = _mm256_mullo_epi32( mul[0], *tbl );
-   tbl = (__m256i*)&( fftTable[ input[1] << 3 ] );
-   F[1] = _mm256_mullo_epi32( mul[1], *tbl );
-   tbl = (__m256i*)&( fftTable[ input[2] << 3 ] );
-   F[2] = _mm256_mullo_epi32( mul[2], *tbl );
-   tbl = (__m256i*)&( fftTable[ input[3] << 3 ] );
-   F[3] = _mm256_mullo_epi32( mul[3], *tbl );
-   tbl = (__m256i*)&( fftTable[ input[4] << 3 ] );
-   F[4] = _mm256_mullo_epi32( mul[4], *tbl );
-   tbl = (__m256i*)&( fftTable[ input[5] << 3 ] );
-   F[5] = _mm256_mullo_epi32( mul[5], *tbl );
-   tbl = (__m256i*)&( fftTable[ input[6] << 3 ] );
-   F[6] = _mm256_mullo_epi32( mul[6], *tbl );
-   tbl = (__m256i*)&( fftTable[ input[7] << 3 ] );
-   F[7] = _mm256_mullo_epi32( mul[7], *tbl );
+   F0 = _mm256_mullo_epi32( mul[0], tbl );
+   tbl = *(__m256i*)&( fftTable[ input[1] << 3 ] );
+   F1 = _mm256_mullo_epi32( mul[1], tbl );
+   tbl = *(__m256i*)&( fftTable[ input[2] << 3 ] );
+   F2 = _mm256_mullo_epi32( mul[2], tbl );
+   tbl = *(__m256i*)&( fftTable[ input[3] << 3 ] );
+   F3 = _mm256_mullo_epi32( mul[3], tbl );
+   tbl = *(__m256i*)&( fftTable[ input[4] << 3 ] );
+   F4 = _mm256_mullo_epi32( mul[4], tbl );
+   tbl = *(__m256i*)&( fftTable[ input[5] << 3 ] );
+   F5 = _mm256_mullo_epi32( mul[5], tbl );
+   tbl = *(__m256i*)&( fftTable[ input[6] << 3 ] );
+   F6 = _mm256_mullo_epi32( mul[6], tbl );
+   tbl = *(__m256i*)&( fftTable[ input[7] << 3 ] );
+   F7 = _mm256_mullo_epi32( mul[7], tbl );

   #define ADD_SUB( a, b ) \
   { \
@@ -668,52 +667,50 @@ void FFT(const unsigned char input[EIGHTH_N], swift_int32_t *output)
      a = _mm256_add_epi32( a, tmp ); \
   }
   
-   ADD_SUB( F[0], F[1] );
-   ADD_SUB( F[2], F[3] );
-   ADD_SUB( F[4], F[5] );
-   ADD_SUB( F[6], F[7] );
-
-   F[3] = _mm256_slli_epi32( F[3], 4 );
-   F[7] = _mm256_slli_epi32( F[7], 4 );
-
-   ADD_SUB( F[0], F[2] );
-   ADD_SUB( F[1], F[3] );
-   ADD_SUB( F[4], F[6] );
-   ADD_SUB( F[5], F[7] );  
-
-   F[5] = _mm256_slli_epi32( F[5], 2 );
-   F[6] = _mm256_slli_epi32( F[6], 4 );
-   F[7] = _mm256_slli_epi32( F[7], 6 );
-
-   ADD_SUB( F[0], F[4] );
-   ADD_SUB( F[1], F[5] );
-   ADD_SUB( F[2], F[6] );
-   ADD_SUB( F[3], F[7] );
+   ADD_SUB( F0, F1 );
+   ADD_SUB( F2, F3 );
+   ADD_SUB( F4, F5 );
+   ADD_SUB( F6, F7 );
+   F3 = _mm256_slli_epi32( F3, 4 );
+   F7 = _mm256_slli_epi32( F7, 4 );
+   ADD_SUB( F0, F2 );
+   ADD_SUB( F1, F3 );
+   ADD_SUB( F4, F6 );
+   ADD_SUB( F5, F7 );  
+   F5 = _mm256_slli_epi32( F5, 2 );
+   F6 = _mm256_slli_epi32( F6, 4 );
+   F7 = _mm256_slli_epi32( F7, 6 );
+   ADD_SUB( F0, F4 );
+   ADD_SUB( F1, F5 );
+   ADD_SUB( F2, F6 );
+   ADD_SUB( F3, F7 );

   #undef ADD_SUB

 #if defined (__AVX512VL__) && defined(__AVX512BW__)   

-   const __m256i mask = _mm256_movm_epi8( 0x11111111 );
-
+   #define Q_REDUCE( a ) \
+       _mm256_sub_epi32( _mm256_maskz_mov_epi8( 0x11111111, a ), \
+                         _mm256_srai_epi32( a, 8 ) )
+         
 #else

-   const __m256i mask = m256_const1_32( 0x000000ff );
-
-#endif
+   const __m256i mask = _mm256_set1_epi32( 0x000000ff );

   #define Q_REDUCE( a ) \
       _mm256_sub_epi32( _mm256_and_si256( a, mask ), \
                         _mm256_srai_epi32( a, 8 ) )
+   
+#endif

-   out[0] = Q_REDUCE( F[0] );  
-   out[1] = Q_REDUCE( F[1] );                        
-   out[2] = Q_REDUCE( F[2] );                        
-   out[3] = Q_REDUCE( F[3] );                        
-   out[4] = Q_REDUCE( F[4] );                        
-   out[5] = Q_REDUCE( F[5] );                        
-   out[6] = Q_REDUCE( F[6] );                        
-   out[7] = Q_REDUCE( F[7] );
+   out[0] = Q_REDUCE( F0 );  
+   out[1] = Q_REDUCE( F1 );                        
+   out[2] = Q_REDUCE( F2 );                        
+   out[3] = Q_REDUCE( F3 );                        
+   out[4] = Q_REDUCE( F4 );                        
+   out[5] = Q_REDUCE( F5 );                        
+   out[6] = Q_REDUCE( F6 );                        
+   out[7] = Q_REDUCE( F7 );

   #undef Q_REDUCE

@@ -763,12 +760,10 @@ void FFT(const unsigned char input[EIGHTH_N], swift_int32_t *output)
   ADD_SUB( F[ 9], F[11] );
   ADD_SUB( F[12], F[14] );
   ADD_SUB( F[13], F[15] );
-
   F[ 6] = _mm_slli_epi32( F[ 6], 4 );
   F[ 7] = _mm_slli_epi32( F[ 7], 4 );
   F[14] = _mm_slli_epi32( F[14], 4 );
   F[15] = _mm_slli_epi32( F[15], 4 );
-
   ADD_SUB( F[ 0], F[ 4] );
   ADD_SUB( F[ 1], F[ 5] );
   ADD_SUB( F[ 2], F[ 6] );
@@ -777,14 +772,12 @@ void FFT(const unsigned char input[EIGHTH_N], swift_int32_t *output)
   ADD_SUB( F[ 9], F[13] );
   ADD_SUB( F[10], F[14] );
   ADD_SUB( F[11], F[15] );
-
   F[10] = _mm_slli_epi32( F[10], 2 );
   F[11] = _mm_slli_epi32( F[11], 2 );
   F[12] = _mm_slli_epi32( F[12], 4 );
   F[13] = _mm_slli_epi32( F[13], 4 );
   F[14] = _mm_slli_epi32( F[14], 6 );
   F[15] = _mm_slli_epi32( F[15], 6 );
-   
   ADD_SUB( F[ 0], F[ 8] );
   ADD_SUB( F[ 1], F[ 9] );
   ADD_SUB( F[ 2], F[10] );
@@ -796,7 +789,7 @@ void FFT(const unsigned char input[EIGHTH_N], swift_int32_t *output)

   #undef ADD_SUB

-   const __m128i mask = m128_const1_32( 0x000000ff );
+   const __m128i mask = _mm_set1_epi32( 0x000000ff );

   #define Q_REDUCE( a ) \
      _mm_sub_epi32( _mm_and_si128( a, mask ), _mm_srai_epi32( a, 8 ) ) 
@@ -820,16 +813,13 @@ void FFT(const unsigned char input[EIGHTH_N], swift_int32_t *output)

   #undef Q_REDUCE

-#else   // < SSE4.1
+#else   // AVX256 elif SSE4_1
   
   swift_int16_t *mult = multipliers;
-
-   // First loop unrolling:
-	register swift_int16_t *table = &(fftTable[input[0] << 3]);
-
-/*
+	swift_int16_t *table = &( fftTable[ input[0] << 3 ] );
   swift_int32_t F[64];

+   /*
   for (int i = 0; i < 8; i++)
   {
      int j = i<<3;
@@ -845,99 +835,91 @@ void FFT(const unsigned char input[EIGHTH_N], swift_int32_t *output)
   }
 */

-   register swift_int32_t F0, F1, F2, F3, F4, F5, F6, F7, F8, F9,
-                F10, F11, F12, F13, F14, F15, F16, F17, F18, F19,
-                F20, F21, F22, F23, F24, F25, F26, F27, F28, F29,
-                F30, F31, F32, F33, F34, F35, F36, F37, F38, F39,
-                F40, F41, F42, F43, F44, F45, F46, F47, F48, F49,
-                F50, F51, F52, F53, F54, F55, F56, F57, F58, F59,
-                F60, F61, F62, F63;
-   
-	F0  = mult[0] * table[0];
-	F8  = mult[1] * table[1];
-	F16 = mult[2] * table[2];
-	F24 = mult[3] * table[3];
-	F32 = mult[4] * table[4];
-	F40 = mult[5] * table[5];
-	F48 = mult[6] * table[6];
-	F56 = mult[7] * table[7];
+	F[ 0] = mult[ 0] * table[0];
+	F[ 8] = mult[ 1] * table[1];
+	F[16] = mult[ 2] * table[2];
+	F[24] = mult[ 3] * table[3];
+	F[32] = mult[ 4] * table[4];
+	F[40] = mult[ 5] * table[5];
+	F[48] = mult[ 6] * table[6];
+	F[56] = mult[ 7] * table[7];

 	table = &(fftTable[input[1] << 3]);

-	F1  = mult[ 8] * table[0];
-	F9  = mult[ 9] * table[1];
-	F17 = mult[10] * table[2];
-	F25 = mult[11] * table[3];
-	F33 = mult[12] * table[4];
-	F41 = mult[13] * table[5];
-	F49 = mult[14] * table[6];
-	F57 = mult[15] * table[7];
+	F[ 1] = mult[ 8] * table[0];
+	F[ 9] = mult[ 9] * table[1];
+	F[17] = mult[10] * table[2];
+	F[25] = mult[11] * table[3];
+	F[33] = mult[12] * table[4];
+	F[41] = mult[13] * table[5];
+	F[49] = mult[14] * table[6];
+	F[57] = mult[15] * table[7];

 	table = &(fftTable[input[2] << 3]);

-	F2  = mult[16] * table[0];
-	F10 = mult[17] * table[1];
-	F18 = mult[18] * table[2];
-	F26 = mult[19] * table[3];
-	F34 = mult[20] * table[4];
-	F42 = mult[21] * table[5];
-	F50 = mult[22] * table[6];
-	F58 = mult[23] * table[7];
+	F[ 2] = mult[16] * table[0];
+	F[10] = mult[17] * table[1];
+	F[18] = mult[18] * table[2];
+	F[26] = mult[19] * table[3];
+	F[34] = mult[20] * table[4];
+	F[42] = mult[21] * table[5];
+	F[50] = mult[22] * table[6];
+	F[58] = mult[23] * table[7];

 	table = &(fftTable[input[3] << 3]);

-	F3  = mult[24] * table[0];
-	F11 = mult[25] * table[1];
-	F19 = mult[26] * table[2];
-	F27 = mult[27] * table[3];
-	F35 = mult[28] * table[4];
-	F43 = mult[29] * table[5];
-	F51 = mult[30] * table[6];
-	F59 = mult[31] * table[7];
+	F[ 3] = mult[24] * table[0];
+	F[11] = mult[25] * table[1];
+	F[19] = mult[26] * table[2];
+	F[27] = mult[27] * table[3];
+	F[35] = mult[28] * table[4];
+	F[43] = mult[29] * table[5];
+	F[51] = mult[30] * table[6];
+	F[59] = mult[31] * table[7];

 	table = &(fftTable[input[4] << 3]);

-	F4  = mult[32] * table[0];
-	F12 = mult[33] * table[1];
-	F20 = mult[34] * table[2];
-	F28 = mult[35] * table[3];
-	F36 = mult[36] * table[4];
-	F44 = mult[37] * table[5];
-	F52 = mult[38] * table[6];
-	F60 = mult[39] * table[7];
+	F[ 4] = mult[32] * table[0];
+	F[12] = mult[33] * table[1];
+	F[20] = mult[34] * table[2];
+	F[28] = mult[35] * table[3];
+	F[36] = mult[36] * table[4];
+	F[44] = mult[37] * table[5];
+	F[52] = mult[38] * table[6];
+	F[60] = mult[39] * table[7];

 	table = &(fftTable[input[5] << 3]);

-	F5  = mult[40] * table[0];
-	F13 = mult[41] * table[1];
-	F21 = mult[42] * table[2];
-	F29 = mult[43] * table[3];
-	F37 = mult[44] * table[4];
-	F45 = mult[45] * table[5];
-	F53 = mult[46] * table[6];
-	F61 = mult[47] * table[7];
+	F[ 5] = mult[40] * table[0];
+	F[13] = mult[41] * table[1];
+	F[21] = mult[42] * table[2];
+	F[29] = mult[43] * table[3];
+	F[37] = mult[44] * table[4];
+	F[45] = mult[45] * table[5];
+	F[53] = mult[46] * table[6];
+	F[61] = mult[47] * table[7];

 	table = &(fftTable[input[6] << 3]);

-	F6  = mult[48] * table[0];
-	F14 = mult[49] * table[1];
-	F22 = mult[50] * table[2];
-	F30 = mult[51] * table[3];
-	F38 = mult[52] * table[4];
-	F46 = mult[53] * table[5];
-	F54 = mult[54] * table[6];
-	F62 = mult[55] * table[7];
+	F[ 6] = mult[48] * table[0];
+	F[14] = mult[49] * table[1];
+	F[22] = mult[50] * table[2];
+	F[30] = mult[51] * table[3];
+	F[38] = mult[52] * table[4];
+	F[46] = mult[53] * table[5];
+	F[54] = mult[54] * table[6];
+	F[62] = mult[55] * table[7];

 	table = &(fftTable[input[7] << 3]);

-	F7  = mult[56] * table[0];
-	F15 = mult[57] * table[1];
-	F23 = mult[58] * table[2];
-	F31 = mult[59] * table[3];
-	F39 = mult[60] * table[4];
-	F47 = mult[61] * table[5];
-	F55 = mult[62] * table[6];
-	F63 = mult[63] * table[7];
+	F[ 7] = mult[56] * table[0];
+	F[15] = mult[57] * table[1];
+	F[23] = mult[58] * table[2];
+	F[31] = mult[59] * table[3];
+	F[39] = mult[60] * table[4];
+	F[47] = mult[61] * table[5];
+	F[55] = mult[62] * table[6];
+	F[63] = mult[63] * table[7];

   #define ADD_SUB( a, b ) \
   { \
@@ -987,262 +969,229 @@ void FFT(const unsigned char input[EIGHTH_N], swift_int32_t *output)
   }
 */

-	// Second loop unrolling:
 	// Iteration 0:
-	ADD_SUB(F0, F1);
-	ADD_SUB(F2, F3);
-	ADD_SUB(F4, F5);
-	ADD_SUB(F6, F7);
+	ADD_SUB( F[ 0], F[ 1] );
+	ADD_SUB( F[ 2], F[ 3] );
+	ADD_SUB( F[ 4], F[ 5] );
+	ADD_SUB( F[ 6], F[ 7] );
+	F[ 3] <<= 4;
+	F[ 7] <<= 4;
+	ADD_SUB( F[ 0], F[ 2] );
+	ADD_SUB( F[ 1], F[ 3] );
+	ADD_SUB( F[ 4], F[ 6] );
+	ADD_SUB( F[ 5], F[ 7] );
+	F[ 5] <<= 2;
+	F[ 6] <<= 4;
+	F[ 7] <<= 6;
+	ADD_SUB( F[ 0], F[ 4] );
+	ADD_SUB( F[ 1], F[ 5] );
+	ADD_SUB( F[ 2], F[ 6] );
+	ADD_SUB( F[ 3], F[ 7] );

-	F3 <<= 4;
-	F7 <<= 4;
-
-	ADD_SUB(F0, F2);
-	ADD_SUB(F1, F3);
-	ADD_SUB(F4, F6);
-	ADD_SUB(F5, F7);
-
-	F5 <<= 2;
-	F6 <<= 4;
-	F7 <<= 6;
-
-	ADD_SUB(F0, F4);
-	ADD_SUB(F1, F5);
-	ADD_SUB(F2, F6);
-	ADD_SUB(F3, F7);
-
-	output[0] = Q_REDUCE(F0);
-	output[8] = Q_REDUCE(F1);
-	output[16] = Q_REDUCE(F2);
-	output[24] = Q_REDUCE(F3);
-	output[32] = Q_REDUCE(F4);
-	output[40] = Q_REDUCE(F5);
-	output[48] = Q_REDUCE(F6);
-	output[56] = Q_REDUCE(F7);
+   output[ 0] = Q_REDUCE( F[ 0] );
+	output[ 8] = Q_REDUCE( F[ 1] );
+	output[16] = Q_REDUCE( F[ 2] );
+	output[24] = Q_REDUCE( F[ 3] );
+	output[32] = Q_REDUCE( F[ 4] );
+	output[40] = Q_REDUCE( F[ 5] );
+	output[48] = Q_REDUCE( F[ 6] );
+	output[56] = Q_REDUCE( F[ 7] );

 	// Iteration 1:
-	ADD_SUB(F8, F9);
-	ADD_SUB(F10, F11);
-	ADD_SUB(F12, F13);
-	ADD_SUB(F14, F15);
+	ADD_SUB( F[ 8], F[ 9] );
+	ADD_SUB( F[10], F[11] );
+	ADD_SUB( F[12], F[13] );
+	ADD_SUB( F[14], F[15] );
+	F[11] <<= 4;
+	F[15] <<= 4;
+	ADD_SUB( F[ 8], F[10] );
+	ADD_SUB( F[ 9], F[11] );
+	ADD_SUB( F[12], F[14] );
+	ADD_SUB( F[13], F[15] );
+	F[13] <<= 2;
+	F[14] <<= 4;
+	F[15] <<= 6;
+	ADD_SUB( F[ 8], F[12] );
+	ADD_SUB( F[ 9], F[13] );
+	ADD_SUB( F[10], F[14] );
+	ADD_SUB( F[11], F[15] );

-	F11 <<= 4;
-	F15 <<= 4;
-
-	ADD_SUB(F8, F10);
-	ADD_SUB(F9, F11);
-	ADD_SUB(F12, F14);
-	ADD_SUB(F13, F15);
-
-	F13 <<= 2;
-	F14 <<= 4;
-	F15 <<= 6;
-
-	ADD_SUB(F8, F12);
-	ADD_SUB(F9, F13);
-	ADD_SUB(F10, F14);
-	ADD_SUB(F11, F15);
-
-	output[1] = Q_REDUCE(F8);
-	output[9] = Q_REDUCE(F9);
-	output[17] = Q_REDUCE(F10);
-	output[25] = Q_REDUCE(F11);
-	output[33] = Q_REDUCE(F12);
-	output[41] = Q_REDUCE(F13);
-	output[49] = Q_REDUCE(F14);
-	output[57] = Q_REDUCE(F15);
+	output[ 1] = Q_REDUCE( F[ 8] );
+	output[ 9] = Q_REDUCE( F[ 9] );
+	output[17] = Q_REDUCE( F[10] );
+	output[25] = Q_REDUCE( F[11] );
+	output[33] = Q_REDUCE( F[12] );
+	output[41] = Q_REDUCE( F[13] );
+	output[49] = Q_REDUCE( F[14] );
+	output[57] = Q_REDUCE( F[15] );

 	// Iteration 2:
-	ADD_SUB(F16, F17);
-	ADD_SUB(F18, F19);
-	ADD_SUB(F20, F21);
-	ADD_SUB(F22, F23);
+	ADD_SUB( F[16], F[17] );
+	ADD_SUB( F[18], F[19] );
+	ADD_SUB( F[20], F[21] );
+	ADD_SUB( F[22], F[23] );
+	F[19] <<= 4;
+	F[23] <<= 4;
+	ADD_SUB( F[16], F[18]);
+	ADD_SUB( F[17], F[19]);
+	ADD_SUB( F[20], F[22]);
+	ADD_SUB( F[21], F[23]);
+	F[21] <<= 2;
+	F[22] <<= 4;
+	F[23] <<= 6;
+	ADD_SUB( F[16], F[20] );
+	ADD_SUB( F[17], F[21] );
+	ADD_SUB( F[18], F[22] );
+	ADD_SUB( F[19], F[23] );

-	F19 <<= 4;
-	F23 <<= 4;
-
-	ADD_SUB(F16, F18);
-	ADD_SUB(F17, F19);
-	ADD_SUB(F20, F22);
-	ADD_SUB(F21, F23);
-
-	F21 <<= 2;
-	F22 <<= 4;
-	F23 <<= 6;
-
-	ADD_SUB(F16, F20);
-	ADD_SUB(F17, F21);
-	ADD_SUB(F18, F22);
-	ADD_SUB(F19, F23);
-
-	output[2] = Q_REDUCE(F16);
-	output[10] = Q_REDUCE(F17);
-	output[18] = Q_REDUCE(F18);
-	output[26] = Q_REDUCE(F19);
-	output[34] = Q_REDUCE(F20);
-	output[42] = Q_REDUCE(F21);
-	output[50] = Q_REDUCE(F22);
-	output[58] = Q_REDUCE(F23);
+	output[ 2] = Q_REDUCE( F[16] );
+	output[10] = Q_REDUCE( F[17] );
+	output[18] = Q_REDUCE( F[18] );
+	output[26] = Q_REDUCE( F[19] );
+	output[34] = Q_REDUCE( F[20] );
+	output[42] = Q_REDUCE( F[21] );
+	output[50] = Q_REDUCE( F[22] );
+	output[58] = Q_REDUCE( F[23] );

 	// Iteration 3:
-	ADD_SUB(F24, F25);
-	ADD_SUB(F26, F27);
-	ADD_SUB(F28, F29);
-	ADD_SUB(F30, F31);
+	ADD_SUB( F[24], F[25] );
+	ADD_SUB( F[26], F[27] );
+	ADD_SUB( F[28], F[29] );
+	ADD_SUB( F[30], F[31] );
+ 	F[27] <<= 4;
+ 	F[31] <<= 4;
+	ADD_SUB( F[24], F[26] );
+	ADD_SUB( F[25], F[27] );
+	ADD_SUB( F[28], F[30] );
+	ADD_SUB( F[29], F[31] );
+	F[29] <<= 2;
+	F[30] <<= 4;
+	F[31] <<= 6;
+	ADD_SUB( F[24], F[28] );
+	ADD_SUB( F[25], F[29] );
+	ADD_SUB( F[26], F[30] );
+	ADD_SUB( F[27], F[31] );

-	F27 <<= 4;
-	F31 <<= 4;
-
-	ADD_SUB(F24, F26);
-	ADD_SUB(F25, F27);
-	ADD_SUB(F28, F30);
-	ADD_SUB(F29, F31);
-
-	F29 <<= 2;
-	F30 <<= 4;
-	F31 <<= 6;
-
-	ADD_SUB(F24, F28);
-	ADD_SUB(F25, F29);
-	ADD_SUB(F26, F30);
-	ADD_SUB(F27, F31);
-
-	output[3] = Q_REDUCE(F24);
-	output[11] = Q_REDUCE(F25);
-	output[19] = Q_REDUCE(F26);
-	output[27] = Q_REDUCE(F27);
-	output[35] = Q_REDUCE(F28);
-	output[43] = Q_REDUCE(F29);
-	output[51] = Q_REDUCE(F30);
-	output[59] = Q_REDUCE(F31);
+	output[ 3] = Q_REDUCE( F[24] );
+	output[11] = Q_REDUCE( F[25] );
+	output[19] = Q_REDUCE( F[26] );
+	output[27] = Q_REDUCE( F[27] );
+	output[35] = Q_REDUCE( F[28] );
+	output[43] = Q_REDUCE( F[29] );
+	output[51] = Q_REDUCE( F[30] );
+	output[59] = Q_REDUCE( F[31] );

 	// Iteration 4:
-	ADD_SUB(F32, F33);
-	ADD_SUB(F34, F35);
-	ADD_SUB(F36, F37);
-	ADD_SUB(F38, F39);
+	ADD_SUB( F[32], F[33] );
+	ADD_SUB( F[34], F[35] );
+	ADD_SUB( F[36], F[37] );
+	ADD_SUB( F[38], F[39] );
+	F[35] <<= 4;
+	F[39] <<= 4;
+	ADD_SUB( F[32], F[34] );
+	ADD_SUB( F[33], F[35] );
+	ADD_SUB( F[36], F[38] );
+	ADD_SUB( F[37], F[39] );
+	F[37] <<= 2;
+	F[38] <<= 4;
+	F[39] <<= 6;
+	ADD_SUB( F[32], F[36] );
+	ADD_SUB( F[33], F[37] );
+	ADD_SUB( F[34], F[38] );
+	ADD_SUB( F[35], F[39] );

-	F35 <<= 4;
-	F39 <<= 4;
-
-	ADD_SUB(F32, F34);
-	ADD_SUB(F33, F35);
-	ADD_SUB(F36, F38);
-	ADD_SUB(F37, F39);
-
-	F37 <<= 2;
-	F38 <<= 4;
-	F39 <<= 6;
-
-	ADD_SUB(F32, F36);
-	ADD_SUB(F33, F37);
-	ADD_SUB(F34, F38);
-	ADD_SUB(F35, F39);
-
-	output[4] = Q_REDUCE(F32);
-	output[12] = Q_REDUCE(F33);
-	output[20] = Q_REDUCE(F34);
-	output[28] = Q_REDUCE(F35);
-	output[36] = Q_REDUCE(F36);
-	output[44] = Q_REDUCE(F37);
-	output[52] = Q_REDUCE(F38);
-	output[60] = Q_REDUCE(F39);
+	output[ 4] = Q_REDUCE( F[32] );
+	output[12] = Q_REDUCE( F[33] );
+	output[20] = Q_REDUCE( F[34] );
+	output[28] = Q_REDUCE( F[35] );
+	output[36] = Q_REDUCE( F[36] );
+	output[44] = Q_REDUCE( F[37] );
+	output[52] = Q_REDUCE( F[38] );
+	output[60] = Q_REDUCE( F[39] );

 	// Iteration 5:
-	ADD_SUB(F40, F41);
-	ADD_SUB(F42, F43);
-	ADD_SUB(F44, F45);
-	ADD_SUB(F46, F47);
+	ADD_SUB( F[40], F[41] );
+	ADD_SUB( F[42], F[43] );
+	ADD_SUB( F[44], F[45] );
+	ADD_SUB( F[46], F[47] );
+	F[43] <<= 4;
+	F[47] <<= 4;
+	ADD_SUB( F[40], F[42] );
+	ADD_SUB( F[41], F[43] );
+	ADD_SUB( F[44], F[46] );
+	ADD_SUB( F[45], F[47] );
+	F[45] <<= 2;
+	F[46] <<= 4;
+	F[47] <<= 6;
+	ADD_SUB( F[40], F[44] );
+	ADD_SUB( F[41], F[45] );
+	ADD_SUB( F[42], F[46] );
+	ADD_SUB( F[43], F[47] );

-	F43 <<= 4;
-	F47 <<= 4;
-
-	ADD_SUB(F40, F42);
-	ADD_SUB(F41, F43);
-	ADD_SUB(F44, F46);
-	ADD_SUB(F45, F47);
-
-	F45 <<= 2;
-	F46 <<= 4;
-	F47 <<= 6;
-
-	ADD_SUB(F40, F44);
-	ADD_SUB(F41, F45);
-	ADD_SUB(F42, F46);
-	ADD_SUB(F43, F47);
-
-	output[5] = Q_REDUCE(F40);
-	output[13] = Q_REDUCE(F41);
-	output[21] = Q_REDUCE(F42);
-	output[29] = Q_REDUCE(F43);
-	output[37] = Q_REDUCE(F44);
-	output[45] = Q_REDUCE(F45);
-	output[53] = Q_REDUCE(F46);
-	output[61] = Q_REDUCE(F47);
+	output[ 5] = Q_REDUCE( F[40] );
+	output[13] = Q_REDUCE( F[41] );
+	output[21] = Q_REDUCE( F[42] );
+	output[29] = Q_REDUCE( F[43] );
+	output[37] = Q_REDUCE( F[44] );
+	output[45] = Q_REDUCE( F[45] );
+	output[53] = Q_REDUCE( F[46] );
+	output[61] = Q_REDUCE( F[47] );

 	// Iteration 6:
-	ADD_SUB(F48, F49);
-	ADD_SUB(F50, F51);
-	ADD_SUB(F52, F53);
-	ADD_SUB(F54, F55);
+	ADD_SUB( F[48], F[49] );
+	ADD_SUB( F[50], F[51] );
+	ADD_SUB( F[52], F[53] );
+	ADD_SUB( F[54], F[55] );
+	F[51] <<= 4;
+	F[55] <<= 4;
+	ADD_SUB( F[48], F[50] );
+	ADD_SUB( F[49], F[51] );
+	ADD_SUB( F[52], F[54] );
+	ADD_SUB( F[53], F[55] );
+	F[53] <<= 2;
+	F[54] <<= 4;
+	F[55] <<= 6;
+	ADD_SUB( F[48], F[52] );
+	ADD_SUB( F[49], F[53] );
+	ADD_SUB( F[50], F[54] );
+	ADD_SUB( F[51], F[55] );

-	F51 <<= 4;
-	F55 <<= 4;
-
-	ADD_SUB(F48, F50);
-	ADD_SUB(F49, F51);
-	ADD_SUB(F52, F54);
-	ADD_SUB(F53, F55);
-
-	F53 <<= 2;
-	F54 <<= 4;
-	F55 <<= 6;
-
-	ADD_SUB(F48, F52);
-	ADD_SUB(F49, F53);
-	ADD_SUB(F50, F54);
-	ADD_SUB(F51, F55);
-
-	output[6] = Q_REDUCE(F48);
-	output[14] = Q_REDUCE(F49);
-	output[22] = Q_REDUCE(F50);
-	output[30] = Q_REDUCE(F51);
-	output[38] = Q_REDUCE(F52);
-	output[46] = Q_REDUCE(F53);
-	output[54] = Q_REDUCE(F54);
-	output[62] = Q_REDUCE(F55);
+	output[ 6] = Q_REDUCE( F[48] );
+	output[14] = Q_REDUCE( F[49] );
+	output[22] = Q_REDUCE( F[50] );
+	output[30] = Q_REDUCE( F[51] );
+	output[38] = Q_REDUCE( F[52] );
+	output[46] = Q_REDUCE( F[53] );
+	output[54] = Q_REDUCE( F[54] );
+	output[62] = Q_REDUCE( F[55] );

 	// Iteration 7:
-	ADD_SUB(F56, F57);
-	ADD_SUB(F58, F59);
-	ADD_SUB(F60, F61);
-	ADD_SUB(F62, F63);
+	ADD_SUB( F[56], F[57] );
+	ADD_SUB( F[58], F[59] );
+	ADD_SUB( F[60], F[61] );
+	ADD_SUB( F[62], F[63] );
+	F[59] <<= 4;
+	F[63] <<= 4;
+	ADD_SUB( F[56], F[58] );
+	ADD_SUB( F[57], F[59] );
+	ADD_SUB( F[60], F[62] );
+	ADD_SUB( F[61], F[63] );
+	F[61] <<= 2;
+	F[62] <<= 4;
+	F[63] <<= 6;
+	ADD_SUB( F[56], F[60] );
+	ADD_SUB( F[57], F[61] );
+	ADD_SUB( F[58], F[62] );
+	ADD_SUB( F[59], F[63] );

-	F59 <<= 4;
-	F63 <<= 4;
-
-	ADD_SUB(F56, F58);
-	ADD_SUB(F57, F59);
-	ADD_SUB(F60, F62);
-	ADD_SUB(F61, F63);
-
-	F61 <<= 2;
-	F62 <<= 4;
-	F63 <<= 6;
-
-	ADD_SUB(F56, F60);
-	ADD_SUB(F57, F61);
-	ADD_SUB(F58, F62);
-	ADD_SUB(F59, F63);
-
-	output[7] = Q_REDUCE(F56);
-	output[15] = Q_REDUCE(F57);
-	output[23] = Q_REDUCE(F58);
-	output[31] = Q_REDUCE(F59);
-	output[39] = Q_REDUCE(F60);
-	output[47] = Q_REDUCE(F61);
-	output[55] = Q_REDUCE(F62);
-	output[63] = Q_REDUCE(F63);
+	output[ 7] = Q_REDUCE( F[56] );
+	output[15] = Q_REDUCE( F[57] );
+	output[23] = Q_REDUCE( F[58] );
+	output[31] = Q_REDUCE( F[59] );
+	output[39] = Q_REDUCE( F[60] );
+	output[47] = Q_REDUCE( F[61] );
+	output[55] = Q_REDUCE( F[62] );
+	output[63] = Q_REDUCE( F[63] );

   #undef ADD_SUB
   #undef Q_REDUCE
--- a/algo/verthash/tiny_sha3/sha3-4way.c
+++ b/algo/verthash/tiny_sha3/sha3-4way.c
@@ -134,10 +134,10 @@ int sha3_4way_update( sha3_4way_ctx_t *c, const void *data, size_t len )
 int sha3_4way_final( void *md, sha3_4way_ctx_t *c )
 {
    c->st[ c->pt ] = _mm256_xor_si256( c->st[ c->pt ],
-                                       m256_const1_64( 6 ) );
+                                       _mm256_set1_epi64x( 6 ) );
    c->st[ c->rsiz / 8 - 1 ] =
                       _mm256_xor_si256( c->st[ c->rsiz / 8 - 1 ],
-                                         m256_const1_64( 0x8000000000000000 ) );
+                                    _mm256_set1_epi64x( 0x8000000000000000 ) );
    sha3_4way_keccakf( c->st );
    memcpy( md, c->st, c->mdlen * 4 );
    return 1;
@@ -268,10 +268,10 @@ int sha3_8way_final( void *md, sha3_8way_ctx_t *c )
 {
    c->st[ c->pt ] =
                       _mm512_xor_si512( c->st[ c->pt ],
-                                         m512_const1_64( 6 ) );
+                                         _mm512_set1_epi64( 6 ) );
    c->st[ c->rsiz / 8 - 1 ] =
                       _mm512_xor_si512( c->st[ c->rsiz / 8 - 1 ],
-                                         m512_const1_64( 0x8000000000000000 ) );
+                                     _mm512_set1_epi64( 0x8000000000000000 ) );
    sha3_8way_keccakf( c->st );
    memcpy( md, c->st, c->mdlen * 8 );
    return 1;
--- a/algo/x11/c11-4way.c
+++ b/algo/x11/c11-4way.c
@@ -201,7 +201,7 @@ int scanhash_c11_8way( struct work *work, uint32_t max_nonce,
   uint32_t n = first_nonce;
   const int thr_id = mythr->id;
   const uint32_t targ32_d7 = ptarget[7];
-   const __m512i eight = m512_const1_64( 8 );
+   const __m512i eight = _mm512_set1_epi64( 8 );
   const bool bench = opt_benchmark;

   edata[0] = mm128_swap64_32( casti_m128i( pdata, 0 ) );
@@ -369,7 +369,7 @@ int scanhash_c11_4way( struct work *work, uint32_t max_nonce,
   uint32_t n = first_nonce;
   const int thr_id = mythr->id;
   const uint32_t targ32_d7 = ptarget[7];
-   const __m256i four = m256_const1_64( 4 );
+   const __m256i four = _mm256_set1_epi64x( 4 );
   const bool bench = opt_benchmark;

   edata[0] = mm128_swap64_32( casti_m128i( pdata, 0 ) );
--- a/algo/x13/skunk-4way.c
+++ b/algo/x13/skunk-4way.c
@@ -114,7 +114,7 @@ int scanhash_skunk_8way( struct work *work, uint32_t max_nonce,
         submit_solution( work, hash+(i<<3), mythr );
      }
      *noncev = _mm512_add_epi32( *noncev,
-                                  m512_const1_64( 0x0000000800000000 ) );
+                                  _mm512_set1_epi64( 0x0000000800000000 ) );
      n +=8;
   } while ( likely( ( n < last_nonce ) && !( *restart ) ) );
   pdata[19] = n;
@@ -218,7 +218,7 @@ int scanhash_skunk_4way( struct work *work, uint32_t max_nonce,
         submit_solution( work, hash+(i<<3), mythr );
      }
      *noncev = _mm256_add_epi32( *noncev,
-                                  m256_const1_64( 0x0000000400000000 ) );
+                                  _mm256_set1_epi64x( 0x0000000400000000 ) );
      n +=4;
   } while ( likely( ( n < last_nonce ) && !( *restart ) ) );
   pdata[19] = n;
--- a/algo/x16/hex.c
+++ b/algo/x16/hex.c
@@ -25,7 +25,7 @@ static void hex_getAlgoString(const uint32_t* prevblock, char *output)

 static __thread x16r_context_overlay hex_ctx;

-int hex_hash( void* output, const void* input, const int thrid )
+int hex_hash( void* output, const void* input, int thrid )
 {
   uint32_t _ALIGN(128) hash[16];
   x16r_context_overlay ctx;
--- a/algo/x16/minotaur.c
+++ b/algo/x16/minotaur.c
@@ -72,7 +72,7 @@ struct TortureGarden

 // Get a 64-byte hash for given 64-byte input, using given TortureGarden contexts and given algo index
 static int get_hash( void *output, const void *input, TortureGarden *garden,
-	                  unsigned int algo, const int thr_id )
+	                  unsigned int algo, int thr_id )
 {    
 	unsigned char hash[64] __attribute__ ((aligned (64)));
   int rc = 1;
@@ -233,7 +233,7 @@ bool initialize_torture_garden()
 }

 // Produce a 32-byte hash from 80-byte input data
-int minotaur_hash( void *output, const void *input, const int thr_id )
+int minotaur_hash( void *output, const void *input, int thr_id )
 {    
    unsigned char hash[64] __attribute__ ((aligned (64)));
    int rc = 1;
--- a/algo/x16/x16r-4way.c
+++ b/algo/x16/x16r-4way.c
@@ -19,7 +19,7 @@
 // Perform midstate prehash of hash functions with block size <= 72 bytes,
 // 76 bytes for hash functions that operate on 32 bit data.

-void x16r_8way_do_prehash( void *vdata, const void *pdata )
+void x16r_8way_prehash( void *vdata, void *pdata )
 {
   uint32_t vdata2[20*8] __attribute__ ((aligned (64)));
   uint32_t edata[20] __attribute__ ((aligned (64)));
@@ -106,18 +106,11 @@ void x16r_8way_do_prehash( void *vdata, const void *pdata )
   }
 }

-int x16r_8way_prehash( struct work *work )
-{
-   x16r_gate_get_hash_order( work, x16r_hash_order );
-   x16r_8way_do_prehash( x16r_8way_vdata, work->data );
-   return 1;
-}
-
 // Perform the full x16r hash and returns 512 bit intermediate hash.
 // Called by wrapper hash function to optionally continue hashing and
 // convert to final hash.

-int x16r_8way_hash_generic( void* output, const void* input, const int thrid )
+int x16r_8way_hash_generic( void* output, const void* input, int thrid )
 {
   uint32_t vhash[20*8] __attribute__ ((aligned (128)));
   uint32_t hash0[20] __attribute__ ((aligned (16)));
@@ -478,7 +471,7 @@ int x16r_8way_hash_generic( void* output, const void* input, const int thrid )

 // x16-r,-s,-rt wrapper called directly by scanhash to repackage 512 bit
 // hash to 256 bit final hash.
-int x16r_8way_hash( void* output, const void* input, const int thrid )
+int x16r_8way_hash( void* output, const void* input, int thrid )
 {
   uint8_t hash[64*8] __attribute__ ((aligned (128)));
   if ( !x16r_8way_hash_generic( hash, input, thrid ) )
@@ -502,6 +495,7 @@ int scanhash_x16r_8way( struct work *work, uint32_t max_nonce,
 {
   uint32_t hash[16*8] __attribute__ ((aligned (128)));
   uint32_t vdata[20*8] __attribute__ ((aligned (64)));
+   uint32_t bedata1[2];
   uint32_t *pdata = work->data;
   uint32_t *ptarget = work->target;
   const uint32_t first_nonce = pdata[19];
@@ -514,16 +508,27 @@ int scanhash_x16r_8way( struct work *work, uint32_t max_nonce,

   if ( bench )   ptarget[7] = 0x0cff;

-   pthread_rwlock_rdlock( &g_work_lock );
-      memcpy( vdata, x16r_8way_vdata, sizeof vdata );
-   pthread_rwlock_unlock( &g_work_lock );
+   bedata1[0] = bswap_32( pdata[1] );
+   bedata1[1] = bswap_32( pdata[2] );

+   static __thread uint32_t s_ntime = UINT32_MAX;
+   const uint32_t ntime = bswap_32( pdata[17] );
+   if ( s_ntime != ntime )
+   {
+      x16_r_s_getAlgoString( (const uint8_t*)bedata1, x16r_hash_order );
+      s_ntime = ntime;
+
+      if ( opt_debug && !thr_id )
+          applog( LOG_INFO, "Hash order %s Ntime %08x", x16r_hash_order, ntime );
+   }
+
+   x16r_8way_prehash( vdata, pdata );
   *noncev = mm512_intrlv_blend_32( _mm512_set_epi32(
                             n+7, 0, n+6, 0, n+5, 0, n+4, 0,
                             n+3, 0, n+2, 0, n+1, 0, n,   0 ), *noncev );
   do
   {
-      if( algo_gate.hash( hash, vdata, thr_id ) );
+      if( x16r_8way_hash( hash, vdata, thr_id ) );
      for ( int i = 0; i < 8; i++ )
      if ( unlikely( valid_hash( hash + (i<<3), ptarget ) && !bench ) )
      {
@@ -531,7 +536,7 @@ int scanhash_x16r_8way( struct work *work, uint32_t max_nonce,
         submit_solution( work, hash+(i<<3), mythr );
      }
      *noncev = _mm512_add_epi32( *noncev,
-                                  m512_const1_64( 0x0000000800000000 ) );
+                                  _mm512_set1_epi64( 0x0000000800000000 ) );
      n += 8;
   } while ( likely( ( n < last_nonce ) && !(*restart) ) );
   pdata[19] = n;
@@ -541,7 +546,7 @@ int scanhash_x16r_8way( struct work *work, uint32_t max_nonce,

 #elif defined (X16R_4WAY)

-void x16r_4way_do_prehash( void *vdata, const void *pdata )
+void x16r_4way_prehash( void *vdata, void *pdata )
 {
   uint32_t vdata2[20*4] __attribute__ ((aligned (64)));
   uint32_t edata[20] __attribute__ ((aligned (64)));
@@ -622,14 +627,7 @@ void x16r_4way_do_prehash( void *vdata, const void *pdata )
   }
 }

-int x16r_4way_prehash( struct work *work )
-{
-   x16r_gate_get_hash_order( work, x16r_hash_order );
-   x16r_4way_do_prehash( x16r_4way_vdata, work->data );
-   return 1;
-}
-
-int x16r_4way_hash_generic( void* output, const void* input, const int thrid )
+int x16r_4way_hash_generic( void* output, const void* input, int thrid )
 {
   uint32_t vhash[20*4] __attribute__ ((aligned (128)));
   uint32_t hash0[20] __attribute__ ((aligned (32)));
@@ -637,14 +635,13 @@ int x16r_4way_hash_generic( void* output, const void* input, const int thrid )
   uint32_t hash2[20] __attribute__ ((aligned (32)));
   uint32_t hash3[20] __attribute__ ((aligned (32)));
   x16r_4way_context_overlay ctx;
+   memcpy( &ctx, &x16r_ctx, sizeof(ctx) );
   void *in0 = (void*) hash0;
   void *in1 = (void*) hash1;
   void *in2 = (void*) hash2;
   void *in3 = (void*) hash3;
   int size = 80;

-   memcpy( &ctx, &x16r_ctx, sizeof(ctx) );
-
   dintrlv_4x64( hash0, hash1, hash2, hash3, input, 640 );

   for ( int i = 0; i < 16; i++ )
@@ -908,7 +905,7 @@ int x16r_4way_hash_generic( void* output, const void* input, const int thrid )
   return 1;
 }

-int x16r_4way_hash( void* output, const void* input, const int thrid )
+int x16r_4way_hash( void* output, const void* input, int thrid )
 {
   uint8_t hash[64*4] __attribute__ ((aligned (64)));
   if ( !x16r_4way_hash_generic( hash, input, thrid ) )
@@ -927,6 +924,7 @@ int scanhash_x16r_4way( struct work *work, uint32_t max_nonce,
 {
   uint32_t hash[16*4] __attribute__ ((aligned (64)));
   uint32_t vdata[20*4] __attribute__ ((aligned (64)));
+   uint32_t bedata1[2];
   uint32_t *pdata = work->data;
   uint32_t *ptarget = work->target;
   const uint32_t first_nonce = pdata[19];
@@ -939,15 +937,25 @@ int scanhash_x16r_4way( struct work *work, uint32_t max_nonce,

   if ( bench )  ptarget[7] = 0x0cff;

-   pthread_rwlock_rdlock( &g_work_lock );
-      memcpy( vdata, x16r_4way_vdata, sizeof vdata );
-   pthread_rwlock_unlock( &g_work_lock );
+   bedata1[0] = bswap_32( pdata[1] );
+   bedata1[1] = bswap_32( pdata[2] );

+   static __thread uint32_t s_ntime = UINT32_MAX;
+   const uint32_t ntime = bswap_32( pdata[17] );
+   if ( s_ntime != ntime )
+   {
+      x16_r_s_getAlgoString( (const uint8_t*)bedata1, x16r_hash_order );
+      s_ntime = ntime;
+      if ( opt_debug && !thr_id )
+         applog( LOG_INFO, "Hash order %s Ntime %08x", x16r_hash_order, ntime );
+   }
+
+   x16r_4way_prehash( vdata, pdata );
   *noncev = mm256_intrlv_blend_32(
                   _mm256_set_epi32( n+3, 0, n+2, 0, n+1, 0, n, 0 ), *noncev );
   do
   {
-      if ( algo_gate.hash( hash, vdata, thr_id ) );
+      if ( x16r_4way_hash( hash, vdata, thr_id ) );
      for ( int i = 0; i < 4; i++ )
      if ( unlikely( valid_hash( hash + (i<<3), ptarget ) && !bench ) )
      {
@@ -955,7 +963,7 @@ int scanhash_x16r_4way( struct work *work, uint32_t max_nonce,
         submit_solution( work, hash+(i<<3), mythr );
      }
      *noncev = _mm256_add_epi32( *noncev,
-                                  m256_const1_64( 0x0000000400000000 ) );
+                                  _mm256_set1_epi64x( 0x0000000400000000 ) );
      n += 4;
   } while ( likely( ( n < last_nonce ) && !(*restart) ) );
   pdata[19] = n;
--- a/algo/x16/x16r-gate.c
+++ b/algo/x16/x16r-gate.c
@@ -1,44 +1,26 @@
 #include "x16r-gate.h"
 #include "algo/sha/sha256d.h"

-char x16r_hash_order[ X16R_HASH_FUNC_COUNT + 1 ] = {0};
+__thread char x16r_hash_order[ X16R_HASH_FUNC_COUNT + 1 ] = { 0 };

-void (*x16r_gate_get_hash_order) ( const struct work *, char * ) = NULL;
+void (*x16_r_s_getAlgoString) ( const uint8_t*, char* ) = NULL;

 #if defined (X16R_8WAY)

-x16r_8way_context_overlay x16r_ctx;
-uint32_t x16r_8way_vdata[24*8] __attribute__ ((aligned (64)));
+__thread x16r_8way_context_overlay x16r_ctx;

 #elif defined (X16R_4WAY)

-x16r_4way_context_overlay x16r_ctx;
-uint32_t x16r_4way_vdata[24*4] __attribute__ ((aligned (64)));
-
+__thread x16r_4way_context_overlay x16r_ctx;

 #endif

-#if defined (X16RV2_8WAY)
+__thread x16r_context_overlay x16_ctx;

-x16rv2_8way_context_overlay x16rv2_ctx;

-#elif defined (X16RV2_4WAY)
-
-x16rv2_4way_context_overlay x16rv2_ctx;
-
-#endif
-
-x16r_context_overlay x16_ctx;
-uint32_t x16r_edata[24] __attribute__ ((aligned (32)));
-
-void x16r_get_hash_order( const struct work *work, char *hash_order )
+void x16r_getAlgoString( const uint8_t* prevblock, char *output )
 {
-   char *sptr = hash_order;
-   const uint32_t *pdata = work->data;
-   uint8_t prevblock[16];
-   ((uint32_t*)prevblock)[0] = bswap_32( pdata[1] );
-   ((uint32_t*)prevblock)[1] = bswap_32( pdata[2] );
-
+   char *sptr = output;
   for ( int j = 0; j < X16R_HASH_FUNC_COUNT; j++ )
   {
      uint8_t b = (15 - j) >> 1; // 16 first ascii hex chars (lsb in uint256)
@@ -50,51 +32,38 @@ void x16r_get_hash_order( const struct work *work, char *hash_order )
      sptr++;
   }
   *sptr = '\0';
-
-   if ( !opt_quiet )
-      applog( LOG_INFO, "Hash order %s", x16r_hash_order );
 }
-   
-void x16s_get_hash_order( const struct work *work, char *hash_order )
+
+void x16s_getAlgoString( const uint8_t* prevblock, char *output )
 {
-   const uint32_t *pdata = work->data;
-   uint8_t prevblock[16];
-   ((uint32_t*)prevblock)[0] = bswap_32( pdata[1] );
-   ((uint32_t*)prevblock)[1] = bswap_32( pdata[2] );
-   strcpy( hash_order, "0123456789ABCDEF" );
+   strcpy( output, "0123456789ABCDEF" );
   for ( int i = 0; i < 16; i++ )
   {
      uint8_t b = (15 - i) >> 1; // 16 ascii hex chars, reversed
      uint8_t algoDigit = (i & 1) ? prevblock[b] & 0xF : prevblock[b] >> 4;
      int offset = algoDigit;
      // insert the nth character at the front
-      char oldVal = hash_order[ offset ];
+      char oldVal = output[offset];
      for( int j = offset; j-- > 0; )
-         hash_order[ j+1 ] = hash_order[ j ];
-      hash_order[ 0 ] = oldVal;
+         output[j+1] = output[j];
+      output[0] = oldVal;
   }
-
-   if ( !opt_quiet )
-      applog( LOG_INFO, "Hash order %s", x16r_hash_order );
 }

 bool register_x16r_algo( algo_gate_t* gate )
 {
 #if defined (X16R_8WAY)
  gate->scanhash  = (void*)&scanhash_x16r_8way;
-  gate->prehash   = (void*)&x16r_8way_prehash;
  gate->hash      = (void*)&x16r_8way_hash;
 #elif defined (X16R_4WAY)
  gate->scanhash  = (void*)&scanhash_x16r_4way;
-  gate->prehash   = (void*)&x16r_4way_prehash;
  gate->hash      = (void*)&x16r_4way_hash;
 #else
  gate->scanhash  = (void*)&scanhash_x16r;
-  gate->prehash   = (void*)&x16r_prehash;
  gate->hash      = (void*)&x16r_hash;
 #endif
  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT;
-  x16r_gate_get_hash_order = (void*)&x16r_get_hash_order;
+  x16_r_s_getAlgoString = (void*)&x16r_getAlgoString;
  opt_target_factor = 256.0;
  return true;
 };
@@ -102,20 +71,17 @@ bool register_x16r_algo( algo_gate_t* gate )
 bool register_x16rv2_algo( algo_gate_t* gate )
 {
 #if defined (X16RV2_8WAY)
-  gate->scanhash  = (void*)&scanhash_x16r_8way;
-  gate->prehash   = (void*)&x16rv2_8way_prehash;
+  gate->scanhash  = (void*)&scanhash_x16rv2_8way;
  gate->hash      = (void*)&x16rv2_8way_hash;
 #elif defined (X16RV2_4WAY)
-  gate->scanhash  = (void*)&scanhash_x16r_4way;
-  gate->prehash   = (void*)&x16rv2_4way_prehash;
+  gate->scanhash  = (void*)&scanhash_x16rv2_4way;
  gate->hash      = (void*)&x16rv2_4way_hash;
 #else
-  gate->scanhash  = (void*)&scanhash_x16r;
-  gate->prehash   = (void*)&x16rv2_prehash;
+  gate->scanhash  = (void*)&scanhash_x16rv2;
  gate->hash      = (void*)&x16rv2_hash;
 #endif
  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT;
-  x16r_gate_get_hash_order = (void*)&x16r_get_hash_order;
+  x16_r_s_getAlgoString = (void*)&x16r_getAlgoString;
  opt_target_factor = 256.0;
  return true;
 };
@@ -124,19 +90,16 @@ bool register_x16s_algo( algo_gate_t* gate )
 {
 #if defined (X16R_8WAY)
  gate->scanhash  = (void*)&scanhash_x16r_8way;
-  gate->prehash   = (void*)&x16r_8way_prehash;
  gate->hash      = (void*)&x16r_8way_hash;
 #elif defined (X16R_4WAY)
  gate->scanhash  = (void*)&scanhash_x16r_4way;
-  gate->prehash   = (void*)&x16r_4way_prehash;
  gate->hash      = (void*)&x16r_4way_hash;
 #else
  gate->scanhash  = (void*)&scanhash_x16r;
-  gate->prehash   = (void*)&x16r_prehash;
  gate->hash      = (void*)&x16r_hash;
 #endif
  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT;
-  x16r_gate_get_hash_order = (void*)&x16s_get_hash_order;
+  x16_r_s_getAlgoString = (void*)&x16s_getAlgoString;
  opt_target_factor = 256.0;
  return true;
 };
@@ -145,33 +108,30 @@ bool register_x16s_algo( algo_gate_t* gate )
 //
 //   X16RT

-void x16rt_get_hash_order( const struct work * work, char * hash_order )
-{   
-   uint32_t _ALIGN(64) timehash[8*8];
-   const uint32_t ntime = bswap_32( work->data[17] );
-   const int32_t masked_ntime = ntime & 0xffffff80;
-   uint8_t* data = (uint8_t*)timehash;
-   char *sptr = hash_order;

-   sha256d( (unsigned char*)timehash, (const unsigned char*)( &masked_ntime ),
-             sizeof( masked_ntime ) );
+void x16rt_getTimeHash( const uint32_t timeStamp, void* timeHash )
+{
+    int32_t maskedTime = timeStamp & 0xffffff80;
+    sha256d( (unsigned char*)timeHash, (const unsigned char*)( &maskedTime ),
+             sizeof( maskedTime ) );
+}

-   for ( uint8_t j = 0; j < X16R_HASH_FUNC_COUNT; j++ )
-   {
+void x16rt_getAlgoString( const uint32_t *timeHash, char *output)
+{
+   char *sptr = output;
+   uint8_t* data = (uint8_t*)timeHash;
+
+   for (uint8_t j = 0; j < X16R_HASH_FUNC_COUNT; j++) {
      uint8_t b = (15 - j) >> 1; // 16 ascii hex chars, reversed
      uint8_t algoDigit = (j & 1) ? data[b] & 0xF : data[b] >> 4;

-      if ( algoDigit >= 10 )
-         sprintf( sptr, "%c", 'A' + (algoDigit - 10) );
+      if (algoDigit >= 10)
+         sprintf(sptr, "%c", 'A' + (algoDigit - 10));
      else
-         sprintf( sptr, "%u", (uint32_t) algoDigit );
+         sprintf(sptr, "%u", (uint32_t) algoDigit);
      sptr++;
   }
   *sptr = '\0';
-
-   if ( !opt_quiet )
-      applog( LOG_INFO, "Hash order %s, ntime %08x, time hash %08x",
-                         hash_order, ntime, timehash );
 }

 void veil_build_extraheader( struct work* g_work, struct stratum_ctx* sctx )
@@ -262,19 +222,15 @@ void veil_build_extraheader( struct work* g_work, struct stratum_ctx* sctx )
 bool register_x16rt_algo( algo_gate_t* gate )
 {
 #if defined (X16R_8WAY)
-  gate->scanhash  = (void*)&scanhash_x16r_8way;
-  gate->prehash   = (void*)&x16r_8way_prehash;
+  gate->scanhash  = (void*)&scanhash_x16rt_8way;
  gate->hash      = (void*)&x16r_8way_hash;
 #elif defined (X16R_4WAY)
-  gate->scanhash  = (void*)&scanhash_x16r_4way;
-  gate->prehash   = (void*)&x16r_4way_prehash;
+  gate->scanhash  = (void*)&scanhash_x16rt_4way;
  gate->hash      = (void*)&x16r_4way_hash;
 #else
-  gate->scanhash  = (void*)&scanhash_x16r;
-  gate->prehash   = (void*)&x16r_prehash;
+  gate->scanhash  = (void*)&scanhash_x16rt;
  gate->hash      = (void*)&x16r_hash;
 #endif
-  x16r_gate_get_hash_order = (void*)&x16rt_get_hash_order;
  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT;
  opt_target_factor = 256.0;
  return true;
@@ -283,20 +239,16 @@ bool register_x16rt_algo( algo_gate_t* gate )
 bool register_x16rt_veil_algo( algo_gate_t* gate )
 {
 #if defined (X16R_8WAY)
-  gate->scanhash  = (void*)&scanhash_x16r_8way;
-  gate->prehash   = (void*)&x16r_8way_prehash;
+  gate->scanhash  = (void*)&scanhash_x16rt_8way;
  gate->hash      = (void*)&x16r_8way_hash;
 #elif defined (X16R_4WAY)
-  gate->scanhash  = (void*)&scanhash_x16r_4way;
-  gate->prehash   = (void*)&x16r_4way_prehash;
+  gate->scanhash  = (void*)&scanhash_x16rt_4way;
  gate->hash      = (void*)&x16r_4way_hash;
 #else
-  gate->scanhash  = (void*)&scanhash_x16r;
-  gate->prehash   = (void*)&x16r_prehash;
+  gate->scanhash  = (void*)&scanhash_x16rt;
  gate->hash      = (void*)&x16r_hash;
 #endif
  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT;
-  x16r_gate_get_hash_order = (void*)&x16rt_get_hash_order;
  gate->build_extraheader = (void*)&veil_build_extraheader;
  opt_target_factor = 256.0;
  return true;
@@ -323,23 +275,20 @@ bool register_hex_algo( algo_gate_t* gate )
 bool register_x21s_algo( algo_gate_t* gate )
 {
 #if defined (X16R_8WAY)
-  gate->scanhash          = (void*)&scanhash_x16r_8way;
-  gate->prehash           = (void*)&x16r_8way_prehash;
+  gate->scanhash          = (void*)&scanhash_x21s_8way;
  gate->hash              = (void*)&x21s_8way_hash;
  gate->miner_thread_init = (void*)&x21s_8way_thread_init;
 #elif defined (X16R_4WAY)
-  gate->scanhash          = (void*)&scanhash_x16r_4way;
-  gate->prehash           = (void*)&x16r_4way_prehash;
+  gate->scanhash          = (void*)&scanhash_x21s_4way;
  gate->hash              = (void*)&x21s_4way_hash;
  gate->miner_thread_init = (void*)&x21s_4way_thread_init;
 #else
-  gate->scanhash          = (void*)&scanhash_x16r;
-  gate->prehash           = (void*)&x16r_prehash;
+  gate->scanhash          = (void*)&scanhash_x21s;
  gate->hash              = (void*)&x21s_hash;
  gate->miner_thread_init = (void*)&x21s_thread_init;
 #endif
  gate->optimizations  = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT;
-  x16r_gate_get_hash_order = (void*)&x16s_get_hash_order;
+  x16_r_s_getAlgoString   = (void*)&x16s_getAlgoString;
  opt_target_factor = 256.0;
  return true;
 };
--- a/algo/x16/x16r-gate.h
+++ b/algo/x16/x16r-gate.h
@@ -21,7 +21,6 @@
 #include "algo/shabal/sph_shabal.h"
 #include "algo/whirlpool/sph_whirlpool.h"
 #include "algo/sha/sph_sha2.h"
-#include "algo/tiger/sph_tiger.h"

 #if defined(__AES__)
 #include "algo/echo/aes_ni/hash_api.h"
@@ -58,11 +57,13 @@

  #define X16R_8WAY   1
  #define X16RV2_8WAY 1
+  #define X16RT_8WAY  1
  #define X21S_8WAY   1

 #elif defined(__AVX2__) && defined(__AES__)

  #define X16RV2_4WAY 1
+  #define X16RT_4WAY  1
  #define X21S_4WAY   1
  #define X16R_4WAY   1

@@ -88,29 +89,23 @@ enum x16r_Algo {
        X16R_HASH_FUNC_COUNT
 };

+extern __thread char x16r_hash_order[ X16R_HASH_FUNC_COUNT + 1 ];

-//extern __thread char x16r_hash_order[ X16R_HASH_FUNC_COUNT + 1 ];
-extern char x16r_hash_order[ X16R_HASH_FUNC_COUNT + 1 ];
-
-
-extern void (*x16r_gate_get_hash_order) ( const struct work *, char * );
-
-// x16r, x16rv2
-void x16r_get_hash_order( const struct work *, char * );
-// x16s, x21s
-void x16s_get_hash_order( const struct work *, char * );
-// x16rt
-void x16rt_get_hash_order( const struct work *, char * );
+extern void (*x16_r_s_getAlgoString) ( const uint8_t*, char* );
+void x16r_getAlgoString( const uint8_t *prevblock, char *output );
+void x16s_getAlgoString( const uint8_t *prevblock, char *output );
+void x16rt_getAlgoString( const uint32_t *timeHash, char *output );

+void x16rt_getTimeHash( const uint32_t timeStamp, void* timeHash );

 bool register_x16r_algo( algo_gate_t* gate );
 bool register_x16rv2_algo( algo_gate_t* gate );
 bool register_x16s_algo( algo_gate_t* gate );
 bool register_x16rt_algo( algo_gate_t* gate );
-bool register_hex_algo( algo_gate_t* gate );
-bool register_x21s_algo( algo_gate_t* gate );
+bool register_hex__algo( algo_gate_t* gate );
+bool register_x21s__algo( algo_gate_t* gate );

-// x16r, x16s, x16rt
+// x16r, x16s
 #if defined(X16R_8WAY)

 union _x16r_8way_context_overlay
@@ -141,15 +136,15 @@ union _x16r_8way_context_overlay

 typedef union _x16r_8way_context_overlay x16r_8way_context_overlay;

-extern x16r_8way_context_overlay x16r_ctx;
-extern uint32_t x16r_8way_vdata[24*8] __attribute__ ((aligned (64)));
+extern __thread x16r_8way_context_overlay x16r_ctx;

-void x16r_8way_do_prehash( void *, const void * );
-int x16r_8way_prehash( struct work * );
-int x16r_8way_hash_generic( void *, const void *, const int );
-int x16r_8way_hash( void *, const void *, const int );
+void x16r_8way_prehash( void *, void * );
+int x16r_8way_hash_generic( void *, const void *, int );
+int x16r_8way_hash( void *, const void *, int );
 int scanhash_x16r_8way( struct work *, uint32_t ,
                        uint64_t *, struct thr_info * );
+extern __thread x16r_8way_context_overlay x16r_ctx;
+

 #elif defined(X16R_4WAY)

@@ -182,15 +177,14 @@ union _x16r_4way_context_overlay

 typedef union _x16r_4way_context_overlay x16r_4way_context_overlay;

-extern x16r_4way_context_overlay x16r_ctx;
-extern uint32_t x16r_4way_vdata[24*4] __attribute__ ((aligned (64)));
+extern __thread x16r_4way_context_overlay x16r_ctx;

-void x16r_4way_do_prehash( void *, const void * );
-int x16r_4way_prehash( struct work * );
-int x16r_4way_hash_generic( void *, const void *, const int );
-int x16r_4way_hash( void *, const void *, const int );
+void x16r_4way_prehash( void *, void * );
+int x16r_4way_hash_generic( void *, const void *, int );
+int x16r_4way_hash( void *, const void *, int );
 int scanhash_x16r_4way( struct work *, uint32_t,
                        uint64_t *, struct thr_info * );
+extern __thread x16r_4way_context_overlay x16r_ctx;

 #endif

@@ -223,113 +217,80 @@ union _x16r_context_overlay

 typedef union _x16r_context_overlay x16r_context_overlay;

-extern x16r_context_overlay x16_ctx;
-extern uint32_t x16r_edata[24] __attribute__ ((aligned (32)));
+extern __thread x16r_context_overlay x16_ctx;

-void x16r_do_prehash( const void * );
-int x16r_prehash( const struct work * );
-int x16r_hash_generic( void *, const void *, const int );
-int x16r_hash( void *, const void *, const int );
+void x16r_prehash( void *, void * );
+int x16r_hash_generic( void *, const void *, int );
+int x16r_hash( void *, const void *, int );
 int scanhash_x16r( struct work *, uint32_t, uint64_t *, struct thr_info * );

 // x16Rv2
 #if defined(X16RV2_8WAY)

-union _x16rv2_8way_context_overlay
-{
-    blake512_8way_context   blake;
-    bmw512_8way_context     bmw;
-    skein512_8way_context   skein;
-    jh512_8way_context      jh;
-    keccak512_8way_context  keccak;
-    luffa_4way_context      luffa;
-    cubehashParam           cube;
-    simd_4way_context       simd;
-    hamsi512_8way_context   hamsi;
-    hashState_fugue         fugue;
-    shabal512_8way_context  shabal;
-    sph_whirlpool_context   whirlpool;
-    sha512_8way_context     sha512;
-    sph_tiger_context       tiger;
-#if defined(__VAES__)
-    groestl512_4way_context groestl;
-    shavite512_4way_context shavite;
-    echo_4way_context       echo;
-#else
-    hashState_groestl       groestl;
-    shavite512_context      shavite;
-    hashState_echo          echo;
-#endif
-} __attribute__ ((aligned (64)));
-
-typedef union _x16rv2_8way_context_overlay x16rv2_8way_context_overlay;
-extern x16rv2_8way_context_overlay x16rv2_ctx;
-
-int x16rv2_8way_prehash( struct work * );
-int x16rv2_8way_hash( void *state, const void *input, const int thrid );
-//int scanhash_x16rv2_8way( struct work *work, uint32_t max_nonce,
-//                          uint64_t *hashes_done, struct thr_info *mythr );
+int x16rv2_8way_hash( void *state, const void *input, int thrid );
+int scanhash_x16rv2_8way( struct work *work, uint32_t max_nonce,
+                          uint64_t *hashes_done, struct thr_info *mythr );

 #elif defined(X16RV2_4WAY)

-union _x16rv2_4way_context_overlay
-{
-    blake512_4way_context   blake;
-    bmw512_4way_context     bmw;
-#if defined(__VAES__)
-    groestl512_2way_context groestl;
-    shavite512_2way_context shavite;
-    echo_2way_context       echo;
+int x16rv2_4way_hash( void *state, const void *input, int thrid );
+int scanhash_x16rv2_4way( struct work *work, uint32_t max_nonce,
+                        uint64_t *hashes_done, struct thr_info *mythr );
+
 #else
-    hashState_groestl       groestl;
-    shavite512_context      shavite;
-    hashState_echo          echo;
+
+int x16rv2_hash( void *state, const void *input, int thr_id );
+int scanhash_x16rv2( struct work *work, uint32_t max_nonce,
+                   uint64_t *hashes_done, struct thr_info *mythr );
+
 #endif
-    skein512_4way_context   skein;
-    jh512_4way_context      jh;
-    keccak512_4way_context  keccak;
-    luffa_2way_context      luffa;
-    cubehashParam           cube;
-    simd_2way_context       simd;
-    hamsi512_4way_context   hamsi;
-    hashState_fugue         fugue;
-    shabal512_4way_context  shabal;
-    sph_whirlpool_context   whirlpool;
-    sha512_4way_context     sha512;
-    sph_tiger_context       tiger;
-};

-typedef union _x16rv2_4way_context_overlay x16rv2_4way_context_overlay;
-extern x16rv2_4way_context_overlay x16rv2_ctx;
+// x16rt, veil
+#if defined(X16R_8WAY)

-int x16rv2_4way_hash( void *state, const void *input, const int thrid );
-int x16rv2_4way_prehash( struct work * );
+//void x16rt_8way_hash( void *state, const void *input );
+int scanhash_x16rt_8way( struct work *work, uint32_t max_nonce,
+                        uint64_t *hashes_done, struct thr_info *mythr );
+
+#elif defined(X16R_4WAY)
+
+//void x16rt_4way_hash( void *state, const void *input );
+int scanhash_x16rt_4way( struct work *work, uint32_t max_nonce,
+                        uint64_t *hashes_done, struct thr_info *mythr );

 #else

-int x16rv2_hash( void *state, const void *input, const int thr_id );
-int x16rv2_prehash( const struct work * );
+//void x16rt_hash( void *state, const void *input );
+int scanhash_x16rt( struct work *work, uint32_t max_nonce,
+                   uint64_t *hashes_done, struct thr_info *mythr );

 #endif

 // x21s
 #if defined(X16R_8WAY)

-int x21s_8way_hash( void *state, const void *input, const int thrid );
+int x21s_8way_hash( void *state, const void *input, int thrid );
+int scanhash_x21s_8way( struct work *work, uint32_t max_nonce,
+                        uint64_t *hashes_done, struct thr_info *mythr );
 bool x21s_8way_thread_init();

 #elif defined(X16R_4WAY)

-int x21s_4way_hash( void *state, const void *input, const int thrid );
+int x21s_4way_hash( void *state, const void *input, int thrid );
+int scanhash_x21s_4way( struct work *work, uint32_t max_nonce,
+                        uint64_t *hashes_done, struct thr_info *mythr );
 bool x21s_4way_thread_init();

 #else

-int x21s_hash( void *state, const void *input, const int thr_id );
+int x21s_hash( void *state, const void *input, int thr_id );
+int scanhash_x21s( struct work *work, uint32_t max_nonce,
+                  uint64_t *hashes_done, struct thr_info *mythr );
 bool x21s_thread_init();

 #endif

+//void hex_hash( void *state, const void *input );
 int scanhash_hex( struct work *work, uint32_t max_nonce,
                  uint64_t *hashes_done, struct thr_info *mythr );

--- a/algo/x16/x16r.c
+++ b/algo/x16/x16r.c
@@ -10,7 +10,7 @@
 #include <stdlib.h>
 #include <string.h>

-void x16r_do_prehash( const void *edata )
+void x16r_prehash( void *edata, void *pdata )
 {
   const char elem = x16r_hash_order[0];
   const uint8_t algo = elem >= 'A' ? elem - 'A' + 10 : elem - '0';
@@ -48,7 +48,7 @@ void x16r_do_prehash( const void *edata )
   }
 }

-int x16r_hash_generic( void* output, const void* input, const int thrid )
+int x16r_hash_generic( void* output, const void* input, int thrid )
 {
   uint32_t _ALIGN(128) hash[16];
   x16r_context_overlay ctx;
@@ -192,15 +192,7 @@ int x16r_hash_generic( void* output, const void* input, const int thrid )
   return true;
 }

-int x16r_prehash( const struct work *work )
-{
-   mm128_bswap32_80( x16r_edata, work->data );
-   x16r_gate_get_hash_order( work, x16r_hash_order );
-   x16r_do_prehash( x16r_edata );  
-   return 1;
-}
-
-int x16r_hash( void* output, const void* input, const int thrid )
+int x16r_hash( void* output, const void* input, int thrid )
 {  
   uint8_t hash[64] __attribute__ ((aligned (64)));
   if ( !x16r_hash_generic( hash, input, thrid ) )
@@ -213,8 +205,8 @@ int x16r_hash( void* output, const void* input, const int thrid )
 int scanhash_x16r( struct work *work, uint32_t max_nonce,
                   uint64_t *hashes_done, struct thr_info *mythr )
 {
-   uint32_t _ALIGN(32) hash32[8];
-   uint32_t _ALIGN(32) edata[20];
+   uint32_t _ALIGN(128) hash32[8];
+   uint32_t _ALIGN(128) edata[20];
   uint32_t *pdata = work->data;
   uint32_t *ptarget = work->target;
   const uint32_t first_nonce = pdata[19];
@@ -224,14 +216,24 @@ int scanhash_x16r( struct work *work, uint32_t max_nonce,
   const bool bench = opt_benchmark;
   if ( bench )  ptarget[7] = 0x0cff;

-   pthread_rwlock_rdlock( &g_work_lock );
-      memcpy( edata, x16r_edata, sizeof edata );
-   pthread_rwlock_unlock( &g_work_lock );
+   mm128_bswap32_80( edata, pdata );
+
+   static __thread uint32_t s_ntime = UINT32_MAX;
+   if ( s_ntime != pdata[17] )
+   {
+      uint32_t ntime = swab32(pdata[17]);
+      x16_r_s_getAlgoString( (const uint8_t*)(&edata[1]), x16r_hash_order );
+      s_ntime = ntime;
+      if ( opt_debug && !thr_id )
+           applog( LOG_DEBUG, "hash order %s (%08x)", x16r_hash_order, ntime );
+   }
+
+   x16r_prehash( edata, pdata );

   do
   {
      edata[19] = nonce;
-      if ( algo_gate.hash( hash32, edata, thr_id ) )
+      if ( x16r_hash( hash32, edata, thr_id ) )
      if ( unlikely( valid_hash( hash32, ptarget ) && !bench ) )
      {
         pdata[19] = bswap_32( nonce );
--- a/algo/x16/x16rt-4way.c
+++ b/algo/x16/x16rt-4way.c
@@ -0,0 +1,113 @@
+#include "x16r-gate.h"
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#if defined (X16R_8WAY)
+
+int scanhash_x16rt_8way( struct work *work, uint32_t max_nonce,
+                        uint64_t *hashes_done, struct thr_info *mythr)
+{
+   uint32_t hash[16*8] __attribute__ ((aligned (128)));
+   uint32_t vdata[20*8] __attribute__ ((aligned (64)));
+   uint32_t _ALIGN(64) timeHash[8*8];
+   uint32_t *pdata = work->data;
+   uint32_t *ptarget = work->target;
+   const uint32_t first_nonce = pdata[19];
+   const uint32_t last_nonce = max_nonce - 8;
+   uint32_t n = first_nonce;
+    __m512i  *noncev = (__m512i*)vdata + 9;   // aligned
+   const int thr_id = mythr->id;
+   volatile uint8_t *restart = &(work_restart[thr_id].restart);
+   const bool bench = opt_benchmark;
+
+   if ( bench )   ptarget[7] = 0x0cff;
+
+   static __thread uint32_t s_ntime = UINT32_MAX;
+   uint32_t masked_ntime = bswap_32( pdata[17] ) & 0xffffff80;
+   if ( s_ntime != masked_ntime )
+   {
+      x16rt_getTimeHash( masked_ntime, &timeHash );
+      x16rt_getAlgoString( &timeHash[0], x16r_hash_order );
+      s_ntime = masked_ntime;
+      if ( !thr_id )
+          applog( LOG_INFO, "Hash order %s, Ntime %08x, time hash %08x",
+                            x16r_hash_order, bswap_32( pdata[17] ), timeHash );
+   }
+
+   x16r_8way_prehash( vdata, pdata );
+   *noncev = mm512_intrlv_blend_32( _mm512_set_epi32(
+                             n+7, 0, n+6, 0, n+5, 0, n+4, 0,
+                             n+3, 0, n+2, 0, n+1, 0, n,   0 ), *noncev );
+   do
+   {
+      if ( x16r_8way_hash( hash, vdata, thr_id ) )
+      for ( int i = 0; i < 8; i++ )
+      if ( unlikely( valid_hash( hash + (i<<3), ptarget ) && !bench ) )
+      {
+         pdata[19] = bswap_32( n+i );
+         submit_solution( work, hash+(i<<3), mythr );
+      }
+      *noncev = _mm512_add_epi32( *noncev,
+                                  _mm512_set1_epi64( 0x0000000800000000 ) );
+      n += 8;
+   } while ( likely( ( n < last_nonce ) && !(*restart) ) );
+   pdata[19] = n;
+   *hashes_done = n - first_nonce;
+   return 0;
+}
+
+#elif defined (X16R_4WAY)
+
+int scanhash_x16rt_4way( struct work *work, uint32_t max_nonce,
+                        uint64_t *hashes_done, struct thr_info *mythr)
+{
+   uint32_t hash[4*16] __attribute__ ((aligned (64)));
+   uint32_t vdata[24*4] __attribute__ ((aligned (64)));
+   uint32_t _ALIGN(64) timeHash[4*8];
+   uint32_t *pdata = work->data;
+   uint32_t *ptarget = work->target;
+   const uint32_t first_nonce = pdata[19];
+   const uint32_t last_nonce = max_nonce - 4;
+   uint32_t n = first_nonce;
+   const int thr_id = mythr->id;  
+    __m256i  *noncev = (__m256i*)vdata + 9;   // aligned
+   volatile uint8_t *restart = &(work_restart[thr_id].restart);
+   const bool bench = opt_benchmark;
+
+   if ( bench )  ptarget[7] = 0x0cff;
+
+   static __thread uint32_t s_ntime = UINT32_MAX;
+   uint32_t masked_ntime = bswap_32( pdata[17] ) & 0xffffff80;
+   if ( s_ntime != masked_ntime )
+   {
+      x16rt_getTimeHash( masked_ntime, &timeHash );
+      x16rt_getAlgoString( &timeHash[0], x16r_hash_order );
+      s_ntime = masked_ntime;
+      if ( !thr_id )
+          applog( LOG_INFO, "Hash order %s, Ntime %08x, time hash %08x",
+                            x16r_hash_order, bswap_32( pdata[17] ), timeHash );
+   }
+
+   x16r_4way_prehash( vdata, pdata );
+   *noncev = mm256_intrlv_blend_32(
+                   _mm256_set_epi32( n+3, 0, n+2, 0, n+1, 0, n, 0 ), *noncev );
+   do
+   {
+      if ( x16r_4way_hash( hash, vdata, thr_id ) )
+      for ( int i = 0; i < 4; i++ )
+      if ( unlikely( valid_hash( hash + (i<<3), ptarget ) && !bench ) )
+      {
+         pdata[19] = bswap_32( n+i );
+         submit_solution( work, hash+(i<<3), mythr );
+      }
+      *noncev = _mm256_add_epi32( *noncev,
+                                  _mm256_set1_epi64x( 0x0000000400000000 ) );
+      n += 4;
+   } while ( (  n < last_nonce ) && !(*restart) );
+   pdata[19] = n;
+   *hashes_done = n - first_nonce;
+   return 0;
+}
+
+#endif
--- a/algo/x16/x16rt.c
+++ b/algo/x16/x16rt.c
@@ -0,0 +1,53 @@
+#include "x16r-gate.h"
+
+#if !defined(X16R_8WAY) && !defined(X16R_4WAY)
+
+int scanhash_x16rt( struct work *work, uint32_t max_nonce,
+                    uint64_t *hashes_done, struct thr_info *mythr )
+{
+   uint32_t _ALIGN(128) hash32[8];
+   uint32_t _ALIGN(128) edata[20];
+   uint32_t _ALIGN(64) timeHash[8];
+   uint32_t *pdata = work->data;
+   uint32_t *ptarget = work->target;
+   const uint32_t first_nonce = pdata[19];
+   const int thr_id = mythr->id; 
+   uint32_t nonce = first_nonce;
+   volatile uint8_t *restart = &(work_restart[thr_id].restart);
+   const bool bench = opt_benchmark;
+   if ( bench )  ptarget[7] = 0x0cff;
+
+   mm128_bswap32_80( edata, pdata );
+
+   static __thread uint32_t s_ntime = UINT32_MAX;
+   uint32_t masked_ntime = swab32( pdata[17] ) & 0xffffff80;
+   if ( s_ntime != masked_ntime )
+   {
+      x16rt_getTimeHash( masked_ntime, &timeHash );
+      x16rt_getAlgoString( &timeHash[0], x16r_hash_order );
+      s_ntime = masked_ntime;
+      if ( !thr_id )
+          applog( LOG_INFO, "hash order: %s time: (%08x) time hash: (%08x)",
+                        x16r_hash_order, swab32( pdata[17] ), timeHash );
+   }
+   
+   x16r_prehash( edata, pdata );
+   
+   do
+   {
+      edata[19] = nonce;
+      if ( x16r_hash( hash32, edata, thr_id ) )
+      if ( valid_hash( hash32, ptarget ) && !bench )
+      {
+         pdata[19] = bswap_32( nonce );
+         submit_solution( work, hash32, mythr );
+      }
+      nonce++;
+   } while ( nonce < max_nonce && !(*restart) );
+   pdata[19] = nonce;
+   *hashes_done = pdata[19] - first_nonce;
+   return 0;
+}
+
+#endif  // !defined(X16R_8WAY) && !defined(X16R_4WAY)
+
--- a/algo/x16/x16rv2-4way.c
+++ b/algo/x16/x16rv2-4way.c
@@ -12,73 +12,37 @@

 #if defined (X16RV2_8WAY)

-void x16rv2_8way_do_prehash( void *vdata, void *pdata )
+union _x16rv2_8way_context_overlay
 {
-   uint32_t vdata32[20*8] __attribute__ ((aligned (64)));
-   uint32_t edata[20] __attribute__ ((aligned (64)));
+    blake512_8way_context   blake;
+    bmw512_8way_context     bmw;
+    skein512_8way_context   skein;
+    jh512_8way_context      jh;
+    keccak512_8way_context  keccak;
+    luffa_4way_context      luffa;
+    cubehashParam           cube;
+    simd_4way_context       simd;
+    hamsi512_8way_context   hamsi;
+    hashState_fugue         fugue;
+    shabal512_8way_context  shabal;
+    sph_whirlpool_context   whirlpool;
+    sha512_8way_context     sha512;
+    sph_tiger_context       tiger;
+#if defined(__VAES__)
+    groestl512_4way_context groestl;
+    shavite512_4way_context shavite;
+    echo_4way_context       echo;
+#else
+    hashState_groestl       groestl;
+    shavite512_context      shavite;
+    hashState_echo          echo;
+#endif
+} __attribute__ ((aligned (64)));

-   const char elem = x16r_hash_order[0];
-   const uint8_t algo = elem >= 'A' ? elem - 'A' + 10 : elem - '0';
+typedef union _x16rv2_8way_context_overlay x16rv2_8way_context_overlay;
+static __thread x16rv2_8way_context_overlay x16rv2_ctx;

-   switch ( algo )
-   {
-      case JH:
-         mm512_bswap32_intrlv80_8x64( vdata, pdata );
-         jh512_8way_init( &x16rv2_ctx.jh );
-         jh512_8way_update( &x16rv2_ctx.jh, vdata, 64 );
-      break;
-      case KECCAK:
-      case LUFFA:
-      case SHA_512:
-         mm128_bswap32_80( edata, pdata );
-         sph_tiger_init( &x16rv2_ctx.tiger );
-         sph_tiger( &x16rv2_ctx.tiger, edata, 64 );
-         intrlv_8x64( vdata, edata, edata, edata, edata,
-                             edata, edata, edata, edata, 640 );
-      break;
-      case SKEIN:
-         mm512_bswap32_intrlv80_8x64( vdata, pdata );
-         skein512_8way_init( &x16rv2_ctx.skein );
-         skein512_8way_update( &x16rv2_ctx.skein, vdata, 64 );
-      break;
-      case CUBEHASH:
-         mm128_bswap32_80( edata, pdata );
-         cubehashInit( &x16rv2_ctx.cube, 512, 16, 32 );
-         cubehashUpdate( &x16rv2_ctx.cube, (const byte*)edata, 64 );
-         intrlv_8x64( vdata, edata, edata, edata, edata,
-                             edata, edata, edata, edata, 640 );
-      break;
-      case HAMSI:
-         mm512_bswap32_intrlv80_8x64( vdata, pdata );
-         hamsi512_8way_init( &x16rv2_ctx.hamsi );
-         hamsi512_8way_update( &x16rv2_ctx.hamsi, vdata, 64 );
-      break;
-      case SHABAL:
-         mm256_bswap32_intrlv80_8x32( vdata32, pdata );
-         shabal512_8way_init( &x16rv2_ctx.shabal );
-         shabal512_8way_update( &x16rv2_ctx.shabal, vdata32, 64 );
-         rintrlv_8x32_8x64( vdata, vdata32, 640 );
-      break;
-      case WHIRLPOOL:
-         mm128_bswap32_80( edata, pdata );
-         sph_whirlpool_init( &x16rv2_ctx.whirlpool );
-         sph_whirlpool( &x16rv2_ctx.whirlpool, edata, 64 );
-         intrlv_8x64( vdata, edata, edata, edata, edata,
-                             edata, edata, edata, edata, 640 );
-      break;
-      default:
-         mm512_bswap32_intrlv80_8x64( vdata, pdata );
-   }
-}
-
-int x16rv2_8way_prehash( struct work *work )
-{
-   x16r_gate_get_hash_order( work, x16r_hash_order );
-   x16rv2_8way_do_prehash( x16r_8way_vdata, work->data );
-   return 1;
-}
-
-int x16rv2_8way_hash( void* output, const void* input, const int thrid )
+int x16rv2_8way_hash( void* output, const void* input, int thrid )
 {
   uint32_t vhash[24*8] __attribute__ ((aligned (128)));
   uint32_t hash0[24] __attribute__ ((aligned (32)));
@@ -593,28 +557,50 @@ int x16rv2_8way_hash( void* output, const void* input, const int thrid )
   return 1;
 }

-#elif defined (X16RV2_4WAY)
-
-// Pad the 24 bytes tiger hash to 64 bytes
-inline void padtiger512( uint32_t* hash )
+int scanhash_x16rv2_8way( struct work *work, uint32_t max_nonce,
+                        uint64_t *hashes_done, struct thr_info *mythr)
 {
-  for ( int i = 6; i < 16; i++ ) hash[i] = 0;
-}
-
-void x16rv2_4way_do_prehash( void *vdata, void *pdata )
-{
-   uint32_t vdata32[20*4] __attribute__ ((aligned (64)));
+   uint32_t hash[16*8] __attribute__ ((aligned (128)));
+   uint32_t vdata[20*8] __attribute__ ((aligned (64)));
+   uint32_t vdata2[20*8] __attribute__ ((aligned (64)));
   uint32_t edata[20] __attribute__ ((aligned (64)));
+   uint32_t bedata1[2] __attribute__((aligned(64)));
+   uint32_t *pdata = work->data;
+   uint32_t *ptarget = work->target;
+   const uint32_t first_nonce = pdata[19];
+   const uint32_t last_nonce = max_nonce - 8;
+   uint32_t n = first_nonce;
+    __m512i  *noncev = (__m512i*)vdata + 9;   // aligned
+   const int thr_id = mythr->id;
+   volatile uint8_t *restart = &(work_restart[thr_id].restart);
+   const bool bench = opt_benchmark;

+   if ( bench ) ptarget[7] = 0x0cff;
+
+   mm512_bswap32_intrlv80_8x64( vdata, pdata );
+
+   bedata1[0] = bswap_32( pdata[1] );
+   bedata1[1] = bswap_32( pdata[2] );
+
+   static __thread uint32_t s_ntime = UINT32_MAX;
+   const uint32_t ntime = bswap_32( pdata[17] );
+   if ( s_ntime != ntime )
+   {
+      x16_r_s_getAlgoString( (const uint8_t*)bedata1, x16r_hash_order );
+      s_ntime = ntime;
+      if ( opt_debug && !thr_id )
+         applog( LOG_INFO, "hash order %s (%08x)", x16r_hash_order, ntime );
+   }
+
+   // Do midstate prehash on hash functions with block size <= 64 bytes.
   const char elem = x16r_hash_order[0];
   const uint8_t algo = elem >= 'A' ? elem - 'A' + 10 : elem - '0';
-
   switch ( algo )
   {
      case JH:
-         mm256_bswap32_intrlv80_4x64( vdata, pdata );
-         jh512_4way_init( &x16rv2_ctx.jh );
-         jh512_4way_update( &x16rv2_ctx.jh, vdata, 64 );
+         mm512_bswap32_intrlv80_8x64( vdata, pdata );
+         jh512_8way_init( &x16rv2_ctx.jh );
+         jh512_8way_update( &x16rv2_ctx.jh, vdata, 64 );
      break;
      case KECCAK:
      case LUFFA:
@@ -622,45 +608,100 @@ void x16rv2_4way_do_prehash( void *vdata, void *pdata )
         mm128_bswap32_80( edata, pdata );
         sph_tiger_init( &x16rv2_ctx.tiger );
         sph_tiger( &x16rv2_ctx.tiger, edata, 64 );
-         intrlv_4x64( vdata, edata, edata, edata, edata, 640 );
+         intrlv_8x64( vdata, edata, edata, edata, edata,
+                             edata, edata, edata, edata, 640 );
      break;
      case SKEIN:
-         mm256_bswap32_intrlv80_4x64( vdata, pdata );
-         skein512_4way_prehash64( &x16r_ctx.skein, vdata );
+         mm512_bswap32_intrlv80_8x64( vdata, pdata );
+         skein512_8way_init( &x16rv2_ctx.skein );
+         skein512_8way_update( &x16rv2_ctx.skein, vdata, 64 );
      break;
      case CUBEHASH:
         mm128_bswap32_80( edata, pdata );
         cubehashInit( &x16rv2_ctx.cube, 512, 16, 32 );
         cubehashUpdate( &x16rv2_ctx.cube, (const byte*)edata, 64 );
-         intrlv_4x64( vdata, edata, edata, edata, edata, 640 );
+         intrlv_8x64( vdata, edata, edata, edata, edata,
+                             edata, edata, edata, edata, 640 );
      break;
      case HAMSI:
-         mm256_bswap32_intrlv80_4x64( vdata, pdata );
-         hamsi512_4way_init( &x16rv2_ctx.hamsi );
-         hamsi512_4way_update( &x16rv2_ctx.hamsi, vdata, 64 );
+         mm512_bswap32_intrlv80_8x64( vdata, pdata );
+         hamsi512_8way_init( &x16rv2_ctx.hamsi );
+         hamsi512_8way_update( &x16rv2_ctx.hamsi, vdata, 64 );
      break;
      case SHABAL:
-         mm128_bswap32_intrlv80_4x32( vdata32, pdata );
-         shabal512_4way_init( &x16rv2_ctx.shabal );
-         shabal512_4way_update( &x16rv2_ctx.shabal, vdata32, 64 );
-         rintrlv_4x32_4x64( vdata, vdata32, 640 );
+         mm256_bswap32_intrlv80_8x32( vdata2, pdata );
+         shabal512_8way_init( &x16rv2_ctx.shabal );
+         shabal512_8way_update( &x16rv2_ctx.shabal, vdata2, 64 );
+         rintrlv_8x32_8x64( vdata, vdata2, 640 );
      break;
      case WHIRLPOOL:
         mm128_bswap32_80( edata, pdata );
         sph_whirlpool_init( &x16rv2_ctx.whirlpool );
         sph_whirlpool( &x16rv2_ctx.whirlpool, edata, 64 );
-         intrlv_4x64( vdata, edata, edata, edata, edata, 640 );
+         intrlv_8x64( vdata, edata, edata, edata, edata,
+                             edata, edata, edata, edata, 640 );
      break;
      default:
-         mm256_bswap32_intrlv80_4x64( vdata, pdata );
+         mm512_bswap32_intrlv80_8x64( vdata, pdata );
   }
-}   
+   
+   *noncev = mm512_intrlv_blend_32( _mm512_set_epi32(
+                             n+7, 0, n+6, 0, n+5, 0, n+4, 0,
+                             n+3, 0, n+2, 0, n+1, 0, n,   0 ), *noncev );
+   do
+   {
+      if ( x16rv2_8way_hash( hash, vdata, thr_id ) )
+      for ( int i = 0; i < 8; i++ )
+      if ( unlikely( valid_hash( hash + (i<<3), ptarget ) && !bench ) )
+      {
+         pdata[19] = bswap_32( n+i );
+         submit_solution( work, hash+(i<<3), mythr );
+      }
+      *noncev = _mm512_add_epi32( *noncev,
+                                  _mm512_set1_epi64( 0x0000000800000000 ) );
+      n += 8;
+   } while ( likely( ( n < last_nonce ) && !(*restart) ) );
+   pdata[19] = n;
+   *hashes_done = n - first_nonce;
+   return 0;
+}

-int x16rv2_4way_prehash( struct work *work )
+#elif defined (X16RV2_4WAY)
+
+union _x16rv2_4way_context_overlay
 {
-   x16r_gate_get_hash_order( work, x16r_hash_order );
-   x16rv2_4way_do_prehash( x16r_4way_vdata, work->data );
-   return 1;
+    blake512_4way_context   blake;
+    bmw512_4way_context     bmw;
+#if defined(__VAES__)
+    groestl512_2way_context groestl;
+    shavite512_2way_context shavite;
+    echo_2way_context       echo;
+#else
+    hashState_groestl       groestl;
+    shavite512_context      shavite;
+    hashState_echo          echo;
+#endif
+    skein512_4way_context   skein;
+    jh512_4way_context      jh;
+    keccak512_4way_context  keccak;
+    luffa_2way_context      luffa;
+    cubehashParam           cube;
+    simd_2way_context       simd;
+    hamsi512_4way_context   hamsi;
+    hashState_fugue         fugue;
+    shabal512_4way_context  shabal;
+    sph_whirlpool_context   whirlpool;
+    sha512_4way_context     sha512;
+    sph_tiger_context       tiger;
+};
+typedef union _x16rv2_4way_context_overlay x16rv2_4way_context_overlay;
+
+static __thread x16rv2_4way_context_overlay x16rv2_ctx;
+
+// Pad the 24 bytes tiger hash to 64 bytes
+inline void padtiger512( uint32_t* hash )
+{
+  for ( int i = 6; i < 16; i++ ) hash[i] = 0;
 }

 int x16rv2_4way_hash( void* output, const void* input, int thrid )
@@ -1007,4 +1048,107 @@ int x16rv2_4way_hash( void* output, const void* input, int thrid )
   return 1;
 }

+int scanhash_x16rv2_4way( struct work *work, uint32_t max_nonce,
+                        uint64_t *hashes_done, struct thr_info *mythr)
+{
+   uint32_t hash[4*16] __attribute__ ((aligned (64)));
+   uint32_t vdata[24*4] __attribute__ ((aligned (64)));
+   uint32_t vdata32[20*4] __attribute__ ((aligned (64)));
+   uint32_t edata[20];
+   uint32_t bedata1[2];
+   uint32_t *pdata = work->data;
+   uint32_t *ptarget = work->target;
+   const uint32_t first_nonce = pdata[19];
+   const uint32_t last_nonce = max_nonce - 4;
+   uint32_t n = first_nonce;
+   const int thr_id = mythr->id; 
+    __m256i  *noncev = (__m256i*)vdata + 9; 
+   volatile uint8_t *restart = &(work_restart[thr_id].restart);
+   const bool bench = opt_benchmark;
+
+   if ( bench )  ptarget[7] = 0x0fff;
+   
+   bedata1[0] = bswap_32( pdata[1] );
+   bedata1[1] = bswap_32( pdata[2] );
+
+   static __thread uint32_t s_ntime = UINT32_MAX;
+   const uint32_t ntime = bswap_32(pdata[17]);
+   if ( s_ntime != ntime )
+   {
+      x16_r_s_getAlgoString( (const uint8_t*)bedata1, x16r_hash_order );
+      s_ntime = ntime;
+      if ( opt_debug && !thr_id )
+         applog( LOG_INFO, "hash order %s (%08x)", x16r_hash_order, ntime );
+   }
+
+   // Do midstate prehash on hash functions with block size <= 64 bytes.
+   const char elem = x16r_hash_order[0];
+   const uint8_t algo = elem >= 'A' ? elem - 'A' + 10 : elem - '0';
+   switch ( algo )
+   {
+      case JH:
+         mm256_bswap32_intrlv80_4x64( vdata, pdata );
+         jh512_4way_init( &x16rv2_ctx.jh );
+         jh512_4way_update( &x16rv2_ctx.jh, vdata, 64 );
+      break;
+      case KECCAK:
+      case LUFFA:
+      case SHA_512:
+         mm128_bswap32_80( edata, pdata );
+         sph_tiger_init( &x16rv2_ctx.tiger );
+         sph_tiger( &x16rv2_ctx.tiger, edata, 64 );
+         intrlv_4x64( vdata, edata, edata, edata, edata, 640 );
+      break;
+      case SKEIN:
+         mm256_bswap32_intrlv80_4x64( vdata, pdata );
+         skein512_4way_prehash64( &x16r_ctx.skein, vdata );
+      break;
+      case CUBEHASH:
+         mm128_bswap32_80( edata, pdata );
+         cubehashInit( &x16rv2_ctx.cube, 512, 16, 32 );
+         cubehashUpdate( &x16rv2_ctx.cube, (const byte*)edata, 64 );
+         intrlv_4x64( vdata, edata, edata, edata, edata, 640 );
+      break;
+      case HAMSI:
+         mm256_bswap32_intrlv80_4x64( vdata, pdata );
+         hamsi512_4way_init( &x16rv2_ctx.hamsi );
+         hamsi512_4way_update( &x16rv2_ctx.hamsi, vdata, 64 );
+      break;
+      case SHABAL:
+         mm128_bswap32_intrlv80_4x32( vdata32, pdata );
+         shabal512_4way_init( &x16rv2_ctx.shabal );
+         shabal512_4way_update( &x16rv2_ctx.shabal, vdata32, 64 );
+         rintrlv_4x32_4x64( vdata, vdata32, 640 );
+      break;
+      case WHIRLPOOL:
+         mm128_bswap32_80( edata, pdata );
+         sph_whirlpool_init( &x16rv2_ctx.whirlpool );
+         sph_whirlpool( &x16rv2_ctx.whirlpool, edata, 64 );
+         intrlv_4x64( vdata, edata, edata, edata, edata, 640 );
+      break;
+      default:
+         mm256_bswap32_intrlv80_4x64( vdata, pdata );
+   }
+
+   *noncev = mm256_intrlv_blend_32(
+                   _mm256_set_epi32( n+3, 0, n+2, 0, n+1, 0, n, 0 ), *noncev );
+
+   do
+   {
+      if ( x16rv2_4way_hash( hash, vdata, thr_id ) )
+      for ( int i = 0; i < 4; i++ )
+      if ( unlikely( valid_hash( hash + (i<<3), ptarget ) && !bench ) )
+      {
+         pdata[19] = bswap_32( n+i );
+         submit_solution( work, hash+(i<<3), mythr );
+      }
+      *noncev = _mm256_add_epi32( *noncev,
+                                  _mm256_set1_epi64x( 0x0000000400000000 ) );
+      n += 4;
+   } while ( likely( ( n < last_nonce ) && !(*restart) ) );
+   pdata[19] = n;
+   *hashes_done = n - first_nonce;
+   return 0;
+}
+
 #endif
--- a/algo/x16/x16rv2.c
+++ b/algo/x16/x16rv2.c
@@ -43,16 +43,9 @@ inline void padtiger512(uint32_t* hash) {
   for (int i = (24/4); i < (64/4); i++) hash[i] = 0;
 }

-// no prehash
-int x16rv2_prehash( const struct work *work )
+int x16rv2_hash( void* output, const void* input, int thrid )
 {
-   x16r_gate_get_hash_order( work, x16r_hash_order );
-   return 1;
-}
-
-int x16rv2_hash( void* output, const void* input, const int thrid )
-{
-   uint32_t _ALIGN(32) hash[16];
+   uint32_t _ALIGN(128) hash[16];
   x16rv2_context_overlay ctx;
   void *in = (void*) input;
   int size = 80;
@@ -177,4 +170,52 @@ int x16rv2_hash( void* output, const void* input, const int thrid )
   return 1;
 }

+int scanhash_x16rv2( struct work *work, uint32_t max_nonce,
+                   uint64_t *hashes_done, struct thr_info *mythr )
+{
+   uint32_t _ALIGN(128) hash32[8];
+   uint32_t _ALIGN(128) edata[20];
+   uint32_t *pdata = work->data;
+   uint32_t *ptarget = work->target;
+   const uint32_t first_nonce = pdata[19];
+   const int thr_id = mythr->id;  
+   uint32_t nonce = first_nonce;
+   volatile uint8_t *restart = &(work_restart[thr_id].restart);
+   const bool bench = opt_benchmark;
+
+   casti_m128i( edata, 0 ) = mm128_bswap_32( casti_m128i( pdata, 0 ) );
+   casti_m128i( edata, 1 ) = mm128_bswap_32( casti_m128i( pdata, 1 ) );
+   casti_m128i( edata, 2 ) = mm128_bswap_32( casti_m128i( pdata, 2 ) );
+   casti_m128i( edata, 3 ) = mm128_bswap_32( casti_m128i( pdata, 3 ) );
+   casti_m128i( edata, 4 ) = mm128_bswap_32( casti_m128i( pdata, 4 ) );
+
+   static __thread uint32_t s_ntime = UINT32_MAX;
+   if ( s_ntime != pdata[17] )
+   {
+      uint32_t ntime = swab32(pdata[17]);
+      x16_r_s_getAlgoString( (const uint8_t*) (&edata[1]), x16r_hash_order );
+      s_ntime = ntime;
+      if ( opt_debug && !thr_id )
+              applog( LOG_DEBUG, "hash order %s (%08x)",
+                                 x16r_hash_order, ntime );
+   }
+
+   if ( bench )   ptarget[7] = 0x0cff;
+
+   do
+   {
+      edata[19] = nonce;
+      if ( x16rv2_hash( hash32, edata, thr_id ) )
+      if ( unlikely( valid_hash( hash32, ptarget ) && !bench ) )
+      {
+         pdata[19] = bswap_32( nonce );
+         submit_solution( work, hash32, mythr );
+      }
+      nonce++;
+   } while ( nonce < max_nonce && !(*restart) );
+   pdata[19] = nonce;
+   *hashes_done = pdata[19] - first_nonce;
+   return 0;
+}
+
 #endif
--- a/algo/x16/x21s-4way.c
+++ b/algo/x16/x21s-4way.c
@@ -30,7 +30,7 @@ union _x21s_8way_context_overlay

 typedef union _x21s_8way_context_overlay x21s_8way_context_overlay;

-int x21s_8way_hash( void* output, const void* input, const int thrid )
+int x21s_8way_hash( void* output, const void* input, int thrid )
 {
   uint32_t vhash[16*8] __attribute__ ((aligned (128)));
   uint8_t shash[64*8] __attribute__ ((aligned (64)));
@@ -129,6 +129,66 @@ int x21s_8way_hash( void* output, const void* input, const int thrid )
   return 1;
 }

+int scanhash_x21s_8way( struct work *work, uint32_t max_nonce,
+                        uint64_t *hashes_done, struct thr_info *mythr)
+{
+   uint32_t hash[16*8] __attribute__ ((aligned (128)));
+   uint32_t vdata[20*8] __attribute__ ((aligned (64)));
+   uint32_t *hash7 = &hash[7<<3];
+   uint32_t lane_hash[8] __attribute__ ((aligned (64)));
+   uint32_t bedata1[2] __attribute__((aligned(64)));
+   uint32_t *pdata = work->data;
+   uint32_t *ptarget = work->target;
+   const uint32_t Htarg = ptarget[7];
+   const uint32_t first_nonce = pdata[19];
+   uint32_t n = first_nonce;
+   const uint32_t last_nonce = max_nonce - 16;
+   const int thr_id = mythr->id;
+    __m512i  *noncev = (__m512i*)vdata + 9;   // aligned
+   volatile uint8_t *restart = &(work_restart[thr_id].restart);
+   const bool bench = opt_benchmark;
+
+   if ( bench )   ptarget[7] = 0x0cff;
+
+   bedata1[0] = bswap_32( pdata[1] );
+   bedata1[1] = bswap_32( pdata[2] );
+
+   static __thread uint32_t s_ntime = UINT32_MAX;
+   uint32_t ntime = bswap_32( pdata[17] );
+   if ( s_ntime != ntime )
+   {
+      x16_r_s_getAlgoString( (const uint8_t*)bedata1, x16r_hash_order );
+      s_ntime = ntime;
+      if ( opt_debug && !thr_id )
+              applog( LOG_INFO, "hash order %s (%08x)", x16r_hash_order, ntime );
+   }
+
+   x16r_8way_prehash( vdata, pdata );
+   *noncev = mm512_intrlv_blend_32( _mm512_set_epi32(
+                             n+7, 0, n+6, 0, n+5, 0, n+4, 0,
+                             n+3, 0, n+2, 0, n+1, 0, n,   0 ), *noncev );
+   do
+   {
+      if ( x21s_8way_hash( hash, vdata, thr_id ) )
+      for ( int lane = 0; lane < 8; lane++ )
+      if ( unlikely( hash7[lane] <= Htarg ) )
+      {
+         extr_lane_8x32( lane_hash, hash, lane, 256 );
+         if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
+         {
+             pdata[19] = bswap_32( n + lane );
+             submit_solution( work, lane_hash, mythr );
+         }
+      }
+      *noncev = _mm512_add_epi32( *noncev,
+                                  _mm512_set1_epi64( 0x0000000800000000 ) );
+      n += 8;
+   } while ( likely( ( n < last_nonce ) && !(*restart) ) );
+   pdata[19] = n;
+   *hashes_done = n - first_nonce;
+   return 0;
+}
+
 bool x21s_8way_thread_init()
 {
   const int64_t ROW_LEN_INT64 = BLOCK_LEN_INT64 * 4; // nCols
@@ -155,7 +215,7 @@ union _x21s_4way_context_overlay

 typedef union _x21s_4way_context_overlay x21s_4way_context_overlay;

-int x21s_4way_hash( void* output, const void* input, const int thrid )
+int x21s_4way_hash( void* output, const void* input, int thrid )
 {
   uint32_t vhash[16*4] __attribute__ ((aligned (64)));
   uint8_t  shash[64*4] __attribute__ ((aligned (64)));
@@ -231,6 +291,58 @@ int x21s_4way_hash( void* output, const void* input, const int thrid )
   return 1;
 }

+int scanhash_x21s_4way( struct work *work, uint32_t max_nonce,
+                        uint64_t *hashes_done, struct thr_info *mythr)
+{
+   uint32_t hash[16*4] __attribute__ ((aligned (64)));
+   uint32_t vdata[20*4] __attribute__ ((aligned (64)));
+   uint32_t bedata1[2] __attribute__((aligned(64)));
+   uint32_t *pdata = work->data;
+   uint32_t *ptarget = work->target;
+   const uint32_t first_nonce = pdata[19];
+   const uint32_t last_nonce = max_nonce - 4;
+   uint32_t n = first_nonce;
+   const int thr_id = mythr->id; 
+   const bool bench = opt_benchmark;
+    __m256i  *noncev = (__m256i*)vdata + 9;   // aligned
+   volatile uint8_t *restart = &(work_restart[thr_id].restart);
+
+   if ( bench )  ptarget[7] = 0x0cff;
+ 
+   bedata1[0] = bswap_32( pdata[1] );
+   bedata1[1] = bswap_32( pdata[2] );
+
+   static __thread uint32_t s_ntime = UINT32_MAX;
+   uint32_t ntime = bswap_32( pdata[17] );
+   if ( s_ntime != ntime )
+   {
+      x16_r_s_getAlgoString( (const uint8_t*)bedata1, x16r_hash_order );
+      s_ntime = ntime;
+      if ( opt_debug && !thr_id )
+              applog( LOG_DEBUG, "hash order %s (%08x)", x16r_hash_order, ntime );
+   }
+
+   x16r_4way_prehash( vdata, pdata );
+   *noncev = mm256_intrlv_blend_32(
+                   _mm256_set_epi32( n+3, 0, n+2, 0, n+1, 0, n, 0 ), *noncev );
+   do
+   {
+      if ( x21s_4way_hash( hash, vdata, thr_id ) )
+      for ( int i = 0; i < 4; i++ )
+      if ( unlikely( valid_hash( hash + (i<<3), ptarget ) && !bench ) )
+      {
+         pdata[19] = bswap_32( n+i );
+         submit_solution( work, hash+(i<<3), mythr );
+      }
+      *noncev = _mm256_add_epi32( *noncev,
+                                  _mm256_set1_epi64x( 0x0000000400000000 ) );
+      n += 4;
+   } while ( likely( (  n < last_nonce ) && !(*restart) ) );
+   pdata[19] = n;
+   *hashes_done = n - first_nonce;
+   return 0;
+}
+
 bool x21s_4way_thread_init()
 {
   const int64_t ROW_LEN_INT64 = BLOCK_LEN_INT64 * 4; // nCols
--- a/algo/x16/x21s.c
+++ b/algo/x16/x21s.c
@@ -27,7 +27,7 @@ union _x21s_context_overlay
 };
 typedef union _x21s_context_overlay x21s_context_overlay;

-int x21s_hash( void* output, const void* input, const int thrid )
+int x21s_hash( void* output, const void* input, int thrid )
 {
   uint32_t _ALIGN(128) hash[16];
   x21s_context_overlay ctx;
@@ -57,6 +57,50 @@ int x21s_hash( void* output, const void* input, const int thrid )
   return 1;
 }

+int scanhash_x21s( struct work *work, uint32_t max_nonce,
+                   uint64_t *hashes_done, struct thr_info *mythr )
+{
+   uint32_t _ALIGN(128) hash32[8];
+   uint32_t _ALIGN(128) edata[20];
+   uint32_t *pdata = work->data;
+   uint32_t *ptarget = work->target;
+   const uint32_t first_nonce = pdata[19];
+   const int thr_id = mythr->id;
+   uint32_t nonce = first_nonce;
+   volatile uint8_t *restart = &(work_restart[thr_id].restart);
+   const bool bench = opt_benchmark;
+   if ( bench )  ptarget[7] = 0x0cff;
+
+   mm128_bswap32_80( edata, pdata );
+
+   static __thread uint32_t s_ntime = UINT32_MAX;
+   if ( s_ntime != pdata[17] )
+   {
+      uint32_t ntime = swab32(pdata[17]);
+      x16_r_s_getAlgoString( (const uint8_t*)(&edata[1]), x16r_hash_order );
+      s_ntime = ntime;
+      if ( opt_debug && !thr_id )
+          applog( LOG_INFO, "hash order %s (%08x)", x16r_hash_order, ntime );
+   }
+
+   x16r_prehash( edata, pdata );
+
+   do
+   {
+      edata[19] = nonce;
+      if ( x21s_hash( hash32, edata, thr_id ) )
+      if ( unlikely( valid_hash( hash32, ptarget ) && !bench ) )
+      {
+         pdata[19] = bswap_32( nonce );
+         submit_solution( work, hash32, mythr );
+      }
+      nonce++;
+   } while ( nonce < max_nonce && !(*restart) );
+   pdata[19] = nonce;
+   *hashes_done = pdata[19] - first_nonce;
+   return 0;
+}
+
 bool x21s_thread_init()
 {
   const int64_t ROW_LEN_INT64 = BLOCK_LEN_INT64 * 4; // nCols
--- a/algo/x17/x17-4way.c
+++ b/algo/x17/x17-4way.c
@@ -254,9 +254,10 @@ int scanhash_x17_8way( struct work *work, uint32_t max_nonce,
   uint32_t n = first_nonce;
   const int thr_id = mythr->id;
   const uint32_t targ32_d7 = ptarget[7];
-   const __m512i eight = m512_const1_64( 8 );
+   const __m512i eight = _mm512_set1_epi64( 8 );
   const bool bench = opt_benchmark;

+   // convert LE32 to LE64
   edata[0] = mm128_swap64_32( casti_m128i( pdata, 0 ) );
   edata[1] = mm128_swap64_32( casti_m128i( pdata, 1 ) );
   edata[2] = mm128_swap64_32( casti_m128i( pdata, 2 ) );
@@ -467,9 +468,10 @@ int scanhash_x17_4way( struct work *work, uint32_t max_nonce,
   uint32_t n = first_nonce;
   const int thr_id = mythr->id;
   const uint32_t targ32_d7 = ptarget[7];
-   const __m256i four = m256_const1_64( 4 );
+   const __m256i four = _mm256_set1_epi64x( 4 );
   const bool bench = opt_benchmark;

+   // convert LE32 to LE64
   edata[0] = mm128_swap64_32( casti_m128i( pdata, 0 ) );
   edata[1] = mm128_swap64_32( casti_m128i( pdata, 1 ) );
   edata[2] = mm128_swap64_32( casti_m128i( pdata, 2 ) );
--- a/algo/x22/x22i-4way.c
+++ b/algo/x22/x22i-4way.c
@@ -445,7 +445,7 @@ int scanhash_x22i_8way_sha( struct work *work, uint32_t max_nonce,
         submit_solution( work, hash+(i<<3), mythr );
      }
      *noncev = _mm512_add_epi32( *noncev,
-                                  m512_const1_64( 0x0000000800000000 ) );
+                                  _mm512_set1_epi64( 0x0000000800000000 ) );
      n += 8;
   } while ( likely( ( n < last_nonce ) && !work_restart[thr_id].restart ) );
   pdata[19] = n;
@@ -494,7 +494,7 @@ int scanhash_x22i_8way( struct work *work, uint32_t max_nonce,
         }
      }
      *noncev = _mm512_add_epi32( *noncev,
-                                  m512_const1_64( 0x0000000800000000 ) );
+                                  _mm512_set1_epi64( 0x0000000800000000 ) );
      n += 8;
   } while ( likely( ( n < last_nonce ) && !work_restart[thr_id].restart ) );
   pdata[19] = n;
@@ -787,7 +787,7 @@ int scanhash_x22i_4way_sha( struct work* work, uint32_t max_nonce,
         submit_solution( work, hash+(i<<3), mythr );
      }
      *noncev = _mm256_add_epi32( *noncev,
-                                  m256_const1_64( 0x0000000400000000 ) );
+                                  _mm256_set1_epi64x( 0x0000000400000000 ) );
      n += 4;
   } while ( likely( ( n < last_nonce ) && !work_restart[thr_id].restart ) );
   pdata[19] = n;
@@ -835,7 +835,7 @@ int scanhash_x22i_4way( struct work* work, uint32_t max_nonce,
         }
      }
      *noncev = _mm256_add_epi32( *noncev,
-                                  m256_const1_64( 0x0000000400000000 ) );
+                                  _mm256_set1_epi64x( 0x0000000400000000 ) );
      n += 4;
   } while ( likely( ( n <= last_nonce ) && !work_restart[thr_id].restart ) );
   pdata[19] = n;
--- a/algo/x22/x25x-4way.c
+++ b/algo/x22/x25x-4way.c
@@ -571,7 +571,7 @@ int scanhash_x25x_8way( struct work *work, uint32_t max_nonce,
   const int thr_id = mythr->id;
   const uint32_t targ32 = ptarget[7];
   const bool bench = opt_benchmark;
-   const __m512i eight = m512_const1_64( 8 );
+   const __m512i eight = _mm512_set1_epi64( 8 );
   if ( bench )  ptarget[7] = 0x08ff;

   edata[0] = mm128_swap64_32( casti_m128i( pdata, 0 ) ); 
@@ -927,7 +927,7 @@ int scanhash_x25x_4way( struct work* work, uint32_t max_nonce,
   uint32_t n = first_nonce;
   const int thr_id = mythr->id;
   const uint32_t targ32 = ptarget[7];
-   const __m256i four = m256_const1_64( 4 );
+   const __m256i four = _mm256_set1_epi64x( 4 );
   const bool bench = opt_benchmark;

   if ( bench ) ptarget[7] = 0x08ff;
--- a/algo/yespower/yespower-gate.c
+++ b/algo/yespower/yespower-gate.c
@@ -31,26 +31,8 @@

 yespower_params_t yespower_params;

-// master g_work 
-sha256_context yespower_sha256_prehash_ctx;
-uint32_t _ALIGN(64) yespower_endiandata[20];
-
-// local work
 __thread sha256_context sha256_prehash_ctx;

-
-int yespower_sha256_prehash( struct work *work )
-{
-   uint32_t *pdata = work->data;
-
-   for ( int k = 0; k < 19; k++ )
-      be32enc( &yespower_endiandata[k], pdata[k] );
-
-   sha256_ctx_init( &yespower_sha256_prehash_ctx );
-   sha256_update( &yespower_sha256_prehash_ctx, yespower_endiandata, 64 );
-
-   return 1;
-}
 // YESPOWER

 int yespower_hash( const char *input, char *output, uint32_t len, int thrid )
@@ -71,15 +53,14 @@ int scanhash_yespower( struct work *work, uint32_t max_nonce,
   uint32_t n = first_nonce;
   const int thr_id = mythr->id;

-//   pthread_rwlock_rdlock( &g_work_lock );
-
-   memcpy( endiandata, yespower_endiandata, sizeof endiandata );
-   memcpy( &sha256_prehash_ctx, &yespower_sha256_prehash_ctx, sizeof sha256_prehash_ctx );
-
-//   pthread_rwlock_unlock( &g_work_lock );
-
+   for ( int k = 0; k < 19; k++ )
+      be32enc( &endiandata[k], pdata[k] );
   endiandata[19] = n;

+   // do sha256 prehash
+   sha256_ctx_init( &sha256_prehash_ctx );
+   sha256_update( &sha256_prehash_ctx, endiandata, 64 );
+
   do {
      if ( yespower_hash( (char*)endiandata, (char*)vhash, 80, thr_id ) )
      if unlikely( valid_hash( vhash, ptarget ) && !opt_benchmark )
@@ -159,7 +140,6 @@ bool register_yespower_algo( algo_gate_t* gate )

  gate->optimizations = SSE2_OPT | SHA_OPT;
  gate->scanhash      = (void*)&scanhash_yespower;
-  gate->prehash       = (void*)&yespower_sha256_prehash;
  gate->hash          = (void*)&yespower_hash;
  opt_target_factor = 65536.0;
  return true;
@@ -174,7 +154,6 @@ bool register_yespowerr16_algo( algo_gate_t* gate )
  yespower_params.perslen = 0;
  gate->optimizations = SSE2_OPT | SHA_OPT;
  gate->scanhash      = (void*)&scanhash_yespower;
-  gate->prehash       = (void*)&yespower_sha256_prehash;
  gate->hash          = (void*)&yespower_hash;
  opt_target_factor = 65536.0;
  return true;
@@ -186,7 +165,6 @@ bool register_yescrypt_algo( algo_gate_t* gate )
 {
   gate->optimizations = SSE2_OPT | SHA_OPT;
   gate->scanhash   = (void*)&scanhash_yespower;
-   gate->prehash       = (void*)&yespower_sha256_prehash;
   yespower_params.version = YESPOWER_0_5;
   opt_target_factor = 65536.0;

@@ -220,7 +198,6 @@ bool register_yescryptr8_algo( algo_gate_t* gate )
 {
   gate->optimizations = SSE2_OPT | SHA_OPT;
   gate->scanhash   = (void*)&scanhash_yespower;
-   gate->prehash       = (void*)&yespower_sha256_prehash;
   yespower_params.version = YESPOWER_0_5;
   yespower_params.N       = 2048;
   yespower_params.r       = 8;
@@ -234,7 +211,6 @@ bool register_yescryptr16_algo( algo_gate_t* gate )
 {
   gate->optimizations = SSE2_OPT | SHA_OPT;
   gate->scanhash   = (void*)&scanhash_yespower;
-   gate->prehash       = (void*)&yespower_sha256_prehash;
   yespower_params.version = YESPOWER_0_5;
   yespower_params.N       = 4096;
   yespower_params.r       = 16;
@@ -248,7 +224,6 @@ bool register_yescryptr32_algo( algo_gate_t* gate )
 {
   gate->optimizations = SSE2_OPT | SHA_OPT;
   gate->scanhash   = (void*)&scanhash_yespower;
-   gate->prehash       = (void*)&yespower_sha256_prehash;
   yespower_params.version = YESPOWER_0_5;
   yespower_params.N       = 4096;
   yespower_params.r       = 32;
--- a/algo/yespower/yespower.h
+++ b/algo/yespower/yespower.h
@@ -80,8 +80,6 @@ extern yespower_params_t yespower_params;

 extern __thread sha256_context sha256_prehash_ctx;

-int yespower_sha256_prehash( struct work *work );
-
 /**
 * yespower_init_local(local):
 * Initialize the thread-local (RAM) data structure.  Actual memory allocation
--- a/build-allarch.sh
+++ b/build-allarch.sh
@@ -29,10 +29,11 @@ mv cpuminer cpuminer-avx512-sha-vaes
 # Zen4 AVX512 SHA VAES
 make clean || echo clean
 rm -f config.status
-# znver3 needs gcc-11, znver4 ?
+# znver3 needs gcc-11, znver4 needs gcc-12.3.
 #CFLAGS="-O3 -march=znver4 -Wall -fno-common " ./configure --with-curl
-CFLAGS="-O3 -march=znver3 -mavx512f -mavx512dq -mavx512bw -mavx512vl -Wall -fno-common " ./configure --with-curl
-#CFLAGS="-O3 -march=znver2 -mvaes -mavx512f -mavx512dq -mavx512bw -mavx512vl -Wall -fno-common " ./configure --with-curl
+# Inclomplete list of Zen4 AVX512 extensions but includes all extensions used by cpuminer.
+CFLAGS="-O3 -march=znver3 -mavx512f -mavx512cd -mavx512dq -mavx512bw -mavx512vl -mavx512vbmi -mavx512vbmi2 -mavx512bitalg -mavx512vpopcntdq -Wall -fno-common " ./configure --with-curl
+#CFLAGS="-O3 -march=znver2 -mvaes -mavx512f -mavx512dq -mavx512bw -mavx512vl -mavx512vbmi -Wall -fno-common " ./configure --with-curl
 make -j 8
 strip -s cpuminer
 mv cpuminer cpuminer-zen4
--- a/20
+++ b/20
@@ -1,6 +1,6 @@
 #! /bin/sh
 # Guess values for system-dependent variables and create Makefiles.
-# Generated by GNU Autoconf 2.71 for cpuminer-opt 3.21.3.
+# Generated by GNU Autoconf 2.71 for cpuminer-opt 3.23.0.
 #
 #
 # Copyright (C) 1992-1996, 1998-2017, 2020-2021 Free Software Foundation,
@@ -608,8 +608,8 @@ MAKEFLAGS=
 # Identity of this package.
 PACKAGE_NAME='cpuminer-opt'
 PACKAGE_TARNAME='cpuminer-opt'
-PACKAGE_VERSION='3.21.3'
-PACKAGE_STRING='cpuminer-opt 3.21.3'
+PACKAGE_VERSION='3.23.0'
+PACKAGE_STRING='cpuminer-opt 3.23.0'
 PACKAGE_BUGREPORT=''
 PACKAGE_URL=''

@@ -1360,7 +1360,7 @@ if test "$ac_init_help" = "long"; then
  # Omit some internal or obsolete options to make the list less imposing.
  # This message is too long to be a string in the A/UX 3.1 sh.
  cat <<_ACEOF
-\`configure' configures cpuminer-opt 3.21.3 to adapt to many kinds of systems.
+\`configure' configures cpuminer-opt 3.23.0 to adapt to many kinds of systems.

 Usage: $0 [OPTION]... [VAR=VALUE]...

@@ -1432,7 +1432,7 @@ fi

 if test -n "$ac_init_help"; then
  case $ac_init_help in
-     short | recursive ) echo "Configuration of cpuminer-opt 3.21.3:";;
+     short | recursive ) echo "Configuration of cpuminer-opt 3.23.0:";;
   esac
  cat <<\_ACEOF

@@ -1538,7 +1538,7 @@ fi
 test -n "$ac_init_help" && exit $ac_status
 if $ac_init_version; then
  cat <<\_ACEOF
-cpuminer-opt configure 3.21.3
+cpuminer-opt configure 3.23.0
 generated by GNU Autoconf 2.71

 Copyright (C) 2021 Free Software Foundation, Inc.
@@ -1985,7 +1985,7 @@ cat >config.log <<_ACEOF
 This file contains any messages produced by compilers while
 running configure, to aid debugging if configure makes a mistake.

-It was created by cpuminer-opt $as_me 3.21.3, which was
+It was created by cpuminer-opt $as_me 3.23.0, which was
 generated by GNU Autoconf 2.71.  Invocation command line was

  $ $0$ac_configure_args_raw
@@ -3593,7 +3593,7 @@ fi

 # Define the identity of the package.
 PACKAGE='cpuminer-opt'
- VERSION='3.21.3'
+ VERSION='3.23.0'


 printf "%s\n" "#define PACKAGE \"$PACKAGE\"" >>confdefs.h
@@ -7508,7 +7508,7 @@ cat >>$CONFIG_STATUS <<\_ACEOF || ac_write_fail=1
 # report actual input values of CONFIG_FILES etc. instead of their
 # values after options handling.
 ac_log="
-This file was extended by cpuminer-opt $as_me 3.21.3, which was
+This file was extended by cpuminer-opt $as_me 3.23.0, which was
 generated by GNU Autoconf 2.71.  Invocation command line was

  CONFIG_FILES    = $CONFIG_FILES
@@ -7576,7 +7576,7 @@ ac_cs_config_escaped=`printf "%s\n" "$ac_cs_config" | sed "s/^ //; s/'/'\\\\\\\\
 cat >>$CONFIG_STATUS <<_ACEOF || ac_write_fail=1
 ac_cs_config='$ac_cs_config_escaped'
 ac_cs_version="\\
-cpuminer-opt config.status 3.21.3
+cpuminer-opt config.status 3.23.0
 configured by $0, generated by GNU Autoconf 2.71,
  with options \\"\$ac_cs_config\\"

--- a/configure.ac
+++ b/configure.ac
@@ -1,4 +1,4 @@
-AC_INIT([cpuminer-opt], [3.21.3])
+AC_INIT([cpuminer-opt], [3.23.0])

 AC_PREREQ([2.59c])
 AC_CANONICAL_SYSTEM
--- a/7647
+++ b/7647
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Jay D Dee	4378d2f841	v3.23.0	2023-08-30 20:15:48 -04:00
Jay D Dee	57a6b7b58b	v3.22.3	2023-06-14 11:07:40 -04:00
Jay D Dee	de564ccbde	v3.22.2	2023-04-06 13:38:37 -04:00
Jay D Dee	fcd7727b0d	v3.22.1	2023-03-24 18:29:42 -04:00
Jay D Dee	3dd6787531	v3.22.0	2023-03-21 17:12:51 -04:00
Jay D Dee	cae1ce2ab7	v3.21.5	2023-03-15 12:27:04 -04:00
Jay D Dee	7a91c41d74	v3.21.4	2023-03-13 14:54:38 -04:00
Jay D Dee	c6bc9d67fb	v3.21.3 Unreleased	2023-03-13 03:20:13 -04:00