v3.15.2

v3.15.1
v3.15.0
2025-09-17 23:44:27 +00:00 · 2020-11-15 17:57:06 -05:00 · 2020-11-09 13:19:05 -05:00 · 2020-10-02 10:48:37 -04:00 · 2020-06-18 17:30:26 -04:00 · 2020-05-30 21:20:44 -04:00
88 changed files with 4754 additions and 2096 deletions
--- a/Makefile.am
+++ b/Makefile.am
@@ -85,6 +85,7 @@ cpuminer_SOURCES = \
  algo/groestl/aes_ni/hash-groestl.c \
  algo/groestl/aes_ni/hash-groestl256.c \
  algo/fugue/sph_fugue.c \
+  algo/fugue/fugue-aesni.c \
  algo/hamsi/sph_hamsi.c \
  algo/hamsi/hamsi-hash-4way.c \
  algo/haval/haval.c \
--- a/README.txt
+++ b/README.txt
@@ -1,6 +1,10 @@
 This file is included in the Windows binary package. Compile instructions
 for Linux and Windows can be found in RELEASE_NOTES.

+This package is officially avalable only from:
+ https://github.com/JayDDee/cpuminer-opt
+No other sources should be trusted.
+
 cpuminer is a console program that is executed from a DOS or Powershell
 prompt. There is no GUI and no mouse support.

@@ -31,20 +35,31 @@ https://en.wikipedia.org/wiki/List_of_Intel_CPU_microarchitectures
 https://en.wikipedia.org/wiki/List_of_AMD_CPU_microarchitectures


-Exe file name              Compile flags              Arch name
+Exe file name                Compile flags            Arch name

 cpuminer-sse2.exe            "-msse2"                 Core2, Nehalem   
-cpuminer-aes-sse42.exe       "-march=westmere"        Westmere
+cpuminer-aes-sse42.exe       "-marxh=westmere"        Westmere
 cpuminer-avx.exe             "-march=corei7-avx"      Sandybridge, Ivybridge
-cpuminer-avx2.exe            "-march=core-avx2 -maes" Haswell*
+cpuminer-avx2.exe            "-march=core-avx2 -maes" Haswell(1)
 cpuminer-avx512.exe          "-march=skylake-avx512"  Skylake-X, Cascadelake-X
-cpuminer-zen.exe             "-march=znver1"          AMD Ryzen, Threadripper
-cpuminer-avx512-sha-vaes.exe "-march=icelake-client"  Icelake*
+cpuminer-zen.exe             "-march=znver1"          Zen1, Zen2
+cpuminer-zen3.exe            "-march=znver2 -mvaes"   Zen3(2) 
+cpuminer-avx512-sha-vaes.exe "-march=icelake-client"  Icelake(3)

-* Haswell includes Broadwell, Skylake, Kabylake, Coffeelake & Cometlake. 
-Icelake is only available on some laptops. Mining with a laptop is not
-recommended. The icelake build is included in anticipation of Intel eventually
-releasing a desktop CPU with a microarchitecture newer than Skylake.
+(1) Haswell includes Broadwell, Skylake, Kabylake, Coffeelake & Cometlake. 
+(2) Zen3 build uses Zen2+VAES as workaround until Zen3 compiler support is
+    available. Zen2 CPUs should use Zen build.
+(3) Icelake is only available on some laptops. Mining with a laptop is not
+recommended.
+
+Notes about included DLL files:
+
+Downloading DLL files from alternative sources presents an inherent
+security risk if their source is unknown. All DLL files included have
+been copied from the Ubuntu-20.04 instalation or compiled by me from
+source code obtained from the author's official repository. The exact
+procedure is documented in the build instructions for Windows:
+https://github.com/JayDDee/cpuminer-opt/wiki/Compiling-from-source

 If you like this software feel free to donate:

--- a/63
+++ b/63
@@ -44,7 +44,7 @@ Please include the following information:
 1. CPU model, operating system, cpuminer-opt version (must be latest),
   binary file for Windows, changes to default build procedure for Linux.

-2. Exact comand line (except user and pw) and intial output showing
+2. Exact command line (except user and pw) and intial output showing
   the above requested info.

 3. Additional program output showing any error messages or other
@@ -65,6 +65,67 @@ If not what makes it happen or not happen?
 Change Log
 ----------

+v3.15.2
+
+Zen3 AVX2+VAES optimization for x16*, x17, sonoa, xevan, x21s, x22i, x25x,
+allium.
+Zen3 build added to Windows binary package.
+
+v3.15.1
+
+Fix compile on AMD Zen3 CPUs with VAES.
+Force new work immediately after solving a block solo.
+
+
+v3.15.0
+
+Fugue optimized with AES, improves many sha3 algos.
+Minotaur algo optimized for all architectures.
+Fixed neoscrypt BUG log.
+ 
+v3.14.3
+
+#265: more mutex changes to reduce blocking with high thread count.
+
+#267: fixed hodl algo potential memory alignment issue,
+      add warning when thread count is not valid for mining hodl algo.
+
+v3.14.2
+
+The second line of the Share Accepted log is no longer displayed,
+new Xnonce log is added and other small log tweaks.
+
+#265: Cleanup use of mutex.
+
+v3.14.1
+
+GBT and getwork log changes:
+ fixed missing TTF in New Block log,
+ ntime no longer byte-swapped for display in New Work log,
+ fixed zero effective hash rate in Periodic Report log,
+ deleted "Current block is..." log.
+
+Renamed stratum "New Job" log to "New Work" to be consistent with the solo
+version of the log. Added more data to both versions.
+
+v3.14.0
+
+Changes to solo mining:
+  - segwit is supported by getblocktemplate,
+  - longpolling is not working and is disabled,
+  - Periodic Report log is output,
+  - New Block log includes TTF estimates,
+  - Stratum thread no longer created when using getwork or GBT.
+
+Fixed BUG log mining sha256d.
+
+v3.13.1.1
+
+Fixed Windows crash mining minotaur algo.
+
+Fixed GCC 10 compile again.
+Added -fno-common to testing to be consistent with GCC 10 default.
+
 v3.13.1

 Added minotaur algo for Ringcoin.
--- a/algo-gate-api.c
+++ b/algo-gate-api.c
@@ -105,17 +105,16 @@ int scanhash_generic( struct work *work, uint32_t max_nonce,
   uint32_t hash[8] __attribute__((aligned(64)));
   uint32_t *pdata = work->data;
   uint32_t *ptarget = work->target;
-   uint32_t n = pdata[19];
   const uint32_t first_nonce = pdata[19];
+   const uint32_t last_nonce = max_nonce - 1;
+   uint32_t n = first_nonce;
   const int thr_id = mythr->id;
   const bool bench = opt_benchmark;

   mm128_bswap32_80( edata, pdata );
-
   do
   {
      edata[19] = n;
-
      if ( likely( algo_gate.hash( hash, edata, thr_id ) ) )
      if ( unlikely( valid_hash( hash, ptarget ) && !bench ) )
      {
@@ -123,12 +122,125 @@ int scanhash_generic( struct work *work, uint32_t max_nonce,
         submit_solution( work, hash, mythr );
      }
      n++;
-   } while ( n < max_nonce && !work_restart[thr_id].restart );
+   } while ( n < last_nonce && !work_restart[thr_id].restart );
   *hashes_done = n - first_nonce;
   pdata[19] = n;
   return 0;
 }

+#if defined(__AVX2__)
+
+//int scanhash_4way_64_64( struct work *work, uint32_t max_nonce,
+//                      uint64_t *hashes_done, struct thr_info *mythr )
+
+//int scanhash_4way_64_640( struct work *work, uint32_t max_nonce,
+//                      uint64_t *hashes_done, struct thr_info *mythr )
+
+int scanhash_4way_64in_32out( struct work *work, uint32_t max_nonce,
+                      uint64_t *hashes_done, struct thr_info *mythr )
+{
+   uint32_t hash32[8*4] __attribute__ ((aligned (64)));
+   uint32_t vdata[20*4] __attribute__ ((aligned (64)));
+   uint32_t lane_hash[8] __attribute__ ((aligned (64)));
+   uint32_t *hash32_d7 = &(hash32[ 7*4 ]);
+   uint32_t *pdata = work->data;
+   const uint32_t *ptarget = work->target;
+   const uint32_t first_nonce = pdata[19];
+   const uint32_t last_nonce = max_nonce - 4;
+   __m256i  *noncev = (__m256i*)vdata + 9;
+   uint32_t n = first_nonce;
+   const int thr_id = mythr->id;
+   const uint32_t targ32_d7 = ptarget[7];
+   const bool bench = opt_benchmark;
+
+   mm256_bswap32_intrlv80_4x64( vdata, pdata );
+   *noncev = mm256_intrlv_blend_32(
+                   _mm256_set_epi32( n+3, 0, n+2, 0, n+1, 0, n, 0 ), *noncev );
+   do
+   {
+      if ( likely( algo_gate.hash( hash32, vdata, thr_id ) ) )
+      for ( int lane = 0; lane < 4; lane++ )
+      if ( unlikely( hash32_d7[ lane ] <= targ32_d7 && !bench ) )
+      {
+         extr_lane_4x32( lane_hash, hash32, lane, 256 );
+         if ( valid_hash( lane_hash, ptarget ) )
+         {
+            pdata[19] = bswap_32( n + lane );
+            submit_solution( work, lane_hash, mythr );
+         }
+      }
+      *noncev = _mm256_add_epi32( *noncev,
+                                  m256_const1_64( 0x0000000400000000 ) );
+      n += 4;
+   } while ( likely( ( n <= last_nonce ) && !work_restart[thr_id].restart ) );
+   pdata[19] = n;
+   *hashes_done = n - first_nonce;
+   return 0;
+}
+
+//int scanhash_8way_32_32( struct work *work, uint32_t max_nonce,
+//                      uint64_t *hashes_done, struct thr_info *mythr )
+
+#endif
+
+#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
+
+//int scanhash_8way_64_64( struct work *work, uint32_t max_nonce,
+//                      uint64_t *hashes_done, struct thr_info *mythr )
+
+//int scanhash_8way_64_640( struct work *work, uint32_t max_nonce,
+//                      uint64_t *hashes_done, struct thr_info *mythr )
+
+int scanhash_8way_64in_32out( struct work *work, uint32_t max_nonce,
+                      uint64_t *hashes_done, struct thr_info *mythr )
+{
+   uint32_t hash32[8*8] __attribute__ ((aligned (128)));
+   uint32_t vdata[20*8] __attribute__ ((aligned (64)));
+   uint32_t lane_hash[8] __attribute__ ((aligned (64)));
+   uint32_t *hash32_d7 = &(hash32[7*8]);
+   uint32_t *pdata = work->data;
+   const uint32_t *ptarget = work->target;
+   const uint32_t first_nonce = pdata[19];
+   const uint32_t last_nonce = max_nonce - 8;
+   __m512i  *noncev = (__m512i*)vdata + 9;
+   uint32_t n = first_nonce;
+   const int thr_id = mythr->id;
+   const uint32_t targ32_d7 = ptarget[7];
+   const bool bench = opt_benchmark;
+
+   mm512_bswap32_intrlv80_8x64( vdata, pdata );
+   *noncev = mm512_intrlv_blend_32(
+              _mm512_set_epi32( n+7, 0, n+6, 0, n+5, 0, n+4, 0,
+                                n+3, 0, n+2, 0, n+1, 0, n,   0 ), *noncev );
+   do
+   {
+      if ( likely( algo_gate.hash( hash32, vdata, thr_id ) ) )
+      for ( int lane = 0; lane < 8; lane++ )
+      if ( unlikely( ( hash32_d7[ lane ] <= targ32_d7 ) && !bench ) )
+      {
+         extr_lane_8x32( lane_hash, hash32, lane, 256 );
+         if ( likely( valid_hash( lane_hash, ptarget ) ) )
+         {
+            pdata[19] = bswap_32( n + lane );
+            submit_solution( work, lane_hash, mythr );
+         }
+      }
+      *noncev = _mm512_add_epi32( *noncev,
+                                  m512_const1_64( 0x0000000800000000 ) );
+      n += 8;
+   } while ( likely( ( n < last_nonce ) && !work_restart[thr_id].restart ) );
+   pdata[19] = n;
+   *hashes_done = n - first_nonce;
+   return 0;
+}
+
+//int scanhash_16way_32_32( struct work *work, uint32_t max_nonce,
+//                      uint64_t *hashes_done, struct thr_info *mythr )
+
+#endif
+
+
+
 int null_hash()
 {
   applog(LOG_WARNING,"SWERR: null_hash unsafe null function");
--- a/algo-gate-api.h
+++ b/algo-gate-api.h
@@ -90,10 +90,11 @@ typedef  uint32_t set_t;
 #define AES_OPT          2  
 #define SSE42_OPT        4
 #define AVX_OPT          8   // Sandybridge
-#define AVX2_OPT      0x10   // Haswell
-#define SHA_OPT       0x20   // sha256 (Ryzen, Ice Lake)
-#define AVX512_OPT    0x40   // AVX512- F, VL, DQ, BW (Skylake-X)
-#define VAES_OPT      0x80   // VAES (Ice Lake)
+#define AVX2_OPT      0x10   // Haswell, Zen1
+#define SHA_OPT       0x20   // Zen1, Icelake (sha256)
+#define AVX512_OPT    0x40   // Skylake-X (AVX512[F,VL,DQ,BW])
+#define VAES_OPT      0x80   // Icelake (VAES & AVX512)
+#define VAES256_OPT   0x100  // Zen3 (VAES without AVX512)


 // return set containing all elements from sets a & b
@@ -110,10 +111,12 @@ inline bool set_excl ( set_t a, set_t b ) { return (a & b) == 0; }

 typedef struct
 {
-// mandatory function, must be overwritten
+// Mandatory functions, one of these is mandatory. If a generic scanhash
+// is used a custom target hash function must be registered, with a custom
+// scanhash the target hash function can be called directly and doesn't need
+// to be registered in the gate. 
 int ( *scanhash ) ( struct work*, uint32_t, uint64_t*, struct thr_info* );

-//int ( *hash )     ( void*, const void*, uint32_t ) ;
 int ( *hash )     ( void*, const void*, int );

 //optional, safe to use default in most cases
@@ -126,7 +129,7 @@ bool ( *miner_thread_init )     ( int );
 void ( *get_new_work )          ( struct work*, struct work*, int, uint32_t* );

 // Decode getwork blockheader
-bool ( *work_decode )           ( const json_t*, struct work* );
+bool ( *work_decode )           ( struct work* );

 // Extra getwork data
 void ( *decode_extra_data )     ( struct work*, uint64_t* );
@@ -201,19 +204,61 @@ void four_way_not_tested();
 #define STD_WORK_DATA_SIZE 128
 #define STD_WORK_CMP_SIZE 76

-#define JR2_NONCE_INDEX 39  // 8 bit offset
+//#define JR2_NONCE_INDEX 39  // 8 bit offset

 // These indexes are only used with JSON RPC2 and are not gated.
-#define JR2_WORK_CMP_INDEX_2 43
-#define JR2_WORK_CMP_SIZE_2 33
+//#define JR2_WORK_CMP_INDEX_2 43
+//#define JR2_WORK_CMP_SIZE_2 33

 // deprecated, use generic instead
 int null_scanhash();

 // Default generic, may be used in many cases.
+// N-way is more complicated, requires many different implementations
+// depending on architecture, input format, and output format.
+// Naming convention is scanhash_[N]way_[input format]in_[output format]out
+// N = number of lanes
+// input/output format:
+//    32: 32 bit interleaved parallel lanes
+//    64: 64 bit interleaved parallel lanes
+//    640: input only, not interleaved, contiguous serial 640 bit lanes.
+//    256: output only, not interleaved, contiguous serial 256 bit lanes.
+
 int scanhash_generic( struct work *work, uint32_t max_nonce,
                      uint64_t *hashes_done, struct thr_info *mythr );

+#if defined(__AVX2__)
+
+//int scanhash_4way_64in_64out( struct work *work, uint32_t max_nonce,
+//                      uint64_t *hashes_done, struct thr_info *mythr );
+
+//int scanhash_4way_64in_256out( struct work *work, uint32_t max_nonce,
+//                      uint64_t *hashes_done, struct thr_info *mythr );
+
+int scanhash_4way_64in_32out( struct work *work, uint32_t max_nonce,
+                      uint64_t *hashes_done, struct thr_info *mythr );
+
+//int scanhash_8way_32in_32out( struct work *work, uint32_t max_nonce,
+//                      uint64_t *hashes_done, struct thr_info *mythr );
+
+#endif
+
+#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
+
+//int scanhash_8way_64in_64out( struct work *work, uint32_t max_nonce,
+//                      uint64_t *hashes_done, struct thr_info *mythr );
+
+//int scanhash_8way_64in_256out( struct work *work, uint32_t max_nonce,
+//                      uint64_t *hashes_done, struct thr_info *mythr );
+
+int scanhash_8way_64in_32out( struct work *work, uint32_t max_nonce,
+                      uint64_t *hashes_done, struct thr_info *mythr );
+
+//int scanhash_16way_32in_32out( struct work *work, uint32_t max_nonce,
+//                      uint64_t *hashes_done, struct thr_info *mythr );
+
+#endif
+
 // displays warning
 int null_hash    ();

@@ -225,8 +270,8 @@ void std_get_new_work( struct work *work, struct work *g_work, int thr_id,
 void sha256d_gen_merkle_root( char *merkle_root, struct stratum_ctx *sctx );
 void SHA256_gen_merkle_root ( char *merkle_root, struct stratum_ctx *sctx );

-bool std_le_work_decode( const json_t *val, struct work *work );
-bool std_be_work_decode( const json_t *val, struct work *work );
+bool std_le_work_decode( struct work *work );
+bool std_be_work_decode( struct work *work );

 bool std_le_submit_getwork_result( CURL *curl, struct work *work );
 bool std_be_submit_getwork_result( CURL *curl, struct work *work );
@@ -261,7 +306,7 @@ int std_get_work_data_size();
 // by calling the algo's register function.
 bool register_algo_gate( int algo, algo_gate_t *gate );

-// Called by algos toverride any default gate functions that are applicable
+// Called by algos to verride any default gate functions that are applicable
 // and do any other algo-specific initialization.
 // The register functions for all the algos can be declared here to reduce
 // compiler warnings but that's just more work for devs adding new algos.
--- a/algo/blake/decred-gate.c
+++ b/algo/blake/decred-gate.c
@@ -78,7 +78,6 @@ void decred_build_extraheader( struct work* g_work, struct stratum_ctx* sctx )
   uint32_t extraheader[32] = { 0 };
   int headersize = 0;
   uint32_t* extradata = (uint32_t*) sctx->xnonce1;
-   size_t t;
   int i;

   // getwork over stratum, getwork merkle + header passed in coinb1
@@ -87,9 +86,6 @@ void decred_build_extraheader( struct work* g_work, struct stratum_ctx* sctx )
                  sizeof(extraheader) );
   memcpy( extraheader, &sctx->job.coinbase[32], headersize );

-   // Increment extranonce2 
-   for ( t = 0; t < sctx->xnonce2_size && !( ++sctx->job.xnonce2[t] ); t++ );
-
   // Assemble block header 
   memset( g_work->data, 0, sizeof(g_work->data) );
   g_work->data[0] = le32dec( sctx->job.version );
--- a/algo/echo/echo-hash-4way.c
+++ b/algo/echo/echo-hash-4way.c
@@ -1,5 +1,4 @@
-//#if 0
-#if defined(__VAES__) && defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
+#if defined(__VAES__)

 #include "simd-utils.h"
 #include "echo-hash-4way.h"
@@ -13,8 +12,12 @@ static const unsigned int mul2ipt[] __attribute__ ((aligned (64))) =
 */
 // do these need to be reversed?

+#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
+
+
 #define mul2mask \
-   _mm512_set4_epi32( 0, 0, 0, 0x00001b00 ) 
+     m512_const2_64( 0, 0x00001b00 )
+//_mm512_set4_epi32( 0, 0, 0, 0x00001b00 ) 
 //   _mm512_set4_epi32( 0x00001b00, 0, 0, 0 )  

 #define lsbmask    m512_const1_32( 0x01010101 ) 
@@ -30,87 +33,87 @@ static const unsigned int mul2ipt[] __attribute__ ((aligned (64))) =
   const int j2 = ( (j)+2 ) & 3; \
   const int j3 = ( (j)+3 ) & 3; \
   s2 = _mm512_add_epi8( state1[ 0 ] [j ], state1[ 0 ][ j ] ); \
-	t1 = _mm512_srli_epi16( state1[ 0 ][ j ], 7 ); \
-	t1 = _mm512_and_si512( t1, lsbmask );\
-	t2 = _mm512_shuffle_epi8( mul2mask, t1 ); \
-	s2 = _mm512_xor_si512( s2, t2 ); \
-	state2[ 0 ] [j ] = s2; \
-	state2[ 1 ] [j ] = state1[ 0 ][ j ]; \
-	state2[ 2 ] [j ] = state1[ 0 ][ j ]; \
-	state2[ 3 ] [j ] = _mm512_xor_si512( s2, state1[ 0 ][ j ] );\
-	s2 = _mm512_add_epi8( state1[ 1 ][ j1 ], state1[ 1 ][ j1 ] ); \
-	t1 = _mm512_srli_epi16( state1[ 1 ][ j1 ], 7 ); \
-	t1 = _mm512_and_si512( t1, lsbmask ); \
-	t2 = _mm512_shuffle_epi8( mul2mask, t1 ); \
-	s2 = _mm512_xor_si512( s2, t2 );\
-	state2[ 0 ][ j ] = _mm512_xor_si512( state2[ 0 ][ j ], \
-                            _mm512_xor_si512( s2, state1[ 1 ][ j1 ] ) ); \
-	state2[ 1 ][ j ] = _mm512_xor_si512( state2[ 1 ][ j ], s2 ); \
-	state2[ 2 ][ j ] = _mm512_xor_si512( state2[ 2 ][ j ], state1[ 1 ][ j1 ] ); \
-	state2[ 3 ][ j ] = _mm512_xor_si512( state2[ 3 ][ j ], state1[ 1 ][ j1 ] ); \
-	s2 = _mm512_add_epi8( state1[ 2 ][ j2 ], state1[ 2 ][ j2 ] ); \
-	t1 = _mm512_srli_epi16( state1[ 2 ][ j2 ], 7 ); \
-	t1 = _mm512_and_si512( t1, lsbmask ); \
-	t2 = _mm512_shuffle_epi8( mul2mask, t1 ); \
-	s2 = _mm512_xor_si512( s2, t2 ); \
-	state2[ 0 ][ j ] = _mm512_xor_si512( state2[ 0 ][ j ], state1[ 2 ][ j2 ] ); \
-	state2[ 1 ][ j ] = _mm512_xor_si512( state2[ 1 ][ j ], \
+   t1 = _mm512_srli_epi16( state1[ 0 ][ j ], 7 ); \
+   t1 = _mm512_and_si512( t1, lsbmask );\
+   t2 = _mm512_shuffle_epi8( mul2mask, t1 ); \
+   s2 = _mm512_xor_si512( s2, t2 ); \
+   state2[ 0 ] [j ] = s2; \
+   state2[ 1 ] [j ] = state1[ 0 ][ j ]; \
+   state2[ 2 ] [j ] = state1[ 0 ][ j ]; \
+   state2[ 3 ] [j ] = _mm512_xor_si512( s2, state1[ 0 ][ j ] );\
+   s2 = _mm512_add_epi8( state1[ 1 ][ j1 ], state1[ 1 ][ j1 ] ); \
+   t1 = _mm512_srli_epi16( state1[ 1 ][ j1 ], 7 ); \
+   t1 = _mm512_and_si512( t1, lsbmask ); \
+   t2 = _mm512_shuffle_epi8( mul2mask, t1 ); \
+   s2 = _mm512_xor_si512( s2, t2 );\
+   state2[ 0 ][ j ] = _mm512_xor_si512( state2[ 0 ][ j ], \
+                              _mm512_xor_si512( s2, state1[ 1 ][ j1 ] ) ); \
+   state2[ 1 ][ j ] = _mm512_xor_si512( state2[ 1 ][ j ], s2 ); \
+   state2[ 2 ][ j ] = _mm512_xor_si512( state2[ 2 ][ j ], state1[ 1 ][ j1 ] ); \
+   state2[ 3 ][ j ] = _mm512_xor_si512( state2[ 3 ][ j ], state1[ 1 ][ j1 ] ); \
+   s2 = _mm512_add_epi8( state1[ 2 ][ j2 ], state1[ 2 ][ j2 ] ); \
+   t1 = _mm512_srli_epi16( state1[ 2 ][ j2 ], 7 ); \
+   t1 = _mm512_and_si512( t1, lsbmask ); \
+   t2 = _mm512_shuffle_epi8( mul2mask, t1 ); \
+   s2 = _mm512_xor_si512( s2, t2 ); \
+   state2[ 0 ][ j ] = _mm512_xor_si512( state2[ 0 ][ j ], state1[ 2 ][ j2 ] ); \
+   state2[ 1 ][ j ] = _mm512_xor_si512( state2[ 1 ][ j ], \
                            _mm512_xor_si512( s2, state1[ 2 ][ j2 ] ) ); \
-	state2[ 2 ][ j ] = _mm512_xor_si512( state2[ 2 ][ j ], s2 ); \
-	state2[ 3 ][ j ] = _mm512_xor_si512( state2[ 3][ j ], state1[ 2 ][ j2 ] ); \
-	s2 = _mm512_add_epi8( state1[ 3 ][ j3 ], state1[ 3 ][ j3 ] ); \
-	t1 = _mm512_srli_epi16( state1[ 3 ][ j3 ], 7 ); \
-	t1 = _mm512_and_si512( t1, lsbmask ); \
-	t2 = _mm512_shuffle_epi8( mul2mask, t1 ); \
-	s2 = _mm512_xor_si512( s2, t2 ); \
-	state2[ 0 ][ j ] = _mm512_xor_si512( state2[ 0 ][ j ], state1[ 3 ][ j3 ] ); \
-	state2[ 1 ][ j ] = _mm512_xor_si512( state2[ 1 ][ j ], state1[ 3 ][ j3 ] ); \
-	state2[ 2 ][ j ] = _mm512_xor_si512( state2[ 2 ][ j ], \
+   state2[ 2 ][ j ] = _mm512_xor_si512( state2[ 2 ][ j ], s2 ); \
+   state2[ 3 ][ j ] = _mm512_xor_si512( state2[ 3][ j ], state1[ 2 ][ j2 ] ); \
+   s2 = _mm512_add_epi8( state1[ 3 ][ j3 ], state1[ 3 ][ j3 ] ); \
+   t1 = _mm512_srli_epi16( state1[ 3 ][ j3 ], 7 ); \
+   t1 = _mm512_and_si512( t1, lsbmask ); \
+   t2 = _mm512_shuffle_epi8( mul2mask, t1 ); \
+   s2 = _mm512_xor_si512( s2, t2 ); \
+   state2[ 0 ][ j ] = _mm512_xor_si512( state2[ 0 ][ j ], state1[ 3 ][ j3 ] ); \
+   state2[ 1 ][ j ] = _mm512_xor_si512( state2[ 1 ][ j ], state1[ 3 ][ j3 ] ); \
+   state2[ 2 ][ j ] = _mm512_xor_si512( state2[ 2 ][ j ], \
                            _mm512_xor_si512( s2, state1[ 3 ][ j3] ) ); \
-	state2[ 3 ][ j ] = _mm512_xor_si512( state2[ 3 ][ j ], s2 ); \
+   state2[ 3 ][ j ] = _mm512_xor_si512( state2[ 3 ][ j ], s2 ); \
 } while(0)

 #define ECHO_ROUND_UNROLL2 \
-	ECHO_SUBBYTES(_state, 0, 0);\
+   ECHO_SUBBYTES(_state, 0, 0);\
   ECHO_SUBBYTES(_state, 1, 0);\
-	ECHO_SUBBYTES(_state, 2, 0);\
-	ECHO_SUBBYTES(_state, 3, 0);\
-	ECHO_SUBBYTES(_state, 0, 1);\
-	ECHO_SUBBYTES(_state, 1, 1);\
-	ECHO_SUBBYTES(_state, 2, 1);\
-	ECHO_SUBBYTES(_state, 3, 1);\
-	ECHO_SUBBYTES(_state, 0, 2);\
-	ECHO_SUBBYTES(_state, 1, 2);\
-	ECHO_SUBBYTES(_state, 2, 2);\
-	ECHO_SUBBYTES(_state, 3, 2);\
-	ECHO_SUBBYTES(_state, 0, 3);\
-	ECHO_SUBBYTES(_state, 1, 3);\
-	ECHO_SUBBYTES(_state, 2, 3);\
-	ECHO_SUBBYTES(_state, 3, 3);\
-	ECHO_MIXBYTES(_state, _state2, 0, t1, t2, s2);\
-	ECHO_MIXBYTES(_state, _state2, 1, t1, t2, s2);\
-	ECHO_MIXBYTES(_state, _state2, 2, t1, t2, s2);\
-	ECHO_MIXBYTES(_state, _state2, 3, t1, t2, s2);\
-	ECHO_SUBBYTES(_state2, 0, 0);\
-	ECHO_SUBBYTES(_state2, 1, 0);\
-	ECHO_SUBBYTES(_state2, 2, 0);\
-	ECHO_SUBBYTES(_state2, 3, 0);\
-	ECHO_SUBBYTES(_state2, 0, 1);\
-	ECHO_SUBBYTES(_state2, 1, 1);\
-	ECHO_SUBBYTES(_state2, 2, 1);\
-	ECHO_SUBBYTES(_state2, 3, 1);\
-	ECHO_SUBBYTES(_state2, 0, 2);\
-	ECHO_SUBBYTES(_state2, 1, 2);\
-	ECHO_SUBBYTES(_state2, 2, 2);\
-	ECHO_SUBBYTES(_state2, 3, 2);\
-	ECHO_SUBBYTES(_state2, 0, 3);\
-	ECHO_SUBBYTES(_state2, 1, 3);\
-	ECHO_SUBBYTES(_state2, 2, 3);\
-	ECHO_SUBBYTES(_state2, 3, 3);\
-	ECHO_MIXBYTES(_state2, _state, 0, t1, t2, s2);\
-	ECHO_MIXBYTES(_state2, _state, 1, t1, t2, s2);\
-	ECHO_MIXBYTES(_state2, _state, 2, t1, t2, s2);\
-	ECHO_MIXBYTES(_state2, _state, 3, t1, t2, s2)
+   ECHO_SUBBYTES(_state, 2, 0);\
+   ECHO_SUBBYTES(_state, 3, 0);\
+   ECHO_SUBBYTES(_state, 0, 1);\
+   ECHO_SUBBYTES(_state, 1, 1);\
+   ECHO_SUBBYTES(_state, 2, 1);\
+   ECHO_SUBBYTES(_state, 3, 1);\
+   ECHO_SUBBYTES(_state, 0, 2);\
+   ECHO_SUBBYTES(_state, 1, 2);\
+   ECHO_SUBBYTES(_state, 2, 2);\
+   ECHO_SUBBYTES(_state, 3, 2);\
+   ECHO_SUBBYTES(_state, 0, 3);\
+   ECHO_SUBBYTES(_state, 1, 3);\
+   ECHO_SUBBYTES(_state, 2, 3);\
+   ECHO_SUBBYTES(_state, 3, 3);\
+   ECHO_MIXBYTES(_state, _state2, 0, t1, t2, s2);\
+   ECHO_MIXBYTES(_state, _state2, 1, t1, t2, s2);\
+   ECHO_MIXBYTES(_state, _state2, 2, t1, t2, s2);\
+   ECHO_MIXBYTES(_state, _state2, 3, t1, t2, s2);\
+   ECHO_SUBBYTES(_state2, 0, 0);\
+   ECHO_SUBBYTES(_state2, 1, 0);\
+   ECHO_SUBBYTES(_state2, 2, 0);\
+   ECHO_SUBBYTES(_state2, 3, 0);\
+   ECHO_SUBBYTES(_state2, 0, 1);\
+   ECHO_SUBBYTES(_state2, 1, 1);\
+   ECHO_SUBBYTES(_state2, 2, 1);\
+   ECHO_SUBBYTES(_state2, 3, 1);\
+   ECHO_SUBBYTES(_state2, 0, 2);\
+   ECHO_SUBBYTES(_state2, 1, 2);\
+   ECHO_SUBBYTES(_state2, 2, 2);\
+   ECHO_SUBBYTES(_state2, 3, 2);\
+   ECHO_SUBBYTES(_state2, 0, 3);\
+   ECHO_SUBBYTES(_state2, 1, 3);\
+   ECHO_SUBBYTES(_state2, 2, 3);\
+   ECHO_SUBBYTES(_state2, 3, 3);\
+   ECHO_MIXBYTES(_state2, _state, 0, t1, t2, s2);\
+   ECHO_MIXBYTES(_state2, _state, 1, t1, t2, s2);\
+   ECHO_MIXBYTES(_state2, _state, 2, t1, t2, s2);\
+   ECHO_MIXBYTES(_state2, _state, 3, t1, t2, s2)

 #define SAVESTATE(dst, src)\
 	dst[0][0] = src[0][0];\
@@ -224,43 +227,43 @@ void echo_4way_compress( echo_4way_context *ctx, const __m512i *pmsg,

 int echo_4way_init( echo_4way_context *ctx, int nHashSize )
 {
-	int i, j;
+   int i, j;

   ctx->k = m512_zero; 
-	ctx->processed_bits = 0;
-	ctx->uBufferBytes = 0;
+   ctx->processed_bits = 0;
+   ctx->uBufferBytes = 0;

-	switch( nHashSize )
-	{
-		case 256:
-			ctx->uHashSize = 256;
-			ctx->uBlockLength = 192;
-			ctx->uRounds = 8;
-			ctx->hashsize = _mm512_set4_epi32( 0, 0, 0, 0x100 );
-			ctx->const1536 = _mm512_set4_epi32( 0, 0, 0, 0x600 );
-			break;
+   switch( nHashSize )
+   {
+	case 256:
+		ctx->uHashSize = 256;
+		ctx->uBlockLength = 192;
+		ctx->uRounds = 8;
+		ctx->hashsize = m512_const2_64( 0, 0x100 );
+		ctx->const1536 = m512_const2_64( 0, 0x600 );
+		break;

-		case 512:
-			ctx->uHashSize = 512;
-			ctx->uBlockLength = 128;
-			ctx->uRounds = 10;
-			ctx->hashsize = _mm512_set4_epi32( 0, 0, 0, 0x200 );
-			ctx->const1536 = _mm512_set4_epi32( 0, 0, 0, 0x400);
-			break;
+	case 512:
+		ctx->uHashSize = 512;
+		ctx->uBlockLength = 128;
+		ctx->uRounds = 10;
+		ctx->hashsize = m512_const2_64( 0, 0x200 );
+		ctx->const1536 = m512_const2_64( 0, 0x400);
+		break;

-		default:
-			return 1;
-	}
+	default:
+        	return 1;
+   }

-	for( i = 0; i < 4; i++ )
-		for( j = 0; j < nHashSize / 256; j++ )
-			ctx->state[ i ][ j ] = ctx->hashsize;
+   for( i = 0; i < 4; i++ )
+	for( j = 0; j < nHashSize / 256; j++ )
+		ctx->state[ i ][ j ] = ctx->hashsize;

-	for( i = 0; i < 4; i++ )
-		for( j = nHashSize / 256; j < 4; j++ )
-			ctx->state[ i ][ j ] = m512_zero;
+   for( i = 0; i < 4; i++ )
+	for( j = nHashSize / 256; j < 4; j++ )
+		ctx->state[ i ][ j ] = m512_zero;

-	return 0;
+   return 0;
 }

 int echo_4way_update_close( echo_4way_context *state, void *hashval,
@@ -285,17 +288,13 @@ int echo_4way_update_close( echo_4way_context *state, void *hashval,
      vlen = databitlen / 128;  // * 4 lanes / 128 bits per lane
      memcpy_512( state->buffer, data, vlen );
      state->processed_bits += (unsigned int)( databitlen );
-      remainingbits = _mm512_set4_epi32( 0, 0, 0, databitlen );
-
+      remainingbits = m512_const2_64( 0, (uint64_t)databitlen );
   }

-   state->buffer[ vlen ] = _mm512_set4_epi32( 0, 0, 0, 0x80 );
+   state->buffer[ vlen ] = m512_const2_64( 0, 0x80 );
   memset_zero_512( state->buffer + vlen + 1, vblen - vlen - 2 );
-   state->buffer[ vblen-2 ] =
-                _mm512_set4_epi32( (uint32_t)state->uHashSize << 16, 0, 0, 0 );
-   state->buffer[ vblen-1 ] =
-                   _mm512_set4_epi64( 0, state->processed_bits,
-                                      0, state->processed_bits );  
+   state->buffer[ vblen-2 ] = m512_const2_64( (uint64_t)state->uHashSize << 48, 0 );
+   state->buffer[ vblen-1 ] = m512_const2_64( 0, state->processed_bits);

   state->k = _mm512_add_epi64( state->k, remainingbits );
   state->k = _mm512_sub_epi64( state->k, state->const1536 );
@@ -328,16 +327,16 @@ int echo_4way_full( echo_4way_context *ctx, void *hashval, int nHashSize,
         ctx->uHashSize = 256;
         ctx->uBlockLength = 192;
         ctx->uRounds = 8;
-         ctx->hashsize = _mm512_set4_epi32( 0, 0, 0, 0x100 );
-         ctx->const1536 = _mm512_set4_epi32( 0, 0, 0, 0x600 );
+         ctx->hashsize = m512_const2_64( 0, 0x100 );
+         ctx->const1536 = m512_const2_64( 0, 0x600 );
         break;

      case 512:
         ctx->uHashSize = 512;
         ctx->uBlockLength = 128;
         ctx->uRounds = 10;
-         ctx->hashsize = _mm512_set4_epi32( 0, 0, 0, 0x200 );
-         ctx->const1536 = _mm512_set4_epi32( 0, 0, 0, 0x400);
+         ctx->hashsize = m512_const2_64( 0, 0x200 );
+         ctx->const1536 = m512_const2_64( 0, 0x400 );
         break;

      default:
@@ -372,17 +371,14 @@ int echo_4way_full( echo_4way_context *ctx, void *hashval, int nHashSize,
      vlen = databitlen / 128;  // * 4 lanes / 128 bits per lane
      memcpy_512( ctx->buffer, data, vlen );
      ctx->processed_bits += (unsigned int)( databitlen );
-      remainingbits = _mm512_set4_epi32( 0, 0, 0, databitlen );
-
+      remainingbits = m512_const2_64( 0, databitlen );
   }

-   ctx->buffer[ vlen ] = _mm512_set4_epi32( 0, 0, 0, 0x80 );
+   ctx->buffer[ vlen ] = m512_const2_64( 0, 0x80 );
   memset_zero_512( ctx->buffer + vlen + 1, vblen - vlen - 2 );
   ctx->buffer[ vblen-2 ] =
-                _mm512_set4_epi32( (uint32_t)ctx->uHashSize << 16, 0, 0, 0 );
-   ctx->buffer[ vblen-1 ] =
-                   _mm512_set4_epi64( 0, ctx->processed_bits,
-                                      0, ctx->processed_bits );
+                     m512_const2_64( (uint64_t)ctx->uHashSize << 48, 0 );
+   ctx->buffer[ vblen-1 ] = m512_const2_64( 0, ctx->processed_bits);

   ctx->k = _mm512_add_epi64( ctx->k, remainingbits );
   ctx->k = _mm512_sub_epi64( ctx->k, ctx->const1536 );
@@ -400,5 +396,380 @@ int echo_4way_full( echo_4way_context *ctx, void *hashval, int nHashSize,
   return 0;
 }

+#endif  // AVX512

-#endif
+// AVX2 + VAES
+
+#define mul2mask_2way   m256_const2_64( 0, 0x0000000000001b00 ) 
+
+#define lsbmask_2way    m256_const1_32( 0x01010101 ) 
+
+#define ECHO_SUBBYTES_2WAY( state, i, j ) \
+        state[i][j] = _mm256_aesenc_epi128( state[i][j], k1 ); \
+        state[i][j] = _mm256_aesenc_epi128( state[i][j], m256_zero ); \
+        k1 = _mm256_add_epi32( k1, m256_one_128 );
+
+#define ECHO_MIXBYTES_2WAY( state1, state2, j, t1, t2, s2 ) do \
+{ \
+   const int j1 = ( (j)+1 ) & 3; \
+   const int j2 = ( (j)+2 ) & 3; \
+   const int j3 = ( (j)+3 ) & 3; \
+   s2 = _mm256_add_epi8( state1[ 0 ] [j ], state1[ 0 ][ j ] ); \
+   t1 = _mm256_srli_epi16( state1[ 0 ][ j ], 7 ); \
+   t1 = _mm256_and_si256( t1, lsbmask_2way );\
+   t2 = _mm256_shuffle_epi8( mul2mask_2way, t1 ); \
+   s2 = _mm256_xor_si256( s2, t2 ); \
+   state2[ 0 ] [j ] = s2; \
+   state2[ 1 ] [j ] = state1[ 0 ][ j ]; \
+   state2[ 2 ] [j ] = state1[ 0 ][ j ]; \
+   state2[ 3 ] [j ] = _mm256_xor_si256( s2, state1[ 0 ][ j ] );\
+   s2 = _mm256_add_epi8( state1[ 1 ][ j1 ], state1[ 1 ][ j1 ] ); \
+   t1 = _mm256_srli_epi16( state1[ 1 ][ j1 ], 7 ); \
+   t1 = _mm256_and_si256( t1, lsbmask_2way ); \
+   t2 = _mm256_shuffle_epi8( mul2mask_2way, t1 ); \
+   s2 = _mm256_xor_si256( s2, t2 );\
+   state2[ 0 ][ j ] = _mm256_xor_si256( state2[ 0 ][ j ], \
+                              _mm256_xor_si256( s2, state1[ 1 ][ j1 ] ) ); \
+   state2[ 1 ][ j ] = _mm256_xor_si256( state2[ 1 ][ j ], s2 ); \
+   state2[ 2 ][ j ] = _mm256_xor_si256( state2[ 2 ][ j ], state1[ 1 ][ j1 ] ); \
+   state2[ 3 ][ j ] = _mm256_xor_si256( state2[ 3 ][ j ], state1[ 1 ][ j1 ] ); \
+   s2 = _mm256_add_epi8( state1[ 2 ][ j2 ], state1[ 2 ][ j2 ] ); \
+   t1 = _mm256_srli_epi16( state1[ 2 ][ j2 ], 7 ); \
+   t1 = _mm256_and_si256( t1, lsbmask_2way ); \
+   t2 = _mm256_shuffle_epi8( mul2mask_2way, t1 ); \
+   s2 = _mm256_xor_si256( s2, t2 ); \
+   state2[ 0 ][ j ] = _mm256_xor_si256( state2[ 0 ][ j ], state1[ 2 ][ j2 ] ); \
+   state2[ 1 ][ j ] = _mm256_xor_si256( state2[ 1 ][ j ], \
+                            _mm256_xor_si256( s2, state1[ 2 ][ j2 ] ) ); \
+   state2[ 2 ][ j ] = _mm256_xor_si256( state2[ 2 ][ j ], s2 ); \
+   state2[ 3 ][ j ] = _mm256_xor_si256( state2[ 3][ j ], state1[ 2 ][ j2 ] ); \
+   s2 = _mm256_add_epi8( state1[ 3 ][ j3 ], state1[ 3 ][ j3 ] ); \
+   t1 = _mm256_srli_epi16( state1[ 3 ][ j3 ], 7 ); \
+   t1 = _mm256_and_si256( t1, lsbmask_2way ); \
+   t2 = _mm256_shuffle_epi8( mul2mask_2way, t1 ); \
+   s2 = _mm256_xor_si256( s2, t2 ); \
+   state2[ 0 ][ j ] = _mm256_xor_si256( state2[ 0 ][ j ], state1[ 3 ][ j3 ] ); \
+   state2[ 1 ][ j ] = _mm256_xor_si256( state2[ 1 ][ j ], state1[ 3 ][ j3 ] ); \
+   state2[ 2 ][ j ] = _mm256_xor_si256( state2[ 2 ][ j ], \
+                            _mm256_xor_si256( s2, state1[ 3 ][ j3] ) ); \
+   state2[ 3 ][ j ] = _mm256_xor_si256( state2[ 3 ][ j ], s2 ); \
+} while(0)
+
+#define ECHO_ROUND_UNROLL2_2WAY \
+   ECHO_SUBBYTES_2WAY(_state, 0, 0);\
+   ECHO_SUBBYTES_2WAY(_state, 1, 0);\
+   ECHO_SUBBYTES_2WAY(_state, 2, 0);\
+   ECHO_SUBBYTES_2WAY(_state, 3, 0);\
+   ECHO_SUBBYTES_2WAY(_state, 0, 1);\
+   ECHO_SUBBYTES_2WAY(_state, 1, 1);\
+   ECHO_SUBBYTES_2WAY(_state, 2, 1);\
+   ECHO_SUBBYTES_2WAY(_state, 3, 1);\
+   ECHO_SUBBYTES_2WAY(_state, 0, 2);\
+   ECHO_SUBBYTES_2WAY(_state, 1, 2);\
+   ECHO_SUBBYTES_2WAY(_state, 2, 2);\
+   ECHO_SUBBYTES_2WAY(_state, 3, 2);\
+   ECHO_SUBBYTES_2WAY(_state, 0, 3);\
+   ECHO_SUBBYTES_2WAY(_state, 1, 3);\
+   ECHO_SUBBYTES_2WAY(_state, 2, 3);\
+   ECHO_SUBBYTES_2WAY(_state, 3, 3);\
+   ECHO_MIXBYTES_2WAY(_state, _state2, 0, t1, t2, s2);\
+   ECHO_MIXBYTES_2WAY(_state, _state2, 1, t1, t2, s2);\
+   ECHO_MIXBYTES_2WAY(_state, _state2, 2, t1, t2, s2);\
+   ECHO_MIXBYTES_2WAY(_state, _state2, 3, t1, t2, s2);\
+   ECHO_SUBBYTES_2WAY(_state2, 0, 0);\
+   ECHO_SUBBYTES_2WAY(_state2, 1, 0);\
+   ECHO_SUBBYTES_2WAY(_state2, 2, 0);\
+   ECHO_SUBBYTES_2WAY(_state2, 3, 0);\
+   ECHO_SUBBYTES_2WAY(_state2, 0, 1);\
+   ECHO_SUBBYTES_2WAY(_state2, 1, 1);\
+   ECHO_SUBBYTES_2WAY(_state2, 2, 1);\
+   ECHO_SUBBYTES_2WAY(_state2, 3, 1);\
+   ECHO_SUBBYTES_2WAY(_state2, 0, 2);\
+   ECHO_SUBBYTES_2WAY(_state2, 1, 2);\
+   ECHO_SUBBYTES_2WAY(_state2, 2, 2);\
+   ECHO_SUBBYTES_2WAY(_state2, 3, 2);\
+   ECHO_SUBBYTES_2WAY(_state2, 0, 3);\
+   ECHO_SUBBYTES_2WAY(_state2, 1, 3);\
+   ECHO_SUBBYTES_2WAY(_state2, 2, 3);\
+   ECHO_SUBBYTES_2WAY(_state2, 3, 3);\
+   ECHO_MIXBYTES_2WAY(_state2, _state, 0, t1, t2, s2);\
+   ECHO_MIXBYTES_2WAY(_state2, _state, 1, t1, t2, s2);\
+   ECHO_MIXBYTES_2WAY(_state2, _state, 2, t1, t2, s2);\
+   ECHO_MIXBYTES_2WAY(_state2, _state, 3, t1, t2, s2)
+
+#define SAVESTATE_2WAY(dst, src)\
+        dst[0][0] = src[0][0];\
+        dst[0][1] = src[0][1];\
+        dst[0][2] = src[0][2];\
+        dst[0][3] = src[0][3];\
+        dst[1][0] = src[1][0];\
+        dst[1][1] = src[1][1];\
+        dst[1][2] = src[1][2];\
+        dst[1][3] = src[1][3];\
+        dst[2][0] = src[2][0];\
+        dst[2][1] = src[2][1];\
+        dst[2][2] = src[2][2];\
+        dst[2][3] = src[2][3];\
+        dst[3][0] = src[3][0];\
+        dst[3][1] = src[3][1];\
+        dst[3][2] = src[3][2];\
+        dst[3][3] = src[3][3]
+
+// blockcount always 1
+void echo_2way_compress( echo_2way_context *ctx, const __m256i *pmsg,
+               unsigned int uBlockCount )
+{
+  unsigned int r, b, i, j;
+  __m256i t1, t2, s2, k1;
+  __m256i _state[4][4], _state2[4][4], _statebackup[4][4];
+
+  _state[ 0 ][ 0 ] = ctx->state[ 0 ][ 0 ];
+  _state[ 0 ][ 1 ] = ctx->state[ 0 ][ 1 ];
+  _state[ 0 ][ 2 ] = ctx->state[ 0 ][ 2 ];
+  _state[ 0 ][ 3 ] = ctx->state[ 0 ][ 3 ];
+  _state[ 1 ][ 0 ] = ctx->state[ 1 ][ 0 ];
+  _state[ 1 ][ 1 ] = ctx->state[ 1 ][ 1 ];
+  _state[ 1 ][ 2 ] = ctx->state[ 1 ][ 2 ];
+  _state[ 1 ][ 3 ] = ctx->state[ 1 ][ 3 ];
+  _state[ 2 ][ 0 ] = ctx->state[ 2 ][ 0 ];
+  _state[ 2 ][ 1 ] = ctx->state[ 2 ][ 1 ];
+  _state[ 2 ][ 2 ] = ctx->state[ 2 ][ 2 ];
+  _state[ 2 ][ 3 ] = ctx->state[ 2 ][ 3 ];
+  _state[ 3 ][ 0 ] = ctx->state[ 3 ][ 0 ];
+  _state[ 3 ][ 1 ] = ctx->state[ 3 ][ 1 ];
+  _state[ 3 ][ 2 ] = ctx->state[ 3 ][ 2 ];
+  _state[ 3 ][ 3 ] = ctx->state[ 3 ][ 3 ];
+
+  for ( b = 0; b < uBlockCount; b++ )
+  {
+    ctx->k = _mm256_add_epi64( ctx->k, ctx->const1536 );
+
+    for( j = ctx->uHashSize / 256; j < 4; j++ )
+    {
+      for ( i = 0; i < 4; i++ )
+      {
+        _state[ i ][ j ] = _mm256_load_si256(
+                     pmsg + 4 * (j - (ctx->uHashSize / 256)) + i );
+      }
+   }
+
+   // save state
+   SAVESTATE_2WAY( _statebackup, _state );
+
+   k1 = ctx->k;
+
+   for ( r = 0; r < ctx->uRounds / 2; r++ )
+   {
+       ECHO_ROUND_UNROLL2_2WAY;
+   }
+
+   if ( ctx->uHashSize == 256 )
+   {
+      for ( i = 0; i < 4; i++ )
+      {
+         _state[ i ][ 0 ] = _mm256_xor_si256( _state[ i ][ 0 ],
+                                              _state[ i ][ 1 ] );
+         _state[ i ][ 0 ] = _mm256_xor_si256( _state[ i ][ 0 ],
+                                              _state[ i ][ 2 ] );
+         _state[ i ][ 0 ] = _mm256_xor_si256( _state[ i ][ 0 ],
+                                              _state[ i ][ 3 ] );
+         _state[ i ][ 0 ] = _mm256_xor_si256( _state[ i ][ 0 ],
+                                              _statebackup[ i ][ 0 ] );
+         _state[ i ][ 0 ] = _mm256_xor_si256( _state[ i ][ 0 ],
+                                              _statebackup[ i ][ 1 ] );
+         _state[ i ][ 0 ] = _mm256_xor_si256( _state[ i ][ 0 ],
+                                              _statebackup[ i ][ 2 ] ) ;
+         _state[ i ][ 0 ] = _mm256_xor_si256( _state[ i ][ 0 ],
+                                              _statebackup[ i ][ 3 ] );
+       }
+    }
+    else
+    {
+       for ( i = 0; i < 4; i++ )
+       {
+         _state[ i ][ 0 ] = _mm256_xor_si256( _state[ i ][ 0 ],
+                                              _state[ i ][ 2 ] );
+         _state[ i ][ 1 ] = _mm256_xor_si256( _state[ i ][ 1 ],
+                                              _state[ i ][ 3 ] );
+         _state[ i ][ 0 ] = _mm256_xor_si256( _state[ i ][ 0 ],
+                                              _statebackup[ i ][ 0 ] );
+         _state[ i ][ 0 ] = _mm256_xor_si256( _state[ i ] [0 ],
+                                              _statebackup[ i ][ 2 ] );
+         _state[ i ][ 1 ] = _mm256_xor_si256( _state[ i ][ 1 ],
+                                              _statebackup[ i ][ 1 ] );
+         _state[ i ][ 1 ] = _mm256_xor_si256( _state[ i ][ 1 ],
+                                              _statebackup[ i ][ 3 ] );
+      }
+    }
+    pmsg += ctx->uBlockLength;
+  }
+  SAVESTATE_2WAY(ctx->state, _state);
+
+}
+int echo_2way_init( echo_2way_context *ctx, int nHashSize )
+{
+        int i, j;
+
+   ctx->k = m256_zero;
+   ctx->processed_bits = 0;
+   ctx->uBufferBytes = 0;
+
+   switch( nHashSize )
+   {
+                case 256:
+                        ctx->uHashSize = 256;
+                        ctx->uBlockLength = 192;
+                        ctx->uRounds = 8;
+                        ctx->hashsize = m256_const2_64( 0, 0x100 );
+                        ctx->const1536 = m256_const2_64( 0, 0x600 );
+                        break;
+
+                case 512:
+                        ctx->uHashSize = 512;
+                        ctx->uBlockLength = 128;
+                        ctx->uRounds = 10;
+                        ctx->hashsize = m256_const2_64( 0, 0x200 );
+                        ctx->const1536 = m256_const2_64( 0, 0x400 );
+                        break;
+
+                default:
+                        return 1;
+        }
+
+        for( i = 0; i < 4; i++ )
+                for( j = 0; j < nHashSize / 256; j++ )
+                        ctx->state[ i ][ j ] = ctx->hashsize;
+
+        for( i = 0; i < 4; i++ )
+                for( j = nHashSize / 256; j < 4; j++ )
+                        ctx->state[ i ][ j ] = m256_zero;
+
+        return 0;
+}
+
+int echo_2way_update_close( echo_2way_context *state, void *hashval,
+                              const void *data, int databitlen )
+{
+// bytelen is either 32 (maybe), 64 or 80 or 128!
+// all are less than full block.
+
+   int vlen = databitlen / 128;  // * 4 lanes / 128 bits per lane
+   const int vblen = state->uBlockLength / 16; //  16 bytes per lane
+   __m256i remainingbits;
+
+   if ( databitlen == 1024 )
+   {
+      echo_2way_compress( state, data, 1 );
+      state->processed_bits = 1024;
+      remainingbits = m256_const2_64( 0, -1024 );
+      vlen = 0;
+   }
+   else
+   {
+      memcpy_256( state->buffer, data, vlen );
+      state->processed_bits += (unsigned int)( databitlen );
+      remainingbits = m256_const2_64( 0, databitlen );
+   }
+
+   state->buffer[ vlen ] = m256_const2_64( 0, 0x80 );
+   memset_zero_256( state->buffer + vlen + 1, vblen - vlen - 2 );
+   state->buffer[ vblen-2 ] = m256_const2_64( (uint64_t)state->uHashSize << 48, 0 );
+   state->buffer[ vblen-1 ] = m256_const2_64( 0, state->processed_bits );
+
+   state->k = _mm256_add_epi64( state->k, remainingbits );
+   state->k = _mm256_sub_epi64( state->k, state->const1536 );
+
+   echo_2way_compress( state, state->buffer, 1 );
+
+   _mm256_store_si256( (__m256i*)hashval + 0, state->state[ 0 ][ 0] );
+   _mm256_store_si256( (__m256i*)hashval + 1, state->state[ 1 ][ 0] );
+
+   if ( state->uHashSize == 512 )
+   {
+      _mm256_store_si256( (__m256i*)hashval + 2, state->state[ 2 ][ 0 ] );
+      _mm256_store_si256( (__m256i*)hashval + 3, state->state[ 3 ][ 0 ] );
+   }
+   return 0;
+}
+
+int echo_2way_full( echo_2way_context *ctx, void *hashval, int nHashSize,
+                    const void *data, int datalen )
+{
+   int i, j;
+   int databitlen = datalen * 8;
+   ctx->k = m256_zero;
+   ctx->processed_bits = 0;
+   ctx->uBufferBytes = 0;
+
+   switch( nHashSize )
+   {
+      case 256:
+         ctx->uHashSize = 256;
+         ctx->uBlockLength = 192;
+         ctx->uRounds = 8;
+         ctx->hashsize = m256_const2_64( 0, 0x100 );
+         ctx->const1536 = m256_const2_64( 0, 0x600 );
+         break;
+
+      case 512:
+         ctx->uHashSize = 512;
+         ctx->uBlockLength = 128;
+         ctx->uRounds = 10;
+         ctx->hashsize = m256_const2_64( 0, 0x200 );
+         ctx->const1536 = m256_const2_64( 0, 0x400 );
+         break;
+
+      default:
+         return 1;
+   }
+
+   for( i = 0; i < 4; i++ )
+      for( j = 0; j < nHashSize / 256; j++ )
+         ctx->state[ i ][ j ] = ctx->hashsize;
+
+   for( i = 0; i < 4; i++ )
+      for( j = nHashSize / 256; j < 4; j++ )
+         ctx->state[ i ][ j ] = m256_zero;
+
+   int vlen = datalen / 32;
+   const int vblen = ctx->uBlockLength / 16; //  16 bytes per lane
+   __m256i remainingbits;
+
+   if ( databitlen == 1024 )
+   {
+      echo_2way_compress( ctx, data, 1 );
+      ctx->processed_bits = 1024;
+      remainingbits = m256_const2_64( 0, -1024 );
+      vlen = 0;
+   }
+   else
+   {
+      vlen = databitlen / 128;  // * 4 lanes / 128 bits per lane
+      memcpy_256( ctx->buffer, data, vlen );
+      ctx->processed_bits += (unsigned int)( databitlen );
+      remainingbits = m256_const2_64( 0, databitlen );
+   }
+
+   ctx->buffer[ vlen ] = m256_const2_64( 0, 0x80 );
+   memset_zero_256( ctx->buffer + vlen + 1, vblen - vlen - 2 );
+   ctx->buffer[ vblen-2 ] = m256_const2_64( (uint64_t)ctx->uHashSize << 48, 0 );
+   ctx->buffer[ vblen-1 ] = m256_const2_64( 0, ctx->processed_bits );
+
+   ctx->k = _mm256_add_epi64( ctx->k, remainingbits );
+   ctx->k = _mm256_sub_epi64( ctx->k, ctx->const1536 );
+
+   echo_2way_compress( ctx, ctx->buffer, 1 );
+
+   _mm256_store_si256( (__m256i*)hashval + 0, ctx->state[ 0 ][ 0] );
+   _mm256_store_si256( (__m256i*)hashval + 1, ctx->state[ 1 ][ 0] );
+
+   if ( ctx->uHashSize == 512 )
+   {
+      _mm256_store_si256( (__m256i*)hashval + 2, ctx->state[ 2 ][ 0 ] );
+      _mm256_store_si256( (__m256i*)hashval + 3, ctx->state[ 3 ][ 0 ] );
+   }
+   return 0;
+}
+
+
+#endif   // VAES
--- a/algo/echo/echo-hash-4way.h
+++ b/algo/echo/echo-hash-4way.h
@@ -1,10 +1,12 @@
 #if !defined(ECHO_HASH_4WAY_H__)
 #define ECHO_HASH_4WAY_H__ 1

-#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
+#if defined(__VAES__)

 #include "simd-utils.h"

+#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
+
 typedef struct
 {
   __m512i    state[4][4];
@@ -20,6 +22,7 @@ typedef struct
   unsigned int   processed_bits;

 } echo_4way_context __attribute__ ((aligned (64)));
+#define echo512_4way_context echo_4way_context

 int echo_4way_init( echo_4way_context *state, int hashbitlen );
 #define echo512_4way_init( state ) echo_4way_init( state, 512 )
@@ -29,8 +32,8 @@ int echo_4way_update( echo_4way_context *state, const void *data,
    unsigned int databitlen);
 #define echo512_4way_update echo_4way_update

-int echo_close( echo_4way_context *state, void *hashval );
-#define echo512_4way_close echo_4way_close
+// int echo_4way_close( echo_4way_context *state, void *hashval );
+// #define echo512_4way_close echo_4way_close

 int echo_4way_update_close( echo_4way_context *state, void *hashval,
                              const void *data, int databitlen );
@@ -43,5 +46,45 @@ int echo_4way_full( echo_4way_context *ctx, void *hashval, int nHashSize,
 #define echo256_4way_full( state, hashval, data, datalen ) \
           echo_4way_full( state, hashval, 256, data, datalen )

-#endif 
-#endif
+#endif   // AVX512
+
+typedef struct
+{
+   __m256i    state[4][4];
+   __m256i    buffer[ 4 * 192 / 16 ];  // 4x128 interleaved 192 bytes
+   __m256i    k;
+   __m256i    hashsize;
+   __m256i    const1536;
+
+   unsigned int   uRounds;
+   unsigned int   uHashSize;
+   unsigned int   uBlockLength;
+   unsigned int   uBufferBytes;
+   unsigned int   processed_bits;
+
+} echo_2way_context __attribute__ ((aligned (64)));
+#define echo512_2way_context echo_2way_context
+
+int echo_2way_init( echo_2way_context *state, int hashbitlen );
+#define echo512_2way_init( state ) echo_2way_init( state, 512 )
+#define echo256_2way_init( state ) echo_2way_init( state, 256 )
+
+int echo_2way_update( echo_2way_context *state, const void *data,
+    unsigned int databitlen);
+#define echo512_2way_update echo_2way_update
+
+int echo_2way_update_close( echo_2way_context *state, void *hashval,
+                              const void *data, int databitlen );
+#define echo512_2way_update_close echo_2way_update_close
+
+int echo_2way_full( echo_2way_context *ctx, void *hashval, int nHashSize,
+                    const void *data, int datalen );
+#define echo512_2way_full( state, hashval, data, datalen ) \
+           echo_2way_full( state, hashval, 512, data, datalen )
+#define echo256_2way_full( state, hashval, data, datalen ) \
+           echo_2way_full( state, hashval, 256, data, datalen )
+
+
+#endif  // VAES
+
+#endif   // ECHO_HASH_4WAY_H__
--- a/algo/fugue/fugue-aesni.c
+++ b/algo/fugue/fugue-aesni.c
@@ -0,0 +1,565 @@
+/*
+ * file        : fugue_vperm.c
+ * version     : 1.0.208
+ * date        : 14.12.2010
+ * 
+ * - vperm and aes_ni implementations of hash function Fugue
+ * - implements NIST hash api
+ * - assumes that message lenght is multiple of 8-bits
+ * - _FUGUE_VPERM_ must be defined if compiling with ../main.c
+ * - default version is vperm, define AES_NI for aes_ni version
+ * 
+ * Cagdas Calik
+ * ccalik@metu.edu.tr
+ * Institute of Applied Mathematics, Middle East Technical University, Turkey.
+ *
+ */
+
+#if defined(__AES__)
+
+#include <x86intrin.h>
+
+#include <memory.h>
+#include "fugue-aesni.h"
+
+
+MYALIGN const unsigned long long _supermix1a[]	= {0x0202010807020100, 0x0a05000f06010c0b};
+MYALIGN const unsigned long long _supermix1b[]	= {0x0b0d080703060504, 0x0e0a090c050e0f0a};
+MYALIGN const unsigned long long _supermix1c[]	= {0x0402060c070d0003, 0x090a060580808080};
+MYALIGN const unsigned long long _supermix1d[]	= {0x808080800f0e0d0c, 0x0f0e0d0c80808080};
+MYALIGN const unsigned long long _supermix2a[]	= {0x07020d0880808080, 0x0b06010c050e0f0a};
+MYALIGN const unsigned long long _supermix4a[]	= {0x000f0a050c0b0601, 0x0302020404030e09};
+MYALIGN const unsigned long long _supermix4b[]	= {0x07020d08080e0d0d, 0x07070908050e0f0a};
+MYALIGN const unsigned long long _supermix4c[]	= {0x0706050403020000, 0x0302000007060504};
+MYALIGN const unsigned long long _supermix7a[]	= {0x010c0b060d080702, 0x0904030e03000104};
+MYALIGN const unsigned long long _supermix7b[]	= {0x8080808080808080, 0x0504070605040f06};
+MYALIGN const unsigned long long _k_n[] = {0x4E4E4E4E4E4E4E4E, 0x1B1B1B1B0E0E0E0E};
+MYALIGN const unsigned char _shift_one_mask[]   = {7, 4, 5, 6, 11, 8, 9, 10, 15, 12, 13, 14, 3, 0, 1, 2};
+MYALIGN const unsigned char _shift_four_mask[]  = {13, 14, 15, 12, 1, 2, 3, 0, 5, 6, 7, 4, 9, 10, 11, 8};
+MYALIGN const unsigned char _shift_seven_mask[] = {10, 11, 8, 9, 14, 15, 12, 13, 2, 3, 0, 1, 6, 7, 4, 5};
+MYALIGN const unsigned char _aes_shift_rows[]   = {0, 5, 10, 15, 4, 9, 14, 3, 8, 13, 2, 7, 12, 1, 6, 11};
+MYALIGN const unsigned int _inv_shift_rows[] = {0x070a0d00, 0x0b0e0104, 0x0f020508, 0x0306090c};
+MYALIGN const unsigned int _mul2mask[] = {0x1b1b0000, 0x00000000, 0x00000000, 0x00000000};
+MYALIGN const unsigned int _mul4mask[] = {0x2d361b00, 0x00000000, 0x00000000, 0x00000000};
+MYALIGN const unsigned int _lsbmask2[] = {0x03030303, 0x03030303, 0x03030303, 0x03030303};
+
+
+MYALIGN const unsigned int _IV512[] = {		
+	0x00000000, 0x00000000,	0x7ea50788, 0x00000000,
+	0x75af16e6, 0xdbe4d3c5, 0x27b09aac, 0x00000000,
+	0x17f115d9, 0x54cceeb6, 0x0b02e806, 0x00000000,
+	0xd1ef924a, 0xc9e2c6aa, 0x9813b2dd, 0x00000000,
+	0x3858e6ca, 0x3f207f43, 0xe778ea25, 0x00000000,
+	0xd6dd1f95, 0x1dd16eda, 0x67353ee1, 0x00000000};
+
+#if defined(__SSE4_1__)
+
+#define PACK_S0(s0, s1, t1)\
+   s0 = _mm_castps_si128(_mm_insert_ps(_mm_castsi128_ps(s0), _mm_castsi128_ps(s1), 0x30))
+
+#define UNPACK_S0(s0, s1, t1)\
+   s1 = _mm_castps_si128(_mm_insert_ps(_mm_castsi128_ps(s1), _mm_castsi128_ps(s0), 0xc0));\
+   s0 = mm128_mask_32( s0, 8 )
+
+#define CMIX(s1, s2, r1, r2, t1, t2)\
+   t1 = s1;\
+   t1 = _mm_castps_si128(_mm_shuffle_ps(_mm_castsi128_ps(t1), _mm_castsi128_ps(s2), _MM_SHUFFLE(3, 0, 2, 1)));\
+   r1 = _mm_xor_si128(r1, t1);\
+   r2 = _mm_xor_si128(r2, t1);
+
+#else   // SSE2
+
+#define PACK_S0(s0, s1, t1)\
+   t1 = _mm_shuffle_epi32(s1, _MM_SHUFFLE(0, 3, 3, 3));\
+   s0 = _mm_xor_si128(s0, t1);
+
+#define UNPACK_S0(s0, s1, t1)\
+   t1 = _mm_shuffle_epi32(s0, _MM_SHUFFLE(3, 3, 3, 3));\
+   s1 = _mm_castps_si128(_mm_move_ss(_mm_castsi128_ps(s1), _mm_castsi128_ps(t1)));\
+   s0 = mm128_mask_32( s0, 8 )
+
+#define CMIX(s1, s2, r1, r2, t1, t2)\
+   t1 = _mm_shuffle_epi32(s1, 0xf9);\
+   t2 = _mm_shuffle_epi32(s2, 0xcf);\
+   t1 = _mm_xor_si128(t1, t2);\
+   r1 = _mm_xor_si128(r1, t1);\
+   r2 = _mm_xor_si128(r2, t1)
+
+#endif
+
+#define TIX256(msg, s10, s8, s24, s0, t1, t2, t3)\
+	t1 = _mm_shuffle_epi32(s0, _MM_SHUFFLE(3, 3, 0, 3));\
+	s10 = _mm_xor_si128(s10, t1);\
+	t1 = _mm_castps_si128(_mm_load_ss((float*)msg));\
+	s0 = _mm_castps_si128(_mm_move_ss(_mm_castsi128_ps(s0), _mm_castsi128_ps(t1)));\
+	t1 = _mm_slli_si128(t1, 8);\
+	s8 = _mm_xor_si128(s8, t1);\
+	t1 = _mm_shuffle_epi32(s24, _MM_SHUFFLE(3, 3, 0, 3));\
+	s0 = _mm_xor_si128(s0, t1)
+
+
+#define TIX384(msg, s16, s8, s27, s30, s0, s4, t1, t2, t3)\
+	t1 = _mm_shuffle_epi32(s0, _MM_SHUFFLE(3, 3, 0, 3));\
+	s16 = _mm_xor_si128(s16, t1);\
+	t1 = _mm_castps_si128(_mm_load_ss((float*)msg));\
+	s0 = _mm_castps_si128(_mm_move_ss(_mm_castsi128_ps(s0), _mm_castsi128_ps(t1)));\
+	t1 = _mm_slli_si128(t1, 8);\
+	s8 = _mm_xor_si128(s8, t1);\
+	t1 = _mm_shuffle_epi32(s27, _MM_SHUFFLE(3, 3, 0, 3));\
+	s0 = _mm_xor_si128(s0, t1);\
+	t1 = _mm_shuffle_epi32(s30, _MM_SHUFFLE(3, 3, 0, 3));\
+	s4 = _mm_xor_si128(s4, t1)
+
+#define TIX512(msg, s22, s8, s24, s27, s30, s0, s4, s7, t1, t2, t3)\
+	t1 = _mm_shuffle_epi32(s0, _MM_SHUFFLE(3, 3, 0, 3));\
+	s22 = _mm_xor_si128(s22, t1);\
+	t1 = _mm_castps_si128(_mm_load_ss((float*)msg));\
+	s0 = _mm_castps_si128(_mm_move_ss(_mm_castsi128_ps(s0), _mm_castsi128_ps(t1)));\
+	t1 = _mm_slli_si128(t1, 8);\
+	s8 = _mm_xor_si128(s8, t1);\
+	t1 = _mm_shuffle_epi32(s24, _MM_SHUFFLE(3, 3, 0, 3));\
+	s0 = _mm_xor_si128(s0, t1);\
+	t1 = _mm_shuffle_epi32(s27, _MM_SHUFFLE(3, 3, 0, 3));\
+	s4 = _mm_xor_si128(s4, t1);\
+	t1 = _mm_shuffle_epi32(s30, _MM_SHUFFLE(3, 3, 0, 3));\
+	s7 = _mm_xor_si128(s7, t1)
+
+
+#define PRESUPERMIX(x, t1, s1, s2, t2)\
+	s1 = x;\
+	s2 = _mm_add_epi8(x, x);\
+	t2 = _mm_add_epi8(s2, s2);\
+	t1 = _mm_srli_epi16(x, 6);\
+	t1 = _mm_and_si128(t1, M128(_lsbmask2));\
+	s2 = _mm_xor_si128(s2, _mm_shuffle_epi8(M128(_mul2mask), t1));\
+	x  = _mm_xor_si128(t2, _mm_shuffle_epi8(M128(_mul4mask), t1))
+
+#define SUBSTITUTE(r0, _t1, _t2, _t3, _t0)\
+	_t2 = _mm_shuffle_epi8(r0, M128(_inv_shift_rows));\
+	_t2 = _mm_aesenclast_si128( _t2, m128_zero )
+	
+#define SUPERMIX(t0, t1, t2, t3, t4)\
+	PRESUPERMIX(t0, t1, t2, t3, t4);\
+	POSTSUPERMIX(t0, t1, t2, t3, t4)
+
+
+#define POSTSUPERMIX(t0, t1, t2, t3, t4)\
+	t1 = t2;\
+	t1 = _mm_shuffle_epi8(t1, M128(_supermix1b));\
+	t4 = t1;\
+	t1 = _mm_shuffle_epi8(t1, M128(_supermix1c));\
+	t4 = _mm_xor_si128(t4, t1);\
+	t1 = t4;\
+	t1 = _mm_shuffle_epi8(t1, M128(_supermix1d));\
+	t4 = _mm_xor_si128(t4, t1);\
+	t1 = t2;\
+	t1 = _mm_shuffle_epi8(t1, M128(_supermix1a));\
+	t4 = _mm_xor_si128(t4, t1);\
+	t2 = _mm_xor_si128(t2, t3);\
+	t2 = _mm_xor_si128(t2, t0);\
+	t2 = _mm_shuffle_epi8(t2, M128(_supermix7a));\
+	t4 = _mm_xor_si128(t4, t2);\
+	t2 = _mm_shuffle_epi8(t2, M128(_supermix7b));\
+	t4 = _mm_xor_si128(t4, t2);\
+	t3 = _mm_shuffle_epi8(t3, M128(_supermix2a));\
+	t1 = t0;\
+	t1 = _mm_shuffle_epi8(t1, M128(_supermix4a));\
+	t4 = _mm_xor_si128(t4, t1);\
+	t0 = _mm_shuffle_epi8(t0, M128(_supermix4b));\
+	t0 = _mm_xor_si128(t0, t3);\
+	t4 = _mm_xor_si128(t4, t0);\
+	t0 = _mm_shuffle_epi8(t0, M128(_supermix4c));\
+	t4 = _mm_xor_si128(t4, t0)
+
+
+#define SUBROUND512_3(r1a, r1b, r1c, r1d, r2a, r2b, r2c, r2d, r3a, r3b, r3c, r3d)\
+	CMIX(r1a, r1b, r1c, r1d, _t0, _t1);\
+	PACK_S0(r1c, r1a, _t0);\
+	SUBSTITUTE(r1c, _t1, _t2, _t3, _t0);\
+	SUPERMIX(_t2, _t3, _t0, _t1, r1c);\
+	_t0 = _mm_shuffle_epi32(r1c, 0x39);\
+	r2c = _mm_xor_si128(r2c, _t0);\
+   _t0 = mm128_mask_32( _t0, 8 ); \
+	r2d = _mm_xor_si128(r2d, _t0);\
+	UNPACK_S0(r1c, r1a, _t3);\
+	SUBSTITUTE(r2c, _t1, _t2, _t3, _t0);\
+	SUPERMIX(_t2, _t3, _t0, _t1, r2c);\
+	_t0 = _mm_shuffle_epi32(r2c, 0x39);\
+	r3c = _mm_xor_si128(r3c, _t0);\
+   _t0 = mm128_mask_32( _t0, 8 ); \
+	r3d = _mm_xor_si128(r3d, _t0);\
+	UNPACK_S0(r2c, r2a, _t3);\
+	SUBSTITUTE(r3c, _t1, _t2, _t3, _t0);\
+	SUPERMIX(_t2, _t3, _t0, _t1, r3c);\
+	UNPACK_S0(r3c, r3a, _t3)
+
+
+#define SUBROUND512_4(r1a, r1b, r1c, r1d, r2a, r2b, r2c, r2d, r3a, r3b, r3c, r3d, r4a, r4b, r4c, r4d)\
+	CMIX(r1a, r1b, r1c, r1d, _t0, _t1);\
+	PACK_S0(r1c, r1a, _t0);\
+	SUBSTITUTE(r1c, _t1, _t2, _t3, _t0);\
+	SUPERMIX(_t2, _t3, _t0, _t1, r1c);\
+	_t0 = _mm_shuffle_epi32(r1c, 0x39);\
+	r2c = _mm_xor_si128(r2c, _t0);\
+   _t0 = mm128_mask_32( _t0, 8 ); \
+	r2d = _mm_xor_si128(r2d, _t0);\
+	UNPACK_S0(r1c, r1a, _t3);\
+	SUBSTITUTE(r2c, _t1, _t2, _t3, _t0);\
+	SUPERMIX(_t2, _t3, _t0, _t1, r2c);\
+	_t0 = _mm_shuffle_epi32(r2c, 0x39);\
+	r3c = _mm_xor_si128(r3c, _t0);\
+   _t0 = mm128_mask_32( _t0, 8 ); \
+	r3d = _mm_xor_si128(r3d, _t0);\
+	UNPACK_S0(r2c, r2a, _t3);\
+	SUBSTITUTE(r3c, _t1, _t2, _t3, _t0);\
+	SUPERMIX(_t2, _t3, _t0, _t1, r3c);\
+	_t0 = _mm_shuffle_epi32(r3c, 0x39);\
+	r4c = _mm_xor_si128(r4c, _t0);\
+   _t0 = mm128_mask_32( _t0, 8 ); \
+	r4d = _mm_xor_si128(r4d, _t0);\
+	UNPACK_S0(r3c, r3a, _t3);\
+	SUBSTITUTE(r4c, _t1, _t2, _t3, _t0);\
+	SUPERMIX(_t2, _t3, _t0, _t1, r4c);\
+	UNPACK_S0(r4c, r4a, _t3)
+
+
+
+#define LOADCOLUMN(x, s, a)\
+	block[0] = col[(base + a + 0) % s];\
+	block[1] = col[(base + a + 1) % s];\
+	block[2] = col[(base + a + 2) % s];\
+	block[3] = col[(base + a + 3) % s];\
+	x = _mm_load_si128((__m128i*)block)
+
+#define STORECOLUMN(x, s)\
+	_mm_store_si128((__m128i*)block, x);\
+	col[(base + 0) % s] = block[0];\
+	col[(base + 1) % s] = block[1];\
+	col[(base + 2) % s] = block[2];\
+	col[(base + 3) % s] = block[3]
+
+void Compress512(hashState_fugue *ctx, const unsigned char *pmsg, unsigned int uBlockCount)
+{
+   __m128i _t0, _t1, _t2, _t3;
+
+   switch(ctx->base)
+   {
+      case 1:
+         TIX512( pmsg, ctx->state[3], ctx->state[10], ctx->state[4],
+                       ctx->state[5], ctx->state[ 6], ctx->state[8],
+		       ctx->state[9], ctx->state[10], _t0, _t1, _t2 );
+
+	 SUBROUND512_4( ctx->state[8], ctx->state[9], ctx->state[7],
+                        ctx->state[1], ctx->state[7], ctx->state[8],
+		       	ctx->state[6], ctx->state[0], ctx->state[6],
+		       	ctx->state[7], ctx->state[5], ctx->state[11],
+		       	ctx->state[5], ctx->state[6], ctx->state[4],
+		       	ctx->state[10] );
+         ctx->base++;
+         pmsg += 4;
+         uBlockCount--;
+      if( uBlockCount == 0 ) break;
+
+      case 2:
+         TIX512( pmsg, ctx->state[11], ctx->state[6], ctx->state[0],
+                       ctx->state[ 1], ctx->state[2], ctx->state[4],
+		       ctx->state[ 5], ctx->state[6], _t0, _t1, _t2);
+
+         SUBROUND512_4( ctx->state[4], ctx->state[5], ctx->state[3],
+                        ctx->state[9], ctx->state[3], ctx->state[4],
+		       	ctx->state[2], ctx->state[8], ctx->state[2],
+		       	ctx->state[3], ctx->state[1], ctx->state[7],
+		       	ctx->state[1], ctx->state[2], ctx->state[0],
+		       	ctx->state[6]);
+
+         ctx->base = 0;
+         pmsg += 4;
+         uBlockCount--;
+      break;
+   }
+
+
+   while( uBlockCount > 0 )
+   {
+      TIX512( pmsg, ctx->state[ 7], ctx->state[2], ctx->state[8], ctx->state[9],
+                    ctx->state[10], ctx->state[0], ctx->state[1], ctx->state[2],
+              _t0, _t1, _t2 );
+      SUBROUND512_4( ctx->state[0], ctx->state[1], ctx->state[11],
+                     ctx->state[5], ctx->state[11], ctx->state[0],
+		     ctx->state[10], ctx->state[4], ctx->state[10],
+		     ctx->state[11], ctx->state[9], ctx->state[3],
+		     ctx->state[9], ctx->state[10], ctx->state[8],
+		     ctx->state[2] );
+
+      ctx->base++;
+      pmsg += 4;
+      uBlockCount--;
+      if( uBlockCount == 0 ) break;
+
+      TIX512( pmsg, ctx->state[3], ctx->state[10], ctx->state[4], ctx->state[5],
+                    ctx->state[6], ctx->state[8], ctx->state[9], ctx->state[10],
+              _t0, _t1, _t2 );
+
+      SUBROUND512_4( ctx->state[8], ctx->state[9], ctx->state[7], ctx->state[1],                     ctx->state[7], ctx->state[8], ctx->state[6], ctx->state[0],
+		     ctx->state[6], ctx->state[7], ctx->state[5], ctx->state[11],
+		     ctx->state[5], ctx->state[6, ctx->state[4], ctx->state[10]);
+
+      ctx->base++;
+      pmsg += 4;
+      uBlockCount--;
+      if( uBlockCount == 0 ) break;
+
+      TIX512( pmsg, ctx->state[11], ctx->state[6], ctx->state[0], ctx->state[1],
+		    ctx->state[2], ctx->state[4], ctx->state[5], ctx->state[6],
+               _t0, _t1, _t2);
+      SUBROUND512_4( ctx->state[4], ctx->state[5], ctx->state[3], ctx->state[9],
+		     ctx->state[3], ctx->state[4], ctx->state[2], ctx->state[8],
+		     ctx->state[2], ctx->state[3], ctx->state[1], ctx->state[7],
+		     ctx->state[1], ctx->state[2], ctx->state[0], ctx->state[6]);
+
+      ctx->base = 0;
+      pmsg += 4;
+      uBlockCount--;
+   }
+
+}
+
+void Final512(hashState_fugue *ctx, BitSequence *hashval)
+{
+        unsigned int block[4] __attribute__ ((aligned (32)));
+        unsigned int col[36] __attribute__ ((aligned (16)));
+	unsigned int i, base;
+	__m128i r0, _t0, _t1, _t2, _t3;
+
+	for(i = 0; i < 12; i++)
+	{
+		_mm_store_si128((__m128i*)block, ctx->state[i]);
+
+		col[3 * i + 0] = block[0];
+		col[3 * i + 1] = block[1];
+		col[3 * i + 2] = block[2];
+	}
+
+	base = (36 - (12 * ctx->base)) % 36;
+
+	for(i = 0; i < 32; i++)
+	{
+		// ROR3
+		base = (base + 33) % 36;
+
+		// CMIX
+		col[(base +  0) % 36] ^= col[(base + 4) % 36];
+		col[(base +  1) % 36] ^= col[(base + 5) % 36];
+		col[(base +  2) % 36] ^= col[(base + 6) % 36];
+		col[(base +  18) % 36] ^= col[(base + 4) % 36];
+		col[(base +  19) % 36] ^= col[(base + 5) % 36];
+		col[(base +  20) % 36] ^= col[(base + 6) % 36];
+
+		// SMIX
+		LOADCOLUMN(r0, 36, 0);
+		SUBSTITUTE(r0, _t1, _t2, _t3, _t0);
+		SUPERMIX(_t2, _t3, _t0, _t1, r0);
+		STORECOLUMN(r0, 36);
+	}
+
+	for(i = 0; i < 13; i++)
+	{
+		// S4 += S0; S9 += S0; S18 += S0; S27 += S0;
+		col[(base +  4) % 36] ^= col[(base + 0) % 36];
+		col[(base +  9) % 36] ^= col[(base + 0) % 36];
+		col[(base + 18) % 36] ^= col[(base + 0) % 36];
+		col[(base + 27) % 36] ^= col[(base + 0) % 36];
+
+		// ROR9
+		base = (base + 27) % 36;
+
+		// SMIX
+		LOADCOLUMN(r0, 36, 0);
+		SUBSTITUTE(r0, _t1, _t2, _t3, _t0);
+		SUPERMIX(_t2, _t3, _t0, _t1, r0);
+		STORECOLUMN(r0, 36);
+
+		// S4 += S0; S10 += S0; S18 += S0; S27 += S0;
+		col[(base +  4) % 36] ^= col[(base + 0) % 36];
+		col[(base + 10) % 36] ^= col[(base + 0) % 36];
+		col[(base + 18) % 36] ^= col[(base + 0) % 36];
+		col[(base + 27) % 36] ^= col[(base + 0) % 36];
+
+		// ROR9
+		base = (base + 27) % 36;
+
+		// SMIX
+		LOADCOLUMN(r0, 36, 0);
+		SUBSTITUTE(r0, _t1, _t2, _t3, _t0);
+		SUPERMIX(_t2, _t3, _t0, _t1, r0);
+		STORECOLUMN(r0, 36);
+
+		// S4 += S0; S10 += S0; S19 += S0; S27 += S0;
+		col[(base +  4) % 36] ^= col[(base + 0) % 36];
+		col[(base + 10) % 36] ^= col[(base + 0) % 36];
+		col[(base + 19) % 36] ^= col[(base + 0) % 36];
+		col[(base + 27) % 36] ^= col[(base + 0) % 36];
+
+		// ROR9
+		base = (base + 27) % 36;
+
+		// SMIX
+		LOADCOLUMN(r0, 36, 0);
+		SUBSTITUTE(r0, _t1, _t2, _t3, _t0);
+		SUPERMIX(_t2, _t3, _t0, _t1, r0);
+		STORECOLUMN(r0, 36);
+
+		// S4 += S0; S10 += S0; S19 += S0; S28 += S0;
+		col[(base +  4) % 36] ^= col[(base + 0) % 36];
+		col[(base + 10) % 36] ^= col[(base + 0) % 36];
+		col[(base + 19) % 36] ^= col[(base + 0) % 36];
+		col[(base + 28) % 36] ^= col[(base + 0) % 36];
+
+		// ROR8
+		base = (base + 28) % 36;
+
+		// SMIX
+		LOADCOLUMN(r0, 36, 0);
+		SUBSTITUTE(r0, _t1, _t2, _t3, _t0);
+		SUPERMIX(_t2, _t3, _t0, _t1, r0);
+		STORECOLUMN(r0, 36);
+	}
+
+	// S4 += S0; S9 += S0; S18 += S0; S27 += S0;
+	col[(base +  4) % 36] ^= col[(base + 0) % 36];
+	col[(base +  9) % 36] ^= col[(base + 0) % 36];
+	col[(base + 18) % 36] ^= col[(base + 0) % 36];
+	col[(base + 27) % 36] ^= col[(base + 0) % 36];
+
+	// Transform to the standard basis and store output; S1 || S2 || S3 || S4
+	LOADCOLUMN(r0, 36, 1);
+	_mm_store_si128((__m128i*)hashval, r0);
+
+	// Transform to the standard basis and store output; S9 || S10 || S11 || S12
+	LOADCOLUMN(r0, 36, 9);
+	_mm_store_si128((__m128i*)hashval + 1, r0);
+
+	// Transform to the standard basis and store output; S18 || S19 || S20 || S21
+	LOADCOLUMN(r0, 36, 18);
+	_mm_store_si128((__m128i*)hashval + 2, r0);
+
+	// Transform to the standard basis and store output; S27 || S28 || S29 || S30
+	LOADCOLUMN(r0, 36, 27);
+	_mm_store_si128((__m128i*)hashval + 3, r0);
+}
+
+HashReturn fugue512_Init(hashState_fugue *ctx, int nHashSize)
+{
+	int i;
+	ctx->processed_bits = 0;
+	ctx->uBufferBytes = 0;
+	ctx->base = 0;
+
+
+	ctx->uHashSize = 512;
+	ctx->uBlockLength = 4;
+
+	for(i = 0; i < 6; i++)
+		ctx->state[i] = m128_zero;
+
+	ctx->state[6]  = _mm_load_si128((__m128i*)_IV512 + 0);
+	ctx->state[7]  = _mm_load_si128((__m128i*)_IV512 + 1);
+	ctx->state[8]  = _mm_load_si128((__m128i*)_IV512 + 2);
+	ctx->state[9]  = _mm_load_si128((__m128i*)_IV512 + 3);
+	ctx->state[10] = _mm_load_si128((__m128i*)_IV512 + 4);
+	ctx->state[11] = _mm_load_si128((__m128i*)_IV512 + 5);
+
+	return SUCCESS;
+}
+
+
+HashReturn fugue512_Update(hashState_fugue *state, const void *data, DataLength databitlen)
+{
+	unsigned int uByteLength, uBlockCount, uRemainingBytes;
+
+	uByteLength = (unsigned int)(databitlen / 8);
+
+	if(state->uBufferBytes + uByteLength >= state->uBlockLength)
+	{
+		if(state->uBufferBytes != 0)
+		{
+			// Fill the buffer
+			memcpy(state->buffer + state->uBufferBytes, (void*)data, state->uBlockLength - state->uBufferBytes);
+
+			// Process the buffer
+			Compress512(state, state->buffer, 1);
+
+			state->processed_bits += state->uBlockLength * 8;
+			data += state->uBlockLength - state->uBufferBytes;
+			uByteLength -= state->uBlockLength - state->uBufferBytes;
+		}
+
+		// buffer now does not contain any unprocessed bytes
+
+		uBlockCount = uByteLength / state->uBlockLength;
+		uRemainingBytes = uByteLength % state->uBlockLength;
+
+		if(uBlockCount > 0)
+		{
+			Compress512(state, data, uBlockCount);
+
+			state->processed_bits += uBlockCount * state->uBlockLength * 8;
+			data += uBlockCount * state->uBlockLength;
+		}
+
+		if(uRemainingBytes > 0)
+		{
+			memcpy(state->buffer, (void*)data, uRemainingBytes);
+		}
+
+		state->uBufferBytes = uRemainingBytes;
+	}
+	else
+	{
+		memcpy(state->buffer + state->uBufferBytes, (void*)data, uByteLength);
+		state->uBufferBytes += uByteLength;
+	}
+
+	return SUCCESS;
+}
+
+HashReturn fugue512_Final(hashState_fugue *state, void *hashval)
+{
+	unsigned int i;
+	BitSequence lengthbuf[8] __attribute__((aligned(64)));
+
+	// Update message bit count
+	state->processed_bits += state->uBufferBytes * 8;
+
+	// Pad the remaining buffer bytes with zero
+	if(state->uBufferBytes != 0)
+	{
+	   if ( state->uBufferBytes != state->uBlockLength)
+		memset(state->buffer + state->uBufferBytes, 0, state->uBlockLength - state->uBufferBytes);
+
+	   Compress512(state, state->buffer, 1);
+	}
+
+	// Last two blocks are message length in bits
+	for(i = 0; i < 8; i++)
+           lengthbuf[i] = ((state->processed_bits) >> (8 * (7 - i))) & 0xff;
+
+	// Process the last two blocks
+	Compress512(state, lengthbuf, 2);
+
+	// Finalization
+	Final512(state, hashval);
+
+	return SUCCESS;
+}
+
+
+HashReturn fugue512_full(hashState_fugue *hs, void *hashval, const void *data, DataLength databitlen)
+{
+	fugue512_Init(hs, 512);
+	fugue512_Update(hs, data, databitlen*8);
+	fugue512_Final(hs, hashval);
+	return SUCCESS;
+}
+
+#endif  // AES
--- a/algo/fugue/fugue-aesni.h
+++ b/algo/fugue/fugue-aesni.h
@@ -0,0 +1,46 @@
+/*
+ * file        : hash_api.h
+ * version     : 1.0.208
+ * date        : 14.12.2010
+ * 
+ * Fugue vperm implementation Hash API
+ *
+ * Cagdas Calik
+ * ccalik@metu.edu.tr
+ * Institute of Applied Mathematics, Middle East Technical University, Turkey.
+ *
+ */
+
+#ifndef FUGUE_HASH_API_H
+#define FUGUE_HASH_API_H
+
+#if defined(__AES__)
+
+#include "algo/sha/sha3_common.h"
+#include "simd-utils.h"
+
+
+typedef struct
+{
+	__m128i			state[12];
+	unsigned int	base;
+
+	unsigned int	uHashSize;
+	unsigned int	uBlockLength;
+	unsigned int	uBufferBytes;
+	DataLength		processed_bits;
+	BitSequence		buffer[4];
+
+} hashState_fugue __attribute__ ((aligned (64)));
+
+HashReturn fugue512_Init(hashState_fugue *state, int hashbitlen);
+
+HashReturn fugue512_Update(hashState_fugue *state, const void *data, DataLength databitlen);
+
+HashReturn fugue512_Final(hashState_fugue *state, void *hashval);
+
+HashReturn fugue512_full(hashState_fugue *hs, void *hashval, const void *data, DataLength databitlen);
+
+#endif // AES
+#endif // HASH_API_H
+
--- a/algo/groestl/groestl256-hash-4way.c
+++ b/algo/groestl/groestl256-hash-4way.c
@@ -15,7 +15,9 @@
 #include "miner.h"
 #include "simd-utils.h"

-#if defined(__VAES__) && defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
+#if defined(__AVX2__) && defined(__VAES__)
+
+#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)


 int groestl256_4way_init( groestl256_4way_context* ctx, uint64_t hashlen )
@@ -43,10 +45,10 @@ int groestl256_4way_init( groestl256_4way_context* ctx, uint64_t hashlen )
 }

 int groestl256_4way_full( groestl256_4way_context* ctx, void* output,
-                                const void* input, uint64_t databitlen )
+                                const void* input, uint64_t datalen )
 {
-   const int len = (int)databitlen / 128;
-   const int hashlen_m128i = 32 / 16;   // bytes to __m128i
+   const int len = (int)datalen >> 4;
+   const int hashlen_m128i = 32 >> 4;   // bytes to __m128i
   const int hash_offset = SIZE256 - hashlen_m128i;
   int rem = ctx->rem_ptr;
   int blocks = len / SIZE256;
@@ -172,5 +174,161 @@ int groestl256_4way_update_close( groestl256_4way_context* ctx, void* output,
   return 0;
 }

-#endif   // VAES
+#endif   // AVX512

+// AVX2 + VAES
+
+int groestl256_2way_init( groestl256_2way_context* ctx, uint64_t hashlen )
+{
+  int i;
+
+  ctx->hashlen = hashlen;
+
+  if (ctx->chaining == NULL || ctx->buffer == NULL)
+    return 1;
+
+  for ( i = 0; i < SIZE256; i++ )
+  {
+     ctx->chaining[i] = m256_zero;
+     ctx->buffer[i]   = m256_zero;
+  }
+
+  // The only non-zero in the IV is len. It can be hard coded.
+  ctx->chaining[ 3 ] = m256_const2_64( 0, 0x0100000000000000 );
+
+  ctx->buf_ptr = 0;
+  ctx->rem_ptr = 0;
+
+  return 0;
+}
+
+int groestl256_2way_full( groestl256_2way_context* ctx, void* output,
+                                const void* input, uint64_t datalen )
+{
+   const int len = (int)datalen >> 4;
+   const int hashlen_m128i = 32 >> 4;   // bytes to __m128i
+   const int hash_offset = SIZE256 - hashlen_m128i;
+   int rem = ctx->rem_ptr;
+   int blocks = len / SIZE256;
+   __m256i* in = (__m256i*)input;
+   int i;
+
+  if (ctx->chaining == NULL || ctx->buffer == NULL)
+    return 1;
+
+  for ( i = 0; i < SIZE256; i++ )
+  {
+     ctx->chaining[i] = m256_zero;
+     ctx->buffer[i]   = m256_zero;
+  }
+
+  // The only non-zero in the IV is len. It can be hard coded.
+  ctx->chaining[ 3 ] = m256_const2_64( 0, 0x0100000000000000 );
+  ctx->buf_ptr = 0;
+  ctx->rem_ptr = 0;
+
+   // --- update ---
+
+   // digest any full blocks, process directly from input 
+   for ( i = 0; i < blocks; i++ )
+      TF512_2way( ctx->chaining, &in[ i * SIZE256 ] );
+   ctx->buf_ptr = blocks * SIZE256;
+
+   // copy any remaining data to buffer, it may already contain data
+   // from a previous update for a midstate precalc
+   for ( i = 0; i < len % SIZE256; i++ )
+       ctx->buffer[ rem + i ] = in[ ctx->buf_ptr + i ];
+   i += rem;    // use i as rem_ptr in final
+
+   //--- final ---
+
+   blocks++;      // adjust for final block
+
+   if ( i == SIZE256 - 1 )
+   {
+       // only 1 vector left in buffer, all padding at once
+      ctx->buffer[i] = m256_const2_64( (uint64_t)blocks << 56, 0x80 );
+   }
+   else
+   {
+       // add first padding
+       ctx->buffer[i] = m256_const2_64( 0, 0x80 );
+       // add zero padding
+       for ( i += 1; i < SIZE256 - 1; i++ )
+           ctx->buffer[i] = m256_zero;
+
+       // add length padding, second last byte is zero unless blocks > 255
+      ctx->buffer[i] = m256_const2_64( (uint64_t)blocks << 56, 0 );
+   }
+
+// digest final padding block and do output transform
+   TF512_2way( ctx->chaining, ctx->buffer );
+
+   OF512_2way( ctx->chaining );
+
+   // store hash result in output 
+   for ( i = 0; i < hashlen_m128i; i++ )
+      casti_m256i( output, i ) = ctx->chaining[ hash_offset + i ];
+
+   return 0;
+}
+int groestl256_2way_update_close( groestl256_2way_context* ctx, void* output,
+                                const void* input, uint64_t databitlen )
+{
+   const int len = (int)databitlen / 128;
+   const int hashlen_m128i = ctx->hashlen / 16;   // bytes to __m128i
+   const int hash_offset = SIZE256 - hashlen_m128i;
+   int rem = ctx->rem_ptr;
+   int blocks = len / SIZE256;
+   __m256i* in = (__m256i*)input;
+   int i;
+
+   // --- update ---
+
+   // digest any full blocks, process directly from input 
+   for ( i = 0; i < blocks; i++ )
+      TF512_2way( ctx->chaining, &in[ i * SIZE256 ] );
+   ctx->buf_ptr = blocks * SIZE256;
+
+   // copy any remaining data to buffer, it may already contain data
+   // from a previous update for a midstate precalc
+   for ( i = 0; i < len % SIZE256; i++ )
+       ctx->buffer[ rem + i ] = in[ ctx->buf_ptr + i ];
+   i += rem;    // use i as rem_ptr in final
+
+   //--- final ---
+
+   blocks++;      // adjust for final block
+
+   if ( i == SIZE256 - 1 )
+   {
+       // only 1 vector left in buffer, all padding at once
+       ctx->buffer[i] = m256_const1_128( _mm_set_epi8(
+                      blocks, blocks>>8,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0x80 ) );
+   }
+   else
+   {
+       // add first padding
+       ctx->buffer[i] = m256_const2_64( 0, 0x80 );
+       // add zero padding
+       for ( i += 1; i < SIZE256 - 1; i++ )
+           ctx->buffer[i] = m256_zero;
+
+       // add length padding, second last byte is zero unless blocks > 255
+       ctx->buffer[i] = m256_const1_128( _mm_set_epi8(
+                   blocks, blocks>>8, 0,0, 0,0, 0,0, 0,0, 0,0, 0,0, 0,0 ) );
+   }
+
+// digest final padding block and do output transform
+   TF512_2way( ctx->chaining, ctx->buffer );
+
+   OF512_2way( ctx->chaining );
+
+   // store hash result in output 
+   for ( i = 0; i < hashlen_m128i; i++ )
+      casti_m256i( output, i ) = ctx->chaining[ hash_offset + i ];
+
+   return 0;
+}
+
+#endif  // VAES
--- a/algo/groestl/groestl256-hash-4way.h
+++ b/algo/groestl/groestl256-hash-4way.h
@@ -18,8 +18,8 @@
 #endif
 #include <stdlib.h>

-#if defined(__VAES__) && defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
-   
+#if defined(__AVX2__) && defined(__VAES__)
+
 #define LENGTH (256)

 //#include "brg_endian.h"
@@ -48,6 +48,8 @@

 #define SIZE256 (SIZE_512/16)

+#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
+
 typedef struct {
  __attribute__ ((aligned (128))) __m512i chaining[SIZE256];
  __attribute__ ((aligned (64))) __m512i buffer[SIZE256];
@@ -55,7 +57,7 @@ typedef struct {
  int blk_count;     // SIZE_m128i
  int buf_ptr;       // __m128i offset
  int rem_ptr;
-  int databitlen;    // bits
+//  int databitlen;    // bits
 } groestl256_4way_context;


@@ -74,5 +76,25 @@ int groestl256_4way_update_close( groestl256_4way_context*,  void*,
 int groestl256_4way_full( groestl256_4way_context*, void*,
                          const void*, uint64_t );

-#endif
-#endif 
+#endif  // AVX512
+
+typedef struct {
+  __attribute__ ((aligned (128))) __m256i chaining[SIZE256];
+  __attribute__ ((aligned (64))) __m256i buffer[SIZE256];
+  int hashlen;       // byte
+  int blk_count;     // SIZE_m128i
+  int buf_ptr;       // __m128i offset
+  int rem_ptr;
+//  int databitlen;    // bits
+} groestl256_2way_context;
+
+int groestl256_2way_init( groestl256_2way_context*, uint64_t );
+
+int groestl256_2way_update_close( groestl256_2way_context*,  void*,
+                                        const void*, uint64_t );
+
+int groestl256_2way_full( groestl256_2way_context*, void*,
+                          const void*, uint64_t );
+
+#endif  // VAES
+#endif  // GROESTL256_HASH_4WAY_H__
--- a/algo/groestl/groestl256-intr-4way.h
+++ b/algo/groestl/groestl256-intr-4way.h
@@ -7,13 +7,13 @@
 * This code is placed in the public domain
 */

-
 #if !defined(GROESTL256_INTR_4WAY_H__)
 #define GROESTL256_INTR_4WAY_H__ 1
      
 #include "groestl256-hash-4way.h"

-#if defined(__VAES__)
+#if defined(__AVX2__) && defined(__VAES__)
+
 static const __m128i round_const_l0[] __attribute__ ((aligned (64))) =
 {
   { 0x7060504030201000, 0xffffffffffffffff },
@@ -42,6 +42,8 @@ static const __m128i round_const_l7[] __attribute__ ((aligned (64))) =
   { 0x0000000000000000, 0x8696a6b6c6d6e6f6 }
 };

+#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
+
 static const __m512i TRANSP_MASK = { 0x0d0509010c040800, 0x0f070b030e060a02,
                                     0x1d1519111c141810, 0x1f171b131e161a12,
                                     0x2d2529212c242820, 0x2f272b232e262a22,
@@ -499,5 +501,398 @@ void OF512_4way( __m512i* chaining )
  chaining[3] = xmm11;
 }

+#endif  // AVX512
+
+static const __m256i TRANSP_MASK_2WAY =
+             { 0x0d0509010c040800, 0x0f070b030e060a02,
+               0x1d1519111c141810, 0x1f171b131e161a12 };
+
+static const __m256i SUBSH_MASK0_2WAY =
+             { 0x0c0f0104070b0e00, 0x03060a0d08020509,
+               0x1c1f1114171b1e10, 0x13161a1d18121519 };
+
+static const __m256i SUBSH_MASK1_2WAY =
+             { 0x0e090205000d0801, 0x04070c0f0a03060b,
+               0x1e191215101d1801, 0x14171c1f1a13161b };
+
+static const __m256i SUBSH_MASK2_2WAY =
+               { 0x080b0306010f0a02, 0x05000e090c04070d,
+                 0x181b1316111f1a12, 0x15101e191c14171d };
+
+static const __m256i SUBSH_MASK3_2WAY =
+               { 0x0a0d040702090c03, 0x0601080b0e05000f,
+                 0x1a1d141712191c13, 0x1611181b1e15101f };
+
+static const __m256i SUBSH_MASK4_2WAY =
+               { 0x0b0e0500030a0d04, 0x0702090c0f060108,
+                 0x1b1e1510131a1d14, 0x1712191c1f161118 };
+
+static const __m256i SUBSH_MASK5_2WAY =
+               { 0x0d080601040c0f05, 0x00030b0e0907020a,
+                 0x1d181611141c1f15, 0x10131b1e1917121a };
+
+static const __m256i SUBSH_MASK6_2WAY =
+               { 0x0f0a0702050e0906, 0x01040d080b00030c,
+                 0x1f1a1712151e1916, 0x11141d181b10131c };
+
+static const __m256i SUBSH_MASK7_2WAY =
+               { 0x090c000306080b07, 0x02050f0a0d01040e,
+                 0x191c101316181b17, 0x12151f1a1d11141e, };
+
+#define tos(a)    #a
+#define tostr(a)  tos(a)
+
+/* xmm[i] will be multiplied by 2
+ * xmm[j] will be lost
+ * xmm[k] has to be all 0x1b */
+#define MUL2_2WAY(i, j, k){\
+  j = _mm256_xor_si256(j, j);\
+  j = _mm256_cmpgt_epi8(j, i );\
+  i = _mm256_add_epi8(i, i);\
+  j = _mm256_and_si256(j, k);\
+  i = _mm256_xor_si256(i, j);\
+}
+
+#define MixBytes_2way(a0, a1, a2, a3, a4, a5, a6, a7, b0, b1, b2, b3, b4, b5, b6, b7){\
+  /* t_i = a_i + a_{i+1} */\
+  b6 = a0;\
+  b7 = a1;\
+  a0 = _mm256_xor_si256(a0, a1);\
+  b0 = a2;\
+  a1 = _mm256_xor_si256(a1, a2);\
+  b1 = a3;\
+  a2 = _mm256_xor_si256(a2, a3);\
+  b2 = a4;\
+  a3 = _mm256_xor_si256(a3, a4);\
+  b3 = a5;\
+  a4 = _mm256_xor_si256(a4, a5);\
+  b4 = a6;\
+  a5 = _mm256_xor_si256(a5, a6);\
+  b5 = a7;\
+  a6 = _mm256_xor_si256(a6, a7);\
+  a7 = _mm256_xor_si256(a7, b6);\
+  \
+  /* build y4 y5 y6 ... in regs xmm8, xmm9, xmm10 by adding t_i*/\
+  b0 = _mm256_xor_si256(b0, a4);\
+  b6 = _mm256_xor_si256(b6, a4);\
+  b1 = _mm256_xor_si256(b1, a5);\
+  b7 = _mm256_xor_si256(b7, a5);\
+  b2 = _mm256_xor_si256(b2, a6);\
+  b0 = _mm256_xor_si256(b0, a6);\
+  /* spill values y_4, y_5 to memory */\
+  TEMP0 = b0;\
+  b3 = _mm256_xor_si256(b3, a7);\
+  b1 = _mm256_xor_si256(b1, a7);\
+  TEMP1 = b1;\
+  b4 = _mm256_xor_si256(b4, a0);\
+  b2 = _mm256_xor_si256(b2, a0);\
+  /* save values t0, t1, t2 to xmm8, xmm9 and memory */\
+  b0 = a0;\
+  b5 = _mm256_xor_si256(b5, a1);\
+  b3 = _mm256_xor_si256(b3, a1);\
+  b1 = a1;\
+  b6 = _mm256_xor_si256(b6, a2);\
+  b4 = _mm256_xor_si256(b4, a2);\
+  TEMP2 = a2;\
+  b7 = _mm256_xor_si256(b7, a3);\
+  b5 = _mm256_xor_si256(b5, a3);\
+  \
+  /* compute x_i = t_i + t_{i+3} */\
+  a0 = _mm256_xor_si256(a0, a3);\
+  a1 = _mm256_xor_si256(a1, a4);\
+  a2 = _mm256_xor_si256(a2, a5);\
+  a3 = _mm256_xor_si256(a3, a6);\
+  a4 = _mm256_xor_si256(a4, a7);\
+  a5 = _mm256_xor_si256(a5, b0);\
+  a6 = _mm256_xor_si256(a6, b1);\
+  a7 = _mm256_xor_si256(a7, TEMP2);\
+  \
+  /* compute z_i : double x_i using temp xmm8 and 1B xmm9 */\
+  /* compute w_i : add y_{i+4} */\
+  b1 = m256_const1_64( 0x1b1b1b1b1b1b1b1b );\
+  MUL2_2WAY(a0, b0, b1);\
+  a0 = _mm256_xor_si256(a0, TEMP0);\
+  MUL2_2WAY(a1, b0, b1);\
+  a1 = _mm256_xor_si256(a1, TEMP1);\
+  MUL2_2WAY(a2, b0, b1);\
+  a2 = _mm256_xor_si256(a2, b2);\
+  MUL2_2WAY(a3, b0, b1);\
+  a3 = _mm256_xor_si256(a3, b3);\
+  MUL2_2WAY(a4, b0, b1);\
+  a4 = _mm256_xor_si256(a4, b4);\
+  MUL2_2WAY(a5, b0, b1);\
+  a5 = _mm256_xor_si256(a5, b5);\
+  MUL2_2WAY(a6, b0, b1);\
+  a6 = _mm256_xor_si256(a6, b6);\
+  MUL2_2WAY(a7, b0, b1);\
+  a7 = _mm256_xor_si256(a7, b7);\
+  \
+  /* compute v_i : double w_i      */\
+  /* add to y_4 y_5 .. v3, v4, ... */\
+  MUL2_2WAY(a0, b0, b1);\
+  b5 = _mm256_xor_si256(b5, a0);\
+  MUL2_2WAY(a1, b0, b1);\
+  b6 = _mm256_xor_si256(b6, a1);\
+  MUL2_2WAY(a2, b0, b1);\
+  b7 = _mm256_xor_si256(b7, a2);\
+  MUL2_2WAY(a5, b0, b1);\
+  b2 = _mm256_xor_si256(b2, a5);\
+  MUL2_2WAY(a6, b0, b1);\
+  b3 = _mm256_xor_si256(b3, a6);\
+  MUL2_2WAY(a7, b0, b1);\
+  b4 = _mm256_xor_si256(b4, a7);\
+  MUL2_2WAY(a3, b0, b1);\
+  MUL2_2WAY(a4, b0, b1);\
+  b0 = TEMP0;\
+  b1 = TEMP1;\
+  b0 = _mm256_xor_si256(b0, a3);\
+  b1 = _mm256_xor_si256(b1, a4);\
+}/*MixBytes*/
+
+#define ROUND_2WAY(i, a0, a1, a2, a3, a4, a5, a6, a7, b0, b1, b2, b3, b4, b5, b6, b7){\
+  /* AddRoundConstant */\
+  b1 = m256_const2_64( 0xffffffffffffffff, 0 ); \
+  a0 = _mm256_xor_si256( a0, m256_const1_128( round_const_l0[i] ) );\
+  a1 = _mm256_xor_si256( a1, b1 );\
+  a2 = _mm256_xor_si256( a2, b1 );\
+  a3 = _mm256_xor_si256( a3, b1 );\
+  a4 = _mm256_xor_si256( a4, b1 );\
+  a5 = _mm256_xor_si256( a5, b1 );\
+  a6 = _mm256_xor_si256( a6, b1 );\
+  a7 = _mm256_xor_si256( a7, m256_const1_128( round_const_l7[i] ) );\
+  \
+  /* ShiftBytes + SubBytes (interleaved) */\
+  b0 = _mm256_xor_si256( b0, b0 );\
+  a0 = _mm256_shuffle_epi8( a0, SUBSH_MASK0_2WAY );\
+  a0 = _mm256_aesenclast_epi128(a0, b0 );\
+  a1 = _mm256_shuffle_epi8( a1, SUBSH_MASK1_2WAY );\
+  a1 = _mm256_aesenclast_epi128(a1, b0 );\
+  a2 = _mm256_shuffle_epi8( a2, SUBSH_MASK2_2WAY );\
+  a2 = _mm256_aesenclast_epi128(a2, b0 );\
+  a3 = _mm256_shuffle_epi8( a3, SUBSH_MASK3_2WAY );\
+  a3 = _mm256_aesenclast_epi128(a3, b0 );\
+  a4 = _mm256_shuffle_epi8( a4, SUBSH_MASK4_2WAY );\
+  a4 = _mm256_aesenclast_epi128(a4, b0 );\
+  a5 = _mm256_shuffle_epi8( a5, SUBSH_MASK5_2WAY );\
+  a5 = _mm256_aesenclast_epi128(a5, b0 );\
+  a6 = _mm256_shuffle_epi8( a6, SUBSH_MASK6_2WAY );\
+  a6 = _mm256_aesenclast_epi128(a6, b0 );\
+  a7 = _mm256_shuffle_epi8( a7, SUBSH_MASK7_2WAY );\
+  a7 = _mm256_aesenclast_epi128( a7, b0 );\
+  \
+  /* MixBytes */\
+  MixBytes_2way(a0, a1, a2, a3, a4, a5, a6, a7, b0, b1, b2, b3, b4, b5, b6, b7);\
+\
+}
+
+/* 10 rounds, P and Q in parallel */
+#define ROUNDS_P_Q_2WAY(){\
+  ROUND_2WAY(0, xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7);\
+  ROUND_2WAY(1, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15);\
+  ROUND_2WAY(2, xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7);\
+  ROUND_2WAY(3, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15);\
+  ROUND_2WAY(4, xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7);\
+  ROUND_2WAY(5, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15);\
+  ROUND_2WAY(6, xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7);\
+  ROUND_2WAY(7, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15);\
+  ROUND_2WAY(8, xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7);\
+  ROUND_2WAY(9, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15);\
+}
+
+#define Matrix_Transpose_A_2way(i0, i1, i2, i3, o1, o2, o3, t0){\
+  t0 = TRANSP_MASK_2WAY;\
+  \
+  i0 = _mm256_shuffle_epi8( i0, t0 );\
+  i1 = _mm256_shuffle_epi8( i1, t0 );\
+  i2 = _mm256_shuffle_epi8( i2, t0 );\
+  i3 = _mm256_shuffle_epi8( i3, t0 );\
+  \
+  o1 = i0;\
+  t0 = i2;\
+  \
+  i0 = _mm256_unpacklo_epi16( i0, i1 );\
+  o1 = _mm256_unpackhi_epi16( o1, i1 );\
+  i2 = _mm256_unpacklo_epi16( i2, i3 );\
+  t0 = _mm256_unpackhi_epi16( t0, i3 );\
+  \
+  i0 = _mm256_shuffle_epi32( i0, 216 );\
+  o1 = _mm256_shuffle_epi32( o1, 216 );\
+  i2 = _mm256_shuffle_epi32( i2, 216 );\
+  t0 = _mm256_shuffle_epi32( t0, 216 );\
+  \
+  o2 = i0;\
+  o3 = o1;\
+  \
+  i0 = _mm256_unpacklo_epi32( i0, i2 );\
+  o1 = _mm256_unpacklo_epi32( o1, t0 );\
+  o2 = _mm256_unpackhi_epi32( o2, i2 );\
+  o3 = _mm256_unpackhi_epi32( o3, t0 );\
+}/**/
+
+#define Matrix_Transpose_B_2way(i0, i1, i2, i3, i4, i5, i6, i7, o1, o2, o3, o4, o5, o6, o7){\
+  o1 = i0;\
+  o2 = i1;\
+  i0 = _mm256_unpacklo_epi64( i0, i4 );\
+  o1 = _mm256_unpackhi_epi64( o1, i4 );\
+  o3 = i1;\
+  o4 = i2;\
+  o2 = _mm256_unpacklo_epi64( o2, i5 );\
+  o3 = _mm256_unpackhi_epi64( o3, i5 );\
+  o5 = i2;\
+  o6 = i3;\
+  o4 = _mm256_unpacklo_epi64( o4, i6 );\
+  o5 = _mm256_unpackhi_epi64( o5, i6 );\
+  o7 = i3;\
+  o6 = _mm256_unpacklo_epi64( o6, i7 );\
+  o7 = _mm256_unpackhi_epi64( o7, i7 );\
+}/**/
+
+#define Matrix_Transpose_B_INV_2way(i0, i1, i2, i3, i4, i5, i6, i7, o0, o1, o2, o3){\
+  o0 = i0;\
+  i0 = _mm256_unpacklo_epi64( i0, i1 );\
+  o0 = _mm256_unpackhi_epi64( o0, i1 );\
+  o1 = i2;\
+  i2 = _mm256_unpacklo_epi64( i2, i3 );\
+  o1 = _mm256_unpackhi_epi64( o1, i3 );\
+  o2 = i4;\
+  i4 = _mm256_unpacklo_epi64( i4, i5 );\
+  o2 = _mm256_unpackhi_epi64( o2, i5 );\
+  o3 = i6;\
+  i6 = _mm256_unpacklo_epi64( i6, i7 );\
+  o3 = _mm256_unpackhi_epi64( o3, i7 );\
+}/**/
+
+#define Matrix_Transpose_O_B_2way(i0, i1, i2, i3, i4, i5, i6, i7, t0){\
+  t0 = _mm256_xor_si256( t0, t0 );\
+  i1 = i0;\
+  i3 = i2;\
+  i5 = i4;\
+  i7 = i6;\
+  i0 = _mm256_unpacklo_epi64( i0, t0 );\
+  i1 = _mm256_unpackhi_epi64( i1, t0 );\
+  i2 = _mm256_unpacklo_epi64( i2, t0 );\
+  i3 = _mm256_unpackhi_epi64( i3, t0 );\
+  i4 = _mm256_unpacklo_epi64( i4, t0 );\
+  i5 = _mm256_unpackhi_epi64( i5, t0 );\
+  i6 = _mm256_unpacklo_epi64( i6, t0 );\
+  i7 = _mm256_unpackhi_epi64( i7, t0 );\
+}/**/
+
+#define Matrix_Transpose_O_B_INV_2way(i0, i1, i2, i3, i4, i5, i6, i7){\
+  i0 = _mm256_unpacklo_epi64( i0, i1 );\
+  i2 = _mm256_unpacklo_epi64( i2, i3 );\
+  i4 = _mm256_unpacklo_epi64( i4, i5 );\
+  i6 = _mm256_unpacklo_epi64( i6, i7 );\
+}/**/
+
+void TF512_2way( __m256i* chaining, __m256i* message )
+{
+  static __m256i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
+  static __m256i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
+  static __m256i TEMP0;
+  static __m256i TEMP1;
+  static __m256i TEMP2;
+
+  /* load message into registers xmm12 - xmm15 */
+  xmm12 = message[0];
+  xmm13 = message[1];
+  xmm14 = message[2];
+  xmm15 = message[3];
+
+  /* transform message M from column ordering into row ordering */
+  /* we first put two rows (64 bit) of the message into one 128-bit xmm register */
+  Matrix_Transpose_A_2way(xmm12, xmm13, xmm14, xmm15, xmm2, xmm6, xmm7, xmm0);
+
+  /* load previous chaining value */
+  /* we first put two rows (64 bit) of the CV into one 128-bit xmm register */
+  xmm8 = chaining[0];
+  xmm0 = chaining[1];
+  xmm4 = chaining[2];
+  xmm5 = chaining[3];
+
+  /* xor message to CV get input of P */
+  /* result: CV+M in xmm8, xmm0, xmm4, xmm5 */
+  xmm8 = _mm256_xor_si256( xmm8, xmm12 );
+  xmm0 = _mm256_xor_si256( xmm0, xmm2 );
+  xmm4 = _mm256_xor_si256( xmm4, xmm6 );
+  xmm5 = _mm256_xor_si256( xmm5, xmm7 );
+
+  /* there are now 2 rows of the Groestl state (P and Q) in each xmm register */
+  /* unpack to get 1 row of P (64 bit) and Q (64 bit) into one xmm register */
+  /* result: the 8 rows of P and Q in xmm8 - xmm12 */
+  Matrix_Transpose_B_2way(xmm8, xmm0, xmm4, xmm5, xmm12, xmm2, xmm6, xmm7, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15);
+
+  /* compute the two permutations P and Q in parallel */
+  ROUNDS_P_Q_2WAY();
+
+  /* unpack again to get two rows of P or two rows of Q in one xmm register */
+  Matrix_Transpose_B_INV_2way(xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm0, xmm1, xmm2, xmm3);
+
+  /* xor output of P and Q */
+  /* result: P(CV+M)+Q(M) in xmm0...xmm3 */
+  xmm0 = _mm256_xor_si256( xmm0, xmm8 );
+  xmm1 = _mm256_xor_si256( xmm1, xmm10 );
+  xmm2 = _mm256_xor_si256( xmm2, xmm12 );
+  xmm3 = _mm256_xor_si256( xmm3, xmm14 );
+
+  /* xor CV (feed-forward) */
+  /* result: P(CV+M)+Q(M)+CV in xmm0...xmm3 */
+  xmm0 = _mm256_xor_si256( xmm0, (chaining[0]) );
+  xmm1 = _mm256_xor_si256( xmm1, (chaining[1]) );
+  xmm2 = _mm256_xor_si256( xmm2, (chaining[2]) );
+  xmm3 = _mm256_xor_si256( xmm3, (chaining[3]) );
+
+  /* store CV */
+  chaining[0] = xmm0;
+  chaining[1] = xmm1;
+  chaining[2] = xmm2;
+  chaining[3] = xmm3;
+
+  return;
+}
+  
+void OF512_2way( __m256i* chaining )
+{
+  static __m256i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
+  static __m256i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
+  static __m256i TEMP0;
+  static __m256i TEMP1;
+  static __m256i TEMP2;
+
+  /* load CV into registers xmm8, xmm10, xmm12, xmm14 */
+  xmm8 = chaining[0];
+  xmm10 = chaining[1];
+  xmm12 = chaining[2];
+  xmm14 = chaining[3];
+
+  /* there are now 2 rows of the CV in one xmm register */
+  /* unpack to get 1 row of P (64 bit) into one half of an xmm register */
+  /* result: the 8 input rows of P in xmm8 - xmm15 */
+  Matrix_Transpose_O_B_2way(xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm0);
+
+  /* compute the permutation P */
+  /* result: the output of P(CV) in xmm8 - xmm15 */
+  ROUNDS_P_Q_2WAY();
+
+  /* unpack again to get two rows of P in one xmm register */
+  /* result: P(CV) in xmm8, xmm10, xmm12, xmm14 */
+  Matrix_Transpose_O_B_INV_2way(xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15);
+
+  /* xor CV to P output (feed-forward) */
+  /* result: P(CV)+CV in xmm8, xmm10, xmm12, xmm14 */
+  xmm8  = _mm256_xor_si256( xmm8,  (chaining[0]) );
+  xmm10 = _mm256_xor_si256( xmm10, (chaining[1]) );
+  xmm12 = _mm256_xor_si256( xmm12, (chaining[2]) );
+  xmm14 = _mm256_xor_si256( xmm14, (chaining[3]) );
+
+  /* transform state back from row ordering into column ordering */
+  /* result: final hash value in xmm9, xmm11 */
+  Matrix_Transpose_A_2way(xmm8, xmm10, xmm12, xmm14, xmm4, xmm9, xmm11, xmm0);
+
+  /* we only need to return the truncated half of the state */
+  chaining[2] = xmm9;
+  chaining[3] = xmm11;
+}
+
 #endif  // VAES
-#endif  // GROESTL512_INTR_4WAY_H__
+#endif  // GROESTL256_INTR_4WAY_H__
--- a/algo/groestl/groestl512-hash-4way.c
+++ b/algo/groestl/groestl512-hash-4way.c
@@ -15,7 +15,9 @@
 #include "miner.h"
 #include "simd-utils.h"

-#if defined(__VAES__) && defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
+#if defined(__AVX2__) && defined(__VAES__)
+
+#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)

 int groestl512_4way_init( groestl512_4way_context* ctx, uint64_t hashlen )
 {
@@ -137,5 +139,130 @@ int groestl512_4way_full( groestl512_4way_context* ctx, void* output,
   return 0;
 }

+#endif   // AVX512
+
+
+// AVX2 + VAES
+
+int groestl512_2way_init( groestl512_2way_context* ctx, uint64_t hashlen )
+{
+  if (ctx->chaining == NULL || ctx->buffer == NULL)
+    return 1;
+
+  memset_zero_256( ctx->chaining, SIZE512 );
+  memset_zero_256( ctx->buffer, SIZE512 );
+
+  // The only non-zero in the IV is len. It can be hard coded.
+  ctx->chaining[ 6 ] = m256_const2_64( 0x0200000000000000, 0 );
+
+  ctx->buf_ptr = 0;
+  ctx->rem_ptr = 0;
+
+  return 0;
+}
+
+int groestl512_2way_update_close( groestl512_2way_context* ctx, void* output,
+                                const void* input, uint64_t databitlen )
+{
+   const int len = (int)databitlen / 128;
+   const int hashlen_m128i = 64 / 16;   // bytes to __m128i
+   const int hash_offset = SIZE512 - hashlen_m128i;
+   int rem = ctx->rem_ptr;
+   int blocks = len / SIZE512;
+   __m256i* in = (__m256i*)input;
+   int i;
+
+   // --- update ---
+
+   for ( i = 0; i < blocks; i++ )
+      TF1024_2way( ctx->chaining, &in[ i * SIZE512 ] );
+   ctx->buf_ptr = blocks * SIZE512;
+
+   for ( i = 0; i < len % SIZE512; i++ )
+       ctx->buffer[ rem + i ] = in[ ctx->buf_ptr + i ];
+   i += rem;
+
+   //--- final ---
+
+   blocks++;      // adjust for final block
+
+   if ( i == SIZE512 - 1 )
+   {
+       // only 1 vector left in buffer, all padding at once
+       ctx->buffer[i] = m256_const1_128( _mm_set_epi8(
+                      blocks, blocks>>8,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0x80 ) );
+   }
+   else
+   {
+       ctx->buffer[i] = m256_const2_64( 0, 0x80 );
+       for ( i += 1; i < SIZE512 - 1; i++ )
+           ctx->buffer[i] = m256_zero;
+       ctx->buffer[i] = m256_const1_128( _mm_set_epi8(
+                   blocks, blocks>>8, 0,0, 0,0, 0,0, 0,0, 0,0, 0,0, 0,0 ) );
+   }
+
+   TF1024_2way( ctx->chaining, ctx->buffer );
+   OF1024_2way( ctx->chaining );
+
+   for ( i = 0; i < hashlen_m128i; i++ )
+      casti_m256i( output, i ) = ctx->chaining[ hash_offset + i ];
+
+   return 0;
+}
+
+int groestl512_2way_full( groestl512_2way_context* ctx, void* output,
+                          const void* input, uint64_t datalen )
+{
+   const int len = (int)datalen >> 4;
+   const int hashlen_m128i = 64 >> 4;   // bytes to __m128i
+   const int hash_offset = SIZE512 - hashlen_m128i;
+   uint64_t blocks = len / SIZE512;
+   __m256i* in = (__m256i*)input;
+   int i;
+
+   // --- init ---
+
+   memset_zero_256( ctx->chaining, SIZE512 );
+   memset_zero_256( ctx->buffer, SIZE512 );
+   ctx->chaining[ 6 ] = m256_const2_64( 0x0200000000000000, 0 );
+   ctx->buf_ptr = 0;
+   ctx->rem_ptr = 0;
+
+   // --- update ---
+
+   for ( i = 0; i < blocks; i++ )
+      TF1024_2way( ctx->chaining, &in[ i * SIZE512 ] );
+   ctx->buf_ptr = blocks * SIZE512;
+
+   for ( i = 0; i < len % SIZE512; i++ )
+       ctx->buffer[ ctx->rem_ptr + i ] = in[ ctx->buf_ptr + i ];
+   i += ctx->rem_ptr;
+
+   // --- close ---
+
+   blocks++;
+
+   if ( i == SIZE512 - 1 )
+   {
+       // only 1 vector left in buffer, all padding at once
+       ctx->buffer[i] = m256_const2_64( blocks << 56, 0x80 );
+   }
+   else
+   {
+       ctx->buffer[i] = m256_const2_64( 0, 0x80 );
+       for ( i += 1; i < SIZE512 - 1; i++ )
+           ctx->buffer[i] = m256_zero;
+       ctx->buffer[i] = m256_const2_64( blocks << 56, 0 );
+   }
+
+   TF1024_2way( ctx->chaining, ctx->buffer );
+   OF1024_2way( ctx->chaining );
+
+   for ( i = 0; i < hashlen_m128i; i++ )
+      casti_m256i( output, i ) = ctx->chaining[ hash_offset + i ];
+
+   return 0;
+}
+   
 #endif   // VAES

--- a/algo/groestl/groestl512-hash-4way.h
+++ b/algo/groestl/groestl512-hash-4way.h
@@ -10,7 +10,7 @@
 #endif
 #include <stdlib.h>

-#if defined(__VAES__) && defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
+#if defined(__AVX2__) && defined(__VAES__)

 #define LENGTH (512)

@@ -36,20 +36,19 @@

 #define SIZE512 (SIZE_1024/16)

+#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
+
 typedef struct {
  __attribute__ ((aligned (128))) __m512i chaining[SIZE512];
  __attribute__ ((aligned (64))) __m512i buffer[SIZE512];
  int blk_count;     // SIZE_m128i
  int buf_ptr;       // __m128i offset
  int rem_ptr;
-  int databitlen;    // bits
 } groestl512_4way_context;


 int groestl512_4way_init( groestl512_4way_context*, uint64_t );

-//int reinit_groestl( hashState_groestl* );
-
 int groestl512_4way_update( groestl512_4way_context*, const void*,
                              uint64_t );
 int groestl512_4way_close( groestl512_4way_context*, void* );
@@ -58,5 +57,29 @@ int groestl512_4way_update_close( groestl512_4way_context*,  void*,
 int groestl512_4way_full( groestl512_4way_context*,  void*,
                          const void*, uint64_t );

+#endif   // AVX512
+
+// AVX2 + VAES
+
+typedef struct {
+  __attribute__ ((aligned (128))) __m256i chaining[SIZE512];
+  __attribute__ ((aligned (64))) __m256i buffer[SIZE512];
+  int blk_count;     // SIZE_m128i
+  int buf_ptr;       // __m128i offset
+  int rem_ptr;
+} groestl512_2way_context;
+
+
+int groestl512_2way_init( groestl512_2way_context*, uint64_t );
+
+int groestl512_2way_update( groestl512_2way_context*, const void*,
+                              uint64_t );
+int groestl512_2way_close( groestl512_2way_context*, void* );
+int groestl512_2way_update_close( groestl512_2way_context*,  void*,
+                                        const void*, uint64_t );
+int groestl512_2way_full( groestl512_2way_context*,  void*,
+                          const void*, uint64_t );
+
+
 #endif   // VAES
 #endif   // GROESTL512_HASH_4WAY_H__
--- a/algo/groestl/groestl512-intr-4way.h
+++ b/algo/groestl/groestl512-intr-4way.h
@@ -7,13 +7,12 @@
 * This code is placed in the public domain
 */

-
 #if !defined(GROESTL512_INTR_4WAY_H__)
 #define GROESTL512_INTR_4WAY_H__ 1
      
 #include "groestl512-hash-4way.h"

-#if defined(__VAES__)
+#if defined(__AVX2__) && defined(__VAES__)

 static const __m128i round_const_p[] __attribute__ ((aligned (64))) =
 {
@@ -51,6 +50,8 @@ static const __m128i round_const_q[] __attribute__ ((aligned (64))) =
   { 0x8292a2b2c2d2e2f2, 0x0212223242526272 }
 };

+#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
+
 static const __m512i TRANSP_MASK = { 0x0d0509010c040800, 0x0f070b030e060a02,
                                     0x1d1519111c141810, 0x1f171b131e161a12,
                                     0x2d2529212c242820, 0x2f272b232e262a22,
@@ -661,5 +662,578 @@ void OF1024_4way( __m512i* chaining )
  return;
 }

+#endif  // AVX512
+
+// AVX2 + VAES
+
+static const __m256i TRANSP_MASK_2WAY =
+    { 0x0d0509010c040800, 0x0f070b030e060a02,
+      0x1d1519111c141810, 0x1f171b131e161a12 };
+
+static const __m256i SUBSH_MASK0_2WAY =
+    { 0x0b0e0104070a0d00, 0x0306090c0f020508,
+      0x1b1e1114171a1d10, 0x1316191c1f121518 };
+
+static const __m256i SUBSH_MASK1_2WAY =
+    { 0x0c0f0205080b0e01, 0x04070a0d00030609,
+      0x1c1f1215181b1e11, 0x14171a1d10131619 };
+
+static const __m256i SUBSH_MASK2_2WAY =
+    { 0x0d000306090c0f02, 0x05080b0e0104070a,
+      0x1d101316191c1f12, 0x15181b1e1114171a };
+
+static const __m256i SUBSH_MASK3_2WAY =
+    { 0x0e0104070a0d0003, 0x06090c0f0205080b,
+      0x1e1114171a1d1013, 0x16191c1f1215181b };
+
+static const __m256i SUBSH_MASK4_2WAY = 
+    { 0x0f0205080b0e0104, 0x070a0d000306090c,
+      0x1f1215181b1e1114, 0x171a1d101316191c };
+
+static const __m256i SUBSH_MASK5_2WAY =
+    { 0x000306090c0f0205, 0x080b0e0104070a0d,
+      0x101316191c1f1215, 0x181b1e1114171a1d };
+
+static const __m256i SUBSH_MASK6_2WAY =
+    { 0x0104070a0d000306, 0x090c0f0205080b0e,
+      0x1114171a1d101316, 0x191c1f1215181b1e };
+
+static const __m256i SUBSH_MASK7_2WAY =
+    { 0x06090c0f0205080b, 0x0e0104070a0d0003,
+      0x16191c1f1215181b, 0x1e1114171a1d1013 };
+
+#define tos(a)    #a
+#define tostr(a)  tos(a)
+
+/* xmm[i] will be multiplied by 2
+ * xmm[j] will be lost
+ * xmm[k] has to be all 0x1b */
+#define MUL2_2WAY(i, j, k){\
+  j = _mm256_xor_si256(j, j);\
+  j = _mm256_cmpgt_epi8(j, i );\
+  i = _mm256_add_epi8(i, i);\
+  j = _mm256_and_si256(j, k);\
+  i = _mm256_xor_si256(i, j);\
+}
+
+#define MixBytes_2way(a0, a1, a2, a3, a4, a5, a6, a7, b0, b1, b2, b3, b4, b5, b6, b7){\
+  /* t_i = a_i + a_{i+1} */\
+  b6 = a0;\
+  b7 = a1;\
+  a0 = _mm256_xor_si256(a0, a1);\
+  b0 = a2;\
+  a1 = _mm256_xor_si256(a1, a2);\
+  b1 = a3;\
+  a2 = _mm256_xor_si256(a2, a3);\
+  b2 = a4;\
+  a3 = _mm256_xor_si256(a3, a4);\
+  b3 = a5;\
+  a4 = _mm256_xor_si256(a4, a5);\
+  b4 = a6;\
+  a5 = _mm256_xor_si256(a5, a6);\
+  b5 = a7;\
+  a6 = _mm256_xor_si256(a6, a7);\
+  a7 = _mm256_xor_si256(a7, b6);\
+  \
+  /* build y4 y5 y6 ... in regs xmm8, xmm9, xmm10 by adding t_i*/\
+  b0 = _mm256_xor_si256(b0, a4);\
+  b6 = _mm256_xor_si256(b6, a4);\
+  b1 = _mm256_xor_si256(b1, a5);\
+  b7 = _mm256_xor_si256(b7, a5);\
+  b2 = _mm256_xor_si256(b2, a6);\
+  b0 = _mm256_xor_si256(b0, a6);\
+  /* spill values y_4, y_5 to memory */\
+  TEMP0 = b0;\
+  b3 = _mm256_xor_si256(b3, a7);\
+  b1 = _mm256_xor_si256(b1, a7);\
+  TEMP1 = b1;\
+  b4 = _mm256_xor_si256(b4, a0);\
+  b2 = _mm256_xor_si256(b2, a0);\
+  /* save values t0, t1, t2 to xmm8, xmm9 and memory */\
+  b0 = a0;\
+  b5 = _mm256_xor_si256(b5, a1);\
+  b3 = _mm256_xor_si256(b3, a1);\
+  b1 = a1;\
+  b6 = _mm256_xor_si256(b6, a2);\
+  b4 = _mm256_xor_si256(b4, a2);\
+  TEMP2 = a2;\
+  b7 = _mm256_xor_si256(b7, a3);\
+  b5 = _mm256_xor_si256(b5, a3);\
+  \
+  /* compute x_i = t_i + t_{i+3} */\
+  a0 = _mm256_xor_si256(a0, a3);\
+  a1 = _mm256_xor_si256(a1, a4);\
+  a2 = _mm256_xor_si256(a2, a5);\
+  a3 = _mm256_xor_si256(a3, a6);\
+  a4 = _mm256_xor_si256(a4, a7);\
+  a5 = _mm256_xor_si256(a5, b0);\
+  a6 = _mm256_xor_si256(a6, b1);\
+  a7 = _mm256_xor_si256(a7, TEMP2);\
+  \
+  /* compute z_i : double x_i using temp xmm8 and 1B xmm9 */\
+  /* compute w_i : add y_{i+4} */\
+  b1 = m256_const1_64( 0x1b1b1b1b1b1b1b1b );\
+  MUL2_2WAY(a0, b0, b1);\
+  a0 = _mm256_xor_si256(a0, TEMP0);\
+  MUL2_2WAY(a1, b0, b1);\
+  a1 = _mm256_xor_si256(a1, TEMP1);\
+  MUL2_2WAY(a2, b0, b1);\
+  a2 = _mm256_xor_si256(a2, b2);\
+  MUL2_2WAY(a3, b0, b1);\
+  a3 = _mm256_xor_si256(a3, b3);\
+  MUL2_2WAY(a4, b0, b1);\
+  a4 = _mm256_xor_si256(a4, b4);\
+  MUL2_2WAY(a5, b0, b1);\
+  a5 = _mm256_xor_si256(a5, b5);\
+  MUL2_2WAY(a6, b0, b1);\
+  a6 = _mm256_xor_si256(a6, b6);\
+  MUL2_2WAY(a7, b0, b1);\
+  a7 = _mm256_xor_si256(a7, b7);\
+  \
+  /* compute v_i : double w_i      */\
+  /* add to y_4 y_5 .. v3, v4, ... */\
+  MUL2_2WAY(a0, b0, b1);\
+  b5 = _mm256_xor_si256(b5, a0);\
+  MUL2_2WAY(a1, b0, b1);\
+  b6 = _mm256_xor_si256(b6, a1);\
+  MUL2_2WAY(a2, b0, b1);\
+  b7 = _mm256_xor_si256(b7, a2);\
+  MUL2_2WAY(a5, b0, b1);\
+  b2 = _mm256_xor_si256(b2, a5);\
+  MUL2_2WAY(a6, b0, b1);\
+  b3 = _mm256_xor_si256(b3, a6);\
+  MUL2_2WAY(a7, b0, b1);\
+  b4 = _mm256_xor_si256(b4, a7);\
+  MUL2_2WAY(a3, b0, b1);\
+  MUL2_2WAY(a4, b0, b1);\
+  b0 = TEMP0;\
+  b1 = TEMP1;\
+  b0 = _mm256_xor_si256(b0, a3);\
+  b1 = _mm256_xor_si256(b1, a4);\
+}/*MixBytes*/
+
+/* one round
+ * a0-a7 = input rows
+ * b0-b7 = output rows
+ */
+#define SUBMIX_2WAY(a0, a1, a2, a3, a4, a5, a6, a7, b0, b1, b2, b3, b4, b5, b6, b7){\
+  /* SubBytes */\
+  b0 = _mm256_xor_si256( b0, b0 );\
+  a0 = _mm256_aesenclast_epi128( a0, b0 );\
+  a1 = _mm256_aesenclast_epi128( a1, b0 );\
+  a2 = _mm256_aesenclast_epi128( a2, b0 );\
+  a3 = _mm256_aesenclast_epi128( a3, b0 );\
+  a4 = _mm256_aesenclast_epi128( a4, b0 );\
+  a5 = _mm256_aesenclast_epi128( a5, b0 );\
+  a6 = _mm256_aesenclast_epi128( a6, b0 );\
+  a7 = _mm256_aesenclast_epi128( a7, b0 );\
+  /* MixBytes */\
+  MixBytes_2way(a0, a1, a2, a3, a4, a5, a6, a7, b0, b1, b2, b3, b4, b5, b6, b7);\
+}
+
+#define ROUNDS_P_2WAY(){\
+  uint8_t round_counter = 0;\
+  for ( round_counter = 0; round_counter < 14; round_counter += 2 ) \
+  { \
+    /* AddRoundConstant P1024 */\
+    xmm8 = _mm256_xor_si256( xmm8, m256_const1_128( \
+             casti_m128i( round_const_p, round_counter ) ) ); \
+    /* ShiftBytes P1024 + pre-AESENCLAST */\
+    xmm8  = _mm256_shuffle_epi8( xmm8,  SUBSH_MASK0_2WAY ); \
+    xmm9  = _mm256_shuffle_epi8( xmm9,  SUBSH_MASK1_2WAY );\
+    xmm10 = _mm256_shuffle_epi8( xmm10, SUBSH_MASK2_2WAY );\
+    xmm11 = _mm256_shuffle_epi8( xmm11, SUBSH_MASK3_2WAY );\
+    xmm12 = _mm256_shuffle_epi8( xmm12, SUBSH_MASK4_2WAY );\
+    xmm13 = _mm256_shuffle_epi8( xmm13, SUBSH_MASK5_2WAY );\
+    xmm14 = _mm256_shuffle_epi8( xmm14, SUBSH_MASK6_2WAY );\
+    xmm15 = _mm256_shuffle_epi8( xmm15, SUBSH_MASK7_2WAY );\
+    /* SubBytes + MixBytes */\
+    SUBMIX_2WAY(xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7);\
+    \
+     /* AddRoundConstant P1024 */\
+    xmm0 = _mm256_xor_si256( xmm0, m256_const1_128( \
+             casti_m128i( round_const_p, round_counter+1 ) ) ); \
+    /* ShiftBytes P1024 + pre-AESENCLAST */\
+    xmm0 = _mm256_shuffle_epi8( xmm0, SUBSH_MASK0_2WAY );\
+    xmm1 = _mm256_shuffle_epi8( xmm1, SUBSH_MASK1_2WAY );\
+    xmm2 = _mm256_shuffle_epi8( xmm2, SUBSH_MASK2_2WAY );\
+    xmm3 = _mm256_shuffle_epi8( xmm3, SUBSH_MASK3_2WAY );\
+    xmm4 = _mm256_shuffle_epi8( xmm4, SUBSH_MASK4_2WAY );\
+    xmm5 = _mm256_shuffle_epi8( xmm5, SUBSH_MASK5_2WAY );\
+    xmm6 = _mm256_shuffle_epi8( xmm6, SUBSH_MASK6_2WAY );\
+    xmm7 = _mm256_shuffle_epi8( xmm7, SUBSH_MASK7_2WAY );\
+    /* SubBytes + MixBytes */\
+     SUBMIX_2WAY(xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15);\
+  }\
+}
+
+#define ROUNDS_Q_2WAY(){\
+  uint8_t round_counter = 0;\
+  for ( round_counter = 0; round_counter < 14; round_counter += 2) \
+  { \
+    /* AddRoundConstant Q1024 */\
+    xmm1 = m256_neg1;\
+    xmm8  = _mm256_xor_si256( xmm8,  xmm1 );\
+    xmm9  = _mm256_xor_si256( xmm9,  xmm1 );\
+    xmm10 = _mm256_xor_si256( xmm10, xmm1 );\
+    xmm11 = _mm256_xor_si256( xmm11, xmm1 );\
+    xmm12 = _mm256_xor_si256( xmm12, xmm1 );\
+    xmm13 = _mm256_xor_si256( xmm13, xmm1 );\
+    xmm14 = _mm256_xor_si256( xmm14, xmm1 );\
+    xmm15 = _mm256_xor_si256( xmm15, m256_const1_128( \
+                 casti_m128i( round_const_q, round_counter ) ) ); \
+    /* ShiftBytes Q1024 + pre-AESENCLAST */\
+    xmm8  = _mm256_shuffle_epi8( xmm8,  SUBSH_MASK1_2WAY );\
+    xmm9  = _mm256_shuffle_epi8( xmm9,  SUBSH_MASK3_2WAY );\
+    xmm10 = _mm256_shuffle_epi8( xmm10, SUBSH_MASK5_2WAY );\
+    xmm11 = _mm256_shuffle_epi8( xmm11, SUBSH_MASK7_2WAY );\
+    xmm12 = _mm256_shuffle_epi8( xmm12, SUBSH_MASK0_2WAY );\
+    xmm13 = _mm256_shuffle_epi8( xmm13, SUBSH_MASK2_2WAY );\
+    xmm14 = _mm256_shuffle_epi8( xmm14, SUBSH_MASK4_2WAY );\
+    xmm15 = _mm256_shuffle_epi8( xmm15, SUBSH_MASK6_2WAY );\
+    /* SubBytes + MixBytes */\
+    SUBMIX_2WAY(xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7);\
+    \
+    /* AddRoundConstant Q1024 */\
+    xmm9 = m256_neg1;\
+    xmm0 = _mm256_xor_si256( xmm0, xmm9 );\
+    xmm1 = _mm256_xor_si256( xmm1, xmm9 );\
+    xmm2 = _mm256_xor_si256( xmm2, xmm9 );\
+    xmm3 = _mm256_xor_si256( xmm3, xmm9 );\
+    xmm4 = _mm256_xor_si256( xmm4, xmm9 );\
+    xmm5 = _mm256_xor_si256( xmm5, xmm9 );\
+    xmm6 = _mm256_xor_si256( xmm6, xmm9 );\
+    xmm7 = _mm256_xor_si256( xmm7, m256_const1_128( \
+             casti_m128i( round_const_q, round_counter+1 ) ) ); \
+    /* ShiftBytes Q1024 + pre-AESENCLAST */\
+    xmm0 = _mm256_shuffle_epi8( xmm0, SUBSH_MASK1_2WAY );\
+    xmm1 = _mm256_shuffle_epi8( xmm1, SUBSH_MASK3_2WAY );\
+    xmm2 = _mm256_shuffle_epi8( xmm2, SUBSH_MASK5_2WAY );\
+    xmm3 = _mm256_shuffle_epi8( xmm3, SUBSH_MASK7_2WAY );\
+    xmm4 = _mm256_shuffle_epi8( xmm4, SUBSH_MASK0_2WAY );\
+    xmm5 = _mm256_shuffle_epi8( xmm5, SUBSH_MASK2_2WAY );\
+    xmm6 = _mm256_shuffle_epi8( xmm6, SUBSH_MASK4_2WAY );\
+    xmm7 = _mm256_shuffle_epi8( xmm7, SUBSH_MASK6_2WAY );\
+    /* SubBytes + MixBytes */\
+    SUBMIX_2WAY(xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15);\
+  }\
+}
+
+#define Matrix_Transpose_2way(i0, i1, i2, i3, i4, i5, i6, i7, t0, t1, t2, t3, t4, t5, t6, t7){\
+  t0 = TRANSP_MASK_2WAY;\
+\
+  i6 = _mm256_shuffle_epi8(i6, t0);\
+  i0 = _mm256_shuffle_epi8(i0, t0);\
+  i1 = _mm256_shuffle_epi8(i1, t0);\
+  i2 = _mm256_shuffle_epi8(i2, t0);\
+  i3 = _mm256_shuffle_epi8(i3, t0);\
+  t1 = i2;\
+  i4 = _mm256_shuffle_epi8(i4, t0);\
+  i5 = _mm256_shuffle_epi8(i5, t0);\
+  t2 = i4;\
+  t3 = i6;\
+  i7 = _mm256_shuffle_epi8(i7, t0);\
+\
+  /* continue with unpack using 4 temp registers */\
+  t0 = i0;\
+  t2 = _mm256_unpackhi_epi16(t2, i5);\
+  i4 = _mm256_unpacklo_epi16(i4, i5);\
+  t3 = _mm256_unpackhi_epi16(t3, i7);\
+  i6 = _mm256_unpacklo_epi16(i6, i7);\
+  t0 = _mm256_unpackhi_epi16(t0, i1);\
+  t1 = _mm256_unpackhi_epi16(t1, i3);\
+  i2 = _mm256_unpacklo_epi16(i2, i3);\
+  i0 = _mm256_unpacklo_epi16(i0, i1);\
+\
+  /* shuffle with immediate */\
+  t0 = _mm256_shuffle_epi32(t0, 216);\
+  t1 = _mm256_shuffle_epi32(t1, 216);\
+  t2 = _mm256_shuffle_epi32(t2, 216);\
+  t3 = _mm256_shuffle_epi32(t3, 216);\
+  i0 = _mm256_shuffle_epi32(i0, 216);\
+  i2 = _mm256_shuffle_epi32(i2, 216);\
+  i4 = _mm256_shuffle_epi32(i4, 216);\
+  i6 = _mm256_shuffle_epi32(i6, 216);\
+\
+  /* continue with unpack */\
+  t4 = i0;\
+  i0 = _mm256_unpacklo_epi32(i0, i2);\
+  t4 = _mm256_unpackhi_epi32(t4, i2);\
+  t5 = t0;\
+  t0 = _mm256_unpacklo_epi32(t0, t1);\
+  t5 = _mm256_unpackhi_epi32(t5, t1);\
+  t6 = i4;\
+  i4 = _mm256_unpacklo_epi32(i4, i6);\
+  t7 = t2;\
+  t6 = _mm256_unpackhi_epi32(t6, i6);\
+  i2 = t0;\
+  t2 = _mm256_unpacklo_epi32(t2, t3);\
+  i3 = t0;\
+  t7 = _mm256_unpackhi_epi32(t7, t3);\
+\
+  /* there are now 2 rows in each xmm */\
+  /* unpack to get 1 row of CV in each xmm */\
+  i1 = i0;\
+  i1 = _mm256_unpackhi_epi64(i1, i4);\
+  i0 = _mm256_unpacklo_epi64(i0, i4);\
+  i4 = t4;\
+  i3 = _mm256_unpackhi_epi64(i3, t2);\
+  i5 = t4;\
+  i2 = _mm256_unpacklo_epi64(i2, t2);\
+  i6 = t5;\
+  i5 = _mm256_unpackhi_epi64(i5, t6);\
+  i7 = t5;\
+  i4 = _mm256_unpacklo_epi64(i4, t6);\
+  i7 = _mm256_unpackhi_epi64(i7, t7);\
+  i6 = _mm256_unpacklo_epi64(i6, t7);\
+  /* transpose done */\
+}/**/
+
+#define Matrix_Transpose_INV_2way(i0, i1, i2, i3, i4, i5, i6, i7, o0, o1, o2, t0, t1, t2, t3, t4){\
+  /*  transpose matrix to get output format */\
+  o1 = i0;\
+  i0 = _mm256_unpacklo_epi64(i0, i1);\
+  o1 = _mm256_unpackhi_epi64(o1, i1);\
+  t0 = i2;\
+  i2 = _mm256_unpacklo_epi64(i2, i3);\
+  t0 = _mm256_unpackhi_epi64(t0, i3);\
+  t1 = i4;\
+  i4 = _mm256_unpacklo_epi64(i4, i5);\
+  t1 = _mm256_unpackhi_epi64(t1, i5);\
+  t2 = i6;\
+  o0 = TRANSP_MASK_2WAY;\
+  i6 = _mm256_unpacklo_epi64(i6, i7);\
+  t2 = _mm256_unpackhi_epi64(t2, i7);\
+  /* load transpose mask into a register, because it will be used 8 times */\
+  i0 = _mm256_shuffle_epi8(i0, o0);\
+  i2 = _mm256_shuffle_epi8(i2, o0);\
+  i4 = _mm256_shuffle_epi8(i4, o0);\
+  i6 = _mm256_shuffle_epi8(i6, o0);\
+  o1 = _mm256_shuffle_epi8(o1, o0);\
+  t0 = _mm256_shuffle_epi8(t0, o0);\
+  t1 = _mm256_shuffle_epi8(t1, o0);\
+  t2 = _mm256_shuffle_epi8(t2, o0);\
+  /* continue with unpack using 4 temp registers */\
+  t3 = i4;\
+  o2 = o1;\
+  o0 = i0;\
+  t4 = t1;\
+  \
+  t3 = _mm256_unpackhi_epi16(t3, i6);\
+  i4 = _mm256_unpacklo_epi16(i4, i6);\
+  o0 = _mm256_unpackhi_epi16(o0, i2);\
+  i0 = _mm256_unpacklo_epi16(i0, i2);\
+  o2 = _mm256_unpackhi_epi16(o2, t0);\
+  o1 = _mm256_unpacklo_epi16(o1, t0);\
+  t4 = _mm256_unpackhi_epi16(t4, t2);\
+  t1 = _mm256_unpacklo_epi16(t1, t2);\
+  /* shuffle with immediate */\
+  i4 = _mm256_shuffle_epi32(i4, 216);\
+  t3 = _mm256_shuffle_epi32(t3, 216);\
+  o1 = _mm256_shuffle_epi32(o1, 216);\
+  o2 = _mm256_shuffle_epi32(o2, 216);\
+  i0 = _mm256_shuffle_epi32(i0, 216);\
+  o0 = _mm256_shuffle_epi32(o0, 216);\
+  t1 = _mm256_shuffle_epi32(t1, 216);\
+  t4 = _mm256_shuffle_epi32(t4, 216);\
+  /* continue with unpack */\
+  i1 = i0;\
+  i3 = o0;\
+  i5 = o1;\
+  i7 = o2;\
+  i0 = _mm256_unpacklo_epi32(i0, i4);\
+  i1 = _mm256_unpackhi_epi32(i1, i4);\
+  o0 = _mm256_unpacklo_epi32(o0, t3);\
+  i3 = _mm256_unpackhi_epi32(i3, t3);\
+  o1 = _mm256_unpacklo_epi32(o1, t1);\
+  i5 = _mm256_unpackhi_epi32(i5, t1);\
+  o2 = _mm256_unpacklo_epi32(o2, t4);\
+  i7 = _mm256_unpackhi_epi32(i7, t4);\
+  /* transpose done */\
+}/**/
+
+void INIT_2way( __m256i *chaining )
+{
+  static __m256i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
+  static __m256i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
+
+  /* load IV into registers xmm8 - xmm15 */
+  xmm8 = chaining[0];
+  xmm9 = chaining[1];
+  xmm10 = chaining[2];
+  xmm11 = chaining[3];
+  xmm12 = chaining[4];
+  xmm13 = chaining[5];
+  xmm14 = chaining[6];
+  xmm15 = chaining[7];
+
+  /* transform chaining value from column ordering into row ordering */
+  Matrix_Transpose_2way(xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7);
+
+  /* store transposed IV */
+  chaining[0] = xmm8;
+  chaining[1] = xmm9;
+  chaining[2] = xmm10;
+  chaining[3] = xmm11;
+  chaining[4] = xmm12;
+  chaining[5] = xmm13;
+  chaining[6] = xmm14;
+  chaining[7] = xmm15;
+}
+
+void TF1024_2way( __m256i *chaining, const __m256i *message )
+{
+  static __m256i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
+  static __m256i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
+  static __m256i QTEMP[8];
+  static __m256i TEMP0;
+  static __m256i TEMP1;
+  static __m256i TEMP2;
+
+  /* load message into registers xmm8 - xmm15 (Q = message) */
+  xmm8 = message[0];
+  xmm9 = message[1];
+  xmm10 = message[2];
+  xmm11 = message[3];
+  xmm12 = message[4];
+  xmm13 = message[5];
+  xmm14 = message[6];
+  xmm15 = message[7];
+
+  /* transform message M from column ordering into row ordering */
+  Matrix_Transpose_2way(xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7);
+
+  /* store message M (Q input) for later */
+  QTEMP[0] = xmm8;
+  QTEMP[1] = xmm9;
+  QTEMP[2] = xmm10;
+  QTEMP[3] = xmm11;
+  QTEMP[4] = xmm12;
+  QTEMP[5] = xmm13;
+  QTEMP[6] = xmm14;
+  QTEMP[7] = xmm15;
+
+  /* xor CV to message to get P input */
+  /* result: CV+M in xmm8...xmm15 */
+  xmm8 = _mm256_xor_si256( xmm8,  (chaining[0]) );
+  xmm9 = _mm256_xor_si256( xmm9,  (chaining[1]) );
+  xmm10 = _mm256_xor_si256( xmm10, (chaining[2]) );
+  xmm11 = _mm256_xor_si256( xmm11, (chaining[3]) );
+  xmm12 = _mm256_xor_si256( xmm12, (chaining[4]) );
+  xmm13 = _mm256_xor_si256( xmm13, (chaining[5]) );
+  xmm14 = _mm256_xor_si256( xmm14, (chaining[6]) );
+  xmm15 = _mm256_xor_si256( xmm15, (chaining[7]) );
+
+  /* compute permutation P */
+  /* result: P(CV+M) in xmm8...xmm15 */
+  ROUNDS_P_2WAY();
+
+  /* xor CV to P output (feed-forward) */
+  /* result: P(CV+M)+CV in xmm8...xmm15 */
+  xmm8 = _mm256_xor_si256( xmm8,  (chaining[0]) );
+  xmm9 = _mm256_xor_si256( xmm9,  (chaining[1]) );
+  xmm10 = _mm256_xor_si256( xmm10, (chaining[2]) );
+  xmm11 = _mm256_xor_si256( xmm11, (chaining[3]) );
+  xmm12 = _mm256_xor_si256( xmm12, (chaining[4]) );
+  xmm13 = _mm256_xor_si256( xmm13, (chaining[5]) );
+  xmm14 = _mm256_xor_si256( xmm14, (chaining[6]) );
+  xmm15 = _mm256_xor_si256( xmm15, (chaining[7]) );
+
+  /* store P(CV+M)+CV */
+  chaining[0] = xmm8;
+  chaining[1] = xmm9;
+  chaining[2] = xmm10;
+  chaining[3] = xmm11;
+  chaining[4] = xmm12;
+  chaining[5] = xmm13;
+  chaining[6] = xmm14;
+  chaining[7] = xmm15;
+
+  /* load message M (Q input) into xmm8-15 */
+  xmm8 = QTEMP[0];
+  xmm9 = QTEMP[1];
+  xmm10 = QTEMP[2];
+  xmm11 = QTEMP[3];
+  xmm12 = QTEMP[4];
+  xmm13 = QTEMP[5];
+  xmm14 = QTEMP[6];
+  xmm15 = QTEMP[7];
+
+  /* compute permutation Q */
+  /* result: Q(M) in xmm8...xmm15 */
+  ROUNDS_Q_2WAY();
+
+  /* xor Q output */
+  /* result: P(CV+M)+CV+Q(M) in xmm8...xmm15 */
+  xmm8 = _mm256_xor_si256( xmm8,  (chaining[0]) );
+  xmm9 = _mm256_xor_si256( xmm9,  (chaining[1]) );
+  xmm10 = _mm256_xor_si256( xmm10, (chaining[2]) );
+  xmm11 = _mm256_xor_si256( xmm11, (chaining[3]) );
+  xmm12 = _mm256_xor_si256( xmm12, (chaining[4]) );
+  xmm13 = _mm256_xor_si256( xmm13, (chaining[5]) );
+  xmm14 = _mm256_xor_si256( xmm14, (chaining[6]) );
+  xmm15 = _mm256_xor_si256( xmm15, (chaining[7]) );
+
+  /* store CV */
+  chaining[0] = xmm8;
+  chaining[1] = xmm9;
+  chaining[2] = xmm10;
+  chaining[3] = xmm11;
+  chaining[4] = xmm12;
+  chaining[5] = xmm13;
+  chaining[6] = xmm14;
+  chaining[7] = xmm15;
+
+  return;
+}
+
+void OF1024_2way( __m256i* chaining )
+{
+  static __m256i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
+  static __m256i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
+  static __m256i TEMP0;
+  static __m256i TEMP1;
+  static __m256i TEMP2;
+
+  /* load CV into registers xmm8 - xmm15 */
+  xmm8 = chaining[0];
+  xmm9 = chaining[1];
+  xmm10 = chaining[2];
+  xmm11 = chaining[3];
+  xmm12 = chaining[4];
+  xmm13 = chaining[5];
+  xmm14 = chaining[6];
+  xmm15 = chaining[7];
+
+  /* compute permutation P */
+  /* result: P(CV) in xmm8...xmm15 */
+  ROUNDS_P_2WAY();
+
+  /* xor CV to P output (feed-forward) */
+  /* result: P(CV)+CV in xmm8...xmm15 */
+  xmm8 = _mm256_xor_si256( xmm8,  (chaining[0]) );
+  xmm9 = _mm256_xor_si256( xmm9,  (chaining[1]) );
+  xmm10 = _mm256_xor_si256( xmm10, (chaining[2]) );
+  xmm11 = _mm256_xor_si256( xmm11, (chaining[3]) );
+  xmm12 = _mm256_xor_si256( xmm12, (chaining[4]) );
+  xmm13 = _mm256_xor_si256( xmm13, (chaining[5]) );
+  xmm14 = _mm256_xor_si256( xmm14, (chaining[6]) );
+  xmm15 = _mm256_xor_si256( xmm15, (chaining[7]) );
+
+  /* transpose CV back from row ordering to column ordering */
+  /* result: final hash value in xmm0, xmm6, xmm13, xmm15 */
+  Matrix_Transpose_INV_2way(xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm4, xmm0, xmm6, xmm1, xmm2, xmm3, xmm5, xmm7);
+
+  /* we only need to return the truncated half of the state */
+  chaining[4] = xmm0;
+  chaining[5] = xmm6;
+  chaining[6] = xmm13;
+  chaining[7] = xmm15;
+
+  return;
+}
+
+
+
 #endif  // VAES
 #endif  // GROESTL512_INTR_4WAY_H__
--- a/algo/hodl/hodl-gate.c
+++ b/algo/hodl/hodl-gate.c
@@ -99,9 +99,13 @@ void hodl_build_block_header( struct work* g_work, uint32_t version,
 // called only by thread 0, saves a backup of g_work
 void hodl_get_new_work( struct work* work, struct work* g_work)
 {
-     work_free( &hodl_work );
-     work_copy( &hodl_work, g_work );
-     hodl_work.data[ algo_gate.nonce_index ] = ( clock() + rand() ) % 9999;
+   pthread_rwlock_rdlock( &g_work_lock );
+
+   work_free( &hodl_work );
+   work_copy( &hodl_work, g_work );
+   hodl_work.data[ algo_gate.nonce_index ] = ( clock() + rand() ) % 9999;
+
+   pthread_rwlock_unlock( &g_work_lock );
 }

 json_t *hodl_longpoll_rpc_call( CURL *curl, int *err, char* lp_url )
@@ -155,11 +159,10 @@ bool register_hodl_algo( algo_gate_t* gate )
  applog( LOG_ERR, "Only CPUs with AES are supported, use legacy version.");
  return false;
 #endif
-//  if ( TOTAL_CHUNKS % opt_n_threads )
-//  {
-//     applog(LOG_ERR,"Thread count must be power of 2.");
-//     return false;
-//  }
+
+  if ( GARBAGE_SIZE % opt_n_threads )
+     applog( LOG_WARNING,"WARNING: Thread count must be power of 2. Miner may crash or produce invalid hash!" );
+
  pthread_barrier_init( &hodl_barrier, NULL, opt_n_threads );
  gate->optimizations         = SSE42_OPT | AES_OPT | AVX2_OPT;
  gate->scanhash              = (void*)&hodl_scanhash;
@@ -171,7 +174,7 @@ bool register_hodl_algo( algo_gate_t* gate )
  gate->resync_threads        = (void*)&hodl_resync_threads;
  gate->do_this_thread        = (void*)&hodl_do_this_thread;
  gate->work_cmp_size         = 76;
-  hodl_scratchbuf = (unsigned char*)malloc( 1 << 30 );
+  hodl_scratchbuf = (unsigned char*)_mm_malloc( 1 << 30, 64 );
  allow_getwork = false;
  opt_target_factor = 8388608.0;
  return ( hodl_scratchbuf != NULL );
--- a/algo/hodl/hodl-wolf.c
+++ b/algo/hodl/hodl-wolf.c
@@ -70,7 +70,7 @@ int scanhash_hodl_wolf( struct work* work, uint32_t max_nonce,
    uint32_t *ptarget = work->target;
    int threadNumber = mythr->id;
    CacheEntry *Garbage = (CacheEntry*)hodl_scratchbuf;
-    CacheEntry Cache[AES_PARALLEL_N];
+    CacheEntry Cache[AES_PARALLEL_N] __attribute__ ((aligned (64)));
    __m128i* data[AES_PARALLEL_N];
    const __m128i* next[AES_PARALLEL_N];
    uint32_t CollisionCount = 0;
--- a/algo/lyra2/allium-4way.c
+++ b/algo/lyra2/allium-4way.c
@@ -174,24 +174,19 @@ void allium_16way_hash( void *state, const void *input )
 #if defined(__VAES__)

   intrlv_4x128( vhash, hash0, hash1, hash2, hash3, 256 );
-
-   groestl256_4way_full( &ctx.groestl, vhash, vhash, 256 );
-
+   groestl256_4way_full( &ctx.groestl, vhash, vhash, 32 );
   dintrlv_4x128( state, state+32, state+64, state+96, vhash, 256 );
+
   intrlv_4x128( vhash, hash4, hash5, hash6, hash7, 256 );
-
-   groestl256_4way_full( &ctx.groestl, vhash, vhash, 256 );
-   
+   groestl256_4way_full( &ctx.groestl, vhash, vhash, 32 );
   dintrlv_4x128( state+128, state+160, state+192, state+224, vhash, 256 );
+
   intrlv_4x128( vhash, hash8, hash9, hash10, hash11, 256 );
-
-   groestl256_4way_full( &ctx.groestl, vhash, vhash, 256 );
-
+   groestl256_4way_full( &ctx.groestl, vhash, vhash, 32 );
   dintrlv_4x128( state+256, state+288, state+320, state+352, vhash, 256 );
-   intrlv_4x128( vhash, hash12, hash13, hash14, hash15, 256 );

-   groestl256_4way_full( &ctx.groestl, vhash, vhash, 256 );
- 
+   intrlv_4x128( vhash, hash12, hash13, hash14, hash15, 256 );
+   groestl256_4way_full( &ctx.groestl, vhash, vhash, 32 );
   dintrlv_4x128( state+384, state+416, state+448, state+480, vhash, 256 );
   
 #else
@@ -262,8 +257,11 @@ typedef struct {
   keccak256_4way_context    keccak;
   cubehashParam             cube;
   skein256_4way_context     skein;
+#if defined(__VAES__)
+   groestl256_2way_context   groestl;
+#else
   hashState_groestl256      groestl;
-
+#endif
 } allium_8way_ctx_holder;

 static __thread allium_8way_ctx_holder allium_8way_ctx;
@@ -273,7 +271,11 @@ bool init_allium_8way_ctx()
   keccak256_4way_init( &allium_8way_ctx.keccak );
   cubehashInit( &allium_8way_ctx.cube, 256, 16, 32 );
   skein256_4way_init( &allium_8way_ctx.skein );
+#if defined(__VAES__)
+   groestl256_2way_init( &allium_8way_ctx.groestl, 32 );
+#else
   init_groestl256( &allium_8way_ctx.groestl, 32 );
+#endif
   return true;
 }

@@ -352,9 +354,28 @@ void allium_8way_hash( void *hash, const void *input )
   skein256_4way_update( &ctx.skein, vhashB, 32 );
   skein256_4way_close( &ctx.skein, vhashB );

+#if defined(__VAES__)
+
+   uint64_t vhashC[4*2] __attribute__ ((aligned (64)));
+   uint64_t vhashD[4*2] __attribute__ ((aligned (64)));
+   
+   rintrlv_4x64_2x128( vhashC, vhashD, vhashA, 256 );
+   groestl256_2way_full( &ctx.groestl, vhashC, vhashC, 32 );
+   groestl256_2way_full( &ctx.groestl, vhashD, vhashD, 32 );
+   dintrlv_2x128( hash0, hash1, vhashC, 256 );
+   dintrlv_2x128( hash2, hash3, vhashD, 256 );
+
+   rintrlv_4x64_2x128( vhashC, vhashD, vhashB, 256 );
+   groestl256_2way_full( &ctx.groestl, vhashC, vhashC, 32 );
+   groestl256_2way_full( &ctx.groestl, vhashD, vhashD, 32 );
+   dintrlv_2x128( hash4, hash5, vhashC, 256 );
+   dintrlv_2x128( hash6, hash7, vhashD, 256 );
+
+#else
+
   dintrlv_4x64( hash0, hash1, hash2, hash3, vhashA, 256 );
   dintrlv_4x64( hash4, hash5, hash6, hash7, vhashB, 256 );
-
+   
   groestl256_full( &ctx.groestl, hash0, hash0, 256 );
   groestl256_full( &ctx.groestl, hash1, hash1, 256 );
   groestl256_full( &ctx.groestl, hash2, hash2, 256 );
@@ -363,6 +384,8 @@ void allium_8way_hash( void *hash, const void *input )
   groestl256_full( &ctx.groestl, hash5, hash5, 256 );
   groestl256_full( &ctx.groestl, hash6, hash6, 256 );
   groestl256_full( &ctx.groestl, hash7, hash7, 256 );
+
+#endif
 }

 int scanhash_allium_8way( struct work *work, uint32_t max_nonce,
--- a/algo/lyra2/lyra2-gate.c
+++ b/algo/lyra2/lyra2-gate.c
@@ -187,7 +187,8 @@ bool register_allium_algo( algo_gate_t* gate )
  gate->scanhash  = (void*)&scanhash_allium;
  gate->hash      = (void*)&allium_hash;
 #endif
-  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT;
+  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT
+	              | VAES_OPT | VAES256_OPT;
  opt_target_factor = 256.0;
  return true;
 };
@@ -215,9 +216,6 @@ void phi2_build_extraheader( struct work* g_work, struct stratum_ctx* sctx )
   size_t t;

   algo_gate.gen_merkle_root( merkle_tree, sctx );
-   // Increment extranonce2
-   for ( t = 0; t < sctx->xnonce2_size && !( ++sctx->job.xnonce2[t] ); t++ );
-   // Assemble block header
   algo_gate.build_block_header( g_work, le32dec( sctx->job.version ),
                  (uint32_t*) sctx->job.prevhash, (uint32_t*) merkle_tree,
                  le32dec( sctx->job.ntime ), le32dec(sctx->job.nbits), NULL );
@@ -225,7 +223,6 @@ void phi2_build_extraheader( struct work* g_work, struct stratum_ctx* sctx )
      g_work->data[ 20+t ] = ((uint32_t*)sctx->job.extra)[t];
 }

-
 bool register_phi2_algo( algo_gate_t* gate )
 {
   gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT;
--- a/algo/lyra2/phi2-4way.c
+++ b/algo/lyra2/phi2-4way.c
@@ -4,7 +4,7 @@
 #include "algo/gost/sph_gost.h"
 #include "algo/cubehash/cubehash_sse2.h"
 #include "lyra2.h"
-#if defined(__VAES__)
+#if defined(__VAES__) && defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
  #include "algo/echo/echo-hash-4way.h"
 #elif defined(__AES__)
  #include "algo/echo/aes_ni/hash_api.h"
--- a/algo/nist5/zr5.c
+++ b/algo/nist5/zr5.c
@@ -156,6 +156,8 @@ int scanhash_zr5( struct work *work, uint32_t max_nonce,
 void zr5_get_new_work( struct work* work, struct work* g_work, int thr_id,
                       uint32_t* end_nonce_ptr )
 {
+   pthread_rwlock_rdlock( &g_work_lock );
+
   // ignore POK in first word
   const int wkcmp_sz = 72;  // (19-1) * sizeof(uint32_t)
   uint32_t *nonceptr = work->data + algo_gate.nonce_index;
@@ -171,6 +173,8 @@ void zr5_get_new_work( struct work* work, struct work* g_work, int thr_id,
   }
   else
       ++(*nonceptr);
+
+   pthread_rwlock_unlock( &g_work_lock );
 }

 void zr5_display_pok( struct work* work )
--- a/algo/quark/hmq1725-4way.c
+++ b/algo/quark/hmq1725-4way.c
@@ -16,7 +16,7 @@
 #include "algo/simd/simd-hash-2way.h"
 #include "algo/echo/aes_ni/hash_api.h"
 #include "algo/hamsi/hamsi-hash-4way.h"
-#include "algo/fugue/sph_fugue.h"
+#include "algo/fugue/fugue-aesni.h"
 #include "algo/shabal/shabal-hash-4way.h"
 #include "algo/whirlpool/sph_whirlpool.h"
 #include "algo/haval/haval-hash-4way.h"
@@ -40,7 +40,7 @@ union _hmq1725_8way_context_overlay
    cube_4way_context       cube;
    simd_4way_context       simd;
    hamsi512_8way_context   hamsi;
-    sph_fugue512_context    fugue;
+    hashState_fugue         fugue;
    shabal512_8way_context  shabal;
    sph_whirlpool_context   whirlpool;
    sha512_8way_context     sha512;
@@ -363,14 +363,14 @@ extern void hmq1725_8way_hash(void *state, const void *input)
   dintrlv_8x64_512( hash0, hash1, hash2, hash3,
                     hash4, hash5, hash6, hash7, vhash );

-   sph_fugue512_full( &ctx.fugue, hash0, hash0, 64 );
-   sph_fugue512_full( &ctx.fugue, hash1, hash1, 64 );
-   sph_fugue512_full( &ctx.fugue, hash2, hash2, 64 );
-   sph_fugue512_full( &ctx.fugue, hash3, hash3, 64 );
-   sph_fugue512_full( &ctx.fugue, hash4, hash4, 64 );
-   sph_fugue512_full( &ctx.fugue, hash5, hash5, 64 );
-   sph_fugue512_full( &ctx.fugue, hash6, hash6, 64 );
-   sph_fugue512_full( &ctx.fugue, hash7, hash7, 64 );
+   fugue512_full( &ctx.fugue, hash0, hash0, 64 );
+   fugue512_full( &ctx.fugue, hash1, hash1, 64 );
+   fugue512_full( &ctx.fugue, hash2, hash2, 64 );
+   fugue512_full( &ctx.fugue, hash3, hash3, 64 );
+   fugue512_full( &ctx.fugue, hash4, hash4, 64 );
+   fugue512_full( &ctx.fugue, hash5, hash5, 64 );
+   fugue512_full( &ctx.fugue, hash6, hash6, 64 );
+   fugue512_full( &ctx.fugue, hash7, hash7, 64 );

   intrlv_8x64_512( vhash, hash0, hash1, hash2, hash3,
                           hash4, hash5, hash6, hash7 );
@@ -459,21 +459,21 @@ extern void hmq1725_8way_hash(void *state, const void *input)
                                       m512_zero );

   if ( hash0[0] & mask )
-      sph_fugue512_full( &ctx.fugue, hash0, hash0, 64 );
+      fugue512_full( &ctx.fugue, hash0, hash0, 64 );
   if ( hash1[0] & mask )
-      sph_fugue512_full( &ctx.fugue, hash1, hash1, 64 );
+      fugue512_full( &ctx.fugue, hash1, hash1, 64 );
   if ( hash2[0] & mask )
-      sph_fugue512_full( &ctx.fugue, hash2, hash2, 64 );
+      fugue512_full( &ctx.fugue, hash2, hash2, 64 );
   if ( hash3[0] & mask )
-      sph_fugue512_full( &ctx.fugue, hash3, hash3, 64 );
+      fugue512_full( &ctx.fugue, hash3, hash3, 64 );
   if ( hash4[0] & mask )
-      sph_fugue512_full( &ctx.fugue, hash4, hash4, 64 );
+      fugue512_full( &ctx.fugue, hash4, hash4, 64 );
   if ( hash5[0] & mask )
-      sph_fugue512_full( &ctx.fugue, hash5, hash5, 64 );
+      fugue512_full( &ctx.fugue, hash5, hash5, 64 );
   if ( hash6[0] & mask )
-      sph_fugue512_full( &ctx.fugue, hash6, hash6, 64 );
+      fugue512_full( &ctx.fugue, hash6, hash6, 64 );
   if ( hash7[0] & mask )
-      sph_fugue512_full( &ctx.fugue, hash7, hash7, 64 );
+      fugue512_full( &ctx.fugue, hash7, hash7, 64 );

   intrlv_8x64_512( vhashA, hash0, hash1, hash2, hash3,
                            hash4, hash5, hash6, hash7 );
@@ -628,7 +628,7 @@ union _hmq1725_4way_context_overlay
    simd_2way_context       simd;
    hashState_echo          echo;
    hamsi512_4way_context   hamsi;
-    sph_fugue512_context    fugue;
+    hashState_fugue         fugue;
    shabal512_4way_context  shabal;
    sph_whirlpool_context   whirlpool;
    sha512_4way_context     sha512;
@@ -846,10 +846,10 @@ extern void hmq1725_4way_hash(void *state, const void *input)

    dintrlv_4x64( hash0, hash1, hash2, hash3, vhash, 512 );

-    sph_fugue512_full( &ctx.fugue, hash0, hash0, 64 );
-    sph_fugue512_full( &ctx.fugue, hash1, hash1, 64 );
-    sph_fugue512_full( &ctx.fugue, hash2, hash2, 64 );
-    sph_fugue512_full( &ctx.fugue, hash3, hash3, 64 );
+    fugue512_full( &ctx.fugue, hash0, hash0, 64 );
+    fugue512_full( &ctx.fugue, hash1, hash1, 64 );
+    fugue512_full( &ctx.fugue, hash2, hash2, 64 );
+    fugue512_full( &ctx.fugue, hash3, hash3, 64 );

    // In this situation serial simd seems to be faster.

@@ -920,13 +920,13 @@ extern void hmq1725_4way_hash(void *state, const void *input)
   h_mask = _mm256_movemask_epi8( vh_mask );

   if ( hash0[0] & mask ) 
-      sph_fugue512_full( &ctx.fugue, hash0, hash0, 64 );
+      fugue512_full( &ctx.fugue, hash0, hash0, 64 );
   if ( hash1[0] & mask ) 
-      sph_fugue512_full( &ctx.fugue, hash1, hash1, 64 );
+      fugue512_full( &ctx.fugue, hash1, hash1, 64 );
   if ( hash2[0] & mask ) 
-      sph_fugue512_full( &ctx.fugue, hash2, hash2, 64 );
+      fugue512_full( &ctx.fugue, hash2, hash2, 64 );
   if ( hash3[0] & mask ) 
-      sph_fugue512_full( &ctx.fugue, hash3, hash3, 64 );
+      fugue512_full( &ctx.fugue, hash3, hash3, 64 );

   intrlv_4x64( vhashA, hash0, hash1, hash2, hash3, 512 );

--- a/algo/quark/hmq1725.c
+++ b/algo/quark/hmq1725.c
@@ -21,9 +21,11 @@
 #if defined(__AES__)
  #include "algo/groestl/aes_ni/hash-groestl.h"
  #include "algo/echo/aes_ni/hash_api.h"
+  #include "algo/fugue/fugue-aesni.h"
 #else
  #include "algo/groestl/sph_groestl.h"
  #include "algo/echo/sph_echo.h"
+  #include "algo/fugue/sph_fugue.h"
 #endif
 #include "algo/luffa/luffa_for_sse2.h"
 #include "algo/cubehash/cubehash_sse2.h"
@@ -40,7 +42,6 @@ typedef struct {
  sph_shavite512_context  shavite1, shavite2;
  hashState_sd            simd1, simd2;
  sph_hamsi512_context    hamsi1;
-  sph_fugue512_context    fugue1, fugue2;
  sph_shabal512_context   shabal1;
  sph_whirlpool_context   whirlpool1, whirlpool2, whirlpool3, whirlpool4;
  SHA512_CTX              sha1, sha2;
@@ -48,9 +49,11 @@ typedef struct {
 #if defined(__AES__)
  hashState_echo          echo1, echo2;
  hashState_groestl       groestl1, groestl2;
+  hashState_fugue         fugue1, fugue2;
 #else
  sph_groestl512_context  groestl1, groestl2;
  sph_echo512_context     echo1, echo2;
+  sph_fugue512_context    fugue1, fugue2;
 #endif
 } hmq1725_ctx_holder;

@@ -88,8 +91,13 @@ void init_hmq1725_ctx()

    sph_hamsi512_init(&hmq1725_ctx.hamsi1);

+#if defined(__AES__)
+    fugue512_Init( &hmq1725_ctx.fugue1, 512 );
+    fugue512_Init( &hmq1725_ctx.fugue2, 512 );
+#else
    sph_fugue512_init(&hmq1725_ctx.fugue1);
    sph_fugue512_init(&hmq1725_ctx.fugue2);
+#endif

    sph_shabal512_init(&hmq1725_ctx.shabal1);

@@ -235,8 +243,13 @@ extern void hmq1725hash(void *state, const void *input)
    sph_hamsi512 (&h_ctx.hamsi1, hashA, 64); //3
    sph_hamsi512_close(&h_ctx.hamsi1, hashB); //4

+#if defined(__AES__)
+    fugue512_Update( &h_ctx.fugue1, hashB, 512 ); //2   ////
+    fugue512_Final( &h_ctx.fugue1, hashA ); //3 
+#else
    sph_fugue512 (&h_ctx.fugue1, hashB, 64); //2   ////
    sph_fugue512_close(&h_ctx.fugue1, hashA); //3 
+#endif

    if ( hashA[0] & mask ) //4
    {
@@ -262,8 +275,13 @@ extern void hmq1725hash(void *state, const void *input)

    if ( hashB[0] & mask ) //7
    {
+#if defined(__AES__)
+        fugue512_Update( &h_ctx.fugue2, hashB, 512 ); //
+        fugue512_Final( &h_ctx.fugue2, hashA ); //8
+#else
        sph_fugue512 (&h_ctx.fugue2, hashB, 64); //
        sph_fugue512_close(&h_ctx.fugue2, hashA); //8
+#endif
    }
    else
    {
--- a/algo/ripemd/lbry-gate.c
+++ b/algo/ripemd/lbry-gate.c
@@ -69,13 +69,9 @@ void lbry_build_block_header( struct work* g_work, uint32_t version,
 void lbry_build_extraheader( struct work* g_work, struct stratum_ctx* sctx )
 {
   unsigned char merkle_root[64] = { 0 };
-   size_t t;
   int i;

   algo_gate.gen_merkle_root( merkle_root, sctx );
-   // Increment extranonce2 
-   for ( t = 0; t < sctx->xnonce2_size && !( ++sctx->job.xnonce2[t] ); t++ );
-   // Assemble block header 

   memset( g_work->data, 0, sizeof(g_work->data) );
   g_work->data[0] = le32dec( sctx->job.version );
--- a/algo/scrypt/neoscrypt.c
+++ b/algo/scrypt/neoscrypt.c
@@ -1051,16 +1051,16 @@ int scanhash_neoscrypt( struct work *work,
    uint32_t _ALIGN(64) hash[8];
    const uint32_t Htarg = ptarget[7];
    const uint32_t first_nonce = pdata[19];
-    int thr_id = mythr->id;  // thr_id arg is deprecated
+    int thr_id = mythr->id; 

    while (pdata[19] < max_nonce && !work_restart[thr_id].restart)
    {
        neoscrypt((uint8_t *) hash, (uint8_t *) pdata );

        /* Quick hash check */
-        if (hash[7] <= Htarg && fulltest_le(hash, ptarget)) {
-            *hashes_done = pdata[19] - first_nonce + 1;
-            return 1;
+        if (hash[7] <= Htarg && fulltest_le(hash, ptarget))
+        {
+          submit_solution( work, hash, mythr );
        }

        pdata[19]++;
--- a/algo/sha/sha2.c
+++ b/algo/sha/sha2.c
@@ -479,8 +479,8 @@ static inline void sha256d_ms(uint32_t *hash, uint32_t *W,
 void sha256d_ms_4way(uint32_t *hash,  uint32_t *data,
 	const uint32_t *midstate, const uint32_t *prehash);

-static inline int scanhash_sha256d_4way(int thr_id, struct work *work,
-             uint32_t max_nonce, uint64_t *hashes_done)
+static inline int scanhash_sha256d_4way( struct work *work,
+             uint32_t max_nonce, uint64_t *hashes_done, struct thr_info *mythr )
 {
        uint32_t *pdata = work->data;
        uint32_t *ptarget = work->target;
@@ -492,6 +492,7 @@ static inline int scanhash_sha256d_4way(int thr_id, struct work *work,
 	uint32_t n = pdata[19] - 1;
 	const uint32_t first_nonce = pdata[19];
 	const uint32_t Htarg = ptarget[7];
+   int thr_id = mythr->id;
 	int i, j;
 	
 	memcpy(data, pdata + 16, 64);
@@ -521,10 +522,8 @@ static inline int scanhash_sha256d_4way(int thr_id, struct work *work,
 			if (swab32(hash[4 * 7 + i]) <= Htarg) {
 				pdata[19] = data[4 * 3 + i];
 				sha256d_80_swap(hash, pdata);
-				if (fulltest(hash, ptarget)) {
-					*hashes_done = n - first_nonce + 1;
-					return 1;
-				}
+            if ( fulltest( hash, ptarget ) && !opt_benchmark )
+               submit_solution( work, hash, mythr );
 			}
 		}
 	} while (n < max_nonce && !work_restart[thr_id].restart);
@@ -541,8 +540,8 @@ static inline int scanhash_sha256d_4way(int thr_id, struct work *work,
 void sha256d_ms_8way(uint32_t *hash,  uint32_t *data,
 	const uint32_t *midstate, const uint32_t *prehash);

-static inline int scanhash_sha256d_8way(int thr_id, struct work *work,
-                              uint32_t max_nonce, uint64_t *hashes_done)
+static inline int scanhash_sha256d_8way( struct work *work,
+            uint32_t max_nonce, uint64_t *hashes_done, struct thr_info *mythr )
 {
        uint32_t *pdata = work->data;
        uint32_t *ptarget = work->target;
@@ -554,6 +553,7 @@ static inline int scanhash_sha256d_8way(int thr_id, struct work *work,
 	uint32_t n = pdata[19] - 1;
 	const uint32_t first_nonce = pdata[19];
 	const uint32_t Htarg = ptarget[7];
+   int thr_id = mythr->id;
 	int i, j;
 	
 	memcpy(data, pdata + 16, 64);
@@ -583,10 +583,8 @@ static inline int scanhash_sha256d_8way(int thr_id, struct work *work,
 			if (swab32(hash[8 * 7 + i]) <= Htarg) {
 				pdata[19] = data[8 * 3 + i];
 				sha256d_80_swap(hash, pdata);
-				if (fulltest(hash, ptarget)) {
-					*hashes_done = n - first_nonce + 1;
-					return 1;
-				}
+            if ( fulltest( hash, ptarget ) && !opt_benchmark )
+               submit_solution( work, hash, mythr );
 			}
 		}
 	} while (n < max_nonce && !work_restart[thr_id].restart);
@@ -614,13 +612,11 @@ int scanhash_sha256d( struct work *work,

 #ifdef HAVE_SHA256_8WAY
 	if (sha256_use_8way())
-		return scanhash_sha256d_8way(thr_id, work,
-			max_nonce, hashes_done);
+		return scanhash_sha256d_8way( work,	max_nonce, hashes_done, mythr );
 #endif
 #ifdef HAVE_SHA256_4WAY
 	if (sha256_use_4way())
-		return scanhash_sha256d_4way(thr_id, work,
-			max_nonce, hashes_done);
+		return scanhash_sha256d_4way( work,	max_nonce, hashes_done, mythr );
 #endif
 	
 	memcpy(data, pdata + 16, 64);
@@ -657,7 +653,7 @@ int scanhash_SHA256d( struct work *work, const uint32_t max_nonce,
   uint32_t n = pdata[19] - 1;
   const uint32_t first_nonce = pdata[19];
   const uint32_t Htarg = ptarget[7];
-   int thr_id = mythr->id;  // thr_id arg is deprecated
+   int thr_id = mythr->id;

   memcpy( data, pdata, 80 );

--- a/algo/sha/sha256t-gate.c
+++ b/algo/sha/sha256t-gate.c
@@ -3,36 +3,38 @@
 bool register_sha256t_algo( algo_gate_t* gate )
 {
 #if defined(SHA256T_8WAY)
-    gate->optimizations = SSE2_OPT | AVX2_OPT | SHA_OPT;
    gate->scanhash   = (void*)&scanhash_sha256t_8way;
    gate->hash       = (void*)&sha256t_8way_hash;
-#elif defined(SHA256T_4WAY)
-    gate->optimizations = SSE2_OPT | AVX2_OPT | SHA_OPT;
+#else
    gate->scanhash   = (void*)&scanhash_sha256t_4way;
    gate->hash       = (void*)&sha256t_4way_hash;
+/*
 #else
    gate->optimizations = SHA_OPT;
    gate->scanhash   = (void*)&scanhash_sha256t;
    gate->hash       = (void*)&sha256t_hash;
+*/
 #endif
+    gate->optimizations = SSE2_OPT | AVX2_OPT;
    return true;
 }

 bool register_sha256q_algo( algo_gate_t* gate )
 {
 #if defined(SHA256T_8WAY)
-    gate->optimizations = SSE2_OPT | AVX2_OPT | SHA_OPT;
    gate->scanhash   = (void*)&scanhash_sha256q_8way;
    gate->hash       = (void*)&sha256q_8way_hash;
-#elif defined(SHA256T_4WAY)
-    gate->optimizations = SSE2_OPT | AVX2_OPT | SHA_OPT;
+#else
    gate->scanhash   = (void*)&scanhash_sha256q_4way;
    gate->hash       = (void*)&sha256q_4way_hash;
+/*
 #else
    gate->optimizations = SHA_OPT;
    gate->scanhash   = (void*)&scanhash_sha256q;
    gate->hash       = (void*)&sha256q_hash;
+*/
 #endif
+    gate->optimizations = SSE2_OPT | AVX2_OPT;
    return true;

 }
--- a/algo/sha/sha256t-gate.h
+++ b/algo/sha/sha256t-gate.h
@@ -4,13 +4,10 @@
 #include <stdint.h>
 #include "algo-gate-api.h"

-// Override multi way on ryzen, SHA is better.
-#if !defined(__SHA__)
- #if defined(__AVX2__)
+#if defined(__AVX2__)
  #define SHA256T_8WAY
- #elif defined(__SSE2__)
+#else
  #define SHA256T_4WAY
- #endif
 #endif

 bool register_sha256t_algo( algo_gate_t* gate );
@@ -36,12 +33,13 @@ int scanhash_sha256q_4way( struct work *work, uint32_t max_nonce,
                           uint64_t *hashes_done, struct thr_info *mythr );
 #endif

+/*
 void sha256t_hash( void *output, const void *input );
 int scanhash_sha256t( struct work *work, uint32_t max_nonce,
                      uint64_t *hashes_done, struct thr_info *mythr );
 void sha256q_hash( void *output, const void *input );
 int scanhash_sha256q( struct work *work, uint32_t max_nonce,
                      uint64_t *hashes_done, struct thr_info *mythr );
-
+*/
 #endif

--- a/algo/sha/sha256t.c
+++ b/algo/sha/sha256t.c
@@ -1,5 +1,7 @@
 #include "sha256t-gate.h"

+// Obsolete
+
 #if !defined(SHA256T_16WAY) && !defined(SHA256T_8WAY) && !defined(SHA256T_4WAY)

 #include <stdlib.h>
--- a/algo/shavite/shavite-hash-2way.c
+++ b/algo/shavite/shavite-hash-2way.c
@@ -26,7 +26,11 @@ static const uint32_t IV512[] =
 static void
 c512_2way( shavite512_2way_context *ctx, const void *msg )
 {
+#if defined(__VAES__)
+   const __m256i zero = _mm256_setzero_si256();
+#else
   const __m128i zero = _mm_setzero_si128();
+#endif
   __m256i p0, p1, p2, p3, x;
   __m256i k00, k01, k02, k03, k10, k11, k12, k13;
   __m256i *m = (__m256i*)msg;
--- a/algo/skein/skein-hash-4way.c
+++ b/algo/skein/skein-hash-4way.c
@@ -731,7 +731,7 @@ void skein512_8way_full( skein512_8way_context *sc, void *out, const void *data,
 void
 skein512_8way_prehash64( skein512_8way_context *sc, const void *data )
 {
-   __m512i *vdata = (__m512*)data;
+   __m512i *vdata = (__m512i*)data;
   __m512i *buf = sc->buf;
   buf[0] = vdata[0];
   buf[1] = vdata[1];
--- a/algo/x13/phi1612-4way.c
+++ b/algo/x13/phi1612-4way.c
@@ -7,7 +7,7 @@
 #include "algo/jh/jh-hash-4way.h"
 #include "algo/cubehash/cubehash_sse2.h"
 #include "algo/cubehash/cube-hash-2way.h"
-#include "algo/fugue/sph_fugue.h"
+#include "algo/fugue/fugue-aesni.h"
 #include "algo/gost/sph_gost.h"
 #include "algo/echo/aes_ni/hash_api.h"
 #if defined(__VAES__)
@@ -20,7 +20,7 @@ typedef struct {
    skein512_8way_context   skein;
    jh512_8way_context      jh;
    cube_4way_context       cube;
-    sph_fugue512_context    fugue;
+    hashState_fugue         fugue;
    sph_gost512_context     gost;
 #if defined(__VAES__)
    echo_4way_context       echo;
@@ -36,7 +36,7 @@ void init_phi1612_8way_ctx()
     skein512_8way_init( &phi1612_8way_ctx.skein );
     jh512_8way_init( &phi1612_8way_ctx.jh );
     cube_4way_init( &phi1612_8way_ctx.cube, 512, 16, 32 );
-     sph_fugue512_init( &phi1612_8way_ctx.fugue );
+     fugue512_Init( &phi1612_8way_ctx.fugue, 512 );
     sph_gost512_init( &phi1612_8way_ctx.gost );
 #if defined(__VAES__)
     echo_4way_init( &phi1612_8way_ctx.echo, 512 );
@@ -79,29 +79,14 @@ void phi1612_8way_hash( void *state, const void *input )
     dintrlv_4x128_512( hash4, hash5, hash6, hash7, vhash );

     // Fugue
-     sph_fugue512( &ctx.fugue, hash0, 64 );
-     sph_fugue512_close( &ctx.fugue, hash0 );
-     sph_fugue512_init( &ctx.fugue );
-     sph_fugue512( &ctx.fugue, hash1, 64 );
-     sph_fugue512_close( &ctx.fugue, hash1 );
-     sph_fugue512_init( &ctx.fugue );
-     sph_fugue512( &ctx.fugue, hash2, 64 );
-     sph_fugue512_close( &ctx.fugue, hash2 );
-     sph_fugue512_init( &ctx.fugue );
-     sph_fugue512( &ctx.fugue, hash3, 64 );
-     sph_fugue512_close( &ctx.fugue, hash3 );
-     sph_fugue512_init( &ctx.fugue );
-     sph_fugue512( &ctx.fugue, hash4, 64 );
-     sph_fugue512_close( &ctx.fugue, hash4 );
-     sph_fugue512_init( &ctx.fugue );
-     sph_fugue512( &ctx.fugue, hash5, 64 );
-     sph_fugue512_close( &ctx.fugue, hash5 );
-     sph_fugue512_init( &ctx.fugue );
-     sph_fugue512( &ctx.fugue, hash6, 64 );
-     sph_fugue512_close( &ctx.fugue, hash6 );
-     sph_fugue512_init( &ctx.fugue );
-     sph_fugue512( &ctx.fugue, hash7, 64 );
-     sph_fugue512_close( &ctx.fugue, hash7 );
+     fugue512_full( &ctx.fugue, hash0, hash0, 64 );
+     fugue512_full( &ctx.fugue, hash1, hash1, 64 );
+     fugue512_full( &ctx.fugue, hash2, hash2, 64 );
+     fugue512_full( &ctx.fugue, hash3, hash3, 64 );
+     fugue512_full( &ctx.fugue, hash4, hash4, 64 );
+     fugue512_full( &ctx.fugue, hash5, hash5, 64 );
+     fugue512_full( &ctx.fugue, hash6, hash6, 64 );
+     fugue512_full( &ctx.fugue, hash7, hash7, 64 );

     // Gost
     sph_gost512( &ctx.gost, hash0, 64 );
@@ -223,7 +208,7 @@ typedef struct {
    skein512_4way_context   skein;
    jh512_4way_context      jh;
    cubehashParam           cube;
-    sph_fugue512_context    fugue;
+    hashState_fugue         fugue;
    sph_gost512_context     gost;
    hashState_echo          echo;
 } phi1612_4way_ctx_holder;
@@ -235,7 +220,6 @@ void init_phi1612_4way_ctx()
     skein512_4way_init( &phi1612_4way_ctx.skein );
     jh512_4way_init( &phi1612_4way_ctx.jh );
     cubehashInit( &phi1612_4way_ctx.cube, 512, 16, 32 );
-     sph_fugue512_init( &phi1612_4way_ctx.fugue );
     sph_gost512_init( &phi1612_4way_ctx.gost );
     init_echo( &phi1612_4way_ctx.echo, 512 );
 };
@@ -275,17 +259,10 @@ void phi1612_4way_hash( void *state, const void *input )
     cubehashUpdateDigest( &ctx.cube, (byte*)hash3, (const byte*) hash3, 64 );

     // Fugue
-     sph_fugue512( &ctx.fugue, hash0, 64 );
-     sph_fugue512_close( &ctx.fugue, hash0 );
-     sph_fugue512_init( &ctx.fugue );
-     sph_fugue512( &ctx.fugue, hash1, 64 );
-     sph_fugue512_close( &ctx.fugue, hash1 );
-     sph_fugue512_init( &ctx.fugue );
-     sph_fugue512( &ctx.fugue, hash2, 64 );
-     sph_fugue512_close( &ctx.fugue, hash2 );
-     sph_fugue512_init( &ctx.fugue );
-     sph_fugue512( &ctx.fugue, hash3, 64 );
-     sph_fugue512_close( &ctx.fugue, hash3 );
+     fugue512_full( &ctx.fugue, hash0, hash0, 64 );
+     fugue512_full( &ctx.fugue, hash1, hash1, 64 );
+     fugue512_full( &ctx.fugue, hash2, hash2, 64 );
+     fugue512_full( &ctx.fugue, hash3, hash3, 64 );

     // Gost
     sph_gost512( &ctx.gost, hash0, 64 );
--- a/algo/x13/phi1612.c
+++ b/algo/x13/phi1612.c
@@ -8,24 +8,28 @@
 #include <stdio.h>
 #include "algo/gost/sph_gost.h"
 #include "algo/echo/sph_echo.h"
-#include "algo/fugue/sph_fugue.h"
 #include "algo/cubehash/cubehash_sse2.h"
 #include "algo/skein/sph_skein.h"
 #include "algo/jh/sph_jh.h"
 #ifdef __AES__
  #include "algo/echo/aes_ni/hash_api.h"
+  #include "algo/fugue/fugue-aesni.h"
+#else
+  #include "algo/echo/sph_echo.h"
+  #include "algo/fugue/sph_fugue.h"
 #endif

 typedef struct {
     sph_skein512_context    skein;
     sph_jh512_context       jh;
     cubehashParam           cube;
-     sph_fugue512_context    fugue;
     sph_gost512_context     gost;
 #ifdef __AES__
     hashState_echo          echo;
+     hashState_fugue         fugue;
 #else
     sph_echo512_context     echo;
+     sph_fugue512_context    fugue;
 #endif
 } phi_ctx_holder;

@@ -38,12 +42,13 @@ void init_phi1612_ctx()
     sph_skein512_init( &phi_ctx.skein );
     sph_jh512_init( &phi_ctx.jh );
     cubehashInit( &phi_ctx.cube, 512, 16, 32 );
-     sph_fugue512_init( &phi_ctx.fugue );
     sph_gost512_init( &phi_ctx.gost );
 #ifdef __AES__
     init_echo( &phi_ctx.echo, 512 );
+     fugue512_Init( &phi_ctx.fugue, 512 );
 #else
     sph_echo512_init( &phi_ctx.echo );
+     sph_fugue512_init( &phi_ctx.fugue );
 #endif
 }

@@ -69,8 +74,13 @@ void phi1612_hash(void *output, const void *input)

     cubehashUpdateDigest( &ctx.cube, (byte*) hash, (const byte*)hash, 64 );

+#if defined(__AES__)
+     fugue512_Update( &ctx.fugue, hash, 512 ); 
+     fugue512_Final( &ctx.fugue, hash ); 
+#else
     sph_fugue512( &ctx.fugue, (const void*)hash, 64 );
     sph_fugue512_close( &ctx.fugue, (void*)hash );
+#endif

     sph_gost512( &ctx.gost, hash, 64 );
     sph_gost512_close( &ctx.gost, hash );
--- a/algo/x13/skunk-4way.c
+++ b/algo/x13/skunk-4way.c
@@ -5,7 +5,7 @@
 #include <stdio.h>
 #include "algo/skein/skein-hash-4way.h"
 #include "algo/gost/sph_gost.h"
-#include "algo/fugue/sph_fugue.h"
+#include "algo/fugue/fugue-aesni.h"
 #include "algo/cubehash/cubehash_sse2.h"
 #include "algo/cubehash/cube-hash-2way.h"

@@ -14,7 +14,7 @@
 typedef struct {
    skein512_8way_context skein;
    cube_4way_context     cube;
-    sph_fugue512_context  fugue;
+    hashState_fugue         fugue;
    sph_gost512_context   gost;
 } skunk_8way_ctx_holder;

@@ -46,29 +46,15 @@ void skunk_8way_hash( void *output, const void *input )
     cube_4way_init( &ctx.cube, 512, 16, 32 );           
     cube_4way_update_close( &ctx.cube, vhash, vhash, 64 );  
     dintrlv_4x128_512( hash4, hash5, hash6, hash7, vhash );
-     
-     sph_fugue512( &ctx.fugue, hash0, 64 );
-     sph_fugue512_close( &ctx.fugue, hash0 );
-     sph_fugue512_init( &ctx.fugue );
-     sph_fugue512( &ctx.fugue, hash1, 64 );
-     sph_fugue512_close( &ctx.fugue, hash1 );
-     sph_fugue512_init( &ctx.fugue );
-     sph_fugue512( &ctx.fugue, hash2, 64 );
-     sph_fugue512_close( &ctx.fugue, hash2 );
-     sph_fugue512_init( &ctx.fugue );
-     sph_fugue512( &ctx.fugue, hash3, 64 );
-     sph_fugue512_close( &ctx.fugue, hash3 );
-     sph_fugue512( &ctx.fugue, hash4, 64 );
-     sph_fugue512_close( &ctx.fugue, hash4 );
-     sph_fugue512_init( &ctx.fugue );
-     sph_fugue512( &ctx.fugue, hash5, 64 );
-     sph_fugue512_close( &ctx.fugue, hash5 );
-     sph_fugue512_init( &ctx.fugue );
-     sph_fugue512( &ctx.fugue, hash6, 64 );
-     sph_fugue512_close( &ctx.fugue, hash6 );
-     sph_fugue512_init( &ctx.fugue );
-     sph_fugue512( &ctx.fugue, hash7, 64 );
-     sph_fugue512_close( &ctx.fugue, hash7 );
+
+     fugue512_full( &ctx.fugue, hash0, hash0, 64 );
+     fugue512_full( &ctx.fugue, hash1, hash1, 64 );
+     fugue512_full( &ctx.fugue, hash2, hash2, 64 );
+     fugue512_full( &ctx.fugue, hash3, hash3, 64 );
+     fugue512_full( &ctx.fugue, hash4, hash4, 64 );
+     fugue512_full( &ctx.fugue, hash5, hash5, 64 );
+     fugue512_full( &ctx.fugue, hash6, hash6, 64 );
+     fugue512_full( &ctx.fugue, hash7, hash7, 64 );

     sph_gost512( &ctx.gost, hash0, 64 );
     sph_gost512_close( &ctx.gost, output );
@@ -140,7 +126,6 @@ bool skunk_8way_thread_init()
 {
   skein512_8way_init( &skunk_8way_ctx.skein );
   cube_4way_init( &skunk_8way_ctx.cube, 512, 16, 32 );
-   sph_fugue512_init( &skunk_8way_ctx.fugue );
   sph_gost512_init( &skunk_8way_ctx.gost );
   return true;
 }
@@ -150,7 +135,7 @@ bool skunk_8way_thread_init()
 typedef struct {
    skein512_4way_context skein;
    cubehashParam         cube;
-    sph_fugue512_context  fugue;
+    hashState_fugue       fugue;
    sph_gost512_context   gost;
 } skunk_4way_ctx_holder;

@@ -178,17 +163,10 @@ void skunk_4way_hash( void *output, const void *input )
     memcpy( &ctx.cube, &skunk_4way_ctx.cube, sizeof(cubehashParam) );
     cubehashUpdateDigest( &ctx.cube, (byte*)hash3, (const byte*) hash3, 64 );

-     sph_fugue512( &ctx.fugue, hash0, 64 );
-     sph_fugue512_close( &ctx.fugue, hash0 );
-     sph_fugue512_init( &ctx.fugue );
-     sph_fugue512( &ctx.fugue, hash1, 64 );
-     sph_fugue512_close( &ctx.fugue, hash1 );
-     sph_fugue512_init( &ctx.fugue );
-     sph_fugue512( &ctx.fugue, hash2, 64 );
-     sph_fugue512_close( &ctx.fugue, hash2 );
-     sph_fugue512_init( &ctx.fugue );
-     sph_fugue512( &ctx.fugue, hash3, 64 );
-     sph_fugue512_close( &ctx.fugue, hash3 );
+     fugue512_full( &ctx.fugue, hash0, hash0, 64 );
+     fugue512_full( &ctx.fugue, hash1, hash1, 64 );
+     fugue512_full( &ctx.fugue, hash2, hash2, 64 );
+     fugue512_full( &ctx.fugue, hash3, hash3, 64 );

     sph_gost512( &ctx.gost, hash0, 64 );
     sph_gost512_close( &ctx.gost, hash0 );
@@ -252,7 +230,6 @@ bool skunk_4way_thread_init()
 {
   skein512_4way_init( &skunk_4way_ctx.skein );
   cubehashInit( &skunk_4way_ctx.cube, 512, 16, 32 );
-   sph_fugue512_init( &skunk_4way_ctx.fugue );
   sph_gost512_init( &skunk_4way_ctx.gost );
   return true;
 }
--- a/algo/x13/skunk-gate.c
+++ b/algo/x13/skunk-gate.c
@@ -2,7 +2,7 @@

 bool register_skunk_algo( algo_gate_t* gate )
 {
-   gate->optimizations = SSE2_OPT | AVX2_OPT | AVX512_OPT;
+   gate->optimizations = SSE2_OPT | AVX2_OPT | AVX512_OPT | AES_OPT;
 #if defined (SKUNK_8WAY)
   gate->miner_thread_init = (void*)&skunk_8way_thread_init;
   gate->scanhash = (void*)&scanhash_skunk_8way;
--- a/algo/x13/skunk.c
+++ b/algo/x13/skunk.c
@@ -8,13 +8,21 @@
 #include <stdio.h>
 #include "algo/gost/sph_gost.h"
 #include "algo/skein/sph_skein.h"
-#include "algo/fugue/sph_fugue.h"
 #include "algo/cubehash/cubehash_sse2.h"
+#if defined(__AES__)
+  #include "algo/fugue/fugue-aesni.h"
+#else
+  #include "algo/fugue/sph_fugue.h"
+#endif

 typedef struct {
    sph_skein512_context  skein;
    cubehashParam         cube;
+#if defined(__AES__)
+    hashState_fugue       fugue;
+#else
    sph_fugue512_context  fugue;
+#endif
    sph_gost512_context   gost;
 } skunk_ctx_holder;

@@ -32,8 +40,13 @@ void skunkhash( void *output, const void *input )

     cubehashUpdateDigest( &ctx.cube, (byte*) hash, (const byte*)hash, 64 );

+#if defined(__AES__)
+     fugue512_Update( &ctx.fugue, hash, 512 ); 
+     fugue512_Final( &ctx.fugue, hash ); 
+#else
     sph_fugue512( &ctx.fugue, hash, 64 );
     sph_fugue512_close( &ctx.fugue, hash );
+#endif

     sph_gost512( &ctx.gost, hash, 64 );
     sph_gost512_close( &ctx.gost, hash );
@@ -87,8 +100,12 @@ bool skunk_thread_init()
 {
   sph_skein512_init( &skunk_ctx.skein );
   cubehashInit( &skunk_ctx.cube, 512, 16, 32 );
-   sph_fugue512_init( &skunk_ctx.fugue );
-   sph_gost512_init( &skunk_ctx.gost );
+#if defined(__AES__)
+    fugue512_Init( &skunk_ctx.fugue, 512 );
+#else
+    sph_fugue512_init( &skunk_ctx.fugue );
+#endif
+    sph_gost512_init( &skunk_ctx.gost );
   return true;
 }
 #endif
--- a/algo/x13/x13-4way.c
+++ b/algo/x13/x13-4way.c
@@ -16,7 +16,7 @@
 #include "algo/simd/simd-hash-2way.h"
 #include "algo/echo/aes_ni/hash_api.h"
 #include "algo/hamsi/hamsi-hash-4way.h"
-#include "algo/fugue/sph_fugue.h"
+#include "algo/fugue/fugue-aesni.h"
 #if defined(__VAES__)
  #include "algo/groestl/groestl512-hash-4way.h"
  #include "algo/shavite/shavite-hash-4way.h"
@@ -35,7 +35,7 @@ typedef struct {
    cube_4way_context       cube;
    simd_4way_context       simd;
    hamsi512_8way_context   hamsi;
-    sph_fugue512_context    fugue;
+    hashState_fugue         fugue;
 #if defined(__VAES__)
    groestl512_4way_context groestl;
    shavite512_4way_context shavite;
@@ -60,7 +60,7 @@ void init_x13_8way_ctx()
     cube_4way_init( &x13_8way_ctx.cube, 512, 16, 32 );
     simd_4way_init( &x13_8way_ctx.simd, 512 );
     hamsi512_8way_init( &x13_8way_ctx.hamsi );
-     sph_fugue512_init( &x13_8way_ctx.fugue );
+     fugue512_Init( &x13_8way_ctx.fugue, 512 );
 #if defined(__VAES__)
     groestl512_4way_init( &x13_8way_ctx.groestl, 64 );
     shavite512_4way_init( &x13_8way_ctx.shavite );
@@ -255,29 +255,29 @@ void x13_8way_hash( void *state, const void *input )
                       vhash );

     // 13 Fugue serial
-     sph_fugue512( &ctx.fugue, hash0, 64 );
-     sph_fugue512_close( &ctx.fugue, hash0 );
-     memcpy( &ctx.fugue, &x13_8way_ctx.fugue, sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash1, 64 );
-     sph_fugue512_close( &ctx.fugue, hash1 );
-     memcpy( &ctx.fugue, &x13_8way_ctx.fugue, sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash2, 64 );
-     sph_fugue512_close( &ctx.fugue, hash2 );
-     memcpy( &ctx.fugue, &x13_8way_ctx.fugue, sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash3, 64 );
-     sph_fugue512_close( &ctx.fugue, hash3 );
-     memcpy( &ctx.fugue, &x13_8way_ctx.fugue, sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash4, 64 );
-     sph_fugue512_close( &ctx.fugue, hash4 );
-     memcpy( &ctx.fugue, &x13_8way_ctx.fugue, sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash5, 64 );
-     sph_fugue512_close( &ctx.fugue, hash5 );
-     memcpy( &ctx.fugue, &x13_8way_ctx.fugue, sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash6, 64 );
-     sph_fugue512_close( &ctx.fugue, hash6 );
-     memcpy( &ctx.fugue, &x13_8way_ctx.fugue, sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash7, 64 );
-     sph_fugue512_close( &ctx.fugue, hash7 );
+     fugue512_Update( &ctx.fugue, hash0, 512 );
+     fugue512_Final( &ctx.fugue, hash0 ); 
+     memcpy( &ctx.fugue, &x13_8way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash1, 512 );
+     fugue512_Final( &ctx.fugue, hash1 );   
+     memcpy( &ctx.fugue, &x13_8way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash2, 512 );
+     fugue512_Final( &ctx.fugue, hash2 );   
+     memcpy( &ctx.fugue, &x13_8way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash3, 512 );
+     fugue512_Final( &ctx.fugue, hash3 );   
+     memcpy( &ctx.fugue, &x13_8way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash4, 512 );
+     fugue512_Final( &ctx.fugue, hash4 );   
+     memcpy( &ctx.fugue, &x13_8way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash5, 512 );
+     fugue512_Final( &ctx.fugue, hash5 );   
+     memcpy( &ctx.fugue, &x13_8way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash6, 512 );
+     fugue512_Final( &ctx.fugue, hash6 );   
+     memcpy( &ctx.fugue, &x13_8way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash7, 512 );
+     fugue512_Final( &ctx.fugue, hash7 );   
     
     memcpy( state,     hash0, 32 );
     memcpy( state+ 32, hash1, 32 );
@@ -344,7 +344,7 @@ typedef struct {
    simd_2way_context       simd;
    hashState_echo          echo;
    hamsi512_4way_context   hamsi;
-    sph_fugue512_context    fugue;
+    hashState_fugue         fugue;
 } x13_4way_ctx_holder;

 x13_4way_ctx_holder x13_4way_ctx __attribute__ ((aligned (64)));
@@ -363,7 +363,7 @@ void init_x13_4way_ctx()
     simd_2way_init( &x13_4way_ctx.simd, 512 );
     init_echo( &x13_4way_ctx.echo, 512 );
     hamsi512_4way_init( &x13_4way_ctx.hamsi );
-     sph_fugue512_init( &x13_4way_ctx.fugue );
+     fugue512_Init( &x13_4way_ctx.fugue, 512 );
 };

 void x13_4way_hash( void *state, const void *input )
@@ -477,17 +477,17 @@ void x13_4way_hash( void *state, const void *input )
     dintrlv_4x64( hash0, hash1, hash2, hash3, vhash, 512 );

     // 13 Fugue serial
-     sph_fugue512( &ctx.fugue, hash0, 64 );
-     sph_fugue512_close( &ctx.fugue, hash0 );
-     memcpy( &ctx.fugue, &x13_4way_ctx.fugue, sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash1, 64 );
-     sph_fugue512_close( &ctx.fugue, hash1 );
-     memcpy( &ctx.fugue, &x13_4way_ctx.fugue, sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash2, 64 );
-     sph_fugue512_close( &ctx.fugue, hash2 );
-     memcpy( &ctx.fugue, &x13_4way_ctx.fugue, sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash3, 64 );
-     sph_fugue512_close( &ctx.fugue, hash3 );
+     fugue512_Update( &ctx.fugue, hash0, 512 );
+     fugue512_Final( &ctx.fugue, hash0 );      
+     memcpy( &ctx.fugue, &x13_4way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash1, 512 );
+     fugue512_Final( &ctx.fugue, hash1 );       
+     memcpy( &ctx.fugue, &x13_4way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash2, 512 );
+     fugue512_Final( &ctx.fugue, hash2 );      
+     memcpy( &ctx.fugue, &x13_4way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash3, 512 );
+     fugue512_Final( &ctx.fugue, hash3 );   

     memcpy( state,    hash0, 32 );
     memcpy( state+32, hash1, 32 );
--- a/algo/x13/x13.c
+++ b/algo/x13/x13.c
@@ -13,7 +13,6 @@
 #include "algo/skein/sph_skein.h"
 #include "algo/shavite/sph_shavite.h"
 #include "algo/hamsi/sph_hamsi.h"
-#include "algo/fugue/sph_fugue.h"
 #include "algo/luffa/luffa_for_sse2.h"
 #include "algo/cubehash/cubehash_sse2.h"
 #include "algo/simd/nist.h"
@@ -21,9 +20,11 @@
 #if defined(__AES__)
  #include "algo/echo/aes_ni/hash_api.h"
  #include "algo/groestl/aes_ni/hash-groestl.h"
+  #include "algo/fugue/fugue-aesni.h"
 #else
  #include "algo/groestl/sph_groestl.h"
  #include "algo/echo/sph_echo.h"
+  #include "algo/fugue/sph_fugue.h"
 #endif

 typedef struct {
@@ -32,9 +33,11 @@ typedef struct {
 #if defined(__AES__)
   hashState_echo          echo;
   hashState_groestl       groestl;
+   hashState_fugue         fugue;
 #else
   sph_groestl512_context   groestl;
   sph_echo512_context      echo;
+   sph_fugue512_context    fugue;
 #endif
   sph_jh512_context       jh;
   sph_keccak512_context   keccak;
@@ -44,7 +47,6 @@ typedef struct {
   sph_shavite512_context  shavite;
   hashState_sd            simd;
   sph_hamsi512_context    hamsi;
-   sph_fugue512_context    fugue;
 } x13_ctx_holder;

 x13_ctx_holder x13_ctx;
@@ -56,9 +58,11 @@ void init_x13_ctx()
 #if defined(__AES__)
   init_groestl( &x13_ctx.groestl, 64 );
   init_echo( &x13_ctx.echo, 512 );
+   fugue512_Init( &x13_ctx.fugue, 512 );
 #else
   sph_groestl512_init( &x13_ctx.groestl );
   sph_echo512_init( &x13_ctx.echo );
+   sph_fugue512_init( &x13_ctx.fugue );
 #endif
   sph_skein512_init( &x13_ctx.skein );
   sph_jh512_init( &x13_ctx.jh );
@@ -68,7 +72,6 @@ void init_x13_ctx()
   sph_shavite512_init( &x13_ctx.shavite );
   init_sd( &x13_ctx.simd, 512 );
   sph_hamsi512_init( &x13_ctx.hamsi );
-   sph_fugue512_init( &x13_ctx.fugue );
 };

 void x13hash(void *output, const void *input)
@@ -84,11 +87,9 @@ void x13hash(void *output, const void *input)
    sph_bmw512_close( &ctx.bmw, hash );

 #if defined(__AES__)
-    init_groestl( &ctx.groestl, 64 );
    update_and_final_groestl( &ctx.groestl, (char*)hash,
                                      (const char*)hash, 512 );
 #else
-    sph_groestl512_init( &ctx.groestl );
    sph_groestl512( &ctx.groestl, hash, 64 );
    sph_groestl512_close( &ctx.groestl, hash );
 #endif
@@ -125,8 +126,13 @@ void x13hash(void *output, const void *input)
    sph_hamsi512( &ctx.hamsi, hash, 64 );
    sph_hamsi512_close( &ctx.hamsi, hash );

+#if defined(__AES__)
+    fugue512_Update( &ctx.fugue, hash, 512 );
+    fugue512_Final( &ctx.fugue, hash );  
+#else
    sph_fugue512( &ctx.fugue, hash, 64 );
-    sph_fugue512_close( &ctx.fugue, hash );
+    sph_fugue512_close( &ctx.fugue, hash ); 
+#endif

 	 memcpy( output, hash, 32 );
 }
--- a/algo/x13/x13bcd-4way.c
+++ b/algo/x13/x13bcd-4way.c
@@ -16,7 +16,7 @@
 #include "algo/echo/aes_ni/hash_api.h"
 #include "algo/sm3/sm3-hash-4way.h"
 #include "algo/hamsi/hamsi-hash-4way.h"
-#include "algo/fugue/sph_fugue.h"
+#include "algo/fugue/fugue-aesni.h"
 #if defined(__VAES__)
  #include "algo/groestl/groestl512-hash-4way.h"
  #include "algo/shavite/shavite-hash-4way.h"
@@ -35,7 +35,7 @@ typedef struct {
    simd_4way_context       simd;
    sm3_8way_ctx_t          sm3;
    hamsi512_8way_context   hamsi;
-    sph_fugue512_context    fugue;
+    hashState_fugue         fugue;
 #if defined(__VAES__)
    groestl512_4way_context groestl;
    shavite512_4way_context shavite;
@@ -61,7 +61,7 @@ void init_x13bcd_8way_ctx()
     simd_4way_init( &x13bcd_8way_ctx.simd, 512 );
     sm3_8way_init( &x13bcd_8way_ctx.sm3 );
     hamsi512_8way_init( &x13bcd_8way_ctx.hamsi );
-     sph_fugue512_init( &x13bcd_8way_ctx.fugue );
+     fugue512_Init( &x13bcd_8way_ctx.fugue, 512 );
 #if defined(__VAES__)
     groestl512_4way_init( &x13bcd_8way_ctx.groestl, 64 );
     shavite512_4way_init( &x13bcd_8way_ctx.shavite );
@@ -257,36 +257,30 @@ void x13bcd_8way_hash( void *state, const void *input )
                       hash4, hash5, hash6, hash7, vhash );

     // Fugue serial
-     sph_fugue512( &ctx.fugue, hash0, 64 );
-     sph_fugue512_close( &ctx.fugue, state );
-     memcpy( &ctx.fugue, &x13bcd_8way_ctx.fugue,
-                         sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash1, 64 );
-     sph_fugue512_close( &ctx.fugue, state+32 );
-     memcpy( &ctx.fugue, &x13bcd_8way_ctx.fugue,
-                         sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash2, 64 );
-     sph_fugue512_close( &ctx.fugue, state+64 );
-     memcpy( &ctx.fugue, &x13bcd_8way_ctx.fugue,
-                         sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash3, 64 );
-     sph_fugue512_close( &ctx.fugue, state+96 );
-     memcpy( &ctx.fugue, &x13bcd_8way_ctx.fugue,
-                         sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash4, 64 );
-     sph_fugue512_close( &ctx.fugue, state+128 );
-     memcpy( &ctx.fugue, &x13bcd_8way_ctx.fugue,
-                         sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash5, 64 );
-     sph_fugue512_close( &ctx.fugue, state+160 );
-     memcpy( &ctx.fugue, &x13bcd_8way_ctx.fugue,
-                         sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash6, 64 );
-     sph_fugue512_close( &ctx.fugue, state+192 );
-     memcpy( &ctx.fugue, &x13bcd_8way_ctx.fugue,
-                         sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash7, 64 );
-     sph_fugue512_close( &ctx.fugue, state+224 );
+     fugue512_Update( &ctx.fugue, hash0, 512 );
+     fugue512_Final( &ctx.fugue, state );
+     memcpy( &ctx.fugue, &x13bcd_8way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash1, 512 );
+     fugue512_Final( &ctx.fugue, state+32 );
+     memcpy( &ctx.fugue, &x13bcd_8way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash2, 512 );
+     fugue512_Final( &ctx.fugue, state+64 );
+     memcpy( &ctx.fugue, &x13bcd_8way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash3, 512 );
+     fugue512_Final( &ctx.fugue, state+96 );
+     memcpy( &ctx.fugue, &x13bcd_8way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash4, 512 );
+     fugue512_Final( &ctx.fugue, state+128 );
+     memcpy( &ctx.fugue, &x13bcd_8way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash5, 512 );
+     fugue512_Final( &ctx.fugue, state+160 );
+     memcpy( &ctx.fugue, &x13bcd_8way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash6, 512 );
+     fugue512_Final( &ctx.fugue, state+192 );
+     memcpy( &ctx.fugue, &x13bcd_8way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash7, 512 );
+     fugue512_Final( &ctx.fugue, state+224 );
+
 }

 int scanhash_x13bcd_8way( struct work *work, uint32_t max_nonce,
@@ -346,7 +340,7 @@ typedef struct {
    hashState_echo          echo;
    sm3_4way_ctx_t          sm3;
    hamsi512_4way_context   hamsi;
-    sph_fugue512_context    fugue;
+    hashState_fugue         fugue;
 } x13bcd_4way_ctx_holder;

 x13bcd_4way_ctx_holder x13bcd_4way_ctx __attribute__ ((aligned (64)));
@@ -366,7 +360,7 @@ void init_x13bcd_4way_ctx()
     init_echo( &x13bcd_4way_ctx.echo, 512 );
     sm3_4way_init( &x13bcd_4way_ctx.sm3 );
     hamsi512_4way_init( &x13bcd_4way_ctx.hamsi );
-     sph_fugue512_init( &x13bcd_4way_ctx.fugue );
+     fugue512_Init( &x13bcd_4way_ctx.fugue, 512 );
 };

 void x13bcd_4way_hash( void *state, const void *input )
@@ -489,20 +483,17 @@ void x13bcd_4way_hash( void *state, const void *input )
     dintrlv_4x64( hash0, hash1, hash2, hash3, vhash, 512 );

     // Fugue serial
-     sph_fugue512( &ctx.fugue, hash0, 64 );
-     sph_fugue512_close( &ctx.fugue, hash0 );
-     memcpy( &ctx.fugue, &x13bcd_4way_ctx.fugue,
-                         sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash1, 64 );
-     sph_fugue512_close( &ctx.fugue, hash1 );
-     memcpy( &ctx.fugue, &x13bcd_4way_ctx.fugue,
-                         sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash2, 64 );
-     sph_fugue512_close( &ctx.fugue, hash2 );
-     memcpy( &ctx.fugue, &x13bcd_4way_ctx.fugue,
-                         sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash3, 64 );
-     sph_fugue512_close( &ctx.fugue, hash3 );
+     fugue512_Update( &ctx.fugue, hash0, 512 );
+     fugue512_Final( &ctx.fugue, hash0 );
+     memcpy( &ctx.fugue, &x13bcd_4way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash1, 512 );
+     fugue512_Final( &ctx.fugue, hash1 );
+     memcpy( &ctx.fugue, &x13bcd_4way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash2, 512 );
+     fugue512_Final( &ctx.fugue, hash2 );
+     memcpy( &ctx.fugue, &x13bcd_4way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash3, 512 );
+     fugue512_Final( &ctx.fugue, hash3 );

     memcpy( state,    hash0, 32 );
     memcpy( state+32, hash1, 32 );
--- a/algo/x13/x13bcd.c
+++ b/algo/x13/x13bcd.c
@@ -14,16 +14,17 @@
 #include "algo/skein/sph_skein.h"
 #include "algo/shavite/sph_shavite.h"
 #include "algo/hamsi/sph_hamsi.h"
-#include "algo/fugue/sph_fugue.h"
 #include "algo/cubehash/cubehash_sse2.h"
 #include "algo/simd/nist.h"

 #if defined(__AES__)
  #include "algo/echo/aes_ni/hash_api.h"
  #include "algo/groestl/aes_ni/hash-groestl.h"
+  #include "algo/fugue/fugue-aesni.h"
 #else
  #include "algo/groestl/sph_groestl.h"
  #include "algo/echo/sph_echo.h"
+  #include "algo/fugue/sph_fugue.h"
 #endif

 typedef struct {
@@ -32,9 +33,11 @@ typedef struct {
 #if defined(__AES__)
   hashState_echo          echo;
   hashState_groestl       groestl;
+   hashState_fugue         fugue;
 #else
   sph_groestl512_context   groestl;
   sph_echo512_context      echo;
+   sph_fugue512_context    fugue;
 #endif
   sph_jh512_context       jh;
   sph_keccak512_context   keccak;
@@ -43,7 +46,6 @@ typedef struct {
   sph_shavite512_context  shavite;
   hashState_sd            simd;
   sph_hamsi512_context    hamsi;
-   sph_fugue512_context    fugue;
   sm3_ctx_t               sm3;
 } x13bcd_ctx_holder;

@@ -56,9 +58,11 @@ void init_x13bcd_ctx()
 #if defined(__AES__)
   init_groestl( &x13bcd_ctx.groestl, 64 );
   init_echo( &x13bcd_ctx.echo, 512 );
+   fugue512_Init( &x13bcd_ctx.fugue, 512 );
 #else
   sph_groestl512_init( &x13bcd_ctx.groestl );
   sph_echo512_init( &x13bcd_ctx.echo );
+   sph_fugue512_init( &x13bcd_ctx.fugue );
 #endif
   sph_skein512_init( &x13bcd_ctx.skein );
   sph_jh512_init( &x13bcd_ctx.jh );
@@ -68,7 +72,6 @@ void init_x13bcd_ctx()
   init_sd( &x13bcd_ctx.simd,512 );
   sm3_init( &x13bcd_ctx.sm3 );
   sph_hamsi512_init( &x13bcd_ctx.hamsi );
-   sph_fugue512_init( &x13bcd_ctx.fugue );
 };

 void x13bcd_hash(void *output, const void *input)
@@ -129,8 +132,13 @@ void x13bcd_hash(void *output, const void *input)
    sph_hamsi512( &ctx.hamsi, hash, 64 );
    sph_hamsi512_close( &ctx.hamsi, hash );

+#if defined(__AES__)
+    fugue512_Update( &ctx.fugue, hash, 512 );
+    fugue512_Final( &ctx.fugue, hash );
+#else
    sph_fugue512( &ctx.fugue, hash, 64 );
    sph_fugue512_close( &ctx.fugue, hash );
+#endif

    memcpy( output, hash, 32 );
 }
--- a/algo/x14/x14-4way.c
+++ b/algo/x14/x14-4way.c
@@ -17,7 +17,7 @@
 #include "algo/echo/aes_ni/hash_api.h"
 #include "algo/echo/sph_echo.h"
 #include "algo/hamsi/hamsi-hash-4way.h"
-#include "algo/fugue/sph_fugue.h"
+#include "algo/fugue/fugue-aesni.h"
 #include "algo/shabal/shabal-hash-4way.h"
 #if defined(__VAES__)
  #include "algo/groestl/groestl512-hash-4way.h"
@@ -37,7 +37,7 @@ typedef struct {
    cube_4way_context       cube;
    simd_4way_context       simd;
    hamsi512_8way_context   hamsi;
-    sph_fugue512_context    fugue;
+    hashState_fugue         fugue;
    shabal512_8way_context  shabal;
 #if defined(__VAES__)
    groestl512_4way_context groestl;
@@ -63,7 +63,7 @@ void init_x14_8way_ctx()
     cube_4way_init( &x14_8way_ctx.cube, 512, 16, 32 );
     simd_4way_init( &x14_8way_ctx.simd, 512 );
     hamsi512_8way_init( &x14_8way_ctx.hamsi );
-     sph_fugue512_init( &x14_8way_ctx.fugue );
+     fugue512_Init( &x14_8way_ctx.fugue, 512 );
     shabal512_8way_init( &x14_8way_ctx.shabal );
 #if defined(__VAES__)
     groestl512_4way_init( &x14_8way_ctx.groestl, 64 );
@@ -259,29 +259,29 @@ void x14_8way_hash( void *state, const void *input )
                       vhash );

     // 13 Fugue serial
-     sph_fugue512( &ctx.fugue, hash0, 64 );
-     sph_fugue512_close( &ctx.fugue, hash0 );
-     memcpy( &ctx.fugue, &x14_8way_ctx.fugue, sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash1, 64 );
-     sph_fugue512_close( &ctx.fugue, hash1 );
-     memcpy( &ctx.fugue, &x14_8way_ctx.fugue, sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash2, 64 );
-     sph_fugue512_close( &ctx.fugue, hash2 );
-     memcpy( &ctx.fugue, &x14_8way_ctx.fugue, sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash3, 64 );
-     sph_fugue512_close( &ctx.fugue, hash3 );
-     memcpy( &ctx.fugue, &x14_8way_ctx.fugue, sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash4, 64 );
-     sph_fugue512_close( &ctx.fugue, hash4 );
-     memcpy( &ctx.fugue, &x14_8way_ctx.fugue, sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash5, 64 );
-     sph_fugue512_close( &ctx.fugue, hash5 );
-     memcpy( &ctx.fugue, &x14_8way_ctx.fugue, sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash6, 64 );
-     sph_fugue512_close( &ctx.fugue, hash6 );
-     memcpy( &ctx.fugue, &x14_8way_ctx.fugue, sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash7, 64 );
-     sph_fugue512_close( &ctx.fugue, hash7 );
+     fugue512_Update( &ctx.fugue, hash0, 512 );
+     fugue512_Final( &ctx.fugue, hash0 );
+     memcpy( &ctx.fugue, &x14_8way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash1, 512 );
+     fugue512_Final( &ctx.fugue, hash1 );
+     memcpy( &ctx.fugue, &x14_8way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash2, 512 );
+     fugue512_Final( &ctx.fugue, hash2 );
+     memcpy( &ctx.fugue, &x14_8way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash3, 512 );
+     fugue512_Final( &ctx.fugue, hash3 );
+     memcpy( &ctx.fugue, &x14_8way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash4, 512 );
+     fugue512_Final( &ctx.fugue, hash4 );
+     memcpy( &ctx.fugue, &x14_8way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash5, 512 );
+     fugue512_Final( &ctx.fugue, hash5 );
+     memcpy( &ctx.fugue, &x14_8way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash6, 512 );
+     fugue512_Final( &ctx.fugue, hash6 );
+     memcpy( &ctx.fugue, &x14_8way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash7, 512 );
+     fugue512_Final( &ctx.fugue, hash7 );

     // 14 Shabal, parallel 32 bit
     intrlv_8x32_512( vhash, hash0, hash1, hash2, hash3, hash4, hash5, hash6,
@@ -348,7 +348,7 @@ typedef struct {
    simd_2way_context       simd;
    hashState_echo          echo;
    hamsi512_4way_context   hamsi;
-    sph_fugue512_context    fugue;
+    hashState_fugue         fugue;
    shabal512_4way_context  shabal;
 } x14_4way_ctx_holder;

@@ -368,7 +368,7 @@ void init_x14_4way_ctx()
     simd_2way_init( &x14_4way_ctx.simd, 512 );
     init_echo( &x14_4way_ctx.echo, 512 );
     hamsi512_4way_init( &x14_4way_ctx.hamsi );
-     sph_fugue512_init( &x14_4way_ctx.fugue );
+     fugue512_Init( &x14_4way_ctx.fugue, 512 );
     shabal512_4way_init( &x14_4way_ctx.shabal );
 };

@@ -483,17 +483,17 @@ void x14_4way_hash( void *state, const void *input )
     dintrlv_4x64( hash0, hash1, hash2, hash3, vhash, 512 );

     // 13 Fugue serial
-     sph_fugue512( &ctx.fugue, hash0, 64 );
-     sph_fugue512_close( &ctx.fugue, hash0 );
-     memcpy( &ctx.fugue, &x14_4way_ctx.fugue, sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash1, 64 );
-     sph_fugue512_close( &ctx.fugue, hash1 );
-     memcpy( &ctx.fugue, &x14_4way_ctx.fugue, sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash2, 64 );
-     sph_fugue512_close( &ctx.fugue, hash2 );
-     memcpy( &ctx.fugue, &x14_4way_ctx.fugue, sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash3, 64 );
-     sph_fugue512_close( &ctx.fugue, hash3 );
+     fugue512_Update( &ctx.fugue, hash0, 512 );
+     fugue512_Final( &ctx.fugue, hash0 );
+     memcpy( &ctx.fugue, &x14_4way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash1, 512 );
+     fugue512_Final( &ctx.fugue, hash1 );
+     memcpy( &ctx.fugue, &x14_4way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash2, 512 );
+     fugue512_Final( &ctx.fugue, hash2 );
+     memcpy( &ctx.fugue, &x14_4way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash3, 512 );
+     fugue512_Final( &ctx.fugue, hash3 );

     // 14 Shabal, parallel 32 bit
     intrlv_4x32( vhash, hash0, hash1, hash2, hash3, 512 );
--- a/algo/x14/x14.c
+++ b/algo/x14/x14.c
@@ -13,7 +13,6 @@
 #include "algo/skein/sph_skein.h"
 #include "algo/shavite/sph_shavite.h"
 #include "algo/hamsi/sph_hamsi.h"
-#include "algo/fugue/sph_fugue.h"
 #include "algo/shabal/sph_shabal.h"
 #include "algo/luffa/luffa_for_sse2.h"
 #include "algo/cubehash/cubehash_sse2.h"
@@ -21,9 +20,11 @@
 #if defined(__AES__)
  #include "algo/echo/aes_ni/hash_api.h"
  #include "algo/groestl/aes_ni/hash-groestl.h"
+  #include "algo/fugue/fugue-aesni.h"
 #else
  #include "algo/groestl/sph_groestl.h"
  #include "algo/echo/sph_echo.h"
+  #include "algo/fugue/sph_fugue.h"
 #endif

 typedef struct {
@@ -32,9 +33,11 @@ typedef struct {
 #if defined(__AES__)
   hashState_groestl       groestl;
   hashState_echo          echo;
+   hashState_fugue         fugue;
 #else
   sph_groestl512_context  groestl;
   sph_echo512_context     echo;
+   sph_fugue512_context    fugue;
 #endif
   sph_jh512_context       jh;
   sph_keccak512_context   keccak;
@@ -44,7 +47,6 @@ typedef struct {
   sph_shavite512_context  shavite;
   hashState_sd            simd;
   sph_hamsi512_context    hamsi;
-   sph_fugue512_context    fugue;
   sph_shabal512_context   shabal;
 } x14_ctx_holder;

@@ -57,9 +59,11 @@ void init_x14_ctx()
 #if defined(__AES__)
   init_groestl( &x14_ctx.groestl, 64 );
   init_echo( &x14_ctx.echo, 512 );
+   fugue512_Init( &x14_ctx.fugue, 512 );
 #else
   sph_groestl512_init( &x14_ctx.groestl );
   sph_echo512_init( &x14_ctx.echo );
+   sph_fugue512_init( &x14_ctx.fugue );
 #endif
   sph_skein512_init( &x14_ctx.skein );
   sph_jh512_init( &x14_ctx.jh );
@@ -69,7 +73,6 @@ void init_x14_ctx()
   sph_shavite512_init( &x14_ctx.shavite );
   init_sd( &x14_ctx.simd,512 );
   sph_hamsi512_init( &x14_ctx.hamsi );
-   sph_fugue512_init( &x14_ctx.fugue );
   sph_shabal512_init( &x14_ctx.shabal );
 };

@@ -125,8 +128,13 @@ void x14hash(void *output, const void *input)
    sph_hamsi512(&ctx.hamsi, hash, 64);
    sph_hamsi512_close(&ctx.hamsi, hash);

+#if defined(__AES__)
+    fugue512_Update( &ctx.fugue, hash, 512 );
+    fugue512_Final( &ctx.fugue, hash );
+#else
    sph_fugue512(&ctx.fugue, hash, 64);
    sph_fugue512_close(&ctx.fugue, hash);
+#endif

    sph_shabal512( &ctx.shabal, hash, 64 );
 	 sph_shabal512_close( &ctx.shabal, hash );
--- a/algo/x15/x15-4way.c
+++ b/algo/x15/x15-4way.c
@@ -17,7 +17,7 @@
 #include "algo/echo/aes_ni/hash_api.h"
 #include "algo/echo/sph_echo.h"
 #include "algo/hamsi/hamsi-hash-4way.h"
-#include "algo/fugue/sph_fugue.h"
+#include "algo/fugue/fugue-aesni.h"
 #include "algo/shabal/shabal-hash-4way.h"
 #include "algo/whirlpool/sph_whirlpool.h"
 #if defined(__VAES__)
@@ -38,7 +38,7 @@ typedef struct {
    cube_4way_context       cube;
    simd_4way_context       simd;
    hamsi512_8way_context   hamsi;
-    sph_fugue512_context    fugue;
+    hashState_fugue         fugue;
    shabal512_8way_context  shabal;
    sph_whirlpool_context   whirlpool;
 #if defined(__VAES__)
@@ -65,7 +65,7 @@ void init_x15_8way_ctx()
     cube_4way_init( &x15_8way_ctx.cube, 512, 16, 32 );
     simd_4way_init( &x15_8way_ctx.simd, 512 );
     hamsi512_8way_init( &x15_8way_ctx.hamsi );
-     sph_fugue512_init( &x15_8way_ctx.fugue );
+     fugue512_Init( &x15_8way_ctx.fugue, 512 );
     shabal512_8way_init( &x15_8way_ctx.shabal );
     sph_whirlpool_init( &x15_8way_ctx.whirlpool );
 #if defined(__VAES__)
@@ -260,30 +260,29 @@ void x15_8way_hash( void *state, const void *input )
                       vhash );

     // 13 Fugue
-     sph_fugue512( &ctx.fugue, hash0, 64 );
-     sph_fugue512_close( &ctx.fugue, hash0 );
-     memcpy( &ctx.fugue, &x15_8way_ctx.fugue, sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash1, 64 );
-     sph_fugue512_close( &ctx.fugue, hash1 );
-     memcpy( &ctx.fugue, &x15_8way_ctx.fugue, sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash2, 64 );
-     sph_fugue512_close( &ctx.fugue, hash2 );
-     memcpy( &ctx.fugue, &x15_8way_ctx.fugue, sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash3, 64 );
-     sph_fugue512_close( &ctx.fugue, hash3 );
-     memcpy( &ctx.fugue, &x15_8way_ctx.fugue, sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash4, 64 );
-     sph_fugue512_close( &ctx.fugue, hash4 );
-     memcpy( &ctx.fugue, &x15_8way_ctx.fugue, sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash5, 64 );
-     sph_fugue512_close( &ctx.fugue, hash5 );
-     memcpy( &ctx.fugue, &x15_8way_ctx.fugue, sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash6, 64 );
-     sph_fugue512_close( &ctx.fugue, hash6 );
-     memcpy( &ctx.fugue, &x15_8way_ctx.fugue, sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash7, 64 );
-     sph_fugue512_close( &ctx.fugue, hash7 );
-
+     fugue512_Update( &ctx.fugue, hash0, 512 );
+     fugue512_Final( &ctx.fugue, hash0 );
+     memcpy( &ctx.fugue, &x15_8way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash1, 512 );
+     fugue512_Final( &ctx.fugue, hash1 );
+     memcpy( &ctx.fugue, &x15_8way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash2, 512 );
+     fugue512_Final( &ctx.fugue, hash2 );
+     memcpy( &ctx.fugue, &x15_8way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash3, 512 );
+     fugue512_Final( &ctx.fugue, hash3 );
+     memcpy( &ctx.fugue, &x15_8way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash4, 512 );
+     fugue512_Final( &ctx.fugue, hash4 );
+     memcpy( &ctx.fugue, &x15_8way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash5, 512 );
+     fugue512_Final( &ctx.fugue, hash5 );
+     memcpy( &ctx.fugue, &x15_8way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash6, 512 );
+     fugue512_Final( &ctx.fugue, hash6 );
+     memcpy( &ctx.fugue, &x15_8way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash7, 512 );
+     fugue512_Final( &ctx.fugue, hash7 );

     // 14 Shabal, parallel 32 bit
     intrlv_8x32_512( vhash, hash0, hash1, hash2, hash3, hash4, hash5, hash6,
@@ -387,7 +386,7 @@ typedef struct {
    simd_2way_context       simd;
    hashState_echo          echo;
    hamsi512_4way_context   hamsi;
-    sph_fugue512_context    fugue;
+    hashState_fugue         fugue;
    shabal512_4way_context  shabal;
    sph_whirlpool_context   whirlpool;
 } x15_4way_ctx_holder;
@@ -408,7 +407,7 @@ void init_x15_4way_ctx()
     simd_2way_init( &x15_4way_ctx.simd, 512 );
     init_echo( &x15_4way_ctx.echo, 512 );
     hamsi512_4way_init( &x15_4way_ctx.hamsi );
-     sph_fugue512_init( &x15_4way_ctx.fugue );
+     fugue512_Init( &x15_4way_ctx.fugue, 512 );
     shabal512_4way_init( &x15_4way_ctx.shabal );
     sph_whirlpool_init( &x15_4way_ctx.whirlpool );
 };
@@ -524,17 +523,17 @@ void x15_4way_hash( void *state, const void *input )
     dintrlv_4x64( hash0, hash1, hash2, hash3, vhash, 512 );

     // 13 Fugue
-     sph_fugue512( &ctx.fugue, hash0, 64 );
-     sph_fugue512_close( &ctx.fugue, hash0 );
-     memcpy( &ctx.fugue, &x15_4way_ctx.fugue, sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash1, 64 );
-     sph_fugue512_close( &ctx.fugue, hash1 );
-     memcpy( &ctx.fugue, &x15_4way_ctx.fugue, sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash2, 64 );
-     sph_fugue512_close( &ctx.fugue, hash2 );
-     memcpy( &ctx.fugue, &x15_4way_ctx.fugue, sizeof(sph_fugue512_context) );
-     sph_fugue512( &ctx.fugue, hash3, 64 );
-     sph_fugue512_close( &ctx.fugue, hash3 );
+     fugue512_Update( &ctx.fugue, hash0, 512 );
+     fugue512_Final( &ctx.fugue, hash0 );
+     memcpy( &ctx.fugue, &x15_4way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash1, 512 );
+     fugue512_Final( &ctx.fugue, hash1 );
+     memcpy( &ctx.fugue, &x15_4way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash2, 512 );
+     fugue512_Final( &ctx.fugue, hash2 );
+     memcpy( &ctx.fugue, &x15_4way_ctx.fugue, sizeof(hashState_fugue) );
+     fugue512_Update( &ctx.fugue, hash3, 512 );
+     fugue512_Final( &ctx.fugue, hash3 );

     // 14 Shabal, parallel 32 bit
     intrlv_4x32( vhash, hash0, hash1, hash2, hash3, 512 );
--- a/algo/x15/x15.c
+++ b/algo/x15/x15.c
@@ -23,9 +23,11 @@
 #if defined(__AES__)
  #include "algo/echo/aes_ni/hash_api.h"
  #include "algo/groestl/aes_ni/hash-groestl.h"
+  #include "algo/fugue/fugue-aesni.h"
 #else
  #include "algo/groestl/sph_groestl.h"
  #include "algo/echo/sph_echo.h"
+  #include "algo/fugue/sph_fugue.h"
 #endif

 typedef struct {
@@ -34,9 +36,11 @@ typedef struct {
 #if defined(__AES__)
   hashState_echo          echo;
   hashState_groestl       groestl;
+   hashState_fugue         fugue;
 #else
   sph_groestl512_context   groestl;
   sph_echo512_context      echo;
+   sph_fugue512_context    fugue;
 #endif
   sph_jh512_context       jh;
   sph_keccak512_context   keccak;
@@ -46,7 +50,6 @@ typedef struct {
   sph_shavite512_context  shavite;
   hashState_sd            simd;
   sph_hamsi512_context    hamsi;
-   sph_fugue512_context    fugue;
   sph_shabal512_context   shabal;
   sph_whirlpool_context   whirlpool;
 } x15_ctx_holder;
@@ -60,9 +63,11 @@ void init_x15_ctx()
 #if defined(__AES__)
   init_groestl( &x15_ctx.groestl, 64 );
   init_echo( &x15_ctx.echo, 512 );
+   fugue512_Init( &x15_ctx.fugue, 512 );
 #else
   sph_groestl512_init( &x15_ctx.groestl );
   sph_echo512_init( &x15_ctx.echo );
+   sph_fugue512_init( &x15_ctx.fugue );
 #endif
   sph_skein512_init( &x15_ctx.skein );
   sph_jh512_init( &x15_ctx.jh );
@@ -72,7 +77,6 @@ void init_x15_ctx()
   sph_shavite512_init( &x15_ctx.shavite );
   init_sd( &x15_ctx.simd, 512 );
   sph_hamsi512_init( &x15_ctx.hamsi );
-   sph_fugue512_init( &x15_ctx.fugue );
   sph_shabal512_init( &x15_ctx.shabal );
   sph_whirlpool_init( &x15_ctx.whirlpool );
 };
@@ -131,8 +135,13 @@ void x15hash(void *output, const void *input)
    sph_hamsi512( &ctx.hamsi, hash, 64 );
    sph_hamsi512_close( &ctx.hamsi, hash );

+#if defined(__AES__)
+    fugue512_Update( &ctx.fugue, hash, 512 );
+    fugue512_Final( &ctx.fugue, hash );
+#else
    sph_fugue512( &ctx.fugue, hash, 64 );
    sph_fugue512_close( &ctx.fugue, hash );
+#endif

    sph_shabal512( &ctx.shabal, hash, 64 );
    sph_shabal512_close( &ctx.shabal, hash );
--- a/algo/x16/hex.c
+++ b/algo/x16/hex.c
@@ -6,30 +6,6 @@
 */
 #include "x16r-gate.h"

-#include <stdio.h>
-#include <stdlib.h>
-#include <string.h>
-#include "algo/blake/sph_blake.h"
-#include "algo/bmw/sph_bmw.h"
-#include "algo/groestl/sph_groestl.h"
-#include "algo/jh/sph_jh.h"
-#include "algo/keccak/sph_keccak.h"
-#include "algo/skein/sph_skein.h"
-#include "algo/shavite/sph_shavite.h"
-#include "algo/luffa/luffa_for_sse2.h"
-#include "algo/cubehash/cubehash_sse2.h"
-#include "algo/simd/nist.h"
-#include "algo/echo/sph_echo.h"
-#include "algo/hamsi/sph_hamsi.h"
-#include "algo/fugue/sph_fugue.h"
-#include "algo/shabal/sph_shabal.h"
-#include "algo/whirlpool/sph_whirlpool.h"
-#include <openssl/sha.h>
-#if defined(__AES__)
-  #include "algo/echo/aes_ni/hash_api.h"
-  #include "algo/groestl/aes_ni/hash-groestl.h"
-#endif
-
 static void hex_getAlgoString(const uint32_t* prevblock, char *output)
 {
   char *sptr = output;
@@ -47,34 +23,6 @@ static void hex_getAlgoString(const uint32_t* prevblock, char *output)
   *sptr = '\0';
 }

-/*
-union _hex_context_overlay
-{
-#if defined(__AES__)
-        hashState_echo          echo;
-        hashState_groestl       groestl;
-#else
-        sph_groestl512_context   groestl;
-        sph_echo512_context      echo;
-#endif
-        sph_blake512_context    blake;
-        sph_bmw512_context      bmw;
-        sph_skein512_context    skein;
-        sph_jh512_context       jh;
-        sph_keccak512_context   keccak;
-        hashState_luffa         luffa;
-        cubehashParam           cube;
-        shavite512_context      shavite;
-        hashState_sd            simd;
-        sph_hamsi512_context    hamsi;
-        sph_fugue512_context    fugue;
-        sph_shabal512_context   shabal;
-        sph_whirlpool_context   whirlpool;
-        SHA512_CTX              sha512;
-};
-typedef union _hex_context_overlay hex_context_overlay;
-*/
-
 static __thread x16r_context_overlay hex_ctx;

 int hex_hash( void* output, const void* input, int thrid )
@@ -187,8 +135,12 @@ int hex_hash( void* output, const void* input, int thrid )
            sph_hamsi512_close( &ctx.hamsi, hash );
         break;
         case FUGUE:
+#if defined(__AES__)
+             fugue512_full( &ctx.fugue, hash, in, size );
+#else
             sph_fugue512_full( &ctx.fugue, hash, in, size );
-         break;
+#endif
+	     break;
         case SHABAL:
            if ( i == 0 ) 
               sph_shabal512( &ctx.shabal, in+64, 16 );
--- a/algo/x16/minotaur.c
+++ b/algo/x16/minotaur.c
@@ -7,7 +7,7 @@
 #include <stdio.h>
 #include "algo/blake/sph_blake.h"
 #include "algo/bmw/sph_bmw.h"
-#include "algo/groestl/sph_groestl.h"
+//#include "algo/jh/jh-hash-sse2.h"
 #include "algo/jh/sph_jh.h"
 #include "algo/keccak/sph_keccak.h"
 #include "algo/skein/sph_skein.h"
@@ -15,18 +15,20 @@
 #include "algo/luffa/luffa_for_sse2.h"
 #include "algo/cubehash/cubehash_sse2.h"
 #include "algo/simd/nist.h"
-#include "algo/echo/sph_echo.h"
 #include "algo/hamsi/sph_hamsi.h"
-#include "algo/fugue/sph_fugue.h"
 #include "algo/shabal/sph_shabal.h"
 #include "algo/whirlpool/sph_whirlpool.h"
 #include <openssl/sha.h>
 #if defined(__AES__)
  #include "algo/echo/aes_ni/hash_api.h"
  #include "algo/groestl/aes_ni/hash-groestl.h"
+  #include "algo/fugue/fugue-aesni.h"
+#else
+  #include "algo/echo/sph_echo.h"
+  #include "algo/groestl/sph_groestl.h"
+  #include "algo/fugue/sph_fugue.h"
 #endif

-
 // Config
 #define MINOTAUR_ALGO_COUNT	16

@@ -34,17 +36,21 @@ typedef struct TortureNode TortureNode;
 typedef struct TortureGarden TortureGarden;

 // Graph of hash algos plus SPH contexts
-struct TortureGarden {
+struct TortureGarden
+{
 #if defined(__AES__)
        hashState_echo          echo;
        hashState_groestl       groestl;
+        hashState_fugue         fugue;
 #else
-        sph_groestl512_context   groestl;
-        sph_echo512_context      echo;
+        sph_echo512_context     echo;
+        sph_groestl512_context  groestl;
+        sph_fugue512_context    fugue;
 #endif
        sph_blake512_context    blake;
        sph_bmw512_context      bmw;
        sph_skein512_context    skein;
+//        jh512_sse2_hashState    jh;
        sph_jh512_context       jh;
        sph_keccak512_context   keccak;
        hashState_luffa         luffa;
@@ -52,23 +58,21 @@ struct TortureGarden {
        shavite512_context      shavite;
        hashState_sd            simd;
        sph_hamsi512_context    hamsi;
-        sph_fugue512_context    fugue;
        sph_shabal512_context   shabal;
        sph_whirlpool_context   whirlpool;
        SHA512_CTX              sha512;

    struct TortureNode {
        unsigned int algo;
-        TortureNode *childLeft;
-        TortureNode *childRight;
+        TortureNode *child[2];
    } nodes[22];
-};
+} __attribute__ ((aligned (64)));

 // Get a 64-byte hash for given 64-byte input, using given TortureGarden contexts and given algo index
 static void get_hash( void *output, const void *input, TortureGarden *garden,
 	              unsigned int algo )
 {    
-	unsigned char _ALIGN(64) hash[64];
+	unsigned char hash[64] __attribute__ ((aligned (64)));

    switch (algo) {
        case 0:
@@ -97,10 +101,12 @@ static void get_hash( void *output, const void *input, TortureGarden *garden,
 #endif
 	    break;
        case 4:
-            sph_fugue512_init(&garden->fugue);
-            sph_fugue512(&garden->fugue, input, 64);
-            sph_fugue512_close(&garden->fugue, hash);          
-            break;
+#if defined(__AES__)
+            fugue512_full( &garden->fugue, hash, input, 64 );
+#else
+            sph_fugue512_full( &garden->fugue, hash, input, 64 );
+#endif
+	    break;
        case 5:
 #if defined(__AES__)
            groestl512_full( &garden->groestl, (char*)hash, (char*)input, 512 );
@@ -121,6 +127,7 @@ static void get_hash( void *output, const void *input, TortureGarden *garden,
            SHA512_Final( (unsigned char*)hash, &garden->sha512 );
            break;
        case 8:
+//            jh512_sse2_full( &garden->jh, hash, input, 64 );
            sph_jh512_init(&garden->jh);
            sph_jh512(&garden->jh, input, 64);
            sph_jh512_close(&garden->jh, hash);          
@@ -162,96 +169,137 @@ static void get_hash( void *output, const void *input, TortureGarden *garden,
            break;
    }

-    // Output the hash
    memcpy(output, hash, 64);
 }

-// Recursively traverse a given torture garden starting with a given hash and given node within the garden. The hash is overwritten with the final hash.
-static void traverse_garden( TortureGarden *garden, void *hash,
-	                     TortureNode *node )
-{
-    unsigned char _ALIGN(64) partialHash[64];
-    get_hash(partialHash, hash, garden, node->algo);
+static __thread TortureGarden garden;

-    if ( partialHash[63] % 2 == 0 )
-    {   // Last byte of output hash is even
-        if ( node->childLeft != NULL )
-            traverse_garden( garden, partialHash, node->childLeft );
-    }
-    else
-    {   // Last byte of output hash is odd
-        if ( node->childRight != NULL )
-            traverse_garden( garden, partialHash, node->childRight );
-    }
-
-    memcpy( hash, partialHash, 64 );
-}
-
-// Associate child nodes with a parent node
-static inline void link_nodes( TortureNode *parent, TortureNode *childLeft,
-	                       TortureNode *childRight ) 
-{
-    parent->childLeft = childLeft;
-    parent->childRight = childRight;
-}
-
-static TortureGarden garden;
-
-void initialize_torture_garden()
+bool initialize_torture_garden()
 {
    // Create torture garden nodes. Note that both sides of 19 and 20 lead to 21, and 21 has no children (to make traversal complete).
-    link_nodes(&garden.nodes[0], &garden.nodes[1], &garden.nodes[2]);
-    link_nodes(&garden.nodes[1], &garden.nodes[3], &garden.nodes[4]);
-    link_nodes(&garden.nodes[2], &garden.nodes[5], &garden.nodes[6]);
-    link_nodes(&garden.nodes[3], &garden.nodes[7], &garden.nodes[8]);
-    link_nodes(&garden.nodes[4], &garden.nodes[9], &garden.nodes[10]);
-    link_nodes(&garden.nodes[5], &garden.nodes[11], &garden.nodes[12]);
-    link_nodes(&garden.nodes[6], &garden.nodes[13], &garden.nodes[14]);
-    link_nodes(&garden.nodes[7], &garden.nodes[15], &garden.nodes[16]);
-    link_nodes(&garden.nodes[8], &garden.nodes[15], &garden.nodes[16]);
-    link_nodes(&garden.nodes[9], &garden.nodes[15], &garden.nodes[16]);
-    link_nodes(&garden.nodes[10], &garden.nodes[15], &garden.nodes[16]);
-    link_nodes(&garden.nodes[11], &garden.nodes[17], &garden.nodes[18]);
-    link_nodes(&garden.nodes[12], &garden.nodes[17], &garden.nodes[18]);
-    link_nodes(&garden.nodes[13], &garden.nodes[17], &garden.nodes[18]);
-    link_nodes(&garden.nodes[14], &garden.nodes[17], &garden.nodes[18]);
-    link_nodes(&garden.nodes[15], &garden.nodes[19], &garden.nodes[20]);
-    link_nodes(&garden.nodes[16], &garden.nodes[19], &garden.nodes[20]);
-    link_nodes(&garden.nodes[17], &garden.nodes[19], &garden.nodes[20]);
-    link_nodes(&garden.nodes[18], &garden.nodes[19], &garden.nodes[20]);
-    link_nodes(&garden.nodes[19], &garden.nodes[21], &garden.nodes[21]);
-    link_nodes(&garden.nodes[20], &garden.nodes[21], &garden.nodes[21]);
-    garden.nodes[21].childLeft = NULL;
-    garden.nodes[21].childRight = NULL;
+
+   garden.nodes[ 0].child[0] = &garden.nodes[ 1];
+   garden.nodes[ 0].child[1] = &garden.nodes[ 2];
+   garden.nodes[ 1].child[0] = &garden.nodes[ 3];
+   garden.nodes[ 1].child[1] = &garden.nodes[ 4];
+   garden.nodes[ 2].child[0] = &garden.nodes[ 5];
+   garden.nodes[ 2].child[1] = &garden.nodes[ 6];
+   garden.nodes[ 3].child[0] = &garden.nodes[ 7];
+   garden.nodes[ 3].child[1] = &garden.nodes[ 8];
+   garden.nodes[ 4].child[0] = &garden.nodes[ 9];
+   garden.nodes[ 4].child[1] = &garden.nodes[10];
+   garden.nodes[ 5].child[0] = &garden.nodes[11];
+   garden.nodes[ 5].child[1] = &garden.nodes[12];
+   garden.nodes[ 6].child[0] = &garden.nodes[13];
+   garden.nodes[ 6].child[1] = &garden.nodes[14];
+   garden.nodes[ 7].child[0] = &garden.nodes[15];
+   garden.nodes[ 7].child[1] = &garden.nodes[16];
+   garden.nodes[ 8].child[0] = &garden.nodes[15];
+   garden.nodes[ 8].child[1] = &garden.nodes[16];
+   garden.nodes[ 9].child[0] = &garden.nodes[15];
+   garden.nodes[ 9].child[1] = &garden.nodes[16];
+   garden.nodes[10].child[0] = &garden.nodes[15];
+   garden.nodes[10].child[1] = &garden.nodes[16];
+   garden.nodes[11].child[0] = &garden.nodes[17];
+   garden.nodes[11].child[1] = &garden.nodes[18];
+   garden.nodes[12].child[0] = &garden.nodes[17];
+   garden.nodes[12].child[1] = &garden.nodes[18];
+   garden.nodes[13].child[0] = &garden.nodes[17];
+   garden.nodes[13].child[1] = &garden.nodes[18];
+   garden.nodes[14].child[0] = &garden.nodes[17];
+   garden.nodes[14].child[1] = &garden.nodes[18];
+   garden.nodes[15].child[0] = &garden.nodes[19];
+   garden.nodes[15].child[1] = &garden.nodes[20];
+   garden.nodes[16].child[0] = &garden.nodes[19];
+   garden.nodes[16].child[1] = &garden.nodes[20];
+   garden.nodes[17].child[0] = &garden.nodes[19];
+   garden.nodes[17].child[1] = &garden.nodes[20];
+   garden.nodes[18].child[0] = &garden.nodes[19];
+   garden.nodes[18].child[1] = &garden.nodes[20];
+   garden.nodes[19].child[0] = &garden.nodes[21];
+   garden.nodes[19].child[1] = &garden.nodes[21];
+   garden.nodes[20].child[0] = &garden.nodes[21];
+   garden.nodes[20].child[1] = &garden.nodes[21];
+   garden.nodes[21].child[0] = NULL;
+   garden.nodes[21].child[1] = NULL;
+
+   return true;
 }

 // Produce a 32-byte hash from 80-byte input data
-int minotaur_hash( void *output, const void *input )
+int minotaur_hash( void *output, const void *input, int thr_id )
 {    
-    unsigned char _ALIGN(64) hash[64];
+    unsigned char hash[64] __attribute__ ((aligned (64)));

    // Find initial sha512 hash
    SHA512_Init( &garden.sha512 );
    SHA512_Update( &garden.sha512, input, 80 );
    SHA512_Final( (unsigned char*) hash, &garden.sha512 );

+    // algo 6 (Hamsi) is very slow. It's faster to skip hashing this nonce
+    // if Hamsi is needed but only the first and last functions are
+    // currently known. Abort if either is Hamsi.
+    if ( ( ( hash[ 0] % MINOTAUR_ALGO_COUNT ) == 6 )
+      || ( ( hash[21] % MINOTAUR_ALGO_COUNT ) == 6 ) )
+         return 0;
+
    // Assign algos to torture garden nodes based on initial hash
    for ( int i = 0; i < 22; i++ )
        garden.nodes[i].algo = hash[i] % MINOTAUR_ALGO_COUNT;

    // Send the initial hash through the torture garden
-    traverse_garden( &garden, hash, &garden.nodes[0] );
+    TortureNode *node = &garden.nodes[0];
+
+    while ( node )
+    {
+      get_hash( hash, hash, &garden, node->algo );
+      node = node->child[ hash[63] & 1 ];
+    }

    memcpy( output, hash, 32 );
-
    return 1;
 }

+int scanhash_minotaur( struct work *work, uint32_t max_nonce,
+                      uint64_t *hashes_done, struct thr_info *mythr )
+{
+   uint32_t edata[20] __attribute__((aligned(64)));
+   uint32_t hash[8] __attribute__((aligned(64)));
+   uint32_t *pdata = work->data;
+   uint32_t *ptarget = work->target;
+   const uint32_t first_nonce = pdata[19];
+   const uint32_t last_nonce = max_nonce - 1;
+   uint32_t n = first_nonce;
+   const int thr_id = mythr->id;
+   const bool bench = opt_benchmark;
+   uint64_t skipped = 0;
+
+   mm128_bswap32_80( edata, pdata );
+   do
+   {
+      edata[19] = n;
+      if ( likely( algo_gate.hash( hash, edata, thr_id ) ) )
+      {
+	 if ( unlikely( valid_hash( hash, ptarget ) && !bench ) )
+         {
+            pdata[19] = bswap_32( n );
+            submit_solution( work, hash, mythr );
+         }
+      }
+      else skipped++;
+      n++;
+   } while ( n < last_nonce && !work_restart[thr_id].restart );
+   *hashes_done = n - first_nonce - skipped;
+   pdata[19] = n;
+   return 0;
+}
+
 bool register_minotaur_algo( algo_gate_t* gate )
 {
+  gate->scanhash = (void*)&scanhash_minotaur;
  gate->hash      = (void*)&minotaur_hash;
  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT;
-  initialize_torture_garden();
+  gate->miner_thread_init = (void*)&initialize_torture_garden;
  return true;
 };

--- a/algo/x16/x16r-4way.c
+++ b/algo/x16/x16r-4way.c
@@ -347,14 +347,14 @@ int x16r_8way_hash_generic( void* output, const void* input, int thrid )
                          hash7, vhash );
         break;
         case FUGUE:
-             sph_fugue512_full( &ctx.fugue, hash0, in0, size );
-             sph_fugue512_full( &ctx.fugue, hash1, in1, size );
-             sph_fugue512_full( &ctx.fugue, hash2, in2, size );
-             sph_fugue512_full( &ctx.fugue, hash3, in3, size );
-             sph_fugue512_full( &ctx.fugue, hash4, in4, size );
-             sph_fugue512_full( &ctx.fugue, hash5, in5, size );
-             sph_fugue512_full( &ctx.fugue, hash6, in6, size );
-             sph_fugue512_full( &ctx.fugue, hash7, in7, size );
+             fugue512_full( &ctx.fugue, hash0, in0, size );
+             fugue512_full( &ctx.fugue, hash1, in1, size );
+             fugue512_full( &ctx.fugue, hash2, in2, size );
+             fugue512_full( &ctx.fugue, hash3, in3, size );
+             fugue512_full( &ctx.fugue, hash4, in4, size );
+             fugue512_full( &ctx.fugue, hash5, in5, size );
+             fugue512_full( &ctx.fugue, hash6, in6, size );
+             fugue512_full( &ctx.fugue, hash7, in7, size );
         break;
         case SHABAL:
             intrlv_8x32( vhash, in0, in1, in2, in3, in4, in5, in6, in7,
@@ -619,11 +619,20 @@ int x16r_4way_hash_generic( void* output, const void* input, int thrid )
            dintrlv_4x64_512( hash0, hash1, hash2, hash3, vhash );
         break;
         case GROESTL:
+#if defined(__VAES__)
+            intrlv_2x128( vhash, in0, in1, size<<3 );
+            groestl512_2way_full( &ctx.groestl, vhash, vhash, size );
+            dintrlv_2x128_512( hash0, hash1, vhash );
+            intrlv_2x128( vhash, in2, in3, size<<3 );
+            groestl512_2way_full( &ctx.groestl, vhash, vhash, size );
+            dintrlv_2x128_512( hash2, hash3, vhash );
+#else
            groestl512_full( &ctx.groestl, (char*)hash0, (char*)in0, size<<3 );
            groestl512_full( &ctx.groestl, (char*)hash1, (char*)in1, size<<3 );
            groestl512_full( &ctx.groestl, (char*)hash2, (char*)in2, size<<3 );
            groestl512_full( &ctx.groestl, (char*)hash3, (char*)in3, size<<3 );
-         break;
+#endif
+   	    break;
         case JH:
            if ( i == 0 )
               jh512_4way_update( &ctx.jh, input + (64<<2), 16 );
@@ -711,11 +720,20 @@ int x16r_4way_hash_generic( void* output, const void* input, int thrid )
            }
         break;
         case SHAVITE:
+#if defined(__VAES__)
+            intrlv_2x128( vhash, in0, in1, size<<3 );
+            shavite512_2way_full( &ctx.shavite, vhash, vhash, size );
+            dintrlv_2x128_512( hash0, hash1, vhash );
+            intrlv_2x128( vhash, in2, in3, size<<3 );
+            shavite512_2way_full( &ctx.shavite, vhash, vhash, size );
+            dintrlv_2x128_512( hash2, hash3, vhash );
+#else
            shavite512_full( &ctx.shavite, hash0, in0, size );
            shavite512_full( &ctx.shavite, hash1, in1, size );
            shavite512_full( &ctx.shavite, hash2, in2, size );
            shavite512_full( &ctx.shavite, hash3, in3, size );
-         break;
+#endif
+   	    break;
         case SIMD:
            intrlv_2x128( vhash, in0, in1, size<<3 );
            simd512_2way_full( &ctx.simd, vhash, vhash, size );
@@ -725,6 +743,14 @@ int x16r_4way_hash_generic( void* output, const void* input, int thrid )
            dintrlv_2x128_512( hash2, hash3, vhash );
         break;
         case ECHO:
+#if defined(__VAES__)
+            intrlv_2x128( vhash, in0, in1, size<<3 );
+            echo_2way_full( &ctx.echo, vhash, 512, vhash, size );
+            dintrlv_2x128_512( hash0, hash1, vhash );
+            intrlv_2x128( vhash, in2, in3, size<<3 );
+            echo_2way_full( &ctx.echo, vhash, 512, vhash, size );
+            dintrlv_2x128_512( hash2, hash3, vhash );
+#else
            echo_full( &ctx.echo, (BitSequence *)hash0, 512,
                              (const BitSequence *)in0, size );
            echo_full( &ctx.echo, (BitSequence *)hash1, 512,
@@ -733,7 +759,8 @@ int x16r_4way_hash_generic( void* output, const void* input, int thrid )
                              (const BitSequence *)in2, size );
            echo_full( &ctx.echo, (BitSequence *)hash3, 512,
                              (const BitSequence *)in3, size );
-         break;
+#endif
+   	    break;
         case HAMSI:
            if ( i == 0 )
               hamsi512_4way_update( &ctx.hamsi, input + (64<<2), 16 );
@@ -747,10 +774,10 @@ int x16r_4way_hash_generic( void* output, const void* input, int thrid )
            dintrlv_4x64_512( hash0, hash1, hash2, hash3, vhash );
         break;
         case FUGUE:
-             sph_fugue512_full( &ctx.fugue, hash0, in0, size );
-             sph_fugue512_full( &ctx.fugue, hash1, in1, size );
-             sph_fugue512_full( &ctx.fugue, hash2, in2, size );
-             sph_fugue512_full( &ctx.fugue, hash3, in3, size );
+             fugue512_full( &ctx.fugue, hash0, in0, size );
+             fugue512_full( &ctx.fugue, hash1, in1, size );
+             fugue512_full( &ctx.fugue, hash2, in2, size );
+             fugue512_full( &ctx.fugue, hash3, in3, size );
         break;
         case SHABAL:
             intrlv_4x32( vhash, in0, in1, in2, in3, size<<3 );
--- a/algo/x16/x16r-gate.c
+++ b/algo/x16/x16r-gate.c
@@ -61,7 +61,8 @@ bool register_x16r_algo( algo_gate_t* gate )
  gate->scanhash  = (void*)&scanhash_x16r;
  gate->hash      = (void*)&x16r_hash;
 #endif
-  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT;
+  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT |
+	                VAES_OPT | VAES256_OPT;
  x16_r_s_getAlgoString = (void*)&x16r_getAlgoString;
  opt_target_factor = 256.0;
  return true;
@@ -79,7 +80,8 @@ bool register_x16rv2_algo( algo_gate_t* gate )
  gate->scanhash  = (void*)&scanhash_x16rv2;
  gate->hash      = (void*)&x16rv2_hash;
 #endif
-  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT;
+  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT |
+	                VAES_OPT | VAES256_OPT;
  x16_r_s_getAlgoString = (void*)&x16r_getAlgoString;
  opt_target_factor = 256.0;
  return true;
@@ -97,7 +99,8 @@ bool register_x16s_algo( algo_gate_t* gate )
  gate->scanhash  = (void*)&scanhash_x16r;
  gate->hash      = (void*)&x16r_hash;
 #endif
-  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT;
+  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT |
+	                VAES_OPT | VAES256_OPT;
  x16_r_s_getAlgoString = (void*)&x16s_getAlgoString;
  opt_target_factor = 256.0;
  return true;
@@ -135,18 +138,16 @@ void x16rt_getAlgoString( const uint32_t *timeHash, char *output)

 void veil_build_extraheader( struct work* g_work, struct stratum_ctx* sctx )
 {
+   uint32_t merkleroothash[8];
+   uint32_t witmerkleroothash[8];
+   uint32_t denom10[8];
+   uint32_t denom100[8];
+   uint32_t denom1000[8];
+   uint32_t denom10000[8];
+   int i;
   uchar merkle_tree[64] = { 0 };
-   size_t t;

   algo_gate.gen_merkle_root( merkle_tree, sctx );
-   // Increment extranonce2
-   for ( t = 0; t < sctx->xnonce2_size && !( ++sctx->job.xnonce2[t] ); t++ );
-
-   // Assemble block header
-//   algo_gate.build_block_header( g_work, le32dec( sctx->job.version ),
-//          (uint32_t*) sctx->job.prevhash, (uint32_t*) merkle_tree,
-//          le32dec( sctx->job.ntime ), le32dec(sctx->job.nbits) );
-   int i;

   memset( g_work->data, 0, sizeof(g_work->data) );
   g_work->data[0] = le32dec( sctx->job.version );
@@ -164,35 +165,35 @@ void veil_build_extraheader( struct work* g_work, struct stratum_ctx* sctx )
   g_work->data[31] = 0x00000280;

   for ( i = 0; i < 8; i++ )
-      g_work->merkleroothash[7 - i] = be32dec((uint32_t *)merkle_tree + i);
+      merkleroothash[7 - i] = be32dec((uint32_t *)merkle_tree + i);
   for ( i = 0; i < 8; i++ )
-      g_work->witmerkleroothash[7 - i] = be32dec((uint32_t *)merkle_tree + i);
+      witmerkleroothash[7 - i] = be32dec((uint32_t *)merkle_tree + i);
   for ( i = 0; i < 8; i++ )
-      g_work->denom10[i] =    le32dec((uint32_t *)sctx->job.denom10 + i);
+      denom10[i] =    le32dec((uint32_t *)sctx->job.denom10 + i);
   for ( i = 0; i < 8; i++ )
-      g_work->denom100[i] =   le32dec((uint32_t *)sctx->job.denom100 + i);
+      denom100[i] =   le32dec((uint32_t *)sctx->job.denom100 + i);
   for ( i = 0; i < 8; i++ )
-      g_work->denom1000[i] =  le32dec((uint32_t *)sctx->job.denom1000 + i);
+      denom1000[i] =  le32dec((uint32_t *)sctx->job.denom1000 + i);
   for ( i = 0; i < 8; i++ )
-      g_work->denom10000[i] = le32dec((uint32_t *)sctx->job.denom10000 + i);
+      denom10000[i] = le32dec((uint32_t *)sctx->job.denom10000 + i);

   uint32_t pofnhash[8];
   memset(pofnhash, 0x00, 32);

-   char denom10_str      [ 2 * sizeof( g_work->denom10 )           + 1 ];
-   char denom100_str     [ 2 * sizeof( g_work->denom100 )          + 1 ];
-   char denom1000_str    [ 2 * sizeof( g_work->denom1000 )         + 1 ];
-   char denom10000_str   [ 2 * sizeof( g_work->denom10000 )        + 1 ];
-   char merkleroot_str   [ 2 * sizeof( g_work->merkleroothash )    + 1 ];
-   char witmerkleroot_str[ 2 * sizeof( g_work->witmerkleroothash ) + 1 ];
+   char denom10_str      [ 2 * sizeof( denom10 )           + 1 ];
+   char denom100_str     [ 2 * sizeof( denom100 )          + 1 ];
+   char denom1000_str    [ 2 * sizeof( denom1000 )         + 1 ];
+   char denom10000_str   [ 2 * sizeof( denom10000 )        + 1 ];
+   char merkleroot_str   [ 2 * sizeof( merkleroothash )    + 1 ];
+   char witmerkleroot_str[ 2 * sizeof( witmerkleroothash ) + 1 ];
   char pofn_str         [ 2 * sizeof( pofnhash )                  + 1 ];

-   cbin2hex( denom10_str,       (char*) g_work->denom10,           32 );
-   cbin2hex( denom100_str,      (char*) g_work->denom100,          32 );
-   cbin2hex( denom1000_str,     (char*) g_work->denom1000,         32 );
-   cbin2hex( denom10000_str,    (char*) g_work->denom10000,        32 );
-   cbin2hex( merkleroot_str,    (char*) g_work->merkleroothash,    32 );
-   cbin2hex( witmerkleroot_str, (char*) g_work->witmerkleroothash, 32 );
+   cbin2hex( denom10_str,       (char*) denom10,           32 );
+   cbin2hex( denom100_str,      (char*) denom100,          32 );
+   cbin2hex( denom1000_str,     (char*) denom1000,         32 );
+   cbin2hex( denom10000_str,    (char*) denom10000,        32 );
+   cbin2hex( merkleroot_str,    (char*) merkleroothash,    32 );
+   cbin2hex( witmerkleroot_str, (char*) witmerkleroothash, 32 );
   cbin2hex( pofn_str,          (char*) pofnhash,                  32 );

   if ( true )
@@ -232,7 +233,8 @@ bool register_x16rt_algo( algo_gate_t* gate )
  gate->scanhash  = (void*)&scanhash_x16rt;
  gate->hash      = (void*)&x16r_hash;
 #endif
-  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT;
+  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT |
+	                VAES_OPT | VAES256_OPT;
  opt_target_factor = 256.0;
  return true;
 };
@@ -249,7 +251,8 @@ bool register_x16rt_veil_algo( algo_gate_t* gate )
  gate->scanhash  = (void*)&scanhash_x16rt;
  gate->hash      = (void*)&x16r_hash;
 #endif
-  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT;
+  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT |
+	                VAES_OPT | VAES256_OPT;
  gate->build_extraheader = (void*)&veil_build_extraheader;
  opt_target_factor = 256.0;
  return true;
@@ -279,22 +282,17 @@ bool register_x21s_algo( algo_gate_t* gate )
  gate->scanhash          = (void*)&scanhash_x21s_8way;
  gate->hash              = (void*)&x21s_8way_hash;
  gate->miner_thread_init = (void*)&x21s_8way_thread_init;
-  gate->optimizations     = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT
-                            | VAES_OPT;
 #elif defined (X16R_4WAY)
  gate->scanhash          = (void*)&scanhash_x21s_4way;
  gate->hash              = (void*)&x21s_4way_hash;
  gate->miner_thread_init = (void*)&x21s_4way_thread_init;
-  gate->optimizations     = SSE2_OPT | AES_OPT | AVX2_OPT | SHA_OPT
-                            | AVX512_OPT | VAES_OPT;
 #else
  gate->scanhash          = (void*)&scanhash_x21s;
  gate->hash              = (void*)&x21s_hash;
  gate->miner_thread_init = (void*)&x21s_thread_init;
-  gate->optimizations     = SSE2_OPT | AES_OPT | AVX2_OPT | SHA_OPT
-                          | AVX512_OPT | VAES_OPT;
 #endif
-//  gate->optimizations     = SSE2_OPT | AES_OPT | AVX2_OPT | SHA_OPT | AVX512_OPT;
+  gate->optimizations     = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT |
+	                    VAES_OPT | VAES256_OPT;
  x16_r_s_getAlgoString   = (void*)&x16s_getAlgoString;
  opt_target_factor = 256.0;
  return true;
--- a/algo/x16/x16r-gate.h
+++ b/algo/x16/x16r-gate.h
@@ -24,6 +24,7 @@
 #if defined(__AES__)
  #include "algo/echo/aes_ni/hash_api.h"
  #include "algo/groestl/aes_ni/hash-groestl.h"
+  #include "algo/fugue/fugue-aesni.h"
 #endif
 #if defined (__AVX2__)
 #include "algo/blake/blake-hash-4way.h"
@@ -40,6 +41,7 @@
 #include "algo/sha/sha-hash-4way.h"
 #if defined(__VAES__)
  #include "algo/groestl/groestl512-hash-4way.h"
+  #include "algo/shavite/shavite-hash-2way.h"
  #include "algo/shavite/shavite-hash-4way.h"
  #include "algo/echo/echo-hash-4way.h"
 #endif
@@ -111,7 +113,7 @@ union _x16r_8way_context_overlay
    cubehashParam           cube;
    simd_4way_context       simd;
    hamsi512_8way_context   hamsi;
-    sph_fugue512_context    fugue;
+    hashState_fugue         fugue;
    shabal512_8way_context  shabal;
    sph_whirlpool_context   whirlpool;
    sha512_8way_context     sha512;
@@ -144,18 +146,24 @@ union _x16r_4way_context_overlay
 {
    blake512_4way_context   blake;
    bmw512_4way_context     bmw;
-    hashState_echo          echo;
+#if defined(__VAES__)
+    groestl512_2way_context groestl;
+    shavite512_2way_context shavite;
+    echo_2way_context       echo;
+#else
    hashState_groestl       groestl;
+    shavite512_context      shavite;
+    hashState_echo          echo;
+#endif
    skein512_4way_context   skein;
    jh512_4way_context      jh;
    keccak512_4way_context  keccak;
    luffa_2way_context      luffa;
    hashState_luffa         luffa1;
    cubehashParam           cube;
-    shavite512_context      shavite;
    simd_2way_context       simd;
    hamsi512_4way_context   hamsi;
-    sph_fugue512_context    fugue;
+    hashState_fugue         fugue;
    shabal512_4way_context  shabal;
    sph_whirlpool_context   whirlpool;
    sha512_4way_context     sha512;
@@ -180,9 +188,11 @@ union _x16r_context_overlay
 #if defined(__AES__)
        hashState_echo          echo;
        hashState_groestl       groestl;
+        hashState_fugue         fugue;
 #else
        sph_groestl512_context   groestl;
        sph_echo512_context      echo;
+        sph_fugue512_context    fugue;
 #endif
        sph_blake512_context    blake;
        sph_bmw512_context      bmw;
@@ -194,7 +204,6 @@ union _x16r_context_overlay
        shavite512_context      shavite;
        hashState_sd            simd;
        sph_hamsi512_context    hamsi;
-        sph_fugue512_context    fugue;
        sph_shabal512_context   shabal;
        sph_whirlpool_context   whirlpool;
        SHA512_CTX              sha512;
--- a/algo/x16/x16r.c
+++ b/algo/x16/x16r.c
@@ -151,8 +151,12 @@ int x16r_hash_generic( void* output, const void* input, int thrid )
            sph_hamsi512_close( &ctx.hamsi, hash );
         break;
         case FUGUE:
-            sph_fugue512_full( &ctx.fugue, hash, in, size );
-         break;
+#if defined(__AES__)
+         fugue512_full( &ctx.fugue, hash, in, size );
+#else
+	 sph_fugue512_full( &ctx.fugue, hash, in, size );
+#endif
+	 break;
         case SHABAL:
            if ( i == 0 )
               sph_shabal512( &ctx.shabal, in+64, 16 );
--- a/algo/x16/x16rv2-4way.c
+++ b/algo/x16/x16rv2-4way.c
@@ -8,30 +8,7 @@
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
-#include "algo/blake/blake-hash-4way.h"
-#include "algo/bmw/bmw-hash-4way.h"
-#include "algo/groestl/aes_ni/hash-groestl.h"
-#include "algo/groestl/aes_ni/hash-groestl.h"
-#include "algo/skein/skein-hash-4way.h"
-#include "algo/jh/jh-hash-4way.h"
-#include "algo/keccak/keccak-hash-4way.h"
-#include "algo/shavite/sph_shavite.h"
-#include "algo/luffa/luffa-hash-2way.h"
-#include "algo/cubehash/cubehash_sse2.h"
-#include "algo/cubehash/cube-hash-2way.h"
-#include "algo/simd/simd-hash-2way.h"
-#include "algo/echo/aes_ni/hash_api.h"
-#include "algo/hamsi/hamsi-hash-4way.h"
-#include "algo/fugue/sph_fugue.h"
-#include "algo/shabal/shabal-hash-4way.h"
-#include "algo/whirlpool/sph_whirlpool.h"
-#include "algo/sha/sha-hash-4way.h"
 #include "algo/tiger/sph_tiger.h"
-#if defined(__VAES__)
-  #include "algo/groestl/groestl512-hash-4way.h"
-  #include "algo/shavite/shavite-hash-4way.h"
-  #include "algo/echo/echo-hash-4way.h"
-#endif

 #if defined (X16RV2_8WAY)

@@ -46,7 +23,7 @@ union _x16rv2_8way_context_overlay
    cubehashParam           cube;
    simd_4way_context       simd;
    hamsi512_8way_context   hamsi;
-    sph_fugue512_context    fugue;
+    hashState_fugue         fugue;
    shabal512_8way_context  shabal;
    sph_whirlpool_context   whirlpool;
    sha512_8way_context     sha512;
@@ -432,14 +409,14 @@ int x16rv2_8way_hash( void* output, const void* input, int thrid )
                          hash7, vhash );
         break;
         case FUGUE:
-            sph_fugue512_full( &ctx.fugue, hash0, in0, size );
-            sph_fugue512_full( &ctx.fugue, hash1, in1, size );
-            sph_fugue512_full( &ctx.fugue, hash2, in2, size );
-            sph_fugue512_full( &ctx.fugue, hash3, in3, size );
-            sph_fugue512_full( &ctx.fugue, hash4, in4, size );
-            sph_fugue512_full( &ctx.fugue, hash5, in5, size );
-            sph_fugue512_full( &ctx.fugue, hash6, in6, size );
-            sph_fugue512_full( &ctx.fugue, hash7, in7, size );
+            fugue512_full( &ctx.fugue, hash0, in0, size );
+            fugue512_full( &ctx.fugue, hash1, in1, size );
+            fugue512_full( &ctx.fugue, hash2, in2, size );
+            fugue512_full( &ctx.fugue, hash3, in3, size );
+            fugue512_full( &ctx.fugue, hash4, in4, size );
+            fugue512_full( &ctx.fugue, hash5, in5, size );
+            fugue512_full( &ctx.fugue, hash6, in6, size );
+            fugue512_full( &ctx.fugue, hash7, in7, size );
         break;
         case SHABAL:
            intrlv_8x32( vhash, in0, in1, in2, in3, in4, in5, in6, in7,
@@ -695,17 +672,23 @@ union _x16rv2_4way_context_overlay
 {
    blake512_4way_context   blake;
    bmw512_4way_context     bmw;
-    hashState_echo          echo;
+#if defined(__VAES__)
+    groestl512_2way_context groestl;
+    shavite512_2way_context shavite;
+    echo_2way_context       echo;
+#else
    hashState_groestl       groestl;
+    shavite512_context      shavite;
+    hashState_echo          echo;
+#endif
    skein512_4way_context   skein;
    jh512_4way_context      jh;
    keccak512_4way_context  keccak;
    luffa_2way_context      luffa;
    cubehashParam           cube;
-    shavite512_context      shavite;
    simd_2way_context       simd;
    hamsi512_4way_context   hamsi;
-    sph_fugue512_context    fugue;
+    hashState_fugue         fugue;
    shabal512_4way_context  shabal;
    sph_whirlpool_context   whirlpool;
    sha512_4way_context     sha512;
@@ -768,10 +751,19 @@ int x16rv2_4way_hash( void* output, const void* input, int thrid )
            dintrlv_4x64_512( hash0, hash1, hash2, hash3, vhash );
         break;
         case GROESTL:
+#if defined(__VAES__)
+            intrlv_2x128( vhash, in0, in1, size<<3 );
+            groestl512_2way_full( &ctx.groestl, vhash, vhash, size );
+            dintrlv_2x128_512( hash0, hash1, vhash );
+            intrlv_2x128( vhash, in2, in3, size<<3 );
+            groestl512_2way_full( &ctx.groestl, vhash, vhash, size );
+            dintrlv_2x128_512( hash2, hash3, vhash );
+#else
            groestl512_full( &ctx.groestl, (char*)hash0, (char*)in0, size<<3 );
            groestl512_full( &ctx.groestl, (char*)hash1, (char*)in1, size<<3 );
            groestl512_full( &ctx.groestl, (char*)hash2, (char*)in2, size<<3 );
            groestl512_full( &ctx.groestl, (char*)hash3, (char*)in3, size<<3 );
+#endif
         break;
         case JH:
            if ( i == 0 )
@@ -910,10 +902,19 @@ int x16rv2_4way_hash( void* output, const void* input, int thrid )
            }
         break;
         case SHAVITE:
+#if defined(__VAES__)
+            intrlv_2x128( vhash, in0, in1, size<<3 );
+            shavite512_2way_full( &ctx.shavite, vhash, vhash, size );
+            dintrlv_2x128_512( hash0, hash1, vhash );
+            intrlv_2x128( vhash, in2, in3, size<<3 );
+            shavite512_2way_full( &ctx.shavite, vhash, vhash, size );
+            dintrlv_2x128_512( hash2, hash3, vhash );
+#else
            shavite512_full( &ctx.shavite, hash0, in0, size );
            shavite512_full( &ctx.shavite, hash1, in1, size );
            shavite512_full( &ctx.shavite, hash2, in2, size );
            shavite512_full( &ctx.shavite, hash3, in3, size );
+#endif
         break;
         case SIMD:
            intrlv_2x128( vhash, in0, in1, size<<3 );
@@ -924,6 +925,14 @@ int x16rv2_4way_hash( void* output, const void* input, int thrid )
            dintrlv_2x128_512( hash2, hash3, vhash );
         break;
         case ECHO:
+#if defined(__VAES__)
+            intrlv_2x128( vhash, in0, in1, size<<3 );
+            echo_2way_full( &ctx.echo, vhash, 512, vhash, size );
+            dintrlv_2x128_512( hash0, hash1, vhash );
+            intrlv_2x128( vhash, in2, in3, size<<3 );
+            echo_2way_full( &ctx.echo, vhash, 512, vhash, size );
+            dintrlv_2x128_512( hash2, hash3, vhash );
+#else
            echo_full( &ctx.echo, (BitSequence *)hash0, 512,
                              (const BitSequence *)in0, size );
            echo_full( &ctx.echo, (BitSequence *)hash1, 512,
@@ -932,6 +941,7 @@ int x16rv2_4way_hash( void* output, const void* input, int thrid )
                              (const BitSequence *)in2, size );
            echo_full( &ctx.echo, (BitSequence *)hash3, 512,
                              (const BitSequence *)in3, size );
+#endif
         break;
         case HAMSI:
            if ( i == 0 )
@@ -946,10 +956,10 @@ int x16rv2_4way_hash( void* output, const void* input, int thrid )
            dintrlv_4x64_512( hash0, hash1, hash2, hash3, vhash );
         break;
         case FUGUE:
-            sph_fugue512_full( &ctx.fugue, hash0, in0, size );
-            sph_fugue512_full( &ctx.fugue, hash1, in1, size );
-            sph_fugue512_full( &ctx.fugue, hash2, in2, size );
-            sph_fugue512_full( &ctx.fugue, hash3, in3, size );
+            fugue512_full( &ctx.fugue, hash0, in0, size );
+            fugue512_full( &ctx.fugue, hash1, in1, size );
+            fugue512_full( &ctx.fugue, hash2, in2, size );
+            fugue512_full( &ctx.fugue, hash3, in3, size );
         break;
         case SHABAL:
             intrlv_4x32( vhash, in0, in1, in2, in3, size<<3 );
--- a/algo/x16/x16rv2.c
+++ b/algo/x16/x16rv2.c
@@ -8,41 +8,18 @@

 #if !defined(X16R_8WAY) && !defined(X16R_4WAY)

-#include <stdio.h>
-#include <stdlib.h>
-#include <string.h>
-#include "algo/blake/sph_blake.h"
-#include "algo/bmw/sph_bmw.h"
-#include "algo/groestl/sph_groestl.h"
-#include "algo/jh/sph_jh.h"
-#include "algo/keccak/sph_keccak.h"
-#include "algo/skein/sph_skein.h"
-#include "algo/shavite/sph_shavite.h"
-#include "algo/luffa/luffa_for_sse2.h"
-#include "algo/cubehash/cubehash_sse2.h"
-#include "algo/simd/nist.h"
-#include "algo/echo/sph_echo.h"
-#include "algo/hamsi/sph_hamsi.h"
-#include "algo/fugue/sph_fugue.h"
-#include "algo/shabal/sph_shabal.h"
-#include "algo/whirlpool/sph_whirlpool.h"
-#include <openssl/sha.h>
 #include "algo/tiger/sph_tiger.h"
-#if defined(__AES__)
-  #include "algo/echo/aes_ni/hash_api.h"
-  #include "algo/groestl/aes_ni/hash-groestl.h"
-#endif
-
-static __thread uint32_t s_ntime = UINT32_MAX;

 union _x16rv2_context_overlay
 {
 #if defined(__AES__)
        hashState_echo          echo;
        hashState_groestl       groestl;
+        hashState_fugue         fugue;
 #else
        sph_groestl512_context   groestl;
        sph_echo512_context      echo;
+        sph_fugue512_context    fugue;
 #endif
        sph_blake512_context    blake;
        sph_bmw512_context      bmw;
@@ -54,7 +31,6 @@ union _x16rv2_context_overlay
        shavite512_context      shavite;
        hashState_sd            simd;
        sph_hamsi512_context    hamsi;
-        sph_fugue512_context    fugue;
        sph_shabal512_context   shabal;
        sph_whirlpool_context   whirlpool;
        SHA512_CTX              sha512;
@@ -160,8 +136,12 @@ int x16rv2_hash( void* output, const void* input, int thrid )
             sph_hamsi512_close( &ctx.hamsi, hash );
         break;
         case FUGUE:
+#if defined(__AES__)
+             fugue512_full( &ctx.fugue, hash, in, size );
+#else
             sph_fugue512_full( &ctx.fugue, hash, in, size );
-         break;
+#endif
+	     break;
         case SHABAL:
             sph_shabal512_init( &ctx.shabal );
             sph_shabal512( &ctx.shabal, in, size );
--- a/algo/x17/sonoa-4way.c
+++ b/algo/x17/sonoa-4way.c
@@ -16,7 +16,7 @@
 #include "algo/simd/simd-hash-2way.h"
 #include "algo/echo/aes_ni/hash_api.h"
 #include "algo/hamsi/hamsi-hash-4way.h"
-#include "algo/fugue/sph_fugue.h"
+#include "algo/fugue/fugue-aesni.h"
 #include "algo/shabal/shabal-hash-4way.h"
 #include "algo/whirlpool/sph_whirlpool.h"
 #include "algo/haval/haval-hash-4way.h"
@@ -40,7 +40,7 @@ union _sonoa_8way_context_overlay
    cube_4way_context       cube;
    simd_4way_context       simd;
    hamsi512_8way_context   hamsi;
-    sph_fugue512_context    fugue;
+    hashState_fugue         fugue;
    shabal512_8way_context  shabal;
    sph_whirlpool_context   whirlpool;
    sha512_8way_context     sha512;
@@ -58,7 +58,7 @@ union _sonoa_8way_context_overlay

 typedef union _sonoa_8way_context_overlay sonoa_8way_context_overlay;

-int sonoa_8way_hash( void *state, const void *input, int thrid )
+int sonoa_8way_hash( void *state, const void *input, int thr_id )
 {
     uint64_t vhash[8*8] __attribute__ ((aligned (128)));
     uint64_t vhashA[8*8] __attribute__ ((aligned (64)));
@@ -186,7 +186,7 @@ int sonoa_8way_hash( void *state, const void *input, int thrid )

 #endif

-     if ( work_restart[thrid].restart ) return 0;
+     if ( work_restart[thr_id].restart ) return 0;
 // 2

     bmw512_8way_full( &ctx.bmw, vhash, vhash, 64 );
@@ -302,7 +302,7 @@ int sonoa_8way_hash( void *state, const void *input, int thrid )
     hamsi512_8way_update( &ctx.hamsi, vhash, 64 );
     hamsi512_8way_close( &ctx.hamsi, vhash );

-     if ( work_restart[thrid].restart ) return 0;
+     if ( work_restart[thr_id].restart ) return 0;
 // 3

     bmw512_8way_full( &ctx.bmw, vhash, vhash, 64 );
@@ -423,16 +423,16 @@ int sonoa_8way_hash( void *state, const void *input, int thrid )
     dintrlv_8x64_512( hash0, hash1, hash2, hash3, hash4, hash5, hash6, hash7,
                       vhash );

-     sph_fugue512_full( &ctx.fugue, hash0, hash0, 64 );
-     sph_fugue512_full( &ctx.fugue, hash1, hash1, 64 );
-     sph_fugue512_full( &ctx.fugue, hash2, hash2, 64 );
-     sph_fugue512_full( &ctx.fugue, hash3, hash3, 64 );
-     sph_fugue512_full( &ctx.fugue, hash4, hash4, 64 );
-     sph_fugue512_full( &ctx.fugue, hash5, hash5, 64 );
-     sph_fugue512_full( &ctx.fugue, hash6, hash6, 64 );
-     sph_fugue512_full( &ctx.fugue, hash7, hash7, 64 );
+     fugue512_full( &ctx.fugue, hash0, hash0, 64 );
+     fugue512_full( &ctx.fugue, hash1, hash1, 64 );
+     fugue512_full( &ctx.fugue, hash2, hash2, 64 );
+     fugue512_full( &ctx.fugue, hash3, hash3, 64 );
+     fugue512_full( &ctx.fugue, hash4, hash4, 64 );
+     fugue512_full( &ctx.fugue, hash5, hash5, 64 );
+     fugue512_full( &ctx.fugue, hash6, hash6, 64 );
+     fugue512_full( &ctx.fugue, hash7, hash7, 64 );

-     if ( work_restart[thrid].restart ) return 0;
+     if ( work_restart[thr_id].restart ) return 0;
 // 4

     intrlv_8x64_512( vhash, hash0, hash1, hash2, hash3, hash4, hash5, hash6,
@@ -554,14 +554,14 @@ int sonoa_8way_hash( void *state, const void *input, int thrid )
     dintrlv_8x64_512( hash0, hash1, hash2, hash3, hash4, hash5, hash6, hash7,
                       vhash );

-     sph_fugue512_full( &ctx.fugue, hash0, hash0, 64 );
-     sph_fugue512_full( &ctx.fugue, hash1, hash1, 64 );
-     sph_fugue512_full( &ctx.fugue, hash2, hash2, 64 );
-     sph_fugue512_full( &ctx.fugue, hash3, hash3, 64 );
-     sph_fugue512_full( &ctx.fugue, hash4, hash4, 64 );
-     sph_fugue512_full( &ctx.fugue, hash5, hash5, 64 );
-     sph_fugue512_full( &ctx.fugue, hash6, hash6, 64 );
-     sph_fugue512_full( &ctx.fugue, hash7, hash7, 64 );
+     fugue512_full( &ctx.fugue, hash0, hash0, 64 );
+     fugue512_full( &ctx.fugue, hash1, hash1, 64 );
+     fugue512_full( &ctx.fugue, hash2, hash2, 64 );
+     fugue512_full( &ctx.fugue, hash3, hash3, 64 );
+     fugue512_full( &ctx.fugue, hash4, hash4, 64 );
+     fugue512_full( &ctx.fugue, hash5, hash5, 64 );
+     fugue512_full( &ctx.fugue, hash6, hash6, 64 );
+     fugue512_full( &ctx.fugue, hash7, hash7, 64 );

     intrlv_8x32_512( vhash, hash0, hash1, hash2, hash3, hash4, hash5, hash6,
                      hash7 );
@@ -630,7 +630,7 @@ int sonoa_8way_hash( void *state, const void *input, int thrid )

 #endif

-     if ( work_restart[thrid].restart ) return 0;
+     if ( work_restart[thr_id].restart ) return 0;
 // 5

     bmw512_8way_full( &ctx.bmw, vhash, vhash, 64 );
@@ -755,14 +755,14 @@ int sonoa_8way_hash( void *state, const void *input, int thrid )
     dintrlv_8x64_512( hash0, hash1, hash2, hash3, hash4, hash5, hash6, hash7,
                       vhash );

-     sph_fugue512_full( &ctx.fugue, hash0, hash0, 64 );
-     sph_fugue512_full( &ctx.fugue, hash1, hash1, 64 );
-     sph_fugue512_full( &ctx.fugue, hash2, hash2, 64 );
-     sph_fugue512_full( &ctx.fugue, hash3, hash3, 64 );
-     sph_fugue512_full( &ctx.fugue, hash4, hash4, 64 );
-     sph_fugue512_full( &ctx.fugue, hash5, hash5, 64 );
-     sph_fugue512_full( &ctx.fugue, hash6, hash6, 64 );
-     sph_fugue512_full( &ctx.fugue, hash7, hash7, 64 );
+     fugue512_full( &ctx.fugue, hash0, hash0, 64 );
+     fugue512_full( &ctx.fugue, hash1, hash1, 64 );
+     fugue512_full( &ctx.fugue, hash2, hash2, 64 );
+     fugue512_full( &ctx.fugue, hash3, hash3, 64 );
+     fugue512_full( &ctx.fugue, hash4, hash4, 64 );
+     fugue512_full( &ctx.fugue, hash5, hash5, 64 );
+     fugue512_full( &ctx.fugue, hash6, hash6, 64 );
+     fugue512_full( &ctx.fugue, hash7, hash7, 64 );

     intrlv_8x32_512( vhash, hash0, hash1, hash2, hash3, hash4, hash5, hash6,
                      hash7 );
@@ -783,7 +783,7 @@ int sonoa_8way_hash( void *state, const void *input, int thrid )
     sph_whirlpool512_full( &ctx.whirlpool, hash6, hash6, 64 );
     sph_whirlpool512_full( &ctx.whirlpool, hash7, hash7, 64 );

-     if ( work_restart[thrid].restart ) return 0;
+     if ( work_restart[thr_id].restart ) return 0;
 // 6

     intrlv_8x64_512( vhash, hash0, hash1, hash2, hash3, hash4, hash5, hash6,
@@ -905,14 +905,14 @@ int sonoa_8way_hash( void *state, const void *input, int thrid )
     dintrlv_8x64_512( hash0, hash1, hash2, hash3, hash4, hash5, hash6, hash7,
                       vhash );

-     sph_fugue512_full( &ctx.fugue, hash0, hash0, 64 );
-     sph_fugue512_full( &ctx.fugue, hash1, hash1, 64 );
-     sph_fugue512_full( &ctx.fugue, hash2, hash2, 64 );
-     sph_fugue512_full( &ctx.fugue, hash3, hash3, 64 );
-     sph_fugue512_full( &ctx.fugue, hash4, hash4, 64 );
-     sph_fugue512_full( &ctx.fugue, hash5, hash5, 64 );
-     sph_fugue512_full( &ctx.fugue, hash6, hash6, 64 );
-     sph_fugue512_full( &ctx.fugue, hash7, hash7, 64 );
+     fugue512_full( &ctx.fugue, hash0, hash0, 64 );
+     fugue512_full( &ctx.fugue, hash1, hash1, 64 );
+     fugue512_full( &ctx.fugue, hash2, hash2, 64 );
+     fugue512_full( &ctx.fugue, hash3, hash3, 64 );
+     fugue512_full( &ctx.fugue, hash4, hash4, 64 );
+     fugue512_full( &ctx.fugue, hash5, hash5, 64 );
+     fugue512_full( &ctx.fugue, hash6, hash6, 64 );
+     fugue512_full( &ctx.fugue, hash7, hash7, 64 );

     intrlv_8x32_512( vhash, hash0, hash1, hash2, hash3, hash4, hash5, hash6,
                      hash7 );
@@ -952,7 +952,7 @@ int sonoa_8way_hash( void *state, const void *input, int thrid )
     sph_whirlpool512_full( &ctx.whirlpool, hash6, hash6, 64 );
     sph_whirlpool512_full( &ctx.whirlpool, hash7, hash7, 64 );

-     if ( work_restart[thrid].restart ) return 0;
+     if ( work_restart[thr_id].restart ) return 0;
 // 7

     intrlv_8x64_512( vhash, hash0, hash1, hash2, hash3, hash4, hash5, hash6,
@@ -1074,14 +1074,14 @@ int sonoa_8way_hash( void *state, const void *input, int thrid )
     dintrlv_8x64_512( hash0, hash1, hash2, hash3, hash4, hash5, hash6, hash7,
                       vhash );

-     sph_fugue512_full( &ctx.fugue, hash0, hash0, 64 );
-     sph_fugue512_full( &ctx.fugue, hash1, hash1, 64 );
-     sph_fugue512_full( &ctx.fugue, hash2, hash2, 64 );
-     sph_fugue512_full( &ctx.fugue, hash3, hash3, 64 );
-     sph_fugue512_full( &ctx.fugue, hash4, hash4, 64 );
-     sph_fugue512_full( &ctx.fugue, hash5, hash5, 64 );
-     sph_fugue512_full( &ctx.fugue, hash6, hash6, 64 );
-     sph_fugue512_full( &ctx.fugue, hash7, hash7, 64 );
+     fugue512_full( &ctx.fugue, hash0, hash0, 64 );
+     fugue512_full( &ctx.fugue, hash1, hash1, 64 );
+     fugue512_full( &ctx.fugue, hash2, hash2, 64 );
+     fugue512_full( &ctx.fugue, hash3, hash3, 64 );
+     fugue512_full( &ctx.fugue, hash4, hash4, 64 );
+     fugue512_full( &ctx.fugue, hash5, hash5, 64 );
+     fugue512_full( &ctx.fugue, hash6, hash6, 64 );
+     fugue512_full( &ctx.fugue, hash7, hash7, 64 );

     intrlv_8x32_512( vhash, hash0, hash1, hash2, hash3, hash4, hash5, hash6,
                      hash7 );
@@ -1117,49 +1117,6 @@ int sonoa_8way_hash( void *state, const void *input, int thrid )

     return 1;
 }
-     
-int scanhash_sonoa_8way( struct work *work, uint32_t max_nonce,
-                       uint64_t *hashes_done, struct thr_info *mythr )
-{
-   uint32_t hash[8*16] __attribute__ ((aligned (128)));
-   uint32_t vdata[20*8] __attribute__ ((aligned (64)));
-   uint32_t lane_hash[8] __attribute__ ((aligned (64)));
-   uint32_t *hashd7 = &(hash[7<<3]);
-   uint32_t *pdata = work->data;
-   const uint32_t *ptarget = work->target;
-   const uint32_t first_nonce = pdata[19];
-   const uint32_t last_nonce = max_nonce - 8;
-   __m512i  *noncev = (__m512i*)vdata + 9;   // aligned
-   uint32_t n = first_nonce;
-   const int thr_id = mythr->id;
-   const uint32_t targ32 = ptarget[7];
-
-   mm512_bswap32_intrlv80_8x64( vdata, pdata );
-   *noncev = mm512_intrlv_blend_32(
-              _mm512_set_epi32( n+7, 0, n+6, 0, n+5, 0, n+4, 0,
-                                n+3, 0, n+2, 0, n+1, 0, n,   0 ), *noncev );
-
-   do
-   {
-      if ( sonoa_8way_hash( hash, vdata, thr_id ) )
-      for ( int lane = 0; lane < 8; lane++ )
-      if unlikely( ( hashd7[ lane ] <= targ32 ) )
-      {
-         extr_lane_8x32( lane_hash, hash, lane, 256 );
-         if ( likely( valid_hash( lane_hash, ptarget ) && !opt_benchmark ) )
-         {
-            pdata[19] = bswap_32( n + lane );
-            submit_solution( work, lane_hash, mythr );
-         }
-      }
-      *noncev = _mm512_add_epi32( *noncev,
-                                  m512_const1_64( 0x0000000800000000 ) );
-      n += 8;
-   } while ( likely( ( n < last_nonce ) && !work_restart[thr_id].restart ) );
-   pdata[19] = n;
-   *hashes_done = n - first_nonce;
-   return 0;
-}

 #elif defined(SONOA_4WAY)

@@ -1167,7 +1124,13 @@ union _sonoa_4way_context_overlay
 {
    blake512_4way_context   blake;
    bmw512_4way_context     bmw;
+#if defined(__VAES__)
+    groestl512_2way_context groestl;
+    echo512_2way_context    echo;
+#else
    hashState_groestl       groestl;
+    hashState_echo          echo;
+#endif
    skein512_4way_context   skein;
    jh512_4way_context      jh;
    keccak512_4way_context  keccak;
@@ -1175,9 +1138,8 @@ union _sonoa_4way_context_overlay
    cube_2way_context       cube;
    shavite512_2way_context shavite;
    simd_2way_context       simd;
-    hashState_echo          echo;
    hamsi512_4way_context   hamsi;
-    sph_fugue512_context    fugue;
+    hashState_fugue         fugue;
    shabal512_4way_context  shabal;
    sph_whirlpool_context   whirlpool;
    sha512_4way_context     sha512;
@@ -1186,7 +1148,7 @@ union _sonoa_4way_context_overlay

 typedef union _sonoa_4way_context_overlay sonoa_4way_context_overlay;

-int sonoa_4way_hash( void *state, const void *input, int thrid )
+int sonoa_4way_hash( void *state, const void *input, int thr_id )
 {
     uint64_t hash0[8] __attribute__ ((aligned (64)));
     uint64_t hash1[8] __attribute__ ((aligned (64)));
@@ -1205,6 +1167,17 @@ int sonoa_4way_hash( void *state, const void *input, int thrid )
     bmw512_4way_update( &ctx.bmw, vhash, 64 );
     bmw512_4way_close( &ctx.bmw, vhash );

+#if defined(__VAES__)
+
+     rintrlv_4x64_2x128( vhashA, vhashB, vhash, 512 );
+
+     groestl512_2way_full( &ctx.groestl, vhashA, vhashA, 64 );
+     groestl512_2way_full( &ctx.groestl, vhashB, vhashB, 64 );
+
+     rintrlv_2x128_4x64( vhash, vhashA, vhashB, 512 );
+
+#else
+
     dintrlv_4x64_512( hash0, hash1, hash2, hash3, vhash );

     groestl512_full( &ctx.groestl, (char*)hash0, (char*)hash0, 512 );
@@ -1214,6 +1187,8 @@ int sonoa_4way_hash( void *state, const void *input, int thrid )
     
     intrlv_4x64_512( vhash, hash0, hash1, hash2, hash3 );

+#endif
+
     skein512_4way_full( &ctx.skein, vhash, vhash, 64 );

     jh512_4way_init( &ctx.jh );
@@ -1238,6 +1213,15 @@ int sonoa_4way_hash( void *state, const void *input, int thrid )
     simd512_2way_full( &ctx.simd, vhashA, vhashA, 64 );
     simd512_2way_full( &ctx.simd, vhashB, vhashB, 64 );

+#if defined(__VAES__)
+
+     echo_2way_full( &ctx.echo, vhashA, 512, vhashA, 64 );
+     echo_2way_full( &ctx.echo, vhashB, 512, vhashB, 64 );
+
+     rintrlv_2x128_4x64( vhash, vhashA, vhashB, 512 );
+
+#else
+
     dintrlv_2x128_512( hash0, hash1, vhashA );
     dintrlv_2x128_512( hash2, hash3, vhashB );

@@ -1249,16 +1233,29 @@ int sonoa_4way_hash( void *state, const void *input, int thrid )
                     (const BitSequence *)hash2, 64 );
     echo_full( &ctx.echo, (BitSequence *)hash3, 512,
                     (const BitSequence *)hash3, 64 );
-     
-     if ( work_restart[thrid].restart ) return 0;
-// 2

     intrlv_4x64_512( vhash, hash0, hash1, hash2, hash3 );

+#endif
+     
+     if ( work_restart[thr_id].restart ) return 0;
+// 2
+
     bmw512_4way_init( &ctx.bmw );
     bmw512_4way_update( &ctx.bmw, vhash, 64 );
     bmw512_4way_close( &ctx.bmw, vhash );

+#if defined(__VAES__)
+
+     rintrlv_4x64_2x128( vhashA, vhashB, vhash, 512 ); 
+
+     groestl512_2way_full( &ctx.groestl, vhashA, vhashA, 64 );
+     groestl512_2way_full( &ctx.groestl, vhashB, vhashB, 64 );
+ 
+     rintrlv_2x128_4x64( vhash, vhashA, vhashB, 512 );
+
+#else
+
     dintrlv_4x64_512( hash0, hash1, hash2, hash3, vhash );

     groestl512_full( &ctx.groestl, (char*)hash0, (char*)hash0, 512 );
@@ -1268,6 +1265,8 @@ int sonoa_4way_hash( void *state, const void *input, int thrid )

     intrlv_4x64_512( vhash, hash0, hash1, hash2, hash3 );

+#endif     
+
     skein512_4way_full( &ctx.skein, vhash, vhash, 64 );

     jh512_4way_init( &ctx.jh );
@@ -1292,6 +1291,15 @@ int sonoa_4way_hash( void *state, const void *input, int thrid )
     simd512_2way_full( &ctx.simd, vhashA, vhashA, 64 );
     simd512_2way_full( &ctx.simd, vhashB, vhashB, 64 );

+#if defined(__VAES__)
+
+     echo_2way_full( &ctx.echo, vhashA, 512, vhashA, 64 );
+     echo_2way_full( &ctx.echo, vhashB, 512, vhashB, 64 );
+
+     rintrlv_2x128_4x64( vhash, vhashA, vhashB, 512 );
+
+#else
+
     dintrlv_2x128_512( hash0, hash1, vhashA );
     dintrlv_2x128_512( hash2, hash3, vhashB );

@@ -1306,17 +1314,30 @@ int sonoa_4way_hash( void *state, const void *input, int thrid )

     intrlv_4x64_512( vhash, hash0, hash1, hash2, hash3 );

+#endif
+
     hamsi512_4way_init( &ctx.hamsi );
     hamsi512_4way_update( &ctx.hamsi, vhash, 64 );
     hamsi512_4way_close( &ctx.hamsi, vhash );

-     if ( work_restart[thrid].restart ) return 0;
+     if ( work_restart[thr_id].restart ) return 0;
 // 3

     bmw512_4way_init( &ctx.bmw );
     bmw512_4way_update( &ctx.bmw, vhash, 64 );
     bmw512_4way_close( &ctx.bmw, vhash );

+#if defined(__VAES__)
+
+     rintrlv_4x64_2x128( vhashA, vhashB, vhash, 512 ); 
+
+     groestl512_2way_full( &ctx.groestl, vhashA, vhashA, 64 );
+     groestl512_2way_full( &ctx.groestl, vhashB, vhashB, 64 );
+ 
+     rintrlv_2x128_4x64( vhash, vhashA, vhashB, 512 );
+
+#else
+
     dintrlv_4x64_512( hash0, hash1, hash2, hash3, vhash );

     groestl512_full( &ctx.groestl, (char*)hash0, (char*)hash0, 512 );
@@ -1326,6 +1347,8 @@ int sonoa_4way_hash( void *state, const void *input, int thrid )

     intrlv_4x64_512( vhash, hash0, hash1, hash2, hash3 );

+#endif     
+
     skein512_4way_full( &ctx.skein, vhash, vhash, 64 );

     jh512_4way_init( &ctx.jh );
@@ -1350,6 +1373,15 @@ int sonoa_4way_hash( void *state, const void *input, int thrid )
     simd512_2way_full( &ctx.simd, vhashA, vhashA, 64 );
     simd512_2way_full( &ctx.simd, vhashB, vhashB, 64 );

+#if defined(__VAES__)
+
+     echo_2way_full( &ctx.echo, vhashA, 512, vhashA, 64 );
+     echo_2way_full( &ctx.echo, vhashB, 512, vhashB, 64 );
+
+     rintrlv_2x128_4x64( vhash, vhashA, vhashB, 512 );
+
+#else
+
     dintrlv_2x128_512( hash0, hash1, vhashA );
     dintrlv_2x128_512( hash2, hash3, vhashB );

@@ -1364,18 +1396,20 @@ int sonoa_4way_hash( void *state, const void *input, int thrid )

     intrlv_4x64_512( vhash, hash0, hash1, hash2, hash3 );

+#endif
+
     hamsi512_4way_init( &ctx.hamsi );
     hamsi512_4way_update( &ctx.hamsi, vhash, 64 );
     hamsi512_4way_close( &ctx.hamsi, vhash );

     dintrlv_4x64_512( hash0, hash1, hash2, hash3, vhash );

-     sph_fugue512_full( &ctx.fugue, hash0, hash0, 64 );
-     sph_fugue512_full( &ctx.fugue, hash1, hash1, 64 );
-     sph_fugue512_full( &ctx.fugue, hash2, hash2, 64 );
-     sph_fugue512_full( &ctx.fugue, hash3, hash3, 64 );
+     fugue512_full( &ctx.fugue, hash0, hash0, 64 );
+     fugue512_full( &ctx.fugue, hash1, hash1, 64 );
+     fugue512_full( &ctx.fugue, hash2, hash2, 64 );
+     fugue512_full( &ctx.fugue, hash3, hash3, 64 );

-     if ( work_restart[thrid].restart ) return 0;
+     if ( work_restart[thr_id].restart ) return 0;
 // 4
     intrlv_4x64_512( vhash, hash0, hash1, hash2, hash3 );

@@ -1383,6 +1417,17 @@ int sonoa_4way_hash( void *state, const void *input, int thrid )
     bmw512_4way_update( &ctx.bmw, vhash, 64 );
     bmw512_4way_close( &ctx.bmw, vhash );

+#if defined(__VAES__)
+
+     rintrlv_4x64_2x128( vhashA, vhashB, vhash, 512 ); 
+
+     groestl512_2way_full( &ctx.groestl, vhashA, vhashA, 64 );
+     groestl512_2way_full( &ctx.groestl, vhashB, vhashB, 64 );
+ 
+     rintrlv_2x128_4x64( vhash, vhashA, vhashB, 512 );
+
+#else
+
     dintrlv_4x64_512( hash0, hash1, hash2, hash3, vhash );

     groestl512_full( &ctx.groestl, (char*)hash0, (char*)hash0, 512 );
@@ -1392,6 +1437,8 @@ int sonoa_4way_hash( void *state, const void *input, int thrid )

     intrlv_4x64_512( vhash, hash0, hash1, hash2, hash3 );

+#endif     
+
     skein512_4way_full( &ctx.skein, vhash, vhash, 64 );

     jh512_4way_init( &ctx.jh );
@@ -1416,6 +1463,15 @@ int sonoa_4way_hash( void *state, const void *input, int thrid )
     simd512_2way_full( &ctx.simd, vhashA, vhashA, 64 );
     simd512_2way_full( &ctx.simd, vhashB, vhashB, 64 );

+#if defined(__VAES__)
+
+     echo_2way_full( &ctx.echo, vhashA, 512, vhashA, 64 );
+     echo_2way_full( &ctx.echo, vhashB, 512, vhashB, 64 );
+
+     rintrlv_2x128_4x64( vhash, vhashA, vhashB, 512 );
+
+#else
+
     dintrlv_2x128_512( hash0, hash1, vhashA );
     dintrlv_2x128_512( hash2, hash3, vhashB );

@@ -1430,16 +1486,18 @@ int sonoa_4way_hash( void *state, const void *input, int thrid )

     intrlv_4x64_512( vhash, hash0, hash1, hash2, hash3 );

+#endif
+
     hamsi512_4way_init( &ctx.hamsi );
     hamsi512_4way_update( &ctx.hamsi, vhash, 64 );
     hamsi512_4way_close( &ctx.hamsi, vhash );

     dintrlv_4x64_512( hash0, hash1, hash2, hash3, vhash );

-     sph_fugue512_full( &ctx.fugue, hash0, hash0, 64 );
-     sph_fugue512_full( &ctx.fugue, hash1, hash1, 64 );
-     sph_fugue512_full( &ctx.fugue, hash2, hash2, 64 );
-     sph_fugue512_full( &ctx.fugue, hash3, hash3, 64 );
+     fugue512_full( &ctx.fugue, hash0, hash0, 64 );
+     fugue512_full( &ctx.fugue, hash1, hash1, 64 );
+     fugue512_full( &ctx.fugue, hash2, hash2, 64 );
+     fugue512_full( &ctx.fugue, hash3, hash3, 64 );

     intrlv_4x32_512( vhash, hash0, hash1, hash2, hash3 );

@@ -1453,6 +1511,15 @@ int sonoa_4way_hash( void *state, const void *input, int thrid )
     hamsi512_4way_update( &ctx.hamsi, vhashB, 64 );
     hamsi512_4way_close( &ctx.hamsi, vhash );

+#if defined(__VAES__)
+
+     rintrlv_4x64_2x128( vhashA, vhashB, vhash, 512 );
+
+     echo_2way_full( &ctx.echo, vhashA, 512, vhashA, 64 );
+     echo_2way_full( &ctx.echo, vhashB, 512, vhashB, 64 );
+
+#else
+
     dintrlv_4x64_512( hash0, hash1, hash2, hash3, vhash );

     echo_full( &ctx.echo, (BitSequence *)hash0, 512,
@@ -1467,12 +1534,14 @@ int sonoa_4way_hash( void *state, const void *input, int thrid )
     intrlv_2x128_512( vhashA, hash0, hash1 );
     intrlv_2x128_512( vhashB, hash2, hash3 );

+#endif
+
     shavite512_2way_init( &ctx.shavite );
     shavite512_2way_update_close( &ctx.shavite, vhashA, vhashA, 64 );
     shavite512_2way_init( &ctx.shavite );
     shavite512_2way_update_close( &ctx.shavite, vhashB, vhashB, 64 );

-     if ( work_restart[thrid].restart ) return 0;
+     if ( work_restart[thr_id].restart ) return 0;
 // 5
     rintrlv_2x128_4x64( vhash, vhashA, vhashB, 512 );

@@ -1486,6 +1555,20 @@ int sonoa_4way_hash( void *state, const void *input, int thrid )
     shabal512_4way_update( &ctx.shabal, vhashB, 64 );
     shabal512_4way_close( &ctx.shabal, vhash );

+#if defined(__VAES__)
+
+//     rintrlv_4x32_2x128( vhashA, vhashB, vhash, 512 ); 
+     dintrlv_4x32_512( hash0, hash1, hash2, hash3, vhash );
+     intrlv_2x128_512( vhashA, hash0, hash1 );
+     intrlv_2x128_512( vhashB, hash2, hash3 );
+     
+     groestl512_2way_full( &ctx.groestl, vhashA, vhashA, 64 );
+     groestl512_2way_full( &ctx.groestl, vhashB, vhashB, 64 );
+ 
+     rintrlv_2x128_4x64( vhash, vhashA, vhashB, 512 );
+
+#else
+
     dintrlv_4x32_512( hash0, hash1, hash2, hash3, vhash );

     groestl512_full( &ctx.groestl, (char*)hash0, (char*)hash0, 512 );
@@ -1495,6 +1578,8 @@ int sonoa_4way_hash( void *state, const void *input, int thrid )

     intrlv_4x64_512( vhash, hash0, hash1, hash2, hash3 );

+#endif     
+
     skein512_4way_full( &ctx.skein, vhash, vhash, 64 );

     jh512_4way_init( &ctx.jh );
@@ -1519,6 +1604,15 @@ int sonoa_4way_hash( void *state, const void *input, int thrid )
     simd512_2way_full( &ctx.simd, vhashA, vhashA, 64 );
     simd512_2way_full( &ctx.simd, vhashB, vhashB, 64 );

+#if defined(__VAES__)
+
+     echo_2way_full( &ctx.echo, vhashA, 512, vhashA, 64 );
+     echo_2way_full( &ctx.echo, vhashB, 512, vhashB, 64 );
+
+     rintrlv_2x128_4x64( vhash, vhashA, vhashB, 512 );
+
+#else
+
     dintrlv_2x128_512( hash0, hash1, vhashA );
     dintrlv_2x128_512( hash2, hash3, vhashB );

@@ -1533,16 +1627,18 @@ int sonoa_4way_hash( void *state, const void *input, int thrid )

     intrlv_4x64_512( vhash, hash0, hash1, hash2, hash3 );

+#endif
+
     hamsi512_4way_init( &ctx.hamsi );
     hamsi512_4way_update( &ctx.hamsi, vhash, 64 );
     hamsi512_4way_close( &ctx.hamsi, vhash );

     dintrlv_4x64_512( hash0, hash1, hash2, hash3, vhash );

-     sph_fugue512_full( &ctx.fugue, hash0, hash0, 64 );
-     sph_fugue512_full( &ctx.fugue, hash1, hash1, 64 );
-     sph_fugue512_full( &ctx.fugue, hash2, hash2, 64 );
-     sph_fugue512_full( &ctx.fugue, hash3, hash3, 64 );
+     fugue512_full( &ctx.fugue, hash0, hash0, 64 );
+     fugue512_full( &ctx.fugue, hash1, hash1, 64 );
+     fugue512_full( &ctx.fugue, hash2, hash2, 64 );
+     fugue512_full( &ctx.fugue, hash3, hash3, 64 );

     intrlv_4x32_512( vhash, hash0, hash1, hash2, hash3 );

@@ -1557,7 +1653,7 @@ int sonoa_4way_hash( void *state, const void *input, int thrid )
     sph_whirlpool512_full( &ctx.whirlpool, hash2, hash2, 64 );
     sph_whirlpool512_full( &ctx.whirlpool, hash3, hash3, 64 );

-     if ( work_restart[thrid].restart ) return 0;
+     if ( work_restart[thr_id].restart ) return 0;
 // 6

     intrlv_4x64_512( vhash, hash0, hash1, hash2, hash3 );
@@ -1566,6 +1662,17 @@ int sonoa_4way_hash( void *state, const void *input, int thrid )
     bmw512_4way_update( &ctx.bmw, vhash, 64 );
     bmw512_4way_close( &ctx.bmw, vhash );

+#if defined(__VAES__)
+
+     rintrlv_4x64_2x128( vhashA, vhashB, vhash, 512 ); 
+
+     groestl512_2way_full( &ctx.groestl, vhashA, vhashA, 64 );
+     groestl512_2way_full( &ctx.groestl, vhashB, vhashB, 64 );
+ 
+     rintrlv_2x128_4x64( vhash, vhashA, vhashB, 512 );
+
+#else
+
     dintrlv_4x64_512( hash0, hash1, hash2, hash3, vhash );

     groestl512_full( &ctx.groestl, (char*)hash0, (char*)hash0, 512 );
@@ -1575,6 +1682,8 @@ int sonoa_4way_hash( void *state, const void *input, int thrid )

     intrlv_4x64_512( vhash, hash0, hash1, hash2, hash3 );

+#endif     
+
     skein512_4way_full( &ctx.skein, vhash, vhash, 64 );

     jh512_4way_init( &ctx.jh );
@@ -1599,6 +1708,15 @@ int sonoa_4way_hash( void *state, const void *input, int thrid )
     simd512_2way_full( &ctx.simd, vhashA, vhashA, 64 );
     simd512_2way_full( &ctx.simd, vhashB, vhashB, 64 );

+#if defined(__VAES__)
+
+     echo_2way_full( &ctx.echo, vhashA, 512, vhashA, 64 );
+     echo_2way_full( &ctx.echo, vhashB, 512, vhashB, 64 );
+
+     rintrlv_2x128_4x64( vhash, vhashA, vhashB, 512 );
+
+#else
+
     dintrlv_2x128_512( hash0, hash1, vhashA );
     dintrlv_2x128_512( hash2, hash3, vhashB );

@@ -1613,16 +1731,18 @@ int sonoa_4way_hash( void *state, const void *input, int thrid )

     intrlv_4x64_512( vhash, hash0, hash1, hash2, hash3 );

+#endif
+
     hamsi512_4way_init( &ctx.hamsi );
     hamsi512_4way_update( &ctx.hamsi, vhash, 64 );
     hamsi512_4way_close( &ctx.hamsi, vhash );

     dintrlv_4x64_512( hash0, hash1, hash2, hash3, vhash );

-     sph_fugue512_full( &ctx.fugue, hash0, hash0, 64 );
-     sph_fugue512_full( &ctx.fugue, hash1, hash1, 64 );
-     sph_fugue512_full( &ctx.fugue, hash2, hash2, 64 );
-     sph_fugue512_full( &ctx.fugue, hash3, hash3, 64 );
+     fugue512_full( &ctx.fugue, hash0, hash0, 64 );
+     fugue512_full( &ctx.fugue, hash1, hash1, 64 );
+     fugue512_full( &ctx.fugue, hash2, hash2, 64 );
+     fugue512_full( &ctx.fugue, hash3, hash3, 64 );

     intrlv_4x32_512( vhash, hash0, hash1, hash2, hash3 );

@@ -1650,7 +1770,7 @@ int sonoa_4way_hash( void *state, const void *input, int thrid )
     sph_whirlpool512_full( &ctx.whirlpool, hash2, hash2, 64 );
     sph_whirlpool512_full( &ctx.whirlpool, hash3, hash3, 64 );

-     if ( work_restart[thrid].restart ) return 0;    
+     if ( work_restart[thr_id].restart ) return 0;    
 // 7

     intrlv_4x64_512( vhash, hash0, hash1, hash2, hash3 );
@@ -1659,6 +1779,17 @@ int sonoa_4way_hash( void *state, const void *input, int thrid )
     bmw512_4way_update( &ctx.bmw, vhash, 64 );
     bmw512_4way_close( &ctx.bmw, vhash );

+#if defined(__VAES__)
+
+     rintrlv_4x64_2x128( vhashA, vhashB, vhash, 512 ); 
+
+     groestl512_2way_full( &ctx.groestl, vhashA, vhashA, 64 );
+     groestl512_2way_full( &ctx.groestl, vhashB, vhashB, 64 );
+ 
+     rintrlv_2x128_4x64( vhash, vhashA, vhashB, 512 );
+
+#else
+
     dintrlv_4x64_512( hash0, hash1, hash2, hash3, vhash );

     groestl512_full( &ctx.groestl, (char*)hash0, (char*)hash0, 512 );
@@ -1668,6 +1799,8 @@ int sonoa_4way_hash( void *state, const void *input, int thrid )

     intrlv_4x64_512( vhash, hash0, hash1, hash2, hash3 );

+#endif     
+
     skein512_4way_full( &ctx.skein, vhash, vhash, 64 );

     jh512_4way_init( &ctx.jh );
@@ -1692,6 +1825,15 @@ int sonoa_4way_hash( void *state, const void *input, int thrid )
     simd512_2way_full( &ctx.simd, vhashA, vhashA, 64 );
     simd512_2way_full( &ctx.simd, vhashB, vhashB, 64 );

+#if defined(__VAES__)
+
+     echo_2way_full( &ctx.echo, vhashA, 512, vhashA, 64 );
+     echo_2way_full( &ctx.echo, vhashB, 512, vhashB, 64 );
+
+     rintrlv_2x128_4x64( vhash, vhashA, vhashB, 512 );
+
+#else
+
     dintrlv_2x128_512( hash0, hash1, vhashA );
     dintrlv_2x128_512( hash2, hash3, vhashB );

@@ -1706,16 +1848,18 @@ int sonoa_4way_hash( void *state, const void *input, int thrid )

     intrlv_4x64_512( vhash, hash0, hash1, hash2, hash3 );

+#endif
+
     hamsi512_4way_init( &ctx.hamsi );
     hamsi512_4way_update( &ctx.hamsi, vhash, 64 );
     hamsi512_4way_close( &ctx.hamsi, vhash );

     dintrlv_4x64_512( hash0, hash1, hash2, hash3, vhash );

-     sph_fugue512_full( &ctx.fugue, hash0, hash0, 64 );
-     sph_fugue512_full( &ctx.fugue, hash1, hash1, 64 );
-     sph_fugue512_full( &ctx.fugue, hash2, hash2, 64 );
-     sph_fugue512_full( &ctx.fugue, hash3, hash3, 64 );
+     fugue512_full( &ctx.fugue, hash0, hash0, 64 );
+     fugue512_full( &ctx.fugue, hash1, hash1, 64 );
+     fugue512_full( &ctx.fugue, hash2, hash2, 64 );
+     fugue512_full( &ctx.fugue, hash3, hash3, 64 );

     intrlv_4x32_512( vhash, hash0, hash1, hash2, hash3 );

@@ -1745,46 +1889,4 @@ int sonoa_4way_hash( void *state, const void *input, int thrid )
     return 1;
 }

-int scanhash_sonoa_4way( struct work *work, const uint32_t max_nonce,
-	            uint64_t *hashes_done, struct thr_info *mythr )
-{
-     uint32_t hash[4*16] __attribute__ ((aligned (64)));
-     uint32_t vdata[24*4] __attribute__ ((aligned (64)));
-     uint32_t lane_hash[8] __attribute__ ((aligned (32)));
-     uint32_t *hashd7 = &( hash[7<<2] );
-     uint32_t *pdata = work->data;
-     const uint32_t *ptarget = work->target;
-     const uint32_t first_nonce = pdata[19];
-     const uint32_t last_nonce = max_nonce - 4;
-     const uint32_t targ32 = ptarget[7];
-     uint32_t n = first_nonce;
-     __m256i  *noncev = (__m256i*)vdata + 9;  
-     const int thr_id = mythr->id;
-
-     mm256_bswap32_intrlv80_4x64( vdata, pdata );
-     *noncev = mm256_intrlv_blend_32(
-                   _mm256_set_epi32( n+3, 0, n+2, 0, n+1, 0, n, 0 ), *noncev );
-
-     do
-     {
-        if ( sonoa_4way_hash( hash, vdata, thr_id ) )
-        for ( int lane = 0; lane < 4; lane++ )
-        if ( unlikely( hashd7[ lane ] <= targ32 ) )
-        {
-           extr_lane_4x32( lane_hash, hash, lane, 256 );
-           if ( likely( valid_hash( lane_hash, ptarget ) && !opt_benchmark ) )
-           {
-              pdata[19] = bswap_32( n + lane );
-              submit_solution( work, lane_hash, mythr );
-           }
-        }
-        *noncev = _mm256_add_epi32( *noncev,
-                                    m256_const1_64( 0x0000000400000000 ) );
-        n += 4;
-     } while ( likely( ( n < last_nonce ) && !work_restart[thr_id].restart ) );
-     pdata[19] = n;
-     *hashes_done = n - first_nonce;
-     return 0;
-}
-
 #endif
--- a/algo/x17/sonoa-gate.c
+++ b/algo/x17/sonoa-gate.c
@@ -3,17 +3,16 @@
 bool register_sonoa_algo( algo_gate_t* gate )
 {
 #if defined (SONOA_8WAY)
-  gate->scanhash  = (void*)&scanhash_sonoa_8way;
-//  gate->hash      = (void*)&sonoa_8way_hash;
+  gate->scanhash  = (void*)&scanhash_8way_64in_32out;
+  gate->hash      = (void*)&sonoa_8way_hash;
 #elif defined (SONOA_4WAY)
-  gate->scanhash  = (void*)&scanhash_sonoa_4way;
-//  gate->hash      = (void*)&sonoa_4way_hash;
+  gate->scanhash  = (void*)&scanhash_4way_64in_32out;
+  gate->hash      = (void*)&sonoa_4way_hash;
 #else
  init_sonoa_ctx();
-  gate->scanhash  = (void*)&scanhash_sonoa;
-//  gate->hash      = (void*)&sonoa_hash;
+  gate->hash      = (void*)&sonoa_hash;
 #endif
-  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT;
+  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT | VAES256_OPT;
  return true;
 };

--- a/algo/x17/sonoa-gate.h
+++ b/algo/x17/sonoa-gate.h
@@ -14,21 +14,15 @@ bool register_sonoa_algo( algo_gate_t* gate );

 #if defined(SONOA_8WAY)

-int sonoa_8way_hash( void *state, const void *input, int thrid );
-int scanhash_sonoa_8way( struct work *work, uint32_t max_nonce,
-                         uint64_t *hashes_done, struct thr_info *mythr );
+int sonoa_8way_hash( void *state, const void *input, int thr_id );

 #elif defined(SONOA_4WAY)

-int sonoa_4way_hash( void *state, const void *input, int thrid );
-int scanhash_sonoa_4way( struct work *work, uint32_t max_nonce,
-                         uint64_t *hashes_done, struct thr_info *mythr );
+int sonoa_4way_hash( void *state, const void *input, int thr_id );

 #else

-int sonoa_hash( void *state, const void *input, int thrid );
-int scanhash_sonoa( struct work *work, uint32_t max_nonce,
-                  uint64_t *hashes_done, struct thr_info *mythr );
+int sonoa_hash( void *state, const void *input, int thr_id );
 void init_sonoa_ctx();

 #endif
--- a/algo/x17/sonoa.c
+++ b/algo/x17/sonoa.c
@@ -14,7 +14,6 @@
 #include "algo/skein/sph_skein.h"
 #include "algo/shavite/sph_shavite.h"
 #include "algo/hamsi/sph_hamsi.h"
-#include "algo/fugue/sph_fugue.h"
 #include "algo/shabal/sph_shabal.h"
 #include "algo/whirlpool/sph_whirlpool.h"
 #include "algo/haval/sph-haval.h"
@@ -25,9 +24,11 @@
 #if defined(__AES__)
  #include "algo/echo/aes_ni/hash_api.h"
  #include "algo/groestl/aes_ni/hash-groestl.h"
+  #include "algo/fugue/fugue-aesni.h"
 #else
  #include "algo/groestl/sph_groestl.h"
  #include "algo/echo/sph_echo.h"
+  #include "algo/fugue/sph_fugue.h"
 #endif

 typedef struct {
@@ -36,9 +37,11 @@ typedef struct {
 #if defined(__AES__)
        hashState_echo          echo;
        hashState_groestl       groestl;
+        hashState_fugue         fugue;
 #else
        sph_groestl512_context  groestl;
        sph_echo512_context     echo;
+        sph_fugue512_context    fugue;
 #endif
        sph_jh512_context       jh;
        sph_keccak512_context   keccak;
@@ -48,7 +51,6 @@ typedef struct {
        sph_shavite512_context  shavite;
        hashState_sd            simd;
        sph_hamsi512_context    hamsi;
-        sph_fugue512_context    fugue;
        sph_shabal512_context   shabal;
        sph_whirlpool_context   whirlpool;
        SHA512_CTX              sha512;
@@ -64,9 +66,11 @@ void init_sonoa_ctx()
 #if defined(__AES__)
        init_echo( &sonoa_ctx.echo, 512 );
        init_groestl( &sonoa_ctx.groestl, 64 );
+        fugue512_Init( &sonoa_ctx.fugue, 512 );
 #else
        sph_groestl512_init(&sonoa_ctx.groestl );
        sph_echo512_init( &sonoa_ctx.echo );
+        sph_fugue512_init( &sonoa_ctx.fugue );
 #endif
        sph_skein512_init( &sonoa_ctx.skein);
        sph_jh512_init( &sonoa_ctx.jh);
@@ -76,14 +80,13 @@ void init_sonoa_ctx()
        sph_shavite512_init( &sonoa_ctx.shavite );
        init_sd( &sonoa_ctx.simd, 512 );
        sph_hamsi512_init( &sonoa_ctx.hamsi );
-        sph_fugue512_init( &sonoa_ctx.fugue );
        sph_shabal512_init( &sonoa_ctx.shabal );
        sph_whirlpool_init( &sonoa_ctx.whirlpool );
        SHA512_Init( &sonoa_ctx.sha512 );
        sph_haval256_5_init(&sonoa_ctx.haval);
 };

-int sonoa_hash( void *state, const void *input, int thrid )
+int sonoa_hash( void *state, const void *input, int thr_id )
 {
 	uint8_t hash[128] __attribute__ ((aligned (64)));
   sonoa_ctx_holder ctx __attribute__ ((aligned (64)));
@@ -132,7 +135,7 @@ int sonoa_hash( void *state, const void *input, int thrid )
   sph_echo512_close(&ctx.echo, hash);
 #endif

-   if ( work_restart[thrid].restart ) return 0;
+   if ( work_restart[thr_id].restart ) return 0;
 //

   sph_bmw512_init( &ctx.bmw);
@@ -190,7 +193,7 @@ int sonoa_hash( void *state, const void *input, int thrid )
   sph_hamsi512(&ctx.hamsi, hash, 64);
   sph_hamsi512_close(&ctx.hamsi, hash);
 	
-   if ( work_restart[thrid].restart ) return 0;
+   if ( work_restart[thr_id].restart ) return 0;
 //

   sph_bmw512_init( &ctx.bmw);
@@ -249,10 +252,15 @@ int sonoa_hash( void *state, const void *input, int thrid )
   sph_hamsi512(&ctx.hamsi, hash, 64);
   sph_hamsi512_close(&ctx.hamsi, hash);

+#if defined(__AES__)
+   fugue512_Update( &ctx.fugue, hash, 512 );
+   fugue512_Final( &ctx.fugue, hash ); 
+#else   
   sph_fugue512(&ctx.fugue, hash, 64);
   sph_fugue512_close(&ctx.fugue, hash);
+#endif

-   if ( work_restart[thrid].restart ) return 0;
+   if ( work_restart[thr_id].restart ) return 0;
 //

   sph_bmw512_init( &ctx.bmw);
@@ -311,9 +319,11 @@ int sonoa_hash( void *state, const void *input, int thrid )
   sph_hamsi512(&ctx.hamsi, hash, 64);
   sph_hamsi512_close(&ctx.hamsi, hash);

-   sph_fugue512_init( &ctx.fugue );
-   sph_fugue512(&ctx.fugue, hash, 64);
-   sph_fugue512_close(&ctx.fugue, hash);
+#if defined(__AES__)
+    fugue512_full( &ctx.fugue, hash, hash, 64 );
+#else
+    sph_fugue512_full( &ctx.fugue, hash, hash, 64 );
+#endif

   sph_shabal512(&ctx.shabal, hash, 64);
   sph_shabal512_close(&ctx.shabal, hash);
@@ -336,7 +346,7 @@ int sonoa_hash( void *state, const void *input, int thrid )
   sph_shavite512(&ctx.shavite, hash, 64);
   sph_shavite512_close(&ctx.shavite, hash);

-   if ( work_restart[thrid].restart ) return 0;
+   if ( work_restart[thr_id].restart ) return 0;
 //

   sph_bmw512_init( &ctx.bmw);
@@ -399,9 +409,11 @@ int sonoa_hash( void *state, const void *input, int thrid )
   sph_hamsi512(&ctx.hamsi, hash, 64);
   sph_hamsi512_close(&ctx.hamsi, hash);

-   sph_fugue512_init( &ctx.fugue );
-   sph_fugue512(&ctx.fugue, hash, 64);
-   sph_fugue512_close(&ctx.fugue, hash);
+#if defined(__AES__)
+    fugue512_full( &ctx.fugue, hash, hash, 64 );
+#else
+    sph_fugue512_full( &ctx.fugue, hash, hash, 64 );
+#endif

   sph_shabal512_init( &ctx.shabal );
   sph_shabal512(&ctx.shabal, hash, 64);
@@ -410,7 +422,7 @@ int sonoa_hash( void *state, const void *input, int thrid )
   sph_whirlpool(&ctx.whirlpool, hash, 64);
   sph_whirlpool_close(&ctx.whirlpool, hash);

-   if ( work_restart[thrid].restart ) return 0;
+   if ( work_restart[thr_id].restart ) return 0;
 //
   sph_bmw512_init( &ctx.bmw);
   sph_bmw512(&ctx.bmw, hash, 64);
@@ -468,9 +480,11 @@ int sonoa_hash( void *state, const void *input, int thrid )
   sph_hamsi512(&ctx.hamsi, hash, 64);
   sph_hamsi512_close(&ctx.hamsi, hash);

-   sph_fugue512_init( &ctx.fugue );
-   sph_fugue512(&ctx.fugue, hash, 64);
-   sph_fugue512_close(&ctx.fugue, hash);
+#if defined(__AES__)
+    fugue512_full( &ctx.fugue, hash, hash, 64 );
+#else
+    sph_fugue512_full( &ctx.fugue, hash, hash, 64 );
+#endif

   sph_shabal512_init( &ctx.shabal );
   sph_shabal512(&ctx.shabal, hash, 64);
@@ -487,7 +501,7 @@ int sonoa_hash( void *state, const void *input, int thrid )
   sph_whirlpool(&ctx.whirlpool, hash, 64);
   sph_whirlpool_close(&ctx.whirlpool, hash);

-   if ( work_restart[thrid].restart ) return 0;
+   if ( work_restart[thr_id].restart ) return 0;
 //

   sph_bmw512_init( &ctx.bmw);
@@ -546,9 +560,11 @@ int sonoa_hash( void *state, const void *input, int thrid )
   sph_hamsi512(&ctx.hamsi, hash, 64);
   sph_hamsi512_close(&ctx.hamsi, hash);

-   sph_fugue512_init( &ctx.fugue );
-   sph_fugue512(&ctx.fugue, hash, 64);
-   sph_fugue512_close(&ctx.fugue, hash);
+#if defined(__AES__)
+    fugue512_full( &ctx.fugue, hash, hash, 64 );
+#else
+    sph_fugue512_full( &ctx.fugue, hash, hash, 64 );
+#endif

   sph_shabal512_init( &ctx.shabal );
   sph_shabal512(&ctx.shabal, hash, 64);
@@ -569,34 +585,4 @@ int sonoa_hash( void *state, const void *input, int thrid )
   return 1;
 }

-int scanhash_sonoa( struct work *work, uint32_t max_nonce,
-             uint64_t *hashes_done, struct thr_info *mythr)
-{
-   uint32_t edata[20] __attribute__((aligned(64)));
-   uint32_t hash64[8] __attribute__((aligned(64)));
-   uint32_t *pdata = work->data;
-   uint32_t *ptarget = work->target;
-   uint32_t n = pdata[19];
-   const uint32_t first_nonce = pdata[19];
-   const int thr_id = mythr->id;
-   const bool bench = opt_benchmark;
-
-   mm128_bswap32_80( edata, pdata );
-
-   do
-   {
-      edata[19] = n;
-      if ( sonoa_hash( hash64, edata, thr_id ) )
-      if ( unlikely( valid_hash( hash64, ptarget ) && !bench ) )
-      {
-         pdata[19] = bswap_32( n );
-         submit_solution( work, hash64, mythr );
-      }
-      n++;
-   } while ( n < max_nonce && !work_restart[thr_id].restart );
-   *hashes_done = n - first_nonce;
-   pdata[19] = n;
-   return 0;
-}
-
 #endif
--- a/algo/x17/x17-4way.c
+++ b/algo/x17/x17-4way.c
@@ -21,7 +21,7 @@
 #include "algo/simd/simd-hash-2way.h"
 #include "algo/echo/aes_ni/hash_api.h"
 #include "algo/hamsi/hamsi-hash-4way.h"
-#include "algo/fugue/sph_fugue.h"
+#include "algo/fugue/fugue-aesni.h"
 #include "algo/shabal/shabal-hash-4way.h"
 #include "algo/whirlpool/sph_whirlpool.h"
 #include "algo/haval/haval-hash-4way.h"
@@ -49,7 +49,7 @@ union _x17_8way_context_overlay
 #endif
    simd_4way_context       simd;
    hamsi512_8way_context   hamsi;
-    sph_fugue512_context    fugue;
+    hashState_fugue         fugue;
    shabal512_8way_context  shabal;
    sph_whirlpool_context   whirlpool;
    sha512_8way_context     sha512;
@@ -57,7 +57,7 @@ union _x17_8way_context_overlay
 } __attribute__ ((aligned (64)));
 typedef union _x17_8way_context_overlay x17_8way_context_overlay;

-void x17_8way_hash( void *state, const void *input )
+int x17_8way_hash( void *state, const void *input, int thr_id )
 {
     uint64_t vhash[8*8] __attribute__ ((aligned (128)));
     uint64_t vhashA[8*8] __attribute__ ((aligned (64)));
@@ -190,14 +190,14 @@ void x17_8way_hash( void *state, const void *input )
     dintrlv_8x64_512( hash0, hash1, hash2, hash3, hash4, hash5, hash6, hash7,
                       vhash );

-     sph_fugue512_full( &ctx.fugue, hash0, hash0, 64 );
-     sph_fugue512_full( &ctx.fugue, hash1, hash1, 64 );
-     sph_fugue512_full( &ctx.fugue, hash2, hash2, 64 );
-     sph_fugue512_full( &ctx.fugue, hash3, hash3, 64 );
-     sph_fugue512_full( &ctx.fugue, hash4, hash4, 64 );
-     sph_fugue512_full( &ctx.fugue, hash5, hash5, 64 );
-     sph_fugue512_full( &ctx.fugue, hash6, hash6, 64 );
-     sph_fugue512_full( &ctx.fugue, hash7, hash7, 64 );
+     fugue512_full( &ctx.fugue, hash0, hash0, 64 );
+     fugue512_full( &ctx.fugue, hash1, hash1, 64 );
+     fugue512_full( &ctx.fugue, hash2, hash2, 64 );
+     fugue512_full( &ctx.fugue, hash3, hash3, 64 );
+     fugue512_full( &ctx.fugue, hash4, hash4, 64 );
+     fugue512_full( &ctx.fugue, hash5, hash5, 64 );
+     fugue512_full( &ctx.fugue, hash6, hash6, 64 );
+     fugue512_full( &ctx.fugue, hash7, hash7, 64 );

     intrlv_8x32_512( vhash, hash0, hash1, hash2, hash3, hash4, hash5, hash6,
                      hash7 );
@@ -230,50 +230,8 @@ void x17_8way_hash( void *state, const void *input )
     haval256_5_8way_init( &ctx.haval );
     haval256_5_8way_update( &ctx.haval, vhashA, 64 );
     haval256_5_8way_close( &ctx.haval, state );
-}

-int scanhash_x17_8way( struct work *work, uint32_t max_nonce,
-                       uint64_t *hashes_done, struct thr_info *mythr )
-{
-   uint32_t hash32[8*8] __attribute__ ((aligned (128)));
-   uint32_t vdata[20*8] __attribute__ ((aligned (64)));
-   uint32_t lane_hash[8] __attribute__ ((aligned (64)));
-   uint32_t *hash32_d7 = &(hash32[7*8]);
-   uint32_t *pdata = work->data;
-   const uint32_t *ptarget = work->target;
-   const uint32_t first_nonce = pdata[19];
-   const uint32_t last_nonce = max_nonce - 8;
-   __m512i  *noncev = (__m512i*)vdata + 9; 
-   uint32_t n = first_nonce;
-   const int thr_id = mythr->id;
-   const uint32_t targ32_d7 = ptarget[7];
-   const bool bench = opt_benchmark;
-
-   mm512_bswap32_intrlv80_8x64( vdata, pdata );
-   *noncev = mm512_intrlv_blend_32(
-              _mm512_set_epi32( n+7, 0, n+6, 0, n+5, 0, n+4, 0,
-                                n+3, 0, n+2, 0, n+1, 0, n,   0 ), *noncev );
-   do
-   {
-      x17_8way_hash( hash32, vdata );
-
-      for ( int lane = 0; lane < 8; lane++ )
-      if ( unlikely( ( hash32_d7[ lane ] <= targ32_d7 ) && !bench ) )
-      {
-         extr_lane_8x32( lane_hash, hash32, lane, 256 );
-         if ( likely( valid_hash( lane_hash, ptarget ) ) )
-         {
-            pdata[19] = bswap_32( n + lane );
-            submit_solution( work, lane_hash, mythr );
-         }
-      }
-      *noncev = _mm512_add_epi32( *noncev,
-                                  m512_const1_64( 0x0000000800000000 ) );
-      n += 8;
-   } while ( likely( ( n < last_nonce ) && !work_restart[thr_id].restart ) );
-   pdata[19] = n;
-   *hashes_done = n - first_nonce;
-   return 0;
+     return 1;
 }

 #elif defined(X17_4WAY)
@@ -282,7 +240,13 @@ union _x17_4way_context_overlay
 {
    blake512_4way_context   blake;
    bmw512_4way_context     bmw;
+#if defined(__VAES__)
+    groestl512_2way_context groestl;
+    echo512_2way_context    echo;
+#else
    hashState_groestl       groestl;
+    hashState_echo          echo;
+#endif
    skein512_4way_context   skein;
    jh512_4way_context      jh;
    keccak512_4way_context  keccak;
@@ -290,9 +254,8 @@ union _x17_4way_context_overlay
    cube_2way_context       cube;
    shavite512_2way_context shavite;
    simd_2way_context       simd;
-    hashState_echo          echo;
    hamsi512_4way_context   hamsi;
-    sph_fugue512_context    fugue;
+    hashState_fugue         fugue;
    shabal512_4way_context  shabal;
    sph_whirlpool_context   whirlpool;
    sha512_4way_context     sha512;
@@ -300,7 +263,7 @@ union _x17_4way_context_overlay
 };  
 typedef union _x17_4way_context_overlay x17_4way_context_overlay;

-void x17_4way_hash( void *state, const void *input )
+int x17_4way_hash( void *state, const void *input, int thr_id )
 {
     uint64_t vhash[8*4] __attribute__ ((aligned (64)));
     uint64_t vhashA[8*4] __attribute__ ((aligned (64)));
@@ -317,6 +280,17 @@ void x17_4way_hash( void *state, const void *input )
     bmw512_4way_update( &ctx.bmw, vhash, 64 );
     bmw512_4way_close( &ctx.bmw, vhash );

+#if defined(__VAES__)
+
+     rintrlv_4x64_2x128( vhashA, vhashB, vhash, 512 );
+
+     groestl512_2way_full( &ctx.groestl, vhashA, vhashA, 64 );
+     groestl512_2way_full( &ctx.groestl, vhashB, vhashB, 64 );
+
+     rintrlv_2x128_4x64( vhash, vhashA, vhashB, 512 );
+
+#else
+     
     dintrlv_4x64_512( hash0, hash1, hash2, hash3, vhash );

     groestl512_full( &ctx.groestl, (char*)hash0, (char*)hash0, 512 );
@@ -326,6 +300,8 @@ void x17_4way_hash( void *state, const void *input )

     intrlv_4x64_512( vhash, hash0, hash1, hash2, hash3 );

+#endif
+
     skein512_4way_full( &ctx.skein, vhash, vhash, 64 );

     jh512_4way_init( &ctx.jh );
@@ -350,6 +326,15 @@ void x17_4way_hash( void *state, const void *input )
     simd512_2way_full( &ctx.simd, vhashA, vhashA, 64 );
     simd512_2way_full( &ctx.simd, vhashB, vhashB, 64 );

+#if defined(__VAES__)
+
+     echo_2way_full( &ctx.echo, vhashA, 512, vhashA, 64 );
+     echo_2way_full( &ctx.echo, vhashB, 512, vhashB, 64 );
+
+     rintrlv_2x128_4x64( vhash, vhashA, vhashB, 512 );
+
+#else
+
     dintrlv_2x128_512( hash0, hash1, vhashA );
     dintrlv_2x128_512( hash2, hash3, vhashB );

@@ -364,16 +349,18 @@ void x17_4way_hash( void *state, const void *input )

     intrlv_4x64_512( vhash, hash0, hash1, hash2, hash3 );

+#endif
+
     hamsi512_4way_init( &ctx.hamsi );
     hamsi512_4way_update( &ctx.hamsi, vhash, 64 );
     hamsi512_4way_close( &ctx.hamsi, vhash );

     dintrlv_4x64_512( hash0, hash1, hash2, hash3, vhash );

-     sph_fugue512_full( &ctx.fugue, hash0, hash0, 64 );
-     sph_fugue512_full( &ctx.fugue, hash1, hash1, 64 );
-     sph_fugue512_full( &ctx.fugue, hash2, hash2, 64 );
-     sph_fugue512_full( &ctx.fugue, hash3, hash3, 64 );
+     fugue512_full( &ctx.fugue, hash0, hash0, 64 );
+     fugue512_full( &ctx.fugue, hash1, hash1, 64 );
+     fugue512_full( &ctx.fugue, hash2, hash2, 64 );
+     fugue512_full( &ctx.fugue, hash3, hash3, 64 );

     intrlv_4x32_512( vhash, hash0, hash1, hash2, hash3 );

@@ -399,49 +386,8 @@ void x17_4way_hash( void *state, const void *input )
     haval256_5_4way_init( &ctx.haval );
     haval256_5_4way_update( &ctx.haval, vhashB, 64 );
     haval256_5_4way_close( &ctx.haval, state );
-}

-int scanhash_x17_4way( struct work *work, uint32_t max_nonce,
-                       uint64_t *hashes_done, struct thr_info *mythr )
-{
-   uint32_t hash32[8*4] __attribute__ ((aligned (64)));
-   uint32_t vdata[20*4] __attribute__ ((aligned (64)));
-   uint32_t lane_hash[8] __attribute__ ((aligned (64)));
-   uint32_t *hash32_d7 = &(hash32[ 7*4 ]);
-   uint32_t *pdata = work->data;
-   const uint32_t *ptarget = work->target;
-   const uint32_t first_nonce = pdata[19];
-   const uint32_t last_nonce = max_nonce - 4;
-   __m256i  *noncev = (__m256i*)vdata + 9;
-   uint32_t n = first_nonce;
-   const int thr_id = mythr->id;
-   const uint32_t targ32_d7 = ptarget[7];
-   const bool bench = opt_benchmark;
-
-   mm256_bswap32_intrlv80_4x64( vdata, pdata );
-   *noncev = mm256_intrlv_blend_32(
-                   _mm256_set_epi32( n+3, 0, n+2, 0, n+1, 0, n, 0 ), *noncev );
-   do
-   {
-      x17_4way_hash( hash32, vdata );
-
-      for ( int lane = 0; lane < 4; lane++ )
-      if ( unlikely( hash32_d7[ lane ] <= targ32_d7 && !bench ) )
-      {  
-         extr_lane_4x32( lane_hash, hash32, lane, 256 );
-         if ( valid_hash( lane_hash, ptarget ) )
-         {
-            pdata[19] = bswap_32( n + lane );
-            submit_solution( work, lane_hash, mythr );
-         }            
-      }
-      *noncev = _mm256_add_epi32( *noncev,
-                                  m256_const1_64( 0x0000000400000000 ) );
-      n += 4;
-   } while ( likely( ( n <= last_nonce ) && !work_restart[thr_id].restart ) );
-   pdata[19] = n;
-   *hashes_done = n - first_nonce;
-   return 0;
+     return 1;
 }

 #endif
--- a/algo/x17/x17-gate.c
+++ b/algo/x17/x17-gate.c
@@ -3,16 +3,15 @@
 bool register_x17_algo( algo_gate_t* gate )
 {
 #if defined (X17_8WAY)
-  gate->scanhash  = (void*)&scanhash_x17_8way;
+  gate->scanhash  = (void*)&scanhash_8way_64in_32out;
  gate->hash      = (void*)&x17_8way_hash;
 #elif defined (X17_4WAY)
-  gate->scanhash  = (void*)&scanhash_x17_4way;
+  gate->scanhash  = (void*)&scanhash_4way_64in_32out;
  gate->hash      = (void*)&x17_4way_hash;
 #else
-  gate->scanhash  = (void*)&scanhash_x17;
  gate->hash      = (void*)&x17_hash;
 #endif
-  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT;
+  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT | VAES256_OPT;
  return true;
 };

--- a/algo/x17/x17-gate.h
+++ b/algo/x17/x17-gate.h
@@ -14,20 +14,15 @@ bool register_x17_algo( algo_gate_t* gate );

 #if defined(X17_8WAY)

-void x17_8way_hash( void *state, const void *input );
-int scanhash_x17_8way( struct work *work, uint32_t max_nonce,
-                       uint64_t *hashes_done, struct thr_info *mythr );
+int x17_8way_hash( void *state, const void *input, int thr_id );
+
 #elif defined(X17_4WAY)

-void x17_4way_hash( void *state, const void *input );
-int scanhash_x17_4way( struct work *work, uint32_t max_nonce,
-                       uint64_t *hashes_done, struct thr_info *mythr );
+int x17_4way_hash( void *state, const void *input, int thr_id );

 #endif

-void x17_hash( void *state, const void *input );
-int scanhash_x17( struct work *work, uint32_t max_nonce,
-                  uint64_t *hashes_done, struct thr_info *mythr );
+int x17_hash( void *state, const void *input, int thr_id );

 #endif

--- a/algo/x17/x17.c
+++ b/algo/x17/x17.c
@@ -13,7 +13,6 @@
 #include "algo/skein/sph_skein.h"
 #include "algo/shavite/sph_shavite.h"
 #include "algo/hamsi/sph_hamsi.h"
-#include "algo/fugue/sph_fugue.h"
 #include "algo/shabal/sph_shabal.h"
 #include "algo/whirlpool/sph_whirlpool.h"
 #include "algo/haval/sph-haval.h"
@@ -22,11 +21,13 @@
 #include "algo/simd/nist.h"
 #include <openssl/sha.h>
 #if defined(__AES__)
+  #include "algo/fugue/fugue-aesni.h"
  #include "algo/echo/aes_ni/hash_api.h"
  #include "algo/groestl/aes_ni/hash-groestl.h"
 #else
  #include "algo/groestl/sph_groestl.h"
  #include "algo/echo/sph_echo.h"
+  #include "algo/fugue/sph_fugue.h"
 #endif

 union _x17_context_overlay
@@ -36,9 +37,11 @@ union _x17_context_overlay
 #if defined(__AES__)
        hashState_groestl       groestl;
        hashState_echo          echo;
+        hashState_fugue         fugue;
 #else
        sph_groestl512_context  groestl;
        sph_echo512_context     echo;
+        sph_fugue512_context    fugue;
 #endif
        sph_jh512_context       jh;
        sph_keccak512_context   keccak;
@@ -48,7 +51,6 @@ union _x17_context_overlay
        sph_shavite512_context  shavite;
        hashState_sd            simd;
        sph_hamsi512_context    hamsi;
-        sph_fugue512_context    fugue;
        sph_shabal512_context   shabal;
        sph_whirlpool_context   whirlpool;
        SHA512_CTX              sha512;
@@ -56,7 +58,7 @@ union _x17_context_overlay
 };
 typedef union _x17_context_overlay x17_context_overlay;

-void x17_hash(void *output, const void *input)
+int x17_hash(void *output, const void *input, int thr_id )
 {
 //    unsigned char hash[64 * 4] __attribute__((aligned(64))) = {0};
    unsigned char hash[64] __attribute__((aligned(64)));
@@ -122,9 +124,11 @@ void x17_hash(void *output, const void *input)
    sph_hamsi512_close( &ctx.hamsi, hash );

    // 13 Fugue
-    sph_fugue512_init( &ctx.fugue );
-    sph_fugue512(&ctx.fugue, hash, 64 );
-    sph_fugue512_close(&ctx.fugue, hash );
+#if defined(__AES__)
+    fugue512_full( &ctx.fugue, hash, hash, 64 );
+#else
+    sph_fugue512_full( &ctx.fugue, hash, hash, 64 );
+#endif

    // X14 Shabal
    sph_shabal512_init( &ctx.shabal );
@@ -143,36 +147,8 @@ void x17_hash(void *output, const void *input)
    sph_haval256_5_init(&ctx.haval);
    sph_haval256_5( &ctx.haval, (const void*)hash, 64 );
    sph_haval256_5_close( &ctx.haval, output );
-}

-int scanhash_x17( struct work *work, uint32_t max_nonce,
-	          uint64_t *hashes_done, struct thr_info *mythr)
-{
-   uint32_t edata[20] __attribute__((aligned(64)));
-   uint32_t hash64[8] __attribute__((aligned(64)));
-   uint32_t *pdata = work->data;
-   uint32_t *ptarget = work->target;
-   uint32_t n = pdata[19] - 1;
-   const uint32_t first_nonce = pdata[19];
-   const int thr_id = mythr->id;
-   const bool bench = opt_benchmark;
-
-   mm128_bswap32_80( edata, pdata );
-   
-   do
-   {
-      edata[19] = n;
-      x17_hash( hash64, edata );
-      if ( unlikely( valid_hash( hash64, ptarget ) && !bench ) )
-      {
-         pdata[19] = bswap_32( n );
-         submit_solution( work, hash64, mythr );
-      }
-      n++;
-   } while ( n < max_nonce && !work_restart[thr_id].restart );
-   *hashes_done = n - first_nonce;
-   pdata[19] = n;
-   return 0;
+    return 1;
 }

 #endif
--- a/algo/x17/xevan-4way.c
+++ b/algo/x17/xevan-4way.c
@@ -16,7 +16,7 @@
 #include "algo/simd/simd-hash-2way.h"
 #include "algo/echo/aes_ni/hash_api.h"
 #include "algo/hamsi/hamsi-hash-4way.h"
-#include "algo/fugue/sph_fugue.h"
+#include "algo/fugue/fugue-aesni.h"
 #include "algo/shabal/shabal-hash-4way.h"
 #include "algo/whirlpool/sph_whirlpool.h"
 #include "algo/sha/sha-hash-4way.h"
@@ -40,7 +40,7 @@ union _xevan_8way_context_overlay
   cube_4way_context       cube;
   simd_4way_context       simd;
   hamsi512_8way_context   hamsi;
-   sph_fugue512_context    fugue;
+   hashState_fugue         fugue;
   shabal512_8way_context  shabal;
   sph_whirlpool_context   whirlpool;
   sha512_8way_context     sha512;
@@ -57,7 +57,7 @@ union _xevan_8way_context_overlay
 } __attribute__ ((aligned (64)));
 typedef union _xevan_8way_context_overlay xevan_8way_context_overlay;

-void xevan_8way_hash( void *output, const void *input )
+int xevan_8way_hash( void *output, const void *input, int thr_id )
 {
     uint64_t vhash[16<<3] __attribute__ ((aligned (128)));
     uint64_t vhashA[16<<3] __attribute__ ((aligned (64)));
@@ -192,14 +192,14 @@ void xevan_8way_hash( void *output, const void *input )
     dintrlv_8x64( hash0, hash1, hash2, hash3, hash4, hash5, hash6, hash7,
                   vhash, dataLen<<3 );

-     sph_fugue512_full( &ctx.fugue, hash0, hash0, dataLen );
-     sph_fugue512_full( &ctx.fugue, hash1, hash1, dataLen );
-     sph_fugue512_full( &ctx.fugue, hash2, hash2, dataLen );
-     sph_fugue512_full( &ctx.fugue, hash3, hash3, dataLen );
-     sph_fugue512_full( &ctx.fugue, hash4, hash4, dataLen );
-     sph_fugue512_full( &ctx.fugue, hash5, hash5, dataLen );
-     sph_fugue512_full( &ctx.fugue, hash6, hash6, dataLen );
-     sph_fugue512_full( &ctx.fugue, hash7, hash7, dataLen );
+     fugue512_full( &ctx.fugue, hash0, hash0, dataLen );
+     fugue512_full( &ctx.fugue, hash1, hash1, dataLen );
+     fugue512_full( &ctx.fugue, hash2, hash2, dataLen );
+     fugue512_full( &ctx.fugue, hash3, hash3, dataLen );
+     fugue512_full( &ctx.fugue, hash4, hash4, dataLen );
+     fugue512_full( &ctx.fugue, hash5, hash5, dataLen );
+     fugue512_full( &ctx.fugue, hash6, hash6, dataLen );
+     fugue512_full( &ctx.fugue, hash7, hash7, dataLen );

     intrlv_8x32( vhash, hash0, hash1, hash2, hash3, hash4, hash5, hash6,
                  hash7, dataLen<<3 );
@@ -355,14 +355,14 @@ void xevan_8way_hash( void *output, const void *input )
     dintrlv_8x64( hash0, hash1, hash2, hash3, hash4, hash5, hash6, hash7,
                   vhash, dataLen<<3 );

-     sph_fugue512_full( &ctx.fugue, hash0, hash0, dataLen );
-     sph_fugue512_full( &ctx.fugue, hash1, hash1, dataLen );
-     sph_fugue512_full( &ctx.fugue, hash2, hash2, dataLen );
-     sph_fugue512_full( &ctx.fugue, hash3, hash3, dataLen );
-     sph_fugue512_full( &ctx.fugue, hash4, hash4, dataLen );
-     sph_fugue512_full( &ctx.fugue, hash5, hash5, dataLen );
-     sph_fugue512_full( &ctx.fugue, hash6, hash6, dataLen );
-     sph_fugue512_full( &ctx.fugue, hash7, hash7, dataLen );
+     fugue512_full( &ctx.fugue, hash0, hash0, dataLen );
+     fugue512_full( &ctx.fugue, hash1, hash1, dataLen );
+     fugue512_full( &ctx.fugue, hash2, hash2, dataLen );
+     fugue512_full( &ctx.fugue, hash3, hash3, dataLen );
+     fugue512_full( &ctx.fugue, hash4, hash4, dataLen );
+     fugue512_full( &ctx.fugue, hash5, hash5, dataLen );
+     fugue512_full( &ctx.fugue, hash6, hash6, dataLen );
+     fugue512_full( &ctx.fugue, hash7, hash7, dataLen );

     intrlv_8x32( vhash, hash0, hash1, hash2, hash3, hash4, hash5, hash6,
                  hash7, dataLen<<3 );
@@ -395,50 +395,8 @@ void xevan_8way_hash( void *output, const void *input )
     haval256_5_8way_init( &ctx.haval );
     haval256_5_8way_update( &ctx.haval, vhashA, dataLen );
     haval256_5_8way_close( &ctx.haval, output );
-}

-int scanhash_xevan_8way( struct work *work, uint32_t max_nonce,
-                       uint64_t *hashes_done, struct thr_info *mythr )
-{
-   uint32_t hash[8*8] __attribute__ ((aligned (128)));
-   uint32_t vdata[20*8] __attribute__ ((aligned (64)));
-   uint32_t lane_hash[8] __attribute__ ((aligned (64)));
-   uint32_t *hashd7 = &(hash[7*8]);
-   uint32_t *pdata = work->data;
-   const uint32_t *ptarget = work->target;
-   const uint32_t first_nonce = pdata[19];
-   const uint32_t last_nonce = max_nonce - 8;
-   __m512i  *noncev = (__m512i*)vdata + 9;
-   uint32_t n = first_nonce;
-   const int thr_id = mythr->id;
-   const uint32_t targ32 = ptarget[7];
-   const bool bench = opt_benchmark;
-
-   mm512_bswap32_intrlv80_8x64( vdata, pdata );
-   *noncev = mm512_intrlv_blend_32(
-              _mm512_set_epi32( n+7, 0, n+6, 0, n+5, 0, n+4, 0,
-                                n+3, 0, n+2, 0, n+1, 0, n,   0 ), *noncev );
-   do
-   {
-      xevan_8way_hash( hash, vdata );
-
-      for ( int lane = 0; lane < 8; lane++ )
-      if ( unlikely( ( hashd7[ lane ] <= targ32 ) && !bench ) )
-      {
-         extr_lane_8x32( lane_hash, hash, lane, 256 );
-         if ( likely( valid_hash( lane_hash, ptarget ) ) )
-         {
-            pdata[19] = bswap_32( n + lane );
-            submit_solution( work, lane_hash, mythr );
-         }
-      }
-      *noncev = _mm512_add_epi32( *noncev,
-                                  m512_const1_64( 0x0000000800000000 ) );
-      n += 8;
-   } while ( likely( ( n < last_nonce ) && !work_restart[thr_id].restart ) );
-   pdata[19] = n;
-   *hashes_done = n - first_nonce;
-   return 0;
+     return 1;
 }

 #elif defined(XEVAN_4WAY)
@@ -447,17 +405,22 @@ union _xevan_4way_context_overlay
 {
 	blake512_4way_context   blake;
        bmw512_4way_context     bmw;
-        hashState_groestl       groestl;
-        skein512_4way_context   skein;
+#if defined(__VAES__)
+        groestl512_2way_context groestl;
+        echo_2way_context       echo;
+#else
+	hashState_groestl       groestl;
+        hashState_echo          echo;
+#endif
+	skein512_4way_context   skein;
        jh512_4way_context      jh;
        keccak512_4way_context  keccak;
        luffa_2way_context      luffa;
        cube_2way_context       cube;
        shavite512_2way_context shavite;
        simd_2way_context       simd;
-        hashState_echo          echo;
        hamsi512_4way_context   hamsi;
-        sph_fugue512_context    fugue;
+        hashState_fugue         fugue;
        shabal512_4way_context  shabal;
        sph_whirlpool_context   whirlpool;
        sha512_4way_context     sha512;
@@ -465,7 +428,7 @@ union _xevan_4way_context_overlay
 };
 typedef union _xevan_4way_context_overlay xevan_4way_context_overlay;

-void xevan_4way_hash( void *output, const void *input )
+int xevan_4way_hash( void *output, const void *input, int thr_id )
 {
     uint64_t hash0[16] __attribute__ ((aligned (64)));
     uint64_t hash1[16] __attribute__ ((aligned (64)));
@@ -484,7 +447,17 @@ void xevan_4way_hash( void *output, const void *input )
     bmw512_4way_update( &ctx.bmw, vhash, dataLen );
     bmw512_4way_close( &ctx.bmw, vhash );

-     // Serial
+#if defined(__VAES__)
+
+     rintrlv_4x64_2x128( vhashA, vhashB, vhash, dataLen<<3 );
+
+     groestl512_2way_full( &ctx.groestl, vhashA, vhashA, dataLen );
+     groestl512_2way_full( &ctx.groestl, vhashB, vhashB, dataLen );
+
+     rintrlv_2x128_4x64( vhash, vhashA, vhashB, dataLen<<3 );
+
+#else
+     
     dintrlv_4x64( hash0, hash1, hash2, hash3, vhash, dataLen<<3 );

     groestl512_full( &ctx.groestl, (char*)hash0, (char*)hash0, dataLen<<3 );
@@ -492,9 +465,10 @@ void xevan_4way_hash( void *output, const void *input )
     groestl512_full( &ctx.groestl, (char*)hash2, (char*)hash2, dataLen<<3 );
     groestl512_full( &ctx.groestl, (char*)hash3, (char*)hash3, dataLen<<3 );

-     // Parallel 4way
     intrlv_4x64( vhash, hash0, hash1, hash2, hash3, dataLen<<3 );

+#endif
+
     skein512_4way_full( &ctx.skein, vhash, vhash, dataLen );

     jh512_4way_init( &ctx.jh );
@@ -519,6 +493,15 @@ void xevan_4way_hash( void *output, const void *input )
     simd512_2way_full( &ctx.simd, vhashA, vhashA, dataLen );
     simd512_2way_full( &ctx.simd, vhashB, vhashB, dataLen );

+#if defined(__VAES__)
+
+     echo_2way_full( &ctx.echo, vhashA, 512, vhashA, dataLen );
+     echo_2way_full( &ctx.echo, vhashB, 512, vhashB, dataLen );
+
+     rintrlv_2x128_4x64( vhash, vhashA, vhashB, dataLen<<3 );
+
+#else
+     
     dintrlv_2x128( hash0, hash1, vhashA, dataLen<<3 );
     dintrlv_2x128( hash2, hash3, vhashB, dataLen<<3 );

@@ -531,19 +514,20 @@ void xevan_4way_hash( void *output, const void *input )
     echo_full( &ctx.echo, (BitSequence *)hash3, 512,
                     (const BitSequence *)hash3, dataLen );

-     // Parallel
     intrlv_4x64( vhash, hash0, hash1, hash2, hash3, dataLen<<3 );

+#endif
+
     hamsi512_4way_init( &ctx.hamsi );
     hamsi512_4way_update( &ctx.hamsi, vhash, dataLen );
     hamsi512_4way_close( &ctx.hamsi, vhash );

     dintrlv_4x64( hash0, hash1, hash2, hash3, vhash, dataLen<<3 );

-     sph_fugue512_full( &ctx.fugue, hash0, hash0, dataLen );
-     sph_fugue512_full( &ctx.fugue, hash1, hash1, dataLen );
-     sph_fugue512_full( &ctx.fugue, hash2, hash2, dataLen );
-     sph_fugue512_full( &ctx.fugue, hash3, hash3, dataLen );
+     fugue512_full( &ctx.fugue, hash0, hash0, dataLen );
+     fugue512_full( &ctx.fugue, hash1, hash1, dataLen );
+     fugue512_full( &ctx.fugue, hash2, hash2, dataLen );
+     fugue512_full( &ctx.fugue, hash3, hash3, dataLen );

     // Parallel 4way 32 bit
     intrlv_4x32( vhash, hash0, hash1, hash2, hash3, dataLen<<3 );
@@ -584,6 +568,17 @@ void xevan_4way_hash( void *output, const void *input )
     bmw512_4way_update( &ctx.bmw, vhash, dataLen );
     bmw512_4way_close( &ctx.bmw, vhash );

+#if defined(__VAES__)
+
+     rintrlv_4x64_2x128( vhashA, vhashB, vhash, dataLen<<3 );
+
+     groestl512_2way_full( &ctx.groestl, vhashA, vhashA, dataLen );
+     groestl512_2way_full( &ctx.groestl, vhashB, vhashB, dataLen );
+
+     rintrlv_2x128_4x64( vhash, vhashA, vhashB, dataLen<<3 );
+
+#else
+
     dintrlv_4x64( hash0, hash1, hash2, hash3, vhash, dataLen<<3 );

     groestl512_full( &ctx.groestl, (char*)hash0, (char*)hash0, dataLen<<3 );
@@ -593,6 +588,8 @@ void xevan_4way_hash( void *output, const void *input )

     intrlv_4x64( vhash, hash0, hash1, hash2, hash3, dataLen<<3 );

+#endif
+
     skein512_4way_full( &ctx.skein, vhash, vhash, dataLen );

     jh512_4way_init( &ctx.jh );
@@ -617,6 +614,15 @@ void xevan_4way_hash( void *output, const void *input )
     simd512_2way_full( &ctx.simd, vhashA, vhashA, dataLen );
     simd512_2way_full( &ctx.simd, vhashB, vhashB, dataLen );

+#if defined(__VAES__)
+
+     echo_2way_full( &ctx.echo, vhashA, 512, vhashA, dataLen );
+     echo_2way_full( &ctx.echo, vhashB, 512, vhashB, dataLen );
+
+     rintrlv_2x128_4x64( vhash, vhashA, vhashB, dataLen<<3 );
+
+#else
+
     dintrlv_2x128( hash0, hash1, vhashA, dataLen<<3 );
     dintrlv_2x128( hash2, hash3, vhashB, dataLen<<3 );

@@ -631,16 +637,18 @@ void xevan_4way_hash( void *output, const void *input )

     intrlv_4x64( vhash, hash0, hash1, hash2, hash3, dataLen<<3 );

+#endif
+
     hamsi512_4way_init( &ctx.hamsi );
     hamsi512_4way_update( &ctx.hamsi, vhash, dataLen );
     hamsi512_4way_close( &ctx.hamsi, vhash );

     dintrlv_4x64( hash0, hash1, hash2, hash3, vhash, dataLen<<3 );

-     sph_fugue512_full( &ctx.fugue, hash0, hash0, dataLen );
-     sph_fugue512_full( &ctx.fugue, hash1, hash1, dataLen );
-     sph_fugue512_full( &ctx.fugue, hash2, hash2, dataLen );
-     sph_fugue512_full( &ctx.fugue, hash3, hash3, dataLen );
+     fugue512_full( &ctx.fugue, hash0, hash0, dataLen );
+     fugue512_full( &ctx.fugue, hash1, hash1, dataLen );
+     fugue512_full( &ctx.fugue, hash2, hash2, dataLen );
+     fugue512_full( &ctx.fugue, hash3, hash3, dataLen );

     intrlv_4x32( vhash, hash0, hash1, hash2, hash3, dataLen<<3 );

@@ -666,49 +674,8 @@ void xevan_4way_hash( void *output, const void *input )
     haval256_5_4way_init( &ctx.haval );
     haval256_5_4way_update( &ctx.haval, vhashA, dataLen );
     haval256_5_4way_close( &ctx.haval, output );
-}

-int scanhash_xevan_4way( struct work *work, uint32_t max_nonce,
-                         uint64_t *hashes_done, struct thr_info *mythr )
-{
-   uint32_t hash[16*4] __attribute__ ((aligned (128)));
-   uint32_t vdata[20*4] __attribute__ ((aligned (64)));
-   uint32_t lane_hash[8] __attribute__ ((aligned (64)));
-   uint32_t *hashd7 = &(hash[7<<2]);
-   uint32_t *pdata = work->data;
-   uint32_t *ptarget = work->target;
-   int thr_id = mythr->id;
-   __m256i  *noncev = (__m256i*)vdata + 9; 
-   const uint32_t targ32 = ptarget[7];
-   const uint32_t first_nonce = pdata[19];
-   const uint32_t last_nonce = max_nonce - 4;
-   uint32_t n = first_nonce;
-   const bool bench = opt_benchmark;
-
-   if ( bench )  ptarget[7] = 0x0cff;
-
-   mm256_bswap32_intrlv80_4x64( vdata, pdata );
-   *noncev = mm256_intrlv_blend_32(
-                   _mm256_set_epi32( n+3, 0, n+2, 0, n+1, 0, n, 0 ), *noncev );
-   do {
-      xevan_4way_hash( hash, vdata );
-      for ( int lane = 0; lane < 4; lane++ )
-      if ( unlikely( hashd7[ lane ] <= targ32 ) && ! bench )
-      {
-         extr_lane_4x32( lane_hash, hash, lane, 256 );
-	      if ( valid_hash( lane_hash, ptarget ) )
-         {
-             pdata[19] = bswap_32( n + lane );
-             submit_solution( work, lane_hash, mythr );
-         }
-      }
-      *noncev = _mm256_add_epi32( *noncev,
-                                  m256_const1_64( 0x0000000400000000 ) );
-      n += 4;
-   } while ( likely( ( n < last_nonce ) && !work_restart[thr_id].restart ) );
-   pdata[19] = n;
-   *hashes_done = n - first_nonce;
-   return 0;
+     return 1;
 }

 #endif
--- a/algo/x17/xevan-gate.c
+++ b/algo/x17/xevan-gate.c
@@ -3,17 +3,16 @@
 bool register_xevan_algo( algo_gate_t* gate )
 {
 #if defined (XEVAN_8WAY)
-  gate->scanhash  = (void*)&scanhash_xevan_8way;
+  gate->scanhash  = (void*)&scanhash_8way_64in_32out;
  gate->hash      = (void*)&xevan_8way_hash;
 #elif defined (XEVAN_4WAY)
-  gate->scanhash  = (void*)&scanhash_xevan_4way;
+  gate->scanhash  = (void*)&scanhash_4way_64in_32out;
  gate->hash      = (void*)&xevan_4way_hash;
 #else
  init_xevan_ctx();
-  gate->scanhash  = (void*)&scanhash_xevan;
  gate->hash      = (void*)&xevan_hash;
 #endif
-  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT;
+  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT | VAES_OPT | VAES256_OPT;
  opt_target_factor = 256.0;
  return true;
 };
--- a/algo/x17/xevan-gate.h
+++ b/algo/x17/xevan-gate.h
@@ -14,26 +14,15 @@ bool register_xevan_algo( algo_gate_t* gate );

 #if defined(XEVAN_8WAY)

-void xevan_8way_hash( void *state, const void *input );
+int xevan_8way_hash( void *state, const void *input, int thr_id );

-int scanhash_xevan_8way( struct work *work, uint32_t max_nonce,
-                       uint64_t *hashes_done, struct thr_info *mythr );
 #elif defined(XEVAN_4WAY)

-void xevan_4way_hash( void *state, const void *input );
-
-int scanhash_xevan_4way( struct work *work, uint32_t max_nonce,
-                       uint64_t *hashes_done, struct thr_info *mythr );
-
-//void init_xevan_4way_ctx();
+int xevan_4way_hash( void *state, const void *input, int thr_id );

 #else

-void xevan_hash( void *state, const void *input );
-
-int scanhash_xevan( struct work *work, uint32_t max_nonce,
-                  uint64_t *hashes_done, struct thr_info *mythr );
-
+int xevan_hash( void *state, const void *input, int trh_id );
 void init_xevan_ctx();

 #endif
--- a/algo/x17/xevan.c
+++ b/algo/x17/xevan.c
@@ -15,7 +15,6 @@
 #include "algo/shavite/sph_shavite.h"
 #include "algo/luffa/luffa_for_sse2.h"
 #include "algo/hamsi/sph_hamsi.h"
-#include "algo/fugue/sph_fugue.h"
 #include "algo/shabal/sph_shabal.h"
 #include "algo/whirlpool/sph_whirlpool.h"
 #include "algo/haval/sph-haval.h"
@@ -25,9 +24,11 @@
 #if defined(__AES__)
  #include "algo/groestl/aes_ni/hash-groestl.h"
  #include "algo/echo/aes_ni/hash_api.h"
+  #include "algo/fugue/fugue-aesni.h"
 #else
  #include "algo/groestl/sph_groestl.h"
  #include "algo/echo/sph_echo.h"
+  #include "algo/fugue/sph_fugue.h"
 #endif

 typedef struct {
@@ -41,7 +42,6 @@ typedef struct {
        sph_shavite512_context  shavite;
        hashState_sd            simd;
        sph_hamsi512_context    hamsi;
-        sph_fugue512_context    fugue;
        sph_shabal512_context   shabal;
        sph_whirlpool_context   whirlpool;
        SHA512_CTX              sha512;
@@ -49,9 +49,11 @@ typedef struct {
 #if defined(__AES__)
        hashState_echo          echo;
        hashState_groestl       groestl;
+        hashState_fugue         fugue;
 #else
 	sph_groestl512_context  groestl;
        sph_echo512_context     echo;
+        sph_fugue512_context    fugue;
 #endif
 } xevan_ctx_holder;

@@ -69,7 +71,6 @@ void init_xevan_ctx()
        sph_shavite512_init( &xevan_ctx.shavite );
        init_sd( &xevan_ctx.simd, 512 );
        sph_hamsi512_init( &xevan_ctx.hamsi );
-        sph_fugue512_init( &xevan_ctx.fugue );
        sph_shabal512_init( &xevan_ctx.shabal );
        sph_whirlpool_init( &xevan_ctx.whirlpool );
        SHA512_Init( &xevan_ctx.sha512 );
@@ -77,13 +78,15 @@ void init_xevan_ctx()
 #if defined(__AES__)
        init_groestl( &xevan_ctx.groestl, 64 );
        init_echo( &xevan_ctx.echo, 512 );
+        fugue512_Init( &xevan_ctx.fugue, 512 );
 #else
 	sph_groestl512_init( &xevan_ctx.groestl );
        sph_echo512_init( &xevan_ctx.echo );
+        sph_fugue512_init( &xevan_ctx.fugue );
 #endif
 };

-void xevan_hash(void *output, const void *input)
+int xevan_hash(void *output, const void *input, int thr_id )
 {
   uint32_t _ALIGN(64) hash[32]; // 128 bytes required
 	const int dataLen = 128;
@@ -137,8 +140,13 @@ void xevan_hash(void *output, const void *input)
 	sph_hamsi512(&ctx.hamsi, hash, dataLen);
 	sph_hamsi512_close(&ctx.hamsi, hash);

+#if defined(__AES__)
+    fugue512_Update( &ctx.fugue, hash, dataLen*8 );
+    fugue512_Final( &ctx.fugue, hash ); 
+#else
 	sph_fugue512(&ctx.fugue, hash, dataLen);
 	sph_fugue512_close(&ctx.fugue, hash);
+#endif

 	sph_shabal512(&ctx.shabal, hash, dataLen);
 	sph_shabal512_close(&ctx.shabal, hash);
@@ -202,8 +210,13 @@ void xevan_hash(void *output, const void *input)
 	sph_hamsi512(&ctx.hamsi, hash, dataLen);
 	sph_hamsi512_close(&ctx.hamsi, hash);

+#if defined(__AES__)
+    fugue512_Update( &ctx.fugue, hash, dataLen*8 );
+    fugue512_Final( &ctx.fugue, hash );   
+#else
 	sph_fugue512(&ctx.fugue, hash, dataLen);
 	sph_fugue512_close(&ctx.fugue, hash);
+#endif

 	sph_shabal512(&ctx.shabal, hash, dataLen);
 	sph_shabal512_close(&ctx.shabal, hash);
@@ -218,36 +231,8 @@ void xevan_hash(void *output, const void *input)
 	sph_haval256_5_close(&ctx.haval, hash);

 	memcpy(output, hash, 32);
-}

-int scanhash_xevan( struct work *work, uint32_t max_nonce,
-             uint64_t *hashes_done, struct thr_info *mythr)
-{
-   uint32_t edata[20] __attribute__((aligned(64)));
-   uint32_t hash64[8] __attribute__((aligned(64)));
-   uint32_t *pdata = work->data;
-   uint32_t *ptarget = work->target;
-   uint32_t n = pdata[19];
-   const uint32_t first_nonce = pdata[19];
-   const int thr_id = mythr->id;
-   const bool bench = opt_benchmark;
-
-   mm128_bswap32_80( edata, pdata );
-
-   do
-   {
-      edata[19] = n;
-      xevan_hash( hash64, edata );
-      if ( unlikely( valid_hash( hash64, ptarget ) && !bench ) )
-      {
-         pdata[19] = bswap_32( n );
-         submit_solution( work, hash64, mythr );
-      }
-      n++;
-   } while ( n < max_nonce && !work_restart[thr_id].restart );
-   pdata[19] = n;
-   *hashes_done = n - first_nonce;
-   return 0;
+   return 1;
 }

 #endif
--- a/algo/x22/x22i-4way.c
+++ b/algo/x22/x22i-4way.c
@@ -11,9 +11,9 @@
 #include "algo/shavite/shavite-hash-2way.h"
 #include "algo/shavite/sph_shavite.h"
 #include "algo/simd/simd-hash-2way.h"
-#include "algo/shavite/sph_shavite.h"
+#include "algo/shavite/shavite-hash-2way.h"
 #include "algo/hamsi/hamsi-hash-4way.h"
-#include "algo/fugue/sph_fugue.h"
+#include "algo/fugue/fugue-aesni.h"
 #include "algo/shabal/shabal-hash-4way.h"
 #include "algo/whirlpool/sph_whirlpool.h"
 #include "algo/sha/sha-hash-4way.h"
@@ -42,7 +42,7 @@ union _x22i_8way_ctx_overlay
    cube_4way_context       cube;
    simd_4way_context       simd;
    hamsi512_8way_context   hamsi;
-    sph_fugue512_context    fugue;
+    hashState_fugue         fugue;
    shabal512_8way_context  shabal;
    sph_whirlpool_context   whirlpool;
    sha512_8way_context     sha512;
@@ -225,30 +225,14 @@ int x22i_8way_hash( void *output, const void *input, int thrid )
   dintrlv_8x64_512( hash0, hash1, hash2, hash3,
                     hash4, hash5, hash6, hash7, vhash );
   
-   sph_fugue512_init( &ctx.fugue );
-   sph_fugue512( &ctx.fugue, hash0, 64 );
-   sph_fugue512_close( &ctx.fugue, hash0 );
-   sph_fugue512_init( &ctx.fugue );
-   sph_fugue512( &ctx.fugue, hash1, 64 );
-   sph_fugue512_close( &ctx.fugue, hash1 );
-   sph_fugue512_init( &ctx.fugue );
-   sph_fugue512( &ctx.fugue, hash2, 64 );
-   sph_fugue512_close( &ctx.fugue, hash2 );
-   sph_fugue512_init( &ctx.fugue );
-   sph_fugue512( &ctx.fugue, hash3, 64 );
-   sph_fugue512_close( &ctx.fugue, hash3 );
-   sph_fugue512_init( &ctx.fugue );
-   sph_fugue512( &ctx.fugue, hash4, 64 );
-   sph_fugue512_close( &ctx.fugue, hash4 );
-   sph_fugue512_init( &ctx.fugue );
-   sph_fugue512( &ctx.fugue, hash5, 64 );
-   sph_fugue512_close( &ctx.fugue, hash5 );
-   sph_fugue512_init( &ctx.fugue );
-   sph_fugue512( &ctx.fugue, hash6, 64 );
-   sph_fugue512_close( &ctx.fugue, hash6 );
-   sph_fugue512_init( &ctx.fugue );
-   sph_fugue512( &ctx.fugue, hash7, 64 );
-   sph_fugue512_close( &ctx.fugue, hash7 );
+   fugue512_full( &ctx.fugue, hash0, hash0, 64 );
+   fugue512_full( &ctx.fugue, hash1, hash1, 64 );
+   fugue512_full( &ctx.fugue, hash2, hash2, 64 );
+   fugue512_full( &ctx.fugue, hash3, hash3, 64 );
+   fugue512_full( &ctx.fugue, hash4, hash4, 64 );
+   fugue512_full( &ctx.fugue, hash5, hash5, 64 );
+   fugue512_full( &ctx.fugue, hash6, hash6, 64 );
+   fugue512_full( &ctx.fugue, hash7, hash7, 64 );

   intrlv_8x32_512( vhash, hash0, hash1, hash2, hash3,
                           hash4, hash5, hash6, hash7 );
@@ -510,17 +494,22 @@ union _x22i_4way_ctx_overlay
 {
    blake512_4way_context   blake;
    bmw512_4way_context     bmw;
+#if defined(__VAES__)
+    groestl512_2way_context groestl;
+    echo_2way_context       echo;
+#else
    hashState_groestl       groestl;
    hashState_echo          echo;
+#endif
+    shavite512_2way_context shavite;
    skein512_4way_context   skein;
    jh512_4way_context      jh;
    keccak512_4way_context  keccak;
    luffa_2way_context      luffa;
    cube_2way_context       cube;
-    shavite512_2way_context shavite;
    simd_2way_context       simd;
    hamsi512_4way_context   hamsi;
-    sph_fugue512_context    fugue;
+    hashState_fugue         fugue;
    shabal512_4way_context  shabal;
    sph_whirlpool_context   whirlpool;
    sha512_4way_context     sha512;
@@ -551,14 +540,28 @@ int x22i_4way_hash( void *output, const void *input, int thrid )
   bmw512_4way_init( &ctx.bmw );
   bmw512_4way_update( &ctx.bmw, vhash, 64 );
   bmw512_4way_close( &ctx.bmw, vhash );
-   dintrlv_4x64_512( hash0, hash1, hash2, hash3, vhash );

-   groestl512_full( &ctx.groestl, (char*)hash0, (const char*)hash0, 512 );
-   groestl512_full( &ctx.groestl, (char*)hash1, (const char*)hash1, 512 );
-   groestl512_full( &ctx.groestl, (char*)hash2, (const char*)hash2, 512 );
-   groestl512_full( &ctx.groestl, (char*)hash3, (const char*)hash3, 512 );
+#if defined(__VAES__)

-   intrlv_4x64_512( vhash, hash0, hash1, hash2, hash3 );
+     rintrlv_4x64_2x128( vhashA, vhashB, vhash, 512 );
+
+     groestl512_2way_full( &ctx.groestl, vhashA, vhashA, 64 );
+     groestl512_2way_full( &ctx.groestl, vhashB, vhashB, 64 );
+
+     rintrlv_2x128_4x64( vhash, vhashA, vhashB, 512 );
+
+#else
+
+     dintrlv_4x64_512( hash0, hash1, hash2, hash3, vhash );
+
+     groestl512_full( &ctx.groestl, (char*)hash0, (char*)hash0, 512 );
+     groestl512_full( &ctx.groestl, (char*)hash1, (char*)hash1, 512 );
+     groestl512_full( &ctx.groestl, (char*)hash2, (char*)hash2, 512 );
+     groestl512_full( &ctx.groestl, (char*)hash3, (char*)hash3, 512 );
+
+     intrlv_4x64_512( vhash, hash0, hash1, hash2, hash3 );
+
+#endif

   skein512_4way_full( &ctx.skein, vhash, vhash, 64 );

@@ -586,6 +589,15 @@ int x22i_4way_hash( void *output, const void *input, int thrid )
   simd512_2way_full( &ctx.simd, vhashA, vhashA, 64 );
   simd512_2way_full( &ctx.simd, vhashB, vhashB, 64 );

+#if defined(__VAES__)
+
+   echo_2way_full( &ctx.echo, vhashA, 512, vhashA, 64 );
+   echo_2way_full( &ctx.echo, vhashB, 512, vhashB, 64 );
+
+   rintrlv_2x128_4x64( vhash, vhashA, vhashB, 512 );
+
+#else
+
   dintrlv_2x128_512( hash0, hash1, vhashA );
   dintrlv_2x128_512( hash2, hash3, vhashB );
   
@@ -600,6 +612,8 @@ int x22i_4way_hash( void *output, const void *input, int thrid )

   intrlv_4x64_512( vhash, hash0, hash1, hash2, hash3 );

+#endif
+
   if ( work_restart[thrid].restart ) return false;
   
   hamsi512_4way_init( &ctx.hamsi );
@@ -607,18 +621,10 @@ int x22i_4way_hash( void *output, const void *input, int thrid )
   hamsi512_4way_close( &ctx.hamsi, vhash );
   dintrlv_4x64_512( hash0, hash1, hash2, hash3, vhash );

-   sph_fugue512_init( &ctx.fugue );
-   sph_fugue512( &ctx.fugue, hash0, 64 );
-   sph_fugue512_close( &ctx.fugue, hash0 );
-   sph_fugue512_init( &ctx.fugue );
-   sph_fugue512( &ctx.fugue, hash1, 64 );
-   sph_fugue512_close( &ctx.fugue, hash1 );
-   sph_fugue512_init( &ctx.fugue );
-   sph_fugue512( &ctx.fugue, hash2, 64 );
-   sph_fugue512_close( &ctx.fugue, hash2 );
-   sph_fugue512_init( &ctx.fugue );
-   sph_fugue512( &ctx.fugue, hash3, 64 );
-   sph_fugue512_close( &ctx.fugue, hash3 );
+   fugue512_full( &ctx.fugue, hash0, hash0, 64 );
+   fugue512_full( &ctx.fugue, hash1, hash1, 64 );
+   fugue512_full( &ctx.fugue, hash2, hash2, 64 );
+   fugue512_full( &ctx.fugue, hash3, hash3, 64 );

   intrlv_4x32_512( vhash, hash0, hash1, hash2, hash3 );

--- a/algo/x22/x22i-gate.c
+++ b/algo/x22/x22i-gate.c
@@ -20,7 +20,7 @@ bool register_x22i_algo( algo_gate_t* gate )
  gate->scanhash  = (void*)&scanhash_x22i;
  gate->hash      = (void*)&x22i_hash;
  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | SHA_OPT
-                      | AVX512_OPT | VAES_OPT;
+                      | AVX512_OPT | VAES_OPT | VAES256_OPT;
 #endif
  return true;
 };
@@ -30,20 +30,15 @@ bool register_x25x_algo( algo_gate_t* gate )
 #if defined (X25X_8WAY)
  gate->scanhash  = (void*)&scanhash_x25x_8way;
  gate->hash      = (void*)&x25x_8way_hash;
-  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT
-                      | AVX512_OPT | VAES_OPT;
 #elif defined (X25X_4WAY)
  gate->scanhash  = (void*)&scanhash_x25x_4way;
  gate->hash      = (void*)&x25x_4way_hash;
-  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | SHA_OPT
-                      | AVX512_OPT | VAES_OPT;
 #else
  gate->scanhash  = (void*)&scanhash_x25x;
  gate->hash      = (void*)&x25x_hash;
-  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | SHA_OPT
-                      | AVX512_OPT | VAES_OPT;
 #endif
-//  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | SHA_OPT;
+  gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT |
+	                VAES_OPT | VAES256_OPT;

  return true;
 };
--- a/algo/x22/x22i.c
+++ b/algo/x22/x22i.c
@@ -7,9 +7,11 @@
 #if defined(__AES__)
  #include "algo/echo/aes_ni/hash_api.h"
  #include "algo/groestl/aes_ni/hash-groestl.h"
+  #include "algo/fugue/fugue-aesni.h"
 #else
  #include "algo/groestl/sph_groestl.h"
  #include "algo/echo/sph_echo.h"
+  #include "algo/fugue/sph_fugue.h"
 #endif
 #include "algo/skein/sph_skein.h"
 #include "algo/jh/sph_jh.h"
@@ -19,7 +21,6 @@
 #include "algo/shavite/sph_shavite.h"
 #include "algo/simd/nist.h"
 #include "algo/hamsi/sph_hamsi.h"
-#include "algo/fugue/sph_fugue.h"
 #include "algo/shabal/sph_shabal.h"
 #include "algo/whirlpool/sph_whirlpool.h"
 #include <openssl/sha.h>
@@ -36,9 +37,11 @@ union _x22i_context_overlay
 #if defined(__AES__)
        hashState_groestl       groestl;
        hashState_echo          echo;
+        hashState_fugue         fugue;
 #else
        sph_groestl512_context  groestl;
        sph_echo512_context     echo;
+        sph_fugue512_context    fugue;
 #endif
        sph_jh512_context       jh;
        sph_keccak512_context   keccak;
@@ -48,7 +51,6 @@ union _x22i_context_overlay
        sph_shavite512_context  shavite;
        hashState_sd            simd;
        sph_hamsi512_context    hamsi;
-        sph_fugue512_context    fugue;
        sph_shabal512_context   shabal;
        sph_whirlpool_context   whirlpool;
        SHA512_CTX              sha512;
@@ -129,9 +131,13 @@ int x22i_hash( void *output, const void *input, int thrid )
 	sph_hamsi512(&ctx.hamsi, (const void*) hash, 64);
 	sph_hamsi512_close(&ctx.hamsi, hash);

+#if defined(__AES__)
+        fugue512_full( &ctx.fugue, hash, hash, 64 );
+#else
 	sph_fugue512_init(&ctx.fugue);
 	sph_fugue512(&ctx.fugue, (const void*) hash, 64);
 	sph_fugue512_close(&ctx.fugue, hash);
+#endif

 	sph_shabal512_init(&ctx.shabal);
 	sph_shabal512(&ctx.shabal, (const void*) hash, 64);
--- a/algo/x22/x25x-4way.c
+++ b/algo/x22/x25x-4way.c
@@ -15,10 +15,11 @@
 #include "algo/cubehash/cubehash_sse2.h"
 #include "algo/luffa/luffa-hash-2way.h"
 #include "algo/cubehash/cube-hash-2way.h"
+#include "algo/shavite/shavite-hash-2way.h"
 #include "algo/shavite/sph_shavite.h"
 #include "algo/simd/nist.h"
 #include "algo/simd/simd-hash-2way.h"
-#include "algo/fugue/sph_fugue.h"
+#include "algo/fugue/fugue-aesni.h"
 #include "algo/whirlpool/sph_whirlpool.h"
 #include "algo/tiger/sph_tiger.h"
 #include "algo/lyra2/lyra2.h"
@@ -72,7 +73,7 @@ union _x25x_8way_ctx_overlay
    cube_4way_context       cube;
    simd_4way_context       simd;
    hamsi512_8way_context   hamsi;
-    sph_fugue512_context    fugue;
+    hashState_fugue         fugue;
    shabal512_8way_context  shabal;
    sph_whirlpool_context   whirlpool;
    sha512_8way_context     sha512;
@@ -303,30 +304,15 @@ int x25x_8way_hash( void *output, const void *input, int thrid )
   dintrlv_8x64_512( hash0[11], hash1[11], hash2[11], hash3[11],
                     hash4[11], hash5[11], hash6[11], hash7[11], vhash );
   
-	sph_fugue512_init(&ctx.fugue);
-	sph_fugue512(&ctx.fugue, (const void*) hash0[11], 64);
-	sph_fugue512_close(&ctx.fugue, hash0[12]);
-   sph_fugue512_init(&ctx.fugue);
-   sph_fugue512(&ctx.fugue, (const void*) hash1[11], 64);
-   sph_fugue512_close(&ctx.fugue, hash1[12]);
-   sph_fugue512_init(&ctx.fugue);
-   sph_fugue512(&ctx.fugue, (const void*) hash2[11], 64);
-   sph_fugue512_close(&ctx.fugue, hash2[12]);
-   sph_fugue512_init(&ctx.fugue);
-   sph_fugue512(&ctx.fugue, (const void*) hash3[11], 64);
-   sph_fugue512_close(&ctx.fugue, hash3[12]);
-   sph_fugue512_init(&ctx.fugue);
-   sph_fugue512(&ctx.fugue, (const void*) hash4[11], 64);
-   sph_fugue512_close(&ctx.fugue, hash4[12]);
-   sph_fugue512_init(&ctx.fugue);
-   sph_fugue512(&ctx.fugue, (const void*) hash5[11], 64);
-   sph_fugue512_close(&ctx.fugue, hash5[12]);
-   sph_fugue512_init(&ctx.fugue);
-   sph_fugue512(&ctx.fugue, (const void*) hash6[11], 64);
-   sph_fugue512_close(&ctx.fugue, hash6[12]);
-   sph_fugue512_init(&ctx.fugue);
-   sph_fugue512(&ctx.fugue, (const void*) hash7[11], 64);
-   sph_fugue512_close(&ctx.fugue, hash7[12]);
+   fugue512_full( &ctx.fugue, hash0[12], hash0[11], 64 );
+   fugue512_full( &ctx.fugue, hash1[12], hash1[11], 64 );
+   fugue512_full( &ctx.fugue, hash2[12], hash2[11], 64 );
+   fugue512_full( &ctx.fugue, hash3[12], hash3[11], 64 );
+   fugue512_full( &ctx.fugue, hash4[12], hash4[11], 64 );
+   fugue512_full( &ctx.fugue, hash5[12], hash5[11], 64 );
+   fugue512_full( &ctx.fugue, hash6[12], hash6[11], 64 );
+   fugue512_full( &ctx.fugue, hash7[12], hash7[11], 64 );
+
   intrlv_8x32_512( vhash, hash0[12], hash1[12], hash2[12], hash3[12],
                           hash4[12], hash5[12], hash6[12], hash7[12] );

@@ -427,9 +413,9 @@ int x25x_8way_hash( void *output, const void *input, int thrid )
   LYRA2X_2WAY( vhash, 32, vhash, 32, 1, 4, 4 );
   dintrlv_2x256( hash6[19], hash7[19], vhash, 256 );

-	sph_gost512_init(&ctx.gost);
-	sph_gost512 (&ctx.gost, (const void*) hash0[19], 64);
-	sph_gost512_close(&ctx.gost, (void*) hash0[20]);
+   sph_gost512_init(&ctx.gost);
+   sph_gost512 (&ctx.gost, (const void*) hash0[19], 64);
+   sph_gost512_close(&ctx.gost, (void*) hash0[20]);
   sph_gost512_init(&ctx.gost);
   sph_gost512 (&ctx.gost, (const void*) hash1[19], 64);
   sph_gost512_close(&ctx.gost, (void*) hash1[20]);
@@ -589,70 +575,28 @@ int scanhash_x25x_8way( struct work *work, uint32_t max_nonce,
   return 0;
 }

-/*
-int scanhash_x25x_8way( struct work* work, uint32_t max_nonce,
-                   uint64_t *hashes_done, struct thr_info *mythr )
-{
-   uint32_t hash[8*16] __attribute__ ((aligned (128)));
-   uint32_t vdata[24*8] __attribute__ ((aligned (64)));
-   uint32_t lane_hash[8] __attribute__ ((aligned (64)));
-   uint32_t *hash7 = &(hash[7<<3]);
-   uint32_t *pdata = work->data;
-   uint32_t *ptarget = work->target;
-   const uint32_t first_nonce = pdata[19];
-   __m512i  *noncev = (__m512i*)vdata + 9;   // aligned
-   uint32_t n = first_nonce;
-   const uint32_t last_nonce = max_nonce - 4;
-   const int thr_id = mythr->id;
-   const uint32_t Htarg = ptarget[7];
-
-   if (opt_benchmark)
-      ((uint32_t*)ptarget)[7] = 0x08ff;
-
-   InitializeSWIFFTX();
-
-   mm512_bswap32_intrlv80_8x64( vdata, pdata );
-   do
-   {
-      *noncev = mm512_intrlv_blend_32( mm512_bswap_32(
-              _mm512_set_epi32( n+7, 0, n+6, 0, n+5, 0, n+4, 0,
-                                n+3, 0, n+2, 0, n+1, 0, n,   0 ) ), *noncev );
-      x25x_8way_hash( hash, vdata );
-
-      for ( int lane = 0; lane < 8; lane++ ) if ( hash7[lane] <= Htarg )
-      {
-         extr_lane_8x32( lane_hash, hash, lane, 256 );
-         if ( fulltest( lane_hash, ptarget ) && !opt_benchmark )
-         {
-              pdata[19] = n + lane;
-              submit_solution( work, lane_hash, mythr );
-         }
-      }
-      n += 8;
-   } while ( likely( ( n < last_nonce ) && !work_restart[thr_id].restart ) );
-
-   *hashes_done = n - first_nonce;
-   return 0;
-}
-*/
-
 #elif defined(X25X_4WAY)

 union _x25x_4way_ctx_overlay
 {
    blake512_4way_context   blake;
    bmw512_4way_context     bmw;
+#if defined(__VAES__)
+    groestl512_2way_context groestl;
+    echo_2way_context       echo;
+#else
    hashState_groestl       groestl;
    hashState_echo          echo;
+#endif
    skein512_4way_context   skein;
    jh512_4way_context      jh;
    keccak512_4way_context  keccak;
-    hashState_luffa         luffa;
-    cubehashParam           cube;
-    sph_shavite512_context  shavite;
-    hashState_sd            simd;
+    luffa_2way_context      luffa;
+    cube_2way_context       cube;
+    shavite512_2way_context shavite;
+    simd_2way_context       simd;
    hamsi512_4way_context   hamsi;
-    sph_fugue512_context    fugue;
+    hashState_fugue         fugue;
    shabal512_4way_context  shabal;
    sph_whirlpool_context   whirlpool;
    sha512_4way_context     sha512;
@@ -673,6 +617,8 @@ int x25x_4way_hash( void *output, const void *input, int thrid )
   unsigned char hash2[25][64] __attribute__((aligned(64))) = {0};
   unsigned char hash3[25][64] __attribute__((aligned(64))) = {0};
   unsigned char vhashX[24][64*4] __attribute__ ((aligned (64)));
+   uint64_t vhashA[8*4] __attribute__ ((aligned (64)));
+   uint64_t vhashB[8*4] __attribute__ ((aligned (64)));
   x25x_4way_ctx_overlay ctx __attribute__ ((aligned (64)));

   blake512_4way_full( &ctx.blake, vhash, input, 80 );
@@ -683,11 +629,25 @@ int x25x_4way_hash( void *output, const void *input, int thrid )
   bmw512_4way_close( &ctx.bmw, vhash );
   dintrlv_4x64_512( hash0[1], hash1[1], hash2[1], hash3[1], vhash );

+#if defined(__VAES__)
+
+   rintrlv_4x64_2x128( vhashA, vhashB, vhash, 512 );
+
+   groestl512_2way_full( &ctx.groestl, vhashA, vhashA, 64 );
+   groestl512_2way_full( &ctx.groestl, vhashB, vhashB, 64 );
+
+   dintrlv_2x128_512( hash0[2], hash1[2], vhashA );
+   dintrlv_2x128_512( hash2[2], hash3[2], vhashB );
+
+#else
+
   groestl512_full( &ctx.groestl, (char*)hash0[2], (const char*)hash0[1], 512 );
   groestl512_full( &ctx.groestl, (char*)hash1[2], (const char*)hash1[1], 512 );
   groestl512_full( &ctx.groestl, (char*)hash2[2], (const char*)hash2[1], 512 );
   groestl512_full( &ctx.groestl, (char*)hash3[2], (const char*)hash3[1], 512 );

+#endif
+
   intrlv_4x64_512( vhash, hash0[2], hash1[2], hash2[2], hash3[2] );
   skein512_4way_full( &ctx.skein, vhash, vhash, 64 );
   dintrlv_4x64_512( hash0[3], hash1[3], hash2[3], hash3[3], vhash );
@@ -704,41 +664,38 @@ int x25x_4way_hash( void *output, const void *input, int thrid )
   keccak512_4way_close( &ctx.keccak, vhash );
   dintrlv_4x64_512( hash0[5], hash1[5], hash2[5], hash3[5], vhash );

-   luffa_full( &ctx.luffa, (BitSequence*)hash0[6], 512,
-                     (const BitSequence*)hash0[5], 64 );
-   luffa_full( &ctx.luffa, (BitSequence*)hash1[6], 512,
-                     (const BitSequence*)hash1[5], 64 );
-   luffa_full( &ctx.luffa, (BitSequence*)hash2[6], 512,
-                     (const BitSequence*)hash2[5], 64 );
-   luffa_full( &ctx.luffa, (BitSequence*)hash3[6], 512,
-                     (const BitSequence*)hash3[5], 64 );
+   rintrlv_4x64_2x128( vhashA, vhashB, vhash, 512 );

-   cubehash_full( &ctx.cube, (byte*)hash0[7], 512, (const byte*)hash0[6], 64 );
-   cubehash_full( &ctx.cube, (byte*)hash1[7], 512, (const byte*)hash1[6], 64 );
-   cubehash_full( &ctx.cube, (byte*)hash2[7], 512, (const byte*)hash2[6], 64 );
-   cubehash_full( &ctx.cube, (byte*)hash3[7], 512, (const byte*)hash3[6], 64 );
+   luffa512_2way_full( &ctx.luffa, vhashA, vhashA, 64 );
+   luffa512_2way_full( &ctx.luffa, vhashB, vhashB, 64 );
+   dintrlv_2x128_512( hash0[6], hash1[6], vhashA );
+   dintrlv_2x128_512( hash2[6], hash3[6], vhashB );
   
-   sph_shavite512_init(&ctx.shavite);
-   sph_shavite512(&ctx.shavite, (const void*) hash0[7], 64);
-   sph_shavite512_close(&ctx.shavite, hash0[8]);
-   sph_shavite512_init(&ctx.shavite);
-   sph_shavite512(&ctx.shavite, (const void*) hash1[7], 64);
-   sph_shavite512_close(&ctx.shavite, hash1[8]);
-   sph_shavite512_init(&ctx.shavite);
-   sph_shavite512(&ctx.shavite, (const void*) hash2[7], 64);
-   sph_shavite512_close(&ctx.shavite, hash2[8]);
-   sph_shavite512_init(&ctx.shavite);
-   sph_shavite512(&ctx.shavite, (const void*) hash3[7], 64);
-   sph_shavite512_close(&ctx.shavite, hash3[8]);
+   cube_2way_full( &ctx.cube, vhashA, 512, vhashA, 64 );
+   cube_2way_full( &ctx.cube, vhashB, 512, vhashB, 64 );
+   dintrlv_2x128_512( hash0[7], hash1[7], vhashA );
+   dintrlv_2x128_512( hash2[7], hash3[7], vhashB );

-   simd_full( &ctx.simd, (BitSequence*)hash0[9],
-                   (const BitSequence*)hash0[8], 512 );
-   simd_full( &ctx.simd, (BitSequence*)hash1[9],
-                   (const BitSequence*)hash1[8], 512 );
-   simd_full( &ctx.simd, (BitSequence*)hash2[9],
-                   (const BitSequence*)hash2[8], 512 );
-   simd_full( &ctx.simd, (BitSequence*)hash3[9],
-                   (const BitSequence*)hash3[8], 512 );
+   shavite512_2way_full( &ctx.shavite, vhashA, vhashA, 64 );
+   shavite512_2way_full( &ctx.shavite, vhashB, vhashB, 64 );
+   dintrlv_2x128_512( hash0[8], hash1[8], vhashA );
+   dintrlv_2x128_512( hash2[8], hash3[8], vhashB );
+
+   simd512_2way_full( &ctx.simd, vhashA, vhashA, 64 );
+   simd512_2way_full( &ctx.simd, vhashB, vhashB, 64 );
+   dintrlv_2x128_512( hash0[9], hash1[9], vhashA );
+   dintrlv_2x128_512( hash2[9], hash3[9], vhashB );
+
+#if defined(__VAES__)
+
+   echo_2way_full( &ctx.echo, vhashA, 512, vhashA, 64 );
+   echo_2way_full( &ctx.echo, vhashB, 512, vhashB, 64 );
+   dintrlv_2x128_512( hash0[10], hash1[10], vhashA );
+   dintrlv_2x128_512( hash2[10], hash3[10], vhashB );
+
+   rintrlv_2x128_4x64( vhash, vhashA, vhashB, 512 );
+
+#else

   echo_full( &ctx.echo, (BitSequence *)hash0[10], 512,
                   (const BitSequence *)hash0[ 9], 64 );
@@ -751,6 +708,8 @@ int x25x_4way_hash( void *output, const void *input, int thrid )

   intrlv_4x64_512( vhash, hash0[10], hash1[10], hash2[10], hash3[10] );

+#endif
+
   if ( work_restart[thrid].restart ) return 0;
   
   hamsi512_4way_init( &ctx.hamsi );
@@ -758,18 +717,10 @@ int x25x_4way_hash( void *output, const void *input, int thrid )
   hamsi512_4way_close( &ctx.hamsi, vhash );
   dintrlv_4x64_512( hash0[11], hash1[11], hash2[11], hash3[11], vhash );

-   sph_fugue512_init(&ctx.fugue);
-   sph_fugue512(&ctx.fugue, (const void*) hash0[11], 64);
-   sph_fugue512_close(&ctx.fugue, hash0[12]);
-   sph_fugue512_init(&ctx.fugue);
-   sph_fugue512(&ctx.fugue, (const void*) hash1[11], 64);
-   sph_fugue512_close(&ctx.fugue, hash1[12]);
-   sph_fugue512_init(&ctx.fugue);
-   sph_fugue512(&ctx.fugue, (const void*) hash2[11], 64);
-   sph_fugue512_close(&ctx.fugue, hash2[12]);
-   sph_fugue512_init(&ctx.fugue);
-   sph_fugue512(&ctx.fugue, (const void*) hash3[11], 64);
-   sph_fugue512_close(&ctx.fugue, hash3[12]);
+   fugue512_full( &ctx.fugue, hash0[12], hash0[11], 64 );
+   fugue512_full( &ctx.fugue, hash1[12], hash1[11], 64 );
+   fugue512_full( &ctx.fugue, hash2[12], hash2[11], 64 );
+   fugue512_full( &ctx.fugue, hash3[12], hash3[11], 64 );

   intrlv_4x32_512( vhash, hash0[12], hash1[12], hash2[12], hash3[12] );

--- a/algo/x22/x25x.c
+++ b/algo/x22/x25x.c
@@ -7,9 +7,11 @@
 #if defined(__AES__)
  #include "algo/echo/aes_ni/hash_api.h"
  #include "algo/groestl/aes_ni/hash-groestl.h"
+  #include "algo/fugue/fugue-aesni.h"
 #else
  #include "algo/groestl/sph_groestl.h"
  #include "algo/echo/sph_echo.h"
+  #include "algo/fugue/sph_fugue.h"
 #endif
 #include "algo/skein/sph_skein.h"
 #include "algo/jh/sph_jh.h"
@@ -19,7 +21,6 @@
 #include "algo/shavite/sph_shavite.h"
 #include "algo/simd/nist.h"
 #include "algo/hamsi/sph_hamsi.h"
-#include "algo/fugue/sph_fugue.h"
 #include "algo/shabal/sph_shabal.h"
 #include "algo/whirlpool/sph_whirlpool.h"
 #include <openssl/sha.h>
@@ -39,9 +40,11 @@ union _x25x_context_overlay
 #if defined(__AES__)
        hashState_groestl       groestl;
        hashState_echo          echo;
+        hashState_fugue         fugue;
 #else
        sph_groestl512_context  groestl;
        sph_echo512_context     echo;
+        sph_fugue512_context    fugue;
 #endif
        sph_jh512_context       jh;
        sph_keccak512_context   keccak;
@@ -51,7 +54,6 @@ union _x25x_context_overlay
        sph_shavite512_context  shavite;
        hashState_sd            simd;
        sph_hamsi512_context    hamsi;
-        sph_fugue512_context    fugue;
        sph_shabal512_context   shabal;
        sph_whirlpool_context   whirlpool;
        SHA512_CTX              sha512;
@@ -133,9 +135,13 @@ int x25x_hash( void *output, const void *input, int thrid )
 	sph_hamsi512(&ctx.hamsi, (const void*) &hash[10], 64);
 	sph_hamsi512_close(&ctx.hamsi, &hash[11]);

+#if defined(__AES__)
+        fugue512_full( &ctx.fugue, &hash[12], &hash[11], 64 );
+#else
 	sph_fugue512_init(&ctx.fugue);
 	sph_fugue512(&ctx.fugue, (const void*) &hash[11], 64);
 	sph_fugue512_close(&ctx.fugue, &hash[12]);
+#endif

 	sph_shabal512_init(&ctx.shabal);
 	sph_shabal512(&ctx.shabal, (const void*) &hash[12], 64);
--- a/algo/yescrypt/yescrypt.c
+++ b/algo/yescrypt/yescrypt.c
@@ -445,7 +445,7 @@ bool register_yescrypt_algo( algo_gate_t* gate )

   YESCRYPT_P = 1;

-   applog( LOG_NOTICE,"Yescrypt parameters: N= %d, R= %d.", YESCRYPT_N,
+   applog( LOG_NOTICE,"Yescrypt parameters: N= %d, R= %d", YESCRYPT_N,
                                                            YESCRYPT_R );
   if ( yescrypt_client_key )
     applog( LOG_NOTICE,"Key= \"%s\"\n", yescrypt_client_key );
--- a/algo/yespower/yespower-gate.c
+++ b/algo/yespower/yespower-gate.c
@@ -139,7 +139,7 @@ bool register_yespower_algo( algo_gate_t* gate )
     yespower_params.perslen = 0;
  }

-  applog( LOG_NOTICE,"Yespower parameters: N= %d, R= %d.", yespower_params.N,
+  applog( LOG_NOTICE,"Yespower parameters: N= %d, R= %d", yespower_params.N,
                                                           yespower_params.r );
  if ( yespower_params.pers )
     applog( LOG_NOTICE,"Key= \"%s\"\n", yespower_params.pers );
@@ -264,7 +264,7 @@ bool register_power2b_algo( algo_gate_t* gate )
  yespower_params.pers = "Now I am become Death, the destroyer of worlds";
  yespower_params.perslen = 46;

-  applog( LOG_NOTICE,"yespower-b2b parameters: N= %d, R= %d.", yespower_params.N,
+  applog( LOG_NOTICE,"yespower-b2b parameters: N= %d, R= %d", yespower_params.N,
                                                           yespower_params.r );
  applog( LOG_NOTICE,"Key= \"%s\"", yespower_params.pers );
  applog( LOG_NOTICE,"Key length= %d\n", yespower_params.perslen );
--- a/algo/yespower/yespower.h
+++ b/algo/yespower/yespower.h
@@ -76,7 +76,7 @@ typedef struct {
 	unsigned char uc[32];
 } yespower_binary_t __attribute__ ((aligned (64)));

-yespower_params_t yespower_params;
+extern yespower_params_t yespower_params;

 //SHA256_CTX sha256_prehash_ctx;
 extern __thread SHA256_CTX sha256_prehash_ctx;
--- a/build-allarch.sh
+++ b/build-allarch.sh
@@ -4,93 +4,127 @@
 # during develpment. However the information contained may provide compilation
 # tips to users.

-rm cpuminer-avx512-sha-vaes cpuminer-avx512 cpuminer-avx2 cpuminer-aes-avx cpuminer-aes-sse42 cpuminer-sse42 cpuminer-ssse3 cpuminer-sse2 cpuminer-zen  > /dev/null
+rm cpuminer-avx512-sha-vaes cpuminer-avx512-sha cpuminer-avx512 cpuminer-avx2 cpuminer-aes-avx cpuminer-aes-sse42 cpuminer-sse42 cpuminer-ssse3 cpuminer-sse2 cpuminer-zen cpuminer-zen3  > /dev/null

+# Icelake AVX512 SHA VAES
 make distclean || echo clean
 rm -f config.status
 ./autogen.sh || echo done
-CFLAGS="-O3 -march=icelake-client -Wall" ./configure --with-curl
-make -j 16
+CFLAGS="-O3 -march=icelake-client -Wall -fno-common" ./configure --with-curl
+make -j 8
 strip -s cpuminer.exe
 mv cpuminer.exe cpuminer-avx512-sha-vaes.exe
 strip -s cpuminer
 mv cpuminer cpuminer-avx512-sha-vaes

-CFLAGS="-O3 -march=skylake-avx512 -Wall" ./configure --with-curl
-make -j 16
+# Rocketlake AVX512 AES SHA
+make clean || echo clean
+rm -f config.status
+CFLAGS="-O3 -march=skylake-avx512 -msha -Wall -fno-common" ./configure --with-curl
+# CFLAGS="-O3 -march=rocketlake -Wall -fno-common" ./configure --with-curl
+make -j 8
+strip -s cpuminer.exe
+mv cpuminer.exe cpuminer-avx512-sha.exe
+strip -s cpuminer
+mv cpuminer cpuminer-avx512-sha
+
+# Slylake-X AVX512 AES
+make clean || echo clean
+rm -f config.status
+CFLAGS="-O3 -march=skylake-avx512 -Wall -fno-common" ./configure --with-curl
+make -j 8
 strip -s cpuminer.exe
 mv cpuminer.exe cpuminer-avx512.exe
 strip -s cpuminer
 mv cpuminer cpuminer-avx512

+# Haswell AVX2 AES
 make clean || echo clean
 rm -f config.status
 # GCC 9 doesn't include AES with core-avx2
-CFLAGS="-O3 -march=core-avx2 -maes -Wall" ./configure --with-curl
-make -j 16
+CFLAGS="-O3 -march=core-avx2 -maes -Wall -fno-common" ./configure --with-curl
+make -j 8
 strip -s cpuminer.exe
 mv cpuminer.exe cpuminer-avx2.exe
 strip -s cpuminer
 mv cpuminer cpuminer-avx2

+# Sandybridge AVX AES
 make clean || echo clean
 rm -f config.status
-CFLAGS="-O3 -march=corei7-avx -maes -Wall" ./configure --with-curl
-make -j 16
+CFLAGS="-O3 -march=corei7-avx -maes -Wall -fno-common" ./configure --with-curl
+make -j 8
 strip -s cpuminer.exe
 mv cpuminer.exe cpuminer-avx.exe
 strip -s cpuminer
-mv cpuminer cpuminer-aes-avx
+mv cpuminer cpuminer-avx

+# Westmere SSE4.2 AES
 make clean || echo clean
 rm -f config.status
-CFLAGS="-O3 -maes -msse4.2 -Wall" ./configure --with-curl
-make -j 16
+CFLAGS="-O3 -march=westmere -Wall -fno-common" ./configure --with-curl
+make -j 8
 strip -s cpuminer.exe
 mv cpuminer.exe cpuminer-aes-sse42.exe
 strip -s cpuminer
 mv cpuminer cpuminer-aes-sse42

-#make clean || echo clean
-#rm -f config.status
-#CFLAGS="-O3 -march=corei7 -Wall" ./configure --with-curl
-#make -j 16
-#strip -s cpuminer.exe
-#mv cpuminer.exe cpuminer-sse42.exe
-#strip -s cpuminer
-#mv cpuminer cpuminer-sse42
-
-#make clean || echo clean
-#rm -f config.status
-#CFLAGS="-O3 -march=core2 -Wall" ./configure --with-curl
-#make -j 16
-#strip -s cpuminer.exe
-#mv cpuminer.exe cpuminer-ssse3.exe
-#strip -s cpuminer
-#mv cpuminer cpuminer-ssse3
-
+# Nehalem SSE4.2
 make clean || echo clean
 rm -f config.status
-CFLAGS="-O3 -msse2 -Wall" ./configure --with-curl
-make -j 16
+CFLAGS="-O3 -march=corei7 -Wall -fno-common" ./configure --with-curl
+make -j 8
+strip -s cpuminer.exe
+mv cpuminer.exe cpuminer-sse42.exe
+strip -s cpuminer
+mv cpuminer cpuminer-sse42
+
+# Core2 SSSE3
+make clean || echo clean
+rm -f config.status
+CFLAGS="-O3 -march=core2 -Wall -fno-common" ./configure --with-curl
+make -j 8
+strip -s cpuminer.exe
+mv cpuminer.exe cpuminer-ssse3.exe
+strip -s cpuminer
+mv cpuminer cpuminer-ssse3
+
+# Generic SSE2
+make clean || echo clean
+rm -f config.status
+CFLAGS="-O3 -msse2 -Wall -fno-common" ./configure --with-curl
+make -j 8
 strip -s cpuminer.exe
 mv cpuminer.exe cpuminer-sse2.exe
 strip -s cpuminer
 mv cpuminer cpuminer-sse2

+# Zen1 AVX2 SHA
 make clean || echo done
 rm -f config.status
-CFLAGS="-O3 -march=znver1 -Wall" ./configure --with-curl
-make -j 16
+CFLAGS="-O3 -march=znver1 -Wall -fno-common" ./configure --with-curl
+make -j 8
 strip -s cpuminer.exe
 mv cpuminer.exe cpuminer-zen.exe
 strip -s cpuminer
 mv cpuminer cpuminer-zen

+# Zen3 AVX2 SHA VAES
 make clean || echo done
 rm -f config.status
-CFLAGS="-O3 -march=native -Wall" ./configure --with-curl
-make -j 16
+CFLAGS="-O3 -march=znver2 -mvaes -Wall -fno-common" ./configure --with-curl
+# CFLAGS="-O3 -march=znver3 -Wall -fno-common" ./configure --with-curl
+make -j 8
+strip -s cpuminer.exe
+mv cpuminer.exe cpuminer-zen3.exe
+strip -s cpuminer
+mv cpuminer cpuminer-zen3
+
+# Native to current CPU
+make clean || echo done
+rm -f config.status
+CFLAGS="-O3 -march=native -Wall -fno-common" ./configure --with-curl
+make -j 8
 strip -s cpuminer.exe
 strip -s cpuminer

--- a/build-avx2.sh
+++ b/build-avx2.sh
@@ -1,27 +0,0 @@
-#!/bin/bash
-
-#if [ "$OS" = "Windows_NT" ]; then
-#    ./mingw64.sh
-#    exit 0
-#fi
-
-# Linux build
-
-make distclean || echo clean
-
-rm -f config.status
-./autogen.sh || echo done
-
-# Ubuntu 10.04 (gcc 4.4)
-# extracflags="-O3 -march=native -Wall -D_REENTRANT -funroll-loops -fvariable-expansion-in-unroller -fmerge-all-constants -fbranch-target-load-optimize2 -fsched2-use-superblocks -falign-loops=16 -falign-functions=16 -falign-jumps=16 -falign-labels=16"
-
-# Debian 7.7 / Ubuntu 14.04 (gcc 4.7+)
-#extracflags="$extracflags -Ofast -flto -fuse-linker-plugin -ftree-loop-if-convert-stores"
-
-#CFLAGS="-O3 -march=native -Wall" ./configure --with-curl --with-crypto=$HOME/usr
-CFLAGS="-O3 -march=haswell -maes -Wall" ./configure --with-curl
-#CFLAGS="-O3 -march=native -Wall" CXXFLAGS="$CFLAGS -std=gnu++11" ./configure --with-curl
-
-make -j 4
-
-strip -s cpuminer
--- a/build.sh
+++ b/build.sh
@@ -12,15 +12,8 @@ make distclean || echo clean
 rm -f config.status
 ./autogen.sh || echo done

-# Ubuntu 10.04 (gcc 4.4)
-# extracflags="-O3 -march=native -Wall -D_REENTRANT -funroll-loops -fvariable-expansion-in-unroller -fmerge-all-constants -fbranch-target-load-optimize2 -fsched2-use-superblocks -falign-loops=16 -falign-functions=16 -falign-jumps=16 -falign-labels=16"
-
-# Debian 7.7 / Ubuntu 14.04 (gcc 4.7+)
-#extracflags="$extracflags -Ofast -flto -fuse-linker-plugin -ftree-loop-if-convert-stores"
-
 #CFLAGS="-O3 -march=native -Wall" ./configure --with-curl --with-crypto=$HOME/usr
 CFLAGS="-O3 -march=native -Wall" ./configure --with-curl
-#CFLAGS="-O3 -march=native -Wall" CXXFLAGS="$CFLAGS -std=gnu++11" ./configure --with-curl

 make -j 4

--- a/buildjdd.sh
+++ b/buildjdd.sh
@@ -1,27 +0,0 @@
-#!/bin/bash
-
-#if [ "$OS" = "Windows_NT" ]; then
-#    ./mingw64.sh
-#    exit 0
-#fi
-
-# Linux build
-
-make distclean || echo clean
-
-rm -f config.status
-./autogen.sh || echo done
-
-# Ubuntu 10.04 (gcc 4.4)
-# extracflags="-O3 -march=native -Wall -D_REENTRANT -funroll-loops -fvariable-expansion-in-unroller -fmerge-all-constants -fbranch-target-load-optimize2 -fsched2-use-superblocks -falign-loops=16 -falign-functions=16 -falign-jumps=16 -falign-labels=16"
-
-# Debian 7.7 / Ubuntu 14.04 (gcc 4.7+)
-#extracflags="$extracflags -Ofast -flto -fuse-linker-plugin -ftree-loop-if-convert-stores"
-
-CFLAGS="-O3 -march=corei7-avx -msha -Wall" ./configure --with-curl
-#CFLAGS="-O3 -march=native -Wall" ./configure --with-curl
-#CFLAGS="-O3 -march=native -Wall" CXXFLAGS="$CFLAGS -std=gnu++11" ./configure --with-curl
-
-make -j 4
-
-strip -s cpuminer
--- a/clean-all.sh
+++ b/clean-all.sh
@@ -1,10 +1,9 @@
 #!/bin/bash
 #
-# imake clean and rm all the targetted executables.
-# tips to users.
+# make clean and rm all the targetted executables.

-rm cpuminer-avx512-sha-vaes cpuminer-avx512 cpuminer-avx2 cpuminer-aes-avx cpuminer-aes-sse42 cpuminer-sse2 cpuminer-zen  > /dev/null
+rm cpuminer-avx512-sha-vaes cpuminer-avx512-sha cpuminer-avx512 cpuminer-avx2 cpuminer-avx cpuminer-aes-sse42 cpuminer-sse2 cpuminer-zen cpuminer-sse42 cpuminer-ssse3 cpuminer-zen3 > /dev/null

-rm cpuminer-avx512-sha-vaes.exe cpuminer-avx512.exe cpuminer-avx2.exe cpuminer-aes-avx.exe cpuminer-aes-sse42.exe cpuminer-sse2.exe cpuminer-zen.exe  > /dev/null
+rm cpuminer-avx512-sha-vaes.exe cpuminer-avx512-sha.exe cpuminer-avx512.exe cpuminer-avx2.exe cpuminer-avx.exe cpuminer-aes-sse42.exe cpuminer-sse2.exe cpuminer-zen.exe  cpuminer-sse42.exe cpuminer-ssse3.exe cpuminer-zen3.exe > /dev/null

 make distclean > /dev/null
--- a/20
+++ b/20
@@ -1,6 +1,6 @@
 #! /bin/sh
 # Guess values for system-dependent variables and create Makefiles.
-# Generated by GNU Autoconf 2.69 for cpuminer-opt 3.13.1.
+# Generated by GNU Autoconf 2.69 for cpuminer-opt 3.15.2.
 #
 #
 # Copyright (C) 1992-1996, 1998-2012 Free Software Foundation, Inc.
@@ -577,8 +577,8 @@ MAKEFLAGS=
 # Identity of this package.
 PACKAGE_NAME='cpuminer-opt'
 PACKAGE_TARNAME='cpuminer-opt'
-PACKAGE_VERSION='3.13.1'
-PACKAGE_STRING='cpuminer-opt 3.13.1'
+PACKAGE_VERSION='3.15.2'
+PACKAGE_STRING='cpuminer-opt 3.15.2'
 PACKAGE_BUGREPORT=''
 PACKAGE_URL=''

@@ -1332,7 +1332,7 @@ if test "$ac_init_help" = "long"; then
  # Omit some internal or obsolete options to make the list less imposing.
  # This message is too long to be a string in the A/UX 3.1 sh.
  cat <<_ACEOF
-\`configure' configures cpuminer-opt 3.13.1 to adapt to many kinds of systems.
+\`configure' configures cpuminer-opt 3.15.2 to adapt to many kinds of systems.

 Usage: $0 [OPTION]... [VAR=VALUE]...

@@ -1404,7 +1404,7 @@ fi

 if test -n "$ac_init_help"; then
  case $ac_init_help in
-     short | recursive ) echo "Configuration of cpuminer-opt 3.13.1:";;
+     short | recursive ) echo "Configuration of cpuminer-opt 3.15.2:";;
   esac
  cat <<\_ACEOF

@@ -1509,7 +1509,7 @@ fi
 test -n "$ac_init_help" && exit $ac_status
 if $ac_init_version; then
  cat <<\_ACEOF
-cpuminer-opt configure 3.13.1
+cpuminer-opt configure 3.15.2
 generated by GNU Autoconf 2.69

 Copyright (C) 2012 Free Software Foundation, Inc.
@@ -2012,7 +2012,7 @@ cat >config.log <<_ACEOF
 This file contains any messages produced by compilers while
 running configure, to aid debugging if configure makes a mistake.

-It was created by cpuminer-opt $as_me 3.13.1, which was
+It was created by cpuminer-opt $as_me 3.15.2, which was
 generated by GNU Autoconf 2.69.  Invocation command line was

  $ $0 $@
@@ -2993,7 +2993,7 @@ fi

 # Define the identity of the package.
 PACKAGE='cpuminer-opt'
- VERSION='3.13.1'
+ VERSION='3.15.2'


 cat >>confdefs.h <<_ACEOF
@@ -6690,7 +6690,7 @@ cat >>$CONFIG_STATUS <<\_ACEOF || ac_write_fail=1
 # report actual input values of CONFIG_FILES etc. instead of their
 # values after options handling.
 ac_log="
-This file was extended by cpuminer-opt $as_me 3.13.1, which was
+This file was extended by cpuminer-opt $as_me 3.15.2, which was
 generated by GNU Autoconf 2.69.  Invocation command line was

  CONFIG_FILES    = $CONFIG_FILES
@@ -6756,7 +6756,7 @@ _ACEOF
 cat >>$CONFIG_STATUS <<_ACEOF || ac_write_fail=1
 ac_cs_config="`$as_echo "$ac_configure_args" | sed 's/^ //; s/[\\""\`\$]/\\\\&/g'`"
 ac_cs_version="\\
-cpuminer-opt config.status 3.13.1
+cpuminer-opt config.status 3.15.2
 configured by $0, generated by GNU Autoconf 2.69,
  with options \\"\$ac_cs_config\\"

--- a/configure.ac
+++ b/configure.ac
@@ -1,4 +1,4 @@
-AC_INIT([cpuminer-opt], [3.13.1])
+AC_INIT([cpuminer-opt], [3.15.2])

 AC_PREREQ([2.59c])
 AC_CANONICAL_SYSTEM
--- a/cpu-miner.c
+++ b/cpu-miner.c
--- a/miner.h
+++ b/miner.h
@@ -83,6 +83,8 @@ enum {
 };
 #endif

+extern bool is_power_of_2( int n );
+
 static inline bool is_windows(void)
 {
 #ifdef WIN32
@@ -313,6 +315,10 @@ size_t address_to_script( unsigned char *out, size_t outsz, const char *addr );
 int    timeval_subtract( struct timeval *result, struct timeval *x,
                           struct timeval *y);

+// Segwit BEGIN
+extern void memrev(unsigned char *p, size_t len);
+// Segwit END
+
 // Bitcoin formula for converting difficulty to an equivalent
 // number of hashes.
 //
@@ -324,12 +330,12 @@ int    timeval_subtract( struct timeval *result, struct timeval *x,

 #define EXP16 65536.
 #define EXP32 4294967296.
-const long double exp32;  // 2**32
-const long double exp48;  // 2**48
-const long double exp64;  // 2**64
-const long double exp96;  // 2**96
-const long double exp128; // 2**128
-const long double exp160; // 2**160
+extern const long double exp32;  // 2**32
+extern const long double exp48;  // 2**48
+extern const long double exp64;  // 2**64
+extern const long double exp96;  // 2**96
+extern const long double exp128; // 2**128
+extern const long double exp160; // 2**160

 bool   fulltest( const uint32_t *hash, const uint32_t *target );
 bool   valid_hash( const void*, const void* );
@@ -374,36 +380,25 @@ void   cpu_brand_string( char* s );
 float cpu_temp( int core );
 */

-struct work {
+struct work
+{
+   uint32_t target[8] __attribute__ ((aligned (64)));
 	uint32_t data[48] __attribute__ ((aligned (64)));
-	uint32_t target[8] __attribute__ ((aligned (64)));
-
 	double targetdiff;
-//	double shareratio;
 	double sharediff;
   double stratum_diff;
-
 	int height;
 	char *txs;
 	char *workid;
-
 	char *job_id;
 	size_t xnonce2_len;
 	unsigned char *xnonce2;
   bool sapling;
   bool stale;
-
-   // x16rt
-   uint32_t merkleroothash[8];
-   uint32_t witmerkleroothash[8];
-   uint32_t denom10[8];
-   uint32_t denom100[8];
-   uint32_t denom1000[8];
-   uint32_t denom10000[8];
-
 } __attribute__ ((aligned (64)));

-struct stratum_job {
+struct stratum_job
+{
 	unsigned char prevhash[32];
   unsigned char final_sapling_hash[32];
   char *job_id;
@@ -417,7 +412,7 @@ struct stratum_job {
 	unsigned char ntime[4];
 	double diff;
   bool clean;
-   // for x16rt
+   // for x16rt-veil
   unsigned char extra[64];
   unsigned char denom10[32];
   unsigned char denom100[32];
@@ -752,6 +747,7 @@ extern double opt_diff_factor;
 extern double opt_target_factor;
 extern bool opt_randomize;
 extern bool allow_mininginfo;
+extern pthread_rwlock_t g_work_lock;
 extern time_t g_work_time;
 extern bool opt_stratum_stats;
 extern int num_cpus;
--- a/simd-utils/simd-128.h
+++ b/simd-utils/simd-128.h
@@ -135,11 +135,17 @@ static inline __m128i mm128_neg1_fn()
 // Bitwise not (~v)  
 #define mm128_not( v )          _mm_xor_si128( (v), m128_neg1 ) 

-// Unary negation of elements
+// Unary negation of elements (-v)
 #define mm128_negate_64( v )    _mm_sub_epi64( m128_zero, v )
 #define mm128_negate_32( v )    _mm_sub_epi32( m128_zero, v )  
 #define mm128_negate_16( v )    _mm_sub_epi16( m128_zero, v )  

+// Clear (zero) 32 bit elements based on bits set in 4 bit mask.
+// Fast, avoids using vector mask, but only available for 128 bit vectors.
+#define mm128_mask_32( a, mask ) \
+   _mm_castps_si128( _mm_insert_ps( _mm_castsi128_ps( a ), \
+                                    _mm_castsi128_ps( a ), mask ) )
+
 // Add 4 values, fewer dependencies than sequential addition.
 #define mm128_add4_64( a, b, c, d ) \
   _mm_add_epi64( _mm_add_epi64( a, b ), _mm_add_epi64( c, d ) )
@@ -269,11 +275,8 @@ static inline void memcpy_128( __m128i *dst, const __m128i *src, const int n )
 // Rotate vector elements accross all lanes

 #define mm128_swap_64( v )    _mm_shuffle_epi32( v, 0x4e )
-
 #define mm128_ror_1x32( v )   _mm_shuffle_epi32( v, 0x39 )
 #define mm128_rol_1x32( v )   _mm_shuffle_epi32( v, 0x93 )
-
-
 //#define mm128_swap_64( v )    _mm_alignr_epi8( v, v,  8 )
 //#define mm128_ror_1x32( v )   _mm_alignr_epi8( v, v,  4 )
 //#define mm128_rol_1x32( v )   _mm_alignr_epi8( v, v, 12 )
@@ -282,53 +285,11 @@ static inline void memcpy_128( __m128i *dst, const __m128i *src, const int n )
 #define mm128_ror_1x8( v )    _mm_alignr_epi8( v, v,  1 )
 #define mm128_rol_1x8( v )    _mm_alignr_epi8( v, v, 15 )

+// Rotate by c bytes
 #define mm128_ror_x8( v, c )  _mm_alignr_epi8( v, c )
 #define mm128_rol_x8( v, c )  _mm_alignr_epi8( v, 16-(c) )


-/*
-// Rotate 16 byte (128 bit) vector by c bytes.
-// Less efficient using shift but more versatile. Use only for odd number
-// byte rotations. Use shuffle above whenever possible.
-#define mm128_ror_x8( v, c ) \
-   _mm_or_si128( _mm_srli_si128( v, c ), _mm_slli_si128( v, 16-(c) ) )
-
-#define mm128_rol_x8( v, c ) \
-   _mm_or_si128( _mm_slli_si128( v, c ), _mm_srli_si128( v, 16-(c) ) )
-
-#if defined (__SSE3__)
-// no SSE2 implementation, no current users
-
-#define mm128_ror_1x16( v ) \
-   _mm_shuffle_epi8( v, m128_const_64( 0x01000f0e0d0c0b0a, \
-                                       0x0908070605040302 ) )
-#define mm128_rol_1x16( v ) \
-   _mm_shuffle_epi8( v, m128_const_64( 0x0d0c0b0a09080706, \
-                                       0x0504030201000f0e ) )
-#define mm128_ror_1x8( v ) \
-   _mm_shuffle_epi8( v, m128_const_64( 0x000f0e0d0c0b0a09, \
-                                       0x0807060504030201 ) )
-#define mm128_rol_1x8( v ) \
-   _mm_shuffle_epi8( v, m128_const_64( 0x0e0d0c0b0a090807, \
-                                       0x060504030201000f ) )
-#else  // SSE2
-
-#define mm128_ror_1x16( v ) \
-   _mm_or_si128( _mm_srli_si128( v, 2 ), _mm_slli_si128( v, 14 ) )
-
-#define mm128_rol_1x16( v ) \
-   _mm_or_si128( _mm_slli_si128( v, 2 ), _mm_srli_si128( v, 14 ) )
-
-#define mm128_ror_1x8( v ) \
-   _mm_or_si128( _mm_srli_si128( v, 1 ), _mm_slli_si128( v, 15 ) )
-
-#define mm128_rol_1x8( v ) \
-   _mm_or_si128( _mm_slli_si128( v, 1 ), _mm_srli_si128( v, 15 ) )
-
-#endif   // SSE3 else SSE2
-*/
-
-
 // Invert vector: {3,2,1,0} -> {0,1,2,3}
 #define mm128_invert_32( v ) _mm_shuffle_epi32( v, 0x1b )

--- a/simd-utils/simd-256.h
+++ b/simd-utils/simd-256.h
@@ -26,8 +26,6 @@
 #define mm256_concat_128( hi, lo ) \
   _mm256_inserti128_si256( _mm256_castsi128_si256( lo ), hi, 1 )

-#define m256_const1_128( v ) \
-         _mm256_broadcastsi128_si256( v )

 // Equavalent of set, move 64 bit integer constants to respective 64 bit
 // elements.
@@ -144,10 +142,11 @@ do { \

 // Parallel AES, for when x is expected to be in a 256 bit register.
 // Use same 128 bit key.
-#if defined(__VAES__) && defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
+
+#if defined(__VAES__)

 #define mm256_aesenc_2x128( x, k ) \
-   _mm256_aesenc_epi128( x, m256_const1_128(k ) )
+   _mm256_aesenc_epi128( x, k )

 #else

--- a/simd-utils/simd-512.h
+++ b/simd-utils/simd-512.h
@@ -56,15 +56,15 @@
 //    If an expensive constant is to be reused in the same function it should
 //    be declared as a local variable defined once and reused.
 //
-//    Permutations cab be very exppensive if they use a vector control index,
+//    Permutations can be very expensive if they use a vector control index,
 //    even if the permutation itself is quite efficient.
 //    The index is essentially a constant with all the baggage that brings.
 //    The same rules apply, if an index is to be reused it should be defined
 //    as a local. This applies specifically to bswap operations.
 //
 //    Additionally, permutations using smaller vectors can be more efficient
-//    if the permutation doesn't cross lane boundaries ,typically 128 bits,
-//    ans the smnaller vector can use an imm comtrol.
+//    if the permutation doesn't cross lane boundaries, typically 128 bits,
+//    and the smnaller vector can use an imm comtrol.
 //
 //    If the permutation doesn't cross lane boundaries a shuffle instructions
 //    can be used with imm control instead of permute.
@@ -182,7 +182,10 @@ static inline __m512i m512_const4_64( const uint64_t i3, const uint64_t i2,
 //
 // Basic operations without SIMD equivalent

+// ~x
 #define mm512_not( x )       _mm512_xor_si512( x, m512_neg1 )
+
+// -x
 #define mm512_negate_64( x ) _mm512_sub_epi64( m512_zero, x )
 #define mm512_negate_32( x ) _mm512_sub_epi32( m512_zero, x )  
 #define mm512_negate_16( x ) _mm512_sub_epi16( m512_zero, x )  
@@ -375,10 +378,10 @@ static inline void memcpy_512( __m512i *dst, const __m512i *src, const int n )

 // Generic for odd rotations
 #define mm512_ror_x64( v, n )      _mm512_alignr_epi64( v, v, n )
-#define mm512_rol_x64( v, n )      _mm512_alignr_epi64( v, v, 8-n )
+#define mm512_rol_x64( v, n )      _mm512_alignr_epi64( v, v, 8-(n) )

 #define mm512_ror_x32( v, n )      _mm512_alignr_epi32( v, v, n )
-#define mm512_rol_x32( v, n )      _mm512_alignr_epi32( v, v, 16-n )
+#define mm512_rol_x32( v, n )      _mm512_alignr_epi32( v, v, 16-(n) )

 #define mm512_ror_1x16( v ) \
   _mm512_permutexvar_epi16( m512_const_64( \
@@ -443,20 +446,13 @@ static inline void memcpy_512( __m512i *dst, const __m512i *src, const int n )
 //
 // Rotate elements within 256 bit lanes of 512 bit vector.

-// Rename these for consistency. Element size is always last.
-// mm<vectorsize>_<op><lanesize>_<elementsize>
-
-
 // Swap hi & lo 128 bits in each 256 bit lane
-
 #define mm512_swap256_128( v )   _mm512_permutex_epi64( v, 0x4e )

 // Rotate 256 bit lanes by one 64 bit element
-
 #define mm512_ror256_64( v )   _mm512_permutex_epi64( v, 0x39 )
 #define mm512_rol256_64( v )   _mm512_permutex_epi64( v, 0x93 )

-
 // Rotate 256 bit lanes by one 32 bit element

 #define mm512_ror256_32( v ) \
--- a/sysinfos.c
+++ b/sysinfos.c
@@ -1,4 +1,4 @@
-#if !defined(SYSINJFOS_C___)
+#if !defined(SYSINFOS_C__)
 #define SYSINFOS_C__

 /**
@@ -331,16 +331,20 @@ static inline void cpu_getmodelid(char *outbuf, size_t maxsz)
 // Feature flags

 // CPU_INFO ECX
-#define XSAVE_Flag    (1<<26) 
-#define OSXSAVE_Flag  (1<<27)
-#define AVX_Flag     (1<<28)
+#define SSE3_Flag      1    
+#define SSSE3_Flag    (1<< 9)
 #define XOP_Flag      (1<<11)
 #define FMA3_Flag     (1<<12)
 #define AES_Flag      (1<<25)
+#define SSE41_Flag    (1<<19)
 #define SSE42_Flag    (1<<20)
+#define AES_Flag      (1<<25)
+#define XSAVE_Flag    (1<<26) 
+#define OSXSAVE_Flag  (1<<27)
+#define AVX_Flag      (1<<28)

 // CPU_INFO EDX
-#define SSE_Flag      (1<<25) // EDX
+#define SSE_Flag      (1<<25)
 #define SSE2_Flag     (1<<26) 

 // EXTENDED_FEATURES EBX
@@ -359,8 +363,8 @@ static inline void cpu_getmodelid(char *outbuf, size_t maxsz)

 // Use this to detect presence of feature
 #define AVX_mask     (AVX_Flag|XSAVE_Flag|OSXSAVE_Flag)
-#define FMA3_mask     (FMA3_Flag|AVX_mask)
-#define AVX512_mask   (AVX512VL_Flag|AVX512BW_Flag|AVX512DQ_Flag|AVX512F_Flag)
+#define FMA3_mask    (FMA3_Flag|AVX_mask)
+#define AVX512_mask  (AVX512VL_Flag|AVX512BW_Flag|AVX512DQ_Flag|AVX512F_Flag)

 static inline bool has_sha()
 {
@@ -476,6 +480,17 @@ static inline bool has_avx512()
 #endif
 }

+// AMD Zen3 added support for 256 bit VAES without requiring AVX512.
+// The original Intel spec requires AVX512F to support 512 bit VAES and 
+// requires AVX512VL to support 256 bit VAES.
+// The CPUID VAES bit alone can't distiguish 256 vs 512 bit.
+// If necessary:
+// VAES 256 & 512 = VAES && AVX512VL
+// VAES 512 = VAES && AVX512F  
+// VAES 256 = ( VAES && AVX512VL ) || ( VAES && !AVX512F )
+// VAES 512 only = VAES && AVX512F && !AVX512VL
+// VAES 256 only = VAES && !AVX512F
+
 static inline bool has_vaes()
 {
 #ifdef __arm__
--- a/util.c
+++ b/util.c
@@ -81,6 +81,15 @@ struct thread_q {
 	pthread_cond_t		cond;
 };

+bool is_power_of_2( int n ) 
+{ 
+  while ( n > 1 ) 
+  { 
+      if ( n % 2 != 0 ) return false; 
+      n = n / 2; 
+  } 
+  return true; 
+} 

 void applog2( int prio, const char *fmt, ... )
 {
@@ -609,6 +618,8 @@ json_t *json_rpc_call(CURL *curl, const char *url,
 		goto err_out;
 	}

+// want_stratum is useless, and so is this code it seems. Nothing in
+// hi appears to be set.   
 	/* If X-Stratum was found, activate Stratum */
 	if (want_stratum && hi.stratum_url &&
 	    !strncasecmp(hi.stratum_url, "stratum+tcp://", 14)) {
@@ -747,6 +758,19 @@ err_out:
 	return cfg;
 }

+// Segwit BEGIN
+void memrev(unsigned char *p, size_t len)
+{
+   unsigned char c, *q;
+   for (q = p + len - 1; p < q; p++, q--) {
+      c = *p;
+      *p = *q;
+      *q = c;
+   }
+}
+// Segwit END
+
+
 void cbin2hex(char *out, const char *in, size_t len)
 {
   if (out) {
@@ -1072,9 +1096,10 @@ bool fulltest( const uint32_t *hash, const uint32_t *target )
 // increases the effective precision. Due to the floating nature of the 
 // decimal point leading zeros aren't counted.
 //
-// Unfortunately I can't get float128 to work so long double it is.
+// Unfortunately I can't get float128 to work so long double (float80) is
+// as precise as it gets.
 // All calculations will be done using long double then converted to double.
-// This prevent introducing significant new error while taking advantage
+// This prevents introducing significant new error while taking advantage
 // of HW rounding.

 #if defined(GCC_INT128)
@@ -1083,7 +1108,8 @@ void diff_to_hash( uint32_t *target, const double diff )
 {
  uint128_t *targ = (uint128_t*)target;
  register long double m = 1. / diff;
-  targ[0] = 0;
+//  targ[0] = 0;
+  targ[0] = -1;
  targ[1] = (uint128_t)( m * exp96 );
 }

@@ -1111,7 +1137,8 @@ void diff_to_hash( uint32_t *target, const double diff )
 {
  uint64_t *targ = (uint64_t*)target;
  register long double m = ( 1. / diff ) * exp32;
-  targ[1] = targ[0] = 0;
+//  targ[1] = targ[0] = 0;
+  targ[1] = targ[0] = -1;
  targ[3] = (uint64_t)m;
  targ[2] = (uint64_t)( ( m - (long double)targ[3] ) * exp64 );
 }
@@ -1458,9 +1485,12 @@ static bool stratum_parse_extranonce(struct stratum_ctx *sctx, json_t *params, i
 	sctx->xnonce2_size = xn2_size;
 	pthread_mutex_unlock(&sctx->work_lock);

-        if (pndx == 0 && opt_debug) /* pool dynamic change */
-		applog(LOG_DEBUG, "Stratum set nonce %s with extranonce2 size=%d",
-			xnonce1, xn2_size);
+   if ( !opt_quiet ) /* pool dynamic change */
+      applog( LOG_INFO, "Stratum extranonce1= %s, extranonce2 size= %d",
+         xnonce1, xn2_size);
+//   if (pndx == 0 && opt_debug)
+//		applog(LOG_DEBUG, "Stratum set nonce %s with extranonce2 size=%d",
+//			xnonce1, xn2_size);

 	return true;
 out:
@@ -1554,8 +1584,6 @@ out:
 	return ret;
 }

-extern bool opt_extranonce;
-
 bool stratum_authorize(struct stratum_ctx *sctx, const char *user, const char *pass)
 {
 	json_t *val = NULL, *res_val, *err_val;
--- a/winbuild-cross.sh
+++ b/winbuild-cross.sh
@@ -40,52 +40,67 @@ cp $LOCAL_LIB/curl/lib/.libs/libcurl-4.dll release/

 # Start building...

+# Icelake AVX512 SHA VAES
 ./clean-all.sh || echo clean
 rm -f config.status
 ./autogen.sh || echo done
 CFLAGS="-O3 -march=icelake-client -Wall" ./configure $CONFIGURE_ARGS
-make -j 16
+make -j 8
 strip -s cpuminer.exe
 mv cpuminer.exe release/cpuminer-avx512-sha-vaes.exe

+# Zen1 AVX2 SHA
 make clean || echo clean
 rm -f config.status
 CFLAGS="-O3 -march=znver1 -Wall" ./configure $CONFIGURE_ARGS
-make -j 16
+make -j 8
 strip -s cpuminer.exe
 mv cpuminer.exe release/cpuminer-zen.exe

+# Zen3 AVX2 SHA VAES
+make clean || echo clean
+rm -f config.status
+CFLAGS="-O3 -march=znver2 -mvaes -Wall" ./configure $CONFIGURE_ARGS
+# CFLAGS="-O3 -march=znver3 -Wall" ./configure $CONFIGURE_ARGS
+make -j 8
+strip -s cpuminer.exe
+mv cpuminer.exe release/cpuminer-zen3.exe
+
+# Slylake-X AVX512 AES
 # mingw won't compile avx512 without -fno-asynchronous-unwind-tables
 make clean || echo clean
 rm -f config.status
 CFLAGS="-O3 -march=skylake-avx512 -Wall" ./configure $CONFIGURE_ARGS
 #CFLAGS="-O3 -march=skylake-avx512 -Wall -fno-asynchronous-unwind-tables" ./configure $CONFIGURE_ARGS
-make -j 16
+make -j 8
 strip -s cpuminer.exe
 mv cpuminer.exe release/cpuminer-avx512.exe

+# Haswell AVX2 AES
 make clean || echo clean
 rm -f config.status
 # GCC 9 doesn't include AES in -march=core-avx2
 CFLAGS="-O3 -march=core-avx2 -maes -Wall" ./configure $CONFIGURE_ARGS
-make -j 16
+make -j 8
 strip -s cpuminer.exe
 mv cpuminer.exe release/cpuminer-avx2.exe

+# Sandybridge AVX AES
 make clean || echo clean
 rm -f config.status
 # -march=corei7-avx still includes aes, but just in case
 CFLAGS="-O3 -march=corei7-avx -maes -Wall" ./configure $CONFIGURE_ARGS 
-make -j 16
+make -j 8
 strip -s cpuminer.exe
 mv cpuminer.exe release/cpuminer-avx.exe

+# Westmere SSE4.2 AES
 # -march=westmere is supported in gcc5
 make clean || echo clean
 rm -f config.status
 CFLAGS="-O3 -march=westmere -Wall" ./configure $CONFIGURE_ARGS
 #CFLAGS="-O3 -maes -msse4.2 -Wall" ./configure $CONFIGURE_ARGS
-make -j 16
+make -j 8
 strip -s cpuminer.exe
 mv cpuminer.exe release/cpuminer-aes-sse42.exe

@@ -104,10 +119,11 @@ mv cpuminer.exe release/cpuminer-aes-sse42.exe
 #mv cpuminer.exe release/cpuminer-ssse3.exe
 #make clean || echo clean

+# Generic SSE2
 make clean || echo clean
 rm -f config.status
 CFLAGS="-O3 -msse2 -Wall" ./configure $CONFIGURE_ARGS
-make -j 16
+make -j 8
 strip -s cpuminer.exe
 mv cpuminer.exe release/cpuminer-sse2.exe
 make clean || echo clean
Author	SHA1	Message	Date
Jay D Dee	45ecd0de14	v3.15.2	2020-11-15 17:57:06 -05:00
Jay D Dee	4fa8fcea8b	v3.15.1	2020-11-09 13:19:05 -05:00
Jay D Dee	c85fb3842b	v3.15.0	2020-10-02 10:48:37 -04:00
Jay D Dee	cdd587537e	v3.14.3	2020-06-18 17:30:26 -04:00
Jay D Dee	51a1d91abd	v3.14.2	2020-05-30 21:20:44 -04:00
Jay D Dee	13563e2598	v3.14.1	2020-05-21 13:00:29 -04:00
Jay D Dee	9571f85d53	v3.14.0	2020-05-20 13:56:35 -04:00
Jay D Dee	0e69756634	v3.13.2-segwit-test	2020-05-18 18:17:27 -04:00
Jay D Dee	9653bca1e2	v3.13.1.1	2020-05-17 19:21:37 -04:00