v3.9.4

v3.9.3.1
v3.9.2.5
2025-09-17 23:44:27 +00:00 · 2019-06-18 13:15:45 -04:00 · 2019-06-13 21:15:58 -04:00 · 2019-06-13 11:20:27 -04:00 · 2019-06-07 23:30:38 -04:00 · 2019-06-05 12:20:04 -04:00
319 changed files with 25018 additions and 11013 deletions
--- a/4
+++ b/4
@@ -29,3 +29,7 @@ Wolf0
 Optiminer

 Jay D Dee
+
+xcouiz@gmail.com
+
+Cryply
--- a/123
+++ b/123
@@ -0,0 +1,123 @@
+
+
+Requirements:
+
+Intel Core2 or newer, or AMD Steamroller or newer CPU. ARM CPUs are not
+supported.
+64 bit Linux operating system. Apple is not supported.
+
+Building on linux prerequisites:
+
+It is assumed users know how to install packages on their system and
+be able to compile standard source packages. This is basic Linux and
+beyond the scope of cpuminer-opt. Regardless compiling is trivial if you
+follow the instructions.
+
+Make sure you have the basic development packages installed.
+Here is a good start:
+
+http://askubuntu.com/questions/457526/how-to-install-cpuminer-in-ubuntu
+
+Install any additional dependencies needed by cpuminer-opt. The list below
+are some of the ones that may not be in the default install and need to
+be installed manually. There may be others, read the error messages they
+will give a clue as to the missing package.
+
+The following command should install everything you need on Debian based
+distributions such as Ubuntu:
+
+sudo apt-get install build-essential libssl-dev libcurl4-openssl-dev libjansson-dev libgmp-dev automake zlib1g-dev
+
+build-essential  (Development Tools package group on Fedora)
+automake
+libjansson-dev
+libgmp-dev
+libcurl4-openssl-dev
+libssl-dev
+lib-thread
+zlib1g-dev
+
+SHA support on AMD Ryzen CPUs requires gcc version 5 or higher and
+openssl 1.1.0e or higher. Add one of the following, depending on the
+compiler version, to CFLAGS:
+"-march=native" or "-march=znver1" or "-msha".
+
+Additional instructions for static compilalation can be found here:
+https://lxadm.com/Static_compilation_of_cpuminer
+Static builds should only considered in a homogeneous HW and SW environment.
+Local builds will always have the best performance and compatibility.
+
+Extract cpuminer source.
+
+tar xvzf cpuminer-opt-x.y.z.tar.gz
+cd cpuminer-opt-x.y.z
+
+Run ./build.sh to build on Linux or execute the following commands.
+
+./autogen.sh
+CFLAGS="-O3 -march=native -Wall" ./configure --with-curl
+make
+
+Start mining.
+
+./cpuminer -a algo -o url -u username -p password
+
+Windows
+
+Precompiled Windows binaries are built on a Linux host using Mingw
+with a more recent compiler than the following Windows hosted procedure.
+
+Building on Windows prerequisites:
+
+msys
+mingw_w64
+Visual C++ redistributable 2008 X64
+openssl
+
+Install msys and mingw_w64, only needed once.
+
+Unpack msys into C:\msys or your preferred directory.
+
+Install mingw_w64 from win-builds.
+Follow instructions, check "msys or cygwin" and "x86_64" and accept default
+existing msys instalation.
+
+Open a msys shell by double clicking on msys.bat.
+Note that msys shell uses linux syntax for file specifications, "C:\" is
+mounted at "/c/".
+
+Add mingw bin directory to PATH variable
+PATH="/c/msys/opt/windows_64/bin/:$PATH"
+
+Instalation complete, compile cpuminer-opt.
+
+Unpack cpuminer-opt source files using tar from msys shell, or using 7zip
+or similar Windows program.
+
+In msys shell cd to miner directory.
+cd /c/path/to/cpuminer-opt
+
+Run build.sh to build on Windows or execute the following commands.
+
+./autogen.sh
+CFLAGS="-O3 -march=native -Wall" ./configure --with-curl
+make
+
+Start mining
+
+cpuminer.exe -a algo -o url -u user -p password
+
+The following tips may be useful for older AMD CPUs.
+
+AMD CPUs older than Steamroller, including Athlon x2 and Phenom II x4, are
+not supported by cpuminer-opt due to an incompatible implementation of SSE2
+on these CPUs. Some algos may crash the miner with an invalid instruction.
+Users are recommended to use an unoptimized miner such as cpuminer-multi.
+
+Some users with AMD CPUs without AES_NI have reported problems compiling
+with build.sh or "-march=native". Problems have included compile errors
+and poor performance. These users are recommended to compile manually
+specifying "-march=btver1" on the configure command line.
+
+Support for even older x86_64 without AES_NI or SSE2 is not availble.
+
--- a/173
+++ b/173
@@ -0,0 +1,173 @@
+Instructions for compiling cpuminer-opt for Windows.
+
+
+Windows compilation using Visual Studio is not supported. Mingw64 is
+used on a Linux system (bare metal or virtual machine) to cross-compile
+cpuminer-opt executable binaries for Windows.
+
+These instructions were written for Debian and Ubuntu compatible distributions
+but should work on other major distributions as well. However some of the
+package names or file paths may be different.
+
+It is assumed a Linux system is already available and running. And the user
+has enough Linux knowledge to find and install packages and follow these
+instructions.
+
+First it is a good idea to create new user specifically for cross compiling.
+It keeps all mingw stuff contained and isolated from the rest of the system.
+
+Step by step...
+
+1. Install necessary packages from the distribution's repositories.
+
+Refer to Linux compile instructions and install required packages.
+
+Additionally, install mingw-64.
+
+sudo apt-get install mingw-w64
+
+
+2. Create a local library directory for packages to be compiled in the next
+   step. Recommended location is $HOME/usr/lib/
+
+
+3. Download and build other packages for mingw that don't have a mingw64
+   version available in the repositories.
+
+Download the following source code packages from their respective and
+respected download locations, copy them to ~/usr/lib/ and uncompress them. 
+
+openssl
+curl
+gmp
+
+In most cases the latest vesrion is ok but it's safest to download
+the same major and minor version as included in your distribution.
+
+Run the following commands or follow the supplied instructions.
+Do not run "make install" unless you are using ~/usr/lib, which isn't
+recommended.
+
+Some instructions insist on running "make check". If make check fails
+it may still work, YMMV.
+
+You can speed up "make" by using all CPU cores available with "-j n" where
+n is the number of CPU threads you want to use.
+
+openssl:
+
+./Configure mingw64 shared --cross-compile-prefix=x86_64-w64-mingw32
+make
+
+curl:
+
+./configure --with-winssl --with-winidn --host=x86_64-w64-mingw32
+make
+
+gmp:
+
+./configure --host=x86_64-w64-mingw32 
+make
+
+
+
+4. Tweak the environment.
+
+This step is required everytime you login or the commands can be added to
+.bashrc.
+
+Define some local variables to point to local library. 
+
+export LOCAL_LIB="$HOME/usr/lib"
+
+export LDFLAGS="-L$LOCAL_LIB/curl/lib/.libs -L$LOCAL_LIB/gmp/.libs -L$LOCAL_LIB/openssl"
+
+export CONFIGURE_ARGS="--with-curl=$LOCAL_LIB/curl --with-crypto=$LOCAL_LIB/openssl --host=x86_64-w64-mingw32"
+
+Create a release directory and copy some dll files previously built.
+This can be done outside of cpuminer-opt and only needs to be done once.
+If the release directory is in cpuminer-opt directory it needs to be
+recreated every a source package is decompressed.
+
+mkdir release
+cp /usr/x86_64-w64-mingw32/lib/zlib1.dll release/
+cp /usr/x86_64-w64-mingw32/lib/libwinpthread-1.dll release/
+cp /usr/lib/gcc/x86_64-w64-mingw32/7.3-win32/libstdc++-6.dll release/
+cp /usr/lib/gcc/x86_64-w64-mingw32/7.3-win32/libgcc_s_seh-1.dll release/
+cp $LOCAL_LIB/openssl/libcrypto-1_1-x64.dll release/
+cp $LOCAL_LIB/curl/lib/.libs/libcurl-4.dll release/
+
+
+
+The following steps need to be done every time a new source package is
+opened.
+
+5. Download cpuminer-opt
+
+Download the latest source code package of cpumuner-opt to your desired
+location. .zip or .tar.gz, your choice.
+
+https://github.com/JayDDee/cpuminer-opt/releases
+
+Decompress and change to the cpuminer-opt directory.
+
+
+
+6. Prepare to compile
+
+Create a link to the locally compiled version of gmp.h
+
+ln -s $LOCAL_LIB/gmp-version/gmp.h ./gmp.h
+
+Edit configure.ac to fix lipthread package name.
+
+sed -i 's/"-lpthread"/"-lpthreadGC2"/g' configure.ac
+
+
+7. Compile
+
+you can use the default compile if you intend to use cpuminer-opt on the
+same CPU and the virtual machine supports that architecture.
+
+./build.sh
+
+Otherwise you can compile manually while setting options in CFLAGS.
+
+Some common options:
+
+To compile for a specific CPU architecture:
+
+CFLAGS="-O3 -march=znver1 -Wall" ./configure --with-curl
+
+This will compile for AMD Ryzen.
+
+You can compile more generically for a set of specific CPU features
+if you know what features you want:
+
+CFLAGS="-O3 -maes -msse4.2 -Wall" ./configure --with-curl
+
+This will compile for an older CPU that does not have AVX.
+
+You can find several examples in build-allarch.sh
+
+If you have a CPU with more than 64 threads and Windows 7 or higher you
+can enable the CPU Groups feature:
+
+-D_WIN32_WINNT==0x0601
+
+Once you have run configure successfully run make with n CPU threads:
+
+make -j n
+
+Copy cpuminer.exe to the release directory, compress and copy the release
+directory to a Windows system and run cpuminer.exe from the command line.
+
+Run cpuminer
+
+In a command windows change directories to the unzipped release folder.
+to get a list of all options:
+
+cpuminer.exe --help
+
+Command options are specific to where you mine. Refer to the pool's
+instructions on how to set them.
--- a/Makefile.am
+++ b/Makefile.am
@@ -31,14 +31,22 @@ cpuminer_SOURCES = \
  crypto/hash.c \
  crypto/aesb.c \
  crypto/magimath.cpp \
-  algo/argon2/argon2a.c \
-  algo/argon2/ar2/argon2.c \
-  algo/argon2/ar2/opt.c \
-  algo/argon2/ar2/cores.c \
-  algo/argon2/ar2/ar2-scrypt-jane.c \
-  algo/argon2/ar2/blake2b.c \
+  algo/argon2/argon2a/argon2a.c \
+  algo/argon2/argon2a/ar2/argon2.c \
+  algo/argon2/argon2a/ar2/opt.c \
+  algo/argon2/argon2a/ar2/cores.c \
+  algo/argon2/argon2a/ar2/ar2-scrypt-jane.c \
+  algo/argon2/argon2a/ar2/blake2b.c \
+  algo/argon2/argon2d/argon2d-gate.c \
+  algo/argon2/argon2d/blake2/blake2b.c \
+  algo/argon2/argon2d/argon2d/argon2.c \
+  algo/argon2/argon2d/argon2d/core.c \
+  algo/argon2/argon2d/argon2d/opt.c \
+  algo/argon2/argon2d/argon2d/argon2d_thread.c \
+  algo/argon2/argon2d/argon2d/encoding.c \
  algo/blake/sph_blake.c \
-  algo/blake/blake-hash-4way.c \
+  algo/blake/blake256-hash-4way.c \
+  algo/blake/blake512-hash-4way.c \
  algo/blake/blake-gate.c \
  algo/blake/blake.c \
  algo/blake/blake-4way.c \
@@ -60,14 +68,15 @@ cpuminer_SOURCES = \
  algo/blake/pentablake-4way.c \
  algo/blake/pentablake.c \
  algo/bmw/sph_bmw.c \
-  algo/bmw/bmw-hash-4way.c \
+  algo/bmw/bmw256-hash-4way.c \
+  algo/bmw/bmw512-hash-4way.c \
  algo/bmw/bmw256.c \
  algo/cryptonight/cryptolight.c \
  algo/cryptonight/cryptonight-common.c\
  algo/cryptonight/cryptonight-aesni.c\
  algo/cryptonight/cryptonight.c\
  algo/cubehash/sph_cubehash.c \
-  algo/cubehash/sse2/cubehash_sse2.c\
+  algo/cubehash/cubehash_sse2.c\
  algo/cubehash/cube-hash-2way.c \
  algo/echo/sph_echo.c \
  algo/echo/aes_ni/hash.c\
@@ -109,26 +118,29 @@ cpuminer_SOURCES = \
  algo/luffa/luffa-hash-2way.c \
  algo/lyra2/lyra2.c \
  algo/lyra2/sponge.c \
-  algo/lyra2/lyra2rev2-gate.c \
+  algo/lyra2/lyra2-gate.c \
  algo/lyra2/lyra2rev2.c \
  algo/lyra2/lyra2rev2-4way.c \
+  algo/lyra2/lyra2rev3.c \
+  algo/lyra2/lyra2rev3-4way.c \
  algo/lyra2/lyra2re.c \
-  algo/lyra2/lyra2z-gate.c \
  algo/lyra2/lyra2z.c \
  algo/lyra2/lyra2z-4way.c \
  algo/lyra2/lyra2z330.c \
-  algo/lyra2/lyra2h-gate.c \
  algo/lyra2/lyra2h.c \
  algo/lyra2/lyra2h-4way.c \
-  algo/lyra2/allium-gate.c \
  algo/lyra2/allium-4way.c \
  algo/lyra2/allium.c \
+  algo/lyra2/phi2-4way.c \
+  algo/lyra2/phi2.c \
  algo/m7m.c \
  algo/neoscrypt/neoscrypt.c \
  algo/nist5/nist5-gate.c \
  algo/nist5/nist5-4way.c \
  algo/nist5/nist5.c \
  algo/nist5/zr5.c \
+  algo/panama/sph_panama.c \
+  algo/radiogatun/sph_radiogatun.c \
  algo/pluck.c \
  algo/quark/quark-gate.c \
  algo/quark/quark.c \
@@ -136,6 +148,9 @@ cpuminer_SOURCES = \
  algo/quark/anime-gate.c \
  algo/quark/anime.c \
  algo/quark/anime-4way.c \
+  algo/quark/hmq1725-gate.c \
+  algo/quark/hmq1725-4way.c \
+  algo/quark/hmq1725.c \
  algo/qubit/qubit-gate.c \
  algo/qubit/qubit.c \
  algo/qubit/qubit-2way.c \
@@ -152,12 +167,18 @@ cpuminer_SOURCES = \
  algo/sha/sph_sha2.c \
  algo/sha/sph_sha2big.c \
  algo/sha/sha2-hash-4way.c \
+  algo/sha/sha256_hash_11way.c \
  algo/sha/sha2.c \
+  algo/sha/sha256t-gate.c \
+  algo/sha/sha256t-4way.c \
  algo/sha/sha256t.c \
+  algo/sha/sha256q-4way.c \
+  algo/sha/sha256q.c \
  algo/shabal/sph_shabal.c \
  algo/shabal/shabal-hash-4way.c \
  algo/shavite/sph_shavite.c \
  algo/shavite/sph-shavite-aesni.c \
+  algo/shavite/shavite-hash-2way.c \
  algo/shavite/shavite.c \
  algo/simd/sph_simd.c \
  algo/simd/nist.c \
@@ -231,19 +252,25 @@ cpuminer_SOURCES = \
  algo/x15/x15-gate.c \
  algo/x15/x15.c \
  algo/x15/x15-4way.c \
+  algo/x16/x16r-gate.c \
+  algo/x16/x16r.c \
+  algo/x16/x16r-4way.c \
  algo/x17/x17-gate.c \
  algo/x17/x17.c \
  algo/x17/x17-4way.c \
  algo/x17/xevan-gate.c \
  algo/x17/xevan.c \
  algo/x17/xevan-4way.c \
-  algo/x17/x16r-gate.c \
-  algo/x17/x16r.c \
-  algo/x17/x16r-4way.c \
-  algo/x17/hmq1725.c \
+  algo/x17/sonoa-gate.c \
+  algo/x17/sonoa-4way.c \
+  algo/x17/sonoa.c \
+  algo/x20/x20r.c \
  algo/yescrypt/yescrypt.c \
  algo/yescrypt/sha256_Y.c \
-  algo/yescrypt/yescrypt-best.c
+  algo/yescrypt/yescrypt-best.c \
+  algo/yespower/yespower.c \
+  algo/yespower/sha256_p.c \
+  algo/yespower/yespower-opt.c

 disable_flags =

--- a/README.md
+++ b/README.md
@@ -7,11 +7,17 @@ All of the code is believed to be open and free. If anyone has a
 claim to any of it post your case in the cpuminer-opt Bitcoin Talk forum
 or by email.

+Miner programs are often flagged as malware by antivirus programs. This is
+a false positive, they are flagged simply because they are cryptocurrency 
+miners. The source code is open for anyone to inspect. If you don't trust 
+the software, don't use it.
+
 https://bitcointalk.org/index.php?topic=1326803.0

 mailto://jayddee246@gmail.com

-See file RELEASE_NOTES for change log and compile instructions.
+See file RELEASE_NOTES for change log and INSTALL_LINUX or INSTALL_WINDOWS
+for compile instructions.

 Requirements
 ------------
@@ -40,83 +46,97 @@ MacOS, OSx and Android are not supported.
 Supported Algorithms
 --------------------

-                          allium       Garlicoin
-                          anime        Animecoin
-                          argon2
-                          axiom        Shabal-256 MemoHash
+                          allium        Garlicoin
+                          anime         Animecoin
+                          argon2        Argon2 coin (AR2)
+                          argon2d250    argon2d-crds, Credits (CRDS)
+                          argon2d500    argon2d-dyn,  Dynamic (DYN)
+                          argon2d4096   argon2d-uis, Unitus, (UIS)
+                          axiom         Shabal-256 MemoHash
                          bastion
-                          blake        Blake-256 (SFR)
-                          blakecoin    blake256r8
-                          blake2s      Blake-2 S
-                          bmw          BMW 256
-                          c11          Chaincoin
-                          cryptolight  Cryptonight-light
-                          cryptonight  cryptonote, Monero (XMR)
+                          blake         Blake-256 (SFR)
+                          blakecoin     blake256r8
+                          blake2s       Blake-2 S
+                          bmw           BMW 256
+                          c11           Chaincoin
+                          cryptolight   Cryptonight-light
+                          cryptonight  
+                          cryptonightv7 Monero (XMR)
                          decred
-                          deep         Deepcoin (DCN)
-                          dmd-gr       Diamond-Groestl
-                          drop         Dropcoin
-                          fresh        Fresh
-                          groestl      Groestl coin
-                          heavy        Heavy
-                          hmq1725      Espers
-                          hodl         Hodlcoin
-                          jha          Jackpotcoin
-                          keccak       Maxcoin
-                          keccakc      Creative coin
-                          lbry         LBC, LBRY Credits
-                          luffa        Luffa
-                          lyra2h       Hppcoin
-                          lyra2re      lyra2
-                          lyra2rev2    lyra2v2, Vertcoin
-                          lyra2z       Zcoin (XZC)
-                          lyra2z330    Lyra2 330 rows, Zoin (ZOI)
-                          m7m          Magi (XMG)
-                          myr-gr       Myriad-Groestl
-                          neoscrypt    NeoScrypt(128, 2, 1)
-                          nist5        Nist5
-                          pentablake   Pentablake
-                          phi1612      phi, LUX coin
-                          pluck        Pluck:128 (Supcoin)
-                          polytimos    Ninja
-                          quark        Quark
-                          qubit        Qubit
-                          scrypt       scrypt(1024, 1, 1) (default)
-                          scrypt:N     scrypt(N, 1, 1)
+                          deep          Deepcoin (DCN)
+                          dmd-gr        Diamond-Groestl
+                          drop          Dropcoin
+                          fresh         Fresh
+                          groestl       Groestl coin
+                          heavy         Heavy
+                          hmq1725       Espers
+                          hodl          Hodlcoin
+                          jha           Jackpotcoin
+                          keccak        Maxcoin
+                          keccakc       Creative coin
+                          lbry          LBC, LBRY Credits
+                          luffa         Luffa
+                          lyra2h        Hppcoin
+                          lyra2re       lyra2
+                          lyra2rev2     lyra2v2, Vertcoin
+                          lyra2rev3     lyrav2v3, Vertcoin
+                          lyra2z        Zcoin (XZC)
+                          lyra2z330     Lyra2 330 rows, Zoin (ZOI)
+                          m7m           Magi (XMG)
+                          myr-gr        Myriad-Groestl
+                          neoscrypt     NeoScrypt(128, 2, 1)
+                          nist5         Nist5
+                          pentablake    Pentablake
+                          phi1612       phi, LUX coin (original algo)
+                          phi2          LUX coin (new algo)
+                          pluck         Pluck:128 (Supcoin)
+                          polytimos     Ninja
+                          quark         Quark
+                          qubit         Qubit
+                          scrypt        scrypt(1024, 1, 1) (default)
+                          scrypt:N      scrypt(N, 1, 1)
                          scryptjane:nf
-                          sha256d      Double SHA-256
-                          sha256t      Triple SHA-256, Onecoin (OC)
-                          shavite3     Shavite3
-                          skein        Skein+Sha (Skeincoin)
-                          skein2       Double Skein (Woodcoin)
-                          skunk        Signatum (SIGT)
-                          timetravel   Machinecoin (MAC)
-                          timetravel10 Bitcore
-                          tribus       Denarius (DNR)
-                          vanilla      blake256r8vnl (VCash)
-                          veltor       (VLT)
+                          sha256d       Double SHA-256
+                          sha256t       Triple SHA-256, Onecoin (OC)
+                          shavite3      Shavite3
+                          skein         Skein+Sha (Skeincoin)
+                          skein2        Double Skein (Woodcoin)
+                          skunk         Signatum (SIGT)
+                          sonoa         Sono
+                          timetravel    Machinecoin (MAC)
+                          timetravel10  Bitcore
+                          tribus        Denarius (DNR)
+                          vanilla       blake256r8vnl (VCash)
+                          veltor        (VLT)
                          whirlpool
                          whirlpoolx
-                          x11          Dash
-                          x11evo       Revolvercoin
-                          x11gost      sib (SibCoin)
-                          x12          Galaxie Cash (GCH)
-                          x13          X13
-                          x13sm3       hsr (Hshare)
-                          x14          X14
-                          x15          X15
-                          x16r         Ravencoin
+                          x11           Dash
+                          x11evo        Revolvercoin
+                          x11gost       sib (SibCoin)
+                          x12           Galaxie Cash (GCH)
+                          x13           X13
+                          x13sm3        hsr (Hshare)
+                          x14           X14
+                          x15           X15
+                          x16r          Ravencoin (RVN)
+                          x16s          pigeoncoin (PGN)
                          x17
-                          xevan        Bitsend
-                          yescrypt     Globalboost-Y (BSTY)
-                          yescryptr8   BitZeny (ZNY)
-                          yescryptr16  Yenten (YTN)
-                          yescryptr32  WAVI
-                          zr5          Ziftr
+                          xevan         Bitsend (BSD)
+                          yescrypt      Globalboost-Y (BSTY)
+                          yescryptr8    BitZeny (ZNY)
+                          yescryptr16   Eli
+                          yescryptr32   WAVI
+                          yespower      Cryply
+                          yespowerr16   Yenten (YTN)
+                          zr5           Ziftr

 Errata
 ------

+Cryptonight and variants are no longer supported, use another miner.
+
+Neoscrypt crashes on Windows, use legacy version.
+
 AMD CPUs older than Piledriver, including Athlon x2 and Phenom II x4, are not
 supported by cpuminer-opt due to an incompatible implementation of SSE2 on
 these CPUs. Some algos may crash the miner with an invalid instruction.
--- a/README.txt
+++ b/README.txt
@@ -4,33 +4,35 @@ for Linux and Windows can be found in RELEASE_NOTES.
 cpuminer is a console program that is executed from a DOS command prompt.
 There is no GUI and no mouse support.

+Miner programs are often flagged as malware by antivirus programs. This is
+a false positive, they are flagged simply because they are cryptocurrency 
+miners. The source code is open for anyone to inspect. If you don't trust
+the software, don't use it.
+
 Choose the exe that best matches you CPU's features or use trial and
 error to find the fastest one that doesn't crash. Pay attention to
 the features listed at cpuminer startup to ensure you are mining at
-optimum speed using all the available features.
+optimum speed using the best available features.

 Architecture names and compile options used are only provided for Intel
-Core series. Pentium and Celeron often have fewer features.
+Core series. Even the newest Pentium and Celeron CPUs are often missing
+features.

 AMD CPUs older than Piledriver, including Athlon x2 and Phenom II x4, are not
 supported by cpuminer-opt due to an incompatible implementation of SSE2 on
 these CPUs. Some algos may crash the miner with an invalid instruction.
 Users are recommended to use an unoptimized miner such as cpuminer-multi.

-Exe name                Compile flags              Arch name
+Exe name                Compile flags            Arch name

-cpuminer-sse2.exe      "-march=core2"              Core2, Nehalem   
-cpuminer-aes-sse42.exe "-maes -msse4.2"            Westmere
-cpuminer-aes-avx.exe   "-march=corei7-avx"         Sandybridge, Ivybridge
-cpuminer-avx2.exe      "-march=core-avx2"          Haswell...
-cpuminer-avx2-sha.exe  "-march=core-avx2 -msha"    Ryzen
+cpuminer-sse2.exe      "-msse2"                  Core2, Nehalem   
+cpuminer-aes-sse42.exe "-march=westmere"         Westmere
+cpuminer-avx.exe       "-march=corei7-avx"       Sandy-Ivybridge
+cpuminer-avx2.exe      "-march=core-avx2"        Haswell, Sky-Kaby-Coffeelake
+cpuminer-zen           "-march=znver1"           AMD Ryzen, Threadripper

 If you like this software feel free to donate:

 BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT
-ETH: 0x72122edabcae9d3f57eab0729305a425f6fef6d0
-LTC: LdUwoHJnux9r9EKqFWNvAi45kQompHk6e8
-BCH: 1QKYkB6atn4P7RFozyziAXLEnurwnUM1cQ
-BTG: GVUyECtRHeC5D58z9F3nGGfVQndwnsPnHQ


--- a/304
+++ b/304
@@ -1,11 +1,11 @@
-puminer-opt now supports HW SHA acceleration available on AMD Ryzen CPUs.
+cpuminer-opt is a console program run from the command line using the
+keyboard, not the mouse.
+
+cpuminer-opt now supports HW SHA acceleration available on AMD Ryzen CPUs.
 This feature requires recent SW including GCC version 5 or higher and
 openssl version 1.1 or higher. It may also require using "-march=znver1"
 compile flag.

-HW SHA support is only available when compiled from source, Windows binaries
-are not yet available.
-
 cpuminer-opt is a console program, if you're using a mouse you're doing it
 wrong.

@@ -13,11 +13,11 @@ Security warning
 ----------------

 Miner programs are often flagged as malware by antivirus programs. This is
-a false positive, they are flagged simply because they are miners. The source
-code is open for anyone to inspect. If you don't trust the software, don't use
-it.
+a false positive, they are flagged simply because they are cryptocurrency 
+miners. The source code is open for anyone to inspect. If you don't trust 
+the software, don't use it.

-The cryptographic code has been taken from trusted sources but has been
+The cryptographic hashing code has been taken from trusted sources but has been
 modified for speed at the expense of accepted security practices. This
 code should not be imported into applications where secure cryptography is
 required.
@@ -25,141 +25,174 @@ required.
 Compile Instructions
 --------------------

-Requirements:
+See INSTALL_LINUX or INSTALL_WINDOWS fror compile instruuctions
+
+Requirements
+------------

 Intel Core2 or newer, or AMD Steamroller or newer CPU. ARM CPUs are not
 supported.
-64 bit Linux or Windows operating system. Apple is not supported.
-
-Building on linux prerequisites:
-
-It is assumed users know how to install packages on their system and
-be able to compile standard source packages. This is basic Linux and
-beyond the scope of cpuminer-opt.
-
-Make sure you have the basic development packages installed.
-Here is a good start:
-
-http://askubuntu.com/questions/457526/how-to-install-cpuminer-in-ubuntu
-
-Install any additional dependencies needed by cpuminer-opt. The list below
-are some of the ones that may not be in the default install and need to
-be installed manually. There may be others, read the error messages they
-will give a clue as to the missing package.
-
-The following command should install everything you need on Debian based
-distributions such as Ubuntu:
-
-sudo apt-get install build-essential libssl-dev libcurl4-openssl-dev libjansson-dev libgmp-dev automake
-
-
-build-essential  (for Ubuntu, Development Tools package group on Fedora)
-automake
-libjansson-dev
-libgmp-dev
-libcurl4-openssl-dev
-libssl-dev
-pthreads
-zlib
-
-SHA support on AMD Ryzen CPUs requires gcc version 5 or higher and openssl 1.1
-or higher. Reports of improved performiance on Ryzen when using openssl 1.0.2
-have been due to AVX and AVX2 optimizations added to that version.
-Additional improvements are expected on Ryzen with openssl 1.1.
-"-march-znver1" or "-msha".
-
-Additional instructions for static compilalation can be found here:
-https://lxadm.com/Static_compilation_of_cpuminer
-Static builds should only considered in a homogeneous HW and SW environment.
-Local builds will always have the best performance and compatibility.
-
-Extract cpuminer source.
-
-tar xvzf cpuminer-opt-x.y.z.tar.gz
-cd cpuminer-opt-x.y.z
-
-Run ./build.sh to build on Linux or execute the following commands.
-
-./autogen.sh
-CFLAGS="-O3 -march=native -Wall" CXXFLAGS="$CFLAGS -std=gnu++11" ./configure --with-curl
-make
-
-Additional optional compile flags, add the following to CFLAGS to activate:
-
-DUSE_SPH_SHA
-
-SPH may give slightly better performance on algos that use sha256 when using
-openssl 1.0.1 or older. Openssl 1.0.2 adds AVX2 and 1.1 adds SHA and perform
-better than SPH. This option is ignored when 4-way is used, even for CPUs
-with SHA.
-
-Start mining.
-
-./cpuminer -a algo -o url -u username -p password
-
-Windows
-
-Precompiled Windows binaries are built on a Linux host using Mingw
-with a more recent compiler than the following Windows hosted procedure.
-
-Building on Windows prerequisites:
-
-msys
-mingw_w64
-Visual C++ redistributable 2008 X64
-openssl
-
-Install msys and mingw_w64, only needed once.
-
-Unpack msys into C:\msys or your preferred directory.
-
-Install mingw_w64 from win-builds.
-Follow instructions, check "msys or cygwin" and "x86_64" and accept default
-existing msys instalation.
-
-Open a msys shell by double clicking on msys.bat.
-Note that msys shell uses linux syntax for file specifications, "C:\" is
-mounted at "/c/".
-
-Add mingw bin directory to PATH variable
-PATH="/c/msys/opt/windows_64/bin/:$PATH"
-
-Instalation complete, compile cpuminer-opt.
-
-Unpack cpuminer-opt source files using tar from msys shell, or using 7zip
-or similar Windows program.
-
-In msys shell cd to miner directory.
-cd /c/path/to/cpuminer-opt
-
-Run build.sh to build on Windows or execute the following commands.
-
-./autogen.sh
-CFLAGS="-O3 -march=native -Wall" ./configure --with-curl
-make
-
-Start mining
-
-cpuminer.exe -a algo -o url -u user -p password
-
-The following tips may be useful for older AMD CPUs.
-
-AMD CPUs older than Steamroller, including Athlon x2 and Phenom II x4, are
-not supported by cpuminer-opt due to an incompatible implementation of SSE2
-on these CPUs. Some algos may crash the miner with an invalid instruction.
-Users are recommended to use an unoptimized miner such as cpuminer-multi.
-
-Some users with AMD CPUs without AES_NI have reported problems compiling
-with build.sh or "-march=native". Problems have included compile errors
-and poor performance. These users are recommended to compile manually
-specifying "-march=btver1" on the configure command line.
-
-Support for even older x86_64 without AES_NI or SSE2 is not availble.

+64 bit Linux or Windows operating system. Apple and Android are not supported.

 Change Log
 ----------

+v3.9.4
+
+Faster AVX2 for lyra2v3, quark, anime.
+Fixed skein AVX2 regression (invalid shares since v3.9.0) and faster.
+Faster skein2 with 4way AVX2 enabled.
+Automatic SHA override on Ryzen CPUs, no need for -DRYZEN compile flag.
+Ongoing restructuring.
+
+v3.9.3.1
+
+Skipped v3.9.3 due to misidentification of v3.9.2.5 as v3.9.3.
+Fixed x16r algo 25% invalid share reject rate. The bug may have also
+affected other algos.
+
+v3.9.2.5
+
+Fixed 2 regressions: hodl AES detection, x16r invalid shares with AVX2.
+More restructuring.
+
+v3.9.2.4
+
+Yet another affinity fix. Hopefully the last one.
+
+v3.9.2.3
+
+Another cpu-affinity fix.
+Disabled test code that fails to compile on some CPUs with limited
+AVX512 capabilities.
+
+v3.9.2.2
+
+Fixed some day one cpu-affinity issues.
+
+v3.9.2
+
+Added sha256q algo.
+Yespower now uses openssl SHA256, but no observable hash rate increase
+on Ryzen.
+Ongoing rearchitecting.
+Lyra2z now hashes 8-way on CPUs with AVX2.
+Lyra2 (all including phi2) now runs optimized code with SSE2.
+
+v3.9.1.1
+
+Fixed lyra2v3 AVX and below.
+
+Compiling on Windows using Cygwin now works. Simply use "./build.sh"
+just like on Linux. It isn't portable therefore the binaries package will
+continue to use the existing procedure.
+The Cygwin procedure will be documented in more detail later and will
+include a list of packages that need to be installed.
+
+v3.9.1
+
+Fixed AVX2 version of anime algo.
+
+Added sonoa algo.
+
+Added "-DRYZEN_" compile option for Ryzen to override 4-way hashing when algo 
+contains sha256 and use SHA instead. This is due to a combination of
+the introduction of HW SHA support combined with the poor performance
+of AVX2 on Ryzen. The Windows binaries package replaces cpuminer-avx2-sha
+with cpuminer-zen compiled with the override. Refer to the build instructions
+for more information.
+
+Ongoing restructuring to streamline the process, reduce latency,
+reduce memory usage and unnecessary copying of data. Most of these
+will not result in a notoceably higher reported hashrate as the
+change simply reduces the time wasted that wasn't factored into the
+hash rate reported by the miner. In short, less dead time resulting in
+a higher net hashrate.
+
+One of these measures to reduce latency also results in an enhanced
+share submission message including the share number*, the CPU thread,
+and the vector lane that found the solution. The time difference between
+the share submission and acceptance (or rejection) response indicates
+network ltatency. One other effect of this change is a reduction in hash
+meter messages because the scan function no longer exits when a share is 
+found. Scan cycles will go longer and submit multiple shares per cycle.
+*the share number is antcipated and includes both accepted and rejected
+shares. Because the share is antipated and not synchronized it may be
+incorrect in time of very rapid share submission. Under most conditions
+it should be easy to match the submission with the corresponding response.
+
+Removed "-DUSE_SPH_SHA" option, all users should have a recent version of
+openssl installed: v1.0.2 (Ubuntu 16.04) or better. Ryzen SHA requires
+v1.1.0 or better. Ryzen SHA is not used when hashing multi-way parallel.
+Ryzen SHA is available in the Windows binaries release package.
+
+Improved compile instructions, now in seperate files: INSTALL_LINUX and
+INSTALL_WINDOWS. The Windows instructions are used to build the binaries
+release package. It's built on a Linux system either running as a virtual
+machine or a seperate computer. At this time there is no known way to
+build natively on a Windows system.
+
+v3.9.0.1
+
+Isolate Windows CPU groups code when CPU groups support not explicitly defined.
+
+v3.9.0
+
+Added support for Windows CPU groups.
+Fixed BIP34 coinbase height.
+Prep work for AVX512.
+Added lyra2rev3 for the vertcoin algo change.
+Added yespower, yespowerr16 (Yenten)
+Added phi2 algo for LUX
+Discontinued support for cryptonight and variants.
+
+v3.8.8.1
+
+Fixed x16r.
+Removed cryptonight variant check due to false positives.
+API displays hashrate before shares are submitted.
+
+v3.8.8
+
+Added cryptonightv7 for Monero.
+
+v3.8.7.2
+
+Fixed argon2d-dyn regression in v3.8.7.1.
+Changed compile options for aes-sse42 Windows build to -march=westmere
+
+v3.8.7.1
+
+Fixed argon2d-uis low difficulty rejects.
+Fixed argon2d aliases.
+
+v3.8.7
+
+Added argon2d4096 (alias argon2d-uis) for Unitus (UIS).
+argon2d-crds and argon2d-dyn renamed to argon2d250 and argon2d500 respectively.
+  The old names are recognized as aliases.
+AVX512 is now supported for argon2d algos, Linux only.
+AVX is no longer a reported feature and an AVX Windows binary is no longer
+  provided. Use AES-SSE42 build instead.
+
+v3.8.6.1
+
+Faster argon2d* AVX2.
+Untested AVX-512 for argon2d*, YMMV.
+
+v3.8.6
+
+Fixed argon2 regression in v3.8.5.
+Added x16s algo for Pigeoncoin.
+Some code cleanup.
+
+v3.8.5
+
+Added argon2d-crds and argon2d-dyn algos.
+sha256t 8 way AVX2 & 4 way SSE4.2 optimized.
+CPUs with SSE4.2 get optimizations previously reserved for AVX.
+
 v3.8.4.1

 Fixed sha256t low difficulty rejects.
@@ -296,6 +329,7 @@ Changed default sha256 and sha512 to openssl. This should be used when
 compiling with openssl 1.0.2 or higher (Ubuntu 16.04).
 This should increase the hashrate for yescrypt, yescryptr16, m7m, xevan, skein,
 myr-gr & others  when openssl 1.0.2 is installed.
+Note: -DUSE_SPH_SHA has been removed in v3.9.1.
 Users with openssl 1.0.1 (Ubuntu 14.04) may get better perforance by adding
 "-DUSE_SPH_SHA" to CLAGS. 
 Windows binaries are compiled with -DUSE_SPH_SHA and won't get the speedup.
--- a/aclocal.m4
+++ b/aclocal.m4
@@ -1,6 +1,6 @@
-# generated automatically by aclocal 1.14.1 -*- Autoconf -*-
+# generated automatically by aclocal 1.15.1 -*- Autoconf -*-

-# Copyright (C) 1996-2013 Free Software Foundation, Inc.
+# Copyright (C) 1996-2017 Free Software Foundation, Inc.

 # This file is free software; the Free Software Foundation
 # gives unlimited permission to copy and/or distribute it,
@@ -20,7 +20,7 @@ You have another version of autoconf.  It may work, but is not guaranteed to.
 If you have problems, you may need to regenerate the build system entirely.
 To do so, use the procedure documented by the package, typically 'autoreconf'.])])

-# Copyright (C) 2002-2013 Free Software Foundation, Inc.
+# Copyright (C) 2002-2017 Free Software Foundation, Inc.
 #
 # This file is free software; the Free Software Foundation
 # gives unlimited permission to copy and/or distribute it,
@@ -32,10 +32,10 @@ To do so, use the procedure documented by the package, typically 'autoreconf'.])
 # generated from the m4 files accompanying Automake X.Y.
 # (This private macro should not be called outside this file.)
 AC_DEFUN([AM_AUTOMAKE_VERSION],
-[am__api_version='1.14'
+[am__api_version='1.15'
 dnl Some users find AM_AUTOMAKE_VERSION and mistake it for a way to
 dnl require some minimum version.  Point them to the right macro.
-m4_if([$1], [1.14.1], [],
+m4_if([$1], [1.15.1], [],
      [AC_FATAL([Do not call $0, use AM_INIT_AUTOMAKE([$1]).])])dnl
 ])

@@ -51,14 +51,14 @@ m4_define([_AM_AUTOCONF_VERSION], [])
 # Call AM_AUTOMAKE_VERSION and AM_AUTOMAKE_VERSION so they can be traced.
 # This function is AC_REQUIREd by AM_INIT_AUTOMAKE.
 AC_DEFUN([AM_SET_CURRENT_AUTOMAKE_VERSION],
-[AM_AUTOMAKE_VERSION([1.14.1])dnl
+[AM_AUTOMAKE_VERSION([1.15.1])dnl
 m4_ifndef([AC_AUTOCONF_VERSION],
  [m4_copy([m4_PACKAGE_VERSION], [AC_AUTOCONF_VERSION])])dnl
 _AM_AUTOCONF_VERSION(m4_defn([AC_AUTOCONF_VERSION]))])

 # Figure out how to run the assembler.                      -*- Autoconf -*-

-# Copyright (C) 2001-2013 Free Software Foundation, Inc.
+# Copyright (C) 2001-2017 Free Software Foundation, Inc.
 #
 # This file is free software; the Free Software Foundation
 # gives unlimited permission to copy and/or distribute it,
@@ -78,7 +78,7 @@ _AM_IF_OPTION([no-dependencies],, [_AM_DEPENDENCIES([CCAS])])dnl

 # AM_AUX_DIR_EXPAND                                         -*- Autoconf -*-

-# Copyright (C) 2001-2013 Free Software Foundation, Inc.
+# Copyright (C) 2001-2017 Free Software Foundation, Inc.
 #
 # This file is free software; the Free Software Foundation
 # gives unlimited permission to copy and/or distribute it,
@@ -123,15 +123,14 @@ _AM_IF_OPTION([no-dependencies],, [_AM_DEPENDENCIES([CCAS])])dnl
 # configured tree to be moved without reconfiguration.

 AC_DEFUN([AM_AUX_DIR_EXPAND],
-[dnl Rely on autoconf to set up CDPATH properly.
-AC_PREREQ([2.50])dnl
-# expand $ac_aux_dir to an absolute path
-am_aux_dir=`cd $ac_aux_dir && pwd`
+[AC_REQUIRE([AC_CONFIG_AUX_DIR_DEFAULT])dnl
+# Expand $ac_aux_dir to an absolute path.
+am_aux_dir=`cd "$ac_aux_dir" && pwd`
 ])

 # AM_CONDITIONAL                                            -*- Autoconf -*-

-# Copyright (C) 1997-2013 Free Software Foundation, Inc.
+# Copyright (C) 1997-2017 Free Software Foundation, Inc.
 #
 # This file is free software; the Free Software Foundation
 # gives unlimited permission to copy and/or distribute it,
@@ -162,7 +161,7 @@ AC_CONFIG_COMMANDS_PRE(
 Usually this means the macro was only invoked conditionally.]])
 fi])])

-# Copyright (C) 1999-2013 Free Software Foundation, Inc.
+# Copyright (C) 1999-2017 Free Software Foundation, Inc.
 #
 # This file is free software; the Free Software Foundation
 # gives unlimited permission to copy and/or distribute it,
@@ -353,7 +352,7 @@ _AM_SUBST_NOTMAKE([am__nodep])dnl

 # Generate code to set up dependency tracking.              -*- Autoconf -*-

-# Copyright (C) 1999-2013 Free Software Foundation, Inc.
+# Copyright (C) 1999-2017 Free Software Foundation, Inc.
 #
 # This file is free software; the Free Software Foundation
 # gives unlimited permission to copy and/or distribute it,
@@ -429,7 +428,7 @@ AC_DEFUN([AM_OUTPUT_DEPENDENCY_COMMANDS],

 # Do all the work for Automake.                             -*- Autoconf -*-

-# Copyright (C) 1996-2013 Free Software Foundation, Inc.
+# Copyright (C) 1996-2017 Free Software Foundation, Inc.
 #
 # This file is free software; the Free Software Foundation
 # gives unlimited permission to copy and/or distribute it,
@@ -519,8 +518,8 @@ AC_REQUIRE([AC_PROG_MKDIR_P])dnl
 # <http://lists.gnu.org/archive/html/automake/2012-07/msg00001.html>
 # <http://lists.gnu.org/archive/html/automake/2012-07/msg00014.html>
 AC_SUBST([mkdir_p], ['$(MKDIR_P)'])
-# We need awk for the "check" target.  The system "awk" is bad on
-# some platforms.
+# We need awk for the "check" target (and possibly the TAP driver).  The
+# system "awk" is bad on some platforms.
 AC_REQUIRE([AC_PROG_AWK])dnl
 AC_REQUIRE([AC_PROG_MAKE_SET])dnl
 AC_REQUIRE([AM_SET_LEADING_DOT])dnl
@@ -593,7 +592,11 @@ to "yes", and re-run configure.
 END
    AC_MSG_ERROR([Your 'rm' program is bad, sorry.])
  fi
-fi])
+fi
+dnl The trailing newline in this macro's definition is deliberate, for
+dnl backward compatibility and to allow trailing 'dnl'-style comments
+dnl after the AM_INIT_AUTOMAKE invocation. See automake bug#16841.
+])

 dnl Hook into '_AC_COMPILER_EXEEXT' early to learn its expansion.  Do not
 dnl add the conditional right here, as _AC_COMPILER_EXEEXT may be further
@@ -622,7 +625,7 @@ for _am_header in $config_headers :; do
 done
 echo "timestamp for $_am_arg" >`AS_DIRNAME(["$_am_arg"])`/stamp-h[]$_am_stamp_count])

-# Copyright (C) 2001-2013 Free Software Foundation, Inc.
+# Copyright (C) 2001-2017 Free Software Foundation, Inc.
 #
 # This file is free software; the Free Software Foundation
 # gives unlimited permission to copy and/or distribute it,
@@ -633,7 +636,7 @@ echo "timestamp for $_am_arg" >`AS_DIRNAME(["$_am_arg"])`/stamp-h[]$_am_stamp_co
 # Define $install_sh.
 AC_DEFUN([AM_PROG_INSTALL_SH],
 [AC_REQUIRE([AM_AUX_DIR_EXPAND])dnl
-if test x"${install_sh}" != xset; then
+if test x"${install_sh+set}" != xset; then
  case $am_aux_dir in
  *\ * | *\	*)
    install_sh="\${SHELL} '$am_aux_dir/install-sh'" ;;
@@ -643,7 +646,7 @@ if test x"${install_sh}" != xset; then
 fi
 AC_SUBST([install_sh])])

-# Copyright (C) 2003-2013 Free Software Foundation, Inc.
+# Copyright (C) 2003-2017 Free Software Foundation, Inc.
 #
 # This file is free software; the Free Software Foundation
 # gives unlimited permission to copy and/or distribute it,
@@ -665,7 +668,7 @@ AC_SUBST([am__leading_dot])])
 # Add --enable-maintainer-mode option to configure.         -*- Autoconf -*-
 # From Jim Meyering

-# Copyright (C) 1996-2013 Free Software Foundation, Inc.
+# Copyright (C) 1996-2017 Free Software Foundation, Inc.
 #
 # This file is free software; the Free Software Foundation
 # gives unlimited permission to copy and/or distribute it,
@@ -700,7 +703,7 @@ AC_MSG_CHECKING([whether to enable maintainer-specific portions of Makefiles])

 # Check to see how 'make' treats includes.	            -*- Autoconf -*-

-# Copyright (C) 2001-2013 Free Software Foundation, Inc.
+# Copyright (C) 2001-2017 Free Software Foundation, Inc.
 #
 # This file is free software; the Free Software Foundation
 # gives unlimited permission to copy and/or distribute it,
@@ -750,7 +753,7 @@ rm -f confinc confmf

 # Fake the existence of programs that GNU maintainers use.  -*- Autoconf -*-

-# Copyright (C) 1997-2013 Free Software Foundation, Inc.
+# Copyright (C) 1997-2017 Free Software Foundation, Inc.
 #
 # This file is free software; the Free Software Foundation
 # gives unlimited permission to copy and/or distribute it,
@@ -789,7 +792,7 @@ fi

 # Helper functions for option handling.                     -*- Autoconf -*-

-# Copyright (C) 2001-2013 Free Software Foundation, Inc.
+# Copyright (C) 2001-2017 Free Software Foundation, Inc.
 #
 # This file is free software; the Free Software Foundation
 # gives unlimited permission to copy and/or distribute it,
@@ -818,7 +821,7 @@ AC_DEFUN([_AM_SET_OPTIONS],
 AC_DEFUN([_AM_IF_OPTION],
 [m4_ifset(_AM_MANGLE_OPTION([$1]), [$2], [$3])])

-# Copyright (C) 1999-2013 Free Software Foundation, Inc.
+# Copyright (C) 1999-2017 Free Software Foundation, Inc.
 #
 # This file is free software; the Free Software Foundation
 # gives unlimited permission to copy and/or distribute it,
@@ -865,7 +868,7 @@ AC_LANG_POP([C])])
 # For backward compatibility.
 AC_DEFUN_ONCE([AM_PROG_CC_C_O], [AC_REQUIRE([AC_PROG_CC])])

-# Copyright (C) 2001-2013 Free Software Foundation, Inc.
+# Copyright (C) 2001-2017 Free Software Foundation, Inc.
 #
 # This file is free software; the Free Software Foundation
 # gives unlimited permission to copy and/or distribute it,
@@ -884,7 +887,7 @@ AC_DEFUN([AM_RUN_LOG],

 # Check to make sure that the build environment is sane.    -*- Autoconf -*-

-# Copyright (C) 1996-2013 Free Software Foundation, Inc.
+# Copyright (C) 1996-2017 Free Software Foundation, Inc.
 #
 # This file is free software; the Free Software Foundation
 # gives unlimited permission to copy and/or distribute it,
@@ -965,7 +968,7 @@ AC_CONFIG_COMMANDS_PRE(
 rm -f conftest.file
 ])

-# Copyright (C) 2009-2013 Free Software Foundation, Inc.
+# Copyright (C) 2009-2017 Free Software Foundation, Inc.
 #
 # This file is free software; the Free Software Foundation
 # gives unlimited permission to copy and/or distribute it,
@@ -1025,7 +1028,7 @@ AC_SUBST([AM_BACKSLASH])dnl
 _AM_SUBST_NOTMAKE([AM_BACKSLASH])dnl
 ])

-# Copyright (C) 2001-2013 Free Software Foundation, Inc.
+# Copyright (C) 2001-2017 Free Software Foundation, Inc.
 #
 # This file is free software; the Free Software Foundation
 # gives unlimited permission to copy and/or distribute it,
@@ -1053,7 +1056,7 @@ fi
 INSTALL_STRIP_PROGRAM="\$(install_sh) -c -s"
 AC_SUBST([INSTALL_STRIP_PROGRAM])])

-# Copyright (C) 2006-2013 Free Software Foundation, Inc.
+# Copyright (C) 2006-2017 Free Software Foundation, Inc.
 #
 # This file is free software; the Free Software Foundation
 # gives unlimited permission to copy and/or distribute it,
@@ -1072,7 +1075,7 @@ AC_DEFUN([AM_SUBST_NOTMAKE], [_AM_SUBST_NOTMAKE($@)])

 # Check how to create a tarball.                            -*- Autoconf -*-

-# Copyright (C) 2004-2013 Free Software Foundation, Inc.
+# Copyright (C) 2004-2017 Free Software Foundation, Inc.
 #
 # This file is free software; the Free Software Foundation
 # gives unlimited permission to copy and/or distribute it,
--- a/algo-gate-api.c
+++ b/algo-gate-api.c
@@ -69,6 +69,8 @@ void do_nothing   () {}
 bool return_true  () { return true;  }
 bool return_false () { return false; }
 void *return_null () { return NULL;  }
+void call_error   () { printf("ERR: Uninitialized function pointer\n"); }
+

 void algo_not_tested()
 {
@@ -113,7 +115,8 @@ void init_algo_gate( algo_gate_t* gate )
   gate->hash_suw                = (void*)&null_hash_suw;
   gate->get_new_work            = (void*)&std_get_new_work;
   gate->get_nonceptr            = (void*)&std_get_nonceptr;
-   gate->display_extra_data      = (void*)&do_nothing;
+   gate->work_decode             = (void*)&std_le_work_decode;
+   gate->decode_extra_data       = (void*)&do_nothing;
   gate->wait_for_diff           = (void*)&std_wait_for_diff;
   gate->get_max64               = (void*)&get_max64_0x1fffffLL;
   gate->gen_merkle_root         = (void*)&sha256d_gen_merkle_root;
@@ -121,7 +124,6 @@ void init_algo_gate( algo_gate_t* gate )
   gate->build_stratum_request   = (void*)&std_le_build_stratum_request;
   gate->malloc_txs_request      = (void*)&std_malloc_txs_request;
   gate->set_target              = (void*)&std_set_target;
-   gate->work_decode             = (void*)&std_le_work_decode;
   gate->submit_getwork_result   = (void*)&std_le_submit_getwork_result;
   gate->build_block_header      = (void*)&std_build_block_header;
   gate->build_extraheader       = (void*)&std_build_extraheader;
@@ -132,11 +134,11 @@ void init_algo_gate( algo_gate_t* gate )
   gate->do_this_thread          = (void*)&return_true;
   gate->longpoll_rpc_call       = (void*)&std_longpoll_rpc_call;
   gate->stratum_handle_response = (void*)&std_stratum_handle_response;
+   gate->get_work_data_size      = (void*)&std_get_work_data_size;
   gate->optimizations           = EMPTY_SET;
   gate->ntime_index             = STD_NTIME_INDEX;
   gate->nbits_index             = STD_NBITS_INDEX;
   gate->nonce_index             = STD_NONCE_INDEX;
-   gate->work_data_size          = STD_WORK_DATA_SIZE;
   gate->work_cmp_size           = STD_WORK_CMP_SIZE;
 }

@@ -157,78 +159,95 @@ bool register_algo_gate( int algo, algo_gate_t *gate )

   switch (algo)
   {
-     case ALGO_ALLIUM:       register_allium_algo      ( gate ); break;
-     case ALGO_ANIME:        register_anime_algo       ( gate ); break;
-     case ALGO_ARGON2:       register_argon2_algo      ( gate ); break;
-     case ALGO_AXIOM:        register_axiom_algo       ( gate ); break;
-     case ALGO_BASTION:      register_bastion_algo     ( gate ); break;
-     case ALGO_BLAKE:        register_blake_algo       ( gate ); break;
-     case ALGO_BLAKECOIN:    register_blakecoin_algo   ( gate ); break;
+     case ALGO_ALLIUM:       register_allium_algo       ( gate ); break;
+     case ALGO_ANIME:        register_anime_algo        ( gate ); break;
+     case ALGO_ARGON2:       register_argon2_algo       ( gate ); break;
+     case ALGO_ARGON2D250:   register_argon2d_crds_algo ( gate ); break;
+     case ALGO_ARGON2D500:   register_argon2d_dyn_algo  ( gate ); break;
+     case ALGO_ARGON2D4096:  register_argon2d4096_algo  ( gate ); break;
+     case ALGO_AXIOM:        register_axiom_algo        ( gate ); break;
+     case ALGO_BASTION:      register_bastion_algo      ( gate ); break;
+     case ALGO_BLAKE:        register_blake_algo        ( gate ); break;
+     case ALGO_BLAKECOIN:    register_blakecoin_algo    ( gate ); break;
 //     case ALGO_BLAKE2B:      register_blake2b_algo    ( gate ); break;
-     case ALGO_BLAKE2S:      register_blake2s_algo     ( gate ); break;
-     case ALGO_C11:          register_c11_algo         ( gate ); break;
-     case ALGO_CRYPTOLIGHT:  register_cryptolight_algo ( gate ); break;
-     case ALGO_CRYPTONIGHT:  register_cryptonight_algo ( gate ); break;
-     case ALGO_DECRED:       register_decred_algo      ( gate ); break;
-     case ALGO_DEEP:         register_deep_algo        ( gate ); break;
-     case ALGO_DMD_GR:       register_dmd_gr_algo      ( gate ); break;
-     case ALGO_DROP:         register_drop_algo        ( gate ); break;
-     case ALGO_FRESH:        register_fresh_algo       ( gate ); break;
-     case ALGO_GROESTL:      register_groestl_algo     ( gate ); break;
-     case ALGO_HEAVY:        register_heavy_algo       ( gate ); break;
-     case ALGO_HMQ1725:      register_hmq1725_algo     ( gate ); break;
-     case ALGO_HODL:         register_hodl_algo        ( gate ); break;
-     case ALGO_JHA:          register_jha_algo         ( gate ); break;
-     case ALGO_KECCAK:       register_keccak_algo      ( gate ); break;
-     case ALGO_KECCAKC:      register_keccakc_algo     ( gate ); break;
-     case ALGO_LBRY:         register_lbry_algo        ( gate ); break;
-     case ALGO_LUFFA:        register_luffa_algo       ( gate ); break;
-     case ALGO_LYRA2H:       register_lyra2h_algo      ( gate ); break;
-     case ALGO_LYRA2RE:      register_lyra2re_algo     ( gate ); break;
-     case ALGO_LYRA2REV2:    register_lyra2rev2_algo   ( gate ); break;
-     case ALGO_LYRA2Z:       register_lyra2z_algo      ( gate ); break;
-     case ALGO_LYRA2Z330:    register_lyra2z330_algo   ( gate ); break;
-     case ALGO_M7M:          register_m7m_algo         ( gate ); break;
-     case ALGO_MYR_GR:       register_myriad_algo      ( gate ); break;
-     case ALGO_NEOSCRYPT:    register_neoscrypt_algo   ( gate ); break;
-     case ALGO_NIST5:        register_nist5_algo       ( gate ); break;
-     case ALGO_PENTABLAKE:   register_pentablake_algo  ( gate ); break;
-     case ALGO_PHI1612:      register_phi1612_algo     ( gate ); break;
-     case ALGO_PLUCK:        register_pluck_algo       ( gate ); break;
-     case ALGO_POLYTIMOS:    register_polytimos_algo   ( gate ); break;
-     case ALGO_QUARK:        register_quark_algo       ( gate ); break;
-     case ALGO_QUBIT:        register_qubit_algo       ( gate ); break;
-     case ALGO_SCRYPT:       register_scrypt_algo      ( gate ); break;
-     case ALGO_SCRYPTJANE:   register_scryptjane_algo  ( gate ); break;
-     case ALGO_SHA256D:      register_sha256d_algo     ( gate ); break;
-     case ALGO_SHA256T:      register_sha256t_algo     ( gate ); break;
-     case ALGO_SHAVITE3:     register_shavite_algo     ( gate ); break;
-     case ALGO_SKEIN:        register_skein_algo       ( gate ); break;
-     case ALGO_SKEIN2:       register_skein2_algo      ( gate ); break;
-     case ALGO_SKUNK:        register_skunk_algo       ( gate ); break;
-     case ALGO_TIMETRAVEL:   register_timetravel_algo  ( gate ); break;
-     case ALGO_TIMETRAVEL10: register_timetravel10_algo( gate ); break;
-     case ALGO_TRIBUS:       register_tribus_algo      ( gate ); break;
-     case ALGO_VANILLA:      register_vanilla_algo     ( gate ); break;
-     case ALGO_VELTOR:       register_veltor_algo      ( gate ); break;
-     case ALGO_WHIRLPOOL:    register_whirlpool_algo   ( gate ); break;
-     case ALGO_WHIRLPOOLX:   register_whirlpoolx_algo  ( gate ); break;
-     case ALGO_X11:          register_x11_algo         ( gate ); break;
-     case ALGO_X11EVO:       register_x11evo_algo      ( gate ); break;
-     case ALGO_X11GOST:      register_x11gost_algo     ( gate ); break;
-     case ALGO_X12:          register_x12_algo         ( gate ); break;
-     case ALGO_X13:          register_x13_algo         ( gate ); break;
-     case ALGO_X13SM3:       register_x13sm3_algo      ( gate ); break;
-     case ALGO_X14:          register_x14_algo         ( gate ); break;
-     case ALGO_X15:          register_x15_algo         ( gate ); break;
-     case ALGO_X16R:         register_x16r_algo        ( gate ); break;
-     case ALGO_X17:          register_x17_algo         ( gate ); break;
-     case ALGO_XEVAN:        register_xevan_algo       ( gate ); break;
-     case ALGO_YESCRYPT:     register_yescrypt_algo    ( gate ); break;
-     case ALGO_YESCRYPTR8:   register_yescryptr8_algo  ( gate ); break;
-     case ALGO_YESCRYPTR16:  register_yescryptr16_algo ( gate ); break;
-     case ALGO_YESCRYPTR32:  register_yescryptr32_algo ( gate ); break;
-     case ALGO_ZR5:          register_zr5_algo         ( gate ); break;
+     case ALGO_BLAKE2S:      register_blake2s_algo      ( gate ); break;
+     case ALGO_C11:          register_c11_algo          ( gate ); break;
+     case ALGO_CRYPTOLIGHT:  register_cryptolight_algo  ( gate ); break;
+     case ALGO_CRYPTONIGHT:  register_cryptonight_algo  ( gate ); break;
+     case ALGO_CRYPTONIGHTV7:register_cryptonightv7_algo( gate ); break;
+     case ALGO_DECRED:       register_decred_algo       ( gate ); break;
+     case ALGO_DEEP:         register_deep_algo         ( gate ); break;
+     case ALGO_DMD_GR:       register_dmd_gr_algo       ( gate ); break;
+     case ALGO_DROP:         register_drop_algo         ( gate ); break;
+     case ALGO_FRESH:        register_fresh_algo        ( gate ); break;
+     case ALGO_GROESTL:      register_groestl_algo      ( gate ); break;
+     case ALGO_HEAVY:        register_heavy_algo        ( gate ); break;
+     case ALGO_HMQ1725:      register_hmq1725_algo      ( gate ); break;
+     case ALGO_HODL:         register_hodl_algo         ( gate ); break;
+     case ALGO_JHA:          register_jha_algo          ( gate ); break;
+     case ALGO_KECCAK:       register_keccak_algo       ( gate ); break;
+     case ALGO_KECCAKC:      register_keccakc_algo      ( gate ); break;
+     case ALGO_LBRY:         register_lbry_algo         ( gate ); break;
+     case ALGO_LUFFA:        register_luffa_algo        ( gate ); break;
+     case ALGO_LYRA2H:       register_lyra2h_algo       ( gate ); break;
+     case ALGO_LYRA2RE:      register_lyra2re_algo      ( gate ); break;
+     case ALGO_LYRA2REV2:    register_lyra2rev2_algo    ( gate ); break;
+     case ALGO_LYRA2REV3:    register_lyra2rev3_algo    ( gate ); break;
+     case ALGO_LYRA2Z:       register_lyra2z_algo       ( gate ); break;
+     case ALGO_LYRA2Z330:    register_lyra2z330_algo    ( gate ); break;
+     case ALGO_M7M:          register_m7m_algo          ( gate ); break;
+     case ALGO_MYR_GR:       register_myriad_algo       ( gate ); break;
+     case ALGO_NEOSCRYPT:    register_neoscrypt_algo    ( gate ); break;
+     case ALGO_NIST5:        register_nist5_algo        ( gate ); break;
+     case ALGO_PENTABLAKE:   register_pentablake_algo   ( gate ); break;
+     case ALGO_PHI1612:      register_phi1612_algo      ( gate ); break;
+     case ALGO_PHI2:         register_phi2_algo         ( gate ); break;
+     case ALGO_PLUCK:        register_pluck_algo        ( gate ); break;
+     case ALGO_POLYTIMOS:    register_polytimos_algo    ( gate ); break;
+     case ALGO_QUARK:        register_quark_algo        ( gate ); break;
+     case ALGO_QUBIT:        register_qubit_algo        ( gate ); break;
+     case ALGO_SCRYPT:       register_scrypt_algo       ( gate ); break;
+     case ALGO_SCRYPTJANE:   register_scryptjane_algo   ( gate ); break;
+     case ALGO_SHA256D:      register_sha256d_algo      ( gate ); break;
+     case ALGO_SHA256T:      register_sha256t_algo      ( gate ); break;
+     case ALGO_SHA256Q:      register_sha256q_algo      ( gate ); break;
+     case ALGO_SHAVITE3:     register_shavite_algo      ( gate ); break;
+     case ALGO_SKEIN:        register_skein_algo        ( gate ); break;
+     case ALGO_SKEIN2:       register_skein2_algo       ( gate ); break;
+     case ALGO_SKUNK:        register_skunk_algo        ( gate ); break;
+     case ALGO_SONOA:        register_sonoa_algo        ( gate ); break;
+     case ALGO_TIMETRAVEL:   register_timetravel_algo   ( gate ); break;
+     case ALGO_TIMETRAVEL10: register_timetravel10_algo ( gate ); break;
+     case ALGO_TRIBUS:       register_tribus_algo       ( gate ); break;
+     case ALGO_VANILLA:      register_vanilla_algo      ( gate ); break;
+     case ALGO_VELTOR:       register_veltor_algo       ( gate ); break;
+     case ALGO_WHIRLPOOL:    register_whirlpool_algo    ( gate ); break;
+     case ALGO_WHIRLPOOLX:   register_whirlpoolx_algo   ( gate ); break;
+     case ALGO_X11:          register_x11_algo          ( gate ); break;
+     case ALGO_X11EVO:       register_x11evo_algo       ( gate ); break;
+     case ALGO_X11GOST:      register_x11gost_algo      ( gate ); break;
+     case ALGO_X12:          register_x12_algo          ( gate ); break;
+     case ALGO_X13:          register_x13_algo          ( gate ); break;
+     case ALGO_X13SM3:       register_x13sm3_algo       ( gate ); break;
+     case ALGO_X14:          register_x14_algo          ( gate ); break;
+     case ALGO_X15:          register_x15_algo          ( gate ); break;
+     case ALGO_X16R:         register_x16r_algo         ( gate ); break;
+     case ALGO_X16S:         register_x16s_algo         ( gate ); break;
+     case ALGO_X17:          register_x17_algo          ( gate ); break;
+     case ALGO_XEVAN:        register_xevan_algo        ( gate ); break;
+/*    case ALGO_YESCRYPT:     register_yescrypt_05_algo     ( gate ); break;
+     case ALGO_YESCRYPTR8:   register_yescryptr8_05_algo   ( gate ); break;
+     case ALGO_YESCRYPTR16:  register_yescryptr16_05_algo  ( gate ); break;
+     case ALGO_YESCRYPTR32:  register_yescryptr32_05_algo  ( gate ); break;
+*/
+     case ALGO_YESCRYPT:     register_yescrypt_algo     ( gate ); break;
+     case ALGO_YESCRYPTR8:   register_yescryptr8_algo   ( gate ); break;
+     case ALGO_YESCRYPTR16:  register_yescryptr16_algo  ( gate ); break;
+     case ALGO_YESCRYPTR32:  register_yescryptr32_algo  ( gate ); break;
+
+     case ALGO_YESPOWER:     register_yespower_algo     ( gate ); break;
+     case ALGO_YESPOWERR16:  register_yespowerr16_algo  ( gate ); break;
+     case ALGO_ZR5:          register_zr5_algo          ( gate ); break;
    default:
        applog(LOG_ERR,"FAIL: algo_gate registration failed, unknown algo %s.\n", algo_names[opt_algo] );
        return false;
@@ -249,6 +268,10 @@ bool register_algo_gate( int algo, algo_gate_t *gate )
 // override std defaults with jr2 defaults
 bool register_json_rpc2( algo_gate_t *gate )
 {
+  applog(LOG_WARNING,"\nCryptonight algorithm and variants are no longer");
+  applog(LOG_WARNING,"supported by cpuminer-opt. Shares submitted will");
+  applog(LOG_WARNING,"likely be rejected. Proceed at your own risk.\n");
+
  gate->wait_for_diff           = (void*)&do_nothing;
  gate->get_new_work            = (void*)&jr2_get_new_work;
  gate->get_nonceptr            = (void*)&jr2_get_nonceptr;
@@ -285,6 +308,9 @@ void exec_hash_function( int algo, void *output, const void *pdata )
 const char* const algo_alias_map[][2] =
 {
 //   alias                proper
+  { "argon2d-crds",      "argon2d250"   },
+  { "argon2d-dyn",       "argon2d500"   },
+  { "argon2d-uis",       "argon2d4096"  },
  { "bitcore",           "timetravel10" },
  { "bitzeny",           "yescryptr8"   },
  { "blake256r8",        "blakecoin"    },
@@ -302,6 +328,7 @@ const char* const algo_alias_map[][2] =
  { "jane",              "scryptjane"   }, 
  { "lyra2",             "lyra2re"      },
  { "lyra2v2",           "lyra2rev2"    },
+  { "lyra2v3",           "lyra2rev3"    },
  { "lyra2zoin",         "lyra2z330"    },
  { "myrgr",             "myr-gr"       },
  { "myriad",            "myr-gr"       },
@@ -318,9 +345,9 @@ const char* const algo_alias_map[][2] =
  { NULL,                NULL           }   
 };

-// if arg is a valid alias for a known algo it is updated with the proper name.
-// No validation of the algo or alias is done, It is the responsinility of the
-// calling function to validate the algo after return.
+// if arg is a valid alias for a known algo it is updated with the proper
+// name. No validation of the algo or alias is done, It is the responsinility
+// of the calling function to validate the algo after return.
 void get_algo_alias( char** algo_or_alias )
 {
  int i;
@@ -333,3 +360,24 @@ void get_algo_alias( char** algo_or_alias )
    }
 }

+#undef ALIAS
+#undef PROPER
+
+// only for parallel when there are lanes.
+bool submit_solution( struct work *work, void *hash,
+                      struct thr_info *thr, int lane )
+{
+     work_set_target_ratio( work, hash );
+     if ( submit_work( thr, work ) )
+     {
+         applog( LOG_NOTICE, "Share %d submitted by thread %d, lane %d.",
+                 accepted_share_count + rejected_share_count + 1,
+                 thr->id, lane );
+         return true;
+     }
+     else
+          applog( LOG_WARNING, "Failed to submit share." );
+     return false;
+}
+
+
--- a/algo-gate-api.h
+++ b/algo-gate-api.h
@@ -2,6 +2,7 @@
 #include <stdbool.h>
 #include <stdint.h>
 #include "miner.h"
+#include "simd-utils.h"

 /////////////////////////////
 ////
@@ -87,10 +88,11 @@ typedef  uint32_t set_t;
 #define EMPTY_SET       0
 #define SSE2_OPT        1
 #define AES_OPT         2  
-#define AVX_OPT         4
-#define AVX2_OPT        8
-#define SHA_OPT      0x10
-//#define FOUR_WAY_OPT 0x20
+#define SSE42_OPT       4
+#define AVX_OPT         8
+#define AVX2_OPT     0x10
+#define SHA_OPT      0x20
+#define AVX512_OPT   0x40

 // return set containing all elements from sets a & b
 inline set_t set_union ( set_t a, set_t b ) { return a | b; }
@@ -106,8 +108,15 @@ inline bool set_excl ( set_t a, set_t b ) { return (a & b) == 0; }

 typedef struct
 {
+// special case, only one target, provides a callback for scanhash to
+// submit work with less overhead.
+// bool (*submit_work )             ( struct thr_info*, const struct work* );
+
 // mandatory functions, must be overwritten
-int ( *scanhash ) ( int, struct work*, uint32_t, uint64_t* );
+// Added a 5th arg for the thread_info structure to replace the int thr id
+// in the first arg. Both will co-exist during the trasition.
+//int ( *scanhash ) ( int, struct work*, uint32_t, uint64_t* );
+int ( *scanhash ) ( int, struct work*, uint32_t, uint64_t*, struct thr_info* );

 // optional unsafe, must be overwritten if algo uses function
 void ( *hash )     ( void*, const void*, uint32_t ) ;
@@ -119,7 +128,7 @@ void ( *stratum_gen_work )       ( struct stratum_ctx*, struct work* );
 void ( *get_new_work )           ( struct work*, struct work*, int, uint32_t*,
                                   bool );
 uint32_t *( *get_nonceptr )      ( uint32_t* );
-void ( *display_extra_data )     ( struct work*, uint64_t* );
+void ( *decode_extra_data )      ( struct work*, uint64_t* );
 void ( *wait_for_diff )          ( struct stratum_ctx* );
 int64_t ( *get_max64 )           ();
 bool ( *work_decode )            ( const json_t*, struct work* );
@@ -128,7 +137,7 @@ bool ( *submit_getwork_result )  ( CURL*, struct work* );
 void ( *gen_merkle_root )        ( char*, struct stratum_ctx* );
 void ( *build_extraheader )      ( struct work*, struct stratum_ctx* );
 void ( *build_block_header )     ( struct work*, uint32_t, uint32_t*,
-                                   uint32_t*, uint32_t, uint32_t );
+	                           uint32_t*, uint32_t, uint32_t );
 void ( *build_stratum_request )  ( char*, struct work*, struct stratum_ctx* );
 char* ( *malloc_txs_request )    ( struct work* );
 void ( *set_work_data_endian )   ( struct work* );
@@ -139,10 +148,10 @@ bool ( *do_this_thread )         ( int );
 json_t* (*longpoll_rpc_call)     ( CURL*, int*, char* );
 bool ( *stratum_handle_response )( json_t* );
 set_t optimizations;
+int  ( *get_work_data_size )     ();
 int  ntime_index;
 int  nbits_index;
 int  nonce_index;            // use with caution, see warning below
-int  work_data_size;
 int  work_cmp_size;

 } algo_gate_t;
@@ -185,6 +194,12 @@ void four_way_not_tested();
 // allways returns failure
 int null_scanhash();

+// The one and only, a callback for scanhash.
+bool submit_solution( struct work *work, void *hash,
+                      struct thr_info *thr, int lane );
+ 
+bool submit_work( struct thr_info *thr, const struct work *work_in );
+
 // displays warning
 void null_hash    ();
 void null_hash_suw();
@@ -239,8 +254,8 @@ void set_work_data_big_endian( struct work *work );
 double std_calc_network_diff( struct work *work );

 void std_build_block_header( struct work* g_work, uint32_t version,
-                             uint32_t *prevhash, uint32_t *merkle_root,
-                             uint32_t ntime, uint32_t nbits );
+	                     uint32_t *prevhash,  uint32_t *merkle_root,
+   	                     uint32_t ntime, uint32_t nbits );

 void std_build_extraheader( struct work *work, struct stratum_ctx *sctx );

@@ -253,6 +268,8 @@ bool jr2_stratum_handle_response( json_t *val );
 bool std_ready_to_mine( struct work* work, struct stratum_ctx* stratum,
                        int thr_id );

+int std_get_work_data_size();
+
 // Gate admin functions

 // Called from main to initialize all gate functions and algo-specific data
--- a/algo/argon2/argon2a/ar2/ar2-scrypt-jane.c
+++ b/algo/argon2/argon2a/ar2/ar2-scrypt-jane.c
--- a/algo/argon2/argon2a/ar2/ar2-scrypt-jane.h
+++ b/algo/argon2/argon2a/ar2/ar2-scrypt-jane.h
--- a/algo/argon2/argon2a/ar2/argon2.c
+++ b/algo/argon2/argon2a/ar2/argon2.c
@@ -99,18 +99,18 @@ static const char *Argon2_ErrorMessage[] = {
 {ARGON2_MISSING_ARGS, */ "Missing arguments", /*},*/
 };

-int argon2d(argon2_context *context) { return argon2_core(context, Argon2_d); }
+int argon2d(argon2_context *context) { return ar2_argon2_core(context, Argon2_d); }

-int argon2i(argon2_context *context) { return argon2_core(context, Argon2_i); }
+int argon2i(argon2_context *context) { return ar2_argon2_core(context, Argon2_i); }

-int verify_d(argon2_context *context, const char *hash)
+int ar2_verify_d(argon2_context *context, const char *hash)
 {
 	int result;
 	/*if (0 == context->outlen || NULL == hash) {
 		return ARGON2_OUT_PTR_MISMATCH;
 	}*/

-	result = argon2_core(context, Argon2_d);
+	result = ar2_argon2_core(context, Argon2_d);

 	if (ARGON2_OK != result) {
 		return result;
@@ -223,7 +223,7 @@ static size_t to_base64(char *dst, size_t dst_len, const void *src)
 * The output length is always exactly 32 bytes.
 */

-int encode_string(char *dst, size_t dst_len, argon2_context *ctx)
+int ar2_encode_string(char *dst, size_t dst_len, argon2_context *ctx)
 {
 #define SS(str)                                                                \
 	do {                                                                       \
--- a/algo/argon2/argon2a/ar2/argon2.h
+++ b/algo/argon2/argon2a/ar2/argon2.h
@@ -255,7 +255,7 @@ int argon2id(argon2_context *context);
 * specified by the context outlen member
 * @return  Zero if successful, a non zero error code otherwise
 */
-int verify_d(argon2_context *context, const char *hash);
+int ar2_verify_d(argon2_context *context, const char *hash);

 /*
 * Get the associated error message for given error code
@@ -283,7 +283,7 @@ const char *error_message(int error_code);
 * The output length is always exactly 32 bytes.
 */

-int encode_string(char *dst, size_t dst_len, argon2_context *ctx);
+int ar2_encode_string(char *dst, size_t dst_len, argon2_context *ctx);

 #if defined(__cplusplus)
 }
--- a/algo/argon2/argon2a/ar2/bench.c
+++ b/algo/argon2/argon2a/ar2/bench.c
--- a/algo/argon2/argon2a/ar2/blake2/blake2-impl.h
+++ b/algo/argon2/argon2a/ar2/blake2/blake2-impl.h
--- a/algo/argon2/argon2a/ar2/blake2/blake2.h
+++ b/algo/argon2/argon2a/ar2/blake2/blake2.h
@@ -52,22 +52,22 @@ enum {
 };

 /* Streaming API */
-int blake2b_init(blake2b_state *S, size_t outlen);
-int blake2b_init_key(blake2b_state *S, size_t outlen, const void *key,
+int ar2_blake2b_init(blake2b_state *S, size_t outlen);
+int ar2_blake2b_init_key(blake2b_state *S, size_t outlen, const void *key,
 					 size_t keylen);
-int blake2b_init_param(blake2b_state *S, const blake2b_param *P);
-int blake2b_update(blake2b_state *S, const void *in, size_t inlen);
+int ar2_blake2b_init_param(blake2b_state *S, const blake2b_param *P);
+int ar2_blake2b_update(blake2b_state *S, const void *in, size_t inlen);
 void my_blake2b_update(blake2b_state *S, const void *in, size_t inlen);
-int blake2b_final(blake2b_state *S, void *out, size_t outlen);
+int ar2_blake2b_final(blake2b_state *S, void *out, size_t outlen);

 /* Simple API */
-int blake2b(void *out, const void *in, const void *key, size_t keylen);
+int ar2_blake2b(void *out, const void *in, const void *key, size_t keylen);

 /* Argon2 Team - Begin Code */
-int blake2b_long(void *out, const void *in);
+int ar2_blake2b_long(void *out, const void *in);
 /* Argon2 Team - End Code */
 /* Miouyouyou */
-void blake2b_too(void *out, const void *in);
+void ar2_blake2b_too(void *out, const void *in);

 #if defined(__cplusplus)
 }
--- a/algo/argon2/argon2a/ar2/blake2/blamka-round-opt.h
+++ b/algo/argon2/argon2a/ar2/blake2/blamka-round-opt.h
--- a/algo/argon2/argon2a/ar2/blake2/blamka-round-ref.h
+++ b/algo/argon2/argon2a/ar2/blake2/blamka-round-ref.h
--- a/algo/argon2/argon2a/ar2/blake2b.c
+++ b/algo/argon2/argon2a/ar2/blake2b.c
@@ -107,7 +107,7 @@ static const blake2b_state miou = {
 };


-int blake2b_init_param(blake2b_state *S, const blake2b_param *P)
+int ar2_blake2b_init_param(blake2b_state *S, const blake2b_param *P)
 {
 	const unsigned char *p = (const unsigned char *)P;
 	unsigned int i;
@@ -133,7 +133,7 @@ void compare_buffs(uint64_t *h, size_t outlen)
 }

 /* Sequential blake2b initialization */
-int blake2b_init(blake2b_state *S, size_t outlen)
+int ar2_blake2b_init(blake2b_state *S, size_t outlen)
 {
 	memcpy(S, &miou, sizeof(*S));
 	S->h[0] += outlen;
@@ -147,7 +147,7 @@ void print64(const char *name, const uint64_t *array, uint16_t size)
 	printf("};\n");
 }

-int blake2b_init_key(blake2b_state *S, size_t outlen, const void *key, size_t keylen)
+int ar2_blake2b_init_key(blake2b_state *S, size_t outlen, const void *key, size_t keylen)
 {
 	return 0;
 }
@@ -207,7 +207,7 @@ static void blake2b_compress(blake2b_state *S, const uint8_t *block)
 #undef ROUND
 }

-int blake2b_update(blake2b_state *S, const void *in, size_t inlen)
+int ar2_blake2b_update(blake2b_state *S, const void *in, size_t inlen)
 {
 	const uint8_t *pin = (const uint8_t *)in;
 	/* Complete current block */
@@ -235,7 +235,7 @@ void my_blake2b_update(blake2b_state *S, const void *in, size_t inlen)
 	S->buflen += (unsigned int)inlen;
 }

-int blake2b_final(blake2b_state *S, void *out, size_t outlen)
+int ar2_blake2b_final(blake2b_state *S, void *out, size_t outlen)
 {
 	uint8_t buffer[BLAKE2B_OUTBYTES] = {0};
 	unsigned int i;
@@ -257,48 +257,48 @@ int blake2b_final(blake2b_state *S, void *out, size_t outlen)
 	return 0;
 }

-int blake2b(void *out, const void *in, const void *key, size_t keylen)
+int ar2_blake2b(void *out, const void *in, const void *key, size_t keylen)
 {
 	blake2b_state S;

-	blake2b_init(&S, 64);
+	ar2_blake2b_init(&S, 64);
 	my_blake2b_update(&S, in, 64);
-	blake2b_final(&S, out, 64);
+	ar2_blake2b_final(&S, out, 64);
 	burn(&S, sizeof(S));
 	return 0;
 }

-void blake2b_too(void *pout, const void *in)
+void ar2_blake2b_too(void *pout, const void *in)
 {
 	uint8_t *out = (uint8_t *)pout;
 	uint8_t out_buffer[64];
 	uint8_t in_buffer[64];

 	blake2b_state blake_state;
-	blake2b_init(&blake_state, 64);
+	ar2_blake2b_init(&blake_state, 64);
 	blake_state.buflen = blake_state.buf[1] = 4;
 	my_blake2b_update(&blake_state, in, 72);
-	blake2b_final(&blake_state, out_buffer, 64);
+	ar2_blake2b_final(&blake_state, out_buffer, 64);
 	memcpy(out, out_buffer, 32);
 	out += 32;

 	register uint8_t i = 29;
 	while (i--) {
 		memcpy(in_buffer, out_buffer, 64);
-		blake2b(out_buffer, in_buffer, NULL, 0);
+		ar2_blake2b(out_buffer, in_buffer, NULL, 0);
 		memcpy(out, out_buffer, 32);
 		out += 32;
 	}

 	memcpy(in_buffer, out_buffer, 64);
-	blake2b(out_buffer, in_buffer, NULL, 0);
+	ar2_blake2b(out_buffer, in_buffer, NULL, 0);
 	memcpy(out, out_buffer, 64);

 	burn(&blake_state, sizeof(blake_state));
 }

 /* Argon2 Team - Begin Code */
-int blake2b_long(void *pout, const void *in)
+int ar2_blake2b_long(void *pout, const void *in)
 {
 	uint8_t *out = (uint8_t *)pout;
 	blake2b_state blake_state;
@@ -306,10 +306,10 @@ int blake2b_long(void *pout, const void *in)

 	store32(outlen_bytes, 32);

-	blake2b_init(&blake_state, 32);
+	ar2_blake2b_init(&blake_state, 32);
 	my_blake2b_update(&blake_state, outlen_bytes, sizeof(outlen_bytes));
-	blake2b_update(&blake_state, in, 1024);
-	blake2b_final(&blake_state, out, 32);
+	ar2_blake2b_update(&blake_state, in, 1024);
+	ar2_blake2b_final(&blake_state, out, 32);
 	burn(&blake_state, sizeof(blake_state));
 	return 0;
 }
--- a/algo/argon2/argon2a/ar2/cores.c
+++ b/algo/argon2/argon2a/ar2/cores.c
@@ -51,15 +51,15 @@
 #endif

 /***************Instance and Position constructors**********/
-void init_block_value(block *b, uint8_t in) { memset(b->v, in, sizeof(b->v)); }
+void ar2_init_block_value(block *b, uint8_t in) { memset(b->v, in, sizeof(b->v)); }
 //inline void init_block_value(block *b, uint8_t in) { memset(b->v, in, sizeof(b->v)); }

-void copy_block(block *dst, const block *src) {
+void ar2_copy_block(block *dst, const block *src) {
 //inline void copy_block(block *dst, const block *src) {
    memcpy(dst->v, src->v, sizeof(uint64_t) * ARGON2_WORDS_IN_BLOCK);
 }

-void xor_block(block *dst, const block *src) {
+void ar2_xor_block(block *dst, const block *src) {
 //inline void xor_block(block *dst, const block *src) {
    int i;
    for (i = 0; i < ARGON2_WORDS_IN_BLOCK; ++i) {
@@ -67,7 +67,7 @@ void xor_block(block *dst, const block *src) {
    }
 }

-static void load_block(block *dst, const void *input) {
+static void ar2_load_block(block *dst, const void *input) {
 //static inline void load_block(block *dst, const void *input) {
    unsigned i;
    for (i = 0; i < ARGON2_WORDS_IN_BLOCK; ++i) {
@@ -75,7 +75,7 @@ static void load_block(block *dst, const void *input) {
    }
 }

-static void store_block(void *output, const block *src) {
+static void ar2_store_block(void *output, const block *src) {
 //static inline void store_block(void *output, const block *src) {
    unsigned i;
    for (i = 0; i < ARGON2_WORDS_IN_BLOCK; ++i) {
@@ -84,7 +84,7 @@ static void store_block(void *output, const block *src) {
 }

 /***************Memory allocators*****************/
-int allocate_memory(block **memory, uint32_t m_cost) {
+int ar2_allocate_memory(block **memory, uint32_t m_cost) {
    if (memory != NULL) {
        size_t memory_size = sizeof(block) * m_cost;
        if (m_cost != 0 &&
@@ -105,34 +105,34 @@ int allocate_memory(block **memory, uint32_t m_cost) {
    }
 }

-void secure_wipe_memory(void *v, size_t n) { memset(v, 0, n); }
+void ar2_secure_wipe_memory(void *v, size_t n) { memset(v, 0, n); }
 //inline void secure_wipe_memory(void *v, size_t n) { memset(v, 0, n); }

 /*********Memory functions*/

-void clear_memory(argon2_instance_t *instance, int clear) {
+void ar2_clear_memory(argon2_instance_t *instance, int clear) {
 //inline void clear_memory(argon2_instance_t *instance, int clear) {
    if (instance->memory != NULL && clear) {
-        secure_wipe_memory(instance->memory,
+        ar2_secure_wipe_memory(instance->memory,
                           sizeof(block) * /*instance->memory_blocks*/16);
    }
 }

-void free_memory(block *memory) { free(memory); }
+void ar2_free_memory(block *memory) { free(memory); }
 //inline void free_memory(block *memory) { free(memory); }

-void finalize(const argon2_context *context, argon2_instance_t *instance) {
+void ar2_finalize(const argon2_context *context, argon2_instance_t *instance) {
    if (context != NULL && instance != NULL) {
        block blockhash;
-        copy_block(&blockhash, instance->memory + 15);
+        ar2_copy_block(&blockhash, instance->memory + 15);

        /* Hash the result */
        {
            uint8_t blockhash_bytes[ARGON2_BLOCK_SIZE];
-            store_block(blockhash_bytes, &blockhash);
-            blake2b_long(context->out, blockhash_bytes);
-            secure_wipe_memory(blockhash.v, ARGON2_BLOCK_SIZE);
-            secure_wipe_memory(blockhash_bytes, ARGON2_BLOCK_SIZE); /* clear blockhash_bytes */
+            ar2_store_block(blockhash_bytes, &blockhash);
+            ar2_blake2b_long(context->out, blockhash_bytes);
+            ar2_secure_wipe_memory(blockhash.v, ARGON2_BLOCK_SIZE);
+            ar2_secure_wipe_memory(blockhash_bytes, ARGON2_BLOCK_SIZE); /* clear blockhash_bytes */
        }

 #ifdef GENKAT
@@ -142,11 +142,11 @@ void finalize(const argon2_context *context, argon2_instance_t *instance) {
        /* Clear memory */
        // clear_memory(instance, 1);

-        free_memory(instance->memory);
+        ar2_free_memory(instance->memory);
    }
 }

-uint32_t index_alpha(const argon2_instance_t *instance,
+uint32_t ar2_index_alpha(const argon2_instance_t *instance,
                     const argon2_position_t *position, uint32_t pseudo_rand,
                     int same_lane) {
    /*
@@ -207,7 +207,7 @@ uint32_t index_alpha(const argon2_instance_t *instance,
    return absolute_position;
 }

-void fill_memory_blocks(argon2_instance_t *instance) {
+void ar2_fill_memory_blocks(argon2_instance_t *instance) {
    uint32_t r, s;

    for (r = 0; r < 2; ++r) {
@@ -218,7 +218,7 @@ void fill_memory_blocks(argon2_instance_t *instance) {
            position.lane = 0;
            position.slice = (uint8_t)s;
            position.index = 0;
-            fill_segment(instance, position);
+            ar2_fill_segment(instance, position);
        }

 #ifdef GENKAT
@@ -227,19 +227,19 @@ void fill_memory_blocks(argon2_instance_t *instance) {
    }
 }

-void fill_first_blocks(uint8_t *blockhash, const argon2_instance_t *instance) {
+void ar2_fill_first_blocks(uint8_t *blockhash, const argon2_instance_t *instance) {
    /* Make the first and second block in each lane as G(H0||i||0) or
       G(H0||i||1) */
    uint8_t blockhash_bytes[ARGON2_BLOCK_SIZE];
    store32(blockhash + ARGON2_PREHASH_DIGEST_LENGTH, 0);
    store32(blockhash + ARGON2_PREHASH_DIGEST_LENGTH + 4, 0);
-    blake2b_too(blockhash_bytes, blockhash);
-    load_block(&instance->memory[0], blockhash_bytes);
+    ar2_blake2b_too(blockhash_bytes, blockhash);
+    ar2_load_block(&instance->memory[0], blockhash_bytes);

    store32(blockhash + ARGON2_PREHASH_DIGEST_LENGTH, 1);
-    blake2b_too(blockhash_bytes, blockhash);
-    load_block(&instance->memory[1], blockhash_bytes);
-    secure_wipe_memory(blockhash_bytes, ARGON2_BLOCK_SIZE);
+    ar2_blake2b_too(blockhash_bytes, blockhash);
+    ar2_load_block(&instance->memory[1], blockhash_bytes);
+    ar2_secure_wipe_memory(blockhash_bytes, ARGON2_BLOCK_SIZE);
 }


@@ -268,7 +268,7 @@ static const blake2b_state base_hash = {
 #define SALTLEN 32
 #define SECRETLEN 0
 #define ADLEN 0
-void initial_hash(uint8_t *blockhash, argon2_context *context,
+void ar2_initial_hash(uint8_t *blockhash, argon2_context *context,
                  argon2_type type) {

    uint8_t value[sizeof(uint32_t)];
@@ -280,7 +280,7 @@ void initial_hash(uint8_t *blockhash, argon2_context *context,
                   PWDLEN);


-    secure_wipe_memory(context->pwd, PWDLEN);
+    ar2_secure_wipe_memory(context->pwd, PWDLEN);
    context->pwdlen = 0;

    store32(&value, SALTLEN);
@@ -295,22 +295,22 @@ void initial_hash(uint8_t *blockhash, argon2_context *context,
    store32(&value, ADLEN);
    my_blake2b_update(&BlakeHash, (const uint8_t *)&value, sizeof(value));

-    blake2b_final(&BlakeHash, blockhash, ARGON2_PREHASH_DIGEST_LENGTH);
+    ar2_blake2b_final(&BlakeHash, blockhash, ARGON2_PREHASH_DIGEST_LENGTH);
 }

-int initialize(argon2_instance_t *instance, argon2_context *context) {
+int ar2_initialize(argon2_instance_t *instance, argon2_context *context) {
    /* 1. Memory allocation */


-    allocate_memory(&(instance->memory), 16);
+    ar2_allocate_memory(&(instance->memory), 16);

    /* 2. Initial hashing */
    /* H_0 + 8 extra bytes to produce the first blocks */
    /* Hashing all inputs */
    uint8_t blockhash[ARGON2_PREHASH_SEED_LENGTH];
-    initial_hash(blockhash, context, instance->type);
+    ar2_initial_hash(blockhash, context, instance->type);
    /* Zeroing 8 extra bytes */
-    secure_wipe_memory(blockhash + ARGON2_PREHASH_DIGEST_LENGTH,
+    ar2_secure_wipe_memory(blockhash + ARGON2_PREHASH_DIGEST_LENGTH,
                       ARGON2_PREHASH_SEED_LENGTH -
                           ARGON2_PREHASH_DIGEST_LENGTH);

@@ -320,14 +320,14 @@ int initialize(argon2_instance_t *instance, argon2_context *context) {

    /* 3. Creating first blocks, we always have at least two blocks in a slice
     */
-    fill_first_blocks(blockhash, instance);
+    ar2_fill_first_blocks(blockhash, instance);
    /* Clearing the hash */
-    secure_wipe_memory(blockhash, ARGON2_PREHASH_SEED_LENGTH);
+    ar2_secure_wipe_memory(blockhash, ARGON2_PREHASH_SEED_LENGTH);

    return ARGON2_OK;
 }

-int argon2_core(argon2_context *context, argon2_type type) {
+int ar2_argon2_core(argon2_context *context, argon2_type type) {
    argon2_instance_t instance;
    instance.memory = NULL;
    instance.type = type;
@@ -336,14 +336,14 @@ int argon2_core(argon2_context *context, argon2_type type) {
     * blocks
     */

-    int result = initialize(&instance, context);
+    int result = ar2_initialize(&instance, context);
    if (ARGON2_OK != result) return result;

    /* 4. Filling memory */
-    fill_memory_blocks(&instance);
+    ar2_fill_memory_blocks(&instance);

    /* 5. Finalization */
-    finalize(context, &instance);
+    ar2_finalize(context, &instance);

    return ARGON2_OK;
 }
--- a/algo/argon2/argon2a/ar2/cores.h
+++ b/algo/argon2/argon2a/ar2/cores.h
@@ -62,13 +62,13 @@ typedef struct _block { uint64_t v[ARGON2_WORDS_IN_BLOCK]; } ALIGN(16) block;
 /*****************Functions that work with the block******************/

 /* Initialize each byte of the block with @in */
-void init_block_value(block *b, uint8_t in);
+void ar2_init_block_value(block *b, uint8_t in);

 /* Copy block @src to block @dst */
-void copy_block(block *dst, const block *src);
+void ar2_copy_block(block *dst, const block *src);

 /* XOR @src onto @dst bytewise */
-void xor_block(block *dst, const block *src);
+void ar2_xor_block(block *dst, const block *src);

 /*
 * Argon2 instance: memory pointer, number of passes, amount of memory, type,
@@ -101,24 +101,24 @@ typedef struct Argon2_position_t {
 * @param m_cost number of blocks to allocate in the memory
 * @return ARGON2_OK if @memory is a valid pointer and memory is allocated
 */
-int allocate_memory(block **memory, uint32_t m_cost);
+int ar2_allocate_memory(block **memory, uint32_t m_cost);

 /* Function that securely cleans the memory
 * @param mem Pointer to the memory
 * @param s Memory size in bytes
 */
-void secure_wipe_memory(void *v, size_t n);
+void ar2_secure_wipe_memory(void *v, size_t n);

 /* Clears memory
 * @param instance pointer to the current instance
 * @param clear_memory indicates if we clear the memory with zeros.
 */
-void clear_memory(argon2_instance_t *instance, int clear);
+void ar2_clear_memory(argon2_instance_t *instance, int clear);

 /* Deallocates memory
 * @param memory pointer to the blocks
 */
-void free_memory(block *memory);
+void ar2_free_memory(block *memory);

 /*
 * Computes absolute position of reference block in the lane following a skewed
@@ -130,7 +130,7 @@ void free_memory(block *memory);
 * If so we can reference the current segment
 * @pre All pointers must be valid
 */
-uint32_t index_alpha(const argon2_instance_t *instance,
+uint32_t ar2_index_alpha(const argon2_instance_t *instance,
                     const argon2_position_t *position, uint32_t pseudo_rand,
                     int same_lane);

@@ -141,7 +141,7 @@ uint32_t index_alpha(const argon2_instance_t *instance,
 * @return ARGON2_OK if everything is all right, otherwise one of error codes
 * (all defined in <argon2.h>
 */
-int validate_inputs(const argon2_context *context);
+int ar2_validate_inputs(const argon2_context *context);

 /*
 * Hashes all the inputs into @a blockhash[PREHASH_DIGEST_LENGTH], clears
@@ -153,7 +153,7 @@ int validate_inputs(const argon2_context *context);
 * @pre    @a blockhash must have at least @a PREHASH_DIGEST_LENGTH bytes
 * allocated
 */
-void initial_hash(uint8_t *blockhash, argon2_context *context,
+void ar2_initial_hash(uint8_t *blockhash, argon2_context *context,
                  argon2_type type);

 /*
@@ -162,7 +162,7 @@ void initial_hash(uint8_t *blockhash, argon2_context *context,
 * @param blockhash Pointer to the pre-hashing digest
 * @pre blockhash must point to @a PREHASH_SEED_LENGTH allocated values
 */
-void fill_firsts_blocks(uint8_t *blockhash, const argon2_instance_t *instance);
+void ar2_fill_firsts_blocks(uint8_t *blockhash, const argon2_instance_t *instance);

 /*
 * Function allocates memory, hashes the inputs with Blake,  and creates first
@@ -174,7 +174,7 @@ void fill_firsts_blocks(uint8_t *blockhash, const argon2_instance_t *instance);
 * @return Zero if successful, -1 if memory failed to allocate. @context->state
 * will be modified if successful.
 */
-int initialize(argon2_instance_t *instance, argon2_context *context);
+int ar2_initialize(argon2_instance_t *instance, argon2_context *context);

 /*
 * XORing the last block of each lane, hashing it, making the tag. Deallocates
@@ -187,7 +187,7 @@ int initialize(argon2_instance_t *instance, argon2_context *context);
 * @pre if context->free_cbk is not NULL, it should point to a function that
 * deallocates memory
 */
-void finalize(const argon2_context *context, argon2_instance_t *instance);
+void ar2_finalize(const argon2_context *context, argon2_instance_t *instance);

 /*
 * Function that fills the segment using previous segments also from other
@@ -196,7 +196,7 @@ void finalize(const argon2_context *context, argon2_instance_t *instance);
 * @param position Current position
 * @pre all block pointers must be valid
 */
-void fill_segment(const argon2_instance_t *instance,
+void ar2_fill_segment(const argon2_instance_t *instance,
                  argon2_position_t position);

 /*
@@ -204,13 +204,13 @@ void fill_segment(const argon2_instance_t *instance,
 * blocks in each lane
 * @param instance Pointer to the current instance
 */
-void fill_memory_blocks(argon2_instance_t *instance);
+void ar2_fill_memory_blocks(argon2_instance_t *instance);

 /*
 * Function that performs memory-hard hashing with certain degree of parallelism
 * @param  context  Pointer to the Argon2 internal structure
 * @return Error code if smth is wrong, ARGON2_OK otherwise
 */
-int argon2_core(argon2_context *context, argon2_type type);
+int ar2_argon2_core(argon2_context *context, argon2_type type);

 #endif
--- a/algo/argon2/argon2a/ar2/genkat.c.hide
+++ b/algo/argon2/argon2a/ar2/genkat.c.hide
--- a/algo/argon2/argon2a/ar2/genkat.h.hide
+++ b/algo/argon2/argon2a/ar2/genkat.h.hide
--- a/algo/argon2/argon2a/ar2/opt.c
+++ b/algo/argon2/argon2a/ar2/opt.c
@@ -26,7 +26,7 @@
 #include "blake2/blake2.h"
 #include "blake2/blamka-round-opt.h"

-void fill_block(__m128i *state, __m128i const *ref_block, __m128i *next_block)
+void ar2_fill_block(__m128i *state, __m128i const *ref_block, __m128i *next_block)
 {
    __m128i ALIGN(16) block_XY[ARGON2_QWORDS_IN_BLOCK];
    uint32_t i;
@@ -95,7 +95,7 @@ static const uint64_t bad_rands[32] = {
    UINT64_C(8548260058287621283),  UINT64_C(8641748798041936364)
 };

-void generate_addresses(const argon2_instance_t *instance,
+void ar2_generate_addresses(const argon2_instance_t *instance,
                        const argon2_position_t *position,
                        uint64_t *pseudo_rands)
 {
@@ -113,7 +113,7 @@ void generate_addresses(const argon2_instance_t *instance,
 #define LANE_LENGTH 16
 #define POS_LANE 0

-void fill_segment(const argon2_instance_t *instance,
+void ar2_fill_segment(const argon2_instance_t *instance,
                  argon2_position_t position)
 {
    block *ref_block = NULL, *curr_block = NULL;
@@ -129,7 +129,7 @@ void fill_segment(const argon2_instance_t *instance,
    pseudo_rands = (uint64_t *)malloc(/*sizeof(uint64_t) * 4*/32);

    if (data_independent_addressing) {
-        generate_addresses(instance, &position, pseudo_rands);
+        ar2_generate_addresses(instance, &position, pseudo_rands);
    }

    i = 0;
@@ -173,12 +173,12 @@ void fill_segment(const argon2_instance_t *instance,
         * lane.
         */
        position.index = i;
-        ref_index = index_alpha(instance, &position, pseudo_rand & 0xFFFFFFFF,1);
+        ref_index = ar2_index_alpha(instance, &position, pseudo_rand & 0xFFFFFFFF,1);

        /* 2 Creating a new block */
        ref_block = instance->memory + ref_index;
        curr_block = instance->memory + curr_offset;
-        fill_block(state, (__m128i const *)ref_block->v, (__m128i *)curr_block->v);
+        ar2_fill_block(state, (__m128i const *)ref_block->v, (__m128i *)curr_block->v);
    }

    free(pseudo_rands);
--- a/algo/argon2/argon2a/ar2/opt.h
+++ b/algo/argon2/argon2a/ar2/opt.h
@@ -21,7 +21,7 @@
 * @param next_block Pointer to the block to be constructed
 * @pre all block pointers must be valid
 */
-void fill_block(__m128i *state, __m128i const *ref_block, __m128i *next_block);
+void ar2_fill_block(__m128i *state, __m128i const *ref_block, __m128i *next_block);

 /*
 * Generate pseudo-random values to reference blocks in the segment and puts
@@ -31,7 +31,7 @@ void fill_block(__m128i *state, __m128i const *ref_block, __m128i *next_block);
 * @param pseudo_rands Pointer to the array of 64-bit values
 * @pre pseudo_rands must point to @a instance->segment_length allocated values
 */
-void generate_addresses(const argon2_instance_t *instance,
+void ar2_generate_addresses(const argon2_instance_t *instance,
                        const argon2_position_t *position,
                        uint64_t *pseudo_rands);

@@ -43,7 +43,7 @@ void generate_addresses(const argon2_instance_t *instance,
 * @param position Current position
 * @pre all block pointers must be valid
 */
-void fill_segment(const argon2_instance_t *instance,
+void ar2_fill_segment(const argon2_instance_t *instance,
                  argon2_position_t position);

 #endif /* ARGON2_OPT_H */
--- a/algo/argon2/argon2a/ar2/ref.c.hide
+++ b/algo/argon2/argon2a/ar2/ref.c.hide
--- a/algo/argon2/argon2a/ar2/ref.h.hide
+++ b/algo/argon2/argon2a/ar2/ref.h.hide
--- a/algo/argon2/argon2a/ar2/run.c.hide
+++ b/algo/argon2/argon2a/ar2/run.c.hide
--- a/algo/argon2/argon2a/ar2/sj/scrypt-jane-hash.h
+++ b/algo/argon2/argon2a/ar2/sj/scrypt-jane-hash.h
--- a/algo/argon2/argon2a/ar2/sj/scrypt-jane-hash_skein512.h
+++ b/algo/argon2/argon2a/ar2/sj/scrypt-jane-hash_skein512.h
--- a/algo/argon2/argon2a/ar2/sj/scrypt-jane-mix_salsa64-avx.h
+++ b/algo/argon2/argon2a/ar2/sj/scrypt-jane-mix_salsa64-avx.h
--- a/algo/argon2/argon2a/ar2/sj/scrypt-jane-mix_salsa64-avx2.h
+++ b/algo/argon2/argon2a/ar2/sj/scrypt-jane-mix_salsa64-avx2.h
--- a/algo/argon2/argon2a/ar2/sj/scrypt-jane-mix_salsa64-sse2.h
+++ b/algo/argon2/argon2a/ar2/sj/scrypt-jane-mix_salsa64-sse2.h
--- a/algo/argon2/argon2a/ar2/sj/scrypt-jane-mix_salsa64-ssse3.h
+++ b/algo/argon2/argon2a/ar2/sj/scrypt-jane-mix_salsa64-ssse3.h
--- a/algo/argon2/argon2a/ar2/sj/scrypt-jane-mix_salsa64-xop.h
+++ b/algo/argon2/argon2a/ar2/sj/scrypt-jane-mix_salsa64-xop.h
--- a/algo/argon2/argon2a/ar2/sj/scrypt-jane-mix_salsa64.h
+++ b/algo/argon2/argon2a/ar2/sj/scrypt-jane-mix_salsa64.h
--- a/algo/argon2/argon2a/ar2/sj/scrypt-jane-pbkdf2.h
+++ b/algo/argon2/argon2a/ar2/sj/scrypt-jane-pbkdf2.h
--- a/algo/argon2/argon2a/ar2/sj/scrypt-jane-portable-x86.h
+++ b/algo/argon2/argon2a/ar2/sj/scrypt-jane-portable-x86.h
--- a/algo/argon2/argon2a/ar2/sj/scrypt-jane-portable.h
+++ b/algo/argon2/argon2a/ar2/sj/scrypt-jane-portable.h
--- a/algo/argon2/argon2a/ar2/sj/scrypt-jane-romix-basic.h
+++ b/algo/argon2/argon2a/ar2/sj/scrypt-jane-romix-basic.h
--- a/algo/argon2/argon2a/ar2/sj/scrypt-jane-romix-template.h
+++ b/algo/argon2/argon2a/ar2/sj/scrypt-jane-romix-template.h
--- a/algo/argon2/argon2a/ar2/sj/scrypt-jane-romix.h
+++ b/algo/argon2/argon2a/ar2/sj/scrypt-jane-romix.h
--- a/algo/argon2/argon2a/ar2/sj/scrypt-jane-salsa64.h
+++ b/algo/argon2/argon2a/ar2/sj/scrypt-jane-salsa64.h
--- a/algo/argon2/argon2a/ar2/sj/scrypt-jane-test-vectors.h
+++ b/algo/argon2/argon2a/ar2/sj/scrypt-jane-test-vectors.h
--- a/algo/argon2/argon2a/argon2a.c
+++ b/algo/argon2/argon2a/argon2a.c
@@ -24,7 +24,7 @@ inline void argon_call(void *out, void *in, void *salt, int type)
 	context.allocate_cbk = NULL;
 	context.free_cbk = NULL;

-	argon2_core(&context, type);
+	ar2_argon2_core(&context, type);
 }

 void argon2hash(void *output, const void *input)
@@ -79,7 +79,7 @@ int64_t argon2_get_max64 ()

 bool register_argon2_algo( algo_gate_t* gate )
 {
-  gate->optimizations = SSE2_OPT | AES_OPT | AVX_OPT | AVX2_OPT;
+  gate->optimizations = SSE2_OPT | AVX_OPT | AVX2_OPT;
  gate->scanhash        = (void*)&scanhash_argon2;
  gate->hash            = (void*)&argon2hash;
  gate->gen_merkle_root = (void*)&SHA256_gen_merkle_root;
--- a/algo/argon2/argon2d/argon2d-gate.c
+++ b/algo/argon2/argon2d/argon2d-gate.c
@@ -0,0 +1,198 @@
+#include "argon2d-gate.h"
+#include "argon2d/argon2.h"
+
+static const size_t INPUT_BYTES = 80;  // Lenth of a block header in bytes. Input Length = Salt Length (salt = input)
+static const size_t OUTPUT_BYTES = 32; // Length of output needed for a 256-bit hash
+static const unsigned int DEFAULT_ARGON2_FLAG = 2; //Same as ARGON2_DEFAULT_FLAGS
+
+// Credits
+
+void argon2d_crds_hash( void *output, const void *input )
+{
+	argon2_context context;
+	context.out = (uint8_t *)output;
+	context.outlen = (uint32_t)OUTPUT_BYTES;
+	context.pwd = (uint8_t *)input;
+	context.pwdlen = (uint32_t)INPUT_BYTES;
+	context.salt = (uint8_t *)input; //salt = input
+	context.saltlen = (uint32_t)INPUT_BYTES;
+	context.secret = NULL;
+	context.secretlen = 0;
+	context.ad = NULL;
+	context.adlen = 0;
+	context.allocate_cbk = NULL;
+	context.free_cbk = NULL;
+	context.flags = DEFAULT_ARGON2_FLAG; // = ARGON2_DEFAULT_FLAGS
+	// main configurable Argon2 hash parameters
+	context.m_cost = 250; // Memory in KiB (~256KB)
+	context.lanes = 4;    // Degree of Parallelism
+	context.threads = 1;  // Threads
+	context.t_cost = 1;   // Iterations
+        context.version = ARGON2_VERSION_10;
+
+	argon2_ctx( &context, Argon2_d );
+}
+
+int scanhash_argon2d_crds( int thr_id, struct work *work, uint32_t max_nonce,
+                      uint64_t *hashes_done )
+{
+        uint32_t _ALIGN(64) endiandata[20];
+        uint32_t _ALIGN(64) hash[8];
+        uint32_t *pdata = work->data;
+        uint32_t *ptarget = work->target;
+
+        const uint32_t first_nonce = pdata[19];
+        const uint32_t Htarg = ptarget[7];
+
+        uint32_t nonce = first_nonce;
+
+        swab32_array( endiandata, pdata, 20 );
+
+        do {
+                be32enc(&endiandata[19], nonce);
+                argon2d_crds_hash( hash, endiandata );
+                if ( hash[7] <= Htarg && fulltest( hash, ptarget ) )
+                {
+                        pdata[19] = nonce;
+                        *hashes_done = pdata[19] - first_nonce;
+                        work_set_target_ratio(work, hash);
+                        return 1;
+                }
+                nonce++;
+        } while (nonce < max_nonce && !work_restart[thr_id].restart);
+
+        pdata[19] = nonce;
+        *hashes_done = pdata[19] - first_nonce + 1;
+        return 0;
+}
+
+bool register_argon2d_crds_algo( algo_gate_t* gate )
+{
+        gate->scanhash = (void*)&scanhash_argon2d_crds;
+        gate->hash = (void*)&argon2d_crds_hash;
+        gate->set_target = (void*)&scrypt_set_target;
+        gate->optimizations = SSE2_OPT | AVX2_OPT | AVX512_OPT;
+        return true;
+}
+
+// Dynamic
+
+void argon2d_dyn_hash( void *output, const void *input )
+{
+    argon2_context context;
+    context.out = (uint8_t *)output;
+    context.outlen = (uint32_t)OUTPUT_BYTES;
+    context.pwd = (uint8_t *)input;
+    context.pwdlen = (uint32_t)INPUT_BYTES;
+    context.salt = (uint8_t *)input; //salt = input
+    context.saltlen = (uint32_t)INPUT_BYTES;
+    context.secret = NULL;
+    context.secretlen = 0;
+    context.ad = NULL;
+    context.adlen = 0;
+    context.allocate_cbk = NULL;
+    context.free_cbk = NULL;
+    context.flags = DEFAULT_ARGON2_FLAG; // = ARGON2_DEFAULT_FLAGS
+    // main configurable Argon2 hash parameters
+    context.m_cost = 500;  // Memory in KiB (512KB)
+    context.lanes = 8;     // Degree of Parallelism
+    context.threads = 1;   // Threads
+    context.t_cost = 2;    // Iterations
+    context.version = ARGON2_VERSION_10;
+
+    argon2_ctx( &context, Argon2_d );
+}
+
+int scanhash_argon2d_dyn( int thr_id, struct work *work, uint32_t max_nonce,
+                      uint64_t *hashes_done )
+{
+        uint32_t _ALIGN(64) endiandata[20];
+        uint32_t _ALIGN(64) hash[8];
+        uint32_t *pdata = work->data;
+        uint32_t *ptarget = work->target;
+
+        const uint32_t first_nonce = pdata[19];
+        const uint32_t Htarg = ptarget[7];
+
+        uint32_t nonce = first_nonce;
+
+        swab32_array( endiandata, pdata, 20 );
+
+        do {
+                be32enc(&endiandata[19], nonce);
+                argon2d_dyn_hash( hash, endiandata );
+                if ( hash[7] <= Htarg && fulltest( hash, ptarget ) )
+                {
+                        pdata[19] = nonce;
+                        *hashes_done = pdata[19] - first_nonce;
+                        work_set_target_ratio(work, hash);
+                        return 1;
+                }
+                nonce++;
+        } while (nonce < max_nonce && !work_restart[thr_id].restart);
+
+        pdata[19] = nonce;
+        *hashes_done = pdata[19] - first_nonce + 1;
+        return 0;
+}
+
+bool register_argon2d_dyn_algo( algo_gate_t* gate )
+{
+        gate->scanhash = (void*)&scanhash_argon2d_dyn;
+        gate->hash = (void*)&argon2d_dyn_hash;
+        gate->set_target = (void*)&scrypt_set_target;
+        gate->optimizations = SSE2_OPT | AVX2_OPT | AVX512_OPT;
+        return true;
+}
+
+// Unitus
+
+int scanhash_argon2d4096( int thr_id, struct work *work, uint32_t max_nonce,
+                           uint64_t *hashes_done)
+{
+   uint32_t _ALIGN(64) vhash[8];
+   uint32_t _ALIGN(64) endiandata[20];
+   uint32_t *pdata = work->data;
+   uint32_t *ptarget = work->target;
+   const uint32_t Htarg = ptarget[7];
+   const uint32_t first_nonce = pdata[19];
+   uint32_t n = first_nonce;
+    
+   uint32_t t_cost = 1; // 1 iteration
+   uint32_t m_cost = 4096; // use 4MB
+   uint32_t parallelism = 1; // 1 thread, 2 lanes
+
+   for ( int i = 0; i < 19; i++ )
+      be32enc( &endiandata[i], pdata[i] );
+
+   do {
+      be32enc( &endiandata[19], n );
+      argon2d_hash_raw( t_cost, m_cost, parallelism, (char*) endiandata, 80,
+                 (char*) endiandata, 80, (char*) vhash, 32, ARGON2_VERSION_13 );
+      if ( vhash[7] < Htarg && fulltest( vhash, ptarget ) )
+      {
+         *hashes_done = n - first_nonce + 1;
+         pdata[19] = n;
+         return true;
+      }
+      n++;
+
+   } while (n < max_nonce && !work_restart[thr_id].restart);
+
+   *hashes_done = n - first_nonce + 1;
+   pdata[19] = n;
+
+   return 0;
+}
+
+int64_t get_max64_0x1ff() { return 0x1ff; }
+
+bool register_argon2d4096_algo( algo_gate_t* gate )
+{
+        gate->scanhash = (void*)&scanhash_argon2d4096;
+        gate->set_target = (void*)&scrypt_set_target;
+        gate->get_max64  = (void*)&get_max64_0x1ff;
+        gate->optimizations = SSE2_OPT | AVX2_OPT | AVX512_OPT;
+        return true;
+}
+
--- a/algo/argon2/argon2d/argon2d-gate.h
+++ b/algo/argon2/argon2d/argon2d-gate.h
@@ -0,0 +1,31 @@
+#ifndef ARGON2D_GATE_H__
+#define ARGON2D_GATE_H__
+
+#include "algo-gate-api.h"
+#include <stdint.h>
+
+// Credits: version = 0x10, m_cost = 250.
+bool register_argon2d_crds_algo( algo_gate_t* gate );
+
+void argon2d_crds_hash( void *state, const void *input );
+
+int scanhash_argon2d_crds( int thr_id, struct work *work, uint32_t max_nonce,
+                    uint64_t *hashes_done );
+
+// Dynamic: version = 0x10, m_cost = 500.
+bool register_argon2d_dyn_algo( algo_gate_t* gate );
+
+void argon2d_dyn_hash( void *state, const void *input );
+
+int scanhash_argon2d_dyn( int thr_id, struct work *work, uint32_t max_nonce,
+                    uint64_t *hashes_done );
+
+
+// Unitus: version = 0x13, m_cost = 4096.
+bool register_argon2d4096_algo( algo_gate_t* gate );
+
+int scanhash_argon2d4096( int thr_id, struct work *work, uint32_t max_nonce,
+                    uint64_t *hashes_done );
+
+#endif
+
--- a/algo/argon2/argon2d/argon2d/argon2.c
+++ b/algo/argon2/argon2d/argon2d/argon2.c
@@ -0,0 +1,458 @@
+/*
+ * Argon2 reference source code package - reference C implementations
+ *
+ * Copyright 2015
+ * Daniel Dinu, Dmitry Khovratovich, Jean-Philippe Aumasson, and Samuel Neves
+ *
+ * You may use this work under the terms of a Creative Commons CC0 1.0
+ * License/Waiver or the Apache Public License 2.0, at your option. The terms of
+ * these licenses can be found at:
+ *
+ * - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
+ * - Apache 2.0        : http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * You should have received a copy of both of these licenses along with this
+ * software. If not, they may be obtained at the above URLs.
+ */
+
+#include <string.h>
+#include <stdlib.h>
+#include <stdio.h>
+
+#include "argon2.h"
+#include "encoding.h"
+#include "core.h"
+
+const char *argon2_type2string(argon2_type type, int uppercase) {
+    switch (type) {
+        case Argon2_d:
+            return uppercase ? "Argon2d" : "argon2d";
+        case Argon2_i:
+            return uppercase ? "Argon2i" : "argon2i";
+        case Argon2_id:
+            return uppercase ? "Argon2id" : "argon2id";
+    }
+
+    return NULL;
+}
+
+int argon2_ctx(argon2_context *context, argon2_type type) {
+    /* 1. Validate all inputs */
+    int result = validate_inputs(context);
+    uint32_t memory_blocks, segment_length;
+    argon2_instance_t instance;
+
+    if (ARGON2_OK != result) {
+        return result;
+    }
+
+    if (Argon2_d != type && Argon2_i != type && Argon2_id != type) {
+        return ARGON2_INCORRECT_TYPE;
+    }
+
+    /* 2. Align memory size */
+    /* Minimum memory_blocks = 8L blocks, where L is the number of lanes */
+    memory_blocks = context->m_cost;
+
+    if (memory_blocks < 2 * ARGON2_SYNC_POINTS * context->lanes) {
+        memory_blocks = 2 * ARGON2_SYNC_POINTS * context->lanes;
+    }
+
+    segment_length = memory_blocks / (context->lanes * ARGON2_SYNC_POINTS);
+    /* Ensure that all segments have equal length */
+    memory_blocks = segment_length * (context->lanes * ARGON2_SYNC_POINTS);
+
+    instance.version = context->version;
+    instance.memory = NULL;
+    instance.passes = context->t_cost;
+    instance.memory_blocks = memory_blocks;
+    instance.segment_length = segment_length;
+    instance.lane_length = segment_length * ARGON2_SYNC_POINTS;
+    instance.lanes = context->lanes;
+    instance.threads = context->threads;
+    instance.type = type;
+
+    if (instance.threads > instance.lanes) {
+        instance.threads = instance.lanes;
+    }
+
+    /* 3. Initialization: Hashing inputs, allocating memory, filling first
+     * blocks
+     */
+    result = initialize(&instance, context);
+
+    if (ARGON2_OK != result) {
+        return result;
+    }
+
+    /* 4. Filling memory */
+    result = fill_memory_blocks(&instance);
+
+    if (ARGON2_OK != result) {
+        return result;
+    }
+    /* 5. Finalization */
+    finalize(context, &instance);
+
+    return ARGON2_OK;
+}
+
+int argon2_hash(const uint32_t t_cost, const uint32_t m_cost,
+                const uint32_t parallelism, const void *pwd,
+                const size_t pwdlen, const void *salt, const size_t saltlen,
+                void *hash, const size_t hashlen, char *encoded,
+                const size_t encodedlen, argon2_type type,
+                const uint32_t version){
+
+    argon2_context context;
+    int result;
+    uint8_t *out;
+
+    if (pwdlen > ARGON2_MAX_PWD_LENGTH) {
+        return ARGON2_PWD_TOO_LONG;
+    }
+
+    if (saltlen > ARGON2_MAX_SALT_LENGTH) {
+        return ARGON2_SALT_TOO_LONG;
+    }
+
+    if (hashlen > ARGON2_MAX_OUTLEN) {
+        return ARGON2_OUTPUT_TOO_LONG;
+    }
+
+    if (hashlen < ARGON2_MIN_OUTLEN) {
+        return ARGON2_OUTPUT_TOO_SHORT;
+    }
+
+    out = malloc(hashlen);
+    if (!out) {
+        return ARGON2_MEMORY_ALLOCATION_ERROR;
+    }
+
+    context.out = (uint8_t *)out;
+    context.outlen = (uint32_t)hashlen;
+    context.pwd = CONST_CAST(uint8_t *)pwd;
+    context.pwdlen = (uint32_t)pwdlen;
+    context.salt = CONST_CAST(uint8_t *)salt;
+    context.saltlen = (uint32_t)saltlen;
+    context.secret = NULL;
+    context.secretlen = 0;
+    context.ad = NULL;
+    context.adlen = 0;
+    context.t_cost = t_cost;
+    context.m_cost = m_cost;
+    context.lanes = parallelism;
+    context.threads = parallelism;
+    context.allocate_cbk = NULL;
+    context.free_cbk = NULL;
+    context.flags = ARGON2_DEFAULT_FLAGS;
+    context.version = version;
+
+    result = argon2_ctx(&context, type);
+
+    if (result != ARGON2_OK) {
+        clear_internal_memory(out, hashlen);
+        free(out);
+        return result;
+    }
+
+    /* if raw hash requested, write it */
+    if (hash) {
+        memcpy(hash, out, hashlen);
+    }
+
+    /* if encoding requested, write it */
+    if (encoded && encodedlen) {
+        if (encode_string(encoded, encodedlen, &context, type) != ARGON2_OK) {
+            clear_internal_memory(out, hashlen); /* wipe buffers if error */
+            clear_internal_memory(encoded, encodedlen);
+            free(out);
+            return ARGON2_ENCODING_FAIL;
+        }
+    }
+    clear_internal_memory(out, hashlen);
+    free(out);
+
+    return ARGON2_OK;
+}
+
+int argon2i_hash_encoded(const uint32_t t_cost, const uint32_t m_cost,
+                         const uint32_t parallelism, const void *pwd,
+                         const size_t pwdlen, const void *salt,
+                         const size_t saltlen, const size_t hashlen,
+                         char *encoded, const size_t encodedlen,
+                         const uint32_t version) {
+
+    return argon2_hash(t_cost, m_cost, parallelism, pwd, pwdlen, salt, saltlen,
+                       NULL, hashlen, encoded, encodedlen, Argon2_i,
+                       version );
+}
+
+int argon2i_hash_raw(const uint32_t t_cost, const uint32_t m_cost,
+                     const uint32_t parallelism, const void *pwd,
+                     const size_t pwdlen, const void *salt,
+                     const size_t saltlen, void *hash, const size_t hashlen,
+                     const uint32_t version ) {
+
+    return argon2_hash(t_cost, m_cost, parallelism, pwd, pwdlen, salt, saltlen,
+                       hash, hashlen, NULL, 0, Argon2_i, version );
+}
+
+int argon2d_hash_encoded(const uint32_t t_cost, const uint32_t m_cost,
+                         const uint32_t parallelism, const void *pwd,
+                         const size_t pwdlen, const void *salt,
+                         const size_t saltlen, const size_t hashlen,
+                         char *encoded, const size_t encodedlen,
+                         const uint32_t version ) {
+
+    return argon2_hash(t_cost, m_cost, parallelism, pwd, pwdlen, salt, saltlen,
+                       NULL, hashlen, encoded, encodedlen, Argon2_d,
+                       version );
+}
+
+int argon2d_hash_raw(const uint32_t t_cost, const uint32_t m_cost,
+                     const uint32_t parallelism, const void *pwd,
+                     const size_t pwdlen, const void *salt,
+                     const size_t saltlen, void *hash, const size_t hashlen,
+                     const uint32_t version ) {
+
+    return argon2_hash(t_cost, m_cost, parallelism, pwd, pwdlen, salt, saltlen,
+                       hash, hashlen, NULL, 0, Argon2_d, version );
+}
+
+int argon2id_hash_encoded(const uint32_t t_cost, const uint32_t m_cost,
+                          const uint32_t parallelism, const void *pwd,
+                          const size_t pwdlen, const void *salt,
+                          const size_t saltlen, const size_t hashlen,
+                          char *encoded, const size_t encodedlen,
+                          const uint32_t version ) {
+
+    return argon2_hash(t_cost, m_cost, parallelism, pwd, pwdlen, salt, saltlen,
+                       NULL, hashlen, encoded, encodedlen, Argon2_id,
+                       version);
+}
+
+int argon2id_hash_raw(const uint32_t t_cost, const uint32_t m_cost,
+                      const uint32_t parallelism, const void *pwd,
+                      const size_t pwdlen, const void *salt,
+                      const size_t saltlen, void *hash, const size_t hashlen,
+                      const uint32_t version ) {
+    return argon2_hash(t_cost, m_cost, parallelism, pwd, pwdlen, salt, saltlen,
+                       hash, hashlen, NULL, 0, Argon2_id, version );
+}
+
+static int argon2_compare(const uint8_t *b1, const uint8_t *b2, size_t len) {
+    size_t i;
+    uint8_t d = 0U;
+
+    for (i = 0U; i < len; i++) {
+        d |= b1[i] ^ b2[i];
+    }
+    return (int)((1 & ((d - 1) >> 8)) - 1);
+}
+
+int argon2_verify(const char *encoded, const void *pwd, const size_t pwdlen,
+                  argon2_type type) {
+
+    argon2_context ctx;
+    uint8_t *desired_result = NULL;
+
+    int ret = ARGON2_OK;
+
+    size_t encoded_len;
+    uint32_t max_field_len;
+
+    if (pwdlen > ARGON2_MAX_PWD_LENGTH) {
+        return ARGON2_PWD_TOO_LONG;
+    }
+
+    if (encoded == NULL) {
+        return ARGON2_DECODING_FAIL;
+    }
+
+    encoded_len = strlen(encoded);
+    if (encoded_len > UINT32_MAX) {
+        return ARGON2_DECODING_FAIL;
+    }
+
+    /* No field can be longer than the encoded length */
+    max_field_len = (uint32_t)encoded_len;
+
+    ctx.saltlen = max_field_len;
+    ctx.outlen = max_field_len;
+
+    ctx.salt = malloc(ctx.saltlen);
+    ctx.out = malloc(ctx.outlen);
+    if (!ctx.salt || !ctx.out) {
+        ret = ARGON2_MEMORY_ALLOCATION_ERROR;
+        goto fail;
+    }
+
+    ctx.pwd = (uint8_t *)pwd;
+    ctx.pwdlen = (uint32_t)pwdlen;
+
+    ret = decode_string(&ctx, encoded, type);
+    if (ret != ARGON2_OK) {
+        goto fail;
+    }
+
+    /* Set aside the desired result, and get a new buffer. */
+    desired_result = ctx.out;
+    ctx.out = malloc(ctx.outlen);
+    if (!ctx.out) {
+        ret = ARGON2_MEMORY_ALLOCATION_ERROR;
+        goto fail;
+    }
+
+    ret = argon2_verify_ctx(&ctx, (char *)desired_result, type);
+    if (ret != ARGON2_OK) {
+        goto fail;
+    }
+
+fail:
+    free(ctx.salt);
+    free(ctx.out);
+    free(desired_result);
+
+    return ret;
+}
+
+int argon2i_verify(const char *encoded, const void *pwd, const size_t pwdlen) {
+
+    return argon2_verify(encoded, pwd, pwdlen, Argon2_i);
+}
+
+int argon2d_verify(const char *encoded, const void *pwd, const size_t pwdlen) {
+
+    return argon2_verify(encoded, pwd, pwdlen, Argon2_d);
+}
+
+int argon2id_verify(const char *encoded, const void *pwd, const size_t pwdlen) {
+
+    return argon2_verify(encoded, pwd, pwdlen, Argon2_id);
+}
+
+int argon2d_ctx(argon2_context *context) {
+    return argon2_ctx(context, Argon2_d);
+}
+
+int argon2i_ctx(argon2_context *context) {
+    return argon2_ctx(context, Argon2_i);
+}
+
+int argon2id_ctx(argon2_context *context) {
+    return argon2_ctx(context, Argon2_id);
+}
+
+int argon2_verify_ctx(argon2_context *context, const char *hash,
+                      argon2_type type) {
+    int ret = argon2_ctx(context, type);
+    if (ret != ARGON2_OK) {
+        return ret;
+    }
+
+    if (argon2_compare((uint8_t *)hash, context->out, context->outlen)) {
+        return ARGON2_VERIFY_MISMATCH;
+    }
+
+    return ARGON2_OK;
+}
+
+int argon2d_verify_ctx(argon2_context *context, const char *hash) {
+    return argon2_verify_ctx(context, hash, Argon2_d);
+}
+
+int argon2i_verify_ctx(argon2_context *context, const char *hash) {
+    return argon2_verify_ctx(context, hash, Argon2_i);
+}
+
+int argon2id_verify_ctx(argon2_context *context, const char *hash) {
+    return argon2_verify_ctx(context, hash, Argon2_id);
+}
+
+const char *argon2_error_message(int error_code) {
+    switch (error_code) {
+    case ARGON2_OK:
+        return "OK";
+    case ARGON2_OUTPUT_PTR_NULL:
+        return "Output pointer is NULL";
+    case ARGON2_OUTPUT_TOO_SHORT:
+        return "Output is too short";
+    case ARGON2_OUTPUT_TOO_LONG:
+        return "Output is too long";
+    case ARGON2_PWD_TOO_SHORT:
+        return "Password is too short";
+    case ARGON2_PWD_TOO_LONG:
+        return "Password is too long";
+    case ARGON2_SALT_TOO_SHORT:
+        return "Salt is too short";
+    case ARGON2_SALT_TOO_LONG:
+        return "Salt is too long";
+    case ARGON2_AD_TOO_SHORT:
+        return "Associated data is too short";
+    case ARGON2_AD_TOO_LONG:
+        return "Associated data is too long";
+    case ARGON2_SECRET_TOO_SHORT:
+        return "Secret is too short";
+    case ARGON2_SECRET_TOO_LONG:
+        return "Secret is too long";
+    case ARGON2_TIME_TOO_SMALL:
+        return "Time cost is too small";
+    case ARGON2_TIME_TOO_LARGE:
+        return "Time cost is too large";
+    case ARGON2_MEMORY_TOO_LITTLE:
+        return "Memory cost is too small";
+    case ARGON2_MEMORY_TOO_MUCH:
+        return "Memory cost is too large";
+    case ARGON2_LANES_TOO_FEW:
+        return "Too few lanes";
+    case ARGON2_LANES_TOO_MANY:
+        return "Too many lanes";
+    case ARGON2_PWD_PTR_MISMATCH:
+        return "Password pointer is NULL, but password length is not 0";
+    case ARGON2_SALT_PTR_MISMATCH:
+        return "Salt pointer is NULL, but salt length is not 0";
+    case ARGON2_SECRET_PTR_MISMATCH:
+        return "Secret pointer is NULL, but secret length is not 0";
+    case ARGON2_AD_PTR_MISMATCH:
+        return "Associated data pointer is NULL, but ad length is not 0";
+    case ARGON2_MEMORY_ALLOCATION_ERROR:
+        return "Memory allocation error";
+    case ARGON2_FREE_MEMORY_CBK_NULL:
+        return "The free memory callback is NULL";
+    case ARGON2_ALLOCATE_MEMORY_CBK_NULL:
+        return "The allocate memory callback is NULL";
+    case ARGON2_INCORRECT_PARAMETER:
+        return "Argon2_Context context is NULL";
+    case ARGON2_INCORRECT_TYPE:
+        return "There is no such version of Argon2";
+    case ARGON2_OUT_PTR_MISMATCH:
+        return "Output pointer mismatch";
+    case ARGON2_THREADS_TOO_FEW:
+        return "Not enough threads";
+    case ARGON2_THREADS_TOO_MANY:
+        return "Too many threads";
+    case ARGON2_MISSING_ARGS:
+        return "Missing arguments";
+    case ARGON2_ENCODING_FAIL:
+        return "Encoding failed";
+    case ARGON2_DECODING_FAIL:
+        return "Decoding failed";
+    case ARGON2_THREAD_FAIL:
+        return "Threading failure";
+    case ARGON2_DECODING_LENGTH_FAIL:
+        return "Some of encoded parameters are too long or too short";
+    case ARGON2_VERIFY_MISMATCH:
+        return "The password does not match the supplied hash";
+    default:
+        return "Unknown error code";
+    }
+}
+/*
+size_t argon2_encodedlen(uint32_t t_cost, uint32_t m_cost, uint32_t parallelism,
+                         uint32_t saltlen, uint32_t hashlen, argon2_type type) {
+  return strlen("$$v=$m=,t=,p=$$") + strlen(argon2_type2string(type, 0)) +
+         numlen(t_cost) + numlen(m_cost) + numlen(parallelism) +
+         b64len(saltlen) + b64len(hashlen) + numlen(ARGON2_VERSION_NUMBER) + 1;
+}
+*/
--- a/algo/argon2/argon2d/argon2d/argon2.h
+++ b/algo/argon2/argon2d/argon2d/argon2.h
@@ -0,0 +1,440 @@
+/*
+ * Argon2 reference source code package - reference C implementations
+ *
+ * Copyright 2015
+ * Daniel Dinu, Dmitry Khovratovich, Jean-Philippe Aumasson, and Samuel Neves
+ *
+ * You may use this work under the terms of a Creative Commons CC0 1.0
+ * License/Waiver or the Apache Public License 2.0, at your option. The terms of
+ * these licenses can be found at:
+ *
+ * - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
+ * - Apache 2.0        : http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * You should have received a copy of both of these licenses along with this
+ * software. If not, they may be obtained at the above URLs.
+ */
+
+#ifndef ARGON2_H
+#define ARGON2_H
+
+#include <stdint.h>
+#include <stddef.h>
+#include <limits.h>
+
+#if defined(__cplusplus)
+extern "C" {
+#endif
+
+/* Symbols visibility control */
+#ifdef A2_VISCTL
+#define ARGON2_PUBLIC __attribute__((visibility("default")))
+#define ARGON2_LOCAL __attribute__ ((visibility ("hidden")))
+#elif _MSC_VER
+#define ARGON2_PUBLIC __declspec(dllexport)
+#define ARGON2_LOCAL
+#else
+#define ARGON2_PUBLIC
+#define ARGON2_LOCAL
+#endif
+
+/*
+ * Argon2 input parameter restrictions
+ */
+
+/* Minimum and maximum number of lanes (degree of parallelism) */
+#define ARGON2_MIN_LANES UINT32_C(1)
+#define ARGON2_MAX_LANES UINT32_C(0xFFFFFF)
+
+/* Minimum and maximum number of threads */
+#define ARGON2_MIN_THREADS UINT32_C(1)
+#define ARGON2_MAX_THREADS UINT32_C(0xFFFFFF)
+
+/* Number of synchronization points between lanes per pass */
+#define ARGON2_SYNC_POINTS UINT32_C(4)
+
+/* Minimum and maximum digest size in bytes */
+#define ARGON2_MIN_OUTLEN UINT32_C(4)
+#define ARGON2_MAX_OUTLEN UINT32_C(0xFFFFFFFF)
+
+/* Minimum and maximum number of memory blocks (each of BLOCK_SIZE bytes) */
+#define ARGON2_MIN_MEMORY (2 * ARGON2_SYNC_POINTS) /* 2 blocks per slice */
+
+#define ARGON2_MIN(a, b) ((a) < (b) ? (a) : (b))
+/* Max memory size is addressing-space/2, topping at 2^32 blocks (4 TB) */
+#define ARGON2_MAX_MEMORY_BITS                                                 \
+    ARGON2_MIN(UINT32_C(32), (sizeof(void *) * CHAR_BIT - 10 - 1))
+#define ARGON2_MAX_MEMORY                                                      \
+    ARGON2_MIN(UINT32_C(0xFFFFFFFF), UINT64_C(1) << ARGON2_MAX_MEMORY_BITS)
+
+/* Minimum and maximum number of passes */
+#define ARGON2_MIN_TIME UINT32_C(1)
+#define ARGON2_MAX_TIME UINT32_C(0xFFFFFFFF)
+
+/* Minimum and maximum password length in bytes */
+#define ARGON2_MIN_PWD_LENGTH UINT32_C(0)
+#define ARGON2_MAX_PWD_LENGTH UINT32_C(0xFFFFFFFF)
+
+/* Minimum and maximum associated data length in bytes */
+#define ARGON2_MIN_AD_LENGTH UINT32_C(0)
+#define ARGON2_MAX_AD_LENGTH UINT32_C(0xFFFFFFFF)
+
+/* Minimum and maximum salt length in bytes */
+#define ARGON2_MIN_SALT_LENGTH UINT32_C(8)
+#define ARGON2_MAX_SALT_LENGTH UINT32_C(0xFFFFFFFF)
+
+/* Minimum and maximum key length in bytes */
+#define ARGON2_MIN_SECRET UINT32_C(0)
+#define ARGON2_MAX_SECRET UINT32_C(0xFFFFFFFF)
+
+/* Flags to determine which fields are securely wiped (default = no wipe). */
+#define ARGON2_DEFAULT_FLAGS UINT32_C(0)
+#define ARGON2_FLAG_CLEAR_PASSWORD (UINT32_C(1) << 0)
+#define ARGON2_FLAG_CLEAR_SECRET (UINT32_C(1) << 1)
+
+/* Global flag to determine if we are wiping internal memory buffers. This flag
+ * is defined in core.c and deafults to 1 (wipe internal memory). */
+extern int FLAG_clear_internal_memory;
+
+/* Error codes */
+typedef enum Argon2_ErrorCodes {
+    ARGON2_OK = 0,
+
+    ARGON2_OUTPUT_PTR_NULL = -1,
+
+    ARGON2_OUTPUT_TOO_SHORT = -2,
+    ARGON2_OUTPUT_TOO_LONG = -3,
+
+    ARGON2_PWD_TOO_SHORT = -4,
+    ARGON2_PWD_TOO_LONG = -5,
+
+    ARGON2_SALT_TOO_SHORT = -6,
+    ARGON2_SALT_TOO_LONG = -7,
+
+    ARGON2_AD_TOO_SHORT = -8,
+    ARGON2_AD_TOO_LONG = -9,
+
+    ARGON2_SECRET_TOO_SHORT = -10,
+    ARGON2_SECRET_TOO_LONG = -11,
+
+    ARGON2_TIME_TOO_SMALL = -12,
+    ARGON2_TIME_TOO_LARGE = -13,
+
+    ARGON2_MEMORY_TOO_LITTLE = -14,
+    ARGON2_MEMORY_TOO_MUCH = -15,
+
+    ARGON2_LANES_TOO_FEW = -16,
+    ARGON2_LANES_TOO_MANY = -17,
+
+    ARGON2_PWD_PTR_MISMATCH = -18,    /* NULL ptr with non-zero length */
+    ARGON2_SALT_PTR_MISMATCH = -19,   /* NULL ptr with non-zero length */
+    ARGON2_SECRET_PTR_MISMATCH = -20, /* NULL ptr with non-zero length */
+    ARGON2_AD_PTR_MISMATCH = -21,     /* NULL ptr with non-zero length */
+
+    ARGON2_MEMORY_ALLOCATION_ERROR = -22,
+
+    ARGON2_FREE_MEMORY_CBK_NULL = -23,
+    ARGON2_ALLOCATE_MEMORY_CBK_NULL = -24,
+
+    ARGON2_INCORRECT_PARAMETER = -25,
+    ARGON2_INCORRECT_TYPE = -26,
+
+    ARGON2_OUT_PTR_MISMATCH = -27,
+
+    ARGON2_THREADS_TOO_FEW = -28,
+    ARGON2_THREADS_TOO_MANY = -29,
+
+    ARGON2_MISSING_ARGS = -30,
+
+    ARGON2_ENCODING_FAIL = -31,
+
+    ARGON2_DECODING_FAIL = -32,
+
+    ARGON2_THREAD_FAIL = -33,
+
+    ARGON2_DECODING_LENGTH_FAIL = -34,
+
+    ARGON2_VERIFY_MISMATCH = -35
+} argon2_error_codes;
+
+/* Memory allocator types --- for external allocation */
+typedef int (*allocate_fptr)(uint8_t **memory, size_t bytes_to_allocate);
+typedef void (*deallocate_fptr)(uint8_t *memory, size_t bytes_to_allocate);
+
+/* Argon2 external data structures */
+
+/*
+ *****
+ * Context: structure to hold Argon2 inputs:
+ *  output array and its length,
+ *  password and its length,
+ *  salt and its length,
+ *  secret and its length,
+ *  associated data and its length,
+ *  number of passes, amount of used memory (in KBytes, can be rounded up a bit)
+ *  number of parallel threads that will be run.
+ * All the parameters above affect the output hash value.
+ * Additionally, two function pointers can be provided to allocate and
+ * deallocate the memory (if NULL, memory will be allocated internally).
+ * Also, three flags indicate whether to erase password, secret as soon as they
+ * are pre-hashed (and thus not needed anymore), and the entire memory
+ *****
+ * Simplest situation: you have output array out[8], password is stored in
+ * pwd[32], salt is stored in salt[16], you do not have keys nor associated
+ * data. You need to spend 1 GB of RAM and you run 5 passes of Argon2d with
+ * 4 parallel lanes.
+ * You want to erase the password, but you're OK with last pass not being
+ * erased. You want to use the default memory allocator.
+ * Then you initialize:
+ Argon2_Context(out,8,pwd,32,salt,16,NULL,0,NULL,0,5,1<<20,4,4,NULL,NULL,true,false,false,false)
+ */
+typedef struct Argon2_Context {
+    uint8_t *out;    /* output array */
+    uint32_t outlen; /* digest length */
+
+    uint8_t *pwd;    /* password array */
+    uint32_t pwdlen; /* password length */
+
+    uint8_t *salt;    /* salt array */
+    uint32_t saltlen; /* salt length */
+
+    uint8_t *secret;    /* key array */
+    uint32_t secretlen; /* key length */
+
+    uint8_t *ad;    /* associated data array */
+    uint32_t adlen; /* associated data length */
+
+    uint32_t t_cost;  /* number of passes */
+    uint32_t m_cost;  /* amount of memory requested (KB) */
+    uint32_t lanes;   /* number of lanes */
+    uint32_t threads; /* maximum number of threads */
+
+    uint32_t version; /* version number */
+
+    allocate_fptr allocate_cbk; /* pointer to memory allocator */
+    deallocate_fptr free_cbk;   /* pointer to memory deallocator */
+
+    uint32_t flags; /* array of bool options */
+} argon2_context;
+
+/* Argon2 primitive type */
+typedef enum Argon2_type {
+  Argon2_d = 0,
+  Argon2_i = 1,
+  Argon2_id = 2
+} argon2_type;
+
+/* Version of the algorithm */
+#define ARGON2_VERSION_10 0x10
+#define ARGON2_VERSION_13 0x13
+
+/*
+ * Function that gives the string representation of an argon2_type.
+ * @param type The argon2_type that we want the string for
+ * @param uppercase Whether the string should have the first letter uppercase
+ * @return NULL if invalid type, otherwise the string representation.
+ */
+ARGON2_PUBLIC const char *argon2_type2string(argon2_type type, int uppercase);
+
+/*
+ * Function that performs memory-hard hashing with certain degree of parallelism
+ * @param  context  Pointer to the Argon2 internal structure
+ * @return Error code if smth is wrong, ARGON2_OK otherwise
+ */
+ARGON2_PUBLIC int argon2_ctx(argon2_context *context, argon2_type type);
+
+/**
+ * Hashes a password with Argon2i, producing an encoded hash
+ * @param t_cost Number of iterations
+ * @param m_cost Sets memory usage to m_cost kibibytes
+ * @param parallelism Number of threads and compute lanes
+ * @param pwd Pointer to password
+ * @param pwdlen Password size in bytes
+ * @param salt Pointer to salt
+ * @param saltlen Salt size in bytes
+ * @param hashlen Desired length of the hash in bytes
+ * @param encoded Buffer where to write the encoded hash
+ * @param encodedlen Size of the buffer (thus max size of the encoded hash)
+ * @pre   Different parallelism levels will give different results
+ * @pre   Returns ARGON2_OK if successful
+ */
+ARGON2_PUBLIC int argon2i_hash_encoded(const uint32_t t_cost,
+                                       const uint32_t m_cost,
+                                       const uint32_t parallelism,
+                                       const void *pwd, const size_t pwdlen,
+                                       const void *salt, const size_t saltlen,
+                                       const size_t hashlen, char *encoded,
+                                       const size_t encodedlen,
+                                       const uint32_t version );
+
+/**
+ * Hashes a password with Argon2i, producing a raw hash at @hash
+ * @param t_cost Number of iterations
+ * @param m_cost Sets memory usage to m_cost kibibytes
+ * @param parallelism Number of threads and compute lanes
+ * @param pwd Pointer to password
+ * @param pwdlen Password size in bytes
+ * @param salt Pointer to salt
+ * @param saltlen Salt size in bytes
+ * @param hash Buffer where to write the raw hash - updated by the function
+ * @param hashlen Desired length of the hash in bytes
+ * @pre   Different parallelism levels will give different results
+ * @pre   Returns ARGON2_OK if successful
+ */
+ARGON2_PUBLIC int argon2i_hash_raw(const uint32_t t_cost, const uint32_t m_cost,
+                                   const uint32_t parallelism, const void *pwd,
+                                   const size_t pwdlen, const void *salt,
+                                   const size_t saltlen, void *hash,
+                                   const size_t hashlen,
+                                   const uint32_t version );
+
+ARGON2_PUBLIC int argon2d_hash_encoded(const uint32_t t_cost,
+                                       const uint32_t m_cost,
+                                       const uint32_t parallelism,
+                                       const void *pwd, const size_t pwdlen,
+                                       const void *salt, const size_t saltlen,
+                                       const size_t hashlen, char *encoded,
+                                       const size_t encodedlen,
+                                       const uint32_t version );
+
+ARGON2_PUBLIC int argon2d_hash_raw(const uint32_t t_cost, const uint32_t m_cost,
+                                   const uint32_t parallelism, const void *pwd,
+                                   const size_t pwdlen, const void *salt,
+                                   const size_t saltlen, void *hash,
+                                   const size_t hashlen,
+                                   const uint32_t version );
+
+ARGON2_PUBLIC int argon2id_hash_encoded(const uint32_t t_cost,
+                                        const uint32_t m_cost,
+                                        const uint32_t parallelism,
+                                        const void *pwd, const size_t pwdlen,
+                                        const void *salt, const size_t saltlen,
+                                        const size_t hashlen, char *encoded,
+                                        const size_t encodedlen,
+                                        const uint32_t version );
+
+ARGON2_PUBLIC int argon2id_hash_raw(const uint32_t t_cost,
+                                    const uint32_t m_cost,
+                                    const uint32_t parallelism, const void *pwd,
+                                    const size_t pwdlen, const void *salt,
+                                    const size_t saltlen, void *hash,
+                                    const size_t hashlen,
+                                    const uint32_t version );
+
+/* generic function underlying the above ones */
+ARGON2_PUBLIC int argon2_hash(const uint32_t t_cost, const uint32_t m_cost,
+                              const uint32_t parallelism, const void *pwd,
+                              const size_t pwdlen, const void *salt,
+                              const size_t saltlen, void *hash,
+                              const size_t hashlen, char *encoded,
+                              const size_t encodedlen, argon2_type type,
+                              const uint32_t version );
+
+/**
+ * Verifies a password against an encoded string
+ * Encoded string is restricted as in validate_inputs()
+ * @param encoded String encoding parameters, salt, hash
+ * @param pwd Pointer to password
+ * @pre   Returns ARGON2_OK if successful
+ */
+ARGON2_PUBLIC int argon2i_verify(const char *encoded, const void *pwd,
+                                 const size_t pwdlen);
+
+ARGON2_PUBLIC int argon2d_verify(const char *encoded, const void *pwd,
+                                 const size_t pwdlen);
+
+ARGON2_PUBLIC int argon2id_verify(const char *encoded, const void *pwd,
+                                  const size_t pwdlen);
+
+/* generic function underlying the above ones */
+ARGON2_PUBLIC int argon2_verify(const char *encoded, const void *pwd,
+                                const size_t pwdlen, argon2_type type);
+
+/**
+ * Argon2d: Version of Argon2 that picks memory blocks depending
+ * on the password and salt. Only for side-channel-free
+ * environment!!
+ *****
+ * @param  context  Pointer to current Argon2 context
+ * @return  Zero if successful, a non zero error code otherwise
+ */
+ARGON2_PUBLIC int argon2d_ctx(argon2_context *context);
+
+/**
+ * Argon2i: Version of Argon2 that picks memory blocks
+ * independent on the password and salt. Good for side-channels,
+ * but worse w.r.t. tradeoff attacks if only one pass is used.
+ *****
+ * @param  context  Pointer to current Argon2 context
+ * @return  Zero if successful, a non zero error code otherwise
+ */
+ARGON2_PUBLIC int argon2i_ctx(argon2_context *context);
+
+/**
+ * Argon2id: Version of Argon2 where the first half-pass over memory is
+ * password-independent, the rest are password-dependent (on the password and
+ * salt). OK against side channels (they reduce to 1/2-pass Argon2i), and
+ * better with w.r.t. tradeoff attacks (similar to Argon2d).
+ *****
+ * @param  context  Pointer to current Argon2 context
+ * @return  Zero if successful, a non zero error code otherwise
+ */
+ARGON2_PUBLIC int argon2id_ctx(argon2_context *context);
+
+/**
+ * Verify if a given password is correct for Argon2d hashing
+ * @param  context  Pointer to current Argon2 context
+ * @param  hash  The password hash to verify. The length of the hash is
+ * specified by the context outlen member
+ * @return  Zero if successful, a non zero error code otherwise
+ */
+ARGON2_PUBLIC int argon2d_verify_ctx(argon2_context *context, const char *hash);
+
+/**
+ * Verify if a given password is correct for Argon2i hashing
+ * @param  context  Pointer to current Argon2 context
+ * @param  hash  The password hash to verify. The length of the hash is
+ * specified by the context outlen member
+ * @return  Zero if successful, a non zero error code otherwise
+ */
+ARGON2_PUBLIC int argon2i_verify_ctx(argon2_context *context, const char *hash);
+
+/**
+ * Verify if a given password is correct for Argon2id hashing
+ * @param  context  Pointer to current Argon2 context
+ * @param  hash  The password hash to verify. The length of the hash is
+ * specified by the context outlen member
+ * @return  Zero if successful, a non zero error code otherwise
+ */
+ARGON2_PUBLIC int argon2id_verify_ctx(argon2_context *context,
+                                      const char *hash);
+
+/* generic function underlying the above ones */
+ARGON2_PUBLIC int argon2_verify_ctx(argon2_context *context, const char *hash,
+                                    argon2_type type);
+
+/**
+ * Get the associated error message for given error code
+ * @return  The error message associated with the given error code
+ */
+ARGON2_PUBLIC const char *argon2_error_message(int error_code);
+
+/**
+ * Returns the encoded hash length for the given input parameters
+ * @param t_cost  Number of iterations
+ * @param m_cost  Memory usage in kibibytes
+ * @param parallelism  Number of threads; used to compute lanes
+ * @param saltlen  Salt size in bytes
+ * @param hashlen  Hash size in bytes
+ * @param type The argon2_type that we want the encoded length for
+ * @return  The encoded hash length in bytes
+ */
+ARGON2_PUBLIC size_t argon2_encodedlen(uint32_t t_cost, uint32_t m_cost,
+                                       uint32_t parallelism, uint32_t saltlen,
+                                       uint32_t hashlen, argon2_type type);
+
+#if defined(__cplusplus)
+}
+#endif
+
+#endif
--- a/algo/argon2/argon2d/argon2d/argon2d_thread.c
+++ b/algo/argon2/argon2d/argon2d/argon2d_thread.c
@@ -0,0 +1,57 @@
+/*
+ * Argon2 reference source code package - reference C implementations
+ *
+ * Copyright 2015
+ * Daniel Dinu, Dmitry Khovratovich, Jean-Philippe Aumasson, and Samuel Neves
+ *
+ * You may use this work under the terms of a Creative Commons CC0 1.0
+ * License/Waiver or the Apache Public License 2.0, at your option. The terms of
+ * these licenses can be found at:
+ *
+ * - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
+ * - Apache 2.0        : http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * You should have received a copy of both of these licenses along with this
+ * software. If not, they may be obtained at the above URLs.
+ */
+
+#if !defined(ARGON2_NO_THREADS)
+
+#include "argon2d_thread.h"
+#if defined(_WIN32)
+#include <windows.h>
+#endif
+
+int argon2_thread_create(argon2_thread_handle_t *handle,
+                         argon2_thread_func_t func, void *args) {
+    if (NULL == handle || func == NULL) {
+        return -1;
+    }
+#if defined(_WIN32)
+    *handle = _beginthreadex(NULL, 0, func, args, 0, NULL);
+    return *handle != 0 ? 0 : -1;
+#else
+    return pthread_create(handle, NULL, func, args);
+#endif
+}
+
+int argon2_thread_join(argon2_thread_handle_t handle) {
+#if defined(_WIN32)
+    if (WaitForSingleObject((HANDLE)handle, INFINITE) == WAIT_OBJECT_0) {
+        return CloseHandle((HANDLE)handle) != 0 ? 0 : -1;
+    }
+    return -1;
+#else
+    return pthread_join(handle, NULL);
+#endif
+}
+
+void argon2_thread_exit(void) {
+#if defined(_WIN32)
+    _endthreadex(0);
+#else
+    pthread_exit(NULL);
+#endif
+}
+
+#endif /* ARGON2_NO_THREADS */
--- a/algo/argon2/argon2d/argon2d/argon2d_thread.h
+++ b/algo/argon2/argon2d/argon2d/argon2d_thread.h
@@ -0,0 +1,67 @@
+/*
+ * Argon2 reference source code package - reference C implementations
+ *
+ * Copyright 2015
+ * Daniel Dinu, Dmitry Khovratovich, Jean-Philippe Aumasson, and Samuel Neves
+ *
+ * You may use this work under the terms of a Creative Commons CC0 1.0
+ * License/Waiver or the Apache Public License 2.0, at your option. The terms of
+ * these licenses can be found at:
+ *
+ * - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
+ * - Apache 2.0        : http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * You should have received a copy of both of these licenses along with this
+ * software. If not, they may be obtained at the above URLs.
+ */
+
+#ifndef ARGON2_THREAD_H
+#define ARGON2_THREAD_H
+
+#if !defined(ARGON2_NO_THREADS)
+
+/*
+        Here we implement an abstraction layer for the simpĺe requirements
+        of the Argon2 code. We only require 3 primitives---thread creation,
+        joining, and termination---so full emulation of the pthreads API
+        is unwarranted. Currently we wrap pthreads and Win32 threads.
+
+        The API defines 2 types: the function pointer type,
+   argon2_thread_func_t,
+        and the type of the thread handle---argon2_thread_handle_t.
+*/
+#if defined(_WIN32)
+#include <process.h>
+typedef unsigned(__stdcall *argon2_thread_func_t)(void *);
+typedef uintptr_t argon2_thread_handle_t;
+#else
+#include <pthread.h>
+typedef void *(*argon2_thread_func_t)(void *);
+typedef pthread_t argon2_thread_handle_t;
+#endif
+
+/* Creates a thread
+ * @param handle pointer to a thread handle, which is the output of this
+ * function. Must not be NULL.
+ * @param func A function pointer for the thread's entry point. Must not be
+ * NULL.
+ * @param args Pointer that is passed as an argument to @func. May be NULL.
+ * @return 0 if @handle and @func are valid pointers and a thread is successfully
+ * created.
+ */
+int argon2_thread_create(argon2_thread_handle_t *handle,
+                         argon2_thread_func_t func, void *args);
+
+/* Waits for a thread to terminate
+ * @param handle Handle to a thread created with argon2_thread_create.
+ * @return 0 if @handle is a valid handle, and joining completed successfully.
+*/
+int argon2_thread_join(argon2_thread_handle_t handle);
+
+/* Terminate the current thread. Must be run inside a thread created by
+ * argon2_thread_create.
+*/
+void argon2_thread_exit(void);
+
+#endif /* ARGON2_NO_THREADS */
+#endif
--- a/algo/argon2/argon2d/argon2d/core.c
+++ b/algo/argon2/argon2d/argon2d/core.c
@@ -0,0 +1,635 @@
+/*
+ * Argon2 reference source code package - reference C implementations
+ *
+ * Copyright 2015
+ * Daniel Dinu, Dmitry Khovratovich, Jean-Philippe Aumasson, and Samuel Neves
+ *
+ * You may use this work under the terms of a Creative Commons CC0 1.0
+ * License/Waiver or the Apache Public License 2.0, at your option. The terms of
+ * these licenses can be found at:
+ *
+ * - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
+ * - Apache 2.0        : http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * You should have received a copy of both of these licenses along with this
+ * software. If not, they may be obtained at the above URLs.
+ */
+
+/*For memory wiping*/
+#ifdef _MSC_VER
+#include <windows.h>
+#include <winbase.h> /* For SecureZeroMemory */
+#endif
+#if defined __STDC_LIB_EXT1__
+#define __STDC_WANT_LIB_EXT1__ 1
+#endif
+#define VC_GE_2005(version) (version >= 1400)
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include "core.h"
+#include "argon2d_thread.h"
+#include "../blake2/blake2.h"
+#include "../blake2/blake2-impl.h"
+
+#ifdef GENKAT
+#include "genkat.h"
+#endif
+
+#if defined(__clang__)
+#if __has_attribute(optnone)
+#define NOT_OPTIMIZED __attribute__((optnone))
+#endif
+#elif defined(__GNUC__)
+#define GCC_VERSION                                                            \
+    (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)
+#if GCC_VERSION >= 40400
+#define NOT_OPTIMIZED __attribute__((optimize("O0")))
+#endif
+#endif
+#ifndef NOT_OPTIMIZED
+#define NOT_OPTIMIZED
+#endif
+
+/***************Instance and Position constructors**********/
+void init_block_value(block *b, uint8_t in) { memset(b->v, in, sizeof(b->v)); }
+
+void copy_block(block *dst, const block *src) {
+    memcpy(dst->v, src->v, sizeof(uint64_t) * ARGON2_QWORDS_IN_BLOCK);
+}
+
+void xor_block(block *dst, const block *src) {
+    int i;
+    for (i = 0; i < ARGON2_QWORDS_IN_BLOCK; ++i) {
+        dst->v[i] ^= src->v[i];
+    }
+}
+
+static void load_block(block *dst, const void *input) {
+    unsigned i;
+    for (i = 0; i < ARGON2_QWORDS_IN_BLOCK; ++i) {
+        dst->v[i] = load64((const uint8_t *)input + i * sizeof(dst->v[i]));
+    }
+}
+
+static void store_block(void *output, const block *src) {
+    unsigned i;
+    for (i = 0; i < ARGON2_QWORDS_IN_BLOCK; ++i) {
+        store64((uint8_t *)output + i * sizeof(src->v[i]), src->v[i]);
+    }
+}
+
+/***************Memory functions*****************/
+
+int allocate_memory(const argon2_context *context, uint8_t **memory,
+                    size_t num, size_t size) {
+    size_t memory_size = num*size;
+    if (memory == NULL) {
+        return ARGON2_MEMORY_ALLOCATION_ERROR;
+    }
+
+    /* 1. Check for multiplication overflow */
+    if (size != 0 && memory_size / size != num) {
+        return ARGON2_MEMORY_ALLOCATION_ERROR;
+    }
+
+    /* 2. Try to allocate with appropriate allocator */
+    if (context->allocate_cbk) {
+        (context->allocate_cbk)(memory, memory_size);
+    } else {
+        *memory = malloc(memory_size);
+    }
+
+    if (*memory == NULL) {
+        return ARGON2_MEMORY_ALLOCATION_ERROR;
+    }
+
+    return ARGON2_OK;
+}
+
+void free_memory(const argon2_context *context, uint8_t *memory,
+                 size_t num, size_t size) {
+    size_t memory_size = num*size;
+//    clear_internal_memory(memory, memory_size);
+    if (context->free_cbk) {
+        (context->free_cbk)(memory, memory_size);
+    } else {
+        free(memory);
+    }
+}
+
+void NOT_OPTIMIZED secure_wipe_memory(void *v, size_t n) {
+#if defined(_MSC_VER) && VC_GE_2005(_MSC_VER)
+    SecureZeroMemory(v, n);
+#elif defined memset_s
+    memset_s(v, n, 0, n);
+#elif defined(__OpenBSD__)
+    explicit_bzero(v, n);
+#else
+    static void *(*const volatile memset_sec)(void *, int, size_t) = &memset;
+    memset_sec(v, 0, n);
+#endif
+}
+
+/* Memory clear flag defaults to true. */
+int FLAG_clear_internal_memory = 0;
+void clear_internal_memory(void *v, size_t n) {
+  if (FLAG_clear_internal_memory && v) {
+//    secure_wipe_memory(v, n);
+  }
+}
+
+void finalize(const argon2_context *context, argon2_instance_t *instance) {
+    if (context != NULL && instance != NULL) {
+        block blockhash;
+        uint32_t l;
+
+        copy_block(&blockhash, instance->memory + instance->lane_length - 1);
+
+        /* XOR the last blocks */
+        for (l = 1; l < instance->lanes; ++l) {
+            uint32_t last_block_in_lane =
+                l * instance->lane_length + (instance->lane_length - 1);
+            xor_block(&blockhash, instance->memory + last_block_in_lane);
+        }
+
+        /* Hash the result */
+        {
+            uint8_t blockhash_bytes[ARGON2_BLOCK_SIZE];
+            store_block(blockhash_bytes, &blockhash);
+            blake2b_long(context->out, context->outlen, blockhash_bytes,
+                         ARGON2_BLOCK_SIZE);
+            /* clear blockhash and blockhash_bytes */
+            clear_internal_memory(blockhash.v, ARGON2_BLOCK_SIZE);
+            clear_internal_memory(blockhash_bytes, ARGON2_BLOCK_SIZE);
+        }
+
+#ifdef GENKAT
+        print_tag(context->out, context->outlen);
+#endif
+
+        free_memory(context, (uint8_t *)instance->memory,
+                    instance->memory_blocks, sizeof(block));
+    }
+}
+
+uint32_t index_alpha(const argon2_instance_t *instance,
+                     const argon2_position_t *position, uint32_t pseudo_rand,
+                     int same_lane) {
+    /*
+     * Pass 0:
+     *      This lane : all already finished segments plus already constructed
+     * blocks in this segment
+     *      Other lanes : all already finished segments
+     * Pass 1+:
+     *      This lane : (SYNC_POINTS - 1) last segments plus already constructed
+     * blocks in this segment
+     *      Other lanes : (SYNC_POINTS - 1) last segments
+     */
+    uint32_t reference_area_size;
+    uint64_t relative_position;
+    uint32_t start_position, absolute_position;
+
+    if (0 == position->pass) {
+        /* First pass */
+        if (0 == position->slice) {
+            /* First slice */
+            reference_area_size =
+                position->index - 1; /* all but the previous */
+        } else {
+            if (same_lane) {
+                /* The same lane => add current segment */
+                reference_area_size =
+                    position->slice * instance->segment_length +
+                    position->index - 1;
+            } else {
+                reference_area_size =
+                    position->slice * instance->segment_length +
+                    ((position->index == 0) ? (-1) : 0);
+            }
+        }
+    } else {
+        /* Second pass */
+        if (same_lane) {
+            reference_area_size = instance->lane_length -
+                                  instance->segment_length + position->index -
+                                  1;
+        } else {
+            reference_area_size = instance->lane_length -
+                                  instance->segment_length +
+                                  ((position->index == 0) ? (-1) : 0);
+        }
+    }
+
+    /* 1.2.4. Mapping pseudo_rand to 0..<reference_area_size-1> and produce
+     * relative position */
+    relative_position = pseudo_rand;
+    relative_position = relative_position * relative_position >> 32;
+    relative_position = reference_area_size - 1 -
+                        (reference_area_size * relative_position >> 32);
+
+    /* 1.2.5 Computing starting position */
+    start_position = 0;
+
+    if (0 != position->pass) {
+        start_position = (position->slice == ARGON2_SYNC_POINTS - 1)
+                             ? 0
+                             : (position->slice + 1) * instance->segment_length;
+    }
+
+    /* 1.2.6. Computing absolute position */
+    absolute_position = (start_position + relative_position) %
+                        instance->lane_length; /* absolute position */
+    return absolute_position;
+}
+
+/* Single-threaded version for p=1 case */
+static int fill_memory_blocks_st(argon2_instance_t *instance) {
+    uint32_t r, s, l;
+
+    for (r = 0; r < instance->passes; ++r) {
+        for (s = 0; s < ARGON2_SYNC_POINTS; ++s) {
+            for (l = 0; l < instance->lanes; ++l) {
+                argon2_position_t position = {r, l, (uint8_t)s, 0};
+                fill_segment(instance, position);
+            }
+        }
+#ifdef GENKAT
+        internal_kat(instance, r); /* Print all memory blocks */
+#endif
+    }
+    return ARGON2_OK;
+}
+
+#if !defined(ARGON2_NO_THREADS)
+
+#ifdef _WIN32
+static unsigned __stdcall fill_segment_thr(void *thread_data)
+#else
+static void *fill_segment_thr(void *thread_data)
+#endif
+{
+    argon2_thread_data *my_data = thread_data;
+    fill_segment(my_data->instance_ptr, my_data->pos);
+    argon2_thread_exit();
+    return 0;
+}
+
+/* Multi-threaded version for p > 1 case */
+static int fill_memory_blocks_mt(argon2_instance_t *instance) {
+    uint32_t r, s;
+    argon2_thread_handle_t *thread = NULL;
+    argon2_thread_data *thr_data = NULL;
+    int rc = ARGON2_OK;
+
+    /* 1. Allocating space for threads */
+    thread = calloc(instance->lanes, sizeof(argon2_thread_handle_t));
+    if (thread == NULL) {
+        rc = ARGON2_MEMORY_ALLOCATION_ERROR;
+        goto fail;
+    }
+
+    thr_data = calloc(instance->lanes, sizeof(argon2_thread_data));
+    if (thr_data == NULL) {
+        rc = ARGON2_MEMORY_ALLOCATION_ERROR;
+        goto fail;
+    }
+
+    for (r = 0; r < instance->passes; ++r) {
+        for (s = 0; s < ARGON2_SYNC_POINTS; ++s) {
+            uint32_t l;
+
+            /* 2. Calling threads */
+            for (l = 0; l < instance->lanes; ++l) {
+                argon2_position_t position;
+
+                /* 2.1 Join a thread if limit is exceeded */
+                if (l >= instance->threads) {
+                    if (argon2_thread_join(thread[l - instance->threads])) {
+                        rc = ARGON2_THREAD_FAIL;
+                        goto fail;
+                    }
+                }
+
+                /* 2.2 Create thread */
+                position.pass = r;
+                position.lane = l;
+                position.slice = (uint8_t)s;
+                position.index = 0;
+                thr_data[l].instance_ptr =
+                    instance; /* preparing the thread input */
+                memcpy(&(thr_data[l].pos), &position,
+                       sizeof(argon2_position_t));
+                if (argon2_thread_create(&thread[l], &fill_segment_thr,
+                                         (void *)&thr_data[l])) {
+                    rc = ARGON2_THREAD_FAIL;
+                    goto fail;
+                }
+
+                /* fill_segment(instance, position); */
+                /*Non-thread equivalent of the lines above */
+            }
+
+            /* 3. Joining remaining threads */
+            for (l = instance->lanes - instance->threads; l < instance->lanes;
+                 ++l) {
+                if (argon2_thread_join(thread[l])) {
+                    rc = ARGON2_THREAD_FAIL;
+                    goto fail;
+                }
+            }
+        }
+
+#ifdef GENKAT
+        internal_kat(instance, r); /* Print all memory blocks */
+#endif
+    }
+
+fail:
+    if (thread != NULL) {
+        free(thread);
+    }
+    if (thr_data != NULL) {
+        free(thr_data);
+    }
+    return rc;
+}
+
+#endif /* ARGON2_NO_THREADS */
+
+int fill_memory_blocks(argon2_instance_t *instance) {
+	if (instance == NULL || instance->lanes == 0) {
+	    return ARGON2_INCORRECT_PARAMETER;
+    }
+#if defined(ARGON2_NO_THREADS)
+    return fill_memory_blocks_st(instance);
+#else
+    return instance->threads == 1 ?
+			fill_memory_blocks_st(instance) : fill_memory_blocks_mt(instance);
+#endif
+}
+
+int validate_inputs(const argon2_context *context) {
+    if (NULL == context) {
+        return ARGON2_INCORRECT_PARAMETER;
+    }
+
+    if (NULL == context->out) {
+        return ARGON2_OUTPUT_PTR_NULL;
+    }
+
+    /* Validate output length */
+    if (ARGON2_MIN_OUTLEN > context->outlen) {
+        return ARGON2_OUTPUT_TOO_SHORT;
+    }
+
+    if (ARGON2_MAX_OUTLEN < context->outlen) {
+        return ARGON2_OUTPUT_TOO_LONG;
+    }
+
+    /* Validate password (required param) */
+    if (NULL == context->pwd) {
+        if (0 != context->pwdlen) {
+            return ARGON2_PWD_PTR_MISMATCH;
+        }
+    }
+
+    if (ARGON2_MIN_PWD_LENGTH > context->pwdlen) {
+      return ARGON2_PWD_TOO_SHORT;
+    }
+
+    if (ARGON2_MAX_PWD_LENGTH < context->pwdlen) {
+        return ARGON2_PWD_TOO_LONG;
+    }
+
+    /* Validate salt (required param) */
+    if (NULL == context->salt) {
+        if (0 != context->saltlen) {
+            return ARGON2_SALT_PTR_MISMATCH;
+        }
+    }
+
+    if (ARGON2_MIN_SALT_LENGTH > context->saltlen) {
+        return ARGON2_SALT_TOO_SHORT;
+    }
+
+    if (ARGON2_MAX_SALT_LENGTH < context->saltlen) {
+        return ARGON2_SALT_TOO_LONG;
+    }
+
+    /* Validate secret (optional param) */
+    if (NULL == context->secret) {
+        if (0 != context->secretlen) {
+            return ARGON2_SECRET_PTR_MISMATCH;
+        }
+    } else {
+        if (ARGON2_MIN_SECRET > context->secretlen) {
+            return ARGON2_SECRET_TOO_SHORT;
+        }
+        if (ARGON2_MAX_SECRET < context->secretlen) {
+            return ARGON2_SECRET_TOO_LONG;
+        }
+    }
+
+    /* Validate associated data (optional param) */
+    if (NULL == context->ad) {
+        if (0 != context->adlen) {
+            return ARGON2_AD_PTR_MISMATCH;
+        }
+    } else {
+        if (ARGON2_MIN_AD_LENGTH > context->adlen) {
+            return ARGON2_AD_TOO_SHORT;
+        }
+        if (ARGON2_MAX_AD_LENGTH < context->adlen) {
+            return ARGON2_AD_TOO_LONG;
+        }
+    }
+
+    /* Validate memory cost */
+    if (ARGON2_MIN_MEMORY > context->m_cost) {
+        return ARGON2_MEMORY_TOO_LITTLE;
+    }
+
+    if (ARGON2_MAX_MEMORY < context->m_cost) {
+        return ARGON2_MEMORY_TOO_MUCH;
+    }
+
+    if (context->m_cost < 8 * context->lanes) {
+        return ARGON2_MEMORY_TOO_LITTLE;
+    }
+
+    /* Validate time cost */
+    if (ARGON2_MIN_TIME > context->t_cost) {
+        return ARGON2_TIME_TOO_SMALL;
+    }
+
+    if (ARGON2_MAX_TIME < context->t_cost) {
+        return ARGON2_TIME_TOO_LARGE;
+    }
+
+    /* Validate lanes */
+    if (ARGON2_MIN_LANES > context->lanes) {
+        return ARGON2_LANES_TOO_FEW;
+    }
+
+    if (ARGON2_MAX_LANES < context->lanes) {
+        return ARGON2_LANES_TOO_MANY;
+    }
+
+    /* Validate threads */
+    if (ARGON2_MIN_THREADS > context->threads) {
+        return ARGON2_THREADS_TOO_FEW;
+    }
+
+    if (ARGON2_MAX_THREADS < context->threads) {
+        return ARGON2_THREADS_TOO_MANY;
+    }
+
+    if (NULL != context->allocate_cbk && NULL == context->free_cbk) {
+        return ARGON2_FREE_MEMORY_CBK_NULL;
+    }
+
+    if (NULL == context->allocate_cbk && NULL != context->free_cbk) {
+        return ARGON2_ALLOCATE_MEMORY_CBK_NULL;
+    }
+
+    return ARGON2_OK;
+}
+
+void fill_first_blocks(uint8_t *blockhash, const argon2_instance_t *instance) {
+    uint32_t l;
+    /* Make the first and second block in each lane as G(H0||0||i) or
+       G(H0||1||i) */
+    uint8_t blockhash_bytes[ARGON2_BLOCK_SIZE];
+    for (l = 0; l < instance->lanes; ++l) {
+
+        store32(blockhash + ARGON2_PREHASH_DIGEST_LENGTH, 0);
+        store32(blockhash + ARGON2_PREHASH_DIGEST_LENGTH + 4, l);
+        blake2b_long(blockhash_bytes, ARGON2_BLOCK_SIZE, blockhash,
+                     ARGON2_PREHASH_SEED_LENGTH);
+        load_block(&instance->memory[l * instance->lane_length + 0],
+                   blockhash_bytes);
+
+        store32(blockhash + ARGON2_PREHASH_DIGEST_LENGTH, 1);
+        blake2b_long(blockhash_bytes, ARGON2_BLOCK_SIZE, blockhash,
+                     ARGON2_PREHASH_SEED_LENGTH);
+        load_block(&instance->memory[l * instance->lane_length + 1],
+                   blockhash_bytes);
+    }
+    clear_internal_memory(blockhash_bytes, ARGON2_BLOCK_SIZE);
+}
+
+void initial_hash(uint8_t *blockhash, argon2_context *context,
+                  argon2_type type) {
+    blake2b_state BlakeHash;
+    uint8_t value[sizeof(uint32_t)];
+
+    if (NULL == context || NULL == blockhash) {
+        return;
+    }
+
+    blake2b_init(&BlakeHash, ARGON2_PREHASH_DIGEST_LENGTH);
+
+    store32(&value, context->lanes);
+    blake2b_update(&BlakeHash, (const uint8_t *)&value, sizeof(value));
+
+    store32(&value, context->outlen);
+    blake2b_update(&BlakeHash, (const uint8_t *)&value, sizeof(value));
+
+    store32(&value, context->m_cost);
+    blake2b_update(&BlakeHash, (const uint8_t *)&value, sizeof(value));
+
+    store32(&value, context->t_cost);
+    blake2b_update(&BlakeHash, (const uint8_t *)&value, sizeof(value));
+
+//    store32(&value, ARGON2_VERSION_NUMBER);
+    store32(&value, context->version);
+    blake2b_update(&BlakeHash, (const uint8_t *)&value, sizeof(value));
+
+    store32(&value, (uint32_t)type);
+    blake2b_update(&BlakeHash, (const uint8_t *)&value, sizeof(value));
+
+    store32(&value, context->pwdlen);
+    blake2b_update(&BlakeHash, (const uint8_t *)&value, sizeof(value));
+
+    if (context->pwd != NULL) {
+        blake2b_update(&BlakeHash, (const uint8_t *)context->pwd,
+                       context->pwdlen);
+
+        if (context->flags & ARGON2_FLAG_CLEAR_PASSWORD) {
+//            secure_wipe_memory(context->pwd, context->pwdlen);
+            context->pwdlen = 0;
+        }
+    }
+
+    store32(&value, context->saltlen);
+    blake2b_update(&BlakeHash, (const uint8_t *)&value, sizeof(value));
+
+    if (context->salt != NULL) {
+        blake2b_update(&BlakeHash, (const uint8_t *)context->salt,
+                       context->saltlen);
+    }
+
+    store32(&value, context->secretlen);
+    blake2b_update(&BlakeHash, (const uint8_t *)&value, sizeof(value));
+
+    if (context->secret != NULL) {
+        blake2b_update(&BlakeHash, (const uint8_t *)context->secret,
+                       context->secretlen);
+
+        if (context->flags & ARGON2_FLAG_CLEAR_SECRET) {
+//            secure_wipe_memory(context->secret, context->secretlen);
+            context->secretlen = 0;
+        }
+    }
+
+    store32(&value, context->adlen);
+    blake2b_update(&BlakeHash, (const uint8_t *)&value, sizeof(value));
+
+    if (context->ad != NULL) {
+        blake2b_update(&BlakeHash, (const uint8_t *)context->ad,
+                       context->adlen);
+    }
+
+    blake2b_final(&BlakeHash, blockhash, ARGON2_PREHASH_DIGEST_LENGTH);
+}
+
+int initialize(argon2_instance_t *instance, argon2_context *context) {
+    uint8_t blockhash[ARGON2_PREHASH_SEED_LENGTH];
+    int result = ARGON2_OK;
+
+    if (instance == NULL || context == NULL)
+        return ARGON2_INCORRECT_PARAMETER;
+    instance->context_ptr = context;
+
+    /* 1. Memory allocation */
+    result = allocate_memory(context, (uint8_t **)&(instance->memory),
+                             instance->memory_blocks, sizeof(block));
+    if (result != ARGON2_OK) {
+        return result;
+    }
+
+    /* 2. Initial hashing */
+    /* H_0 + 8 extra bytes to produce the first blocks */
+    /* uint8_t blockhash[ARGON2_PREHASH_SEED_LENGTH]; */
+    /* Hashing all inputs */
+    initial_hash(blockhash, context, instance->type);
+    /* Zeroing 8 extra bytes */
+    clear_internal_memory(blockhash + ARGON2_PREHASH_DIGEST_LENGTH,
+                          ARGON2_PREHASH_SEED_LENGTH -
+                              ARGON2_PREHASH_DIGEST_LENGTH);
+
+#ifdef GENKAT
+    initial_kat(blockhash, context, instance->type);
+#endif
+
+    /* 3. Creating first blocks, we always have at least two blocks in a slice
+     */
+    fill_first_blocks(blockhash, instance);
+    /* Clearing the hash */
+    clear_internal_memory(blockhash, ARGON2_PREHASH_SEED_LENGTH);
+
+    return ARGON2_OK;
+}
--- a/algo/argon2/argon2d/argon2d/core.h
+++ b/algo/argon2/argon2d/argon2d/core.h
@@ -0,0 +1,228 @@
+/*
+ * Argon2 reference source code package - reference C implementations
+ *
+ * Copyright 2015
+ * Daniel Dinu, Dmitry Khovratovich, Jean-Philippe Aumasson, and Samuel Neves
+ *
+ * You may use this work under the terms of a Creative Commons CC0 1.0
+ * License/Waiver or the Apache Public License 2.0, at your option. The terms of
+ * these licenses can be found at:
+ *
+ * - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
+ * - Apache 2.0        : http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * You should have received a copy of both of these licenses along with this
+ * software. If not, they may be obtained at the above URLs.
+ */
+
+#ifndef ARGON2_CORE_H
+#define ARGON2_CORE_H
+
+#include "argon2.h"
+
+#define CONST_CAST(x) (x)(uintptr_t)
+
+/**********************Argon2 internal constants*******************************/
+
+enum argon2_core_constants {
+    /* Memory block size in bytes */
+    ARGON2_BLOCK_SIZE = 1024,
+    ARGON2_QWORDS_IN_BLOCK = ARGON2_BLOCK_SIZE / 8,
+    ARGON2_OWORDS_IN_BLOCK = ARGON2_BLOCK_SIZE / 16,
+    ARGON2_HWORDS_IN_BLOCK = ARGON2_BLOCK_SIZE / 32,
+    ARGON2_512BIT_WORDS_IN_BLOCK = ARGON2_BLOCK_SIZE / 64,
+
+    /* Number of pseudo-random values generated by one call to Blake in Argon2i
+       to
+       generate reference block positions */
+    ARGON2_ADDRESSES_IN_BLOCK = 128,
+
+    /* Pre-hashing digest length and its extension*/
+    ARGON2_PREHASH_DIGEST_LENGTH = 64,
+    ARGON2_PREHASH_SEED_LENGTH = 72
+};
+
+/*************************Argon2 internal data types***********************/
+
+/*
+ * Structure for the (1KB) memory block implemented as 128 64-bit words.
+ * Memory blocks can be copied, XORed. Internal words can be accessed by [] (no
+ * bounds checking).
+ */
+typedef struct block_ { uint64_t v[ARGON2_QWORDS_IN_BLOCK]; } block;
+
+/*****************Functions that work with the block******************/
+
+/* Initialize each byte of the block with @in */
+void init_block_value(block *b, uint8_t in);
+
+/* Copy block @src to block @dst */
+void copy_block(block *dst, const block *src);
+
+/* XOR @src onto @dst bytewise */
+void xor_block(block *dst, const block *src);
+
+/*
+ * Argon2 instance: memory pointer, number of passes, amount of memory, type,
+ * and derived values.
+ * Used to evaluate the number and location of blocks to construct in each
+ * thread
+ */
+typedef struct Argon2_instance_t {
+    block *memory;          /* Memory pointer */
+    uint32_t version;
+    uint32_t passes;        /* Number of passes */
+    uint32_t memory_blocks; /* Number of blocks in memory */
+    uint32_t segment_length;
+    uint32_t lane_length;
+    uint32_t lanes;
+    uint32_t threads;
+    argon2_type type;
+    int print_internals; /* whether to print the memory blocks */
+    argon2_context *context_ptr; /* points back to original context */
+} argon2_instance_t;
+
+/*
+ * Argon2 position: where we construct the block right now. Used to distribute
+ * work between threads.
+ */
+typedef struct Argon2_position_t {
+    uint32_t pass;
+    uint32_t lane;
+    uint8_t slice;
+    uint32_t index;
+} argon2_position_t;
+
+/*Struct that holds the inputs for thread handling FillSegment*/
+typedef struct Argon2_thread_data {
+    argon2_instance_t *instance_ptr;
+    argon2_position_t pos;
+} argon2_thread_data;
+
+/*************************Argon2 core functions********************************/
+
+/* Allocates memory to the given pointer, uses the appropriate allocator as
+ * specified in the context. Total allocated memory is num*size.
+ * @param context argon2_context which specifies the allocator
+ * @param memory pointer to the pointer to the memory
+ * @param size the size in bytes for each element to be allocated
+ * @param num the number of elements to be allocated
+ * @return ARGON2_OK if @memory is a valid pointer and memory is allocated
+ */
+int allocate_memory(const argon2_context *context, uint8_t **memory,
+                    size_t num, size_t size);
+
+/*
+ * Frees memory at the given pointer, uses the appropriate deallocator as
+ * specified in the context. Also cleans the memory using clear_internal_memory.
+ * @param context argon2_context which specifies the deallocator
+ * @param memory pointer to buffer to be freed
+ * @param size the size in bytes for each element to be deallocated
+ * @param num the number of elements to be deallocated
+ */
+void free_memory(const argon2_context *context, uint8_t *memory,
+                 size_t num, size_t size);
+
+/* Function that securely cleans the memory. This ignores any flags set
+ * regarding clearing memory. Usually one just calls clear_internal_memory.
+ * @param mem Pointer to the memory
+ * @param s Memory size in bytes
+ */
+void secure_wipe_memory(void *v, size_t n);
+
+/* Function that securely clears the memory if FLAG_clear_internal_memory is
+ * set. If the flag isn't set, this function does nothing.
+ * @param mem Pointer to the memory
+ * @param s Memory size in bytes
+ */
+void clear_internal_memory(void *v, size_t n);
+
+/*
+ * Computes absolute position of reference block in the lane following a skewed
+ * distribution and using a pseudo-random value as input
+ * @param instance Pointer to the current instance
+ * @param position Pointer to the current position
+ * @param pseudo_rand 32-bit pseudo-random value used to determine the position
+ * @param same_lane Indicates if the block will be taken from the current lane.
+ * If so we can reference the current segment
+ * @pre All pointers must be valid
+ */
+uint32_t index_alpha(const argon2_instance_t *instance,
+                     const argon2_position_t *position, uint32_t pseudo_rand,
+                     int same_lane);
+
+/*
+ * Function that validates all inputs against predefined restrictions and return
+ * an error code
+ * @param context Pointer to current Argon2 context
+ * @return ARGON2_OK if everything is all right, otherwise one of error codes
+ * (all defined in <argon2.h>
+ */
+int validate_inputs(const argon2_context *context);
+
+/*
+ * Hashes all the inputs into @a blockhash[PREHASH_DIGEST_LENGTH], clears
+ * password and secret if needed
+ * @param  context  Pointer to the Argon2 internal structure containing memory
+ * pointer, and parameters for time and space requirements.
+ * @param  blockhash Buffer for pre-hashing digest
+ * @param  type Argon2 type
+ * @pre    @a blockhash must have at least @a PREHASH_DIGEST_LENGTH bytes
+ * allocated
+ */
+void initial_hash(uint8_t *blockhash, argon2_context *context,
+                  argon2_type type);
+
+/*
+ * Function creates first 2 blocks per lane
+ * @param instance Pointer to the current instance
+ * @param blockhash Pointer to the pre-hashing digest
+ * @pre blockhash must point to @a PREHASH_SEED_LENGTH allocated values
+ */
+void fill_first_blocks(uint8_t *blockhash, const argon2_instance_t *instance);
+
+/*
+ * Function allocates memory, hashes the inputs with Blake,  and creates first
+ * two blocks. Returns the pointer to the main memory with 2 blocks per lane
+ * initialized
+ * @param  context  Pointer to the Argon2 internal structure containing memory
+ * pointer, and parameters for time and space requirements.
+ * @param  instance Current Argon2 instance
+ * @return Zero if successful, -1 if memory failed to allocate. @context->state
+ * will be modified if successful.
+ */
+int initialize(argon2_instance_t *instance, argon2_context *context);
+
+/*
+ * XORing the last block of each lane, hashing it, making the tag. Deallocates
+ * the memory.
+ * @param context Pointer to current Argon2 context (use only the out parameters
+ * from it)
+ * @param instance Pointer to current instance of Argon2
+ * @pre instance->state must point to necessary amount of memory
+ * @pre context->out must point to outlen bytes of memory
+ * @pre if context->free_cbk is not NULL, it should point to a function that
+ * deallocates memory
+ */
+void finalize(const argon2_context *context, argon2_instance_t *instance);
+
+/*
+ * Function that fills the segment using previous segments also from other
+ * threads
+ * @param context current context
+ * @param instance Pointer to the current instance
+ * @param position Current position
+ * @pre all block pointers must be valid
+ */
+void fill_segment(const argon2_instance_t *instance,
+                  argon2_position_t position);
+
+/*
+ * Function that fills the entire memory t_cost times based on the first two
+ * blocks in each lane
+ * @param instance Pointer to the current instance
+ * @return ARGON2_OK if successful, @context->state
+ */
+int fill_memory_blocks(argon2_instance_t *instance);
+
+#endif
--- a/algo/argon2/argon2d/argon2d/encoding.c
+++ b/algo/argon2/argon2d/argon2d/encoding.c
@@ -0,0 +1,463 @@
+/*
+ * Argon2 reference source code package - reference C implementations
+ *
+ * Copyright 2015
+ * Daniel Dinu, Dmitry Khovratovich, Jean-Philippe Aumasson, and Samuel Neves
+ *
+ * You may use this work under the terms of a Creative Commons CC0 1.0
+ * License/Waiver or the Apache Public License 2.0, at your option. The terms of
+ * these licenses can be found at:
+ *
+ * - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
+ * - Apache 2.0        : http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * You should have received a copy of both of these licenses along with this
+ * software. If not, they may be obtained at the above URLs.
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <limits.h>
+#include "encoding.h"
+#include "core.h"
+
+/*
+ * Example code for a decoder and encoder of "hash strings", with Argon2
+ * parameters.
+ *
+ * This code comprises three sections:
+ *
+ *   -- The first section contains generic Base64 encoding and decoding
+ *   functions. It is conceptually applicable to any hash function
+ *   implementation that uses Base64 to encode and decode parameters,
+ *   salts and outputs. It could be made into a library, provided that
+ *   the relevant functions are made public (non-static) and be given
+ *   reasonable names to avoid collisions with other functions.
+ *
+ *   -- The second section is specific to Argon2. It encodes and decodes
+ *   the parameters, salts and outputs. It does not compute the hash
+ *   itself.
+ *
+ * The code was originally written by Thomas Pornin <pornin@bolet.org>,
+ * to whom comments and remarks may be sent. It is released under what
+ * should amount to Public Domain or its closest equivalent; the
+ * following mantra is supposed to incarnate that fact with all the
+ * proper legal rituals:
+ *
+ * ---------------------------------------------------------------------
+ * This file is provided under the terms of Creative Commons CC0 1.0
+ * Public Domain Dedication. To the extent possible under law, the
+ * author (Thomas Pornin) has waived all copyright and related or
+ * neighboring rights to this file. This work is published from: Canada.
+ * ---------------------------------------------------------------------
+ *
+ * Copyright (c) 2015 Thomas Pornin
+ */
+
+/* ==================================================================== */
+/*
+ * Common code; could be shared between different hash functions.
+ *
+ * Note: the Base64 functions below assume that uppercase letters (resp.
+ * lowercase letters) have consecutive numerical codes, that fit on 8
+ * bits. All modern systems use ASCII-compatible charsets, where these
+ * properties are true. If you are stuck with a dinosaur of a system
+ * that still defaults to EBCDIC then you already have much bigger
+ * interoperability issues to deal with.
+ */
+
+/*
+ * Some macros for constant-time comparisons. These work over values in
+ * the 0..255 range. Returned value is 0x00 on "false", 0xFF on "true".
+ */
+#define EQ(x, y) ((((0U - ((unsigned)(x) ^ (unsigned)(y))) >> 8) & 0xFF) ^ 0xFF)
+#define GT(x, y) ((((unsigned)(y) - (unsigned)(x)) >> 8) & 0xFF)
+#define GE(x, y) (GT(y, x) ^ 0xFF)
+#define LT(x, y) GT(y, x)
+#define LE(x, y) GE(y, x)
+
+/*
+ * Convert value x (0..63) to corresponding Base64 character.
+ */
+static int b64_byte_to_char(unsigned x) {
+    return (LT(x, 26) & (x + 'A')) |
+           (GE(x, 26) & LT(x, 52) & (x + ('a' - 26))) |
+           (GE(x, 52) & LT(x, 62) & (x + ('0' - 52))) | (EQ(x, 62) & '+') |
+           (EQ(x, 63) & '/');
+}
+
+/*
+ * Convert character c to the corresponding 6-bit value. If character c
+ * is not a Base64 character, then 0xFF (255) is returned.
+ */
+static unsigned b64_char_to_byte(int c) {
+    unsigned x;
+
+    x = (GE(c, 'A') & LE(c, 'Z') & (c - 'A')) |
+        (GE(c, 'a') & LE(c, 'z') & (c - ('a' - 26))) |
+        (GE(c, '0') & LE(c, '9') & (c - ('0' - 52))) | (EQ(c, '+') & 62) |
+        (EQ(c, '/') & 63);
+    return x | (EQ(x, 0) & (EQ(c, 'A') ^ 0xFF));
+}
+
+/*
+ * Convert some bytes to Base64. 'dst_len' is the length (in characters)
+ * of the output buffer 'dst'; if that buffer is not large enough to
+ * receive the result (including the terminating 0), then (size_t)-1
+ * is returned. Otherwise, the zero-terminated Base64 string is written
+ * in the buffer, and the output length (counted WITHOUT the terminating
+ * zero) is returned.
+ */
+static size_t to_base64(char *dst, size_t dst_len, const void *src,
+                        size_t src_len) {
+    size_t olen;
+    const unsigned char *buf;
+    unsigned acc, acc_len;
+
+    olen = (src_len / 3) << 2;
+    switch (src_len % 3) {
+    case 2:
+        olen++;
+    /* fall through */
+    case 1:
+        olen += 2;
+        break;
+    }
+    if (dst_len <= olen) {
+        return (size_t)-1;
+    }
+    acc = 0;
+    acc_len = 0;
+    buf = (const unsigned char *)src;
+    while (src_len-- > 0) {
+        acc = (acc << 8) + (*buf++);
+        acc_len += 8;
+        while (acc_len >= 6) {
+            acc_len -= 6;
+            *dst++ = (char)b64_byte_to_char((acc >> acc_len) & 0x3F);
+        }
+    }
+    if (acc_len > 0) {
+        *dst++ = (char)b64_byte_to_char((acc << (6 - acc_len)) & 0x3F);
+    }
+    *dst++ = 0;
+    return olen;
+}
+
+/*
+ * Decode Base64 chars into bytes. The '*dst_len' value must initially
+ * contain the length of the output buffer '*dst'; when the decoding
+ * ends, the actual number of decoded bytes is written back in
+ * '*dst_len'.
+ *
+ * Decoding stops when a non-Base64 character is encountered, or when
+ * the output buffer capacity is exceeded. If an error occurred (output
+ * buffer is too small, invalid last characters leading to unprocessed
+ * buffered bits), then NULL is returned; otherwise, the returned value
+ * points to the first non-Base64 character in the source stream, which
+ * may be the terminating zero.
+ */
+static const char *from_base64(void *dst, size_t *dst_len, const char *src) {
+    size_t len;
+    unsigned char *buf;
+    unsigned acc, acc_len;
+
+    buf = (unsigned char *)dst;
+    len = 0;
+    acc = 0;
+    acc_len = 0;
+    for (;;) {
+        unsigned d;
+
+        d = b64_char_to_byte(*src);
+        if (d == 0xFF) {
+            break;
+        }
+        src++;
+        acc = (acc << 6) + d;
+        acc_len += 6;
+        if (acc_len >= 8) {
+            acc_len -= 8;
+            if ((len++) >= *dst_len) {
+                return NULL;
+            }
+            *buf++ = (acc >> acc_len) & 0xFF;
+        }
+    }
+
+    /*
+     * If the input length is equal to 1 modulo 4 (which is
+     * invalid), then there will remain 6 unprocessed bits;
+     * otherwise, only 0, 2 or 4 bits are buffered. The buffered
+     * bits must also all be zero.
+     */
+    if (acc_len > 4 || (acc & (((unsigned)1 << acc_len) - 1)) != 0) {
+        return NULL;
+    }
+    *dst_len = len;
+    return src;
+}
+
+/*
+ * Decode decimal integer from 'str'; the value is written in '*v'.
+ * Returned value is a pointer to the next non-decimal character in the
+ * string. If there is no digit at all, or the value encoding is not
+ * minimal (extra leading zeros), or the value does not fit in an
+ * 'unsigned long', then NULL is returned.
+ */
+static const char *decode_decimal(const char *str, unsigned long *v) {
+    const char *orig;
+    unsigned long acc;
+
+    acc = 0;
+    for (orig = str;; str++) {
+        int c;
+
+        c = *str;
+        if (c < '0' || c > '9') {
+            break;
+        }
+        c -= '0';
+        if (acc > (ULONG_MAX / 10)) {
+            return NULL;
+        }
+        acc *= 10;
+        if ((unsigned long)c > (ULONG_MAX - acc)) {
+            return NULL;
+        }
+        acc += (unsigned long)c;
+    }
+    if (str == orig || (*orig == '0' && str != (orig + 1))) {
+        return NULL;
+    }
+    *v = acc;
+    return str;
+}
+
+/* ==================================================================== */
+/*
+ * Code specific to Argon2.
+ *
+ * The code below applies the following format:
+ *
+ *  $argon2<T>[$v=<num>]$m=<num>,t=<num>,p=<num>$<bin>$<bin>
+ *
+ * where <T> is either 'd', 'id', or 'i', <num> is a decimal integer (positive,
+ * fits in an 'unsigned long'), and <bin> is Base64-encoded data (no '=' padding
+ * characters, no newline or whitespace).
+ *
+ * The last two binary chunks (encoded in Base64) are, in that order,
+ * the salt and the output. Both are required. The binary salt length and the
+ * output length must be in the allowed ranges defined in argon2.h.
+ *
+ * The ctx struct must contain buffers large enough to hold the salt and pwd
+ * when it is fed into decode_string.
+ */
+
+int decode_string(argon2_context *ctx, const char *str, argon2_type type) {
+
+/* check for prefix */
+#define CC(prefix)                                                             \
+    do {                                                                       \
+        size_t cc_len = strlen(prefix);                                        \
+        if (strncmp(str, prefix, cc_len) != 0) {                               \
+            return ARGON2_DECODING_FAIL;                                       \
+        }                                                                      \
+        str += cc_len;                                                         \
+    } while ((void)0, 0)
+
+/* optional prefix checking with supplied code */
+#define CC_opt(prefix, code)                                                   \
+    do {                                                                       \
+        size_t cc_len = strlen(prefix);                                        \
+        if (strncmp(str, prefix, cc_len) == 0) {                               \
+            str += cc_len;                                                     \
+            { code; }                                                          \
+        }                                                                      \
+    } while ((void)0, 0)
+
+/* Decoding prefix into decimal */
+#define DECIMAL(x)                                                             \
+    do {                                                                       \
+        unsigned long dec_x;                                                   \
+        str = decode_decimal(str, &dec_x);                                     \
+        if (str == NULL) {                                                     \
+            return ARGON2_DECODING_FAIL;                                       \
+        }                                                                      \
+        (x) = dec_x;                                                           \
+    } while ((void)0, 0)
+
+
+/* Decoding prefix into uint32_t decimal */
+#define DECIMAL_U32(x)                                                         \
+    do {                                                                       \
+        unsigned long dec_x;                                                   \
+        str = decode_decimal(str, &dec_x);                                     \
+        if (str == NULL || dec_x > UINT32_MAX) {                               \
+            return ARGON2_DECODING_FAIL;                                       \
+        }                                                                      \
+        (x) = (uint32_t)dec_x;                                                 \
+    } while ((void)0, 0)
+
+
+/* Decoding base64 into a binary buffer */
+#define BIN(buf, max_len, len)                                                 \
+    do {                                                                       \
+        size_t bin_len = (max_len);                                            \
+        str = from_base64(buf, &bin_len, str);                                 \
+        if (str == NULL || bin_len > UINT32_MAX) {                             \
+            return ARGON2_DECODING_FAIL;                                       \
+        }                                                                      \
+        (len) = (uint32_t)bin_len;                                             \
+    } while ((void)0, 0)
+
+    size_t maxsaltlen = ctx->saltlen;
+    size_t maxoutlen = ctx->outlen;
+    int validation_result;
+    const char* type_string;
+
+    /* We should start with the argon2_type we are using */
+    type_string = argon2_type2string(type, 0);
+    if (!type_string) {
+        return ARGON2_INCORRECT_TYPE;
+    }
+
+    CC("$");
+    CC(type_string);
+
+    /* Reading the version number if the default is suppressed */
+    ctx->version = ARGON2_VERSION_10;
+    CC_opt("$v=", DECIMAL_U32(ctx->version));
+
+    CC("$m=");
+    DECIMAL_U32(ctx->m_cost);
+    CC(",t=");
+    DECIMAL_U32(ctx->t_cost);
+    CC(",p=");
+    DECIMAL_U32(ctx->lanes);
+    ctx->threads = ctx->lanes;
+
+    CC("$");
+    BIN(ctx->salt, maxsaltlen, ctx->saltlen);
+    CC("$");
+    BIN(ctx->out, maxoutlen, ctx->outlen);
+
+    /* The rest of the fields get the default values */
+    ctx->secret = NULL;
+    ctx->secretlen = 0;
+    ctx->ad = NULL;
+    ctx->adlen = 0;
+    ctx->allocate_cbk = NULL;
+    ctx->free_cbk = NULL;
+    ctx->flags = ARGON2_DEFAULT_FLAGS;
+
+    /* On return, must have valid context */
+    validation_result = validate_inputs(ctx);
+    if (validation_result != ARGON2_OK) {
+        return validation_result;
+    }
+
+    /* Can't have any additional characters */
+    if (*str == 0) {
+        return ARGON2_OK;
+    } else {
+        return ARGON2_DECODING_FAIL;
+    }
+#undef CC
+#undef CC_opt
+#undef DECIMAL
+#undef BIN
+}
+
+int encode_string(char *dst, size_t dst_len, argon2_context *ctx,
+                  argon2_type type) {
+#define SS(str)                                                                \
+    do {                                                                       \
+        size_t pp_len = strlen(str);                                           \
+        if (pp_len >= dst_len) {                                               \
+            return ARGON2_ENCODING_FAIL;                                       \
+        }                                                                      \
+        memcpy(dst, str, pp_len + 1);                                          \
+        dst += pp_len;                                                         \
+        dst_len -= pp_len;                                                     \
+    } while ((void)0, 0)
+
+#define SX(x)                                                                  \
+    do {                                                                       \
+        char tmp[30];                                                          \
+        sprintf(tmp, "%lu", (unsigned long)(x));                               \
+        SS(tmp);                                                               \
+    } while ((void)0, 0)
+
+#define SB(buf, len)                                                           \
+    do {                                                                       \
+        size_t sb_len = to_base64(dst, dst_len, buf, len);                     \
+        if (sb_len == (size_t)-1) {                                            \
+            return ARGON2_ENCODING_FAIL;                                       \
+        }                                                                      \
+        dst += sb_len;                                                         \
+        dst_len -= sb_len;                                                     \
+    } while ((void)0, 0)
+
+    const char* type_string = argon2_type2string(type, 0);
+    int validation_result = validate_inputs(ctx);
+
+    if (!type_string) {
+      return ARGON2_ENCODING_FAIL;
+    }
+
+    if (validation_result != ARGON2_OK) {
+      return validation_result;
+    }
+
+
+    SS("$");
+    SS(type_string);
+
+    SS("$v=");
+    SX(ctx->version);
+
+    SS("$m=");
+    SX(ctx->m_cost);
+    SS(",t=");
+    SX(ctx->t_cost);
+    SS(",p=");
+    SX(ctx->lanes);
+
+    SS("$");
+    SB(ctx->salt, ctx->saltlen);
+
+    SS("$");
+    SB(ctx->out, ctx->outlen);
+    return ARGON2_OK;
+
+#undef SS
+#undef SX
+#undef SB
+}
+
+size_t b64len(uint32_t len) {
+    size_t olen = ((size_t)len / 3) << 2;
+
+    switch (len % 3) {
+    case 2:
+        olen++;
+    /* fall through */
+    case 1:
+        olen += 2;
+        break;
+    }
+
+    return olen;
+}
+
+size_t numlen(uint32_t num) {
+    size_t len = 1;
+    while (num >= 10) {
+        ++len;
+        num = num / 10;
+    }
+    return len;
+}
+
--- a/algo/argon2/argon2d/argon2d/encoding.h
+++ b/algo/argon2/argon2d/argon2d/encoding.h
@@ -0,0 +1,57 @@
+/*
+ * Argon2 reference source code package - reference C implementations
+ *
+ * Copyright 2015
+ * Daniel Dinu, Dmitry Khovratovich, Jean-Philippe Aumasson, and Samuel Neves
+ *
+ * You may use this work under the terms of a Creative Commons CC0 1.0
+ * License/Waiver or the Apache Public License 2.0, at your option. The terms of
+ * these licenses can be found at:
+ *
+ * - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
+ * - Apache 2.0        : http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * You should have received a copy of both of these licenses along with this
+ * software. If not, they may be obtained at the above URLs.
+ */
+
+#ifndef ENCODING_H
+#define ENCODING_H
+#include "argon2.h"
+
+#define ARGON2_MAX_DECODED_LANES UINT32_C(255)
+#define ARGON2_MIN_DECODED_SALT_LEN UINT32_C(8)
+#define ARGON2_MIN_DECODED_OUT_LEN UINT32_C(12)
+
+/*
+* encode an Argon2 hash string into the provided buffer. 'dst_len'
+* contains the size, in characters, of the 'dst' buffer; if 'dst_len'
+* is less than the number of required characters (including the
+* terminating 0), then this function returns ARGON2_ENCODING_ERROR.
+*
+* on success, ARGON2_OK is returned.
+*/
+int encode_string(char *dst, size_t dst_len, argon2_context *ctx,
+                  argon2_type type);
+
+/*
+* Decodes an Argon2 hash string into the provided structure 'ctx'.
+* The only fields that must be set prior to this call are ctx.saltlen and
+* ctx.outlen (which must be the maximal salt and out length values that are
+* allowed), ctx.salt and ctx.out (which must be buffers of the specified
+* length), and ctx.pwd and ctx.pwdlen which must hold a valid password.
+*
+* Invalid input string causes an error. On success, the ctx is valid and all
+* fields have been initialized.
+*
+* Returned value is ARGON2_OK on success, other ARGON2_ codes on error.
+*/
+int decode_string(argon2_context *ctx, const char *str, argon2_type type);
+
+/* Returns the length of the encoded byte stream with length len */
+size_t b64len(uint32_t len);
+
+/* Returns the length of the encoded number num */
+size_t numlen(uint32_t num);
+
+#endif
--- a/algo/argon2/argon2d/argon2d/opt.c
+++ b/algo/argon2/argon2d/argon2d/opt.c
@@ -0,0 +1,359 @@
+/*
+ * Argon2 reference source code package - reference C implementations
+ *
+ * Copyright 2015
+ * Daniel Dinu, Dmitry Khovratovich, Jean-Philippe Aumasson, and Samuel Neves
+ *
+ * You may use this work under the terms of a Creative Commons CC0 1.0
+ * License/Waiver or the Apache Public License 2.0, at your option. The terms of
+ * these licenses can be found at:
+ *
+ * - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
+ * - Apache 2.0        : http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * You should have received a copy of both of these licenses along with this
+ * software. If not, they may be obtained at the above URLs.
+ */
+
+#include <stdint.h>
+#include <string.h>
+#include <stdlib.h>
+
+#include "argon2.h"
+#include "core.h"
+
+#include "../blake2/blake2.h"
+#include "../blake2/blamka-round-opt.h"
+
+/*
+ * Function fills a new memory block and optionally XORs the old block over the new one.
+ * Memory must be initialized.
+ * @param state Pointer to the just produced block. Content will be updated(!)
+ * @param ref_block Pointer to the reference block
+ * @param next_block Pointer to the block to be XORed over. May coincide with @ref_block
+ * @param with_xor Whether to XOR into the new block (1) or just overwrite (0)
+ * @pre all block pointers must be valid
+ */
+
+#if defined(__AVX512F__)
+
+static void fill_block(__m512i *state, const block *ref_block,
+                       block *next_block, int with_xor) {
+    __m512i block_XY[ARGON2_512BIT_WORDS_IN_BLOCK];
+    unsigned int i;
+
+    if (with_xor) {
+        for (i = 0; i < ARGON2_512BIT_WORDS_IN_BLOCK; i++) {
+            state[i] = _mm512_xor_si512(
+                state[i], _mm512_loadu_si512((const __m512i *)ref_block->v + i));
+            block_XY[i] = _mm512_xor_si512(
+                state[i], _mm512_loadu_si512((const __m512i *)next_block->v + i));
+        }
+    } else {
+        for (i = 0; i < ARGON2_512BIT_WORDS_IN_BLOCK; i++) {
+            block_XY[i] = state[i] = _mm512_xor_si512(
+                state[i], _mm512_loadu_si512((const __m512i *)ref_block->v + i));
+        }
+    }
+
+    BLAKE2_ROUND_1( state[ 0], state[ 1], state[ 2], state[ 3],
+                    state[ 4], state[ 5], state[ 6], state[ 7] );
+    BLAKE2_ROUND_1( state[ 8], state[ 9], state[10], state[11],
+                    state[12], state[13], state[14], state[15] );
+
+    BLAKE2_ROUND_2( state[ 0], state[ 2], state[ 4], state[ 6],
+                    state[ 8], state[10], state[12], state[14] );
+    BLAKE2_ROUND_2( state[ 1], state[ 3], state[ 5], state[ 7],
+                    state[ 9], state[11], state[13], state[15] );
+
+/*
+    for (i = 0; i < 2; ++i) {
+        BLAKE2_ROUND_1(
+            state[8 * i + 0], state[8 * i + 1], state[8 * i + 2], state[8 * i + 3],
+            state[8 * i + 4], state[8 * i + 5], state[8 * i + 6], state[8 * i + 7]);
+    }
+
+    for (i = 0; i < 2; ++i) {
+        BLAKE2_ROUND_2(
+            state[2 * 0 + i], state[2 * 1 + i], state[2 * 2 + i], state[2 * 3 + i],
+            state[2 * 4 + i], state[2 * 5 + i], state[2 * 6 + i], state[2 * 7 + i]);
+    }
+*/
+
+    for (i = 0; i < ARGON2_512BIT_WORDS_IN_BLOCK; i++) {
+        state[i] = _mm512_xor_si512(state[i], block_XY[i]);
+        _mm512_storeu_si512((__m512i *)next_block->v + i, state[i]);
+    }
+}
+
+#elif defined(__AVX2__)
+
+static void fill_block(__m256i *state, const block *ref_block,
+                       block *next_block, int with_xor) {
+    __m256i block_XY[ARGON2_HWORDS_IN_BLOCK];
+    unsigned int i;
+
+    if (with_xor) {
+        for (i = 0; i < ARGON2_HWORDS_IN_BLOCK; i++) {
+            state[i] = _mm256_xor_si256(
+                state[i], _mm256_loadu_si256((const __m256i *)ref_block->v + i));
+            block_XY[i] = _mm256_xor_si256(
+                state[i], _mm256_loadu_si256((const __m256i *)next_block->v + i));
+        }
+    } else {
+        for (i = 0; i < ARGON2_HWORDS_IN_BLOCK; i++) {
+            block_XY[i] = state[i] = _mm256_xor_si256(
+                state[i], _mm256_loadu_si256((const __m256i *)ref_block->v + i));
+        }
+    }
+
+    BLAKE2_ROUND_1( state[ 0], state[ 4], state[ 1], state[ 5],
+                    state[ 2], state[ 6], state[ 3], state[ 7] );
+    BLAKE2_ROUND_1( state[ 8], state[12], state[ 9], state[13],
+                    state[10], state[14], state[11], state[15] );
+    BLAKE2_ROUND_1( state[16], state[20], state[17], state[21],
+                    state[18], state[22], state[19], state[23] );
+    BLAKE2_ROUND_1( state[24], state[28], state[25], state[29],
+                    state[26], state[30], state[27], state[31] );
+
+    BLAKE2_ROUND_2( state[ 0], state[ 4], state[ 8], state[12],
+                    state[16], state[20], state[24], state[28] );
+    BLAKE2_ROUND_2( state[ 1], state[ 5], state[ 9], state[13],
+                    state[17], state[21], state[25], state[29] );
+    BLAKE2_ROUND_2( state[ 2], state[ 6], state[10], state[14],
+                    state[18], state[22], state[26], state[30] );
+    BLAKE2_ROUND_2( state[ 3], state[ 7], state[11], state[15],
+                    state[19], state[23], state[27], state[31] );
+
+/*
+    for (i = 0; i < 4; ++i) {
+        BLAKE2_ROUND_1(state[8 * i + 0], state[8 * i + 4], state[8 * i + 1], state[8 * i + 5],
+                       state[8 * i + 2], state[8 * i + 6], state[8 * i + 3], state[8 * i + 7]);
+    }
+
+    for (i = 0; i < 4; ++i) {
+        BLAKE2_ROUND_2(state[ 0 + i], state[ 4 + i], state[ 8 + i], state[12 + i],
+                       state[16 + i], state[20 + i], state[24 + i], state[28 + i]);
+    }
+*/
+
+    for (i = 0; i < ARGON2_HWORDS_IN_BLOCK; i++) {
+        state[i] = _mm256_xor_si256(state[i], block_XY[i]);
+        _mm256_storeu_si256((__m256i *)next_block->v + i, state[i]);
+    }
+}
+
+#else  // SSE2
+
+static void fill_block(__m128i *state, const block *ref_block,
+                       block *next_block, int with_xor) {
+    __m128i block_XY[ARGON2_OWORDS_IN_BLOCK];
+    unsigned int i;
+
+    if (with_xor) {
+        for (i = 0; i < ARGON2_OWORDS_IN_BLOCK; i++) {
+            state[i] = _mm_xor_si128(
+                state[i], _mm_loadu_si128((const __m128i *)ref_block->v + i));
+            block_XY[i] = _mm_xor_si128(
+                state[i], _mm_loadu_si128((const __m128i *)next_block->v + i));
+        }
+    } else {
+        for (i = 0; i < ARGON2_OWORDS_IN_BLOCK; i++) {
+            block_XY[i] = state[i] = _mm_xor_si128(
+                state[i], _mm_loadu_si128((const __m128i *)ref_block->v + i));
+        }
+    }
+
+    BLAKE2_ROUND( state[ 0], state[ 1], state[ 2], state[ 3],
+                  state[ 4], state[ 5], state[ 6], state[ 7] );
+    BLAKE2_ROUND( state[ 8], state[ 9], state[10], state[11], 
+                  state[12], state[13], state[14], state[15] );
+    BLAKE2_ROUND( state[16], state[17], state[18], state[19], 
+                  state[20], state[21], state[22], state[23] );
+    BLAKE2_ROUND( state[24], state[25], state[26], state[27], 
+                  state[28], state[29], state[30], state[31] );
+    BLAKE2_ROUND( state[32], state[33], state[34], state[35], 
+                  state[36], state[37], state[38], state[39] );
+    BLAKE2_ROUND( state[40], state[41], state[42], state[43], 
+                  state[44], state[45], state[46], state[47] );
+    BLAKE2_ROUND( state[48], state[49], state[50], state[51], 
+                  state[52], state[53], state[54], state[55] );
+    BLAKE2_ROUND( state[56], state[57], state[58], state[59], 
+                  state[60], state[61], state[62], state[63] );
+
+    BLAKE2_ROUND( state[ 0], state[ 8], state[16], state[24], 
+                  state[32], state[40], state[48], state[56] );
+    BLAKE2_ROUND( state[ 1], state[ 9], state[17], state[25],  
+                  state[33], state[41], state[49], state[57] );
+    BLAKE2_ROUND( state[ 2], state[10], state[18], state[26],  
+                  state[34], state[42], state[50], state[58] );
+    BLAKE2_ROUND( state[ 3], state[11], state[19], state[27],  
+                  state[35], state[43], state[51], state[59] );
+    BLAKE2_ROUND( state[ 4], state[12], state[20], state[28],  
+                  state[36], state[44], state[52], state[60] );
+    BLAKE2_ROUND( state[ 5], state[13], state[21], state[29],  
+                  state[37], state[45], state[53], state[61] );
+    BLAKE2_ROUND( state[ 6], state[14], state[22], state[30],  
+                  state[38], state[46], state[54], state[62] );
+    BLAKE2_ROUND( state[ 7], state[15], state[23], state[31],  
+                  state[39], state[47], state[55], state[63] );
+
+/*
+    for (i = 0; i < 8; ++i) {
+        BLAKE2_ROUND(state[8 * i + 0], state[8 * i + 1], state[8 * i + 2],
+            state[8 * i + 3], state[8 * i + 4], state[8 * i + 5],
+            state[8 * i + 6], state[8 * i + 7]);
+    }
+
+    for (i = 0; i < 8; ++i) {
+        BLAKE2_ROUND(state[8 * 0 + i], state[8 * 1 + i], state[8 * 2 + i],
+            state[8 * 3 + i], state[8 * 4 + i], state[8 * 5 + i],
+            state[8 * 6 + i], state[8 * 7 + i]);
+    }
+*/
+    for (i = 0; i < ARGON2_OWORDS_IN_BLOCK; i++) {
+        state[i] = _mm_xor_si128(state[i], block_XY[i]);
+        _mm_storeu_si128((__m128i *)next_block->v + i, state[i]);
+    }
+}
+
+#endif
+
+#if 0
+static void next_addresses(block *address_block, block *input_block) {
+    /*Temporary zero-initialized blocks*/
+#if defined(__AVX512F__)
+    __m512i zero_block[ARGON2_512BIT_WORDS_IN_BLOCK];
+    __m512i zero2_block[ARGON2_512BIT_WORDS_IN_BLOCK];
+#elif defined(__AVX2__)
+    __m256i zero_block[ARGON2_HWORDS_IN_BLOCK];
+    __m256i zero2_block[ARGON2_HWORDS_IN_BLOCK];
+#else
+    __m128i zero_block[ARGON2_OWORDS_IN_BLOCK];
+    __m128i zero2_block[ARGON2_OWORDS_IN_BLOCK];
+#endif
+
+    memset(zero_block, 0, sizeof(zero_block));
+    memset(zero2_block, 0, sizeof(zero2_block));
+
+    /*Increasing index counter*/
+    input_block->v[6]++;
+
+    /*First iteration of G*/
+    fill_block(zero_block, input_block, address_block, 0);
+
+    /*Second iteration of G*/
+    fill_block(zero2_block, address_block, address_block, 0);
+}
+#endif
+
+void fill_segment(const argon2_instance_t *instance,
+                  argon2_position_t position) {
+    block *ref_block = NULL, *curr_block = NULL;
+//    block address_block, input_block;
+    uint64_t pseudo_rand, ref_index, ref_lane;
+    uint32_t prev_offset, curr_offset;
+    uint32_t starting_index, i;
+#if defined(__AVX512F__)
+    __m512i state[ARGON2_512BIT_WORDS_IN_BLOCK];
+#elif defined(__AVX2__)
+    __m256i state[ARGON2_HWORDS_IN_BLOCK];
+#else
+    __m128i state[ARGON2_OWORDS_IN_BLOCK];
+#endif
+//    int data_independent_addressing;
+
+    if (instance == NULL) {
+        return;
+    }
+
+    // data_independent_addressing =
+    //     (instance->type == Argon2_i) ||
+    //     (instance->type == Argon2_id && (position.pass == 0) &&
+    //      (position.slice < ARGON2_SYNC_POINTS / 2));
+
+    // if (data_independent_addressing) {
+    //     init_block_value(&input_block, 0);
+
+    //     input_block.v[0] = position.pass;
+    //     input_block.v[1] = position.lane;
+    //     input_block.v[2] = position.slice;
+    //     input_block.v[3] = instance->memory_blocks;
+    //     input_block.v[4] = instance->passes;
+    //     input_block.v[5] = instance->type;
+    // }
+
+    starting_index = 0;
+
+    if ((0 == position.pass) && (0 == position.slice)) {
+        starting_index = 2; /* we have already generated the first two blocks */
+
+        /* Don't forget to generate the first block of addresses: */
+//        if (data_independent_addressing) {
+//            next_addresses(&address_block, &input_block);
+//        }
+    }
+
+    /* Offset of the current block */
+    curr_offset = position.lane * instance->lane_length +
+                  position.slice * instance->segment_length + starting_index;
+
+    if (0 == curr_offset % instance->lane_length) {
+        /* Last block in this lane */
+        prev_offset = curr_offset + instance->lane_length - 1;
+    } else {
+        /* Previous block */
+        prev_offset = curr_offset - 1;
+    }
+
+    memcpy(state, ((instance->memory + prev_offset)->v), ARGON2_BLOCK_SIZE);
+
+    for (i = starting_index; i < instance->segment_length;
+         ++i, ++curr_offset, ++prev_offset) {
+        /*1.1 Rotating prev_offset if needed */
+        if (curr_offset % instance->lane_length == 1) {
+            prev_offset = curr_offset - 1;
+        }
+
+        /* 1.2 Computing the index of the reference block */
+        /* 1.2.1 Taking pseudo-random value from the previous block */
+//        if (data_independent_addressing) {
+//            if (i % ARGON2_ADDRESSES_IN_BLOCK == 0) {
+//                next_addresses(&address_block, &input_block);
+//            }
+//            pseudo_rand = address_block.v[i % ARGON2_ADDRESSES_IN_BLOCK];
+//        } else {
+            pseudo_rand = instance->memory[prev_offset].v[0];
+//        }
+
+        /* 1.2.2 Computing the lane of the reference block */
+        ref_lane = ((pseudo_rand >> 32)) % instance->lanes;
+
+        if ((position.pass == 0) && (position.slice == 0)) {
+            /* Can not reference other lanes yet */
+            ref_lane = position.lane;
+        }
+
+        /* 1.2.3 Computing the number of possible reference block within the
+         * lane.
+         */
+        position.index = i;
+        ref_index = index_alpha(instance, &position, pseudo_rand & 0xFFFFFFFF,
+                                ref_lane == position.lane);
+
+        /* 2 Creating a new block */
+        ref_block =
+            instance->memory + instance->lane_length * ref_lane + ref_index;
+        curr_block = instance->memory + curr_offset;
+         if (ARGON2_VERSION_10 == instance->version) {
+             /* version 1.2.1 and earlier: overwrite, not XOR */
+             fill_block(state, ref_block, curr_block, 0);
+         } else {
+             if(0 == position.pass) {
+                fill_block(state, ref_block, curr_block, 0);
+             } else {
+                 fill_block(state, ref_block, curr_block, 1);
+             }
+         }
+    }
+}
--- a/algo/argon2/argon2d/blake2/blake2-impl.h
+++ b/algo/argon2/argon2d/blake2/blake2-impl.h
@@ -0,0 +1,156 @@
+/*
+ * Argon2 reference source code package - reference C implementations
+ *
+ * Copyright 2015
+ * Daniel Dinu, Dmitry Khovratovich, Jean-Philippe Aumasson, and Samuel Neves
+ *
+ * You may use this work under the terms of a Creative Commons CC0 1.0
+ * License/Waiver or the Apache Public License 2.0, at your option. The terms of
+ * these licenses can be found at:
+ *
+ * - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
+ * - Apache 2.0        : http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * You should have received a copy of both of these licenses along with this
+ * software. If not, they may be obtained at the above URLs.
+ */
+
+#ifndef PORTABLE_BLAKE2_IMPL_H
+#define PORTABLE_BLAKE2_IMPL_H
+
+#include <stdint.h>
+#include <string.h>
+
+#if defined(_MSC_VER)
+#define BLAKE2_INLINE __inline
+#elif defined(__GNUC__) || defined(__clang__)
+#define BLAKE2_INLINE __inline__
+#else
+#define BLAKE2_INLINE
+#endif
+
+/* Argon2 Team - Begin Code */
+/*
+   Not an exhaustive list, but should cover the majority of modern platforms
+   Additionally, the code will always be correct---this is only a performance
+   tweak.
+*/
+#if (defined(__BYTE_ORDER__) &&                                                \
+     (__BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__)) ||                           \
+    defined(__LITTLE_ENDIAN__) || defined(__ARMEL__) || defined(__MIPSEL__) || \
+    defined(__AARCH64EL__) || defined(__amd64__) || defined(__i386__) ||       \
+    defined(_M_IX86) || defined(_M_X64) || defined(_M_AMD64) ||                \
+    defined(_M_ARM)
+#define NATIVE_LITTLE_ENDIAN
+#endif
+/* Argon2 Team - End Code */
+
+static BLAKE2_INLINE uint32_t load32(const void *src) {
+#if defined(NATIVE_LITTLE_ENDIAN)
+    uint32_t w;
+    memcpy(&w, src, sizeof w);
+    return w;
+#else
+    const uint8_t *p = (const uint8_t *)src;
+    uint32_t w = *p++;
+    w |= (uint32_t)(*p++) << 8;
+    w |= (uint32_t)(*p++) << 16;
+    w |= (uint32_t)(*p++) << 24;
+    return w;
+#endif
+}
+
+static BLAKE2_INLINE uint64_t load64(const void *src) {
+#if defined(NATIVE_LITTLE_ENDIAN)
+    uint64_t w;
+    memcpy(&w, src, sizeof w);
+    return w;
+#else
+    const uint8_t *p = (const uint8_t *)src;
+    uint64_t w = *p++;
+    w |= (uint64_t)(*p++) << 8;
+    w |= (uint64_t)(*p++) << 16;
+    w |= (uint64_t)(*p++) << 24;
+    w |= (uint64_t)(*p++) << 32;
+    w |= (uint64_t)(*p++) << 40;
+    w |= (uint64_t)(*p++) << 48;
+    w |= (uint64_t)(*p++) << 56;
+    return w;
+#endif
+}
+
+static BLAKE2_INLINE void store32(void *dst, uint32_t w) {
+#if defined(NATIVE_LITTLE_ENDIAN)
+    memcpy(dst, &w, sizeof w);
+#else
+    uint8_t *p = (uint8_t *)dst;
+    *p++ = (uint8_t)w;
+    w >>= 8;
+    *p++ = (uint8_t)w;
+    w >>= 8;
+    *p++ = (uint8_t)w;
+    w >>= 8;
+    *p++ = (uint8_t)w;
+#endif
+}
+
+static BLAKE2_INLINE void store64(void *dst, uint64_t w) {
+#if defined(NATIVE_LITTLE_ENDIAN)
+    memcpy(dst, &w, sizeof w);
+#else
+    uint8_t *p = (uint8_t *)dst;
+    *p++ = (uint8_t)w;
+    w >>= 8;
+    *p++ = (uint8_t)w;
+    w >>= 8;
+    *p++ = (uint8_t)w;
+    w >>= 8;
+    *p++ = (uint8_t)w;
+    w >>= 8;
+    *p++ = (uint8_t)w;
+    w >>= 8;
+    *p++ = (uint8_t)w;
+    w >>= 8;
+    *p++ = (uint8_t)w;
+    w >>= 8;
+    *p++ = (uint8_t)w;
+#endif
+}
+
+static BLAKE2_INLINE uint64_t load48(const void *src) {
+    const uint8_t *p = (const uint8_t *)src;
+    uint64_t w = *p++;
+    w |= (uint64_t)(*p++) << 8;
+    w |= (uint64_t)(*p++) << 16;
+    w |= (uint64_t)(*p++) << 24;
+    w |= (uint64_t)(*p++) << 32;
+    w |= (uint64_t)(*p++) << 40;
+    return w;
+}
+
+static BLAKE2_INLINE void store48(void *dst, uint64_t w) {
+    uint8_t *p = (uint8_t *)dst;
+    *p++ = (uint8_t)w;
+    w >>= 8;
+    *p++ = (uint8_t)w;
+    w >>= 8;
+    *p++ = (uint8_t)w;
+    w >>= 8;
+    *p++ = (uint8_t)w;
+    w >>= 8;
+    *p++ = (uint8_t)w;
+    w >>= 8;
+    *p++ = (uint8_t)w;
+}
+
+static BLAKE2_INLINE uint32_t rotr32(const uint32_t w, const unsigned c) {
+    return (w >> c) | (w << (32 - c));
+}
+
+static BLAKE2_INLINE uint64_t rotr64(const uint64_t w, const unsigned c) {
+    return (w >> c) | (w << (64 - c));
+}
+
+void clear_internal_memory(void *v, size_t n);
+
+#endif
--- a/algo/argon2/argon2d/blake2/blake2.h
+++ b/algo/argon2/argon2d/blake2/blake2.h
@@ -0,0 +1,91 @@
+/*
+ * Argon2 reference source code package - reference C implementations
+ *
+ * Copyright 2015
+ * Daniel Dinu, Dmitry Khovratovich, Jean-Philippe Aumasson, and Samuel Neves
+ *
+ * You may use this work under the terms of a Creative Commons CC0 1.0
+ * License/Waiver or the Apache Public License 2.0, at your option. The terms of
+ * these licenses can be found at:
+ *
+ * - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
+ * - Apache 2.0        : http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * You should have received a copy of both of these licenses along with this
+ * software. If not, they may be obtained at the above URLs.
+ */
+
+#ifndef PORTABLE_BLAKE2_H
+#define PORTABLE_BLAKE2_H
+
+#include <stddef.h>
+#include <stdint.h>
+#include <limits.h>
+
+#if defined(__cplusplus)
+extern "C" {
+#endif
+
+enum blake2b_constant {
+    BLAKE2B_BLOCKBYTES = 128,
+    BLAKE2B_OUTBYTES = 64,
+    BLAKE2B_KEYBYTES = 64,
+    BLAKE2B_SALTBYTES = 16,
+    BLAKE2B_PERSONALBYTES = 16
+};
+
+#pragma pack(push, 1)
+typedef struct __blake2b_param {
+    uint8_t digest_length;                   /* 1 */
+    uint8_t key_length;                      /* 2 */
+    uint8_t fanout;                          /* 3 */
+    uint8_t depth;                           /* 4 */
+    uint32_t leaf_length;                    /* 8 */
+    uint64_t node_offset;                    /* 16 */
+    uint8_t node_depth;                      /* 17 */
+    uint8_t inner_length;                    /* 18 */
+    uint8_t reserved[14];                    /* 32 */
+    uint8_t salt[BLAKE2B_SALTBYTES];         /* 48 */
+    uint8_t personal[BLAKE2B_PERSONALBYTES]; /* 64 */
+} blake2b_param;
+#pragma pack(pop)
+
+typedef struct __blake2b_state {
+    uint64_t h[8];
+    uint64_t t[2];
+    uint64_t f[2];
+    uint8_t buf[BLAKE2B_BLOCKBYTES];
+    unsigned buflen;
+    unsigned outlen;
+    uint8_t last_node;
+} blake2b_state;
+
+/* Ensure param structs have not been wrongly padded */
+/* Poor man's static_assert */
+enum {
+    blake2_size_check_0 = 1 / !!(CHAR_BIT == 8),
+    blake2_size_check_2 =
+        1 / !!(sizeof(blake2b_param) == sizeof(uint64_t) * CHAR_BIT)
+};
+
+/* Streaming API */
+int blake2b_init(blake2b_state *S, size_t outlen);
+int blake2b_init_key(blake2b_state *S, size_t outlen, const void *key,
+                     size_t keylen);
+int blake2b_init_param(blake2b_state *S, const blake2b_param *P);
+int blake2b_update(blake2b_state *S, const void *in, size_t inlen);
+int blake2b_final(blake2b_state *S, void *out, size_t outlen);
+
+/* Simple API */
+int blake2b(void *out, size_t outlen, const void *in, size_t inlen,
+                         const void *key, size_t keylen);
+
+/* Argon2 Team - Begin Code */
+int blake2b_long(void *out, size_t outlen, const void *in, size_t inlen);
+/* Argon2 Team - End Code */
+
+#if defined(__cplusplus)
+}
+#endif
+
+#endif
--- a/algo/argon2/argon2d/blake2/blake2b.c
+++ b/algo/argon2/argon2d/blake2/blake2b.c
@@ -0,0 +1,390 @@
+/*
+ * Argon2 reference source code package - reference C implementations
+ *
+ * Copyright 2015
+ * Daniel Dinu, Dmitry Khovratovich, Jean-Philippe Aumasson, and Samuel Neves
+ *
+ * You may use this work under the terms of a Creative Commons CC0 1.0
+ * License/Waiver or the Apache Public License 2.0, at your option. The terms of
+ * these licenses can be found at:
+ *
+ * - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
+ * - Apache 2.0        : http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * You should have received a copy of both of these licenses along with this
+ * software. If not, they may be obtained at the above URLs.
+ */
+
+#include <stdint.h>
+#include <string.h>
+#include <stdio.h>
+
+#include "blake2.h"
+#include "blake2-impl.h"
+
+static const uint64_t blake2b_IV[8] = {
+    UINT64_C(0x6a09e667f3bcc908), UINT64_C(0xbb67ae8584caa73b),
+    UINT64_C(0x3c6ef372fe94f82b), UINT64_C(0xa54ff53a5f1d36f1),
+    UINT64_C(0x510e527fade682d1), UINT64_C(0x9b05688c2b3e6c1f),
+    UINT64_C(0x1f83d9abfb41bd6b), UINT64_C(0x5be0cd19137e2179)};
+
+static const unsigned int blake2b_sigma[12][16] = {
+    {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15},
+    {14, 10, 4, 8, 9, 15, 13, 6, 1, 12, 0, 2, 11, 7, 5, 3},
+    {11, 8, 12, 0, 5, 2, 15, 13, 10, 14, 3, 6, 7, 1, 9, 4},
+    {7, 9, 3, 1, 13, 12, 11, 14, 2, 6, 5, 10, 4, 0, 15, 8},
+    {9, 0, 5, 7, 2, 4, 10, 15, 14, 1, 11, 12, 6, 8, 3, 13},
+    {2, 12, 6, 10, 0, 11, 8, 3, 4, 13, 7, 5, 15, 14, 1, 9},
+    {12, 5, 1, 15, 14, 13, 4, 10, 0, 7, 6, 3, 9, 2, 8, 11},
+    {13, 11, 7, 14, 12, 1, 3, 9, 5, 0, 15, 4, 8, 6, 2, 10},
+    {6, 15, 14, 9, 11, 3, 0, 8, 12, 2, 13, 7, 1, 4, 10, 5},
+    {10, 2, 8, 4, 7, 6, 1, 5, 15, 11, 9, 14, 3, 12, 13, 0},
+    {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15},
+    {14, 10, 4, 8, 9, 15, 13, 6, 1, 12, 0, 2, 11, 7, 5, 3},
+};
+
+static BLAKE2_INLINE void blake2b_set_lastnode(blake2b_state *S) {
+    S->f[1] = (uint64_t)-1;
+}
+
+static BLAKE2_INLINE void blake2b_set_lastblock(blake2b_state *S) {
+    if (S->last_node) {
+        blake2b_set_lastnode(S);
+    }
+    S->f[0] = (uint64_t)-1;
+}
+
+static BLAKE2_INLINE void blake2b_increment_counter(blake2b_state *S,
+                                                    uint64_t inc) {
+    S->t[0] += inc;
+    S->t[1] += (S->t[0] < inc);
+}
+
+static BLAKE2_INLINE void blake2b_invalidate_state(blake2b_state *S) {
+    clear_internal_memory(S, sizeof(*S));      /* wipe */
+    blake2b_set_lastblock(S); /* invalidate for further use */
+}
+
+static BLAKE2_INLINE void blake2b_init0(blake2b_state *S) {
+    memset(S, 0, sizeof(*S));
+    memcpy(S->h, blake2b_IV, sizeof(S->h));
+}
+
+int blake2b_init_param(blake2b_state *S, const blake2b_param *P) {
+    const unsigned char *p = (const unsigned char *)P;
+    unsigned int i;
+
+    if (NULL == P || NULL == S) {
+        return -1;
+    }
+
+    blake2b_init0(S);
+    /* IV XOR Parameter Block */
+    for (i = 0; i < 8; ++i) {
+        S->h[i] ^= load64(&p[i * sizeof(S->h[i])]);
+    }
+    S->outlen = P->digest_length;
+    return 0;
+}
+
+/* Sequential blake2b initialization */
+int blake2b_init(blake2b_state *S, size_t outlen) {
+    blake2b_param P;
+
+    if (S == NULL) {
+        return -1;
+    }
+
+    if ((outlen == 0) || (outlen > BLAKE2B_OUTBYTES)) {
+        blake2b_invalidate_state(S);
+        return -1;
+    }
+
+    /* Setup Parameter Block for unkeyed BLAKE2 */
+    P.digest_length = (uint8_t)outlen;
+    P.key_length = 0;
+    P.fanout = 1;
+    P.depth = 1;
+    P.leaf_length = 0;
+    P.node_offset = 0;
+    P.node_depth = 0;
+    P.inner_length = 0;
+    memset(P.reserved, 0, sizeof(P.reserved));
+    memset(P.salt, 0, sizeof(P.salt));
+    memset(P.personal, 0, sizeof(P.personal));
+
+    return blake2b_init_param(S, &P);
+}
+
+int blake2b_init_key(blake2b_state *S, size_t outlen, const void *key,
+                     size_t keylen) {
+    blake2b_param P;
+
+    if (S == NULL) {
+        return -1;
+    }
+
+    if ((outlen == 0) || (outlen > BLAKE2B_OUTBYTES)) {
+        blake2b_invalidate_state(S);
+        return -1;
+    }
+
+    if ((key == 0) || (keylen == 0) || (keylen > BLAKE2B_KEYBYTES)) {
+        blake2b_invalidate_state(S);
+        return -1;
+    }
+
+    /* Setup Parameter Block for keyed BLAKE2 */
+    P.digest_length = (uint8_t)outlen;
+    P.key_length = (uint8_t)keylen;
+    P.fanout = 1;
+    P.depth = 1;
+    P.leaf_length = 0;
+    P.node_offset = 0;
+    P.node_depth = 0;
+    P.inner_length = 0;
+    memset(P.reserved, 0, sizeof(P.reserved));
+    memset(P.salt, 0, sizeof(P.salt));
+    memset(P.personal, 0, sizeof(P.personal));
+
+    if (blake2b_init_param(S, &P) < 0) {
+        blake2b_invalidate_state(S);
+        return -1;
+    }
+
+    {
+        uint8_t block[BLAKE2B_BLOCKBYTES];
+        memset(block, 0, BLAKE2B_BLOCKBYTES);
+        memcpy(block, key, keylen);
+        blake2b_update(S, block, BLAKE2B_BLOCKBYTES);
+        /* Burn the key from stack */
+        clear_internal_memory(block, BLAKE2B_BLOCKBYTES);
+    }
+    return 0;
+}
+
+static void blake2b_compress(blake2b_state *S, const uint8_t *block) {
+    uint64_t m[16];
+    uint64_t v[16];
+    unsigned int i, r;
+
+    for (i = 0; i < 16; ++i) {
+        m[i] = load64(block + i * sizeof(m[i]));
+    }
+
+    for (i = 0; i < 8; ++i) {
+        v[i] = S->h[i];
+    }
+
+    v[8] = blake2b_IV[0];
+    v[9] = blake2b_IV[1];
+    v[10] = blake2b_IV[2];
+    v[11] = blake2b_IV[3];
+    v[12] = blake2b_IV[4] ^ S->t[0];
+    v[13] = blake2b_IV[5] ^ S->t[1];
+    v[14] = blake2b_IV[6] ^ S->f[0];
+    v[15] = blake2b_IV[7] ^ S->f[1];
+
+#define G(r, i, a, b, c, d)                                                    \
+    do {                                                                       \
+        a = a + b + m[blake2b_sigma[r][2 * i + 0]];                            \
+        d = rotr64(d ^ a, 32);                                                 \
+        c = c + d;                                                             \
+        b = rotr64(b ^ c, 24);                                                 \
+        a = a + b + m[blake2b_sigma[r][2 * i + 1]];                            \
+        d = rotr64(d ^ a, 16);                                                 \
+        c = c + d;                                                             \
+        b = rotr64(b ^ c, 63);                                                 \
+    } while ((void)0, 0)
+
+#define ROUND(r)                                                               \
+    do {                                                                       \
+        G(r, 0, v[0], v[4], v[8], v[12]);                                      \
+        G(r, 1, v[1], v[5], v[9], v[13]);                                      \
+        G(r, 2, v[2], v[6], v[10], v[14]);                                     \
+        G(r, 3, v[3], v[7], v[11], v[15]);                                     \
+        G(r, 4, v[0], v[5], v[10], v[15]);                                     \
+        G(r, 5, v[1], v[6], v[11], v[12]);                                     \
+        G(r, 6, v[2], v[7], v[8], v[13]);                                      \
+        G(r, 7, v[3], v[4], v[9], v[14]);                                      \
+    } while ((void)0, 0)
+
+    for (r = 0; r < 12; ++r) {
+        ROUND(r);
+    }
+
+    for (i = 0; i < 8; ++i) {
+        S->h[i] = S->h[i] ^ v[i] ^ v[i + 8];
+    }
+
+#undef G
+#undef ROUND
+}
+
+int blake2b_update(blake2b_state *S, const void *in, size_t inlen) {
+    const uint8_t *pin = (const uint8_t *)in;
+
+    if (inlen == 0) {
+        return 0;
+    }
+
+    /* Sanity check */
+    if (S == NULL || in == NULL) {
+        return -1;
+    }
+
+    /* Is this a reused state? */
+    if (S->f[0] != 0) {
+        return -1;
+    }
+
+    if (S->buflen + inlen > BLAKE2B_BLOCKBYTES) {
+        /* Complete current block */
+        size_t left = S->buflen;
+        size_t fill = BLAKE2B_BLOCKBYTES - left;
+        memcpy(&S->buf[left], pin, fill);
+        blake2b_increment_counter(S, BLAKE2B_BLOCKBYTES);
+        blake2b_compress(S, S->buf);
+        S->buflen = 0;
+        inlen -= fill;
+        pin += fill;
+        /* Avoid buffer copies when possible */
+        while (inlen > BLAKE2B_BLOCKBYTES) {
+            blake2b_increment_counter(S, BLAKE2B_BLOCKBYTES);
+            blake2b_compress(S, pin);
+            inlen -= BLAKE2B_BLOCKBYTES;
+            pin += BLAKE2B_BLOCKBYTES;
+        }
+    }
+    memcpy(&S->buf[S->buflen], pin, inlen);
+    S->buflen += (unsigned int)inlen;
+    return 0;
+}
+
+int blake2b_final(blake2b_state *S, void *out, size_t outlen) {
+    uint8_t buffer[BLAKE2B_OUTBYTES] = {0};
+    unsigned int i;
+
+    /* Sanity checks */
+    if (S == NULL || out == NULL || outlen < S->outlen) {
+        return -1;
+    }
+
+    /* Is this a reused state? */
+    if (S->f[0] != 0) {
+        return -1;
+    }
+
+    blake2b_increment_counter(S, S->buflen);
+    blake2b_set_lastblock(S);
+    memset(&S->buf[S->buflen], 0, BLAKE2B_BLOCKBYTES - S->buflen); /* Padding */
+    blake2b_compress(S, S->buf);
+
+    for (i = 0; i < 8; ++i) { /* Output full hash to temp buffer */
+        store64(buffer + sizeof(S->h[i]) * i, S->h[i]);
+    }
+
+    memcpy(out, buffer, S->outlen);
+    clear_internal_memory(buffer, sizeof(buffer));
+    clear_internal_memory(S->buf, sizeof(S->buf));
+    clear_internal_memory(S->h, sizeof(S->h));
+    return 0;
+}
+
+int blake2b(void *out, size_t outlen, const void *in, size_t inlen,
+            const void *key, size_t keylen) {
+    blake2b_state S;
+    int ret = -1;
+
+    /* Verify parameters */
+    if (NULL == in && inlen > 0) {
+        goto fail;
+    }
+
+    if (NULL == out || outlen == 0 || outlen > BLAKE2B_OUTBYTES) {
+        goto fail;
+    }
+
+    if ((NULL == key && keylen > 0) || keylen > BLAKE2B_KEYBYTES) {
+        goto fail;
+    }
+
+    if (keylen > 0) {
+        if (blake2b_init_key(&S, outlen, key, keylen) < 0) {
+            goto fail;
+        }
+    } else {
+        if (blake2b_init(&S, outlen) < 0) {
+            goto fail;
+        }
+    }
+
+    if (blake2b_update(&S, in, inlen) < 0) {
+        goto fail;
+    }
+    ret = blake2b_final(&S, out, outlen);
+
+fail:
+    clear_internal_memory(&S, sizeof(S));
+    return ret;
+}
+
+/* Argon2 Team - Begin Code */
+int blake2b_long(void *pout, size_t outlen, const void *in, size_t inlen) {
+    uint8_t *out = (uint8_t *)pout;
+    blake2b_state blake_state;
+    uint8_t outlen_bytes[sizeof(uint32_t)] = {0};
+    int ret = -1;
+
+    if (outlen > UINT32_MAX) {
+        goto fail;
+    }
+
+    /* Ensure little-endian byte order! */
+    store32(outlen_bytes, (uint32_t)outlen);
+
+#define TRY(statement)                                                         \
+    do {                                                                       \
+        ret = statement;                                                       \
+        if (ret < 0) {                                                         \
+            goto fail;                                                         \
+        }                                                                      \
+    } while ((void)0, 0)
+
+    if (outlen <= BLAKE2B_OUTBYTES) {
+        TRY(blake2b_init(&blake_state, outlen));
+        TRY(blake2b_update(&blake_state, outlen_bytes, sizeof(outlen_bytes)));
+        TRY(blake2b_update(&blake_state, in, inlen));
+        TRY(blake2b_final(&blake_state, out, outlen));
+    } else {
+        uint32_t toproduce;
+        uint8_t out_buffer[BLAKE2B_OUTBYTES];
+        uint8_t in_buffer[BLAKE2B_OUTBYTES];
+        TRY(blake2b_init(&blake_state, BLAKE2B_OUTBYTES));
+        TRY(blake2b_update(&blake_state, outlen_bytes, sizeof(outlen_bytes)));
+        TRY(blake2b_update(&blake_state, in, inlen));
+        TRY(blake2b_final(&blake_state, out_buffer, BLAKE2B_OUTBYTES));
+        memcpy(out, out_buffer, BLAKE2B_OUTBYTES / 2);
+        out += BLAKE2B_OUTBYTES / 2;
+        toproduce = (uint32_t)outlen - BLAKE2B_OUTBYTES / 2;
+
+        while (toproduce > BLAKE2B_OUTBYTES) {
+            memcpy(in_buffer, out_buffer, BLAKE2B_OUTBYTES);
+            TRY(blake2b(out_buffer, BLAKE2B_OUTBYTES, in_buffer,
+                        BLAKE2B_OUTBYTES, NULL, 0));
+            memcpy(out, out_buffer, BLAKE2B_OUTBYTES / 2);
+            out += BLAKE2B_OUTBYTES / 2;
+            toproduce -= BLAKE2B_OUTBYTES / 2;
+        }
+
+        memcpy(in_buffer, out_buffer, BLAKE2B_OUTBYTES);
+        TRY(blake2b(out_buffer, toproduce, in_buffer, BLAKE2B_OUTBYTES, NULL,
+                    0));
+        memcpy(out, out_buffer, toproduce);
+    }
+fail:
+    clear_internal_memory(&blake_state, sizeof(blake_state));
+    return ret;
+#undef TRY
+}
+/* Argon2 Team - End Code */
--- a/algo/argon2/argon2d/blake2/blamka-round-opt.h
+++ b/algo/argon2/argon2d/blake2/blamka-round-opt.h
@@ -0,0 +1,471 @@
+/*
+ * Argon2 reference source code package - reference C implementations
+ *
+ * Copyright 2015
+ * Daniel Dinu, Dmitry Khovratovich, Jean-Philippe Aumasson, and Samuel Neves
+ *
+ * You may use this work under the terms of a Creative Commons CC0 1.0
+ * License/Waiver or the Apache Public License 2.0, at your option. The terms of
+ * these licenses can be found at:
+ *
+ * - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
+ * - Apache 2.0        : http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * You should have received a copy of both of these licenses along with this
+ * software. If not, they may be obtained at the above URLs.
+ */
+
+#ifndef BLAKE_ROUND_MKA_OPT_H
+#define BLAKE_ROUND_MKA_OPT_H
+
+#include "blake2-impl.h"
+
+#include <emmintrin.h>
+#if defined(__SSSE3__)
+#include <tmmintrin.h> /* for _mm_shuffle_epi8 and _mm_alignr_epi8 */
+#endif
+
+#if defined(__XOP__) && (defined(__GNUC__) || defined(__clang__))
+#include <x86intrin.h>
+#endif
+
+#if !defined(__AVX512F__)
+#if !defined(__AVX2__)
+#if !defined(__XOP__)
+#if defined(__SSSE3__)
+#define r16                                                                    \
+    (_mm_setr_epi8(2, 3, 4, 5, 6, 7, 0, 1, 10, 11, 12, 13, 14, 15, 8, 9))
+#define r24                                                                    \
+    (_mm_setr_epi8(3, 4, 5, 6, 7, 0, 1, 2, 11, 12, 13, 14, 15, 8, 9, 10))
+#define _mm_roti_epi64(x, c)                                                   \
+    (-(c) == 32)                                                               \
+        ? _mm_shuffle_epi32((x), _MM_SHUFFLE(2, 3, 0, 1))                      \
+        : (-(c) == 24)                                                         \
+              ? _mm_shuffle_epi8((x), r24)                                     \
+              : (-(c) == 16)                                                   \
+                    ? _mm_shuffle_epi8((x), r16)                               \
+                    : (-(c) == 63)                                             \
+                          ? _mm_xor_si128(_mm_srli_epi64((x), -(c)),           \
+                                          _mm_add_epi64((x), (x)))             \
+                          : _mm_xor_si128(_mm_srli_epi64((x), -(c)),           \
+                                          _mm_slli_epi64((x), 64 - (-(c))))
+#else /* defined(__SSE2__) */
+#define _mm_roti_epi64(r, c)                                                   \
+    _mm_xor_si128(_mm_srli_epi64((r), -(c)), _mm_slli_epi64((r), 64 - (-(c))))
+#endif
+#else
+#endif
+
+static BLAKE2_INLINE __m128i fBlaMka(__m128i x, __m128i y) {
+    const __m128i z = _mm_mul_epu32(x, y);
+    return _mm_add_epi64(_mm_add_epi64(x, y), _mm_add_epi64(z, z));
+}
+
+#define G1(A0, B0, C0, D0, A1, B1, C1, D1)                                     \
+    do {                                                                       \
+        A0 = fBlaMka(A0, B0);                                                  \
+        A1 = fBlaMka(A1, B1);                                                  \
+                                                                               \
+        D0 = _mm_xor_si128(D0, A0);                                            \
+        D1 = _mm_xor_si128(D1, A1);                                            \
+                                                                               \
+        D0 = _mm_roti_epi64(D0, -32);                                          \
+        D1 = _mm_roti_epi64(D1, -32);                                          \
+                                                                               \
+        C0 = fBlaMka(C0, D0);                                                  \
+        C1 = fBlaMka(C1, D1);                                                  \
+                                                                               \
+        B0 = _mm_xor_si128(B0, C0);                                            \
+        B1 = _mm_xor_si128(B1, C1);                                            \
+                                                                               \
+        B0 = _mm_roti_epi64(B0, -24);                                          \
+        B1 = _mm_roti_epi64(B1, -24);                                          \
+    } while ((void)0, 0)
+
+#define G2(A0, B0, C0, D0, A1, B1, C1, D1)                                     \
+    do {                                                                       \
+        A0 = fBlaMka(A0, B0);                                                  \
+        A1 = fBlaMka(A1, B1);                                                  \
+                                                                               \
+        D0 = _mm_xor_si128(D0, A0);                                            \
+        D1 = _mm_xor_si128(D1, A1);                                            \
+                                                                               \
+        D0 = _mm_roti_epi64(D0, -16);                                          \
+        D1 = _mm_roti_epi64(D1, -16);                                          \
+                                                                               \
+        C0 = fBlaMka(C0, D0);                                                  \
+        C1 = fBlaMka(C1, D1);                                                  \
+                                                                               \
+        B0 = _mm_xor_si128(B0, C0);                                            \
+        B1 = _mm_xor_si128(B1, C1);                                            \
+                                                                               \
+        B0 = _mm_roti_epi64(B0, -63);                                          \
+        B1 = _mm_roti_epi64(B1, -63);                                          \
+    } while ((void)0, 0)
+
+#if defined(__SSSE3__)
+#define DIAGONALIZE(A0, B0, C0, D0, A1, B1, C1, D1)                            \
+    do {                                                                       \
+        __m128i t0 = _mm_alignr_epi8(B1, B0, 8);                               \
+        __m128i t1 = _mm_alignr_epi8(B0, B1, 8);                               \
+        B0 = t0;                                                               \
+        B1 = t1;                                                               \
+                                                                               \
+        t0 = C0;                                                               \
+        C0 = C1;                                                               \
+        C1 = t0;                                                               \
+                                                                               \
+        t0 = _mm_alignr_epi8(D1, D0, 8);                                       \
+        t1 = _mm_alignr_epi8(D0, D1, 8);                                       \
+        D0 = t1;                                                               \
+        D1 = t0;                                                               \
+    } while ((void)0, 0)
+
+#define UNDIAGONALIZE(A0, B0, C0, D0, A1, B1, C1, D1)                          \
+    do {                                                                       \
+        __m128i t0 = _mm_alignr_epi8(B0, B1, 8);                               \
+        __m128i t1 = _mm_alignr_epi8(B1, B0, 8);                               \
+        B0 = t0;                                                               \
+        B1 = t1;                                                               \
+                                                                               \
+        t0 = C0;                                                               \
+        C0 = C1;                                                               \
+        C1 = t0;                                                               \
+                                                                               \
+        t0 = _mm_alignr_epi8(D0, D1, 8);                                       \
+        t1 = _mm_alignr_epi8(D1, D0, 8);                                       \
+        D0 = t1;                                                               \
+        D1 = t0;                                                               \
+    } while ((void)0, 0)
+#else /* SSE2 */
+#define DIAGONALIZE(A0, B0, C0, D0, A1, B1, C1, D1)                            \
+    do {                                                                       \
+        __m128i t0 = D0;                                                       \
+        __m128i t1 = B0;                                                       \
+        D0 = C0;                                                               \
+        C0 = C1;                                                               \
+        C1 = D0;                                                               \
+        D0 = _mm_unpackhi_epi64(D1, _mm_unpacklo_epi64(t0, t0));               \
+        D1 = _mm_unpackhi_epi64(t0, _mm_unpacklo_epi64(D1, D1));               \
+        B0 = _mm_unpackhi_epi64(B0, _mm_unpacklo_epi64(B1, B1));               \
+        B1 = _mm_unpackhi_epi64(B1, _mm_unpacklo_epi64(t1, t1));               \
+    } while ((void)0, 0)
+
+#define UNDIAGONALIZE(A0, B0, C0, D0, A1, B1, C1, D1)                          \
+    do {                                                                       \
+        __m128i t0, t1;                                                        \
+        t0 = C0;                                                               \
+        C0 = C1;                                                               \
+        C1 = t0;                                                               \
+        t0 = B0;                                                               \
+        t1 = D0;                                                               \
+        B0 = _mm_unpackhi_epi64(B1, _mm_unpacklo_epi64(B0, B0));               \
+        B1 = _mm_unpackhi_epi64(t0, _mm_unpacklo_epi64(B1, B1));               \
+        D0 = _mm_unpackhi_epi64(D0, _mm_unpacklo_epi64(D1, D1));               \
+        D1 = _mm_unpackhi_epi64(D1, _mm_unpacklo_epi64(t1, t1));               \
+    } while ((void)0, 0)
+#endif
+
+#define BLAKE2_ROUND(A0, A1, B0, B1, C0, C1, D0, D1)                           \
+    do {                                                                       \
+        G1(A0, B0, C0, D0, A1, B1, C1, D1);                                    \
+        G2(A0, B0, C0, D0, A1, B1, C1, D1);                                    \
+                                                                               \
+        DIAGONALIZE(A0, B0, C0, D0, A1, B1, C1, D1);                           \
+                                                                               \
+        G1(A0, B0, C0, D0, A1, B1, C1, D1);                                    \
+        G2(A0, B0, C0, D0, A1, B1, C1, D1);                                    \
+                                                                               \
+        UNDIAGONALIZE(A0, B0, C0, D0, A1, B1, C1, D1);                         \
+    } while ((void)0, 0)
+#else /* __AVX2__ */
+
+#include <immintrin.h>
+
+#define rotr32(x)   _mm256_shuffle_epi32(x, _MM_SHUFFLE(2, 3, 0, 1))
+#define rotr24(x)   _mm256_shuffle_epi8(x, _mm256_setr_epi8(3, 4, 5, 6, 7, 0, 1, 2, 11, 12, 13, 14, 15, 8, 9, 10, 3, 4, 5, 6, 7, 0, 1, 2, 11, 12, 13, 14, 15, 8, 9, 10))
+#define rotr16(x)   _mm256_shuffle_epi8(x, _mm256_setr_epi8(2, 3, 4, 5, 6, 7, 0, 1, 10, 11, 12, 13, 14, 15, 8, 9, 2, 3, 4, 5, 6, 7, 0, 1, 10, 11, 12, 13, 14, 15, 8, 9))
+#define rotr63(x)   _mm256_xor_si256(_mm256_srli_epi64((x), 63), _mm256_add_epi64((x), (x)))
+
+#define G1_AVX2(A0, A1, B0, B1, C0, C1, D0, D1) \
+    do { \
+        __m256i ml = _mm256_mul_epu32(A0, B0); \
+        ml = _mm256_add_epi64(ml, ml); \
+        A0 = _mm256_add_epi64(A0, _mm256_add_epi64(B0, ml)); \
+        D0 = _mm256_xor_si256(D0, A0); \
+        D0 = rotr32(D0); \
+        \
+        ml = _mm256_mul_epu32(C0, D0); \
+        ml = _mm256_add_epi64(ml, ml); \
+        C0 = _mm256_add_epi64(C0, _mm256_add_epi64(D0, ml)); \
+        \
+        B0 = _mm256_xor_si256(B0, C0); \
+        B0 = rotr24(B0); \
+        \
+        ml = _mm256_mul_epu32(A1, B1); \
+        ml = _mm256_add_epi64(ml, ml); \
+        A1 = _mm256_add_epi64(A1, _mm256_add_epi64(B1, ml)); \
+        D1 = _mm256_xor_si256(D1, A1); \
+        D1 = rotr32(D1); \
+        \
+        ml = _mm256_mul_epu32(C1, D1); \
+        ml = _mm256_add_epi64(ml, ml); \
+        C1 = _mm256_add_epi64(C1, _mm256_add_epi64(D1, ml)); \
+        \
+        B1 = _mm256_xor_si256(B1, C1); \
+        B1 = rotr24(B1); \
+    } while((void)0, 0);
+
+#define G2_AVX2(A0, A1, B0, B1, C0, C1, D0, D1) \
+    do { \
+        __m256i ml = _mm256_mul_epu32(A0, B0); \
+        ml = _mm256_add_epi64(ml, ml); \
+        A0 = _mm256_add_epi64(A0, _mm256_add_epi64(B0, ml)); \
+        D0 = _mm256_xor_si256(D0, A0); \
+        D0 = rotr16(D0); \
+        \
+        ml = _mm256_mul_epu32(C0, D0); \
+        ml = _mm256_add_epi64(ml, ml); \
+        C0 = _mm256_add_epi64(C0, _mm256_add_epi64(D0, ml)); \
+        B0 = _mm256_xor_si256(B0, C0); \
+        B0 = rotr63(B0); \
+        \
+        ml = _mm256_mul_epu32(A1, B1); \
+        ml = _mm256_add_epi64(ml, ml); \
+        A1 = _mm256_add_epi64(A1, _mm256_add_epi64(B1, ml)); \
+        D1 = _mm256_xor_si256(D1, A1); \
+        D1 = rotr16(D1); \
+        \
+        ml = _mm256_mul_epu32(C1, D1); \
+        ml = _mm256_add_epi64(ml, ml); \
+        C1 = _mm256_add_epi64(C1, _mm256_add_epi64(D1, ml)); \
+        B1 = _mm256_xor_si256(B1, C1); \
+        B1 = rotr63(B1); \
+    } while((void)0, 0);
+
+#define DIAGONALIZE_1(A0, B0, C0, D0, A1, B1, C1, D1) \
+    do { \
+        B0 = _mm256_permute4x64_epi64(B0, _MM_SHUFFLE(0, 3, 2, 1)); \
+        C0 = _mm256_permute4x64_epi64(C0, _MM_SHUFFLE(1, 0, 3, 2)); \
+        D0 = _mm256_permute4x64_epi64(D0, _MM_SHUFFLE(2, 1, 0, 3)); \
+        \
+        B1 = _mm256_permute4x64_epi64(B1, _MM_SHUFFLE(0, 3, 2, 1)); \
+        C1 = _mm256_permute4x64_epi64(C1, _MM_SHUFFLE(1, 0, 3, 2)); \
+        D1 = _mm256_permute4x64_epi64(D1, _MM_SHUFFLE(2, 1, 0, 3)); \
+    } while((void)0, 0);
+
+#define DIAGONALIZE_2(A0, A1, B0, B1, C0, C1, D0, D1) \
+    do { \
+        __m256i tmp1 = _mm256_blend_epi32(B0, B1, 0xCC); \
+        __m256i tmp2 = _mm256_blend_epi32(B0, B1, 0x33); \
+        B1 = _mm256_permute4x64_epi64(tmp1, _MM_SHUFFLE(2,3,0,1)); \
+        B0 = _mm256_permute4x64_epi64(tmp2, _MM_SHUFFLE(2,3,0,1)); \
+        \
+        tmp1 = C0; \
+        C0 = C1; \
+        C1 = tmp1; \
+        \
+        tmp1 = _mm256_blend_epi32(D0, D1, 0xCC); \
+        tmp2 = _mm256_blend_epi32(D0, D1, 0x33); \
+        D0 = _mm256_permute4x64_epi64(tmp1, _MM_SHUFFLE(2,3,0,1)); \
+        D1 = _mm256_permute4x64_epi64(tmp2, _MM_SHUFFLE(2,3,0,1)); \
+    } while(0);
+
+#define UNDIAGONALIZE_1(A0, B0, C0, D0, A1, B1, C1, D1) \
+    do { \
+        B0 = _mm256_permute4x64_epi64(B0, _MM_SHUFFLE(2, 1, 0, 3)); \
+        C0 = _mm256_permute4x64_epi64(C0, _MM_SHUFFLE(1, 0, 3, 2)); \
+        D0 = _mm256_permute4x64_epi64(D0, _MM_SHUFFLE(0, 3, 2, 1)); \
+        \
+        B1 = _mm256_permute4x64_epi64(B1, _MM_SHUFFLE(2, 1, 0, 3)); \
+        C1 = _mm256_permute4x64_epi64(C1, _MM_SHUFFLE(1, 0, 3, 2)); \
+        D1 = _mm256_permute4x64_epi64(D1, _MM_SHUFFLE(0, 3, 2, 1)); \
+    } while((void)0, 0);
+
+#define UNDIAGONALIZE_2(A0, A1, B0, B1, C0, C1, D0, D1) \
+    do { \
+        __m256i tmp1 = _mm256_blend_epi32(B0, B1, 0xCC); \
+        __m256i tmp2 = _mm256_blend_epi32(B0, B1, 0x33); \
+        B0 = _mm256_permute4x64_epi64(tmp1, _MM_SHUFFLE(2,3,0,1)); \
+        B1 = _mm256_permute4x64_epi64(tmp2, _MM_SHUFFLE(2,3,0,1)); \
+        \
+        tmp1 = C0; \
+        C0 = C1; \
+        C1 = tmp1; \
+        \
+        tmp1 = _mm256_blend_epi32(D0, D1, 0x33); \
+        tmp2 = _mm256_blend_epi32(D0, D1, 0xCC); \
+        D0 = _mm256_permute4x64_epi64(tmp1, _MM_SHUFFLE(2,3,0,1)); \
+        D1 = _mm256_permute4x64_epi64(tmp2, _MM_SHUFFLE(2,3,0,1)); \
+    } while((void)0, 0);
+
+#define BLAKE2_ROUND_1(A0, A1, B0, B1, C0, C1, D0, D1) \
+    do{ \
+        G1_AVX2(A0, A1, B0, B1, C0, C1, D0, D1) \
+        G2_AVX2(A0, A1, B0, B1, C0, C1, D0, D1) \
+        \
+        DIAGONALIZE_1(A0, B0, C0, D0, A1, B1, C1, D1) \
+        \
+        G1_AVX2(A0, A1, B0, B1, C0, C1, D0, D1) \
+        G2_AVX2(A0, A1, B0, B1, C0, C1, D0, D1) \
+        \
+        UNDIAGONALIZE_1(A0, B0, C0, D0, A1, B1, C1, D1) \
+    } while((void)0, 0);
+
+#define BLAKE2_ROUND_2(A0, A1, B0, B1, C0, C1, D0, D1) \
+    do{ \
+        G1_AVX2(A0, A1, B0, B1, C0, C1, D0, D1) \
+        G2_AVX2(A0, A1, B0, B1, C0, C1, D0, D1) \
+        \
+        DIAGONALIZE_2(A0, A1, B0, B1, C0, C1, D0, D1) \
+        \
+        G1_AVX2(A0, A1, B0, B1, C0, C1, D0, D1) \
+        G2_AVX2(A0, A1, B0, B1, C0, C1, D0, D1) \
+        \
+        UNDIAGONALIZE_2(A0, A1, B0, B1, C0, C1, D0, D1) \
+    } while((void)0, 0);
+
+#endif /* __AVX2__ */
+
+#else /* __AVX512F__ */
+
+#include <immintrin.h>
+
+#define ror64(x, n) _mm512_ror_epi64((x), (n))
+
+static __m512i muladd(__m512i x, __m512i y)
+{
+    __m512i z = _mm512_mul_epu32(x, y);
+    return _mm512_add_epi64(_mm512_add_epi64(x, y), _mm512_add_epi64(z, z));
+}
+
+#define G1(A0, B0, C0, D0, A1, B1, C1, D1) \
+    do { \
+        A0 = muladd(A0, B0); \
+        A1 = muladd(A1, B1); \
+\
+        D0 = _mm512_xor_si512(D0, A0); \
+        D1 = _mm512_xor_si512(D1, A1); \
+\
+        D0 = ror64(D0, 32); \
+        D1 = ror64(D1, 32); \
+\
+        C0 = muladd(C0, D0); \
+        C1 = muladd(C1, D1); \
+\
+        B0 = _mm512_xor_si512(B0, C0); \
+        B1 = _mm512_xor_si512(B1, C1); \
+\
+        B0 = ror64(B0, 24); \
+        B1 = ror64(B1, 24); \
+    } while ((void)0, 0)
+
+#define G2(A0, B0, C0, D0, A1, B1, C1, D1) \
+    do { \
+        A0 = muladd(A0, B0); \
+        A1 = muladd(A1, B1); \
+\
+        D0 = _mm512_xor_si512(D0, A0); \
+        D1 = _mm512_xor_si512(D1, A1); \
+\
+        D0 = ror64(D0, 16); \
+        D1 = ror64(D1, 16); \
+\
+        C0 = muladd(C0, D0); \
+        C1 = muladd(C1, D1); \
+\
+        B0 = _mm512_xor_si512(B0, C0); \
+        B1 = _mm512_xor_si512(B1, C1); \
+\
+        B0 = ror64(B0, 63); \
+        B1 = ror64(B1, 63); \
+    } while ((void)0, 0)
+
+#define DIAGONALIZE(A0, B0, C0, D0, A1, B1, C1, D1) \
+    do { \
+        B0 = _mm512_permutex_epi64(B0, _MM_SHUFFLE(0, 3, 2, 1)); \
+        B1 = _mm512_permutex_epi64(B1, _MM_SHUFFLE(0, 3, 2, 1)); \
+\
+        C0 = _mm512_permutex_epi64(C0, _MM_SHUFFLE(1, 0, 3, 2)); \
+        C1 = _mm512_permutex_epi64(C1, _MM_SHUFFLE(1, 0, 3, 2)); \
+\
+        D0 = _mm512_permutex_epi64(D0, _MM_SHUFFLE(2, 1, 0, 3)); \
+        D1 = _mm512_permutex_epi64(D1, _MM_SHUFFLE(2, 1, 0, 3)); \
+    } while ((void)0, 0)
+
+#define UNDIAGONALIZE(A0, B0, C0, D0, A1, B1, C1, D1) \
+    do { \
+        B0 = _mm512_permutex_epi64(B0, _MM_SHUFFLE(2, 1, 0, 3)); \
+        B1 = _mm512_permutex_epi64(B1, _MM_SHUFFLE(2, 1, 0, 3)); \
+\
+        C0 = _mm512_permutex_epi64(C0, _MM_SHUFFLE(1, 0, 3, 2)); \
+        C1 = _mm512_permutex_epi64(C1, _MM_SHUFFLE(1, 0, 3, 2)); \
+\
+        D0 = _mm512_permutex_epi64(D0, _MM_SHUFFLE(0, 3, 2, 1)); \
+        D1 = _mm512_permutex_epi64(D1, _MM_SHUFFLE(0, 3, 2, 1)); \
+    } while ((void)0, 0)
+
+#define BLAKE2_ROUND(A0, B0, C0, D0, A1, B1, C1, D1) \
+    do { \
+        G1(A0, B0, C0, D0, A1, B1, C1, D1); \
+        G2(A0, B0, C0, D0, A1, B1, C1, D1); \
+\
+        DIAGONALIZE(A0, B0, C0, D0, A1, B1, C1, D1); \
+\
+        G1(A0, B0, C0, D0, A1, B1, C1, D1); \
+        G2(A0, B0, C0, D0, A1, B1, C1, D1); \
+\
+        UNDIAGONALIZE(A0, B0, C0, D0, A1, B1, C1, D1); \
+    } while ((void)0, 0)
+
+#define SWAP_HALVES(A0, A1) \
+    do { \
+        __m512i t0, t1; \
+        t0 = _mm512_shuffle_i64x2(A0, A1, _MM_SHUFFLE(1, 0, 1, 0)); \
+        t1 = _mm512_shuffle_i64x2(A0, A1, _MM_SHUFFLE(3, 2, 3, 2)); \
+        A0 = t0; \
+        A1 = t1; \
+    } while((void)0, 0)
+
+#define SWAP_QUARTERS(A0, A1) \
+    do { \
+        SWAP_HALVES(A0, A1); \
+        A0 = _mm512_permutexvar_epi64(_mm512_setr_epi64(0, 1, 4, 5, 2, 3, 6, 7), A0); \
+        A1 = _mm512_permutexvar_epi64(_mm512_setr_epi64(0, 1, 4, 5, 2, 3, 6, 7), A1); \
+    } while((void)0, 0)
+
+#define UNSWAP_QUARTERS(A0, A1) \
+    do { \
+        A0 = _mm512_permutexvar_epi64(_mm512_setr_epi64(0, 1, 4, 5, 2, 3, 6, 7), A0); \
+        A1 = _mm512_permutexvar_epi64(_mm512_setr_epi64(0, 1, 4, 5, 2, 3, 6, 7), A1); \
+        SWAP_HALVES(A0, A1); \
+    } while((void)0, 0)
+
+#define BLAKE2_ROUND_1(A0, C0, B0, D0, A1, C1, B1, D1) \
+    do { \
+        SWAP_HALVES(A0, B0); \
+        SWAP_HALVES(C0, D0); \
+        SWAP_HALVES(A1, B1); \
+        SWAP_HALVES(C1, D1); \
+        BLAKE2_ROUND(A0, B0, C0, D0, A1, B1, C1, D1); \
+        SWAP_HALVES(A0, B0); \
+        SWAP_HALVES(C0, D0); \
+        SWAP_HALVES(A1, B1); \
+        SWAP_HALVES(C1, D1); \
+    } while ((void)0, 0)
+
+#define BLAKE2_ROUND_2(A0, A1, B0, B1, C0, C1, D0, D1) \
+    do { \
+        SWAP_QUARTERS(A0, A1); \
+        SWAP_QUARTERS(B0, B1); \
+        SWAP_QUARTERS(C0, C1); \
+        SWAP_QUARTERS(D0, D1); \
+        BLAKE2_ROUND(A0, B0, C0, D0, A1, B1, C1, D1); \
+        UNSWAP_QUARTERS(A0, A1); \
+        UNSWAP_QUARTERS(B0, B1); \
+        UNSWAP_QUARTERS(C0, C1); \
+        UNSWAP_QUARTERS(D0, D1); \
+    } while ((void)0, 0)
+
+#endif /* __AVX512F__ */
+#endif /* BLAKE_ROUND_MKA_OPT_H */
--- a/algo/argon2/argon2d/blake2/blamka-round-ref.h
+++ b/algo/argon2/argon2d/blake2/blamka-round-ref.h
@@ -0,0 +1,56 @@
+/*
+ * Argon2 reference source code package - reference C implementations
+ *
+ * Copyright 2015
+ * Daniel Dinu, Dmitry Khovratovich, Jean-Philippe Aumasson, and Samuel Neves
+ *
+ * You may use this work under the terms of a Creative Commons CC0 1.0
+ * License/Waiver or the Apache Public License 2.0, at your option. The terms of
+ * these licenses can be found at:
+ *
+ * - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
+ * - Apache 2.0        : http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * You should have received a copy of both of these licenses along with this
+ * software. If not, they may be obtained at the above URLs.
+ */
+
+#ifndef BLAKE_ROUND_MKA_H
+#define BLAKE_ROUND_MKA_H
+
+#include "blake2.h"
+#include "blake2-impl.h"
+
+/* designed by the Lyra PHC team */
+static BLAKE2_INLINE uint64_t fBlaMka(uint64_t x, uint64_t y) {
+    const uint64_t m = UINT64_C(0xFFFFFFFF);
+    const uint64_t xy = (x & m) * (y & m);
+    return x + y + 2 * xy;
+}
+
+#define G(a, b, c, d)                                                          \
+    do {                                                                       \
+        a = fBlaMka(a, b);                                                     \
+        d = rotr64(d ^ a, 32);                                                 \
+        c = fBlaMka(c, d);                                                     \
+        b = rotr64(b ^ c, 24);                                                 \
+        a = fBlaMka(a, b);                                                     \
+        d = rotr64(d ^ a, 16);                                                 \
+        c = fBlaMka(c, d);                                                     \
+        b = rotr64(b ^ c, 63);                                                 \
+    } while ((void)0, 0)
+
+#define BLAKE2_ROUND_NOMSG(v0, v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11,   \
+                           v12, v13, v14, v15)                                 \
+    do {                                                                       \
+        G(v0, v4, v8, v12);                                                    \
+        G(v1, v5, v9, v13);                                                    \
+        G(v2, v6, v10, v14);                                                   \
+        G(v3, v7, v11, v15);                                                   \
+        G(v0, v5, v10, v15);                                                   \
+        G(v1, v6, v11, v12);                                                   \
+        G(v2, v7, v8, v13);                                                    \
+        G(v3, v4, v9, v14);                                                    \
+    } while ((void)0, 0)
+
+#endif
--- a/algo/blake/blake-4way.c
+++ b/algo/blake/blake-4way.c
@@ -15,7 +15,7 @@ void blakehash_4way(void *state, const void *input)
     memcpy( &ctx, &blake_4w_ctx, sizeof ctx );
     blake256r14_4way( &ctx, input + (64<<2), 16 );
     blake256r14_4way_close( &ctx, vhash );
-     mm_deinterleave_4x32( state, state+32, state+64, state+96, vhash, 256 );
+     mm128_deinterleave_4x32( state, state+32, state+64, state+96, vhash, 256 );
 }

 int scanhash_blake_4way( int thr_id, struct work *work, uint32_t max_nonce,
@@ -37,7 +37,7 @@ int scanhash_blake_4way( int thr_id, struct work *work, uint32_t max_nonce,

   // we need big endian data...
   swab32_array( edata, pdata, 20 );
-   mm_interleave_4x32( vdata, edata, edata, edata, edata, 640 );
+   mm128_interleave_4x32( vdata, edata, edata, edata, edata, 640 );
   blake256r14_4way_init( &blake_4w_ctx );
   blake256r14_4way( &blake_4w_ctx, vdata, 64 );

--- a/algo/blake/blake-gate.c
+++ b/algo/blake/blake-gate.c
@@ -10,7 +10,7 @@ bool register_blake_algo( algo_gate_t* gate )
  gate->optimizations = AVX2_OPT;
  gate->get_max64 = (void*)&blake_get_max64;
 //#if defined (__AVX2__) && defined (FOUR_WAY)
-//   gate->optimizations = SSE2_OPT | AVX_OPT | AVX2_OPT;
+//   gate->optimizations = SSE2_OPT | AVX2_OPT;
 //  gate->scanhash  = (void*)&scanhash_blake_8way;
 //  gate->hash      = (void*)&blakehash_8way;
 #if defined(BLAKE_4WAY)
--- a/algo/blake/blake-hash-4way.h
+++ b/algo/blake/blake-hash-4way.h
@@ -37,7 +37,7 @@
 #ifndef __BLAKE_HASH_4WAY__
 #define __BLAKE_HASH_4WAY__ 1

-#ifdef __AVX__
+//#ifdef __SSE4_2__

 #ifdef __cplusplus
 extern "C"{
@@ -45,31 +45,34 @@ extern "C"{

 #include <stddef.h>
 #include "algo/sha/sph_types.h"
-#include "avxdefs.h"
+#include "simd-utils.h"

 #define SPH_SIZE_blake256   256

 #define SPH_SIZE_blake512   512

-// With AVX only Blake-256 4 way is available.
+// With SSE4.2 only Blake-256 4 way is available.
 // With AVX2 Blake-256 8way & Blake-512 4 way are also available.

 // Blake-256 4 way

 typedef struct {
-   __m128i buf[16] __attribute__ ((aligned (64)));
-   __m128i H[8];
-   __m128i S[4];    
+   unsigned char buf[64<<2];
+   uint32_t H[8<<2];
+   uint32_t S[4<<2];
+//   __m128i buf[16] __attribute__ ((aligned (64)));
+//   __m128i H[8];
+//   __m128i S[4];    
   size_t ptr;
-   sph_u32 T0, T1;
+   uint32_t T0, T1;
   int rounds;   // 14 for blake, 8 for blakecoin & vanilla
-} blake_4way_small_context;
+} blake_4way_small_context __attribute__ ((aligned (64)));

 // Default 14 rounds
 typedef blake_4way_small_context blake256_4way_context;
-void blake256_4way_init(void *cc);
-void blake256_4way(void *cc, const void *data, size_t len);
-void blake256_4way_close(void *cc, void *dst);
+void blake256_4way_init(void *ctx);
+void blake256_4way(void *ctx, const void *data, size_t len);
+void blake256_4way_close(void *ctx, void *dst);

 // 14 rounds, blake, decred
 typedef blake_4way_small_context blake256r14_4way_context;
@@ -132,12 +135,10 @@ void blake512_4way_close(void *cc, void *dst);
 void blake512_4way_addbits_and_close(
 	void *cc, unsigned ub, unsigned n, void *dst);

-#endif
+#endif  // AVX2

 #ifdef __cplusplus
 }
 #endif

-#endif
-
-#endif
+#endif  // BLAKE_HASH_4WAY_H__
--- a/algo/blake/blake256-hash-4way.c
+++ b/algo/blake/blake256-hash-4way.c
@@ -30,9 +30,10 @@
 * @author   Thomas Pornin <thomas.pornin@cryptolog.com>
 */

-#if defined (__AVX__)
+//#if defined (__SSE4_2__)

 #include <stddef.h>
+#include <stdint.h>
 #include <string.h>
 #include <limits.h>

@@ -60,26 +61,12 @@ extern "C"{

 // Blake-256

-static const sph_u32 IV256[8] = {
-	SPH_C32(0x6A09E667), SPH_C32(0xBB67AE85),
-	SPH_C32(0x3C6EF372), SPH_C32(0xA54FF53A),
-	SPH_C32(0x510E527F), SPH_C32(0x9B05688C),
-	SPH_C32(0x1F83D9AB), SPH_C32(0x5BE0CD19)
+static const uint32_t IV256[8] =
+{
+	0x6A09E667, 0xBB67AE85,	0x3C6EF372, 0xA54FF53A,
+	0x510E527F, 0x9B05688C,	0x1F83D9AB, 0x5BE0CD19
 };

-#if defined (__AVX2__)
-
-// Blake-512
-
-static const sph_u64 IV512[8] = {
-	SPH_C64(0x6A09E667F3BCC908), SPH_C64(0xBB67AE8584CAA73B),
-	SPH_C64(0x3C6EF372FE94F82B), SPH_C64(0xA54FF53A5F1D36F1),
-	SPH_C64(0x510E527FADE682D1), SPH_C64(0x9B05688C2B3E6C1F),
-	SPH_C64(0x1F83D9ABFB41BD6B), SPH_C64(0x5BE0CD19137E2179)
-};
-
-#endif
-
 #if SPH_COMPACT_BLAKE_32 || SPH_COMPACT_BLAKE_64

 // Blake-256 4 & 8 way, Blake-512 4 way
@@ -317,60 +304,19 @@ static const sph_u32 CS[16] = {

 #endif

-#if defined(__AVX2__)
-
-// Blake-512 4 way
-
-#define CBx(r, i)   CBx_(Z ## r ## i)
-#define CBx_(n)     CBx__(n)
-#define CBx__(n)    CB ## n
-
-#define CB0   SPH_C64(0x243F6A8885A308D3)
-#define CB1   SPH_C64(0x13198A2E03707344)
-#define CB2   SPH_C64(0xA4093822299F31D0)
-#define CB3   SPH_C64(0x082EFA98EC4E6C89)
-#define CB4   SPH_C64(0x452821E638D01377)
-#define CB5   SPH_C64(0xBE5466CF34E90C6C)
-#define CB6   SPH_C64(0xC0AC29B7C97C50DD)
-#define CB7   SPH_C64(0x3F84D5B5B5470917)
-#define CB8   SPH_C64(0x9216D5D98979FB1B)
-#define CB9   SPH_C64(0xD1310BA698DFB5AC)
-#define CBA   SPH_C64(0x2FFD72DBD01ADFB7)
-#define CBB   SPH_C64(0xB8E1AFED6A267E96)
-#define CBC   SPH_C64(0xBA7C9045F12C7F99)
-#define CBD   SPH_C64(0x24A19947B3916CF7)
-#define CBE   SPH_C64(0x0801F2E2858EFC16)
-#define CBF   SPH_C64(0x636920D871574E69)
-
-#if SPH_COMPACT_BLAKE_64
-// not used
-static const sph_u64 CB[16] = {
-	SPH_C64(0x243F6A8885A308D3), SPH_C64(0x13198A2E03707344),
-	SPH_C64(0xA4093822299F31D0), SPH_C64(0x082EFA98EC4E6C89),
-	SPH_C64(0x452821E638D01377), SPH_C64(0xBE5466CF34E90C6C),
-	SPH_C64(0xC0AC29B7C97C50DD), SPH_C64(0x3F84D5B5B5470917),
-	SPH_C64(0x9216D5D98979FB1B), SPH_C64(0xD1310BA698DFB5AC),
-	SPH_C64(0x2FFD72DBD01ADFB7), SPH_C64(0xB8E1AFED6A267E96),
-	SPH_C64(0xBA7C9045F12C7F99), SPH_C64(0x24A19947B3916CF7),
-	SPH_C64(0x0801F2E2858EFC16), SPH_C64(0x636920D871574E69)
-};
-
-#endif
-
-#endif

 #define GS_4WAY( m0, m1, c0, c1, a, b, c, d ) \
 do { \
   a = _mm_add_epi32( _mm_add_epi32( _mm_xor_si128( \
                 _mm_set_epi32( c1, c1, c1, c1 ), m0 ), b ), a ); \
-   d = mm_rotr_32( _mm_xor_si128( d, a ), 16 ); \
+   d = mm128_ror_32( _mm_xor_si128( d, a ), 16 ); \
   c = _mm_add_epi32( c, d ); \
-   b = mm_rotr_32( _mm_xor_si128( b, c ), 12 ); \
+   b = mm128_ror_32( _mm_xor_si128( b, c ), 12 ); \
   a = _mm_add_epi32( _mm_add_epi32( _mm_xor_si128( \
                 _mm_set_epi32( c0, c0, c0, c0 ), m1 ), b ), a ); \
-   d = mm_rotr_32( _mm_xor_si128( d, a ), 8 ); \
+   d = mm128_ror_32( _mm_xor_si128( d, a ), 8 ); \
   c = _mm_add_epi32( c, d ); \
-   b = mm_rotr_32( _mm_xor_si128( b, c ), 7 ); \
+   b = mm128_ror_32( _mm_xor_si128( b, c ), 7 ); \
 } while (0)

 #if SPH_COMPACT_BLAKE_32
@@ -411,125 +357,41 @@ do { \

 #endif

-#if defined (__AVX2__)
-
-// Blake-256 8 way
-
-#define GS_8WAY( m0, m1, c0, c1, a, b, c, d ) \
-do { \
-   a = _mm256_add_epi32( _mm256_add_epi32( _mm256_xor_si256( \
-                 _mm256_set1_epi32( c1 ), m0 ), b ), a ); \
-   d = mm256_rotr_32( _mm256_xor_si256( d, a ), 16 ); \
-   c = _mm256_add_epi32( c, d ); \
-   b = mm256_rotr_32( _mm256_xor_si256( b, c ), 12 ); \
-   a = _mm256_add_epi32( _mm256_add_epi32( _mm256_xor_si256( \
-                 _mm256_set1_epi32( c0 ), m1 ), b ), a ); \
-   d = mm256_rotr_32( _mm256_xor_si256( d, a ), 8 ); \
-   c = _mm256_add_epi32( c, d ); \
-   b = mm256_rotr_32( _mm256_xor_si256( b, c ), 7 ); \
-} while (0)
-
-#define ROUND_S_8WAY(r)   do { \
-        GS_8WAY(Mx(r, 0), Mx(r, 1), CSx(r, 0), CSx(r, 1), V0, V4, V8, VC); \
-        GS_8WAY(Mx(r, 2), Mx(r, 3), CSx(r, 2), CSx(r, 3), V1, V5, V9, VD); \
-        GS_8WAY(Mx(r, 4), Mx(r, 5), CSx(r, 4), CSx(r, 5), V2, V6, VA, VE); \
-        GS_8WAY(Mx(r, 6), Mx(r, 7), CSx(r, 6), CSx(r, 7), V3, V7, VB, VF); \
-        GS_8WAY(Mx(r, 8), Mx(r, 9), CSx(r, 8), CSx(r, 9), V0, V5, VA, VF); \
-        GS_8WAY(Mx(r, A), Mx(r, B), CSx(r, A), CSx(r, B), V1, V6, VB, VC); \
-        GS_8WAY(Mx(r, C), Mx(r, D), CSx(r, C), CSx(r, D), V2, V7, V8, VD); \
-        GS_8WAY(Mx(r, E), Mx(r, F), CSx(r, E), CSx(r, F), V3, V4, V9, VE); \
-} while (0)
-
-// Blake-512 4 way
-
-#define GB_4WAY(m0, m1, c0, c1, a, b, c, d)   do { \
-   a = _mm256_add_epi64( _mm256_add_epi64( _mm256_xor_si256( \
-                 _mm256_set_epi64x( c1, c1, c1, c1 ), m0 ), b ), a ); \
-   d = mm256_rotr_64( _mm256_xor_si256( d, a ), 32 ); \
-   c = _mm256_add_epi64( c, d ); \
-   b = mm256_rotr_64( _mm256_xor_si256( b, c ), 25 ); \
-   a = _mm256_add_epi64( _mm256_add_epi64( _mm256_xor_si256( \
-                 _mm256_set_epi64x( c0, c0, c0, c0 ), m1 ), b ), a ); \
-   d = mm256_rotr_64( _mm256_xor_si256( d, a ), 16 ); \
-   c = _mm256_add_epi64( c, d ); \
-   b = mm256_rotr_64( _mm256_xor_si256( b, c ), 11 ); \
-} while (0)
-
-#if SPH_COMPACT_BLAKE_64
-// not used
-#define ROUND_B_4WAY(r)   do { \
-	GB_4WAY(M[sigma[r][0x0]], M[sigma[r][0x1]], \
-		CB[sigma[r][0x0]], CB[sigma[r][0x1]], V0, V4, V8, VC); \
-	GB_4WAY(M[sigma[r][0x2]], M[sigma[r][0x3]], \
-		CB[sigma[r][0x2]], CB[sigma[r][0x3]], V1, V5, V9, VD); \
-	GB_4WAY(M[sigma[r][0x4]], M[sigma[r][0x5]], \
-		CB[sigma[r][0x4]], CB[sigma[r][0x5]], V2, V6, VA, VE); \
-	GB_4WAY(M[sigma[r][0x6]], M[sigma[r][0x7]], \
-		CB[sigma[r][0x6]], CB[sigma[r][0x7]], V3, V7, VB, VF); \
-	GB_4WAY(M[sigma[r][0x8]], M[sigma[r][0x9]], \
-		CB[sigma[r][0x8]], CB[sigma[r][0x9]], V0, V5, VA, VF); \
-	GB_4WAY(M[sigma[r][0xA]], M[sigma[r][0xB]], \
-		CB[sigma[r][0xA]], CB[sigma[r][0xB]], V1, V6, VB, VC); \
-	GB_4WAY(M[sigma[r][0xC]], M[sigma[r][0xD]], \
-		CB[sigma[r][0xC]], CB[sigma[r][0xD]], V2, V7, V8, VD); \
-	GB_4WAY(M[sigma[r][0xE]], M[sigma[r][0xF]], \
-		CB[sigma[r][0xE]], CB[sigma[r][0xF]], V3, V4, V9, VE); \
-} while (0)
-
-#else
-//current_impl
-#define ROUND_B_4WAY(r)   do { \
-	GB_4WAY(Mx(r, 0), Mx(r, 1), CBx(r, 0), CBx(r, 1), V0, V4, V8, VC); \
-	GB_4WAY(Mx(r, 2), Mx(r, 3), CBx(r, 2), CBx(r, 3), V1, V5, V9, VD); \
-	GB_4WAY(Mx(r, 4), Mx(r, 5), CBx(r, 4), CBx(r, 5), V2, V6, VA, VE); \
-	GB_4WAY(Mx(r, 6), Mx(r, 7), CBx(r, 6), CBx(r, 7), V3, V7, VB, VF); \
-	GB_4WAY(Mx(r, 8), Mx(r, 9), CBx(r, 8), CBx(r, 9), V0, V5, VA, VF); \
-	GB_4WAY(Mx(r, A), Mx(r, B), CBx(r, A), CBx(r, B), V1, V6, VB, VC); \
-	GB_4WAY(Mx(r, C), Mx(r, D), CBx(r, C), CBx(r, D), V2, V7, V8, VD); \
-	GB_4WAY(Mx(r, E), Mx(r, F), CBx(r, E), CBx(r, F), V3, V4, V9, VE); \
-	} while (0)
-
-#endif
-
-#endif
-
-// Blake-256 4 way
-
 #define DECL_STATE32_4WAY \
 	__m128i H0, H1, H2, H3, H4, H5, H6, H7; \
 	__m128i S0, S1, S2, S3; \
-        sph_u32 T0, T1;
+        uint32_t T0, T1;

 #define READ_STATE32_4WAY(state)   do { \
-		H0 = (state)->H[0]; \
-		H1 = (state)->H[1]; \
-		H2 = (state)->H[2]; \
-		H3 = (state)->H[3]; \
-		H4 = (state)->H[4]; \
-		H5 = (state)->H[5]; \
-		H6 = (state)->H[6]; \
-		H7 = (state)->H[7]; \
-		S0 = (state)->S[0]; \
-		S1 = (state)->S[1]; \
-		S2 = (state)->S[2]; \
-		S3 = (state)->S[3]; \
+		H0 = casti_m128i( state->H, 0 ); \
+		H1 = casti_m128i( state->H, 1 ); \
+		H2 = casti_m128i( state->H, 2 ); \
+		H3 = casti_m128i( state->H, 3 ); \
+		H4 = casti_m128i( state->H, 4 ); \
+		H5 = casti_m128i( state->H, 5 ); \
+		H6 = casti_m128i( state->H, 6 ); \
+		H7 = casti_m128i( state->H, 7 ); \
+		S0 = casti_m128i( state->S, 0 ); \
+		S1 = casti_m128i( state->S, 1 ); \
+		S2 = casti_m128i( state->S, 2 ); \
+		S3 = casti_m128i( state->S, 3 ); \
 		T0 = (state)->T0; \
 		T1 = (state)->T1; \
 	} while (0)

 #define WRITE_STATE32_4WAY(state)   do { \
-		(state)->H[0] = H0; \
-		(state)->H[1] = H1; \
-		(state)->H[2] = H2; \
-		(state)->H[3] = H3; \
-		(state)->H[4] = H4; \
-		(state)->H[5] = H5; \
-		(state)->H[6] = H6; \
-		(state)->H[7] = H7; \
-		(state)->S[0] = S0; \
-		(state)->S[1] = S1; \
-		(state)->S[2] = S2; \
-		(state)->S[3] = S3; \
+		casti_m128i( state->H, 0 ) = H0; \
+		casti_m128i( state->H, 1 ) = H1; \
+		casti_m128i( state->H, 2 ) = H2; \
+		casti_m128i( state->H, 3 ) = H3; \
+		casti_m128i( state->H, 4 ) = H4; \
+		casti_m128i( state->H, 5 ) = H5; \
+		casti_m128i( state->H, 6 ) = H6; \
+		casti_m128i( state->H, 7 ) = H7; \
+		casti_m128i( state->S, 0 ) = S0; \
+		casti_m128i( state->S, 1 ) = S1; \
+		casti_m128i( state->S, 2 ) = S2; \
+		casti_m128i( state->S, 3 ) = S3; \
 		(state)->T0 = T0; \
 		(state)->T1 = T1; \
 	} while (0)
@@ -562,22 +424,22 @@ do { \
                          , _mm_set_epi32( CS6, CS6, CS6, CS6 ) ); \
        VF = _mm_xor_si128( _mm_set_epi32( T1, T1, T1, T1 ), \
                            _mm_set_epi32( CS7, CS7, CS7, CS7 ) ); \
-	M[0x0] = mm_bswap_32( *(buf +  0) ); \
-	M[0x1] = mm_bswap_32( *(buf +  1) ); \
-	M[0x2] = mm_bswap_32( *(buf +  2) ); \
-	M[0x3] = mm_bswap_32( *(buf +  3) ); \
-	M[0x4] = mm_bswap_32( *(buf +  4) ); \
-	M[0x5] = mm_bswap_32( *(buf +  5) ); \
-	M[0x6] = mm_bswap_32( *(buf +  6) ); \
-	M[0x7] = mm_bswap_32( *(buf +  7) ); \
-	M[0x8] = mm_bswap_32( *(buf +  8) ); \
-	M[0x9] = mm_bswap_32( *(buf +  9) ); \
-	M[0xA] = mm_bswap_32( *(buf + 10) ); \
-	M[0xB] = mm_bswap_32( *(buf + 11) ); \
-	M[0xC] = mm_bswap_32( *(buf + 12) ); \
-	M[0xD] = mm_bswap_32( *(buf + 13) ); \
-	M[0xE] = mm_bswap_32( *(buf + 14) ); \
-	M[0xF] = mm_bswap_32( *(buf + 15) ); \
+	M[0x0] = mm128_bswap_32( *(buf +  0) ); \
+	M[0x1] = mm128_bswap_32( *(buf +  1) ); \
+	M[0x2] = mm128_bswap_32( *(buf +  2) ); \
+	M[0x3] = mm128_bswap_32( *(buf +  3) ); \
+	M[0x4] = mm128_bswap_32( *(buf +  4) ); \
+	M[0x5] = mm128_bswap_32( *(buf +  5) ); \
+	M[0x6] = mm128_bswap_32( *(buf +  6) ); \
+	M[0x7] = mm128_bswap_32( *(buf +  7) ); \
+	M[0x8] = mm128_bswap_32( *(buf +  8) ); \
+	M[0x9] = mm128_bswap_32( *(buf +  9) ); \
+	M[0xA] = mm128_bswap_32( *(buf + 10) ); \
+	M[0xB] = mm128_bswap_32( *(buf + 11) ); \
+	M[0xC] = mm128_bswap_32( *(buf + 12) ); \
+	M[0xD] = mm128_bswap_32( *(buf + 13) ); \
+	M[0xE] = mm128_bswap_32( *(buf + 14) ); \
+	M[0xF] = mm128_bswap_32( *(buf + 15) ); \
 	for (r = 0; r < rounds; r ++) \
 		ROUND_S_4WAY(r); \
        H0 = _mm_xor_si128( _mm_xor_si128( \
@@ -616,30 +478,30 @@ do { \
   V5 = H5; \
   V6 = H6; \
   V7 = H7; \
-   V8 = _mm_xor_si128( S0, _mm_set_epi32( CS0, CS0, CS0, CS0 ) ); \
-   V9 = _mm_xor_si128( S1, _mm_set_epi32( CS1, CS1, CS1, CS1 ) ); \
-   VA = _mm_xor_si128( S2, _mm_set_epi32( CS2, CS2, CS2, CS2 ) ); \
-   VB = _mm_xor_si128( S3, _mm_set_epi32( CS3, CS3, CS3, CS3 ) ); \
+   V8 = _mm_xor_si128( S0, _mm_set1_epi32( CS0 ) ); \
+   V9 = _mm_xor_si128( S1, _mm_set1_epi32( CS1 ) ); \
+   VA = _mm_xor_si128( S2, _mm_set1_epi32( CS2 ) ); \
+   VB = _mm_xor_si128( S3, _mm_set1_epi32( CS3 ) ); \
   VC = _mm_xor_si128( _mm_set1_epi32( T0 ), _mm_set1_epi32( CS4 ) ); \
   VD = _mm_xor_si128( _mm_set1_epi32( T0 ), _mm_set1_epi32( CS5 ) ); \
   VE = _mm_xor_si128( _mm_set1_epi32( T1 ), _mm_set1_epi32( CS6 ) ); \
   VF = _mm_xor_si128( _mm_set1_epi32( T1 ), _mm_set1_epi32( CS7 ) ); \
-   M0 = mm_bswap_32( * buf ); \
-   M1 = mm_bswap_32( *(buf+1) ); \
-   M2 = mm_bswap_32( *(buf+2) ); \
-   M3 = mm_bswap_32( *(buf+3) ); \
-   M4 = mm_bswap_32( *(buf+4) ); \
-   M5 = mm_bswap_32( *(buf+5) ); \
-   M6 = mm_bswap_32( *(buf+6) ); \
-   M7 = mm_bswap_32( *(buf+7) ); \
-   M8 = mm_bswap_32( *(buf+8) ); \
-   M9 = mm_bswap_32( *(buf+9) ); \
-   MA = mm_bswap_32( *(buf+10) ); \
-   MB = mm_bswap_32( *(buf+11) ); \
-   MC = mm_bswap_32( *(buf+12) ); \
-   MD = mm_bswap_32( *(buf+13) ); \
-   ME = mm_bswap_32( *(buf+14) ); \
-   MF = mm_bswap_32( *(buf+15) ); \
+   M0 = mm128_bswap_32( buf[ 0] ); \
+   M1 = mm128_bswap_32( buf[ 1] ); \
+   M2 = mm128_bswap_32( buf[ 2] ); \
+   M3 = mm128_bswap_32( buf[ 3] ); \
+   M4 = mm128_bswap_32( buf[ 4] ); \
+   M5 = mm128_bswap_32( buf[ 5] ); \
+   M6 = mm128_bswap_32( buf[ 6] ); \
+   M7 = mm128_bswap_32( buf[ 7] ); \
+   M8 = mm128_bswap_32( buf[ 8] ); \
+   M9 = mm128_bswap_32( buf[ 9] ); \
+   MA = mm128_bswap_32( buf[10] ); \
+   MB = mm128_bswap_32( buf[11] ); \
+   MC = mm128_bswap_32( buf[12] ); \
+   MD = mm128_bswap_32( buf[13] ); \
+   ME = mm128_bswap_32( buf[14] ); \
+   MF = mm128_bswap_32( buf[15] ); \
   ROUND_S_4WAY(0); \
   ROUND_S_4WAY(1); \
   ROUND_S_4WAY(2); \
@@ -673,6 +535,31 @@ do { \

 // Blake-256 8 way

+#define GS_8WAY( m0, m1, c0, c1, a, b, c, d ) \
+do { \
+   a = _mm256_add_epi32( _mm256_add_epi32( _mm256_xor_si256( \
+                 _mm256_set1_epi32( c1 ), m0 ), b ), a ); \
+   d = mm256_ror_32( _mm256_xor_si256( d, a ), 16 ); \
+   c = _mm256_add_epi32( c, d ); \
+   b = mm256_ror_32( _mm256_xor_si256( b, c ), 12 ); \
+   a = _mm256_add_epi32( _mm256_add_epi32( _mm256_xor_si256( \
+                 _mm256_set1_epi32( c0 ), m1 ), b ), a ); \
+   d = mm256_ror_32( _mm256_xor_si256( d, a ), 8 ); \
+   c = _mm256_add_epi32( c, d ); \
+   b = mm256_ror_32( _mm256_xor_si256( b, c ), 7 ); \
+} while (0)
+
+#define ROUND_S_8WAY(r)   do { \
+        GS_8WAY(Mx(r, 0), Mx(r, 1), CSx(r, 0), CSx(r, 1), V0, V4, V8, VC); \
+        GS_8WAY(Mx(r, 2), Mx(r, 3), CSx(r, 2), CSx(r, 3), V1, V5, V9, VD); \
+        GS_8WAY(Mx(r, 4), Mx(r, 5), CSx(r, 4), CSx(r, 5), V2, V6, VA, VE); \
+        GS_8WAY(Mx(r, 6), Mx(r, 7), CSx(r, 6), CSx(r, 7), V3, V7, VB, VF); \
+        GS_8WAY(Mx(r, 8), Mx(r, 9), CSx(r, 8), CSx(r, 9), V0, V5, VA, VF); \
+        GS_8WAY(Mx(r, A), Mx(r, B), CSx(r, A), CSx(r, B), V1, V6, VB, VC); \
+        GS_8WAY(Mx(r, C), Mx(r, D), CSx(r, C), CSx(r, D), V2, V7, V8, VD); \
+        GS_8WAY(Mx(r, E), Mx(r, F), CSx(r, E), CSx(r, F), V3, V4, V9, VE); \
+} while (0)
+
 #define DECL_STATE32_8WAY \
   __m256i H0, H1, H2, H3, H4, H5, H6, H7; \
   __m256i S0, S1, S2, S3; \
@@ -787,312 +674,136 @@ do { \
                                                              S3 ), H7 ); \
 } while (0)

-// Blake-512 4 way
-
-#define DECL_STATE64_4WAY \
-	__m256i H0, H1, H2, H3, H4, H5, H6, H7; \
-        __m256i S0, S1, S2, S3; \
-	sph_u64 T0, T1;
-
-#define READ_STATE64_4WAY(state)   do { \
-		H0 = (state)->H[0]; \
-		H1 = (state)->H[1]; \
-		H2 = (state)->H[2]; \
-		H3 = (state)->H[3]; \
-		H4 = (state)->H[4]; \
-		H5 = (state)->H[5]; \
-		H6 = (state)->H[6]; \
-		H7 = (state)->H[7]; \
-		S0 = (state)->S[0]; \
-		S1 = (state)->S[1]; \
-		S2 = (state)->S[2]; \
-		S3 = (state)->S[3]; \
-		T0 = (state)->T0; \
-		T1 = (state)->T1; \
-	} while (0)
-
-#define WRITE_STATE64_4WAY(state)   do { \
-		(state)->H[0] = H0; \
-		(state)->H[1] = H1; \
-		(state)->H[2] = H2; \
-		(state)->H[3] = H3; \
-		(state)->H[4] = H4; \
-		(state)->H[5] = H5; \
-		(state)->H[6] = H6; \
-		(state)->H[7] = H7; \
-		(state)->S[0] = S0; \
-		(state)->S[1] = S1; \
-		(state)->S[2] = S2; \
-		(state)->S[3] = S3; \
-		(state)->T0 = T0; \
-		(state)->T1 = T1; \
-	} while (0)
-
-#if SPH_COMPACT_BLAKE_64
-
-// not used
-#define COMPRESS64_4WAY   do { \
-	__m256i M[16]; \
-	__m256i V0, V1, V2, V3, V4, V5, V6, V7; \
-	__m256i V8, V9, VA, VB, VC, VD, VE, VF; \
-	unsigned r; \
-	V0 = H0; \
-	V1 = H1; \
-	V2 = H2; \
-	V3 = H3; \
-	V4 = H4; \
-	V5 = H5; \
-	V6 = H6; \
-	V7 = H7; \
-        V8 = _mm256_xor_si256( S0, _mm256_set_epi64x( CB0, CB0, CB0, CB0 ) ); \
-        V9 = _mm256_xor_si256( S1, _mm256_set_epi64x( CB1, CB1, CB1, CB1 ) ); \
-        VA = _mm256_xor_si256( S2, _mm256_set_epi64x( CB2, CB2, CB2, CB2 ) ); \
-        VB = _mm256_xor_si256( S3, _mm256_set_epi64x( CB3, CB3, CB3, CB3 ) ); \
-        VC = _mm256_xor_si256( _mm256_set_epi64x( T0, T0, T0, T0 ), \
-                               _mm256_set_epi64x( CB4, CB4, CB4, CB4 ) ); \
-        VD = _mm256_xor_si256( _mm256_set_epi64x( T0, T0, T0, T0 ), \
-                               _mm256_set_epi64x( CB5, CB5, CB5, CB5 ) ); \
-        VE = _mm256_xor_si256( _mm256_set_epi64x( T1, T1, T1, T1 ), \
-                               _mm256_set_epi64x( CB6, CB6, CB6, CB6 ) ); \
-        VF = _mm256_xor_si256( _mm256_set_epi64x( T1, T1, T1, T1 ), \
-                               _mm256_set_epi64x( CB7, CB7, CB7, CB7 ) ); \
-	M[0x0] = mm256_bswap_64( *(buf+0) ); \
-	M[0x1] = mm256_bswap_64( *(buf+1) ); \
-	M[0x2] = mm256_bswap_64( *(buf+2) ); \
-	M[0x3] = mm256_bswap_64( *(buf+3) ); \
-	M[0x4] = mm256_bswap_64( *(buf+4) ); \
-	M[0x5] = mm256_bswap_64( *(buf+5) ); \
-	M[0x6] = mm256_bswap_64( *(buf+6) ); \
-	M[0x7] = mm256_bswap_64( *(buf+7) ); \
-	M[0x8] = mm256_bswap_64( *(buf+8) ); \
-	M[0x9] = mm256_bswap_64( *(buf+9) ); \
-	M[0xA] = mm256_bswap_64( *(buf+10) ); \
-	M[0xB] = mm256_bswap_64( *(buf+11) ); \
-	M[0xC] = mm256_bswap_64( *(buf+12) ); \
-	M[0xD] = mm256_bswap_64( *(buf+13) ); \
-	M[0xE] = mm256_bswap_64( *(buf+14) ); \
-	M[0xF] = mm256_bswap_64( *(buf+15) ); \
-	for (r = 0; r < 16; r ++) \
-		ROUND_B_4WAY(r); \
-        H0 = _mm256_xor_si256( _mm256_xor_si256( \
-                    _mm256_xor_si256( S0, V0 ), V8 ), H0 ); \
-        H1 = _mm256_xor_si256( _mm256_xor_si256( \
-                    _mm256_xor_si256( S1, V1 ), V9 ), H1 ); \
-        H2 = _mm256_xor_si256( _mm256_xor_si256( \
-                    _mm256_xor_si256( S2, V2 ), VA ), H2 ); \
-        H3 = _mm256_xor_si256( _mm256_xor_si256( \
-                    _mm256_xor_si256( S3, V3 ), VB ), H3 ); \
-        H4 = _mm256_xor_si256( _mm256_xor_si256( \
-                    _mm256_xor_si256( S0, V4 ), VC ), H4 ); \
-        H5 = _mm256_xor_si256( _mm256_xor_si256( \
-                    _mm256_xor_si256( S1, V5 ), VD ), H5 ); \
-        H6 = _mm256_xor_si256( _mm256_xor_si256( \
-                    _mm256_xor_si256( S2, V6 ), VE ), H6 ); \
-        H7 = _mm256_xor_si256( _mm256_xor_si256( \
-                    _mm256_xor_si256( S3, V7 ), VF ), H7 ); \
-	} while (0)
-
-#else
-
-//current impl
-
-#define COMPRESS64_4WAY   do { \
-     __m256i M0, M1, M2, M3, M4, M5, M6, M7; \
-     __m256i M8, M9, MA, MB, MC, MD, ME, MF; \
-     __m256i V0, V1, V2, V3, V4, V5, V6, V7; \
-     __m256i V8, V9, VA, VB, VC, VD, VE, VF; \
-     V0 = H0; \
-     V1 = H1; \
-     V2 = H2; \
-     V3 = H3; \
-     V4 = H4; \
-     V5 = H5; \
-     V6 = H6; \
-     V7 = H7; \
-     V8 = _mm256_xor_si256( S0, _mm256_set_epi64x( CB0, CB0, CB0, CB0 ) );  \
-     V9 = _mm256_xor_si256( S1, _mm256_set_epi64x( CB1, CB1, CB1, CB1 ) );  \
-     VA = _mm256_xor_si256( S2, _mm256_set_epi64x( CB2, CB2, CB2, CB2 ) );  \
-     VB = _mm256_xor_si256( S3, _mm256_set_epi64x( CB3, CB3, CB3, CB3 ) );  \
-     VC = _mm256_xor_si256( _mm256_set_epi64x( T0, T0, T0, T0 ), \
-                            _mm256_set_epi64x( CB4, CB4, CB4, CB4 ) );  \
-     VD = _mm256_xor_si256( _mm256_set_epi64x( T0, T0, T0, T0 ), \
-                            _mm256_set_epi64x( CB5, CB5, CB5, CB5 ) );  \
-     VE = _mm256_xor_si256( _mm256_set_epi64x( T1, T1, T1, T1 ), \
-                            _mm256_set_epi64x( CB6, CB6, CB6, CB6 ) );  \
-     VF = _mm256_xor_si256( _mm256_set_epi64x( T1, T1, T1, T1 ), \
-                            _mm256_set_epi64x( CB7, CB7, CB7, CB7 ) );  \
-     M0 = mm256_bswap_64( *(buf + 0) ); \
-     M1 = mm256_bswap_64( *(buf + 1) ); \
-     M2 = mm256_bswap_64( *(buf + 2) ); \
-     M3 = mm256_bswap_64( *(buf + 3) ); \
-     M4 = mm256_bswap_64( *(buf + 4) ); \
-     M5 = mm256_bswap_64( *(buf + 5) ); \
-     M6 = mm256_bswap_64( *(buf + 6) ); \
-     M7 = mm256_bswap_64( *(buf + 7) ); \
-     M8 = mm256_bswap_64( *(buf + 8) ); \
-     M9 = mm256_bswap_64( *(buf + 9) ); \
-     MA = mm256_bswap_64( *(buf + 10) ); \
-     MB = mm256_bswap_64( *(buf + 11) ); \
-     MC = mm256_bswap_64( *(buf + 12) ); \
-     MD = mm256_bswap_64( *(buf + 13) ); \
-     ME = mm256_bswap_64( *(buf + 14) ); \
-     MF = mm256_bswap_64( *(buf + 15) ); \
-     ROUND_B_4WAY(0); \
-     ROUND_B_4WAY(1); \
-     ROUND_B_4WAY(2); \
-     ROUND_B_4WAY(3); \
-     ROUND_B_4WAY(4); \
-     ROUND_B_4WAY(5); \
-     ROUND_B_4WAY(6); \
-     ROUND_B_4WAY(7); \
-     ROUND_B_4WAY(8); \
-     ROUND_B_4WAY(9); \
-     ROUND_B_4WAY(0); \
-     ROUND_B_4WAY(1); \
-     ROUND_B_4WAY(2); \
-     ROUND_B_4WAY(3); \
-     ROUND_B_4WAY(4); \
-     ROUND_B_4WAY(5); \
-     H0 = _mm256_xor_si256( _mm256_xor_si256( \
-                            _mm256_xor_si256( S0, V0 ), V8 ), H0 ); \
-     H1 = _mm256_xor_si256( _mm256_xor_si256( \
-                            _mm256_xor_si256( S1, V1 ), V9 ), H1 ); \
-     H2 = _mm256_xor_si256( _mm256_xor_si256( \
-                            _mm256_xor_si256( S2, V2 ), VA ), H2 ); \
-     H3 = _mm256_xor_si256( _mm256_xor_si256( \
-                            _mm256_xor_si256( S3, V3 ), VB ), H3 ); \
-     H4 = _mm256_xor_si256( _mm256_xor_si256( \
-                            _mm256_xor_si256( S0, V4 ), VC ), H4 ); \
-     H5 = _mm256_xor_si256( _mm256_xor_si256( \
-                            _mm256_xor_si256( S1, V5 ), VD ), H5 ); \
-     H6 = _mm256_xor_si256( _mm256_xor_si256( \
-                            _mm256_xor_si256( S2, V6 ), VE ), H6 ); \
-     H7 = _mm256_xor_si256( _mm256_xor_si256( \
-                            _mm256_xor_si256( S3, V7 ), VF ), H7 ); \
-	} while (0)
-
-#endif

 #endif

 // Blake-256 4 way

-static const sph_u32 salt_zero_4way_small[4] = { 0, 0, 0, 0 };
+static const uint32_t salt_zero_4way_small[4] = { 0, 0, 0, 0 };

 static void
-blake32_4way_init( blake_4way_small_context *sc, const sph_u32 *iv,
-                   const sph_u32 *salt, int rounds )
+blake32_4way_init( blake_4way_small_context *ctx, const uint32_t *iv,
+                   const uint32_t *salt, int rounds )
 {
-   int i;
-   for ( i = 0; i < 8; i++ )
-      sc->H[i] = _mm_set1_epi32( iv[i] );
-   for ( i = 0; i < 4; i++ )
-      sc->S[i] = _mm_set1_epi32( salt[i] );
-   sc->T0 = sc->T1 = 0;
-   sc->ptr = 0;
-   sc->rounds = rounds;
+   casti_m128i( ctx->H, 0 ) = _mm_set1_epi32( iv[0] );
+   casti_m128i( ctx->H, 1 ) = _mm_set1_epi32( iv[1] );
+   casti_m128i( ctx->H, 2 ) = _mm_set1_epi32( iv[2] );
+   casti_m128i( ctx->H, 3 ) = _mm_set1_epi32( iv[3] );
+   casti_m128i( ctx->H, 4 ) = _mm_set1_epi32( iv[4] );
+   casti_m128i( ctx->H, 5 ) = _mm_set1_epi32( iv[5] );
+   casti_m128i( ctx->H, 6 ) = _mm_set1_epi32( iv[6] );
+   casti_m128i( ctx->H, 7 ) = _mm_set1_epi32( iv[7] );
+
+   casti_m128i( ctx->S, 0 ) = m128_zero;
+   casti_m128i( ctx->S, 1 ) = m128_zero;
+   casti_m128i( ctx->S, 2 ) = m128_zero;
+   casti_m128i( ctx->S, 3 ) = m128_zero;
+/*
+   sc->S[0] = _mm_set1_epi32( salt[0] );
+   sc->S[1] = _mm_set1_epi32( salt[1] );
+   sc->S[2] = _mm_set1_epi32( salt[2] );
+   sc->S[3] = _mm_set1_epi32( salt[3] );
+*/
+   ctx->T0 = ctx->T1 = 0;
+   ctx->ptr = 0;
+   ctx->rounds = rounds;
 }

 static void
-blake32_4way( blake_4way_small_context *sc, const void *data, size_t len )
+blake32_4way( blake_4way_small_context *ctx, const void *data, size_t len )
 {
-   __m128i *vdata = (__m128i*)data;
-   __m128i *buf;
-   size_t ptr;
-   const int buf_size = 64;   // number of elements, sizeof/4
+   __m128i *buf = (__m128i*)ctx->buf;
+   size_t  bptr = ctx->ptr<<2;
+   size_t  vptr = ctx->ptr >> 2;
+   size_t  blen = len << 2;
   DECL_STATE32_4WAY
-   buf = sc->buf;
-   ptr = sc->ptr;
-   if ( len < buf_size - ptr )
+
+   if ( blen < (sizeof ctx->buf) - bptr )
   {
-      memcpy_128( buf + (ptr>>2), vdata, len>>2 );
-      ptr += len;
-      sc->ptr = ptr;
+      memcpy( buf + vptr, data, (sizeof ctx->buf) - bptr );
+      bptr += blen;
+      ctx->ptr = bptr>>2;
      return;
   }

-   READ_STATE32_4WAY(sc);
-   while ( len > 0 )
+   READ_STATE32_4WAY( ctx );
+   while ( blen > 0 )
   {
-      size_t clen;
+      size_t clen = ( sizeof ctx->buf ) - bptr;

-      clen = buf_size - ptr;
-      if ( clen > len )
-         clen = len;
-      memcpy_128( buf + (ptr>>2), vdata, clen>>2 );
-      ptr += clen;
-      vdata += (clen>>2);
-      len -= clen;
-      if ( ptr == buf_size )
+      if ( clen > blen )
+	 clen = blen;
+      memcpy( buf + vptr, data, clen );
+      bptr += clen;
+      data = (const unsigned char *)data + clen;
+      blen -= clen;
+      if ( bptr == ( sizeof ctx->buf ) )
      {
-         if ( ( T0 = SPH_T32(T0 + 512) ) < 512 )
-            T1 = SPH_T32(T1 + 1);
-         COMPRESS32_4WAY( sc->rounds );
-         ptr = 0;
+         if ( ( T0 = T0 + 512 ) < 512 )
+            T1 = T1 + 1;
+         COMPRESS32_4WAY( ctx->rounds );
+	 bptr = 0;
      }
   }
-   WRITE_STATE32_4WAY(sc);
-   sc->ptr = ptr;
+   WRITE_STATE32_4WAY( ctx );
+   ctx->ptr = bptr>>2;
 }

 static void
-blake32_4way_close( blake_4way_small_context *sc, unsigned ub, unsigned n,
+blake32_4way_close( blake_4way_small_context *ctx, unsigned ub, unsigned n,
               void *dst, size_t out_size_w32 )
 {
-//   union {
-	__m128i buf[16];
-//	sph_u32 dummy;
-//   } u;
-   size_t ptr, k;
-   unsigned bit_len;
-   sph_u32 th, tl;
-   __m128i *out;
-
-   ptr = sc->ptr;
-   bit_len = ((unsigned)ptr << 3);
-   buf[ptr>>2] = _mm_set1_epi32( 0x80 );
-   tl = sc->T0 + bit_len;
-   th = sc->T1;
+   __m128i buf[16] __attribute__ ((aligned (64)));
+   size_t   ptr     = ctx->ptr;
+   size_t   vptr    = ctx->ptr>>2;
+   unsigned bit_len = ( (unsigned)ptr << 3 );
+   uint32_t tl      = ctx->T0 + bit_len;
+   uint32_t th      = ctx->T1;

   if ( ptr == 0 )
   {
-	sc->T0 = SPH_C32(0xFFFFFE00UL);
-	sc->T1 = SPH_C32(0xFFFFFFFFUL);
+      ctx->T0 = 0xFFFFFE00UL;
+      ctx->T1 = 0xFFFFFFFFUL;
   }
-   else if ( sc->T0 == 0 )
+   else if ( ctx->T0 == 0 )
   {
-	sc->T0 = SPH_C32(0xFFFFFE00UL) + bit_len;
-	sc->T1 = SPH_T32(sc->T1 - 1);
+      ctx->T0 = 0xFFFFFE00UL + bit_len;
+      ctx->T1 = ctx->T1 - 1;
   } 
   else
-	sc->T0 -= 512 - bit_len;
+      ctx->T0 -= 512 - bit_len;

-   if ( ptr <= 52 )
+   buf[vptr] = _mm_set1_epi32( 0x80 );
+
+   if ( vptr < 12 )
   {
-       memset_zero_128( buf + (ptr>>2) + 1, (52 - ptr) >> 2 );
-       if (out_size_w32 == 8)
-           buf[52>>2] = _mm_or_si128( buf[52>>2],
-                                        _mm_set1_epi32( 0x01000000UL ) );
-       *(buf+(56>>2)) = mm_bswap_32( _mm_set1_epi32( th ) );
-       *(buf+(60>>2)) = mm_bswap_32( _mm_set1_epi32( tl ) );
-       blake32_4way( sc, buf + (ptr>>2), 64 - ptr );
+      memset_zero_128( buf + vptr + 1, 13 - vptr  );
+      buf[ 13 ] = _mm_or_si128( buf[ 13 ], _mm_set1_epi32( 0x01000000UL ) );
+      buf[ 14 ] = mm128_bswap_32( _mm_set1_epi32( th ) );
+      buf[ 15 ] = mm128_bswap_32( _mm_set1_epi32( tl ) );
+      blake32_4way( ctx, buf + vptr, 64 - ptr );
   }
   else
   {
-	memset_zero_128( buf + (ptr>>2) + 1, (60-ptr) >> 2 );
-	blake32_4way( sc, buf + (ptr>>2), 64 - ptr );
-	sc->T0 = SPH_C32(0xFFFFFE00UL);
-	sc->T1 = SPH_C32(0xFFFFFFFFUL);
-	memset_zero_128( buf, 56>>2 );
-       if (out_size_w32 == 8)
-           buf[52>>2] = _mm_set1_epi32( 0x01000000UL );
-        *(buf+(56>>2)) = mm_bswap_32( _mm_set1_epi32( th ) );
-        *(buf+(60>>2)) = mm_bswap_32( _mm_set1_epi32( tl ) );
-	blake32_4way( sc, buf, 64 );
+      memset_zero_128( buf + vptr + 1, (60-ptr) >> 2 );
+      blake32_4way( ctx, buf + vptr, 64 - ptr );
+      ctx->T0 = 0xFFFFFE00UL;
+      ctx->T1 = 0xFFFFFFFFUL;
+      memset_zero_128( buf, 56>>2 );
+      buf[ 13 ] = _mm_or_si128( buf[ 13 ], _mm_set1_epi32( 0x01000000UL ) );
+      buf[ 14 ] = mm128_bswap_32( _mm_set1_epi32( th ) );
+      buf[ 15 ] = mm128_bswap_32( _mm_set1_epi32( tl ) );
+      blake32_4way( ctx, buf, 64 );
   }
-   out = (__m128i*)dst;
-   for ( k = 0; k < out_size_w32; k++ )
-        out[k] = mm_bswap_32( sc->H[k] );
+
+   casti_m128i( dst, 0 ) = mm128_bswap_32( casti_m128i( ctx->H, 0 ) );
+   casti_m128i( dst, 1 ) = mm128_bswap_32( casti_m128i( ctx->H, 1 ) );
+   casti_m128i( dst, 2 ) = mm128_bswap_32( casti_m128i( ctx->H, 2 ) );
+   casti_m128i( dst, 3 ) = mm128_bswap_32( casti_m128i( ctx->H, 3 ) );
+   casti_m128i( dst, 4 ) = mm128_bswap_32( casti_m128i( ctx->H, 4 ) );
+   casti_m128i( dst, 5 ) = mm128_bswap_32( casti_m128i( ctx->H, 5 ) );
+   casti_m128i( dst, 6 ) = mm128_bswap_32( casti_m128i( ctx->H, 6 ) );
+   casti_m128i( dst, 7 ) = mm128_bswap_32( casti_m128i( ctx->H, 7 ) );
 }

 #if defined (__AVX2__)
@@ -1217,163 +928,32 @@ blake32_8way_close( blake_8way_small_context *sc, unsigned ub, unsigned n,
        out[k] = mm256_bswap_32( sc->H[k] );
 }

-// Blake-512 4 way
-
-static const sph_u64 salt_zero_big[4] = { 0, 0, 0, 0 };
-
-static void
-blake64_4way_init( blake_4way_big_context *sc, const sph_u64 *iv,
-              const sph_u64 *salt )
-{
-        int i;
-        for ( i = 0; i < 8; i++ )
-           sc->H[i] = _mm256_set1_epi64x( iv[i] );
-        for ( i = 0; i < 4; i++ )
-           sc->S[i] = _mm256_set1_epi64x( salt[i] );
-        sc->T0 = sc->T1 = 0;
-        sc->ptr = 0;
-}
-
-static void
-blake64_4way( blake_4way_big_context *sc, const void *data, size_t len)
-{
-   __m256i *vdata = (__m256i*)data;
-   __m256i *buf;
-   size_t ptr;
-   DECL_STATE64_4WAY
-
-   const int buf_size = 128;  //  sizeof/8 
-
-   buf = sc->buf;
-   ptr = sc->ptr;
-   if ( len < (buf_size - ptr) )
-   {
-	memcpy_256( buf + (ptr>>3), vdata, len>>3 );
-	ptr += len;
-	sc->ptr = ptr;
-	return;
-   }
-
-   READ_STATE64_4WAY(sc);
-   while ( len > 0 )
-   {
-	size_t clen;
-
-	clen = buf_size - ptr;
-	if ( clen > len )
-		clen = len;
-	memcpy_256( buf + (ptr>>3), vdata, clen>>3 );
-	ptr += clen;
-	vdata = vdata + (clen>>3);
-	len -= clen;
-	if (ptr == buf_size )
-        {
-		if ((T0 = SPH_T64(T0 + 1024)) < 1024)
-			T1 = SPH_T64(T1 + 1);
-		COMPRESS64_4WAY;
-		ptr = 0;
-	}
-   }
-   WRITE_STATE64_4WAY(sc);
-   sc->ptr = ptr;
-}
-
-static void
-blake64_4way_close( blake_4way_big_context *sc,
-	unsigned ub, unsigned n, void *dst, size_t out_size_w64)
-{
-//   union {
-      __m256i buf[16];
-//      sph_u64 dummy;
-//   } u;
-   size_t ptr, k;
-   unsigned bit_len;
-   uint64_t z, zz;
-   sph_u64 th, tl;
-   __m256i *out;
-
-   ptr = sc->ptr;
-   bit_len = ((unsigned)ptr << 3);
-   z = 0x80 >> n;
-   zz = ((ub & -z) | z) & 0xFF;
-   buf[ptr>>3] = _mm256_set_epi64x( zz, zz, zz, zz );
-   tl = sc->T0 + bit_len;
-   th = sc->T1;
-   if (ptr == 0 )
-   {
-	sc->T0 = SPH_C64(0xFFFFFFFFFFFFFC00ULL);
-	sc->T1 = SPH_C64(0xFFFFFFFFFFFFFFFFULL);
-   }
-   else if ( sc->T0 == 0 )
-   {
-	sc->T0 = SPH_C64(0xFFFFFFFFFFFFFC00ULL) + bit_len;
-	sc->T1 = SPH_T64(sc->T1 - 1);
-   } 
-   else
-   {
-        sc->T0 -= 1024 - bit_len;
-   }
-   if ( ptr <= 104 )
-   {
-       memset_zero_256( buf + (ptr>>3) + 1, (104-ptr) >> 3 );
-       if ( out_size_w64 == 8 )
-          buf[(104>>3)] = _mm256_or_si256( buf[(104>>3)],
-                                 _mm256_set1_epi64x( 0x0100000000000000ULL ) );
-       *(buf+(112>>3)) = mm256_bswap_64(
-                                    _mm256_set_epi64x( th, th, th, th ) );
-       *(buf+(120>>3)) = mm256_bswap_64(
-                                    _mm256_set_epi64x( tl, tl, tl, tl ) );
-
-       blake64_4way( sc, buf + (ptr>>3), 128 - ptr );
-   }
-   else
-  {
-       memset_zero_256( buf + (ptr>>3) + 1, (120 - ptr) >> 3 );
-
-       blake64_4way( sc, buf + (ptr>>3), 128 - ptr );
-       sc->T0 = SPH_C64(0xFFFFFFFFFFFFFC00ULL);
-       sc->T1 = SPH_C64(0xFFFFFFFFFFFFFFFFULL);
-       memset_zero_256( buf, 112>>3 ); 
-       if ( out_size_w64 == 8 )
-           buf[104>>3] = _mm256_set1_epi64x( 0x0100000000000000ULL );
-       *(buf+(112>>3)) = mm256_bswap_64(
-                                    _mm256_set_epi64x( th, th, th, th ) );
-       *(buf+(120>>3)) = mm256_bswap_64(
-                                    _mm256_set_epi64x( tl, tl, tl, tl ) );
-
-       blake64_4way( sc, buf, 128 );
-   }
-   out = (__m256i*)dst;
-   for ( k = 0; k < out_size_w64; k++ )
-       out[k] = mm256_bswap_64( sc->H[k] );
-}
-
 #endif

 // Blake-256 4 way

 // default 14 rounds, backward copatibility
 void
-blake256_4way_init(void *cc)
+blake256_4way_init(void *ctx)
 {
-   blake32_4way_init( cc, IV256, salt_zero_4way_small, 14 );
+   blake32_4way_init( ctx, IV256, salt_zero_4way_small, 14 );
 }

 void
-blake256_4way(void *cc, const void *data, size_t len)
+blake256_4way(void *ctx, const void *data, size_t len)
 {
-	blake32_4way(cc, data, len);
+	blake32_4way(ctx, data, len);
 }

 void
-blake256_4way_close(void *cc, void *dst)
+blake256_4way_close(void *ctx, void *dst)
 {
-        blake32_4way_close(cc, 0, 0, dst, 8);
+        blake32_4way_close(ctx, 0, 0, dst, 8);
 }

 #if defined(__AVX2__)

-// Blake-256 8way
+// Blake-256 8 way

 void
 blake256_8way_init(void *cc)
@@ -1473,38 +1053,8 @@ blake256r8_8way_close(void *cc, void *dst)

 #endif

-// Blake-512 4 way
-
-#if defined (__AVX2__)
-
-void
-blake512_4way_init(void *cc)
-{
-	blake64_4way_init(cc, IV512, salt_zero_big);
-}
-
-void
-blake512_4way(void *cc, const void *data, size_t len)
-{
-	blake64_4way(cc, data, len);
-}
-
-void
-blake512_4way_close(void *cc, void *dst)
-{
-	blake512_4way_addbits_and_close(cc, 0, 0, dst);
-}
-
-void
-blake512_4way_addbits_and_close(void *cc, unsigned ub, unsigned n, void *dst)
-{
-	blake64_4way_close(cc, ub, n, dst, 8);
-}
-
-#endif
-
 #ifdef __cplusplus
 }
 #endif

-#endif
+//#endif
--- a/algo/blake/blake256-hash-4way.c.new
+++ b/algo/blake/blake256-hash-4way.c.new
@@ -0,0 +1,322 @@
+// convert blake256 32 bit to use 64 bit with serial vectoring
+//
+//  cut calls to GS in half
+//
+// combine V
+// v0 = {V0,V1}
+// v1 = {V2,V3}
+// v2 = {V4,V5}
+// v3 = {V6,V7}
+// v4 = {V8,V9}
+// v5 = {VA,VB}
+// v6 = {VC,VD}
+// v7 = {CE,VF}
+//
+// v6x = {VD,VC}      swap(VC,VD)   swap(v6)
+// v7x = {VF,VE}      swap(VE,VF)   swap(v7)
+//
+// V0 = v1v0
+// V1 = v3v2
+// V2 = v5v4
+// V3 = v7v6
+// V4 = v9v8
+// V5 = vbva
+// V6 = vdvc
+// V7 = vfve
+//
+// The rotate in ROUND is to effect straddle and unstraddle for the third
+// and 4th iteration of GS.
+// It concatenates 2 contiguous 256 bit vectors and extracts the middle
+// 256 bits. After the transform they must be restored with only the
+// chosen bits modified in the original 2 vectors.
+// ror1x128 achieves this by putting the chosen bits in arg1, the "low"
+// 256 bit vector and saves the untouched bits temporailly in arg0, the
+// "high" 256 bit vector. Simply reverse the process to restore data back
+// to original positions.
+
+// Use standard 4way when AVX2 is not available use x2 mode with AVX2.
+//
+// Data is organised the same as 32 bit 4 way, in effect serial vectoring
+// on top of parallel vectoring. Same data in the same place just taking
+// two chunks at a time.
+//
+// Transparent to user, x2 mode used when AVX2 detected.
+// Use existing 4way context but revert to scalar types.
+// Same interleave function (128 bit) or x2 with 256 bit?
+// User trsnaparency would have to apply to interleave as well.
+//
+// Use common 4way update and close
+
+/*
+typedef struct {
+   unsigned char buf[64<<2];
+   uint32_t H[8<<2];
+   uint32_t S[4<<2];
+   size_t ptr;
+   uint32_t T0, T1;
+   int rounds;   // 14 for blake, 8 for blakecoin & vanilla
+} blakex2_4way_small_context __attribute__ ((aligned (64)));
+*/
+
+static void
+blake32x2_4way_init( blake_4way_small_context *ctx, const uint32_t *iv,
+                   const uint32_t *salt, int rounds )
+{
+   casti_m128i( ctx->H, 0 ) = _mm_set1_epi32( iv[0] );
+   casti_m128i( ctx->H, 1 ) = _mm_set1_epi32( iv[1] );
+   casti_m128i( ctx->H, 2 ) = _mm_set1_epi32( iv[2] );
+   casti_m128i( ctx->H, 3 ) = _mm_set1_epi32( iv[3] );
+   casti_m128i( ctx->H, 4 ) = _mm_set1_epi32( iv[4] );
+   casti_m128i( ctx->H, 5 ) = _mm_set1_epi32( iv[5] );
+   casti_m128i( ctx->H, 6 ) = _mm_set1_epi32( iv[6] );
+   casti_m128i( ctx->H, 7 ) = _mm_set1_epi32( iv[7] );
+
+   casti_m128i( ctx->S, 0 ) = m128_zero;
+   casti_m128i( ctx->S, 1 ) = m128_zero;
+   casti_m128i( ctx->S, 2 ) = m128_zero;
+   casti_m128i( ctx->S, 3 ) = m128_zero;
+/*
+   sc->S[0] = _mm_set1_epi32( salt[0] );
+   sc->S[1] = _mm_set1_epi32( salt[1] );
+   sc->S[2] = _mm_set1_epi32( salt[2] );
+   sc->S[3] = _mm_set1_epi32( salt[3] );
+*/
+   ctx->T0 = ctx->T1 = 0;
+   ctx->ptr = 0;
+   ctx->rounds = rounds;
+}
+
+static void
+blake32x2( blake_4way_small_context *ctx, const void *data, size_t len )
+{
+   __m128i *buf = (__m256i*)ctx->buf;
+   size_t  bptr = ctx->ptr << 2;
+   size_t  vptr = ctx->ptr >> 3;
+   size_t  blen = len << 2;
+//    unsigned char *buf = ctx->buf;
+//    size_t ptr         = ctx->ptr<<4;  // repurposed
+    DECL_STATE32x2
+
+//    buf = sc->buf;
+//    ptr = sc->ptr;
+
+// adjust len for use with ptr, clen, all absolute bytes.
+//    int blen = len<<2;
+
+    if ( blen < (sizeof ctx->buf) - bptr )
+    {
+        memcpy( buf + vptr, data, blen );
+        ptr += blen;
+        ctx->ptr = bptr >> 2;;
+        return;
+    }
+
+    READ_STATE32( ctx );
+    while ( blen > 0 )
+    {
+        size_t clen;
+
+        clen = ( sizeof sc->buf ) - ptr;
+        if ( clen > blen )
+            clen = blen;
+        memcpy( buf + vptr, data, clen );
+        bptr += clen;
+        vptr = bptr >> 5;
+	data = (const unsigned char *)data + clen;
+        blen -= clen;
+        if ( bptr == sizeof ctx->buf )
+       	{
+           if ( ( T0 = T0 + 512 ) < 512 ) // not needed, will never rollover
+               T1 += 1;
+           COMPRESS32x2_4WAY( ctx->rounds );
+           ptr = 0;
+        }
+    }
+    WRITE_STATE32x2( ctx );
+    ctx->ptr = bptr >> 2;
+}
+
+static void
+blake32x2_4way_close( blake_4way_small_context *ctx, void *dst )
+{
+   __m256i buf[8] __attribute__ ((aligned (64)));
+   size_t   ptr     = ctx->ptr;
+   size_t   vptr    = ctx->ptr>>2;
+   unsigned bit_len = ( (unsigned)ptr << 3 );  // one lane
+   uint32_t th      = ctx->T1;
+   uint32_t tl      = ctx->T0 + bit_len;
+
+   if ( ptr == 0 )
+   {
+        ctx->T0 = 0xFFFFFE00UL;
+        ctx->T1 = 0xFFFFFFFFUL;
+   }
+   else if ( ctx->T0 == 0 )
+   {
+      ctx->T0 = 0xFFFFFE00UL + bit_len;
+      ctx->T1 -= 1;
+   }
+   else
+      ctx->T0 -= 512 - bit_len;
+
+   // memset doesn't do ints
+   buf[ vptr ] = _mm256_set_epi32( 0,0,0,0, 0x80, 0x80, 0x80, 0x80 );
+
+   if ( vptr < 5 )
+   {
+       memset_zero_256( buf + vptr + 1, 6 - vptr  );
+       buf[ 6 ] = _mm256_or_si256( vbuf[ 6 ], _mm256_set_epi32(
+             0x01000000UL,0x01000000UL,0x01000000UL,0x01000000UL, 0,0,0,0 ) ); 
+       buf[ 7 ] = mm256_bswap_32( _mm256_set_epi32( tl,tl,tl,tl,
+			                            th,th,th,th ) );
+       blake32x2_4way( ctx, buf + vptr, 64 - ptr );
+   }
+   else
+   {
+       memset_zero_256( vbuf + vptr + 1, 7 - vptr );
+       blake32x2_4way( ctx,  vbuf + ptr, 64 - ptr );
+       ctx->T0 = 0xFFFFFE00UL;
+       ctx->T1 = 0xFFFFFFFFUL;
+       buf[ 6 ] = mm256_zero;
+       buf[ 6 ] = _mm256_set_epi32( 0,0,0,0,
+		         0x01000000UL,0x01000000UL,0x01000000UL,0x01000000UL );
+       buf[ 7 ] = mm256_bswap_32( _mm256_set_epi32( tl, tl, tl, tl,
+                                                    th, th, th, th );
+       blake32x2_4way( ctx, buf, 64 );
+   }
+
+   casti_m256i( dst, 0 ) = mm256_bswap_32( casti_m256i( ctx->H, 0 ) );
+   casti_m256i( dst, 1 ) = mm256_bswap_32( casti_m256i( ctx->H, 1 ) );
+   casti_m256i( dst, 2 ) = mm256_bswap_32( casti_m256i( ctx->H, 2 ) );
+   casti_m256i( dst, 3 ) = mm256_bswap_32( casti_m256i( ctx->H, 3 ) );
+}
+
+
+
+
+#define DECL_STATE32x2_4WAY \
+   __m256i H0, H1, H2, H3; \
+   __m256i S0, S1; \
+   uint32_t T0, T1;
+
+#define READ_STATE32x2_4WAY(state)  do \
+{ \
+   H0 = casti_m256i( state->H, 0 ); \
+   H1 = casti_m256i( state->H, 1 ); \
+   H2 = casti_m256i( state->H, 2 ); \
+   H3 = casti_m256i( state->H, 3 ); \
+   S0 = casti_m256i( state->S, 0 ); \
+   S1 = casti_m256i( state->S, 1 ); \
+   T0 = state->T0; \
+   T1 = state->T1; \
+
+#define WRITE_STATE32x2_4WAY(state)   do { \
+   casti_m256i( state->H, 0 ) = H0; \
+   casti_m256i( state->H, 1 ) = H1; \
+   casti_m256i( state->H, 2 ) = H2; \
+   casti_m256i( state->H, 3 ) = H3; \
+   casti_m256i( state->S, 0 ) = S0; \
+   casti_m256i( state->S, 1 ) = S1; \
+   state->T0 = T0; \
+   state->T1 = T1; \
+} while (0)
+
+
+#define GSx2_4WAY( m0m2, m1m3, c0c2, c1c3, a, b, c, d ) do \
+{ \
+   a = _mm256_add_epi32( _mm256_add_epi32( _mm256_xor_si256( \
+          _mm256_set_epi32( c1,c3, c1,c3, c1,c3, c1,c3 ), \
+	  _mm256_set_epi32( m0,m2, m0,m2, m0,m2, m0,m2 ) ), b ), a ); \
+   d = mm256_ror_32( _mm_xor_si128( d, a ), 16 ); \
+   c = _mm256_add_epi32( c, d ); \
+   b = mm256_ror_32( _mm256_xor_si256( b, c ), 12 ); \
+   a = _mm256_add_epi32( _mm256_add_epi32( _mm256_xor_si256( \
+          _mm256_set_epi32( c0,c2, c0,c2, c0,c2, c0,c2 ), \
+	  _mm256_set_epi32( m1,m3, m1,m3, m1,m3, m1,m3 ) ), b ), a ); \
+   d = mm256_ror_32( _mm256_xor_si256( d, a ), 8 ); \
+   c = _mm256_add_epi32( c, d ); \
+   b = mm256_ror_32( _mm256_xor_si256( b, c ), 7 ); \
+} while (0)
+
+#define ROUND_Sx2_4WAY(r)   do \
+{ \
+  GS2_4WAY( Mx(r, 0),  Mx(r, 1),  Mx(r, 2),  Mx(r, 3), \
+           CSx(r, 0), CSx(r, 1), CSx(r, 2), CSx(r, 3), V0, V2, V4, V6 ); \
+  GS2_4WAY( Mx(r, 4),  Mx(r, 5),  Mx(r, 6),  Mx(r, 7), \
+           CSx(r, 4), CSx(r, 5), CSx(r, 6), CSx(r, 7), V1, V3, V5, V7 ); \
+  mm256_ror1x128_512( V3, V2 ); \
+  mm256_ror1x128_512( V6, V7 ); \
+  GS2_4WAY( Mx(r, 8),  Mx(r, 9),  Mx(r, A),  Mx(r, B), \
+           CSx(r, 8), CSx(r, 9), CSx(r, A), CSx(r, B), V0, V2, V5, V7 ); \
+  GS2_4WAY( Mx(r, C),  Mx(r, D),  Mx(r, C),  Mx(r, D), \
+           CSx(r, C), CSx(r, D), CSx(r, C), CSx(r, D), V1, V3, V4, V6 ); \
+  mm256_rol1x128_512( V2, V3 ); \
+  mm256_rol1x128_512( V7, V6 ); 
+
+#define COMPRESS32x2_4WAY( rounds ) do \
+{ \
+   __m256i M0, M1, M2, M3, M4, M5, M6, M7; \
+   __m256i V0, V1, V2, V3, V4, V5, V6, V7; \
+   unsigned r; \
+   V0 = H0; \
+   V1 = H1; \
+   V2 = H2; \
+   V3 = H3; \
+   V4 = _mm256_xor_si256( S0, _mm256_set_epi32( CS1, CS1, CS1, CS1, \
+			                        CS0, CS0, CS0, CS0 ) ); \
+   V5 = _mm256_xor_si256( S1, _mm256_set_epi32( CS3, CS3, CS3, CS3, \
+                                                CS2, CS2, CS2, CS2 ) ); \
+   V6 = _mm256_xor_si256( _mm256_set1_epi32( T0 ), \
+                              _mm256_set_epi32( CS5, CS5, CS5, CS5, \
+		                                CS4, CS4, CS4, CS4 ) ); \
+   V7 = _mm256_xor_si256( _mm256_set1_epi32( T1 ), \
+                              _mm256_set_epi32( CS7, CS7, CS7, CS7, \
+                                                CS6, CS6, CS6, CS6 ) ); \
+   M0 = mm256_bswap_32( buf[ 0] ); \
+   M1 = mm256_bswap_32( buf[ 1] ); \
+   M2 = mm256_bswap_32( buf[ 2] ); \
+   M3 = mm256_bswap_32( buf[ 3] ); \
+   M4 = mm256_bswap_32( buf[ 4] ); \
+   M5 = mm256_bswap_32( buf[ 5] ); \
+   M6 = mm256_bswap_32( buf[ 6] ); \
+   M7 = mm256_bswap_32( buf[ 7] ); \
+   ROUND_Sx2_4WAY(0); \
+   ROUND_Sx2_4WAY(1); \
+   ROUND_Sx2_4WAY(2); \
+   ROUND_Sx2_4WAY(3); \
+   ROUND_Sx2_4WAY(4); \
+   ROUND_Sx2_4WAY(5); \
+   ROUND_Sx2_4WAY(6); \
+   ROUND_Sx2_4WAY(7); \
+   if (rounds == 14) \
+   { \
+      ROUND_Sx2_4WAY(8); \
+      ROUND_Sx2_4WAY(9); \
+      ROUND_Sx2_4WAY(0); \
+      ROUND_Sx2_4WAY(1); \
+      ROUND_Sx2_4WAY(2); \
+      ROUND_Sx2_4WAY(3); \
+   } \
+   H0 = _mm256_xor_si256( _mm256_xor_si256( \
+			           _mm256_xor_si256( V8, V0 ), S0 ), H0 ); \
+   H1 = _mm256_xor_si256( _mm256_xor_si256( \
+			           _mm256_xor_si256( V9, V1 ), S1 ), H1 ); \
+   H2 = _mm256_xor_si256( _mm256_xor_si256( \
+			           _mm256_xor_si256( VA, V2 ), S2 ), H2 ); \
+   H3 = _mm256_xor_si256( _mm256_xor_si256( \
+			           _mm256_xor_si256( VB, V3 ), S3 ), H3 ); \
+} while (0)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
--- a/algo/blake/blake2s-4way.c
+++ b/algo/blake/blake2s-4way.c
@@ -85,7 +85,8 @@ void blake2s_4way_hash( void *output, const void *input )
   blake2s_4way_update( &ctx, input + (64<<2), 16 );
   blake2s_4way_final( &ctx, vhash, BLAKE2S_OUTBYTES );

-   mm_deinterleave_4x32( output, output+32, output+64, output+96, vhash, 256 );
+   mm128_deinterleave_4x32( output, output+32, output+64, output+96,
+		            vhash, 256 );
 }

 int scanhash_blake2s_4way( int thr_id, struct work *work, uint32_t max_nonce,
@@ -104,7 +105,7 @@ int scanhash_blake2s_4way( int thr_id, struct work *work, uint32_t max_nonce,
   uint32_t *noncep = vdata + 76;   // 19*4

   swab32_array( edata, pdata, 20 );
-   mm_interleave_4x32( vdata, edata, edata, edata, edata, 640 );
+   mm128_interleave_4x32( vdata, edata, edata, edata, edata, 640 );
   blake2s_4way_init( &blake2s_4w_ctx, BLAKE2S_OUTBYTES );
   blake2s_4way_update( &blake2s_4w_ctx, vdata, 64 );

--- a/algo/blake/blake2s-gate.c
+++ b/algo/blake/blake2s-gate.c
@@ -20,7 +20,7 @@ bool register_blake2s_algo( algo_gate_t* gate )
  gate->hash      = (void*)&blake2s_hash;
 #endif
  gate->get_max64 = (void*)&blake2s_get_max64;
-  gate->optimizations = AVX_OPT | AVX2_OPT;
+  gate->optimizations = SSE42_OPT | AVX2_OPT;
  return true;
 };

--- a/algo/blake/blake2s-gate.h
+++ b/algo/blake/blake2s-gate.h
@@ -4,7 +4,7 @@
 #include <stdint.h>
 #include "algo-gate-api.h"

-#if defined(__AVX__)
+#if defined(__SSE4_2__)
  #define BLAKE2S_4WAY
 #endif
 #if defined(__AVX2__)
--- a/algo/blake/blake2s-hash-4way.c
+++ b/algo/blake/blake2s-hash-4way.c
@@ -17,7 +17,7 @@
 #include <string.h>
 #include <stdio.h>

-#if defined(__AVX__)
+#if defined(__SSE4_2__)

 static const uint32_t blake2s_IV[8] =
 {
@@ -92,13 +92,13 @@ int blake2s_4way_compress( blake2s_4way_state *S, const __m128i* block )
 #define G4W(r,i,a,b,c,d) \
 do { \
   a = _mm_add_epi32( _mm_add_epi32( a, b ), m[ blake2s_sigma[r][2*i+0] ] ); \
-   d = mm_rotr_32( _mm_xor_si128( d, a ), 16 ); \
+   d = mm128_ror_32( _mm_xor_si128( d, a ), 16 ); \
   c = _mm_add_epi32( c, d ); \
-   b = mm_rotr_32( _mm_xor_si128( b, c ), 12 ); \
+   b = mm128_ror_32( _mm_xor_si128( b, c ), 12 ); \
   a = _mm_add_epi32( _mm_add_epi32( a, b ), m[ blake2s_sigma[r][2*i+1] ] ); \
-   d = mm_rotr_32( _mm_xor_si128( d, a ),  8 ); \
+   d = mm128_ror_32( _mm_xor_si128( d, a ),  8 ); \
   c = _mm_add_epi32( c, d ); \
-   b = mm_rotr_32( _mm_xor_si128( b, c ),  7 ); \
+   b = mm128_ror_32( _mm_xor_si128( b, c ),  7 ); \
 } while(0)

 #define ROUND4W(r)  \
@@ -210,14 +210,14 @@ int blake2s_8way_compress( blake2s_8way_state *S, const __m256i *block )
 do { \
   a = _mm256_add_epi32( _mm256_add_epi32( a, b ), \
                          m[ blake2s_sigma[r][2*i+0] ] ); \
-   d = mm256_rotr_32( _mm256_xor_si256( d, a ), 16 ); \
+   d = mm256_ror_32( _mm256_xor_si256( d, a ), 16 ); \
   c = _mm256_add_epi32( c, d ); \
-   b = mm256_rotr_32( _mm256_xor_si256( b, c ), 12 ); \
+   b = mm256_ror_32( _mm256_xor_si256( b, c ), 12 ); \
   a = _mm256_add_epi32( _mm256_add_epi32( a, b ), \
                         m[ blake2s_sigma[r][2*i+1] ] ); \
-   d = mm256_rotr_32( _mm256_xor_si256( d, a ),  8 ); \
+   d = mm256_ror_32( _mm256_xor_si256( d, a ),  8 ); \
   c = _mm256_add_epi32( c, d ); \
-   b = mm256_rotr_32( _mm256_xor_si256( b, c ),  7 ); \
+   b = mm256_ror_32( _mm256_xor_si256( b, c ),  7 ); \
 } while(0)

 #define ROUND8W(r)  \
@@ -359,4 +359,4 @@ int blake2s( uint8_t *out, const void *in, const void *key, const uint8_t outlen
 }
 #endif

-#endif // __AVX__
+#endif // __SSE4_2__
--- a/algo/blake/blake2s-hash-4way.h
+++ b/algo/blake/blake2s-hash-4way.h
@@ -14,9 +14,9 @@
 #ifndef __BLAKE2S_HASH_4WAY_H__
 #define __BLAKE2S_HASH_4WAY_H__ 1

-#if defined(__AVX__)
+#if defined(__SSE4_2__)

-#include "avxdefs.h"
+#include "simd-utils.h"

 #include <stddef.h>
 #include <stdint.h>
@@ -107,6 +107,6 @@ int blake2s_8way_final( blake2s_8way_state *S, void *out, uint8_t outlen );
 }
 #endif

-#endif  // __AVX__
+#endif  // __SSE4_2__

 #endif
--- a/algo/blake/blake512-hash-4way.c
+++ b/algo/blake/blake512-hash-4way.c
@@ -0,0 +1,701 @@
+/* $Id: blake.c 252 2011-06-07 17:55:14Z tp $ */
+/*
+ * BLAKE implementation.
+ *
+ * ==========================(LICENSE BEGIN)============================
+ *
+ * Copyright (c) 2007-2010  Projet RNRT SAPHIR
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining
+ * a copy of this software and associated documentation files (the
+ * "Software"), to deal in the Software without restriction, including
+ * without limitation the rights to use, copy, modify, merge, publish,
+ * distribute, sublicense, and/or sell copies of the Software, and to
+ * permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ *
+ * ===========================(LICENSE END)=============================
+ *
+ * @author   Thomas Pornin <thomas.pornin@cryptolog.com>
+ */
+
+#if defined (__AVX2__)
+
+#include <stddef.h>
+#include <string.h>
+#include <limits.h>
+
+#include "blake-hash-4way.h"
+
+#ifdef __cplusplus
+extern "C"{
+#endif
+
+#if SPH_SMALL_FOOTPRINT && !defined SPH_SMALL_FOOTPRINT_BLAKE
+#define SPH_SMALL_FOOTPRINT_BLAKE   1
+#endif
+
+#if SPH_64 && (SPH_SMALL_FOOTPRINT_BLAKE || !SPH_64_TRUE)
+#define SPH_COMPACT_BLAKE_64   1
+#endif
+
+#ifdef _MSC_VER
+#pragma warning (disable: 4146)
+#endif
+
+
+// Blake-512
+
+static const sph_u64 IV512[8] = {
+	SPH_C64(0x6A09E667F3BCC908), SPH_C64(0xBB67AE8584CAA73B),
+	SPH_C64(0x3C6EF372FE94F82B), SPH_C64(0xA54FF53A5F1D36F1),
+	SPH_C64(0x510E527FADE682D1), SPH_C64(0x9B05688C2B3E6C1F),
+	SPH_C64(0x1F83D9ABFB41BD6B), SPH_C64(0x5BE0CD19137E2179)
+};
+
+
+#if SPH_COMPACT_BLAKE_32 || SPH_COMPACT_BLAKE_64
+
+// Blake-256 4 & 8 way, Blake-512 4 way
+
+static const unsigned sigma[16][16] = {
+	{  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15 },
+	{ 14, 10,  4,  8,  9, 15, 13,  6,  1, 12,  0,  2, 11,  7,  5,  3 },
+	{ 11,  8, 12,  0,  5,  2, 15, 13, 10, 14,  3,  6,  7,  1,  9,  4 },
+	{  7,  9,  3,  1, 13, 12, 11, 14,  2,  6,  5, 10,  4,  0, 15,  8 },
+	{  9,  0,  5,  7,  2,  4, 10, 15, 14,  1, 11, 12,  6,  8,  3, 13 },
+	{  2, 12,  6, 10,  0, 11,  8,  3,  4, 13,  7,  5, 15, 14,  1,  9 },
+	{ 12,  5,  1, 15, 14, 13,  4, 10,  0,  7,  6,  3,  9,  2,  8, 11 },
+	{ 13, 11,  7, 14, 12,  1,  3,  9,  5,  0, 15,  4,  8,  6,  2, 10 },
+	{  6, 15, 14,  9, 11,  3,  0,  8, 12,  2, 13,  7,  1,  4, 10,  5 },
+	{ 10,  2,  8,  4,  7,  6,  1,  5, 15, 11,  9, 14,  3, 12, 13,  0 },
+	{  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15 },
+	{ 14, 10,  4,  8,  9, 15, 13,  6,  1, 12,  0,  2, 11,  7,  5,  3 },
+	{ 11,  8, 12,  0,  5,  2, 15, 13, 10, 14,  3,  6,  7,  1,  9,  4 },
+	{  7,  9,  3,  1, 13, 12, 11, 14,  2,  6,  5, 10,  4,  0, 15,  8 },
+	{  9,  0,  5,  7,  2,  4, 10, 15, 14,  1, 11, 12,  6,  8,  3, 13 },
+	{  2, 12,  6, 10,  0, 11,  8,  3,  4, 13,  7,  5, 15, 14,  1,  9 }
+};
+
+#endif
+
+#define Z00   0
+#define Z01   1
+#define Z02   2
+#define Z03   3
+#define Z04   4
+#define Z05   5
+#define Z06   6
+#define Z07   7
+#define Z08   8
+#define Z09   9
+#define Z0A   A
+#define Z0B   B
+#define Z0C   C
+#define Z0D   D
+#define Z0E   E
+#define Z0F   F
+
+#define Z10   E
+#define Z11   A
+#define Z12   4
+#define Z13   8
+#define Z14   9
+#define Z15   F
+#define Z16   D
+#define Z17   6
+#define Z18   1
+#define Z19   C
+#define Z1A   0
+#define Z1B   2
+#define Z1C   B
+#define Z1D   7
+#define Z1E   5
+#define Z1F   3
+
+#define Z20   B
+#define Z21   8
+#define Z22   C
+#define Z23   0
+#define Z24   5
+#define Z25   2
+#define Z26   F
+#define Z27   D
+#define Z28   A
+#define Z29   E
+#define Z2A   3
+#define Z2B   6
+#define Z2C   7
+#define Z2D   1
+#define Z2E   9
+#define Z2F   4
+
+#define Z30   7
+#define Z31   9
+#define Z32   3
+#define Z33   1
+#define Z34   D
+#define Z35   C
+#define Z36   B
+#define Z37   E
+#define Z38   2
+#define Z39   6
+#define Z3A   5
+#define Z3B   A
+#define Z3C   4
+#define Z3D   0
+#define Z3E   F
+#define Z3F   8
+
+#define Z40   9
+#define Z41   0
+#define Z42   5
+#define Z43   7
+#define Z44   2
+#define Z45   4
+#define Z46   A
+#define Z47   F
+#define Z48   E
+#define Z49   1
+#define Z4A   B
+#define Z4B   C
+#define Z4C   6
+#define Z4D   8
+#define Z4E   3
+#define Z4F   D
+
+#define Z50   2
+#define Z51   C
+#define Z52   6
+#define Z53   A
+#define Z54   0
+#define Z55   B
+#define Z56   8
+#define Z57   3
+#define Z58   4
+#define Z59   D
+#define Z5A   7
+#define Z5B   5
+#define Z5C   F
+#define Z5D   E
+#define Z5E   1
+#define Z5F   9
+
+#define Z60   C
+#define Z61   5
+#define Z62   1
+#define Z63   F
+#define Z64   E
+#define Z65   D
+#define Z66   4
+#define Z67   A
+#define Z68   0
+#define Z69   7
+#define Z6A   6
+#define Z6B   3
+#define Z6C   9
+#define Z6D   2
+#define Z6E   8
+#define Z6F   B
+
+#define Z70   D
+#define Z71   B
+#define Z72   7
+#define Z73   E
+#define Z74   C
+#define Z75   1
+#define Z76   3
+#define Z77   9
+#define Z78   5
+#define Z79   0
+#define Z7A   F
+#define Z7B   4
+#define Z7C   8
+#define Z7D   6
+#define Z7E   2
+#define Z7F   A
+
+#define Z80   6
+#define Z81   F
+#define Z82   E
+#define Z83   9
+#define Z84   B
+#define Z85   3
+#define Z86   0
+#define Z87   8
+#define Z88   C
+#define Z89   2
+#define Z8A   D
+#define Z8B   7
+#define Z8C   1
+#define Z8D   4
+#define Z8E   A
+#define Z8F   5
+
+#define Z90   A
+#define Z91   2
+#define Z92   8
+#define Z93   4
+#define Z94   7
+#define Z95   6
+#define Z96   1
+#define Z97   5
+#define Z98   F
+#define Z99   B
+#define Z9A   9
+#define Z9B   E
+#define Z9C   3
+#define Z9D   C
+#define Z9E   D
+#define Z9F   0
+
+#define Mx(r, i)    Mx_(Z ## r ## i)
+#define Mx_(n)      Mx__(n)
+#define Mx__(n)     M ## n
+
+// Blake-512 4 way
+
+#define CBx(r, i)   CBx_(Z ## r ## i)
+#define CBx_(n)     CBx__(n)
+#define CBx__(n)    CB ## n
+
+#define CB0   SPH_C64(0x243F6A8885A308D3)
+#define CB1   SPH_C64(0x13198A2E03707344)
+#define CB2   SPH_C64(0xA4093822299F31D0)
+#define CB3   SPH_C64(0x082EFA98EC4E6C89)
+#define CB4   SPH_C64(0x452821E638D01377)
+#define CB5   SPH_C64(0xBE5466CF34E90C6C)
+#define CB6   SPH_C64(0xC0AC29B7C97C50DD)
+#define CB7   SPH_C64(0x3F84D5B5B5470917)
+#define CB8   SPH_C64(0x9216D5D98979FB1B)
+#define CB9   SPH_C64(0xD1310BA698DFB5AC)
+#define CBA   SPH_C64(0x2FFD72DBD01ADFB7)
+#define CBB   SPH_C64(0xB8E1AFED6A267E96)
+#define CBC   SPH_C64(0xBA7C9045F12C7F99)
+#define CBD   SPH_C64(0x24A19947B3916CF7)
+#define CBE   SPH_C64(0x0801F2E2858EFC16)
+#define CBF   SPH_C64(0x636920D871574E69)
+
+#if SPH_COMPACT_BLAKE_64
+// not used
+static const sph_u64 CB[16] = {
+	SPH_C64(0x243F6A8885A308D3), SPH_C64(0x13198A2E03707344),
+	SPH_C64(0xA4093822299F31D0), SPH_C64(0x082EFA98EC4E6C89),
+	SPH_C64(0x452821E638D01377), SPH_C64(0xBE5466CF34E90C6C),
+	SPH_C64(0xC0AC29B7C97C50DD), SPH_C64(0x3F84D5B5B5470917),
+	SPH_C64(0x9216D5D98979FB1B), SPH_C64(0xD1310BA698DFB5AC),
+	SPH_C64(0x2FFD72DBD01ADFB7), SPH_C64(0xB8E1AFED6A267E96),
+	SPH_C64(0xBA7C9045F12C7F99), SPH_C64(0x24A19947B3916CF7),
+	SPH_C64(0x0801F2E2858EFC16), SPH_C64(0x636920D871574E69)
+};
+
+#endif
+
+
+// Blake-512 4 way
+
+#define GB_4WAY(m0, m1, c0, c1, a, b, c, d)   do { \
+   a = _mm256_add_epi64( _mm256_add_epi64( _mm256_xor_si256( \
+                 _mm256_set_epi64x( c1, c1, c1, c1 ), m0 ), b ), a ); \
+   d = mm256_ror_64( _mm256_xor_si256( d, a ), 32 ); \
+   c = _mm256_add_epi64( c, d ); \
+   b = mm256_ror_64( _mm256_xor_si256( b, c ), 25 ); \
+   a = _mm256_add_epi64( _mm256_add_epi64( _mm256_xor_si256( \
+                 _mm256_set_epi64x( c0, c0, c0, c0 ), m1 ), b ), a ); \
+   d = mm256_ror_64( _mm256_xor_si256( d, a ), 16 ); \
+   c = _mm256_add_epi64( c, d ); \
+   b = mm256_ror_64( _mm256_xor_si256( b, c ), 11 ); \
+} while (0)
+
+#if SPH_COMPACT_BLAKE_64
+// not used
+#define ROUND_B_4WAY(r)   do { \
+	GB_4WAY(M[sigma[r][0x0]], M[sigma[r][0x1]], \
+		CB[sigma[r][0x0]], CB[sigma[r][0x1]], V0, V4, V8, VC); \
+	GB_4WAY(M[sigma[r][0x2]], M[sigma[r][0x3]], \
+		CB[sigma[r][0x2]], CB[sigma[r][0x3]], V1, V5, V9, VD); \
+	GB_4WAY(M[sigma[r][0x4]], M[sigma[r][0x5]], \
+		CB[sigma[r][0x4]], CB[sigma[r][0x5]], V2, V6, VA, VE); \
+	GB_4WAY(M[sigma[r][0x6]], M[sigma[r][0x7]], \
+		CB[sigma[r][0x6]], CB[sigma[r][0x7]], V3, V7, VB, VF); \
+	GB_4WAY(M[sigma[r][0x8]], M[sigma[r][0x9]], \
+		CB[sigma[r][0x8]], CB[sigma[r][0x9]], V0, V5, VA, VF); \
+	GB_4WAY(M[sigma[r][0xA]], M[sigma[r][0xB]], \
+		CB[sigma[r][0xA]], CB[sigma[r][0xB]], V1, V6, VB, VC); \
+	GB_4WAY(M[sigma[r][0xC]], M[sigma[r][0xD]], \
+		CB[sigma[r][0xC]], CB[sigma[r][0xD]], V2, V7, V8, VD); \
+	GB_4WAY(M[sigma[r][0xE]], M[sigma[r][0xF]], \
+		CB[sigma[r][0xE]], CB[sigma[r][0xF]], V3, V4, V9, VE); \
+} while (0)
+
+#else
+//current_impl
+#define ROUND_B_4WAY(r)   do { \
+	GB_4WAY(Mx(r, 0), Mx(r, 1), CBx(r, 0), CBx(r, 1), V0, V4, V8, VC); \
+	GB_4WAY(Mx(r, 2), Mx(r, 3), CBx(r, 2), CBx(r, 3), V1, V5, V9, VD); \
+	GB_4WAY(Mx(r, 4), Mx(r, 5), CBx(r, 4), CBx(r, 5), V2, V6, VA, VE); \
+	GB_4WAY(Mx(r, 6), Mx(r, 7), CBx(r, 6), CBx(r, 7), V3, V7, VB, VF); \
+	GB_4WAY(Mx(r, 8), Mx(r, 9), CBx(r, 8), CBx(r, 9), V0, V5, VA, VF); \
+	GB_4WAY(Mx(r, A), Mx(r, B), CBx(r, A), CBx(r, B), V1, V6, VB, VC); \
+	GB_4WAY(Mx(r, C), Mx(r, D), CBx(r, C), CBx(r, D), V2, V7, V8, VD); \
+	GB_4WAY(Mx(r, E), Mx(r, F), CBx(r, E), CBx(r, F), V3, V4, V9, VE); \
+	} while (0)
+
+#endif
+
+
+// Blake-512 4 way
+
+#define DECL_STATE64_4WAY \
+	__m256i H0, H1, H2, H3, H4, H5, H6, H7; \
+        __m256i S0, S1, S2, S3; \
+	sph_u64 T0, T1;
+
+#define READ_STATE64_4WAY(state)   do { \
+		H0 = (state)->H[0]; \
+		H1 = (state)->H[1]; \
+		H2 = (state)->H[2]; \
+		H3 = (state)->H[3]; \
+		H4 = (state)->H[4]; \
+		H5 = (state)->H[5]; \
+		H6 = (state)->H[6]; \
+		H7 = (state)->H[7]; \
+		S0 = (state)->S[0]; \
+		S1 = (state)->S[1]; \
+		S2 = (state)->S[2]; \
+		S3 = (state)->S[3]; \
+		T0 = (state)->T0; \
+		T1 = (state)->T1; \
+	} while (0)
+
+#define WRITE_STATE64_4WAY(state)   do { \
+		(state)->H[0] = H0; \
+		(state)->H[1] = H1; \
+		(state)->H[2] = H2; \
+		(state)->H[3] = H3; \
+		(state)->H[4] = H4; \
+		(state)->H[5] = H5; \
+		(state)->H[6] = H6; \
+		(state)->H[7] = H7; \
+		(state)->S[0] = S0; \
+		(state)->S[1] = S1; \
+		(state)->S[2] = S2; \
+		(state)->S[3] = S3; \
+		(state)->T0 = T0; \
+		(state)->T1 = T1; \
+	} while (0)
+
+#if SPH_COMPACT_BLAKE_64
+
+// not used
+#define COMPRESS64_4WAY   do { \
+	__m256i M[16]; \
+	__m256i V0, V1, V2, V3, V4, V5, V6, V7; \
+	__m256i V8, V9, VA, VB, VC, VD, VE, VF; \
+	unsigned r; \
+	V0 = H0; \
+	V1 = H1; \
+	V2 = H2; \
+	V3 = H3; \
+	V4 = H4; \
+	V5 = H5; \
+	V6 = H6; \
+	V7 = H7; \
+        V8 = _mm256_xor_si256( S0, _mm256_set_epi64x( CB0, CB0, CB0, CB0 ) ); \
+        V9 = _mm256_xor_si256( S1, _mm256_set_epi64x( CB1, CB1, CB1, CB1 ) ); \
+        VA = _mm256_xor_si256( S2, _mm256_set_epi64x( CB2, CB2, CB2, CB2 ) ); \
+        VB = _mm256_xor_si256( S3, _mm256_set_epi64x( CB3, CB3, CB3, CB3 ) ); \
+        VC = _mm256_xor_si256( _mm256_set_epi64x( T0, T0, T0, T0 ), \
+                               _mm256_set_epi64x( CB4, CB4, CB4, CB4 ) ); \
+        VD = _mm256_xor_si256( _mm256_set_epi64x( T0, T0, T0, T0 ), \
+                               _mm256_set_epi64x( CB5, CB5, CB5, CB5 ) ); \
+        VE = _mm256_xor_si256( _mm256_set_epi64x( T1, T1, T1, T1 ), \
+                               _mm256_set_epi64x( CB6, CB6, CB6, CB6 ) ); \
+        VF = _mm256_xor_si256( _mm256_set_epi64x( T1, T1, T1, T1 ), \
+                               _mm256_set_epi64x( CB7, CB7, CB7, CB7 ) ); \
+	M[0x0] = mm256_bswap_64( *(buf+0) ); \
+	M[0x1] = mm256_bswap_64( *(buf+1) ); \
+	M[0x2] = mm256_bswap_64( *(buf+2) ); \
+	M[0x3] = mm256_bswap_64( *(buf+3) ); \
+	M[0x4] = mm256_bswap_64( *(buf+4) ); \
+	M[0x5] = mm256_bswap_64( *(buf+5) ); \
+	M[0x6] = mm256_bswap_64( *(buf+6) ); \
+	M[0x7] = mm256_bswap_64( *(buf+7) ); \
+	M[0x8] = mm256_bswap_64( *(buf+8) ); \
+	M[0x9] = mm256_bswap_64( *(buf+9) ); \
+	M[0xA] = mm256_bswap_64( *(buf+10) ); \
+	M[0xB] = mm256_bswap_64( *(buf+11) ); \
+	M[0xC] = mm256_bswap_64( *(buf+12) ); \
+	M[0xD] = mm256_bswap_64( *(buf+13) ); \
+	M[0xE] = mm256_bswap_64( *(buf+14) ); \
+	M[0xF] = mm256_bswap_64( *(buf+15) ); \
+	for (r = 0; r < 16; r ++) \
+		ROUND_B_4WAY(r); \
+        H0 = _mm256_xor_si256( _mm256_xor_si256( \
+                    _mm256_xor_si256( S0, V0 ), V8 ), H0 ); \
+        H1 = _mm256_xor_si256( _mm256_xor_si256( \
+                    _mm256_xor_si256( S1, V1 ), V9 ), H1 ); \
+        H2 = _mm256_xor_si256( _mm256_xor_si256( \
+                    _mm256_xor_si256( S2, V2 ), VA ), H2 ); \
+        H3 = _mm256_xor_si256( _mm256_xor_si256( \
+                    _mm256_xor_si256( S3, V3 ), VB ), H3 ); \
+        H4 = _mm256_xor_si256( _mm256_xor_si256( \
+                    _mm256_xor_si256( S0, V4 ), VC ), H4 ); \
+        H5 = _mm256_xor_si256( _mm256_xor_si256( \
+                    _mm256_xor_si256( S1, V5 ), VD ), H5 ); \
+        H6 = _mm256_xor_si256( _mm256_xor_si256( \
+                    _mm256_xor_si256( S2, V6 ), VE ), H6 ); \
+        H7 = _mm256_xor_si256( _mm256_xor_si256( \
+                    _mm256_xor_si256( S3, V7 ), VF ), H7 ); \
+	} while (0)
+
+#else
+
+//current impl
+
+#define COMPRESS64_4WAY   do { \
+     __m256i M0, M1, M2, M3, M4, M5, M6, M7; \
+     __m256i M8, M9, MA, MB, MC, MD, ME, MF; \
+     __m256i V0, V1, V2, V3, V4, V5, V6, V7; \
+     __m256i V8, V9, VA, VB, VC, VD, VE, VF; \
+     V0 = H0; \
+     V1 = H1; \
+     V2 = H2; \
+     V3 = H3; \
+     V4 = H4; \
+     V5 = H5; \
+     V6 = H6; \
+     V7 = H7; \
+     V8 = _mm256_xor_si256( S0, _mm256_set_epi64x( CB0, CB0, CB0, CB0 ) );  \
+     V9 = _mm256_xor_si256( S1, _mm256_set_epi64x( CB1, CB1, CB1, CB1 ) );  \
+     VA = _mm256_xor_si256( S2, _mm256_set_epi64x( CB2, CB2, CB2, CB2 ) );  \
+     VB = _mm256_xor_si256( S3, _mm256_set_epi64x( CB3, CB3, CB3, CB3 ) );  \
+     VC = _mm256_xor_si256( _mm256_set_epi64x( T0, T0, T0, T0 ), \
+                            _mm256_set_epi64x( CB4, CB4, CB4, CB4 ) );  \
+     VD = _mm256_xor_si256( _mm256_set_epi64x( T0, T0, T0, T0 ), \
+                            _mm256_set_epi64x( CB5, CB5, CB5, CB5 ) );  \
+     VE = _mm256_xor_si256( _mm256_set_epi64x( T1, T1, T1, T1 ), \
+                            _mm256_set_epi64x( CB6, CB6, CB6, CB6 ) );  \
+     VF = _mm256_xor_si256( _mm256_set_epi64x( T1, T1, T1, T1 ), \
+                            _mm256_set_epi64x( CB7, CB7, CB7, CB7 ) );  \
+     M0 = mm256_bswap_64( *(buf + 0) ); \
+     M1 = mm256_bswap_64( *(buf + 1) ); \
+     M2 = mm256_bswap_64( *(buf + 2) ); \
+     M3 = mm256_bswap_64( *(buf + 3) ); \
+     M4 = mm256_bswap_64( *(buf + 4) ); \
+     M5 = mm256_bswap_64( *(buf + 5) ); \
+     M6 = mm256_bswap_64( *(buf + 6) ); \
+     M7 = mm256_bswap_64( *(buf + 7) ); \
+     M8 = mm256_bswap_64( *(buf + 8) ); \
+     M9 = mm256_bswap_64( *(buf + 9) ); \
+     MA = mm256_bswap_64( *(buf + 10) ); \
+     MB = mm256_bswap_64( *(buf + 11) ); \
+     MC = mm256_bswap_64( *(buf + 12) ); \
+     MD = mm256_bswap_64( *(buf + 13) ); \
+     ME = mm256_bswap_64( *(buf + 14) ); \
+     MF = mm256_bswap_64( *(buf + 15) ); \
+     ROUND_B_4WAY(0); \
+     ROUND_B_4WAY(1); \
+     ROUND_B_4WAY(2); \
+     ROUND_B_4WAY(3); \
+     ROUND_B_4WAY(4); \
+     ROUND_B_4WAY(5); \
+     ROUND_B_4WAY(6); \
+     ROUND_B_4WAY(7); \
+     ROUND_B_4WAY(8); \
+     ROUND_B_4WAY(9); \
+     ROUND_B_4WAY(0); \
+     ROUND_B_4WAY(1); \
+     ROUND_B_4WAY(2); \
+     ROUND_B_4WAY(3); \
+     ROUND_B_4WAY(4); \
+     ROUND_B_4WAY(5); \
+     H0 = _mm256_xor_si256( _mm256_xor_si256( \
+                            _mm256_xor_si256( S0, V0 ), V8 ), H0 ); \
+     H1 = _mm256_xor_si256( _mm256_xor_si256( \
+                            _mm256_xor_si256( S1, V1 ), V9 ), H1 ); \
+     H2 = _mm256_xor_si256( _mm256_xor_si256( \
+                            _mm256_xor_si256( S2, V2 ), VA ), H2 ); \
+     H3 = _mm256_xor_si256( _mm256_xor_si256( \
+                            _mm256_xor_si256( S3, V3 ), VB ), H3 ); \
+     H4 = _mm256_xor_si256( _mm256_xor_si256( \
+                            _mm256_xor_si256( S0, V4 ), VC ), H4 ); \
+     H5 = _mm256_xor_si256( _mm256_xor_si256( \
+                            _mm256_xor_si256( S1, V5 ), VD ), H5 ); \
+     H6 = _mm256_xor_si256( _mm256_xor_si256( \
+                            _mm256_xor_si256( S2, V6 ), VE ), H6 ); \
+     H7 = _mm256_xor_si256( _mm256_xor_si256( \
+                            _mm256_xor_si256( S3, V7 ), VF ), H7 ); \
+	} while (0)
+
+#endif
+
+static const sph_u64 salt_zero_big[4] = { 0, 0, 0, 0 };
+
+static void
+blake64_4way_init( blake_4way_big_context *sc, const sph_u64 *iv,
+              const sph_u64 *salt )
+{
+        int i;
+        for ( i = 0; i < 8; i++ )
+           sc->H[i] = _mm256_set1_epi64x( iv[i] );
+        for ( i = 0; i < 4; i++ )
+           sc->S[i] = _mm256_set1_epi64x( salt[i] );
+        sc->T0 = sc->T1 = 0;
+        sc->ptr = 0;
+}
+
+static void
+blake64_4way( blake_4way_big_context *sc, const void *data, size_t len)
+{
+   __m256i *vdata = (__m256i*)data;
+   __m256i *buf;
+   size_t ptr;
+   DECL_STATE64_4WAY
+
+   const int buf_size = 128;  //  sizeof/8 
+
+   buf = sc->buf;
+   ptr = sc->ptr;
+   if ( len < (buf_size - ptr) )
+   {
+	memcpy_256( buf + (ptr>>3), vdata, len>>3 );
+	ptr += len;
+	sc->ptr = ptr;
+	return;
+   }
+
+   READ_STATE64_4WAY(sc);
+   while ( len > 0 )
+   {
+	size_t clen;
+
+	clen = buf_size - ptr;
+	if ( clen > len )
+		clen = len;
+	memcpy_256( buf + (ptr>>3), vdata, clen>>3 );
+	ptr += clen;
+	vdata = vdata + (clen>>3);
+	len -= clen;
+	if (ptr == buf_size )
+        {
+		if ((T0 = SPH_T64(T0 + 1024)) < 1024)
+			T1 = SPH_T64(T1 + 1);
+		COMPRESS64_4WAY;
+		ptr = 0;
+	}
+   }
+   WRITE_STATE64_4WAY(sc);
+   sc->ptr = ptr;
+}
+
+static void
+blake64_4way_close( blake_4way_big_context *sc,
+	unsigned ub, unsigned n, void *dst, size_t out_size_w64)
+{
+//   union {
+      __m256i buf[16];
+//      sph_u64 dummy;
+//   } u;
+   size_t ptr, k;
+   unsigned bit_len;
+   uint64_t z, zz;
+   sph_u64 th, tl;
+   __m256i *out;
+
+   ptr = sc->ptr;
+   bit_len = ((unsigned)ptr << 3);
+   z = 0x80 >> n;
+   zz = ((ub & -z) | z) & 0xFF;
+   buf[ptr>>3] = _mm256_set_epi64x( zz, zz, zz, zz );
+   tl = sc->T0 + bit_len;
+   th = sc->T1;
+   if (ptr == 0 )
+   {
+	sc->T0 = SPH_C64(0xFFFFFFFFFFFFFC00ULL);
+	sc->T1 = SPH_C64(0xFFFFFFFFFFFFFFFFULL);
+   }
+   else if ( sc->T0 == 0 )
+   {
+	sc->T0 = SPH_C64(0xFFFFFFFFFFFFFC00ULL) + bit_len;
+	sc->T1 = SPH_T64(sc->T1 - 1);
+   } 
+   else
+   {
+        sc->T0 -= 1024 - bit_len;
+   }
+   if ( ptr <= 104 )
+   {
+       memset_zero_256( buf + (ptr>>3) + 1, (104-ptr) >> 3 );
+       if ( out_size_w64 == 8 )
+          buf[(104>>3)] = _mm256_or_si256( buf[(104>>3)],
+                                 _mm256_set1_epi64x( 0x0100000000000000ULL ) );
+       *(buf+(112>>3)) = mm256_bswap_64(
+                                    _mm256_set_epi64x( th, th, th, th ) );
+       *(buf+(120>>3)) = mm256_bswap_64(
+                                    _mm256_set_epi64x( tl, tl, tl, tl ) );
+
+       blake64_4way( sc, buf + (ptr>>3), 128 - ptr );
+   }
+   else
+  {
+       memset_zero_256( buf + (ptr>>3) + 1, (120 - ptr) >> 3 );
+
+       blake64_4way( sc, buf + (ptr>>3), 128 - ptr );
+       sc->T0 = SPH_C64(0xFFFFFFFFFFFFFC00ULL);
+       sc->T1 = SPH_C64(0xFFFFFFFFFFFFFFFFULL);
+       memset_zero_256( buf, 112>>3 ); 
+       if ( out_size_w64 == 8 )
+           buf[104>>3] = _mm256_set1_epi64x( 0x0100000000000000ULL );
+       *(buf+(112>>3)) = mm256_bswap_64(
+                                    _mm256_set_epi64x( th, th, th, th ) );
+       *(buf+(120>>3)) = mm256_bswap_64(
+                                    _mm256_set_epi64x( tl, tl, tl, tl ) );
+
+       blake64_4way( sc, buf, 128 );
+   }
+   out = (__m256i*)dst;
+   for ( k = 0; k < out_size_w64; k++ )
+       out[k] = mm256_bswap_64( sc->H[k] );
+}
+
+void
+blake512_4way_init(void *cc)
+{
+	blake64_4way_init(cc, IV512, salt_zero_big);
+}
+
+void
+blake512_4way(void *cc, const void *data, size_t len)
+{
+	blake64_4way(cc, data, len);
+}
+
+void
+blake512_4way_close(void *cc, void *dst)
+{
+	blake512_4way_addbits_and_close(cc, 0, 0, dst);
+}
+
+void
+blake512_4way_addbits_and_close(void *cc, unsigned ub, unsigned n, void *dst)
+{
+	blake64_4way_close(cc, ub, n, dst, 8);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
--- a/algo/blake/blakecoin-4way.c
+++ b/algo/blake/blakecoin-4way.c
@@ -17,7 +17,7 @@ void blakecoin_4way_hash(void *state, const void *input)
     blake256r8_4way( &ctx, input + (64<<2), 16 );
     blake256r8_4way_close( &ctx, vhash );

-     mm_deinterleave_4x32( state, state+32, state+64, state+96, vhash, 256 );
+     mm128_deinterleave_4x32( state, state+32, state+64, state+96, vhash, 256 );
 }

 int scanhash_blakecoin_4way( int thr_id, struct work *work, uint32_t max_nonce,
@@ -37,7 +37,7 @@ int scanhash_blakecoin_4way( int thr_id, struct work *work, uint32_t max_nonce,
      HTarget = 0x7f;

   swab32_array( edata, pdata, 20 );
-   mm_interleave_4x32( vdata, edata, edata, edata, edata, 640 );
+   mm128_interleave_4x32( vdata, edata, edata, edata, edata, 640 );
   blake256r8_4way_init( &blakecoin_4w_ctx );
   blake256r8_4way( &blakecoin_4w_ctx, vdata, 64 );

--- a/algo/blake/blakecoin-gate.c
+++ b/algo/blake/blakecoin-gate.c
@@ -22,7 +22,7 @@ bool register_vanilla_algo( algo_gate_t* gate )
  gate->scanhash = (void*)&scanhash_blakecoin;
  gate->hash     = (void*)&blakecoinhash;
 #endif
-  gate->optimizations = AVX_OPT | AVX2_OPT;
+  gate->optimizations = SSE42_OPT | AVX2_OPT;
  gate->get_max64 = (void*)&blakecoin_get_max64;
  return true;
 }
--- a/algo/blake/blakecoin-gate.h
+++ b/algo/blake/blakecoin-gate.h
@@ -4,7 +4,7 @@
 #include "algo-gate-api.h"
 #include <stdint.h>

-#if defined(__AVX__)
+#if defined(__SSE4_2__)
  #define BLAKECOIN_4WAY
 #endif
 #if defined(__AVX2__)
--- a/algo/blake/decred-4way.c
+++ b/algo/blake/decred-4way.c
@@ -23,7 +23,7 @@ void decred_hash_4way( void *state, const void *input )
     memcpy( &ctx, &blake_mid, sizeof(blake_mid) );
     blake256_4way( &ctx, tail, tail_len );
     blake256_4way_close( &ctx, vhash );
-     mm_deinterleave_4x32( state, state+32, state+64, state+96, vhash, 256 );
+     mm128_deinterleave_4x32( state, state+32, state+64, state+96, vhash, 256 );
 }

 int scanhash_decred_4way( int thr_id, struct work *work, uint32_t max_nonce,
@@ -44,7 +44,7 @@ int scanhash_decred_4way( int thr_id, struct work *work, uint32_t max_nonce,
   memcpy( edata, pdata, 180 );

   // use the old way until  new way updated for size.
-   mm_interleave_4x32x( vdata, edata, edata, edata, edata, 180*8 );
+   mm128_interleave_4x32x( vdata, edata, edata, edata, edata, 180*8 );

   blake256_4way_init( &blake_mid );
   blake256_4way( &blake_mid, vdata, DECRED_MIDSTATE_LEN );
--- a/algo/blake/decred-gate.c
+++ b/algo/blake/decred-gate.c
@@ -140,6 +140,7 @@ bool decred_ready_to_mine( struct work* work, struct stratum_ctx* stratum,
   return true;
 }

+int decred_get_work_data_size() { return DECRED_DATA_SIZE; }

 bool register_decred_algo( algo_gate_t* gate )
 {
@@ -154,7 +155,7 @@ bool register_decred_algo( algo_gate_t* gate )
  gate->optimizations = AVX2_OPT;
  gate->get_nonceptr          = (void*)&decred_get_nonceptr;
  gate->get_max64             = (void*)&get_max64_0x3fffffLL;
-  gate->display_extra_data    = (void*)&decred_decode_extradata;
+  gate->decode_extra_data     = (void*)&decred_decode_extradata;
  gate->build_stratum_request = (void*)&decred_be_build_stratum_request;
  gate->work_decode           = (void*)&std_be_work_decode;
  gate->submit_getwork_result = (void*)&std_be_submit_getwork_result;
@@ -163,7 +164,7 @@ bool register_decred_algo( algo_gate_t* gate )
  gate->nbits_index           = DECRED_NBITS_INDEX;
  gate->ntime_index           = DECRED_NTIME_INDEX;
  gate->nonce_index           = DECRED_NONCE_INDEX;
-  gate->work_data_size        = DECRED_DATA_SIZE;
+  gate->get_work_data_size    = (void*)&decred_get_work_data_size;
  gate->work_cmp_size         = DECRED_WORK_COMPARE_SIZE;
  allow_mininginfo            = false;
  have_gbt                    = false;
--- a/algo/blake/decred-gate.h
+++ b/algo/blake/decred-gate.h
@@ -18,7 +18,7 @@
 //                         uint64_t *hashes_done );
 #endif

-#if defined(__AVX2__)
+#if defined(__SSE4_2__)
  #define DECRED_4WAY
 #endif

--- a/algo/blake/decred.c
+++ b/algo/blake/decred.c
@@ -268,7 +268,7 @@ bool register_decred_algo( algo_gate_t* gate )
  gate->hash                  = (void*)&decred_hash;
  gate->get_nonceptr          = (void*)&decred_get_nonceptr;
  gate->get_max64             = (void*)&get_max64_0x3fffffLL;
-  gate->display_extra_data    = (void*)&decred_decode_extradata;
+  gate->decode_extra_data     = (void*)&decred_decode_extradata;
  gate->build_stratum_request = (void*)&decred_be_build_stratum_request;
  gate->work_decode           = (void*)&std_be_work_decode;
  gate->submit_getwork_result = (void*)&std_be_submit_getwork_result;
--- a/algo/bmw/bmw-hash-4way.h
+++ b/algo/bmw/bmw-hash-4way.h
@@ -41,15 +41,18 @@ extern "C"{
 #endif

 #include <stddef.h>
-#ifdef __AVX2__

 #include "algo/sha/sph_types.h"
-#include "avxdefs.h"
+#include "simd-utils.h"

 #define SPH_SIZE_bmw256   256

 #define SPH_SIZE_bmw512   512

+#if defined(__SSE2__)
+
+// BMW-256 4 way 32
+
 typedef struct {
   __m128i buf[64];
   __m128i H[16];
@@ -59,6 +62,60 @@ typedef struct {

 typedef bmw_4way_small_context bmw256_4way_context;

+void bmw256_4way_init(void *cc);
+
+void bmw256_4way(void *cc, const void *data, size_t len);
+
+void bmw256_4way_close(void *cc, void *dst);
+
+void bmw256_4way_addbits_and_close(
+        void *cc, unsigned ub, unsigned n, void *dst);
+
+#endif  // __SSE2__
+
+#if defined(__AVX2__)
+
+// BMW-256 8 way 32
+
+typedef struct {
+   __m256i buf[64];
+   __m256i H[16];
+   size_t ptr;
+   uint32_t bit_count;  // assume bit_count fits in 32 bits
+} bmw_8way_small_context __attribute__ ((aligned (64)));
+
+typedef bmw_8way_small_context bmw256_8way_context;
+
+void bmw256_8way_init( bmw256_8way_context *ctx );
+void bmw256_8way( bmw256_8way_context *ctx, const void *data, size_t len );
+void bmw256_8way_close( bmw256_8way_context *ctx, void *dst );
+
+#endif
+
+
+#if defined(__SSE2__)
+
+// BMW-512 2 way 64
+
+typedef struct {
+   __m128i buf[16];
+   __m128i H[16];
+   size_t ptr;
+   uint64_t bit_count; 
+} bmw_2way_big_context __attribute__ ((aligned (64)));
+
+typedef bmw_2way_big_context bmw512_2way_context;
+
+void bmw512_2way_init( bmw512_2way_context *ctx );
+void bmw512_2way( bmw512_2way_context *ctx, const void *data, size_t len );
+void bmw512_2way_close( bmw512_2way_context *ctx, void *dst );
+
+#endif // __SSE2__
+
+#if defined(__AVX2__)
+
+// BMW-512 4 way 64
+
 typedef struct {
   __m256i buf[16];
   __m256i H[16];
@@ -68,14 +125,6 @@ typedef struct {

 typedef bmw_4way_big_context bmw512_4way_context;

-void bmw256_4way_init(void *cc);
-
-void bmw256_4way(void *cc, const void *data, size_t len);
-
-void bmw256_4way_close(void *cc, void *dst);
-
-void bmw256_4way_addbits_and_close(
-	void *cc, unsigned ub, unsigned n, void *dst);

 void bmw512_4way_init(void *cc);

@@ -86,10 +135,10 @@ void bmw512_4way_close(void *cc, void *dst);
 void bmw512_4way_addbits_and_close(
 	void *cc, unsigned ub, unsigned n, void *dst);

-#endif
+#endif  // __AVX2__

 #ifdef __cplusplus
 }
 #endif

-#endif
+#endif // BMW_HASH_H__
--- a/algo/bmw/bmw256-hash-4way.c
+++ b/algo/bmw/bmw256-hash-4way.c
--- a/algo/bmw/bmw512-hash-4way.c
+++ b/algo/bmw/bmw512-hash-4way.c
--- a/algo/cryptonight/cryptolight.c
+++ b/algo/cryptonight/cryptolight.c
@@ -325,7 +325,7 @@ int scanhash_cryptolight(int thr_id, struct work *work,

 	struct cryptonight_ctx *ctx = (struct cryptonight_ctx*)malloc(sizeof(struct cryptonight_ctx));

-#ifndef NO_AES_NI
+#if defined(__AES__)
 		do {
 			*nonceptr = ++n;
 			cryptolight_hash_ctx_aes_ni(hash, pdata, 76, ctx);
--- a/algo/cryptonight/cryptonight-aesni.c
+++ b/algo/cryptonight/cryptonight-aesni.c
@@ -1,14 +1,11 @@
+#if defined(__AES__)
+
 #include <x86intrin.h>
 #include <memory.h>
 #include "cryptonight.h"
 #include "miner.h"
 #include "crypto/c_keccak.h"
 #include <immintrin.h>
-//#include "avxdefs.h"
-
-void aesni_parallel_noxor(uint8_t *long_state, uint8_t *text, uint8_t *ExpandedKey);
-void aesni_parallel_xor(uint8_t *text, uint8_t *ExpandedKey, uint8_t *long_state);
-void that_fucking_loop(uint8_t a[16], uint8_t b[16], uint8_t *long_state);

 static inline void ExpandAESKey256_sub1(__m128i *tmp1, __m128i *tmp2)
 {
@@ -25,7 +22,6 @@ static inline void ExpandAESKey256_sub1(__m128i *tmp1, __m128i *tmp2)

 static inline void ExpandAESKey256_sub2(__m128i *tmp1, __m128i *tmp3)
 {
-#ifndef NO_AES_NI
 	__m128i tmp2, tmp4;
 	
 	tmp4 = _mm_aeskeygenassist_si128(*tmp1, 0x00);
@@ -37,14 +33,12 @@ static inline void ExpandAESKey256_sub2(__m128i *tmp1, __m128i *tmp3)
 	tmp4 = _mm_slli_si128(tmp4, 0x04);
 	*tmp3 = _mm_xor_si128(*tmp3, tmp4);
 	*tmp3 = _mm_xor_si128(*tmp3, tmp2);
-#endif
 }

 // Special thanks to Intel for helping me
 // with ExpandAESKey256() and its subroutines
 static inline void ExpandAESKey256(char *keybuf)
 {
-#ifndef NO_AES_NI
 	__m128i tmp1, tmp2, tmp3, *keys;
 	
 	keys = (__m128i *)keybuf;
@@ -91,7 +85,6 @@ static inline void ExpandAESKey256(char *keybuf)
 	tmp2 = _mm_aeskeygenassist_si128(tmp3, 0x40);
 	ExpandAESKey256_sub1(&tmp1, &tmp2);
 	keys[14] = tmp1;
-#endif
 }

 // align to 64 byte cache line
@@ -109,13 +102,19 @@ static __thread cryptonight_ctx ctx;

 void cryptonight_hash_aes( void *restrict output, const void *input, int len )
 {
-#ifndef NO_AES_NI
-
    uint8_t ExpandedKey[256] __attribute__((aligned(64)));
    __m128i *longoutput, *expkey, *xmminput;
    size_t i, j;
    
    keccak( (const uint8_t*)input, 76, (char*)&ctx.state.hs.b, 200 );
+
+    if ( cryptonightV7 && len < 43 )
+      return;
+
+    const uint64_t tweak = cryptonightV7 
+                         ? *((const uint64_t*) (((const uint8_t*)input) + 35))
+                           ^ ctx.state.hs.w[24] : 0; 
+
    memcpy( ExpandedKey, ctx.state.hs.b, AES_KEY_SIZE );
    ExpandAESKey256( ExpandedKey );
    memcpy( ctx.text, ctx.state.init, INIT_SIZE_BYTE );
@@ -214,7 +213,15 @@ void cryptonight_hash_aes( void *restrict output, const void *input, int len )
 	_mm_store_si128( (__m128i*)c, c_x );
        b_x = _mm_xor_si128( b_x, c_x );
        nextblock = (uint64_t *)&ctx.long_state[c[0] & 0x1FFFF0];
-	_mm_store_si128( lsa, b_x );
+        _mm_store_si128( lsa, b_x );
+
+        if ( cryptonightV7 )
+        {
+           const uint8_t tmp = ( (const uint8_t*)(lsa) )[11];
+           const uint8_t index = ( ( (tmp >> 3) & 6 ) | (tmp & 1) ) << 1;
+           ((uint8_t*)(lsa))[11] = tmp ^ ( ( 0x75310 >> index) & 0x30 );
+        } 
+
 	b[0] = nextblock[0];
 	b[1] = nextblock[1];

@@ -227,10 +234,14 @@ void cryptonight_hash_aes( void *restrict output, const void *input, int len )
 		 : "cc" );

        b_x = c_x;
-        nextblock[0] = a[0] + hi;
-        nextblock[1] = a[1] + lo;
-        a[0] = b[0] ^ nextblock[0];
-        a[1] = b[1] ^ nextblock[1];
+
+        a[0] += hi;
+        a[1] += lo;
+        nextblock[0] = a[0];
+        nextblock[1] = cryptonightV7 ? a[1] ^ tweak : a[1];
+        a[0] ^= b[0];
+        a[1] ^= b[1];
+
        lsa = (__m128i*)&ctx.long_state[ a[0] & 0x1FFFF0 ];
        a_x = _mm_load_si128( (__m128i*)a );
        c_x = _mm_load_si128( lsa );
@@ -241,6 +252,14 @@ void cryptonight_hash_aes( void *restrict output, const void *input, int len )
    b_x = _mm_xor_si128( b_x, c_x );
    nextblock = (uint64_t *)&ctx.long_state[c[0] & 0x1FFFF0];
    _mm_store_si128( lsa, b_x );
+
+    if ( cryptonightV7 )
+    {
+       const uint8_t tmp = ( (const uint8_t*)(lsa) )[11];
+       const uint8_t index = ( ( (tmp >> 3) & 6 ) | (tmp & 1) ) << 1;
+       ((uint8_t*)(lsa))[11] = tmp ^ ( ( 0x75310 >> index) & 0x30 );
+    }
+
    b[0] = nextblock[0];
    b[1] = nextblock[1];

@@ -251,8 +270,12 @@ void cryptonight_hash_aes( void *restrict output, const void *input, int len )
               "rm" ( b[0] )
             : "cc" );

-    nextblock[0] = a[0] + hi;
-    nextblock[1] = a[1] + lo;
+    a[0] += hi;
+    a[1] += lo;
+    nextblock[0] = a[0];
+    nextblock[1] = cryptonightV7 ? a[1] ^ tweak : a[1];
+    a[0] ^= b[0];
+    a[1] ^= b[1];

    memcpy( ExpandedKey, &ctx.state.hs.b[32], AES_KEY_SIZE );
    ExpandAESKey256( ExpandedKey );
@@ -330,5 +353,5 @@ void cryptonight_hash_aes( void *restrict output, const void *input, int len )
    keccakf( (uint64_t*)&ctx.state.hs.w, 24 );
    extra_hashes[ctx.state.hs.b[0] & 3](&ctx.state, 200, output);

-#endif
 }
+#endif
--- a/algo/cryptonight/cryptonight-common.c
+++ b/algo/cryptonight/cryptonight-common.c
@@ -7,11 +7,11 @@
 #include "cpuminer-config.h"
 #include "algo-gate-api.h"

-#ifndef NO_AES_NI
+#if defined(__AES__)
  #include "algo/groestl/aes_ni/hash-groestl256.h"
-#endif
-
+#else
 #include "crypto/c_groestl.h"
+#endif
 #include "crypto/c_blake256.h"
 #include "crypto/c_jh.h"
 #include "crypto/c_skein.h"
@@ -30,12 +30,12 @@ void do_blake_hash(const void* input, size_t len, char* output) {
 }

 void do_groestl_hash(const void* input, size_t len, char* output) {
-#ifdef NO_AES_NI
-    groestl(input, len * 8, (uint8_t*)output);
-#else
+#if defined(__AES__)
    hashState_groestl256 ctx;
    init_groestl256( &ctx, 32 );
    update_and_final_groestl256( &ctx, output, input, len * 8 );
+#else
+    groestl(input, len * 8, (uint8_t*)output);
 #endif
 }

@@ -52,23 +52,24 @@ void (* const extra_hashes[4])( const void *, size_t, char *) =

 void cryptonight_hash( void *restrict output, const void *input, int len )
 {
-
-#ifdef NO_AES_NI
-  cryptonight_hash_ctx ( output, input, len );
-#else 
+#if defined(__AES__)
  cryptonight_hash_aes( output, input, len );
+#else
+  cryptonight_hash_ctx ( output, input, len );
 #endif
 }

 void cryptonight_hash_suw( void *restrict output, const void *input )
 {
-#ifdef NO_AES_NI
-  cryptonight_hash_ctx ( output, input, 76 );
-#else
+#if defined(__AES__)
  cryptonight_hash_aes( output, input, 76 );
+#else
+  cryptonight_hash_ctx ( output, input, 76 );
 #endif
 }

+bool cryptonightV7 = false;
+
 int scanhash_cryptonight( int thr_id, struct work *work, uint32_t max_nonce,
                   uint64_t *hashes_done )
 {
@@ -80,6 +81,11 @@ int scanhash_cryptonight( int thr_id, struct work *work, uint32_t max_nonce,
    const uint32_t first_nonce = n + 1;
    const uint32_t Htarg = ptarget[7];
    uint32_t hash[32 / 4] __attribute__((aligned(32)));
+
+//    if (  (  cryptonightV7 && ( *(uint8_t*)pdata <  7 ) )
+//       || ( !cryptonightV7 && ( *(uint8_t*)pdata == 7 ) ) )
+//          applog(LOG_WARNING,"Cryptonight variant mismatch, shares may be rejected.");
+
    do
    {
       *nonceptr = ++n;
@@ -87,6 +93,7 @@ int scanhash_cryptonight( int thr_id, struct work *work, uint32_t max_nonce,
       if (unlikely( hash[7] < Htarg ))
       {
           *hashes_done = n - first_nonce + 1;
+//           work_set_target_ratio( work, hash );
 	   return true;
       }
    } while (likely((n <= max_nonce && !work_restart[thr_id].restart)));
@@ -97,6 +104,7 @@ int scanhash_cryptonight( int thr_id, struct work *work, uint32_t max_nonce,

 bool register_cryptonight_algo( algo_gate_t* gate )
 {
+  cryptonightV7 = false;
  register_json_rpc2( gate );
  gate->optimizations = SSE2_OPT | AES_OPT;
  gate->scanhash         = (void*)&scanhash_cryptonight;
@@ -106,3 +114,15 @@ bool register_cryptonight_algo( algo_gate_t* gate )
  return true;
 };

+bool register_cryptonightv7_algo( algo_gate_t* gate )
+{
+  cryptonightV7 = true;
+  register_json_rpc2( gate );
+  gate->optimizations = SSE2_OPT | AES_OPT;
+  gate->scanhash      = (void*)&scanhash_cryptonight;
+  gate->hash          = (void*)&cryptonight_hash;
+  gate->hash_suw      = (void*)&cryptonight_hash_suw;
+  gate->get_max64     = (void*)&get_max64_0x40LL;
+  return true;
+};
+
--- a/algo/cryptonight/cryptonight.c
+++ b/algo/cryptonight/cryptonight.c
@@ -20,8 +20,8 @@
 #include "crypto/c_jh.h"
 #include "crypto/c_skein.h"
 #include "crypto/int-util.h"
-#include "crypto/hash-ops.h"
-//#include "cryptonight.h"
+//#include "crypto/hash-ops.h"
+#include "cryptonight.h"

 #if USE_INT128

@@ -51,6 +51,7 @@ typedef __uint128_t uint128_t;
 #define INIT_SIZE_BLK   8
 #define INIT_SIZE_BYTE (INIT_SIZE_BLK * AES_BLOCK_SIZE)

+/*
 #pragma pack(push, 1)
 union cn_slow_hash_state {
 	union hash_state hs;
@@ -78,6 +79,7 @@ static void do_skein_hash(const void* input, size_t len, char* output) {
 	int r = skein_hash(8 * HASH_SIZE, input, 8 * len, (uint8_t*)output);
 	assert(likely(SKEIN_SUCCESS == r));
 }
+*/

 extern int aesb_single_round(const uint8_t *in, uint8_t*out, const uint8_t *expandedKey);
 extern int aesb_pseudo_round_mut(uint8_t *val, uint8_t *expandedKey);
@@ -120,9 +122,11 @@ static uint64_t mul128(uint64_t multiplier, uint64_t multiplicand, uint64_t* pro
 extern uint64_t mul128(uint64_t multiplier, uint64_t multiplicand, uint64_t* product_hi);
 #endif

+/*
 static void (* const extra_hashes[4])(const void *, size_t, char *) = {
 		do_blake_hash, do_groestl_hash, do_jh_hash, do_skein_hash
 };
+*/

 static inline size_t e2i(const uint8_t* a) {
 #if !LITE
@@ -132,14 +136,16 @@ static inline size_t e2i(const uint8_t* a) {
 #endif
 }

-static inline void mul_sum_xor_dst(const uint8_t* a, uint8_t* c, uint8_t* dst) {
+static inline void mul_sum_xor_dst( const uint8_t* a, uint8_t* c, uint8_t* dst, 
+         const uint64_t tweak )
+{
 	uint64_t hi, lo = mul128(((uint64_t*) a)[0], ((uint64_t*) dst)[0], &hi) + ((uint64_t*) c)[1];
 	hi += ((uint64_t*) c)[0];

 	((uint64_t*) c)[0] = ((uint64_t*) dst)[0] ^ hi;
 	((uint64_t*) c)[1] = ((uint64_t*) dst)[1] ^ lo;
 	((uint64_t*) dst)[0] = hi;
-	((uint64_t*) dst)[1] = lo;
+	((uint64_t*) dst)[1] = cryptonightV7 ? lo ^ tweak : lo;
 }

 static inline void xor_blocks(uint8_t* a, const uint8_t* b) {
@@ -174,8 +180,16 @@ static __thread cryptonight_ctx ctx;

 void cryptonight_hash_ctx(void* output, const void* input, int len)
 {
-	hash_process(&ctx.state.hs, (const uint8_t*) input, len);
-	ctx.aes_ctx = (oaes_ctx*) oaes_alloc();
+//    hash_process(&ctx.state.hs, (const uint8_t*) input, len);
+    keccak( (const uint8_t*)input, 76, (char*)&ctx.state.hs.b, 200 );
+
+    if ( cryptonightV7 && len < 43 )
+      return;
+    const uint64_t tweak = cryptonightV7
+                         ? *((const uint64_t*) (((const uint8_t*)input) + 35))
+                           ^ ctx.state.hs.w[24] : 0;
+
+    ctx.aes_ctx = (oaes_ctx*) oaes_alloc();

    __builtin_prefetch( ctx.text,             0, 3 );
    __builtin_prefetch( ctx.text       +  64, 0, 3 );
@@ -211,23 +225,44 @@ void cryptonight_hash_ctx(void* output, const void* input, int len)
 	xor_blocks_dst(&ctx.state.k[0], &ctx.state.k[32], ctx.a);
 	xor_blocks_dst(&ctx.state.k[16], &ctx.state.k[48], ctx.b);

-	for (i = 0; likely(i < ITER / 4); ++i) {
-		/* Dependency chain: address -> read value ------+
-		 * written value <-+ hard function (AES or MUL) <+
-		 * next address  <-+
-		 */
-		/* Iteration 1 */
-		j = e2i(ctx.a);
-		aesb_single_round(&ctx.long_state[j], ctx.c, ctx.a);
-		xor_blocks_dst(ctx.c, ctx.b, &ctx.long_state[j]);
-		/* Iteration 2 */
-		mul_sum_xor_dst(ctx.c, ctx.a, &ctx.long_state[e2i(ctx.c)]);
-		/* Iteration 3 */
-		j = e2i(ctx.a);
-		aesb_single_round(&ctx.long_state[j], ctx.b, ctx.a);
-		xor_blocks_dst(ctx.b, ctx.c, &ctx.long_state[j]);
-		/* Iteration 4 */
-		mul_sum_xor_dst(ctx.b, ctx.a, &ctx.long_state[e2i(ctx.b)]);
+	for (i = 0; likely(i < ITER / 4); ++i)
+        {
+           /* Dependency chain: address -> read value ------+
+            * written value <-+ hard function (AES or MUL) <+
+            * next address  <-+
+            */
+           /* Iteration 1 */
+           j = e2i(ctx.a);
+           aesb_single_round(&ctx.long_state[j], ctx.c, ctx.a);
+           xor_blocks_dst(ctx.c, ctx.b, &ctx.long_state[j]);
+
+           if ( cryptonightV7 )
+           {
+              uint8_t *lsa = (uint8_t*)&ctx.long_state[((uint64_t *)(ctx.a))[0] & 0x1FFFF0];
+              const uint8_t tmp = lsa[11];
+              const uint8_t index = ( ( (tmp >> 3) & 6 ) | (tmp & 1) ) << 1;
+              lsa[11] = tmp ^ ( ( 0x75310 >> index) & 0x30 );
+           }
+
+           /* Iteration 2 */
+           mul_sum_xor_dst(ctx.c, ctx.a, &ctx.long_state[e2i(ctx.c)], tweak );
+
+           /* Iteration 3 */
+           j = e2i(ctx.a);
+           aesb_single_round(&ctx.long_state[j], ctx.b, ctx.a);
+           xor_blocks_dst(ctx.b, ctx.c, &ctx.long_state[j]);
+
+           if ( cryptonightV7 )
+           {
+              uint8_t *lsa = (uint8_t*)&ctx.long_state[((uint64_t *)(ctx.a))[0] & 0x1FFFF0];
+              const uint8_t tmp = lsa[11];
+              const uint8_t index = ( ( (tmp >> 3) & 6 ) | (tmp & 1) ) << 1;
+              lsa[11] = tmp ^ ( ( 0x75310 >> index) & 0x30 );
+           }
+
+           /* Iteration 4 */
+           mul_sum_xor_dst(ctx.b, ctx.a, &ctx.long_state[e2i(ctx.b)], tweak );
+
 	}

    __builtin_prefetch( ctx.text,             0, 3 );
@@ -266,7 +301,8 @@ void cryptonight_hash_ctx(void* output, const void* input, int len)
 		aesb_pseudo_round_mut(&ctx.text[7 * AES_BLOCK_SIZE], ctx.aes_ctx->key->exp_data);
 	}
 	memcpy(ctx.state.init, ctx.text, INIT_SIZE_BYTE);
-	hash_permutation(&ctx.state.hs);
+//	hash_permutation(&ctx.state.hs);
+        keccakf( (uint64_t*)&ctx.state.hs.w, 24 );
 	/*memcpy(hash, &state, 32);*/
 	extra_hashes[ctx.state.hs.b[0] & 3](&ctx.state, 200, output);
 	oaes_free((OAES_CTX **) &ctx.aes_ctx);
--- a/algo/cryptonight/cryptonight.h
+++ b/algo/cryptonight/cryptonight.h
@@ -45,5 +45,7 @@ int scanhash_cryptonight( int thr_id, struct work *work, uint32_t max_nonce,

 void cryptonight_hash_aes( void *restrict output, const void *input, int len );

+extern bool cryptonightV7;
+
 #endif

--- a/algo/cubehash/cube-hash-2way.c
+++ b/algo/cubehash/cube-hash-2way.c
@@ -7,6 +7,24 @@

 // 2x128

+// The result of hashing 10 rounds of initial data which consists of params
+// zero padded.
+static const uint64_t IV256[] =
+{
+0xCCD6F29FEA2BD4B4, 0x35481EAE63117E71, 0xE5D94E6322512D5B, 0xF4CC12BE7E624131,
+0x42AF2070C2D0B696, 0x3361DA8CD0720C35, 0x8EF8AD8328CCECA4, 0x40E5FBAB4680AC00,
+0x6107FBD5D89041C3, 0xF0B266796C859D41, 0x5FA2560309392549, 0x93CB628565C892FD,
+0x9E4B4E602AF2B5AE, 0x85254725774ABFDD, 0x4AB6AAD615815AEB, 0xD6032C0A9CDAF8AF
+};
+
+static const uint64_t IV512[] =
+{
+0x50F494D42AEA2A61, 0x4167D83E2D538B8B, 0xC701CF8C3FEE2313, 0x50AC5695CC39968E,
+0xA647A8B34D42C787, 0x825B453797CF0BEF, 0xF22090C4EEF864D2, 0xA23911AED0E5CD33,
+0x148FE485FCD398D9, 0xB64445321B017BEF, 0x2FF5781C6A536159, 0x0DBADEA991FA7934,
+0xA5A70E75D65C8A2B, 0xBC796576B1C62456, 0xE7989AF11921C8F7, 0xD43E3B447795D246
+};
+
 static void transform_2way( cube_2way_context *sp )
 {
    int r;
@@ -45,10 +63,10 @@ static void transform_2way( cube_2way_context *sp )
        x1 = _mm256_xor_si256( x1, x5 );
        x2 = _mm256_xor_si256( x2, x6 );
        x3 = _mm256_xor_si256( x3, x7 );
-        x4 = mm256_swap128_64( x4 );
-        x5 = mm256_swap128_64( x5 );
-        x6 = mm256_swap128_64( x6 );
-        x7 = mm256_swap128_64( x7 );
+        x4 = mm256_swap64_128( x4 );
+        x5 = mm256_swap64_128( x5 );
+        x6 = mm256_swap64_128( x6 );
+        x7 = mm256_swap64_128( x7 );
        x4 = _mm256_add_epi32( x0, x4 );
        x5 = _mm256_add_epi32( x1, x5 );
        x6 = _mm256_add_epi32( x2, x6 );
@@ -69,10 +87,10 @@ static void transform_2way( cube_2way_context *sp )
        x1 = _mm256_xor_si256( x1, x5 );
        x2 = _mm256_xor_si256( x2, x6 );
        x3 = _mm256_xor_si256( x3, x7 );
-        x4 = mm256_swap64_32( x4 );
-        x5 = mm256_swap64_32( x5 );
-        x6 = mm256_swap64_32( x6 );
-        x7 = mm256_swap64_32( x7 );
+        x4 = mm256_swap32_64( x4 );
+        x5 = mm256_swap32_64( x5 );
+        x6 = mm256_swap32_64( x6 );
+        x7 = mm256_swap32_64( x7 );
    }

    _mm256_store_si256( (__m256i*)sp->h,     x0 );
@@ -86,44 +104,33 @@ static void transform_2way( cube_2way_context *sp )

 }

-cube_2way_context cube_2way_ctx_cache __attribute__ ((aligned (64)));
-
-int cube_2way_reinit( cube_2way_context *sp )
-{
-   memcpy( sp, &cube_2way_ctx_cache, sizeof(cube_2way_context) );
-   return 0;
-
-}
-
 int cube_2way_init( cube_2way_context *sp, int hashbitlen, int rounds,
-                       int blockbytes )
+                    int blockbytes )
 {
-    int i;
+    const uint64_t* iv = hashbitlen == 512 ? IV512 : IV256;
+    sp->hashlen   = hashbitlen/128;
+    sp->blocksize = blockbytes/16;
+    sp->rounds    = rounds;
+    sp->pos       = 0;

-    // all sizes of __m128i
-    cube_2way_ctx_cache.hashlen   = hashbitlen/128;
-    cube_2way_ctx_cache.blocksize = blockbytes/16;
-    cube_2way_ctx_cache.rounds    = rounds;
-    cube_2way_ctx_cache.pos       = 0;
+    __m256i* h = (__m256i*)sp->h;

-    for ( i = 0; i < 8; ++i )
-       cube_2way_ctx_cache.h[i] = m256_zero;
+    h[0] = _mm256_set_epi64x( iv[ 1], iv[ 0], iv[ 1], iv[ 0] );
+    h[1] = _mm256_set_epi64x( iv[ 3], iv[ 2], iv[ 3], iv[ 2] );
+    h[2] = _mm256_set_epi64x( iv[ 5], iv[ 4], iv[ 5], iv[ 4] );
+    h[3] = _mm256_set_epi64x( iv[ 7], iv[ 6], iv[ 7], iv[ 6] );
+    h[4] = _mm256_set_epi64x( iv[ 9], iv[ 8], iv[ 9], iv[ 8] );
+    h[5] = _mm256_set_epi64x( iv[11], iv[10], iv[11], iv[10] );
+    h[6] = _mm256_set_epi64x( iv[13], iv[12], iv[13], iv[12] );
+    h[7] = _mm256_set_epi64x( iv[15], iv[14], iv[15], iv[14] );

-    cube_2way_ctx_cache.h[0] = _mm256_set_epi32(
-                                   0, rounds, blockbytes, hashbitlen / 8,
-                                   0, rounds, blockbytes, hashbitlen / 8 );
-
-    for ( i = 0; i < 10; ++i )
-       transform_2way( &cube_2way_ctx_cache );
-
-    memcpy( sp, &cube_2way_ctx_cache, sizeof(cube_2way_context) );
    return 0;
 }


 int cube_2way_update( cube_2way_context *sp, const void *data, size_t size )
 {
-    const int len = size / 16;
+    const int len = size >> 4;
    const __m256i *in = (__m256i*)data;
    int i;

@@ -140,7 +147,6 @@ int cube_2way_update( cube_2way_context *sp, const void *data, size_t size )
           sp->pos = 0;
        }
    }
-
    return 0;
 }

@@ -151,25 +157,22 @@ int cube_2way_close( cube_2way_context *sp, void *output )

    // pos is zero for 64 byte data, 1 for 80 byte data.
    sp->h[ sp->pos ] = _mm256_xor_si256( sp->h[ sp->pos ],
-                    _mm256_set_epi8( 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0x80,
-                                     0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0x80 ) );
+                                _mm256_set_epi32( 0,0,0,0x80,  0,0,0,0x80 ) );
    transform_2way( sp );

-    sp->h[7] = _mm256_xor_si256( sp->h[7], _mm256_set_epi32( 1,0,0,0,
-                                                             1,0,0,0 ) );
-    for ( i = 0; i < 10; ++i )
-       transform_2way( &cube_2way_ctx_cache );
+    sp->h[7] = _mm256_xor_si256( sp->h[7],
+		                 _mm256_set_epi32( 1,0,0,0,  1,0,0,0 ) );

-    for ( i = 0; i < sp->hashlen; i++ )
-       hash[i] = sp->h[i];
+    for ( i = 0; i < 10; ++i )           transform_2way( sp );

+    for ( i = 0; i < sp->hashlen; i++ )  hash[i] = sp->h[i];
    return 0;
 }

 int cube_2way_update_close( cube_2way_context *sp, void *output,
                               const void *data, size_t size )
 {
-    const int len = size / 16;
+    const int len = size >> 4;
    const __m256i *in = (__m256i*)data;
    __m256i *hash = (__m256i*)output;
    int i;
@@ -187,18 +190,15 @@ int cube_2way_update_close( cube_2way_context *sp, void *output,

    // pos is zero for 64 byte data, 1 for 80 byte data.
    sp->h[ sp->pos ] = _mm256_xor_si256( sp->h[ sp->pos ],
-                    _mm256_set_epi8( 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0x80,
-                                     0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0x80 ) );
+                    _mm256_set_epi32( 0,0,0,0x80,  0,0,0,0x80 ) );
    transform_2way( sp );

    sp->h[7] = _mm256_xor_si256( sp->h[7], _mm256_set_epi32( 1,0,0,0,
                                                             1,0,0,0 ) );
-    for ( i = 0; i < 10; ++i )
-       transform_2way( &cube_2way_ctx_cache );

-    for ( i = 0; i < sp->hashlen; i++ )
-       hash[i] = sp->h[i];
+    for ( i = 0; i < 10; ++i )            transform_2way( sp );

+    for ( i = 0; i < sp->hashlen; i++ )   hash[i] = sp->h[i];
    return 0;
 }

--- a/algo/cubehash/cube-hash-2way.h
+++ b/algo/cubehash/cube-hash-2way.h
@@ -4,18 +4,18 @@
 #if defined(__AVX2__)

 #include <stdint.h>
-#include "avxdefs.h"
+#include "simd-utils.h"

 // 2x128, 2 way parallel SSE2

 struct _cube_2way_context
 {
+    __m256i h[8];
    int hashlen;           // __m128i
    int rounds;
    int blocksize;         // __m128i
    int pos;               // number of __m128i read into x from current block
-    __m256i h[8] __attribute__ ((aligned (64)));
-};
+} __attribute__ ((aligned (64)));

 typedef struct _cube_2way_context cube_2way_context;

--- a/algo/cubehash/sse2/cubehash_sse2.c
+++ b/algo/cubehash/sse2/cubehash_sse2.c
@@ -13,7 +13,26 @@
 #include <stdbool.h>
 #include <unistd.h>
 #include <memory.h>
-#include "avxdefs.h"
+#include "simd-utils.h"
+#include <stdio.h>
+
+// The result of hashing 10 rounds of initial data which is params and 
+// mostly zeros.
+static const uint64_t IV256[] =
+{
+0xCCD6F29FEA2BD4B4, 0x35481EAE63117E71, 0xE5D94E6322512D5B, 0xF4CC12BE7E624131,
+0x42AF2070C2D0B696, 0x3361DA8CD0720C35, 0x8EF8AD8328CCECA4, 0x40E5FBAB4680AC00,
+0x6107FBD5D89041C3, 0xF0B266796C859D41, 0x5FA2560309392549, 0x93CB628565C892FD,
+0x9E4B4E602AF2B5AE, 0x85254725774ABFDD, 0x4AB6AAD615815AEB, 0xD6032C0A9CDAF8AF
+};
+
+static const uint64_t IV512[] =
+{
+0x50F494D42AEA2A61, 0x4167D83E2D538B8B, 0xC701CF8C3FEE2313, 0x50AC5695CC39968E,
+0xA647A8B34D42C787, 0x825B453797CF0BEF, 0xF22090C4EEF864D2, 0xA23911AED0E5CD33,
+0x148FE485FCD398D9, 0xB64445321B017BEF, 0x2FF5781C6A536159, 0x0DBADEA991FA7934,
+0xA5A70E75D65C8A2B, 0xBC796576B1C62456, 0xE7989AF11921C8F7, 0xD43E3B447795D246
+};

 static void transform( cubehashParam *sp )
 {
@@ -22,7 +41,7 @@ static void transform( cubehashParam *sp )

 #ifdef __AVX2__

-    __m256i x0, x1, x2, x3, y0, y1;
+    register __m256i x0, x1, x2, x3, y0, y1;

    x0 = _mm256_load_si256( (__m256i*)sp->x     );
    x1 = _mm256_load_si256( (__m256i*)sp->x + 1 );   
@@ -33,20 +52,19 @@ static void transform( cubehashParam *sp )
    { 
        x2 = _mm256_add_epi32( x0, x2 );
        x3 = _mm256_add_epi32( x1, x3 );
-        y0 = x1;
-        y1 = x0;
-        x0 = _mm256_xor_si256( _mm256_slli_epi32( y0, 7 ),
+        y0 = x0;
+        x0 = _mm256_xor_si256( _mm256_slli_epi32( x1, 7 ),
+                               _mm256_srli_epi32( x1, 25 ) );
+        x1 = _mm256_xor_si256( _mm256_slli_epi32( y0, 7 ),
                               _mm256_srli_epi32( y0, 25 ) );
-        x1 = _mm256_xor_si256( _mm256_slli_epi32( y1, 7 ),
-                               _mm256_srli_epi32( y1, 25 ) );
        x0 = _mm256_xor_si256( x0, x2 );
        x1 = _mm256_xor_si256( x1, x3 );
        x2 = _mm256_shuffle_epi32( x2, 0x4e );
        x3 = _mm256_shuffle_epi32( x3, 0x4e );
        x2 = _mm256_add_epi32( x0, x2 );
        x3 = _mm256_add_epi32( x1, x3 );
-        y0 = _mm256_permute2f128_si256( x0, x0, 1 );
-        y1 = _mm256_permute2f128_si256( x1, x1, 1 );
+        y0 = _mm256_permute4x64_epi64( x0, 0x4e );
+        y1 = _mm256_permute4x64_epi64( x1, 0x4e );
        x0 = _mm256_xor_si256( _mm256_slli_epi32( y0, 11 ),
                               _mm256_srli_epi32( y0, 21 ) );
        x1 = _mm256_xor_si256( _mm256_slli_epi32( y1, 11 ), 
@@ -129,48 +147,37 @@ static void transform( cubehashParam *sp )
 #endif
 }  // transform

-// Cubehash context initializing is very expensive.
-// Cache the intial value for faster reinitializing.
-cubehashParam cube_ctx_cache __attribute__ ((aligned (64)));
-
-int cubehashReinit( cubehashParam *sp )
-{
-   memcpy( sp, &cube_ctx_cache, sizeof(cubehashParam) );
-   return SUCCESS;
-
-}
-
-// Initialize the cache then copy to sp.
 int cubehashInit(cubehashParam *sp, int hashbitlen, int rounds, int blockbytes)
 {
-    int i;
+    const uint64_t* iv = hashbitlen == 512 ? IV512 : IV256;
+    sp->hashlen   = hashbitlen/128;
+    sp->blocksize = blockbytes/16;
+    sp->rounds    = rounds;
+    sp->pos       = 0;
+    
+#if defined(__AVX2__)

-    if ( hashbitlen < 8 ) return BAD_HASHBITLEN;
-    if ( hashbitlen > 512 ) return BAD_HASHBITLEN;
-    if ( hashbitlen != 8 * (hashbitlen / 8) ) return BAD_HASHBITLEN;
+    __m256i* x = (__m256i*)sp->x;

-    /* Sanity checks */
-    if ( rounds <= 0 || rounds > 32 )
-       rounds = CUBEHASH_ROUNDS;
-    if ( blockbytes <= 0 || blockbytes >= 256)
-       blockbytes = CUBEHASH_BLOCKBYTES;
+    x[0] = _mm256_set_epi64x( iv[ 3], iv[ 2], iv[ 1], iv[ 0] );
+    x[1] = _mm256_set_epi64x( iv[ 7], iv[ 6], iv[ 5], iv[ 4] );
+    x[2] = _mm256_set_epi64x( iv[11], iv[10], iv[ 9], iv[ 8] );
+    x[3] = _mm256_set_epi64x( iv[15], iv[14], iv[13], iv[12] );

-    // all sizes of __m128i
-    cube_ctx_cache.hashlen   = hashbitlen/128;
-    cube_ctx_cache.blocksize = blockbytes/16;
-    cube_ctx_cache.rounds    = rounds;
-    cube_ctx_cache.pos       = 0;
+#else

-    for ( i = 0; i < 8; ++i )
-       cube_ctx_cache.x[i] = _mm_setzero_si128();;
+    __m128i* x = (__m128i*)sp->x;

-    cube_ctx_cache.x[0] = _mm_set_epi32( 0, rounds, blockbytes,
-                                         hashbitlen / 8 );
+     x[0] = _mm_set_epi64x( iv[ 1], iv[ 0] );
+     x[1] = _mm_set_epi64x( iv[ 3], iv[ 2] );
+     x[2] = _mm_set_epi64x( iv[ 5], iv[ 4] );
+     x[3] = _mm_set_epi64x( iv[ 7], iv[ 6] );
+     x[4] = _mm_set_epi64x( iv[ 9], iv[ 8] );
+     x[5] = _mm_set_epi64x( iv[11], iv[10] );
+     x[6] = _mm_set_epi64x( iv[13], iv[12] );
+     x[7] = _mm_set_epi64x( iv[15], iv[14] );

-    for ( i = 0; i < 10; ++i )
-       transform( &cube_ctx_cache );
-
-    memcpy( sp, &cube_ctx_cache, sizeof(cubehashParam) );
+#endif
    return SUCCESS;
 }

@@ -255,6 +262,7 @@ int cubehashUpdateDigest( cubehashParam *sp, byte *digest,
    transform( sp );

    sp->x[7] = _mm_xor_si128( sp->x[7], _mm_set_epi32( 1,0,0,0 ) );
+
    transform( sp );
    transform( sp );
    transform( sp );
--- a/algo/cubehash/sse2/cubehash_sse2.h
+++ b/algo/cubehash/sse2/cubehash_sse2.h
--- a/algo/echo/aes_ni/hash.c
+++ b/algo/echo/aes_ni/hash.c
@@ -60,336 +60,174 @@ MYALIGN const unsigned int	zero[]			= {0x00000000, 0x00000000, 0x00000000, 0x000
 MYALIGN const unsigned int	mul2ipt[]		= {0x728efc00, 0x6894e61a, 0x3fc3b14d, 0x25d9ab57, 0xfd5ba600, 0x2a8c71d7, 0x1eb845e3, 0xc96f9234};


-//#include "crypto_hash.h"
-
- int crypto_hash(
-   unsigned char *out,
-   const unsigned char *in,
-   unsigned long long inlen
- )
- {
-
-	 if(hash_echo(512, in, inlen * 8, out) == SUCCESS) 
-		 return 0;
-	 
-	 return -1;
- }
-
-/*
-int main()
-{
-	return 0;
-}
-*/
-
-#if 0
-void DumpState(__m128i *ps)
-{
-	int i, j, k;
-	unsigned int ucol;
-
-	for(j = 0; j < 4; j++)
-	{
-		for(i = 0; i < 4; i++)
-		{
-			printf("row %d,col %d : ", i, j);
-			for(k = 0; k < 4; k++)
-			{
-				ucol = *((int*)ps + 16 * i + 4 * j + k);
-				printf("%02x%02x%02x%02x ", (ucol >> 0) & 0xff, (ucol >> 8) & 0xff, (ucol >> 16) & 0xff, (ucol >> 24) & 0xff);
-			}
-
-			printf("\n");
-		}
-	}
-
-	printf("\n");
-}
-#endif
-
-
-
-
-#ifndef NO_AES_NI
 #define ECHO_SUBBYTES(state, i, j) \
-				state[i][j] = _mm_aesenc_si128(state[i][j], k1);\
-				state[i][j] = _mm_aesenc_si128(state[i][j], M128(zero));\
-				k1 = _mm_add_epi32(k1, M128(const1))
-#else
-#define ECHO_SUBBYTES(state, i, j) \
-				AES_ROUND_VPERM(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
-				state[i][j] = _mm_xor_si128(state[i][j], k1);\
-				AES_ROUND_VPERM(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
-				k1 = _mm_add_epi32(k1, M128(const1))
-
-#define ECHO_SUB_AND_MIX(state, i, j, state2, c, r1, r2, r3, r4) \
-				AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
-				ktemp = k1;\
-				TRANSFORM(ktemp, _k_ipt, t1, t4);\
-				state[i][j] = _mm_xor_si128(state[i][j], ktemp);\
-				AES_ROUND_VPERM_CORE(state[i][j], t1, t2, t3, t4, s1, s2, s3);\
-				k1 = _mm_add_epi32(k1, M128(const1));\
-				s1 = state[i][j];\
-				s2 = s1;\
-				TRANSFORM(s2, mul2ipt, t1, t2);\
-				s3 = _mm_xor_si128(s1, s2);\
-				state2[r1][c] = _mm_xor_si128(state2[r1][c], s2);\
-				state2[r2][c] = _mm_xor_si128(state2[r2][c], s1);\
-				state2[r3][c] = _mm_xor_si128(state2[r3][c], s1);\
-				state2[r4][c] = _mm_xor_si128(state2[r4][c], s3)
-
-
-
-#endif
-
+	state[i][j] = _mm_aesenc_si128(state[i][j], k1);\
+	state[i][j] = _mm_aesenc_si128(state[i][j], M128(zero));\
+	k1 = _mm_add_epi32(k1, M128(const1))

 #define ECHO_MIXBYTES(state1, state2, j, t1, t2, s2) \
-				s2 = _mm_add_epi8(state1[0][j], state1[0][j]);\
-				t1 = _mm_srli_epi16(state1[0][j], 7);\
-				t1 = _mm_and_si128(t1, M128(lsbmask));\
-				t2 = _mm_shuffle_epi8(M128(mul2mask), t1);\
-				s2 = _mm_xor_si128(s2, t2);\
-				state2[0][j] = s2;\
-				state2[1][j] = state1[0][j];\
-				state2[2][j] = state1[0][j];\
-				state2[3][j] = _mm_xor_si128(s2, state1[0][j]);\
-				s2 = _mm_add_epi8(state1[1][(j + 1) & 3], state1[1][(j + 1) & 3]);\
-				t1 = _mm_srli_epi16(state1[1][(j + 1) & 3], 7);\
-				t1 = _mm_and_si128(t1, M128(lsbmask));\
-				t2 = _mm_shuffle_epi8(M128(mul2mask), t1);\
-				s2 = _mm_xor_si128(s2, t2);\
-				state2[0][j] = _mm_xor_si128(state2[0][j], _mm_xor_si128(s2, state1[1][(j + 1) & 3]));\
-				state2[1][j] = _mm_xor_si128(state2[1][j], s2);\
-				state2[2][j] = _mm_xor_si128(state2[2][j], state1[1][(j + 1) & 3]);\
-				state2[3][j] = _mm_xor_si128(state2[3][j], state1[1][(j + 1) & 3]);\
-				s2 = _mm_add_epi8(state1[2][(j + 2) & 3], state1[2][(j + 2) & 3]);\
-				t1 = _mm_srli_epi16(state1[2][(j + 2) & 3], 7);\
-				t1 = _mm_and_si128(t1, M128(lsbmask));\
-				t2 = _mm_shuffle_epi8(M128(mul2mask), t1);\
-				s2 = _mm_xor_si128(s2, t2);\
-				state2[0][j] = _mm_xor_si128(state2[0][j], state1[2][(j + 2) & 3]);\
-				state2[1][j] = _mm_xor_si128(state2[1][j], _mm_xor_si128(s2, state1[2][(j + 2) & 3]));\
-				state2[2][j] = _mm_xor_si128(state2[2][j], s2);\
-				state2[3][j] = _mm_xor_si128(state2[3][j], state1[2][(j + 2) & 3]);\
-				s2 = _mm_add_epi8(state1[3][(j + 3) & 3], state1[3][(j + 3) & 3]);\
-				t1 = _mm_srli_epi16(state1[3][(j + 3) & 3], 7);\
-				t1 = _mm_and_si128(t1, M128(lsbmask));\
-				t2 = _mm_shuffle_epi8(M128(mul2mask), t1);\
-				s2 = _mm_xor_si128(s2, t2);\
-				state2[0][j] = _mm_xor_si128(state2[0][j], state1[3][(j + 3) & 3]);\
-				state2[1][j] = _mm_xor_si128(state2[1][j], state1[3][(j + 3) & 3]);\
-				state2[2][j] = _mm_xor_si128(state2[2][j], _mm_xor_si128(s2, state1[3][(j + 3) & 3]));\
-				state2[3][j] = _mm_xor_si128(state2[3][j], s2)
+	s2 = _mm_add_epi8(state1[0][j], state1[0][j]);\
+	t1 = _mm_srli_epi16(state1[0][j], 7);\
+	t1 = _mm_and_si128(t1, M128(lsbmask));\
+	t2 = _mm_shuffle_epi8(M128(mul2mask), t1);\
+	s2 = _mm_xor_si128(s2, t2);\
+	state2[0][j] = s2;\
+	state2[1][j] = state1[0][j];\
+	state2[2][j] = state1[0][j];\
+	state2[3][j] = _mm_xor_si128(s2, state1[0][j]);\
+	s2 = _mm_add_epi8(state1[1][(j + 1) & 3], state1[1][(j + 1) & 3]);\
+	t1 = _mm_srli_epi16(state1[1][(j + 1) & 3], 7);\
+	t1 = _mm_and_si128(t1, M128(lsbmask));\
+	t2 = _mm_shuffle_epi8(M128(mul2mask), t1);\
+	s2 = _mm_xor_si128(s2, t2);\
+	state2[0][j] = _mm_xor_si128(state2[0][j], _mm_xor_si128(s2, state1[1][(j + 1) & 3]));\
+	state2[1][j] = _mm_xor_si128(state2[1][j], s2);\
+	state2[2][j] = _mm_xor_si128(state2[2][j], state1[1][(j + 1) & 3]);\
+	state2[3][j] = _mm_xor_si128(state2[3][j], state1[1][(j + 1) & 3]);\
+	s2 = _mm_add_epi8(state1[2][(j + 2) & 3], state1[2][(j + 2) & 3]);\
+	t1 = _mm_srli_epi16(state1[2][(j + 2) & 3], 7);\
+	t1 = _mm_and_si128(t1, M128(lsbmask));\
+	t2 = _mm_shuffle_epi8(M128(mul2mask), t1);\
+	s2 = _mm_xor_si128(s2, t2);\
+	state2[0][j] = _mm_xor_si128(state2[0][j], state1[2][(j + 2) & 3]);\
+	state2[1][j] = _mm_xor_si128(state2[1][j], _mm_xor_si128(s2, state1[2][(j + 2) & 3]));\
+	state2[2][j] = _mm_xor_si128(state2[2][j], s2);\
+	state2[3][j] = _mm_xor_si128(state2[3][j], state1[2][(j + 2) & 3]);\
+	s2 = _mm_add_epi8(state1[3][(j + 3) & 3], state1[3][(j + 3) & 3]);\
+	t1 = _mm_srli_epi16(state1[3][(j + 3) & 3], 7);\
+	t1 = _mm_and_si128(t1, M128(lsbmask));\
+	t2 = _mm_shuffle_epi8(M128(mul2mask), t1);\
+	s2 = _mm_xor_si128(s2, t2);\
+	state2[0][j] = _mm_xor_si128(state2[0][j], state1[3][(j + 3) & 3]);\
+	state2[1][j] = _mm_xor_si128(state2[1][j], state1[3][(j + 3) & 3]);\
+	state2[2][j] = _mm_xor_si128(state2[2][j], _mm_xor_si128(s2, state1[3][(j + 3) & 3]));\
+	state2[3][j] = _mm_xor_si128(state2[3][j], s2)


 #define ECHO_ROUND_UNROLL2 \
-			ECHO_SUBBYTES(_state, 0, 0);\
-			ECHO_SUBBYTES(_state, 1, 0);\
-			ECHO_SUBBYTES(_state, 2, 0);\
-			ECHO_SUBBYTES(_state, 3, 0);\
-			ECHO_SUBBYTES(_state, 0, 1);\
-			ECHO_SUBBYTES(_state, 1, 1);\
-			ECHO_SUBBYTES(_state, 2, 1);\
-			ECHO_SUBBYTES(_state, 3, 1);\
-			ECHO_SUBBYTES(_state, 0, 2);\
-			ECHO_SUBBYTES(_state, 1, 2);\
-			ECHO_SUBBYTES(_state, 2, 2);\
-			ECHO_SUBBYTES(_state, 3, 2);\
-			ECHO_SUBBYTES(_state, 0, 3);\
-			ECHO_SUBBYTES(_state, 1, 3);\
-			ECHO_SUBBYTES(_state, 2, 3);\
-			ECHO_SUBBYTES(_state, 3, 3);\
-			ECHO_MIXBYTES(_state, _state2, 0, t1, t2, s2);\
-			ECHO_MIXBYTES(_state, _state2, 1, t1, t2, s2);\
-			ECHO_MIXBYTES(_state, _state2, 2, t1, t2, s2);\
-			ECHO_MIXBYTES(_state, _state2, 3, t1, t2, s2);\
-			ECHO_SUBBYTES(_state2, 0, 0);\
-			ECHO_SUBBYTES(_state2, 1, 0);\
-			ECHO_SUBBYTES(_state2, 2, 0);\
-			ECHO_SUBBYTES(_state2, 3, 0);\
-			ECHO_SUBBYTES(_state2, 0, 1);\
-			ECHO_SUBBYTES(_state2, 1, 1);\
-			ECHO_SUBBYTES(_state2, 2, 1);\
-			ECHO_SUBBYTES(_state2, 3, 1);\
-			ECHO_SUBBYTES(_state2, 0, 2);\
-			ECHO_SUBBYTES(_state2, 1, 2);\
-			ECHO_SUBBYTES(_state2, 2, 2);\
-			ECHO_SUBBYTES(_state2, 3, 2);\
-			ECHO_SUBBYTES(_state2, 0, 3);\
-			ECHO_SUBBYTES(_state2, 1, 3);\
-			ECHO_SUBBYTES(_state2, 2, 3);\
-			ECHO_SUBBYTES(_state2, 3, 3);\
-			ECHO_MIXBYTES(_state2, _state, 0, t1, t2, s2);\
-			ECHO_MIXBYTES(_state2, _state, 1, t1, t2, s2);\
-			ECHO_MIXBYTES(_state2, _state, 2, t1, t2, s2);\
-			ECHO_MIXBYTES(_state2, _state, 3, t1, t2, s2)
+	ECHO_SUBBYTES(_state, 0, 0);\
+	ECHO_SUBBYTES(_state, 1, 0);\
+	ECHO_SUBBYTES(_state, 2, 0);\
+	ECHO_SUBBYTES(_state, 3, 0);\
+	ECHO_SUBBYTES(_state, 0, 1);\
+	ECHO_SUBBYTES(_state, 1, 1);\
+	ECHO_SUBBYTES(_state, 2, 1);\
+	ECHO_SUBBYTES(_state, 3, 1);\
+	ECHO_SUBBYTES(_state, 0, 2);\
+	ECHO_SUBBYTES(_state, 1, 2);\
+	ECHO_SUBBYTES(_state, 2, 2);\
+	ECHO_SUBBYTES(_state, 3, 2);\
+	ECHO_SUBBYTES(_state, 0, 3);\
+	ECHO_SUBBYTES(_state, 1, 3);\
+	ECHO_SUBBYTES(_state, 2, 3);\
+	ECHO_SUBBYTES(_state, 3, 3);\
+	ECHO_MIXBYTES(_state, _state2, 0, t1, t2, s2);\
+	ECHO_MIXBYTES(_state, _state2, 1, t1, t2, s2);\
+	ECHO_MIXBYTES(_state, _state2, 2, t1, t2, s2);\
+	ECHO_MIXBYTES(_state, _state2, 3, t1, t2, s2);\
+	ECHO_SUBBYTES(_state2, 0, 0);\
+	ECHO_SUBBYTES(_state2, 1, 0);\
+	ECHO_SUBBYTES(_state2, 2, 0);\
+	ECHO_SUBBYTES(_state2, 3, 0);\
+	ECHO_SUBBYTES(_state2, 0, 1);\
+	ECHO_SUBBYTES(_state2, 1, 1);\
+	ECHO_SUBBYTES(_state2, 2, 1);\
+	ECHO_SUBBYTES(_state2, 3, 1);\
+	ECHO_SUBBYTES(_state2, 0, 2);\
+	ECHO_SUBBYTES(_state2, 1, 2);\
+	ECHO_SUBBYTES(_state2, 2, 2);\
+	ECHO_SUBBYTES(_state2, 3, 2);\
+	ECHO_SUBBYTES(_state2, 0, 3);\
+	ECHO_SUBBYTES(_state2, 1, 3);\
+	ECHO_SUBBYTES(_state2, 2, 3);\
+	ECHO_SUBBYTES(_state2, 3, 3);\
+	ECHO_MIXBYTES(_state2, _state, 0, t1, t2, s2);\
+	ECHO_MIXBYTES(_state2, _state, 1, t1, t2, s2);\
+	ECHO_MIXBYTES(_state2, _state, 2, t1, t2, s2);\
+	ECHO_MIXBYTES(_state2, _state, 3, t1, t2, s2)



 #define SAVESTATE(dst, src)\
-		dst[0][0] = src[0][0];\
-		dst[0][1] = src[0][1];\
-		dst[0][2] = src[0][2];\
-		dst[0][3] = src[0][3];\
-		dst[1][0] = src[1][0];\
-		dst[1][1] = src[1][1];\
-		dst[1][2] = src[1][2];\
-		dst[1][3] = src[1][3];\
-		dst[2][0] = src[2][0];\
-		dst[2][1] = src[2][1];\
-		dst[2][2] = src[2][2];\
-		dst[2][3] = src[2][3];\
-		dst[3][0] = src[3][0];\
-		dst[3][1] = src[3][1];\
-		dst[3][2] = src[3][2];\
-		dst[3][3] = src[3][3]
+	dst[0][0] = src[0][0];\
+	dst[0][1] = src[0][1];\
+	dst[0][2] = src[0][2];\
+	dst[0][3] = src[0][3];\
+	dst[1][0] = src[1][0];\
+	dst[1][1] = src[1][1];\
+	dst[1][2] = src[1][2];\
+	dst[1][3] = src[1][3];\
+	dst[2][0] = src[2][0];\
+	dst[2][1] = src[2][1];\
+	dst[2][2] = src[2][2];\
+	dst[2][3] = src[2][3];\
+	dst[3][0] = src[3][0];\
+	dst[3][1] = src[3][1];\
+	dst[3][2] = src[3][2];\
+	dst[3][3] = src[3][3]


 void Compress(hashState_echo *ctx, const unsigned char *pmsg, unsigned int uBlockCount)
 {
-	unsigned int r, b, i, j;
-//      __m128i t1, t2, t3, t4, s1, s2, s3, k1, ktemp;
-	__m128i t1, t2, s2, k1;
-	__m128i _state[4][4], _state2[4][4], _statebackup[4][4]; 
+   unsigned int r, b, i, j;
+   __m128i t1, t2, s2, k1;
+   __m128i _state[4][4], _state2[4][4], _statebackup[4][4]; 

+   for(i = 0; i < 4; i++)
+	for(j = 0; j < ctx->uHashSize / 256; j++)
+		_state[i][j] = ctx->state[i][j];

-	for(i = 0; i < 4; i++)
-		for(j = 0; j < ctx->uHashSize / 256; j++)
-			_state[i][j] = ctx->state[i][j];
+   for(b = 0; b < uBlockCount; b++)
+   {
+	ctx->k = _mm_add_epi64(ctx->k, ctx->const1536);

-
-#ifdef NO_AES_NI
-	// transform cv
-	for(i = 0; i < 4; i++)
-		for(j = 0; j < ctx->uHashSize / 256; j++)
-		{
-			TRANSFORM(_state[i][j], _k_ipt, t1, t2);
-		}
-#endif
-
-	for(b = 0; b < uBlockCount; b++)
+	// load message
+	for(j = ctx->uHashSize / 256; j < 4; j++)
 	{
-		ctx->k = _mm_add_epi64(ctx->k, ctx->const1536);
-
-		// load message
-		for(j = ctx->uHashSize / 256; j < 4; j++)
-		{
-			for(i = 0; i < 4; i++)
-			{
-				_state[i][j] = _mm_loadu_si128((__m128i*)pmsg + 4 * (j - (ctx->uHashSize / 256)) + i);
-
-#ifdef NO_AES_NI
-				// transform message
-				TRANSFORM(_state[i][j], _k_ipt, t1, t2);
-#endif
-			}
-		}
-
-		// save state
-		SAVESTATE(_statebackup, _state);
-
-
-		k1 = ctx->k;
-
-#ifndef NO_AES_NI
-		for(r = 0; r < ctx->uRounds / 2; r++)
-		{
-			ECHO_ROUND_UNROLL2;
-		}
-
-#else
-		for(r = 0; r < ctx->uRounds / 2; r++)
-		{
-			_state2[0][0] = M128(zero); _state2[1][0] = M128(zero); _state2[2][0] = M128(zero); _state2[3][0] = M128(zero);
-			_state2[0][1] = M128(zero); _state2[1][1] = M128(zero); _state2[2][1] = M128(zero); _state2[3][1] = M128(zero);
-			_state2[0][2] = M128(zero); _state2[1][2] = M128(zero); _state2[2][2] = M128(zero); _state2[3][2] = M128(zero);
-			_state2[0][3] = M128(zero); _state2[1][3] = M128(zero); _state2[2][3] = M128(zero); _state2[3][3] = M128(zero);																			
-
-			ECHO_SUB_AND_MIX(_state, 0, 0, _state2, 0, 0, 1, 2, 3);
-			ECHO_SUB_AND_MIX(_state, 1, 0, _state2, 3, 1, 2, 3, 0);
-			ECHO_SUB_AND_MIX(_state, 2, 0, _state2, 2, 2, 3, 0, 1);
-			ECHO_SUB_AND_MIX(_state, 3, 0, _state2, 1, 3, 0, 1, 2);
-			ECHO_SUB_AND_MIX(_state, 0, 1, _state2, 1, 0, 1, 2, 3);
-			ECHO_SUB_AND_MIX(_state, 1, 1, _state2, 0, 1, 2, 3, 0);
-			ECHO_SUB_AND_MIX(_state, 2, 1, _state2, 3, 2, 3, 0, 1);
-			ECHO_SUB_AND_MIX(_state, 3, 1, _state2, 2, 3, 0, 1, 2);
-			ECHO_SUB_AND_MIX(_state, 0, 2, _state2, 2, 0, 1, 2, 3);
-			ECHO_SUB_AND_MIX(_state, 1, 2, _state2, 1, 1, 2, 3, 0);
-			ECHO_SUB_AND_MIX(_state, 2, 2, _state2, 0, 2, 3, 0, 1);
-			ECHO_SUB_AND_MIX(_state, 3, 2, _state2, 3, 3, 0, 1, 2);
-			ECHO_SUB_AND_MIX(_state, 0, 3, _state2, 3, 0, 1, 2, 3);
-			ECHO_SUB_AND_MIX(_state, 1, 3, _state2, 2, 1, 2, 3, 0);
-			ECHO_SUB_AND_MIX(_state, 2, 3, _state2, 1, 2, 3, 0, 1);
-			ECHO_SUB_AND_MIX(_state, 3, 3, _state2, 0, 3, 0, 1, 2);
-
-			_state[0][0] = M128(zero); _state[1][0] = M128(zero); _state[2][0] = M128(zero); _state[3][0] = M128(zero);
-			_state[0][1] = M128(zero); _state[1][1] = M128(zero); _state[2][1] = M128(zero); _state[3][1] = M128(zero);
-			_state[0][2] = M128(zero); _state[1][2] = M128(zero); _state[2][2] = M128(zero); _state[3][2] = M128(zero);
-			_state[0][3] = M128(zero); _state[1][3] = M128(zero); _state[2][3] = M128(zero); _state[3][3] = M128(zero);																			
-
-			ECHO_SUB_AND_MIX(_state2, 0, 0, _state, 0, 0, 1, 2, 3);
-			ECHO_SUB_AND_MIX(_state2, 1, 0, _state, 3, 1, 2, 3, 0);
-			ECHO_SUB_AND_MIX(_state2, 2, 0, _state, 2, 2, 3, 0, 1);
-			ECHO_SUB_AND_MIX(_state2, 3, 0, _state, 1, 3, 0, 1, 2);
-			ECHO_SUB_AND_MIX(_state2, 0, 1, _state, 1, 0, 1, 2, 3);
-			ECHO_SUB_AND_MIX(_state2, 1, 1, _state, 0, 1, 2, 3, 0);
-			ECHO_SUB_AND_MIX(_state2, 2, 1, _state, 3, 2, 3, 0, 1);
-			ECHO_SUB_AND_MIX(_state2, 3, 1, _state, 2, 3, 0, 1, 2);
-			ECHO_SUB_AND_MIX(_state2, 0, 2, _state, 2, 0, 1, 2, 3);
-			ECHO_SUB_AND_MIX(_state2, 1, 2, _state, 1, 1, 2, 3, 0);
-			ECHO_SUB_AND_MIX(_state2, 2, 2, _state, 0, 2, 3, 0, 1);
-			ECHO_SUB_AND_MIX(_state2, 3, 2, _state, 3, 3, 0, 1, 2);
-			ECHO_SUB_AND_MIX(_state2, 0, 3, _state, 3, 0, 1, 2, 3);
-			ECHO_SUB_AND_MIX(_state2, 1, 3, _state, 2, 1, 2, 3, 0);
-			ECHO_SUB_AND_MIX(_state2, 2, 3, _state, 1, 2, 3, 0, 1);
-			ECHO_SUB_AND_MIX(_state2, 3, 3, _state, 0, 3, 0, 1, 2);
-
-		}
-#endif
-
-		
-		if(ctx->uHashSize == 256)
-		{
-			for(i = 0; i < 4; i++)
-			{
-				_state[i][0] = _mm_xor_si128(_state[i][0], _state[i][1]);
-				_state[i][0] = _mm_xor_si128(_state[i][0], _state[i][2]);
-				_state[i][0] = _mm_xor_si128(_state[i][0], _state[i][3]);
-
-				_state[i][0] = _mm_xor_si128(_state[i][0], _statebackup[i][0]);
-				_state[i][0] = _mm_xor_si128(_state[i][0], _statebackup[i][1]);
-				_state[i][0] = _mm_xor_si128(_state[i][0], _statebackup[i][2]);
-				_state[i][0] = _mm_xor_si128(_state[i][0], _statebackup[i][3]);
-			}
-		}
-		else
-		{
-			for(i = 0; i < 4; i++)
-			{
-				_state[i][0] = _mm_xor_si128(_state[i][0], _state[i][2]);
-				_state[i][1] = _mm_xor_si128(_state[i][1], _state[i][3]);
-
-				_state[i][0] = _mm_xor_si128(_state[i][0], _statebackup[i][0]);
-				_state[i][0] = _mm_xor_si128(_state[i][0], _statebackup[i][2]);
-
-				_state[i][1] = _mm_xor_si128(_state[i][1], _statebackup[i][1]);
-				_state[i][1] = _mm_xor_si128(_state[i][1], _statebackup[i][3]);
-			}
-		}
-
-		pmsg += ctx->uBlockLength;
+	   for(i = 0; i < 4; i++)
+	   {
+		_state[i][j] = _mm_loadu_si128((__m128i*)pmsg + 4 * (j - (ctx->uHashSize / 256)) + i);
+	   }
 	}

-#ifdef NO_AES_NI
-	// transform state
-	for(i = 0; i < 4; i++)
-		for(j = 0; j < 4; j++)
-		{
-			TRANSFORM(_state[i][j], _k_opt, t1, t2);
-		}
-#endif
+	// save state
+	SAVESTATE(_statebackup, _state);

-		SAVESTATE(ctx->state, _state);
+	k1 = ctx->k;
+
+	for(r = 0; r < ctx->uRounds / 2; r++)
+	{
+		ECHO_ROUND_UNROLL2;
+	}
+		
+	if(ctx->uHashSize == 256)
+	{
+	   for(i = 0; i < 4; i++)
+	   {
+		_state[i][0] = _mm_xor_si128(_state[i][0], _state[i][1]);
+		_state[i][0] = _mm_xor_si128(_state[i][0], _state[i][2]);
+		_state[i][0] = _mm_xor_si128(_state[i][0], _state[i][3]);
+		_state[i][0] = _mm_xor_si128(_state[i][0], _statebackup[i][0]);
+		_state[i][0] = _mm_xor_si128(_state[i][0], _statebackup[i][1]);
+		_state[i][0] = _mm_xor_si128(_state[i][0], _statebackup[i][2]);
+		_state[i][0] = _mm_xor_si128(_state[i][0], _statebackup[i][3]);
+	   }
+	}
+	else
+	{
+	   for(i = 0; i < 4; i++)
+	   {
+		_state[i][0] = _mm_xor_si128(_state[i][0], _state[i][2]);
+		_state[i][1] = _mm_xor_si128(_state[i][1], _state[i][3]);
+		_state[i][0] = _mm_xor_si128(_state[i][0], _statebackup[i][0]);
+		_state[i][0] = _mm_xor_si128(_state[i][0], _statebackup[i][2]);
+		_state[i][1] = _mm_xor_si128(_state[i][1], _statebackup[i][1]);
+		_state[i][1] = _mm_xor_si128(_state[i][1], _statebackup[i][3]);
+           }
+	}
+	pmsg += ctx->uBlockLength;
+   }
+	SAVESTATE(ctx->state, _state);

 }

--- a/algo/echo/aes_ni/hash_api.h
+++ b/algo/echo/aes_ni/hash_api.h
@@ -30,6 +30,7 @@
 typedef struct
 {
 	__m128i			state[4][4];
+        BitSequence             buffer[192];
 	__m128i			k;
 	__m128i			hashsize;
 	__m128i			const1536;
@@ -39,9 +40,8 @@ typedef struct
 	unsigned int	uBlockLength;
 	unsigned int	uBufferBytes;
 	DataLength		processed_bits;
-	BitSequence		buffer[192];

-} hashState_echo;
+} hashState_echo __attribute__ ((aligned (64)));

 HashReturn init_echo(hashState_echo *state, int hashbitlen);

--- a/algo/echo/sse2/echo.c
+++ b/algo/echo/sse2/echo.c
--- a/algo/echo/sse2/sph_echo.h
+++ b/algo/echo/sse2/sph_echo.h
@@ -1,320 +0,0 @@
-/* $Id: sph_echo.h 216 2010-06-08 09:46:57Z tp $ */
-/**
- * ECHO interface. ECHO is a family of functions which differ by
- * their output size; this implementation defines ECHO for output
- * sizes 224, 256, 384 and 512 bits.
- *
- * ==========================(LICENSE BEGIN)============================
- *
- * Copyright (c) 2007-2010  Projet RNRT SAPHIR
- * 
- * Permission is hereby granted, free of charge, to any person obtaining
- * a copy of this software and associated documentation files (the
- * "Software"), to deal in the Software without restriction, including
- * without limitation the rights to use, copy, modify, merge, publish,
- * distribute, sublicense, and/or sell copies of the Software, and to
- * permit persons to whom the Software is furnished to do so, subject to
- * the following conditions:
- * 
- * The above copyright notice and this permission notice shall be
- * included in all copies or substantial portions of the Software.
- * 
- * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
- * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
- * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
- * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
- * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
- * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
- * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
- *
- * ===========================(LICENSE END)=============================
- *
- * @file     sph_echo.h
- * @author   Thomas Pornin <thomas.pornin@cryptolog.com>
- */
-
-#ifndef SPH_ECHO_H__
-#define SPH_ECHO_H__
-
-#ifdef __cplusplus
-extern "C"{
-#endif
-
-#include <stddef.h>
-#include "algo/sha/sph_types.h"
-
-/**
- * Output size (in bits) for ECHO-224.
- */
-#define SPH_SIZE_echo224   224
-
-/**
- * Output size (in bits) for ECHO-256.
- */
-#define SPH_SIZE_echo256   256
-
-/**
- * Output size (in bits) for ECHO-384.
- */
-#define SPH_SIZE_echo384   384
-
-/**
- * Output size (in bits) for ECHO-512.
- */
-#define SPH_SIZE_echo512   512
-
-/**
- * This structure is a context for ECHO computations: it contains the
- * intermediate values and some data from the last entered block. Once
- * an ECHO computation has been performed, the context can be reused for
- * another computation. This specific structure is used for ECHO-224
- * and ECHO-256.
- *
- * The contents of this structure are private. A running ECHO computation
- * can be cloned by copying the context (e.g. with a simple
- * <code>memcpy()</code>).
- */
-typedef struct {
-#ifndef DOXYGEN_IGNORE
-	unsigned char buf[192];    /* first field, for alignment */
-	size_t ptr;
-	union {
-		sph_u32 Vs[4][4];
-#if SPH_64
-		sph_u64 Vb[4][2];
-#endif
-	} u;
-	sph_u32 C0, C1, C2, C3;
-#endif
-} sph_echo_small_context;
-
-/**
- * This structure is a context for ECHO computations: it contains the
- * intermediate values and some data from the last entered block. Once
- * an ECHO computation has been performed, the context can be reused for
- * another computation. This specific structure is used for ECHO-384
- * and ECHO-512.
- *
- * The contents of this structure are private. A running ECHO computation
- * can be cloned by copying the context (e.g. with a simple
- * <code>memcpy()</code>).
- */
-typedef struct {
-#ifndef DOXYGEN_IGNORE
-	unsigned char buf[128];    /* first field, for alignment */
-	size_t ptr;
-	union {
-		sph_u32 Vs[8][4];
-#if SPH_64
-		sph_u64 Vb[8][2];
-#endif
-	} u;
-	sph_u32 C0, C1, C2, C3;
-#endif
-} sph_echo_big_context;
-
-/**
- * Type for a ECHO-224 context (identical to the common "small" context).
- */
-typedef sph_echo_small_context sph_echo224_context;
-
-/**
- * Type for a ECHO-256 context (identical to the common "small" context).
- */
-typedef sph_echo_small_context sph_echo256_context;
-
-/**
- * Type for a ECHO-384 context (identical to the common "big" context).
- */
-typedef sph_echo_big_context sph_echo384_context;
-
-/**
- * Type for a ECHO-512 context (identical to the common "big" context).
- */
-typedef sph_echo_big_context sph_echo512_context;
-
-/**
- * Initialize an ECHO-224 context. This process performs no memory allocation.
- *
- * @param cc   the ECHO-224 context (pointer to a
- *             <code>sph_echo224_context</code>)
- */
-void sph_echo224_init(void *cc);
-
-/**
- * Process some data bytes. It is acceptable that <code>len</code> is zero
- * (in which case this function does nothing).
- *
- * @param cc     the ECHO-224 context
- * @param data   the input data
- * @param len    the input data length (in bytes)
- */
-void sph_echo224(void *cc, const void *data, size_t len);
-
-/**
- * Terminate the current ECHO-224 computation and output the result into
- * the provided buffer. The destination buffer must be wide enough to
- * accomodate the result (28 bytes). The context is automatically
- * reinitialized.
- *
- * @param cc    the ECHO-224 context
- * @param dst   the destination buffer
- */
-void sph_echo224_close(void *cc, void *dst);
-
-/**
- * Add a few additional bits (0 to 7) to the current computation, then
- * terminate it and output the result in the provided buffer, which must
- * be wide enough to accomodate the result (28 bytes). If bit number i
- * in <code>ub</code> has value 2^i, then the extra bits are those
- * numbered 7 downto 8-n (this is the big-endian convention at the byte
- * level). The context is automatically reinitialized.
- *
- * @param cc    the ECHO-224 context
- * @param ub    the extra bits
- * @param n     the number of extra bits (0 to 7)
- * @param dst   the destination buffer
- */
-void sph_echo224_addbits_and_close(
-	void *cc, unsigned ub, unsigned n, void *dst);
-
-/**
- * Initialize an ECHO-256 context. This process performs no memory allocation.
- *
- * @param cc   the ECHO-256 context (pointer to a
- *             <code>sph_echo256_context</code>)
- */
-void sph_echo256_init(void *cc);
-
-/**
- * Process some data bytes. It is acceptable that <code>len</code> is zero
- * (in which case this function does nothing).
- *
- * @param cc     the ECHO-256 context
- * @param data   the input data
- * @param len    the input data length (in bytes)
- */
-void sph_echo256(void *cc, const void *data, size_t len);
-
-/**
- * Terminate the current ECHO-256 computation and output the result into
- * the provided buffer. The destination buffer must be wide enough to
- * accomodate the result (32 bytes). The context is automatically
- * reinitialized.
- *
- * @param cc    the ECHO-256 context
- * @param dst   the destination buffer
- */
-void sph_echo256_close(void *cc, void *dst);
-
-/**
- * Add a few additional bits (0 to 7) to the current computation, then
- * terminate it and output the result in the provided buffer, which must
- * be wide enough to accomodate the result (32 bytes). If bit number i
- * in <code>ub</code> has value 2^i, then the extra bits are those
- * numbered 7 downto 8-n (this is the big-endian convention at the byte
- * level). The context is automatically reinitialized.
- *
- * @param cc    the ECHO-256 context
- * @param ub    the extra bits
- * @param n     the number of extra bits (0 to 7)
- * @param dst   the destination buffer
- */
-void sph_echo256_addbits_and_close(
-	void *cc, unsigned ub, unsigned n, void *dst);
-
-/**
- * Initialize an ECHO-384 context. This process performs no memory allocation.
- *
- * @param cc   the ECHO-384 context (pointer to a
- *             <code>sph_echo384_context</code>)
- */
-void sph_echo384_init(void *cc);
-
-/**
- * Process some data bytes. It is acceptable that <code>len</code> is zero
- * (in which case this function does nothing).
- *
- * @param cc     the ECHO-384 context
- * @param data   the input data
- * @param len    the input data length (in bytes)
- */
-void sph_echo384(void *cc, const void *data, size_t len);
-
-/**
- * Terminate the current ECHO-384 computation and output the result into
- * the provided buffer. The destination buffer must be wide enough to
- * accomodate the result (48 bytes). The context is automatically
- * reinitialized.
- *
- * @param cc    the ECHO-384 context
- * @param dst   the destination buffer
- */
-void sph_echo384_close(void *cc, void *dst);
-
-/**
- * Add a few additional bits (0 to 7) to the current computation, then
- * terminate it and output the result in the provided buffer, which must
- * be wide enough to accomodate the result (48 bytes). If bit number i
- * in <code>ub</code> has value 2^i, then the extra bits are those
- * numbered 7 downto 8-n (this is the big-endian convention at the byte
- * level). The context is automatically reinitialized.
- *
- * @param cc    the ECHO-384 context
- * @param ub    the extra bits
- * @param n     the number of extra bits (0 to 7)
- * @param dst   the destination buffer
- */
-void sph_echo384_addbits_and_close(
-	void *cc, unsigned ub, unsigned n, void *dst);
-
-/**
- * Initialize an ECHO-512 context. This process performs no memory allocation.
- *
- * @param cc   the ECHO-512 context (pointer to a
- *             <code>sph_echo512_context</code>)
- */
-void sph_echo512_init(void *cc);
-
-/**
- * Process some data bytes. It is acceptable that <code>len</code> is zero
- * (in which case this function does nothing).
- *
- * @param cc     the ECHO-512 context
- * @param data   the input data
- * @param len    the input data length (in bytes)
- */
-void sph_echo512(void *cc, const void *data, size_t len);
-
-/**
- * Terminate the current ECHO-512 computation and output the result into
- * the provided buffer. The destination buffer must be wide enough to
- * accomodate the result (64 bytes). The context is automatically
- * reinitialized.
- *
- * @param cc    the ECHO-512 context
- * @param dst   the destination buffer
- */
-void sph_echo512_close(void *cc, void *dst);
-
-/**
- * Add a few additional bits (0 to 7) to the current computation, then
- * terminate it and output the result in the provided buffer, which must
- * be wide enough to accomodate the result (64 bytes). If bit number i
- * in <code>ub</code> has value 2^i, then the extra bits are those
- * numbered 7 downto 8-n (this is the big-endian convention at the byte
- * level). The context is automatically reinitialized.
- *
- * @param cc    the ECHO-512 context
- * @param ub    the extra bits
- * @param n     the number of extra bits (0 to 7)
- * @param dst   the destination buffer
- */
-void sph_echo512_addbits_and_close(
-	void *cc, unsigned ub, unsigned n, void *dst);
-	
-#ifdef __cplusplus
-}
-#endif
-
-#endif
--- a/algo/fugue/sph_fugue.c
+++ b/algo/fugue/sph_fugue.c
@@ -11,6 +11,8 @@ extern "C"{
 #pragma warning (disable: 4146)
 #endif

+#define SPH_FUGUE_NOCOPY 1
+
 static const sph_u32 IV224[] = {
 	SPH_C32(0xf4c9120d), SPH_C32(0x6286f757), SPH_C32(0xee39e01c),
 	SPH_C32(0xe074e3cb), SPH_C32(0xa1127c62), SPH_C32(0x9a43d215),
--- a/algo/groestl/aes_ni/hash-groestl.c
+++ b/algo/groestl/aes_ni/hash-groestl.c
@@ -12,7 +12,7 @@
 #include <memory.h>
 #include "hash-groestl.h"
 #include "miner.h"
-#include "avxdefs.h"
+#include "simd-utils.h"

 #ifndef NO_AES_NI

--- a/algo/groestl/aes_ni/hash-groestl256.c
+++ b/algo/groestl/aes_ni/hash-groestl256.c
@@ -9,7 +9,7 @@
 #include <memory.h>
 #include "hash-groestl256.h"
 #include "miner.h"
-#include "avxdefs.h"
+#include "simd-utils.h"

 #ifndef NO_AES_NI

--- a/algo/groestl/myrgr-4way.c
+++ b/algo/groestl/myrgr-4way.c
@@ -33,7 +33,7 @@ void myriad_4way_hash( void *output, const void *input )
     myrgr_4way_ctx_holder ctx;
     memcpy( &ctx, &myrgr_4way_ctx, sizeof(myrgr_4way_ctx) );

-     mm_deinterleave_4x32( hash0, hash1, hash2, hash3, input, 640 );
+     mm128_deinterleave_4x32( hash0, hash1, hash2, hash3, input, 640 );

     update_and_final_groestl( &ctx.groestl, (char*)hash0, (char*)hash0, 640 );
     memcpy( &ctx.groestl, &myrgr_4way_ctx.groestl, sizeof(hashState_groestl) );
@@ -43,12 +43,12 @@ void myriad_4way_hash( void *output, const void *input )
     memcpy( &ctx.groestl, &myrgr_4way_ctx.groestl, sizeof(hashState_groestl) );
     update_and_final_groestl( &ctx.groestl, (char*)hash3, (char*)hash3, 640 );

-     mm_interleave_4x32( vhash, hash0, hash1, hash2, hash3, 512 );
+     mm128_interleave_4x32( vhash, hash0, hash1, hash2, hash3, 512 );

     sha256_4way( &ctx.sha, vhash, 64 );
     sha256_4way_close( &ctx.sha, vhash );

-     mm_deinterleave_4x32( output, output+32, output+64, output+96,
+     mm128_deinterleave_4x32( output, output+32, output+64, output+96,
                           vhash, 256 );
 }

@@ -79,7 +79,7 @@ int scanhash_myriad_4way( int thr_id, struct work *work, uint32_t max_nonce,
      ( (uint32_t*)ptarget )[7] = 0x0000ff;

   swab32_array( edata, pdata, 20 );
-   mm_interleave_4x32( vdata, edata, edata, edata, edata, 640 );
+   mm128_interleave_4x32( vdata, edata, edata, edata, edata, 640 );

   do {
      be32enc( noncep,   n   );
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Jay D Dee	d6e8d7a46e	v3.9.4	2019-06-18 13:15:45 -04:00
Jay D Dee	71d6b97ee8	v3.9.3.1	2019-06-13 21:15:58 -04:00
Jay D Dee	b2331375a3	v3.9.2.5	2019-06-13 11:20:27 -04:00
Jay D Dee	7fec680835	v3.9.2.4	2019-06-07 23:30:38 -04:00
Jay D Dee	1b0a5aadf6	v3.9.2.3	2019-06-05 12:20:04 -04:00
Jay D Dee	0a3c52810e	v3.9.2.2	2019-06-04 17:14:03 -04:00
Jay D Dee	4d4386a374	v3.9.2.1	2019-06-04 16:56:44 -04:00
Jay D Dee	ce259b915a	v3.9.2	2019-06-03 21:36:33 -04:00
Jay D Dee	02202ab803	v3.9.1.1	2019-05-31 13:20:12 -04:00
Jay D Dee	77c5ae80ab	v3.9.1	2019-05-30 16:59:49 -04:00
Jay D Dee	eb3f57bfc7	v3.9.0.1	2019-05-21 20:55:05 -04:00
Jay D Dee	e1aead3c76	v3.9.0	2019-05-19 13:39:45 -04:00
Jay D Dee	bfd1c002f9	v3.8.8.1	2018-05-11 11:52:36 -04:00
Jay D Dee	9edc650042	v3.8.7.2	2018-04-11 13:44:26 -04:00
Jay D Dee	218cef337a	v3.8.7.1	2018-04-10 21:49:06 -04:00
Jay D Dee	9ffce7bdb7	v3.8.7	2018-04-09 19:14:38 -04:00
Jay D Dee	c7efa50aad	v3.8.6.1	2018-04-06 11:42:01 -04:00
Jay D Dee	dd5e552357	v3.8.6	2018-03-31 12:50:52 -04:00
Jay D Dee	f449c6725f	v3.8.5	2018-03-27 20:20:05 -04:00