mirror of
https://github.com/JayDDee/cpuminer-opt.git
synced 2025-09-17 23:44:27 +00:00
Compare commits
12 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
1a7a573675 | ||
|
|
70089d1224 | ||
|
|
3572cb53c4 | ||
|
|
241bc26767 | ||
|
|
c65b0ff7a6 | ||
|
|
a17ff6f189 | ||
|
|
73430b13b1 | ||
|
|
40039386a0 | ||
|
|
91ec6f1771 | ||
|
|
a52c5eccf7 | ||
|
|
86b889e1b0 | ||
|
|
72330eb5a7 |
@@ -1,12 +1,14 @@
|
|||||||
|
|
||||||
|
|
||||||
Requirements:
|
1. Requirements:
|
||||||
|
---------------
|
||||||
|
|
||||||
Intel Core2 or newer, or AMD Steamroller or newer CPU. ARM CPUs are not
|
Intel Core2 or newer, or AMD Steamroller or newer CPU. ARM CPUs are not
|
||||||
supported.
|
supported.
|
||||||
64 bit Linux operating system. Apple is not supported.
|
64 bit Linux operating system. Apple is not supported.
|
||||||
|
|
||||||
Building on linux prerequisites:
|
2. Building on linux prerequisites:
|
||||||
|
-----------------------------------
|
||||||
|
|
||||||
It is assumed users know how to install packages on their system and
|
It is assumed users know how to install packages on their system and
|
||||||
be able to compile standard source packages. This is basic Linux and
|
be able to compile standard source packages. This is basic Linux and
|
||||||
@@ -20,49 +22,74 @@ http://askubuntu.com/questions/457526/how-to-install-cpuminer-in-ubuntu
|
|||||||
|
|
||||||
Install any additional dependencies needed by cpuminer-opt. The list below
|
Install any additional dependencies needed by cpuminer-opt. The list below
|
||||||
are some of the ones that may not be in the default install and need to
|
are some of the ones that may not be in the default install and need to
|
||||||
be installed manually. There may be others, read the error messages they
|
be installed manually. There may be others, read the compiler error messages,
|
||||||
will give a clue as to the missing package.
|
they will give a clue as to the missing package.
|
||||||
|
|
||||||
The following command should install everything you need on Debian based
|
The following command should install everything you need on Debian based
|
||||||
distributions such as Ubuntu:
|
distributions such as Ubuntu. Fedora and other distributions may have similar
|
||||||
|
but different package names.
|
||||||
|
|
||||||
sudo apt-get install build-essential libssl-dev libcurl4-openssl-dev libjansson-dev libgmp-dev automake zlib1g-dev
|
$ sudo apt-get install build-essential automake libssl-dev libcurl4-openssl-dev libjansson-dev libgmp-dev zlib1g-dev git
|
||||||
|
|
||||||
build-essential (Development Tools package group on Fedora)
|
|
||||||
automake
|
|
||||||
libjansson-dev
|
|
||||||
libgmp-dev
|
|
||||||
libcurl4-openssl-dev
|
|
||||||
libssl-dev
|
|
||||||
lib-thread
|
|
||||||
zlib1g-dev
|
|
||||||
|
|
||||||
SHA support on AMD Ryzen CPUs requires gcc version 5 or higher and
|
SHA support on AMD Ryzen CPUs requires gcc version 5 or higher and
|
||||||
openssl 1.1.0e or higher. Add one of the following, depending on the
|
openssl 1.1.0e or higher. Add one of the following to CFLAGS for SHA
|
||||||
compiler version, to CFLAGS:
|
support depending on your CPU and compiler version:
|
||||||
"-march=native" or "-march=znver1" or "-msha".
|
|
||||||
|
"-march=native" is always the best choice
|
||||||
|
|
||||||
|
"-march=znver1" for Ryzen 1000 & 2000 series, znver2 for 3000.
|
||||||
|
|
||||||
|
"-msha" Add SHA to other tuning options
|
||||||
|
|
||||||
Additional instructions for static compilalation can be found here:
|
Additional instructions for static compilalation can be found here:
|
||||||
https://lxadm.com/Static_compilation_of_cpuminer
|
https://lxadm.com/Static_compilation_of_cpuminer
|
||||||
Static builds should only considered in a homogeneous HW and SW environment.
|
Static builds should only considered in a homogeneous HW and SW environment.
|
||||||
Local builds will always have the best performance and compatibility.
|
Local builds will always have the best performance and compatibility.
|
||||||
|
|
||||||
Extract cpuminer source.
|
3. Download cpuminer-opt
|
||||||
|
------------------------
|
||||||
|
|
||||||
tar xvzf cpuminer-opt-x.y.z.tar.gz
|
Download the source code for the latest realease from the official repository.
|
||||||
cd cpuminer-opt-x.y.z
|
|
||||||
|
|
||||||
Run ./build.sh to build on Linux or execute the following commands.
|
https://github.com/JayDDee/cpuminer-opt/releases
|
||||||
|
|
||||||
./autogen.sh
|
Extract the source code.
|
||||||
CFLAGS="-O3 -march=native -Wall" ./configure --with-curl
|
|
||||||
make
|
|
||||||
|
|
||||||
Start mining.
|
$ tar xvzf cpuminer-opt-x.y.z.tar.gz
|
||||||
|
|
||||||
|
|
||||||
|
Alternatively it can be cloned from git.
|
||||||
|
|
||||||
|
$ git clone https://github.com/JayDDee/cpuminer-opt.git
|
||||||
|
|
||||||
|
4. Build cpuminer-opt
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
It is recomended to Build with default options, this will usuallly
|
||||||
|
produce the best results.
|
||||||
|
|
||||||
|
$ ./build.sh to build on Linux or execute the following commands.
|
||||||
|
|
||||||
|
or
|
||||||
|
|
||||||
|
$ ./autogen.sh
|
||||||
|
$ CFLAGS="-O3 -march=native -Wall" ./configure --with-curl
|
||||||
|
$ make -j n
|
||||||
|
|
||||||
|
n is the number of threads.
|
||||||
|
|
||||||
|
5. Start mining.
|
||||||
|
----------------
|
||||||
|
|
||||||
|
$ ./cpuminer -a algo -o url -u username -p password
|
||||||
|
|
||||||
./cpuminer -a algo -o url -u username -p password
|
|
||||||
|
|
||||||
Windows
|
Windows
|
||||||
|
-------
|
||||||
|
|
||||||
|
See also INSTAL_WINDOWS
|
||||||
|
|
||||||
|
The following procedure is obsolete and uses an old compiler.
|
||||||
|
|
||||||
Precompiled Windows binaries are built on a Linux host using Mingw
|
Precompiled Windows binaries are built on a Linux host using Mingw
|
||||||
with a more recent compiler than the following Windows hosted procedure.
|
with a more recent compiler than the following Windows hosted procedure.
|
||||||
|
|||||||
@@ -22,14 +22,13 @@ Step by step...
|
|||||||
|
|
||||||
Refer to Linux compile instructions and install required packages.
|
Refer to Linux compile instructions and install required packages.
|
||||||
|
|
||||||
Additionally, install mingw-64.
|
Additionally, install mingw-w64.
|
||||||
|
|
||||||
sudo apt-get install mingw-w64
|
sudo apt-get install mingw-w64
|
||||||
|
|
||||||
|
|
||||||
2. Create a local library directory for packages to be compiled in the next
|
2. Create a local library directory for packages to be compiled in the next
|
||||||
step. Recommended location is $HOME/usr/lib/
|
step. Suggested location is $HOME/usr/lib/
|
||||||
|
|
||||||
|
|
||||||
3. Download and build other packages for mingw that don't have a mingw64
|
3. Download and build other packages for mingw that don't have a mingw64
|
||||||
version available in the repositories.
|
version available in the repositories.
|
||||||
|
|||||||
24
Makefile.am
24
Makefile.am
@@ -84,10 +84,15 @@ cpuminer_SOURCES = \
|
|||||||
algo/cubehash/cubehash_sse2.c\
|
algo/cubehash/cubehash_sse2.c\
|
||||||
algo/cubehash/cube-hash-2way.c \
|
algo/cubehash/cube-hash-2way.c \
|
||||||
algo/echo/sph_echo.c \
|
algo/echo/sph_echo.c \
|
||||||
|
algo/echo/echo-hash-4way.c \
|
||||||
algo/echo/aes_ni/hash.c\
|
algo/echo/aes_ni/hash.c\
|
||||||
algo/gost/sph_gost.c \
|
algo/gost/sph_gost.c \
|
||||||
|
algo/groestl/groestl-gate.c \
|
||||||
|
algo/groestl/groestl512-hash-4way.c \
|
||||||
|
algo/groestl/groestl256-hash-4way.c \
|
||||||
algo/groestl/sph_groestl.c \
|
algo/groestl/sph_groestl.c \
|
||||||
algo/groestl/groestl.c \
|
algo/groestl/groestl.c \
|
||||||
|
algo/groestl/groestl-4way.c \
|
||||||
algo/groestl/myrgr-gate.c \
|
algo/groestl/myrgr-gate.c \
|
||||||
algo/groestl/myrgr-4way.c \
|
algo/groestl/myrgr-4way.c \
|
||||||
algo/groestl/myr-groestl.c \
|
algo/groestl/myr-groestl.c \
|
||||||
@@ -116,13 +121,15 @@ cpuminer_SOURCES = \
|
|||||||
algo/keccak/keccak-hash-4way.c \
|
algo/keccak/keccak-hash-4way.c \
|
||||||
algo/keccak/keccak-4way.c\
|
algo/keccak/keccak-4way.c\
|
||||||
algo/keccak/keccak-gate.c \
|
algo/keccak/keccak-gate.c \
|
||||||
algo/keccak/sse2/keccak.c \
|
algo/lanehash/lane.c \
|
||||||
algo/luffa/sph_luffa.c \
|
algo/luffa/sph_luffa.c \
|
||||||
algo/luffa/luffa.c \
|
algo/luffa/luffa.c \
|
||||||
algo/luffa/luffa_for_sse2.c \
|
algo/luffa/luffa_for_sse2.c \
|
||||||
algo/luffa/luffa-hash-2way.c \
|
algo/luffa/luffa-hash-2way.c \
|
||||||
algo/lyra2/lyra2.c \
|
algo/lyra2/lyra2.c \
|
||||||
algo/lyra2/sponge.c \
|
algo/lyra2/sponge.c \
|
||||||
|
algo/lyra2/sponge-2way.c \
|
||||||
|
algo/lyra2/lyra2-hash-2way.c \
|
||||||
algo/lyra2/lyra2-gate.c \
|
algo/lyra2/lyra2-gate.c \
|
||||||
algo/lyra2/lyra2rev2.c \
|
algo/lyra2/lyra2rev2.c \
|
||||||
algo/lyra2/lyra2rev2-4way.c \
|
algo/lyra2/lyra2rev2-4way.c \
|
||||||
@@ -143,6 +150,7 @@ cpuminer_SOURCES = \
|
|||||||
algo/nist5/nist5-4way.c \
|
algo/nist5/nist5-4way.c \
|
||||||
algo/nist5/nist5.c \
|
algo/nist5/nist5.c \
|
||||||
algo/nist5/zr5.c \
|
algo/nist5/zr5.c \
|
||||||
|
algo/panama/panama-hash-4way.c \
|
||||||
algo/panama/sph_panama.c \
|
algo/panama/sph_panama.c \
|
||||||
algo/radiogatun/sph_radiogatun.c \
|
algo/radiogatun/sph_radiogatun.c \
|
||||||
algo/quark/quark-gate.c \
|
algo/quark/quark-gate.c \
|
||||||
@@ -168,12 +176,10 @@ cpuminer_SOURCES = \
|
|||||||
algo/scrypt/scrypt.c \
|
algo/scrypt/scrypt.c \
|
||||||
algo/scrypt/neoscrypt.c \
|
algo/scrypt/neoscrypt.c \
|
||||||
algo/scrypt/pluck.c \
|
algo/scrypt/pluck.c \
|
||||||
algo/scryptjane/scrypt-jane.c \
|
|
||||||
algo/sha/sph_sha2.c \
|
algo/sha/sph_sha2.c \
|
||||||
algo/sha/sph_sha2big.c \
|
algo/sha/sph_sha2big.c \
|
||||||
algo/sha/sha256-hash-4way.c \
|
algo/sha/sha256-hash-4way.c \
|
||||||
algo/sha/sha512-hash-4way.c \
|
algo/sha/sha512-hash-4way.c \
|
||||||
algo/sha/sha256_hash_11way.c \
|
|
||||||
algo/sha/sha2.c \
|
algo/sha/sha2.c \
|
||||||
algo/sha/sha256t-gate.c \
|
algo/sha/sha256t-gate.c \
|
||||||
algo/sha/sha256t-4way.c \
|
algo/sha/sha256t-4way.c \
|
||||||
@@ -185,6 +191,7 @@ cpuminer_SOURCES = \
|
|||||||
algo/shavite/sph_shavite.c \
|
algo/shavite/sph_shavite.c \
|
||||||
algo/shavite/sph-shavite-aesni.c \
|
algo/shavite/sph-shavite-aesni.c \
|
||||||
algo/shavite/shavite-hash-2way.c \
|
algo/shavite/shavite-hash-2way.c \
|
||||||
|
algo/shavite/shavite-hash-4way.c \
|
||||||
algo/shavite/shavite.c \
|
algo/shavite/shavite.c \
|
||||||
algo/simd/sph_simd.c \
|
algo/simd/sph_simd.c \
|
||||||
algo/simd/nist.c \
|
algo/simd/nist.c \
|
||||||
@@ -197,9 +204,9 @@ cpuminer_SOURCES = \
|
|||||||
algo/skein/skein-gate.c \
|
algo/skein/skein-gate.c \
|
||||||
algo/skein/skein2.c \
|
algo/skein/skein2.c \
|
||||||
algo/skein/skein2-4way.c \
|
algo/skein/skein2-4way.c \
|
||||||
algo/skein/skein2-gate.c \
|
|
||||||
algo/sm3/sm3.c \
|
algo/sm3/sm3.c \
|
||||||
algo/sm3/sm3-hash-4way.c \
|
algo/sm3/sm3-hash-4way.c \
|
||||||
|
algo/swifftx/swifftx.c \
|
||||||
algo/tiger/sph_tiger.c \
|
algo/tiger/sph_tiger.c \
|
||||||
algo/whirlpool/sph_whirlpool.c \
|
algo/whirlpool/sph_whirlpool.c \
|
||||||
algo/whirlpool/whirlpool-hash-4way.c \
|
algo/whirlpool/whirlpool-hash-4way.c \
|
||||||
@@ -279,10 +286,17 @@ cpuminer_SOURCES = \
|
|||||||
algo/x17/sonoa-4way.c \
|
algo/x17/sonoa-4way.c \
|
||||||
algo/x17/sonoa.c \
|
algo/x17/sonoa.c \
|
||||||
algo/x20/x20r.c \
|
algo/x20/x20r.c \
|
||||||
|
algo/x22/x22i-4way.c \
|
||||||
|
algo/x22/x22i.c \
|
||||||
|
algo/x22/x22i-gate.c \
|
||||||
|
algo/x22/x25x.c \
|
||||||
|
algo/x22/x25x-4way.c \
|
||||||
algo/yescrypt/yescrypt.c \
|
algo/yescrypt/yescrypt.c \
|
||||||
algo/yescrypt/sha256_Y.c \
|
algo/yescrypt/sha256_Y.c \
|
||||||
algo/yescrypt/yescrypt-best.c \
|
algo/yescrypt/yescrypt-best.c \
|
||||||
algo/yespower/yespower.c \
|
algo/yespower/yespower-gate.c \
|
||||||
|
algo/yespower/yespower-blake2b.c \
|
||||||
|
algo/yespower/crypto/blake2b-yp.c \
|
||||||
algo/yespower/sha256_p.c \
|
algo/yespower/sha256_p.c \
|
||||||
algo/yespower/yespower-opt.c
|
algo/yespower/yespower-opt.c
|
||||||
|
|
||||||
|
|||||||
21
README.md
21
README.md
@@ -92,6 +92,7 @@ Supported Algorithms
|
|||||||
phi2-lux identical to phi2
|
phi2-lux identical to phi2
|
||||||
pluck Pluck:128 (Supcoin)
|
pluck Pluck:128 (Supcoin)
|
||||||
polytimos Ninja
|
polytimos Ninja
|
||||||
|
power2b MicroBitcoin (MBC)
|
||||||
quark Quark
|
quark Quark
|
||||||
qubit Qubit
|
qubit Qubit
|
||||||
scrypt scrypt(1024, 1, 1) (default)
|
scrypt scrypt(1024, 1, 1) (default)
|
||||||
@@ -121,13 +122,15 @@ Supported Algorithms
|
|||||||
x13sm3 hsr (Hshare)
|
x13sm3 hsr (Hshare)
|
||||||
x14 X14
|
x14 X14
|
||||||
x15 X15
|
x15 X15
|
||||||
x16r Ravencoin (RVN) (original algo)
|
x16r
|
||||||
x16rv2 Ravencoin (RVN) (new algo)
|
x16rv2 Ravencoin (RVN)
|
||||||
x16rt Gincoin (GIN)
|
x16rt Gincoin (GIN)
|
||||||
x16rt_veil Veil (VEIL)
|
x16rt-veil Veil (VEIL)
|
||||||
x16s Pigeoncoin (PGN)
|
x16s Pigeoncoin (PGN)
|
||||||
x17
|
x17
|
||||||
x21s
|
x21s
|
||||||
|
x22i
|
||||||
|
x25x
|
||||||
xevan Bitsend (BSD)
|
xevan Bitsend (BSD)
|
||||||
yescrypt Globalboost-Y (BSTY)
|
yescrypt Globalboost-Y (BSTY)
|
||||||
yescryptr8 BitZeny (ZNY)
|
yescryptr8 BitZeny (ZNY)
|
||||||
@@ -135,11 +138,15 @@ Supported Algorithms
|
|||||||
yescryptr32 WAVI
|
yescryptr32 WAVI
|
||||||
yespower Cryply
|
yespower Cryply
|
||||||
yespowerr16 Yenten (YTN)
|
yespowerr16 Yenten (YTN)
|
||||||
|
yespower-b2b generic yespower + blake2b
|
||||||
zr5 Ziftr
|
zr5 Ziftr
|
||||||
|
|
||||||
Errata
|
Errata
|
||||||
------
|
------
|
||||||
|
|
||||||
|
Old algorithms that are no longer used frequently will not have the latest
|
||||||
|
optimizations.
|
||||||
|
|
||||||
Cryptonight and variants are no longer supported, use another miner.
|
Cryptonight and variants are no longer supported, use another miner.
|
||||||
|
|
||||||
Neoscrypt crashes on Windows, use legacy version.
|
Neoscrypt crashes on Windows, use legacy version.
|
||||||
@@ -158,10 +165,12 @@ Bugs
|
|||||||
----
|
----
|
||||||
|
|
||||||
Users are encouraged to post their bug reports using git issues or on the
|
Users are encouraged to post their bug reports using git issues or on the
|
||||||
Bitcoin Talk forum at:
|
Bitcoin Talk forum or opening an issue in git:
|
||||||
|
|
||||||
https://bitcointalk.org/index.php?topic=1326803.0
|
https://bitcointalk.org/index.php?topic=1326803.0
|
||||||
|
|
||||||
|
https://github.com/JayDDee/cpuminer-opt/issues
|
||||||
|
|
||||||
All problem reports must be accompanied by a proper problem definition.
|
All problem reports must be accompanied by a proper problem definition.
|
||||||
This should include how the problem occurred, the command line and
|
This should include how the problem occurred, the command line and
|
||||||
output from the miner showing the startup messages and any errors.
|
output from the miner showing the startup messages and any errors.
|
||||||
@@ -173,10 +182,6 @@ Donations
|
|||||||
cpuminer-opt has no fees of any kind but donations are accepted.
|
cpuminer-opt has no fees of any kind but donations are accepted.
|
||||||
|
|
||||||
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT
|
BTC: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT
|
||||||
ETH: 0x72122edabcae9d3f57eab0729305a425f6fef6d0
|
|
||||||
LTC: LdUwoHJnux9r9EKqFWNvAi45kQompHk6e8
|
|
||||||
BCH: 1QKYkB6atn4P7RFozyziAXLEnurwnUM1cQ
|
|
||||||
BTG: GVUyECtRHeC5D58z9F3nGGfVQndwnsPnHQ
|
|
||||||
|
|
||||||
Happy mining!
|
Happy mining!
|
||||||
|
|
||||||
|
|||||||
17
README.txt
17
README.txt
@@ -15,20 +15,29 @@ the features listed at cpuminer startup to ensure you are mining at
|
|||||||
optimum speed using the best available features.
|
optimum speed using the best available features.
|
||||||
|
|
||||||
Architecture names and compile options used are only provided for Intel
|
Architecture names and compile options used are only provided for Intel
|
||||||
Core series. Even the newest Pentium and Celeron CPUs are often missing
|
Core series. Budget CPUs like Pentium and Celeron are often missing the
|
||||||
features.
|
latest features.
|
||||||
|
|
||||||
AMD CPUs older than Piledriver, including Athlon x2 and Phenom II x4, are not
|
AMD CPUs older than Piledriver, including Athlon x2 and Phenom II x4, are not
|
||||||
supported by cpuminer-opt due to an incompatible implementation of SSE2 on
|
supported by cpuminer-opt due to an incompatible implementation of SSE2 on
|
||||||
these CPUs. Some algos may crash the miner with an invalid instruction.
|
these CPUs. Some algos may crash the miner with an invalid instruction.
|
||||||
Users are recommended to use an unoptimized miner such as cpuminer-multi.
|
Users are recommended to use an unoptimized miner such as cpuminer-multi.
|
||||||
|
|
||||||
|
More information for Intel and AMD CPU architectures and their features
|
||||||
|
can be found on Wikipedia.
|
||||||
|
|
||||||
|
https://en.wikipedia.org/wiki/List_of_Intel_CPU_microarchitectures
|
||||||
|
|
||||||
|
https://en.wikipedia.org/wiki/List_of_AMD_CPU_microarchitectures
|
||||||
|
|
||||||
|
|
||||||
Exe name Compile flags Arch name
|
Exe name Compile flags Arch name
|
||||||
|
|
||||||
cpuminer-sse2.exe "-msse2" Core2, Nehalem
|
cpuminer-sse2.exe "-msse2" Core2, Nehalem
|
||||||
cpuminer-aes-sse42.exe "-march=westmere" Westmere
|
cpuminer-aes-sse42.exe "-march=westmere" Westmere
|
||||||
cpuminer-avx.exe "-march=corei7-avx" Sandy-Ivybridge
|
cpuminer-avx.exe "-march=corei7-avx" Sandybridge
|
||||||
cpuminer-avx2.exe "-march=core-avx2" Haswell, Sky-Kaby-Coffeelake
|
cpuminer-avx2.exe "-march=core-avx2 -maes" Haswell, Skylake, Coffeelake
|
||||||
|
cpuminer-avx512.exe "-march=skylake-avx512" Skylake-X, Cascadelake-X
|
||||||
cpuminer-zen "-march=znver1" AMD Ryzen, Threadripper
|
cpuminer-zen "-march=znver1" AMD Ryzen, Threadripper
|
||||||
|
|
||||||
If you like this software feel free to donate:
|
If you like this software feel free to donate:
|
||||||
|
|||||||
217
RELEASE_NOTES
217
RELEASE_NOTES
@@ -1,21 +1,18 @@
|
|||||||
cpuminer-opt is a console program run from the command line using the
|
cpuminer-opt is a console program run from the command line using the
|
||||||
keyboard, not the mouse.
|
keyboard, not the mouse.
|
||||||
|
|
||||||
cpuminer-opt now supports HW SHA acceleration available on AMD Ryzen CPUs.
|
See also README.md for list of supported algorithms,
|
||||||
This feature requires recent SW including GCC version 5 or higher and
|
|
||||||
openssl version 1.1 or higher. It may also require using "-march=znver1"
|
|
||||||
compile flag.
|
|
||||||
|
|
||||||
cpuminer-opt is a console program, if you're using a mouse you're doing it
|
|
||||||
wrong.
|
|
||||||
|
|
||||||
Security warning
|
Security warning
|
||||||
----------------
|
----------------
|
||||||
|
|
||||||
Miner programs are often flagged as malware by antivirus programs. This is
|
Miner programs are often flagged as malware by antivirus programs. This is
|
||||||
a false positive, they are flagged simply because they are cryptocurrency
|
usually a false positive, they are flagged simply because they are
|
||||||
miners. The source code is open for anyone to inspect. If you don't trust
|
cryptocurrency miners. However, some malware masquerading as a miner has
|
||||||
the software, don't use it.
|
been spread using the cover that miners are known to be subject to false
|
||||||
|
positives ans users will dismiss the AV alert. Always be on alert.
|
||||||
|
The source code of cpuminer-opt is open for anyone to inspect.
|
||||||
|
If you don't trust the software don't download it.
|
||||||
|
|
||||||
The cryptographic hashing code has been taken from trusted sources but has been
|
The cryptographic hashing code has been taken from trusted sources but has been
|
||||||
modified for speed at the expense of accepted security practices. This
|
modified for speed at the expense of accepted security practices. This
|
||||||
@@ -25,7 +22,7 @@ required.
|
|||||||
Compile Instructions
|
Compile Instructions
|
||||||
--------------------
|
--------------------
|
||||||
|
|
||||||
See INSTALL_LINUX or INSTALL_WINDOWS fror compile instruuctions
|
See INSTALL_LINUX or INSTALL_WINDOWS for compile instruuctions
|
||||||
|
|
||||||
Requirements
|
Requirements
|
||||||
------------
|
------------
|
||||||
@@ -33,11 +30,207 @@ Requirements
|
|||||||
Intel Core2 or newer, or AMD Steamroller or newer CPU. ARM CPUs are not
|
Intel Core2 or newer, or AMD Steamroller or newer CPU. ARM CPUs are not
|
||||||
supported.
|
supported.
|
||||||
|
|
||||||
64 bit Linux or Windows operating system. Apple and Android are not supported.
|
64 bit Linux or Windows operating system. Apple, Android and Raspberry Pi
|
||||||
|
are not supported. FreeBSD YMMV.
|
||||||
|
|
||||||
|
Reporting bugs
|
||||||
|
--------------
|
||||||
|
|
||||||
|
Bugs can be reported by sending am email to JayDDee246@gmail.com or opening
|
||||||
|
an issue in git: https://github.com/JayDDee/cpuminer-opt/issues
|
||||||
|
|
||||||
|
Please include the following information:
|
||||||
|
|
||||||
|
1. CPU model, operating system, cpuminer-opt version (must be latest),
|
||||||
|
binary file for Windows, changes to default build procedure for Linux.
|
||||||
|
|
||||||
|
2. Exact comand line (except user and pw) and intial output showing
|
||||||
|
the above requested info.
|
||||||
|
|
||||||
|
3. Additional program output showing any error messages or other
|
||||||
|
pertinent data.
|
||||||
|
|
||||||
|
4. A clear description of the problem including history, scope,
|
||||||
|
persistence or intermittance, and reproduceability.
|
||||||
|
|
||||||
|
In simpler terms:
|
||||||
|
|
||||||
|
What is it doing?
|
||||||
|
What should it be doing instead?
|
||||||
|
Did it work in a previous release?
|
||||||
|
Does it happen for all algos? All pools? All options? Solo?
|
||||||
|
Does it happen all the time?
|
||||||
|
If not what makes it happen or not happen?
|
||||||
|
|
||||||
Change Log
|
Change Log
|
||||||
----------
|
----------
|
||||||
|
|
||||||
|
v3.11.5
|
||||||
|
|
||||||
|
Fixed AVX512 detection that could cause compilation errors on CPUs
|
||||||
|
without AVX512.
|
||||||
|
|
||||||
|
Fixed "BLOCK SOLVED" log incorrectly displaying "Accepted" when a block
|
||||||
|
is solved.
|
||||||
|
Added share counter to share submitited & accepted logs
|
||||||
|
Added job id to share submitted log.
|
||||||
|
Share submitted log is no longer highlighted blue, there was too much blue.
|
||||||
|
|
||||||
|
Another CPU temperature fix for Linux.
|
||||||
|
|
||||||
|
Added bug reporting tips to RELEASE NOTES.
|
||||||
|
|
||||||
|
v3.11.4
|
||||||
|
|
||||||
|
Fixed scrypt segfault since v3.9.9.1.
|
||||||
|
|
||||||
|
Stale shares counted and reported seperately from other rejected shares.
|
||||||
|
|
||||||
|
Display of counters for solved blocks, rejects, stale shares suppressed in
|
||||||
|
periodic summary when zero.
|
||||||
|
|
||||||
|
v3.11.3
|
||||||
|
|
||||||
|
Fixed x12 AVX2 again.
|
||||||
|
|
||||||
|
More speed for allium: AVX2 +4%, AVX512 +6%, VAES +14%.
|
||||||
|
|
||||||
|
Restored lost speed for x22i & x25x.
|
||||||
|
|
||||||
|
v3.11.2
|
||||||
|
|
||||||
|
Fixed x11gost (sib) AVX2 invalid shares.
|
||||||
|
|
||||||
|
Fixed x16r, x16rv2, x16s, x16rt, x16rt-veil (veil), x21s.
|
||||||
|
No shares were submitted when cube, shavite or echo were the first function
|
||||||
|
in the hash order.
|
||||||
|
|
||||||
|
Fixed all algos reporting stats problems when mining with SSE2.
|
||||||
|
|
||||||
|
Faster Lyra2 AVX512: lyra2z +47%, lyra2rev3 +11%, allium +13%, x21s +6%
|
||||||
|
|
||||||
|
Other minor performance improvements.
|
||||||
|
|
||||||
|
Known issue:
|
||||||
|
|
||||||
|
Lyra2 AVX512 improvements paradoxically reduced performance on x22i and x25x.
|
||||||
|
https://github.com/JayDDee/cpuminer-opt/issues/225
|
||||||
|
|
||||||
|
v3.11.1
|
||||||
|
|
||||||
|
Faster panama for x25x AVX2 & AVX512.
|
||||||
|
|
||||||
|
Fixed echo VAES for Xevan.
|
||||||
|
|
||||||
|
Removed support for scryptjane algo.
|
||||||
|
|
||||||
|
Reverted macro implemtations of hash functions to SPH reference code
|
||||||
|
for SSE2 versions of algos.
|
||||||
|
|
||||||
|
v3.11.0
|
||||||
|
|
||||||
|
Fixed x25x AVX512 lane 4 invalid shares.
|
||||||
|
|
||||||
|
AVX512 for hex, phi2.
|
||||||
|
|
||||||
|
VAES optimzation for Intel Icelake CPUs for most algos recently optimized
|
||||||
|
with AVX512, source code only.
|
||||||
|
|
||||||
|
v3.10.7
|
||||||
|
|
||||||
|
AVX512 for x25x, lbry, x13bcd (bcd).
|
||||||
|
|
||||||
|
v3.10.6
|
||||||
|
|
||||||
|
Added support for SSL stratum: stratum+tcps://
|
||||||
|
|
||||||
|
Added job id reporting again, but leaner, suppressed with --quiet.
|
||||||
|
|
||||||
|
AVX512 for x21s, x22i, lyra2z, allium.
|
||||||
|
|
||||||
|
Fixed share overflow warnings mining lbry with Ryzen (SHA).
|
||||||
|
|
||||||
|
v3.10.5
|
||||||
|
|
||||||
|
AVX512 for x17, sonoa, xevan, hmq1725, lyra2rev3, lyra2rev2.
|
||||||
|
Faster hmq1725 AVX2.
|
||||||
|
|
||||||
|
v3.10.4
|
||||||
|
|
||||||
|
AVX512 for x16r, x16rv2, x16rt, x16s, x16rt-veil (veil).
|
||||||
|
|
||||||
|
v3.10.3
|
||||||
|
|
||||||
|
AVX512 for x12, x13, x14, x15.
|
||||||
|
Fixed x12 AVX2 invalid shares.
|
||||||
|
|
||||||
|
v.10.2
|
||||||
|
|
||||||
|
AVX512 added for bmw512, c11, phi1612 (phi), qubit, skunk, x11, x11gost (sib).
|
||||||
|
Fixed c11 AVX2 invalid shares.
|
||||||
|
|
||||||
|
v3.10.1
|
||||||
|
|
||||||
|
AVX512 for blake2b, nist5, quark, tribus.
|
||||||
|
|
||||||
|
More broken lane fixes, fixed buffer overflow in skein AVX512, fixed
|
||||||
|
quark invalid shares AVX2.
|
||||||
|
|
||||||
|
Only the highest ranking feature in a class is listed at startup, lower ranking
|
||||||
|
features are available but no longer listed.
|
||||||
|
|
||||||
|
v3.10.0
|
||||||
|
|
||||||
|
AVX512 is now supported on selected algos, Windows binary is now available.
|
||||||
|
AVX512 optimizations are available for argon2d, blake2s, keccak, keccakc,
|
||||||
|
skein & skein2.
|
||||||
|
|
||||||
|
Fixed CPU temperature for some CPU models (Linux only).
|
||||||
|
|
||||||
|
Fixed a bug that caused some lanes not to submit shares.
|
||||||
|
|
||||||
|
Fixed some previously undetected buffer overflows.
|
||||||
|
|
||||||
|
Lyra2rev2 3% faster SSE2 and AVX2.
|
||||||
|
|
||||||
|
Added "-fno-asynchronous-unwind-tables" to AVX512 build script for Windows
|
||||||
|
to fix known mingw issue.
|
||||||
|
|
||||||
|
Changed AVX2 build script to explicitly add AES to address change in
|
||||||
|
behaviour in GCC 9.
|
||||||
|
|
||||||
|
v3.9.11
|
||||||
|
|
||||||
|
Added x22i & x25x algos.
|
||||||
|
Blake2s 2% faster AVX2 with Intel CPU, slower with Ryzen v1, v2 ?
|
||||||
|
|
||||||
|
v3.9.10
|
||||||
|
|
||||||
|
Faster X* algos with AVX2.
|
||||||
|
Small improvements to summary stats report.
|
||||||
|
|
||||||
|
v3.9.9.1
|
||||||
|
|
||||||
|
Fixed a day1 bug that could cause the miner to idle for up to 2 minutes
|
||||||
|
under certain circumstances.
|
||||||
|
|
||||||
|
Redesigned summary stats report now includes session statistics.
|
||||||
|
|
||||||
|
More robust handling of statistics to reduce corruption.
|
||||||
|
|
||||||
|
Removed --hide-diff option.
|
||||||
|
|
||||||
|
Better handling of cpu-affinity with more than 64 CPUs.
|
||||||
|
|
||||||
|
v3.9.9
|
||||||
|
|
||||||
|
Added power2b algo for MicroBitcoin.
|
||||||
|
Added generic yespower-b2b (yespower + blake2b) algo to be used with
|
||||||
|
the parameters introduced in v3.9.7 for yespower & yescrypt.
|
||||||
|
Display additional info when a share is rejected.
|
||||||
|
Some low level enhancements and minor tweaking of log output.
|
||||||
|
RELEASE_NOTES (this file) and README.md added to Windows release package.
|
||||||
|
|
||||||
v3.9.8.1
|
v3.9.8.1
|
||||||
|
|
||||||
Summary log report will be generated on stratum diff change or after 5 minutes,
|
Summary log report will be generated on stratum diff change or after 5 minutes,
|
||||||
|
|||||||
@@ -116,8 +116,6 @@ void init_algo_gate( algo_gate_t* gate )
|
|||||||
gate->get_nonceptr = (void*)&std_get_nonceptr;
|
gate->get_nonceptr = (void*)&std_get_nonceptr;
|
||||||
gate->work_decode = (void*)&std_le_work_decode;
|
gate->work_decode = (void*)&std_le_work_decode;
|
||||||
gate->decode_extra_data = (void*)&do_nothing;
|
gate->decode_extra_data = (void*)&do_nothing;
|
||||||
gate->wait_for_diff = (void*)&std_wait_for_diff;
|
|
||||||
gate->get_max64 = (void*)&get_max64_0x1fffffLL;
|
|
||||||
gate->gen_merkle_root = (void*)&sha256d_gen_merkle_root;
|
gate->gen_merkle_root = (void*)&sha256d_gen_merkle_root;
|
||||||
gate->stratum_gen_work = (void*)&std_stratum_gen_work;
|
gate->stratum_gen_work = (void*)&std_stratum_gen_work;
|
||||||
gate->build_stratum_request = (void*)&std_le_build_stratum_request;
|
gate->build_stratum_request = (void*)&std_le_build_stratum_request;
|
||||||
@@ -204,10 +202,10 @@ bool register_algo_gate( int algo, algo_gate_t *gate )
|
|||||||
case ALGO_PHI2: register_phi2_algo ( gate ); break;
|
case ALGO_PHI2: register_phi2_algo ( gate ); break;
|
||||||
case ALGO_PLUCK: register_pluck_algo ( gate ); break;
|
case ALGO_PLUCK: register_pluck_algo ( gate ); break;
|
||||||
case ALGO_POLYTIMOS: register_polytimos_algo ( gate ); break;
|
case ALGO_POLYTIMOS: register_polytimos_algo ( gate ); break;
|
||||||
|
case ALGO_POWER2B: register_power2b_algo ( gate ); break;
|
||||||
case ALGO_QUARK: register_quark_algo ( gate ); break;
|
case ALGO_QUARK: register_quark_algo ( gate ); break;
|
||||||
case ALGO_QUBIT: register_qubit_algo ( gate ); break;
|
case ALGO_QUBIT: register_qubit_algo ( gate ); break;
|
||||||
case ALGO_SCRYPT: register_scrypt_algo ( gate ); break;
|
case ALGO_SCRYPT: register_scrypt_algo ( gate ); break;
|
||||||
case ALGO_SCRYPTJANE: register_scryptjane_algo ( gate ); break;
|
|
||||||
case ALGO_SHA256D: register_sha256d_algo ( gate ); break;
|
case ALGO_SHA256D: register_sha256d_algo ( gate ); break;
|
||||||
case ALGO_SHA256Q: register_sha256q_algo ( gate ); break;
|
case ALGO_SHA256Q: register_sha256q_algo ( gate ); break;
|
||||||
case ALGO_SHA256T: register_sha256t_algo ( gate ); break;
|
case ALGO_SHA256T: register_sha256t_algo ( gate ); break;
|
||||||
@@ -239,6 +237,8 @@ bool register_algo_gate( int algo, algo_gate_t *gate )
|
|||||||
case ALGO_X16S: register_x16s_algo ( gate ); break;
|
case ALGO_X16S: register_x16s_algo ( gate ); break;
|
||||||
case ALGO_X17: register_x17_algo ( gate ); break;
|
case ALGO_X17: register_x17_algo ( gate ); break;
|
||||||
case ALGO_X21S: register_x21s_algo ( gate ); break;
|
case ALGO_X21S: register_x21s_algo ( gate ); break;
|
||||||
|
case ALGO_X22I: register_x22i_algo ( gate ); break;
|
||||||
|
case ALGO_X25X: register_x25x_algo ( gate ); break;
|
||||||
case ALGO_XEVAN: register_xevan_algo ( gate ); break;
|
case ALGO_XEVAN: register_xevan_algo ( gate ); break;
|
||||||
/* case ALGO_YESCRYPT: register_yescrypt_05_algo ( gate ); break;
|
/* case ALGO_YESCRYPT: register_yescrypt_05_algo ( gate ); break;
|
||||||
case ALGO_YESCRYPTR8: register_yescryptr8_05_algo ( gate ); break;
|
case ALGO_YESCRYPTR8: register_yescryptr8_05_algo ( gate ); break;
|
||||||
@@ -251,6 +251,7 @@ bool register_algo_gate( int algo, algo_gate_t *gate )
|
|||||||
case ALGO_YESCRYPTR32: register_yescryptr32_algo ( gate ); break;
|
case ALGO_YESCRYPTR32: register_yescryptr32_algo ( gate ); break;
|
||||||
case ALGO_YESPOWER: register_yespower_algo ( gate ); break;
|
case ALGO_YESPOWER: register_yespower_algo ( gate ); break;
|
||||||
case ALGO_YESPOWERR16: register_yespowerr16_algo ( gate ); break;
|
case ALGO_YESPOWERR16: register_yespowerr16_algo ( gate ); break;
|
||||||
|
case ALGO_YESPOWER_B2B: register_yespower_b2b_algo ( gate ); break;
|
||||||
case ALGO_ZR5: register_zr5_algo ( gate ); break;
|
case ALGO_ZR5: register_zr5_algo ( gate ); break;
|
||||||
default:
|
default:
|
||||||
applog(LOG_ERR,"FAIL: algo_gate registration failed, unknown algo %s.\n", algo_names[opt_algo] );
|
applog(LOG_ERR,"FAIL: algo_gate registration failed, unknown algo %s.\n", algo_names[opt_algo] );
|
||||||
@@ -276,7 +277,7 @@ bool register_json_rpc2( algo_gate_t *gate )
|
|||||||
applog(LOG_WARNING,"supported by cpuminer-opt. Shares submitted will");
|
applog(LOG_WARNING,"supported by cpuminer-opt. Shares submitted will");
|
||||||
applog(LOG_WARNING,"likely be rejected. Proceed at your own risk.\n");
|
applog(LOG_WARNING,"likely be rejected. Proceed at your own risk.\n");
|
||||||
|
|
||||||
gate->wait_for_diff = (void*)&do_nothing;
|
// gate->wait_for_diff = (void*)&do_nothing;
|
||||||
gate->get_new_work = (void*)&jr2_get_new_work;
|
gate->get_new_work = (void*)&jr2_get_new_work;
|
||||||
gate->get_nonceptr = (void*)&jr2_get_nonceptr;
|
gate->get_nonceptr = (void*)&jr2_get_nonceptr;
|
||||||
gate->stratum_gen_work = (void*)&jr2_stratum_gen_work;
|
gate->stratum_gen_work = (void*)&jr2_stratum_gen_work;
|
||||||
@@ -315,6 +316,7 @@ const char* const algo_alias_map[][2] =
|
|||||||
{ "argon2d-crds", "argon2d250" },
|
{ "argon2d-crds", "argon2d250" },
|
||||||
{ "argon2d-dyn", "argon2d500" },
|
{ "argon2d-dyn", "argon2d500" },
|
||||||
{ "argon2d-uis", "argon2d4096" },
|
{ "argon2d-uis", "argon2d4096" },
|
||||||
|
{ "bcd", "x13bcd" },
|
||||||
{ "bitcore", "timetravel10" },
|
{ "bitcore", "timetravel10" },
|
||||||
{ "bitzeny", "yescryptr8" },
|
{ "bitzeny", "yescryptr8" },
|
||||||
{ "blake256r8", "blakecoin" },
|
{ "blake256r8", "blakecoin" },
|
||||||
|
|||||||
@@ -35,7 +35,7 @@
|
|||||||
// 6. Determine if other non existant functions are required.
|
// 6. Determine if other non existant functions are required.
|
||||||
// That is determined by the need to add code in cpu-miner.c
|
// That is determined by the need to add code in cpu-miner.c
|
||||||
// that applies only to the new algo. That is forbidden. All
|
// that applies only to the new algo. That is forbidden. All
|
||||||
// algo specific code must be in theh algo's file.
|
// algo specific code must be in the algo's file.
|
||||||
//
|
//
|
||||||
// 7. If new functions need to be added to the gate add the type
|
// 7. If new functions need to be added to the gate add the type
|
||||||
// to the structure, declare a null instance in this file and define
|
// to the structure, declare a null instance in this file and define
|
||||||
@@ -48,7 +48,7 @@
|
|||||||
// instances as they are defined by default, or unsafe functions that
|
// instances as they are defined by default, or unsafe functions that
|
||||||
// are not needed by the algo.
|
// are not needed by the algo.
|
||||||
//
|
//
|
||||||
// 9. Add an case entry to the switch/case in function register_gate
|
// 9. Add a case entry to the switch/case in function register_gate
|
||||||
// in file algo-gate-api.c for the new algo.
|
// in file algo-gate-api.c for the new algo.
|
||||||
//
|
//
|
||||||
// 10 If a new function type was defined add an entry to init algo_gate
|
// 10 If a new function type was defined add an entry to init algo_gate
|
||||||
@@ -89,10 +89,12 @@ typedef uint32_t set_t;
|
|||||||
#define SSE2_OPT 1
|
#define SSE2_OPT 1
|
||||||
#define AES_OPT 2
|
#define AES_OPT 2
|
||||||
#define SSE42_OPT 4
|
#define SSE42_OPT 4
|
||||||
#define AVX_OPT 8
|
#define AVX_OPT 8 // Sandybridge
|
||||||
#define AVX2_OPT 0x10
|
#define AVX2_OPT 0x10 // Haswell
|
||||||
#define SHA_OPT 0x20
|
#define SHA_OPT 0x20 // sha256 (Ryzen, Ice Lake)
|
||||||
#define AVX512_OPT 0x40
|
#define AVX512_OPT 0x40 // AVX512- F, VL, DQ, BW (Skylake-X)
|
||||||
|
#define VAES_OPT 0x80 // VAES (Ice Lake)
|
||||||
|
|
||||||
|
|
||||||
// return set containing all elements from sets a & b
|
// return set containing all elements from sets a & b
|
||||||
inline set_t set_union ( set_t a, set_t b ) { return a | b; }
|
inline set_t set_union ( set_t a, set_t b ) { return a | b; }
|
||||||
@@ -108,14 +110,7 @@ inline bool set_excl ( set_t a, set_t b ) { return (a & b) == 0; }
|
|||||||
|
|
||||||
typedef struct
|
typedef struct
|
||||||
{
|
{
|
||||||
// special case, only one target, provides a callback for scanhash to
|
|
||||||
// submit work with less overhead.
|
|
||||||
// bool (*submit_work ) ( struct thr_info*, const struct work* );
|
|
||||||
|
|
||||||
// mandatory functions, must be overwritten
|
// mandatory functions, must be overwritten
|
||||||
// Added a 5th arg for the thread_info structure to replace the int thr id
|
|
||||||
// in the first arg. Both will co-exist during the trasition.
|
|
||||||
//int ( *scanhash ) ( int, struct work*, uint32_t, uint64_t* );
|
|
||||||
int ( *scanhash ) ( struct work*, uint32_t, uint64_t*, struct thr_info* );
|
int ( *scanhash ) ( struct work*, uint32_t, uint64_t*, struct thr_info* );
|
||||||
|
|
||||||
// optional unsafe, must be overwritten if algo uses function
|
// optional unsafe, must be overwritten if algo uses function
|
||||||
@@ -123,27 +118,55 @@ void ( *hash ) ( void*, const void*, uint32_t ) ;
|
|||||||
void ( *hash_suw ) ( void*, const void* );
|
void ( *hash_suw ) ( void*, const void* );
|
||||||
|
|
||||||
//optional, safe to use default in most cases
|
//optional, safe to use default in most cases
|
||||||
|
|
||||||
|
// Allocate thread local buffers and other initialization specific to miner
|
||||||
|
// threads.
|
||||||
bool ( *miner_thread_init ) ( int );
|
bool ( *miner_thread_init ) ( int );
|
||||||
|
|
||||||
|
// Generate global blockheader from stratum data.
|
||||||
void ( *stratum_gen_work ) ( struct stratum_ctx*, struct work* );
|
void ( *stratum_gen_work ) ( struct stratum_ctx*, struct work* );
|
||||||
|
|
||||||
|
// Get thread local copy of blockheader with unique nonce.
|
||||||
void ( *get_new_work ) ( struct work*, struct work*, int, uint32_t*,
|
void ( *get_new_work ) ( struct work*, struct work*, int, uint32_t*,
|
||||||
bool );
|
bool );
|
||||||
|
|
||||||
|
// Return pointer to nonce in blockheader.
|
||||||
uint32_t *( *get_nonceptr ) ( uint32_t* );
|
uint32_t *( *get_nonceptr ) ( uint32_t* );
|
||||||
void ( *decode_extra_data ) ( struct work*, uint64_t* );
|
|
||||||
void ( *wait_for_diff ) ( struct stratum_ctx* );
|
// Decode getwork blockheader
|
||||||
int64_t ( *get_max64 ) ();
|
|
||||||
bool ( *work_decode ) ( const json_t*, struct work* );
|
bool ( *work_decode ) ( const json_t*, struct work* );
|
||||||
|
|
||||||
|
// Extra getwork data
|
||||||
|
void ( *decode_extra_data ) ( struct work*, uint64_t* );
|
||||||
|
|
||||||
bool ( *submit_getwork_result ) ( CURL*, struct work* );
|
bool ( *submit_getwork_result ) ( CURL*, struct work* );
|
||||||
|
|
||||||
void ( *gen_merkle_root ) ( char*, struct stratum_ctx* );
|
void ( *gen_merkle_root ) ( char*, struct stratum_ctx* );
|
||||||
|
|
||||||
|
// Increment extranonce
|
||||||
void ( *build_extraheader ) ( struct work*, struct stratum_ctx* );
|
void ( *build_extraheader ) ( struct work*, struct stratum_ctx* );
|
||||||
|
|
||||||
void ( *build_block_header ) ( struct work*, uint32_t, uint32_t*,
|
void ( *build_block_header ) ( struct work*, uint32_t, uint32_t*,
|
||||||
uint32_t*, uint32_t, uint32_t );
|
uint32_t*, uint32_t, uint32_t );
|
||||||
|
// Build mining.submit message
|
||||||
void ( *build_stratum_request ) ( char*, struct work*, struct stratum_ctx* );
|
void ( *build_stratum_request ) ( char*, struct work*, struct stratum_ctx* );
|
||||||
|
|
||||||
char* ( *malloc_txs_request ) ( struct work* );
|
char* ( *malloc_txs_request ) ( struct work* );
|
||||||
|
|
||||||
|
// Big or little
|
||||||
void ( *set_work_data_endian ) ( struct work* );
|
void ( *set_work_data_endian ) ( struct work* );
|
||||||
|
|
||||||
double ( *calc_network_diff ) ( struct work* );
|
double ( *calc_network_diff ) ( struct work* );
|
||||||
|
|
||||||
|
// Wait for first work
|
||||||
bool ( *ready_to_mine ) ( struct work*, struct stratum_ctx*, int );
|
bool ( *ready_to_mine ) ( struct work*, struct stratum_ctx*, int );
|
||||||
void ( *resync_threads ) ( struct work* );
|
|
||||||
|
// Diverge mining threads
|
||||||
bool ( *do_this_thread ) ( int );
|
bool ( *do_this_thread ) ( int );
|
||||||
|
|
||||||
|
// After do_this_thread
|
||||||
|
void ( *resync_threads ) ( struct work* );
|
||||||
|
|
||||||
json_t* (*longpoll_rpc_call) ( CURL*, int*, char* );
|
json_t* (*longpoll_rpc_call) ( CURL*, int*, char* );
|
||||||
bool ( *stratum_handle_response )( json_t* );
|
bool ( *stratum_handle_response )( json_t* );
|
||||||
set_t optimizations;
|
set_t optimizations;
|
||||||
@@ -198,8 +221,6 @@ void null_hash_suw();
|
|||||||
|
|
||||||
// optional safe targets, default listed first unless noted.
|
// optional safe targets, default listed first unless noted.
|
||||||
|
|
||||||
void std_wait_for_diff();
|
|
||||||
|
|
||||||
uint32_t *std_get_nonceptr( uint32_t *work_data );
|
uint32_t *std_get_nonceptr( uint32_t *work_data );
|
||||||
uint32_t *jr2_get_nonceptr( uint32_t *work_data );
|
uint32_t *jr2_get_nonceptr( uint32_t *work_data );
|
||||||
|
|
||||||
@@ -214,14 +235,6 @@ void jr2_stratum_gen_work( struct stratum_ctx *sctx, struct work *work );
|
|||||||
void sha256d_gen_merkle_root( char *merkle_root, struct stratum_ctx *sctx );
|
void sha256d_gen_merkle_root( char *merkle_root, struct stratum_ctx *sctx );
|
||||||
void SHA256_gen_merkle_root ( char *merkle_root, struct stratum_ctx *sctx );
|
void SHA256_gen_merkle_root ( char *merkle_root, struct stratum_ctx *sctx );
|
||||||
|
|
||||||
// pick your favorite or define your own
|
|
||||||
int64_t get_max64_0x1fffffLL(); // default
|
|
||||||
int64_t get_max64_0x40LL();
|
|
||||||
int64_t get_max64_0x3ffff();
|
|
||||||
int64_t get_max64_0x3fffffLL();
|
|
||||||
int64_t get_max64_0x1ffff();
|
|
||||||
int64_t get_max64_0xffffLL();
|
|
||||||
|
|
||||||
bool std_le_work_decode( const json_t *val, struct work *work );
|
bool std_le_work_decode( const json_t *val, struct work *work );
|
||||||
bool std_be_work_decode( const json_t *val, struct work *work );
|
bool std_be_work_decode( const json_t *val, struct work *work );
|
||||||
bool jr2_work_decode( const json_t *val, struct work *work );
|
bool jr2_work_decode( const json_t *val, struct work *work );
|
||||||
@@ -264,8 +277,8 @@ int std_get_work_data_size();
|
|||||||
// by calling the algo's register function.
|
// by calling the algo's register function.
|
||||||
bool register_algo_gate( int algo, algo_gate_t *gate );
|
bool register_algo_gate( int algo, algo_gate_t *gate );
|
||||||
|
|
||||||
// Override any default gate functions that are applicable and do any other
|
// Called by algos toverride any default gate functions that are applicable
|
||||||
// algo-specific initialization.
|
// and do any other algo-specific initialization.
|
||||||
// The register functions for all the algos can be declared here to reduce
|
// The register functions for all the algos can be declared here to reduce
|
||||||
// compiler warnings but that's just more work for devs adding new algos.
|
// compiler warnings but that's just more work for devs adding new algos.
|
||||||
bool register_algo( algo_gate_t *gate );
|
bool register_algo( algo_gate_t *gate );
|
||||||
@@ -278,5 +291,7 @@ bool register_json_rpc2( algo_gate_t *gate );
|
|||||||
// use this to call the hash function of an algo directly, ie util.c test.
|
// use this to call the hash function of an algo directly, ie util.c test.
|
||||||
void exec_hash_function( int algo, void *output, const void *pdata );
|
void exec_hash_function( int algo, void *output, const void *pdata );
|
||||||
|
|
||||||
|
// Validate a string as a known algo and alias, updates arg to proper
|
||||||
|
// algo name if valid alias, NULL if invalid alias or algo.
|
||||||
void get_algo_alias( char **algo_or_alias );
|
void get_algo_alias( char **algo_or_alias );
|
||||||
|
|
||||||
|
|||||||
@@ -62,9 +62,7 @@ int scanhash_argon2( struct work* work, uint32_t max_nonce,
|
|||||||
argon2hash(hash, endiandata);
|
argon2hash(hash, endiandata);
|
||||||
if (hash[7] <= Htarg && fulltest(hash, ptarget)) {
|
if (hash[7] <= Htarg && fulltest(hash, ptarget)) {
|
||||||
pdata[19] = nonce;
|
pdata[19] = nonce;
|
||||||
*hashes_done = pdata[19] - first_nonce;
|
submit_solution( work, hash, mythr );
|
||||||
work_set_target_ratio(work, hash);
|
|
||||||
return 1;
|
|
||||||
}
|
}
|
||||||
nonce++;
|
nonce++;
|
||||||
} while (nonce < max_nonce && !work_restart[thr_id].restart);
|
} while (nonce < max_nonce && !work_restart[thr_id].restart);
|
||||||
@@ -74,18 +72,12 @@ int scanhash_argon2( struct work* work, uint32_t max_nonce,
|
|||||||
return 0;
|
return 0;
|
||||||
}
|
}
|
||||||
|
|
||||||
int64_t argon2_get_max64 ()
|
|
||||||
{
|
|
||||||
return 0x1ffLL;
|
|
||||||
}
|
|
||||||
|
|
||||||
bool register_argon2_algo( algo_gate_t* gate )
|
bool register_argon2_algo( algo_gate_t* gate )
|
||||||
{
|
{
|
||||||
gate->optimizations = SSE2_OPT | AVX_OPT | AVX2_OPT;
|
gate->optimizations = SSE2_OPT | AVX_OPT | AVX2_OPT;
|
||||||
gate->scanhash = (void*)&scanhash_argon2;
|
gate->scanhash = (void*)&scanhash_argon2;
|
||||||
gate->hash = (void*)&argon2hash;
|
gate->hash = (void*)&argon2hash;
|
||||||
gate->gen_merkle_root = (void*)&SHA256_gen_merkle_root;
|
gate->gen_merkle_root = (void*)&SHA256_gen_merkle_root;
|
||||||
gate->get_max64 = (void*)&argon2_get_max64;
|
|
||||||
opt_target_factor = 65536.0;
|
opt_target_factor = 65536.0;
|
||||||
|
|
||||||
return true;
|
return true;
|
||||||
|
|||||||
@@ -179,12 +179,9 @@ int scanhash_argon2d4096( struct work *work, uint32_t max_nonce,
|
|||||||
return 0;
|
return 0;
|
||||||
}
|
}
|
||||||
|
|
||||||
int64_t get_max64_0x1ff() { return 0x1ff; }
|
|
||||||
|
|
||||||
bool register_argon2d4096_algo( algo_gate_t* gate )
|
bool register_argon2d4096_algo( algo_gate_t* gate )
|
||||||
{
|
{
|
||||||
gate->scanhash = (void*)&scanhash_argon2d4096;
|
gate->scanhash = (void*)&scanhash_argon2d4096;
|
||||||
gate->get_max64 = (void*)&get_max64_0x1ff;
|
|
||||||
gate->optimizations = SSE2_OPT | AVX2_OPT | AVX512_OPT;
|
gate->optimizations = SSE2_OPT | AVX2_OPT | AVX512_OPT;
|
||||||
opt_target_factor = 65536.0;
|
opt_target_factor = 65536.0;
|
||||||
return true;
|
return true;
|
||||||
|
|||||||
@@ -21,7 +21,7 @@
|
|||||||
|
|
||||||
#include "argon2.h"
|
#include "argon2.h"
|
||||||
#include "core.h"
|
#include "core.h"
|
||||||
|
#include "simd-utils.h"
|
||||||
#include "../blake2/blake2.h"
|
#include "../blake2/blake2.h"
|
||||||
#include "../blake2/blamka-round-opt.h"
|
#include "../blake2/blamka-round-opt.h"
|
||||||
|
|
||||||
@@ -38,22 +38,26 @@
|
|||||||
#if defined(__AVX512F__)
|
#if defined(__AVX512F__)
|
||||||
|
|
||||||
static void fill_block( __m512i *state, const block *ref_block,
|
static void fill_block( __m512i *state, const block *ref_block,
|
||||||
block *next_block, int with_xor) {
|
block *next_block, int with_xor )
|
||||||
|
{
|
||||||
__m512i block_XY[ARGON2_512BIT_WORDS_IN_BLOCK];
|
__m512i block_XY[ARGON2_512BIT_WORDS_IN_BLOCK];
|
||||||
unsigned int i;
|
unsigned int i;
|
||||||
|
|
||||||
if (with_xor) {
|
if ( with_xor )
|
||||||
for (i = 0; i < ARGON2_512BIT_WORDS_IN_BLOCK; i++) {
|
{
|
||||||
state[i] = _mm512_xor_si512(
|
for ( i = 0; i < ARGON2_512BIT_WORDS_IN_BLOCK; i++ )
|
||||||
state[i], _mm512_loadu_si512((const __m512i *)ref_block->v + i));
|
{
|
||||||
block_XY[i] = _mm512_xor_si512(
|
state[i] = _mm512_xor_si512( state[i],
|
||||||
state[i], _mm512_loadu_si512((const __m512i *)next_block->v + i));
|
_mm512_load_si512( (const __m512i*)ref_block->v + i ) );
|
||||||
|
block_XY[i] = _mm512_xor_si512( state[i],
|
||||||
|
_mm512_load_si512( (const __m512i*)next_block->v + i ) );
|
||||||
}
|
}
|
||||||
} else {
|
|
||||||
for (i = 0; i < ARGON2_512BIT_WORDS_IN_BLOCK; i++) {
|
|
||||||
block_XY[i] = state[i] = _mm512_xor_si512(
|
|
||||||
state[i], _mm512_loadu_si512((const __m512i *)ref_block->v + i));
|
|
||||||
}
|
}
|
||||||
|
else
|
||||||
|
{
|
||||||
|
for ( i = 0; i < ARGON2_512BIT_WORDS_IN_BLOCK; i++ )
|
||||||
|
block_XY[i] = state[i] = _mm512_xor_si512( state[i],
|
||||||
|
_mm512_load_si512( (const __m512i*)ref_block->v + i ) );
|
||||||
}
|
}
|
||||||
|
|
||||||
BLAKE2_ROUND_1( state[ 0], state[ 1], state[ 2], state[ 3],
|
BLAKE2_ROUND_1( state[ 0], state[ 1], state[ 2], state[ 3],
|
||||||
@@ -66,23 +70,10 @@ static void fill_block(__m512i *state, const block *ref_block,
|
|||||||
BLAKE2_ROUND_2( state[ 1], state[ 3], state[ 5], state[ 7],
|
BLAKE2_ROUND_2( state[ 1], state[ 3], state[ 5], state[ 7],
|
||||||
state[ 9], state[11], state[13], state[15] );
|
state[ 9], state[11], state[13], state[15] );
|
||||||
|
|
||||||
/*
|
for ( i = 0; i < ARGON2_512BIT_WORDS_IN_BLOCK; i++ )
|
||||||
for (i = 0; i < 2; ++i) {
|
{
|
||||||
BLAKE2_ROUND_1(
|
|
||||||
state[8 * i + 0], state[8 * i + 1], state[8 * i + 2], state[8 * i + 3],
|
|
||||||
state[8 * i + 4], state[8 * i + 5], state[8 * i + 6], state[8 * i + 7]);
|
|
||||||
}
|
|
||||||
|
|
||||||
for (i = 0; i < 2; ++i) {
|
|
||||||
BLAKE2_ROUND_2(
|
|
||||||
state[2 * 0 + i], state[2 * 1 + i], state[2 * 2 + i], state[2 * 3 + i],
|
|
||||||
state[2 * 4 + i], state[2 * 5 + i], state[2 * 6 + i], state[2 * 7 + i]);
|
|
||||||
}
|
|
||||||
*/
|
|
||||||
|
|
||||||
for (i = 0; i < ARGON2_512BIT_WORDS_IN_BLOCK; i++) {
|
|
||||||
state[i] = _mm512_xor_si512( state[i], block_XY[i] );
|
state[i] = _mm512_xor_si512( state[i], block_XY[i] );
|
||||||
_mm512_storeu_si512((__m512i *)next_block->v + i, state[i]);
|
_mm512_store_si512( (__m512i*)next_block->v + i, state[i] );
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -125,18 +116,6 @@ static void fill_block(__m256i *state, const block *ref_block,
|
|||||||
BLAKE2_ROUND_2( state[ 3], state[ 7], state[11], state[15],
|
BLAKE2_ROUND_2( state[ 3], state[ 7], state[11], state[15],
|
||||||
state[19], state[23], state[27], state[31] );
|
state[19], state[23], state[27], state[31] );
|
||||||
|
|
||||||
/*
|
|
||||||
for (i = 0; i < 4; ++i) {
|
|
||||||
BLAKE2_ROUND_1(state[8 * i + 0], state[8 * i + 4], state[8 * i + 1], state[8 * i + 5],
|
|
||||||
state[8 * i + 2], state[8 * i + 6], state[8 * i + 3], state[8 * i + 7]);
|
|
||||||
}
|
|
||||||
|
|
||||||
for (i = 0; i < 4; ++i) {
|
|
||||||
BLAKE2_ROUND_2(state[ 0 + i], state[ 4 + i], state[ 8 + i], state[12 + i],
|
|
||||||
state[16 + i], state[20 + i], state[24 + i], state[28 + i]);
|
|
||||||
}
|
|
||||||
*/
|
|
||||||
|
|
||||||
for (i = 0; i < ARGON2_HWORDS_IN_BLOCK; i++) {
|
for (i = 0; i < ARGON2_HWORDS_IN_BLOCK; i++) {
|
||||||
state[i] = _mm256_xor_si256(state[i], block_XY[i]);
|
state[i] = _mm256_xor_si256(state[i], block_XY[i]);
|
||||||
_mm256_store_si256((__m256i *)next_block->v + i, state[i]);
|
_mm256_store_si256((__m256i *)next_block->v + i, state[i]);
|
||||||
@@ -153,14 +132,14 @@ static void fill_block(__m128i *state, const block *ref_block,
|
|||||||
if (with_xor) {
|
if (with_xor) {
|
||||||
for (i = 0; i < ARGON2_OWORDS_IN_BLOCK; i++) {
|
for (i = 0; i < ARGON2_OWORDS_IN_BLOCK; i++) {
|
||||||
state[i] = _mm_xor_si128(
|
state[i] = _mm_xor_si128(
|
||||||
state[i], _mm_loadu_si128((const __m128i *)ref_block->v + i));
|
state[i], _mm_load_si128((const __m128i *)ref_block->v + i));
|
||||||
block_XY[i] = _mm_xor_si128(
|
block_XY[i] = _mm_xor_si128(
|
||||||
state[i], _mm_loadu_si128((const __m128i *)next_block->v + i));
|
state[i], _mm_load_si128((const __m128i *)next_block->v + i));
|
||||||
}
|
}
|
||||||
} else {
|
} else {
|
||||||
for (i = 0; i < ARGON2_OWORDS_IN_BLOCK; i++) {
|
for (i = 0; i < ARGON2_OWORDS_IN_BLOCK; i++) {
|
||||||
block_XY[i] = state[i] = _mm_xor_si128(
|
block_XY[i] = state[i] = _mm_xor_si128(
|
||||||
state[i], _mm_loadu_si128((const __m128i *)ref_block->v + i));
|
state[i], _mm_load_si128((const __m128i *)ref_block->v + i));
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -198,22 +177,9 @@ static void fill_block(__m128i *state, const block *ref_block,
|
|||||||
BLAKE2_ROUND( state[ 7], state[15], state[23], state[31],
|
BLAKE2_ROUND( state[ 7], state[15], state[23], state[31],
|
||||||
state[39], state[47], state[55], state[63] );
|
state[39], state[47], state[55], state[63] );
|
||||||
|
|
||||||
/*
|
|
||||||
for (i = 0; i < 8; ++i) {
|
|
||||||
BLAKE2_ROUND(state[8 * i + 0], state[8 * i + 1], state[8 * i + 2],
|
|
||||||
state[8 * i + 3], state[8 * i + 4], state[8 * i + 5],
|
|
||||||
state[8 * i + 6], state[8 * i + 7]);
|
|
||||||
}
|
|
||||||
|
|
||||||
for (i = 0; i < 8; ++i) {
|
|
||||||
BLAKE2_ROUND(state[8 * 0 + i], state[8 * 1 + i], state[8 * 2 + i],
|
|
||||||
state[8 * 3 + i], state[8 * 4 + i], state[8 * 5 + i],
|
|
||||||
state[8 * 6 + i], state[8 * 7 + i]);
|
|
||||||
}
|
|
||||||
*/
|
|
||||||
for (i = 0; i < ARGON2_OWORDS_IN_BLOCK; i++) {
|
for (i = 0; i < ARGON2_OWORDS_IN_BLOCK; i++) {
|
||||||
state[i] = _mm_xor_si128(state[i], block_XY[i]);
|
state[i] = _mm_xor_si128(state[i], block_XY[i]);
|
||||||
_mm_storeu_si128((__m128i *)next_block->v + i, state[i]);
|
_mm_store_si128((__m128i *)next_block->v + i, state[i]);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
@@ -184,9 +184,9 @@ static BLAKE2_INLINE __m128i fBlaMka(__m128i x, __m128i y) {
|
|||||||
|
|
||||||
#include <immintrin.h>
|
#include <immintrin.h>
|
||||||
|
|
||||||
#define rotr32 mm256_swap32_64
|
#define rotr32( x ) mm256_ror_64( x, 32 )
|
||||||
#define rotr24 mm256_ror3x8_64
|
#define rotr24( x ) mm256_ror_64( x, 24 )
|
||||||
#define rotr16 mm256_ror1x16_64
|
#define rotr16( x ) mm256_ror_64( x, 16 )
|
||||||
#define rotr63( x ) mm256_rol_64( x, 1 )
|
#define rotr63( x ) mm256_rol_64( x, 1 )
|
||||||
|
|
||||||
//#define rotr32(x) _mm256_shuffle_epi32(x, _MM_SHUFFLE(2, 3, 0, 1))
|
//#define rotr32(x) _mm256_shuffle_epi32(x, _MM_SHUFFLE(2, 3, 0, 1))
|
||||||
@@ -427,14 +427,14 @@ static __m512i muladd(__m512i x, __m512i y)
|
|||||||
#define SWAP_QUARTERS(A0, A1) \
|
#define SWAP_QUARTERS(A0, A1) \
|
||||||
do { \
|
do { \
|
||||||
SWAP_HALVES(A0, A1); \
|
SWAP_HALVES(A0, A1); \
|
||||||
A0 = _mm512_permutexvar_epi64(_mm512_setr_epi64(0, 1, 4, 5, 2, 3, 6, 7), A0); \
|
A0 = _mm512_shuffle_i64x2( A0, A0, 0xd8 ); \
|
||||||
A1 = _mm512_permutexvar_epi64(_mm512_setr_epi64(0, 1, 4, 5, 2, 3, 6, 7), A1); \
|
A1 = _mm512_shuffle_i64x2( A1, A1, 0xd8 ); \
|
||||||
} while((void)0, 0)
|
} while((void)0, 0)
|
||||||
|
|
||||||
#define UNSWAP_QUARTERS(A0, A1) \
|
#define UNSWAP_QUARTERS(A0, A1) \
|
||||||
do { \
|
do { \
|
||||||
A0 = _mm512_permutexvar_epi64(_mm512_setr_epi64(0, 1, 4, 5, 2, 3, 6, 7), A0); \
|
A0 = _mm512_shuffle_i64x2( A0, A0, 0xd8 ); \
|
||||||
A1 = _mm512_permutexvar_epi64(_mm512_setr_epi64(0, 1, 4, 5, 2, 3, 6, 7), A1); \
|
A1 = _mm512_shuffle_i64x2( A1, A1, 0xd8 ); \
|
||||||
SWAP_HALVES(A0, A1); \
|
SWAP_HALVES(A0, A1); \
|
||||||
} while((void)0, 0)
|
} while((void)0, 0)
|
||||||
|
|
||||||
|
|||||||
@@ -13,7 +13,7 @@ void blakehash_4way(void *state, const void *input)
|
|||||||
uint32_t vhash[8*4] __attribute__ ((aligned (64)));
|
uint32_t vhash[8*4] __attribute__ ((aligned (64)));
|
||||||
blake256r14_4way_context ctx;
|
blake256r14_4way_context ctx;
|
||||||
memcpy( &ctx, &blake_4w_ctx, sizeof ctx );
|
memcpy( &ctx, &blake_4w_ctx, sizeof ctx );
|
||||||
blake256r14_4way( &ctx, input + (64<<2), 16 );
|
blake256r14_4way_update( &ctx, input + (64<<2), 16 );
|
||||||
blake256r14_4way_close( &ctx, vhash );
|
blake256r14_4way_close( &ctx, vhash );
|
||||||
dintrlv_4x32( state, state+32, state+64, state+96, vhash, 256 );
|
dintrlv_4x32( state, state+32, state+64, state+96, vhash, 256 );
|
||||||
}
|
}
|
||||||
@@ -36,7 +36,7 @@ int scanhash_blake_4way( struct work *work, uint32_t max_nonce,
|
|||||||
|
|
||||||
mm128_bswap32_intrlv80_4x32( vdata, pdata );
|
mm128_bswap32_intrlv80_4x32( vdata, pdata );
|
||||||
blake256r14_4way_init( &blake_4w_ctx );
|
blake256r14_4way_init( &blake_4w_ctx );
|
||||||
blake256r14_4way( &blake_4w_ctx, vdata, 64 );
|
blake256r14_4way_update( &blake_4w_ctx, vdata, 64 );
|
||||||
|
|
||||||
do {
|
do {
|
||||||
*noncev = mm128_bswap_32( _mm_set_epi32( n+3, n+2, n+1, n ) );
|
*noncev = mm128_bswap_32( _mm_set_epi32( n+3, n+2, n+1, n ) );
|
||||||
|
|||||||
@@ -1,18 +1,8 @@
|
|||||||
#include "blake-gate.h"
|
#include "blake-gate.h"
|
||||||
|
|
||||||
int64_t blake_get_max64 ()
|
|
||||||
{
|
|
||||||
return 0x7ffffLL;
|
|
||||||
}
|
|
||||||
|
|
||||||
bool register_blake_algo( algo_gate_t* gate )
|
bool register_blake_algo( algo_gate_t* gate )
|
||||||
{
|
{
|
||||||
gate->optimizations = AVX2_OPT;
|
gate->optimizations = AVX2_OPT;
|
||||||
gate->get_max64 = (void*)&blake_get_max64;
|
|
||||||
//#if defined (__AVX2__) && defined (FOUR_WAY)
|
|
||||||
// gate->optimizations = SSE2_OPT | AVX2_OPT;
|
|
||||||
// gate->scanhash = (void*)&scanhash_blake_8way;
|
|
||||||
// gate->hash = (void*)&blakehash_8way;
|
|
||||||
#if defined(BLAKE_4WAY)
|
#if defined(BLAKE_4WAY)
|
||||||
four_way_not_tested();
|
four_way_not_tested();
|
||||||
gate->scanhash = (void*)&scanhash_blake_4way;
|
gate->scanhash = (void*)&scanhash_blake_4way;
|
||||||
|
|||||||
@@ -37,8 +37,6 @@
|
|||||||
#ifndef __BLAKE_HASH_4WAY__
|
#ifndef __BLAKE_HASH_4WAY__
|
||||||
#define __BLAKE_HASH_4WAY__ 1
|
#define __BLAKE_HASH_4WAY__ 1
|
||||||
|
|
||||||
//#ifdef __SSE4_2__
|
|
||||||
|
|
||||||
#ifdef __cplusplus
|
#ifdef __cplusplus
|
||||||
extern "C"{
|
extern "C"{
|
||||||
#endif
|
#endif
|
||||||
@@ -51,49 +49,45 @@ extern "C"{
|
|||||||
|
|
||||||
#define SPH_SIZE_blake512 512
|
#define SPH_SIZE_blake512 512
|
||||||
|
|
||||||
// With SSE4.2 only Blake-256 4 way is available.
|
//////////////////////////
|
||||||
// With AVX2 Blake-256 8way & Blake-512 4 way are also available.
|
//
|
||||||
|
// Blake-256 4 way SSE2
|
||||||
// Blake-256 4 way
|
|
||||||
|
|
||||||
typedef struct {
|
typedef struct {
|
||||||
unsigned char buf[64<<2];
|
unsigned char buf[64<<2];
|
||||||
uint32_t H[8<<2];
|
uint32_t H[8<<2];
|
||||||
uint32_t S[4<<2];
|
|
||||||
// __m128i buf[16] __attribute__ ((aligned (64)));
|
|
||||||
// __m128i H[8];
|
|
||||||
// __m128i S[4];
|
|
||||||
size_t ptr;
|
size_t ptr;
|
||||||
uint32_t T0, T1;
|
uint32_t T0, T1;
|
||||||
int rounds; // 14 for blake, 8 for blakecoin & vanilla
|
int rounds; // 14 for blake, 8 for blakecoin & vanilla
|
||||||
} blake_4way_small_context __attribute__ ((aligned (64)));
|
} blake_4way_small_context __attribute__ ((aligned (64)));
|
||||||
|
|
||||||
// Default 14 rounds
|
// Default, 14 rounds, blake, decred
|
||||||
typedef blake_4way_small_context blake256_4way_context;
|
typedef blake_4way_small_context blake256_4way_context;
|
||||||
void blake256_4way_init(void *ctx);
|
void blake256_4way_init(void *ctx);
|
||||||
void blake256_4way(void *ctx, const void *data, size_t len);
|
void blake256_4way_update(void *ctx, const void *data, size_t len);
|
||||||
void blake256_4way_close(void *ctx, void *dst);
|
void blake256_4way_close(void *ctx, void *dst);
|
||||||
|
|
||||||
// 14 rounds, blake, decred
|
// 14 rounds, blake, decred
|
||||||
typedef blake_4way_small_context blake256r14_4way_context;
|
typedef blake_4way_small_context blake256r14_4way_context;
|
||||||
void blake256r14_4way_init(void *cc);
|
void blake256r14_4way_init(void *cc);
|
||||||
void blake256r14_4way(void *cc, const void *data, size_t len);
|
void blake256r14_4way_update(void *cc, const void *data, size_t len);
|
||||||
void blake256r14_4way_close(void *cc, void *dst);
|
void blake256r14_4way_close(void *cc, void *dst);
|
||||||
|
|
||||||
// 8 rounds, blakecoin, vanilla
|
// 8 rounds, blakecoin, vanilla
|
||||||
typedef blake_4way_small_context blake256r8_4way_context;
|
typedef blake_4way_small_context blake256r8_4way_context;
|
||||||
void blake256r8_4way_init(void *cc);
|
void blake256r8_4way_init(void *cc);
|
||||||
void blake256r8_4way(void *cc, const void *data, size_t len);
|
void blake256r8_4way_update(void *cc, const void *data, size_t len);
|
||||||
void blake256r8_4way_close(void *cc, void *dst);
|
void blake256r8_4way_close(void *cc, void *dst);
|
||||||
|
|
||||||
#ifdef __AVX2__
|
#ifdef __AVX2__
|
||||||
|
|
||||||
// Blake-256 8 way
|
//////////////////////////
|
||||||
|
//
|
||||||
|
// Blake-256 8 way AVX2
|
||||||
|
|
||||||
typedef struct {
|
typedef struct {
|
||||||
__m256i buf[16] __attribute__ ((aligned (64)));
|
__m256i buf[16] __attribute__ ((aligned (64)));
|
||||||
__m256i H[8];
|
__m256i H[8];
|
||||||
__m256i S[4];
|
|
||||||
size_t ptr;
|
size_t ptr;
|
||||||
sph_u32 T0, T1;
|
sph_u32 T0, T1;
|
||||||
int rounds; // 14 for blake, 8 for blakecoin & vanilla
|
int rounds; // 14 for blake, 8 for blakecoin & vanilla
|
||||||
@@ -102,39 +96,92 @@ typedef struct {
|
|||||||
// Default 14 rounds
|
// Default 14 rounds
|
||||||
typedef blake_8way_small_context blake256_8way_context;
|
typedef blake_8way_small_context blake256_8way_context;
|
||||||
void blake256_8way_init(void *cc);
|
void blake256_8way_init(void *cc);
|
||||||
void blake256_8way(void *cc, const void *data, size_t len);
|
void blake256_8way_update(void *cc, const void *data, size_t len);
|
||||||
void blake256_8way_close(void *cc, void *dst);
|
void blake256_8way_close(void *cc, void *dst);
|
||||||
|
|
||||||
// 14 rounds, blake, decred
|
// 14 rounds, blake, decred
|
||||||
typedef blake_8way_small_context blake256r14_8way_context;
|
typedef blake_8way_small_context blake256r14_8way_context;
|
||||||
void blake256r14_8way_init(void *cc);
|
void blake256r14_8way_init(void *cc);
|
||||||
void blake256r14_8way(void *cc, const void *data, size_t len);
|
void blake256r14_8way_update(void *cc, const void *data, size_t len);
|
||||||
void blake256r14_8way_close(void *cc, void *dst);
|
void blake256r14_8way_close(void *cc, void *dst);
|
||||||
|
|
||||||
// 8 rounds, blakecoin, vanilla
|
// 8 rounds, blakecoin, vanilla
|
||||||
typedef blake_8way_small_context blake256r8_8way_context;
|
typedef blake_8way_small_context blake256r8_8way_context;
|
||||||
void blake256r8_8way_init(void *cc);
|
void blake256r8_8way_init(void *cc);
|
||||||
void blake256r8_8way(void *cc, const void *data, size_t len);
|
void blake256r8_8way_update(void *cc, const void *data, size_t len);
|
||||||
void blake256r8_8way_close(void *cc, void *dst);
|
void blake256r8_8way_close(void *cc, void *dst);
|
||||||
|
|
||||||
// Blake-512 4 way
|
// Blake-512 4 way AVX2
|
||||||
|
|
||||||
typedef struct {
|
typedef struct {
|
||||||
__m256i buf[16] __attribute__ ((aligned (64)));
|
__m256i buf[16];
|
||||||
__m256i H[8];
|
__m256i H[8];
|
||||||
__m256i S[4];
|
__m256i S[4];
|
||||||
size_t ptr;
|
size_t ptr;
|
||||||
sph_u64 T0, T1;
|
sph_u64 T0, T1;
|
||||||
} blake_4way_big_context;
|
} blake_4way_big_context __attribute__ ((aligned (128)));
|
||||||
|
|
||||||
typedef blake_4way_big_context blake512_4way_context;
|
typedef blake_4way_big_context blake512_4way_context;
|
||||||
|
|
||||||
void blake512_4way_init(void *cc);
|
void blake512_4way_init( blake_4way_big_context *sc );
|
||||||
void blake512_4way(void *cc, const void *data, size_t len);
|
void blake512_4way_update( void *cc, const void *data, size_t len );
|
||||||
void blake512_4way_close( void *cc, void *dst );
|
void blake512_4way_close( void *cc, void *dst );
|
||||||
void blake512_4way_addbits_and_close(
|
void blake512_4way_full( blake_4way_big_context *sc, void * dst,
|
||||||
void *cc, unsigned ub, unsigned n, void *dst);
|
const void *data, size_t len );
|
||||||
|
|
||||||
|
#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
|
||||||
|
|
||||||
|
////////////////////////////
|
||||||
|
//
|
||||||
|
// Blake-256 16 way AVX512
|
||||||
|
|
||||||
|
typedef struct {
|
||||||
|
__m512i buf[16];
|
||||||
|
__m512i H[8];
|
||||||
|
size_t ptr;
|
||||||
|
uint32_t T0, T1;
|
||||||
|
int rounds; // 14 for blake, 8 for blakecoin & vanilla
|
||||||
|
} blake_16way_small_context __attribute__ ((aligned (128)));
|
||||||
|
|
||||||
|
// Default 14 rounds
|
||||||
|
typedef blake_16way_small_context blake256_16way_context;
|
||||||
|
void blake256_16way_init(void *cc);
|
||||||
|
void blake256_16way_update(void *cc, const void *data, size_t len);
|
||||||
|
void blake256_16way_close(void *cc, void *dst);
|
||||||
|
|
||||||
|
// 14 rounds, blake, decred
|
||||||
|
typedef blake_16way_small_context blake256r14_16way_context;
|
||||||
|
void blake256r14_16way_init(void *cc);
|
||||||
|
void blake256r14_16way_update(void *cc, const void *data, size_t len);
|
||||||
|
void blake256r14_16way_close(void *cc, void *dst);
|
||||||
|
|
||||||
|
// 8 rounds, blakecoin, vanilla
|
||||||
|
typedef blake_16way_small_context blake256r8_16way_context;
|
||||||
|
void blake256r8_16way_init(void *cc);
|
||||||
|
void blake256r8_16way_update(void *cc, const void *data, size_t len);
|
||||||
|
void blake256r8_16way_close(void *cc, void *dst);
|
||||||
|
|
||||||
|
////////////////////////////
|
||||||
|
//
|
||||||
|
//// Blake-512 8 way AVX512
|
||||||
|
|
||||||
|
typedef struct {
|
||||||
|
__m512i buf[16];
|
||||||
|
__m512i H[8];
|
||||||
|
__m512i S[4];
|
||||||
|
size_t ptr;
|
||||||
|
sph_u64 T0, T1;
|
||||||
|
} blake_8way_big_context __attribute__ ((aligned (128)));
|
||||||
|
|
||||||
|
typedef blake_8way_big_context blake512_8way_context;
|
||||||
|
|
||||||
|
void blake512_8way_init( blake_8way_big_context *sc );
|
||||||
|
void blake512_8way_update( void *cc, const void *data, size_t len );
|
||||||
|
void blake512_8way_close( void *cc, void *dst );
|
||||||
|
void blake512_8way_full( blake_8way_big_context *sc, void * dst,
|
||||||
|
const void *data, size_t len );
|
||||||
|
|
||||||
|
#endif // AVX512
|
||||||
#endif // AVX2
|
#endif // AVX2
|
||||||
|
|
||||||
#ifdef __cplusplus
|
#ifdef __cplusplus
|
||||||
|
|||||||
@@ -304,16 +304,17 @@ static const sph_u32 CS[16] = {
|
|||||||
|
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
|
// Blake-256 4 way
|
||||||
|
|
||||||
#define GS_4WAY( m0, m1, c0, c1, a, b, c, d ) \
|
#define GS_4WAY( m0, m1, c0, c1, a, b, c, d ) \
|
||||||
do { \
|
do { \
|
||||||
a = _mm_add_epi32( _mm_add_epi32( _mm_xor_si128( \
|
a = _mm_add_epi32( _mm_add_epi32( a, b ), \
|
||||||
_mm_set1_epi32( c1 ), m0 ), b ), a ); \
|
_mm_xor_si128( _mm_set1_epi32( c1 ), m0 ) ); \
|
||||||
d = mm128_ror_32( _mm_xor_si128( d, a ), 16 ); \
|
d = mm128_ror_32( _mm_xor_si128( d, a ), 16 ); \
|
||||||
c = _mm_add_epi32( c, d ); \
|
c = _mm_add_epi32( c, d ); \
|
||||||
b = mm128_ror_32( _mm_xor_si128( b, c ), 12 ); \
|
b = mm128_ror_32( _mm_xor_si128( b, c ), 12 ); \
|
||||||
a = _mm_add_epi32( _mm_add_epi32( _mm_xor_si128( \
|
a = _mm_add_epi32( _mm_add_epi32( a, b ), \
|
||||||
_mm_set1_epi32( c0 ), m1 ), b ), a ); \
|
_mm_xor_si128( _mm_set1_epi32( c0 ), m1 ) ); \
|
||||||
d = mm128_ror_32( _mm_xor_si128( d, a ), 8 ); \
|
d = mm128_ror_32( _mm_xor_si128( d, a ), 8 ); \
|
||||||
c = _mm_add_epi32( c, d ); \
|
c = _mm_add_epi32( c, d ); \
|
||||||
b = mm128_ror_32( _mm_xor_si128( b, c ), 7 ); \
|
b = mm128_ror_32( _mm_xor_si128( b, c ), 7 ); \
|
||||||
@@ -321,7 +322,8 @@ do { \
|
|||||||
|
|
||||||
#if SPH_COMPACT_BLAKE_32
|
#if SPH_COMPACT_BLAKE_32
|
||||||
|
|
||||||
// Blake-256 4 way
|
// Not used
|
||||||
|
#if 0
|
||||||
|
|
||||||
#define ROUND_S_4WAY(r) do { \
|
#define ROUND_S_4WAY(r) do { \
|
||||||
GS_4WAY(M[sigma[r][0x0]], M[sigma[r][0x1]], \
|
GS_4WAY(M[sigma[r][0x0]], M[sigma[r][0x1]], \
|
||||||
@@ -342,6 +344,8 @@ do { \
|
|||||||
CS[sigma[r][0xE]], CS[sigma[r][0xF]], V3, V4, V9, VE); \
|
CS[sigma[r][0xE]], CS[sigma[r][0xF]], V3, V4, V9, VE); \
|
||||||
} while (0)
|
} while (0)
|
||||||
|
|
||||||
|
#endif
|
||||||
|
|
||||||
#else
|
#else
|
||||||
|
|
||||||
#define ROUND_S_4WAY(r) do { \
|
#define ROUND_S_4WAY(r) do { \
|
||||||
@@ -359,7 +363,6 @@ do { \
|
|||||||
|
|
||||||
#define DECL_STATE32_4WAY \
|
#define DECL_STATE32_4WAY \
|
||||||
__m128i H0, H1, H2, H3, H4, H5, H6, H7; \
|
__m128i H0, H1, H2, H3, H4, H5, H6, H7; \
|
||||||
__m128i S0, S1, S2, S3; \
|
|
||||||
uint32_t T0, T1;
|
uint32_t T0, T1;
|
||||||
|
|
||||||
#define READ_STATE32_4WAY(state) do { \
|
#define READ_STATE32_4WAY(state) do { \
|
||||||
@@ -371,10 +374,6 @@ do { \
|
|||||||
H5 = casti_m128i( state->H, 5 ); \
|
H5 = casti_m128i( state->H, 5 ); \
|
||||||
H6 = casti_m128i( state->H, 6 ); \
|
H6 = casti_m128i( state->H, 6 ); \
|
||||||
H7 = casti_m128i( state->H, 7 ); \
|
H7 = casti_m128i( state->H, 7 ); \
|
||||||
S0 = casti_m128i( state->S, 0 ); \
|
|
||||||
S1 = casti_m128i( state->S, 1 ); \
|
|
||||||
S2 = casti_m128i( state->S, 2 ); \
|
|
||||||
S3 = casti_m128i( state->S, 3 ); \
|
|
||||||
T0 = (state)->T0; \
|
T0 = (state)->T0; \
|
||||||
T1 = (state)->T1; \
|
T1 = (state)->T1; \
|
||||||
} while (0)
|
} while (0)
|
||||||
@@ -388,17 +387,13 @@ do { \
|
|||||||
casti_m128i( state->H, 5 ) = H5; \
|
casti_m128i( state->H, 5 ) = H5; \
|
||||||
casti_m128i( state->H, 6 ) = H6; \
|
casti_m128i( state->H, 6 ) = H6; \
|
||||||
casti_m128i( state->H, 7 ) = H7; \
|
casti_m128i( state->H, 7 ) = H7; \
|
||||||
casti_m128i( state->S, 0 ) = S0; \
|
|
||||||
casti_m128i( state->S, 1 ) = S1; \
|
|
||||||
casti_m128i( state->S, 2 ) = S2; \
|
|
||||||
casti_m128i( state->S, 3 ) = S3; \
|
|
||||||
(state)->T0 = T0; \
|
(state)->T0 = T0; \
|
||||||
(state)->T1 = T1; \
|
(state)->T1 = T1; \
|
||||||
} while (0)
|
} while (0)
|
||||||
|
|
||||||
#if SPH_COMPACT_BLAKE_32
|
#if SPH_COMPACT_BLAKE_32
|
||||||
// not used
|
// not used
|
||||||
|
#if 0
|
||||||
#define COMPRESS32_4WAY( rounds ) do { \
|
#define COMPRESS32_4WAY( rounds ) do { \
|
||||||
__m128i M[16]; \
|
__m128i M[16]; \
|
||||||
__m128i V0, V1, V2, V3, V4, V5, V6, V7; \
|
__m128i V0, V1, V2, V3, V4, V5, V6, V7; \
|
||||||
@@ -441,6 +436,7 @@ do { \
|
|||||||
H7 = _mm_xor_si128( _mm_xor_si128( \
|
H7 = _mm_xor_si128( _mm_xor_si128( \
|
||||||
_mm_xor_si128( S3, V7 ), VF ), H7 ); \
|
_mm_xor_si128( S3, V7 ), VF ), H7 ); \
|
||||||
} while (0)
|
} while (0)
|
||||||
|
#endif
|
||||||
|
|
||||||
#else
|
#else
|
||||||
|
|
||||||
@@ -508,10 +504,10 @@ do { \
|
|||||||
V5 = H5; \
|
V5 = H5; \
|
||||||
V6 = H6; \
|
V6 = H6; \
|
||||||
V7 = H7; \
|
V7 = H7; \
|
||||||
V8 = _mm_xor_si128( S0, m128_const1_64( 0x243F6A88243F6A88 ) ); \
|
V8 = m128_const1_64( 0x243F6A88243F6A88 ); \
|
||||||
V9 = _mm_xor_si128( S1, m128_const1_64( 0x85A308D385A308D3 ) ); \
|
V9 = m128_const1_64( 0x85A308D385A308D3 ); \
|
||||||
VA = _mm_xor_si128( S2, m128_const1_64( 0x13198A2E13198A2E ) ); \
|
VA = m128_const1_64( 0x13198A2E13198A2E ); \
|
||||||
VB = _mm_xor_si128( S3, m128_const1_64( 0x0370734403707344 ) ); \
|
VB = m128_const1_64( 0x0370734403707344 ); \
|
||||||
VC = _mm_xor_si128( _mm_set1_epi32( T0 ), \
|
VC = _mm_xor_si128( _mm_set1_epi32( T0 ), \
|
||||||
m128_const1_64( 0xA4093822A4093822 ) ); \
|
m128_const1_64( 0xA4093822A4093822 ) ); \
|
||||||
VD = _mm_xor_si128( _mm_set1_epi32( T0 ), \
|
VD = _mm_xor_si128( _mm_set1_epi32( T0 ), \
|
||||||
@@ -538,14 +534,14 @@ do { \
|
|||||||
ROUND_S_4WAY(2); \
|
ROUND_S_4WAY(2); \
|
||||||
ROUND_S_4WAY(3); \
|
ROUND_S_4WAY(3); \
|
||||||
} \
|
} \
|
||||||
H0 = mm128_xor4( V8, V0, S0, H0 ); \
|
H0 = _mm_xor_si128( _mm_xor_si128( V8, V0 ), H0 ); \
|
||||||
H1 = mm128_xor4( V9, V1, S1, H1 ); \
|
H1 = _mm_xor_si128( _mm_xor_si128( V9, V1 ), H1 ); \
|
||||||
H2 = mm128_xor4( VA, V2, S2, H2 ); \
|
H2 = _mm_xor_si128( _mm_xor_si128( VA, V2 ), H2 ); \
|
||||||
H3 = mm128_xor4( VB, V3, S3, H3 ); \
|
H3 = _mm_xor_si128( _mm_xor_si128( VB, V3 ), H3 ); \
|
||||||
H4 = mm128_xor4( VC, V4, S0, H4 ); \
|
H4 = _mm_xor_si128( _mm_xor_si128( VC, V4 ), H4 ); \
|
||||||
H5 = mm128_xor4( VD, V5, S1, H5 ); \
|
H5 = _mm_xor_si128( _mm_xor_si128( VD, V5 ), H5 ); \
|
||||||
H6 = mm128_xor4( VE, V6, S2, H6 ); \
|
H6 = _mm_xor_si128( _mm_xor_si128( VE, V6 ), H6 ); \
|
||||||
H7 = mm128_xor4( VF, V7, S3, H7 ); \
|
H7 = _mm_xor_si128( _mm_xor_si128( VF, V7 ), H7 ); \
|
||||||
} while (0)
|
} while (0)
|
||||||
|
|
||||||
#endif
|
#endif
|
||||||
@@ -556,13 +552,13 @@ do { \
|
|||||||
|
|
||||||
#define GS_8WAY( m0, m1, c0, c1, a, b, c, d ) \
|
#define GS_8WAY( m0, m1, c0, c1, a, b, c, d ) \
|
||||||
do { \
|
do { \
|
||||||
a = _mm256_add_epi32( _mm256_add_epi32( _mm256_xor_si256( \
|
a = _mm256_add_epi32( _mm256_add_epi32( a, b ), \
|
||||||
_mm256_set1_epi32( c1 ), m0 ), b ), a ); \
|
_mm256_xor_si256( _mm256_set1_epi32( c1 ), m0 ) ); \
|
||||||
d = mm256_ror_32( _mm256_xor_si256( d, a ), 16 ); \
|
d = mm256_ror_32( _mm256_xor_si256( d, a ), 16 ); \
|
||||||
c = _mm256_add_epi32( c, d ); \
|
c = _mm256_add_epi32( c, d ); \
|
||||||
b = mm256_ror_32( _mm256_xor_si256( b, c ), 12 ); \
|
b = mm256_ror_32( _mm256_xor_si256( b, c ), 12 ); \
|
||||||
a = _mm256_add_epi32( _mm256_add_epi32( _mm256_xor_si256( \
|
a = _mm256_add_epi32( _mm256_add_epi32( a, b ), \
|
||||||
_mm256_set1_epi32( c0 ), m1 ), b ), a ); \
|
_mm256_xor_si256( _mm256_set1_epi32( c0 ), m1 ) ); \
|
||||||
d = mm256_ror_32( _mm256_xor_si256( d, a ), 8 ); \
|
d = mm256_ror_32( _mm256_xor_si256( d, a ), 8 ); \
|
||||||
c = _mm256_add_epi32( c, d ); \
|
c = _mm256_add_epi32( c, d ); \
|
||||||
b = mm256_ror_32( _mm256_xor_si256( b, c ), 7 ); \
|
b = mm256_ror_32( _mm256_xor_si256( b, c ), 7 ); \
|
||||||
@@ -581,7 +577,6 @@ do { \
|
|||||||
|
|
||||||
#define DECL_STATE32_8WAY \
|
#define DECL_STATE32_8WAY \
|
||||||
__m256i H0, H1, H2, H3, H4, H5, H6, H7; \
|
__m256i H0, H1, H2, H3, H4, H5, H6, H7; \
|
||||||
__m256i S0, S1, S2, S3; \
|
|
||||||
sph_u32 T0, T1;
|
sph_u32 T0, T1;
|
||||||
|
|
||||||
#define READ_STATE32_8WAY(state) \
|
#define READ_STATE32_8WAY(state) \
|
||||||
@@ -594,10 +589,6 @@ do { \
|
|||||||
H5 = (state)->H[5]; \
|
H5 = (state)->H[5]; \
|
||||||
H6 = (state)->H[6]; \
|
H6 = (state)->H[6]; \
|
||||||
H7 = (state)->H[7]; \
|
H7 = (state)->H[7]; \
|
||||||
S0 = (state)->S[0]; \
|
|
||||||
S1 = (state)->S[1]; \
|
|
||||||
S2 = (state)->S[2]; \
|
|
||||||
S3 = (state)->S[3]; \
|
|
||||||
T0 = (state)->T0; \
|
T0 = (state)->T0; \
|
||||||
T1 = (state)->T1; \
|
T1 = (state)->T1; \
|
||||||
} while (0)
|
} while (0)
|
||||||
@@ -612,10 +603,6 @@ do { \
|
|||||||
(state)->H[5] = H5; \
|
(state)->H[5] = H5; \
|
||||||
(state)->H[6] = H6; \
|
(state)->H[6] = H6; \
|
||||||
(state)->H[7] = H7; \
|
(state)->H[7] = H7; \
|
||||||
(state)->S[0] = S0; \
|
|
||||||
(state)->S[1] = S1; \
|
|
||||||
(state)->S[2] = S2; \
|
|
||||||
(state)->S[3] = S3; \
|
|
||||||
(state)->T0 = T0; \
|
(state)->T0 = T0; \
|
||||||
(state)->T1 = T1; \
|
(state)->T1 = T1; \
|
||||||
} while (0)
|
} while (0)
|
||||||
@@ -635,10 +622,10 @@ do { \
|
|||||||
V5 = H5; \
|
V5 = H5; \
|
||||||
V6 = H6; \
|
V6 = H6; \
|
||||||
V7 = H7; \
|
V7 = H7; \
|
||||||
V8 = _mm256_xor_si256( S0, m256_const1_64( 0x243F6A88243F6A88 ) ); \
|
V8 = m256_const1_64( 0x243F6A88243F6A88 ); \
|
||||||
V9 = _mm256_xor_si256( S1, m256_const1_64( 0x85A308D385A308D3 ) ); \
|
V9 = m256_const1_64( 0x85A308D385A308D3 ); \
|
||||||
VA = _mm256_xor_si256( S2, m256_const1_64( 0x13198A2E13198A2E ) ); \
|
VA = m256_const1_64( 0x13198A2E13198A2E ); \
|
||||||
VB = _mm256_xor_si256( S3, m256_const1_64( 0x0370734403707344 ) ); \
|
VB = m256_const1_64( 0x0370734403707344 ); \
|
||||||
VC = _mm256_xor_si256( _mm256_set1_epi32( T0 ),\
|
VC = _mm256_xor_si256( _mm256_set1_epi32( T0 ),\
|
||||||
m256_const1_64( 0xA4093822A4093822 ) ); \
|
m256_const1_64( 0xA4093822A4093822 ) ); \
|
||||||
VD = _mm256_xor_si256( _mm256_set1_epi32( T0 ),\
|
VD = _mm256_xor_si256( _mm256_set1_epi32( T0 ),\
|
||||||
@@ -647,7 +634,7 @@ do { \
|
|||||||
m256_const1_64( 0x082EFA98082EFA98 ) ); \
|
m256_const1_64( 0x082EFA98082EFA98 ) ); \
|
||||||
VF = _mm256_xor_si256( _mm256_set1_epi32( T1 ), \
|
VF = _mm256_xor_si256( _mm256_set1_epi32( T1 ), \
|
||||||
m256_const1_64( 0xEC4E6C89EC4E6C89 ) ); \
|
m256_const1_64( 0xEC4E6C89EC4E6C89 ) ); \
|
||||||
shuf_bswap32 = m256_const_64( 0x0c0d0e0f08090a0b, 0x0405060700010203, \
|
shuf_bswap32 = m256_const_64( 0x1c1d1e1f18191a1b, 0x1415161710111213, \
|
||||||
0x0c0d0e0f08090a0b, 0x0405060700010203 ); \
|
0x0c0d0e0f08090a0b, 0x0405060700010203 ); \
|
||||||
M0 = _mm256_shuffle_epi8( * buf , shuf_bswap32 ); \
|
M0 = _mm256_shuffle_epi8( * buf , shuf_bswap32 ); \
|
||||||
M1 = _mm256_shuffle_epi8( *(buf+ 1), shuf_bswap32 ); \
|
M1 = _mm256_shuffle_epi8( *(buf+ 1), shuf_bswap32 ); \
|
||||||
@@ -682,17 +669,155 @@ do { \
|
|||||||
ROUND_S_8WAY(2); \
|
ROUND_S_8WAY(2); \
|
||||||
ROUND_S_8WAY(3); \
|
ROUND_S_8WAY(3); \
|
||||||
} \
|
} \
|
||||||
H0 = mm256_xor4( V8, V0, S0, H0 ); \
|
H0 = _mm256_xor_si256( _mm256_xor_si256( V8, V0 ), H0 ); \
|
||||||
H1 = mm256_xor4( V9, V1, S1, H1 ); \
|
H1 = _mm256_xor_si256( _mm256_xor_si256( V9, V1 ), H1 ); \
|
||||||
H2 = mm256_xor4( VA, V2, S2, H2 ); \
|
H2 = _mm256_xor_si256( _mm256_xor_si256( VA, V2 ), H2 ); \
|
||||||
H3 = mm256_xor4( VB, V3, S3, H3 ); \
|
H3 = _mm256_xor_si256( _mm256_xor_si256( VB, V3 ), H3 ); \
|
||||||
H4 = mm256_xor4( VC, V4, S0, H4 ); \
|
H4 = _mm256_xor_si256( _mm256_xor_si256( VC, V4 ), H4 ); \
|
||||||
H5 = mm256_xor4( VD, V5, S1, H5 ); \
|
H5 = _mm256_xor_si256( _mm256_xor_si256( VD, V5 ), H5 ); \
|
||||||
H6 = mm256_xor4( VE, V6, S2, H6 ); \
|
H6 = _mm256_xor_si256( _mm256_xor_si256( VE, V6 ), H6 ); \
|
||||||
H7 = mm256_xor4( VF, V7, S3, H7 ); \
|
H7 = _mm256_xor_si256( _mm256_xor_si256( VF, V7 ), H7 ); \
|
||||||
} while (0)
|
} while (0)
|
||||||
|
|
||||||
|
|
||||||
|
#endif
|
||||||
|
|
||||||
|
|
||||||
|
#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
|
||||||
|
|
||||||
|
// Blaske-256 16 way AVX512
|
||||||
|
|
||||||
|
#define GS_16WAY( m0, m1, c0, c1, a, b, c, d ) \
|
||||||
|
do { \
|
||||||
|
a = _mm512_add_epi32( _mm512_add_epi32( a, b ), \
|
||||||
|
_mm512_xor_si512( _mm512_set1_epi32( c1 ), m0 ) ); \
|
||||||
|
d = mm512_ror_32( _mm512_xor_si512( d, a ), 16 ); \
|
||||||
|
c = _mm512_add_epi32( c, d ); \
|
||||||
|
b = mm512_ror_32( _mm512_xor_si512( b, c ), 12 ); \
|
||||||
|
a = _mm512_add_epi32( _mm512_add_epi32( a, b ), \
|
||||||
|
_mm512_xor_si512( _mm512_set1_epi32( c0 ), m1 ) ); \
|
||||||
|
d = mm512_ror_32( _mm512_xor_si512( d, a ), 8 ); \
|
||||||
|
c = _mm512_add_epi32( c, d ); \
|
||||||
|
b = mm512_ror_32( _mm512_xor_si512( b, c ), 7 ); \
|
||||||
|
} while (0)
|
||||||
|
|
||||||
|
#define ROUND_S_16WAY(r) do { \
|
||||||
|
GS_16WAY(Mx(r, 0), Mx(r, 1), CSx(r, 0), CSx(r, 1), V0, V4, V8, VC); \
|
||||||
|
GS_16WAY(Mx(r, 2), Mx(r, 3), CSx(r, 2), CSx(r, 3), V1, V5, V9, VD); \
|
||||||
|
GS_16WAY(Mx(r, 4), Mx(r, 5), CSx(r, 4), CSx(r, 5), V2, V6, VA, VE); \
|
||||||
|
GS_16WAY(Mx(r, 6), Mx(r, 7), CSx(r, 6), CSx(r, 7), V3, V7, VB, VF); \
|
||||||
|
GS_16WAY(Mx(r, 8), Mx(r, 9), CSx(r, 8), CSx(r, 9), V0, V5, VA, VF); \
|
||||||
|
GS_16WAY(Mx(r, A), Mx(r, B), CSx(r, A), CSx(r, B), V1, V6, VB, VC); \
|
||||||
|
GS_16WAY(Mx(r, C), Mx(r, D), CSx(r, C), CSx(r, D), V2, V7, V8, VD); \
|
||||||
|
GS_16WAY(Mx(r, E), Mx(r, F), CSx(r, E), CSx(r, F), V3, V4, V9, VE); \
|
||||||
|
} while (0)
|
||||||
|
|
||||||
|
#define DECL_STATE32_16WAY \
|
||||||
|
__m512i H0, H1, H2, H3, H4, H5, H6, H7; \
|
||||||
|
sph_u32 T0, T1;
|
||||||
|
|
||||||
|
#define READ_STATE32_16WAY(state) \
|
||||||
|
do { \
|
||||||
|
H0 = (state)->H[0]; \
|
||||||
|
H1 = (state)->H[1]; \
|
||||||
|
H2 = (state)->H[2]; \
|
||||||
|
H3 = (state)->H[3]; \
|
||||||
|
H4 = (state)->H[4]; \
|
||||||
|
H5 = (state)->H[5]; \
|
||||||
|
H6 = (state)->H[6]; \
|
||||||
|
H7 = (state)->H[7]; \
|
||||||
|
T0 = (state)->T0; \
|
||||||
|
T1 = (state)->T1; \
|
||||||
|
} while (0)
|
||||||
|
|
||||||
|
#define WRITE_STATE32_16WAY(state) \
|
||||||
|
do { \
|
||||||
|
(state)->H[0] = H0; \
|
||||||
|
(state)->H[1] = H1; \
|
||||||
|
(state)->H[2] = H2; \
|
||||||
|
(state)->H[3] = H3; \
|
||||||
|
(state)->H[4] = H4; \
|
||||||
|
(state)->H[5] = H5; \
|
||||||
|
(state)->H[6] = H6; \
|
||||||
|
(state)->H[7] = H7; \
|
||||||
|
(state)->T0 = T0; \
|
||||||
|
(state)->T1 = T1; \
|
||||||
|
} while (0)
|
||||||
|
|
||||||
|
#define COMPRESS32_16WAY( rounds ) \
|
||||||
|
do { \
|
||||||
|
__m512i M0, M1, M2, M3, M4, M5, M6, M7; \
|
||||||
|
__m512i M8, M9, MA, MB, MC, MD, ME, MF; \
|
||||||
|
__m512i V0, V1, V2, V3, V4, V5, V6, V7; \
|
||||||
|
__m512i V8, V9, VA, VB, VC, VD, VE, VF; \
|
||||||
|
__m512i shuf_bswap32; \
|
||||||
|
V0 = H0; \
|
||||||
|
V1 = H1; \
|
||||||
|
V2 = H2; \
|
||||||
|
V3 = H3; \
|
||||||
|
V4 = H4; \
|
||||||
|
V5 = H5; \
|
||||||
|
V6 = H6; \
|
||||||
|
V7 = H7; \
|
||||||
|
V8 = m512_const1_64( 0x243F6A88243F6A88 ); \
|
||||||
|
V9 = m512_const1_64( 0x85A308D385A308D3 ); \
|
||||||
|
VA = m512_const1_64( 0x13198A2E13198A2E ); \
|
||||||
|
VB = m512_const1_64( 0x0370734403707344 ); \
|
||||||
|
VC = _mm512_xor_si512( _mm512_set1_epi32( T0 ),\
|
||||||
|
m512_const1_64( 0xA4093822A4093822 ) ); \
|
||||||
|
VD = _mm512_xor_si512( _mm512_set1_epi32( T0 ),\
|
||||||
|
m512_const1_64( 0x299F31D0299F31D0 ) ); \
|
||||||
|
VE = _mm512_xor_si512( _mm512_set1_epi32( T1 ), \
|
||||||
|
m512_const1_64( 0x082EFA98082EFA98 ) ); \
|
||||||
|
VF = _mm512_xor_si512( _mm512_set1_epi32( T1 ), \
|
||||||
|
m512_const1_64( 0xEC4E6C89EC4E6C89 ) ); \
|
||||||
|
shuf_bswap32 = m512_const_64( 0x3c3d3e3f38393a3b, 0x3435363730313233, \
|
||||||
|
0x2c2d2e2f28292a2b, 0x2425262720212223, \
|
||||||
|
0x1c1d1e1f18191a1b, 0x1415161710111213, \
|
||||||
|
0x0c0d0e0f08090a0b, 0x0405060700010203 ); \
|
||||||
|
M0 = _mm512_shuffle_epi8( * buf , shuf_bswap32 ); \
|
||||||
|
M1 = _mm512_shuffle_epi8( *(buf+ 1), shuf_bswap32 ); \
|
||||||
|
M2 = _mm512_shuffle_epi8( *(buf+ 2), shuf_bswap32 ); \
|
||||||
|
M3 = _mm512_shuffle_epi8( *(buf+ 3), shuf_bswap32 ); \
|
||||||
|
M4 = _mm512_shuffle_epi8( *(buf+ 4), shuf_bswap32 ); \
|
||||||
|
M5 = _mm512_shuffle_epi8( *(buf+ 5), shuf_bswap32 ); \
|
||||||
|
M6 = _mm512_shuffle_epi8( *(buf+ 6), shuf_bswap32 ); \
|
||||||
|
M7 = _mm512_shuffle_epi8( *(buf+ 7), shuf_bswap32 ); \
|
||||||
|
M8 = _mm512_shuffle_epi8( *(buf+ 8), shuf_bswap32 ); \
|
||||||
|
M9 = _mm512_shuffle_epi8( *(buf+ 9), shuf_bswap32 ); \
|
||||||
|
MA = _mm512_shuffle_epi8( *(buf+10), shuf_bswap32 ); \
|
||||||
|
MB = _mm512_shuffle_epi8( *(buf+11), shuf_bswap32 ); \
|
||||||
|
MC = _mm512_shuffle_epi8( *(buf+12), shuf_bswap32 ); \
|
||||||
|
MD = _mm512_shuffle_epi8( *(buf+13), shuf_bswap32 ); \
|
||||||
|
ME = _mm512_shuffle_epi8( *(buf+14), shuf_bswap32 ); \
|
||||||
|
MF = _mm512_shuffle_epi8( *(buf+15), shuf_bswap32 ); \
|
||||||
|
ROUND_S_16WAY(0); \
|
||||||
|
ROUND_S_16WAY(1); \
|
||||||
|
ROUND_S_16WAY(2); \
|
||||||
|
ROUND_S_16WAY(3); \
|
||||||
|
ROUND_S_16WAY(4); \
|
||||||
|
ROUND_S_16WAY(5); \
|
||||||
|
ROUND_S_16WAY(6); \
|
||||||
|
ROUND_S_16WAY(7); \
|
||||||
|
if (rounds == 14) \
|
||||||
|
{ \
|
||||||
|
ROUND_S_16WAY(8); \
|
||||||
|
ROUND_S_16WAY(9); \
|
||||||
|
ROUND_S_16WAY(0); \
|
||||||
|
ROUND_S_16WAY(1); \
|
||||||
|
ROUND_S_16WAY(2); \
|
||||||
|
ROUND_S_16WAY(3); \
|
||||||
|
} \
|
||||||
|
H0 = _mm512_xor_si512( _mm512_xor_si512( V8, V0 ), H0 ); \
|
||||||
|
H1 = _mm512_xor_si512( _mm512_xor_si512( V9, V1 ), H1 ); \
|
||||||
|
H2 = _mm512_xor_si512( _mm512_xor_si512( VA, V2 ), H2 ); \
|
||||||
|
H3 = _mm512_xor_si512( _mm512_xor_si512( VB, V3 ), H3 ); \
|
||||||
|
H4 = _mm512_xor_si512( _mm512_xor_si512( VC, V4 ), H4 ); \
|
||||||
|
H5 = _mm512_xor_si512( _mm512_xor_si512( VD, V5 ), H5 ); \
|
||||||
|
H6 = _mm512_xor_si512( _mm512_xor_si512( VE, V6 ), H6 ); \
|
||||||
|
H7 = _mm512_xor_si512( _mm512_xor_si512( VF, V7 ), H7 ); \
|
||||||
|
} while (0)
|
||||||
|
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
// Blake-256 4 way
|
// Blake-256 4 way
|
||||||
@@ -703,7 +828,6 @@ static void
|
|||||||
blake32_4way_init( blake_4way_small_context *ctx, const uint32_t *iv,
|
blake32_4way_init( blake_4way_small_context *ctx, const uint32_t *iv,
|
||||||
const uint32_t *salt, int rounds )
|
const uint32_t *salt, int rounds )
|
||||||
{
|
{
|
||||||
__m128i zero = m128_zero;
|
|
||||||
casti_m128i( ctx->H, 0 ) = m128_const1_64( 0x6A09E6676A09E667 );
|
casti_m128i( ctx->H, 0 ) = m128_const1_64( 0x6A09E6676A09E667 );
|
||||||
casti_m128i( ctx->H, 1 ) = m128_const1_64( 0xBB67AE85BB67AE85 );
|
casti_m128i( ctx->H, 1 ) = m128_const1_64( 0xBB67AE85BB67AE85 );
|
||||||
casti_m128i( ctx->H, 2 ) = m128_const1_64( 0x3C6EF3723C6EF372 );
|
casti_m128i( ctx->H, 2 ) = m128_const1_64( 0x3C6EF3723C6EF372 );
|
||||||
@@ -712,18 +836,14 @@ blake32_4way_init( blake_4way_small_context *ctx, const uint32_t *iv,
|
|||||||
casti_m128i( ctx->H, 5 ) = m128_const1_64( 0x9B05688C9B05688C );
|
casti_m128i( ctx->H, 5 ) = m128_const1_64( 0x9B05688C9B05688C );
|
||||||
casti_m128i( ctx->H, 6 ) = m128_const1_64( 0x1F83D9AB1F83D9AB );
|
casti_m128i( ctx->H, 6 ) = m128_const1_64( 0x1F83D9AB1F83D9AB );
|
||||||
casti_m128i( ctx->H, 7 ) = m128_const1_64( 0x5BE0CD195BE0CD19 );
|
casti_m128i( ctx->H, 7 ) = m128_const1_64( 0x5BE0CD195BE0CD19 );
|
||||||
|
|
||||||
casti_m128i( ctx->S, 0 ) = zero;
|
|
||||||
casti_m128i( ctx->S, 1 ) = zero;
|
|
||||||
casti_m128i( ctx->S, 2 ) = zero;
|
|
||||||
casti_m128i( ctx->S, 3 ) = zero;
|
|
||||||
ctx->T0 = ctx->T1 = 0;
|
ctx->T0 = ctx->T1 = 0;
|
||||||
ctx->ptr = 0;
|
ctx->ptr = 0;
|
||||||
ctx->rounds = rounds;
|
ctx->rounds = rounds;
|
||||||
}
|
}
|
||||||
|
|
||||||
static void
|
static void
|
||||||
blake32_4way( blake_4way_small_context *ctx, const void *data, size_t len )
|
blake32_4way( blake_4way_small_context *ctx, const void *data,
|
||||||
|
size_t len )
|
||||||
{
|
{
|
||||||
__m128i *buf = (__m128i*)ctx->buf;
|
__m128i *buf = (__m128i*)ctx->buf;
|
||||||
size_t bptr = ctx->ptr<<2;
|
size_t bptr = ctx->ptr<<2;
|
||||||
@@ -824,7 +944,6 @@ static void
|
|||||||
blake32_8way_init( blake_8way_small_context *sc, const sph_u32 *iv,
|
blake32_8way_init( blake_8way_small_context *sc, const sph_u32 *iv,
|
||||||
const sph_u32 *salt, int rounds )
|
const sph_u32 *salt, int rounds )
|
||||||
{
|
{
|
||||||
__m256i zero = m256_zero;
|
|
||||||
casti_m256i( sc->H, 0 ) = m256_const1_64( 0x6A09E6676A09E667 );
|
casti_m256i( sc->H, 0 ) = m256_const1_64( 0x6A09E6676A09E667 );
|
||||||
casti_m256i( sc->H, 1 ) = m256_const1_64( 0xBB67AE85BB67AE85 );
|
casti_m256i( sc->H, 1 ) = m256_const1_64( 0xBB67AE85BB67AE85 );
|
||||||
casti_m256i( sc->H, 2 ) = m256_const1_64( 0x3C6EF3723C6EF372 );
|
casti_m256i( sc->H, 2 ) = m256_const1_64( 0x3C6EF3723C6EF372 );
|
||||||
@@ -833,10 +952,6 @@ blake32_8way_init( blake_8way_small_context *sc, const sph_u32 *iv,
|
|||||||
casti_m256i( sc->H, 5 ) = m256_const1_64( 0x9B05688C9B05688C );
|
casti_m256i( sc->H, 5 ) = m256_const1_64( 0x9B05688C9B05688C );
|
||||||
casti_m256i( sc->H, 6 ) = m256_const1_64( 0x1F83D9AB1F83D9AB );
|
casti_m256i( sc->H, 6 ) = m256_const1_64( 0x1F83D9AB1F83D9AB );
|
||||||
casti_m256i( sc->H, 7 ) = m256_const1_64( 0x5BE0CD195BE0CD19 );
|
casti_m256i( sc->H, 7 ) = m256_const1_64( 0x5BE0CD195BE0CD19 );
|
||||||
casti_m256i( sc->S, 0 ) = zero;
|
|
||||||
casti_m256i( sc->S, 1 ) = zero;
|
|
||||||
casti_m256i( sc->S, 2 ) = zero;
|
|
||||||
casti_m256i( sc->S, 3 ) = zero;
|
|
||||||
sc->T0 = sc->T1 = 0;
|
sc->T0 = sc->T1 = 0;
|
||||||
sc->ptr = 0;
|
sc->ptr = 0;
|
||||||
sc->rounds = rounds;
|
sc->rounds = rounds;
|
||||||
@@ -940,6 +1055,179 @@ blake32_8way_close( blake_8way_small_context *sc, unsigned ub, unsigned n,
|
|||||||
|
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
|
|
||||||
|
#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
|
||||||
|
|
||||||
|
//Blake-256 16 way AVX512
|
||||||
|
|
||||||
|
static void
|
||||||
|
blake32_16way_init( blake_16way_small_context *sc, const sph_u32 *iv,
|
||||||
|
const sph_u32 *salt, int rounds )
|
||||||
|
{
|
||||||
|
casti_m512i( sc->H, 0 ) = m512_const1_64( 0x6A09E6676A09E667 );
|
||||||
|
casti_m512i( sc->H, 1 ) = m512_const1_64( 0xBB67AE85BB67AE85 );
|
||||||
|
casti_m512i( sc->H, 2 ) = m512_const1_64( 0x3C6EF3723C6EF372 );
|
||||||
|
casti_m512i( sc->H, 3 ) = m512_const1_64( 0xA54FF53AA54FF53A );
|
||||||
|
casti_m512i( sc->H, 4 ) = m512_const1_64( 0x510E527F510E527F );
|
||||||
|
casti_m512i( sc->H, 5 ) = m512_const1_64( 0x9B05688C9B05688C );
|
||||||
|
casti_m512i( sc->H, 6 ) = m512_const1_64( 0x1F83D9AB1F83D9AB );
|
||||||
|
casti_m512i( sc->H, 7 ) = m512_const1_64( 0x5BE0CD195BE0CD19 );
|
||||||
|
sc->T0 = sc->T1 = 0;
|
||||||
|
sc->ptr = 0;
|
||||||
|
sc->rounds = rounds;
|
||||||
|
}
|
||||||
|
|
||||||
|
static void
|
||||||
|
blake32_16way( blake_16way_small_context *sc, const void *data, size_t len )
|
||||||
|
{
|
||||||
|
__m512i *vdata = (__m512i*)data;
|
||||||
|
__m512i *buf;
|
||||||
|
size_t ptr;
|
||||||
|
const int buf_size = 64; // number of elements, sizeof/4
|
||||||
|
DECL_STATE32_16WAY
|
||||||
|
buf = sc->buf;
|
||||||
|
ptr = sc->ptr;
|
||||||
|
if ( len < buf_size - ptr )
|
||||||
|
{
|
||||||
|
memcpy_512( buf + (ptr>>2), vdata, len>>2 );
|
||||||
|
ptr += len;
|
||||||
|
sc->ptr = ptr;
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
READ_STATE32_16WAY(sc);
|
||||||
|
while ( len > 0 )
|
||||||
|
{
|
||||||
|
size_t clen;
|
||||||
|
|
||||||
|
clen = buf_size - ptr;
|
||||||
|
if (clen > len)
|
||||||
|
clen = len;
|
||||||
|
memcpy_512( buf + (ptr>>2), vdata, clen>>2 );
|
||||||
|
ptr += clen;
|
||||||
|
vdata += (clen>>2);
|
||||||
|
len -= clen;
|
||||||
|
if ( ptr == buf_size )
|
||||||
|
{
|
||||||
|
if ( ( T0 = T0 + 512 ) < 512 )
|
||||||
|
T1 = T1 + 1;
|
||||||
|
COMPRESS32_16WAY( sc->rounds );
|
||||||
|
ptr = 0;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
WRITE_STATE32_16WAY(sc);
|
||||||
|
sc->ptr = ptr;
|
||||||
|
}
|
||||||
|
|
||||||
|
static void
|
||||||
|
blake32_16way_close( blake_16way_small_context *sc, unsigned ub, unsigned n,
|
||||||
|
void *dst, size_t out_size_w32 )
|
||||||
|
{
|
||||||
|
__m512i buf[16];
|
||||||
|
size_t ptr;
|
||||||
|
unsigned bit_len;
|
||||||
|
sph_u32 th, tl;
|
||||||
|
|
||||||
|
ptr = sc->ptr;
|
||||||
|
bit_len = ((unsigned)ptr << 3);
|
||||||
|
buf[ptr>>2] = m512_const1_64( 0x0000008000000080ULL );
|
||||||
|
tl = sc->T0 + bit_len;
|
||||||
|
th = sc->T1;
|
||||||
|
|
||||||
|
if ( ptr == 0 )
|
||||||
|
{
|
||||||
|
sc->T0 = 0xFFFFFE00UL;
|
||||||
|
sc->T1 = 0xFFFFFFFFUL;
|
||||||
|
}
|
||||||
|
else if ( sc->T0 == 0 )
|
||||||
|
{
|
||||||
|
sc->T0 = 0xFFFFFE00UL + bit_len;
|
||||||
|
sc->T1 = sc->T1 - 1;
|
||||||
|
}
|
||||||
|
else
|
||||||
|
sc->T0 -= 512 - bit_len;
|
||||||
|
|
||||||
|
if ( ptr <= 52 )
|
||||||
|
{
|
||||||
|
memset_zero_512( buf + (ptr>>2) + 1, (52 - ptr) >> 2 );
|
||||||
|
if ( out_size_w32 == 8 )
|
||||||
|
buf[52>>2] = _mm512_or_si512( buf[52>>2],
|
||||||
|
m512_const1_64( 0x0100000001000000ULL ) );
|
||||||
|
buf[+56>>2] = mm512_bswap_32( _mm512_set1_epi32( th ) );
|
||||||
|
buf[+60>>2] = mm512_bswap_32( _mm512_set1_epi32( tl ) );
|
||||||
|
blake32_16way( sc, buf + (ptr>>2), 64 - ptr );
|
||||||
|
}
|
||||||
|
else
|
||||||
|
{
|
||||||
|
memset_zero_512( buf + (ptr>>2) + 1, (60-ptr) >> 2 );
|
||||||
|
blake32_16way( sc, buf + (ptr>>2), 64 - ptr );
|
||||||
|
sc->T0 = 0xFFFFFE00UL;
|
||||||
|
sc->T1 = 0xFFFFFFFFUL;
|
||||||
|
memset_zero_512( buf, 56>>2 );
|
||||||
|
if ( out_size_w32 == 8 )
|
||||||
|
buf[52>>2] = m512_const1_64( 0x0100000001000000ULL );
|
||||||
|
buf[56>>2] = mm512_bswap_32( _mm512_set1_epi32( th ) );
|
||||||
|
buf[60>>2] = mm512_bswap_32( _mm512_set1_epi32( tl ) );
|
||||||
|
blake32_16way( sc, buf, 64 );
|
||||||
|
}
|
||||||
|
mm512_block_bswap_32( (__m512i*)dst, (__m512i*)sc->H );
|
||||||
|
}
|
||||||
|
|
||||||
|
void
|
||||||
|
blake256_16way_init(void *cc)
|
||||||
|
{
|
||||||
|
blake32_16way_init( cc, IV256, salt_zero_8way_small, 14 );
|
||||||
|
}
|
||||||
|
|
||||||
|
void
|
||||||
|
blake256_16way_update(void *cc, const void *data, size_t len)
|
||||||
|
{
|
||||||
|
blake32_16way(cc, data, len);
|
||||||
|
}
|
||||||
|
|
||||||
|
void
|
||||||
|
blake256_16way_close(void *cc, void *dst)
|
||||||
|
{
|
||||||
|
blake32_16way_close(cc, 0, 0, dst, 8);
|
||||||
|
}
|
||||||
|
|
||||||
|
void blake256r14_16way_init(void *cc)
|
||||||
|
{
|
||||||
|
blake32_16way_init( cc, IV256, salt_zero_8way_small, 14 );
|
||||||
|
}
|
||||||
|
|
||||||
|
void
|
||||||
|
blake256r14_16way_update(void *cc, const void *data, size_t len)
|
||||||
|
{
|
||||||
|
blake32_16way(cc, data, len);
|
||||||
|
}
|
||||||
|
|
||||||
|
void
|
||||||
|
blake256r14_16way_close(void *cc, void *dst)
|
||||||
|
{
|
||||||
|
blake32_16way_close(cc, 0, 0, dst, 8);
|
||||||
|
}
|
||||||
|
|
||||||
|
void blake256r8_16way_init(void *cc)
|
||||||
|
{
|
||||||
|
blake32_16way_init( cc, IV256, salt_zero_8way_small, 8 );
|
||||||
|
}
|
||||||
|
|
||||||
|
void
|
||||||
|
blake256r8_16way_update(void *cc, const void *data, size_t len)
|
||||||
|
{
|
||||||
|
blake32_16way(cc, data, len);
|
||||||
|
}
|
||||||
|
|
||||||
|
void
|
||||||
|
blake256r8_16way_close(void *cc, void *dst)
|
||||||
|
{
|
||||||
|
blake32_16way_close(cc, 0, 0, dst, 8);
|
||||||
|
}
|
||||||
|
|
||||||
|
#endif // AVX512
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
// Blake-256 4 way
|
// Blake-256 4 way
|
||||||
|
|
||||||
// default 14 rounds, backward copatibility
|
// default 14 rounds, backward copatibility
|
||||||
@@ -950,7 +1238,7 @@ blake256_4way_init(void *ctx)
|
|||||||
}
|
}
|
||||||
|
|
||||||
void
|
void
|
||||||
blake256_4way(void *ctx, const void *data, size_t len)
|
blake256_4way_update(void *ctx, const void *data, size_t len)
|
||||||
{
|
{
|
||||||
blake32_4way(ctx, data, len);
|
blake32_4way(ctx, data, len);
|
||||||
}
|
}
|
||||||
@@ -972,7 +1260,7 @@ blake256_8way_init(void *cc)
|
|||||||
}
|
}
|
||||||
|
|
||||||
void
|
void
|
||||||
blake256_8way(void *cc, const void *data, size_t len)
|
blake256_8way_update(void *cc, const void *data, size_t len)
|
||||||
{
|
{
|
||||||
blake32_8way(cc, data, len);
|
blake32_8way(cc, data, len);
|
||||||
}
|
}
|
||||||
@@ -992,7 +1280,7 @@ void blake256r14_4way_init(void *cc)
|
|||||||
}
|
}
|
||||||
|
|
||||||
void
|
void
|
||||||
blake256r14_4way(void *cc, const void *data, size_t len)
|
blake256r14_4way_update(void *cc, const void *data, size_t len)
|
||||||
{
|
{
|
||||||
blake32_4way(cc, data, len);
|
blake32_4way(cc, data, len);
|
||||||
}
|
}
|
||||||
@@ -1011,7 +1299,7 @@ void blake256r14_8way_init(void *cc)
|
|||||||
}
|
}
|
||||||
|
|
||||||
void
|
void
|
||||||
blake256r14_8way(void *cc, const void *data, size_t len)
|
blake256r14_8way_update(void *cc, const void *data, size_t len)
|
||||||
{
|
{
|
||||||
blake32_8way(cc, data, len);
|
blake32_8way(cc, data, len);
|
||||||
}
|
}
|
||||||
@@ -1031,7 +1319,7 @@ void blake256r8_4way_init(void *cc)
|
|||||||
}
|
}
|
||||||
|
|
||||||
void
|
void
|
||||||
blake256r8_4way(void *cc, const void *data, size_t len)
|
blake256r8_4way_update(void *cc, const void *data, size_t len)
|
||||||
{
|
{
|
||||||
blake32_4way(cc, data, len);
|
blake32_4way(cc, data, len);
|
||||||
}
|
}
|
||||||
@@ -1050,7 +1338,7 @@ void blake256r8_8way_init(void *cc)
|
|||||||
}
|
}
|
||||||
|
|
||||||
void
|
void
|
||||||
blake256r8_8way(void *cc, const void *data, size_t len)
|
blake256r8_8way_update(void *cc, const void *data, size_t len)
|
||||||
{
|
{
|
||||||
blake32_8way(cc, data, len);
|
blake32_8way(cc, data, len);
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -4,13 +4,59 @@
|
|||||||
*/
|
*/
|
||||||
|
|
||||||
#include "blake2b-gate.h"
|
#include "blake2b-gate.h"
|
||||||
|
|
||||||
#if defined(BLAKE2B_4WAY)
|
|
||||||
|
|
||||||
#include <string.h>
|
#include <string.h>
|
||||||
#include <stdint.h>
|
#include <stdint.h>
|
||||||
#include "blake2b-hash-4way.h"
|
#include "blake2b-hash-4way.h"
|
||||||
|
|
||||||
|
#if defined(BLAKE2B_8WAY)
|
||||||
|
|
||||||
|
int scanhash_blake2b_8way( struct work *work, uint32_t max_nonce,
|
||||||
|
uint64_t *hashes_done, struct thr_info *mythr )
|
||||||
|
{
|
||||||
|
uint32_t hash[8*8] __attribute__ ((aligned (128)));;
|
||||||
|
uint32_t vdata[20*8] __attribute__ ((aligned (64)));;
|
||||||
|
uint32_t lane_hash[8] __attribute__ ((aligned (64)));
|
||||||
|
blake2b_8way_ctx ctx __attribute__ ((aligned (64)));
|
||||||
|
uint32_t *hash7 = &(hash[49]); // 3*16+1
|
||||||
|
uint32_t *pdata = work->data;
|
||||||
|
uint32_t *ptarget = work->target;
|
||||||
|
int thr_id = mythr->id;
|
||||||
|
__m512i *noncev = (__m512i*)vdata + 9; // aligned
|
||||||
|
const uint32_t Htarg = ptarget[7];
|
||||||
|
const uint32_t first_nonce = pdata[19];
|
||||||
|
|
||||||
|
uint32_t n = first_nonce;
|
||||||
|
|
||||||
|
mm512_bswap32_intrlv80_8x64( vdata, pdata );
|
||||||
|
|
||||||
|
do {
|
||||||
|
*noncev = mm512_intrlv_blend_32( mm512_bswap_32(
|
||||||
|
_mm512_set_epi32( n+7, 0, n+6, 0, n+5, 0, n+4, 0,
|
||||||
|
n+3, 0, n+2, 0, n+1, 0, n , 0 ) ), *noncev );
|
||||||
|
|
||||||
|
blake2b_8way_init( &ctx );
|
||||||
|
blake2b_8way_update( &ctx, vdata, 80 );
|
||||||
|
blake2b_8way_final( &ctx, hash );
|
||||||
|
|
||||||
|
for ( int lane = 0; lane < 8; lane++ )
|
||||||
|
if ( hash7[ lane<<1 ] <= Htarg )
|
||||||
|
{
|
||||||
|
extr_lane_8x64( lane_hash, hash, lane, 256 );
|
||||||
|
if ( fulltest( lane_hash, ptarget ) && !opt_benchmark )
|
||||||
|
{
|
||||||
|
pdata[19] = n + lane;
|
||||||
|
submit_lane_solution( work, lane_hash, mythr, lane );
|
||||||
|
}
|
||||||
|
}
|
||||||
|
n += 8;
|
||||||
|
} while ( (n < max_nonce-8) && !work_restart[thr_id].restart);
|
||||||
|
|
||||||
|
*hashes_done = n - first_nonce + 1;
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
#elif defined(BLAKE2B_4WAY)
|
||||||
|
|
||||||
// Function not used, code inlined.
|
// Function not used, code inlined.
|
||||||
void blake2b_4way_hash(void *output, const void *input)
|
void blake2b_4way_hash(void *output, const void *input)
|
||||||
{
|
{
|
||||||
@@ -48,7 +94,7 @@ int scanhash_blake2b_4way( struct work *work, uint32_t max_nonce,
|
|||||||
blake2b_4way_final( &ctx, hash );
|
blake2b_4way_final( &ctx, hash );
|
||||||
|
|
||||||
for ( int lane = 0; lane < 4; lane++ )
|
for ( int lane = 0; lane < 4; lane++ )
|
||||||
if ( hash7[ lane<<1 ] < Htarg )
|
if ( hash7[ lane<<1 ] <= Htarg )
|
||||||
{
|
{
|
||||||
extr_lane_4x64( lane_hash, hash, lane, 256 );
|
extr_lane_4x64( lane_hash, hash, lane, 256 );
|
||||||
if ( fulltest( lane_hash, ptarget ) && !opt_benchmark )
|
if ( fulltest( lane_hash, ptarget ) && !opt_benchmark )
|
||||||
|
|||||||
@@ -1,24 +1,19 @@
|
|||||||
#include "blake2b-gate.h"
|
#include "blake2b-gate.h"
|
||||||
|
|
||||||
/*
|
|
||||||
// changed to get_max64_0x3fffffLL in cpuminer-multi-decred
|
|
||||||
int64_t blake2s_get_max64 ()
|
|
||||||
{
|
|
||||||
return 0x7ffffLL;
|
|
||||||
}
|
|
||||||
*/
|
|
||||||
|
|
||||||
bool register_blake2b_algo( algo_gate_t* gate )
|
bool register_blake2b_algo( algo_gate_t* gate )
|
||||||
{
|
{
|
||||||
#if defined(BLAKE2B_4WAY)
|
#if defined(BLAKE2B_8WAY)
|
||||||
|
gate->scanhash = (void*)&scanhash_blake2b_8way;
|
||||||
|
// gate->hash = (void*)&blake2b_8way_hash;
|
||||||
|
#elif defined(BLAKE2B_4WAY)
|
||||||
gate->scanhash = (void*)&scanhash_blake2b_4way;
|
gate->scanhash = (void*)&scanhash_blake2b_4way;
|
||||||
gate->hash = (void*)&blake2b_4way_hash;
|
gate->hash = (void*)&blake2b_4way_hash;
|
||||||
#else
|
#else
|
||||||
gate->scanhash = (void*)&scanhash_blake2b;
|
gate->scanhash = (void*)&scanhash_blake2b;
|
||||||
gate->hash = (void*)&blake2b_hash;
|
gate->hash = (void*)&blake2b_hash;
|
||||||
#endif
|
#endif
|
||||||
// gate->get_max64 = (void*)&blake2s_get_max64;
|
gate->optimizations = AVX2_OPT | AVX512_OPT;
|
||||||
gate->optimizations = AVX2_OPT;
|
|
||||||
return true;
|
return true;
|
||||||
};
|
};
|
||||||
|
|
||||||
|
|||||||
@@ -4,13 +4,21 @@
|
|||||||
#include <stdint.h>
|
#include <stdint.h>
|
||||||
#include "algo-gate-api.h"
|
#include "algo-gate-api.h"
|
||||||
|
|
||||||
#if defined(__AVX2__)
|
#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
|
||||||
|
#define BLAKE2B_8WAY
|
||||||
|
#elif defined(__AVX2__)
|
||||||
#define BLAKE2B_4WAY
|
#define BLAKE2B_4WAY
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
bool register_blake2b_algo( algo_gate_t* gate );
|
bool register_blake2b_algo( algo_gate_t* gate );
|
||||||
|
|
||||||
#if defined(BLAKE2B_4WAY)
|
#if defined(BLAKE2B_8WAY)
|
||||||
|
|
||||||
|
//void blake2b_8way_hash( void *state, const void *input );
|
||||||
|
int scanhash_blake2b_8way( struct work *work, uint32_t max_nonce,
|
||||||
|
uint64_t *hashes_done, struct thr_info *mythr );
|
||||||
|
|
||||||
|
#elif defined(BLAKE2B_4WAY)
|
||||||
|
|
||||||
void blake2b_4way_hash( void *state, const void *input );
|
void blake2b_4way_hash( void *state, const void *input );
|
||||||
int scanhash_blake2b_4way( struct work *work, uint32_t max_nonce,
|
int scanhash_blake2b_4way( struct work *work, uint32_t max_nonce,
|
||||||
|
|||||||
@@ -33,6 +33,178 @@
|
|||||||
|
|
||||||
#include "blake2b-hash-4way.h"
|
#include "blake2b-hash-4way.h"
|
||||||
|
|
||||||
|
static const uint8_t sigma[12][16] =
|
||||||
|
{
|
||||||
|
{ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 },
|
||||||
|
{ 14, 10, 4, 8, 9, 15, 13, 6, 1, 12, 0, 2, 11, 7, 5, 3 },
|
||||||
|
{ 11, 8, 12, 0, 5, 2, 15, 13, 10, 14, 3, 6, 7, 1, 9, 4 },
|
||||||
|
{ 7, 9, 3, 1, 13, 12, 11, 14, 2, 6, 5, 10, 4, 0, 15, 8 },
|
||||||
|
{ 9, 0, 5, 7, 2, 4, 10, 15, 14, 1, 11, 12, 6, 8, 3, 13 },
|
||||||
|
{ 2, 12, 6, 10, 0, 11, 8, 3, 4, 13, 7, 5, 15, 14, 1, 9 },
|
||||||
|
{ 12, 5, 1, 15, 14, 13, 4, 10, 0, 7, 6, 3, 9, 2, 8, 11 },
|
||||||
|
{ 13, 11, 7, 14, 12, 1, 3, 9, 5, 0, 15, 4, 8, 6, 2, 10 },
|
||||||
|
{ 6, 15, 14, 9, 11, 3, 0, 8, 12, 2, 13, 7, 1, 4, 10, 5 },
|
||||||
|
{ 10, 2, 8, 4, 7, 6, 1, 5, 15, 11, 9, 14, 3, 12, 13, 0 },
|
||||||
|
{ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 },
|
||||||
|
{ 14, 10, 4, 8, 9, 15, 13, 6, 1, 12, 0, 2, 11, 7, 5, 3 }
|
||||||
|
};
|
||||||
|
|
||||||
|
|
||||||
|
#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
|
||||||
|
|
||||||
|
#define B2B8W_G(a, b, c, d, x, y) \
|
||||||
|
{ \
|
||||||
|
v[a] = _mm512_add_epi64( _mm512_add_epi64( v[a], v[b] ), x ); \
|
||||||
|
v[d] = mm512_ror_64( _mm512_xor_si512( v[d], v[a] ), 32 ); \
|
||||||
|
v[c] = _mm512_add_epi64( v[c], v[d] ); \
|
||||||
|
v[b] = mm512_ror_64( _mm512_xor_si512( v[b], v[c] ), 24 ); \
|
||||||
|
v[a] = _mm512_add_epi64( _mm512_add_epi64( v[a], v[b] ), y ); \
|
||||||
|
v[d] = mm512_ror_64( _mm512_xor_si512( v[d], v[a] ), 16 ); \
|
||||||
|
v[c] = _mm512_add_epi64( v[c], v[d] ); \
|
||||||
|
v[b] = mm512_ror_64( _mm512_xor_si512( v[b], v[c] ), 63 ); \
|
||||||
|
}
|
||||||
|
|
||||||
|
static void blake2b_8way_compress( blake2b_8way_ctx *ctx, int last )
|
||||||
|
{
|
||||||
|
__m512i v[16], m[16];
|
||||||
|
|
||||||
|
v[ 0] = ctx->h[0];
|
||||||
|
v[ 1] = ctx->h[1];
|
||||||
|
v[ 2] = ctx->h[2];
|
||||||
|
v[ 3] = ctx->h[3];
|
||||||
|
v[ 4] = ctx->h[4];
|
||||||
|
v[ 5] = ctx->h[5];
|
||||||
|
v[ 6] = ctx->h[6];
|
||||||
|
v[ 7] = ctx->h[7];
|
||||||
|
v[ 8] = m512_const1_64( 0x6A09E667F3BCC908 );
|
||||||
|
v[ 9] = m512_const1_64( 0xBB67AE8584CAA73B );
|
||||||
|
v[10] = m512_const1_64( 0x3C6EF372FE94F82B );
|
||||||
|
v[11] = m512_const1_64( 0xA54FF53A5F1D36F1 );
|
||||||
|
v[12] = m512_const1_64( 0x510E527FADE682D1 );
|
||||||
|
v[13] = m512_const1_64( 0x9B05688C2B3E6C1F );
|
||||||
|
v[14] = m512_const1_64( 0x1F83D9ABFB41BD6B );
|
||||||
|
v[15] = m512_const1_64( 0x5BE0CD19137E2179 );
|
||||||
|
|
||||||
|
v[12] = _mm512_xor_si512( v[12], _mm512_set1_epi64( ctx->t[0] ) );
|
||||||
|
v[13] = _mm512_xor_si512( v[13], _mm512_set1_epi64( ctx->t[1] ) );
|
||||||
|
|
||||||
|
if ( last )
|
||||||
|
v[14] = mm512_not( v[14] );
|
||||||
|
|
||||||
|
m[ 0] = ctx->b[ 0];
|
||||||
|
m[ 1] = ctx->b[ 1];
|
||||||
|
m[ 2] = ctx->b[ 2];
|
||||||
|
m[ 3] = ctx->b[ 3];
|
||||||
|
m[ 4] = ctx->b[ 4];
|
||||||
|
m[ 5] = ctx->b[ 5];
|
||||||
|
m[ 6] = ctx->b[ 6];
|
||||||
|
m[ 7] = ctx->b[ 7];
|
||||||
|
m[ 8] = ctx->b[ 8];
|
||||||
|
m[ 9] = ctx->b[ 9];
|
||||||
|
m[10] = ctx->b[10];
|
||||||
|
m[11] = ctx->b[11];
|
||||||
|
m[12] = ctx->b[12];
|
||||||
|
m[13] = ctx->b[13];
|
||||||
|
m[14] = ctx->b[14];
|
||||||
|
m[15] = ctx->b[15];
|
||||||
|
|
||||||
|
for ( int i = 0; i < 12; i++ )
|
||||||
|
{
|
||||||
|
B2B8W_G( 0, 4, 8, 12, m[ sigma[i][ 0] ], m[ sigma[i][ 1] ] );
|
||||||
|
B2B8W_G( 1, 5, 9, 13, m[ sigma[i][ 2] ], m[ sigma[i][ 3] ] );
|
||||||
|
B2B8W_G( 2, 6, 10, 14, m[ sigma[i][ 4] ], m[ sigma[i][ 5] ] );
|
||||||
|
B2B8W_G( 3, 7, 11, 15, m[ sigma[i][ 6] ], m[ sigma[i][ 7] ] );
|
||||||
|
B2B8W_G( 0, 5, 10, 15, m[ sigma[i][ 8] ], m[ sigma[i][ 9] ] );
|
||||||
|
B2B8W_G( 1, 6, 11, 12, m[ sigma[i][10] ], m[ sigma[i][11] ] );
|
||||||
|
B2B8W_G( 2, 7, 8, 13, m[ sigma[i][12] ], m[ sigma[i][13] ] );
|
||||||
|
B2B8W_G( 3, 4, 9, 14, m[ sigma[i][14] ], m[ sigma[i][15] ] );
|
||||||
|
}
|
||||||
|
|
||||||
|
ctx->h[0] = _mm512_xor_si512( _mm512_xor_si512( ctx->h[0], v[0] ), v[ 8] );
|
||||||
|
ctx->h[1] = _mm512_xor_si512( _mm512_xor_si512( ctx->h[1], v[1] ), v[ 9] );
|
||||||
|
ctx->h[2] = _mm512_xor_si512( _mm512_xor_si512( ctx->h[2], v[2] ), v[10] );
|
||||||
|
ctx->h[3] = _mm512_xor_si512( _mm512_xor_si512( ctx->h[3], v[3] ), v[11] );
|
||||||
|
ctx->h[4] = _mm512_xor_si512( _mm512_xor_si512( ctx->h[4], v[4] ), v[12] );
|
||||||
|
ctx->h[5] = _mm512_xor_si512( _mm512_xor_si512( ctx->h[5], v[5] ), v[13] );
|
||||||
|
ctx->h[6] = _mm512_xor_si512( _mm512_xor_si512( ctx->h[6], v[6] ), v[14] );
|
||||||
|
ctx->h[7] = _mm512_xor_si512( _mm512_xor_si512( ctx->h[7], v[7] ), v[15] );
|
||||||
|
}
|
||||||
|
|
||||||
|
int blake2b_8way_init( blake2b_8way_ctx *ctx )
|
||||||
|
{
|
||||||
|
size_t i;
|
||||||
|
|
||||||
|
ctx->h[0] = m512_const1_64( 0x6A09E667F3BCC908 );
|
||||||
|
ctx->h[1] = m512_const1_64( 0xBB67AE8584CAA73B );
|
||||||
|
ctx->h[2] = m512_const1_64( 0x3C6EF372FE94F82B );
|
||||||
|
ctx->h[3] = m512_const1_64( 0xA54FF53A5F1D36F1 );
|
||||||
|
ctx->h[4] = m512_const1_64( 0x510E527FADE682D1 );
|
||||||
|
ctx->h[5] = m512_const1_64( 0x9B05688C2B3E6C1F );
|
||||||
|
ctx->h[6] = m512_const1_64( 0x1F83D9ABFB41BD6B );
|
||||||
|
ctx->h[7] = m512_const1_64( 0x5BE0CD19137E2179 );
|
||||||
|
|
||||||
|
ctx->h[0] = _mm512_xor_si512( ctx->h[0], m512_const1_64( 0x01010020 ) );
|
||||||
|
|
||||||
|
ctx->t[0] = 0;
|
||||||
|
ctx->t[1] = 0;
|
||||||
|
ctx->c = 0;
|
||||||
|
ctx->outlen = 32;
|
||||||
|
|
||||||
|
for ( i = 0; i < 16; i++ )
|
||||||
|
ctx->b[i] = m512_zero;
|
||||||
|
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
void blake2b_8way_update( blake2b_8way_ctx *ctx, const void *input,
|
||||||
|
size_t inlen )
|
||||||
|
{
|
||||||
|
__m512i* in =(__m512i*)input;
|
||||||
|
|
||||||
|
size_t i, c;
|
||||||
|
c = ctx->c >> 3;
|
||||||
|
|
||||||
|
for ( i = 0; i < (inlen >> 3); i++ )
|
||||||
|
{
|
||||||
|
if ( ctx->c == 128 )
|
||||||
|
{
|
||||||
|
ctx->t[0] += ctx->c;
|
||||||
|
if ( ctx->t[0] < ctx->c )
|
||||||
|
ctx->t[1]++;
|
||||||
|
blake2b_8way_compress( ctx, 0 );
|
||||||
|
ctx->c = 0;
|
||||||
|
}
|
||||||
|
ctx->b[ c++ ] = in[i];
|
||||||
|
ctx->c += 8;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
void blake2b_8way_final( blake2b_8way_ctx *ctx, void *out )
|
||||||
|
{
|
||||||
|
size_t c;
|
||||||
|
c = ctx->c >> 3;
|
||||||
|
|
||||||
|
ctx->t[0] += ctx->c;
|
||||||
|
if ( ctx->t[0] < ctx->c )
|
||||||
|
ctx->t[1]++;
|
||||||
|
|
||||||
|
while ( ctx->c < 128 )
|
||||||
|
{
|
||||||
|
ctx->b[c++] = m512_zero;
|
||||||
|
ctx->c += 8;
|
||||||
|
}
|
||||||
|
|
||||||
|
blake2b_8way_compress( ctx, 1 ); // final block flag = 1
|
||||||
|
|
||||||
|
casti_m512i( out, 0 ) = ctx->h[0];
|
||||||
|
casti_m512i( out, 1 ) = ctx->h[1];
|
||||||
|
casti_m512i( out, 2 ) = ctx->h[2];
|
||||||
|
casti_m512i( out, 3 ) = ctx->h[3];
|
||||||
|
}
|
||||||
|
|
||||||
|
#endif
|
||||||
|
|
||||||
#if defined(__AVX2__)
|
#if defined(__AVX2__)
|
||||||
|
|
||||||
// G Mixing function.
|
// G Mixing function.
|
||||||
@@ -61,21 +233,6 @@ static const uint64_t blake2b_iv[8] = {
|
|||||||
|
|
||||||
static void blake2b_4way_compress( blake2b_4way_ctx *ctx, int last )
|
static void blake2b_4way_compress( blake2b_4way_ctx *ctx, int last )
|
||||||
{
|
{
|
||||||
const uint8_t sigma[12][16] = {
|
|
||||||
{ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 },
|
|
||||||
{ 14, 10, 4, 8, 9, 15, 13, 6, 1, 12, 0, 2, 11, 7, 5, 3 },
|
|
||||||
{ 11, 8, 12, 0, 5, 2, 15, 13, 10, 14, 3, 6, 7, 1, 9, 4 },
|
|
||||||
{ 7, 9, 3, 1, 13, 12, 11, 14, 2, 6, 5, 10, 4, 0, 15, 8 },
|
|
||||||
{ 9, 0, 5, 7, 2, 4, 10, 15, 14, 1, 11, 12, 6, 8, 3, 13 },
|
|
||||||
{ 2, 12, 6, 10, 0, 11, 8, 3, 4, 13, 7, 5, 15, 14, 1, 9 },
|
|
||||||
{ 12, 5, 1, 15, 14, 13, 4, 10, 0, 7, 6, 3, 9, 2, 8, 11 },
|
|
||||||
{ 13, 11, 7, 14, 12, 1, 3, 9, 5, 0, 15, 4, 8, 6, 2, 10 },
|
|
||||||
{ 6, 15, 14, 9, 11, 3, 0, 8, 12, 2, 13, 7, 1, 4, 10, 5 },
|
|
||||||
{ 10, 2, 8, 4, 7, 6, 1, 5, 15, 11, 9, 14, 3, 12, 13, 0 },
|
|
||||||
{ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 },
|
|
||||||
{ 14, 10, 4, 8, 9, 15, 13, 6, 1, 12, 0, 2, 11, 7, 5, 3 }
|
|
||||||
};
|
|
||||||
int i;
|
|
||||||
__m256i v[16], m[16];
|
__m256i v[16], m[16];
|
||||||
|
|
||||||
v[ 0] = ctx->h[0];
|
v[ 0] = ctx->h[0];
|
||||||
@@ -118,7 +275,7 @@ static void blake2b_4way_compress( blake2b_4way_ctx *ctx, int last )
|
|||||||
m[14] = ctx->b[14];
|
m[14] = ctx->b[14];
|
||||||
m[15] = ctx->b[15];
|
m[15] = ctx->b[15];
|
||||||
|
|
||||||
for ( i = 0; i < 12; i++ )
|
for ( int i = 0; i < 12; i++ )
|
||||||
{
|
{
|
||||||
B2B_G( 0, 4, 8, 12, m[ sigma[i][ 0] ], m[ sigma[i][ 1] ] );
|
B2B_G( 0, 4, 8, 12, m[ sigma[i][ 0] ], m[ sigma[i][ 1] ] );
|
||||||
B2B_G( 1, 5, 9, 13, m[ sigma[i][ 2] ], m[ sigma[i][ 3] ] );
|
B2B_G( 1, 5, 9, 13, m[ sigma[i][ 2] ], m[ sigma[i][ 3] ] );
|
||||||
|
|||||||
@@ -2,8 +2,6 @@
|
|||||||
#ifndef __BLAKE2B_HASH_4WAY_H__
|
#ifndef __BLAKE2B_HASH_4WAY_H__
|
||||||
#define __BLAKE2B_HASH_4WAY_H__
|
#define __BLAKE2B_HASH_4WAY_H__
|
||||||
|
|
||||||
#if defined(__AVX2__)
|
|
||||||
|
|
||||||
#include "simd-utils.h"
|
#include "simd-utils.h"
|
||||||
#include <stddef.h>
|
#include <stddef.h>
|
||||||
#include <stdint.h>
|
#include <stdint.h>
|
||||||
@@ -16,14 +14,34 @@
|
|||||||
#define ALIGN(x) __attribute__((aligned(x)))
|
#define ALIGN(x) __attribute__((aligned(x)))
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
|
|
||||||
|
#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
|
||||||
|
|
||||||
|
ALIGN(128) typedef struct {
|
||||||
|
__m512i b[16]; // input buffer
|
||||||
|
__m512i h[8]; // chained state
|
||||||
|
uint64_t t[2]; // total number of bytes
|
||||||
|
size_t c; // pointer for b[]
|
||||||
|
size_t outlen; // digest size
|
||||||
|
} blake2b_8way_ctx;
|
||||||
|
|
||||||
|
int blake2b_8way_init( blake2b_8way_ctx *ctx );
|
||||||
|
void blake2b_8way_update( blake2b_8way_ctx *ctx, const void *input,
|
||||||
|
size_t inlen );
|
||||||
|
void blake2b_8way_final( blake2b_8way_ctx *ctx, void *out );
|
||||||
|
|
||||||
|
#endif
|
||||||
|
|
||||||
|
#if defined(__AVX2__)
|
||||||
|
|
||||||
// state context
|
// state context
|
||||||
ALIGN(64) typedef struct {
|
ALIGN(128) typedef struct {
|
||||||
__m256i b[16]; // input buffer
|
__m256i b[16]; // input buffer
|
||||||
__m256i h[8]; // chained state
|
__m256i h[8]; // chained state
|
||||||
uint64_t t[2]; // total number of bytes
|
uint64_t t[2]; // total number of bytes
|
||||||
size_t c; // pointer for b[]
|
size_t c; // pointer for b[]
|
||||||
size_t outlen; // digest size
|
size_t outlen; // digest size
|
||||||
} blake2b_4way_ctx __attribute__((aligned(64)));
|
} blake2b_4way_ctx;
|
||||||
|
|
||||||
int blake2b_4way_init( blake2b_4way_ctx *ctx );
|
int blake2b_4way_init( blake2b_4way_ctx *ctx );
|
||||||
void blake2b_4way_update( blake2b_4way_ctx *ctx, const void *input,
|
void blake2b_4way_update( blake2b_4way_ctx *ctx, const void *input,
|
||||||
|
|||||||
@@ -43,17 +43,14 @@ int scanhash_blake2b( struct work *work, uint32_t max_nonce,
|
|||||||
|
|
||||||
do {
|
do {
|
||||||
be32enc(&endiandata[19], n);
|
be32enc(&endiandata[19], n);
|
||||||
//blake2b_hash_end(vhashcpu, endiandata);
|
|
||||||
blake2b_hash(vhashcpu, endiandata);
|
blake2b_hash(vhashcpu, endiandata);
|
||||||
|
|
||||||
if (vhashcpu[7] < Htarg && fulltest(vhashcpu, ptarget)) {
|
if (vhashcpu[7] <= Htarg && fulltest(vhashcpu, ptarget))
|
||||||
work_set_target_ratio(work, vhashcpu);
|
{
|
||||||
*hashes_done = n - first_nonce + 1;
|
|
||||||
pdata[19] = n;
|
pdata[19] = n;
|
||||||
return 1;
|
submit_solution( work, vhashcpu, mythr );
|
||||||
}
|
}
|
||||||
n++;
|
n++;
|
||||||
|
|
||||||
} while (n < max_nonce && !work_restart[thr_id].restart);
|
} while (n < max_nonce && !work_restart[thr_id].restart);
|
||||||
*hashes_done = n - first_nonce + 1;
|
*hashes_done = n - first_nonce + 1;
|
||||||
pdata[19] = n;
|
pdata[19] = n;
|
||||||
|
|||||||
@@ -3,22 +3,72 @@
|
|||||||
#include <string.h>
|
#include <string.h>
|
||||||
#include <stdint.h>
|
#include <stdint.h>
|
||||||
|
|
||||||
#if defined(BLAKE2S_8WAY)
|
#if defined(BLAKE2S_16WAY)
|
||||||
|
|
||||||
|
static __thread blake2s_16way_state blake2s_16w_ctx;
|
||||||
|
|
||||||
|
void blake2s_16way_hash( void *output, const void *input )
|
||||||
|
{
|
||||||
|
blake2s_16way_state ctx;
|
||||||
|
memcpy( &ctx, &blake2s_16w_ctx, sizeof ctx );
|
||||||
|
blake2s_16way_update( &ctx, input + (64<<4), 16 );
|
||||||
|
blake2s_16way_final( &ctx, output, BLAKE2S_OUTBYTES );
|
||||||
|
}
|
||||||
|
|
||||||
|
int scanhash_blake2s_16way( struct work *work, uint32_t max_nonce,
|
||||||
|
uint64_t *hashes_done, struct thr_info *mythr )
|
||||||
|
{
|
||||||
|
uint32_t vdata[20*16] __attribute__ ((aligned (128)));
|
||||||
|
uint32_t hash[8*16] __attribute__ ((aligned (64)));
|
||||||
|
uint32_t lane_hash[8] __attribute__ ((aligned (64)));
|
||||||
|
uint32_t *hash7 = &(hash[7<<4]);
|
||||||
|
uint32_t *pdata = work->data;
|
||||||
|
uint32_t *ptarget = work->target;
|
||||||
|
const uint32_t Htarg = ptarget[7];
|
||||||
|
const uint32_t first_nonce = pdata[19];
|
||||||
|
__m512i *noncev = (__m512i*)vdata + 19; // aligned
|
||||||
|
uint32_t n = first_nonce;
|
||||||
|
int thr_id = mythr->id;
|
||||||
|
|
||||||
|
mm512_bswap32_intrlv80_16x32( vdata, pdata );
|
||||||
|
blake2s_16way_init( &blake2s_16w_ctx, BLAKE2S_OUTBYTES );
|
||||||
|
blake2s_16way_update( &blake2s_16w_ctx, vdata, 64 );
|
||||||
|
|
||||||
|
do {
|
||||||
|
*noncev = mm512_bswap_32( _mm512_set_epi32(
|
||||||
|
n+15, n+14, n+13, n+12, n+11, n+10, n+ 9, n+ 8,
|
||||||
|
n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n+ 1, n ) );
|
||||||
|
pdata[19] = n;
|
||||||
|
|
||||||
|
blake2s_16way_hash( hash, vdata );
|
||||||
|
|
||||||
|
for ( int lane = 0; lane < 16; lane++ )
|
||||||
|
if ( unlikely( hash7[lane] <= Htarg ) )
|
||||||
|
{
|
||||||
|
extr_lane_16x32( lane_hash, hash, lane, 256 );
|
||||||
|
if ( likely( fulltest( lane_hash, ptarget ) && !opt_benchmark ) )
|
||||||
|
{
|
||||||
|
pdata[19] = n + lane;
|
||||||
|
submit_lane_solution( work, lane_hash, mythr, lane );
|
||||||
|
}
|
||||||
|
}
|
||||||
|
n += 16;
|
||||||
|
} while ( (n < max_nonce-16) && !work_restart[thr_id].restart );
|
||||||
|
|
||||||
|
*hashes_done = n - first_nonce + 1;
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
#elif defined(BLAKE2S_8WAY)
|
||||||
|
|
||||||
static __thread blake2s_8way_state blake2s_8w_ctx;
|
static __thread blake2s_8way_state blake2s_8w_ctx;
|
||||||
|
|
||||||
void blake2s_8way_hash( void *output, const void *input )
|
void blake2s_8way_hash( void *output, const void *input )
|
||||||
{
|
{
|
||||||
uint32_t vhash[8*8] __attribute__ ((aligned (64)));
|
|
||||||
blake2s_8way_state ctx;
|
blake2s_8way_state ctx;
|
||||||
memcpy( &ctx, &blake2s_8w_ctx, sizeof ctx );
|
memcpy( &ctx, &blake2s_8w_ctx, sizeof ctx );
|
||||||
|
|
||||||
blake2s_8way_update( &ctx, input + (64<<3), 16 );
|
blake2s_8way_update( &ctx, input + (64<<3), 16 );
|
||||||
blake2s_8way_final( &ctx, vhash, BLAKE2S_OUTBYTES );
|
blake2s_8way_final( &ctx, output, BLAKE2S_OUTBYTES );
|
||||||
|
|
||||||
dintrlv_8x32( output, output+ 32, output+ 64, output+ 96,
|
|
||||||
output+128, output+160, output+192, output+224,
|
|
||||||
vhash, 256 );
|
|
||||||
}
|
}
|
||||||
|
|
||||||
int scanhash_blake2s_8way( struct work *work, uint32_t max_nonce,
|
int scanhash_blake2s_8way( struct work *work, uint32_t max_nonce,
|
||||||
@@ -26,13 +76,15 @@ int scanhash_blake2s_8way( struct work *work, uint32_t max_nonce,
|
|||||||
{
|
{
|
||||||
uint32_t vdata[20*8] __attribute__ ((aligned (64)));
|
uint32_t vdata[20*8] __attribute__ ((aligned (64)));
|
||||||
uint32_t hash[8*8] __attribute__ ((aligned (32)));
|
uint32_t hash[8*8] __attribute__ ((aligned (32)));
|
||||||
|
uint32_t lane_hash[8] __attribute__ ((aligned (32)));
|
||||||
|
uint32_t *hash7 = &(hash[7<<3]);
|
||||||
uint32_t *pdata = work->data;
|
uint32_t *pdata = work->data;
|
||||||
uint32_t *ptarget = work->target;
|
uint32_t *ptarget = work->target;
|
||||||
const uint32_t Htarg = ptarget[7];
|
const uint32_t Htarg = ptarget[7];
|
||||||
const uint32_t first_nonce = pdata[19];
|
const uint32_t first_nonce = pdata[19];
|
||||||
__m256i *noncev = (__m256i*)vdata + 19; // aligned
|
__m256i *noncev = (__m256i*)vdata + 19; // aligned
|
||||||
uint32_t n = first_nonce;
|
uint32_t n = first_nonce;
|
||||||
int thr_id = mythr->id; // thr_id arg is deprecated
|
int thr_id = mythr->id;
|
||||||
|
|
||||||
mm256_bswap32_intrlv80_8x32( vdata, pdata );
|
mm256_bswap32_intrlv80_8x32( vdata, pdata );
|
||||||
blake2s_8way_init( &blake2s_8w_ctx, BLAKE2S_OUTBYTES );
|
blake2s_8way_init( &blake2s_8w_ctx, BLAKE2S_OUTBYTES );
|
||||||
@@ -45,16 +97,17 @@ int scanhash_blake2s_8way( struct work *work, uint32_t max_nonce,
|
|||||||
|
|
||||||
blake2s_8way_hash( hash, vdata );
|
blake2s_8way_hash( hash, vdata );
|
||||||
|
|
||||||
|
for ( int lane = 0; lane < 8; lane++ )
|
||||||
for ( int i = 0; i < 8; i++ )
|
if ( unlikely( hash7[lane] <= Htarg ) )
|
||||||
if ( (hash+(i<<3))[7] <= Htarg )
|
|
||||||
if ( fulltest( hash+(i<<3), ptarget ) && !opt_benchmark )
|
|
||||||
{
|
{
|
||||||
pdata[19] = n+i;
|
extr_lane_8x32( lane_hash, hash, lane, 256 );
|
||||||
submit_lane_solution( work, hash+(i<<3), mythr, i );
|
if ( likely( fulltest( lane_hash, ptarget ) && !opt_benchmark ) )
|
||||||
|
{
|
||||||
|
pdata[19] = n + lane;
|
||||||
|
submit_lane_solution( work, lane_hash, mythr, lane );
|
||||||
|
}
|
||||||
}
|
}
|
||||||
n += 8;
|
n += 8;
|
||||||
|
|
||||||
} while ( (n < max_nonce) && !work_restart[thr_id].restart );
|
} while ( (n < max_nonce) && !work_restart[thr_id].restart );
|
||||||
|
|
||||||
*hashes_done = n - first_nonce + 1;
|
*hashes_done = n - first_nonce + 1;
|
||||||
@@ -67,15 +120,10 @@ static __thread blake2s_4way_state blake2s_4w_ctx;
|
|||||||
|
|
||||||
void blake2s_4way_hash( void *output, const void *input )
|
void blake2s_4way_hash( void *output, const void *input )
|
||||||
{
|
{
|
||||||
uint32_t vhash[8*4] __attribute__ ((aligned (64)));
|
|
||||||
blake2s_4way_state ctx;
|
blake2s_4way_state ctx;
|
||||||
memcpy( &ctx, &blake2s_4w_ctx, sizeof ctx );
|
memcpy( &ctx, &blake2s_4w_ctx, sizeof ctx );
|
||||||
|
|
||||||
blake2s_4way_update( &ctx, input + (64<<2), 16 );
|
blake2s_4way_update( &ctx, input + (64<<2), 16 );
|
||||||
blake2s_4way_final( &ctx, vhash, BLAKE2S_OUTBYTES );
|
blake2s_4way_final( &ctx, output, BLAKE2S_OUTBYTES );
|
||||||
|
|
||||||
dintrlv_4x32( output, output+32, output+64, output+96,
|
|
||||||
vhash, 256 );
|
|
||||||
}
|
}
|
||||||
|
|
||||||
int scanhash_blake2s_4way( struct work *work, uint32_t max_nonce,
|
int scanhash_blake2s_4way( struct work *work, uint32_t max_nonce,
|
||||||
@@ -83,13 +131,15 @@ int scanhash_blake2s_4way( struct work *work, uint32_t max_nonce,
|
|||||||
{
|
{
|
||||||
uint32_t vdata[20*4] __attribute__ ((aligned (64)));
|
uint32_t vdata[20*4] __attribute__ ((aligned (64)));
|
||||||
uint32_t hash[8*4] __attribute__ ((aligned (32)));
|
uint32_t hash[8*4] __attribute__ ((aligned (32)));
|
||||||
|
uint32_t lane_hash[8] __attribute__ ((aligned (32)));
|
||||||
|
uint32_t *hash7 = &(hash[7<<2]);
|
||||||
uint32_t *pdata = work->data;
|
uint32_t *pdata = work->data;
|
||||||
uint32_t *ptarget = work->target;
|
uint32_t *ptarget = work->target;
|
||||||
const uint32_t Htarg = ptarget[7];
|
const uint32_t Htarg = ptarget[7];
|
||||||
const uint32_t first_nonce = pdata[19];
|
const uint32_t first_nonce = pdata[19];
|
||||||
__m128i *noncev = (__m128i*)vdata + 19; // aligned
|
__m128i *noncev = (__m128i*)vdata + 19; // aligned
|
||||||
uint32_t n = first_nonce;
|
uint32_t n = first_nonce;
|
||||||
int thr_id = mythr->id; // thr_id arg is deprecated
|
int thr_id = mythr->id;
|
||||||
|
|
||||||
mm128_bswap32_intrlv80_4x32( vdata, pdata );
|
mm128_bswap32_intrlv80_4x32( vdata, pdata );
|
||||||
blake2s_4way_init( &blake2s_4w_ctx, BLAKE2S_OUTBYTES );
|
blake2s_4way_init( &blake2s_4w_ctx, BLAKE2S_OUTBYTES );
|
||||||
@@ -101,15 +151,16 @@ int scanhash_blake2s_4way( struct work *work, uint32_t max_nonce,
|
|||||||
|
|
||||||
blake2s_4way_hash( hash, vdata );
|
blake2s_4way_hash( hash, vdata );
|
||||||
|
|
||||||
for ( int i = 0; i < 4; i++ )
|
for ( int lane = 0; lane < 4; lane++ ) if ( hash7[lane] <= Htarg )
|
||||||
if ( (hash+(i<<3))[7] <= Htarg )
|
|
||||||
if ( fulltest( hash+(i<<3), ptarget ) && !opt_benchmark )
|
|
||||||
{
|
{
|
||||||
pdata[19] = n+i;
|
extr_lane_4x32( lane_hash, hash, lane, 256 );
|
||||||
submit_lane_solution( work, hash+(i<<3), mythr, i );
|
if ( fulltest( lane_hash, ptarget ) && !opt_benchmark )
|
||||||
|
{
|
||||||
|
pdata[19] = n + lane;
|
||||||
|
submit_lane_solution( work, lane_hash, mythr, lane );
|
||||||
|
}
|
||||||
}
|
}
|
||||||
n += 4;
|
n += 4;
|
||||||
|
|
||||||
} while ( (n < max_nonce) && !work_restart[thr_id].restart );
|
} while ( (n < max_nonce) && !work_restart[thr_id].restart );
|
||||||
|
|
||||||
*hashes_done = n - first_nonce + 1;
|
*hashes_done = n - first_nonce + 1;
|
||||||
|
|||||||
@@ -1,15 +1,12 @@
|
|||||||
#include "blake2s-gate.h"
|
#include "blake2s-gate.h"
|
||||||
|
|
||||||
|
|
||||||
// changed to get_max64_0x3fffffLL in cpuminer-multi-decred
|
|
||||||
int64_t blake2s_get_max64 ()
|
|
||||||
{
|
|
||||||
return 0x7ffffLL;
|
|
||||||
}
|
|
||||||
|
|
||||||
bool register_blake2s_algo( algo_gate_t* gate )
|
bool register_blake2s_algo( algo_gate_t* gate )
|
||||||
{
|
{
|
||||||
#if defined(BLAKE2S_8WAY)
|
#if defined(BLAKE2S_16WAY)
|
||||||
|
gate->scanhash = (void*)&scanhash_blake2s_16way;
|
||||||
|
gate->hash = (void*)&blake2s_16way_hash;
|
||||||
|
#elif defined(BLAKE2S_8WAY)
|
||||||
|
//#if defined(BLAKE2S_8WAY)
|
||||||
gate->scanhash = (void*)&scanhash_blake2s_8way;
|
gate->scanhash = (void*)&scanhash_blake2s_8way;
|
||||||
gate->hash = (void*)&blake2s_8way_hash;
|
gate->hash = (void*)&blake2s_8way_hash;
|
||||||
#elif defined(BLAKE2S_4WAY)
|
#elif defined(BLAKE2S_4WAY)
|
||||||
@@ -19,8 +16,7 @@ bool register_blake2s_algo( algo_gate_t* gate )
|
|||||||
gate->scanhash = (void*)&scanhash_blake2s;
|
gate->scanhash = (void*)&scanhash_blake2s;
|
||||||
gate->hash = (void*)&blake2s_hash;
|
gate->hash = (void*)&blake2s_hash;
|
||||||
#endif
|
#endif
|
||||||
gate->get_max64 = (void*)&blake2s_get_max64;
|
gate->optimizations = SSE2_OPT | AVX2_OPT | AVX512_OPT;
|
||||||
gate->optimizations = SSE2_OPT | AVX2_OPT;
|
|
||||||
return true;
|
return true;
|
||||||
};
|
};
|
||||||
|
|
||||||
|
|||||||
@@ -8,13 +8,26 @@
|
|||||||
#if defined(__SSE2__)
|
#if defined(__SSE2__)
|
||||||
#define BLAKE2S_4WAY
|
#define BLAKE2S_4WAY
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
#if defined(__AVX2__)
|
#if defined(__AVX2__)
|
||||||
#define BLAKE2S_8WAY
|
#define BLAKE2S_8WAY
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
|
#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
|
||||||
|
#define BLAKE2S_16WAY
|
||||||
|
#endif
|
||||||
|
|
||||||
bool register_blake2s_algo( algo_gate_t* gate );
|
bool register_blake2s_algo( algo_gate_t* gate );
|
||||||
|
|
||||||
#if defined(BLAKE2S_8WAY)
|
#if defined(BLAKE2S_16WAY)
|
||||||
|
|
||||||
|
void blake2s_16way_hash( void *state, const void *input );
|
||||||
|
int scanhash_blake2s_16way( struct work *work, uint32_t max_nonce,
|
||||||
|
uint64_t *hashes_done, struct thr_info *mythr );
|
||||||
|
|
||||||
|
#elif defined (BLAKE2S_8WAY)
|
||||||
|
|
||||||
|
//#if defined(BLAKE2S_8WAY)
|
||||||
|
|
||||||
void blake2s_8way_hash( void *state, const void *input );
|
void blake2s_8way_hash( void *state, const void *input );
|
||||||
int scanhash_blake2s_8way( struct work *work, uint32_t max_nonce,
|
int scanhash_blake2s_8way( struct work *work, uint32_t max_nonce,
|
||||||
|
|||||||
@@ -20,12 +20,13 @@
|
|||||||
//#if defined(__SSE4_2__)
|
//#if defined(__SSE4_2__)
|
||||||
#if defined(__SSE2__)
|
#if defined(__SSE2__)
|
||||||
|
|
||||||
|
/*
|
||||||
static const uint32_t blake2s_IV[8] =
|
static const uint32_t blake2s_IV[8] =
|
||||||
{
|
{
|
||||||
0x6A09E667UL, 0xBB67AE85UL, 0x3C6EF372UL, 0xA54FF53AUL,
|
0x6A09E667UL, 0xBB67AE85UL, 0x3C6EF372UL, 0xA54FF53AUL,
|
||||||
0x510E527FUL, 0x9B05688CUL, 0x1F83D9ABUL, 0x5BE0CD19UL
|
0x510E527FUL, 0x9B05688CUL, 0x1F83D9ABUL, 0x5BE0CD19UL
|
||||||
};
|
};
|
||||||
|
*/
|
||||||
|
|
||||||
static const uint8_t blake2s_sigma[10][16] =
|
static const uint8_t blake2s_sigma[10][16] =
|
||||||
{
|
{
|
||||||
@@ -41,6 +42,7 @@ static const uint8_t blake2s_sigma[10][16] =
|
|||||||
{ 10, 2, 8, 4, 7, 6, 1, 5, 15, 11, 9, 14, 3, 12, 13 , 0 } ,
|
{ 10, 2, 8, 4, 7, 6, 1, 5, 15, 11, 9, 14, 3, 12, 13 , 0 } ,
|
||||||
};
|
};
|
||||||
|
|
||||||
|
|
||||||
// define a constant for initial param.
|
// define a constant for initial param.
|
||||||
|
|
||||||
int blake2s_4way_init( blake2s_4way_state *S, const uint8_t outlen )
|
int blake2s_4way_init( blake2s_4way_state *S, const uint8_t outlen )
|
||||||
@@ -88,41 +90,45 @@ int blake2s_4way_compress( blake2s_4way_state *S, const __m128i* block )
|
|||||||
memcpy_128( m, block, 16 );
|
memcpy_128( m, block, 16 );
|
||||||
memcpy_128( v, S->h, 8 );
|
memcpy_128( v, S->h, 8 );
|
||||||
|
|
||||||
v[ 8] = _mm_set1_epi32( blake2s_IV[0] );
|
v[ 8] = m128_const1_64( 0x6A09E6676A09E667ULL );
|
||||||
v[ 9] = _mm_set1_epi32( blake2s_IV[1] );
|
v[ 9] = m128_const1_64( 0xBB67AE85BB67AE85ULL );
|
||||||
v[10] = _mm_set1_epi32( blake2s_IV[2] );
|
v[10] = m128_const1_64( 0x3C6EF3723C6EF372ULL );
|
||||||
v[11] = _mm_set1_epi32( blake2s_IV[3] );
|
v[11] = m128_const1_64( 0xA54FF53AA54FF53AULL );
|
||||||
v[12] = _mm_xor_si128( _mm_set1_epi32( S->t[0] ),
|
v[12] = _mm_xor_si128( _mm_set1_epi32( S->t[0] ),
|
||||||
_mm_set1_epi32( blake2s_IV[4] ) );
|
m128_const1_64( 0x510E527F510E527FULL ) );
|
||||||
v[13] = _mm_xor_si128( _mm_set1_epi32( S->t[1] ),
|
v[13] = _mm_xor_si128( _mm_set1_epi32( S->t[1] ),
|
||||||
_mm_set1_epi32( blake2s_IV[5] ) );
|
m128_const1_64( 0x9B05688C9B05688CULL ) );
|
||||||
v[14] = _mm_xor_si128( _mm_set1_epi32( S->f[0] ),
|
v[14] = _mm_xor_si128( _mm_set1_epi32( S->f[0] ),
|
||||||
_mm_set1_epi32( blake2s_IV[6] ) );
|
m128_const1_64( 0x1F83D9AB1F83D9ABULL ) );
|
||||||
v[15] = _mm_xor_si128( _mm_set1_epi32( S->f[1] ),
|
v[15] = _mm_xor_si128( _mm_set1_epi32( S->f[1] ),
|
||||||
_mm_set1_epi32( blake2s_IV[7] ) );
|
m128_const1_64( 0x5BE0CD195BE0CD19ULL ) );
|
||||||
|
|
||||||
#define G4W(r,i,a,b,c,d) \
|
#define G4W( sigma0, sigma1, a, b, c, d ) \
|
||||||
do { \
|
do { \
|
||||||
a = _mm_add_epi32( _mm_add_epi32( a, b ), m[ blake2s_sigma[r][2*i+0] ] ); \
|
uint8_t s0 = sigma0; \
|
||||||
|
uint8_t s1 = sigma1; \
|
||||||
|
a = _mm_add_epi32( _mm_add_epi32( a, b ), m[ s0 ] ); \
|
||||||
d = mm128_ror_32( _mm_xor_si128( d, a ), 16 ); \
|
d = mm128_ror_32( _mm_xor_si128( d, a ), 16 ); \
|
||||||
c = _mm_add_epi32( c, d ); \
|
c = _mm_add_epi32( c, d ); \
|
||||||
b = mm128_ror_32( _mm_xor_si128( b, c ), 12 ); \
|
b = mm128_ror_32( _mm_xor_si128( b, c ), 12 ); \
|
||||||
a = _mm_add_epi32( _mm_add_epi32( a, b ), m[ blake2s_sigma[r][2*i+1] ] ); \
|
a = _mm_add_epi32( _mm_add_epi32( a, b ), m[ s1 ] ); \
|
||||||
d = mm128_ror_32( _mm_xor_si128( d, a ), 8 ); \
|
d = mm128_ror_32( _mm_xor_si128( d, a ), 8 ); \
|
||||||
c = _mm_add_epi32( c, d ); \
|
c = _mm_add_epi32( c, d ); \
|
||||||
b = mm128_ror_32( _mm_xor_si128( b, c ), 7 ); \
|
b = mm128_ror_32( _mm_xor_si128( b, c ), 7 ); \
|
||||||
} while(0)
|
} while(0)
|
||||||
|
|
||||||
|
|
||||||
#define ROUND4W(r) \
|
#define ROUND4W(r) \
|
||||||
do { \
|
do { \
|
||||||
G4W( r, 0, v[ 0], v[ 4], v[ 8], v[12] ); \
|
uint8_t *sigma = (uint8_t*)&blake2s_sigma[r]; \
|
||||||
G4W( r, 1, v[ 1], v[ 5], v[ 9], v[13] ); \
|
G4W( sigma[ 0], sigma[ 1], v[ 0], v[ 4], v[ 8], v[12] ); \
|
||||||
G4W( r, 2, v[ 2], v[ 6], v[10], v[14] ); \
|
G4W( sigma[ 2], sigma[ 3], v[ 1], v[ 5], v[ 9], v[13] ); \
|
||||||
G4W( r, 3, v[ 3], v[ 7], v[11], v[15] ); \
|
G4W( sigma[ 4], sigma[ 5], v[ 2], v[ 6], v[10], v[14] ); \
|
||||||
G4W( r, 4, v[ 0], v[ 5], v[10], v[15] ); \
|
G4W( sigma[ 6], sigma[ 7], v[ 3], v[ 7], v[11], v[15] ); \
|
||||||
G4W( r, 5, v[ 1], v[ 6], v[11], v[12] ); \
|
G4W( sigma[ 8], sigma[ 9], v[ 0], v[ 5], v[10], v[15] ); \
|
||||||
G4W( r, 6, v[ 2], v[ 7], v[ 8], v[13] ); \
|
G4W( sigma[10], sigma[11], v[ 1], v[ 6], v[11], v[12] ); \
|
||||||
G4W( r, 7, v[ 3], v[ 4], v[ 9], v[14] ); \
|
G4W( sigma[12], sigma[13], v[ 2], v[ 7], v[ 8], v[13] ); \
|
||||||
|
G4W( sigma[14], sigma[15], v[ 3], v[ 4], v[ 9], v[14] ); \
|
||||||
} while(0)
|
} while(0)
|
||||||
|
|
||||||
ROUND4W( 0 );
|
ROUND4W( 0 );
|
||||||
@@ -144,26 +150,47 @@ do { \
|
|||||||
return 0;
|
return 0;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// There is a problem that can't be resolved internally.
|
||||||
|
// If the last block is a full 64 bytes it should not be compressed in
|
||||||
|
// update but left for final. However, when streaming, it isn't known
|
||||||
|
// which block is last. There may be a subsequent call to update to add
|
||||||
|
// more data.
|
||||||
|
//
|
||||||
|
// The reference code handled this by juggling 2 blocks at a time at
|
||||||
|
// a significant performance penalty.
|
||||||
|
//
|
||||||
|
// Instead a new function is introduced called full_blocks which combines
|
||||||
|
// update and final and is to be used in non-streaming mode where the data
|
||||||
|
// is a multiple of 64 bytes.
|
||||||
|
//
|
||||||
|
// Supported:
|
||||||
|
// 64 + 16 bytes (blake2s with midstate optimization)
|
||||||
|
// 80 bytes (blake2s without midstate optimization)
|
||||||
|
// Any multiple of 64 bytes in one shot (x25x)
|
||||||
|
//
|
||||||
|
// Unsupported:
|
||||||
|
// Stream of full 64 byte blocks one at a time.
|
||||||
|
|
||||||
|
// use only when streaming more data or final block not full.
|
||||||
int blake2s_4way_update( blake2s_4way_state *S, const void *in,
|
int blake2s_4way_update( blake2s_4way_state *S, const void *in,
|
||||||
uint64_t inlen )
|
uint64_t inlen )
|
||||||
{
|
{
|
||||||
__m128i *input = (__m128i*)in;
|
__m128i *input = (__m128i*)in;
|
||||||
__m128i *buf = (__m128i*)S->buf;
|
__m128i *buf = (__m128i*)S->buf;
|
||||||
const int bsize = BLAKE2S_BLOCKBYTES;
|
|
||||||
|
|
||||||
while( inlen > 0 )
|
while( inlen > 0 )
|
||||||
{
|
{
|
||||||
size_t left = S->buflen;
|
size_t left = S->buflen;
|
||||||
if( inlen >= bsize - left )
|
if( inlen >= BLAKE2S_BLOCKBYTES - left )
|
||||||
{
|
{
|
||||||
memcpy_128( buf + (left>>2), input, (bsize - left) >> 2 );
|
memcpy_128( buf + (left>>2), input, (BLAKE2S_BLOCKBYTES - left) >> 2 );
|
||||||
S->buflen += bsize - left;
|
S->buflen += BLAKE2S_BLOCKBYTES - left;
|
||||||
S->t[0] += BLAKE2S_BLOCKBYTES;
|
S->t[0] += BLAKE2S_BLOCKBYTES;
|
||||||
S->t[1] += ( S->t[0] < BLAKE2S_BLOCKBYTES );
|
S->t[1] += ( S->t[0] < BLAKE2S_BLOCKBYTES );
|
||||||
blake2s_4way_compress( S, buf );
|
blake2s_4way_compress( S, buf );
|
||||||
S->buflen = 0;
|
S->buflen = 0;
|
||||||
input += ( bsize >> 2 );
|
input += ( BLAKE2S_BLOCKBYTES >> 2 );
|
||||||
inlen -= bsize;
|
inlen -= BLAKE2S_BLOCKBYTES;
|
||||||
}
|
}
|
||||||
else
|
else
|
||||||
{
|
{
|
||||||
@@ -195,8 +222,45 @@ int blake2s_4way_final( blake2s_4way_state *S, void *out, uint8_t outlen )
|
|||||||
return 0;
|
return 0;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Update and final when inlen is a multiple of 64 bytes
|
||||||
|
int blake2s_4way_full_blocks( blake2s_4way_state *S, void *out,
|
||||||
|
const void *input, uint64_t inlen )
|
||||||
|
{
|
||||||
|
__m128i *in = (__m128i*)input;
|
||||||
|
__m128i *buf = (__m128i*)S->buf;
|
||||||
|
|
||||||
|
while( inlen > BLAKE2S_BLOCKBYTES )
|
||||||
|
{
|
||||||
|
memcpy_128( buf, in, BLAKE2S_BLOCKBYTES >> 2 );
|
||||||
|
S->buflen = BLAKE2S_BLOCKBYTES;
|
||||||
|
inlen -= BLAKE2S_BLOCKBYTES;
|
||||||
|
S->t[0] += BLAKE2S_BLOCKBYTES;
|
||||||
|
S->t[1] += ( S->t[0] < BLAKE2S_BLOCKBYTES );
|
||||||
|
blake2s_4way_compress( S, buf );
|
||||||
|
S->buflen = 0;
|
||||||
|
in += ( BLAKE2S_BLOCKBYTES >> 2 );
|
||||||
|
}
|
||||||
|
|
||||||
|
// last block
|
||||||
|
memcpy_128( buf, in, BLAKE2S_BLOCKBYTES >> 2 );
|
||||||
|
S->buflen = BLAKE2S_BLOCKBYTES;
|
||||||
|
S->t[0] += S->buflen;
|
||||||
|
S->t[1] += ( S->t[0] < S->buflen );
|
||||||
|
if ( S->last_node ) S->f[1] = ~0U;
|
||||||
|
S->f[0] = ~0U;
|
||||||
|
blake2s_4way_compress( S, buf );
|
||||||
|
|
||||||
|
for ( int i = 0; i < 8; ++i )
|
||||||
|
casti_m128i( out, i ) = S->h[ i ];
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
#if defined(__AVX2__)
|
#if defined(__AVX2__)
|
||||||
|
|
||||||
|
// The commented code below is slower on Intel but faster on
|
||||||
|
// Zen1 AVX2. It's also faster than Zen1 AVX.
|
||||||
|
// Ryzen gen2 is unknown at this time.
|
||||||
|
|
||||||
int blake2s_8way_compress( blake2s_8way_state *S, const __m256i *block )
|
int blake2s_8way_compress( blake2s_8way_state *S, const __m256i *block )
|
||||||
{
|
{
|
||||||
__m256i m[16];
|
__m256i m[16];
|
||||||
@@ -205,6 +269,23 @@ int blake2s_8way_compress( blake2s_8way_state *S, const __m256i *block )
|
|||||||
memcpy_256( m, block, 16 );
|
memcpy_256( m, block, 16 );
|
||||||
memcpy_256( v, S->h, 8 );
|
memcpy_256( v, S->h, 8 );
|
||||||
|
|
||||||
|
v[ 8] = m256_const1_64( 0x6A09E6676A09E667ULL );
|
||||||
|
v[ 9] = m256_const1_64( 0xBB67AE85BB67AE85ULL );
|
||||||
|
v[10] = m256_const1_64( 0x3C6EF3723C6EF372ULL );
|
||||||
|
v[11] = m256_const1_64( 0xA54FF53AA54FF53AULL );
|
||||||
|
v[12] = _mm256_xor_si256( _mm256_set1_epi32( S->t[0] ),
|
||||||
|
m256_const1_64( 0x510E527F510E527FULL ) );
|
||||||
|
|
||||||
|
v[13] = _mm256_xor_si256( _mm256_set1_epi32( S->t[1] ),
|
||||||
|
m256_const1_64( 0x9B05688C9B05688CULL ) );
|
||||||
|
|
||||||
|
v[14] = _mm256_xor_si256( _mm256_set1_epi32( S->f[0] ),
|
||||||
|
m256_const1_64( 0x1F83D9AB1F83D9ABULL ) );
|
||||||
|
|
||||||
|
v[15] = _mm256_xor_si256( _mm256_set1_epi32( S->f[1] ),
|
||||||
|
m256_const1_64( 0x5BE0CD195BE0CD19ULL ) );
|
||||||
|
|
||||||
|
/*
|
||||||
v[ 8] = _mm256_set1_epi32( blake2s_IV[0] );
|
v[ 8] = _mm256_set1_epi32( blake2s_IV[0] );
|
||||||
v[ 9] = _mm256_set1_epi32( blake2s_IV[1] );
|
v[ 9] = _mm256_set1_epi32( blake2s_IV[1] );
|
||||||
v[10] = _mm256_set1_epi32( blake2s_IV[2] );
|
v[10] = _mm256_set1_epi32( blake2s_IV[2] );
|
||||||
@@ -218,6 +299,7 @@ int blake2s_8way_compress( blake2s_8way_state *S, const __m256i *block )
|
|||||||
v[15] = _mm256_xor_si256( _mm256_set1_epi32( S->f[1] ),
|
v[15] = _mm256_xor_si256( _mm256_set1_epi32( S->f[1] ),
|
||||||
_mm256_set1_epi32( blake2s_IV[7] ) );
|
_mm256_set1_epi32( blake2s_IV[7] ) );
|
||||||
|
|
||||||
|
|
||||||
#define G8W(r,i,a,b,c,d) \
|
#define G8W(r,i,a,b,c,d) \
|
||||||
do { \
|
do { \
|
||||||
a = _mm256_add_epi32( _mm256_add_epi32( a, b ), \
|
a = _mm256_add_epi32( _mm256_add_epi32( a, b ), \
|
||||||
@@ -231,7 +313,36 @@ do { \
|
|||||||
c = _mm256_add_epi32( c, d ); \
|
c = _mm256_add_epi32( c, d ); \
|
||||||
b = mm256_ror_32( _mm256_xor_si256( b, c ), 7 ); \
|
b = mm256_ror_32( _mm256_xor_si256( b, c ), 7 ); \
|
||||||
} while(0)
|
} while(0)
|
||||||
|
*/
|
||||||
|
|
||||||
|
#define G8W( sigma0, sigma1, a, b, c, d) \
|
||||||
|
do { \
|
||||||
|
uint8_t s0 = sigma0; \
|
||||||
|
uint8_t s1 = sigma1; \
|
||||||
|
a = _mm256_add_epi32( _mm256_add_epi32( a, b ), m[ s0 ] ); \
|
||||||
|
d = mm256_ror_32( _mm256_xor_si256( d, a ), 16 ); \
|
||||||
|
c = _mm256_add_epi32( c, d ); \
|
||||||
|
b = mm256_ror_32( _mm256_xor_si256( b, c ), 12 ); \
|
||||||
|
a = _mm256_add_epi32( _mm256_add_epi32( a, b ), m[ s1 ] ); \
|
||||||
|
d = mm256_ror_32( _mm256_xor_si256( d, a ), 8 ); \
|
||||||
|
c = _mm256_add_epi32( c, d ); \
|
||||||
|
b = mm256_ror_32( _mm256_xor_si256( b, c ), 7 ); \
|
||||||
|
} while(0)
|
||||||
|
|
||||||
|
#define ROUND8W(r) \
|
||||||
|
do { \
|
||||||
|
uint8_t *sigma = (uint8_t*)&blake2s_sigma[r]; \
|
||||||
|
G8W( sigma[ 0], sigma[ 1], v[ 0], v[ 4], v[ 8], v[12] ); \
|
||||||
|
G8W( sigma[ 2], sigma[ 3], v[ 1], v[ 5], v[ 9], v[13] ); \
|
||||||
|
G8W( sigma[ 4], sigma[ 5], v[ 2], v[ 6], v[10], v[14] ); \
|
||||||
|
G8W( sigma[ 6], sigma[ 7], v[ 3], v[ 7], v[11], v[15] ); \
|
||||||
|
G8W( sigma[ 8], sigma[ 9], v[ 0], v[ 5], v[10], v[15] ); \
|
||||||
|
G8W( sigma[10], sigma[11], v[ 1], v[ 6], v[11], v[12] ); \
|
||||||
|
G8W( sigma[12], sigma[13], v[ 2], v[ 7], v[ 8], v[13] ); \
|
||||||
|
G8W( sigma[14], sigma[15], v[ 3], v[ 4], v[ 9], v[14] ); \
|
||||||
|
} while(0)
|
||||||
|
|
||||||
|
/*
|
||||||
#define ROUND8W(r) \
|
#define ROUND8W(r) \
|
||||||
do { \
|
do { \
|
||||||
G8W( r, 0, v[ 0], v[ 4], v[ 8], v[12] ); \
|
G8W( r, 0, v[ 0], v[ 4], v[ 8], v[12] ); \
|
||||||
@@ -243,6 +354,7 @@ do { \
|
|||||||
G8W( r, 6, v[ 2], v[ 7], v[ 8], v[13] ); \
|
G8W( r, 6, v[ 2], v[ 7], v[ 8], v[13] ); \
|
||||||
G8W( r, 7, v[ 3], v[ 4], v[ 9], v[14] ); \
|
G8W( r, 7, v[ 3], v[ 4], v[ 9], v[14] ); \
|
||||||
} while(0)
|
} while(0)
|
||||||
|
*/
|
||||||
|
|
||||||
ROUND8W( 0 );
|
ROUND8W( 0 );
|
||||||
ROUND8W( 1 );
|
ROUND8W( 1 );
|
||||||
@@ -351,9 +463,203 @@ int blake2s_8way_final( blake2s_8way_state *S, void *out, uint8_t outlen )
|
|||||||
return 0;
|
return 0;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Update and final when inlen is a multiple of 64 bytes
|
||||||
|
int blake2s_8way_full_blocks( blake2s_8way_state *S, void *out,
|
||||||
|
const void *input, uint64_t inlen )
|
||||||
|
{
|
||||||
|
__m256i *in = (__m256i*)input;
|
||||||
|
__m256i *buf = (__m256i*)S->buf;
|
||||||
|
|
||||||
|
while( inlen > BLAKE2S_BLOCKBYTES )
|
||||||
|
{
|
||||||
|
memcpy_256( buf, in, BLAKE2S_BLOCKBYTES >> 2 );
|
||||||
|
S->buflen = BLAKE2S_BLOCKBYTES;
|
||||||
|
inlen -= BLAKE2S_BLOCKBYTES;
|
||||||
|
S->t[0] += BLAKE2S_BLOCKBYTES;
|
||||||
|
S->t[1] += ( S->t[0] < BLAKE2S_BLOCKBYTES );
|
||||||
|
blake2s_8way_compress( S, buf );
|
||||||
|
S->buflen = 0;
|
||||||
|
in += ( BLAKE2S_BLOCKBYTES >> 2 );
|
||||||
|
}
|
||||||
|
|
||||||
|
// last block
|
||||||
|
memcpy_256( buf, in, BLAKE2S_BLOCKBYTES >> 2 );
|
||||||
|
S->buflen = BLAKE2S_BLOCKBYTES;
|
||||||
|
S->t[0] += S->buflen;
|
||||||
|
S->t[1] += ( S->t[0] < S->buflen );
|
||||||
|
if ( S->last_node ) S->f[1] = ~0U;
|
||||||
|
S->f[0] = ~0U;
|
||||||
|
blake2s_8way_compress( S, buf );
|
||||||
|
|
||||||
|
for ( int i = 0; i < 8; ++i )
|
||||||
|
casti_m256i( out, i ) = S->h[ i ];
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
#endif // __AVX2__
|
#endif // __AVX2__
|
||||||
|
|
||||||
|
#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
|
||||||
|
|
||||||
|
// Blake2s-256 16 way
|
||||||
|
|
||||||
|
int blake2s_16way_compress( blake2s_16way_state *S, const __m512i *block )
|
||||||
|
{
|
||||||
|
__m512i m[16];
|
||||||
|
__m512i v[16];
|
||||||
|
|
||||||
|
memcpy_512( m, block, 16 );
|
||||||
|
memcpy_512( v, S->h, 8 );
|
||||||
|
|
||||||
|
v[ 8] = m512_const1_64( 0x6A09E6676A09E667ULL );
|
||||||
|
v[ 9] = m512_const1_64( 0xBB67AE85BB67AE85ULL );
|
||||||
|
v[10] = m512_const1_64( 0x3C6EF3723C6EF372ULL );
|
||||||
|
v[11] = m512_const1_64( 0xA54FF53AA54FF53AULL );
|
||||||
|
v[12] = _mm512_xor_si512( _mm512_set1_epi32( S->t[0] ),
|
||||||
|
m512_const1_64( 0x510E527F510E527FULL ) );
|
||||||
|
|
||||||
|
v[13] = _mm512_xor_si512( _mm512_set1_epi32( S->t[1] ),
|
||||||
|
m512_const1_64( 0x9B05688C9B05688CULL ) );
|
||||||
|
|
||||||
|
v[14] = _mm512_xor_si512( _mm512_set1_epi32( S->f[0] ),
|
||||||
|
m512_const1_64( 0x1F83D9AB1F83D9ABULL ) );
|
||||||
|
|
||||||
|
v[15] = _mm512_xor_si512( _mm512_set1_epi32( S->f[1] ),
|
||||||
|
m512_const1_64( 0x5BE0CD195BE0CD19ULL ) );
|
||||||
|
|
||||||
|
|
||||||
|
#define G16W( sigma0, sigma1, a, b, c, d) \
|
||||||
|
do { \
|
||||||
|
uint8_t s0 = sigma0; \
|
||||||
|
uint8_t s1 = sigma1; \
|
||||||
|
a = _mm512_add_epi32( _mm512_add_epi32( a, b ), m[ s0 ] ); \
|
||||||
|
d = mm512_ror_32( _mm512_xor_si512( d, a ), 16 ); \
|
||||||
|
c = _mm512_add_epi32( c, d ); \
|
||||||
|
b = mm512_ror_32( _mm512_xor_si512( b, c ), 12 ); \
|
||||||
|
a = _mm512_add_epi32( _mm512_add_epi32( a, b ), m[ s1 ] ); \
|
||||||
|
d = mm512_ror_32( _mm512_xor_si512( d, a ), 8 ); \
|
||||||
|
c = _mm512_add_epi32( c, d ); \
|
||||||
|
b = mm512_ror_32( _mm512_xor_si512( b, c ), 7 ); \
|
||||||
|
} while(0)
|
||||||
|
|
||||||
|
#define ROUND16W(r) \
|
||||||
|
do { \
|
||||||
|
uint8_t *sigma = (uint8_t*)&blake2s_sigma[r]; \
|
||||||
|
G16W( sigma[ 0], sigma[ 1], v[ 0], v[ 4], v[ 8], v[12] ); \
|
||||||
|
G16W( sigma[ 2], sigma[ 3], v[ 1], v[ 5], v[ 9], v[13] ); \
|
||||||
|
G16W( sigma[ 4], sigma[ 5], v[ 2], v[ 6], v[10], v[14] ); \
|
||||||
|
G16W( sigma[ 6], sigma[ 7], v[ 3], v[ 7], v[11], v[15] ); \
|
||||||
|
G16W( sigma[ 8], sigma[ 9], v[ 0], v[ 5], v[10], v[15] ); \
|
||||||
|
G16W( sigma[10], sigma[11], v[ 1], v[ 6], v[11], v[12] ); \
|
||||||
|
G16W( sigma[12], sigma[13], v[ 2], v[ 7], v[ 8], v[13] ); \
|
||||||
|
G16W( sigma[14], sigma[15], v[ 3], v[ 4], v[ 9], v[14] ); \
|
||||||
|
} while(0)
|
||||||
|
|
||||||
|
ROUND16W( 0 );
|
||||||
|
ROUND16W( 1 );
|
||||||
|
ROUND16W( 2 );
|
||||||
|
ROUND16W( 3 );
|
||||||
|
ROUND16W( 4 );
|
||||||
|
ROUND16W( 5 );
|
||||||
|
ROUND16W( 6 );
|
||||||
|
ROUND16W( 7 );
|
||||||
|
ROUND16W( 8 );
|
||||||
|
ROUND16W( 9 );
|
||||||
|
|
||||||
|
for( size_t i = 0; i < 8; ++i )
|
||||||
|
S->h[i] = _mm512_xor_si512( _mm512_xor_si512( S->h[i], v[i] ), v[i + 8] );
|
||||||
|
|
||||||
|
#undef G16W
|
||||||
|
#undef ROUND16W
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
int blake2s_16way_init( blake2s_16way_state *S, const uint8_t outlen )
|
||||||
|
{
|
||||||
|
blake2s_nway_param P[1];
|
||||||
|
|
||||||
|
P->digest_length = outlen;
|
||||||
|
P->key_length = 0;
|
||||||
|
P->fanout = 1;
|
||||||
|
P->depth = 1;
|
||||||
|
P->leaf_length = 0;
|
||||||
|
*((uint64_t*)(P->node_offset)) = 0;
|
||||||
|
P->node_depth = 0;
|
||||||
|
P->inner_length = 0;
|
||||||
|
memset( P->salt, 0, sizeof( P->salt ) );
|
||||||
|
memset( P->personal, 0, sizeof( P->personal ) );
|
||||||
|
|
||||||
|
memset( S, 0, sizeof( blake2s_16way_state ) );
|
||||||
|
S->h[0] = m512_const1_64( 0x6A09E6676A09E667ULL );
|
||||||
|
S->h[1] = m512_const1_64( 0xBB67AE85BB67AE85ULL );
|
||||||
|
S->h[2] = m512_const1_64( 0x3C6EF3723C6EF372ULL );
|
||||||
|
S->h[3] = m512_const1_64( 0xA54FF53AA54FF53AULL );
|
||||||
|
S->h[4] = m512_const1_64( 0x510E527F510E527FULL );
|
||||||
|
S->h[5] = m512_const1_64( 0x9B05688C9B05688CULL );
|
||||||
|
S->h[6] = m512_const1_64( 0x1F83D9AB1F83D9ABULL );
|
||||||
|
S->h[7] = m512_const1_64( 0x5BE0CD195BE0CD19ULL );
|
||||||
|
|
||||||
|
uint32_t *p = ( uint32_t * )( P );
|
||||||
|
|
||||||
|
/* IV XOR ParamBlock */
|
||||||
|
for ( size_t i = 0; i < 8; ++i )
|
||||||
|
S->h[i] = _mm512_xor_si512( S->h[i], _mm512_set1_epi32( p[i] ) );
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
int blake2s_16way_update( blake2s_16way_state *S, const void *in,
|
||||||
|
uint64_t inlen )
|
||||||
|
{
|
||||||
|
__m512i *input = (__m512i*)in;
|
||||||
|
__m512i *buf = (__m512i*)S->buf;
|
||||||
|
const int bsize = BLAKE2S_BLOCKBYTES;
|
||||||
|
|
||||||
|
while( inlen > 0 )
|
||||||
|
{
|
||||||
|
size_t left = S->buflen;
|
||||||
|
if( inlen >= bsize - left )
|
||||||
|
{
|
||||||
|
memcpy_512( buf + (left>>2), input, (bsize - left) >> 2 );
|
||||||
|
S->buflen += bsize - left;
|
||||||
|
S->t[0] += BLAKE2S_BLOCKBYTES;
|
||||||
|
S->t[1] += ( S->t[0] < BLAKE2S_BLOCKBYTES );
|
||||||
|
blake2s_16way_compress( S, buf );
|
||||||
|
S->buflen = 0;
|
||||||
|
input += ( bsize >> 2 );
|
||||||
|
inlen -= bsize;
|
||||||
|
}
|
||||||
|
else
|
||||||
|
{
|
||||||
|
memcpy_512( buf + ( left>>2 ), input, inlen>>2 );
|
||||||
|
S->buflen += (size_t) inlen;
|
||||||
|
input += ( inlen>>2 );
|
||||||
|
inlen -= inlen;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
int blake2s_16way_final( blake2s_16way_state *S, void *out, uint8_t outlen )
|
||||||
|
{
|
||||||
|
__m512i *buf = (__m512i*)S->buf;
|
||||||
|
|
||||||
|
S->t[0] += S->buflen;
|
||||||
|
S->t[1] += ( S->t[0] < S->buflen );
|
||||||
|
if ( S->last_node )
|
||||||
|
S->f[1] = ~0U;
|
||||||
|
S->f[0] = ~0U;
|
||||||
|
|
||||||
|
memset_zero_512( buf + ( S->buflen>>2 ),
|
||||||
|
( BLAKE2S_BLOCKBYTES - S->buflen ) >> 2 );
|
||||||
|
blake2s_16way_compress( S, buf );
|
||||||
|
|
||||||
|
for ( int i = 0; i < 8; ++i )
|
||||||
|
casti_m512i( out, i ) = S->h[ i ];
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
#endif // AVX512
|
||||||
|
|
||||||
|
|
||||||
#if 0
|
#if 0
|
||||||
int blake2s( uint8_t *out, const void *in, const void *key, const uint8_t outlen, const uint64_t inlen, uint8_t keylen )
|
int blake2s( uint8_t *out, const void *in, const void *key, const uint8_t outlen, const uint64_t inlen, uint8_t keylen )
|
||||||
{
|
{
|
||||||
|
|||||||
@@ -14,7 +14,6 @@
|
|||||||
#ifndef __BLAKE2S_HASH_4WAY_H__
|
#ifndef __BLAKE2S_HASH_4WAY_H__
|
||||||
#define __BLAKE2S_HASH_4WAY_H__ 1
|
#define __BLAKE2S_HASH_4WAY_H__ 1
|
||||||
|
|
||||||
//#if defined(__SSE4_2__)
|
|
||||||
#if defined(__SSE2__)
|
#if defined(__SSE2__)
|
||||||
|
|
||||||
#include "simd-utils.h"
|
#include "simd-utils.h"
|
||||||
@@ -75,6 +74,9 @@ int blake2s_4way_init( blake2s_4way_state *S, const uint8_t outlen );
|
|||||||
int blake2s_4way_update( blake2s_4way_state *S, const void *in,
|
int blake2s_4way_update( blake2s_4way_state *S, const void *in,
|
||||||
uint64_t inlen );
|
uint64_t inlen );
|
||||||
int blake2s_4way_final( blake2s_4way_state *S, void *out, uint8_t outlen );
|
int blake2s_4way_final( blake2s_4way_state *S, void *out, uint8_t outlen );
|
||||||
|
int blake2s_4way_full_blocks( blake2s_4way_state *S, void *out,
|
||||||
|
const void *input, uint64_t inlen );
|
||||||
|
|
||||||
|
|
||||||
#if defined(__AVX2__)
|
#if defined(__AVX2__)
|
||||||
|
|
||||||
@@ -92,6 +94,27 @@ int blake2s_8way_init( blake2s_8way_state *S, const uint8_t outlen );
|
|||||||
int blake2s_8way_update( blake2s_8way_state *S, const void *in,
|
int blake2s_8way_update( blake2s_8way_state *S, const void *in,
|
||||||
uint64_t inlen );
|
uint64_t inlen );
|
||||||
int blake2s_8way_final( blake2s_8way_state *S, void *out, uint8_t outlen );
|
int blake2s_8way_final( blake2s_8way_state *S, void *out, uint8_t outlen );
|
||||||
|
int blake2s_8way_full_blocks( blake2s_8way_state *S, void *out,
|
||||||
|
const void *input, uint64_t inlen );
|
||||||
|
|
||||||
|
#endif
|
||||||
|
|
||||||
|
#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
|
||||||
|
|
||||||
|
ALIGN( 128 ) typedef struct __blake2s_16way_state
|
||||||
|
{
|
||||||
|
__m512i h[8];
|
||||||
|
uint8_t buf[ BLAKE2S_BLOCKBYTES * 16 ];
|
||||||
|
uint32_t t[2];
|
||||||
|
uint32_t f[2];
|
||||||
|
size_t buflen;
|
||||||
|
uint8_t last_node;
|
||||||
|
} blake2s_16way_state ;
|
||||||
|
|
||||||
|
int blake2s_16way_init( blake2s_16way_state *S, const uint8_t outlen );
|
||||||
|
int blake2s_16way_update( blake2s_16way_state *S, const void *in,
|
||||||
|
uint64_t inlen );
|
||||||
|
int blake2s_16way_final( blake2s_16way_state *S, void *out, uint8_t outlen );
|
||||||
|
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
@@ -108,6 +131,6 @@ int blake2s_8way_final( blake2s_8way_state *S, void *out, uint8_t outlen );
|
|||||||
}
|
}
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
#endif // __SSE4_2__
|
#endif // __SSE2__
|
||||||
|
|
||||||
#endif
|
#endif
|
||||||
|
|||||||
@@ -56,7 +56,7 @@ int scanhash_blake2s( struct work *work,
|
|||||||
do {
|
do {
|
||||||
be32enc(&endiandata[19], n);
|
be32enc(&endiandata[19], n);
|
||||||
blake2s_hash( hash64, endiandata );
|
blake2s_hash( hash64, endiandata );
|
||||||
if (hash64[7] < Htarg && fulltest(hash64, ptarget)) {
|
if (hash64[7] <= Htarg && fulltest(hash64, ptarget)) {
|
||||||
*hashes_done = n - first_nonce + 1;
|
*hashes_done = n - first_nonce + 1;
|
||||||
pdata[19] = n;
|
pdata[19] = n;
|
||||||
return true;
|
return true;
|
||||||
@@ -70,18 +70,3 @@ int scanhash_blake2s( struct work *work,
|
|||||||
|
|
||||||
return 0;
|
return 0;
|
||||||
}
|
}
|
||||||
/*
|
|
||||||
// changed to get_max64_0x3fffffLL in cpuminer-multi-decred
|
|
||||||
int64_t blake2s_get_max64 ()
|
|
||||||
{
|
|
||||||
return 0x7ffffLL;
|
|
||||||
}
|
|
||||||
|
|
||||||
bool register_blake2s_algo( algo_gate_t* gate )
|
|
||||||
{
|
|
||||||
gate->scanhash = (void*)&scanhash_blake2s;
|
|
||||||
gate->hash = (void*)&blake2s_hash;
|
|
||||||
gate->get_max64 = (void*)&blake2s_get_max64;
|
|
||||||
return true;
|
|
||||||
};
|
|
||||||
*/
|
|
||||||
|
|||||||
@@ -42,21 +42,13 @@
|
|||||||
extern "C"{
|
extern "C"{
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
#if SPH_SMALL_FOOTPRINT && !defined SPH_SMALL_FOOTPRINT_BLAKE
|
|
||||||
#define SPH_SMALL_FOOTPRINT_BLAKE 1
|
|
||||||
#endif
|
|
||||||
|
|
||||||
#if SPH_64 && (SPH_SMALL_FOOTPRINT_BLAKE || !SPH_64_TRUE)
|
|
||||||
#define SPH_COMPACT_BLAKE_64 1
|
|
||||||
#endif
|
|
||||||
|
|
||||||
#ifdef _MSC_VER
|
#ifdef _MSC_VER
|
||||||
#pragma warning (disable: 4146)
|
#pragma warning (disable: 4146)
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
|
// Blake-512 common
|
||||||
|
|
||||||
// Blake-512
|
/*
|
||||||
|
|
||||||
static const sph_u64 IV512[8] = {
|
static const sph_u64 IV512[8] = {
|
||||||
SPH_C64(0x6A09E667F3BCC908), SPH_C64(0xBB67AE8584CAA73B),
|
SPH_C64(0x6A09E667F3BCC908), SPH_C64(0xBB67AE8584CAA73B),
|
||||||
SPH_C64(0x3C6EF372FE94F82B), SPH_C64(0xA54FF53A5F1D36F1),
|
SPH_C64(0x3C6EF372FE94F82B), SPH_C64(0xA54FF53A5F1D36F1),
|
||||||
@@ -64,10 +56,7 @@ static const sph_u64 IV512[8] = {
|
|||||||
SPH_C64(0x1F83D9ABFB41BD6B), SPH_C64(0x5BE0CD19137E2179)
|
SPH_C64(0x1F83D9ABFB41BD6B), SPH_C64(0x5BE0CD19137E2179)
|
||||||
};
|
};
|
||||||
|
|
||||||
|
static const sph_u64 salt_zero_big[4] = { 0, 0, 0, 0 };
|
||||||
#if SPH_COMPACT_BLAKE_32 || SPH_COMPACT_BLAKE_64
|
|
||||||
|
|
||||||
// Blake-256 4 & 8 way, Blake-512 4 way
|
|
||||||
|
|
||||||
static const unsigned sigma[16][16] = {
|
static const unsigned sigma[16][16] = {
|
||||||
{ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 },
|
{ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 },
|
||||||
@@ -88,7 +77,17 @@ static const unsigned sigma[16][16] = {
|
|||||||
{ 2, 12, 6, 10, 0, 11, 8, 3, 4, 13, 7, 5, 15, 14, 1, 9 }
|
{ 2, 12, 6, 10, 0, 11, 8, 3, 4, 13, 7, 5, 15, 14, 1, 9 }
|
||||||
};
|
};
|
||||||
|
|
||||||
#endif
|
static const sph_u64 CB[16] = {
|
||||||
|
SPH_C64(0x243F6A8885A308D3), SPH_C64(0x13198A2E03707344),
|
||||||
|
SPH_C64(0xA4093822299F31D0), SPH_C64(0x082EFA98EC4E6C89),
|
||||||
|
SPH_C64(0x452821E638D01377), SPH_C64(0xBE5466CF34E90C6C),
|
||||||
|
SPH_C64(0xC0AC29B7C97C50DD), SPH_C64(0x3F84D5B5B5470917),
|
||||||
|
SPH_C64(0x9216D5D98979FB1B), SPH_C64(0xD1310BA698DFB5AC),
|
||||||
|
SPH_C64(0x2FFD72DBD01ADFB7), SPH_C64(0xB8E1AFED6A267E96),
|
||||||
|
SPH_C64(0xBA7C9045F12C7F99), SPH_C64(0x24A19947B3916CF7),
|
||||||
|
SPH_C64(0x0801F2E2858EFC16), SPH_C64(0x636920D871574E69)
|
||||||
|
|
||||||
|
*/
|
||||||
|
|
||||||
#define Z00 0
|
#define Z00 0
|
||||||
#define Z01 1
|
#define Z01 1
|
||||||
@@ -264,105 +263,28 @@ static const unsigned sigma[16][16] = {
|
|||||||
#define Mx_(n) Mx__(n)
|
#define Mx_(n) Mx__(n)
|
||||||
#define Mx__(n) M ## n
|
#define Mx__(n) M ## n
|
||||||
|
|
||||||
// Blake-512 4 way
|
|
||||||
|
|
||||||
#define CBx(r, i) CBx_(Z ## r ## i)
|
#define CBx(r, i) CBx_(Z ## r ## i)
|
||||||
#define CBx_(n) CBx__(n)
|
#define CBx_(n) CBx__(n)
|
||||||
#define CBx__(n) CB ## n
|
#define CBx__(n) CB ## n
|
||||||
|
|
||||||
#define CB0 SPH_C64(0x243F6A8885A308D3)
|
#define CB0 0x243F6A8885A308D3
|
||||||
#define CB1 SPH_C64(0x13198A2E03707344)
|
#define CB1 0x13198A2E03707344
|
||||||
#define CB2 SPH_C64(0xA4093822299F31D0)
|
#define CB2 0xA4093822299F31D0
|
||||||
#define CB3 SPH_C64(0x082EFA98EC4E6C89)
|
#define CB3 0x082EFA98EC4E6C89
|
||||||
#define CB4 SPH_C64(0x452821E638D01377)
|
#define CB4 0x452821E638D01377
|
||||||
#define CB5 SPH_C64(0xBE5466CF34E90C6C)
|
#define CB5 0xBE5466CF34E90C6C
|
||||||
#define CB6 SPH_C64(0xC0AC29B7C97C50DD)
|
#define CB6 0xC0AC29B7C97C50DD
|
||||||
#define CB7 SPH_C64(0x3F84D5B5B5470917)
|
#define CB7 0x3F84D5B5B5470917
|
||||||
#define CB8 SPH_C64(0x9216D5D98979FB1B)
|
#define CB8 0x9216D5D98979FB1B
|
||||||
#define CB9 SPH_C64(0xD1310BA698DFB5AC)
|
#define CB9 0xD1310BA698DFB5AC
|
||||||
#define CBA SPH_C64(0x2FFD72DBD01ADFB7)
|
#define CBA 0x2FFD72DBD01ADFB7
|
||||||
#define CBB SPH_C64(0xB8E1AFED6A267E96)
|
#define CBB 0xB8E1AFED6A267E96
|
||||||
#define CBC SPH_C64(0xBA7C9045F12C7F99)
|
#define CBC 0xBA7C9045F12C7F99
|
||||||
#define CBD SPH_C64(0x24A19947B3916CF7)
|
#define CBD 0x24A19947B3916CF7
|
||||||
#define CBE SPH_C64(0x0801F2E2858EFC16)
|
#define CBE 0x0801F2E2858EFC16
|
||||||
#define CBF SPH_C64(0x636920D871574E69)
|
#define CBF 0x636920D871574E69
|
||||||
|
|
||||||
#if SPH_COMPACT_BLAKE_64
|
#define READ_STATE64(state) do { \
|
||||||
// not used
|
|
||||||
static const sph_u64 CB[16] = {
|
|
||||||
SPH_C64(0x243F6A8885A308D3), SPH_C64(0x13198A2E03707344),
|
|
||||||
SPH_C64(0xA4093822299F31D0), SPH_C64(0x082EFA98EC4E6C89),
|
|
||||||
SPH_C64(0x452821E638D01377), SPH_C64(0xBE5466CF34E90C6C),
|
|
||||||
SPH_C64(0xC0AC29B7C97C50DD), SPH_C64(0x3F84D5B5B5470917),
|
|
||||||
SPH_C64(0x9216D5D98979FB1B), SPH_C64(0xD1310BA698DFB5AC),
|
|
||||||
SPH_C64(0x2FFD72DBD01ADFB7), SPH_C64(0xB8E1AFED6A267E96),
|
|
||||||
SPH_C64(0xBA7C9045F12C7F99), SPH_C64(0x24A19947B3916CF7),
|
|
||||||
SPH_C64(0x0801F2E2858EFC16), SPH_C64(0x636920D871574E69)
|
|
||||||
};
|
|
||||||
|
|
||||||
#endif
|
|
||||||
|
|
||||||
|
|
||||||
// Blake-512 4 way
|
|
||||||
|
|
||||||
#define GB_4WAY(m0, m1, c0, c1, a, b, c, d) do { \
|
|
||||||
a = _mm256_add_epi64( _mm256_add_epi64( _mm256_xor_si256( \
|
|
||||||
_mm256_set1_epi64x( c1 ), m0 ), b ), a ); \
|
|
||||||
d = mm256_ror_64( _mm256_xor_si256( d, a ), 32 ); \
|
|
||||||
c = _mm256_add_epi64( c, d ); \
|
|
||||||
b = mm256_ror_64( _mm256_xor_si256( b, c ), 25 ); \
|
|
||||||
a = _mm256_add_epi64( _mm256_add_epi64( _mm256_xor_si256( \
|
|
||||||
_mm256_set1_epi64x( c0 ), m1 ), b ), a ); \
|
|
||||||
d = mm256_ror_64( _mm256_xor_si256( d, a ), 16 ); \
|
|
||||||
c = _mm256_add_epi64( c, d ); \
|
|
||||||
b = mm256_ror_64( _mm256_xor_si256( b, c ), 11 ); \
|
|
||||||
} while (0)
|
|
||||||
|
|
||||||
#if SPH_COMPACT_BLAKE_64
|
|
||||||
// not used
|
|
||||||
#define ROUND_B_4WAY(r) do { \
|
|
||||||
GB_4WAY(M[sigma[r][0x0]], M[sigma[r][0x1]], \
|
|
||||||
CB[sigma[r][0x0]], CB[sigma[r][0x1]], V0, V4, V8, VC); \
|
|
||||||
GB_4WAY(M[sigma[r][0x2]], M[sigma[r][0x3]], \
|
|
||||||
CB[sigma[r][0x2]], CB[sigma[r][0x3]], V1, V5, V9, VD); \
|
|
||||||
GB_4WAY(M[sigma[r][0x4]], M[sigma[r][0x5]], \
|
|
||||||
CB[sigma[r][0x4]], CB[sigma[r][0x5]], V2, V6, VA, VE); \
|
|
||||||
GB_4WAY(M[sigma[r][0x6]], M[sigma[r][0x7]], \
|
|
||||||
CB[sigma[r][0x6]], CB[sigma[r][0x7]], V3, V7, VB, VF); \
|
|
||||||
GB_4WAY(M[sigma[r][0x8]], M[sigma[r][0x9]], \
|
|
||||||
CB[sigma[r][0x8]], CB[sigma[r][0x9]], V0, V5, VA, VF); \
|
|
||||||
GB_4WAY(M[sigma[r][0xA]], M[sigma[r][0xB]], \
|
|
||||||
CB[sigma[r][0xA]], CB[sigma[r][0xB]], V1, V6, VB, VC); \
|
|
||||||
GB_4WAY(M[sigma[r][0xC]], M[sigma[r][0xD]], \
|
|
||||||
CB[sigma[r][0xC]], CB[sigma[r][0xD]], V2, V7, V8, VD); \
|
|
||||||
GB_4WAY(M[sigma[r][0xE]], M[sigma[r][0xF]], \
|
|
||||||
CB[sigma[r][0xE]], CB[sigma[r][0xF]], V3, V4, V9, VE); \
|
|
||||||
} while (0)
|
|
||||||
|
|
||||||
#else
|
|
||||||
//current_impl
|
|
||||||
#define ROUND_B_4WAY(r) do { \
|
|
||||||
GB_4WAY(Mx(r, 0), Mx(r, 1), CBx(r, 0), CBx(r, 1), V0, V4, V8, VC); \
|
|
||||||
GB_4WAY(Mx(r, 2), Mx(r, 3), CBx(r, 2), CBx(r, 3), V1, V5, V9, VD); \
|
|
||||||
GB_4WAY(Mx(r, 4), Mx(r, 5), CBx(r, 4), CBx(r, 5), V2, V6, VA, VE); \
|
|
||||||
GB_4WAY(Mx(r, 6), Mx(r, 7), CBx(r, 6), CBx(r, 7), V3, V7, VB, VF); \
|
|
||||||
GB_4WAY(Mx(r, 8), Mx(r, 9), CBx(r, 8), CBx(r, 9), V0, V5, VA, VF); \
|
|
||||||
GB_4WAY(Mx(r, A), Mx(r, B), CBx(r, A), CBx(r, B), V1, V6, VB, VC); \
|
|
||||||
GB_4WAY(Mx(r, C), Mx(r, D), CBx(r, C), CBx(r, D), V2, V7, V8, VD); \
|
|
||||||
GB_4WAY(Mx(r, E), Mx(r, F), CBx(r, E), CBx(r, F), V3, V4, V9, VE); \
|
|
||||||
} while (0)
|
|
||||||
|
|
||||||
#endif
|
|
||||||
|
|
||||||
|
|
||||||
// Blake-512 4 way
|
|
||||||
|
|
||||||
#define DECL_STATE64_4WAY \
|
|
||||||
__m256i H0, H1, H2, H3, H4, H5, H6, H7; \
|
|
||||||
__m256i S0, S1, S2, S3; \
|
|
||||||
sph_u64 T0, T1;
|
|
||||||
|
|
||||||
#define READ_STATE64_4WAY(state) do { \
|
|
||||||
H0 = (state)->H[0]; \
|
H0 = (state)->H[0]; \
|
||||||
H1 = (state)->H[1]; \
|
H1 = (state)->H[1]; \
|
||||||
H2 = (state)->H[2]; \
|
H2 = (state)->H[2]; \
|
||||||
@@ -379,7 +301,7 @@ static const sph_u64 CB[16] = {
|
|||||||
T1 = (state)->T1; \
|
T1 = (state)->T1; \
|
||||||
} while (0)
|
} while (0)
|
||||||
|
|
||||||
#define WRITE_STATE64_4WAY(state) do { \
|
#define WRITE_STATE64(state) do { \
|
||||||
(state)->H[0] = H0; \
|
(state)->H[0] = H0; \
|
||||||
(state)->H[1] = H1; \
|
(state)->H[1] = H1; \
|
||||||
(state)->H[2] = H2; \
|
(state)->H[2] = H2; \
|
||||||
@@ -396,14 +318,46 @@ static const sph_u64 CB[16] = {
|
|||||||
(state)->T1 = T1; \
|
(state)->T1 = T1; \
|
||||||
} while (0)
|
} while (0)
|
||||||
|
|
||||||
#if SPH_COMPACT_BLAKE_64
|
#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
|
||||||
|
|
||||||
// not used
|
// Blake-512 8 way AVX512
|
||||||
#define COMPRESS64_4WAY do { \
|
|
||||||
__m256i M[16]; \
|
#define GB_8WAY(m0, m1, c0, c1, a, b, c, d) do { \
|
||||||
__m256i V0, V1, V2, V3, V4, V5, V6, V7; \
|
a = _mm512_add_epi64( _mm512_add_epi64( _mm512_xor_si512( \
|
||||||
__m256i V8, V9, VA, VB, VC, VD, VE, VF; \
|
_mm512_set1_epi64( c1 ), m0 ), b ), a ); \
|
||||||
unsigned r; \
|
d = mm512_ror_64( _mm512_xor_si512( d, a ), 32 ); \
|
||||||
|
c = _mm512_add_epi64( c, d ); \
|
||||||
|
b = mm512_ror_64( _mm512_xor_si512( b, c ), 25 ); \
|
||||||
|
a = _mm512_add_epi64( _mm512_add_epi64( _mm512_xor_si512( \
|
||||||
|
_mm512_set1_epi64( c0 ), m1 ), b ), a ); \
|
||||||
|
d = mm512_ror_64( _mm512_xor_si512( d, a ), 16 ); \
|
||||||
|
c = _mm512_add_epi64( c, d ); \
|
||||||
|
b = mm512_ror_64( _mm512_xor_si512( b, c ), 11 ); \
|
||||||
|
} while (0)
|
||||||
|
|
||||||
|
#define ROUND_B_8WAY(r) do { \
|
||||||
|
GB_8WAY(Mx(r, 0), Mx(r, 1), CBx(r, 0), CBx(r, 1), V0, V4, V8, VC); \
|
||||||
|
GB_8WAY(Mx(r, 2), Mx(r, 3), CBx(r, 2), CBx(r, 3), V1, V5, V9, VD); \
|
||||||
|
GB_8WAY(Mx(r, 4), Mx(r, 5), CBx(r, 4), CBx(r, 5), V2, V6, VA, VE); \
|
||||||
|
GB_8WAY(Mx(r, 6), Mx(r, 7), CBx(r, 6), CBx(r, 7), V3, V7, VB, VF); \
|
||||||
|
GB_8WAY(Mx(r, 8), Mx(r, 9), CBx(r, 8), CBx(r, 9), V0, V5, VA, VF); \
|
||||||
|
GB_8WAY(Mx(r, A), Mx(r, B), CBx(r, A), CBx(r, B), V1, V6, VB, VC); \
|
||||||
|
GB_8WAY(Mx(r, C), Mx(r, D), CBx(r, C), CBx(r, D), V2, V7, V8, VD); \
|
||||||
|
GB_8WAY(Mx(r, E), Mx(r, F), CBx(r, E), CBx(r, F), V3, V4, V9, VE); \
|
||||||
|
} while (0)
|
||||||
|
|
||||||
|
#define DECL_STATE64_8WAY \
|
||||||
|
__m512i H0, H1, H2, H3, H4, H5, H6, H7; \
|
||||||
|
__m512i S0, S1, S2, S3; \
|
||||||
|
uint64_t T0, T1;
|
||||||
|
|
||||||
|
#define COMPRESS64_8WAY( buf ) do \
|
||||||
|
{ \
|
||||||
|
__m512i M0, M1, M2, M3, M4, M5, M6, M7; \
|
||||||
|
__m512i M8, M9, MA, MB, MC, MD, ME, MF; \
|
||||||
|
__m512i V0, V1, V2, V3, V4, V5, V6, V7; \
|
||||||
|
__m512i V8, V9, VA, VB, VC, VD, VE, VF; \
|
||||||
|
__m512i shuf_bswap64; \
|
||||||
V0 = H0; \
|
V0 = H0; \
|
||||||
V1 = H1; \
|
V1 = H1; \
|
||||||
V2 = H2; \
|
V2 = H2; \
|
||||||
@@ -412,57 +366,382 @@ static const sph_u64 CB[16] = {
|
|||||||
V5 = H5; \
|
V5 = H5; \
|
||||||
V6 = H6; \
|
V6 = H6; \
|
||||||
V7 = H7; \
|
V7 = H7; \
|
||||||
V8 = _mm256_xor_si256( S0, _mm256_set_epi64x( CB0, CB0, CB0, CB0 ) ); \
|
V8 = _mm512_xor_si512( S0, m512_const1_64( CB0 ) ); \
|
||||||
V9 = _mm256_xor_si256( S1, _mm256_set_epi64x( CB1, CB1, CB1, CB1 ) ); \
|
V9 = _mm512_xor_si512( S1, m512_const1_64( CB1 ) ); \
|
||||||
VA = _mm256_xor_si256( S2, _mm256_set_epi64x( CB2, CB2, CB2, CB2 ) ); \
|
VA = _mm512_xor_si512( S2, m512_const1_64( CB2 ) ); \
|
||||||
VB = _mm256_xor_si256( S3, _mm256_set_epi64x( CB3, CB3, CB3, CB3 ) ); \
|
VB = _mm512_xor_si512( S3, m512_const1_64( CB3 ) ); \
|
||||||
VC = _mm256_xor_si256( _mm256_set_epi64x( T0, T0, T0, T0 ), \
|
VC = _mm512_xor_si512( _mm512_set1_epi64( T0 ), \
|
||||||
_mm256_set_epi64x( CB4, CB4, CB4, CB4 ) ); \
|
m512_const1_64( CB4 ) ); \
|
||||||
VD = _mm256_xor_si256( _mm256_set_epi64x( T0, T0, T0, T0 ), \
|
VD = _mm512_xor_si512( _mm512_set1_epi64( T0 ), \
|
||||||
_mm256_set_epi64x( CB5, CB5, CB5, CB5 ) ); \
|
m512_const1_64( CB5 ) ); \
|
||||||
VE = _mm256_xor_si256( _mm256_set_epi64x( T1, T1, T1, T1 ), \
|
VE = _mm512_xor_si512( _mm512_set1_epi64( T1 ), \
|
||||||
_mm256_set_epi64x( CB6, CB6, CB6, CB6 ) ); \
|
m512_const1_64( CB6 ) ); \
|
||||||
VF = _mm256_xor_si256( _mm256_set_epi64x( T1, T1, T1, T1 ), \
|
VF = _mm512_xor_si512( _mm512_set1_epi64( T1 ), \
|
||||||
_mm256_set_epi64x( CB7, CB7, CB7, CB7 ) ); \
|
m512_const1_64( CB7 ) ); \
|
||||||
M[0x0] = mm256_bswap_64( *(buf+0) ); \
|
shuf_bswap64 = m512_const_64( 0x38393a3b3c3d3e3f, 0x3031323334353637, \
|
||||||
M[0x1] = mm256_bswap_64( *(buf+1) ); \
|
0x28292a2b2c2d2e2f, 0x2021222324252627, \
|
||||||
M[0x2] = mm256_bswap_64( *(buf+2) ); \
|
0x18191a1b1c1d1e1f, 0x1011121314151617, \
|
||||||
M[0x3] = mm256_bswap_64( *(buf+3) ); \
|
0x08090a0b0c0d0e0f, 0x0001020304050607 ); \
|
||||||
M[0x4] = mm256_bswap_64( *(buf+4) ); \
|
M0 = _mm512_shuffle_epi8( *(buf+ 0), shuf_bswap64 ); \
|
||||||
M[0x5] = mm256_bswap_64( *(buf+5) ); \
|
M1 = _mm512_shuffle_epi8( *(buf+ 1), shuf_bswap64 ); \
|
||||||
M[0x6] = mm256_bswap_64( *(buf+6) ); \
|
M2 = _mm512_shuffle_epi8( *(buf+ 2), shuf_bswap64 ); \
|
||||||
M[0x7] = mm256_bswap_64( *(buf+7) ); \
|
M3 = _mm512_shuffle_epi8( *(buf+ 3), shuf_bswap64 ); \
|
||||||
M[0x8] = mm256_bswap_64( *(buf+8) ); \
|
M4 = _mm512_shuffle_epi8( *(buf+ 4), shuf_bswap64 ); \
|
||||||
M[0x9] = mm256_bswap_64( *(buf+9) ); \
|
M5 = _mm512_shuffle_epi8( *(buf+ 5), shuf_bswap64 ); \
|
||||||
M[0xA] = mm256_bswap_64( *(buf+10) ); \
|
M6 = _mm512_shuffle_epi8( *(buf+ 6), shuf_bswap64 ); \
|
||||||
M[0xB] = mm256_bswap_64( *(buf+11) ); \
|
M7 = _mm512_shuffle_epi8( *(buf+ 7), shuf_bswap64 ); \
|
||||||
M[0xC] = mm256_bswap_64( *(buf+12) ); \
|
M8 = _mm512_shuffle_epi8( *(buf+ 8), shuf_bswap64 ); \
|
||||||
M[0xD] = mm256_bswap_64( *(buf+13) ); \
|
M9 = _mm512_shuffle_epi8( *(buf+ 9), shuf_bswap64 ); \
|
||||||
M[0xE] = mm256_bswap_64( *(buf+14) ); \
|
MA = _mm512_shuffle_epi8( *(buf+10), shuf_bswap64 ); \
|
||||||
M[0xF] = mm256_bswap_64( *(buf+15) ); \
|
MB = _mm512_shuffle_epi8( *(buf+11), shuf_bswap64 ); \
|
||||||
for (r = 0; r < 16; r ++) \
|
MC = _mm512_shuffle_epi8( *(buf+12), shuf_bswap64 ); \
|
||||||
ROUND_B_4WAY(r); \
|
MD = _mm512_shuffle_epi8( *(buf+13), shuf_bswap64 ); \
|
||||||
H0 = _mm256_xor_si256( _mm256_xor_si256( \
|
ME = _mm512_shuffle_epi8( *(buf+14), shuf_bswap64 ); \
|
||||||
_mm256_xor_si256( S0, V0 ), V8 ), H0 ); \
|
MF = _mm512_shuffle_epi8( *(buf+15), shuf_bswap64 ); \
|
||||||
H1 = _mm256_xor_si256( _mm256_xor_si256( \
|
ROUND_B_8WAY(0); \
|
||||||
_mm256_xor_si256( S1, V1 ), V9 ), H1 ); \
|
ROUND_B_8WAY(1); \
|
||||||
H2 = _mm256_xor_si256( _mm256_xor_si256( \
|
ROUND_B_8WAY(2); \
|
||||||
_mm256_xor_si256( S2, V2 ), VA ), H2 ); \
|
ROUND_B_8WAY(3); \
|
||||||
H3 = _mm256_xor_si256( _mm256_xor_si256( \
|
ROUND_B_8WAY(4); \
|
||||||
_mm256_xor_si256( S3, V3 ), VB ), H3 ); \
|
ROUND_B_8WAY(5); \
|
||||||
H4 = _mm256_xor_si256( _mm256_xor_si256( \
|
ROUND_B_8WAY(6); \
|
||||||
_mm256_xor_si256( S0, V4 ), VC ), H4 ); \
|
ROUND_B_8WAY(7); \
|
||||||
H5 = _mm256_xor_si256( _mm256_xor_si256( \
|
ROUND_B_8WAY(8); \
|
||||||
_mm256_xor_si256( S1, V5 ), VD ), H5 ); \
|
ROUND_B_8WAY(9); \
|
||||||
H6 = _mm256_xor_si256( _mm256_xor_si256( \
|
ROUND_B_8WAY(0); \
|
||||||
_mm256_xor_si256( S2, V6 ), VE ), H6 ); \
|
ROUND_B_8WAY(1); \
|
||||||
H7 = _mm256_xor_si256( _mm256_xor_si256( \
|
ROUND_B_8WAY(2); \
|
||||||
_mm256_xor_si256( S3, V7 ), VF ), H7 ); \
|
ROUND_B_8WAY(3); \
|
||||||
|
ROUND_B_8WAY(4); \
|
||||||
|
ROUND_B_8WAY(5); \
|
||||||
|
H0 = mm512_xor4( V8, V0, S0, H0 ); \
|
||||||
|
H1 = mm512_xor4( V9, V1, S1, H1 ); \
|
||||||
|
H2 = mm512_xor4( VA, V2, S2, H2 ); \
|
||||||
|
H3 = mm512_xor4( VB, V3, S3, H3 ); \
|
||||||
|
H4 = mm512_xor4( VC, V4, S0, H4 ); \
|
||||||
|
H5 = mm512_xor4( VD, V5, S1, H5 ); \
|
||||||
|
H6 = mm512_xor4( VE, V6, S2, H6 ); \
|
||||||
|
H7 = mm512_xor4( VF, V7, S3, H7 ); \
|
||||||
} while (0)
|
} while (0)
|
||||||
|
|
||||||
#else
|
void blake512_8way_compress( blake_8way_big_context *sc )
|
||||||
|
{
|
||||||
|
__m512i M0, M1, M2, M3, M4, M5, M6, M7;
|
||||||
|
__m512i M8, M9, MA, MB, MC, MD, ME, MF;
|
||||||
|
__m512i V0, V1, V2, V3, V4, V5, V6, V7;
|
||||||
|
__m512i V8, V9, VA, VB, VC, VD, VE, VF;
|
||||||
|
__m512i shuf_bswap64;
|
||||||
|
|
||||||
//current impl
|
V0 = sc->H[0];
|
||||||
|
V1 = sc->H[1];
|
||||||
|
V2 = sc->H[2];
|
||||||
|
V3 = sc->H[3];
|
||||||
|
V4 = sc->H[4];
|
||||||
|
V5 = sc->H[5];
|
||||||
|
V6 = sc->H[6];
|
||||||
|
V7 = sc->H[7];
|
||||||
|
V8 = _mm512_xor_si512( sc->S[0], m512_const1_64( CB0 ) );
|
||||||
|
V9 = _mm512_xor_si512( sc->S[1], m512_const1_64( CB1 ) );
|
||||||
|
VA = _mm512_xor_si512( sc->S[2], m512_const1_64( CB2 ) );
|
||||||
|
VB = _mm512_xor_si512( sc->S[3], m512_const1_64( CB3 ) );
|
||||||
|
VC = _mm512_xor_si512( _mm512_set1_epi64( sc->T0 ),
|
||||||
|
m512_const1_64( CB4 ) );
|
||||||
|
VD = _mm512_xor_si512( _mm512_set1_epi64( sc->T0 ),
|
||||||
|
m512_const1_64( CB5 ) );
|
||||||
|
VE = _mm512_xor_si512( _mm512_set1_epi64( sc->T1 ),
|
||||||
|
m512_const1_64( CB6 ) );
|
||||||
|
VF = _mm512_xor_si512( _mm512_set1_epi64( sc->T1 ),
|
||||||
|
m512_const1_64( CB7 ) );
|
||||||
|
|
||||||
|
shuf_bswap64 = m512_const_64( 0x38393a3b3c3d3e3f, 0x3031323334353637,
|
||||||
|
0x28292a2b2c2d2e2f, 0x2021222324252627,
|
||||||
|
0x18191a1b1c1d1e1f, 0x1011121314151617,
|
||||||
|
0x08090a0b0c0d0e0f, 0x0001020304050607 );
|
||||||
|
|
||||||
|
M0 = _mm512_shuffle_epi8( sc->buf[ 0], shuf_bswap64 );
|
||||||
|
M1 = _mm512_shuffle_epi8( sc->buf[ 1], shuf_bswap64 );
|
||||||
|
M2 = _mm512_shuffle_epi8( sc->buf[ 2], shuf_bswap64 );
|
||||||
|
M3 = _mm512_shuffle_epi8( sc->buf[ 3], shuf_bswap64 );
|
||||||
|
M4 = _mm512_shuffle_epi8( sc->buf[ 4], shuf_bswap64 );
|
||||||
|
M5 = _mm512_shuffle_epi8( sc->buf[ 5], shuf_bswap64 );
|
||||||
|
M6 = _mm512_shuffle_epi8( sc->buf[ 6], shuf_bswap64 );
|
||||||
|
M7 = _mm512_shuffle_epi8( sc->buf[ 7], shuf_bswap64 );
|
||||||
|
M8 = _mm512_shuffle_epi8( sc->buf[ 8], shuf_bswap64 );
|
||||||
|
M9 = _mm512_shuffle_epi8( sc->buf[ 9], shuf_bswap64 );
|
||||||
|
MA = _mm512_shuffle_epi8( sc->buf[10], shuf_bswap64 );
|
||||||
|
MB = _mm512_shuffle_epi8( sc->buf[11], shuf_bswap64 );
|
||||||
|
MC = _mm512_shuffle_epi8( sc->buf[12], shuf_bswap64 );
|
||||||
|
MD = _mm512_shuffle_epi8( sc->buf[13], shuf_bswap64 );
|
||||||
|
ME = _mm512_shuffle_epi8( sc->buf[14], shuf_bswap64 );
|
||||||
|
MF = _mm512_shuffle_epi8( sc->buf[15], shuf_bswap64 );
|
||||||
|
|
||||||
|
ROUND_B_8WAY(0);
|
||||||
|
ROUND_B_8WAY(1);
|
||||||
|
ROUND_B_8WAY(2);
|
||||||
|
ROUND_B_8WAY(3);
|
||||||
|
ROUND_B_8WAY(4);
|
||||||
|
ROUND_B_8WAY(5);
|
||||||
|
ROUND_B_8WAY(6);
|
||||||
|
ROUND_B_8WAY(7);
|
||||||
|
ROUND_B_8WAY(8);
|
||||||
|
ROUND_B_8WAY(9);
|
||||||
|
ROUND_B_8WAY(0);
|
||||||
|
ROUND_B_8WAY(1);
|
||||||
|
ROUND_B_8WAY(2);
|
||||||
|
ROUND_B_8WAY(3);
|
||||||
|
ROUND_B_8WAY(4);
|
||||||
|
ROUND_B_8WAY(5);
|
||||||
|
|
||||||
|
sc->H[0] = mm512_xor4( V8, V0, sc->S[0], sc->H[0] );
|
||||||
|
sc->H[1] = mm512_xor4( V9, V1, sc->S[1], sc->H[1] );
|
||||||
|
sc->H[2] = mm512_xor4( VA, V2, sc->S[2], sc->H[2] );
|
||||||
|
sc->H[3] = mm512_xor4( VB, V3, sc->S[3], sc->H[3] );
|
||||||
|
sc->H[4] = mm512_xor4( VC, V4, sc->S[0], sc->H[4] );
|
||||||
|
sc->H[5] = mm512_xor4( VD, V5, sc->S[1], sc->H[5] );
|
||||||
|
sc->H[6] = mm512_xor4( VE, V6, sc->S[2], sc->H[6] );
|
||||||
|
sc->H[7] = mm512_xor4( VF, V7, sc->S[3], sc->H[7] );
|
||||||
|
}
|
||||||
|
|
||||||
|
void blake512_8way_init( blake_8way_big_context *sc )
|
||||||
|
{
|
||||||
|
__m512i zero = m512_zero;
|
||||||
|
casti_m512i( sc->H, 0 ) = m512_const1_64( 0x6A09E667F3BCC908 );
|
||||||
|
casti_m512i( sc->H, 1 ) = m512_const1_64( 0xBB67AE8584CAA73B );
|
||||||
|
casti_m512i( sc->H, 2 ) = m512_const1_64( 0x3C6EF372FE94F82B );
|
||||||
|
casti_m512i( sc->H, 3 ) = m512_const1_64( 0xA54FF53A5F1D36F1 );
|
||||||
|
casti_m512i( sc->H, 4 ) = m512_const1_64( 0x510E527FADE682D1 );
|
||||||
|
casti_m512i( sc->H, 5 ) = m512_const1_64( 0x9B05688C2B3E6C1F );
|
||||||
|
casti_m512i( sc->H, 6 ) = m512_const1_64( 0x1F83D9ABFB41BD6B );
|
||||||
|
casti_m512i( sc->H, 7 ) = m512_const1_64( 0x5BE0CD19137E2179 );
|
||||||
|
|
||||||
|
casti_m512i( sc->S, 0 ) = zero;
|
||||||
|
casti_m512i( sc->S, 1 ) = zero;
|
||||||
|
casti_m512i( sc->S, 2 ) = zero;
|
||||||
|
casti_m512i( sc->S, 3 ) = zero;
|
||||||
|
|
||||||
|
sc->T0 = sc->T1 = 0;
|
||||||
|
sc->ptr = 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
static void
|
||||||
|
blake64_8way( blake_8way_big_context *sc, const void *data, size_t len )
|
||||||
|
{
|
||||||
|
__m512i *vdata = (__m512i*)data;
|
||||||
|
__m512i *buf;
|
||||||
|
size_t ptr;
|
||||||
|
DECL_STATE64_8WAY
|
||||||
|
|
||||||
|
const int buf_size = 128; // sizeof/8
|
||||||
|
|
||||||
|
// 64, 80 bytes: 1st pass copy data. 2nd pass copy padding and compress.
|
||||||
|
// 128 bytes: 1st pass copy data, compress. 2nd pass copy padding, compress.
|
||||||
|
|
||||||
|
buf = sc->buf;
|
||||||
|
ptr = sc->ptr;
|
||||||
|
if ( len < (buf_size - ptr) )
|
||||||
|
{
|
||||||
|
memcpy_512( buf + (ptr>>3), vdata, len>>3 );
|
||||||
|
ptr += len;
|
||||||
|
sc->ptr = ptr;
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
READ_STATE64(sc);
|
||||||
|
while ( len > 0 )
|
||||||
|
{
|
||||||
|
size_t clen;
|
||||||
|
|
||||||
|
clen = buf_size - ptr;
|
||||||
|
if ( clen > len )
|
||||||
|
clen = len;
|
||||||
|
memcpy_512( buf + (ptr>>3), vdata, clen>>3 );
|
||||||
|
ptr += clen;
|
||||||
|
vdata = vdata + (clen>>3);
|
||||||
|
len -= clen;
|
||||||
|
if ( ptr == buf_size )
|
||||||
|
{
|
||||||
|
if ( ( T0 = T0 + 1024 ) < 1024 )
|
||||||
|
T1 = T1 + 1;
|
||||||
|
COMPRESS64_8WAY( buf );
|
||||||
|
ptr = 0;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
WRITE_STATE64(sc);
|
||||||
|
sc->ptr = ptr;
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
|
static void
|
||||||
|
blake64_8way_close( blake_8way_big_context *sc, void *dst )
|
||||||
|
{
|
||||||
|
__m512i buf[16];
|
||||||
|
size_t ptr;
|
||||||
|
unsigned bit_len;
|
||||||
|
uint64_t th, tl;
|
||||||
|
|
||||||
|
ptr = sc->ptr;
|
||||||
|
bit_len = ((unsigned)ptr << 3);
|
||||||
|
buf[ptr>>3] = m512_const1_64( 0x80 );
|
||||||
|
tl = sc->T0 + bit_len;
|
||||||
|
th = sc->T1;
|
||||||
|
if (ptr == 0 )
|
||||||
|
{
|
||||||
|
sc->T0 = 0xFFFFFFFFFFFFFC00ULL;
|
||||||
|
sc->T1 = 0xFFFFFFFFFFFFFFFFULL;
|
||||||
|
}
|
||||||
|
else if ( sc->T0 == 0 )
|
||||||
|
{
|
||||||
|
sc->T0 = 0xFFFFFFFFFFFFFC00ULL + bit_len;
|
||||||
|
sc->T1 = sc->T1 - 1;
|
||||||
|
}
|
||||||
|
else
|
||||||
|
{
|
||||||
|
sc->T0 -= 1024 - bit_len;
|
||||||
|
}
|
||||||
|
if ( ptr <= 104 )
|
||||||
|
{
|
||||||
|
memset_zero_512( buf + (ptr>>3) + 1, (104-ptr) >> 3 );
|
||||||
|
buf[104>>3] = _mm512_or_si512( buf[104>>3],
|
||||||
|
m512_const1_64( 0x0100000000000000ULL ) );
|
||||||
|
buf[112>>3] = m512_const1_64( bswap_64( th ) );
|
||||||
|
buf[120>>3] = m512_const1_64( bswap_64( tl ) );
|
||||||
|
|
||||||
|
blake64_8way( sc, buf + (ptr>>3), 128 - ptr );
|
||||||
|
}
|
||||||
|
else
|
||||||
|
{
|
||||||
|
memset_zero_512( buf + (ptr>>3) + 1, (120 - ptr) >> 3 );
|
||||||
|
|
||||||
|
blake64_8way( sc, buf + (ptr>>3), 128 - ptr );
|
||||||
|
sc->T0 = 0xFFFFFFFFFFFFFC00ULL;
|
||||||
|
sc->T1 = 0xFFFFFFFFFFFFFFFFULL;
|
||||||
|
memset_zero_512( buf, 112>>3 );
|
||||||
|
buf[104>>3] = m512_const1_64( 0x0100000000000000ULL );
|
||||||
|
buf[112>>3] = m512_const1_64( bswap_64( th ) );
|
||||||
|
buf[120>>3] = m512_const1_64( bswap_64( tl ) );
|
||||||
|
|
||||||
|
blake64_8way( sc, buf, 128 );
|
||||||
|
}
|
||||||
|
mm512_block_bswap_64( (__m512i*)dst, sc->H );
|
||||||
|
}
|
||||||
|
|
||||||
|
// init, update & close
|
||||||
|
void blake512_8way_full( blake_8way_big_context *sc, void * dst,
|
||||||
|
const void *data, size_t len )
|
||||||
|
{
|
||||||
|
|
||||||
|
// init
|
||||||
|
|
||||||
|
casti_m512i( sc->H, 0 ) = m512_const1_64( 0x6A09E667F3BCC908 );
|
||||||
|
casti_m512i( sc->H, 1 ) = m512_const1_64( 0xBB67AE8584CAA73B );
|
||||||
|
casti_m512i( sc->H, 2 ) = m512_const1_64( 0x3C6EF372FE94F82B );
|
||||||
|
casti_m512i( sc->H, 3 ) = m512_const1_64( 0xA54FF53A5F1D36F1 );
|
||||||
|
casti_m512i( sc->H, 4 ) = m512_const1_64( 0x510E527FADE682D1 );
|
||||||
|
casti_m512i( sc->H, 5 ) = m512_const1_64( 0x9B05688C2B3E6C1F );
|
||||||
|
casti_m512i( sc->H, 6 ) = m512_const1_64( 0x1F83D9ABFB41BD6B );
|
||||||
|
casti_m512i( sc->H, 7 ) = m512_const1_64( 0x5BE0CD19137E2179 );
|
||||||
|
|
||||||
|
casti_m512i( sc->S, 0 ) = m512_zero;
|
||||||
|
casti_m512i( sc->S, 1 ) = m512_zero;
|
||||||
|
casti_m512i( sc->S, 2 ) = m512_zero;
|
||||||
|
casti_m512i( sc->S, 3 ) = m512_zero;
|
||||||
|
|
||||||
|
sc->T0 = sc->T1 = 0;
|
||||||
|
sc->ptr = 0;
|
||||||
|
|
||||||
|
// update
|
||||||
|
|
||||||
|
memcpy_512( sc->buf, (__m512i*)data, len>>3 );
|
||||||
|
sc->ptr = len;
|
||||||
|
if ( len == 128 )
|
||||||
|
{
|
||||||
|
if ( ( sc->T0 = sc->T0 + 1024 ) < 1024 )
|
||||||
|
sc->T1 = sc->T1 + 1;
|
||||||
|
blake512_8way_compress( sc );
|
||||||
|
sc->ptr = 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
// close
|
||||||
|
|
||||||
|
size_t ptr64 = sc->ptr >> 3;
|
||||||
|
unsigned bit_len;
|
||||||
|
uint64_t th, tl;
|
||||||
|
|
||||||
|
bit_len = sc->ptr << 3;
|
||||||
|
sc->buf[ptr64] = m512_const1_64( 0x80 );
|
||||||
|
tl = sc->T0 + bit_len;
|
||||||
|
th = sc->T1;
|
||||||
|
|
||||||
|
if ( ptr64 == 0 )
|
||||||
|
{
|
||||||
|
sc->T0 = 0xFFFFFFFFFFFFFC00ULL;
|
||||||
|
sc->T1 = 0xFFFFFFFFFFFFFFFFULL;
|
||||||
|
}
|
||||||
|
else if ( sc->T0 == 0 )
|
||||||
|
{
|
||||||
|
sc->T0 = 0xFFFFFFFFFFFFFC00ULL + bit_len;
|
||||||
|
sc->T1 = sc->T1 - 1;
|
||||||
|
}
|
||||||
|
else
|
||||||
|
sc->T0 -= 1024 - bit_len;
|
||||||
|
|
||||||
|
memset_zero_512( sc->buf + ptr64 + 1, 13 - ptr64 );
|
||||||
|
sc->buf[13] = m512_const1_64( 0x0100000000000000ULL );
|
||||||
|
sc->buf[14] = m512_const1_64( bswap_64( th ) );
|
||||||
|
sc->buf[15] = m512_const1_64( bswap_64( tl ) );
|
||||||
|
|
||||||
|
if ( ( sc->T0 = sc->T0 + 1024 ) < 1024 )
|
||||||
|
sc->T1 = sc->T1 + 1;
|
||||||
|
|
||||||
|
blake512_8way_compress( sc );
|
||||||
|
|
||||||
|
mm512_block_bswap_64( (__m512i*)dst, sc->H );
|
||||||
|
}
|
||||||
|
|
||||||
|
void
|
||||||
|
blake512_8way_update(void *cc, const void *data, size_t len)
|
||||||
|
{
|
||||||
|
blake64_8way(cc, data, len);
|
||||||
|
}
|
||||||
|
|
||||||
|
void
|
||||||
|
blake512_8way_close(void *cc, void *dst)
|
||||||
|
{
|
||||||
|
blake64_8way_close(cc, dst);
|
||||||
|
}
|
||||||
|
|
||||||
|
#endif // AVX512
|
||||||
|
|
||||||
|
// Blake-512 4 way
|
||||||
|
|
||||||
|
#define GB_4WAY(m0, m1, c0, c1, a, b, c, d) do { \
|
||||||
|
a = _mm256_add_epi64( _mm256_add_epi64( _mm256_xor_si256( \
|
||||||
|
_mm256_set1_epi64x( c1 ), m0 ), b ), a ); \
|
||||||
|
d = mm256_ror_64( _mm256_xor_si256( d, a ), 32 ); \
|
||||||
|
c = _mm256_add_epi64( c, d ); \
|
||||||
|
b = mm256_ror_64( _mm256_xor_si256( b, c ), 25 ); \
|
||||||
|
a = _mm256_add_epi64( _mm256_add_epi64( _mm256_xor_si256( \
|
||||||
|
_mm256_set1_epi64x( c0 ), m1 ), b ), a ); \
|
||||||
|
d = mm256_ror_64( _mm256_xor_si256( d, a ), 16 ); \
|
||||||
|
c = _mm256_add_epi64( c, d ); \
|
||||||
|
b = mm256_ror_64( _mm256_xor_si256( b, c ), 11 ); \
|
||||||
|
} while (0)
|
||||||
|
|
||||||
|
#define ROUND_B_4WAY(r) do { \
|
||||||
|
GB_4WAY(Mx(r, 0), Mx(r, 1), CBx(r, 0), CBx(r, 1), V0, V4, V8, VC); \
|
||||||
|
GB_4WAY(Mx(r, 2), Mx(r, 3), CBx(r, 2), CBx(r, 3), V1, V5, V9, VD); \
|
||||||
|
GB_4WAY(Mx(r, 4), Mx(r, 5), CBx(r, 4), CBx(r, 5), V2, V6, VA, VE); \
|
||||||
|
GB_4WAY(Mx(r, 6), Mx(r, 7), CBx(r, 6), CBx(r, 7), V3, V7, VB, VF); \
|
||||||
|
GB_4WAY(Mx(r, 8), Mx(r, 9), CBx(r, 8), CBx(r, 9), V0, V5, VA, VF); \
|
||||||
|
GB_4WAY(Mx(r, A), Mx(r, B), CBx(r, A), CBx(r, B), V1, V6, VB, VC); \
|
||||||
|
GB_4WAY(Mx(r, C), Mx(r, D), CBx(r, C), CBx(r, D), V2, V7, V8, VD); \
|
||||||
|
GB_4WAY(Mx(r, E), Mx(r, F), CBx(r, E), CBx(r, F), V3, V4, V9, VE); \
|
||||||
|
} while (0)
|
||||||
|
|
||||||
|
#define DECL_STATE64_4WAY \
|
||||||
|
__m256i H0, H1, H2, H3, H4, H5, H6, H7; \
|
||||||
|
__m256i S0, S1, S2, S3; \
|
||||||
|
uint64_t T0, T1;
|
||||||
|
|
||||||
#define COMPRESS64_4WAY do \
|
#define COMPRESS64_4WAY do \
|
||||||
{ \
|
{ \
|
||||||
@@ -491,7 +770,7 @@ static const sph_u64 CB[16] = {
|
|||||||
m256_const1_64( CB6 ) ); \
|
m256_const1_64( CB6 ) ); \
|
||||||
VF = _mm256_xor_si256( _mm256_set1_epi64x( T1 ), \
|
VF = _mm256_xor_si256( _mm256_set1_epi64x( T1 ), \
|
||||||
m256_const1_64( CB7 ) ); \
|
m256_const1_64( CB7 ) ); \
|
||||||
shuf_bswap64 = m256_const_64( 0x08090a0b0c0d0e0f, 0x0001020304050607, \
|
shuf_bswap64 = m256_const_64( 0x18191a1b1c1d1e1f, 0x1011121314151617, \
|
||||||
0x08090a0b0c0d0e0f, 0x0001020304050607 ); \
|
0x08090a0b0c0d0e0f, 0x0001020304050607 ); \
|
||||||
M0 = _mm256_shuffle_epi8( *(buf+ 0), shuf_bswap64 ); \
|
M0 = _mm256_shuffle_epi8( *(buf+ 0), shuf_bswap64 ); \
|
||||||
M1 = _mm256_shuffle_epi8( *(buf+ 1), shuf_bswap64 ); \
|
M1 = _mm256_shuffle_epi8( *(buf+ 1), shuf_bswap64 ); \
|
||||||
@@ -535,13 +814,83 @@ static const sph_u64 CB[16] = {
|
|||||||
H7 = mm256_xor4( VF, V7, S3, H7 ); \
|
H7 = mm256_xor4( VF, V7, S3, H7 ); \
|
||||||
} while (0)
|
} while (0)
|
||||||
|
|
||||||
#endif
|
|
||||||
|
|
||||||
static const sph_u64 salt_zero_big[4] = { 0, 0, 0, 0 };
|
void blake512_4way_compress( blake_4way_big_context *sc )
|
||||||
|
{
|
||||||
|
__m256i M0, M1, M2, M3, M4, M5, M6, M7;
|
||||||
|
__m256i M8, M9, MA, MB, MC, MD, ME, MF;
|
||||||
|
__m256i V0, V1, V2, V3, V4, V5, V6, V7;
|
||||||
|
__m256i V8, V9, VA, VB, VC, VD, VE, VF;
|
||||||
|
__m256i shuf_bswap64;
|
||||||
|
|
||||||
static void
|
V0 = sc->H[0];
|
||||||
blake64_4way_init( blake_4way_big_context *sc, const sph_u64 *iv,
|
V1 = sc->H[1];
|
||||||
const sph_u64 *salt )
|
V2 = sc->H[2];
|
||||||
|
V3 = sc->H[3];
|
||||||
|
V4 = sc->H[4];
|
||||||
|
V5 = sc->H[5];
|
||||||
|
V6 = sc->H[6];
|
||||||
|
V7 = sc->H[7];
|
||||||
|
V8 = _mm256_xor_si256( sc->S[0], m256_const1_64( CB0 ) );
|
||||||
|
V9 = _mm256_xor_si256( sc->S[1], m256_const1_64( CB1 ) );
|
||||||
|
VA = _mm256_xor_si256( sc->S[2], m256_const1_64( CB2 ) );
|
||||||
|
VB = _mm256_xor_si256( sc->S[3], m256_const1_64( CB3 ) );
|
||||||
|
VC = _mm256_xor_si256( _mm256_set1_epi64x( sc->T0 ),
|
||||||
|
m256_const1_64( CB4 ) );
|
||||||
|
VD = _mm256_xor_si256( _mm256_set1_epi64x( sc->T0 ),
|
||||||
|
m256_const1_64( CB5 ) );
|
||||||
|
VE = _mm256_xor_si256( _mm256_set1_epi64x( sc->T1 ),
|
||||||
|
m256_const1_64( CB6 ) );
|
||||||
|
VF = _mm256_xor_si256( _mm256_set1_epi64x( sc->T1 ),
|
||||||
|
m256_const1_64( CB7 ) );
|
||||||
|
shuf_bswap64 = m256_const_64( 0x18191a1b1c1d1e1f, 0x1011121314151617,
|
||||||
|
0x08090a0b0c0d0e0f, 0x0001020304050607 );
|
||||||
|
|
||||||
|
M0 = _mm256_shuffle_epi8( sc->buf[ 0], shuf_bswap64 );
|
||||||
|
M1 = _mm256_shuffle_epi8( sc->buf[ 1], shuf_bswap64 );
|
||||||
|
M2 = _mm256_shuffle_epi8( sc->buf[ 2], shuf_bswap64 );
|
||||||
|
M3 = _mm256_shuffle_epi8( sc->buf[ 3], shuf_bswap64 );
|
||||||
|
M4 = _mm256_shuffle_epi8( sc->buf[ 4], shuf_bswap64 );
|
||||||
|
M5 = _mm256_shuffle_epi8( sc->buf[ 5], shuf_bswap64 );
|
||||||
|
M6 = _mm256_shuffle_epi8( sc->buf[ 6], shuf_bswap64 );
|
||||||
|
M7 = _mm256_shuffle_epi8( sc->buf[ 7], shuf_bswap64 );
|
||||||
|
M8 = _mm256_shuffle_epi8( sc->buf[ 8], shuf_bswap64 );
|
||||||
|
M9 = _mm256_shuffle_epi8( sc->buf[ 9], shuf_bswap64 );
|
||||||
|
MA = _mm256_shuffle_epi8( sc->buf[10], shuf_bswap64 );
|
||||||
|
MB = _mm256_shuffle_epi8( sc->buf[11], shuf_bswap64 );
|
||||||
|
MC = _mm256_shuffle_epi8( sc->buf[12], shuf_bswap64 );
|
||||||
|
MD = _mm256_shuffle_epi8( sc->buf[13], shuf_bswap64 );
|
||||||
|
ME = _mm256_shuffle_epi8( sc->buf[14], shuf_bswap64 );
|
||||||
|
MF = _mm256_shuffle_epi8( sc->buf[15], shuf_bswap64 );
|
||||||
|
|
||||||
|
ROUND_B_4WAY(0);
|
||||||
|
ROUND_B_4WAY(1);
|
||||||
|
ROUND_B_4WAY(2);
|
||||||
|
ROUND_B_4WAY(3);
|
||||||
|
ROUND_B_4WAY(4);
|
||||||
|
ROUND_B_4WAY(5);
|
||||||
|
ROUND_B_4WAY(6);
|
||||||
|
ROUND_B_4WAY(7);
|
||||||
|
ROUND_B_4WAY(8);
|
||||||
|
ROUND_B_4WAY(9);
|
||||||
|
ROUND_B_4WAY(0);
|
||||||
|
ROUND_B_4WAY(1);
|
||||||
|
ROUND_B_4WAY(2);
|
||||||
|
ROUND_B_4WAY(3);
|
||||||
|
ROUND_B_4WAY(4);
|
||||||
|
ROUND_B_4WAY(5);
|
||||||
|
|
||||||
|
sc->H[0] = mm256_xor4( V8, V0, sc->S[0], sc->H[0] );
|
||||||
|
sc->H[1] = mm256_xor4( V9, V1, sc->S[1], sc->H[1] );
|
||||||
|
sc->H[2] = mm256_xor4( VA, V2, sc->S[2], sc->H[2] );
|
||||||
|
sc->H[3] = mm256_xor4( VB, V3, sc->S[3], sc->H[3] );
|
||||||
|
sc->H[4] = mm256_xor4( VC, V4, sc->S[0], sc->H[4] );
|
||||||
|
sc->H[5] = mm256_xor4( VD, V5, sc->S[1], sc->H[5] );
|
||||||
|
sc->H[6] = mm256_xor4( VE, V6, sc->S[2], sc->H[6] );
|
||||||
|
sc->H[7] = mm256_xor4( VF, V7, sc->S[3], sc->H[7] );
|
||||||
|
}
|
||||||
|
|
||||||
|
void blake512_4way_init( blake_4way_big_context *sc )
|
||||||
{
|
{
|
||||||
__m256i zero = m256_zero;
|
__m256i zero = m256_zero;
|
||||||
casti_m256i( sc->H, 0 ) = m256_const1_64( 0x6A09E667F3BCC908 );
|
casti_m256i( sc->H, 0 ) = m256_const1_64( 0x6A09E667F3BCC908 );
|
||||||
@@ -582,7 +931,7 @@ blake64_4way( blake_4way_big_context *sc, const void *data, size_t len)
|
|||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
|
|
||||||
READ_STATE64_4WAY(sc);
|
READ_STATE64(sc);
|
||||||
while ( len > 0 )
|
while ( len > 0 )
|
||||||
{
|
{
|
||||||
size_t clen;
|
size_t clen;
|
||||||
@@ -596,55 +945,51 @@ blake64_4way( blake_4way_big_context *sc, const void *data, size_t len)
|
|||||||
len -= clen;
|
len -= clen;
|
||||||
if ( ptr == buf_size )
|
if ( ptr == buf_size )
|
||||||
{
|
{
|
||||||
if ((T0 = SPH_T64(T0 + 1024)) < 1024)
|
if ( (T0 = T0 + 1024 ) < 1024 )
|
||||||
T1 = SPH_T64(T1 + 1);
|
T1 = SPH_T64(T1 + 1);
|
||||||
COMPRESS64_4WAY;
|
COMPRESS64_4WAY;
|
||||||
ptr = 0;
|
ptr = 0;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
WRITE_STATE64_4WAY(sc);
|
WRITE_STATE64(sc);
|
||||||
sc->ptr = ptr;
|
sc->ptr = ptr;
|
||||||
}
|
}
|
||||||
|
|
||||||
static void
|
static void
|
||||||
blake64_4way_close( blake_4way_big_context *sc,
|
blake64_4way_close( blake_4way_big_context *sc, void *dst )
|
||||||
unsigned ub, unsigned n, void *dst, size_t out_size_w64)
|
|
||||||
{
|
{
|
||||||
__m256i buf[16];
|
__m256i buf[16];
|
||||||
size_t ptr;
|
size_t ptr;
|
||||||
unsigned bit_len;
|
unsigned bit_len;
|
||||||
uint64_t z, zz;
|
uint64_t th, tl;
|
||||||
sph_u64 th, tl;
|
|
||||||
|
|
||||||
ptr = sc->ptr;
|
ptr = sc->ptr;
|
||||||
bit_len = ((unsigned)ptr << 3);
|
bit_len = ((unsigned)ptr << 3);
|
||||||
z = 0x80 >> n;
|
buf[ptr>>3] = m256_const1_64( 0x80 );
|
||||||
zz = ((ub & -z) | z) & 0xFF;
|
|
||||||
buf[ptr>>3] = _mm256_set_epi64x( zz, zz, zz, zz );
|
|
||||||
tl = sc->T0 + bit_len;
|
tl = sc->T0 + bit_len;
|
||||||
th = sc->T1;
|
th = sc->T1;
|
||||||
if (ptr == 0 )
|
if (ptr == 0 )
|
||||||
{
|
{
|
||||||
sc->T0 = SPH_C64(0xFFFFFFFFFFFFFC00ULL);
|
sc->T0 = 0xFFFFFFFFFFFFFC00ULL;
|
||||||
sc->T1 = SPH_C64(0xFFFFFFFFFFFFFFFFULL);
|
sc->T1 = 0xFFFFFFFFFFFFFFFFULL;
|
||||||
}
|
}
|
||||||
else if ( sc->T0 == 0 )
|
else if ( sc->T0 == 0 )
|
||||||
{
|
{
|
||||||
sc->T0 = SPH_C64(0xFFFFFFFFFFFFFC00ULL) + bit_len;
|
sc->T0 = 0xFFFFFFFFFFFFFC00ULL + bit_len;
|
||||||
sc->T1 = SPH_T64(sc->T1 - 1);
|
sc->T1 = sc->T1 - 1;
|
||||||
}
|
}
|
||||||
else
|
else
|
||||||
{
|
{
|
||||||
sc->T0 -= 1024 - bit_len;
|
sc->T0 -= 1024 - bit_len;
|
||||||
}
|
}
|
||||||
|
|
||||||
if ( ptr <= 104 )
|
if ( ptr <= 104 )
|
||||||
{
|
{
|
||||||
memset_zero_256( buf + (ptr>>3) + 1, (104-ptr) >> 3 );
|
memset_zero_256( buf + (ptr>>3) + 1, (104-ptr) >> 3 );
|
||||||
if ( out_size_w64 == 8 )
|
buf[104>>3] = _mm256_or_si256( buf[104>>3],
|
||||||
buf[(104>>3)] = _mm256_or_si256( buf[(104>>3)],
|
|
||||||
m256_const1_64( 0x0100000000000000ULL ) );
|
m256_const1_64( 0x0100000000000000ULL ) );
|
||||||
*(buf+(112>>3)) = _mm256_set1_epi64x( bswap_64( th ) );
|
buf[112>>3] = m256_const1_64( bswap_64( th ) );
|
||||||
*(buf+(120>>3)) = _mm256_set1_epi64x( bswap_64( tl ) );
|
buf[120>>3] = m256_const1_64( bswap_64( tl ) );
|
||||||
|
|
||||||
blake64_4way( sc, buf + (ptr>>3), 128 - ptr );
|
blake64_4way( sc, buf + (ptr>>3), 128 - ptr );
|
||||||
}
|
}
|
||||||
@@ -656,24 +1001,89 @@ blake64_4way_close( blake_4way_big_context *sc,
|
|||||||
sc->T0 = SPH_C64(0xFFFFFFFFFFFFFC00ULL);
|
sc->T0 = SPH_C64(0xFFFFFFFFFFFFFC00ULL);
|
||||||
sc->T1 = SPH_C64(0xFFFFFFFFFFFFFFFFULL);
|
sc->T1 = SPH_C64(0xFFFFFFFFFFFFFFFFULL);
|
||||||
memset_zero_256( buf, 112>>3 );
|
memset_zero_256( buf, 112>>3 );
|
||||||
if ( out_size_w64 == 8 )
|
|
||||||
buf[104>>3] = m256_const1_64( 0x0100000000000000ULL );
|
buf[104>>3] = m256_const1_64( 0x0100000000000000ULL );
|
||||||
*(buf+(112>>3)) = _mm256_set1_epi64x( bswap_64( th ) );
|
buf[112>>3] = m256_const1_64( bswap_64( th ) );
|
||||||
*(buf+(120>>3)) = _mm256_set1_epi64x( bswap_64( tl ) );
|
buf[120>>3] = m256_const1_64( bswap_64( tl ) );
|
||||||
|
|
||||||
blake64_4way( sc, buf, 128 );
|
blake64_4way( sc, buf, 128 );
|
||||||
}
|
}
|
||||||
mm256_block_bswap_64( (__m256i*)dst, sc->H );
|
mm256_block_bswap_64( (__m256i*)dst, sc->H );
|
||||||
}
|
}
|
||||||
|
|
||||||
void
|
// init, update & close
|
||||||
blake512_4way_init(void *cc)
|
void blake512_4way_full( blake_4way_big_context *sc, void * dst,
|
||||||
|
const void *data, size_t len )
|
||||||
{
|
{
|
||||||
blake64_4way_init(cc, IV512, salt_zero_big);
|
|
||||||
|
// init
|
||||||
|
|
||||||
|
casti_m256i( sc->H, 0 ) = m256_const1_64( 0x6A09E667F3BCC908 );
|
||||||
|
casti_m256i( sc->H, 1 ) = m256_const1_64( 0xBB67AE8584CAA73B );
|
||||||
|
casti_m256i( sc->H, 2 ) = m256_const1_64( 0x3C6EF372FE94F82B );
|
||||||
|
casti_m256i( sc->H, 3 ) = m256_const1_64( 0xA54FF53A5F1D36F1 );
|
||||||
|
casti_m256i( sc->H, 4 ) = m256_const1_64( 0x510E527FADE682D1 );
|
||||||
|
casti_m256i( sc->H, 5 ) = m256_const1_64( 0x9B05688C2B3E6C1F );
|
||||||
|
casti_m256i( sc->H, 6 ) = m256_const1_64( 0x1F83D9ABFB41BD6B );
|
||||||
|
casti_m256i( sc->H, 7 ) = m256_const1_64( 0x5BE0CD19137E2179 );
|
||||||
|
|
||||||
|
casti_m256i( sc->S, 0 ) = m256_zero;
|
||||||
|
casti_m256i( sc->S, 1 ) = m256_zero;
|
||||||
|
casti_m256i( sc->S, 2 ) = m256_zero;
|
||||||
|
casti_m256i( sc->S, 3 ) = m256_zero;
|
||||||
|
|
||||||
|
sc->T0 = sc->T1 = 0;
|
||||||
|
sc->ptr = 0;
|
||||||
|
|
||||||
|
// update
|
||||||
|
|
||||||
|
memcpy_256( sc->buf, (__m256i*)data, len>>3 );
|
||||||
|
sc->ptr += len;
|
||||||
|
if ( len == 128 )
|
||||||
|
{
|
||||||
|
if ( ( sc->T0 = sc->T0 + 1024 ) < 1024 )
|
||||||
|
sc->T1 = sc->T1 + 1;
|
||||||
|
blake512_4way_compress( sc );
|
||||||
|
sc->ptr = 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
// close
|
||||||
|
|
||||||
|
size_t ptr64 = sc->ptr >> 3;
|
||||||
|
unsigned bit_len;
|
||||||
|
uint64_t th, tl;
|
||||||
|
|
||||||
|
bit_len = sc->ptr << 3;
|
||||||
|
sc->buf[ptr64] = m256_const1_64( 0x80 );
|
||||||
|
tl = sc->T0 + bit_len;
|
||||||
|
th = sc->T1;
|
||||||
|
if ( sc->ptr == 0 )
|
||||||
|
{
|
||||||
|
sc->T0 = 0xFFFFFFFFFFFFFC00ULL;
|
||||||
|
sc->T1 = 0xFFFFFFFFFFFFFFFFULL;
|
||||||
|
}
|
||||||
|
else if ( sc->T0 == 0 )
|
||||||
|
{
|
||||||
|
sc->T0 = 0xFFFFFFFFFFFFFC00ULL + bit_len;
|
||||||
|
sc->T1 = sc->T1 - 1;
|
||||||
|
}
|
||||||
|
else
|
||||||
|
sc->T0 -= 1024 - bit_len;
|
||||||
|
|
||||||
|
memset_zero_256( sc->buf + ptr64 + 1, 13 - ptr64 );
|
||||||
|
sc->buf[13] = m256_const1_64( 0x0100000000000000ULL );
|
||||||
|
sc->buf[14] = m256_const1_64( bswap_64( th ) );
|
||||||
|
sc->buf[15] = m256_const1_64( bswap_64( tl ) );
|
||||||
|
|
||||||
|
if ( ( sc->T0 = sc->T0 + 1024 ) < 1024 )
|
||||||
|
sc->T1 = sc->T1 + 1;
|
||||||
|
|
||||||
|
blake512_4way_compress( sc );
|
||||||
|
|
||||||
|
mm256_block_bswap_64( (__m256i*)dst, sc->H );
|
||||||
}
|
}
|
||||||
|
|
||||||
void
|
void
|
||||||
blake512_4way(void *cc, const void *data, size_t len)
|
blake512_4way_update(void *cc, const void *data, size_t len)
|
||||||
{
|
{
|
||||||
blake64_4way(cc, data, len);
|
blake64_4way(cc, data, len);
|
||||||
}
|
}
|
||||||
@@ -681,13 +1091,7 @@ blake512_4way(void *cc, const void *data, size_t len)
|
|||||||
void
|
void
|
||||||
blake512_4way_close(void *cc, void *dst)
|
blake512_4way_close(void *cc, void *dst)
|
||||||
{
|
{
|
||||||
blake512_4way_addbits_and_close(cc, 0, 0, dst);
|
blake64_4way_close( cc, dst );
|
||||||
}
|
|
||||||
|
|
||||||
void
|
|
||||||
blake512_4way_addbits_and_close(void *cc, unsigned ub, unsigned n, void *dst)
|
|
||||||
{
|
|
||||||
blake64_4way_close(cc, ub, n, dst, 8);
|
|
||||||
}
|
}
|
||||||
|
|
||||||
#ifdef __cplusplus
|
#ifdef __cplusplus
|
||||||
|
|||||||
@@ -14,7 +14,7 @@ void blakecoin_4way_hash(void *state, const void *input)
|
|||||||
blake256r8_4way_context ctx;
|
blake256r8_4way_context ctx;
|
||||||
|
|
||||||
memcpy( &ctx, &blakecoin_4w_ctx, sizeof ctx );
|
memcpy( &ctx, &blakecoin_4w_ctx, sizeof ctx );
|
||||||
blake256r8_4way( &ctx, input + (64<<2), 16 );
|
blake256r8_4way_update( &ctx, input + (64<<2), 16 );
|
||||||
blake256r8_4way_close( &ctx, vhash );
|
blake256r8_4way_close( &ctx, vhash );
|
||||||
|
|
||||||
dintrlv_4x32( state, state+32, state+64, state+96, vhash, 256 );
|
dintrlv_4x32( state, state+32, state+64, state+96, vhash, 256 );
|
||||||
@@ -37,7 +37,7 @@ int scanhash_blakecoin_4way( struct work *work, uint32_t max_nonce,
|
|||||||
|
|
||||||
mm128_bswap32_intrlv80_4x32( vdata, pdata );
|
mm128_bswap32_intrlv80_4x32( vdata, pdata );
|
||||||
blake256r8_4way_init( &blakecoin_4w_ctx );
|
blake256r8_4way_init( &blakecoin_4w_ctx );
|
||||||
blake256r8_4way( &blakecoin_4w_ctx, vdata, 64 );
|
blake256r8_4way_update( &blakecoin_4w_ctx, vdata, 64 );
|
||||||
|
|
||||||
do {
|
do {
|
||||||
*noncev = mm128_bswap_32( _mm_set_epi32( n+3, n+2, n+1, n ) );
|
*noncev = mm128_bswap_32( _mm_set_epi32( n+3, n+2, n+1, n ) );
|
||||||
@@ -71,7 +71,7 @@ void blakecoin_8way_hash( void *state, const void *input )
|
|||||||
blake256r8_8way_context ctx;
|
blake256r8_8way_context ctx;
|
||||||
|
|
||||||
memcpy( &ctx, &blakecoin_8w_ctx, sizeof ctx );
|
memcpy( &ctx, &blakecoin_8w_ctx, sizeof ctx );
|
||||||
blake256r8_8way( &ctx, input + (64<<3), 16 );
|
blake256r8_8way_update( &ctx, input + (64<<3), 16 );
|
||||||
blake256r8_8way_close( &ctx, vhash );
|
blake256r8_8way_close( &ctx, vhash );
|
||||||
|
|
||||||
dintrlv_8x32( state, state+ 32, state+ 64, state+ 96, state+128,
|
dintrlv_8x32( state, state+ 32, state+ 64, state+ 96, state+128,
|
||||||
@@ -95,7 +95,7 @@ int scanhash_blakecoin_8way( struct work *work, uint32_t max_nonce,
|
|||||||
|
|
||||||
mm256_bswap32_intrlv80_8x32( vdata, pdata );
|
mm256_bswap32_intrlv80_8x32( vdata, pdata );
|
||||||
blake256r8_8way_init( &blakecoin_8w_ctx );
|
blake256r8_8way_init( &blakecoin_8w_ctx );
|
||||||
blake256r8_8way( &blakecoin_8w_ctx, vdata, 64 );
|
blake256r8_8way_update( &blakecoin_8w_ctx, vdata, 64 );
|
||||||
|
|
||||||
do {
|
do {
|
||||||
*noncev = mm256_bswap_32( _mm256_set_epi32( n+7, n+6, n+5, n+4,
|
*noncev = mm256_bswap_32( _mm256_set_epi32( n+7, n+6, n+5, n+4,
|
||||||
|
|||||||
@@ -1,13 +1,6 @@
|
|||||||
#include "blakecoin-gate.h"
|
#include "blakecoin-gate.h"
|
||||||
#include <memory.h>
|
#include <memory.h>
|
||||||
|
|
||||||
// changed to get_max64_0x3fffffLL in cpuminer-multi-decred
|
|
||||||
int64_t blakecoin_get_max64 ()
|
|
||||||
{
|
|
||||||
return 0x7ffffLL;
|
|
||||||
// return 0x3fffffLL;
|
|
||||||
}
|
|
||||||
|
|
||||||
// vanilla uses default gen merkle root, otherwise identical to blakecoin
|
// vanilla uses default gen merkle root, otherwise identical to blakecoin
|
||||||
bool register_vanilla_algo( algo_gate_t* gate )
|
bool register_vanilla_algo( algo_gate_t* gate )
|
||||||
{
|
{
|
||||||
@@ -23,7 +16,6 @@ bool register_vanilla_algo( algo_gate_t* gate )
|
|||||||
gate->hash = (void*)&blakecoinhash;
|
gate->hash = (void*)&blakecoinhash;
|
||||||
#endif
|
#endif
|
||||||
gate->optimizations = SSE42_OPT | AVX2_OPT;
|
gate->optimizations = SSE42_OPT | AVX2_OPT;
|
||||||
gate->get_max64 = (void*)&blakecoin_get_max64;
|
|
||||||
return true;
|
return true;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
@@ -93,33 +93,3 @@ int scanhash_blakecoin( struct work *work, uint32_t max_nonce,
|
|||||||
return 0;
|
return 0;
|
||||||
}
|
}
|
||||||
|
|
||||||
/*
|
|
||||||
void blakecoin_gen_merkle_root ( char* merkle_root, struct stratum_ctx* sctx )
|
|
||||||
{
|
|
||||||
SHA256( sctx->job.coinbase, (int)sctx->job.coinbase_size, merkle_root );
|
|
||||||
}
|
|
||||||
*/
|
|
||||||
/*
|
|
||||||
// changed to get_max64_0x3fffffLL in cpuminer-multi-decred
|
|
||||||
int64_t blakecoin_get_max64 ()
|
|
||||||
{
|
|
||||||
return 0x7ffffLL;
|
|
||||||
}
|
|
||||||
|
|
||||||
// vanilla uses default gen merkle root, otherwise identical to blakecoin
|
|
||||||
bool register_vanilla_algo( algo_gate_t* gate )
|
|
||||||
{
|
|
||||||
gate->scanhash = (void*)&scanhash_blakecoin;
|
|
||||||
gate->hash = (void*)&blakecoinhash;
|
|
||||||
gate->get_max64 = (void*)&blakecoin_get_max64;
|
|
||||||
blakecoin_init( &blake_init_ctx );
|
|
||||||
return true;
|
|
||||||
}
|
|
||||||
|
|
||||||
bool register_blakecoin_algo( algo_gate_t* gate )
|
|
||||||
{
|
|
||||||
register_vanilla_algo( gate );
|
|
||||||
gate->gen_merkle_root = (void*)&SHA256_gen_merkle_root;
|
|
||||||
return true;
|
|
||||||
}
|
|
||||||
*/
|
|
||||||
|
|||||||
@@ -21,7 +21,7 @@ void decred_hash_4way( void *state, const void *input )
|
|||||||
blake256_4way_context ctx __attribute__ ((aligned (64)));
|
blake256_4way_context ctx __attribute__ ((aligned (64)));
|
||||||
|
|
||||||
memcpy( &ctx, &blake_mid, sizeof(blake_mid) );
|
memcpy( &ctx, &blake_mid, sizeof(blake_mid) );
|
||||||
blake256_4way( &ctx, tail, tail_len );
|
blake256_4way_update( &ctx, tail, tail_len );
|
||||||
blake256_4way_close( &ctx, vhash );
|
blake256_4way_close( &ctx, vhash );
|
||||||
dintrlv_4x32( state, state+32, state+64, state+96, vhash, 256 );
|
dintrlv_4x32( state, state+32, state+64, state+96, vhash, 256 );
|
||||||
}
|
}
|
||||||
@@ -46,7 +46,7 @@ int scanhash_decred_4way( struct work *work, uint32_t max_nonce,
|
|||||||
mm128_intrlv_4x32x( vdata, edata, edata, edata, edata, 180*8 );
|
mm128_intrlv_4x32x( vdata, edata, edata, edata, edata, 180*8 );
|
||||||
|
|
||||||
blake256_4way_init( &blake_mid );
|
blake256_4way_init( &blake_mid );
|
||||||
blake256_4way( &blake_mid, vdata, DECRED_MIDSTATE_LEN );
|
blake256_4way_update( &blake_mid, vdata, DECRED_MIDSTATE_LEN );
|
||||||
|
|
||||||
uint32_t *noncep = vdata + DECRED_NONCE_INDEX * 4;
|
uint32_t *noncep = vdata + DECRED_NONCE_INDEX * 4;
|
||||||
do {
|
do {
|
||||||
|
|||||||
@@ -38,7 +38,7 @@ void decred_decode_extradata( struct work* work, uint64_t* net_blocks )
|
|||||||
if (!have_longpoll && work->height > *net_blocks + 1)
|
if (!have_longpoll && work->height > *net_blocks + 1)
|
||||||
{
|
{
|
||||||
char netinfo[64] = { 0 };
|
char netinfo[64] = { 0 };
|
||||||
if (opt_showdiff && net_diff > 0.)
|
if ( net_diff > 0. )
|
||||||
{
|
{
|
||||||
if (net_diff != work->targetdiff)
|
if (net_diff != work->targetdiff)
|
||||||
sprintf(netinfo, ", diff %.3f, target %.1f", net_diff,
|
sprintf(netinfo, ", diff %.3f, target %.1f", net_diff,
|
||||||
@@ -116,7 +116,7 @@ void decred_build_extraheader( struct work* g_work, struct stratum_ctx* sctx )
|
|||||||
// block header suffix from coinb2 (stake version)
|
// block header suffix from coinb2 (stake version)
|
||||||
memcpy( &g_work->data[44],
|
memcpy( &g_work->data[44],
|
||||||
&sctx->job.coinbase[ sctx->job.coinbase_size-4 ], 4 );
|
&sctx->job.coinbase[ sctx->job.coinbase_size-4 ], 4 );
|
||||||
sctx->bloc_height = g_work->data[32];
|
sctx->block_height = g_work->data[32];
|
||||||
//applog_hex(work->data, 180);
|
//applog_hex(work->data, 180);
|
||||||
//applog_hex(&work->data[36], 36);
|
//applog_hex(&work->data[36], 36);
|
||||||
}
|
}
|
||||||
@@ -154,7 +154,6 @@ bool register_decred_algo( algo_gate_t* gate )
|
|||||||
#endif
|
#endif
|
||||||
gate->optimizations = AVX2_OPT;
|
gate->optimizations = AVX2_OPT;
|
||||||
gate->get_nonceptr = (void*)&decred_get_nonceptr;
|
gate->get_nonceptr = (void*)&decred_get_nonceptr;
|
||||||
gate->get_max64 = (void*)&get_max64_0x3fffffLL;
|
|
||||||
gate->decode_extra_data = (void*)&decred_decode_extradata;
|
gate->decode_extra_data = (void*)&decred_decode_extradata;
|
||||||
gate->build_stratum_request = (void*)&decred_be_build_stratum_request;
|
gate->build_stratum_request = (void*)&decred_be_build_stratum_request;
|
||||||
gate->work_decode = (void*)&std_be_work_decode;
|
gate->work_decode = (void*)&std_be_work_decode;
|
||||||
|
|||||||
@@ -77,25 +77,15 @@ int scanhash_decred( struct work *work, uint32_t max_nonce,
|
|||||||
be32enc(&endiandata[k], pdata[k]);
|
be32enc(&endiandata[k], pdata[k]);
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
#ifdef DEBUG_ALGO
|
|
||||||
if (!thr_id) applog(LOG_DEBUG,"[%d] Target=%08x %08x", thr_id, ptarget[6], ptarget[7]);
|
|
||||||
#endif
|
|
||||||
|
|
||||||
do {
|
do {
|
||||||
//be32enc(&endiandata[DCR_NONCE_OFT32], n);
|
//be32enc(&endiandata[DCR_NONCE_OFT32], n);
|
||||||
endiandata[DECRED_NONCE_INDEX] = n;
|
endiandata[DECRED_NONCE_INDEX] = n;
|
||||||
decred_hash(hash32, endiandata);
|
decred_hash(hash32, endiandata);
|
||||||
|
|
||||||
if (hash32[7] <= HTarget && fulltest(hash32, ptarget)) {
|
if (hash32[7] <= HTarget && fulltest(hash32, ptarget))
|
||||||
work_set_target_ratio(work, hash32);
|
{
|
||||||
*hashes_done = n - first_nonce + 1;
|
|
||||||
#ifdef DEBUG_ALGO
|
|
||||||
applog(LOG_BLUE, "Nonce : %08x %08x", n, swab32(n));
|
|
||||||
applog_hash(ptarget);
|
|
||||||
applog_compare_hash(hash32, ptarget);
|
|
||||||
#endif
|
|
||||||
pdata[DECRED_NONCE_INDEX] = n;
|
pdata[DECRED_NONCE_INDEX] = n;
|
||||||
return 1;
|
submit_solution( work, hash32, mythr );
|
||||||
}
|
}
|
||||||
|
|
||||||
n++;
|
n++;
|
||||||
@@ -143,7 +133,7 @@ void decred_decode_extradata( struct work* work, uint64_t* net_blocks )
|
|||||||
if (!have_longpoll && work->height > *net_blocks + 1)
|
if (!have_longpoll && work->height > *net_blocks + 1)
|
||||||
{
|
{
|
||||||
char netinfo[64] = { 0 };
|
char netinfo[64] = { 0 };
|
||||||
if (opt_showdiff && net_diff > 0.)
|
if (net_diff > 0.)
|
||||||
{
|
{
|
||||||
if (net_diff != work->targetdiff)
|
if (net_diff != work->targetdiff)
|
||||||
sprintf(netinfo, ", diff %.3f, target %.1f", net_diff,
|
sprintf(netinfo, ", diff %.3f, target %.1f", net_diff,
|
||||||
@@ -269,7 +259,6 @@ bool register_decred_algo( algo_gate_t* gate )
|
|||||||
gate->scanhash = (void*)&scanhash_decred;
|
gate->scanhash = (void*)&scanhash_decred;
|
||||||
gate->hash = (void*)&decred_hash;
|
gate->hash = (void*)&decred_hash;
|
||||||
gate->get_nonceptr = (void*)&decred_get_nonceptr;
|
gate->get_nonceptr = (void*)&decred_get_nonceptr;
|
||||||
gate->get_max64 = (void*)&get_max64_0x3fffffLL;
|
|
||||||
gate->decode_extra_data = (void*)&decred_decode_extradata;
|
gate->decode_extra_data = (void*)&decred_decode_extradata;
|
||||||
gate->build_stratum_request = (void*)&decred_be_build_stratum_request;
|
gate->build_stratum_request = (void*)&decred_be_build_stratum_request;
|
||||||
gate->work_decode = (void*)&std_be_work_decode;
|
gate->work_decode = (void*)&std_be_work_decode;
|
||||||
|
|||||||
@@ -22,23 +22,23 @@ extern void pentablakehash_4way( void *output, const void *input )
|
|||||||
|
|
||||||
|
|
||||||
blake512_4way_init( &ctx );
|
blake512_4way_init( &ctx );
|
||||||
blake512_4way( &ctx, input, 80 );
|
blake512_4way_update( &ctx, input, 80 );
|
||||||
blake512_4way_close( &ctx, vhash );
|
blake512_4way_close( &ctx, vhash );
|
||||||
|
|
||||||
blake512_4way_init( &ctx );
|
blake512_4way_init( &ctx );
|
||||||
blake512_4way( &ctx, vhash, 64 );
|
blake512_4way_update( &ctx, vhash, 64 );
|
||||||
blake512_4way_close( &ctx, vhash );
|
blake512_4way_close( &ctx, vhash );
|
||||||
|
|
||||||
blake512_4way_init( &ctx );
|
blake512_4way_init( &ctx );
|
||||||
blake512_4way( &ctx, vhash, 64 );
|
blake512_4way_update( &ctx, vhash, 64 );
|
||||||
blake512_4way_close( &ctx, vhash );
|
blake512_4way_close( &ctx, vhash );
|
||||||
|
|
||||||
blake512_4way_init( &ctx );
|
blake512_4way_init( &ctx );
|
||||||
blake512_4way( &ctx, vhash, 64 );
|
blake512_4way_update( &ctx, vhash, 64 );
|
||||||
blake512_4way_close( &ctx, vhash );
|
blake512_4way_close( &ctx, vhash );
|
||||||
|
|
||||||
blake512_4way_init( &ctx );
|
blake512_4way_init( &ctx );
|
||||||
blake512_4way( &ctx, vhash, 64 );
|
blake512_4way_update( &ctx, vhash, 64 );
|
||||||
blake512_4way_close( &ctx, vhash );
|
blake512_4way_close( &ctx, vhash );
|
||||||
|
|
||||||
memcpy( output, hash0, 32 );
|
memcpy( output, hash0, 32 );
|
||||||
|
|||||||
@@ -10,7 +10,6 @@ bool register_pentablake_algo( algo_gate_t* gate )
|
|||||||
gate->hash = (void*)&pentablakehash;
|
gate->hash = (void*)&pentablakehash;
|
||||||
#endif
|
#endif
|
||||||
gate->optimizations = AVX2_OPT;
|
gate->optimizations = AVX2_OPT;
|
||||||
gate->get_max64 = (void*)&get_max64_0x3ffff;
|
|
||||||
return true;
|
return true;
|
||||||
};
|
};
|
||||||
|
|
||||||
|
|||||||
@@ -1,476 +0,0 @@
|
|||||||
/* $Id: blake.c 252 2011-06-07 17:55:14Z tp $ */
|
|
||||||
/*
|
|
||||||
* BLAKE implementation.
|
|
||||||
*
|
|
||||||
* ==========================(LICENSE BEGIN)============================
|
|
||||||
*
|
|
||||||
* Copyright (c) 2007-2010 Projet RNRT SAPHIR
|
|
||||||
*
|
|
||||||
* Permission is hereby granted, free of charge, to any person obtaining
|
|
||||||
* a copy of this software and associated documentation files (the
|
|
||||||
* "Software"), to deal in the Software without restriction, including
|
|
||||||
* without limitation the rights to use, copy, modify, merge, publish,
|
|
||||||
* distribute, sublicense, and/or sell copies of the Software, and to
|
|
||||||
* permit persons to whom the Software is furnished to do so, subject to
|
|
||||||
* the following conditions:
|
|
||||||
*
|
|
||||||
* The above copyright notice and this permission notice shall be
|
|
||||||
* included in all copies or substantial portions of the Software.
|
|
||||||
*
|
|
||||||
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
|
||||||
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
|
||||||
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
|
|
||||||
* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
|
|
||||||
* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
|
|
||||||
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
|
|
||||||
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
|
||||||
*
|
|
||||||
* ===========================(LICENSE END)=============================
|
|
||||||
*
|
|
||||||
* @author Thomas Pornin <thomas.pornin@cryptolog.com>
|
|
||||||
*/
|
|
||||||
#include <stddef.h>
|
|
||||||
#include <string.h>
|
|
||||||
#include <limits.h>
|
|
||||||
|
|
||||||
#include "../sph_blake.h"
|
|
||||||
|
|
||||||
#ifdef __cplusplus
|
|
||||||
extern "C"{
|
|
||||||
#endif
|
|
||||||
|
|
||||||
#ifdef _MSC_VER
|
|
||||||
#pragma warning (disable: 4146)
|
|
||||||
#endif
|
|
||||||
|
|
||||||
static const sph_u64 blkIV512[8] = {
|
|
||||||
SPH_C64(0x6A09E667F3BCC908), SPH_C64(0xBB67AE8584CAA73B),
|
|
||||||
SPH_C64(0x3C6EF372FE94F82B), SPH_C64(0xA54FF53A5F1D36F1),
|
|
||||||
SPH_C64(0x510E527FADE682D1), SPH_C64(0x9B05688C2B3E6C1F),
|
|
||||||
SPH_C64(0x1F83D9ABFB41BD6B), SPH_C64(0x5BE0CD19137E2179)
|
|
||||||
};
|
|
||||||
|
|
||||||
#define Z00 0
|
|
||||||
#define Z01 1
|
|
||||||
#define Z02 2
|
|
||||||
#define Z03 3
|
|
||||||
#define Z04 4
|
|
||||||
#define Z05 5
|
|
||||||
#define Z06 6
|
|
||||||
#define Z07 7
|
|
||||||
#define Z08 8
|
|
||||||
#define Z09 9
|
|
||||||
#define Z0A A
|
|
||||||
#define Z0B B
|
|
||||||
#define Z0C C
|
|
||||||
#define Z0D D
|
|
||||||
#define Z0E E
|
|
||||||
#define Z0F F
|
|
||||||
|
|
||||||
#define Z10 E
|
|
||||||
#define Z11 A
|
|
||||||
#define Z12 4
|
|
||||||
#define Z13 8
|
|
||||||
#define Z14 9
|
|
||||||
#define Z15 F
|
|
||||||
#define Z16 D
|
|
||||||
#define Z17 6
|
|
||||||
#define Z18 1
|
|
||||||
#define Z19 C
|
|
||||||
#define Z1A 0
|
|
||||||
#define Z1B 2
|
|
||||||
#define Z1C B
|
|
||||||
#define Z1D 7
|
|
||||||
#define Z1E 5
|
|
||||||
#define Z1F 3
|
|
||||||
|
|
||||||
#define Z20 B
|
|
||||||
#define Z21 8
|
|
||||||
#define Z22 C
|
|
||||||
#define Z23 0
|
|
||||||
#define Z24 5
|
|
||||||
#define Z25 2
|
|
||||||
#define Z26 F
|
|
||||||
#define Z27 D
|
|
||||||
#define Z28 A
|
|
||||||
#define Z29 E
|
|
||||||
#define Z2A 3
|
|
||||||
#define Z2B 6
|
|
||||||
#define Z2C 7
|
|
||||||
#define Z2D 1
|
|
||||||
#define Z2E 9
|
|
||||||
#define Z2F 4
|
|
||||||
|
|
||||||
#define Z30 7
|
|
||||||
#define Z31 9
|
|
||||||
#define Z32 3
|
|
||||||
#define Z33 1
|
|
||||||
#define Z34 D
|
|
||||||
#define Z35 C
|
|
||||||
#define Z36 B
|
|
||||||
#define Z37 E
|
|
||||||
#define Z38 2
|
|
||||||
#define Z39 6
|
|
||||||
#define Z3A 5
|
|
||||||
#define Z3B A
|
|
||||||
#define Z3C 4
|
|
||||||
#define Z3D 0
|
|
||||||
#define Z3E F
|
|
||||||
#define Z3F 8
|
|
||||||
|
|
||||||
#define Z40 9
|
|
||||||
#define Z41 0
|
|
||||||
#define Z42 5
|
|
||||||
#define Z43 7
|
|
||||||
#define Z44 2
|
|
||||||
#define Z45 4
|
|
||||||
#define Z46 A
|
|
||||||
#define Z47 F
|
|
||||||
#define Z48 E
|
|
||||||
#define Z49 1
|
|
||||||
#define Z4A B
|
|
||||||
#define Z4B C
|
|
||||||
#define Z4C 6
|
|
||||||
#define Z4D 8
|
|
||||||
#define Z4E 3
|
|
||||||
#define Z4F D
|
|
||||||
|
|
||||||
#define Z50 2
|
|
||||||
#define Z51 C
|
|
||||||
#define Z52 6
|
|
||||||
#define Z53 A
|
|
||||||
#define Z54 0
|
|
||||||
#define Z55 B
|
|
||||||
#define Z56 8
|
|
||||||
#define Z57 3
|
|
||||||
#define Z58 4
|
|
||||||
#define Z59 D
|
|
||||||
#define Z5A 7
|
|
||||||
#define Z5B 5
|
|
||||||
#define Z5C F
|
|
||||||
#define Z5D E
|
|
||||||
#define Z5E 1
|
|
||||||
#define Z5F 9
|
|
||||||
|
|
||||||
#define Z60 C
|
|
||||||
#define Z61 5
|
|
||||||
#define Z62 1
|
|
||||||
#define Z63 F
|
|
||||||
#define Z64 E
|
|
||||||
#define Z65 D
|
|
||||||
#define Z66 4
|
|
||||||
#define Z67 A
|
|
||||||
#define Z68 0
|
|
||||||
#define Z69 7
|
|
||||||
#define Z6A 6
|
|
||||||
#define Z6B 3
|
|
||||||
#define Z6C 9
|
|
||||||
#define Z6D 2
|
|
||||||
#define Z6E 8
|
|
||||||
#define Z6F B
|
|
||||||
|
|
||||||
#define Z70 D
|
|
||||||
#define Z71 B
|
|
||||||
#define Z72 7
|
|
||||||
#define Z73 E
|
|
||||||
#define Z74 C
|
|
||||||
#define Z75 1
|
|
||||||
#define Z76 3
|
|
||||||
#define Z77 9
|
|
||||||
#define Z78 5
|
|
||||||
#define Z79 0
|
|
||||||
#define Z7A F
|
|
||||||
#define Z7B 4
|
|
||||||
#define Z7C 8
|
|
||||||
#define Z7D 6
|
|
||||||
#define Z7E 2
|
|
||||||
#define Z7F A
|
|
||||||
|
|
||||||
#define Z80 6
|
|
||||||
#define Z81 F
|
|
||||||
#define Z82 E
|
|
||||||
#define Z83 9
|
|
||||||
#define Z84 B
|
|
||||||
#define Z85 3
|
|
||||||
#define Z86 0
|
|
||||||
#define Z87 8
|
|
||||||
#define Z88 C
|
|
||||||
#define Z89 2
|
|
||||||
#define Z8A D
|
|
||||||
#define Z8B 7
|
|
||||||
#define Z8C 1
|
|
||||||
#define Z8D 4
|
|
||||||
#define Z8E A
|
|
||||||
#define Z8F 5
|
|
||||||
|
|
||||||
#define Z90 A
|
|
||||||
#define Z91 2
|
|
||||||
#define Z92 8
|
|
||||||
#define Z93 4
|
|
||||||
#define Z94 7
|
|
||||||
#define Z95 6
|
|
||||||
#define Z96 1
|
|
||||||
#define Z97 5
|
|
||||||
#define Z98 F
|
|
||||||
#define Z99 B
|
|
||||||
#define Z9A 9
|
|
||||||
#define Z9B E
|
|
||||||
#define Z9C 3
|
|
||||||
#define Z9D C
|
|
||||||
#define Z9E D
|
|
||||||
#define Z9F 0
|
|
||||||
|
|
||||||
#define Mx(r, i) Mx_(Z ## r ## i)
|
|
||||||
#define Mx_(n) Mx__(n)
|
|
||||||
#define Mx__(n) M ## n
|
|
||||||
|
|
||||||
#define CSx(r, i) CSx_(Z ## r ## i)
|
|
||||||
#define CSx_(n) CSx__(n)
|
|
||||||
#define CSx__(n) CS ## n
|
|
||||||
|
|
||||||
#define CS0 SPH_C32(0x243F6A88)
|
|
||||||
#define CS1 SPH_C32(0x85A308D3)
|
|
||||||
#define CS2 SPH_C32(0x13198A2E)
|
|
||||||
#define CS3 SPH_C32(0x03707344)
|
|
||||||
#define CS4 SPH_C32(0xA4093822)
|
|
||||||
#define CS5 SPH_C32(0x299F31D0)
|
|
||||||
#define CS6 SPH_C32(0x082EFA98)
|
|
||||||
#define CS7 SPH_C32(0xEC4E6C89)
|
|
||||||
#define CS8 SPH_C32(0x452821E6)
|
|
||||||
#define CS9 SPH_C32(0x38D01377)
|
|
||||||
#define CSA SPH_C32(0xBE5466CF)
|
|
||||||
#define CSB SPH_C32(0x34E90C6C)
|
|
||||||
#define CSC SPH_C32(0xC0AC29B7)
|
|
||||||
#define CSD SPH_C32(0xC97C50DD)
|
|
||||||
#define CSE SPH_C32(0x3F84D5B5)
|
|
||||||
#define CSF SPH_C32(0xB5470917)
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
#define CBx(r, i) CBx_(Z ## r ## i)
|
|
||||||
#define CBx_(n) CBx__(n)
|
|
||||||
#define CBx__(n) CB ## n
|
|
||||||
|
|
||||||
#define CB0 SPH_C64(0x243F6A8885A308D3)
|
|
||||||
#define CB1 SPH_C64(0x13198A2E03707344)
|
|
||||||
#define CB2 SPH_C64(0xA4093822299F31D0)
|
|
||||||
#define CB3 SPH_C64(0x082EFA98EC4E6C89)
|
|
||||||
#define CB4 SPH_C64(0x452821E638D01377)
|
|
||||||
#define CB5 SPH_C64(0xBE5466CF34E90C6C)
|
|
||||||
#define CB6 SPH_C64(0xC0AC29B7C97C50DD)
|
|
||||||
#define CB7 SPH_C64(0x3F84D5B5B5470917)
|
|
||||||
#define CB8 SPH_C64(0x9216D5D98979FB1B)
|
|
||||||
#define CB9 SPH_C64(0xD1310BA698DFB5AC)
|
|
||||||
#define CBA SPH_C64(0x2FFD72DBD01ADFB7)
|
|
||||||
#define CBB SPH_C64(0xB8E1AFED6A267E96)
|
|
||||||
#define CBC SPH_C64(0xBA7C9045F12C7F99)
|
|
||||||
#define CBD SPH_C64(0x24A19947B3916CF7)
|
|
||||||
#define CBE SPH_C64(0x0801F2E2858EFC16)
|
|
||||||
#define CBF SPH_C64(0x636920D871574E69)
|
|
||||||
|
|
||||||
|
|
||||||
#define GS(m0, m1, c0, c1, a, b, c, d) do { \
|
|
||||||
a = SPH_T32(a + b + (m0 ^ c1)); \
|
|
||||||
d = SPH_ROTR32(d ^ a, 16); \
|
|
||||||
c = SPH_T32(c + d); \
|
|
||||||
b = SPH_ROTR32(b ^ c, 12); \
|
|
||||||
a = SPH_T32(a + b + (m1 ^ c0)); \
|
|
||||||
d = SPH_ROTR32(d ^ a, 8); \
|
|
||||||
c = SPH_T32(c + d); \
|
|
||||||
b = SPH_ROTR32(b ^ c, 7); \
|
|
||||||
} while (0)
|
|
||||||
|
|
||||||
#define ROUND_S(r) do { \
|
|
||||||
GS(Mx(r, 0), Mx(r, 1), CSx(r, 0), CSx(r, 1), V0, V4, V8, VC); \
|
|
||||||
GS(Mx(r, 2), Mx(r, 3), CSx(r, 2), CSx(r, 3), V1, V5, V9, VD); \
|
|
||||||
GS(Mx(r, 4), Mx(r, 5), CSx(r, 4), CSx(r, 5), V2, V6, VA, VE); \
|
|
||||||
GS(Mx(r, 6), Mx(r, 7), CSx(r, 6), CSx(r, 7), V3, V7, VB, VF); \
|
|
||||||
GS(Mx(r, 8), Mx(r, 9), CSx(r, 8), CSx(r, 9), V0, V5, VA, VF); \
|
|
||||||
GS(Mx(r, A), Mx(r, B), CSx(r, A), CSx(r, B), V1, V6, VB, VC); \
|
|
||||||
GS(Mx(r, C), Mx(r, D), CSx(r, C), CSx(r, D), V2, V7, V8, VD); \
|
|
||||||
GS(Mx(r, E), Mx(r, F), CSx(r, E), CSx(r, F), V3, V4, V9, VE); \
|
|
||||||
} while (0)
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
#define GB(m0, m1, c0, c1, a, b, c, d) do { \
|
|
||||||
a = SPH_T64(a + b + (m0 ^ c1)); \
|
|
||||||
d = SPH_ROTR64(d ^ a, 32); \
|
|
||||||
c = SPH_T64(c + d); \
|
|
||||||
b = SPH_ROTR64(b ^ c, 25); \
|
|
||||||
a = SPH_T64(a + b + (m1 ^ c0)); \
|
|
||||||
d = SPH_ROTR64(d ^ a, 16); \
|
|
||||||
c = SPH_T64(c + d); \
|
|
||||||
b = SPH_ROTR64(b ^ c, 11); \
|
|
||||||
} while (0)
|
|
||||||
|
|
||||||
#define ROUND_B(r) do { \
|
|
||||||
GB(Mx(r, 0), Mx(r, 1), CBx(r, 0), CBx(r, 1), V0, V4, V8, VC); \
|
|
||||||
GB(Mx(r, 2), Mx(r, 3), CBx(r, 2), CBx(r, 3), V1, V5, V9, VD); \
|
|
||||||
GB(Mx(r, 4), Mx(r, 5), CBx(r, 4), CBx(r, 5), V2, V6, VA, VE); \
|
|
||||||
GB(Mx(r, 6), Mx(r, 7), CBx(r, 6), CBx(r, 7), V3, V7, VB, VF); \
|
|
||||||
GB(Mx(r, 8), Mx(r, 9), CBx(r, 8), CBx(r, 9), V0, V5, VA, VF); \
|
|
||||||
GB(Mx(r, A), Mx(r, B), CBx(r, A), CBx(r, B), V1, V6, VB, VC); \
|
|
||||||
GB(Mx(r, C), Mx(r, D), CBx(r, C), CBx(r, D), V2, V7, V8, VD); \
|
|
||||||
GB(Mx(r, E), Mx(r, F), CBx(r, E), CBx(r, F), V3, V4, V9, VE); \
|
|
||||||
} while (0)
|
|
||||||
|
|
||||||
|
|
||||||
#define COMPRESS64 do { \
|
|
||||||
int b=0; \
|
|
||||||
sph_u64 M0, M1, M2, M3, M4, M5, M6, M7; \
|
|
||||||
sph_u64 M8, M9, MA, MB, MC, MD, ME, MF; \
|
|
||||||
sph_u64 V0, V1, V2, V3, V4, V5, V6, V7; \
|
|
||||||
sph_u64 V8, V9, VA, VB, VC, VD, VE, VF; \
|
|
||||||
V0 = blkH0, \
|
|
||||||
V1 = blkH1, \
|
|
||||||
V2 = blkH2, \
|
|
||||||
V3 = blkH3, \
|
|
||||||
V4 = blkH4, \
|
|
||||||
V5 = blkH5, \
|
|
||||||
V6 = blkH6, \
|
|
||||||
V7 = blkH7; \
|
|
||||||
V8 = blkS0 ^ CB0, \
|
|
||||||
V9 = blkS1 ^ CB1, \
|
|
||||||
VA = blkS2 ^ CB2, \
|
|
||||||
VB = blkS3 ^ CB3, \
|
|
||||||
VC = hashctA ^ CB4, \
|
|
||||||
VD = hashctA ^ CB5, \
|
|
||||||
VE = hashctB ^ CB6, \
|
|
||||||
VF = hashctB ^ CB7; \
|
|
||||||
M0 = sph_dec64be_aligned(buf + 0), \
|
|
||||||
M1 = sph_dec64be_aligned(buf + 8), \
|
|
||||||
M2 = sph_dec64be_aligned(buf + 16), \
|
|
||||||
M3 = sph_dec64be_aligned(buf + 24), \
|
|
||||||
M4 = sph_dec64be_aligned(buf + 32), \
|
|
||||||
M5 = sph_dec64be_aligned(buf + 40), \
|
|
||||||
M6 = sph_dec64be_aligned(buf + 48), \
|
|
||||||
M7 = sph_dec64be_aligned(buf + 56), \
|
|
||||||
M8 = sph_dec64be_aligned(buf + 64), \
|
|
||||||
M9 = sph_dec64be_aligned(buf + 72), \
|
|
||||||
MA = sph_dec64be_aligned(buf + 80), \
|
|
||||||
MB = sph_dec64be_aligned(buf + 88), \
|
|
||||||
MC = sph_dec64be_aligned(buf + 96), \
|
|
||||||
MD = sph_dec64be_aligned(buf + 104), \
|
|
||||||
ME = sph_dec64be_aligned(buf + 112), \
|
|
||||||
MF = sph_dec64be_aligned(buf + 120); \
|
|
||||||
/* loop once and a half */ \
|
|
||||||
/* save some space */ \
|
|
||||||
for (;;) { \
|
|
||||||
ROUND_B(0); \
|
|
||||||
ROUND_B(1); \
|
|
||||||
ROUND_B(2); \
|
|
||||||
ROUND_B(3); \
|
|
||||||
ROUND_B(4); \
|
|
||||||
ROUND_B(5); \
|
|
||||||
if (b) break; \
|
|
||||||
b = 1; \
|
|
||||||
ROUND_B(6); \
|
|
||||||
ROUND_B(7); \
|
|
||||||
ROUND_B(8); \
|
|
||||||
ROUND_B(9); \
|
|
||||||
}; \
|
|
||||||
blkH0 ^= blkS0 ^ V0 ^ V8, \
|
|
||||||
blkH1 ^= blkS1 ^ V1 ^ V9, \
|
|
||||||
blkH2 ^= blkS2 ^ V2 ^ VA, \
|
|
||||||
blkH3 ^= blkS3 ^ V3 ^ VB, \
|
|
||||||
blkH4 ^= blkS0 ^ V4 ^ VC, \
|
|
||||||
blkH5 ^= blkS1 ^ V5 ^ VD, \
|
|
||||||
blkH6 ^= blkS2 ^ V6 ^ VE, \
|
|
||||||
blkH7 ^= blkS3 ^ V7 ^ VF; \
|
|
||||||
} while (0)
|
|
||||||
/*
|
|
||||||
*/
|
|
||||||
#define DECL_BLK \
|
|
||||||
sph_u64 blkH0; \
|
|
||||||
sph_u64 blkH1; \
|
|
||||||
sph_u64 blkH2; \
|
|
||||||
sph_u64 blkH3; \
|
|
||||||
sph_u64 blkH4; \
|
|
||||||
sph_u64 blkH5; \
|
|
||||||
sph_u64 blkH6; \
|
|
||||||
sph_u64 blkH7; \
|
|
||||||
sph_u64 blkS0; \
|
|
||||||
sph_u64 blkS1; \
|
|
||||||
sph_u64 blkS2; \
|
|
||||||
sph_u64 blkS3; \
|
|
||||||
|
|
||||||
/* load initial constants */
|
|
||||||
#define BLK_I \
|
|
||||||
do { \
|
|
||||||
blkH0 = SPH_C64(0x6A09E667F3BCC908); \
|
|
||||||
blkH1 = SPH_C64(0xBB67AE8584CAA73B); \
|
|
||||||
blkH2 = SPH_C64(0x3C6EF372FE94F82B); \
|
|
||||||
blkH3 = SPH_C64(0xA54FF53A5F1D36F1); \
|
|
||||||
blkH4 = SPH_C64(0x510E527FADE682D1); \
|
|
||||||
blkH5 = SPH_C64(0x9B05688C2B3E6C1F); \
|
|
||||||
blkH6 = SPH_C64(0x1F83D9ABFB41BD6B); \
|
|
||||||
blkH7 = SPH_C64(0x5BE0CD19137E2179); \
|
|
||||||
blkS0 = 0; \
|
|
||||||
blkS1 = 0; \
|
|
||||||
blkS2 = 0; \
|
|
||||||
blkS3 = 0; \
|
|
||||||
hashctB = SPH_T64(0- 1); \
|
|
||||||
} while (0)
|
|
||||||
|
|
||||||
/* copy in 80 for initial hash */
|
|
||||||
#define BLK_W \
|
|
||||||
do { \
|
|
||||||
memcpy(hashbuf, input, 80); \
|
|
||||||
hashctA = SPH_C64(0xFFFFFFFFFFFFFC00) + 80*8; \
|
|
||||||
hashptr = 80; \
|
|
||||||
} while (0)
|
|
||||||
|
|
||||||
/* copy in 64 for looped hash */
|
|
||||||
#define BLK_U \
|
|
||||||
do { \
|
|
||||||
memcpy(hashbuf, hash , 64); \
|
|
||||||
hashctA = SPH_C64(0xFFFFFFFFFFFFFC00) + 64*8; \
|
|
||||||
hashptr = 64; \
|
|
||||||
} while (0)
|
|
||||||
|
|
||||||
/* blake compress function */
|
|
||||||
/* hash = blake512(loaded) */
|
|
||||||
#define BLK_C \
|
|
||||||
do { \
|
|
||||||
\
|
|
||||||
union { \
|
|
||||||
unsigned char buf[128]; \
|
|
||||||
sph_u64 dummy; \
|
|
||||||
} u; \
|
|
||||||
size_t ptr; \
|
|
||||||
unsigned bit_len; \
|
|
||||||
\
|
|
||||||
ptr = hashptr; \
|
|
||||||
bit_len = ((unsigned)ptr << 3) + 0; \
|
|
||||||
u.buf[ptr] = ((0 & -(0x80)) | (0x80)) & 0xFF; \
|
|
||||||
memset(u.buf + ptr + 1, 0, 111 - ptr); \
|
|
||||||
u.buf[111] |= 1; \
|
|
||||||
sph_enc64be_aligned(u.buf + 112, 0); \
|
|
||||||
sph_enc64be_aligned(u.buf + 120, bit_len); \
|
|
||||||
do { \
|
|
||||||
const void *data = u.buf + ptr; \
|
|
||||||
unsigned char *buf; \
|
|
||||||
buf = hashbuf; \
|
|
||||||
size_t clen; \
|
|
||||||
clen = (sizeof(char)*128) - hashptr; \
|
|
||||||
memcpy(buf + hashptr, data, clen); \
|
|
||||||
hashctA = SPH_T64(hashctA + 1024); \
|
|
||||||
hashctB = SPH_T64(hashctB + 1); \
|
|
||||||
COMPRESS64; \
|
|
||||||
} while (0); \
|
|
||||||
/* end blake64(sc, u.buf + ptr, 128 - ptr); */ \
|
|
||||||
sph_enc64be((unsigned char*)(hash) + (0 << 3), blkH0), \
|
|
||||||
sph_enc64be((unsigned char*)(hash) + (1 << 3), blkH1); \
|
|
||||||
sph_enc64be((unsigned char*)(hash) + (2 << 3), blkH2), \
|
|
||||||
sph_enc64be((unsigned char*)(hash) + (3 << 3), blkH3); \
|
|
||||||
sph_enc64be((unsigned char*)(hash) + (4 << 3), blkH4), \
|
|
||||||
sph_enc64be((unsigned char*)(hash) + (5 << 3), blkH5); \
|
|
||||||
sph_enc64be((unsigned char*)(hash) + (6 << 3), blkH6), \
|
|
||||||
sph_enc64be((unsigned char*)(hash) + (7 << 3), blkH7); \
|
|
||||||
} while (0)
|
|
||||||
|
|
||||||
|
|
||||||
#ifdef __cplusplus
|
|
||||||
}
|
|
||||||
#endif
|
|
||||||
@@ -1,2 +0,0 @@
|
|||||||
#define CRYPTO_BYTES 64
|
|
||||||
|
|
||||||
@@ -1,2 +0,0 @@
|
|||||||
amd64
|
|
||||||
x86
|
|
||||||
@@ -1,8 +0,0 @@
|
|||||||
#ifndef __BLAKE512_CONFIG_H__
|
|
||||||
#define __BLAKE512_CONFIG_H__
|
|
||||||
|
|
||||||
#define AVOID_BRANCHING 1
|
|
||||||
//#define HAVE_XOP 1
|
|
||||||
|
|
||||||
#endif
|
|
||||||
|
|
||||||
@@ -1,287 +0,0 @@
|
|||||||
|
|
||||||
#include "hash.h"
|
|
||||||
/*
|
|
||||||
#ifndef NOT_SUPERCOP
|
|
||||||
|
|
||||||
#include "crypto_hash.h"
|
|
||||||
#include "crypto_uint64.h"
|
|
||||||
#include "crypto_uint32.h"
|
|
||||||
#include "crypto_uint8.h"
|
|
||||||
|
|
||||||
typedef crypto_uint64 u64;
|
|
||||||
typedef crypto_uint32 u32;
|
|
||||||
typedef crypto_uint8 u8;
|
|
||||||
|
|
||||||
#else
|
|
||||||
|
|
||||||
typedef unsigned long long u64;
|
|
||||||
typedef unsigned int u32;
|
|
||||||
typedef unsigned char u8;
|
|
||||||
|
|
||||||
#endif
|
|
||||||
*/
|
|
||||||
#define U8TO32(p) \
|
|
||||||
(((u32)((p)[0]) << 24) | ((u32)((p)[1]) << 16) | \
|
|
||||||
((u32)((p)[2]) << 8) | ((u32)((p)[3]) ))
|
|
||||||
#define U8TO64(p) \
|
|
||||||
(((u64)U8TO32(p) << 32) | (u64)U8TO32((p) + 4))
|
|
||||||
#define U32TO8(p, v) \
|
|
||||||
(p)[0] = (u8)((v) >> 24); (p)[1] = (u8)((v) >> 16); \
|
|
||||||
(p)[2] = (u8)((v) >> 8); (p)[3] = (u8)((v) );
|
|
||||||
#define U64TO8(p, v) \
|
|
||||||
U32TO8((p), (u32)((v) >> 32)); \
|
|
||||||
U32TO8((p) + 4, (u32)((v) ));
|
|
||||||
/*
|
|
||||||
typedef struct
|
|
||||||
{
|
|
||||||
__m128i h[4];
|
|
||||||
u64 s[4], t[2];
|
|
||||||
u32 buflen, nullt;
|
|
||||||
u8 buf[128];
|
|
||||||
} state __attribute__ ((aligned (64)));
|
|
||||||
*/
|
|
||||||
static const u8 padding[129] =
|
|
||||||
{
|
|
||||||
0x80,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
|
|
||||||
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
|
|
||||||
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
|
|
||||||
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
|
|
||||||
};
|
|
||||||
|
|
||||||
static inline int blake512_compress( hashState_blake * state, const u8 * datablock )
|
|
||||||
{
|
|
||||||
|
|
||||||
__m128i row1l,row1h;
|
|
||||||
__m128i row2l,row2h;
|
|
||||||
__m128i row3l,row3h;
|
|
||||||
__m128i row4l,row4h;
|
|
||||||
|
|
||||||
const __m128i r16 = _mm_setr_epi8(2,3,4,5,6,7,0,1,10,11,12,13,14,15,8,9);
|
|
||||||
const __m128i u8to64 = _mm_set_epi8(8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 4, 5, 6, 7);
|
|
||||||
|
|
||||||
__m128i m0, m1, m2, m3, m4, m5, m6, m7;
|
|
||||||
__m128i t0, t1, t2, t3, t4, t5, t6, t7;
|
|
||||||
__m128i b0, b1, b2, b3;
|
|
||||||
|
|
||||||
m0 = _mm_loadu_si128((__m128i*)(datablock + 0));
|
|
||||||
m1 = _mm_loadu_si128((__m128i*)(datablock + 16));
|
|
||||||
m2 = _mm_loadu_si128((__m128i*)(datablock + 32));
|
|
||||||
m3 = _mm_loadu_si128((__m128i*)(datablock + 48));
|
|
||||||
m4 = _mm_loadu_si128((__m128i*)(datablock + 64));
|
|
||||||
m5 = _mm_loadu_si128((__m128i*)(datablock + 80));
|
|
||||||
m6 = _mm_loadu_si128((__m128i*)(datablock + 96));
|
|
||||||
m7 = _mm_loadu_si128((__m128i*)(datablock + 112));
|
|
||||||
|
|
||||||
m0 = BSWAP64(m0);
|
|
||||||
m1 = BSWAP64(m1);
|
|
||||||
m2 = BSWAP64(m2);
|
|
||||||
m3 = BSWAP64(m3);
|
|
||||||
m4 = BSWAP64(m4);
|
|
||||||
m5 = BSWAP64(m5);
|
|
||||||
m6 = BSWAP64(m6);
|
|
||||||
m7 = BSWAP64(m7);
|
|
||||||
|
|
||||||
row1l = state->h[0];
|
|
||||||
row1h = state->h[1];
|
|
||||||
row2l = state->h[2];
|
|
||||||
row2h = state->h[3];
|
|
||||||
row3l = _mm_set_epi64x(0x13198A2E03707344ULL, 0x243F6A8885A308D3ULL);
|
|
||||||
row3h = _mm_set_epi64x(0x082EFA98EC4E6C89ULL, 0xA4093822299F31D0ULL);
|
|
||||||
|
|
||||||
row4l = _mm_set_epi64x(0xBE5466CF34E90C6CULL, 0x452821E638D01377ULL);
|
|
||||||
row4h = _mm_set_epi64x(0x3F84D5B5B5470917ULL, 0xC0AC29B7C97C50DDULL);
|
|
||||||
|
|
||||||
#ifdef AVOID_BRANCHING
|
|
||||||
do
|
|
||||||
{
|
|
||||||
const __m128i mask = _mm_cmpeq_epi32(_mm_setzero_si128(), _mm_set1_epi32(state->nullt));
|
|
||||||
const __m128i xor1 = _mm_and_si128(_mm_set1_epi64x(state->t[0]), mask);
|
|
||||||
const __m128i xor2 = _mm_and_si128(_mm_set1_epi64x(state->t[1]), mask);
|
|
||||||
row4l = _mm_xor_si128(row4l, xor1);
|
|
||||||
row4h = _mm_xor_si128(row4h, xor2);
|
|
||||||
} while(0);
|
|
||||||
#else
|
|
||||||
if(!state->nullt)
|
|
||||||
{
|
|
||||||
row4l = _mm_xor_si128(row4l, _mm_set1_epi64x(state->t[0]));
|
|
||||||
row4h = _mm_xor_si128(row4h, _mm_set1_epi64x(state->t[1]));
|
|
||||||
}
|
|
||||||
#endif
|
|
||||||
|
|
||||||
ROUND( 0);
|
|
||||||
ROUND( 1);
|
|
||||||
ROUND( 2);
|
|
||||||
ROUND( 3);
|
|
||||||
ROUND( 4);
|
|
||||||
ROUND( 5);
|
|
||||||
ROUND( 6);
|
|
||||||
ROUND( 7);
|
|
||||||
ROUND( 8);
|
|
||||||
ROUND( 9);
|
|
||||||
ROUND(10);
|
|
||||||
ROUND(11);
|
|
||||||
ROUND(12);
|
|
||||||
ROUND(13);
|
|
||||||
ROUND(14);
|
|
||||||
ROUND(15);
|
|
||||||
|
|
||||||
row1l = _mm_xor_si128(row3l,row1l);
|
|
||||||
row1h = _mm_xor_si128(row3h,row1h);
|
|
||||||
|
|
||||||
state->h[0] = _mm_xor_si128(row1l, state->h[0]);
|
|
||||||
state->h[1] = _mm_xor_si128(row1h, state->h[1]);
|
|
||||||
|
|
||||||
row2l = _mm_xor_si128(row4l,row2l);
|
|
||||||
row2h = _mm_xor_si128(row4h,row2h);
|
|
||||||
|
|
||||||
state->h[2] = _mm_xor_si128(row2l, state->h[2]);
|
|
||||||
state->h[3] = _mm_xor_si128(row2h, state->h[3]);
|
|
||||||
|
|
||||||
return 0;
|
|
||||||
}
|
|
||||||
|
|
||||||
static inline void blake512_init( hashState_blake * S, u64 databitlen )
|
|
||||||
{
|
|
||||||
memset(S, 0, sizeof(hashState_blake));
|
|
||||||
S->h[0] = _mm_set_epi64x(0xBB67AE8584CAA73BULL, 0x6A09E667F3BCC908ULL);
|
|
||||||
S->h[1] = _mm_set_epi64x(0xA54FF53A5F1D36F1ULL, 0x3C6EF372FE94F82BULL);
|
|
||||||
S->h[2] = _mm_set_epi64x(0x9B05688C2B3E6C1FULL, 0x510E527FADE682D1ULL);
|
|
||||||
S->h[3] = _mm_set_epi64x(0x5BE0CD19137E2179ULL, 0x1F83D9ABFB41BD6BULL);
|
|
||||||
S->buflen = databitlen;
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
static void blake512_update( hashState_blake * S, const u8 * data, u64 datalen )
|
|
||||||
{
|
|
||||||
|
|
||||||
|
|
||||||
int left = (S->buflen >> 3);
|
|
||||||
int fill = 128 - left;
|
|
||||||
|
|
||||||
if( left && ( ((datalen >> 3) & 0x7F) >= fill ) ) {
|
|
||||||
memcpy( (void *) (S->buf + left), (void *) data, fill );
|
|
||||||
S->t[0] += 1024;
|
|
||||||
blake512_compress( S, S->buf );
|
|
||||||
data += fill;
|
|
||||||
datalen -= (fill << 3);
|
|
||||||
left = 0;
|
|
||||||
}
|
|
||||||
|
|
||||||
while( datalen >= 1024 ) {
|
|
||||||
S->t[0] += 1024;
|
|
||||||
blake512_compress( S, data );
|
|
||||||
data += 128;
|
|
||||||
datalen -= 1024;
|
|
||||||
}
|
|
||||||
|
|
||||||
if( datalen > 0 ) {
|
|
||||||
memcpy( (void *) (S->buf + left), (void *) data, ( datalen>>3 ) & 0x7F );
|
|
||||||
S->buflen = (left<<3) + datalen;
|
|
||||||
}
|
|
||||||
else S->buflen=0;
|
|
||||||
}
|
|
||||||
|
|
||||||
static inline void blake512_final( hashState_blake * S, u8 * digest )
|
|
||||||
{
|
|
||||||
|
|
||||||
u8 msglen[16], zo=0x01,oo=0x81;
|
|
||||||
u64 lo=S->t[0] + S->buflen, hi = S->t[1];
|
|
||||||
if ( lo < S->buflen ) hi++;
|
|
||||||
U64TO8( msglen + 0, hi );
|
|
||||||
U64TO8( msglen + 8, lo );
|
|
||||||
|
|
||||||
if ( S->buflen == 888 ) /* one padding byte */
|
|
||||||
{
|
|
||||||
S->t[0] -= 8;
|
|
||||||
blake512_update( S, &oo, 8 );
|
|
||||||
}
|
|
||||||
else
|
|
||||||
{
|
|
||||||
if ( S->buflen < 888 ) /* enough space to fill the block */
|
|
||||||
{
|
|
||||||
if ( S->buflen == 0 ) S->nullt=1;
|
|
||||||
S->t[0] -= 888 - S->buflen;
|
|
||||||
blake512_update( S, padding, 888 - S->buflen );
|
|
||||||
}
|
|
||||||
else /* NOT enough space, need 2 compressions */
|
|
||||||
{
|
|
||||||
S->t[0] -= 1024 - S->buflen;
|
|
||||||
blake512_update( S, padding, 1024 - S->buflen );
|
|
||||||
S->t[0] -= 888;
|
|
||||||
blake512_update( S, padding+1, 888 );
|
|
||||||
S->nullt = 1;
|
|
||||||
}
|
|
||||||
blake512_update( S, &zo, 8 );
|
|
||||||
S->t[0] -= 8;
|
|
||||||
}
|
|
||||||
S->t[0] -= 128;
|
|
||||||
blake512_update( S, msglen, 128 );
|
|
||||||
|
|
||||||
do
|
|
||||||
{
|
|
||||||
const __m128i u8to64 = _mm_set_epi8(8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 4, 5, 6, 7);
|
|
||||||
_mm_storeu_si128((__m128i*)(digest + 0), BSWAP64(S->h[0]));
|
|
||||||
_mm_storeu_si128((__m128i*)(digest + 16), BSWAP64(S->h[1]));
|
|
||||||
_mm_storeu_si128((__m128i*)(digest + 32), BSWAP64(S->h[2]));
|
|
||||||
_mm_storeu_si128((__m128i*)(digest + 48), BSWAP64(S->h[3]));
|
|
||||||
} while(0);
|
|
||||||
}
|
|
||||||
|
|
||||||
/*
|
|
||||||
int crypto_hash( unsigned char *out, const unsigned char *in, unsigned long long inlen )
|
|
||||||
{
|
|
||||||
|
|
||||||
hashState_blake S;
|
|
||||||
blake512_init( &S );
|
|
||||||
blake512_update( &S, in, inlen*8 );
|
|
||||||
blake512_final( &S, out );
|
|
||||||
return 0;
|
|
||||||
}
|
|
||||||
*/
|
|
||||||
/*
|
|
||||||
#ifdef NOT_SUPERCOP
|
|
||||||
|
|
||||||
int main()
|
|
||||||
{
|
|
||||||
|
|
||||||
int i, v;
|
|
||||||
u8 data[144], digest[64];
|
|
||||||
u8 test1[]= {0x97, 0x96, 0x15, 0x87, 0xF6, 0xD9, 0x70, 0xFA, 0xBA, 0x6D, 0x24, 0x78, 0x04, 0x5D, 0xE6, 0xD1,
|
|
||||||
0xFA, 0xBD, 0x09, 0xB6, 0x1A, 0xE5, 0x09, 0x32, 0x05, 0x4D, 0x52, 0xBC, 0x29, 0xD3, 0x1B, 0xE4,
|
|
||||||
0xFF, 0x91, 0x02, 0xB9, 0xF6, 0x9E, 0x2B, 0xBD, 0xB8, 0x3B, 0xE1, 0x3D, 0x4B, 0x9C, 0x06, 0x09,
|
|
||||||
0x1E, 0x5F, 0xA0, 0xB4, 0x8B, 0xD0, 0x81, 0xB6, 0x34, 0x05, 0x8B, 0xE0, 0xEC, 0x49, 0xBE, 0xB3};
|
|
||||||
u8 test2[]= {0x31, 0x37, 0x17, 0xD6, 0x08, 0xE9, 0xCF, 0x75, 0x8D, 0xCB, 0x1E, 0xB0, 0xF0, 0xC3, 0xCF, 0x9F,
|
|
||||||
0xC1, 0x50, 0xB2, 0xD5, 0x00, 0xFB, 0x33, 0xF5, 0x1C, 0x52, 0xAF, 0xC9, 0x9D, 0x35, 0x8A, 0x2F,
|
|
||||||
0x13, 0x74, 0xB8, 0xA3, 0x8B, 0xBA, 0x79, 0x74, 0xE7, 0xF6, 0xEF, 0x79, 0xCA, 0xB1, 0x6F, 0x22,
|
|
||||||
0xCE, 0x1E, 0x64, 0x9D, 0x6E, 0x01, 0xAD, 0x95, 0x89, 0xC2, 0x13, 0x04, 0x5D, 0x54, 0x5D, 0xDE};
|
|
||||||
|
|
||||||
for(i=0; i<144; ++i) data[i]=0;
|
|
||||||
|
|
||||||
crypto_hash( digest, data, 1 );
|
|
||||||
v=0;
|
|
||||||
for(i=0; i<64; ++i) {
|
|
||||||
printf("%02X", digest[i]);
|
|
||||||
if ( digest[i] != test1[i]) v=1;
|
|
||||||
}
|
|
||||||
if (v) printf("\nerror\n");
|
|
||||||
else printf("\nok\n");
|
|
||||||
|
|
||||||
for(i=0; i<144; ++i) data[i]=0;
|
|
||||||
|
|
||||||
crypto_hash( digest, data, 144 );
|
|
||||||
v=0;
|
|
||||||
for(i=0; i<64; ++i) {
|
|
||||||
printf("%02X", digest[i]);
|
|
||||||
if ( digest[i] != test2[i]) v=1;
|
|
||||||
}
|
|
||||||
if (v) printf("\nerror\n");
|
|
||||||
else printf("\nok\n");
|
|
||||||
|
|
||||||
return 0;
|
|
||||||
}
|
|
||||||
|
|
||||||
#endif
|
|
||||||
|
|
||||||
*/
|
|
||||||
|
|
||||||
|
|
||||||
@@ -1,74 +0,0 @@
|
|||||||
|
|
||||||
#include <stdio.h>
|
|
||||||
#include <stdlib.h>
|
|
||||||
#include <string.h>
|
|
||||||
#include <x86intrin.h>
|
|
||||||
|
|
||||||
#include "config.h"
|
|
||||||
#include "rounds.h"
|
|
||||||
/*
|
|
||||||
#ifndef NOT_SUPERCOP
|
|
||||||
|
|
||||||
#include "crypto_hash.h"
|
|
||||||
#include "crypto_uint64.h"
|
|
||||||
#include "crypto_uint32.h"
|
|
||||||
#include "crypto_uint8.h"
|
|
||||||
|
|
||||||
typedef crypto_uint64 u64;
|
|
||||||
typedef crypto_uint32 u32;
|
|
||||||
typedef crypto_uint8 u8;
|
|
||||||
|
|
||||||
#else
|
|
||||||
*/
|
|
||||||
typedef unsigned long long u64;
|
|
||||||
typedef unsigned int u32;
|
|
||||||
typedef unsigned char u8;
|
|
||||||
|
|
||||||
typedef struct
|
|
||||||
{
|
|
||||||
__m128i h[4];
|
|
||||||
u64 s[4], t[2];
|
|
||||||
u32 buflen, nullt;
|
|
||||||
u8 buf[128];
|
|
||||||
} hashState_blake __attribute__ ((aligned (64)));
|
|
||||||
/*
|
|
||||||
#endif
|
|
||||||
|
|
||||||
#define U8TO32(p) \
|
|
||||||
(((u32)((p)[0]) << 24) | ((u32)((p)[1]) << 16) | \
|
|
||||||
((u32)((p)[2]) << 8) | ((u32)((p)[3]) ))
|
|
||||||
#define U8TO64(p) \
|
|
||||||
(((u64)U8TO32(p) << 32) | (u64)U8TO32((p) + 4))
|
|
||||||
#define U32TO8(p, v) \
|
|
||||||
(p)[0] = (u8)((v) >> 24); (p)[1] = (u8)((v) >> 16); \
|
|
||||||
(p)[2] = (u8)((v) >> 8); (p)[3] = (u8)((v) );
|
|
||||||
#define U64TO8(p, v) \
|
|
||||||
U32TO8((p), (u32)((v) >> 32)); \
|
|
||||||
U32TO8((p) + 4, (u32)((v) ));
|
|
||||||
*/
|
|
||||||
|
|
||||||
/*
|
|
||||||
static const u8 padding[129] =
|
|
||||||
{
|
|
||||||
0x80,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
|
|
||||||
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
|
|
||||||
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
|
|
||||||
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
|
|
||||||
};
|
|
||||||
|
|
||||||
*/
|
|
||||||
static inline void blake512_init( hashState_blake * S, u64 datalen );
|
|
||||||
|
|
||||||
|
|
||||||
static void blake512_update( hashState_blake * S, const u8 * data, u64 datalen ) ;
|
|
||||||
|
|
||||||
static inline void blake512_final( hashState_blake * S, u8 * digest ) ;
|
|
||||||
|
|
||||||
|
|
||||||
int crypto_hash( unsigned char *out, const unsigned char *in, unsigned long long inlen ) ;
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@@ -1,2 +0,0 @@
|
|||||||
Jean-Philippe Aumasson
|
|
||||||
Samuel Neves
|
|
||||||
@@ -1,871 +0,0 @@
|
|||||||
|
|
||||||
#ifndef __BLAKE512_ROUNDS_H__
|
|
||||||
#define __BLAKE512_ROUNDS_H__
|
|
||||||
|
|
||||||
#ifndef HAVE_XOP
|
|
||||||
#define BSWAP64(x) _mm_shuffle_epi8((x), u8to64)
|
|
||||||
|
|
||||||
#define _mm_roti_epi64(x, c) \
|
|
||||||
(-(c) == 32) ? _mm_shuffle_epi32((x), _MM_SHUFFLE(2,3,0,1)) \
|
|
||||||
: (-(c) == 16) ? _mm_shuffle_epi8((x), r16) \
|
|
||||||
: _mm_xor_si128(_mm_srli_epi64((x), -(c)), _mm_slli_epi64((x), 64-(-c)))
|
|
||||||
#else
|
|
||||||
#define BSWAP64(x) _mm_perm_epi8((x),(x),u8to64)
|
|
||||||
#endif
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_0_1(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpacklo_epi64(m0, m1); \
|
|
||||||
t1 = _mm_set_epi64x(0x82EFA98EC4E6C89ULL, 0x13198A2E03707344ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpacklo_epi64(m2, m3); \
|
|
||||||
t3 = _mm_set_epi64x(0x3F84D5B5B5470917ULL, 0xBE5466CF34E90C6CULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_0_2(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpackhi_epi64(m0, m1); \
|
|
||||||
t1 = _mm_set_epi64x(0xA4093822299F31D0ULL, 0x243F6A8885A308D3ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpackhi_epi64(m2, m3); \
|
|
||||||
t3 = _mm_set_epi64x(0xC0AC29B7C97C50DDULL, 0x452821E638D01377ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_0_3(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpacklo_epi64(m4, m5); \
|
|
||||||
t1 = _mm_set_epi64x(0xB8E1AFED6A267E96ULL, 0xD1310BA698DFB5ACULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpacklo_epi64(m6, m7); \
|
|
||||||
t3 = _mm_set_epi64x(0x636920D871574E69ULL, 0x24A19947B3916CF7ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_0_4(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpackhi_epi64(m4, m5); \
|
|
||||||
t1 = _mm_set_epi64x(0x2FFD72DBD01ADFB7ULL, 0x9216D5D98979FB1BULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpackhi_epi64(m6, m7); \
|
|
||||||
t3 = _mm_set_epi64x(0x801F2E2858EFC16ULL, 0xBA7C9045F12C7F99ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_1_1(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpacklo_epi64(m7, m2); \
|
|
||||||
t1 = _mm_set_epi64x(0x9216D5D98979FB1BULL, 0x2FFD72DBD01ADFB7ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpackhi_epi64(m4, m6); \
|
|
||||||
t3 = _mm_set_epi64x(0xC0AC29B7C97C50DDULL, 0x636920D871574E69ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_1_2(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpacklo_epi64(m5, m4); \
|
|
||||||
t1 = _mm_set_epi64x(0x452821E638D01377ULL, 0x801F2E2858EFC16ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_alignr_epi8(m3, m7, 8); \
|
|
||||||
t3 = _mm_set_epi64x(0x24A19947B3916CF7ULL, 0xD1310BA698DFB5ACULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_1_3(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_shuffle_epi32(m0, _MM_SHUFFLE(1,0,3,2)); \
|
|
||||||
t1 = _mm_set_epi64x(0xA4093822299F31D0ULL, 0xBA7C9045F12C7F99ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpackhi_epi64(m5, m2); \
|
|
||||||
t3 = _mm_set_epi64x(0x82EFA98EC4E6C89ULL, 0x3F84D5B5B5470917ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_1_4(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpacklo_epi64(m6, m1); \
|
|
||||||
t1 = _mm_set_epi64x(0x243F6A8885A308D3ULL, 0x13198A2E03707344ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpackhi_epi64(m3, m1); \
|
|
||||||
t3 = _mm_set_epi64x(0xBE5466CF34E90C6CULL, 0xB8E1AFED6A267E96ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_2_1(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_alignr_epi8(m6, m5, 8); \
|
|
||||||
t1 = _mm_set_epi64x(0x243F6A8885A308D3ULL, 0x9216D5D98979FB1BULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpackhi_epi64(m2, m7); \
|
|
||||||
t3 = _mm_set_epi64x(0x24A19947B3916CF7ULL, 0xA4093822299F31D0ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_2_2(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpacklo_epi64(m4, m0); \
|
|
||||||
t1 = _mm_set_epi64x(0xBA7C9045F12C7F99ULL, 0xB8E1AFED6A267E96ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_blend_epi16(m1, m6, 0xF0); \
|
|
||||||
t3 = _mm_set_epi64x(0x636920D871574E69ULL, 0xBE5466CF34E90C6CULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_2_3(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_blend_epi16(m5, m1, 0xF0); \
|
|
||||||
t1 = _mm_set_epi64x(0xC0AC29B7C97C50DDULL, 0x801F2E2858EFC16ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpackhi_epi64(m3, m4); \
|
|
||||||
t3 = _mm_set_epi64x(0x452821E638D01377ULL, 0x13198A2E03707344ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_2_4(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpacklo_epi64(m7, m3); \
|
|
||||||
t1 = _mm_set_epi64x(0x82EFA98EC4E6C89ULL, 0x2FFD72DBD01ADFB7ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_alignr_epi8(m2, m0, 8); \
|
|
||||||
t3 = _mm_set_epi64x(0xD1310BA698DFB5ACULL, 0x3F84D5B5B5470917ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_3_1(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpackhi_epi64(m3, m1); \
|
|
||||||
t1 = _mm_set_epi64x(0x13198A2E03707344ULL, 0xD1310BA698DFB5ACULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpackhi_epi64(m6, m5); \
|
|
||||||
t3 = _mm_set_epi64x(0x801F2E2858EFC16ULL, 0xBA7C9045F12C7F99ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_3_2(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpackhi_epi64(m4, m0); \
|
|
||||||
t1 = _mm_set_epi64x(0x82EFA98EC4E6C89ULL, 0x3F84D5B5B5470917ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpacklo_epi64(m6, m7); \
|
|
||||||
t3 = _mm_set_epi64x(0xB8E1AFED6A267E96ULL, 0x24A19947B3916CF7ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_3_3(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_blend_epi16(m1, m2, 0xF0); \
|
|
||||||
t1 = _mm_set_epi64x(0x2FFD72DBD01ADFB7ULL, 0xC0AC29B7C97C50DDULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_blend_epi16(m2, m7, 0xF0); \
|
|
||||||
t3 = _mm_set_epi64x(0x9216D5D98979FB1BULL, 0x243F6A8885A308D3ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_3_4(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpacklo_epi64(m3, m5); \
|
|
||||||
t1 = _mm_set_epi64x(0xBE5466CF34E90C6CULL, 0xA4093822299F31D0ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpacklo_epi64(m0, m4); \
|
|
||||||
t3 = _mm_set_epi64x(0x636920D871574E69ULL, 0x452821E638D01377ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_4_1(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpackhi_epi64(m4, m2); \
|
|
||||||
t1 = _mm_set_epi64x(0x3F84D5B5B5470917ULL, 0x243F6A8885A308D3ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpacklo_epi64(m1, m5); \
|
|
||||||
t3 = _mm_set_epi64x(0x636920D871574E69ULL, 0x452821E638D01377ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_4_2(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_blend_epi16(m0, m3, 0xF0); \
|
|
||||||
t1 = _mm_set_epi64x(0xBE5466CF34E90C6CULL, 0xD1310BA698DFB5ACULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_blend_epi16(m2, m7, 0xF0); \
|
|
||||||
t3 = _mm_set_epi64x(0x2FFD72DBD01ADFB7ULL, 0xA4093822299F31D0ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_4_3(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_blend_epi16(m7, m5, 0xF0); \
|
|
||||||
t1 = _mm_set_epi64x(0xBA7C9045F12C7F99ULL, 0x13198A2E03707344ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_blend_epi16(m3, m1, 0xF0); \
|
|
||||||
t3 = _mm_set_epi64x(0x24A19947B3916CF7ULL, 0x9216D5D98979FB1BULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_4_4(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_alignr_epi8(m6, m0, 8); \
|
|
||||||
t1 = _mm_set_epi64x(0xB8E1AFED6A267E96ULL, 0x801F2E2858EFC16ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_blend_epi16(m4, m6, 0xF0); \
|
|
||||||
t3 = _mm_set_epi64x(0x82EFA98EC4E6C89ULL, 0xC0AC29B7C97C50DDULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_5_1(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpacklo_epi64(m1, m3); \
|
|
||||||
t1 = _mm_set_epi64x(0x2FFD72DBD01ADFB7ULL, 0xBA7C9045F12C7F99ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpacklo_epi64(m0, m4); \
|
|
||||||
t3 = _mm_set_epi64x(0x82EFA98EC4E6C89ULL, 0xB8E1AFED6A267E96ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_5_2(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpacklo_epi64(m6, m5); \
|
|
||||||
t1 = _mm_set_epi64x(0xC0AC29B7C97C50DDULL, 0xA4093822299F31D0ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpackhi_epi64(m5, m1); \
|
|
||||||
t3 = _mm_set_epi64x(0x9216D5D98979FB1BULL, 0x243F6A8885A308D3ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_5_3(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_blend_epi16(m2, m3, 0xF0); \
|
|
||||||
t1 = _mm_set_epi64x(0xBE5466CF34E90C6CULL, 0x24A19947B3916CF7ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpackhi_epi64(m7, m0); \
|
|
||||||
t3 = _mm_set_epi64x(0xD1310BA698DFB5ACULL, 0x801F2E2858EFC16ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_5_4(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpackhi_epi64(m6, m2); \
|
|
||||||
t1 = _mm_set_epi64x(0x3F84D5B5B5470917ULL, 0x452821E638D01377ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_blend_epi16(m7, m4, 0xF0); \
|
|
||||||
t3 = _mm_set_epi64x(0x13198A2E03707344ULL, 0x636920D871574E69ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_6_1(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_blend_epi16(m6, m0, 0xF0); \
|
|
||||||
t1 = _mm_set_epi64x(0x636920D871574E69ULL, 0xBE5466CF34E90C6CULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpacklo_epi64(m7, m2); \
|
|
||||||
t3 = _mm_set_epi64x(0x2FFD72DBD01ADFB7ULL, 0x24A19947B3916CF7ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_6_2(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpackhi_epi64(m2, m7); \
|
|
||||||
t1 = _mm_set_epi64x(0x13198A2E03707344ULL, 0xBA7C9045F12C7F99ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_alignr_epi8(m5, m6, 8); \
|
|
||||||
t3 = _mm_set_epi64x(0x452821E638D01377ULL, 0x801F2E2858EFC16ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_6_3(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpacklo_epi64(m0, m3); \
|
|
||||||
t1 = _mm_set_epi64x(0x82EFA98EC4E6C89ULL, 0x3F84D5B5B5470917ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_shuffle_epi32(m4, _MM_SHUFFLE(1,0,3,2)); \
|
|
||||||
t3 = _mm_set_epi64x(0xB8E1AFED6A267E96ULL, 0xA4093822299F31D0ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_6_4(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpackhi_epi64(m3, m1); \
|
|
||||||
t1 = _mm_set_epi64x(0xC0AC29B7C97C50DDULL, 0x243F6A8885A308D3ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_blend_epi16(m1, m5, 0xF0); \
|
|
||||||
t3 = _mm_set_epi64x(0x9216D5D98979FB1BULL, 0xD1310BA698DFB5ACULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_7_1(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpackhi_epi64(m6, m3); \
|
|
||||||
t1 = _mm_set_epi64x(0x801F2E2858EFC16ULL, 0xB8E1AFED6A267E96ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_blend_epi16(m6, m1, 0xF0); \
|
|
||||||
t3 = _mm_set_epi64x(0xD1310BA698DFB5ACULL, 0x13198A2E03707344ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_7_2(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_alignr_epi8(m7, m5, 8); \
|
|
||||||
t1 = _mm_set_epi64x(0x3F84D5B5B5470917ULL, 0x24A19947B3916CF7ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpackhi_epi64(m0, m4); \
|
|
||||||
t3 = _mm_set_epi64x(0x82EFA98EC4E6C89ULL, 0xBA7C9045F12C7F99ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_7_3(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpackhi_epi64(m2, m7); \
|
|
||||||
t1 = _mm_set_epi64x(0x452821E638D01377ULL, 0x243F6A8885A308D3ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpacklo_epi64(m4, m1); \
|
|
||||||
t3 = _mm_set_epi64x(0x2FFD72DBD01ADFB7ULL, 0xC0AC29B7C97C50DDULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_7_4(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpacklo_epi64(m0, m2); \
|
|
||||||
t1 = _mm_set_epi64x(0x636920D871574E69ULL, 0xBE5466CF34E90C6CULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpacklo_epi64(m3, m5); \
|
|
||||||
t3 = _mm_set_epi64x(0xA4093822299F31D0ULL, 0x9216D5D98979FB1BULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_8_1(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpacklo_epi64(m3, m7); \
|
|
||||||
t1 = _mm_set_epi64x(0xD1310BA698DFB5ACULL, 0x636920D871574E69ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_alignr_epi8(m0, m5, 8); \
|
|
||||||
t3 = _mm_set_epi64x(0x9216D5D98979FB1BULL, 0x82EFA98EC4E6C89ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_8_2(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpackhi_epi64(m7, m4); \
|
|
||||||
t1 = _mm_set_epi64x(0x801F2E2858EFC16ULL, 0xC0AC29B7C97C50DDULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_alignr_epi8(m4, m1, 8); \
|
|
||||||
t3 = _mm_set_epi64x(0x243F6A8885A308D3ULL, 0xB8E1AFED6A267E96ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_8_3(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = m6; \
|
|
||||||
t1 = _mm_set_epi64x(0x3F84D5B5B5470917ULL, 0xA4093822299F31D0ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_alignr_epi8(m5, m0, 8); \
|
|
||||||
t3 = _mm_set_epi64x(0xBE5466CF34E90C6CULL, 0x452821E638D01377ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_8_4(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_blend_epi16(m1, m3, 0xF0); \
|
|
||||||
t1 = _mm_set_epi64x(0x24A19947B3916CF7ULL, 0xBA7C9045F12C7F99ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = m2; \
|
|
||||||
t3 = _mm_set_epi64x(0x2FFD72DBD01ADFB7ULL, 0x13198A2E03707344ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_9_1(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpacklo_epi64(m5, m4); \
|
|
||||||
t1 = _mm_set_epi64x(0x452821E638D01377ULL, 0xA4093822299F31D0ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpackhi_epi64(m3, m0); \
|
|
||||||
t3 = _mm_set_epi64x(0xBE5466CF34E90C6CULL, 0xC0AC29B7C97C50DDULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_9_2(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpacklo_epi64(m1, m2); \
|
|
||||||
t1 = _mm_set_epi64x(0x9216D5D98979FB1BULL, 0x2FFD72DBD01ADFB7ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_blend_epi16(m3, m2, 0xF0); \
|
|
||||||
t3 = _mm_set_epi64x(0x13198A2E03707344ULL, 0x3F84D5B5B5470917ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_9_3(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpackhi_epi64(m7, m4); \
|
|
||||||
t1 = _mm_set_epi64x(0x801F2E2858EFC16ULL, 0xB8E1AFED6A267E96ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpackhi_epi64(m1, m6); \
|
|
||||||
t3 = _mm_set_epi64x(0x243F6A8885A308D3ULL, 0xBA7C9045F12C7F99ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_9_4(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_alignr_epi8(m7, m5, 8); \
|
|
||||||
t1 = _mm_set_epi64x(0xD1310BA698DFB5ACULL, 0x636920D871574E69ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpacklo_epi64(m6, m0); \
|
|
||||||
t3 = _mm_set_epi64x(0x24A19947B3916CF7ULL, 0x82EFA98EC4E6C89ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_10_1(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpacklo_epi64(m0, m1); \
|
|
||||||
t1 = _mm_set_epi64x(0x82EFA98EC4E6C89ULL, 0x13198A2E03707344ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpacklo_epi64(m2, m3); \
|
|
||||||
t3 = _mm_set_epi64x(0x3F84D5B5B5470917ULL, 0xBE5466CF34E90C6CULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_10_2(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpackhi_epi64(m0, m1); \
|
|
||||||
t1 = _mm_set_epi64x(0xA4093822299F31D0ULL, 0x243F6A8885A308D3ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpackhi_epi64(m2, m3); \
|
|
||||||
t3 = _mm_set_epi64x(0xC0AC29B7C97C50DDULL, 0x452821E638D01377ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_10_3(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpacklo_epi64(m4, m5); \
|
|
||||||
t1 = _mm_set_epi64x(0xB8E1AFED6A267E96ULL, 0xD1310BA698DFB5ACULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpacklo_epi64(m6, m7); \
|
|
||||||
t3 = _mm_set_epi64x(0x636920D871574E69ULL, 0x24A19947B3916CF7ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_10_4(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpackhi_epi64(m4, m5); \
|
|
||||||
t1 = _mm_set_epi64x(0x2FFD72DBD01ADFB7ULL, 0x9216D5D98979FB1BULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpackhi_epi64(m6, m7); \
|
|
||||||
t3 = _mm_set_epi64x(0x801F2E2858EFC16ULL, 0xBA7C9045F12C7F99ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_11_1(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpacklo_epi64(m7, m2); \
|
|
||||||
t1 = _mm_set_epi64x(0x9216D5D98979FB1BULL, 0x2FFD72DBD01ADFB7ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpackhi_epi64(m4, m6); \
|
|
||||||
t3 = _mm_set_epi64x(0xC0AC29B7C97C50DDULL, 0x636920D871574E69ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_11_2(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpacklo_epi64(m5, m4); \
|
|
||||||
t1 = _mm_set_epi64x(0x452821E638D01377ULL, 0x801F2E2858EFC16ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_alignr_epi8(m3, m7, 8); \
|
|
||||||
t3 = _mm_set_epi64x(0x24A19947B3916CF7ULL, 0xD1310BA698DFB5ACULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_11_3(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_shuffle_epi32(m0, _MM_SHUFFLE(1,0,3,2)); \
|
|
||||||
t1 = _mm_set_epi64x(0xA4093822299F31D0ULL, 0xBA7C9045F12C7F99ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpackhi_epi64(m5, m2); \
|
|
||||||
t3 = _mm_set_epi64x(0x82EFA98EC4E6C89ULL, 0x3F84D5B5B5470917ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_11_4(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpacklo_epi64(m6, m1); \
|
|
||||||
t1 = _mm_set_epi64x(0x243F6A8885A308D3ULL, 0x13198A2E03707344ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpackhi_epi64(m3, m1); \
|
|
||||||
t3 = _mm_set_epi64x(0xBE5466CF34E90C6CULL, 0xB8E1AFED6A267E96ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_12_1(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_alignr_epi8(m6, m5, 8); \
|
|
||||||
t1 = _mm_set_epi64x(0x243F6A8885A308D3ULL, 0x9216D5D98979FB1BULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpackhi_epi64(m2, m7); \
|
|
||||||
t3 = _mm_set_epi64x(0x24A19947B3916CF7ULL, 0xA4093822299F31D0ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_12_2(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpacklo_epi64(m4, m0); \
|
|
||||||
t1 = _mm_set_epi64x(0xBA7C9045F12C7F99ULL, 0xB8E1AFED6A267E96ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_blend_epi16(m1, m6, 0xF0); \
|
|
||||||
t3 = _mm_set_epi64x(0x636920D871574E69ULL, 0xBE5466CF34E90C6CULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_12_3(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_blend_epi16(m5, m1, 0xF0); \
|
|
||||||
t1 = _mm_set_epi64x(0xC0AC29B7C97C50DDULL, 0x801F2E2858EFC16ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpackhi_epi64(m3, m4); \
|
|
||||||
t3 = _mm_set_epi64x(0x452821E638D01377ULL, 0x13198A2E03707344ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_12_4(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpacklo_epi64(m7, m3); \
|
|
||||||
t1 = _mm_set_epi64x(0x82EFA98EC4E6C89ULL, 0x2FFD72DBD01ADFB7ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_alignr_epi8(m2, m0, 8); \
|
|
||||||
t3 = _mm_set_epi64x(0xD1310BA698DFB5ACULL, 0x3F84D5B5B5470917ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_13_1(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpackhi_epi64(m3, m1); \
|
|
||||||
t1 = _mm_set_epi64x(0x13198A2E03707344ULL, 0xD1310BA698DFB5ACULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpackhi_epi64(m6, m5); \
|
|
||||||
t3 = _mm_set_epi64x(0x801F2E2858EFC16ULL, 0xBA7C9045F12C7F99ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_13_2(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpackhi_epi64(m4, m0); \
|
|
||||||
t1 = _mm_set_epi64x(0x82EFA98EC4E6C89ULL, 0x3F84D5B5B5470917ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpacklo_epi64(m6, m7); \
|
|
||||||
t3 = _mm_set_epi64x(0xB8E1AFED6A267E96ULL, 0x24A19947B3916CF7ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_13_3(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_blend_epi16(m1, m2, 0xF0); \
|
|
||||||
t1 = _mm_set_epi64x(0x2FFD72DBD01ADFB7ULL, 0xC0AC29B7C97C50DDULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_blend_epi16(m2, m7, 0xF0); \
|
|
||||||
t3 = _mm_set_epi64x(0x9216D5D98979FB1BULL, 0x243F6A8885A308D3ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_13_4(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpacklo_epi64(m3, m5); \
|
|
||||||
t1 = _mm_set_epi64x(0xBE5466CF34E90C6CULL, 0xA4093822299F31D0ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpacklo_epi64(m0, m4); \
|
|
||||||
t3 = _mm_set_epi64x(0x636920D871574E69ULL, 0x452821E638D01377ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_14_1(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpackhi_epi64(m4, m2); \
|
|
||||||
t1 = _mm_set_epi64x(0x3F84D5B5B5470917ULL, 0x243F6A8885A308D3ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpacklo_epi64(m1, m5); \
|
|
||||||
t3 = _mm_set_epi64x(0x636920D871574E69ULL, 0x452821E638D01377ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_14_2(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_blend_epi16(m0, m3, 0xF0); \
|
|
||||||
t1 = _mm_set_epi64x(0xBE5466CF34E90C6CULL, 0xD1310BA698DFB5ACULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_blend_epi16(m2, m7, 0xF0); \
|
|
||||||
t3 = _mm_set_epi64x(0x2FFD72DBD01ADFB7ULL, 0xA4093822299F31D0ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_14_3(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_blend_epi16(m7, m5, 0xF0); \
|
|
||||||
t1 = _mm_set_epi64x(0xBA7C9045F12C7F99ULL, 0x13198A2E03707344ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_blend_epi16(m3, m1, 0xF0); \
|
|
||||||
t3 = _mm_set_epi64x(0x24A19947B3916CF7ULL, 0x9216D5D98979FB1BULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_14_4(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_alignr_epi8(m6, m0, 8); \
|
|
||||||
t1 = _mm_set_epi64x(0xB8E1AFED6A267E96ULL, 0x801F2E2858EFC16ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_blend_epi16(m4, m6, 0xF0); \
|
|
||||||
t3 = _mm_set_epi64x(0x82EFA98EC4E6C89ULL, 0xC0AC29B7C97C50DDULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_15_1(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpacklo_epi64(m1, m3); \
|
|
||||||
t1 = _mm_set_epi64x(0x2FFD72DBD01ADFB7ULL, 0xBA7C9045F12C7F99ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpacklo_epi64(m0, m4); \
|
|
||||||
t3 = _mm_set_epi64x(0x82EFA98EC4E6C89ULL, 0xB8E1AFED6A267E96ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_15_2(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpacklo_epi64(m6, m5); \
|
|
||||||
t1 = _mm_set_epi64x(0xC0AC29B7C97C50DDULL, 0xA4093822299F31D0ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpackhi_epi64(m5, m1); \
|
|
||||||
t3 = _mm_set_epi64x(0x9216D5D98979FB1BULL, 0x243F6A8885A308D3ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_15_3(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_blend_epi16(m2, m3, 0xF0); \
|
|
||||||
t1 = _mm_set_epi64x(0xBE5466CF34E90C6CULL, 0x24A19947B3916CF7ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_unpackhi_epi64(m7, m0); \
|
|
||||||
t3 = _mm_set_epi64x(0xD1310BA698DFB5ACULL, 0x801F2E2858EFC16ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
#define LOAD_MSG_15_4(b0, b1) \
|
|
||||||
do \
|
|
||||||
{ \
|
|
||||||
t0 = _mm_unpackhi_epi64(m6, m2); \
|
|
||||||
t1 = _mm_set_epi64x(0x3F84D5B5B5470917ULL, 0x452821E638D01377ULL); \
|
|
||||||
b0 = _mm_xor_si128(t0, t1); \
|
|
||||||
t2 = _mm_blend_epi16(m7, m4, 0xF0); \
|
|
||||||
t3 = _mm_set_epi64x(0x13198A2E03707344ULL, 0x636920D871574E69ULL); \
|
|
||||||
b1 = _mm_xor_si128(t2, t3); \
|
|
||||||
} while(0)
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
#define G1(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h,b0,b1) \
|
|
||||||
row1l = _mm_add_epi64(_mm_add_epi64(row1l, b0), row2l); \
|
|
||||||
row1h = _mm_add_epi64(_mm_add_epi64(row1h, b1), row2h); \
|
|
||||||
\
|
|
||||||
row4l = _mm_xor_si128(row4l, row1l); \
|
|
||||||
row4h = _mm_xor_si128(row4h, row1h); \
|
|
||||||
\
|
|
||||||
row4l = _mm_roti_epi64(row4l, -32); \
|
|
||||||
row4h = _mm_roti_epi64(row4h, -32); \
|
|
||||||
\
|
|
||||||
row3l = _mm_add_epi64(row3l, row4l); \
|
|
||||||
row3h = _mm_add_epi64(row3h, row4h); \
|
|
||||||
\
|
|
||||||
row2l = _mm_xor_si128(row2l, row3l); \
|
|
||||||
row2h = _mm_xor_si128(row2h, row3h); \
|
|
||||||
\
|
|
||||||
row2l = _mm_roti_epi64(row2l, -25); \
|
|
||||||
row2h = _mm_roti_epi64(row2h, -25); \
|
|
||||||
|
|
||||||
#define G2(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h,b0,b1) \
|
|
||||||
row1l = _mm_add_epi64(_mm_add_epi64(row1l, b0), row2l); \
|
|
||||||
row1h = _mm_add_epi64(_mm_add_epi64(row1h, b1), row2h); \
|
|
||||||
\
|
|
||||||
row4l = _mm_xor_si128(row4l, row1l); \
|
|
||||||
row4h = _mm_xor_si128(row4h, row1h); \
|
|
||||||
\
|
|
||||||
row4l = _mm_roti_epi64(row4l, -16); \
|
|
||||||
row4h = _mm_roti_epi64(row4h, -16); \
|
|
||||||
\
|
|
||||||
row3l = _mm_add_epi64(row3l, row4l); \
|
|
||||||
row3h = _mm_add_epi64(row3h, row4h); \
|
|
||||||
\
|
|
||||||
row2l = _mm_xor_si128(row2l, row3l); \
|
|
||||||
row2h = _mm_xor_si128(row2h, row3h); \
|
|
||||||
\
|
|
||||||
row2l = _mm_roti_epi64(row2l, -11); \
|
|
||||||
row2h = _mm_roti_epi64(row2h, -11); \
|
|
||||||
|
|
||||||
|
|
||||||
#define DIAGONALIZE(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h) \
|
|
||||||
t0 = _mm_alignr_epi8(row2h, row2l, 8); \
|
|
||||||
t1 = _mm_alignr_epi8(row2l, row2h, 8); \
|
|
||||||
row2l = t0; \
|
|
||||||
row2h = t1; \
|
|
||||||
\
|
|
||||||
t0 = row3l; \
|
|
||||||
row3l = row3h; \
|
|
||||||
row3h = t0; \
|
|
||||||
\
|
|
||||||
t0 = _mm_alignr_epi8(row4h, row4l, 8); \
|
|
||||||
t1 = _mm_alignr_epi8(row4l, row4h, 8); \
|
|
||||||
row4l = t1; \
|
|
||||||
row4h = t0;
|
|
||||||
|
|
||||||
#define UNDIAGONALIZE(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h) \
|
|
||||||
t0 = _mm_alignr_epi8(row2l, row2h, 8); \
|
|
||||||
t1 = _mm_alignr_epi8(row2h, row2l, 8); \
|
|
||||||
row2l = t0; \
|
|
||||||
row2h = t1; \
|
|
||||||
\
|
|
||||||
t0 = row3l; \
|
|
||||||
row3l = row3h; \
|
|
||||||
row3h = t0; \
|
|
||||||
\
|
|
||||||
t0 = _mm_alignr_epi8(row4l, row4h, 8); \
|
|
||||||
t1 = _mm_alignr_epi8(row4h, row4l, 8); \
|
|
||||||
row4l = t1; \
|
|
||||||
row4h = t0;
|
|
||||||
|
|
||||||
#define ROUND(r) \
|
|
||||||
LOAD_MSG_ ##r ##_1(b0, b1); \
|
|
||||||
G1(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h,b0,b1); \
|
|
||||||
LOAD_MSG_ ##r ##_2(b0, b1); \
|
|
||||||
G2(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h,b0,b1); \
|
|
||||||
DIAGONALIZE(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h); \
|
|
||||||
LOAD_MSG_ ##r ##_3(b0, b1); \
|
|
||||||
G1(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h,b0,b1); \
|
|
||||||
LOAD_MSG_ ##r ##_4(b0, b1); \
|
|
||||||
G2(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h,b0,b1); \
|
|
||||||
UNDIAGONALIZE(row1l,row2l,row3l,row4l,row1h,row2h,row3h,row4h);
|
|
||||||
|
|
||||||
#endif
|
|
||||||
|
|
||||||
@@ -64,7 +64,8 @@ typedef bmw_4way_small_context bmw256_4way_context;
|
|||||||
|
|
||||||
void bmw256_4way_init( bmw256_4way_context *ctx );
|
void bmw256_4way_init( bmw256_4way_context *ctx );
|
||||||
|
|
||||||
void bmw256_4way(void *cc, const void *data, size_t len);
|
void bmw256_4way_update(void *cc, const void *data, size_t len);
|
||||||
|
#define bmw256_4way bmw256_4way_update
|
||||||
|
|
||||||
void bmw256_4way_close(void *cc, void *dst);
|
void bmw256_4way_close(void *cc, void *dst);
|
||||||
|
|
||||||
@@ -78,7 +79,7 @@ void bmw256_4way_addbits_and_close(
|
|||||||
// BMW-256 8 way 32
|
// BMW-256 8 way 32
|
||||||
|
|
||||||
typedef struct {
|
typedef struct {
|
||||||
__m256i buf[64];
|
__m256i buf[16];
|
||||||
__m256i H[16];
|
__m256i H[16];
|
||||||
size_t ptr;
|
size_t ptr;
|
||||||
uint32_t bit_count; // assume bit_count fits in 32 bits
|
uint32_t bit_count; // assume bit_count fits in 32 bits
|
||||||
@@ -87,11 +88,33 @@ typedef struct {
|
|||||||
typedef bmw_8way_small_context bmw256_8way_context;
|
typedef bmw_8way_small_context bmw256_8way_context;
|
||||||
|
|
||||||
void bmw256_8way_init( bmw256_8way_context *ctx );
|
void bmw256_8way_init( bmw256_8way_context *ctx );
|
||||||
void bmw256_8way( bmw256_8way_context *ctx, const void *data, size_t len );
|
void bmw256_8way_update( bmw256_8way_context *ctx, const void *data,
|
||||||
|
size_t len );
|
||||||
|
#define bmw256_8way bmw256_8way_update
|
||||||
void bmw256_8way_close( bmw256_8way_context *ctx, void *dst );
|
void bmw256_8way_close( bmw256_8way_context *ctx, void *dst );
|
||||||
|
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
|
#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
|
||||||
|
|
||||||
|
// BMW-256 16 way 32
|
||||||
|
|
||||||
|
typedef struct {
|
||||||
|
__m512i buf[16];
|
||||||
|
__m512i H[16];
|
||||||
|
size_t ptr;
|
||||||
|
uint32_t bit_count; // assume bit_count fits in 32 bits
|
||||||
|
} bmw_16way_small_context __attribute__ ((aligned (128)));
|
||||||
|
|
||||||
|
typedef bmw_16way_small_context bmw256_16way_context;
|
||||||
|
|
||||||
|
void bmw256_16way_init( bmw256_16way_context *ctx );
|
||||||
|
void bmw256_16way_update( bmw256_16way_context *ctx, const void *data,
|
||||||
|
size_t len );
|
||||||
|
void bmw256_16way_close( bmw256_16way_context *ctx, void *dst );
|
||||||
|
|
||||||
|
#endif
|
||||||
|
|
||||||
|
|
||||||
#if defined(__SSE2__)
|
#if defined(__SSE2__)
|
||||||
|
|
||||||
@@ -107,7 +130,8 @@ typedef struct {
|
|||||||
typedef bmw_2way_big_context bmw512_2way_context;
|
typedef bmw_2way_big_context bmw512_2way_context;
|
||||||
|
|
||||||
void bmw512_2way_init( bmw512_2way_context *ctx );
|
void bmw512_2way_init( bmw512_2way_context *ctx );
|
||||||
void bmw512_2way( bmw512_2way_context *ctx, const void *data, size_t len );
|
void bmw512_2way_update( bmw512_2way_context *ctx, const void *data,
|
||||||
|
size_t len );
|
||||||
void bmw512_2way_close( bmw512_2way_context *ctx, void *dst );
|
void bmw512_2way_close( bmw512_2way_context *ctx, void *dst );
|
||||||
|
|
||||||
#endif // __SSE2__
|
#endif // __SSE2__
|
||||||
@@ -121,14 +145,15 @@ typedef struct {
|
|||||||
__m256i H[16];
|
__m256i H[16];
|
||||||
size_t ptr;
|
size_t ptr;
|
||||||
sph_u64 bit_count;
|
sph_u64 bit_count;
|
||||||
} bmw_4way_big_context;
|
} bmw_4way_big_context __attribute__((aligned(128)));
|
||||||
|
|
||||||
typedef bmw_4way_big_context bmw512_4way_context;
|
typedef bmw_4way_big_context bmw512_4way_context;
|
||||||
|
|
||||||
|
|
||||||
void bmw512_4way_init(void *cc);
|
void bmw512_4way_init(void *cc);
|
||||||
|
|
||||||
void bmw512_4way(void *cc, const void *data, size_t len);
|
void bmw512_4way_update(void *cc, const void *data, size_t len);
|
||||||
|
#define bmw512_4way bmw512_4way_update
|
||||||
|
|
||||||
void bmw512_4way_close(void *cc, void *dst);
|
void bmw512_4way_close(void *cc, void *dst);
|
||||||
|
|
||||||
@@ -137,6 +162,22 @@ void bmw512_4way_addbits_and_close(
|
|||||||
|
|
||||||
#endif // __AVX2__
|
#endif // __AVX2__
|
||||||
|
|
||||||
|
#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
|
||||||
|
|
||||||
|
typedef struct {
|
||||||
|
__m512i buf[16];
|
||||||
|
__m512i H[16];
|
||||||
|
size_t ptr;
|
||||||
|
uint64_t bit_count;
|
||||||
|
} bmw512_8way_context __attribute__((aligned(128)));
|
||||||
|
|
||||||
|
void bmw512_8way_init( bmw512_8way_context *ctx );
|
||||||
|
void bmw512_8way_update( bmw512_8way_context *ctx, const void *data,
|
||||||
|
size_t len );
|
||||||
|
void bmw512_8way_close( bmw512_8way_context *ctx, void *dst );
|
||||||
|
|
||||||
|
#endif // AVX512
|
||||||
|
|
||||||
#ifdef __cplusplus
|
#ifdef __cplusplus
|
||||||
}
|
}
|
||||||
#endif
|
#endif
|
||||||
|
|||||||
@@ -137,165 +137,151 @@ static const uint32_t IV256[] = {
|
|||||||
ss4( qt[ (i)- 2 ] ), ss5( qt[ (i)- 1 ] ) ) ), \
|
ss4( qt[ (i)- 2 ] ), ss5( qt[ (i)- 1 ] ) ) ), \
|
||||||
add_elt_s( M, H, (i)-16 ) )
|
add_elt_s( M, H, (i)-16 ) )
|
||||||
|
|
||||||
|
// Expressions are grouped using associativity to reduce CPU depenedencies,
|
||||||
|
// resulting in some sign changes compared to the reference code.
|
||||||
|
|
||||||
#define Ws0 \
|
#define Ws0 \
|
||||||
_mm_add_epi32( \
|
|
||||||
_mm_add_epi32( \
|
_mm_add_epi32( \
|
||||||
_mm_add_epi32( \
|
_mm_add_epi32( \
|
||||||
_mm_sub_epi32( _mm_xor_si128( M[ 5], H[ 5] ), \
|
_mm_sub_epi32( _mm_xor_si128( M[ 5], H[ 5] ), \
|
||||||
_mm_xor_si128( M[ 7], H[ 7] ) ), \
|
_mm_xor_si128( M[ 7], H[ 7] ) ), \
|
||||||
_mm_xor_si128( M[10], H[10] ) ), \
|
_mm_xor_si128( M[10], H[10] ) ), \
|
||||||
_mm_xor_si128( M[13], H[13] ) ), \
|
_mm_add_epi32( _mm_xor_si128( M[13], H[13] ), \
|
||||||
_mm_xor_si128( M[14], H[14] ) )
|
_mm_xor_si128( M[14], H[14] ) ) )
|
||||||
|
|
||||||
#define Ws1 \
|
#define Ws1 \
|
||||||
_mm_sub_epi32( \
|
|
||||||
_mm_add_epi32( \
|
_mm_add_epi32( \
|
||||||
_mm_add_epi32( \
|
_mm_add_epi32( \
|
||||||
_mm_sub_epi32( _mm_xor_si128( M[ 6], H[ 6] ), \
|
_mm_sub_epi32( _mm_xor_si128( M[ 6], H[ 6] ), \
|
||||||
_mm_xor_si128( M[ 8], H[ 8] ) ), \
|
_mm_xor_si128( M[ 8], H[ 8] ) ), \
|
||||||
_mm_xor_si128( M[11], H[11] ) ), \
|
_mm_xor_si128( M[11], H[11] ) ), \
|
||||||
_mm_xor_si128( M[14], H[14] ) ), \
|
_mm_sub_epi32( _mm_xor_si128( M[14], H[14] ), \
|
||||||
_mm_xor_si128( M[15], H[15] ) )
|
_mm_xor_si128( M[15], H[15] ) ) )
|
||||||
|
|
||||||
#define Ws2 \
|
#define Ws2 \
|
||||||
_mm_add_epi32( \
|
|
||||||
_mm_sub_epi32( \
|
_mm_sub_epi32( \
|
||||||
_mm_add_epi32( \
|
_mm_add_epi32( \
|
||||||
_mm_add_epi32( _mm_xor_si128( M[ 0], H[ 0] ), \
|
_mm_add_epi32( _mm_xor_si128( M[ 0], H[ 0] ), \
|
||||||
_mm_xor_si128( M[ 7], H[ 7] ) ), \
|
_mm_xor_si128( M[ 7], H[ 7] ) ), \
|
||||||
_mm_xor_si128( M[ 9], H[ 9] ) ), \
|
_mm_xor_si128( M[ 9], H[ 9] ) ), \
|
||||||
_mm_xor_si128( M[12], H[12] ) ), \
|
_mm_sub_epi32( _mm_xor_si128( M[12], H[12] ), \
|
||||||
_mm_xor_si128( M[15], H[15] ) )
|
_mm_xor_si128( M[15], H[15] ) ) )
|
||||||
|
|
||||||
#define Ws3 \
|
#define Ws3 \
|
||||||
_mm_add_epi32( \
|
|
||||||
_mm_sub_epi32( \
|
_mm_sub_epi32( \
|
||||||
_mm_add_epi32( \
|
_mm_add_epi32( \
|
||||||
_mm_sub_epi32( _mm_xor_si128( M[ 0], H[ 0] ), \
|
_mm_sub_epi32( _mm_xor_si128( M[ 0], H[ 0] ), \
|
||||||
_mm_xor_si128( M[ 1], H[ 1] ) ), \
|
_mm_xor_si128( M[ 1], H[ 1] ) ), \
|
||||||
_mm_xor_si128( M[ 8], H[ 8] ) ), \
|
_mm_xor_si128( M[ 8], H[ 8] ) ), \
|
||||||
_mm_xor_si128( M[10], H[10] ) ), \
|
_mm_sub_epi32( _mm_xor_si128( M[10], H[10] ), \
|
||||||
_mm_xor_si128( M[13], H[13] ) )
|
_mm_xor_si128( M[13], H[13] ) ) )
|
||||||
|
|
||||||
#define Ws4 \
|
#define Ws4 \
|
||||||
_mm_sub_epi32( \
|
|
||||||
_mm_sub_epi32( \
|
_mm_sub_epi32( \
|
||||||
_mm_add_epi32( \
|
_mm_add_epi32( \
|
||||||
_mm_add_epi32( _mm_xor_si128( M[ 1], H[ 1] ), \
|
_mm_add_epi32( _mm_xor_si128( M[ 1], H[ 1] ), \
|
||||||
_mm_xor_si128( M[ 2], H[ 2] ) ), \
|
_mm_xor_si128( M[ 2], H[ 2] ) ), \
|
||||||
_mm_xor_si128( M[ 9], H[ 9] ) ), \
|
_mm_xor_si128( M[ 9], H[ 9] ) ), \
|
||||||
_mm_xor_si128( M[11], H[11] ) ), \
|
_mm_add_epi32( _mm_xor_si128( M[11], H[11] ), \
|
||||||
_mm_xor_si128( M[14], H[14] ) )
|
_mm_xor_si128( M[14], H[14] ) ) )
|
||||||
|
|
||||||
#define Ws5 \
|
#define Ws5 \
|
||||||
_mm_add_epi32( \
|
|
||||||
_mm_sub_epi32( \
|
_mm_sub_epi32( \
|
||||||
_mm_add_epi32( \
|
_mm_add_epi32( \
|
||||||
_mm_sub_epi32( _mm_xor_si128( M[ 3], H[ 3] ), \
|
_mm_sub_epi32( _mm_xor_si128( M[ 3], H[ 3] ), \
|
||||||
_mm_xor_si128( M[ 2], H[ 2] ) ), \
|
_mm_xor_si128( M[ 2], H[ 2] ) ), \
|
||||||
_mm_xor_si128( M[10], H[10] ) ), \
|
_mm_xor_si128( M[10], H[10] ) ), \
|
||||||
_mm_xor_si128( M[12], H[12] ) ), \
|
_mm_sub_epi32( _mm_xor_si128( M[12], H[12] ), \
|
||||||
_mm_xor_si128( M[15], H[15] ) )
|
_mm_xor_si128( M[15], H[15] ) ) )
|
||||||
|
|
||||||
#define Ws6 \
|
#define Ws6 \
|
||||||
_mm_add_epi32( \
|
|
||||||
_mm_sub_epi32( \
|
_mm_sub_epi32( \
|
||||||
_mm_sub_epi32( \
|
_mm_sub_epi32( \
|
||||||
_mm_sub_epi32( _mm_xor_si128( M[ 4], H[ 4] ), \
|
_mm_sub_epi32( _mm_xor_si128( M[ 4], H[ 4] ), \
|
||||||
_mm_xor_si128( M[ 0], H[ 0] ) ), \
|
_mm_xor_si128( M[ 0], H[ 0] ) ), \
|
||||||
_mm_xor_si128( M[ 3], H[ 3] ) ), \
|
_mm_xor_si128( M[ 3], H[ 3] ) ), \
|
||||||
_mm_xor_si128( M[11], H[11] ) ), \
|
_mm_sub_epi32( _mm_xor_si128( M[11], H[11] ), \
|
||||||
_mm_xor_si128( M[13], H[13] ) )
|
_mm_xor_si128( M[13], H[13] ) ) )
|
||||||
|
|
||||||
#define Ws7 \
|
#define Ws7 \
|
||||||
_mm_sub_epi32( \
|
|
||||||
_mm_sub_epi32( \
|
_mm_sub_epi32( \
|
||||||
_mm_sub_epi32( \
|
_mm_sub_epi32( \
|
||||||
_mm_sub_epi32( _mm_xor_si128( M[ 1], H[ 1] ), \
|
_mm_sub_epi32( _mm_xor_si128( M[ 1], H[ 1] ), \
|
||||||
_mm_xor_si128( M[ 4], H[ 4] ) ), \
|
_mm_xor_si128( M[ 4], H[ 4] ) ), \
|
||||||
_mm_xor_si128( M[ 5], H[ 5] ) ), \
|
_mm_xor_si128( M[ 5], H[ 5] ) ), \
|
||||||
_mm_xor_si128( M[12], H[12] ) ), \
|
_mm_add_epi32( _mm_xor_si128( M[12], H[12] ), \
|
||||||
_mm_xor_si128( M[14], H[14] ) )
|
_mm_xor_si128( M[14], H[14] ) ) )
|
||||||
|
|
||||||
#define Ws8 \
|
#define Ws8 \
|
||||||
_mm_sub_epi32( \
|
|
||||||
_mm_add_epi32( \
|
_mm_add_epi32( \
|
||||||
_mm_sub_epi32( \
|
_mm_sub_epi32( \
|
||||||
_mm_sub_epi32( _mm_xor_si128( M[ 2], H[ 2] ), \
|
_mm_sub_epi32( _mm_xor_si128( M[ 2], H[ 2] ), \
|
||||||
_mm_xor_si128( M[ 5], H[ 5] ) ), \
|
_mm_xor_si128( M[ 5], H[ 5] ) ), \
|
||||||
_mm_xor_si128( M[ 6], H[ 6] ) ), \
|
_mm_xor_si128( M[ 6], H[ 6] ) ), \
|
||||||
_mm_xor_si128( M[13], H[13] ) ), \
|
_mm_sub_epi32( _mm_xor_si128( M[13], H[13] ), \
|
||||||
_mm_xor_si128( M[15], H[15] ) )
|
_mm_xor_si128( M[15], H[15] ) ) )
|
||||||
|
|
||||||
#define Ws9 \
|
#define Ws9 \
|
||||||
_mm_add_epi32( \
|
|
||||||
_mm_sub_epi32( \
|
_mm_sub_epi32( \
|
||||||
_mm_add_epi32( \
|
_mm_add_epi32( \
|
||||||
_mm_sub_epi32( _mm_xor_si128( M[ 0], H[ 0] ), \
|
_mm_sub_epi32( _mm_xor_si128( M[ 0], H[ 0] ), \
|
||||||
_mm_xor_si128( M[ 3], H[ 3] ) ), \
|
_mm_xor_si128( M[ 3], H[ 3] ) ), \
|
||||||
_mm_xor_si128( M[ 6], H[ 6] ) ), \
|
_mm_xor_si128( M[ 6], H[ 6] ) ), \
|
||||||
_mm_xor_si128( M[ 7], H[ 7] ) ), \
|
_mm_sub_epi32( _mm_xor_si128( M[ 7], H[ 7] ), \
|
||||||
_mm_xor_si128( M[14], H[14] ) )
|
_mm_xor_si128( M[14], H[14] ) ) )
|
||||||
|
|
||||||
#define Ws10 \
|
#define Ws10 \
|
||||||
_mm_add_epi32( \
|
|
||||||
_mm_sub_epi32( \
|
_mm_sub_epi32( \
|
||||||
_mm_sub_epi32( \
|
_mm_sub_epi32( \
|
||||||
_mm_sub_epi32( _mm_xor_si128( M[ 8], H[ 8] ), \
|
_mm_sub_epi32( _mm_xor_si128( M[ 8], H[ 8] ), \
|
||||||
_mm_xor_si128( M[ 1], H[ 1] ) ), \
|
_mm_xor_si128( M[ 1], H[ 1] ) ), \
|
||||||
_mm_xor_si128( M[ 4], H[ 4] ) ), \
|
_mm_xor_si128( M[ 4], H[ 4] ) ), \
|
||||||
_mm_xor_si128( M[ 7], H[ 7] ) ), \
|
_mm_sub_epi32( _mm_xor_si128( M[ 7], H[ 7] ), \
|
||||||
_mm_xor_si128( M[15], H[15] ) )
|
_mm_xor_si128( M[15], H[15] ) ) )
|
||||||
|
|
||||||
#define Ws11 \
|
#define Ws11 \
|
||||||
_mm_add_epi32( \
|
|
||||||
_mm_sub_epi32( \
|
_mm_sub_epi32( \
|
||||||
_mm_sub_epi32( \
|
_mm_sub_epi32( \
|
||||||
_mm_sub_epi32( _mm_xor_si128( M[ 8], H[ 8] ), \
|
_mm_sub_epi32( _mm_xor_si128( M[ 8], H[ 8] ), \
|
||||||
_mm_xor_si128( M[ 0], H[ 0] ) ), \
|
_mm_xor_si128( M[ 0], H[ 0] ) ), \
|
||||||
_mm_xor_si128( M[ 2], H[ 2] ) ), \
|
_mm_xor_si128( M[ 2], H[ 2] ) ), \
|
||||||
_mm_xor_si128( M[ 5], H[ 5] ) ), \
|
_mm_sub_epi32( _mm_xor_si128( M[ 5], H[ 5] ), \
|
||||||
_mm_xor_si128( M[ 9], H[ 9] ) )
|
_mm_xor_si128( M[ 9], H[ 9] ) ) )
|
||||||
|
|
||||||
#define Ws12 \
|
#define Ws12 \
|
||||||
_mm_add_epi32( \
|
|
||||||
_mm_sub_epi32( \
|
_mm_sub_epi32( \
|
||||||
_mm_sub_epi32( \
|
_mm_sub_epi32( \
|
||||||
_mm_add_epi32( _mm_xor_si128( M[ 1], H[ 1] ), \
|
_mm_add_epi32( _mm_xor_si128( M[ 1], H[ 1] ), \
|
||||||
_mm_xor_si128( M[ 3], H[ 3] ) ), \
|
_mm_xor_si128( M[ 3], H[ 3] ) ), \
|
||||||
_mm_xor_si128( M[ 6], H[ 6] ) ), \
|
_mm_xor_si128( M[ 6], H[ 6] ) ), \
|
||||||
_mm_xor_si128( M[ 9], H[ 9] ) ), \
|
_mm_sub_epi32( _mm_xor_si128( M[ 9], H[ 9] ), \
|
||||||
_mm_xor_si128( M[10], H[10] ) )
|
_mm_xor_si128( M[10], H[10] ) ) )
|
||||||
|
|
||||||
#define Ws13 \
|
#define Ws13 \
|
||||||
_mm_add_epi32( \
|
|
||||||
_mm_add_epi32( \
|
_mm_add_epi32( \
|
||||||
_mm_add_epi32( \
|
_mm_add_epi32( \
|
||||||
_mm_add_epi32( _mm_xor_si128( M[ 2], H[ 2] ), \
|
_mm_add_epi32( _mm_xor_si128( M[ 2], H[ 2] ), \
|
||||||
_mm_xor_si128( M[ 4], H[ 4] ) ), \
|
_mm_xor_si128( M[ 4], H[ 4] ) ), \
|
||||||
_mm_xor_si128( M[ 7], H[ 7] ) ), \
|
_mm_xor_si128( M[ 7], H[ 7] ) ), \
|
||||||
_mm_xor_si128( M[10], H[10] ) ), \
|
_mm_add_epi32( _mm_xor_si128( M[10], H[10] ), \
|
||||||
_mm_xor_si128( M[11], H[11] ) )
|
_mm_xor_si128( M[11], H[11] ) ) )
|
||||||
|
|
||||||
#define Ws14 \
|
#define Ws14 \
|
||||||
_mm_sub_epi32( \
|
|
||||||
_mm_sub_epi32( \
|
_mm_sub_epi32( \
|
||||||
_mm_add_epi32( \
|
_mm_add_epi32( \
|
||||||
_mm_sub_epi32( _mm_xor_si128( M[ 3], H[ 3] ), \
|
_mm_sub_epi32( _mm_xor_si128( M[ 3], H[ 3] ), \
|
||||||
_mm_xor_si128( M[ 5], H[ 5] ) ), \
|
_mm_xor_si128( M[ 5], H[ 5] ) ), \
|
||||||
_mm_xor_si128( M[ 8], H[ 8] ) ), \
|
_mm_xor_si128( M[ 8], H[ 8] ) ), \
|
||||||
_mm_xor_si128( M[11], H[11] ) ), \
|
_mm_add_epi32( _mm_xor_si128( M[11], H[11] ), \
|
||||||
_mm_xor_si128( M[12], H[12] ) )
|
_mm_xor_si128( M[12], H[12] ) ) )
|
||||||
|
|
||||||
#define Ws15 \
|
#define Ws15 \
|
||||||
_mm_add_epi32( \
|
|
||||||
_mm_sub_epi32( \
|
_mm_sub_epi32( \
|
||||||
_mm_sub_epi32( \
|
_mm_sub_epi32( \
|
||||||
_mm_sub_epi32( _mm_xor_si128( M[12], H[12] ), \
|
_mm_sub_epi32( _mm_xor_si128( M[12], H[12] ), \
|
||||||
_mm_xor_si128( M[ 4], H[4] ) ), \
|
_mm_xor_si128( M[ 4], H[4] ) ), \
|
||||||
_mm_xor_si128( M[ 6], H[ 6] ) ), \
|
_mm_xor_si128( M[ 6], H[ 6] ) ), \
|
||||||
_mm_xor_si128( M[ 9], H[ 9] ) ), \
|
_mm_sub_epi32( _mm_xor_si128( M[ 9], H[ 9] ), \
|
||||||
_mm_xor_si128( M[13], H[13] ) )
|
_mm_xor_si128( M[13], H[13] ) ) )
|
||||||
|
|
||||||
|
|
||||||
void compress_small( const __m128i *M, const __m128i H[16], __m128i dH[16] )
|
void compress_small( const __m128i *M, const __m128i H[16], __m128i dH[16] )
|
||||||
@@ -578,7 +564,7 @@ bmw256_4way_init(void *cc)
|
|||||||
*/
|
*/
|
||||||
|
|
||||||
void
|
void
|
||||||
bmw256_4way(void *cc, const void *data, size_t len)
|
bmw256_4way_update(void *cc, const void *data, size_t len)
|
||||||
{
|
{
|
||||||
bmw32_4way(cc, data, len);
|
bmw32_4way(cc, data, len);
|
||||||
}
|
}
|
||||||
@@ -699,164 +685,149 @@ bmw256_4way_addbits_and_close(void *cc, unsigned ub, unsigned n, void *dst)
|
|||||||
|
|
||||||
|
|
||||||
#define W8s0 \
|
#define W8s0 \
|
||||||
_mm256_add_epi32( \
|
|
||||||
_mm256_add_epi32( \
|
_mm256_add_epi32( \
|
||||||
_mm256_add_epi32( \
|
_mm256_add_epi32( \
|
||||||
_mm256_sub_epi32( _mm256_xor_si256( M[ 5], H[ 5] ), \
|
_mm256_sub_epi32( _mm256_xor_si256( M[ 5], H[ 5] ), \
|
||||||
_mm256_xor_si256( M[ 7], H[ 7] ) ), \
|
_mm256_xor_si256( M[ 7], H[ 7] ) ), \
|
||||||
_mm256_xor_si256( M[10], H[10] ) ), \
|
_mm256_xor_si256( M[10], H[10] ) ), \
|
||||||
_mm256_xor_si256( M[13], H[13] ) ), \
|
_mm256_add_epi32( _mm256_xor_si256( M[13], H[13] ), \
|
||||||
_mm256_xor_si256( M[14], H[14] ) )
|
_mm256_xor_si256( M[14], H[14] ) ) )
|
||||||
|
|
||||||
#define W8s1 \
|
#define W8s1 \
|
||||||
_mm256_sub_epi32( \
|
|
||||||
_mm256_add_epi32( \
|
_mm256_add_epi32( \
|
||||||
_mm256_add_epi32( \
|
_mm256_add_epi32( \
|
||||||
_mm256_sub_epi32( _mm256_xor_si256( M[ 6], H[ 6] ), \
|
_mm256_sub_epi32( _mm256_xor_si256( M[ 6], H[ 6] ), \
|
||||||
_mm256_xor_si256( M[ 8], H[ 8] ) ), \
|
_mm256_xor_si256( M[ 8], H[ 8] ) ), \
|
||||||
_mm256_xor_si256( M[11], H[11] ) ), \
|
_mm256_xor_si256( M[11], H[11] ) ), \
|
||||||
_mm256_xor_si256( M[14], H[14] ) ), \
|
_mm256_sub_epi32( _mm256_xor_si256( M[14], H[14] ), \
|
||||||
_mm256_xor_si256( M[15], H[15] ) )
|
_mm256_xor_si256( M[15], H[15] ) ) )
|
||||||
|
|
||||||
#define W8s2 \
|
#define W8s2 \
|
||||||
_mm256_add_epi32( \
|
|
||||||
_mm256_sub_epi32( \
|
_mm256_sub_epi32( \
|
||||||
_mm256_add_epi32( \
|
_mm256_add_epi32( \
|
||||||
_mm256_add_epi32( _mm256_xor_si256( M[ 0], H[ 0] ), \
|
_mm256_add_epi32( _mm256_xor_si256( M[ 0], H[ 0] ), \
|
||||||
_mm256_xor_si256( M[ 7], H[ 7] ) ), \
|
_mm256_xor_si256( M[ 7], H[ 7] ) ), \
|
||||||
_mm256_xor_si256( M[ 9], H[ 9] ) ), \
|
_mm256_xor_si256( M[ 9], H[ 9] ) ), \
|
||||||
_mm256_xor_si256( M[12], H[12] ) ), \
|
_mm256_sub_epi32( _mm256_xor_si256( M[12], H[12] ), \
|
||||||
_mm256_xor_si256( M[15], H[15] ) )
|
_mm256_xor_si256( M[15], H[15] ) ) )
|
||||||
|
|
||||||
#define W8s3 \
|
#define W8s3 \
|
||||||
_mm256_add_epi32( \
|
|
||||||
_mm256_sub_epi32( \
|
_mm256_sub_epi32( \
|
||||||
_mm256_add_epi32( \
|
_mm256_add_epi32( \
|
||||||
_mm256_sub_epi32( _mm256_xor_si256( M[ 0], H[ 0] ), \
|
_mm256_sub_epi32( _mm256_xor_si256( M[ 0], H[ 0] ), \
|
||||||
_mm256_xor_si256( M[ 1], H[ 1] ) ), \
|
_mm256_xor_si256( M[ 1], H[ 1] ) ), \
|
||||||
_mm256_xor_si256( M[ 8], H[ 8] ) ), \
|
_mm256_xor_si256( M[ 8], H[ 8] ) ), \
|
||||||
_mm256_xor_si256( M[10], H[10] ) ), \
|
_mm256_sub_epi32( _mm256_xor_si256( M[10], H[10] ), \
|
||||||
_mm256_xor_si256( M[13], H[13] ) )
|
_mm256_xor_si256( M[13], H[13] ) ) )
|
||||||
|
|
||||||
#define W8s4 \
|
#define W8s4 \
|
||||||
_mm256_sub_epi32( \
|
|
||||||
_mm256_sub_epi32( \
|
_mm256_sub_epi32( \
|
||||||
_mm256_add_epi32( \
|
_mm256_add_epi32( \
|
||||||
_mm256_add_epi32( _mm256_xor_si256( M[ 1], H[ 1] ), \
|
_mm256_add_epi32( _mm256_xor_si256( M[ 1], H[ 1] ), \
|
||||||
_mm256_xor_si256( M[ 2], H[ 2] ) ), \
|
_mm256_xor_si256( M[ 2], H[ 2] ) ), \
|
||||||
_mm256_xor_si256( M[ 9], H[ 9] ) ), \
|
_mm256_xor_si256( M[ 9], H[ 9] ) ), \
|
||||||
_mm256_xor_si256( M[11], H[11] ) ), \
|
_mm256_add_epi32( _mm256_xor_si256( M[11], H[11] ), \
|
||||||
_mm256_xor_si256( M[14], H[14] ) )
|
_mm256_xor_si256( M[14], H[14] ) ) )
|
||||||
|
|
||||||
#define W8s5 \
|
#define W8s5 \
|
||||||
_mm256_add_epi32( \
|
|
||||||
_mm256_sub_epi32( \
|
_mm256_sub_epi32( \
|
||||||
_mm256_add_epi32( \
|
_mm256_add_epi32( \
|
||||||
_mm256_sub_epi32( _mm256_xor_si256( M[ 3], H[ 3] ), \
|
_mm256_sub_epi32( _mm256_xor_si256( M[ 3], H[ 3] ), \
|
||||||
_mm256_xor_si256( M[ 2], H[ 2] ) ), \
|
_mm256_xor_si256( M[ 2], H[ 2] ) ), \
|
||||||
_mm256_xor_si256( M[10], H[10] ) ), \
|
_mm256_xor_si256( M[10], H[10] ) ), \
|
||||||
_mm256_xor_si256( M[12], H[12] ) ), \
|
_mm256_sub_epi32( _mm256_xor_si256( M[12], H[12] ), \
|
||||||
_mm256_xor_si256( M[15], H[15] ) )
|
_mm256_xor_si256( M[15], H[15] ) ) )
|
||||||
|
|
||||||
#define W8s6 \
|
#define W8s6 \
|
||||||
_mm256_add_epi32( \
|
|
||||||
_mm256_sub_epi32( \
|
_mm256_sub_epi32( \
|
||||||
_mm256_sub_epi32( \
|
_mm256_sub_epi32( \
|
||||||
_mm256_sub_epi32( _mm256_xor_si256( M[ 4], H[ 4] ), \
|
_mm256_sub_epi32( _mm256_xor_si256( M[ 4], H[ 4] ), \
|
||||||
_mm256_xor_si256( M[ 0], H[ 0] ) ), \
|
_mm256_xor_si256( M[ 0], H[ 0] ) ), \
|
||||||
_mm256_xor_si256( M[ 3], H[ 3] ) ), \
|
_mm256_xor_si256( M[ 3], H[ 3] ) ), \
|
||||||
_mm256_xor_si256( M[11], H[11] ) ), \
|
_mm256_sub_epi32( _mm256_xor_si256( M[11], H[11] ), \
|
||||||
_mm256_xor_si256( M[13], H[13] ) )
|
_mm256_xor_si256( M[13], H[13] ) ) )
|
||||||
|
|
||||||
#define W8s7 \
|
#define W8s7 \
|
||||||
_mm256_sub_epi32( \
|
|
||||||
_mm256_sub_epi32( \
|
_mm256_sub_epi32( \
|
||||||
_mm256_sub_epi32( \
|
_mm256_sub_epi32( \
|
||||||
_mm256_sub_epi32( _mm256_xor_si256( M[ 1], H[ 1] ), \
|
_mm256_sub_epi32( _mm256_xor_si256( M[ 1], H[ 1] ), \
|
||||||
_mm256_xor_si256( M[ 4], H[ 4] ) ), \
|
_mm256_xor_si256( M[ 4], H[ 4] ) ), \
|
||||||
_mm256_xor_si256( M[ 5], H[ 5] ) ), \
|
_mm256_xor_si256( M[ 5], H[ 5] ) ), \
|
||||||
_mm256_xor_si256( M[12], H[12] ) ), \
|
_mm256_add_epi32( _mm256_xor_si256( M[12], H[12] ), \
|
||||||
_mm256_xor_si256( M[14], H[14] ) )
|
_mm256_xor_si256( M[14], H[14] ) ) )
|
||||||
|
|
||||||
#define W8s8 \
|
#define W8s8 \
|
||||||
_mm256_sub_epi32( \
|
|
||||||
_mm256_add_epi32( \
|
_mm256_add_epi32( \
|
||||||
_mm256_sub_epi32( \
|
_mm256_sub_epi32( \
|
||||||
_mm256_sub_epi32( _mm256_xor_si256( M[ 2], H[ 2] ), \
|
_mm256_sub_epi32( _mm256_xor_si256( M[ 2], H[ 2] ), \
|
||||||
_mm256_xor_si256( M[ 5], H[ 5] ) ), \
|
_mm256_xor_si256( M[ 5], H[ 5] ) ), \
|
||||||
_mm256_xor_si256( M[ 6], H[ 6] ) ), \
|
_mm256_xor_si256( M[ 6], H[ 6] ) ), \
|
||||||
_mm256_xor_si256( M[13], H[13] ) ), \
|
_mm256_sub_epi32( _mm256_xor_si256( M[13], H[13] ), \
|
||||||
_mm256_xor_si256( M[15], H[15] ) )
|
_mm256_xor_si256( M[15], H[15] ) ) )
|
||||||
|
|
||||||
#define W8s9 \
|
#define W8s9 \
|
||||||
_mm256_add_epi32( \
|
|
||||||
_mm256_sub_epi32( \
|
_mm256_sub_epi32( \
|
||||||
_mm256_add_epi32( \
|
_mm256_add_epi32( \
|
||||||
_mm256_sub_epi32( _mm256_xor_si256( M[ 0], H[ 0] ), \
|
_mm256_sub_epi32( _mm256_xor_si256( M[ 0], H[ 0] ), \
|
||||||
_mm256_xor_si256( M[ 3], H[ 3] ) ), \
|
_mm256_xor_si256( M[ 3], H[ 3] ) ), \
|
||||||
_mm256_xor_si256( M[ 6], H[ 6] ) ), \
|
_mm256_xor_si256( M[ 6], H[ 6] ) ), \
|
||||||
_mm256_xor_si256( M[ 7], H[ 7] ) ), \
|
_mm256_sub_epi32( _mm256_xor_si256( M[ 7], H[ 7] ), \
|
||||||
_mm256_xor_si256( M[14], H[14] ) )
|
_mm256_xor_si256( M[14], H[14] ) ) )
|
||||||
|
|
||||||
#define W8s10 \
|
#define W8s10 \
|
||||||
_mm256_add_epi32( \
|
|
||||||
_mm256_sub_epi32( \
|
_mm256_sub_epi32( \
|
||||||
_mm256_sub_epi32( \
|
_mm256_sub_epi32( \
|
||||||
_mm256_sub_epi32( _mm256_xor_si256( M[ 8], H[ 8] ), \
|
_mm256_sub_epi32( _mm256_xor_si256( M[ 8], H[ 8] ), \
|
||||||
_mm256_xor_si256( M[ 1], H[ 1] ) ), \
|
_mm256_xor_si256( M[ 1], H[ 1] ) ), \
|
||||||
_mm256_xor_si256( M[ 4], H[ 4] ) ), \
|
_mm256_xor_si256( M[ 4], H[ 4] ) ), \
|
||||||
_mm256_xor_si256( M[ 7], H[ 7] ) ), \
|
_mm256_sub_epi32( _mm256_xor_si256( M[ 7], H[ 7] ), \
|
||||||
_mm256_xor_si256( M[15], H[15] ) )
|
_mm256_xor_si256( M[15], H[15] ) ) )
|
||||||
|
|
||||||
#define W8s11 \
|
#define W8s11 \
|
||||||
_mm256_add_epi32( \
|
|
||||||
_mm256_sub_epi32( \
|
_mm256_sub_epi32( \
|
||||||
_mm256_sub_epi32( \
|
_mm256_sub_epi32( \
|
||||||
_mm256_sub_epi32( _mm256_xor_si256( M[ 8], H[ 8] ), \
|
_mm256_sub_epi32( _mm256_xor_si256( M[ 8], H[ 8] ), \
|
||||||
_mm256_xor_si256( M[ 0], H[ 0] ) ), \
|
_mm256_xor_si256( M[ 0], H[ 0] ) ), \
|
||||||
_mm256_xor_si256( M[ 2], H[ 2] ) ), \
|
_mm256_xor_si256( M[ 2], H[ 2] ) ), \
|
||||||
_mm256_xor_si256( M[ 5], H[ 5] ) ), \
|
_mm256_sub_epi32( _mm256_xor_si256( M[ 5], H[ 5] ), \
|
||||||
_mm256_xor_si256( M[ 9], H[ 9] ) )
|
_mm256_xor_si256( M[ 9], H[ 9] ) ) )
|
||||||
|
|
||||||
#define W8s12 \
|
#define W8s12 \
|
||||||
_mm256_add_epi32( \
|
|
||||||
_mm256_sub_epi32( \
|
_mm256_sub_epi32( \
|
||||||
_mm256_sub_epi32( \
|
_mm256_sub_epi32( \
|
||||||
_mm256_add_epi32( _mm256_xor_si256( M[ 1], H[ 1] ), \
|
_mm256_add_epi32( _mm256_xor_si256( M[ 1], H[ 1] ), \
|
||||||
_mm256_xor_si256( M[ 3], H[ 3] ) ), \
|
_mm256_xor_si256( M[ 3], H[ 3] ) ), \
|
||||||
_mm256_xor_si256( M[ 6], H[ 6] ) ), \
|
_mm256_xor_si256( M[ 6], H[ 6] ) ), \
|
||||||
_mm256_xor_si256( M[ 9], H[ 9] ) ), \
|
_mm256_sub_epi32( _mm256_xor_si256( M[ 9], H[ 9] ), \
|
||||||
_mm256_xor_si256( M[10], H[10] ) )
|
_mm256_xor_si256( M[10], H[10] ) ) )
|
||||||
|
|
||||||
#define W8s13 \
|
#define W8s13 \
|
||||||
_mm256_add_epi32( \
|
|
||||||
_mm256_add_epi32( \
|
_mm256_add_epi32( \
|
||||||
_mm256_add_epi32( \
|
_mm256_add_epi32( \
|
||||||
_mm256_add_epi32( _mm256_xor_si256( M[ 2], H[ 2] ), \
|
_mm256_add_epi32( _mm256_xor_si256( M[ 2], H[ 2] ), \
|
||||||
_mm256_xor_si256( M[ 4], H[ 4] ) ), \
|
_mm256_xor_si256( M[ 4], H[ 4] ) ), \
|
||||||
_mm256_xor_si256( M[ 7], H[ 7] ) ), \
|
_mm256_xor_si256( M[ 7], H[ 7] ) ), \
|
||||||
_mm256_xor_si256( M[10], H[10] ) ), \
|
_mm256_add_epi32( _mm256_xor_si256( M[10], H[10] ), \
|
||||||
_mm256_xor_si256( M[11], H[11] ) )
|
_mm256_xor_si256( M[11], H[11] ) ) )
|
||||||
|
|
||||||
#define W8s14 \
|
#define W8s14 \
|
||||||
_mm256_sub_epi32( \
|
|
||||||
_mm256_sub_epi32( \
|
_mm256_sub_epi32( \
|
||||||
_mm256_add_epi32( \
|
_mm256_add_epi32( \
|
||||||
_mm256_sub_epi32( _mm256_xor_si256( M[ 3], H[ 3] ), \
|
_mm256_sub_epi32( _mm256_xor_si256( M[ 3], H[ 3] ), \
|
||||||
_mm256_xor_si256( M[ 5], H[ 5] ) ), \
|
_mm256_xor_si256( M[ 5], H[ 5] ) ), \
|
||||||
_mm256_xor_si256( M[ 8], H[ 8] ) ), \
|
_mm256_xor_si256( M[ 8], H[ 8] ) ), \
|
||||||
_mm256_xor_si256( M[11], H[11] ) ), \
|
_mm256_add_epi32( _mm256_xor_si256( M[11], H[11] ), \
|
||||||
_mm256_xor_si256( M[12], H[12] ) )
|
_mm256_xor_si256( M[12], H[12] ) ) )
|
||||||
|
|
||||||
#define W8s15 \
|
#define W8s15 \
|
||||||
_mm256_add_epi32( \
|
|
||||||
_mm256_sub_epi32( \
|
_mm256_sub_epi32( \
|
||||||
_mm256_sub_epi32( \
|
_mm256_sub_epi32( \
|
||||||
_mm256_sub_epi32( _mm256_xor_si256( M[12], H[12] ), \
|
_mm256_sub_epi32( _mm256_xor_si256( M[12], H[12] ), \
|
||||||
_mm256_xor_si256( M[ 4], H[4] ) ), \
|
_mm256_xor_si256( M[ 4], H[4] ) ), \
|
||||||
_mm256_xor_si256( M[ 6], H[ 6] ) ), \
|
_mm256_xor_si256( M[ 6], H[ 6] ) ), \
|
||||||
_mm256_xor_si256( M[ 9], H[ 9] ) ), \
|
_mm256_sub_epi32( _mm256_xor_si256( M[ 9], H[ 9] ), \
|
||||||
_mm256_xor_si256( M[13], H[13] ) )
|
_mm256_xor_si256( M[13], H[13] ) ) )
|
||||||
|
|
||||||
|
|
||||||
void compress_small_8way( const __m256i *M, const __m256i H[16],
|
void compress_small_8way( const __m256i *M, const __m256i H[16],
|
||||||
__m256i dH[16] )
|
__m256i dH[16] )
|
||||||
@@ -903,6 +874,57 @@ void compress_small_8way( const __m256i *M, const __m256i H[16],
|
|||||||
mm256_xor4( qt[24], qt[25], qt[26], qt[27] ),
|
mm256_xor4( qt[24], qt[25], qt[26], qt[27] ),
|
||||||
mm256_xor4( qt[28], qt[29], qt[30], qt[31] ) ) );
|
mm256_xor4( qt[28], qt[29], qt[30], qt[31] ) ) );
|
||||||
|
|
||||||
|
#define DH1L( m, sl, sr, a, b, c ) \
|
||||||
|
_mm256_add_epi32( \
|
||||||
|
_mm256_xor_si256( M[m], \
|
||||||
|
_mm256_xor_si256( _mm256_slli_epi32( xh, sl ), \
|
||||||
|
_mm256_srli_epi32( qt[a], sr ) ) ), \
|
||||||
|
_mm256_xor_si256( _mm256_xor_si256( xl, qt[b] ), qt[c] ) )
|
||||||
|
|
||||||
|
#define DH1R( m, sl, sr, a, b, c ) \
|
||||||
|
_mm256_add_epi32( \
|
||||||
|
_mm256_xor_si256( M[m], \
|
||||||
|
_mm256_xor_si256( _mm256_srli_epi32( xh, sl ), \
|
||||||
|
_mm256_slli_epi32( qt[a], sr ) ) ), \
|
||||||
|
_mm256_xor_si256( _mm256_xor_si256( xl, qt[b] ), qt[c] ) )
|
||||||
|
|
||||||
|
#define DH2L( m, rl, sl, h, a, b, c ) \
|
||||||
|
_mm256_add_epi32( _mm256_add_epi32( \
|
||||||
|
mm256_rol_32( dH[h], rl ), \
|
||||||
|
_mm256_xor_si256( _mm256_xor_si256( xh, qt[a] ), M[m] )), \
|
||||||
|
_mm256_xor_si256( _mm256_slli_epi32( xl, sl ), \
|
||||||
|
_mm256_xor_si256( qt[b], qt[c] ) ) );
|
||||||
|
|
||||||
|
#define DH2R( m, rl, sr, h, a, b, c ) \
|
||||||
|
_mm256_add_epi32( _mm256_add_epi32( \
|
||||||
|
mm256_rol_32( dH[h], rl ), \
|
||||||
|
_mm256_xor_si256( _mm256_xor_si256( xh, qt[a] ), M[m] )), \
|
||||||
|
_mm256_xor_si256( _mm256_srli_epi32( xl, sr ), \
|
||||||
|
_mm256_xor_si256( qt[b], qt[c] ) ) );
|
||||||
|
|
||||||
|
dH[ 0] = DH1L( 0, 5, 5, 16, 24, 0 );
|
||||||
|
dH[ 1] = DH1R( 1, 7, 8, 17, 25, 1 );
|
||||||
|
dH[ 2] = DH1R( 2, 5, 5, 18, 26, 2 );
|
||||||
|
dH[ 3] = DH1R( 3, 1, 5, 19, 27, 3 );
|
||||||
|
dH[ 4] = DH1R( 4, 3, 0, 20, 28, 4 );
|
||||||
|
dH[ 5] = DH1L( 5, 6, 6, 21, 29, 5 );
|
||||||
|
dH[ 6] = DH1R( 6, 4, 6, 22, 30, 6 );
|
||||||
|
dH[ 7] = DH1R( 7, 11, 2, 23, 31, 7 );
|
||||||
|
dH[ 8] = DH2L( 8, 9, 8, 4, 24, 23, 8 );
|
||||||
|
dH[ 9] = DH2R( 9, 10, 6, 5, 25, 16, 9 );
|
||||||
|
dH[10] = DH2L( 10, 11, 6, 6, 26, 17, 10 );
|
||||||
|
dH[11] = DH2L( 11, 12, 4, 7, 27, 18, 11 );
|
||||||
|
dH[12] = DH2R( 12, 13, 3, 0, 28, 19, 12 );
|
||||||
|
dH[13] = DH2R( 13, 14, 4, 1, 29, 20, 13 );
|
||||||
|
dH[14] = DH2R( 14, 15, 7, 2, 30, 21, 14 );
|
||||||
|
dH[15] = DH2R( 15, 16, 2, 3, 31, 22, 15 );
|
||||||
|
|
||||||
|
#undef DH1L
|
||||||
|
#undef DH1R
|
||||||
|
#undef DH2L
|
||||||
|
#undef DH2R
|
||||||
|
|
||||||
|
/*
|
||||||
dH[ 0] = _mm256_add_epi32(
|
dH[ 0] = _mm256_add_epi32(
|
||||||
_mm256_xor_si256( M[0],
|
_mm256_xor_si256( M[0],
|
||||||
_mm256_xor_si256( _mm256_slli_epi32( xh, 5 ),
|
_mm256_xor_si256( _mm256_slli_epi32( xh, 5 ),
|
||||||
@@ -983,6 +1005,7 @@ void compress_small_8way( const __m256i *M, const __m256i H[16],
|
|||||||
_mm256_xor_si256( _mm256_xor_si256( xh, qt[31] ), M[15] )),
|
_mm256_xor_si256( _mm256_xor_si256( xh, qt[31] ), M[15] )),
|
||||||
_mm256_xor_si256( _mm256_srli_epi32( xl, 2 ),
|
_mm256_xor_si256( _mm256_srli_epi32( xl, 2 ),
|
||||||
_mm256_xor_si256( qt[22], qt[15] ) ) );
|
_mm256_xor_si256( qt[22], qt[15] ) ) );
|
||||||
|
*/
|
||||||
}
|
}
|
||||||
|
|
||||||
static const __m256i final_s8[16] =
|
static const __m256i final_s8[16] =
|
||||||
@@ -1043,7 +1066,8 @@ void bmw256_8way_init( bmw256_8way_context *ctx )
|
|||||||
ctx->bit_count = 0;
|
ctx->bit_count = 0;
|
||||||
}
|
}
|
||||||
|
|
||||||
void bmw256_8way( bmw256_8way_context *ctx, const void *data, size_t len )
|
void bmw256_8way_update( bmw256_8way_context *ctx, const void *data,
|
||||||
|
size_t len )
|
||||||
{
|
{
|
||||||
__m256i *vdata = (__m256i*)data;
|
__m256i *vdata = (__m256i*)data;
|
||||||
__m256i *buf;
|
__m256i *buf;
|
||||||
@@ -1121,6 +1145,513 @@ void bmw256_8way_close( bmw256_8way_context *ctx, void *dst )
|
|||||||
|
|
||||||
#endif // __AVX2__
|
#endif // __AVX2__
|
||||||
|
|
||||||
|
#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
|
||||||
|
|
||||||
|
// BMW-256 16 way 32
|
||||||
|
|
||||||
|
|
||||||
|
#define s16s0(x) \
|
||||||
|
mm512_xor4( _mm512_srli_epi32( (x), 1), \
|
||||||
|
_mm512_slli_epi32( (x), 3), \
|
||||||
|
mm512_rol_32( (x), 4), \
|
||||||
|
mm512_rol_32( (x), 19) )
|
||||||
|
|
||||||
|
#define s16s1(x) \
|
||||||
|
mm512_xor4( _mm512_srli_epi32( (x), 1), \
|
||||||
|
_mm512_slli_epi32( (x), 2), \
|
||||||
|
mm512_rol_32( (x), 8), \
|
||||||
|
mm512_rol_32( (x), 23) )
|
||||||
|
|
||||||
|
#define s16s2(x) \
|
||||||
|
mm512_xor4( _mm512_srli_epi32( (x), 2), \
|
||||||
|
_mm512_slli_epi32( (x), 1), \
|
||||||
|
mm512_rol_32( (x), 12), \
|
||||||
|
mm512_rol_32( (x), 25) )
|
||||||
|
|
||||||
|
#define s16s3(x) \
|
||||||
|
mm512_xor4( _mm512_srli_epi32( (x), 2), \
|
||||||
|
_mm512_slli_epi32( (x), 2), \
|
||||||
|
mm512_rol_32( (x), 15), \
|
||||||
|
mm512_rol_32( (x), 29) )
|
||||||
|
|
||||||
|
#define s16s4(x) \
|
||||||
|
_mm512_xor_si512( (x), _mm512_srli_epi32( (x), 1 ) )
|
||||||
|
|
||||||
|
#define s16s5(x) \
|
||||||
|
_mm512_xor_si512( (x), _mm512_srli_epi32( (x), 2 ) )
|
||||||
|
|
||||||
|
#define r16s1(x) mm512_rol_32( x, 3 )
|
||||||
|
#define r16s2(x) mm512_rol_32( x, 7 )
|
||||||
|
#define r16s3(x) mm512_rol_32( x, 13 )
|
||||||
|
#define r16s4(x) mm512_rol_32( x, 16 )
|
||||||
|
#define r16s5(x) mm512_rol_32( x, 19 )
|
||||||
|
#define r16s6(x) mm512_rol_32( x, 23 )
|
||||||
|
#define r16s7(x) mm512_rol_32( x, 27 )
|
||||||
|
|
||||||
|
#define mm512_rol_off_32( M, j, off ) \
|
||||||
|
mm512_rol_32( M[ ( (j) + (off) ) & 0xF ] , \
|
||||||
|
( ( (j) + (off) ) & 0xF ) + 1 )
|
||||||
|
|
||||||
|
#define add_elt_s16( M, H, j ) \
|
||||||
|
_mm512_xor_si512( \
|
||||||
|
_mm512_add_epi32( \
|
||||||
|
_mm512_sub_epi32( _mm512_add_epi32( mm512_rol_off_32( M, j, 0 ), \
|
||||||
|
mm512_rol_off_32( M, j, 3 ) ), \
|
||||||
|
mm512_rol_off_32( M, j, 10 ) ), \
|
||||||
|
_mm512_set1_epi32( ( (j) + 16 ) * 0x05555555UL ) ), \
|
||||||
|
H[ ( (j)+7 ) & 0xF ] )
|
||||||
|
|
||||||
|
#define expand1s16( qt, M, H, i ) \
|
||||||
|
_mm512_add_epi32( add_elt_s16( M, H, (i)-16 ), \
|
||||||
|
mm512_add4_32( mm512_add4_32( s16s1( qt[ (i)-16 ] ), \
|
||||||
|
s16s2( qt[ (i)-15 ] ), \
|
||||||
|
s16s3( qt[ (i)-14 ] ), \
|
||||||
|
s16s0( qt[ (i)-13 ] ) ), \
|
||||||
|
mm512_add4_32( s16s1( qt[ (i)-12 ] ), \
|
||||||
|
s16s2( qt[ (i)-11 ] ), \
|
||||||
|
s16s3( qt[ (i)-10 ] ), \
|
||||||
|
s16s0( qt[ (i)- 9 ] ) ), \
|
||||||
|
mm512_add4_32( s16s1( qt[ (i)- 8 ] ), \
|
||||||
|
s16s2( qt[ (i)- 7 ] ), \
|
||||||
|
s16s3( qt[ (i)- 6 ] ), \
|
||||||
|
s16s0( qt[ (i)- 5 ] ) ), \
|
||||||
|
mm512_add4_32( s16s1( qt[ (i)- 4 ] ), \
|
||||||
|
s16s2( qt[ (i)- 3 ] ), \
|
||||||
|
s16s3( qt[ (i)- 2 ] ), \
|
||||||
|
s16s0( qt[ (i)- 1 ] ) ) ) )
|
||||||
|
|
||||||
|
#define expand2s16( qt, M, H, i) \
|
||||||
|
_mm512_add_epi32( add_elt_s16( M, H, (i)-16 ), \
|
||||||
|
mm512_add4_32( mm512_add4_32( qt[ (i)-16 ], \
|
||||||
|
r16s1( qt[ (i)-15 ] ), \
|
||||||
|
qt[ (i)-14 ], \
|
||||||
|
r16s2( qt[ (i)-13 ] ) ), \
|
||||||
|
mm512_add4_32( qt[ (i)-12 ], \
|
||||||
|
r16s3( qt[ (i)-11 ] ), \
|
||||||
|
qt[ (i)-10 ], \
|
||||||
|
r16s4( qt[ (i)- 9 ] ) ), \
|
||||||
|
mm512_add4_32( qt[ (i)- 8 ], \
|
||||||
|
r16s5( qt[ (i)- 7 ] ), \
|
||||||
|
qt[ (i)- 6 ], \
|
||||||
|
r16s6( qt[ (i)- 5 ] ) ), \
|
||||||
|
mm512_add4_32( qt[ (i)- 4 ], \
|
||||||
|
r16s7( qt[ (i)- 3 ] ), \
|
||||||
|
s16s4( qt[ (i)- 2 ] ), \
|
||||||
|
s16s5( qt[ (i)- 1 ] ) ) ) )
|
||||||
|
|
||||||
|
|
||||||
|
#define W16s0 \
|
||||||
|
_mm512_add_epi32( \
|
||||||
|
_mm512_add_epi32( \
|
||||||
|
_mm512_sub_epi32( _mm512_xor_si512( M[ 5], H[ 5] ), \
|
||||||
|
_mm512_xor_si512( M[ 7], H[ 7] ) ), \
|
||||||
|
_mm512_xor_si512( M[10], H[10] ) ), \
|
||||||
|
_mm512_add_epi32( _mm512_xor_si512( M[13], H[13] ), \
|
||||||
|
_mm512_xor_si512( M[14], H[14] ) ) )
|
||||||
|
|
||||||
|
#define W16s1 \
|
||||||
|
_mm512_add_epi32( \
|
||||||
|
_mm512_add_epi32( \
|
||||||
|
_mm512_sub_epi32( _mm512_xor_si512( M[ 6], H[ 6] ), \
|
||||||
|
_mm512_xor_si512( M[ 8], H[ 8] ) ), \
|
||||||
|
_mm512_xor_si512( M[11], H[11] ) ), \
|
||||||
|
_mm512_sub_epi32( _mm512_xor_si512( M[14], H[14] ), \
|
||||||
|
_mm512_xor_si512( M[15], H[15] ) ) )
|
||||||
|
|
||||||
|
#define W16s2 \
|
||||||
|
_mm512_sub_epi32( \
|
||||||
|
_mm512_add_epi32( \
|
||||||
|
_mm512_add_epi32( _mm512_xor_si512( M[ 0], H[ 0] ), \
|
||||||
|
_mm512_xor_si512( M[ 7], H[ 7] ) ), \
|
||||||
|
_mm512_xor_si512( M[ 9], H[ 9] ) ), \
|
||||||
|
_mm512_sub_epi32( _mm512_xor_si512( M[12], H[12] ), \
|
||||||
|
_mm512_xor_si512( M[15], H[15] ) ) )
|
||||||
|
|
||||||
|
#define W16s3 \
|
||||||
|
_mm512_sub_epi32( \
|
||||||
|
_mm512_add_epi32( \
|
||||||
|
_mm512_sub_epi32( _mm512_xor_si512( M[ 0], H[ 0] ), \
|
||||||
|
_mm512_xor_si512( M[ 1], H[ 1] ) ), \
|
||||||
|
_mm512_xor_si512( M[ 8], H[ 8] ) ), \
|
||||||
|
_mm512_sub_epi32( _mm512_xor_si512( M[10], H[10] ), \
|
||||||
|
_mm512_xor_si512( M[13], H[13] ) ) )
|
||||||
|
|
||||||
|
#define W16s4 \
|
||||||
|
_mm512_sub_epi32( \
|
||||||
|
_mm512_add_epi32( \
|
||||||
|
_mm512_add_epi32( _mm512_xor_si512( M[ 1], H[ 1] ), \
|
||||||
|
_mm512_xor_si512( M[ 2], H[ 2] ) ), \
|
||||||
|
_mm512_xor_si512( M[ 9], H[ 9] ) ), \
|
||||||
|
_mm512_add_epi32( _mm512_xor_si512( M[11], H[11] ), \
|
||||||
|
_mm512_xor_si512( M[14], H[14] ) ) )
|
||||||
|
|
||||||
|
#define W16s5 \
|
||||||
|
_mm512_sub_epi32( \
|
||||||
|
_mm512_add_epi32( \
|
||||||
|
_mm512_sub_epi32( _mm512_xor_si512( M[ 3], H[ 3] ), \
|
||||||
|
_mm512_xor_si512( M[ 2], H[ 2] ) ), \
|
||||||
|
_mm512_xor_si512( M[10], H[10] ) ), \
|
||||||
|
_mm512_sub_epi32( _mm512_xor_si512( M[12], H[12] ), \
|
||||||
|
_mm512_xor_si512( M[15], H[15] ) ) )
|
||||||
|
|
||||||
|
#define W16s6 \
|
||||||
|
_mm512_sub_epi32( \
|
||||||
|
_mm512_sub_epi32( \
|
||||||
|
_mm512_sub_epi32( _mm512_xor_si512( M[ 4], H[ 4] ), \
|
||||||
|
_mm512_xor_si512( M[ 0], H[ 0] ) ), \
|
||||||
|
_mm512_xor_si512( M[ 3], H[ 3] ) ), \
|
||||||
|
_mm512_sub_epi32( _mm512_xor_si512( M[11], H[11] ), \
|
||||||
|
_mm512_xor_si512( M[13], H[13] ) ) )
|
||||||
|
|
||||||
|
#define W16s7 \
|
||||||
|
_mm512_sub_epi32( \
|
||||||
|
_mm512_sub_epi32( \
|
||||||
|
_mm512_sub_epi32( _mm512_xor_si512( M[ 1], H[ 1] ), \
|
||||||
|
_mm512_xor_si512( M[ 4], H[ 4] ) ), \
|
||||||
|
_mm512_xor_si512( M[ 5], H[ 5] ) ), \
|
||||||
|
_mm512_add_epi32( _mm512_xor_si512( M[12], H[12] ), \
|
||||||
|
_mm512_xor_si512( M[14], H[14] ) ) )
|
||||||
|
|
||||||
|
#define W16s8 \
|
||||||
|
_mm512_add_epi32( \
|
||||||
|
_mm512_sub_epi32( \
|
||||||
|
_mm512_sub_epi32( _mm512_xor_si512( M[ 2], H[ 2] ), \
|
||||||
|
_mm512_xor_si512( M[ 5], H[ 5] ) ), \
|
||||||
|
_mm512_xor_si512( M[ 6], H[ 6] ) ), \
|
||||||
|
_mm512_sub_epi32( _mm512_xor_si512( M[13], H[13] ), \
|
||||||
|
_mm512_xor_si512( M[15], H[15] ) ) )
|
||||||
|
|
||||||
|
#define W16s9 \
|
||||||
|
_mm512_sub_epi32( \
|
||||||
|
_mm512_add_epi32( \
|
||||||
|
_mm512_sub_epi32( _mm512_xor_si512( M[ 0], H[ 0] ), \
|
||||||
|
_mm512_xor_si512( M[ 3], H[ 3] ) ), \
|
||||||
|
_mm512_xor_si512( M[ 6], H[ 6] ) ), \
|
||||||
|
_mm512_sub_epi32( _mm512_xor_si512( M[ 7], H[ 7] ), \
|
||||||
|
_mm512_xor_si512( M[14], H[14] ) ) )
|
||||||
|
|
||||||
|
#define W16s10 \
|
||||||
|
_mm512_sub_epi32( \
|
||||||
|
_mm512_sub_epi32( \
|
||||||
|
_mm512_sub_epi32( _mm512_xor_si512( M[ 8], H[ 8] ), \
|
||||||
|
_mm512_xor_si512( M[ 1], H[ 1] ) ), \
|
||||||
|
_mm512_xor_si512( M[ 4], H[ 4] ) ), \
|
||||||
|
_mm512_sub_epi32( _mm512_xor_si512( M[ 7], H[ 7] ), \
|
||||||
|
_mm512_xor_si512( M[15], H[15] ) ) )
|
||||||
|
|
||||||
|
#define W16s11 \
|
||||||
|
_mm512_sub_epi32( \
|
||||||
|
_mm512_sub_epi32( \
|
||||||
|
_mm512_sub_epi32( _mm512_xor_si512( M[ 8], H[ 8] ), \
|
||||||
|
_mm512_xor_si512( M[ 0], H[ 0] ) ), \
|
||||||
|
_mm512_xor_si512( M[ 2], H[ 2] ) ), \
|
||||||
|
_mm512_sub_epi32( _mm512_xor_si512( M[ 5], H[ 5] ), \
|
||||||
|
_mm512_xor_si512( M[ 9], H[ 9] ) ) )
|
||||||
|
|
||||||
|
#define W16s12 \
|
||||||
|
_mm512_sub_epi32( \
|
||||||
|
_mm512_sub_epi32( \
|
||||||
|
_mm512_add_epi32( _mm512_xor_si512( M[ 1], H[ 1] ), \
|
||||||
|
_mm512_xor_si512( M[ 3], H[ 3] ) ), \
|
||||||
|
_mm512_xor_si512( M[ 6], H[ 6] ) ), \
|
||||||
|
_mm512_sub_epi32( _mm512_xor_si512( M[ 9], H[ 9] ), \
|
||||||
|
_mm512_xor_si512( M[10], H[10] ) ) )
|
||||||
|
|
||||||
|
#define W16s13 \
|
||||||
|
_mm512_add_epi32( \
|
||||||
|
_mm512_add_epi32( \
|
||||||
|
_mm512_add_epi32( _mm512_xor_si512( M[ 2], H[ 2] ), \
|
||||||
|
_mm512_xor_si512( M[ 4], H[ 4] ) ), \
|
||||||
|
_mm512_xor_si512( M[ 7], H[ 7] ) ), \
|
||||||
|
_mm512_add_epi32( _mm512_xor_si512( M[10], H[10] ), \
|
||||||
|
_mm512_xor_si512( M[11], H[11] ) ) )
|
||||||
|
|
||||||
|
#define W16s14 \
|
||||||
|
_mm512_sub_epi32( \
|
||||||
|
_mm512_add_epi32( \
|
||||||
|
_mm512_sub_epi32( _mm512_xor_si512( M[ 3], H[ 3] ), \
|
||||||
|
_mm512_xor_si512( M[ 5], H[ 5] ) ), \
|
||||||
|
_mm512_xor_si512( M[ 8], H[ 8] ) ), \
|
||||||
|
_mm512_add_epi32( _mm512_xor_si512( M[11], H[11] ), \
|
||||||
|
_mm512_xor_si512( M[12], H[12] ) ) )
|
||||||
|
|
||||||
|
#define W16s15 \
|
||||||
|
_mm512_sub_epi32( \
|
||||||
|
_mm512_sub_epi32( \
|
||||||
|
_mm512_sub_epi32( _mm512_xor_si512( M[12], H[12] ), \
|
||||||
|
_mm512_xor_si512( M[ 4], H[4] ) ), \
|
||||||
|
_mm512_xor_si512( M[ 6], H[ 6] ) ), \
|
||||||
|
_mm512_sub_epi32( _mm512_xor_si512( M[ 9], H[ 9] ), \
|
||||||
|
_mm512_xor_si512( M[13], H[13] ) ) )
|
||||||
|
|
||||||
|
void compress_small_16way( const __m512i *M, const __m512i H[16],
|
||||||
|
__m512i dH[16] )
|
||||||
|
{
|
||||||
|
__m512i qt[32], xl, xh;
|
||||||
|
|
||||||
|
qt[ 0] = _mm512_add_epi32( s16s0( W16s0 ), H[ 1] );
|
||||||
|
qt[ 1] = _mm512_add_epi32( s16s1( W16s1 ), H[ 2] );
|
||||||
|
qt[ 2] = _mm512_add_epi32( s16s2( W16s2 ), H[ 3] );
|
||||||
|
qt[ 3] = _mm512_add_epi32( s16s3( W16s3 ), H[ 4] );
|
||||||
|
qt[ 4] = _mm512_add_epi32( s16s4( W16s4 ), H[ 5] );
|
||||||
|
qt[ 5] = _mm512_add_epi32( s16s0( W16s5 ), H[ 6] );
|
||||||
|
qt[ 6] = _mm512_add_epi32( s16s1( W16s6 ), H[ 7] );
|
||||||
|
qt[ 7] = _mm512_add_epi32( s16s2( W16s7 ), H[ 8] );
|
||||||
|
qt[ 8] = _mm512_add_epi32( s16s3( W16s8 ), H[ 9] );
|
||||||
|
qt[ 9] = _mm512_add_epi32( s16s4( W16s9 ), H[10] );
|
||||||
|
qt[10] = _mm512_add_epi32( s16s0( W16s10), H[11] );
|
||||||
|
qt[11] = _mm512_add_epi32( s16s1( W16s11), H[12] );
|
||||||
|
qt[12] = _mm512_add_epi32( s16s2( W16s12), H[13] );
|
||||||
|
qt[13] = _mm512_add_epi32( s16s3( W16s13), H[14] );
|
||||||
|
qt[14] = _mm512_add_epi32( s16s4( W16s14), H[15] );
|
||||||
|
qt[15] = _mm512_add_epi32( s16s0( W16s15), H[ 0] );
|
||||||
|
qt[16] = expand1s16( qt, M, H, 16 );
|
||||||
|
qt[17] = expand1s16( qt, M, H, 17 );
|
||||||
|
qt[18] = expand2s16( qt, M, H, 18 );
|
||||||
|
qt[19] = expand2s16( qt, M, H, 19 );
|
||||||
|
qt[20] = expand2s16( qt, M, H, 20 );
|
||||||
|
qt[21] = expand2s16( qt, M, H, 21 );
|
||||||
|
qt[22] = expand2s16( qt, M, H, 22 );
|
||||||
|
qt[23] = expand2s16( qt, M, H, 23 );
|
||||||
|
qt[24] = expand2s16( qt, M, H, 24 );
|
||||||
|
qt[25] = expand2s16( qt, M, H, 25 );
|
||||||
|
qt[26] = expand2s16( qt, M, H, 26 );
|
||||||
|
qt[27] = expand2s16( qt, M, H, 27 );
|
||||||
|
qt[28] = expand2s16( qt, M, H, 28 );
|
||||||
|
qt[29] = expand2s16( qt, M, H, 29 );
|
||||||
|
qt[30] = expand2s16( qt, M, H, 30 );
|
||||||
|
qt[31] = expand2s16( qt, M, H, 31 );
|
||||||
|
|
||||||
|
xl = _mm512_xor_si512(
|
||||||
|
mm512_xor4( qt[16], qt[17], qt[18], qt[19] ),
|
||||||
|
mm512_xor4( qt[20], qt[21], qt[22], qt[23] ) );
|
||||||
|
xh = _mm512_xor_si512( xl, _mm512_xor_si512(
|
||||||
|
mm512_xor4( qt[24], qt[25], qt[26], qt[27] ),
|
||||||
|
mm512_xor4( qt[28], qt[29], qt[30], qt[31] ) ) );
|
||||||
|
|
||||||
|
#define DH1L( m, sl, sr, a, b, c ) \
|
||||||
|
_mm512_add_epi32( \
|
||||||
|
_mm512_xor_si512( M[m], \
|
||||||
|
_mm512_xor_si512( _mm512_slli_epi32( xh, sl ), \
|
||||||
|
_mm512_srli_epi32( qt[a], sr ) ) ), \
|
||||||
|
_mm512_xor_si512( _mm512_xor_si512( xl, qt[b] ), qt[c] ) )
|
||||||
|
|
||||||
|
#define DH1R( m, sl, sr, a, b, c ) \
|
||||||
|
_mm512_add_epi32( \
|
||||||
|
_mm512_xor_si512( M[m], \
|
||||||
|
_mm512_xor_si512( _mm512_srli_epi32( xh, sl ), \
|
||||||
|
_mm512_slli_epi32( qt[a], sr ) ) ), \
|
||||||
|
_mm512_xor_si512( _mm512_xor_si512( xl, qt[b] ), qt[c] ) )
|
||||||
|
|
||||||
|
#define DH2L( m, rl, sl, h, a, b, c ) \
|
||||||
|
_mm512_add_epi32( _mm512_add_epi32( \
|
||||||
|
mm512_rol_32( dH[h], rl ), \
|
||||||
|
_mm512_xor_si512( _mm512_xor_si512( xh, qt[a] ), M[m] )), \
|
||||||
|
_mm512_xor_si512( _mm512_slli_epi32( xl, sl ), \
|
||||||
|
_mm512_xor_si512( qt[b], qt[c] ) ) );
|
||||||
|
|
||||||
|
#define DH2R( m, rl, sr, h, a, b, c ) \
|
||||||
|
_mm512_add_epi32( _mm512_add_epi32( \
|
||||||
|
mm512_rol_32( dH[h], rl ), \
|
||||||
|
_mm512_xor_si512( _mm512_xor_si512( xh, qt[a] ), M[m] )), \
|
||||||
|
_mm512_xor_si512( _mm512_srli_epi32( xl, sr ), \
|
||||||
|
_mm512_xor_si512( qt[b], qt[c] ) ) );
|
||||||
|
|
||||||
|
dH[ 0] = DH1L( 0, 5, 5, 16, 24, 0 );
|
||||||
|
dH[ 1] = DH1R( 1, 7, 8, 17, 25, 1 );
|
||||||
|
dH[ 2] = DH1R( 2, 5, 5, 18, 26, 2 );
|
||||||
|
dH[ 3] = DH1R( 3, 1, 5, 19, 27, 3 );
|
||||||
|
dH[ 4] = DH1R( 4, 3, 0, 20, 28, 4 );
|
||||||
|
dH[ 5] = DH1L( 5, 6, 6, 21, 29, 5 );
|
||||||
|
dH[ 6] = DH1R( 6, 4, 6, 22, 30, 6 );
|
||||||
|
dH[ 7] = DH1R( 7, 11, 2, 23, 31, 7 );
|
||||||
|
dH[ 8] = DH2L( 8, 9, 8, 4, 24, 23, 8 );
|
||||||
|
dH[ 9] = DH2R( 9, 10, 6, 5, 25, 16, 9 );
|
||||||
|
dH[10] = DH2L( 10, 11, 6, 6, 26, 17, 10 );
|
||||||
|
dH[11] = DH2L( 11, 12, 4, 7, 27, 18, 11 );
|
||||||
|
dH[12] = DH2R( 12, 13, 3, 0, 28, 19, 12 );
|
||||||
|
dH[13] = DH2R( 13, 14, 4, 1, 29, 20, 13 );
|
||||||
|
dH[14] = DH2R( 14, 15, 7, 2, 30, 21, 14 );
|
||||||
|
dH[15] = DH2R( 15, 16, 2, 3, 31, 22, 15 );
|
||||||
|
|
||||||
|
#undef DH1L
|
||||||
|
#undef DH1R
|
||||||
|
#undef DH2L
|
||||||
|
#undef DH2R
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
|
static const __m512i final_s16[16] =
|
||||||
|
{
|
||||||
|
{ 0xaaaaaaa0aaaaaaa0, 0xaaaaaaa0aaaaaaa0,
|
||||||
|
0xaaaaaaa0aaaaaaa0, 0xaaaaaaa0aaaaaaa0,
|
||||||
|
0xaaaaaaa0aaaaaaa0, 0xaaaaaaa0aaaaaaa0,
|
||||||
|
0xaaaaaaa0aaaaaaa0, 0xaaaaaaa0aaaaaaa0 },
|
||||||
|
{ 0xaaaaaaa1aaaaaaa1, 0xaaaaaaa1aaaaaaa1,
|
||||||
|
0xaaaaaaa1aaaaaaa1, 0xaaaaaaa1aaaaaaa1,
|
||||||
|
0xaaaaaaa1aaaaaaa1, 0xaaaaaaa1aaaaaaa1,
|
||||||
|
0xaaaaaaa1aaaaaaa1, 0xaaaaaaa1aaaaaaa1 },
|
||||||
|
{ 0xaaaaaaa2aaaaaaa2, 0xaaaaaaa2aaaaaaa2,
|
||||||
|
0xaaaaaaa2aaaaaaa2, 0xaaaaaaa2aaaaaaa2,
|
||||||
|
0xaaaaaaa2aaaaaaa2, 0xaaaaaaa2aaaaaaa2,
|
||||||
|
0xaaaaaaa2aaaaaaa2, 0xaaaaaaa2aaaaaaa2 },
|
||||||
|
{ 0xaaaaaaa3aaaaaaa3, 0xaaaaaaa3aaaaaaa3,
|
||||||
|
0xaaaaaaa3aaaaaaa3, 0xaaaaaaa3aaaaaaa3,
|
||||||
|
0xaaaaaaa3aaaaaaa3, 0xaaaaaaa3aaaaaaa3,
|
||||||
|
0xaaaaaaa3aaaaaaa3, 0xaaaaaaa3aaaaaaa3 },
|
||||||
|
{ 0xaaaaaaa4aaaaaaa4, 0xaaaaaaa4aaaaaaa4,
|
||||||
|
0xaaaaaaa4aaaaaaa4, 0xaaaaaaa4aaaaaaa4,
|
||||||
|
0xaaaaaaa4aaaaaaa4, 0xaaaaaaa4aaaaaaa4,
|
||||||
|
0xaaaaaaa4aaaaaaa4, 0xaaaaaaa4aaaaaaa4 },
|
||||||
|
{ 0xaaaaaaa5aaaaaaa5, 0xaaaaaaa5aaaaaaa5,
|
||||||
|
0xaaaaaaa5aaaaaaa5, 0xaaaaaaa5aaaaaaa5,
|
||||||
|
0xaaaaaaa5aaaaaaa5, 0xaaaaaaa5aaaaaaa5,
|
||||||
|
0xaaaaaaa5aaaaaaa5, 0xaaaaaaa5aaaaaaa5 },
|
||||||
|
{ 0xaaaaaaa6aaaaaaa6, 0xaaaaaaa6aaaaaaa6,
|
||||||
|
0xaaaaaaa6aaaaaaa6, 0xaaaaaaa6aaaaaaa6,
|
||||||
|
0xaaaaaaa6aaaaaaa6, 0xaaaaaaa6aaaaaaa6,
|
||||||
|
0xaaaaaaa6aaaaaaa6, 0xaaaaaaa6aaaaaaa6 },
|
||||||
|
{ 0xaaaaaaa7aaaaaaa7, 0xaaaaaaa7aaaaaaa7,
|
||||||
|
0xaaaaaaa7aaaaaaa7, 0xaaaaaaa7aaaaaaa7,
|
||||||
|
0xaaaaaaa7aaaaaaa7, 0xaaaaaaa7aaaaaaa7,
|
||||||
|
0xaaaaaaa7aaaaaaa7, 0xaaaaaaa7aaaaaaa7 },
|
||||||
|
{ 0xaaaaaaa8aaaaaaa8, 0xaaaaaaa8aaaaaaa8,
|
||||||
|
0xaaaaaaa8aaaaaaa8, 0xaaaaaaa8aaaaaaa8,
|
||||||
|
0xaaaaaaa8aaaaaaa8, 0xaaaaaaa8aaaaaaa8,
|
||||||
|
0xaaaaaaa8aaaaaaa8, 0xaaaaaaa8aaaaaaa8 },
|
||||||
|
{ 0xaaaaaaa9aaaaaaa9, 0xaaaaaaa9aaaaaaa9,
|
||||||
|
0xaaaaaaa9aaaaaaa9, 0xaaaaaaa9aaaaaaa9,
|
||||||
|
0xaaaaaaa9aaaaaaa9, 0xaaaaaaa9aaaaaaa9,
|
||||||
|
0xaaaaaaa9aaaaaaa9, 0xaaaaaaa9aaaaaaa9 },
|
||||||
|
{ 0xaaaaaaaaaaaaaaaa, 0xaaaaaaaaaaaaaaaa,
|
||||||
|
0xaaaaaaaaaaaaaaaa, 0xaaaaaaaaaaaaaaaa,
|
||||||
|
0xaaaaaaaaaaaaaaaa, 0xaaaaaaaaaaaaaaaa,
|
||||||
|
0xaaaaaaaaaaaaaaaa, 0xaaaaaaaaaaaaaaaa },
|
||||||
|
{ 0xaaaaaaabaaaaaaab, 0xaaaaaaabaaaaaaab,
|
||||||
|
0xaaaaaaabaaaaaaab, 0xaaaaaaabaaaaaaab,
|
||||||
|
0xaaaaaaabaaaaaaab, 0xaaaaaaabaaaaaaab,
|
||||||
|
0xaaaaaaabaaaaaaab, 0xaaaaaaabaaaaaaab },
|
||||||
|
{ 0xaaaaaaacaaaaaaac, 0xaaaaaaacaaaaaaac,
|
||||||
|
0xaaaaaaacaaaaaaac, 0xaaaaaaacaaaaaaac,
|
||||||
|
0xaaaaaaacaaaaaaac, 0xaaaaaaacaaaaaaac,
|
||||||
|
0xaaaaaaacaaaaaaac, 0xaaaaaaacaaaaaaac },
|
||||||
|
{ 0xaaaaaaadaaaaaaad, 0xaaaaaaadaaaaaaad,
|
||||||
|
0xaaaaaaadaaaaaaad, 0xaaaaaaadaaaaaaad,
|
||||||
|
0xaaaaaaadaaaaaaad, 0xaaaaaaadaaaaaaad,
|
||||||
|
0xaaaaaaadaaaaaaad, 0xaaaaaaadaaaaaaad },
|
||||||
|
{ 0xaaaaaaaeaaaaaaae, 0xaaaaaaaeaaaaaaae,
|
||||||
|
0xaaaaaaaeaaaaaaae, 0xaaaaaaaeaaaaaaae,
|
||||||
|
0xaaaaaaaeaaaaaaae, 0xaaaaaaaeaaaaaaae,
|
||||||
|
0xaaaaaaaeaaaaaaae, 0xaaaaaaaeaaaaaaae },
|
||||||
|
{ 0xaaaaaaafaaaaaaaf, 0xaaaaaaafaaaaaaaf,
|
||||||
|
0xaaaaaaafaaaaaaaf, 0xaaaaaaafaaaaaaaf,
|
||||||
|
0xaaaaaaafaaaaaaaf, 0xaaaaaaafaaaaaaaf,
|
||||||
|
0xaaaaaaafaaaaaaaf, 0xaaaaaaafaaaaaaaf }
|
||||||
|
};
|
||||||
|
|
||||||
|
|
||||||
|
void bmw256_16way_init( bmw256_16way_context *ctx )
|
||||||
|
{
|
||||||
|
ctx->H[ 0] = m512_const1_64( 0x4041424340414243 );
|
||||||
|
ctx->H[ 1] = m512_const1_64( 0x4445464744454647 );
|
||||||
|
ctx->H[ 2] = m512_const1_64( 0x48494A4B48494A4B );
|
||||||
|
ctx->H[ 3] = m512_const1_64( 0x4C4D4E4F4C4D4E4F );
|
||||||
|
ctx->H[ 4] = m512_const1_64( 0x5051525350515253 );
|
||||||
|
ctx->H[ 5] = m512_const1_64( 0x5455565754555657 );
|
||||||
|
ctx->H[ 6] = m512_const1_64( 0x58595A5B58595A5B );
|
||||||
|
ctx->H[ 7] = m512_const1_64( 0x5C5D5E5F5C5D5E5F );
|
||||||
|
ctx->H[ 8] = m512_const1_64( 0x6061626360616263 );
|
||||||
|
ctx->H[ 9] = m512_const1_64( 0x6465666764656667 );
|
||||||
|
ctx->H[10] = m512_const1_64( 0x68696A6B68696A6B );
|
||||||
|
ctx->H[11] = m512_const1_64( 0x6C6D6E6F6C6D6E6F );
|
||||||
|
ctx->H[12] = m512_const1_64( 0x7071727370717273 );
|
||||||
|
ctx->H[13] = m512_const1_64( 0x7475767774757677 );
|
||||||
|
ctx->H[14] = m512_const1_64( 0x78797A7B78797A7B );
|
||||||
|
ctx->H[15] = m512_const1_64( 0x7C7D7E7F7C7D7E7F );
|
||||||
|
ctx->ptr = 0;
|
||||||
|
ctx->bit_count = 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
void bmw256_16way_update( bmw256_16way_context *ctx, const void *data,
|
||||||
|
size_t len )
|
||||||
|
{
|
||||||
|
__m512i *vdata = (__m512i*)data;
|
||||||
|
__m512i *buf;
|
||||||
|
__m512i htmp[16];
|
||||||
|
__m512i *h1, *h2;
|
||||||
|
size_t ptr;
|
||||||
|
const int buf_size = 64; // bytes of one lane, compatible with len
|
||||||
|
|
||||||
|
ctx->bit_count += len << 3;
|
||||||
|
buf = ctx->buf;
|
||||||
|
ptr = ctx->ptr;
|
||||||
|
h1 = ctx->H;
|
||||||
|
h2 = htmp;
|
||||||
|
|
||||||
|
while ( len > 0 )
|
||||||
|
{
|
||||||
|
size_t clen;
|
||||||
|
clen = buf_size - ptr;
|
||||||
|
if ( clen > len )
|
||||||
|
clen = len;
|
||||||
|
memcpy_512( buf + (ptr>>2), vdata, clen >> 2 );
|
||||||
|
vdata = vdata + (clen>>2);
|
||||||
|
len -= clen;
|
||||||
|
ptr += clen;
|
||||||
|
if ( ptr == buf_size )
|
||||||
|
{
|
||||||
|
__m512i *ht;
|
||||||
|
compress_small_16way( buf, h1, h2 );
|
||||||
|
ht = h1;
|
||||||
|
h1 = h2;
|
||||||
|
h2 = ht;
|
||||||
|
ptr = 0;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
ctx->ptr = ptr;
|
||||||
|
|
||||||
|
if ( h1 != ctx->H )
|
||||||
|
memcpy_512( ctx->H, h1, 16 );
|
||||||
|
}
|
||||||
|
|
||||||
|
void bmw256_16way_close( bmw256_16way_context *ctx, void *dst )
|
||||||
|
{
|
||||||
|
__m512i *buf;
|
||||||
|
__m512i h1[16], h2[16], *h;
|
||||||
|
size_t ptr, u, v;
|
||||||
|
const int buf_size = 64; // bytes of one lane, compatible with len
|
||||||
|
|
||||||
|
buf = ctx->buf;
|
||||||
|
ptr = ctx->ptr;
|
||||||
|
buf[ ptr>>2 ] = m512_const1_64( 0x0000008000000080 );
|
||||||
|
ptr += 4;
|
||||||
|
h = ctx->H;
|
||||||
|
|
||||||
|
if ( ptr > (buf_size - 4) )
|
||||||
|
{
|
||||||
|
memset_zero_512( buf + (ptr>>2), (buf_size - ptr) >> 2 );
|
||||||
|
compress_small_16way( buf, h, h1 );
|
||||||
|
ptr = 0;
|
||||||
|
h = h1;
|
||||||
|
}
|
||||||
|
memset_zero_512( buf + (ptr>>2), (buf_size - 8 - ptr) >> 2 );
|
||||||
|
buf[ (buf_size - 8) >> 2 ] = _mm512_set1_epi32( ctx->bit_count );
|
||||||
|
buf[ (buf_size - 4) >> 2 ] = m512_zero;
|
||||||
|
|
||||||
|
compress_small_16way( buf, h, h2 );
|
||||||
|
|
||||||
|
for ( u = 0; u < 16; u ++ )
|
||||||
|
buf[u] = h2[u];
|
||||||
|
|
||||||
|
compress_small_16way( buf, final_s16, h1 );
|
||||||
|
for (u = 0, v = 16 - 8; u < 8; u ++, v ++)
|
||||||
|
casti_m512i(dst,u) = h1[v];
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
#endif // AVX512
|
||||||
|
|
||||||
|
|
||||||
#ifdef __cplusplus
|
#ifdef __cplusplus
|
||||||
}
|
}
|
||||||
#endif
|
#endif
|
||||||
|
|||||||
@@ -1,34 +1,88 @@
|
|||||||
#include "bmw512-gate.h"
|
#include "bmw512-gate.h"
|
||||||
|
|
||||||
#ifdef BMW512_4WAY
|
|
||||||
|
|
||||||
#include <stdlib.h>
|
#include <stdlib.h>
|
||||||
#include <string.h>
|
#include <string.h>
|
||||||
#include <stdint.h>
|
#include <stdint.h>
|
||||||
//#include "sph_keccak.h"
|
//#include "sph_keccak.h"
|
||||||
#include "bmw-hash-4way.h"
|
#include "bmw-hash-4way.h"
|
||||||
|
|
||||||
|
#if defined(BMW512_8WAY)
|
||||||
|
|
||||||
|
void bmw512hash_8way(void *state, const void *input)
|
||||||
|
{
|
||||||
|
bmw512_8way_context ctx;
|
||||||
|
bmw512_8way_init( &ctx );
|
||||||
|
bmw512_8way_update( &ctx, input, 80 );
|
||||||
|
bmw512_8way_close( &ctx, state );
|
||||||
|
}
|
||||||
|
|
||||||
|
int scanhash_bmw512_8way( struct work *work, uint32_t max_nonce,
|
||||||
|
uint64_t *hashes_done, struct thr_info *mythr )
|
||||||
|
{
|
||||||
|
uint32_t vdata[24*8] __attribute__ ((aligned (128)));
|
||||||
|
uint32_t hash[16*8] __attribute__ ((aligned (64)));
|
||||||
|
uint32_t lane_hash[8] __attribute__ ((aligned (64)));
|
||||||
|
uint32_t *hash7 = &(hash[49]); // 3*16+1
|
||||||
|
uint32_t *pdata = work->data;
|
||||||
|
uint32_t *ptarget = work->target;
|
||||||
|
uint32_t n = pdata[19];
|
||||||
|
const uint32_t first_nonce = pdata[19];
|
||||||
|
const uint32_t last_nonce = max_nonce - 8;
|
||||||
|
__m512i *noncev = (__m512i*)vdata + 9; // aligned
|
||||||
|
const uint32_t Htarg = ptarget[7];
|
||||||
|
int thr_id = mythr->id;
|
||||||
|
|
||||||
|
mm512_bswap32_intrlv80_8x64( vdata, pdata );
|
||||||
|
do {
|
||||||
|
*noncev = mm512_intrlv_blend_32( mm512_bswap_32(
|
||||||
|
_mm512_set_epi32( n+7, 0, n+6, 0, n+5, 0, n+4, 0 ,
|
||||||
|
n+3, 0, n+2, 0, n+1, 0, n , 0 ) ), *noncev );
|
||||||
|
|
||||||
|
bmw512hash_8way( hash, vdata );
|
||||||
|
|
||||||
|
for ( int lane = 0; lane < 8; lane++ )
|
||||||
|
if ( unlikely( hash7[ lane<<1 ] <= Htarg ) )
|
||||||
|
{
|
||||||
|
extr_lane_8x64( lane_hash, hash, lane, 256 );
|
||||||
|
if ( fulltest( lane_hash, ptarget ) )
|
||||||
|
{
|
||||||
|
pdata[19] = n + lane;
|
||||||
|
submit_lane_solution( work, lane_hash, mythr, lane );
|
||||||
|
}
|
||||||
|
}
|
||||||
|
n += 8;
|
||||||
|
|
||||||
|
} while ( likely( ( n < last_nonce ) && !work_restart[thr_id].restart) );
|
||||||
|
|
||||||
|
*hashes_done = n - first_nonce;
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
#elif defined(BMW512_4WAY)
|
||||||
|
|
||||||
|
//#ifdef BMW512_4WAY
|
||||||
|
|
||||||
void bmw512hash_4way(void *state, const void *input)
|
void bmw512hash_4way(void *state, const void *input)
|
||||||
{
|
{
|
||||||
bmw512_4way_context ctx;
|
bmw512_4way_context ctx;
|
||||||
bmw512_4way_init( &ctx );
|
bmw512_4way_init( &ctx );
|
||||||
bmw512_4way( &ctx, input, 80 );
|
bmw512_4way_update( &ctx, input, 80 );
|
||||||
bmw512_4way_close( &ctx, state );
|
bmw512_4way_close( &ctx, state );
|
||||||
}
|
}
|
||||||
|
|
||||||
int scanhash_bmw512_4way( struct work *work, uint32_t max_nonce,
|
int scanhash_bmw512_4way( struct work *work, uint32_t max_nonce,
|
||||||
uint64_t *hashes_done, struct thr_info *mythr )
|
uint64_t *hashes_done, struct thr_info *mythr )
|
||||||
{
|
{
|
||||||
uint32_t vdata[24*4] __attribute__ ((aligned (64)));
|
uint32_t vdata[24*4] __attribute__ ((aligned (128)));
|
||||||
uint32_t hash[16*4] __attribute__ ((aligned (32)));
|
uint32_t hash[16*4] __attribute__ ((aligned (64)));
|
||||||
uint32_t lane_hash[8] __attribute__ ((aligned (32)));
|
uint32_t lane_hash[8] __attribute__ ((aligned (64)));
|
||||||
uint32_t *hash7 = &(hash[25]); // 3*8+1
|
uint32_t *hash7 = &(hash[25]); // 3*8+1
|
||||||
uint32_t *pdata = work->data;
|
uint32_t *pdata = work->data;
|
||||||
uint32_t *ptarget = work->target;
|
uint32_t *ptarget = work->target;
|
||||||
uint32_t n = pdata[19];
|
uint32_t n = pdata[19];
|
||||||
const uint32_t first_nonce = pdata[19];
|
const uint32_t first_nonce = pdata[19];
|
||||||
|
const uint32_t last_nonce = max_nonce - 4;
|
||||||
__m256i *noncev = (__m256i*)vdata + 9; // aligned
|
__m256i *noncev = (__m256i*)vdata + 9; // aligned
|
||||||
// const uint32_t Htarg = ptarget[7];
|
const uint32_t Htarg = ptarget[7];
|
||||||
int thr_id = mythr->id; // thr_id arg is deprecated
|
int thr_id = mythr->id; // thr_id arg is deprecated
|
||||||
|
|
||||||
mm256_bswap32_intrlv80_4x64( vdata, pdata );
|
mm256_bswap32_intrlv80_4x64( vdata, pdata );
|
||||||
@@ -39,7 +93,7 @@ int scanhash_bmw512_4way( struct work *work, uint32_t max_nonce,
|
|||||||
bmw512hash_4way( hash, vdata );
|
bmw512hash_4way( hash, vdata );
|
||||||
|
|
||||||
for ( int lane = 0; lane < 4; lane++ )
|
for ( int lane = 0; lane < 4; lane++ )
|
||||||
if ( ( ( hash7[ lane<<1 ] & 0xFFFFFF00 ) == 0 ) )
|
if ( unlikely( hash7[ lane<<1 ] <= Htarg ) )
|
||||||
{
|
{
|
||||||
extr_lane_4x64( lane_hash, hash, lane, 256 );
|
extr_lane_4x64( lane_hash, hash, lane, 256 );
|
||||||
if ( fulltest( lane_hash, ptarget ) )
|
if ( fulltest( lane_hash, ptarget ) )
|
||||||
@@ -50,9 +104,9 @@ int scanhash_bmw512_4way( struct work *work, uint32_t max_nonce,
|
|||||||
}
|
}
|
||||||
n += 4;
|
n += 4;
|
||||||
|
|
||||||
} while ( (n < max_nonce-4) && !work_restart[thr_id].restart);
|
} while ( likely( (n < last_nonce) && !work_restart[thr_id].restart ) );
|
||||||
|
|
||||||
*hashes_done = n - first_nonce + 1;
|
*hashes_done = n - first_nonce;
|
||||||
return 0;
|
return 0;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
@@ -1,13 +1,13 @@
|
|||||||
#include "bmw512-gate.h"
|
#include "bmw512-gate.h"
|
||||||
|
|
||||||
int64_t bmw512_get_max64() { return 0x7ffffLL; }
|
|
||||||
|
|
||||||
bool register_bmw512_algo( algo_gate_t* gate )
|
bool register_bmw512_algo( algo_gate_t* gate )
|
||||||
{
|
{
|
||||||
gate->optimizations = AVX2_OPT;
|
gate->optimizations = AVX2_OPT | AVX512_OPT;
|
||||||
gate->get_max64 = (void*)&bmw512_get_max64;
|
|
||||||
opt_target_factor = 256.0;
|
opt_target_factor = 256.0;
|
||||||
#if defined (BMW512_4WAY)
|
#if defined (BMW512_8WAY)
|
||||||
|
gate->scanhash = (void*)&scanhash_bmw512_8way;
|
||||||
|
gate->hash = (void*)&bmw512hash_8way;
|
||||||
|
#elif defined (BMW512_4WAY)
|
||||||
gate->scanhash = (void*)&scanhash_bmw512_4way;
|
gate->scanhash = (void*)&scanhash_bmw512_4way;
|
||||||
gate->hash = (void*)&bmw512hash_4way;
|
gate->hash = (void*)&bmw512hash_4way;
|
||||||
#else
|
#else
|
||||||
|
|||||||
@@ -1,23 +1,33 @@
|
|||||||
#ifndef BMW512_GATE_H__
|
#ifndef BMW512_GATE_H__
|
||||||
#define BMW512_GATE_H__
|
#define BMW512_GATE_H__ 1
|
||||||
|
|
||||||
#include "algo-gate-api.h"
|
#include "algo-gate-api.h"
|
||||||
#include <stdint.h>
|
#include <stdint.h>
|
||||||
|
|
||||||
#if defined(__AVX2__)
|
#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
|
||||||
|
#define BMW512_8WAY 1
|
||||||
|
#elif defined(__AVX2__)
|
||||||
#define BMW512_4WAY 1
|
#define BMW512_4WAY 1
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
#if defined(BMW512_4WAY)
|
#if defined(BMW512_8WAY)
|
||||||
|
|
||||||
|
void bmw512hash_8way( void *state, const void *input );
|
||||||
|
int scanhash_bmw512_8way( struct work *work, uint32_t max_nonce,
|
||||||
|
uint64_t *hashes_done, struct thr_info *mythr );
|
||||||
|
|
||||||
|
#elif defined(BMW512_4WAY)
|
||||||
|
|
||||||
void bmw512hash_4way( void *state, const void *input );
|
void bmw512hash_4way( void *state, const void *input );
|
||||||
int scanhash_bmw512_4way( struct work *work, uint32_t max_nonce,
|
int scanhash_bmw512_4way( struct work *work, uint32_t max_nonce,
|
||||||
uint64_t *hashes_done, struct thr_info *mythr );
|
uint64_t *hashes_done, struct thr_info *mythr );
|
||||||
|
|
||||||
#endif
|
#else
|
||||||
|
|
||||||
void bmw512hash( void *state, const void *input );
|
void bmw512hash( void *state, const void *input );
|
||||||
int scanhash_bmw512( struct work *work, uint32_t max_nonce,
|
int scanhash_bmw512( struct work *work, uint32_t max_nonce,
|
||||||
uint64_t *hashes_done, struct thr_info *mythr );
|
uint64_t *hashes_done, struct thr_info *mythr );
|
||||||
|
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
|
#endif
|
||||||
|
|||||||
@@ -60,7 +60,6 @@ static const sph_u64 IV512[] = {
|
|||||||
|
|
||||||
// BMW-512 2 way 64
|
// BMW-512 2 way 64
|
||||||
|
|
||||||
|
|
||||||
#define s2b0(x) \
|
#define s2b0(x) \
|
||||||
_mm_xor_si128( _mm_xor_si128( _mm_srli_epi64( (x), 1), \
|
_mm_xor_si128( _mm_xor_si128( _mm_srli_epi64( (x), 1), \
|
||||||
_mm_slli_epi64( (x), 3) ), \
|
_mm_slli_epi64( (x), 3) ), \
|
||||||
@@ -556,18 +555,15 @@ void bmw512_2way_close( bmw_2way_big_context *ctx, void *dst )
|
|||||||
compress_big_2way( buf, h, h2 );
|
compress_big_2way( buf, h, h2 );
|
||||||
memcpy_128( buf, h2, 16 );
|
memcpy_128( buf, h2, 16 );
|
||||||
compress_big_2way( buf, final_b2, h1 );
|
compress_big_2way( buf, final_b2, h1 );
|
||||||
memcpy( (__m128i*)dst, h1+16, 8 );
|
memcpy( (__m128i*)dst, h1+8, 8 );
|
||||||
}
|
}
|
||||||
|
|
||||||
#endif // __SSE2__
|
#endif // __SSE2__
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
#if defined(__AVX2__)
|
#if defined(__AVX2__)
|
||||||
|
|
||||||
// BMW-512 4 way 64
|
// BMW-512 4 way 64
|
||||||
|
|
||||||
|
|
||||||
#define sb0(x) \
|
#define sb0(x) \
|
||||||
mm256_xor4( _mm256_srli_epi64( (x), 1), _mm256_slli_epi64( (x), 3), \
|
mm256_xor4( _mm256_srli_epi64( (x), 1), _mm256_slli_epi64( (x), 3), \
|
||||||
mm256_rol_64( (x), 4), mm256_rol_64( (x),37) )
|
mm256_rol_64( (x), 4), mm256_rol_64( (x),37) )
|
||||||
@@ -636,165 +632,152 @@ void bmw512_2way_close( bmw_2way_big_context *ctx, void *dst )
|
|||||||
sb4( qt[ (i)- 2 ] ), sb5( qt[ (i)- 1 ] ) ) ), \
|
sb4( qt[ (i)- 2 ] ), sb5( qt[ (i)- 1 ] ) ) ), \
|
||||||
add_elt_b( M, H, (i)-16 ) )
|
add_elt_b( M, H, (i)-16 ) )
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
#define Wb0 \
|
#define Wb0 \
|
||||||
_mm256_add_epi64( \
|
|
||||||
_mm256_add_epi64( \
|
_mm256_add_epi64( \
|
||||||
_mm256_add_epi64( \
|
_mm256_add_epi64( \
|
||||||
_mm256_sub_epi64( _mm256_xor_si256( M[ 5], H[ 5] ), \
|
_mm256_sub_epi64( _mm256_xor_si256( M[ 5], H[ 5] ), \
|
||||||
_mm256_xor_si256( M[ 7], H[ 7] ) ), \
|
_mm256_xor_si256( M[ 7], H[ 7] ) ), \
|
||||||
_mm256_xor_si256( M[10], H[10] ) ), \
|
_mm256_xor_si256( M[10], H[10] ) ), \
|
||||||
_mm256_xor_si256( M[13], H[13] ) ), \
|
_mm256_add_epi64( _mm256_xor_si256( M[13], H[13] ), \
|
||||||
_mm256_xor_si256( M[14], H[14] ) )
|
_mm256_xor_si256( M[14], H[14] ) ) )
|
||||||
|
|
||||||
#define Wb1 \
|
#define Wb1 \
|
||||||
_mm256_sub_epi64( \
|
|
||||||
_mm256_add_epi64( \
|
_mm256_add_epi64( \
|
||||||
_mm256_add_epi64( \
|
_mm256_add_epi64( \
|
||||||
_mm256_sub_epi64( _mm256_xor_si256( M[ 6], H[ 6] ), \
|
_mm256_sub_epi64( _mm256_xor_si256( M[ 6], H[ 6] ), \
|
||||||
_mm256_xor_si256( M[ 8], H[ 8] ) ), \
|
_mm256_xor_si256( M[ 8], H[ 8] ) ), \
|
||||||
_mm256_xor_si256( M[11], H[11] ) ), \
|
_mm256_xor_si256( M[11], H[11] ) ), \
|
||||||
_mm256_xor_si256( M[14], H[14] ) ), \
|
_mm256_sub_epi64( _mm256_xor_si256( M[14], H[14] ), \
|
||||||
_mm256_xor_si256( M[15], H[15] ) )
|
_mm256_xor_si256( M[15], H[15] ) ) )
|
||||||
|
|
||||||
#define Wb2 \
|
#define Wb2 \
|
||||||
_mm256_add_epi64( \
|
|
||||||
_mm256_sub_epi64( \
|
_mm256_sub_epi64( \
|
||||||
_mm256_add_epi64( \
|
_mm256_add_epi64( \
|
||||||
_mm256_add_epi64( _mm256_xor_si256( M[ 0], H[ 0] ), \
|
_mm256_add_epi64( _mm256_xor_si256( M[ 0], H[ 0] ), \
|
||||||
_mm256_xor_si256( M[ 7], H[ 7] ) ), \
|
_mm256_xor_si256( M[ 7], H[ 7] ) ), \
|
||||||
_mm256_xor_si256( M[ 9], H[ 9] ) ), \
|
_mm256_xor_si256( M[ 9], H[ 9] ) ), \
|
||||||
_mm256_xor_si256( M[12], H[12] ) ), \
|
_mm256_sub_epi64( _mm256_xor_si256( M[12], H[12] ), \
|
||||||
_mm256_xor_si256( M[15], H[15] ) )
|
_mm256_xor_si256( M[15], H[15] ) ) )
|
||||||
|
|
||||||
#define Wb3 \
|
#define Wb3 \
|
||||||
_mm256_add_epi64( \
|
|
||||||
_mm256_sub_epi64( \
|
_mm256_sub_epi64( \
|
||||||
_mm256_add_epi64( \
|
_mm256_add_epi64( \
|
||||||
_mm256_sub_epi64( _mm256_xor_si256( M[ 0], H[ 0] ), \
|
_mm256_sub_epi64( _mm256_xor_si256( M[ 0], H[ 0] ), \
|
||||||
_mm256_xor_si256( M[ 1], H[ 1] ) ), \
|
_mm256_xor_si256( M[ 1], H[ 1] ) ), \
|
||||||
_mm256_xor_si256( M[ 8], H[ 8] ) ), \
|
_mm256_xor_si256( M[ 8], H[ 8] ) ), \
|
||||||
_mm256_xor_si256( M[10], H[10] ) ), \
|
_mm256_sub_epi64( _mm256_xor_si256( M[10], H[10] ), \
|
||||||
_mm256_xor_si256( M[13], H[13] ) )
|
_mm256_xor_si256( M[13], H[13] ) ) )
|
||||||
|
|
||||||
#define Wb4 \
|
#define Wb4 \
|
||||||
_mm256_sub_epi64( \
|
|
||||||
_mm256_sub_epi64( \
|
_mm256_sub_epi64( \
|
||||||
_mm256_add_epi64( \
|
_mm256_add_epi64( \
|
||||||
_mm256_add_epi64( _mm256_xor_si256( M[ 1], H[ 1] ), \
|
_mm256_add_epi64( _mm256_xor_si256( M[ 1], H[ 1] ), \
|
||||||
_mm256_xor_si256( M[ 2], H[ 2] ) ), \
|
_mm256_xor_si256( M[ 2], H[ 2] ) ), \
|
||||||
_mm256_xor_si256( M[ 9], H[ 9] ) ), \
|
_mm256_xor_si256( M[ 9], H[ 9] ) ), \
|
||||||
_mm256_xor_si256( M[11], H[11] ) ), \
|
_mm256_add_epi64( _mm256_xor_si256( M[11], H[11] ), \
|
||||||
_mm256_xor_si256( M[14], H[14] ) )
|
_mm256_xor_si256( M[14], H[14] ) ) )
|
||||||
|
|
||||||
#define Wb5 \
|
#define Wb5 \
|
||||||
_mm256_add_epi64( \
|
|
||||||
_mm256_sub_epi64( \
|
_mm256_sub_epi64( \
|
||||||
_mm256_add_epi64( \
|
_mm256_add_epi64( \
|
||||||
_mm256_sub_epi64( _mm256_xor_si256( M[ 3], H[ 3] ), \
|
_mm256_sub_epi64( _mm256_xor_si256( M[ 3], H[ 3] ), \
|
||||||
_mm256_xor_si256( M[ 2], H[ 2] ) ), \
|
_mm256_xor_si256( M[ 2], H[ 2] ) ), \
|
||||||
_mm256_xor_si256( M[10], H[10] ) ), \
|
_mm256_xor_si256( M[10], H[10] ) ), \
|
||||||
_mm256_xor_si256( M[12], H[12] ) ), \
|
_mm256_sub_epi64( _mm256_xor_si256( M[12], H[12] ), \
|
||||||
_mm256_xor_si256( M[15], H[15] ) )
|
_mm256_xor_si256( M[15], H[15] ) ) )
|
||||||
|
|
||||||
#define Wb6 \
|
#define Wb6 \
|
||||||
_mm256_add_epi64( \
|
|
||||||
_mm256_sub_epi64( \
|
_mm256_sub_epi64( \
|
||||||
_mm256_sub_epi64( \
|
_mm256_sub_epi64( \
|
||||||
_mm256_sub_epi64( _mm256_xor_si256( M[ 4], H[ 4] ), \
|
_mm256_sub_epi64( _mm256_xor_si256( M[ 4], H[ 4] ), \
|
||||||
_mm256_xor_si256( M[ 0], H[ 0] ) ), \
|
_mm256_xor_si256( M[ 0], H[ 0] ) ), \
|
||||||
_mm256_xor_si256( M[ 3], H[ 3] ) ), \
|
_mm256_xor_si256( M[ 3], H[ 3] ) ), \
|
||||||
_mm256_xor_si256( M[11], H[11] ) ), \
|
_mm256_sub_epi64( _mm256_xor_si256( M[11], H[11] ), \
|
||||||
_mm256_xor_si256( M[13], H[13] ) )
|
_mm256_xor_si256( M[13], H[13] ) ) )
|
||||||
|
|
||||||
#define Wb7 \
|
#define Wb7 \
|
||||||
_mm256_sub_epi64( \
|
|
||||||
_mm256_sub_epi64( \
|
_mm256_sub_epi64( \
|
||||||
_mm256_sub_epi64( \
|
_mm256_sub_epi64( \
|
||||||
_mm256_sub_epi64( _mm256_xor_si256( M[ 1], H[ 1] ), \
|
_mm256_sub_epi64( _mm256_xor_si256( M[ 1], H[ 1] ), \
|
||||||
_mm256_xor_si256( M[ 4], H[ 4] ) ), \
|
_mm256_xor_si256( M[ 4], H[ 4] ) ), \
|
||||||
_mm256_xor_si256( M[ 5], H[ 5] ) ), \
|
_mm256_xor_si256( M[ 5], H[ 5] ) ), \
|
||||||
_mm256_xor_si256( M[12], H[12] ) ), \
|
_mm256_add_epi64( _mm256_xor_si256( M[12], H[12] ), \
|
||||||
_mm256_xor_si256( M[14], H[14] ) )
|
_mm256_xor_si256( M[14], H[14] ) ) )
|
||||||
|
|
||||||
#define Wb8 \
|
#define Wb8 \
|
||||||
_mm256_sub_epi64( \
|
|
||||||
_mm256_add_epi64( \
|
_mm256_add_epi64( \
|
||||||
_mm256_sub_epi64( \
|
_mm256_sub_epi64( \
|
||||||
_mm256_sub_epi64( _mm256_xor_si256( M[ 2], H[ 2] ), \
|
_mm256_sub_epi64( _mm256_xor_si256( M[ 2], H[ 2] ), \
|
||||||
_mm256_xor_si256( M[ 5], H[ 5] ) ), \
|
_mm256_xor_si256( M[ 5], H[ 5] ) ), \
|
||||||
_mm256_xor_si256( M[ 6], H[ 6] ) ), \
|
_mm256_xor_si256( M[ 6], H[ 6] ) ), \
|
||||||
_mm256_xor_si256( M[13], H[13] ) ), \
|
_mm256_sub_epi64( _mm256_xor_si256( M[13], H[13] ), \
|
||||||
_mm256_xor_si256( M[15], H[15] ) )
|
_mm256_xor_si256( M[15], H[15] ) ) )
|
||||||
|
|
||||||
#define Wb9 \
|
#define Wb9 \
|
||||||
_mm256_add_epi64( \
|
|
||||||
_mm256_sub_epi64( \
|
_mm256_sub_epi64( \
|
||||||
_mm256_add_epi64( \
|
_mm256_add_epi64( \
|
||||||
_mm256_sub_epi64( _mm256_xor_si256( M[ 0], H[ 0] ), \
|
_mm256_sub_epi64( _mm256_xor_si256( M[ 0], H[ 0] ), \
|
||||||
_mm256_xor_si256( M[ 3], H[ 3] ) ), \
|
_mm256_xor_si256( M[ 3], H[ 3] ) ), \
|
||||||
_mm256_xor_si256( M[ 6], H[ 6] ) ), \
|
_mm256_xor_si256( M[ 6], H[ 6] ) ), \
|
||||||
_mm256_xor_si256( M[ 7], H[ 7] ) ), \
|
_mm256_sub_epi64( _mm256_xor_si256( M[ 7], H[ 7] ), \
|
||||||
_mm256_xor_si256( M[14], H[14] ) )
|
_mm256_xor_si256( M[14], H[14] ) ) )
|
||||||
|
|
||||||
#define Wb10 \
|
#define Wb10 \
|
||||||
_mm256_add_epi64( \
|
|
||||||
_mm256_sub_epi64( \
|
_mm256_sub_epi64( \
|
||||||
_mm256_sub_epi64( \
|
_mm256_sub_epi64( \
|
||||||
_mm256_sub_epi64( _mm256_xor_si256( M[ 8], H[ 8] ), \
|
_mm256_sub_epi64( _mm256_xor_si256( M[ 8], H[ 8] ), \
|
||||||
_mm256_xor_si256( M[ 1], H[ 1] ) ), \
|
_mm256_xor_si256( M[ 1], H[ 1] ) ), \
|
||||||
_mm256_xor_si256( M[ 4], H[ 4] ) ), \
|
_mm256_xor_si256( M[ 4], H[ 4] ) ), \
|
||||||
_mm256_xor_si256( M[ 7], H[ 7] ) ), \
|
_mm256_sub_epi64( _mm256_xor_si256( M[ 7], H[ 7] ), \
|
||||||
_mm256_xor_si256( M[15], H[15] ) )
|
_mm256_xor_si256( M[15], H[15] ) ) )
|
||||||
|
|
||||||
#define Wb11 \
|
#define Wb11 \
|
||||||
_mm256_add_epi64( \
|
|
||||||
_mm256_sub_epi64( \
|
_mm256_sub_epi64( \
|
||||||
_mm256_sub_epi64( \
|
_mm256_sub_epi64( \
|
||||||
_mm256_sub_epi64( _mm256_xor_si256( M[ 8], H[ 8] ), \
|
_mm256_sub_epi64( _mm256_xor_si256( M[ 8], H[ 8] ), \
|
||||||
_mm256_xor_si256( M[ 0], H[ 0] ) ), \
|
_mm256_xor_si256( M[ 0], H[ 0] ) ), \
|
||||||
_mm256_xor_si256( M[ 2], H[ 2] ) ), \
|
_mm256_xor_si256( M[ 2], H[ 2] ) ), \
|
||||||
_mm256_xor_si256( M[ 5], H[ 5] ) ), \
|
_mm256_sub_epi64( _mm256_xor_si256( M[ 5], H[ 5] ), \
|
||||||
_mm256_xor_si256( M[ 9], H[ 9] ) )
|
_mm256_xor_si256( M[ 9], H[ 9] ) ) )
|
||||||
|
|
||||||
#define Wb12 \
|
#define Wb12 \
|
||||||
_mm256_add_epi64( \
|
|
||||||
_mm256_sub_epi64( \
|
_mm256_sub_epi64( \
|
||||||
_mm256_sub_epi64( \
|
_mm256_sub_epi64( \
|
||||||
_mm256_add_epi64( _mm256_xor_si256( M[ 1], H[ 1] ), \
|
_mm256_add_epi64( _mm256_xor_si256( M[ 1], H[ 1] ), \
|
||||||
_mm256_xor_si256( M[ 3], H[ 3] ) ), \
|
_mm256_xor_si256( M[ 3], H[ 3] ) ), \
|
||||||
_mm256_xor_si256( M[ 6], H[ 6] ) ), \
|
_mm256_xor_si256( M[ 6], H[ 6] ) ), \
|
||||||
_mm256_xor_si256( M[ 9], H[ 9] ) ), \
|
_mm256_sub_epi64( _mm256_xor_si256( M[ 9], H[ 9] ), \
|
||||||
_mm256_xor_si256( M[10], H[10] ) )
|
_mm256_xor_si256( M[10], H[10] ) ) )
|
||||||
|
|
||||||
#define Wb13 \
|
#define Wb13 \
|
||||||
_mm256_add_epi64( \
|
|
||||||
_mm256_add_epi64( \
|
_mm256_add_epi64( \
|
||||||
_mm256_add_epi64( \
|
_mm256_add_epi64( \
|
||||||
_mm256_add_epi64( _mm256_xor_si256( M[ 2], H[ 2] ), \
|
_mm256_add_epi64( _mm256_xor_si256( M[ 2], H[ 2] ), \
|
||||||
_mm256_xor_si256( M[ 4], H[ 4] ) ), \
|
_mm256_xor_si256( M[ 4], H[ 4] ) ), \
|
||||||
_mm256_xor_si256( M[ 7], H[ 7] ) ), \
|
_mm256_xor_si256( M[ 7], H[ 7] ) ), \
|
||||||
_mm256_xor_si256( M[10], H[10] ) ), \
|
_mm256_add_epi64( _mm256_xor_si256( M[10], H[10] ), \
|
||||||
_mm256_xor_si256( M[11], H[11] ) )
|
_mm256_xor_si256( M[11], H[11] ) ) )
|
||||||
|
|
||||||
#define Wb14 \
|
#define Wb14 \
|
||||||
_mm256_sub_epi64( \
|
|
||||||
_mm256_sub_epi64( \
|
_mm256_sub_epi64( \
|
||||||
_mm256_add_epi64( \
|
_mm256_add_epi64( \
|
||||||
_mm256_sub_epi64( _mm256_xor_si256( M[ 3], H[ 3] ), \
|
_mm256_sub_epi64( _mm256_xor_si256( M[ 3], H[ 3] ), \
|
||||||
_mm256_xor_si256( M[ 5], H[ 5] ) ), \
|
_mm256_xor_si256( M[ 5], H[ 5] ) ), \
|
||||||
_mm256_xor_si256( M[ 8], H[ 8] ) ), \
|
_mm256_xor_si256( M[ 8], H[ 8] ) ), \
|
||||||
_mm256_xor_si256( M[11], H[11] ) ), \
|
_mm256_add_epi64( _mm256_xor_si256( M[11], H[11] ), \
|
||||||
_mm256_xor_si256( M[12], H[12] ) )
|
_mm256_xor_si256( M[12], H[12] ) ) )
|
||||||
|
|
||||||
#define Wb15 \
|
#define Wb15 \
|
||||||
_mm256_add_epi64( \
|
|
||||||
_mm256_sub_epi64( \
|
_mm256_sub_epi64( \
|
||||||
_mm256_sub_epi64( \
|
_mm256_sub_epi64( \
|
||||||
_mm256_sub_epi64( _mm256_xor_si256( M[12], H[12] ), \
|
_mm256_sub_epi64( _mm256_xor_si256( M[12], H[12] ), \
|
||||||
_mm256_xor_si256( M[ 4], H[4] ) ), \
|
_mm256_xor_si256( M[ 4], H[4] ) ), \
|
||||||
_mm256_xor_si256( M[ 6], H[ 6] ) ), \
|
_mm256_xor_si256( M[ 6], H[ 6] ) ), \
|
||||||
_mm256_xor_si256( M[ 9], H[ 9] ) ), \
|
_mm256_sub_epi64( _mm256_xor_si256( M[ 9], H[ 9] ), \
|
||||||
_mm256_xor_si256( M[13], H[13] ) )
|
_mm256_xor_si256( M[13], H[13] ) ) )
|
||||||
|
|
||||||
|
|
||||||
void compress_big( const __m256i *M, const __m256i H[16], __m256i dH[16] )
|
void compress_big( const __m256i *M, const __m256i H[16], __m256i dH[16] )
|
||||||
{
|
{
|
||||||
@@ -840,86 +823,56 @@ void compress_big( const __m256i *M, const __m256i H[16], __m256i dH[16] )
|
|||||||
mm256_xor4( qt[24], qt[25], qt[26], qt[27] ),
|
mm256_xor4( qt[24], qt[25], qt[26], qt[27] ),
|
||||||
mm256_xor4( qt[28], qt[29], qt[30], qt[31] ) ) );
|
mm256_xor4( qt[28], qt[29], qt[30], qt[31] ) ) );
|
||||||
|
|
||||||
dH[ 0] = _mm256_add_epi64(
|
|
||||||
_mm256_xor_si256( M[0],
|
#define DH1L( m, sl, sr, a, b, c ) \
|
||||||
_mm256_xor_si256( _mm256_slli_epi64( xh, 5 ),
|
_mm256_add_epi64( \
|
||||||
_mm256_srli_epi64( qt[16], 5 ) ) ),
|
_mm256_xor_si256( M[m], \
|
||||||
_mm256_xor_si256( _mm256_xor_si256( xl, qt[24] ), qt[ 0] ) );
|
_mm256_xor_si256( _mm256_slli_epi64( xh, sl ), \
|
||||||
dH[ 1] = _mm256_add_epi64(
|
_mm256_srli_epi64( qt[a], sr ) ) ), \
|
||||||
_mm256_xor_si256( M[1],
|
_mm256_xor_si256( _mm256_xor_si256( xl, qt[b] ), qt[c] ) )
|
||||||
_mm256_xor_si256( _mm256_srli_epi64( xh, 7 ),
|
|
||||||
_mm256_slli_epi64( qt[17], 8 ) ) ),
|
#define DH1R( m, sl, sr, a, b, c ) \
|
||||||
_mm256_xor_si256( _mm256_xor_si256( xl, qt[25] ), qt[ 1] ) );
|
_mm256_add_epi64( \
|
||||||
dH[ 2] = _mm256_add_epi64(
|
_mm256_xor_si256( M[m], \
|
||||||
_mm256_xor_si256( M[2],
|
_mm256_xor_si256( _mm256_srli_epi64( xh, sl ), \
|
||||||
_mm256_xor_si256( _mm256_srli_epi64( xh, 5 ),
|
_mm256_slli_epi64( qt[a], sr ) ) ), \
|
||||||
_mm256_slli_epi64( qt[18], 5 ) ) ),
|
_mm256_xor_si256( _mm256_xor_si256( xl, qt[b] ), qt[c] ) )
|
||||||
_mm256_xor_si256( _mm256_xor_si256( xl, qt[26] ), qt[ 2] ) );
|
|
||||||
dH[ 3] = _mm256_add_epi64(
|
#define DH2L( m, rl, sl, h, a, b, c ) \
|
||||||
_mm256_xor_si256( M[3],
|
_mm256_add_epi64( _mm256_add_epi64( \
|
||||||
_mm256_xor_si256( _mm256_srli_epi64( xh, 1 ),
|
mm256_rol_64( dH[h], rl ), \
|
||||||
_mm256_slli_epi64( qt[19], 5 ) ) ),
|
_mm256_xor_si256( _mm256_xor_si256( xh, qt[a] ), M[m] )), \
|
||||||
_mm256_xor_si256( _mm256_xor_si256( xl, qt[27] ), qt[ 3] ) );
|
_mm256_xor_si256( _mm256_slli_epi64( xl, sl ), \
|
||||||
dH[ 4] = _mm256_add_epi64(
|
_mm256_xor_si256( qt[b], qt[c] ) ) );
|
||||||
_mm256_xor_si256( M[4],
|
|
||||||
_mm256_xor_si256( _mm256_srli_epi64( xh, 3 ),
|
#define DH2R( m, rl, sr, h, a, b, c ) \
|
||||||
_mm256_slli_epi64( qt[20], 0 ) ) ),
|
_mm256_add_epi64( _mm256_add_epi64( \
|
||||||
_mm256_xor_si256( _mm256_xor_si256( xl, qt[28] ), qt[ 4] ) );
|
mm256_rol_64( dH[h], rl ), \
|
||||||
dH[ 5] = _mm256_add_epi64(
|
_mm256_xor_si256( _mm256_xor_si256( xh, qt[a] ), M[m] )), \
|
||||||
_mm256_xor_si256( M[5],
|
_mm256_xor_si256( _mm256_srli_epi64( xl, sr ), \
|
||||||
_mm256_xor_si256( _mm256_slli_epi64( xh, 6 ),
|
_mm256_xor_si256( qt[b], qt[c] ) ) );
|
||||||
_mm256_srli_epi64( qt[21], 6 ) ) ),
|
|
||||||
_mm256_xor_si256( _mm256_xor_si256( xl, qt[29] ), qt[ 5] ) );
|
dH[ 0] = DH1L( 0, 5, 5, 16, 24, 0 );
|
||||||
dH[ 6] = _mm256_add_epi64(
|
dH[ 1] = DH1R( 1, 7, 8, 17, 25, 1 );
|
||||||
_mm256_xor_si256( M[6],
|
dH[ 2] = DH1R( 2, 5, 5, 18, 26, 2 );
|
||||||
_mm256_xor_si256( _mm256_srli_epi64( xh, 4 ),
|
dH[ 3] = DH1R( 3, 1, 5, 19, 27, 3 );
|
||||||
_mm256_slli_epi64( qt[22], 6 ) ) ),
|
dH[ 4] = DH1R( 4, 3, 0, 20, 28, 4 );
|
||||||
_mm256_xor_si256( _mm256_xor_si256( xl, qt[30] ), qt[ 6] ) );
|
dH[ 5] = DH1L( 5, 6, 6, 21, 29, 5 );
|
||||||
dH[ 7] = _mm256_add_epi64(
|
dH[ 6] = DH1R( 6, 4, 6, 22, 30, 6 );
|
||||||
_mm256_xor_si256( M[7],
|
dH[ 7] = DH1R( 7, 11, 2, 23, 31, 7 );
|
||||||
_mm256_xor_si256( _mm256_srli_epi64( xh, 11 ),
|
dH[ 8] = DH2L( 8, 9, 8, 4, 24, 23, 8 );
|
||||||
_mm256_slli_epi64( qt[23], 2 ) ) ),
|
dH[ 9] = DH2R( 9, 10, 6, 5, 25, 16, 9 );
|
||||||
_mm256_xor_si256( _mm256_xor_si256( xl, qt[31] ), qt[ 7] ) );
|
dH[10] = DH2L( 10, 11, 6, 6, 26, 17, 10 );
|
||||||
dH[ 8] = _mm256_add_epi64( _mm256_add_epi64(
|
dH[11] = DH2L( 11, 12, 4, 7, 27, 18, 11 );
|
||||||
mm256_rol_64( dH[4], 9 ),
|
dH[12] = DH2R( 12, 13, 3, 0, 28, 19, 12 );
|
||||||
_mm256_xor_si256( _mm256_xor_si256( xh, qt[24] ), M[ 8] )),
|
dH[13] = DH2R( 13, 14, 4, 1, 29, 20, 13 );
|
||||||
_mm256_xor_si256( _mm256_slli_epi64( xl, 8 ),
|
dH[14] = DH2R( 14, 15, 7, 2, 30, 21, 14 );
|
||||||
_mm256_xor_si256( qt[23], qt[ 8] ) ) );
|
dH[15] = DH2R( 15, 16, 2, 3, 31, 22, 15 );
|
||||||
dH[ 9] = _mm256_add_epi64( _mm256_add_epi64(
|
|
||||||
mm256_rol_64( dH[5], 10 ),
|
#undef DH1L
|
||||||
_mm256_xor_si256( _mm256_xor_si256( xh, qt[25] ), M[ 9] )),
|
#undef DH1R
|
||||||
_mm256_xor_si256( _mm256_srli_epi64( xl, 6 ),
|
#undef DH2L
|
||||||
_mm256_xor_si256( qt[16], qt[ 9] ) ) );
|
#undef DH2R
|
||||||
dH[10] = _mm256_add_epi64( _mm256_add_epi64(
|
|
||||||
mm256_rol_64( dH[6], 11 ),
|
|
||||||
_mm256_xor_si256( _mm256_xor_si256( xh, qt[26] ), M[10] )),
|
|
||||||
_mm256_xor_si256( _mm256_slli_epi64( xl, 6 ),
|
|
||||||
_mm256_xor_si256( qt[17], qt[10] ) ) );
|
|
||||||
dH[11] = _mm256_add_epi64( _mm256_add_epi64(
|
|
||||||
mm256_rol_64( dH[7], 12 ),
|
|
||||||
_mm256_xor_si256( _mm256_xor_si256( xh, qt[27] ), M[11] )),
|
|
||||||
_mm256_xor_si256( _mm256_slli_epi64( xl, 4 ),
|
|
||||||
_mm256_xor_si256( qt[18], qt[11] ) ) );
|
|
||||||
dH[12] = _mm256_add_epi64( _mm256_add_epi64(
|
|
||||||
mm256_rol_64( dH[0], 13 ),
|
|
||||||
_mm256_xor_si256( _mm256_xor_si256( xh, qt[28] ), M[12] )),
|
|
||||||
_mm256_xor_si256( _mm256_srli_epi64( xl, 3 ),
|
|
||||||
_mm256_xor_si256( qt[19], qt[12] ) ) );
|
|
||||||
dH[13] = _mm256_add_epi64( _mm256_add_epi64(
|
|
||||||
mm256_rol_64( dH[1], 14 ),
|
|
||||||
_mm256_xor_si256( _mm256_xor_si256( xh, qt[29] ), M[13] )),
|
|
||||||
_mm256_xor_si256( _mm256_srli_epi64( xl, 4 ),
|
|
||||||
_mm256_xor_si256( qt[20], qt[13] ) ) );
|
|
||||||
dH[14] = _mm256_add_epi64( _mm256_add_epi64(
|
|
||||||
mm256_rol_64( dH[2], 15 ),
|
|
||||||
_mm256_xor_si256( _mm256_xor_si256( xh, qt[30] ), M[14] )),
|
|
||||||
_mm256_xor_si256( _mm256_srli_epi64( xl, 7 ),
|
|
||||||
_mm256_xor_si256( qt[21], qt[14] ) ) );
|
|
||||||
dH[15] = _mm256_add_epi64( _mm256_add_epi64(
|
|
||||||
mm256_rol_64( dH[3], 16 ),
|
|
||||||
_mm256_xor_si256( _mm256_xor_si256( xh, qt[31] ), M[15] )),
|
|
||||||
_mm256_xor_si256( _mm256_srli_epi64( xl, 2 ),
|
|
||||||
_mm256_xor_si256( qt[22], qt[15] ) ) );
|
|
||||||
}
|
}
|
||||||
|
|
||||||
static const __m256i final_b[16] =
|
static const __m256i final_b[16] =
|
||||||
@@ -1060,7 +1013,7 @@ bmw512_4way_init(void *cc)
|
|||||||
}
|
}
|
||||||
|
|
||||||
void
|
void
|
||||||
bmw512_4way(void *cc, const void *data, size_t len)
|
bmw512_4way_update(void *cc, const void *data, size_t len)
|
||||||
{
|
{
|
||||||
bmw64_4way(cc, data, len);
|
bmw64_4way(cc, data, len);
|
||||||
}
|
}
|
||||||
@@ -1079,6 +1032,483 @@ bmw512_4way_addbits_and_close(void *cc, unsigned ub, unsigned n, void *dst)
|
|||||||
|
|
||||||
#endif // __AVX2__
|
#endif // __AVX2__
|
||||||
|
|
||||||
|
#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
|
||||||
|
|
||||||
|
// BMW-512 8 WAY
|
||||||
|
|
||||||
|
#define s8b0(x) \
|
||||||
|
mm512_xor4( _mm512_srli_epi64( (x), 1), _mm512_slli_epi64( (x), 3), \
|
||||||
|
mm512_rol_64( (x), 4), mm512_rol_64( (x),37) )
|
||||||
|
|
||||||
|
#define s8b1(x) \
|
||||||
|
mm512_xor4( _mm512_srli_epi64( (x), 1), _mm512_slli_epi64( (x), 2), \
|
||||||
|
mm512_rol_64( (x),13), mm512_rol_64( (x),43) )
|
||||||
|
|
||||||
|
#define s8b2(x) \
|
||||||
|
mm512_xor4( _mm512_srli_epi64( (x), 2), _mm512_slli_epi64( (x), 1), \
|
||||||
|
mm512_rol_64( (x),19), mm512_rol_64( (x),53) )
|
||||||
|
|
||||||
|
#define s8b3(x) \
|
||||||
|
mm512_xor4( _mm512_srli_epi64( (x), 2), _mm512_slli_epi64( (x), 2), \
|
||||||
|
mm512_rol_64( (x),28), mm512_rol_64( (x),59) )
|
||||||
|
|
||||||
|
#define s8b4(x) \
|
||||||
|
_mm512_xor_si512( (x), _mm512_srli_epi64( (x), 1 ) )
|
||||||
|
|
||||||
|
#define s8b5(x) \
|
||||||
|
_mm512_xor_si512( (x), _mm512_srli_epi64( (x), 2 ) )
|
||||||
|
|
||||||
|
#define r8b1(x) mm512_rol_64( x, 5 )
|
||||||
|
#define r8b2(x) mm512_rol_64( x, 11 )
|
||||||
|
#define r8b3(x) mm512_rol_64( x, 27 )
|
||||||
|
#define r8b4(x) mm512_rol_64( x, 32 )
|
||||||
|
#define r8b5(x) mm512_rol_64( x, 37 )
|
||||||
|
#define r8b6(x) mm512_rol_64( x, 43 )
|
||||||
|
#define r8b7(x) mm512_rol_64( x, 53 )
|
||||||
|
|
||||||
|
#define rol8w_off_64( M, j, off ) \
|
||||||
|
mm512_rol_64( M[ ( (j) + (off) ) & 0xF ] , \
|
||||||
|
( ( (j) + (off) ) & 0xF ) + 1 )
|
||||||
|
|
||||||
|
#define add_elt_b8( M, H, j ) \
|
||||||
|
_mm512_xor_si512( \
|
||||||
|
_mm512_add_epi64( \
|
||||||
|
_mm512_sub_epi64( _mm512_add_epi64( rol8w_off_64( M, j, 0 ), \
|
||||||
|
rol8w_off_64( M, j, 3 ) ), \
|
||||||
|
rol8w_off_64( M, j, 10 ) ), \
|
||||||
|
_mm512_set1_epi64( ( (j) + 16 ) * 0x0555555555555555ULL ) ), \
|
||||||
|
H[ ( (j)+7 ) & 0xF ] )
|
||||||
|
|
||||||
|
#define expand1b8( qt, M, H, i ) \
|
||||||
|
_mm512_add_epi64( mm512_add4_64( \
|
||||||
|
mm512_add4_64( s8b1( qt[ (i)-16 ] ), s8b2( qt[ (i)-15 ] ), \
|
||||||
|
s8b3( qt[ (i)-14 ] ), s8b0( qt[ (i)-13 ] )), \
|
||||||
|
mm512_add4_64( s8b1( qt[ (i)-12 ] ), s8b2( qt[ (i)-11 ] ), \
|
||||||
|
s8b3( qt[ (i)-10 ] ), s8b0( qt[ (i)- 9 ] )), \
|
||||||
|
mm512_add4_64( s8b1( qt[ (i)- 8 ] ), s8b2( qt[ (i)- 7 ] ), \
|
||||||
|
s8b3( qt[ (i)- 6 ] ), s8b0( qt[ (i)- 5 ] )), \
|
||||||
|
mm512_add4_64( s8b1( qt[ (i)- 4 ] ), s8b2( qt[ (i)- 3 ] ), \
|
||||||
|
s8b3( qt[ (i)- 2 ] ), s8b0( qt[ (i)- 1 ] ) ) ), \
|
||||||
|
add_elt_b8( M, H, (i)-16 ) )
|
||||||
|
|
||||||
|
#define expand2b8( qt, M, H, i) \
|
||||||
|
_mm512_add_epi64( mm512_add4_64( \
|
||||||
|
mm512_add4_64( qt[ (i)-16 ], r8b1( qt[ (i)-15 ] ), \
|
||||||
|
qt[ (i)-14 ], r8b2( qt[ (i)-13 ] ) ), \
|
||||||
|
mm512_add4_64( qt[ (i)-12 ], r8b3( qt[ (i)-11 ] ), \
|
||||||
|
qt[ (i)-10 ], r8b4( qt[ (i)- 9 ] ) ), \
|
||||||
|
mm512_add4_64( qt[ (i)- 8 ], r8b5( qt[ (i)- 7 ] ), \
|
||||||
|
qt[ (i)- 6 ], r8b6( qt[ (i)- 5 ] ) ), \
|
||||||
|
mm512_add4_64( qt[ (i)- 4 ], r8b7( qt[ (i)- 3 ] ), \
|
||||||
|
s8b4( qt[ (i)- 2 ] ), s8b5( qt[ (i)- 1 ] ) ) ), \
|
||||||
|
add_elt_b8( M, H, (i)-16 ) )
|
||||||
|
|
||||||
|
#define W8b0 \
|
||||||
|
_mm512_add_epi64( \
|
||||||
|
_mm512_add_epi64( \
|
||||||
|
_mm512_sub_epi64( _mm512_xor_si512( M[ 5], H[ 5] ), \
|
||||||
|
_mm512_xor_si512( M[ 7], H[ 7] ) ), \
|
||||||
|
_mm512_xor_si512( M[10], H[10] ) ), \
|
||||||
|
_mm512_add_epi64( _mm512_xor_si512( M[13], H[13] ), \
|
||||||
|
_mm512_xor_si512( M[14], H[14] ) ) )
|
||||||
|
|
||||||
|
#define W8b1 \
|
||||||
|
_mm512_add_epi64( \
|
||||||
|
_mm512_add_epi64( \
|
||||||
|
_mm512_sub_epi64( _mm512_xor_si512( M[ 6], H[ 6] ), \
|
||||||
|
_mm512_xor_si512( M[ 8], H[ 8] ) ), \
|
||||||
|
_mm512_xor_si512( M[11], H[11] ) ), \
|
||||||
|
_mm512_sub_epi64( _mm512_xor_si512( M[14], H[14] ), \
|
||||||
|
_mm512_xor_si512( M[15], H[15] ) ) )
|
||||||
|
|
||||||
|
#define W8b2 \
|
||||||
|
_mm512_sub_epi64( \
|
||||||
|
_mm512_add_epi64( \
|
||||||
|
_mm512_add_epi64( _mm512_xor_si512( M[ 0], H[ 0] ), \
|
||||||
|
_mm512_xor_si512( M[ 7], H[ 7] ) ), \
|
||||||
|
_mm512_xor_si512( M[ 9], H[ 9] ) ), \
|
||||||
|
_mm512_sub_epi64( _mm512_xor_si512( M[12], H[12] ), \
|
||||||
|
_mm512_xor_si512( M[15], H[15] ) ) )
|
||||||
|
|
||||||
|
#define W8b3 \
|
||||||
|
_mm512_sub_epi64( \
|
||||||
|
_mm512_add_epi64( \
|
||||||
|
_mm512_sub_epi64( _mm512_xor_si512( M[ 0], H[ 0] ), \
|
||||||
|
_mm512_xor_si512( M[ 1], H[ 1] ) ), \
|
||||||
|
_mm512_xor_si512( M[ 8], H[ 8] ) ), \
|
||||||
|
_mm512_sub_epi64( _mm512_xor_si512( M[10], H[10] ), \
|
||||||
|
_mm512_xor_si512( M[13], H[13] ) ) )
|
||||||
|
|
||||||
|
#define W8b4 \
|
||||||
|
_mm512_sub_epi64( \
|
||||||
|
_mm512_add_epi64( \
|
||||||
|
_mm512_add_epi64( _mm512_xor_si512( M[ 1], H[ 1] ), \
|
||||||
|
_mm512_xor_si512( M[ 2], H[ 2] ) ), \
|
||||||
|
_mm512_xor_si512( M[ 9], H[ 9] ) ), \
|
||||||
|
_mm512_add_epi64( _mm512_xor_si512( M[11], H[11] ), \
|
||||||
|
_mm512_xor_si512( M[14], H[14] ) ) )
|
||||||
|
|
||||||
|
#define W8b5 \
|
||||||
|
_mm512_sub_epi64( \
|
||||||
|
_mm512_add_epi64( \
|
||||||
|
_mm512_sub_epi64( _mm512_xor_si512( M[ 3], H[ 3] ), \
|
||||||
|
_mm512_xor_si512( M[ 2], H[ 2] ) ), \
|
||||||
|
_mm512_xor_si512( M[10], H[10] ) ), \
|
||||||
|
_mm512_sub_epi64( _mm512_xor_si512( M[12], H[12] ), \
|
||||||
|
_mm512_xor_si512( M[15], H[15] ) ) )
|
||||||
|
|
||||||
|
#define W8b6 \
|
||||||
|
_mm512_sub_epi64( \
|
||||||
|
_mm512_sub_epi64( \
|
||||||
|
_mm512_sub_epi64( _mm512_xor_si512( M[ 4], H[ 4] ), \
|
||||||
|
_mm512_xor_si512( M[ 0], H[ 0] ) ), \
|
||||||
|
_mm512_xor_si512( M[ 3], H[ 3] ) ), \
|
||||||
|
_mm512_sub_epi64( _mm512_xor_si512( M[11], H[11] ), \
|
||||||
|
_mm512_xor_si512( M[13], H[13] ) ) )
|
||||||
|
|
||||||
|
#define W8b7 \
|
||||||
|
_mm512_sub_epi64( \
|
||||||
|
_mm512_sub_epi64( \
|
||||||
|
_mm512_sub_epi64( _mm512_xor_si512( M[ 1], H[ 1] ), \
|
||||||
|
_mm512_xor_si512( M[ 4], H[ 4] ) ), \
|
||||||
|
_mm512_xor_si512( M[ 5], H[ 5] ) ), \
|
||||||
|
_mm512_add_epi64( _mm512_xor_si512( M[12], H[12] ), \
|
||||||
|
_mm512_xor_si512( M[14], H[14] ) ) )
|
||||||
|
|
||||||
|
#define W8b8 \
|
||||||
|
_mm512_add_epi64( \
|
||||||
|
_mm512_sub_epi64( \
|
||||||
|
_mm512_sub_epi64( _mm512_xor_si512( M[ 2], H[ 2] ), \
|
||||||
|
_mm512_xor_si512( M[ 5], H[ 5] ) ), \
|
||||||
|
_mm512_xor_si512( M[ 6], H[ 6] ) ), \
|
||||||
|
_mm512_sub_epi64( _mm512_xor_si512( M[13], H[13] ), \
|
||||||
|
_mm512_xor_si512( M[15], H[15] ) ) )
|
||||||
|
|
||||||
|
#define W8b9 \
|
||||||
|
_mm512_sub_epi64( \
|
||||||
|
_mm512_add_epi64( \
|
||||||
|
_mm512_sub_epi64( _mm512_xor_si512( M[ 0], H[ 0] ), \
|
||||||
|
_mm512_xor_si512( M[ 3], H[ 3] ) ), \
|
||||||
|
_mm512_xor_si512( M[ 6], H[ 6] ) ), \
|
||||||
|
_mm512_sub_epi64( _mm512_xor_si512( M[ 7], H[ 7] ), \
|
||||||
|
_mm512_xor_si512( M[14], H[14] ) ) )
|
||||||
|
|
||||||
|
#define W8b10 \
|
||||||
|
_mm512_sub_epi64( \
|
||||||
|
_mm512_sub_epi64( \
|
||||||
|
_mm512_sub_epi64( _mm512_xor_si512( M[ 8], H[ 8] ), \
|
||||||
|
_mm512_xor_si512( M[ 1], H[ 1] ) ), \
|
||||||
|
_mm512_xor_si512( M[ 4], H[ 4] ) ), \
|
||||||
|
_mm512_sub_epi64( _mm512_xor_si512( M[ 7], H[ 7] ), \
|
||||||
|
_mm512_xor_si512( M[15], H[15] ) ) )
|
||||||
|
|
||||||
|
#define W8b11 \
|
||||||
|
_mm512_sub_epi64( \
|
||||||
|
_mm512_sub_epi64( \
|
||||||
|
_mm512_sub_epi64( _mm512_xor_si512( M[ 8], H[ 8] ), \
|
||||||
|
_mm512_xor_si512( M[ 0], H[ 0] ) ), \
|
||||||
|
_mm512_xor_si512( M[ 2], H[ 2] ) ), \
|
||||||
|
_mm512_sub_epi64( _mm512_xor_si512( M[ 5], H[ 5] ), \
|
||||||
|
_mm512_xor_si512( M[ 9], H[ 9] ) ) )
|
||||||
|
|
||||||
|
#define W8b12 \
|
||||||
|
_mm512_sub_epi64( \
|
||||||
|
_mm512_sub_epi64( \
|
||||||
|
_mm512_add_epi64( _mm512_xor_si512( M[ 1], H[ 1] ), \
|
||||||
|
_mm512_xor_si512( M[ 3], H[ 3] ) ), \
|
||||||
|
_mm512_xor_si512( M[ 6], H[ 6] ) ), \
|
||||||
|
_mm512_sub_epi64( _mm512_xor_si512( M[ 9], H[ 9] ), \
|
||||||
|
_mm512_xor_si512( M[10], H[10] ) ) )
|
||||||
|
|
||||||
|
#define W8b13 \
|
||||||
|
_mm512_add_epi64( \
|
||||||
|
_mm512_add_epi64( \
|
||||||
|
_mm512_add_epi64( _mm512_xor_si512( M[ 2], H[ 2] ), \
|
||||||
|
_mm512_xor_si512( M[ 4], H[ 4] ) ), \
|
||||||
|
_mm512_xor_si512( M[ 7], H[ 7] ) ), \
|
||||||
|
_mm512_add_epi64( _mm512_xor_si512( M[10], H[10] ), \
|
||||||
|
_mm512_xor_si512( M[11], H[11] ) ) )
|
||||||
|
|
||||||
|
#define W8b14 \
|
||||||
|
_mm512_sub_epi64( \
|
||||||
|
_mm512_add_epi64( \
|
||||||
|
_mm512_sub_epi64( _mm512_xor_si512( M[ 3], H[ 3] ), \
|
||||||
|
_mm512_xor_si512( M[ 5], H[ 5] ) ), \
|
||||||
|
_mm512_xor_si512( M[ 8], H[ 8] ) ), \
|
||||||
|
_mm512_add_epi64( _mm512_xor_si512( M[11], H[11] ), \
|
||||||
|
_mm512_xor_si512( M[12], H[12] ) ) )
|
||||||
|
|
||||||
|
#define W8b15 \
|
||||||
|
_mm512_sub_epi64( \
|
||||||
|
_mm512_sub_epi64( \
|
||||||
|
_mm512_sub_epi64( _mm512_xor_si512( M[12], H[12] ), \
|
||||||
|
_mm512_xor_si512( M[ 4], H[4] ) ), \
|
||||||
|
_mm512_xor_si512( M[ 6], H[ 6] ) ), \
|
||||||
|
_mm512_sub_epi64( _mm512_xor_si512( M[ 9], H[ 9] ), \
|
||||||
|
_mm512_xor_si512( M[13], H[13] ) ) )
|
||||||
|
|
||||||
|
void compress_big_8way( const __m512i *M, const __m512i H[16],
|
||||||
|
__m512i dH[16] )
|
||||||
|
{
|
||||||
|
__m512i qt[32], xl, xh;
|
||||||
|
|
||||||
|
qt[ 0] = _mm512_add_epi64( s8b0( W8b0 ), H[ 1] );
|
||||||
|
qt[ 1] = _mm512_add_epi64( s8b1( W8b1 ), H[ 2] );
|
||||||
|
qt[ 2] = _mm512_add_epi64( s8b2( W8b2 ), H[ 3] );
|
||||||
|
qt[ 3] = _mm512_add_epi64( s8b3( W8b3 ), H[ 4] );
|
||||||
|
qt[ 4] = _mm512_add_epi64( s8b4( W8b4 ), H[ 5] );
|
||||||
|
qt[ 5] = _mm512_add_epi64( s8b0( W8b5 ), H[ 6] );
|
||||||
|
qt[ 6] = _mm512_add_epi64( s8b1( W8b6 ), H[ 7] );
|
||||||
|
qt[ 7] = _mm512_add_epi64( s8b2( W8b7 ), H[ 8] );
|
||||||
|
qt[ 8] = _mm512_add_epi64( s8b3( W8b8 ), H[ 9] );
|
||||||
|
qt[ 9] = _mm512_add_epi64( s8b4( W8b9 ), H[10] );
|
||||||
|
qt[10] = _mm512_add_epi64( s8b0( W8b10), H[11] );
|
||||||
|
qt[11] = _mm512_add_epi64( s8b1( W8b11), H[12] );
|
||||||
|
qt[12] = _mm512_add_epi64( s8b2( W8b12), H[13] );
|
||||||
|
qt[13] = _mm512_add_epi64( s8b3( W8b13), H[14] );
|
||||||
|
qt[14] = _mm512_add_epi64( s8b4( W8b14), H[15] );
|
||||||
|
qt[15] = _mm512_add_epi64( s8b0( W8b15), H[ 0] );
|
||||||
|
qt[16] = expand1b8( qt, M, H, 16 );
|
||||||
|
qt[17] = expand1b8( qt, M, H, 17 );
|
||||||
|
qt[18] = expand2b8( qt, M, H, 18 );
|
||||||
|
qt[19] = expand2b8( qt, M, H, 19 );
|
||||||
|
qt[20] = expand2b8( qt, M, H, 20 );
|
||||||
|
qt[21] = expand2b8( qt, M, H, 21 );
|
||||||
|
qt[22] = expand2b8( qt, M, H, 22 );
|
||||||
|
qt[23] = expand2b8( qt, M, H, 23 );
|
||||||
|
qt[24] = expand2b8( qt, M, H, 24 );
|
||||||
|
qt[25] = expand2b8( qt, M, H, 25 );
|
||||||
|
qt[26] = expand2b8( qt, M, H, 26 );
|
||||||
|
qt[27] = expand2b8( qt, M, H, 27 );
|
||||||
|
qt[28] = expand2b8( qt, M, H, 28 );
|
||||||
|
qt[29] = expand2b8( qt, M, H, 29 );
|
||||||
|
qt[30] = expand2b8( qt, M, H, 30 );
|
||||||
|
qt[31] = expand2b8( qt, M, H, 31 );
|
||||||
|
|
||||||
|
xl = _mm512_xor_si512(
|
||||||
|
mm512_xor4( qt[16], qt[17], qt[18], qt[19] ),
|
||||||
|
mm512_xor4( qt[20], qt[21], qt[22], qt[23] ) );
|
||||||
|
xh = _mm512_xor_si512( xl, _mm512_xor_si512(
|
||||||
|
mm512_xor4( qt[24], qt[25], qt[26], qt[27] ),
|
||||||
|
mm512_xor4( qt[28], qt[29], qt[30], qt[31] ) ) );
|
||||||
|
|
||||||
|
#define DH1L( m, sl, sr, a, b, c ) \
|
||||||
|
_mm512_add_epi64( \
|
||||||
|
_mm512_xor_si512( M[m], \
|
||||||
|
_mm512_xor_si512( _mm512_slli_epi64( xh, sl ), \
|
||||||
|
_mm512_srli_epi64( qt[a], sr ) ) ), \
|
||||||
|
_mm512_xor_si512( _mm512_xor_si512( xl, qt[b] ), qt[c] ) )
|
||||||
|
|
||||||
|
#define DH1R( m, sl, sr, a, b, c ) \
|
||||||
|
_mm512_add_epi64( \
|
||||||
|
_mm512_xor_si512( M[m], \
|
||||||
|
_mm512_xor_si512( _mm512_srli_epi64( xh, sl ), \
|
||||||
|
_mm512_slli_epi64( qt[a], sr ) ) ), \
|
||||||
|
_mm512_xor_si512( _mm512_xor_si512( xl, qt[b] ), qt[c] ) )
|
||||||
|
|
||||||
|
#define DH2L( m, rl, sl, h, a, b, c ) \
|
||||||
|
_mm512_add_epi64( _mm512_add_epi64( \
|
||||||
|
mm512_rol_64( dH[h], rl ), \
|
||||||
|
_mm512_xor_si512( _mm512_xor_si512( xh, qt[a] ), M[m] )), \
|
||||||
|
_mm512_xor_si512( _mm512_slli_epi64( xl, sl ), \
|
||||||
|
_mm512_xor_si512( qt[b], qt[c] ) ) );
|
||||||
|
|
||||||
|
#define DH2R( m, rl, sr, h, a, b, c ) \
|
||||||
|
_mm512_add_epi64( _mm512_add_epi64( \
|
||||||
|
mm512_rol_64( dH[h], rl ), \
|
||||||
|
_mm512_xor_si512( _mm512_xor_si512( xh, qt[a] ), M[m] )), \
|
||||||
|
_mm512_xor_si512( _mm512_srli_epi64( xl, sr ), \
|
||||||
|
_mm512_xor_si512( qt[b], qt[c] ) ) );
|
||||||
|
|
||||||
|
|
||||||
|
dH[ 0] = DH1L( 0, 5, 5, 16, 24, 0 );
|
||||||
|
dH[ 1] = DH1R( 1, 7, 8, 17, 25, 1 );
|
||||||
|
dH[ 2] = DH1R( 2, 5, 5, 18, 26, 2 );
|
||||||
|
dH[ 3] = DH1R( 3, 1, 5, 19, 27, 3 );
|
||||||
|
dH[ 4] = DH1R( 4, 3, 0, 20, 28, 4 );
|
||||||
|
dH[ 5] = DH1L( 5, 6, 6, 21, 29, 5 );
|
||||||
|
dH[ 6] = DH1R( 6, 4, 6, 22, 30, 6 );
|
||||||
|
dH[ 7] = DH1R( 7, 11, 2, 23, 31, 7 );
|
||||||
|
dH[ 8] = DH2L( 8, 9, 8, 4, 24, 23, 8 );
|
||||||
|
dH[ 9] = DH2R( 9, 10, 6, 5, 25, 16, 9 );
|
||||||
|
dH[10] = DH2L( 10, 11, 6, 6, 26, 17, 10 );
|
||||||
|
dH[11] = DH2L( 11, 12, 4, 7, 27, 18, 11 );
|
||||||
|
dH[12] = DH2R( 12, 13, 3, 0, 28, 19, 12 );
|
||||||
|
dH[13] = DH2R( 13, 14, 4, 1, 29, 20, 13 );
|
||||||
|
dH[14] = DH2R( 14, 15, 7, 2, 30, 21, 14 );
|
||||||
|
dH[15] = DH2R( 15, 16, 2, 3, 31, 22, 15 );
|
||||||
|
|
||||||
|
#undef DH1L
|
||||||
|
#undef DH1R
|
||||||
|
#undef DH2L
|
||||||
|
#undef DH2R
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
|
static const __m512i final_b8[16] =
|
||||||
|
{
|
||||||
|
{ 0xaaaaaaaaaaaaaaa0, 0xaaaaaaaaaaaaaaa0,
|
||||||
|
0xaaaaaaaaaaaaaaa0, 0xaaaaaaaaaaaaaaa0,
|
||||||
|
0xaaaaaaaaaaaaaaa0, 0xaaaaaaaaaaaaaaa0,
|
||||||
|
0xaaaaaaaaaaaaaaa0, 0xaaaaaaaaaaaaaaa0 },
|
||||||
|
{ 0xaaaaaaaaaaaaaaa1, 0xaaaaaaaaaaaaaaa1,
|
||||||
|
0xaaaaaaaaaaaaaaa1, 0xaaaaaaaaaaaaaaa1,
|
||||||
|
0xaaaaaaaaaaaaaaa1, 0xaaaaaaaaaaaaaaa1,
|
||||||
|
0xaaaaaaaaaaaaaaa1, 0xaaaaaaaaaaaaaaa1 },
|
||||||
|
{ 0xaaaaaaaaaaaaaaa2, 0xaaaaaaaaaaaaaaa2,
|
||||||
|
0xaaaaaaaaaaaaaaa2, 0xaaaaaaaaaaaaaaa2,
|
||||||
|
0xaaaaaaaaaaaaaaa2, 0xaaaaaaaaaaaaaaa2,
|
||||||
|
0xaaaaaaaaaaaaaaa2, 0xaaaaaaaaaaaaaaa2 },
|
||||||
|
{ 0xaaaaaaaaaaaaaaa3, 0xaaaaaaaaaaaaaaa3,
|
||||||
|
0xaaaaaaaaaaaaaaa3, 0xaaaaaaaaaaaaaaa3,
|
||||||
|
0xaaaaaaaaaaaaaaa3, 0xaaaaaaaaaaaaaaa3,
|
||||||
|
0xaaaaaaaaaaaaaaa3, 0xaaaaaaaaaaaaaaa3 },
|
||||||
|
{ 0xaaaaaaaaaaaaaaa4, 0xaaaaaaaaaaaaaaa4,
|
||||||
|
0xaaaaaaaaaaaaaaa4, 0xaaaaaaaaaaaaaaa4,
|
||||||
|
0xaaaaaaaaaaaaaaa4, 0xaaaaaaaaaaaaaaa4,
|
||||||
|
0xaaaaaaaaaaaaaaa4, 0xaaaaaaaaaaaaaaa4 },
|
||||||
|
{ 0xaaaaaaaaaaaaaaa5, 0xaaaaaaaaaaaaaaa5,
|
||||||
|
0xaaaaaaaaaaaaaaa5, 0xaaaaaaaaaaaaaaa5,
|
||||||
|
0xaaaaaaaaaaaaaaa5, 0xaaaaaaaaaaaaaaa5,
|
||||||
|
0xaaaaaaaaaaaaaaa5, 0xaaaaaaaaaaaaaaa5 },
|
||||||
|
{ 0xaaaaaaaaaaaaaaa6, 0xaaaaaaaaaaaaaaa6,
|
||||||
|
0xaaaaaaaaaaaaaaa6, 0xaaaaaaaaaaaaaaa6,
|
||||||
|
0xaaaaaaaaaaaaaaa6, 0xaaaaaaaaaaaaaaa6,
|
||||||
|
0xaaaaaaaaaaaaaaa6, 0xaaaaaaaaaaaaaaa6 },
|
||||||
|
{ 0xaaaaaaaaaaaaaaa7, 0xaaaaaaaaaaaaaaa7,
|
||||||
|
0xaaaaaaaaaaaaaaa7, 0xaaaaaaaaaaaaaaa7,
|
||||||
|
0xaaaaaaaaaaaaaaa7, 0xaaaaaaaaaaaaaaa7,
|
||||||
|
0xaaaaaaaaaaaaaaa7, 0xaaaaaaaaaaaaaaa7 },
|
||||||
|
{ 0xaaaaaaaaaaaaaaa8, 0xaaaaaaaaaaaaaaa8,
|
||||||
|
0xaaaaaaaaaaaaaaa8, 0xaaaaaaaaaaaaaaa8,
|
||||||
|
0xaaaaaaaaaaaaaaa8, 0xaaaaaaaaaaaaaaa8,
|
||||||
|
0xaaaaaaaaaaaaaaa8, 0xaaaaaaaaaaaaaaa8 },
|
||||||
|
{ 0xaaaaaaaaaaaaaaa9, 0xaaaaaaaaaaaaaaa9,
|
||||||
|
0xaaaaaaaaaaaaaaa9, 0xaaaaaaaaaaaaaaa9,
|
||||||
|
0xaaaaaaaaaaaaaaa9, 0xaaaaaaaaaaaaaaa9,
|
||||||
|
0xaaaaaaaaaaaaaaa9, 0xaaaaaaaaaaaaaaa9 },
|
||||||
|
{ 0xaaaaaaaaaaaaaaaa, 0xaaaaaaaaaaaaaaaa,
|
||||||
|
0xaaaaaaaaaaaaaaaa, 0xaaaaaaaaaaaaaaaa,
|
||||||
|
0xaaaaaaaaaaaaaaaa, 0xaaaaaaaaaaaaaaaa,
|
||||||
|
0xaaaaaaaaaaaaaaaa, 0xaaaaaaaaaaaaaaaa },
|
||||||
|
{ 0xaaaaaaaaaaaaaaab, 0xaaaaaaaaaaaaaaab,
|
||||||
|
0xaaaaaaaaaaaaaaab, 0xaaaaaaaaaaaaaaab,
|
||||||
|
0xaaaaaaaaaaaaaaab, 0xaaaaaaaaaaaaaaab,
|
||||||
|
0xaaaaaaaaaaaaaaab, 0xaaaaaaaaaaaaaaab },
|
||||||
|
{ 0xaaaaaaaaaaaaaaac, 0xaaaaaaaaaaaaaaac,
|
||||||
|
0xaaaaaaaaaaaaaaac, 0xaaaaaaaaaaaaaaac,
|
||||||
|
0xaaaaaaaaaaaaaaac, 0xaaaaaaaaaaaaaaac,
|
||||||
|
0xaaaaaaaaaaaaaaac, 0xaaaaaaaaaaaaaaac },
|
||||||
|
{ 0xaaaaaaaaaaaaaaad, 0xaaaaaaaaaaaaaaad,
|
||||||
|
0xaaaaaaaaaaaaaaad, 0xaaaaaaaaaaaaaaad,
|
||||||
|
0xaaaaaaaaaaaaaaad, 0xaaaaaaaaaaaaaaad,
|
||||||
|
0xaaaaaaaaaaaaaaad, 0xaaaaaaaaaaaaaaad },
|
||||||
|
{ 0xaaaaaaaaaaaaaaae, 0xaaaaaaaaaaaaaaae,
|
||||||
|
0xaaaaaaaaaaaaaaae, 0xaaaaaaaaaaaaaaae,
|
||||||
|
0xaaaaaaaaaaaaaaae, 0xaaaaaaaaaaaaaaae,
|
||||||
|
0xaaaaaaaaaaaaaaae, 0xaaaaaaaaaaaaaaae },
|
||||||
|
{ 0xaaaaaaaaaaaaaaaf, 0xaaaaaaaaaaaaaaaf,
|
||||||
|
0xaaaaaaaaaaaaaaaf, 0xaaaaaaaaaaaaaaaf,
|
||||||
|
0xaaaaaaaaaaaaaaaf, 0xaaaaaaaaaaaaaaaf,
|
||||||
|
0xaaaaaaaaaaaaaaaf, 0xaaaaaaaaaaaaaaaf }
|
||||||
|
};
|
||||||
|
|
||||||
|
|
||||||
|
void bmw512_8way_init( bmw512_8way_context *ctx )
|
||||||
|
//bmw64_4way_init( bmw_4way_big_context *sc, const sph_u64 *iv )
|
||||||
|
{
|
||||||
|
ctx->H[ 0] = m512_const1_64( 0x8081828384858687 );
|
||||||
|
ctx->H[ 1] = m512_const1_64( 0x88898A8B8C8D8E8F );
|
||||||
|
ctx->H[ 2] = m512_const1_64( 0x9091929394959697 );
|
||||||
|
ctx->H[ 3] = m512_const1_64( 0x98999A9B9C9D9E9F );
|
||||||
|
ctx->H[ 4] = m512_const1_64( 0xA0A1A2A3A4A5A6A7 );
|
||||||
|
ctx->H[ 5] = m512_const1_64( 0xA8A9AAABACADAEAF );
|
||||||
|
ctx->H[ 6] = m512_const1_64( 0xB0B1B2B3B4B5B6B7 );
|
||||||
|
ctx->H[ 7] = m512_const1_64( 0xB8B9BABBBCBDBEBF );
|
||||||
|
ctx->H[ 8] = m512_const1_64( 0xC0C1C2C3C4C5C6C7 );
|
||||||
|
ctx->H[ 9] = m512_const1_64( 0xC8C9CACBCCCDCECF );
|
||||||
|
ctx->H[10] = m512_const1_64( 0xD0D1D2D3D4D5D6D7 );
|
||||||
|
ctx->H[11] = m512_const1_64( 0xD8D9DADBDCDDDEDF );
|
||||||
|
ctx->H[12] = m512_const1_64( 0xE0E1E2E3E4E5E6E7 );
|
||||||
|
ctx->H[13] = m512_const1_64( 0xE8E9EAEBECEDEEEF );
|
||||||
|
ctx->H[14] = m512_const1_64( 0xF0F1F2F3F4F5F6F7 );
|
||||||
|
ctx->H[15] = m512_const1_64( 0xF8F9FAFBFCFDFEFF );
|
||||||
|
ctx->ptr = 0;
|
||||||
|
ctx->bit_count = 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
void bmw512_8way_update( bmw512_8way_context *ctx, const void *data,
|
||||||
|
size_t len )
|
||||||
|
{
|
||||||
|
__m512i *vdata = (__m512i*)data;
|
||||||
|
__m512i *buf;
|
||||||
|
__m512i htmp[16];
|
||||||
|
__m512i *h1, *h2;
|
||||||
|
size_t ptr;
|
||||||
|
const int buf_size = 128; // bytes of one lane, compatible with len
|
||||||
|
|
||||||
|
ctx->bit_count += len << 3;
|
||||||
|
buf = ctx->buf;
|
||||||
|
ptr = ctx->ptr;
|
||||||
|
h1 = ctx->H;
|
||||||
|
h2 = htmp;
|
||||||
|
while ( len > 0 )
|
||||||
|
{
|
||||||
|
size_t clen;
|
||||||
|
clen = buf_size - ptr;
|
||||||
|
if ( clen > len )
|
||||||
|
clen = len;
|
||||||
|
memcpy_512( buf + (ptr>>3), vdata, clen >> 3 );
|
||||||
|
vdata = vdata + (clen>>3);
|
||||||
|
len -= clen;
|
||||||
|
ptr += clen;
|
||||||
|
if ( ptr == buf_size )
|
||||||
|
{
|
||||||
|
__m512i *ht;
|
||||||
|
compress_big_8way( buf, h1, h2 );
|
||||||
|
ht = h1;
|
||||||
|
h1 = h2;
|
||||||
|
h2 = ht;
|
||||||
|
ptr = 0;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
ctx->ptr = ptr;
|
||||||
|
if ( h1 != ctx->H )
|
||||||
|
memcpy_512( ctx->H, h1, 16 );
|
||||||
|
}
|
||||||
|
|
||||||
|
void bmw512_8way_close( bmw512_8way_context *ctx, void *dst )
|
||||||
|
{
|
||||||
|
__m512i *buf;
|
||||||
|
__m512i h1[16], h2[16], *h;
|
||||||
|
size_t ptr, u, v;
|
||||||
|
const int buf_size = 128; // bytes of one lane, compatible with len
|
||||||
|
|
||||||
|
buf = ctx->buf;
|
||||||
|
ptr = ctx->ptr;
|
||||||
|
buf[ ptr>>3 ] = m512_const1_64( 0x80 );
|
||||||
|
ptr += 8;
|
||||||
|
h = ctx->H;
|
||||||
|
|
||||||
|
if ( ptr > (buf_size - 8) )
|
||||||
|
{
|
||||||
|
memset_zero_512( buf + (ptr>>3), (buf_size - ptr) >> 3 );
|
||||||
|
compress_big_8way( buf, h, h1 );
|
||||||
|
ptr = 0;
|
||||||
|
h = h1;
|
||||||
|
}
|
||||||
|
memset_zero_512( buf + (ptr>>3), (buf_size - 8 - ptr) >> 3 );
|
||||||
|
buf[ (buf_size - 8) >> 3 ] = _mm512_set1_epi64( ctx->bit_count );
|
||||||
|
compress_big_8way( buf, h, h2 );
|
||||||
|
for ( u = 0; u < 16; u ++ )
|
||||||
|
buf[ u ] = h2[ u ];
|
||||||
|
compress_big_8way( buf, final_b8, h1 );
|
||||||
|
for (u = 0, v = 8; u < 8; u ++, v ++)
|
||||||
|
casti_m512i( dst, u ) = h1[ v ];
|
||||||
|
}
|
||||||
|
|
||||||
|
#endif // AVX512
|
||||||
|
|
||||||
#ifdef __cplusplus
|
#ifdef __cplusplus
|
||||||
}
|
}
|
||||||
#endif
|
#endif
|
||||||
|
|||||||
@@ -1,519 +0,0 @@
|
|||||||
/* $Id: bmw.c 227 2010-06-16 17:28:38Z tp $ */
|
|
||||||
/*
|
|
||||||
* BMW implementation.
|
|
||||||
*
|
|
||||||
* ==========================(LICENSE BEGIN)============================
|
|
||||||
*
|
|
||||||
* Copyright (c) 2007-2010 Projet RNRT SAPHIR
|
|
||||||
*
|
|
||||||
* Permission is hereby granted, free of charge, to any person obtaining
|
|
||||||
* a copy of this software and associated documentation files (the
|
|
||||||
* "Software"), to deal in the Software without restriction, including
|
|
||||||
* without limitation the rights to use, copy, modify, merge, publish,
|
|
||||||
* distribute, sublicense, and/or sell copies of the Software, and to
|
|
||||||
* permit persons to whom the Software is furnished to do so, subject to
|
|
||||||
* the following conditions:
|
|
||||||
*
|
|
||||||
* The above copyright notice and this permission notice shall be
|
|
||||||
* included in all copies or substantial portions of the Software.
|
|
||||||
*
|
|
||||||
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
|
||||||
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
|
||||||
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
|
|
||||||
* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
|
|
||||||
* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
|
|
||||||
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
|
|
||||||
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
|
||||||
*
|
|
||||||
* ===========================(LICENSE END)=============================
|
|
||||||
*
|
|
||||||
* @author Thomas Pornin <thomas.pornin@cryptolog.com>
|
|
||||||
*/
|
|
||||||
|
|
||||||
#include <stddef.h>
|
|
||||||
#include <string.h>
|
|
||||||
#include <limits.h>
|
|
||||||
|
|
||||||
#ifdef __cplusplus
|
|
||||||
extern "C"{
|
|
||||||
#endif
|
|
||||||
|
|
||||||
#include "../sph_bmw.h"
|
|
||||||
|
|
||||||
#ifdef _MSC_VER
|
|
||||||
#pragma warning (disable: 4146)
|
|
||||||
#endif
|
|
||||||
|
|
||||||
static const sph_u64 bmwIV512[] = {
|
|
||||||
SPH_C64(0x8081828384858687), SPH_C64(0x88898A8B8C8D8E8F),
|
|
||||||
SPH_C64(0x9091929394959697), SPH_C64(0x98999A9B9C9D9E9F),
|
|
||||||
SPH_C64(0xA0A1A2A3A4A5A6A7), SPH_C64(0xA8A9AAABACADAEAF),
|
|
||||||
SPH_C64(0xB0B1B2B3B4B5B6B7), SPH_C64(0xB8B9BABBBCBDBEBF),
|
|
||||||
SPH_C64(0xC0C1C2C3C4C5C6C7), SPH_C64(0xC8C9CACBCCCDCECF),
|
|
||||||
SPH_C64(0xD0D1D2D3D4D5D6D7), SPH_C64(0xD8D9DADBDCDDDEDF),
|
|
||||||
SPH_C64(0xE0E1E2E3E4E5E6E7), SPH_C64(0xE8E9EAEBECEDEEEF),
|
|
||||||
SPH_C64(0xF0F1F2F3F4F5F6F7), SPH_C64(0xF8F9FAFBFCFDFEFF)
|
|
||||||
};
|
|
||||||
|
|
||||||
#define XCAT(x, y) XCAT_(x, y)
|
|
||||||
#define XCAT_(x, y) x ## y
|
|
||||||
|
|
||||||
#define LPAR (
|
|
||||||
|
|
||||||
#define I16_16 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
|
|
||||||
#define I16_17 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
|
|
||||||
#define I16_18 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17
|
|
||||||
#define I16_19 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18
|
|
||||||
#define I16_20 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
|
|
||||||
#define I16_21 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20
|
|
||||||
#define I16_22 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21
|
|
||||||
#define I16_23 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22
|
|
||||||
#define I16_24 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23
|
|
||||||
#define I16_25 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24
|
|
||||||
#define I16_26 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25
|
|
||||||
#define I16_27 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26
|
|
||||||
#define I16_28 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27
|
|
||||||
#define I16_29 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28
|
|
||||||
#define I16_30 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29
|
|
||||||
#define I16_31 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30
|
|
||||||
|
|
||||||
#define M16_16 0, 1, 3, 4, 7, 10, 11
|
|
||||||
#define M16_17 1, 2, 4, 5, 8, 11, 12
|
|
||||||
#define M16_18 2, 3, 5, 6, 9, 12, 13
|
|
||||||
#define M16_19 3, 4, 6, 7, 10, 13, 14
|
|
||||||
#define M16_20 4, 5, 7, 8, 11, 14, 15
|
|
||||||
#define M16_21 5, 6, 8, 9, 12, 15, 16
|
|
||||||
#define M16_22 6, 7, 9, 10, 13, 0, 1
|
|
||||||
#define M16_23 7, 8, 10, 11, 14, 1, 2
|
|
||||||
#define M16_24 8, 9, 11, 12, 15, 2, 3
|
|
||||||
#define M16_25 9, 10, 12, 13, 0, 3, 4
|
|
||||||
#define M16_26 10, 11, 13, 14, 1, 4, 5
|
|
||||||
#define M16_27 11, 12, 14, 15, 2, 5, 6
|
|
||||||
#define M16_28 12, 13, 15, 16, 3, 6, 7
|
|
||||||
#define M16_29 13, 14, 0, 1, 4, 7, 8
|
|
||||||
#define M16_30 14, 15, 1, 2, 5, 8, 9
|
|
||||||
#define M16_31 15, 16, 2, 3, 6, 9, 10
|
|
||||||
|
|
||||||
#define ss0(x) (((x) >> 1) ^ SPH_T32((x) << 3) \
|
|
||||||
^ SPH_ROTL32(x, 4) ^ SPH_ROTL32(x, 19))
|
|
||||||
#define ss1(x) (((x) >> 1) ^ SPH_T32((x) << 2) \
|
|
||||||
^ SPH_ROTL32(x, 8) ^ SPH_ROTL32(x, 23))
|
|
||||||
#define ss2(x) (((x) >> 2) ^ SPH_T32((x) << 1) \
|
|
||||||
^ SPH_ROTL32(x, 12) ^ SPH_ROTL32(x, 25))
|
|
||||||
#define ss3(x) (((x) >> 2) ^ SPH_T32((x) << 2) \
|
|
||||||
^ SPH_ROTL32(x, 15) ^ SPH_ROTL32(x, 29))
|
|
||||||
#define ss4(x) (((x) >> 1) ^ (x))
|
|
||||||
#define ss5(x) (((x) >> 2) ^ (x))
|
|
||||||
#define rs1(x) SPH_ROTL32(x, 3)
|
|
||||||
#define rs2(x) SPH_ROTL32(x, 7)
|
|
||||||
#define rs3(x) SPH_ROTL32(x, 13)
|
|
||||||
#define rs4(x) SPH_ROTL32(x, 16)
|
|
||||||
#define rs5(x) SPH_ROTL32(x, 19)
|
|
||||||
#define rs6(x) SPH_ROTL32(x, 23)
|
|
||||||
#define rs7(x) SPH_ROTL32(x, 27)
|
|
||||||
|
|
||||||
#define Ks(j) SPH_T32((sph_u32)(j) * SPH_C32(0x05555555))
|
|
||||||
|
|
||||||
#define add_elt_s(mf, hf, j0m, j1m, j3m, j4m, j7m, j10m, j11m, j16) \
|
|
||||||
(SPH_T32(SPH_ROTL32(mf(j0m), j1m) + SPH_ROTL32(mf(j3m), j4m) \
|
|
||||||
- SPH_ROTL32(mf(j10m), j11m) + Ks(j16)) ^ hf(j7m))
|
|
||||||
|
|
||||||
#define expand1s_inner(qf, mf, hf, i16, \
|
|
||||||
i0, i1, i2, i3, i4, i5, i6, i7, i8, \
|
|
||||||
i9, i10, i11, i12, i13, i14, i15, \
|
|
||||||
i0m, i1m, i3m, i4m, i7m, i10m, i11m) \
|
|
||||||
SPH_T32(ss1(qf(i0)) + ss2(qf(i1)) + ss3(qf(i2)) + ss0(qf(i3)) \
|
|
||||||
+ ss1(qf(i4)) + ss2(qf(i5)) + ss3(qf(i6)) + ss0(qf(i7)) \
|
|
||||||
+ ss1(qf(i8)) + ss2(qf(i9)) + ss3(qf(i10)) + ss0(qf(i11)) \
|
|
||||||
+ ss1(qf(i12)) + ss2(qf(i13)) + ss3(qf(i14)) + ss0(qf(i15)) \
|
|
||||||
+ add_elt_s(mf, hf, i0m, i1m, i3m, i4m, i7m, i10m, i11m, i16))
|
|
||||||
|
|
||||||
#define expand1s(qf, mf, hf, i16) \
|
|
||||||
expand1s_(qf, mf, hf, i16, I16_ ## i16, M16_ ## i16)
|
|
||||||
#define expand1s_(qf, mf, hf, i16, ix, iy) \
|
|
||||||
expand1s_inner LPAR qf, mf, hf, i16, ix, iy)
|
|
||||||
|
|
||||||
#define expand2s_inner(qf, mf, hf, i16, \
|
|
||||||
i0, i1, i2, i3, i4, i5, i6, i7, i8, \
|
|
||||||
i9, i10, i11, i12, i13, i14, i15, \
|
|
||||||
i0m, i1m, i3m, i4m, i7m, i10m, i11m) \
|
|
||||||
SPH_T32(qf(i0) + rs1(qf(i1)) + qf(i2) + rs2(qf(i3)) \
|
|
||||||
+ qf(i4) + rs3(qf(i5)) + qf(i6) + rs4(qf(i7)) \
|
|
||||||
+ qf(i8) + rs5(qf(i9)) + qf(i10) + rs6(qf(i11)) \
|
|
||||||
+ qf(i12) + rs7(qf(i13)) + ss4(qf(i14)) + ss5(qf(i15)) \
|
|
||||||
+ add_elt_s(mf, hf, i0m, i1m, i3m, i4m, i7m, i10m, i11m, i16))
|
|
||||||
|
|
||||||
#define expand2s(qf, mf, hf, i16) \
|
|
||||||
expand2s_(qf, mf, hf, i16, I16_ ## i16, M16_ ## i16)
|
|
||||||
#define expand2s_(qf, mf, hf, i16, ix, iy) \
|
|
||||||
expand2s_inner LPAR qf, mf, hf, i16, ix, iy)
|
|
||||||
|
|
||||||
#if SPH_64
|
|
||||||
|
|
||||||
#define sb0(x) (((x) >> 1) ^ SPH_T64((x) << 3) \
|
|
||||||
^ SPH_ROTL64(x, 4) ^ SPH_ROTL64(x, 37))
|
|
||||||
#define sb1(x) (((x) >> 1) ^ SPH_T64((x) << 2) \
|
|
||||||
^ SPH_ROTL64(x, 13) ^ SPH_ROTL64(x, 43))
|
|
||||||
#define sb2(x) (((x) >> 2) ^ SPH_T64((x) << 1) \
|
|
||||||
^ SPH_ROTL64(x, 19) ^ SPH_ROTL64(x, 53))
|
|
||||||
#define sb3(x) (((x) >> 2) ^ SPH_T64((x) << 2) \
|
|
||||||
^ SPH_ROTL64(x, 28) ^ SPH_ROTL64(x, 59))
|
|
||||||
#define sb4(x) (((x) >> 1) ^ (x))
|
|
||||||
#define sb5(x) (((x) >> 2) ^ (x))
|
|
||||||
#define rb1(x) SPH_ROTL64(x, 5)
|
|
||||||
#define rb2(x) SPH_ROTL64(x, 11)
|
|
||||||
#define rb3(x) SPH_ROTL64(x, 27)
|
|
||||||
#define rb4(x) SPH_ROTL64(x, 32)
|
|
||||||
#define rb5(x) SPH_ROTL64(x, 37)
|
|
||||||
#define rb6(x) SPH_ROTL64(x, 43)
|
|
||||||
#define rb7(x) SPH_ROTL64(x, 53)
|
|
||||||
|
|
||||||
#define Kb(j) SPH_T64((sph_u64)(j) * SPH_C64(0x0555555555555555))
|
|
||||||
|
|
||||||
#if 0
|
|
||||||
|
|
||||||
static const sph_u64 Kb_tab[] = {
|
|
||||||
Kb(16), Kb(17), Kb(18), Kb(19), Kb(20), Kb(21), Kb(22), Kb(23),
|
|
||||||
Kb(24), Kb(25), Kb(26), Kb(27), Kb(28), Kb(29), Kb(30), Kb(31)
|
|
||||||
};
|
|
||||||
|
|
||||||
#define rol_off(mf, j, off) \
|
|
||||||
SPH_ROTL64(mf(((j) + (off)) & 15), (((j) + (off)) & 15) + 1)
|
|
||||||
|
|
||||||
#define add_elt_b(mf, hf, j) \
|
|
||||||
(SPH_T64(rol_off(mf, j, 0) + rol_off(mf, j, 3) \
|
|
||||||
- rol_off(mf, j, 10) + Kb_tab[j]) ^ hf(((j) + 7) & 15))
|
|
||||||
|
|
||||||
#define expand1b(qf, mf, hf, i) \
|
|
||||||
SPH_T64(sb1(qf((i) - 16)) + sb2(qf((i) - 15)) \
|
|
||||||
+ sb3(qf((i) - 14)) + sb0(qf((i) - 13)) \
|
|
||||||
+ sb1(qf((i) - 12)) + sb2(qf((i) - 11)) \
|
|
||||||
+ sb3(qf((i) - 10)) + sb0(qf((i) - 9)) \
|
|
||||||
+ sb1(qf((i) - 8)) + sb2(qf((i) - 7)) \
|
|
||||||
+ sb3(qf((i) - 6)) + sb0(qf((i) - 5)) \
|
|
||||||
+ sb1(qf((i) - 4)) + sb2(qf((i) - 3)) \
|
|
||||||
+ sb3(qf((i) - 2)) + sb0(qf((i) - 1)) \
|
|
||||||
+ add_elt_b(mf, hf, (i) - 16))
|
|
||||||
|
|
||||||
#define expand2b(qf, mf, hf, i) \
|
|
||||||
SPH_T64(qf((i) - 16) + rb1(qf((i) - 15)) \
|
|
||||||
+ qf((i) - 14) + rb2(qf((i) - 13)) \
|
|
||||||
+ qf((i) - 12) + rb3(qf((i) - 11)) \
|
|
||||||
+ qf((i) - 10) + rb4(qf((i) - 9)) \
|
|
||||||
+ qf((i) - 8) + rb5(qf((i) - 7)) \
|
|
||||||
+ qf((i) - 6) + rb6(qf((i) - 5)) \
|
|
||||||
+ qf((i) - 4) + rb7(qf((i) - 3)) \
|
|
||||||
+ sb4(qf((i) - 2)) + sb5(qf((i) - 1)) \
|
|
||||||
+ add_elt_b(mf, hf, (i) - 16))
|
|
||||||
|
|
||||||
#else
|
|
||||||
|
|
||||||
#define add_elt_b(mf, hf, j0m, j1m, j3m, j4m, j7m, j10m, j11m, j16) \
|
|
||||||
(SPH_T64(SPH_ROTL64(mf(j0m), j1m) + SPH_ROTL64(mf(j3m), j4m) \
|
|
||||||
- SPH_ROTL64(mf(j10m), j11m) + Kb(j16)) ^ hf(j7m))
|
|
||||||
|
|
||||||
#define expand1b_inner(qf, mf, hf, i16, \
|
|
||||||
i0, i1, i2, i3, i4, i5, i6, i7, i8, \
|
|
||||||
i9, i10, i11, i12, i13, i14, i15, \
|
|
||||||
i0m, i1m, i3m, i4m, i7m, i10m, i11m) \
|
|
||||||
SPH_T64(sb1(qf(i0)) + sb2(qf(i1)) + sb3(qf(i2)) + sb0(qf(i3)) \
|
|
||||||
+ sb1(qf(i4)) + sb2(qf(i5)) + sb3(qf(i6)) + sb0(qf(i7)) \
|
|
||||||
+ sb1(qf(i8)) + sb2(qf(i9)) + sb3(qf(i10)) + sb0(qf(i11)) \
|
|
||||||
+ sb1(qf(i12)) + sb2(qf(i13)) + sb3(qf(i14)) + sb0(qf(i15)) \
|
|
||||||
+ add_elt_b(mf, hf, i0m, i1m, i3m, i4m, i7m, i10m, i11m, i16))
|
|
||||||
|
|
||||||
#define expand1b(qf, mf, hf, i16) \
|
|
||||||
expand1b_(qf, mf, hf, i16, I16_ ## i16, M16_ ## i16)
|
|
||||||
#define expand1b_(qf, mf, hf, i16, ix, iy) \
|
|
||||||
expand1b_inner LPAR qf, mf, hf, i16, ix, iy)
|
|
||||||
|
|
||||||
#define expand2b_inner(qf, mf, hf, i16, \
|
|
||||||
i0, i1, i2, i3, i4, i5, i6, i7, i8, \
|
|
||||||
i9, i10, i11, i12, i13, i14, i15, \
|
|
||||||
i0m, i1m, i3m, i4m, i7m, i10m, i11m) \
|
|
||||||
SPH_T64(qf(i0) + rb1(qf(i1)) + qf(i2) + rb2(qf(i3)) \
|
|
||||||
+ qf(i4) + rb3(qf(i5)) + qf(i6) + rb4(qf(i7)) \
|
|
||||||
+ qf(i8) + rb5(qf(i9)) + qf(i10) + rb6(qf(i11)) \
|
|
||||||
+ qf(i12) + rb7(qf(i13)) + sb4(qf(i14)) + sb5(qf(i15)) \
|
|
||||||
+ add_elt_b(mf, hf, i0m, i1m, i3m, i4m, i7m, i10m, i11m, i16))
|
|
||||||
|
|
||||||
#define expand2b(qf, mf, hf, i16) \
|
|
||||||
expand2b_(qf, mf, hf, i16, I16_ ## i16, M16_ ## i16)
|
|
||||||
#define expand2b_(qf, mf, hf, i16, ix, iy) \
|
|
||||||
expand2b_inner LPAR qf, mf, hf, i16, ix, iy)
|
|
||||||
|
|
||||||
#endif
|
|
||||||
|
|
||||||
#endif
|
|
||||||
|
|
||||||
#define MAKE_W(tt, i0, op01, i1, op12, i2, op23, i3, op34, i4) \
|
|
||||||
tt((M(i0) ^ H(i0)) op01 (M(i1) ^ H(i1)) op12 (M(i2) ^ H(i2)) \
|
|
||||||
op23 (M(i3) ^ H(i3)) op34 (M(i4) ^ H(i4)))
|
|
||||||
|
|
||||||
#define Ws0 MAKE_W(SPH_T32, 5, -, 7, +, 10, +, 13, +, 14)
|
|
||||||
#define Ws1 MAKE_W(SPH_T32, 6, -, 8, +, 11, +, 14, -, 15)
|
|
||||||
#define Ws2 MAKE_W(SPH_T32, 0, +, 7, +, 9, -, 12, +, 15)
|
|
||||||
#define Ws3 MAKE_W(SPH_T32, 0, -, 1, +, 8, -, 10, +, 13)
|
|
||||||
#define Ws4 MAKE_W(SPH_T32, 1, +, 2, +, 9, -, 11, -, 14)
|
|
||||||
#define Ws5 MAKE_W(SPH_T32, 3, -, 2, +, 10, -, 12, +, 15)
|
|
||||||
#define Ws6 MAKE_W(SPH_T32, 4, -, 0, -, 3, -, 11, +, 13)
|
|
||||||
#define Ws7 MAKE_W(SPH_T32, 1, -, 4, -, 5, -, 12, -, 14)
|
|
||||||
#define Ws8 MAKE_W(SPH_T32, 2, -, 5, -, 6, +, 13, -, 15)
|
|
||||||
#define Ws9 MAKE_W(SPH_T32, 0, -, 3, +, 6, -, 7, +, 14)
|
|
||||||
#define Ws10 MAKE_W(SPH_T32, 8, -, 1, -, 4, -, 7, +, 15)
|
|
||||||
#define Ws11 MAKE_W(SPH_T32, 8, -, 0, -, 2, -, 5, +, 9)
|
|
||||||
#define Ws12 MAKE_W(SPH_T32, 1, +, 3, -, 6, -, 9, +, 10)
|
|
||||||
#define Ws13 MAKE_W(SPH_T32, 2, +, 4, +, 7, +, 10, +, 11)
|
|
||||||
#define Ws14 MAKE_W(SPH_T32, 3, -, 5, +, 8, -, 11, -, 12)
|
|
||||||
#define Ws15 MAKE_W(SPH_T32, 12, -, 4, -, 6, -, 9, +, 13)
|
|
||||||
|
|
||||||
#define MAKE_Qas do { \
|
|
||||||
qt[ 0] = SPH_T32(ss0(Ws0 ) + H( 1)); \
|
|
||||||
qt[ 1] = SPH_T32(ss1(Ws1 ) + H( 2)); \
|
|
||||||
qt[ 2] = SPH_T32(ss2(Ws2 ) + H( 3)); \
|
|
||||||
qt[ 3] = SPH_T32(ss3(Ws3 ) + H( 4)); \
|
|
||||||
qt[ 4] = SPH_T32(ss4(Ws4 ) + H( 5)); \
|
|
||||||
qt[ 5] = SPH_T32(ss0(Ws5 ) + H( 6)); \
|
|
||||||
qt[ 6] = SPH_T32(ss1(Ws6 ) + H( 7)); \
|
|
||||||
qt[ 7] = SPH_T32(ss2(Ws7 ) + H( 8)); \
|
|
||||||
qt[ 8] = SPH_T32(ss3(Ws8 ) + H( 9)); \
|
|
||||||
qt[ 9] = SPH_T32(ss4(Ws9 ) + H(10)); \
|
|
||||||
qt[10] = SPH_T32(ss0(Ws10) + H(11)); \
|
|
||||||
qt[11] = SPH_T32(ss1(Ws11) + H(12)); \
|
|
||||||
qt[12] = SPH_T32(ss2(Ws12) + H(13)); \
|
|
||||||
qt[13] = SPH_T32(ss3(Ws13) + H(14)); \
|
|
||||||
qt[14] = SPH_T32(ss4(Ws14) + H(15)); \
|
|
||||||
qt[15] = SPH_T32(ss0(Ws15) + H( 0)); \
|
|
||||||
} while (0)
|
|
||||||
|
|
||||||
#define MAKE_Qbs do { \
|
|
||||||
qt[16] = expand1s(Qs, M, H, 16); \
|
|
||||||
qt[17] = expand1s(Qs, M, H, 17); \
|
|
||||||
qt[18] = expand2s(Qs, M, H, 18); \
|
|
||||||
qt[19] = expand2s(Qs, M, H, 19); \
|
|
||||||
qt[20] = expand2s(Qs, M, H, 20); \
|
|
||||||
qt[21] = expand2s(Qs, M, H, 21); \
|
|
||||||
qt[22] = expand2s(Qs, M, H, 22); \
|
|
||||||
qt[23] = expand2s(Qs, M, H, 23); \
|
|
||||||
qt[24] = expand2s(Qs, M, H, 24); \
|
|
||||||
qt[25] = expand2s(Qs, M, H, 25); \
|
|
||||||
qt[26] = expand2s(Qs, M, H, 26); \
|
|
||||||
qt[27] = expand2s(Qs, M, H, 27); \
|
|
||||||
qt[28] = expand2s(Qs, M, H, 28); \
|
|
||||||
qt[29] = expand2s(Qs, M, H, 29); \
|
|
||||||
qt[30] = expand2s(Qs, M, H, 30); \
|
|
||||||
qt[31] = expand2s(Qs, M, H, 31); \
|
|
||||||
} while (0)
|
|
||||||
|
|
||||||
#define MAKE_Qs do { \
|
|
||||||
MAKE_Qas; \
|
|
||||||
MAKE_Qbs; \
|
|
||||||
} while (0)
|
|
||||||
|
|
||||||
#define Qs(j) (qt[j])
|
|
||||||
|
|
||||||
#define Wb0 MAKE_W(SPH_T64, 5, -, 7, +, 10, +, 13, +, 14)
|
|
||||||
#define Wb1 MAKE_W(SPH_T64, 6, -, 8, +, 11, +, 14, -, 15)
|
|
||||||
#define Wb2 MAKE_W(SPH_T64, 0, +, 7, +, 9, -, 12, +, 15)
|
|
||||||
#define Wb3 MAKE_W(SPH_T64, 0, -, 1, +, 8, -, 10, +, 13)
|
|
||||||
#define Wb4 MAKE_W(SPH_T64, 1, +, 2, +, 9, -, 11, -, 14)
|
|
||||||
#define Wb5 MAKE_W(SPH_T64, 3, -, 2, +, 10, -, 12, +, 15)
|
|
||||||
#define Wb6 MAKE_W(SPH_T64, 4, -, 0, -, 3, -, 11, +, 13)
|
|
||||||
#define Wb7 MAKE_W(SPH_T64, 1, -, 4, -, 5, -, 12, -, 14)
|
|
||||||
#define Wb8 MAKE_W(SPH_T64, 2, -, 5, -, 6, +, 13, -, 15)
|
|
||||||
#define Wb9 MAKE_W(SPH_T64, 0, -, 3, +, 6, -, 7, +, 14)
|
|
||||||
#define Wb10 MAKE_W(SPH_T64, 8, -, 1, -, 4, -, 7, +, 15)
|
|
||||||
#define Wb11 MAKE_W(SPH_T64, 8, -, 0, -, 2, -, 5, +, 9)
|
|
||||||
#define Wb12 MAKE_W(SPH_T64, 1, +, 3, -, 6, -, 9, +, 10)
|
|
||||||
#define Wb13 MAKE_W(SPH_T64, 2, +, 4, +, 7, +, 10, +, 11)
|
|
||||||
#define Wb14 MAKE_W(SPH_T64, 3, -, 5, +, 8, -, 11, -, 12)
|
|
||||||
#define Wb15 MAKE_W(SPH_T64, 12, -, 4, -, 6, -, 9, +, 13)
|
|
||||||
|
|
||||||
#define MAKE_Qab do { \
|
|
||||||
qt[ 0] = SPH_T64(sb0(Wb0 ) + H( 1)); \
|
|
||||||
qt[ 1] = SPH_T64(sb1(Wb1 ) + H( 2)); \
|
|
||||||
qt[ 2] = SPH_T64(sb2(Wb2 ) + H( 3)); \
|
|
||||||
qt[ 3] = SPH_T64(sb3(Wb3 ) + H( 4)); \
|
|
||||||
qt[ 4] = SPH_T64(sb4(Wb4 ) + H( 5)); \
|
|
||||||
qt[ 5] = SPH_T64(sb0(Wb5 ) + H( 6)); \
|
|
||||||
qt[ 6] = SPH_T64(sb1(Wb6 ) + H( 7)); \
|
|
||||||
qt[ 7] = SPH_T64(sb2(Wb7 ) + H( 8)); \
|
|
||||||
qt[ 8] = SPH_T64(sb3(Wb8 ) + H( 9)); \
|
|
||||||
qt[ 9] = SPH_T64(sb4(Wb9 ) + H(10)); \
|
|
||||||
qt[10] = SPH_T64(sb0(Wb10) + H(11)); \
|
|
||||||
qt[11] = SPH_T64(sb1(Wb11) + H(12)); \
|
|
||||||
qt[12] = SPH_T64(sb2(Wb12) + H(13)); \
|
|
||||||
qt[13] = SPH_T64(sb3(Wb13) + H(14)); \
|
|
||||||
qt[14] = SPH_T64(sb4(Wb14) + H(15)); \
|
|
||||||
qt[15] = SPH_T64(sb0(Wb15) + H( 0)); \
|
|
||||||
} while (0)
|
|
||||||
|
|
||||||
#define MAKE_Qbb do { \
|
|
||||||
qt[16] = expand1b(Qb, M, H, 16); \
|
|
||||||
qt[17] = expand1b(Qb, M, H, 17); \
|
|
||||||
qt[18] = expand2b(Qb, M, H, 18); \
|
|
||||||
qt[19] = expand2b(Qb, M, H, 19); \
|
|
||||||
qt[20] = expand2b(Qb, M, H, 20); \
|
|
||||||
qt[21] = expand2b(Qb, M, H, 21); \
|
|
||||||
qt[22] = expand2b(Qb, M, H, 22); \
|
|
||||||
qt[23] = expand2b(Qb, M, H, 23); \
|
|
||||||
qt[24] = expand2b(Qb, M, H, 24); \
|
|
||||||
qt[25] = expand2b(Qb, M, H, 25); \
|
|
||||||
qt[26] = expand2b(Qb, M, H, 26); \
|
|
||||||
qt[27] = expand2b(Qb, M, H, 27); \
|
|
||||||
qt[28] = expand2b(Qb, M, H, 28); \
|
|
||||||
qt[29] = expand2b(Qb, M, H, 29); \
|
|
||||||
qt[30] = expand2b(Qb, M, H, 30); \
|
|
||||||
qt[31] = expand2b(Qb, M, H, 31); \
|
|
||||||
} while (0)
|
|
||||||
|
|
||||||
#define MAKE_Qb do { \
|
|
||||||
MAKE_Qab; \
|
|
||||||
MAKE_Qbb; \
|
|
||||||
} while (0)
|
|
||||||
|
|
||||||
#define Qb(j) (qt[j])
|
|
||||||
|
|
||||||
#define FOLD(type, mkQ, tt, rol, mf, qf, dhf) do { \
|
|
||||||
type qt[32], xl, xh; \
|
|
||||||
mkQ; \
|
|
||||||
xl = qf(16) ^ qf(17) ^ qf(18) ^ qf(19) \
|
|
||||||
^ qf(20) ^ qf(21) ^ qf(22) ^ qf(23); \
|
|
||||||
xh = xl ^ qf(24) ^ qf(25) ^ qf(26) ^ qf(27) \
|
|
||||||
^ qf(28) ^ qf(29) ^ qf(30) ^ qf(31); \
|
|
||||||
dhf( 0) = tt(((xh << 5) ^ (qf(16) >> 5) ^ mf( 0)) \
|
|
||||||
+ (xl ^ qf(24) ^ qf( 0))); \
|
|
||||||
dhf( 1) = tt(((xh >> 7) ^ (qf(17) << 8) ^ mf( 1)) \
|
|
||||||
+ (xl ^ qf(25) ^ qf( 1))); \
|
|
||||||
dhf( 2) = tt(((xh >> 5) ^ (qf(18) << 5) ^ mf( 2)) \
|
|
||||||
+ (xl ^ qf(26) ^ qf( 2))); \
|
|
||||||
dhf( 3) = tt(((xh >> 1) ^ (qf(19) << 5) ^ mf( 3)) \
|
|
||||||
+ (xl ^ qf(27) ^ qf( 3))); \
|
|
||||||
dhf( 4) = tt(((xh >> 3) ^ (qf(20) << 0) ^ mf( 4)) \
|
|
||||||
+ (xl ^ qf(28) ^ qf( 4))); \
|
|
||||||
dhf( 5) = tt(((xh << 6) ^ (qf(21) >> 6) ^ mf( 5)) \
|
|
||||||
+ (xl ^ qf(29) ^ qf( 5))); \
|
|
||||||
dhf( 6) = tt(((xh >> 4) ^ (qf(22) << 6) ^ mf( 6)) \
|
|
||||||
+ (xl ^ qf(30) ^ qf( 6))); \
|
|
||||||
dhf( 7) = tt(((xh >> 11) ^ (qf(23) << 2) ^ mf( 7)) \
|
|
||||||
+ (xl ^ qf(31) ^ qf( 7))); \
|
|
||||||
dhf( 8) = tt(rol(dhf(4), 9) + (xh ^ qf(24) ^ mf( 8)) \
|
|
||||||
+ ((xl << 8) ^ qf(23) ^ qf( 8))); \
|
|
||||||
dhf( 9) = tt(rol(dhf(5), 10) + (xh ^ qf(25) ^ mf( 9)) \
|
|
||||||
+ ((xl >> 6) ^ qf(16) ^ qf( 9))); \
|
|
||||||
dhf(10) = tt(rol(dhf(6), 11) + (xh ^ qf(26) ^ mf(10)) \
|
|
||||||
+ ((xl << 6) ^ qf(17) ^ qf(10))); \
|
|
||||||
dhf(11) = tt(rol(dhf(7), 12) + (xh ^ qf(27) ^ mf(11)) \
|
|
||||||
+ ((xl << 4) ^ qf(18) ^ qf(11))); \
|
|
||||||
dhf(12) = tt(rol(dhf(0), 13) + (xh ^ qf(28) ^ mf(12)) \
|
|
||||||
+ ((xl >> 3) ^ qf(19) ^ qf(12))); \
|
|
||||||
dhf(13) = tt(rol(dhf(1), 14) + (xh ^ qf(29) ^ mf(13)) \
|
|
||||||
+ ((xl >> 4) ^ qf(20) ^ qf(13))); \
|
|
||||||
dhf(14) = tt(rol(dhf(2), 15) + (xh ^ qf(30) ^ mf(14)) \
|
|
||||||
+ ((xl >> 7) ^ qf(21) ^ qf(14))); \
|
|
||||||
dhf(15) = tt(rol(dhf(3), 16) + (xh ^ qf(31) ^ mf(15)) \
|
|
||||||
+ ((xl >> 2) ^ qf(22) ^ qf(15))); \
|
|
||||||
} while (0)
|
|
||||||
|
|
||||||
#define FOLDs FOLD(sph_u32, MAKE_Qs, SPH_T32, SPH_ROTL32, M, Qs, dH)
|
|
||||||
|
|
||||||
#define FOLDb FOLD(sph_u64, MAKE_Qb, SPH_T64, SPH_ROTL64, M, Qb, dH)
|
|
||||||
|
|
||||||
#define DECL_BMW \
|
|
||||||
sph_u64 bmwH[16]; \
|
|
||||||
|
|
||||||
/* load initial constants */
|
|
||||||
#define BMW_I \
|
|
||||||
do { \
|
|
||||||
memcpy(bmwH, bmwIV512, sizeof bmwH); \
|
|
||||||
hashptr = 0; \
|
|
||||||
hashctA = 0; \
|
|
||||||
} while (0)
|
|
||||||
|
|
||||||
/* load hash for loop */
|
|
||||||
#define BMW_U \
|
|
||||||
do { \
|
|
||||||
const void *data = hash; \
|
|
||||||
size_t len = 64; \
|
|
||||||
unsigned char *buf; \
|
|
||||||
\
|
|
||||||
hashctA += (sph_u64)len << 3; \
|
|
||||||
buf = hashbuf; \
|
|
||||||
memcpy(buf, data, 64); \
|
|
||||||
hashptr = 64; \
|
|
||||||
} while (0)
|
|
||||||
|
|
||||||
|
|
||||||
/* bmw512 hash loaded */
|
|
||||||
/* hash = blake512(loaded) */
|
|
||||||
#define BMW_C \
|
|
||||||
do { \
|
|
||||||
void *dst = hash; \
|
|
||||||
size_t out_size_w64 = 8; \
|
|
||||||
unsigned char *data; \
|
|
||||||
sph_u64 *dh; \
|
|
||||||
unsigned char *out; \
|
|
||||||
size_t ptr, u, v; \
|
|
||||||
unsigned z; \
|
|
||||||
sph_u64 h1[16], h2[16], *h; \
|
|
||||||
data = hashbuf; \
|
|
||||||
ptr = hashptr; \
|
|
||||||
z = 0x80 >> 0; \
|
|
||||||
data[ptr ++] = ((0 & -z) | z) & 0xFF; \
|
|
||||||
memset(data + ptr, 0, (sizeof(char)*128) - 8 - ptr); \
|
|
||||||
sph_enc64le_aligned(data + (sizeof(char)*128) - 8, \
|
|
||||||
SPH_T64(hashctA + 0)); \
|
|
||||||
/* for break loop */ \
|
|
||||||
/* one copy of inline FOLD */ \
|
|
||||||
/* FOLD uses, */ \
|
|
||||||
/* uint64 *h, data */ \
|
|
||||||
/* uint64 dh, state */ \
|
|
||||||
h = bmwH; \
|
|
||||||
dh = h2; \
|
|
||||||
for (;;) { \
|
|
||||||
FOLDb; \
|
|
||||||
/* dh gets changed for 2nd run */ \
|
|
||||||
if (dh == h1) break; \
|
|
||||||
for (u = 0; u < 16; u ++) \
|
|
||||||
sph_enc64le_aligned(data + 8 * u, h2[u]); \
|
|
||||||
dh = h1; \
|
|
||||||
h = (sph_u64*)final_b; \
|
|
||||||
} \
|
|
||||||
/* end wrapped for break loop */ \
|
|
||||||
out = dst; \
|
|
||||||
for (u = 0, v = 16 - out_size_w64; u < out_size_w64; u ++, v ++) \
|
|
||||||
sph_enc64le(out + 8 * u, h1[v]); \
|
|
||||||
} while (0)
|
|
||||||
|
|
||||||
/*
|
|
||||||
static void
|
|
||||||
compress_big(const unsigned char *data, const sph_u64 h[16], sph_u64 dh[16])
|
|
||||||
{
|
|
||||||
|
|
||||||
#define M(x) sph_dec64le_aligned(data + 8 * (x))
|
|
||||||
#define H(x) (h[x])
|
|
||||||
#define dH(x) (dh[x])
|
|
||||||
|
|
||||||
FOLDb;
|
|
||||||
|
|
||||||
#undef M
|
|
||||||
#undef H
|
|
||||||
#undef dH
|
|
||||||
}
|
|
||||||
*/
|
|
||||||
|
|
||||||
static const sph_u64 final_b[16] = {
|
|
||||||
SPH_C64(0xaaaaaaaaaaaaaaa0), SPH_C64(0xaaaaaaaaaaaaaaa1),
|
|
||||||
SPH_C64(0xaaaaaaaaaaaaaaa2), SPH_C64(0xaaaaaaaaaaaaaaa3),
|
|
||||||
SPH_C64(0xaaaaaaaaaaaaaaa4), SPH_C64(0xaaaaaaaaaaaaaaa5),
|
|
||||||
SPH_C64(0xaaaaaaaaaaaaaaa6), SPH_C64(0xaaaaaaaaaaaaaaa7),
|
|
||||||
SPH_C64(0xaaaaaaaaaaaaaaa8), SPH_C64(0xaaaaaaaaaaaaaaa9),
|
|
||||||
SPH_C64(0xaaaaaaaaaaaaaaaa), SPH_C64(0xaaaaaaaaaaaaaaab),
|
|
||||||
SPH_C64(0xaaaaaaaaaaaaaaac), SPH_C64(0xaaaaaaaaaaaaaaad),
|
|
||||||
SPH_C64(0xaaaaaaaaaaaaaaae), SPH_C64(0xaaaaaaaaaaaaaaaf)
|
|
||||||
};
|
|
||||||
|
|
||||||
|
|
||||||
#ifdef __cplusplus
|
|
||||||
}
|
|
||||||
#endif
|
|
||||||
@@ -1,61 +0,0 @@
|
|||||||
/* $Id: sph_bmw.h 216 2010-06-08 09:46:57Z tp $ */
|
|
||||||
/**
|
|
||||||
* BMW interface. BMW (aka "Blue Midnight Wish") is a family of
|
|
||||||
* functions which differ by their output size; this implementation
|
|
||||||
* defines BMW for output sizes 224, 256, 384 and 512 bits.
|
|
||||||
*
|
|
||||||
* ==========================(LICENSE BEGIN)============================
|
|
||||||
*
|
|
||||||
* Copyright (c) 2007-2010 Projet RNRT SAPHIR
|
|
||||||
*
|
|
||||||
* Permission is hereby granted, free of charge, to any person obtaining
|
|
||||||
* a copy of this software and associated documentation files (the
|
|
||||||
* "Software"), to deal in the Software without restriction, including
|
|
||||||
* without limitation the rights to use, copy, modify, merge, publish,
|
|
||||||
* distribute, sublicense, and/or sell copies of the Software, and to
|
|
||||||
* permit persons to whom the Software is furnished to do so, subject to
|
|
||||||
* the following conditions:
|
|
||||||
*
|
|
||||||
* The above copyright notice and this permission notice shall be
|
|
||||||
* included in all copies or substantial portions of the Software.
|
|
||||||
*
|
|
||||||
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
|
||||||
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
|
||||||
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
|
|
||||||
* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
|
|
||||||
* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
|
|
||||||
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
|
|
||||||
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
|
||||||
*
|
|
||||||
* ===========================(LICENSE END)=============================
|
|
||||||
*
|
|
||||||
* @file sph_bmw.h
|
|
||||||
* @author Thomas Pornin <thomas.pornin@cryptolog.com>
|
|
||||||
*/
|
|
||||||
|
|
||||||
#ifndef SPH_BMW_H__
|
|
||||||
#define SPH_BMW_H__
|
|
||||||
|
|
||||||
#ifdef __cplusplus
|
|
||||||
extern "C"{
|
|
||||||
#endif
|
|
||||||
|
|
||||||
#include <stddef.h>
|
|
||||||
#include "sph_types.h"
|
|
||||||
|
|
||||||
#define SPH_SIZE_bmw512 512
|
|
||||||
|
|
||||||
typedef struct {
|
|
||||||
#ifndef DOXYGEN_IGNORE
|
|
||||||
sph_u64 bmwH[16];
|
|
||||||
#endif
|
|
||||||
} sph_bmw_big_context;
|
|
||||||
|
|
||||||
typedef sph_bmw_big_context sph_bmw512_context;
|
|
||||||
|
|
||||||
|
|
||||||
#ifdef __cplusplus
|
|
||||||
}
|
|
||||||
#endif
|
|
||||||
|
|
||||||
#endif
|
|
||||||
@@ -363,7 +363,6 @@ bool register_cryptolight_algo( algo_gate_t* gate )
|
|||||||
gate->scanhash = (void*)&scanhash_cryptolight;
|
gate->scanhash = (void*)&scanhash_cryptolight;
|
||||||
gate->hash = (void*)&cryptolight_hash;
|
gate->hash = (void*)&cryptolight_hash;
|
||||||
gate->hash_suw = (void*)&cryptolight_hash;
|
gate->hash_suw = (void*)&cryptolight_hash;
|
||||||
gate->get_max64 = (void*)&get_max64_0x40LL;
|
|
||||||
return true;
|
return true;
|
||||||
};
|
};
|
||||||
|
|
||||||
|
|||||||
@@ -111,7 +111,6 @@ bool register_cryptonight_algo( algo_gate_t* gate )
|
|||||||
gate->scanhash = (void*)&scanhash_cryptonight;
|
gate->scanhash = (void*)&scanhash_cryptonight;
|
||||||
gate->hash = (void*)&cryptonight_hash;
|
gate->hash = (void*)&cryptonight_hash;
|
||||||
gate->hash_suw = (void*)&cryptonight_hash_suw;
|
gate->hash_suw = (void*)&cryptonight_hash_suw;
|
||||||
gate->get_max64 = (void*)&get_max64_0x40LL;
|
|
||||||
return true;
|
return true;
|
||||||
};
|
};
|
||||||
|
|
||||||
@@ -123,7 +122,6 @@ bool register_cryptonightv7_algo( algo_gate_t* gate )
|
|||||||
gate->scanhash = (void*)&scanhash_cryptonight;
|
gate->scanhash = (void*)&scanhash_cryptonight;
|
||||||
gate->hash = (void*)&cryptonight_hash;
|
gate->hash = (void*)&cryptonight_hash;
|
||||||
gate->hash_suw = (void*)&cryptonight_hash_suw;
|
gate->hash_suw = (void*)&cryptonight_hash_suw;
|
||||||
gate->get_max64 = (void*)&get_max64_0x40LL;
|
|
||||||
return true;
|
return true;
|
||||||
};
|
};
|
||||||
|
|
||||||
|
|||||||
@@ -7,7 +7,7 @@
|
|||||||
|
|
||||||
// 2x128
|
// 2x128
|
||||||
|
|
||||||
/*
|
|
||||||
// The result of hashing 10 rounds of initial data which consists of params
|
// The result of hashing 10 rounds of initial data which consists of params
|
||||||
// zero padded.
|
// zero padded.
|
||||||
static const uint64_t IV256[] =
|
static const uint64_t IV256[] =
|
||||||
@@ -25,7 +25,247 @@ static const uint64_t IV512[] =
|
|||||||
0x148FE485FCD398D9, 0xB64445321B017BEF, 0x2FF5781C6A536159, 0x0DBADEA991FA7934,
|
0x148FE485FCD398D9, 0xB64445321B017BEF, 0x2FF5781C6A536159, 0x0DBADEA991FA7934,
|
||||||
0xA5A70E75D65C8A2B, 0xBC796576B1C62456, 0xE7989AF11921C8F7, 0xD43E3B447795D246
|
0xA5A70E75D65C8A2B, 0xBC796576B1C62456, 0xE7989AF11921C8F7, 0xD43E3B447795D246
|
||||||
};
|
};
|
||||||
*/
|
|
||||||
|
#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
|
||||||
|
|
||||||
|
// 4 way 128 is handy to avoid reinterleaving in many algos.
|
||||||
|
// If reinterleaving is necessary it may be more efficient to use
|
||||||
|
// 2 way 256. The same transform code should work for both.
|
||||||
|
|
||||||
|
static void transform_4way( cube_4way_context *sp )
|
||||||
|
{
|
||||||
|
int r;
|
||||||
|
const int rounds = sp->rounds;
|
||||||
|
|
||||||
|
__m512i x0, x1, x2, x3, x4, x5, x6, x7, y0, y1;
|
||||||
|
|
||||||
|
x0 = _mm512_load_si512( (__m512i*)sp->h );
|
||||||
|
x1 = _mm512_load_si512( (__m512i*)sp->h + 1 );
|
||||||
|
x2 = _mm512_load_si512( (__m512i*)sp->h + 2 );
|
||||||
|
x3 = _mm512_load_si512( (__m512i*)sp->h + 3 );
|
||||||
|
x4 = _mm512_load_si512( (__m512i*)sp->h + 4 );
|
||||||
|
x5 = _mm512_load_si512( (__m512i*)sp->h + 5 );
|
||||||
|
x6 = _mm512_load_si512( (__m512i*)sp->h + 6 );
|
||||||
|
x7 = _mm512_load_si512( (__m512i*)sp->h + 7 );
|
||||||
|
|
||||||
|
for ( r = 0; r < rounds; ++r )
|
||||||
|
{
|
||||||
|
x4 = _mm512_add_epi32( x0, x4 );
|
||||||
|
x5 = _mm512_add_epi32( x1, x5 );
|
||||||
|
x6 = _mm512_add_epi32( x2, x6 );
|
||||||
|
x7 = _mm512_add_epi32( x3, x7 );
|
||||||
|
y0 = x0;
|
||||||
|
y1 = x1;
|
||||||
|
x0 = mm512_rol_32( x2, 7 );
|
||||||
|
x1 = mm512_rol_32( x3, 7 );
|
||||||
|
x2 = mm512_rol_32( y0, 7 );
|
||||||
|
x3 = mm512_rol_32( y1, 7 );
|
||||||
|
x0 = _mm512_xor_si512( x0, x4 );
|
||||||
|
x1 = _mm512_xor_si512( x1, x5 );
|
||||||
|
x2 = _mm512_xor_si512( x2, x6 );
|
||||||
|
x3 = _mm512_xor_si512( x3, x7 );
|
||||||
|
x4 = mm512_swap128_64( x4 );
|
||||||
|
x5 = mm512_swap128_64( x5 );
|
||||||
|
x6 = mm512_swap128_64( x6 );
|
||||||
|
x7 = mm512_swap128_64( x7 );
|
||||||
|
x4 = _mm512_add_epi32( x0, x4 );
|
||||||
|
x5 = _mm512_add_epi32( x1, x5 );
|
||||||
|
x6 = _mm512_add_epi32( x2, x6 );
|
||||||
|
x7 = _mm512_add_epi32( x3, x7 );
|
||||||
|
y0 = x0;
|
||||||
|
y1 = x2;
|
||||||
|
x0 = mm512_rol_32( x1, 11 );
|
||||||
|
x1 = mm512_rol_32( y0, 11 );
|
||||||
|
x2 = mm512_rol_32( x3, 11 );
|
||||||
|
x3 = mm512_rol_32( y1, 11 );
|
||||||
|
x0 = _mm512_xor_si512( x0, x4 );
|
||||||
|
x1 = _mm512_xor_si512( x1, x5 );
|
||||||
|
x2 = _mm512_xor_si512( x2, x6 );
|
||||||
|
x3 = _mm512_xor_si512( x3, x7 );
|
||||||
|
x4 = mm512_swap64_32( x4 );
|
||||||
|
x5 = mm512_swap64_32( x5 );
|
||||||
|
x6 = mm512_swap64_32( x6 );
|
||||||
|
x7 = mm512_swap64_32( x7 );
|
||||||
|
}
|
||||||
|
|
||||||
|
_mm512_store_si512( (__m512i*)sp->h, x0 );
|
||||||
|
_mm512_store_si512( (__m512i*)sp->h + 1, x1 );
|
||||||
|
_mm512_store_si512( (__m512i*)sp->h + 2, x2 );
|
||||||
|
_mm512_store_si512( (__m512i*)sp->h + 3, x3 );
|
||||||
|
_mm512_store_si512( (__m512i*)sp->h + 4, x4 );
|
||||||
|
_mm512_store_si512( (__m512i*)sp->h + 5, x5 );
|
||||||
|
_mm512_store_si512( (__m512i*)sp->h + 6, x6 );
|
||||||
|
_mm512_store_si512( (__m512i*)sp->h + 7, x7 );
|
||||||
|
}
|
||||||
|
|
||||||
|
int cube_4way_init( cube_4way_context *sp, int hashbitlen, int rounds,
|
||||||
|
int blockbytes )
|
||||||
|
{
|
||||||
|
__m512i *h = (__m512i*)sp->h;
|
||||||
|
__m128i *iv = (__m128i*)( hashbitlen == 512 ? (__m128i*)IV512
|
||||||
|
: (__m128i*)IV256 );
|
||||||
|
sp->hashlen = hashbitlen/128;
|
||||||
|
sp->blocksize = blockbytes/16;
|
||||||
|
sp->rounds = rounds;
|
||||||
|
sp->pos = 0;
|
||||||
|
|
||||||
|
h[ 0] = m512_const1_128( iv[0] );
|
||||||
|
h[ 1] = m512_const1_128( iv[1] );
|
||||||
|
h[ 2] = m512_const1_128( iv[2] );
|
||||||
|
h[ 3] = m512_const1_128( iv[3] );
|
||||||
|
h[ 4] = m512_const1_128( iv[4] );
|
||||||
|
h[ 5] = m512_const1_128( iv[5] );
|
||||||
|
h[ 6] = m512_const1_128( iv[6] );
|
||||||
|
h[ 7] = m512_const1_128( iv[7] );
|
||||||
|
h[ 0] = m512_const1_128( iv[0] );
|
||||||
|
h[ 1] = m512_const1_128( iv[1] );
|
||||||
|
h[ 2] = m512_const1_128( iv[2] );
|
||||||
|
h[ 3] = m512_const1_128( iv[3] );
|
||||||
|
h[ 4] = m512_const1_128( iv[4] );
|
||||||
|
h[ 5] = m512_const1_128( iv[5] );
|
||||||
|
h[ 6] = m512_const1_128( iv[6] );
|
||||||
|
h[ 7] = m512_const1_128( iv[7] );
|
||||||
|
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
int cube_4way_update( cube_4way_context *sp, const void *data, size_t size )
|
||||||
|
{
|
||||||
|
const int len = size >> 4;
|
||||||
|
const __m512i *in = (__m512i*)data;
|
||||||
|
int i;
|
||||||
|
|
||||||
|
for ( i = 0; i < len; i++ )
|
||||||
|
{
|
||||||
|
sp->h[ sp->pos ] = _mm512_xor_si512( sp->h[ sp->pos ], in[i] );
|
||||||
|
sp->pos++;
|
||||||
|
if ( sp->pos == sp->blocksize )
|
||||||
|
{
|
||||||
|
transform_4way( sp );
|
||||||
|
sp->pos = 0;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
int cube_4way_close( cube_4way_context *sp, void *output )
|
||||||
|
{
|
||||||
|
__m512i *hash = (__m512i*)output;
|
||||||
|
int i;
|
||||||
|
|
||||||
|
// pos is zero for 64 byte data, 1 for 80 byte data.
|
||||||
|
sp->h[ sp->pos ] = _mm512_xor_si512( sp->h[ sp->pos ],
|
||||||
|
m512_const2_64( 0, 0x0000000000000080 ) );
|
||||||
|
transform_4way( sp );
|
||||||
|
|
||||||
|
sp->h[7] = _mm512_xor_si512( sp->h[7],
|
||||||
|
m512_const2_64( 0x0000000100000000, 0 ) );
|
||||||
|
|
||||||
|
for ( i = 0; i < 10; ++i )
|
||||||
|
transform_4way( sp );
|
||||||
|
|
||||||
|
memcpy( hash, sp->h, sp->hashlen<<6 );
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
int cube_4way_full( cube_4way_context *sp, void *output, int hashbitlen,
|
||||||
|
const void *data, size_t size )
|
||||||
|
{
|
||||||
|
__m512i *h = (__m512i*)sp->h;
|
||||||
|
__m128i *iv = (__m128i*)( hashbitlen == 512 ? (__m128i*)IV512
|
||||||
|
: (__m128i*)IV256 );
|
||||||
|
sp->hashlen = hashbitlen/128;
|
||||||
|
sp->blocksize = 32/16;
|
||||||
|
sp->rounds = 16;
|
||||||
|
sp->pos = 0;
|
||||||
|
|
||||||
|
h[ 0] = m512_const1_128( iv[0] );
|
||||||
|
h[ 1] = m512_const1_128( iv[1] );
|
||||||
|
h[ 2] = m512_const1_128( iv[2] );
|
||||||
|
h[ 3] = m512_const1_128( iv[3] );
|
||||||
|
h[ 4] = m512_const1_128( iv[4] );
|
||||||
|
h[ 5] = m512_const1_128( iv[5] );
|
||||||
|
h[ 6] = m512_const1_128( iv[6] );
|
||||||
|
h[ 7] = m512_const1_128( iv[7] );
|
||||||
|
h[ 0] = m512_const1_128( iv[0] );
|
||||||
|
h[ 1] = m512_const1_128( iv[1] );
|
||||||
|
h[ 2] = m512_const1_128( iv[2] );
|
||||||
|
h[ 3] = m512_const1_128( iv[3] );
|
||||||
|
h[ 4] = m512_const1_128( iv[4] );
|
||||||
|
h[ 5] = m512_const1_128( iv[5] );
|
||||||
|
h[ 6] = m512_const1_128( iv[6] );
|
||||||
|
h[ 7] = m512_const1_128( iv[7] );
|
||||||
|
|
||||||
|
const int len = size >> 4;
|
||||||
|
const __m512i *in = (__m512i*)data;
|
||||||
|
__m512i *hash = (__m512i*)output;
|
||||||
|
int i;
|
||||||
|
|
||||||
|
for ( i = 0; i < len; i++ )
|
||||||
|
{
|
||||||
|
sp->h[ sp->pos ] = _mm512_xor_si512( sp->h[ sp->pos ], in[i] );
|
||||||
|
sp->pos++;
|
||||||
|
if ( sp->pos == sp->blocksize )
|
||||||
|
{
|
||||||
|
transform_4way( sp );
|
||||||
|
sp->pos = 0;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// pos is zero for 64 byte data, 1 for 80 byte data.
|
||||||
|
sp->h[ sp->pos ] = _mm512_xor_si512( sp->h[ sp->pos ],
|
||||||
|
m512_const2_64( 0, 0x0000000000000080 ) );
|
||||||
|
transform_4way( sp );
|
||||||
|
|
||||||
|
sp->h[7] = _mm512_xor_si512( sp->h[7],
|
||||||
|
m512_const2_64( 0x0000000100000000, 0 ) );
|
||||||
|
|
||||||
|
for ( i = 0; i < 10; ++i )
|
||||||
|
transform_4way( sp );
|
||||||
|
|
||||||
|
memcpy( hash, sp->h, sp->hashlen<<6);
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
int cube_4way_update_close( cube_4way_context *sp, void *output,
|
||||||
|
const void *data, size_t size )
|
||||||
|
{
|
||||||
|
const int len = size >> 4;
|
||||||
|
const __m512i *in = (__m512i*)data;
|
||||||
|
__m512i *hash = (__m512i*)output;
|
||||||
|
int i;
|
||||||
|
|
||||||
|
for ( i = 0; i < len; i++ )
|
||||||
|
{
|
||||||
|
sp->h[ sp->pos ] = _mm512_xor_si512( sp->h[ sp->pos ], in[i] );
|
||||||
|
sp->pos++;
|
||||||
|
if ( sp->pos == sp->blocksize )
|
||||||
|
{
|
||||||
|
transform_4way( sp );
|
||||||
|
sp->pos = 0;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// pos is zero for 64 byte data, 1 for 80 byte data.
|
||||||
|
sp->h[ sp->pos ] = _mm512_xor_si512( sp->h[ sp->pos ],
|
||||||
|
m512_const2_64( 0, 0x0000000000000080 ) );
|
||||||
|
transform_4way( sp );
|
||||||
|
|
||||||
|
sp->h[7] = _mm512_xor_si512( sp->h[7],
|
||||||
|
m512_const2_64( 0x0000000100000000, 0 ) );
|
||||||
|
|
||||||
|
for ( i = 0; i < 10; ++i )
|
||||||
|
transform_4way( sp );
|
||||||
|
|
||||||
|
memcpy( hash, sp->h, sp->hashlen<<6);
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
#endif // AVX512
|
||||||
|
|
||||||
|
// 2 way 128
|
||||||
|
|
||||||
static void transform_2way( cube_2way_context *sp )
|
static void transform_2way( cube_2way_context *sp )
|
||||||
{
|
{
|
||||||
@@ -59,10 +299,10 @@ static void transform_2way( cube_2way_context *sp )
|
|||||||
x1 = _mm256_xor_si256( x1, x5 );
|
x1 = _mm256_xor_si256( x1, x5 );
|
||||||
x2 = _mm256_xor_si256( x2, x6 );
|
x2 = _mm256_xor_si256( x2, x6 );
|
||||||
x3 = _mm256_xor_si256( x3, x7 );
|
x3 = _mm256_xor_si256( x3, x7 );
|
||||||
x4 = mm256_swap64_128( x4 );
|
x4 = mm256_swap128_64( x4 );
|
||||||
x5 = mm256_swap64_128( x5 );
|
x5 = mm256_swap128_64( x5 );
|
||||||
x6 = mm256_swap64_128( x6 );
|
x6 = mm256_swap128_64( x6 );
|
||||||
x7 = mm256_swap64_128( x7 );
|
x7 = mm256_swap128_64( x7 );
|
||||||
x4 = _mm256_add_epi32( x0, x4 );
|
x4 = _mm256_add_epi32( x0, x4 );
|
||||||
x5 = _mm256_add_epi32( x1, x5 );
|
x5 = _mm256_add_epi32( x1, x5 );
|
||||||
x6 = _mm256_add_epi32( x2, x6 );
|
x6 = _mm256_add_epi32( x2, x6 );
|
||||||
@@ -77,10 +317,10 @@ static void transform_2way( cube_2way_context *sp )
|
|||||||
x1 = _mm256_xor_si256( x1, x5 );
|
x1 = _mm256_xor_si256( x1, x5 );
|
||||||
x2 = _mm256_xor_si256( x2, x6 );
|
x2 = _mm256_xor_si256( x2, x6 );
|
||||||
x3 = _mm256_xor_si256( x3, x7 );
|
x3 = _mm256_xor_si256( x3, x7 );
|
||||||
x4 = mm256_swap32_64( x4 );
|
x4 = mm256_swap64_32( x4 );
|
||||||
x5 = mm256_swap32_64( x5 );
|
x5 = mm256_swap64_32( x5 );
|
||||||
x6 = mm256_swap32_64( x6 );
|
x6 = mm256_swap64_32( x6 );
|
||||||
x7 = mm256_swap32_64( x7 );
|
x7 = mm256_swap64_32( x7 );
|
||||||
}
|
}
|
||||||
|
|
||||||
_mm256_store_si256( (__m256i*)sp->h, x0 );
|
_mm256_store_si256( (__m256i*)sp->h, x0 );
|
||||||
@@ -91,45 +331,35 @@ static void transform_2way( cube_2way_context *sp )
|
|||||||
_mm256_store_si256( (__m256i*)sp->h + 5, x5 );
|
_mm256_store_si256( (__m256i*)sp->h + 5, x5 );
|
||||||
_mm256_store_si256( (__m256i*)sp->h + 6, x6 );
|
_mm256_store_si256( (__m256i*)sp->h + 6, x6 );
|
||||||
_mm256_store_si256( (__m256i*)sp->h + 7, x7 );
|
_mm256_store_si256( (__m256i*)sp->h + 7, x7 );
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|
||||||
int cube_2way_init( cube_2way_context *sp, int hashbitlen, int rounds,
|
int cube_2way_init( cube_2way_context *sp, int hashbitlen, int rounds,
|
||||||
int blockbytes )
|
int blockbytes )
|
||||||
{
|
{
|
||||||
__m128i* h = (__m128i*)sp->h;
|
__m256i *h = (__m256i*)sp->h;
|
||||||
|
__m128i *iv = (__m128i*)( hashbitlen == 512 ? (__m128i*)IV512
|
||||||
|
: (__m128i*)IV256 );
|
||||||
sp->hashlen = hashbitlen/128;
|
sp->hashlen = hashbitlen/128;
|
||||||
sp->blocksize = blockbytes/16;
|
sp->blocksize = blockbytes/16;
|
||||||
sp->rounds = rounds;
|
sp->rounds = rounds;
|
||||||
sp->pos = 0;
|
sp->pos = 0;
|
||||||
|
|
||||||
if ( hashbitlen == 512 )
|
h[ 0] = m256_const1_128( iv[0] );
|
||||||
{
|
h[ 1] = m256_const1_128( iv[1] );
|
||||||
|
h[ 2] = m256_const1_128( iv[2] );
|
||||||
h[ 0] = m128_const_64( 0x4167D83E2D538B8B, 0x50F494D42AEA2A61 );
|
h[ 3] = m256_const1_128( iv[3] );
|
||||||
h[ 2] = m128_const_64( 0x50AC5695CC39968E, 0xC701CF8C3FEE2313 );
|
h[ 4] = m256_const1_128( iv[4] );
|
||||||
h[ 4] = m128_const_64( 0x825B453797CF0BEF, 0xA647A8B34D42C787 );
|
h[ 5] = m256_const1_128( iv[5] );
|
||||||
h[ 6] = m128_const_64( 0xA23911AED0E5CD33, 0xF22090C4EEF864D2 );
|
h[ 6] = m256_const1_128( iv[6] );
|
||||||
h[ 8] = m128_const_64( 0xB64445321B017BEF, 0x148FE485FCD398D9 );
|
h[ 7] = m256_const1_128( iv[7] );
|
||||||
h[10] = m128_const_64( 0x0DBADEA991FA7934, 0x2FF5781C6A536159 );
|
h[ 0] = m256_const1_128( iv[0] );
|
||||||
h[12] = m128_const_64( 0xBC796576B1C62456, 0xA5A70E75D65C8A2B );
|
h[ 1] = m256_const1_128( iv[1] );
|
||||||
h[14] = m128_const_64( 0xD43E3B447795D246, 0xE7989AF11921C8F7 );
|
h[ 2] = m256_const1_128( iv[2] );
|
||||||
h[1] = h[ 0]; h[ 3] = h[ 2]; h[ 5] = h[ 4]; h[ 7] = h[ 6];
|
h[ 3] = m256_const1_128( iv[3] );
|
||||||
h[9] = h[ 8]; h[11] = h[10]; h[13] = h[12]; h[15] = h[14];
|
h[ 4] = m256_const1_128( iv[4] );
|
||||||
}
|
h[ 5] = m256_const1_128( iv[5] );
|
||||||
else
|
h[ 6] = m256_const1_128( iv[6] );
|
||||||
{
|
h[ 7] = m256_const1_128( iv[7] );
|
||||||
h[ 0] = m128_const_64( 0x35481EAE63117E71, 0xCCD6F29FEA2BD4B4 );
|
|
||||||
h[ 2] = m128_const_64( 0xF4CC12BE7E624131, 0xE5D94E6322512D5B );
|
|
||||||
h[ 4] = m128_const_64( 0x3361DA8CD0720C35, 0x42AF2070C2D0B696 );
|
|
||||||
h[ 6] = m128_const_64( 0x40E5FBAB4680AC00, 0x8EF8AD8328CCECA4 );
|
|
||||||
h[ 8] = m128_const_64( 0xF0B266796C859D41, 0x6107FBD5D89041C3 );
|
|
||||||
h[10] = m128_const_64( 0x93CB628565C892FD, 0x5FA2560309392549 );
|
|
||||||
h[12] = m128_const_64( 0x85254725774ABFDD, 0x9E4B4E602AF2B5AE );
|
|
||||||
h[14] = m128_const_64( 0xD6032C0A9CDAF8AF, 0x4AB6AAD615815AEB );
|
|
||||||
h[1] = h[ 0]; h[ 3] = h[ 2]; h[ 5] = h[ 4]; h[ 7] = h[ 6];
|
|
||||||
h[9] = h[ 8]; h[11] = h[10]; h[13] = h[12]; h[15] = h[14];
|
|
||||||
}
|
|
||||||
|
|
||||||
return 0;
|
return 0;
|
||||||
}
|
}
|
||||||
@@ -141,9 +371,6 @@ int cube_2way_update( cube_2way_context *sp, const void *data, size_t size )
|
|||||||
const __m256i *in = (__m256i*)data;
|
const __m256i *in = (__m256i*)data;
|
||||||
int i;
|
int i;
|
||||||
|
|
||||||
// It is assumed data is aligned to 256 bits and is a multiple of 128 bits.
|
|
||||||
// Current usage sata is either 64 or 80 bytes.
|
|
||||||
|
|
||||||
for ( i = 0; i < len; i++ )
|
for ( i = 0; i < len; i++ )
|
||||||
{
|
{
|
||||||
sp->h[ sp->pos ] = _mm256_xor_si256( sp->h[ sp->pos ], in[i] );
|
sp->h[ sp->pos ] = _mm256_xor_si256( sp->h[ sp->pos ], in[i] );
|
||||||
@@ -164,11 +391,11 @@ int cube_2way_close( cube_2way_context *sp, void *output )
|
|||||||
|
|
||||||
// pos is zero for 64 byte data, 1 for 80 byte data.
|
// pos is zero for 64 byte data, 1 for 80 byte data.
|
||||||
sp->h[ sp->pos ] = _mm256_xor_si256( sp->h[ sp->pos ],
|
sp->h[ sp->pos ] = _mm256_xor_si256( sp->h[ sp->pos ],
|
||||||
_mm256_set_epi32( 0,0,0,0x80, 0,0,0,0x80 ) );
|
m256_const2_64( 0, 0x0000000000000080 ) );
|
||||||
transform_2way( sp );
|
transform_2way( sp );
|
||||||
|
|
||||||
sp->h[7] = _mm256_xor_si256( sp->h[7],
|
sp->h[7] = _mm256_xor_si256( sp->h[7],
|
||||||
_mm256_set_epi32( 1,0,0,0, 1,0,0,0 ) );
|
m256_const2_64( 0x0000000100000000, 0 ) );
|
||||||
|
|
||||||
for ( i = 0; i < 10; ++i ) transform_2way( sp );
|
for ( i = 0; i < 10; ++i ) transform_2way( sp );
|
||||||
|
|
||||||
@@ -197,11 +424,69 @@ int cube_2way_update_close( cube_2way_context *sp, void *output,
|
|||||||
|
|
||||||
// pos is zero for 64 byte data, 1 for 80 byte data.
|
// pos is zero for 64 byte data, 1 for 80 byte data.
|
||||||
sp->h[ sp->pos ] = _mm256_xor_si256( sp->h[ sp->pos ],
|
sp->h[ sp->pos ] = _mm256_xor_si256( sp->h[ sp->pos ],
|
||||||
_mm256_set_epi32( 0,0,0,0x80, 0,0,0,0x80 ) );
|
m256_const2_64( 0, 0x0000000000000080 ) );
|
||||||
transform_2way( sp );
|
transform_2way( sp );
|
||||||
|
|
||||||
sp->h[7] = _mm256_xor_si256( sp->h[7], _mm256_set_epi32( 1,0,0,0,
|
sp->h[7] = _mm256_xor_si256( sp->h[7],
|
||||||
1,0,0,0 ) );
|
m256_const2_64( 0x0000000100000000, 0 ) );
|
||||||
|
|
||||||
|
for ( i = 0; i < 10; ++i ) transform_2way( sp );
|
||||||
|
|
||||||
|
memcpy( hash, sp->h, sp->hashlen<<5 );
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
int cube_2way_full( cube_2way_context *sp, void *output, int hashbitlen,
|
||||||
|
const void *data, size_t size )
|
||||||
|
{
|
||||||
|
__m256i *h = (__m256i*)sp->h;
|
||||||
|
__m128i *iv = (__m128i*)( hashbitlen == 512 ? (__m128i*)IV512
|
||||||
|
: (__m128i*)IV256 );
|
||||||
|
sp->hashlen = hashbitlen/128;
|
||||||
|
sp->blocksize = 32/16;
|
||||||
|
sp->rounds = 16;
|
||||||
|
sp->pos = 0;
|
||||||
|
|
||||||
|
h[ 0] = m256_const1_128( iv[0] );
|
||||||
|
h[ 1] = m256_const1_128( iv[1] );
|
||||||
|
h[ 2] = m256_const1_128( iv[2] );
|
||||||
|
h[ 3] = m256_const1_128( iv[3] );
|
||||||
|
h[ 4] = m256_const1_128( iv[4] );
|
||||||
|
h[ 5] = m256_const1_128( iv[5] );
|
||||||
|
h[ 6] = m256_const1_128( iv[6] );
|
||||||
|
h[ 7] = m256_const1_128( iv[7] );
|
||||||
|
h[ 0] = m256_const1_128( iv[0] );
|
||||||
|
h[ 1] = m256_const1_128( iv[1] );
|
||||||
|
h[ 2] = m256_const1_128( iv[2] );
|
||||||
|
h[ 3] = m256_const1_128( iv[3] );
|
||||||
|
h[ 4] = m256_const1_128( iv[4] );
|
||||||
|
h[ 5] = m256_const1_128( iv[5] );
|
||||||
|
h[ 6] = m256_const1_128( iv[6] );
|
||||||
|
h[ 7] = m256_const1_128( iv[7] );
|
||||||
|
|
||||||
|
const int len = size >> 4;
|
||||||
|
const __m256i *in = (__m256i*)data;
|
||||||
|
__m256i *hash = (__m256i*)output;
|
||||||
|
int i;
|
||||||
|
|
||||||
|
for ( i = 0; i < len; i++ )
|
||||||
|
{
|
||||||
|
sp->h[ sp->pos ] = _mm256_xor_si256( sp->h[ sp->pos ], in[i] );
|
||||||
|
sp->pos++;
|
||||||
|
if ( sp->pos == sp->blocksize )
|
||||||
|
{
|
||||||
|
transform_2way( sp );
|
||||||
|
sp->pos = 0;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// pos is zero for 64 byte data, 1 for 80 byte data.
|
||||||
|
sp->h[ sp->pos ] = _mm256_xor_si256( sp->h[ sp->pos ],
|
||||||
|
m256_const2_64( 0, 0x0000000000000080 ) );
|
||||||
|
transform_2way( sp );
|
||||||
|
|
||||||
|
sp->h[7] = _mm256_xor_si256( sp->h[7],
|
||||||
|
m256_const2_64( 0x0000000100000000, 0 ) );
|
||||||
|
|
||||||
for ( i = 0; i < 10; ++i ) transform_2way( sp );
|
for ( i = 0; i < 10; ++i ) transform_2way( sp );
|
||||||
|
|
||||||
|
|||||||
@@ -1,11 +1,35 @@
|
|||||||
#ifndef CUBE_HASH_2WAY_H__
|
#ifndef CUBE_HASH_2WAY_H__
|
||||||
#define CUBE_HASH_2WAY_H__
|
#define CUBE_HASH_2WAY_H__ 1
|
||||||
|
|
||||||
#if defined(__AVX2__)
|
|
||||||
|
|
||||||
#include <stdint.h>
|
#include <stdint.h>
|
||||||
#include "simd-utils.h"
|
#include "simd-utils.h"
|
||||||
|
|
||||||
|
#if defined(__AVX2__)
|
||||||
|
|
||||||
|
#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
|
||||||
|
|
||||||
|
struct _cube_4way_context
|
||||||
|
{
|
||||||
|
__m512i h[8];
|
||||||
|
int hashlen;
|
||||||
|
int rounds;
|
||||||
|
int blocksize;
|
||||||
|
int pos;
|
||||||
|
} __attribute__ ((aligned (128)));
|
||||||
|
|
||||||
|
typedef struct _cube_4way_context cube_4way_context;
|
||||||
|
|
||||||
|
int cube_4way_init( cube_4way_context* sp, int hashbitlen, int rounds,
|
||||||
|
int blockbytes );
|
||||||
|
int cube_4way_update( cube_4way_context *sp, const void *data, size_t size );
|
||||||
|
int cube_4way_close( cube_4way_context *sp, void *output );
|
||||||
|
int cube_4way_update_close( cube_4way_context *sp, void *output,
|
||||||
|
const void *data, size_t size );
|
||||||
|
int cube_4way_full( cube_4way_context *sp, void *output, int hashbitlen,
|
||||||
|
const void *data, size_t size );
|
||||||
|
|
||||||
|
#endif
|
||||||
|
|
||||||
// 2x128, 2 way parallel SSE2
|
// 2x128, 2 way parallel SSE2
|
||||||
|
|
||||||
struct _cube_2way_context
|
struct _cube_2way_context
|
||||||
@@ -15,21 +39,18 @@ struct _cube_2way_context
|
|||||||
int rounds;
|
int rounds;
|
||||||
int blocksize; // __m128i
|
int blocksize; // __m128i
|
||||||
int pos; // number of __m128i read into x from current block
|
int pos; // number of __m128i read into x from current block
|
||||||
} __attribute__ ((aligned (64)));
|
} __attribute__ ((aligned (128)));
|
||||||
|
|
||||||
typedef struct _cube_2way_context cube_2way_context;
|
typedef struct _cube_2way_context cube_2way_context;
|
||||||
|
|
||||||
int cube_2way_init( cube_2way_context* sp, int hashbitlen, int rounds,
|
int cube_2way_init( cube_2way_context* sp, int hashbitlen, int rounds,
|
||||||
int blockbytes );
|
int blockbytes );
|
||||||
// reinitialize context with same parameters, much faster.
|
|
||||||
int cube_2way_reinit( cube_2way_context *sp );
|
|
||||||
|
|
||||||
int cube_2way_update( cube_2way_context *sp, const void *data, size_t size );
|
int cube_2way_update( cube_2way_context *sp, const void *data, size_t size );
|
||||||
|
|
||||||
int cube_2way_close( cube_2way_context *sp, void *output );
|
int cube_2way_close( cube_2way_context *sp, void *output );
|
||||||
|
|
||||||
int cube_2way_update_close( cube_2way_context *sp, void *output,
|
int cube_2way_update_close( cube_2way_context *sp, void *output,
|
||||||
const void *data, size_t size );
|
const void *data, size_t size );
|
||||||
|
int cube_2way_full( cube_2way_context *sp, void *output, int hashbitlen,
|
||||||
|
const void *data, size_t size );
|
||||||
|
|
||||||
|
|
||||||
#endif
|
#endif
|
||||||
|
|||||||
@@ -21,7 +21,27 @@ static void transform( cubehashParam *sp )
|
|||||||
int r;
|
int r;
|
||||||
const int rounds = sp->rounds;
|
const int rounds = sp->rounds;
|
||||||
|
|
||||||
#ifdef __AVX2__
|
#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
|
||||||
|
|
||||||
|
register __m512i x0, x1;
|
||||||
|
|
||||||
|
x0 = _mm512_load_si512( (__m512i*)sp->x );
|
||||||
|
x1 = _mm512_load_si512( (__m512i*)sp->x + 1 );
|
||||||
|
|
||||||
|
for ( r = 0; r < rounds; ++r )
|
||||||
|
{
|
||||||
|
x1 = _mm512_add_epi32( x0, x1 );
|
||||||
|
x0 = _mm512_xor_si512( mm512_rol_32( mm512_swap_256( x0 ), 7 ), x1 );
|
||||||
|
x1 = _mm512_add_epi32( x0, mm512_swap128_64( x1 ) );
|
||||||
|
x0 = _mm512_xor_si512( mm512_rol_32(
|
||||||
|
mm512_swap256_128( x0 ), 11 ), x1 );
|
||||||
|
x1 = mm512_swap64_32( x1 );
|
||||||
|
}
|
||||||
|
|
||||||
|
_mm512_store_si512( (__m512i*)sp->x, x0 );
|
||||||
|
_mm512_store_si512( (__m512i*)sp->x + 1, x1 );
|
||||||
|
|
||||||
|
#elif defined(__AVX2__)
|
||||||
|
|
||||||
register __m256i x0, x1, x2, x3, y0, y1;
|
register __m256i x0, x1, x2, x3, y0, y1;
|
||||||
|
|
||||||
@@ -39,8 +59,8 @@ static void transform( cubehashParam *sp )
|
|||||||
x1 = mm256_rol_32( y0, 7 );
|
x1 = mm256_rol_32( y0, 7 );
|
||||||
x0 = _mm256_xor_si256( x0, x2 );
|
x0 = _mm256_xor_si256( x0, x2 );
|
||||||
x1 = _mm256_xor_si256( x1, x3 );
|
x1 = _mm256_xor_si256( x1, x3 );
|
||||||
x2 = mm256_swap64_128( x2 );
|
x2 = mm256_swap128_64( x2 );
|
||||||
x3 = mm256_swap64_128( x3 );
|
x3 = mm256_swap128_64( x3 );
|
||||||
x2 = _mm256_add_epi32( x0, x2 );
|
x2 = _mm256_add_epi32( x0, x2 );
|
||||||
x3 = _mm256_add_epi32( x1, x3 );
|
x3 = _mm256_add_epi32( x1, x3 );
|
||||||
y0 = mm256_swap_128( x0 );
|
y0 = mm256_swap_128( x0 );
|
||||||
@@ -49,8 +69,8 @@ static void transform( cubehashParam *sp )
|
|||||||
x1 = mm256_rol_32( y1, 11 );
|
x1 = mm256_rol_32( y1, 11 );
|
||||||
x0 = _mm256_xor_si256( x0, x2 );
|
x0 = _mm256_xor_si256( x0, x2 );
|
||||||
x1 = _mm256_xor_si256( x1, x3 );
|
x1 = _mm256_xor_si256( x1, x3 );
|
||||||
x2 = mm256_swap32_64( x2 );
|
x2 = mm256_swap64_32( x2 );
|
||||||
x3 = mm256_swap32_64( x3 );
|
x3 = mm256_swap64_32( x3 );
|
||||||
}
|
}
|
||||||
|
|
||||||
_mm256_store_si256( (__m256i*)sp->x, x0 );
|
_mm256_store_si256( (__m256i*)sp->x, x0 );
|
||||||
|
|||||||
@@ -7,7 +7,6 @@
|
|||||||
* - implements NIST hash api
|
* - implements NIST hash api
|
||||||
* - assumes that message lenght is multiple of 8-bits
|
* - assumes that message lenght is multiple of 8-bits
|
||||||
* - _ECHO_VPERM_ must be defined if compiling with ../main.c
|
* - _ECHO_VPERM_ must be defined if compiling with ../main.c
|
||||||
* - define NO_AES_NI for aes_ni version
|
|
||||||
*
|
*
|
||||||
* Cagdas Calik
|
* Cagdas Calik
|
||||||
* ccalik@metu.edu.tr
|
* ccalik@metu.edu.tr
|
||||||
@@ -21,13 +20,7 @@
|
|||||||
#include "hash_api.h"
|
#include "hash_api.h"
|
||||||
//#include "vperm.h"
|
//#include "vperm.h"
|
||||||
#include <immintrin.h>
|
#include <immintrin.h>
|
||||||
/*
|
#include "simd-utils.h"
|
||||||
#ifndef NO_AES_NI
|
|
||||||
#include <wmmintrin.h>
|
|
||||||
#else
|
|
||||||
#include <tmmintrin.h>
|
|
||||||
#endif
|
|
||||||
*/
|
|
||||||
|
|
||||||
MYALIGN const unsigned int _k_s0F[] = {0x0F0F0F0F, 0x0F0F0F0F, 0x0F0F0F0F, 0x0F0F0F0F};
|
MYALIGN const unsigned int _k_s0F[] = {0x0F0F0F0F, 0x0F0F0F0F, 0x0F0F0F0F, 0x0F0F0F0F};
|
||||||
MYALIGN const unsigned int _k_ipt[] = {0x5A2A7000, 0xC2B2E898, 0x52227808, 0xCABAE090, 0x317C4D00, 0x4C01307D, 0xB0FDCC81, 0xCD80B1FC};
|
MYALIGN const unsigned int _k_ipt[] = {0x5A2A7000, 0xC2B2E898, 0x52227808, 0xCABAE090, 0x317C4D00, 0x4C01307D, 0xB0FDCC81, 0xCD80B1FC};
|
||||||
@@ -186,7 +179,7 @@ void Compress(hashState_echo *ctx, const unsigned char *pmsg, unsigned int uBloc
|
|||||||
{
|
{
|
||||||
for(i = 0; i < 4; i++)
|
for(i = 0; i < 4; i++)
|
||||||
{
|
{
|
||||||
_state[i][j] = _mm_loadu_si128((__m128i*)pmsg + 4 * (j - (ctx->uHashSize / 256)) + i);
|
_state[i][j] = _mm_load_si128((__m128i*)pmsg + 4 * (j - (ctx->uHashSize / 256)) + i);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -390,13 +383,13 @@ HashReturn final_echo(hashState_echo *state, BitSequence *hashval)
|
|||||||
}
|
}
|
||||||
|
|
||||||
// Store the hash value
|
// Store the hash value
|
||||||
_mm_storeu_si128((__m128i*)hashval + 0, state->state[0][0]);
|
_mm_store_si128((__m128i*)hashval + 0, state->state[0][0]);
|
||||||
_mm_storeu_si128((__m128i*)hashval + 1, state->state[1][0]);
|
_mm_store_si128((__m128i*)hashval + 1, state->state[1][0]);
|
||||||
|
|
||||||
if(state->uHashSize == 512)
|
if(state->uHashSize == 512)
|
||||||
{
|
{
|
||||||
_mm_storeu_si128((__m128i*)hashval + 2, state->state[2][0]);
|
_mm_store_si128((__m128i*)hashval + 2, state->state[2][0]);
|
||||||
_mm_storeu_si128((__m128i*)hashval + 3, state->state[3][0]);
|
_mm_store_si128((__m128i*)hashval + 3, state->state[3][0]);
|
||||||
}
|
}
|
||||||
|
|
||||||
return SUCCESS;
|
return SUCCESS;
|
||||||
@@ -513,18 +506,177 @@ HashReturn update_final_echo( hashState_echo *state, BitSequence *hashval,
|
|||||||
}
|
}
|
||||||
|
|
||||||
// Store the hash value
|
// Store the hash value
|
||||||
_mm_storeu_si128( (__m128i*)hashval + 0, state->state[0][0] );
|
_mm_store_si128( (__m128i*)hashval + 0, state->state[0][0] );
|
||||||
_mm_storeu_si128( (__m128i*)hashval + 1, state->state[1][0] );
|
_mm_store_si128( (__m128i*)hashval + 1, state->state[1][0] );
|
||||||
|
|
||||||
if( state->uHashSize == 512 )
|
if( state->uHashSize == 512 )
|
||||||
{
|
{
|
||||||
_mm_storeu_si128( (__m128i*)hashval + 2, state->state[2][0] );
|
_mm_store_si128( (__m128i*)hashval + 2, state->state[2][0] );
|
||||||
_mm_storeu_si128( (__m128i*)hashval + 3, state->state[3][0] );
|
_mm_store_si128( (__m128i*)hashval + 3, state->state[3][0] );
|
||||||
|
|
||||||
}
|
}
|
||||||
return SUCCESS;
|
return SUCCESS;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
HashReturn echo_full( hashState_echo *state, BitSequence *hashval,
|
||||||
|
int nHashSize, const BitSequence *data, DataLength datalen )
|
||||||
|
{
|
||||||
|
int i, j;
|
||||||
|
|
||||||
|
state->k = m128_zero;
|
||||||
|
state->processed_bits = 0;
|
||||||
|
state->uBufferBytes = 0;
|
||||||
|
|
||||||
|
switch( nHashSize )
|
||||||
|
{
|
||||||
|
case 256:
|
||||||
|
state->uHashSize = 256;
|
||||||
|
state->uBlockLength = 192;
|
||||||
|
state->uRounds = 8;
|
||||||
|
state->hashsize = m128_const_64( 0, 0x100 );
|
||||||
|
state->const1536 = m128_const_64( 0, 0x600 );
|
||||||
|
break;
|
||||||
|
|
||||||
|
case 512:
|
||||||
|
state->uHashSize = 512;
|
||||||
|
state->uBlockLength = 128;
|
||||||
|
state->uRounds = 10;
|
||||||
|
state->hashsize = m128_const_64( 0, 0x200 );
|
||||||
|
state->const1536 = m128_const_64( 0, 0x400 );
|
||||||
|
break;
|
||||||
|
|
||||||
|
default:
|
||||||
|
return BAD_HASHBITLEN;
|
||||||
|
}
|
||||||
|
|
||||||
|
for(i = 0; i < 4; i++)
|
||||||
|
for(j = 0; j < nHashSize / 256; j++)
|
||||||
|
state->state[i][j] = state->hashsize;
|
||||||
|
|
||||||
|
for(i = 0; i < 4; i++)
|
||||||
|
for(j = nHashSize / 256; j < 4; j++)
|
||||||
|
state->state[i][j] = m128_zero;
|
||||||
|
|
||||||
|
|
||||||
|
unsigned int uBlockCount, uRemainingBytes;
|
||||||
|
|
||||||
|
if( (state->uBufferBytes + datalen) >= state->uBlockLength )
|
||||||
|
{
|
||||||
|
if( state->uBufferBytes != 0 )
|
||||||
|
{
|
||||||
|
// Fill the buffer
|
||||||
|
memcpy( state->buffer + state->uBufferBytes,
|
||||||
|
(void*)data, state->uBlockLength - state->uBufferBytes );
|
||||||
|
|
||||||
|
// Process buffer
|
||||||
|
Compress( state, state->buffer, 1 );
|
||||||
|
state->processed_bits += state->uBlockLength * 8;
|
||||||
|
|
||||||
|
data += state->uBlockLength - state->uBufferBytes;
|
||||||
|
datalen -= state->uBlockLength - state->uBufferBytes;
|
||||||
|
}
|
||||||
|
|
||||||
|
// buffer now does not contain any unprocessed bytes
|
||||||
|
|
||||||
|
uBlockCount = datalen / state->uBlockLength;
|
||||||
|
uRemainingBytes = datalen % state->uBlockLength;
|
||||||
|
|
||||||
|
if( uBlockCount > 0 )
|
||||||
|
{
|
||||||
|
Compress( state, data, uBlockCount );
|
||||||
|
state->processed_bits += uBlockCount * state->uBlockLength * 8;
|
||||||
|
data += uBlockCount * state->uBlockLength;
|
||||||
|
}
|
||||||
|
|
||||||
|
if( uRemainingBytes > 0 )
|
||||||
|
memcpy(state->buffer, (void*)data, uRemainingBytes);
|
||||||
|
|
||||||
|
state->uBufferBytes = uRemainingBytes;
|
||||||
|
}
|
||||||
|
else
|
||||||
|
{
|
||||||
|
memcpy( state->buffer + state->uBufferBytes, (void*)data, datalen );
|
||||||
|
state->uBufferBytes += datalen;
|
||||||
|
}
|
||||||
|
|
||||||
|
__m128i remainingbits;
|
||||||
|
|
||||||
|
// Add remaining bytes in the buffer
|
||||||
|
state->processed_bits += state->uBufferBytes * 8;
|
||||||
|
|
||||||
|
remainingbits = _mm_set_epi32( 0, 0, 0, state->uBufferBytes * 8 );
|
||||||
|
|
||||||
|
// Pad with 0x80
|
||||||
|
state->buffer[state->uBufferBytes++] = 0x80;
|
||||||
|
// Enough buffer space for padding in this block?
|
||||||
|
if( (state->uBlockLength - state->uBufferBytes) >= 18 )
|
||||||
|
{
|
||||||
|
// Pad with zeros
|
||||||
|
memset( state->buffer + state->uBufferBytes, 0, state->uBlockLength - (state->uBufferBytes + 18) );
|
||||||
|
|
||||||
|
// Hash size
|
||||||
|
*( (unsigned short*)(state->buffer + state->uBlockLength - 18) ) = state->uHashSize;
|
||||||
|
|
||||||
|
// Processed bits
|
||||||
|
*( (DataLength*)(state->buffer + state->uBlockLength - 16) ) =
|
||||||
|
state->processed_bits;
|
||||||
|
*( (DataLength*)(state->buffer + state->uBlockLength - 8) ) = 0;
|
||||||
|
|
||||||
|
// Last block contains message bits?
|
||||||
|
if( state->uBufferBytes == 1 )
|
||||||
|
{
|
||||||
|
state->k = _mm_xor_si128( state->k, state->k );
|
||||||
|
state->k = _mm_sub_epi64( state->k, state->const1536 );
|
||||||
|
}
|
||||||
|
else
|
||||||
|
{
|
||||||
|
state->k = _mm_add_epi64( state->k, remainingbits );
|
||||||
|
state->k = _mm_sub_epi64( state->k, state->const1536 );
|
||||||
|
}
|
||||||
|
|
||||||
|
// Compress
|
||||||
|
Compress( state, state->buffer, 1 );
|
||||||
|
}
|
||||||
|
else
|
||||||
|
{
|
||||||
|
// Fill with zero and compress
|
||||||
|
memset( state->buffer + state->uBufferBytes, 0,
|
||||||
|
state->uBlockLength - state->uBufferBytes );
|
||||||
|
state->k = _mm_add_epi64( state->k, remainingbits );
|
||||||
|
state->k = _mm_sub_epi64( state->k, state->const1536 );
|
||||||
|
Compress( state, state->buffer, 1 );
|
||||||
|
|
||||||
|
// Last block
|
||||||
|
memset( state->buffer, 0, state->uBlockLength - 18 );
|
||||||
|
|
||||||
|
// Hash size
|
||||||
|
*( (unsigned short*)(state->buffer + state->uBlockLength - 18) ) =
|
||||||
|
state->uHashSize;
|
||||||
|
|
||||||
|
// Processed bits
|
||||||
|
*( (DataLength*)(state->buffer + state->uBlockLength - 16) ) =
|
||||||
|
state->processed_bits;
|
||||||
|
*( (DataLength*)(state->buffer + state->uBlockLength - 8) ) = 0;
|
||||||
|
// Compress the last block
|
||||||
|
state->k = _mm_xor_si128( state->k, state->k );
|
||||||
|
state->k = _mm_sub_epi64( state->k, state->const1536 );
|
||||||
|
Compress( state, state->buffer, 1) ;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Store the hash value
|
||||||
|
_mm_store_si128( (__m128i*)hashval + 0, state->state[0][0] );
|
||||||
|
_mm_store_si128( (__m128i*)hashval + 1, state->state[1][0] );
|
||||||
|
|
||||||
|
if( state->uHashSize == 512 )
|
||||||
|
{
|
||||||
|
_mm_store_si128( (__m128i*)hashval + 2, state->state[2][0] );
|
||||||
|
_mm_store_si128( (__m128i*)hashval + 3, state->state[3][0] );
|
||||||
|
|
||||||
|
}
|
||||||
|
return SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
HashReturn hash_echo(int hashbitlen, const BitSequence *data, DataLength databitlen, BitSequence *hashval)
|
HashReturn hash_echo(int hashbitlen, const BitSequence *data, DataLength databitlen, BitSequence *hashval)
|
||||||
{
|
{
|
||||||
|
|||||||
@@ -15,7 +15,7 @@
|
|||||||
#ifndef HASH_API_H
|
#ifndef HASH_API_H
|
||||||
#define HASH_API_H
|
#define HASH_API_H
|
||||||
|
|
||||||
#ifndef NO_AES_NI
|
#ifdef __AES__
|
||||||
#define HASH_IMPL_STR "ECHO-aesni"
|
#define HASH_IMPL_STR "ECHO-aesni"
|
||||||
#else
|
#else
|
||||||
#define HASH_IMPL_STR "ECHO-vperm"
|
#define HASH_IMPL_STR "ECHO-vperm"
|
||||||
@@ -55,6 +55,8 @@ HashReturn hash_echo(int hashbitlen, const BitSequence *data, DataLength databit
|
|||||||
|
|
||||||
HashReturn update_final_echo( hashState_echo *state, BitSequence *hashval,
|
HashReturn update_final_echo( hashState_echo *state, BitSequence *hashval,
|
||||||
const BitSequence *data, DataLength databitlen );
|
const BitSequence *data, DataLength databitlen );
|
||||||
|
HashReturn echo_full( hashState_echo *state, BitSequence *hashval,
|
||||||
|
int nHashSize, const BitSequence *data, DataLength databitlen );
|
||||||
|
|
||||||
#endif // HASH_API_H
|
#endif // HASH_API_H
|
||||||
|
|
||||||
|
|||||||
404
algo/echo/echo-hash-4way.c
Normal file
404
algo/echo/echo-hash-4way.c
Normal file
@@ -0,0 +1,404 @@
|
|||||||
|
//#if 0
|
||||||
|
#if defined(__VAES__) && defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
|
||||||
|
|
||||||
|
#include "simd-utils.h"
|
||||||
|
#include "echo-hash-4way.h"
|
||||||
|
|
||||||
|
/*
|
||||||
|
static const unsigned int mul2ipt[] __attribute__ ((aligned (64))) =
|
||||||
|
{
|
||||||
|
0x728efc00, 0x6894e61a, 0x3fc3b14d, 0x25d9ab57,
|
||||||
|
0xfd5ba600, 0x2a8c71d7, 0x1eb845e3, 0xc96f9234
|
||||||
|
};
|
||||||
|
*/
|
||||||
|
// do these need to be reversed?
|
||||||
|
|
||||||
|
#define mul2mask \
|
||||||
|
_mm512_set4_epi32( 0, 0, 0, 0x00001b00 )
|
||||||
|
// _mm512_set4_epi32( 0x00001b00, 0, 0, 0 )
|
||||||
|
|
||||||
|
#define lsbmask m512_const1_32( 0x01010101 )
|
||||||
|
|
||||||
|
#define ECHO_SUBBYTES( state, i, j ) \
|
||||||
|
state[i][j] = _mm512_aesenc_epi128( state[i][j], k1 ); \
|
||||||
|
state[i][j] = _mm512_aesenc_epi128( state[i][j], m512_zero ); \
|
||||||
|
k1 = _mm512_add_epi32( k1, m512_one_128 );
|
||||||
|
|
||||||
|
#define ECHO_MIXBYTES( state1, state2, j, t1, t2, s2 ) do \
|
||||||
|
{ \
|
||||||
|
const int j1 = ( (j)+1 ) & 3; \
|
||||||
|
const int j2 = ( (j)+2 ) & 3; \
|
||||||
|
const int j3 = ( (j)+3 ) & 3; \
|
||||||
|
s2 = _mm512_add_epi8( state1[ 0 ] [j ], state1[ 0 ][ j ] ); \
|
||||||
|
t1 = _mm512_srli_epi16( state1[ 0 ][ j ], 7 ); \
|
||||||
|
t1 = _mm512_and_si512( t1, lsbmask );\
|
||||||
|
t2 = _mm512_shuffle_epi8( mul2mask, t1 ); \
|
||||||
|
s2 = _mm512_xor_si512( s2, t2 ); \
|
||||||
|
state2[ 0 ] [j ] = s2; \
|
||||||
|
state2[ 1 ] [j ] = state1[ 0 ][ j ]; \
|
||||||
|
state2[ 2 ] [j ] = state1[ 0 ][ j ]; \
|
||||||
|
state2[ 3 ] [j ] = _mm512_xor_si512( s2, state1[ 0 ][ j ] );\
|
||||||
|
s2 = _mm512_add_epi8( state1[ 1 ][ j1 ], state1[ 1 ][ j1 ] ); \
|
||||||
|
t1 = _mm512_srli_epi16( state1[ 1 ][ j1 ], 7 ); \
|
||||||
|
t1 = _mm512_and_si512( t1, lsbmask ); \
|
||||||
|
t2 = _mm512_shuffle_epi8( mul2mask, t1 ); \
|
||||||
|
s2 = _mm512_xor_si512( s2, t2 );\
|
||||||
|
state2[ 0 ][ j ] = _mm512_xor_si512( state2[ 0 ][ j ], \
|
||||||
|
_mm512_xor_si512( s2, state1[ 1 ][ j1 ] ) ); \
|
||||||
|
state2[ 1 ][ j ] = _mm512_xor_si512( state2[ 1 ][ j ], s2 ); \
|
||||||
|
state2[ 2 ][ j ] = _mm512_xor_si512( state2[ 2 ][ j ], state1[ 1 ][ j1 ] ); \
|
||||||
|
state2[ 3 ][ j ] = _mm512_xor_si512( state2[ 3 ][ j ], state1[ 1 ][ j1 ] ); \
|
||||||
|
s2 = _mm512_add_epi8( state1[ 2 ][ j2 ], state1[ 2 ][ j2 ] ); \
|
||||||
|
t1 = _mm512_srli_epi16( state1[ 2 ][ j2 ], 7 ); \
|
||||||
|
t1 = _mm512_and_si512( t1, lsbmask ); \
|
||||||
|
t2 = _mm512_shuffle_epi8( mul2mask, t1 ); \
|
||||||
|
s2 = _mm512_xor_si512( s2, t2 ); \
|
||||||
|
state2[ 0 ][ j ] = _mm512_xor_si512( state2[ 0 ][ j ], state1[ 2 ][ j2 ] ); \
|
||||||
|
state2[ 1 ][ j ] = _mm512_xor_si512( state2[ 1 ][ j ], \
|
||||||
|
_mm512_xor_si512( s2, state1[ 2 ][ j2 ] ) ); \
|
||||||
|
state2[ 2 ][ j ] = _mm512_xor_si512( state2[ 2 ][ j ], s2 ); \
|
||||||
|
state2[ 3 ][ j ] = _mm512_xor_si512( state2[ 3][ j ], state1[ 2 ][ j2 ] ); \
|
||||||
|
s2 = _mm512_add_epi8( state1[ 3 ][ j3 ], state1[ 3 ][ j3 ] ); \
|
||||||
|
t1 = _mm512_srli_epi16( state1[ 3 ][ j3 ], 7 ); \
|
||||||
|
t1 = _mm512_and_si512( t1, lsbmask ); \
|
||||||
|
t2 = _mm512_shuffle_epi8( mul2mask, t1 ); \
|
||||||
|
s2 = _mm512_xor_si512( s2, t2 ); \
|
||||||
|
state2[ 0 ][ j ] = _mm512_xor_si512( state2[ 0 ][ j ], state1[ 3 ][ j3 ] ); \
|
||||||
|
state2[ 1 ][ j ] = _mm512_xor_si512( state2[ 1 ][ j ], state1[ 3 ][ j3 ] ); \
|
||||||
|
state2[ 2 ][ j ] = _mm512_xor_si512( state2[ 2 ][ j ], \
|
||||||
|
_mm512_xor_si512( s2, state1[ 3 ][ j3] ) ); \
|
||||||
|
state2[ 3 ][ j ] = _mm512_xor_si512( state2[ 3 ][ j ], s2 ); \
|
||||||
|
} while(0)
|
||||||
|
|
||||||
|
#define ECHO_ROUND_UNROLL2 \
|
||||||
|
ECHO_SUBBYTES(_state, 0, 0);\
|
||||||
|
ECHO_SUBBYTES(_state, 1, 0);\
|
||||||
|
ECHO_SUBBYTES(_state, 2, 0);\
|
||||||
|
ECHO_SUBBYTES(_state, 3, 0);\
|
||||||
|
ECHO_SUBBYTES(_state, 0, 1);\
|
||||||
|
ECHO_SUBBYTES(_state, 1, 1);\
|
||||||
|
ECHO_SUBBYTES(_state, 2, 1);\
|
||||||
|
ECHO_SUBBYTES(_state, 3, 1);\
|
||||||
|
ECHO_SUBBYTES(_state, 0, 2);\
|
||||||
|
ECHO_SUBBYTES(_state, 1, 2);\
|
||||||
|
ECHO_SUBBYTES(_state, 2, 2);\
|
||||||
|
ECHO_SUBBYTES(_state, 3, 2);\
|
||||||
|
ECHO_SUBBYTES(_state, 0, 3);\
|
||||||
|
ECHO_SUBBYTES(_state, 1, 3);\
|
||||||
|
ECHO_SUBBYTES(_state, 2, 3);\
|
||||||
|
ECHO_SUBBYTES(_state, 3, 3);\
|
||||||
|
ECHO_MIXBYTES(_state, _state2, 0, t1, t2, s2);\
|
||||||
|
ECHO_MIXBYTES(_state, _state2, 1, t1, t2, s2);\
|
||||||
|
ECHO_MIXBYTES(_state, _state2, 2, t1, t2, s2);\
|
||||||
|
ECHO_MIXBYTES(_state, _state2, 3, t1, t2, s2);\
|
||||||
|
ECHO_SUBBYTES(_state2, 0, 0);\
|
||||||
|
ECHO_SUBBYTES(_state2, 1, 0);\
|
||||||
|
ECHO_SUBBYTES(_state2, 2, 0);\
|
||||||
|
ECHO_SUBBYTES(_state2, 3, 0);\
|
||||||
|
ECHO_SUBBYTES(_state2, 0, 1);\
|
||||||
|
ECHO_SUBBYTES(_state2, 1, 1);\
|
||||||
|
ECHO_SUBBYTES(_state2, 2, 1);\
|
||||||
|
ECHO_SUBBYTES(_state2, 3, 1);\
|
||||||
|
ECHO_SUBBYTES(_state2, 0, 2);\
|
||||||
|
ECHO_SUBBYTES(_state2, 1, 2);\
|
||||||
|
ECHO_SUBBYTES(_state2, 2, 2);\
|
||||||
|
ECHO_SUBBYTES(_state2, 3, 2);\
|
||||||
|
ECHO_SUBBYTES(_state2, 0, 3);\
|
||||||
|
ECHO_SUBBYTES(_state2, 1, 3);\
|
||||||
|
ECHO_SUBBYTES(_state2, 2, 3);\
|
||||||
|
ECHO_SUBBYTES(_state2, 3, 3);\
|
||||||
|
ECHO_MIXBYTES(_state2, _state, 0, t1, t2, s2);\
|
||||||
|
ECHO_MIXBYTES(_state2, _state, 1, t1, t2, s2);\
|
||||||
|
ECHO_MIXBYTES(_state2, _state, 2, t1, t2, s2);\
|
||||||
|
ECHO_MIXBYTES(_state2, _state, 3, t1, t2, s2)
|
||||||
|
|
||||||
|
#define SAVESTATE(dst, src)\
|
||||||
|
dst[0][0] = src[0][0];\
|
||||||
|
dst[0][1] = src[0][1];\
|
||||||
|
dst[0][2] = src[0][2];\
|
||||||
|
dst[0][3] = src[0][3];\
|
||||||
|
dst[1][0] = src[1][0];\
|
||||||
|
dst[1][1] = src[1][1];\
|
||||||
|
dst[1][2] = src[1][2];\
|
||||||
|
dst[1][3] = src[1][3];\
|
||||||
|
dst[2][0] = src[2][0];\
|
||||||
|
dst[2][1] = src[2][1];\
|
||||||
|
dst[2][2] = src[2][2];\
|
||||||
|
dst[2][3] = src[2][3];\
|
||||||
|
dst[3][0] = src[3][0];\
|
||||||
|
dst[3][1] = src[3][1];\
|
||||||
|
dst[3][2] = src[3][2];\
|
||||||
|
dst[3][3] = src[3][3]
|
||||||
|
|
||||||
|
// blockcount always 1
|
||||||
|
void echo_4way_compress( echo_4way_context *ctx, const __m512i *pmsg,
|
||||||
|
unsigned int uBlockCount )
|
||||||
|
{
|
||||||
|
unsigned int r, b, i, j;
|
||||||
|
__m512i t1, t2, s2, k1;
|
||||||
|
__m512i _state[4][4], _state2[4][4], _statebackup[4][4];
|
||||||
|
|
||||||
|
_state[ 0 ][ 0 ] = ctx->state[ 0 ][ 0 ];
|
||||||
|
_state[ 0 ][ 1 ] = ctx->state[ 0 ][ 1 ];
|
||||||
|
_state[ 0 ][ 2 ] = ctx->state[ 0 ][ 2 ];
|
||||||
|
_state[ 0 ][ 3 ] = ctx->state[ 0 ][ 3 ];
|
||||||
|
_state[ 1 ][ 0 ] = ctx->state[ 1 ][ 0 ];
|
||||||
|
_state[ 1 ][ 1 ] = ctx->state[ 1 ][ 1 ];
|
||||||
|
_state[ 1 ][ 2 ] = ctx->state[ 1 ][ 2 ];
|
||||||
|
_state[ 1 ][ 3 ] = ctx->state[ 1 ][ 3 ];
|
||||||
|
_state[ 2 ][ 0 ] = ctx->state[ 2 ][ 0 ];
|
||||||
|
_state[ 2 ][ 1 ] = ctx->state[ 2 ][ 1 ];
|
||||||
|
_state[ 2 ][ 2 ] = ctx->state[ 2 ][ 2 ];
|
||||||
|
_state[ 2 ][ 3 ] = ctx->state[ 2 ][ 3 ];
|
||||||
|
_state[ 3 ][ 0 ] = ctx->state[ 3 ][ 0 ];
|
||||||
|
_state[ 3 ][ 1 ] = ctx->state[ 3 ][ 1 ];
|
||||||
|
_state[ 3 ][ 2 ] = ctx->state[ 3 ][ 2 ];
|
||||||
|
_state[ 3 ][ 3 ] = ctx->state[ 3 ][ 3 ];
|
||||||
|
|
||||||
|
for ( b = 0; b < uBlockCount; b++ )
|
||||||
|
{
|
||||||
|
ctx->k = _mm512_add_epi64( ctx->k, ctx->const1536 );
|
||||||
|
|
||||||
|
for( j = ctx->uHashSize / 256; j < 4; j++ )
|
||||||
|
{
|
||||||
|
for ( i = 0; i < 4; i++ )
|
||||||
|
{
|
||||||
|
_state[ i ][ j ] = _mm512_load_si512(
|
||||||
|
pmsg + 4 * (j - (ctx->uHashSize / 256)) + i );
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// save state
|
||||||
|
SAVESTATE( _statebackup, _state );
|
||||||
|
|
||||||
|
k1 = ctx->k;
|
||||||
|
|
||||||
|
for ( r = 0; r < ctx->uRounds / 2; r++ )
|
||||||
|
{
|
||||||
|
ECHO_ROUND_UNROLL2;
|
||||||
|
}
|
||||||
|
|
||||||
|
if ( ctx->uHashSize == 256 )
|
||||||
|
{
|
||||||
|
for ( i = 0; i < 4; i++ )
|
||||||
|
{
|
||||||
|
_state[ i ][ 0 ] = _mm512_xor_si512( _state[ i ][ 0 ],
|
||||||
|
_state[ i ][ 1 ] );
|
||||||
|
_state[ i ][ 0 ] = _mm512_xor_si512( _state[ i ][ 0 ],
|
||||||
|
_state[ i ][ 2 ] );
|
||||||
|
_state[ i ][ 0 ] = _mm512_xor_si512( _state[ i ][ 0 ],
|
||||||
|
_state[ i ][ 3 ] );
|
||||||
|
_state[ i ][ 0 ] = _mm512_xor_si512( _state[ i ][ 0 ],
|
||||||
|
_statebackup[ i ][ 0 ] );
|
||||||
|
_state[ i ][ 0 ] = _mm512_xor_si512( _state[ i ][ 0 ],
|
||||||
|
_statebackup[ i ][ 1 ] );
|
||||||
|
_state[ i ][ 0 ] = _mm512_xor_si512( _state[ i ][ 0 ],
|
||||||
|
_statebackup[ i ][ 2 ] ) ;
|
||||||
|
_state[ i ][ 0 ] = _mm512_xor_si512( _state[ i ][ 0 ],
|
||||||
|
_statebackup[ i ][ 3 ] );
|
||||||
|
}
|
||||||
|
}
|
||||||
|
else
|
||||||
|
{
|
||||||
|
for ( i = 0; i < 4; i++ )
|
||||||
|
{
|
||||||
|
_state[ i ][ 0 ] = _mm512_xor_si512( _state[ i ][ 0 ],
|
||||||
|
_state[ i ][ 2 ] );
|
||||||
|
_state[ i ][ 1 ] = _mm512_xor_si512( _state[ i ][ 1 ],
|
||||||
|
_state[ i ][ 3 ] );
|
||||||
|
_state[ i ][ 0 ] = _mm512_xor_si512( _state[ i ][ 0 ],
|
||||||
|
_statebackup[ i ][ 0 ] );
|
||||||
|
_state[ i ][ 0 ] = _mm512_xor_si512( _state[ i ] [0 ],
|
||||||
|
_statebackup[ i ][ 2 ] );
|
||||||
|
_state[ i ][ 1 ] = _mm512_xor_si512( _state[ i ][ 1 ],
|
||||||
|
_statebackup[ i ][ 1 ] );
|
||||||
|
_state[ i ][ 1 ] = _mm512_xor_si512( _state[ i ][ 1 ],
|
||||||
|
_statebackup[ i ][ 3 ] );
|
||||||
|
}
|
||||||
|
}
|
||||||
|
pmsg += ctx->uBlockLength;
|
||||||
|
}
|
||||||
|
SAVESTATE(ctx->state, _state);
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
|
int echo_4way_init( echo_4way_context *ctx, int nHashSize )
|
||||||
|
{
|
||||||
|
int i, j;
|
||||||
|
|
||||||
|
ctx->k = m512_zero;
|
||||||
|
ctx->processed_bits = 0;
|
||||||
|
ctx->uBufferBytes = 0;
|
||||||
|
|
||||||
|
switch( nHashSize )
|
||||||
|
{
|
||||||
|
case 256:
|
||||||
|
ctx->uHashSize = 256;
|
||||||
|
ctx->uBlockLength = 192;
|
||||||
|
ctx->uRounds = 8;
|
||||||
|
ctx->hashsize = _mm512_set4_epi32( 0, 0, 0, 0x100 );
|
||||||
|
ctx->const1536 = _mm512_set4_epi32( 0, 0, 0, 0x600 );
|
||||||
|
break;
|
||||||
|
|
||||||
|
case 512:
|
||||||
|
ctx->uHashSize = 512;
|
||||||
|
ctx->uBlockLength = 128;
|
||||||
|
ctx->uRounds = 10;
|
||||||
|
ctx->hashsize = _mm512_set4_epi32( 0, 0, 0, 0x200 );
|
||||||
|
ctx->const1536 = _mm512_set4_epi32( 0, 0, 0, 0x400);
|
||||||
|
break;
|
||||||
|
|
||||||
|
default:
|
||||||
|
return 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
for( i = 0; i < 4; i++ )
|
||||||
|
for( j = 0; j < nHashSize / 256; j++ )
|
||||||
|
ctx->state[ i ][ j ] = ctx->hashsize;
|
||||||
|
|
||||||
|
for( i = 0; i < 4; i++ )
|
||||||
|
for( j = nHashSize / 256; j < 4; j++ )
|
||||||
|
ctx->state[ i ][ j ] = m512_zero;
|
||||||
|
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
int echo_4way_update_close( echo_4way_context *state, void *hashval,
|
||||||
|
const void *data, int databitlen )
|
||||||
|
{
|
||||||
|
// bytelen is either 32 (maybe), 64 or 80 or 128!
|
||||||
|
// all are less than full block.
|
||||||
|
|
||||||
|
int vlen = databitlen / 128; // * 4 lanes / 128 bits per lane
|
||||||
|
const int vblen = state->uBlockLength / 16; // 16 bytes per lane
|
||||||
|
__m512i remainingbits;
|
||||||
|
|
||||||
|
if ( databitlen == 1024 )
|
||||||
|
{
|
||||||
|
echo_4way_compress( state, data, 1 );
|
||||||
|
state->processed_bits = 1024;
|
||||||
|
remainingbits = m512_const2_64( 0, -1024 );
|
||||||
|
vlen = 0;
|
||||||
|
}
|
||||||
|
else
|
||||||
|
{
|
||||||
|
vlen = databitlen / 128; // * 4 lanes / 128 bits per lane
|
||||||
|
memcpy_512( state->buffer, data, vlen );
|
||||||
|
state->processed_bits += (unsigned int)( databitlen );
|
||||||
|
remainingbits = _mm512_set4_epi32( 0, 0, 0, databitlen );
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
|
state->buffer[ vlen ] = _mm512_set4_epi32( 0, 0, 0, 0x80 );
|
||||||
|
memset_zero_512( state->buffer + vlen + 1, vblen - vlen - 2 );
|
||||||
|
state->buffer[ vblen-2 ] =
|
||||||
|
_mm512_set4_epi32( (uint32_t)state->uHashSize << 16, 0, 0, 0 );
|
||||||
|
state->buffer[ vblen-1 ] =
|
||||||
|
_mm512_set4_epi64( 0, state->processed_bits,
|
||||||
|
0, state->processed_bits );
|
||||||
|
|
||||||
|
state->k = _mm512_add_epi64( state->k, remainingbits );
|
||||||
|
state->k = _mm512_sub_epi64( state->k, state->const1536 );
|
||||||
|
|
||||||
|
echo_4way_compress( state, state->buffer, 1 );
|
||||||
|
|
||||||
|
_mm512_store_si512( (__m512i*)hashval + 0, state->state[ 0 ][ 0] );
|
||||||
|
_mm512_store_si512( (__m512i*)hashval + 1, state->state[ 1 ][ 0] );
|
||||||
|
|
||||||
|
if ( state->uHashSize == 512 )
|
||||||
|
{
|
||||||
|
_mm512_store_si512( (__m512i*)hashval + 2, state->state[ 2 ][ 0 ] );
|
||||||
|
_mm512_store_si512( (__m512i*)hashval + 3, state->state[ 3 ][ 0 ] );
|
||||||
|
}
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
int echo_4way_full( echo_4way_context *ctx, void *hashval, int nHashSize,
|
||||||
|
const void *data, int datalen )
|
||||||
|
{
|
||||||
|
int i, j;
|
||||||
|
int databitlen = datalen * 8;
|
||||||
|
ctx->k = m512_zero;
|
||||||
|
ctx->processed_bits = 0;
|
||||||
|
ctx->uBufferBytes = 0;
|
||||||
|
|
||||||
|
switch( nHashSize )
|
||||||
|
{
|
||||||
|
case 256:
|
||||||
|
ctx->uHashSize = 256;
|
||||||
|
ctx->uBlockLength = 192;
|
||||||
|
ctx->uRounds = 8;
|
||||||
|
ctx->hashsize = _mm512_set4_epi32( 0, 0, 0, 0x100 );
|
||||||
|
ctx->const1536 = _mm512_set4_epi32( 0, 0, 0, 0x600 );
|
||||||
|
break;
|
||||||
|
|
||||||
|
case 512:
|
||||||
|
ctx->uHashSize = 512;
|
||||||
|
ctx->uBlockLength = 128;
|
||||||
|
ctx->uRounds = 10;
|
||||||
|
ctx->hashsize = _mm512_set4_epi32( 0, 0, 0, 0x200 );
|
||||||
|
ctx->const1536 = _mm512_set4_epi32( 0, 0, 0, 0x400);
|
||||||
|
break;
|
||||||
|
|
||||||
|
default:
|
||||||
|
return 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
for( i = 0; i < 4; i++ )
|
||||||
|
for( j = 0; j < nHashSize / 256; j++ )
|
||||||
|
ctx->state[ i ][ j ] = ctx->hashsize;
|
||||||
|
|
||||||
|
for( i = 0; i < 4; i++ )
|
||||||
|
for( j = nHashSize / 256; j < 4; j++ )
|
||||||
|
ctx->state[ i ][ j ] = m512_zero;
|
||||||
|
|
||||||
|
|
||||||
|
// bytelen is either 32 (maybe), 64 or 80 or 128!
|
||||||
|
// all are less than full block.
|
||||||
|
|
||||||
|
int vlen = datalen / 32;
|
||||||
|
const int vblen = ctx->uBlockLength / 16; // 16 bytes per lane
|
||||||
|
__m512i remainingbits;
|
||||||
|
|
||||||
|
if ( databitlen == 1024 )
|
||||||
|
{
|
||||||
|
echo_4way_compress( ctx, data, 1 );
|
||||||
|
ctx->processed_bits = 1024;
|
||||||
|
remainingbits = m512_const2_64( 0, -1024 );
|
||||||
|
vlen = 0;
|
||||||
|
}
|
||||||
|
else
|
||||||
|
{
|
||||||
|
vlen = databitlen / 128; // * 4 lanes / 128 bits per lane
|
||||||
|
memcpy_512( ctx->buffer, data, vlen );
|
||||||
|
ctx->processed_bits += (unsigned int)( databitlen );
|
||||||
|
remainingbits = _mm512_set4_epi32( 0, 0, 0, databitlen );
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
|
ctx->buffer[ vlen ] = _mm512_set4_epi32( 0, 0, 0, 0x80 );
|
||||||
|
memset_zero_512( ctx->buffer + vlen + 1, vblen - vlen - 2 );
|
||||||
|
ctx->buffer[ vblen-2 ] =
|
||||||
|
_mm512_set4_epi32( (uint32_t)ctx->uHashSize << 16, 0, 0, 0 );
|
||||||
|
ctx->buffer[ vblen-1 ] =
|
||||||
|
_mm512_set4_epi64( 0, ctx->processed_bits,
|
||||||
|
0, ctx->processed_bits );
|
||||||
|
|
||||||
|
ctx->k = _mm512_add_epi64( ctx->k, remainingbits );
|
||||||
|
ctx->k = _mm512_sub_epi64( ctx->k, ctx->const1536 );
|
||||||
|
|
||||||
|
echo_4way_compress( ctx, ctx->buffer, 1 );
|
||||||
|
|
||||||
|
_mm512_store_si512( (__m512i*)hashval + 0, ctx->state[ 0 ][ 0] );
|
||||||
|
_mm512_store_si512( (__m512i*)hashval + 1, ctx->state[ 1 ][ 0] );
|
||||||
|
|
||||||
|
if ( ctx->uHashSize == 512 )
|
||||||
|
{
|
||||||
|
_mm512_store_si512( (__m512i*)hashval + 2, ctx->state[ 2 ][ 0 ] );
|
||||||
|
_mm512_store_si512( (__m512i*)hashval + 3, ctx->state[ 3 ][ 0 ] );
|
||||||
|
}
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
#endif
|
||||||
39
algo/echo/echo-hash-4way.h
Normal file
39
algo/echo/echo-hash-4way.h
Normal file
@@ -0,0 +1,39 @@
|
|||||||
|
#if !defined(ECHO_HASH_4WAY_H__)
|
||||||
|
#define ECHO_HASH_4WAY_H__ 1
|
||||||
|
|
||||||
|
#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
|
||||||
|
|
||||||
|
#include "simd-utils.h"
|
||||||
|
|
||||||
|
typedef struct
|
||||||
|
{
|
||||||
|
__m512i state[4][4];
|
||||||
|
__m512i buffer[ 4 * 192 / 16 ]; // 4x128 interleaved 192 bytes
|
||||||
|
__m512i k;
|
||||||
|
__m512i hashsize;
|
||||||
|
__m512i const1536;
|
||||||
|
|
||||||
|
unsigned int uRounds;
|
||||||
|
unsigned int uHashSize;
|
||||||
|
unsigned int uBlockLength;
|
||||||
|
unsigned int uBufferBytes;
|
||||||
|
unsigned int processed_bits;
|
||||||
|
|
||||||
|
} echo_4way_context __attribute__ ((aligned (64)));
|
||||||
|
|
||||||
|
int echo_4way_init( echo_4way_context *state, int hashbitlen );
|
||||||
|
|
||||||
|
|
||||||
|
int echo_4way_update( echo_4way_context *state, const void *data,
|
||||||
|
unsigned int databitlen);
|
||||||
|
|
||||||
|
int echo_close( echo_4way_context *state, void *hashval );
|
||||||
|
|
||||||
|
int echo_4way_update_close( echo_4way_context *state, void *hashval,
|
||||||
|
const void *data, int databitlen );
|
||||||
|
|
||||||
|
int echo_4way_full( echo_4way_context *ctx, void *hashval, int nHashSize,
|
||||||
|
const void *data, int datalen );
|
||||||
|
|
||||||
|
#endif
|
||||||
|
#endif
|
||||||
@@ -4,7 +4,7 @@
|
|||||||
#include <stdlib.h>
|
#include <stdlib.h>
|
||||||
#include <memory.h>
|
#include <memory.h>
|
||||||
#include <math.h>
|
#include <math.h>
|
||||||
|
#include "simd-utils.h"
|
||||||
#include "sph_gost.h"
|
#include "sph_gost.h"
|
||||||
|
|
||||||
#ifdef __cplusplus
|
#ifdef __cplusplus
|
||||||
@@ -696,9 +696,26 @@ static void AddModulo512(const void *a,const void *b,void *c)
|
|||||||
|
|
||||||
static void AddXor512(const void *a,const void *b,void *c)
|
static void AddXor512(const void *a,const void *b,void *c)
|
||||||
{
|
{
|
||||||
|
#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
|
||||||
|
casti_m512i( c, 0 ) = _mm512_xor_si512( casti_m512i( a, 0 ),
|
||||||
|
casti_m512i( b, 0 ) );
|
||||||
|
#elif defined(__AVX2__)
|
||||||
|
casti_m256i( c, 0 ) = _mm256_xor_si256( casti_m256i( a, 0 ),
|
||||||
|
casti_m256i( b, 0 ) );
|
||||||
|
casti_m256i( c, 1 ) = _mm256_xor_si256( casti_m256i( a, 1 ),
|
||||||
|
casti_m256i( b, 1 ) );
|
||||||
|
#elif defined(__SSE2__)
|
||||||
|
casti_m128i( c, 0 ) = _mm_xor_si128( casti_m128i( a, 0 ),
|
||||||
|
casti_m128i( b, 0 ) );
|
||||||
|
casti_m128i( c, 1 ) = _mm_xor_si128( casti_m128i( a, 1 ),
|
||||||
|
casti_m128i( b, 1 ) );
|
||||||
|
casti_m128i( c, 2 ) = _mm_xor_si128( casti_m128i( a, 2 ),
|
||||||
|
casti_m128i( b, 2 ) );
|
||||||
|
casti_m128i( c, 3 ) = _mm_xor_si128( casti_m128i( a, 3 ),
|
||||||
|
casti_m128i( b, 3 ) );
|
||||||
|
#else
|
||||||
const unsigned long long *A=a, *B=b;
|
const unsigned long long *A=a, *B=b;
|
||||||
unsigned long long *C=c;
|
unsigned long long *C=c;
|
||||||
#ifdef FULL_UNROLL
|
|
||||||
C[0] = A[0] ^ B[0];
|
C[0] = A[0] ^ B[0];
|
||||||
C[1] = A[1] ^ B[1];
|
C[1] = A[1] ^ B[1];
|
||||||
C[2] = A[2] ^ B[2];
|
C[2] = A[2] ^ B[2];
|
||||||
@@ -707,12 +724,6 @@ static void AddXor512(const void *a,const void *b,void *c)
|
|||||||
C[5] = A[5] ^ B[5];
|
C[5] = A[5] ^ B[5];
|
||||||
C[6] = A[6] ^ B[6];
|
C[6] = A[6] ^ B[6];
|
||||||
C[7] = A[7] ^ B[7];
|
C[7] = A[7] ^ B[7];
|
||||||
#else
|
|
||||||
int i = 0;
|
|
||||||
|
|
||||||
for(i=0; i<8; i++) {
|
|
||||||
C[i] = A[i] ^ B[i];
|
|
||||||
}
|
|
||||||
#endif
|
#endif
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -893,31 +904,32 @@ static void g_N(const unsigned char *N,unsigned char *h,const unsigned char *m)
|
|||||||
|
|
||||||
static void hash_X(unsigned char *IV,const unsigned char *message,unsigned long long length,unsigned char *out)
|
static void hash_X(unsigned char *IV,const unsigned char *message,unsigned long long length,unsigned char *out)
|
||||||
{
|
{
|
||||||
unsigned char v512[64] = {
|
unsigned char v512[64] __attribute__((aligned(64))) = {
|
||||||
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
|
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
|
||||||
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
|
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
|
||||||
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
|
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
|
||||||
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x02,0x00
|
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x02,0x00
|
||||||
};
|
};
|
||||||
unsigned char v0[64] = {
|
unsigned char v0[64] __attribute__((aligned(64))) = {
|
||||||
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
|
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
|
||||||
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
|
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
|
||||||
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
|
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
|
||||||
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00
|
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00
|
||||||
};
|
};
|
||||||
unsigned char Sigma[64] = {
|
unsigned char Sigma[64] __attribute__((aligned(64))) = {
|
||||||
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
|
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
|
||||||
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
|
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
|
||||||
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
|
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
|
||||||
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00
|
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00
|
||||||
};
|
};
|
||||||
unsigned char N[64] = {
|
unsigned char N[64] __attribute__((aligned(64))) = {
|
||||||
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
|
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
|
||||||
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
|
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
|
||||||
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
|
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
|
||||||
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00
|
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00
|
||||||
};
|
};
|
||||||
unsigned char m[64], *hash = IV;
|
unsigned char m[64] __attribute__((aligned(64)));
|
||||||
|
unsigned char *hash = IV;
|
||||||
unsigned long long len = length;
|
unsigned long long len = length;
|
||||||
|
|
||||||
// Stage 2
|
// Stage 2
|
||||||
@@ -952,7 +964,7 @@ static void hash_X(unsigned char *IV,const unsigned char *message,unsigned long
|
|||||||
|
|
||||||
static void hash_512(const unsigned char *message, unsigned long long length, unsigned char *out)
|
static void hash_512(const unsigned char *message, unsigned long long length, unsigned char *out)
|
||||||
{
|
{
|
||||||
unsigned char IV[64] = {
|
unsigned char IV[64] __attribute__((aligned(64))) = {
|
||||||
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
|
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
|
||||||
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
|
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
|
||||||
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
|
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
|
||||||
|
|||||||
@@ -81,9 +81,9 @@ typedef struct {
|
|||||||
*/
|
*/
|
||||||
typedef struct {
|
typedef struct {
|
||||||
#ifndef DOXYGEN_IGNORE
|
#ifndef DOXYGEN_IGNORE
|
||||||
unsigned char buf[64]; /* first field, for alignment */
|
unsigned char buf[64] __attribute__((aligned(64)));
|
||||||
|
sph_u32 V[5][8] __attribute__((aligned(64)));
|
||||||
size_t ptr;
|
size_t ptr;
|
||||||
sph_u32 V[5][8];
|
|
||||||
#endif
|
#endif
|
||||||
} sph_gost512_context;
|
} sph_gost512_context;
|
||||||
|
|
||||||
|
|||||||
@@ -209,7 +209,6 @@ __m128i ALL_FF;
|
|||||||
\
|
\
|
||||||
/* AddRoundConstant P1024 */\
|
/* AddRoundConstant P1024 */\
|
||||||
xmm0 = _mm_xor_si128(xmm0, (ROUND_CONST_P[round_counter+1]));\
|
xmm0 = _mm_xor_si128(xmm0, (ROUND_CONST_P[round_counter+1]));\
|
||||||
/* ShiftBytes P1024 + pre-AESENCLAST */\
|
|
||||||
xmm0 = _mm_shuffle_epi8(xmm0, (SUBSH_MASK[0]));\
|
xmm0 = _mm_shuffle_epi8(xmm0, (SUBSH_MASK[0]));\
|
||||||
xmm1 = _mm_shuffle_epi8(xmm1, (SUBSH_MASK[1]));\
|
xmm1 = _mm_shuffle_epi8(xmm1, (SUBSH_MASK[1]));\
|
||||||
xmm2 = _mm_shuffle_epi8(xmm2, (SUBSH_MASK[2]));\
|
xmm2 = _mm_shuffle_epi8(xmm2, (SUBSH_MASK[2]));\
|
||||||
@@ -218,7 +217,6 @@ __m128i ALL_FF;
|
|||||||
xmm5 = _mm_shuffle_epi8(xmm5, (SUBSH_MASK[5]));\
|
xmm5 = _mm_shuffle_epi8(xmm5, (SUBSH_MASK[5]));\
|
||||||
xmm6 = _mm_shuffle_epi8(xmm6, (SUBSH_MASK[6]));\
|
xmm6 = _mm_shuffle_epi8(xmm6, (SUBSH_MASK[6]));\
|
||||||
xmm7 = _mm_shuffle_epi8(xmm7, (SUBSH_MASK[7]));\
|
xmm7 = _mm_shuffle_epi8(xmm7, (SUBSH_MASK[7]));\
|
||||||
/* SubBytes + MixBytes */\
|
|
||||||
SUBMIX(xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15);\
|
SUBMIX(xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15);\
|
||||||
}\
|
}\
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -2,13 +2,7 @@
|
|||||||
//#define TASM
|
//#define TASM
|
||||||
#define TINTR
|
#define TINTR
|
||||||
|
|
||||||
//#define AES_NI
|
// Not to be confused with AVX512VAES
|
||||||
|
|
||||||
//#ifdef AES_NI
|
|
||||||
// specify AES-NI, AVX (with AES-NI) or vector-permute implementation
|
|
||||||
|
|
||||||
//#ifndef NO_AES_NI
|
|
||||||
|
|
||||||
#define VAES
|
#define VAES
|
||||||
// #define VAVX
|
// #define VAVX
|
||||||
// #define VVPERM
|
// #define VVPERM
|
||||||
|
|||||||
@@ -14,7 +14,7 @@
|
|||||||
#include "miner.h"
|
#include "miner.h"
|
||||||
#include "simd-utils.h"
|
#include "simd-utils.h"
|
||||||
|
|
||||||
#ifndef NO_AES_NI
|
#ifdef __AES__
|
||||||
|
|
||||||
#include "groestl-version.h"
|
#include "groestl-version.h"
|
||||||
|
|
||||||
@@ -67,8 +67,12 @@ HashReturn_gr init_groestl( hashState_groestl* ctx, int hashlen )
|
|||||||
ctx->chaining[i] = _mm_setzero_si128();
|
ctx->chaining[i] = _mm_setzero_si128();
|
||||||
ctx->buffer[i] = _mm_setzero_si128();
|
ctx->buffer[i] = _mm_setzero_si128();
|
||||||
}
|
}
|
||||||
((u64*)ctx->chaining)[COLS-1] = U64BIG((u64)LENGTH);
|
|
||||||
INIT(ctx->chaining);
|
// The only non-zero in the IV is len. It can be hard coded.
|
||||||
|
ctx->chaining[ 6 ] = m128_const_64( 0x0200000000000000, 0 );
|
||||||
|
// ((u64*)ctx->chaining)[COLS-1] = U64BIG((u64)LENGTH);
|
||||||
|
// INIT(ctx->chaining);
|
||||||
|
|
||||||
ctx->buf_ptr = 0;
|
ctx->buf_ptr = 0;
|
||||||
ctx->rem_ptr = 0;
|
ctx->rem_ptr = 0;
|
||||||
|
|
||||||
@@ -87,8 +91,9 @@ HashReturn_gr reinit_groestl( hashState_groestl* ctx )
|
|||||||
ctx->chaining[i] = _mm_setzero_si128();
|
ctx->chaining[i] = _mm_setzero_si128();
|
||||||
ctx->buffer[i] = _mm_setzero_si128();
|
ctx->buffer[i] = _mm_setzero_si128();
|
||||||
}
|
}
|
||||||
((u64*)ctx->chaining)[COLS-1] = U64BIG((u64)LENGTH);
|
ctx->chaining[ 6 ] = m128_const_64( 0x0200000000000000, 0 );
|
||||||
INIT(ctx->chaining);
|
// ((u64*)ctx->chaining)[COLS-1] = U64BIG((u64)LENGTH);
|
||||||
|
// INIT(ctx->chaining);
|
||||||
ctx->buf_ptr = 0;
|
ctx->buf_ptr = 0;
|
||||||
ctx->rem_ptr = 0;
|
ctx->rem_ptr = 0;
|
||||||
|
|
||||||
@@ -180,6 +185,82 @@ HashReturn_gr final_groestl( hashState_groestl* ctx, void* output )
|
|||||||
return SUCCESS_GR;
|
return SUCCESS_GR;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
int groestl512_full( hashState_groestl* ctx, void* output,
|
||||||
|
const void* input, uint64_t databitlen )
|
||||||
|
{
|
||||||
|
|
||||||
|
int i;
|
||||||
|
|
||||||
|
ctx->hashlen = 64;
|
||||||
|
SET_CONSTANTS();
|
||||||
|
|
||||||
|
for ( i = 0; i < SIZE512; i++ )
|
||||||
|
{
|
||||||
|
ctx->chaining[i] = _mm_setzero_si128();
|
||||||
|
ctx->buffer[i] = _mm_setzero_si128();
|
||||||
|
}
|
||||||
|
ctx->chaining[ 6 ] = m128_const_64( 0x0200000000000000, 0 );
|
||||||
|
ctx->buf_ptr = 0;
|
||||||
|
ctx->rem_ptr = 0;
|
||||||
|
|
||||||
|
|
||||||
|
const int len = (int)databitlen / 128;
|
||||||
|
const int hashlen_m128i = ctx->hashlen / 16; // bytes to __m128i
|
||||||
|
const int hash_offset = SIZE512 - hashlen_m128i;
|
||||||
|
int rem = ctx->rem_ptr;
|
||||||
|
uint64_t blocks = len / SIZE512;
|
||||||
|
__m128i* in = (__m128i*)input;
|
||||||
|
|
||||||
|
// --- update ---
|
||||||
|
|
||||||
|
// digest any full blocks, process directly from input
|
||||||
|
for ( i = 0; i < blocks; i++ )
|
||||||
|
TF1024( ctx->chaining, &in[ i * SIZE512 ] );
|
||||||
|
ctx->buf_ptr = blocks * SIZE512;
|
||||||
|
|
||||||
|
// copy any remaining data to buffer, it may already contain data
|
||||||
|
// from a previous update for a midstate precalc
|
||||||
|
for ( i = 0; i < len % SIZE512; i++ )
|
||||||
|
ctx->buffer[ rem + i ] = in[ ctx->buf_ptr + i ];
|
||||||
|
i += rem; // use i as rem_ptr in final
|
||||||
|
|
||||||
|
//--- final ---
|
||||||
|
|
||||||
|
blocks++; // adjust for final block
|
||||||
|
|
||||||
|
if ( i == len -1 )
|
||||||
|
{
|
||||||
|
// only 128 bits left in buffer, all padding at once
|
||||||
|
ctx->buffer[i] = _mm_set_epi8( blocks,0,0,0, 0,0,0,0,
|
||||||
|
0,0,0,0, 0,0,0,0x80 );
|
||||||
|
}
|
||||||
|
else
|
||||||
|
{
|
||||||
|
// add first padding
|
||||||
|
ctx->buffer[i] = _mm_set_epi8( 0,0,0,0, 0,0,0,0,
|
||||||
|
0,0,0,0, 0,0,0,0x80 );
|
||||||
|
// add zero padding
|
||||||
|
for ( i += 1; i < SIZE512 - 1; i++ )
|
||||||
|
ctx->buffer[i] = _mm_setzero_si128();
|
||||||
|
|
||||||
|
// add length padding, second last byte is zero unless blocks > 255
|
||||||
|
ctx->buffer[i] = _mm_set_epi8( blocks, blocks>>8, 0,0, 0,0,0,0,
|
||||||
|
0, 0 ,0,0, 0,0,0,0 );
|
||||||
|
}
|
||||||
|
|
||||||
|
// digest final padding block and do output transform
|
||||||
|
TF1024( ctx->chaining, ctx->buffer );
|
||||||
|
|
||||||
|
OF1024( ctx->chaining );
|
||||||
|
|
||||||
|
// store hash result in output
|
||||||
|
for ( i = 0; i < hashlen_m128i; i++ )
|
||||||
|
casti_m128i( output, i ) = ctx->chaining[ hash_offset + i ];
|
||||||
|
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
HashReturn_gr update_and_final_groestl( hashState_groestl* ctx, void* output,
|
HashReturn_gr update_and_final_groestl( hashState_groestl* ctx, void* output,
|
||||||
const void* input, DataLength_gr databitlen )
|
const void* input, DataLength_gr databitlen )
|
||||||
{
|
{
|
||||||
@@ -230,6 +311,7 @@ HashReturn_gr update_and_final_groestl( hashState_groestl* ctx, void* output,
|
|||||||
|
|
||||||
// digest final padding block and do output transform
|
// digest final padding block and do output transform
|
||||||
TF1024( ctx->chaining, ctx->buffer );
|
TF1024( ctx->chaining, ctx->buffer );
|
||||||
|
|
||||||
OF1024( ctx->chaining );
|
OF1024( ctx->chaining );
|
||||||
|
|
||||||
// store hash result in output
|
// store hash result in output
|
||||||
|
|||||||
@@ -87,5 +87,6 @@ HashReturn_gr final_groestl( hashState_groestl*, void* );
|
|||||||
|
|
||||||
HashReturn_gr update_and_final_groestl( hashState_groestl*, void*,
|
HashReturn_gr update_and_final_groestl( hashState_groestl*, void*,
|
||||||
const void*, DataLength_gr );
|
const void*, DataLength_gr );
|
||||||
|
int groestl512_full( hashState_groestl*, void*, const void*, uint64_t );
|
||||||
|
|
||||||
#endif /* __hash_h */
|
#endif /* __hash_h */
|
||||||
|
|||||||
@@ -11,7 +11,7 @@
|
|||||||
#include "miner.h"
|
#include "miner.h"
|
||||||
#include "simd-utils.h"
|
#include "simd-utils.h"
|
||||||
|
|
||||||
#ifndef NO_AES_NI
|
#ifdef __AES__
|
||||||
|
|
||||||
#include "groestl-version.h"
|
#include "groestl-version.h"
|
||||||
|
|
||||||
@@ -86,8 +86,11 @@ HashReturn_gr reinit_groestl256(hashState_groestl256* ctx)
|
|||||||
ctx->chaining[i] = _mm_setzero_si128();
|
ctx->chaining[i] = _mm_setzero_si128();
|
||||||
ctx->buffer[i] = _mm_setzero_si128();
|
ctx->buffer[i] = _mm_setzero_si128();
|
||||||
}
|
}
|
||||||
((u64*)ctx->chaining)[COLS-1] = U64BIG((u64)LENGTH);
|
|
||||||
INIT256(ctx->chaining);
|
ctx->chaining[ 3 ] = m128_const_64( 0, 0x0100000000000000 );
|
||||||
|
|
||||||
|
// ((u64*)ctx->chaining)[COLS-1] = U64BIG((u64)LENGTH);
|
||||||
|
// INIT256(ctx->chaining);
|
||||||
ctx->buf_ptr = 0;
|
ctx->buf_ptr = 0;
|
||||||
ctx->rem_ptr = 0;
|
ctx->rem_ptr = 0;
|
||||||
|
|
||||||
|
|||||||
@@ -93,9 +93,6 @@ typedef enum
|
|||||||
typedef struct {
|
typedef struct {
|
||||||
__attribute__ ((aligned (32))) __m128i chaining[SIZE256];
|
__attribute__ ((aligned (32))) __m128i chaining[SIZE256];
|
||||||
__attribute__ ((aligned (32))) __m128i buffer[SIZE256];
|
__attribute__ ((aligned (32))) __m128i buffer[SIZE256];
|
||||||
// __attribute__ ((aligned (32))) u64 chaining[SIZE/8]; /* actual state */
|
|
||||||
// __attribute__ ((aligned (32))) BitSequence_gr buffer[SIZE]; /* data buffer */
|
|
||||||
// u64 block_counter; /* message block counter */
|
|
||||||
int hashlen; // bytes
|
int hashlen; // bytes
|
||||||
int blk_count;
|
int blk_count;
|
||||||
int buf_ptr; /* data buffer pointer */
|
int buf_ptr; /* data buffer pointer */
|
||||||
|
|||||||
64
algo/groestl/groestl-4way.c
Normal file
64
algo/groestl/groestl-4way.c
Normal file
@@ -0,0 +1,64 @@
|
|||||||
|
#include "groestl-gate.h"
|
||||||
|
#include <stdio.h>
|
||||||
|
#include <stdlib.h>
|
||||||
|
#include <stdint.h>
|
||||||
|
#include <string.h>
|
||||||
|
|
||||||
|
#if defined(GROESTL_4WAY_VAES)
|
||||||
|
|
||||||
|
#include "groestl512-hash-4way.h"
|
||||||
|
|
||||||
|
void groestl_4way_hash( void *output, const void *input )
|
||||||
|
{
|
||||||
|
uint32_t hash[16*4] __attribute__ ((aligned (128)));
|
||||||
|
groestl512_4way_context ctx;
|
||||||
|
|
||||||
|
groestl512_4way_init( &ctx, 64 );
|
||||||
|
groestl512_4way_update_close( &ctx, hash, input, 640 );
|
||||||
|
|
||||||
|
groestl512_4way_init( &ctx, 64 );
|
||||||
|
groestl512_4way_update_close( &ctx, hash, hash, 512 );
|
||||||
|
|
||||||
|
dintrlv_4x128( output, output+32, output+64, output+96, hash, 256 );
|
||||||
|
}
|
||||||
|
|
||||||
|
int scanhash_groestl_4way( struct work *work, uint32_t max_nonce,
|
||||||
|
uint64_t *hashes_done, struct thr_info *mythr )
|
||||||
|
{
|
||||||
|
uint32_t hash[8*4] __attribute__ ((aligned (128)));
|
||||||
|
uint32_t vdata[24*4] __attribute__ ((aligned (64)));
|
||||||
|
uint32_t *pdata = work->data;
|
||||||
|
uint32_t *ptarget = work->target;
|
||||||
|
uint32_t n = pdata[19];
|
||||||
|
const uint32_t first_nonce = pdata[19];
|
||||||
|
const uint32_t last_nonce = max_nonce - 4;
|
||||||
|
uint32_t *noncep = vdata + 64+3; // 4*16 + 3
|
||||||
|
int thr_id = mythr->id;
|
||||||
|
const uint32_t Htarg = ptarget[7];
|
||||||
|
|
||||||
|
mm512_bswap32_intrlv80_4x128( vdata, pdata );
|
||||||
|
|
||||||
|
do
|
||||||
|
{
|
||||||
|
be32enc( noncep, n );
|
||||||
|
be32enc( noncep+ 4, n+1 );
|
||||||
|
be32enc( noncep+ 8, n+2 );
|
||||||
|
be32enc( noncep+12, n+3 );
|
||||||
|
|
||||||
|
groestl_4way_hash( hash, vdata );
|
||||||
|
pdata[19] = n;
|
||||||
|
|
||||||
|
for ( int lane = 0; lane < 4; lane++ )
|
||||||
|
if ( ( hash+(lane<<3) )[7] <= Htarg )
|
||||||
|
if ( fulltest( hash+(lane<<3), ptarget) && !opt_benchmark )
|
||||||
|
{
|
||||||
|
pdata[19] = n + lane;
|
||||||
|
submit_lane_solution( work, hash+(lane<<3), mythr, lane );
|
||||||
|
}
|
||||||
|
n += 4;
|
||||||
|
} while ( ( n < last_nonce ) && !work_restart[thr_id].restart );
|
||||||
|
*hashes_done = n - first_nonce;
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
#endif
|
||||||
23
algo/groestl/groestl-gate.c
Normal file
23
algo/groestl/groestl-gate.c
Normal file
@@ -0,0 +1,23 @@
|
|||||||
|
#include "groestl-gate.h"
|
||||||
|
|
||||||
|
bool register_dmd_gr_algo( algo_gate_t *gate )
|
||||||
|
{
|
||||||
|
#if defined (GROESTL_4WAY_VAES)
|
||||||
|
gate->scanhash = (void*)&scanhash_groestl_4way;
|
||||||
|
gate->hash = (void*)&groestl_4way_hash;
|
||||||
|
#else
|
||||||
|
init_groestl_ctx();
|
||||||
|
gate->scanhash = (void*)&scanhash_groestl;
|
||||||
|
gate->hash = (void*)&groestlhash;
|
||||||
|
#endif
|
||||||
|
gate->optimizations = AES_OPT | VAES_OPT;
|
||||||
|
return true;
|
||||||
|
};
|
||||||
|
|
||||||
|
bool register_groestl_algo( algo_gate_t* gate )
|
||||||
|
{
|
||||||
|
register_dmd_gr_algo( gate );
|
||||||
|
gate->gen_merkle_root = (void*)&SHA256_gen_merkle_root;
|
||||||
|
return true;
|
||||||
|
};
|
||||||
|
|
||||||
31
algo/groestl/groestl-gate.h
Normal file
31
algo/groestl/groestl-gate.h
Normal file
@@ -0,0 +1,31 @@
|
|||||||
|
#ifndef GROESTL_GATE_H__
|
||||||
|
#define GROESTL_GATE_H__ 1
|
||||||
|
|
||||||
|
#include "algo-gate-api.h"
|
||||||
|
#include <stdint.h>
|
||||||
|
|
||||||
|
#if defined(__VAES__) && defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
|
||||||
|
#define GROESTL_4WAY_VAES 1
|
||||||
|
#endif
|
||||||
|
|
||||||
|
bool register_dmd_gr_algo( algo_gate_t* gate );
|
||||||
|
|
||||||
|
bool register_groestl_algo( algo_gate_t* gate );
|
||||||
|
|
||||||
|
#if defined(GROESTL_4WAY_VAES)
|
||||||
|
|
||||||
|
void groestl_4way_hash( void *state, const void *input );
|
||||||
|
int scanhash_groestl_4way( struct work *work, uint32_t max_nonce,
|
||||||
|
uint64_t *hashes_done, struct thr_info *mythr );
|
||||||
|
|
||||||
|
#else
|
||||||
|
|
||||||
|
void groestlhash( void *state, const void *input );
|
||||||
|
int scanhash_groestl( struct work *work, uint32_t max_nonce,
|
||||||
|
uint64_t *hashes_done, struct thr_info *mythr );
|
||||||
|
void init_groestl_ctx();
|
||||||
|
|
||||||
|
#endif
|
||||||
|
|
||||||
|
#endif
|
||||||
|
|
||||||
@@ -1,22 +1,20 @@
|
|||||||
#include "algo-gate-api.h"
|
#include "groestl-gate.h"
|
||||||
|
|
||||||
#include <stdio.h>
|
#include <stdio.h>
|
||||||
#include <stdlib.h>
|
#include <stdlib.h>
|
||||||
#include <stdint.h>
|
#include <stdint.h>
|
||||||
#include <string.h>
|
#include <string.h>
|
||||||
|
#ifdef __AES__
|
||||||
#ifdef NO_AES_NI
|
|
||||||
#include "sph_groestl.h"
|
|
||||||
#else
|
|
||||||
#include "algo/groestl/aes_ni/hash-groestl.h"
|
#include "algo/groestl/aes_ni/hash-groestl.h"
|
||||||
|
#else
|
||||||
|
#include "sph_groestl.h"
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
typedef struct
|
typedef struct
|
||||||
{
|
{
|
||||||
#ifdef NO_AES_NI
|
#ifdef __AES__
|
||||||
sph_groestl512_context groestl1, groestl2;
|
|
||||||
#else
|
|
||||||
hashState_groestl groestl1, groestl2;
|
hashState_groestl groestl1, groestl2;
|
||||||
|
#else
|
||||||
|
sph_groestl512_context groestl1, groestl2;
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
} groestl_ctx_holder;
|
} groestl_ctx_holder;
|
||||||
@@ -25,12 +23,12 @@ static groestl_ctx_holder groestl_ctx;
|
|||||||
|
|
||||||
void init_groestl_ctx()
|
void init_groestl_ctx()
|
||||||
{
|
{
|
||||||
#ifdef NO_AES_NI
|
#ifdef __AES__
|
||||||
sph_groestl512_init( &groestl_ctx.groestl1 );
|
|
||||||
sph_groestl512_init( &groestl_ctx.groestl2 );
|
|
||||||
#else
|
|
||||||
init_groestl( &groestl_ctx.groestl1, 64 );
|
init_groestl( &groestl_ctx.groestl1, 64 );
|
||||||
init_groestl( &groestl_ctx.groestl2, 64 );
|
init_groestl( &groestl_ctx.groestl2, 64 );
|
||||||
|
#else
|
||||||
|
sph_groestl512_init( &groestl_ctx.groestl1 );
|
||||||
|
sph_groestl512_init( &groestl_ctx.groestl2 );
|
||||||
#endif
|
#endif
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -40,18 +38,18 @@ void groestlhash( void *output, const void *input )
|
|||||||
groestl_ctx_holder ctx __attribute__ ((aligned (64)));
|
groestl_ctx_holder ctx __attribute__ ((aligned (64)));
|
||||||
memcpy( &ctx, &groestl_ctx, sizeof(groestl_ctx) );
|
memcpy( &ctx, &groestl_ctx, sizeof(groestl_ctx) );
|
||||||
|
|
||||||
#ifdef NO_AES_NI
|
#ifdef __AES__
|
||||||
sph_groestl512(&ctx.groestl1, input, 80);
|
|
||||||
sph_groestl512_close(&ctx.groestl1, hash);
|
|
||||||
|
|
||||||
sph_groestl512(&ctx.groestl2, hash, 64);
|
|
||||||
sph_groestl512_close(&ctx.groestl2, hash);
|
|
||||||
#else
|
|
||||||
update_and_final_groestl( &ctx.groestl1, (char*)hash,
|
update_and_final_groestl( &ctx.groestl1, (char*)hash,
|
||||||
(const char*)input, 640 );
|
(const char*)input, 640 );
|
||||||
|
|
||||||
update_and_final_groestl( &ctx.groestl2, (char*)hash,
|
update_and_final_groestl( &ctx.groestl2, (char*)hash,
|
||||||
(const char*)hash, 512 );
|
(const char*)hash, 512 );
|
||||||
|
#else
|
||||||
|
sph_groestl512(&ctx.groestl1, input, 80);
|
||||||
|
sph_groestl512_close(&ctx.groestl1, hash);
|
||||||
|
|
||||||
|
sph_groestl512(&ctx.groestl2, hash, 64);
|
||||||
|
sph_groestl512_close(&ctx.groestl2, hash);
|
||||||
#endif
|
#endif
|
||||||
memcpy(output, hash, 32);
|
memcpy(output, hash, 32);
|
||||||
}
|
}
|
||||||
@@ -78,15 +76,12 @@ int scanhash_groestl( struct work *work, uint32_t max_nonce,
|
|||||||
groestlhash(hash, endiandata);
|
groestlhash(hash, endiandata);
|
||||||
|
|
||||||
if (hash[7] <= Htarg )
|
if (hash[7] <= Htarg )
|
||||||
if ( fulltest(hash, ptarget))
|
if ( fulltest(hash, ptarget) && !opt_benchmark )
|
||||||
{
|
{
|
||||||
pdata[19] = nonce;
|
pdata[19] = nonce;
|
||||||
*hashes_done = pdata[19] - first_nonce;
|
submit_solution( work, hash, mythr );
|
||||||
return 1;
|
|
||||||
}
|
}
|
||||||
|
|
||||||
nonce++;
|
nonce++;
|
||||||
|
|
||||||
} while (nonce < max_nonce && !work_restart[thr_id].restart);
|
} while (nonce < max_nonce && !work_restart[thr_id].restart);
|
||||||
|
|
||||||
pdata[19] = nonce;
|
pdata[19] = nonce;
|
||||||
@@ -94,21 +89,3 @@ int scanhash_groestl( struct work *work, uint32_t max_nonce,
|
|||||||
return 0;
|
return 0;
|
||||||
}
|
}
|
||||||
|
|
||||||
bool register_dmd_gr_algo( algo_gate_t* gate )
|
|
||||||
{
|
|
||||||
init_groestl_ctx();
|
|
||||||
gate->optimizations = SSE2_OPT | AES_OPT;
|
|
||||||
gate->scanhash = (void*)&scanhash_groestl;
|
|
||||||
gate->hash = (void*)&groestlhash;
|
|
||||||
gate->get_max64 = (void*)&get_max64_0x3ffff;
|
|
||||||
opt_target_factor = 256.0;
|
|
||||||
return true;
|
|
||||||
};
|
|
||||||
|
|
||||||
bool register_groestl_algo( algo_gate_t* gate )
|
|
||||||
{
|
|
||||||
register_dmd_gr_algo( gate );
|
|
||||||
gate->gen_merkle_root = (void*)&SHA256_gen_merkle_root;
|
|
||||||
return true;
|
|
||||||
};
|
|
||||||
|
|
||||||
|
|||||||
109
algo/groestl/groestl256-hash-4way.c
Normal file
109
algo/groestl/groestl256-hash-4way.c
Normal file
@@ -0,0 +1,109 @@
|
|||||||
|
/* hash.c Aug 2011
|
||||||
|
* groestl512-hash-4way https://github.com/JayDDee/cpuminer-opt 2019-12.
|
||||||
|
*
|
||||||
|
* Groestl implementation for different versions.
|
||||||
|
* Author: Krystian Matusiewicz, Günther A. Roland, Martin Schläffer
|
||||||
|
*
|
||||||
|
* This code is placed in the public domain
|
||||||
|
*/
|
||||||
|
|
||||||
|
// Optimized for hash and data length that are integrals of __m128i
|
||||||
|
|
||||||
|
|
||||||
|
#include <memory.h>
|
||||||
|
#include "groestl256-intr-4way.h"
|
||||||
|
#include "miner.h"
|
||||||
|
#include "simd-utils.h"
|
||||||
|
|
||||||
|
#if defined(__VAES__) && defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
|
||||||
|
|
||||||
|
|
||||||
|
int groestl256_4way_init( groestl256_4way_context* ctx, uint64_t hashlen )
|
||||||
|
{
|
||||||
|
int i;
|
||||||
|
|
||||||
|
ctx->hashlen = hashlen;
|
||||||
|
SET_CONSTANTS();
|
||||||
|
|
||||||
|
if (ctx->chaining == NULL || ctx->buffer == NULL)
|
||||||
|
return 1;
|
||||||
|
|
||||||
|
for ( i = 0; i < SIZE256; i++ )
|
||||||
|
{
|
||||||
|
ctx->chaining[i] = m512_zero;
|
||||||
|
ctx->buffer[i] = m512_zero;
|
||||||
|
}
|
||||||
|
|
||||||
|
// The only non-zero in the IV is len. It can be hard coded.
|
||||||
|
ctx->chaining[ 3 ] = m512_const2_64( 0, 0x0100000000000000 );
|
||||||
|
// uint64_t len = U64BIG((uint64_t)LENGTH);
|
||||||
|
// ctx->chaining[ COLS/2 -1 ] = _mm512_set4_epi64( len, 0, len, 0 );
|
||||||
|
// INIT256_4way(ctx->chaining);
|
||||||
|
|
||||||
|
ctx->buf_ptr = 0;
|
||||||
|
ctx->rem_ptr = 0;
|
||||||
|
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
int groestl256_4way_update_close( groestl256_4way_context* ctx, void* output,
|
||||||
|
const void* input, uint64_t databitlen )
|
||||||
|
{
|
||||||
|
const int len = (int)databitlen / 128;
|
||||||
|
const int hashlen_m128i = ctx->hashlen / 16; // bytes to __m128i
|
||||||
|
const int hash_offset = SIZE256 - hashlen_m128i;
|
||||||
|
int rem = ctx->rem_ptr;
|
||||||
|
int blocks = len / SIZE256;
|
||||||
|
__m512i* in = (__m512i*)input;
|
||||||
|
int i;
|
||||||
|
|
||||||
|
// --- update ---
|
||||||
|
|
||||||
|
// digest any full blocks, process directly from input
|
||||||
|
for ( i = 0; i < blocks; i++ )
|
||||||
|
TF512_4way( ctx->chaining, &in[ i * SIZE256 ] );
|
||||||
|
ctx->buf_ptr = blocks * SIZE256;
|
||||||
|
|
||||||
|
// copy any remaining data to buffer, it may already contain data
|
||||||
|
// from a previous update for a midstate precalc
|
||||||
|
for ( i = 0; i < len % SIZE256; i++ )
|
||||||
|
ctx->buffer[ rem + i ] = in[ ctx->buf_ptr + i ];
|
||||||
|
i += rem; // use i as rem_ptr in final
|
||||||
|
|
||||||
|
//--- final ---
|
||||||
|
|
||||||
|
blocks++; // adjust for final block
|
||||||
|
|
||||||
|
if ( i == SIZE256 - 1 )
|
||||||
|
{
|
||||||
|
// only 1 vector left in buffer, all padding at once
|
||||||
|
ctx->buffer[i] = m512_const1_128( _mm_set_epi8(
|
||||||
|
blocks, blocks>>8,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0x80 ) );
|
||||||
|
}
|
||||||
|
else
|
||||||
|
{
|
||||||
|
// add first padding
|
||||||
|
ctx->buffer[i] = m512_const4_64( 0, 0x80, 0, 0x80 );
|
||||||
|
// add zero padding
|
||||||
|
for ( i += 1; i < SIZE256 - 1; i++ )
|
||||||
|
ctx->buffer[i] = m512_zero;
|
||||||
|
|
||||||
|
// add length padding, second last byte is zero unless blocks > 255
|
||||||
|
ctx->buffer[i] = m512_const1_128( _mm_set_epi8(
|
||||||
|
blocks, blocks>>8, 0,0, 0,0, 0,0, 0,0, 0,0, 0,0, 0,0 ) );
|
||||||
|
}
|
||||||
|
|
||||||
|
// digest final padding block and do output transform
|
||||||
|
TF512_4way( ctx->chaining, ctx->buffer );
|
||||||
|
|
||||||
|
OF512_4way( ctx->chaining );
|
||||||
|
|
||||||
|
// store hash result in output
|
||||||
|
for ( i = 0; i < hashlen_m128i; i++ )
|
||||||
|
casti_m512i( output, i ) = ctx->chaining[ hash_offset + i ];
|
||||||
|
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
#endif // VAES
|
||||||
|
|
||||||
75
algo/groestl/groestl256-hash-4way.h
Normal file
75
algo/groestl/groestl256-hash-4way.h
Normal file
@@ -0,0 +1,75 @@
|
|||||||
|
/* hash.h Aug 2011
|
||||||
|
*
|
||||||
|
* Groestl implementation for different versions.
|
||||||
|
* Author: Krystian Matusiewicz, Günther A. Roland, Martin Schläffer
|
||||||
|
*
|
||||||
|
* This code is placed in the public domain
|
||||||
|
*/
|
||||||
|
|
||||||
|
#if !defined(GROESTL256_HASH_4WAY_H__)
|
||||||
|
#define GROESTL256_HASH_4WAY_H__ 1
|
||||||
|
|
||||||
|
#include "simd-utils.h"
|
||||||
|
#include <immintrin.h>
|
||||||
|
#include <stdint.h>
|
||||||
|
#include <stdio.h>
|
||||||
|
#if defined(_WIN64) || defined(__WINDOWS__)
|
||||||
|
#include <windows.h>
|
||||||
|
#endif
|
||||||
|
#include <stdlib.h>
|
||||||
|
|
||||||
|
#if defined(__VAES__) && defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
|
||||||
|
|
||||||
|
#define LENGTH (256)
|
||||||
|
|
||||||
|
//#include "brg_endian.h"
|
||||||
|
//#define NEED_UINT_64T
|
||||||
|
//#include "algo/sha/brg_types.h"
|
||||||
|
|
||||||
|
/* some sizes (number of bytes) */
|
||||||
|
#define ROWS (8)
|
||||||
|
#define LENGTHFIELDLEN (ROWS)
|
||||||
|
#define COLS512 (8)
|
||||||
|
//#define COLS1024 (16)
|
||||||
|
#define SIZE_512 ((ROWS)*(COLS512))
|
||||||
|
//#define SIZE_1024 ((ROWS)*(COLS1024))
|
||||||
|
#define ROUNDS512 (10)
|
||||||
|
//#define ROUNDS1024 (14)
|
||||||
|
|
||||||
|
//#if LENGTH<=256
|
||||||
|
#define COLS (COLS512)
|
||||||
|
#define SIZE (SIZE512)
|
||||||
|
#define ROUNDS (ROUNDS512)
|
||||||
|
//#else
|
||||||
|
//#define COLS (COLS1024)
|
||||||
|
//#define SIZE (SIZE1024)
|
||||||
|
//#define ROUNDS (ROUNDS1024)
|
||||||
|
//#endif
|
||||||
|
|
||||||
|
#define SIZE256 (SIZE_512/16)
|
||||||
|
|
||||||
|
typedef struct {
|
||||||
|
__attribute__ ((aligned (128))) __m512i chaining[SIZE256];
|
||||||
|
__attribute__ ((aligned (64))) __m512i buffer[SIZE256];
|
||||||
|
int hashlen; // byte
|
||||||
|
int blk_count; // SIZE_m128i
|
||||||
|
int buf_ptr; // __m128i offset
|
||||||
|
int rem_ptr;
|
||||||
|
int databitlen; // bits
|
||||||
|
} groestl256_4way_context;
|
||||||
|
|
||||||
|
|
||||||
|
int groestl256_4way_init( groestl256_4way_context*, uint64_t );
|
||||||
|
|
||||||
|
//int reinit_groestl( hashState_groestl* );
|
||||||
|
|
||||||
|
//int groestl512_4way_update( groestl256_4way_context*, const void*,
|
||||||
|
// uint64_t );
|
||||||
|
|
||||||
|
//int groestl512_4way_close( groestl512_4way_context*, void* );
|
||||||
|
|
||||||
|
int groestl256_4way_update_close( groestl256_4way_context*, void*,
|
||||||
|
const void*, uint64_t );
|
||||||
|
|
||||||
|
#endif
|
||||||
|
#endif
|
||||||
526
algo/groestl/groestl256-intr-4way.h
Normal file
526
algo/groestl/groestl256-intr-4way.h
Normal file
@@ -0,0 +1,526 @@
|
|||||||
|
/* groestl-intr-aes.h Aug 2011
|
||||||
|
*
|
||||||
|
* Groestl implementation with intrinsics using ssse3, sse4.1, and aes
|
||||||
|
* instructions.
|
||||||
|
* Author: Günther A. Roland, Martin Schläffer, Krystian Matusiewicz
|
||||||
|
*
|
||||||
|
* This code is placed in the public domain
|
||||||
|
*/
|
||||||
|
|
||||||
|
|
||||||
|
#if !defined(GROESTL256_INTR_4WAY_H__)
|
||||||
|
#define GROESTL256_INTR_4WAY_H__ 1
|
||||||
|
|
||||||
|
#include "groestl256-hash-4way.h"
|
||||||
|
|
||||||
|
#if defined(__VAES__)
|
||||||
|
|
||||||
|
/* global constants */
|
||||||
|
__m512i ROUND_CONST_Lx;
|
||||||
|
__m512i ROUND_CONST_L0[ROUNDS512];
|
||||||
|
__m512i ROUND_CONST_L7[ROUNDS512];
|
||||||
|
//__m512i ROUND_CONST_P[ROUNDS1024];
|
||||||
|
//__m512i ROUND_CONST_Q[ROUNDS1024];
|
||||||
|
__m512i TRANSP_MASK;
|
||||||
|
__m512i SUBSH_MASK[8];
|
||||||
|
__m512i ALL_1B;
|
||||||
|
__m512i ALL_FF;
|
||||||
|
|
||||||
|
#define tos(a) #a
|
||||||
|
#define tostr(a) tos(a)
|
||||||
|
|
||||||
|
/* xmm[i] will be multiplied by 2
|
||||||
|
* xmm[j] will be lost
|
||||||
|
* xmm[k] has to be all 0x1b */
|
||||||
|
#define MUL2(i, j, k){\
|
||||||
|
j = _mm512_xor_si512(j, j);\
|
||||||
|
j = _mm512_movm_epi8( _mm512_cmpgt_epi8_mask(j, i) );\
|
||||||
|
i = _mm512_add_epi8(i, i);\
|
||||||
|
j = _mm512_and_si512(j, k);\
|
||||||
|
i = _mm512_xor_si512(i, j);\
|
||||||
|
}
|
||||||
|
|
||||||
|
/**/
|
||||||
|
|
||||||
|
/* Yet another implementation of MixBytes.
|
||||||
|
This time we use the formulae (3) from the paper "Byte Slicing Groestl".
|
||||||
|
Input: a0, ..., a7
|
||||||
|
Output: b0, ..., b7 = MixBytes(a0,...,a7).
|
||||||
|
but we use the relations:
|
||||||
|
t_i = a_i + a_{i+3}
|
||||||
|
x_i = t_i + t_{i+3}
|
||||||
|
y_i = t_i + t+{i+2} + a_{i+6}
|
||||||
|
z_i = 2*x_i
|
||||||
|
w_i = z_i + y_{i+4}
|
||||||
|
v_i = 2*w_i
|
||||||
|
b_i = v_{i+3} + y_{i+4}
|
||||||
|
We keep building b_i in registers xmm8..xmm15 by first building y_{i+4} there
|
||||||
|
and then adding v_i computed in the meantime in registers xmm0..xmm7.
|
||||||
|
We almost fit into 16 registers, need only 3 spills to memory.
|
||||||
|
This implementation costs 7.7 c/b giving total speed on SNB: 10.7c/b.
|
||||||
|
K. Matusiewicz, 2011/05/29 */
|
||||||
|
#define MixBytes(a0, a1, a2, a3, a4, a5, a6, a7, b0, b1, b2, b3, b4, b5, b6, b7){\
|
||||||
|
/* t_i = a_i + a_{i+1} */\
|
||||||
|
b6 = a0;\
|
||||||
|
b7 = a1;\
|
||||||
|
a0 = _mm512_xor_si512(a0, a1);\
|
||||||
|
b0 = a2;\
|
||||||
|
a1 = _mm512_xor_si512(a1, a2);\
|
||||||
|
b1 = a3;\
|
||||||
|
a2 = _mm512_xor_si512(a2, a3);\
|
||||||
|
b2 = a4;\
|
||||||
|
a3 = _mm512_xor_si512(a3, a4);\
|
||||||
|
b3 = a5;\
|
||||||
|
a4 = _mm512_xor_si512(a4, a5);\
|
||||||
|
b4 = a6;\
|
||||||
|
a5 = _mm512_xor_si512(a5, a6);\
|
||||||
|
b5 = a7;\
|
||||||
|
a6 = _mm512_xor_si512(a6, a7);\
|
||||||
|
a7 = _mm512_xor_si512(a7, b6);\
|
||||||
|
\
|
||||||
|
/* build y4 y5 y6 ... in regs xmm8, xmm9, xmm10 by adding t_i*/\
|
||||||
|
b0 = _mm512_xor_si512(b0, a4);\
|
||||||
|
b6 = _mm512_xor_si512(b6, a4);\
|
||||||
|
b1 = _mm512_xor_si512(b1, a5);\
|
||||||
|
b7 = _mm512_xor_si512(b7, a5);\
|
||||||
|
b2 = _mm512_xor_si512(b2, a6);\
|
||||||
|
b0 = _mm512_xor_si512(b0, a6);\
|
||||||
|
/* spill values y_4, y_5 to memory */\
|
||||||
|
TEMP0 = b0;\
|
||||||
|
b3 = _mm512_xor_si512(b3, a7);\
|
||||||
|
b1 = _mm512_xor_si512(b1, a7);\
|
||||||
|
TEMP1 = b1;\
|
||||||
|
b4 = _mm512_xor_si512(b4, a0);\
|
||||||
|
b2 = _mm512_xor_si512(b2, a0);\
|
||||||
|
/* save values t0, t1, t2 to xmm8, xmm9 and memory */\
|
||||||
|
b0 = a0;\
|
||||||
|
b5 = _mm512_xor_si512(b5, a1);\
|
||||||
|
b3 = _mm512_xor_si512(b3, a1);\
|
||||||
|
b1 = a1;\
|
||||||
|
b6 = _mm512_xor_si512(b6, a2);\
|
||||||
|
b4 = _mm512_xor_si512(b4, a2);\
|
||||||
|
TEMP2 = a2;\
|
||||||
|
b7 = _mm512_xor_si512(b7, a3);\
|
||||||
|
b5 = _mm512_xor_si512(b5, a3);\
|
||||||
|
\
|
||||||
|
/* compute x_i = t_i + t_{i+3} */\
|
||||||
|
a0 = _mm512_xor_si512(a0, a3);\
|
||||||
|
a1 = _mm512_xor_si512(a1, a4);\
|
||||||
|
a2 = _mm512_xor_si512(a2, a5);\
|
||||||
|
a3 = _mm512_xor_si512(a3, a6);\
|
||||||
|
a4 = _mm512_xor_si512(a4, a7);\
|
||||||
|
a5 = _mm512_xor_si512(a5, b0);\
|
||||||
|
a6 = _mm512_xor_si512(a6, b1);\
|
||||||
|
a7 = _mm512_xor_si512(a7, TEMP2);\
|
||||||
|
\
|
||||||
|
/* compute z_i : double x_i using temp xmm8 and 1B xmm9 */\
|
||||||
|
/* compute w_i : add y_{i+4} */\
|
||||||
|
b1 = m512_const1_64( 0x1b1b1b1b1b1b1b1b );\
|
||||||
|
MUL2(a0, b0, b1);\
|
||||||
|
a0 = _mm512_xor_si512(a0, TEMP0);\
|
||||||
|
MUL2(a1, b0, b1);\
|
||||||
|
a1 = _mm512_xor_si512(a1, TEMP1);\
|
||||||
|
MUL2(a2, b0, b1);\
|
||||||
|
a2 = _mm512_xor_si512(a2, b2);\
|
||||||
|
MUL2(a3, b0, b1);\
|
||||||
|
a3 = _mm512_xor_si512(a3, b3);\
|
||||||
|
MUL2(a4, b0, b1);\
|
||||||
|
a4 = _mm512_xor_si512(a4, b4);\
|
||||||
|
MUL2(a5, b0, b1);\
|
||||||
|
a5 = _mm512_xor_si512(a5, b5);\
|
||||||
|
MUL2(a6, b0, b1);\
|
||||||
|
a6 = _mm512_xor_si512(a6, b6);\
|
||||||
|
MUL2(a7, b0, b1);\
|
||||||
|
a7 = _mm512_xor_si512(a7, b7);\
|
||||||
|
\
|
||||||
|
/* compute v_i : double w_i */\
|
||||||
|
/* add to y_4 y_5 .. v3, v4, ... */\
|
||||||
|
MUL2(a0, b0, b1);\
|
||||||
|
b5 = _mm512_xor_si512(b5, a0);\
|
||||||
|
MUL2(a1, b0, b1);\
|
||||||
|
b6 = _mm512_xor_si512(b6, a1);\
|
||||||
|
MUL2(a2, b0, b1);\
|
||||||
|
b7 = _mm512_xor_si512(b7, a2);\
|
||||||
|
MUL2(a5, b0, b1);\
|
||||||
|
b2 = _mm512_xor_si512(b2, a5);\
|
||||||
|
MUL2(a6, b0, b1);\
|
||||||
|
b3 = _mm512_xor_si512(b3, a6);\
|
||||||
|
MUL2(a7, b0, b1);\
|
||||||
|
b4 = _mm512_xor_si512(b4, a7);\
|
||||||
|
MUL2(a3, b0, b1);\
|
||||||
|
MUL2(a4, b0, b1);\
|
||||||
|
b0 = TEMP0;\
|
||||||
|
b1 = TEMP1;\
|
||||||
|
b0 = _mm512_xor_si512(b0, a3);\
|
||||||
|
b1 = _mm512_xor_si512(b1, a4);\
|
||||||
|
}/*MixBytes*/
|
||||||
|
|
||||||
|
// calculate the round constants seperately and load at startup
|
||||||
|
|
||||||
|
#define SET_CONSTANTS(){\
|
||||||
|
ALL_1B = _mm512_set1_epi32( 0x1b1b1b1b );\
|
||||||
|
TRANSP_MASK = _mm512_set_epi32( \
|
||||||
|
0x3f373b33, 0x3e363a32, 0x3d353931, 0x3c343830, \
|
||||||
|
0x2f272b23, 0x2e262a22, 0x2d252921, 0x2c242820, \
|
||||||
|
0x1f171b13, 0x1e161a12, 0x1d151911, 0x1c141810, \
|
||||||
|
0x0f070b03, 0x0e060a02, 0x0d050901, 0x0c040800 ); \
|
||||||
|
SUBSH_MASK[0] = _mm512_set_epi32( \
|
||||||
|
0x33363a3d, 0x38323539, 0x3c3f3134, 0x373b3e30, \
|
||||||
|
0x23262a2d, 0x28222529, 0x2c2f2124, 0x272b2e20, \
|
||||||
|
0x13161a1d, 0x18121519, 0x1c1f1114, 0x171b1e10, \
|
||||||
|
0x03060a0d, 0x08020509, 0x0c0f0104, 0x070b0e00 ); \
|
||||||
|
SUBSH_MASK[1] = _mm512_set_epi32( \
|
||||||
|
0x34373c3f, 0x3a33363b, 0x3e393235, 0x303d3831, \
|
||||||
|
0x24272c2f, 0x2a23262b, 0x2e292225, 0x202d2821, \
|
||||||
|
0x14171c1f, 0x1a13161b, 0x1e191215, 0x101d1801, \
|
||||||
|
0x04070c0f, 0x0a03060b, 0x0e090205, 0x000d0801 );\
|
||||||
|
SUBSH_MASK[2] = _mm512_set_epi32( \
|
||||||
|
0x35303e39, 0x3c34373d, 0x383b3336, 0x313f3a32, \
|
||||||
|
0x25202e29, 0x2c24272d, 0x282b2326, 0x212f2a22, \
|
||||||
|
0x15101e19, 0x1c14171d, 0x181b1316, 0x111f1a12, \
|
||||||
|
0x05000e09, 0x0c04070d, 0x080b0306, 0x010f0a02 );\
|
||||||
|
SUBSH_MASK[3] = _mm512_set_epi32( \
|
||||||
|
0x3631383b, 0x3e35303f, 0x3a3d3437, 0x32393c33, \
|
||||||
|
0x2621282b, 0x2e25202f, 0x2a2d2427, 0x22292c23, \
|
||||||
|
0x1611181b, 0x1e15101f, 0x1a1d1417, 0x12191c13, \
|
||||||
|
0x0601080b, 0x0e05000f, 0x0a0d0407, 0x02090c03 );\
|
||||||
|
SUBSH_MASK[4] = _mm512_set_epi32( \
|
||||||
|
0x3732393c, 0x3f363138, 0x3b3e3530, 0x333a3d34, \
|
||||||
|
0x2722292c, 0x2f262128, 0x2b2e2520, 0x232a2d24, \
|
||||||
|
0x1712191c, 0x1f161118, 0x1b1e1510, 0x131a1d14, \
|
||||||
|
0x0702090c, 0x0f060108, 0x0b0e0500, 0x030a0d04 );\
|
||||||
|
SUBSH_MASK[5] = _mm512_set_epi32( \
|
||||||
|
0x30333b3e, 0x3937323a, 0x3d383631, 0x343c3f35, \
|
||||||
|
0x20232b2e, 0x2927222a, 0x2d282621, 0x242c2f25, \
|
||||||
|
0x10131b1e, 0x1917121a, 0x1d181611, 0x141c1f15, \
|
||||||
|
0x00030b0e, 0x0907020a, 0x0d080601, 0x040c0f05 );\
|
||||||
|
SUBSH_MASK[6] = _mm512_set_epi32( \
|
||||||
|
0x31343d38, 0x3b30333c, 0x3f3a3732, 0x353e3936, \
|
||||||
|
0x21242d28, 0x2b20232c, 0x2f2a2722, 0x252e2926, \
|
||||||
|
0x11141d18, 0x1b10131c, 0x1f1a1712, 0x151e1916, \
|
||||||
|
0x01040d08, 0x0b00030c, 0x0f0a0702, 0x050e0906 );\
|
||||||
|
SUBSH_MASK[7] = _mm512_set_epi32( \
|
||||||
|
0x32353f3a, 0x3d31343e, 0x393c3033, 0x36383b37, \
|
||||||
|
0x22252f2a, 0x2d21242e, 0x292c2023, 0x26282b27, \
|
||||||
|
0x12151f1a, 0x1d11141e, 0x191c1013, 0x16181b17, \
|
||||||
|
0x02050f0a, 0x0d01040e, 0x090c0003, 0x06080b07 );\
|
||||||
|
for ( i = 0; i < ROUNDS512; i++ ) \
|
||||||
|
{\
|
||||||
|
ROUND_CONST_L0[i] = _mm512_set4_epi32( 0xffffffff, 0xffffffff, \
|
||||||
|
0x70605040 ^ ( i * 0x01010101 ), 0x30201000 ^ ( i * 0x01010101 ) ); \
|
||||||
|
ROUND_CONST_L7[i] = _mm512_set4_epi32( 0x8f9fafbf ^ ( i * 0x01010101 ), \
|
||||||
|
0xcfdfefff ^ ( i * 0x01010101 ), 0x00000000, 0x00000000 ); \
|
||||||
|
}\
|
||||||
|
ROUND_CONST_Lx = _mm512_set4_epi32( 0xffffffff, 0xffffffff, \
|
||||||
|
0x00000000, 0x00000000 ); \
|
||||||
|
}while(0);\
|
||||||
|
|
||||||
|
#define ROUND(i, a0, a1, a2, a3, a4, a5, a6, a7, b0, b1, b2, b3, b4, b5, b6, b7){\
|
||||||
|
/* AddRoundConstant */\
|
||||||
|
b1 = ROUND_CONST_Lx;\
|
||||||
|
a0 = _mm512_xor_si512( a0, (ROUND_CONST_L0[i]) );\
|
||||||
|
a1 = _mm512_xor_si512( a1, b1 );\
|
||||||
|
a2 = _mm512_xor_si512( a2, b1 );\
|
||||||
|
a3 = _mm512_xor_si512( a3, b1 );\
|
||||||
|
a4 = _mm512_xor_si512( a4, b1 );\
|
||||||
|
a5 = _mm512_xor_si512( a5, b1 );\
|
||||||
|
a6 = _mm512_xor_si512( a6, b1 );\
|
||||||
|
a7 = _mm512_xor_si512( a7, (ROUND_CONST_L7[i]) );\
|
||||||
|
\
|
||||||
|
/* ShiftBytes + SubBytes (interleaved) */\
|
||||||
|
b0 = _mm512_xor_si512( b0, b0 );\
|
||||||
|
a0 = _mm512_shuffle_epi8( a0, (SUBSH_MASK[0]) );\
|
||||||
|
a0 = _mm512_aesenclast_epi128(a0, b0 );\
|
||||||
|
a1 = _mm512_shuffle_epi8( a1, (SUBSH_MASK[1]) );\
|
||||||
|
a1 = _mm512_aesenclast_epi128(a1, b0 );\
|
||||||
|
a2 = _mm512_shuffle_epi8( a2, (SUBSH_MASK[2]) );\
|
||||||
|
a2 = _mm512_aesenclast_epi128(a2, b0 );\
|
||||||
|
a3 = _mm512_shuffle_epi8( a3, (SUBSH_MASK[3]) );\
|
||||||
|
a3 = _mm512_aesenclast_epi128(a3, b0 );\
|
||||||
|
a4 = _mm512_shuffle_epi8( a4, (SUBSH_MASK[4]) );\
|
||||||
|
a4 = _mm512_aesenclast_epi128(a4, b0 );\
|
||||||
|
a5 = _mm512_shuffle_epi8( a5, (SUBSH_MASK[5]) );\
|
||||||
|
a5 = _mm512_aesenclast_epi128(a5, b0 );\
|
||||||
|
a6 = _mm512_shuffle_epi8( a6, (SUBSH_MASK[6]) );\
|
||||||
|
a6 = _mm512_aesenclast_epi128(a6, b0 );\
|
||||||
|
a7 = _mm512_shuffle_epi8( a7, (SUBSH_MASK[7]) );\
|
||||||
|
a7 = _mm512_aesenclast_epi128( a7, b0 );\
|
||||||
|
\
|
||||||
|
/* MixBytes */\
|
||||||
|
MixBytes(a0, a1, a2, a3, a4, a5, a6, a7, b0, b1, b2, b3, b4, b5, b6, b7);\
|
||||||
|
\
|
||||||
|
}
|
||||||
|
|
||||||
|
/* 10 rounds, P and Q in parallel */
|
||||||
|
#define ROUNDS_P_Q(){\
|
||||||
|
ROUND(0, xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7);\
|
||||||
|
ROUND(1, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15);\
|
||||||
|
ROUND(2, xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7);\
|
||||||
|
ROUND(3, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15);\
|
||||||
|
ROUND(4, xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7);\
|
||||||
|
ROUND(5, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15);\
|
||||||
|
ROUND(6, xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7);\
|
||||||
|
ROUND(7, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15);\
|
||||||
|
ROUND(8, xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7);\
|
||||||
|
ROUND(9, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15);\
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Matrix Transpose Step 1
|
||||||
|
* input is a 512-bit state with two columns in one xmm
|
||||||
|
* output is a 512-bit state with two rows in one xmm
|
||||||
|
* inputs: i0-i3
|
||||||
|
* outputs: i0, o1-o3
|
||||||
|
* clobbers: t0
|
||||||
|
*/
|
||||||
|
#define Matrix_Transpose_A(i0, i1, i2, i3, o1, o2, o3, t0){\
|
||||||
|
t0 = TRANSP_MASK;\
|
||||||
|
\
|
||||||
|
i0 = _mm512_shuffle_epi8( i0, t0 );\
|
||||||
|
i1 = _mm512_shuffle_epi8( i1, t0 );\
|
||||||
|
i2 = _mm512_shuffle_epi8( i2, t0 );\
|
||||||
|
i3 = _mm512_shuffle_epi8( i3, t0 );\
|
||||||
|
\
|
||||||
|
o1 = i0;\
|
||||||
|
t0 = i2;\
|
||||||
|
\
|
||||||
|
i0 = _mm512_unpacklo_epi16( i0, i1 );\
|
||||||
|
o1 = _mm512_unpackhi_epi16( o1, i1 );\
|
||||||
|
i2 = _mm512_unpacklo_epi16( i2, i3 );\
|
||||||
|
t0 = _mm512_unpackhi_epi16( t0, i3 );\
|
||||||
|
\
|
||||||
|
i0 = _mm512_shuffle_epi32( i0, 216 );\
|
||||||
|
o1 = _mm512_shuffle_epi32( o1, 216 );\
|
||||||
|
i2 = _mm512_shuffle_epi32( i2, 216 );\
|
||||||
|
t0 = _mm512_shuffle_epi32( t0, 216 );\
|
||||||
|
\
|
||||||
|
o2 = i0;\
|
||||||
|
o3 = o1;\
|
||||||
|
\
|
||||||
|
i0 = _mm512_unpacklo_epi32( i0, i2 );\
|
||||||
|
o1 = _mm512_unpacklo_epi32( o1, t0 );\
|
||||||
|
o2 = _mm512_unpackhi_epi32( o2, i2 );\
|
||||||
|
o3 = _mm512_unpackhi_epi32( o3, t0 );\
|
||||||
|
}/**/
|
||||||
|
|
||||||
|
/* Matrix Transpose Step 2
|
||||||
|
* input are two 512-bit states with two rows in one xmm
|
||||||
|
* output are two 512-bit states with one row of each state in one xmm
|
||||||
|
* inputs: i0-i3 = P, i4-i7 = Q
|
||||||
|
* outputs: (i0, o1-o7) = (P|Q)
|
||||||
|
* possible reassignments: (output reg = input reg)
|
||||||
|
* * i1 -> o3-7
|
||||||
|
* * i2 -> o5-7
|
||||||
|
* * i3 -> o7
|
||||||
|
* * i4 -> o3-7
|
||||||
|
* * i5 -> o6-7
|
||||||
|
*/
|
||||||
|
#define Matrix_Transpose_B(i0, i1, i2, i3, i4, i5, i6, i7, o1, o2, o3, o4, o5, o6, o7){\
|
||||||
|
o1 = i0;\
|
||||||
|
o2 = i1;\
|
||||||
|
i0 = _mm512_unpacklo_epi64( i0, i4 );\
|
||||||
|
o1 = _mm512_unpackhi_epi64( o1, i4 );\
|
||||||
|
o3 = i1;\
|
||||||
|
o4 = i2;\
|
||||||
|
o2 = _mm512_unpacklo_epi64( o2, i5 );\
|
||||||
|
o3 = _mm512_unpackhi_epi64( o3, i5 );\
|
||||||
|
o5 = i2;\
|
||||||
|
o6 = i3;\
|
||||||
|
o4 = _mm512_unpacklo_epi64( o4, i6 );\
|
||||||
|
o5 = _mm512_unpackhi_epi64( o5, i6 );\
|
||||||
|
o7 = i3;\
|
||||||
|
o6 = _mm512_unpacklo_epi64( o6, i7 );\
|
||||||
|
o7 = _mm512_unpackhi_epi64( o7, i7 );\
|
||||||
|
}/**/
|
||||||
|
|
||||||
|
/* Matrix Transpose Inverse Step 2
|
||||||
|
* input are two 512-bit states with one row of each state in one xmm
|
||||||
|
* output are two 512-bit states with two rows in one xmm
|
||||||
|
* inputs: i0-i7 = (P|Q)
|
||||||
|
* outputs: (i0, i2, i4, i6) = P, (o0-o3) = Q
|
||||||
|
*/
|
||||||
|
#define Matrix_Transpose_B_INV(i0, i1, i2, i3, i4, i5, i6, i7, o0, o1, o2, o3){\
|
||||||
|
o0 = i0;\
|
||||||
|
i0 = _mm512_unpacklo_epi64( i0, i1 );\
|
||||||
|
o0 = _mm512_unpackhi_epi64( o0, i1 );\
|
||||||
|
o1 = i2;\
|
||||||
|
i2 = _mm512_unpacklo_epi64( i2, i3 );\
|
||||||
|
o1 = _mm512_unpackhi_epi64( o1, i3 );\
|
||||||
|
o2 = i4;\
|
||||||
|
i4 = _mm512_unpacklo_epi64( i4, i5 );\
|
||||||
|
o2 = _mm512_unpackhi_epi64( o2, i5 );\
|
||||||
|
o3 = i6;\
|
||||||
|
i6 = _mm512_unpacklo_epi64( i6, i7 );\
|
||||||
|
o3 = _mm512_unpackhi_epi64( o3, i7 );\
|
||||||
|
}/**/
|
||||||
|
|
||||||
|
|
||||||
|
/* Matrix Transpose Output Step 2
|
||||||
|
* input is one 512-bit state with two rows in one xmm
|
||||||
|
* output is one 512-bit state with one row in the low 64-bits of one xmm
|
||||||
|
* inputs: i0,i2,i4,i6 = S
|
||||||
|
* outputs: (i0-7) = (0|S)
|
||||||
|
*/
|
||||||
|
#define Matrix_Transpose_O_B(i0, i1, i2, i3, i4, i5, i6, i7, t0){\
|
||||||
|
t0 = _mm512_xor_si512( t0, t0 );\
|
||||||
|
i1 = i0;\
|
||||||
|
i3 = i2;\
|
||||||
|
i5 = i4;\
|
||||||
|
i7 = i6;\
|
||||||
|
i0 = _mm512_unpacklo_epi64( i0, t0 );\
|
||||||
|
i1 = _mm512_unpackhi_epi64( i1, t0 );\
|
||||||
|
i2 = _mm512_unpacklo_epi64( i2, t0 );\
|
||||||
|
i3 = _mm512_unpackhi_epi64( i3, t0 );\
|
||||||
|
i4 = _mm512_unpacklo_epi64( i4, t0 );\
|
||||||
|
i5 = _mm512_unpackhi_epi64( i5, t0 );\
|
||||||
|
i6 = _mm512_unpacklo_epi64( i6, t0 );\
|
||||||
|
i7 = _mm512_unpackhi_epi64( i7, t0 );\
|
||||||
|
}/**/
|
||||||
|
|
||||||
|
/* Matrix Transpose Output Inverse Step 2
|
||||||
|
* input is one 512-bit state with one row in the low 64-bits of one xmm
|
||||||
|
* output is one 512-bit state with two rows in one xmm
|
||||||
|
* inputs: i0-i7 = (0|S)
|
||||||
|
* outputs: (i0, i2, i4, i6) = S
|
||||||
|
*/
|
||||||
|
#define Matrix_Transpose_O_B_INV(i0, i1, i2, i3, i4, i5, i6, i7){\
|
||||||
|
i0 = _mm512_unpacklo_epi64( i0, i1 );\
|
||||||
|
i2 = _mm512_unpacklo_epi64( i2, i3 );\
|
||||||
|
i4 = _mm512_unpacklo_epi64( i4, i5 );\
|
||||||
|
i6 = _mm512_unpacklo_epi64( i6, i7 );\
|
||||||
|
}/**/
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
void INIT256_4way( __m512i* chaining )
|
||||||
|
{
|
||||||
|
static __m512i xmm0, xmm2, xmm6, xmm7;
|
||||||
|
static __m512i xmm12, xmm13, xmm14, xmm15;
|
||||||
|
|
||||||
|
/* load IV into registers xmm12 - xmm15 */
|
||||||
|
xmm12 = chaining[0];
|
||||||
|
xmm13 = chaining[1];
|
||||||
|
xmm14 = chaining[2];
|
||||||
|
xmm15 = chaining[3];
|
||||||
|
|
||||||
|
/* transform chaining value from column ordering into row ordering */
|
||||||
|
/* we put two rows (64 bit) of the IV into one 128-bit XMM register */
|
||||||
|
Matrix_Transpose_A(xmm12, xmm13, xmm14, xmm15, xmm2, xmm6, xmm7, xmm0);
|
||||||
|
|
||||||
|
/* store transposed IV */
|
||||||
|
chaining[0] = xmm12;
|
||||||
|
chaining[1] = xmm2;
|
||||||
|
chaining[2] = xmm6;
|
||||||
|
chaining[3] = xmm7;
|
||||||
|
}
|
||||||
|
|
||||||
|
void TF512_4way( __m512i* chaining, __m512i* message )
|
||||||
|
{
|
||||||
|
static __m512i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
|
||||||
|
static __m512i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
|
||||||
|
static __m512i TEMP0;
|
||||||
|
static __m512i TEMP1;
|
||||||
|
static __m512i TEMP2;
|
||||||
|
|
||||||
|
/* load message into registers xmm12 - xmm15 */
|
||||||
|
xmm12 = message[0];
|
||||||
|
xmm13 = message[1];
|
||||||
|
xmm14 = message[2];
|
||||||
|
xmm15 = message[3];
|
||||||
|
|
||||||
|
/* transform message M from column ordering into row ordering */
|
||||||
|
/* we first put two rows (64 bit) of the message into one 128-bit xmm register */
|
||||||
|
Matrix_Transpose_A(xmm12, xmm13, xmm14, xmm15, xmm2, xmm6, xmm7, xmm0);
|
||||||
|
|
||||||
|
/* load previous chaining value */
|
||||||
|
/* we first put two rows (64 bit) of the CV into one 128-bit xmm register */
|
||||||
|
xmm8 = chaining[0];
|
||||||
|
xmm0 = chaining[1];
|
||||||
|
xmm4 = chaining[2];
|
||||||
|
xmm5 = chaining[3];
|
||||||
|
|
||||||
|
/* xor message to CV get input of P */
|
||||||
|
/* result: CV+M in xmm8, xmm0, xmm4, xmm5 */
|
||||||
|
xmm8 = _mm512_xor_si512( xmm8, xmm12 );
|
||||||
|
xmm0 = _mm512_xor_si512( xmm0, xmm2 );
|
||||||
|
xmm4 = _mm512_xor_si512( xmm4, xmm6 );
|
||||||
|
xmm5 = _mm512_xor_si512( xmm5, xmm7 );
|
||||||
|
|
||||||
|
/* there are now 2 rows of the Groestl state (P and Q) in each xmm register */
|
||||||
|
/* unpack to get 1 row of P (64 bit) and Q (64 bit) into one xmm register */
|
||||||
|
/* result: the 8 rows of P and Q in xmm8 - xmm12 */
|
||||||
|
Matrix_Transpose_B(xmm8, xmm0, xmm4, xmm5, xmm12, xmm2, xmm6, xmm7, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15);
|
||||||
|
|
||||||
|
/* compute the two permutations P and Q in parallel */
|
||||||
|
ROUNDS_P_Q();
|
||||||
|
|
||||||
|
/* unpack again to get two rows of P or two rows of Q in one xmm register */
|
||||||
|
Matrix_Transpose_B_INV(xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm0, xmm1, xmm2, xmm3);
|
||||||
|
|
||||||
|
/* xor output of P and Q */
|
||||||
|
/* result: P(CV+M)+Q(M) in xmm0...xmm3 */
|
||||||
|
xmm0 = _mm512_xor_si512( xmm0, xmm8 );
|
||||||
|
xmm1 = _mm512_xor_si512( xmm1, xmm10 );
|
||||||
|
xmm2 = _mm512_xor_si512( xmm2, xmm12 );
|
||||||
|
xmm3 = _mm512_xor_si512( xmm3, xmm14 );
|
||||||
|
|
||||||
|
/* xor CV (feed-forward) */
|
||||||
|
/* result: P(CV+M)+Q(M)+CV in xmm0...xmm3 */
|
||||||
|
xmm0 = _mm512_xor_si512( xmm0, (chaining[0]) );
|
||||||
|
xmm1 = _mm512_xor_si512( xmm1, (chaining[1]) );
|
||||||
|
xmm2 = _mm512_xor_si512( xmm2, (chaining[2]) );
|
||||||
|
xmm3 = _mm512_xor_si512( xmm3, (chaining[3]) );
|
||||||
|
|
||||||
|
/* store CV */
|
||||||
|
chaining[0] = xmm0;
|
||||||
|
chaining[1] = xmm1;
|
||||||
|
chaining[2] = xmm2;
|
||||||
|
chaining[3] = xmm3;
|
||||||
|
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
void OF512_4way( __m512i* chaining )
|
||||||
|
{
|
||||||
|
static __m512i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
|
||||||
|
static __m512i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
|
||||||
|
static __m512i TEMP0;
|
||||||
|
static __m512i TEMP1;
|
||||||
|
static __m512i TEMP2;
|
||||||
|
|
||||||
|
/* load CV into registers xmm8, xmm10, xmm12, xmm14 */
|
||||||
|
xmm8 = chaining[0];
|
||||||
|
xmm10 = chaining[1];
|
||||||
|
xmm12 = chaining[2];
|
||||||
|
xmm14 = chaining[3];
|
||||||
|
|
||||||
|
/* there are now 2 rows of the CV in one xmm register */
|
||||||
|
/* unpack to get 1 row of P (64 bit) into one half of an xmm register */
|
||||||
|
/* result: the 8 input rows of P in xmm8 - xmm15 */
|
||||||
|
Matrix_Transpose_O_B(xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm0);
|
||||||
|
|
||||||
|
/* compute the permutation P */
|
||||||
|
/* result: the output of P(CV) in xmm8 - xmm15 */
|
||||||
|
ROUNDS_P_Q();
|
||||||
|
|
||||||
|
/* unpack again to get two rows of P in one xmm register */
|
||||||
|
/* result: P(CV) in xmm8, xmm10, xmm12, xmm14 */
|
||||||
|
Matrix_Transpose_O_B_INV(xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15);
|
||||||
|
|
||||||
|
/* xor CV to P output (feed-forward) */
|
||||||
|
/* result: P(CV)+CV in xmm8, xmm10, xmm12, xmm14 */
|
||||||
|
xmm8 = _mm512_xor_si512( xmm8, (chaining[0]) );
|
||||||
|
xmm10 = _mm512_xor_si512( xmm10, (chaining[1]) );
|
||||||
|
xmm12 = _mm512_xor_si512( xmm12, (chaining[2]) );
|
||||||
|
xmm14 = _mm512_xor_si512( xmm14, (chaining[3]) );
|
||||||
|
|
||||||
|
/* transform state back from row ordering into column ordering */
|
||||||
|
/* result: final hash value in xmm9, xmm11 */
|
||||||
|
Matrix_Transpose_A(xmm8, xmm10, xmm12, xmm14, xmm4, xmm9, xmm11, xmm0);
|
||||||
|
|
||||||
|
/* we only need to return the truncated half of the state */
|
||||||
|
chaining[2] = xmm9;
|
||||||
|
chaining[3] = xmm11;
|
||||||
|
}
|
||||||
|
|
||||||
|
#endif // VAES
|
||||||
|
#endif // GROESTL512_INTR_4WAY_H__
|
||||||
146
algo/groestl/groestl512-hash-4way.c
Normal file
146
algo/groestl/groestl512-hash-4way.c
Normal file
@@ -0,0 +1,146 @@
|
|||||||
|
/* hash.c Aug 2011
|
||||||
|
* groestl512-hash-4way https://github.com/JayDDee/cpuminer-opt 2019-12.
|
||||||
|
*
|
||||||
|
* Groestl implementation for different versions.
|
||||||
|
* Author: Krystian Matusiewicz, Günther A. Roland, Martin Schläffer
|
||||||
|
*
|
||||||
|
* This code is placed in the public domain
|
||||||
|
*/
|
||||||
|
|
||||||
|
// Optimized for hash and data length that are integrals of __m128i
|
||||||
|
|
||||||
|
|
||||||
|
#include <memory.h>
|
||||||
|
#include "groestl512-intr-4way.h"
|
||||||
|
#include "miner.h"
|
||||||
|
#include "simd-utils.h"
|
||||||
|
|
||||||
|
#if defined(__VAES__) && defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
|
||||||
|
|
||||||
|
int groestl512_4way_init( groestl512_4way_context* ctx, uint64_t hashlen )
|
||||||
|
{
|
||||||
|
int i;
|
||||||
|
|
||||||
|
SET_CONSTANTS();
|
||||||
|
|
||||||
|
if (ctx->chaining == NULL || ctx->buffer == NULL)
|
||||||
|
return 1;
|
||||||
|
|
||||||
|
memset_zero_512( ctx->chaining, SIZE512 );
|
||||||
|
memset_zero_512( ctx->buffer, SIZE512 );
|
||||||
|
|
||||||
|
// The only non-zero in the IV is len. It can be hard coded.
|
||||||
|
ctx->chaining[ 6 ] = m512_const2_64( 0x0200000000000000, 0 );
|
||||||
|
|
||||||
|
ctx->buf_ptr = 0;
|
||||||
|
ctx->rem_ptr = 0;
|
||||||
|
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
int groestl512_4way_update_close( groestl512_4way_context* ctx, void* output,
|
||||||
|
const void* input, uint64_t databitlen )
|
||||||
|
{
|
||||||
|
const int len = (int)databitlen / 128;
|
||||||
|
const int hashlen_m128i = 64 / 16; // bytes to __m128i
|
||||||
|
const int hash_offset = SIZE512 - hashlen_m128i;
|
||||||
|
int rem = ctx->rem_ptr;
|
||||||
|
int blocks = len / SIZE512;
|
||||||
|
__m512i* in = (__m512i*)input;
|
||||||
|
int i;
|
||||||
|
|
||||||
|
// --- update ---
|
||||||
|
|
||||||
|
for ( i = 0; i < blocks; i++ )
|
||||||
|
TF1024_4way( ctx->chaining, &in[ i * SIZE512 ] );
|
||||||
|
ctx->buf_ptr = blocks * SIZE512;
|
||||||
|
|
||||||
|
for ( i = 0; i < len % SIZE512; i++ )
|
||||||
|
ctx->buffer[ rem + i ] = in[ ctx->buf_ptr + i ];
|
||||||
|
i += rem;
|
||||||
|
|
||||||
|
//--- final ---
|
||||||
|
|
||||||
|
blocks++; // adjust for final block
|
||||||
|
|
||||||
|
if ( i == SIZE512 - 1 )
|
||||||
|
{
|
||||||
|
// only 1 vector left in buffer, all padding at once
|
||||||
|
ctx->buffer[i] = m512_const1_128( _mm_set_epi8(
|
||||||
|
blocks, blocks>>8,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0x80 ) );
|
||||||
|
}
|
||||||
|
else
|
||||||
|
{
|
||||||
|
ctx->buffer[i] = m512_const4_64( 0, 0x80, 0, 0x80 );
|
||||||
|
for ( i += 1; i < SIZE512 - 1; i++ )
|
||||||
|
ctx->buffer[i] = m512_zero;
|
||||||
|
ctx->buffer[i] = m512_const1_128( _mm_set_epi8(
|
||||||
|
blocks, blocks>>8, 0,0, 0,0, 0,0, 0,0, 0,0, 0,0, 0,0 ) );
|
||||||
|
}
|
||||||
|
|
||||||
|
TF1024_4way( ctx->chaining, ctx->buffer );
|
||||||
|
OF1024_4way( ctx->chaining );
|
||||||
|
|
||||||
|
for ( i = 0; i < hashlen_m128i; i++ )
|
||||||
|
casti_m512i( output, i ) = ctx->chaining[ hash_offset + i ];
|
||||||
|
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
int groestl512_4way_full( groestl512_4way_context* ctx, void* output,
|
||||||
|
const void* input, uint64_t datalen )
|
||||||
|
{
|
||||||
|
const int len = (int)datalen >> 4;
|
||||||
|
const int hashlen_m128i = 64 >> 4; // bytes to __m128i
|
||||||
|
const int hash_offset = SIZE512 - hashlen_m128i;
|
||||||
|
uint64_t blocks = len / SIZE512;
|
||||||
|
__m512i* in = (__m512i*)input;
|
||||||
|
int i;
|
||||||
|
|
||||||
|
// --- init ---
|
||||||
|
|
||||||
|
SET_CONSTANTS();
|
||||||
|
memset_zero_512( ctx->chaining, SIZE512 );
|
||||||
|
memset_zero_512( ctx->buffer, SIZE512 );
|
||||||
|
ctx->chaining[ 6 ] = m512_const2_64( 0x0200000000000000, 0 );
|
||||||
|
ctx->buf_ptr = 0;
|
||||||
|
ctx->rem_ptr = 0;
|
||||||
|
|
||||||
|
// --- update ---
|
||||||
|
|
||||||
|
for ( i = 0; i < blocks; i++ )
|
||||||
|
TF1024_4way( ctx->chaining, &in[ i * SIZE512 ] );
|
||||||
|
ctx->buf_ptr = blocks * SIZE512;
|
||||||
|
|
||||||
|
for ( i = 0; i < len % SIZE512; i++ )
|
||||||
|
ctx->buffer[ ctx->rem_ptr + i ] = in[ ctx->buf_ptr + i ];
|
||||||
|
i += ctx->rem_ptr;
|
||||||
|
|
||||||
|
// --- close ---
|
||||||
|
|
||||||
|
blocks++;
|
||||||
|
|
||||||
|
if ( i == SIZE512 - 1 )
|
||||||
|
{
|
||||||
|
// only 1 vector left in buffer, all padding at once
|
||||||
|
ctx->buffer[i] = m512_const2_64( blocks << 56, 0x80 );
|
||||||
|
}
|
||||||
|
else
|
||||||
|
{
|
||||||
|
ctx->buffer[i] = m512_const4_64( 0, 0x80, 0, 0x80 );
|
||||||
|
for ( i += 1; i < SIZE512 - 1; i++ )
|
||||||
|
ctx->buffer[i] = m512_zero;
|
||||||
|
ctx->buffer[i] = m512_const2_64( blocks << 56, 0 );
|
||||||
|
}
|
||||||
|
|
||||||
|
TF1024_4way( ctx->chaining, ctx->buffer );
|
||||||
|
OF1024_4way( ctx->chaining );
|
||||||
|
|
||||||
|
for ( i = 0; i < hashlen_m128i; i++ )
|
||||||
|
casti_m512i( output, i ) = ctx->chaining[ hash_offset + i ];
|
||||||
|
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
#endif // VAES
|
||||||
|
|
||||||
62
algo/groestl/groestl512-hash-4way.h
Normal file
62
algo/groestl/groestl512-hash-4way.h
Normal file
@@ -0,0 +1,62 @@
|
|||||||
|
#if !defined(GROESTL512_HASH_4WAY_H__)
|
||||||
|
#define GROESTL512_HASH_4WAY_H__ 1
|
||||||
|
|
||||||
|
#include "simd-utils.h"
|
||||||
|
#include <immintrin.h>
|
||||||
|
#include <stdint.h>
|
||||||
|
#include <stdio.h>
|
||||||
|
#if defined(_WIN64) || defined(__WINDOWS__)
|
||||||
|
#include <windows.h>
|
||||||
|
#endif
|
||||||
|
#include <stdlib.h>
|
||||||
|
|
||||||
|
#if defined(__VAES__) && defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
|
||||||
|
|
||||||
|
#define LENGTH (512)
|
||||||
|
|
||||||
|
/* some sizes (number of bytes) */
|
||||||
|
#define ROWS (8)
|
||||||
|
#define LENGTHFIELDLEN (ROWS)
|
||||||
|
//#define COLS512 (8)
|
||||||
|
#define COLS1024 (16)
|
||||||
|
//#define SIZE512 ((ROWS)*(COLS512))
|
||||||
|
#define SIZE_1024 ((ROWS)*(COLS1024))
|
||||||
|
//#define ROUNDS512 (10)
|
||||||
|
#define ROUNDS1024 (14)
|
||||||
|
|
||||||
|
//#if LENGTH<=256
|
||||||
|
//#define COLS (COLS512)
|
||||||
|
//#define SIZE (SIZE512)
|
||||||
|
//#define ROUNDS (ROUNDS512)
|
||||||
|
//#else
|
||||||
|
#define COLS (COLS1024)
|
||||||
|
//#define SIZE (SIZE1024)
|
||||||
|
#define ROUNDS (ROUNDS1024)
|
||||||
|
//#endif
|
||||||
|
|
||||||
|
#define SIZE512 (SIZE_1024/16)
|
||||||
|
|
||||||
|
typedef struct {
|
||||||
|
__attribute__ ((aligned (128))) __m512i chaining[SIZE512];
|
||||||
|
__attribute__ ((aligned (64))) __m512i buffer[SIZE512];
|
||||||
|
int blk_count; // SIZE_m128i
|
||||||
|
int buf_ptr; // __m128i offset
|
||||||
|
int rem_ptr;
|
||||||
|
int databitlen; // bits
|
||||||
|
} groestl512_4way_context;
|
||||||
|
|
||||||
|
|
||||||
|
int groestl512_4way_init( groestl512_4way_context*, uint64_t );
|
||||||
|
|
||||||
|
//int reinit_groestl( hashState_groestl* );
|
||||||
|
|
||||||
|
int groestl512_4way_update( groestl512_4way_context*, const void*,
|
||||||
|
uint64_t );
|
||||||
|
int groestl512_4way_close( groestl512_4way_context*, void* );
|
||||||
|
int groestl512_4way_update_close( groestl512_4way_context*, void*,
|
||||||
|
const void*, uint64_t );
|
||||||
|
int groestl512_4way_full( groestl512_4way_context*, void*,
|
||||||
|
const void*, uint64_t );
|
||||||
|
|
||||||
|
#endif // VAES
|
||||||
|
#endif // GROESTL512_HASH_4WAY_H__
|
||||||
654
algo/groestl/groestl512-intr-4way.h
Normal file
654
algo/groestl/groestl512-intr-4way.h
Normal file
@@ -0,0 +1,654 @@
|
|||||||
|
/* groestl-intr-aes.h Aug 2011
|
||||||
|
*
|
||||||
|
* Groestl implementation with intrinsics using ssse3, sse4.1, and aes
|
||||||
|
* instructions.
|
||||||
|
* Author: Günther A. Roland, Martin Schläffer, Krystian Matusiewicz
|
||||||
|
*
|
||||||
|
* This code is placed in the public domain
|
||||||
|
*/
|
||||||
|
|
||||||
|
|
||||||
|
#if !defined(GROESTL512_INTR_4WAY_H__)
|
||||||
|
#define GROESTL512_INTR_4WAY_H__ 1
|
||||||
|
|
||||||
|
#include "groestl512-hash-4way.h"
|
||||||
|
|
||||||
|
#if defined(__VAES__)
|
||||||
|
|
||||||
|
/* global constants */
|
||||||
|
__m512i ROUND_CONST_Lx;
|
||||||
|
//__m128i ROUND_CONST_L0[ROUNDS512];
|
||||||
|
//__m128i ROUND_CONST_L7[ROUNDS512];
|
||||||
|
__m512i ROUND_CONST_P[ROUNDS1024];
|
||||||
|
__m512i ROUND_CONST_Q[ROUNDS1024];
|
||||||
|
__m512i TRANSP_MASK;
|
||||||
|
__m512i SUBSH_MASK[8];
|
||||||
|
__m512i ALL_1B;
|
||||||
|
__m512i ALL_FF;
|
||||||
|
|
||||||
|
#define tos(a) #a
|
||||||
|
#define tostr(a) tos(a)
|
||||||
|
|
||||||
|
/* xmm[i] will be multiplied by 2
|
||||||
|
* xmm[j] will be lost
|
||||||
|
* xmm[k] has to be all 0x1b */
|
||||||
|
#define MUL2(i, j, k){\
|
||||||
|
j = _mm512_xor_si512(j, j);\
|
||||||
|
j = _mm512_movm_epi8( _mm512_cmpgt_epi8_mask(j, i) );\
|
||||||
|
i = _mm512_add_epi8(i, i);\
|
||||||
|
j = _mm512_and_si512(j, k);\
|
||||||
|
i = _mm512_xor_si512(i, j);\
|
||||||
|
}
|
||||||
|
|
||||||
|
/**/
|
||||||
|
|
||||||
|
/* Yet another implementation of MixBytes.
|
||||||
|
This time we use the formulae (3) from the paper "Byte Slicing Groestl".
|
||||||
|
Input: a0, ..., a7
|
||||||
|
Output: b0, ..., b7 = MixBytes(a0,...,a7).
|
||||||
|
but we use the relations:
|
||||||
|
t_i = a_i + a_{i+3}
|
||||||
|
x_i = t_i + t_{i+3}
|
||||||
|
y_i = t_i + t+{i+2} + a_{i+6}
|
||||||
|
z_i = 2*x_i
|
||||||
|
w_i = z_i + y_{i+4}
|
||||||
|
v_i = 2*w_i
|
||||||
|
b_i = v_{i+3} + y_{i+4}
|
||||||
|
We keep building b_i in registers xmm8..xmm15 by first building y_{i+4} there
|
||||||
|
and then adding v_i computed in the meantime in registers xmm0..xmm7.
|
||||||
|
We almost fit into 16 registers, need only 3 spills to memory.
|
||||||
|
This implementation costs 7.7 c/b giving total speed on SNB: 10.7c/b.
|
||||||
|
K. Matusiewicz, 2011/05/29 */
|
||||||
|
#define MixBytes(a0, a1, a2, a3, a4, a5, a6, a7, b0, b1, b2, b3, b4, b5, b6, b7){\
|
||||||
|
/* t_i = a_i + a_{i+1} */\
|
||||||
|
b6 = a0;\
|
||||||
|
b7 = a1;\
|
||||||
|
a0 = _mm512_xor_si512(a0, a1);\
|
||||||
|
b0 = a2;\
|
||||||
|
a1 = _mm512_xor_si512(a1, a2);\
|
||||||
|
b1 = a3;\
|
||||||
|
a2 = _mm512_xor_si512(a2, a3);\
|
||||||
|
b2 = a4;\
|
||||||
|
a3 = _mm512_xor_si512(a3, a4);\
|
||||||
|
b3 = a5;\
|
||||||
|
a4 = _mm512_xor_si512(a4, a5);\
|
||||||
|
b4 = a6;\
|
||||||
|
a5 = _mm512_xor_si512(a5, a6);\
|
||||||
|
b5 = a7;\
|
||||||
|
a6 = _mm512_xor_si512(a6, a7);\
|
||||||
|
a7 = _mm512_xor_si512(a7, b6);\
|
||||||
|
\
|
||||||
|
/* build y4 y5 y6 ... in regs xmm8, xmm9, xmm10 by adding t_i*/\
|
||||||
|
b0 = _mm512_xor_si512(b0, a4);\
|
||||||
|
b6 = _mm512_xor_si512(b6, a4);\
|
||||||
|
b1 = _mm512_xor_si512(b1, a5);\
|
||||||
|
b7 = _mm512_xor_si512(b7, a5);\
|
||||||
|
b2 = _mm512_xor_si512(b2, a6);\
|
||||||
|
b0 = _mm512_xor_si512(b0, a6);\
|
||||||
|
/* spill values y_4, y_5 to memory */\
|
||||||
|
TEMP0 = b0;\
|
||||||
|
b3 = _mm512_xor_si512(b3, a7);\
|
||||||
|
b1 = _mm512_xor_si512(b1, a7);\
|
||||||
|
TEMP1 = b1;\
|
||||||
|
b4 = _mm512_xor_si512(b4, a0);\
|
||||||
|
b2 = _mm512_xor_si512(b2, a0);\
|
||||||
|
/* save values t0, t1, t2 to xmm8, xmm9 and memory */\
|
||||||
|
b0 = a0;\
|
||||||
|
b5 = _mm512_xor_si512(b5, a1);\
|
||||||
|
b3 = _mm512_xor_si512(b3, a1);\
|
||||||
|
b1 = a1;\
|
||||||
|
b6 = _mm512_xor_si512(b6, a2);\
|
||||||
|
b4 = _mm512_xor_si512(b4, a2);\
|
||||||
|
TEMP2 = a2;\
|
||||||
|
b7 = _mm512_xor_si512(b7, a3);\
|
||||||
|
b5 = _mm512_xor_si512(b5, a3);\
|
||||||
|
\
|
||||||
|
/* compute x_i = t_i + t_{i+3} */\
|
||||||
|
a0 = _mm512_xor_si512(a0, a3);\
|
||||||
|
a1 = _mm512_xor_si512(a1, a4);\
|
||||||
|
a2 = _mm512_xor_si512(a2, a5);\
|
||||||
|
a3 = _mm512_xor_si512(a3, a6);\
|
||||||
|
a4 = _mm512_xor_si512(a4, a7);\
|
||||||
|
a5 = _mm512_xor_si512(a5, b0);\
|
||||||
|
a6 = _mm512_xor_si512(a6, b1);\
|
||||||
|
a7 = _mm512_xor_si512(a7, TEMP2);\
|
||||||
|
\
|
||||||
|
/* compute z_i : double x_i using temp xmm8 and 1B xmm9 */\
|
||||||
|
/* compute w_i : add y_{i+4} */\
|
||||||
|
b1 = m512_const1_64( 0x1b1b1b1b1b1b1b1b );\
|
||||||
|
MUL2(a0, b0, b1);\
|
||||||
|
a0 = _mm512_xor_si512(a0, TEMP0);\
|
||||||
|
MUL2(a1, b0, b1);\
|
||||||
|
a1 = _mm512_xor_si512(a1, TEMP1);\
|
||||||
|
MUL2(a2, b0, b1);\
|
||||||
|
a2 = _mm512_xor_si512(a2, b2);\
|
||||||
|
MUL2(a3, b0, b1);\
|
||||||
|
a3 = _mm512_xor_si512(a3, b3);\
|
||||||
|
MUL2(a4, b0, b1);\
|
||||||
|
a4 = _mm512_xor_si512(a4, b4);\
|
||||||
|
MUL2(a5, b0, b1);\
|
||||||
|
a5 = _mm512_xor_si512(a5, b5);\
|
||||||
|
MUL2(a6, b0, b1);\
|
||||||
|
a6 = _mm512_xor_si512(a6, b6);\
|
||||||
|
MUL2(a7, b0, b1);\
|
||||||
|
a7 = _mm512_xor_si512(a7, b7);\
|
||||||
|
\
|
||||||
|
/* compute v_i : double w_i */\
|
||||||
|
/* add to y_4 y_5 .. v3, v4, ... */\
|
||||||
|
MUL2(a0, b0, b1);\
|
||||||
|
b5 = _mm512_xor_si512(b5, a0);\
|
||||||
|
MUL2(a1, b0, b1);\
|
||||||
|
b6 = _mm512_xor_si512(b6, a1);\
|
||||||
|
MUL2(a2, b0, b1);\
|
||||||
|
b7 = _mm512_xor_si512(b7, a2);\
|
||||||
|
MUL2(a5, b0, b1);\
|
||||||
|
b2 = _mm512_xor_si512(b2, a5);\
|
||||||
|
MUL2(a6, b0, b1);\
|
||||||
|
b3 = _mm512_xor_si512(b3, a6);\
|
||||||
|
MUL2(a7, b0, b1);\
|
||||||
|
b4 = _mm512_xor_si512(b4, a7);\
|
||||||
|
MUL2(a3, b0, b1);\
|
||||||
|
MUL2(a4, b0, b1);\
|
||||||
|
b0 = TEMP0;\
|
||||||
|
b1 = TEMP1;\
|
||||||
|
b0 = _mm512_xor_si512(b0, a3);\
|
||||||
|
b1 = _mm512_xor_si512(b1, a4);\
|
||||||
|
}/*MixBytes*/
|
||||||
|
|
||||||
|
// calculate the round constants seperately and load at startup
|
||||||
|
|
||||||
|
#define SET_CONSTANTS(){\
|
||||||
|
ALL_FF = _mm512_set1_epi32( 0xffffffff );\
|
||||||
|
ALL_1B = _mm512_set1_epi32( 0x1b1b1b1b );\
|
||||||
|
TRANSP_MASK = _mm512_set_epi32( \
|
||||||
|
0x3f373b33, 0x3e363a32, 0x3d353931, 0x3c343830, \
|
||||||
|
0x2f272b23, 0x2e262a22, 0x2d252921, 0x2c242820, \
|
||||||
|
0x1f171b13, 0x1e161a12, 0x1d151911, 0x1c141810, \
|
||||||
|
0x0f070b03, 0x0e060a02, 0x0d050901, 0x0c040800 ); \
|
||||||
|
SUBSH_MASK[0] = _mm512_set_epi32( \
|
||||||
|
0x3336393c, 0x3f323538, 0x3b3e3134, 0x373a3d30, \
|
||||||
|
0x2326292c, 0x2f222528, 0x2b2e2124, 0x272a2d20, \
|
||||||
|
0x1316191c, 0x1f121518, 0x1b1e1114, 0x171a1d10, \
|
||||||
|
0x0306090c, 0x0f020508, 0x0b0e0104, 0x070a0d00 ); \
|
||||||
|
SUBSH_MASK[1] = _mm512_set_epi32( \
|
||||||
|
0x34373a3d, 0x30333639, 0x3c3f3235, 0x383b3e31, \
|
||||||
|
0x24272a2d, 0x20232629, 0x2c2f2225, 0x282b2e21, \
|
||||||
|
0x14171a1d, 0x10131619, 0x1c1f1215, 0x181b1e11, \
|
||||||
|
0x04070a0d, 0x00030609, 0x0c0f0205, 0x080b0e01 ); \
|
||||||
|
SUBSH_MASK[2] = _mm512_set_epi32( \
|
||||||
|
0x35383b3e, 0x3134373a, 0x3d303336, 0x393c3f32, \
|
||||||
|
0x25282b2e, 0x2124272a, 0x2d202326, 0x292c2f22, \
|
||||||
|
0x15181b1e, 0x1114171a, 0x1d101316, 0x191c1f12, \
|
||||||
|
0x05080b0e, 0x0104070a, 0x0d000306, 0x090c0f02 ); \
|
||||||
|
SUBSH_MASK[3] = _mm512_set_epi32( \
|
||||||
|
0x36393c3f, 0x3235383b, 0x3e313437, 0x3a3d3033, \
|
||||||
|
0x26292c2f, 0x2225282b, 0x2e212427, 0x2a2d2023, \
|
||||||
|
0x16191c1f, 0x1215181b, 0x1e111417, 0x1a1d1013, \
|
||||||
|
0x06090c0f, 0x0205080b, 0x0e010407, 0x0a0d0003 ); \
|
||||||
|
SUBSH_MASK[4] = _mm512_set_epi32( \
|
||||||
|
0x373a3d30, 0x3336393c, 0x3f323538, 0x3b3e3134, \
|
||||||
|
0x272a2d20, 0x2326292c, 0x2f222528, 0x2b2e2124, \
|
||||||
|
0x171a1d10, 0x1316191c, 0x1f121518, 0x1b1e1114, \
|
||||||
|
0x070a0d00, 0x0306090c, 0x0f020508, 0x0b0e0104 ); \
|
||||||
|
SUBSH_MASK[5] = _mm512_set_epi32( \
|
||||||
|
0x383b3e31, 0x34373a3d, 0x30333639, 0x3c3f3235, \
|
||||||
|
0x282b2e21, 0x24272a2d, 0x20232629, 0x2c2f2225, \
|
||||||
|
0x181b1e11, 0x14171a1d, 0x10131619, 0x1c1f1215, \
|
||||||
|
0x080b0e01, 0x04070a0d, 0x00030609, 0x0c0f0205 ); \
|
||||||
|
SUBSH_MASK[6] = _mm512_set_epi32( \
|
||||||
|
0x393c3f32, 0x35383b3e, 0x3134373a, 0x3d303336, \
|
||||||
|
0x292c2f22, 0x25282b2e, 0x2124272a, 0x2d202326, \
|
||||||
|
0x191c1f12, 0x15181b1e, 0x1114171a, 0x1d101316, \
|
||||||
|
0x090c0f02, 0x05080b0e, 0x0104070a, 0x0d000306 ); \
|
||||||
|
SUBSH_MASK[7] = _mm512_set_epi32( \
|
||||||
|
0x3e313437, 0x3a3d3033, 0x36393c3f, 0x3235383b, \
|
||||||
|
0x2e212427, 0x2a2d2023, 0x26292c2f, 0x2225282b, \
|
||||||
|
0x1e111417, 0x1a1d1013, 0x16191c1f, 0x1215181b, \
|
||||||
|
0x0e010407, 0x0a0d0003, 0x06090c0f, 0x0205080b ); \
|
||||||
|
for( i = 0; i < ROUNDS1024; i++ ) \
|
||||||
|
{ \
|
||||||
|
ROUND_CONST_P[i] = _mm512_set4_epi32( 0xf0e0d0c0 ^ (i * 0x01010101), \
|
||||||
|
0xb0a09080 ^ (i * 0x01010101), \
|
||||||
|
0x70605040 ^ (i * 0x01010101), \
|
||||||
|
0x30201000 ^ (i * 0x01010101) ); \
|
||||||
|
ROUND_CONST_Q[i] = _mm512_set4_epi32( 0x0f1f2f3f ^ (i * 0x01010101), \
|
||||||
|
0x4f5f6f7f ^ (i * 0x01010101), \
|
||||||
|
0x8f9fafbf ^ (i * 0x01010101), \
|
||||||
|
0xcfdfefff ^ (i * 0x01010101));\
|
||||||
|
} \
|
||||||
|
}while(0);\
|
||||||
|
|
||||||
|
/* one round
|
||||||
|
* a0-a7 = input rows
|
||||||
|
* b0-b7 = output rows
|
||||||
|
*/
|
||||||
|
#define SUBMIX(a0, a1, a2, a3, a4, a5, a6, a7, b0, b1, b2, b3, b4, b5, b6, b7){\
|
||||||
|
/* SubBytes */\
|
||||||
|
b0 = _mm512_xor_si512( b0, b0 );\
|
||||||
|
a0 = _mm512_aesenclast_epi128( a0, b0 );\
|
||||||
|
a1 = _mm512_aesenclast_epi128( a1, b0 );\
|
||||||
|
a2 = _mm512_aesenclast_epi128( a2, b0 );\
|
||||||
|
a3 = _mm512_aesenclast_epi128( a3, b0 );\
|
||||||
|
a4 = _mm512_aesenclast_epi128( a4, b0 );\
|
||||||
|
a5 = _mm512_aesenclast_epi128( a5, b0 );\
|
||||||
|
a6 = _mm512_aesenclast_epi128( a6, b0 );\
|
||||||
|
a7 = _mm512_aesenclast_epi128( a7, b0 );\
|
||||||
|
/* MixBytes */\
|
||||||
|
MixBytes(a0, a1, a2, a3, a4, a5, a6, a7, b0, b1, b2, b3, b4, b5, b6, b7);\
|
||||||
|
}
|
||||||
|
|
||||||
|
#define ROUNDS_P(){\
|
||||||
|
uint8_t round_counter = 0;\
|
||||||
|
for ( round_counter = 0; round_counter < 14; round_counter += 2 ) \
|
||||||
|
{ \
|
||||||
|
/* AddRoundConstant P1024 */\
|
||||||
|
xmm8 = _mm512_xor_si512( xmm8, ( ROUND_CONST_P[ round_counter ] ) );\
|
||||||
|
/* ShiftBytes P1024 + pre-AESENCLAST */\
|
||||||
|
xmm8 = _mm512_shuffle_epi8( xmm8, ( SUBSH_MASK[0] ) );\
|
||||||
|
xmm9 = _mm512_shuffle_epi8( xmm9, ( SUBSH_MASK[1] ) );\
|
||||||
|
xmm10 = _mm512_shuffle_epi8( xmm10, ( SUBSH_MASK[2] ) );\
|
||||||
|
xmm11 = _mm512_shuffle_epi8( xmm11, ( SUBSH_MASK[3] ) );\
|
||||||
|
xmm12 = _mm512_shuffle_epi8( xmm12, ( SUBSH_MASK[4] ) );\
|
||||||
|
xmm13 = _mm512_shuffle_epi8( xmm13, ( SUBSH_MASK[5] ) );\
|
||||||
|
xmm14 = _mm512_shuffle_epi8( xmm14, ( SUBSH_MASK[6] ) );\
|
||||||
|
xmm15 = _mm512_shuffle_epi8( xmm15, ( SUBSH_MASK[7] ) );\
|
||||||
|
/* SubBytes + MixBytes */\
|
||||||
|
SUBMIX(xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7);\
|
||||||
|
\
|
||||||
|
/* AddRoundConstant P1024 */\
|
||||||
|
xmm0 = _mm512_xor_si512( xmm0, ( ROUND_CONST_P[ round_counter+1 ] ) );\
|
||||||
|
/* ShiftBytes P1024 + pre-AESENCLAST */\
|
||||||
|
xmm0 = _mm512_shuffle_epi8( xmm0, ( SUBSH_MASK[0] ) );\
|
||||||
|
xmm1 = _mm512_shuffle_epi8( xmm1, ( SUBSH_MASK[1] ) );\
|
||||||
|
xmm2 = _mm512_shuffle_epi8( xmm2, ( SUBSH_MASK[2] ) );\
|
||||||
|
xmm3 = _mm512_shuffle_epi8( xmm3, ( SUBSH_MASK[3] ) );\
|
||||||
|
xmm4 = _mm512_shuffle_epi8( xmm4, ( SUBSH_MASK[4] ) );\
|
||||||
|
xmm5 = _mm512_shuffle_epi8( xmm5, ( SUBSH_MASK[5] ) );\
|
||||||
|
xmm6 = _mm512_shuffle_epi8( xmm6, ( SUBSH_MASK[6] ) );\
|
||||||
|
xmm7 = _mm512_shuffle_epi8( xmm7, ( SUBSH_MASK[7] ) );\
|
||||||
|
/* SubBytes + MixBytes */\
|
||||||
|
SUBMIX(xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15);\
|
||||||
|
}\
|
||||||
|
}
|
||||||
|
|
||||||
|
#define ROUNDS_Q(){\
|
||||||
|
uint8_t round_counter = 0;\
|
||||||
|
for ( round_counter = 0; round_counter < 14; round_counter += 2) \
|
||||||
|
{ \
|
||||||
|
/* AddRoundConstant Q1024 */\
|
||||||
|
xmm1 = m512_neg1;\
|
||||||
|
xmm8 = _mm512_xor_si512( xmm8, xmm1 );\
|
||||||
|
xmm9 = _mm512_xor_si512( xmm9, xmm1 );\
|
||||||
|
xmm10 = _mm512_xor_si512( xmm10, xmm1 );\
|
||||||
|
xmm11 = _mm512_xor_si512( xmm11, xmm1 );\
|
||||||
|
xmm12 = _mm512_xor_si512( xmm12, xmm1 );\
|
||||||
|
xmm13 = _mm512_xor_si512( xmm13, xmm1 );\
|
||||||
|
xmm14 = _mm512_xor_si512( xmm14, xmm1 );\
|
||||||
|
xmm15 = _mm512_xor_si512( xmm15, ( ROUND_CONST_Q[ round_counter ] ) );\
|
||||||
|
/* ShiftBytes Q1024 + pre-AESENCLAST */\
|
||||||
|
xmm8 = _mm512_shuffle_epi8( xmm8, ( SUBSH_MASK[1] ) );\
|
||||||
|
xmm9 = _mm512_shuffle_epi8( xmm9, ( SUBSH_MASK[3] ) );\
|
||||||
|
xmm10 = _mm512_shuffle_epi8( xmm10, ( SUBSH_MASK[5] ) );\
|
||||||
|
xmm11 = _mm512_shuffle_epi8( xmm11, ( SUBSH_MASK[7] ) );\
|
||||||
|
xmm12 = _mm512_shuffle_epi8( xmm12, ( SUBSH_MASK[0] ) );\
|
||||||
|
xmm13 = _mm512_shuffle_epi8( xmm13, ( SUBSH_MASK[2] ) );\
|
||||||
|
xmm14 = _mm512_shuffle_epi8( xmm14, ( SUBSH_MASK[4] ) );\
|
||||||
|
xmm15 = _mm512_shuffle_epi8( xmm15, ( SUBSH_MASK[6] ) );\
|
||||||
|
/* SubBytes + MixBytes */\
|
||||||
|
SUBMIX(xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7);\
|
||||||
|
\
|
||||||
|
/* AddRoundConstant Q1024 */\
|
||||||
|
xmm9 = m512_neg1;\
|
||||||
|
xmm0 = _mm512_xor_si512( xmm0, xmm9 );\
|
||||||
|
xmm1 = _mm512_xor_si512( xmm1, xmm9 );\
|
||||||
|
xmm2 = _mm512_xor_si512( xmm2, xmm9 );\
|
||||||
|
xmm3 = _mm512_xor_si512( xmm3, xmm9 );\
|
||||||
|
xmm4 = _mm512_xor_si512( xmm4, xmm9 );\
|
||||||
|
xmm5 = _mm512_xor_si512( xmm5, xmm9 );\
|
||||||
|
xmm6 = _mm512_xor_si512( xmm6, xmm9 );\
|
||||||
|
xmm7 = _mm512_xor_si512( xmm7, ( ROUND_CONST_Q[ round_counter+1 ] ) );\
|
||||||
|
/* ShiftBytes Q1024 + pre-AESENCLAST */\
|
||||||
|
xmm0 = _mm512_shuffle_epi8( xmm0, ( SUBSH_MASK[1] ) );\
|
||||||
|
xmm1 = _mm512_shuffle_epi8( xmm1, ( SUBSH_MASK[3] ) );\
|
||||||
|
xmm2 = _mm512_shuffle_epi8( xmm2, ( SUBSH_MASK[5] ) );\
|
||||||
|
xmm3 = _mm512_shuffle_epi8( xmm3, ( SUBSH_MASK[7] ) );\
|
||||||
|
xmm4 = _mm512_shuffle_epi8( xmm4, ( SUBSH_MASK[0] ) );\
|
||||||
|
xmm5 = _mm512_shuffle_epi8( xmm5, ( SUBSH_MASK[2] ) );\
|
||||||
|
xmm6 = _mm512_shuffle_epi8( xmm6, ( SUBSH_MASK[4] ) );\
|
||||||
|
xmm7 = _mm512_shuffle_epi8( xmm7, ( SUBSH_MASK[6] ) );\
|
||||||
|
/* SubBytes + MixBytes */\
|
||||||
|
SUBMIX(xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15);\
|
||||||
|
}\
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Matrix Transpose
|
||||||
|
* input is a 1024-bit state with two columns in one xmm
|
||||||
|
* output is a 1024-bit state with two rows in one xmm
|
||||||
|
* inputs: i0-i7
|
||||||
|
* outputs: i0-i7
|
||||||
|
* clobbers: t0-t7
|
||||||
|
*/
|
||||||
|
#define Matrix_Transpose(i0, i1, i2, i3, i4, i5, i6, i7, t0, t1, t2, t3, t4, t5, t6, t7){\
|
||||||
|
t0 = TRANSP_MASK;\
|
||||||
|
\
|
||||||
|
i6 = _mm512_shuffle_epi8(i6, t0);\
|
||||||
|
i0 = _mm512_shuffle_epi8(i0, t0);\
|
||||||
|
i1 = _mm512_shuffle_epi8(i1, t0);\
|
||||||
|
i2 = _mm512_shuffle_epi8(i2, t0);\
|
||||||
|
i3 = _mm512_shuffle_epi8(i3, t0);\
|
||||||
|
t1 = i2;\
|
||||||
|
i4 = _mm512_shuffle_epi8(i4, t0);\
|
||||||
|
i5 = _mm512_shuffle_epi8(i5, t0);\
|
||||||
|
t2 = i4;\
|
||||||
|
t3 = i6;\
|
||||||
|
i7 = _mm512_shuffle_epi8(i7, t0);\
|
||||||
|
\
|
||||||
|
/* continue with unpack using 4 temp registers */\
|
||||||
|
t0 = i0;\
|
||||||
|
t2 = _mm512_unpackhi_epi16(t2, i5);\
|
||||||
|
i4 = _mm512_unpacklo_epi16(i4, i5);\
|
||||||
|
t3 = _mm512_unpackhi_epi16(t3, i7);\
|
||||||
|
i6 = _mm512_unpacklo_epi16(i6, i7);\
|
||||||
|
t0 = _mm512_unpackhi_epi16(t0, i1);\
|
||||||
|
t1 = _mm512_unpackhi_epi16(t1, i3);\
|
||||||
|
i2 = _mm512_unpacklo_epi16(i2, i3);\
|
||||||
|
i0 = _mm512_unpacklo_epi16(i0, i1);\
|
||||||
|
\
|
||||||
|
/* shuffle with immediate */\
|
||||||
|
t0 = _mm512_shuffle_epi32(t0, 216);\
|
||||||
|
t1 = _mm512_shuffle_epi32(t1, 216);\
|
||||||
|
t2 = _mm512_shuffle_epi32(t2, 216);\
|
||||||
|
t3 = _mm512_shuffle_epi32(t3, 216);\
|
||||||
|
i0 = _mm512_shuffle_epi32(i0, 216);\
|
||||||
|
i2 = _mm512_shuffle_epi32(i2, 216);\
|
||||||
|
i4 = _mm512_shuffle_epi32(i4, 216);\
|
||||||
|
i6 = _mm512_shuffle_epi32(i6, 216);\
|
||||||
|
\
|
||||||
|
/* continue with unpack */\
|
||||||
|
t4 = i0;\
|
||||||
|
i0 = _mm512_unpacklo_epi32(i0, i2);\
|
||||||
|
t4 = _mm512_unpackhi_epi32(t4, i2);\
|
||||||
|
t5 = t0;\
|
||||||
|
t0 = _mm512_unpacklo_epi32(t0, t1);\
|
||||||
|
t5 = _mm512_unpackhi_epi32(t5, t1);\
|
||||||
|
t6 = i4;\
|
||||||
|
i4 = _mm512_unpacklo_epi32(i4, i6);\
|
||||||
|
t7 = t2;\
|
||||||
|
t6 = _mm512_unpackhi_epi32(t6, i6);\
|
||||||
|
i2 = t0;\
|
||||||
|
t2 = _mm512_unpacklo_epi32(t2, t3);\
|
||||||
|
i3 = t0;\
|
||||||
|
t7 = _mm512_unpackhi_epi32(t7, t3);\
|
||||||
|
\
|
||||||
|
/* there are now 2 rows in each xmm */\
|
||||||
|
/* unpack to get 1 row of CV in each xmm */\
|
||||||
|
i1 = i0;\
|
||||||
|
i1 = _mm512_unpackhi_epi64(i1, i4);\
|
||||||
|
i0 = _mm512_unpacklo_epi64(i0, i4);\
|
||||||
|
i4 = t4;\
|
||||||
|
i3 = _mm512_unpackhi_epi64(i3, t2);\
|
||||||
|
i5 = t4;\
|
||||||
|
i2 = _mm512_unpacklo_epi64(i2, t2);\
|
||||||
|
i6 = t5;\
|
||||||
|
i5 = _mm512_unpackhi_epi64(i5, t6);\
|
||||||
|
i7 = t5;\
|
||||||
|
i4 = _mm512_unpacklo_epi64(i4, t6);\
|
||||||
|
i7 = _mm512_unpackhi_epi64(i7, t7);\
|
||||||
|
i6 = _mm512_unpacklo_epi64(i6, t7);\
|
||||||
|
/* transpose done */\
|
||||||
|
}/**/
|
||||||
|
|
||||||
|
/* Matrix Transpose Inverse
|
||||||
|
* input is a 1024-bit state with two rows in one xmm
|
||||||
|
* output is a 1024-bit state with two columns in one xmm
|
||||||
|
* inputs: i0-i7
|
||||||
|
* outputs: (i0, o0, i1, i3, o1, o2, i5, i7)
|
||||||
|
* clobbers: t0-t4
|
||||||
|
*/
|
||||||
|
#define Matrix_Transpose_INV(i0, i1, i2, i3, i4, i5, i6, i7, o0, o1, o2, t0, t1, t2, t3, t4){\
|
||||||
|
/* transpose matrix to get output format */\
|
||||||
|
o1 = i0;\
|
||||||
|
i0 = _mm512_unpacklo_epi64(i0, i1);\
|
||||||
|
o1 = _mm512_unpackhi_epi64(o1, i1);\
|
||||||
|
t0 = i2;\
|
||||||
|
i2 = _mm512_unpacklo_epi64(i2, i3);\
|
||||||
|
t0 = _mm512_unpackhi_epi64(t0, i3);\
|
||||||
|
t1 = i4;\
|
||||||
|
i4 = _mm512_unpacklo_epi64(i4, i5);\
|
||||||
|
t1 = _mm512_unpackhi_epi64(t1, i5);\
|
||||||
|
t2 = i6;\
|
||||||
|
o0 = TRANSP_MASK;\
|
||||||
|
i6 = _mm512_unpacklo_epi64(i6, i7);\
|
||||||
|
t2 = _mm512_unpackhi_epi64(t2, i7);\
|
||||||
|
/* load transpose mask into a register, because it will be used 8 times */\
|
||||||
|
i0 = _mm512_shuffle_epi8(i0, o0);\
|
||||||
|
i2 = _mm512_shuffle_epi8(i2, o0);\
|
||||||
|
i4 = _mm512_shuffle_epi8(i4, o0);\
|
||||||
|
i6 = _mm512_shuffle_epi8(i6, o0);\
|
||||||
|
o1 = _mm512_shuffle_epi8(o1, o0);\
|
||||||
|
t0 = _mm512_shuffle_epi8(t0, o0);\
|
||||||
|
t1 = _mm512_shuffle_epi8(t1, o0);\
|
||||||
|
t2 = _mm512_shuffle_epi8(t2, o0);\
|
||||||
|
/* continue with unpack using 4 temp registers */\
|
||||||
|
t3 = i4;\
|
||||||
|
o2 = o1;\
|
||||||
|
o0 = i0;\
|
||||||
|
t4 = t1;\
|
||||||
|
\
|
||||||
|
t3 = _mm512_unpackhi_epi16(t3, i6);\
|
||||||
|
i4 = _mm512_unpacklo_epi16(i4, i6);\
|
||||||
|
o0 = _mm512_unpackhi_epi16(o0, i2);\
|
||||||
|
i0 = _mm512_unpacklo_epi16(i0, i2);\
|
||||||
|
o2 = _mm512_unpackhi_epi16(o2, t0);\
|
||||||
|
o1 = _mm512_unpacklo_epi16(o1, t0);\
|
||||||
|
t4 = _mm512_unpackhi_epi16(t4, t2);\
|
||||||
|
t1 = _mm512_unpacklo_epi16(t1, t2);\
|
||||||
|
/* shuffle with immediate */\
|
||||||
|
i4 = _mm512_shuffle_epi32(i4, 216);\
|
||||||
|
t3 = _mm512_shuffle_epi32(t3, 216);\
|
||||||
|
o1 = _mm512_shuffle_epi32(o1, 216);\
|
||||||
|
o2 = _mm512_shuffle_epi32(o2, 216);\
|
||||||
|
i0 = _mm512_shuffle_epi32(i0, 216);\
|
||||||
|
o0 = _mm512_shuffle_epi32(o0, 216);\
|
||||||
|
t1 = _mm512_shuffle_epi32(t1, 216);\
|
||||||
|
t4 = _mm512_shuffle_epi32(t4, 216);\
|
||||||
|
/* continue with unpack */\
|
||||||
|
i1 = i0;\
|
||||||
|
i3 = o0;\
|
||||||
|
i5 = o1;\
|
||||||
|
i7 = o2;\
|
||||||
|
i0 = _mm512_unpacklo_epi32(i0, i4);\
|
||||||
|
i1 = _mm512_unpackhi_epi32(i1, i4);\
|
||||||
|
o0 = _mm512_unpacklo_epi32(o0, t3);\
|
||||||
|
i3 = _mm512_unpackhi_epi32(i3, t3);\
|
||||||
|
o1 = _mm512_unpacklo_epi32(o1, t1);\
|
||||||
|
i5 = _mm512_unpackhi_epi32(i5, t1);\
|
||||||
|
o2 = _mm512_unpacklo_epi32(o2, t4);\
|
||||||
|
i7 = _mm512_unpackhi_epi32(i7, t4);\
|
||||||
|
/* transpose done */\
|
||||||
|
}/**/
|
||||||
|
|
||||||
|
|
||||||
|
void INIT_4way( __m512i* chaining )
|
||||||
|
{
|
||||||
|
static __m512i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
|
||||||
|
static __m512i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
|
||||||
|
|
||||||
|
/* load IV into registers xmm8 - xmm15 */
|
||||||
|
xmm8 = chaining[0];
|
||||||
|
xmm9 = chaining[1];
|
||||||
|
xmm10 = chaining[2];
|
||||||
|
xmm11 = chaining[3];
|
||||||
|
xmm12 = chaining[4];
|
||||||
|
xmm13 = chaining[5];
|
||||||
|
xmm14 = chaining[6];
|
||||||
|
xmm15 = chaining[7];
|
||||||
|
|
||||||
|
/* transform chaining value from column ordering into row ordering */
|
||||||
|
Matrix_Transpose(xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7);
|
||||||
|
|
||||||
|
/* store transposed IV */
|
||||||
|
chaining[0] = xmm8;
|
||||||
|
chaining[1] = xmm9;
|
||||||
|
chaining[2] = xmm10;
|
||||||
|
chaining[3] = xmm11;
|
||||||
|
chaining[4] = xmm12;
|
||||||
|
chaining[5] = xmm13;
|
||||||
|
chaining[6] = xmm14;
|
||||||
|
chaining[7] = xmm15;
|
||||||
|
}
|
||||||
|
|
||||||
|
void TF1024_4way( __m512i* chaining, const __m512i* message )
|
||||||
|
{
|
||||||
|
static __m512i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
|
||||||
|
static __m512i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
|
||||||
|
static __m512i QTEMP[8];
|
||||||
|
static __m512i TEMP0;
|
||||||
|
static __m512i TEMP1;
|
||||||
|
static __m512i TEMP2;
|
||||||
|
|
||||||
|
/* load message into registers xmm8 - xmm15 (Q = message) */
|
||||||
|
xmm8 = message[0];
|
||||||
|
xmm9 = message[1];
|
||||||
|
xmm10 = message[2];
|
||||||
|
xmm11 = message[3];
|
||||||
|
xmm12 = message[4];
|
||||||
|
xmm13 = message[5];
|
||||||
|
xmm14 = message[6];
|
||||||
|
xmm15 = message[7];
|
||||||
|
|
||||||
|
/* transform message M from column ordering into row ordering */
|
||||||
|
Matrix_Transpose(xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7);
|
||||||
|
|
||||||
|
/* store message M (Q input) for later */
|
||||||
|
QTEMP[0] = xmm8;
|
||||||
|
QTEMP[1] = xmm9;
|
||||||
|
QTEMP[2] = xmm10;
|
||||||
|
QTEMP[3] = xmm11;
|
||||||
|
QTEMP[4] = xmm12;
|
||||||
|
QTEMP[5] = xmm13;
|
||||||
|
QTEMP[6] = xmm14;
|
||||||
|
QTEMP[7] = xmm15;
|
||||||
|
|
||||||
|
/* xor CV to message to get P input */
|
||||||
|
/* result: CV+M in xmm8...xmm15 */
|
||||||
|
xmm8 = _mm512_xor_si512( xmm8, (chaining[0]) );
|
||||||
|
xmm9 = _mm512_xor_si512( xmm9, (chaining[1]) );
|
||||||
|
xmm10 = _mm512_xor_si512( xmm10, (chaining[2]) );
|
||||||
|
xmm11 = _mm512_xor_si512( xmm11, (chaining[3]) );
|
||||||
|
xmm12 = _mm512_xor_si512( xmm12, (chaining[4]) );
|
||||||
|
xmm13 = _mm512_xor_si512( xmm13, (chaining[5]) );
|
||||||
|
xmm14 = _mm512_xor_si512( xmm14, (chaining[6]) );
|
||||||
|
xmm15 = _mm512_xor_si512( xmm15, (chaining[7]) );
|
||||||
|
|
||||||
|
/* compute permutation P */
|
||||||
|
/* result: P(CV+M) in xmm8...xmm15 */
|
||||||
|
ROUNDS_P();
|
||||||
|
|
||||||
|
/* xor CV to P output (feed-forward) */
|
||||||
|
/* result: P(CV+M)+CV in xmm8...xmm15 */
|
||||||
|
xmm8 = _mm512_xor_si512( xmm8, (chaining[0]) );
|
||||||
|
xmm9 = _mm512_xor_si512( xmm9, (chaining[1]) );
|
||||||
|
xmm10 = _mm512_xor_si512( xmm10, (chaining[2]) );
|
||||||
|
xmm11 = _mm512_xor_si512( xmm11, (chaining[3]) );
|
||||||
|
xmm12 = _mm512_xor_si512( xmm12, (chaining[4]) );
|
||||||
|
xmm13 = _mm512_xor_si512( xmm13, (chaining[5]) );
|
||||||
|
xmm14 = _mm512_xor_si512( xmm14, (chaining[6]) );
|
||||||
|
xmm15 = _mm512_xor_si512( xmm15, (chaining[7]) );
|
||||||
|
|
||||||
|
/* store P(CV+M)+CV */
|
||||||
|
chaining[0] = xmm8;
|
||||||
|
chaining[1] = xmm9;
|
||||||
|
chaining[2] = xmm10;
|
||||||
|
chaining[3] = xmm11;
|
||||||
|
chaining[4] = xmm12;
|
||||||
|
chaining[5] = xmm13;
|
||||||
|
chaining[6] = xmm14;
|
||||||
|
chaining[7] = xmm15;
|
||||||
|
|
||||||
|
/* load message M (Q input) into xmm8-15 */
|
||||||
|
xmm8 = QTEMP[0];
|
||||||
|
xmm9 = QTEMP[1];
|
||||||
|
xmm10 = QTEMP[2];
|
||||||
|
xmm11 = QTEMP[3];
|
||||||
|
xmm12 = QTEMP[4];
|
||||||
|
xmm13 = QTEMP[5];
|
||||||
|
xmm14 = QTEMP[6];
|
||||||
|
xmm15 = QTEMP[7];
|
||||||
|
|
||||||
|
/* compute permutation Q */
|
||||||
|
/* result: Q(M) in xmm8...xmm15 */
|
||||||
|
ROUNDS_Q();
|
||||||
|
|
||||||
|
/* xor Q output */
|
||||||
|
/* result: P(CV+M)+CV+Q(M) in xmm8...xmm15 */
|
||||||
|
xmm8 = _mm512_xor_si512( xmm8, (chaining[0]) );
|
||||||
|
xmm9 = _mm512_xor_si512( xmm9, (chaining[1]) );
|
||||||
|
xmm10 = _mm512_xor_si512( xmm10, (chaining[2]) );
|
||||||
|
xmm11 = _mm512_xor_si512( xmm11, (chaining[3]) );
|
||||||
|
xmm12 = _mm512_xor_si512( xmm12, (chaining[4]) );
|
||||||
|
xmm13 = _mm512_xor_si512( xmm13, (chaining[5]) );
|
||||||
|
xmm14 = _mm512_xor_si512( xmm14, (chaining[6]) );
|
||||||
|
xmm15 = _mm512_xor_si512( xmm15, (chaining[7]) );
|
||||||
|
|
||||||
|
/* store CV */
|
||||||
|
chaining[0] = xmm8;
|
||||||
|
chaining[1] = xmm9;
|
||||||
|
chaining[2] = xmm10;
|
||||||
|
chaining[3] = xmm11;
|
||||||
|
chaining[4] = xmm12;
|
||||||
|
chaining[5] = xmm13;
|
||||||
|
chaining[6] = xmm14;
|
||||||
|
chaining[7] = xmm15;
|
||||||
|
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
void OF1024_4way( __m512i* chaining )
|
||||||
|
{
|
||||||
|
static __m512i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
|
||||||
|
static __m512i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
|
||||||
|
static __m512i TEMP0;
|
||||||
|
static __m512i TEMP1;
|
||||||
|
static __m512i TEMP2;
|
||||||
|
|
||||||
|
/* load CV into registers xmm8 - xmm15 */
|
||||||
|
xmm8 = chaining[0];
|
||||||
|
xmm9 = chaining[1];
|
||||||
|
xmm10 = chaining[2];
|
||||||
|
xmm11 = chaining[3];
|
||||||
|
xmm12 = chaining[4];
|
||||||
|
xmm13 = chaining[5];
|
||||||
|
xmm14 = chaining[6];
|
||||||
|
xmm15 = chaining[7];
|
||||||
|
|
||||||
|
/* compute permutation P */
|
||||||
|
/* result: P(CV) in xmm8...xmm15 */
|
||||||
|
ROUNDS_P();
|
||||||
|
|
||||||
|
/* xor CV to P output (feed-forward) */
|
||||||
|
/* result: P(CV)+CV in xmm8...xmm15 */
|
||||||
|
xmm8 = _mm512_xor_si512( xmm8, (chaining[0]) );
|
||||||
|
xmm9 = _mm512_xor_si512( xmm9, (chaining[1]) );
|
||||||
|
xmm10 = _mm512_xor_si512( xmm10, (chaining[2]) );
|
||||||
|
xmm11 = _mm512_xor_si512( xmm11, (chaining[3]) );
|
||||||
|
xmm12 = _mm512_xor_si512( xmm12, (chaining[4]) );
|
||||||
|
xmm13 = _mm512_xor_si512( xmm13, (chaining[5]) );
|
||||||
|
xmm14 = _mm512_xor_si512( xmm14, (chaining[6]) );
|
||||||
|
xmm15 = _mm512_xor_si512( xmm15, (chaining[7]) );
|
||||||
|
|
||||||
|
/* transpose CV back from row ordering to column ordering */
|
||||||
|
/* result: final hash value in xmm0, xmm6, xmm13, xmm15 */
|
||||||
|
Matrix_Transpose_INV(xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm4, xmm0, xmm6, xmm1, xmm2, xmm3, xmm5, xmm7);
|
||||||
|
|
||||||
|
/* we only need to return the truncated half of the state */
|
||||||
|
chaining[4] = xmm0;
|
||||||
|
chaining[5] = xmm6;
|
||||||
|
chaining[6] = xmm13;
|
||||||
|
chaining[7] = xmm15;
|
||||||
|
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
#endif // VAES
|
||||||
|
#endif // GROESTL512_INTR_4WAY_H__
|
||||||
@@ -1,22 +1,20 @@
|
|||||||
#include "myrgr-gate.h"
|
#include "myrgr-gate.h"
|
||||||
|
|
||||||
#include <stdio.h>
|
#include <stdio.h>
|
||||||
#include <stdlib.h>
|
#include <stdlib.h>
|
||||||
#include <stdint.h>
|
#include <stdint.h>
|
||||||
#include <string.h>
|
#include <string.h>
|
||||||
|
#ifdef __AES__
|
||||||
#ifdef NO_AES_NI
|
|
||||||
#include "sph_groestl.h"
|
|
||||||
#else
|
|
||||||
#include "aes_ni/hash-groestl.h"
|
#include "aes_ni/hash-groestl.h"
|
||||||
|
#else
|
||||||
|
#include "sph_groestl.h"
|
||||||
#endif
|
#endif
|
||||||
#include <openssl/sha.h>
|
#include <openssl/sha.h>
|
||||||
|
|
||||||
typedef struct {
|
typedef struct {
|
||||||
#ifdef NO_AES_NI
|
#ifdef __AES__
|
||||||
sph_groestl512_context groestl;
|
|
||||||
#else
|
|
||||||
hashState_groestl groestl;
|
hashState_groestl groestl;
|
||||||
|
#else
|
||||||
|
sph_groestl512_context groestl;
|
||||||
#endif
|
#endif
|
||||||
SHA256_CTX sha;
|
SHA256_CTX sha;
|
||||||
} myrgr_ctx_holder;
|
} myrgr_ctx_holder;
|
||||||
@@ -25,10 +23,10 @@ myrgr_ctx_holder myrgr_ctx;
|
|||||||
|
|
||||||
void init_myrgr_ctx()
|
void init_myrgr_ctx()
|
||||||
{
|
{
|
||||||
#ifdef NO_AES_NI
|
#ifdef __AES__
|
||||||
sph_groestl512_init( &myrgr_ctx.groestl );
|
|
||||||
#else
|
|
||||||
init_groestl ( &myrgr_ctx.groestl, 64 );
|
init_groestl ( &myrgr_ctx.groestl, 64 );
|
||||||
|
#else
|
||||||
|
sph_groestl512_init( &myrgr_ctx.groestl );
|
||||||
#endif
|
#endif
|
||||||
SHA256_Init( &myrgr_ctx.sha );
|
SHA256_Init( &myrgr_ctx.sha );
|
||||||
}
|
}
|
||||||
@@ -40,12 +38,12 @@ void myriad_hash(void *output, const void *input)
|
|||||||
|
|
||||||
uint32_t _ALIGN(32) hash[16];
|
uint32_t _ALIGN(32) hash[16];
|
||||||
|
|
||||||
#ifdef NO_AES_NI
|
#ifdef __AES__
|
||||||
sph_groestl512(&ctx.groestl, input, 80);
|
|
||||||
sph_groestl512_close(&ctx.groestl, hash);
|
|
||||||
#else
|
|
||||||
update_groestl( &ctx.groestl, (char*)input, 640 );
|
update_groestl( &ctx.groestl, (char*)input, 640 );
|
||||||
final_groestl( &ctx.groestl, (char*)hash);
|
final_groestl( &ctx.groestl, (char*)hash);
|
||||||
|
#else
|
||||||
|
sph_groestl512(&ctx.groestl, input, 80);
|
||||||
|
sph_groestl512_close(&ctx.groestl, hash);
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
SHA256_Update( &ctx.sha, (unsigned char*)hash, 64 );
|
SHA256_Update( &ctx.sha, (unsigned char*)hash, 64 );
|
||||||
@@ -88,15 +86,3 @@ int scanhash_myriad( struct work *work, uint32_t max_nonce,
|
|||||||
*hashes_done = pdata[19] - first_nonce + 1;
|
*hashes_done = pdata[19] - first_nonce + 1;
|
||||||
return 0;
|
return 0;
|
||||||
}
|
}
|
||||||
/*
|
|
||||||
bool register_myriad_algo( algo_gate_t* gate )
|
|
||||||
{
|
|
||||||
gate->optimizations = SSE2_OPT | AES_OPT;
|
|
||||||
init_myrgr_ctx();
|
|
||||||
gate->scanhash = (void*)&scanhash_myriad;
|
|
||||||
gate->hash = (void*)&myriadhash;
|
|
||||||
// gate->hash_alt = (void*)&myriadhash;
|
|
||||||
gate->get_max64 = (void*)&get_max64_0x3ffff;
|
|
||||||
return true;
|
|
||||||
};
|
|
||||||
*/
|
|
||||||
|
|||||||
@@ -1,14 +1,159 @@
|
|||||||
#include "myrgr-gate.h"
|
#include "myrgr-gate.h"
|
||||||
|
|
||||||
#if defined(MYRGR_4WAY)
|
|
||||||
|
|
||||||
#include <stdio.h>
|
#include <stdio.h>
|
||||||
#include <stdlib.h>
|
#include <stdlib.h>
|
||||||
#include <stdint.h>
|
#include <stdint.h>
|
||||||
#include <string.h>
|
#include <string.h>
|
||||||
|
|
||||||
#include "aes_ni/hash-groestl.h"
|
#include "aes_ni/hash-groestl.h"
|
||||||
#include "algo/sha/sha-hash-4way.h"
|
#include "algo/sha/sha-hash-4way.h"
|
||||||
|
#if defined(__VAES__)
|
||||||
|
#include "groestl512-hash-4way.h"
|
||||||
|
#endif
|
||||||
|
|
||||||
|
#if defined(MYRGR_8WAY)
|
||||||
|
|
||||||
|
typedef struct {
|
||||||
|
#if defined(__VAES__)
|
||||||
|
groestl512_4way_context groestl;
|
||||||
|
#else
|
||||||
|
hashState_groestl groestl;
|
||||||
|
#endif
|
||||||
|
sha256_8way_context sha;
|
||||||
|
} myrgr_8way_ctx_holder;
|
||||||
|
|
||||||
|
myrgr_8way_ctx_holder myrgr_8way_ctx;
|
||||||
|
|
||||||
|
void init_myrgr_8way_ctx()
|
||||||
|
{
|
||||||
|
#if defined(__VAES__)
|
||||||
|
groestl512_4way_init( &myrgr_8way_ctx.groestl, 64 );
|
||||||
|
#else
|
||||||
|
init_groestl( &myrgr_8way_ctx.groestl, 64 );
|
||||||
|
#endif
|
||||||
|
sha256_8way_init( &myrgr_8way_ctx.sha );
|
||||||
|
}
|
||||||
|
|
||||||
|
void myriad_8way_hash( void *output, const void *input )
|
||||||
|
{
|
||||||
|
uint32_t vhash[16*8] __attribute__ ((aligned (128)));
|
||||||
|
uint32_t vhashA[20*8] __attribute__ ((aligned (64)));
|
||||||
|
uint32_t vhashB[20*8] __attribute__ ((aligned (64)));
|
||||||
|
myrgr_8way_ctx_holder ctx;
|
||||||
|
memcpy( &ctx, &myrgr_8way_ctx, sizeof(myrgr_8way_ctx) );
|
||||||
|
|
||||||
|
#if defined(__VAES__)
|
||||||
|
|
||||||
|
rintrlv_8x64_4x128( vhashA, vhashB, input, 640 );
|
||||||
|
groestl512_4way_update_close( &ctx.groestl, vhashA, vhashA, 640 );
|
||||||
|
groestl512_4way_update_close( &ctx.groestl, vhashB, vhashB, 640 );
|
||||||
|
|
||||||
|
uint32_t hash0[20] __attribute__ ((aligned (64)));
|
||||||
|
uint32_t hash1[20] __attribute__ ((aligned (64)));
|
||||||
|
uint32_t hash2[20] __attribute__ ((aligned (64)));
|
||||||
|
uint32_t hash3[20] __attribute__ ((aligned (64)));
|
||||||
|
uint32_t hash4[20] __attribute__ ((aligned (64)));
|
||||||
|
uint32_t hash5[20] __attribute__ ((aligned (64)));
|
||||||
|
uint32_t hash6[20] __attribute__ ((aligned (64)));
|
||||||
|
uint32_t hash7[20] __attribute__ ((aligned (64)));
|
||||||
|
|
||||||
|
// rintrlv_4x128_8x32( vhash, vhashA, vhashB, 512 );
|
||||||
|
dintrlv_4x128_512( hash0, hash1, hash2, hash3, vhashA );
|
||||||
|
dintrlv_4x128_512( hash4, hash5, hash6, hash7, vhashB );
|
||||||
|
intrlv_8x32_512( vhash, hash0, hash1, hash2, hash3, hash4, hash5,
|
||||||
|
hash6, hash7 );
|
||||||
|
|
||||||
|
#else
|
||||||
|
|
||||||
|
uint32_t hash0[20] __attribute__ ((aligned (64)));
|
||||||
|
uint32_t hash1[20] __attribute__ ((aligned (64)));
|
||||||
|
uint32_t hash2[20] __attribute__ ((aligned (64)));
|
||||||
|
uint32_t hash3[20] __attribute__ ((aligned (64)));
|
||||||
|
uint32_t hash4[20] __attribute__ ((aligned (64)));
|
||||||
|
uint32_t hash5[20] __attribute__ ((aligned (64)));
|
||||||
|
uint32_t hash6[20] __attribute__ ((aligned (64)));
|
||||||
|
uint32_t hash7[20] __attribute__ ((aligned (64)));
|
||||||
|
|
||||||
|
dintrlv_8x64( hash0, hash1, hash2, hash3,
|
||||||
|
hash4, hash5, hash6, hash7, input, 640 );
|
||||||
|
|
||||||
|
update_and_final_groestl( &ctx.groestl, (char*)hash0, (char*)hash0, 640 );
|
||||||
|
memcpy( &ctx.groestl, &myrgr_4way_ctx.groestl, sizeof(hashState_groestl) );
|
||||||
|
update_and_final_groestl( &ctx.groestl, (char*)hash1, (char*)hash1, 640 );
|
||||||
|
memcpy( &ctx.groestl, &myrgr_4way_ctx.groestl, sizeof(hashState_groestl) );
|
||||||
|
update_and_final_groestl( &ctx.groestl, (char*)hash2, (char*)hash2, 640 );
|
||||||
|
memcpy( &ctx.groestl, &myrgr_4way_ctx.groestl, sizeof(hashState_groestl) );
|
||||||
|
update_and_final_groestl( &ctx.groestl, (char*)hash3, (char*)hash3, 640 );
|
||||||
|
memcpy( &ctx.groestl, &myrgr_4way_ctx.groestl, sizeof(hashState_groestl) );
|
||||||
|
update_and_final_groestl( &ctx.groestl, (char*)hash4, (char*)hash4, 640 );
|
||||||
|
memcpy( &ctx.groestl, &myrgr_4way_ctx.groestl, sizeof(hashState_groestl) );
|
||||||
|
update_and_final_groestl( &ctx.groestl, (char*)hash5, (char*)hash5, 640 );
|
||||||
|
memcpy( &ctx.groestl, &myrgr_4way_ctx.groestl, sizeof(hashState_groestl) );
|
||||||
|
update_and_final_groestl( &ctx.groestl, (char*)hash6, (char*)hash6, 640 );
|
||||||
|
memcpy( &ctx.groestl, &myrgr_4way_ctx.groestl, sizeof(hashState_groestl) );
|
||||||
|
update_and_final_groestl( &ctx.groestl, (char*)hash7, (char*)hash7, 640 );
|
||||||
|
memcpy( &ctx.groestl, &myrgr_4way_ctx.groestl, sizeof(hashState_groestl) );
|
||||||
|
|
||||||
|
intrlv_8x32( vhash, hash0, hash1, hash2, hash3,
|
||||||
|
hash4, hash5, hash6, hash7, 512 );
|
||||||
|
|
||||||
|
#endif
|
||||||
|
|
||||||
|
sha256_8way_update( &ctx.sha, vhash, 64 );
|
||||||
|
sha256_8way_close( &ctx.sha, output );
|
||||||
|
}
|
||||||
|
|
||||||
|
int scanhash_myriad_8way( struct work *work, uint32_t max_nonce,
|
||||||
|
uint64_t *hashes_done, struct thr_info *mythr )
|
||||||
|
{
|
||||||
|
uint32_t hash[8*8] __attribute__ ((aligned (128)));
|
||||||
|
uint32_t vdata[20*8] __attribute__ ((aligned (64)));
|
||||||
|
uint32_t lane_hash[8] __attribute__ ((aligned (64)));
|
||||||
|
uint32_t *hash7 = &(hash[7<<3]);
|
||||||
|
uint32_t *pdata = work->data;
|
||||||
|
uint32_t *ptarget = work->target;
|
||||||
|
const uint32_t Htarg = ptarget[7];
|
||||||
|
const uint32_t first_nonce = pdata[19];
|
||||||
|
const uint32_t last_nonce = max_nonce - 8;
|
||||||
|
uint32_t n = first_nonce;
|
||||||
|
uint32_t *noncep = vdata + 64+3; // 4*16 + 3
|
||||||
|
int thr_id = mythr->id; // thr_id arg is deprecated
|
||||||
|
|
||||||
|
if ( opt_benchmark )
|
||||||
|
( (uint32_t*)ptarget )[7] = 0x0000ff;
|
||||||
|
|
||||||
|
mm512_bswap32_intrlv80_4x128( vdata, pdata );
|
||||||
|
|
||||||
|
do
|
||||||
|
{
|
||||||
|
be32enc( noncep, n );
|
||||||
|
be32enc( noncep+ 8, n+1 );
|
||||||
|
be32enc( noncep+16, n+2 );
|
||||||
|
be32enc( noncep+24, n+3 );
|
||||||
|
be32enc( noncep+32, n+4 );
|
||||||
|
be32enc( noncep+40, n+5 );
|
||||||
|
be32enc( noncep+48, n+6 );
|
||||||
|
be32enc( noncep+64, n+7 );
|
||||||
|
|
||||||
|
myriad_8way_hash( hash, vdata );
|
||||||
|
pdata[19] = n;
|
||||||
|
|
||||||
|
for ( int lane = 0; lane < 8; lane++ )
|
||||||
|
if ( hash7[ lane ] <= Htarg )
|
||||||
|
{
|
||||||
|
extr_lane_8x32( lane_hash, hash, lane, 256 );
|
||||||
|
if ( fulltest( lane_hash, ptarget ) && !opt_benchmark )
|
||||||
|
{
|
||||||
|
pdata[19] = n + lane;
|
||||||
|
submit_lane_solution( work, lane_hash, mythr, lane );
|
||||||
|
}
|
||||||
|
}
|
||||||
|
n += 8;
|
||||||
|
} while ( (n < last_nonce) && !work_restart[thr_id].restart);
|
||||||
|
|
||||||
|
*hashes_done = n - first_nonce;
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
#elif defined(MYRGR_4WAY)
|
||||||
|
|
||||||
typedef struct {
|
typedef struct {
|
||||||
hashState_groestl groestl;
|
hashState_groestl groestl;
|
||||||
@@ -45,7 +190,7 @@ void myriad_4way_hash( void *output, const void *input )
|
|||||||
|
|
||||||
intrlv_4x32( vhash, hash0, hash1, hash2, hash3, 512 );
|
intrlv_4x32( vhash, hash0, hash1, hash2, hash3, 512 );
|
||||||
|
|
||||||
sha256_4way( &ctx.sha, vhash, 64 );
|
sha256_4way_update( &ctx.sha, vhash, 64 );
|
||||||
sha256_4way_close( &ctx.sha, output );
|
sha256_4way_close( &ctx.sha, output );
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
@@ -2,17 +2,22 @@
|
|||||||
|
|
||||||
bool register_myriad_algo( algo_gate_t* gate )
|
bool register_myriad_algo( algo_gate_t* gate )
|
||||||
{
|
{
|
||||||
#if defined (MYRGR_4WAY)
|
#if defined (MYRGR_8WAY)
|
||||||
|
init_myrgr_8way_ctx();
|
||||||
|
gate->scanhash = (void*)&scanhash_myriad_8way;
|
||||||
|
gate->hash = (void*)&myriad_8way_hash;
|
||||||
|
gate->optimizations = AES_OPT | AVX2_OPT | VAES_OPT;
|
||||||
|
#elif defined (MYRGR_4WAY)
|
||||||
init_myrgr_4way_ctx();
|
init_myrgr_4way_ctx();
|
||||||
gate->scanhash = (void*)&scanhash_myriad_4way;
|
gate->scanhash = (void*)&scanhash_myriad_4way;
|
||||||
gate->hash = (void*)&myriad_4way_hash;
|
gate->hash = (void*)&myriad_4way_hash;
|
||||||
|
gate->optimizations = AES_OPT | SSE2_OPT | AVX2_OPT | VAES_OPT;
|
||||||
#else
|
#else
|
||||||
init_myrgr_ctx();
|
init_myrgr_ctx();
|
||||||
gate->scanhash = (void*)&scanhash_myriad;
|
gate->scanhash = (void*)&scanhash_myriad;
|
||||||
gate->hash = (void*)&myriad_hash;
|
gate->hash = (void*)&myriad_hash;
|
||||||
|
gate->optimizations = AES_OPT | SSE2_OPT | AVX2_OPT | SHA_OPT | VAES_OPT;
|
||||||
#endif
|
#endif
|
||||||
gate->optimizations = AES_OPT | AVX2_OPT;
|
|
||||||
gate->get_max64 = (void*)&get_max64_0x3ffff;
|
|
||||||
return true;
|
return true;
|
||||||
};
|
};
|
||||||
|
|
||||||
|
|||||||
@@ -1,30 +1,35 @@
|
|||||||
#ifndef MYRGR_GATE_H__
|
#ifndef MYRGR_GATE_H__
|
||||||
#define MYRGR_GATE_H__
|
#define MYRGR_GATE_H__ 1
|
||||||
|
|
||||||
#include "algo-gate-api.h"
|
#include "algo-gate-api.h"
|
||||||
#include <stdint.h>
|
#include <stdint.h>
|
||||||
|
|
||||||
#if defined(__AVX2__) && defined(__AES__) && !defined(__SHA__)
|
#if defined(__VAES__) && defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
|
||||||
#define MYRGR_4WAY
|
#define MYRGR_8WAY 1
|
||||||
|
#elif defined(__AVX2__) && defined(__AES__) && !defined(__SHA__)
|
||||||
|
#define MYRGR_4WAY 1
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
#if defined(MYRGR_4WAY)
|
#if defined(MYRGR_8WAY)
|
||||||
|
|
||||||
|
void myriad_8way_hash( void *state, const void *input );
|
||||||
|
int scanhash_myriad_8way( struct work *work, uint32_t max_nonce,
|
||||||
|
uint64_t *hashes_done, struct thr_info *mythr );
|
||||||
|
void init_myrgr_8way_ctx();
|
||||||
|
|
||||||
|
#elif defined(MYRGR_4WAY)
|
||||||
|
|
||||||
void myriad_4way_hash( void *state, const void *input );
|
void myriad_4way_hash( void *state, const void *input );
|
||||||
|
|
||||||
int scanhash_myriad_4way( struct work *work, uint32_t max_nonce,
|
int scanhash_myriad_4way( struct work *work, uint32_t max_nonce,
|
||||||
uint64_t *hashes_done, struct thr_info *mythr );
|
uint64_t *hashes_done, struct thr_info *mythr );
|
||||||
|
|
||||||
void init_myrgr_4way_ctx();
|
void init_myrgr_4way_ctx();
|
||||||
|
|
||||||
#endif
|
#else
|
||||||
|
|
||||||
void myriad_hash( void *state, const void *input );
|
void myriad_hash( void *state, const void *input );
|
||||||
|
|
||||||
int scanhash_myriad( struct work *work, uint32_t max_nonce,
|
int scanhash_myriad( struct work *work, uint32_t max_nonce,
|
||||||
uint64_t *hashes_done, struct thr_info *mythr );
|
uint64_t *hashes_done, struct thr_info *mythr );
|
||||||
|
|
||||||
void init_myrgr_ctx();
|
void init_myrgr_ctx();
|
||||||
|
|
||||||
#endif
|
#endif
|
||||||
|
#endif
|
||||||
|
|||||||
@@ -32,8 +32,6 @@
|
|||||||
|
|
||||||
#include <stddef.h>
|
#include <stddef.h>
|
||||||
#include <string.h>
|
#include <string.h>
|
||||||
|
|
||||||
//#include "miner.h"
|
|
||||||
#include "hamsi-hash-4way.h"
|
#include "hamsi-hash-4way.h"
|
||||||
|
|
||||||
#if defined(__AVX2__)
|
#if defined(__AVX2__)
|
||||||
@@ -100,7 +98,7 @@ extern "C"{
|
|||||||
#endif
|
#endif
|
||||||
|
|
||||||
//#include "hamsi-helper-4way.c"
|
//#include "hamsi-helper-4way.c"
|
||||||
|
/*
|
||||||
static const sph_u32 IV512[] = {
|
static const sph_u32 IV512[] = {
|
||||||
SPH_C32(0x73746565), SPH_C32(0x6c706172), SPH_C32(0x6b204172),
|
SPH_C32(0x73746565), SPH_C32(0x6c706172), SPH_C32(0x6b204172),
|
||||||
SPH_C32(0x656e6265), SPH_C32(0x72672031), SPH_C32(0x302c2062),
|
SPH_C32(0x656e6265), SPH_C32(0x72672031), SPH_C32(0x302c2062),
|
||||||
@@ -109,7 +107,7 @@ static const sph_u32 IV512[] = {
|
|||||||
SPH_C32(0x65766572), SPH_C32(0x6c65652c), SPH_C32(0x2042656c),
|
SPH_C32(0x65766572), SPH_C32(0x6c65652c), SPH_C32(0x2042656c),
|
||||||
SPH_C32(0x6769756d)
|
SPH_C32(0x6769756d)
|
||||||
};
|
};
|
||||||
|
*/
|
||||||
static const sph_u32 alpha_n[] = {
|
static const sph_u32 alpha_n[] = {
|
||||||
SPH_C32(0xff00f0f0), SPH_C32(0xccccaaaa), SPH_C32(0xf0f0cccc),
|
SPH_C32(0xff00f0f0), SPH_C32(0xccccaaaa), SPH_C32(0xf0f0cccc),
|
||||||
SPH_C32(0xff00aaaa), SPH_C32(0xccccaaaa), SPH_C32(0xf0f0ff00),
|
SPH_C32(0xff00aaaa), SPH_C32(0xccccaaaa), SPH_C32(0xf0f0ff00),
|
||||||
@@ -138,6 +136,7 @@ static const sph_u32 alpha_f[] = {
|
|||||||
SPH_C32(0xcaf9f9c0), SPH_C32(0x0ff0639c)
|
SPH_C32(0xcaf9f9c0), SPH_C32(0x0ff0639c)
|
||||||
};
|
};
|
||||||
|
|
||||||
|
|
||||||
// imported from hamsi helper
|
// imported from hamsi helper
|
||||||
|
|
||||||
/* Note: this table lists bits within each byte from least
|
/* Note: this table lists bits within each byte from least
|
||||||
@@ -529,49 +528,374 @@ static const sph_u32 T512[64][16] = {
|
|||||||
SPH_C32(0xe7e00a94) }
|
SPH_C32(0xe7e00a94) }
|
||||||
};
|
};
|
||||||
|
|
||||||
|
#define s0 m0
|
||||||
|
#define s1 c0
|
||||||
|
#define s2 m1
|
||||||
|
#define s3 c1
|
||||||
|
#define s4 c2
|
||||||
|
#define s5 m2
|
||||||
|
#define s6 c3
|
||||||
|
#define s7 m3
|
||||||
|
#define s8 m4
|
||||||
|
#define s9 c4
|
||||||
|
#define sA m5
|
||||||
|
#define sB c5
|
||||||
|
#define sC c6
|
||||||
|
#define sD m6
|
||||||
|
#define sE c7
|
||||||
|
#define sF m7
|
||||||
|
|
||||||
|
|
||||||
|
#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
|
||||||
|
|
||||||
|
// Hamsi 8 way
|
||||||
|
|
||||||
|
#define INPUT_BIG8 \
|
||||||
|
do { \
|
||||||
|
__m512i db = *buf; \
|
||||||
|
const uint64_t *tp = (uint64_t*)&T512[0][0]; \
|
||||||
|
m0 = m1 = m2 = m3 = m4 = m5 = m6 = m7 = m512_zero; \
|
||||||
|
for ( int u = 0; u < 64; u++ ) \
|
||||||
|
{ \
|
||||||
|
__m512i dm = _mm512_and_si512( db, m512_one_64 ) ; \
|
||||||
|
dm = mm512_negate_32( _mm512_or_si512( dm, \
|
||||||
|
_mm512_slli_epi64( dm, 32 ) ) ); \
|
||||||
|
m0 = _mm512_xor_si512( m0, _mm512_and_si512( dm, \
|
||||||
|
m512_const1_64( tp[0] ) ) ); \
|
||||||
|
m1 = _mm512_xor_si512( m1, _mm512_and_si512( dm, \
|
||||||
|
m512_const1_64( tp[1] ) ) ); \
|
||||||
|
m2 = _mm512_xor_si512( m2, _mm512_and_si512( dm, \
|
||||||
|
m512_const1_64( tp[2] ) ) ); \
|
||||||
|
m3 = _mm512_xor_si512( m3, _mm512_and_si512( dm, \
|
||||||
|
m512_const1_64( tp[3] ) ) ); \
|
||||||
|
m4 = _mm512_xor_si512( m4, _mm512_and_si512( dm, \
|
||||||
|
m512_const1_64( tp[4] ) ) ); \
|
||||||
|
m5 = _mm512_xor_si512( m5, _mm512_and_si512( dm, \
|
||||||
|
m512_const1_64( tp[5] ) ) ); \
|
||||||
|
m6 = _mm512_xor_si512( m6, _mm512_and_si512( dm, \
|
||||||
|
m512_const1_64( tp[6] ) ) ); \
|
||||||
|
m7 = _mm512_xor_si512( m7, _mm512_and_si512( dm, \
|
||||||
|
m512_const1_64( tp[7] ) ) ); \
|
||||||
|
tp += 8; \
|
||||||
|
db = _mm512_srli_epi64( db, 1 ); \
|
||||||
|
} \
|
||||||
|
} while (0)
|
||||||
|
|
||||||
|
#define SBOX8( a, b, c, d ) \
|
||||||
|
do { \
|
||||||
|
__m512i t; \
|
||||||
|
t = a; \
|
||||||
|
a = _mm512_and_si512( a, c ); \
|
||||||
|
a = _mm512_xor_si512( a, d ); \
|
||||||
|
c = _mm512_xor_si512( c, b ); \
|
||||||
|
c = _mm512_xor_si512( c, a ); \
|
||||||
|
d = _mm512_or_si512( d, t ); \
|
||||||
|
d = _mm512_xor_si512( d, b ); \
|
||||||
|
t = _mm512_xor_si512( t, c ); \
|
||||||
|
b = d; \
|
||||||
|
d = _mm512_or_si512( d, t ); \
|
||||||
|
d = _mm512_xor_si512( d, a ); \
|
||||||
|
a = _mm512_and_si512( a, b ); \
|
||||||
|
t = _mm512_xor_si512( t, a ); \
|
||||||
|
b = _mm512_xor_si512( b, d ); \
|
||||||
|
b = _mm512_xor_si512( b, t ); \
|
||||||
|
a = c; \
|
||||||
|
c = b; \
|
||||||
|
b = d; \
|
||||||
|
d = mm512_not( t ); \
|
||||||
|
} while (0)
|
||||||
|
|
||||||
|
#define L8( a, b, c, d ) \
|
||||||
|
do { \
|
||||||
|
a = mm512_rol_32( a, 13 ); \
|
||||||
|
c = mm512_rol_32( c, 3 ); \
|
||||||
|
b = _mm512_xor_si512( b, _mm512_xor_si512( a, c ) ); \
|
||||||
|
d = _mm512_xor_si512( d, _mm512_xor_si512( c, \
|
||||||
|
_mm512_slli_epi32( a, 3 ) ) ); \
|
||||||
|
b = mm512_rol_32( b, 1 ); \
|
||||||
|
d = mm512_rol_32( d, 7 ); \
|
||||||
|
a = _mm512_xor_si512( a, _mm512_xor_si512( b, d ) ); \
|
||||||
|
c = _mm512_xor_si512( c, _mm512_xor_si512( d, \
|
||||||
|
_mm512_slli_epi32( b, 7 ) ) ); \
|
||||||
|
a = mm512_rol_32( a, 5 ); \
|
||||||
|
c = mm512_rol_32( c, 22 ); \
|
||||||
|
} while (0)
|
||||||
|
|
||||||
|
#define DECL_STATE_BIG8 \
|
||||||
|
__m512i c0, c1, c2, c3, c4, c5, c6, c7; \
|
||||||
|
|
||||||
|
#define READ_STATE_BIG8(sc) \
|
||||||
|
do { \
|
||||||
|
c0 = sc->h[0x0]; \
|
||||||
|
c1 = sc->h[0x1]; \
|
||||||
|
c2 = sc->h[0x2]; \
|
||||||
|
c3 = sc->h[0x3]; \
|
||||||
|
c4 = sc->h[0x4]; \
|
||||||
|
c5 = sc->h[0x5]; \
|
||||||
|
c6 = sc->h[0x6]; \
|
||||||
|
c7 = sc->h[0x7]; \
|
||||||
|
} while (0)
|
||||||
|
|
||||||
|
#define WRITE_STATE_BIG8(sc) \
|
||||||
|
do { \
|
||||||
|
sc->h[0x0] = c0; \
|
||||||
|
sc->h[0x1] = c1; \
|
||||||
|
sc->h[0x2] = c2; \
|
||||||
|
sc->h[0x3] = c3; \
|
||||||
|
sc->h[0x4] = c4; \
|
||||||
|
sc->h[0x5] = c5; \
|
||||||
|
sc->h[0x6] = c6; \
|
||||||
|
sc->h[0x7] = c7; \
|
||||||
|
} while (0)
|
||||||
|
|
||||||
|
|
||||||
|
#define ROUND_BIG8(rc, alpha) \
|
||||||
|
do { \
|
||||||
|
__m512i t0, t1, t2, t3; \
|
||||||
|
s0 = _mm512_xor_si512( s0, m512_const1_64( \
|
||||||
|
( (uint64_t)(rc) << 32 ) ^ ( (uint64_t*)(alpha) )[ 0] ) ); \
|
||||||
|
s1 = _mm512_xor_si512( s1, m512_const1_64( ( (uint64_t*)(alpha) )[ 1] ) ); \
|
||||||
|
s2 = _mm512_xor_si512( s2, m512_const1_64( ( (uint64_t*)(alpha) )[ 2] ) ); \
|
||||||
|
s3 = _mm512_xor_si512( s3, m512_const1_64( ( (uint64_t*)(alpha) )[ 3] ) ); \
|
||||||
|
s4 = _mm512_xor_si512( s4, m512_const1_64( ( (uint64_t*)(alpha) )[ 4] ) ); \
|
||||||
|
s5 = _mm512_xor_si512( s5, m512_const1_64( ( (uint64_t*)(alpha) )[ 5] ) ); \
|
||||||
|
s6 = _mm512_xor_si512( s6, m512_const1_64( ( (uint64_t*)(alpha) )[ 6] ) ); \
|
||||||
|
s7 = _mm512_xor_si512( s7, m512_const1_64( ( (uint64_t*)(alpha) )[ 7] ) ); \
|
||||||
|
s8 = _mm512_xor_si512( s8, m512_const1_64( ( (uint64_t*)(alpha) )[ 8] ) ); \
|
||||||
|
s9 = _mm512_xor_si512( s9, m512_const1_64( ( (uint64_t*)(alpha) )[ 9] ) ); \
|
||||||
|
sA = _mm512_xor_si512( sA, m512_const1_64( ( (uint64_t*)(alpha) )[10] ) ); \
|
||||||
|
sB = _mm512_xor_si512( sB, m512_const1_64( ( (uint64_t*)(alpha) )[11] ) ); \
|
||||||
|
sC = _mm512_xor_si512( sC, m512_const1_64( ( (uint64_t*)(alpha) )[12] ) ); \
|
||||||
|
sD = _mm512_xor_si512( sD, m512_const1_64( ( (uint64_t*)(alpha) )[13] ) ); \
|
||||||
|
sE = _mm512_xor_si512( sE, m512_const1_64( ( (uint64_t*)(alpha) )[14] ) ); \
|
||||||
|
sF = _mm512_xor_si512( sF, m512_const1_64( ( (uint64_t*)(alpha) )[15] ) ); \
|
||||||
|
\
|
||||||
|
SBOX8( s0, s4, s8, sC ); \
|
||||||
|
SBOX8( s1, s5, s9, sD ); \
|
||||||
|
SBOX8( s2, s6, sA, sE ); \
|
||||||
|
SBOX8( s3, s7, sB, sF ); \
|
||||||
|
\
|
||||||
|
t1 = _mm512_mask_blend_epi32( 0xaaaa, _mm512_bsrli_epi128( s4, 4 ), \
|
||||||
|
_mm512_bslli_epi128( s5, 4 ) ); \
|
||||||
|
t3 = _mm512_mask_blend_epi32( 0xaaaa, _mm512_bsrli_epi128( sD, 4 ), \
|
||||||
|
_mm512_bslli_epi128( sE, 4 ) ); \
|
||||||
|
L8( s0, t1, s9, t3 ); \
|
||||||
|
s4 = _mm512_mask_blend_epi32( 0xaaaa, s4, _mm512_bslli_epi128( t1, 4 ) ); \
|
||||||
|
s5 = _mm512_mask_blend_epi32( 0x5555, s5, _mm512_bsrli_epi128( t1, 4 ) ); \
|
||||||
|
sD = _mm512_mask_blend_epi32( 0xaaaa, sD, _mm512_bslli_epi128( t3, 4 ) ); \
|
||||||
|
sE = _mm512_mask_blend_epi32( 0x5555, sE, _mm512_bsrli_epi128( t3, 4 ) ); \
|
||||||
|
\
|
||||||
|
t1 = _mm512_mask_blend_epi32( 0xaaaa, _mm512_bsrli_epi128( s5, 4 ), \
|
||||||
|
_mm512_bslli_epi128( s6, 4 ) ); \
|
||||||
|
t3 = _mm512_mask_blend_epi32( 0xaaaa, _mm512_bsrli_epi128( sE, 4 ), \
|
||||||
|
_mm512_bslli_epi128( sF, 4 ) ); \
|
||||||
|
L8( s1, t1, sA, t3 ); \
|
||||||
|
s5 = _mm512_mask_blend_epi32( 0xaaaa, s5, _mm512_bslli_epi128( t1, 4 ) ); \
|
||||||
|
s6 = _mm512_mask_blend_epi32( 0x5555, s6, _mm512_bsrli_epi128( t1, 4 ) ); \
|
||||||
|
sE = _mm512_mask_blend_epi32( 0xaaaa, sE, _mm512_bslli_epi128( t3, 4 ) ); \
|
||||||
|
sF = _mm512_mask_blend_epi32( 0x5555, sF, _mm512_bsrli_epi128( t3, 4 ) ); \
|
||||||
|
\
|
||||||
|
t1 = _mm512_mask_blend_epi32( 0xaaaa, _mm512_bsrli_epi128( s6, 4 ), \
|
||||||
|
_mm512_bslli_epi128( s7, 4 ) ); \
|
||||||
|
t3 = _mm512_mask_blend_epi32( 0xaaaa, _mm512_bsrli_epi128( sF, 4 ), \
|
||||||
|
_mm512_bslli_epi128( sC, 4 ) ); \
|
||||||
|
L8( s2, t1, sB, t3 ); \
|
||||||
|
s6 = _mm512_mask_blend_epi32( 0xaaaa, s6, _mm512_bslli_epi128( t1, 4 ) ); \
|
||||||
|
s7 = _mm512_mask_blend_epi32( 0x5555, s7, _mm512_bsrli_epi128( t1, 4 ) ); \
|
||||||
|
sF = _mm512_mask_blend_epi32( 0xaaaa, sF, _mm512_bslli_epi128( t3, 4 ) ); \
|
||||||
|
sC = _mm512_mask_blend_epi32( 0x5555, sC, _mm512_bsrli_epi128( t3, 4 ) ); \
|
||||||
|
\
|
||||||
|
t1 = _mm512_mask_blend_epi32( 0xaaaa, _mm512_bsrli_epi128( s7, 4 ), \
|
||||||
|
_mm512_bslli_epi128( s4, 4 ) ); \
|
||||||
|
t3 = _mm512_mask_blend_epi32( 0xaaaa, _mm512_bsrli_epi128( sC, 4 ), \
|
||||||
|
_mm512_bslli_epi128( sD, 4 ) ); \
|
||||||
|
L8( s3, t1, s8, t3 ); \
|
||||||
|
s7 = _mm512_mask_blend_epi32( 0xaaaa, s7, _mm512_bslli_epi128( t1, 4 ) ); \
|
||||||
|
s4 = _mm512_mask_blend_epi32( 0x5555, s4, _mm512_bsrli_epi128( t1, 4 ) ); \
|
||||||
|
sC = _mm512_mask_blend_epi32( 0xaaaa, sC, _mm512_bslli_epi128( t3, 4 ) ); \
|
||||||
|
sD = _mm512_mask_blend_epi32( 0x5555, sD, _mm512_bsrli_epi128( t3, 4 ) ); \
|
||||||
|
\
|
||||||
|
t0 = _mm512_mask_blend_epi32( 0xaaaa, s0, _mm512_bslli_epi128( s8, 4 ) ); \
|
||||||
|
t1 = _mm512_mask_blend_epi32( 0xaaaa, s1, s9 ); \
|
||||||
|
t2 = _mm512_mask_blend_epi32( 0xaaaa, _mm512_bsrli_epi128( s2, 4 ), sA ); \
|
||||||
|
t3 = _mm512_mask_blend_epi32( 0xaaaa, _mm512_bsrli_epi128( s3, 4 ), \
|
||||||
|
_mm512_bslli_epi128( sB, 4 ) ); \
|
||||||
|
L8( t0, t1, t2, t3 ); \
|
||||||
|
s0 = _mm512_mask_blend_epi32( 0x5555, s0, t0 ); \
|
||||||
|
s8 = _mm512_mask_blend_epi32( 0x5555, s8, _mm512_bsrli_epi128( t0, 4 ) ); \
|
||||||
|
s1 = _mm512_mask_blend_epi32( 0x5555, s1, t1 ); \
|
||||||
|
s9 = _mm512_mask_blend_epi32( 0xaaaa, s9, t1 ); \
|
||||||
|
s2 = _mm512_mask_blend_epi32( 0xaaaa, s2, _mm512_bslli_epi128( t2, 4 ) ); \
|
||||||
|
sA = _mm512_mask_blend_epi32( 0xaaaa, sA, t2 ); \
|
||||||
|
s3 = _mm512_mask_blend_epi32( 0xaaaa, s3, _mm512_bslli_epi128( t3, 4 ) ); \
|
||||||
|
sB = _mm512_mask_blend_epi32( 0x5555, sB, _mm512_bsrli_epi128( t3, 4 ) ); \
|
||||||
|
\
|
||||||
|
t0 = _mm512_mask_blend_epi32( 0xaaaa, _mm512_bsrli_epi128( s4, 4 ), sC ); \
|
||||||
|
t1 = _mm512_mask_blend_epi32( 0xaaaa, _mm512_bsrli_epi128( s5, 4 ), \
|
||||||
|
_mm512_bslli_epi128( sD, 4 ) ); \
|
||||||
|
t2 = _mm512_mask_blend_epi32( 0xaaaa, s6, _mm512_bslli_epi128( sE, 4 ) ); \
|
||||||
|
t3 = _mm512_mask_blend_epi32( 0xaaaa, s7, sF ); \
|
||||||
|
L8( t0, t1, t2, t3 ); \
|
||||||
|
s4 = _mm512_mask_blend_epi32( 0xaaaa, s4, _mm512_bslli_epi128( t0, 4 ) ); \
|
||||||
|
sC = _mm512_mask_blend_epi32( 0xaaaa, sC, t0 ); \
|
||||||
|
s5 = _mm512_mask_blend_epi32( 0xaaaa, s5, _mm512_bslli_epi128( t1, 4 ) ); \
|
||||||
|
sD = _mm512_mask_blend_epi32( 0x5555, sD, _mm512_bsrli_epi128( t1, 4 ) ); \
|
||||||
|
s6 = _mm512_mask_blend_epi32( 0x5555, s6, t2 ); \
|
||||||
|
sE = _mm512_mask_blend_epi32( 0x5555, sE, _mm512_bsrli_epi128( t2, 4 ) ); \
|
||||||
|
s7 = _mm512_mask_blend_epi32( 0x5555, s7, t3 ); \
|
||||||
|
sF = _mm512_mask_blend_epi32( 0xaaaa, sF, t3 ); \
|
||||||
|
} while (0)
|
||||||
|
|
||||||
|
#define P_BIG8 \
|
||||||
|
do { \
|
||||||
|
ROUND_BIG8(0, alpha_n); \
|
||||||
|
ROUND_BIG8(1, alpha_n); \
|
||||||
|
ROUND_BIG8(2, alpha_n); \
|
||||||
|
ROUND_BIG8(3, alpha_n); \
|
||||||
|
ROUND_BIG8(4, alpha_n); \
|
||||||
|
ROUND_BIG8(5, alpha_n); \
|
||||||
|
} while (0)
|
||||||
|
|
||||||
|
#define PF_BIG8 \
|
||||||
|
do { \
|
||||||
|
ROUND_BIG8( 0, alpha_f); \
|
||||||
|
ROUND_BIG8( 1, alpha_f); \
|
||||||
|
ROUND_BIG8( 2, alpha_f); \
|
||||||
|
ROUND_BIG8( 3, alpha_f); \
|
||||||
|
ROUND_BIG8( 4, alpha_f); \
|
||||||
|
ROUND_BIG8( 5, alpha_f); \
|
||||||
|
ROUND_BIG8( 6, alpha_f); \
|
||||||
|
ROUND_BIG8( 7, alpha_f); \
|
||||||
|
ROUND_BIG8( 8, alpha_f); \
|
||||||
|
ROUND_BIG8( 9, alpha_f); \
|
||||||
|
ROUND_BIG8(10, alpha_f); \
|
||||||
|
ROUND_BIG8(11, alpha_f); \
|
||||||
|
} while (0)
|
||||||
|
|
||||||
|
#define T_BIG8 \
|
||||||
|
do { /* order is important */ \
|
||||||
|
c7 = sc->h[ 0x7 ] = _mm512_xor_si512( sc->h[ 0x7 ], sB ); \
|
||||||
|
c6 = sc->h[ 0x6 ] = _mm512_xor_si512( sc->h[ 0x6 ], sA ); \
|
||||||
|
c5 = sc->h[ 0x5 ] = _mm512_xor_si512( sc->h[ 0x5 ], s9 ); \
|
||||||
|
c4 = sc->h[ 0x4 ] = _mm512_xor_si512( sc->h[ 0x4 ], s8 ); \
|
||||||
|
c3 = sc->h[ 0x3 ] = _mm512_xor_si512( sc->h[ 0x3 ], s3 ); \
|
||||||
|
c2 = sc->h[ 0x2 ] = _mm512_xor_si512( sc->h[ 0x2 ], s2 ); \
|
||||||
|
c1 = sc->h[ 0x1 ] = _mm512_xor_si512( sc->h[ 0x1 ], s1 ); \
|
||||||
|
c0 = sc->h[ 0x0 ] = _mm512_xor_si512( sc->h[ 0x0 ], s0 ); \
|
||||||
|
} while (0)
|
||||||
|
|
||||||
|
void hamsi_8way_big( hamsi_8way_big_context *sc, __m512i *buf, size_t num )
|
||||||
|
{
|
||||||
|
DECL_STATE_BIG8
|
||||||
|
uint32_t tmp = num << 6;
|
||||||
|
|
||||||
|
sc->count_low = SPH_T32( sc->count_low + tmp );
|
||||||
|
sc->count_high += (sph_u32)( (num >> 13) >> 13 );
|
||||||
|
if ( sc->count_low < tmp )
|
||||||
|
sc->count_high++;
|
||||||
|
|
||||||
|
READ_STATE_BIG8( sc );
|
||||||
|
while ( num-- > 0 )
|
||||||
|
{
|
||||||
|
__m512i m0, m1, m2, m3, m4, m5, m6, m7;
|
||||||
|
|
||||||
|
INPUT_BIG8;
|
||||||
|
P_BIG8;
|
||||||
|
T_BIG8;
|
||||||
|
buf++;
|
||||||
|
}
|
||||||
|
WRITE_STATE_BIG8( sc );
|
||||||
|
}
|
||||||
|
|
||||||
|
void hamsi_8way_big_final( hamsi_8way_big_context *sc, __m512i *buf )
|
||||||
|
{
|
||||||
|
__m512i m0, m1, m2, m3, m4, m5, m6, m7;
|
||||||
|
DECL_STATE_BIG8
|
||||||
|
READ_STATE_BIG8( sc );
|
||||||
|
INPUT_BIG8;
|
||||||
|
PF_BIG8;
|
||||||
|
T_BIG8;
|
||||||
|
WRITE_STATE_BIG8( sc );
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
void hamsi512_8way_init( hamsi_8way_big_context *sc )
|
||||||
|
{
|
||||||
|
sc->partial_len = 0;
|
||||||
|
sc->count_high = sc->count_low = 0;
|
||||||
|
|
||||||
|
sc->h[0] = m512_const1_64( 0x6c70617273746565 );
|
||||||
|
sc->h[1] = m512_const1_64( 0x656e62656b204172 );
|
||||||
|
sc->h[2] = m512_const1_64( 0x302c206272672031 );
|
||||||
|
sc->h[3] = m512_const1_64( 0x3434362c75732032 );
|
||||||
|
sc->h[4] = m512_const1_64( 0x3030312020422d33 );
|
||||||
|
sc->h[5] = m512_const1_64( 0x656e2d484c657576 );
|
||||||
|
sc->h[6] = m512_const1_64( 0x6c65652c65766572 );
|
||||||
|
sc->h[7] = m512_const1_64( 0x6769756d2042656c );
|
||||||
|
}
|
||||||
|
|
||||||
|
void hamsi512_8way_update( hamsi_8way_big_context *sc, const void *data,
|
||||||
|
size_t len )
|
||||||
|
{
|
||||||
|
__m512i *vdata = (__m512i*)data;
|
||||||
|
|
||||||
|
hamsi_8way_big( sc, vdata, len>>3 );
|
||||||
|
vdata += ( (len& ~(size_t)7) >> 3 );
|
||||||
|
len &= (size_t)7;
|
||||||
|
memcpy_512( sc->buf, vdata, len>>3 );
|
||||||
|
sc->partial_len = len;
|
||||||
|
}
|
||||||
|
|
||||||
|
void hamsi512_8way_close( hamsi_8way_big_context *sc, void *dst )
|
||||||
|
{
|
||||||
|
__m512i pad[1];
|
||||||
|
int ch, cl;
|
||||||
|
|
||||||
|
sph_enc32be( &ch, sc->count_high );
|
||||||
|
sph_enc32be( &cl, sc->count_low + ( sc->partial_len << 3 ) );
|
||||||
|
pad[0] = _mm512_set_epi32( cl, ch, cl, ch, cl, ch, cl, ch,
|
||||||
|
cl, ch, cl, ch, cl, ch, cl, ch );
|
||||||
|
// pad[0] = m512_const2_32( cl, ch );
|
||||||
|
sc->buf[0] = m512_const1_64( 0x80 );
|
||||||
|
hamsi_8way_big( sc, sc->buf, 1 );
|
||||||
|
hamsi_8way_big_final( sc, pad );
|
||||||
|
|
||||||
|
mm512_block_bswap_32( (__m512i*)dst, sc->h );
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
#endif // AVX512
|
||||||
|
|
||||||
|
|
||||||
|
// Hamsi 4 way
|
||||||
|
|
||||||
#define INPUT_BIG \
|
#define INPUT_BIG \
|
||||||
do { \
|
do { \
|
||||||
const __m256i zero = _mm256_setzero_si256(); \
|
|
||||||
__m256i db = *buf; \
|
__m256i db = *buf; \
|
||||||
const sph_u32 *tp = &T512[0][0]; \
|
const uint64_t *tp = (uint64_t*)&T512[0][0]; \
|
||||||
m0 = zero; \
|
m0 = m1 = m2 = m3 = m4 = m5 = m6 = m7 = m256_zero; \
|
||||||
m1 = zero; \
|
|
||||||
m2 = zero; \
|
|
||||||
m3 = zero; \
|
|
||||||
m4 = zero; \
|
|
||||||
m5 = zero; \
|
|
||||||
m6 = zero; \
|
|
||||||
m7 = zero; \
|
|
||||||
for ( int u = 0; u < 64; u++ ) \
|
for ( int u = 0; u < 64; u++ ) \
|
||||||
{ \
|
{ \
|
||||||
__m256i dm = _mm256_and_si256( db, m256_one_64 ) ; \
|
__m256i dm = _mm256_and_si256( db, m256_one_64 ) ; \
|
||||||
dm = mm256_negate_32( _mm256_or_si256( dm, \
|
dm = mm256_negate_32( _mm256_or_si256( dm, \
|
||||||
_mm256_slli_epi64( dm, 32 ) ) ); \
|
_mm256_slli_epi64( dm, 32 ) ) ); \
|
||||||
m0 = _mm256_xor_si256( m0, _mm256_and_si256( dm, \
|
m0 = _mm256_xor_si256( m0, _mm256_and_si256( dm, \
|
||||||
_mm256_set_epi32( tp[0x1], tp[0x0], tp[0x1], tp[0x0], \
|
m256_const1_64( tp[0] ) ) ); \
|
||||||
tp[0x1], tp[0x0], tp[0x1], tp[0x0] ) ) ); \
|
|
||||||
m1 = _mm256_xor_si256( m1, _mm256_and_si256( dm, \
|
m1 = _mm256_xor_si256( m1, _mm256_and_si256( dm, \
|
||||||
_mm256_set_epi32( tp[0x3], tp[0x2], tp[0x3], tp[0x2], \
|
m256_const1_64( tp[1] ) ) ); \
|
||||||
tp[0x3], tp[0x2], tp[0x3], tp[0x2] ) ) ); \
|
|
||||||
m2 = _mm256_xor_si256( m2, _mm256_and_si256( dm, \
|
m2 = _mm256_xor_si256( m2, _mm256_and_si256( dm, \
|
||||||
_mm256_set_epi32( tp[0x5], tp[0x4], tp[0x5], tp[0x4], \
|
m256_const1_64( tp[2] ) ) ); \
|
||||||
tp[0x5], tp[0x4], tp[0x5], tp[0x4] ) ) ); \
|
|
||||||
m3 = _mm256_xor_si256( m3, _mm256_and_si256( dm, \
|
m3 = _mm256_xor_si256( m3, _mm256_and_si256( dm, \
|
||||||
_mm256_set_epi32( tp[0x7], tp[0x6], tp[0x7], tp[0x6], \
|
m256_const1_64( tp[3] ) ) ); \
|
||||||
tp[0x7], tp[0x6], tp[0x7], tp[0x6] ) ) ); \
|
|
||||||
m4 = _mm256_xor_si256( m4, _mm256_and_si256( dm, \
|
m4 = _mm256_xor_si256( m4, _mm256_and_si256( dm, \
|
||||||
_mm256_set_epi32( tp[0x9], tp[0x8], tp[0x9], tp[0x8], \
|
m256_const1_64( tp[4] ) ) ); \
|
||||||
tp[0x9], tp[0x8], tp[0x9], tp[0x8] ) ) ); \
|
|
||||||
m5 = _mm256_xor_si256( m5, _mm256_and_si256( dm, \
|
m5 = _mm256_xor_si256( m5, _mm256_and_si256( dm, \
|
||||||
_mm256_set_epi32( tp[0xB], tp[0xA], tp[0xB], tp[0xA], \
|
m256_const1_64( tp[5] ) ) ); \
|
||||||
tp[0xB], tp[0xA], tp[0xB], tp[0xA] ) ) ); \
|
|
||||||
m6 = _mm256_xor_si256( m6, _mm256_and_si256( dm, \
|
m6 = _mm256_xor_si256( m6, _mm256_and_si256( dm, \
|
||||||
_mm256_set_epi32( tp[0xD], tp[0xC], tp[0xD], tp[0xC], \
|
m256_const1_64( tp[6] ) ) ); \
|
||||||
tp[0xD], tp[0xC], tp[0xD], tp[0xC] ) ) ); \
|
|
||||||
m7 = _mm256_xor_si256( m7, _mm256_and_si256( dm, \
|
m7 = _mm256_xor_si256( m7, _mm256_and_si256( dm, \
|
||||||
_mm256_set_epi32( tp[0xF], tp[0xE], tp[0xF], tp[0xE], \
|
m256_const1_64( tp[7] ) ) ); \
|
||||||
tp[0xF], tp[0xE], tp[0xF], tp[0xE] ) ) ); \
|
tp += 8; \
|
||||||
tp += 0x10; \
|
|
||||||
db = _mm256_srli_epi64( db, 1 ); \
|
db = _mm256_srli_epi64( db, 1 ); \
|
||||||
} \
|
} \
|
||||||
} while (0)
|
} while (0)
|
||||||
@@ -643,6 +967,7 @@ do { \
|
|||||||
sc->h[0x7] = c7; \
|
sc->h[0x7] = c7; \
|
||||||
} while (0)
|
} while (0)
|
||||||
|
|
||||||
|
/*
|
||||||
#define s0 m0
|
#define s0 m0
|
||||||
#define s1 c0
|
#define s1 c0
|
||||||
#define s2 m1
|
#define s2 m1
|
||||||
@@ -659,58 +984,28 @@ do { \
|
|||||||
#define sD m6
|
#define sD m6
|
||||||
#define sE c7
|
#define sE c7
|
||||||
#define sF m7
|
#define sF m7
|
||||||
|
*/
|
||||||
|
|
||||||
#define ROUND_BIG(rc, alpha) \
|
#define ROUND_BIG(rc, alpha) \
|
||||||
do { \
|
do { \
|
||||||
__m256i t0, t1, t2, t3; \
|
__m256i t0, t1, t2, t3; \
|
||||||
s0 = _mm256_xor_si256( s0, _mm256_set_epi32( \
|
s0 = _mm256_xor_si256( s0, m256_const1_64( \
|
||||||
alpha[0x01] ^ (rc), alpha[0x00], alpha[0x01] ^ (rc), alpha[0x00], \
|
( (uint64_t)(rc) << 32 ) ^ ( (uint64_t*)(alpha) )[ 0] ) ); \
|
||||||
alpha[0x01] ^ (rc), alpha[0x00], alpha[0x01] ^ (rc), alpha[0x00] ) ); \
|
s1 = _mm256_xor_si256( s1, m256_const1_64( ( (uint64_t*)(alpha) )[ 1] ) ); \
|
||||||
s1 = _mm256_xor_si256( s1, _mm256_set_epi32( \
|
s2 = _mm256_xor_si256( s2, m256_const1_64( ( (uint64_t*)(alpha) )[ 2] ) ); \
|
||||||
alpha[0x03], alpha[0x02], alpha[0x03], alpha[0x02], \
|
s3 = _mm256_xor_si256( s3, m256_const1_64( ( (uint64_t*)(alpha) )[ 3] ) ); \
|
||||||
alpha[0x03], alpha[0x02], alpha[0x03], alpha[0x02] ) ); \
|
s4 = _mm256_xor_si256( s4, m256_const1_64( ( (uint64_t*)(alpha) )[ 4] ) ); \
|
||||||
s2 = _mm256_xor_si256( s2, _mm256_set_epi32( \
|
s5 = _mm256_xor_si256( s5, m256_const1_64( ( (uint64_t*)(alpha) )[ 5] ) ); \
|
||||||
alpha[0x05], alpha[0x04], alpha[0x05], alpha[0x04], \
|
s6 = _mm256_xor_si256( s6, m256_const1_64( ( (uint64_t*)(alpha) )[ 6] ) ); \
|
||||||
alpha[0x05], alpha[0x04], alpha[0x05], alpha[0x04] ) ); \
|
s7 = _mm256_xor_si256( s7, m256_const1_64( ( (uint64_t*)(alpha) )[ 7] ) ); \
|
||||||
s3 = _mm256_xor_si256( s3, _mm256_set_epi32( \
|
s8 = _mm256_xor_si256( s8, m256_const1_64( ( (uint64_t*)(alpha) )[ 8] ) ); \
|
||||||
alpha[0x07], alpha[0x06], alpha[0x07], alpha[0x06], \
|
s9 = _mm256_xor_si256( s9, m256_const1_64( ( (uint64_t*)(alpha) )[ 9] ) ); \
|
||||||
alpha[0x07], alpha[0x06], alpha[0x07], alpha[0x06] ) ); \
|
sA = _mm256_xor_si256( sA, m256_const1_64( ( (uint64_t*)(alpha) )[10] ) ); \
|
||||||
s4 = _mm256_xor_si256( s4, _mm256_set_epi32( \
|
sB = _mm256_xor_si256( sB, m256_const1_64( ( (uint64_t*)(alpha) )[11] ) ); \
|
||||||
alpha[0x09], alpha[0x08], alpha[0x09], alpha[0x08], \
|
sC = _mm256_xor_si256( sC, m256_const1_64( ( (uint64_t*)(alpha) )[12] ) ); \
|
||||||
alpha[0x09], alpha[0x08], alpha[0x09], alpha[0x08] ) ); \
|
sD = _mm256_xor_si256( sD, m256_const1_64( ( (uint64_t*)(alpha) )[13] ) ); \
|
||||||
s5 = _mm256_xor_si256( s5, _mm256_set_epi32( \
|
sE = _mm256_xor_si256( sE, m256_const1_64( ( (uint64_t*)(alpha) )[14] ) ); \
|
||||||
alpha[0x0B], alpha[0x0A], alpha[0x0B], alpha[0x0A], \
|
sF = _mm256_xor_si256( sF, m256_const1_64( ( (uint64_t*)(alpha) )[15] ) ); \
|
||||||
alpha[0x0B], alpha[0x0A], alpha[0x0B], alpha[0x0A] ) ); \
|
|
||||||
s6 = _mm256_xor_si256( s6, _mm256_set_epi32( \
|
|
||||||
alpha[0x0D], alpha[0x0C], alpha[0x0D], alpha[0x0C], \
|
|
||||||
alpha[0x0D], alpha[0x0C], alpha[0x0D], alpha[0x0C] ) ); \
|
|
||||||
s7 = _mm256_xor_si256( s7, _mm256_set_epi32( \
|
|
||||||
alpha[0x0F], alpha[0x0E], alpha[0x0F], alpha[0x0E], \
|
|
||||||
alpha[0x0F], alpha[0x0E], alpha[0x0F], alpha[0x0E] ) ); \
|
|
||||||
s8 = _mm256_xor_si256( s8, _mm256_set_epi32( \
|
|
||||||
alpha[0x11], alpha[0x10], alpha[0x11], alpha[0x10], \
|
|
||||||
alpha[0x11], alpha[0x10], alpha[0x11], alpha[0x10] ) ); \
|
|
||||||
s9 = _mm256_xor_si256( s9, _mm256_set_epi32( \
|
|
||||||
alpha[0x13], alpha[0x12], alpha[0x13], alpha[0x12], \
|
|
||||||
alpha[0x13], alpha[0x12], alpha[0x13], alpha[0x12] ) ); \
|
|
||||||
sA = _mm256_xor_si256( sA, _mm256_set_epi32( \
|
|
||||||
alpha[0x15], alpha[0x14], alpha[0x15], alpha[0x14], \
|
|
||||||
alpha[0x15], alpha[0x14], alpha[0x15], alpha[0x14] ) ); \
|
|
||||||
sB = _mm256_xor_si256( sB, _mm256_set_epi32( \
|
|
||||||
alpha[0x17], alpha[0x16], alpha[0x17], alpha[0x16], \
|
|
||||||
alpha[0x17], alpha[0x16], alpha[0x17], alpha[0x16] ) ); \
|
|
||||||
sC = _mm256_xor_si256( sC, _mm256_set_epi32( \
|
|
||||||
alpha[0x19], alpha[0x18], alpha[0x19], alpha[0x18], \
|
|
||||||
alpha[0x19], alpha[0x18], alpha[0x19], alpha[0x18] ) ); \
|
|
||||||
sD = _mm256_xor_si256( sD, _mm256_set_epi32( \
|
|
||||||
alpha[0x1B], alpha[0x1A], alpha[0x1B], alpha[0x1A], \
|
|
||||||
alpha[0x1B], alpha[0x1A], alpha[0x1B], alpha[0x1A] ) ); \
|
|
||||||
sE = _mm256_xor_si256( sE, _mm256_set_epi32( \
|
|
||||||
alpha[0x1D], alpha[0x1C], alpha[0x1D], alpha[0x1C], \
|
|
||||||
alpha[0x1D], alpha[0x1C], alpha[0x1D], alpha[0x1C] ) ); \
|
|
||||||
sF = _mm256_xor_si256( sF, _mm256_set_epi32( \
|
|
||||||
alpha[0x1F], alpha[0x1E], alpha[0x1F], alpha[0x1E], \
|
|
||||||
alpha[0x1F], alpha[0x1E], alpha[0x1F], alpha[0x1E] ) ); \
|
|
||||||
\
|
\
|
||||||
SBOX( s0, s4, s8, sC ); \
|
SBOX( s0, s4, s8, sC ); \
|
||||||
SBOX( s1, s5, s9, sD ); \
|
SBOX( s1, s5, s9, sD ); \
|
||||||
@@ -864,47 +1159,23 @@ void hamsi_big_final( hamsi_4way_big_context *sc, __m256i *buf )
|
|||||||
void hamsi512_4way_init( hamsi_4way_big_context *sc )
|
void hamsi512_4way_init( hamsi_4way_big_context *sc )
|
||||||
{
|
{
|
||||||
sc->partial_len = 0;
|
sc->partial_len = 0;
|
||||||
sph_u32 lo, hi;
|
|
||||||
sc->count_high = sc->count_low = 0;
|
sc->count_high = sc->count_low = 0;
|
||||||
for ( int i = 0; i < 8; i++ )
|
|
||||||
{
|
sc->h[0] = m256_const1_64( 0x6c70617273746565 );
|
||||||
lo = 2*i;
|
sc->h[1] = m256_const1_64( 0x656e62656b204172 );
|
||||||
hi = 2*i + 1;
|
sc->h[2] = m256_const1_64( 0x302c206272672031 );
|
||||||
sc->h[i] = _mm256_set_epi32( IV512[hi], IV512[lo], IV512[hi], IV512[lo],
|
sc->h[3] = m256_const1_64( 0x3434362c75732032 );
|
||||||
IV512[hi], IV512[lo], IV512[hi], IV512[lo] );
|
sc->h[4] = m256_const1_64( 0x3030312020422d33 );
|
||||||
}
|
sc->h[5] = m256_const1_64( 0x656e2d484c657576 );
|
||||||
|
sc->h[6] = m256_const1_64( 0x6c65652c65766572 );
|
||||||
|
sc->h[7] = m256_const1_64( 0x6769756d2042656c );
|
||||||
}
|
}
|
||||||
|
|
||||||
void hamsi512_4way( hamsi_4way_big_context *sc, const void *data, size_t len )
|
void hamsi512_4way_update( hamsi_4way_big_context *sc, const void *data,
|
||||||
|
size_t len )
|
||||||
{
|
{
|
||||||
__m256i *vdata = (__m256i*)data;
|
__m256i *vdata = (__m256i*)data;
|
||||||
|
|
||||||
// It looks like the only way to get in here is if core was previously called
|
|
||||||
// with a very small len
|
|
||||||
// That's not likely even with 80 byte input so deprecate partial len
|
|
||||||
/*
|
|
||||||
if ( sc->partial_len != 0 )
|
|
||||||
{
|
|
||||||
size_t mlen;
|
|
||||||
|
|
||||||
mlen = 8 - sc->partial_len;
|
|
||||||
if ( len < mlen )
|
|
||||||
{
|
|
||||||
memcpy_256( sc->partial + (sc->partial_len >> 3), data, len>>3 );
|
|
||||||
sc->partial_len += len;
|
|
||||||
return;
|
|
||||||
}
|
|
||||||
else
|
|
||||||
{
|
|
||||||
memcpy_256( sc->partial + (sc->partial_len >> 3), data, mlen>>3 );
|
|
||||||
len -= mlen;
|
|
||||||
vdata += mlen>>3;
|
|
||||||
hamsi_big( sc, sc->partial, 1 );
|
|
||||||
sc->partial_len = 0;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
*/
|
|
||||||
|
|
||||||
hamsi_big( sc, vdata, len>>3 );
|
hamsi_big( sc, vdata, len>>3 );
|
||||||
vdata += ( (len& ~(size_t)7) >> 3 );
|
vdata += ( (len& ~(size_t)7) >> 3 );
|
||||||
len &= (size_t)7;
|
len &= (size_t)7;
|
||||||
@@ -920,8 +1191,9 @@ void hamsi512_4way_close( hamsi_4way_big_context *sc, void *dst )
|
|||||||
sph_enc32be( &ch, sc->count_high );
|
sph_enc32be( &ch, sc->count_high );
|
||||||
sph_enc32be( &cl, sc->count_low + ( sc->partial_len << 3 ) );
|
sph_enc32be( &cl, sc->count_low + ( sc->partial_len << 3 ) );
|
||||||
pad[0] = _mm256_set_epi32( cl, ch, cl, ch, cl, ch, cl, ch );
|
pad[0] = _mm256_set_epi32( cl, ch, cl, ch, cl, ch, cl, ch );
|
||||||
sc->buf[0] = _mm256_set_epi32( 0UL, 0x80UL, 0UL, 0x80UL,
|
sc->buf[0] = m256_const1_64( 0x80 );
|
||||||
0UL, 0x80UL, 0UL, 0x80UL );
|
// sc->buf[0] = _mm256_set_epi32( 0UL, 0x80UL, 0UL, 0x80UL,
|
||||||
|
// 0UL, 0x80UL, 0UL, 0x80UL );
|
||||||
hamsi_big( sc, sc->buf, 1 );
|
hamsi_big( sc, sc->buf, 1 );
|
||||||
hamsi_big_final( sc, pad );
|
hamsi_big_final( sc, pad );
|
||||||
|
|
||||||
|
|||||||
@@ -60,9 +60,32 @@ typedef struct {
|
|||||||
typedef hamsi_4way_big_context hamsi512_4way_context;
|
typedef hamsi_4way_big_context hamsi512_4way_context;
|
||||||
|
|
||||||
void hamsi512_4way_init( hamsi512_4way_context *sc );
|
void hamsi512_4way_init( hamsi512_4way_context *sc );
|
||||||
void hamsi512_4way( hamsi512_4way_context *sc, const void *data, size_t len );
|
void hamsi512_4way_update( hamsi512_4way_context *sc, const void *data,
|
||||||
|
size_t len );
|
||||||
|
//#define hamsi512_4way hamsi512_4way_update
|
||||||
void hamsi512_4way_close( hamsi512_4way_context *sc, void *dst );
|
void hamsi512_4way_close( hamsi512_4way_context *sc, void *dst );
|
||||||
|
|
||||||
|
#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
|
||||||
|
|
||||||
|
typedef struct {
|
||||||
|
__m512i h[8];
|
||||||
|
__m512i buf[1];
|
||||||
|
size_t partial_len;
|
||||||
|
sph_u32 count_high, count_low;
|
||||||
|
} hamsi_8way_big_context;
|
||||||
|
|
||||||
|
typedef hamsi_8way_big_context hamsi512_8way_context;
|
||||||
|
|
||||||
|
void hamsi512_8way_init( hamsi512_8way_context *sc );
|
||||||
|
void hamsi512_8way_update( hamsi512_8way_context *sc, const void *data,
|
||||||
|
size_t len );
|
||||||
|
void hamsi512_8way_close( hamsi512_8way_context *sc, void *dst );
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
#endif
|
||||||
|
|
||||||
|
|
||||||
#ifdef __cplusplus
|
#ifdef __cplusplus
|
||||||
}
|
}
|
||||||
#endif
|
#endif
|
||||||
|
|||||||
@@ -38,7 +38,7 @@
|
|||||||
#define SPH_XCAT_(a, b) a ## b
|
#define SPH_XCAT_(a, b) a ## b
|
||||||
|
|
||||||
static void
|
static void
|
||||||
SPH_XCAT(SPH_XCAT(haval, PASSES), _4way)
|
SPH_XCAT(SPH_XCAT(haval, PASSES), _4way_update)
|
||||||
( haval_4way_context *sc, const void *data, size_t len )
|
( haval_4way_context *sc, const void *data, size_t len )
|
||||||
{
|
{
|
||||||
__m128i *vdata = (__m128i*)data;
|
__m128i *vdata = (__m128i*)data;
|
||||||
|
|||||||
115
algo/haval/haval-8way-helper.c
Normal file
115
algo/haval/haval-8way-helper.c
Normal file
@@ -0,0 +1,115 @@
|
|||||||
|
/* $Id: haval_helper.c 218 2010-06-08 17:06:34Z tp $ */
|
||||||
|
/*
|
||||||
|
* Helper code, included (three times !) by HAVAL implementation.
|
||||||
|
*
|
||||||
|
* TODO: try to merge this with md_helper.c.
|
||||||
|
*
|
||||||
|
* ==========================(LICENSE BEGIN)============================
|
||||||
|
*
|
||||||
|
* Copyright (c) 2007-2010 Projet RNRT SAPHIR
|
||||||
|
*
|
||||||
|
* Permission is hereby granted, free of charge, to any person obtaining
|
||||||
|
* a copy of this software and associated documentation files (the
|
||||||
|
* "Software"), to deal in the Software without restriction, including
|
||||||
|
* without limitation the rights to use, copy, modify, merge, publish,
|
||||||
|
* distribute, sublicense, and/or sell copies of the Software, and to
|
||||||
|
* permit persons to whom the Software is furnished to do so, subject to
|
||||||
|
* the following conditions:
|
||||||
|
*
|
||||||
|
* The above copyright notice and this permission notice shall be
|
||||||
|
* included in all copies or substantial portions of the Software.
|
||||||
|
*
|
||||||
|
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
||||||
|
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
||||||
|
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
|
||||||
|
* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
|
||||||
|
* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
|
||||||
|
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
|
||||||
|
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
||||||
|
*
|
||||||
|
* ===========================(LICENSE END)=============================
|
||||||
|
*
|
||||||
|
* @author Thomas Pornin <thomas.pornin@cryptolog.com>
|
||||||
|
*/
|
||||||
|
|
||||||
|
#undef SPH_XCAT
|
||||||
|
#define SPH_XCAT(a, b) SPH_XCAT_(a, b)
|
||||||
|
#undef SPH_XCAT_
|
||||||
|
#define SPH_XCAT_(a, b) a ## b
|
||||||
|
|
||||||
|
static void
|
||||||
|
SPH_XCAT(SPH_XCAT(haval, PASSES), _8way_update)
|
||||||
|
( haval_8way_context *sc, const void *data, size_t len )
|
||||||
|
{
|
||||||
|
__m256i *vdata = (__m256i*)data;
|
||||||
|
unsigned current;
|
||||||
|
|
||||||
|
current = (unsigned)sc->count_low & 127U;
|
||||||
|
while ( len > 0 )
|
||||||
|
{
|
||||||
|
unsigned clen;
|
||||||
|
uint32_t clow, clow2;
|
||||||
|
|
||||||
|
clen = 128U - current;
|
||||||
|
if ( clen > len )
|
||||||
|
clen = len;
|
||||||
|
memcpy_256( sc->buf + (current>>2), vdata, clen>>2 );
|
||||||
|
vdata += clen>>2;
|
||||||
|
current += clen;
|
||||||
|
len -= clen;
|
||||||
|
if ( current == 128U )
|
||||||
|
{
|
||||||
|
DSTATE_8W;
|
||||||
|
IN_PREPARE_8W(sc->buf);
|
||||||
|
RSTATE_8W;
|
||||||
|
SPH_XCAT(CORE_8W, PASSES)(INW_8W);
|
||||||
|
WSTATE_8W;
|
||||||
|
current = 0;
|
||||||
|
}
|
||||||
|
clow = sc->count_low;
|
||||||
|
clow2 = clow + clen;
|
||||||
|
sc->count_low = clow2;
|
||||||
|
if ( clow2 < clow )
|
||||||
|
sc->count_high ++;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
static void
|
||||||
|
SPH_XCAT(SPH_XCAT(haval, PASSES), _8way_close)( haval_8way_context *sc,
|
||||||
|
void *dst)
|
||||||
|
{
|
||||||
|
unsigned current;
|
||||||
|
DSTATE_8W;
|
||||||
|
|
||||||
|
current = (unsigned)sc->count_low & 127UL;
|
||||||
|
|
||||||
|
sc->buf[ current>>2 ] = m256_one_32;
|
||||||
|
current += 4;
|
||||||
|
RSTATE_8W;
|
||||||
|
if ( current > 116UL )
|
||||||
|
{
|
||||||
|
memset_zero_256( sc->buf + ( current>>2 ), (128UL-current) >> 2 );
|
||||||
|
do
|
||||||
|
{
|
||||||
|
IN_PREPARE_8W(sc->buf);
|
||||||
|
SPH_XCAT(CORE_8W, PASSES)(INW_8W);
|
||||||
|
} while (0);
|
||||||
|
current = 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
uint32_t t1, t2;
|
||||||
|
memset_zero_256( sc->buf + ( current>>2 ), (116UL-current) >> 2 );
|
||||||
|
t1 = 0x01 | (PASSES << 3);
|
||||||
|
t2 = sc->olen << 3;
|
||||||
|
sc->buf[ 116>>2 ] = _mm256_set1_epi32( ( t1 << 16 ) | ( t2 << 24 ) );
|
||||||
|
sc->buf[ 120>>2 ] = _mm256_set1_epi32( sc->count_low << 3 );
|
||||||
|
sc->buf[ 124>>2 ] = _mm256_set1_epi32( (sc->count_high << 3)
|
||||||
|
| (sc->count_low >> 29) );
|
||||||
|
do
|
||||||
|
{
|
||||||
|
IN_PREPARE_8W(sc->buf);
|
||||||
|
SPH_XCAT(CORE_8W, PASSES)(INW_8W);
|
||||||
|
} while (0);
|
||||||
|
WSTATE_8W;
|
||||||
|
haval_8way_out( sc, dst );
|
||||||
|
}
|
||||||
@@ -40,7 +40,7 @@
|
|||||||
#include <string.h>
|
#include <string.h>
|
||||||
#include "haval-hash-4way.h"
|
#include "haval-hash-4way.h"
|
||||||
|
|
||||||
// won't compile with sse4.2
|
// won't compile with sse4.2, not a problem, it's only used with AVX2 4 way.
|
||||||
//#if defined (__SSE4_2__)
|
//#if defined (__SSE4_2__)
|
||||||
#if defined(__AVX__)
|
#if defined(__AVX__)
|
||||||
|
|
||||||
@@ -479,9 +479,9 @@ haval ## xxx ## _ ## y ## _4way_init(void *cc) \
|
|||||||
} \
|
} \
|
||||||
\
|
\
|
||||||
void \
|
void \
|
||||||
haval ## xxx ## _ ## y ## _4way (void *cc, const void *data, size_t len) \
|
haval ## xxx ## _ ## y ## _4way_update (void *cc, const void *data, size_t len) \
|
||||||
{ \
|
{ \
|
||||||
haval ## y ## _4way(cc, data, len); \
|
haval ## y ## _4way_update(cc, data, len); \
|
||||||
} \
|
} \
|
||||||
\
|
\
|
||||||
void \
|
void \
|
||||||
@@ -518,6 +518,301 @@ do { \
|
|||||||
|
|
||||||
#define INMSG(i) msg[i]
|
#define INMSG(i) msg[i]
|
||||||
|
|
||||||
|
#if defined(__AVX2__)
|
||||||
|
|
||||||
|
// Haval-256 8 way 32 bit avx2
|
||||||
|
|
||||||
|
#define F1_8W(x6, x5, x4, x3, x2, x1, x0) \
|
||||||
|
_mm256_xor_si256( x0, \
|
||||||
|
_mm256_xor_si256( _mm256_and_si256(_mm256_xor_si256( x0, x4 ), x1 ), \
|
||||||
|
_mm256_xor_si256( _mm256_and_si256( x2, x5 ), \
|
||||||
|
_mm256_and_si256( x3, x6 ) ) ) ) \
|
||||||
|
|
||||||
|
#define F2_8W(x6, x5, x4, x3, x2, x1, x0) \
|
||||||
|
_mm256_xor_si256( \
|
||||||
|
_mm256_and_si256( x2, \
|
||||||
|
_mm256_xor_si256( _mm256_andnot_si256( x3, x1 ), \
|
||||||
|
_mm256_xor_si256( _mm256_and_si256( x4, x5 ), \
|
||||||
|
_mm256_xor_si256( x6, x0 ) ) ) ), \
|
||||||
|
_mm256_xor_si256( \
|
||||||
|
_mm256_and_si256( x4, _mm256_xor_si256( x1, x5 ) ), \
|
||||||
|
_mm256_xor_si256( _mm256_and_si256( x3, x5 ), x0 ) ) ) \
|
||||||
|
|
||||||
|
#define F3_8W(x6, x5, x4, x3, x2, x1, x0) \
|
||||||
|
_mm256_xor_si256( \
|
||||||
|
_mm256_and_si256( x3, \
|
||||||
|
_mm256_xor_si256( _mm256_and_si256( x1, x2 ), \
|
||||||
|
_mm256_xor_si256( x6, x0 ) ) ), \
|
||||||
|
_mm256_xor_si256( _mm256_xor_si256(_mm256_and_si256( x1, x4 ), \
|
||||||
|
_mm256_and_si256( x2, x5 ) ), x0 ) )
|
||||||
|
|
||||||
|
#define F4_8W(x6, x5, x4, x3, x2, x1, x0) \
|
||||||
|
_mm256_xor_si256( \
|
||||||
|
_mm256_xor_si256( \
|
||||||
|
_mm256_and_si256( x3, \
|
||||||
|
_mm256_xor_si256( _mm256_xor_si256( _mm256_and_si256( x1, x2 ), \
|
||||||
|
_mm256_or_si256( x4, x6 ) ), x5 ) ), \
|
||||||
|
_mm256_and_si256( x4, \
|
||||||
|
_mm256_xor_si256( _mm256_xor_si256( _mm256_and_si256( mm256_not(x2), x5 ), \
|
||||||
|
_mm256_xor_si256( x1, x6 ) ), x0 ) ) ), \
|
||||||
|
_mm256_xor_si256( _mm256_and_si256( x2, x6 ), x0 ) )
|
||||||
|
|
||||||
|
|
||||||
|
#define F5_8W(x6, x5, x4, x3, x2, x1, x0) \
|
||||||
|
_mm256_xor_si256( \
|
||||||
|
_mm256_and_si256( x0, \
|
||||||
|
mm256_not( _mm256_xor_si256( \
|
||||||
|
_mm256_and_si256( _mm256_and_si256( x1, x2 ), x3 ), x5 ) ) ), \
|
||||||
|
_mm256_xor_si256( _mm256_xor_si256( _mm256_and_si256( x1, x4 ), \
|
||||||
|
_mm256_and_si256( x2, x5 ) ), \
|
||||||
|
_mm256_and_si256( x3, x6 ) ) )
|
||||||
|
|
||||||
|
#define FP3_1_8W(x6, x5, x4, x3, x2, x1, x0) \
|
||||||
|
F1_8W(x1, x0, x3, x5, x6, x2, x4)
|
||||||
|
#define FP3_2_8W(x6, x5, x4, x3, x2, x1, x0) \
|
||||||
|
F2_8W(x4, x2, x1, x0, x5, x3, x6)
|
||||||
|
#define FP3_3_8W(x6, x5, x4, x3, x2, x1, x0) \
|
||||||
|
F3_8W(x6, x1, x2, x3, x4, x5, x0)
|
||||||
|
|
||||||
|
#define FP4_1_8W(x6, x5, x4, x3, x2, x1, x0) \
|
||||||
|
F1_8W(x2, x6, x1, x4, x5, x3, x0)
|
||||||
|
#define FP4_2_8W(x6, x5, x4, x3, x2, x1, x0) \
|
||||||
|
F2_8W(x3, x5, x2, x0, x1, x6, x4)
|
||||||
|
#define FP4_3_8W(x6, x5, x4, x3, x2, x1, x0) \
|
||||||
|
F3_8W(x1, x4, x3, x6, x0, x2, x5)
|
||||||
|
#define FP4_4_8W(x6, x5, x4, x3, x2, x1, x0) \
|
||||||
|
F4_8W(x6, x4, x0, x5, x2, x1, x3)
|
||||||
|
|
||||||
|
#define FP5_1_8W(x6, x5, x4, x3, x2, x1, x0) \
|
||||||
|
F1_8W(x3, x4, x1, x0, x5, x2, x6)
|
||||||
|
#define FP5_2_8W(x6, x5, x4, x3, x2, x1, x0) \
|
||||||
|
F2_8W(x6, x2, x1, x0, x3, x4, x5)
|
||||||
|
#define FP5_3_8W(x6, x5, x4, x3, x2, x1, x0) \
|
||||||
|
F3_8W(x2, x6, x0, x4, x3, x1, x5)
|
||||||
|
#define FP5_4_8W(x6, x5, x4, x3, x2, x1, x0) \
|
||||||
|
F4_8W(x1, x5, x3, x2, x0, x4, x6)
|
||||||
|
#define FP5_5_8W(x6, x5, x4, x3, x2, x1, x0) \
|
||||||
|
F5_8W(x2, x5, x0, x6, x4, x3, x1)
|
||||||
|
|
||||||
|
#define STEP_8W(n, p, x7, x6, x5, x4, x3, x2, x1, x0, w, c) \
|
||||||
|
do { \
|
||||||
|
__m256i t = FP ## n ## _ ## p ## _8W(x6, x5, x4, x3, x2, x1, x0); \
|
||||||
|
x7 = _mm256_add_epi32( _mm256_add_epi32( mm256_ror_32( t, 7 ), \
|
||||||
|
mm256_ror_32( x7, 11 ) ), \
|
||||||
|
_mm256_add_epi32( w, _mm256_set1_epi32( c ) ) ); \
|
||||||
|
} while (0)
|
||||||
|
|
||||||
|
#define PASS1_8W(n, in) do { \
|
||||||
|
unsigned pass_count; \
|
||||||
|
for (pass_count = 0; pass_count < 32; pass_count += 8) { \
|
||||||
|
STEP_8W(n, 1, s7, s6, s5, s4, s3, s2, s1, s0, \
|
||||||
|
in(pass_count + 0), SPH_C32(0x00000000)); \
|
||||||
|
STEP_8W(n, 1, s6, s5, s4, s3, s2, s1, s0, s7, \
|
||||||
|
in(pass_count + 1), SPH_C32(0x00000000)); \
|
||||||
|
STEP_8W(n, 1, s5, s4, s3, s2, s1, s0, s7, s6, \
|
||||||
|
in(pass_count + 2), SPH_C32(0x00000000)); \
|
||||||
|
STEP_8W(n, 1, s4, s3, s2, s1, s0, s7, s6, s5, \
|
||||||
|
in(pass_count + 3), SPH_C32(0x00000000)); \
|
||||||
|
STEP_8W(n, 1, s3, s2, s1, s0, s7, s6, s5, s4, \
|
||||||
|
in(pass_count + 4), SPH_C32(0x00000000)); \
|
||||||
|
STEP_8W(n, 1, s2, s1, s0, s7, s6, s5, s4, s3, \
|
||||||
|
in(pass_count + 5), SPH_C32(0x00000000)); \
|
||||||
|
STEP_8W(n, 1, s1, s0, s7, s6, s5, s4, s3, s2, \
|
||||||
|
in(pass_count + 6), SPH_C32(0x00000000)); \
|
||||||
|
STEP_8W(n, 1, s0, s7, s6, s5, s4, s3, s2, s1, \
|
||||||
|
in(pass_count + 7), SPH_C32(0x00000000)); \
|
||||||
|
} \
|
||||||
|
} while (0)
|
||||||
|
|
||||||
|
#define PASSG_8W(p, n, in) do { \
|
||||||
|
unsigned pass_count; \
|
||||||
|
for (pass_count = 0; pass_count < 32; pass_count += 8) { \
|
||||||
|
STEP_8W(n, p, s7, s6, s5, s4, s3, s2, s1, s0, \
|
||||||
|
in(MP ## p[pass_count + 0]), \
|
||||||
|
RK ## p[pass_count + 0]); \
|
||||||
|
STEP_8W(n, p, s6, s5, s4, s3, s2, s1, s0, s7, \
|
||||||
|
in(MP ## p[pass_count + 1]), \
|
||||||
|
RK ## p[pass_count + 1]); \
|
||||||
|
STEP_8W(n, p, s5, s4, s3, s2, s1, s0, s7, s6, \
|
||||||
|
in(MP ## p[pass_count + 2]), \
|
||||||
|
RK ## p[pass_count + 2]); \
|
||||||
|
STEP_8W(n, p, s4, s3, s2, s1, s0, s7, s6, s5, \
|
||||||
|
in(MP ## p[pass_count + 3]), \
|
||||||
|
RK ## p[pass_count + 3]); \
|
||||||
|
STEP_8W(n, p, s3, s2, s1, s0, s7, s6, s5, s4, \
|
||||||
|
in(MP ## p[pass_count + 4]), \
|
||||||
|
RK ## p[pass_count + 4]); \
|
||||||
|
STEP_8W(n, p, s2, s1, s0, s7, s6, s5, s4, s3, \
|
||||||
|
in(MP ## p[pass_count + 5]), \
|
||||||
|
RK ## p[pass_count + 5]); \
|
||||||
|
STEP_8W(n, p, s1, s0, s7, s6, s5, s4, s3, s2, \
|
||||||
|
in(MP ## p[pass_count + 6]), \
|
||||||
|
RK ## p[pass_count + 6]); \
|
||||||
|
STEP_8W(n, p, s0, s7, s6, s5, s4, s3, s2, s1, \
|
||||||
|
in(MP ## p[pass_count + 7]), \
|
||||||
|
RK ## p[pass_count + 7]); \
|
||||||
|
} \
|
||||||
|
} while (0)
|
||||||
|
|
||||||
|
#define PASS2_8W(n, in) PASSG_8W(2, n, in)
|
||||||
|
#define PASS3_8W(n, in) PASSG_8W(3, n, in)
|
||||||
|
#define PASS4_8W(n, in) PASSG_8W(4, n, in)
|
||||||
|
#define PASS5_8W(n, in) PASSG_8W(5, n, in)
|
||||||
|
|
||||||
|
#define SAVE_STATE_8W \
|
||||||
|
__m256i u0, u1, u2, u3, u4, u5, u6, u7; \
|
||||||
|
do { \
|
||||||
|
u0 = s0; \
|
||||||
|
u1 = s1; \
|
||||||
|
u2 = s2; \
|
||||||
|
u3 = s3; \
|
||||||
|
u4 = s4; \
|
||||||
|
u5 = s5; \
|
||||||
|
u6 = s6; \
|
||||||
|
u7 = s7; \
|
||||||
|
} while (0)
|
||||||
|
|
||||||
|
#define UPDATE_STATE_8W \
|
||||||
|
do { \
|
||||||
|
s0 = _mm256_add_epi32( s0, u0 ); \
|
||||||
|
s1 = _mm256_add_epi32( s1, u1 ); \
|
||||||
|
s2 = _mm256_add_epi32( s2, u2 ); \
|
||||||
|
s3 = _mm256_add_epi32( s3, u3 ); \
|
||||||
|
s4 = _mm256_add_epi32( s4, u4 ); \
|
||||||
|
s5 = _mm256_add_epi32( s5, u5 ); \
|
||||||
|
s6 = _mm256_add_epi32( s6, u6 ); \
|
||||||
|
s7 = _mm256_add_epi32( s7, u7 ); \
|
||||||
|
} while (0)
|
||||||
|
|
||||||
|
#define CORE_8W5(in) do { \
|
||||||
|
SAVE_STATE_8W; \
|
||||||
|
PASS1_8W(5, in); \
|
||||||
|
PASS2_8W(5, in); \
|
||||||
|
PASS3_8W(5, in); \
|
||||||
|
PASS4_8W(5, in); \
|
||||||
|
PASS5_8W(5, in); \
|
||||||
|
UPDATE_STATE_8W; \
|
||||||
|
} while (0)
|
||||||
|
|
||||||
|
#define DSTATE_8W __m256i s0, s1, s2, s3, s4, s5, s6, s7
|
||||||
|
|
||||||
|
#define RSTATE_8W \
|
||||||
|
do { \
|
||||||
|
s0 = sc->s0; \
|
||||||
|
s1 = sc->s1; \
|
||||||
|
s2 = sc->s2; \
|
||||||
|
s3 = sc->s3; \
|
||||||
|
s4 = sc->s4; \
|
||||||
|
s5 = sc->s5; \
|
||||||
|
s6 = sc->s6; \
|
||||||
|
s7 = sc->s7; \
|
||||||
|
} while (0)
|
||||||
|
|
||||||
|
#define WSTATE_8W \
|
||||||
|
do { \
|
||||||
|
sc->s0 = s0; \
|
||||||
|
sc->s1 = s1; \
|
||||||
|
sc->s2 = s2; \
|
||||||
|
sc->s3 = s3; \
|
||||||
|
sc->s4 = s4; \
|
||||||
|
sc->s5 = s5; \
|
||||||
|
sc->s6 = s6; \
|
||||||
|
sc->s7 = s7; \
|
||||||
|
} while (0)
|
||||||
|
|
||||||
|
static void
|
||||||
|
haval_8way_init( haval_8way_context *sc, unsigned olen, unsigned passes )
|
||||||
|
{
|
||||||
|
sc->s0 = m256_const1_32( 0x243F6A88UL );
|
||||||
|
sc->s1 = m256_const1_32( 0x85A308D3UL );
|
||||||
|
sc->s2 = m256_const1_32( 0x13198A2EUL );
|
||||||
|
sc->s3 = m256_const1_32( 0x03707344UL );
|
||||||
|
sc->s4 = m256_const1_32( 0xA4093822UL );
|
||||||
|
sc->s5 = m256_const1_32( 0x299F31D0UL );
|
||||||
|
sc->s6 = m256_const1_32( 0x082EFA98UL );
|
||||||
|
sc->s7 = m256_const1_32( 0xEC4E6C89UL );
|
||||||
|
sc->olen = olen;
|
||||||
|
sc->passes = passes;
|
||||||
|
sc->count_high = 0;
|
||||||
|
sc->count_low = 0;
|
||||||
|
|
||||||
|
}
|
||||||
|
#define IN_PREPARE_8W(indata) const __m256i *const load_ptr_8w = (indata)
|
||||||
|
|
||||||
|
#define INW_8W(i) load_ptr_8w[ i ]
|
||||||
|
|
||||||
|
static void
|
||||||
|
haval_8way_out( haval_8way_context *sc, void *dst )
|
||||||
|
{
|
||||||
|
__m256i *buf = (__m256i*)dst;
|
||||||
|
DSTATE_8W;
|
||||||
|
RSTATE_8W;
|
||||||
|
|
||||||
|
buf[0] = s0;
|
||||||
|
buf[1] = s1;
|
||||||
|
buf[2] = s2;
|
||||||
|
buf[3] = s3;
|
||||||
|
buf[4] = s4;
|
||||||
|
buf[5] = s5;
|
||||||
|
buf[6] = s6;
|
||||||
|
buf[7] = s7;
|
||||||
|
}
|
||||||
|
|
||||||
|
#undef PASSES
|
||||||
|
#define PASSES 5
|
||||||
|
#include "haval-8way-helper.c"
|
||||||
|
|
||||||
|
#define API_8W(xxx, y) \
|
||||||
|
void \
|
||||||
|
haval ## xxx ## _ ## y ## _8way_init(void *cc) \
|
||||||
|
{ \
|
||||||
|
haval_8way_init(cc, xxx >> 5, y); \
|
||||||
|
} \
|
||||||
|
\
|
||||||
|
void \
|
||||||
|
haval ## xxx ## _ ## y ## _8way_update (void *cc, const void *data, size_t len) \
|
||||||
|
{ \
|
||||||
|
haval ## y ## _8way_update(cc, data, len); \
|
||||||
|
} \
|
||||||
|
\
|
||||||
|
void \
|
||||||
|
haval ## xxx ## _ ## y ## _8way_close(void *cc, void *dst) \
|
||||||
|
{ \
|
||||||
|
haval ## y ## _8way_close(cc, dst); \
|
||||||
|
} \
|
||||||
|
|
||||||
|
API_8W(256, 5)
|
||||||
|
|
||||||
|
#define RVAL_8W \
|
||||||
|
do { \
|
||||||
|
s0 = val[0]; \
|
||||||
|
s1 = val[1]; \
|
||||||
|
s2 = val[2]; \
|
||||||
|
s3 = val[3]; \
|
||||||
|
s4 = val[4]; \
|
||||||
|
s5 = val[5]; \
|
||||||
|
s6 = val[6]; \
|
||||||
|
s7 = val[7]; \
|
||||||
|
} while (0)
|
||||||
|
|
||||||
|
#define WVAL_8W \
|
||||||
|
do { \
|
||||||
|
val[0] = s0; \
|
||||||
|
val[1] = s1; \
|
||||||
|
val[2] = s2; \
|
||||||
|
val[3] = s3; \
|
||||||
|
val[4] = s4; \
|
||||||
|
val[5] = s5; \
|
||||||
|
val[6] = s6; \
|
||||||
|
val[7] = s7; \
|
||||||
|
} while (0)
|
||||||
|
|
||||||
|
#define INMSG_8W(i) msg[i]
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
#endif // AVX2
|
||||||
|
|
||||||
#ifdef __cplusplus
|
#ifdef __cplusplus
|
||||||
}
|
}
|
||||||
#endif
|
#endif
|
||||||
|
|||||||
@@ -59,7 +59,7 @@
|
|||||||
*/
|
*/
|
||||||
|
|
||||||
#ifndef HAVAL_HASH_4WAY_H__
|
#ifndef HAVAL_HASH_4WAY_H__
|
||||||
#define HAVAL_HASH_4WAY_H__
|
#define HAVAL_HASH_4WAY_H__ 1
|
||||||
|
|
||||||
#if defined(__AVX__)
|
#if defined(__AVX__)
|
||||||
|
|
||||||
@@ -84,10 +84,30 @@ typedef haval_4way_context haval256_5_4way_context;
|
|||||||
|
|
||||||
void haval256_5_4way_init( void *cc );
|
void haval256_5_4way_init( void *cc );
|
||||||
|
|
||||||
void haval256_5_4way( void *cc, const void *data, size_t len );
|
void haval256_5_4way_update( void *cc, const void *data, size_t len );
|
||||||
|
//#define haval256_5_4way haval256_5_4way_update
|
||||||
|
|
||||||
void haval256_5_4way_close( void *cc, void *dst );
|
void haval256_5_4way_close( void *cc, void *dst );
|
||||||
|
|
||||||
|
#if defined(__AVX2__)
|
||||||
|
|
||||||
|
typedef struct {
|
||||||
|
__m256i buf[32];
|
||||||
|
__m256i s0, s1, s2, s3, s4, s5, s6, s7;
|
||||||
|
unsigned olen, passes;
|
||||||
|
uint32_t count_high, count_low;
|
||||||
|
} haval_8way_context __attribute__ ((aligned (64)));
|
||||||
|
|
||||||
|
typedef haval_8way_context haval256_5_8way_context;
|
||||||
|
|
||||||
|
void haval256_5_8way_init( void *cc );
|
||||||
|
|
||||||
|
void haval256_5_8way_update( void *cc, const void *data, size_t len );
|
||||||
|
|
||||||
|
void haval256_5_8way_close( void *cc, void *dst );
|
||||||
|
|
||||||
|
#endif // AVX2
|
||||||
|
|
||||||
#ifdef __cplusplus
|
#ifdef __cplusplus
|
||||||
}
|
}
|
||||||
#endif
|
#endif
|
||||||
|
|||||||
@@ -1,13 +1,10 @@
|
|||||||
#include "algo-gate-api.h"
|
#include "algo-gate-api.h"
|
||||||
|
|
||||||
#include <stdio.h>
|
#include <stdio.h>
|
||||||
#include <string.h>
|
#include <string.h>
|
||||||
#include <openssl/sha.h>
|
#include <openssl/sha.h>
|
||||||
#include <stdint.h>
|
#include <stdint.h>
|
||||||
#include <stdlib.h>
|
#include <stdlib.h>
|
||||||
|
|
||||||
#include "sph_hefty1.h"
|
#include "sph_hefty1.h"
|
||||||
|
|
||||||
#include "algo/luffa/sph_luffa.h"
|
#include "algo/luffa/sph_luffa.h"
|
||||||
#include "algo/fugue/sph_fugue.h"
|
#include "algo/fugue/sph_fugue.h"
|
||||||
#include "algo/skein/sph_skein.h"
|
#include "algo/skein/sph_skein.h"
|
||||||
@@ -16,9 +13,7 @@
|
|||||||
#include "algo/echo/sph_echo.h"
|
#include "algo/echo/sph_echo.h"
|
||||||
#include "algo/hamsi/sph_hamsi.h"
|
#include "algo/hamsi/sph_hamsi.h"
|
||||||
#include "algo/luffa/luffa_for_sse2.h"
|
#include "algo/luffa/luffa_for_sse2.h"
|
||||||
#include "algo/skein/sse2/skein.c"
|
#ifdef __AES__
|
||||||
|
|
||||||
#ifndef NO_AES_NI
|
|
||||||
#include "algo/echo/aes_ni/hash_api.h"
|
#include "algo/echo/aes_ni/hash_api.h"
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
@@ -26,29 +21,23 @@ void bastionhash(void *output, const void *input)
|
|||||||
{
|
{
|
||||||
unsigned char hash[64] __attribute__ ((aligned (64)));
|
unsigned char hash[64] __attribute__ ((aligned (64)));
|
||||||
|
|
||||||
#ifdef NO_AES_NI
|
#ifdef __AES__
|
||||||
sph_echo512_context ctx_echo;
|
|
||||||
#else
|
|
||||||
hashState_echo ctx_echo;
|
hashState_echo ctx_echo;
|
||||||
|
#else
|
||||||
|
sph_echo512_context ctx_echo;
|
||||||
#endif
|
#endif
|
||||||
hashState_luffa ctx_luffa;
|
hashState_luffa ctx_luffa;
|
||||||
sph_fugue512_context ctx_fugue;
|
sph_fugue512_context ctx_fugue;
|
||||||
sph_whirlpool_context ctx_whirlpool;
|
sph_whirlpool_context ctx_whirlpool;
|
||||||
sph_shabal512_context ctx_shabal;
|
sph_shabal512_context ctx_shabal;
|
||||||
sph_hamsi512_context ctx_hamsi;
|
sph_hamsi512_context ctx_hamsi;
|
||||||
|
sph_skein512_context ctx_skein;
|
||||||
unsigned char hashbuf[128] __attribute__ ((aligned (16)));
|
|
||||||
sph_u64 hashctA;
|
|
||||||
// sph_u64 hashctB;
|
|
||||||
size_t hashptr;
|
|
||||||
|
|
||||||
HEFTY1(input, 80, hash);
|
HEFTY1(input, 80, hash);
|
||||||
|
|
||||||
init_luffa( &ctx_luffa, 512 );
|
init_luffa( &ctx_luffa, 512 );
|
||||||
update_and_final_luffa( &ctx_luffa, (BitSequence*)hash,
|
update_and_final_luffa( &ctx_luffa, (BitSequence*)hash,
|
||||||
(const BitSequence*)hash, 64 );
|
(const BitSequence*)hash, 64 );
|
||||||
// update_luffa( &ctx_luffa, hash, 64 );
|
|
||||||
// final_luffa( &ctx_luffa, hash );
|
|
||||||
|
|
||||||
if (hash[0] & 0x8)
|
if (hash[0] & 0x8)
|
||||||
{
|
{
|
||||||
@@ -56,10 +45,9 @@ void bastionhash(void *output, const void *input)
|
|||||||
sph_fugue512(&ctx_fugue, hash, 64);
|
sph_fugue512(&ctx_fugue, hash, 64);
|
||||||
sph_fugue512_close(&ctx_fugue, hash);
|
sph_fugue512_close(&ctx_fugue, hash);
|
||||||
} else {
|
} else {
|
||||||
DECL_SKN;
|
sph_skein512_init( &ctx_skein );
|
||||||
SKN_I;
|
sph_skein512( &ctx_skein, hash, 64 );
|
||||||
SKN_U;
|
sph_skein512_close( &ctx_skein, hash );
|
||||||
SKN_C;
|
|
||||||
}
|
}
|
||||||
|
|
||||||
sph_whirlpool_init(&ctx_whirlpool);
|
sph_whirlpool_init(&ctx_whirlpool);
|
||||||
@@ -72,33 +60,28 @@ void bastionhash(void *output, const void *input)
|
|||||||
|
|
||||||
if (hash[0] & 0x8)
|
if (hash[0] & 0x8)
|
||||||
{
|
{
|
||||||
#ifdef NO_AES_NI
|
#ifdef __AES__
|
||||||
sph_echo512_init(&ctx_echo);
|
|
||||||
sph_echo512(&ctx_echo, hash, 64);
|
|
||||||
sph_echo512_close(&ctx_echo, hash);
|
|
||||||
#else
|
|
||||||
init_echo( &ctx_echo, 512 );
|
init_echo( &ctx_echo, 512 );
|
||||||
update_final_echo ( &ctx_echo,(BitSequence*)hash,
|
update_final_echo ( &ctx_echo,(BitSequence*)hash,
|
||||||
(const BitSequence*)hash, 512 );
|
(const BitSequence*)hash, 512 );
|
||||||
// update_echo ( &ctx_echo, hash, 512 );
|
#else
|
||||||
// final_echo( &ctx_echo, hash );
|
sph_echo512_init(&ctx_echo);
|
||||||
|
sph_echo512(&ctx_echo, hash, 64);
|
||||||
|
sph_echo512_close(&ctx_echo, hash);
|
||||||
#endif
|
#endif
|
||||||
} else {
|
} else {
|
||||||
init_luffa( &ctx_luffa, 512 );
|
init_luffa( &ctx_luffa, 512 );
|
||||||
update_and_final_luffa( &ctx_luffa, (BitSequence*)hash,
|
update_and_final_luffa( &ctx_luffa, (BitSequence*)hash,
|
||||||
(const BitSequence*)hash, 64 );
|
(const BitSequence*)hash, 64 );
|
||||||
// update_luffa( &ctx_luffa, hash, 64 );
|
|
||||||
// final_luffa( &ctx_luffa, hash );
|
|
||||||
}
|
}
|
||||||
|
|
||||||
sph_shabal512_init(&ctx_shabal);
|
sph_shabal512_init(&ctx_shabal);
|
||||||
sph_shabal512(&ctx_shabal, hash, 64);
|
sph_shabal512(&ctx_shabal, hash, 64);
|
||||||
sph_shabal512_close(&ctx_shabal, hash);
|
sph_shabal512_close(&ctx_shabal, hash);
|
||||||
|
|
||||||
DECL_SKN;
|
sph_skein512_init( &ctx_skein );
|
||||||
SKN_I;
|
sph_skein512( &ctx_skein, hash, 64 );
|
||||||
SKN_U;
|
sph_skein512_close( &ctx_skein, hash );
|
||||||
SKN_C;
|
|
||||||
|
|
||||||
if (hash[0] & 0x8)
|
if (hash[0] & 0x8)
|
||||||
{
|
{
|
||||||
@@ -124,8 +107,6 @@ void bastionhash(void *output, const void *input)
|
|||||||
init_luffa( &ctx_luffa, 512 );
|
init_luffa( &ctx_luffa, 512 );
|
||||||
update_and_final_luffa( &ctx_luffa, (BitSequence*)hash,
|
update_and_final_luffa( &ctx_luffa, (BitSequence*)hash,
|
||||||
(const BitSequence*)hash, 64 );
|
(const BitSequence*)hash, 64 );
|
||||||
// update_luffa( &ctx_luffa, hash, 64 );
|
|
||||||
// final_luffa( &ctx_luffa, hash );
|
|
||||||
}
|
}
|
||||||
|
|
||||||
memcpy(output, hash, 32);
|
memcpy(output, hash, 32);
|
||||||
@@ -152,10 +133,8 @@ int scanhash_bastion( struct work *work, uint32_t max_nonce,
|
|||||||
be32enc(&endiandata[19], n);
|
be32enc(&endiandata[19], n);
|
||||||
bastionhash(hash32, endiandata);
|
bastionhash(hash32, endiandata);
|
||||||
if (hash32[7] < Htarg && fulltest(hash32, ptarget)) {
|
if (hash32[7] < Htarg && fulltest(hash32, ptarget)) {
|
||||||
work_set_target_ratio(work, hash32);
|
|
||||||
*hashes_done = n - first_nonce + 1;
|
|
||||||
pdata[19] = n;
|
pdata[19] = n;
|
||||||
return true;
|
submit_solution( work, hash32, mythr );
|
||||||
}
|
}
|
||||||
n++;
|
n++;
|
||||||
|
|
||||||
|
|||||||
@@ -161,7 +161,7 @@ bool register_hodl_algo( algo_gate_t* gate )
|
|||||||
// return false;
|
// return false;
|
||||||
// }
|
// }
|
||||||
pthread_barrier_init( &hodl_barrier, NULL, opt_n_threads );
|
pthread_barrier_init( &hodl_barrier, NULL, opt_n_threads );
|
||||||
gate->optimizations = AES_OPT | AVX_OPT | AVX2_OPT;
|
gate->optimizations = SSE42_OPT | AES_OPT | AVX2_OPT;
|
||||||
gate->scanhash = (void*)&hodl_scanhash;
|
gate->scanhash = (void*)&hodl_scanhash;
|
||||||
gate->get_new_work = (void*)&hodl_get_new_work;
|
gate->get_new_work = (void*)&hodl_get_new_work;
|
||||||
gate->longpoll_rpc_call = (void*)&hodl_longpoll_rpc_call;
|
gate->longpoll_rpc_call = (void*)&hodl_longpoll_rpc_call;
|
||||||
|
|||||||
@@ -41,60 +41,45 @@
|
|||||||
extern "C"{
|
extern "C"{
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
|
|
||||||
#if SPH_SMALL_FOOTPRINT && !defined SPH_SMALL_FOOTPRINT_JH
|
|
||||||
#define SPH_SMALL_FOOTPRINT_JH 1
|
|
||||||
#endif
|
|
||||||
|
|
||||||
#if !defined SPH_JH_64 && SPH_64_TRUE
|
|
||||||
#define SPH_JH_64 1
|
|
||||||
#endif
|
|
||||||
|
|
||||||
#if !SPH_64
|
|
||||||
#undef SPH_JH_64
|
|
||||||
#endif
|
|
||||||
|
|
||||||
#ifdef _MSC_VER
|
#ifdef _MSC_VER
|
||||||
#pragma warning (disable: 4146)
|
#pragma warning (disable: 4146)
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
/*
|
#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
|
||||||
* The internal bitslice representation may use either big-endian or
|
|
||||||
* little-endian (true bitslice operations do not care about the bit
|
|
||||||
* ordering, and the bit-swapping linear operations in JH happen to
|
|
||||||
* be invariant through endianness-swapping). The constants must be
|
|
||||||
* defined according to the chosen endianness; we use some
|
|
||||||
* byte-swapping macros for that.
|
|
||||||
*/
|
|
||||||
|
|
||||||
#if SPH_LITTLE_ENDIAN
|
#define Sb_8W(x0, x1, x2, x3, c) \
|
||||||
|
do { \
|
||||||
|
__m512i cc = _mm512_set1_epi64( c ); \
|
||||||
|
x3 = mm512_not( x3 ); \
|
||||||
|
x0 = _mm512_xor_si512( x0, _mm512_andnot_si512( x2, cc ) ); \
|
||||||
|
tmp = _mm512_xor_si512( cc, _mm512_and_si512( x0, x1 ) ); \
|
||||||
|
x0 = _mm512_xor_si512( x0, _mm512_and_si512( x2, x3 ) ); \
|
||||||
|
x3 = _mm512_xor_si512( x3, _mm512_andnot_si512( x1, x2 ) ); \
|
||||||
|
x1 = _mm512_xor_si512( x1, _mm512_and_si512( x0, x2 ) ); \
|
||||||
|
x2 = _mm512_xor_si512( x2, _mm512_andnot_si512( x3, x0 ) ); \
|
||||||
|
x0 = _mm512_xor_si512( x0, _mm512_or_si512( x1, x3 ) ); \
|
||||||
|
x3 = _mm512_xor_si512( x3, _mm512_and_si512( x1, x2 ) ); \
|
||||||
|
x1 = _mm512_xor_si512( x1, _mm512_and_si512( tmp, x0 ) ); \
|
||||||
|
x2 = _mm512_xor_si512( x2, tmp ); \
|
||||||
|
} while (0)
|
||||||
|
|
||||||
#if SPH_64
|
#define Lb_8W(x0, x1, x2, x3, x4, x5, x6, x7) \
|
||||||
#define C64e(x) ((SPH_C64(x) >> 56) \
|
do { \
|
||||||
| ((SPH_C64(x) >> 40) & SPH_C64(0x000000000000FF00)) \
|
x4 = _mm512_xor_si512( x4, x1 ); \
|
||||||
| ((SPH_C64(x) >> 24) & SPH_C64(0x0000000000FF0000)) \
|
x5 = _mm512_xor_si512( x5, x2 ); \
|
||||||
| ((SPH_C64(x) >> 8) & SPH_C64(0x00000000FF000000)) \
|
x6 = _mm512_xor_si512( x6, _mm512_xor_si512( x3, x0 ) ); \
|
||||||
| ((SPH_C64(x) << 8) & SPH_C64(0x000000FF00000000)) \
|
x7 = _mm512_xor_si512( x7, x0 ); \
|
||||||
| ((SPH_C64(x) << 24) & SPH_C64(0x0000FF0000000000)) \
|
x0 = _mm512_xor_si512( x0, x5 ); \
|
||||||
| ((SPH_C64(x) << 40) & SPH_C64(0x00FF000000000000)) \
|
x1 = _mm512_xor_si512( x1, x6 ); \
|
||||||
| ((SPH_C64(x) << 56) & SPH_C64(0xFF00000000000000)))
|
x2 = _mm512_xor_si512( x2, _mm512_xor_si512( x7, x4 ) ); \
|
||||||
#define dec64e_aligned sph_dec64le_aligned
|
x3 = _mm512_xor_si512( x3, x4 ); \
|
||||||
#define enc64e sph_enc64le
|
} while (0)
|
||||||
#endif
|
|
||||||
|
|
||||||
#else
|
|
||||||
|
|
||||||
#if SPH_64
|
|
||||||
#define C64e(x) SPH_C64(x)
|
|
||||||
#define dec64e_aligned sph_dec64be_aligned
|
|
||||||
#define enc64e sph_enc64be
|
|
||||||
#endif
|
|
||||||
|
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
#define Sb(x0, x1, x2, x3, c) \
|
#define Sb(x0, x1, x2, x3, c) \
|
||||||
do { \
|
do { \
|
||||||
__m256i cc = _mm256_set_epi64x( c, c, c, c ); \
|
__m256i cc = _mm256_set1_epi64x( c ); \
|
||||||
x3 = mm256_not( x3 ); \
|
x3 = mm256_not( x3 ); \
|
||||||
x0 = _mm256_xor_si256( x0, _mm256_andnot_si256( x2, cc ) ); \
|
x0 = _mm256_xor_si256( x0, _mm256_andnot_si256( x2, cc ) ); \
|
||||||
tmp = _mm256_xor_si256( cc, _mm256_and_si256( x0, x1 ) ); \
|
tmp = _mm256_xor_si256( cc, _mm256_and_si256( x0, x1 ) ); \
|
||||||
@@ -120,8 +105,97 @@ do { \
|
|||||||
x3 = _mm256_xor_si256( x3, x4 ); \
|
x3 = _mm256_xor_si256( x3, x4 ); \
|
||||||
} while (0)
|
} while (0)
|
||||||
|
|
||||||
#if SPH_JH_64
|
static const uint64_t C[] =
|
||||||
|
{
|
||||||
|
0x67f815dfa2ded572, 0x571523b70a15847b,
|
||||||
|
0xf6875a4d90d6ab81, 0x402bd1c3c54f9f4e,
|
||||||
|
0x9cfa455ce03a98ea, 0x9a99b26699d2c503,
|
||||||
|
0x8a53bbf2b4960266, 0x31a2db881a1456b5,
|
||||||
|
0xdb0e199a5c5aa303, 0x1044c1870ab23f40,
|
||||||
|
0x1d959e848019051c, 0xdccde75eadeb336f,
|
||||||
|
0x416bbf029213ba10, 0xd027bbf7156578dc,
|
||||||
|
0x5078aa3739812c0a, 0xd3910041d2bf1a3f,
|
||||||
|
0x907eccf60d5a2d42, 0xce97c0929c9f62dd,
|
||||||
|
0xac442bc70ba75c18, 0x23fcc663d665dfd1,
|
||||||
|
0x1ab8e09e036c6e97, 0xa8ec6c447e450521,
|
||||||
|
0xfa618e5dbb03f1ee, 0x97818394b29796fd,
|
||||||
|
0x2f3003db37858e4a, 0x956a9ffb2d8d672a,
|
||||||
|
0x6c69b8f88173fe8a, 0x14427fc04672c78a,
|
||||||
|
0xc45ec7bd8f15f4c5, 0x80bb118fa76f4475,
|
||||||
|
0xbc88e4aeb775de52, 0xf4a3a6981e00b882,
|
||||||
|
0x1563a3a9338ff48e, 0x89f9b7d524565faa,
|
||||||
|
0xfde05a7c20edf1b6, 0x362c42065ae9ca36,
|
||||||
|
0x3d98fe4e433529ce, 0xa74b9a7374f93a53,
|
||||||
|
0x86814e6f591ff5d0, 0x9f5ad8af81ad9d0e,
|
||||||
|
0x6a6234ee670605a7, 0x2717b96ebe280b8b,
|
||||||
|
0x3f1080c626077447, 0x7b487ec66f7ea0e0,
|
||||||
|
0xc0a4f84aa50a550d, 0x9ef18e979fe7e391,
|
||||||
|
0xd48d605081727686, 0x62b0e5f3415a9e7e,
|
||||||
|
0x7a205440ec1f9ffc, 0x84c9f4ce001ae4e3,
|
||||||
|
0xd895fa9df594d74f, 0xa554c324117e2e55,
|
||||||
|
0x286efebd2872df5b, 0xb2c4a50fe27ff578,
|
||||||
|
0x2ed349eeef7c8905, 0x7f5928eb85937e44,
|
||||||
|
0x4a3124b337695f70, 0x65e4d61df128865e,
|
||||||
|
0xe720b95104771bc7, 0x8a87d423e843fe74,
|
||||||
|
0xf2947692a3e8297d, 0xc1d9309b097acbdd,
|
||||||
|
0xe01bdc5bfb301b1d, 0xbf829cf24f4924da,
|
||||||
|
0xffbf70b431bae7a4, 0x48bcf8de0544320d,
|
||||||
|
0x39d3bb5332fcae3b, 0xa08b29e0c1c39f45,
|
||||||
|
0x0f09aef7fd05c9e5, 0x34f1904212347094,
|
||||||
|
0x95ed44e301b771a2, 0x4a982f4f368e3be9,
|
||||||
|
0x15f66ca0631d4088, 0xffaf52874b44c147,
|
||||||
|
0x30c60ae2f14abb7e, 0xe68c6eccc5b67046,
|
||||||
|
0x00ca4fbd56a4d5a4, 0xae183ec84b849dda,
|
||||||
|
0xadd1643045ce5773, 0x67255c1468cea6e8,
|
||||||
|
0x16e10ecbf28cdaa3, 0x9a99949a5806e933,
|
||||||
|
0x7b846fc220b2601f, 0x1885d1a07facced1,
|
||||||
|
0xd319dd8da15b5932, 0x46b4a5aac01c9a50,
|
||||||
|
0xba6b04e467633d9f, 0x7eee560bab19caf6,
|
||||||
|
0x742128a9ea79b11f, 0xee51363b35f7bde9,
|
||||||
|
0x76d350755aac571d, 0x01707da3fec2463a,
|
||||||
|
0x42d8a498afc135f7, 0x79676b9e20eced78,
|
||||||
|
0xa8db3aea15638341, 0x832c83324d3bc3fa,
|
||||||
|
0xf347271c1f3b40a7, 0x9a762db734f04059,
|
||||||
|
0xfd4f21d26c4e3ee7, 0xef5957dc398dfdb8,
|
||||||
|
0xdaeb492b490c9b8d, 0x0d70f36849d7a25b,
|
||||||
|
0x84558d7ad0ae3b7d, 0x658ef8e4f0e9a5f5,
|
||||||
|
0x533b1036f4a2b8a0, 0x5aec3e759e07a80c,
|
||||||
|
0x4f88e85692946891, 0x4cbcbaf8555cb05b,
|
||||||
|
0x7b9487f3993bbbe3, 0x5d1c6b72d6f4da75,
|
||||||
|
0x6db334dc28acae64, 0x71db28b850a5346c,
|
||||||
|
0x2a518d10f2e261f8, 0xfc75dd593364dbe3,
|
||||||
|
0xa23fce43f1bcac1c, 0xb043e8023cd1bb67,
|
||||||
|
0x75a12988ca5b0a33, 0x5c5316b44d19347f,
|
||||||
|
0x1e4d790ec3943b92, 0x3fafeeb6d7757479,
|
||||||
|
0x21391abef7d4a8ea, 0x5127234c097ef45c,
|
||||||
|
0xd23c32ba5324a326, 0xadd5a66d4a17a344,
|
||||||
|
0x08c9f2afa63e1db5, 0x563c6b91983d5983,
|
||||||
|
0x4d608672a17cf84c, 0xf6c76e08cc3ee246,
|
||||||
|
0x5e76bcb1b333982f, 0x2ae6c4efa566d62b,
|
||||||
|
0x36d4c1bee8b6f406, 0x6321efbc1582ee74,
|
||||||
|
0x69c953f40d4ec1fd, 0x26585806c45a7da7,
|
||||||
|
0x16fae0061614c17e, 0x3f9d63283daf907e,
|
||||||
|
0x0cd29b00e3f2c9d2, 0x300cd4b730ceaa5f,
|
||||||
|
0x9832e0f216512a74, 0x9af8cee3d830eb0d,
|
||||||
|
0x9279f1b57b9ec54b, 0xd36886046ee651ff,
|
||||||
|
0x316796e6574d239b, 0x05750a17f3a6e6cc,
|
||||||
|
0xce6c3213d98176b1, 0x62a205f88452173c,
|
||||||
|
0x47154778b3cb2bf4, 0x486a9323825446ff,
|
||||||
|
0x65655e4e0758df38, 0x8e5086fc897cfcf2,
|
||||||
|
0x86ca0bd0442e7031, 0x4e477830a20940f0,
|
||||||
|
0x8338f7d139eea065, 0xbd3a2ce437e95ef7,
|
||||||
|
0x6ff8130126b29721, 0xe7de9fefd1ed44a3,
|
||||||
|
0xd992257615dfa08b, 0xbe42dc12f6f7853c,
|
||||||
|
0x7eb027ab7ceca7d8, 0xdea83eaada7d8d53,
|
||||||
|
0xd86902bd93ce25aa, 0xf908731afd43f65a,
|
||||||
|
0xa5194a17daef5fc0, 0x6a21fd4c33664d97,
|
||||||
|
0x701541db3198b435, 0x9b54cdedbb0f1eea,
|
||||||
|
0x72409751a163d09a, 0xe26f4791bf9d75f6
|
||||||
|
};
|
||||||
|
|
||||||
|
// Big endian version
|
||||||
|
|
||||||
|
/*
|
||||||
static const sph_u64 C[] = {
|
static const sph_u64 C[] = {
|
||||||
C64e(0x72d5dea2df15f867), C64e(0x7b84150ab7231557),
|
C64e(0x72d5dea2df15f867), C64e(0x7b84150ab7231557),
|
||||||
C64e(0x81abd6904d5a87f6), C64e(0x4e9f4fc5c3d12b40),
|
C64e(0x81abd6904d5a87f6), C64e(0x4e9f4fc5c3d12b40),
|
||||||
@@ -208,6 +282,7 @@ static const sph_u64 C[] = {
|
|||||||
C64e(0x35b49831db411570), C64e(0xea1e0fbbedcd549b),
|
C64e(0x35b49831db411570), C64e(0xea1e0fbbedcd549b),
|
||||||
C64e(0x9ad063a151974072), C64e(0xf6759dbf91476fe2)
|
C64e(0x9ad063a151974072), C64e(0xf6759dbf91476fe2)
|
||||||
};
|
};
|
||||||
|
*/
|
||||||
|
|
||||||
#define Ceven_hi(r) (C[((r) << 2) + 0])
|
#define Ceven_hi(r) (C[((r) << 2) + 0])
|
||||||
#define Ceven_lo(r) (C[((r) << 2) + 1])
|
#define Ceven_lo(r) (C[((r) << 2) + 1])
|
||||||
@@ -226,6 +301,48 @@ static const sph_u64 C[] = {
|
|||||||
x4 ## l, x5 ## l, x6 ## l, x7 ## l); \
|
x4 ## l, x5 ## l, x6 ## l, x7 ## l); \
|
||||||
} while (0)
|
} while (0)
|
||||||
|
|
||||||
|
#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
|
||||||
|
|
||||||
|
#define S_8W(x0, x1, x2, x3, cb, r) do { \
|
||||||
|
Sb_8W(x0 ## h, x1 ## h, x2 ## h, x3 ## h, cb ## hi(r)); \
|
||||||
|
Sb_8W(x0 ## l, x1 ## l, x2 ## l, x3 ## l, cb ## lo(r)); \
|
||||||
|
} while (0)
|
||||||
|
|
||||||
|
#define L_8W(x0, x1, x2, x3, x4, x5, x6, x7) do { \
|
||||||
|
Lb_8W(x0 ## h, x1 ## h, x2 ## h, x3 ## h, \
|
||||||
|
x4 ## h, x5 ## h, x6 ## h, x7 ## h); \
|
||||||
|
Lb_8W(x0 ## l, x1 ## l, x2 ## l, x3 ## l, \
|
||||||
|
x4 ## l, x5 ## l, x6 ## l, x7 ## l); \
|
||||||
|
} while (0)
|
||||||
|
|
||||||
|
#define Wz_8W(x, c, n) \
|
||||||
|
do { \
|
||||||
|
__m512i t = _mm512_slli_epi64( _mm512_and_si512(x ## h, (c)), (n) ); \
|
||||||
|
x ## h = _mm512_or_si512( _mm512_and_si512( \
|
||||||
|
_mm512_srli_epi64(x ## h, (n)), (c)), t ); \
|
||||||
|
t = _mm512_slli_epi64( _mm512_and_si512(x ## l, (c)), (n) ); \
|
||||||
|
x ## l = _mm512_or_si512( _mm512_and_si512((x ## l >> (n)), (c)), t ); \
|
||||||
|
} while (0)
|
||||||
|
|
||||||
|
#define W80(x) Wz_8W(x, m512_const1_64( 0x5555555555555555 ), 1 )
|
||||||
|
#define W81(x) Wz_8W(x, m512_const1_64( 0x3333333333333333 ), 2 )
|
||||||
|
#define W82(x) Wz_8W(x, m512_const1_64( 0x0F0F0F0F0F0F0F0F ), 4 )
|
||||||
|
#define W83(x) Wz_8W(x, m512_const1_64( 0x00FF00FF00FF00FF ), 8 )
|
||||||
|
#define W84(x) Wz_8W(x, m512_const1_64( 0x0000FFFF0000FFFF ), 16 )
|
||||||
|
#define W85(x) Wz_8W(x, m512_const1_64( 0x00000000FFFFFFFF ), 32 )
|
||||||
|
#define W86(x) \
|
||||||
|
do { \
|
||||||
|
__m512i t = x ## h; \
|
||||||
|
x ## h = x ## l; \
|
||||||
|
x ## l = t; \
|
||||||
|
} while (0)
|
||||||
|
|
||||||
|
#define DECL_STATE_8W \
|
||||||
|
__m512i h0h, h1h, h2h, h3h, h4h, h5h, h6h, h7h; \
|
||||||
|
__m512i h0l, h1l, h2l, h3l, h4l, h5l, h6l, h7l; \
|
||||||
|
__m512i tmp;
|
||||||
|
|
||||||
|
#endif
|
||||||
|
|
||||||
#define Wz(x, c, n) \
|
#define Wz(x, c, n) \
|
||||||
do { \
|
do { \
|
||||||
@@ -236,16 +353,6 @@ do { \
|
|||||||
x ## l = _mm256_or_si256( _mm256_and_si256((x ## l >> (n)), (c)), t ); \
|
x ## l = _mm256_or_si256( _mm256_and_si256((x ## l >> (n)), (c)), t ); \
|
||||||
} while (0)
|
} while (0)
|
||||||
|
|
||||||
|
|
||||||
/*
|
|
||||||
#define Wz(x, c, n) do { \
|
|
||||||
sph_u64 t = (x ## h & (c)) << (n); \
|
|
||||||
x ## h = ((x ## h >> (n)) & (c)) | t; \
|
|
||||||
t = (x ## l & (c)) << (n); \
|
|
||||||
x ## l = ((x ## l >> (n)) & (c)) | t; \
|
|
||||||
} while (0)
|
|
||||||
*/
|
|
||||||
|
|
||||||
#define W0(x) Wz(x, m256_const1_64( 0x5555555555555555 ), 1 )
|
#define W0(x) Wz(x, m256_const1_64( 0x5555555555555555 ), 1 )
|
||||||
#define W1(x) Wz(x, m256_const1_64( 0x3333333333333333 ), 2 )
|
#define W1(x) Wz(x, m256_const1_64( 0x3333333333333333 ), 2 )
|
||||||
#define W2(x) Wz(x, m256_const1_64( 0x0F0F0F0F0F0F0F0F ), 4 )
|
#define W2(x) Wz(x, m256_const1_64( 0x0F0F0F0F0F0F0F0F ), 4 )
|
||||||
@@ -259,25 +366,12 @@ do { \
|
|||||||
x ## l = t; \
|
x ## l = t; \
|
||||||
} while (0)
|
} while (0)
|
||||||
|
|
||||||
/*
|
|
||||||
#define W0(x) Wz(x, SPH_C64(0x5555555555555555), 1)
|
|
||||||
#define W1(x) Wz(x, SPH_C64(0x3333333333333333), 2)
|
|
||||||
#define W2(x) Wz(x, SPH_C64(0x0F0F0F0F0F0F0F0F), 4)
|
|
||||||
#define W3(x) Wz(x, SPH_C64(0x00FF00FF00FF00FF), 8)
|
|
||||||
#define W4(x) Wz(x, SPH_C64(0x0000FFFF0000FFFF), 16)
|
|
||||||
#define W5(x) Wz(x, SPH_C64(0x00000000FFFFFFFF), 32)
|
|
||||||
#define W6(x) do { \
|
|
||||||
sph_u64 t = x ## h; \
|
|
||||||
x ## h = x ## l; \
|
|
||||||
x ## l = t; \
|
|
||||||
} while (0)
|
|
||||||
*/
|
|
||||||
|
|
||||||
#define DECL_STATE \
|
#define DECL_STATE \
|
||||||
__m256i h0h, h1h, h2h, h3h, h4h, h5h, h6h, h7h; \
|
__m256i h0h, h1h, h2h, h3h, h4h, h5h, h6h, h7h; \
|
||||||
__m256i h0l, h1l, h2l, h3l, h4l, h5l, h6l, h7l; \
|
__m256i h0l, h1l, h2l, h3l, h4l, h5l, h6l, h7l; \
|
||||||
__m256i tmp;
|
__m256i tmp;
|
||||||
|
|
||||||
|
|
||||||
#define READ_STATE(state) do { \
|
#define READ_STATE(state) do { \
|
||||||
h0h = (state)->H[ 0]; \
|
h0h = (state)->H[ 0]; \
|
||||||
h0l = (state)->H[ 1]; \
|
h0l = (state)->H[ 1]; \
|
||||||
@@ -316,6 +410,38 @@ do { \
|
|||||||
(state)->H[15] = h7l; \
|
(state)->H[15] = h7l; \
|
||||||
} while (0)
|
} while (0)
|
||||||
|
|
||||||
|
#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
|
||||||
|
|
||||||
|
#define INPUT_BUF1_8W \
|
||||||
|
__m512i m0h = buf[0]; \
|
||||||
|
__m512i m0l = buf[1]; \
|
||||||
|
__m512i m1h = buf[2]; \
|
||||||
|
__m512i m1l = buf[3]; \
|
||||||
|
__m512i m2h = buf[4]; \
|
||||||
|
__m512i m2l = buf[5]; \
|
||||||
|
__m512i m3h = buf[6]; \
|
||||||
|
__m512i m3l = buf[7]; \
|
||||||
|
h0h = _mm512_xor_si512( h0h, m0h ); \
|
||||||
|
h0l = _mm512_xor_si512( h0l, m0l ); \
|
||||||
|
h1h = _mm512_xor_si512( h1h, m1h ); \
|
||||||
|
h1l = _mm512_xor_si512( h1l, m1l ); \
|
||||||
|
h2h = _mm512_xor_si512( h2h, m2h ); \
|
||||||
|
h2l = _mm512_xor_si512( h2l, m2l ); \
|
||||||
|
h3h = _mm512_xor_si512( h3h, m3h ); \
|
||||||
|
h3l = _mm512_xor_si512( h3l, m3l ); \
|
||||||
|
|
||||||
|
#define INPUT_BUF2_8W \
|
||||||
|
h4h = _mm512_xor_si512( h4h, m0h ); \
|
||||||
|
h4l = _mm512_xor_si512( h4l, m0l ); \
|
||||||
|
h5h = _mm512_xor_si512( h5h, m1h ); \
|
||||||
|
h5l = _mm512_xor_si512( h5l, m1l ); \
|
||||||
|
h6h = _mm512_xor_si512( h6h, m2h ); \
|
||||||
|
h6l = _mm512_xor_si512( h6l, m2l ); \
|
||||||
|
h7h = _mm512_xor_si512( h7h, m3h ); \
|
||||||
|
h7l = _mm512_xor_si512( h7l, m3l ); \
|
||||||
|
|
||||||
|
#endif
|
||||||
|
|
||||||
#define INPUT_BUF1 \
|
#define INPUT_BUF1 \
|
||||||
__m256i m0h = buf[0]; \
|
__m256i m0h = buf[0]; \
|
||||||
__m256i m0l = buf[1]; \
|
__m256i m0l = buf[1]; \
|
||||||
@@ -344,6 +470,7 @@ do { \
|
|||||||
h7h = _mm256_xor_si256( h7h, m3h ); \
|
h7h = _mm256_xor_si256( h7h, m3h ); \
|
||||||
h7l = _mm256_xor_si256( h7l, m3l ); \
|
h7l = _mm256_xor_si256( h7l, m3l ); \
|
||||||
|
|
||||||
|
/*
|
||||||
static const sph_u64 IV256[] = {
|
static const sph_u64 IV256[] = {
|
||||||
C64e(0xeb98a3412c20d3eb), C64e(0x92cdbe7b9cb245c1),
|
C64e(0xeb98a3412c20d3eb), C64e(0x92cdbe7b9cb245c1),
|
||||||
C64e(0x1c93519160d4c7fa), C64e(0x260082d67e508a03),
|
C64e(0x1c93519160d4c7fa), C64e(0x260082d67e508a03),
|
||||||
@@ -366,9 +493,22 @@ static const sph_u64 IV512[] = {
|
|||||||
C64e(0xcf57f6ec9db1f856), C64e(0xa706887c5716b156),
|
C64e(0xcf57f6ec9db1f856), C64e(0xa706887c5716b156),
|
||||||
C64e(0xe3c2fcdfe68517fb), C64e(0x545a4678cc8cdd4b)
|
C64e(0xe3c2fcdfe68517fb), C64e(0x545a4678cc8cdd4b)
|
||||||
};
|
};
|
||||||
|
*/
|
||||||
|
|
||||||
#else
|
|
||||||
|
|
||||||
|
#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
|
||||||
|
|
||||||
|
#define SL_8W(ro) SLu_8W(r + ro, ro)
|
||||||
|
|
||||||
|
#define SLu_8W(r, ro) do { \
|
||||||
|
S_8W(h0, h2, h4, h6, Ceven_, r); \
|
||||||
|
S_8W(h1, h3, h5, h7, Codd_, r); \
|
||||||
|
L_8W(h0, h2, h4, h6, h1, h3, h5, h7); \
|
||||||
|
W8 ## ro(h1); \
|
||||||
|
W8 ## ro(h3); \
|
||||||
|
W8 ## ro(h5); \
|
||||||
|
W8 ## ro(h7); \
|
||||||
|
} while (0)
|
||||||
|
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
@@ -384,41 +524,57 @@ static const sph_u64 IV512[] = {
|
|||||||
W ## ro(h7); \
|
W ## ro(h7); \
|
||||||
} while (0)
|
} while (0)
|
||||||
|
|
||||||
#if SPH_SMALL_FOOTPRINT_JH
|
|
||||||
|
|
||||||
#if SPH_JH_64
|
#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
|
||||||
|
|
||||||
/*
|
#define E8_8W do { \
|
||||||
* The "small footprint" 64-bit version just uses a partially unrolled
|
SLu_8W( 0, 0); \
|
||||||
* loop.
|
SLu_8W( 1, 1); \
|
||||||
*/
|
SLu_8W( 2, 2); \
|
||||||
|
SLu_8W( 3, 3); \
|
||||||
#define E8 do { \
|
SLu_8W( 4, 4); \
|
||||||
unsigned r; \
|
SLu_8W( 5, 5); \
|
||||||
for (r = 0; r < 42; r += 7) { \
|
SLu_8W( 6, 6); \
|
||||||
SL(0); \
|
SLu_8W( 7, 0); \
|
||||||
SL(1); \
|
SLu_8W( 8, 1); \
|
||||||
SL(2); \
|
SLu_8W( 9, 2); \
|
||||||
SL(3); \
|
SLu_8W(10, 3); \
|
||||||
SL(4); \
|
SLu_8W(11, 4); \
|
||||||
SL(5); \
|
SLu_8W(12, 5); \
|
||||||
SL(6); \
|
SLu_8W(13, 6); \
|
||||||
} \
|
SLu_8W(14, 0); \
|
||||||
|
SLu_8W(15, 1); \
|
||||||
|
SLu_8W(16, 2); \
|
||||||
|
SLu_8W(17, 3); \
|
||||||
|
SLu_8W(18, 4); \
|
||||||
|
SLu_8W(19, 5); \
|
||||||
|
SLu_8W(20, 6); \
|
||||||
|
SLu_8W(21, 0); \
|
||||||
|
SLu_8W(22, 1); \
|
||||||
|
SLu_8W(23, 2); \
|
||||||
|
SLu_8W(24, 3); \
|
||||||
|
SLu_8W(25, 4); \
|
||||||
|
SLu_8W(26, 5); \
|
||||||
|
SLu_8W(27, 6); \
|
||||||
|
SLu_8W(28, 0); \
|
||||||
|
SLu_8W(29, 1); \
|
||||||
|
SLu_8W(30, 2); \
|
||||||
|
SLu_8W(31, 3); \
|
||||||
|
SLu_8W(32, 4); \
|
||||||
|
SLu_8W(33, 5); \
|
||||||
|
SLu_8W(34, 6); \
|
||||||
|
SLu_8W(35, 0); \
|
||||||
|
SLu_8W(36, 1); \
|
||||||
|
SLu_8W(37, 2); \
|
||||||
|
SLu_8W(38, 3); \
|
||||||
|
SLu_8W(39, 4); \
|
||||||
|
SLu_8W(40, 5); \
|
||||||
|
SLu_8W(41, 6); \
|
||||||
} while (0)
|
} while (0)
|
||||||
|
|
||||||
#else
|
#endif // AVX512
|
||||||
|
|
||||||
|
|
||||||
#endif
|
|
||||||
|
|
||||||
#else
|
|
||||||
|
|
||||||
#if SPH_JH_64
|
|
||||||
|
|
||||||
/*
|
|
||||||
* On a "true 64-bit" architecture, we can unroll at will.
|
|
||||||
*/
|
|
||||||
|
|
||||||
#define E8 do { \
|
#define E8 do { \
|
||||||
SLu( 0, 0); \
|
SLu( 0, 0); \
|
||||||
SLu( 1, 1); \
|
SLu( 1, 1); \
|
||||||
@@ -464,10 +620,153 @@ static const sph_u64 IV512[] = {
|
|||||||
SLu(41, 6); \
|
SLu(41, 6); \
|
||||||
} while (0)
|
} while (0)
|
||||||
|
|
||||||
#else
|
#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
|
||||||
|
|
||||||
|
void jh256_8way_init( jh_8way_context *sc )
|
||||||
|
{
|
||||||
|
// bswapped IV256
|
||||||
|
sc->H[ 0] = m512_const1_64( 0xebd3202c41a398eb );
|
||||||
|
sc->H[ 1] = m512_const1_64( 0xc145b29c7bbecd92 );
|
||||||
|
sc->H[ 2] = m512_const1_64( 0xfac7d4609151931c );
|
||||||
|
sc->H[ 3] = m512_const1_64( 0x038a507ed6820026 );
|
||||||
|
sc->H[ 4] = m512_const1_64( 0x45b92677269e23a4 );
|
||||||
|
sc->H[ 5] = m512_const1_64( 0x77941ad4481afbe0 );
|
||||||
|
sc->H[ 6] = m512_const1_64( 0x7a176b0226abb5cd );
|
||||||
|
sc->H[ 7] = m512_const1_64( 0xa82fff0f4224f056 );
|
||||||
|
sc->H[ 8] = m512_const1_64( 0x754d2e7f8996a371 );
|
||||||
|
sc->H[ 9] = m512_const1_64( 0x62e27df70849141d );
|
||||||
|
sc->H[10] = m512_const1_64( 0x948f2476f7957627 );
|
||||||
|
sc->H[11] = m512_const1_64( 0x6c29804757b6d587 );
|
||||||
|
sc->H[12] = m512_const1_64( 0x6c0d8eac2d275e5c );
|
||||||
|
sc->H[13] = m512_const1_64( 0x0f7a0557c6508451 );
|
||||||
|
sc->H[14] = m512_const1_64( 0xea12247067d3e47b );
|
||||||
|
sc->H[15] = m512_const1_64( 0x69d71cd313abe389 );
|
||||||
|
sc->ptr = 0;
|
||||||
|
sc->block_count = 0;
|
||||||
|
}
|
||||||
|
|
||||||
#endif
|
void jh512_8way_init( jh_8way_context *sc )
|
||||||
|
{
|
||||||
|
// bswapped IV512
|
||||||
|
sc->H[ 0] = m512_const1_64( 0x17aa003e964bd16f );
|
||||||
|
sc->H[ 1] = m512_const1_64( 0x43d5157a052e6a63 );
|
||||||
|
sc->H[ 2] = m512_const1_64( 0x0bef970c8d5e228a );
|
||||||
|
sc->H[ 3] = m512_const1_64( 0x61c3b3f2591234e9 );
|
||||||
|
sc->H[ 4] = m512_const1_64( 0x1e806f53c1a01d89 );
|
||||||
|
sc->H[ 5] = m512_const1_64( 0x806d2bea6b05a92a );
|
||||||
|
sc->H[ 6] = m512_const1_64( 0xa6ba7520dbcc8e58 );
|
||||||
|
sc->H[ 7] = m512_const1_64( 0xf73bf8ba763a0fa9 );
|
||||||
|
sc->H[ 8] = m512_const1_64( 0x694ae34105e66901 );
|
||||||
|
sc->H[ 9] = m512_const1_64( 0x5ae66f2e8e8ab546 );
|
||||||
|
sc->H[10] = m512_const1_64( 0x243c84c1d0a74710 );
|
||||||
|
sc->H[11] = m512_const1_64( 0x99c15a2db1716e3b );
|
||||||
|
sc->H[12] = m512_const1_64( 0x56f8b19decf657cf );
|
||||||
|
sc->H[13] = m512_const1_64( 0x56b116577c8806a7 );
|
||||||
|
sc->H[14] = m512_const1_64( 0xfb1785e6dffcc2e3 );
|
||||||
|
sc->H[15] = m512_const1_64( 0x4bdd8ccc78465a54 );
|
||||||
|
sc->ptr = 0;
|
||||||
|
sc->block_count = 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
static void
|
||||||
|
jh_8way_core( jh_8way_context *sc, const void *data, size_t len )
|
||||||
|
{
|
||||||
|
__m512i *buf;
|
||||||
|
__m512i *vdata = (__m512i*)data;
|
||||||
|
const int buf_size = 64; // 64 * _m512i
|
||||||
|
size_t ptr;
|
||||||
|
DECL_STATE_8W
|
||||||
|
|
||||||
|
buf = sc->buf;
|
||||||
|
ptr = sc->ptr;
|
||||||
|
|
||||||
|
if ( len < (buf_size - ptr) )
|
||||||
|
{
|
||||||
|
memcpy_512( buf + (ptr>>3), vdata, len>>3 );
|
||||||
|
ptr += len;
|
||||||
|
sc->ptr = ptr;
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
READ_STATE(sc);
|
||||||
|
while ( len > 0 )
|
||||||
|
{
|
||||||
|
size_t clen;
|
||||||
|
clen = buf_size - ptr;
|
||||||
|
if ( clen > len )
|
||||||
|
clen = len;
|
||||||
|
|
||||||
|
memcpy_512( buf + (ptr>>3), vdata, clen>>3 );
|
||||||
|
ptr += clen;
|
||||||
|
vdata += (clen>>3);
|
||||||
|
len -= clen;
|
||||||
|
if ( ptr == buf_size )
|
||||||
|
{
|
||||||
|
INPUT_BUF1_8W;
|
||||||
|
E8_8W;
|
||||||
|
INPUT_BUF2_8W;
|
||||||
|
sc->block_count ++;
|
||||||
|
ptr = 0;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
WRITE_STATE(sc);
|
||||||
|
sc->ptr = ptr;
|
||||||
|
}
|
||||||
|
|
||||||
|
static void
|
||||||
|
jh_8way_close( jh_8way_context *sc, unsigned ub, unsigned n, void *dst,
|
||||||
|
size_t out_size_w32 )
|
||||||
|
{
|
||||||
|
__m512i buf[16*4];
|
||||||
|
__m512i *dst512 = (__m512i*)dst;
|
||||||
|
size_t numz, u;
|
||||||
|
uint64_t l0, l1;
|
||||||
|
|
||||||
|
buf[0] = m512_const1_64( 0x80ULL );
|
||||||
|
|
||||||
|
if ( sc->ptr == 0 )
|
||||||
|
numz = 48;
|
||||||
|
else
|
||||||
|
numz = 112 - sc->ptr;
|
||||||
|
|
||||||
|
memset_zero_512( buf+1, (numz>>3) - 1 );
|
||||||
|
|
||||||
|
l0 = ( sc->block_count << 9 ) + ( sc->ptr << 3 );
|
||||||
|
l1 = ( sc->block_count >> 55 );
|
||||||
|
*(buf + (numz>>3) ) = _mm512_set1_epi64( bswap_64( l1 ) );
|
||||||
|
*(buf + (numz>>3) + 1) = _mm512_set1_epi64( bswap_64( l0 ) );
|
||||||
|
|
||||||
|
jh_8way_core( sc, buf, numz + 16 );
|
||||||
|
|
||||||
|
for ( u=0; u < 8; u++ )
|
||||||
|
buf[u] = sc->H[u+8];
|
||||||
|
|
||||||
|
memcpy_512( dst512, buf, 8 );
|
||||||
|
}
|
||||||
|
|
||||||
|
void
|
||||||
|
jh256_8way_update(void *cc, const void *data, size_t len)
|
||||||
|
{
|
||||||
|
jh_8way_core(cc, data, len);
|
||||||
|
}
|
||||||
|
|
||||||
|
void
|
||||||
|
jh256_8way_close(void *cc, void *dst)
|
||||||
|
{
|
||||||
|
jh_8way_close(cc, 0, 0, dst, 8);
|
||||||
|
}
|
||||||
|
|
||||||
|
void
|
||||||
|
jh512_8way_update(void *cc, const void *data, size_t len)
|
||||||
|
{
|
||||||
|
jh_8way_core(cc, data, len);
|
||||||
|
}
|
||||||
|
|
||||||
|
void
|
||||||
|
jh512_8way_close(void *cc, void *dst)
|
||||||
|
{
|
||||||
|
jh_8way_close(cc, 0, 0, dst, 16);
|
||||||
|
}
|
||||||
|
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
@@ -564,12 +863,12 @@ jh_4way_core( jh_4way_context *sc, const void *data, size_t len )
|
|||||||
|
|
||||||
static void
|
static void
|
||||||
jh_4way_close( jh_4way_context *sc, unsigned ub, unsigned n, void *dst,
|
jh_4way_close( jh_4way_context *sc, unsigned ub, unsigned n, void *dst,
|
||||||
size_t out_size_w32, const void *iv )
|
size_t out_size_w32 )
|
||||||
{
|
{
|
||||||
__m256i buf[16*4];
|
__m256i buf[16*4];
|
||||||
__m256i *dst256 = (__m256i*)dst;
|
__m256i *dst256 = (__m256i*)dst;
|
||||||
size_t numz, u;
|
size_t numz, u;
|
||||||
sph_u64 l0, l1, l0e, l1e;
|
uint64_t l0, l1;
|
||||||
|
|
||||||
buf[0] = m256_const1_64( 0x80ULL );
|
buf[0] = m256_const1_64( 0x80ULL );
|
||||||
|
|
||||||
@@ -580,12 +879,10 @@ jh_4way_close( jh_4way_context *sc, unsigned ub, unsigned n, void *dst,
|
|||||||
|
|
||||||
memset_zero_256( buf+1, (numz>>3) - 1 );
|
memset_zero_256( buf+1, (numz>>3) - 1 );
|
||||||
|
|
||||||
l0 = SPH_T64(sc->block_count << 9) + (sc->ptr << 3);
|
l0 = ( sc->block_count << 9 ) + ( sc->ptr << 3 );
|
||||||
l1 = SPH_T64(sc->block_count >> 55);
|
l1 = ( sc->block_count >> 55 );
|
||||||
sph_enc64be( &l0e, l0 );
|
*(buf + (numz>>3) ) = _mm256_set1_epi64x( bswap_64( l1 ) );
|
||||||
sph_enc64be( &l1e, l1 );
|
*(buf + (numz>>3) + 1) = _mm256_set1_epi64x( bswap_64( l0 ) );
|
||||||
*(buf + (numz>>3) ) = _mm256_set1_epi64x( l1e );
|
|
||||||
*(buf + (numz>>3) + 1) = _mm256_set1_epi64x( l0e );
|
|
||||||
|
|
||||||
jh_4way_core( sc, buf, numz + 16 );
|
jh_4way_core( sc, buf, numz + 16 );
|
||||||
|
|
||||||
@@ -595,16 +892,8 @@ jh_4way_close( jh_4way_context *sc, unsigned ub, unsigned n, void *dst,
|
|||||||
memcpy_256( dst256, buf, 8 );
|
memcpy_256( dst256, buf, 8 );
|
||||||
}
|
}
|
||||||
|
|
||||||
/*
|
|
||||||
void
|
void
|
||||||
jh256_4way_init(void *cc)
|
jh256_4way_update(void *cc, const void *data, size_t len)
|
||||||
{
|
|
||||||
jhs_4way_init(cc, IV256);
|
|
||||||
}
|
|
||||||
*/
|
|
||||||
|
|
||||||
void
|
|
||||||
jh256_4way(void *cc, const void *data, size_t len)
|
|
||||||
{
|
{
|
||||||
jh_4way_core(cc, data, len);
|
jh_4way_core(cc, data, len);
|
||||||
}
|
}
|
||||||
@@ -612,19 +901,11 @@ jh256_4way(void *cc, const void *data, size_t len)
|
|||||||
void
|
void
|
||||||
jh256_4way_close(void *cc, void *dst)
|
jh256_4way_close(void *cc, void *dst)
|
||||||
{
|
{
|
||||||
jh_4way_close(cc, 0, 0, dst, 8, IV256);
|
jh_4way_close(cc, 0, 0, dst, 8 );
|
||||||
}
|
}
|
||||||
|
|
||||||
/*
|
|
||||||
void
|
void
|
||||||
jh512_4way_init(void *cc)
|
jh512_4way_update(void *cc, const void *data, size_t len)
|
||||||
{
|
|
||||||
jhb_4way_init(cc, IV512);
|
|
||||||
}
|
|
||||||
*/
|
|
||||||
|
|
||||||
void
|
|
||||||
jh512_4way(void *cc, const void *data, size_t len)
|
|
||||||
{
|
{
|
||||||
jh_4way_core(cc, data, len);
|
jh_4way_core(cc, data, len);
|
||||||
}
|
}
|
||||||
@@ -632,9 +913,10 @@ jh512_4way(void *cc, const void *data, size_t len)
|
|||||||
void
|
void
|
||||||
jh512_4way_close(void *cc, void *dst)
|
jh512_4way_close(void *cc, void *dst)
|
||||||
{
|
{
|
||||||
jh_4way_close(cc, 0, 0, dst, 16, IV512);
|
jh_4way_close(cc, 0, 0, dst, 16 );
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
#ifdef __cplusplus
|
#ifdef __cplusplus
|
||||||
}
|
}
|
||||||
#endif
|
#endif
|
||||||
|
|||||||
@@ -43,7 +43,6 @@ extern "C"{
|
|||||||
#endif
|
#endif
|
||||||
|
|
||||||
#include <stddef.h>
|
#include <stddef.h>
|
||||||
#include "algo/sha/sph_types.h"
|
|
||||||
#include "simd-utils.h"
|
#include "simd-utils.h"
|
||||||
|
|
||||||
#define SPH_SIZE_jh256 256
|
#define SPH_SIZE_jh256 256
|
||||||
@@ -60,20 +59,41 @@ extern "C"{
|
|||||||
* can be cloned by copying the context (e.g. with a simple
|
* can be cloned by copying the context (e.g. with a simple
|
||||||
* <code>memcpy()</code>).
|
* <code>memcpy()</code>).
|
||||||
*/
|
*/
|
||||||
|
|
||||||
|
|
||||||
|
#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
|
||||||
|
|
||||||
typedef struct {
|
typedef struct {
|
||||||
__m256i buf[8] __attribute__ ((aligned (64)));
|
__m512i buf[8];
|
||||||
|
__m512i H[16];
|
||||||
|
size_t ptr;
|
||||||
|
uint64_t block_count;
|
||||||
|
} jh_8way_context __attribute__ ((aligned (128)));
|
||||||
|
|
||||||
|
typedef jh_8way_context jh256_8way_context;
|
||||||
|
|
||||||
|
typedef jh_8way_context jh512_8way_context;
|
||||||
|
|
||||||
|
void jh256_8way_init( jh_8way_context *sc);
|
||||||
|
|
||||||
|
void jh256_8way_update(void *cc, const void *data, size_t len);
|
||||||
|
|
||||||
|
void jh256_8way_close(void *cc, void *dst);
|
||||||
|
|
||||||
|
void jh512_8way_init( jh_8way_context *sc );
|
||||||
|
|
||||||
|
void jh512_8way_update(void *cc, const void *data, size_t len);
|
||||||
|
|
||||||
|
void jh512_8way_close(void *cc, void *dst);
|
||||||
|
|
||||||
|
#endif
|
||||||
|
|
||||||
|
typedef struct {
|
||||||
|
__m256i buf[8];
|
||||||
__m256i H[16];
|
__m256i H[16];
|
||||||
size_t ptr;
|
size_t ptr;
|
||||||
uint64_t block_count;
|
uint64_t block_count;
|
||||||
/*
|
} jh_4way_context __attribute__ ((aligned (128)));
|
||||||
unsigned char buf[64];
|
|
||||||
size_t ptr;
|
|
||||||
union {
|
|
||||||
sph_u64 wide[16];
|
|
||||||
} H;
|
|
||||||
sph_u64 block_count;
|
|
||||||
*/
|
|
||||||
} jh_4way_context;
|
|
||||||
|
|
||||||
typedef jh_4way_context jh256_4way_context;
|
typedef jh_4way_context jh256_4way_context;
|
||||||
|
|
||||||
@@ -81,13 +101,13 @@ typedef jh_4way_context jh512_4way_context;
|
|||||||
|
|
||||||
void jh256_4way_init( jh_4way_context *sc);
|
void jh256_4way_init( jh_4way_context *sc);
|
||||||
|
|
||||||
void jh256_4way(void *cc, const void *data, size_t len);
|
void jh256_4way_update(void *cc, const void *data, size_t len);
|
||||||
|
|
||||||
void jh256_4way_close(void *cc, void *dst);
|
void jh256_4way_close(void *cc, void *dst);
|
||||||
|
|
||||||
void jh512_4way_init( jh_4way_context *sc );
|
void jh512_4way_init( jh_4way_context *sc );
|
||||||
|
|
||||||
void jh512_4way(void *cc, const void *data, size_t len);
|
void jh512_4way_update(void *cc, const void *data, size_t len);
|
||||||
|
|
||||||
void jh512_4way_close(void *cc, void *dst);
|
void jh512_4way_close(void *cc, void *dst);
|
||||||
|
|
||||||
@@ -95,6 +115,6 @@ void jh512_4way_close(void *cc, void *dst);
|
|||||||
}
|
}
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
#endif
|
#endif // AVX2
|
||||||
|
|
||||||
#endif
|
#endif
|
||||||
|
|||||||
@@ -33,7 +33,7 @@ void jha_hash_4way( void *out, const void *input )
|
|||||||
keccak512_4way_context ctx_keccak;
|
keccak512_4way_context ctx_keccak;
|
||||||
|
|
||||||
keccak512_4way_init( &ctx_keccak );
|
keccak512_4way_init( &ctx_keccak );
|
||||||
keccak512_4way( &ctx_keccak, input, 80 );
|
keccak512_4way_update( &ctx_keccak, input, 80 );
|
||||||
keccak512_4way_close( &ctx_keccak, vhash );
|
keccak512_4way_close( &ctx_keccak, vhash );
|
||||||
|
|
||||||
// Heavy & Light Pair Loop
|
// Heavy & Light Pair Loop
|
||||||
@@ -58,18 +58,18 @@ void jha_hash_4way( void *out, const void *input )
|
|||||||
intrlv_4x64( vhashA, hash0, hash1, hash2, hash3, 512 );
|
intrlv_4x64( vhashA, hash0, hash1, hash2, hash3, 512 );
|
||||||
|
|
||||||
skein512_4way_init( &ctx_skein );
|
skein512_4way_init( &ctx_skein );
|
||||||
skein512_4way( &ctx_skein, vhash, 64 );
|
skein512_4way_update( &ctx_skein, vhash, 64 );
|
||||||
skein512_4way_close( &ctx_skein, vhashB );
|
skein512_4way_close( &ctx_skein, vhashB );
|
||||||
|
|
||||||
for ( int i = 0; i < 8; i++ )
|
for ( int i = 0; i < 8; i++ )
|
||||||
vh[i] = _mm256_blendv_epi8( vhA[i], vhB[i], vh_mask );
|
vh[i] = _mm256_blendv_epi8( vhA[i], vhB[i], vh_mask );
|
||||||
|
|
||||||
blake512_4way_init( &ctx_blake );
|
blake512_4way_init( &ctx_blake );
|
||||||
blake512_4way( &ctx_blake, vhash, 64 );
|
blake512_4way_update( &ctx_blake, vhash, 64 );
|
||||||
blake512_4way_close( &ctx_blake, vhashA );
|
blake512_4way_close( &ctx_blake, vhashA );
|
||||||
|
|
||||||
jh512_4way_init( &ctx_jh );
|
jh512_4way_init( &ctx_jh );
|
||||||
jh512_4way( &ctx_jh, vhash, 64 );
|
jh512_4way_update( &ctx_jh, vhash, 64 );
|
||||||
jh512_4way_close( &ctx_jh, vhashB );
|
jh512_4way_close( &ctx_jh, vhashB );
|
||||||
|
|
||||||
for ( int i = 0; i < 8; i++ )
|
for ( int i = 0; i < 8; i++ )
|
||||||
|
|||||||
@@ -1,19 +1,16 @@
|
|||||||
#include "jha-gate.h"
|
#include "jha-gate.h"
|
||||||
|
|
||||||
#include <stdlib.h>
|
#include <stdlib.h>
|
||||||
#include <stdint.h>
|
#include <stdint.h>
|
||||||
#include <string.h>
|
#include <string.h>
|
||||||
#include <stdio.h>
|
#include <stdio.h>
|
||||||
|
|
||||||
#include "algo/blake/sph_blake.h"
|
#include "algo/blake/sph_blake.h"
|
||||||
#include "algo/jh/sph_jh.h"
|
#include "algo/jh/sph_jh.h"
|
||||||
#include "algo/keccak/sph_keccak.h"
|
#include "algo/keccak/sph_keccak.h"
|
||||||
#include "algo/skein/sph_skein.h"
|
#include "algo/skein/sph_skein.h"
|
||||||
|
#ifdef __AES__
|
||||||
#ifdef NO_AES_NI
|
|
||||||
#include "algo/groestl/sph_groestl.h"
|
|
||||||
#else
|
|
||||||
#include "algo/groestl/aes_ni/hash-groestl.h"
|
#include "algo/groestl/aes_ni/hash-groestl.h"
|
||||||
|
#else
|
||||||
|
#include "algo/groestl/sph_groestl.h"
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
static __thread sph_keccak512_context jha_kec_mid __attribute__ ((aligned (64)));
|
static __thread sph_keccak512_context jha_kec_mid __attribute__ ((aligned (64)));
|
||||||
@@ -28,10 +25,10 @@ void jha_hash(void *output, const void *input)
|
|||||||
{
|
{
|
||||||
uint8_t _ALIGN(128) hash[64];
|
uint8_t _ALIGN(128) hash[64];
|
||||||
|
|
||||||
#ifdef NO_AES_NI
|
#ifdef __AES__
|
||||||
sph_groestl512_context ctx_groestl;
|
|
||||||
#else
|
|
||||||
hashState_groestl ctx_groestl;
|
hashState_groestl ctx_groestl;
|
||||||
|
#else
|
||||||
|
sph_groestl512_context ctx_groestl;
|
||||||
#endif
|
#endif
|
||||||
sph_blake512_context ctx_blake;
|
sph_blake512_context ctx_blake;
|
||||||
sph_jh512_context ctx_jh;
|
sph_jh512_context ctx_jh;
|
||||||
@@ -47,14 +44,14 @@ void jha_hash(void *output, const void *input)
|
|||||||
{
|
{
|
||||||
if (hash[0] & 0x01)
|
if (hash[0] & 0x01)
|
||||||
{
|
{
|
||||||
#ifdef NO_AES_NI
|
#ifdef __AES__
|
||||||
sph_groestl512_init(&ctx_groestl);
|
|
||||||
sph_groestl512(&ctx_groestl, hash, 64 );
|
|
||||||
sph_groestl512_close(&ctx_groestl, hash );
|
|
||||||
#else
|
|
||||||
init_groestl( &ctx_groestl, 64 );
|
init_groestl( &ctx_groestl, 64 );
|
||||||
update_and_final_groestl( &ctx_groestl, (char*)hash,
|
update_and_final_groestl( &ctx_groestl, (char*)hash,
|
||||||
(char*)hash, 512 );
|
(char*)hash, 512 );
|
||||||
|
#else
|
||||||
|
sph_groestl512_init(&ctx_groestl);
|
||||||
|
sph_groestl512(&ctx_groestl, hash, 64 );
|
||||||
|
sph_groestl512_close(&ctx_groestl, hash );
|
||||||
#endif
|
#endif
|
||||||
}
|
}
|
||||||
else
|
else
|
||||||
@@ -117,9 +114,6 @@ int scanhash_jha( struct work *work, uint32_t max_nonce,
|
|||||||
|
|
||||||
jha_kec_midstate( endiandata );
|
jha_kec_midstate( endiandata );
|
||||||
|
|
||||||
#ifdef DEBUG_ALGO
|
|
||||||
printf("[%d] Htarg=%X\n", thr_id, Htarg);
|
|
||||||
#endif
|
|
||||||
for (int m=0; m < 6; m++) {
|
for (int m=0; m < 6; m++) {
|
||||||
if (Htarg <= htmax[m]) {
|
if (Htarg <= htmax[m]) {
|
||||||
uint32_t mask = masks[m];
|
uint32_t mask = masks[m];
|
||||||
@@ -127,25 +121,9 @@ int scanhash_jha( struct work *work, uint32_t max_nonce,
|
|||||||
pdata[19] = ++n;
|
pdata[19] = ++n;
|
||||||
be32enc(&endiandata[19], n);
|
be32enc(&endiandata[19], n);
|
||||||
jha_hash(hash32, endiandata);
|
jha_hash(hash32, endiandata);
|
||||||
#ifndef DEBUG_ALGO
|
if ((!(hash32[7] & mask)) && fulltest(hash32, ptarget))
|
||||||
if ((!(hash32[7] & mask)) && fulltest(hash32, ptarget)) {
|
submit_solution( work, hash32, mythr );
|
||||||
work_set_target_ratio(work, hash32);
|
|
||||||
*hashes_done = n - first_nonce + 1;
|
|
||||||
return 1;
|
|
||||||
}
|
|
||||||
#else
|
|
||||||
if (!(n % 0x1000) && !thr_id) printf(".");
|
|
||||||
if (!(hash32[7] & mask)) {
|
|
||||||
printf("[%d]",thr_id);
|
|
||||||
if (fulltest(hash32, ptarget)) {
|
|
||||||
work_set_target_ratio(work, hash32);
|
|
||||||
*hashes_done = n - first_nonce + 1;
|
|
||||||
return 1;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
#endif
|
|
||||||
} while (n < max_nonce && !work_restart[thr_id].restart);
|
} while (n < max_nonce && !work_restart[thr_id].restart);
|
||||||
// see blake.c if else to understand the loop on htmax => mask
|
|
||||||
break;
|
break;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
1116
algo/jh/sse2/jh.c
1116
algo/jh/sse2/jh.c
File diff suppressed because it is too large
Load Diff
@@ -1,465 +0,0 @@
|
|||||||
/* This program gives the optimized SSE2 bitslice implementation of JH for 32-bit platform (with 8 128-bit XMM registers).
|
|
||||||
|
|
||||||
-----------------------------------------
|
|
||||||
Performance:
|
|
||||||
|
|
||||||
Microprocessor: Intel CORE 2 processor (Core 2 Duo Mobile T6600 2.2GHz)
|
|
||||||
Operating System: 32-bit Ubuntu 10.04 (Linux kernel 2.6.32-22-generic)
|
|
||||||
Speed for long message:
|
|
||||||
1) 23.6 cycles/byte compiler: Intel C++ Compiler 11.1 compilation option: icc -O2
|
|
||||||
2) 24.1 cycles/byte compiler: gcc 4.4.3 compilation option: gcc -msse2 -O3
|
|
||||||
|
|
||||||
------------------------------------------
|
|
||||||
Comparing with the original JH sse2 code for 32-bit platform, the following modifications are made:
|
|
||||||
a) The Sbox implementation follows exactly the description given in the document
|
|
||||||
b) Data alignment definition is improved so that the code can be compiled by GCC, Intel C++ compiler and Microsoft Visual C compiler
|
|
||||||
c) Using y0,y1,..,y7 variables in Function F8 for performance improvement (local variable in function F8 so that compiler can optimize the code easily)
|
|
||||||
d) Removed a number of intermediate variables from the program (so as to given compiler more freedom to optimize the code)
|
|
||||||
e) Using "for" loop to implement 42 rounds (with 7 rounds in each loop), so as to reduce the code size.
|
|
||||||
------------------------------------------
|
|
||||||
|
|
||||||
Last Modified: January 16, 2011
|
|
||||||
*/
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
#include <emmintrin.h>
|
|
||||||
#include <string.h>
|
|
||||||
|
|
||||||
typedef unsigned int uint32;
|
|
||||||
typedef __m128i word128; /*word128 defines a 128-bit SSE2 word*/
|
|
||||||
|
|
||||||
typedef unsigned char BitSequence;
|
|
||||||
typedef unsigned long long DataLength;
|
|
||||||
typedef enum {SUCCESS = 0, FAIL = 1, BAD_HASHLEN = 2} HashReturn;
|
|
||||||
|
|
||||||
/*define data alignment for different C compilers*/
|
|
||||||
#if defined(__GNUC__)
|
|
||||||
#define DATA_ALIGN16(x) x __attribute__ ((aligned(16)))
|
|
||||||
#else
|
|
||||||
#define DATA_ALIGN16(x) __declspec(align(16)) x
|
|
||||||
#endif
|
|
||||||
|
|
||||||
typedef struct {
|
|
||||||
int hashbitlen; /*the message digest size*/
|
|
||||||
unsigned long long databitlen; /*the message size in bits*/
|
|
||||||
unsigned long long datasize_in_buffer; /*the size of the message remained in buffer; assumed to be multiple of 8bits except for the last partial block at the end of the message*/
|
|
||||||
word128 x0,x1,x2,x3,x4,x5,x6,x7; /*1024-bit state;*/
|
|
||||||
unsigned char buffer[64]; /*512-bit message block;*/
|
|
||||||
} hashState;
|
|
||||||
|
|
||||||
/*The initial hash value H(0)*/
|
|
||||||
DATA_ALIGN16(const unsigned char JH224_H0[128])={0x2d,0xfe,0xdd,0x62,0xf9,0x9a,0x98,0xac,0xae,0x7c,0xac,0xd6,0x19,0xd6,0x34,0xe7,0xa4,0x83,0x10,0x5,0xbc,0x30,0x12,0x16,0xb8,0x60,0x38,0xc6,0xc9,0x66,0x14,0x94,0x66,0xd9,0x89,0x9f,0x25,0x80,0x70,0x6f,0xce,0x9e,0xa3,0x1b,0x1d,0x9b,0x1a,0xdc,0x11,0xe8,0x32,0x5f,0x7b,0x36,0x6e,0x10,0xf9,0x94,0x85,0x7f,0x2,0xfa,0x6,0xc1,0x1b,0x4f,0x1b,0x5c,0xd8,0xc8,0x40,0xb3,0x97,0xf6,0xa1,0x7f,0x6e,0x73,0x80,0x99,0xdc,0xdf,0x93,0xa5,0xad,0xea,0xa3,0xd3,0xa4,0x31,0xe8,0xde,0xc9,0x53,0x9a,0x68,0x22,0xb4,0xa9,0x8a,0xec,0x86,0xa1,0xe4,0xd5,0x74,0xac,0x95,0x9c,0xe5,0x6c,0xf0,0x15,0x96,0xd,0xea,0xb5,0xab,0x2b,0xbf,0x96,0x11,0xdc,0xf0,0xdd,0x64,0xea,0x6e};
|
|
||||||
DATA_ALIGN16(const unsigned char JH256_H0[128])={0xeb,0x98,0xa3,0x41,0x2c,0x20,0xd3,0xeb,0x92,0xcd,0xbe,0x7b,0x9c,0xb2,0x45,0xc1,0x1c,0x93,0x51,0x91,0x60,0xd4,0xc7,0xfa,0x26,0x0,0x82,0xd6,0x7e,0x50,0x8a,0x3,0xa4,0x23,0x9e,0x26,0x77,0x26,0xb9,0x45,0xe0,0xfb,0x1a,0x48,0xd4,0x1a,0x94,0x77,0xcd,0xb5,0xab,0x26,0x2,0x6b,0x17,0x7a,0x56,0xf0,0x24,0x42,0xf,0xff,0x2f,0xa8,0x71,0xa3,0x96,0x89,0x7f,0x2e,0x4d,0x75,0x1d,0x14,0x49,0x8,0xf7,0x7d,0xe2,0x62,0x27,0x76,0x95,0xf7,0x76,0x24,0x8f,0x94,0x87,0xd5,0xb6,0x57,0x47,0x80,0x29,0x6c,0x5c,0x5e,0x27,0x2d,0xac,0x8e,0xd,0x6c,0x51,0x84,0x50,0xc6,0x57,0x5,0x7a,0xf,0x7b,0xe4,0xd3,0x67,0x70,0x24,0x12,0xea,0x89,0xe3,0xab,0x13,0xd3,0x1c,0xd7,0x69};
|
|
||||||
DATA_ALIGN16(const unsigned char JH384_H0[128])={0x48,0x1e,0x3b,0xc6,0xd8,0x13,0x39,0x8a,0x6d,0x3b,0x5e,0x89,0x4a,0xde,0x87,0x9b,0x63,0xfa,0xea,0x68,0xd4,0x80,0xad,0x2e,0x33,0x2c,0xcb,0x21,0x48,0xf,0x82,0x67,0x98,0xae,0xc8,0x4d,0x90,0x82,0xb9,0x28,0xd4,0x55,0xea,0x30,0x41,0x11,0x42,0x49,0x36,0xf5,0x55,0xb2,0x92,0x48,0x47,0xec,0xc7,0x25,0xa,0x93,0xba,0xf4,0x3c,0xe1,0x56,0x9b,0x7f,0x8a,0x27,0xdb,0x45,0x4c,0x9e,0xfc,0xbd,0x49,0x63,0x97,0xaf,0xe,0x58,0x9f,0xc2,0x7d,0x26,0xaa,0x80,0xcd,0x80,0xc0,0x8b,0x8c,0x9d,0xeb,0x2e,0xda,0x8a,0x79,0x81,0xe8,0xf8,0xd5,0x37,0x3a,0xf4,0x39,0x67,0xad,0xdd,0xd1,0x7a,0x71,0xa9,0xb4,0xd3,0xbd,0xa4,0x75,0xd3,0x94,0x97,0x6c,0x3f,0xba,0x98,0x42,0x73,0x7f};
|
|
||||||
DATA_ALIGN16(const unsigned char JH512_H0[128])={0x6f,0xd1,0x4b,0x96,0x3e,0x0,0xaa,0x17,0x63,0x6a,0x2e,0x5,0x7a,0x15,0xd5,0x43,0x8a,0x22,0x5e,0x8d,0xc,0x97,0xef,0xb,0xe9,0x34,0x12,0x59,0xf2,0xb3,0xc3,0x61,0x89,0x1d,0xa0,0xc1,0x53,0x6f,0x80,0x1e,0x2a,0xa9,0x5,0x6b,0xea,0x2b,0x6d,0x80,0x58,0x8e,0xcc,0xdb,0x20,0x75,0xba,0xa6,0xa9,0xf,0x3a,0x76,0xba,0xf8,0x3b,0xf7,0x1,0x69,0xe6,0x5,0x41,0xe3,0x4a,0x69,0x46,0xb5,0x8a,0x8e,0x2e,0x6f,0xe6,0x5a,0x10,0x47,0xa7,0xd0,0xc1,0x84,0x3c,0x24,0x3b,0x6e,0x71,0xb1,0x2d,0x5a,0xc1,0x99,0xcf,0x57,0xf6,0xec,0x9d,0xb1,0xf8,0x56,0xa7,0x6,0x88,0x7c,0x57,0x16,0xb1,0x56,0xe3,0xc2,0xfc,0xdf,0xe6,0x85,0x17,0xfb,0x54,0x5a,0x46,0x78,0xcc,0x8c,0xdd,0x4b};
|
|
||||||
|
|
||||||
/*42 round constants, each round constant is 32-byte (256-bit)*/
|
|
||||||
DATA_ALIGN16(const unsigned char E8_bitslice_roundconstant[42][32])={
|
|
||||||
{0x72,0xd5,0xde,0xa2,0xdf,0x15,0xf8,0x67,0x7b,0x84,0x15,0xa,0xb7,0x23,0x15,0x57,0x81,0xab,0xd6,0x90,0x4d,0x5a,0x87,0xf6,0x4e,0x9f,0x4f,0xc5,0xc3,0xd1,0x2b,0x40},
|
|
||||||
{0xea,0x98,0x3a,0xe0,0x5c,0x45,0xfa,0x9c,0x3,0xc5,0xd2,0x99,0x66,0xb2,0x99,0x9a,0x66,0x2,0x96,0xb4,0xf2,0xbb,0x53,0x8a,0xb5,0x56,0x14,0x1a,0x88,0xdb,0xa2,0x31},
|
|
||||||
{0x3,0xa3,0x5a,0x5c,0x9a,0x19,0xe,0xdb,0x40,0x3f,0xb2,0xa,0x87,0xc1,0x44,0x10,0x1c,0x5,0x19,0x80,0x84,0x9e,0x95,0x1d,0x6f,0x33,0xeb,0xad,0x5e,0xe7,0xcd,0xdc},
|
|
||||||
{0x10,0xba,0x13,0x92,0x2,0xbf,0x6b,0x41,0xdc,0x78,0x65,0x15,0xf7,0xbb,0x27,0xd0,0xa,0x2c,0x81,0x39,0x37,0xaa,0x78,0x50,0x3f,0x1a,0xbf,0xd2,0x41,0x0,0x91,0xd3},
|
|
||||||
{0x42,0x2d,0x5a,0xd,0xf6,0xcc,0x7e,0x90,0xdd,0x62,0x9f,0x9c,0x92,0xc0,0x97,0xce,0x18,0x5c,0xa7,0xb,0xc7,0x2b,0x44,0xac,0xd1,0xdf,0x65,0xd6,0x63,0xc6,0xfc,0x23},
|
|
||||||
{0x97,0x6e,0x6c,0x3,0x9e,0xe0,0xb8,0x1a,0x21,0x5,0x45,0x7e,0x44,0x6c,0xec,0xa8,0xee,0xf1,0x3,0xbb,0x5d,0x8e,0x61,0xfa,0xfd,0x96,0x97,0xb2,0x94,0x83,0x81,0x97},
|
|
||||||
{0x4a,0x8e,0x85,0x37,0xdb,0x3,0x30,0x2f,0x2a,0x67,0x8d,0x2d,0xfb,0x9f,0x6a,0x95,0x8a,0xfe,0x73,0x81,0xf8,0xb8,0x69,0x6c,0x8a,0xc7,0x72,0x46,0xc0,0x7f,0x42,0x14},
|
|
||||||
{0xc5,0xf4,0x15,0x8f,0xbd,0xc7,0x5e,0xc4,0x75,0x44,0x6f,0xa7,0x8f,0x11,0xbb,0x80,0x52,0xde,0x75,0xb7,0xae,0xe4,0x88,0xbc,0x82,0xb8,0x0,0x1e,0x98,0xa6,0xa3,0xf4},
|
|
||||||
{0x8e,0xf4,0x8f,0x33,0xa9,0xa3,0x63,0x15,0xaa,0x5f,0x56,0x24,0xd5,0xb7,0xf9,0x89,0xb6,0xf1,0xed,0x20,0x7c,0x5a,0xe0,0xfd,0x36,0xca,0xe9,0x5a,0x6,0x42,0x2c,0x36},
|
|
||||||
{0xce,0x29,0x35,0x43,0x4e,0xfe,0x98,0x3d,0x53,0x3a,0xf9,0x74,0x73,0x9a,0x4b,0xa7,0xd0,0xf5,0x1f,0x59,0x6f,0x4e,0x81,0x86,0xe,0x9d,0xad,0x81,0xaf,0xd8,0x5a,0x9f},
|
|
||||||
{0xa7,0x5,0x6,0x67,0xee,0x34,0x62,0x6a,0x8b,0xb,0x28,0xbe,0x6e,0xb9,0x17,0x27,0x47,0x74,0x7,0x26,0xc6,0x80,0x10,0x3f,0xe0,0xa0,0x7e,0x6f,0xc6,0x7e,0x48,0x7b},
|
|
||||||
{0xd,0x55,0xa,0xa5,0x4a,0xf8,0xa4,0xc0,0x91,0xe3,0xe7,0x9f,0x97,0x8e,0xf1,0x9e,0x86,0x76,0x72,0x81,0x50,0x60,0x8d,0xd4,0x7e,0x9e,0x5a,0x41,0xf3,0xe5,0xb0,0x62},
|
|
||||||
{0xfc,0x9f,0x1f,0xec,0x40,0x54,0x20,0x7a,0xe3,0xe4,0x1a,0x0,0xce,0xf4,0xc9,0x84,0x4f,0xd7,0x94,0xf5,0x9d,0xfa,0x95,0xd8,0x55,0x2e,0x7e,0x11,0x24,0xc3,0x54,0xa5},
|
|
||||||
{0x5b,0xdf,0x72,0x28,0xbd,0xfe,0x6e,0x28,0x78,0xf5,0x7f,0xe2,0xf,0xa5,0xc4,0xb2,0x5,0x89,0x7c,0xef,0xee,0x49,0xd3,0x2e,0x44,0x7e,0x93,0x85,0xeb,0x28,0x59,0x7f},
|
|
||||||
{0x70,0x5f,0x69,0x37,0xb3,0x24,0x31,0x4a,0x5e,0x86,0x28,0xf1,0x1d,0xd6,0xe4,0x65,0xc7,0x1b,0x77,0x4,0x51,0xb9,0x20,0xe7,0x74,0xfe,0x43,0xe8,0x23,0xd4,0x87,0x8a},
|
|
||||||
{0x7d,0x29,0xe8,0xa3,0x92,0x76,0x94,0xf2,0xdd,0xcb,0x7a,0x9,0x9b,0x30,0xd9,0xc1,0x1d,0x1b,0x30,0xfb,0x5b,0xdc,0x1b,0xe0,0xda,0x24,0x49,0x4f,0xf2,0x9c,0x82,0xbf},
|
|
||||||
{0xa4,0xe7,0xba,0x31,0xb4,0x70,0xbf,0xff,0xd,0x32,0x44,0x5,0xde,0xf8,0xbc,0x48,0x3b,0xae,0xfc,0x32,0x53,0xbb,0xd3,0x39,0x45,0x9f,0xc3,0xc1,0xe0,0x29,0x8b,0xa0},
|
|
||||||
{0xe5,0xc9,0x5,0xfd,0xf7,0xae,0x9,0xf,0x94,0x70,0x34,0x12,0x42,0x90,0xf1,0x34,0xa2,0x71,0xb7,0x1,0xe3,0x44,0xed,0x95,0xe9,0x3b,0x8e,0x36,0x4f,0x2f,0x98,0x4a},
|
|
||||||
{0x88,0x40,0x1d,0x63,0xa0,0x6c,0xf6,0x15,0x47,0xc1,0x44,0x4b,0x87,0x52,0xaf,0xff,0x7e,0xbb,0x4a,0xf1,0xe2,0xa,0xc6,0x30,0x46,0x70,0xb6,0xc5,0xcc,0x6e,0x8c,0xe6},
|
|
||||||
{0xa4,0xd5,0xa4,0x56,0xbd,0x4f,0xca,0x0,0xda,0x9d,0x84,0x4b,0xc8,0x3e,0x18,0xae,0x73,0x57,0xce,0x45,0x30,0x64,0xd1,0xad,0xe8,0xa6,0xce,0x68,0x14,0x5c,0x25,0x67},
|
|
||||||
{0xa3,0xda,0x8c,0xf2,0xcb,0xe,0xe1,0x16,0x33,0xe9,0x6,0x58,0x9a,0x94,0x99,0x9a,0x1f,0x60,0xb2,0x20,0xc2,0x6f,0x84,0x7b,0xd1,0xce,0xac,0x7f,0xa0,0xd1,0x85,0x18},
|
|
||||||
{0x32,0x59,0x5b,0xa1,0x8d,0xdd,0x19,0xd3,0x50,0x9a,0x1c,0xc0,0xaa,0xa5,0xb4,0x46,0x9f,0x3d,0x63,0x67,0xe4,0x4,0x6b,0xba,0xf6,0xca,0x19,0xab,0xb,0x56,0xee,0x7e},
|
|
||||||
{0x1f,0xb1,0x79,0xea,0xa9,0x28,0x21,0x74,0xe9,0xbd,0xf7,0x35,0x3b,0x36,0x51,0xee,0x1d,0x57,0xac,0x5a,0x75,0x50,0xd3,0x76,0x3a,0x46,0xc2,0xfe,0xa3,0x7d,0x70,0x1},
|
|
||||||
{0xf7,0x35,0xc1,0xaf,0x98,0xa4,0xd8,0x42,0x78,0xed,0xec,0x20,0x9e,0x6b,0x67,0x79,0x41,0x83,0x63,0x15,0xea,0x3a,0xdb,0xa8,0xfa,0xc3,0x3b,0x4d,0x32,0x83,0x2c,0x83},
|
|
||||||
{0xa7,0x40,0x3b,0x1f,0x1c,0x27,0x47,0xf3,0x59,0x40,0xf0,0x34,0xb7,0x2d,0x76,0x9a,0xe7,0x3e,0x4e,0x6c,0xd2,0x21,0x4f,0xfd,0xb8,0xfd,0x8d,0x39,0xdc,0x57,0x59,0xef},
|
|
||||||
{0x8d,0x9b,0xc,0x49,0x2b,0x49,0xeb,0xda,0x5b,0xa2,0xd7,0x49,0x68,0xf3,0x70,0xd,0x7d,0x3b,0xae,0xd0,0x7a,0x8d,0x55,0x84,0xf5,0xa5,0xe9,0xf0,0xe4,0xf8,0x8e,0x65},
|
|
||||||
{0xa0,0xb8,0xa2,0xf4,0x36,0x10,0x3b,0x53,0xc,0xa8,0x7,0x9e,0x75,0x3e,0xec,0x5a,0x91,0x68,0x94,0x92,0x56,0xe8,0x88,0x4f,0x5b,0xb0,0x5c,0x55,0xf8,0xba,0xbc,0x4c},
|
|
||||||
{0xe3,0xbb,0x3b,0x99,0xf3,0x87,0x94,0x7b,0x75,0xda,0xf4,0xd6,0x72,0x6b,0x1c,0x5d,0x64,0xae,0xac,0x28,0xdc,0x34,0xb3,0x6d,0x6c,0x34,0xa5,0x50,0xb8,0x28,0xdb,0x71},
|
|
||||||
{0xf8,0x61,0xe2,0xf2,0x10,0x8d,0x51,0x2a,0xe3,0xdb,0x64,0x33,0x59,0xdd,0x75,0xfc,0x1c,0xac,0xbc,0xf1,0x43,0xce,0x3f,0xa2,0x67,0xbb,0xd1,0x3c,0x2,0xe8,0x43,0xb0},
|
|
||||||
{0x33,0xa,0x5b,0xca,0x88,0x29,0xa1,0x75,0x7f,0x34,0x19,0x4d,0xb4,0x16,0x53,0x5c,0x92,0x3b,0x94,0xc3,0xe,0x79,0x4d,0x1e,0x79,0x74,0x75,0xd7,0xb6,0xee,0xaf,0x3f},
|
|
||||||
{0xea,0xa8,0xd4,0xf7,0xbe,0x1a,0x39,0x21,0x5c,0xf4,0x7e,0x9,0x4c,0x23,0x27,0x51,0x26,0xa3,0x24,0x53,0xba,0x32,0x3c,0xd2,0x44,0xa3,0x17,0x4a,0x6d,0xa6,0xd5,0xad},
|
|
||||||
{0xb5,0x1d,0x3e,0xa6,0xaf,0xf2,0xc9,0x8,0x83,0x59,0x3d,0x98,0x91,0x6b,0x3c,0x56,0x4c,0xf8,0x7c,0xa1,0x72,0x86,0x60,0x4d,0x46,0xe2,0x3e,0xcc,0x8,0x6e,0xc7,0xf6},
|
|
||||||
{0x2f,0x98,0x33,0xb3,0xb1,0xbc,0x76,0x5e,0x2b,0xd6,0x66,0xa5,0xef,0xc4,0xe6,0x2a,0x6,0xf4,0xb6,0xe8,0xbe,0xc1,0xd4,0x36,0x74,0xee,0x82,0x15,0xbc,0xef,0x21,0x63},
|
|
||||||
{0xfd,0xc1,0x4e,0xd,0xf4,0x53,0xc9,0x69,0xa7,0x7d,0x5a,0xc4,0x6,0x58,0x58,0x26,0x7e,0xc1,0x14,0x16,0x6,0xe0,0xfa,0x16,0x7e,0x90,0xaf,0x3d,0x28,0x63,0x9d,0x3f},
|
|
||||||
{0xd2,0xc9,0xf2,0xe3,0x0,0x9b,0xd2,0xc,0x5f,0xaa,0xce,0x30,0xb7,0xd4,0xc,0x30,0x74,0x2a,0x51,0x16,0xf2,0xe0,0x32,0x98,0xd,0xeb,0x30,0xd8,0xe3,0xce,0xf8,0x9a},
|
|
||||||
{0x4b,0xc5,0x9e,0x7b,0xb5,0xf1,0x79,0x92,0xff,0x51,0xe6,0x6e,0x4,0x86,0x68,0xd3,0x9b,0x23,0x4d,0x57,0xe6,0x96,0x67,0x31,0xcc,0xe6,0xa6,0xf3,0x17,0xa,0x75,0x5},
|
|
||||||
{0xb1,0x76,0x81,0xd9,0x13,0x32,0x6c,0xce,0x3c,0x17,0x52,0x84,0xf8,0x5,0xa2,0x62,0xf4,0x2b,0xcb,0xb3,0x78,0x47,0x15,0x47,0xff,0x46,0x54,0x82,0x23,0x93,0x6a,0x48},
|
|
||||||
{0x38,0xdf,0x58,0x7,0x4e,0x5e,0x65,0x65,0xf2,0xfc,0x7c,0x89,0xfc,0x86,0x50,0x8e,0x31,0x70,0x2e,0x44,0xd0,0xb,0xca,0x86,0xf0,0x40,0x9,0xa2,0x30,0x78,0x47,0x4e},
|
|
||||||
{0x65,0xa0,0xee,0x39,0xd1,0xf7,0x38,0x83,0xf7,0x5e,0xe9,0x37,0xe4,0x2c,0x3a,0xbd,0x21,0x97,0xb2,0x26,0x1,0x13,0xf8,0x6f,0xa3,0x44,0xed,0xd1,0xef,0x9f,0xde,0xe7},
|
|
||||||
{0x8b,0xa0,0xdf,0x15,0x76,0x25,0x92,0xd9,0x3c,0x85,0xf7,0xf6,0x12,0xdc,0x42,0xbe,0xd8,0xa7,0xec,0x7c,0xab,0x27,0xb0,0x7e,0x53,0x8d,0x7d,0xda,0xaa,0x3e,0xa8,0xde},
|
|
||||||
{0xaa,0x25,0xce,0x93,0xbd,0x2,0x69,0xd8,0x5a,0xf6,0x43,0xfd,0x1a,0x73,0x8,0xf9,0xc0,0x5f,0xef,0xda,0x17,0x4a,0x19,0xa5,0x97,0x4d,0x66,0x33,0x4c,0xfd,0x21,0x6a},
|
|
||||||
{0x35,0xb4,0x98,0x31,0xdb,0x41,0x15,0x70,0xea,0x1e,0xf,0xbb,0xed,0xcd,0x54,0x9b,0x9a,0xd0,0x63,0xa1,0x51,0x97,0x40,0x72,0xf6,0x75,0x9d,0xbf,0x91,0x47,0x6f,0xe2}};
|
|
||||||
|
|
||||||
|
|
||||||
void F8(hashState *state); /* the compression function F8 */
|
|
||||||
|
|
||||||
/*The API functions*/
|
|
||||||
HashReturn Init(hashState *state, int hashbitlen);
|
|
||||||
HashReturn Update(hashState *state, const BitSequence *data, DataLength databitlen);
|
|
||||||
HashReturn Final(hashState *state, BitSequence *hashval);
|
|
||||||
HashReturn Hash(int hashbitlen, const BitSequence *data,DataLength databitlen, BitSequence *hashval);
|
|
||||||
|
|
||||||
/*The following defines operations on 128-bit word(s)*/
|
|
||||||
#define CONSTANT(b) _mm_set1_epi8((b)) /*set each byte in a 128-bit register to be "b"*/
|
|
||||||
|
|
||||||
#define XOR(x,y) _mm_xor_si128((x),(y)) /*XOR(x,y) = x ^ y, where x and y are two 128-bit word*/
|
|
||||||
#define AND(x,y) _mm_and_si128((x),(y)) /*AND(x,y) = x & y, where x and y are two 128-bit word*/
|
|
||||||
#define ANDNOT(x,y) _mm_andnot_si128((x),(y)) /*ANDNOT(x,y) = (!x) & y, where x and y are two 128-bit word*/
|
|
||||||
#define OR(x,y) _mm_or_si128((x),(y)) /*OR(x,y) = x | y, where x and y are two 128-bit word*/
|
|
||||||
|
|
||||||
#define SHR1(x) _mm_srli_epi16((x), 1) /*SHR1(x) = x >> 1, where x is a 128 bit word*/
|
|
||||||
#define SHR2(x) _mm_srli_epi16((x), 2) /*SHR2(x) = x >> 2, where x is a 128 bit word*/
|
|
||||||
#define SHR4(x) _mm_srli_epi16((x), 4) /*SHR4(x) = x >> 4, where x is a 128 bit word*/
|
|
||||||
#define SHR8(x) _mm_slli_epi16((x), 8) /*SHR8(x) = x >> 8, where x is a 128 bit word*/
|
|
||||||
#define SHR16(x) _mm_slli_epi32((x), 16) /*SHR16(x) = x >> 16, where x is a 128 bit word*/
|
|
||||||
#define SHR32(x) _mm_slli_epi64((x), 32) /*SHR32(x) = x >> 32, where x is a 128 bit word*/
|
|
||||||
#define SHR64(x) _mm_slli_si128((x), 8) /*SHR64(x) = x >> 64, where x is a 128 bit word*/
|
|
||||||
|
|
||||||
#define SHL1(x) _mm_slli_epi16((x), 1) /*SHL1(x) = x << 1, where x is a 128 bit word*/
|
|
||||||
#define SHL2(x) _mm_slli_epi16((x), 2) /*SHL2(x) = x << 2, where x is a 128 bit word*/
|
|
||||||
#define SHL4(x) _mm_slli_epi16((x), 4) /*SHL4(x) = x << 4, where x is a 128 bit word*/
|
|
||||||
#define SHL8(x) _mm_srli_epi16((x), 8) /*SHL8(x) = x << 8, where x is a 128 bit word*/
|
|
||||||
#define SHL16(x) _mm_srli_epi32((x), 16) /*SHL16(x) = x << 16, where x is a 128 bit word*/
|
|
||||||
#define SHL32(x) _mm_srli_epi64((x), 32) /*SHL32(x) = x << 32, where x is a 128 bit word*/
|
|
||||||
#define SHL64(x) _mm_srli_si128((x), 8) /*SHL64(x) = x << 64, where x is a 128 bit word*/
|
|
||||||
|
|
||||||
#define SWAP1(x) OR(SHR1(AND((x),CONSTANT(0xaa))),SHL1(AND((x),CONSTANT(0x55)))) /*swapping bit 2i with bit 2i+1 of the 128-bit x */
|
|
||||||
#define SWAP2(x) OR(SHR2(AND((x),CONSTANT(0xcc))),SHL2(AND((x),CONSTANT(0x33)))) /*swapping bit 4i||4i+1 with bit 4i+2||4i+3 of the 128-bit x */
|
|
||||||
#define SWAP4(x) OR(SHR4(AND((x),CONSTANT(0xf0))),SHL4(AND((x),CONSTANT(0xf)))) /*swapping bits 8i||8i+1||8i+2||8i+3 with bits 8i+4||8i+5||8i+6||8i+7 of the 128-bit x */
|
|
||||||
#define SWAP8(x) OR(SHR8(x),SHL8(x)) /*swapping bits 16i||16i+1||...||16i+7 with bits 16i+8||16i+9||...||16i+15 of the 128-bit x */
|
|
||||||
#define SWAP16(x) OR(SHR16(x),SHL16(x)) /*swapping bits 32i||32i+1||...||32i+15 with bits 32i+16||32i+17||...||32i+31 of the 128-bit x */
|
|
||||||
#define SWAP32(x) _mm_shuffle_epi32((x),_MM_SHUFFLE(2,3,0,1)) /*swapping bits 64i||64i+1||...||64i+31 with bits 64i+32||64i+33||...||64i+63 of the 128-bit x*/
|
|
||||||
#define SWAP64(x) _mm_shuffle_epi32((x),_MM_SHUFFLE(1,0,3,2)) /*swapping bits 128i||128i+1||...||128i+63 with bits 128i+64||128i+65||...||128i+127 of the 128-bit x*/
|
|
||||||
|
|
||||||
#define STORE(x,p) _mm_store_si128((__m128i *)(p), (x)) /*store the 128-bit word x into memeory address p, where p is the multile of 16 bytes*/
|
|
||||||
#define LOAD(p) _mm_load_si128((__m128i *)(p)) /*load 16 bytes from the memory address p, return a 128-bit word, where p is the multile of 16 bytes*/
|
|
||||||
|
|
||||||
/*The MDS code*/
|
|
||||||
#define L(m0,m1,m2,m3,m4,m5,m6,m7) \
|
|
||||||
(m4) = XOR((m4),(m1)); \
|
|
||||||
(m5) = XOR((m5),(m2)); \
|
|
||||||
(m6) = XOR(XOR((m6),(m3)),(m0)); \
|
|
||||||
(m7) = XOR((m7),(m0)); \
|
|
||||||
(m0) = XOR((m0),(m5)); \
|
|
||||||
(m1) = XOR((m1),(m6)); \
|
|
||||||
(m2) = XOR(XOR((m2),(m7)),(m4)); \
|
|
||||||
(m3) = XOR((m3),(m4));
|
|
||||||
|
|
||||||
/*The Sbox, it implements S0 and S1, selected by a constant bit*/
|
|
||||||
#define S(m0,m1,m2,m3,c0) \
|
|
||||||
m3 = XOR(m3,CONSTANT(0xff)); \
|
|
||||||
m0 = XOR(m0,ANDNOT(m2,c0)); \
|
|
||||||
temp0 = XOR(c0,AND(m0,m1)); \
|
|
||||||
m0 = XOR(m0,AND(m3,m2)); \
|
|
||||||
m3 = XOR(m3,ANDNOT(m1,m2)); \
|
|
||||||
m1 = XOR(m1,AND(m0,m2)); \
|
|
||||||
m2 = XOR(m2,ANDNOT(m3,m0)); \
|
|
||||||
m0 = XOR(m0,OR(m1,m3)); \
|
|
||||||
m3 = XOR(m3,AND(m1,m2)); \
|
|
||||||
m2 = XOR(m2,temp0); \
|
|
||||||
m1 = XOR(m1,AND(temp0,m0));
|
|
||||||
|
|
||||||
/* The linear transform of the (7i+0)th round*/
|
|
||||||
#define lineartransform_R00(m0,m1,m2,m3,m4,m5,m6,m7) \
|
|
||||||
/*MDS layer*/ \
|
|
||||||
L(m0,m1,m2,m3,m4,m5,m6,m7); \
|
|
||||||
/*swapping bit 2i with bit 2i+1 for m4,m5,m6 and m7 */ \
|
|
||||||
m4 = SWAP1(m4); m5 = SWAP1(m5); m6 = SWAP1(m6); m7 = SWAP1(m7);
|
|
||||||
|
|
||||||
/* The linear transform of the (7i+1)th round*/
|
|
||||||
#define lineartransform_R01(m0,m1,m2,m3,m4,m5,m6,m7) \
|
|
||||||
/*MDS layer*/ \
|
|
||||||
L(m0,m1,m2,m3,m4,m5,m6,m7); \
|
|
||||||
/*swapping bit 4i||4i+1 with bit 4i+2||4i+3 for m4,m5,m6 and m7 */ \
|
|
||||||
m4 = SWAP2(m4); m5 = SWAP2(m5); m6 = SWAP2(m6); m7 = SWAP2(m7);
|
|
||||||
|
|
||||||
/* The linear transform of the (7i+2)th round*/
|
|
||||||
#define lineartransform_R02(m0,m1,m2,m3,m4,m5,m6,m7) \
|
|
||||||
/*MDS layer*/ \
|
|
||||||
L(m0,m1,m2,m3,m4,m5,m6,m7); \
|
|
||||||
/*swapping bits 8i||8i+1||8i+2||8i+3 with bits 8i+4||8i+5||8i+6||8i+7 for m4,m5,m6 and m7*/ \
|
|
||||||
m4 = SWAP4(m4); m5 = SWAP4(m5); m6 = SWAP4(m6); m7 = SWAP4(m7);
|
|
||||||
|
|
||||||
/* The linear transform of the (7i+3)th round*/
|
|
||||||
#define lineartransform_R03(m0,m1,m2,m3,m4,m5,m6,m7) \
|
|
||||||
/*MDS layer*/ \
|
|
||||||
L(m0,m1,m2,m3,m4,m5,m6,m7); \
|
|
||||||
/*swapping bits 16i||16i+1||...||16i+7 with bits 16i+8||16i+9||...||16i+15 for m4,m5,m6 and m7*/ \
|
|
||||||
m4 = SWAP8(m4); m5 = SWAP8(m5); m6 = SWAP8(m6); m7 = SWAP8(m7);
|
|
||||||
|
|
||||||
/* The linear transform of the (7i+4)th round*/
|
|
||||||
#define lineartransform_R04(m0,m1,m2,m3,m4,m5,m6,m7) \
|
|
||||||
/*MDS layer*/ \
|
|
||||||
L(m0,m1,m2,m3,m4,m5,m6,m7); \
|
|
||||||
/*swapping bits 32i||32i+1||...||32i+15 with bits 32i+16||32i+17||...||32i+31 for m0,m1,m2 and m3*/ \
|
|
||||||
m4 = SWAP16(m4); m5 = SWAP16(m5); m6 = SWAP16(m6); m7 = SWAP16(m7);
|
|
||||||
|
|
||||||
/* The linear transform of the (7i+5)th round -- faster*/
|
|
||||||
#define lineartransform_R05(m0,m1,m2,m3,m4,m5,m6,m7) \
|
|
||||||
/*MDS layer*/ \
|
|
||||||
L(m0,m1,m2,m3,m4,m5,m6,m7); \
|
|
||||||
/*swapping bits 64i||64i+1||...||64i+31 with bits 64i+32||64i+33||...||64i+63 for m0,m1,m2 and m3*/ \
|
|
||||||
m4 = SWAP32(m4); m5 = SWAP32(m5); m6 = SWAP32(m6); m7 = SWAP32(m7);
|
|
||||||
|
|
||||||
/* The linear transform of the (7i+6)th round -- faster*/
|
|
||||||
#define lineartransform_R06(m0,m1,m2,m3,m4,m5,m6,m7) \
|
|
||||||
/*MDS layer*/ \
|
|
||||||
L(m0,m1,m2,m3,m4,m5,m6,m7); \
|
|
||||||
/*swapping bits 128i||128i+1||...||128i+63 with bits 128i+64||128i+65||...||128i+127 for m0,m1,m2 and m3*/ \
|
|
||||||
m4 = SWAP64(m4); m5 = SWAP64(m5); m6 = SWAP64(m6); m7 = SWAP64(m7);
|
|
||||||
|
|
||||||
/*the round function of E8 */
|
|
||||||
#define round_function(nn,r) \
|
|
||||||
S(y0,y2,y4,y6, LOAD(E8_bitslice_roundconstant[r]) ); \
|
|
||||||
S(y1,y3,y5,y7, LOAD(E8_bitslice_roundconstant[r]+16) ); \
|
|
||||||
lineartransform_R##nn(y0,y2,y4,y6,y1,y3,y5,y7);
|
|
||||||
|
|
||||||
/*the compression function F8 */
|
|
||||||
void F8(hashState *state)
|
|
||||||
{
|
|
||||||
uint32 i;
|
|
||||||
word128 y0,y1,y2,y3,y4,y5,y6,y7;
|
|
||||||
word128 temp0;
|
|
||||||
|
|
||||||
y0 = state->x0;
|
|
||||||
y1 = state->x1;
|
|
||||||
y2 = state->x2;
|
|
||||||
y3 = state->x3;
|
|
||||||
y4 = state->x4;
|
|
||||||
y5 = state->x5;
|
|
||||||
y6 = state->x6;
|
|
||||||
y7 = state->x7;
|
|
||||||
|
|
||||||
/*xor the 512-bit message with the fist half of the 1024-bit hash state*/
|
|
||||||
|
|
||||||
y0 = XOR(y0, LOAD(state->buffer));
|
|
||||||
y1 = XOR(y1, LOAD(state->buffer+16));
|
|
||||||
y2 = XOR(y2, LOAD(state->buffer+32));
|
|
||||||
y3 = XOR(y3, LOAD(state->buffer+48));
|
|
||||||
|
|
||||||
/*perform 42 rounds*/
|
|
||||||
for (i = 0; i < 42; i = i+7) {
|
|
||||||
round_function(00,i);
|
|
||||||
round_function(01,i+1);
|
|
||||||
round_function(02,i+2);
|
|
||||||
round_function(03,i+3);
|
|
||||||
round_function(04,i+4);
|
|
||||||
round_function(05,i+5);
|
|
||||||
round_function(06,i+6);
|
|
||||||
}
|
|
||||||
|
|
||||||
/*xor the 512-bit message with the second half of the 1024-bit hash state*/
|
|
||||||
|
|
||||||
y4 = XOR(y4, LOAD(state->buffer));
|
|
||||||
y5 = XOR(y5, LOAD(state->buffer+16));
|
|
||||||
y6 = XOR(y6, LOAD(state->buffer+32));
|
|
||||||
y7 = XOR(y7, LOAD(state->buffer+48));
|
|
||||||
|
|
||||||
state->x0 = y0;
|
|
||||||
state->x1 = y1;
|
|
||||||
state->x2 = y2;
|
|
||||||
state->x3 = y3;
|
|
||||||
state->x4 = y4;
|
|
||||||
state->x5 = y5;
|
|
||||||
state->x6 = y6;
|
|
||||||
state->x7 = y7;
|
|
||||||
}
|
|
||||||
|
|
||||||
/*before hashing a message, initialize the hash state as H0 */
|
|
||||||
HashReturn Init(hashState *state, int hashbitlen)
|
|
||||||
{
|
|
||||||
|
|
||||||
state->databitlen = 0;
|
|
||||||
state->datasize_in_buffer = 0;
|
|
||||||
|
|
||||||
state->hashbitlen = hashbitlen;
|
|
||||||
|
|
||||||
/*initialize the initial hash value of JH*/
|
|
||||||
/*load the intital hash value into state*/
|
|
||||||
|
|
||||||
switch(hashbitlen)
|
|
||||||
{
|
|
||||||
case 224:
|
|
||||||
state->x0 = LOAD(JH224_H0);
|
|
||||||
state->x1 = LOAD(JH224_H0+16);
|
|
||||||
state->x2 = LOAD(JH224_H0+32);
|
|
||||||
state->x3 = LOAD(JH224_H0+48);
|
|
||||||
state->x4 = LOAD(JH224_H0+64);
|
|
||||||
state->x5 = LOAD(JH224_H0+80);
|
|
||||||
state->x6 = LOAD(JH224_H0+96);
|
|
||||||
state->x7 = LOAD(JH224_H0+112);
|
|
||||||
break;
|
|
||||||
|
|
||||||
case 256:
|
|
||||||
state->x0 = LOAD(JH256_H0);
|
|
||||||
state->x1 = LOAD(JH256_H0+16);
|
|
||||||
state->x2 = LOAD(JH256_H0+32);
|
|
||||||
state->x3 = LOAD(JH256_H0+48);
|
|
||||||
state->x4 = LOAD(JH256_H0+64);
|
|
||||||
state->x5 = LOAD(JH256_H0+80);
|
|
||||||
state->x6 = LOAD(JH256_H0+96);
|
|
||||||
state->x7 = LOAD(JH256_H0+112);
|
|
||||||
break;
|
|
||||||
|
|
||||||
case 384:
|
|
||||||
state->x0 = LOAD(JH384_H0);
|
|
||||||
state->x1 = LOAD(JH384_H0+16);
|
|
||||||
state->x2 = LOAD(JH384_H0+32);
|
|
||||||
state->x3 = LOAD(JH384_H0+48);
|
|
||||||
state->x4 = LOAD(JH384_H0+64);
|
|
||||||
state->x5 = LOAD(JH384_H0+80);
|
|
||||||
state->x6 = LOAD(JH384_H0+96);
|
|
||||||
state->x7 = LOAD(JH384_H0+112);
|
|
||||||
break;
|
|
||||||
|
|
||||||
case 512:
|
|
||||||
state->x0 = LOAD(JH512_H0);
|
|
||||||
state->x1 = LOAD(JH512_H0+16);
|
|
||||||
state->x2 = LOAD(JH512_H0+32);
|
|
||||||
state->x3 = LOAD(JH512_H0+48);
|
|
||||||
state->x4 = LOAD(JH512_H0+64);
|
|
||||||
state->x5 = LOAD(JH512_H0+80);
|
|
||||||
state->x6 = LOAD(JH512_H0+96);
|
|
||||||
state->x7 = LOAD(JH512_H0+112);
|
|
||||||
break;
|
|
||||||
}
|
|
||||||
|
|
||||||
return(SUCCESS);
|
|
||||||
}
|
|
||||||
|
|
||||||
/*hash each 512-bit message block, except the last partial block*/
|
|
||||||
HashReturn Update(hashState *state, const BitSequence *data, DataLength databitlen)
|
|
||||||
{
|
|
||||||
DataLength index; /*the starting address of the data to be compressed*/
|
|
||||||
|
|
||||||
state->databitlen += databitlen;
|
|
||||||
index = 0;
|
|
||||||
|
|
||||||
/*if there is remaining data in the buffer, fill it to a full message block first*/
|
|
||||||
/*we assume that the size of the data in the buffer is the multiple of 8 bits if it is not at the end of a message*/
|
|
||||||
|
|
||||||
/*There is data in the buffer, but the incoming data is insufficient for a full block*/
|
|
||||||
if ( (state->datasize_in_buffer > 0 ) && (( state->datasize_in_buffer + databitlen) < 512) ) {
|
|
||||||
if ( (databitlen & 7) == 0 ) {
|
|
||||||
memcpy(state->buffer + (state->datasize_in_buffer >> 3), data, 64-(state->datasize_in_buffer >> 3)) ;
|
|
||||||
}
|
|
||||||
else memcpy(state->buffer + (state->datasize_in_buffer >> 3), data, 64-(state->datasize_in_buffer >> 3)+1) ;
|
|
||||||
state->datasize_in_buffer += databitlen;
|
|
||||||
databitlen = 0;
|
|
||||||
}
|
|
||||||
|
|
||||||
/*There is data in the buffer, and the incoming data is sufficient for a full block*/
|
|
||||||
if ( (state->datasize_in_buffer > 0 ) && (( state->datasize_in_buffer + databitlen) >= 512) ) {
|
|
||||||
memcpy( state->buffer + (state->datasize_in_buffer >> 3), data, 64-(state->datasize_in_buffer >> 3) ) ;
|
|
||||||
index = 64-(state->datasize_in_buffer >> 3);
|
|
||||||
databitlen = databitlen - (512 - state->datasize_in_buffer);
|
|
||||||
F8(state);
|
|
||||||
state->datasize_in_buffer = 0;
|
|
||||||
}
|
|
||||||
|
|
||||||
/*hash the remaining full message blocks*/
|
|
||||||
for ( ; databitlen >= 512; index = index+64, databitlen = databitlen - 512) {
|
|
||||||
memcpy(state->buffer, data+index, 64);
|
|
||||||
F8(state);
|
|
||||||
}
|
|
||||||
|
|
||||||
/*store the partial block into buffer, assume that -- if part of the last byte is not part of the message, then that part consists of 0 bits*/
|
|
||||||
if ( databitlen > 0) {
|
|
||||||
if ((databitlen & 7) == 0)
|
|
||||||
memcpy(state->buffer, data+index, (databitlen & 0x1ff) >> 3);
|
|
||||||
else
|
|
||||||
memcpy(state->buffer, data+index, ((databitlen & 0x1ff) >> 3)+1);
|
|
||||||
state->datasize_in_buffer = databitlen;
|
|
||||||
}
|
|
||||||
|
|
||||||
return(SUCCESS);
|
|
||||||
}
|
|
||||||
|
|
||||||
/*pad the message, process the padded block(s), truncate the hash value H to obtain the message digest*/
|
|
||||||
HashReturn Final(hashState *state, BitSequence *hashval)
|
|
||||||
{
|
|
||||||
unsigned int i;
|
|
||||||
DATA_ALIGN16(unsigned char t[64]);
|
|
||||||
|
|
||||||
if ( (state->databitlen & 0x1ff) == 0 )
|
|
||||||
{
|
|
||||||
/*pad the message when databitlen is multiple of 512 bits, then process the padded block*/
|
|
||||||
memset(state->buffer,0,64);
|
|
||||||
state->buffer[0] = 0x80;
|
|
||||||
state->buffer[63] = state->databitlen & 0xff;
|
|
||||||
state->buffer[62] = (state->databitlen >> 8) & 0xff;
|
|
||||||
state->buffer[61] = (state->databitlen >> 16) & 0xff;
|
|
||||||
state->buffer[60] = (state->databitlen >> 24) & 0xff;
|
|
||||||
state->buffer[59] = (state->databitlen >> 32) & 0xff;
|
|
||||||
state->buffer[58] = (state->databitlen >> 40) & 0xff;
|
|
||||||
state->buffer[57] = (state->databitlen >> 48) & 0xff;
|
|
||||||
state->buffer[56] = (state->databitlen >> 56) & 0xff;
|
|
||||||
F8(state);
|
|
||||||
}
|
|
||||||
else {
|
|
||||||
/*set the rest of the bytes in the buffer to 0*/
|
|
||||||
if ( (state->datasize_in_buffer & 7) == 0)
|
|
||||||
for (i = (state->databitlen & 0x1ff) >> 3; i < 64; i++) state->buffer[i] = 0;
|
|
||||||
else
|
|
||||||
for (i = ((state->databitlen & 0x1ff) >> 3)+1; i < 64; i++) state->buffer[i] = 0;
|
|
||||||
|
|
||||||
/*pad and process the partial block when databitlen is not multiple of 512 bits, then hash the padded blocks*/
|
|
||||||
state->buffer[((state->databitlen & 0x1ff) >> 3)] |= 1 << (7- (state->databitlen & 7));
|
|
||||||
F8(state);
|
|
||||||
memset(state->buffer,0,64);
|
|
||||||
state->buffer[63] = state->databitlen & 0xff;
|
|
||||||
state->buffer[62] = (state->databitlen >> 8) & 0xff;
|
|
||||||
state->buffer[61] = (state->databitlen >> 16) & 0xff;
|
|
||||||
state->buffer[60] = (state->databitlen >> 24) & 0xff;
|
|
||||||
state->buffer[59] = (state->databitlen >> 32) & 0xff;
|
|
||||||
state->buffer[58] = (state->databitlen >> 40) & 0xff;
|
|
||||||
state->buffer[57] = (state->databitlen >> 48) & 0xff;
|
|
||||||
state->buffer[56] = (state->databitlen >> 56) & 0xff;
|
|
||||||
F8(state);
|
|
||||||
}
|
|
||||||
|
|
||||||
/*truncting the final hash value to generate the message digest*/
|
|
||||||
|
|
||||||
STORE(state->x4,t);
|
|
||||||
STORE(state->x5,t+16);
|
|
||||||
STORE(state->x6,t+32);
|
|
||||||
STORE(state->x7,t+48);
|
|
||||||
|
|
||||||
switch (state->hashbitlen)
|
|
||||||
{
|
|
||||||
case 224: memcpy(hashval,t+36,28); break;
|
|
||||||
case 256: memcpy(hashval,t+32,32); break;
|
|
||||||
case 384: memcpy(hashval,t+16,48); break;
|
|
||||||
case 512: memcpy(hashval,t,64); break;
|
|
||||||
}
|
|
||||||
|
|
||||||
return(SUCCESS);
|
|
||||||
}
|
|
||||||
|
|
||||||
/* hash a message,
|
|
||||||
three inputs: message digest size in bits (hashbitlen); message (data); message length in bits (databitlen)
|
|
||||||
one output: message digest (hashval)
|
|
||||||
*/
|
|
||||||
HashReturn Hash(int hashbitlen, const BitSequence *data,DataLength databitlen, BitSequence *hashval)
|
|
||||||
{
|
|
||||||
hashState state;
|
|
||||||
|
|
||||||
if ( hashbitlen == 224 || hashbitlen == 256 || hashbitlen == 384 || hashbitlen == 512 )
|
|
||||||
{
|
|
||||||
Init(&state, hashbitlen);
|
|
||||||
Update(&state, data, databitlen);
|
|
||||||
Final(&state, hashval);
|
|
||||||
return SUCCESS;
|
|
||||||
}
|
|
||||||
else
|
|
||||||
return(BAD_HASHLEN);
|
|
||||||
}
|
|
||||||
@@ -1,357 +0,0 @@
|
|||||||
/*This program gives the optimized SSE2 bitslice implementation of JH for 64-bit platform (with 16 128-bit XMM registers).
|
|
||||||
|
|
||||||
--------------------------------
|
|
||||||
Performance
|
|
||||||
|
|
||||||
Microprocessor: Intel CORE 2 processor (Core 2 Duo Mobile T6600 2.2GHz)
|
|
||||||
Operating System: 64-bit Ubuntu 10.04 (Linux kernel 2.6.32-22-generic)
|
|
||||||
Speed for long message:
|
|
||||||
1) 19.9 cycles/byte compiler: Intel C++ Compiler 11.1 compilation option: icc -O3
|
|
||||||
2) 20.9 cycles/byte compiler: gcc 4.4.3 compilation option: gcc -msse2 -O3
|
|
||||||
|
|
||||||
--------------------------------
|
|
||||||
Compare with the original JH sse2 code (October 2008) for 64-bit platform, we made the modifications:
|
|
||||||
a) The Sbox implementation follows exactly the description given in the document
|
|
||||||
b) Data alignment definition is improved so that the code can be compiled by GCC, Intel C++ compiler and Microsoft Visual C compiler
|
|
||||||
c) Using y0,y1,..,y7 variables in Function F8 for performance improvement (local variable in function F8 so that compiler can optimize the code easily)
|
|
||||||
d) Removed a number of intermediate variables from the program (so as to given compiler more freedom to optimize the code)
|
|
||||||
e) Using "for" loop to implement 42 rounds (with 7 rounds in each loop), so as to reduce the code size.
|
|
||||||
|
|
||||||
--------------------------------
|
|
||||||
Last Modified: January 16, 2011
|
|
||||||
*/
|
|
||||||
|
|
||||||
|
|
||||||
#include <emmintrin.h>
|
|
||||||
#include <stdint.h>
|
|
||||||
#include <string.h>
|
|
||||||
#include "algo/sha/sha3-defs.h"
|
|
||||||
|
|
||||||
typedef __m128i word128; /*word128 defines a 128-bit SSE2 word*/
|
|
||||||
typedef enum {jhSUCCESS = 0, jhFAIL = 1, jhBAD_HASHLEN = 2} jhReturn;
|
|
||||||
|
|
||||||
/*define data alignment for different C compilers*/
|
|
||||||
#if defined(__GNUC__)
|
|
||||||
#define DATA_ALIGN16(x) x __attribute__ ((aligned(16)))
|
|
||||||
#else
|
|
||||||
#define DATA_ALIGN16(x) __declspec(align(16)) x
|
|
||||||
#endif
|
|
||||||
|
|
||||||
typedef struct {
|
|
||||||
DataLength jhbitlen; /*the message digest size*/
|
|
||||||
DataLength databitlen; /*the message size in bits*/
|
|
||||||
DataLength datasize_in_buffer; /*the size of the message remained in buffer; assumed to be multiple of 8bits except for the last partial block at the end of the message*/
|
|
||||||
word128 x0,x1,x2,x3,x4,x5,x6,x7; /*1024-bit state;*/
|
|
||||||
unsigned char buffer[64]; /*512-bit message block;*/
|
|
||||||
} jhState;
|
|
||||||
|
|
||||||
#define DECL_JH \
|
|
||||||
word128 jhSx0,jhSx1,jhSx2,jhSx3,jhSx4,jhSx5,jhSx6,jhSx7; \
|
|
||||||
unsigned char jhSbuffer[64];
|
|
||||||
|
|
||||||
|
|
||||||
/*The initial hash value H(0)*/
|
|
||||||
static DATA_ALIGN16(const unsigned char JH512_H0[128])={0x6f,0xd1,0x4b,0x96,0x3e,0x0,0xaa,0x17,0x63,0x6a,0x2e,0x5,0x7a,0x15,0xd5,0x43,0x8a,0x22,0x5e,0x8d,0xc,0x97,0xef,0xb,0xe9,0x34,0x12,0x59,0xf2,0xb3,0xc3,0x61,0x89,0x1d,0xa0,0xc1,0x53,0x6f,0x80,0x1e,0x2a,0xa9,0x5,0x6b,0xea,0x2b,0x6d,0x80,0x58,0x8e,0xcc,0xdb,0x20,0x75,0xba,0xa6,0xa9,0xf,0x3a,0x76,0xba,0xf8,0x3b,0xf7,0x1,0x69,0xe6,0x5,0x41,0xe3,0x4a,0x69,0x46,0xb5,0x8a,0x8e,0x2e,0x6f,0xe6,0x5a,0x10,0x47,0xa7,0xd0,0xc1,0x84,0x3c,0x24,0x3b,0x6e,0x71,0xb1,0x2d,0x5a,0xc1,0x99,0xcf,0x57,0xf6,0xec,0x9d,0xb1,0xf8,0x56,0xa7,0x6,0x88,0x7c,0x57,0x16,0xb1,0x56,0xe3,0xc2,0xfc,0xdf,0xe6,0x85,0x17,0xfb,0x54,0x5a,0x46,0x78,0xcc,0x8c,0xdd,0x4b};
|
|
||||||
|
|
||||||
/*42 round constants, each round constant is 32-byte (256-bit)*/
|
|
||||||
static DATA_ALIGN16(const unsigned char jhE8_bitslice_roundconstant[42][32])={
|
|
||||||
{0x72,0xd5,0xde,0xa2,0xdf,0x15,0xf8,0x67,0x7b,0x84,0x15,0xa,0xb7,0x23,0x15,0x57,0x81,0xab,0xd6,0x90,0x4d,0x5a,0x87,0xf6,0x4e,0x9f,0x4f,0xc5,0xc3,0xd1,0x2b,0x40},
|
|
||||||
{0xea,0x98,0x3a,0xe0,0x5c,0x45,0xfa,0x9c,0x3,0xc5,0xd2,0x99,0x66,0xb2,0x99,0x9a,0x66,0x2,0x96,0xb4,0xf2,0xbb,0x53,0x8a,0xb5,0x56,0x14,0x1a,0x88,0xdb,0xa2,0x31},
|
|
||||||
{0x3,0xa3,0x5a,0x5c,0x9a,0x19,0xe,0xdb,0x40,0x3f,0xb2,0xa,0x87,0xc1,0x44,0x10,0x1c,0x5,0x19,0x80,0x84,0x9e,0x95,0x1d,0x6f,0x33,0xeb,0xad,0x5e,0xe7,0xcd,0xdc},
|
|
||||||
{0x10,0xba,0x13,0x92,0x2,0xbf,0x6b,0x41,0xdc,0x78,0x65,0x15,0xf7,0xbb,0x27,0xd0,0xa,0x2c,0x81,0x39,0x37,0xaa,0x78,0x50,0x3f,0x1a,0xbf,0xd2,0x41,0x0,0x91,0xd3},
|
|
||||||
{0x42,0x2d,0x5a,0xd,0xf6,0xcc,0x7e,0x90,0xdd,0x62,0x9f,0x9c,0x92,0xc0,0x97,0xce,0x18,0x5c,0xa7,0xb,0xc7,0x2b,0x44,0xac,0xd1,0xdf,0x65,0xd6,0x63,0xc6,0xfc,0x23},
|
|
||||||
{0x97,0x6e,0x6c,0x3,0x9e,0xe0,0xb8,0x1a,0x21,0x5,0x45,0x7e,0x44,0x6c,0xec,0xa8,0xee,0xf1,0x3,0xbb,0x5d,0x8e,0x61,0xfa,0xfd,0x96,0x97,0xb2,0x94,0x83,0x81,0x97},
|
|
||||||
{0x4a,0x8e,0x85,0x37,0xdb,0x3,0x30,0x2f,0x2a,0x67,0x8d,0x2d,0xfb,0x9f,0x6a,0x95,0x8a,0xfe,0x73,0x81,0xf8,0xb8,0x69,0x6c,0x8a,0xc7,0x72,0x46,0xc0,0x7f,0x42,0x14},
|
|
||||||
{0xc5,0xf4,0x15,0x8f,0xbd,0xc7,0x5e,0xc4,0x75,0x44,0x6f,0xa7,0x8f,0x11,0xbb,0x80,0x52,0xde,0x75,0xb7,0xae,0xe4,0x88,0xbc,0x82,0xb8,0x0,0x1e,0x98,0xa6,0xa3,0xf4},
|
|
||||||
{0x8e,0xf4,0x8f,0x33,0xa9,0xa3,0x63,0x15,0xaa,0x5f,0x56,0x24,0xd5,0xb7,0xf9,0x89,0xb6,0xf1,0xed,0x20,0x7c,0x5a,0xe0,0xfd,0x36,0xca,0xe9,0x5a,0x6,0x42,0x2c,0x36},
|
|
||||||
{0xce,0x29,0x35,0x43,0x4e,0xfe,0x98,0x3d,0x53,0x3a,0xf9,0x74,0x73,0x9a,0x4b,0xa7,0xd0,0xf5,0x1f,0x59,0x6f,0x4e,0x81,0x86,0xe,0x9d,0xad,0x81,0xaf,0xd8,0x5a,0x9f},
|
|
||||||
{0xa7,0x5,0x6,0x67,0xee,0x34,0x62,0x6a,0x8b,0xb,0x28,0xbe,0x6e,0xb9,0x17,0x27,0x47,0x74,0x7,0x26,0xc6,0x80,0x10,0x3f,0xe0,0xa0,0x7e,0x6f,0xc6,0x7e,0x48,0x7b},
|
|
||||||
{0xd,0x55,0xa,0xa5,0x4a,0xf8,0xa4,0xc0,0x91,0xe3,0xe7,0x9f,0x97,0x8e,0xf1,0x9e,0x86,0x76,0x72,0x81,0x50,0x60,0x8d,0xd4,0x7e,0x9e,0x5a,0x41,0xf3,0xe5,0xb0,0x62},
|
|
||||||
{0xfc,0x9f,0x1f,0xec,0x40,0x54,0x20,0x7a,0xe3,0xe4,0x1a,0x0,0xce,0xf4,0xc9,0x84,0x4f,0xd7,0x94,0xf5,0x9d,0xfa,0x95,0xd8,0x55,0x2e,0x7e,0x11,0x24,0xc3,0x54,0xa5},
|
|
||||||
{0x5b,0xdf,0x72,0x28,0xbd,0xfe,0x6e,0x28,0x78,0xf5,0x7f,0xe2,0xf,0xa5,0xc4,0xb2,0x5,0x89,0x7c,0xef,0xee,0x49,0xd3,0x2e,0x44,0x7e,0x93,0x85,0xeb,0x28,0x59,0x7f},
|
|
||||||
{0x70,0x5f,0x69,0x37,0xb3,0x24,0x31,0x4a,0x5e,0x86,0x28,0xf1,0x1d,0xd6,0xe4,0x65,0xc7,0x1b,0x77,0x4,0x51,0xb9,0x20,0xe7,0x74,0xfe,0x43,0xe8,0x23,0xd4,0x87,0x8a},
|
|
||||||
{0x7d,0x29,0xe8,0xa3,0x92,0x76,0x94,0xf2,0xdd,0xcb,0x7a,0x9,0x9b,0x30,0xd9,0xc1,0x1d,0x1b,0x30,0xfb,0x5b,0xdc,0x1b,0xe0,0xda,0x24,0x49,0x4f,0xf2,0x9c,0x82,0xbf},
|
|
||||||
{0xa4,0xe7,0xba,0x31,0xb4,0x70,0xbf,0xff,0xd,0x32,0x44,0x5,0xde,0xf8,0xbc,0x48,0x3b,0xae,0xfc,0x32,0x53,0xbb,0xd3,0x39,0x45,0x9f,0xc3,0xc1,0xe0,0x29,0x8b,0xa0},
|
|
||||||
{0xe5,0xc9,0x5,0xfd,0xf7,0xae,0x9,0xf,0x94,0x70,0x34,0x12,0x42,0x90,0xf1,0x34,0xa2,0x71,0xb7,0x1,0xe3,0x44,0xed,0x95,0xe9,0x3b,0x8e,0x36,0x4f,0x2f,0x98,0x4a},
|
|
||||||
{0x88,0x40,0x1d,0x63,0xa0,0x6c,0xf6,0x15,0x47,0xc1,0x44,0x4b,0x87,0x52,0xaf,0xff,0x7e,0xbb,0x4a,0xf1,0xe2,0xa,0xc6,0x30,0x46,0x70,0xb6,0xc5,0xcc,0x6e,0x8c,0xe6},
|
|
||||||
{0xa4,0xd5,0xa4,0x56,0xbd,0x4f,0xca,0x0,0xda,0x9d,0x84,0x4b,0xc8,0x3e,0x18,0xae,0x73,0x57,0xce,0x45,0x30,0x64,0xd1,0xad,0xe8,0xa6,0xce,0x68,0x14,0x5c,0x25,0x67},
|
|
||||||
{0xa3,0xda,0x8c,0xf2,0xcb,0xe,0xe1,0x16,0x33,0xe9,0x6,0x58,0x9a,0x94,0x99,0x9a,0x1f,0x60,0xb2,0x20,0xc2,0x6f,0x84,0x7b,0xd1,0xce,0xac,0x7f,0xa0,0xd1,0x85,0x18},
|
|
||||||
{0x32,0x59,0x5b,0xa1,0x8d,0xdd,0x19,0xd3,0x50,0x9a,0x1c,0xc0,0xaa,0xa5,0xb4,0x46,0x9f,0x3d,0x63,0x67,0xe4,0x4,0x6b,0xba,0xf6,0xca,0x19,0xab,0xb,0x56,0xee,0x7e},
|
|
||||||
{0x1f,0xb1,0x79,0xea,0xa9,0x28,0x21,0x74,0xe9,0xbd,0xf7,0x35,0x3b,0x36,0x51,0xee,0x1d,0x57,0xac,0x5a,0x75,0x50,0xd3,0x76,0x3a,0x46,0xc2,0xfe,0xa3,0x7d,0x70,0x1},
|
|
||||||
{0xf7,0x35,0xc1,0xaf,0x98,0xa4,0xd8,0x42,0x78,0xed,0xec,0x20,0x9e,0x6b,0x67,0x79,0x41,0x83,0x63,0x15,0xea,0x3a,0xdb,0xa8,0xfa,0xc3,0x3b,0x4d,0x32,0x83,0x2c,0x83},
|
|
||||||
{0xa7,0x40,0x3b,0x1f,0x1c,0x27,0x47,0xf3,0x59,0x40,0xf0,0x34,0xb7,0x2d,0x76,0x9a,0xe7,0x3e,0x4e,0x6c,0xd2,0x21,0x4f,0xfd,0xb8,0xfd,0x8d,0x39,0xdc,0x57,0x59,0xef},
|
|
||||||
{0x8d,0x9b,0xc,0x49,0x2b,0x49,0xeb,0xda,0x5b,0xa2,0xd7,0x49,0x68,0xf3,0x70,0xd,0x7d,0x3b,0xae,0xd0,0x7a,0x8d,0x55,0x84,0xf5,0xa5,0xe9,0xf0,0xe4,0xf8,0x8e,0x65},
|
|
||||||
{0xa0,0xb8,0xa2,0xf4,0x36,0x10,0x3b,0x53,0xc,0xa8,0x7,0x9e,0x75,0x3e,0xec,0x5a,0x91,0x68,0x94,0x92,0x56,0xe8,0x88,0x4f,0x5b,0xb0,0x5c,0x55,0xf8,0xba,0xbc,0x4c},
|
|
||||||
{0xe3,0xbb,0x3b,0x99,0xf3,0x87,0x94,0x7b,0x75,0xda,0xf4,0xd6,0x72,0x6b,0x1c,0x5d,0x64,0xae,0xac,0x28,0xdc,0x34,0xb3,0x6d,0x6c,0x34,0xa5,0x50,0xb8,0x28,0xdb,0x71},
|
|
||||||
{0xf8,0x61,0xe2,0xf2,0x10,0x8d,0x51,0x2a,0xe3,0xdb,0x64,0x33,0x59,0xdd,0x75,0xfc,0x1c,0xac,0xbc,0xf1,0x43,0xce,0x3f,0xa2,0x67,0xbb,0xd1,0x3c,0x2,0xe8,0x43,0xb0},
|
|
||||||
{0x33,0xa,0x5b,0xca,0x88,0x29,0xa1,0x75,0x7f,0x34,0x19,0x4d,0xb4,0x16,0x53,0x5c,0x92,0x3b,0x94,0xc3,0xe,0x79,0x4d,0x1e,0x79,0x74,0x75,0xd7,0xb6,0xee,0xaf,0x3f},
|
|
||||||
{0xea,0xa8,0xd4,0xf7,0xbe,0x1a,0x39,0x21,0x5c,0xf4,0x7e,0x9,0x4c,0x23,0x27,0x51,0x26,0xa3,0x24,0x53,0xba,0x32,0x3c,0xd2,0x44,0xa3,0x17,0x4a,0x6d,0xa6,0xd5,0xad},
|
|
||||||
{0xb5,0x1d,0x3e,0xa6,0xaf,0xf2,0xc9,0x8,0x83,0x59,0x3d,0x98,0x91,0x6b,0x3c,0x56,0x4c,0xf8,0x7c,0xa1,0x72,0x86,0x60,0x4d,0x46,0xe2,0x3e,0xcc,0x8,0x6e,0xc7,0xf6},
|
|
||||||
{0x2f,0x98,0x33,0xb3,0xb1,0xbc,0x76,0x5e,0x2b,0xd6,0x66,0xa5,0xef,0xc4,0xe6,0x2a,0x6,0xf4,0xb6,0xe8,0xbe,0xc1,0xd4,0x36,0x74,0xee,0x82,0x15,0xbc,0xef,0x21,0x63},
|
|
||||||
{0xfd,0xc1,0x4e,0xd,0xf4,0x53,0xc9,0x69,0xa7,0x7d,0x5a,0xc4,0x6,0x58,0x58,0x26,0x7e,0xc1,0x14,0x16,0x6,0xe0,0xfa,0x16,0x7e,0x90,0xaf,0x3d,0x28,0x63,0x9d,0x3f},
|
|
||||||
{0xd2,0xc9,0xf2,0xe3,0x0,0x9b,0xd2,0xc,0x5f,0xaa,0xce,0x30,0xb7,0xd4,0xc,0x30,0x74,0x2a,0x51,0x16,0xf2,0xe0,0x32,0x98,0xd,0xeb,0x30,0xd8,0xe3,0xce,0xf8,0x9a},
|
|
||||||
{0x4b,0xc5,0x9e,0x7b,0xb5,0xf1,0x79,0x92,0xff,0x51,0xe6,0x6e,0x4,0x86,0x68,0xd3,0x9b,0x23,0x4d,0x57,0xe6,0x96,0x67,0x31,0xcc,0xe6,0xa6,0xf3,0x17,0xa,0x75,0x5},
|
|
||||||
{0xb1,0x76,0x81,0xd9,0x13,0x32,0x6c,0xce,0x3c,0x17,0x52,0x84,0xf8,0x5,0xa2,0x62,0xf4,0x2b,0xcb,0xb3,0x78,0x47,0x15,0x47,0xff,0x46,0x54,0x82,0x23,0x93,0x6a,0x48},
|
|
||||||
{0x38,0xdf,0x58,0x7,0x4e,0x5e,0x65,0x65,0xf2,0xfc,0x7c,0x89,0xfc,0x86,0x50,0x8e,0x31,0x70,0x2e,0x44,0xd0,0xb,0xca,0x86,0xf0,0x40,0x9,0xa2,0x30,0x78,0x47,0x4e},
|
|
||||||
{0x65,0xa0,0xee,0x39,0xd1,0xf7,0x38,0x83,0xf7,0x5e,0xe9,0x37,0xe4,0x2c,0x3a,0xbd,0x21,0x97,0xb2,0x26,0x1,0x13,0xf8,0x6f,0xa3,0x44,0xed,0xd1,0xef,0x9f,0xde,0xe7},
|
|
||||||
{0x8b,0xa0,0xdf,0x15,0x76,0x25,0x92,0xd9,0x3c,0x85,0xf7,0xf6,0x12,0xdc,0x42,0xbe,0xd8,0xa7,0xec,0x7c,0xab,0x27,0xb0,0x7e,0x53,0x8d,0x7d,0xda,0xaa,0x3e,0xa8,0xde},
|
|
||||||
{0xaa,0x25,0xce,0x93,0xbd,0x2,0x69,0xd8,0x5a,0xf6,0x43,0xfd,0x1a,0x73,0x8,0xf9,0xc0,0x5f,0xef,0xda,0x17,0x4a,0x19,0xa5,0x97,0x4d,0x66,0x33,0x4c,0xfd,0x21,0x6a},
|
|
||||||
{0x35,0xb4,0x98,0x31,0xdb,0x41,0x15,0x70,0xea,0x1e,0xf,0xbb,0xed,0xcd,0x54,0x9b,0x9a,0xd0,0x63,0xa1,0x51,0x97,0x40,0x72,0xf6,0x75,0x9d,0xbf,0x91,0x47,0x6f,0xe2}};
|
|
||||||
|
|
||||||
|
|
||||||
//static void jhF8(jhState *state); /* the compression function F8 */
|
|
||||||
|
|
||||||
/*The API functions*/
|
|
||||||
|
|
||||||
/*The following defines operations on 128-bit word(s)*/
|
|
||||||
#define jhCONSTANT(b) _mm_set1_epi8((b)) /*set each byte in a 128-bit register to be "b"*/
|
|
||||||
|
|
||||||
#define jhXOR(x,y) _mm_xor_si128((x),(y)) /*jhXOR(x,y) = x ^ y, where x and y are two 128-bit word*/
|
|
||||||
#define jhAND(x,y) _mm_and_si128((x),(y)) /*jhAND(x,y) = x & y, where x and y are two 128-bit word*/
|
|
||||||
#define jhANDNOT(x,y) _mm_andnot_si128((x),(y)) /*jhANDNOT(x,y) = (!x) & y, where x and y are two 128-bit word*/
|
|
||||||
#define jhOR(x,y) _mm_or_si128((x),(y)) /*jhOR(x,y) = x | y, where x and y are two 128-bit word*/
|
|
||||||
|
|
||||||
#define jhSHR1(x) _mm_srli_epi16((x), 1) /*jhSHR1(x) = x >> 1, where x is a 128 bit word*/
|
|
||||||
#define jhSHR2(x) _mm_srli_epi16((x), 2) /*jhSHR2(x) = x >> 2, where x is a 128 bit word*/
|
|
||||||
#define jhSHR4(x) _mm_srli_epi16((x), 4) /*jhSHR4(x) = x >> 4, where x is a 128 bit word*/
|
|
||||||
#define jhSHR8(x) _mm_slli_epi16((x), 8) /*jhSHR8(x) = x >> 8, where x is a 128 bit word*/
|
|
||||||
#define jhSHR16(x) _mm_slli_epi32((x), 16) /*jhSHR16(x) = x >> 16, where x is a 128 bit word*/
|
|
||||||
#define jhSHR32(x) _mm_slli_epi64((x), 32) /*jhSHR32(x) = x >> 32, where x is a 128 bit word*/
|
|
||||||
#define jhSHR64(x) _mm_slli_si128((x), 8) /*jhSHR64(x) = x >> 64, where x is a 128 bit word*/
|
|
||||||
|
|
||||||
#define jhSHL1(x) _mm_slli_epi16((x), 1) /*jhSHL1(x) = x << 1, where x is a 128 bit word*/
|
|
||||||
#define jhSHL2(x) _mm_slli_epi16((x), 2) /*jhSHL2(x) = x << 2, where x is a 128 bit word*/
|
|
||||||
#define jhSHL4(x) _mm_slli_epi16((x), 4) /*jhSHL4(x) = x << 4, where x is a 128 bit word*/
|
|
||||||
#define jhSHL8(x) _mm_srli_epi16((x), 8) /*jhSHL8(x) = x << 8, where x is a 128 bit word*/
|
|
||||||
#define jhSHL16(x) _mm_srli_epi32((x), 16) /*jhSHL16(x) = x << 16, where x is a 128 bit word*/
|
|
||||||
#define jhSHL32(x) _mm_srli_epi64((x), 32) /*jhSHL32(x) = x << 32, where x is a 128 bit word*/
|
|
||||||
#define jhSHL64(x) _mm_srli_si128((x), 8) /*jhSHL64(x) = x << 64, where x is a 128 bit word*/
|
|
||||||
|
|
||||||
#define jhSWAP1(x) jhOR(jhSHR1(jhAND((x),jhCONSTANT(0xaa))),jhSHL1(jhAND((x),jhCONSTANT(0x55)))) /*swapping bit 2i with bit 2i+1 of the 128-bit x */
|
|
||||||
#define jhSWAP2(x) jhOR(jhSHR2(jhAND((x),jhCONSTANT(0xcc))),jhSHL2(jhAND((x),jhCONSTANT(0x33)))) /*swapping bit 4i||4i+1 with bit 4i+2||4i+3 of the 128-bit x */
|
|
||||||
#define jhSWAP4(x) jhOR(jhSHR4(jhAND((x),jhCONSTANT(0xf0))),jhSHL4(jhAND((x),jhCONSTANT(0xf)))) /*swapping bits 8i||8i+1||8i+2||8i+3 with bits 8i+4||8i+5||8i+6||8i+7 of the 128-bit x */
|
|
||||||
#define jhSWAP8(x) jhOR(jhSHR8(x),jhSHL8(x)) /*swapping bits 16i||16i+1||...||16i+7 with bits 16i+8||16i+9||...||16i+15 of the 128-bit x */
|
|
||||||
#define jhSWAP16(x) jhOR(jhSHR16(x),jhSHL16(x)) /*swapping bits 32i||32i+1||...||32i+15 with bits 32i+16||32i+17||...||32i+31 of the 128-bit x */
|
|
||||||
#define jhSWAP32(x) _mm_shuffle_epi32((x),_MM_SHUFFLE(2,3,0,1)) /*swapping bits 64i||64i+1||...||64i+31 with bits 64i+32||64i+33||...||64i+63 of the 128-bit x*/
|
|
||||||
#define jhSWAP64(x) _mm_shuffle_epi32((x),_MM_SHUFFLE(1,0,3,2)) /*swapping bits 128i||128i+1||...||128i+63 with bits 128i+64||128i+65||...||128i+127 of the 128-bit x*/
|
|
||||||
#define jhSTORE(x,p) _mm_store_si128((__m128i *)(p), (x)) /*store the 128-bit word x into memeory address p, where p is the multile of 16 bytes*/
|
|
||||||
#define jhLOAD(p) _mm_load_si128((__m128i *)(p)) /*load 16 bytes from the memory address p, return a 128-bit word, where p is the multile of 16 bytes*/
|
|
||||||
|
|
||||||
/*The MDS code*/
|
|
||||||
#define jhL(m0,m1,m2,m3,m4,m5,m6,m7) \
|
|
||||||
(m4) = jhXOR((m4),(m1)); \
|
|
||||||
(m5) = jhXOR((m5),(m2)); \
|
|
||||||
(m6) = jhXOR(jhXOR((m6),(m3)),(m0)); \
|
|
||||||
(m7) = jhXOR((m7),(m0)); \
|
|
||||||
(m0) = jhXOR((m0),(m5)); \
|
|
||||||
(m1) = jhXOR((m1),(m6)); \
|
|
||||||
(m2) = jhXOR(jhXOR((m2),(m7)),(m4)); \
|
|
||||||
(m3) = jhXOR((m3),(m4));
|
|
||||||
|
|
||||||
/*Two Sboxes computed in parallel, each Sbox implements S0 and S1, selected by a constant bit*/
|
|
||||||
/*The reason to compute two Sboxes in parallel is to try to fully utilize the parallel processing power of SSE2 instructions*/
|
|
||||||
#define jhSS(m0,m1,m2,m3,m4,m5,m6,m7,constant0,constant1) \
|
|
||||||
m3 = jhXOR(m3,jhCONSTANT(0xff)); \
|
|
||||||
m7 = jhXOR(m7,jhCONSTANT(0xff)); \
|
|
||||||
m0 = jhXOR(m0,jhANDNOT(m2,constant0)); \
|
|
||||||
m4 = jhXOR(m4,jhANDNOT(m6,constant1)); \
|
|
||||||
a0 = jhXOR(constant0,jhAND(m0,m1)); \
|
|
||||||
a1 = jhXOR(constant1,jhAND(m4,m5)); \
|
|
||||||
m0 = jhXOR(m0,jhAND(m3,m2)); \
|
|
||||||
m4 = jhXOR(m4,jhAND(m7,m6)); \
|
|
||||||
m3 = jhXOR(m3,jhANDNOT(m1,m2)); \
|
|
||||||
m7 = jhXOR(m7,jhANDNOT(m5,m6)); \
|
|
||||||
m1 = jhXOR(m1,jhAND(m0,m2)); \
|
|
||||||
m5 = jhXOR(m5,jhAND(m4,m6)); \
|
|
||||||
m2 = jhXOR(m2,jhANDNOT(m3,m0)); \
|
|
||||||
m6 = jhXOR(m6,jhANDNOT(m7,m4)); \
|
|
||||||
m0 = jhXOR(m0,jhOR(m1,m3)); \
|
|
||||||
m4 = jhXOR(m4,jhOR(m5,m7)); \
|
|
||||||
m3 = jhXOR(m3,jhAND(m1,m2)); \
|
|
||||||
m7 = jhXOR(m7,jhAND(m5,m6)); \
|
|
||||||
m2 = jhXOR(m2,a0); \
|
|
||||||
m6 = jhXOR(m6,a1); \
|
|
||||||
m1 = jhXOR(m1,jhAND(a0,m0)); \
|
|
||||||
m5 = jhXOR(m5,jhAND(a1,m4));
|
|
||||||
|
|
||||||
/* The linear transform of the (7*i+0)th round*/
|
|
||||||
#define jhlineartransform_R00(m0,m1,m2,m3,m4,m5,m6,m7) \
|
|
||||||
/*MDS layer*/ \
|
|
||||||
jhL(m0,m1,m2,m3,m4,m5,m6,m7); \
|
|
||||||
/*swapping bit 2i with bit 2i+1 for m4,m5,m6 and m7 */ \
|
|
||||||
m4 = jhSWAP1(m4); m5 = jhSWAP1(m5); m6 = jhSWAP1(m6); m7 = jhSWAP1(m7);
|
|
||||||
|
|
||||||
/* The linear transform of the (7*i+1)th round*/
|
|
||||||
#define jhlineartransform_R01(m0,m1,m2,m3,m4,m5,m6,m7) \
|
|
||||||
/*MDS layer*/ \
|
|
||||||
jhL(m0,m1,m2,m3,m4,m5,m6,m7); \
|
|
||||||
/*swapping bit 4i||4i+1 with bit 4i+2||4i+3 for m4,m5,m6 and m7 */ \
|
|
||||||
m4 = jhSWAP2(m4); m5 = jhSWAP2(m5); m6 = jhSWAP2(m6); m7 = jhSWAP2(m7);
|
|
||||||
|
|
||||||
/* The linear transform of the (7*i+2)th round*/
|
|
||||||
#define jhlineartransform_R02(m0,m1,m2,m3,m4,m5,m6,m7) \
|
|
||||||
/*MDS layer*/ \
|
|
||||||
jhL(m0,m1,m2,m3,m4,m5,m6,m7); \
|
|
||||||
/*swapping bits 8i||8i+1||8i+2||8i+3 with bits 8i+4||8i+5||8i+6||8i+7 for m4,m5,m6 and m7*/ \
|
|
||||||
m4 = jhSWAP4(m4); m5 = jhSWAP4(m5); m6 = jhSWAP4(m6); m7 = jhSWAP4(m7);
|
|
||||||
|
|
||||||
/* The linear transform of the (7*i+3)th round*/
|
|
||||||
#define jhlineartransform_R03(m0,m1,m2,m3,m4,m5,m6,m7) \
|
|
||||||
/*MDS layer*/ \
|
|
||||||
jhL(m0,m1,m2,m3,m4,m5,m6,m7); \
|
|
||||||
/*swapping bits 16i||16i+1||...||16i+7 with bits 16i+8||16i+9||...||16i+15 for m4,m5,m6 and m7*/ \
|
|
||||||
m4 = jhSWAP8(m4); m5 = jhSWAP8(m5); m6 = jhSWAP8(m6); m7 = jhSWAP8(m7);
|
|
||||||
|
|
||||||
/* The linear transform of the (7*i+4)th round*/
|
|
||||||
#define jhlineartransform_R04(m0,m1,m2,m3,m4,m5,m6,m7) \
|
|
||||||
/*MDS layer*/ \
|
|
||||||
jhL(m0,m1,m2,m3,m4,m5,m6,m7); \
|
|
||||||
/*swapping bits 32i||32i+1||...||32i+15 with bits 32i+16||32i+17||...||32i+31 for m0,m1,m2 and m3*/ \
|
|
||||||
m4 = jhSWAP16(m4); m5 = jhSWAP16(m5); m6 = jhSWAP16(m6); m7 = jhSWAP16(m7);
|
|
||||||
|
|
||||||
/* The linear transform of the (7*i+5)th round -- faster*/
|
|
||||||
#define jhlineartransform_R05(m0,m1,m2,m3,m4,m5,m6,m7) \
|
|
||||||
/*MDS layer*/ \
|
|
||||||
jhL(m0,m1,m2,m3,m4,m5,m6,m7); \
|
|
||||||
/*swapping bits 64i||64i+1||...||64i+31 with bits 64i+32||64i+33||...||64i+63 for m0,m1,m2 and m3*/ \
|
|
||||||
m4 = jhSWAP32(m4); m5 = jhSWAP32(m5); m6 = jhSWAP32(m6); m7 = jhSWAP32(m7);
|
|
||||||
|
|
||||||
/* The linear transform of the (7*i+6)th round -- faster*/
|
|
||||||
#define jhlineartransform_R06(m0,m1,m2,m3,m4,m5,m6,m7) \
|
|
||||||
/*MDS layer*/ \
|
|
||||||
jhL(m0,m1,m2,m3,m4,m5,m6,m7); \
|
|
||||||
/*swapping bits 128i||128i+1||...||128i+63 with bits 128i+64||128i+65||...||128i+127 for m0,m1,m2 and m3*/ \
|
|
||||||
m4 = jhSWAP64(m4); m5 = jhSWAP64(m5); m6 = jhSWAP64(m6); m7 = jhSWAP64(m7);
|
|
||||||
|
|
||||||
/*the round function of E8 */
|
|
||||||
#define jhround_function(nn,r) \
|
|
||||||
jhSS(y0,y2,y4,y6,y1,y3,y5,y7, jhLOAD(jhE8_bitslice_roundconstant[r]), jhLOAD(jhE8_bitslice_roundconstant[r]+16) ); \
|
|
||||||
jhlineartransform_R##nn(y0,y2,y4,y6,y1,y3,y5,y7);
|
|
||||||
|
|
||||||
/*the round function of E8 */
|
|
||||||
#define jhround_functionI(nn,r) \
|
|
||||||
jhSS(jhSx0,jhSx2,jhSx4,jhSx6,jhSx1,jhSx3,jhSx5,jhSx7, jhLOAD(jhE8_bitslice_roundconstant[r]), jhLOAD(jhE8_bitslice_roundconstant[r]+16) ); \
|
|
||||||
jhlineartransform_R##nn(jhSx0,jhSx2,jhSx4,jhSx6,jhSx1,jhSx3,jhSx5,jhSx7);
|
|
||||||
|
|
||||||
/*
|
|
||||||
//the compression function F8
|
|
||||||
static void jhF8(jhState *state)
|
|
||||||
{
|
|
||||||
return;
|
|
||||||
uint64_t i;
|
|
||||||
word128 y0,y1,y2,y3,y4,y5,y6,y7;
|
|
||||||
word128 a0,a1;
|
|
||||||
|
|
||||||
y0 = state->x0,
|
|
||||||
y0 = jhXOR(y0, jhLOAD(state->buffer));
|
|
||||||
y1 = state->x1,
|
|
||||||
y1 = jhXOR(y1, jhLOAD(state->buffer+16));
|
|
||||||
y2 = state->x2,
|
|
||||||
y2 = jhXOR(y2, jhLOAD(state->buffer+32));
|
|
||||||
y3 = state->x3,
|
|
||||||
y3 = jhXOR(y3, jhLOAD(state->buffer+48));
|
|
||||||
y4 = state->x4;
|
|
||||||
y5 = state->x5;
|
|
||||||
y6 = state->x6;
|
|
||||||
y7 = state->x7;
|
|
||||||
|
|
||||||
//xor the 512-bit message with the fist half of the 1024-bit hash state
|
|
||||||
|
|
||||||
//perform 42 rounds
|
|
||||||
for (i = 0; i < 42; i = i+7) {
|
|
||||||
jhround_function(00,i);
|
|
||||||
jhround_function(01,i+1);
|
|
||||||
jhround_function(02,i+2);
|
|
||||||
jhround_function(03,i+3);
|
|
||||||
jhround_function(04,i+4);
|
|
||||||
jhround_function(05,i+5);
|
|
||||||
jhround_function(06,i+6);
|
|
||||||
}
|
|
||||||
|
|
||||||
//xor the 512-bit message with the second half of the 1024-bit hash state
|
|
||||||
|
|
||||||
state->x0 = y0;
|
|
||||||
state->x1 = y1;
|
|
||||||
state->x2 = y2;
|
|
||||||
state->x3 = y3;
|
|
||||||
y4 = jhXOR(y4, jhLOAD(state->buffer)),
|
|
||||||
state->x4 = y4;
|
|
||||||
y5 = jhXOR(y5, jhLOAD(state->buffer+16)),
|
|
||||||
state->x5 = y5;
|
|
||||||
y6 = jhXOR(y6, jhLOAD(state->buffer+32)),
|
|
||||||
state->x6 = y6;
|
|
||||||
y7 = jhXOR(y7, jhLOAD(state->buffer+48)),
|
|
||||||
state->x7 = y7;
|
|
||||||
}
|
|
||||||
*/
|
|
||||||
|
|
||||||
#define jhF8I \
|
|
||||||
do { \
|
|
||||||
uint64_t i; \
|
|
||||||
word128 a0,a1; \
|
|
||||||
jhSx0 = jhXOR(jhSx0, jhLOAD(jhSbuffer)); \
|
|
||||||
jhSx1 = jhXOR(jhSx1, jhLOAD(jhSbuffer+16)); \
|
|
||||||
jhSx2 = jhXOR(jhSx2, jhLOAD(jhSbuffer+32)); \
|
|
||||||
jhSx3 = jhXOR(jhSx3, jhLOAD(jhSbuffer+48)); \
|
|
||||||
for (i = 0; i < 42; i = i+7) { \
|
|
||||||
jhround_functionI(00,i); \
|
|
||||||
jhround_functionI(01,i+1); \
|
|
||||||
jhround_functionI(02,i+2); \
|
|
||||||
jhround_functionI(03,i+3); \
|
|
||||||
jhround_functionI(04,i+4); \
|
|
||||||
jhround_functionI(05,i+5); \
|
|
||||||
jhround_functionI(06,i+6); \
|
|
||||||
} \
|
|
||||||
jhSx4 = jhXOR(jhSx4, jhLOAD(jhSbuffer)); \
|
|
||||||
jhSx5 = jhXOR(jhSx5, jhLOAD(jhSbuffer+16)); \
|
|
||||||
jhSx6 = jhXOR(jhSx6, jhLOAD(jhSbuffer+32)); \
|
|
||||||
jhSx7 = jhXOR(jhSx7, jhLOAD(jhSbuffer+48)); \
|
|
||||||
} while (0)
|
|
||||||
|
|
||||||
/* the whole thing
|
|
||||||
* load from hash
|
|
||||||
* hash = JH512(loaded)
|
|
||||||
*/
|
|
||||||
#define JH_H \
|
|
||||||
do { \
|
|
||||||
jhSx0 = jhLOAD(JH512_H0); \
|
|
||||||
jhSx1 = jhLOAD(JH512_H0+16); \
|
|
||||||
jhSx2 = jhLOAD(JH512_H0+32); \
|
|
||||||
jhSx3 = jhLOAD(JH512_H0+48); \
|
|
||||||
jhSx4 = jhLOAD(JH512_H0+64); \
|
|
||||||
jhSx5 = jhLOAD(JH512_H0+80); \
|
|
||||||
jhSx6 = jhLOAD(JH512_H0+96); \
|
|
||||||
jhSx7 = jhLOAD(JH512_H0+112); \
|
|
||||||
/* for break loop */ \
|
|
||||||
/* one inlined copy of JHF8i */ \
|
|
||||||
int b = false; \
|
|
||||||
memcpy(jhSbuffer, hash, 64); \
|
|
||||||
for(;;) { \
|
|
||||||
jhF8I; \
|
|
||||||
if (b) break; \
|
|
||||||
memset(jhSbuffer,0,48); \
|
|
||||||
jhSbuffer[0] = 0x80; \
|
|
||||||
jhSbuffer[48] = 0x00, \
|
|
||||||
jhSbuffer[49] = 0x00, \
|
|
||||||
jhSbuffer[50] = 0x00, \
|
|
||||||
jhSbuffer[51] = 0x00, \
|
|
||||||
jhSbuffer[52] = 0x00, \
|
|
||||||
jhSbuffer[53] = 0x00, \
|
|
||||||
jhSbuffer[54] = 0x00, \
|
|
||||||
jhSbuffer[55] = 0x00; \
|
|
||||||
jhSbuffer[56] = ((char)((uint64_t)(64*8) >> 56)) & 0xff, \
|
|
||||||
jhSbuffer[57] = ((char)((uint64_t)(64*8) >> 48)) & 0xff, \
|
|
||||||
jhSbuffer[58] = ((char)((uint64_t)(64*8) >> 40)) & 0xff, \
|
|
||||||
jhSbuffer[59] = ((char)((uint64_t)(64*8) >> 32)) & 0xff, \
|
|
||||||
jhSbuffer[60] = ((char)((uint64_t)(64*8) >> 24)) & 0xff, \
|
|
||||||
jhSbuffer[61] = ((char)((uint64_t)(64*8) >> 16)) & 0xff, \
|
|
||||||
jhSbuffer[62] = ((char)((uint64_t)(64*8) >> 8)) & 0xff, \
|
|
||||||
jhSbuffer[63] = (64*8) & 0xff; \
|
|
||||||
b = true; \
|
|
||||||
} \
|
|
||||||
jhSTORE(jhSx4,(char *)(hash)); \
|
|
||||||
jhSTORE(jhSx5,(char *)(hash)+16); \
|
|
||||||
jhSTORE(jhSx6,(char *)(hash)+32); \
|
|
||||||
jhSTORE(jhSx7,(char *)(hash)+48); \
|
|
||||||
} while (0)
|
|
||||||
|
|
||||||
@@ -1,127 +0,0 @@
|
|||||||
/* $Id: sph_jh.h 216 2010-06-08 09:46:57Z tp $ */
|
|
||||||
/**
|
|
||||||
* JH interface. JH is a family of functions which differ by
|
|
||||||
* their output size; this implementation defines JH for output
|
|
||||||
* sizes 224, 256, 384 and 512 bits.
|
|
||||||
*
|
|
||||||
* ==========================(LICENSE BEGIN)============================
|
|
||||||
*
|
|
||||||
* Copyright (c) 2007-2010 Projet RNRT SAPHIR
|
|
||||||
*
|
|
||||||
* Permission is hereby granted, free of charge, to any person obtaining
|
|
||||||
* a copy of this software and associated documentation files (the
|
|
||||||
* "Software"), to deal in the Software without restriction, including
|
|
||||||
* without limitation the rights to use, copy, modify, merge, publish,
|
|
||||||
* distribute, sublicense, and/or sell copies of the Software, and to
|
|
||||||
* permit persons to whom the Software is furnished to do so, subject to
|
|
||||||
* the following conditions:
|
|
||||||
*
|
|
||||||
* The above copyright notice and this permission notice shall be
|
|
||||||
* included in all copies or substantial portions of the Software.
|
|
||||||
*
|
|
||||||
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
|
||||||
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
|
||||||
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
|
|
||||||
* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
|
|
||||||
* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
|
|
||||||
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
|
|
||||||
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
|
||||||
*
|
|
||||||
* ===========================(LICENSE END)=============================
|
|
||||||
*
|
|
||||||
* @file sph_jh.h
|
|
||||||
* @author Thomas Pornin <thomas.pornin@cryptolog.com>
|
|
||||||
*/
|
|
||||||
|
|
||||||
#ifndef SPH_JH_H__
|
|
||||||
#define SPH_JH_H__
|
|
||||||
|
|
||||||
#ifdef __cplusplus
|
|
||||||
extern "C"{
|
|
||||||
#endif
|
|
||||||
|
|
||||||
#include <stddef.h>
|
|
||||||
#include "sph_types.h"
|
|
||||||
|
|
||||||
#define QSTATIC static
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Output size (in bits) for JH-512.
|
|
||||||
*/
|
|
||||||
#define SPH_SIZE_jh512 512
|
|
||||||
|
|
||||||
/**
|
|
||||||
* This structure is a context for JH computations: it contains the
|
|
||||||
* intermediate values and some data from the last entered block. Once
|
|
||||||
* a JH computation has been performed, the context can be reused for
|
|
||||||
* another computation.
|
|
||||||
*
|
|
||||||
* The contents of this structure are private. A running JH computation
|
|
||||||
* can be cloned by copying the context (e.g. with a simple
|
|
||||||
* <code>memcpy()</code>).
|
|
||||||
*/
|
|
||||||
typedef struct {
|
|
||||||
#ifndef DOXYGEN_IGNORE
|
|
||||||
size_t ptr;
|
|
||||||
union {
|
|
||||||
sph_u64 wide[16];
|
|
||||||
sph_u32 narrow[32];
|
|
||||||
} H;
|
|
||||||
sph_u64 block_count;
|
|
||||||
} sph_jh_context;
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Type for a JH-512 context (identical to the common context).
|
|
||||||
*/
|
|
||||||
typedef sph_jh_context sph_jh512_context;
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Initialize a JH-512 context. This process performs no memory allocation.
|
|
||||||
*
|
|
||||||
* @param cc the JH-512 context (pointer to a
|
|
||||||
* <code>sph_jh512_context</code>)
|
|
||||||
*/
|
|
||||||
QSTATIC void sph_jh512_init(void *cc);
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Process some data bytes. It is acceptable that <code>len</code> is zero
|
|
||||||
* (in which case this function does nothing).
|
|
||||||
*
|
|
||||||
* @param cc the JH-512 context
|
|
||||||
* @param data the input data
|
|
||||||
* @param len the input data length (in bytes)
|
|
||||||
*/
|
|
||||||
QSTATIC void sph_jh512(void *cc, const void *data, size_t len);
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Terminate the current JH-512 computation and output the result into
|
|
||||||
* the provided buffer. The destination buffer must be wide enough to
|
|
||||||
* accomodate the result (64 bytes). The context is automatically
|
|
||||||
* reinitialized.
|
|
||||||
*
|
|
||||||
* @param cc the JH-512 context
|
|
||||||
* @param dst the destination buffer
|
|
||||||
*/
|
|
||||||
QSTATIC void sph_jh512_close(void *cc, void *dst);
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Add a few additional bits (0 to 7) to the current computation, then
|
|
||||||
* terminate it and output the result in the provided buffer, which must
|
|
||||||
* be wide enough to accomodate the result (64 bytes). If bit number i
|
|
||||||
* in <code>ub</code> has value 2^i, then the extra bits are those
|
|
||||||
* numbered 7 downto 8-n (this is the big-endian convention at the byte
|
|
||||||
* level). The context is automatically reinitialized.
|
|
||||||
*
|
|
||||||
* @param cc the JH-512 context
|
|
||||||
* @param ub the extra bits
|
|
||||||
* @param n the number of extra bits (0 to 7)
|
|
||||||
* @param dst the destination buffer
|
|
||||||
*/
|
|
||||||
QSTATIC void sph_jh512_addbits_and_close(
|
|
||||||
void *cc, unsigned ub, unsigned n, void *dst);
|
|
||||||
|
|
||||||
#ifdef __cplusplus
|
|
||||||
}
|
|
||||||
#endif
|
|
||||||
|
|
||||||
#endif
|
|
||||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user