Compare commits

...

13 Commits

Author SHA1 Message Date
Jay D Dee
4378d2f841 v3.23.0 2023-08-30 20:15:48 -04:00
Jay D Dee
57a6b7b58b v3.22.3 2023-06-14 11:07:40 -04:00
Jay D Dee
de564ccbde v3.22.2 2023-04-06 13:38:37 -04:00
Jay D Dee
fcd7727b0d v3.22.1 2023-03-24 18:29:42 -04:00
Jay D Dee
3dd6787531 v3.22.0 2023-03-21 17:12:51 -04:00
Jay D Dee
cae1ce2ab7 v3.21.5 2023-03-15 12:27:04 -04:00
Jay D Dee
7a91c41d74 v3.21.4 2023-03-13 14:54:38 -04:00
Jay D Dee
c6bc9d67fb v3.21.3 Unreleased 2023-03-13 03:20:13 -04:00
Jay D Dee
b339450898 v3.21.3 2023-03-11 14:54:49 -05:00
Jay D Dee
fb93160641 v3.21.2 2023-03-03 12:38:31 -05:00
Jay D Dee
520d4d5384 v3.21.1 2023-02-08 22:11:05 -05:00
Jay D Dee
da7030faa8 v3.21.0 2022-12-21 13:09:14 -05:00
Jay D Dee
bd84f199fe v3.20.3 2022-10-21 23:12:18 -04:00
121 changed files with 18102 additions and 14901 deletions

View File

@@ -1,4 +1,6 @@
These instructions may be out of date, see the Wiki for the latest...
https://github.com/JayDDee/cpuminer-opt/wiki/Compiling-from-source
1. Requirements:
---------------
@@ -35,7 +37,7 @@ SHA support on AMD Ryzen CPUs requires gcc version 5 or higher and
openssl 1.1.0e or higher.
znver1 and znver2 should be recognized on most recent version of GCC and
znver3 is expected with GCC 11. GCC 11 also includes rocketlake support.
znver3 is available with GCC 11. GCC 11 also includes rocketlake support.
In the meantime here are some suggestions to compile with new CPUs:
"-march=native" is usually the best choice, used by build.sh.

View File

@@ -1,158 +1,4 @@
Instructions for compiling cpuminer-opt for Windows.
Thwaw intructions nay be out of date. Please consult the wiki for
the latest:
Please consult the wiki for Windows compile instructions.
https://github.com/JayDDee/cpuminer-opt/wiki/Compiling-from-source
Windows compilation using Visual Studio is not supported. Mingw64 is
used on a Linux system (bare metal or virtual machine) to cross-compile
cpuminer-opt executable binaries for Windows.
These instructions were written for Debian and Ubuntu compatible distributions
but should work on other major distributions as well. However some of the
package names or file paths may be different.
It is assumed a Linux system is already available and running. And the user
has enough Linux knowledge to find and install packages and follow these
instructions.
First it is a good idea to create new user specifically for cross compiling.
It keeps all mingw stuff contained and isolated from the rest of the system.
Step by step...
1. Install necessary packages from the distribution's repositories.
Refer to Linux compile instructions and install required packages.
Additionally, install mingw-w64.
sudo apt-get install mingw-w64 libz-mingw-w64-dev
2. Create a local library directory for packages to be compiled in the next
step. Suggested location is $HOME/usr/lib/
$ mkdir $HOME/usr/lib
3. Download and build other packages for mingw that don't have a mingw64
version available in the repositories.
Download the following source code packages from their respective and
respected download locations, copy them to $HOME/usr/lib/ and uncompress them.
openssl: https://github.com/openssl/openssl/releases
curl: https://github.com/curl/curl/releases
gmp: https://gmplib.org/download/gmp/
In most cases the latest version is ok but it's safest to download the same major and minor version as included in your distribution. The following uses versions from Ubuntu 20.04. Change version numbers as required.
Run the following commands or follow the supplied instructions. Do not run "make install" unless you are using /usr/lib, which isn't recommended.
Some instructions insist on running "make check". If make check fails it may still work, YMMV.
You can speed up "make" by using all CPU cores available with "-j n" where n is the number of CPU threads you want to use.
openssl:
$ ./Configure mingw64 shared --cross-compile-prefix=x86_64-w64-mingw32-
$ make
Make may fail with an ld error, just ensure libcrypto-1_1-x64.dll is created.
curl:
$ ./configure --with-winssl --with-winidn --host=x86_64-w64-mingw32
$ make
gmp:
$ ./configure --host=x86_64-w64-mingw32
$ make
4. Tweak the environment.
This step is required everytime you login or the commands can be added to .bashrc.
Define some local variables to point to local library.
$ export LOCAL_LIB="$HOME/usr/lib"
$ export LDFLAGS="-L$LOCAL_LIB/curl/lib/.libs -L$LOCAL_LIB/gmp/.libs -L$LOCAL_LIB/openssl"
$ export CONFIGURE_ARGS="--with-curl=$LOCAL_LIB/curl --with-crypto=$LOCAL_LIB/openssl --host=x86_64-w64-mingw32"
Adjust for gcc version:
$ export GCC_MINGW_LIB="/usr/lib/gcc/x86_64-w64-mingw32/9.3-win32"
Create a release directory and copy some dll files previously built. This can be done outside of cpuminer-opt and only needs to be done once. If the release directory is in cpuminer-opt directory it needs to be recreated every time a source package is decompressed.
$ mkdir release
$ cp /usr/x86_64-w64-mingw32/lib/zlib1.dll release/
$ cp /usr/x86_64-w64-mingw32/lib/libwinpthread-1.dll release/
$ cp $GCC_MINGW_LIB/libstdc++-6.dll release/
$ cp $GCC_MINGW_LIB/libgcc_s_seh-1.dll release/
$ cp $LOCAL_LIB/openssl/libcrypto-1_1-x64.dll release/
$ cp $LOCAL_LIB/curl/lib/.libs/libcurl-4.dll release/
The following steps need to be done every time a new source package is
opened.
5. Download cpuminer-opt
Download the latest source code package of cpumuner-opt to your desired
location. .zip or .tar.gz, your choice.
https://github.com/JayDDee/cpuminer-opt/releases
Decompress and change to the cpuminer-opt directory.
6. compile
Create a link to the locally compiled version of gmp.h
$ ln -s $LOCAL_LIB/gmp-version/gmp.h ./gmp.h
$ ./autogen.sh
Configure the compiler for the CPU architecture of the host machine:
CFLAGS="-O3 -march=native -Wall" ./configure $CONFIGURE_ARGS
or cross compile for a specific CPU architecture:
CFLAGS="-O3 -march=znver1 -Wall" ./configure $CONFIGURE_ARGS
This will compile for AMD Ryzen.
You can compile more generically for a set of specific CPU features if you know what features you want:
CFLAGS="-O3 -maes -msse4.2 -Wall" ./configure $CONFIGURE_ARGS
This will compile for an older CPU that does not have AVX.
You can find several examples in README.txt
If you have a CPU with more than 64 threads and Windows 7 or higher you can enable the CPU Groups feature by adding the following to CFLAGS:
"-D_WIN32_WINNT=0x0601"
Once you have run configure successfully run the compiler with n CPU threads:
$ make -j n
Copy cpuminer.exe to the release directory, compress and copy the release directory to a Windows system and run cpuminer.exe from the command line.
Run cpuminer
In a command windows change directories to the unzipped release folder. To get a list of all options:
cpuminer.exe --help
Command options are specific to where you mine. Refer to the pool's instructions on how to set them.

View File

@@ -55,9 +55,6 @@ cpuminer_SOURCES = \
algo/blake/mod_blakecoin.c \
algo/blake/blakecoin.c \
algo/blake/blakecoin-4way.c \
algo/blake/decred-gate.c \
algo/blake/decred.c \
algo/blake/decred-4way.c \
algo/blake/pentablake-gate.c \
algo/blake/pentablake-4way.c \
algo/blake/pentablake.c \
@@ -178,6 +175,8 @@ cpuminer_SOURCES = \
algo/sha/sha256t.c \
algo/sha/sha256q-4way.c \
algo/sha/sha256q.c \
algo/sha/sha512256d-4way.c \
algo/sha/sha256dt.c \
algo/shabal/sph_shabal.c \
algo/shabal/shabal-hash-4way.c \
algo/shavite/sph_shavite.c \
@@ -205,7 +204,6 @@ cpuminer_SOURCES = \
algo/verthash/tiny_sha3/sha3.c \
algo/verthash/tiny_sha3/sha3-4way.c \
algo/whirlpool/sph_whirlpool.c \
algo/whirlpool/whirlpool-hash-4way.c \
algo/whirlpool/whirlpool-gate.c \
algo/whirlpool/whirlpool.c \
algo/whirlpool/whirlpoolx.c \

View File

@@ -40,17 +40,25 @@ Requirements
Intel Core2 and newer and AMD equivalents. Further optimizations are available
on some algoritms for CPUs with AES, AVX, AVX2, SHA, AVX512 and VAES.
Older CPUs are supported by cpuminer-multi by TPruvot but at reduced
performance.
32 bit CPUs are not supported.
Other CPU architectures such as ARM, Raspberry Pi, RISC-V, Xeon Phi, etc,
are not supported.
ARM and Aarch64 CPUs are not supported.
Mobile CPUs like laptop computers are not recommended because they aren't
designed for extreme heat of operating at full load for extended periods of
time.
Older CPUs and ARM architecture may be supported by cpuminer-multi by TPruvot.
2. 64 bit Linux or Windows OS. Ubuntu and Fedora based distributions,
including Mint and Centos, are known to work and have all dependencies
in their repositories. Others may work but may require more effort. Older
versions such as Centos 6 don't work due to missing features.
64 bit Windows OS is supported with mingw_w64 and msys or pre-built binaries.
Windows 7 or newer is supported with mingw_w64 and msys or using the pre-built
binaries. WindowsXP 64 bit is YMMV.
FreeBSD is not actively tested but should work, YMMV.
MacOS, OSx and Android are not supported.
3. Stratum pool supporting stratum+tcp:// or stratum+ssl:// protocols or
@@ -66,53 +74,50 @@ Supported Algorithms
argon2d250 argon2d-crds, Credits (CRDS)
argon2d500 argon2d-dyn, Dynamic (DYN)
argon2d4096 argon2d-uis, Unitus, (UIS)
axiom Shabal-256 MemoHash
blake Blake-256 (SFR)
blake2b Blake2b 256
blake2s Blake-2 S
blake Blake-256
blake2b Blake2-512
blake2s Blake2-256
blakecoin blake256r8
bmw BMW 256
bmw512 BMW 512
c11 Chaincoin
c11
decred
deep Deepcoin (DCN)
dmd-gr Diamond-Groestl
groestl Groestl coin
hex x16r-hex
hmq1725 Espers
hmq1725
hodl Hodlcoin
jha Jackpotcoin
keccak Maxcoin
keccakc Creative coin
lbry LBC, LBRY Credits
luffa Luffa
lyra2h Hppcoin
lyra2h
lyra2re lyra2
lyra2rev2 lyra2v2
lyra2rev3 lyrav2v3
lyra2z
lyra2z330 Lyra2 330 rows, Zoin (ZOI)
m7m Magi (XMG)
minotaur Ringcoin (RNG)
lyra2z330
m7m
minotaur
minotaurx
myr-gr Myriad-Groestl
neoscrypt NeoScrypt(128, 2, 1)
nist5 Nist5
pentablake Pentablake
phi1612 phi
phi2 Luxcoin (LUX)
phi2-lux identical to phi2
pluck Pluck:128 (Supcoin)
phi2
polytimos Ninja
power2b MicroBitcoin (MBC)
quark Quark
qubit Qubit
scrypt scrypt(1024, 1, 1) (default)
scrypt:N scrypt(N, 1, 1)
scryptn2 scrypt(1048576, 1, 1)
sha256d Double SHA-256
sha256q Quad SHA-256, Pyrite (PYE)
sha256t Triple SHA-256, Onecoin (OC)
sha256q Quad SHA-256
sha256t Triple SHA-256
sha3d Double keccak256 (BSHA3)
shavite3 Shavite3
skein Skein+Sha (Skeincoin)
skein2 Double Skein (Woodcoin)
skunk Signatum (SIGT)
@@ -128,17 +133,17 @@ Supported Algorithms
x11 Dash
x11evo Revolvercoin
x11gost sib (SibCoin)
x12 Galaxie Cash (GCH)
x13 X13
x12
x13
x13bcd bcd
x13sm3 hsr (Hshare)
x14 X14
x15 X15
x14
x15
x16r
x16rv2
x16rt Gincoin (GIN)
x16rt-veil Veil (VEIL)
x16s Pigeoncoin (PGN)
x16rt
x16rt-veil veil
x16s
x17
x21s
x22i

View File

@@ -1,12 +1,22 @@
This file is included in the Windows binary package. Compile instructions
for Linux and Windows can be found in RELEASE_NOTES.
This package is officially avalable only from:
cpuminer-opt is open source and free of any fees. Many forks exist that are
closed source and contain usage fees. support open source free software.
This package is officially avalaible only from:
https://github.com/JayDDee/cpuminer-opt
No other sources should be trusted.
cpuminer is a console program that is executed from a DOS or Powershell
prompt. There is no GUI and no mouse support.
command prompt. There is no GUI and no mouse support.
New users are encouraged to consult the cpuminer-opt Wiki for detailed
information on usage:
https://github.com/JayDDee/cpuminer-opt/wiki
Miner programs are often flagged as malware by antivirus programs. This is
a false positive, they are flagged simply because they are cryptocurrency
@@ -43,12 +53,11 @@ cpuminer-avx2.exe Haswell, Skylake, Kabylake, Coffeelake, Cometlake
cpuminer-avx2-sha.exe AMD Zen1, Zen2
cpuminer-avx2-sha-vaes.exe Intel Alderlake*, AMD Zen3
cpuminer-avx512.exe Intel HEDT Skylake-X, Cascadelake
cpuminer-avx512-sha-vaes.exe Icelake, Tigerlake, Rocketlake
cpuminer-avx512-sha-vaes.exe AMD Zen4, Intel Rocketlake, Icelake
* Alderlake is a hybrid architecture. With the E-cores disabled it may be
possible to enable AVX512 on the the P-cores and use the avx512-sha-vaes
build. This is not officially supported by Intel at time of writing.
Check for current information.
* Alderlake is a hybrid architecture with a mix of E-cores & P-cores. Although
the P-cores can support AVX512 the E-cores can't so Intel decided to disable
AVX512 on the the P-cores.
Notes about included DLL files:
@@ -59,9 +68,10 @@ source code obtained from the author's official repository. The exact
procedure is documented in the build instructions for Windows:
https://github.com/JayDDee/cpuminer-opt/wiki/Compiling-from-source
Some DLL filess may already be installed on the system by Windows or third
party packages. They often will work and may be used instead of the included
file.
Some included DLL files may already be installed on the system by Windows or
third party packages. They often will work and may be used instead of the
included version of the files.
If you like this software feel free to donate:

View File

@@ -65,6 +65,108 @@ If not what makes it happen or not happen?
Change Log
----------
v3.23.0
#398: Prevent GBT fallback to Getwork on network error.
#398: Prevent excessive logs when conditional mining is paused when mining solo.
Fix a false start if stratum doesn't immediately send a new job after connecting.
Tweak diagonal shuffle in Blake2b & Blake256 1-way SIMD to reduce latency.
CPUID support for AVX10.
Initial changes to AVX2 targeted code in preparation for AVX10.
Code cleanup and miscellaneous small improvements.
v3.22.3
Data interleaving and byte swap optimizations with AVX2, AVX512 & AVX512VBMI.
Faster Luffa with AVX2 & AVX512.
Other small optimizations.
Some code cleanup.
v3.22.2
Added sha512256d & sha256dt algos.
Fixed intermittant invalid shares lyra2v2 AVX512.
Removed application limits on the number of CPUs and threads, HW and OS limits still apply.
Added a log warning if more threads are defined than active CPUs in affinity mask.
Improved merkle tree memory management for stratum.
Added transaction count to New Work log.
Other small improvements.
v3.22.1
#393 fixed segfault in GBT, regression from v3.22.0.
More efficient 32 bit data interleaving.
v3.22.0
Stratum: faster netdiff calculation.
Merged a few updates from Pooler/cpuminer:
Use CURLOPT_POSTFIELDS in json_rpc_call,
Use CURLINFO_ACTIVESOCKET when supported,
JSONRPC speedup,
Speed up hex2bin function.
Small log improvements, notably more frequent hash rate reports.
Removed decred algo.
v3.21.5
All issues with v3.21.3 & v3.21.4 should be resolved.
Changes since v3.21.2:
#392 #379 #389 Fixed misaligned address segfault solo mining.
#392 Fixed stats for myr-gr algo, and a few others, for CPUs without AVX2.
#392 Fixed conditional mining.
#392 Fixed cpu affinity on Ryzen CPUs using Windows binaries,
Windows binaries no longer support CPU groups,
Windows binaries support CPUs with up to 64 threads.
Small optimizations to serialized vectoring.
v3.21.4 CANCELLED
Reapply selected changes from v3.21.3.
#392 #379 #389 Fixed misaligned address segfault solo mining.
#392 Fixed conditional mining.
#392 Fixed cpu affinity on Ryzen CPUs using Windows binaries,
Windows binaries no longer support CPU groups,
Windows binaries support CPUs with up to 64 threads.
v3.21.3.1 UNRELEASED
Revert to 3.21.2
v3.21.3 CANCELLED
#392 #379 #389 Fixed misaligned address segfault solo mining.
#392 Fixed stats for myr-gr algo, and a few others, for CPUs without AVX2.
#392 Fixed conditional mining.
#392 Fixed cpu affinity on Ryzen CPUs using Windows binaries,
Windows binaries no longer support CPU groups,
Windows binaries support CPUs with up to 64 threads.
Midstate prehash is now centralized, done only once instead of by every thread
for selected algos.
Small optimizations to serialized vectoring.
v3.21.2
Faster SALSA SIMD shuffle for yespower, yescrypt & scryptn2.
Fixed a couple of compiler warnings with gcc-12.
v3.21.1
Fixed a segfault in some obsolete algos.
Small optimizations to Hamsi & Shabal AVX2 & AVX512.
v3.21.0
Added minotaurx algo for stratum only.
Blake256 & sha256 prehash optimized to ignore zero-padded data for AVX2 & AVX512.
Other small improvements.
v3.20.3
Faster c11 algo: AVX512 6%, AVX2 4%, AVX2+VAES 15%.
Faster AVX2+VAES for anime 14%, hmq1725 6%.
Small optimizations to Luffa AVX2 & AVX512.
v3.20.2
Bit rotation optimizations to Blake256, Blake512, Blake2b, Blake2s & Lyra2-blake2b for SSE2 & AVX2.
@@ -75,7 +177,7 @@ v3.20.1
sph_blake2b optimized 1-way SSSE3 & AVX2.
Removed duplicate Blake2b used by Power2b algo, will now use optimized sph_blake2b.
Removed imprecise hash & target display from rejected share log.
Share and target difficulty is now displayed only for low diificulty shares.
Share and target difficulty is now displayed only for low difficulty shares.
Updated configure.ac to check for AVX512 asm support.
Small optimization to Lyra2 SSE2.
@@ -92,12 +194,9 @@ v3.19.8
#370 "stratum+ssl", in addition to "stratum+tcps", is now recognized as a valid
url protocol specifier for requesting a secure stratum connection.
The full url, including the protocol, is now displayed in the stratum connect
log and the periodic summary log.
Small optimizations to Cubehash, AVX2 & AVX512.
Byte order and prehash optimizations for Blake256 & Blake512, AVX2 & AVX512.
v3.19.7

83
aclocal.m4 vendored
View File

@@ -1,6 +1,6 @@
# generated automatically by aclocal 1.16.1 -*- Autoconf -*-
# generated automatically by aclocal 1.16.5 -*- Autoconf -*-
# Copyright (C) 1996-2018 Free Software Foundation, Inc.
# Copyright (C) 1996-2021 Free Software Foundation, Inc.
# This file is free software; the Free Software Foundation
# gives unlimited permission to copy and/or distribute it,
@@ -14,13 +14,13 @@
m4_ifndef([AC_CONFIG_MACRO_DIRS], [m4_defun([_AM_CONFIG_MACRO_DIRS], [])m4_defun([AC_CONFIG_MACRO_DIRS], [_AM_CONFIG_MACRO_DIRS($@)])])
m4_ifndef([AC_AUTOCONF_VERSION],
[m4_copy([m4_PACKAGE_VERSION], [AC_AUTOCONF_VERSION])])dnl
m4_if(m4_defn([AC_AUTOCONF_VERSION]), [2.69],,
[m4_warning([this file was generated for autoconf 2.69.
m4_if(m4_defn([AC_AUTOCONF_VERSION]), [2.71],,
[m4_warning([this file was generated for autoconf 2.71.
You have another version of autoconf. It may work, but is not guaranteed to.
If you have problems, you may need to regenerate the build system entirely.
To do so, use the procedure documented by the package, typically 'autoreconf'.])])
# Copyright (C) 2002-2018 Free Software Foundation, Inc.
# Copyright (C) 2002-2021 Free Software Foundation, Inc.
#
# This file is free software; the Free Software Foundation
# gives unlimited permission to copy and/or distribute it,
@@ -35,7 +35,7 @@ AC_DEFUN([AM_AUTOMAKE_VERSION],
[am__api_version='1.16'
dnl Some users find AM_AUTOMAKE_VERSION and mistake it for a way to
dnl require some minimum version. Point them to the right macro.
m4_if([$1], [1.16.1], [],
m4_if([$1], [1.16.5], [],
[AC_FATAL([Do not call $0, use AM_INIT_AUTOMAKE([$1]).])])dnl
])
@@ -51,14 +51,14 @@ m4_define([_AM_AUTOCONF_VERSION], [])
# Call AM_AUTOMAKE_VERSION and AM_AUTOMAKE_VERSION so they can be traced.
# This function is AC_REQUIREd by AM_INIT_AUTOMAKE.
AC_DEFUN([AM_SET_CURRENT_AUTOMAKE_VERSION],
[AM_AUTOMAKE_VERSION([1.16.1])dnl
[AM_AUTOMAKE_VERSION([1.16.5])dnl
m4_ifndef([AC_AUTOCONF_VERSION],
[m4_copy([m4_PACKAGE_VERSION], [AC_AUTOCONF_VERSION])])dnl
_AM_AUTOCONF_VERSION(m4_defn([AC_AUTOCONF_VERSION]))])
# Figure out how to run the assembler. -*- Autoconf -*-
# Copyright (C) 2001-2018 Free Software Foundation, Inc.
# Copyright (C) 2001-2021 Free Software Foundation, Inc.
#
# This file is free software; the Free Software Foundation
# gives unlimited permission to copy and/or distribute it,
@@ -78,7 +78,7 @@ _AM_IF_OPTION([no-dependencies],, [_AM_DEPENDENCIES([CCAS])])dnl
# AM_AUX_DIR_EXPAND -*- Autoconf -*-
# Copyright (C) 2001-2018 Free Software Foundation, Inc.
# Copyright (C) 2001-2021 Free Software Foundation, Inc.
#
# This file is free software; the Free Software Foundation
# gives unlimited permission to copy and/or distribute it,
@@ -130,7 +130,7 @@ am_aux_dir=`cd "$ac_aux_dir" && pwd`
# AM_CONDITIONAL -*- Autoconf -*-
# Copyright (C) 1997-2018 Free Software Foundation, Inc.
# Copyright (C) 1997-2021 Free Software Foundation, Inc.
#
# This file is free software; the Free Software Foundation
# gives unlimited permission to copy and/or distribute it,
@@ -161,7 +161,7 @@ AC_CONFIG_COMMANDS_PRE(
Usually this means the macro was only invoked conditionally.]])
fi])])
# Copyright (C) 1999-2018 Free Software Foundation, Inc.
# Copyright (C) 1999-2021 Free Software Foundation, Inc.
#
# This file is free software; the Free Software Foundation
# gives unlimited permission to copy and/or distribute it,
@@ -352,7 +352,7 @@ _AM_SUBST_NOTMAKE([am__nodep])dnl
# Generate code to set up dependency tracking. -*- Autoconf -*-
# Copyright (C) 1999-2018 Free Software Foundation, Inc.
# Copyright (C) 1999-2021 Free Software Foundation, Inc.
#
# This file is free software; the Free Software Foundation
# gives unlimited permission to copy and/or distribute it,
@@ -391,7 +391,9 @@ AC_DEFUN([_AM_OUTPUT_DEPENDENCY_COMMANDS],
done
if test $am_rc -ne 0; then
AC_MSG_FAILURE([Something went wrong bootstrapping makefile fragments
for automatic dependency tracking. Try re-running configure with the
for automatic dependency tracking. If GNU make was not used, consider
re-running the configure script with MAKE="gmake" (or whatever is
necessary). You can also try re-running configure with the
'--disable-dependency-tracking' option to at least be able to build
the package (albeit without support for automatic dependency tracking).])
fi
@@ -418,7 +420,7 @@ AC_DEFUN([AM_OUTPUT_DEPENDENCY_COMMANDS],
# Do all the work for Automake. -*- Autoconf -*-
# Copyright (C) 1996-2018 Free Software Foundation, Inc.
# Copyright (C) 1996-2021 Free Software Foundation, Inc.
#
# This file is free software; the Free Software Foundation
# gives unlimited permission to copy and/or distribute it,
@@ -446,6 +448,10 @@ m4_defn([AC_PROG_CC])
# release and drop the old call support.
AC_DEFUN([AM_INIT_AUTOMAKE],
[AC_PREREQ([2.65])dnl
m4_ifdef([_$0_ALREADY_INIT],
[m4_fatal([$0 expanded multiple times
]m4_defn([_$0_ALREADY_INIT]))],
[m4_define([_$0_ALREADY_INIT], m4_expansion_stack)])dnl
dnl Autoconf wants to disallow AM_ names. We explicitly allow
dnl the ones we care about.
m4_pattern_allow([^AM_[A-Z]+FLAGS$])dnl
@@ -482,7 +488,7 @@ m4_ifval([$3], [_AM_SET_OPTION([no-define])])dnl
[_AM_SET_OPTIONS([$1])dnl
dnl Diagnose old-style AC_INIT with new-style AM_AUTOMAKE_INIT.
m4_if(
m4_ifdef([AC_PACKAGE_NAME], [ok]):m4_ifdef([AC_PACKAGE_VERSION], [ok]),
m4_ifset([AC_PACKAGE_NAME], [ok]):m4_ifset([AC_PACKAGE_VERSION], [ok]),
[ok:ok],,
[m4_fatal([AC_INIT should be called with package and version arguments])])dnl
AC_SUBST([PACKAGE], ['AC_PACKAGE_TARNAME'])dnl
@@ -534,6 +540,20 @@ AC_PROVIDE_IFELSE([AC_PROG_OBJCXX],
[m4_define([AC_PROG_OBJCXX],
m4_defn([AC_PROG_OBJCXX])[_AM_DEPENDENCIES([OBJCXX])])])dnl
])
# Variables for tags utilities; see am/tags.am
if test -z "$CTAGS"; then
CTAGS=ctags
fi
AC_SUBST([CTAGS])
if test -z "$ETAGS"; then
ETAGS=etags
fi
AC_SUBST([ETAGS])
if test -z "$CSCOPE"; then
CSCOPE=cscope
fi
AC_SUBST([CSCOPE])
AC_REQUIRE([AM_SILENT_RULES])dnl
dnl The testsuite driver may need to know about EXEEXT, so add the
dnl 'am__EXEEXT' conditional if _AM_COMPILER_EXEEXT was seen. This
@@ -615,7 +635,7 @@ for _am_header in $config_headers :; do
done
echo "timestamp for $_am_arg" >`AS_DIRNAME(["$_am_arg"])`/stamp-h[]$_am_stamp_count])
# Copyright (C) 2001-2018 Free Software Foundation, Inc.
# Copyright (C) 2001-2021 Free Software Foundation, Inc.
#
# This file is free software; the Free Software Foundation
# gives unlimited permission to copy and/or distribute it,
@@ -636,7 +656,7 @@ if test x"${install_sh+set}" != xset; then
fi
AC_SUBST([install_sh])])
# Copyright (C) 2003-2018 Free Software Foundation, Inc.
# Copyright (C) 2003-2021 Free Software Foundation, Inc.
#
# This file is free software; the Free Software Foundation
# gives unlimited permission to copy and/or distribute it,
@@ -658,7 +678,7 @@ AC_SUBST([am__leading_dot])])
# Add --enable-maintainer-mode option to configure. -*- Autoconf -*-
# From Jim Meyering
# Copyright (C) 1996-2018 Free Software Foundation, Inc.
# Copyright (C) 1996-2021 Free Software Foundation, Inc.
#
# This file is free software; the Free Software Foundation
# gives unlimited permission to copy and/or distribute it,
@@ -693,7 +713,7 @@ AC_MSG_CHECKING([whether to enable maintainer-specific portions of Makefiles])
# Check to see how 'make' treats includes. -*- Autoconf -*-
# Copyright (C) 2001-2018 Free Software Foundation, Inc.
# Copyright (C) 2001-2021 Free Software Foundation, Inc.
#
# This file is free software; the Free Software Foundation
# gives unlimited permission to copy and/or distribute it,
@@ -736,7 +756,7 @@ AC_SUBST([am__quote])])
# Fake the existence of programs that GNU maintainers use. -*- Autoconf -*-
# Copyright (C) 1997-2018 Free Software Foundation, Inc.
# Copyright (C) 1997-2021 Free Software Foundation, Inc.
#
# This file is free software; the Free Software Foundation
# gives unlimited permission to copy and/or distribute it,
@@ -757,12 +777,7 @@ AC_DEFUN([AM_MISSING_HAS_RUN],
[AC_REQUIRE([AM_AUX_DIR_EXPAND])dnl
AC_REQUIRE_AUX_FILE([missing])dnl
if test x"${MISSING+set}" != xset; then
case $am_aux_dir in
*\ * | *\ *)
MISSING="\${SHELL} \"$am_aux_dir/missing\"" ;;
*)
MISSING="\${SHELL} $am_aux_dir/missing" ;;
esac
MISSING="\${SHELL} '$am_aux_dir/missing'"
fi
# Use eval to expand $SHELL
if eval "$MISSING --is-lightweight"; then
@@ -775,7 +790,7 @@ fi
# Helper functions for option handling. -*- Autoconf -*-
# Copyright (C) 2001-2018 Free Software Foundation, Inc.
# Copyright (C) 2001-2021 Free Software Foundation, Inc.
#
# This file is free software; the Free Software Foundation
# gives unlimited permission to copy and/or distribute it,
@@ -804,7 +819,7 @@ AC_DEFUN([_AM_SET_OPTIONS],
AC_DEFUN([_AM_IF_OPTION],
[m4_ifset(_AM_MANGLE_OPTION([$1]), [$2], [$3])])
# Copyright (C) 1999-2018 Free Software Foundation, Inc.
# Copyright (C) 1999-2021 Free Software Foundation, Inc.
#
# This file is free software; the Free Software Foundation
# gives unlimited permission to copy and/or distribute it,
@@ -851,7 +866,7 @@ AC_LANG_POP([C])])
# For backward compatibility.
AC_DEFUN_ONCE([AM_PROG_CC_C_O], [AC_REQUIRE([AC_PROG_CC])])
# Copyright (C) 2001-2018 Free Software Foundation, Inc.
# Copyright (C) 2001-2021 Free Software Foundation, Inc.
#
# This file is free software; the Free Software Foundation
# gives unlimited permission to copy and/or distribute it,
@@ -870,7 +885,7 @@ AC_DEFUN([AM_RUN_LOG],
# Check to make sure that the build environment is sane. -*- Autoconf -*-
# Copyright (C) 1996-2018 Free Software Foundation, Inc.
# Copyright (C) 1996-2021 Free Software Foundation, Inc.
#
# This file is free software; the Free Software Foundation
# gives unlimited permission to copy and/or distribute it,
@@ -951,7 +966,7 @@ AC_CONFIG_COMMANDS_PRE(
rm -f conftest.file
])
# Copyright (C) 2009-2018 Free Software Foundation, Inc.
# Copyright (C) 2009-2021 Free Software Foundation, Inc.
#
# This file is free software; the Free Software Foundation
# gives unlimited permission to copy and/or distribute it,
@@ -1011,7 +1026,7 @@ AC_SUBST([AM_BACKSLASH])dnl
_AM_SUBST_NOTMAKE([AM_BACKSLASH])dnl
])
# Copyright (C) 2001-2018 Free Software Foundation, Inc.
# Copyright (C) 2001-2021 Free Software Foundation, Inc.
#
# This file is free software; the Free Software Foundation
# gives unlimited permission to copy and/or distribute it,
@@ -1039,7 +1054,7 @@ fi
INSTALL_STRIP_PROGRAM="\$(install_sh) -c -s"
AC_SUBST([INSTALL_STRIP_PROGRAM])])
# Copyright (C) 2006-2018 Free Software Foundation, Inc.
# Copyright (C) 2006-2021 Free Software Foundation, Inc.
#
# This file is free software; the Free Software Foundation
# gives unlimited permission to copy and/or distribute it,
@@ -1058,7 +1073,7 @@ AC_DEFUN([AM_SUBST_NOTMAKE], [_AM_SUBST_NOTMAKE($@)])
# Check how to create a tarball. -*- Autoconf -*-
# Copyright (C) 2004-2018 Free Software Foundation, Inc.
# Copyright (C) 2004-2021 Free Software Foundation, Inc.
#
# This file is free software; the Free Software Foundation
# gives unlimited permission to copy and/or distribute it,

View File

@@ -67,7 +67,6 @@ void do_nothing () {}
bool return_true () { return true; }
bool return_false () { return false; }
void *return_null () { return NULL; }
void call_error () { printf("ERR: Uninitialized function pointer\n"); }
void algo_not_tested()
{
@@ -95,7 +94,8 @@ int null_scanhash()
return 0;
}
// Default generic scanhash can be used in many cases.
// Default generic scanhash can be used in many cases. Not to be used when
// prehashing can be done or when byte swapping the data can be avoided.
int scanhash_generic( struct work *work, uint32_t max_nonce,
uint64_t *hashes_done, struct thr_info *mythr )
{
@@ -152,6 +152,9 @@ int scanhash_4way_64in_32out( struct work *work, uint32_t max_nonce,
const bool bench = opt_benchmark;
mm256_bswap32_intrlv80_4x64( vdata, pdata );
// overwrite byte swapped nonce with original byte order for proper
// incrementing. The nonce only needs to byte swapped if it is to be
// sumbitted.
*noncev = mm256_intrlv_blend_32(
_mm256_set_epi32( n+3, 0, n+2, 0, n+1, 0, n, 0 ), *noncev );
do
@@ -168,7 +171,7 @@ int scanhash_4way_64in_32out( struct work *work, uint32_t max_nonce,
}
}
*noncev = _mm256_add_epi32( *noncev,
m256_const1_64( 0x0000000400000000 ) );
_mm256_set1_epi64x( 0x0000000400000000 ) );
n += 4;
} while ( likely( ( n <= last_nonce ) && !work_restart[thr_id].restart ) );
pdata[19] = n;
@@ -224,7 +227,7 @@ int scanhash_8way_64in_32out( struct work *work, uint32_t max_nonce,
}
}
*noncev = _mm512_add_epi32( *noncev,
m512_const1_64( 0x0000000800000000 ) );
_mm512_set1_epi64( 0x0000000800000000 ) );
n += 8;
} while ( likely( ( n < last_nonce ) && !work_restart[thr_id].restart ) );
pdata[19] = n;
@@ -260,8 +263,6 @@ void init_algo_gate( algo_gate_t* gate )
gate->build_block_header = (void*)&std_build_block_header;
gate->build_extraheader = (void*)&std_build_extraheader;
gate->set_work_data_endian = (void*)&do_nothing;
gate->calc_network_diff = (void*)&std_calc_network_diff;
gate->ready_to_mine = (void*)&std_ready_to_mine;
gate->resync_threads = (void*)&do_nothing;
gate->do_this_thread = (void*)&return_true;
gate->longpoll_rpc_call = (void*)&std_longpoll_rpc_call;
@@ -305,7 +306,6 @@ bool register_algo_gate( int algo, algo_gate_t *gate )
case ALGO_BLAKECOIN: rc = register_blakecoin_algo ( gate ); break;
case ALGO_BMW512: rc = register_bmw512_algo ( gate ); break;
case ALGO_C11: rc = register_c11_algo ( gate ); break;
case ALGO_DECRED: rc = register_decred_algo ( gate ); break;
case ALGO_DEEP: rc = register_deep_algo ( gate ); break;
case ALGO_DMD_GR: rc = register_dmd_gr_algo ( gate ); break;
case ALGO_GROESTL: rc = register_groestl_algo ( gate ); break;
@@ -324,6 +324,7 @@ bool register_algo_gate( int algo, algo_gate_t *gate )
case ALGO_LYRA2Z330: rc = register_lyra2z330_algo ( gate ); break;
case ALGO_M7M: rc = register_m7m_algo ( gate ); break;
case ALGO_MINOTAUR: rc = register_minotaur_algo ( gate ); break;
case ALGO_MINOTAURX: rc = register_minotaur_algo ( gate ); break;
case ALGO_MYR_GR: rc = register_myriad_algo ( gate ); break;
case ALGO_NEOSCRYPT: rc = register_neoscrypt_algo ( gate ); break;
case ALGO_NIST5: rc = register_nist5_algo ( gate ); break;
@@ -336,9 +337,11 @@ bool register_algo_gate( int algo, algo_gate_t *gate )
case ALGO_QUBIT: rc = register_qubit_algo ( gate ); break;
case ALGO_SCRYPT: rc = register_scrypt_algo ( gate ); break;
case ALGO_SHA256D: rc = register_sha256d_algo ( gate ); break;
case ALGO_SHA256DT: rc = register_sha256dt_algo ( gate ); break;
case ALGO_SHA256Q: rc = register_sha256q_algo ( gate ); break;
case ALGO_SHA256T: rc = register_sha256t_algo ( gate ); break;
case ALGO_SHA3D: rc = register_sha3d_algo ( gate ); break;
case ALGO_SHA512256D: rc = register_sha512256d_algo ( gate ); break;
case ALGO_SHAVITE3: rc = register_shavite_algo ( gate ); break;
case ALGO_SKEIN: rc = register_skein_algo ( gate ); break;
case ALGO_SKEIN2: rc = register_skein2_algo ( gate ); break;
@@ -423,7 +426,6 @@ const char* const algo_alias_map[][2] =
{ "blake256r8", "blakecoin" },
{ "blake256r8vnl", "vanilla" },
{ "blake256r14", "blake" },
{ "blake256r14dcr", "decred" },
{ "diamond", "dmd-gr" },
{ "espers", "hmq1725" },
{ "flax", "c11" },

View File

@@ -94,10 +94,13 @@ typedef uint32_t set_t;
#define SSE42_OPT 4
#define AVX_OPT 8 // Sandybridge
#define AVX2_OPT 0x10 // Haswell, Zen1
#define SHA_OPT 0x20 // Zen1, Icelake (sha256)
#define AVX512_OPT 0x40 // Skylake-X (AVX512[F,VL,DQ,BW])
#define VAES_OPT 0x80 // Icelake (VAES & AVX512)
#define SHA_OPT 0x20 // Zen1, Icelake (deprecated)
#define AVX512_OPT 0x40 // Skylake-X, Zen4 (AVX512[F,VL,DQ,BW])
#define VAES_OPT 0x80 // Icelake, Zen3
// AVX10 does not have explicit algo features:
// AVX10_512 is compatible with AVX512 + VAES
// AVX10_256 is compatible with AVX2 + VAES
// return set containing all elements from sets a & b
inline set_t set_union ( set_t a, set_t b ) { return a | b; }
@@ -144,7 +147,7 @@ void ( *gen_merkle_root ) ( char*, struct stratum_ctx* );
void ( *build_extraheader ) ( struct work*, struct stratum_ctx* );
void ( *build_block_header ) ( struct work*, uint32_t, uint32_t*,
uint32_t*, uint32_t, uint32_t,
uint32_t*, uint32_t, uint32_t,
unsigned char* );
// Build mining.submit message
@@ -155,19 +158,13 @@ char* ( *malloc_txs_request ) ( struct work* );
// Big endian or little endian
void ( *set_work_data_endian ) ( struct work* );
double ( *calc_network_diff ) ( struct work* );
// Wait for first work
bool ( *ready_to_mine ) ( struct work*, struct stratum_ctx*, int );
// Diverge mining threads
bool ( *do_this_thread ) ( int );
// After do_this_thread
void ( *resync_threads ) ( int, struct work* );
// No longer needed
json_t* (*longpoll_rpc_call) ( CURL*, int*, char* );
json_t* ( *longpoll_rpc_call ) ( CURL*, int*, char* );
set_t optimizations;
int ( *get_work_data_size ) ();
@@ -286,8 +283,6 @@ char* std_malloc_txs_request( struct work *work );
// Default is do_nothing, little endian is assumed
void set_work_data_big_endian( struct work *work );
double std_calc_network_diff( struct work *work );
void std_build_block_header( struct work* g_work, uint32_t version,
uint32_t *prevhash, uint32_t *merkle_root,
uint32_t ntime, uint32_t nbits,
@@ -297,9 +292,6 @@ void std_build_extraheader( struct work *work, struct stratum_ctx *sctx );
json_t* std_longpoll_rpc_call( CURL *curl, int *err, char *lp_url );
bool std_ready_to_mine( struct work* work, struct stratum_ctx* stratum,
int thr_id );
int std_get_work_data_size();
// Gate admin functions

View File

@@ -115,7 +115,7 @@ void blake256_8way_close(void *cc, void *dst);
void blake256_8way_update_le(void *cc, const void *data, size_t len);
void blake256_8way_close_le(void *cc, void *dst);
void blake256_8way_round0_prehash_le( void *midstate, const void *midhash,
const void *data );
void *data );
void blake256_8way_final_rounds_le( void *final_hash, const void *midstate,
const void *midhash, const void *data );
@@ -178,7 +178,7 @@ void blake256_16way_close(void *cc, void *dst);
void blake256_16way_update_le(void *cc, const void *data, size_t len);
void blake256_16way_close_le(void *cc, void *dst);
void blake256_16way_round0_prehash_le( void *midstate, const void *midhash,
const void *data );
void *data );
void blake256_16way_final_rounds_le( void *final_hash, const void *midstate,
const void *midhash, const void *data );

File diff suppressed because it is too large Load Diff

View File

@@ -252,14 +252,14 @@ static void blake2b_8way_compress( blake2b_8way_ctx *ctx, int last )
v[ 5] = ctx->h[5];
v[ 6] = ctx->h[6];
v[ 7] = ctx->h[7];
v[ 8] = m512_const1_64( 0x6A09E667F3BCC908 );
v[ 9] = m512_const1_64( 0xBB67AE8584CAA73B );
v[10] = m512_const1_64( 0x3C6EF372FE94F82B );
v[11] = m512_const1_64( 0xA54FF53A5F1D36F1 );
v[12] = m512_const1_64( 0x510E527FADE682D1 );
v[13] = m512_const1_64( 0x9B05688C2B3E6C1F );
v[14] = m512_const1_64( 0x1F83D9ABFB41BD6B );
v[15] = m512_const1_64( 0x5BE0CD19137E2179 );
v[ 8] = _mm512_set1_epi64( 0x6A09E667F3BCC908 );
v[ 9] = _mm512_set1_epi64( 0xBB67AE8584CAA73B );
v[10] = _mm512_set1_epi64( 0x3C6EF372FE94F82B );
v[11] = _mm512_set1_epi64( 0xA54FF53A5F1D36F1 );
v[12] = _mm512_set1_epi64( 0x510E527FADE682D1 );
v[13] = _mm512_set1_epi64( 0x9B05688C2B3E6C1F );
v[14] = _mm512_set1_epi64( 0x1F83D9ABFB41BD6B );
v[15] = _mm512_set1_epi64( 0x5BE0CD19137E2179 );
v[12] = _mm512_xor_si512( v[12], _mm512_set1_epi64( ctx->t[0] ) );
v[13] = _mm512_xor_si512( v[13], _mm512_set1_epi64( ctx->t[1] ) );
@@ -310,16 +310,16 @@ int blake2b_8way_init( blake2b_8way_ctx *ctx )
{
size_t i;
ctx->h[0] = m512_const1_64( 0x6A09E667F3BCC908 );
ctx->h[1] = m512_const1_64( 0xBB67AE8584CAA73B );
ctx->h[2] = m512_const1_64( 0x3C6EF372FE94F82B );
ctx->h[3] = m512_const1_64( 0xA54FF53A5F1D36F1 );
ctx->h[4] = m512_const1_64( 0x510E527FADE682D1 );
ctx->h[5] = m512_const1_64( 0x9B05688C2B3E6C1F );
ctx->h[6] = m512_const1_64( 0x1F83D9ABFB41BD6B );
ctx->h[7] = m512_const1_64( 0x5BE0CD19137E2179 );
ctx->h[0] = _mm512_set1_epi64( 0x6A09E667F3BCC908 );
ctx->h[1] = _mm512_set1_epi64( 0xBB67AE8584CAA73B );
ctx->h[2] = _mm512_set1_epi64( 0x3C6EF372FE94F82B );
ctx->h[3] = _mm512_set1_epi64( 0xA54FF53A5F1D36F1 );
ctx->h[4] = _mm512_set1_epi64( 0x510E527FADE682D1 );
ctx->h[5] = _mm512_set1_epi64( 0x9B05688C2B3E6C1F );
ctx->h[6] = _mm512_set1_epi64( 0x1F83D9ABFB41BD6B );
ctx->h[7] = _mm512_set1_epi64( 0x5BE0CD19137E2179 );
ctx->h[0] = _mm512_xor_si512( ctx->h[0], m512_const1_64( 0x01010020 ) );
ctx->h[0] = _mm512_xor_si512( ctx->h[0], _mm512_set1_epi64( 0x01010020 ) );
ctx->t[0] = 0;
ctx->t[1] = 0;
@@ -419,14 +419,14 @@ static void blake2b_4way_compress( blake2b_4way_ctx *ctx, int last )
v[ 5] = ctx->h[5];
v[ 6] = ctx->h[6];
v[ 7] = ctx->h[7];
v[ 8] = m256_const1_64( 0x6A09E667F3BCC908 );
v[ 9] = m256_const1_64( 0xBB67AE8584CAA73B );
v[10] = m256_const1_64( 0x3C6EF372FE94F82B );
v[11] = m256_const1_64( 0xA54FF53A5F1D36F1 );
v[12] = m256_const1_64( 0x510E527FADE682D1 );
v[13] = m256_const1_64( 0x9B05688C2B3E6C1F );
v[14] = m256_const1_64( 0x1F83D9ABFB41BD6B );
v[15] = m256_const1_64( 0x5BE0CD19137E2179 );
v[ 8] = _mm256_set1_epi64x( 0x6A09E667F3BCC908 );
v[ 9] = _mm256_set1_epi64x( 0xBB67AE8584CAA73B );
v[10] = _mm256_set1_epi64x( 0x3C6EF372FE94F82B );
v[11] = _mm256_set1_epi64x( 0xA54FF53A5F1D36F1 );
v[12] = _mm256_set1_epi64x( 0x510E527FADE682D1 );
v[13] = _mm256_set1_epi64x( 0x9B05688C2B3E6C1F );
v[14] = _mm256_set1_epi64x( 0x1F83D9ABFB41BD6B );
v[15] = _mm256_set1_epi64x( 0x5BE0CD19137E2179 );
v[12] = _mm256_xor_si256( v[12], _mm256_set1_epi64x( ctx->t[0] ) );
v[13] = _mm256_xor_si256( v[13], _mm256_set1_epi64x( ctx->t[1] ) );
@@ -477,16 +477,16 @@ int blake2b_4way_init( blake2b_4way_ctx *ctx )
{
size_t i;
ctx->h[0] = m256_const1_64( 0x6A09E667F3BCC908 );
ctx->h[1] = m256_const1_64( 0xBB67AE8584CAA73B );
ctx->h[2] = m256_const1_64( 0x3C6EF372FE94F82B );
ctx->h[3] = m256_const1_64( 0xA54FF53A5F1D36F1 );
ctx->h[4] = m256_const1_64( 0x510E527FADE682D1 );
ctx->h[5] = m256_const1_64( 0x9B05688C2B3E6C1F );
ctx->h[6] = m256_const1_64( 0x1F83D9ABFB41BD6B );
ctx->h[7] = m256_const1_64( 0x5BE0CD19137E2179 );
ctx->h[0] = _mm256_set1_epi64x( 0x6A09E667F3BCC908 );
ctx->h[1] = _mm256_set1_epi64x( 0xBB67AE8584CAA73B );
ctx->h[2] = _mm256_set1_epi64x( 0x3C6EF372FE94F82B );
ctx->h[3] = _mm256_set1_epi64x( 0xA54FF53A5F1D36F1 );
ctx->h[4] = _mm256_set1_epi64x( 0x510E527FADE682D1 );
ctx->h[5] = _mm256_set1_epi64x( 0x9B05688C2B3E6C1F );
ctx->h[6] = _mm256_set1_epi64x( 0x1F83D9ABFB41BD6B );
ctx->h[7] = _mm256_set1_epi64x( 0x5BE0CD19137E2179 );
ctx->h[0] = _mm256_xor_si256( ctx->h[0], m256_const1_64( 0x01010020 ) );
ctx->h[0] = _mm256_xor_si256( ctx->h[0], _mm256_set1_epi64x( 0x01010020 ) );
ctx->t[0] = 0;
ctx->t[1] = 0;

View File

@@ -62,14 +62,14 @@ int blake2s_4way_init( blake2s_4way_state *S, const uint8_t outlen )
memset( S, 0, sizeof( blake2s_4way_state ) );
S->h[0] = m128_const1_64( 0x6A09E6676A09E667ULL );
S->h[1] = m128_const1_64( 0xBB67AE85BB67AE85ULL );
S->h[2] = m128_const1_64( 0x3C6EF3723C6EF372ULL );
S->h[3] = m128_const1_64( 0xA54FF53AA54FF53AULL );
S->h[4] = m128_const1_64( 0x510E527F510E527FULL );
S->h[5] = m128_const1_64( 0x9B05688C9B05688CULL );
S->h[6] = m128_const1_64( 0x1F83D9AB1F83D9ABULL );
S->h[7] = m128_const1_64( 0x5BE0CD195BE0CD19ULL );
S->h[0] = _mm_set1_epi64x( 0x6A09E6676A09E667ULL );
S->h[1] = _mm_set1_epi64x( 0xBB67AE85BB67AE85ULL );
S->h[2] = _mm_set1_epi64x( 0x3C6EF3723C6EF372ULL );
S->h[3] = _mm_set1_epi64x( 0xA54FF53AA54FF53AULL );
S->h[4] = _mm_set1_epi64x( 0x510E527F510E527FULL );
S->h[5] = _mm_set1_epi64x( 0x9B05688C9B05688CULL );
S->h[6] = _mm_set1_epi64x( 0x1F83D9AB1F83D9ABULL );
S->h[7] = _mm_set1_epi64x( 0x5BE0CD195BE0CD19ULL );
// for( int i = 0; i < 8; ++i )
// S->h[i] = _mm_set1_epi32( blake2s_IV[i] );
@@ -90,18 +90,18 @@ int blake2s_4way_compress( blake2s_4way_state *S, const __m128i* block )
memcpy_128( m, block, 16 );
memcpy_128( v, S->h, 8 );
v[ 8] = m128_const1_64( 0x6A09E6676A09E667ULL );
v[ 9] = m128_const1_64( 0xBB67AE85BB67AE85ULL );
v[10] = m128_const1_64( 0x3C6EF3723C6EF372ULL );
v[11] = m128_const1_64( 0xA54FF53AA54FF53AULL );
v[ 8] = _mm_set1_epi64x( 0x6A09E6676A09E667ULL );
v[ 9] = _mm_set1_epi64x( 0xBB67AE85BB67AE85ULL );
v[10] = _mm_set1_epi64x( 0x3C6EF3723C6EF372ULL );
v[11] = _mm_set1_epi64x( 0xA54FF53AA54FF53AULL );
v[12] = _mm_xor_si128( _mm_set1_epi32( S->t[0] ),
m128_const1_64( 0x510E527F510E527FULL ) );
_mm_set1_epi64x( 0x510E527F510E527FULL ) );
v[13] = _mm_xor_si128( _mm_set1_epi32( S->t[1] ),
m128_const1_64( 0x9B05688C9B05688CULL ) );
_mm_set1_epi64x( 0x9B05688C9B05688CULL ) );
v[14] = _mm_xor_si128( _mm_set1_epi32( S->f[0] ),
m128_const1_64( 0x1F83D9AB1F83D9ABULL ) );
_mm_set1_epi64x( 0x1F83D9AB1F83D9ABULL ) );
v[15] = _mm_xor_si128( _mm_set1_epi32( S->f[1] ),
m128_const1_64( 0x5BE0CD195BE0CD19ULL ) );
_mm_set1_epi64x( 0x5BE0CD195BE0CD19ULL ) );
#define G4W( sigma0, sigma1, a, b, c, d ) \
do { \
@@ -269,21 +269,21 @@ int blake2s_8way_compress( blake2s_8way_state *S, const __m256i *block )
memcpy_256( m, block, 16 );
memcpy_256( v, S->h, 8 );
v[ 8] = m256_const1_64( 0x6A09E6676A09E667ULL );
v[ 9] = m256_const1_64( 0xBB67AE85BB67AE85ULL );
v[10] = m256_const1_64( 0x3C6EF3723C6EF372ULL );
v[11] = m256_const1_64( 0xA54FF53AA54FF53AULL );
v[ 8] = _mm256_set1_epi64x( 0x6A09E6676A09E667ULL );
v[ 9] = _mm256_set1_epi64x( 0xBB67AE85BB67AE85ULL );
v[10] = _mm256_set1_epi64x( 0x3C6EF3723C6EF372ULL );
v[11] = _mm256_set1_epi64x( 0xA54FF53AA54FF53AULL );
v[12] = _mm256_xor_si256( _mm256_set1_epi32( S->t[0] ),
m256_const1_64( 0x510E527F510E527FULL ) );
_mm256_set1_epi64x( 0x510E527F510E527FULL ) );
v[13] = _mm256_xor_si256( _mm256_set1_epi32( S->t[1] ),
m256_const1_64( 0x9B05688C9B05688CULL ) );
_mm256_set1_epi64x( 0x9B05688C9B05688CULL ) );
v[14] = _mm256_xor_si256( _mm256_set1_epi32( S->f[0] ),
m256_const1_64( 0x1F83D9AB1F83D9ABULL ) );
_mm256_set1_epi64x( 0x1F83D9AB1F83D9ABULL ) );
v[15] = _mm256_xor_si256( _mm256_set1_epi32( S->f[1] ),
m256_const1_64( 0x5BE0CD195BE0CD19ULL ) );
_mm256_set1_epi64x( 0x5BE0CD195BE0CD19ULL ) );
/*
v[ 8] = _mm256_set1_epi32( blake2s_IV[0] );
@@ -391,14 +391,14 @@ int blake2s_8way_init( blake2s_8way_state *S, const uint8_t outlen )
memset( P->personal, 0, sizeof( P->personal ) );
memset( S, 0, sizeof( blake2s_8way_state ) );
S->h[0] = m256_const1_64( 0x6A09E6676A09E667ULL );
S->h[1] = m256_const1_64( 0xBB67AE85BB67AE85ULL );
S->h[2] = m256_const1_64( 0x3C6EF3723C6EF372ULL );
S->h[3] = m256_const1_64( 0xA54FF53AA54FF53AULL );
S->h[4] = m256_const1_64( 0x510E527F510E527FULL );
S->h[5] = m256_const1_64( 0x9B05688C9B05688CULL );
S->h[6] = m256_const1_64( 0x1F83D9AB1F83D9ABULL );
S->h[7] = m256_const1_64( 0x5BE0CD195BE0CD19ULL );
S->h[0] = _mm256_set1_epi64x( 0x6A09E6676A09E667ULL );
S->h[1] = _mm256_set1_epi64x( 0xBB67AE85BB67AE85ULL );
S->h[2] = _mm256_set1_epi64x( 0x3C6EF3723C6EF372ULL );
S->h[3] = _mm256_set1_epi64x( 0xA54FF53AA54FF53AULL );
S->h[4] = _mm256_set1_epi64x( 0x510E527F510E527FULL );
S->h[5] = _mm256_set1_epi64x( 0x9B05688C9B05688CULL );
S->h[6] = _mm256_set1_epi64x( 0x1F83D9AB1F83D9ABULL );
S->h[7] = _mm256_set1_epi64x( 0x5BE0CD195BE0CD19ULL );
// for( int i = 0; i < 8; ++i )
@@ -510,21 +510,21 @@ int blake2s_16way_compress( blake2s_16way_state *S, const __m512i *block )
memcpy_512( m, block, 16 );
memcpy_512( v, S->h, 8 );
v[ 8] = m512_const1_64( 0x6A09E6676A09E667ULL );
v[ 9] = m512_const1_64( 0xBB67AE85BB67AE85ULL );
v[10] = m512_const1_64( 0x3C6EF3723C6EF372ULL );
v[11] = m512_const1_64( 0xA54FF53AA54FF53AULL );
v[ 8] = _mm512_set1_epi64( 0x6A09E6676A09E667ULL );
v[ 9] = _mm512_set1_epi64( 0xBB67AE85BB67AE85ULL );
v[10] = _mm512_set1_epi64( 0x3C6EF3723C6EF372ULL );
v[11] = _mm512_set1_epi64( 0xA54FF53AA54FF53AULL );
v[12] = _mm512_xor_si512( _mm512_set1_epi32( S->t[0] ),
m512_const1_64( 0x510E527F510E527FULL ) );
_mm512_set1_epi64( 0x510E527F510E527FULL ) );
v[13] = _mm512_xor_si512( _mm512_set1_epi32( S->t[1] ),
m512_const1_64( 0x9B05688C9B05688CULL ) );
_mm512_set1_epi64( 0x9B05688C9B05688CULL ) );
v[14] = _mm512_xor_si512( _mm512_set1_epi32( S->f[0] ),
m512_const1_64( 0x1F83D9AB1F83D9ABULL ) );
_mm512_set1_epi64( 0x1F83D9AB1F83D9ABULL ) );
v[15] = _mm512_xor_si512( _mm512_set1_epi32( S->f[1] ),
m512_const1_64( 0x5BE0CD195BE0CD19ULL ) );
_mm512_set1_epi64( 0x5BE0CD195BE0CD19ULL ) );
#define G16W( sigma0, sigma1, a, b, c, d) \
@@ -589,14 +589,14 @@ int blake2s_16way_init( blake2s_16way_state *S, const uint8_t outlen )
memset( P->personal, 0, sizeof( P->personal ) );
memset( S, 0, sizeof( blake2s_16way_state ) );
S->h[0] = m512_const1_64( 0x6A09E6676A09E667ULL );
S->h[1] = m512_const1_64( 0xBB67AE85BB67AE85ULL );
S->h[2] = m512_const1_64( 0x3C6EF3723C6EF372ULL );
S->h[3] = m512_const1_64( 0xA54FF53AA54FF53AULL );
S->h[4] = m512_const1_64( 0x510E527F510E527FULL );
S->h[5] = m512_const1_64( 0x9B05688C9B05688CULL );
S->h[6] = m512_const1_64( 0x1F83D9AB1F83D9ABULL );
S->h[7] = m512_const1_64( 0x5BE0CD195BE0CD19ULL );
S->h[0] = _mm512_set1_epi64( 0x6A09E6676A09E667ULL );
S->h[1] = _mm512_set1_epi64( 0xBB67AE85BB67AE85ULL );
S->h[2] = _mm512_set1_epi64( 0x3C6EF3723C6EF372ULL );
S->h[3] = _mm512_set1_epi64( 0xA54FF53AA54FF53AULL );
S->h[4] = _mm512_set1_epi64( 0x510E527F510E527FULL );
S->h[5] = _mm512_set1_epi64( 0x9B05688C9B05688CULL );
S->h[6] = _mm512_set1_epi64( 0x1F83D9AB1F83D9ABULL );
S->h[7] = _mm512_set1_epi64( 0x5BE0CD195BE0CD19ULL );
uint32_t *p = ( uint32_t * )( P );

View File

@@ -350,7 +350,6 @@ static const sph_u64 CB[16] = {
__m512i M8, M9, MA, MB, MC, MD, ME, MF; \
__m512i V0, V1, V2, V3, V4, V5, V6, V7; \
__m512i V8, V9, VA, VB, VC, VD, VE, VF; \
__m512i shuf_bswap64; \
V0 = H0; \
V1 = H1; \
V2 = H2; \
@@ -359,18 +358,16 @@ static const sph_u64 CB[16] = {
V5 = H5; \
V6 = H6; \
V7 = H7; \
V8 = m512_const1_64( CB0 ); \
V9 = m512_const1_64( CB1 ); \
VA = m512_const1_64( CB2 ); \
VB = m512_const1_64( CB3 ); \
V8 = _mm512_set1_epi64( CB0 ); \
V9 = _mm512_set1_epi64( CB1 ); \
VA = _mm512_set1_epi64( CB2 ); \
VB = _mm512_set1_epi64( CB3 ); \
VC = _mm512_set1_epi64( T0 ^ CB4 ); \
VD = _mm512_set1_epi64( T0 ^ CB5 ); \
VE = _mm512_set1_epi64( T1 ^ CB6 ); \
VF = _mm512_set1_epi64( T1 ^ CB7 ); \
shuf_bswap64 = m512_const_64( 0x38393a3b3c3d3e3f, 0x3031323334353637, \
0x28292a2b2c2d2e2f, 0x2021222324252627, \
0x18191a1b1c1d1e1f, 0x1011121314151617, \
0x08090a0b0c0d0e0f, 0x0001020304050607 ); \
const __m512i shuf_bswap64 = mm512_bcast_m128( _mm_set_epi64x( \
0x08090a0b0c0d0e0f, 0x0001020304050607 ) ); \
M0 = _mm512_shuffle_epi8( *(buf+ 0), shuf_bswap64 ); \
M1 = _mm512_shuffle_epi8( *(buf+ 1), shuf_bswap64 ); \
M2 = _mm512_shuffle_epi8( *(buf+ 2), shuf_bswap64 ); \
@@ -419,7 +416,6 @@ void blake512_8way_compress( blake_8way_big_context *sc )
__m512i M8, M9, MA, MB, MC, MD, ME, MF;
__m512i V0, V1, V2, V3, V4, V5, V6, V7;
__m512i V8, V9, VA, VB, VC, VD, VE, VF;
__m512i shuf_bswap64;
V0 = sc->H[0];
V1 = sc->H[1];
@@ -429,19 +425,17 @@ void blake512_8way_compress( blake_8way_big_context *sc )
V5 = sc->H[5];
V6 = sc->H[6];
V7 = sc->H[7];
V8 = m512_const1_64( CB0 );
V9 = m512_const1_64( CB1 );
VA = m512_const1_64( CB2 );
VB = m512_const1_64( CB3 );
V8 = _mm512_set1_epi64( CB0 );
V9 = _mm512_set1_epi64( CB1 );
VA = _mm512_set1_epi64( CB2 );
VB = _mm512_set1_epi64( CB3 );
VC = _mm512_set1_epi64( sc->T0 ^ CB4 );
VD = _mm512_set1_epi64( sc->T0 ^ CB5 );
VE = _mm512_set1_epi64( sc->T1 ^ CB6 );
VF = _mm512_set1_epi64( sc->T1 ^ CB7 );
shuf_bswap64 = m512_const_64( 0x38393a3b3c3d3e3f, 0x3031323334353637,
0x28292a2b2c2d2e2f, 0x2021222324252627,
0x18191a1b1c1d1e1f, 0x1011121314151617,
0x08090a0b0c0d0e0f, 0x0001020304050607 );
const __m512i shuf_bswap64 = mm512_bcast_m128( _mm_set_epi64x(
0x08090a0b0c0d0e0f, 0x0001020304050607 ) );
M0 = _mm512_shuffle_epi8( sc->buf[ 0], shuf_bswap64 );
M1 = _mm512_shuffle_epi8( sc->buf[ 1], shuf_bswap64 );
@@ -503,10 +497,10 @@ void blake512_8way_compress_le( blake_8way_big_context *sc )
V5 = sc->H[5];
V6 = sc->H[6];
V7 = sc->H[7];
V8 = m512_const1_64( CB0 );
V9 = m512_const1_64( CB1 );
VA = m512_const1_64( CB2 );
VB = m512_const1_64( CB3 );
V8 = _mm512_set1_epi64( CB0 );
V9 = _mm512_set1_epi64( CB1 );
VA = _mm512_set1_epi64( CB2 );
VB = _mm512_set1_epi64( CB3 );
VC = _mm512_set1_epi64( sc->T0 ^ CB4 );
VD = _mm512_set1_epi64( sc->T0 ^ CB5 );
VE = _mm512_set1_epi64( sc->T1 ^ CB6 );
@@ -565,23 +559,23 @@ void blake512_8way_prehash_le( blake_8way_big_context *sc, __m512i *midstate,
__m512i V8, V9, VA, VB, VC, VD, VE, VF;
// initial hash
casti_m512i( sc->H, 0 ) = m512_const1_64( 0x6A09E667F3BCC908 );
casti_m512i( sc->H, 1 ) = m512_const1_64( 0xBB67AE8584CAA73B );
casti_m512i( sc->H, 2 ) = m512_const1_64( 0x3C6EF372FE94F82B );
casti_m512i( sc->H, 3 ) = m512_const1_64( 0xA54FF53A5F1D36F1 );
casti_m512i( sc->H, 4 ) = m512_const1_64( 0x510E527FADE682D1 );
casti_m512i( sc->H, 5 ) = m512_const1_64( 0x9B05688C2B3E6C1F );
casti_m512i( sc->H, 6 ) = m512_const1_64( 0x1F83D9ABFB41BD6B );
casti_m512i( sc->H, 7 ) = m512_const1_64( 0x5BE0CD19137E2179 );
casti_m512i( sc->H, 0 ) = _mm512_set1_epi64( 0x6A09E667F3BCC908 );
casti_m512i( sc->H, 1 ) = _mm512_set1_epi64( 0xBB67AE8584CAA73B );
casti_m512i( sc->H, 2 ) = _mm512_set1_epi64( 0x3C6EF372FE94F82B );
casti_m512i( sc->H, 3 ) = _mm512_set1_epi64( 0xA54FF53A5F1D36F1 );
casti_m512i( sc->H, 4 ) = _mm512_set1_epi64( 0x510E527FADE682D1 );
casti_m512i( sc->H, 5 ) = _mm512_set1_epi64( 0x9B05688C2B3E6C1F );
casti_m512i( sc->H, 6 ) = _mm512_set1_epi64( 0x1F83D9ABFB41BD6B );
casti_m512i( sc->H, 7 ) = _mm512_set1_epi64( 0x5BE0CD19137E2179 );
// fill buffer
memcpy_512( sc->buf, (__m512i*)data, 80>>3 );
sc->buf[10] = m512_const1_64( 0x8000000000000000ULL );
sc->buf[10] = _mm512_set1_epi64( 0x8000000000000000ULL );
sc->buf[11] =
sc->buf[12] = m512_zero;
sc->buf[13] = m512_one_64;
sc->buf[14] = m512_zero;
sc->buf[15] = m512_const1_64( 80*8 );
sc->buf[15] = _mm512_set1_epi64( 80*8 );
// build working variables
V0 = sc->H[0];
@@ -592,10 +586,10 @@ void blake512_8way_prehash_le( blake_8way_big_context *sc, __m512i *midstate,
V5 = sc->H[5];
V6 = sc->H[6];
V7 = sc->H[7];
V8 = m512_const1_64( CB0 );
V9 = m512_const1_64( CB1 );
VA = m512_const1_64( CB2 );
VB = m512_const1_64( CB3 );
V8 = _mm512_set1_epi64( CB0 );
V9 = _mm512_set1_epi64( CB1 );
VA = _mm512_set1_epi64( CB2 );
VB = _mm512_set1_epi64( CB3 );
VC = _mm512_set1_epi64( CB4 ^ 0x280ULL );
VD = _mm512_set1_epi64( CB5 ^ 0x280ULL );
VE = _mm512_set1_epi64( CB6 );
@@ -790,14 +784,14 @@ void blake512_8way_final_le( blake_8way_big_context *sc, void *hash,
void blake512_8way_init( blake_8way_big_context *sc )
{
casti_m512i( sc->H, 0 ) = m512_const1_64( 0x6A09E667F3BCC908 );
casti_m512i( sc->H, 1 ) = m512_const1_64( 0xBB67AE8584CAA73B );
casti_m512i( sc->H, 2 ) = m512_const1_64( 0x3C6EF372FE94F82B );
casti_m512i( sc->H, 3 ) = m512_const1_64( 0xA54FF53A5F1D36F1 );
casti_m512i( sc->H, 4 ) = m512_const1_64( 0x510E527FADE682D1 );
casti_m512i( sc->H, 5 ) = m512_const1_64( 0x9B05688C2B3E6C1F );
casti_m512i( sc->H, 6 ) = m512_const1_64( 0x1F83D9ABFB41BD6B );
casti_m512i( sc->H, 7 ) = m512_const1_64( 0x5BE0CD19137E2179 );
casti_m512i( sc->H, 0 ) = _mm512_set1_epi64( 0x6A09E667F3BCC908 );
casti_m512i( sc->H, 1 ) = _mm512_set1_epi64( 0xBB67AE8584CAA73B );
casti_m512i( sc->H, 2 ) = _mm512_set1_epi64( 0x3C6EF372FE94F82B );
casti_m512i( sc->H, 3 ) = _mm512_set1_epi64( 0xA54FF53A5F1D36F1 );
casti_m512i( sc->H, 4 ) = _mm512_set1_epi64( 0x510E527FADE682D1 );
casti_m512i( sc->H, 5 ) = _mm512_set1_epi64( 0x9B05688C2B3E6C1F );
casti_m512i( sc->H, 6 ) = _mm512_set1_epi64( 0x1F83D9ABFB41BD6B );
casti_m512i( sc->H, 7 ) = _mm512_set1_epi64( 0x5BE0CD19137E2179 );
sc->T0 = sc->T1 = 0;
sc->ptr = 0;
@@ -861,7 +855,7 @@ blake64_8way_close( blake_8way_big_context *sc, void *dst )
ptr = sc->ptr;
bit_len = ((unsigned)ptr << 3);
buf[ptr>>3] = m512_const1_64( 0x80 );
buf[ptr>>3] = _mm512_set1_epi64( 0x80 );
tl = sc->T0 + bit_len;
th = sc->T1;
if (ptr == 0 )
@@ -882,9 +876,9 @@ blake64_8way_close( blake_8way_big_context *sc, void *dst )
{
memset_zero_512( buf + (ptr>>3) + 1, (104-ptr) >> 3 );
buf[104>>3] = _mm512_or_si512( buf[104>>3],
m512_const1_64( 0x0100000000000000ULL ) );
buf[112>>3] = m512_const1_64( bswap_64( th ) );
buf[120>>3] = m512_const1_64( bswap_64( tl ) );
_mm512_set1_epi64( 0x0100000000000000ULL ) );
buf[112>>3] = _mm512_set1_epi64( bswap_64( th ) );
buf[120>>3] = _mm512_set1_epi64( bswap_64( tl ) );
blake64_8way( sc, buf + (ptr>>3), 128 - ptr );
}
@@ -896,9 +890,9 @@ blake64_8way_close( blake_8way_big_context *sc, void *dst )
sc->T0 = 0xFFFFFFFFFFFFFC00ULL;
sc->T1 = 0xFFFFFFFFFFFFFFFFULL;
memset_zero_512( buf, 112>>3 );
buf[104>>3] = m512_const1_64( 0x0100000000000000ULL );
buf[112>>3] = m512_const1_64( bswap_64( th ) );
buf[120>>3] = m512_const1_64( bswap_64( tl ) );
buf[104>>3] = _mm512_set1_epi64( 0x0100000000000000ULL );
buf[112>>3] = _mm512_set1_epi64( bswap_64( th ) );
buf[120>>3] = _mm512_set1_epi64( bswap_64( tl ) );
blake64_8way( sc, buf, 128 );
}
@@ -912,14 +906,14 @@ void blake512_8way_full( blake_8way_big_context *sc, void * dst,
// init
casti_m512i( sc->H, 0 ) = m512_const1_64( 0x6A09E667F3BCC908 );
casti_m512i( sc->H, 1 ) = m512_const1_64( 0xBB67AE8584CAA73B );
casti_m512i( sc->H, 2 ) = m512_const1_64( 0x3C6EF372FE94F82B );
casti_m512i( sc->H, 3 ) = m512_const1_64( 0xA54FF53A5F1D36F1 );
casti_m512i( sc->H, 4 ) = m512_const1_64( 0x510E527FADE682D1 );
casti_m512i( sc->H, 5 ) = m512_const1_64( 0x9B05688C2B3E6C1F );
casti_m512i( sc->H, 6 ) = m512_const1_64( 0x1F83D9ABFB41BD6B );
casti_m512i( sc->H, 7 ) = m512_const1_64( 0x5BE0CD19137E2179 );
casti_m512i( sc->H, 0 ) = _mm512_set1_epi64( 0x6A09E667F3BCC908 );
casti_m512i( sc->H, 1 ) = _mm512_set1_epi64( 0xBB67AE8584CAA73B );
casti_m512i( sc->H, 2 ) = _mm512_set1_epi64( 0x3C6EF372FE94F82B );
casti_m512i( sc->H, 3 ) = _mm512_set1_epi64( 0xA54FF53A5F1D36F1 );
casti_m512i( sc->H, 4 ) = _mm512_set1_epi64( 0x510E527FADE682D1 );
casti_m512i( sc->H, 5 ) = _mm512_set1_epi64( 0x9B05688C2B3E6C1F );
casti_m512i( sc->H, 6 ) = _mm512_set1_epi64( 0x1F83D9ABFB41BD6B );
casti_m512i( sc->H, 7 ) = _mm512_set1_epi64( 0x5BE0CD19137E2179 );
sc->T0 = sc->T1 = 0;
sc->ptr = 0;
@@ -943,7 +937,7 @@ void blake512_8way_full( blake_8way_big_context *sc, void * dst,
uint64_t th, tl;
bit_len = sc->ptr << 3;
sc->buf[ptr64] = m512_const1_64( 0x80 );
sc->buf[ptr64] = _mm512_set1_epi64( 0x80 );
tl = sc->T0 + bit_len;
th = sc->T1;
@@ -961,9 +955,9 @@ void blake512_8way_full( blake_8way_big_context *sc, void * dst,
sc->T0 -= 1024 - bit_len;
memset_zero_512( sc->buf + ptr64 + 1, 13 - ptr64 );
sc->buf[13] = m512_const1_64( 0x0100000000000000ULL );
sc->buf[14] = m512_const1_64( bswap_64( th ) );
sc->buf[15] = m512_const1_64( bswap_64( tl ) );
sc->buf[13] = _mm512_set1_epi64( 0x0100000000000000ULL );
sc->buf[14] = _mm512_set1_epi64( bswap_64( th ) );
sc->buf[15] = _mm512_set1_epi64( bswap_64( tl ) );
if ( ( sc->T0 = sc->T0 + 1024 ) < 1024 )
sc->T1 = sc->T1 + 1;
@@ -979,14 +973,14 @@ void blake512_8way_full_le( blake_8way_big_context *sc, void * dst,
// init
casti_m512i( sc->H, 0 ) = m512_const1_64( 0x6A09E667F3BCC908 );
casti_m512i( sc->H, 1 ) = m512_const1_64( 0xBB67AE8584CAA73B );
casti_m512i( sc->H, 2 ) = m512_const1_64( 0x3C6EF372FE94F82B );
casti_m512i( sc->H, 3 ) = m512_const1_64( 0xA54FF53A5F1D36F1 );
casti_m512i( sc->H, 4 ) = m512_const1_64( 0x510E527FADE682D1 );
casti_m512i( sc->H, 5 ) = m512_const1_64( 0x9B05688C2B3E6C1F );
casti_m512i( sc->H, 6 ) = m512_const1_64( 0x1F83D9ABFB41BD6B );
casti_m512i( sc->H, 7 ) = m512_const1_64( 0x5BE0CD19137E2179 );
casti_m512i( sc->H, 0 ) = _mm512_set1_epi64( 0x6A09E667F3BCC908 );
casti_m512i( sc->H, 1 ) = _mm512_set1_epi64( 0xBB67AE8584CAA73B );
casti_m512i( sc->H, 2 ) = _mm512_set1_epi64( 0x3C6EF372FE94F82B );
casti_m512i( sc->H, 3 ) = _mm512_set1_epi64( 0xA54FF53A5F1D36F1 );
casti_m512i( sc->H, 4 ) = _mm512_set1_epi64( 0x510E527FADE682D1 );
casti_m512i( sc->H, 5 ) = _mm512_set1_epi64( 0x9B05688C2B3E6C1F );
casti_m512i( sc->H, 6 ) = _mm512_set1_epi64( 0x1F83D9ABFB41BD6B );
casti_m512i( sc->H, 7 ) = _mm512_set1_epi64( 0x5BE0CD19137E2179 );
sc->T0 = sc->T1 = 0;
sc->ptr = 0;
@@ -1010,7 +1004,7 @@ void blake512_8way_full_le( blake_8way_big_context *sc, void * dst,
uint64_t th, tl;
bit_len = sc->ptr << 3;
sc->buf[ptr64] = m512_const1_64( 0x8000000000000000ULL );
sc->buf[ptr64] = _mm512_set1_epi64( 0x8000000000000000ULL );
tl = sc->T0 + bit_len;
th = sc->T1;
@@ -1029,8 +1023,8 @@ void blake512_8way_full_le( blake_8way_big_context *sc, void * dst,
memset_zero_512( sc->buf + ptr64 + 1, 13 - ptr64 );
sc->buf[13] = m512_one_64;
sc->buf[14] = m512_const1_64( th );
sc->buf[15] = m512_const1_64( tl );
sc->buf[14] = _mm512_set1_epi64( th );
sc->buf[15] = _mm512_set1_epi64( tl );
if ( ( sc->T0 = sc->T0 + 1024 ) < 1024 )
sc->T1 = sc->T1 + 1;
@@ -1092,7 +1086,6 @@ blake512_8way_close(void *cc, void *dst)
__m256i M8, M9, MA, MB, MC, MD, ME, MF; \
__m256i V0, V1, V2, V3, V4, V5, V6, V7; \
__m256i V8, V9, VA, VB, VC, VD, VE, VF; \
__m256i shuf_bswap64; \
V0 = H0; \
V1 = H1; \
V2 = H2; \
@@ -1101,16 +1094,16 @@ blake512_8way_close(void *cc, void *dst)
V5 = H5; \
V6 = H6; \
V7 = H7; \
V8 = m256_const1_64( CB0 ); \
V9 = m256_const1_64( CB1 ); \
VA = m256_const1_64( CB2 ); \
VB = m256_const1_64( CB3 ); \
V8 = _mm256_set1_epi64x( CB0 ); \
V9 = _mm256_set1_epi64x( CB1 ); \
VA = _mm256_set1_epi64x( CB2 ); \
VB = _mm256_set1_epi64x( CB3 ); \
VC = _mm256_set1_epi64x( T0 ^ CB4 ); \
VD = _mm256_set1_epi64x( T0 ^ CB5 ); \
VE = _mm256_set1_epi64x( T1 ^ CB6 ); \
VF = _mm256_set1_epi64x( T1 ^ CB7 ); \
shuf_bswap64 = m256_const_64( 0x18191a1b1c1d1e1f, 0x1011121314151617, \
0x08090a0b0c0d0e0f, 0x0001020304050607 ); \
const __m256i shuf_bswap64 = mm256_bcast_m128( _mm_set_epi64x( \
0x08090a0b0c0d0e0f, 0x0001020304050607 ) ); \
M0 = _mm256_shuffle_epi8( *(buf+ 0), shuf_bswap64 ); \
M1 = _mm256_shuffle_epi8( *(buf+ 1), shuf_bswap64 ); \
M2 = _mm256_shuffle_epi8( *(buf+ 2), shuf_bswap64 ); \
@@ -1160,7 +1153,6 @@ void blake512_4way_compress( blake_4way_big_context *sc )
__m256i M8, M9, MA, MB, MC, MD, ME, MF;
__m256i V0, V1, V2, V3, V4, V5, V6, V7;
__m256i V8, V9, VA, VB, VC, VD, VE, VF;
__m256i shuf_bswap64;
V0 = sc->H[0];
V1 = sc->H[1];
@@ -1170,20 +1162,20 @@ void blake512_4way_compress( blake_4way_big_context *sc )
V5 = sc->H[5];
V6 = sc->H[6];
V7 = sc->H[7];
V8 = m256_const1_64( CB0 );
V9 = m256_const1_64( CB1 );
VA = m256_const1_64( CB2 );
VB = m256_const1_64( CB3 );
V8 = _mm256_set1_epi64x( CB0 );
V9 = _mm256_set1_epi64x( CB1 );
VA = _mm256_set1_epi64x( CB2 );
VB = _mm256_set1_epi64x( CB3 );
VC = _mm256_xor_si256( _mm256_set1_epi64x( sc->T0 ),
m256_const1_64( CB4 ) );
_mm256_set1_epi64x( CB4 ) );
VD = _mm256_xor_si256( _mm256_set1_epi64x( sc->T0 ),
m256_const1_64( CB5 ) );
_mm256_set1_epi64x( CB5 ) );
VE = _mm256_xor_si256( _mm256_set1_epi64x( sc->T1 ),
m256_const1_64( CB6 ) );
_mm256_set1_epi64x( CB6 ) );
VF = _mm256_xor_si256( _mm256_set1_epi64x( sc->T1 ),
m256_const1_64( CB7 ) );
shuf_bswap64 = m256_const_64( 0x18191a1b1c1d1e1f, 0x1011121314151617,
0x08090a0b0c0d0e0f, 0x0001020304050607 );
_mm256_set1_epi64x( CB7 ) );
const __m256i shuf_bswap64 = mm256_bcast_m128( _mm_set_epi64x(
0x08090a0b0c0d0e0f, 0x0001020304050607 ) );
M0 = _mm256_shuffle_epi8( sc->buf[ 0], shuf_bswap64 );
M1 = _mm256_shuffle_epi8( sc->buf[ 1], shuf_bswap64 );
@@ -1236,23 +1228,23 @@ void blake512_4way_prehash_le( blake_4way_big_context *sc, __m256i *midstate,
__m256i V8, V9, VA, VB, VC, VD, VE, VF;
// initial hash
casti_m256i( sc->H, 0 ) = m256_const1_64( 0x6A09E667F3BCC908 );
casti_m256i( sc->H, 1 ) = m256_const1_64( 0xBB67AE8584CAA73B );
casti_m256i( sc->H, 2 ) = m256_const1_64( 0x3C6EF372FE94F82B );
casti_m256i( sc->H, 3 ) = m256_const1_64( 0xA54FF53A5F1D36F1 );
casti_m256i( sc->H, 4 ) = m256_const1_64( 0x510E527FADE682D1 );
casti_m256i( sc->H, 5 ) = m256_const1_64( 0x9B05688C2B3E6C1F );
casti_m256i( sc->H, 6 ) = m256_const1_64( 0x1F83D9ABFB41BD6B );
casti_m256i( sc->H, 7 ) = m256_const1_64( 0x5BE0CD19137E2179 );
casti_m256i( sc->H, 0 ) = _mm256_set1_epi64x( 0x6A09E667F3BCC908 );
casti_m256i( sc->H, 1 ) = _mm256_set1_epi64x( 0xBB67AE8584CAA73B );
casti_m256i( sc->H, 2 ) = _mm256_set1_epi64x( 0x3C6EF372FE94F82B );
casti_m256i( sc->H, 3 ) = _mm256_set1_epi64x( 0xA54FF53A5F1D36F1 );
casti_m256i( sc->H, 4 ) = _mm256_set1_epi64x( 0x510E527FADE682D1 );
casti_m256i( sc->H, 5 ) = _mm256_set1_epi64x( 0x9B05688C2B3E6C1F );
casti_m256i( sc->H, 6 ) = _mm256_set1_epi64x( 0x1F83D9ABFB41BD6B );
casti_m256i( sc->H, 7 ) = _mm256_set1_epi64x( 0x5BE0CD19137E2179 );
// fill buffer
memcpy_256( sc->buf, (__m256i*)data, 80>>3 );
sc->buf[10] = m256_const1_64( 0x8000000000000000ULL );
sc->buf[10] = _mm256_set1_epi64x( 0x8000000000000000ULL );
sc->buf[11] = m256_zero;
sc->buf[12] = m256_zero;
sc->buf[13] = m256_one_64;
sc->buf[14] = m256_zero;
sc->buf[15] = m256_const1_64( 80*8 );
sc->buf[15] = _mm256_set1_epi64x( 80*8 );
// build working variables
V0 = sc->H[0];
@@ -1263,10 +1255,10 @@ void blake512_4way_prehash_le( blake_4way_big_context *sc, __m256i *midstate,
V5 = sc->H[5];
V6 = sc->H[6];
V7 = sc->H[7];
V8 = m256_const1_64( CB0 );
V9 = m256_const1_64( CB1 );
VA = m256_const1_64( CB2 );
VB = m256_const1_64( CB3 );
V8 = _mm256_set1_epi64x( CB0 );
V9 = _mm256_set1_epi64x( CB1 );
VA = _mm256_set1_epi64x( CB2 );
VB = _mm256_set1_epi64x( CB3 );
VC = _mm256_set1_epi64x( CB4 ^ 0x280ULL );
VD = _mm256_set1_epi64x( CB5 ^ 0x280ULL );
VE = _mm256_set1_epi64x( CB6 );
@@ -1446,14 +1438,14 @@ void blake512_4way_final_le( blake_4way_big_context *sc, void *hash,
void blake512_4way_init( blake_4way_big_context *sc )
{
casti_m256i( sc->H, 0 ) = m256_const1_64( 0x6A09E667F3BCC908 );
casti_m256i( sc->H, 1 ) = m256_const1_64( 0xBB67AE8584CAA73B );
casti_m256i( sc->H, 2 ) = m256_const1_64( 0x3C6EF372FE94F82B );
casti_m256i( sc->H, 3 ) = m256_const1_64( 0xA54FF53A5F1D36F1 );
casti_m256i( sc->H, 4 ) = m256_const1_64( 0x510E527FADE682D1 );
casti_m256i( sc->H, 5 ) = m256_const1_64( 0x9B05688C2B3E6C1F );
casti_m256i( sc->H, 6 ) = m256_const1_64( 0x1F83D9ABFB41BD6B );
casti_m256i( sc->H, 7 ) = m256_const1_64( 0x5BE0CD19137E2179 );
casti_m256i( sc->H, 0 ) = _mm256_set1_epi64x( 0x6A09E667F3BCC908 );
casti_m256i( sc->H, 1 ) = _mm256_set1_epi64x( 0xBB67AE8584CAA73B );
casti_m256i( sc->H, 2 ) = _mm256_set1_epi64x( 0x3C6EF372FE94F82B );
casti_m256i( sc->H, 3 ) = _mm256_set1_epi64x( 0xA54FF53A5F1D36F1 );
casti_m256i( sc->H, 4 ) = _mm256_set1_epi64x( 0x510E527FADE682D1 );
casti_m256i( sc->H, 5 ) = _mm256_set1_epi64x( 0x9B05688C2B3E6C1F );
casti_m256i( sc->H, 6 ) = _mm256_set1_epi64x( 0x1F83D9ABFB41BD6B );
casti_m256i( sc->H, 7 ) = _mm256_set1_epi64x( 0x5BE0CD19137E2179 );
sc->T0 = sc->T1 = 0;
sc->ptr = 0;
@@ -1513,7 +1505,7 @@ blake64_4way_close( blake_4way_big_context *sc, void *dst )
ptr = sc->ptr;
bit_len = ((unsigned)ptr << 3);
buf[ptr>>3] = m256_const1_64( 0x80 );
buf[ptr>>3] = _mm256_set1_epi64x( 0x80 );
tl = sc->T0 + bit_len;
th = sc->T1;
if (ptr == 0 )
@@ -1535,9 +1527,9 @@ blake64_4way_close( blake_4way_big_context *sc, void *dst )
{
memset_zero_256( buf + (ptr>>3) + 1, (104-ptr) >> 3 );
buf[104>>3] = _mm256_or_si256( buf[104>>3],
m256_const1_64( 0x0100000000000000ULL ) );
buf[112>>3] = m256_const1_64( bswap_64( th ) );
buf[120>>3] = m256_const1_64( bswap_64( tl ) );
_mm256_set1_epi64x( 0x0100000000000000ULL ) );
buf[112>>3] = _mm256_set1_epi64x( bswap_64( th ) );
buf[120>>3] = _mm256_set1_epi64x( bswap_64( tl ) );
blake64_4way( sc, buf + (ptr>>3), 128 - ptr );
}
@@ -1549,9 +1541,9 @@ blake64_4way_close( blake_4way_big_context *sc, void *dst )
sc->T0 = SPH_C64(0xFFFFFFFFFFFFFC00ULL);
sc->T1 = SPH_C64(0xFFFFFFFFFFFFFFFFULL);
memset_zero_256( buf, 112>>3 );
buf[104>>3] = m256_const1_64( 0x0100000000000000ULL );
buf[112>>3] = m256_const1_64( bswap_64( th ) );
buf[120>>3] = m256_const1_64( bswap_64( tl ) );
buf[104>>3] = _mm256_set1_epi64x( 0x0100000000000000ULL );
buf[112>>3] = _mm256_set1_epi64x( bswap_64( th ) );
buf[120>>3] = _mm256_set1_epi64x( bswap_64( tl ) );
blake64_4way( sc, buf, 128 );
}
@@ -1565,14 +1557,14 @@ void blake512_4way_full( blake_4way_big_context *sc, void * dst,
// init
casti_m256i( sc->H, 0 ) = m256_const1_64( 0x6A09E667F3BCC908 );
casti_m256i( sc->H, 1 ) = m256_const1_64( 0xBB67AE8584CAA73B );
casti_m256i( sc->H, 2 ) = m256_const1_64( 0x3C6EF372FE94F82B );
casti_m256i( sc->H, 3 ) = m256_const1_64( 0xA54FF53A5F1D36F1 );
casti_m256i( sc->H, 4 ) = m256_const1_64( 0x510E527FADE682D1 );
casti_m256i( sc->H, 5 ) = m256_const1_64( 0x9B05688C2B3E6C1F );
casti_m256i( sc->H, 6 ) = m256_const1_64( 0x1F83D9ABFB41BD6B );
casti_m256i( sc->H, 7 ) = m256_const1_64( 0x5BE0CD19137E2179 );
casti_m256i( sc->H, 0 ) = _mm256_set1_epi64x( 0x6A09E667F3BCC908 );
casti_m256i( sc->H, 1 ) = _mm256_set1_epi64x( 0xBB67AE8584CAA73B );
casti_m256i( sc->H, 2 ) = _mm256_set1_epi64x( 0x3C6EF372FE94F82B );
casti_m256i( sc->H, 3 ) = _mm256_set1_epi64x( 0xA54FF53A5F1D36F1 );
casti_m256i( sc->H, 4 ) = _mm256_set1_epi64x( 0x510E527FADE682D1 );
casti_m256i( sc->H, 5 ) = _mm256_set1_epi64x( 0x9B05688C2B3E6C1F );
casti_m256i( sc->H, 6 ) = _mm256_set1_epi64x( 0x1F83D9ABFB41BD6B );
casti_m256i( sc->H, 7 ) = _mm256_set1_epi64x( 0x5BE0CD19137E2179 );
sc->T0 = sc->T1 = 0;
sc->ptr = 0;
@@ -1596,7 +1588,7 @@ void blake512_4way_full( blake_4way_big_context *sc, void * dst,
uint64_t th, tl;
bit_len = sc->ptr << 3;
sc->buf[ptr64] = m256_const1_64( 0x80 );
sc->buf[ptr64] = _mm256_set1_epi64x( 0x80 );
tl = sc->T0 + bit_len;
th = sc->T1;
if ( sc->ptr == 0 )
@@ -1613,9 +1605,9 @@ void blake512_4way_full( blake_4way_big_context *sc, void * dst,
sc->T0 -= 1024 - bit_len;
memset_zero_256( sc->buf + ptr64 + 1, 13 - ptr64 );
sc->buf[13] = m256_const1_64( 0x0100000000000000ULL );
sc->buf[14] = m256_const1_64( bswap_64( th ) );
sc->buf[15] = m256_const1_64( bswap_64( tl ) );
sc->buf[13] = _mm256_set1_epi64x( 0x0100000000000000ULL );
sc->buf[14] = _mm256_set1_epi64x( bswap_64( th ) );
sc->buf[15] = _mm256_set1_epi64x( bswap_64( tl ) );
if ( ( sc->T0 = sc->T0 + 1024 ) < 1024 )
sc->T1 = sc->T1 + 1;

View File

@@ -1,74 +0,0 @@
#include "decred-gate.h"
#include "blake-hash-4way.h"
#include <string.h>
#include <stdint.h>
#include <memory.h>
#include <unistd.h>
#if defined (DECRED_4WAY)
static __thread blake256_4way_context blake_mid;
void decred_hash_4way( void *state, const void *input )
{
uint32_t vhash[8*4] __attribute__ ((aligned (64)));
// uint32_t hash0[8] __attribute__ ((aligned (32)));
// uint32_t hash1[8] __attribute__ ((aligned (32)));
// uint32_t hash2[8] __attribute__ ((aligned (32)));
// uint32_t hash3[8] __attribute__ ((aligned (32)));
const void *tail = input + ( DECRED_MIDSTATE_LEN << 2 );
int tail_len = 180 - DECRED_MIDSTATE_LEN;
blake256_4way_context ctx __attribute__ ((aligned (64)));
memcpy( &ctx, &blake_mid, sizeof(blake_mid) );
blake256_4way_update( &ctx, tail, tail_len );
blake256_4way_close( &ctx, vhash );
dintrlv_4x32( state, state+32, state+64, state+96, vhash, 256 );
}
int scanhash_decred_4way( struct work *work, uint32_t max_nonce,
uint64_t *hashes_done, struct thr_info *mythr )
{
uint32_t vdata[48*4] __attribute__ ((aligned (64)));
uint32_t hash[8*4] __attribute__ ((aligned (32)));
uint32_t _ALIGN(64) edata[48];
uint32_t *pdata = work->data;
uint32_t *ptarget = work->target;
const uint32_t first_nonce = pdata[DECRED_NONCE_INDEX];
uint32_t n = first_nonce;
const uint32_t HTarget = opt_benchmark ? 0x7f : ptarget[7];
int thr_id = mythr->id; // thr_id arg is deprecated
// copy to buffer guaranteed to be aligned.
memcpy( edata, pdata, 180 );
// use the old way until new way updated for size.
mm128_intrlv_4x32x( vdata, edata, edata, edata, edata, 180*8 );
blake256_4way_init( &blake_mid );
blake256_4way_update( &blake_mid, vdata, DECRED_MIDSTATE_LEN );
uint32_t *noncep = vdata + DECRED_NONCE_INDEX * 4;
do {
* noncep = n;
*(noncep+1) = n+1;
*(noncep+2) = n+2;
*(noncep+3) = n+3;
decred_hash_4way( hash, vdata );
for ( int i = 0; i < 4; i++ )
if ( (hash+(i<<3))[7] <= HTarget )
if ( fulltest( hash+(i<<3), ptarget ) && !opt_benchmark )
{
pdata[DECRED_NONCE_INDEX] = n+i;
submit_solution( work, hash+(i<<3), mythr );
}
n += 4;
} while ( (n < max_nonce) && !work_restart[thr_id].restart );
*hashes_done = n - first_nonce + 1;
return 0;
}
#endif

View File

@@ -1,171 +0,0 @@
#include "decred-gate.h"
#include <unistd.h>
#include <memory.h>
#include <string.h>
uint32_t *decred_get_nonceptr( uint32_t *work_data )
{
return &work_data[ DECRED_NONCE_INDEX ];
}
long double decred_calc_network_diff( struct work* work )
{
// sample for diff 43.281 : 1c05ea29
// todo: endian reversed on longpoll could be zr5 specific...
uint32_t nbits = work->data[ DECRED_NBITS_INDEX ];
uint32_t bits = ( nbits & 0xffffff );
int16_t shift = ( swab32(nbits) & 0xff ); // 0x1c = 28
int m;
long double d = (long double)0x0000ffff / (long double)bits;
for ( m = shift; m < 29; m++ )
d *= 256.0;
for ( m = 29; m < shift; m++ )
d /= 256.0;
if ( shift == 28 )
d *= 256.0; // testnet
if ( opt_debug_diff )
applog( LOG_DEBUG, "net diff: %f -> shift %u, bits %08x", (double)d,
shift, bits );
return net_diff;
}
void decred_decode_extradata( struct work* work, uint64_t* net_blocks )
{
// some random extradata to make the work unique
work->data[ DECRED_XNONCE_INDEX ] = (rand()*4);
work->height = work->data[32];
if (!have_longpoll && work->height > *net_blocks + 1)
{
char netinfo[64] = { 0 };
if ( net_diff > 0. )
{
if (net_diff != work->targetdiff)
sprintf(netinfo, ", diff %.3f, target %.1f", net_diff,
work->targetdiff);
else
sprintf(netinfo, ", diff %.3f", net_diff);
}
applog(LOG_BLUE, "%s block %d%s", algo_names[opt_algo], work->height,
netinfo);
*net_blocks = work->height - 1;
}
}
void decred_be_build_stratum_request( char *req, struct work *work,
struct stratum_ctx *sctx )
{
unsigned char *xnonce2str;
uint32_t ntime, nonce;
char ntimestr[9], noncestr[9];
be32enc( &ntime, work->data[ DECRED_NTIME_INDEX ] );
be32enc( &nonce, work->data[ DECRED_NONCE_INDEX ] );
bin2hex( ntimestr, (char*)(&ntime), sizeof(uint32_t) );
bin2hex( noncestr, (char*)(&nonce), sizeof(uint32_t) );
xnonce2str = abin2hex( (char*)( &work->data[ DECRED_XNONCE_INDEX ] ),
sctx->xnonce1_size );
snprintf( req, JSON_BUF_LEN,
"{\"method\": \"mining.submit\", \"params\": [\"%s\", \"%s\", \"%s\", \"%s\", \"%s\"], \"id\":4}",
rpc_user, work->job_id, xnonce2str, ntimestr, noncestr );
free(xnonce2str);
}
#if !defined(min)
#define min(a,b) (a>b ? (b) :(a))
#endif
void decred_build_extraheader( struct work* g_work, struct stratum_ctx* sctx )
{
uchar merkle_root[64] = { 0 };
uint32_t extraheader[32] = { 0 };
int headersize = 0;
uint32_t* extradata = (uint32_t*) sctx->xnonce1;
int i;
// getwork over stratum, getwork merkle + header passed in coinb1
memcpy(merkle_root, sctx->job.coinbase, 32);
headersize = min((int)sctx->job.coinbase_size - 32,
sizeof(extraheader) );
memcpy( extraheader, &sctx->job.coinbase[32], headersize );
// Assemble block header
memset( g_work->data, 0, sizeof(g_work->data) );
g_work->data[0] = le32dec( sctx->job.version );
for ( i = 0; i < 8; i++ )
g_work->data[1 + i] = swab32(
le32dec( (uint32_t *) sctx->job.prevhash + i ) );
for ( i = 0; i < 8; i++ )
g_work->data[9 + i] = swab32( be32dec( (uint32_t *) merkle_root + i ) );
// for ( i = 0; i < 8; i++ ) // prevhash
// g_work->data[1 + i] = swab32( g_work->data[1 + i] );
// for ( i = 0; i < 8; i++ ) // merkle
// g_work->data[9 + i] = swab32( g_work->data[9 + i] );
for ( i = 0; i < headersize/4; i++ ) // header
g_work->data[17 + i] = extraheader[i];
// extradata
for ( i = 0; i < sctx->xnonce1_size/4; i++ )
g_work->data[ DECRED_XNONCE_INDEX + i ] = extradata[i];
for ( i = DECRED_XNONCE_INDEX + sctx->xnonce1_size/4; i < 45; i++ )
g_work->data[i] = 0;
g_work->data[37] = (rand()*4) << 8;
// block header suffix from coinb2 (stake version)
memcpy( &g_work->data[44],
&sctx->job.coinbase[ sctx->job.coinbase_size-4 ], 4 );
sctx->block_height = g_work->data[32];
//applog_hex(work->data, 180);
//applog_hex(&work->data[36], 36);
}
#undef min
bool decred_ready_to_mine( struct work* work, struct stratum_ctx* stratum,
int thr_id )
{
if ( have_stratum && strcmp(stratum->job.job_id, work->job_id) )
// need to regen g_work..
return false;
if ( have_stratum && !work->data[0] && !opt_benchmark )
{
sleep(1);
return false;
}
// extradata: prevent duplicates
work->data[ DECRED_XNONCE_INDEX ] += 1;
work->data[ DECRED_XNONCE_INDEX + 1 ] |= thr_id;
return true;
}
int decred_get_work_data_size() { return DECRED_DATA_SIZE; }
bool register_decred_algo( algo_gate_t* gate )
{
#if defined(DECRED_4WAY)
four_way_not_tested();
gate->scanhash = (void*)&scanhash_decred_4way;
gate->hash = (void*)&decred_hash_4way;
#else
gate->scanhash = (void*)&scanhash_decred;
gate->hash = (void*)&decred_hash;
#endif
gate->optimizations = AVX2_OPT;
// gate->get_nonceptr = (void*)&decred_get_nonceptr;
gate->decode_extra_data = (void*)&decred_decode_extradata;
gate->build_stratum_request = (void*)&decred_be_build_stratum_request;
gate->work_decode = (void*)&std_be_work_decode;
gate->submit_getwork_result = (void*)&std_be_submit_getwork_result;
gate->build_extraheader = (void*)&decred_build_extraheader;
gate->ready_to_mine = (void*)&decred_ready_to_mine;
gate->nbits_index = DECRED_NBITS_INDEX;
gate->ntime_index = DECRED_NTIME_INDEX;
gate->nonce_index = DECRED_NONCE_INDEX;
gate->get_work_data_size = (void*)&decred_get_work_data_size;
gate->work_cmp_size = DECRED_WORK_COMPARE_SIZE;
allow_mininginfo = false;
have_gbt = false;
return true;
}

View File

@@ -1,36 +0,0 @@
#ifndef __DECRED_GATE_H__
#define __DECRED_GATE_H__
#include "algo-gate-api.h"
#include <stdint.h>
#define DECRED_NBITS_INDEX 29
#define DECRED_NTIME_INDEX 34
#define DECRED_NONCE_INDEX 35
#define DECRED_XNONCE_INDEX 36
#define DECRED_DATA_SIZE 192
#define DECRED_WORK_COMPARE_SIZE 140
#define DECRED_MIDSTATE_LEN 128
#if defined (__AVX2__)
//void blakehash_84way(void *state, const void *input);
//int scanhash_blake_8way( struct work *work, uint32_t max_nonce,
// uint64_t *hashes_done );
#endif
#if defined(__SSE4_2__)
#define DECRED_4WAY
#endif
#if defined (DECRED_4WAY)
void decred_hash_4way(void *state, const void *input);
int scanhash_decred_4way( struct work *work, uint32_t max_nonce,
uint64_t *hashes_done, struct thr_info *mythr );
#endif
void decred_hash( void *state, const void *input );
int scanhash_decred( struct work *work, uint32_t max_nonce,
uint64_t *hashes_done, struct thr_info *mythr );
#endif

View File

@@ -1,282 +0,0 @@
#include "decred-gate.h"
#if !defined(DECRED_8WAY) && !defined(DECRED_4WAY)
#include "sph_blake.h"
#include <string.h>
#include <stdint.h>
#include <memory.h>
#include <unistd.h>
/*
#ifndef min
#define min(a,b) (a>b ? b : a)
#endif
#ifndef max
#define max(a,b) (a<b ? b : a)
#endif
*/
/*
#define DECRED_NBITS_INDEX 29
#define DECRED_NTIME_INDEX 34
#define DECRED_NONCE_INDEX 35
#define DECRED_XNONCE_INDEX 36
#define DECRED_DATA_SIZE 192
#define DECRED_WORK_COMPARE_SIZE 140
*/
static __thread sph_blake256_context blake_mid;
static __thread bool ctx_midstate_done = false;
void decred_hash(void *state, const void *input)
{
// #define MIDSTATE_LEN 128
sph_blake256_context ctx __attribute__ ((aligned (64)));
uint8_t *ending = (uint8_t*) input;
ending += DECRED_MIDSTATE_LEN;
if (!ctx_midstate_done) {
sph_blake256_init(&blake_mid);
sph_blake256(&blake_mid, input, DECRED_MIDSTATE_LEN);
ctx_midstate_done = true;
}
memcpy(&ctx, &blake_mid, sizeof(blake_mid));
sph_blake256(&ctx, ending, (180 - DECRED_MIDSTATE_LEN));
sph_blake256_close(&ctx, state);
}
void decred_hash_simple(void *state, const void *input)
{
sph_blake256_context ctx;
sph_blake256_init(&ctx);
sph_blake256(&ctx, input, 180);
sph_blake256_close(&ctx, state);
}
int scanhash_decred( struct work *work, uint32_t max_nonce,
uint64_t *hashes_done, struct thr_info *mythr )
{
uint32_t _ALIGN(64) endiandata[48];
uint32_t _ALIGN(64) hash32[8];
uint32_t *pdata = work->data;
uint32_t *ptarget = work->target;
int thr_id = mythr->id; // thr_id arg is deprecated
// #define DCR_NONCE_OFT32 35
const uint32_t first_nonce = pdata[DECRED_NONCE_INDEX];
const uint32_t HTarget = opt_benchmark ? 0x7f : ptarget[7];
uint32_t n = first_nonce;
ctx_midstate_done = false;
#if 1
memcpy(endiandata, pdata, 180);
#else
for (int k=0; k < (180/4); k++)
be32enc(&endiandata[k], pdata[k]);
#endif
do {
//be32enc(&endiandata[DCR_NONCE_OFT32], n);
endiandata[DECRED_NONCE_INDEX] = n;
decred_hash(hash32, endiandata);
if (hash32[7] <= HTarget && fulltest(hash32, ptarget))
{
pdata[DECRED_NONCE_INDEX] = n;
submit_solution( work, hash32, mythr );
}
n++;
} while (n < max_nonce && !work_restart[thr_id].restart);
*hashes_done = n - first_nonce + 1;
pdata[DECRED_NONCE_INDEX] = n;
return 0;
}
/*
uint32_t *decred_get_nonceptr( uint32_t *work_data )
{
return &work_data[ DECRED_NONCE_INDEX ];
}
double decred_calc_network_diff( struct work* work )
{
// sample for diff 43.281 : 1c05ea29
// todo: endian reversed on longpoll could be zr5 specific...
uint32_t nbits = work->data[ DECRED_NBITS_INDEX ];
uint32_t bits = ( nbits & 0xffffff );
int16_t shift = ( swab32(nbits) & 0xff ); // 0x1c = 28
int m;
double d = (double)0x0000ffff / (double)bits;
for ( m = shift; m < 29; m++ )
d *= 256.0;
for ( m = 29; m < shift; m++ )
d /= 256.0;
if ( shift == 28 )
d *= 256.0; // testnet
if ( opt_debug_diff )
applog( LOG_DEBUG, "net diff: %f -> shift %u, bits %08x", d,
shift, bits );
return net_diff;
}
void decred_decode_extradata( struct work* work, uint64_t* net_blocks )
{
// some random extradata to make the work unique
work->data[ DECRED_XNONCE_INDEX ] = (rand()*4);
work->height = work->data[32];
if (!have_longpoll && work->height > *net_blocks + 1)
{
char netinfo[64] = { 0 };
if (net_diff > 0.)
{
if (net_diff != work->targetdiff)
sprintf(netinfo, ", diff %.3f, target %.1f", net_diff,
work->targetdiff);
else
sprintf(netinfo, ", diff %.3f", net_diff);
}
applog(LOG_BLUE, "%s block %d%s", algo_names[opt_algo], work->height,
netinfo);
*net_blocks = work->height - 1;
}
}
void decred_be_build_stratum_request( char *req, struct work *work,
struct stratum_ctx *sctx )
{
unsigned char *xnonce2str;
uint32_t ntime, nonce;
char ntimestr[9], noncestr[9];
be32enc( &ntime, work->data[ DECRED_NTIME_INDEX ] );
be32enc( &nonce, work->data[ DECRED_NONCE_INDEX ] );
bin2hex( ntimestr, (char*)(&ntime), sizeof(uint32_t) );
bin2hex( noncestr, (char*)(&nonce), sizeof(uint32_t) );
xnonce2str = abin2hex( (char*)( &work->data[ DECRED_XNONCE_INDEX ] ),
sctx->xnonce1_size );
snprintf( req, JSON_BUF_LEN,
"{\"method\": \"mining.submit\", \"params\": [\"%s\", \"%s\", \"%s\", \"%s\", \"%s\"], \"id\":4}",
rpc_user, work->job_id, xnonce2str, ntimestr, noncestr );
free(xnonce2str);
}
*/
/*
// data shared between gen_merkle_root and build_extraheader.
__thread uint32_t decred_extraheader[32] = { 0 };
__thread int decred_headersize = 0;
void decred_gen_merkle_root( char* merkle_root, struct stratum_ctx* sctx )
{
// getwork over stratum, getwork merkle + header passed in coinb1
memcpy(merkle_root, sctx->job.coinbase, 32);
decred_headersize = min((int)sctx->job.coinbase_size - 32,
sizeof(decred_extraheader) );
memcpy( decred_extraheader, &sctx->job.coinbase[32], decred_headersize);
}
*/
/*
#define min(a,b) (a>b ? (b) :(a))
void decred_build_extraheader( struct work* g_work, struct stratum_ctx* sctx )
{
uchar merkle_root[64] = { 0 };
uint32_t extraheader[32] = { 0 };
int headersize = 0;
uint32_t* extradata = (uint32_t*) sctx->xnonce1;
size_t t;
int i;
// getwork over stratum, getwork merkle + header passed in coinb1
memcpy(merkle_root, sctx->job.coinbase, 32);
headersize = min((int)sctx->job.coinbase_size - 32,
sizeof(extraheader) );
memcpy( extraheader, &sctx->job.coinbase[32], headersize );
// Increment extranonce2
for ( t = 0; t < sctx->xnonce2_size && !( ++sctx->job.xnonce2[t] ); t++ );
// Assemble block header
memset( g_work->data, 0, sizeof(g_work->data) );
g_work->data[0] = le32dec( sctx->job.version );
for ( i = 0; i < 8; i++ )
g_work->data[1 + i] = swab32(
le32dec( (uint32_t *) sctx->job.prevhash + i ) );
for ( i = 0; i < 8; i++ )
g_work->data[9 + i] = swab32( be32dec( (uint32_t *) merkle_root + i ) );
// for ( i = 0; i < 8; i++ ) // prevhash
// g_work->data[1 + i] = swab32( g_work->data[1 + i] );
// for ( i = 0; i < 8; i++ ) // merkle
// g_work->data[9 + i] = swab32( g_work->data[9 + i] );
for ( i = 0; i < headersize/4; i++ ) // header
g_work->data[17 + i] = extraheader[i];
// extradata
for ( i = 0; i < sctx->xnonce1_size/4; i++ )
g_work->data[ DECRED_XNONCE_INDEX + i ] = extradata[i];
for ( i = DECRED_XNONCE_INDEX + sctx->xnonce1_size/4; i < 45; i++ )
g_work->data[i] = 0;
g_work->data[37] = (rand()*4) << 8;
// block header suffix from coinb2 (stake version)
memcpy( &g_work->data[44],
&sctx->job.coinbase[ sctx->job.coinbase_size-4 ], 4 );
sctx->bloc_height = g_work->data[32];
//applog_hex(work->data, 180);
//applog_hex(&work->data[36], 36);
}
#undef min
bool decred_ready_to_mine( struct work* work, struct stratum_ctx* stratum,
int thr_id )
{
if ( have_stratum && strcmp(stratum->job.job_id, work->job_id) )
// need to regen g_work..
return false;
if ( have_stratum && !work->data[0] && !opt_benchmark )
{
sleep(1);
return false;
}
// extradata: prevent duplicates
work->data[ DECRED_XNONCE_INDEX ] += 1;
work->data[ DECRED_XNONCE_INDEX + 1 ] |= thr_id;
return true;
}
bool register_decred_algo( algo_gate_t* gate )
{
gate->optimizations = SSE2_OPT;
gate->scanhash = (void*)&scanhash_decred;
gate->hash = (void*)&decred_hash;
gate->get_nonceptr = (void*)&decred_get_nonceptr;
gate->decode_extra_data = (void*)&decred_decode_extradata;
gate->build_stratum_request = (void*)&decred_be_build_stratum_request;
gate->work_decode = (void*)&std_be_work_decode;
gate->submit_getwork_result = (void*)&std_be_submit_getwork_result;
gate->build_extraheader = (void*)&decred_build_extraheader;
gate->ready_to_mine = (void*)&decred_ready_to_mine;
gate->nbits_index = DECRED_NBITS_INDEX;
gate->ntime_index = DECRED_NTIME_INDEX;
gate->nonce_index = DECRED_NONCE_INDEX;
gate->work_data_size = DECRED_DATA_SIZE;
gate->work_cmp_size = DECRED_WORK_COMPARE_SIZE;
allow_mininginfo = false;
have_gbt = false;
return true;
}
*/
#endif

View File

@@ -1,6 +1,6 @@
#include "pentablake-gate.h"
#if defined (__AVX2__)
#if defined(PENTABLAKE_4WAY)
#include <stdlib.h>
#include <stdint.h>

View File

@@ -4,9 +4,10 @@
#include "algo-gate-api.h"
#include <stdint.h>
#if defined(__AVX2__)
#define PENTABLAKE_4WAY
#endif
// 4way is broken
//#if defined(__AVX2__)
// #define PENTABLAKE_4WAY
//#endif
#if defined(PENTABLAKE_4WAY)
void pentablakehash_4way( void *state, const void *input );

View File

@@ -64,6 +64,22 @@
V[1] = mm256_ror_64( _mm256_xor_si256( V[1], V[2] ), 63 ); \
}
// Pivot about V[1] instead of V[0] reduces latency.
#define BLAKE2B_ROUND( R ) \
{ \
__m256i *V = (__m256i*)v; \
const uint8_t *sigmaR = sigma[R]; \
BLAKE2B_G( 0, 1, 2, 3, 4, 5, 6, 7 ); \
V[0] = mm256_shufll_64( V[0] ); \
V[3] = mm256_swap_128( V[3] ); \
V[2] = mm256_shuflr_64( V[2] ); \
BLAKE2B_G( 14, 15, 8, 9, 10, 11, 12, 13 ); \
V[0] = mm256_shuflr_64( V[0] ); \
V[3] = mm256_swap_128( V[3] ); \
V[2] = mm256_shufll_64( V[2] ); \
}
/*
#define BLAKE2B_ROUND( R ) \
{ \
__m256i *V = (__m256i*)v; \
@@ -77,8 +93,10 @@
V[2] = mm256_swap_128( V[2] ); \
V[1] = mm256_shufll_64( V[1] ); \
}
*/
#elif defined(__SSSE3__)
#elif defined(__SSE2__)
// always true
#define BLAKE2B_G( Va, Vb, Vc, Vd, Sa, Sb, Sc, Sd ) \
{ \
@@ -102,19 +120,20 @@
const uint8_t *sigmaR = sigma[R]; \
BLAKE2B_G( V[0], V[2], V[4], V[6], 0, 1, 2, 3 ); \
BLAKE2B_G( V[1], V[3], V[5], V[7], 4, 5, 6, 7 ); \
V2 = mm128_shufl2r_64( V[2], V[3] ); \
V3 = mm128_shufl2r_64( V[3], V[2] ); \
V6 = mm128_shufl2l_64( V[6], V[7] ); \
V7 = mm128_shufl2l_64( V[7], V[6] ); \
V2 = mm128_alignr_64( V[3], V[2], 1 ); \
V3 = mm128_alignr_64( V[2], V[3], 1 ); \
V6 = mm128_alignr_64( V[6], V[7], 1 ); \
V7 = mm128_alignr_64( V[7], V[6], 1 ); \
BLAKE2B_G( V[0], V2, V[5], V6, 8, 9, 10, 11 ); \
BLAKE2B_G( V[1], V3, V[4], V7, 12, 13, 14, 15 ); \
V[2] = mm128_shufl2l_64( V2, V3 ); \
V[3] = mm128_shufl2l_64( V3, V2 ); \
V[6] = mm128_shufl2r_64( V6, V7 ); \
V[7] = mm128_shufl2r_64( V7, V6 ); \
V[2] = mm128_alignr_64( V2, V3, 1 ); \
V[3] = mm128_alignr_64( V3, V2, 1 ); \
V[6] = mm128_alignr_64( V7, V6, 1 ); \
V[7] = mm128_alignr_64( V6, V7, 1 ); \
}
#else
// never used, SSE2 is always available
#ifndef ROTR64
#define ROTR64(x, y) (((x) >> (y)) ^ ((x) << (64 - (y))))

View File

@@ -451,22 +451,22 @@ static const __m128i final_s[16] =
*/
void bmw256_4way_init( bmw256_4way_context *ctx )
{
ctx->H[ 0] = m128_const1_64( 0x4041424340414243 );
ctx->H[ 1] = m128_const1_64( 0x4445464744454647 );
ctx->H[ 2] = m128_const1_64( 0x48494A4B48494A4B );
ctx->H[ 3] = m128_const1_64( 0x4C4D4E4F4C4D4E4F );
ctx->H[ 4] = m128_const1_64( 0x5051525350515253 );
ctx->H[ 5] = m128_const1_64( 0x5455565754555657 );
ctx->H[ 6] = m128_const1_64( 0x58595A5B58595A5B );
ctx->H[ 7] = m128_const1_64( 0x5C5D5E5F5C5D5E5F );
ctx->H[ 8] = m128_const1_64( 0x6061626360616263 );
ctx->H[ 9] = m128_const1_64( 0x6465666764656667 );
ctx->H[10] = m128_const1_64( 0x68696A6B68696A6B );
ctx->H[11] = m128_const1_64( 0x6C6D6E6F6C6D6E6F );
ctx->H[12] = m128_const1_64( 0x7071727370717273 );
ctx->H[13] = m128_const1_64( 0x7475767774757677 );
ctx->H[14] = m128_const1_64( 0x78797A7B78797A7B );
ctx->H[15] = m128_const1_64( 0x7C7D7E7F7C7D7E7F );
ctx->H[ 0] = _mm_set1_epi64x( 0x4041424340414243 );
ctx->H[ 1] = _mm_set1_epi64x( 0x4445464744454647 );
ctx->H[ 2] = _mm_set1_epi64x( 0x48494A4B48494A4B );
ctx->H[ 3] = _mm_set1_epi64x( 0x4C4D4E4F4C4D4E4F );
ctx->H[ 4] = _mm_set1_epi64x( 0x5051525350515253 );
ctx->H[ 5] = _mm_set1_epi64x( 0x5455565754555657 );
ctx->H[ 6] = _mm_set1_epi64x( 0x58595A5B58595A5B );
ctx->H[ 7] = _mm_set1_epi64x( 0x5C5D5E5F5C5D5E5F );
ctx->H[ 8] = _mm_set1_epi64x( 0x6061626360616263 );
ctx->H[ 9] = _mm_set1_epi64x( 0x6465666764656667 );
ctx->H[10] = _mm_set1_epi64x( 0x68696A6B68696A6B );
ctx->H[11] = _mm_set1_epi64x( 0x6C6D6E6F6C6D6E6F );
ctx->H[12] = _mm_set1_epi64x( 0x7071727370717273 );
ctx->H[13] = _mm_set1_epi64x( 0x7475767774757677 );
ctx->H[14] = _mm_set1_epi64x( 0x78797A7B78797A7B );
ctx->H[15] = _mm_set1_epi64x( 0x7C7D7E7F7C7D7E7F );
// for ( int i = 0; i < 16; i++ )
@@ -529,7 +529,7 @@ bmw32_4way_close(bmw_4way_small_context *sc, unsigned ub, unsigned n,
buf = sc->buf;
ptr = sc->ptr;
buf[ ptr>>2 ] = m128_const1_64( 0x0000008000000080 );
buf[ ptr>>2 ] = _mm_set1_epi64x( 0x0000008000000080 );
ptr += 4;
h = sc->H;
@@ -959,22 +959,22 @@ static const __m256i final_s8[16] =
void bmw256_8way_init( bmw256_8way_context *ctx )
{
ctx->H[ 0] = m256_const1_64( 0x4041424340414243 );
ctx->H[ 1] = m256_const1_64( 0x4445464744454647 );
ctx->H[ 2] = m256_const1_64( 0x48494A4B48494A4B );
ctx->H[ 3] = m256_const1_64( 0x4C4D4E4F4C4D4E4F );
ctx->H[ 4] = m256_const1_64( 0x5051525350515253 );
ctx->H[ 5] = m256_const1_64( 0x5455565754555657 );
ctx->H[ 6] = m256_const1_64( 0x58595A5B58595A5B );
ctx->H[ 7] = m256_const1_64( 0x5C5D5E5F5C5D5E5F );
ctx->H[ 8] = m256_const1_64( 0x6061626360616263 );
ctx->H[ 9] = m256_const1_64( 0x6465666764656667 );
ctx->H[10] = m256_const1_64( 0x68696A6B68696A6B );
ctx->H[11] = m256_const1_64( 0x6C6D6E6F6C6D6E6F );
ctx->H[12] = m256_const1_64( 0x7071727370717273 );
ctx->H[13] = m256_const1_64( 0x7475767774757677 );
ctx->H[14] = m256_const1_64( 0x78797A7B78797A7B );
ctx->H[15] = m256_const1_64( 0x7C7D7E7F7C7D7E7F );
ctx->H[ 0] = _mm256_set1_epi64x( 0x4041424340414243 );
ctx->H[ 1] = _mm256_set1_epi64x( 0x4445464744454647 );
ctx->H[ 2] = _mm256_set1_epi64x( 0x48494A4B48494A4B );
ctx->H[ 3] = _mm256_set1_epi64x( 0x4C4D4E4F4C4D4E4F );
ctx->H[ 4] = _mm256_set1_epi64x( 0x5051525350515253 );
ctx->H[ 5] = _mm256_set1_epi64x( 0x5455565754555657 );
ctx->H[ 6] = _mm256_set1_epi64x( 0x58595A5B58595A5B );
ctx->H[ 7] = _mm256_set1_epi64x( 0x5C5D5E5F5C5D5E5F );
ctx->H[ 8] = _mm256_set1_epi64x( 0x6061626360616263 );
ctx->H[ 9] = _mm256_set1_epi64x( 0x6465666764656667 );
ctx->H[10] = _mm256_set1_epi64x( 0x68696A6B68696A6B );
ctx->H[11] = _mm256_set1_epi64x( 0x6C6D6E6F6C6D6E6F );
ctx->H[12] = _mm256_set1_epi64x( 0x7071727370717273 );
ctx->H[13] = _mm256_set1_epi64x( 0x7475767774757677 );
ctx->H[14] = _mm256_set1_epi64x( 0x78797A7B78797A7B );
ctx->H[15] = _mm256_set1_epi64x( 0x7C7D7E7F7C7D7E7F );
ctx->ptr = 0;
ctx->bit_count = 0;
}
@@ -1030,7 +1030,7 @@ void bmw256_8way_close( bmw256_8way_context *ctx, void *dst )
buf = ctx->buf;
ptr = ctx->ptr;
buf[ ptr>>2 ] = m256_const1_64( 0x0000008000000080 );
buf[ ptr>>2 ] = _mm256_set1_epi64x( 0x0000008000000080 );
ptr += 4;
h = ctx->H;
@@ -1460,22 +1460,22 @@ static const __m512i final_s16[16] =
void bmw256_16way_init( bmw256_16way_context *ctx )
{
ctx->H[ 0] = m512_const1_64( 0x4041424340414243 );
ctx->H[ 1] = m512_const1_64( 0x4445464744454647 );
ctx->H[ 2] = m512_const1_64( 0x48494A4B48494A4B );
ctx->H[ 3] = m512_const1_64( 0x4C4D4E4F4C4D4E4F );
ctx->H[ 4] = m512_const1_64( 0x5051525350515253 );
ctx->H[ 5] = m512_const1_64( 0x5455565754555657 );
ctx->H[ 6] = m512_const1_64( 0x58595A5B58595A5B );
ctx->H[ 7] = m512_const1_64( 0x5C5D5E5F5C5D5E5F );
ctx->H[ 8] = m512_const1_64( 0x6061626360616263 );
ctx->H[ 9] = m512_const1_64( 0x6465666764656667 );
ctx->H[10] = m512_const1_64( 0x68696A6B68696A6B );
ctx->H[11] = m512_const1_64( 0x6C6D6E6F6C6D6E6F );
ctx->H[12] = m512_const1_64( 0x7071727370717273 );
ctx->H[13] = m512_const1_64( 0x7475767774757677 );
ctx->H[14] = m512_const1_64( 0x78797A7B78797A7B );
ctx->H[15] = m512_const1_64( 0x7C7D7E7F7C7D7E7F );
ctx->H[ 0] = _mm512_set1_epi64( 0x4041424340414243 );
ctx->H[ 1] = _mm512_set1_epi64( 0x4445464744454647 );
ctx->H[ 2] = _mm512_set1_epi64( 0x48494A4B48494A4B );
ctx->H[ 3] = _mm512_set1_epi64( 0x4C4D4E4F4C4D4E4F );
ctx->H[ 4] = _mm512_set1_epi64( 0x5051525350515253 );
ctx->H[ 5] = _mm512_set1_epi64( 0x5455565754555657 );
ctx->H[ 6] = _mm512_set1_epi64( 0x58595A5B58595A5B );
ctx->H[ 7] = _mm512_set1_epi64( 0x5C5D5E5F5C5D5E5F );
ctx->H[ 8] = _mm512_set1_epi64( 0x6061626360616263 );
ctx->H[ 9] = _mm512_set1_epi64( 0x6465666764656667 );
ctx->H[10] = _mm512_set1_epi64( 0x68696A6B68696A6B );
ctx->H[11] = _mm512_set1_epi64( 0x6C6D6E6F6C6D6E6F );
ctx->H[12] = _mm512_set1_epi64( 0x7071727370717273 );
ctx->H[13] = _mm512_set1_epi64( 0x7475767774757677 );
ctx->H[14] = _mm512_set1_epi64( 0x78797A7B78797A7B );
ctx->H[15] = _mm512_set1_epi64( 0x7C7D7E7F7C7D7E7F );
ctx->ptr = 0;
ctx->bit_count = 0;
}
@@ -1531,7 +1531,7 @@ void bmw256_16way_close( bmw256_16way_context *ctx, void *dst )
buf = ctx->buf;
ptr = ctx->ptr;
buf[ ptr>>2 ] = m512_const1_64( 0x0000008000000080 );
buf[ ptr>>2 ] = _mm512_set1_epi64( 0x0000008000000080 );
ptr += 4;
h = ctx->H;

View File

@@ -747,38 +747,40 @@ void compress_big( const __m256i *M, const __m256i H[16], __m256i dH[16] )
mj[14] = mm256_rol_64( M[14], 15 );
mj[15] = mm256_rol_64( M[15], 16 );
qt[16] = add_elt_b( mj[ 0], mj[ 3], mj[10], H[ 7],
(const __m256i)_mm256_set1_epi64x( 16 * 0x0555555555555555ULL ) );
qt[17] = add_elt_b( mj[ 1], mj[ 4], mj[11], H[ 8],
(const __m256i)_mm256_set1_epi64x( 17 * 0x0555555555555555ULL ) );
qt[18] = add_elt_b( mj[ 2], mj[ 5], mj[12], H[ 9],
(const __m256i)_mm256_set1_epi64x( 18 * 0x0555555555555555ULL ) );
qt[19] = add_elt_b( mj[ 3], mj[ 6], mj[13], H[10],
(const __m256i)_mm256_set1_epi64x( 19 * 0x0555555555555555ULL ) );
qt[20] = add_elt_b( mj[ 4], mj[ 7], mj[14], H[11],
(const __m256i)_mm256_set1_epi64x( 20 * 0x0555555555555555ULL ) );
qt[21] = add_elt_b( mj[ 5], mj[ 8], mj[15], H[12],
(const __m256i)_mm256_set1_epi64x( 21 * 0x0555555555555555ULL ) );
qt[22] = add_elt_b( mj[ 6], mj[ 9], mj[ 0], H[13],
(const __m256i)_mm256_set1_epi64x( 22 * 0x0555555555555555ULL ) );
qt[23] = add_elt_b( mj[ 7], mj[10], mj[ 1], H[14],
(const __m256i)_mm256_set1_epi64x( 23 * 0x0555555555555555ULL ) );
qt[24] = add_elt_b( mj[ 8], mj[11], mj[ 2], H[15],
(const __m256i)_mm256_set1_epi64x( 24 * 0x0555555555555555ULL ) );
qt[25] = add_elt_b( mj[ 9], mj[12], mj[ 3], H[ 0],
(const __m256i)_mm256_set1_epi64x( 25 * 0x0555555555555555ULL ) );
qt[26] = add_elt_b( mj[10], mj[13], mj[ 4], H[ 1],
(const __m256i)_mm256_set1_epi64x( 26 * 0x0555555555555555ULL ) );
qt[27] = add_elt_b( mj[11], mj[14], mj[ 5], H[ 2],
(const __m256i)_mm256_set1_epi64x( 27 * 0x0555555555555555ULL ) );
qt[28] = add_elt_b( mj[12], mj[15], mj[ 6], H[ 3],
(const __m256i)_mm256_set1_epi64x( 28 * 0x0555555555555555ULL ) );
qt[29] = add_elt_b( mj[13], mj[ 0], mj[ 7], H[ 4],
(const __m256i)_mm256_set1_epi64x( 29 * 0x0555555555555555ULL ) );
qt[30] = add_elt_b( mj[14], mj[ 1], mj[ 8], H[ 5],
(const __m256i)_mm256_set1_epi64x( 30 * 0x0555555555555555ULL ) );
qt[31] = add_elt_b( mj[15], mj[ 2], mj[ 9], H[ 6],
(const __m256i)_mm256_set1_epi64x( 31 * 0x0555555555555555ULL ) );
__m256i K = _mm256_set1_epi64x( 16 * 0x0555555555555555ULL );
const __m256i Kincr = _mm256_set1_epi64x( 0x0555555555555555ULL );
qt[16] = add_elt_b( mj[ 0], mj[ 3], mj[10], H[ 7], K );
K = _mm256_add_epi64( K, Kincr );
qt[17] = add_elt_b( mj[ 1], mj[ 4], mj[11], H[ 8], K );
K = _mm256_add_epi64( K, Kincr );
qt[18] = add_elt_b( mj[ 2], mj[ 5], mj[12], H[ 9], K );
K = _mm256_add_epi64( K, Kincr );
qt[19] = add_elt_b( mj[ 3], mj[ 6], mj[13], H[10], K );
K = _mm256_add_epi64( K, Kincr );
qt[20] = add_elt_b( mj[ 4], mj[ 7], mj[14], H[11], K );
K = _mm256_add_epi64( K, Kincr );
qt[21] = add_elt_b( mj[ 5], mj[ 8], mj[15], H[12], K );
K = _mm256_add_epi64( K, Kincr );
qt[22] = add_elt_b( mj[ 6], mj[ 9], mj[ 0], H[13], K );
K = _mm256_add_epi64( K, Kincr );
qt[23] = add_elt_b( mj[ 7], mj[10], mj[ 1], H[14], K );
K = _mm256_add_epi64( K, Kincr );
qt[24] = add_elt_b( mj[ 8], mj[11], mj[ 2], H[15], K );
K = _mm256_add_epi64( K, Kincr );
qt[25] = add_elt_b( mj[ 9], mj[12], mj[ 3], H[ 0], K );
K = _mm256_add_epi64( K, Kincr );
qt[26] = add_elt_b( mj[10], mj[13], mj[ 4], H[ 1], K );
K = _mm256_add_epi64( K, Kincr );
qt[27] = add_elt_b( mj[11], mj[14], mj[ 5], H[ 2], K );
K = _mm256_add_epi64( K, Kincr );
qt[28] = add_elt_b( mj[12], mj[15], mj[ 6], H[ 3], K );
K = _mm256_add_epi64( K, Kincr );
qt[29] = add_elt_b( mj[13], mj[ 0], mj[ 7], H[ 4], K );
K = _mm256_add_epi64( K, Kincr );
qt[30] = add_elt_b( mj[14], mj[ 1], mj[ 8], H[ 5], K );
K = _mm256_add_epi64( K, Kincr );
qt[31] = add_elt_b( mj[15], mj[ 2], mj[ 9], H[ 6], K );
qt[16] = _mm256_add_epi64( qt[16], expand1_b( qt, 16 ) );
qt[17] = _mm256_add_epi64( qt[17], expand1_b( qt, 17 ) );
@@ -894,22 +896,22 @@ static const __m256i final_b[16] =
static void
bmw64_4way_init( bmw_4way_big_context *sc, const sph_u64 *iv )
{
sc->H[ 0] = m256_const1_64( 0x8081828384858687 );
sc->H[ 1] = m256_const1_64( 0x88898A8B8C8D8E8F );
sc->H[ 2] = m256_const1_64( 0x9091929394959697 );
sc->H[ 3] = m256_const1_64( 0x98999A9B9C9D9E9F );
sc->H[ 4] = m256_const1_64( 0xA0A1A2A3A4A5A6A7 );
sc->H[ 5] = m256_const1_64( 0xA8A9AAABACADAEAF );
sc->H[ 6] = m256_const1_64( 0xB0B1B2B3B4B5B6B7 );
sc->H[ 7] = m256_const1_64( 0xB8B9BABBBCBDBEBF );
sc->H[ 8] = m256_const1_64( 0xC0C1C2C3C4C5C6C7 );
sc->H[ 9] = m256_const1_64( 0xC8C9CACBCCCDCECF );
sc->H[10] = m256_const1_64( 0xD0D1D2D3D4D5D6D7 );
sc->H[11] = m256_const1_64( 0xD8D9DADBDCDDDEDF );
sc->H[12] = m256_const1_64( 0xE0E1E2E3E4E5E6E7 );
sc->H[13] = m256_const1_64( 0xE8E9EAEBECEDEEEF );
sc->H[14] = m256_const1_64( 0xF0F1F2F3F4F5F6F7 );
sc->H[15] = m256_const1_64( 0xF8F9FAFBFCFDFEFF );
sc->H[ 0] = _mm256_set1_epi64x( 0x8081828384858687 );
sc->H[ 1] = _mm256_set1_epi64x( 0x88898A8B8C8D8E8F );
sc->H[ 2] = _mm256_set1_epi64x( 0x9091929394959697 );
sc->H[ 3] = _mm256_set1_epi64x( 0x98999A9B9C9D9E9F );
sc->H[ 4] = _mm256_set1_epi64x( 0xA0A1A2A3A4A5A6A7 );
sc->H[ 5] = _mm256_set1_epi64x( 0xA8A9AAABACADAEAF );
sc->H[ 6] = _mm256_set1_epi64x( 0xB0B1B2B3B4B5B6B7 );
sc->H[ 7] = _mm256_set1_epi64x( 0xB8B9BABBBCBDBEBF );
sc->H[ 8] = _mm256_set1_epi64x( 0xC0C1C2C3C4C5C6C7 );
sc->H[ 9] = _mm256_set1_epi64x( 0xC8C9CACBCCCDCECF );
sc->H[10] = _mm256_set1_epi64x( 0xD0D1D2D3D4D5D6D7 );
sc->H[11] = _mm256_set1_epi64x( 0xD8D9DADBDCDDDEDF );
sc->H[12] = _mm256_set1_epi64x( 0xE0E1E2E3E4E5E6E7 );
sc->H[13] = _mm256_set1_epi64x( 0xE8E9EAEBECEDEEEF );
sc->H[14] = _mm256_set1_epi64x( 0xF0F1F2F3F4F5F6F7 );
sc->H[15] = _mm256_set1_epi64x( 0xF8F9FAFBFCFDFEFF );
sc->ptr = 0;
sc->bit_count = 0;
}
@@ -965,7 +967,7 @@ bmw64_4way_close(bmw_4way_big_context *sc, unsigned ub, unsigned n,
buf = sc->buf;
ptr = sc->ptr;
buf[ ptr>>3 ] = m256_const1_64( 0x80 );
buf[ ptr>>3 ] = _mm256_set1_epi64x( 0x80 );
ptr += 8;
h = sc->H;
@@ -1180,7 +1182,6 @@ void compress_big_8way( const __m512i *M, const __m512i H[16],
qt[15] = _mm512_add_epi64( s8b0( W8b15), H[ 0] );
__m512i mj[16];
uint64_t K = 16 * 0x0555555555555555ULL;
mj[ 0] = mm512_rol_64( M[ 0], 1 );
mj[ 1] = mm512_rol_64( M[ 1], 2 );
@@ -1199,54 +1200,40 @@ void compress_big_8way( const __m512i *M, const __m512i H[16],
mj[14] = mm512_rol_64( M[14], 15 );
mj[15] = mm512_rol_64( M[15], 16 );
qt[16] = add_elt_b8( mj[ 0], mj[ 3], mj[10], H[ 7],
(const __m512i)_mm512_set1_epi64( K ) );
K += 0x0555555555555555ULL;
qt[17] = add_elt_b8( mj[ 1], mj[ 4], mj[11], H[ 8],
(const __m512i)_mm512_set1_epi64( K ) );
K += 0x0555555555555555ULL;
qt[18] = add_elt_b8( mj[ 2], mj[ 5], mj[12], H[ 9],
(const __m512i)_mm512_set1_epi64( K ) );
K += 0x0555555555555555ULL;
qt[19] = add_elt_b8( mj[ 3], mj[ 6], mj[13], H[10],
(const __m512i)_mm512_set1_epi64( K ) );
K += 0x0555555555555555ULL;
qt[20] = add_elt_b8( mj[ 4], mj[ 7], mj[14], H[11],
(const __m512i)_mm512_set1_epi64( K ) );
K += 0x0555555555555555ULL;
qt[21] = add_elt_b8( mj[ 5], mj[ 8], mj[15], H[12],
(const __m512i)_mm512_set1_epi64( K ) );
K += 0x0555555555555555ULL;
qt[22] = add_elt_b8( mj[ 6], mj[ 9], mj[ 0], H[13],
(const __m512i)_mm512_set1_epi64( K ) );
K += 0x0555555555555555ULL;
qt[23] = add_elt_b8( mj[ 7], mj[10], mj[ 1], H[14],
(const __m512i)_mm512_set1_epi64( K ) );
K += 0x0555555555555555ULL;
qt[24] = add_elt_b8( mj[ 8], mj[11], mj[ 2], H[15],
(const __m512i)_mm512_set1_epi64( K ) );
K += 0x0555555555555555ULL;
qt[25] = add_elt_b8( mj[ 9], mj[12], mj[ 3], H[ 0],
(const __m512i)_mm512_set1_epi64( K ) );
K += 0x0555555555555555ULL;
qt[26] = add_elt_b8( mj[10], mj[13], mj[ 4], H[ 1],
(const __m512i)_mm512_set1_epi64( K ) );
K += 0x0555555555555555ULL;
qt[27] = add_elt_b8( mj[11], mj[14], mj[ 5], H[ 2],
(const __m512i)_mm512_set1_epi64( K ) );
K += 0x0555555555555555ULL;
qt[28] = add_elt_b8( mj[12], mj[15], mj[ 6], H[ 3],
(const __m512i)_mm512_set1_epi64( K ) );
K += 0x0555555555555555ULL;
qt[29] = add_elt_b8( mj[13], mj[ 0], mj[ 7], H[ 4],
(const __m512i)_mm512_set1_epi64( K ) );
K += 0x0555555555555555ULL;
qt[30] = add_elt_b8( mj[14], mj[ 1], mj[ 8], H[ 5],
(const __m512i)_mm512_set1_epi64( K ) );
K += 0x0555555555555555ULL;
qt[31] = add_elt_b8( mj[15], mj[ 2], mj[ 9], H[ 6],
(const __m512i)_mm512_set1_epi64( K ) );
__m512i K = _mm512_set1_epi64( 16 * 0x0555555555555555ULL );
const __m512i Kincr = _mm512_set1_epi64( 0x0555555555555555ULL );
qt[16] = add_elt_b8( mj[ 0], mj[ 3], mj[10], H[ 7], K );
K = _mm512_add_epi64( K, Kincr );
qt[17] = add_elt_b8( mj[ 1], mj[ 4], mj[11], H[ 8], K );
K = _mm512_add_epi64( K, Kincr );
qt[18] = add_elt_b8( mj[ 2], mj[ 5], mj[12], H[ 9], K );
K = _mm512_add_epi64( K, Kincr );
qt[19] = add_elt_b8( mj[ 3], mj[ 6], mj[13], H[10], K );
K = _mm512_add_epi64( K, Kincr );
qt[20] = add_elt_b8( mj[ 4], mj[ 7], mj[14], H[11], K );
K = _mm512_add_epi64( K, Kincr );
qt[21] = add_elt_b8( mj[ 5], mj[ 8], mj[15], H[12], K );
K = _mm512_add_epi64( K, Kincr );
qt[22] = add_elt_b8( mj[ 6], mj[ 9], mj[ 0], H[13], K );
K = _mm512_add_epi64( K, Kincr );
qt[23] = add_elt_b8( mj[ 7], mj[10], mj[ 1], H[14], K );
K = _mm512_add_epi64( K, Kincr );
qt[24] = add_elt_b8( mj[ 8], mj[11], mj[ 2], H[15], K );
K = _mm512_add_epi64( K, Kincr );
qt[25] = add_elt_b8( mj[ 9], mj[12], mj[ 3], H[ 0], K );
K = _mm512_add_epi64( K, Kincr );
qt[26] = add_elt_b8( mj[10], mj[13], mj[ 4], H[ 1], K );
K = _mm512_add_epi64( K, Kincr );
qt[27] = add_elt_b8( mj[11], mj[14], mj[ 5], H[ 2], K );
K = _mm512_add_epi64( K, Kincr );
qt[28] = add_elt_b8( mj[12], mj[15], mj[ 6], H[ 3], K );
K = _mm512_add_epi64( K, Kincr );
qt[29] = add_elt_b8( mj[13], mj[ 0], mj[ 7], H[ 4], K );
K = _mm512_add_epi64( K, Kincr );
qt[30] = add_elt_b8( mj[14], mj[ 1], mj[ 8], H[ 5], K );
K = _mm512_add_epi64( K, Kincr );
qt[31] = add_elt_b8( mj[15], mj[ 2], mj[ 9], H[ 6], K );
qt[16] = _mm512_add_epi64( qt[16], expand1_b8( qt, 16 ) );
qt[17] = _mm512_add_epi64( qt[17], expand1_b8( qt, 17 ) );
@@ -1392,22 +1379,22 @@ static const __m512i final_b8[16] =
void bmw512_8way_init( bmw512_8way_context *ctx )
//bmw64_4way_init( bmw_4way_big_context *sc, const sph_u64 *iv )
{
ctx->H[ 0] = m512_const1_64( 0x8081828384858687 );
ctx->H[ 1] = m512_const1_64( 0x88898A8B8C8D8E8F );
ctx->H[ 2] = m512_const1_64( 0x9091929394959697 );
ctx->H[ 3] = m512_const1_64( 0x98999A9B9C9D9E9F );
ctx->H[ 4] = m512_const1_64( 0xA0A1A2A3A4A5A6A7 );
ctx->H[ 5] = m512_const1_64( 0xA8A9AAABACADAEAF );
ctx->H[ 6] = m512_const1_64( 0xB0B1B2B3B4B5B6B7 );
ctx->H[ 7] = m512_const1_64( 0xB8B9BABBBCBDBEBF );
ctx->H[ 8] = m512_const1_64( 0xC0C1C2C3C4C5C6C7 );
ctx->H[ 9] = m512_const1_64( 0xC8C9CACBCCCDCECF );
ctx->H[10] = m512_const1_64( 0xD0D1D2D3D4D5D6D7 );
ctx->H[11] = m512_const1_64( 0xD8D9DADBDCDDDEDF );
ctx->H[12] = m512_const1_64( 0xE0E1E2E3E4E5E6E7 );
ctx->H[13] = m512_const1_64( 0xE8E9EAEBECEDEEEF );
ctx->H[14] = m512_const1_64( 0xF0F1F2F3F4F5F6F7 );
ctx->H[15] = m512_const1_64( 0xF8F9FAFBFCFDFEFF );
ctx->H[ 0] = _mm512_set1_epi64( 0x8081828384858687 );
ctx->H[ 1] = _mm512_set1_epi64( 0x88898A8B8C8D8E8F );
ctx->H[ 2] = _mm512_set1_epi64( 0x9091929394959697 );
ctx->H[ 3] = _mm512_set1_epi64( 0x98999A9B9C9D9E9F );
ctx->H[ 4] = _mm512_set1_epi64( 0xA0A1A2A3A4A5A6A7 );
ctx->H[ 5] = _mm512_set1_epi64( 0xA8A9AAABACADAEAF );
ctx->H[ 6] = _mm512_set1_epi64( 0xB0B1B2B3B4B5B6B7 );
ctx->H[ 7] = _mm512_set1_epi64( 0xB8B9BABBBCBDBEBF );
ctx->H[ 8] = _mm512_set1_epi64( 0xC0C1C2C3C4C5C6C7 );
ctx->H[ 9] = _mm512_set1_epi64( 0xC8C9CACBCCCDCECF );
ctx->H[10] = _mm512_set1_epi64( 0xD0D1D2D3D4D5D6D7 );
ctx->H[11] = _mm512_set1_epi64( 0xD8D9DADBDCDDDEDF );
ctx->H[12] = _mm512_set1_epi64( 0xE0E1E2E3E4E5E6E7 );
ctx->H[13] = _mm512_set1_epi64( 0xE8E9EAEBECEDEEEF );
ctx->H[14] = _mm512_set1_epi64( 0xF0F1F2F3F4F5F6F7 );
ctx->H[15] = _mm512_set1_epi64( 0xF8F9FAFBFCFDFEFF );
ctx->ptr = 0;
ctx->bit_count = 0;
}
@@ -1461,7 +1448,7 @@ void bmw512_8way_close( bmw512_8way_context *ctx, void *dst )
buf = ctx->buf;
ptr = ctx->ptr;
buf[ ptr>>3 ] = m512_const1_64( 0x80 );
buf[ ptr>>3 ] = _mm512_set1_epi64( 0x80 );
ptr += 8;
h = ctx->H;
@@ -1496,22 +1483,22 @@ void bmw512_8way_full( bmw512_8way_context *ctx, void *out, const void *data,
// Init
H[ 0] = m512_const1_64( 0x8081828384858687 );
H[ 1] = m512_const1_64( 0x88898A8B8C8D8E8F );
H[ 2] = m512_const1_64( 0x9091929394959697 );
H[ 3] = m512_const1_64( 0x98999A9B9C9D9E9F );
H[ 4] = m512_const1_64( 0xA0A1A2A3A4A5A6A7 );
H[ 5] = m512_const1_64( 0xA8A9AAABACADAEAF );
H[ 6] = m512_const1_64( 0xB0B1B2B3B4B5B6B7 );
H[ 7] = m512_const1_64( 0xB8B9BABBBCBDBEBF );
H[ 8] = m512_const1_64( 0xC0C1C2C3C4C5C6C7 );
H[ 9] = m512_const1_64( 0xC8C9CACBCCCDCECF );
H[10] = m512_const1_64( 0xD0D1D2D3D4D5D6D7 );
H[11] = m512_const1_64( 0xD8D9DADBDCDDDEDF );
H[12] = m512_const1_64( 0xE0E1E2E3E4E5E6E7 );
H[13] = m512_const1_64( 0xE8E9EAEBECEDEEEF );
H[14] = m512_const1_64( 0xF0F1F2F3F4F5F6F7 );
H[15] = m512_const1_64( 0xF8F9FAFBFCFDFEFF );
H[ 0] = _mm512_set1_epi64( 0x8081828384858687 );
H[ 1] = _mm512_set1_epi64( 0x88898A8B8C8D8E8F );
H[ 2] = _mm512_set1_epi64( 0x9091929394959697 );
H[ 3] = _mm512_set1_epi64( 0x98999A9B9C9D9E9F );
H[ 4] = _mm512_set1_epi64( 0xA0A1A2A3A4A5A6A7 );
H[ 5] = _mm512_set1_epi64( 0xA8A9AAABACADAEAF );
H[ 6] = _mm512_set1_epi64( 0xB0B1B2B3B4B5B6B7 );
H[ 7] = _mm512_set1_epi64( 0xB8B9BABBBCBDBEBF );
H[ 8] = _mm512_set1_epi64( 0xC0C1C2C3C4C5C6C7 );
H[ 9] = _mm512_set1_epi64( 0xC8C9CACBCCCDCECF );
H[10] = _mm512_set1_epi64( 0xD0D1D2D3D4D5D6D7 );
H[11] = _mm512_set1_epi64( 0xD8D9DADBDCDDDEDF );
H[12] = _mm512_set1_epi64( 0xE0E1E2E3E4E5E6E7 );
H[13] = _mm512_set1_epi64( 0xE8E9EAEBECEDEEEF );
H[14] = _mm512_set1_epi64( 0xF0F1F2F3F4F5F6F7 );
H[15] = _mm512_set1_epi64( 0xF8F9FAFBFCFDFEFF );
// Update
@@ -1543,7 +1530,7 @@ void bmw512_8way_full( bmw512_8way_context *ctx, void *out, const void *data,
__m512i h1[16], h2[16];
size_t u, v;
buf[ ptr>>3 ] = m512_const1_64( 0x80 );
buf[ ptr>>3 ] = _mm512_set1_epi64( 0x80 );
ptr += 8;
if ( ptr > (buf_size - 8) )

View File

@@ -221,14 +221,14 @@ int cube_4way_init( cube_4way_context *sp, int hashbitlen, int rounds,
sp->rounds = rounds;
sp->pos = 0;
h[ 0] = m512_const1_128( iv[0] );
h[ 1] = m512_const1_128( iv[1] );
h[ 2] = m512_const1_128( iv[2] );
h[ 3] = m512_const1_128( iv[3] );
h[ 4] = m512_const1_128( iv[4] );
h[ 5] = m512_const1_128( iv[5] );
h[ 6] = m512_const1_128( iv[6] );
h[ 7] = m512_const1_128( iv[7] );
h[ 0] = mm512_bcast_m128( iv[0] );
h[ 1] = mm512_bcast_m128( iv[1] );
h[ 2] = mm512_bcast_m128( iv[2] );
h[ 3] = mm512_bcast_m128( iv[3] );
h[ 4] = mm512_bcast_m128( iv[4] );
h[ 5] = mm512_bcast_m128( iv[5] );
h[ 6] = mm512_bcast_m128( iv[6] );
h[ 7] = mm512_bcast_m128( iv[7] );
return 0;
}
@@ -259,11 +259,11 @@ int cube_4way_close( cube_4way_context *sp, void *output )
// pos is zero for 64 byte data, 1 for 80 byte data.
sp->h[ sp->pos ] = _mm512_xor_si512( sp->h[ sp->pos ],
m512_const2_64( 0, 0x0000000000000080 ) );
mm512_bcast128lo_64( 0x0000000000000080 ) );
transform_4way( sp );
sp->h[7] = _mm512_xor_si512( sp->h[7],
m512_const2_64( 0x0000000100000000, 0 ) );
mm512_bcast128hi_64( 0x0000000100000000 ) );
for ( i = 0; i < 10; ++i )
transform_4way( sp );
@@ -283,14 +283,14 @@ int cube_4way_full( cube_4way_context *sp, void *output, int hashbitlen,
sp->rounds = 16;
sp->pos = 0;
h[ 0] = m512_const1_128( iv[0] );
h[ 1] = m512_const1_128( iv[1] );
h[ 2] = m512_const1_128( iv[2] );
h[ 3] = m512_const1_128( iv[3] );
h[ 4] = m512_const1_128( iv[4] );
h[ 5] = m512_const1_128( iv[5] );
h[ 6] = m512_const1_128( iv[6] );
h[ 7] = m512_const1_128( iv[7] );
h[ 0] = mm512_bcast_m128( iv[0] );
h[ 1] = mm512_bcast_m128( iv[1] );
h[ 2] = mm512_bcast_m128( iv[2] );
h[ 3] = mm512_bcast_m128( iv[3] );
h[ 4] = mm512_bcast_m128( iv[4] );
h[ 5] = mm512_bcast_m128( iv[5] );
h[ 6] = mm512_bcast_m128( iv[6] );
h[ 7] = mm512_bcast_m128( iv[7] );
const int len = size >> 4;
const __m512i *in = (__m512i*)data;
@@ -310,11 +310,11 @@ int cube_4way_full( cube_4way_context *sp, void *output, int hashbitlen,
// pos is zero for 64 byte data, 1 for 80 byte data.
sp->h[ sp->pos ] = _mm512_xor_si512( sp->h[ sp->pos ],
m512_const2_64( 0, 0x0000000000000080 ) );
mm512_bcast128lo_64( 0x0000000000000080 ) );
transform_4way( sp );
sp->h[7] = _mm512_xor_si512( sp->h[7],
m512_const2_64( 0x0000000100000000, 0 ) );
mm512_bcast128hi_64( 0x0000000100000000 ) );
for ( i = 0; i < 10; ++i )
transform_4way( sp );
@@ -336,14 +336,14 @@ int cube_4way_2buf_full( cube_4way_2buf_context *sp,
sp->rounds = 16;
sp->pos = 0;
h1[0] = h0[0] = m512_const1_128( iv[0] );
h1[1] = h0[1] = m512_const1_128( iv[1] );
h1[2] = h0[2] = m512_const1_128( iv[2] );
h1[3] = h0[3] = m512_const1_128( iv[3] );
h1[4] = h0[4] = m512_const1_128( iv[4] );
h1[5] = h0[5] = m512_const1_128( iv[5] );
h1[6] = h0[6] = m512_const1_128( iv[6] );
h1[7] = h0[7] = m512_const1_128( iv[7] );
h1[0] = h0[0] = mm512_bcast_m128( iv[0] );
h1[1] = h0[1] = mm512_bcast_m128( iv[1] );
h1[2] = h0[2] = mm512_bcast_m128( iv[2] );
h1[3] = h0[3] = mm512_bcast_m128( iv[3] );
h1[4] = h0[4] = mm512_bcast_m128( iv[4] );
h1[5] = h0[5] = mm512_bcast_m128( iv[5] );
h1[6] = h0[6] = mm512_bcast_m128( iv[6] );
h1[7] = h0[7] = mm512_bcast_m128( iv[7] );
const int len = size >> 4;
const __m512i *in0 = (__m512i*)data0;
@@ -365,13 +365,13 @@ int cube_4way_2buf_full( cube_4way_2buf_context *sp,
}
// pos is zero for 64 byte data, 1 for 80 byte data.
__m512i tmp = m512_const2_64( 0, 0x0000000000000080 );
__m512i tmp = mm512_bcast128lo_64( 0x0000000000000080 );
sp->h0[ sp->pos ] = _mm512_xor_si512( sp->h0[ sp->pos ], tmp );
sp->h1[ sp->pos ] = _mm512_xor_si512( sp->h1[ sp->pos ], tmp );
transform_4way_2buf( sp );
tmp = m512_const2_64( 0x0000000100000000, 0 );
tmp = mm512_bcast128hi_64( 0x0000000100000000 );
sp->h0[7] = _mm512_xor_si512( sp->h0[7], tmp );
sp->h1[7] = _mm512_xor_si512( sp->h1[7], tmp );
@@ -384,7 +384,6 @@ int cube_4way_2buf_full( cube_4way_2buf_context *sp,
return 0;
}
int cube_4way_update_close( cube_4way_context *sp, void *output,
const void *data, size_t size )
{
@@ -406,11 +405,11 @@ int cube_4way_update_close( cube_4way_context *sp, void *output,
// pos is zero for 64 byte data, 1 for 80 byte data.
sp->h[ sp->pos ] = _mm512_xor_si512( sp->h[ sp->pos ],
m512_const2_64( 0, 0x0000000000000080 ) );
mm512_bcast128lo_64( 0x0000000000000080 ) );
transform_4way( sp );
sp->h[7] = _mm512_xor_si512( sp->h[7],
m512_const2_64( 0x0000000100000000, 0 ) );
mm512_bcast128hi_64( 0x0000000100000000 ) );
for ( i = 0; i < 10; ++i )
transform_4way( sp );
@@ -424,21 +423,6 @@ int cube_4way_update_close( cube_4way_context *sp, void *output,
// 2 way 128
// This isn't expected to be used with AVX512 so HW rotate intruction
// is assumed not avaiable.
// Use double buffering to optimize serial bit rotations. Full double
// buffering isn't practical because it needs twice as many registers
// with AVX2 having only half as many as AVX512.
#define ROL2( out0, out1, in0, in1, c ) \
{ \
__m256i t0 = _mm256_slli_epi32( in0, c ); \
__m256i t1 = _mm256_slli_epi32( in1, c ); \
out0 = _mm256_srli_epi32( in0, 32-(c) ); \
out1 = _mm256_srli_epi32( in1, 32-(c) ); \
out0 = _mm256_or_si256( out0, t0 ); \
out1 = _mm256_or_si256( out1, t1 ); \
}
static void transform_2way( cube_2way_context *sp )
{
int r;
@@ -461,8 +445,10 @@ static void transform_2way( cube_2way_context *sp )
x5 = _mm256_add_epi32( x1, x5 );
x6 = _mm256_add_epi32( x2, x6 );
x7 = _mm256_add_epi32( x3, x7 );
ROL2( y0, y1, x2, x3, 7 );
ROL2( x2, x3, x0, x1, 7 );
y0 = mm256_rol_32( x2, 7 );
y1 = mm256_rol_32( x3, 7 );
x2 = mm256_rol_32( x0, 7 );
x3 = mm256_rol_32( x1, 7 );
x0 = _mm256_xor_si256( y0, x4 );
x1 = _mm256_xor_si256( y1, x5 );
x2 = _mm256_xor_si256( x2, x6 );
@@ -475,8 +461,10 @@ static void transform_2way( cube_2way_context *sp )
x5 = _mm256_add_epi32( x1, x5 );
x6 = _mm256_add_epi32( x2, x6 );
x7 = _mm256_add_epi32( x3, x7 );
ROL2( y0, x1, x1, x0, 11 );
ROL2( y1, x3, x3, x2, 11 );
y0 = mm256_rol_32( x1, 11 );
x1 = mm256_rol_32( x0, 11 );
y1 = mm256_rol_32( x3, 11 );
x3 = mm256_rol_32( x2, 11 );
x0 = _mm256_xor_si256( y0, x4 );
x1 = _mm256_xor_si256( x1, x5 );
x2 = _mm256_xor_si256( y1, x6 );
@@ -508,14 +496,14 @@ int cube_2way_init( cube_2way_context *sp, int hashbitlen, int rounds,
sp->rounds = rounds;
sp->pos = 0;
h[ 0] = m256_const1_128( iv[0] );
h[ 1] = m256_const1_128( iv[1] );
h[ 2] = m256_const1_128( iv[2] );
h[ 3] = m256_const1_128( iv[3] );
h[ 4] = m256_const1_128( iv[4] );
h[ 5] = m256_const1_128( iv[5] );
h[ 6] = m256_const1_128( iv[6] );
h[ 7] = m256_const1_128( iv[7] );
h[ 0] = mm256_bcast_m128( iv[0] );
h[ 1] = mm256_bcast_m128( iv[1] );
h[ 2] = mm256_bcast_m128( iv[2] );
h[ 3] = mm256_bcast_m128( iv[3] );
h[ 4] = mm256_bcast_m128( iv[4] );
h[ 5] = mm256_bcast_m128( iv[5] );
h[ 6] = mm256_bcast_m128( iv[6] );
h[ 7] = mm256_bcast_m128( iv[7] );
return 0;
}
@@ -546,13 +534,14 @@ int cube_2way_close( cube_2way_context *sp, void *output )
// pos is zero for 64 byte data, 1 for 80 byte data.
sp->h[ sp->pos ] = _mm256_xor_si256( sp->h[ sp->pos ],
m256_const2_64( 0, 0x0000000000000080 ) );
mm256_bcast128lo_64( 0x0000000000000080 ) );
transform_2way( sp );
sp->h[7] = _mm256_xor_si256( sp->h[7],
m256_const2_64( 0x0000000100000000, 0 ) );
mm256_bcast128hi_64( 0x0000000100000000 ) );
for ( i = 0; i < 10; ++i ) transform_2way( sp );
for ( i = 0; i < 10; ++i )
transform_2way( sp );
memcpy( hash, sp->h, sp->hashlen<<5 );
return 0;
@@ -579,13 +568,14 @@ int cube_2way_update_close( cube_2way_context *sp, void *output,
// pos is zero for 64 byte data, 1 for 80 byte data.
sp->h[ sp->pos ] = _mm256_xor_si256( sp->h[ sp->pos ],
m256_const2_64( 0, 0x0000000000000080 ) );
mm256_bcast128lo_64( 0x0000000000000080 ) );
transform_2way( sp );
sp->h[7] = _mm256_xor_si256( sp->h[7],
m256_const2_64( 0x0000000100000000, 0 ) );
mm256_bcast128hi_64( 0x0000000100000000 ) );
for ( i = 0; i < 10; ++i ) transform_2way( sp );
for ( i = 0; i < 10; ++i )
transform_2way( sp );
memcpy( hash, sp->h, sp->hashlen<<5 );
return 0;
@@ -602,14 +592,14 @@ int cube_2way_full( cube_2way_context *sp, void *output, int hashbitlen,
sp->rounds = 16;
sp->pos = 0;
h[ 0] = m256_const1_128( iv[0] );
h[ 1] = m256_const1_128( iv[1] );
h[ 2] = m256_const1_128( iv[2] );
h[ 3] = m256_const1_128( iv[3] );
h[ 4] = m256_const1_128( iv[4] );
h[ 5] = m256_const1_128( iv[5] );
h[ 6] = m256_const1_128( iv[6] );
h[ 7] = m256_const1_128( iv[7] );
h[ 0] = mm256_bcast_m128( iv[0] );
h[ 1] = mm256_bcast_m128( iv[1] );
h[ 2] = mm256_bcast_m128( iv[2] );
h[ 3] = mm256_bcast_m128( iv[3] );
h[ 4] = mm256_bcast_m128( iv[4] );
h[ 5] = mm256_bcast_m128( iv[5] );
h[ 6] = mm256_bcast_m128( iv[6] );
h[ 7] = mm256_bcast_m128( iv[7] );
const int len = size >> 4;
const __m256i *in = (__m256i*)data;
@@ -629,13 +619,14 @@ int cube_2way_full( cube_2way_context *sp, void *output, int hashbitlen,
// pos is zero for 64 byte data, 1 for 80 byte data.
sp->h[ sp->pos ] = _mm256_xor_si256( sp->h[ sp->pos ],
m256_const2_64( 0, 0x0000000000000080 ) );
mm256_bcast128lo_64( 0x0000000000000080 ) );
transform_2way( sp );
sp->h[7] = _mm256_xor_si256( sp->h[7],
m256_const2_64( 0x0000000100000000, 0 ) );
mm256_bcast128hi_64( 0x0000000100000000 ) );
for ( i = 0; i < 10; ++i ) transform_2way( sp );
for ( i = 0; i < 10; ++i )
transform_2way( sp );
memcpy( hash, sp->h, sp->hashlen<<5 );
return 0;

View File

@@ -32,7 +32,7 @@ static void transform( cubehashParam *sp )
{
x1 = _mm512_add_epi32( x0, x1 );
x0 = mm512_swap_256( x0 );
x0 = mm512_rol_32( x0, 7 );
x0 = mm512_rol_32( x0, 7 );
x0 = _mm512_xor_si512( x0, x1 );
x1 = mm512_swap128_64( x1 );
x1 = _mm512_add_epi32( x0, x1 );
@@ -58,19 +58,18 @@ static void transform( cubehashParam *sp )
{
x2 = _mm256_add_epi32( x0, x2 );
x3 = _mm256_add_epi32( x1, x3 );
y0 = x0;
x0 = mm256_rol_32( x1, 7 );
x1 = mm256_rol_32( y0, 7 );
x0 = _mm256_xor_si256( x0, x2 );
x1 = _mm256_xor_si256( x1, x3 );
y0 = mm256_rol_32( x1, 7 );
y1 = mm256_rol_32( x0, 7 );
x0 = _mm256_xor_si256( y0, x2 );
x1 = _mm256_xor_si256( y1, x3 );
x2 = mm256_swap128_64( x2 );
x3 = mm256_swap128_64( x3 );
x2 = _mm256_add_epi32( x0, x2 );
x3 = _mm256_add_epi32( x1, x3 );
y0 = mm256_swap_128( x0 );
y1 = mm256_swap_128( x1 );
x0 = mm256_rol_32( y0, 11 );
x1 = mm256_rol_32( y1, 11 );
x0 = mm256_swap_128( x0 );
x1 = mm256_swap_128( x1 );
x0 = mm256_rol_32( x0, 11 );
x1 = mm256_rol_32( x1, 11 );
x0 = _mm256_xor_si256( x0, x2 );
x1 = _mm256_xor_si256( x1, x3 );
x2 = mm256_swap64_32( x2 );
@@ -94,47 +93,48 @@ static void transform( cubehashParam *sp )
x6 = _mm_load_si128( (__m128i*)sp->x + 6 );
x7 = _mm_load_si128( (__m128i*)sp->x + 7 );
for (r = 0; r < rounds; ++r) {
x4 = _mm_add_epi32(x0, x4);
x5 = _mm_add_epi32(x1, x5);
x6 = _mm_add_epi32(x2, x6);
x7 = _mm_add_epi32(x3, x7);
y0 = x2;
y1 = x3;
y2 = x0;
y3 = x1;
x0 = _mm_xor_si128(_mm_slli_epi32(y0, 7), _mm_srli_epi32(y0, 25));
x1 = _mm_xor_si128(_mm_slli_epi32(y1, 7), _mm_srli_epi32(y1, 25));
x2 = _mm_xor_si128(_mm_slli_epi32(y2, 7), _mm_srli_epi32(y2, 25));
x3 = _mm_xor_si128(_mm_slli_epi32(y3, 7), _mm_srli_epi32(y3, 25));
x0 = _mm_xor_si128(x0, x4);
x1 = _mm_xor_si128(x1, x5);
x2 = _mm_xor_si128(x2, x6);
x3 = _mm_xor_si128(x3, x7);
x4 = _mm_shuffle_epi32(x4, 0x4e);
x5 = _mm_shuffle_epi32(x5, 0x4e);
x6 = _mm_shuffle_epi32(x6, 0x4e);
x7 = _mm_shuffle_epi32(x7, 0x4e);
x4 = _mm_add_epi32(x0, x4);
x5 = _mm_add_epi32(x1, x5);
x6 = _mm_add_epi32(x2, x6);
x7 = _mm_add_epi32(x3, x7);
y0 = x1;
y1 = x0;
y2 = x3;
y3 = x2;
x0 = _mm_xor_si128(_mm_slli_epi32(y0, 11), _mm_srli_epi32(y0, 21));
x1 = _mm_xor_si128(_mm_slli_epi32(y1, 11), _mm_srli_epi32(y1, 21));
x2 = _mm_xor_si128(_mm_slli_epi32(y2, 11), _mm_srli_epi32(y2, 21));
x3 = _mm_xor_si128(_mm_slli_epi32(y3, 11), _mm_srli_epi32(y3, 21));
x0 = _mm_xor_si128(x0, x4);
x1 = _mm_xor_si128(x1, x5);
x2 = _mm_xor_si128(x2, x6);
x3 = _mm_xor_si128(x3, x7);
x4 = _mm_shuffle_epi32(x4, 0xb1);
x5 = _mm_shuffle_epi32(x5, 0xb1);
x6 = _mm_shuffle_epi32(x6, 0xb1);
x7 = _mm_shuffle_epi32(x7, 0xb1);
for ( r = 0; r < rounds; ++r )
{
x4 = _mm_add_epi32( x0, x4 );
x5 = _mm_add_epi32( x1, x5 );
x6 = _mm_add_epi32( x2, x6 );
x7 = _mm_add_epi32( x3, x7 );
y0 = x2;
y1 = x3;
y2 = x0;
y3 = x1;
x0 = mm128_rol_32( y0, 7 );
x1 = mm128_rol_32( y1, 7 );
x2 = mm128_rol_32( y2, 7 );
x3 = mm128_rol_32( y3, 7 );
x0 = _mm_xor_si128( x0, x4 );
x1 = _mm_xor_si128( x1, x5 );
x2 = _mm_xor_si128( x2, x6 );
x3 = _mm_xor_si128( x3, x7 );
x4 = _mm_shuffle_epi32( x4, 0x4e );
x5 = _mm_shuffle_epi32( x5, 0x4e );
x6 = _mm_shuffle_epi32( x6, 0x4e );
x7 = _mm_shuffle_epi32( x7, 0x4e );
x4 = _mm_add_epi32( x0, x4 );
x5 = _mm_add_epi32( x1, x5 );
x6 = _mm_add_epi32( x2, x6 );
x7 = _mm_add_epi32( x3, x7 );
y0 = x1;
y1 = x0;
y2 = x3;
y3 = x2;
x0 = mm128_rol_32( y0, 11 );
x1 = mm128_rol_32( y1, 11 );
x2 = mm128_rol_32( y2, 11 );
x3 = mm128_rol_32( y3, 11 );
x0 = _mm_xor_si128( x0, x4 );
x1 = _mm_xor_si128( x1, x5 );
x2 = _mm_xor_si128( x2, x6 );
x3 = _mm_xor_si128( x3, x7 );
x4 = _mm_shuffle_epi32( x4, 0xb1 );
x5 = _mm_shuffle_epi32( x5, 0xb1 );
x6 = _mm_shuffle_epi32( x6, 0xb1 );
x7 = _mm_shuffle_epi32( x7, 0xb1 );
}
_mm_store_si128( (__m128i*)sp->x, x0 );
@@ -180,25 +180,25 @@ int cubehashInit(cubehashParam *sp, int hashbitlen, int rounds, int blockbytes)
if ( hashbitlen == 512 )
{
x[0] = m128_const_64( 0x4167D83E2D538B8B, 0x50F494D42AEA2A61 );
x[1] = m128_const_64( 0x50AC5695CC39968E, 0xC701CF8C3FEE2313 );
x[2] = m128_const_64( 0x825B453797CF0BEF, 0xA647A8B34D42C787 );
x[3] = m128_const_64( 0xA23911AED0E5CD33, 0xF22090C4EEF864D2 );
x[4] = m128_const_64( 0xB64445321B017BEF, 0x148FE485FCD398D9 );
x[5] = m128_const_64( 0x0DBADEA991FA7934, 0x2FF5781C6A536159 );
x[6] = m128_const_64( 0xBC796576B1C62456, 0xA5A70E75D65C8A2B );
x[7] = m128_const_64( 0xD43E3B447795D246, 0xE7989AF11921C8F7 );
x[0] = _mm_set_epi64x( 0x4167D83E2D538B8B, 0x50F494D42AEA2A61 );
x[1] = _mm_set_epi64x( 0x50AC5695CC39968E, 0xC701CF8C3FEE2313 );
x[2] = _mm_set_epi64x( 0x825B453797CF0BEF, 0xA647A8B34D42C787 );
x[3] = _mm_set_epi64x( 0xA23911AED0E5CD33, 0xF22090C4EEF864D2 );
x[4] = _mm_set_epi64x( 0xB64445321B017BEF, 0x148FE485FCD398D9 );
x[5] = _mm_set_epi64x( 0x0DBADEA991FA7934, 0x2FF5781C6A536159 );
x[6] = _mm_set_epi64x( 0xBC796576B1C62456, 0xA5A70E75D65C8A2B );
x[7] = _mm_set_epi64x( 0xD43E3B447795D246, 0xE7989AF11921C8F7 );
}
else
{
x[0] = m128_const_64( 0x35481EAE63117E71, 0xCCD6F29FEA2BD4B4 );
x[1] = m128_const_64( 0xF4CC12BE7E624131, 0xE5D94E6322512D5B );
x[2] = m128_const_64( 0x3361DA8CD0720C35, 0x42AF2070C2D0B696 );
x[3] = m128_const_64( 0x40E5FBAB4680AC00, 0x8EF8AD8328CCECA4 );
x[4] = m128_const_64( 0xF0B266796C859D41, 0x6107FBD5D89041C3 );
x[5] = m128_const_64( 0x93CB628565C892FD, 0x5FA2560309392549 );
x[6] = m128_const_64( 0x85254725774ABFDD, 0x9E4B4E602AF2B5AE );
x[7] = m128_const_64( 0xD6032C0A9CDAF8AF, 0x4AB6AAD615815AEB );
x[0] = _mm_set_epi64x( 0x35481EAE63117E71, 0xCCD6F29FEA2BD4B4 );
x[1] = _mm_set_epi64x( 0xF4CC12BE7E624131, 0xE5D94E6322512D5B );
x[2] = _mm_set_epi64x( 0x3361DA8CD0720C35, 0x42AF2070C2D0B696 );
x[3] = _mm_set_epi64x( 0x40E5FBAB4680AC00, 0x8EF8AD8328CCECA4 );
x[4] = _mm_set_epi64x( 0xF0B266796C859D41, 0x6107FBD5D89041C3 );
x[5] = _mm_set_epi64x( 0x93CB628565C892FD, 0x5FA2560309392549 );
x[6] = _mm_set_epi64x( 0x85254725774ABFDD, 0x9E4B4E602AF2B5AE );
x[7] = _mm_set_epi64x( 0xD6032C0A9CDAF8AF, 0x4AB6AAD615815AEB );
}
return SUCCESS;
@@ -234,10 +234,10 @@ int cubehashDigest( cubehashParam *sp, byte *digest )
// pos is zero for 64 byte data, 1 for 80 byte data.
sp->x[ sp->pos ] = _mm_xor_si128( sp->x[ sp->pos ],
m128_const_64( 0, 0x80 ) );
_mm_set_epi64x( 0, 0x80 ) );
transform( sp );
sp->x[7] = _mm_xor_si128( sp->x[7], m128_const_64( 0x100000000, 0 ) );
sp->x[7] = _mm_xor_si128( sp->x[7], _mm_set_epi64x( 0x100000000, 0 ) );
transform( sp );
transform( sp );
transform( sp );
@@ -279,10 +279,10 @@ int cubehashUpdateDigest( cubehashParam *sp, byte *digest,
// pos is zero for 64 byte data, 1 for 80 byte data.
sp->x[ sp->pos ] = _mm_xor_si128( sp->x[ sp->pos ],
m128_const_64( 0, 0x80 ) );
_mm_set_epi64x( 0, 0x80 ) );
transform( sp );
sp->x[7] = _mm_xor_si128( sp->x[7], m128_const_64( 0x100000000, 0 ) );
sp->x[7] = _mm_xor_si128( sp->x[7], _mm_set_epi64x( 0x100000000, 0 ) );
transform( sp );
transform( sp );
@@ -313,25 +313,25 @@ int cubehash_full( cubehashParam *sp, byte *digest, int hashbitlen,
if ( hashbitlen == 512 )
{
x[0] = m128_const_64( 0x4167D83E2D538B8B, 0x50F494D42AEA2A61 );
x[1] = m128_const_64( 0x50AC5695CC39968E, 0xC701CF8C3FEE2313 );
x[2] = m128_const_64( 0x825B453797CF0BEF, 0xA647A8B34D42C787 );
x[3] = m128_const_64( 0xA23911AED0E5CD33, 0xF22090C4EEF864D2 );
x[4] = m128_const_64( 0xB64445321B017BEF, 0x148FE485FCD398D9 );
x[5] = m128_const_64( 0x0DBADEA991FA7934, 0x2FF5781C6A536159 );
x[6] = m128_const_64( 0xBC796576B1C62456, 0xA5A70E75D65C8A2B );
x[7] = m128_const_64( 0xD43E3B447795D246, 0xE7989AF11921C8F7 );
x[0] = _mm_set_epi64x( 0x4167D83E2D538B8B, 0x50F494D42AEA2A61 );
x[1] = _mm_set_epi64x( 0x50AC5695CC39968E, 0xC701CF8C3FEE2313 );
x[2] = _mm_set_epi64x( 0x825B453797CF0BEF, 0xA647A8B34D42C787 );
x[3] = _mm_set_epi64x( 0xA23911AED0E5CD33, 0xF22090C4EEF864D2 );
x[4] = _mm_set_epi64x( 0xB64445321B017BEF, 0x148FE485FCD398D9 );
x[5] = _mm_set_epi64x( 0x0DBADEA991FA7934, 0x2FF5781C6A536159 );
x[6] = _mm_set_epi64x( 0xBC796576B1C62456, 0xA5A70E75D65C8A2B );
x[7] = _mm_set_epi64x( 0xD43E3B447795D246, 0xE7989AF11921C8F7 );
}
else
{
x[0] = m128_const_64( 0x35481EAE63117E71, 0xCCD6F29FEA2BD4B4 );
x[1] = m128_const_64( 0xF4CC12BE7E624131, 0xE5D94E6322512D5B );
x[2] = m128_const_64( 0x3361DA8CD0720C35, 0x42AF2070C2D0B696 );
x[3] = m128_const_64( 0x40E5FBAB4680AC00, 0x8EF8AD8328CCECA4 );
x[4] = m128_const_64( 0xF0B266796C859D41, 0x6107FBD5D89041C3 );
x[5] = m128_const_64( 0x93CB628565C892FD, 0x5FA2560309392549 );
x[6] = m128_const_64( 0x85254725774ABFDD, 0x9E4B4E602AF2B5AE );
x[7] = m128_const_64( 0xD6032C0A9CDAF8AF, 0x4AB6AAD615815AEB );
x[0] = _mm_set_epi64x( 0x35481EAE63117E71, 0xCCD6F29FEA2BD4B4 );
x[1] = _mm_set_epi64x( 0xF4CC12BE7E624131, 0xE5D94E6322512D5B );
x[2] = _mm_set_epi64x( 0x3361DA8CD0720C35, 0x42AF2070C2D0B696 );
x[3] = _mm_set_epi64x( 0x40E5FBAB4680AC00, 0x8EF8AD8328CCECA4 );
x[4] = _mm_set_epi64x( 0xF0B266796C859D41, 0x6107FBD5D89041C3 );
x[5] = _mm_set_epi64x( 0x93CB628565C892FD, 0x5FA2560309392549 );
x[6] = _mm_set_epi64x( 0x85254725774ABFDD, 0x9E4B4E602AF2B5AE );
x[7] = _mm_set_epi64x( 0xD6032C0A9CDAF8AF, 0x4AB6AAD615815AEB );
}
@@ -358,10 +358,10 @@ int cubehash_full( cubehashParam *sp, byte *digest, int hashbitlen,
// pos is zero for 64 byte data, 1 for 80 byte data.
sp->x[ sp->pos ] = _mm_xor_si128( sp->x[ sp->pos ],
m128_const_64( 0, 0x80 ) );
_mm_set_epi64x( 0, 0x80 ) );
transform( sp );
sp->x[7] = _mm_xor_si128( sp->x[7], m128_const_64( 0x100000000, 0 ) );
sp->x[7] = _mm_xor_si128( sp->x[7], _mm_set_epi64x( 0x100000000, 0 ) );
transform( sp );
transform( sp );

View File

@@ -566,16 +566,16 @@ HashReturn echo_full( hashState_echo *state, BitSequence *hashval,
state->uHashSize = 256;
state->uBlockLength = 192;
state->uRounds = 8;
state->hashsize = m128_const_64( 0, 0x100 );
state->const1536 = m128_const_64( 0, 0x600 );
state->hashsize = _mm_set_epi64x( 0, 0x100 );
state->const1536 = _mm_set_epi64x( 0, 0x600 );
break;
case 512:
state->uHashSize = 512;
state->uBlockLength = 128;
state->uRounds = 10;
state->hashsize = m128_const_64( 0, 0x200 );
state->const1536 = m128_const_64( 0, 0x400 );
state->hashsize = _mm_set_epi64x( 0, 0x200 );
state->const1536 = _mm_set_epi64x( 0, 0x400 );
break;
default:

View File

@@ -162,9 +162,9 @@ void echo_4way_compress( echo_4way_context *ctx, const __m512i *pmsg,
unsigned int r, b, i, j;
__m512i t1, t2, s2, k1;
__m512i _state[4][4], _state2[4][4], _statebackup[4][4];
__m512i one = m512_one_128;
__m512i mul2mask = m512_const2_64( 0, 0x00001b00 );
__m512i lsbmask = m512_const1_32( 0x01010101 );
const __m512i one = mm512_bcast128lo_64( 1 );
const __m512i mul2mask = mm512_bcast128lo_64( 0x00001b00 );
const __m512i lsbmask = _mm512_set1_epi32( 0x01010101 );
_state[ 0 ][ 0 ] = ctx->state[ 0 ][ 0 ];
_state[ 0 ][ 1 ] = ctx->state[ 0 ][ 1 ];
@@ -264,16 +264,16 @@ int echo_4way_init( echo_4way_context *ctx, int nHashSize )
ctx->uHashSize = 256;
ctx->uBlockLength = 192;
ctx->uRounds = 8;
ctx->hashsize = m512_const2_64( 0, 0x100 );
ctx->const1536 = m512_const2_64( 0, 0x600 );
ctx->hashsize = mm512_bcast128lo_64( 0x100 );
ctx->const1536 = mm512_bcast128lo_64( 0x600 );
break;
case 512:
ctx->uHashSize = 512;
ctx->uBlockLength = 128;
ctx->uRounds = 10;
ctx->hashsize = m512_const2_64( 0, 0x200 );
ctx->const1536 = m512_const2_64( 0, 0x400);
ctx->hashsize = mm512_bcast128lo_64( 0x200 );
ctx->const1536 = mm512_bcast128lo_64( 0x400);
break;
default:
@@ -305,7 +305,7 @@ int echo_4way_update_close( echo_4way_context *state, void *hashval,
{
echo_4way_compress( state, data, 1 );
state->processed_bits = 1024;
remainingbits = m512_const2_64( 0, -1024 );
remainingbits = mm512_bcast128lo_64( -1024 );
vlen = 0;
}
else
@@ -313,13 +313,15 @@ int echo_4way_update_close( echo_4way_context *state, void *hashval,
vlen = databitlen / 128; // * 4 lanes / 128 bits per lane
memcpy_512( state->buffer, data, vlen );
state->processed_bits += (unsigned int)( databitlen );
remainingbits = m512_const2_64( 0, (uint64_t)databitlen );
remainingbits = mm512_bcast128lo_64( (uint64_t)databitlen );
}
state->buffer[ vlen ] = m512_const2_64( 0, 0x80 );
state->buffer[ vlen ] = mm512_bcast128lo_64( 0x80 );
memset_zero_512( state->buffer + vlen + 1, vblen - vlen - 2 );
state->buffer[ vblen-2 ] = m512_const2_64( (uint64_t)state->uHashSize << 48, 0 );
state->buffer[ vblen-1 ] = m512_const2_64( 0, state->processed_bits);
state->buffer[ vblen-2 ] =
mm512_bcast128hi_64( (uint64_t)state->uHashSize << 48 );
state->buffer[ vblen-1 ] =
mm512_bcast128lo_64( state->processed_bits );
state->k = _mm512_add_epi64( state->k, remainingbits );
state->k = _mm512_sub_epi64( state->k, state->const1536 );
@@ -352,16 +354,16 @@ int echo_4way_full( echo_4way_context *ctx, void *hashval, int nHashSize,
ctx->uHashSize = 256;
ctx->uBlockLength = 192;
ctx->uRounds = 8;
ctx->hashsize = m512_const2_64( 0, 0x100 );
ctx->const1536 = m512_const2_64( 0, 0x600 );
ctx->hashsize = mm512_bcast128lo_64( 0x100 );
ctx->const1536 = mm512_bcast128lo_64( 0x600 );
break;
case 512:
ctx->uHashSize = 512;
ctx->uBlockLength = 128;
ctx->uRounds = 10;
ctx->hashsize = m512_const2_64( 0, 0x200 );
ctx->const1536 = m512_const2_64( 0, 0x400 );
ctx->hashsize = mm512_bcast128lo_64( 0x200 );
ctx->const1536 = mm512_bcast128lo_64( 0x400 );
break;
default:
@@ -388,7 +390,7 @@ int echo_4way_full( echo_4way_context *ctx, void *hashval, int nHashSize,
{
echo_4way_compress( ctx, data, 1 );
ctx->processed_bits = 1024;
remainingbits = m512_const2_64( 0, -1024 );
remainingbits = mm512_bcast128lo_64( -1024 );
vlen = 0;
}
else
@@ -396,14 +398,14 @@ int echo_4way_full( echo_4way_context *ctx, void *hashval, int nHashSize,
vlen = databitlen / 128; // * 4 lanes / 128 bits per lane
memcpy_512( ctx->buffer, data, vlen );
ctx->processed_bits += (unsigned int)( databitlen );
remainingbits = m512_const2_64( 0, databitlen );
remainingbits = mm512_bcast128lo_64( databitlen );
}
ctx->buffer[ vlen ] = m512_const2_64( 0, 0x80 );
ctx->buffer[ vlen ] = mm512_bcast128lo_64( 0x80 );
memset_zero_512( ctx->buffer + vlen + 1, vblen - vlen - 2 );
ctx->buffer[ vblen-2 ] =
m512_const2_64( (uint64_t)ctx->uHashSize << 48, 0 );
ctx->buffer[ vblen-1 ] = m512_const2_64( 0, ctx->processed_bits);
mm512_bcast128hi_64( (uint64_t)ctx->uHashSize << 48 );
ctx->buffer[ vblen-1 ] = mm512_bcast128lo_64( ctx->processed_bits);
ctx->k = _mm512_add_epi64( ctx->k, remainingbits );
ctx->k = _mm512_sub_epi64( ctx->k, ctx->const1536 );
@@ -425,9 +427,9 @@ int echo_4way_full( echo_4way_context *ctx, void *hashval, int nHashSize,
// AVX2 + VAES
#define mul2mask_2way m256_const2_64( 0, 0x0000000000001b00 )
#define mul2mask_2way mm256_bcast128lo_64( 0x0000000000001b00 )
#define lsbmask_2way m256_const1_32( 0x01010101 )
#define lsbmask_2way _mm256_set1_epi32( 0x01010101 )
#define ECHO_SUBBYTES4_2WAY( state, j ) \
state[0][j] = _mm256_aesenc_epi128( state[0][j], k1 ); \
@@ -467,8 +469,7 @@ int echo_4way_full( echo_4way_context *ctx, void *hashval, int nHashSize,
t1 = _mm256_and_si256( t1, lsbmask_2way ); \
t2 = _mm256_shuffle_epi8( mul2mask_2way, t1 ); \
s2 = _mm256_xor_si256( s2, t2 );\
state2[ 0 ][ j ] = _mm256_xor_si256( state2[ 0 ][ j ], \
_mm256_xor_si256( s2, state1[ 1 ][ j1 ] ) ); \
state2[ 0 ][ j ] = mm256_xor3( state2[ 0 ][ j ], s2, state1[ 1 ][ j1 ] ); \
state2[ 1 ][ j ] = _mm256_xor_si256( state2[ 1 ][ j ], s2 ); \
state2[ 2 ][ j ] = _mm256_xor_si256( state2[ 2 ][ j ], state1[ 1 ][ j1 ] ); \
state2[ 3 ][ j ] = _mm256_xor_si256( state2[ 3 ][ j ], state1[ 1 ][ j1 ] ); \
@@ -478,8 +479,7 @@ int echo_4way_full( echo_4way_context *ctx, void *hashval, int nHashSize,
t2 = _mm256_shuffle_epi8( mul2mask_2way, t1 ); \
s2 = _mm256_xor_si256( s2, t2 ); \
state2[ 0 ][ j ] = _mm256_xor_si256( state2[ 0 ][ j ], state1[ 2 ][ j2 ] ); \
state2[ 1 ][ j ] = _mm256_xor_si256( state2[ 1 ][ j ], \
_mm256_xor_si256( s2, state1[ 2 ][ j2 ] ) ); \
state2[ 1 ][ j ] = mm256_xor3( state2[ 1 ][ j ], s2, state1[ 2 ][ j2 ] ); \
state2[ 2 ][ j ] = _mm256_xor_si256( state2[ 2 ][ j ], s2 ); \
state2[ 3 ][ j ] = _mm256_xor_si256( state2[ 3][ j ], state1[ 2 ][ j2 ] ); \
s2 = _mm256_add_epi8( state1[ 3 ][ j3 ], state1[ 3 ][ j3 ] ); \
@@ -489,8 +489,7 @@ int echo_4way_full( echo_4way_context *ctx, void *hashval, int nHashSize,
s2 = _mm256_xor_si256( s2, t2 ); \
state2[ 0 ][ j ] = _mm256_xor_si256( state2[ 0 ][ j ], state1[ 3 ][ j3 ] ); \
state2[ 1 ][ j ] = _mm256_xor_si256( state2[ 1 ][ j ], state1[ 3 ][ j3 ] ); \
state2[ 2 ][ j ] = _mm256_xor_si256( state2[ 2 ][ j ], \
_mm256_xor_si256( s2, state1[ 3 ][ j3] ) ); \
state2[ 2 ][ j ] = mm256_xor3( state2[ 2 ][ j ], s2, state1[ 3 ][ j3] ); \
state2[ 3 ][ j ] = _mm256_xor_si256( state2[ 3 ][ j ], s2 ); \
} while(0)
@@ -679,16 +678,16 @@ int echo_2way_init( echo_2way_context *ctx, int nHashSize )
ctx->uHashSize = 256;
ctx->uBlockLength = 192;
ctx->uRounds = 8;
ctx->hashsize = m256_const2_64( 0, 0x100 );
ctx->const1536 = m256_const2_64( 0, 0x600 );
ctx->hashsize = mm256_bcast128lo_64( 0x100 );
ctx->const1536 = mm256_bcast128lo_64( 0x600 );
break;
case 512:
ctx->uHashSize = 512;
ctx->uBlockLength = 128;
ctx->uRounds = 10;
ctx->hashsize = m256_const2_64( 0, 0x200 );
ctx->const1536 = m256_const2_64( 0, 0x400 );
ctx->hashsize = mm256_bcast128lo_64( 0x200 );
ctx->const1536 = mm256_bcast128lo_64( 0x400 );
break;
default:
@@ -720,20 +719,20 @@ int echo_2way_update_close( echo_2way_context *state, void *hashval,
{
echo_2way_compress( state, data, 1 );
state->processed_bits = 1024;
remainingbits = m256_const2_64( 0, -1024 );
remainingbits = mm256_bcast128lo_64( -1024 );
vlen = 0;
}
else
{
memcpy_256( state->buffer, data, vlen );
state->processed_bits += (unsigned int)( databitlen );
remainingbits = m256_const2_64( 0, databitlen );
remainingbits = mm256_bcast128lo_64( databitlen );
}
state->buffer[ vlen ] = m256_const2_64( 0, 0x80 );
state->buffer[ vlen ] = mm256_bcast128lo_64( 0x80 );
memset_zero_256( state->buffer + vlen + 1, vblen - vlen - 2 );
state->buffer[ vblen-2 ] = m256_const2_64( (uint64_t)state->uHashSize << 48, 0 );
state->buffer[ vblen-1 ] = m256_const2_64( 0, state->processed_bits );
state->buffer[ vblen-2 ] = mm256_bcast128hi_64( (uint64_t)state->uHashSize << 48 );
state->buffer[ vblen-1 ] = mm256_bcast128lo_64( state->processed_bits );
state->k = _mm256_add_epi64( state->k, remainingbits );
state->k = _mm256_sub_epi64( state->k, state->const1536 );
@@ -766,16 +765,16 @@ int echo_2way_full( echo_2way_context *ctx, void *hashval, int nHashSize,
ctx->uHashSize = 256;
ctx->uBlockLength = 192;
ctx->uRounds = 8;
ctx->hashsize = m256_const2_64( 0, 0x100 );
ctx->const1536 = m256_const2_64( 0, 0x600 );
ctx->hashsize = mm256_bcast128lo_64( 0x100 );
ctx->const1536 = mm256_bcast128lo_64( 0x600 );
break;
case 512:
ctx->uHashSize = 512;
ctx->uBlockLength = 128;
ctx->uRounds = 10;
ctx->hashsize = m256_const2_64( 0, 0x200 );
ctx->const1536 = m256_const2_64( 0, 0x400 );
ctx->hashsize = mm256_bcast128lo_64( 0x200 );
ctx->const1536 = mm256_bcast128lo_64( 0x400 );
break;
default:
@@ -798,7 +797,7 @@ int echo_2way_full( echo_2way_context *ctx, void *hashval, int nHashSize,
{
echo_2way_compress( ctx, data, 1 );
ctx->processed_bits = 1024;
remainingbits = m256_const2_64( 0, -1024 );
remainingbits = mm256_bcast128lo_64( -1024 );
vlen = 0;
}
else
@@ -806,13 +805,13 @@ int echo_2way_full( echo_2way_context *ctx, void *hashval, int nHashSize,
vlen = databitlen / 128; // * 4 lanes / 128 bits per lane
memcpy_256( ctx->buffer, data, vlen );
ctx->processed_bits += (unsigned int)( databitlen );
remainingbits = m256_const2_64( 0, databitlen );
remainingbits = mm256_bcast128lo_64( databitlen );
}
ctx->buffer[ vlen ] = m256_const2_64( 0, 0x80 );
ctx->buffer[ vlen ] = mm256_bcast128lo_64( 0x80 );
memset_zero_256( ctx->buffer + vlen + 1, vblen - vlen - 2 );
ctx->buffer[ vblen-2 ] = m256_const2_64( (uint64_t)ctx->uHashSize << 48, 0 );
ctx->buffer[ vblen-1 ] = m256_const2_64( 0, ctx->processed_bits );
ctx->buffer[ vblen-2 ] = mm256_bcast128hi_64( (uint64_t)ctx->uHashSize << 48 );
ctx->buffer[ vblen-1 ] = mm256_bcast128lo_64( ctx->processed_bits );
ctx->k = _mm256_add_epi64( ctx->k, remainingbits );
ctx->k = _mm256_sub_epi64( ctx->k, ctx->const1536 );

View File

@@ -33,11 +33,11 @@ MYALIGN const unsigned long long _supermix4b[] = {0x07020d08080e0d0d, 0x07070908
MYALIGN const unsigned long long _supermix4c[] = {0x0706050403020000, 0x0302000007060504};
MYALIGN const unsigned long long _supermix7a[] = {0x010c0b060d080702, 0x0904030e03000104};
MYALIGN const unsigned long long _supermix7b[] = {0x8080808080808080, 0x0504070605040f06};
MYALIGN const unsigned long long _k_n[] = {0x4E4E4E4E4E4E4E4E, 0x1B1B1B1B0E0E0E0E};
MYALIGN const unsigned char _shift_one_mask[] = {7, 4, 5, 6, 11, 8, 9, 10, 15, 12, 13, 14, 3, 0, 1, 2};
MYALIGN const unsigned char _shift_four_mask[] = {13, 14, 15, 12, 1, 2, 3, 0, 5, 6, 7, 4, 9, 10, 11, 8};
MYALIGN const unsigned char _shift_seven_mask[] = {10, 11, 8, 9, 14, 15, 12, 13, 2, 3, 0, 1, 6, 7, 4, 5};
MYALIGN const unsigned char _aes_shift_rows[] = {0, 5, 10, 15, 4, 9, 14, 3, 8, 13, 2, 7, 12, 1, 6, 11};
//MYALIGN const unsigned long long _k_n[] = {0x4E4E4E4E4E4E4E4E, 0x1B1B1B1B0E0E0E0E};
//MYALIGN const unsigned char _shift_one_mask[] = {7, 4, 5, 6, 11, 8, 9, 10, 15, 12, 13, 14, 3, 0, 1, 2};
//MYALIGN const unsigned char _shift_four_mask[] = {13, 14, 15, 12, 1, 2, 3, 0, 5, 6, 7, 4, 9, 10, 11, 8};
//MYALIGN const unsigned char _shift_seven_mask[] = {10, 11, 8, 9, 14, 15, 12, 13, 2, 3, 0, 1, 6, 7, 4, 5};
//MYALIGN const unsigned char _aes_shift_rows[] = {0, 5, 10, 15, 4, 9, 14, 3, 8, 13, 2, 7, 12, 1, 6, 11};
MYALIGN const unsigned int _inv_shift_rows[] = {0x070a0d00, 0x0b0e0104, 0x0f020508, 0x0306090c};
MYALIGN const unsigned int _mul2mask[] = {0x1b1b0000, 0x00000000, 0x00000000, 0x00000000};
MYALIGN const unsigned int _mul4mask[] = {0x2d361b00, 0x00000000, 0x00000000, 0x00000000};
@@ -131,7 +131,7 @@ MYALIGN const unsigned int _IV512[] = {
t1 = _mm_srli_epi16(t0, 6);\
t1 = _mm_and_si128(t1, M128(_lsbmask2));\
t3 = _mm_xor_si128(t3, _mm_shuffle_epi8(M128(_mul2mask), t1));\
t0 = _mm_xor_si128(t4, _mm_shuffle_epi8(M128(_mul4mask), t1))
t0 = _mm_xor_si128(t4, _mm_shuffle_epi8(M128(_mul4mask), t1))
/*
#define PRESUPERMIX(x, t1, s1, s2, t2)\

View File

@@ -139,7 +139,7 @@ static const __m128i SUBSH_MASK7 = { 0x06090c0f0205080b, 0x0e0104070a0d0003 };
\
/* compute z_i : double x_i using temp xmm8 and 1B xmm9 */\
/* compute w_i : add y_{i+4} */\
b1 = m128_const1_64( 0x1b1b1b1b1b1b1b1b );\
b1 = _mm_set1_epi64x( 0x1b1b1b1b1b1b1b1b );\
MUL2(a0, b0, b1);\
a0 = _mm_xor_si128(a0, TEMP0);\
MUL2(a1, b0, b1);\
@@ -237,7 +237,7 @@ static const __m128i SUBSH_MASK7 = { 0x06090c0f0205080b, 0x0e0104070a0d0003 };
\
/* compute z_i : double x_i using temp xmm8 and 1B xmm9 */\
/* compute w_i : add y_{i+4} */\
b1 = m128_const1_64( 0x1b1b1b1b1b1b1b1b );\
b1 = _mm_set1_epi64x( 0x1b1b1b1b1b1b1b1b );\
MUL2(a0, b0, b1);\
a0 = _mm_xor_si128(a0, TEMP0);\
MUL2(a1, b0, b1);\

View File

@@ -128,7 +128,7 @@ static const __m128i SUBSH_MASK7 = { 0x090c000306080b07, 0x02050f0a0d01040e };
\
/* compute z_i : double x_i using temp xmm8 and 1B xmm9 */\
/* compute w_i : add y_{i+4} */\
b1 = m128_const1_64( 0x1b1b1b1b1b1b1b1b );\
b1 = _mm_set1_epi64x( 0x1b1b1b1b1b1b1b1b );\
MUL2(a0, b0, b1);\
a0 = _mm_xor_si128(a0, TEMP0);\
MUL2(a1, b0, b1);\
@@ -226,7 +226,7 @@ static const __m128i SUBSH_MASK7 = { 0x090c000306080b07, 0x02050f0a0d01040e };
\
/* compute z_i : double x_i using temp xmm8 and 1B xmm9 */\
/* compute w_i : add y_{i+4} */\
b1 = m128_const1_64( 0x1b1b1b1b1b1b1b1b );\
b1 = _mm_set1_epi64x( 0x1b1b1b1b1b1b1b1b );\
MUL2(a0, b0, b1);\
a0 = _mm_xor_si128(a0, TEMP0);\
MUL2(a1, b0, b1);\
@@ -275,7 +275,7 @@ static const __m128i SUBSH_MASK7 = { 0x090c000306080b07, 0x02050f0a0d01040e };
*/
#define ROUND(i, a0, a1, a2, a3, a4, a5, a6, a7, b0, b1, b2, b3, b4, b5, b6, b7){\
/* AddRoundConstant */\
b1 = m128_const_64( 0xffffffffffffffff, 0 ); \
b1 = _mm_set_epi64x( 0xffffffffffffffff, 0 ); \
a0 = _mm_xor_si128( a0, casti_m128i( round_const_l0, i ) ); \
a1 = _mm_xor_si128( a1, b1 ); \
a2 = _mm_xor_si128( a2, b1 ); \

View File

@@ -24,9 +24,6 @@ HashReturn_gr init_groestl( hashState_groestl* ctx, int hashlen )
ctx->hashlen = hashlen;
if (ctx->chaining == NULL || ctx->buffer == NULL)
return FAIL_GR;
for ( i = 0; i < SIZE512; i++ )
{
ctx->chaining[i] = _mm_setzero_si128();
@@ -34,7 +31,7 @@ HashReturn_gr init_groestl( hashState_groestl* ctx, int hashlen )
}
// The only non-zero in the IV is len. It can be hard coded.
ctx->chaining[ 6 ] = m128_const_64( 0x0200000000000000, 0 );
ctx->chaining[ 6 ] = _mm_set_epi64x( 0x0200000000000000, 0 );
ctx->buf_ptr = 0;
ctx->rem_ptr = 0;
@@ -46,15 +43,12 @@ HashReturn_gr reinit_groestl( hashState_groestl* ctx )
{
int i;
if (ctx->chaining == NULL || ctx->buffer == NULL)
return FAIL_GR;
for ( i = 0; i < SIZE512; i++ )
{
ctx->chaining[i] = _mm_setzero_si128();
ctx->buffer[i] = _mm_setzero_si128();
}
ctx->chaining[ 6 ] = m128_const_64( 0x0200000000000000, 0 );
ctx->chaining[ 6 ] = _mm_set_epi64x( 0x0200000000000000, 0 );
ctx->buf_ptr = 0;
ctx->rem_ptr = 0;
@@ -122,7 +116,7 @@ HashReturn_gr final_groestl( hashState_groestl* ctx, void* output )
else
{
// add first padding
ctx->buffer[rem_ptr] = m128_const_64( 0, 0x80 );
ctx->buffer[rem_ptr] = _mm_set_epi64x( 0, 0x80 );
// add zero padding
for ( i = rem_ptr + 1; i < SIZE512 - 1; i++ )
ctx->buffer[i] = _mm_setzero_si128();
@@ -154,7 +148,7 @@ int groestl512_full( hashState_groestl* ctx, void* output,
ctx->chaining[i] = _mm_setzero_si128();
ctx->buffer[i] = _mm_setzero_si128();
}
ctx->chaining[ 6 ] = m128_const_64( 0x0200000000000000, 0 );
ctx->chaining[ 6 ] = _mm_set_epi64x( 0x0200000000000000, 0 );
ctx->buf_ptr = 0;
// --- update ---
@@ -188,7 +182,7 @@ int groestl512_full( hashState_groestl* ctx, void* output,
else
{
// add first padding
ctx->buffer[i] = m128_const_64( 0, 0x80 );
ctx->buffer[i] = _mm_set_epi64x( 0, 0x80 );
// add zero padding
for ( i += 1; i < SIZE512 - 1; i++ )
ctx->buffer[i] = _mm_setzero_si128();
@@ -245,7 +239,7 @@ HashReturn_gr update_and_final_groestl( hashState_groestl* ctx, void* output,
else
{
// add first padding
ctx->buffer[i] = m128_const_64( 0, 0x80 );
ctx->buffer[i] = _mm_set_epi64x( 0, 0x80 );
// add zero padding
for ( i += 1; i < SIZE512 - 1; i++ )
ctx->buffer[i] = _mm_setzero_si128();

View File

@@ -22,9 +22,6 @@ HashReturn_gr init_groestl256( hashState_groestl256* ctx, int hashlen )
ctx->hashlen = hashlen;
if (ctx->chaining == NULL || ctx->buffer == NULL)
return FAIL_GR;
for ( i = 0; i < SIZE256; i++ )
{
ctx->chaining[i] = _mm_setzero_si128();
@@ -43,19 +40,14 @@ HashReturn_gr reinit_groestl256(hashState_groestl256* ctx)
{
int i;
if (ctx->chaining == NULL || ctx->buffer == NULL)
return FAIL_GR;
for ( i = 0; i < SIZE256; i++ )
{
ctx->chaining[i] = _mm_setzero_si128();
ctx->buffer[i] = _mm_setzero_si128();
}
ctx->chaining[ 3 ] = m128_const_64( 0, 0x0100000000000000 );
ctx->chaining[ 3 ] = _mm_set_epi64x( 0, 0x0100000000000000 );
// ((u64*)ctx->chaining)[COLS-1] = U64BIG((u64)LENGTH);
// INIT256(ctx->chaining);
ctx->buf_ptr = 0;
ctx->rem_ptr = 0;

View File

@@ -26,9 +26,6 @@ int groestl256_4way_init( groestl256_4way_context* ctx, uint64_t hashlen )
ctx->hashlen = hashlen;
if (ctx->chaining == NULL || ctx->buffer == NULL)
return 1;
for ( i = 0; i < SIZE256; i++ )
{
ctx->chaining[i] = m512_zero;
@@ -36,8 +33,7 @@ int groestl256_4way_init( groestl256_4way_context* ctx, uint64_t hashlen )
}
// The only non-zero in the IV is len. It can be hard coded.
ctx->chaining[ 3 ] = m512_const2_64( 0, 0x0100000000000000 );
ctx->chaining[ 3 ] = mm512_bcast128lo_64( 0x0100000000000000 );
ctx->buf_ptr = 0;
ctx->rem_ptr = 0;
@@ -54,9 +50,6 @@ int groestl256_4way_full( groestl256_4way_context* ctx, void* output,
__m512i* in = (__m512i*)input;
int i;
if (ctx->chaining == NULL || ctx->buffer == NULL)
return 1;
for ( i = 0; i < SIZE256; i++ )
{
ctx->chaining[i] = m512_zero;
@@ -64,7 +57,7 @@ int groestl256_4way_full( groestl256_4way_context* ctx, void* output,
}
// The only non-zero in the IV is len. It can be hard coded.
ctx->chaining[ 3 ] = m512_const2_64( 0, 0x0100000000000000 );
ctx->chaining[ 3 ] = mm512_bcast128lo_64( 0x0100000000000000 );
ctx->buf_ptr = 0;
// --- update ---
@@ -86,18 +79,18 @@ int groestl256_4way_full( groestl256_4way_context* ctx, void* output,
if ( i == SIZE256 - 1 )
{
// only 1 vector left in buffer, all padding at once
ctx->buffer[i] = m512_const2_64( blocks << 56, 0x80 );
ctx->buffer[i] = mm512_set2_64( blocks << 56, 0x80 );
}
else
{
// add first padding
ctx->buffer[i] = m512_const2_64( 0, 0x80 );
ctx->buffer[i] = mm512_bcast128lo_64( 0x80 );
// add zero padding
for ( i += 1; i < SIZE256 - 1; i++ )
ctx->buffer[i] = m512_zero;
// add length padding, second last byte is zero unless blocks > 255
ctx->buffer[i] = m512_const2_64( blocks << 56, 0 );
ctx->buffer[i] = mm512_bcast128hi_64( blocks << 56 );
}
// digest final padding block and do output transform
@@ -143,18 +136,18 @@ int groestl256_4way_update_close( groestl256_4way_context* ctx, void* output,
if ( i == SIZE256 - 1 )
{
// only 1 vector left in buffer, all padding at once
ctx->buffer[i] = m512_const2_64( blocks << 56, 0x80 );
ctx->buffer[i] = mm512_set2_64( blocks << 56, 0x80 );
}
else
{
// add first padding
ctx->buffer[i] = m512_const2_64( 0, 0x80 );
ctx->buffer[i] = mm512_bcast128lo_64( 0x80 );
// add zero padding
for ( i += 1; i < SIZE256 - 1; i++ )
ctx->buffer[i] = m512_zero;
// add length padding, second last byte is zero unless blocks > 255
ctx->buffer[i] = m512_const2_64( blocks << 56, 0 );
ctx->buffer[i] = mm512_bcast128hi_64( blocks << 56 );
}
// digest final padding block and do output transform
@@ -179,8 +172,8 @@ int groestl256_2way_init( groestl256_2way_context* ctx, uint64_t hashlen )
ctx->hashlen = hashlen;
if (ctx->chaining == NULL || ctx->buffer == NULL)
return 1;
// if (ctx->chaining == NULL || ctx->buffer == NULL)
// return 1;
for ( i = 0; i < SIZE256; i++ )
{
@@ -189,7 +182,7 @@ int groestl256_2way_init( groestl256_2way_context* ctx, uint64_t hashlen )
}
// The only non-zero in the IV is len. It can be hard coded.
ctx->chaining[ 3 ] = m256_const2_64( 0, 0x0100000000000000 );
ctx->chaining[ 3 ] = mm256_bcast128lo_64( 0x0100000000000000 );
ctx->buf_ptr = 0;
ctx->rem_ptr = 0;
@@ -207,9 +200,6 @@ int groestl256_2way_full( groestl256_2way_context* ctx, void* output,
__m256i* in = (__m256i*)input;
int i;
if (ctx->chaining == NULL || ctx->buffer == NULL)
return 1;
for ( i = 0; i < SIZE256; i++ )
{
ctx->chaining[i] = m256_zero;
@@ -217,7 +207,7 @@ int groestl256_2way_full( groestl256_2way_context* ctx, void* output,
}
// The only non-zero in the IV is len. It can be hard coded.
ctx->chaining[ 3 ] = m256_const2_64( 0, 0x0100000000000000 );
ctx->chaining[ 3 ] = mm256_bcast128lo_64( 0x0100000000000000 );
ctx->buf_ptr = 0;
// --- update ---
@@ -239,18 +229,18 @@ int groestl256_2way_full( groestl256_2way_context* ctx, void* output,
if ( i == SIZE256 - 1 )
{
// only 1 vector left in buffer, all padding at once
ctx->buffer[i] = m256_const2_64( blocks << 56, 0x80 );
ctx->buffer[i] = mm256_set2_64( blocks << 56, 0x80 );
}
else
{
// add first padding
ctx->buffer[i] = m256_const2_64( 0, 0x80 );
ctx->buffer[i] = mm256_bcast128lo_64( 0x80 );
// add zero padding
for ( i += 1; i < SIZE256 - 1; i++ )
ctx->buffer[i] = m256_zero;
// add length padding, second last byte is zero unless blocks > 255
ctx->buffer[i] = m256_const2_64( blocks << 56, 0 );
ctx->buffer[i] = mm256_bcast128hi_64( blocks << 56 );
}
// digest final padding block and do output transform
@@ -295,23 +285,22 @@ int groestl256_2way_update_close( groestl256_2way_context* ctx, void* output,
if ( i == SIZE256 - 1 )
{
// only 1 vector left in buffer, all padding at once
ctx->buffer[i] = m256_const2_64( blocks << 56, 0x80 );
ctx->buffer[i] = mm256_set2_64( blocks << 56, 0x80 );
}
else
{
// add first padding
ctx->buffer[i] = m256_const2_64( 0, 0x80 );
ctx->buffer[i] = mm256_bcast128lo_64( 0x80 );
// add zero padding
for ( i += 1; i < SIZE256 - 1; i++ )
ctx->buffer[i] = m256_zero;
// add length padding, second last byte is zero unless blocks > 255
ctx->buffer[i] = m256_const2_64( blocks << 56, 0 );
ctx->buffer[i] = mm256_bcast128hi_64( blocks << 56 );
}
// digest final padding block and do output transform
TF512_2way( ctx->chaining, ctx->buffer );
OF512_2way( ctx->chaining );
// store hash result in output

View File

@@ -165,7 +165,7 @@ static const __m512i SUBSH_MASK7 = { 0x090c000306080b07, 0x02050f0a0d01040e,
\
/* compute z_i : double x_i using temp xmm8 and 1B xmm9 */\
/* compute w_i : add y_{i+4} */\
b1 = m512_const1_64( 0x1b1b1b1b1b1b1b1b ); \
b1 = _mm512_set1_epi64( 0x1b1b1b1b1b1b1b1b ); \
MUL2( a0, b0, b1 ); \
a0 = _mm512_xor_si512( a0, TEMP0 ); \
MUL2( a1, b0, b1 ); \
@@ -205,116 +205,18 @@ static const __m512i SUBSH_MASK7 = { 0x090c000306080b07, 0x02050f0a0d01040e,
b1 = _mm512_xor_si512( b1, a4 ); \
}/*MixBytes*/
#if 0
#define MixBytes(a0, a1, a2, a3, a4, a5, a6, a7, b0, b1, b2, b3, b4, b5, b6, b7){\
/* t_i = a_i + a_{i+1} */\
b6 = a0;\
b7 = a1;\
a0 = _mm512_xor_si512(a0, a1);\
b0 = a2;\
a1 = _mm512_xor_si512(a1, a2);\
b1 = a3;\
a2 = _mm512_xor_si512(a2, a3);\
b2 = a4;\
a3 = _mm512_xor_si512(a3, a4);\
b3 = a5;\
a4 = _mm512_xor_si512(a4, a5);\
b4 = a6;\
a5 = _mm512_xor_si512(a5, a6);\
b5 = a7;\
a6 = _mm512_xor_si512(a6, a7);\
a7 = _mm512_xor_si512(a7, b6);\
\
/* build y4 y5 y6 ... in regs xmm8, xmm9, xmm10 by adding t_i*/\
b0 = _mm512_xor_si512(b0, a4);\
b6 = _mm512_xor_si512(b6, a4);\
b1 = _mm512_xor_si512(b1, a5);\
b7 = _mm512_xor_si512(b7, a5);\
b2 = _mm512_xor_si512(b2, a6);\
b0 = _mm512_xor_si512(b0, a6);\
/* spill values y_4, y_5 to memory */\
TEMP0 = b0;\
b3 = _mm512_xor_si512(b3, a7);\
b1 = _mm512_xor_si512(b1, a7);\
TEMP1 = b1;\
b4 = _mm512_xor_si512(b4, a0);\
b2 = _mm512_xor_si512(b2, a0);\
/* save values t0, t1, t2 to xmm8, xmm9 and memory */\
b0 = a0;\
b5 = _mm512_xor_si512(b5, a1);\
b3 = _mm512_xor_si512(b3, a1);\
b1 = a1;\
b6 = _mm512_xor_si512(b6, a2);\
b4 = _mm512_xor_si512(b4, a2);\
TEMP2 = a2;\
b7 = _mm512_xor_si512(b7, a3);\
b5 = _mm512_xor_si512(b5, a3);\
\
/* compute x_i = t_i + t_{i+3} */\
a0 = _mm512_xor_si512(a0, a3);\
a1 = _mm512_xor_si512(a1, a4);\
a2 = _mm512_xor_si512(a2, a5);\
a3 = _mm512_xor_si512(a3, a6);\
a4 = _mm512_xor_si512(a4, a7);\
a5 = _mm512_xor_si512(a5, b0);\
a6 = _mm512_xor_si512(a6, b1);\
a7 = _mm512_xor_si512(a7, TEMP2);\
\
/* compute z_i : double x_i using temp xmm8 and 1B xmm9 */\
/* compute w_i : add y_{i+4} */\
b1 = m512_const1_64( 0x1b1b1b1b1b1b1b1b );\
MUL2(a0, b0, b1);\
a0 = _mm512_xor_si512(a0, TEMP0);\
MUL2(a1, b0, b1);\
a1 = _mm512_xor_si512(a1, TEMP1);\
MUL2(a2, b0, b1);\
a2 = _mm512_xor_si512(a2, b2);\
MUL2(a3, b0, b1);\
a3 = _mm512_xor_si512(a3, b3);\
MUL2(a4, b0, b1);\
a4 = _mm512_xor_si512(a4, b4);\
MUL2(a5, b0, b1);\
a5 = _mm512_xor_si512(a5, b5);\
MUL2(a6, b0, b1);\
a6 = _mm512_xor_si512(a6, b6);\
MUL2(a7, b0, b1);\
a7 = _mm512_xor_si512(a7, b7);\
\
/* compute v_i : double w_i */\
/* add to y_4 y_5 .. v3, v4, ... */\
MUL2(a0, b0, b1);\
b5 = _mm512_xor_si512(b5, a0);\
MUL2(a1, b0, b1);\
b6 = _mm512_xor_si512(b6, a1);\
MUL2(a2, b0, b1);\
b7 = _mm512_xor_si512(b7, a2);\
MUL2(a5, b0, b1);\
b2 = _mm512_xor_si512(b2, a5);\
MUL2(a6, b0, b1);\
b3 = _mm512_xor_si512(b3, a6);\
MUL2(a7, b0, b1);\
b4 = _mm512_xor_si512(b4, a7);\
MUL2(a3, b0, b1);\
MUL2(a4, b0, b1);\
b0 = TEMP0;\
b1 = TEMP1;\
b0 = _mm512_xor_si512(b0, a3);\
b1 = _mm512_xor_si512(b1, a4);\
}/*MixBytes*/
#endif
#define MASK_NOT( a ) _mm512_mask_ternarylogic_epi64( a, 0xaa, a, a, 1 )
#define ROUND(i, a0, a1, a2, a3, a4, a5, a6, a7, b0, b1, b2, b3, b4, b5, b6, b7){\
/* AddRoundConstant */\
b1 = m512_const2_64( 0xffffffffffffffff, 0 ); \
a0 = _mm512_xor_si512( a0, m512_const1_128( round_const_l0[i] ) );\
a1 = _mm512_xor_si512( a1, b1 );\
a2 = _mm512_xor_si512( a2, b1 );\
a3 = _mm512_xor_si512( a3, b1 );\
a4 = _mm512_xor_si512( a4, b1 );\
a5 = _mm512_xor_si512( a5, b1 );\
a6 = _mm512_xor_si512( a6, b1 );\
a7 = _mm512_xor_si512( a7, m512_const1_128( round_const_l7[i] ) );\
a0 = _mm512_xor_si512( a0, mm512_bcast_m128( round_const_l0[i] ) );\
a1 = MASK_NOT( a1 ); \
a2 = MASK_NOT( a2 ); \
a3 = MASK_NOT( a3 ); \
a4 = MASK_NOT( a4 ); \
a5 = MASK_NOT( a5 ); \
a6 = MASK_NOT( a6 ); \
a7 = _mm512_xor_si512( a7, mm512_bcast_m128( round_const_l7[i] ) );\
\
/* ShiftBytes + SubBytes (interleaved) */\
b0 = _mm512_xor_si512( b0, b0 );\
@@ -450,7 +352,7 @@ static const __m512i SUBSH_MASK7 = { 0x090c000306080b07, 0x02050f0a0d01040e,
* outputs: (i0-7) = (0|S)
*/
#define Matrix_Transpose_O_B(i0, i1, i2, i3, i4, i5, i6, i7, t0){\
t0 = _mm512_xor_si512( t0, t0 );\
t0 = m512_zero;\
i1 = i0;\
i3 = i2;\
i5 = i4;\
@@ -481,11 +383,11 @@ static const __m512i SUBSH_MASK7 = { 0x090c000306080b07, 0x02050f0a0d01040e,
void TF512_4way( __m512i* chaining, __m512i* message )
{
static __m512i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
static __m512i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
static __m512i TEMP0;
static __m512i TEMP1;
static __m512i TEMP2;
__m512i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
__m512i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
__m512i TEMP0;
__m512i TEMP1;
__m512i TEMP2;
/* load message into registers xmm12 - xmm15 */
xmm12 = message[0];
@@ -547,11 +449,11 @@ void TF512_4way( __m512i* chaining, __m512i* message )
void OF512_4way( __m512i* chaining )
{
static __m512i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
static __m512i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
static __m512i TEMP0;
static __m512i TEMP1;
static __m512i TEMP2;
__m512i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
__m512i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
__m512i TEMP0;
__m512i TEMP1;
__m512i TEMP2;
/* load CV into registers xmm8, xmm10, xmm12, xmm14 */
xmm8 = chaining[0];
@@ -637,7 +539,7 @@ static const __m256i SUBSH_MASK7_2WAY =
j = _mm256_cmpgt_epi8(j, i );\
i = _mm256_add_epi8(i, i);\
j = _mm256_and_si256(j, k);\
i = _mm256_xor_si256(i, j);\
i = mm256_xorand( i, j, k );\
}
#define MixBytes_2way(a0, a1, a2, a3, a4, a5, a6, a7, b0, b1, b2, b3, b4, b5, b6, b7){\
@@ -648,7 +550,7 @@ static const __m256i SUBSH_MASK7_2WAY =
b0 = a2;\
a1 = _mm256_xor_si256(a1, a2);\
b1 = a3;\
a2 = _mm256_xor_si256(a2, a3);\
TEMP2 = _mm256_xor_si256(a2, a3);\
b2 = a4;\
a3 = _mm256_xor_si256(a3, a4);\
b3 = a5;\
@@ -660,34 +562,20 @@ static const __m256i SUBSH_MASK7_2WAY =
a7 = _mm256_xor_si256(a7, b6);\
\
/* build y4 y5 y6 ... in regs xmm8, xmm9, xmm10 by adding t_i*/\
b0 = _mm256_xor_si256(b0, a4);\
b6 = _mm256_xor_si256(b6, a4);\
b1 = _mm256_xor_si256(b1, a5);\
b7 = _mm256_xor_si256(b7, a5);\
b2 = _mm256_xor_si256(b2, a6);\
b0 = _mm256_xor_si256(b0, a6);\
/* spill values y_4, y_5 to memory */\
TEMP0 = b0;\
b3 = _mm256_xor_si256(b3, a7);\
b1 = _mm256_xor_si256(b1, a7);\
TEMP1 = b1;\
b4 = _mm256_xor_si256(b4, a0);\
b2 = _mm256_xor_si256(b2, a0);\
/* save values t0, t1, t2 to xmm8, xmm9 and memory */\
b0 = a0;\
b5 = _mm256_xor_si256(b5, a1);\
b3 = _mm256_xor_si256(b3, a1);\
b1 = a1;\
b6 = _mm256_xor_si256(b6, a2);\
b4 = _mm256_xor_si256(b4, a2);\
TEMP2 = a2;\
b7 = _mm256_xor_si256(b7, a3);\
b5 = _mm256_xor_si256(b5, a3);\
\
TEMP0 = mm256_xor3( b0, a4, a6 ); \
TEMP1 = mm256_xor3( b1, a5, a7 ); \
b2 = mm256_xor3( b2, a6, a0 ); \
b0 = a0; \
b3 = mm256_xor3( b3, a7, a1 ); \
b1 = a1; \
b6 = mm256_xor3( b6, a4, TEMP2 ); \
b4 = mm256_xor3( b4, a0, TEMP2 ); \
b7 = mm256_xor3( b7, a5, a3 ); \
b5 = mm256_xor3( b5, a1, a3 ); \
/* compute x_i = t_i + t_{i+3} */\
a0 = _mm256_xor_si256(a0, a3);\
a1 = _mm256_xor_si256(a1, a4);\
a2 = _mm256_xor_si256(a2, a5);\
a2 = _mm256_xor_si256( TEMP2, a5);\
a3 = _mm256_xor_si256(a3, a6);\
a4 = _mm256_xor_si256(a4, a7);\
a5 = _mm256_xor_si256(a5, b0);\
@@ -696,7 +584,7 @@ static const __m256i SUBSH_MASK7_2WAY =
\
/* compute z_i : double x_i using temp xmm8 and 1B xmm9 */\
/* compute w_i : add y_{i+4} */\
b1 = m256_const1_64( 0x1b1b1b1b1b1b1b1b );\
b1 = _mm256_set1_epi64x( 0x1b1b1b1b1b1b1b1b );\
MUL2_2WAY(a0, b0, b1);\
a0 = _mm256_xor_si256(a0, TEMP0);\
MUL2_2WAY(a1, b0, b1);\
@@ -738,15 +626,15 @@ static const __m256i SUBSH_MASK7_2WAY =
#define ROUND_2WAY(i, a0, a1, a2, a3, a4, a5, a6, a7, b0, b1, b2, b3, b4, b5, b6, b7){\
/* AddRoundConstant */\
b1 = m256_const2_64( 0xffffffffffffffff, 0 ); \
a0 = _mm256_xor_si256( a0, m256_const1_128( round_const_l0[i] ) );\
b1 = mm256_bcast_m128( mm128_mask_32( m128_neg1, 0x3 ) ); \
a0 = _mm256_xor_si256( a0, mm256_bcast_m128( round_const_l0[i] ) );\
a1 = _mm256_xor_si256( a1, b1 );\
a2 = _mm256_xor_si256( a2, b1 );\
a3 = _mm256_xor_si256( a3, b1 );\
a4 = _mm256_xor_si256( a4, b1 );\
a5 = _mm256_xor_si256( a5, b1 );\
a6 = _mm256_xor_si256( a6, b1 );\
a7 = _mm256_xor_si256( a7, m256_const1_128( round_const_l7[i] ) );\
a7 = _mm256_xor_si256( a7, mm256_bcast_m128( round_const_l7[i] ) );\
\
/* ShiftBytes + SubBytes (interleaved) */\
b0 = _mm256_xor_si256( b0, b0 );\
@@ -769,7 +657,6 @@ static const __m256i SUBSH_MASK7_2WAY =
\
/* MixBytes */\
MixBytes_2way(a0, a1, a2, a3, a4, a5, a6, a7, b0, b1, b2, b3, b4, b5, b6, b7);\
\
}
/* 10 rounds, P and Q in parallel */
@@ -850,7 +737,7 @@ static const __m256i SUBSH_MASK7_2WAY =
}/**/
#define Matrix_Transpose_O_B_2way(i0, i1, i2, i3, i4, i5, i6, i7, t0){\
t0 = _mm256_xor_si256( t0, t0 );\
t0 = m256_zero;\
i1 = i0;\
i3 = i2;\
i5 = i4;\
@@ -874,11 +761,11 @@ static const __m256i SUBSH_MASK7_2WAY =
void TF512_2way( __m256i* chaining, __m256i* message )
{
static __m256i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
static __m256i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
static __m256i TEMP0;
static __m256i TEMP1;
static __m256i TEMP2;
__m256i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
__m256i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
__m256i TEMP0;
__m256i TEMP1;
__m256i TEMP2;
/* load message into registers xmm12 - xmm15 */
xmm12 = message[0];
@@ -940,11 +827,11 @@ void TF512_2way( __m256i* chaining, __m256i* message )
void OF512_2way( __m256i* chaining )
{
static __m256i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
static __m256i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
static __m256i TEMP0;
static __m256i TEMP1;
static __m256i TEMP2;
__m256i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
__m256i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
__m256i TEMP0;
__m256i TEMP1;
__m256i TEMP2;
/* load CV into registers xmm8, xmm10, xmm12, xmm14 */
xmm8 = chaining[0];

View File

@@ -21,15 +21,11 @@
int groestl512_4way_init( groestl512_4way_context* ctx, uint64_t hashlen )
{
if (ctx->chaining == NULL || ctx->buffer == NULL)
return 1;
memset_zero_512( ctx->chaining, SIZE512 );
memset_zero_512( ctx->buffer, SIZE512 );
// The only non-zero in the IV is len. It can be hard coded.
ctx->chaining[ 6 ] = m512_const2_64( 0x0200000000000000, 0 );
ctx->chaining[ 6 ] = mm512_bcast128hi_64( 0x0200000000000000 );
ctx->buf_ptr = 0;
ctx->rem_ptr = 0;
@@ -64,14 +60,14 @@ int groestl512_4way_update_close( groestl512_4way_context* ctx, void* output,
if ( i == SIZE512 - 1 )
{
// only 1 vector left in buffer, all padding at once
ctx->buffer[i] = m512_const2_64( blocks << 56, 0x80 );
ctx->buffer[i] = mm512_set2_64( blocks << 56, 0x80 );
}
else
{
ctx->buffer[i] = m512_const2_64( 0, 0x80 );
ctx->buffer[i] = mm512_bcast128lo_64( 0x80 );
for ( i += 1; i < SIZE512 - 1; i++ )
ctx->buffer[i] = m512_zero;
ctx->buffer[i] = m512_const2_64( blocks << 56, 0 );
ctx->buffer[i] = mm512_bcast128hi_64( blocks << 56 );
}
TF1024_4way( ctx->chaining, ctx->buffer );
@@ -97,7 +93,7 @@ int groestl512_4way_full( groestl512_4way_context* ctx, void* output,
memset_zero_512( ctx->chaining, SIZE512 );
memset_zero_512( ctx->buffer, SIZE512 );
ctx->chaining[ 6 ] = m512_const2_64( 0x0200000000000000, 0 );
ctx->chaining[ 6 ] = mm512_bcast128hi_64( 0x0200000000000000 );
ctx->buf_ptr = 0;
// --- update ---
@@ -116,14 +112,14 @@ int groestl512_4way_full( groestl512_4way_context* ctx, void* output,
if ( i == SIZE512 - 1 )
{
// only 1 vector left in buffer, all padding at once
ctx->buffer[i] = m512_const2_64( blocks << 56, 0x80 );
ctx->buffer[i] = mm512_set2_64( blocks << 56, 0x80 );
}
else
{
ctx->buffer[i] = m512_const2_64( 0, 0x80 );
ctx->buffer[i] = mm512_bcast128lo_64( 0x80 );
for ( i += 1; i < SIZE512 - 1; i++ )
ctx->buffer[i] = m512_zero;
ctx->buffer[i] = m512_const2_64( blocks << 56, 0 );
ctx->buffer[i] = mm512_bcast128hi_64( blocks << 56 );
}
TF1024_4way( ctx->chaining, ctx->buffer );
@@ -142,14 +138,11 @@ int groestl512_4way_full( groestl512_4way_context* ctx, void* output,
int groestl512_2way_init( groestl512_2way_context* ctx, uint64_t hashlen )
{
if (ctx->chaining == NULL || ctx->buffer == NULL)
return 1;
memset_zero_256( ctx->chaining, SIZE512 );
memset_zero_256( ctx->buffer, SIZE512 );
// The only non-zero in the IV is len. It can be hard coded.
ctx->chaining[ 6 ] = m256_const2_64( 0x0200000000000000, 0 );
ctx->chaining[ 6 ] = mm256_bcast128hi_64( 0x0200000000000000 );
ctx->buf_ptr = 0;
ctx->rem_ptr = 0;
@@ -185,14 +178,14 @@ int groestl512_2way_update_close( groestl512_2way_context* ctx, void* output,
if ( i == SIZE512 - 1 )
{
// only 1 vector left in buffer, all padding at once
ctx->buffer[i] = m256_const2_64( blocks << 56, 0x80 );
ctx->buffer[i] = mm256_set2_64( blocks << 56, 0x80 );
}
else
{
ctx->buffer[i] = m256_const2_64( 0, 0x80 );
ctx->buffer[i] = mm256_bcast128lo_64( 0x80 );
for ( i += 1; i < SIZE512 - 1; i++ )
ctx->buffer[i] = m256_zero;
ctx->buffer[i] = m256_const2_64( blocks << 56, 0 );
ctx->buffer[i] = mm256_bcast128hi_64( blocks << 56 );
}
TF1024_2way( ctx->chaining, ctx->buffer );
@@ -218,7 +211,7 @@ int groestl512_2way_full( groestl512_2way_context* ctx, void* output,
memset_zero_256( ctx->chaining, SIZE512 );
memset_zero_256( ctx->buffer, SIZE512 );
ctx->chaining[ 6 ] = m256_const2_64( 0x0200000000000000, 0 );
ctx->chaining[ 6 ] = mm256_bcast128hi_64( 0x0200000000000000 );
ctx->buf_ptr = 0;
// --- update ---
@@ -237,14 +230,14 @@ int groestl512_2way_full( groestl512_2way_context* ctx, void* output,
if ( i == SIZE512 - 1 )
{
// only 1 vector left in buffer, all padding at once
ctx->buffer[i] = m256_const2_64( blocks << 56, 0x80 );
ctx->buffer[i] = mm256_set2_64( blocks << 56, 0x80 );
}
else
{
ctx->buffer[i] = m256_const2_64( 0, 0x80 );
ctx->buffer[i] = mm256_bcast128lo_64( 0x80 );
for ( i += 1; i < SIZE512 - 1; i++ )
ctx->buffer[i] = m256_zero;
ctx->buffer[i] = m256_const2_64( blocks << 56, 0 );
ctx->buffer[i] = mm256_bcast128hi_64( blocks << 56 );
}
TF1024_2way( ctx->chaining, ctx->buffer );

View File

@@ -174,7 +174,7 @@ static const __m512i SUBSH_MASK7 = { 0x06090c0f0205080b, 0x0e0104070a0d0003,
\
/* compute z_i : double x_i using temp xmm8 and 1B xmm9 */\
/* compute w_i : add y_{i+4} */\
b1 = m512_const1_64( 0x1b1b1b1b1b1b1b1b ); \
b1 = _mm512_set1_epi64( 0x1b1b1b1b1b1b1b1b ); \
MUL2( a0, b0, b1 ); \
a0 = _mm512_xor_si512( a0, TEMP0 ); \
MUL2( a1, b0, b1 ); \
@@ -238,7 +238,7 @@ static const __m512i SUBSH_MASK7 = { 0x06090c0f0205080b, 0x0e0104070a0d0003,
for ( round_counter = 0; round_counter < 14; round_counter += 2 ) \
{ \
/* AddRoundConstant P1024 */\
xmm8 = _mm512_xor_si512( xmm8, m512_const1_128( \
xmm8 = _mm512_xor_si512( xmm8, mm512_bcast_m128( \
casti_m128i( round_const_p, round_counter ) ) ); \
/* ShiftBytes P1024 + pre-AESENCLAST */\
xmm8 = _mm512_shuffle_epi8( xmm8, SUBSH_MASK0 ); \
@@ -253,7 +253,7 @@ static const __m512i SUBSH_MASK7 = { 0x06090c0f0205080b, 0x0e0104070a0d0003,
SUBMIX(xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7);\
\
/* AddRoundConstant P1024 */\
xmm0 = _mm512_xor_si512( xmm0, m512_const1_128( \
xmm0 = _mm512_xor_si512( xmm0, mm512_bcast_m128( \
casti_m128i( round_const_p, round_counter+1 ) ) ); \
/* ShiftBytes P1024 + pre-AESENCLAST */\
xmm0 = _mm512_shuffle_epi8( xmm0, SUBSH_MASK0 );\
@@ -282,7 +282,7 @@ static const __m512i SUBSH_MASK7 = { 0x06090c0f0205080b, 0x0e0104070a0d0003,
xmm12 = _mm512_xor_si512( xmm12, xmm1 );\
xmm13 = _mm512_xor_si512( xmm13, xmm1 );\
xmm14 = _mm512_xor_si512( xmm14, xmm1 );\
xmm15 = _mm512_xor_si512( xmm15, m512_const1_128( \
xmm15 = _mm512_xor_si512( xmm15, mm512_bcast_m128( \
casti_m128i( round_const_q, round_counter ) ) ); \
/* ShiftBytes Q1024 + pre-AESENCLAST */\
xmm8 = _mm512_shuffle_epi8( xmm8, SUBSH_MASK1 );\
@@ -305,7 +305,7 @@ static const __m512i SUBSH_MASK7 = { 0x06090c0f0205080b, 0x0e0104070a0d0003,
xmm4 = _mm512_xor_si512( xmm4, xmm9 );\
xmm5 = _mm512_xor_si512( xmm5, xmm9 );\
xmm6 = _mm512_xor_si512( xmm6, xmm9 );\
xmm7 = _mm512_xor_si512( xmm7, m512_const1_128( \
xmm7 = _mm512_xor_si512( xmm7, mm512_bcast_m128( \
casti_m128i( round_const_q, round_counter+1 ) ) ); \
/* ShiftBytes Q1024 + pre-AESENCLAST */\
xmm0 = _mm512_shuffle_epi8( xmm0, SUBSH_MASK1 );\
@@ -471,8 +471,8 @@ static const __m512i SUBSH_MASK7 = { 0x06090c0f0205080b, 0x0e0104070a0d0003,
void INIT_4way( __m512i* chaining )
{
static __m512i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
static __m512i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
__m512i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
__m512i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
/* load IV into registers xmm8 - xmm15 */
xmm8 = chaining[0];
@@ -500,12 +500,12 @@ void INIT_4way( __m512i* chaining )
void TF1024_4way( __m512i* chaining, const __m512i* message )
{
static __m512i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
static __m512i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
static __m512i QTEMP[8];
static __m512i TEMP0;
static __m512i TEMP1;
static __m512i TEMP2;
__m512i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
__m512i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
__m512i QTEMP[8];
__m512i TEMP0;
__m512i TEMP1;
__m512i TEMP2;
/* load message into registers xmm8 - xmm15 (Q = message) */
xmm8 = message[0];
@@ -606,11 +606,11 @@ void TF1024_4way( __m512i* chaining, const __m512i* message )
void OF1024_4way( __m512i* chaining )
{
static __m512i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
static __m512i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
static __m512i TEMP0;
static __m512i TEMP1;
static __m512i TEMP2;
__m512i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
__m512i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
__m512i TEMP0;
__m512i TEMP1;
__m512i TEMP2;
/* load CV into registers xmm8 - xmm15 */
xmm8 = chaining[0];
@@ -710,7 +710,7 @@ static const __m256i SUBSH_MASK7_2WAY =
b0 = a2;\
a1 = _mm256_xor_si256(a1, a2);\
b1 = a3;\
a2 = _mm256_xor_si256(a2, a3);\
TEMP2 = _mm256_xor_si256(a2, a3);\
b2 = a4;\
a3 = _mm256_xor_si256(a3, a4);\
b3 = a5;\
@@ -722,34 +722,23 @@ static const __m256i SUBSH_MASK7_2WAY =
a7 = _mm256_xor_si256(a7, b6);\
\
/* build y4 y5 y6 ... in regs xmm8, xmm9, xmm10 by adding t_i*/\
b0 = _mm256_xor_si256(b0, a4);\
b6 = _mm256_xor_si256(b6, a4);\
b1 = _mm256_xor_si256(b1, a5);\
b7 = _mm256_xor_si256(b7, a5);\
b2 = _mm256_xor_si256(b2, a6);\
b0 = _mm256_xor_si256(b0, a6);\
TEMP0 = mm256_xor3( b0, a4, a6 ); \
/* spill values y_4, y_5 to memory */\
TEMP0 = b0;\
b3 = _mm256_xor_si256(b3, a7);\
b1 = _mm256_xor_si256(b1, a7);\
TEMP1 = b1;\
b4 = _mm256_xor_si256(b4, a0);\
b2 = _mm256_xor_si256(b2, a0);\
TEMP1 = mm256_xor3( b1, a5, a7 ); \
b2 = mm256_xor3( b2, a6, a0 ); \
/* save values t0, t1, t2 to xmm8, xmm9 and memory */\
b0 = a0;\
b5 = _mm256_xor_si256(b5, a1);\
b3 = _mm256_xor_si256(b3, a1);\
b1 = a1;\
b6 = _mm256_xor_si256(b6, a2);\
b4 = _mm256_xor_si256(b4, a2);\
TEMP2 = a2;\
b7 = _mm256_xor_si256(b7, a3);\
b5 = _mm256_xor_si256(b5, a3);\
b0 = a0; \
b3 = mm256_xor3( b3, a7, a1 ); \
b1 = a1; \
b6 = mm256_xor3( b6, a4, TEMP2 ); \
b4 = mm256_xor3( b4, a0, TEMP2 ); \
b7 = mm256_xor3( b7, a5, a3 ); \
b5 = mm256_xor3( b5, a1, a3 ); \
\
/* compute x_i = t_i + t_{i+3} */\
a0 = _mm256_xor_si256(a0, a3);\
a1 = _mm256_xor_si256(a1, a4);\
a2 = _mm256_xor_si256(a2, a5);\
a2 = _mm256_xor_si256( TEMP2, a5);\
a3 = _mm256_xor_si256(a3, a6);\
a4 = _mm256_xor_si256(a4, a7);\
a5 = _mm256_xor_si256(a5, b0);\
@@ -758,7 +747,7 @@ static const __m256i SUBSH_MASK7_2WAY =
\
/* compute z_i : double x_i using temp xmm8 and 1B xmm9 */\
/* compute w_i : add y_{i+4} */\
b1 = m256_const1_64( 0x1b1b1b1b1b1b1b1b );\
b1 = _mm256_set1_epi64x( 0x1b1b1b1b1b1b1b1b );\
MUL2_2WAY(a0, b0, b1);\
a0 = _mm256_xor_si256(a0, TEMP0);\
MUL2_2WAY(a1, b0, b1);\
@@ -822,7 +811,7 @@ static const __m256i SUBSH_MASK7_2WAY =
for ( round_counter = 0; round_counter < 14; round_counter += 2 ) \
{ \
/* AddRoundConstant P1024 */\
xmm8 = _mm256_xor_si256( xmm8, m256_const1_128( \
xmm8 = _mm256_xor_si256( xmm8, mm256_bcast_m128( \
casti_m128i( round_const_p, round_counter ) ) ); \
/* ShiftBytes P1024 + pre-AESENCLAST */\
xmm8 = _mm256_shuffle_epi8( xmm8, SUBSH_MASK0_2WAY ); \
@@ -837,7 +826,7 @@ static const __m256i SUBSH_MASK7_2WAY =
SUBMIX_2WAY(xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7);\
\
/* AddRoundConstant P1024 */\
xmm0 = _mm256_xor_si256( xmm0, m256_const1_128( \
xmm0 = _mm256_xor_si256( xmm0, mm256_bcast_m128( \
casti_m128i( round_const_p, round_counter+1 ) ) ); \
/* ShiftBytes P1024 + pre-AESENCLAST */\
xmm0 = _mm256_shuffle_epi8( xmm0, SUBSH_MASK0_2WAY );\
@@ -866,7 +855,7 @@ static const __m256i SUBSH_MASK7_2WAY =
xmm12 = _mm256_xor_si256( xmm12, xmm1 );\
xmm13 = _mm256_xor_si256( xmm13, xmm1 );\
xmm14 = _mm256_xor_si256( xmm14, xmm1 );\
xmm15 = _mm256_xor_si256( xmm15, m256_const1_128( \
xmm15 = _mm256_xor_si256( xmm15, mm256_bcast_m128( \
casti_m128i( round_const_q, round_counter ) ) ); \
/* ShiftBytes Q1024 + pre-AESENCLAST */\
xmm8 = _mm256_shuffle_epi8( xmm8, SUBSH_MASK1_2WAY );\
@@ -889,7 +878,7 @@ static const __m256i SUBSH_MASK7_2WAY =
xmm4 = _mm256_xor_si256( xmm4, xmm9 );\
xmm5 = _mm256_xor_si256( xmm5, xmm9 );\
xmm6 = _mm256_xor_si256( xmm6, xmm9 );\
xmm7 = _mm256_xor_si256( xmm7, m256_const1_128( \
xmm7 = _mm256_xor_si256( xmm7, mm256_bcast_m128( \
casti_m128i( round_const_q, round_counter+1 ) ) ); \
/* ShiftBytes Q1024 + pre-AESENCLAST */\
xmm0 = _mm256_shuffle_epi8( xmm0, SUBSH_MASK1_2WAY );\
@@ -1040,8 +1029,8 @@ static const __m256i SUBSH_MASK7_2WAY =
void INIT_2way( __m256i *chaining )
{
static __m256i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
static __m256i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
__m256i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
__m256i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
/* load IV into registers xmm8 - xmm15 */
xmm8 = chaining[0];
@@ -1069,12 +1058,12 @@ void INIT_2way( __m256i *chaining )
void TF1024_2way( __m256i *chaining, const __m256i *message )
{
static __m256i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
static __m256i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
static __m256i QTEMP[8];
static __m256i TEMP0;
static __m256i TEMP1;
static __m256i TEMP2;
__m256i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
__m256i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
__m256i QTEMP[8];
__m256i TEMP0;
__m256i TEMP1;
__m256i TEMP2;
/* load message into registers xmm8 - xmm15 (Q = message) */
xmm8 = message[0];
@@ -1175,11 +1164,11 @@ void TF1024_2way( __m256i *chaining, const __m256i *message )
void OF1024_2way( __m256i* chaining )
{
static __m256i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
static __m256i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
static __m256i TEMP0;
static __m256i TEMP1;
static __m256i TEMP2;
__m256i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7;
__m256i xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
__m256i TEMP0;
__m256i TEMP1;
__m256i TEMP2;
/* load CV into registers xmm8 - xmm15 */
xmm8 = chaining[0];

View File

@@ -73,11 +73,11 @@ int scanhash_myriad( struct work *work, uint32_t max_nonce,
be32enc(&endiandata[19], nonce);
myriad_hash(hash, endiandata);
if (hash[7] <= Htarg && fulltest(hash, ptarget))
if (hash[7] <= Htarg )
if ( fulltest(hash, ptarget) && !opt_benchmark )
{
pdata[19] = nonce;
*hashes_done = pdata[19] - first_nonce;
return 1;
submit_solution( work, hash, mythr );
}
nonce++;

View File

@@ -562,14 +562,14 @@ do { \
for ( int u = 0; u < 64; u++ ) \
{ \
const __mmask8 dm = _mm512_cmplt_epi64_mask( db, zero ); \
m0 = _mm512_mask_xor_epi64( m0, dm, m0, m512_const1_64( tp[0] ) ); \
m1 = _mm512_mask_xor_epi64( m1, dm, m1, m512_const1_64( tp[1] ) ); \
m2 = _mm512_mask_xor_epi64( m2, dm, m2, m512_const1_64( tp[2] ) ); \
m3 = _mm512_mask_xor_epi64( m3, dm, m3, m512_const1_64( tp[3] ) ); \
m4 = _mm512_mask_xor_epi64( m4, dm, m4, m512_const1_64( tp[4] ) ); \
m5 = _mm512_mask_xor_epi64( m5, dm, m5, m512_const1_64( tp[5] ) ); \
m6 = _mm512_mask_xor_epi64( m6, dm, m6, m512_const1_64( tp[6] ) ); \
m7 = _mm512_mask_xor_epi64( m7, dm, m7, m512_const1_64( tp[7] ) ); \
m0 = _mm512_mask_xor_epi64( m0, dm, m0, _mm512_set1_epi64( tp[0] ) ); \
m1 = _mm512_mask_xor_epi64( m1, dm, m1, _mm512_set1_epi64( tp[1] ) ); \
m2 = _mm512_mask_xor_epi64( m2, dm, m2, _mm512_set1_epi64( tp[2] ) ); \
m3 = _mm512_mask_xor_epi64( m3, dm, m3, _mm512_set1_epi64( tp[3] ) ); \
m4 = _mm512_mask_xor_epi64( m4, dm, m4, _mm512_set1_epi64( tp[4] ) ); \
m5 = _mm512_mask_xor_epi64( m5, dm, m5, _mm512_set1_epi64( tp[5] ) ); \
m6 = _mm512_mask_xor_epi64( m6, dm, m6, _mm512_set1_epi64( tp[6] ) ); \
m7 = _mm512_mask_xor_epi64( m7, dm, m7, _mm512_set1_epi64( tp[7] ) ); \
db = _mm512_ror_epi64( db, 1 ); \
tp += 8; \
} \
@@ -585,9 +585,8 @@ do { \
t = _mm512_xor_si512( t, c ); \
d = mm512_xoror( a, b, t ); \
t = mm512_xorand( t, a, b ); \
b = mm512_xor3( b, d, t ); \
a = c; \
c = b; \
c = mm512_xor3( b, d, t ); \
b = d; \
d = mm512_not( t ); \
} while (0)
@@ -635,7 +634,7 @@ do { \
#define ROUND_BIG8( alpha ) \
do { \
__m512i t0, t1, t2, t3; \
__m512i t0, t1, t2, t3, t4, t5; \
s0 = _mm512_xor_si512( s0, alpha[ 0] ); /* m0 */ \
s1 = _mm512_xor_si512( s1, alpha[ 1] ); /* c0 */ \
s2 = _mm512_xor_si512( s2, alpha[ 2] ); /* m1 */ \
@@ -662,43 +661,35 @@ do { \
s5 = mm512_swap64_32( s5 ); \
sD = mm512_swap64_32( sD ); \
sE = mm512_swap64_32( sE ); \
t1 = _mm512_mask_blend_epi32( 0xaaaa, s4, s5 ); \
t3 = _mm512_mask_blend_epi32( 0xaaaa, sD, sE ); \
L8( s0, t1, s9, t3 ); \
s4 = _mm512_mask_blend_epi32( 0x5555, s4, t1 ); \
s5 = _mm512_mask_blend_epi32( 0xaaaa, s5, t1 ); \
sD = _mm512_mask_blend_epi32( 0x5555, sD, t3 ); \
sE = _mm512_mask_blend_epi32( 0xaaaa, sE, t3 ); \
t0 = _mm512_mask_blend_epi32( 0xaaaa, s4, s5 ); \
t1 = _mm512_mask_blend_epi32( 0xaaaa, sD, sE ); \
L8( s0, t0, s9, t1 ); \
\
s6 = mm512_swap64_32( s6 ); \
sF = mm512_swap64_32( sF ); \
t1 = _mm512_mask_blend_epi32( 0xaaaa, s5, s6 ); \
t2 = _mm512_mask_blend_epi32( 0xaaaa, s5, s6 ); \
t3 = _mm512_mask_blend_epi32( 0xaaaa, sE, sF ); \
L8( s1, t1, sA, t3 ); \
s5 = _mm512_mask_blend_epi32( 0x5555, s5, t1 ); \
s6 = _mm512_mask_blend_epi32( 0xaaaa, s6, t1 ); \
sE = _mm512_mask_blend_epi32( 0x5555, sE, t3 ); \
sF = _mm512_mask_blend_epi32( 0xaaaa, sF, t3 ); \
L8( s1, t2, sA, t3 ); \
s5 = _mm512_mask_blend_epi32( 0x5555, t0, t2 ); \
sE = _mm512_mask_blend_epi32( 0x5555, t1, t3 ); \
\
s7 = mm512_swap64_32( s7 ); \
sC = mm512_swap64_32( sC ); \
t1 = _mm512_mask_blend_epi32( 0xaaaa, s6, s7 ); \
t3 = _mm512_mask_blend_epi32( 0xaaaa, sF, sC ); \
L8( s2, t1, sB, t3 ); \
s6 = _mm512_mask_blend_epi32( 0x5555, s6, t1 ); \
s7 = _mm512_mask_blend_epi32( 0xaaaa, s7, t1 ); \
sF = _mm512_mask_blend_epi32( 0x5555, sF, t3 ); \
sC = _mm512_mask_blend_epi32( 0xaaaa, sC, t3 ); \
t4 = _mm512_mask_blend_epi32( 0xaaaa, s6, s7 ); \
t5 = _mm512_mask_blend_epi32( 0xaaaa, sF, sC ); \
L8( s2, t4, sB, t5 ); \
s6 = _mm512_mask_blend_epi32( 0x5555, t2, t4 ); \
sF = _mm512_mask_blend_epi32( 0x5555, t3, t5 ); \
s6 = mm512_swap64_32( s6 ); \
sF = mm512_swap64_32( sF ); \
\
t1 = _mm512_mask_blend_epi32( 0xaaaa, s7, s4 ); \
t2 = _mm512_mask_blend_epi32( 0xaaaa, s7, s4 ); \
t3 = _mm512_mask_blend_epi32( 0xaaaa, sC, sD ); \
L8( s3, t1, s8, t3 ); \
s7 = _mm512_mask_blend_epi32( 0x5555, s7, t1 ); \
s4 = _mm512_mask_blend_epi32( 0xaaaa, s4, t1 ); \
sC = _mm512_mask_blend_epi32( 0x5555, sC, t3 ); \
sD = _mm512_mask_blend_epi32( 0xaaaa, sD, t3 ); \
L8( s3, t2, s8, t3 ); \
s7 = _mm512_mask_blend_epi32( 0x5555, t4, t2 ); \
s4 = _mm512_mask_blend_epi32( 0xaaaa, t0, t2 ); \
sC = _mm512_mask_blend_epi32( 0x5555, t5, t3 ); \
sD = _mm512_mask_blend_epi32( 0xaaaa, t1, t3 ); \
s7 = mm512_swap64_32( s7 ); \
sC = mm512_swap64_32( sC ); \
\
@@ -742,17 +733,17 @@ do { \
__m512i alpha[16]; \
const uint64_t A0 = ( (uint64_t*)alpha_n )[0]; \
for( int i = 0; i < 16; i++ ) \
alpha[i] = m512_const1_64( ( (uint64_t*)alpha_n )[i] ); \
alpha[i] = _mm512_set1_epi64( ( (uint64_t*)alpha_n )[i] ); \
ROUND_BIG8( alpha ); \
alpha[0] = m512_const1_64( (1ULL << 32) ^ A0 ); \
alpha[0] = _mm512_set1_epi64( (1ULL << 32) ^ A0 ); \
ROUND_BIG8( alpha ); \
alpha[0] = m512_const1_64( (2ULL << 32) ^ A0 ); \
alpha[0] = _mm512_set1_epi64( (2ULL << 32) ^ A0 ); \
ROUND_BIG8( alpha ); \
alpha[0] = m512_const1_64( (3ULL << 32) ^ A0 ); \
alpha[0] = _mm512_set1_epi64( (3ULL << 32) ^ A0 ); \
ROUND_BIG8( alpha ); \
alpha[0] = m512_const1_64( (4ULL << 32) ^ A0 ); \
alpha[0] = _mm512_set1_epi64( (4ULL << 32) ^ A0 ); \
ROUND_BIG8( alpha ); \
alpha[0] = m512_const1_64( (5ULL << 32) ^ A0 ); \
alpha[0] = _mm512_set1_epi64( (5ULL << 32) ^ A0 ); \
ROUND_BIG8( alpha ); \
} while (0)
@@ -761,29 +752,29 @@ do { \
__m512i alpha[16]; \
const uint64_t A0 = ( (uint64_t*)alpha_f )[0]; \
for( int i = 0; i < 16; i++ ) \
alpha[i] = m512_const1_64( ( (uint64_t*)alpha_f )[i] ); \
alpha[i] = _mm512_set1_epi64( ( (uint64_t*)alpha_f )[i] ); \
ROUND_BIG8( alpha ); \
alpha[0] = m512_const1_64( ( 1ULL << 32) ^ A0 ); \
alpha[0] = _mm512_set1_epi64( ( 1ULL << 32) ^ A0 ); \
ROUND_BIG8( alpha ); \
alpha[0] = m512_const1_64( ( 2ULL << 32) ^ A0 ); \
alpha[0] = _mm512_set1_epi64( ( 2ULL << 32) ^ A0 ); \
ROUND_BIG8( alpha ); \
alpha[0] = m512_const1_64( ( 3ULL << 32) ^ A0 ); \
alpha[0] = _mm512_set1_epi64( ( 3ULL << 32) ^ A0 ); \
ROUND_BIG8( alpha ); \
alpha[0] = m512_const1_64( ( 4ULL << 32) ^ A0 ); \
alpha[0] = _mm512_set1_epi64( ( 4ULL << 32) ^ A0 ); \
ROUND_BIG8( alpha ); \
alpha[0] = m512_const1_64( ( 5ULL << 32) ^ A0 ); \
alpha[0] = _mm512_set1_epi64( ( 5ULL << 32) ^ A0 ); \
ROUND_BIG8( alpha ); \
alpha[0] = m512_const1_64( ( 6ULL << 32) ^ A0 ); \
alpha[0] = _mm512_set1_epi64( ( 6ULL << 32) ^ A0 ); \
ROUND_BIG8( alpha ); \
alpha[0] = m512_const1_64( ( 7ULL << 32) ^ A0 ); \
alpha[0] = _mm512_set1_epi64( ( 7ULL << 32) ^ A0 ); \
ROUND_BIG8( alpha ); \
alpha[0] = m512_const1_64( ( 8ULL << 32) ^ A0 ); \
alpha[0] = _mm512_set1_epi64( ( 8ULL << 32) ^ A0 ); \
ROUND_BIG8( alpha ); \
alpha[0] = m512_const1_64( ( 9ULL << 32) ^ A0 ); \
alpha[0] = _mm512_set1_epi64( ( 9ULL << 32) ^ A0 ); \
ROUND_BIG8( alpha ); \
alpha[0] = m512_const1_64( (10ULL << 32) ^ A0 ); \
alpha[0] = _mm512_set1_epi64( (10ULL << 32) ^ A0 ); \
ROUND_BIG8( alpha ); \
alpha[0] = m512_const1_64( (11ULL << 32) ^ A0 ); \
alpha[0] = _mm512_set1_epi64( (11ULL << 32) ^ A0 ); \
ROUND_BIG8( alpha ); \
} while (0)
@@ -838,14 +829,14 @@ void hamsi512_8way_init( hamsi_8way_big_context *sc )
sc->partial_len = 0;
sc->count_high = sc->count_low = 0;
sc->h[0] = m512_const1_64( 0x6c70617273746565 );
sc->h[1] = m512_const1_64( 0x656e62656b204172 );
sc->h[2] = m512_const1_64( 0x302c206272672031 );
sc->h[3] = m512_const1_64( 0x3434362c75732032 );
sc->h[4] = m512_const1_64( 0x3030312020422d33 );
sc->h[5] = m512_const1_64( 0x656e2d484c657576 );
sc->h[6] = m512_const1_64( 0x6c65652c65766572 );
sc->h[7] = m512_const1_64( 0x6769756d2042656c );
sc->h[0] = _mm512_set1_epi64( 0x6c70617273746565 );
sc->h[1] = _mm512_set1_epi64( 0x656e62656b204172 );
sc->h[2] = _mm512_set1_epi64( 0x302c206272672031 );
sc->h[3] = _mm512_set1_epi64( 0x3434362c75732032 );
sc->h[4] = _mm512_set1_epi64( 0x3030312020422d33 );
sc->h[5] = _mm512_set1_epi64( 0x656e2d484c657576 );
sc->h[6] = _mm512_set1_epi64( 0x6c65652c65766572 );
sc->h[7] = _mm512_set1_epi64( 0x6769756d2042656c );
}
void hamsi512_8way_update( hamsi_8way_big_context *sc, const void *data,
@@ -868,7 +859,7 @@ void hamsi512_8way_close( hamsi_8way_big_context *sc, void *dst )
sph_enc32be( &ch, sc->count_high );
sph_enc32be( &cl, sc->count_low + ( sc->partial_len << 3 ) );
pad[0] = _mm512_set1_epi64( ((uint64_t)cl << 32 ) | (uint64_t)ch );
sc->buf[0] = m512_const1_64( 0x80 );
sc->buf[0] = _mm512_set1_epi64( 0x80 );
hamsi_8way_big( sc, sc->buf, 1 );
hamsi_8way_big_final( sc, pad );
@@ -879,6 +870,32 @@ void hamsi512_8way_close( hamsi_8way_big_context *sc, void *dst )
// Hamsi 4 way AVX2
#if defined(__AVX512VL__)
#define INPUT_BIG \
do { \
__m256i db = _mm256_ror_epi64( *buf, 1 ); \
const __m256i zero = m256_zero; \
const uint64_t *tp = (const uint64_t*)T512; \
m0 = m1 = m2 = m3 = m4 = m5 = m6 = m7 = zero; \
for ( int u = 0; u < 64; u++ ) \
{ \
const __mmask8 dm = _mm256_cmplt_epi64_mask( db, zero ); \
m0 = _mm256_mask_xor_epi64( m0, dm, m0, _mm256_set1_epi64x( tp[0] ) ); \
m1 = _mm256_mask_xor_epi64( m1, dm, m1, _mm256_set1_epi64x( tp[1] ) ); \
m2 = _mm256_mask_xor_epi64( m2, dm, m2, _mm256_set1_epi64x( tp[2] ) ); \
m3 = _mm256_mask_xor_epi64( m3, dm, m3, _mm256_set1_epi64x( tp[3] ) ); \
m4 = _mm256_mask_xor_epi64( m4, dm, m4, _mm256_set1_epi64x( tp[4] ) ); \
m5 = _mm256_mask_xor_epi64( m5, dm, m5, _mm256_set1_epi64x( tp[5] ) ); \
m6 = _mm256_mask_xor_epi64( m6, dm, m6, _mm256_set1_epi64x( tp[6] ) ); \
m7 = _mm256_mask_xor_epi64( m7, dm, m7, _mm256_set1_epi64x( tp[7] ) ); \
db = _mm256_ror_epi64( db, 1 ); \
tp += 8; \
} \
} while (0)
#else
#define INPUT_BIG \
do { \
__m256i db = *buf; \
@@ -889,25 +906,58 @@ do { \
{ \
__m256i dm = _mm256_cmpgt_epi64( zero, _mm256_slli_epi64( db, u ) ); \
m0 = _mm256_xor_si256( m0, _mm256_and_si256( dm, \
m256_const1_64( tp[0] ) ) ); \
_mm256_set1_epi64x( tp[0] ) ) ); \
m1 = _mm256_xor_si256( m1, _mm256_and_si256( dm, \
m256_const1_64( tp[1] ) ) ); \
_mm256_set1_epi64x( tp[1] ) ) ); \
m2 = _mm256_xor_si256( m2, _mm256_and_si256( dm, \
m256_const1_64( tp[2] ) ) ); \
_mm256_set1_epi64x( tp[2] ) ) ); \
m3 = _mm256_xor_si256( m3, _mm256_and_si256( dm, \
m256_const1_64( tp[3] ) ) ); \
_mm256_set1_epi64x( tp[3] ) ) ); \
m4 = _mm256_xor_si256( m4, _mm256_and_si256( dm, \
m256_const1_64( tp[4] ) ) ); \
_mm256_set1_epi64x( tp[4] ) ) ); \
m5 = _mm256_xor_si256( m5, _mm256_and_si256( dm, \
m256_const1_64( tp[5] ) ) ); \
_mm256_set1_epi64x( tp[5] ) ) ); \
m6 = _mm256_xor_si256( m6, _mm256_and_si256( dm, \
m256_const1_64( tp[6] ) ) ); \
_mm256_set1_epi64x( tp[6] ) ) ); \
m7 = _mm256_xor_si256( m7, _mm256_and_si256( dm, \
m256_const1_64( tp[7] ) ) ); \
_mm256_set1_epi64x( tp[7] ) ) ); \
tp += 8; \
} \
} while (0)
#endif
#define SBOX( a, b, c, d ) \
do { \
__m256i t; \
t = a; \
a = mm256_xorand( d, a, c ); \
c = mm256_xor3( a, b, c ); \
b = mm256_xoror( b, d, t ); \
t = _mm256_xor_si256( t, c ); \
d = mm256_xoror( a, b, t ); \
t = mm256_xorand( t, a, b ); \
a = c; \
c = mm256_xor3( b, d, t ); \
b = d; \
d = mm256_not( t ); \
} while (0)
#define L( a, b, c, d ) \
do { \
a = mm256_rol_32( a, 13 ); \
c = mm256_rol_32( c, 3 ); \
b = mm256_xor3( a, b, c ); \
d = mm256_xor3( d, c, _mm256_slli_epi32( a, 3 ) ); \
b = mm256_rol_32( b, 1 ); \
d = mm256_rol_32( d, 7 ); \
a = mm256_xor3( a, b, d ); \
c = mm256_xor3( c, d, _mm256_slli_epi32( b, 7 ) ); \
a = mm256_rol_32( a, 5 ); \
c = mm256_rol_32( c, 22 ); \
} while (0)
/*
#define SBOX( a, b, c, d ) \
do { \
__m256i t; \
@@ -924,10 +974,9 @@ do { \
d = _mm256_xor_si256( d, a ); \
a = _mm256_and_si256( a, b ); \
t = _mm256_xor_si256( t, a ); \
b = _mm256_xor_si256( b, d ); \
b = _mm256_xor_si256( b, t ); \
a = c; \
c = b; \
c = _mm256_xor_si256( b, d ); \
c = _mm256_xor_si256( c, t ); \
b = d; \
d = mm256_not( t ); \
} while (0)
@@ -947,6 +996,7 @@ do { \
a = mm256_rol_32( a, 5 ); \
c = mm256_rol_32( c, 22 ); \
} while (0)
*/
#define DECL_STATE_BIG \
__m256i c0, c1, c2, c3, c4, c5, c6, c7; \
@@ -977,7 +1027,7 @@ do { \
#define ROUND_BIG( alpha ) \
do { \
__m256i t0, t1, t2, t3; \
__m256i t0, t1, t2, t3, t4, t5; \
s0 = _mm256_xor_si256( s0, alpha[ 0] ); \
s1 = _mm256_xor_si256( s1, alpha[ 1] ); \
s2 = _mm256_xor_si256( s2, alpha[ 2] ); \
@@ -1004,43 +1054,35 @@ do { \
s5 = mm256_swap64_32( s5 ); \
sD = mm256_swap64_32( sD ); \
sE = mm256_swap64_32( sE ); \
t1 = _mm256_blend_epi32( s4, s5, 0xaa ); \
t3 = _mm256_blend_epi32( sD, sE, 0xaa ); \
L( s0, t1, s9, t3 ); \
s4 = _mm256_blend_epi32( s4, t1, 0x55 ); \
s5 = _mm256_blend_epi32( s5, t1, 0xaa ); \
sD = _mm256_blend_epi32( sD, t3, 0x55 ); \
sE = _mm256_blend_epi32( sE, t3, 0xaa ); \
t0 = _mm256_blend_epi32( s4, s5, 0xaa ); \
t1 = _mm256_blend_epi32( sD, sE, 0xaa ); \
L( s0, t0, s9, t1 ); \
\
s6 = mm256_swap64_32( s6 ); \
sF = mm256_swap64_32( sF ); \
t1 = _mm256_blend_epi32( s5, s6, 0xaa ); \
t2 = _mm256_blend_epi32( s5, s6, 0xaa ); \
t3 = _mm256_blend_epi32( sE, sF, 0xaa ); \
L( s1, t1, sA, t3 ); \
s5 = _mm256_blend_epi32( s5, t1, 0x55 ); \
s6 = _mm256_blend_epi32( s6, t1, 0xaa ); \
sE = _mm256_blend_epi32( sE, t3, 0x55 ); \
sF = _mm256_blend_epi32( sF, t3, 0xaa ); \
L( s1, t2, sA, t3 ); \
s5 = _mm256_blend_epi32( t0, t2, 0x55 ); \
sE = _mm256_blend_epi32( t1, t3, 0x55 ); \
\
s7 = mm256_swap64_32( s7 ); \
sC = mm256_swap64_32( sC ); \
t1 = _mm256_blend_epi32( s6, s7, 0xaa ); \
t3 = _mm256_blend_epi32( sF, sC, 0xaa ); \
L( s2, t1, sB, t3 ); \
s6 = _mm256_blend_epi32( s6, t1, 0x55 ); \
s7 = _mm256_blend_epi32( s7, t1, 0xaa ); \
sF = _mm256_blend_epi32( sF, t3, 0x55 ); \
sC = _mm256_blend_epi32( sC, t3, 0xaa ); \
t4 = _mm256_blend_epi32( s6, s7, 0xaa ); \
t5 = _mm256_blend_epi32( sF, sC, 0xaa ); \
L( s2, t4, sB, t5 ); \
s6 = _mm256_blend_epi32( t2, t4, 0x55 ); \
sF = _mm256_blend_epi32( t3, t5, 0x55 ); \
s6 = mm256_swap64_32( s6 ); \
sF = mm256_swap64_32( sF ); \
\
t1 = _mm256_blend_epi32( s7, s4, 0xaa ); \
t2 = _mm256_blend_epi32( s7, s4, 0xaa ); \
t3 = _mm256_blend_epi32( sC, sD, 0xaa ); \
L( s3, t1, s8, t3 ); \
s7 = _mm256_blend_epi32( s7, t1, 0x55 ); \
s4 = _mm256_blend_epi32( s4, t1, 0xaa ); \
sC = _mm256_blend_epi32( sC, t3, 0x55 ); \
sD = _mm256_blend_epi32( sD, t3, 0xaa ); \
L( s3, t2, s8, t3 ); \
s7 = _mm256_blend_epi32( t4, t2, 0x55 ); \
s4 = _mm256_blend_epi32( t0, t2, 0xaa ); \
sC = _mm256_blend_epi32( t5, t3, 0x55 ); \
sD = _mm256_blend_epi32( t1, t3, 0xaa ); \
s7 = mm256_swap64_32( s7 ); \
sC = mm256_swap64_32( sC ); \
\
@@ -1084,17 +1126,17 @@ do { \
__m256i alpha[16]; \
const uint64_t A0 = ( (uint64_t*)alpha_n )[0]; \
for( int i = 0; i < 16; i++ ) \
alpha[i] = m256_const1_64( ( (uint64_t*)alpha_n )[i] ); \
alpha[i] = _mm256_set1_epi64x( ( (uint64_t*)alpha_n )[i] ); \
ROUND_BIG( alpha ); \
alpha[0] = m256_const1_64( (1ULL << 32) ^ A0 ); \
alpha[0] = _mm256_set1_epi64x( (1ULL << 32) ^ A0 ); \
ROUND_BIG( alpha ); \
alpha[0] = m256_const1_64( (2ULL << 32) ^ A0 ); \
alpha[0] = _mm256_set1_epi64x( (2ULL << 32) ^ A0 ); \
ROUND_BIG( alpha ); \
alpha[0] = m256_const1_64( (3ULL << 32) ^ A0 ); \
alpha[0] = _mm256_set1_epi64x( (3ULL << 32) ^ A0 ); \
ROUND_BIG( alpha ); \
alpha[0] = m256_const1_64( (4ULL << 32) ^ A0 ); \
alpha[0] = _mm256_set1_epi64x( (4ULL << 32) ^ A0 ); \
ROUND_BIG( alpha ); \
alpha[0] = m256_const1_64( (5ULL << 32) ^ A0 ); \
alpha[0] = _mm256_set1_epi64x( (5ULL << 32) ^ A0 ); \
ROUND_BIG( alpha ); \
} while (0)
@@ -1103,29 +1145,29 @@ do { \
__m256i alpha[16]; \
const uint64_t A0 = ( (uint64_t*)alpha_f )[0]; \
for( int i = 0; i < 16; i++ ) \
alpha[i] = m256_const1_64( ( (uint64_t*)alpha_f )[i] ); \
alpha[i] = _mm256_set1_epi64x( ( (uint64_t*)alpha_f )[i] ); \
ROUND_BIG( alpha ); \
alpha[0] = m256_const1_64( ( 1ULL << 32) ^ A0 ); \
alpha[0] = _mm256_set1_epi64x( ( 1ULL << 32) ^ A0 ); \
ROUND_BIG( alpha ); \
alpha[0] = m256_const1_64( ( 2ULL << 32) ^ A0 ); \
alpha[0] = _mm256_set1_epi64x( ( 2ULL << 32) ^ A0 ); \
ROUND_BIG( alpha ); \
alpha[0] = m256_const1_64( ( 3ULL << 32) ^ A0 ); \
alpha[0] = _mm256_set1_epi64x( ( 3ULL << 32) ^ A0 ); \
ROUND_BIG( alpha ); \
alpha[0] = m256_const1_64( ( 4ULL << 32) ^ A0 ); \
alpha[0] = _mm256_set1_epi64x( ( 4ULL << 32) ^ A0 ); \
ROUND_BIG( alpha ); \
alpha[0] = m256_const1_64( ( 5ULL << 32) ^ A0 ); \
alpha[0] = _mm256_set1_epi64x( ( 5ULL << 32) ^ A0 ); \
ROUND_BIG( alpha ); \
alpha[0] = m256_const1_64( ( 6ULL << 32) ^ A0 ); \
alpha[0] = _mm256_set1_epi64x( ( 6ULL << 32) ^ A0 ); \
ROUND_BIG( alpha ); \
alpha[0] = m256_const1_64( ( 7ULL << 32) ^ A0 ); \
alpha[0] = _mm256_set1_epi64x( ( 7ULL << 32) ^ A0 ); \
ROUND_BIG( alpha ); \
alpha[0] = m256_const1_64( ( 8ULL << 32) ^ A0 ); \
alpha[0] = _mm256_set1_epi64x( ( 8ULL << 32) ^ A0 ); \
ROUND_BIG( alpha ); \
alpha[0] = m256_const1_64( ( 9ULL << 32) ^ A0 ); \
alpha[0] = _mm256_set1_epi64x( ( 9ULL << 32) ^ A0 ); \
ROUND_BIG( alpha ); \
alpha[0] = m256_const1_64( (10ULL << 32) ^ A0 ); \
alpha[0] = _mm256_set1_epi64x( (10ULL << 32) ^ A0 ); \
ROUND_BIG( alpha ); \
alpha[0] = m256_const1_64( (11ULL << 32) ^ A0 ); \
alpha[0] = _mm256_set1_epi64x( (11ULL << 32) ^ A0 ); \
ROUND_BIG( alpha ); \
} while (0)
@@ -1181,14 +1223,14 @@ void hamsi512_4way_init( hamsi_4way_big_context *sc )
sc->partial_len = 0;
sc->count_high = sc->count_low = 0;
sc->h[0] = m256_const1_64( 0x6c70617273746565 );
sc->h[1] = m256_const1_64( 0x656e62656b204172 );
sc->h[2] = m256_const1_64( 0x302c206272672031 );
sc->h[3] = m256_const1_64( 0x3434362c75732032 );
sc->h[4] = m256_const1_64( 0x3030312020422d33 );
sc->h[5] = m256_const1_64( 0x656e2d484c657576 );
sc->h[6] = m256_const1_64( 0x6c65652c65766572 );
sc->h[7] = m256_const1_64( 0x6769756d2042656c );
sc->h[0] = _mm256_set1_epi64x( 0x6c70617273746565 );
sc->h[1] = _mm256_set1_epi64x( 0x656e62656b204172 );
sc->h[2] = _mm256_set1_epi64x( 0x302c206272672031 );
sc->h[3] = _mm256_set1_epi64x( 0x3434362c75732032 );
sc->h[4] = _mm256_set1_epi64x( 0x3030312020422d33 );
sc->h[5] = _mm256_set1_epi64x( 0x656e2d484c657576 );
sc->h[6] = _mm256_set1_epi64x( 0x6c65652c65766572 );
sc->h[7] = _mm256_set1_epi64x( 0x6769756d2042656c );
}
void hamsi512_4way_update( hamsi_4way_big_context *sc, const void *data,
@@ -1211,7 +1253,7 @@ void hamsi512_4way_close( hamsi_4way_big_context *sc, void *dst )
sph_enc32be( &ch, sc->count_high );
sph_enc32be( &cl, sc->count_low + ( sc->partial_len << 3 ) );
pad[0] = _mm256_set1_epi64x( ((uint64_t)cl << 32 ) | (uint64_t)ch );
sc->buf[0] = m256_const1_64( 0x80 );
sc->buf[0] = _mm256_set1_epi64x( 0x80 );
hamsi_big( sc, sc->buf, 1 );
hamsi_big_final( sc, pad );

View File

@@ -52,6 +52,56 @@ extern "C"{
#define SPH_SMALL_FOOTPRINT_HAVAL 1
//#endif
#if defined(__AVX512VL__)
// ( ~( a ^ b ) ) & c
#define mm128_andnotxor( a, b, c ) \
_mm_ternarylogic_epi32( a, b, c, 0x82 )
#else
#define mm128_andnotxor( a, b, c ) \
_mm_andnot_si128( _mm_xor_si128( a, b ), c )
#endif
#define F1(x6, x5, x4, x3, x2, x1, x0) \
mm128_xor3( x0, mm128_andxor( x1, x0, x4 ), \
_mm_xor_si128( _mm_and_si128( x2, x5 ), \
_mm_and_si128( x3, x6 ) ) ) \
#define F2(x6, x5, x4, x3, x2, x1, x0) \
mm128_xor3( mm128_andxor( x2, _mm_andnot_si128( x3, x1 ), \
mm128_xor3( _mm_and_si128( x4, x5 ), x6, x0 ) ), \
mm128_andxor( x4, x1, x5 ), \
mm128_xorand( x0, x3, x5 ) ) \
#define F3(x6, x5, x4, x3, x2, x1, x0) \
mm128_xor3( x0, \
_mm_and_si128( x3, \
mm128_xor3( _mm_and_si128( x1, x2 ), x6, x0 ) ), \
_mm_xor_si128( _mm_and_si128( x1, x4 ), \
_mm_and_si128( x2, x5 ) ) )
#define F4(x6, x5, x4, x3, x2, x1, x0) \
mm128_xor3( \
mm128_andxor( x3, x5, \
_mm_xor_si128( _mm_and_si128( x1, x2 ), \
_mm_or_si128( x4, x6 ) ) ), \
_mm_and_si128( x4, \
mm128_xor3( x0, _mm_andnot_si128( x2, x5 ), \
_mm_xor_si128( x1, x6 ) ) ), \
mm128_xorand( x0, x2, x6 ) )
#define F5(x6, x5, x4, x3, x2, x1, x0) \
_mm_xor_si128( \
mm128_andnotxor( mm128_and3( x1, x2, x3 ), x5, x0 ), \
mm128_xor3( _mm_and_si128( x1, x4 ), \
_mm_and_si128( x2, x5 ), \
_mm_and_si128( x3, x6 ) ) )
/*
#define F1(x6, x5, x4, x3, x2, x1, x0) \
_mm_xor_si128( x0, \
_mm_xor_si128( _mm_and_si128(_mm_xor_si128( x0, x4 ), x1 ), \
@@ -96,6 +146,7 @@ extern "C"{
_mm_xor_si128( _mm_xor_si128( _mm_and_si128( x1, x4 ), \
_mm_and_si128( x2, x5 ) ), \
_mm_and_si128( x3, x6 ) ) )
*/
/*
* The macros below integrate the phi() permutations, depending on the
@@ -141,6 +192,13 @@ do { \
_mm_add_epi32( w, _mm_set1_epi32( c ) ) ); \
} while (0)
#define STEP1(n, p, x7, x6, x5, x4, x3, x2, x1, x0, w) \
do { \
__m128i t = FP ## n ## _ ## p(x6, x5, x4, x3, x2, x1, x0); \
x7 = _mm_add_epi32( _mm_add_epi32( mm128_ror_32( t, 7 ), \
mm128_ror_32( x7, 11 ) ), w ); \
} while (0)
/*
* PASSy(n, in) computes pass number "y", for a total of "n", using the
* one-argument macro "in" to access input words. Current state is assumed
@@ -152,22 +210,22 @@ do { \
#define PASS1(n, in) do { \
unsigned pass_count; \
for (pass_count = 0; pass_count < 32; pass_count += 8) { \
STEP(n, 1, s7, s6, s5, s4, s3, s2, s1, s0, \
in(pass_count + 0), SPH_C32(0x00000000)); \
STEP(n, 1, s6, s5, s4, s3, s2, s1, s0, s7, \
in(pass_count + 1), SPH_C32(0x00000000)); \
STEP(n, 1, s5, s4, s3, s2, s1, s0, s7, s6, \
in(pass_count + 2), SPH_C32(0x00000000)); \
STEP(n, 1, s4, s3, s2, s1, s0, s7, s6, s5, \
in(pass_count + 3), SPH_C32(0x00000000)); \
STEP(n, 1, s3, s2, s1, s0, s7, s6, s5, s4, \
in(pass_count + 4), SPH_C32(0x00000000)); \
STEP(n, 1, s2, s1, s0, s7, s6, s5, s4, s3, \
in(pass_count + 5), SPH_C32(0x00000000)); \
STEP(n, 1, s1, s0, s7, s6, s5, s4, s3, s2, \
in(pass_count + 6), SPH_C32(0x00000000)); \
STEP(n, 1, s0, s7, s6, s5, s4, s3, s2, s1, \
in(pass_count + 7), SPH_C32(0x00000000)); \
STEP1(n, 1, s7, s6, s5, s4, s3, s2, s1, s0, \
in(pass_count + 0) ); \
STEP1(n, 1, s6, s5, s4, s3, s2, s1, s0, s7, \
in(pass_count + 1) ); \
STEP1(n, 1, s5, s4, s3, s2, s1, s0, s7, s6, \
in(pass_count + 2) ); \
STEP1(n, 1, s4, s3, s2, s1, s0, s7, s6, s5, \
in(pass_count + 3) ); \
STEP1(n, 1, s3, s2, s1, s0, s7, s6, s5, s4, \
in(pass_count + 4) ); \
STEP1(n, 1, s2, s1, s0, s7, s6, s5, s4, s3, \
in(pass_count + 5) ); \
STEP1(n, 1, s1, s0, s7, s6, s5, s4, s3, s2, \
in(pass_count + 6) ); \
STEP1(n, 1, s0, s7, s6, s5, s4, s3, s2, s1, \
in(pass_count + 7) ); \
} \
} while (0)
@@ -605,25 +663,32 @@ do { \
_mm256_add_epi32( w, _mm256_set1_epi32( c ) ) ); \
} while (0)
#define STEP1_8W(n, p, x7, x6, x5, x4, x3, x2, x1, x0, w) \
do { \
__m256i t = FP ## n ## _ ## p ## _8W(x6, x5, x4, x3, x2, x1, x0); \
x7 = _mm256_add_epi32( _mm256_add_epi32( mm256_ror_32( t, 7 ), \
mm256_ror_32( x7, 11 ) ), w ); \
} while (0)
#define PASS1_8W(n, in) do { \
unsigned pass_count; \
for (pass_count = 0; pass_count < 32; pass_count += 8) { \
STEP_8W(n, 1, s7, s6, s5, s4, s3, s2, s1, s0, \
in(pass_count + 0), SPH_C32(0x00000000)); \
STEP_8W(n, 1, s6, s5, s4, s3, s2, s1, s0, s7, \
in(pass_count + 1), SPH_C32(0x00000000)); \
STEP_8W(n, 1, s5, s4, s3, s2, s1, s0, s7, s6, \
in(pass_count + 2), SPH_C32(0x00000000)); \
STEP_8W(n, 1, s4, s3, s2, s1, s0, s7, s6, s5, \
in(pass_count + 3), SPH_C32(0x00000000)); \
STEP_8W(n, 1, s3, s2, s1, s0, s7, s6, s5, s4, \
in(pass_count + 4), SPH_C32(0x00000000)); \
STEP_8W(n, 1, s2, s1, s0, s7, s6, s5, s4, s3, \
in(pass_count + 5), SPH_C32(0x00000000)); \
STEP_8W(n, 1, s1, s0, s7, s6, s5, s4, s3, s2, \
in(pass_count + 6), SPH_C32(0x00000000)); \
STEP_8W(n, 1, s0, s7, s6, s5, s4, s3, s2, s1, \
in(pass_count + 7), SPH_C32(0x00000000)); \
STEP1_8W(n, 1, s7, s6, s5, s4, s3, s2, s1, s0, \
in(pass_count + 0) ); \
STEP1_8W(n, 1, s6, s5, s4, s3, s2, s1, s0, s7, \
in(pass_count + 1) ); \
STEP1_8W(n, 1, s5, s4, s3, s2, s1, s0, s7, s6, \
in(pass_count + 2) ); \
STEP1_8W(n, 1, s4, s3, s2, s1, s0, s7, s6, s5, \
in(pass_count + 3) ); \
STEP1_8W(n, 1, s3, s2, s1, s0, s7, s6, s5, s4, \
in(pass_count + 4) ); \
STEP1_8W(n, 1, s2, s1, s0, s7, s6, s5, s4, s3, \
in(pass_count + 5) ); \
STEP1_8W(n, 1, s1, s0, s7, s6, s5, s4, s3, s2, \
in(pass_count + 6) ); \
STEP1_8W(n, 1, s0, s7, s6, s5, s4, s3, s2, s1, \
in(pass_count + 7) ); \
} \
} while (0)
@@ -726,14 +791,14 @@ do { \
static void
haval_8way_init( haval_8way_context *sc, unsigned olen, unsigned passes )
{
sc->s0 = m256_const1_32( 0x243F6A88UL );
sc->s1 = m256_const1_32( 0x85A308D3UL );
sc->s2 = m256_const1_32( 0x13198A2EUL );
sc->s3 = m256_const1_32( 0x03707344UL );
sc->s4 = m256_const1_32( 0xA4093822UL );
sc->s5 = m256_const1_32( 0x299F31D0UL );
sc->s6 = m256_const1_32( 0x082EFA98UL );
sc->s7 = m256_const1_32( 0xEC4E6C89UL );
sc->s0 = _mm256_set1_epi32( 0x243F6A88UL );
sc->s1 = _mm256_set1_epi32( 0x85A308D3UL );
sc->s2 = _mm256_set1_epi32( 0x13198A2EUL );
sc->s3 = _mm256_set1_epi32( 0x03707344UL );
sc->s4 = _mm256_set1_epi32( 0xA4093822UL );
sc->s5 = _mm256_set1_epi32( 0x299F31D0UL );
sc->s6 = _mm256_set1_epi32( 0x082EFA98UL );
sc->s7 = _mm256_set1_epi32( 0xEC4E6C89UL );
sc->olen = olen;
sc->passes = passes;
sc->count_high = 0;

View File

@@ -49,12 +49,11 @@ extern "C"{
#define Sb_8W(x0, x1, x2, x3, c) \
do { \
__m512i cc = _mm512_set1_epi64( c ); \
x3 = mm512_not( x3 ); \
const __m512i cc = _mm512_set1_epi64( c ); \
x0 = mm512_xorandnot( x0, x2, cc ); \
tmp = mm512_xorand( cc, x0, x1 ); \
x0 = mm512_xorand( x0, x2, x3 ); \
x3 = mm512_xorandnot( x3, x1, x2 ); \
x0 = mm512_xorandnot( x0, x3, x2 ); \
x3 = _mm512_ternarylogic_epi64( x3, x1, x2, 0x2d ); /* ~x3 ^ (~x1 & x2) */\
x1 = mm512_xorand( x1, x0, x2 ); \
x2 = mm512_xorandnot( x2, x3, x0 ); \
x0 = mm512_xoror( x0, x1, x3 ); \
@@ -77,19 +76,31 @@ do { \
#endif
#if defined(__AVX512VL__)
//TODO enable for AVX10_256, not used with AVX512VL
#define notxorandnot( a, b, c ) \
_mm256_ternarylogic_epi64( a, b, c, 0x2d )
#else
#define notxorandnot( a, b, c ) \
_mm256_xor_si256( mm256_not( a ), _mm256_andnot_si256( b, c ) )
#endif
#define Sb(x0, x1, x2, x3, c) \
do { \
__m256i cc = _mm256_set1_epi64x( c ); \
x3 = mm256_not( x3 ); \
x0 = _mm256_xor_si256( x0, _mm256_andnot_si256( x2, cc ) ); \
tmp = _mm256_xor_si256( cc, _mm256_and_si256( x0, x1 ) ); \
x0 = _mm256_xor_si256( x0, _mm256_and_si256( x2, x3 ) ); \
x3 = _mm256_xor_si256( x3, _mm256_andnot_si256( x1, x2 ) ); \
x1 = _mm256_xor_si256( x1, _mm256_and_si256( x0, x2 ) ); \
x2 = _mm256_xor_si256( x2, _mm256_andnot_si256( x3, x0 ) ); \
x0 = _mm256_xor_si256( x0, _mm256_or_si256( x1, x3 ) ); \
x3 = _mm256_xor_si256( x3, _mm256_and_si256( x1, x2 ) ); \
x1 = _mm256_xor_si256( x1, _mm256_and_si256( tmp, x0 ) ); \
const __m256i cc = _mm256_set1_epi64x( c ); \
x0 = mm256_xorandnot( x0, x2, cc ); \
tmp = mm256_xorand( cc, x0, x1 ); \
x0 = mm256_xorandnot( x0, x3, x2 ); \
x3 = notxorandnot( x3, x1, x2 ); \
x1 = mm256_xorand( x1, x0, x2 ); \
x2 = mm256_xorandnot( x2, x3, x0 ); \
x0 = mm256_xoror( x0, x1, x3 ); \
x3 = mm256_xorand( x3, x1, x2 ); \
x1 = mm256_xorand( x1, tmp, x0 ); \
x2 = _mm256_xor_si256( x2, tmp ); \
} while (0)
@@ -97,11 +108,11 @@ do { \
do { \
x4 = _mm256_xor_si256( x4, x1 ); \
x5 = _mm256_xor_si256( x5, x2 ); \
x6 = _mm256_xor_si256( x6, _mm256_xor_si256( x3, x0 ) ); \
x6 = mm256_xor3( x6, x3, x0 ); \
x7 = _mm256_xor_si256( x7, x0 ); \
x0 = _mm256_xor_si256( x0, x5 ); \
x1 = _mm256_xor_si256( x1, x6 ); \
x2 = _mm256_xor_si256( x2, _mm256_xor_si256( x7, x4 ) ); \
x2 = mm256_xor3( x2, x7, x4 ); \
x3 = _mm256_xor_si256( x3, x4 ); \
} while (0)
@@ -324,12 +335,12 @@ do { \
} while (0)
#define W80(x) Wz_8W(x, m512_const1_64( 0x5555555555555555 ), 1 )
#define W81(x) Wz_8W(x, m512_const1_64( 0x3333333333333333 ), 2 )
#define W82(x) Wz_8W(x, m512_const1_64( 0x0F0F0F0F0F0F0F0F ), 4 )
#define W83(x) Wz_8W(x, m512_const1_64( 0x00FF00FF00FF00FF ), 8 )
#define W84(x) Wz_8W(x, m512_const1_64( 0x0000FFFF0000FFFF ), 16 )
#define W85(x) Wz_8W(x, m512_const1_64( 0x00000000FFFFFFFF ), 32 )
#define W80(x) Wz_8W(x, _mm512_set1_epi64( 0x5555555555555555 ), 1 )
#define W81(x) Wz_8W(x, _mm512_set1_epi64( 0x3333333333333333 ), 2 )
#define W82(x) Wz_8W(x, _mm512_set1_epi64( 0x0F0F0F0F0F0F0F0F ), 4 )
#define W83(x) Wz_8W(x, _mm512_set1_epi64( 0x00FF00FF00FF00FF ), 8 )
#define W84(x) Wz_8W(x, _mm512_set1_epi64( 0x0000FFFF0000FFFF ), 16 )
#define W85(x) Wz_8W(x, _mm512_set1_epi64( 0x00000000FFFFFFFF ), 32 )
#define W86(x) \
do { \
__m512i t = x ## h; \
@@ -353,12 +364,12 @@ do { \
x ## l = _mm256_or_si256( _mm256_and_si256((x ## l >> (n)), (c)), t ); \
} while (0)
#define W0(x) Wz(x, m256_const1_64( 0x5555555555555555 ), 1 )
#define W1(x) Wz(x, m256_const1_64( 0x3333333333333333 ), 2 )
#define W2(x) Wz(x, m256_const1_64( 0x0F0F0F0F0F0F0F0F ), 4 )
#define W3(x) Wz(x, m256_const1_64( 0x00FF00FF00FF00FF ), 8 )
#define W4(x) Wz(x, m256_const1_64( 0x0000FFFF0000FFFF ), 16 )
#define W5(x) Wz(x, m256_const1_64( 0x00000000FFFFFFFF ), 32 )
#define W0(x) Wz(x, _mm256_set1_epi64x( 0x5555555555555555 ), 1 )
#define W1(x) Wz(x, _mm256_set1_epi64x( 0x3333333333333333 ), 2 )
#define W2(x) Wz(x, _mm256_set1_epi64x( 0x0F0F0F0F0F0F0F0F ), 4 )
#define W3(x) Wz(x, _mm256_set1_epi64x( 0x00FF00FF00FF00FF ), 8 )
#define W4(x) Wz(x, _mm256_set1_epi64x( 0x0000FFFF0000FFFF ), 16 )
#define W5(x) Wz(x, _mm256_set1_epi64x( 0x00000000FFFFFFFF ), 32 )
#define W6(x) \
do { \
__m256i t = x ## h; \
@@ -625,22 +636,22 @@ static const sph_u64 IV512[] = {
void jh256_8way_init( jh_8way_context *sc )
{
// bswapped IV256
sc->H[ 0] = m512_const1_64( 0xebd3202c41a398eb );
sc->H[ 1] = m512_const1_64( 0xc145b29c7bbecd92 );
sc->H[ 2] = m512_const1_64( 0xfac7d4609151931c );
sc->H[ 3] = m512_const1_64( 0x038a507ed6820026 );
sc->H[ 4] = m512_const1_64( 0x45b92677269e23a4 );
sc->H[ 5] = m512_const1_64( 0x77941ad4481afbe0 );
sc->H[ 6] = m512_const1_64( 0x7a176b0226abb5cd );
sc->H[ 7] = m512_const1_64( 0xa82fff0f4224f056 );
sc->H[ 8] = m512_const1_64( 0x754d2e7f8996a371 );
sc->H[ 9] = m512_const1_64( 0x62e27df70849141d );
sc->H[10] = m512_const1_64( 0x948f2476f7957627 );
sc->H[11] = m512_const1_64( 0x6c29804757b6d587 );
sc->H[12] = m512_const1_64( 0x6c0d8eac2d275e5c );
sc->H[13] = m512_const1_64( 0x0f7a0557c6508451 );
sc->H[14] = m512_const1_64( 0xea12247067d3e47b );
sc->H[15] = m512_const1_64( 0x69d71cd313abe389 );
sc->H[ 0] = _mm512_set1_epi64( 0xebd3202c41a398eb );
sc->H[ 1] = _mm512_set1_epi64( 0xc145b29c7bbecd92 );
sc->H[ 2] = _mm512_set1_epi64( 0xfac7d4609151931c );
sc->H[ 3] = _mm512_set1_epi64( 0x038a507ed6820026 );
sc->H[ 4] = _mm512_set1_epi64( 0x45b92677269e23a4 );
sc->H[ 5] = _mm512_set1_epi64( 0x77941ad4481afbe0 );
sc->H[ 6] = _mm512_set1_epi64( 0x7a176b0226abb5cd );
sc->H[ 7] = _mm512_set1_epi64( 0xa82fff0f4224f056 );
sc->H[ 8] = _mm512_set1_epi64( 0x754d2e7f8996a371 );
sc->H[ 9] = _mm512_set1_epi64( 0x62e27df70849141d );
sc->H[10] = _mm512_set1_epi64( 0x948f2476f7957627 );
sc->H[11] = _mm512_set1_epi64( 0x6c29804757b6d587 );
sc->H[12] = _mm512_set1_epi64( 0x6c0d8eac2d275e5c );
sc->H[13] = _mm512_set1_epi64( 0x0f7a0557c6508451 );
sc->H[14] = _mm512_set1_epi64( 0xea12247067d3e47b );
sc->H[15] = _mm512_set1_epi64( 0x69d71cd313abe389 );
sc->ptr = 0;
sc->block_count = 0;
}
@@ -648,22 +659,22 @@ void jh256_8way_init( jh_8way_context *sc )
void jh512_8way_init( jh_8way_context *sc )
{
// bswapped IV512
sc->H[ 0] = m512_const1_64( 0x17aa003e964bd16f );
sc->H[ 1] = m512_const1_64( 0x43d5157a052e6a63 );
sc->H[ 2] = m512_const1_64( 0x0bef970c8d5e228a );
sc->H[ 3] = m512_const1_64( 0x61c3b3f2591234e9 );
sc->H[ 4] = m512_const1_64( 0x1e806f53c1a01d89 );
sc->H[ 5] = m512_const1_64( 0x806d2bea6b05a92a );
sc->H[ 6] = m512_const1_64( 0xa6ba7520dbcc8e58 );
sc->H[ 7] = m512_const1_64( 0xf73bf8ba763a0fa9 );
sc->H[ 8] = m512_const1_64( 0x694ae34105e66901 );
sc->H[ 9] = m512_const1_64( 0x5ae66f2e8e8ab546 );
sc->H[10] = m512_const1_64( 0x243c84c1d0a74710 );
sc->H[11] = m512_const1_64( 0x99c15a2db1716e3b );
sc->H[12] = m512_const1_64( 0x56f8b19decf657cf );
sc->H[13] = m512_const1_64( 0x56b116577c8806a7 );
sc->H[14] = m512_const1_64( 0xfb1785e6dffcc2e3 );
sc->H[15] = m512_const1_64( 0x4bdd8ccc78465a54 );
sc->H[ 0] = _mm512_set1_epi64( 0x17aa003e964bd16f );
sc->H[ 1] = _mm512_set1_epi64( 0x43d5157a052e6a63 );
sc->H[ 2] = _mm512_set1_epi64( 0x0bef970c8d5e228a );
sc->H[ 3] = _mm512_set1_epi64( 0x61c3b3f2591234e9 );
sc->H[ 4] = _mm512_set1_epi64( 0x1e806f53c1a01d89 );
sc->H[ 5] = _mm512_set1_epi64( 0x806d2bea6b05a92a );
sc->H[ 6] = _mm512_set1_epi64( 0xa6ba7520dbcc8e58 );
sc->H[ 7] = _mm512_set1_epi64( 0xf73bf8ba763a0fa9 );
sc->H[ 8] = _mm512_set1_epi64( 0x694ae34105e66901 );
sc->H[ 9] = _mm512_set1_epi64( 0x5ae66f2e8e8ab546 );
sc->H[10] = _mm512_set1_epi64( 0x243c84c1d0a74710 );
sc->H[11] = _mm512_set1_epi64( 0x99c15a2db1716e3b );
sc->H[12] = _mm512_set1_epi64( 0x56f8b19decf657cf );
sc->H[13] = _mm512_set1_epi64( 0x56b116577c8806a7 );
sc->H[14] = _mm512_set1_epi64( 0xfb1785e6dffcc2e3 );
sc->H[15] = _mm512_set1_epi64( 0x4bdd8ccc78465a54 );
sc->ptr = 0;
sc->block_count = 0;
}
@@ -722,7 +733,7 @@ jh_8way_close( jh_8way_context *sc, unsigned ub, unsigned n, void *dst,
size_t numz, u;
uint64_t l0, l1;
buf[0] = m512_const1_64( 0x80ULL );
buf[0] = _mm512_set1_epi64( 0x80ULL );
if ( sc->ptr == 0 )
numz = 48;
@@ -773,22 +784,22 @@ jh512_8way_close(void *cc, void *dst)
void jh256_4way_init( jh_4way_context *sc )
{
// bswapped IV256
sc->H[ 0] = m256_const1_64( 0xebd3202c41a398eb );
sc->H[ 1] = m256_const1_64( 0xc145b29c7bbecd92 );
sc->H[ 2] = m256_const1_64( 0xfac7d4609151931c );
sc->H[ 3] = m256_const1_64( 0x038a507ed6820026 );
sc->H[ 4] = m256_const1_64( 0x45b92677269e23a4 );
sc->H[ 5] = m256_const1_64( 0x77941ad4481afbe0 );
sc->H[ 6] = m256_const1_64( 0x7a176b0226abb5cd );
sc->H[ 7] = m256_const1_64( 0xa82fff0f4224f056 );
sc->H[ 8] = m256_const1_64( 0x754d2e7f8996a371 );
sc->H[ 9] = m256_const1_64( 0x62e27df70849141d );
sc->H[10] = m256_const1_64( 0x948f2476f7957627 );
sc->H[11] = m256_const1_64( 0x6c29804757b6d587 );
sc->H[12] = m256_const1_64( 0x6c0d8eac2d275e5c );
sc->H[13] = m256_const1_64( 0x0f7a0557c6508451 );
sc->H[14] = m256_const1_64( 0xea12247067d3e47b );
sc->H[15] = m256_const1_64( 0x69d71cd313abe389 );
sc->H[ 0] = _mm256_set1_epi64x( 0xebd3202c41a398eb );
sc->H[ 1] = _mm256_set1_epi64x( 0xc145b29c7bbecd92 );
sc->H[ 2] = _mm256_set1_epi64x( 0xfac7d4609151931c );
sc->H[ 3] = _mm256_set1_epi64x( 0x038a507ed6820026 );
sc->H[ 4] = _mm256_set1_epi64x( 0x45b92677269e23a4 );
sc->H[ 5] = _mm256_set1_epi64x( 0x77941ad4481afbe0 );
sc->H[ 6] = _mm256_set1_epi64x( 0x7a176b0226abb5cd );
sc->H[ 7] = _mm256_set1_epi64x( 0xa82fff0f4224f056 );
sc->H[ 8] = _mm256_set1_epi64x( 0x754d2e7f8996a371 );
sc->H[ 9] = _mm256_set1_epi64x( 0x62e27df70849141d );
sc->H[10] = _mm256_set1_epi64x( 0x948f2476f7957627 );
sc->H[11] = _mm256_set1_epi64x( 0x6c29804757b6d587 );
sc->H[12] = _mm256_set1_epi64x( 0x6c0d8eac2d275e5c );
sc->H[13] = _mm256_set1_epi64x( 0x0f7a0557c6508451 );
sc->H[14] = _mm256_set1_epi64x( 0xea12247067d3e47b );
sc->H[15] = _mm256_set1_epi64x( 0x69d71cd313abe389 );
sc->ptr = 0;
sc->block_count = 0;
}
@@ -796,22 +807,22 @@ void jh256_4way_init( jh_4way_context *sc )
void jh512_4way_init( jh_4way_context *sc )
{
// bswapped IV512
sc->H[ 0] = m256_const1_64( 0x17aa003e964bd16f );
sc->H[ 1] = m256_const1_64( 0x43d5157a052e6a63 );
sc->H[ 2] = m256_const1_64( 0x0bef970c8d5e228a );
sc->H[ 3] = m256_const1_64( 0x61c3b3f2591234e9 );
sc->H[ 4] = m256_const1_64( 0x1e806f53c1a01d89 );
sc->H[ 5] = m256_const1_64( 0x806d2bea6b05a92a );
sc->H[ 6] = m256_const1_64( 0xa6ba7520dbcc8e58 );
sc->H[ 7] = m256_const1_64( 0xf73bf8ba763a0fa9 );
sc->H[ 8] = m256_const1_64( 0x694ae34105e66901 );
sc->H[ 9] = m256_const1_64( 0x5ae66f2e8e8ab546 );
sc->H[10] = m256_const1_64( 0x243c84c1d0a74710 );
sc->H[11] = m256_const1_64( 0x99c15a2db1716e3b );
sc->H[12] = m256_const1_64( 0x56f8b19decf657cf );
sc->H[13] = m256_const1_64( 0x56b116577c8806a7 );
sc->H[14] = m256_const1_64( 0xfb1785e6dffcc2e3 );
sc->H[15] = m256_const1_64( 0x4bdd8ccc78465a54 );
sc->H[ 0] = _mm256_set1_epi64x( 0x17aa003e964bd16f );
sc->H[ 1] = _mm256_set1_epi64x( 0x43d5157a052e6a63 );
sc->H[ 2] = _mm256_set1_epi64x( 0x0bef970c8d5e228a );
sc->H[ 3] = _mm256_set1_epi64x( 0x61c3b3f2591234e9 );
sc->H[ 4] = _mm256_set1_epi64x( 0x1e806f53c1a01d89 );
sc->H[ 5] = _mm256_set1_epi64x( 0x806d2bea6b05a92a );
sc->H[ 6] = _mm256_set1_epi64x( 0xa6ba7520dbcc8e58 );
sc->H[ 7] = _mm256_set1_epi64x( 0xf73bf8ba763a0fa9 );
sc->H[ 8] = _mm256_set1_epi64x( 0x694ae34105e66901 );
sc->H[ 9] = _mm256_set1_epi64x( 0x5ae66f2e8e8ab546 );
sc->H[10] = _mm256_set1_epi64x( 0x243c84c1d0a74710 );
sc->H[11] = _mm256_set1_epi64x( 0x99c15a2db1716e3b );
sc->H[12] = _mm256_set1_epi64x( 0x56f8b19decf657cf );
sc->H[13] = _mm256_set1_epi64x( 0x56b116577c8806a7 );
sc->H[14] = _mm256_set1_epi64x( 0xfb1785e6dffcc2e3 );
sc->H[15] = _mm256_set1_epi64x( 0x4bdd8ccc78465a54 );
sc->ptr = 0;
sc->block_count = 0;
}
@@ -870,7 +881,7 @@ jh_4way_close( jh_4way_context *sc, unsigned ub, unsigned n, void *dst,
size_t numz, u;
uint64_t l0, l1;
buf[0] = m256_const1_64( 0x80ULL );
buf[0] = _mm256_set1_epi64x( 0x80ULL );
if ( sc->ptr == 0 )
numz = 48;

View File

@@ -49,7 +49,7 @@ int scanhash_keccak_8way( struct work *work, uint32_t max_nonce,
}
}
*noncev = _mm512_add_epi32( *noncev,
m512_const1_64( 0x0000000800000000 ) );
_mm512_set1_epi64( 0x0000000800000000 ) );
n += 8;
} while ( (n < max_nonce-8) && !work_restart[thr_id].restart);
@@ -101,7 +101,7 @@ int scanhash_keccak_4way( struct work *work, uint32_t max_nonce,
}
}
*noncev = _mm256_add_epi32( *noncev,
m256_const1_64( 0x0000000400000000 ) );
_mm256_set1_epi64x( 0x0000000400000000 ) );
n += 4;
} while ( (n < max_nonce-4) && !work_restart[thr_id].restart);
pdata[19] = n;

View File

@@ -72,11 +72,11 @@ static const uint64_t RC[] = {
// Targetted macros, keccak-macros.h is included for each target.
#define DECL64(x) __m512i x
#define XOR(d, a, b) (d = _mm512_xor_si512(a,b))
#define XOR64 XOR
#define XOR(d, a, b) (d = _mm512_xor_si512(a,b))
#define XOR64 XOR
#define AND64(d, a, b) (d = _mm512_and_si512(a,b))
#define OR64(d, a, b) (d = _mm512_or_si512(a,b))
#define NOT64(d, s) (d = _mm512_xor_si512(s,m512_neg1))
#define NOT64(d, s) (d = mm512_not( s ) )
#define ROL64(d, v, n) (d = mm512_rol_64(v, n))
#define XOROR(d, a, b, c) (d = mm512_xoror(a, b, c))
#define XORAND(d, a, b, c) (d = mm512_xorand(a, b, c))
@@ -180,15 +180,15 @@ static void keccak64_8way_close( keccak64_ctx_m512i *kc, void *dst,
if ( kc->ptr == (lim - 8) )
{
const uint64_t t = eb | 0x8000000000000000;
u.tmp[0] = m512_const1_64( t );
u.tmp[0] = _mm512_set1_epi64( t );
j = 8;
}
else
{
j = lim - kc->ptr;
u.tmp[0] = m512_const1_64( eb );
u.tmp[0] = _mm512_set1_epi64( eb );
memset_zero_512( u.tmp + 1, (j>>3) - 2 );
u.tmp[ (j>>3) - 1] = m512_const1_64( 0x8000000000000000 );
u.tmp[ (j>>3) - 1] = _mm512_set1_epi64( 0x8000000000000000 );
}
keccak64_8way_core( kc, u.tmp, j, lim );
/* Finalize the "lane complement" */
@@ -257,15 +257,15 @@ keccak512_8way_close(void *cc, void *dst)
kc->w[j ] = _mm256_xor_si256( kc->w[j], buf[j] ); \
} while (0)
#define DECL64(x) __m256i x
#define XOR(d, a, b) (d = _mm256_xor_si256(a,b))
#define XOR64 XOR
#define AND64(d, a, b) (d = _mm256_and_si256(a,b))
#define OR64(d, a, b) (d = _mm256_or_si256(a,b))
#define NOT64(d, s) (d = _mm256_xor_si256(s,m256_neg1))
#define ROL64(d, v, n) (d = mm256_rol_64(v, n))
#define XOROR(d, a, b, c) (d = _mm256_xor_si256(a, _mm256_or_si256(b, c)))
#define XORAND(d, a, b, c) (d = _mm256_xor_si256(a, _mm256_and_si256(b, c)))
#define DECL64(x) __m256i x
#define XOR(d, a, b) (d = _mm256_xor_si256(a,b))
#define XOR64 XOR
#define AND64(d, a, b) (d = _mm256_and_si256(a,b))
#define OR64(d, a, b) (d = _mm256_or_si256(a,b))
#define NOT64(d, s) (d = mm256_not( s ) )
#define ROL64(d, v, n) (d = mm256_rol_64(v, n))
#define XOROR(d, a, b, c) (d = mm256_xoror( a, b, c ) )
#define XORAND(d, a, b, c) (d = mm256_xorand( a, b, c ) )
#define XOR3( d, a, b, c ) (d = mm256_xor3( a, b, c ))
#include "keccak-macros.c"
@@ -368,15 +368,15 @@ static void keccak64_close( keccak64_ctx_m256i *kc, void *dst, size_t byte_len,
if ( kc->ptr == (lim - 8) )
{
const uint64_t t = eb | 0x8000000000000000;
u.tmp[0] = m256_const1_64( t );
u.tmp[0] = _mm256_set1_epi64x( t );
j = 8;
}
else
{
j = lim - kc->ptr;
u.tmp[0] = m256_const1_64( eb );
u.tmp[0] = _mm256_set1_epi64x( eb );
memset_zero_256( u.tmp + 1, (j>>3) - 2 );
u.tmp[ (j>>3) - 1] = m256_const1_64( 0x8000000000000000 );
u.tmp[ (j>>3) - 1] = _mm256_set1_epi64x( 0x8000000000000000 );
}
keccak64_core( kc, u.tmp, j, lim );
/* Finalize the "lane complement" */

View File

@@ -56,7 +56,7 @@ int scanhash_sha3d_8way( struct work *work, uint32_t max_nonce,
}
}
*noncev = _mm512_add_epi32( *noncev,
m512_const1_64( 0x0000000800000000 ) );
_mm512_set1_epi64( 0x0000000800000000 ) );
n += 8;
} while ( likely( (n < last_nonce) && !work_restart[thr_id].restart ) );
@@ -115,7 +115,7 @@ int scanhash_sha3d_4way( struct work *work, uint32_t max_nonce,
}
}
*noncev = _mm256_add_epi32( *noncev,
m256_const1_64( 0x0000000400000000 ) );
_mm256_set1_epi64x( 0x0000000400000000 ) );
n += 4;
} while ( likely( (n < last_nonce) && !work_restart[thr_id].restart ) );
pdata[19] = n;

File diff suppressed because it is too large Load Diff

View File

@@ -19,96 +19,122 @@
*/
#include <string.h>
#include <emmintrin.h>
#include "simd-utils.h"
#include "luffa_for_sse2.h"
#define cns(i) ( ( (__m128i*)CNS_INIT)[i] )
#define ADD_CONSTANT( a, b, c0 ,c1 ) \
a = _mm_xor_si128( a, c0 ); \
b = _mm_xor_si128( b, c1 ); \
#if defined(__AVX512VL__)
//TODO enable for AVX10_512 AVX10_256
#define MULT2( a0, a1 ) \
{ \
__m128i b = _mm_xor_si128( a0, \
_mm_maskz_shuffle_epi32( 0xb, a1, 0x10 ) ); \
a0 = _mm_alignr_epi8( a1, b, 4 ); \
a1 = _mm_alignr_epi8( b, a1, 4 ); \
}
#elif defined(__SSE4_1__)
#define MULT2( a0, a1 ) do \
{ \
__m128i b = _mm_xor_si128( a0, _mm_shuffle_epi32( _mm_and_si128(a1,MASK), 16 ) ); \
a0 = _mm_or_si128( _mm_srli_si128(b,4), _mm_slli_si128(a1,12) ); \
a1 = _mm_or_si128( _mm_srli_si128(a1,4), _mm_slli_si128(b,12) ); \
__m128i b = _mm_xor_si128( a0, \
_mm_shuffle_epi32( mm128_mask_32( a1, 0xe ), 0x10 ) ); \
a0 = _mm_alignr_epi8( a1, b, 4 ); \
a1 = _mm_alignr_epi8( b, a1, 4 ); \
} while(0)
/*
static inline __m256i mult2_avx2( a )
{
__m128 a0, a0, b;
a0 = mm128_extractlo_256( a );
a1 = mm128_extracthi_256( a );
b = _mm_xor_si128( a0, _mm_shuffle_epi32( _mm_and_si128(a1,MASK), 16 ) );
a0 = _mm_or_si128( _mm_srli_si128(b,4), _mm_slli_si128(a1,12) );
a1 = _mm_or_si128( _mm_srli_si128(a1,4), _mm_slli_si128(b,12) );
return mm256_concat_128( a1, a0 );
#else
#define MULT2( a0, a1 ) do \
{ \
__m128i b = _mm_xor_si128( a0, \
_mm_shuffle_epi32( _mm_and_si128( a1, MASK ), 0x10 ) ); \
a0 = _mm_or_si128( _mm_srli_si128( b, 4 ), _mm_slli_si128( a1, 12 ) ); \
a1 = _mm_or_si128( _mm_srli_si128( a1, 4 ), _mm_slli_si128( b, 12 ) ); \
} while(0)
#endif
#if defined(__AVX512VL__)
//TODO enable for AVX10_512 AVX10_256
#define SUBCRUMB( a0, a1, a2, a3 ) \
{ \
__m128i t = a0; \
a0 = mm128_xoror( a3, a0, a1 ); \
a2 = _mm_xor_si128( a2, a3 ); \
a1 = _mm_ternarylogic_epi64( a1, a3, t, 0x87 ); /* a1 xnor (a3 & t) */ \
a3 = mm128_xorand( a2, a3, t ); \
a2 = mm128_xorand( a1, a2, a0 ); \
a1 = _mm_or_si128( a1, a3 ); \
a3 = _mm_xor_si128( a3, a2 ); \
t = _mm_xor_si128( t, a1 ); \
a2 = _mm_and_si128( a2, a1 ); \
a1 = mm128_xnor( a1, a0 ); \
a0 = t; \
}
*/
#define STEP_PART(x,c,t)\
SUBCRUMB(*x,*(x+1),*(x+2),*(x+3),*t);\
SUBCRUMB(*(x+5),*(x+6),*(x+7),*(x+4),*t);\
MIXWORD(*x,*(x+4),*t,*(t+1));\
MIXWORD(*(x+1),*(x+5),*t,*(t+1));\
MIXWORD(*(x+2),*(x+6),*t,*(t+1));\
MIXWORD(*(x+3),*(x+7),*t,*(t+1));\
ADD_CONSTANT(*x, *(x+4), *c, *(c+1));
#else
#define STEP_PART2(a0,a1,t0,t1,c0,c1,tmp0,tmp1)\
a1 = _mm_shuffle_epi32(a1,147);\
t0 = _mm_load_si128(&a1);\
a1 = _mm_unpacklo_epi32(a1,a0);\
t0 = _mm_unpackhi_epi32(t0,a0);\
t1 = _mm_shuffle_epi32(t0,78);\
a0 = _mm_shuffle_epi32(a1,78);\
SUBCRUMB(t1,t0,a0,a1,tmp0);\
t0 = _mm_unpacklo_epi32(t0,t1);\
a1 = _mm_unpacklo_epi32(a1,a0);\
a0 = _mm_load_si128(&a1);\
a0 = _mm_unpackhi_epi64(a0,t0);\
a1 = _mm_unpacklo_epi64(a1,t0);\
a1 = _mm_shuffle_epi32(a1,57);\
MIXWORD(a0,a1,tmp0,tmp1);\
ADD_CONSTANT(a0,a1,c0,c1);
#define SUBCRUMB( a0, a1, a2, a3 ) \
{ \
__m128i t = a0; \
a0 = _mm_or_si128( a0, a1 ); \
a2 = _mm_xor_si128( a2, a3 ); \
a1 = mm128_not( a1 ); \
a0 = _mm_xor_si128( a0, a3 ); \
a3 = _mm_and_si128( a3, t ); \
a1 = _mm_xor_si128( a1, a3 ); \
a3 = _mm_xor_si128( a3, a2 ); \
a2 = _mm_and_si128( a2, a0 ); \
a0 = mm128_not( a0 ); \
a2 = _mm_xor_si128( a2, a1 ); \
a1 = _mm_or_si128( a1, a3 ); \
t = _mm_xor_si128( t , a1 ); \
a3 = _mm_xor_si128( a3, a2 ); \
a2 = _mm_and_si128( a2, a1 ); \
a1 = _mm_xor_si128( a1, a0 ); \
a0 = t; \
}
#define SUBCRUMB(a0,a1,a2,a3,t)\
t = _mm_load_si128(&a0);\
a0 = _mm_or_si128(a0,a1);\
a2 = _mm_xor_si128(a2,a3);\
a1 = _mm_andnot_si128(a1,ALLONE);\
a0 = _mm_xor_si128(a0,a3);\
a3 = _mm_and_si128(a3,t);\
a1 = _mm_xor_si128(a1,a3);\
a3 = _mm_xor_si128(a3,a2);\
a2 = _mm_and_si128(a2,a0);\
a0 = _mm_andnot_si128(a0,ALLONE);\
a2 = _mm_xor_si128(a2,a1);\
a1 = _mm_or_si128(a1,a3);\
t = _mm_xor_si128(t,a1);\
a3 = _mm_xor_si128(a3,a2);\
a2 = _mm_and_si128(a2,a1);\
a1 = _mm_xor_si128(a1,a0);\
a0 = _mm_load_si128(&t);\
#endif
#define MIXWORD(a,b,t1,t2)\
b = _mm_xor_si128(a,b);\
t1 = _mm_slli_epi32(a,2);\
t2 = _mm_srli_epi32(a,30);\
a = _mm_or_si128(t1,t2);\
a = _mm_xor_si128(a,b);\
t1 = _mm_slli_epi32(b,14);\
t2 = _mm_srli_epi32(b,18);\
b = _mm_or_si128(t1,t2);\
b = _mm_xor_si128(a,b);\
t1 = _mm_slli_epi32(a,10);\
t2 = _mm_srli_epi32(a,22);\
a = _mm_or_si128(t1,t2);\
a = _mm_xor_si128(a,b);\
t1 = _mm_slli_epi32(b,1);\
t2 = _mm_srli_epi32(b,31);\
b = _mm_or_si128(t1,t2);
#define MIXWORD( a, b ) \
b = _mm_xor_si128( a, b ); \
a = _mm_xor_si128( b, mm128_rol_32( a, 2 ) ); \
b = _mm_xor_si128( a, mm128_rol_32( b, 14 ) ); \
a = _mm_xor_si128( b, mm128_rol_32( a, 10 ) ); \
b = mm128_rol_32( b, 1 );
#define ADD_CONSTANT(a,b,c0,c1)\
a = _mm_xor_si128(a,c0);\
b = _mm_xor_si128(b,c1);\
#define STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, c0, c1 ) \
SUBCRUMB( x0, x1, x2, x3 ); \
SUBCRUMB( x5, x6, x7, x4 ); \
MIXWORD( x0, x4 ); \
MIXWORD( x1, x5 ); \
MIXWORD( x2, x6 ); \
MIXWORD( x3, x7 ); \
ADD_CONSTANT( x0, x4, c0, c1 );
#define STEP_PART2( a0, a1, t0, t1, c0, c1 ) \
t0 = _mm_shuffle_epi32( a1, 147 ); \
a1 = _mm_unpacklo_epi32( t0, a0 ); \
t0 = _mm_unpackhi_epi32( t0, a0 ); \
t1 = _mm_shuffle_epi32( t0, 78 ); \
a0 = _mm_shuffle_epi32( a1, 78 ); \
SUBCRUMB( t1, t0, a0, a1 ); \
t0 = _mm_unpacklo_epi32( t0, t1 ); \
a1 = _mm_unpacklo_epi32( a1, a0 ); \
a0 = _mm_unpackhi_epi64( a1, t0 ); \
a1 = _mm_unpacklo_epi64( a1, t0 ); \
a1 = _mm_shuffle_epi32( a1, 57 ); \
MIXWORD( a0, a1 ); \
ADD_CONSTANT( a0, a1, c0, c1 );
#define NMLTOM768(r0,r1,r2,s0,s1,s2,s3,p0,p1,p2,q0,q1,q2,q3)\
s2 = _mm_load_si128(&r1);\
@@ -169,32 +195,22 @@ static inline __m256i mult2_avx2( a )
q1 = _mm_load_si128(&p1);\
#define NMLTOM1024(r0,r1,r2,r3,s0,s1,s2,s3,p0,p1,p2,p3,q0,q1,q2,q3)\
s1 = _mm_load_si128(&r3);\
q1 = _mm_load_si128(&p3);\
s3 = _mm_load_si128(&r3);\
q3 = _mm_load_si128(&p3);\
s1 = _mm_unpackhi_epi32(s1,r2);\
q1 = _mm_unpackhi_epi32(q1,p2);\
s3 = _mm_unpacklo_epi32(s3,r2);\
q3 = _mm_unpacklo_epi32(q3,p2);\
s0 = _mm_load_si128(&s1);\
q0 = _mm_load_si128(&q1);\
s2 = _mm_load_si128(&s3);\
q2 = _mm_load_si128(&q3);\
r3 = _mm_load_si128(&r1);\
p3 = _mm_load_si128(&p1);\
r1 = _mm_unpacklo_epi32(r1,r0);\
p1 = _mm_unpacklo_epi32(p1,p0);\
r3 = _mm_unpackhi_epi32(r3,r0);\
p3 = _mm_unpackhi_epi32(p3,p0);\
s0 = _mm_unpackhi_epi64(s0,r3);\
q0 = _mm_unpackhi_epi64(q0,p3);\
s1 = _mm_unpacklo_epi64(s1,r3);\
q1 = _mm_unpacklo_epi64(q1,p3);\
s2 = _mm_unpackhi_epi64(s2,r1);\
q2 = _mm_unpackhi_epi64(q2,p1);\
s3 = _mm_unpacklo_epi64(s3,r1);\
q3 = _mm_unpacklo_epi64(q3,p1);
s1 = _mm_unpackhi_epi32( r3, r2 ); \
q1 = _mm_unpackhi_epi32( p3, p2 ); \
s3 = _mm_unpacklo_epi32( r3, r2 ); \
q3 = _mm_unpacklo_epi32( p3, p2 ); \
r3 = _mm_unpackhi_epi32( r1, r0 ); \
r1 = _mm_unpacklo_epi32( r1, r0 ); \
p3 = _mm_unpackhi_epi32( p1, p0 ); \
p1 = _mm_unpacklo_epi32( p1, p0 ); \
s0 = _mm_unpackhi_epi64( s1, r3 ); \
q0 = _mm_unpackhi_epi64( q1 ,p3 ); \
s1 = _mm_unpacklo_epi64( s1, r3 ); \
q1 = _mm_unpacklo_epi64( q1, p3 ); \
s2 = _mm_unpackhi_epi64( s3, r1 ); \
q2 = _mm_unpackhi_epi64( q3, p1 ); \
s3 = _mm_unpacklo_epi64( s3, r1 ); \
q3 = _mm_unpacklo_epi64( q3, p1 );
#define MIXTON1024(r0,r1,r2,r3,s0,s1,s2,s3,p0,p1,p2,p3,q0,q1,q2,q3)\
NMLTOM1024(r0,r1,r2,r3,s0,s1,s2,s3,p0,p1,p2,p3,q0,q1,q2,q3);
@@ -255,17 +271,18 @@ static const uint32 CNS_INIT[128] __attribute((aligned(16))) = {
__m128i CNS128[32];
__m128i ALLONE;
#if !defined(__SSE4_1__)
__m128i MASK;
#endif
HashReturn init_luffa(hashState_luffa *state, int hashbitlen)
{
int i;
state->hashbitlen = hashbitlen;
#if !defined(__SSE4_1__)
/* set the lower 32 bits to '1' */
MASK= _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0xffffffff);
/* set all bits to '1' */
ALLONE = _mm_set_epi32(0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff);
#endif
/* set the 32-bit round constant values to the 128-bit data field */
for ( i=0; i<32; i++ )
CNS128[i] = _mm_load_si128( (__m128i*)&CNS_INIT[i*4] );
@@ -297,8 +314,7 @@ HashReturn update_luffa( hashState_luffa *state, const BitSequence *data,
// remaining data bytes
casti_m128i( state->buffer, 0 ) = mm128_bswap_32( cast_m128i( data ) );
// padding of partial block
casti_m128i( state->buffer, 1 ) =
_mm_set_epi8( 0,0,0,0, 0,0,0,0, 0,0,0,0, 0x80,0,0,0 );
casti_m128i( state->buffer, 1 ) = _mm_set_epi32( 0, 0, 0, 0x80000000 );
}
return SUCCESS;
@@ -316,8 +332,7 @@ HashReturn final_luffa(hashState_luffa *state, BitSequence *hashval)
else
{
// empty pad block, constant data
rnd512( state, _mm_setzero_si128(),
_mm_set_epi8( 0,0,0,0, 0,0,0,0, 0,0,0,0, 0x80,0,0,0 ) );
rnd512( state, _mm_setzero_si128(), _mm_set_epi32( 0, 0, 0, 0x80000000 ) );
}
finalization512(state, (uint32*) hashval);
@@ -345,11 +360,11 @@ HashReturn update_and_final_luffa( hashState_luffa *state, BitSequence* output,
// 16 byte partial block exists for 80 byte len
if ( state->rembytes )
// padding of partial block
rnd512( state, m128_const_i128( 0x80000000 ),
rnd512( state, mm128_mov64_128( 0x80000000 ),
mm128_bswap_32( cast_m128i( data ) ) );
else
// empty pad block
rnd512( state, m128_zero, m128_const_i128( 0x80000000 ) );
rnd512( state, m128_zero, mm128_mov64_128( 0x80000000 ) );
finalization512( state, (uint32*) output );
if ( state->hashbitlen > 512 )
@@ -365,10 +380,10 @@ int luffa_full( hashState_luffa *state, BitSequence* output, int hashbitlen,
// Optimized for integrals of 16 bytes, good for 64 and 80 byte len
int i;
state->hashbitlen = hashbitlen;
#if !defined(__SSE4_1__)
/* set the lower 32 bits to '1' */
MASK= _mm_set_epi32(0x00000000, 0x00000000, 0x00000000, 0xffffffff);
/* set all bits to '1' */
ALLONE = _mm_set_epi32(0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff);
#endif
/* set the 32-bit round constant values to the 128-bit data field */
for ( i=0; i<32; i++ )
CNS128[i] = _mm_load_si128( (__m128i*)&CNS_INIT[i*4] );
@@ -394,11 +409,11 @@ int luffa_full( hashState_luffa *state, BitSequence* output, int hashbitlen,
// 16 byte partial block exists for 80 byte len
if ( state->rembytes )
// padding of partial block
rnd512( state, m128_const_i128( 0x80000000 ),
rnd512( state, mm128_mov64_128( 0x80000000 ),
mm128_bswap_32( cast_m128i( data ) ) );
else
// empty pad block
rnd512( state, m128_zero, m128_const_i128( 0x80000000 ) );
rnd512( state, m128_zero, mm128_mov64_128( 0x80000000 ) );
finalization512( state, (uint32*) output );
if ( state->hashbitlen > 512 )
@@ -414,163 +429,119 @@ int luffa_full( hashState_luffa *state, BitSequence* output, int hashbitlen,
static void rnd512( hashState_luffa *state, __m128i msg1, __m128i msg0 )
{
__m128i t[2];
__m128i t0, t1;
__m128i *chainv = state->chainv;
__m128i tmp[2];
__m128i x[8];
__m128i x0, x1, x2, x3, x4, x5, x6, x7;
t[0] = chainv[0];
t[1] = chainv[1];
t0 = mm128_xor3( chainv[0], chainv[2], chainv[4] );
t1 = mm128_xor3( chainv[1], chainv[3], chainv[5] );
t0 = mm128_xor3( t0, chainv[6], chainv[8] );
t1 = mm128_xor3( t1, chainv[7], chainv[9] );
t[0] = _mm_xor_si128( t[0], chainv[2] );
t[1] = _mm_xor_si128( t[1], chainv[3] );
t[0] = _mm_xor_si128( t[0], chainv[4] );
t[1] = _mm_xor_si128( t[1], chainv[5] );
t[0] = _mm_xor_si128( t[0], chainv[6] );
t[1] = _mm_xor_si128( t[1], chainv[7] );
t[0] = _mm_xor_si128( t[0], chainv[8] );
t[1] = _mm_xor_si128( t[1], chainv[9] );
MULT2( t[0], t[1] );
MULT2( t0, t1 );
msg0 = _mm_shuffle_epi32( msg0, 27 );
msg1 = _mm_shuffle_epi32( msg1, 27 );
chainv[0] = _mm_xor_si128( chainv[0], t[0] );
chainv[1] = _mm_xor_si128( chainv[1], t[1] );
chainv[2] = _mm_xor_si128( chainv[2], t[0] );
chainv[3] = _mm_xor_si128( chainv[3], t[1] );
chainv[4] = _mm_xor_si128( chainv[4], t[0] );
chainv[5] = _mm_xor_si128( chainv[5], t[1] );
chainv[6] = _mm_xor_si128( chainv[6], t[0] );
chainv[7] = _mm_xor_si128( chainv[7], t[1] );
chainv[8] = _mm_xor_si128( chainv[8], t[0] );
chainv[9] = _mm_xor_si128( chainv[9], t[1] );
chainv[0] = _mm_xor_si128( chainv[0], t0 );
chainv[1] = _mm_xor_si128( chainv[1], t1 );
chainv[2] = _mm_xor_si128( chainv[2], t0 );
chainv[3] = _mm_xor_si128( chainv[3], t1 );
chainv[4] = _mm_xor_si128( chainv[4], t0 );
chainv[5] = _mm_xor_si128( chainv[5], t1 );
chainv[6] = _mm_xor_si128( chainv[6], t0 );
chainv[7] = _mm_xor_si128( chainv[7], t1 );
chainv[8] = _mm_xor_si128( chainv[8], t0 );
chainv[9] = _mm_xor_si128( chainv[9], t1 );
t[0] = chainv[0];
t[1] = chainv[1];
t0 = chainv[0];
t1 = chainv[1];
MULT2( chainv[0], chainv[1]);
chainv[0] = _mm_xor_si128( chainv[0], chainv[2] );
chainv[1] = _mm_xor_si128( chainv[1], chainv[3] );
MULT2( chainv[2], chainv[3]);
chainv[2] = _mm_xor_si128(chainv[2], chainv[4]);
chainv[3] = _mm_xor_si128(chainv[3], chainv[5]);
MULT2( chainv[4], chainv[5]);
chainv[4] = _mm_xor_si128(chainv[4], chainv[6]);
chainv[5] = _mm_xor_si128(chainv[5], chainv[7]);
MULT2( chainv[6], chainv[7]);
chainv[6] = _mm_xor_si128(chainv[6], chainv[8]);
chainv[7] = _mm_xor_si128(chainv[7], chainv[9]);
MULT2( chainv[8], chainv[9]);
chainv[8] = _mm_xor_si128( chainv[8], t[0] );
chainv[9] = _mm_xor_si128( chainv[9], t[1] );
t[0] = chainv[8];
t[1] = chainv[9];
t0 = chainv[8] = _mm_xor_si128( chainv[8], t0 );
t1 = chainv[9] = _mm_xor_si128( chainv[9], t1 );
MULT2( chainv[8], chainv[9]);
chainv[8] = _mm_xor_si128( chainv[8], chainv[6] );
chainv[9] = _mm_xor_si128( chainv[9], chainv[7] );
MULT2( chainv[6], chainv[7]);
chainv[6] = _mm_xor_si128( chainv[6], chainv[4] );
chainv[7] = _mm_xor_si128( chainv[7], chainv[5] );
MULT2( chainv[4], chainv[5]);
chainv[4] = _mm_xor_si128( chainv[4], chainv[2] );
chainv[5] = _mm_xor_si128( chainv[5], chainv[3] );
MULT2( chainv[2], chainv[3] );
chainv[2] = _mm_xor_si128( chainv[2], chainv[0] );
chainv[3] = _mm_xor_si128( chainv[3], chainv[1] );
MULT2( chainv[0], chainv[1] );
chainv[0] = _mm_xor_si128( _mm_xor_si128( chainv[0], t[0] ), msg0 );
chainv[1] = _mm_xor_si128( _mm_xor_si128( chainv[1], t[1] ), msg1 );
chainv[0] = _mm_xor_si128( _mm_xor_si128( chainv[0], t0 ), msg0 );
chainv[1] = _mm_xor_si128( _mm_xor_si128( chainv[1], t1 ), msg1 );
MULT2( msg0, msg1);
chainv[2] = _mm_xor_si128( chainv[2], msg0 );
chainv[3] = _mm_xor_si128( chainv[3], msg1 );
MULT2( msg0, msg1);
chainv[4] = _mm_xor_si128( chainv[4], msg0 );
chainv[5] = _mm_xor_si128( chainv[5], msg1 );
MULT2( msg0, msg1);
chainv[6] = _mm_xor_si128( chainv[6], msg0 );
chainv[7] = _mm_xor_si128( chainv[7], msg1 );
MULT2( msg0, msg1);
chainv[8] = _mm_xor_si128( chainv[8], msg0 );
chainv[9] = _mm_xor_si128( chainv[9], msg1 );
MULT2( msg0, msg1);
chainv[3] = mm128_rol_32( chainv[3], 1 );
chainv[5] = mm128_rol_32( chainv[5], 2 );
chainv[7] = mm128_rol_32( chainv[7], 3 );
chainv[9] = mm128_rol_32( chainv[9], 4 );
NMLTOM1024( chainv[0], chainv[2], chainv[4], chainv[6], x0, x1, x2, x3,
chainv[1], chainv[3], chainv[5], chainv[7], x4, x5, x6, x7 );
chainv[3] = _mm_or_si128( _mm_slli_epi32(chainv[3], 1),
_mm_srli_epi32(chainv[3], 31) );
chainv[5] = _mm_or_si128( _mm_slli_epi32(chainv[5], 2),
_mm_srli_epi32(chainv[5], 30) );
chainv[7] = _mm_or_si128( _mm_slli_epi32(chainv[7], 3),
_mm_srli_epi32(chainv[7], 29) );
chainv[9] = _mm_or_si128( _mm_slli_epi32(chainv[9], 4),
_mm_srli_epi32(chainv[9], 28) );
STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, cns( 0), cns( 1) );
STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, cns( 2), cns( 3) );
STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, cns( 4), cns( 5) );
STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, cns( 6), cns( 7) );
STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, cns( 8), cns( 9) );
STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, cns(10), cns(11) );
STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, cns(12), cns(13) );
STEP_PART( x0, x1, x2, x3, x4, x5, x6, x7, cns(14), cns(15) );
MIXTON1024( x0, x1, x2, x3, chainv[0], chainv[2], chainv[4], chainv[6],
x4, x5, x6, x7, chainv[1], chainv[3], chainv[5], chainv[7]);
NMLTOM1024( chainv[0], chainv[2], chainv[4], chainv[6],
x[0], x[1], x[2], x[3],
chainv[1],chainv[3],chainv[5],chainv[7],
x[4], x[5], x[6], x[7] );
STEP_PART( &x[0], &CNS128[ 0], &tmp[0] );
STEP_PART( &x[0], &CNS128[ 2], &tmp[0] );
STEP_PART( &x[0], &CNS128[ 4], &tmp[0] );
STEP_PART( &x[0], &CNS128[ 6], &tmp[0] );
STEP_PART( &x[0], &CNS128[ 8], &tmp[0] );
STEP_PART( &x[0], &CNS128[10], &tmp[0] );
STEP_PART( &x[0], &CNS128[12], &tmp[0] );
STEP_PART( &x[0], &CNS128[14], &tmp[0] );
MIXTON1024( x[0], x[1], x[2], x[3],
chainv[0], chainv[2], chainv[4],chainv[6],
x[4], x[5], x[6], x[7],
chainv[1],chainv[3],chainv[5],chainv[7]);
/* Process last 256-bit block */
STEP_PART2( chainv[8], chainv[9], t[0], t[1], CNS128[16], CNS128[17],
tmp[0], tmp[1] );
STEP_PART2( chainv[8], chainv[9], t[0], t[1], CNS128[18], CNS128[19],
tmp[0], tmp[1] );
STEP_PART2( chainv[8], chainv[9], t[0], t[1], CNS128[20], CNS128[21],
tmp[0], tmp[1] );
STEP_PART2( chainv[8], chainv[9], t[0], t[1], CNS128[22], CNS128[23],
tmp[0], tmp[1] );
STEP_PART2( chainv[8], chainv[9], t[0], t[1], CNS128[24], CNS128[25],
tmp[0], tmp[1] );
STEP_PART2( chainv[8], chainv[9], t[0], t[1], CNS128[26], CNS128[27],
tmp[0], tmp[1] );
STEP_PART2( chainv[8], chainv[9], t[0], t[1], CNS128[28], CNS128[29],
tmp[0], tmp[1] );
STEP_PART2( chainv[8], chainv[9], t[0], t[1], CNS128[30], CNS128[31],
tmp[0], tmp[1] );
STEP_PART2( chainv[8], chainv[9], t0, t1, cns(16), cns(17) );
STEP_PART2( chainv[8], chainv[9], t0, t1, cns(18), cns(19) );
STEP_PART2( chainv[8], chainv[9], t0, t1, cns(20), cns(21) );
STEP_PART2( chainv[8], chainv[9], t0, t1, cns(22), cns(23) );
STEP_PART2( chainv[8], chainv[9], t0, t1, cns(24), cns(25) );
STEP_PART2( chainv[8], chainv[9], t0, t1, cns(26), cns(27) );
STEP_PART2( chainv[8], chainv[9], t0, t1, cns(28), cns(29) );
STEP_PART2( chainv[8], chainv[9], t0, t1, cns(30), cns(31) );
}
@@ -579,51 +550,6 @@ static void rnd512( hashState_luffa *state, __m128i msg1, __m128i msg0 )
/* state: hash context */
/* b[8]: hash values */
#if defined (__AVX2__)
static void finalization512( hashState_luffa *state, uint32 *b )
{
uint32 hash[8] __attribute((aligned(64)));
__m256i* chainv = (__m256i*)state->chainv;
__m256i t;
const __m128i zero = m128_zero;
const __m256i shuff_bswap32 = m256_const_64( 0x1c1d1e1f18191a1b,
0x1415161710111213,
0x0c0d0e0f08090a0b,
0x0405060700010203 );
rnd512( state, zero, zero );
t = chainv[0];
t = _mm256_xor_si256( t, chainv[1] );
t = _mm256_xor_si256( t, chainv[2] );
t = _mm256_xor_si256( t, chainv[3] );
t = _mm256_xor_si256( t, chainv[4] );
t = _mm256_shuffle_epi32( t, 27 );
_mm256_store_si256( (__m256i*)hash, t );
casti_m256i( b, 0 ) = _mm256_shuffle_epi8(
casti_m256i( hash, 0 ), shuff_bswap32 );
rnd512( state, zero, zero );
t = chainv[0];
t = _mm256_xor_si256( t, chainv[1] );
t = _mm256_xor_si256( t, chainv[2] );
t = _mm256_xor_si256( t, chainv[3] );
t = _mm256_xor_si256( t, chainv[4] );
t = _mm256_shuffle_epi32( t, 27 );
_mm256_store_si256( (__m256i*)hash, t );
casti_m256i( b, 1 ) = _mm256_shuffle_epi8(
casti_m256i( hash, 0 ), shuff_bswap32 );
}
#else
static void finalization512( hashState_luffa *state, uint32 *b )
{
uint32 hash[8] __attribute((aligned(64)));
@@ -676,6 +602,5 @@ static void finalization512( hashState_luffa *state, uint32 *b )
casti_m128i( b, 2 ) = mm128_bswap_32( casti_m128i( hash, 0 ) );
casti_m128i( b, 3 ) = mm128_bswap_32( casti_m128i( hash, 1 ) );
}
#endif
/***************************************************/

View File

@@ -212,7 +212,7 @@ int scanhash_allium_16way( struct work *work, uint32_t max_nonce,
const uint32_t last_nonce = max_nonce - 16;
const int thr_id = mythr->id;
const bool bench = opt_benchmark;
const __m512i sixteen = m512_const1_32( 16 );
const __m512i sixteen = _mm512_set1_epi32( 16 );
if ( bench ) ( (uint32_t*)ptarget )[7] = 0x0000ff;
@@ -230,25 +230,13 @@ int scanhash_allium_16way( struct work *work, uint32_t max_nonce,
block0_hash[7] = _mm512_set1_epi32( phash[7] );
// Build vectored second block, interleave last 16 bytes of data using
// unique nonces, add padding.
// unique nonces.
block_buf[ 0] = _mm512_set1_epi32( pdata[16] );
block_buf[ 1] = _mm512_set1_epi32( pdata[17] );
block_buf[ 2] = _mm512_set1_epi32( pdata[18] );
block_buf[ 3] =
_mm512_set_epi32( n+15, n+14, n+13, n+12, n+11, n+10, n+ 9, n+ 8,
n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n+ 1, n );
block_buf[ 4] = m512_const1_32( 0x80000000 );
block_buf[ 5] =
block_buf[ 6] =
block_buf[ 7] =
block_buf[ 8] =
block_buf[ 9] =
block_buf[10] =
block_buf[11] =
block_buf[12] = m512_zero;
block_buf[13] = m512_one_32;
block_buf[14] = m512_zero;
block_buf[15] = m512_const1_32( 80*8 );
// Partialy prehash second block without touching nonces in block_buf[3].
blake256_16way_round0_prehash_le( midstate_vars, block0_hash, block_buf );
@@ -410,7 +398,7 @@ int scanhash_allium_8way( struct work *work, uint32_t max_nonce,
uint32_t n = first_nonce;
const int thr_id = mythr->id;
const bool bench = opt_benchmark;
const __m256i eight = m256_const1_32( 8 );
const __m256i eight = _mm256_set1_epi32( 8 );
// Prehash first block
blake256_transform_le( phash, pdata, 512, 0 );
@@ -425,24 +413,12 @@ int scanhash_allium_8way( struct work *work, uint32_t max_nonce,
block0_hash[7] = _mm256_set1_epi32( phash[7] );
// Build vectored second block, interleave last 16 bytes of data using
// unique nonces and add padding.
// unique nonces.
block_buf[ 0] = _mm256_set1_epi32( pdata[16] );
block_buf[ 1] = _mm256_set1_epi32( pdata[17] );
block_buf[ 2] = _mm256_set1_epi32( pdata[18] );
block_buf[ 3] =
_mm256_set_epi32( n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n+ 1, n );
block_buf[ 4] = m256_const1_32( 0x80000000 );
block_buf[ 5] =
block_buf[ 6] =
block_buf[ 7] =
block_buf[ 8] =
block_buf[ 9] =
block_buf[10] =
block_buf[11] =
block_buf[12] = m256_zero;
block_buf[13] = m256_one_32;
block_buf[14] = m256_zero;
block_buf[15] = m256_const1_32( 80*8 );
block_buf[ 3] = _mm256_set_epi32( n+ 7, n+ 6, n+ 5, n+ 4,
n+ 3, n+ 2, n+ 1, n );
// Partialy prehash second block without touching nonces
blake256_8way_round0_prehash_le( midstate_vars, block0_hash, block_buf );

View File

@@ -75,7 +75,7 @@ void lyra2rev2_16way_hash( void *state, const void *input )
keccak256_8way_close( &ctx.keccak, vhash );
dintrlv_8x64( hash8, hash9, hash10, hash11,
hash12, hash13, hash14, hash5, vhash, 256 );
hash12, hash13, hash14, hash15, vhash, 256 );
cubehash_full( &ctx.cube, (byte*) hash0, 256, (const byte*) hash0, 32 );
cubehash_full( &ctx.cube, (byte*) hash1, 256, (const byte*) hash1, 32 );
@@ -203,7 +203,7 @@ int scanhash_lyra2rev2_16way( struct work *work, const uint32_t max_nonce,
submit_solution( work, lane_hash, mythr );
}
}
*noncev = _mm512_add_epi32( *noncev, m512_const1_32( 16 ) );
*noncev = _mm512_add_epi32( *noncev, _mm512_set1_epi32( 16 ) );
n += 16;
} while ( likely( (n < last_nonce) && !work_restart[thr_id].restart ) );
pdata[19] = n;
@@ -345,7 +345,7 @@ int scanhash_lyra2rev2_8way( struct work *work, const uint32_t max_nonce,
submit_solution( work, lane_hash, mythr );
}
}
*noncev = _mm256_add_epi32( *noncev, m256_const1_32( 8 ) );
*noncev = _mm256_add_epi32( *noncev, _mm256_set1_epi32( 8 ) );
n += 8;
} while ( likely( (n < last_nonce) && !work_restart[thr_id].restart ) );
pdata[19] = n;

View File

@@ -287,7 +287,7 @@ int scanhash_lyra2rev3_8way( struct work *work, const uint32_t max_nonce,
submit_solution( work, lane_hash, mythr );
}
}
*noncev = _mm256_add_epi32( *noncev, m256_const1_32( 8 ) );
*noncev = _mm256_add_epi32( *noncev, _mm256_set1_epi32( 8 ) );
n += 8;
} while ( likely( (n < last_nonce) && !work_restart[thr_id].restart ) );
pdata[19] = n;
@@ -389,7 +389,7 @@ int scanhash_lyra2rev3_4way( struct work *work, const uint32_t max_nonce,
submit_solution( work, lane_hash, mythr );
}
}
*noncev = _mm_add_epi32( *noncev, m128_const1_32( 4 ) );
*noncev = _mm_add_epi32( *noncev, _mm_set1_epi32( 4 ) );
n += 4;
} while ( (n < max_nonce-4) && !work_restart[thr_id].restart);
pdata[19] = n;

View File

@@ -103,7 +103,7 @@ int scanhash_lyra2z_16way( struct work *work, uint32_t max_nonce,
const uint32_t last_nonce = max_nonce - 16;
const int thr_id = mythr->id;
const bool bench = opt_benchmark;
const __m512i sixteen = m512_const1_32( 16 );
const __m512i sixteen = _mm512_set1_epi32( 16 );
if ( bench ) ( (uint32_t*)ptarget )[7] = 0x0000ff;
@@ -120,25 +120,13 @@ int scanhash_lyra2z_16way( struct work *work, uint32_t max_nonce,
block0_hash[7] = _mm512_set1_epi32( phash[7] );
// Build vectored second block, interleave last 16 bytes of data using
// unique nonces and add padding.
// unique nonces.
block_buf[ 0] = _mm512_set1_epi32( pdata[16] );
block_buf[ 1] = _mm512_set1_epi32( pdata[17] );
block_buf[ 2] = _mm512_set1_epi32( pdata[18] );
block_buf[ 3] =
_mm512_set_epi32( n+15, n+14, n+13, n+12, n+11, n+10, n+ 9, n+ 8,
n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n +1, n );
block_buf[ 4] = m512_const1_32( 0x80000000 );
block_buf[ 5] =
block_buf[ 6] =
block_buf[ 7] =
block_buf[ 8] =
block_buf[ 9] =
block_buf[10] =
block_buf[11] =
block_buf[12] = m512_zero;
block_buf[13] = m512_one_32;
block_buf[14] = m512_zero;
block_buf[15] = m512_const1_32( 80*8 );
// Partialy prehash second block without touching nonces in block_buf[3].
blake256_16way_round0_prehash_le( midstate_vars, block0_hash, block_buf );
@@ -225,7 +213,7 @@ int scanhash_lyra2z_8way( struct work *work, uint32_t max_nonce,
uint32_t n = first_nonce;
const int thr_id = mythr->id;
const bool bench = opt_benchmark;
const __m256i eight = m256_const1_32( 8 );
const __m256i eight = _mm256_set1_epi32( 8 );
// Prehash first block
blake256_transform_le( phash, pdata, 512, 0 );
@@ -240,24 +228,12 @@ int scanhash_lyra2z_8way( struct work *work, uint32_t max_nonce,
block0_hash[7] = _mm256_set1_epi32( phash[7] );
// Build vectored second block, interleave last 16 bytes of data using
// unique nonces and add padding.
// unique nonces.
block_buf[ 0] = _mm256_set1_epi32( pdata[16] );
block_buf[ 1] = _mm256_set1_epi32( pdata[17] );
block_buf[ 2] = _mm256_set1_epi32( pdata[18] );
block_buf[ 3] =
_mm256_set_epi32( n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n +1, n );
block_buf[ 4] = m256_const1_32( 0x80000000 );
block_buf[ 5] =
block_buf[ 6] =
block_buf[ 7] =
block_buf[ 8] =
block_buf[ 9] =
block_buf[10] =
block_buf[11] =
block_buf[12] = m256_zero;
block_buf[13] = m256_one_32;
block_buf[14] = m256_zero;
block_buf[15] = m256_const1_32( 80*8 );
// Partialy prehash second block without touching nonces
blake256_8way_round0_prehash_le( midstate_vars, block0_hash, block_buf );
@@ -352,7 +328,7 @@ int scanhash_lyra2z_4way( struct work *work, uint32_t max_nonce,
submit_solution( work, lane_hash, mythr );
}
}
*noncev = _mm_add_epi32( *noncev, m128_const1_32( 4 ) );
*noncev = _mm_add_epi32( *noncev, _mm_set1_epi32( 4 ) );
n += 4;
} while ( likely( (n < last_nonce) && !work_restart[thr_id].restart ) );

View File

@@ -3,7 +3,7 @@
#include "lyra2.h"
#include "simd-utils.h"
__thread uint64_t* lyra2z330_wholeMatrix;
static __thread uint64_t* lyra2z330_wholeMatrix;
void lyra2z330_hash(void *state, const void *input, uint32_t height)
{

View File

@@ -85,10 +85,10 @@ inline void absorbBlockBlake2Safe_2way( uint64_t *State, const uint64_t *In,
state0 =
state1 = m512_zero;
state2 = m512_const4_64( 0xa54ff53a5f1d36f1ULL, 0x3c6ef372fe94f82bULL,
0xbb67ae8584caa73bULL, 0x6a09e667f3bcc908ULL );
state3 = m512_const4_64( 0x5be0cd19137e2179ULL, 0x1f83d9abfb41bd6bULL,
0x9b05688c2b3e6c1fULL, 0x510e527fade682d1ULL );
state2 = _mm512_set4_epi64( 0xa54ff53a5f1d36f1ULL, 0x3c6ef372fe94f82bULL,
0xbb67ae8584caa73bULL, 0x6a09e667f3bcc908ULL );
state3 = _mm512_set4_epi64( 0x5be0cd19137e2179ULL, 0x1f83d9abfb41bd6bULL,
0x9b05688c2b3e6c1fULL, 0x510e527fade682d1ULL );
for ( int i = 0; i < nBlocks; i++ )
{

View File

@@ -41,17 +41,17 @@
inline void initState( uint64_t State[/*16*/] )
{
/*
/*
#if defined (__AVX2__)
__m256i* state = (__m256i*)State;
const __m256i zero = m256_zero;
state[0] = zero;
state[1] = zero;
state[2] = m256_const_64( 0xa54ff53a5f1d36f1ULL, 0x3c6ef372fe94f82bULL,
0xbb67ae8584caa73bULL, 0x6a09e667f3bcc908ULL );
state[3] = m256_const_64( 0x5be0cd19137e2179ULL, 0x1f83d9abfb41bd6bULL,
0x9b05688c2b3e6c1fULL, 0x510e527fade682d1ULL );
state[2] = _mm256_set_epi64x( 0xa54ff53a5f1d36f1ULL, 0x3c6ef372fe94f82bULL,
0xbb67ae8584caa73bULL, 0x6a09e667f3bcc908ULL );
state[3] = _mm256_set_epi64x( 0x5be0cd19137e2179ULL, 0x1f83d9abfb41bd6bULL,
0x9b05688c2b3e6c1fULL, 0x510e527fade682d1ULL );
#elif defined (__SSE2__)
@@ -62,10 +62,10 @@ inline void initState( uint64_t State[/*16*/] )
state[1] = zero;
state[2] = zero;
state[3] = zero;
state[4] = m128_const_64( 0xbb67ae8584caa73bULL, 0x6a09e667f3bcc908ULL );
state[5] = m128_const_64( 0xa54ff53a5f1d36f1ULL, 0x3c6ef372fe94f82bULL );
state[6] = m128_const_64( 0x9b05688c2b3e6c1fULL, 0x510e527fade682d1ULL );
state[7] = m128_const_64( 0x5be0cd19137e2179ULL, 0x1f83d9abfb41bd6bULL );
state[4] = _mm_set_epi64x( 0xbb67ae8584caa73bULL, 0x6a09e667f3bcc908ULL );
state[5] = _mm_set_epi64x( 0xa54ff53a5f1d36f1ULL, 0x3c6ef372fe94f82bULL );
state[6] = _mm_set_epi64x( 0x9b05688c2b3e6c1fULL, 0x510e527fade682d1ULL );
state[7] = _mm_set_epi64x( 0x5be0cd19137e2179ULL, 0x1f83d9abfb41bd6bULL );
#else
//First 512 bis are zeros
@@ -271,10 +271,10 @@ inline void absorbBlockBlake2Safe( uint64_t *State, const uint64_t *In,
state0 =
state1 = m256_zero;
state2 = m256_const_64( 0xa54ff53a5f1d36f1ULL, 0x3c6ef372fe94f82bULL,
0xbb67ae8584caa73bULL, 0x6a09e667f3bcc908ULL );
state3 = m256_const_64( 0x5be0cd19137e2179ULL, 0x1f83d9abfb41bd6bULL,
0x9b05688c2b3e6c1fULL, 0x510e527fade682d1ULL );
state2 = _mm256_set_epi64x( 0xa54ff53a5f1d36f1ULL, 0x3c6ef372fe94f82bULL,
0xbb67ae8584caa73bULL, 0x6a09e667f3bcc908ULL );
state3 = _mm256_set_epi64x( 0x5be0cd19137e2179ULL, 0x1f83d9abfb41bd6bULL,
0x9b05688c2b3e6c1fULL, 0x510e527fade682d1ULL );
for ( int i = 0; i < nBlocks; i++ )
{
@@ -299,10 +299,10 @@ inline void absorbBlockBlake2Safe( uint64_t *State, const uint64_t *In,
state1 =
state2 =
state3 = m128_zero;
state4 = m128_const_64( 0xbb67ae8584caa73bULL, 0x6a09e667f3bcc908ULL );
state5 = m128_const_64( 0xa54ff53a5f1d36f1ULL, 0x3c6ef372fe94f82bULL );
state6 = m128_const_64( 0x9b05688c2b3e6c1fULL, 0x510e527fade682d1ULL );
state7 = m128_const_64( 0x5be0cd19137e2179ULL, 0x1f83d9abfb41bd6bULL );
state4 = _mm_set_epi64x( 0xbb67ae8584caa73bULL, 0x6a09e667f3bcc908ULL );
state5 = _mm_set_epi64x( 0xa54ff53a5f1d36f1ULL, 0x3c6ef372fe94f82bULL );
state6 = _mm_set_epi64x( 0x9b05688c2b3e6c1fULL, 0x510e527fade682d1ULL );
state7 = _mm_set_epi64x( 0x5be0cd19137e2179ULL, 0x1f83d9abfb41bd6bULL );
for ( int i = 0; i < nBlocks; i++ )
{

View File

@@ -43,27 +43,29 @@ static const uint64_t blake2b_IV[8] =
0x1f83d9abfb41bd6bULL, 0x5be0cd19137e2179ULL
};
/*Blake2b's rotation*/
static inline uint64_t rotr64( const uint64_t w, const unsigned c ){
return ( w >> c ) | ( w << ( 64 - c ) );
}
// serial data is only 32 bytes so AVX2 is the limit for that dimension.
// However, 2 way parallel looks trivial to code for AVX512 except for
// a data dependency with rowa.
#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
#define G2W_4X64(a,b,c,d) \
a = _mm512_add_epi64( a, b ); \
d = mm512_ror_64( _mm512_xor_si512( d, a ), 32 ); \
d = _mm512_ror_epi64( _mm512_xor_si512( d, a ), 32 ); \
c = _mm512_add_epi64( c, d ); \
b = mm512_ror_64( _mm512_xor_si512( b, c ), 24 ); \
b = _mm512_ror_epi64( _mm512_xor_si512( b, c ), 24 ); \
a = _mm512_add_epi64( a, b ); \
d = mm512_ror_64( _mm512_xor_si512( d, a ), 16 ); \
d = _mm512_ror_epi64( _mm512_xor_si512( d, a ), 16 ); \
c = _mm512_add_epi64( c, d ); \
b = mm512_ror_64( _mm512_xor_si512( b, c ), 63 );
b = _mm512_ror_epi64( _mm512_xor_si512( b, c ), 63 );
#define LYRA_ROUND_2WAY_AVX512( s0, s1, s2, s3 ) \
G2W_4X64( s0, s1, s2, s3 ); \
s0 = mm512_shufll256_64( s0 ); \
s3 = mm512_swap256_128( s3); \
s2 = mm512_shuflr256_64( s2 ); \
G2W_4X64( s0, s1, s2, s3 ); \
s0 = mm512_shuflr256_64( s0 ); \
s3 = mm512_swap256_128( s3 ); \
s2 = mm512_shufll256_64( s2 );
/*
#define LYRA_ROUND_2WAY_AVX512( s0, s1, s2, s3 ) \
G2W_4X64( s0, s1, s2, s3 ); \
s3 = mm512_shufll256_64( s3 ); \
@@ -73,6 +75,7 @@ static inline uint64_t rotr64( const uint64_t w, const unsigned c ){
s3 = mm512_shuflr256_64( s3 ); \
s1 = mm512_shufll256_64( s1 ); \
s2 = mm512_swap256_128( s2 );
*/
#define LYRA_12_ROUNDS_2WAY_AVX512( s0, s1, s2, s3 ) \
LYRA_ROUND_2WAY_AVX512( s0, s1, s2, s3 ) \
@@ -88,13 +91,10 @@ static inline uint64_t rotr64( const uint64_t w, const unsigned c ){
LYRA_ROUND_2WAY_AVX512( s0, s1, s2, s3 ) \
LYRA_ROUND_2WAY_AVX512( s0, s1, s2, s3 )
#endif // AVX512
#if defined __AVX2__
#if defined(__AVX2__)
// process 4 columns in parallel
// returns void, updates all args
#define G_4X64(a,b,c,d) \
a = _mm256_add_epi64( a, b ); \
d = mm256_swap64_32( _mm256_xor_si256( d, a ) ); \
@@ -105,6 +105,18 @@ static inline uint64_t rotr64( const uint64_t w, const unsigned c ){
c = _mm256_add_epi64( c, d ); \
b = mm256_ror_64( _mm256_xor_si256( b, c ), 63 );
// Pivot about s1 instead of s0 reduces latency.
#define LYRA_ROUND_AVX2( s0, s1, s2, s3 ) \
G_4X64( s0, s1, s2, s3 ); \
s0 = mm256_shufll_64( s0 ); \
s3 = mm256_swap_128( s3); \
s2 = mm256_shuflr_64( s2 ); \
G_4X64( s0, s1, s2, s3 ); \
s0 = mm256_shuflr_64( s0 ); \
s3 = mm256_swap_128( s3 ); \
s2 = mm256_shufll_64( s2 );
/*
#define LYRA_ROUND_AVX2( s0, s1, s2, s3 ) \
G_4X64( s0, s1, s2, s3 ); \
s3 = mm256_shufll_64( s3 ); \
@@ -114,6 +126,7 @@ static inline uint64_t rotr64( const uint64_t w, const unsigned c ){
s3 = mm256_shuflr_64( s3 ); \
s1 = mm256_shufll_64( s1 ); \
s2 = mm256_swap_128( s2 );
*/
#define LYRA_12_ROUNDS_AVX2( s0, s1, s2, s3 ) \
LYRA_ROUND_AVX2( s0, s1, s2, s3 ) \
@@ -146,14 +159,25 @@ static inline uint64_t rotr64( const uint64_t w, const unsigned c ){
b = mm128_ror_64( _mm_xor_si128( b, c ), 63 );
#define LYRA_ROUND_AVX(s0,s1,s2,s3,s4,s5,s6,s7) \
{ \
__m128i t; \
G_2X64( s0, s2, s4, s6 ); \
G_2X64( s1, s3, s5, s7 ); \
mm128_vrol256_64( s6, s7 ); \
mm128_vror256_64( s2, s3 ); \
t = mm128_alignr_64( s7, s6, 1 ); \
s6 = mm128_alignr_64( s6, s7, 1 ); \
s7 = t; \
t = mm128_alignr_64( s2, s3, 1 ); \
s2 = mm128_alignr_64( s3, s2, 1 ); \
s3 = t; \
G_2X64( s0, s2, s5, s6 ); \
G_2X64( s1, s3, s4, s7 ); \
mm128_vror256_64( s6, s7 ); \
mm128_vrol256_64( s2, s3 );
t = mm128_alignr_64( s6, s7, 1 ); \
s6 = mm128_alignr_64( s7, s6, 1 ); \
s7 = t; \
t = mm128_alignr_64( s3, s2, 1 ); \
s2 = mm128_alignr_64( s2, s3, 1 ); \
s3 = t; \
}
#define LYRA_12_ROUNDS_AVX(s0,s1,s2,s3,s4,s5,s6,s7) \
LYRA_ROUND_AVX(s0,s1,s2,s3,s4,s5,s6,s7) \
@@ -171,8 +195,13 @@ static inline uint64_t rotr64( const uint64_t w, const unsigned c ){
#endif // AVX2 else SSE2
// Scalar
//Blake2b's G function
/*
// Scalar, not used.
static inline uint64_t rotr64( const uint64_t w, const unsigned c ){
return ( w >> c ) | ( w << ( 64 - c ) );
}
#define G(r,i,a,b,c,d) \
do { \
a = a + b; \
@@ -185,8 +214,6 @@ static inline uint64_t rotr64( const uint64_t w, const unsigned c ){
b = rotr64(b ^ c, 63); \
} while(0)
/*One Round of the Blake2b's compression function*/
#define ROUND_LYRA(r) \
G(r,0,v[ 0],v[ 4],v[ 8],v[12]); \
G(r,1,v[ 1],v[ 5],v[ 9],v[13]); \
@@ -196,6 +223,7 @@ static inline uint64_t rotr64( const uint64_t w, const unsigned c ){
G(r,5,v[ 1],v[ 6],v[11],v[12]); \
G(r,6,v[ 2],v[ 7],v[ 8],v[13]); \
G(r,7,v[ 3],v[ 4],v[ 9],v[14]);
*/
#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)

View File

@@ -15,7 +15,8 @@
#if defined (ANIME_8WAY)
typedef struct {
union _anime_8way_context_overlay
{
blake512_8way_context blake;
bmw512_8way_context bmw;
#if defined(__VAES__)
@@ -26,23 +27,9 @@ typedef struct {
jh512_8way_context jh;
skein512_8way_context skein;
keccak512_8way_context keccak;
} anime_8way_ctx_holder;
} __attribute__ ((aligned (64)));
anime_8way_ctx_holder anime_8way_ctx __attribute__ ((aligned (64)));
void init_anime_8way_ctx()
{
blake512_8way_init( &anime_8way_ctx.blake );
bmw512_8way_init( &anime_8way_ctx.bmw );
#if defined(__VAES__)
groestl512_4way_init( &anime_8way_ctx.groestl, 64 );
#else
init_groestl( &anime_8way_ctx.groestl, 64 );
#endif
skein512_8way_init( &anime_8way_ctx.skein );
jh512_8way_init( &anime_8way_ctx.jh );
keccak512_8way_init( &anime_8way_ctx.keccak );
}
typedef union _anime_8way_context_overlay anime_8way_context_overlay;
void anime_8way_hash( void *state, const void *input )
{
@@ -64,18 +51,15 @@ void anime_8way_hash( void *state, const void *input )
__m512i* vhA = (__m512i*)vhashA;
__m512i* vhB = (__m512i*)vhashB;
__m512i* vhC = (__m512i*)vhashC;
const __m512i bit3_mask = m512_const1_64( 8 );
const __m512i zero = _mm512_setzero_si512();
const __m512i bit3_mask = _mm512_set1_epi64( 8 );
__mmask8 vh_mask;
anime_8way_ctx_holder ctx;
memcpy( &ctx, &anime_8way_ctx, sizeof(anime_8way_ctx) );
anime_8way_context_overlay ctx __attribute__ ((aligned (64)));
bmw512_8way_full( &ctx.bmw, vhash, input, 80 );
blake512_8way_full( &ctx.blake, vhash, vhash, 64 );
vh_mask = _mm512_cmpeq_epi64_mask( _mm512_and_si512( vh[0], bit3_mask ),
zero );
vh_mask = _mm512_testn_epi64_mask( vh[0], bit3_mask );
#if defined(__VAES__)
@@ -152,8 +136,7 @@ void anime_8way_hash( void *state, const void *input )
jh512_8way_update( &ctx.jh, vhash, 64 );
jh512_8way_close( &ctx.jh, vhash );
vh_mask = _mm512_cmpeq_epi64_mask( _mm512_and_si512( vh[0], bit3_mask ),
zero );
vh_mask = _mm512_testn_epi64_mask( vh[0], bit3_mask );
if ( ( vh_mask & 0xff ) != 0xff )
blake512_8way_full( &ctx.blake, vhashA, vhash, 64 );
@@ -168,8 +151,7 @@ void anime_8way_hash( void *state, const void *input )
skein512_8way_full( &ctx.skein, vhash, vhash, 64 );
vh_mask = _mm512_cmpeq_epi64_mask( _mm512_and_si512( vh[0], bit3_mask ),
zero );
vh_mask = _mm512_testn_epi64_mask( vh[0], bit3_mask );
if ( ( vh_mask & 0xff ) != 0xff )
{
@@ -227,7 +209,7 @@ int scanhash_anime_8way( struct work *work, uint32_t max_nonce,
}
}
*noncev = _mm512_add_epi32( *noncev,
m512_const1_64( 0x0000000800000000 ) );
_mm512_set1_epi64( 0x0000000800000000 ) );
n += 8;
} while ( likely( ( n < last_nonce ) && !work_restart[thr_id].restart ) );
pdata[19] = n;
@@ -237,14 +219,20 @@ int scanhash_anime_8way( struct work *work, uint32_t max_nonce,
#elif defined (ANIME_4WAY)
typedef struct {
union _anime_4way_context_overlay
{
blake512_4way_context blake;
bmw512_4way_context bmw;
hashState_groestl groestl;
jh512_4way_context jh;
skein512_4way_context skein;
keccak512_4way_context keccak;
} anime_4way_ctx_holder;
#if defined(__VAES__)
groestl512_2way_context groestl2;
#endif
} __attribute__ ((aligned (64)));
typedef union _anime_4way_context_overlay anime_4way_context_overlay;
void anime_4way_hash( void *state, const void *input )
{
@@ -260,9 +248,9 @@ void anime_4way_hash( void *state, const void *input )
__m256i* vhB = (__m256i*)vhashB;
__m256i vh_mask;
int h_mask;
const __m256i bit3_mask = m256_const1_64( 8 );
const __m256i bit3_mask = _mm256_set1_epi64x( 8 );
const __m256i zero = _mm256_setzero_si256();
anime_4way_ctx_holder ctx;
anime_4way_context_overlay ctx __attribute__ ((aligned (64)));
bmw512_4way_init( &ctx.bmw );
bmw512_4way_update( &ctx.bmw, input, 80 );
@@ -293,7 +281,18 @@ void anime_4way_hash( void *state, const void *input )
mm256_blend_hash_4x64( vh, vhA, vhB, vh_mask );
dintrlv_4x64( hash0, hash1, hash2, hash3, vhash, 512 );
#if defined(__VAES__)
rintrlv_4x64_2x128( vhashA, vhashB, vhash, 512 );
groestl512_2way_full( &ctx.groestl2, vhashA, vhashA, 64 );
groestl512_2way_full( &ctx.groestl2, vhashB, vhashB, 64 );
rintrlv_2x128_4x64( vhash, vhashA, vhashB, 512 );
#else
dintrlv_4x64( hash0, hash1, hash2, hash3, vhash, 512 );
groestl512_full( &ctx.groestl, (char*)hash0, (char*)hash0, 512 );
groestl512_full( &ctx.groestl, (char*)hash1, (char*)hash1, 512 );
@@ -302,6 +301,8 @@ void anime_4way_hash( void *state, const void *input )
intrlv_4x64( vhash, hash0, hash1, hash2, hash3, 512 );
#endif
jh512_4way_init( &ctx.jh );
jh512_4way_update( &ctx.jh, vhash, 64 );
jh512_4way_close( &ctx.jh, vhash );
@@ -387,7 +388,7 @@ int scanhash_anime_4way( struct work *work, uint32_t max_nonce,
}
}
*noncev = _mm256_add_epi32( *noncev,
m256_const1_64( 0x0000000400000000 ) );
_mm256_set1_epi64x( 0x0000000400000000 ) );
n += 4;
} while ( likely( ( n < last_nonce ) && !work_restart[thr_id].restart ) );
pdata[19] = n;

View File

@@ -13,6 +13,7 @@
#include "algo/cubehash/cubehash_sse2.h"
#include "algo/simd/nist.h"
#include "algo/shavite/sph_shavite.h"
#include "algo/shavite/shavite-hash-2way.h"
#include "algo/simd/simd-hash-2way.h"
#include "algo/echo/aes_ni/hash_api.h"
#include "algo/hamsi/hamsi-hash-4way.h"
@@ -74,7 +75,7 @@ extern void hmq1725_8way_hash(void *state, const void *input)
uint32_t hash7 [16] __attribute__ ((aligned (32)));
hmq1725_8way_context_overlay ctx __attribute__ ((aligned (64)));
__mmask8 vh_mask;
const __m512i vmask = m512_const1_64( 24 );
const __m512i vmask = _mm512_set1_epi64( 24 );
const uint32_t mask = 24;
__m512i* vh = (__m512i*)vhash;
__m512i* vhA = (__m512i*)vhashA;
@@ -98,8 +99,7 @@ extern void hmq1725_8way_hash(void *state, const void *input)
intrlv_8x64_512( vhash, hash0, hash1, hash2, hash3,
hash4, hash5, hash6, hash7 );
vh_mask = _mm512_cmpeq_epi64_mask( _mm512_and_si512( vh[0], vmask ),
m512_zero );
vh_mask = _mm512_testn_epi64_mask( vh[0], vmask );
// A
#if defined(__VAES__)
@@ -154,8 +154,7 @@ extern void hmq1725_8way_hash(void *state, const void *input)
keccak512_8way_update( &ctx.keccak, vhash, 64 );
keccak512_8way_close( &ctx.keccak, vhash );
vh_mask = _mm512_cmpeq_epi64_mask( _mm512_and_si512( vh[0], vmask ),
m512_zero );
vh_mask = _mm512_testn_epi64_mask( vh[0], vmask );
// A
if ( ( vh_mask & 0xff ) != 0xff )
@@ -174,8 +173,7 @@ extern void hmq1725_8way_hash(void *state, const void *input)
cube_4way_full( &ctx.cube, vhashB, 512, vhashB, 64 );
rintrlv_4x128_8x64( vhash, vhashA, vhashB, 512 );
vh_mask = _mm512_cmpeq_epi64_mask( _mm512_and_si512( vh[0], vmask ),
m512_zero );
vh_mask = _mm512_testn_epi64_mask( vh[0], vmask );
if ( likely( ( vh_mask & 0xff ) != 0xff ) )
{
@@ -223,8 +221,7 @@ extern void hmq1725_8way_hash(void *state, const void *input)
simd512_4way_full( &ctx.simd, vhashB, vhashB, 64 );
rintrlv_4x128_8x64( vhash, vhashA, vhashB, 512 );
vh_mask = _mm512_cmpeq_epi64_mask( _mm512_and_si512( vh[0], vmask ),
m512_zero );
vh_mask = _mm512_testn_epi64_mask( vh[0], vmask );
dintrlv_8x64_512( hash0, hash1, hash2, hash3,
hash4, hash5, hash6, hash7, vhash );
// 4x32 for haval
@@ -302,8 +299,7 @@ extern void hmq1725_8way_hash(void *state, const void *input)
blake512_8way_full( &ctx.blake, vhash, vhash, 64 );
vh_mask = _mm512_cmpeq_epi64_mask( _mm512_and_si512( vh[0], vmask ),
m512_zero );
vh_mask = _mm512_testn_epi64_mask( vh[0], vmask );
// A
#if defined(__VAES__)
@@ -374,8 +370,7 @@ extern void hmq1725_8way_hash(void *state, const void *input)
intrlv_8x64_512( vhash, hash0, hash1, hash2, hash3,
hash4, hash5, hash6, hash7 );
vh_mask = _mm512_cmpeq_epi64_mask( _mm512_and_si512( vh[0], vmask ),
m512_zero );
vh_mask = _mm512_testn_epi64_mask( vh[0], vmask );
// A
#if defined(__VAES__)
@@ -455,8 +450,7 @@ extern void hmq1725_8way_hash(void *state, const void *input)
intrlv_8x64_512( vhash, hash0, hash1, hash2, hash3,
hash4, hash5, hash6, hash7 );
vh_mask = _mm512_cmpeq_epi64_mask( _mm512_and_si512( vh[0], vmask ),
m512_zero );
vh_mask = _mm512_testn_epi64_mask( vh[0], vmask );
if ( hash0[0] & mask )
fugue512_full( &ctx.fugue, hash0, hash0, 64 );
@@ -520,8 +514,7 @@ extern void hmq1725_8way_hash(void *state, const void *input)
sha512_8way_update( &ctx.sha512, vhash, 64 );
sha512_8way_close( &ctx.sha512, vhash );
vh_mask = _mm512_cmpeq_epi64_mask( _mm512_and_si512( vh[0], vmask ),
m512_zero );
vh_mask = _mm512_testn_epi64_mask( vh[0], vmask );
dintrlv_8x64_512( hash0, hash1, hash2, hash3,
hash4, hash5, hash6, hash7, vhash );
@@ -600,7 +593,7 @@ int scanhash_hmq1725_8way( struct work *work, uint32_t max_nonce,
}
}
*noncev = _mm512_add_epi32( *noncev,
m512_const1_64( 0x0000000800000000 ) );
_mm512_set1_epi64( 0x0000000800000000 ) );
n += 8;
} while ( likely( ( n < last_nonce ) && !work_restart[thr_id].restart ) );
@@ -625,6 +618,7 @@ union _hmq1725_4way_context_overlay
cube_2way_context cube2;
sph_shavite512_context shavite;
hashState_sd sd;
shavite512_2way_context shavite2;
simd_2way_context simd;
hashState_echo echo;
hamsi512_4way_context hamsi;
@@ -633,6 +627,10 @@ union _hmq1725_4way_context_overlay
sph_whirlpool_context whirlpool;
sha512_4way_context sha512;
haval256_5_4way_context haval;
#if defined(__VAES__)
groestl512_2way_context groestl2;
echo_2way_context echo2;
#endif
} __attribute__ ((aligned (64)));
typedef union _hmq1725_4way_context_overlay hmq1725_4way_context_overlay;
@@ -649,7 +647,7 @@ extern void hmq1725_4way_hash(void *state, const void *input)
hmq1725_4way_context_overlay ctx __attribute__ ((aligned (64)));
__m256i vh_mask;
int h_mask;
const __m256i vmask = m256_const1_64( 24 );
const __m256i vmask = _mm256_set1_epi64x( 24 );
const uint32_t mask = 24;
__m256i* vh = (__m256i*)vhash;
__m256i* vhA = (__m256i*)vhashA;
@@ -750,15 +748,10 @@ extern void hmq1725_4way_hash(void *state, const void *input)
mm256_blend_hash_4x64( vh, vhA, vhB, vh_mask );
dintrlv_4x64( hash0, hash1, hash2, hash3, vhash, 512 );
rintrlv_4x64_2x128( vhashA, vhashB, vhash, 512 );
shavite512_full( &ctx.shavite, hash0, hash0, 64 );
shavite512_full( &ctx.shavite, hash1, hash1, 64 );
shavite512_full( &ctx.shavite, hash2, hash2, 64 );
shavite512_full( &ctx.shavite, hash3, hash3, 64 );
intrlv_2x128_512( vhashA, hash0, hash1 );
intrlv_2x128_512( vhashB, hash2, hash3 );
shavite512_2way_full( &ctx.shavite2, vhashA, vhashA, 64 );
shavite512_2way_full( &ctx.shavite2, vhashB, vhashB, 64 );
simd512_2way_full( &ctx.simd, vhashA, vhashA, 64 );
simd512_2way_full( &ctx.simd, vhashB, vhashB, 64 );
@@ -795,6 +788,17 @@ extern void hmq1725_4way_hash(void *state, const void *input)
mm256_blend_hash_4x64( vh, vhA, vhB, vh_mask );
#if defined(__VAES__)
rintrlv_4x64_2x128( vhashA, vhashB, vhash, 512 );
echo_2way_full( &ctx.echo2, vhashA, 512, vhashA, 64 );
echo_2way_full( &ctx.echo2, vhashB, 512, vhashB, 64 );
rintrlv_2x128_4x64( vhash, vhashA, vhashB, 512 );
#else
dintrlv_4x64( hash0, hash1, hash2, hash3, vhash, 512 );
echo_full( &ctx.echo, (BitSequence *)hash0, 512,
@@ -807,7 +811,9 @@ extern void hmq1725_4way_hash(void *state, const void *input)
(const BitSequence *)hash3, 64 );
intrlv_4x64( vhash, hash0, hash1, hash2, hash3, 512 );
#endif
blake512_4way_full( &ctx.blake, vhash, vhash, 64 );
dintrlv_4x64( hash0, hash1, hash2, hash3, vhash, 512 );
@@ -939,6 +945,17 @@ extern void hmq1725_4way_hash(void *state, const void *input)
mm256_blend_hash_4x64( vh, vhA, vhB, vh_mask );
#if defined(__VAES__)
rintrlv_4x64_2x128( vhashA, vhashB, vhash, 512 );
groestl512_2way_full( &ctx.groestl2, vhashA, vhashA, 64 );
groestl512_2way_full( &ctx.groestl2, vhashB, vhashB, 64 );
rintrlv_2x128_4x64( vhash, vhashA, vhashB, 512 );
#else
dintrlv_4x64( hash0, hash1, hash2, hash3, vhash, 512 );
groestl512_full( &ctx.groestl, (char*)hash0, (char*)hash0, 512 );
@@ -948,6 +965,8 @@ extern void hmq1725_4way_hash(void *state, const void *input)
intrlv_4x64( vhash, hash0, hash1, hash2, hash3, 512 );
#endif
sha512_4way_init( &ctx.sha512 );
sha512_4way_update( &ctx.sha512, vhash, 64 );
sha512_4way_close( &ctx.sha512, vhash );
@@ -1022,7 +1041,7 @@ int scanhash_hmq1725_4way( struct work *work, uint32_t max_nonce,
}
}
*noncev = _mm256_add_epi32( *noncev,
m256_const1_64( 0x0000000400000000 ) );
_mm256_set1_epi64x( 0x0000000400000000 ) );
n += 4;
} while ( likely( ( n < last_nonce ) && !work_restart[thr_id].restart ) );
pdata[19] = n;

View File

@@ -67,8 +67,7 @@ void quark_8way_hash( void *state, const void *input )
__mmask8 vh_mask;
quark_8way_ctx_holder ctx;
const uint32_t mask = 8;
const __m512i bit3_mask = m512_const1_64( mask );
const __m512i zero = _mm512_setzero_si512();
const __m512i bit3_mask = _mm512_set1_epi64( mask );
memcpy( &ctx, &quark_8way_ctx, sizeof(quark_8way_ctx) );
@@ -76,9 +75,7 @@ void quark_8way_hash( void *state, const void *input )
bmw512_8way_full( &ctx.bmw, vhash, vhash, 64 );
vh_mask = _mm512_cmpeq_epi64_mask( _mm512_and_si512( vh[0], bit3_mask ),
zero );
vh_mask = _mm512_testn_epi64_mask( vh[0], bit3_mask );
#if defined(__VAES__)
@@ -154,8 +151,7 @@ void quark_8way_hash( void *state, const void *input )
jh512_8way_update( &ctx.jh, vhash, 64 );
jh512_8way_close( &ctx.jh, vhash );
vh_mask = _mm512_cmpeq_epi64_mask( _mm512_and_si512( vh[0], bit3_mask ),
zero );
vh_mask = _mm512_testn_epi64_mask( vh[0], bit3_mask );
if ( ( vh_mask & 0xff ) != 0xff )
blake512_8way_full( &ctx.blake, vhashA, vhash, 64 );
@@ -169,8 +165,7 @@ void quark_8way_hash( void *state, const void *input )
skein512_8way_full( &ctx.skein, vhash, vhash, 64 );
vh_mask = _mm512_cmpeq_epi64_mask( _mm512_and_si512( vh[0], bit3_mask ),
zero );
vh_mask = _mm512_testn_epi64_mask( vh[0], bit3_mask );
if ( ( vh_mask & 0xff ) != 0xff )
{
@@ -229,7 +224,7 @@ int scanhash_quark_8way( struct work *work, uint32_t max_nonce,
}
}
*noncev = _mm512_add_epi32( *noncev,
m512_const1_64( 0x0000000800000000 ) );
_mm512_set1_epi64( 0x0000000800000000 ) );
n += 8;
} while ( likely( ( n < last_nonce ) && !work_restart[thr_id].restart ) );
@@ -276,7 +271,7 @@ void quark_4way_hash( void *state, const void *input )
__m256i vh_mask;
int h_mask;
quark_4way_ctx_holder ctx;
const __m256i bit3_mask = m256_const1_64( 8 );
const __m256i bit3_mask = _mm256_set1_epi64x( 8 );
const __m256i zero = _mm256_setzero_si256();
memcpy( &ctx, &quark_4way_ctx, sizeof(quark_4way_ctx) );
@@ -402,7 +397,7 @@ int scanhash_quark_4way( struct work *work, uint32_t max_nonce,
}
}
*noncev = _mm256_add_epi32( *noncev,
m256_const1_64( 0x0000000400000000 ) );
_mm256_set1_epi64x( 0x0000000400000000 ) );
n += 4;
} while ( likely( ( n < last_nonce ) && !work_restart[thr_id].restart ) );

View File

@@ -4,24 +4,6 @@
#include <string.h>
#include <stdio.h>
long double lbry_calc_network_diff( struct work *work )
{
// sample for diff 43.281 : 1c05ea29
// todo: endian reversed on longpoll could be zr5 specific...
uint32_t nbits = swab32( work->data[ LBRY_NBITS_INDEX ] );
uint32_t bits = (nbits & 0xffffff);
int16_t shift = (swab32(nbits) & 0xff); // 0x1c = 28
long double d = (long double)0x0000ffff / (long double)bits;
for (int m=shift; m < 29; m++) d *= 256.0;
for (int m=29; m < shift; m++) d /= 256.0;
if (opt_debug_diff)
applog(LOG_DEBUG, "net diff: %f -> shift %u, bits %08x", d, shift, bits);
return d;
}
// std_le should work but it doesn't
void lbry_le_build_stratum_request( char *req, struct work *work,
struct stratum_ctx *sctx )
@@ -41,31 +23,6 @@ void lbry_le_build_stratum_request( char *req, struct work *work,
free(xnonce2str);
}
/*
void lbry_build_block_header( struct work* g_work, uint32_t version,
uint32_t *prevhash, uint32_t *merkle_root,
uint32_t ntime, uint32_t nbits )
{
int i;
memset( g_work->data, 0, sizeof(g_work->data) );
g_work->data[0] = version;
if ( have_stratum )
for ( i = 0; i < 8; i++ )
g_work->data[1 + i] = le32dec( prevhash + i );
else
for (i = 0; i < 8; i++)
g_work->data[ 8-i ] = le32dec( prevhash + i );
for ( i = 0; i < 8; i++ )
g_work->data[9 + i] = be32dec( merkle_root + i );
g_work->data[ LBRY_NTIME_INDEX ] = ntime;
g_work->data[ LBRY_NBITS_INDEX ] = nbits;
g_work->data[28] = 0x80000000;
}
*/
void lbry_build_extraheader( struct work* g_work, struct stratum_ctx* sctx )
{
unsigned char merkle_root[64] = { 0 };
@@ -112,9 +69,7 @@ bool register_lbry_algo( algo_gate_t* gate )
gate->hash = (void*)&lbry_hash;
gate->optimizations = AVX2_OPT | AVX512_OPT | SHA_OPT;
#endif
gate->calc_network_diff = (void*)&lbry_calc_network_diff;
gate->build_stratum_request = (void*)&lbry_le_build_stratum_request;
// gate->build_block_header = (void*)&build_block_header;
gate->build_extraheader = (void*)&lbry_build_extraheader;
gate->ntime_index = LBRY_NTIME_INDEX;
gate->nbits_index = LBRY_NBITS_INDEX;

View File

@@ -47,7 +47,7 @@ static const uint32_t IV[5] =
do{ \
a = _mm_add_epi32( mm128_rol_32( _mm_add_epi32( _mm_add_epi32( \
_mm_add_epi32( a, f( b ,c, d ) ), r ), \
m128_const1_64( k ) ), s ), e ); \
_mm_set1_epi64x( k ) ), s ), e ); \
c = mm128_rol_32( c, 10 );\
} while (0)
@@ -251,11 +251,11 @@ static void ripemd160_4way_round( ripemd160_4way_context *sc )
void ripemd160_4way_init( ripemd160_4way_context *sc )
{
sc->val[0] = m128_const1_64( 0x6745230167452301 );
sc->val[1] = m128_const1_64( 0xEFCDAB89EFCDAB89 );
sc->val[2] = m128_const1_64( 0x98BADCFE98BADCFE );
sc->val[3] = m128_const1_64( 0x1032547610325476 );
sc->val[4] = m128_const1_64( 0xC3D2E1F0C3D2E1F0 );
sc->val[0] = _mm_set1_epi64x( 0x6745230167452301 );
sc->val[1] = _mm_set1_epi64x( 0xEFCDAB89EFCDAB89 );
sc->val[2] = _mm_set1_epi64x( 0x98BADCFE98BADCFE );
sc->val[3] = _mm_set1_epi64x( 0x1032547610325476 );
sc->val[4] = _mm_set1_epi64x( 0xC3D2E1F0C3D2E1F0 );
sc->count_high = sc->count_low = 0;
}
@@ -347,7 +347,7 @@ void ripemd160_4way_close( ripemd160_4way_context *sc, void *dst )
do{ \
a = _mm256_add_epi32( mm256_rol_32( _mm256_add_epi32( _mm256_add_epi32( \
_mm256_add_epi32( a, f( b ,c, d ) ), r ), \
m256_const1_64( k ) ), s ), e ); \
_mm256_set1_epi64x( k ) ), s ), e ); \
c = mm256_rol_32( c, 10 );\
} while (0)
@@ -552,11 +552,11 @@ static void ripemd160_8way_round( ripemd160_8way_context *sc )
void ripemd160_8way_init( ripemd160_8way_context *sc )
{
sc->val[0] = m256_const1_64( 0x6745230167452301 );
sc->val[1] = m256_const1_64( 0xEFCDAB89EFCDAB89 );
sc->val[2] = m256_const1_64( 0x98BADCFE98BADCFE );
sc->val[3] = m256_const1_64( 0x1032547610325476 );
sc->val[4] = m256_const1_64( 0xC3D2E1F0C3D2E1F0 );
sc->val[0] = _mm256_set1_epi64x( 0x6745230167452301 );
sc->val[1] = _mm256_set1_epi64x( 0xEFCDAB89EFCDAB89 );
sc->val[2] = _mm256_set1_epi64x( 0x98BADCFE98BADCFE );
sc->val[3] = _mm256_set1_epi64x( 0x1032547610325476 );
sc->val[4] = _mm256_set1_epi64x( 0xC3D2E1F0C3D2E1F0 );
sc->count_high = sc->count_low = 0;
}
@@ -649,7 +649,7 @@ void ripemd160_8way_close( ripemd160_8way_context *sc, void *dst )
do{ \
a = _mm512_add_epi32( mm512_rol_32( _mm512_add_epi32( _mm512_add_epi32( \
_mm512_add_epi32( a, f( b ,c, d ) ), r ), \
m512_const1_64( k ) ), s ), e ); \
_mm512_set1_epi64( k ) ), s ), e ); \
c = mm512_rol_32( c, 10 );\
} while (0)
@@ -853,11 +853,11 @@ static void ripemd160_16way_round( ripemd160_16way_context *sc )
void ripemd160_16way_init( ripemd160_16way_context *sc )
{
sc->val[0] = m512_const1_64( 0x6745230167452301 );
sc->val[1] = m512_const1_64( 0xEFCDAB89EFCDAB89 );
sc->val[2] = m512_const1_64( 0x98BADCFE98BADCFE );
sc->val[3] = m512_const1_64( 0x1032547610325476 );
sc->val[4] = m512_const1_64( 0xC3D2E1F0C3D2E1F0 );
sc->val[0] = _mm512_set1_epi64( 0x6745230167452301 );
sc->val[1] = _mm512_set1_epi64( 0xEFCDAB89EFCDAB89 );
sc->val[2] = _mm512_set1_epi64( 0x98BADCFE98BADCFE );
sc->val[3] = _mm512_set1_epi64( 0x1032547610325476 );
sc->val[4] = _mm512_set1_epi64( 0xC3D2E1F0C3D2E1F0 );
sc->count_high = sc->count_low = 0;
}
@@ -902,7 +902,7 @@ void ripemd160_16way_close( ripemd160_16way_context *sc, void *dst )
const int pad = block_size - 8;
ptr = (unsigned)sc->count_low & ( block_size - 1U);
sc->buf[ ptr>>2 ] = m512_const1_32( 0x80 );
sc->buf[ ptr>>2 ] = _mm512_set1_epi32( 0x80 );
ptr += 4;
if ( ptr > pad )

View File

@@ -830,7 +830,7 @@ void scrypt_core_16way( __m512i *X, __m512i *V, const uint32_t N )
}
}
// Working, not up to date, needs stream optimization.
// Working, not up to date, needs stream, shuffle optimizations.
// 4x32 interleaving
static void salsa8_simd128_4way( __m128i *b, const __m128i *c )
{
@@ -937,46 +937,28 @@ void scrypt_core_simd128_4way( __m128i *X, __m128i *V, const uint32_t N )
// 4x memory usage
// Working
// 4x128 interleaving
static void salsa_shuffle_4way_simd128( __m512i *X )
static inline void salsa_shuffle_4way_simd128( __m512i *X )
{
__m512i Y0, Y1, Y2, Y3, Z0, Z1, Z2, Z3;
Y0 = _mm512_mask_blend_epi32( 0x1111, X[1], X[0] );
Z0 = _mm512_mask_blend_epi32( 0x4444, X[3], X[2] );
Y1 = _mm512_mask_blend_epi32( 0x1111, X[2], X[1] );
Z1 = _mm512_mask_blend_epi32( 0x4444, X[0], X[3] );
Y2 = _mm512_mask_blend_epi32( 0x1111, X[3], X[2] );
Z2 = _mm512_mask_blend_epi32( 0x4444, X[1], X[0] );
Y3 = _mm512_mask_blend_epi32( 0x1111, X[0], X[3] );
Z3 = _mm512_mask_blend_epi32( 0x4444, X[2], X[1] );
X[0] = _mm512_mask_blend_epi32( 0x3333, Z0, Y0 );
X[1] = _mm512_mask_blend_epi32( 0x3333, Z1, Y1 );
X[2] = _mm512_mask_blend_epi32( 0x3333, Z2, Y2 );
X[3] = _mm512_mask_blend_epi32( 0x3333, Z3, Y3 );
__m512i t0 = _mm512_mask_blend_epi32( 0xaaaa, X[0], X[1] );
__m512i t1 = _mm512_mask_blend_epi32( 0x5555, X[0], X[1] );
__m512i t2 = _mm512_mask_blend_epi32( 0xaaaa, X[2], X[3] );
__m512i t3 = _mm512_mask_blend_epi32( 0x5555, X[2], X[3] );
X[0] = _mm512_mask_blend_epi32( 0xcccc, t0, t2 );
X[1] = _mm512_mask_blend_epi32( 0x6666, t1, t3 );
X[2] = _mm512_mask_blend_epi32( 0x3333, t0, t2 );
X[3] = _mm512_mask_blend_epi32( 0x9999, t1, t3 );
}
static void salsa_unshuffle_4way_simd128( __m512i *X )
static inline void salsa_unshuffle_4way_simd128( __m512i *X )
{
__m512i Y0, Y1, Y2, Y3;
Y0 = _mm512_mask_blend_epi32( 0x8888, X[0], X[1] );
Y1 = _mm512_mask_blend_epi32( 0x1111, X[0], X[1] );
Y2 = _mm512_mask_blend_epi32( 0x2222, X[0], X[1] );
Y3 = _mm512_mask_blend_epi32( 0x4444, X[0], X[1] );
Y0 = _mm512_mask_blend_epi32( 0x4444, Y0, X[2] );
Y1 = _mm512_mask_blend_epi32( 0x8888, Y1, X[2] );
Y2 = _mm512_mask_blend_epi32( 0x1111, Y2, X[2] );
Y3 = _mm512_mask_blend_epi32( 0x2222, Y3, X[2] );
X[0] = _mm512_mask_blend_epi32( 0x2222, Y0, X[3] );
X[1] = _mm512_mask_blend_epi32( 0x4444, Y1, X[3] );
X[2] = _mm512_mask_blend_epi32( 0x8888, Y2, X[3] );
X[3] = _mm512_mask_blend_epi32( 0x1111, Y3, X[3] );
__m512i t0 = _mm512_mask_blend_epi32( 0xcccc, X[0], X[2] );
__m512i t1 = _mm512_mask_blend_epi32( 0x3333, X[0], X[2] );
__m512i t2 = _mm512_mask_blend_epi32( 0x6666, X[1], X[3] );
__m512i t3 = _mm512_mask_blend_epi32( 0x9999, X[1], X[3] );
X[0] = _mm512_mask_blend_epi32( 0xaaaa, t0, t2 );
X[1] = _mm512_mask_blend_epi32( 0x5555, t0, t2 );
X[2] = _mm512_mask_blend_epi32( 0xaaaa, t1, t3 );
X[3] = _mm512_mask_blend_epi32( 0x5555, t1, t3 );
}
static void salsa8_4way_simd128( __m512i * const B, const __m512i * const C)
@@ -1147,46 +1129,28 @@ void scrypt_core_8way( __m256i *X, __m256i *V, const uint32_t N )
// { l1xb, l1xa, l1c9, l1x8, l0xb, l0xa, l0x9, l0x8 } b[1] B[23:16]
// { l1xf, l1xe, l1xd, l1xc, l0xf, l0xe, l0xd, l0xc } b[0] B[31:24]
static void salsa_shuffle_2way_simd128( __m256i *X )
static inline void salsa_shuffle_2way_simd128( __m256i *X )
{
__m256i Y0, Y1, Y2, Y3, Z0, Z1, Z2, Z3;
Y0 = _mm256_blend_epi32( X[1], X[0], 0x11 );
Z0 = _mm256_blend_epi32( X[3], X[2], 0x44 );
Y1 = _mm256_blend_epi32( X[2], X[1], 0x11 );
Z1 = _mm256_blend_epi32( X[0], X[3], 0x44 );
Y2 = _mm256_blend_epi32( X[3], X[2], 0x11 );
Z2 = _mm256_blend_epi32( X[1], X[0], 0x44 );
Y3 = _mm256_blend_epi32( X[0], X[3], 0x11 );
Z3 = _mm256_blend_epi32( X[2], X[1], 0x44 );
X[0] = _mm256_blend_epi32( Z0, Y0, 0x33 );
X[1] = _mm256_blend_epi32( Z1, Y1, 0x33 );
X[2] = _mm256_blend_epi32( Z2, Y2, 0x33 );
X[3] = _mm256_blend_epi32( Z3, Y3, 0x33 );
__m256i t0 = _mm256_blend_epi32( X[0], X[1], 0xaa );
__m256i t1 = _mm256_blend_epi32( X[0], X[1], 0x55 );
__m256i t2 = _mm256_blend_epi32( X[2], X[3], 0xaa );
__m256i t3 = _mm256_blend_epi32( X[2], X[3], 0x55 );
X[0] = _mm256_blend_epi32( t0, t2, 0xcc );
X[1] = _mm256_blend_epi32( t1, t3, 0x66 );
X[2] = _mm256_blend_epi32( t0, t2, 0x33 );
X[3] = _mm256_blend_epi32( t1, t3, 0x99 );
}
static void salsa_unshuffle_2way_simd128( __m256i *X )
static inline void salsa_unshuffle_2way_simd128( __m256i *X )
{
__m256i Y0, Y1, Y2, Y3;
Y0 = _mm256_blend_epi32( X[0], X[1], 0x88 );
Y1 = _mm256_blend_epi32( X[0], X[1], 0x11 );
Y2 = _mm256_blend_epi32( X[0], X[1], 0x22 );
Y3 = _mm256_blend_epi32( X[0], X[1], 0x44 );
Y0 = _mm256_blend_epi32( Y0, X[2], 0x44 );
Y1 = _mm256_blend_epi32( Y1, X[2], 0x88 );
Y2 = _mm256_blend_epi32( Y2, X[2], 0x11 );
Y3 = _mm256_blend_epi32( Y3, X[2], 0x22 );
X[0] = _mm256_blend_epi32( Y0, X[3], 0x22 );
X[1] = _mm256_blend_epi32( Y1, X[3], 0x44 );
X[2] = _mm256_blend_epi32( Y2, X[3], 0x88 );
X[3] = _mm256_blend_epi32( Y3, X[3], 0x11 );
__m256i t0 = _mm256_blend_epi32( X[0], X[2], 0xcc );
__m256i t1 = _mm256_blend_epi32( X[0], X[2], 0x33 );
__m256i t2 = _mm256_blend_epi32( X[1], X[3], 0x66 );
__m256i t3 = _mm256_blend_epi32( X[1], X[3], 0x99 );
X[0] = _mm256_blend_epi32( t0, t2, 0xaa );
X[1] = _mm256_blend_epi32( t0, t2, 0x55 );
X[2] = _mm256_blend_epi32( t1, t3, 0xaa );
X[3] = _mm256_blend_epi32( t1, t3, 0x55 );
}
static void salsa8_2way_simd128( __m256i * const B, const __m256i * const C)
@@ -2163,7 +2127,7 @@ static void salsa8_simd128( uint32_t *b, const uint32_t * const c)
X2 = _mm_blend_epi32( B[1], B[0], 0x4 );
Y3 = _mm_blend_epi32( B[0], B[3], 0x1 );
X3 = _mm_blend_epi32( B[2], B[1], 0x4 );
X0 = _mm_blend_epi32( X0, Y0, 0x3);
X0 = _mm_blend_epi32( X0, Y0, 0x3 );
X1 = _mm_blend_epi32( X1, Y1, 0x3 );
X2 = _mm_blend_epi32( X2, Y2, 0x3 );
X3 = _mm_blend_epi32( X3, Y3, 0x3 );
@@ -2311,91 +2275,34 @@ void scrypt_core_simd128( uint32_t *X, uint32_t *V, const uint32_t N )
// Double buffered, 2x memory usage
// No interleaving
static void salsa_simd128_shuffle_2buf( uint32_t *xa, uint32_t *xb )
static inline void salsa_simd128_shuffle_2buf( uint32_t *xa, uint32_t *xb )
{
__m128i *XA = (__m128i*)xa;
__m128i *XB = (__m128i*)xb;
__m128i YA0, YA1, YA2, YA3, YB0, YB1, YB2, YB3;
#if defined(__SSE4_1__)
// __m128i YA0, YA1, YA2, YA3, YB0, YB1, YB2, YB3;
__m128i ZA0, ZA1, ZA2, ZA3, ZB0, ZB1, ZB2, ZB3;
#if defined(__AVX2__)
YA0 = _mm_blend_epi32( XA[1], XA[0], 0x1 );
YB0 = _mm_blend_epi32( XB[1], XB[0], 0x1 );
ZA0 = _mm_blend_epi32( XA[3], XA[2], 0x4 );
ZB0 = _mm_blend_epi32( XB[3], XB[2], 0x4 );
YA1 = _mm_blend_epi32( XA[2], XA[1], 0x1 );
YB1 = _mm_blend_epi32( XB[2], XB[1], 0x1 );
ZA1 = _mm_blend_epi32( XA[0], XA[3], 0x4 );
ZB1 = _mm_blend_epi32( XB[0], XB[3], 0x4 );
YA2 = _mm_blend_epi32( XA[3], XA[2], 0x1 );
YB2 = _mm_blend_epi32( XB[3], XB[2], 0x1 );
ZA2 = _mm_blend_epi32( XA[1], XA[0], 0x4 );
ZB2 = _mm_blend_epi32( XB[1], XB[0], 0x4 );
YA3 = _mm_blend_epi32( XA[0], XA[3], 0x1 );
YB3 = _mm_blend_epi32( XB[0], XB[3], 0x1 );
ZA3 = _mm_blend_epi32( XA[2], XA[1], 0x4 );
ZB3 = _mm_blend_epi32( XB[2], XB[1], 0x4 );
XA[0] = _mm_blend_epi32( ZA0, YA0, 0x3 );
XB[0] = _mm_blend_epi32( ZB0, YB0, 0x3 );
XA[1] = _mm_blend_epi32( ZA1, YA1, 0x3 );
XB[1] = _mm_blend_epi32( ZB1, YB1, 0x3 );
XA[2] = _mm_blend_epi32( ZA2, YA2, 0x3 );
XB[2] = _mm_blend_epi32( ZB2, YB2, 0x3 );
XA[3] = _mm_blend_epi32( ZA3, YA3, 0x3 );
XB[3] = _mm_blend_epi32( ZB3, YB3, 0x3 );
#else
// SSE4.1
YA0 = _mm_blend_epi16( XA[1], XA[0], 0x03 );
YB0 = _mm_blend_epi16( XB[1], XB[0], 0x03 );
ZA0 = _mm_blend_epi16( XA[3], XA[2], 0x30 );
ZB0 = _mm_blend_epi16( XB[3], XB[2], 0x30 );
YA1 = _mm_blend_epi16( XA[2], XA[1], 0x03 );
YB1 = _mm_blend_epi16( XB[2], XB[1], 0x03 );
ZA1 = _mm_blend_epi16( XA[0], XA[3], 0x30 );
ZB1 = _mm_blend_epi16( XB[0], XB[3], 0x30 );
YA2 = _mm_blend_epi16( XA[3], XA[2], 0x03 );
YB2 = _mm_blend_epi16( XB[3], XB[2], 0x03 );
ZA2 = _mm_blend_epi16( XA[1], XA[0], 0x30 );
ZB2 = _mm_blend_epi16( XB[1], XB[0], 0x30 );
YA3 = _mm_blend_epi16( XA[0], XA[3], 0x03 );
YB3 = _mm_blend_epi16( XB[0], XB[3], 0x03 );
ZA3 = _mm_blend_epi16( XA[2], XA[1], 0x30 );
ZB3 = _mm_blend_epi16( XB[2], XB[1], 0x30 );
XA[0] = _mm_blend_epi16( ZA0, YA0, 0x0f );
XB[0] = _mm_blend_epi16( ZB0, YB0, 0x0f );
XA[1] = _mm_blend_epi16( ZA1, YA1, 0x0f );
XB[1] = _mm_blend_epi16( ZB1, YB1, 0x0f );
XA[2] = _mm_blend_epi16( ZA2, YA2, 0x0f );
XB[2] = _mm_blend_epi16( ZB2, YB2, 0x0f );
XA[3] = _mm_blend_epi16( ZA3, YA3, 0x0f );
XB[3] = _mm_blend_epi16( ZB3, YB3, 0x0f );
#endif // AVX2 else SSE4_1
__m128i t0 = _mm_blend_epi16( XA[0], XA[1], 0xcc );
__m128i t1 = _mm_blend_epi16( XA[0], XA[1], 0x33 );
__m128i t2 = _mm_blend_epi16( XA[2], XA[3], 0xcc );
__m128i t3 = _mm_blend_epi16( XA[2], XA[3], 0x33 );
XA[0] = _mm_blend_epi16( t0, t2, 0xf0 );
XA[1] = _mm_blend_epi16( t1, t3, 0x3c );
XA[2] = _mm_blend_epi16( t0, t2, 0x0f );
XA[3] = _mm_blend_epi16( t1, t3, 0xc3 );
t0 = _mm_blend_epi16( XB[0], XB[1], 0xcc );
t1 = _mm_blend_epi16( XB[0], XB[1], 0x33 );
t2 = _mm_blend_epi16( XB[2], XB[3], 0xcc );
t3 = _mm_blend_epi16( XB[2], XB[3], 0x33 );
XB[0] = _mm_blend_epi16( t0, t2, 0xf0 );
XB[1] = _mm_blend_epi16( t1, t3, 0x3c );
XB[2] = _mm_blend_epi16( t0, t2, 0x0f );
XB[3] = _mm_blend_epi16( t1, t3, 0xc3 );
#else // SSE2
__m128i YA0, YA1, YA2, YA3, YB0, YB1, YB2, YB3;
YA0 = _mm_set_epi32( xa[15], xa[10], xa[ 5], xa[ 0] );
YB0 = _mm_set_epi32( xb[15], xb[10], xb[ 5], xb[ 0] );
YA1 = _mm_set_epi32( xa[ 3], xa[14], xa[ 9], xa[ 4] );
@@ -2417,7 +2324,7 @@ static void salsa_simd128_shuffle_2buf( uint32_t *xa, uint32_t *xb )
#endif
}
static void salsa_simd128_unshuffle_2buf( uint32_t* xa, uint32_t* xb )
static inline void salsa_simd128_unshuffle_2buf( uint32_t* xa, uint32_t* xb )
{
__m128i *XA = (__m128i*)xa;
@@ -2425,67 +2332,22 @@ static void salsa_simd128_unshuffle_2buf( uint32_t* xa, uint32_t* xb )
#if defined(__SSE4_1__)
__m128i YA0, YA1, YA2, YA3, YB0, YB1, YB2, YB3;
#if defined(__AVX2__)
YA0 = _mm_blend_epi32( XA[0], XA[1], 0x8 );
YB0 = _mm_blend_epi32( XB[0], XB[1], 0x8 );
YA1 = _mm_blend_epi32( XA[0], XA[1], 0x1 );
YB1 = _mm_blend_epi32( XB[0], XB[1], 0x1 );
YA2 = _mm_blend_epi32( XA[0], XA[1], 0x2 );
YB2 = _mm_blend_epi32( XB[0], XB[1], 0x2 );
YA3 = _mm_blend_epi32( XA[0], XA[1], 0x4 );
YB3 = _mm_blend_epi32( XB[0], XB[1], 0x4 );
YA0 = _mm_blend_epi32( YA0, XA[2], 0x4 );
YB0 = _mm_blend_epi32( YB0, XB[2], 0x4 );
YA1 = _mm_blend_epi32( YA1, XA[2], 0x8 );
YB1 = _mm_blend_epi32( YB1, XB[2], 0x8 );
YA2 = _mm_blend_epi32( YA2, XA[2], 0x1 );
YB2 = _mm_blend_epi32( YB2, XB[2], 0x1 );
YA3 = _mm_blend_epi32( YA3, XA[2], 0x2 );
YB3 = _mm_blend_epi32( YB3, XB[2], 0x2 );
XA[0] = _mm_blend_epi32( YA0, XA[3], 0x2 );
XB[0] = _mm_blend_epi32( YB0, XB[3], 0x2 );
XA[1] = _mm_blend_epi32( YA1, XA[3], 0x4 );
XB[1] = _mm_blend_epi32( YB1, XB[3], 0x4 );
XA[2] = _mm_blend_epi32( YA2, XA[3], 0x8 );
XB[2] = _mm_blend_epi32( YB2, XB[3], 0x8 );
XA[3] = _mm_blend_epi32( YA3, XA[3], 0x1 );
XB[3] = _mm_blend_epi32( YB3, XB[3], 0x1 );
#else // SSE4_1
YA0 = _mm_blend_epi16( XA[0], XA[1], 0xc0 );
YB0 = _mm_blend_epi16( XB[0], XB[1], 0xc0 );
YA1 = _mm_blend_epi16( XA[0], XA[1], 0x03 );
YB1 = _mm_blend_epi16( XB[0], XB[1], 0x03 );
YA2 = _mm_blend_epi16( XA[0], XA[1], 0x0c );
YB2 = _mm_blend_epi16( XB[0], XB[1], 0x0c );
YA3 = _mm_blend_epi16( XA[0], XA[1], 0x30 );
YB3 = _mm_blend_epi16( XB[0], XB[1], 0x30 );
YA0 = _mm_blend_epi16( YA0, XA[2], 0x30 );
YB0 = _mm_blend_epi16( YB0, XB[2], 0x30 );
YA1 = _mm_blend_epi16( YA1, XA[2], 0xc0 );
YB1 = _mm_blend_epi16( YB1, XB[2], 0xc0 );
YA2 = _mm_blend_epi16( YA2, XA[2], 0x03 );
YB2 = _mm_blend_epi16( YB2, XB[2], 0x03 );
YA3 = _mm_blend_epi16( YA3, XA[2], 0x0c );
YB3 = _mm_blend_epi16( YB3, XB[2], 0x0c );
XA[0] = _mm_blend_epi16( YA0, XA[3], 0x0c );
XB[0] = _mm_blend_epi16( YB0, XB[3], 0x0c );
XA[1] = _mm_blend_epi16( YA1, XA[3], 0x30 );
XB[1] = _mm_blend_epi16( YB1, XB[3], 0x30 );
XA[2] = _mm_blend_epi16( YA2, XA[3], 0xc0 );
XB[2] = _mm_blend_epi16( YB2, XB[3], 0xc0 );
XA[3] = _mm_blend_epi16( YA3, XA[3], 0x03 );
XB[3] = _mm_blend_epi16( YB3, XB[3], 0x03 );
#endif // AVX2 else SSE4_1
__m128i t0 = _mm_blend_epi16( XA[0], XA[2], 0xf0 );
__m128i t1 = _mm_blend_epi16( XA[0], XA[2], 0x0f );
__m128i t2 = _mm_blend_epi16( XA[1], XA[3], 0x3c );
__m128i t3 = _mm_blend_epi16( XA[1], XA[3], 0xc3 );
XA[0] = _mm_blend_epi16( t0, t2, 0xcc );
XA[1] = _mm_blend_epi16( t0, t2, 0x33 );
XA[2] = _mm_blend_epi16( t1, t3, 0xcc );
XA[3] = _mm_blend_epi16( t1, t3, 0x33 );
t0 = _mm_blend_epi16( XB[0], XB[2], 0xf0 );
t1 = _mm_blend_epi16( XB[0], XB[2], 0x0f );
t2 = _mm_blend_epi16( XB[1], XB[3], 0x3c );
t3 = _mm_blend_epi16( XB[1], XB[3], 0xc3 );
XB[0] = _mm_blend_epi16( t0, t2, 0xcc );
XB[1] = _mm_blend_epi16( t0, t2, 0x33 );
XB[2] = _mm_blend_epi16( t1, t3, 0xcc );
XB[3] = _mm_blend_epi16( t1, t3, 0x33 );
#else // SSE2
@@ -2690,116 +2552,44 @@ void scrypt_core_simd128_2buf( uint32_t *X, uint32_t *V, const uint32_t N )
}
static void salsa_simd128_shuffle_3buf( uint32_t *xa, uint32_t *xb,
static inline void salsa_simd128_shuffle_3buf( uint32_t *xa, uint32_t *xb,
uint32_t *xc )
{
__m128i *XA = (__m128i*)xa;
__m128i *XB = (__m128i*)xb;
__m128i *XC = (__m128i*)xc;
__m128i YA0, YA1, YA2, YA3, YB0, YB1, YB2, YB3, YC0, YC1, YC2, YC3;
#if defined(__SSE4_1__)
__m128i ZA0, ZA1, ZA2, ZA3, ZB0, ZB1, ZB2, ZB3, ZC0, ZC1, ZC2, ZC3;
#if defined(__AVX2__)
YA0 = _mm_blend_epi32( XA[1], XA[0], 0x1 );
YB0 = _mm_blend_epi32( XB[1], XB[0], 0x1 );
YC0 = _mm_blend_epi32( XC[1], XC[0], 0x1 );
ZA0 = _mm_blend_epi32( XA[3], XA[2], 0x4 );
ZB0 = _mm_blend_epi32( XB[3], XB[2], 0x4 );
ZC0 = _mm_blend_epi32( XC[3], XC[2], 0x4 );
YA1 = _mm_blend_epi32( XA[2], XA[1], 0x1 );
YB1 = _mm_blend_epi32( XB[2], XB[1], 0x1 );
YC1 = _mm_blend_epi32( XC[2], XC[1], 0x1 );
ZA1 = _mm_blend_epi32( XA[0], XA[3], 0x4 );
ZB1 = _mm_blend_epi32( XB[0], XB[3], 0x4 );
ZC1 = _mm_blend_epi32( XC[0], XC[3], 0x4 );
YA2 = _mm_blend_epi32( XA[3], XA[2], 0x1 );
YB2 = _mm_blend_epi32( XB[3], XB[2], 0x1 );
YC2 = _mm_blend_epi32( XC[3], XC[2], 0x1 );
ZA2 = _mm_blend_epi32( XA[1], XA[0], 0x4 );
ZB2 = _mm_blend_epi32( XB[1], XB[0], 0x4 );
ZC2 = _mm_blend_epi32( XC[1], XC[0], 0x4 );
YA3 = _mm_blend_epi32( XA[0], XA[3], 0x1 );
YB3 = _mm_blend_epi32( XB[0], XB[3], 0x1 );
YC3 = _mm_blend_epi32( XC[0], XC[3], 0x1 );
ZA3 = _mm_blend_epi32( XA[2], XA[1], 0x4 );
ZB3 = _mm_blend_epi32( XB[2], XB[1], 0x4 );
ZC3 = _mm_blend_epi32( XC[2], XC[1], 0x4 );
XA[0] = _mm_blend_epi32( ZA0, YA0, 0x3 );
XB[0] = _mm_blend_epi32( ZB0, YB0, 0x3 );
XC[0] = _mm_blend_epi32( ZC0, YC0, 0x3 );
XA[1] = _mm_blend_epi32( ZA1, YA1, 0x3 );
XB[1] = _mm_blend_epi32( ZB1, YB1, 0x3 );
XC[1] = _mm_blend_epi32( ZC1, YC1, 0x3 );
XA[2] = _mm_blend_epi32( ZA2, YA2, 0x3 );
XB[2] = _mm_blend_epi32( ZB2, YB2, 0x3 );
XC[2] = _mm_blend_epi32( ZC2, YC2, 0x3 );
XA[3] = _mm_blend_epi32( ZA3, YA3, 0x3 );
XB[3] = _mm_blend_epi32( ZB3, YB3, 0x3 );
XC[3] = _mm_blend_epi32( ZC3, YC3, 0x3 );
#else
// SSE4.1
YA0 = _mm_blend_epi16( XA[1], XA[0], 0x03 );
YB0 = _mm_blend_epi16( XB[1], XB[0], 0x03 );
YC0 = _mm_blend_epi16( XC[1], XC[0], 0x03 );
ZA0 = _mm_blend_epi16( XA[3], XA[2], 0x30 );
ZB0 = _mm_blend_epi16( XB[3], XB[2], 0x30 );
ZC0 = _mm_blend_epi16( XC[3], XC[2], 0x30 );
YA1 = _mm_blend_epi16( XA[2], XA[1], 0x03 );
YB1 = _mm_blend_epi16( XB[2], XB[1], 0x03 );
YC1 = _mm_blend_epi16( XC[2], XC[1], 0x03 );
ZA1 = _mm_blend_epi16( XA[0], XA[3], 0x30 );
ZB1 = _mm_blend_epi16( XB[0], XB[3], 0x30 );
ZC1 = _mm_blend_epi16( XC[0], XC[3], 0x30 );
YA2 = _mm_blend_epi16( XA[3], XA[2], 0x03 );
YB2 = _mm_blend_epi16( XB[3], XB[2], 0x03 );
YC2 = _mm_blend_epi16( XC[3], XC[2], 0x03 );
ZA2 = _mm_blend_epi16( XA[1], XA[0], 0x30 );
ZB2 = _mm_blend_epi16( XB[1], XB[0], 0x30 );
ZC2 = _mm_blend_epi16( XC[1], XC[0], 0x30 );
YA3 = _mm_blend_epi16( XA[0], XA[3], 0x03 );
YB3 = _mm_blend_epi16( XB[0], XB[3], 0x03 );
YC3 = _mm_blend_epi16( XC[0], XC[3], 0x03 );
ZA3 = _mm_blend_epi16( XA[2], XA[1], 0x30 );
ZB3 = _mm_blend_epi16( XB[2], XB[1], 0x30 );
ZC3 = _mm_blend_epi16( XC[2], XC[1], 0x30 );
XA[0] = _mm_blend_epi16( ZA0, YA0, 0x0f );
XB[0] = _mm_blend_epi16( ZB0, YB0, 0x0f );
XC[0] = _mm_blend_epi16( ZC0, YC0, 0x0f );
XA[1] = _mm_blend_epi16( ZA1, YA1, 0x0f );
XB[1] = _mm_blend_epi16( ZB1, YB1, 0x0f );
XC[1] = _mm_blend_epi16( ZC1, YC1, 0x0f );
XA[2] = _mm_blend_epi16( ZA2, YA2, 0x0f );
XB[2] = _mm_blend_epi16( ZB2, YB2, 0x0f );
XC[2] = _mm_blend_epi16( ZC2, YC2, 0x0f );
XA[3] = _mm_blend_epi16( ZA3, YA3, 0x0f );
XB[3] = _mm_blend_epi16( ZB3, YB3, 0x0f );
XC[3] = _mm_blend_epi16( ZC3, YC3, 0x0f );
#endif // AVX2 else SSE4_1
__m128i t0 = _mm_blend_epi16( XA[0], XA[1], 0xcc );
__m128i t1 = _mm_blend_epi16( XA[0], XA[1], 0x33 );
__m128i t2 = _mm_blend_epi16( XA[2], XA[3], 0xcc );
__m128i t3 = _mm_blend_epi16( XA[2], XA[3], 0x33 );
XA[0] = _mm_blend_epi16( t0, t2, 0xf0 );
XA[1] = _mm_blend_epi16( t1, t3, 0x3c );
XA[2] = _mm_blend_epi16( t0, t2, 0x0f );
XA[3] = _mm_blend_epi16( t1, t3, 0xc3 );
t0 = _mm_blend_epi16( XB[0], XB[1], 0xcc );
t1 = _mm_blend_epi16( XB[0], XB[1], 0x33 );
t2 = _mm_blend_epi16( XB[2], XB[3], 0xcc );
t3 = _mm_blend_epi16( XB[2], XB[3], 0x33 );
XB[0] = _mm_blend_epi16( t0, t2, 0xf0 );
XB[1] = _mm_blend_epi16( t1, t3, 0x3c );
XB[2] = _mm_blend_epi16( t0, t2, 0x0f );
XB[3] = _mm_blend_epi16( t1, t3, 0xc3 );
t0 = _mm_blend_epi16( XC[0], XC[1], 0xcc );
t1 = _mm_blend_epi16( XC[0], XC[1], 0x33 );
t2 = _mm_blend_epi16( XC[2], XC[3], 0xcc );
t3 = _mm_blend_epi16( XC[2], XC[3], 0x33 );
XC[0] = _mm_blend_epi16( t0, t2, 0xf0 );
XC[1] = _mm_blend_epi16( t1, t3, 0x3c );
XC[2] = _mm_blend_epi16( t0, t2, 0x0f );
XC[3] = _mm_blend_epi16( t1, t3, 0xc3 );
#else // SSE2
__m128i YA0, YA1, YA2, YA3, YB0, YB1, YB2, YB3, YC0, YC1, YC2, YC3;
YA0 = _mm_set_epi32( xa[15], xa[10], xa[ 5], xa[ 0] );
YB0 = _mm_set_epi32( xb[15], xb[10], xb[ 5], xb[ 0] );
YC0 = _mm_set_epi32( xc[15], xc[10], xc[ 5], xc[ 0] );
@@ -2829,7 +2619,7 @@ static void salsa_simd128_shuffle_3buf( uint32_t *xa, uint32_t *xb,
#endif
}
static void salsa_simd128_unshuffle_3buf( uint32_t* xa, uint32_t* xb,
static inline void salsa_simd128_unshuffle_3buf( uint32_t* xa, uint32_t* xb,
uint32_t* xc )
{
__m128i *XA = (__m128i*)xa;
@@ -2838,91 +2628,30 @@ static void salsa_simd128_unshuffle_3buf( uint32_t* xa, uint32_t* xb,
#if defined(__SSE4_1__)
__m128i YA0, YA1, YA2, YA3, YB0, YB1, YB2, YB3, YC0, YC1, YC2, YC3;
#if defined(__AVX2__)
YA0 = _mm_blend_epi32( XA[0], XA[1], 0x8 );
YB0 = _mm_blend_epi32( XB[0], XB[1], 0x8 );
YC0 = _mm_blend_epi32( XC[0], XC[1], 0x8 );
YA1 = _mm_blend_epi32( XA[0], XA[1], 0x1 );
YB1 = _mm_blend_epi32( XB[0], XB[1], 0x1 );
YC1 = _mm_blend_epi32( XC[0], XC[1], 0x1 );
YA2 = _mm_blend_epi32( XA[0], XA[1], 0x2 );
YB2 = _mm_blend_epi32( XB[0], XB[1], 0x2 );
YC2 = _mm_blend_epi32( XC[0], XC[1], 0x2 );
YA3 = _mm_blend_epi32( XA[0], XA[1], 0x4 );
YB3 = _mm_blend_epi32( XB[0], XB[1], 0x4 );
YC3 = _mm_blend_epi32( XC[0], XC[1], 0x4 );
YA0 = _mm_blend_epi32( YA0, XA[2], 0x4 );
YB0 = _mm_blend_epi32( YB0, XB[2], 0x4 );
YC0 = _mm_blend_epi32( YC0, XC[2], 0x4 );
YA1 = _mm_blend_epi32( YA1, XA[2], 0x8 );
YB1 = _mm_blend_epi32( YB1, XB[2], 0x8 );
YC1 = _mm_blend_epi32( YC1, XC[2], 0x8 );
YA2 = _mm_blend_epi32( YA2, XA[2], 0x1 );
YB2 = _mm_blend_epi32( YB2, XB[2], 0x1 );
YC2 = _mm_blend_epi32( YC2, XC[2], 0x1 );
YA3 = _mm_blend_epi32( YA3, XA[2], 0x2 );
YB3 = _mm_blend_epi32( YB3, XB[2], 0x2 );
YC3 = _mm_blend_epi32( YC3, XC[2], 0x2 );
XA[0] = _mm_blend_epi32( YA0, XA[3], 0x2 );
XB[0] = _mm_blend_epi32( YB0, XB[3], 0x2 );
XC[0] = _mm_blend_epi32( YC0, XC[3], 0x2 );
XA[1] = _mm_blend_epi32( YA1, XA[3], 0x4 );
XB[1] = _mm_blend_epi32( YB1, XB[3], 0x4 );
XC[1] = _mm_blend_epi32( YC1, XC[3], 0x4 );
XA[2] = _mm_blend_epi32( YA2, XA[3], 0x8 );
XB[2] = _mm_blend_epi32( YB2, XB[3], 0x8 );
XC[2] = _mm_blend_epi32( YC2, XC[3], 0x8 );
XA[3] = _mm_blend_epi32( YA3, XA[3], 0x1 );
XB[3] = _mm_blend_epi32( YB3, XB[3], 0x1 );
XC[3] = _mm_blend_epi32( YC3, XC[3], 0x1 );
#else // SSE4_1
YA0 = _mm_blend_epi16( XA[0], XA[1], 0xc0 );
YB0 = _mm_blend_epi16( XB[0], XB[1], 0xc0 );
YC0 = _mm_blend_epi16( XC[0], XC[1], 0xc0 );
YA1 = _mm_blend_epi16( XA[0], XA[1], 0x03 );
YB1 = _mm_blend_epi16( XB[0], XB[1], 0x03 );
YC1 = _mm_blend_epi16( XC[0], XC[1], 0x03 );
YA2 = _mm_blend_epi16( XA[0], XA[1], 0x0c );
YB2 = _mm_blend_epi16( XB[0], XB[1], 0x0c );
YC2 = _mm_blend_epi16( XC[0], XC[1], 0x0c );
YA3 = _mm_blend_epi16( XA[0], XA[1], 0x30 );
YB3 = _mm_blend_epi16( XB[0], XB[1], 0x30 );
YC3 = _mm_blend_epi16( XC[0], XC[1], 0x30 );
YA0 = _mm_blend_epi16( YA0, XA[2], 0x30 );
YB0 = _mm_blend_epi16( YB0, XB[2], 0x30 );
YC0 = _mm_blend_epi16( YC0, XC[2], 0x30 );
YA1 = _mm_blend_epi16( YA1, XA[2], 0xc0 );
YB1 = _mm_blend_epi16( YB1, XB[2], 0xc0 );
YC1 = _mm_blend_epi16( YC1, XC[2], 0xc0 );
YA2 = _mm_blend_epi16( YA2, XA[2], 0x03 );
YB2 = _mm_blend_epi16( YB2, XB[2], 0x03 );
YC2 = _mm_blend_epi16( YC2, XC[2], 0x03 );
YA3 = _mm_blend_epi16( YA3, XA[2], 0x0c );
YB3 = _mm_blend_epi16( YB3, XB[2], 0x0c );
YC3 = _mm_blend_epi16( YC3, XC[2], 0x0c );
XA[0] = _mm_blend_epi16( YA0, XA[3], 0x0c );
XB[0] = _mm_blend_epi16( YB0, XB[3], 0x0c );
XC[0] = _mm_blend_epi16( YC0, XC[3], 0x0c );
XA[1] = _mm_blend_epi16( YA1, XA[3], 0x30 );
XB[1] = _mm_blend_epi16( YB1, XB[3], 0x30 );
XC[1] = _mm_blend_epi16( YC1, XC[3], 0x30 );
XA[2] = _mm_blend_epi16( YA2, XA[3], 0xc0 );
XB[2] = _mm_blend_epi16( YB2, XB[3], 0xc0 );
XC[2] = _mm_blend_epi16( YC2, XC[3], 0xc0 );
XA[3] = _mm_blend_epi16( YA3, XA[3], 0x03 );
XB[3] = _mm_blend_epi16( YB3, XB[3], 0x03 );
XC[3] = _mm_blend_epi16( YC3, XC[3], 0x03 );
#endif // AVX2 else SSE4_1
__m128i t0 = _mm_blend_epi16( XA[0], XA[2], 0xf0 );
__m128i t1 = _mm_blend_epi16( XA[0], XA[2], 0x0f );
__m128i t2 = _mm_blend_epi16( XA[1], XA[3], 0x3c );
__m128i t3 = _mm_blend_epi16( XA[1], XA[3], 0xc3 );
XA[0] = _mm_blend_epi16( t0, t2, 0xcc );
XA[1] = _mm_blend_epi16( t0, t2, 0x33 );
XA[2] = _mm_blend_epi16( t1, t3, 0xcc );
XA[3] = _mm_blend_epi16( t1, t3, 0x33 );
t0 = _mm_blend_epi16( XB[0], XB[2], 0xf0 );
t1 = _mm_blend_epi16( XB[0], XB[2], 0x0f );
t2 = _mm_blend_epi16( XB[1], XB[3], 0x3c );
t3 = _mm_blend_epi16( XB[1], XB[3], 0xc3 );
XB[0] = _mm_blend_epi16( t0, t2, 0xcc );
XB[1] = _mm_blend_epi16( t0, t2, 0x33 );
XB[2] = _mm_blend_epi16( t1, t3, 0xcc );
XB[3] = _mm_blend_epi16( t1, t3, 0x33 );
t0 = _mm_blend_epi16( XC[0], XC[2], 0xf0 );
t1 = _mm_blend_epi16( XC[0], XC[2], 0x0f );
t2 = _mm_blend_epi16( XC[1], XC[3], 0x3c );
t3 = _mm_blend_epi16( XC[1], XC[3], 0xc3 );
XC[0] = _mm_blend_epi16( t0, t2, 0xcc );
XC[1] = _mm_blend_epi16( t0, t2, 0x33 );
XC[2] = _mm_blend_epi16( t1, t3, 0xcc );
XC[3] = _mm_blend_epi16( t1, t3, 0x33 );
#else // SSE2

View File

@@ -1,270 +0,0 @@
/* $Id: md_helper.c 216 2010-06-08 09:46:57Z tp $ */
/*
* This file contains some functions which implement the external data
* handling and padding for Merkle-Damgard hash functions which follow
* the conventions set out by MD4 (little-endian) or SHA-1 (big-endian).
*
* API: this file is meant to be included, not compiled as a stand-alone
* file. Some macros must be defined:
* RFUN name for the round function
* HASH "short name" for the hash function
* BE32 defined for big-endian, 32-bit based (e.g. SHA-1)
* LE32 defined for little-endian, 32-bit based (e.g. MD5)
* BE64 defined for big-endian, 64-bit based (e.g. SHA-512)
* LE64 defined for little-endian, 64-bit based (no example yet)
* PW01 if defined, append 0x01 instead of 0x80 (for Tiger)
* BLEN if defined, length of a message block (in bytes)
* PLW1 if defined, length is defined on one 64-bit word only (for Tiger)
* PLW4 if defined, length is defined on four 64-bit words (for WHIRLPOOL)
* SVAL if defined, reference to the context state information
*
* BLEN is used when a message block is not 16 (32-bit or 64-bit) words:
* this is used for instance for Tiger, which works on 64-bit words but
* uses 512-bit message blocks (eight 64-bit words). PLW1 and PLW4 are
* ignored if 32-bit words are used; if 64-bit words are used and PLW1 is
* set, then only one word (64 bits) will be used to encode the input
* message length (in bits), otherwise two words will be used (as in
* SHA-384 and SHA-512). If 64-bit words are used and PLW4 is defined (but
* not PLW1), four 64-bit words will be used to encode the message length
* (in bits). Note that regardless of those settings, only 64-bit message
* lengths are supported (in bits): messages longer than 2 Exabytes will be
* improperly hashed (this is unlikely to happen soon: 2 Exabytes is about
* 2 millions Terabytes, which is huge).
*
* If CLOSE_ONLY is defined, then this file defines only the sph_XXX_close()
* function. This is used for Tiger2, which is identical to Tiger except
* when it comes to the padding (Tiger2 uses the standard 0x80 byte instead
* of the 0x01 from original Tiger).
*
* The RFUN function is invoked with two arguments, the first pointing to
* aligned data (as a "const void *"), the second being state information
* from the context structure. By default, this state information is the
* "val" field from the context, and this field is assumed to be an array
* of words ("sph_u32" or "sph_u64", depending on BE32/LE32/BE64/LE64).
* from the context structure. The "val" field can have any type, except
* for the output encoding which assumes that it is an array of "sph_u32"
* values. By defining NO_OUTPUT, this last step is deactivated; the
* includer code is then responsible for writing out the hash result. When
* NO_OUTPUT is defined, the third parameter to the "close()" function is
* ignored.
*
* ==========================(LICENSE BEGIN)============================
*
* Copyright (c) 2007-2010 Projet RNRT SAPHIR
*
* Permission is hereby granted, free of charge, to any person obtaining
* a copy of this software and associated documentation files (the
* "Software"), to deal in the Software without restriction, including
* without limitation the rights to use, copy, modify, merge, publish,
* distribute, sublicense, and/or sell copies of the Software, and to
* permit persons to whom the Software is furnished to do so, subject to
* the following conditions:
*
* The above copyright notice and this permission notice shall be
* included in all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*
* ===========================(LICENSE END)=============================
*
* @author Thomas Pornin <thomas.pornin@cryptolog.com>
*/
#ifdef _MSC_VER
#pragma warning (disable: 4146)
#endif
#undef SPH_XCAT
#define SPH_XCAT(a, b) SPH_XCAT_(a, b)
#undef SPH_XCAT_
#define SPH_XCAT_(a, b) a ## b
#undef SPH_BLEN
#undef SPH_WLEN
#if defined BE64 || defined LE64
#define SPH_BLEN 128U
#define SPH_WLEN 8U
#else
#define SPH_BLEN 64U
#define SPH_WLEN 4U
#endif
#ifdef BLEN
#undef SPH_BLEN
#define SPH_BLEN BLEN
#endif
#undef SPH_MAXPAD
#if defined PLW1
#define SPH_MAXPAD (SPH_BLEN - SPH_WLEN)
#elif defined PLW4
#define SPH_MAXPAD (SPH_BLEN - (SPH_WLEN << 2))
#else
#define SPH_MAXPAD (SPH_BLEN - (SPH_WLEN << 1))
#endif
#undef SPH_VAL
#undef SPH_NO_OUTPUT
#ifdef SVAL
#define SPH_VAL SVAL
#define SPH_NO_OUTPUT 1
#else
#define SPH_VAL sc->val
#endif
#ifndef CLOSE_ONLY
#ifdef SPH_UPTR
static void
SPH_XCAT(HASH, _short)( void *cc, const void *data, size_t len )
#else
void
HASH ( void *cc, const void *data, size_t len )
#endif
{
SPH_XCAT( HASH, _context ) *sc;
__m256i *vdata = (__m256i*)data;
size_t ptr;
sc = cc;
ptr = (unsigned)sc->count & (SPH_BLEN - 1U);
while ( len > 0 )
{
size_t clen;
clen = SPH_BLEN - ptr;
if ( clen > len )
clen = len;
memcpy_256( sc->buf + (ptr>>3), vdata, clen>>3 );
vdata = vdata + (clen>>3);
ptr += clen;
len -= clen;
if ( ptr == SPH_BLEN )
{
RFUN( sc->buf, SPH_VAL );
ptr = 0;
}
sc->count += clen;
}
}
#ifdef SPH_UPTR
void
HASH (void *cc, const void *data, size_t len)
{
SPH_XCAT(HASH, _context) *sc;
__m256i *vdata = (__m256i*)data;
unsigned ptr;
if ( len < (2 * SPH_BLEN) )
{
SPH_XCAT(HASH, _short)(cc, data, len);
return;
}
sc = cc;
ptr = (unsigned)sc->count & (SPH_BLEN - 1U);
if ( ptr > 0 )
{
unsigned t;
t = SPH_BLEN - ptr;
SPH_XCAT( HASH, _short )( cc, data, t );
vdata = vdata + (t>>3);
len -= t;
}
SPH_XCAT( HASH, _short )( cc, data, len );
}
#endif
#endif
/*
* Perform padding and produce result. The context is NOT reinitialized
* by this function.
*/
static void
SPH_XCAT( HASH, _addbits_and_close )(void *cc, unsigned ub, unsigned n,
void *dst, unsigned rnum )
{
SPH_XCAT(HASH, _context) *sc;
unsigned ptr, u;
sc = cc;
ptr = (unsigned)sc->count & (SPH_BLEN - 1U);
#ifdef PW01
sc->buf[ptr>>3] = m256_const1_64( 0x100 >> 8 );
#else
sc->buf[ptr>>3] = m256_const1_64( 0x80 );
#endif
ptr += 8;
if ( ptr > SPH_MAXPAD )
{
memset_zero_256( sc->buf + (ptr>>3), (SPH_BLEN - ptr) >> 3 );
RFUN( sc->buf, SPH_VAL );
memset_zero_256( sc->buf, SPH_MAXPAD >> 3 );
}
else
{
memset_zero_256( sc->buf + (ptr>>3), (SPH_MAXPAD - ptr) >> 3 );
}
#if defined BE64
#if defined PLW1
sc->buf[ SPH_MAXPAD>>3 ] =
mm256_bswap_64( _mm256_set1_epi64x( sc->count << 3 ) );
#elif defined PLW4
memset_zero_256( sc->buf + (SPH_MAXPAD>>3), ( 2 * SPH_WLEN ) >> 3 );
sc->buf[ (SPH_MAXPAD + 2 * SPH_WLEN ) >> 3 ] =
mm256_bswap_64( _mm256_set1_epi64x( sc->count >> 61 ) );
sc->buf[ (SPH_MAXPAD + 3 * SPH_WLEN ) >> 3 ] =
mm256_bswap_64( _mm256_set1_epi64x( sc->count << 3 ) );
#else
sc->buf[ ( SPH_MAXPAD + 2 * SPH_WLEN ) >> 3 ] =
mm256_bswap_64( _mm256_set1_epi64x( sc->count >> 61 ) );
sc->buf[ ( SPH_MAXPAD + 3 * SPH_WLEN ) >> 3 ] =
mm256_bswap_64( _mm256_set1_epi64x( sc->count << 3 ) );
#endif // PLW
#else // LE64
#if defined PLW1
sc->buf[ SPH_MAXPAD >> 3 ] = _mm256_set1_epi64x( sc->count << 3 );
#elif defined PLW4
sc->buf[ SPH_MAXPAD >> 3 ] = _mm256_set1_epi64x( sc->count << 3 );
sc->buf[ ( SPH_MAXPAD + SPH_WLEN ) >> 3 ] =
_mm256_set1_epi64x( c->count >> 61 );
memset_zero_256( sc->buf + ( ( SPH_MAXPAD + 2 * SPH_WLEN ) >> 3 ),
2 * SPH_WLEN );
#else
sc->buf[ SPH_MAXPAD >> 3 ] = _mm256_set1_epi64x( sc->count << 3 );
sc->buf[ ( SPH_MAXPAD + SPH_WLEN ) >> 3 ] =
_mm256_set1_epi64x( sc->count >> 61 );
#endif // PLW
#endif // LE64
RFUN( sc->buf, SPH_VAL );
#ifdef SPH_NO_OUTPUT
(void)dst;
(void)rnum;
(void)u;
#else
for ( u = 0; u < rnum; u ++ )
{
#if defined BE64
((__m256i*)dst)[u] = mm256_bswap_64( sc->val[u] );
#else // LE64
((__m256i*)dst)[u] = sc->val[u];
#endif
}
#endif
}
static void
SPH_XCAT( HASH, _mdclose )( void *cc, void *dst, unsigned rnum )
{
SPH_XCAT( HASH, _addbits_and_close )( cc, 0, 0, dst, rnum );
}

View File

@@ -311,7 +311,7 @@ int sha256_4way_transform_le_short( __m128i *state_out, const __m128i *data,
__m128i A, B, C, D, E, F, G, H;
__m128i W[16]; memcpy_128( W, data, 16 );
// Value required by H after round 60 to produce valid final hash
const __m128i H_ = m128_const1_32( 0x136032ED );
const __m128i H_ = _mm_set1_epi32( 0x136032ED );
A = _mm_load_si128( state_in );
B = _mm_load_si128( state_in+1 );
@@ -408,14 +408,14 @@ int sha256_4way_transform_le_short( __m128i *state_out, const __m128i *data,
void sha256_4way_init( sha256_4way_context *sc )
{
sc->count_high = sc->count_low = 0;
sc->val[0] = m128_const1_64( 0x6A09E6676A09E667 );
sc->val[1] = m128_const1_64( 0xBB67AE85BB67AE85 );
sc->val[2] = m128_const1_64( 0x3C6EF3723C6EF372 );
sc->val[3] = m128_const1_64( 0xA54FF53AA54FF53A );
sc->val[4] = m128_const1_64( 0x510E527F510E527F );
sc->val[5] = m128_const1_64( 0x9B05688C9B05688C );
sc->val[6] = m128_const1_64( 0x1F83D9AB1F83D9AB );
sc->val[7] = m128_const1_64( 0x5BE0CD195BE0CD19 );
sc->val[0] = _mm_set1_epi64x( 0x6A09E6676A09E667 );
sc->val[1] = _mm_set1_epi64x( 0xBB67AE85BB67AE85 );
sc->val[2] = _mm_set1_epi64x( 0x3C6EF3723C6EF372 );
sc->val[3] = _mm_set1_epi64x( 0xA54FF53AA54FF53A );
sc->val[4] = _mm_set1_epi64x( 0x510E527F510E527F );
sc->val[5] = _mm_set1_epi64x( 0x9B05688C9B05688C );
sc->val[6] = _mm_set1_epi64x( 0x1F83D9AB1F83D9AB );
sc->val[7] = _mm_set1_epi64x( 0x5BE0CD195BE0CD19 );
}
void sha256_4way_update( sha256_4way_context *sc, const void *data, size_t len )
@@ -458,7 +458,7 @@ void sha256_4way_close( sha256_4way_context *sc, void *dst )
const int pad = buf_size - 8;
ptr = (unsigned)sc->count_low & (buf_size - 1U);
sc->buf[ ptr>>2 ] = m128_const1_64( 0x0000008000000080 );
sc->buf[ ptr>>2 ] = _mm_set1_epi64x( 0x0000008000000080 );
ptr += 4;
if ( ptr > pad )
@@ -474,8 +474,8 @@ void sha256_4way_close( sha256_4way_context *sc, void *dst )
high = (sc->count_high << 3) | (low >> 29);
low = low << 3;
sc->buf[ pad >> 2 ] = m128_const1_32( bswap_32( high ) );
sc->buf[( pad+4 ) >> 2 ] = m128_const1_32( bswap_32( low ) );
sc->buf[ pad >> 2 ] = _mm_set1_epi32( bswap_32( high ) );
sc->buf[( pad+4 ) >> 2 ] = _mm_set1_epi32( bswap_32( low ) );
sha256_4way_transform_be( sc->val, sc->buf, sc->val );
mm128_block_bswap_32( dst, sc->val );
@@ -589,7 +589,6 @@ do { \
_mm256_xor_si256( Y, _mm256_and_si256( X_xor_Y = _mm256_xor_si256( X, Y ), \
Y_xor_Z ) )
#define SHA2s_8WAY_STEP( A, B, C, D, E, F, G, H, i, j ) \
do { \
__m256i T0 = _mm256_add_epi32( _mm256_set1_epi32( K256[(j)+(i)] ), W[i] ); \
@@ -711,8 +710,11 @@ void sha256_8way_prehash_3rounds( __m256i *state_mid, __m256i *X,
{
__m256i A, B, C, D, E, F, G, H;
X[ 0] = SHA2x_MEXP( W[14], W[ 9], W[ 1], W[ 0] );
X[ 1] = SHA2x_MEXP( W[15], W[10], W[ 2], W[ 1] );
// W[9:14] are zero, therefore X[9:13] are also zero and not needed.
// Except X[ 9] which is part of W[ 0] from the third group.
X[ 0] = _mm256_add_epi32( SSG2_0x( W[ 1] ), W[ 0] );
X[ 1] = _mm256_add_epi32( _mm256_add_epi32( SSG2_1x( W[15] ),
SSG2_0x( W[ 2] ) ), W[ 1] );
X[ 2] = _mm256_add_epi32( _mm256_add_epi32( SSG2_1x( X[ 0] ), W[11] ),
W[ 2] );
X[ 3] = _mm256_add_epi32( _mm256_add_epi32( SSG2_1x( X[ 1] ), W[12] ),
@@ -725,16 +727,12 @@ void sha256_8way_prehash_3rounds( __m256i *state_mid, __m256i *X,
W[ 6] );
X[ 7] = _mm256_add_epi32( _mm256_add_epi32( X[ 0], SSG2_0x( W[ 8] ) ),
W[ 7] );
X[ 8] = _mm256_add_epi32( _mm256_add_epi32( X[ 1], SSG2_0x( W[ 9] ) ),
W[ 8] );
X[ 9] = _mm256_add_epi32( SSG2_0x( W[10] ), W[ 9] );
X[10] = _mm256_add_epi32( SSG2_0x( W[11] ), W[10] );
X[11] = _mm256_add_epi32( SSG2_0x( W[12] ), W[11] );
X[12] = _mm256_add_epi32( SSG2_0x( W[13] ), W[12] );
X[13] = _mm256_add_epi32( SSG2_0x( W[14] ), W[13] );
X[14] = _mm256_add_epi32( SSG2_0x( W[15] ), W[14] );
X[ 8] = _mm256_add_epi32( X[ 1], W[ 8] );
X[14] = SSG2_0x( W[15] );
X[15] = _mm256_add_epi32( SSG2_0x( X[ 0] ), W[15] );
X[ 9] = _mm256_add_epi32( SSG2_0x( X[ 1] ), X[ 0] );
A = _mm256_load_si256( state_in );
B = _mm256_load_si256( state_in + 1 );
C = _mm256_load_si256( state_in + 2 );
@@ -779,10 +777,6 @@ void sha256_8way_final_rounds( __m256i *state_out, const __m256i *data,
G = _mm256_load_si256( state_mid + 6 );
H = _mm256_load_si256( state_mid + 7 );
// SHA2s_8WAY_STEP( A, B, C, D, E, F, G, H, 0, 0 );
// SHA2s_8WAY_STEP( H, A, B, C, D, E, F, G, 1, 0 );
// SHA2s_8WAY_STEP( G, H, A, B, C, D, E, F, 2, 0 );
#if !defined(__AVX512VL__)
__m256i X_xor_Y, Y_xor_Z = _mm256_xor_si256( G, H );
#endif
@@ -810,23 +804,36 @@ void sha256_8way_final_rounds( __m256i *state_out, const __m256i *data,
W[ 6] = _mm256_add_epi32( X[ 6], SSG2_1x( W[ 4] ) );
W[ 7] = _mm256_add_epi32( X[ 7], SSG2_1x( W[ 5] ) );
W[ 8] = _mm256_add_epi32( X[ 8], SSG2_1x( W[ 6] ) );
W[ 9] = _mm256_add_epi32( X[ 9], _mm256_add_epi32( SSG2_1x( W[ 7] ),
W[ 2] ) );
W[10] = _mm256_add_epi32( X[10], _mm256_add_epi32( SSG2_1x( W[ 8] ),
W[ 3] ) );
W[11] = _mm256_add_epi32( X[11], _mm256_add_epi32( SSG2_1x( W[ 9] ),
W[ 4] ) );
W[12] = _mm256_add_epi32( X[12], _mm256_add_epi32( SSG2_1x( W[10] ),
W[ 5] ) );
W[13] = _mm256_add_epi32( X[13], _mm256_add_epi32( SSG2_1x( W[11] ),
W[ 6] ) );
W[ 9] = _mm256_add_epi32( SSG2_1x( W[ 7] ), W[ 2] );
W[10] = _mm256_add_epi32( SSG2_1x( W[ 8] ), W[ 3] );
W[11] = _mm256_add_epi32( SSG2_1x( W[ 9] ), W[ 4] );
W[12] = _mm256_add_epi32( SSG2_1x( W[10] ), W[ 5] );
W[13] = _mm256_add_epi32( SSG2_1x( W[11] ), W[ 6] );
W[14] = _mm256_add_epi32( X[14], _mm256_add_epi32( SSG2_1x( W[12] ),
W[ 7] ) );
W[15] = _mm256_add_epi32( X[15], _mm256_add_epi32( SSG2_1x( W[13] ),
W[ 8] ) );
SHA256x8_16ROUNDS( A, B, C, D, E, F, G, H, 16 );
SHA256x8_MSG_EXPANSION( W );
W[ 0] = _mm256_add_epi32( X[ 9], _mm256_add_epi32( SSG2_1x( W[14] ),
W[ 9] ) );
W[ 1] = SHA2x_MEXP( W[15], W[10], W[ 2], W[ 1] );
W[ 2] = SHA2x_MEXP( W[ 0], W[11], W[ 3], W[ 2] );
W[ 3] = SHA2x_MEXP( W[ 1], W[12], W[ 4], W[ 3] );
W[ 4] = SHA2x_MEXP( W[ 2], W[13], W[ 5], W[ 4] );
W[ 5] = SHA2x_MEXP( W[ 3], W[14], W[ 6], W[ 5] );
W[ 6] = SHA2x_MEXP( W[ 4], W[15], W[ 7], W[ 6] );
W[ 7] = SHA2x_MEXP( W[ 5], W[ 0], W[ 8], W[ 7] );
W[ 8] = SHA2x_MEXP( W[ 6], W[ 1], W[ 9], W[ 8] );
W[ 9] = SHA2x_MEXP( W[ 7], W[ 2], W[10], W[ 9] );
W[10] = SHA2x_MEXP( W[ 8], W[ 3], W[11], W[10] );
W[11] = SHA2x_MEXP( W[ 9], W[ 4], W[12], W[11] );
W[12] = SHA2x_MEXP( W[10], W[ 5], W[13], W[12] );
W[13] = SHA2x_MEXP( W[11], W[ 6], W[14], W[13] );
W[14] = SHA2x_MEXP( W[12], W[ 7], W[15], W[14] );
W[15] = SHA2x_MEXP( W[13], W[ 8], W[ 0], W[15] );
SHA256x8_16ROUNDS( A, B, C, D, E, F, G, H, 32 );
SHA256x8_MSG_EXPANSION( W );
SHA256x8_16ROUNDS( A, B, C, D, E, F, G, H, 48 );
@@ -855,7 +862,7 @@ int sha256_8way_transform_le_short( __m256i *state_out, const __m256i *data,
{
__m256i A, B, C, D, E, F, G, H;
__m256i W[16]; memcpy_256( W, data, 16 );
const __m256i H_ = m256_const1_32( 0x136032ED );
const __m256i H_ = _mm256_set1_epi32( 0x136032ED );
A = _mm256_load_si256( state_in );
B = _mm256_load_si256( state_in+1 );
@@ -971,14 +978,14 @@ int sha256_8way_transform_le_short( __m256i *state_out, const __m256i *data,
void sha256_8way_init( sha256_8way_context *sc )
{
sc->count_high = sc->count_low = 0;
sc->val[0] = m256_const1_64( 0x6A09E6676A09E667 );
sc->val[1] = m256_const1_64( 0xBB67AE85BB67AE85 );
sc->val[2] = m256_const1_64( 0x3C6EF3723C6EF372 );
sc->val[3] = m256_const1_64( 0xA54FF53AA54FF53A );
sc->val[4] = m256_const1_64( 0x510E527F510E527F );
sc->val[5] = m256_const1_64( 0x9B05688C9B05688C );
sc->val[6] = m256_const1_64( 0x1F83D9AB1F83D9AB );
sc->val[7] = m256_const1_64( 0x5BE0CD195BE0CD19 );
sc->val[0] = _mm256_set1_epi64x( 0x6A09E6676A09E667 );
sc->val[1] = _mm256_set1_epi64x( 0xBB67AE85BB67AE85 );
sc->val[2] = _mm256_set1_epi64x( 0x3C6EF3723C6EF372 );
sc->val[3] = _mm256_set1_epi64x( 0xA54FF53AA54FF53A );
sc->val[4] = _mm256_set1_epi64x( 0x510E527F510E527F );
sc->val[5] = _mm256_set1_epi64x( 0x9B05688C9B05688C );
sc->val[6] = _mm256_set1_epi64x( 0x1F83D9AB1F83D9AB );
sc->val[7] = _mm256_set1_epi64x( 0x5BE0CD195BE0CD19 );
}
// need to handle odd byte length for yespower.
@@ -1024,7 +1031,7 @@ void sha256_8way_close( sha256_8way_context *sc, void *dst )
const int pad = buf_size - 8;
ptr = (unsigned)sc->count_low & (buf_size - 1U);
sc->buf[ ptr>>2 ] = m256_const1_64( 0x0000008000000080 );
sc->buf[ ptr>>2 ] = _mm256_set1_epi64x( 0x0000008000000080 );
ptr += 4;
if ( ptr > pad )
@@ -1040,8 +1047,8 @@ void sha256_8way_close( sha256_8way_context *sc, void *dst )
high = (sc->count_high << 3) | (low >> 29);
low = low << 3;
sc->buf[ pad >> 2 ] = m256_const1_32( bswap_32( high ) );
sc->buf[ ( pad+4 ) >> 2 ] = m256_const1_32( bswap_32( low ) );
sc->buf[ pad >> 2 ] = _mm256_set1_epi32( bswap_32( high ) );
sc->buf[ ( pad+4 ) >> 2 ] = _mm256_set1_epi32( bswap_32( low ) );
sha256_8way_transform_be( sc->val, sc->buf, sc->val );
@@ -1201,9 +1208,13 @@ void sha256_16way_prehash_3rounds( __m512i *state_mid, __m512i *X,
{
__m512i A, B, C, D, E, F, G, H;
// precalculate constant part msg expansion for second iteration.
X[ 0] = SHA2x16_MEXP( W[14], W[ 9], W[ 1], W[ 0] );
X[ 1] = SHA2x16_MEXP( W[15], W[10], W[ 2], W[ 1] );
// X is pre-expanded constant part of msg for second group, rounds 16 to 31.
// W[9:14] are zero, therefore X[9:13] are also zero and not needed.
// Except X[ 9] which is used to pre-expand part of W[ 0] from the third
// group, rounds 32 to 48.
X[ 0] = _mm512_add_epi32( SSG2_0x16( W[ 1] ), W[ 0] );
X[ 1] = _mm512_add_epi32( _mm512_add_epi32( SSG2_1x16( W[15] ),
SSG2_0x16( W[ 2] ) ), W[ 1] );
X[ 2] = _mm512_add_epi32( _mm512_add_epi32( SSG2_1x16( X[ 0] ), W[11] ),
W[ 2] );
X[ 3] = _mm512_add_epi32( _mm512_add_epi32( SSG2_1x16( X[ 1] ), W[12] ),
@@ -1216,16 +1227,12 @@ void sha256_16way_prehash_3rounds( __m512i *state_mid, __m512i *X,
W[ 6] );
X[ 7] = _mm512_add_epi32( _mm512_add_epi32( X[ 0], SSG2_0x16( W[ 8] ) ),
W[ 7] );
X[ 8] = _mm512_add_epi32( _mm512_add_epi32( X[ 1], SSG2_0x16( W[ 9] ) ),
W[ 8] );
X[ 9] = _mm512_add_epi32( SSG2_0x16( W[10] ), W[ 9] );
X[10] = _mm512_add_epi32( SSG2_0x16( W[11] ), W[10] );
X[11] = _mm512_add_epi32( SSG2_0x16( W[12] ), W[11] );
X[12] = _mm512_add_epi32( SSG2_0x16( W[13] ), W[12] );
X[13] = _mm512_add_epi32( SSG2_0x16( W[14] ), W[13] );
X[14] = _mm512_add_epi32( SSG2_0x16( W[15] ), W[14] );
X[ 8] = _mm512_add_epi32( X[ 1], W[ 8] );
X[14] = SSG2_0x16( W[15] );
X[15] = _mm512_add_epi32( SSG2_0x16( X[ 0] ), W[15] );
X[ 9] = _mm512_add_epi32( SSG2_0x16( X[ 1] ), X[ 0] );
A = _mm512_load_si512( state_in );
B = _mm512_load_si512( state_in + 1 );
C = _mm512_load_si512( state_in + 2 );
@@ -1280,7 +1287,7 @@ void sha256_16way_final_rounds( __m512i *state_out, const __m512i *data,
SHA2s_16WAY_STEP( C, D, E, F, G, H, A, B, 14, 0 );
SHA2s_16WAY_STEP( B, C, D, E, F, G, H, A, 15, 0 );
// update precalculated msg expansion with new nonce: W[3].
// inject nonce, W[3], to complete msg expansion.
W[ 0] = X[ 0];
W[ 1] = X[ 1];
W[ 2] = _mm512_add_epi32( X[ 2], SSG2_0x16( W[ 3] ) );
@@ -1290,23 +1297,36 @@ void sha256_16way_final_rounds( __m512i *state_out, const __m512i *data,
W[ 6] = _mm512_add_epi32( X[ 6], SSG2_1x16( W[ 4] ) );
W[ 7] = _mm512_add_epi32( X[ 7], SSG2_1x16( W[ 5] ) );
W[ 8] = _mm512_add_epi32( X[ 8], SSG2_1x16( W[ 6] ) );
W[ 9] = _mm512_add_epi32( X[ 9], _mm512_add_epi32( SSG2_1x16( W[ 7] ),
W[ 2] ) );
W[10] = _mm512_add_epi32( X[10], _mm512_add_epi32( SSG2_1x16( W[ 8] ),
W[ 3] ) );
W[11] = _mm512_add_epi32( X[11], _mm512_add_epi32( SSG2_1x16( W[ 9] ),
W[ 4] ) );
W[12] = _mm512_add_epi32( X[12], _mm512_add_epi32( SSG2_1x16( W[10] ),
W[ 5] ) );
W[13] = _mm512_add_epi32( X[13], _mm512_add_epi32( SSG2_1x16( W[11] ),
W[ 6] ) );
W[ 9] = _mm512_add_epi32( SSG2_1x16( W[ 7] ), W[ 2] );
W[10] = _mm512_add_epi32( SSG2_1x16( W[ 8] ), W[ 3] );
W[11] = _mm512_add_epi32( SSG2_1x16( W[ 9] ), W[ 4] );
W[12] = _mm512_add_epi32( SSG2_1x16( W[10] ), W[ 5] );
W[13] = _mm512_add_epi32( SSG2_1x16( W[11] ), W[ 6] );
W[14] = _mm512_add_epi32( X[14], _mm512_add_epi32( SSG2_1x16( W[12] ),
W[ 7] ) );
W[15] = _mm512_add_epi32( X[15], _mm512_add_epi32( SSG2_1x16( W[13] ),
W[ 8] ) );
SHA256x16_16ROUNDS( A, B, C, D, E, F, G, H, 16 );
SHA256x16_MSG_EXPANSION( W );
W[ 0] = _mm512_add_epi32( X[ 9], _mm512_add_epi32( SSG2_1x16( W[14] ),
W[ 9] ) );
W[ 1] = SHA2x16_MEXP( W[15], W[10], W[ 2], W[ 1] );
W[ 2] = SHA2x16_MEXP( W[ 0], W[11], W[ 3], W[ 2] );
W[ 3] = SHA2x16_MEXP( W[ 1], W[12], W[ 4], W[ 3] );
W[ 4] = SHA2x16_MEXP( W[ 2], W[13], W[ 5], W[ 4] );
W[ 5] = SHA2x16_MEXP( W[ 3], W[14], W[ 6], W[ 5] );
W[ 6] = SHA2x16_MEXP( W[ 4], W[15], W[ 7], W[ 6] );
W[ 7] = SHA2x16_MEXP( W[ 5], W[ 0], W[ 8], W[ 7] );
W[ 8] = SHA2x16_MEXP( W[ 6], W[ 1], W[ 9], W[ 8] );
W[ 9] = SHA2x16_MEXP( W[ 7], W[ 2], W[10], W[ 9] );
W[10] = SHA2x16_MEXP( W[ 8], W[ 3], W[11], W[10] );
W[11] = SHA2x16_MEXP( W[ 9], W[ 4], W[12], W[11] );
W[12] = SHA2x16_MEXP( W[10], W[ 5], W[13], W[12] );
W[13] = SHA2x16_MEXP( W[11], W[ 6], W[14], W[13] );
W[14] = SHA2x16_MEXP( W[12], W[ 7], W[15], W[14] );
W[15] = SHA2x16_MEXP( W[13], W[ 8], W[ 0], W[15] );
SHA256x16_16ROUNDS( A, B, C, D, E, F, G, H, 32 );
SHA256x16_MSG_EXPANSION( W );
SHA256x16_16ROUNDS( A, B, C, D, E, F, G, H, 48 );
@@ -1336,10 +1356,10 @@ int sha256_16way_transform_le_short( __m512i *state_out, const __m512i *data,
{
__m512i A, B, C, D, E, F, G, H;
__m512i W[16]; memcpy_512( W, data, 16 );
// Value for H at round 60, before adding K, to produce valid final hash
//where H == 0.
// Value for H at round 60, before adding K, needed to produce valid final
// hash where H == 0.
// H_ = -( H256[7] + K256[60] );
const __m512i H_ = m512_const1_32( 0x136032ED );
const __m512i H_ = _mm512_set1_epi32( 0x136032ED );
A = _mm512_load_si512( state_in );
B = _mm512_load_si512( state_in+1 );
@@ -1432,14 +1452,14 @@ int sha256_16way_transform_le_short( __m512i *state_out, const __m512i *data,
void sha256_16way_init( sha256_16way_context *sc )
{
sc->count_high = sc->count_low = 0;
sc->val[0] = m512_const1_64( 0x6A09E6676A09E667 );
sc->val[1] = m512_const1_64( 0xBB67AE85BB67AE85 );
sc->val[2] = m512_const1_64( 0x3C6EF3723C6EF372 );
sc->val[3] = m512_const1_64( 0xA54FF53AA54FF53A );
sc->val[4] = m512_const1_64( 0x510E527F510E527F );
sc->val[5] = m512_const1_64( 0x9B05688C9B05688C );
sc->val[6] = m512_const1_64( 0x1F83D9AB1F83D9AB );
sc->val[7] = m512_const1_64( 0x5BE0CD195BE0CD19 );
sc->val[0] = _mm512_set1_epi64( 0x6A09E6676A09E667 );
sc->val[1] = _mm512_set1_epi64( 0xBB67AE85BB67AE85 );
sc->val[2] = _mm512_set1_epi64( 0x3C6EF3723C6EF372 );
sc->val[3] = _mm512_set1_epi64( 0xA54FF53AA54FF53A );
sc->val[4] = _mm512_set1_epi64( 0x510E527F510E527F );
sc->val[5] = _mm512_set1_epi64( 0x9B05688C9B05688C );
sc->val[6] = _mm512_set1_epi64( 0x1F83D9AB1F83D9AB );
sc->val[7] = _mm512_set1_epi64( 0x5BE0CD195BE0CD19 );
}
void sha256_16way_update( sha256_16way_context *sc, const void *data,
@@ -1483,7 +1503,7 @@ void sha256_16way_close( sha256_16way_context *sc, void *dst )
const int pad = buf_size - 8;
ptr = (unsigned)sc->count_low & (buf_size - 1U);
sc->buf[ ptr>>2 ] = m512_const1_64( 0x0000008000000080 );
sc->buf[ ptr>>2 ] = _mm512_set1_epi64( 0x0000008000000080 );
ptr += 4;
if ( ptr > pad )
@@ -1499,8 +1519,8 @@ void sha256_16way_close( sha256_16way_context *sc, void *dst )
high = (sc->count_high << 3) | (low >> 29);
low = low << 3;
sc->buf[ pad >> 2 ] = m512_const1_32( bswap_32( high ) );
sc->buf[ ( pad+4 ) >> 2 ] = m512_const1_32( bswap_32( low ) );
sc->buf[ pad >> 2 ] = _mm512_set1_epi32( bswap_32( high ) );
sc->buf[ ( pad+4 ) >> 2 ] = _mm512_set1_epi32( bswap_32( low ) );
sha256_16way_transform_be( sc->val, sc->buf, sc->val );

View File

@@ -28,32 +28,32 @@ int scanhash_sha256d_16way( struct work *work, const uint32_t max_nonce,
__m512i *noncev = vdata + 19;
const int thr_id = mythr->id;
const bool bench = opt_benchmark;
const __m512i last_byte = m512_const1_32( 0x80000000 );
const __m512i sixteen = m512_const1_32( 16 );
const __m512i last_byte = _mm512_set1_epi32( 0x80000000 );
const __m512i sixteen = _mm512_set1_epi32( 16 );
for ( int i = 0; i < 19; i++ )
vdata[i] = m512_const1_32( pdata[i] );
vdata[i] = _mm512_set1_epi32( pdata[i] );
*noncev = _mm512_set_epi32( n+15, n+14, n+13, n+12, n+11, n+10, n+9, n+8,
n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n+1, n );
vdata[16+4] = last_byte;
memset_zero_512( vdata+16 + 5, 10 );
vdata[16+15] = m512_const1_32( 80*8 ); // bit count
vdata[16+15] = _mm512_set1_epi32( 80*8 ); // bit count
block[ 8] = last_byte;
memset_zero_512( block + 9, 6 );
block[15] = m512_const1_32( 32*8 ); // bit count
block[15] = _mm512_set1_epi32( 32*8 ); // bit count
// initialize state
initstate[0] = m512_const1_64( 0x6A09E6676A09E667 );
initstate[1] = m512_const1_64( 0xBB67AE85BB67AE85 );
initstate[2] = m512_const1_64( 0x3C6EF3723C6EF372 );
initstate[3] = m512_const1_64( 0xA54FF53AA54FF53A );
initstate[4] = m512_const1_64( 0x510E527F510E527F );
initstate[5] = m512_const1_64( 0x9B05688C9B05688C );
initstate[6] = m512_const1_64( 0x1F83D9AB1F83D9AB );
initstate[7] = m512_const1_64( 0x5BE0CD195BE0CD19 );
initstate[0] = _mm512_set1_epi64( 0x6A09E6676A09E667 );
initstate[1] = _mm512_set1_epi64( 0xBB67AE85BB67AE85 );
initstate[2] = _mm512_set1_epi64( 0x3C6EF3723C6EF372 );
initstate[3] = _mm512_set1_epi64( 0xA54FF53AA54FF53A );
initstate[4] = _mm512_set1_epi64( 0x510E527F510E527F );
initstate[5] = _mm512_set1_epi64( 0x9B05688C9B05688C );
initstate[6] = _mm512_set1_epi64( 0x1F83D9AB1F83D9AB );
initstate[7] = _mm512_set1_epi64( 0x5BE0CD195BE0CD19 );
sha256_16way_transform_le( midstate1, vdata, initstate );
@@ -116,31 +116,31 @@ int scanhash_sha256d_8way( struct work *work, const uint32_t max_nonce,
__m256i *noncev = vdata + 19;
const int thr_id = mythr->id;
const bool bench = opt_benchmark;
const __m256i last_byte = m256_const1_32( 0x80000000 );
const __m256i eight = m256_const1_32( 8 );
const __m256i last_byte = _mm256_set1_epi32( 0x80000000 );
const __m256i eight = _mm256_set1_epi32( 8 );
for ( int i = 0; i < 19; i++ )
vdata[i] = m256_const1_32( pdata[i] );
vdata[i] = _mm256_set1_epi32( pdata[i] );
*noncev = _mm256_set_epi32( n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n+1, n );
vdata[16+4] = last_byte;
memset_zero_256( vdata+16 + 5, 10 );
vdata[16+15] = m256_const1_32( 80*8 ); // bit count
vdata[16+15] = _mm256_set1_epi32( 80*8 ); // bit count
block[ 8] = last_byte;
memset_zero_256( block + 9, 6 );
block[15] = m256_const1_32( 32*8 ); // bit count
block[15] = _mm256_set1_epi32( 32*8 ); // bit count
// initialize state
initstate[0] = m256_const1_64( 0x6A09E6676A09E667 );
initstate[1] = m256_const1_64( 0xBB67AE85BB67AE85 );
initstate[2] = m256_const1_64( 0x3C6EF3723C6EF372 );
initstate[3] = m256_const1_64( 0xA54FF53AA54FF53A );
initstate[4] = m256_const1_64( 0x510E527F510E527F );
initstate[5] = m256_const1_64( 0x9B05688C9B05688C );
initstate[6] = m256_const1_64( 0x1F83D9AB1F83D9AB );
initstate[7] = m256_const1_64( 0x5BE0CD195BE0CD19 );
initstate[0] = _mm256_set1_epi64x( 0x6A09E6676A09E667 );
initstate[1] = _mm256_set1_epi64x( 0xBB67AE85BB67AE85 );
initstate[2] = _mm256_set1_epi64x( 0x3C6EF3723C6EF372 );
initstate[3] = _mm256_set1_epi64x( 0xA54FF53AA54FF53A );
initstate[4] = _mm256_set1_epi64x( 0x510E527F510E527F );
initstate[5] = _mm256_set1_epi64x( 0x9B05688C9B05688C );
initstate[6] = _mm256_set1_epi64x( 0x1F83D9AB1F83D9AB );
initstate[7] = _mm256_set1_epi64x( 0x5BE0CD195BE0CD19 );
sha256_8way_transform_le( midstate1, vdata, initstate );
@@ -204,31 +204,31 @@ int scanhash_sha256d_4way( struct work *work, const uint32_t max_nonce,
__m128i *noncev = vdata + 19;
const int thr_id = mythr->id;
const bool bench = opt_benchmark;
const __m128i last_byte = m128_const1_32( 0x80000000 );
const __m128i four = m128_const1_32( 4 );
const __m128i last_byte = _mm_set1_epi32( 0x80000000 );
const __m128i four = _mm_set1_epi32( 4 );
for ( int i = 0; i < 19; i++ )
vdata[i] = m128_const1_32( pdata[i] );
vdata[i] = _mm_set1_epi32( pdata[i] );
*noncev = _mm_set_epi32( n+ 3, n+ 2, n+1, n );
vdata[16+4] = last_byte;
memset_zero_128( vdata+16 + 5, 10 );
vdata[16+15] = m128_const1_32( 80*8 ); // bit count
vdata[16+15] = _mm_set1_epi32( 80*8 ); // bit count
block[ 8] = last_byte;
memset_zero_128( block + 9, 6 );
block[15] = m128_const1_32( 32*8 ); // bit count
block[15] = _mm_set1_epi32( 32*8 ); // bit count
// initialize state
initstate[0] = m128_const1_64( 0x6A09E6676A09E667 );
initstate[1] = m128_const1_64( 0xBB67AE85BB67AE85 );
initstate[2] = m128_const1_64( 0x3C6EF3723C6EF372 );
initstate[3] = m128_const1_64( 0xA54FF53AA54FF53A );
initstate[4] = m128_const1_64( 0x510E527F510E527F );
initstate[5] = m128_const1_64( 0x9B05688C9B05688C );
initstate[6] = m128_const1_64( 0x1F83D9AB1F83D9AB );
initstate[7] = m128_const1_64( 0x5BE0CD195BE0CD19 );
initstate[0] = _mm_set1_epi64x( 0x6A09E6676A09E667 );
initstate[1] = _mm_set1_epi64x( 0xBB67AE85BB67AE85 );
initstate[2] = _mm_set1_epi64x( 0x3C6EF3723C6EF372 );
initstate[3] = _mm_set1_epi64x( 0xA54FF53AA54FF53A );
initstate[4] = _mm_set1_epi64x( 0x510E527F510E527F );
initstate[5] = _mm_set1_epi64x( 0x9B05688C9B05688C );
initstate[6] = _mm_set1_epi64x( 0x1F83D9AB1F83D9AB );
initstate[7] = _mm_set1_epi64x( 0x5BE0CD195BE0CD19 );
// hash first 64 bytes of data
sha256_4way_transform_le( midstate1, vdata, initstate );

268
algo/sha/sha256dt.c Normal file
View File

@@ -0,0 +1,268 @@
#include "algo-gate-api.h"
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <stdio.h>
#include "sha-hash-4way.h"
#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
#define SHA256DT_16WAY 1
#elif defined(__AVX2__)
#define SHA256DT_8WAY 1
#else
#define SHA256DT_4WAY 1
#endif
#if defined(SHA256DT_16WAY)
int scanhash_sha256dt_16way( struct work *work, const uint32_t max_nonce,
uint64_t *hashes_done, struct thr_info *mythr )
{
__m512i vdata[32] __attribute__ ((aligned (128)));
__m512i block[16] __attribute__ ((aligned (64)));
__m512i hash32[8] __attribute__ ((aligned (64)));
__m512i initstate[8] __attribute__ ((aligned (64)));
__m512i midstate1[8] __attribute__ ((aligned (64)));
__m512i midstate2[8] __attribute__ ((aligned (64)));
__m512i mexp_pre[16] __attribute__ ((aligned (64)));
uint32_t lane_hash[8] __attribute__ ((aligned (64)));
uint32_t *hash32_d7 = (uint32_t*)&( hash32[7] );
uint32_t *pdata = work->data;
const uint32_t *ptarget = work->target;
const uint32_t targ32_d7 = ptarget[7];
const uint32_t first_nonce = pdata[19];
const uint32_t last_nonce = max_nonce - 16;
uint32_t n = first_nonce;
__m512i *noncev = vdata + 19;
const int thr_id = mythr->id;
const bool bench = opt_benchmark;
const __m512i last_byte = _mm512_set1_epi32( 0x80000000 );
const __m512i sixteen = _mm512_set1_epi32( 16 );
for ( int i = 0; i < 19; i++ )
vdata[i] = _mm512_set1_epi32( pdata[i] );
*noncev = _mm512_set_epi32( n+15, n+14, n+13, n+12, n+11, n+10, n+9, n+8,
n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n+1, n );
vdata[16+4] = last_byte;
memset_zero_512( vdata+16 + 5, 10 );
vdata[16+15] = _mm512_set1_epi32( 0x480 );
block[ 8] = last_byte;
memset_zero_512( block + 9, 6 );
block[15] = _mm512_set1_epi32( 0x300 );
initstate[0] = _mm512_set1_epi64( 0xdfa9bf2cdfa9bf2c );
initstate[1] = _mm512_set1_epi64( 0xb72074d4b72074d4 );
initstate[2] = _mm512_set1_epi64( 0x6bb011226bb01122 );
initstate[3] = _mm512_set1_epi64( 0xd338e869d338e869 );
initstate[4] = _mm512_set1_epi64( 0xaa3ff126aa3ff126 );
initstate[5] = _mm512_set1_epi64( 0x475bbf30475bbf30 );
initstate[6] = _mm512_set1_epi64( 0x8fd52e5b8fd52e5b );
initstate[7] = _mm512_set1_epi64( 0x9f75c9ad9f75c9ad );
sha256_16way_transform_le( midstate1, vdata, initstate );
// Do 3 rounds on the first 12 bytes of the next block
sha256_16way_prehash_3rounds( midstate2, mexp_pre, vdata+16, midstate1 );
do
{
sha256_16way_final_rounds( block, vdata+16, midstate1, midstate2,
mexp_pre );
sha256_16way_transform_le( hash32, block, initstate );
mm512_block_bswap_32( hash32, hash32 );
for ( int lane = 0; lane < 16; lane++ )
if ( hash32_d7[ lane ] <= targ32_d7 )
{
extr_lane_16x32( lane_hash, hash32, lane, 256 );
if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
{
pdata[19] = n + lane;
submit_solution( work, lane_hash, mythr );
}
}
*noncev = _mm512_add_epi32( *noncev, sixteen );
n += 16;
} while ( (n < last_nonce) && !work_restart[thr_id].restart );
pdata[19] = n;
*hashes_done = n - first_nonce;
return 0;
}
#endif
#if defined(SHA256DT_8WAY)
int scanhash_sha256dt_8way( struct work *work, const uint32_t max_nonce,
uint64_t *hashes_done, struct thr_info *mythr )
{
__m256i vdata[32] __attribute__ ((aligned (64)));
__m256i block[16] __attribute__ ((aligned (32)));
__m256i hash32[8] __attribute__ ((aligned (32)));
__m256i initstate[8] __attribute__ ((aligned (32)));
__m256i midstate1[8] __attribute__ ((aligned (32)));
__m256i midstate2[8] __attribute__ ((aligned (32)));
__m256i mexp_pre[16] __attribute__ ((aligned (32)));
uint32_t lane_hash[8] __attribute__ ((aligned (32)));
uint32_t *hash32_d7 = (uint32_t*)&( hash32[7] );
uint32_t *pdata = work->data;
const uint32_t *ptarget = work->target;
const uint32_t targ32_d7 = ptarget[7];
const uint32_t first_nonce = pdata[19];
const uint32_t last_nonce = max_nonce - 8;
uint32_t n = first_nonce;
__m256i *noncev = vdata + 19;
const int thr_id = mythr->id;
const bool bench = opt_benchmark;
const __m256i last_byte = _mm256_set1_epi32( 0x80000000 );
const __m256i eight = _mm256_set1_epi32( 8 );
for ( int i = 0; i < 19; i++ )
vdata[i] = _mm256_set1_epi32( pdata[i] );
*noncev = _mm256_set_epi32( n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n+1, n );
vdata[16+4] = last_byte;
memset_zero_256( vdata+16 + 5, 10 );
vdata[16+15] = _mm256_set1_epi32( 0x480 );
block[ 8] = last_byte;
memset_zero_256( block + 9, 6 );
block[15] = _mm256_set1_epi32( 0x300 );
// initialize state
initstate[0] = _mm256_set1_epi64x( 0xdfa9bf2cdfa9bf2c );
initstate[1] = _mm256_set1_epi64x( 0xb72074d4b72074d4 );
initstate[2] = _mm256_set1_epi64x( 0x6bb011226bb01122 );
initstate[3] = _mm256_set1_epi64x( 0xd338e869d338e869 );
initstate[4] = _mm256_set1_epi64x( 0xaa3ff126aa3ff126 );
initstate[5] = _mm256_set1_epi64x( 0x475bbf30475bbf30 );
initstate[6] = _mm256_set1_epi64x( 0x8fd52e5b8fd52e5b );
initstate[7] = _mm256_set1_epi64x( 0x9f75c9ad9f75c9ad );
sha256_8way_transform_le( midstate1, vdata, initstate );
// Do 3 rounds on the first 12 bytes of the next block
sha256_8way_prehash_3rounds( midstate2, mexp_pre, vdata + 16, midstate1 );
do
{
sha256_8way_final_rounds( block, vdata+16, midstate1, midstate2,
mexp_pre );
sha256_8way_transform_le( hash32, block, initstate );
mm256_block_bswap_32( hash32, hash32 );
for ( int lane = 0; lane < 8; lane++ )
if ( hash32_d7[ lane ] <= targ32_d7 )
{
extr_lane_8x32( lane_hash, hash32, lane, 256 );
if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
{
pdata[19] = n + lane;
submit_solution( work, lane_hash, mythr );
}
}
*noncev = _mm256_add_epi32( *noncev, eight );
n += 8;
} while ( (n < last_nonce) && !work_restart[thr_id].restart );
pdata[19] = n;
*hashes_done = n - first_nonce;
return 0;
}
#endif
#if defined(SHA256DT_4WAY)
int scanhash_sha256dt_4way( struct work *work, const uint32_t max_nonce,
uint64_t *hashes_done, struct thr_info *mythr )
{
__m128i vdata[32] __attribute__ ((aligned (64)));
__m128i block[16] __attribute__ ((aligned (32)));
__m128i hash32[8] __attribute__ ((aligned (32)));
__m128i initstate[8] __attribute__ ((aligned (32)));
__m128i midstate[8] __attribute__ ((aligned (32)));
uint32_t lane_hash[8] __attribute__ ((aligned (32)));
uint32_t *hash32_d7 = (uint32_t*)&( hash32[7] );
uint32_t *pdata = work->data;
const uint32_t *ptarget = work->target;
const uint32_t targ32_d7 = ptarget[7];
const uint32_t first_nonce = pdata[19];
const uint32_t last_nonce = max_nonce - 4;
uint32_t n = first_nonce;
__m128i *noncev = vdata + 19;
const int thr_id = mythr->id;
const bool bench = opt_benchmark;
const __m128i last_byte = _mm_set1_epi32( 0x80000000 );
const __m128i four = _mm_set1_epi32( 4 );
for ( int i = 0; i < 19; i++ )
vdata[i] = _mm_set1_epi32( pdata[i] );
*noncev = _mm_set_epi32( n+ 3, n+ 2, n+1, n );
vdata[16+4] = last_byte;
memset_zero_128( vdata+16 + 5, 10 );
vdata[16+15] = _mm_set1_epi32( 0x480 );
block[ 8] = last_byte;
memset_zero_128( block + 9, 6 );
block[15] = _mm_set1_epi32( 0x300 );
// initialize state
initstate[0] = _mm_set1_epi64x( 0xdfa9bf2cdfa9bf2c );
initstate[1] = _mm_set1_epi64x( 0xb72074d4b72074d4 );
initstate[2] = _mm_set1_epi64x( 0x6bb011226bb01122 );
initstate[3] = _mm_set1_epi64x( 0xd338e869d338e869 );
initstate[4] = _mm_set1_epi64x( 0xaa3ff126aa3ff126 );
initstate[5] = _mm_set1_epi64x( 0x475bbf30475bbf30 );
initstate[6] = _mm_set1_epi64x( 0x8fd52e5b8fd52e5b );
initstate[7] = _mm_set1_epi64x( 0x9f75c9ad9f75c9ad );
// hash first 64 bytes of data
sha256_4way_transform_le( midstate, vdata, initstate );
do
{
sha256_4way_transform_le( block, vdata+16, midstate );
sha256_4way_transform_le( hash32, block, initstate );
mm128_block_bswap_32( hash32, hash32 );
for ( int lane = 0; lane < 4; lane++ )
if ( unlikely( hash32_d7[ lane ] <= targ32_d7 ) )
{
extr_lane_4x32( lane_hash, hash32, lane, 256 );
if ( likely( valid_hash( lane_hash, ptarget ) && !bench ) )
{
pdata[19] = n + lane;
submit_solution( work, lane_hash, mythr );
}
}
*noncev = _mm_add_epi32( *noncev, four );
n += 4;
} while ( (n < last_nonce) && !work_restart[thr_id].restart );
pdata[19] = n;
*hashes_done = n - first_nonce;
return 0;
}
#endif
bool register_sha256dt_algo( algo_gate_t* gate )
{
gate->optimizations = SSE2_OPT | AVX2_OPT | AVX512_OPT;
#if defined(SHA256DT_16WAY)
gate->scanhash = (void*)&scanhash_sha256dt_16way;
#elif defined(SHA256DT_8WAY)
gate->scanhash = (void*)&scanhash_sha256dt_8way;
#else
gate->scanhash = (void*)&scanhash_sha256dt_4way;
#endif
return true;
}

View File

@@ -68,7 +68,7 @@ int scanhash_sha256q_16way( struct work *work, const uint32_t max_nonce,
submit_solution( work, lane_hash, mythr );
}
}
*noncev = _mm512_add_epi32( *noncev, m512_const1_32( 16 ) );
*noncev = _mm512_add_epi32( *noncev, _mm512_set1_epi32( 16 ) );
n += 16;
} while ( (n < last_nonce) && !work_restart[thr_id].restart );
pdata[19] = n;
@@ -140,7 +140,7 @@ int scanhash_sha256q_8way( struct work *work, const uint32_t max_nonce,
submit_solution( work, lane_hash, mythr );
}
}
*noncev = _mm256_add_epi32( *noncev, m256_const1_32( 8 ) );
*noncev = _mm256_add_epi32( *noncev, _mm256_set1_epi32( 8 ) );
n += 8;
} while ( (n < last_nonce) && !work_restart[thr_id].restart );
pdata[19] = n;

View File

@@ -28,31 +28,31 @@ int scanhash_sha256t_16way( struct work *work, const uint32_t max_nonce,
__m512i *noncev = vdata + 19;
const int thr_id = mythr->id;
const bool bench = opt_benchmark;
const __m512i last_byte = m512_const1_32( 0x80000000 );
const __m512i sixteen = m512_const1_32( 16 );
const __m512i last_byte = _mm512_set1_epi32( 0x80000000 );
const __m512i sixteen = _mm512_set1_epi32( 16 );
for ( int i = 0; i < 19; i++ )
vdata[i] = m512_const1_32( pdata[i] );
vdata[i] = _mm512_set1_epi32( pdata[i] );
*noncev = _mm512_set_epi32( n+15, n+14, n+13, n+12, n+11, n+10, n+9, n+8,
n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n+1, n );
vdata[16+4] = last_byte;
memset_zero_512( vdata+16 + 5, 10 );
vdata[16+15] = m512_const1_32( 80*8 ); // bit count
vdata[16+15] = _mm512_set1_epi32( 80*8 ); // bit count
block[ 8] = last_byte;
memset_zero_512( block + 9, 6 );
block[15] = m512_const1_32( 32*8 ); // bit count
block[15] = _mm512_set1_epi32( 32*8 ); // bit count
initstate[0] = m512_const1_64( 0x6A09E6676A09E667 );
initstate[1] = m512_const1_64( 0xBB67AE85BB67AE85 );
initstate[2] = m512_const1_64( 0x3C6EF3723C6EF372 );
initstate[3] = m512_const1_64( 0xA54FF53AA54FF53A );
initstate[4] = m512_const1_64( 0x510E527F510E527F );
initstate[5] = m512_const1_64( 0x9B05688C9B05688C );
initstate[6] = m512_const1_64( 0x1F83D9AB1F83D9AB );
initstate[7] = m512_const1_64( 0x5BE0CD195BE0CD19 );
initstate[0] = _mm512_set1_epi64( 0x6A09E6676A09E667 );
initstate[1] = _mm512_set1_epi64( 0xBB67AE85BB67AE85 );
initstate[2] = _mm512_set1_epi64( 0x3C6EF3723C6EF372 );
initstate[3] = _mm512_set1_epi64( 0xA54FF53AA54FF53A );
initstate[4] = _mm512_set1_epi64( 0x510E527F510E527F );
initstate[5] = _mm512_set1_epi64( 0x9B05688C9B05688C );
initstate[6] = _mm512_set1_epi64( 0x1F83D9AB1F83D9AB );
initstate[7] = _mm512_set1_epi64( 0x5BE0CD195BE0CD19 );
sha256_16way_transform_le( midstate1, vdata, initstate );
@@ -120,31 +120,31 @@ int scanhash_sha256t_8way( struct work *work, const uint32_t max_nonce,
__m256i *noncev = vdata + 19;
const int thr_id = mythr->id;
const bool bench = opt_benchmark;
const __m256i last_byte = m256_const1_32( 0x80000000 );
const __m256i eight = m256_const1_32( 8 );
const __m256i last_byte = _mm256_set1_epi32( 0x80000000 );
const __m256i eight = _mm256_set1_epi32( 8 );
for ( int i = 0; i < 19; i++ )
vdata[i] = m256_const1_32( pdata[i] );
vdata[i] = _mm256_set1_epi32( pdata[i] );
*noncev = _mm256_set_epi32( n+ 7, n+ 6, n+ 5, n+ 4, n+ 3, n+ 2, n+1, n );
vdata[16+4] = last_byte;
memset_zero_256( vdata+16 + 5, 10 );
vdata[16+15] = m256_const1_32( 80*8 ); // bit count
vdata[16+15] = _mm256_set1_epi32( 80*8 ); // bit count
block[ 8] = last_byte;
memset_zero_256( block + 9, 6 );
block[15] = m256_const1_32( 32*8 ); // bit count
block[15] = _mm256_set1_epi32( 32*8 ); // bit count
// initialize state
initstate[0] = m256_const1_64( 0x6A09E6676A09E667 );
initstate[1] = m256_const1_64( 0xBB67AE85BB67AE85 );
initstate[2] = m256_const1_64( 0x3C6EF3723C6EF372 );
initstate[3] = m256_const1_64( 0xA54FF53AA54FF53A );
initstate[4] = m256_const1_64( 0x510E527F510E527F );
initstate[5] = m256_const1_64( 0x9B05688C9B05688C );
initstate[6] = m256_const1_64( 0x1F83D9AB1F83D9AB );
initstate[7] = m256_const1_64( 0x5BE0CD195BE0CD19 );
initstate[0] = _mm256_set1_epi64x( 0x6A09E6676A09E667 );
initstate[1] = _mm256_set1_epi64x( 0xBB67AE85BB67AE85 );
initstate[2] = _mm256_set1_epi64x( 0x3C6EF3723C6EF372 );
initstate[3] = _mm256_set1_epi64x( 0xA54FF53AA54FF53A );
initstate[4] = _mm256_set1_epi64x( 0x510E527F510E527F );
initstate[5] = _mm256_set1_epi64x( 0x9B05688C9B05688C );
initstate[6] = _mm256_set1_epi64x( 0x1F83D9AB1F83D9AB );
initstate[7] = _mm256_set1_epi64x( 0x5BE0CD195BE0CD19 );
sha256_8way_transform_le( midstate1, vdata, initstate );
@@ -215,31 +215,31 @@ int scanhash_sha256t_4way( struct work *work, const uint32_t max_nonce,
__m128i *noncev = vdata + 19;
const int thr_id = mythr->id;
const bool bench = opt_benchmark;
const __m128i last_byte = m128_const1_32( 0x80000000 );
const __m128i four = m128_const1_32( 4 );
const __m128i last_byte = _mm_set1_epi32( 0x80000000 );
const __m128i four = _mm_set1_epi32( 4 );
for ( int i = 0; i < 19; i++ )
vdata[i] = m128_const1_32( pdata[i] );
vdata[i] = _mm_set1_epi32( pdata[i] );
*noncev = _mm_set_epi32( n+ 3, n+ 2, n+1, n );
vdata[16+4] = last_byte;
memset_zero_128( vdata+16 + 5, 10 );
vdata[16+15] = m128_const1_32( 80*8 ); // bit count
vdata[16+15] = _mm_set1_epi32( 80*8 ); // bit count
block[ 8] = last_byte;
memset_zero_128( block + 9, 6 );
block[15] = m128_const1_32( 32*8 ); // bit count
block[15] = _mm_set1_epi32( 32*8 ); // bit count
// initialize state
initstate[0] = m128_const1_64( 0x6A09E6676A09E667 );
initstate[1] = m128_const1_64( 0xBB67AE85BB67AE85 );
initstate[2] = m128_const1_64( 0x3C6EF3723C6EF372 );
initstate[3] = m128_const1_64( 0xA54FF53AA54FF53A );
initstate[4] = m128_const1_64( 0x510E527F510E527F );
initstate[5] = m128_const1_64( 0x9B05688C9B05688C );
initstate[6] = m128_const1_64( 0x1F83D9AB1F83D9AB );
initstate[7] = m128_const1_64( 0x5BE0CD195BE0CD19 );
initstate[0] = _mm_set1_epi64x( 0x6A09E6676A09E667 );
initstate[1] = _mm_set1_epi64x( 0xBB67AE85BB67AE85 );
initstate[2] = _mm_set1_epi64x( 0x3C6EF3723C6EF372 );
initstate[3] = _mm_set1_epi64x( 0xA54FF53AA54FF53A );
initstate[4] = _mm_set1_epi64x( 0x510E527F510E527F );
initstate[5] = _mm_set1_epi64x( 0x9B05688C9B05688C );
initstate[6] = _mm_set1_epi64x( 0x1F83D9AB1F83D9AB );
initstate[7] = _mm_set1_epi64x( 0x5BE0CD195BE0CD19 );
// hash first 64 bytes of data
sha256_4way_transform_le( midstate1, vdata, initstate );
@@ -302,31 +302,31 @@ int scanhash_sha256t_4way( struct work *work, const uint32_t max_nonce,
__m128i *noncev = vdata + 19;
const int thr_id = mythr->id;
const bool bench = opt_benchmark;
const __m128i last_byte = m128_const1_32( 0x80000000 );
const __m128i four = m128_const1_32( 4 );
const __m128i last_byte = _mm_set1_epi32( 0x80000000 );
const __m128i four = _mm_set1_epi32( 4 );
for ( int i = 0; i < 19; i++ )
vdata[i] = m128_const1_32( pdata[i] );
vdata[i] = _mm_set1_epi32( pdata[i] );
*noncev = _mm_set_epi32( n+ 3, n+ 2, n+1, n );
vdata[16+4] = last_byte;
memset_zero_128( vdata+16 + 5, 10 );
vdata[16+15] = m128_const1_32( 80*8 ); // bit count
vdata[16+15] = _mm_set1_epi32( 80*8 ); // bit count
block[ 8] = last_byte;
memset_zero_128( block + 9, 6 );
block[15] = m128_const1_32( 32*8 ); // bit count
block[15] = _mm_set1_epi32( 32*8 ); // bit count
// initialize state
initstate[0] = m128_const1_64( 0x6A09E6676A09E667 );
initstate[1] = m128_const1_64( 0xBB67AE85BB67AE85 );
initstate[2] = m128_const1_64( 0x3C6EF3723C6EF372 );
initstate[3] = m128_const1_64( 0xA54FF53AA54FF53A );
initstate[4] = m128_const1_64( 0x510E527F510E527F );
initstate[5] = m128_const1_64( 0x9B05688C9B05688C );
initstate[6] = m128_const1_64( 0x1F83D9AB1F83D9AB );
initstate[7] = m128_const1_64( 0x5BE0CD195BE0CD19 );
initstate[0] = _mm_set1_epi64x( 0x6A09E6676A09E667 );
initstate[1] = _mm_set1_epi64x( 0xBB67AE85BB67AE85 );
initstate[2] = _mm_set1_epi64x( 0x3C6EF3723C6EF372 );
initstate[3] = _mm_set1_epi64x( 0xA54FF53AA54FF53A );
initstate[4] = _mm_set1_epi64x( 0x510E527F510E527F );
initstate[5] = _mm_set1_epi64x( 0x9B05688C9B05688C );
initstate[6] = _mm_set1_epi64x( 0x1F83D9AB1F83D9AB );
initstate[7] = _mm_set1_epi64x( 0x5BE0CD195BE0CD19 );
// hash first 64 bytes of data
sha256_4way_transform_le( midstate, vdata, initstate );

View File

@@ -155,14 +155,14 @@ sha512_8way_round( sha512_8way_context *ctx, __m512i *in, __m512i r[8] )
}
else
{
A = m512_const1_64( 0x6A09E667F3BCC908 );
B = m512_const1_64( 0xBB67AE8584CAA73B );
C = m512_const1_64( 0x3C6EF372FE94F82B );
D = m512_const1_64( 0xA54FF53A5F1D36F1 );
E = m512_const1_64( 0x510E527FADE682D1 );
F = m512_const1_64( 0x9B05688C2B3E6C1F );
G = m512_const1_64( 0x1F83D9ABFB41BD6B );
H = m512_const1_64( 0x5BE0CD19137E2179 );
A = _mm512_set1_epi64( 0x6A09E667F3BCC908 );
B = _mm512_set1_epi64( 0xBB67AE8584CAA73B );
C = _mm512_set1_epi64( 0x3C6EF372FE94F82B );
D = _mm512_set1_epi64( 0xA54FF53A5F1D36F1 );
E = _mm512_set1_epi64( 0x510E527FADE682D1 );
F = _mm512_set1_epi64( 0x9B05688C2B3E6C1F );
G = _mm512_set1_epi64( 0x1F83D9ABFB41BD6B );
H = _mm512_set1_epi64( 0x5BE0CD19137E2179 );
}
for ( i = 0; i < 80; i += 8 )
@@ -191,14 +191,14 @@ sha512_8way_round( sha512_8way_context *ctx, __m512i *in, __m512i r[8] )
else
{
ctx->initialized = true;
r[0] = _mm512_add_epi64( A, m512_const1_64( 0x6A09E667F3BCC908 ) );
r[1] = _mm512_add_epi64( B, m512_const1_64( 0xBB67AE8584CAA73B ) );
r[2] = _mm512_add_epi64( C, m512_const1_64( 0x3C6EF372FE94F82B ) );
r[3] = _mm512_add_epi64( D, m512_const1_64( 0xA54FF53A5F1D36F1 ) );
r[4] = _mm512_add_epi64( E, m512_const1_64( 0x510E527FADE682D1 ) );
r[5] = _mm512_add_epi64( F, m512_const1_64( 0x9B05688C2B3E6C1F ) );
r[6] = _mm512_add_epi64( G, m512_const1_64( 0x1F83D9ABFB41BD6B ) );
r[7] = _mm512_add_epi64( H, m512_const1_64( 0x5BE0CD19137E2179 ) );
r[0] = _mm512_add_epi64( A, _mm512_set1_epi64( 0x6A09E667F3BCC908 ) );
r[1] = _mm512_add_epi64( B, _mm512_set1_epi64( 0xBB67AE8584CAA73B ) );
r[2] = _mm512_add_epi64( C, _mm512_set1_epi64( 0x3C6EF372FE94F82B ) );
r[3] = _mm512_add_epi64( D, _mm512_set1_epi64( 0xA54FF53A5F1D36F1 ) );
r[4] = _mm512_add_epi64( E, _mm512_set1_epi64( 0x510E527FADE682D1 ) );
r[5] = _mm512_add_epi64( F, _mm512_set1_epi64( 0x9B05688C2B3E6C1F ) );
r[6] = _mm512_add_epi64( G, _mm512_set1_epi64( 0x1F83D9ABFB41BD6B ) );
r[7] = _mm512_add_epi64( H, _mm512_set1_epi64( 0x5BE0CD19137E2179 ) );
}
}
@@ -239,14 +239,11 @@ void sha512_8way_close( sha512_8way_context *sc, void *dst )
unsigned ptr;
const int buf_size = 128;
const int pad = buf_size - 16;
const __m512i shuff_bswap64 = m512_const_64(
0x38393a3b3c3d3e3f, 0x3031323334353637,
0x28292a2b2c2d2e2f, 0x2021222324252627,
0x18191a1b1c1d1e1f, 0x1011121314151617,
0x08090a0b0c0d0e0f, 0x0001020304050607 );
const __m512i shuff_bswap64 = mm512_bcast_m128( _mm_set_epi64x(
0x08090a0b0c0d0e0f, 0x0001020304050607 ) );
ptr = (unsigned)sc->count & (buf_size - 1U);
sc->buf[ ptr>>3 ] = m512_const1_64( 0x80 );
sc->buf[ ptr>>3 ] = _mm512_set1_epi64( 0x80 );
ptr += 8;
if ( ptr > pad )
{
@@ -271,51 +268,56 @@ void sha512_8way_close( sha512_8way_context *sc, void *dst )
// SHA-512 4 way 64 bit
#define BSG5_0( x ) mm256_xor3( mm256_ror_64( x, 28 ), \
mm256_ror_64( x, 34 ), \
mm256_ror_64( x, 39 ) )
#define BSG5_1( x ) mm256_xor3( mm256_ror_64( x, 14 ), \
mm256_ror_64( x, 18 ), \
mm256_ror_64( x, 41 ) )
#define SSG5_0( x ) mm256_xor3( mm256_ror_64( x, 1 ), \
mm256_ror_64( x, 8 ), \
_mm256_srli_epi64( x, 7 ) )
#define SSG5_1( x ) mm256_xor3( mm256_ror_64( x, 19 ), \
mm256_ror_64( x, 61 ), \
_mm256_srli_epi64( x, 6 ) )
#if defined(__AVX512VL__)
//TODO Enable for AVX10_256
// 4 way is not used whith AVX512 but will be whith AVX10_256 when it
// becomes available.
#define CH( X, Y, Z ) _mm256_ternarylogic_epi64( X, Y, Z, 0xca )
#define MAJ( X, Y, Z ) _mm256_ternarylogic_epi64( X, Y, Z, 0xe8 )
#define SHA3_4WAY_STEP( A, B, C, D, E, F, G, H, i ) \
do { \
__m256i T0 = _mm256_add_epi64( _mm256_set1_epi64x( K512[i] ), W[i] ); \
__m256i T1 = BSG5_1( E ); \
__m256i T2 = BSG5_0( A ); \
T0 = _mm256_add_epi64( T0, CH( E, F, G ) ); \
T1 = _mm256_add_epi64( T1, H ); \
T2 = _mm256_add_epi64( T2, MAJ( A, B, C ) ); \
T1 = _mm256_add_epi64( T1, T0 ); \
D = _mm256_add_epi64( D, T1 ); \
H = _mm256_add_epi64( T1, T2 ); \
} while (0)
#else // AVX2 only
#define CH(X, Y, Z) \
_mm256_xor_si256( _mm256_and_si256( _mm256_xor_si256( Y, Z ), X ), Z )
#define MAJ(X, Y, Z) \
_mm256_xor_si256( Y, _mm256_and_si256( X_xor_Y = _mm256_xor_si256( X, Y ), \
Y_xor_Z ) )
#define BSG5_0(x) \
mm256_ror_64( _mm256_xor_si256( mm256_ror_64( \
_mm256_xor_si256( mm256_ror_64( x, 5 ), x ), 6 ), x ), 28 )
#define BSG5_1(x) \
mm256_ror_64( _mm256_xor_si256( mm256_ror_64( \
_mm256_xor_si256( mm256_ror_64( x, 23 ), x ), 4 ), x ), 14 )
/*
#define SSG5_0(x) \
_mm256_xor_si256( _mm256_xor_si256( \
mm256_ror_64(x, 1), mm256_ror_64(x, 8) ), _mm256_srli_epi64(x, 7) )
#define SSG5_1(x) \
_mm256_xor_si256( _mm256_xor_si256( \
mm256_ror_64(x, 19), mm256_ror_64(x, 61) ), _mm256_srli_epi64(x, 6) )
*/
// Interleave SSG0 & SSG1 for better throughput.
// return ssg0(w0) + ssg1(w1)
static inline __m256i ssg512_add( __m256i w0, __m256i w1 )
{
__m256i w0a, w1a, w0b, w1b;
w0a = mm256_ror_64( w0, 1 );
w1a = mm256_ror_64( w1,19 );
w0b = mm256_ror_64( w0, 8 );
w1b = mm256_ror_64( w1,61 );
w0a = _mm256_xor_si256( w0a, w0b );
w1a = _mm256_xor_si256( w1a, w1b );
w0b = _mm256_srli_epi64( w0, 7 );
w1b = _mm256_srli_epi64( w1, 6 );
w0a = _mm256_xor_si256( w0a, w0b );
w1a = _mm256_xor_si256( w1a, w1b );
return _mm256_add_epi64( w0a, w1a );
}
#define SHA3_4WAY_STEP( A, B, C, D, E, F, G, H, i ) \
do { \
__m256i T0 = _mm256_add_epi64( _mm256_set1_epi64x( K512[i] ), W[ i ] ); \
__m256i T0 = _mm256_add_epi64( _mm256_set1_epi64x( K512[i] ), W[i] ); \
__m256i T1 = BSG5_1( E ); \
__m256i T2 = BSG5_0( A ); \
T0 = _mm256_add_epi64( T0, CH( E, F, G ) ); \
@@ -327,19 +329,27 @@ do { \
H = _mm256_add_epi64( T1, T2 ); \
} while (0)
#endif // AVX512VL AVX10_256
static void
sha512_4way_round( sha512_4way_context *ctx, __m256i *in, __m256i r[8] )
{
int i;
register __m256i A, B, C, D, E, F, G, H, X_xor_Y, Y_xor_Z;
register __m256i A, B, C, D, E, F, G, H;
#if !defined(__AVX512VL__)
// Disable for AVX10_256
__m256i X_xor_Y, Y_xor_Z;
#endif
__m256i W[80];
mm256_block_bswap_64( W , in );
mm256_block_bswap_64( W+8, in+8 );
for ( i = 16; i < 80; i++ )
W[i] = _mm256_add_epi64( ssg512_add( W[i-15], W[i-2] ),
_mm256_add_epi64( W[ i- 7 ], W[ i-16 ] ) );
W[i] = mm256_add4_64( SSG5_0( W[i-15] ), SSG5_1( W[i-2] ),
W[ i- 7 ], W[ i-16 ] );
if ( ctx->initialized )
{
@@ -354,17 +364,20 @@ sha512_4way_round( sha512_4way_context *ctx, __m256i *in, __m256i r[8] )
}
else
{
A = m256_const1_64( 0x6A09E667F3BCC908 );
B = m256_const1_64( 0xBB67AE8584CAA73B );
C = m256_const1_64( 0x3C6EF372FE94F82B );
D = m256_const1_64( 0xA54FF53A5F1D36F1 );
E = m256_const1_64( 0x510E527FADE682D1 );
F = m256_const1_64( 0x9B05688C2B3E6C1F );
G = m256_const1_64( 0x1F83D9ABFB41BD6B );
H = m256_const1_64( 0x5BE0CD19137E2179 );
A = _mm256_set1_epi64x( 0x6A09E667F3BCC908 );
B = _mm256_set1_epi64x( 0xBB67AE8584CAA73B );
C = _mm256_set1_epi64x( 0x3C6EF372FE94F82B );
D = _mm256_set1_epi64x( 0xA54FF53A5F1D36F1 );
E = _mm256_set1_epi64x( 0x510E527FADE682D1 );
F = _mm256_set1_epi64x( 0x9B05688C2B3E6C1F );
G = _mm256_set1_epi64x( 0x1F83D9ABFB41BD6B );
H = _mm256_set1_epi64x( 0x5BE0CD19137E2179 );
}
#if !defined(__AVX512VL__)
// Disable for AVX10_256
Y_xor_Z = _mm256_xor_si256( B, C );
#endif
for ( i = 0; i < 80; i += 8 )
{
@@ -392,14 +405,14 @@ sha512_4way_round( sha512_4way_context *ctx, __m256i *in, __m256i r[8] )
else
{
ctx->initialized = true;
r[0] = _mm256_add_epi64( A, m256_const1_64( 0x6A09E667F3BCC908 ) );
r[1] = _mm256_add_epi64( B, m256_const1_64( 0xBB67AE8584CAA73B ) );
r[2] = _mm256_add_epi64( C, m256_const1_64( 0x3C6EF372FE94F82B ) );
r[3] = _mm256_add_epi64( D, m256_const1_64( 0xA54FF53A5F1D36F1 ) );
r[4] = _mm256_add_epi64( E, m256_const1_64( 0x510E527FADE682D1 ) );
r[5] = _mm256_add_epi64( F, m256_const1_64( 0x9B05688C2B3E6C1F ) );
r[6] = _mm256_add_epi64( G, m256_const1_64( 0x1F83D9ABFB41BD6B ) );
r[7] = _mm256_add_epi64( H, m256_const1_64( 0x5BE0CD19137E2179 ) );
r[0] = _mm256_add_epi64( A, _mm256_set1_epi64x( 0x6A09E667F3BCC908 ) );
r[1] = _mm256_add_epi64( B, _mm256_set1_epi64x( 0xBB67AE8584CAA73B ) );
r[2] = _mm256_add_epi64( C, _mm256_set1_epi64x( 0x3C6EF372FE94F82B ) );
r[3] = _mm256_add_epi64( D, _mm256_set1_epi64x( 0xA54FF53A5F1D36F1 ) );
r[4] = _mm256_add_epi64( E, _mm256_set1_epi64x( 0x510E527FADE682D1 ) );
r[5] = _mm256_add_epi64( F, _mm256_set1_epi64x( 0x9B05688C2B3E6C1F ) );
r[6] = _mm256_add_epi64( G, _mm256_set1_epi64x( 0x1F83D9ABFB41BD6B ) );
r[7] = _mm256_add_epi64( H, _mm256_set1_epi64x( 0x5BE0CD19137E2179 ) );
}
}
@@ -440,13 +453,11 @@ void sha512_4way_close( sha512_4way_context *sc, void *dst )
unsigned ptr;
const int buf_size = 128;
const int pad = buf_size - 16;
const __m256i shuff_bswap64 = m256_const_64( 0x18191a1b1c1d1e1f,
0x1011121314151617,
0x08090a0b0c0d0e0f,
0x0001020304050607 );
const __m256i shuff_bswap64 = mm256_bcast_m128( _mm_set_epi64x(
0x08090a0b0c0d0e0f, 0x0001020304050607 ) );
ptr = (unsigned)sc->count & (buf_size - 1U);
sc->buf[ ptr>>3 ] = m256_const1_64( 0x80 );
sc->buf[ ptr>>3 ] = _mm256_set1_epi64x( 0x80 );
ptr += 8;
if ( ptr > pad )
{

221
algo/sha/sha512256d-4way.c Normal file
View File

@@ -0,0 +1,221 @@
#include "algo-gate-api.h"
#include "sha-hash-4way.h"
#include <string.h>
#include <stdint.h>
#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
#define SHA512256D_8WAY 1
#elif defined(__AVX2__)
#define SHA512256D_4WAY 1
#endif
#if defined(SHA512256D_8WAY)
static void sha512256d_8way_init( sha512_8way_context *ctx )
{
ctx->count = 0;
ctx->initialized = true;
ctx->val[0] = _mm512_set1_epi64( 0x22312194FC2BF72C );
ctx->val[1] = _mm512_set1_epi64( 0x9F555FA3C84C64C2 );
ctx->val[2] = _mm512_set1_epi64( 0x2393B86B6F53B151 );
ctx->val[3] = _mm512_set1_epi64( 0x963877195940EABD );
ctx->val[4] = _mm512_set1_epi64( 0x96283EE2A88EFFE3 );
ctx->val[5] = _mm512_set1_epi64( 0xBE5E1E2553863992 );
ctx->val[6] = _mm512_set1_epi64( 0x2B0199FC2C85B8AA );
ctx->val[7] = _mm512_set1_epi64( 0x0EB72DDC81C52CA2 );
}
int scanhash_sha512256d_8way( struct work *work, uint32_t max_nonce,
uint64_t *hashes_done, struct thr_info *mythr )
{
uint64_t hash[8*8] __attribute__ ((aligned (128)));
uint32_t vdata[20*8] __attribute__ ((aligned (64)));
sha512_8way_context ctx;
uint32_t lane_hash[8] __attribute__ ((aligned (32)));
uint64_t *hash_q3 = &(hash[3*8]);
uint32_t *pdata = work->data;
uint32_t *ptarget = work->target;
const uint64_t targ_q3 = ((uint64_t*)ptarget)[3];
const uint32_t first_nonce = pdata[19];
const uint32_t last_nonce = max_nonce - 8;
uint32_t n = first_nonce;
__m512i *noncev = (__m512i*)vdata + 9;
const int thr_id = mythr->id;
const bool bench = opt_benchmark;
const __m512i eight = _mm512_set1_epi64( 0x0000000800000000 );
mm512_bswap32_intrlv80_8x64( vdata, pdata );
*noncev = mm512_intrlv_blend_32(
_mm512_set_epi32( n+7, 0, n+6, 0, n+5, 0, n+4, 0,
n+3, 0, n+2, 0, n+1, 0, n , 0 ), *noncev );
do
{
sha512256d_8way_init( &ctx );
sha512_8way_update( &ctx, vdata, 80 );
sha512_8way_close( &ctx, hash );
sha512256d_8way_init( &ctx );
sha512_8way_update( &ctx, hash, 32 );
sha512_8way_close( &ctx, hash );
for ( int lane = 0; lane < 8; lane++ )
if ( unlikely( hash_q3[ lane ] <= targ_q3 && !bench ) )
{
extr_lane_8x64( lane_hash, hash, lane, 256 );
if ( valid_hash( lane_hash, ptarget ) && !bench )
{
pdata[19] = bswap_32( n + lane );
submit_solution( work, lane_hash, mythr );
}
}
*noncev = _mm512_add_epi32( *noncev, eight );
n += 8;
} while ( likely( (n < last_nonce) && !work_restart[thr_id].restart ) );
pdata[19] = n;
*hashes_done = n - first_nonce;
return 0;
}
#elif defined(SHA512256D_4WAY)
static void sha512256d_4way_init( sha512_4way_context *ctx )
{
ctx->count = 0;
ctx->initialized = true;
ctx->val[0] = _mm256_set1_epi64x( 0x22312194FC2BF72C );
ctx->val[1] = _mm256_set1_epi64x( 0x9F555FA3C84C64C2 );
ctx->val[2] = _mm256_set1_epi64x( 0x2393B86B6F53B151 );
ctx->val[3] = _mm256_set1_epi64x( 0x963877195940EABD );
ctx->val[4] = _mm256_set1_epi64x( 0x96283EE2A88EFFE3 );
ctx->val[5] = _mm256_set1_epi64x( 0xBE5E1E2553863992 );
ctx->val[6] = _mm256_set1_epi64x( 0x2B0199FC2C85B8AA );
ctx->val[7] = _mm256_set1_epi64x( 0x0EB72DDC81C52CA2 );
}
int scanhash_sha512256d_4way( struct work *work, uint32_t max_nonce,
uint64_t *hashes_done, struct thr_info *mythr )
{
uint64_t hash[8*4] __attribute__ ((aligned (64)));
uint32_t vdata[20*4] __attribute__ ((aligned (64)));
sha512_4way_context ctx;
uint32_t lane_hash[8] __attribute__ ((aligned (32)));
uint64_t *hash_q3 = &(hash[3*4]);
uint32_t *pdata = work->data;
uint32_t *ptarget = work->target;
const uint64_t targ_q3 = ((uint64_t*)ptarget)[3];
const uint32_t first_nonce = pdata[19];
const uint32_t last_nonce = max_nonce - 4;
uint32_t n = first_nonce;
__m256i *noncev = (__m256i*)vdata + 9;
const int thr_id = mythr->id;
const bool bench = opt_benchmark;
const __m256i four = _mm256_set1_epi64x( 0x0000000400000000 );
mm256_bswap32_intrlv80_4x64( vdata, pdata );
*noncev = mm256_intrlv_blend_32(
_mm256_set_epi32( n+3, 0, n+2, 0, n+1, 0, n, 0 ), *noncev );
do
{
sha512256d_4way_init( &ctx );
sha512_4way_update( &ctx, vdata, 80 );
sha512_4way_close( &ctx, hash );
sha512256d_4way_init( &ctx );
sha512_4way_update( &ctx, hash, 32 );
sha512_4way_close( &ctx, hash );
for ( int lane = 0; lane < 4; lane++ )
if ( hash_q3[ lane ] <= targ_q3 )
{
extr_lane_4x64( lane_hash, hash, lane, 256 );
if ( valid_hash( lane_hash, ptarget ) && !bench )
{
pdata[19] = bswap_32( n + lane );
submit_solution( work, lane_hash, mythr );
}
}
*noncev = _mm256_add_epi32( *noncev, four );
n += 4;
} while ( (n < last_nonce) && !work_restart[thr_id].restart );
pdata[19] = n;
*hashes_done = n - first_nonce;
return 0;
}
#else
#include "sph_sha2.h"
static const uint64_t H512_256[8] =
{
0x22312194FC2BF72C, 0x9F555FA3C84C64C2,
0x2393B86B6F53B151, 0x963877195940EABD,
0x96283EE2A88EFFE3, 0xBE5E1E2553863992,
0x2B0199FC2C85B8AA, 0x0EB72DDC81C52CA2,
};
static void sha512256d_init( sph_sha512_context *ctx )
{
memcpy( ctx->val, H512_256, sizeof H512_256 );
ctx->count = 0;
}
int scanhash_sha512256d( struct work *work, uint32_t max_nonce,
uint64_t *hashes_done, struct thr_info *mythr )
{
uint32_t *pdata = work->data;
uint32_t *ptarget = work->target;
uint32_t hash64[8] __attribute__ ((aligned (64)));
uint32_t endiandata[20] __attribute__ ((aligned (64)));
sph_sha512_context ctx;
const uint32_t Htarg = ptarget[7];
const uint32_t first_nonce = pdata[19];
uint32_t n = first_nonce;
int thr_id = mythr->id;
swab32_array( endiandata, pdata, 20 );
do {
be32enc( &endiandata[19], n );
sha512256d_init( &ctx );
sph_sha512( &ctx, endiandata, 80 );
sph_sha512_close( &ctx, hash64 );
sha512256d_init( &ctx );
sph_sha512( &ctx, hash64, 32 );
sph_sha512_close( &ctx, hash64 );
if ( hash64[7] <= Htarg )
if ( fulltest( hash64, ptarget ) && !opt_benchmark )
{
pdata[19] = n;
submit_solution( work, hash64, mythr );
}
n++;
} while (n < max_nonce && !work_restart[thr_id].restart);
*hashes_done = n - first_nonce + 1;
pdata[19] = n;
return 0;
}
#endif
bool register_sha512256d_algo( algo_gate_t* gate )
{
gate->optimizations = AVX2_OPT | AVX512_OPT;
#if defined(SHA512256D_8WAY)
gate->scanhash = (void*)&scanhash_sha512256d_8way;
#elif defined(SHA512256D_4WAY)
gate->scanhash = (void*)&scanhash_sha512256d_4way;
#else
gate->scanhash = (void*)&scanhash_sha512256d;
#endif
return true;
};

View File

@@ -33,6 +33,7 @@
#include <stddef.h>
#include <string.h>
// 4way is only used with AVX2, 8way only with AVX512, 16way is not needed.
#ifdef __SSE4_1__
#include "shabal-hash-4way.h"
@@ -44,21 +45,6 @@ extern "C"{
#pragma warning (disable: 4146)
#endif
/*
* Part of this code was automatically generated (the part between
* the "BEGIN" and "END" markers).
*/
#define sM 16
#define C32 SPH_C32
#define T32 SPH_T32
#define O1 13
#define O2 9
#define O3 6
#if defined(__AVX2__)
#define DECL_STATE8 \
@@ -126,50 +112,50 @@ extern "C"{
else \
{ \
(state)->state_loaded = true; \
A0 = m256_const1_64( 0x20728DFD20728DFD ); \
A1 = m256_const1_64( 0x46C0BD5346C0BD53 ); \
A2 = m256_const1_64( 0xE782B699E782B699 ); \
A3 = m256_const1_64( 0x5530463255304632 ); \
A4 = m256_const1_64( 0x71B4EF9071B4EF90 ); \
A5 = m256_const1_64( 0x0EA9E82C0EA9E82C ); \
A6 = m256_const1_64( 0xDBB930F1DBB930F1 ); \
A7 = m256_const1_64( 0xFAD06B8BFAD06B8B ); \
A8 = m256_const1_64( 0xBE0CAE40BE0CAE40 ); \
A9 = m256_const1_64( 0x8BD144108BD14410 ); \
AA = m256_const1_64( 0x76D2ADAC76D2ADAC ); \
AB = m256_const1_64( 0x28ACAB7F28ACAB7F ); \
B0 = m256_const1_64( 0xC1099CB7C1099CB7 ); \
B1 = m256_const1_64( 0x07B385F307B385F3 ); \
B2 = m256_const1_64( 0xE7442C26E7442C26 ); \
B3 = m256_const1_64( 0xCC8AD640CC8AD640 ); \
B4 = m256_const1_64( 0xEB6F56C7EB6F56C7 ); \
B5 = m256_const1_64( 0x1EA81AA91EA81AA9 ); \
B6 = m256_const1_64( 0x73B9D31473B9D314 ); \
B7 = m256_const1_64( 0x1DE85D081DE85D08 ); \
B8 = m256_const1_64( 0x48910A5A48910A5A ); \
B9 = m256_const1_64( 0x893B22DB893B22DB ); \
BA = m256_const1_64( 0xC5A0DF44C5A0DF44 ); \
BB = m256_const1_64( 0xBBC4324EBBC4324E ); \
BC = m256_const1_64( 0x72D2F24072D2F240 ); \
BD = m256_const1_64( 0x75941D9975941D99 ); \
BE = m256_const1_64( 0x6D8BDE826D8BDE82 ); \
BF = m256_const1_64( 0xA1A7502BA1A7502B ); \
C0 = m256_const1_64( 0xD9BF68D1D9BF68D1 ); \
C1 = m256_const1_64( 0x58BAD75058BAD750 ); \
C2 = m256_const1_64( 0x56028CB256028CB2 ); \
C3 = m256_const1_64( 0x8134F3598134F359 ); \
C4 = m256_const1_64( 0xB5D469D8B5D469D8 ); \
C5 = m256_const1_64( 0x941A8CC2941A8CC2 ); \
C6 = m256_const1_64( 0x418B2A6E418B2A6E ); \
C7 = m256_const1_64( 0x0405278004052780 ); \
C8 = m256_const1_64( 0x7F07D7877F07D787 ); \
C9 = m256_const1_64( 0x5194358F5194358F ); \
CA = m256_const1_64( 0x3C60D6653C60D665 ); \
CB = m256_const1_64( 0xBE97D79ABE97D79A ); \
CC = m256_const1_64( 0x950C3434950C3434 ); \
CD = m256_const1_64( 0xAED9A06DAED9A06D ); \
CE = m256_const1_64( 0x2537DC8D2537DC8D ); \
CF = m256_const1_64( 0x7CDB59697CDB5969 ); \
A0 = _mm256_set1_epi64x( 0x20728DFD20728DFD ); \
A1 = _mm256_set1_epi64x( 0x46C0BD5346C0BD53 ); \
A2 = _mm256_set1_epi64x( 0xE782B699E782B699 ); \
A3 = _mm256_set1_epi64x( 0x5530463255304632 ); \
A4 = _mm256_set1_epi64x( 0x71B4EF9071B4EF90 ); \
A5 = _mm256_set1_epi64x( 0x0EA9E82C0EA9E82C ); \
A6 = _mm256_set1_epi64x( 0xDBB930F1DBB930F1 ); \
A7 = _mm256_set1_epi64x( 0xFAD06B8BFAD06B8B ); \
A8 = _mm256_set1_epi64x( 0xBE0CAE40BE0CAE40 ); \
A9 = _mm256_set1_epi64x( 0x8BD144108BD14410 ); \
AA = _mm256_set1_epi64x( 0x76D2ADAC76D2ADAC ); \
AB = _mm256_set1_epi64x( 0x28ACAB7F28ACAB7F ); \
B0 = _mm256_set1_epi64x( 0xC1099CB7C1099CB7 ); \
B1 = _mm256_set1_epi64x( 0x07B385F307B385F3 ); \
B2 = _mm256_set1_epi64x( 0xE7442C26E7442C26 ); \
B3 = _mm256_set1_epi64x( 0xCC8AD640CC8AD640 ); \
B4 = _mm256_set1_epi64x( 0xEB6F56C7EB6F56C7 ); \
B5 = _mm256_set1_epi64x( 0x1EA81AA91EA81AA9 ); \
B6 = _mm256_set1_epi64x( 0x73B9D31473B9D314 ); \
B7 = _mm256_set1_epi64x( 0x1DE85D081DE85D08 ); \
B8 = _mm256_set1_epi64x( 0x48910A5A48910A5A ); \
B9 = _mm256_set1_epi64x( 0x893B22DB893B22DB ); \
BA = _mm256_set1_epi64x( 0xC5A0DF44C5A0DF44 ); \
BB = _mm256_set1_epi64x( 0xBBC4324EBBC4324E ); \
BC = _mm256_set1_epi64x( 0x72D2F24072D2F240 ); \
BD = _mm256_set1_epi64x( 0x75941D9975941D99 ); \
BE = _mm256_set1_epi64x( 0x6D8BDE826D8BDE82 ); \
BF = _mm256_set1_epi64x( 0xA1A7502BA1A7502B ); \
C0 = _mm256_set1_epi64x( 0xD9BF68D1D9BF68D1 ); \
C1 = _mm256_set1_epi64x( 0x58BAD75058BAD750 ); \
C2 = _mm256_set1_epi64x( 0x56028CB256028CB2 ); \
C3 = _mm256_set1_epi64x( 0x8134F3598134F359 ); \
C4 = _mm256_set1_epi64x( 0xB5D469D8B5D469D8 ); \
C5 = _mm256_set1_epi64x( 0x941A8CC2941A8CC2 ); \
C6 = _mm256_set1_epi64x( 0x418B2A6E418B2A6E ); \
C7 = _mm256_set1_epi64x( 0x0405278004052780 ); \
C8 = _mm256_set1_epi64x( 0x7F07D7877F07D787 ); \
C9 = _mm256_set1_epi64x( 0x5194358F5194358F ); \
CA = _mm256_set1_epi64x( 0x3C60D6653C60D665 ); \
CB = _mm256_set1_epi64x( 0xBE97D79ABE97D79A ); \
CC = _mm256_set1_epi64x( 0x950C3434950C3434 ); \
CD = _mm256_set1_epi64x( 0xAED9A06DAED9A06D ); \
CE = _mm256_set1_epi64x( 0x2537DC8D2537DC8D ); \
CF = _mm256_set1_epi64x( 0x7CDB59697CDB5969 ); \
} \
Wlow = (state)->Wlow; \
Whigh = (state)->Whigh; \
@@ -290,6 +276,11 @@ do { \
A1 = _mm256_xor_si256( A1, _mm256_set1_epi32( Whigh ) ); \
} while (0)
#define mm256_swap512_256( v1, v2 ) \
v1 = _mm256_xor_si256( v1, v2 ); \
v2 = _mm256_xor_si256( v1, v2 ); \
v1 = _mm256_xor_si256( v1, v2 );
#define SWAP_BC8 \
do { \
mm256_swap512_256( B0, C0 ); \
@@ -310,72 +301,71 @@ do { \
mm256_swap512_256( BF, CF ); \
} while (0)
#define PERM_ELT8(xa0, xa1, xb0, xb1, xb2, xb3, xc, xm) \
#define PERM_ELT8( xa0, xa1, xb0, xb1, xb2, xb3, xc, xm ) \
do { \
xa0 = mm256_xor3( xm, xb1, _mm256_xor_si256( \
_mm256_andnot_si256( xb3, xb2 ), \
_mm256_mullo_epi32( mm256_xor3( xa0, xc, \
_mm256_mullo_epi32( mm256_rol_32( xa1, 15 ), \
FIVE ) ), THREE ) ) ); \
xa0 = mm256_xor3( xm, xb1, mm256_xorandnot( \
_mm256_mullo_epi32( mm256_xor3( xa0, xc, \
_mm256_mullo_epi32( mm256_rol_32( xa1, 15 ), FIVE ) ), THREE ), \
xb3, xb2 ) ); \
xb0 = mm256_xnor( xa0, mm256_rol_32( xb0, 1 ) ); \
} while (0)
#define PERM_STEP_0_8 do { \
PERM_ELT8(A0, AB, B0, BD, B9, B6, C8, M0); \
PERM_ELT8(A1, A0, B1, BE, BA, B7, C7, M1); \
PERM_ELT8(A2, A1, B2, BF, BB, B8, C6, M2); \
PERM_ELT8(A3, A2, B3, B0, BC, B9, C5, M3); \
PERM_ELT8(A4, A3, B4, B1, BD, BA, C4, M4); \
PERM_ELT8(A5, A4, B5, B2, BE, BB, C3, M5); \
PERM_ELT8(A6, A5, B6, B3, BF, BC, C2, M6); \
PERM_ELT8(A7, A6, B7, B4, B0, BD, C1, M7); \
PERM_ELT8(A8, A7, B8, B5, B1, BE, C0, M8); \
PERM_ELT8(A9, A8, B9, B6, B2, BF, CF, M9); \
PERM_ELT8(AA, A9, BA, B7, B3, B0, CE, MA); \
PERM_ELT8(AB, AA, BB, B8, B4, B1, CD, MB); \
PERM_ELT8(A0, AB, BC, B9, B5, B2, CC, MC); \
PERM_ELT8(A1, A0, BD, BA, B6, B3, CB, MD); \
PERM_ELT8(A2, A1, BE, BB, B7, B4, CA, ME); \
PERM_ELT8(A3, A2, BF, BC, B8, B5, C9, MF); \
} while (0)
PERM_ELT8( A0, AB, B0, BD, B9, B6, C8, M0 ); \
PERM_ELT8( A1, A0, B1, BE, BA, B7, C7, M1 ); \
PERM_ELT8( A2, A1, B2, BF, BB, B8, C6, M2 ); \
PERM_ELT8( A3, A2, B3, B0, BC, B9, C5, M3 ); \
PERM_ELT8( A4, A3, B4, B1, BD, BA, C4, M4 ); \
PERM_ELT8( A5, A4, B5, B2, BE, BB, C3, M5 ); \
PERM_ELT8( A6, A5, B6, B3, BF, BC, C2, M6 ); \
PERM_ELT8( A7, A6, B7, B4, B0, BD, C1, M7 ); \
PERM_ELT8( A8, A7, B8, B5, B1, BE, C0, M8 ); \
PERM_ELT8( A9, A8, B9, B6, B2, BF, CF, M9 ); \
PERM_ELT8( AA, A9, BA, B7, B3, B0, CE, MA ); \
PERM_ELT8( AB, AA, BB, B8, B4, B1, CD, MB ); \
PERM_ELT8( A0, AB, BC, B9, B5, B2, CC, MC ); \
PERM_ELT8( A1, A0, BD, BA, B6, B3, CB, MD ); \
PERM_ELT8( A2, A1, BE, BB, B7, B4, CA, ME ); \
PERM_ELT8( A3, A2, BF, BC, B8, B5, C9, MF ); \
} while (0)
#define PERM_STEP_1_8 do { \
PERM_ELT8(A4, A3, B0, BD, B9, B6, C8, M0); \
PERM_ELT8(A5, A4, B1, BE, BA, B7, C7, M1); \
PERM_ELT8(A6, A5, B2, BF, BB, B8, C6, M2); \
PERM_ELT8(A7, A6, B3, B0, BC, B9, C5, M3); \
PERM_ELT8(A8, A7, B4, B1, BD, BA, C4, M4); \
PERM_ELT8(A9, A8, B5, B2, BE, BB, C3, M5); \
PERM_ELT8(AA, A9, B6, B3, BF, BC, C2, M6); \
PERM_ELT8(AB, AA, B7, B4, B0, BD, C1, M7); \
PERM_ELT8(A0, AB, B8, B5, B1, BE, C0, M8); \
PERM_ELT8(A1, A0, B9, B6, B2, BF, CF, M9); \
PERM_ELT8(A2, A1, BA, B7, B3, B0, CE, MA); \
PERM_ELT8(A3, A2, BB, B8, B4, B1, CD, MB); \
PERM_ELT8(A4, A3, BC, B9, B5, B2, CC, MC); \
PERM_ELT8(A5, A4, BD, BA, B6, B3, CB, MD); \
PERM_ELT8(A6, A5, BE, BB, B7, B4, CA, ME); \
PERM_ELT8(A7, A6, BF, BC, B8, B5, C9, MF); \
} while (0)
PERM_ELT8( A4, A3, B0, BD, B9, B6, C8, M0 ); \
PERM_ELT8( A5, A4, B1, BE, BA, B7, C7, M1 ); \
PERM_ELT8( A6, A5, B2, BF, BB, B8, C6, M2 ); \
PERM_ELT8( A7, A6, B3, B0, BC, B9, C5, M3 ); \
PERM_ELT8( A8, A7, B4, B1, BD, BA, C4, M4 ); \
PERM_ELT8( A9, A8, B5, B2, BE, BB, C3, M5 ); \
PERM_ELT8( AA, A9, B6, B3, BF, BC, C2, M6 ); \
PERM_ELT8( AB, AA, B7, B4, B0, BD, C1, M7 ); \
PERM_ELT8( A0, AB, B8, B5, B1, BE, C0, M8 ); \
PERM_ELT8( A1, A0, B9, B6, B2, BF, CF, M9 ); \
PERM_ELT8( A2, A1, BA, B7, B3, B0, CE, MA ); \
PERM_ELT8( A3, A2, BB, B8, B4, B1, CD, MB ); \
PERM_ELT8( A4, A3, BC, B9, B5, B2, CC, MC ); \
PERM_ELT8( A5, A4, BD, BA, B6, B3, CB, MD ); \
PERM_ELT8( A6, A5, BE, BB, B7, B4, CA, ME ); \
PERM_ELT8( A7, A6, BF, BC, B8, B5, C9, MF ); \
} while (0)
#define PERM_STEP_2_8 do { \
PERM_ELT8(A8, A7, B0, BD, B9, B6, C8, M0); \
PERM_ELT8(A9, A8, B1, BE, BA, B7, C7, M1); \
PERM_ELT8(AA, A9, B2, BF, BB, B8, C6, M2); \
PERM_ELT8(AB, AA, B3, B0, BC, B9, C5, M3); \
PERM_ELT8(A0, AB, B4, B1, BD, BA, C4, M4); \
PERM_ELT8(A1, A0, B5, B2, BE, BB, C3, M5); \
PERM_ELT8(A2, A1, B6, B3, BF, BC, C2, M6); \
PERM_ELT8(A3, A2, B7, B4, B0, BD, C1, M7); \
PERM_ELT8(A4, A3, B8, B5, B1, BE, C0, M8); \
PERM_ELT8(A5, A4, B9, B6, B2, BF, CF, M9); \
PERM_ELT8(A6, A5, BA, B7, B3, B0, CE, MA); \
PERM_ELT8(A7, A6, BB, B8, B4, B1, CD, MB); \
PERM_ELT8(A8, A7, BC, B9, B5, B2, CC, MC); \
PERM_ELT8(A9, A8, BD, BA, B6, B3, CB, MD); \
PERM_ELT8(AA, A9, BE, BB, B7, B4, CA, ME); \
PERM_ELT8(AB, AA, BF, BC, B8, B5, C9, MF); \
} while (0)
PERM_ELT8( A8, A7, B0, BD, B9, B6, C8, M0 ); \
PERM_ELT8( A9, A8, B1, BE, BA, B7, C7, M1 ); \
PERM_ELT8( AA, A9, B2, BF, BB, B8, C6, M2 ); \
PERM_ELT8( AB, AA, B3, B0, BC, B9, C5, M3 ); \
PERM_ELT8( A0, AB, B4, B1, BD, BA, C4, M4 ); \
PERM_ELT8( A1, A0, B5, B2, BE, BB, C3, M5 ); \
PERM_ELT8( A2, A1, B6, B3, BF, BC, C2, M6 ); \
PERM_ELT8( A3, A2, B7, B4, B0, BD, C1, M7 ); \
PERM_ELT8( A4, A3, B8, B5, B1, BE, C0, M8 ); \
PERM_ELT8( A5, A4, B9, B6, B2, BF, CF, M9 ); \
PERM_ELT8( A6, A5, BA, B7, B3, B0, CE, MA ); \
PERM_ELT8( A7, A6, BB, B8, B4, B1, CD, MB ); \
PERM_ELT8( A8, A7, BC, B9, B5, B2, CC, MC ); \
PERM_ELT8( A9, A8, BD, BA, B6, B3, CB, MD ); \
PERM_ELT8( AA, A9, BE, BB, B7, B4, CA, ME ); \
PERM_ELT8( AB, AA, BF, BC, B8, B5, C9, MF ); \
} while (0)
#define APPLY_P8 \
do { \
@@ -437,8 +427,8 @@ do { \
} while (0)
#define INCR_W8 do { \
if ((Wlow = T32(Wlow + 1)) == 0) \
Whigh = T32(Whigh + 1); \
if ( ( Wlow = Wlow + 1 ) == 0 ) \
Whigh = Whigh + 1; \
} while (0)
static void
@@ -453,52 +443,52 @@ shabal_8way_init( void *cc, unsigned size )
else
{ // No users
sc->state_loaded = true;
sc->A[ 0] = m256_const1_64( 0x52F8455252F84552 );
sc->A[ 1] = m256_const1_64( 0xE54B7999E54B7999 );
sc->A[ 2] = m256_const1_64( 0x2D8EE3EC2D8EE3EC );
sc->A[ 3] = m256_const1_64( 0xB9645191B9645191 );
sc->A[ 4] = m256_const1_64( 0xE0078B86E0078B86 );
sc->A[ 5] = m256_const1_64( 0xBB7C44C9BB7C44C9 );
sc->A[ 6] = m256_const1_64( 0xD2B5C1CAD2B5C1CA );
sc->A[ 7] = m256_const1_64( 0xB0D2EB8CB0D2EB8C );
sc->A[ 8] = m256_const1_64( 0x14CE5A4514CE5A45 );
sc->A[ 9] = m256_const1_64( 0x22AF50DC22AF50DC );
sc->A[10] = m256_const1_64( 0xEFFDBC6BEFFDBC6B );
sc->A[11] = m256_const1_64( 0xEB21B74AEB21B74A );
sc->A[ 0] = _mm256_set1_epi64x( 0x52F8455252F84552 );
sc->A[ 1] = _mm256_set1_epi64x( 0xE54B7999E54B7999 );
sc->A[ 2] = _mm256_set1_epi64x( 0x2D8EE3EC2D8EE3EC );
sc->A[ 3] = _mm256_set1_epi64x( 0xB9645191B9645191 );
sc->A[ 4] = _mm256_set1_epi64x( 0xE0078B86E0078B86 );
sc->A[ 5] = _mm256_set1_epi64x( 0xBB7C44C9BB7C44C9 );
sc->A[ 6] = _mm256_set1_epi64x( 0xD2B5C1CAD2B5C1CA );
sc->A[ 7] = _mm256_set1_epi64x( 0xB0D2EB8CB0D2EB8C );
sc->A[ 8] = _mm256_set1_epi64x( 0x14CE5A4514CE5A45 );
sc->A[ 9] = _mm256_set1_epi64x( 0x22AF50DC22AF50DC );
sc->A[10] = _mm256_set1_epi64x( 0xEFFDBC6BEFFDBC6B );
sc->A[11] = _mm256_set1_epi64x( 0xEB21B74AEB21B74A );
sc->B[ 0] = m256_const1_64( 0xB555C6EEB555C6EE );
sc->B[ 1] = m256_const1_64( 0x3E7105963E710596 );
sc->B[ 2] = m256_const1_64( 0xA72A652FA72A652F );
sc->B[ 3] = m256_const1_64( 0x9301515F9301515F );
sc->B[ 4] = m256_const1_64( 0xDA28C1FADA28C1FA );
sc->B[ 5] = m256_const1_64( 0x696FD868696FD868 );
sc->B[ 6] = m256_const1_64( 0x9CB6BF729CB6BF72 );
sc->B[ 7] = m256_const1_64( 0x0AFE40020AFE4002 );
sc->B[ 8] = m256_const1_64( 0xA6E03615A6E03615 );
sc->B[ 9] = m256_const1_64( 0x5138C1D45138C1D4 );
sc->B[10] = m256_const1_64( 0xBE216306BE216306 );
sc->B[11] = m256_const1_64( 0xB38B8890B38B8890 );
sc->B[12] = m256_const1_64( 0x3EA8B96B3EA8B96B );
sc->B[13] = m256_const1_64( 0x3299ACE43299ACE4 );
sc->B[14] = m256_const1_64( 0x30924DD430924DD4 );
sc->B[15] = m256_const1_64( 0x55CB34A555CB34A5 );
sc->B[ 0] = _mm256_set1_epi64x( 0xB555C6EEB555C6EE );
sc->B[ 1] = _mm256_set1_epi64x( 0x3E7105963E710596 );
sc->B[ 2] = _mm256_set1_epi64x( 0xA72A652FA72A652F );
sc->B[ 3] = _mm256_set1_epi64x( 0x9301515F9301515F );
sc->B[ 4] = _mm256_set1_epi64x( 0xDA28C1FADA28C1FA );
sc->B[ 5] = _mm256_set1_epi64x( 0x696FD868696FD868 );
sc->B[ 6] = _mm256_set1_epi64x( 0x9CB6BF729CB6BF72 );
sc->B[ 7] = _mm256_set1_epi64x( 0x0AFE40020AFE4002 );
sc->B[ 8] = _mm256_set1_epi64x( 0xA6E03615A6E03615 );
sc->B[ 9] = _mm256_set1_epi64x( 0x5138C1D45138C1D4 );
sc->B[10] = _mm256_set1_epi64x( 0xBE216306BE216306 );
sc->B[11] = _mm256_set1_epi64x( 0xB38B8890B38B8890 );
sc->B[12] = _mm256_set1_epi64x( 0x3EA8B96B3EA8B96B );
sc->B[13] = _mm256_set1_epi64x( 0x3299ACE43299ACE4 );
sc->B[14] = _mm256_set1_epi64x( 0x30924DD430924DD4 );
sc->B[15] = _mm256_set1_epi64x( 0x55CB34A555CB34A5 );
sc->C[ 0] = m256_const1_64( 0xB405F031B405F031 );
sc->C[ 1] = m256_const1_64( 0xC4233EBAC4233EBA );
sc->C[ 2] = m256_const1_64( 0xB3733979B3733979 );
sc->C[ 3] = m256_const1_64( 0xC0DD9D55C0DD9D55 );
sc->C[ 4] = m256_const1_64( 0xC51C28AEC51C28AE );
sc->C[ 5] = m256_const1_64( 0xA327B8E1A327B8E1 );
sc->C[ 6] = m256_const1_64( 0x56C5616756C56167 );
sc->C[ 7] = m256_const1_64( 0xED614433ED614433 );
sc->C[ 8] = m256_const1_64( 0x88B59D6088B59D60 );
sc->C[ 9] = m256_const1_64( 0x60E2CEBA60E2CEBA );
sc->C[10] = m256_const1_64( 0x758B4B8B758B4B8B );
sc->C[11] = m256_const1_64( 0x83E82A7F83E82A7F );
sc->C[12] = m256_const1_64( 0xBC968828BC968828 );
sc->C[13] = m256_const1_64( 0xE6E00BF7E6E00BF7 );
sc->C[14] = m256_const1_64( 0xBA839E55BA839E55 );
sc->C[15] = m256_const1_64( 0x9B491C609B491C60 );
sc->C[ 0] = _mm256_set1_epi64x( 0xB405F031B405F031 );
sc->C[ 1] = _mm256_set1_epi64x( 0xC4233EBAC4233EBA );
sc->C[ 2] = _mm256_set1_epi64x( 0xB3733979B3733979 );
sc->C[ 3] = _mm256_set1_epi64x( 0xC0DD9D55C0DD9D55 );
sc->C[ 4] = _mm256_set1_epi64x( 0xC51C28AEC51C28AE );
sc->C[ 5] = _mm256_set1_epi64x( 0xA327B8E1A327B8E1 );
sc->C[ 6] = _mm256_set1_epi64x( 0x56C5616756C56167 );
sc->C[ 7] = _mm256_set1_epi64x( 0xED614433ED614433 );
sc->C[ 8] = _mm256_set1_epi64x( 0x88B59D6088B59D60 );
sc->C[ 9] = _mm256_set1_epi64x( 0x60E2CEBA60E2CEBA );
sc->C[10] = _mm256_set1_epi64x( 0x758B4B8B758B4B8B );
sc->C[11] = _mm256_set1_epi64x( 0x83E82A7F83E82A7F );
sc->C[12] = _mm256_set1_epi64x( 0xBC968828BC968828 );
sc->C[13] = _mm256_set1_epi64x( 0xE6E00BF7E6E00BF7 );
sc->C[14] = _mm256_set1_epi64x( 0xBA839E55BA839E55 );
sc->C[15] = _mm256_set1_epi64x( 0x9B491C609B491C60 );
}
sc->Wlow = 1;
sc->Whigh = 0;
@@ -650,15 +640,8 @@ shabal512_8way_addbits_and_close(void *cc, unsigned ub, unsigned n, void *dst)
shabal_8way_close(cc, ub, n, dst, 16);
}
#endif // AVX2
/*
* We copy the state into local variables, so that the compiler knows
* that it can optimize them at will.
*/
#define DECL_STATE \
__m128i A0, A1, A2, A3, A4, A5, A6, A7, \
A8, A9, AA, AB; \
@@ -724,50 +707,50 @@ shabal512_8way_addbits_and_close(void *cc, unsigned ub, unsigned n, void *dst)
else \
{ \
(state)->state_loaded = true; \
A0 = m128_const1_64( 0x20728DFD20728DFD ); \
A1 = m128_const1_64( 0x46C0BD5346C0BD53 ); \
A2 = m128_const1_64( 0xE782B699E782B699 ); \
A3 = m128_const1_64( 0x5530463255304632 ); \
A4 = m128_const1_64( 0x71B4EF9071B4EF90 ); \
A5 = m128_const1_64( 0x0EA9E82C0EA9E82C ); \
A6 = m128_const1_64( 0xDBB930F1DBB930F1 ); \
A7 = m128_const1_64( 0xFAD06B8BFAD06B8B ); \
A8 = m128_const1_64( 0xBE0CAE40BE0CAE40 ); \
A9 = m128_const1_64( 0x8BD144108BD14410 ); \
AA = m128_const1_64( 0x76D2ADAC76D2ADAC ); \
AB = m128_const1_64( 0x28ACAB7F28ACAB7F ); \
B0 = m128_const1_64( 0xC1099CB7C1099CB7 ); \
B1 = m128_const1_64( 0x07B385F307B385F3 ); \
B2 = m128_const1_64( 0xE7442C26E7442C26 ); \
B3 = m128_const1_64( 0xCC8AD640CC8AD640 ); \
B4 = m128_const1_64( 0xEB6F56C7EB6F56C7 ); \
B5 = m128_const1_64( 0x1EA81AA91EA81AA9 ); \
B6 = m128_const1_64( 0x73B9D31473B9D314 ); \
B7 = m128_const1_64( 0x1DE85D081DE85D08 ); \
B8 = m128_const1_64( 0x48910A5A48910A5A ); \
B9 = m128_const1_64( 0x893B22DB893B22DB ); \
BA = m128_const1_64( 0xC5A0DF44C5A0DF44 ); \
BB = m128_const1_64( 0xBBC4324EBBC4324E ); \
BC = m128_const1_64( 0x72D2F24072D2F240 ); \
BD = m128_const1_64( 0x75941D9975941D99 ); \
BE = m128_const1_64( 0x6D8BDE826D8BDE82 ); \
BF = m128_const1_64( 0xA1A7502BA1A7502B ); \
C0 = m128_const1_64( 0xD9BF68D1D9BF68D1 ); \
C1 = m128_const1_64( 0x58BAD75058BAD750 ); \
C2 = m128_const1_64( 0x56028CB256028CB2 ); \
C3 = m128_const1_64( 0x8134F3598134F359 ); \
C4 = m128_const1_64( 0xB5D469D8B5D469D8 ); \
C5 = m128_const1_64( 0x941A8CC2941A8CC2 ); \
C6 = m128_const1_64( 0x418B2A6E418B2A6E ); \
C7 = m128_const1_64( 0x0405278004052780 ); \
C8 = m128_const1_64( 0x7F07D7877F07D787 ); \
C9 = m128_const1_64( 0x5194358F5194358F ); \
CA = m128_const1_64( 0x3C60D6653C60D665 ); \
CB = m128_const1_64( 0xBE97D79ABE97D79A ); \
CC = m128_const1_64( 0x950C3434950C3434 ); \
CD = m128_const1_64( 0xAED9A06DAED9A06D ); \
CE = m128_const1_64( 0x2537DC8D2537DC8D ); \
CF = m128_const1_64( 0x7CDB59697CDB5969 ); \
A0 = _mm_set1_epi64x( 0x20728DFD20728DFD ); \
A1 = _mm_set1_epi64x( 0x46C0BD5346C0BD53 ); \
A2 = _mm_set1_epi64x( 0xE782B699E782B699 ); \
A3 = _mm_set1_epi64x( 0x5530463255304632 ); \
A4 = _mm_set1_epi64x( 0x71B4EF9071B4EF90 ); \
A5 = _mm_set1_epi64x( 0x0EA9E82C0EA9E82C ); \
A6 = _mm_set1_epi64x( 0xDBB930F1DBB930F1 ); \
A7 = _mm_set1_epi64x( 0xFAD06B8BFAD06B8B ); \
A8 = _mm_set1_epi64x( 0xBE0CAE40BE0CAE40 ); \
A9 = _mm_set1_epi64x( 0x8BD144108BD14410 ); \
AA = _mm_set1_epi64x( 0x76D2ADAC76D2ADAC ); \
AB = _mm_set1_epi64x( 0x28ACAB7F28ACAB7F ); \
B0 = _mm_set1_epi64x( 0xC1099CB7C1099CB7 ); \
B1 = _mm_set1_epi64x( 0x07B385F307B385F3 ); \
B2 = _mm_set1_epi64x( 0xE7442C26E7442C26 ); \
B3 = _mm_set1_epi64x( 0xCC8AD640CC8AD640 ); \
B4 = _mm_set1_epi64x( 0xEB6F56C7EB6F56C7 ); \
B5 = _mm_set1_epi64x( 0x1EA81AA91EA81AA9 ); \
B6 = _mm_set1_epi64x( 0x73B9D31473B9D314 ); \
B7 = _mm_set1_epi64x( 0x1DE85D081DE85D08 ); \
B8 = _mm_set1_epi64x( 0x48910A5A48910A5A ); \
B9 = _mm_set1_epi64x( 0x893B22DB893B22DB ); \
BA = _mm_set1_epi64x( 0xC5A0DF44C5A0DF44 ); \
BB = _mm_set1_epi64x( 0xBBC4324EBBC4324E ); \
BC = _mm_set1_epi64x( 0x72D2F24072D2F240 ); \
BD = _mm_set1_epi64x( 0x75941D9975941D99 ); \
BE = _mm_set1_epi64x( 0x6D8BDE826D8BDE82 ); \
BF = _mm_set1_epi64x( 0xA1A7502BA1A7502B ); \
C0 = _mm_set1_epi64x( 0xD9BF68D1D9BF68D1 ); \
C1 = _mm_set1_epi64x( 0x58BAD75058BAD750 ); \
C2 = _mm_set1_epi64x( 0x56028CB256028CB2 ); \
C3 = _mm_set1_epi64x( 0x8134F3598134F359 ); \
C4 = _mm_set1_epi64x( 0xB5D469D8B5D469D8 ); \
C5 = _mm_set1_epi64x( 0x941A8CC2941A8CC2 ); \
C6 = _mm_set1_epi64x( 0x418B2A6E418B2A6E ); \
C7 = _mm_set1_epi64x( 0x0405278004052780 ); \
C8 = _mm_set1_epi64x( 0x7F07D7877F07D787 ); \
C9 = _mm_set1_epi64x( 0x5194358F5194358F ); \
CA = _mm_set1_epi64x( 0x3C60D6653C60D665 ); \
CB = _mm_set1_epi64x( 0xBE97D79ABE97D79A ); \
CC = _mm_set1_epi64x( 0x950C3434950C3434 ); \
CD = _mm_set1_epi64x( 0xAED9A06DAED9A06D ); \
CE = _mm_set1_epi64x( 0x2537DC8D2537DC8D ); \
CF = _mm_set1_epi64x( 0x7CDB59697CDB5969 ); \
} \
Wlow = (state)->Wlow; \
Whigh = (state)->Whigh; \
@@ -888,14 +871,10 @@ do { \
A1 = _mm_xor_si128( A1, _mm_set1_epi32( Whigh ) ); \
} while (0)
/*
#define SWAP(v1, v2) do { \
sph_u32 tmp = (v1); \
(v1) = (v2); \
(v2) = tmp; \
} while (0)
*/
#define mm128_swap256_128( v1, v2 ) \
v1 = _mm_xor_si128( v1, v2 ); \
v2 = _mm_xor_si128( v1, v2 ); \
v1 = _mm_xor_si128( v1, v2 );
#define SWAP_BC \
do { \
@@ -917,18 +896,16 @@ do { \
mm128_swap256_128( BF, CF ); \
} while (0)
/*
#define PERM_ELT(xa0, xa1, xb0, xb1, xb2, xb3, xc, xm) \
#define PERM_ELT( xa0, xa1, xb0, xb1, xb2, xb3, xc, xm ) \
do { \
__m128i t1 = _mm_mullo_epi32( mm_rol_32( xa1, 15 ),\
_mm_set1_epi32(5UL) ) \
__m128i t2 = _mm_xor_si128( xa0, xc ); \
xb0 = mm_not( _mm_xor_si256( xa0, mm_rol_32( xb0, 1 ) ) ); \
xa0 = mm_xor4( xm, xb1, _mm_andnot_si128( xb3, xb2 ), \
_mm_xor_si128( t2, \
_mm_mullo_epi32( t1, _mm_set1_epi32(5UL) ) ) ) \
*/
xa0 = mm128_xor3( xm, xb1, mm128_xorandnot( \
_mm_mullo_epi32( mm128_xor3( xa0, xc, \
_mm_mullo_epi32( mm128_rol_32( xa1, 15 ), FIVE ) ), THREE ), \
xb3, xb2 ) ); \
xb0 = mm128_xnor( xa0, mm128_rol_32( xb0, 1 ) ); \
} while (0)
/*
#define PERM_ELT(xa0, xa1, xb0, xb1, xb2, xb3, xc, xm) \
do { \
xa0 = _mm_xor_si128( xm, _mm_xor_si128( xb1, _mm_xor_si128( \
@@ -938,6 +915,7 @@ do { \
) ), THREE ) ) ) ); \
xb0 = mm128_not( _mm_xor_si128( xa0, mm128_rol_32( xb0, 1 ) ) ); \
} while (0)
*/
#define PERM_STEP_0 do { \
PERM_ELT(A0, AB, B0, BD, B9, B6, C8, M0); \
@@ -1056,8 +1034,8 @@ do { \
} while (0)
#define INCR_W do { \
if ((Wlow = T32(Wlow + 1)) == 0) \
Whigh = T32(Whigh + 1); \
if ( ( Wlow = Wlow + 1 ) == 0 ) \
Whigh = Whigh + 1; \
} while (0)
/*
@@ -1111,103 +1089,103 @@ shabal_4way_init( void *cc, unsigned size )
{ // copy immediate constants directly to working registers later.
sc->state_loaded = false;
/*
sc->A[ 0] = m128_const1_64( 0x20728DFD20728DFD );
sc->A[ 1] = m128_const1_64( 0x46C0BD5346C0BD53 );
sc->A[ 2] = m128_const1_64( 0xE782B699E782B699 );
sc->A[ 3] = m128_const1_64( 0x5530463255304632 );
sc->A[ 4] = m128_const1_64( 0x71B4EF9071B4EF90 );
sc->A[ 5] = m128_const1_64( 0x0EA9E82C0EA9E82C );
sc->A[ 6] = m128_const1_64( 0xDBB930F1DBB930F1 );
sc->A[ 7] = m128_const1_64( 0xFAD06B8BFAD06B8B );
sc->A[ 8] = m128_const1_64( 0xBE0CAE40BE0CAE40 );
sc->A[ 9] = m128_const1_64( 0x8BD144108BD14410 );
sc->A[10] = m128_const1_64( 0x76D2ADAC76D2ADAC );
sc->A[11] = m128_const1_64( 0x28ACAB7F28ACAB7F );
sc->A[ 0] = _mm_set1_epi64x( 0x20728DFD20728DFD );
sc->A[ 1] = _mm_set1_epi64x( 0x46C0BD5346C0BD53 );
sc->A[ 2] = _mm_set1_epi64x( 0xE782B699E782B699 );
sc->A[ 3] = _mm_set1_epi64x( 0x5530463255304632 );
sc->A[ 4] = _mm_set1_epi64x( 0x71B4EF9071B4EF90 );
sc->A[ 5] = _mm_set1_epi64x( 0x0EA9E82C0EA9E82C );
sc->A[ 6] = _mm_set1_epi64x( 0xDBB930F1DBB930F1 );
sc->A[ 7] = _mm_set1_epi64x( 0xFAD06B8BFAD06B8B );
sc->A[ 8] = _mm_set1_epi64x( 0xBE0CAE40BE0CAE40 );
sc->A[ 9] = _mm_set1_epi64x( 0x8BD144108BD14410 );
sc->A[10] = _mm_set1_epi64x( 0x76D2ADAC76D2ADAC );
sc->A[11] = _mm_set1_epi64x( 0x28ACAB7F28ACAB7F );
sc->B[ 0] = m128_const1_64( 0xC1099CB7C1099CB7 );
sc->B[ 1] = m128_const1_64( 0x07B385F307B385F3 );
sc->B[ 2] = m128_const1_64( 0xE7442C26E7442C26 );
sc->B[ 3] = m128_const1_64( 0xCC8AD640CC8AD640 );
sc->B[ 4] = m128_const1_64( 0xEB6F56C7EB6F56C7 );
sc->B[ 5] = m128_const1_64( 0x1EA81AA91EA81AA9 );
sc->B[ 6] = m128_const1_64( 0x73B9D31473B9D314 );
sc->B[ 7] = m128_const1_64( 0x1DE85D081DE85D08 );
sc->B[ 8] = m128_const1_64( 0x48910A5A48910A5A );
sc->B[ 9] = m128_const1_64( 0x893B22DB893B22DB );
sc->B[10] = m128_const1_64( 0xC5A0DF44C5A0DF44 );
sc->B[11] = m128_const1_64( 0xBBC4324EBBC4324E );
sc->B[12] = m128_const1_64( 0x72D2F24072D2F240 );
sc->B[13] = m128_const1_64( 0x75941D9975941D99 );
sc->B[14] = m128_const1_64( 0x6D8BDE826D8BDE82 );
sc->B[15] = m128_const1_64( 0xA1A7502BA1A7502B );
sc->B[ 0] = _mm_set1_epi64x( 0xC1099CB7C1099CB7 );
sc->B[ 1] = _mm_set1_epi64x( 0x07B385F307B385F3 );
sc->B[ 2] = _mm_set1_epi64x( 0xE7442C26E7442C26 );
sc->B[ 3] = _mm_set1_epi64x( 0xCC8AD640CC8AD640 );
sc->B[ 4] = _mm_set1_epi64x( 0xEB6F56C7EB6F56C7 );
sc->B[ 5] = _mm_set1_epi64x( 0x1EA81AA91EA81AA9 );
sc->B[ 6] = _mm_set1_epi64x( 0x73B9D31473B9D314 );
sc->B[ 7] = _mm_set1_epi64x( 0x1DE85D081DE85D08 );
sc->B[ 8] = _mm_set1_epi64x( 0x48910A5A48910A5A );
sc->B[ 9] = _mm_set1_epi64x( 0x893B22DB893B22DB );
sc->B[10] = _mm_set1_epi64x( 0xC5A0DF44C5A0DF44 );
sc->B[11] = _mm_set1_epi64x( 0xBBC4324EBBC4324E );
sc->B[12] = _mm_set1_epi64x( 0x72D2F24072D2F240 );
sc->B[13] = _mm_set1_epi64x( 0x75941D9975941D99 );
sc->B[14] = _mm_set1_epi64x( 0x6D8BDE826D8BDE82 );
sc->B[15] = _mm_set1_epi64x( 0xA1A7502BA1A7502B );
sc->C[ 0] = m128_const1_64( 0xD9BF68D1D9BF68D1 );
sc->C[ 1] = m128_const1_64( 0x58BAD75058BAD750 );
sc->C[ 2] = m128_const1_64( 0x56028CB256028CB2 );
sc->C[ 3] = m128_const1_64( 0x8134F3598134F359 );
sc->C[ 4] = m128_const1_64( 0xB5D469D8B5D469D8 );
sc->C[ 5] = m128_const1_64( 0x941A8CC2941A8CC2 );
sc->C[ 6] = m128_const1_64( 0x418B2A6E418B2A6E );
sc->C[ 7] = m128_const1_64( 0x0405278004052780 );
sc->C[ 8] = m128_const1_64( 0x7F07D7877F07D787 );
sc->C[ 9] = m128_const1_64( 0x5194358F5194358F );
sc->C[10] = m128_const1_64( 0x3C60D6653C60D665 );
sc->C[11] = m128_const1_64( 0xBE97D79ABE97D79A );
sc->C[12] = m128_const1_64( 0x950C3434950C3434 );
sc->C[13] = m128_const1_64( 0xAED9A06DAED9A06D );
sc->C[14] = m128_const1_64( 0x2537DC8D2537DC8D );
sc->C[15] = m128_const1_64( 0x7CDB59697CDB5969 );
sc->C[ 0] = _mm_set1_epi64x( 0xD9BF68D1D9BF68D1 );
sc->C[ 1] = _mm_set1_epi64x( 0x58BAD75058BAD750 );
sc->C[ 2] = _mm_set1_epi64x( 0x56028CB256028CB2 );
sc->C[ 3] = _mm_set1_epi64x( 0x8134F3598134F359 );
sc->C[ 4] = _mm_set1_epi64x( 0xB5D469D8B5D469D8 );
sc->C[ 5] = _mm_set1_epi64x( 0x941A8CC2941A8CC2 );
sc->C[ 6] = _mm_set1_epi64x( 0x418B2A6E418B2A6E );
sc->C[ 7] = _mm_set1_epi64x( 0x0405278004052780 );
sc->C[ 8] = _mm_set1_epi64x( 0x7F07D7877F07D787 );
sc->C[ 9] = _mm_set1_epi64x( 0x5194358F5194358F );
sc->C[10] = _mm_set1_epi64x( 0x3C60D6653C60D665 );
sc->C[11] = _mm_set1_epi64x( 0xBE97D79ABE97D79A );
sc->C[12] = _mm_set1_epi64x( 0x950C3434950C3434 );
sc->C[13] = _mm_set1_epi64x( 0xAED9A06DAED9A06D );
sc->C[14] = _mm_set1_epi64x( 0x2537DC8D2537DC8D );
sc->C[15] = _mm_set1_epi64x( 0x7CDB59697CDB5969 );
*/
}
else
{ // No users
sc->state_loaded = true;
sc->A[ 0] = m128_const1_64( 0x52F8455252F84552 );
sc->A[ 1] = m128_const1_64( 0xE54B7999E54B7999 );
sc->A[ 2] = m128_const1_64( 0x2D8EE3EC2D8EE3EC );
sc->A[ 3] = m128_const1_64( 0xB9645191B9645191 );
sc->A[ 4] = m128_const1_64( 0xE0078B86E0078B86 );
sc->A[ 5] = m128_const1_64( 0xBB7C44C9BB7C44C9 );
sc->A[ 6] = m128_const1_64( 0xD2B5C1CAD2B5C1CA );
sc->A[ 7] = m128_const1_64( 0xB0D2EB8CB0D2EB8C );
sc->A[ 8] = m128_const1_64( 0x14CE5A4514CE5A45 );
sc->A[ 9] = m128_const1_64( 0x22AF50DC22AF50DC );
sc->A[10] = m128_const1_64( 0xEFFDBC6BEFFDBC6B );
sc->A[11] = m128_const1_64( 0xEB21B74AEB21B74A );
sc->A[ 0] = _mm_set1_epi64x( 0x52F8455252F84552 );
sc->A[ 1] = _mm_set1_epi64x( 0xE54B7999E54B7999 );
sc->A[ 2] = _mm_set1_epi64x( 0x2D8EE3EC2D8EE3EC );
sc->A[ 3] = _mm_set1_epi64x( 0xB9645191B9645191 );
sc->A[ 4] = _mm_set1_epi64x( 0xE0078B86E0078B86 );
sc->A[ 5] = _mm_set1_epi64x( 0xBB7C44C9BB7C44C9 );
sc->A[ 6] = _mm_set1_epi64x( 0xD2B5C1CAD2B5C1CA );
sc->A[ 7] = _mm_set1_epi64x( 0xB0D2EB8CB0D2EB8C );
sc->A[ 8] = _mm_set1_epi64x( 0x14CE5A4514CE5A45 );
sc->A[ 9] = _mm_set1_epi64x( 0x22AF50DC22AF50DC );
sc->A[10] = _mm_set1_epi64x( 0xEFFDBC6BEFFDBC6B );
sc->A[11] = _mm_set1_epi64x( 0xEB21B74AEB21B74A );
sc->B[ 0] = m128_const1_64( 0xB555C6EEB555C6EE );
sc->B[ 1] = m128_const1_64( 0x3E7105963E710596 );
sc->B[ 2] = m128_const1_64( 0xA72A652FA72A652F );
sc->B[ 3] = m128_const1_64( 0x9301515F9301515F );
sc->B[ 4] = m128_const1_64( 0xDA28C1FADA28C1FA );
sc->B[ 5] = m128_const1_64( 0x696FD868696FD868 );
sc->B[ 6] = m128_const1_64( 0x9CB6BF729CB6BF72 );
sc->B[ 7] = m128_const1_64( 0x0AFE40020AFE4002 );
sc->B[ 8] = m128_const1_64( 0xA6E03615A6E03615 );
sc->B[ 9] = m128_const1_64( 0x5138C1D45138C1D4 );
sc->B[10] = m128_const1_64( 0xBE216306BE216306 );
sc->B[11] = m128_const1_64( 0xB38B8890B38B8890 );
sc->B[12] = m128_const1_64( 0x3EA8B96B3EA8B96B );
sc->B[13] = m128_const1_64( 0x3299ACE43299ACE4 );
sc->B[14] = m128_const1_64( 0x30924DD430924DD4 );
sc->B[15] = m128_const1_64( 0x55CB34A555CB34A5 );
sc->B[ 0] = _mm_set1_epi64x( 0xB555C6EEB555C6EE );
sc->B[ 1] = _mm_set1_epi64x( 0x3E7105963E710596 );
sc->B[ 2] = _mm_set1_epi64x( 0xA72A652FA72A652F );
sc->B[ 3] = _mm_set1_epi64x( 0x9301515F9301515F );
sc->B[ 4] = _mm_set1_epi64x( 0xDA28C1FADA28C1FA );
sc->B[ 5] = _mm_set1_epi64x( 0x696FD868696FD868 );
sc->B[ 6] = _mm_set1_epi64x( 0x9CB6BF729CB6BF72 );
sc->B[ 7] = _mm_set1_epi64x( 0x0AFE40020AFE4002 );
sc->B[ 8] = _mm_set1_epi64x( 0xA6E03615A6E03615 );
sc->B[ 9] = _mm_set1_epi64x( 0x5138C1D45138C1D4 );
sc->B[10] = _mm_set1_epi64x( 0xBE216306BE216306 );
sc->B[11] = _mm_set1_epi64x( 0xB38B8890B38B8890 );
sc->B[12] = _mm_set1_epi64x( 0x3EA8B96B3EA8B96B );
sc->B[13] = _mm_set1_epi64x( 0x3299ACE43299ACE4 );
sc->B[14] = _mm_set1_epi64x( 0x30924DD430924DD4 );
sc->B[15] = _mm_set1_epi64x( 0x55CB34A555CB34A5 );
sc->C[ 0] = m128_const1_64( 0xB405F031B405F031 );
sc->C[ 1] = m128_const1_64( 0xC4233EBAC4233EBA );
sc->C[ 2] = m128_const1_64( 0xB3733979B3733979 );
sc->C[ 3] = m128_const1_64( 0xC0DD9D55C0DD9D55 );
sc->C[ 4] = m128_const1_64( 0xC51C28AEC51C28AE );
sc->C[ 5] = m128_const1_64( 0xA327B8E1A327B8E1 );
sc->C[ 6] = m128_const1_64( 0x56C5616756C56167 );
sc->C[ 7] = m128_const1_64( 0xED614433ED614433 );
sc->C[ 8] = m128_const1_64( 0x88B59D6088B59D60 );
sc->C[ 9] = m128_const1_64( 0x60E2CEBA60E2CEBA );
sc->C[10] = m128_const1_64( 0x758B4B8B758B4B8B );
sc->C[11] = m128_const1_64( 0x83E82A7F83E82A7F );
sc->C[12] = m128_const1_64( 0xBC968828BC968828 );
sc->C[13] = m128_const1_64( 0xE6E00BF7E6E00BF7 );
sc->C[14] = m128_const1_64( 0xBA839E55BA839E55 );
sc->C[15] = m128_const1_64( 0x9B491C609B491C60 );
sc->C[ 0] = _mm_set1_epi64x( 0xB405F031B405F031 );
sc->C[ 1] = _mm_set1_epi64x( 0xC4233EBAC4233EBA );
sc->C[ 2] = _mm_set1_epi64x( 0xB3733979B3733979 );
sc->C[ 3] = _mm_set1_epi64x( 0xC0DD9D55C0DD9D55 );
sc->C[ 4] = _mm_set1_epi64x( 0xC51C28AEC51C28AE );
sc->C[ 5] = _mm_set1_epi64x( 0xA327B8E1A327B8E1 );
sc->C[ 6] = _mm_set1_epi64x( 0x56C5616756C56167 );
sc->C[ 7] = _mm_set1_epi64x( 0xED614433ED614433 );
sc->C[ 8] = _mm_set1_epi64x( 0x88B59D6088B59D60 );
sc->C[ 9] = _mm_set1_epi64x( 0x60E2CEBA60E2CEBA );
sc->C[10] = _mm_set1_epi64x( 0x758B4B8B758B4B8B );
sc->C[11] = _mm_set1_epi64x( 0x83E82A7F83E82A7F );
sc->C[12] = _mm_set1_epi64x( 0xBC968828BC968828 );
sc->C[13] = _mm_set1_epi64x( 0xE6E00BF7E6E00BF7 );
sc->C[14] = _mm_set1_epi64x( 0xBA839E55BA839E55 );
sc->C[15] = _mm_set1_epi64x( 0x9B491C609B491C60 );
}
sc->Wlow = 1;
sc->Whigh = 0;

View File

@@ -75,7 +75,6 @@ void shabal512_8way_close( void *cc, void *dst );
void shabal512_8way_addbits_and_close( void *cc, unsigned ub, unsigned n,
void *dst );
#endif
typedef struct {
@@ -97,7 +96,6 @@ void shabal256_4way_addbits_and_close( void *cc, unsigned ub, unsigned n,
void shabal512_4way_init( void *cc );
void shabal512_4way_update( void *cc, const void *data, size_t len );
//#define shabal512_4way shabal512_4way_update
void shabal512_4way_close( void *cc, void *dst );
void shabal512_4way_addbits_and_close( void *cc, unsigned ub, unsigned n,
void *dst );

View File

@@ -18,14 +18,6 @@ static const uint32_t IV512[] =
0xE275EADE, 0x502D9FCD, 0xB9357178, 0x022A4B9A
};
/*
#define mm256_ror2x256hi_1x32( a, b ) \
_mm256_blend_epi32( mm256_shuflr128_32( a ), \
mm256_shuflr128_32( b ), 0x88 )
*/
//#define mm256_ror2x256hi_1x32( a, b ) _mm256_alignr_epi8( b, a, 4 )
#if defined(__VAES__)
#define mm256_aesenc_2x128( x, k ) \
@@ -34,8 +26,47 @@ static const uint32_t IV512[] =
#else
#define mm256_aesenc_2x128( x, k ) \
mm256_concat_128( _mm_aesenc_si128( mm128_extr_hi128_256( x ), k ), \
_mm_aesenc_si128( mm128_extr_lo128_256( x ), k ) )
_mm256_inserti128_si256( _mm256_castsi128_si256( \
_mm_aesenc_si128( _mm256_castsi256_si128( x ), k ) ), \
_mm_aesenc_si128( _mm256_extracti128_si256( x, 1 ), k ), 1 )
#endif
#if defined (__AVX512VL__)
//TODO Enable for AVX10_256
#define DECL_m256i_count \
const __m256i count = \
mm256_set4_32( ctx->count3, ctx->count2, ctx->count1, ctx->count0 );
#define COUNT_R0 \
_mm256_mask_xor_epi32( count, 0x88, count, m256_neg1 )
#define COUNT_R1 \
mm256_shuflr128_32( _mm256_mask_xor_epi32( count, 0x11, count, m256_neg1 ) )
#define COUNT_R2 \
mm256_swap128_64( _mm256_mask_xor_epi32( count, 0x22, count, m256_neg1 ) )
#define COUNT_R13 \
mm256_swap64_32( _mm256_mask_xor_epi32( count, 0x44, count, m256_neg1 ) )
#else
#define DECL_m256i_count
// R matches the loop index not the round number, should changet that
#define COUNT_R0 \
mm256_set4_32( ~ctx->count3, ctx->count2, ctx->count1, ctx->count0 )
#define COUNT_R1 \
mm256_set4_32( ~ctx->count0, ctx->count1, ctx->count2, ctx->count3 )
#define COUNT_R2 \
mm256_set4_32( ~ctx->count1, ctx->count0, ctx->count3, ctx->count2 )
#define COUNT_R13 \
mm256_set4_32( ~ctx->count2, ctx->count3, ctx->count0, ctx->count1 )
#endif
@@ -47,6 +78,7 @@ c512_2way( shavite512_2way_context *ctx, const void *msg )
__m256i k00, k01, k02, k03, k10, k11, k12, k13;
__m256i *m = (__m256i*)msg;
__m256i *h = (__m256i*)ctx->h;
DECL_m256i_count;
int r;
p0 = h[0];
@@ -54,7 +86,8 @@ c512_2way( shavite512_2way_context *ctx, const void *msg )
p2 = h[2];
p3 = h[3];
// round
// round 0
k00 = m[0];
x = mm256_aesenc_2x128( _mm256_xor_si256( p1, k00 ), zero );
k01 = m[1];
@@ -85,18 +118,14 @@ c512_2way( shavite512_2way_context *ctx, const void *msg )
mm256_aesenc_2x128( k00, zero ) ) );
if ( r == 0 )
k00 = _mm256_xor_si256( k00, _mm256_set_epi32(
~ctx->count3, ctx->count2, ctx->count1, ctx->count0,
~ctx->count3, ctx->count2, ctx->count1, ctx->count0 ) );
k00 = _mm256_xor_si256( k00, COUNT_R0 );
x = mm256_aesenc_2x128( _mm256_xor_si256( p0, k00 ), zero );
k01 = _mm256_xor_si256( k00,
mm256_shuflr128_32( mm256_aesenc_2x128( k01, zero ) ) );
if ( r == 1 )
k01 = _mm256_xor_si256( k01, _mm256_set_epi32(
~ctx->count0, ctx->count1, ctx->count2, ctx->count3,
~ctx->count0, ctx->count1, ctx->count2, ctx->count3 ) );
k01 = _mm256_xor_si256( k01, COUNT_R1 );
x = mm256_aesenc_2x128( _mm256_xor_si256( x, k01 ), zero );
k02 = _mm256_xor_si256( k01,
@@ -121,9 +150,7 @@ c512_2way( shavite512_2way_context *ctx, const void *msg )
mm256_shuflr128_32( mm256_aesenc_2x128( k13, zero ) ) );
if ( r == 2 )
k13 = _mm256_xor_si256( k13, _mm256_set_epi32(
~ctx->count1, ctx->count0, ctx->count3, ctx->count2,
~ctx->count1, ctx->count0, ctx->count3, ctx->count2 ) );
k13 = _mm256_xor_si256( k13, COUNT_R2 );
x = mm256_aesenc_2x128( _mm256_xor_si256( x, k13 ), zero );
p1 = _mm256_xor_si256( p1, x );
@@ -235,9 +262,7 @@ c512_2way( shavite512_2way_context *ctx, const void *msg )
x = mm256_aesenc_2x128( _mm256_xor_si256( x, k11 ), zero );
k12 = mm256_shuflr128_32( mm256_aesenc_2x128( k12, zero ) );
k12 = _mm256_xor_si256( k12, _mm256_xor_si256( k11, _mm256_set_epi32(
~ctx->count2, ctx->count3, ctx->count0, ctx->count1,
~ctx->count2, ctx->count3, ctx->count0, ctx->count1 ) ) );
k12 = _mm256_xor_si256( k12, _mm256_xor_si256( k11, COUNT_R13 ) );
x = mm256_aesenc_2x128( _mm256_xor_si256( x, k12 ), zero );
k13 = _mm256_xor_si256( mm256_shuflr128_32(
@@ -257,10 +282,10 @@ void shavite512_2way_init( shavite512_2way_context *ctx )
__m256i *h = (__m256i*)ctx->h;
__m128i *iv = (__m128i*)IV512;
h[0] = m256_const1_128( iv[0] );
h[1] = m256_const1_128( iv[1] );
h[2] = m256_const1_128( iv[2] );
h[3] = m256_const1_128( iv[3] );
h[0] = mm256_bcast_m128( iv[0] );
h[1] = mm256_bcast_m128( iv[1] );
h[2] = mm256_bcast_m128( iv[2] );
h[3] = mm256_bcast_m128( iv[3] );
ctx->ptr = 0;
ctx->count0 = 0;
@@ -320,7 +345,7 @@ void shavite512_2way_close( shavite512_2way_context *ctx, void *dst )
uint32_t vp = ctx->ptr>>5;
// Terminating byte then zero pad
casti_m256i( buf, vp++ ) = m256_const1_i128( 0x0000000000000080 );
casti_m256i( buf, vp++ ) = mm256_bcast128lo_64( 0x0000000000000080 );
// Zero pad full vectors up to count
for ( ; vp < 6; vp++ )
@@ -334,9 +359,9 @@ void shavite512_2way_close( shavite512_2way_context *ctx, void *dst )
count.u32[2] = ctx->count2;
count.u32[3] = ctx->count3;
casti_m256i( buf, 6 ) = m256_const1_128(
casti_m256i( buf, 6 ) = mm256_bcast_m128(
_mm_insert_epi16( m128_zero, count.u16[0], 7 ) );
casti_m256i( buf, 7 ) = m256_const1_128( _mm_set_epi16(
casti_m256i( buf, 7 ) = mm256_bcast_m128( _mm_set_epi16(
0x0200, count.u16[7], count.u16[6], count.u16[5],
count.u16[4], count.u16[3], count.u16[2], count.u16[1] ) );
@@ -400,19 +425,19 @@ void shavite512_2way_update_close( shavite512_2way_context *ctx, void *dst,
if ( vp == 0 ) // empty buf, xevan.
{
casti_m256i( buf, 0 ) = m256_const1_i128( 0x0000000000000080 );
casti_m256i( buf, 0 ) = mm256_bcast128lo_64( 0x0000000000000080 );
memset_zero_256( (__m256i*)buf + 1, 5 );
ctx->count0 = ctx->count1 = ctx->count2 = ctx->count3 = 0;
}
else // half full buf, everyone else.
{
casti_m256i( buf, vp++ ) = m256_const1_i128( 0x0000000000000080 );
casti_m256i( buf, vp++ ) = mm256_bcast128lo_64( 0x0000000000000080 );
memset_zero_256( (__m256i*)buf + vp, 6 - vp );
}
casti_m256i( buf, 6 ) = m256_const1_128(
casti_m256i( buf, 6 ) = mm256_bcast_m128(
_mm_insert_epi16( m128_zero, count.u16[0], 7 ) );
casti_m256i( buf, 7 ) = m256_const1_128( _mm_set_epi16(
casti_m256i( buf, 7 ) = mm256_bcast_m128( _mm_set_epi16(
0x0200, count.u16[7], count.u16[6], count.u16[5],
count.u16[4], count.u16[3], count.u16[2], count.u16[1] ) );
@@ -430,10 +455,10 @@ void shavite512_2way_full( shavite512_2way_context *ctx, void *dst,
__m256i *h = (__m256i*)ctx->h;
__m128i *iv = (__m128i*)IV512;
h[0] = m256_const1_128( iv[0] );
h[1] = m256_const1_128( iv[1] );
h[2] = m256_const1_128( iv[2] );
h[3] = m256_const1_128( iv[3] );
h[0] = mm256_bcast_m128( iv[0] );
h[1] = mm256_bcast_m128( iv[1] );
h[2] = mm256_bcast_m128( iv[2] );
h[3] = mm256_bcast_m128( iv[3] );
ctx->ptr =
ctx->count0 =
@@ -490,19 +515,19 @@ void shavite512_2way_full( shavite512_2way_context *ctx, void *dst,
if ( vp == 0 ) // empty buf, xevan.
{
casti_m256i( buf, 0 ) = m256_const1_i128( 0x0000000000000080 );
casti_m256i( buf, 0 ) = mm256_bcast128lo_64( 0x0000000000000080 );
memset_zero_256( (__m256i*)buf + 1, 5 );
ctx->count0 = ctx->count1 = ctx->count2 = ctx->count3 = 0;
}
else // half full buf, everyone else.
{
casti_m256i( buf, vp++ ) = m256_const1_i128( 0x0000000000000080 );
casti_m256i( buf, vp++ ) = mm256_bcast128lo_64( 0x0000000000000080 );
memset_zero_256( (__m256i*)buf + vp, 6 - vp );
}
casti_m256i( buf, 6 ) = m256_const1_128(
casti_m256i( buf, 6 ) = mm256_bcast_m128(
_mm_insert_epi16( m128_zero, count.u16[0], 7 ) );
casti_m256i( buf, 7 ) = m256_const1_128( _mm_set_epi16(
casti_m256i( buf, 7 ) = mm256_bcast_m128( _mm_set_epi16(
0x0200, count.u16[7], count.u16[6], count.u16[5],
count.u16[4], count.u16[3], count.u16[2], count.u16[1] ) );

View File

@@ -204,11 +204,9 @@ c512_4way( shavite512_4way_context *ctx, const void *msg )
K5 = _mm512_xor_si512( mm512_shuflr128_32(
_mm512_aesenc_epi128( K5, m512_zero ) ), K4 );
X = _mm512_aesenc_epi128( _mm512_xor_si512( X, K5 ), m512_zero );
K6 = mm512_shuflr128_32( _mm512_aesenc_epi128( K6, m512_zero ) );
K6 = _mm512_xor_si512( K6, _mm512_xor_si512( K5, _mm512_set4_epi32(
~ctx->count2, ctx->count3, ctx->count0, ctx->count1 ) ) );
K6 = _mm512_xor_si512( K6, _mm512_xor_si512( K5, mm512_swap64_32(
_mm512_mask_xor_epi32( count, 0x4444, count, m512_neg1 ) ) ) );
X = _mm512_aesenc_epi128( _mm512_xor_si512( X, K6 ), m512_zero );
K7= _mm512_xor_si512( mm512_shuflr128_32(
_mm512_aesenc_epi128( K7, m512_zero ) ), K6 );
@@ -227,10 +225,10 @@ void shavite512_4way_init( shavite512_4way_context *ctx )
__m512i *h = (__m512i*)ctx->h;
__m128i *iv = (__m128i*)IV512;
h[0] = m512_const1_128( iv[0] );
h[1] = m512_const1_128( iv[1] );
h[2] = m512_const1_128( iv[2] );
h[3] = m512_const1_128( iv[3] );
h[0] = mm512_bcast_m128( iv[0] );
h[1] = mm512_bcast_m128( iv[1] );
h[2] = mm512_bcast_m128( iv[2] );
h[3] = mm512_bcast_m128( iv[3] );
ctx->ptr = 0;
ctx->count0 = 0;
@@ -290,7 +288,7 @@ void shavite512_4way_close( shavite512_4way_context *ctx, void *dst )
uint32_t vp = ctx->ptr>>6;
// Terminating byte then zero pad
casti_m512i( buf, vp++ ) = m512_const1_i128( 0x0000000000000080 );
casti_m512i( buf, vp++ ) = mm512_bcast128lo_64( 0x0000000000000080 );
// Zero pad full vectors up to count
for ( ; vp < 6; vp++ )
@@ -304,9 +302,9 @@ void shavite512_4way_close( shavite512_4way_context *ctx, void *dst )
count.u32[2] = ctx->count2;
count.u32[3] = ctx->count3;
casti_m512i( buf, 6 ) = m512_const1_128(
casti_m512i( buf, 6 ) = mm512_bcast_m128(
_mm_insert_epi16( m128_zero, count.u16[0], 7 ) );
casti_m512i( buf, 7 ) = m512_const1_128( _mm_set_epi16(
casti_m512i( buf, 7 ) = mm512_bcast_m128( _mm_set_epi16(
0x0200, count.u16[7], count.u16[6], count.u16[5],
count.u16[4], count.u16[3], count.u16[2], count.u16[1] ) );
@@ -370,19 +368,19 @@ void shavite512_4way_update_close( shavite512_4way_context *ctx, void *dst,
if ( vp == 0 ) // empty buf, xevan.
{
casti_m512i( buf, 0 ) = m512_const1_i128( 0x0000000000000080 );
casti_m512i( buf, 0 ) = mm512_bcast128lo_64( 0x0000000000000080 );
memset_zero_512( (__m512i*)buf + 1, 5 );
ctx->count0 = ctx->count1 = ctx->count2 = ctx->count3 = 0;
}
else // half full buf, everyone else.
{
casti_m512i( buf, vp++ ) = m512_const1_i128( 0x0000000000000080 );
casti_m512i( buf, vp++ ) = mm512_bcast128lo_64( 0x0000000000000080 );
memset_zero_512( (__m512i*)buf + vp, 6 - vp );
}
casti_m512i( buf, 6 ) = m512_const1_128(
casti_m512i( buf, 6 ) = mm512_bcast_m128(
_mm_insert_epi16( m128_zero, count.u16[0], 7 ) );
casti_m512i( buf, 7 ) = m512_const1_128( _mm_set_epi16(
casti_m512i( buf, 7 ) = mm512_bcast_m128( _mm_set_epi16(
0x0200, count.u16[7], count.u16[6], count.u16[5],
count.u16[4], count.u16[3], count.u16[2], count.u16[1] ) );
@@ -401,10 +399,10 @@ void shavite512_4way_full( shavite512_4way_context *ctx, void *dst,
__m512i *h = (__m512i*)ctx->h;
__m128i *iv = (__m128i*)IV512;
h[0] = m512_const1_128( iv[0] );
h[1] = m512_const1_128( iv[1] );
h[2] = m512_const1_128( iv[2] );
h[3] = m512_const1_128( iv[3] );
h[0] = mm512_bcast_m128( iv[0] );
h[1] = mm512_bcast_m128( iv[1] );
h[2] = mm512_bcast_m128( iv[2] );
h[3] = mm512_bcast_m128( iv[3] );
ctx->ptr =
ctx->count0 =
@@ -461,19 +459,19 @@ void shavite512_4way_full( shavite512_4way_context *ctx, void *dst,
if ( vp == 0 ) // empty buf, xevan.
{
casti_m512i( buf, 0 ) = m512_const1_i128( 0x0000000000000080 );
casti_m512i( buf, 0 ) = mm512_bcast128lo_64( 0x0000000000000080 );
memset_zero_512( (__m512i*)buf + 1, 5 );
ctx->count0 = ctx->count1 = ctx->count2 = ctx->count3 = 0;
}
else // half full buf, everyone else.
{
casti_m512i( buf, vp++ ) = m512_const1_i128( 0x0000000000000080 );
casti_m512i( buf, vp++ ) = mm512_bcast128lo_64( 0x0000000000000080 );
memset_zero_512( (__m512i*)buf + vp, 6 - vp );
}
casti_m512i( buf, 6 ) = m512_const1_128(
casti_m512i( buf, 6 ) = mm512_bcast_m128(
_mm_insert_epi16( m128_zero, count.u16[0], 7 ) );
casti_m512i( buf, 7 ) = m512_const1_128( _mm_set_epi16(
casti_m512i( buf, 7 ) = mm512_bcast_m128( _mm_set_epi16(
0x0200, count.u16[7], count.u16[6], count.u16[5],
count.u16[4], count.u16[3], count.u16[2], count.u16[1] ) );

View File

@@ -212,14 +212,24 @@ do { \
// targetted
#define shufxor2w(x,s) _mm256_shuffle_epi32( x, XCAT( SHUFXOR_, s ))
#if defined(__AVX512VL__)
//TODO Enable for AVX10_256
#define REDUCE(x) \
_mm256_sub_epi16( _mm256_and_si256( x, m256_const1_64( \
_mm256_sub_epi16( _mm256_maskz_mov_epi8( 0x55555555, x ), \
_mm256_srai_epi16( x, 8 ) )
#else
#define REDUCE(x) \
_mm256_sub_epi16( _mm256_and_si256( x, _mm256_set1_epi64x( \
0x00ff00ff00ff00ff ) ), _mm256_srai_epi16( x, 8 ) )
#endif
#define EXTRA_REDUCE_S(x)\
_mm256_sub_epi16( x, _mm256_and_si256( \
m256_const1_64( 0x0101010101010101 ), \
_mm256_cmpgt_epi16( x, m256_const1_64( 0x0080008000800080 ) ) ) )
_mm256_set1_epi64x( 0x0101010101010101 ), \
_mm256_cmpgt_epi16( x, _mm256_set1_epi64x( 0x0080008000800080 ) ) ) )
#define REDUCE_FULL_S( x ) EXTRA_REDUCE_S( REDUCE (x ) )
@@ -384,14 +394,14 @@ static const m512_v16 FFT256_Twiddle4w[] =
#define shufxor4w(x,s) _mm512_shuffle_epi32( x, XCAT( SHUFXOR_, s ))
#define REDUCE4w(x) \
_mm512_sub_epi16( _mm512_and_si512( x, m512_const1_64( \
0x00ff00ff00ff00ff ) ), _mm512_srai_epi16( x, 8 ) )
_mm512_sub_epi16( _mm512_maskz_mov_epi8( 0x5555555555555555, x ), \
_mm512_srai_epi16( x, 8 ) )
#define EXTRA_REDUCE_S4w(x)\
#define EXTRA_REDUCE_S4w(x) \
_mm512_sub_epi16( x, _mm512_and_si512( \
m512_const1_64( 0x0101010101010101 ), \
_mm512_set1_epi64( 0x0101010101010101 ), \
_mm512_movm_epi16( _mm512_cmpgt_epi16_mask( \
x, m512_const1_64( 0x0080008000800080 ) ) ) ) )
x, _mm512_set1_epi64( 0x0080008000800080 ) ) ) ) )
// generic, except it calls targetted macros
#define REDUCE_FULL_S4w( x ) EXTRA_REDUCE_S4w( REDUCE4w (x ) )
@@ -400,8 +410,8 @@ static const m512_v16 FFT256_Twiddle4w[] =
#define DO_REDUCE_FULL_S4w(i) \
do { \
X(i) = REDUCE4w( X(i) ); \
X(i) = EXTRA_REDUCE_S4w( X(i) ); \
X(i) = REDUCE4w( X(i) ); \
X(i) = EXTRA_REDUCE_S4w( X(i) ); \
} while(0)
@@ -431,10 +441,6 @@ void fft64_4way( void *a )
// Unrolled decimation in frequency (DIF) radix-2 NTT.
// Output data is in revbin_permuted order.
static const int w[] = {0, 2, 4, 6};
// __m256i *Twiddle = (__m256i*)FFT64_Twiddle;
// targetted
#define BUTTERFLY_0( i,j ) \
do { \
@@ -443,25 +449,25 @@ do { \
X(i) = _mm512_sub_epi16( X(i), v ); \
} while(0)
#define BUTTERFLY_N( i,j,n ) \
#define BUTTERFLY_N( i, j, w ) \
do { \
__m512i v = X(j); \
X(j) = _mm512_add_epi16( X(i), X(j) ); \
X(i) = _mm512_slli_epi16( _mm512_sub_epi16( X(i), v ), w[n] ); \
X(i) = _mm512_slli_epi16( _mm512_sub_epi16( X(i), v ), w ); \
} while(0)
BUTTERFLY_0( 0, 4 );
BUTTERFLY_N( 1, 5, 1 );
BUTTERFLY_N( 2, 6, 2 );
BUTTERFLY_N( 3, 7, 3 );
BUTTERFLY_N( 1, 5, 2 );
BUTTERFLY_N( 2, 6, 4 );
BUTTERFLY_N( 3, 7, 6 );
DO_REDUCE( 2 );
DO_REDUCE( 3 );
BUTTERFLY_0( 0, 2 );
BUTTERFLY_0( 4, 6 );
BUTTERFLY_N( 1, 3, 2 );
BUTTERFLY_N( 5, 7, 2 );
BUTTERFLY_N( 1, 3, 4 );
BUTTERFLY_N( 5, 7, 4 );
DO_REDUCE( 1 );
@@ -482,14 +488,7 @@ do { \
#undef BUTTERFLY_0
#undef BUTTERFLY_N
// twiddle is hard coded T[0] = m512_const2_64( {128,64,32,16}, {8,4,2,1} )
// Multiply by twiddle factors
// X(6) = _mm512_mullo_epi16( X(6), m512_const2_64( 0x0080004000200010,
// 0x0008000400020001 );
// X(5) = _mm512_mullo_epi16( X(5), m512_const2_64( 0xffdc0008ffef0004,
// 0x00780002003c0001 );
X(6) = _mm512_mullo_epi16( X(6), FFT64_Twiddle4w[0].v512 );
X(5) = _mm512_mullo_epi16( X(5), FFT64_Twiddle4w[1].v512 );
X(4) = _mm512_mullo_epi16( X(4), FFT64_Twiddle4w[2].v512 );
@@ -501,12 +500,11 @@ do { \
// Transpose the FFT state with a revbin order permutation
// on the rows and the column.
// This will make the full FFT_64 in order.
#define INTERLEAVE(i,j) \
#define INTERLEAVE( i, j ) \
do { \
__m512i t1= X(i); \
__m512i t2= X(j); \
X(i) = _mm512_unpacklo_epi16( t1, t2 ); \
X(j) = _mm512_unpackhi_epi16( t1, t2 ); \
__m512i u = X(j); \
X(j) = _mm512_unpackhi_epi16( X(i), X(j) ); \
X(i) = _mm512_unpacklo_epi16( X(i), u ); \
} while(0)
INTERLEAVE( 1, 0 );
@@ -534,10 +532,10 @@ do { \
} while(0)
#define BUTTERFLY_N( i,j,n ) \
#define BUTTERFLY_N( i, j, w ) \
do { \
__m512i u = X(j); \
X(i) = _mm512_slli_epi16( X(i), w[n] ); \
X(i) = _mm512_slli_epi16( X(i), w ); \
X(j) = _mm512_sub_epi16( X(j), X(i) ); \
X(i) = _mm512_add_epi16( u, X(i) ); \
} while(0)
@@ -558,15 +556,15 @@ do { \
BUTTERFLY_0( 0, 2 );
BUTTERFLY_0( 4, 6 );
BUTTERFLY_N( 1, 3, 2 );
BUTTERFLY_N( 5, 7, 2 );
BUTTERFLY_N( 1, 3, 4 );
BUTTERFLY_N( 5, 7, 4 );
DO_REDUCE( 3 );
BUTTERFLY_0( 0, 4 );
BUTTERFLY_N( 1, 5, 1 );
BUTTERFLY_N( 2, 6, 2 );
BUTTERFLY_N( 3, 7, 3 );
BUTTERFLY_N( 1, 5, 2 );
BUTTERFLY_N( 2, 6, 4 );
BUTTERFLY_N( 3, 7, 6 );
DO_REDUCE_FULL_S4w( 0 );
DO_REDUCE_FULL_S4w( 1 );
@@ -599,7 +597,6 @@ void fft128_4way( void *a )
// Temp space to help for interleaving in the end
__m512i B[8];
__m512i *A = (__m512i*) a;
// __m256i *Twiddle = (__m256i*)FFT128_Twiddle;
/* Size-2 butterflies */
for ( i = 0; i<8; i++ )
@@ -633,7 +630,6 @@ void fft128_4way_msg( uint16_t *a, const uint8_t *x, int final )
__m512i *X = (__m512i*)x;
__m512i *A = (__m512i*)a;
// __m256i *Twiddle = (__m256i*)FFT128_Twiddle;
#define UNPACK( i ) \
do { \
@@ -686,7 +682,6 @@ void fft256_4way_msg( uint16_t *a, const uint8_t *x, int final )
__m512i *X = (__m512i*)x;
__m512i *A = (__m512i*)a;
// __m256i *Twiddle = (__m256i*)FFT256_Twiddle;
#define UNPACK( i ) \
do { \
@@ -776,109 +771,6 @@ void rounds512_4way( uint32_t *state, const uint8_t *msg, uint16_t *fft )
// We split the round function in two halfes
// so as to insert some independent computations in between
// generic
#if 0
#define SUM7_00 0
#define SUM7_01 1
#define SUM7_02 2
#define SUM7_03 3
#define SUM7_04 4
#define SUM7_05 5
#define SUM7_06 6
#define SUM7_10 1
#define SUM7_11 2
#define SUM7_12 3
#define SUM7_13 4
#define SUM7_14 5
#define SUM7_15 6
#define SUM7_16 0
#define SUM7_20 2
#define SUM7_21 3
#define SUM7_22 4
#define SUM7_23 5
#define SUM7_24 6
#define SUM7_25 0
#define SUM7_26 1
#define SUM7_30 3
#define SUM7_31 4
#define SUM7_32 5
#define SUM7_33 6
#define SUM7_34 0
#define SUM7_35 1
#define SUM7_36 2
#define SUM7_40 4
#define SUM7_41 5
#define SUM7_42 6
#define SUM7_43 0
#define SUM7_44 1
#define SUM7_45 2
#define SUM7_46 3
#define SUM7_50 5
#define SUM7_51 6
#define SUM7_52 0
#define SUM7_53 1
#define SUM7_54 2
#define SUM7_55 3
#define SUM7_56 4
#define SUM7_60 6
#define SUM7_61 0
#define SUM7_62 1
#define SUM7_63 2
#define SUM7_64 3
#define SUM7_65 4
#define SUM7_66 5
#define PERM(z,d,a) XCAT(PERM_,XCAT(SUM7_##z,PERM_START))(d,a)
#define PERM_0(d,a) /* XOR 1 */ \
do { \
d##l = shufxor( a##l, 1 ); \
d##h = shufxor( a##h, 1 ); \
} while(0)
#define PERM_1(d,a) /* XOR 6 */ \
do { \
d##l = shufxor( a##h, 2 ); \
d##h = shufxor( a##l, 2 ); \
} while(0)
#define PERM_2(d,a) /* XOR 2 */ \
do { \
d##l = shufxor( a##l, 2 ); \
d##h = shufxor( a##h, 2 ); \
} while(0)
#define PERM_3(d,a) /* XOR 3 */ \
do { \
d##l = shufxor( a##l, 3 ); \
d##h = shufxor( a##h, 3 ); \
} while(0)
#define PERM_4(d,a) /* XOR 5 */ \
do { \
d##l = shufxor( a##h, 1 ); \
d##h = shufxor( a##l, 1 ); \
} while(0)
#define PERM_5(d,a) /* XOR 7 */ \
do { \
d##l = shufxor( a##h, 3 ); \
d##h = shufxor( a##l, 3 ); \
} while(0)
#define PERM_6(d,a) /* XOR 4 */ \
do { \
d##l = a##h; \
d##h = a##l; \
} while(0)
#endif
// targetted
#define STEP_1_(a,b,c,d,w,fun,r,s,z) \

View File

@@ -63,7 +63,7 @@ int scanhash_skein_8way( struct work *work, uint32_t max_nonce,
}
}
*noncev = _mm512_add_epi32( *noncev,
m512_const1_64( 0x0000000800000000 ) );
_mm512_set1_epi64( 0x0000000800000000 ) );
n += 8;
} while ( likely( (n < last_nonce) && !work_restart[thr_id].restart ) );
@@ -151,7 +151,7 @@ int scanhash_skein_4way( struct work *work, uint32_t max_nonce,
}
}
*noncev = _mm256_add_epi32( *noncev,
m256_const1_64( 0x0000000400000000 ) );
_mm256_set1_epi64x( 0x0000000400000000 ) );
n += 4;
} while ( likely( (n < last_nonce) && !work_restart[thr_id].restart ) );

View File

@@ -285,7 +285,7 @@ static const uint64_t IV512[] = {
#define SKBI(k, s, i) XCAT(k, XCAT(XCAT(XCAT(M9_, s), _), i))
#define SKBT(t, s, v) XCAT(t, XCAT(XCAT(XCAT(M3_, s), _), v))
#define READ_STATE_BIG(sc) do { \
#define READ_STATE_BIG(sc) \
h0 = (sc)->h0; \
h1 = (sc)->h1; \
h2 = (sc)->h2; \
@@ -294,10 +294,9 @@ static const uint64_t IV512[] = {
h5 = (sc)->h5; \
h6 = (sc)->h6; \
h7 = (sc)->h7; \
bcount = sc->bcount; \
} while (0)
bcount = sc->bcount;
#define WRITE_STATE_BIG(sc) do { \
#define WRITE_STATE_BIG(sc) \
(sc)->h0 = h0; \
(sc)->h1 = h1; \
(sc)->h2 = h2; \
@@ -306,62 +305,54 @@ static const uint64_t IV512[] = {
(sc)->h5 = h5; \
(sc)->h6 = h6; \
(sc)->h7 = h7; \
sc->bcount = bcount; \
} while (0)
sc->bcount = bcount;
#if defined(__AVX512F__) && defined(__AVX512VL__) && defined(__AVX512DQ__) && defined(__AVX512BW__)
#define TFBIG_KINIT_8WAY( k0, k1, k2, k3, k4, k5, k6, k7, k8, t0, t1, t2 ) \
do { \
k8 = mm512_xor3( mm512_xor3( k0, k1, k2 ), mm512_xor3( k3, k4, k5 ), \
mm512_xor3( k6, k7, m512_const1_64( 0x1BD11BDAA9FC1A22) ));\
t2 = t0 ^ t1; \
} while (0)
k8 = mm512_xor3( mm512_xor3( k0, k1, k2 ), \
mm512_xor3( k3, k4, k5 ), \
mm512_xor3( k6, k7, \
_mm512_set1_epi64( 0x1BD11BDAA9FC1A22) ) ); \
t2 = t0 ^ t1;
#define TFBIG_ADDKEY_8WAY(w0, w1, w2, w3, w4, w5, w6, w7, k, t, s) \
do { \
w0 = _mm512_add_epi64( w0, SKBI(k,s,0) ); \
w1 = _mm512_add_epi64( w1, SKBI(k,s,1) ); \
w2 = _mm512_add_epi64( w2, SKBI(k,s,2) ); \
w3 = _mm512_add_epi64( w3, SKBI(k,s,3) ); \
w4 = _mm512_add_epi64( w4, SKBI(k,s,4) ); \
w5 = _mm512_add_epi64( w5, _mm512_add_epi64( SKBI(k,s,5), \
m512_const1_64( SKBT(t,s,0) ) ) ); \
_mm512_set1_epi64( SKBT(t,s,0) ) ) ); \
w6 = _mm512_add_epi64( w6, _mm512_add_epi64( SKBI(k,s,6), \
m512_const1_64( SKBT(t,s,1) ) ) ); \
_mm512_set1_epi64( SKBT(t,s,1) ) ) ); \
w7 = _mm512_add_epi64( w7, _mm512_add_epi64( SKBI(k,s,7), \
m512_const1_64( s ) ) ); \
} while (0)
_mm512_set1_epi64( s ) ) );
#define TFBIG_MIX_8WAY(x0, x1, rc) \
do { \
x0 = _mm512_add_epi64( x0, x1 ); \
x1 = _mm512_xor_si512( mm512_rol_64( x1, rc ), x0 ); \
} while (0)
x1 = _mm512_xor_si512( mm512_rol_64( x1, rc ), x0 );
#define TFBIG_MIX8_8WAY(w0, w1, w2, w3, w4, w5, w6, w7, rc0, rc1, rc2, rc3) do { \
#define TFBIG_MIX8_8WAY(w0, w1, w2, w3, w4, w5, w6, w7, rc0, rc1, rc2, rc3) \
TFBIG_MIX_8WAY(w0, w1, rc0); \
TFBIG_MIX_8WAY(w2, w3, rc1); \
TFBIG_MIX_8WAY(w4, w5, rc2); \
TFBIG_MIX_8WAY(w6, w7, rc3); \
} while (0)
TFBIG_MIX_8WAY(w6, w7, rc3);
#define TFBIG_8WAY_4e(s) do { \
#define TFBIG_8WAY_4e(s) \
TFBIG_ADDKEY_8WAY(p0, p1, p2, p3, p4, p5, p6, p7, h, t, s); \
TFBIG_MIX8_8WAY(p0, p1, p2, p3, p4, p5, p6, p7, 46, 36, 19, 37); \
TFBIG_MIX8_8WAY(p2, p1, p4, p7, p6, p5, p0, p3, 33, 27, 14, 42); \
TFBIG_MIX8_8WAY(p4, p1, p6, p3, p0, p5, p2, p7, 17, 49, 36, 39); \
TFBIG_MIX8_8WAY(p6, p1, p0, p7, p2, p5, p4, p3, 44, 9, 54, 56); \
} while (0)
TFBIG_MIX8_8WAY(p6, p1, p0, p7, p2, p5, p4, p3, 44, 9, 54, 56);
#define TFBIG_8WAY_4o(s) do { \
#define TFBIG_8WAY_4o(s) \
TFBIG_ADDKEY_8WAY(p0, p1, p2, p3, p4, p5, p6, p7, h, t, s); \
TFBIG_MIX8_8WAY(p0, p1, p2, p3, p4, p5, p6, p7, 39, 30, 34, 24); \
TFBIG_MIX8_8WAY(p2, p1, p4, p7, p6, p5, p0, p3, 13, 50, 10, 17); \
TFBIG_MIX8_8WAY(p4, p1, p6, p3, p0, p5, p2, p7, 25, 29, 39, 43); \
TFBIG_MIX8_8WAY(p6, p1, p0, p7, p2, p5, p4, p3, 8, 35, 56, 22); \
} while (0)
TFBIG_MIX8_8WAY(p6, p1, p0, p7, p2, p5, p4, p3, 8, 35, 56, 22);
#define UBI_BIG_8WAY(etype, extra) \
do { \
@@ -424,59 +415,48 @@ do { \
#endif // AVX512
#define TFBIG_KINIT_4WAY( k0, k1, k2, k3, k4, k5, k6, k7, k8, t0, t1, t2 ) \
do { \
k8 = _mm256_xor_si256( _mm256_xor_si256( \
_mm256_xor_si256( _mm256_xor_si256( k0, k1 ), \
_mm256_xor_si256( k2, k3 ) ), \
_mm256_xor_si256( _mm256_xor_si256( k4, k5 ), \
_mm256_xor_si256( k6, k7 ) ) ), \
m256_const1_64( 0x1BD11BDAA9FC1A22) ); \
t2 = t0 ^ t1; \
} while (0)
k8 = mm256_xor3( mm256_xor3( k0, k1, k2 ), \
mm256_xor3( k3, k4, k5 ), \
mm256_xor3( k6, k7, \
_mm256_set1_epi64x( 0x1BD11BDAA9FC1A22) ) ); \
t2 = t0 ^ t1;
#define TFBIG_ADDKEY_4WAY(w0, w1, w2, w3, w4, w5, w6, w7, k, t, s) \
do { \
w0 = _mm256_add_epi64( w0, SKBI(k,s,0) ); \
w1 = _mm256_add_epi64( w1, SKBI(k,s,1) ); \
w2 = _mm256_add_epi64( w2, SKBI(k,s,2) ); \
w3 = _mm256_add_epi64( w3, SKBI(k,s,3) ); \
w4 = _mm256_add_epi64( w4, SKBI(k,s,4) ); \
w5 = _mm256_add_epi64( w5, _mm256_add_epi64( SKBI(k,s,5), \
m256_const1_64( SKBT(t,s,0) ) ) ); \
_mm256_set1_epi64x( SKBT(t,s,0) ) ) ); \
w6 = _mm256_add_epi64( w6, _mm256_add_epi64( SKBI(k,s,6), \
m256_const1_64( SKBT(t,s,1) ) ) ); \
_mm256_set1_epi64x( SKBT(t,s,1) ) ) ); \
w7 = _mm256_add_epi64( w7, _mm256_add_epi64( SKBI(k,s,7), \
m256_const1_64( s ) ) ); \
} while (0)
_mm256_set1_epi64x( s ) ) );
#define TFBIG_MIX_4WAY(x0, x1, rc) \
do { \
x0 = _mm256_add_epi64( x0, x1 ); \
x1 = _mm256_xor_si256( mm256_rol_64( x1, rc ), x0 ); \
} while (0)
x1 = _mm256_xor_si256( mm256_rol_64( x1, rc ), x0 );
#define TFBIG_MIX8_4WAY(w0, w1, w2, w3, w4, w5, w6, w7, rc0, rc1, rc2, rc3) do { \
#define TFBIG_MIX8_4WAY(w0, w1, w2, w3, w4, w5, w6, w7, rc0, rc1, rc2, rc3) \
TFBIG_MIX_4WAY(w0, w1, rc0); \
TFBIG_MIX_4WAY(w2, w3, rc1); \
TFBIG_MIX_4WAY(w4, w5, rc2); \
TFBIG_MIX_4WAY(w6, w7, rc3); \
} while (0)
TFBIG_MIX_4WAY(w6, w7, rc3);
#define TFBIG_4WAY_4e(s) do { \
#define TFBIG_4WAY_4e(s) \
TFBIG_ADDKEY_4WAY(p0, p1, p2, p3, p4, p5, p6, p7, h, t, s); \
TFBIG_MIX8_4WAY(p0, p1, p2, p3, p4, p5, p6, p7, 46, 36, 19, 37); \
TFBIG_MIX8_4WAY(p2, p1, p4, p7, p6, p5, p0, p3, 33, 27, 14, 42); \
TFBIG_MIX8_4WAY(p4, p1, p6, p3, p0, p5, p2, p7, 17, 49, 36, 39); \
TFBIG_MIX8_4WAY(p6, p1, p0, p7, p2, p5, p4, p3, 44, 9, 54, 56); \
} while (0)
TFBIG_MIX8_4WAY(p6, p1, p0, p7, p2, p5, p4, p3, 44, 9, 54, 56);
#define TFBIG_4WAY_4o(s) do { \
#define TFBIG_4WAY_4o(s) \
TFBIG_ADDKEY_4WAY(p0, p1, p2, p3, p4, p5, p6, p7, h, t, s); \
TFBIG_MIX8_4WAY(p0, p1, p2, p3, p4, p5, p6, p7, 39, 30, 34, 24); \
TFBIG_MIX8_4WAY(p2, p1, p4, p7, p6, p5, p0, p3, 13, 50, 10, 17); \
TFBIG_MIX8_4WAY(p4, p1, p6, p3, p0, p5, p2, p7, 25, 29, 39, 43); \
TFBIG_MIX8_4WAY(p6, p1, p0, p7, p2, p5, p4, p3, 8, 35, 56, 22); \
} while (0)
TFBIG_MIX8_4WAY(p6, p1, p0, p7, p2, p5, p4, p3, 8, 35, 56, 22);
// scale buf offset by 4
#define UBI_BIG_4WAY(etype, extra) \
@@ -541,28 +521,28 @@ do { \
void skein256_8way_init( skein256_8way_context *sc )
{
sc->h0 = m512_const1_64( 0xCCD044A12FDB3E13 );
sc->h1 = m512_const1_64( 0xE83590301A79A9EB );
sc->h2 = m512_const1_64( 0x55AEA0614F816E6F );
sc->h3 = m512_const1_64( 0x2A2767A4AE9B94DB );
sc->h4 = m512_const1_64( 0xEC06025E74DD7683 );
sc->h5 = m512_const1_64( 0xE7A436CDC4746251 );
sc->h6 = m512_const1_64( 0xC36FBAF9393AD185 );
sc->h7 = m512_const1_64( 0x3EEDBA1833EDFC13 );
sc->h0 = _mm512_set1_epi64( 0xCCD044A12FDB3E13 );
sc->h1 = _mm512_set1_epi64( 0xE83590301A79A9EB );
sc->h2 = _mm512_set1_epi64( 0x55AEA0614F816E6F );
sc->h3 = _mm512_set1_epi64( 0x2A2767A4AE9B94DB );
sc->h4 = _mm512_set1_epi64( 0xEC06025E74DD7683 );
sc->h5 = _mm512_set1_epi64( 0xE7A436CDC4746251 );
sc->h6 = _mm512_set1_epi64( 0xC36FBAF9393AD185 );
sc->h7 = _mm512_set1_epi64( 0x3EEDBA1833EDFC13 );
sc->bcount = 0;
sc->ptr = 0;
}
void skein512_8way_init( skein512_8way_context *sc )
{
sc->h0 = m512_const1_64( 0x4903ADFF749C51CE );
sc->h1 = m512_const1_64( 0x0D95DE399746DF03 );
sc->h2 = m512_const1_64( 0x8FD1934127C79BCE );
sc->h3 = m512_const1_64( 0x9A255629FF352CB1 );
sc->h4 = m512_const1_64( 0x5DB62599DF6CA7B0 );
sc->h5 = m512_const1_64( 0xEABE394CA9D5C3F4 );
sc->h6 = m512_const1_64( 0x991112C71A75B523 );
sc->h7 = m512_const1_64( 0xAE18A40B660FCC33 );
sc->h0 = _mm512_set1_epi64( 0x4903ADFF749C51CE );
sc->h1 = _mm512_set1_epi64( 0x0D95DE399746DF03 );
sc->h2 = _mm512_set1_epi64( 0x8FD1934127C79BCE );
sc->h3 = _mm512_set1_epi64( 0x9A255629FF352CB1 );
sc->h4 = _mm512_set1_epi64( 0x5DB62599DF6CA7B0 );
sc->h5 = _mm512_set1_epi64( 0xEABE394CA9D5C3F4 );
sc->h6 = _mm512_set1_epi64( 0x991112C71A75B523 );
sc->h7 = _mm512_set1_epi64( 0xAE18A40B660FCC33 );
sc->bcount = 0;
sc->ptr = 0;
}
@@ -660,14 +640,14 @@ void skein512_8way_full( skein512_8way_context *sc, void *out, const void *data,
// Init
h0 = m512_const1_64( 0x4903ADFF749C51CE );
h1 = m512_const1_64( 0x0D95DE399746DF03 );
h2 = m512_const1_64( 0x8FD1934127C79BCE );
h3 = m512_const1_64( 0x9A255629FF352CB1 );
h4 = m512_const1_64( 0x5DB62599DF6CA7B0 );
h5 = m512_const1_64( 0xEABE394CA9D5C3F4 );
h6 = m512_const1_64( 0x991112C71A75B523 );
h7 = m512_const1_64( 0xAE18A40B660FCC33 );
h0 = _mm512_set1_epi64( 0x4903ADFF749C51CE );
h1 = _mm512_set1_epi64( 0x0D95DE399746DF03 );
h2 = _mm512_set1_epi64( 0x8FD1934127C79BCE );
h3 = _mm512_set1_epi64( 0x9A255629FF352CB1 );
h4 = _mm512_set1_epi64( 0x5DB62599DF6CA7B0 );
h5 = _mm512_set1_epi64( 0xEABE394CA9D5C3F4 );
h6 = _mm512_set1_epi64( 0x991112C71A75B523 );
h7 = _mm512_set1_epi64( 0xAE18A40B660FCC33 );
// Update
@@ -734,14 +714,14 @@ skein512_8way_prehash64( skein512_8way_context *sc, const void *data )
buf[5] = vdata[5];
buf[6] = vdata[6];
buf[7] = vdata[7];
register __m512i h0 = m512_const1_64( 0x4903ADFF749C51CE );
register __m512i h1 = m512_const1_64( 0x0D95DE399746DF03 );
register __m512i h2 = m512_const1_64( 0x8FD1934127C79BCE );
register __m512i h3 = m512_const1_64( 0x9A255629FF352CB1 );
register __m512i h4 = m512_const1_64( 0x5DB62599DF6CA7B0 );
register __m512i h5 = m512_const1_64( 0xEABE394CA9D5C3F4 );
register __m512i h6 = m512_const1_64( 0x991112C71A75B523 );
register __m512i h7 = m512_const1_64( 0xAE18A40B660FCC33 );
register __m512i h0 = _mm512_set1_epi64( 0x4903ADFF749C51CE );
register __m512i h1 = _mm512_set1_epi64( 0x0D95DE399746DF03 );
register __m512i h2 = _mm512_set1_epi64( 0x8FD1934127C79BCE );
register __m512i h3 = _mm512_set1_epi64( 0x9A255629FF352CB1 );
register __m512i h4 = _mm512_set1_epi64( 0x5DB62599DF6CA7B0 );
register __m512i h5 = _mm512_set1_epi64( 0xEABE394CA9D5C3F4 );
register __m512i h6 = _mm512_set1_epi64( 0x991112C71A75B523 );
register __m512i h7 = _mm512_set1_epi64( 0xAE18A40B660FCC33 );
uint64_t bcount = 1;
UBI_BIG_8WAY( 224, 0 );
@@ -830,28 +810,28 @@ skein512_8way_close(void *cc, void *dst)
void skein256_4way_init( skein256_4way_context *sc )
{
sc->h0 = m256_const1_64( 0xCCD044A12FDB3E13 );
sc->h1 = m256_const1_64( 0xE83590301A79A9EB );
sc->h2 = m256_const1_64( 0x55AEA0614F816E6F );
sc->h3 = m256_const1_64( 0x2A2767A4AE9B94DB );
sc->h4 = m256_const1_64( 0xEC06025E74DD7683 );
sc->h5 = m256_const1_64( 0xE7A436CDC4746251 );
sc->h6 = m256_const1_64( 0xC36FBAF9393AD185 );
sc->h7 = m256_const1_64( 0x3EEDBA1833EDFC13 );
sc->h0 = _mm256_set1_epi64x( 0xCCD044A12FDB3E13 );
sc->h1 = _mm256_set1_epi64x( 0xE83590301A79A9EB );
sc->h2 = _mm256_set1_epi64x( 0x55AEA0614F816E6F );
sc->h3 = _mm256_set1_epi64x( 0x2A2767A4AE9B94DB );
sc->h4 = _mm256_set1_epi64x( 0xEC06025E74DD7683 );
sc->h5 = _mm256_set1_epi64x( 0xE7A436CDC4746251 );
sc->h6 = _mm256_set1_epi64x( 0xC36FBAF9393AD185 );
sc->h7 = _mm256_set1_epi64x( 0x3EEDBA1833EDFC13 );
sc->bcount = 0;
sc->ptr = 0;
}
void skein512_4way_init( skein512_4way_context *sc )
{
sc->h0 = m256_const1_64( 0x4903ADFF749C51CE );
sc->h1 = m256_const1_64( 0x0D95DE399746DF03 );
sc->h2 = m256_const1_64( 0x8FD1934127C79BCE );
sc->h3 = m256_const1_64( 0x9A255629FF352CB1 );
sc->h4 = m256_const1_64( 0x5DB62599DF6CA7B0 );
sc->h5 = m256_const1_64( 0xEABE394CA9D5C3F4 );
sc->h6 = m256_const1_64( 0x991112C71A75B523 );
sc->h7 = m256_const1_64( 0xAE18A40B660FCC33 );
sc->h0 = _mm256_set1_epi64x( 0x4903ADFF749C51CE );
sc->h1 = _mm256_set1_epi64x( 0x0D95DE399746DF03 );
sc->h2 = _mm256_set1_epi64x( 0x8FD1934127C79BCE );
sc->h3 = _mm256_set1_epi64x( 0x9A255629FF352CB1 );
sc->h4 = _mm256_set1_epi64x( 0x5DB62599DF6CA7B0 );
sc->h5 = _mm256_set1_epi64x( 0xEABE394CA9D5C3F4 );
sc->h6 = _mm256_set1_epi64x( 0x991112C71A75B523 );
sc->h7 = _mm256_set1_epi64x( 0xAE18A40B660FCC33 );
sc->bcount = 0;
sc->ptr = 0;
}
@@ -954,14 +934,14 @@ skein512_4way_full( skein512_4way_context *sc, void *out, const void *data,
const int buf_size = 64; // 64 * __m256i
uint64_t bcount = 0;
h0 = m256_const1_64( 0x4903ADFF749C51CE );
h1 = m256_const1_64( 0x0D95DE399746DF03 );
h2 = m256_const1_64( 0x8FD1934127C79BCE );
h3 = m256_const1_64( 0x9A255629FF352CB1 );
h4 = m256_const1_64( 0x5DB62599DF6CA7B0 );
h5 = m256_const1_64( 0xEABE394CA9D5C3F4 );
h6 = m256_const1_64( 0x991112C71A75B523 );
h7 = m256_const1_64( 0xAE18A40B660FCC33 );
h0 = _mm256_set1_epi64x( 0x4903ADFF749C51CE );
h1 = _mm256_set1_epi64x( 0x0D95DE399746DF03 );
h2 = _mm256_set1_epi64x( 0x8FD1934127C79BCE );
h3 = _mm256_set1_epi64x( 0x9A255629FF352CB1 );
h4 = _mm256_set1_epi64x( 0x5DB62599DF6CA7B0 );
h5 = _mm256_set1_epi64x( 0xEABE394CA9D5C3F4 );
h6 = _mm256_set1_epi64x( 0x991112C71A75B523 );
h7 = _mm256_set1_epi64x( 0xAE18A40B660FCC33 );
// Update
@@ -1028,14 +1008,14 @@ skein512_4way_prehash64( skein512_4way_context *sc, const void *data )
buf[5] = vdata[5];
buf[6] = vdata[6];
buf[7] = vdata[7];
register __m256i h0 = m256_const1_64( 0x4903ADFF749C51CE );
register __m256i h1 = m256_const1_64( 0x0D95DE399746DF03 );
register __m256i h2 = m256_const1_64( 0x8FD1934127C79BCE );
register __m256i h3 = m256_const1_64( 0x9A255629FF352CB1 );
register __m256i h4 = m256_const1_64( 0x5DB62599DF6CA7B0 );
register __m256i h5 = m256_const1_64( 0xEABE394CA9D5C3F4 );
register __m256i h6 = m256_const1_64( 0x991112C71A75B523 );
register __m256i h7 = m256_const1_64( 0xAE18A40B660FCC33 );
register __m256i h0 = _mm256_set1_epi64x( 0x4903ADFF749C51CE );
register __m256i h1 = _mm256_set1_epi64x( 0x0D95DE399746DF03 );
register __m256i h2 = _mm256_set1_epi64x( 0x8FD1934127C79BCE );
register __m256i h3 = _mm256_set1_epi64x( 0x9A255629FF352CB1 );
register __m256i h4 = _mm256_set1_epi64x( 0x5DB62599DF6CA7B0 );
register __m256i h5 = _mm256_set1_epi64x( 0xEABE394CA9D5C3F4 );
register __m256i h6 = _mm256_set1_epi64x( 0x991112C71A75B523 );
register __m256i h7 = _mm256_set1_epi64x( 0xAE18A40B660FCC33 );
uint64_t bcount = 1;
UBI_BIG_4WAY( 224, 0 );
@@ -1106,8 +1086,7 @@ skein256_4way_close(void *cc, void *dst)
}
// Do not use with 128 bit data
// Broken for 80 & 128 bytes, use prehash or full
void
skein512_4way_update(void *cc, const void *data, size_t len)
{

View File

@@ -31,18 +31,19 @@ int scanhash_skein( struct work *work, uint32_t max_nonce,
const uint32_t Htarg = ptarget[7];
const uint32_t first_nonce = pdata[19];
uint32_t n = first_nonce;
int thr_id = mythr->id; // thr_id arg is deprecated
int thr_id = mythr->id;
swab32_array( endiandata, pdata, 20 );
do {
be32enc(&endiandata[19], n);
skeinhash(hash64, endiandata);
if (hash64[7] < Htarg && fulltest(hash64, ptarget)) {
*hashes_done = n - first_nonce + 1;
pdata[19] = n;
return true;
}
if (hash64[7] <= Htarg )
if ( fulltest(hash64, ptarget) && !opt_benchmark )
{
pdata[19] = n;
submit_solution( work, hash64, mythr );
}
n++;
} while (n < max_nonce && !work_restart[thr_id].restart);

View File

@@ -57,7 +57,7 @@ int scanhash_skein2_8way( struct work *work, uint32_t max_nonce,
}
}
*noncev = _mm512_add_epi32( *noncev,
m512_const1_64( 0x0000000800000000 ) );
_mm512_set1_epi64( 0x0000000800000000 ) );
n += 8;
} while ( likely( (n < last_nonce) && !work_restart[thr_id].restart ) );
@@ -119,7 +119,7 @@ int scanhash_skein2_4way( struct work *work, uint32_t max_nonce,
}
}
*noncev = _mm256_add_epi32( *noncev,
m256_const1_64( 0x0000000400000000 ) );
_mm256_set1_epi64x( 0x0000000400000000 ) );
n += 4;
} while ( (n < last_nonce) && !work_restart[thr_id].restart );

View File

@@ -34,31 +34,31 @@ void skein2hash(void *output, const void *input)
sph_skein512_close(&ctx_skein, hash);
memcpy(output, hash, 32);
}
int scanhash_skein2( struct work *work, uint32_t max_nonce,
uint64_t *hashes_done, struct thr_info *mythr )
{
uint32_t *pdata = work->data;
uint32_t *ptarget = work->target;
uint32_t *pdata = work->data;
uint32_t *ptarget = work->target;
uint32_t hash64[8] __attribute__ ((aligned (64)));
uint32_t endiandata[20] __attribute__ ((aligned (64)));
const uint32_t Htarg = ptarget[7];
const uint32_t first_nonce = pdata[19];
uint32_t n = first_nonce;
int thr_id = mythr->id; // thr_id arg is deprecated
int thr_id = mythr->id;
swab32_array( endiandata, pdata, 20 );
swab32_array( endiandata, pdata, 20 );
do {
be32enc(&endiandata[19], n);
skein2hash(hash64, endiandata);
if (hash64[7] < Htarg && fulltest(hash64, ptarget)) {
*hashes_done = n - first_nonce + 1;
pdata[19] = n;
return true;
}
if (hash64[7] <= Htarg )
if ( fulltest(hash64, ptarget) && !opt_benchmark )
{
pdata[19] = n;
submit_solution( work, hash64, mythr );
}
n++;
} while (n < max_nonce && !work_restart[thr_id].restart);

View File

@@ -74,6 +74,10 @@
_mm256_or_si256( _mm256_and_si256( x, y ), \
_mm256_andnot_si256( x, z ) )
#define mm256_rol_var_32( v, c ) \
_mm256_or_si256( _mm256_slli_epi32( v, c ), \
_mm256_srli_epi32( v, 32-(c) ) )
void sm3_8way_compress( __m256i *digest, __m256i *block )
{
__m256i W[68], W1[64];
@@ -251,6 +255,9 @@ void sm3_8way_close( void *cc, void *dst )
_mm_andnot_si128( x, z ) )
#define mm128_rol_var_32( v, c ) \
_mm_or_si128( _mm_slli_epi32( v, c ), _mm_srli_epi32( v, 32-(c) ) )
void sm3_4way_compress( __m128i *digest, __m128i *block )
{
__m128i W[68], W1[64];

View File

@@ -630,36 +630,35 @@ void InitializeSWIFFTX()
}
// In the original code the F matrix is rotated so it was not aranged
// the same as all the other data. Rearanging F to match all the other
// data made vectorizing possible, the compiler probably could have been
// able to auto-vectorize with proper data organisation.
// Also in the original code the custom 16 bit data types are all now 32
// bit int32_t regardless of the type name.
//
void FFT(const unsigned char input[EIGHTH_N], swift_int32_t *output)
// the same as the other data. Rearanging F made vectorizing up to 256 bits
// possible.
// Also in the original code the custom 16 bit data types are all now aliased
// to 32 bit int32_t.
void FFT( const unsigned char input[EIGHTH_N], swift_int32_t *output )
{
#if defined(__AVX2__)
__m256i F[8] __attribute__ ((aligned (64)));
__m256i F0, F1, F2, F3, F4, F5, F6, F7;
__m256i tbl = *(__m256i*)&( fftTable[ input[0] << 3 ] );
__m256i *mul = (__m256i*)multipliers;
__m256i *out = (__m256i*)output;
__m256i *tbl = (__m256i*)&( fftTable[ input[0] << 3 ] );
F[0] = _mm256_mullo_epi32( mul[0], *tbl );
tbl = (__m256i*)&( fftTable[ input[1] << 3 ] );
F[1] = _mm256_mullo_epi32( mul[1], *tbl );
tbl = (__m256i*)&( fftTable[ input[2] << 3 ] );
F[2] = _mm256_mullo_epi32( mul[2], *tbl );
tbl = (__m256i*)&( fftTable[ input[3] << 3 ] );
F[3] = _mm256_mullo_epi32( mul[3], *tbl );
tbl = (__m256i*)&( fftTable[ input[4] << 3 ] );
F[4] = _mm256_mullo_epi32( mul[4], *tbl );
tbl = (__m256i*)&( fftTable[ input[5] << 3 ] );
F[5] = _mm256_mullo_epi32( mul[5], *tbl );
tbl = (__m256i*)&( fftTable[ input[6] << 3 ] );
F[6] = _mm256_mullo_epi32( mul[6], *tbl );
tbl = (__m256i*)&( fftTable[ input[7] << 3 ] );
F[7] = _mm256_mullo_epi32( mul[7], *tbl );
F0 = _mm256_mullo_epi32( mul[0], tbl );
tbl = *(__m256i*)&( fftTable[ input[1] << 3 ] );
F1 = _mm256_mullo_epi32( mul[1], tbl );
tbl = *(__m256i*)&( fftTable[ input[2] << 3 ] );
F2 = _mm256_mullo_epi32( mul[2], tbl );
tbl = *(__m256i*)&( fftTable[ input[3] << 3 ] );
F3 = _mm256_mullo_epi32( mul[3], tbl );
tbl = *(__m256i*)&( fftTable[ input[4] << 3 ] );
F4 = _mm256_mullo_epi32( mul[4], tbl );
tbl = *(__m256i*)&( fftTable[ input[5] << 3 ] );
F5 = _mm256_mullo_epi32( mul[5], tbl );
tbl = *(__m256i*)&( fftTable[ input[6] << 3 ] );
F6 = _mm256_mullo_epi32( mul[6], tbl );
tbl = *(__m256i*)&( fftTable[ input[7] << 3 ] );
F7 = _mm256_mullo_epi32( mul[7], tbl );
#define ADD_SUB( a, b ) \
{ \
@@ -668,52 +667,50 @@ void FFT(const unsigned char input[EIGHTH_N], swift_int32_t *output)
a = _mm256_add_epi32( a, tmp ); \
}
ADD_SUB( F[0], F[1] );
ADD_SUB( F[2], F[3] );
ADD_SUB( F[4], F[5] );
ADD_SUB( F[6], F[7] );
F[3] = _mm256_slli_epi32( F[3], 4 );
F[7] = _mm256_slli_epi32( F[7], 4 );
ADD_SUB( F[0], F[2] );
ADD_SUB( F[1], F[3] );
ADD_SUB( F[4], F[6] );
ADD_SUB( F[5], F[7] );
F[5] = _mm256_slli_epi32( F[5], 2 );
F[6] = _mm256_slli_epi32( F[6], 4 );
F[7] = _mm256_slli_epi32( F[7], 6 );
ADD_SUB( F[0], F[4] );
ADD_SUB( F[1], F[5] );
ADD_SUB( F[2], F[6] );
ADD_SUB( F[3], F[7] );
ADD_SUB( F0, F1 );
ADD_SUB( F2, F3 );
ADD_SUB( F4, F5 );
ADD_SUB( F6, F7 );
F3 = _mm256_slli_epi32( F3, 4 );
F7 = _mm256_slli_epi32( F7, 4 );
ADD_SUB( F0, F2 );
ADD_SUB( F1, F3 );
ADD_SUB( F4, F6 );
ADD_SUB( F5, F7 );
F5 = _mm256_slli_epi32( F5, 2 );
F6 = _mm256_slli_epi32( F6, 4 );
F7 = _mm256_slli_epi32( F7, 6 );
ADD_SUB( F0, F4 );
ADD_SUB( F1, F5 );
ADD_SUB( F2, F6 );
ADD_SUB( F3, F7 );
#undef ADD_SUB
#if defined (__AVX512VL__) && defined(__AVX512BW__)
const __m256i mask = _mm256_movm_epi8( 0x11111111 );
#define Q_REDUCE( a ) \
_mm256_sub_epi32( _mm256_maskz_mov_epi8( 0x11111111, a ), \
_mm256_srai_epi32( a, 8 ) )
#else
const __m256i mask = m256_const1_32( 0x000000ff );
#endif
const __m256i mask = _mm256_set1_epi32( 0x000000ff );
#define Q_REDUCE( a ) \
_mm256_sub_epi32( _mm256_and_si256( a, mask ), \
_mm256_srai_epi32( a, 8 ) )
#endif
out[0] = Q_REDUCE( F[0] );
out[1] = Q_REDUCE( F[1] );
out[2] = Q_REDUCE( F[2] );
out[3] = Q_REDUCE( F[3] );
out[4] = Q_REDUCE( F[4] );
out[5] = Q_REDUCE( F[5] );
out[6] = Q_REDUCE( F[6] );
out[7] = Q_REDUCE( F[7] );
out[0] = Q_REDUCE( F0 );
out[1] = Q_REDUCE( F1 );
out[2] = Q_REDUCE( F2 );
out[3] = Q_REDUCE( F3 );
out[4] = Q_REDUCE( F4 );
out[5] = Q_REDUCE( F5 );
out[6] = Q_REDUCE( F6 );
out[7] = Q_REDUCE( F7 );
#undef Q_REDUCE
@@ -763,12 +760,10 @@ void FFT(const unsigned char input[EIGHTH_N], swift_int32_t *output)
ADD_SUB( F[ 9], F[11] );
ADD_SUB( F[12], F[14] );
ADD_SUB( F[13], F[15] );
F[ 6] = _mm_slli_epi32( F[ 6], 4 );
F[ 7] = _mm_slli_epi32( F[ 7], 4 );
F[14] = _mm_slli_epi32( F[14], 4 );
F[15] = _mm_slli_epi32( F[15], 4 );
ADD_SUB( F[ 0], F[ 4] );
ADD_SUB( F[ 1], F[ 5] );
ADD_SUB( F[ 2], F[ 6] );
@@ -777,14 +772,12 @@ void FFT(const unsigned char input[EIGHTH_N], swift_int32_t *output)
ADD_SUB( F[ 9], F[13] );
ADD_SUB( F[10], F[14] );
ADD_SUB( F[11], F[15] );
F[10] = _mm_slli_epi32( F[10], 2 );
F[11] = _mm_slli_epi32( F[11], 2 );
F[12] = _mm_slli_epi32( F[12], 4 );
F[13] = _mm_slli_epi32( F[13], 4 );
F[14] = _mm_slli_epi32( F[14], 6 );
F[15] = _mm_slli_epi32( F[15], 6 );
ADD_SUB( F[ 0], F[ 8] );
ADD_SUB( F[ 1], F[ 9] );
ADD_SUB( F[ 2], F[10] );
@@ -796,7 +789,7 @@ void FFT(const unsigned char input[EIGHTH_N], swift_int32_t *output)
#undef ADD_SUB
const __m128i mask = m128_const1_32( 0x000000ff );
const __m128i mask = _mm_set1_epi32( 0x000000ff );
#define Q_REDUCE( a ) \
_mm_sub_epi32( _mm_and_si128( a, mask ), _mm_srai_epi32( a, 8 ) )
@@ -820,16 +813,13 @@ void FFT(const unsigned char input[EIGHTH_N], swift_int32_t *output)
#undef Q_REDUCE
#else // < SSE4.1
#else // AVX256 elif SSE4_1
swift_int16_t *mult = multipliers;
// First loop unrolling:
register swift_int16_t *table = &(fftTable[input[0] << 3]);
/*
swift_int16_t *table = &( fftTable[ input[0] << 3 ] );
swift_int32_t F[64];
/*
for (int i = 0; i < 8; i++)
{
int j = i<<3;
@@ -845,99 +835,91 @@ void FFT(const unsigned char input[EIGHTH_N], swift_int32_t *output)
}
*/
register swift_int32_t F0, F1, F2, F3, F4, F5, F6, F7, F8, F9,
F10, F11, F12, F13, F14, F15, F16, F17, F18, F19,
F20, F21, F22, F23, F24, F25, F26, F27, F28, F29,
F30, F31, F32, F33, F34, F35, F36, F37, F38, F39,
F40, F41, F42, F43, F44, F45, F46, F47, F48, F49,
F50, F51, F52, F53, F54, F55, F56, F57, F58, F59,
F60, F61, F62, F63;
F0 = mult[0] * table[0];
F8 = mult[1] * table[1];
F16 = mult[2] * table[2];
F24 = mult[3] * table[3];
F32 = mult[4] * table[4];
F40 = mult[5] * table[5];
F48 = mult[6] * table[6];
F56 = mult[7] * table[7];
F[ 0] = mult[ 0] * table[0];
F[ 8] = mult[ 1] * table[1];
F[16] = mult[ 2] * table[2];
F[24] = mult[ 3] * table[3];
F[32] = mult[ 4] * table[4];
F[40] = mult[ 5] * table[5];
F[48] = mult[ 6] * table[6];
F[56] = mult[ 7] * table[7];
table = &(fftTable[input[1] << 3]);
F1 = mult[ 8] * table[0];
F9 = mult[ 9] * table[1];
F17 = mult[10] * table[2];
F25 = mult[11] * table[3];
F33 = mult[12] * table[4];
F41 = mult[13] * table[5];
F49 = mult[14] * table[6];
F57 = mult[15] * table[7];
F[ 1] = mult[ 8] * table[0];
F[ 9] = mult[ 9] * table[1];
F[17] = mult[10] * table[2];
F[25] = mult[11] * table[3];
F[33] = mult[12] * table[4];
F[41] = mult[13] * table[5];
F[49] = mult[14] * table[6];
F[57] = mult[15] * table[7];
table = &(fftTable[input[2] << 3]);
F2 = mult[16] * table[0];
F10 = mult[17] * table[1];
F18 = mult[18] * table[2];
F26 = mult[19] * table[3];
F34 = mult[20] * table[4];
F42 = mult[21] * table[5];
F50 = mult[22] * table[6];
F58 = mult[23] * table[7];
F[ 2] = mult[16] * table[0];
F[10] = mult[17] * table[1];
F[18] = mult[18] * table[2];
F[26] = mult[19] * table[3];
F[34] = mult[20] * table[4];
F[42] = mult[21] * table[5];
F[50] = mult[22] * table[6];
F[58] = mult[23] * table[7];
table = &(fftTable[input[3] << 3]);
F3 = mult[24] * table[0];
F11 = mult[25] * table[1];
F19 = mult[26] * table[2];
F27 = mult[27] * table[3];
F35 = mult[28] * table[4];
F43 = mult[29] * table[5];
F51 = mult[30] * table[6];
F59 = mult[31] * table[7];
F[ 3] = mult[24] * table[0];
F[11] = mult[25] * table[1];
F[19] = mult[26] * table[2];
F[27] = mult[27] * table[3];
F[35] = mult[28] * table[4];
F[43] = mult[29] * table[5];
F[51] = mult[30] * table[6];
F[59] = mult[31] * table[7];
table = &(fftTable[input[4] << 3]);
F4 = mult[32] * table[0];
F12 = mult[33] * table[1];
F20 = mult[34] * table[2];
F28 = mult[35] * table[3];
F36 = mult[36] * table[4];
F44 = mult[37] * table[5];
F52 = mult[38] * table[6];
F60 = mult[39] * table[7];
F[ 4] = mult[32] * table[0];
F[12] = mult[33] * table[1];
F[20] = mult[34] * table[2];
F[28] = mult[35] * table[3];
F[36] = mult[36] * table[4];
F[44] = mult[37] * table[5];
F[52] = mult[38] * table[6];
F[60] = mult[39] * table[7];
table = &(fftTable[input[5] << 3]);
F5 = mult[40] * table[0];
F13 = mult[41] * table[1];
F21 = mult[42] * table[2];
F29 = mult[43] * table[3];
F37 = mult[44] * table[4];
F45 = mult[45] * table[5];
F53 = mult[46] * table[6];
F61 = mult[47] * table[7];
F[ 5] = mult[40] * table[0];
F[13] = mult[41] * table[1];
F[21] = mult[42] * table[2];
F[29] = mult[43] * table[3];
F[37] = mult[44] * table[4];
F[45] = mult[45] * table[5];
F[53] = mult[46] * table[6];
F[61] = mult[47] * table[7];
table = &(fftTable[input[6] << 3]);
F6 = mult[48] * table[0];
F14 = mult[49] * table[1];
F22 = mult[50] * table[2];
F30 = mult[51] * table[3];
F38 = mult[52] * table[4];
F46 = mult[53] * table[5];
F54 = mult[54] * table[6];
F62 = mult[55] * table[7];
F[ 6] = mult[48] * table[0];
F[14] = mult[49] * table[1];
F[22] = mult[50] * table[2];
F[30] = mult[51] * table[3];
F[38] = mult[52] * table[4];
F[46] = mult[53] * table[5];
F[54] = mult[54] * table[6];
F[62] = mult[55] * table[7];
table = &(fftTable[input[7] << 3]);
F7 = mult[56] * table[0];
F15 = mult[57] * table[1];
F23 = mult[58] * table[2];
F31 = mult[59] * table[3];
F39 = mult[60] * table[4];
F47 = mult[61] * table[5];
F55 = mult[62] * table[6];
F63 = mult[63] * table[7];
F[ 7] = mult[56] * table[0];
F[15] = mult[57] * table[1];
F[23] = mult[58] * table[2];
F[31] = mult[59] * table[3];
F[39] = mult[60] * table[4];
F[47] = mult[61] * table[5];
F[55] = mult[62] * table[6];
F[63] = mult[63] * table[7];
#define ADD_SUB( a, b ) \
{ \
@@ -987,262 +969,229 @@ void FFT(const unsigned char input[EIGHTH_N], swift_int32_t *output)
}
*/
// Second loop unrolling:
// Iteration 0:
ADD_SUB(F0, F1);
ADD_SUB(F2, F3);
ADD_SUB(F4, F5);
ADD_SUB(F6, F7);
ADD_SUB( F[ 0], F[ 1] );
ADD_SUB( F[ 2], F[ 3] );
ADD_SUB( F[ 4], F[ 5] );
ADD_SUB( F[ 6], F[ 7] );
F[ 3] <<= 4;
F[ 7] <<= 4;
ADD_SUB( F[ 0], F[ 2] );
ADD_SUB( F[ 1], F[ 3] );
ADD_SUB( F[ 4], F[ 6] );
ADD_SUB( F[ 5], F[ 7] );
F[ 5] <<= 2;
F[ 6] <<= 4;
F[ 7] <<= 6;
ADD_SUB( F[ 0], F[ 4] );
ADD_SUB( F[ 1], F[ 5] );
ADD_SUB( F[ 2], F[ 6] );
ADD_SUB( F[ 3], F[ 7] );
F3 <<= 4;
F7 <<= 4;
ADD_SUB(F0, F2);
ADD_SUB(F1, F3);
ADD_SUB(F4, F6);
ADD_SUB(F5, F7);
F5 <<= 2;
F6 <<= 4;
F7 <<= 6;
ADD_SUB(F0, F4);
ADD_SUB(F1, F5);
ADD_SUB(F2, F6);
ADD_SUB(F3, F7);
output[0] = Q_REDUCE(F0);
output[8] = Q_REDUCE(F1);
output[16] = Q_REDUCE(F2);
output[24] = Q_REDUCE(F3);
output[32] = Q_REDUCE(F4);
output[40] = Q_REDUCE(F5);
output[48] = Q_REDUCE(F6);
output[56] = Q_REDUCE(F7);
output[ 0] = Q_REDUCE( F[ 0] );
output[ 8] = Q_REDUCE( F[ 1] );
output[16] = Q_REDUCE( F[ 2] );
output[24] = Q_REDUCE( F[ 3] );
output[32] = Q_REDUCE( F[ 4] );
output[40] = Q_REDUCE( F[ 5] );
output[48] = Q_REDUCE( F[ 6] );
output[56] = Q_REDUCE( F[ 7] );
// Iteration 1:
ADD_SUB(F8, F9);
ADD_SUB(F10, F11);
ADD_SUB(F12, F13);
ADD_SUB(F14, F15);
ADD_SUB( F[ 8], F[ 9] );
ADD_SUB( F[10], F[11] );
ADD_SUB( F[12], F[13] );
ADD_SUB( F[14], F[15] );
F[11] <<= 4;
F[15] <<= 4;
ADD_SUB( F[ 8], F[10] );
ADD_SUB( F[ 9], F[11] );
ADD_SUB( F[12], F[14] );
ADD_SUB( F[13], F[15] );
F[13] <<= 2;
F[14] <<= 4;
F[15] <<= 6;
ADD_SUB( F[ 8], F[12] );
ADD_SUB( F[ 9], F[13] );
ADD_SUB( F[10], F[14] );
ADD_SUB( F[11], F[15] );
F11 <<= 4;
F15 <<= 4;
ADD_SUB(F8, F10);
ADD_SUB(F9, F11);
ADD_SUB(F12, F14);
ADD_SUB(F13, F15);
F13 <<= 2;
F14 <<= 4;
F15 <<= 6;
ADD_SUB(F8, F12);
ADD_SUB(F9, F13);
ADD_SUB(F10, F14);
ADD_SUB(F11, F15);
output[1] = Q_REDUCE(F8);
output[9] = Q_REDUCE(F9);
output[17] = Q_REDUCE(F10);
output[25] = Q_REDUCE(F11);
output[33] = Q_REDUCE(F12);
output[41] = Q_REDUCE(F13);
output[49] = Q_REDUCE(F14);
output[57] = Q_REDUCE(F15);
output[ 1] = Q_REDUCE( F[ 8] );
output[ 9] = Q_REDUCE( F[ 9] );
output[17] = Q_REDUCE( F[10] );
output[25] = Q_REDUCE( F[11] );
output[33] = Q_REDUCE( F[12] );
output[41] = Q_REDUCE( F[13] );
output[49] = Q_REDUCE( F[14] );
output[57] = Q_REDUCE( F[15] );
// Iteration 2:
ADD_SUB(F16, F17);
ADD_SUB(F18, F19);
ADD_SUB(F20, F21);
ADD_SUB(F22, F23);
ADD_SUB( F[16], F[17] );
ADD_SUB( F[18], F[19] );
ADD_SUB( F[20], F[21] );
ADD_SUB( F[22], F[23] );
F[19] <<= 4;
F[23] <<= 4;
ADD_SUB( F[16], F[18]);
ADD_SUB( F[17], F[19]);
ADD_SUB( F[20], F[22]);
ADD_SUB( F[21], F[23]);
F[21] <<= 2;
F[22] <<= 4;
F[23] <<= 6;
ADD_SUB( F[16], F[20] );
ADD_SUB( F[17], F[21] );
ADD_SUB( F[18], F[22] );
ADD_SUB( F[19], F[23] );
F19 <<= 4;
F23 <<= 4;
ADD_SUB(F16, F18);
ADD_SUB(F17, F19);
ADD_SUB(F20, F22);
ADD_SUB(F21, F23);
F21 <<= 2;
F22 <<= 4;
F23 <<= 6;
ADD_SUB(F16, F20);
ADD_SUB(F17, F21);
ADD_SUB(F18, F22);
ADD_SUB(F19, F23);
output[2] = Q_REDUCE(F16);
output[10] = Q_REDUCE(F17);
output[18] = Q_REDUCE(F18);
output[26] = Q_REDUCE(F19);
output[34] = Q_REDUCE(F20);
output[42] = Q_REDUCE(F21);
output[50] = Q_REDUCE(F22);
output[58] = Q_REDUCE(F23);
output[ 2] = Q_REDUCE( F[16] );
output[10] = Q_REDUCE( F[17] );
output[18] = Q_REDUCE( F[18] );
output[26] = Q_REDUCE( F[19] );
output[34] = Q_REDUCE( F[20] );
output[42] = Q_REDUCE( F[21] );
output[50] = Q_REDUCE( F[22] );
output[58] = Q_REDUCE( F[23] );
// Iteration 3:
ADD_SUB(F24, F25);
ADD_SUB(F26, F27);
ADD_SUB(F28, F29);
ADD_SUB(F30, F31);
ADD_SUB( F[24], F[25] );
ADD_SUB( F[26], F[27] );
ADD_SUB( F[28], F[29] );
ADD_SUB( F[30], F[31] );
F[27] <<= 4;
F[31] <<= 4;
ADD_SUB( F[24], F[26] );
ADD_SUB( F[25], F[27] );
ADD_SUB( F[28], F[30] );
ADD_SUB( F[29], F[31] );
F[29] <<= 2;
F[30] <<= 4;
F[31] <<= 6;
ADD_SUB( F[24], F[28] );
ADD_SUB( F[25], F[29] );
ADD_SUB( F[26], F[30] );
ADD_SUB( F[27], F[31] );
F27 <<= 4;
F31 <<= 4;
ADD_SUB(F24, F26);
ADD_SUB(F25, F27);
ADD_SUB(F28, F30);
ADD_SUB(F29, F31);
F29 <<= 2;
F30 <<= 4;
F31 <<= 6;
ADD_SUB(F24, F28);
ADD_SUB(F25, F29);
ADD_SUB(F26, F30);
ADD_SUB(F27, F31);
output[3] = Q_REDUCE(F24);
output[11] = Q_REDUCE(F25);
output[19] = Q_REDUCE(F26);
output[27] = Q_REDUCE(F27);
output[35] = Q_REDUCE(F28);
output[43] = Q_REDUCE(F29);
output[51] = Q_REDUCE(F30);
output[59] = Q_REDUCE(F31);
output[ 3] = Q_REDUCE( F[24] );
output[11] = Q_REDUCE( F[25] );
output[19] = Q_REDUCE( F[26] );
output[27] = Q_REDUCE( F[27] );
output[35] = Q_REDUCE( F[28] );
output[43] = Q_REDUCE( F[29] );
output[51] = Q_REDUCE( F[30] );
output[59] = Q_REDUCE( F[31] );
// Iteration 4:
ADD_SUB(F32, F33);
ADD_SUB(F34, F35);
ADD_SUB(F36, F37);
ADD_SUB(F38, F39);
ADD_SUB( F[32], F[33] );
ADD_SUB( F[34], F[35] );
ADD_SUB( F[36], F[37] );
ADD_SUB( F[38], F[39] );
F[35] <<= 4;
F[39] <<= 4;
ADD_SUB( F[32], F[34] );
ADD_SUB( F[33], F[35] );
ADD_SUB( F[36], F[38] );
ADD_SUB( F[37], F[39] );
F[37] <<= 2;
F[38] <<= 4;
F[39] <<= 6;
ADD_SUB( F[32], F[36] );
ADD_SUB( F[33], F[37] );
ADD_SUB( F[34], F[38] );
ADD_SUB( F[35], F[39] );
F35 <<= 4;
F39 <<= 4;
ADD_SUB(F32, F34);
ADD_SUB(F33, F35);
ADD_SUB(F36, F38);
ADD_SUB(F37, F39);
F37 <<= 2;
F38 <<= 4;
F39 <<= 6;
ADD_SUB(F32, F36);
ADD_SUB(F33, F37);
ADD_SUB(F34, F38);
ADD_SUB(F35, F39);
output[4] = Q_REDUCE(F32);
output[12] = Q_REDUCE(F33);
output[20] = Q_REDUCE(F34);
output[28] = Q_REDUCE(F35);
output[36] = Q_REDUCE(F36);
output[44] = Q_REDUCE(F37);
output[52] = Q_REDUCE(F38);
output[60] = Q_REDUCE(F39);
output[ 4] = Q_REDUCE( F[32] );
output[12] = Q_REDUCE( F[33] );
output[20] = Q_REDUCE( F[34] );
output[28] = Q_REDUCE( F[35] );
output[36] = Q_REDUCE( F[36] );
output[44] = Q_REDUCE( F[37] );
output[52] = Q_REDUCE( F[38] );
output[60] = Q_REDUCE( F[39] );
// Iteration 5:
ADD_SUB(F40, F41);
ADD_SUB(F42, F43);
ADD_SUB(F44, F45);
ADD_SUB(F46, F47);
ADD_SUB( F[40], F[41] );
ADD_SUB( F[42], F[43] );
ADD_SUB( F[44], F[45] );
ADD_SUB( F[46], F[47] );
F[43] <<= 4;
F[47] <<= 4;
ADD_SUB( F[40], F[42] );
ADD_SUB( F[41], F[43] );
ADD_SUB( F[44], F[46] );
ADD_SUB( F[45], F[47] );
F[45] <<= 2;
F[46] <<= 4;
F[47] <<= 6;
ADD_SUB( F[40], F[44] );
ADD_SUB( F[41], F[45] );
ADD_SUB( F[42], F[46] );
ADD_SUB( F[43], F[47] );
F43 <<= 4;
F47 <<= 4;
ADD_SUB(F40, F42);
ADD_SUB(F41, F43);
ADD_SUB(F44, F46);
ADD_SUB(F45, F47);
F45 <<= 2;
F46 <<= 4;
F47 <<= 6;
ADD_SUB(F40, F44);
ADD_SUB(F41, F45);
ADD_SUB(F42, F46);
ADD_SUB(F43, F47);
output[5] = Q_REDUCE(F40);
output[13] = Q_REDUCE(F41);
output[21] = Q_REDUCE(F42);
output[29] = Q_REDUCE(F43);
output[37] = Q_REDUCE(F44);
output[45] = Q_REDUCE(F45);
output[53] = Q_REDUCE(F46);
output[61] = Q_REDUCE(F47);
output[ 5] = Q_REDUCE( F[40] );
output[13] = Q_REDUCE( F[41] );
output[21] = Q_REDUCE( F[42] );
output[29] = Q_REDUCE( F[43] );
output[37] = Q_REDUCE( F[44] );
output[45] = Q_REDUCE( F[45] );
output[53] = Q_REDUCE( F[46] );
output[61] = Q_REDUCE( F[47] );
// Iteration 6:
ADD_SUB(F48, F49);
ADD_SUB(F50, F51);
ADD_SUB(F52, F53);
ADD_SUB(F54, F55);
ADD_SUB( F[48], F[49] );
ADD_SUB( F[50], F[51] );
ADD_SUB( F[52], F[53] );
ADD_SUB( F[54], F[55] );
F[51] <<= 4;
F[55] <<= 4;
ADD_SUB( F[48], F[50] );
ADD_SUB( F[49], F[51] );
ADD_SUB( F[52], F[54] );
ADD_SUB( F[53], F[55] );
F[53] <<= 2;
F[54] <<= 4;
F[55] <<= 6;
ADD_SUB( F[48], F[52] );
ADD_SUB( F[49], F[53] );
ADD_SUB( F[50], F[54] );
ADD_SUB( F[51], F[55] );
F51 <<= 4;
F55 <<= 4;
ADD_SUB(F48, F50);
ADD_SUB(F49, F51);
ADD_SUB(F52, F54);
ADD_SUB(F53, F55);
F53 <<= 2;
F54 <<= 4;
F55 <<= 6;
ADD_SUB(F48, F52);
ADD_SUB(F49, F53);
ADD_SUB(F50, F54);
ADD_SUB(F51, F55);
output[6] = Q_REDUCE(F48);
output[14] = Q_REDUCE(F49);
output[22] = Q_REDUCE(F50);
output[30] = Q_REDUCE(F51);
output[38] = Q_REDUCE(F52);
output[46] = Q_REDUCE(F53);
output[54] = Q_REDUCE(F54);
output[62] = Q_REDUCE(F55);
output[ 6] = Q_REDUCE( F[48] );
output[14] = Q_REDUCE( F[49] );
output[22] = Q_REDUCE( F[50] );
output[30] = Q_REDUCE( F[51] );
output[38] = Q_REDUCE( F[52] );
output[46] = Q_REDUCE( F[53] );
output[54] = Q_REDUCE( F[54] );
output[62] = Q_REDUCE( F[55] );
// Iteration 7:
ADD_SUB(F56, F57);
ADD_SUB(F58, F59);
ADD_SUB(F60, F61);
ADD_SUB(F62, F63);
ADD_SUB( F[56], F[57] );
ADD_SUB( F[58], F[59] );
ADD_SUB( F[60], F[61] );
ADD_SUB( F[62], F[63] );
F[59] <<= 4;
F[63] <<= 4;
ADD_SUB( F[56], F[58] );
ADD_SUB( F[57], F[59] );
ADD_SUB( F[60], F[62] );
ADD_SUB( F[61], F[63] );
F[61] <<= 2;
F[62] <<= 4;
F[63] <<= 6;
ADD_SUB( F[56], F[60] );
ADD_SUB( F[57], F[61] );
ADD_SUB( F[58], F[62] );
ADD_SUB( F[59], F[63] );
F59 <<= 4;
F63 <<= 4;
ADD_SUB(F56, F58);
ADD_SUB(F57, F59);
ADD_SUB(F60, F62);
ADD_SUB(F61, F63);
F61 <<= 2;
F62 <<= 4;
F63 <<= 6;
ADD_SUB(F56, F60);
ADD_SUB(F57, F61);
ADD_SUB(F58, F62);
ADD_SUB(F59, F63);
output[7] = Q_REDUCE(F56);
output[15] = Q_REDUCE(F57);
output[23] = Q_REDUCE(F58);
output[31] = Q_REDUCE(F59);
output[39] = Q_REDUCE(F60);
output[47] = Q_REDUCE(F61);
output[55] = Q_REDUCE(F62);
output[63] = Q_REDUCE(F63);
output[ 7] = Q_REDUCE( F[56] );
output[15] = Q_REDUCE( F[57] );
output[23] = Q_REDUCE( F[58] );
output[31] = Q_REDUCE( F[59] );
output[39] = Q_REDUCE( F[60] );
output[47] = Q_REDUCE( F[61] );
output[55] = Q_REDUCE( F[62] );
output[63] = Q_REDUCE( F[63] );
#undef ADD_SUB
#undef Q_REDUCE

View File

@@ -134,10 +134,10 @@ int sha3_4way_update( sha3_4way_ctx_t *c, const void *data, size_t len )
int sha3_4way_final( void *md, sha3_4way_ctx_t *c )
{
c->st[ c->pt ] = _mm256_xor_si256( c->st[ c->pt ],
m256_const1_64( 6 ) );
_mm256_set1_epi64x( 6 ) );
c->st[ c->rsiz / 8 - 1 ] =
_mm256_xor_si256( c->st[ c->rsiz / 8 - 1 ],
m256_const1_64( 0x8000000000000000 ) );
_mm256_set1_epi64x( 0x8000000000000000 ) );
sha3_4way_keccakf( c->st );
memcpy( md, c->st, c->mdlen * 4 );
return 1;
@@ -268,10 +268,10 @@ int sha3_8way_final( void *md, sha3_8way_ctx_t *c )
{
c->st[ c->pt ] =
_mm512_xor_si512( c->st[ c->pt ],
m512_const1_64( 6 ) );
_mm512_set1_epi64( 6 ) );
c->st[ c->rsiz / 8 - 1 ] =
_mm512_xor_si512( c->st[ c->rsiz / 8 - 1 ],
m512_const1_64( 0x8000000000000000 ) );
_mm512_set1_epi64( 0x8000000000000000 ) );
sha3_8way_keccakf( c->st );
memcpy( md, c->st, c->mdlen * 8 );
return 1;

View File

@@ -1,291 +0,0 @@
/* $Id: md_helper.c 216 2010-06-08 09:46:57Z tp $ */
/*
* This file contains some functions which implement the external data
* handling and padding for Merkle-Damgard hash functions which follow
* the conventions set out by MD4 (little-endian) or SHA-1 (big-endian).
*
* API: this file is meant to be included, not compiled as a stand-alone
* file. Some macros must be defined:
* RFUN name for the round function
* HASH "short name" for the hash function
* BE32 defined for big-endian, 32-bit based (e.g. SHA-1)
* LE32 defined for little-endian, 32-bit based (e.g. MD5)
* BE64 defined for big-endian, 64-bit based (e.g. SHA-512)
* LE64 defined for little-endian, 64-bit based (no example yet)
* PW01 if defined, append 0x01 instead of 0x80 (for Tiger)
* BLEN if defined, length of a message block (in bytes)
* PLW1 if defined, length is defined on one 64-bit word only (for Tiger)
* PLW4 if defined, length is defined on four 64-bit words (for WHIRLPOOL)
* SVAL if defined, reference to the context state information
*
* BLEN is used when a message block is not 16 (32-bit or 64-bit) words:
* this is used for instance for Tiger, which works on 64-bit words but
* uses 512-bit message blocks (eight 64-bit words). PLW1 and PLW4 are
* ignored if 32-bit words are used; if 64-bit words are used and PLW1 is
* set, then only one word (64 bits) will be used to encode the input
* message length (in bits), otherwise two words will be used (as in
* SHA-384 and SHA-512). If 64-bit words are used and PLW4 is defined (but
* not PLW1), four 64-bit words will be used to encode the message length
* (in bits). Note that regardless of those settings, only 64-bit message
* lengths are supported (in bits): messages longer than 2 Exabytes will be
* improperly hashed (this is unlikely to happen soon: 2 Exabytes is about
* 2 millions Terabytes, which is huge).
*
* If CLOSE_ONLY is defined, then this file defines only the sph_XXX_close()
* function. This is used for Tiger2, which is identical to Tiger except
* when it comes to the padding (Tiger2 uses the standard 0x80 byte instead
* of the 0x01 from original Tiger).
*
* The RFUN function is invoked with two arguments, the first pointing to
* aligned data (as a "const void *"), the second being state information
* from the context structure. By default, this state information is the
* "val" field from the context, and this field is assumed to be an array
* of words ("sph_u32" or "sph_u64", depending on BE32/LE32/BE64/LE64).
* from the context structure. The "val" field can have any type, except
* for the output encoding which assumes that it is an array of "sph_u32"
* values. By defining NO_OUTPUT, this last step is deactivated; the
* includer code is then responsible for writing out the hash result. When
* NO_OUTPUT is defined, the third parameter to the "close()" function is
* ignored.
*
* ==========================(LICENSE BEGIN)============================
*
* Copyright (c) 2007-2010 Projet RNRT SAPHIR
*
* Permission is hereby granted, free of charge, to any person obtaining
* a copy of this software and associated documentation files (the
* "Software"), to deal in the Software without restriction, including
* without limitation the rights to use, copy, modify, merge, publish,
* distribute, sublicense, and/or sell copies of the Software, and to
* permit persons to whom the Software is furnished to do so, subject to
* the following conditions:
*
* The above copyright notice and this permission notice shall be
* included in all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*
* ===========================(LICENSE END)=============================
*
* @author Thomas Pornin <thomas.pornin@cryptolog.com>
*/
#ifdef _MSC_VER
#pragma warning (disable: 4146)
#endif
#undef SPH_XCAT
#define SPH_XCAT(a, b) SPH_XCAT_(a, b)
#undef SPH_XCAT_
#define SPH_XCAT_(a, b) a ## b
#undef SPH_BLEN
#undef SPH_WLEN
#if defined BE64 || defined LE64
#define SPH_BLEN 128U
#define SPH_WLEN 8U
#else
#define SPH_BLEN 64U
#define SPH_WLEN 4U
#endif
#ifdef BLEN
#undef SPH_BLEN
#define SPH_BLEN BLEN
#endif
#undef SPH_MAXPAD
#if defined PLW1
#define SPH_MAXPAD (SPH_BLEN - SPH_WLEN)
#elif defined PLW4
#define SPH_MAXPAD (SPH_BLEN - (SPH_WLEN << 2))
#else
#define SPH_MAXPAD (SPH_BLEN - (SPH_WLEN << 1))
#endif
#undef SPH_VAL
#undef SPH_NO_OUTPUT
#ifdef SVAL
#define SPH_VAL SVAL
#define SPH_NO_OUTPUT 1
#else
#define SPH_VAL sc->val
#endif
#ifndef CLOSE_ONLY
#ifdef SPH_UPTR
static void
SPH_XCAT(HASH, _short)( void *cc, const void *data, size_t len )
#else
void
HASH ( void *cc, const void *data, size_t len )
#endif
{
SPH_XCAT( HASH, _context ) *sc;
__m256i *vdata = (__m256i*)data;
size_t ptr;
sc = cc;
ptr = (unsigned)sc->count & (SPH_BLEN - 1U);
while ( len > 0 )
{
size_t clen;
clen = SPH_BLEN - ptr;
if ( clen > len )
clen = len;
memcpy_256( sc->buf + (ptr>>3), vdata, clen>>3 );
vdata = vdata + (clen>>3);
ptr += clen;
len -= clen;
if ( ptr == SPH_BLEN )
{
RFUN( sc->buf, SPH_VAL );
ptr = 0;
}
sc->count += clen;
}
}
#ifdef SPH_UPTR
void
HASH (void *cc, const void *data, size_t len)
{
SPH_XCAT(HASH, _context) *sc;
__m256i *vdata = (__m256i*)data;
unsigned ptr;
if ( len < (2 * SPH_BLEN) )
{
SPH_XCAT(HASH, _short)(cc, data, len);
return;
}
sc = cc;
ptr = (unsigned)sc->count & (SPH_BLEN - 1U);
if ( ptr > 0 )
{
unsigned t;
t = SPH_BLEN - ptr;
SPH_XCAT( HASH, _short )( cc, data, t );
vdata = vdata + (t>>3);
len -= t;
}
SPH_XCAT( HASH, _short )( cc, data, len );
}
#endif
#endif
/*
* Perform padding and produce result. The context is NOT reinitialized
* by this function.
*/
static void
SPH_XCAT( HASH, _addbits_and_close )(void *cc, unsigned ub, unsigned n,
void *dst, unsigned rnum )
{
SPH_XCAT(HASH, _context) *sc;
unsigned ptr, u;
sc = cc;
ptr = (unsigned)sc->count & (SPH_BLEN - 1U);
//uint64_t *b= (uint64_t*)sc->buf;
//uint64_t *s= (uint64_t*)sc->state;
//printf("Vptr 1= %u\n", ptr);
//printf("VBuf %016llx %016llx %016llx %016llx\n", b[0], b[4], b[8], b[12] );
//printf("VBuf %016llx %016llx %016llx %016llx\n", b[16], b[20], b[24], b[28] );
#ifdef PW01
sc->buf[ptr>>3] = _mm256_set1_epi64x( 0x100 >> 8 );
// sc->buf[ptr++] = 0x100 >> 8;
#else
// need to overwrite exactly one byte
// sc->buf[ptr>>3] = _mm256_set_epi64x( 0, 0, 0, 0x80 );
sc->buf[ptr>>3] = _mm256_set1_epi64x( 0x80 );
// ptr++;
#endif
ptr += 8;
//printf("Vptr 2= %u\n", ptr);
//printf("VBuf %016llx %016llx %016llx %016llx\n", b[0], b[4], b[8], b[12] );
//printf("VBuf %016llx %016llx %016llx %016llx\n", b[16], b[20], b[24], b[28] );
if ( ptr > SPH_MAXPAD )
{
memset_zero_256( sc->buf + (ptr>>3), (SPH_BLEN - ptr) >> 3 );
RFUN( sc->buf, SPH_VAL );
memset_zero_256( sc->buf, SPH_MAXPAD >> 3 );
}
else
{
memset_zero_256( sc->buf + (ptr>>3), (SPH_MAXPAD - ptr) >> 3 );
}
#if defined BE64
#if defined PLW1
sc->buf[ SPH_MAXPAD>>3 ] =
mm256_bswap_64( _mm256_set1_epi64x( sc->count << 3 ) );
#elif defined PLW4
memset_zero_256( sc->buf + (SPH_MAXPAD>>3), ( 2 * SPH_WLEN ) >> 3 );
sc->buf[ (SPH_MAXPAD + 2 * SPH_WLEN ) >> 3 ] =
mm256_bswap_64( _mm256_set1_epi64x( sc->count >> 61 ) );
sc->buf[ (SPH_MAXPAD + 3 * SPH_WLEN ) >> 3 ] =
mm256_bswap_64( _mm256_set1_epi64x( sc->count << 3 ) );
#else
sc->buf[ ( SPH_MAXPAD + 2 * SPH_WLEN ) >> 3 ] =
mm256_bswap_64( _mm256_set1_epi64x( sc->count >> 61 ) );
sc->buf[ ( SPH_MAXPAD + 3 * SPH_WLEN ) >> 3 ] =
mm256_bswap_64( _mm256_set1_epi64x( sc->count << 3 ) );
#endif // PLW
#else // LE64
#if defined PLW1
sc->buf[ SPH_MAXPAD >> 3 ] = _mm256_set1_epi64x( sc->count << 3 );
#elif defined PLW4
sc->buf[ SPH_MAXPAD >> 3 ] = _mm256_set1_epi64x( sc->count << 3 );
sc->buf[ ( SPH_MAXPAD + SPH_WLEN ) >> 3 ] =
_mm256_set1_epi64x( c->count >> 61 );
memset_zero_256( sc->buf + ( ( SPH_MAXPAD + 2 * SPH_WLEN ) >> 3 ),
2 * SPH_WLEN );
#else
sc->buf[ SPH_MAXPAD >> 3 ] = _mm256_set1_epi64x( sc->count << 3 );
sc->buf[ ( SPH_MAXPAD + SPH_WLEN ) >> 3 ] =
_mm256_set1_epi64x( sc->count >> 61 );
#endif // PLW
#endif // LE64
//printf("Vptr 3= %u\n", ptr);
//printf("VBuf %016llx %016llx %016llx %016llx\n", b[0], b[4], b[8], b[12] );
//printf("VBuf %016llx %016llx %016llx %016llx\n", b[16], b[20], b[24], b[28] );
RFUN( sc->buf, SPH_VAL );
//printf("Vptr after= %u\n", ptr);
//printf("VState %016llx %016llx %016llx %016llx\n", s[0], s[4], s[8], s[12] );
//printf("VState %016llx %016llx %016llx %016llx\n", s[16], s[20], s[24], s[28] );
#ifdef SPH_NO_OUTPUT
(void)dst;
(void)rnum;
(void)u;
#else
for ( u = 0; u < rnum; u ++ )
{
#if defined BE64
((__m256i*)dst)[u] = mm256_bswap_64( sc->val[u] );
#else // LE64
((__m256i*)dst)[u] = sc->val[u];
#endif
}
#endif
}
static void
SPH_XCAT( HASH, _mdclose )( void *cc, void *dst, unsigned rnum )
{
SPH_XCAT( HASH, _addbits_and_close )( cc, 0, 0, dst, rnum );
}

File diff suppressed because it is too large Load Diff

View File

@@ -1,108 +0,0 @@
/* $Id: sph_whirlpool.h 216 2010-06-08 09:46:57Z tp $ */
/**
* WHIRLPOOL interface.
*
* WHIRLPOOL knows three variants, dubbed "WHIRLPOOL-0" (original
* version, published in 2000, studied by NESSIE), "WHIRLPOOL-1"
* (first revision, 2001, with a new S-box) and "WHIRLPOOL" (current
* version, 2003, with a new diffusion matrix, also described as "plain
* WHIRLPOOL"). All three variants are implemented here.
*
* The original WHIRLPOOL (i.e. WHIRLPOOL-0) was published in: P. S. L.
* M. Barreto, V. Rijmen, "The Whirlpool Hashing Function", First open
* NESSIE Workshop, Leuven, Belgium, November 13--14, 2000.
*
* The current WHIRLPOOL specification and a reference implementation
* can be found on the WHIRLPOOL web page:
* http://paginas.terra.com.br/informatica/paulobarreto/WhirlpoolPage.html
*
* ==========================(LICENSE BEGIN)============================
*
* Copyright (c) 2007-2010 Projet RNRT SAPHIR
*
* Permission is hereby granted, free of charge, to any person obtaining
* a copy of this software and associated documentation files (the
* "Software"), to deal in the Software without restriction, including
* without limitation the rights to use, copy, modify, merge, publish,
* distribute, sublicense, and/or sell copies of the Software, and to
* permit persons to whom the Software is furnished to do so, subject to
* the following conditions:
*
* The above copyright notice and this permission notice shall be
* included in all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*
* ===========================(LICENSE END)=============================
*
* @file sph_whirlpool.h
* @author Thomas Pornin <thomas.pornin@cryptolog.com>
*/
#ifndef WHIRLPOOL_HASH_4WAY_H__
#define WHIRLPOOL_HASH_4WAY_H__
#ifdef __AVX2__
#include <stddef.h>
#include "algo/sha/sph_types.h"
#include "simd-utils.h"
/**
* Output size (in bits) for WHIRLPOOL.
*/
#define SPH_SIZE_whirlpool 512
/**
* Output size (in bits) for WHIRLPOOL-0.
*/
#define SPH_SIZE_whirlpool0 512
/**
* Output size (in bits) for WHIRLPOOL-1.
*/
#define SPH_SIZE_whirlpool1 512
typedef struct {
__m256i buf[8] __attribute__ ((aligned (64)));
__m256i state[8];
sph_u64 count;
} whirlpool_4way_context;
void whirlpool_4way_init( void *cc );
void whirlpool_4way( void *cc, const void *data, size_t len );
void whirlpool_4way_close( void *cc, void *dst );
/**
* WHIRLPOOL-0 uses the same structure than plain WHIRLPOOL.
*/
typedef whirlpool_4way_context whirlpool0_4way_context;
#define whirlpool0_4way_init whirlpool_4way_init
void whirlpool0_4way( void *cc, const void *data, size_t len );
void whirlpool0_4way_close( void *cc, void *dst );
/**
* WHIRLPOOL-1 uses the same structure than plain WHIRLPOOL.
*/
typedef whirlpool_4way_context whirlpool1_4way_context;
#define whirlpool1_4way_init whirlpool_4way_init
void whirlpool1_4way(void *cc, const void *data, size_t len);
void whirlpool1_4way_close(void *cc, void *dst);
#endif
#endif

View File

@@ -12,6 +12,7 @@
#include "algo/cubehash/cube-hash-2way.h"
#include "algo/cubehash/cubehash_sse2.h"
#include "algo/shavite/sph_shavite.h"
#include "algo/shavite/shavite-hash-2way.h"
#include "algo/simd/simd-hash-2way.h"
#include "algo/echo/aes_ni/hash_api.h"
#if defined(__VAES__)
@@ -22,15 +23,15 @@
#if defined (C11_8WAY)
typedef struct {
union _c11_8way_context_overlay
{
blake512_8way_context blake;
bmw512_8way_context bmw;
skein512_8way_context skein;
jh512_8way_context jh;
keccak512_8way_context keccak;
luffa_4way_context luffa;
cube_4way_context cube;
simd_4way_context simd;
cube_4way_2buf_context cube;
#if defined(__VAES__)
groestl512_4way_context groestl;
shavite512_4way_context shavite;
@@ -40,32 +41,14 @@ typedef struct {
sph_shavite512_context shavite;
hashState_echo echo;
#endif
} c11_8way_ctx_holder;
simd_4way_context simd;
} __attribute__ ((aligned (64)));
typedef union _c11_8way_context_overlay c11_8way_context_overlay;
c11_8way_ctx_holder c11_8way_ctx;
static __thread __m512i c11_8way_midstate[16] __attribute__((aligned(64)));
static __thread blake512_8way_context blake512_8way_ctx __attribute__((aligned(64)));
void init_c11_8way_ctx()
{
blake512_8way_init( &c11_8way_ctx.blake );
bmw512_8way_init( &c11_8way_ctx.bmw );
skein512_8way_init( &c11_8way_ctx.skein );
jh512_8way_init( &c11_8way_ctx.jh );
keccak512_8way_init( &c11_8way_ctx.keccak );
luffa_4way_init( &c11_8way_ctx.luffa, 512 );
cube_4way_init( &c11_8way_ctx.cube, 512, 16, 32 );
simd_4way_init( &c11_8way_ctx.simd, 512 );
#if defined(__VAES__)
groestl512_4way_init( &c11_8way_ctx.groestl, 64 );
shavite512_4way_init( &c11_8way_ctx.shavite );
echo_4way_init( &c11_8way_ctx.echo, 512 );
#else
init_groestl( &c11_8way_ctx.groestl, 64 );
sph_shavite512_init( &c11_8way_ctx.shavite );
init_echo( &c11_8way_ctx.echo, 512 );
#endif
}
void c11_8way_hash( void *state, const void *input )
int c11_8way_hash( void *state, const void *input, int thr_id )
{
uint64_t vhash[8*8] __attribute__ ((aligned (128)));
uint64_t vhashA[4*8] __attribute__ ((aligned (64)));
@@ -78,24 +61,19 @@ void c11_8way_hash( void *state, const void *input )
uint64_t hash5[8] __attribute__ ((aligned (64)));
uint64_t hash6[8] __attribute__ ((aligned (64)));
uint64_t hash7[8] __attribute__ ((aligned (64)));
c11_8way_ctx_holder ctx;
memcpy( &ctx, &c11_8way_ctx, sizeof(c11_8way_ctx) );
c11_8way_context_overlay ctx;
// 1 Blake 4way
blake512_8way_update( &ctx.blake, input, 80 );
blake512_8way_close( &ctx.blake, vhash );
// 2 Bmw
bmw512_8way_update( &ctx.bmw, vhash, 64 );
bmw512_8way_close( &ctx.bmw, vhash );
blake512_8way_final_le( &blake512_8way_ctx, vhash, casti_m512i( input, 9 ),
c11_8way_midstate );
bmw512_8way_full( &ctx.bmw, vhash, vhash, 64 );
#if defined(__VAES__)
rintrlv_8x64_4x128( vhashA, vhashB, vhash, 512 );
groestl512_4way_update_close( &ctx.groestl, vhashA, vhashA, 512 );
groestl512_4way_init( &ctx.groestl, 64 );
groestl512_4way_update_close( &ctx.groestl, vhashB, vhashB, 512 );
groestl512_4way_full( &ctx.groestl, vhashA, vhashA, 64 );
groestl512_4way_full( &ctx.groestl, vhashB, vhashB, 64 );
rintrlv_4x128_8x64( vhash, vhashA, vhashB, 512 );
@@ -104,21 +82,14 @@ void c11_8way_hash( void *state, const void *input )
dintrlv_8x64_512( hash0, hash1, hash2, hash3, hash4, hash5, hash6, hash7,
vhash );
update_and_final_groestl( &ctx.groestl, (char*)hash0, (char*)hash0, 512 );
memcpy( &ctx.groestl, &c11_8way_ctx.groestl, sizeof(hashState_groestl) );
update_and_final_groestl( &ctx.groestl, (char*)hash1, (char*)hash1, 512 );
memcpy( &ctx.groestl, &c11_8way_ctx.groestl, sizeof(hashState_groestl) );
update_and_final_groestl( &ctx.groestl, (char*)hash2, (char*)hash2, 512 );
memcpy( &ctx.groestl, &c11_8way_ctx.groestl, sizeof(hashState_groestl) );
update_and_final_groestl( &ctx.groestl, (char*)hash3, (char*)hash3, 512 );
memcpy( &ctx.groestl, &c11_8way_ctx.groestl, sizeof(hashState_groestl) );
update_and_final_groestl( &ctx.groestl, (char*)hash4, (char*)hash4, 512 );
memcpy( &ctx.groestl, &c11_8way_ctx.groestl, sizeof(hashState_groestl) );
update_and_final_groestl( &ctx.groestl, (char*)hash5, (char*)hash5, 512 );
memcpy( &ctx.groestl, &c11_8way_ctx.groestl, sizeof(hashState_groestl) );
update_and_final_groestl( &ctx.groestl, (char*)hash6, (char*)hash6, 512 );
memcpy( &ctx.groestl, &c11_8way_ctx.groestl, sizeof(hashState_groestl) );
update_and_final_groestl( &ctx.groestl, (char*)hash7, (char*)hash7, 512 );
groestl512_full( &ctx.groestl, (char*)hash0, (char*)hash0, 512 );
groestl512_full( &ctx.groestl, (char*)hash1, (char*)hash1, 512 );
groestl512_full( &ctx.groestl, (char*)hash2, (char*)hash2, 512 );
groestl512_full( &ctx.groestl, (char*)hash3, (char*)hash3, 512 );
groestl512_full( &ctx.groestl, (char*)hash4, (char*)hash4, 512 );
groestl512_full( &ctx.groestl, (char*)hash5, (char*)hash5, 512 );
groestl512_full( &ctx.groestl, (char*)hash6, (char*)hash6, 512 );
groestl512_full( &ctx.groestl, (char*)hash7, (char*)hash7, 512 );
intrlv_8x64_512( vhash, hash0, hash1, hash2, hash3, hash4, hash5, hash6,
hash7 );
@@ -126,83 +97,56 @@ void c11_8way_hash( void *state, const void *input )
#endif
// 4 JH
jh512_8way_init( &ctx.jh );
jh512_8way_update( &ctx.jh, vhash, 64 );
jh512_8way_close( &ctx.jh, vhash );
// 5 Keccak
keccak512_8way_init( &ctx.keccak );
keccak512_8way_update( &ctx.keccak, vhash, 64 );
keccak512_8way_close( &ctx.keccak, vhash );
// 6 Skein
skein512_8way_update( &ctx.skein, vhash, 64 );
skein512_8way_close( &ctx.skein, vhash );
skein512_8way_full( &ctx.skein, vhash, vhash, 64 );
rintrlv_8x64_4x128( vhashA, vhashB, vhash, 512 );
luffa_4way_update_close( &ctx.luffa, vhashA, vhashA, 64 );
luffa_4way_init( &ctx.luffa, 512 );
luffa_4way_update_close( &ctx.luffa, vhashB, vhashB, 64 );
cube_4way_update_close( &ctx.cube, vhashA, vhashA, 64 );
cube_4way_init( &ctx.cube, 512, 16, 32 );
cube_4way_update_close( &ctx.cube, vhashB, vhashB, 64 );
luffa512_4way_full( &ctx.luffa, vhashA, vhashA, 64 );
luffa512_4way_full( &ctx.luffa, vhashB, vhashB, 64 );
cube_4way_2buf_full( &ctx.cube, vhashA, vhashB, 512, vhashA, vhashB, 64 );
#if defined(__VAES__)
shavite512_4way_update_close( &ctx.shavite, vhashA, vhashA, 64 );
shavite512_4way_init( &ctx.shavite );
shavite512_4way_update_close( &ctx.shavite, vhashB, vhashB, 64 );
shavite512_4way_full( &ctx.shavite, vhashA, vhashA, 64 );
shavite512_4way_full( &ctx.shavite, vhashB, vhashB, 64 );
#else
dintrlv_4x128_512( hash0, hash1, hash2, hash3, vhashA );
dintrlv_4x128_512( hash4, hash5, hash6, hash7, vhashB );
sph_shavite512( &ctx.shavite, hash0, 64 );
sph_shavite512_close( &ctx.shavite, hash0 );
memcpy( &ctx.shavite, &c11_8way_ctx.shavite,
sizeof(sph_shavite512_context) );
sph_shavite512( &ctx.shavite, hash1, 64 );
sph_shavite512_close( &ctx.shavite, hash1 );
memcpy( &ctx.shavite, &c11_8way_ctx.shavite,
sizeof(sph_shavite512_context) );
sph_shavite512( &ctx.shavite, hash2, 64 );
sph_shavite512_close( &ctx.shavite, hash2 );
memcpy( &ctx.shavite, &c11_8way_ctx.shavite,
sizeof(sph_shavite512_context) );
sph_shavite512( &ctx.shavite, hash3, 64 );
sph_shavite512_close( &ctx.shavite, hash3 );
memcpy( &ctx.shavite, &c11_8way_ctx.shavite,
sizeof(sph_shavite512_context) );
sph_shavite512( &ctx.shavite, hash4, 64 );
sph_shavite512_close( &ctx.shavite, hash4 );
memcpy( &ctx.shavite, &c11_8way_ctx.shavite,
sizeof(sph_shavite512_context) );
sph_shavite512( &ctx.shavite, hash5, 64 );
sph_shavite512_close( &ctx.shavite, hash5 );
memcpy( &ctx.shavite, &c11_8way_ctx.shavite,
sizeof(sph_shavite512_context) );
sph_shavite512( &ctx.shavite, hash6, 64 );
sph_shavite512_close( &ctx.shavite, hash6 );
memcpy( &ctx.shavite, &c11_8way_ctx.shavite,
sizeof(sph_shavite512_context) );
sph_shavite512( &ctx.shavite, hash7, 64 );
sph_shavite512_close( &ctx.shavite, hash7 );
shavite512_full( &ctx.shavite, hash0, hash0, 64 );
shavite512_full( &ctx.shavite, hash1, hash1, 64 );
shavite512_full( &ctx.shavite, hash2, hash2, 64 );
shavite512_full( &ctx.shavite, hash3, hash3, 64 );
shavite512_full( &ctx.shavite, hash4, hash4, 64 );
shavite512_full( &ctx.shavite, hash5, hash5, 64 );
shavite512_full( &ctx.shavite, hash6, hash6, 64 );
shavite512_full( &ctx.shavite, hash7, hash7, 64 );
intrlv_4x128_512( vhashA, hash0, hash1, hash2, hash3 );
intrlv_4x128_512( vhashB, hash4, hash5, hash6, hash7 );
#endif
simd_4way_update_close( &ctx.simd, vhashA, vhashA, 512 );
simd_4way_init( &ctx.simd, 512 );
simd_4way_update_close( &ctx.simd, vhashB, vhashB, 512 );
simd512_4way_full( &ctx.simd, vhashA, vhashA, 64 );
simd512_4way_full( &ctx.simd, vhashB, vhashB, 64 );
#if defined(__VAES__)
echo_4way_update_close( &ctx.echo, vhashA, vhashA, 512 );
echo_4way_init( &ctx.echo, 512 );
echo_4way_update_close( &ctx.echo, vhashB, vhashB, 512 );
echo_4way_full( &ctx.echo, vhashA, 512, vhashA, 64 );
echo_4way_full( &ctx.echo, vhashB, 512, vhashB, 64 );
dintrlv_4x128_512( hash0, hash1, hash2, hash3, vhashA );
dintrlv_4x128_512( hash4, hash5, hash6, hash7, vhashB );
@@ -212,29 +156,22 @@ void c11_8way_hash( void *state, const void *input )
dintrlv_4x128_512( hash0, hash1, hash2, hash3, vhashA );
dintrlv_4x128_512( hash4, hash5, hash6, hash7, vhashB );
update_final_echo( &ctx.echo, (BitSequence *)hash0,
(const BitSequence *) hash0, 512 );
memcpy( &ctx.echo, &c11_8way_ctx.echo, sizeof(hashState_echo) );
update_final_echo( &ctx.echo, (BitSequence *)hash1,
(const BitSequence *) hash1, 512 );
memcpy( &ctx.echo, &c11_8way_ctx.echo, sizeof(hashState_echo) );
update_final_echo( &ctx.echo, (BitSequence *)hash2,
(const BitSequence *) hash2, 512 );
memcpy( &ctx.echo, &c11_8way_ctx.echo, sizeof(hashState_echo) );
update_final_echo( &ctx.echo, (BitSequence *)hash3,
(const BitSequence *) hash3, 512 );
memcpy( &ctx.echo, &c11_8way_ctx.echo, sizeof(hashState_echo) );
update_final_echo( &ctx.echo, (BitSequence *)hash4,
(const BitSequence *) hash4, 512 );
memcpy( &ctx.echo, &c11_8way_ctx.echo, sizeof(hashState_echo) );
update_final_echo( &ctx.echo, (BitSequence *)hash5,
(const BitSequence *) hash5, 512 );
memcpy( &ctx.echo, &c11_8way_ctx.echo, sizeof(hashState_echo) );
update_final_echo( &ctx.echo, (BitSequence *)hash6,
(const BitSequence *) hash6, 512 );
memcpy( &ctx.echo, &c11_8way_ctx.echo, sizeof(hashState_echo) );
update_final_echo( &ctx.echo, (BitSequence *)hash7,
(const BitSequence *) hash7, 512 );
echo_full( &ctx.echo, (BitSequence *)hash0, 512,
(const BitSequence *)hash0, 64 );
echo_full( &ctx.echo, (BitSequence *)hash1, 512,
(const BitSequence *)hash1, 64 );
echo_full( &ctx.echo, (BitSequence *)hash2, 512,
(const BitSequence *)hash2, 64 );
echo_full( &ctx.echo, (BitSequence *)hash3, 512,
(const BitSequence *)hash3, 64 );
echo_full( &ctx.echo, (BitSequence *)hash4, 512,
(const BitSequence *)hash4, 64 );
echo_full( &ctx.echo, (BitSequence *)hash5, 512,
(const BitSequence *)hash5, 64 );
echo_full( &ctx.echo, (BitSequence *)hash6, 512,
(const BitSequence *)hash6, 64 );
echo_full( &ctx.echo, (BitSequence *)hash7, 512,
(const BitSequence *)hash7, 64 );
#endif
@@ -246,225 +183,223 @@ void c11_8way_hash( void *state, const void *input )
memcpy( state+160, hash5, 32 );
memcpy( state+192, hash6, 32 );
memcpy( state+224, hash7, 32 );
return 1;
}
int scanhash_c11_8way( struct work *work, uint32_t max_nonce,
uint64_t *hashes_done, struct thr_info *mythr )
{
uint32_t hash[8*8] __attribute__ ((aligned (128)));
uint32_t vdata[24*8] __attribute__ ((aligned (64)));
uint32_t *pdata = work->data;
uint32_t *ptarget = work->target;
uint32_t n = pdata[19];
const uint32_t first_nonce = pdata[19];
int thr_id = mythr->id;
__m512i *noncev = (__m512i*)vdata + 9; // aligned
const uint32_t Htarg = ptarget[7];
uint32_t hash[8*8] __attribute__ ((aligned (128)));
uint32_t vdata[20*8] __attribute__ ((aligned (64)));
__m128i edata[5] __attribute__ ((aligned (64)));
uint32_t *pdata = work->data;
const uint32_t *ptarget = work->target;
const uint32_t first_nonce = pdata[19];
const uint32_t last_nonce = max_nonce - 8;
__m512i *noncev = (__m512i*)vdata + 9;
uint32_t n = first_nonce;
const int thr_id = mythr->id;
const uint32_t targ32_d7 = ptarget[7];
const __m512i eight = _mm512_set1_epi64( 8 );
const bool bench = opt_benchmark;
max_nonce -= 8;
edata[0] = mm128_swap64_32( casti_m128i( pdata, 0 ) );
edata[1] = mm128_swap64_32( casti_m128i( pdata, 1 ) );
edata[2] = mm128_swap64_32( casti_m128i( pdata, 2 ) );
edata[3] = mm128_swap64_32( casti_m128i( pdata, 3 ) );
edata[4] = mm128_swap64_32( casti_m128i( pdata, 4 ) );
mm512_bswap32_intrlv80_8x64( vdata, pdata );
mm512_intrlv80_8x64( vdata, edata );
*noncev = _mm512_add_epi32( *noncev, _mm512_set_epi32(
0, 7, 0, 6, 0, 5, 0, 4, 0, 3, 0, 2, 0, 1, 0, 0 ) );
blake512_8way_prehash_le( &blake512_8way_ctx, c11_8way_midstate, vdata );
do
{
*noncev = mm512_intrlv_blend_32( mm512_bswap_32(
_mm512_set_epi32( n+7, 0, n+6, 0, n+5, 0, n+4, 0,
n+3, 0, n+2, 0, n+1, 0, n, 0 ) ), *noncev );
c11_8way_hash( hash, vdata );
pdata[19] = n;
for ( int i = 0; i < 8; i++ )
if ( ( ( hash+(i<<3) )[7] <= Htarg )
&& fulltest( hash+(i<<3), ptarget ) && !opt_benchmark )
{
pdata[19] = n+i;
submit_solution( work, hash+(i<<3), mythr );
}
n += 8;
} while ( ( n < max_nonce ) && !work_restart[thr_id].restart );
*hashes_done = n - first_nonce;
return 0;
do
{
if ( likely( c11_8way_hash( hash, vdata, thr_id ) ) )
for ( int lane = 0; lane < 8; lane++ )
if ( ( ( hash + ( lane << 3 ) )[7] <= targ32_d7 )
&& valid_hash( hash +( lane << 3 ), ptarget ) && !bench )
{
pdata[19] = n + lane;
submit_solution( work, hash + ( lane << 3 ), mythr );
}
*noncev = _mm512_add_epi32( *noncev, eight );
n += 8;
} while ( ( n < last_nonce ) && !work_restart[thr_id].restart );
pdata[19] = n;
*hashes_done = n - first_nonce;
return 0;
}
#elif defined (C11_4WAY)
typedef struct {
union _c11_4way_context_overlay
{
blake512_4way_context blake;
bmw512_4way_context bmw;
#if defined(__VAES__)
groestl512_2way_context groestl;
echo512_2way_context echo;
#else
hashState_groestl groestl;
skein512_4way_context skein;
jh512_4way_context jh;
keccak512_4way_context keccak;
luffa_2way_context luffa;
cubehashParam cube;
sph_shavite512_context shavite;
simd_2way_context simd;
hashState_echo echo;
} c11_4way_ctx_holder;
#endif
skein512_4way_context skein;
jh512_4way_context jh;
keccak512_4way_context keccak;
luffa_2way_context luffa;
cube_2way_context cube;
shavite512_2way_context shavite;
simd_2way_context simd;
};
typedef union _c11_4way_context_overlay c11_4way_context_overlay;
c11_4way_ctx_holder c11_4way_ctx;
static __thread __m256i c11_4way_midstate[16] __attribute__((aligned(64)));
static __thread blake512_4way_context blake512_4way_ctx __attribute__((aligned(64)));
void init_c11_4way_ctx()
{
blake512_4way_init( &c11_4way_ctx.blake );
bmw512_4way_init( &c11_4way_ctx.bmw );
init_groestl( &c11_4way_ctx.groestl, 64 );
skein512_4way_init( &c11_4way_ctx.skein );
jh512_4way_init( &c11_4way_ctx.jh );
keccak512_4way_init( &c11_4way_ctx.keccak );
luffa_2way_init( &c11_4way_ctx.luffa, 512 );
cubehashInit( &c11_4way_ctx.cube, 512, 16, 32 );
sph_shavite512_init( &c11_4way_ctx.shavite );
simd_2way_init( &c11_4way_ctx.simd, 512 );
init_echo( &c11_4way_ctx.echo, 512 );
}
void c11_4way_hash( void *state, const void *input )
int c11_4way_hash( void *state, const void *input, int thr_id )
{
uint64_t hash0[8] __attribute__ ((aligned (64)));
uint64_t hash1[8] __attribute__ ((aligned (64)));
uint64_t hash2[8] __attribute__ ((aligned (64)));
uint64_t hash3[8] __attribute__ ((aligned (64)));
uint64_t vhash[8*4] __attribute__ ((aligned (64)));
uint64_t vhashA[8*2] __attribute__ ((aligned (64)));
uint64_t vhashB[8*2] __attribute__ ((aligned (64)));
c11_4way_ctx_holder ctx;
memcpy( &ctx, &c11_4way_ctx, sizeof(c11_4way_ctx) );
c11_4way_context_overlay ctx;
// 1 Blake 4way
blake512_4way_update( &ctx.blake, input, 80 );
blake512_4way_close( &ctx.blake, vhash );
blake512_4way_final_le( &blake512_4way_ctx, vhash, casti_m256i( input, 9 ),
c11_4way_midstate );
// 2 Bmw
bmw512_4way_init( &ctx.bmw );
bmw512_4way_update( &ctx.bmw, vhash, 64 );
bmw512_4way_close( &ctx.bmw, vhash );
#if defined(__VAES__)
// Serial
dintrlv_4x64( hash0, hash1, hash2, hash3, vhash, 512 );
rintrlv_4x64_2x128( vhashA, vhashB, vhash, 512 );
// 3 Groestl
update_and_final_groestl( &ctx.groestl, (char*)hash0, (char*)hash0, 512 );
memcpy( &ctx.groestl, &c11_4way_ctx.groestl, sizeof(hashState_groestl) );
update_and_final_groestl( &ctx.groestl, (char*)hash1, (char*)hash1, 512 );
memcpy( &ctx.groestl, &c11_4way_ctx.groestl, sizeof(hashState_groestl) );
update_and_final_groestl( &ctx.groestl, (char*)hash2, (char*)hash2, 512 );
memcpy( &ctx.groestl, &c11_4way_ctx.groestl, sizeof(hashState_groestl) );
update_and_final_groestl( &ctx.groestl, (char*)hash3, (char*)hash3, 512 );
groestl512_2way_full( &ctx.groestl, vhashA, vhashA, 64 );
groestl512_2way_full( &ctx.groestl, vhashB, vhashB, 64 );
// 4way
intrlv_4x64( vhash, hash0, hash1, hash2, hash3, 512 );
rintrlv_2x128_4x64( vhash, vhashA, vhashB, 512 );
// 4 JH
#else
dintrlv_4x64_512( hash0, hash1, hash2, hash3, vhash );
groestl512_full( &ctx.groestl, (char*)hash0, (char*)hash0, 512 );
groestl512_full( &ctx.groestl, (char*)hash1, (char*)hash1, 512 );
groestl512_full( &ctx.groestl, (char*)hash2, (char*)hash2, 512 );
groestl512_full( &ctx.groestl, (char*)hash3, (char*)hash3, 512 );
intrlv_4x64_512( vhash, hash0, hash1, hash2, hash3 );
#endif
jh512_4way_init( &ctx.jh );
jh512_4way_update( &ctx.jh, vhash, 64 );
jh512_4way_close( &ctx.jh, vhash );
// 5 Keccak
keccak512_4way_init( &ctx.keccak );
keccak512_4way_update( &ctx.keccak, vhash, 64 );
keccak512_4way_close( &ctx.keccak, vhash );
// 6 Skein
skein512_4way_update( &ctx.skein, vhash, 64 );
skein512_4way_close( &ctx.skein, vhash );
skein512_4way_full( &ctx.skein, vhash, vhash, 64 );
// Serial
dintrlv_4x64( hash0, hash1, hash2, hash3, vhash, 512 );
rintrlv_4x64_2x128( vhashA, vhashB, vhash, 512 );
// 7 Luffa
intrlv_2x128( vhash, hash0, hash1, 512 );
intrlv_2x128( vhashB, hash2, hash3, 512 );
luffa_2way_update_close( &ctx.luffa, vhash, vhash, 64 );
luffa_2way_init( &ctx.luffa, 512 );
luffa_2way_update_close( &ctx.luffa, vhashB, vhashB, 64 );
dintrlv_2x128( hash0, hash1, vhash, 512 );
dintrlv_2x128( hash2, hash3, vhashB, 512 );
luffa512_2way_full( &ctx.luffa, vhashA, vhashA, 64 );
luffa512_2way_full( &ctx.luffa, vhashB, vhashB, 64 );
// 8 Cubehash
cubehashUpdateDigest( &ctx.cube, (byte*)hash0, (const byte*) hash0, 64 );
memcpy( &ctx.cube, &c11_4way_ctx.cube, sizeof(cubehashParam) );
cubehashUpdateDigest( &ctx.cube, (byte*)hash1, (const byte*) hash1, 64 );
memcpy( &ctx.cube, &c11_4way_ctx.cube, sizeof(cubehashParam) );
cubehashUpdateDigest( &ctx.cube, (byte*)hash2, (const byte*) hash2, 64 );
memcpy( &ctx.cube, &c11_4way_ctx.cube, sizeof(cubehashParam) );
cubehashUpdateDigest( &ctx.cube, (byte*)hash3, (const byte*) hash3, 64 );
cube_2way_full( &ctx.cube, vhashA, 512, vhashA, 64 );
cube_2way_full( &ctx.cube, vhashB, 512, vhashB, 64 );
// 9 Shavite
sph_shavite512( &ctx.shavite, hash0, 64 );
sph_shavite512_close( &ctx.shavite, hash0 );
memcpy( &ctx.shavite, &c11_4way_ctx.shavite,
sizeof(sph_shavite512_context) );
sph_shavite512( &ctx.shavite, hash1, 64 );
sph_shavite512_close( &ctx.shavite, hash1 );
memcpy( &ctx.shavite, &c11_4way_ctx.shavite,
sizeof(sph_shavite512_context) );
sph_shavite512( &ctx.shavite, hash2, 64 );
sph_shavite512_close( &ctx.shavite, hash2 );
memcpy( &ctx.shavite, &c11_4way_ctx.shavite,
sizeof(sph_shavite512_context) );
sph_shavite512( &ctx.shavite, hash3, 64 );
sph_shavite512_close( &ctx.shavite, hash3 );
shavite512_2way_full( &ctx.shavite, vhashA, vhashA, 64 );
shavite512_2way_full( &ctx.shavite, vhashB, vhashB, 64 );
// 10 Simd
intrlv_2x128( vhash, hash0, hash1, 512 );
intrlv_2x128( vhashB, hash2, hash3, 512 );
simd_2way_update_close( &ctx.simd, vhash, vhash, 512 );
simd_2way_init( &ctx.simd, 512 );
simd_2way_update_close( &ctx.simd, vhashB, vhashB, 512 );
dintrlv_2x128( hash0, hash1, vhash, 512 );
dintrlv_2x128( hash2, hash3, vhashB, 512 );
simd512_2way_full( &ctx.simd, vhashA, vhashA, 64 );
simd512_2way_full( &ctx.simd, vhashB, vhashB, 64 );
// 11 Echo
update_final_echo( &ctx.echo, (BitSequence *)hash0,
(const BitSequence *) hash0, 512 );
memcpy( &ctx.echo, &c11_4way_ctx.echo, sizeof(hashState_echo) );
update_final_echo( &ctx.echo, (BitSequence *)hash1,
(const BitSequence *) hash1, 512 );
memcpy( &ctx.echo, &c11_4way_ctx.echo, sizeof(hashState_echo) );
update_final_echo( &ctx.echo, (BitSequence *)hash2,
(const BitSequence *) hash2, 512 );
memcpy( &ctx.echo, &c11_4way_ctx.echo, sizeof(hashState_echo) );
update_final_echo( &ctx.echo, (BitSequence *)hash3,
(const BitSequence *) hash3, 512 );
#if defined(__VAES__)
echo_2way_full( &ctx.echo, vhashA, 512, vhashA, 64 );
echo_2way_full( &ctx.echo, vhashB, 512, vhashB, 64 );
dintrlv_2x128_512( hash0, hash1, vhashA );
dintrlv_2x128_512( hash2, hash3, vhashB );
#else
dintrlv_2x128_512( hash0, hash1, vhashA );
dintrlv_2x128_512( hash2, hash3, vhashB );
echo_full( &ctx.echo, (BitSequence *)hash0, 512,
(const BitSequence *)hash0, 64 );
echo_full( &ctx.echo, (BitSequence *)hash1, 512,
(const BitSequence *)hash1, 64 );
echo_full( &ctx.echo, (BitSequence *)hash2, 512,
(const BitSequence *)hash2, 64 );
echo_full( &ctx.echo, (BitSequence *)hash3, 512,
(const BitSequence *)hash3, 64 );
#endif
memcpy( state, hash0, 32 );
memcpy( state+32, hash1, 32 );
memcpy( state+64, hash2, 32 );
memcpy( state+96, hash3, 32 );
return 1;
}
int scanhash_c11_4way( struct work *work, uint32_t max_nonce,
uint64_t *hashes_done, struct thr_info *mythr )
{
uint32_t hash[4*8] __attribute__ ((aligned (64)));
uint32_t vdata[24*4] __attribute__ ((aligned (64)));
uint32_t *pdata = work->data;
uint32_t *ptarget = work->target;
uint32_t n = pdata[19];
const uint32_t first_nonce = pdata[19];
int thr_id = mythr->id; // thr_id arg is deprecated
__m256i *noncev = (__m256i*)vdata + 9; // aligned
const uint32_t Htarg = ptarget[7];
uint32_t hash[8*4] __attribute__ ((aligned (128)));
uint32_t vdata[20*4] __attribute__ ((aligned (64)));
__m128i edata[5] __attribute__ ((aligned (32)));
uint32_t *pdata = work->data;
const uint32_t *ptarget = work->target;
const uint32_t first_nonce = pdata[19];
const uint32_t last_nonce = max_nonce - 8;
__m256i *noncev = (__m256i*)vdata + 9;
uint32_t n = first_nonce;
const int thr_id = mythr->id;
const uint32_t targ32_d7 = ptarget[7];
const __m256i four = _mm256_set1_epi64x( 4 );
const bool bench = opt_benchmark;
mm256_bswap32_intrlv80_4x64( vdata, pdata );
edata[0] = mm128_swap64_32( casti_m128i( pdata, 0 ) );
edata[1] = mm128_swap64_32( casti_m128i( pdata, 1 ) );
edata[2] = mm128_swap64_32( casti_m128i( pdata, 2 ) );
edata[3] = mm128_swap64_32( casti_m128i( pdata, 3 ) );
edata[4] = mm128_swap64_32( casti_m128i( pdata, 4 ) );
do
{
*noncev = mm256_intrlv_blend_32( mm256_bswap_32(
_mm256_set_epi32( n+3, 0, n+2, 0, n+1, 0, n, 0 ) ), *noncev );
mm256_intrlv80_4x64( vdata, edata );
c11_4way_hash( hash, vdata );
pdata[19] = n;
*noncev = _mm256_add_epi32( *noncev, _mm256_set_epi32(
0, 3, 0, 2, 0, 1, 0, 0 ) );
blake512_4way_prehash_le( &blake512_4way_ctx, c11_4way_midstate, vdata );
for ( int i = 0; i < 4; i++ )
if ( ( ( hash+(i<<3) )[7] <= Htarg )
&& fulltest( hash+(i<<3), ptarget ) && !opt_benchmark )
{
pdata[19] = n+i;
submit_solution( work, hash+(i<<3), mythr );
}
n += 4;
} while ( ( n < max_nonce ) && !work_restart[thr_id].restart );
*hashes_done = n - first_nonce + 1;
return 0;
do
{
if ( likely( c11_4way_hash( hash, vdata, thr_id ) ) )
for ( int lane = 0; lane < 4; lane++ )
if ( ( ( hash + ( lane << 3 ) )[7] <= targ32_d7 )
&& valid_hash( hash +( lane << 3 ), ptarget ) && !bench )
{
pdata[19] = n + lane;
submit_solution( work, hash + ( lane << 3 ), mythr );
}
*noncev = _mm256_add_epi32( *noncev, four );
n += 4;
} while ( ( n < last_nonce ) && !work_restart[thr_id].restart );
pdata[19] = n;
*hashes_done = n - first_nonce;
return 0;
}
#endif

View File

@@ -3,11 +3,9 @@
bool register_c11_algo( algo_gate_t* gate )
{
#if defined (C11_8WAY)
init_c11_8way_ctx();
gate->scanhash = (void*)&scanhash_c11_8way;
gate->hash = (void*)&c11_8way_hash;
#elif defined (C11_4WAY)
init_c11_4way_ctx();
gate->scanhash = (void*)&scanhash_c11_4way;
gate->hash = (void*)&c11_4way_hash;
#else

View File

@@ -14,14 +14,14 @@
bool register_c11_algo( algo_gate_t* gate );
#if defined(C11_8WAY)
void c11_8way_hash( void *state, const void *input );
int c11_8way_hash( void *state, const void *input, int thr_id );
int scanhash_c11_8way( struct work *work, uint32_t max_nonce,
uint64_t *hashes_done, struct thr_info *mythr );
void init_c11_8way_ctx();
//void init_c11_8way_ctx();
#elif defined(C11_4WAY)
void c11_4way_hash( void *state, const void *input );
int c11_4way_hash( void *state, const void *input, int thr_id );
int scanhash_c11_4way( struct work *work, uint32_t max_nonce,
uint64_t *hashes_done, struct thr_info *mythr );
void init_c11_4way_ctx();

View File

@@ -112,8 +112,9 @@ void timetravel_4way_hash(void *output, const void *input)
intrlv_4x64( vhashB, hash0, hash1, hash2, hash3, dataLen<<3 );
break;
case 3:
skein512_4way_update( &ctx.skein, vhashA, dataLen );
skein512_4way_close( &ctx.skein, vhashB );
skein512_4way_full( &ctx.skein, vhashB, vhashA, dataLen );
// skein512_4way_update( &ctx.skein, vhashA, dataLen );
// skein512_4way_close( &ctx.skein, vhashB );
if ( i == 7 )
dintrlv_4x64( hash0, hash1, hash2, hash3, vhashB, dataLen<<3 );
break;

View File

@@ -118,8 +118,9 @@ void timetravel10_4way_hash(void *output, const void *input)
intrlv_4x64( vhashB, hash0, hash1, hash2, hash3, dataLen<<3 );
break;
case 3:
skein512_4way_update( &ctx.skein, vhashA, dataLen );
skein512_4way_close( &ctx.skein, vhashB );
skein512_4way_full( &ctx.skein, vhashB, vhashA, dataLen );
// skein512_4way_update( &ctx.skein, vhashA, dataLen );
// skein512_4way_close( &ctx.skein, vhashB );
if ( i == 9 )
dintrlv_4x64( hash0, hash1, hash2, hash3, vhashB, dataLen<<3 );
break;

View File

@@ -114,7 +114,7 @@ int scanhash_skunk_8way( struct work *work, uint32_t max_nonce,
submit_solution( work, hash+(i<<3), mythr );
}
*noncev = _mm512_add_epi32( *noncev,
m512_const1_64( 0x0000000800000000 ) );
_mm512_set1_epi64( 0x0000000800000000 ) );
n +=8;
} while ( likely( ( n < last_nonce ) && !( *restart ) ) );
pdata[19] = n;
@@ -218,7 +218,7 @@ int scanhash_skunk_4way( struct work *work, uint32_t max_nonce,
submit_solution( work, hash+(i<<3), mythr );
}
*noncev = _mm256_add_epi32( *noncev,
m256_const1_64( 0x0000000400000000 ) );
_mm256_set1_epi64x( 0x0000000400000000 ) );
n +=4;
} while ( likely( ( n < last_nonce ) && !( *restart ) ) );
pdata[19] = n;

View File

@@ -33,9 +33,10 @@ void polytimos_4way_hash( void *output, const void *input )
uint64_t vhash[8*4] __attribute__ ((aligned (64)));
poly_4way_context_overlay ctx;
skein512_4way_init( &ctx.skein );
skein512_4way_update( &ctx.skein, input, 80 );
skein512_4way_close( &ctx.skein, vhash );
skein512_4way_full( &ctx.skein, vhash, input, 80 );
// skein512_4way_init( &ctx.skein );
// skein512_4way_update( &ctx.skein, input, 80 );
// skein512_4way_close( &ctx.skein, vhash );
// Need to convert from 64 bit interleaved to 32 bit interleaved.
uint32_t vhash32[16*4];

View File

@@ -38,8 +38,10 @@ void veltor_4way_hash( void *output, const void *input )
veltor_4way_ctx_holder ctx __attribute__ ((aligned (64)));
memcpy( &ctx, &veltor_4way_ctx, sizeof(veltor_4way_ctx) );
skein512_4way_update( &ctx.skein, input, 80 );
skein512_4way_close( &ctx.skein, vhash );
// skein512_4way_update( &ctx.skein, input, 80 );
// skein512_4way_close( &ctx.skein, vhash );
skein512_4way_full( &ctx.skein, vhash, input, 80 );
dintrlv_4x64( hash0, hash1, hash2, hash3, vhash, 512 );
sph_shavite512( &ctx.shavite, hash0, 64 );
@@ -105,7 +107,7 @@ int scanhash_veltor_4way( struct work *work, uint32_t max_nonce,
pdata[19] = n;
for ( int i = 0; i < 4; i++ )
if ( (hash+(i<<3))[7] <= Htarg && fulltest( hash+(i<<3), ptarget ) )
if ( (hash+(i<<3))[7] <= Htarg && fulltest( hash+(i<<3), ptarget ) && ! opt_benchmark )
{
pdata[19] = n+i;
submit_solution( work, hash+(i<<3), mythr );

View File

@@ -18,6 +18,7 @@
#include "algo/shabal/sph_shabal.h"
#include "algo/whirlpool/sph_whirlpool.h"
#include "algo/sha/sph_sha2.h"
#include "algo/yespower/yespower.h"
#if defined(__AES__)
#include "algo/echo/aes_ni/hash_api.h"
#include "algo/groestl/aes_ni/hash-groestl.h"
@@ -31,6 +32,9 @@
// Config
#define MINOTAUR_ALGO_COUNT 16
static const yespower_params_t minotaurx_yespower_params =
{ YESPOWER_1_0, 2048, 8, "et in arcadia ego", 17 };
typedef struct TortureNode TortureNode;
typedef struct TortureGarden TortureGarden;
@@ -59,20 +63,22 @@ struct TortureGarden
sph_shabal512_context shabal;
sph_whirlpool_context whirlpool;
sph_sha512_context sha512;
struct TortureNode {
struct TortureNode
{
unsigned int algo;
TortureNode *child[2];
} nodes[22];
} __attribute__ ((aligned (64)));
// Get a 64-byte hash for given 64-byte input, using given TortureGarden contexts and given algo index
static void get_hash( void *output, const void *input, TortureGarden *garden,
unsigned int algo )
static int get_hash( void *output, const void *input, TortureGarden *garden,
unsigned int algo, int thr_id )
{
unsigned char hash[64] __attribute__ ((aligned (64)));
int rc = 1;
switch (algo) {
switch ( algo )
{
case 0:
sph_blake512_init(&garden->blake);
sph_blake512(&garden->blake, input, 64);
@@ -97,14 +103,14 @@ static void get_hash( void *output, const void *input, TortureGarden *garden,
sph_echo512(&garden->echo, input, 64);
sph_echo512_close(&garden->echo, hash);
#endif
break;
break;
case 4:
#if defined(__AES__)
fugue512_full( &garden->fugue, hash, input, 64 );
#else
sph_fugue512_full( &garden->fugue, hash, input, 64 );
#endif
break;
break;
case 5:
#if defined(__AES__)
groestl512_full( &garden->groestl, (char*)hash, (char*)input, 512 );
@@ -113,7 +119,7 @@ static void get_hash( void *output, const void *input, TortureGarden *garden,
sph_groestl512(&garden->groestl, input, 64);
sph_groestl512_close(&garden->groestl, hash);
#endif
break;
break;
case 6:
sph_hamsi512_init(&garden->hamsi);
sph_hamsi512(&garden->hamsi, input, 64);
@@ -164,16 +170,20 @@ static void get_hash( void *output, const void *input, TortureGarden *garden,
sph_whirlpool(&garden->whirlpool, input, 64);
sph_whirlpool_close(&garden->whirlpool, hash);
break;
case 16: // minotaurx only, yespower hardcoded for last node
rc = yespower_tls( input, 64, &minotaurx_yespower_params,
(yespower_binary_t*)hash, thr_id );
}
memcpy(output, hash, 64);
return rc;
}
static __thread TortureGarden garden;
bool initialize_torture_garden()
{
// Create torture garden nodes. Note that both sides of 19 and 20 lead to 21, and 21 has no children (to make traversal complete).
// Create torture garden nodes. Note that both sides of 19 and 20 lead to 21, and 21 has no children (to make traversal complete).
garden.nodes[ 0].child[0] = &garden.nodes[ 1];
garden.nodes[ 0].child[1] = &garden.nodes[ 2];
@@ -219,7 +229,6 @@ bool initialize_torture_garden()
garden.nodes[20].child[1] = &garden.nodes[21];
garden.nodes[21].child[0] = NULL;
garden.nodes[21].child[1] = NULL;
return true;
}
@@ -227,38 +236,45 @@ bool initialize_torture_garden()
int minotaur_hash( void *output, const void *input, int thr_id )
{
unsigned char hash[64] __attribute__ ((aligned (64)));
int rc = 1;
// Find initial sha512 hash
sph_sha512_init( &garden.sha512 );
sph_sha512( &garden.sha512, input, 80 );
sph_sha512_close( &garden.sha512, hash );
// algo 6 (Hamsi) is very slow. It's faster to skip hashing this nonce
// if Hamsi is needed but only the first and last functions are
// currently known. Abort if either is Hamsi.
if ( ( ( hash[ 0] % MINOTAUR_ALGO_COUNT ) == 6 )
|| ( ( hash[21] % MINOTAUR_ALGO_COUNT ) == 6 ) )
return 0;
if ( opt_algo != ALGO_MINOTAURX )
{
// algo 6 (Hamsi) is very slow. It's faster to skip hashing this nonce
// if Hamsi is needed but only the first and last functions are
// currently known. Abort if either is Hamsi.
if ( ( ( hash[ 0] % MINOTAUR_ALGO_COUNT ) == 6 )
|| ( ( hash[21] % MINOTAUR_ALGO_COUNT ) == 6 ) )
return 0;
}
// Assign algos to torture garden nodes based on initial hash
for ( int i = 0; i < 22; i++ )
garden.nodes[i].algo = hash[i] % MINOTAUR_ALGO_COUNT;
// MinotaurX override algo for last node with yespower
if ( opt_algo == ALGO_MINOTAURX )
garden.nodes[21].algo = MINOTAUR_ALGO_COUNT;
// Send the initial hash through the torture garden
TortureNode *node = &garden.nodes[0];
while ( node )
while ( rc && node )
{
get_hash( hash, hash, &garden, node->algo );
rc = get_hash( hash, hash, &garden, node->algo, thr_id );
node = node->child[ hash[63] & 1 ];
}
memcpy( output, hash, 32 );
return 1;
return rc;
}
int scanhash_minotaur( struct work *work, uint32_t max_nonce,
uint64_t *hashes_done, struct thr_info *mythr )
uint64_t *hashes_done, struct thr_info *mythr )
{
uint32_t edata[20] __attribute__((aligned(64)));
uint32_t hash[8] __attribute__((aligned(64)));
@@ -277,7 +293,7 @@ int scanhash_minotaur( struct work *work, uint32_t max_nonce,
edata[19] = n;
if ( likely( algo_gate.hash( hash, edata, thr_id ) ) )
{
if ( unlikely( valid_hash( hash, ptarget ) && !bench ) )
if ( unlikely( valid_hash( hash, ptarget ) && !bench ) )
{
pdata[19] = bswap_32( n );
submit_solution( work, hash, mythr );
@@ -291,12 +307,14 @@ int scanhash_minotaur( struct work *work, uint32_t max_nonce,
return 0;
}
// hash function has hooks for minotaurx
bool register_minotaur_algo( algo_gate_t* gate )
{
gate->scanhash = (void*)&scanhash_minotaur;
gate->hash = (void*)&minotaur_hash;
gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT;
gate->scanhash = (void*)&scanhash_minotaur;
gate->hash = (void*)&minotaur_hash;
gate->miner_thread_init = (void*)&initialize_torture_garden;
gate->optimizations = SSE2_OPT | AES_OPT | AVX2_OPT | AVX512_OPT;
if ( opt_algo == ALGO_MINOTAURX ) gate->optimizations |= SHA_OPT;
return true;
};

View File

@@ -163,7 +163,7 @@ int x16r_8way_hash_generic( void* output, const void* input, int thrid )
{
intrlv_8x64( vhash, in0, in1, in2, in3, in4, in5, in6, in7,
size<<3 );
bmw512_8way_update( &ctx.bmw, vhash, size );
bmw512_8way_update( &ctx.bmw, vhash, size );
}
bmw512_8way_close( &ctx.bmw, vhash );
dintrlv_8x64_512( hash0, hash1, hash2, hash3, hash4, hash5, hash6,
@@ -536,7 +536,7 @@ int scanhash_x16r_8way( struct work *work, uint32_t max_nonce,
submit_solution( work, hash+(i<<3), mythr );
}
*noncev = _mm512_add_epi32( *noncev,
m512_const1_64( 0x0000000800000000 ) );
_mm512_set1_epi64( 0x0000000800000000 ) );
n += 8;
} while ( likely( ( n < last_nonce ) && !(*restart) ) );
pdata[19] = n;
@@ -963,7 +963,7 @@ int scanhash_x16r_4way( struct work *work, uint32_t max_nonce,
submit_solution( work, hash+(i<<3), mythr );
}
*noncev = _mm256_add_epi32( *noncev,
m256_const1_64( 0x0000000400000000 ) );
_mm256_set1_epi64x( 0x0000000400000000 ) );
n += 4;
} while ( likely( ( n < last_nonce ) && !(*restart) ) );
pdata[19] = n;

View File

@@ -198,7 +198,7 @@ void veil_build_extraheader( struct work* g_work, struct stratum_ctx* sctx )
{
char* data;
data = (char*)malloc( 2 + strlen( denom10_str ) * 4 + 16 * 4
+ strlen( merkleroot_str ) * 3 );
+ strlen( merkleroot_str ) * 3 + 1 );
// Build the block header veildatahash in hex
sprintf( data, "%s%s%s%s%s%s%s%s%s%s%s%s",
merkleroot_str, witmerkleroot_str, "04",

View File

@@ -31,7 +31,7 @@ int scanhash_x16rt_8way( struct work *work, uint32_t max_nonce,
x16rt_getAlgoString( &timeHash[0], x16r_hash_order );
s_ntime = masked_ntime;
if ( !thr_id )
applog( LOG_INFO, "Hash order %s, Nime %08x, time hash %08x",
applog( LOG_INFO, "Hash order %s, Ntime %08x, time hash %08x",
x16r_hash_order, bswap_32( pdata[17] ), timeHash );
}
@@ -49,7 +49,7 @@ int scanhash_x16rt_8way( struct work *work, uint32_t max_nonce,
submit_solution( work, hash+(i<<3), mythr );
}
*noncev = _mm512_add_epi32( *noncev,
m512_const1_64( 0x0000000800000000 ) );
_mm512_set1_epi64( 0x0000000800000000 ) );
n += 8;
} while ( likely( ( n < last_nonce ) && !(*restart) ) );
pdata[19] = n;
@@ -85,7 +85,7 @@ int scanhash_x16rt_4way( struct work *work, uint32_t max_nonce,
x16rt_getAlgoString( &timeHash[0], x16r_hash_order );
s_ntime = masked_ntime;
if ( !thr_id )
applog( LOG_INFO, "Hash order %s, Nime %08x, time hash %08x",
applog( LOG_INFO, "Hash order %s, Ntime %08x, time hash %08x",
x16r_hash_order, bswap_32( pdata[17] ), timeHash );
}
@@ -102,7 +102,7 @@ int scanhash_x16rt_4way( struct work *work, uint32_t max_nonce,
submit_solution( work, hash+(i<<3), mythr );
}
*noncev = _mm256_add_epi32( *noncev,
m256_const1_64( 0x0000000400000000 ) );
_mm256_set1_epi64x( 0x0000000400000000 ) );
n += 4;
} while ( ( n < last_nonce ) && !(*restart) );
pdata[19] = n;

View File

@@ -26,7 +26,7 @@ int scanhash_x16rt( struct work *work, uint32_t max_nonce,
x16rt_getTimeHash( masked_ntime, &timeHash );
x16rt_getAlgoString( &timeHash[0], x16r_hash_order );
s_ntime = masked_ntime;
if ( opt_debug && !thr_id )
if ( !thr_id )
applog( LOG_INFO, "hash order: %s time: (%08x) time hash: (%08x)",
x16r_hash_order, swab32( pdata[17] ), timeHash );
}

View File

@@ -658,7 +658,7 @@ int scanhash_x16rv2_8way( struct work *work, uint32_t max_nonce,
submit_solution( work, hash+(i<<3), mythr );
}
*noncev = _mm512_add_epi32( *noncev,
m512_const1_64( 0x0000000800000000 ) );
_mm512_set1_epi64( 0x0000000800000000 ) );
n += 8;
} while ( likely( ( n < last_nonce ) && !(*restart) ) );
pdata[19] = n;
@@ -1143,7 +1143,7 @@ int scanhash_x16rv2_4way( struct work *work, uint32_t max_nonce,
submit_solution( work, hash+(i<<3), mythr );
}
*noncev = _mm256_add_epi32( *noncev,
m256_const1_64( 0x0000000400000000 ) );
_mm256_set1_epi64x( 0x0000000400000000 ) );
n += 4;
} while ( likely( ( n < last_nonce ) && !(*restart) ) );
pdata[19] = n;

View File

@@ -181,7 +181,7 @@ int scanhash_x21s_8way( struct work *work, uint32_t max_nonce,
}
}
*noncev = _mm512_add_epi32( *noncev,
m512_const1_64( 0x0000000800000000 ) );
_mm512_set1_epi64( 0x0000000800000000 ) );
n += 8;
} while ( likely( ( n < last_nonce ) && !(*restart) ) );
pdata[19] = n;
@@ -335,7 +335,7 @@ int scanhash_x21s_4way( struct work *work, uint32_t max_nonce,
submit_solution( work, hash+(i<<3), mythr );
}
*noncev = _mm256_add_epi32( *noncev,
m256_const1_64( 0x0000000400000000 ) );
_mm256_set1_epi64x( 0x0000000400000000 ) );
n += 4;
} while ( likely( ( n < last_nonce ) && !(*restart) ) );
pdata[19] = n;

View File

@@ -254,9 +254,10 @@ int scanhash_x17_8way( struct work *work, uint32_t max_nonce,
uint32_t n = first_nonce;
const int thr_id = mythr->id;
const uint32_t targ32_d7 = ptarget[7];
const __m512i eight = m512_const1_64( 8 );
const __m512i eight = _mm512_set1_epi64( 8 );
const bool bench = opt_benchmark;
// convert LE32 to LE64
edata[0] = mm128_swap64_32( casti_m128i( pdata, 0 ) );
edata[1] = mm128_swap64_32( casti_m128i( pdata, 1 ) );
edata[2] = mm128_swap64_32( casti_m128i( pdata, 2 ) );
@@ -264,10 +265,8 @@ int scanhash_x17_8way( struct work *work, uint32_t max_nonce,
edata[4] = mm128_swap64_32( casti_m128i( pdata, 4 ) );
mm512_intrlv80_8x64( vdata, edata );
*noncev = mm512_intrlv_blend_32( *noncev,
_mm512_set_epi32( 0, n+7, 0, n+6, 0, n+5, 0, n+4,
0, n+3, 0, n+2, 0, n+1, 0, n ) );
*noncev = _mm512_add_epi32( *noncev, _mm512_set_epi32(
0,7, 0,6, 0,5, 0,4, 0,3, 0,2, 0,1, 0,0 ) );
blake512_8way_prehash_le( &blake512_8way_ctx, x17_8way_midstate, vdata );
do
@@ -279,7 +278,7 @@ int scanhash_x17_8way( struct work *work, uint32_t max_nonce,
extr_lane_8x32( lane_hash, hash32, lane, 256 );
if ( likely( valid_hash( lane_hash, ptarget ) ) )
{
pdata[19] = n + lane;
pdata[19] = n + lane;
submit_solution( work, lane_hash, mythr );
}
}
@@ -291,8 +290,6 @@ int scanhash_x17_8way( struct work *work, uint32_t max_nonce,
return 0;
}
#elif defined(X17_4WAY)
union _x17_4way_context_overlay
@@ -322,6 +319,9 @@ union _x17_4way_context_overlay
};
typedef union _x17_4way_context_overlay x17_4way_context_overlay;
static __thread __m256i x17_4way_midstate[16] __attribute__((aligned(64)));
static __thread blake512_4way_context blake512_4way_ctx __attribute__((aligned(64)));
int x17_4way_hash( void *state, const void *input, int thr_id )
{
uint64_t vhash[8*4] __attribute__ ((aligned (64)));
@@ -333,7 +333,10 @@ int x17_4way_hash( void *state, const void *input, int thr_id )
uint64_t hash3[8] __attribute__ ((aligned (32)));
x17_4way_context_overlay ctx;
blake512_4way_full( &ctx.blake, vhash, input, 80 );
blake512_4way_final_le( &blake512_4way_ctx, vhash, casti_m256i( input, 9 ),
x17_4way_midstate );
// blake512_4way_full( &ctx.blake, vhash, input, 80 );
bmw512_4way_init( &ctx.bmw );
bmw512_4way_update( &ctx.bmw, vhash, 64 );
@@ -449,4 +452,55 @@ int x17_4way_hash( void *state, const void *input, int thr_id )
return 1;
}
int scanhash_x17_4way( struct work *work, uint32_t max_nonce,
uint64_t *hashes_done, struct thr_info *mythr )
{
uint32_t hash32[8*4] __attribute__ ((aligned (128)));
uint32_t vdata[20*4] __attribute__ ((aligned (32)));
uint32_t lane_hash[8] __attribute__ ((aligned (32)));
__m128i edata[5] __attribute__ ((aligned (32)));
uint32_t *pdata = work->data;
uint32_t *hash32_d7 = &(hash32[7*4]);
const uint32_t *ptarget = work->target;
const uint32_t first_nonce = pdata[19];
const uint32_t last_nonce = max_nonce - 4;
__m256i *noncev = (__m256i*)vdata + 9;
uint32_t n = first_nonce;
const int thr_id = mythr->id;
const uint32_t targ32_d7 = ptarget[7];
const __m256i four = _mm256_set1_epi64x( 4 );
const bool bench = opt_benchmark;
// convert LE32 to LE64
edata[0] = mm128_swap64_32( casti_m128i( pdata, 0 ) );
edata[1] = mm128_swap64_32( casti_m128i( pdata, 1 ) );
edata[2] = mm128_swap64_32( casti_m128i( pdata, 2 ) );
edata[3] = mm128_swap64_32( casti_m128i( pdata, 3 ) );
edata[4] = mm128_swap64_32( casti_m128i( pdata, 4 ) );
mm256_intrlv80_4x64( vdata, edata );
*noncev = _mm256_add_epi32( *noncev, _mm256_set_epi32( 0,3,0,2, 0,1,0,0 ) );
blake512_4way_prehash_le( &blake512_4way_ctx, x17_4way_midstate, vdata );
do
{
if ( likely( x17_4way_hash( hash32, vdata, thr_id ) ) )
for ( int lane = 0; lane < 4; lane++ )
if ( unlikely( ( hash32_d7[ lane ] <= targ32_d7 ) && !bench ) )
{
extr_lane_4x32( lane_hash, hash32, lane, 256 );
if ( likely( valid_hash( lane_hash, ptarget ) ) )
{
pdata[19] = n + lane;
submit_solution( work, lane_hash, mythr );
}
}
*noncev = _mm256_add_epi32( *noncev, four );
n += 4;
} while ( ( n < last_nonce ) && !work_restart[thr_id].restart );
pdata[19] = n;
*hashes_done = n - first_nonce;
return 0;
}
#endif

View File

@@ -6,7 +6,8 @@ bool register_x17_algo( algo_gate_t* gate )
gate->scanhash = (void*)&scanhash_x17_8way;
gate->hash = (void*)&x17_8way_hash;
#elif defined (X17_4WAY)
gate->scanhash = (void*)&scanhash_4way_64in_32out;
gate->scanhash = (void*)&scanhash_x17_4way;
// gate->scanhash = (void*)&scanhash_4way_64in_32out;
gate->hash = (void*)&x17_4way_hash;
#else
gate->hash = (void*)&x17_hash;

Some files were not shown because too many files have changed in this diff Show More