From bbe46ec6fb26a7dde95399bffc84d1ab55a71a09 Mon Sep 17 00:00:00 2001 From: Chris P Date: Thu, 13 Oct 2016 05:57:34 +0200 Subject: [PATCH 001/180] Clean up and clarify manpage a bit (first iteration) --- docs/rmlint.1.rst | 269 ++++++++++++++++++++++++++++++---------------- lib/cmdline.c | 4 +- 2 files changed, 179 insertions(+), 94 deletions(-) diff --git a/docs/rmlint.1.rst b/docs/rmlint.1.rst index a7436474..2d747b82 100644 --- a/docs/rmlint.1.rst +++ b/docs/rmlint.1.rst @@ -6,7 +6,8 @@ rmlint find duplicate files and other space waste efficiently ------------------------------------------------------ -.. Stuff in curly braces gets replaced by SCons +.. NOTE: Stuff in curly braces gets replaced by SCons +.. Use something like {{this}} to escape curly braces. SYNOPSIS ======== @@ -17,17 +18,23 @@ DESCRIPTION =========== ``rmlint`` finds space waste and other broken things on your filesystem. +It's main focus lies on finding duplicate files and directories. -Types of waste include: +It is able to find the following types of lint: -* Duplicate files and directories. -* Nonstripped Binaries (Binaries with debug symbols). -* Broken links. -* Empty files and directories. +* Duplicate files and directories (and as a result unique files). +* Nonstripped Binaries (Binaries with debug symbols; needs to be explicityl enabled). +* Broken symbolic links. +* Empty files and directories (also nested empty directories). * Files with broken user or group id. -``rmlint`` will not delete any files. It does however produce executable output -(for example a shell script) to help you delete the files if you want to. +``rmlint`` itself WILL NOT DELETE ANY FILES. It does however produce executable +output (for example a shell script) to help you delete the files if you want +to. Another design principle is that it should work well together with other +tools like ``find``. Therefore we do not replicate features of other well know +programs, as for example pattern matching and finding duplicate filenames. +However we provide many convinience options for common usecases that are hard +to build from scratch with standard tools. In order to find the lint, ``rmlint`` is given one or more directories to traverse. If no directories or files were given, the current working directory is assumed. @@ -35,13 +42,26 @@ By default, ``rmlint`` will ignore hidden files and will not follow symlinks (se traversal options below). ``rmlint`` will first find "other lint" and then search the remaining files for duplicates. -Duplicate sets will be displayed as an original and one or more duplicates. You -can set criteria for how ``rmlint`` chooses using the `-S` option (by default it -chooses the first-named path on the command line, or if that is equal then the -oldest file based on mtime). You can also specify that certain paths **only** contain -originals by naming the path after the special path separator **//**. - -Examples are given at the end of this manual. +``rmlint`` tries to be helpful by guessing what file of a group of duplicates +is the **original** (i.e. the file that should not be deleted). It does this by using +different sorting strategies that can be controlled via the ``-S`` option. By +default it chooses the first-named path on the commandline. If two duplicates +come from the same path, it will also apply different fallback sort strategies (See the documentation of the ``-S`` strategy). + +This behaviour can be also overwritten if you know that a certain directory +contains duplicates and another one originals. In this case you write the +original directory after specifying a single ``//`` on the commandline. +Everything that comes after is a preferred (or a "tagged") directory. If there +are duplicates from a unpreferred and from a preffered directory, the preferred +one will always count as original. Special options can also be used to always +keep files in preferred directories (``-k``) and to only find duplicates that +are present in both given directories (``-m``). + +We advise new users to have a short look at all options ``rmlint`` has to +offer, and maybe test some examples before letting it run on productive data. +WRONG ASSUMPTIONS ARE THE BIGGEST ENEMY OF YOUR DATA. There are some extended +example at the end of this manual, but each option that is not self-explanatory +will also try to give examples. OPTIONS ======= @@ -53,7 +73,7 @@ General Options Configure the types of lint rmlint will look for. The `list` string is a comma-separated list of lint types or lint groups (other separators like - semicolon or space also work). + semicolon or space also work though). One of the following groups can be specified at the beginning of the list: @@ -67,29 +87,41 @@ General Options Any of the following lint types can be added individually, or deselected by prefixing with a **-**: - * ``badids``, ``bi``: Find bad UID, GID or files with both. - * ``badlinks``, ``bl``: Find bad symlinks pointing nowhere. + * ``badids``, ``bi``: Find files with bad UID, GID or both. + * ``badlinks``, ``bl``: Find bad symlinks pointing nowhere valid. * ``emptydirs``, ``ed``: Find empty directories. * ``emptyfiles``, ``ef``: Find empty files. * ``nonstripped``, ``ns``: Find nonstripped binaries. * ``duplicates``, ``df``: Find duplicate files. * ``duplicatedirs``, ``dd``: Find duplicate directories. - **WARNING:** It is good practice to enclose the description in quotes. In - obscure cases argument parsing might fail in weird ways. + **WARNING:** It is good practice to enclose the description in single or + double quotes. In obscure cases argument parsing might fail in weird ways, + especially when using spaces as separator. + + Example: + + ``$ rmlint -T "df,dd" # Only search for duplicate files and directories`` + ``$ rmlint -T "all -df -dd" # Search for all lint except duplicate files and dirs.`` + +:``-o --output=spec`` / ``-O --add-output=spec`` (**default\:** *-o sh\:rmlint.sh -o pretty\:stdout -o summary\:stdout -o json\:rmlint.json*): -:``-o --output=spec`` / ``-O --add-output=spec`` (**default\:** *-o sh\:rmlint.sh -o pretty\:stdout -o summary\:stdout*): + Configure the way ``rmlint`` outputs its results. A ``spec`` is in the form + ``format:file`` or just ``format``. A ``file`` might either be an + arbitrary path or ``stdout`` or ``stderr``. If file is omitted, ``stdout`` + is assumed. ``format`` is the name of a formatter supported by this + program. For a list of formatters and their options, refer to the + **Formatters** section below. - Configure the way ``rmlint`` outputs its results. A ``spec`` is in the - form ``format:file`` or just ``format``. A file might either be an arbitrary - path or ``stdout`` or ``stderr``. If file is omitted, ``stdout`` is assumed. + If ``-o`` is specified, rmlint's default outputs are overwritten. With + ``--O`` the defaults are preserved. Either ``-o`` or ``-O`` may be + specified multiple times to get multiple outputs, including multiple + outputs of the same format. - If ``-o`` is specified, rmlint's defaults are overwritten. With ``--O`` the - defaults are preserved. Either ``-o`` or ``-O`` may be specified multiple - times to get multiple outputs, including multiple outputs of the same format. + Examples: - For a list of formatters and their options, refer to the **Formatters** - section below. + ``$ rmlint -o json # Stream the json output to stdout`` + ``$ rmlint -O csv:/tmp/rmlint.csv # Output an extra csv fle to /tmp`` :``-c --config=spec[=value]`` (**default\:** *none*): @@ -97,7 +129,12 @@ General Options the existing formatters. See the **Formatters** section for details on the available keys. - If the value is omitted it is set to a true value. + If the value is omitted it is set to a value meaning "enabled". + + Examples: + + ``$ rmlint -c sh:link # Smartly link duplicates instead of removing`` + ``$ rmlint -c progressbar:fancy # Use a different theme for the progressbar`` :``-z --perms[=[rwx]]`` (**default\:** *no check*): @@ -110,6 +147,8 @@ General Options By default this check is not done. + ``$ rmlint -z rx $(echo $PATH | tr ":" " ") # Look at all executable files in $PATH`` + :``-a --algorithm=name`` (**default\:** *blake2b*): Choose the algorithm to use for finding duplicate files. The algorithm can be @@ -124,7 +163,7 @@ General Options * **bastard:** 256bit, combining **city**, and **murmur**. * **city256, city512, murmur256, murmur512:** Use multiple 128-bit hashes with different seeds. - * **spooky32, spooky64:** Faster version of **spooky** with less bits. + * **spooky32, spooky64:** Faster version of **spooky** with less bits. We strongly advise against using these. :``-p --paranoid`` / ``-P --less-paranoid`` (**default**): @@ -140,16 +179,18 @@ General Options :``-v --loud`` / ``-V --quiet``: Increase or decrease the verbosity. You can pass these options several - times. This only affects ``rmlint``'s logging on *stderr*, but not the outputs - defined with **-o**. Passing either option more than three times has no - effect. + times. This only affects ``rmlint``'s logging on *stderr*, but not the + outputs defined with **-o**. Passing either option more than three times + has no further effect. :``-g --progress`` / ``-G --no-progress`` (**default**): - Convenience shortcut for ``-o progressbar -o summary -o sh:rmlint.sh -VVV``. + Show a progressbar with sane defaults. + + Convenience shortcut for ``-o progressbar -o summary -o sh:rmlint.sh -o json:rmlint.json -VVV``. - Note: This flag clears all previous outputs. Specify any additional outputs - after this flag! + NOTE: This flag clears all previous outputs. If you want additional + outputs, specify them after this flag using ``-O``. :``-D --merge-directories`` (**default\:** *disabled*): @@ -159,16 +200,19 @@ General Options during ``rmlint``'s or its removal scripts run. IMPORTANT: Definition of equal: Two directories are considered equal by - ``rmlint`` if they contain the exact same data, no matter how are the files + ``rmlint`` if they contain the exact same data, no matter how the files contaning the data are named. Imagine that ``rmlint`` creates a long, sorted stream out of the data found in the directory and compares this in - a magic way. This means that the layout of the directory is not considered - to be important by ``rmlint``. This might be surprising to some users, but + a magic way to another directory. This means that the layout of the + directory is not considered to be important by default. Also empty files + will not count as content. This might be surprising to some users, but remember that ``rmlint`` generally cares only about content, not about any - other metadata or layout. + other metadata or layout. If you want to only find trees with the same hierarchy + you should use ``--honour-dir-layout / -j``. Output is deferred until all duplicates were found. Duplicate directories - are printed first, followed by any remaining duplicate files. + are printed first, followed by any remaining duplicate files that are isolated + or inside of any original directories. **--rank-by** applies for directories too, but 'p' or 'P' (path index) has no defined (i.e. useful) meaning. Sorting takes only place when the number of @@ -179,8 +223,6 @@ General Options * This option enables ``--partial-hidden`` and ``-@`` (``--see-symlinks``) for convenience. If this is not desired, you should change this after specifying ``-D``. - * This feature might not deliver perfect result in corner cases, but - should never report false positives. * This feature might add some runtime for large datasets. * When using this option, you will not be able to use the ``-c sh:clone`` option. Use ``-c sh:link`` as a good alternative. @@ -189,8 +231,8 @@ General Options Only recognize directories as duplicates that have the same path layout. In other words: All duplicates that build the duplicate directory must have - the same path from the root of the directory. - This flag has no effect without ``--merge-directories``. + the same path from the root of each respective directory. + This flag makes no sense without ``--merge-directories``. :``-y --sort-by=order`` (**default\:** *none*): @@ -208,23 +250,28 @@ General Options --rank-by``) to reverse the sorting. Note that ``rmlint`` has to hold back all results to the end of the run before sorting and printing. + ``$ rmlint -y sN # Sort groups by size (biggest last) and if tied by files in it (smallest last)`` + :``--gui``: Start the optional graphical frontend to ``rmlint`` called ``Shredder``. + The frontend is supposed to be used by beginner level users and does not + offer all features of the commandline version. This will only work when ``Shredder`` and its dependencies were installed. See also: http://rmlint.readthedocs.org/en/latest/gui.html The gui has its own set of options, see ``--gui --help`` for a list. These - should be placed at the end, ie ``rmlint --gui [options]`` when calling + should be placed at the end, i.e. ``rmlint --gui [options]`` when calling it from commandline. :``--hash [paths...]``: Make ``rmlint`` work as a multi-threaded file hash utility, similar to the - popular ``md5sum`` or ``sha1sum`` utilities, but faster and with more algorithms. - A set of paths given on the commandline or from *stdin* is hashed using one - of the available hash algorithms. Use ``rmlint --hash -h`` to see options. + popular ``md5sum`` or ``sha1sum`` utilities, but faster and with more + algorithms. A set of paths given on the commandline or from *stdin* is + hashed using one of the available hash algorithms. Use ``rmlint --hash + --help`` to see the extended options. :``--equal [paths...]``: @@ -237,9 +284,11 @@ General Options Note: This even works for directories and also in combination with paranoid mode (pass ``-pp`` for byte comparison); remember that rmlint does not care about the layout of the directory, but only about the content of the files - in it. At least two paths need to be given to the commandline. + in it. This is the main advantage of ``--equal`` over the ``cmp`` util + which will be faser when comparing files. - By default this will use hashing to compare the files and/or directories. + At least two paths need to be given to the commandline. If more than two paths + are given, all arguments must be equal. :``-w --with-color`` (**default**) / ``-W --no-with-color``: @@ -248,12 +297,12 @@ General Options :``-h --help`` / ``-H --show-man``: - Show a shorter reference help text (``-h``) or this full man page (``-H``). + Show a shorter reference help text (``-h``) or the full man page (``-H``). :``--version``: Print the version of rmlint. Includes git revision and compile time - features. + features. Please include this when giving feedback to us. Traversal Options ----------------- @@ -267,14 +316,15 @@ Traversal Options - *C* (1^1), *W* (2^1), B (512^1), *K* (1000^1), KB (1024^1), *M* (1000^2), *MB* (1024^2), *G* (1000^3), *GB* (1024^3), - *T* (1000^4), *TB* (1024^4), *P* (1000^5), *PB* (1024^5), *E* (1000^6), *EB* (1024^6) - The size format is about the same as `dd(1)` uses. A valid example would be: **"100KB-2M"**. - This limits duplicates to a range from 100 Kilobyte to 2 Megabyte. + The size format is about the same as `dd(1)` uses. A valid example would + be: **"100KB-2M"**. This limits duplicates to a range from 100 Kilobyte to + 2 Megabyte. It's also possible to specify only one size. In this case the size is interpreted as *"bigger or equal"*. If you want to to filter for files *up to this size* you can add a ``-`` in front (``-s -1M`` == ``-s 0-1M``). - **NOTE:** The default excludes empty files from the duplicate search. + **Edge case:** The default excludes empty files from the duplicate search. Normally these are treated specially by ``rmlint`` by handling them as *other lint*. If you want to include empty files as duplicates you should lower the limit to zero: @@ -284,11 +334,14 @@ Traversal Options :``-d --max-depth=depth`` (**default\:** *INF*): Only recurse up to this depth. A depth of 1 would disable recursion and is - equivalent to a directory listing. + equivalent to a directory listing. A depth of 2 would also consider also all + children directories and so on. :``-l --hardlinked`` (**default**) / ``-L --no-hardlinked``: Whether to report hardlinked files as duplicates. + Hardlinked files will not appear as space waste in the statistics, since + they do not allocate any extra space. :``-f --followlinks`` / ``-F --no-followlinks`` / ``-@ --see-symlinks`` (**default**): @@ -307,9 +360,11 @@ Traversal Options :``-r --hidden`` / ``-R --no-hidden`` (**default**) / ``--partial-hidden``: Also traverse hidden directories? This is often not a good idea, since - directories like ``.git/`` would be investigated. + directories like ``.git/`` would be investigated, possibly leading to the + deletion of internal ``git`` files which in turn break a repository. With ``--partial-hidden`` hidden files and folders are only considered if - they're inside duplicate directories (see --merge-directories). + they're inside duplicate directories (see ``--merge-directories``) and will + be deleted as part of it. :``-b --match-basename``: @@ -330,7 +385,7 @@ Traversal Options :``-i --match-without-extension`` / ``-I --no-match-without-extension`` (**default**): Only consider those files as dupes that have the same basename minus the file - extension. For example: ``banana.png`` and ``banana.jpeg`` would be considered, + extension. For example: ``banana.png`` and ``Banana.jpeg`` would be considered, while ``apple.png`` and ``peach.png`` won't. The comparison is case-insensitive. :``-n --newer-than-stamp=`` / ``-N --newer-than=``: @@ -487,10 +542,13 @@ Caching :``-U --write-unfinished``: - Include files in output that have not been hashed fully (i.e. files that do - not appear to have a duplicate). This is mainly useful in conjunction with - ``--xattr-write/read``. When re-running rmlint on a large dataset this can greatly - speed up a re-run in some cases. + Include files in output that have not been hashed fully, i.e. files that do + not appear to have a duplicate. Note that this will not include all files + that ``rmlint`` traversed, but only the files that were chosen to be hashed. + + This is mainly useful in conjunction with ``--xattr-write/read``. When + re-running rmlint on a large dataset this can greatly speed up a re-run in + some cases. Please refer to ``--xattr-read`` for an example. Rarely used, miscellaneous options ---------------------------------- @@ -498,12 +556,22 @@ Rarely used, miscellaneous options :``-t --threads=N`` (*default\:* 16): The number of threads to use during file tree traversal and hashing. - ``rmlint`` probably knows better than you how to set the value. + ``rmlint`` probably knows better than you how to set this value, so just + leave it as it is. Setting it to ``1`` will also not make ``rmlint`` + a single threaded program. + +:``-u --limit-mem=size``: + + Apply a maximum number of memory to use for hashing and **--paranoid**. + The total number of memory might still exceed this limit though, especially + when setting it very low. In general ``rmlint`` will however consume about this + amont of memory plus a more or less constant extra amount that depends on the + data you are scanning. -:``-u --max-paranoid-mem=size``: + The ``size``-description has the same format as for **--size**, therefore you + can do something like this (use this if you have 1GB of memory available): - Apply a maximum number of bytes to use for **--paranoid**. - The ``size``-description has the same format as for **--size**. + ``$ rmlint -u 512M # Limit paranoid mem usage to 512 MB``` :``-q --clamp-low=[fac.tor|percent%|offset]`` (**default\:** *0*) / ``-Q --clamp-top=[fac.tor|percent%|offset]`` (**default\:** *1.0*): @@ -521,6 +589,10 @@ Rarely used, miscellaneous options Also it might be useful for approximate comparison where it suffices when the file is the same in the middle part. + Example: + + ``$ rmlint -q 10% -Q 512M # Only read the last 90% of a file, but read at max. 512MB`` + :``-Z --mtime-window=T`` (**default\:** *-1*): Only consider those files as duplicates that have the same content and @@ -539,7 +611,8 @@ Rarely used, miscellaneous options :``--with-fiemap`` (**default**) / ``--without-fiemap``: Enable or disable reading the file extents on rotational disk in order to - optimize disk access patterns. + optimize disk access patterns. If this feature is not available, it is + disabled automatically. FORMATTERS ========== @@ -565,49 +638,58 @@ FORMATTERS files in that given order until one handler succeeds. Handlers are just the name of a way of getting rid of the file and can be any of the following: - * ``clone``: ``btrfs`` only. Try to clone both files with the + * ``clone``: For ``btrfs`` only. Try to clone both files with the BTRFS_IOC_FILE_EXTENT_SAME ``ioctl(3p)``. This will physically delete duplicate extents. Needs at least kernel 4.2. * ``reflink``: Try to reflink the duplicate file to the original. See also ``--reflink`` in ``man 1 cp``. Fails if the filesystem does not support it. * ``hardlink``: Replace the duplicate file with a hardlink to the original - file. The resulting files will have the same inode number. Fails if both files are not on the same partition. - You can use ``ls -i`` to show the inode number of a file and ``find -samefile `` to find - all hardlinks for a certain file. + file. The resulting files will have the same inode number. Fails if both + files are not on the same partition. You can use ``ls -i`` to show the + inode number of a file and ``find -samefile `` to find all + hardlinks for a certain file. * ``symlink``: Tries to replace the duplicate file with a symbolic link to - the original. Never fails. + the original. This handler never fails. * ``remove``: Remove the file using ``rm -rf``. (``-r`` for duplicate dirs). - Never fails. + This handler never fails. * ``usercmd``: Use the provided user defined command (``-c sh:cmd=something``). Never fails. Default is ``remove``. * *link*: Shortcut for ``-c sh:handler=clone,reflink,hardlink,symlink``. + Use this if you are on a reflink-capable system. * *hardlink*: Shortcut for ``-c sh:handler=hardlink,symlink``. + Use this if you want to hardlink files, but want to fallback + for duplicates that lie on different devices. * *symlink*: Shortcut for ``-c sh:handler=symlink``. + Use this as last straw. -* ``json``: Print a JSON-formatted dump of all found reports. - Outputs all finds as a json document. The document is a list of dictionaries, - where the first and last element is the header and the footer respectively, - everything between are data-dictionaries. +* ``json``: Print a JSON-formatted dump of all found reports. Outputs all lint + as a json document. The document is a list of dictionaries, where the first + and last element is the header and the footer. Everything between are + data-dictionaries. Available options: - - *no_header=[true|false]:* Print the header with metadata. - - *no_footer=[true|false]:* Print the footer with statistics. - - *oneline=[true|false]:* Print one json document per line. + - *no_header=[true|false]:* Print the header with metadata (default: true) + - *no_footer=[true|false]:* Print the footer with statistics (default: true) + - *oneline=[true|false]:* Print one json document per line (default: false) + This is useful if you plan to parse the output line-by-line, e.g. while + ``rmlint`` is sill running. * ``py``: Outputs a python script and a JSON document, just like the **json** formatter. The JSON document is written to ``.rmlint.json``, executing the script will make it read from there. This formatter is mostly intented for complex use-cases - where the lint needs special handling. Therefore the python script can be modified - to do things standard ``rmlint`` is not able to do easily. + where the lint needs special handling that you define in the python script. + Therefore the python script can be modified to do things standard ``rmlint`` + is not able to do easily. * ``stamp``: Outputs a timestamp of the time ``rmlint`` was run. + See also the ``--newer-than`` and ``--newer-than-stamp`` file option. Available options: @@ -730,17 +812,20 @@ PROBLEMS option. This will compare all the files byte-by-byte and is not much slower than SHA1. 2. **File modification during or after rmlint run:** It is possible that a file - that ``rmlint`` recognized as duplicate is modified afterwards, resulting in a - different file. If you use the rmlint-generated shell script to delete the duplicates, - you can run it with the ``-p`` option to do a full re-check of the duplicate against - the original before it deletes the file. When using ``-c sh:hardlink`` or ``-c sh:symlink`` - care should be taken that a modification of one file will now result in a modification of - all files. This is not the case for ``-c sh:reflink`` or ``-c sh:clone``. Use ``-c sh:link`` - to minimise this risk. + that ``rmlint`` recognized as duplicate is modified afterwards, resulting in + a different file. If you use the rmlint-generated shell script to delete + the duplicates, you can run it with the ``-p`` option to do a full re-check + of the duplicate against the original before it deletes the file. When using + ``-c sh:hardlink`` or ``-c sh:symlink`` care should be taken that + a modification of one file will now result in a modification of all files. + This is not the case for ``-c sh:reflink`` or ``-c sh:clone``. Use ``-c + sh:link`` to minimise this risk. SEE ALSO ======== +Reading the manpages o these tools might help working with ``rmlint``: + * `find(1)` * `rm(1)` * `cp(1)` diff --git a/lib/cmdline.c b/lib/cmdline.c index 2e43c3e7..9702b0c7 100644 --- a/lib/cmdline.c +++ b/lib/cmdline.c @@ -1415,8 +1415,8 @@ bool rm_cmd_parse_args(int argc, char **argv, RmSession *session) { {"clamp-low" , 'q' , HIDDEN , G_OPTION_ARG_CALLBACK , FUNC(clamp_low) , "Limit lower reading barrier" , "P"} , {"clamp-top" , 'Q' , HIDDEN , G_OPTION_ARG_CALLBACK , FUNC(clamp_top) , "Limit upper reading barrier" , "P"} , {"limit-mem" , 'u' , HIDDEN , G_OPTION_ARG_CALLBACK , FUNC(limit_mem) , "Specify max. memory usage target" , "S"} , - {"sweep-size" , 'u' , HIDDEN , G_OPTION_ARG_CALLBACK , FUNC(sweep_size) , "Specify max. bytes per pass when scanning disks" , "S"} , - {"sweep-files" , 'u' , HIDDEN , G_OPTION_ARG_CALLBACK , FUNC(sweep_count) , "Specify max. file count per pass when scanning disks" , "S"} , + {"sweep-size" , 0 , HIDDEN , G_OPTION_ARG_CALLBACK , FUNC(sweep_size) , "Specify max. bytes per pass when scanning disks" , "S"} , + {"sweep-files" , 0 , HIDDEN , G_OPTION_ARG_CALLBACK , FUNC(sweep_count) , "Specify max. file count per pass when scanning disks" , "S"} , {"threads" , 't' , HIDDEN , G_OPTION_ARG_INT64 , &cfg->threads , "Specify max. number of hasher threads" , "N"} , {"threads-per-disk" , 0 , HIDDEN , G_OPTION_ARG_INT , &cfg->threads_per_disk , "Specify number of reader threads per physical disk" , NULL} , {"write-unfinished" , 'U' , HIDDEN , G_OPTION_ARG_NONE , &cfg->write_unfinished , "Output unfinished checksums" , NULL} , From 6a6a071ae0dd85101034c5be25653d650de098ee Mon Sep 17 00:00:00 2001 From: hungrywolf27 Date: Thu, 15 Jun 2017 17:22:50 -0400 Subject: [PATCH 002/180] py: Fix typos in comments and function names --- lib/formats/py.py | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/lib/formats/py.py b/lib/formats/py.py index fd31fba7..6c9215a1 100644 --- a/lib/formats/py.py +++ b/lib/formats/py.py @@ -24,7 +24,7 @@ # This is the python remover utility shipped inside the rmlint binary. # The 200 lines source presented below is meant to be clean and hackable. -# It is intented to be used for corner cases where the built-in sh formatter +# It is intended to be used for corner cases where the built-in sh formatter # is not enough or as an alternative to it. By default it works the same. # Python2 compat: @@ -94,7 +94,7 @@ def handle_empty_dir(path, **kwargs): os.rmdir(path) -def handle_empy_file(path, **kwargs): +def handle_empty_file(path, **kwargs): os.remove(path) @@ -127,7 +127,7 @@ def handle_badugid(path, **kwargs): "duplicate_file": handle_duplicate_file, "unfinished_cksum": handle_unfinished_cksum, "emptydir": handle_empty_dir, - "emptyfile": handle_empy_file, + "emptyfile": handle_empty_file, "nonstripped": handle_nonstripped, "badlink": handle_badlink, "baduid": handle_baduid, From e6a36391126fcf9d6b38b2be81adbc253a56d7fc Mon Sep 17 00:00:00 2001 From: hungrywolf27 Date: Thu, 15 Jun 2017 17:59:26 -0400 Subject: [PATCH 003/180] py: Move dryrun check to inside each handler function This makes the code more repetitive, but the dryrun should be as close as possible to the real thing, and should include checking original files. --- lib/formats/py.py | 30 +++++++++++++++++++----------- 1 file changed, 19 insertions(+), 11 deletions(-) diff --git a/lib/formats/py.py b/lib/formats/py.py index 6c9215a1..f2be2f5b 100644 --- a/lib/formats/py.py +++ b/lib/formats/py.py @@ -78,12 +78,14 @@ def original_check(path, original, be_paranoid=True): def handle_duplicate_dir(path, original, **kwargs): - shutil.rmtree(path) + if not args.dry_run: + shutil.rmtree(path) def handle_duplicate_file(path, original, args, **kwargs): if original_check(path, original['path'], be_paranoid=args.paranoid): - os.remove(path) + if not args.dry_run: + os.remove(path) def handle_unfinished_cksum(path, **kwargs): @@ -91,19 +93,23 @@ def handle_unfinished_cksum(path, **kwargs): def handle_empty_dir(path, **kwargs): - os.rmdir(path) + if not args.dry_run: + os.rmdir(path) def handle_empty_file(path, **kwargs): - os.remove(path) + if not args.dry_run: + os.remove(path) def handle_nonstripped(path, **kwargs): - subprocess.call(["strip", "--strip-debug", path]) + if not args.dry_run: + subprocess.call(["strip", "--strip-debug", path]) def handle_badlink(path, **kwargs): - os.remove(path) + if not args.dry_run: + os.remove(path) CURRENT_UID = os.geteuid() @@ -111,15 +117,18 @@ def handle_badlink(path, **kwargs): def handle_baduid(path, **kwargs): - os.chmod(path, CURRENT_UID, -1) + if not args.dry_run: + os.chmod(path, CURRENT_UID, -1) def handle_badgid(path, **kwargs): - os.chmod(path, -1, CURRENT_GID) + if not args.dry_run: + os.chmod(path, -1, CURRENT_GID) def handle_badugid(path, **kwargs): - os.chmod(path, CURRENT_UID, CURRENT_GID) + if not args.dry_run: + os.chmod(path, CURRENT_UID, CURRENT_GID) OPERATIONS = { @@ -177,8 +186,7 @@ def main(args, header, data, footer): # Do not handle originals. continue - if not args.dry_run: - exec_operation(item, original=last_original_item, args=args) + exec_operation(item, original=last_original_item, args=args) print('{c[blue]}#{c[reset]} Handling ({t} -> {v}): {p}'.format( c=COLORS, t=item['type'], v=MESSAGES[item['type']], p=item['path']) From 476fa3e3ae6811fb8a13e3aecd4aaef55b0b0edc Mon Sep 17 00:00:00 2001 From: hungrywolf27 Date: Fri, 16 Jun 2017 08:07:08 -0400 Subject: [PATCH 004/180] py: Clean up loading of json files Use 'default' option in argparse Don't use argparse's type=open --- lib/formats/py.py | 29 ++++++++++++++--------------- 1 file changed, 14 insertions(+), 15 deletions(-) diff --git a/lib/formats/py.py b/lib/formats/py.py index f2be2f5b..f87d853f 100644 --- a/lib/formats/py.py +++ b/lib/formats/py.py @@ -199,8 +199,8 @@ def main(args, header, data, footer): ) parser.add_argument( - 'json_docs', metavar='json_doc', type=open, nargs='*', - help='A json output of rmlint to handle (can be given many times)' + 'json_docs', metavar='json_doc', nargs='*', default=['.rmlint.json'], + help='A json output of rmlint to handle (can be given multiple times)' ) parser.add_argument( '-n', '--dry-run', action='store_true', @@ -215,22 +215,21 @@ def main(args, header, data, footer): help='Do an extra byte-by-byte compare before deleting duplicates' ) - try: - args = parser.parse_args() - except OSError as err: - print(err) - sys.exit(-1) + args = parser.parse_args() - if not args.json_docs: - # None given on the commandline + json_docus = [] + for doc in args.json_docs: try: - args.json_docs.append(open('.rmlint.json', 'r')) - except OSError as err: - print('Cannot load default json document: ', str(err), file=sys.stderr) - sys.exit(-2) + with open(doc) as f: + j = json.load(f) + json_docus.append(j) + except IOError as err: # Cannot open file + print(err, file=sys.stderr) + sys.exit(-1) + except ValueError as err: # File is not valid JSON + print('{}: {}'.format(err, doc), file=sys.stderr) + sys.exit(-1) - json_docus = [json.load(doc) for doc in args.json_docs] - json_elems = [item for sublist in json_docus for item in sublist] try: if not args.no_ask and not args.dry_run: From a8d9c49320f26de4fc591c971b1dbf88b4678570 Mon Sep 17 00:00:00 2001 From: hungrywolf27 Date: Sun, 25 Jun 2017 17:06:42 -0500 Subject: [PATCH 005/180] py: Improve handling json header and footer Process header and footer inside main() function Print header and footer before asking for confirmation Header or footer may not be present if rmlint was run with no_header or no_footer --- lib/formats/py.py | 20 ++++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/lib/formats/py.py b/lib/formats/py.py index f87d853f..6c9f19fe 100644 --- a/lib/formats/py.py +++ b/lib/formats/py.py @@ -170,10 +170,22 @@ def exec_operation(item, original=None, args=None): ) -def main(args, header, data, footer): +def main(args, data): seen_cksums = set() last_original_item = None + # Process header and footer, if present + header, footer = [], [] + if data[0].get('description'): + header = data.pop(0) + if data[-1].get('total_files'): + footer = data.pop(-1) + # TODO: Print header and footer data here before asking for confirmation + + if not args.no_ask and not args.dry_run: + print('\nPlease hit any key before continuing to shredder your data.', file=sys.stderr) + sys.stdin.read(1) + for item in data: if item['type'].startswith('duplicate_') and item['is_original']: print( @@ -232,12 +244,8 @@ def main(args, header, data, footer): try: - if not args.no_ask and not args.dry_run: - print('\nPlease hit any key before continuing to shredder your data.', file=sys.stderr) - sys.stdin.read(1) - for json_doc in json_docus: - main(args, json_doc[0], json_doc[1:-1], json_doc[-1]) + main(args, json_doc) if args.dry_run: print( From 5f3dbf5ddaeb8c293fa99edde62c594575a57260 Mon Sep 17 00:00:00 2001 From: hungrywolf27 Date: Sun, 25 Jun 2017 17:16:57 -0500 Subject: [PATCH 006/180] py: Make output, colors, and help more consistent with sh output Show progress indicator Print item message before exec_operation, not after --- lib/formats/py.py | 61 ++++++++++++++++++++++++----------------------- 1 file changed, 31 insertions(+), 30 deletions(-) diff --git a/lib/formats/py.py b/lib/formats/py.py index 6c9f19fe..f4d7e961 100644 --- a/lib/formats/py.py +++ b/lib/formats/py.py @@ -39,7 +39,6 @@ import argparse import subprocess - USE_COLOR = sys.stdout.isatty() and sys.stderr.isatty() COLORS = { 'red': "\x1b[31;01m" if USE_COLOR else "", @@ -144,18 +143,6 @@ def handle_badugid(path, **kwargs): "badugid": handle_badugid, } -MESSAGES = { - "duplicate_dir": "removing tree", - "duplicate_file": "removing", - "unfinished_cksum": "checking", - "emptydir": "removing", - "emptyfile": "removing", - "nonstripped": "stripping", - "badlink": "removing", - "baduid": "changing uid", - "badgid": "changing gid", - "badugid": "changing uid & gid", -} def exec_operation(item, original=None, args=None): @@ -163,7 +150,7 @@ def exec_operation(item, original=None, args=None): OPERATIONS[item['type']](item['path'], original=original, item=item, args=args) except OSError as err: print( - '{c[red]}#{c[reset]} Error on `{item[path]}`:\n{c[red]}#{c[reset]} {err}'.format( + '{c[red]}#{c[reset]} {err}'.format( item=item, err=err, c=COLORS ), file=sys.stderr @@ -186,49 +173,63 @@ def main(args, data): print('\nPlease hit any key before continuing to shredder your data.', file=sys.stderr) sys.stdin.read(1) + MESSAGES = { + 'duplicate_dir': '{c[yellow]}Deleting duplicate directory'.format(c=COLORS), + 'duplicate_file': '{c[yellow]}Deleting duplicate:'.format(c=COLORS), + "unfinished_cksum": "checking", + 'emptydir': '{c[green]}Deleting empty directory:'.format(c=COLORS), + 'emptyfile': '{c[green]}Deleting empty file:'.format(c=COLORS), + 'nonstripped': '{c[green]}Stripping debug symbols:'.format(c=COLORS), + 'badlink': '{c[green]}Deleting bad symlink:'.format(c=COLORS), + 'baduid': '{c[green]}chown'.format(c=COLORS), + 'badgid': '{c[green]}chgrp'.format(c=COLORS), + 'badugid': '{c[green]}chown'.format(c=COLORS), + } + for item in data: if item['type'].startswith('duplicate_') and item['is_original']: - print( - "\n{c[green]}#{c[reset]} Deleting twins of {item[path]} ".format( - item=item, c=COLORS - ) + print('{c[blue]}[{prog:3}%]{c[reset]} {c[green]}Keeping original: {c[reset]}{path}'.format( + prog=item['progress'], path=item['path'], c=COLORS) ) last_original_item = item # Do not handle originals. continue - exec_operation(item, original=last_original_item, args=args) - - print('{c[blue]}#{c[reset]} Handling ({t} -> {v}): {p}'.format( - c=COLORS, t=item['type'], v=MESSAGES[item['type']], p=item['path']) + print('{c[blue]}[{prog:3}%]{c[reset]} {v}{c[reset]} {p}'.format( + c=COLORS, + prog=item['progress'], + v=MESSAGES[item['type']], + p=item['path'], + ) ) + exec_operation(item, original=last_original_item, args=args) if __name__ == '__main__': parser = argparse.ArgumentParser( - description='Handle the files stored in rmlints json output' + description='Handle the files in a JSON output of rmlint.' ) parser.add_argument( 'json_docs', metavar='json_doc', nargs='*', default=['.rmlint.json'], - help='A json output of rmlint to handle (can be given multiple times)' + help='A JSON output of rmlint to handle (can be given multiple times)' ) parser.add_argument( '-n', '--dry-run', action='store_true', - help='Only print what would be done.' + help='Do not perform any modifications, just print what would be done. ' + + '(implies -d)' ) parser.add_argument( '-d', '--no-ask', action='store_true', default=False, - help='ask for confirmation before running (does nothing for -n)' + help='Do not ask for confirmation before running.' ) parser.add_argument( '-p', '--paranoid', action='store_true', default=False, - help='Do an extra byte-by-byte compare before deleting duplicates' + help='Recheck that files are still identical before removing duplicates.' ) args = parser.parse_args() - json_docus = [] for doc in args.json_docs: try: @@ -249,9 +250,9 @@ def main(args, data): if args.dry_run: print( - '\n{c[green]}#{c[reset]} This was a dry run. Nothing modified.'.format( + '\n{c[green]}#{c[reset]} This was a dry run. Nothing was modified.'.format( c=COLORS ) ) except KeyboardInterrupt: - print('canceled.') + print('\ncanceled.') From 7df10b6d2900717aed93924e41eaad94f7e7dc07 Mon Sep 17 00:00:00 2001 From: hungrywolf27 Date: Sun, 25 Jun 2017 23:04:03 -0500 Subject: [PATCH 007/180] py: Fix handling bad uid/gid (#239) python script must be run as root for chown operations. Accordingly, target uid and gid must be given by user on command line, since they are not included in json output. --- lib/formats/py.py | 27 +++++++++++++++++---------- 1 file changed, 17 insertions(+), 10 deletions(-) diff --git a/lib/formats/py.py b/lib/formats/py.py index f4d7e961..12c3a7a2 100644 --- a/lib/formats/py.py +++ b/lib/formats/py.py @@ -39,6 +39,9 @@ import argparse import subprocess +CURRENT_UID = os.geteuid() +CURRENT_GID = pwd.getpwuid(CURRENT_UID).pw_gid + USE_COLOR = sys.stdout.isatty() and sys.stderr.isatty() COLORS = { 'red': "\x1b[31;01m" if USE_COLOR else "", @@ -111,23 +114,19 @@ def handle_badlink(path, **kwargs): os.remove(path) -CURRENT_UID = os.geteuid() -CURRENT_GID = pwd.getpwuid(CURRENT_UID).pw_gid - - def handle_baduid(path, **kwargs): if not args.dry_run: - os.chmod(path, CURRENT_UID, -1) + os.chown(path, kwargs['args'].user, -1) def handle_badgid(path, **kwargs): if not args.dry_run: - os.chmod(path, -1, CURRENT_GID) + os.chown(path, -1, kwargs['args'].group) def handle_badugid(path, **kwargs): if not args.dry_run: - os.chmod(path, CURRENT_UID, CURRENT_GID) + os.chown(path, kwargs['args'].user, kwargs['args'].group) OPERATIONS = { @@ -181,9 +180,9 @@ def main(args, data): 'emptyfile': '{c[green]}Deleting empty file:'.format(c=COLORS), 'nonstripped': '{c[green]}Stripping debug symbols:'.format(c=COLORS), 'badlink': '{c[green]}Deleting bad symlink:'.format(c=COLORS), - 'baduid': '{c[green]}chown'.format(c=COLORS), - 'badgid': '{c[green]}chgrp'.format(c=COLORS), - 'badugid': '{c[green]}chown'.format(c=COLORS), + 'baduid': '{c[green]}chown {u}'.format(c=COLORS, u=args.user), + 'badgid': '{c[green]}chgrp {g}'.format(c=COLORS, g=args.group), + 'badugid': '{c[green]}chown {u}:{g}'.format(c=COLORS, u=args.user, g=args.group), } for item in data: @@ -228,6 +227,14 @@ def main(args, data): '-p', '--paranoid', action='store_true', default=False, help='Recheck that files are still identical before removing duplicates.' ) + parser.add_argument( + '-u', '--user', type=int, default=CURRENT_UID, + help='Numerical uid for chown operations' + ) + parser.add_argument( + '-g', '--group', type=int, default=CURRENT_GID, + help='Numerical gid for chgrp operations' + ) args = parser.parse_args() json_docus = [] From 3a8563d4f8083d77952c4313a1fe478e197a0057 Mon Sep 17 00:00:00 2001 From: hungrywolf27 Date: Sun, 25 Jun 2017 23:18:00 -0500 Subject: [PATCH 008/180] py: More verbose confirmation message, other visual tweaks Add 100% Done at end Use same colors as shell script --- lib/formats/py.py | 24 ++++++++++++++++-------- 1 file changed, 16 insertions(+), 8 deletions(-) diff --git a/lib/formats/py.py b/lib/formats/py.py index 12c3a7a2..a4c1be5e 100644 --- a/lib/formats/py.py +++ b/lib/formats/py.py @@ -44,11 +44,11 @@ USE_COLOR = sys.stdout.isatty() and sys.stderr.isatty() COLORS = { - 'red': "\x1b[31;01m" if USE_COLOR else "", - 'yellow': "\x1b[33;01m" if USE_COLOR else "", + 'red': "\x1b[0;31m" if USE_COLOR else "", + 'blue': "\x1b[1;34m" if USE_COLOR else "", + 'green': "\x1b[0;32m" if USE_COLOR else "", + 'yellow': "\x1b[0;33m" if USE_COLOR else "", 'reset': "\x1b[0m" if USE_COLOR else "", - 'green': "\x1b[32;01m" if USE_COLOR else "", - 'blue': "\x1b[34;01m" if USE_COLOR else "" } @@ -149,7 +149,7 @@ def exec_operation(item, original=None, args=None): OPERATIONS[item['type']](item['path'], original=original, item=item, args=args) except OSError as err: print( - '{c[red]}#{c[reset]} {err}'.format( + '{c[red]}# {err}{c[reset]}'.format( item=item, err=err, c=COLORS ), file=sys.stderr @@ -169,7 +169,12 @@ def main(args, data): # TODO: Print header and footer data here before asking for confirmation if not args.no_ask and not args.dry_run: - print('\nPlease hit any key before continuing to shredder your data.', file=sys.stderr) + print('rmlint was executed in the following way:\n', + header.get('args'), + '\n\nPress Enter to continue and perform modifications, ' + 'or CTRL-C to exit.' + '\nExecute this script with -d to disable this message.', + file=sys.stderr) sys.stdin.read(1) MESSAGES = { @@ -187,7 +192,8 @@ def main(args, data): for item in data: if item['type'].startswith('duplicate_') and item['is_original']: - print('{c[blue]}[{prog:3}%]{c[reset]} {c[green]}Keeping original: {c[reset]}{path}'.format( + print('{c[blue]}[{prog:3}%]{c[reset]} ' + '{c[green]}Keeping original: {c[reset]}{path}'.format( prog=item['progress'], path=item['path'], c=COLORS) ) last_original_item = item @@ -204,6 +210,8 @@ def main(args, data): ) exec_operation(item, original=last_original_item, args=args) + print('{c[blue]}[100%] Done!{c[reset]}'.format(c=COLORS)) + if __name__ == '__main__': parser = argparse.ArgumentParser( @@ -250,8 +258,8 @@ def main(args, data): print('{}: {}'.format(err, doc), file=sys.stderr) sys.exit(-1) - try: + print('# This is a dry run. Nothing will be modified.') for json_doc in json_docus: main(args, json_doc) From 545aaca654a6e02018d696367377269d80abe660 Mon Sep 17 00:00:00 2001 From: hungrywolf27 Date: Sun, 25 Jun 2017 23:19:44 -0500 Subject: [PATCH 009/180] Add rmlint.json to .gitignore --- .gitignore | 1 + 1 file changed, 1 insertion(+) diff --git a/.gitignore b/.gitignore index 10510a06..d2dd8ce7 100644 --- a/.gitignore +++ b/.gitignore @@ -7,6 +7,7 @@ *.mo rmlint rmlint.sh +rmlint.json src/config.h docs/rmlint.1.gz docs/rmlint.1 From c9f5eaae8488fb5d62ab13911f6774c4406981c0 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Tue, 4 Jul 2017 09:09:54 +1000 Subject: [PATCH 010/180] sh: fix for #241; escape dirnames during test for new emptydirs --- lib/formats/sh.sh | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/lib/formats/sh.sh b/lib/formats/sh.sh index 39f34414..d9cbcfdc 100644 --- a/lib/formats/sh.sh +++ b/lib/formats/sh.sh @@ -1,7 +1,7 @@ #!/bin/sh PROGRESS_CURR=0 -PROGRESS_TOTAL= +PROGRESS_TOTAL= # This file was autowritten by rmlint # rmlint was executed from: %s @@ -48,7 +48,11 @@ print_progress_prefix() { PROGRESS_PERC=$((PROGRESS_CURR * 100 / PROGRESS_TOTAL)) fi printf "$COL_BLUE[% 3d%%]$COL_RESET " $PROGRESS_PERC - PROGRESS_CURR=$((PROGRESS_CURR+1)) + if [ $# -eq "1" ]; then + PROGRESS_CURR=$((PROGRESS_CURR+$1)) + else + PROGRESS_CURR=$((PROGRESS_CURR+1)) + fi fi } @@ -235,7 +239,9 @@ remove_cmd() { if [ ! -z "$DO_DELETE_EMPTY_DIRS" ]; then DIR=$(dirname "$1") - while [ ! "$(ls -A $DIR)" ]; do + while [ ! "$(ls -A "$DIR")" ]; do + print_progress_prefix 0 + echo "${COL_GREEN}Deleting resulting empty dir: ${COL_RESET}" "$DIR" rmdir "$DIR" DIR=$(dirname "$DIR") done From 0e5870955f09b66f212b499a9e650ed22c0a0e49 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Tue, 4 Jul 2017 09:12:09 +1000 Subject: [PATCH 011/180] tests: add a testcase for #241 --- tests/test_formatters/test_sh.py | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/tests/test_formatters/test_sh.py b/tests/test_formatters/test_sh.py index 04f1e146..a899ddd0 100644 --- a/tests/test_formatters/test_sh.py +++ b/tests/test_formatters/test_sh.py @@ -230,3 +230,29 @@ def test_remove_empty_dirs_with_dupe_dirs(shell, inverse_order): assert data[1]["is_original"] is False _check_if_empty_dirs_deleted(shell, inverse_order, sh_path, data) + +@with_setup(usual_setup_func, usual_teardown_func) +@parameterized([("sh", ), ("bash", ), ("dash", )]) +def test_cleanup_emptydirs(shell): + create_file('xxx', 'dir1/a') + + # create some ugly dir names + names = 'escape me [please?]', '上野洋子, 吉野裕司, 浅井裕子 & 河越重義', '天谷大輔', 'Аркона' + for dirname in names: + create_file('xxx', '{}/b'.format(dirname)) + + head, *data, footer = run_rmlint('-S a -o sh:{t}/rmlint.sh'.format(t=TESTDIR_NAME)) + + assert footer['duplicate_sets'] == 1 + assert footer['total_lint_size'] == 3 * len(names) + assert footer['total_files'] == 1 + len(names) + assert footer['duplicates'] == len(names) + + # run rmlint.sh with -c option (should clean up empty dirs after deleting 'b' files). + sh_path = os.path.join(TESTDIR_NAME, 'rmlint.sh') + text = run_shell_script(shell, sh_path, "-dc") + + assert os.path.exists(os.path.join(TESTDIR_NAME, 'dir1/a')) + + for dirname in names: + assert (not os.path.exists(os.path.join(TESTDIR_NAME, dirname))) From 7cc2fe71ff08fb3a63ca7f23a247ef5b1d07c030 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Tue, 4 Jul 2017 09:34:05 +1000 Subject: [PATCH 012/180] test: add nested dirs to test_sh/test_cleanup_emptydirs --- tests/test_formatters/test_sh.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tests/test_formatters/test_sh.py b/tests/test_formatters/test_sh.py index a899ddd0..ff9d203b 100644 --- a/tests/test_formatters/test_sh.py +++ b/tests/test_formatters/test_sh.py @@ -237,7 +237,7 @@ def test_cleanup_emptydirs(shell): create_file('xxx', 'dir1/a') # create some ugly dir names - names = 'escape me [please?]', '上野洋子, 吉野裕司, 浅井裕子 & 河越重義', '天谷大輔', 'Аркона' + names = 'escape me [please?]', 'let\'s nest/a level/[or two]', '上野洋子, 吉野裕司, 浅井裕子 & 河越重義', '天谷大輔', 'Аркона' for dirname in names: create_file('xxx', '{}/b'.format(dirname)) From 8b9d2bacc175cf148b8b289847fc997875e6597b Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Tue, 4 Jul 2017 09:37:56 +1000 Subject: [PATCH 013/180] tests: check that nested new emptydirs all get deleted --- tests/test_formatters/test_sh.py | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/tests/test_formatters/test_sh.py b/tests/test_formatters/test_sh.py index ff9d203b..8704fbf4 100644 --- a/tests/test_formatters/test_sh.py +++ b/tests/test_formatters/test_sh.py @@ -237,11 +237,14 @@ def test_cleanup_emptydirs(shell): create_file('xxx', 'dir1/a') # create some ugly dir names - names = 'escape me [please?]', 'let\'s nest/a level/[or two]', '上野洋子, 吉野裕司, 浅井裕子 & 河越重義', '天谷大輔', 'Аркона' + names = [ 'escape me [please?]', '上野洋子, 吉野裕司, 浅井裕子 & 河越重義', '天谷大輔', 'Аркона', + 'let\'s nest', + 'let\'s nest/a level', + 'let\'s nest/a level/[or two]' ] for dirname in names: create_file('xxx', '{}/b'.format(dirname)) - head, *data, footer = run_rmlint('-S a -o sh:{t}/rmlint.sh'.format(t=TESTDIR_NAME)) + head, *data, footer = run_rmlint('-S a -T df -o sh:{t}/rmlint.sh'.format(t=TESTDIR_NAME)) assert footer['duplicate_sets'] == 1 assert footer['total_lint_size'] == 3 * len(names) From fe7b025328dd421a488e32bdad4947c16475630f Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Tue, 4 Jul 2017 20:48:31 +1000 Subject: [PATCH 014/180] sh: restore sahib's trailing spaces and make them a bit more visible --- lib/formats/sh.c.in | 2 +- lib/formats/sh.sh | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/formats/sh.c.in b/lib/formats/sh.c.in index 82d6885e..b9ee2297 100644 --- a/lib/formats/sh.c.in +++ b/lib/formats/sh.c.in @@ -432,7 +432,7 @@ static void rm_fmt_foot(_UNUSED RmSession *session, RmFmtHandler *parent, FILE * char *escaped_path = rm_fmt_sh_escape_path(parent->path); fprintf(out, SH_SCRIPT_TEMPLATE_FOOT, "rm -f", escaped_path); - const char progress_marker_text[] = "PROGRESS_TOTAL="; + const char progress_marker_text[] = "PROGRESS_TOTAL=\""; char *progress_marker = strstr(SH_SCRIPT_TEMPLATE_HEAD, progress_marker_text); if(progress_marker != NULL) { gsize offset = (progress_marker - SH_SCRIPT_TEMPLATE_HEAD) + diff --git a/lib/formats/sh.sh b/lib/formats/sh.sh index d9cbcfdc..e281e596 100644 --- a/lib/formats/sh.sh +++ b/lib/formats/sh.sh @@ -1,7 +1,7 @@ #!/bin/sh PROGRESS_CURR=0 -PROGRESS_TOTAL= +PROGRESS_TOTAL=" " # This file was autowritten by rmlint # rmlint was executed from: %s From f72e7550df52e227585a33b00a1362b5caa54470 Mon Sep 17 00:00:00 2001 From: Chris P Date: Tue, 4 Jul 2017 22:34:11 +0200 Subject: [PATCH 015/180] py: Use fwrite instead of fprintf --- lib/formats/py.c.in | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/lib/formats/py.c.in b/lib/formats/py.c.in index 3daaa657..5ca926e2 100644 --- a/lib/formats/py.c.in +++ b/lib/formats/py.c.in @@ -52,7 +52,11 @@ typedef struct RmFmtHandlerPy { static void rm_fmt_head(RmSession *session, RmFmtHandler *parent, FILE *out) { RmFmtHandlerPy *self = (RmFmtHandlerPy *)parent; - fprintf(out, "%s", PY_SOURCE); + if(fwrite(out, 1, sizeof(PY_SOURCE), out) <= 0) { + rm_log_perror("Failed to write python script"); + return; + } + if(fchmod(fileno(out), S_IRUSR | S_IWUSR | S_IXUSR) == -1) { rm_log_perror("Could not chmod +x python-script"); } From 75660213d0bd8f9d89c1078c6bef3a866be432ba Mon Sep 17 00:00:00 2001 From: Chris P Date: Tue, 4 Jul 2017 22:40:15 +0200 Subject: [PATCH 016/180] py: fix wrong usage of fwrite... --- lib/formats/py.c.in | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/formats/py.c.in b/lib/formats/py.c.in index 5ca926e2..3decb309 100644 --- a/lib/formats/py.c.in +++ b/lib/formats/py.c.in @@ -31,7 +31,7 @@ #include #include -static const char *PY_SOURCE = "<>"; +static const char PY_SOURCE[] = "<>"; typedef struct RmFmtHandlerPy { /* must be first */ @@ -52,7 +52,7 @@ typedef struct RmFmtHandlerPy { static void rm_fmt_head(RmSession *session, RmFmtHandler *parent, FILE *out) { RmFmtHandlerPy *self = (RmFmtHandlerPy *)parent; - if(fwrite(out, 1, sizeof(PY_SOURCE), out) <= 0) { + if(fwrite(PY_SOURCE, 1, sizeof(PY_SOURCE), out) <= 0) { rm_log_perror("Failed to write python script"); return; } From 84fc53d7b3755da50ddea2fd6cb1905cab536898 Mon Sep 17 00:00:00 2001 From: Chris P Date: Tue, 4 Jul 2017 22:44:42 +0200 Subject: [PATCH 017/180] cmdline: Make -O not start from beginning; like documented in the docs --- lib/cmdline.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/lib/cmdline.c b/lib/cmdline.c index 2e43c3e7..a1c65a59 100644 --- a/lib/cmdline.c +++ b/lib/cmdline.c @@ -1272,8 +1272,7 @@ static bool rm_cmd_set_outputs(RmSession *session, GError **error) { g_set_error(error, RM_ERROR_QUARK, 0, _("Specifiyng both -o and -O is not allowed")); return false; - } else if(session->output_cnt[0] < 0 && session->output_cnt[1] < 0 && - !rm_fmt_len(session->formats)) { + } else if(session->output_cnt[0] < 0 && session->cfg->progress_enabled == false) { rm_cmd_set_default_outputs(session); } From e29ff416b920f8874e82ccd3a4506695a42cb5c0 Mon Sep 17 00:00:00 2001 From: Chris P Date: Tue, 4 Jul 2017 22:50:17 +0200 Subject: [PATCH 018/180] sh: Bring back the "invisible" spaces to avoid having big strings in the user script --- lib/formats/sh.c.in | 2 +- lib/formats/sh.sh | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/formats/sh.c.in b/lib/formats/sh.c.in index b9ee2297..82d6885e 100644 --- a/lib/formats/sh.c.in +++ b/lib/formats/sh.c.in @@ -432,7 +432,7 @@ static void rm_fmt_foot(_UNUSED RmSession *session, RmFmtHandler *parent, FILE * char *escaped_path = rm_fmt_sh_escape_path(parent->path); fprintf(out, SH_SCRIPT_TEMPLATE_FOOT, "rm -f", escaped_path); - const char progress_marker_text[] = "PROGRESS_TOTAL=\""; + const char progress_marker_text[] = "PROGRESS_TOTAL="; char *progress_marker = strstr(SH_SCRIPT_TEMPLATE_HEAD, progress_marker_text); if(progress_marker != NULL) { gsize offset = (progress_marker - SH_SCRIPT_TEMPLATE_HEAD) + diff --git a/lib/formats/sh.sh b/lib/formats/sh.sh index e281e596..320bf68b 100644 --- a/lib/formats/sh.sh +++ b/lib/formats/sh.sh @@ -1,7 +1,7 @@ #!/bin/sh PROGRESS_CURR=0 -PROGRESS_TOTAL=" " +PROGRESS_TOTAL= # This file was autowritten by rmlint # rmlint was executed from: %s From 899229dfe2008ca5fa3dc71c8a9eee598ae4c1c0 Mon Sep 17 00:00:00 2001 From: Chris P Date: Tue, 4 Jul 2017 23:14:38 +0200 Subject: [PATCH 019/180] sh: Fallback to PROGRESS_TOTAL=0 when seek fails (thanks @SeeSpotRun) --- lib/formats/sh.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/formats/sh.sh b/lib/formats/sh.sh index 320bf68b..7d69cf5e 100644 --- a/lib/formats/sh.sh +++ b/lib/formats/sh.sh @@ -1,7 +1,7 @@ #!/bin/sh PROGRESS_CURR=0 -PROGRESS_TOTAL= +PROGRESS_TOTAL=0 # This file was autowritten by rmlint # rmlint was executed from: %s From 83759d2ec66c1b8c2dead048af0ca3ea47311692 Mon Sep 17 00:00:00 2001 From: hungrywolf27 Date: Tue, 11 Jul 2017 18:03:29 -0400 Subject: [PATCH 020/180] py: Fix dryrun message --- lib/formats/py.py | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/lib/formats/py.py b/lib/formats/py.py index a4c1be5e..91099ec0 100644 --- a/lib/formats/py.py +++ b/lib/formats/py.py @@ -259,15 +259,17 @@ def main(args, data): sys.exit(-1) try: - print('# This is a dry run. Nothing will be modified.') + if args.dry_run: + print('{c[green]}#{c[reset]} ' + 'This is a dry run. Nothing will be modified.'.format( + c=COLORS)) + for json_doc in json_docus: main(args, json_doc) if args.dry_run: - print( - '\n{c[green]}#{c[reset]} This was a dry run. Nothing was modified.'.format( - c=COLORS - ) - ) + print('{c[green]}#{c[reset]} ' + 'This was a dry run. Nothing was modified.'.format( + c=COLORS)) except KeyboardInterrupt: print('\ncanceled.') From 5a0aa678b32ffdc8185a2dc27b3ee1ac59ae571b Mon Sep 17 00:00:00 2001 From: hungrywolf27 Date: Tue, 11 Jul 2017 20:23:45 -0400 Subject: [PATCH 021/180] py: Refactor progress and message output --- lib/formats/py.py | 53 ++++++++++++++++++++++++----------------------- 1 file changed, 27 insertions(+), 26 deletions(-) diff --git a/lib/formats/py.py b/lib/formats/py.py index 91099ec0..753f742d 100644 --- a/lib/formats/py.py +++ b/lib/formats/py.py @@ -155,6 +155,23 @@ def exec_operation(item, original=None, args=None): file=sys.stderr ) +MESSAGES = { + 'duplicate_dir': '{c[yellow]}Deleting duplicate directory:', + 'duplicate_file': '{c[yellow]}Deleting duplicate:', + 'unfinished_cksum': 'checking', + 'emptydir': '{c[green]}Deleting empty directory:', + 'emptyfile': '{c[green]}Deleting empty file:', + 'nonstripped': '{c[green]}Stripping debug symbols:', + 'badlink': '{c[green]}Deleting bad symlink:', + 'baduid': '{c[green]}chown {u}', + 'badgid': '{c[green]}chgrp {g}', + 'badugid': '{c[green]}chown {u}:{g}', +} + +ORIGINAL_MESSAGES = { + 'duplicate_file': '{c[green]}Keeping original: ', + 'duplicate_dir': '{c[green]}Keeping original directory: ', +} def main(args, data): seen_cksums = set() @@ -177,37 +194,21 @@ def main(args, data): file=sys.stderr) sys.stdin.read(1) - MESSAGES = { - 'duplicate_dir': '{c[yellow]}Deleting duplicate directory'.format(c=COLORS), - 'duplicate_file': '{c[yellow]}Deleting duplicate:'.format(c=COLORS), - "unfinished_cksum": "checking", - 'emptydir': '{c[green]}Deleting empty directory:'.format(c=COLORS), - 'emptyfile': '{c[green]}Deleting empty file:'.format(c=COLORS), - 'nonstripped': '{c[green]}Stripping debug symbols:'.format(c=COLORS), - 'badlink': '{c[green]}Deleting bad symlink:'.format(c=COLORS), - 'baduid': '{c[green]}chown {u}'.format(c=COLORS, u=args.user), - 'badgid': '{c[green]}chgrp {g}'.format(c=COLORS, g=args.group), - 'badugid': '{c[green]}chown {u}:{g}'.format(c=COLORS, u=args.user, g=args.group), - } - for item in data: - if item['type'].startswith('duplicate_') and item['is_original']: - print('{c[blue]}[{prog:3}%]{c[reset]} ' - '{c[green]}Keeping original: {c[reset]}{path}'.format( - prog=item['progress'], path=item['path'], c=COLORS) - ) - last_original_item = item + progress_prefix = '{c[blue]}[{p:3}%]{c[reset]} '.format( + c=COLORS, p=item['progress']) + if item['is_original']: + msg = ORIGINAL_MESSAGES[item['type']].format(c=COLORS) + print('{prog}{v}{c[reset]} {path}'.format( + c=COLORS, prog=progress_prefix, v=msg, path=item['path'])) + last_original_item = item # Do not handle originals. continue - print('{c[blue]}[{prog:3}%]{c[reset]} {v}{c[reset]} {p}'.format( - c=COLORS, - prog=item['progress'], - v=MESSAGES[item['type']], - p=item['path'], - ) - ) + msg = MESSAGES[item['type']].format(c=COLORS, u=args.user, g=args.group) + print('{prog}{v}{c[reset]} {path}'.format( + c=COLORS, prog=progress_prefix, v=msg, path=item['path'])) exec_operation(item, original=last_original_item, args=args) print('{c[blue]}[100%] Done!{c[reset]}'.format(c=COLORS)) From 5af12bcef4b0a2a9fc0f3b1c52d6d3bf4f8a524e Mon Sep 17 00:00:00 2001 From: hungrywolf27 Date: Tue, 11 Jul 2017 22:02:16 -0400 Subject: [PATCH 022/180] py: clean up variables --- lib/formats/py.py | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) diff --git a/lib/formats/py.py b/lib/formats/py.py index 753f742d..3957bf3e 100644 --- a/lib/formats/py.py +++ b/lib/formats/py.py @@ -116,17 +116,17 @@ def handle_badlink(path, **kwargs): def handle_baduid(path, **kwargs): if not args.dry_run: - os.chown(path, kwargs['args'].user, -1) + os.chown(path, args.user, -1) def handle_badgid(path, **kwargs): if not args.dry_run: - os.chown(path, -1, kwargs['args'].group) + os.chown(path, -1, args.group) def handle_badugid(path, **kwargs): if not args.dry_run: - os.chown(path, kwargs['args'].user, kwargs['args'].group) + os.chown(path, args.user, args.group) OPERATIONS = { @@ -174,7 +174,6 @@ def exec_operation(item, original=None, args=None): } def main(args, data): - seen_cksums = set() last_original_item = None # Process header and footer, if present @@ -220,7 +219,7 @@ def main(args, data): ) parser.add_argument( - 'json_docs', metavar='json_doc', nargs='*', default=['.rmlint.json'], + 'json_files', metavar='json_file', nargs='*', default=['.rmlint.json'], help='A JSON output of rmlint to handle (can be given multiple times)' ) parser.add_argument( @@ -246,12 +245,12 @@ def main(args, data): ) args = parser.parse_args() - json_docus = [] - for doc in args.json_docs: + json_docs = [] + for json_file in args.json_files: try: - with open(doc) as f: + with open(json_file) as f: j = json.load(f) - json_docus.append(j) + json_docs.append(j) except IOError as err: # Cannot open file print(err, file=sys.stderr) sys.exit(-1) @@ -265,7 +264,7 @@ def main(args, data): 'This is a dry run. Nothing will be modified.'.format( c=COLORS)) - for json_doc in json_docus: + for json_doc in json_docs: main(args, json_doc) if args.dry_run: From 7560b5adfd00d02f8d99d90c52900895aa16a285 Mon Sep 17 00:00:00 2001 From: hungrywolf27 Date: Tue, 11 Jul 2017 22:03:43 -0400 Subject: [PATCH 023/180] py: clean up for pep8 and pylint --- lib/formats/py.py | 48 ++++++++++++++++++++++------------------------- 1 file changed, 22 insertions(+), 26 deletions(-) diff --git a/lib/formats/py.py b/lib/formats/py.py index 3957bf3e..ebf2c85b 100644 --- a/lib/formats/py.py +++ b/lib/formats/py.py @@ -57,20 +57,17 @@ def original_check(path, original, be_paranoid=True): stat_p, stat_o = os.stat(path), os.stat(original) if (stat_p.st_dev, stat_p.st_ino) == (stat_o.st_dev, stat_o.st_ino): print('{c[red]}Same inode; ignoring:{c[reset]} {o} <=> {p}'.format( - c=COLORS, o=original, p=path - )) + c=COLORS, o=original, p=path)) return False if stat_p.st_size != stat_o.st_size: - print('{c[red]}Size differs; ignoring:{c[reset]} {o} <=> {p}'.format( - c=COLORS, o=original, p=path - )) + print('{c[red]}Size differs; ignoring:{c[reset]} ' + '{o} <=> {p}'.format(c=COLORS, o=original, p=path)) return False if be_paranoid and not filecmp.cmp(path, original): - print('{c[red]}Content differs; ignoring:{c[reset]} {o} <=> {p}'.format( - c=COLORS, o=original, p=path - )) + print('{c[red]}Content differs; ignoring:{c[reset]} ' + '{o} <=> {p}'.format(c=COLORS, o=original, p=path)) return False return True @@ -143,17 +140,14 @@ def handle_badugid(path, **kwargs): } - def exec_operation(item, original=None, args=None): try: - OPERATIONS[item['type']](item['path'], original=original, item=item, args=args) + OPERATIONS[item['type']]( + item['path'], original=original, item=item, args=args) except OSError as err: - print( - '{c[red]}# {err}{c[reset]}'.format( - item=item, err=err, c=COLORS - ), - file=sys.stderr - ) + print('{c[red]}# {err}{c[reset]}'.format( + item=item, err=err, c=COLORS), file=sys.stderr) + MESSAGES = { 'duplicate_dir': '{c[yellow]}Deleting duplicate directory:', @@ -173,6 +167,7 @@ def exec_operation(item, original=None, args=None): 'duplicate_dir': '{c[green]}Keeping original directory: ', } + def main(args, data): last_original_item = None @@ -182,15 +177,14 @@ def main(args, data): header = data.pop(0) if data[-1].get('total_files'): footer = data.pop(-1) - # TODO: Print header and footer data here before asking for confirmation if not args.no_ask and not args.dry_run: print('rmlint was executed in the following way:\n', - header.get('args'), - '\n\nPress Enter to continue and perform modifications, ' - 'or CTRL-C to exit.' - '\nExecute this script with -d to disable this message.', - file=sys.stderr) + header.get('args'), + '\n\nPress Enter to continue and perform modifications, ' + 'or CTRL-C to exit.' + '\nExecute this script with -d to disable this message.', + file=sys.stderr) sys.stdin.read(1) for item in data: @@ -205,7 +199,8 @@ def main(args, data): # Do not handle originals. continue - msg = MESSAGES[item['type']].format(c=COLORS, u=args.user, g=args.group) + msg = MESSAGES[item['type']].format( + c=COLORS, u=args.user, g=args.group) print('{prog}{v}{c[reset]} {path}'.format( c=COLORS, prog=progress_prefix, v=msg, path=item['path'])) exec_operation(item, original=last_original_item, args=args) @@ -224,8 +219,8 @@ def main(args, data): ) parser.add_argument( '-n', '--dry-run', action='store_true', - help='Do not perform any modifications, just print what would be done. ' + - '(implies -d)' + help='Do not perform any modifications, just print what would be ' + 'done. (implies -d)' ) parser.add_argument( '-d', '--no-ask', action='store_true', default=False, @@ -233,7 +228,8 @@ def main(args, data): ) parser.add_argument( '-p', '--paranoid', action='store_true', default=False, - help='Recheck that files are still identical before removing duplicates.' + help='Recheck that files are still identical before removing ' + 'duplicates.' ) parser.add_argument( '-u', '--user', type=int, default=CURRENT_UID, From 340ec056cdd6cd58884963e6a32f3945b440b841 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 3 Jul 2017 10:25:05 +1000 Subject: [PATCH 024/180] tests: add tests for rmlint --btrfs-clone --- .travis.yml | 2 + test-requirements.txt | 1 + tests/test_mains/test_clone.py | 132 +++++++++++++++++++++++++++++++++ tests/utils.py | 27 +++++-- 4 files changed, 156 insertions(+), 6 deletions(-) create mode 100644 tests/test_mains/test_clone.py diff --git a/.travis.yml b/.travis.yml index 2010cf7c..9c471fbd 100644 --- a/.travis.yml +++ b/.travis.yml @@ -1,8 +1,10 @@ language: c + install: - sudo apt-get update - sudo apt-get install python3-sphinx gettext python3-setuptools - sudo apt-get install libblkid-dev libelf-dev libglib2.0-dev libjson-glib-dev + - sudo apt-get install clang - sudo easy_install3 $(cat test-requirements.txt) compiler: diff --git a/test-requirements.txt b/test-requirements.txt index 4fce00c3..44df813d 100644 --- a/test-requirements.txt +++ b/test-requirements.txt @@ -1,2 +1,3 @@ nose==1.3.7 parameterized==0.6.1 +psutil==5.2.2 diff --git a/tests/test_mains/test_clone.py b/tests/test_mains/test_clone.py new file mode 100644 index 00000000..f7139608 --- /dev/null +++ b/tests/test_mains/test_clone.py @@ -0,0 +1,132 @@ +#!/usr/bin/env python3 +# encoding: utf-8 + +from nose import with_setup +from nose.tools import make_decorator +from nose.plugins.skip import SkipTest +from contextlib import contextmanager +import psutil + +from tests.utils import * + + +@contextmanager +def assert_exit_code(status_code): + """ + Assert that the with block yields a subprocess.CalledProcessError + with a certain return code. If nothing is thrown, status_code + is required to be 0 to survive the test. + """ + try: + yield + except subprocess.CalledProcessError as exc: + assert exc.returncode == status_code + else: + # No exception? status_code should be fine. + assert status_code == 0 + + +def is_btrfs(path): + parts = psutil.disk_partitions(all=True) + + # iterate up from `path` until mountpoint found + p = path + while 1: + match = next((x for x in parts if x.mountpoint == p), None) + if (match): + print("{0} is {1} mounted at {2}".format(path, match.fstype, p)) + return (match.fstype == 'btrfs') + + if (p == '/'): + # probably should never get here... + print("no mountpoint found for {0}".format(path)) + return False + p = os.path.dirname(p) + + +# decorator for tests dependent on btrfs testdir +def needs_btrfs(test): + def no_support(*args): + raise SkipTest("btrfs not supported") + + def not_btrfs(*args): + raise SkipTest("testdir is not on btrfs filesystem") + + if not has_feature('btrfs-support'): + return make_decorator(test)(no_support) + elif not is_btrfs(TESTDIR_NAME): + return make_decorator(test)(not_btrfs) + else: + return test + + +@needs_btrfs +@with_setup(usual_setup_func, usual_teardown_func) +def test_equal_files(): + path_a = create_file('1234', 'a') + path_b = create_file('1234', 'b') + + with assert_exit_code(0): + head, *data, footer = run_rmlint( + '--btrfs-clone', + path_a, path_b, + use_default_dir=False, + with_json=False, + verbosity="") + + with assert_exit_code(0): + head, *data, footer = run_rmlint( + '--btrfs-clone', + path_a, '//', path_b, + use_default_dir=False, + with_json=False) + + +@needs_btrfs +@with_setup(usual_setup_func, usual_teardown_func) +def test_different_files(): + path_a = create_file('1234', 'a') + path_b = create_file('4321', 'b') + + with assert_exit_code(1): + head, *data, footer = run_rmlint( + '--btrfs-clone', + path_a, path_b, + use_default_dir=False, + with_json=False, + verbosity="") + + +@needs_btrfs +@with_setup(usual_setup_func, usual_teardown_func) +def test_bad_arguments(): + path_a = create_file('1234', 'a') + path_b = create_file('1234', 'b') + path_c = create_file('1234', 'c') + for paths in [ + path_a, + ' '.join((path_a, path_b, path_c)), + ' '.join((path_a, path_a + ".nonexistent")) + ]: + with assert_exit_code(1): + head, *data, footer = run_rmlint( + '--btrfs-clone', + paths, + use_default_dir=False, + with_json=False, + verbosity="") + + +@needs_btrfs +@with_setup(usual_setup_func, usual_teardown_func) +def test_directories(): + path_a = os.path.dirname(create_dirs('dir_a')) + path_b = os.path.dirname(create_dirs('dir_b')) + + with assert_exit_code(1): + head, *data, footer = run_rmlint( + '--btrfs-clone', + path_a, path_b, + use_default_dir=False, + with_json=False, + verbosity="") diff --git a/tests/utils.py b/tests/utils.py index 950adb7d..1c243df9 100644 --- a/tests/utils.py +++ b/tests/utils.py @@ -64,7 +64,14 @@ def has_feature(feature): ).decode('utf-8') -def run_rmlint_once(*args, dir_suffix=None, use_default_dir=True, outputs=None, directly_return_output=False, use_shell=False): +def run_rmlint_once(*args, + dir_suffix=None, + use_default_dir=True, + outputs=None, + with_json=True, + directly_return_output=False, + use_shell=False, + verbosity="-V"): if use_default_dir: if dir_suffix: target_dir = os.path.join(TESTDIR_NAME, dir_suffix) @@ -87,10 +94,11 @@ def run_rmlint_once(*args, dir_suffix=None, use_default_dir=True, outputs=None, env, cmd = {}, [] cmd += [ - './rmlint', target_dir, '-V', - ] + shlex.split(' '.join(args)) + [ - '-o', 'json:/tmp/out.json', '-c', 'json:oneline' - ] + './rmlint', target_dir, verbosity, + ] + shlex.split(' '.join(args)) + + if with_json: + cmd += ['-o', 'json:/tmp/out.json', '-c', 'json:oneline'] for idx, output in enumerate(outputs or []): cmd.append('-o') @@ -254,7 +262,14 @@ def run_rmlint(*args, force_no_pendantic=False, **kwargs): def create_dirs(path): - os.makedirs(os.path.join(TESTDIR_NAME, path)) + full_path = os.path.join(TESTDIR_NAME, path) + + try: + os.makedirs(full_path) + except OSError: + pass + + return full_path def create_link(path, target, symlink=False): From 56a48d916f8ace21c9395c7753668114fd7408fc Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 3 Jul 2017 12:27:03 +1000 Subject: [PATCH 025/180] tests: fix test fail if run on btrfs (made reflinks, not hardlinks) --- tests/test_formatters/test_sh.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tests/test_formatters/test_sh.py b/tests/test_formatters/test_sh.py index 8704fbf4..b03ba08e 100644 --- a/tests/test_formatters/test_sh.py +++ b/tests/test_formatters/test_sh.py @@ -141,7 +141,7 @@ def test_hardlink_duplicate_directories(shell): sh_path = os.path.join(TESTDIR_NAME, "result.sh") header, *data, footer = run_rmlint( - "-D -S a -c sh:link -o sh:{}".format(sh_path), + "-D -S a -c sh:hardlink -o sh:{}".format(sh_path), ) assert len(data) == 2 assert data[0]["path"].endswith("dir_a") From f68eaa32011a0443e6de5b54074ab6971bd70771 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Sun, 16 Jul 2017 17:29:44 +1000 Subject: [PATCH 026/180] sh: make comments match actual link priority --- lib/formats/sh.c.in | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/lib/formats/sh.c.in b/lib/formats/sh.c.in index 82d6885e..62961fc7 100644 --- a/lib/formats/sh.c.in +++ b/lib/formats/sh.c.in @@ -276,16 +276,16 @@ static void rm_fmt_head(RmSession *session, RmFmtHandler *parent, FILE *out) { /* user specified handlers */ rm_sh_parse_handlers(self, handler_cfg); } else if(rm_fmt_get_config_value(session->formats, "sh", "clone") != NULL) { - /* Preset: try clone, then reflinks, then symlinks then hardlinks */ + /* Preset: try clone, then reflinks, then hardlinks then symlinks */ rm_sh_parse_handlers(self, "clone,reflink,hardlink,symlink"); } else if(rm_fmt_get_config_value(session->formats, "sh", "link") != NULL) { - /* Preset: try reflinks, then symlinks then hardlinks */ + /* Preset: try reflinks, then then hardlinks then symlinks */ rm_sh_parse_handlers(self, "reflink,hardlink,symlink"); } else if(rm_fmt_get_config_value(session->formats, "sh", "hardlink") != NULL) { - /* Preset: try symlinks before using hardlinks */ + /* Preset: try hardlinks before using symlinks */ rm_sh_parse_handlers(self, "hardlink,symlink"); } else if(rm_fmt_get_config_value(session->formats, "sh", "symlink") != NULL) { - /* Preset: only do hardlinks */ + /* Preset: only do symlinks */ rm_sh_parse_handlers(self, "symlink"); } else { /* Default: remove the file */ From 6f59bde62ca326ed366934a8f1241309d3d05a67 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Sun, 16 Jul 2017 14:29:58 +1000 Subject: [PATCH 027/180] session: move btrfs clone main to session and make exit code meaningful --- lib/cfg.h | 4 ++ lib/cmdline.c | 115 ++------------------------------------------------ lib/session.c | 115 +++++++++++++++++++++++++++++++++++++++++++++++++- lib/session.h | 14 ++++-- src/rmlint.c | 6 ++- 5 files changed, 136 insertions(+), 118 deletions(-) diff --git a/lib/cfg.h b/lib/cfg.h index 0d986287..71bb94e4 100644 --- a/lib/cfg.h +++ b/lib/cfg.h @@ -151,6 +151,10 @@ typedef struct RmCfg { * (or directories) */ gboolean run_equal_mode; + /* for --btrfs-clone option */ + bool btrfs_clone; + bool btrfs_readonly; + } RmCfg; /** diff --git a/lib/cmdline.c b/lib/cmdline.c index a1c65a59..dfbd844b 100644 --- a/lib/cmdline.c +++ b/lib/cmdline.c @@ -233,105 +233,6 @@ static int rm_cmd_maybe_switch_to_hasher(int argc, const char **argv) { return EXIT_SUCCESS; } -static void rm_cmd_btrfs_clone_usage(void) { - rm_log_error(_("Usage: rmlint --btrfs-clone [-r] source dest\n")); -} - -static void rm_cmd_btrfs_clone(const char *source, const char *dest, - const gboolean read_only) { -#if HAVE_BTRFS_H - struct { - struct btrfs_ioctl_same_args args; - struct btrfs_ioctl_same_extent_info info; - } extent_same; - memset(&extent_same, 0, sizeof(extent_same)); - - int source_fd = rm_sys_open(source, O_RDONLY); - if(source_fd < 0) { - rm_log_error_line(_("btrfs clone: failed to open source file")); - return; - } - - extent_same.info.fd = rm_sys_open(dest, read_only ? O_RDONLY : O_RDWR); - if(extent_same.info.fd < 0) { - rm_log_error_line(_("btrfs clone: error %i: failed to open dest file.%s"), - errno, - read_only ? "" : _("\n\t(if target is a read-only snapshot " - "then -r option is required)")); - rm_sys_close(source_fd); - return; - } - - struct stat source_stat; - fstat(source_fd, &source_stat); - - guint64 bytes_deduped = 0; - gint64 bytes_remaining = source_stat.st_size; - int ret = 0; - while(bytes_deduped < (guint64)source_stat.st_size && ret == 0 && - extent_same.info.status == 0 && bytes_remaining) { - extent_same.args.dest_count = 1; - extent_same.args.logical_offset = bytes_deduped; - extent_same.info.logical_offset = bytes_deduped; - - /* BTRFS_IOC_FILE_EXTENT_SAME has an internal limit at 16MB */ - extent_same.args.length = MIN(16 * 1024 * 1024, bytes_remaining); - if(extent_same.args.length == 0) { - extent_same.args.length = bytes_remaining; - } - - ret = ioctl(source_fd, BTRFS_IOC_FILE_EXTENT_SAME, &extent_same); - if(ret == 0 && extent_same.info.status == 0) { - bytes_deduped += extent_same.info.bytes_deduped; - bytes_remaining -= extent_same.info.bytes_deduped; - } - } - - rm_sys_close(source_fd); - rm_sys_close(extent_same.info.fd); - - if(ret < 0) { - ret = errno; - rm_log_error_line(_("BTRFS_IOC_FILE_EXTENT_SAME returned error: (%d) %s"), ret, - strerror(ret)); - } else if(extent_same.info.status == -22 && read_only && getuid()) { - rm_log_error_line(_("Need to run as root user to clone to a read-only snapshot")); - } else if(extent_same.info.status < 0) { - rm_log_error_line(_("BTRFS_IOC_FILE_EXTENT_SAME returned status %d for file %s"), - extent_same.info.status, dest); - } else if(bytes_remaining > 0) { - rm_log_info_line(_("Files don't match - not cloned")); - } -#else - (void)source; - (void)dest; - (void)read_only; - rm_log_error_line(_("rmlint was not compiled with btrfs support.")) - -#endif -} - -static int rm_cmd_maybe_btrfs_clone(RmSession *session, int argc, const char **argv) { - if(argc > 0 && g_strcmp0("--btrfs-clone", argv[1]) == 0) { - /* treat as a btrfs clone subcommand... */ - if(!rm_session_check_kernel_version(session, 4, 2)) { - rm_log_warning_line("This needs at least linux >= 4.2."); - } else if(argc == 5 && g_strcmp0("-r", argv[2]) == 0) { - /* -r option for deduping read-only snapshots */ - /* TODO: add check for root user permissions */ - rm_cmd_btrfs_clone(argv[3], argv[4], TRUE); - } else if(argc == 4) { - rm_cmd_btrfs_clone(argv[2], argv[3], FALSE); - } else { - /* malformed command */ - rm_cmd_btrfs_clone_usage(); - } - /* return EXIT_FAILURE to indicate not to go ahead with main rmlint call */ - return EXIT_FAILURE; - } - return EXIT_SUCCESS; -} - /* clang-format off */ static const struct FormatSpec { const char *id; @@ -1301,7 +1202,6 @@ static char * rm_cmd_find_own_executable_path(RmSession *session, char **argv) { /* Parse the commandline and set arguments in 'settings' (glob. var accordingly) */ bool rm_cmd_parse_args(int argc, char **argv, RmSession *session) { RmCfg *cfg = session->cfg; - gboolean clone = FALSE; /* Handle --gui before all other processing, * since we need to pass other args to the python interpreter. @@ -1316,10 +1216,6 @@ bool rm_cmd_parse_args(int argc, char **argv, RmSession *session) { return false; } - if(rm_cmd_maybe_btrfs_clone(session, argc, (const char **)argv) == EXIT_FAILURE) { - return false; - } - /* List of paths we got passed (or NULL) */ char **paths = NULL; @@ -1378,14 +1274,15 @@ bool rm_cmd_parse_args(int argc, char **argv, RmSession *session) { {"no-hardlinked" , 'L' , DISABLE , G_OPTION_ARG_NONE , &cfg->find_hardlinked_dupes , _("Ignore hardlink twins") , NULL} , {"partial-hidden" , 0 , EMPTY , G_OPTION_ARG_CALLBACK , FUNC(partial_hidden) , _("Find hidden files in duplicate folders only") , NULL} , {"mtime-window" , 'Z' , 0 , G_OPTION_ARG_DOUBLE , &cfg->mtime_window , _("Consider duplicates only equal when mtime differs at max. T seconds") , "T"} , + {"btrfs-clone" , 0 , 0 , G_OPTION_ARG_NONE , &cfg->btrfs_clone , _("Clone extents from source to dest, if extents match") , NULL} , + {"btrfs-readonly" , 'r' , 0 , G_OPTION_ARG_NONE , &cfg->btrfs_readonly , _("(btrfs-clone option) also clone to read-only snapshots (needs root)") , NULL} , /* Callback */ {"show-man" , 'H' , EMPTY , G_OPTION_ARG_CALLBACK , rm_cmd_show_manpage , _("Show the manpage") , NULL} , {"version" , 0 , EMPTY , G_OPTION_ARG_CALLBACK , rm_cmd_show_version , _("Show the version & features") , NULL} , /* Dummy option for --help output only: */ {"gui" , 0 , 0 , G_OPTION_ARG_NONE , NULL , _("If installed, start the optional gui with all following args") , NULL}, - {"hash" , 0 , 0 , G_OPTION_ARG_NONE , NULL , _("Work like sha1sum for all supported hash algorithms (see also --hash --help)") , NULL} , - {"btrfs-clone" , 0 , 0 , G_OPTION_ARG_NONE , &clone , _("Clone extents from source to dest, if extents match") , NULL} , + {"hash" , 0 , 0 , G_OPTION_ARG_NONE , NULL , _("Work like sha1sum for all supported hash algorithms (see also --hash --help)") , NULL}, /* Special case: accumulate leftover args (paths) in &paths */ {G_OPTION_REMAINING , 0 , 0 , G_OPTION_ARG_FILENAME_ARRAY , &paths , "" , NULL} , @@ -1499,12 +1396,6 @@ bool rm_cmd_parse_args(int argc, char **argv, RmSession *session) { goto failure; } - if(clone) { - /* should not get here */ - rm_cmd_btrfs_clone_usage(); - session->cmdline_parse_error = TRUE; - } - /* Silent fixes of invalid numeric input */ cfg->threads = CLAMP(cfg->threads, 1, 128); cfg->depth = CLAMP(cfg->depth, 1, PATH_MAX / 2 + 1); diff --git a/lib/session.c b/lib/session.c index 3dd15b59..23ffbf38 100644 --- a/lib/session.c +++ b/lib/session.c @@ -33,6 +33,11 @@ #include "session.h" #include "traverse.h" +#if HAVE_BTRFS_H +#include +#include +#endif + #if HAVE_UNAME #include "sys/utsname.h" @@ -99,8 +104,8 @@ void rm_session_init(RmSession *session, RmCfg *cfg) { session->timer_since_proc_start = g_timer_new(); g_timer_start(session->timer_since_proc_start); - /* Assume that files are not equal */ - session->equal_exit_code = EXIT_FAILURE; + /* Assume that files are not equal */ + session->equal_exit_code = EXIT_FAILURE; } void rm_session_clear(RmSession *session) { @@ -160,3 +165,109 @@ bool rm_session_was_aborted() { return rc; } +/** + * *********** btrfs clone session main ************ + **/ +int rm_session_btrfs_clone_main(RmSession *session) { + RmCfg *cfg = session->cfg; + if(cfg->path_count != 2) { + rm_log_error(_("Usage: rmlint --btrfs-clone [-r] [-v|V] source dest\n")); + return EXIT_FAILURE; + } + + if(!rm_session_check_kernel_version(session, 4, 2)) { + rm_log_warning_line("This needs at least linux >= 4.2."); + return EXIT_FAILURE; + } + + /* TODO: if kernel version >= 4.5 then use IOCTL-FIDEDUPERANGE + * http://man7.org/linux/man-pages/man2/ioctl_fideduperange.2.html + */ +#if HAVE_BTRFS_H + + g_assert(cfg->paths); + RmPath *dest = cfg->paths->data; + g_assert(cfg->paths->next); + RmPath *source = cfg->paths->next->data; + rm_log_debug_line("Cloning %s -> %s", source->path, dest->path); + + struct { + struct btrfs_ioctl_same_args args; + struct btrfs_ioctl_same_extent_info info; + } extent_same; + memset(&extent_same, 0, sizeof(extent_same)); + + int source_fd = rm_sys_open(source->path, O_RDONLY); + if(source_fd < 0) { + rm_log_error_line(_("btrfs clone: failed to open source file")); + return EXIT_FAILURE; + } + + extent_same.info.fd = + rm_sys_open(dest->path, cfg->btrfs_readonly ? O_RDONLY : O_RDWR); + if(extent_same.info.fd < 0) { + rm_log_error_line( + _("btrfs clone: error %i: failed to open dest file.%s"), + errno, + cfg->btrfs_readonly ? "" : _("\n\t(if target is a read-only snapshot " + "then -r option is required)")); + rm_sys_close(source_fd); + return EXIT_FAILURE; + } + + /* fsync's needed to flush extent mapping */ + fsync(source_fd); + fsync(extent_same.info.fd); + + struct stat source_stat; + fstat(source_fd, &source_stat); + + guint64 bytes_deduped = 0; + gint64 bytes_remaining = source_stat.st_size; + int ret = 0; + while(bytes_deduped < (guint64)source_stat.st_size && ret == 0 && + extent_same.info.status == 0 && bytes_remaining) { + extent_same.args.dest_count = 1; + extent_same.args.logical_offset = bytes_deduped; + extent_same.info.logical_offset = bytes_deduped; + + /* try to dedupe the rest of the file */ + extent_same.args.length = bytes_remaining; + + ret = ioctl(source_fd, BTRFS_IOC_FILE_EXTENT_SAME, &extent_same); + bytes_deduped += extent_same.info.bytes_deduped; + bytes_remaining -= extent_same.info.bytes_deduped; + rm_log_debug_line("deduped %lu bytes...", bytes_deduped); + } + + rm_sys_close(source_fd); + rm_sys_close(extent_same.info.fd); + + if(ret >= 0 && bytes_remaining == 0) { + return EXIT_SUCCESS; + } + + if(ret < 0) { + ret = errno; + rm_log_error_line(_("BTRFS_IOC_FILE_EXTENT_SAME returned error: (%d) %s"), ret, + strerror(ret)); + } else if(extent_same.info.status == -22 && cfg->btrfs_readonly && getuid()) { + rm_log_error_line(_("Need to run as root user to clone to a read-only snapshot")); + } else if(extent_same.info.status < 0) { + rm_log_error_line(_("BTRFS_IOC_FILE_EXTENT_SAME returned status %d for file %s"), + extent_same.info.status, dest->path); + } else if(bytes_deduped == 0) { + rm_log_info_line(_("Files don't match - not cloned")); + } else if(bytes_remaining > 0) { + rm_log_info_line(_("Only first %lu bytes cloned - files not fully identical"), + bytes_deduped); + } + +#else + (void)cfg; + rm_log_error_line(_("rmlint was not compiled with btrfs support.")) +#endif + + return EXIT_FAILURE; +} + diff --git a/lib/session.h b/lib/session.h index 6bdef655..4b84a2f0 100644 --- a/lib/session.h +++ b/lib/session.h @@ -138,9 +138,9 @@ typedef struct RmSession { /* Version of the linux kernel (0 on other operating systems) */ int kernel_version[2]; - /* When run with --equal this holds the exit code for rmlint - * (the exit code is determined by the _equal formatter) */ - int equal_exit_code; + /* When run with --equal this holds the exit code for rmlint + * (the exit code is determined by the _equal formatter) */ + int equal_exit_code; } RmSession; /** @@ -181,6 +181,14 @@ bool rm_session_was_aborted(void); */ bool rm_session_check_kernel_version(RmSession *session, int major, int minor); +/** + * @brief Trigger rmlint in --btrfs-clone mode. + * + * @return exit_status for exit() + */ +int rm_session_btrfs_clone_main(RmSession *session); + + /* Maybe colors, for use outside of the rm_log macros, * in order to work with the --with-no-color option * diff --git a/src/rmlint.c b/src/rmlint.c index 836261f6..3439f4e6 100644 --- a/src/rmlint.c +++ b/src/rmlint.c @@ -133,7 +133,11 @@ int main(int argc, const char **argv) { /* Parse commandline */ if(rm_cmd_parse_args(argc, (char **)argv, &session) != 0) { /* Do all the real work */ - exit_state = rm_cmd_main(&session); + if(cfg.btrfs_clone) { + exit_state = rm_session_btrfs_clone_main(&session); + } else { + exit_state = rm_cmd_main(&session); + } } rm_session_clear(&session); From a3dd0803703d6ad30fa6adf4bad430a20d7b2d48 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Sun, 16 Jul 2017 17:34:26 +1000 Subject: [PATCH 028/180] session: fix logic error in rm_session_check_kernel_version() --- lib/session.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/session.c b/lib/session.c index 23ffbf38..8919d427 100644 --- a/lib/session.c +++ b/lib/session.c @@ -74,7 +74,7 @@ bool rm_session_check_kernel_version(RmSession *session, int major, int minor) { } /* Lower is bad. */ - if(found_major < major || found_minor < minor) { + if(found_major < major || (found_major == major && found_minor < minor)) { return false; } From 835f2fc512ba66912aa2c6787118e1d23cd752e4 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Sun, 16 Jul 2017 18:09:14 +1000 Subject: [PATCH 029/180] session: make kernel version check stand-alone --- lib/cfg.h | 14 ++++++------- lib/cmdline.c | 2 +- lib/formats/sh.c.in | 2 +- lib/session.c | 50 +++++++++++++++++++++------------------------ lib/session.h | 5 +---- 5 files changed, 33 insertions(+), 40 deletions(-) diff --git a/lib/cfg.h b/lib/cfg.h index 71bb94e4..4a6d4758 100644 --- a/lib/cfg.h +++ b/lib/cfg.h @@ -111,8 +111,8 @@ typedef struct RmCfg { /* working dir rmlint called from */ char *iwd; - /* Path to the rmlint binary of this run */ - char *full_argv0_path; + /* Path to the rmlint binary of this run */ + char *full_argv0_path; /* the full command line */ char *joined_argv; @@ -146,11 +146,11 @@ typedef struct RmCfg { */ gboolean cache_file_structs; - /* Instead of running in duplicate detection mode, - * check if the passed arguments are equal files - * (or directories) - */ - gboolean run_equal_mode; + /* Instead of running in duplicate detection mode, + * check if the passed arguments are equal files + * (or directories) + */ + gboolean run_equal_mode; /* for --btrfs-clone option */ bool btrfs_clone; bool btrfs_readonly; diff --git a/lib/cmdline.c b/lib/cmdline.c index dfbd844b..c70b80c7 100644 --- a/lib/cmdline.c +++ b/lib/cmdline.c @@ -1584,7 +1584,7 @@ int rm_cmd_main(RmSession *session) { exit_state = EXIT_FAILURE; } - if(exit_state == EXIT_SUCCESS && cfg->run_equal_mode) { + if(exit_state == EXIT_SUCCESS && cfg->run_equal_mode) { return session->equal_exit_code; } diff --git a/lib/formats/sh.c.in b/lib/formats/sh.c.in index 62961fc7..f2aae86d 100644 --- a/lib/formats/sh.c.in +++ b/lib/formats/sh.c.in @@ -90,7 +90,7 @@ static bool rm_sh_emit_handler_clone(RmFmtHandlerShScript *self, char **out, RmF } /* Needs to have at least kernel 4.2 */ - if(!rm_session_check_kernel_version(self->session, 4, 2)) { + if(!rm_session_check_kernel_version(4, 2)) { return false; } diff --git a/lib/session.c b/lib/session.c index 8919d427..4e57f98b 100644 --- a/lib/session.c +++ b/lib/session.c @@ -40,41 +40,39 @@ #if HAVE_UNAME #include "sys/utsname.h" +#endif -void rm_session_read_kernel_version(RmSession *session) { +static gpointer rm_session_read_kernel_version(_UNUSED gpointer arg) { + static int version[2] = {-1, -1}; +#if HAVE_UNAME struct utsname buf; - if(uname(&buf) == -1) { - return; + if(uname(&buf) != -1 && sscanf(buf.release, "%d.%d.*", &version[0], &version[1]) != EOF) { + rm_log_debug_line("Linux kernel version is %d.%d.", version[0], version[1]); + } else { + rm_log_warning_line("Unable to read Linux kernel version"); } - - if(sscanf(buf.release, "%d.%d.*", &session->kernel_version[0], - &session->kernel_version[1]) == EOF) { - session->kernel_version[0] = -1; - session->kernel_version[1] = -1; - return; - } - - rm_log_debug_line("Linux kernel version is %d.%d.", - session->kernel_version[0], - session->kernel_version[1]); -} #else -void rm_session_read_kernel_version(RmSession *session) { - (void)session; -} + rm_log_warning_line( + "rmlint was not compiled with ability to read Linux kernel version"); #endif + return version; +} -bool rm_session_check_kernel_version(RmSession *session, int major, int minor) { - int found_major = session->kernel_version[0]; - int found_minor = session->kernel_version[1]; - /* Could not read kernel version: Assume failure on our side. */ - if(found_major <= 0 && found_minor <= 0) { +bool rm_session_check_kernel_version(int need_major, int need_minor) { + static GOnce once = G_ONCE_INIT; + g_once (&once, rm_session_read_kernel_version, NULL); + int *version = once.retval; + int major = version[0]; + int minor = version[1]; + + if(major < 0 && minor < 0) { + /* Could not read kernel version: Assume failure on our side. */ return true; } /* Lower is bad. */ - if(found_major < major || (found_major == major && found_minor < minor)) { + if(major < need_major || (major == need_major && minor < need_minor)) { return false; } @@ -99,8 +97,6 @@ void rm_session_init(RmSession *session, RmCfg *cfg) { session->offset_fragments = 0; session->offset_fails = 0; - rm_session_read_kernel_version(session); - session->timer_since_proc_start = g_timer_new(); g_timer_start(session->timer_since_proc_start); @@ -175,7 +171,7 @@ int rm_session_btrfs_clone_main(RmSession *session) { return EXIT_FAILURE; } - if(!rm_session_check_kernel_version(session, 4, 2)) { + if(!rm_session_check_kernel_version(4, 2)) { rm_log_warning_line("This needs at least linux >= 4.2."); return EXIT_FAILURE; } diff --git a/lib/session.h b/lib/session.h index 4b84a2f0..04c154d0 100644 --- a/lib/session.h +++ b/lib/session.h @@ -135,9 +135,6 @@ typedef struct RmSession { /* true once traverse finished running */ bool traverse_finished; - /* Version of the linux kernel (0 on other operating systems) */ - int kernel_version[2]; - /* When run with --equal this holds the exit code for rmlint * (the exit code is determined by the _equal formatter) */ int equal_exit_code; @@ -179,7 +176,7 @@ bool rm_session_was_aborted(void); * * @return True if the kernel is recent enough. */ -bool rm_session_check_kernel_version(RmSession *session, int major, int minor); +bool rm_session_check_kernel_version(int need_major, int need_minor); /** * @brief Trigger rmlint in --btrfs-clone mode. From 04cc8ead8350b000361776dded23f3503e8d0ebd Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 17 Jul 2017 08:57:45 +1000 Subject: [PATCH 030/180] config: add rm_log_perrorf() macro --- lib/config.h.in | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/lib/config.h.in b/lib/config.h.in index 7c6aafaf..6c619290 100644 --- a/lib/config.h.in +++ b/lib/config.h.in @@ -70,6 +70,15 @@ rm_log_error_line("%s:%d: %s: %s", __FILE__, __LINE__, message, g_strerror(errno)); \ }} \ +#define rm_log_perrorf(message, ...) \ + if(errno) {{ \ + int _errsv = errno; \ + char *msg = g_strdup_printf(message, __VA_ARGS__); \ + rm_log_error_line("%s:%d: %s: %s", __FILE__, __LINE__, msg, \ + g_strerror(_errsv)); \ + g_free(msg); \ + }} + #define _UNUSED G_GNUC_UNUSED #define LLU G_GUINT64_FORMAT #define LLI G_GINT64_FORMAT @@ -161,7 +170,7 @@ typedef guint64 RmOff; rm_log_error_line(" Will try to continue in 2 seconds. Expect crashes."); \ g_usleep(2 * 1000 * 1000); \ }} - + #define RM_NOT_REACHED (0) #define rm_assert_gentle_not_reached() rm_assert_gentle(RM_NOT_REACHED) From 4000b752ee7b2152d1bf13e5f6294ead9eec8b2b Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 17 Jul 2017 07:50:28 +1000 Subject: [PATCH 031/180] session: update --btrfs-clone implementation to FIDEDUPERANGE ioctl --- SConstruct | 11 +++ lib/SConscript | 1 + lib/cmdline.c | 5 -- lib/config.h.in | 1 + lib/session.c | 177 +++++++++++++++++++++++++++++++++--------------- lib/session.h | 2 +- src/rmlint.c | 2 +- 7 files changed, 139 insertions(+), 60 deletions(-) diff --git a/SConstruct b/SConstruct index f8e2b107..67c7cdab 100755 --- a/SConstruct +++ b/SConstruct @@ -325,6 +325,15 @@ def check_btrfs_h(context): context.Result(rc) return rc +def check_linux_fs_h(context): + rc = 1 + if tests.CheckHeader(context, 'linux/fs.h'): + rc = 0 + + conf.env['HAVE_LINUX_FS_H'] = rc + context.did_show_result = True + context.Result(rc) + return rc def check_linux_limits(context): rc = 1 @@ -524,6 +533,7 @@ conf = Configure(env, custom_tests={ 'check_gettext': check_gettext, 'check_linux_limits': check_linux_limits, 'check_btrfs_h': check_btrfs_h, + 'check_linux_fs_h': check_linux_fs_h, 'check_uname': check_uname, 'check_cygwin': check_cygwin, 'check_sysmacro_h': check_sysmacro_h @@ -639,6 +649,7 @@ conf.check_linux_limits() conf.check_posix_fadvise() conf.check_faccessat() conf.check_btrfs_h() +conf.check_linux_fs_h() conf.check_uname() conf.check_sysmacro_h() diff --git a/lib/SConscript b/lib/SConscript index 8412b404..9b920628 100644 --- a/lib/SConscript +++ b/lib/SConscript @@ -32,6 +32,7 @@ def build_config_template(target, source, env): HAVE_BLKID=env['HAVE_BLKID'], HAVE_SYSBLOCK=env['HAVE_SYSBLOCK'], HAVE_LINUX_LIMITS=env['HAVE_LINUX_LIMITS'], + HAVE_LINUX_FS_H=env['HAVE_LINUX_FS_H'], HAVE_BTRFS_H=env['HAVE_BTRFS_H'], HAVE_FACCESSAT=env['HAVE_FACCESSAT'], HAVE_UNAME=env['HAVE_UNAME'], diff --git a/lib/cmdline.c b/lib/cmdline.c index c70b80c7..4531463a 100644 --- a/lib/cmdline.c +++ b/lib/cmdline.c @@ -50,11 +50,6 @@ #include "treemerge.h" #include "utilities.h" -#if HAVE_BTRFS_H -#include -#include -#endif - static void rm_cmd_show_version(void) { fprintf(stderr, "version %s compiled: %s at [%s] \"%s\" (rev %s)\n", RM_VERSION, __DATE__, __TIME__, RM_VERSION_NAME, RM_VERSION_GIT_REVISION); diff --git a/lib/config.h.in b/lib/config.h.in index 6c619290..384315de 100644 --- a/lib/config.h.in +++ b/lib/config.h.in @@ -18,6 +18,7 @@ #define HAVE_LINUX_LIMITS ({HAVE_LINUX_LIMITS}) #define HAVE_POSIX_FADVISE ({HAVE_POSIX_FADVISE}) #define HAVE_BTRFS_H ({HAVE_BTRFS_H}) +#define HAVE_LINUX_FS_H ({HAVE_LINUX_FS_H}) #define HAVE_FACCESSAT ({HAVE_FACCESSAT}) #define HAVE_UNAME ({HAVE_UNAME}) #define HAVE_SYSMACROS_H ({HAVE_SYSMACROS_H}) diff --git a/lib/session.c b/lib/session.c index 4e57f98b..46b15d9a 100644 --- a/lib/session.c +++ b/lib/session.c @@ -35,6 +35,19 @@ #if HAVE_BTRFS_H #include +#endif + +#if HAVE_LINUX_FS_H +#include +#endif + +#ifdef FIDEDUPERANGE +#define HAVE_FIDEDUPERANGE 1 +#else +#define HAVE_FIDEDUPERANGE 0 +#endif + +#if HAVE_BTRFS_H || HAVE_FIDEDUPERANGE #include #endif @@ -164,44 +177,83 @@ bool rm_session_was_aborted() { /** * *********** btrfs clone session main ************ **/ -int rm_session_btrfs_clone_main(RmSession *session) { - RmCfg *cfg = session->cfg; +int rm_session_btrfs_clone_main(RmCfg *cfg) { +#if HAVE_FIDEDUPERANGE || HAVE_BTRFS_H if(cfg->path_count != 2) { rm_log_error(_("Usage: rmlint --btrfs-clone [-r] [-v|V] source dest\n")); return EXIT_FAILURE; } - if(!rm_session_check_kernel_version(4, 2)) { - rm_log_warning_line("This needs at least linux >= 4.2."); - return EXIT_FAILURE; - } - - /* TODO: if kernel version >= 4.5 then use IOCTL-FIDEDUPERANGE - * http://man7.org/linux/man-pages/man2/ioctl_fideduperange.2.html - */ -#if HAVE_BTRFS_H - g_assert(cfg->paths); RmPath *dest = cfg->paths->data; g_assert(cfg->paths->next); RmPath *source = cfg->paths->next->data; rm_log_debug_line("Cloning %s -> %s", source->path, dest->path); - struct { - struct btrfs_ioctl_same_args args; - struct btrfs_ioctl_same_extent_info info; - } extent_same; - memset(&extent_same, 0, sizeof(extent_same)); - int source_fd = rm_sys_open(source->path, O_RDONLY); if(source_fd < 0) { rm_log_error_line(_("btrfs clone: failed to open source file")); return EXIT_FAILURE; } - extent_same.info.fd = + struct stat source_stat; + fstat(source_fd, &source_stat); + gint64 bytes_deduped = 0; + +/* FIDEDUPERANGE supercedes the btrfs-only BTRFS_IOC_FILE_EXTENT_SAME as of Linux 4.5 and + * should work for ocfs2 and xfs as well as btrfs. We should still support the older + * btrfs ioctl so that this still works on Linux 4.2 to 4.4. The two ioctl's are + * identical apart from field names so we can use #define's to accommodate both. */ +/* clang-format off */ +#if HAVE_FIDEDUPERANGE +# define _DEDUPE_IOCTL_NAME "FIDEDUPERANGE" +# define _DEDUPE_IOCTL FIDEDUPERANGE +# define _DEST_FD dest_fd +# define _SRC_OFFSET src_offset +# define _DEST_OFFSET dest_offset +# define _SRC_LENGTH src_length +# define _DATA_DIFFERS FILE_DEDUPE_RANGE_DIFFERS +# define _FILE_DEDUPE_RANGE file_dedupe_range +# define _FILE_DEDUPE_RANGE_INFO file_dedupe_range_info +# define _MIN_LINUX_SUBVERSION 5 +#else +# define _DEDUPE_IOCTL_NAME "BTRFS_IOC_FILE_EXTENT_SAME" +# define _DEDUPE_IOCTL BTRFS_IOC_FILE_EXTENT_SAME +# define _DEST_FD fd +# define _SRC_OFFSET logical_offset +# define _DEST_OFFSET logical_offset +# define _SRC_LENGTH length +# define _DATA_DIFFERS BTRFS_SAME_DATA_DIFFERS +# define _FILE_DEDUPE_RANGE btrfs_ioctl_same_args +# define _FILE_DEDUPE_RANGE_INFO btrfs_ioctl_same_extent_info +# define _MIN_LINUX_SUBVERSION 2 +#endif +/* clang-format on */ + + /* a poorly-documented limit for dedupe ioctl's */ + static const gint64 max_dedupe_chunk = 16 * 1024 * 1024; + + /* how fine a resolution to use once difference detected; + * use btrfs default node size (16k): */ + static const gint64 min_dedupe_chunk = 16 * 1024; + + rm_log_debug_line("Cloning using %s", _DEDUPE_IOCTL_NAME); + + if(!rm_session_check_kernel_version(4, _MIN_LINUX_SUBVERSION)) { + rm_log_warning_line("This needs at least linux >= 4.%d.", _MIN_LINUX_SUBVERSION); + return EXIT_FAILURE; + } + + struct { + struct _FILE_DEDUPE_RANGE args; + struct _FILE_DEDUPE_RANGE_INFO info; + } dedupe; + memset(&dedupe, 0, sizeof(dedupe)); + + dedupe.info._DEST_FD = rm_sys_open(dest->path, cfg->btrfs_readonly ? O_RDONLY : O_RDWR); - if(extent_same.info.fd < 0) { + + if(dedupe.info._DEST_FD < 0) { rm_log_error_line( _("btrfs clone: error %i: failed to open dest file.%s"), errno, @@ -213,51 +265,70 @@ int rm_session_btrfs_clone_main(RmSession *session) { /* fsync's needed to flush extent mapping */ fsync(source_fd); - fsync(extent_same.info.fd); - - struct stat source_stat; - fstat(source_fd, &source_stat); + fsync(dedupe.info._DEST_FD); - guint64 bytes_deduped = 0; - gint64 bytes_remaining = source_stat.st_size; int ret = 0; - while(bytes_deduped < (guint64)source_stat.st_size && ret == 0 && - extent_same.info.status == 0 && bytes_remaining) { - extent_same.args.dest_count = 1; - extent_same.args.logical_offset = bytes_deduped; - extent_same.info.logical_offset = bytes_deduped; + gint64 dedupe_chunk = max_dedupe_chunk; + while(bytes_deduped < source_stat.st_size && !rm_session_was_aborted()) { + dedupe.args.dest_count = 1; + /* TODO: multiple destinations at same time? */ + dedupe.args._SRC_OFFSET = bytes_deduped; + dedupe.info._DEST_OFFSET = bytes_deduped; /* try to dedupe the rest of the file */ - extent_same.args.length = bytes_remaining; + dedupe.args._SRC_LENGTH = MIN(dedupe_chunk, source_stat.st_size - bytes_deduped); + + ret = ioctl(source_fd, _DEDUPE_IOCTL, &dedupe); + + if(ret != 0) { + rm_log_perrorf(_("%s returned error: (%d)"), _DEDUPE_IOCTL_NAME, ret); + break; + } else if(dedupe.info.status == _DATA_DIFFERS) { + if(dedupe_chunk != min_dedupe_chunk) { + dedupe_chunk = min_dedupe_chunk; + rm_log_debug_line("Dropping to %lu byte chunks after %lu bytes", + dedupe_chunk, bytes_deduped); + } else { + break; + } + } else if(dedupe.info.status != 0) { + rm_log_error_line("%s returned status %d", _DEDUPE_IOCTL_NAME, + dedupe.info.status); + break; + } else if(dedupe.info.bytes_deduped == 0) { + break; + } + + bytes_deduped += dedupe.info.bytes_deduped; + } - ret = ioctl(source_fd, BTRFS_IOC_FILE_EXTENT_SAME, &extent_same); - bytes_deduped += extent_same.info.bytes_deduped; - bytes_remaining -= extent_same.info.bytes_deduped; - rm_log_debug_line("deduped %lu bytes...", bytes_deduped); + if(bytes_deduped == 0) { + rm_log_info_line(_("Files don't match - not cloned")); + } else if(bytes_deduped < source_stat.st_size) { + rm_log_info_line(_("Only first %lu bytes cloned - files not fully identical"), + bytes_deduped); } rm_sys_close(source_fd); - rm_sys_close(extent_same.info.fd); + rm_sys_close(dedupe.info._DEST_FD); - if(ret >= 0 && bytes_remaining == 0) { + if(bytes_deduped == source_stat.st_size) { return EXIT_SUCCESS; } - if(ret < 0) { - ret = errno; - rm_log_error_line(_("BTRFS_IOC_FILE_EXTENT_SAME returned error: (%d) %s"), ret, - strerror(ret)); - } else if(extent_same.info.status == -22 && cfg->btrfs_readonly && getuid()) { - rm_log_error_line(_("Need to run as root user to clone to a read-only snapshot")); - } else if(extent_same.info.status < 0) { - rm_log_error_line(_("BTRFS_IOC_FILE_EXTENT_SAME returned status %d for file %s"), - extent_same.info.status, dest->path); - } else if(bytes_deduped == 0) { - rm_log_info_line(_("Files don't match - not cloned")); - } else if(bytes_remaining > 0) { - rm_log_info_line(_("Only first %lu bytes cloned - files not fully identical"), - bytes_deduped); - } +#undef _DEDUPE_IOCTL_NAME +#undef _DEDUPE_IOCTL +#undef _DEST_FD +#undef _SRC_OFFSET +#undef _DEST_OFFSET +#undef _SRC_LENGTH +#undef _DATA_DIFFERS +#undef _DEST_FD +#undef _FILE_DEDUPE_RANGE +#undef _FILE_DEDUPE_RANGE_INFO +#undef _MIN_LINUX_SUBVERSION +#undef MAX_DEDUPE_CHUNK +#undef MIN_DEDUPE_CHUNK #else (void)cfg; diff --git a/lib/session.h b/lib/session.h index 04c154d0..e3c6926f 100644 --- a/lib/session.h +++ b/lib/session.h @@ -183,7 +183,7 @@ bool rm_session_check_kernel_version(int need_major, int need_minor); * * @return exit_status for exit() */ -int rm_session_btrfs_clone_main(RmSession *session); +int rm_session_btrfs_clone_main(RmCfg *cfg); /* Maybe colors, for use outside of the rm_log macros, diff --git a/src/rmlint.c b/src/rmlint.c index 3439f4e6..1bc911d0 100644 --- a/src/rmlint.c +++ b/src/rmlint.c @@ -134,7 +134,7 @@ int main(int argc, const char **argv) { if(rm_cmd_parse_args(argc, (char **)argv, &session) != 0) { /* Do all the real work */ if(cfg.btrfs_clone) { - exit_state = rm_session_btrfs_clone_main(&session); + exit_state = rm_session_btrfs_clone_main(&cfg); } else { exit_state = rm_cmd_main(&session); } From f312b82d062b5004cc90b9f7e55bafac7f36111b Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 17 Jul 2017 15:27:01 +1000 Subject: [PATCH 032/180] cmdline: deprecate --btrfs-clone in favour of --dedupe --- docs/cautions.rst | 25 ++++++++++++------------ lib/cfg.h | 6 +++--- lib/cmdline.c | 31 +++++++++++++++++++++++++++--- lib/formats/sh.sh | 6 +++--- lib/session.c | 20 +++++++++---------- lib/session.h | 4 ++-- lib/utilities.c | 4 ++-- po/de.po | 12 ++++++------ po/es.po | 12 ++++++------ po/fr.po | 12 ++++++------ po/rmlint.pot | 6 +++--- src/rmlint.c | 4 ++-- tests/test_mains/test_clone.py | 35 +++++++++++++++++----------------- 13 files changed, 102 insertions(+), 75 deletions(-) diff --git a/docs/cautions.rst b/docs/cautions.rst index 2963a4d2..dc8e8b54 100644 --- a/docs/cautions.rst +++ b/docs/cautions.rst @@ -62,22 +62,23 @@ follows to move the files to */tmp*: fi } -Another safe alternative, if your files are on a ``btrfs`` filesystem and you have linux -kernel 4.2 or higher, is to reflink the duplicate to the original. You can do this via -``cp --reflink`` or using ``rmlint --btrfs-clone``: +Another safe alternative, if your files are on a copy-on-write filesystem such +as ``btrfs``, and you have linux kernel 4.2 or higher, is to use a deduplication +utility such as ``duperemove`` or ``rmlint --dedupe``: .. code-block:: bash - $ cp --reflink=always original duplicate # deletes duplicate and replaces it with reflink copy of original - $ rmlint --btrfs-clone original duplicate # does and in-place clone + $ duperemove -dh original duplicate + $ rmlint --dedupe original duplicate + +Both of the above first verify (via the kernel) that ``original`` and +``duplicate`` are identical, then modifies ``duplicate`` to reference +``original``'s data extents. Note they do not change the mtime or other +metadata of the duplicate (unlike hardlinks). If you pass ``-c sh:link`` to ``rmlint``, it will even check for you if your filesystem is capable of reflinks and emit the correct command conveniently. -The second option is actually safer because it verifies (via the kernel) that the files -are identical before creating the reflink. Also it does not change the mtime or other -metadata of the duplicate. - You might think hardlinking as a safe alternative to deletion, but in fact hardlinking first deletes the duplicate and then creates a hardlink to the original in its place. If your duplicate finder has found a false positive, it is possible that you may lose @@ -139,7 +140,7 @@ Dupe finders ``rdfind`` and ``dupd`` can also be tricked with the right combinat Deleted 1 files. $ ls -l dir/ total 0 - + $ dupd scan --path /home/foo/a --path /home/foo/a Files scanned: 2 Total duplicates: 2 @@ -210,8 +211,8 @@ Symlinks can make a real mess out of filesystem traversal: dir/link/link/file [snip] dir/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/file - - Set 1 of 1, preserve files [1 - 41, all]: + + Set 1 of 1, preserve files [1 - 41, all]: *Solution:* diff --git a/lib/cfg.h b/lib/cfg.h index 4a6d4758..4caae6f5 100644 --- a/lib/cfg.h +++ b/lib/cfg.h @@ -151,9 +151,9 @@ typedef struct RmCfg { * (or directories) */ gboolean run_equal_mode; - /* for --btrfs-clone option */ - bool btrfs_clone; - bool btrfs_readonly; + /* --dedupe options */ + bool dedupe; + bool dedupe_readonly; } RmCfg; diff --git a/lib/cmdline.c b/lib/cmdline.c index 4531463a..2d16076f 100644 --- a/lib/cmdline.c +++ b/lib/cmdline.c @@ -1106,6 +1106,22 @@ static gboolean rm_cmd_parse_equal(_UNUSED const char *option_name, return true; } +static gboolean rm_cmd_parse_btrfs_clone(_UNUSED const char *option_name, + _UNUSED const gchar *x, RmSession *session, + _UNUSED GError **error) { + rm_log_warning_line("option --btrfs-clone is deprecated, use --dedupe"); + session->cfg->dedupe = true; + return true; +} + +static gboolean rm_cmd_parse_btrfs_readonly(_UNUSED const char *option_name, + _UNUSED const gchar *x, RmSession *session, + _UNUSED GError **error) { + rm_log_warning_line("option --btrfs-readonly is deprecated, use --dedupe-readonly"); + session->cfg->dedupe_readonly = true; + return true; +} + static bool rm_cmd_set_cwd(RmCfg *cfg) { /* Get current directory */ char cwd_buf[PATH_MAX + 1]; @@ -1269,8 +1285,8 @@ bool rm_cmd_parse_args(int argc, char **argv, RmSession *session) { {"no-hardlinked" , 'L' , DISABLE , G_OPTION_ARG_NONE , &cfg->find_hardlinked_dupes , _("Ignore hardlink twins") , NULL} , {"partial-hidden" , 0 , EMPTY , G_OPTION_ARG_CALLBACK , FUNC(partial_hidden) , _("Find hidden files in duplicate folders only") , NULL} , {"mtime-window" , 'Z' , 0 , G_OPTION_ARG_DOUBLE , &cfg->mtime_window , _("Consider duplicates only equal when mtime differs at max. T seconds") , "T"} , - {"btrfs-clone" , 0 , 0 , G_OPTION_ARG_NONE , &cfg->btrfs_clone , _("Clone extents from source to dest, if extents match") , NULL} , - {"btrfs-readonly" , 'r' , 0 , G_OPTION_ARG_NONE , &cfg->btrfs_readonly , _("(btrfs-clone option) also clone to read-only snapshots (needs root)") , NULL} , + {"dedupe" , 0 , 0 , G_OPTION_ARG_NONE , &cfg->dedupe , _("Dedupe matching extents from source to dest (if filesystem supports)") , NULL} , + {"dedupe-readonly" , 0 , 0 , G_OPTION_ARG_NONE , &cfg->dedupe_readonly , _("(--dedupe option) even dedupe read-only snapshots (needs root)") , NULL} , /* Callback */ {"show-man" , 'H' , EMPTY , G_OPTION_ARG_CALLBACK , rm_cmd_show_manpage , _("Show the manpage") , NULL} , @@ -1327,6 +1343,11 @@ bool rm_cmd_parse_args(int argc, char **argv, RmSession *session) { {NULL , 0 , HIDDEN , 0 , NULL , NULL , NULL} }; + const GOptionEntry deprecated_option_entries[] = { + {"btrfs-clone" , 0 , EMPTY | HIDDEN , G_OPTION_ARG_CALLBACK , FUNC(btrfs_clone) , "Deprecated, use --dedupe instead" , NULL}, + {"btrfs-readonly" , 0 , EMPTY | HIDDEN , G_OPTION_ARG_CALLBACK , FUNC(btrfs_readonly) , "Deprecated, use --dedupe-readonly instead" , NULL} + }; + /* clang-format on */ /* Initialize default verbosity */ @@ -1344,7 +1365,7 @@ bool rm_cmd_parse_args(int argc, char **argv, RmSession *session) { /* Attempt to find out path to own executable. * This is used in the shell script to call the executable - * for special modes like --btrfs-clone or --equal. + * for special modes like --dedupe or --equal. * We want to make sure the installed version has this * */ cfg->full_argv0_path = rm_cmd_find_own_executable_path(session, argv); @@ -1363,13 +1384,17 @@ bool rm_cmd_parse_args(int argc, char **argv, RmSession *session) { "inversed", "inverted", "Options that enable defaults", session, NULL); GOptionGroup *unusual_group = g_option_group_new("unusual", "unusual", "Unusual options", session, NULL); + GOptionGroup *deprecated_group = + g_option_group_new("deprecated", "deprecated", "Deprecated options", session, NULL); g_option_group_add_entries(main_group, main_option_entries); g_option_group_add_entries(main_group, inversed_option_entries); g_option_group_add_entries(main_group, unusual_option_entries); + g_option_group_add_entries(deprecated_group, deprecated_option_entries); g_option_context_add_group(option_parser, inversion_group); g_option_context_add_group(option_parser, unusual_group); + g_option_context_add_group(option_parser, deprecated_group); g_option_context_set_main_group(option_parser, main_group); g_option_context_set_summary(option_parser, _("rmlint finds space waste and other broken things on " diff --git a/lib/formats/sh.sh b/lib/formats/sh.sh index 7d69cf5e..2086fd37 100644 --- a/lib/formats/sh.sh +++ b/lib/formats/sh.sh @@ -207,9 +207,9 @@ clone() { echo "${COL_YELLOW}Cloning to: ${COL_RESET}" "$1" if [ -z "$DO_DRY_RUN" ]; then if [ -n "$DO_CLONE_READONLY" ]; then - sudo $RMLINT_BINARY --btrfs-clone -r "$2" "$1" + sudo $RMLINT_BINARY --dedupe -r "$2" "$1" else - $RMLINT_BINARY --btrfs-clone "$2" "$1" + $RMLINT_BINARY --dedupe "$2" "$1" fi fi } @@ -291,7 +291,7 @@ OPTIONS: -d Do not ask before running. -x Keep rmlint.sh; do not autodelete it. -p Recheck that files are still identical before removing duplicates. - -r Allow btrfs-clone to clone to read-only snapshots. (requires sudo) + -r Allow deduplication of files on read-only btrfs snapshots. (requires sudo) -n Do not perform any modifications, just print what would be done. (implies -d and -x) -c Clean up empty directories while deleting duplicates. -q Do not show progress. diff --git a/lib/session.c b/lib/session.c index 46b15d9a..2f7bf3ee 100644 --- a/lib/session.c +++ b/lib/session.c @@ -175,12 +175,12 @@ bool rm_session_was_aborted() { return rc; } /** - * *********** btrfs clone session main ************ + * *********** dedupe session main ************ **/ -int rm_session_btrfs_clone_main(RmCfg *cfg) { +int rm_session_dedupe_main(RmCfg *cfg) { #if HAVE_FIDEDUPERANGE || HAVE_BTRFS_H if(cfg->path_count != 2) { - rm_log_error(_("Usage: rmlint --btrfs-clone [-r] [-v|V] source dest\n")); + rm_log_error(_("Usage: rmlint --dedupe [-r] [-v|V] source dest\n")); return EXIT_FAILURE; } @@ -192,7 +192,7 @@ int rm_session_btrfs_clone_main(RmCfg *cfg) { int source_fd = rm_sys_open(source->path, O_RDONLY); if(source_fd < 0) { - rm_log_error_line(_("btrfs clone: failed to open source file")); + rm_log_error_line(_("dedupe: failed to open source file")); return EXIT_FAILURE; } @@ -251,13 +251,13 @@ int rm_session_btrfs_clone_main(RmCfg *cfg) { memset(&dedupe, 0, sizeof(dedupe)); dedupe.info._DEST_FD = - rm_sys_open(dest->path, cfg->btrfs_readonly ? O_RDONLY : O_RDWR); + rm_sys_open(dest->path, cfg->dedupe_readonly ? O_RDONLY : O_RDWR); if(dedupe.info._DEST_FD < 0) { rm_log_error_line( - _("btrfs clone: error %i: failed to open dest file.%s"), + _("dedupe: error %i: failed to open dest file.%s"), errno, - cfg->btrfs_readonly ? "" : _("\n\t(if target is a read-only snapshot " + cfg->dedupe_readonly ? "" : _("\n\t(if target is a read-only snapshot " "then -r option is required)")); rm_sys_close(source_fd); return EXIT_FAILURE; @@ -303,9 +303,9 @@ int rm_session_btrfs_clone_main(RmCfg *cfg) { } if(bytes_deduped == 0) { - rm_log_info_line(_("Files don't match - not cloned")); + rm_log_info_line(_("Files don't match - not deduped")); } else if(bytes_deduped < source_stat.st_size) { - rm_log_info_line(_("Only first %lu bytes cloned - files not fully identical"), + rm_log_info_line(_("Only first %lu bytes deduped - files not fully identical"), bytes_deduped); } @@ -332,7 +332,7 @@ int rm_session_btrfs_clone_main(RmCfg *cfg) { #else (void)cfg; - rm_log_error_line(_("rmlint was not compiled with btrfs support.")) + rm_log_error_line(_("rmlint was not compiled with file cloning support.")) #endif return EXIT_FAILURE; diff --git a/lib/session.h b/lib/session.h index e3c6926f..fea7fbe2 100644 --- a/lib/session.h +++ b/lib/session.h @@ -179,11 +179,11 @@ bool rm_session_was_aborted(void); bool rm_session_check_kernel_version(int need_major, int need_minor); /** - * @brief Trigger rmlint in --btrfs-clone mode. + * @brief Trigger rmlint in --dedupe mode. * * @return exit_status for exit() */ -int rm_session_btrfs_clone_main(RmCfg *cfg); +int rm_session_dedupe_main(RmCfg *cfg); /* Maybe colors, for use outside of the rm_log macros, diff --git a/lib/utilities.c b/lib/utilities.c index 85370e93..7bcbb880 100644 --- a/lib/utilities.c +++ b/lib/utilities.c @@ -624,8 +624,8 @@ static RmMountEntries *rm_mount_list_open(RmMountTable *table) { {"debugfs", 0}, {NULL, 0}}; - /* btrfs and ocfs2 filesystems support reflinks for deduplication */ - static const char *reflinkfs_types[] = {"btrfs", "ocfs2", NULL}; + /* btrfs, ocfs2 and cfs filesystems support reflinks for deduplication */ + static const char *reflinkfs_types[] = {"btrfs", "ocfs2", "xfs", NULL}; const struct RmEvilFs *evilfs_found = NULL; for(int i = 0; evilfs_types[i].name && !evilfs_found; ++i) { diff --git a/po/de.po b/po/de.po index 8aa35058..cc856cc2 100644 --- a/po/de.po +++ b/po/de.po @@ -487,17 +487,17 @@ msgstr "" #: lib/cmdline.c:237 #, fuzzy -msgid "Usage: rmlint --btrfs-clone [-r] source dest\n" -msgstr "Benutzung: rmlint --btrfs-clone QUELLE ZIEL\n" +msgid "Usage: rmlint --dedupe [-r] source dest\n" +msgstr "Benutzung: rmlint --dedupe QUELLE ZIEL\n" #: lib/cmdline.c:251 -msgid "btrfs clone: failed to open source file" -msgstr "btrfs-clone: Konnte Quelldatei nicht öffnen" +msgid "dedupe: failed to open source file" +msgstr "dedupe: Konnte Quelldatei nicht öffnen" #: lib/cmdline.c:257 #, fuzzy, c-format -msgid "btrfs clone: error %i: failed to open dest file.%s" -msgstr "btrfs clone: Konnte Zieldatei nicht öffnen" +msgid "dedupe: error %i: failed to open dest file.%s" +msgstr "dedupe: Konnte Zieldatei nicht öffnen" #: lib/cmdline.c:259 msgid "" diff --git a/po/es.po b/po/es.po index f409be2a..4bdad2ae 100644 --- a/po/es.po +++ b/po/es.po @@ -480,17 +480,17 @@ msgstr "" #: lib/cmdline.c:237 #, fuzzy -msgid "Usage: rmlint --btrfs-clone [-r] source dest\n" -msgstr "Uso: rmlint --btrfs-clone fuente dest\n" +msgid "Usage: rmlint --dedupe [-r] source dest\n" +msgstr "Uso: rmlint --dedupe fuente dest\n" #: lib/cmdline.c:251 -msgid "btrfs clone: failed to open source file" -msgstr "btrfs clone: no logró abrir el archivo fuente." +msgid "dedupe: failed to open source file" +msgstr "dedupe: no logró abrir el archivo fuente." #: lib/cmdline.c:257 #, fuzzy, c-format -msgid "btrfs clone: error %i: failed to open dest file.%s" -msgstr "btrfs clone: no logró abrir el archivo dest." +msgid "dedupe: error %i: failed to open dest file.%s" +msgstr "dedupe: no logró abrir el archivo dest." #: lib/cmdline.c:259 msgid "" diff --git a/po/fr.po b/po/fr.po index 0d332df7..14cd62f2 100644 --- a/po/fr.po +++ b/po/fr.po @@ -468,17 +468,17 @@ msgstr "" #: lib/cmdline.c:237 #, fuzzy -msgid "Usage: rmlint --btrfs-clone [-r] source dest\n" -msgstr "Utilisation: rmlint --btrfs-clone source dest\n" +msgid "Usage: rmlint --dedupe [-r] source dest\n" +msgstr "Utilisation: rmlint --dedupe source dest\n" #: lib/cmdline.c:251 -msgid "btrfs clone: failed to open source file" -msgstr "btrfs clone: échec de l'ouverture du fichier source." +msgid "dedupe: failed to open source file" +msgstr "dedupe: échec de l'ouverture du fichier source." #: lib/cmdline.c:257 #, fuzzy, c-format -msgid "btrfs clone: error %i: failed to open dest file.%s" -msgstr "btrfs clone: échec de l'ouverture du fichier de destination." +msgid "dedupe: error %i: failed to open dest file.%s" +msgstr "dedupe: échec de l'ouverture du fichier de destination." #: lib/cmdline.c:259 msgid "" diff --git a/po/rmlint.pot b/po/rmlint.pot index 42bd3505..06e8854f 100644 --- a/po/rmlint.pot +++ b/po/rmlint.pot @@ -458,16 +458,16 @@ msgid "" msgstr "" #: lib/cmdline.c:237 -msgid "Usage: rmlint --btrfs-clone [-r] source dest\n" +msgid "Usage: rmlint --dedupe [-r] source dest\n" msgstr "" #: lib/cmdline.c:251 -msgid "btrfs clone: failed to open source file" +msgid "dedupe: failed to open source file" msgstr "" #: lib/cmdline.c:257 #, c-format -msgid "btrfs clone: error %i: failed to open dest file.%s" +msgid "dedupe: error %i: failed to open dest file.%s" msgstr "" #: lib/cmdline.c:259 diff --git a/src/rmlint.c b/src/rmlint.c index 1bc911d0..96976ba6 100644 --- a/src/rmlint.c +++ b/src/rmlint.c @@ -133,8 +133,8 @@ int main(int argc, const char **argv) { /* Parse commandline */ if(rm_cmd_parse_args(argc, (char **)argv, &session) != 0) { /* Do all the real work */ - if(cfg.btrfs_clone) { - exit_state = rm_session_btrfs_clone_main(&cfg); + if(cfg.dedupe) { + exit_state = rm_session_dedupe_main(&cfg); } else { exit_state = rm_cmd_main(&session); } diff --git a/tests/test_mains/test_clone.py b/tests/test_mains/test_clone.py index f7139608..14650ae0 100644 --- a/tests/test_mains/test_clone.py +++ b/tests/test_mains/test_clone.py @@ -9,6 +9,7 @@ from tests.utils import * +REFLINK_CAPABLE_FILESYSTEMS = {'btrfs', 'xfs', 'ocfs2'} @contextmanager def assert_exit_code(status_code): @@ -26,7 +27,7 @@ def assert_exit_code(status_code): assert status_code == 0 -def is_btrfs(path): +def is_on_reflink_fs(path): parts = psutil.disk_partitions(all=True) # iterate up from `path` until mountpoint found @@ -35,7 +36,7 @@ def is_btrfs(path): match = next((x for x in parts if x.mountpoint == p), None) if (match): print("{0} is {1} mounted at {2}".format(path, match.fstype, p)) - return (match.fstype == 'btrfs') + return (match.fstype in REFLINK_CAPABLE_FILESYSTEMS) if (p == '/'): # probably should never get here... @@ -44,23 +45,23 @@ def is_btrfs(path): p = os.path.dirname(p) -# decorator for tests dependent on btrfs testdir -def needs_btrfs(test): +# decorator for tests dependent on reflink-capable testdir +def needs_reflink_fs(test): def no_support(*args): raise SkipTest("btrfs not supported") - def not_btrfs(*args): - raise SkipTest("testdir is not on btrfs filesystem") + def not_reflink_fs(*args): + raise SkipTest("testdir is not on reflink-capable filesystem") if not has_feature('btrfs-support'): return make_decorator(test)(no_support) - elif not is_btrfs(TESTDIR_NAME): - return make_decorator(test)(not_btrfs) + elif not is_on_reflink_fs(TESTDIR_NAME): + return make_decorator(test)(not_reflink_fs) else: return test -@needs_btrfs +@needs_reflink_fs @with_setup(usual_setup_func, usual_teardown_func) def test_equal_files(): path_a = create_file('1234', 'a') @@ -68,7 +69,7 @@ def test_equal_files(): with assert_exit_code(0): head, *data, footer = run_rmlint( - '--btrfs-clone', + '--dedupe', path_a, path_b, use_default_dir=False, with_json=False, @@ -76,13 +77,13 @@ def test_equal_files(): with assert_exit_code(0): head, *data, footer = run_rmlint( - '--btrfs-clone', + '--dedupe', path_a, '//', path_b, use_default_dir=False, with_json=False) -@needs_btrfs +@needs_reflink_fs @with_setup(usual_setup_func, usual_teardown_func) def test_different_files(): path_a = create_file('1234', 'a') @@ -90,14 +91,14 @@ def test_different_files(): with assert_exit_code(1): head, *data, footer = run_rmlint( - '--btrfs-clone', + '--dedupe', path_a, path_b, use_default_dir=False, with_json=False, verbosity="") -@needs_btrfs +@needs_reflink_fs @with_setup(usual_setup_func, usual_teardown_func) def test_bad_arguments(): path_a = create_file('1234', 'a') @@ -110,14 +111,14 @@ def test_bad_arguments(): ]: with assert_exit_code(1): head, *data, footer = run_rmlint( - '--btrfs-clone', + '--dedupe', paths, use_default_dir=False, with_json=False, verbosity="") -@needs_btrfs +@needs_reflink_fs @with_setup(usual_setup_func, usual_teardown_func) def test_directories(): path_a = os.path.dirname(create_dirs('dir_a')) @@ -125,7 +126,7 @@ def test_directories(): with assert_exit_code(1): head, *data, footer = run_rmlint( - '--btrfs-clone', + '--dedupe', path_a, path_b, use_default_dir=False, with_json=False, From 095b4dd83f8c5cf931f8aeb59cbf4fb0ce6ffda7 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Tue, 18 Jul 2017 06:33:41 +1000 Subject: [PATCH 033/180] cmdline: don't create default outputs for --dedupe session --- lib/cmdline.c | 20 ++++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/lib/cmdline.c b/lib/cmdline.c index 2d16076f..c221bbe9 100644 --- a/lib/cmdline.c +++ b/lib/cmdline.c @@ -1355,12 +1355,12 @@ bool rm_cmd_parse_args(int argc, char **argv, RmSession *session) { if(!rm_cmd_set_cwd(cfg)) { g_set_error(&error, RM_ERROR_QUARK, 0, _("Cannot set current working directory")); - goto failure; + goto cleanup; } if(!rm_cmd_set_cmdline(cfg, argc, argv)) { g_set_error(&error, RM_ERROR_QUARK, 0, _("Cannot join commandline")); - goto failure; + goto cleanup; } /* Attempt to find out path to own executable. @@ -1413,7 +1413,17 @@ bool rm_cmd_parse_args(int argc, char **argv, RmSession *session) { g_option_group_set_error_hook(main_group, (GOptionErrorFunc)rm_cmd_on_error); if(!g_option_context_parse(option_parser, &argc, &argv, &error)) { - goto failure; + goto cleanup; + } + + if(!rm_cmd_set_paths(session, paths)) { + error = g_error_new(RM_ERROR_QUARK, 0, _("Not all given paths are valid. Aborting")); + goto cleanup; + } + + if(cfg->dedupe) { + /* dedupe session; regular rmlint configs are ignored */ + goto cleanup; } /* Silent fixes of invalid numeric input */ @@ -1457,8 +1467,6 @@ bool rm_cmd_parse_args(int argc, char **argv, RmSession *session) { } else if(cfg->skip_start_factor >= cfg->skip_end_factor) { error = g_error_new(RM_ERROR_QUARK, 0, _("-q (--clamp-low) should be lower than -Q (--clamp-top)")); - } else if(!rm_cmd_set_paths(session, paths)) { - error = g_error_new(RM_ERROR_QUARK, 0, _("Not all given paths are valid. Aborting")); } else if(!rm_cmd_set_outputs(session, &error)) { /* Something wrong with the outputs */ } else if(cfg->follow_symlinks && cfg->see_symlinks) { @@ -1466,7 +1474,7 @@ bool rm_cmd_parse_args(int argc, char **argv, RmSession *session) { rm_assert_gentle_not_reached(); } -failure: +cleanup: if(error != NULL) { rm_cmd_on_error(NULL, NULL, session, &error); } From f78797ca102fef3ffe3afa7353c1a5dda6816cec Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Wed, 19 Jul 2017 10:04:06 +1000 Subject: [PATCH 034/180] session: add --is-clone utility to test if two files are clones --- lib/cfg.h | 3 +++ lib/cmdline.c | 5 ++++- lib/session.c | 32 ++++++++++++++++++++++++++++++++ lib/session.h | 7 +++++++ src/rmlint.c | 2 ++ 5 files changed, 48 insertions(+), 1 deletion(-) diff --git a/lib/cfg.h b/lib/cfg.h index 4caae6f5..ca5a88e1 100644 --- a/lib/cfg.h +++ b/lib/cfg.h @@ -155,6 +155,9 @@ typedef struct RmCfg { bool dedupe; bool dedupe_readonly; + /* for --is-reflink option */ + bool is_reflink; + } RmCfg; /** diff --git a/lib/cmdline.c b/lib/cmdline.c index c221bbe9..003f43b8 100644 --- a/lib/cmdline.c +++ b/lib/cmdline.c @@ -1258,7 +1258,7 @@ bool rm_cmd_parse_args(int argc, char **argv, RmSession *session) { {"newer-than" , 'N' , 0 , G_OPTION_ARG_CALLBACK , FUNC(timestamp) , _("Newer than timestamp") , "STAMP"} , {"config" , 'c' , 0 , G_OPTION_ARG_CALLBACK , FUNC(config) , _("Configure a formatter") , "FMT:K[=V]"} , - /* Non-trvial switches */ + /* Non-trivial switches */ {"progress" , 'g' , EMPTY , G_OPTION_ARG_CALLBACK , FUNC(progress) , _("Enable progressbar") , NULL} , {"loud" , 'v' , EMPTY , G_OPTION_ARG_CALLBACK , FUNC(loud) , _("Be more verbose (-vvv for much more)") , NULL} , {"quiet" , 'V' , EMPTY , G_OPTION_ARG_CALLBACK , FUNC(quiet) , _("Be less verbose (-VVV for much less)") , NULL} , @@ -1285,8 +1285,11 @@ bool rm_cmd_parse_args(int argc, char **argv, RmSession *session) { {"no-hardlinked" , 'L' , DISABLE , G_OPTION_ARG_NONE , &cfg->find_hardlinked_dupes , _("Ignore hardlink twins") , NULL} , {"partial-hidden" , 0 , EMPTY , G_OPTION_ARG_CALLBACK , FUNC(partial_hidden) , _("Find hidden files in duplicate folders only") , NULL} , {"mtime-window" , 'Z' , 0 , G_OPTION_ARG_DOUBLE , &cfg->mtime_window , _("Consider duplicates only equal when mtime differs at max. T seconds") , "T"} , + + /* COW filesystem deduplication support */ {"dedupe" , 0 , 0 , G_OPTION_ARG_NONE , &cfg->dedupe , _("Dedupe matching extents from source to dest (if filesystem supports)") , NULL} , {"dedupe-readonly" , 0 , 0 , G_OPTION_ARG_NONE , &cfg->dedupe_readonly , _("(--dedupe option) even dedupe read-only snapshots (needs root)") , NULL} , + {"is-reflink" , 0 , 0 , G_OPTION_ARG_NONE , &cfg->is_reflink , _("Test if two files are reflinks (share same data extents)") , NULL} , /* Callback */ {"show-man" , 'H' , EMPTY , G_OPTION_ARG_CALLBACK , rm_cmd_show_manpage , _("Show the manpage") , NULL} , diff --git a/lib/session.c b/lib/session.c index 2f7bf3ee..69b528f7 100644 --- a/lib/session.c +++ b/lib/session.c @@ -338,3 +338,35 @@ int rm_session_dedupe_main(RmCfg *cfg) { return EXIT_FAILURE; } + +/** + * *********** `rmlint --is-reflink` session main ************ + **/ +int rm_session_is_reflink_main(RmCfg *cfg) { + /* the linux OS doesn't provide any easy way to check if two files are + * reflinks / clones (eg: + * https://unix.stackexchange.com/questions/263309/how-to-verify-a-file-copy-is-reflink-cow + * + * `rmlint --is-clone file_a file_b` provides this functionality rmlint. + * return values: + * EXIT_SUCCESS if clone confirmed + * EXIT_FAILURE if definitely not clones + */ + if (cfg->path_count != 2) { + rm_log_error(_("Usage: rmlint --is-clone [-v|V] file1 file2\n")); + return EXIT_FAILURE; + } + + g_assert(cfg->paths); + RmPath *a = cfg->paths->data; + g_assert(cfg->paths->next); + RmPath *b = cfg->paths->next->data; + rm_log_debug_line("Testing if %s is clone of %s", a->path, b->path); + + if(rm_offsets_match(a->path, b->path)) { + rm_log_debug_line("Offsets match"); + return EXIT_SUCCESS; + } + + return EXIT_FAILURE; +} diff --git a/lib/session.h b/lib/session.h index fea7fbe2..d7651823 100644 --- a/lib/session.h +++ b/lib/session.h @@ -186,6 +186,13 @@ bool rm_session_check_kernel_version(int need_major, int need_minor); int rm_session_dedupe_main(RmCfg *cfg); +/** + * @brief Trigger rmlint in --is-reflink mode. + * + * @return 0 if is reflink, 1 if not, ?? if can't tell + */ +int rm_session_is_reflink_main(RmCfg *cfg); + /* Maybe colors, for use outside of the rm_log macros, * in order to work with the --with-no-color option * diff --git a/src/rmlint.c b/src/rmlint.c index 96976ba6..eb726e07 100644 --- a/src/rmlint.c +++ b/src/rmlint.c @@ -135,6 +135,8 @@ int main(int argc, const char **argv) { /* Do all the real work */ if(cfg.dedupe) { exit_state = rm_session_dedupe_main(&cfg); + } else if (cfg.is_reflink) { + exit_state = rm_session_is_reflink_main(&cfg); } else { exit_state = rm_cmd_main(&session); } From ee054972d3c7fa77c75c882b1e28c6f8b20f6466 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Wed, 19 Jul 2017 10:33:57 +1000 Subject: [PATCH 035/180] utilities: bugfixes for rm_offsets_match and improve readability --- lib/session.c | 25 +++++++++----- lib/utilities.c | 87 +++++++++++++++++++++++++++++++++++++++++-------- lib/utilities.h | 4 +++ 3 files changed, 95 insertions(+), 21 deletions(-) diff --git a/lib/session.c b/lib/session.c index 69b528f7..8d0307bd 100644 --- a/lib/session.c +++ b/lib/session.c @@ -71,7 +71,6 @@ static gpointer rm_session_read_kernel_version(_UNUSED gpointer arg) { return version; } - bool rm_session_check_kernel_version(int need_major, int need_minor) { static GOnce once = G_ONCE_INIT; g_once (&once, rm_session_read_kernel_version, NULL); @@ -258,7 +257,7 @@ int rm_session_dedupe_main(RmCfg *cfg) { _("dedupe: error %i: failed to open dest file.%s"), errno, cfg->dedupe_readonly ? "" : _("\n\t(if target is a read-only snapshot " - "then -r option is required)")); + "then -r option is required)")); rm_sys_close(source_fd); return EXIT_FAILURE; } @@ -338,7 +337,6 @@ int rm_session_dedupe_main(RmCfg *cfg) { return EXIT_FAILURE; } - /** * *********** `rmlint --is-reflink` session main ************ **/ @@ -352,7 +350,7 @@ int rm_session_is_reflink_main(RmCfg *cfg) { * EXIT_SUCCESS if clone confirmed * EXIT_FAILURE if definitely not clones */ - if (cfg->path_count != 2) { + if(cfg->path_count != 2) { rm_log_error(_("Usage: rmlint --is-clone [-v|V] file1 file2\n")); return EXIT_FAILURE; } @@ -363,10 +361,21 @@ int rm_session_is_reflink_main(RmCfg *cfg) { RmPath *b = cfg->paths->next->data; rm_log_debug_line("Testing if %s is clone of %s", a->path, b->path); - if(rm_offsets_match(a->path, b->path)) { - rm_log_debug_line("Offsets match"); - return EXIT_SUCCESS; + if (!rm_offsets_match(a->path, b->path)){ + switch(errno) { + case EXIT_FAILURE: + rm_log_debug_line("Offsets differ"); + break; + case ENODATA: + rm_log_debug_line("Can't read file offsets (maybe inline extents?)"); + break; + default: + rm_log_perror("Error in rm_offsets_match()"); + break; + } + return EXIT_FAILURE; } - return EXIT_FAILURE; + rm_log_debug_line("Offsets match"); + return EXIT_SUCCESS; } diff --git a/lib/utilities.c b/lib/utilities.c index 7bcbb880..3635e932 100644 --- a/lib/utilities.c +++ b/lib/utilities.c @@ -31,6 +31,7 @@ #include #include "config.h" +#include "session.h" /* Be safe: This header is not essential and might be missing on some systems. * We only include it here, because it fixes some recent warning... @@ -1006,6 +1007,8 @@ RmOff rm_offset_get_from_fd(int fd, RmOff file_offset, RmOff *file_offset_next) /* used for detecting contiguous extents */ unsigned long expected = 0; + fsync(fd); + while(!done) { /* read in one extent */ struct fiemap *fm = rm_offset_get_fiemap(fd, 1, file_offset); @@ -1079,33 +1082,91 @@ RmOff rm_offset_get_from_path(const char *path, RmOff file_offset, } bool rm_offsets_match(char *path1, char *path2) { - bool result = FALSE; + + errno = 0; + int fd1 = rm_sys_open(path1, O_RDONLY); if(fd1 == -1) { - rm_log_info_line("Error opening %s in rm_offsets_match", path1); + rm_log_perrorf("Error opening %s in rm_offsets_match", path1); return FALSE; } int fd2 = rm_sys_open(path2, O_RDONLY); if(fd2 == -1) { - rm_log_info_line("Error opening %s in rm_offsets_match", path2); + rm_log_perrorf("Error opening %s in rm_offsets_match", path2); rm_sys_close(fd1); return FALSE; } - RmOff file1_offset_next = 0; - RmOff file2_offset_next = 0; - RmOff file_offset_current = 0; - while(!result && - (rm_offset_get_from_fd(fd1, file_offset_current, &file1_offset_next) == - rm_offset_get_from_fd(fd2, file_offset_current, &file2_offset_next)) && - file1_offset_next != 0 && file1_offset_next == file2_offset_next) { - if(file1_offset_next == file_offset_current) { + RmStat stat1; + int stat_state = rm_sys_stat(path1, &stat1); + if(stat_state == -1) { + rm_log_perrorf("Unable to stat file %s", path1); + return FALSE; + } + + RmStat stat2; + stat_state = rm_sys_stat(path2, &stat2); + if(stat_state == -1) { + rm_log_perrorf("Unable to stat file %s", path2); + return FALSE; + } + + if(stat1.st_size != stat2.st_size) { + rm_log_debug_line("Files have different sizes: %lu <> %lu", stat1.st_size, + stat2.st_size); + errno = EINVAL; + return FALSE; + } + + RmOff logical_current = 0; + bool result = FALSE; + + while(!rm_session_was_aborted()) { + RmOff logical_next_1 = 0; + RmOff logical_next_2 = 0; + RmOff physical_1 = + rm_offset_get_from_fd(fd1, logical_current, &logical_next_1); + RmOff physical_2 = + rm_offset_get_from_fd(fd2, logical_current, &logical_next_2); + + if(physical_1 != physical_2) { + rm_log_debug_line("Files differ at offset %lu: %lu <> %lu", + logical_current, physical_1, physical_2); + errno = EXIT_FAILURE; + break; + } + if(logical_next_1 != logical_next_2) { + rm_log_debug_line("Next offsets differ after %lu: %lu <> %lu", + logical_current, logical_next_1, logical_next_2); + errno = EXIT_FAILURE; + break; + } + + if(physical_1 == 0) { + rm_log_debug_line( + "Can't determine whether files are clones (maybe inline extents?)"); + errno = ENODATA; + break; + } + + rm_log_debug_line("Offsets match at logical=%lu, physical=%lu", logical_current, + physical_1); + + if(logical_next_1 == logical_current) { + rm_log_debug_line( + "rm_offsets_match() giving up: file1_offset_next==file_offset_current"); + errno = EINVAL; + break; + } + + if(logical_next_1 >= (RmOff)stat1.st_size) { /* phew, we got to the end */ result = TRUE; break; } - file_offset_current = file1_offset_next; + + logical_current = logical_next_1; } rm_sys_close(fd2); @@ -1125,7 +1186,7 @@ RmOff rm_offset_get_from_path(_UNUSED const char *path, _UNUSED RmOff file_offse return 0; } -bool rm_offsets_match(char *path1, char *path2) { +int rm_offsets_match(char *path1, char *path2) { return (path1 == path2); } diff --git a/lib/utilities.h b/lib/utilities.h index b27debef..7ee95f2b 100644 --- a/lib/utilities.h +++ b/lib/utilities.h @@ -410,6 +410,10 @@ RmOff rm_offset_get_from_path(const char *path, RmOff file_offset, /** * @brief Test if two files have identical fiemaps. + * @retval true if match, false otherwise (and errno set). + * errno: EXIT_FAILURE if fiemaps differ, + * ENODATA if file offsets can't be read, + * errno if error encountered */ bool rm_offsets_match(char *path1, char *path2); From e2494a8a139cf061ec47fb3dee1dc3cff6cd7c41 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Wed, 19 Jul 2017 10:50:34 +1000 Subject: [PATCH 036/180] sh: apply clone to all reflink-capable filesystems --- lib/formats/sh.c.in | 12 ++---------- lib/utilities.c | 4 ++-- 2 files changed, 4 insertions(+), 12 deletions(-) diff --git a/lib/formats/sh.c.in b/lib/formats/sh.c.in index f2aae86d..78072c62 100644 --- a/lib/formats/sh.c.in +++ b/lib/formats/sh.c.in @@ -94,18 +94,10 @@ static bool rm_sh_emit_handler_clone(RmFmtHandlerShScript *self, char **out, RmF return false; } - bool offsets_match = rm_offsets_match(dupe_path, orig_path); - char *reflink_type = g_hash_table_lookup( - self->session->mounts->reflinkfs_table, - GUINT_TO_POINTER(file->dev) - ); - - if (offsets_match) { + if (rm_offsets_match(dupe_path, orig_path)) { *out = g_strdup_printf("skip_reflink '%s' '%s'", dupe_escaped, orig_escaped); - } else if(!g_strcmp0("btrfs", reflink_type)) { - *out = g_strdup_printf("clone '%s' '%s'", dupe_escaped, orig_escaped); } else { - return false; + *out = g_strdup_printf("clone '%s' '%s'", dupe_escaped, orig_escaped); } return true; diff --git a/lib/utilities.c b/lib/utilities.c index 3635e932..57d85bb7 100644 --- a/lib/utilities.c +++ b/lib/utilities.c @@ -625,8 +625,8 @@ static RmMountEntries *rm_mount_list_open(RmMountTable *table) { {"debugfs", 0}, {NULL, 0}}; - /* btrfs, ocfs2 and cfs filesystems support reflinks for deduplication */ - static const char *reflinkfs_types[] = {"btrfs", "ocfs2", "xfs", NULL}; + /* btrfs and ocfs2 filesystems support reflinks for deduplication */ + static const char *reflinkfs_types[] = {"btrfs", "ocfs2", NULL}; const struct RmEvilFs *evilfs_found = NULL; for(int i = 0; evilfs_types[i].name && !evilfs_found; ++i) { From c638166425e82c6658b986df0f1853753e73b011 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Wed, 19 Jul 2017 22:27:03 +1000 Subject: [PATCH 037/180] docs: updates for --dedupe and --is-reflink --- docs/rmlint.1.rst | 96 ++++++++++++++++++++++++++++++----------------- 1 file changed, 62 insertions(+), 34 deletions(-) diff --git a/docs/rmlint.1.rst b/docs/rmlint.1.rst index a7436474..1c2b0c73 100644 --- a/docs/rmlint.1.rst +++ b/docs/rmlint.1.rst @@ -208,39 +208,6 @@ General Options --rank-by``) to reverse the sorting. Note that ``rmlint`` has to hold back all results to the end of the run before sorting and printing. -:``--gui``: - - Start the optional graphical frontend to ``rmlint`` called ``Shredder``. - - This will only work when ``Shredder`` and its dependencies were installed. - See also: http://rmlint.readthedocs.org/en/latest/gui.html - - The gui has its own set of options, see ``--gui --help`` for a list. These - should be placed at the end, ie ``rmlint --gui [options]`` when calling - it from commandline. - -:``--hash [paths...]``: - - Make ``rmlint`` work as a multi-threaded file hash utility, similar to the - popular ``md5sum`` or ``sha1sum`` utilities, but faster and with more algorithms. - A set of paths given on the commandline or from *stdin* is hashed using one - of the available hash algorithms. Use ``rmlint --hash -h`` to see options. - -:``--equal [paths...]``: - - Check if the paths given on the commandline all have equal content. If all - paths are equal and no other error happened, rmlint will exit with an exit - code 0. Otherwise it will exit with a nonzero exit code. All other options - can be used as normal, but note that no other formatters (``sh``, ``csv`` - etc.) will be executed by default. At least two paths need to be passed. - - Note: This even works for directories and also in combination with paranoid - mode (pass ``-pp`` for byte comparison); remember that rmlint does not care - about the layout of the directory, but only about the content of the files - in it. At least two paths need to be given to the commandline. - - By default this will use hashing to compare the files and/or directories. - :``-w --with-color`` (**default**) / ``-W --no-with-color``: Use color escapes for pretty output or disable them. @@ -649,6 +616,63 @@ FORMATTERS print newlines between files, only a space. Newlines are printed only between sets of duplicates. +OTHER STAND-ALONE COMMANDS +========================== + +:``rmlint --gui``: + + Start the optional graphical frontend to ``rmlint`` called ``Shredder``. + + This will only work when ``Shredder`` and its dependencies were installed. + See also: http://rmlint.readthedocs.org/en/latest/gui.html + + The gui has its own set of options, see ``--gui --help`` for a list. These + should be placed at the end, ie ``rmlint --gui [options]`` when calling + it from commandline. + +:``rmlint --hash [paths...]``: + + Make ``rmlint`` work as a multi-threaded file hash utility, similar to the + popular ``md5sum`` or ``sha1sum`` utilities, but faster and with more algorithms. + A set of paths given on the commandline or from *stdin* is hashed using one + of the available hash algorithms. Use ``rmlint --hash -h`` to see options. + +:``rmlint --equal [paths...]``: + + Check if the paths given on the commandline all have equal content. If all + paths are equal and no other error happened, rmlint will exit with an exit + code 0. Otherwise it will exit with a nonzero exit code. All other options + can be used as normal, but note that no other formatters (``sh``, ``csv`` + etc.) will be executed by default. At least two paths need to be passed. + + Note: This even works for directories and also in combination with paranoid + mode (pass ``-pp`` for byte comparison); remember that rmlint does not care + about the layout of the directory, but only about the content of the files + in it. At least two paths need to be given to the commandline. + + By default this will use hashing to compare the files and/or directories. + +:``rmlint --dedupe [-r] [-v|-V] ``: + + If the filesystem supports files sharing physical storage between multiple + files, and if ``src`` and ``dest`` have same content, this command makes the + data in the ``src`` file appear the ``dest`` file by sharing the + underlying storage. + + This command is similar to ``cp --reflink=always `` + except that it (a) checks that ``src`` and ``dest`` have identical data, and + it makes no changes to ``dest``'s metadata. + + Running with ``-r`` option will enable deduplication of read-only [btrfs] + snapshots (requires root). + +:``rmlint --is-reflink [-v|-V] ``: + Tests whether ``file1`` and ``file2`` are reflinks (reference same data). + Returns 0 if yes, 1 if no, or 61 (ENODATA) or another error number if the + files' fiemaps can't be read. + + + EXAMPLES ======== @@ -717,7 +741,11 @@ This is a collection of common usecases and other tricks: * Compare if the directories a b c and are equal - ``$ rmlint --equal a b c; echo $? # Will print 0 if they are equal`` + ``$ rmlint --equal a b c && echo "Files are equal" || echo "Files are not equal"`` + +* Test if two files are reflinks + ``rmlint --is-reflink a b && echo "Files are reflinks" || echo "Files are not reflinks"``. + PROBLEMS ======== From 0a0fc77eeb2d5cae0c8300dd658b8b495f4f60e3 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Wed, 19 Jul 2017 22:44:59 +1000 Subject: [PATCH 038/180] tests: rename test_clone to test_dedupe --- tests/test_mains/{test_clone.py => test_dedupe.py} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename tests/test_mains/{test_clone.py => test_dedupe.py} (100%) diff --git a/tests/test_mains/test_clone.py b/tests/test_mains/test_dedupe.py similarity index 100% rename from tests/test_mains/test_clone.py rename to tests/test_mains/test_dedupe.py From a40e86a67392948c0270fdfe33dacbe8dbe6097c Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Wed, 19 Jul 2017 22:45:52 +1000 Subject: [PATCH 039/180] tests: bugfix test_dedupe --- tests/test_mains/test_dedupe.py | 10 +++++----- tests/utils.py | 7 +++++-- 2 files changed, 10 insertions(+), 7 deletions(-) diff --git a/tests/test_mains/test_dedupe.py b/tests/test_mains/test_dedupe.py index 14650ae0..a433d1d4 100644 --- a/tests/test_mains/test_dedupe.py +++ b/tests/test_mains/test_dedupe.py @@ -68,7 +68,7 @@ def test_equal_files(): path_b = create_file('1234', 'b') with assert_exit_code(0): - head, *data, footer = run_rmlint( + run_rmlint( '--dedupe', path_a, path_b, use_default_dir=False, @@ -76,7 +76,7 @@ def test_equal_files(): verbosity="") with assert_exit_code(0): - head, *data, footer = run_rmlint( + run_rmlint( '--dedupe', path_a, '//', path_b, use_default_dir=False, @@ -90,7 +90,7 @@ def test_different_files(): path_b = create_file('4321', 'b') with assert_exit_code(1): - head, *data, footer = run_rmlint( + run_rmlint( '--dedupe', path_a, path_b, use_default_dir=False, @@ -110,7 +110,7 @@ def test_bad_arguments(): ' '.join((path_a, path_a + ".nonexistent")) ]: with assert_exit_code(1): - head, *data, footer = run_rmlint( + run_rmlint( '--dedupe', paths, use_default_dir=False, @@ -125,7 +125,7 @@ def test_directories(): path_b = os.path.dirname(create_dirs('dir_b')) with assert_exit_code(1): - head, *data, footer = run_rmlint( + run_rmlint( '--dedupe', path_a, path_b, use_default_dir=False, diff --git a/tests/utils.py b/tests/utils.py index 1c243df9..59fe9f9e 100644 --- a/tests/utils.py +++ b/tests/utils.py @@ -132,8 +132,11 @@ def run_rmlint_once(*args, if directly_return_output: return output - with open('/tmp/out.json', 'r') as f: - json_data = json.loads(f.read()) + if with_json: + with open('/tmp/out.json', 'r') as f: + json_data = json.loads(f.read()) + else: + json_data = [] read_outputs = [] for idx, output in enumerate(outputs or []): From 427ae49feece02f2ed78634b9d78e3f0f1bc22fc Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Wed, 19 Jul 2017 22:49:00 +1000 Subject: [PATCH 040/180] tests: use rmlint --is-reflink to test rmlint --dedupe --- tests/test_mains/test_dedupe.py | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) diff --git a/tests/test_mains/test_dedupe.py b/tests/test_mains/test_dedupe.py index a433d1d4..19ab110d 100644 --- a/tests/test_mains/test_dedupe.py +++ b/tests/test_mains/test_dedupe.py @@ -131,3 +131,36 @@ def test_directories(): use_default_dir=False, with_json=False, verbosity="") + + +@needs_reflink_fs +@with_setup(usual_setup_func, usual_teardown_func) +def test_dedupe_works(): + + # test files need to be larger than btrfs node size to prevent inline extents + path_a = create_file('1' * 100000, 'a') + path_b = create_file('1' * 100000, 'b') + + with assert_exit_code(1): + run_rmlint( + '--is-reflink', path_a, path_b, + use_default_dir=False, + with_json=False, + verbosity="" + ) + + with assert_exit_code(0): + run_rmlint( + '--dedupe', path_a, path_b, + use_default_dir=False, + with_json=False, + verbosity="" + ) + + with assert_exit_code(0): + run_rmlint( + '--is-reflink', path_a, path_b, + use_default_dir=False, + with_json=False, + verbosity="" + ) From 8734bc26d93ba1fddf32243e7afc7b5497a22c31 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Thu, 20 Jul 2017 00:41:49 +1000 Subject: [PATCH 041/180] cmdline: bugfix missing NULL terminator --- lib/cmdline.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/lib/cmdline.c b/lib/cmdline.c index 003f43b8..4b7e4f8b 100644 --- a/lib/cmdline.c +++ b/lib/cmdline.c @@ -1348,7 +1348,8 @@ bool rm_cmd_parse_args(int argc, char **argv, RmSession *session) { const GOptionEntry deprecated_option_entries[] = { {"btrfs-clone" , 0 , EMPTY | HIDDEN , G_OPTION_ARG_CALLBACK , FUNC(btrfs_clone) , "Deprecated, use --dedupe instead" , NULL}, - {"btrfs-readonly" , 0 , EMPTY | HIDDEN , G_OPTION_ARG_CALLBACK , FUNC(btrfs_readonly) , "Deprecated, use --dedupe-readonly instead" , NULL} + {"btrfs-readonly" , 0 , EMPTY | HIDDEN , G_OPTION_ARG_CALLBACK , FUNC(btrfs_readonly) , "Deprecated, use --dedupe-readonly instead" , NULL}, + {NULL , 0 , HIDDEN , 0 , NULL , NULL , NULL} }; /* clang-format on */ From bacdea98243a2d45c24b3bdf3da6eb13a7d10167 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Thu, 20 Jul 2017 01:15:59 +1000 Subject: [PATCH 042/180] cmdline: make '-r' option shortcut work for --dedupe-readonly --- lib/cmdline.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/cmdline.c b/lib/cmdline.c index 4b7e4f8b..3acb5778 100644 --- a/lib/cmdline.c +++ b/lib/cmdline.c @@ -1288,7 +1288,7 @@ bool rm_cmd_parse_args(int argc, char **argv, RmSession *session) { /* COW filesystem deduplication support */ {"dedupe" , 0 , 0 , G_OPTION_ARG_NONE , &cfg->dedupe , _("Dedupe matching extents from source to dest (if filesystem supports)") , NULL} , - {"dedupe-readonly" , 0 , 0 , G_OPTION_ARG_NONE , &cfg->dedupe_readonly , _("(--dedupe option) even dedupe read-only snapshots (needs root)") , NULL} , + {"dedupe-readonly" , 'r' , 0 , G_OPTION_ARG_NONE , &cfg->dedupe_readonly , _("(--dedupe option) even dedupe read-only snapshots (needs root)") , NULL} , {"is-reflink" , 0 , 0 , G_OPTION_ARG_NONE , &cfg->is_reflink , _("Test if two files are reflinks (share same data extents)") , NULL} , /* Callback */ From 929d6df316570f8ca4d6d2bd4752e789ed954af7 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Sun, 30 Jul 2017 05:37:03 +1000 Subject: [PATCH 043/180] cmdline: add TODO to clean up subcommand handling --- lib/cmdline.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/lib/cmdline.c b/lib/cmdline.c index 6fb51dc9..d1315657 100644 --- a/lib/cmdline.c +++ b/lib/cmdline.c @@ -1378,6 +1378,18 @@ bool rm_cmd_parse_args(int argc, char **argv, RmSession *session) { // OPTION PARSING // //////////////////// + /* TODO: move subcommands to separate option parser + * e.g. + * Usage: + * rmlint [options] ... + * rmlint --subcommand [options] + * Subcommands (must be first arg): + * --dedupe Dedupe matching extents from source to dest (if filesystem supports) + * --is-reflink Test if two files are reflinks + * --gui Launch rmlint gui + * For help on subcommands use rmlint -- --help + * + */ option_parser = g_option_context_new( _("[TARGET_DIR_OR_FILES …] [//] [TAGGED_TARGET_DIR_OR_FILES …] [-]")); g_option_context_set_translation_domain(option_parser, RM_GETTEXT_PACKAGE); From 83a18bebfcb600d46a11e3e7b3e439f56a8bfb19 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Sun, 30 Jul 2017 06:52:29 +1000 Subject: [PATCH 044/180] session: test fsync retval in rm_session_dedupe_main() --- lib/session.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/lib/session.c b/lib/session.c index 8d0307bd..b99ee3be 100644 --- a/lib/session.c +++ b/lib/session.c @@ -263,8 +263,15 @@ int rm_session_dedupe_main(RmCfg *cfg) { } /* fsync's needed to flush extent mapping */ - fsync(source_fd); - fsync(dedupe.info._DEST_FD); + if(fsync(source_fd) != 0) { + rm_log_warning_line("Error syncing source file %s: %s", + source->path, strerror(errno)); + } + + if(fsync(dedupe.info._DEST_FD) != 0) { + rm_log_warning_line("Error syncing dest file %s: %s", + dest->path, strerror(errno)); + } int ret = 0; gint64 dedupe_chunk = max_dedupe_chunk; From 7002c896dc46e11079765b701ec3fbad7921d987 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Sun, 30 Jul 2017 06:53:40 +1000 Subject: [PATCH 045/180] session: add a TODO and do some some clang-formatting --- lib/session.c | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/lib/session.c b/lib/session.c index b99ee3be..ee9f1a2b 100644 --- a/lib/session.c +++ b/lib/session.c @@ -59,7 +59,8 @@ static gpointer rm_session_read_kernel_version(_UNUSED gpointer arg) { static int version[2] = {-1, -1}; #if HAVE_UNAME struct utsname buf; - if(uname(&buf) != -1 && sscanf(buf.release, "%d.%d.*", &version[0], &version[1]) != EOF) { + if(uname(&buf) != -1 && + sscanf(buf.release, "%d.%d.*", &version[0], &version[1]) != EOF) { rm_log_debug_line("Linux kernel version is %d.%d.", version[0], version[1]); } else { rm_log_warning_line("Unable to read Linux kernel version"); @@ -73,7 +74,7 @@ static gpointer rm_session_read_kernel_version(_UNUSED gpointer arg) { bool rm_session_check_kernel_version(int need_major, int need_minor) { static GOnce once = G_ONCE_INIT; - g_once (&once, rm_session_read_kernel_version, NULL); + g_once(&once, rm_session_read_kernel_version, NULL); int *version = once.retval; int major = version[0]; int minor = version[1]; @@ -203,6 +204,12 @@ int rm_session_dedupe_main(RmCfg *cfg) { * should work for ocfs2 and xfs as well as btrfs. We should still support the older * btrfs ioctl so that this still works on Linux 4.2 to 4.4. The two ioctl's are * identical apart from field names so we can use #define's to accommodate both. */ + +/* TODO: test this on system running kernel 4.[2|3|4] ; if the c headers + * support FIDEDUPERANGE but kernel doesn't, then this will fail at runtime + * because the BTRFS_IOC_FILE_EXTENT_SAME is decided at compile time... + */ + /* clang-format off */ #if HAVE_FIDEDUPERANGE # define _DEDUPE_IOCTL_NAME "FIDEDUPERANGE" @@ -368,7 +375,7 @@ int rm_session_is_reflink_main(RmCfg *cfg) { RmPath *b = cfg->paths->next->data; rm_log_debug_line("Testing if %s is clone of %s", a->path, b->path); - if (!rm_offsets_match(a->path, b->path)){ + if(!rm_offsets_match(a->path, b->path)) { switch(errno) { case EXIT_FAILURE: rm_log_debug_line("Offsets differ"); From 20be7258b85fb3fcd5d9cfea414f245e9f205185 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Sun, 30 Jul 2017 08:32:21 +1000 Subject: [PATCH 046/180] session: clarify return codes for `rmlint --is-reflink` --- docs/rmlint.1.rst | 13 ++++++-- lib/session.c | 18 +++++------ lib/utilities.c | 82 ++++++++++++++++++++++++++++++++--------------- lib/utilities.h | 20 +++++++++--- 4 files changed, 90 insertions(+), 43 deletions(-) diff --git a/docs/rmlint.1.rst b/docs/rmlint.1.rst index 656662e3..adffd1f8 100644 --- a/docs/rmlint.1.rst +++ b/docs/rmlint.1.rst @@ -743,9 +743,16 @@ OTHER STAND-ALONE COMMANDS :``rmlint --is-reflink [-v|-V] ``: Tests whether ``file1`` and ``file2`` are reflinks (reference same data). - Returns 0 if yes, 1 if no, or 61 (ENODATA) or another error number if the - files' fiemaps can't be read. - + Return codes: + 0: files are reflinks + 1: files are not reflinks + 3: not a regular file + 4: file sizes differ + 5: fiemaps can't be read + 6: file1 and file2 are the same path + 7: file1 and file2 are the same file under different mountpoints + 8: files are hardlinks + 9: other error encountered EXAMPLES diff --git a/lib/session.c b/lib/session.c index ee9f1a2b..a9b9f501 100644 --- a/lib/session.c +++ b/lib/session.c @@ -363,7 +363,9 @@ int rm_session_is_reflink_main(RmCfg *cfg) { * return values: * EXIT_SUCCESS if clone confirmed * EXIT_FAILURE if definitely not clones + * Other return values defined in utilities.h 'RmOffsetsMatchCode' enum */ + if(cfg->path_count != 2) { rm_log_error(_("Usage: rmlint --is-clone [-v|V] file1 file2\n")); return EXIT_FAILURE; @@ -375,21 +377,19 @@ int rm_session_is_reflink_main(RmCfg *cfg) { RmPath *b = cfg->paths->next->data; rm_log_debug_line("Testing if %s is clone of %s", a->path, b->path); - if(!rm_offsets_match(a->path, b->path)) { - switch(errno) { - case EXIT_FAILURE: + int result = rm_offsets_match(a->path, b->path); + switch(result) { + case RM_OFFSETS_DIFFER: rm_log_debug_line("Offsets differ"); break; - case ENODATA: + case RM_OFFSETS_MATCH: + rm_log_debug_line("Offsets match"); + case RM_OFFSETS_NO_DATA: rm_log_debug_line("Can't read file offsets (maybe inline extents?)"); break; default: - rm_log_perror("Error in rm_offsets_match()"); break; - } - return EXIT_FAILURE; } - rm_log_debug_line("Offsets match"); - return EXIT_SUCCESS; + return result; } diff --git a/lib/utilities.c b/lib/utilities.c index 57d85bb7..7edd03fd 100644 --- a/lib/utilities.c +++ b/lib/utilities.c @@ -1081,46 +1081,82 @@ RmOff rm_offset_get_from_path(const char *path, RmOff file_offset, return result; } -bool rm_offsets_match(char *path1, char *path2) { +static gboolean rm_util_is_path_double(char *path1, char *path2) { + char *basename1 = rm_util_basename(path1); + char *basename2 = rm_util_basename(path2); + return (strcmp(basename1, basename2) == 0 && + rm_util_parent_node(path1) == rm_util_parent_node(path2)); +} + - errno = 0; +RmOffsetsMatchCode rm_offsets_match(char *path1, char *path2) { int fd1 = rm_sys_open(path1, O_RDONLY); if(fd1 == -1) { rm_log_perrorf("Error opening %s in rm_offsets_match", path1); - return FALSE; + return RM_OFFSETS_ERROR; } - int fd2 = rm_sys_open(path2, O_RDONLY); - if(fd2 == -1) { - rm_log_perrorf("Error opening %s in rm_offsets_match", path2); - rm_sys_close(fd1); - return FALSE; +#define RM_RETURN(value) \ + { \ + rm_sys_close(fd1); \ + return (value); \ } RmStat stat1; int stat_state = rm_sys_stat(path1, &stat1); if(stat_state == -1) { rm_log_perrorf("Unable to stat file %s", path1); - return FALSE; + RM_RETURN(RM_OFFSETS_ERROR); + } + + if(!S_ISREG(stat1.st_mode)) { + RM_RETURN(RM_OFFSETS_NOT_FILE); + } + + int fd2 = rm_sys_open(path2, O_RDONLY); + if(fd2 == -1) { + rm_log_perrorf("Error opening %s in rm_offsets_match", path2); + RM_RETURN(RM_OFFSETS_ERROR); + } + +#undef RM_RETURN +#define RM_RETURN(value) \ + { \ + rm_sys_close(fd1); \ + rm_sys_close(fd2); \ + return (value); \ } RmStat stat2; stat_state = rm_sys_stat(path2, &stat2); if(stat_state == -1) { rm_log_perrorf("Unable to stat file %s", path2); - return FALSE; + RM_RETURN(RM_OFFSETS_ERROR); + } + + if(!S_ISREG(stat2.st_mode)) { + RM_RETURN(RM_OFFSETS_NOT_FILE); } if(stat1.st_size != stat2.st_size) { rm_log_debug_line("Files have different sizes: %lu <> %lu", stat1.st_size, stat2.st_size); - errno = EINVAL; - return FALSE; + RM_RETURN(RM_OFFSETS_WRONG_SIZE); + } + + if(stat1.st_dev == stat2.st_dev && stat1.st_ino == stat2.st_ino) { + /* hardlinks or maybe even same file */ + if(strcmp(path1, path2)==0) { + RM_RETURN(RM_OFFSETS_SAME_FILE); + } else if (rm_util_is_path_double(path1, path2)) { + RM_RETURN(RM_OFFSETS_PATH_DOUBLE); + } else { + RM_RETURN(RM_OFFSETS_HARDLINK); + } } RmOff logical_current = 0; - bool result = FALSE; while(!rm_session_was_aborted()) { RmOff logical_next_1 = 0; @@ -1133,21 +1169,18 @@ bool rm_offsets_match(char *path1, char *path2) { if(physical_1 != physical_2) { rm_log_debug_line("Files differ at offset %lu: %lu <> %lu", logical_current, physical_1, physical_2); - errno = EXIT_FAILURE; - break; + RM_RETURN(RM_OFFSETS_DIFFER); } if(logical_next_1 != logical_next_2) { rm_log_debug_line("Next offsets differ after %lu: %lu <> %lu", logical_current, logical_next_1, logical_next_2); - errno = EXIT_FAILURE; - break; + RM_RETURN(RM_OFFSETS_DIFFER); } if(physical_1 == 0) { rm_log_debug_line( "Can't determine whether files are clones (maybe inline extents?)"); - errno = ENODATA; - break; + RM_RETURN(RM_OFFSETS_NO_DATA); } rm_log_debug_line("Offsets match at logical=%lu, physical=%lu", logical_current, @@ -1156,22 +1189,19 @@ bool rm_offsets_match(char *path1, char *path2) { if(logical_next_1 == logical_current) { rm_log_debug_line( "rm_offsets_match() giving up: file1_offset_next==file_offset_current"); - errno = EINVAL; - break; + RM_RETURN(RM_OFFSETS_NO_DATA) } if(logical_next_1 >= (RmOff)stat1.st_size) { /* phew, we got to the end */ - result = TRUE; - break; + RM_RETURN(RM_OFFSETS_MATCH) } logical_current = logical_next_1; } - rm_sys_close(fd2); - rm_sys_close(fd1); - return result; + RM_RETURN(RM_OFFSETS_ERROR); +#undef RM_RETURN } #else /* Probably FreeBSD */ diff --git a/lib/utilities.h b/lib/utilities.h index 7ee95f2b..c5928d94 100644 --- a/lib/utilities.h +++ b/lib/utilities.h @@ -38,6 +38,19 @@ /* Pat(h)tricia Trie implementation */ #include "pathtricia.h" +/* return values for rm_offsets_match */ +typedef enum RmOffsetsMatchCode { + RM_OFFSETS_MATCH = EXIT_SUCCESS, + RM_OFFSETS_DIFFER = EXIT_FAILURE, + RM_OFFSETS_NOT_FILE = 3, + RM_OFFSETS_WRONG_SIZE = 4, + RM_OFFSETS_NO_DATA = 5, + RM_OFFSETS_SAME_FILE = 6, + RM_OFFSETS_PATH_DOUBLE = 7, + RM_OFFSETS_HARDLINK = 8, + RM_OFFSETS_ERROR = 9, +} RmOffsetsMatchCode; + #if HAVE_STAT64 && !RM_IS_APPLE typedef struct stat64 RmStat; #else @@ -410,12 +423,9 @@ RmOff rm_offset_get_from_path(const char *path, RmOff file_offset, /** * @brief Test if two files have identical fiemaps. - * @retval true if match, false otherwise (and errno set). - * errno: EXIT_FAILURE if fiemaps differ, - * ENODATA if file offsets can't be read, - * errno if error encountered + * @retval see RmOffsetsMatchCode enum definition. */ -bool rm_offsets_match(char *path1, char *path2); +RmOffsetsMatchCode rm_offsets_match(char *path1, char *path2); ////////////////////////////// // TIMESTAMP HELPERS // From 9300623317af9d4f6aab7a395892c073bbbe6ad3 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Sun, 30 Jul 2017 08:55:18 +1000 Subject: [PATCH 047/180] cmdline: speed up --equal for hardlinks, reflinks and path doubles --- lib/cmdline.c | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/lib/cmdline.c b/lib/cmdline.c index d1315657..452274fd 100644 --- a/lib/cmdline.c +++ b/lib/cmdline.c @@ -1590,7 +1590,30 @@ int rm_cmd_main(RmSession *session) { return EXIT_FAILURE; } + /* some optimisations for rmlint --equal */ + if(cfg->run_equal_mode && session->total_files == 2) { + /* check if the two files are hardlinks or reflinks or some such */ + g_assert(cfg->paths); + RmPath *a = cfg->paths->data; + g_assert(cfg->paths->next); + RmPath *b = cfg->paths->next->data; + switch(rm_offsets_match(a->path, b->path)) { + case RM_OFFSETS_HARDLINK: + case RM_OFFSETS_MATCH: + case RM_OFFSETS_PATH_DOUBLE: + case RM_OFFSETS_SAME_FILE: + session->equal_exit_code = EXIT_SUCCESS; + cfg->find_duplicates = FALSE; + cfg->merge_directories = FALSE; + rm_log_debug_line("got match via rm_offsets_match"); + break; + default: + break; + } + } + if(session->total_files >= 1) { + rm_fmt_set_state(session->formats, RM_PROGRESS_STATE_PREPROCESS); rm_preprocess(session); From 7737c05fa353ecef37d80539b3c203e3ee9cd325 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Sun, 30 Jul 2017 08:58:20 +1000 Subject: [PATCH 048/180] sh: add comment re lack of original_check() in clone() --- lib/formats/sh.sh | 1 + 1 file changed, 1 insertion(+) diff --git a/lib/formats/sh.sh b/lib/formats/sh.sh index 2086fd37..bbfa32e6 100644 --- a/lib/formats/sh.sh +++ b/lib/formats/sh.sh @@ -204,6 +204,7 @@ cp_reflink() { clone() { print_progress_prefix # clone $1 from $2's data + # note: no original_check() call because rmlint --dedupe takes care of this echo "${COL_YELLOW}Cloning to: ${COL_RESET}" "$1" if [ -z "$DO_DRY_RUN" ]; then if [ -n "$DO_CLONE_READONLY" ]; then From 8e90ae869121121c3cba4867a08443d1543352ed Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Sun, 30 Jul 2017 09:09:18 +1000 Subject: [PATCH 049/180] tests: make is_on_reflink_fs() more readable --- tests/test_mains/test_dedupe.py | 27 +++++++++++++++------------ 1 file changed, 15 insertions(+), 12 deletions(-) diff --git a/tests/test_mains/test_dedupe.py b/tests/test_mains/test_dedupe.py index 19ab110d..1dd10364 100644 --- a/tests/test_mains/test_dedupe.py +++ b/tests/test_mains/test_dedupe.py @@ -27,22 +27,25 @@ def assert_exit_code(status_code): assert status_code == 0 +def up(path): + while path: + yield path + if path == "/": + break + path = os.path.dirname(path) + def is_on_reflink_fs(path): parts = psutil.disk_partitions(all=True) # iterate up from `path` until mountpoint found - p = path - while 1: - match = next((x for x in parts if x.mountpoint == p), None) - if (match): - print("{0} is {1} mounted at {2}".format(path, match.fstype, p)) - return (match.fstype in REFLINK_CAPABLE_FILESYSTEMS) - - if (p == '/'): - # probably should never get here... - print("no mountpoint found for {0}".format(path)) - return False - p = os.path.dirname(p) + for up_path in up(path): + for part in parts: + if up_path == part.mountpoint: + print("{0} is {1} mounted at {2}".format(path, part.fstype, part.mountpoint)) + return (part.fstype in REFLINK_CAPABLE_FILESYSTEMS) + + print("No mountpoint found for {0}", path) + return False # decorator for tests dependent on reflink-capable testdir From be61e0b29bb991d1e953a80d820b6bc4f2158cf8 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 31 Jul 2017 06:23:35 +1000 Subject: [PATCH 050/180] sh: update for changes to rm_offsets_match() --- lib/formats/sh.c.in | 22 ++++++++++++++++++---- 1 file changed, 18 insertions(+), 4 deletions(-) diff --git a/lib/formats/sh.c.in b/lib/formats/sh.c.in index 78072c62..4415057b 100644 --- a/lib/formats/sh.c.in +++ b/lib/formats/sh.c.in @@ -94,13 +94,27 @@ static bool rm_sh_emit_handler_clone(RmFmtHandlerShScript *self, char **out, RmF return false; } - if (rm_offsets_match(dupe_path, orig_path)) { + int match = rm_offsets_match(dupe_path, orig_path); + switch(match) { + case RM_OFFSETS_MATCH: *out = g_strdup_printf("skip_reflink '%s' '%s'", dupe_escaped, orig_escaped); - } else { + return TRUE; + case RM_OFFSETS_SAME_FILE: + case RM_OFFSETS_NOT_FILE: + case RM_OFFSETS_WRONG_SIZE: + case RM_OFFSETS_PATH_DOUBLE: + case RM_OFFSETS_ERROR: + rm_log_warning_line("Unexpected return code %d from rm_offsets_match()", match); + return FALSE; + case RM_OFFSETS_HARDLINK: + case RM_OFFSETS_NO_DATA: + case RM_OFFSETS_DIFFER: *out = g_strdup_printf("clone '%s' '%s'", dupe_escaped, orig_escaped); + return TRUE; + default: + rm_assert_gentle_not_reached(); + return FALSE; } - - return true; } static bool rm_sh_emit_handler_reflink(RmFmtHandlerShScript *self, char **out, RmFile *file, char *dupe_path, char *orig_path, char *dupe_escaped, char *orig_escaped) { From 2d3add25a3a334c288715a7bf86a3abbfb3da963 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 31 Jul 2017 06:32:31 +1000 Subject: [PATCH 051/180] some clang-formatting --- lib/cmdline.c | 24 ++++++++++++------------ lib/session.c | 28 ++++++++++++++-------------- lib/utilities.c | 34 +++++++++++++++------------------- src/rmlint.c | 2 +- 4 files changed, 42 insertions(+), 46 deletions(-) diff --git a/lib/cmdline.c b/lib/cmdline.c index 452274fd..191711db 100644 --- a/lib/cmdline.c +++ b/lib/cmdline.c @@ -1552,7 +1552,7 @@ int rm_cmd_main(RmSession *session) { } if(session->mounts == NULL) { - rm_log_debug_line("No mount table created."); + rm_log_debug_line("No mount table created."); } session->mds = rm_mds_new(cfg->threads, session->mounts, cfg->fake_pathindex_as_disk); @@ -1598,17 +1598,17 @@ int rm_cmd_main(RmSession *session) { g_assert(cfg->paths->next); RmPath *b = cfg->paths->next->data; switch(rm_offsets_match(a->path, b->path)) { - case RM_OFFSETS_HARDLINK: - case RM_OFFSETS_MATCH: - case RM_OFFSETS_PATH_DOUBLE: - case RM_OFFSETS_SAME_FILE: - session->equal_exit_code = EXIT_SUCCESS; - cfg->find_duplicates = FALSE; - cfg->merge_directories = FALSE; - rm_log_debug_line("got match via rm_offsets_match"); - break; - default: - break; + case RM_OFFSETS_HARDLINK: + case RM_OFFSETS_MATCH: + case RM_OFFSETS_PATH_DOUBLE: + case RM_OFFSETS_SAME_FILE: + session->equal_exit_code = EXIT_SUCCESS; + cfg->find_duplicates = FALSE; + cfg->merge_directories = FALSE; + rm_log_debug_line("got match via rm_offsets_match"); + break; + default: + break; } } diff --git a/lib/session.c b/lib/session.c index a9b9f501..2296a1eb 100644 --- a/lib/session.c +++ b/lib/session.c @@ -271,13 +271,13 @@ int rm_session_dedupe_main(RmCfg *cfg) { /* fsync's needed to flush extent mapping */ if(fsync(source_fd) != 0) { - rm_log_warning_line("Error syncing source file %s: %s", - source->path, strerror(errno)); + rm_log_warning_line("Error syncing source file %s: %s", source->path, + strerror(errno)); } if(fsync(dedupe.info._DEST_FD) != 0) { - rm_log_warning_line("Error syncing dest file %s: %s", - dest->path, strerror(errno)); + rm_log_warning_line("Error syncing dest file %s: %s", dest->path, + strerror(errno)); } int ret = 0; @@ -379,16 +379,16 @@ int rm_session_is_reflink_main(RmCfg *cfg) { int result = rm_offsets_match(a->path, b->path); switch(result) { - case RM_OFFSETS_DIFFER: - rm_log_debug_line("Offsets differ"); - break; - case RM_OFFSETS_MATCH: - rm_log_debug_line("Offsets match"); - case RM_OFFSETS_NO_DATA: - rm_log_debug_line("Can't read file offsets (maybe inline extents?)"); - break; - default: - break; + case RM_OFFSETS_DIFFER: + rm_log_debug_line("Offsets differ"); + break; + case RM_OFFSETS_MATCH: + rm_log_debug_line("Offsets match"); + case RM_OFFSETS_NO_DATA: + rm_log_debug_line("Can't read file offsets (maybe inline extents?)"); + break; + default: + break; } return result; diff --git a/lib/utilities.c b/lib/utilities.c index 7edd03fd..2da0b76b 100644 --- a/lib/utilities.c +++ b/lib/utilities.c @@ -1088,19 +1088,17 @@ static gboolean rm_util_is_path_double(char *path1, char *path2) { rm_util_parent_node(path1) == rm_util_parent_node(path2)); } - RmOffsetsMatchCode rm_offsets_match(char *path1, char *path2) { - int fd1 = rm_sys_open(path1, O_RDONLY); if(fd1 == -1) { rm_log_perrorf("Error opening %s in rm_offsets_match", path1); return RM_OFFSETS_ERROR; } -#define RM_RETURN(value) \ - { \ - rm_sys_close(fd1); \ - return (value); \ +#define RM_RETURN(value) \ + { \ + rm_sys_close(fd1); \ + return (value); \ } RmStat stat1; @@ -1121,11 +1119,11 @@ RmOffsetsMatchCode rm_offsets_match(char *path1, char *path2) { } #undef RM_RETURN -#define RM_RETURN(value) \ - { \ - rm_sys_close(fd1); \ - rm_sys_close(fd2); \ - return (value); \ +#define RM_RETURN(value) \ + { \ + rm_sys_close(fd1); \ + rm_sys_close(fd2); \ + return (value); \ } RmStat stat2; @@ -1147,9 +1145,9 @@ RmOffsetsMatchCode rm_offsets_match(char *path1, char *path2) { if(stat1.st_dev == stat2.st_dev && stat1.st_ino == stat2.st_ino) { /* hardlinks or maybe even same file */ - if(strcmp(path1, path2)==0) { + if(strcmp(path1, path2) == 0) { RM_RETURN(RM_OFFSETS_SAME_FILE); - } else if (rm_util_is_path_double(path1, path2)) { + } else if(rm_util_is_path_double(path1, path2)) { RM_RETURN(RM_OFFSETS_PATH_DOUBLE); } else { RM_RETURN(RM_OFFSETS_HARDLINK); @@ -1161,14 +1159,12 @@ RmOffsetsMatchCode rm_offsets_match(char *path1, char *path2) { while(!rm_session_was_aborted()) { RmOff logical_next_1 = 0; RmOff logical_next_2 = 0; - RmOff physical_1 = - rm_offset_get_from_fd(fd1, logical_current, &logical_next_1); - RmOff physical_2 = - rm_offset_get_from_fd(fd2, logical_current, &logical_next_2); + RmOff physical_1 = rm_offset_get_from_fd(fd1, logical_current, &logical_next_1); + RmOff physical_2 = rm_offset_get_from_fd(fd2, logical_current, &logical_next_2); if(physical_1 != physical_2) { - rm_log_debug_line("Files differ at offset %lu: %lu <> %lu", - logical_current, physical_1, physical_2); + rm_log_debug_line("Files differ at offset %lu: %lu <> %lu", logical_current, + physical_1, physical_2); RM_RETURN(RM_OFFSETS_DIFFER); } if(logical_next_1 != logical_next_2) { diff --git a/src/rmlint.c b/src/rmlint.c index eb726e07..7815cdb9 100644 --- a/src/rmlint.c +++ b/src/rmlint.c @@ -135,7 +135,7 @@ int main(int argc, const char **argv) { /* Do all the real work */ if(cfg.dedupe) { exit_state = rm_session_dedupe_main(&cfg); - } else if (cfg.is_reflink) { + } else if(cfg.is_reflink) { exit_state = rm_session_is_reflink_main(&cfg); } else { exit_state = rm_cmd_main(&session); From 8276b118b85ca122579ed8e77da2a1b3250dd492 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 31 Jul 2017 06:35:41 +1000 Subject: [PATCH 052/180] utilities: fix rm_offsets_match for when we don't HAVE_FIEMAP --- lib/utilities.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/utilities.c b/lib/utilities.c index 2da0b76b..2191448f 100644 --- a/lib/utilities.c +++ b/lib/utilities.c @@ -1212,8 +1212,8 @@ RmOff rm_offset_get_from_path(_UNUSED const char *path, _UNUSED RmOff file_offse return 0; } -int rm_offsets_match(char *path1, char *path2) { - return (path1 == path2); +RmOffsetsMatchCode rm_offsets_match(char *path1, char *path2) { + return (path1 == path2) ? RM_OFFSETS_MATCH : RM_OFFSETS_NO_DATA; } #endif From ff853b7eaf49d143c49443af066a8d18328ba53a Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 31 Jul 2017 07:20:52 +1000 Subject: [PATCH 053/180] tests: add testcase for be61e0b (sh handler's skip_reflink logic) --- tests/test_mains/test_dedupe.py | 58 +++++++++++++++++++++++++++++++++ 1 file changed, 58 insertions(+) diff --git a/tests/test_mains/test_dedupe.py b/tests/test_mains/test_dedupe.py index 1dd10364..b3325e9e 100644 --- a/tests/test_mains/test_dedupe.py +++ b/tests/test_mains/test_dedupe.py @@ -144,6 +144,7 @@ def test_dedupe_works(): path_a = create_file('1' * 100000, 'a') path_b = create_file('1' * 100000, 'b') + # confirm that files are not reflinks with assert_exit_code(1): run_rmlint( '--is-reflink', path_a, path_b, @@ -152,6 +153,7 @@ def test_dedupe_works(): verbosity="" ) + # reflink our files with assert_exit_code(0): run_rmlint( '--dedupe', path_a, path_b, @@ -160,6 +162,7 @@ def test_dedupe_works(): verbosity="" ) + # confirm that they are now reflinks with assert_exit_code(0): run_rmlint( '--is-reflink', path_a, path_b, @@ -167,3 +170,58 @@ def test_dedupe_works(): with_json=False, verbosity="" ) + +# count the number of line in a file which start with patterns[] +def pattern_count(path, patterns): + counts = [0] * len(patterns) + f = open(path, 'r') + for line in f: + for i, pattern in enumerate(patterns): + if line.startswith(pattern): + counts[i] += 1 + f.close() + return counts + + +@needs_reflink_fs +@with_setup(usual_setup_func, usual_teardown_func) +def test_clone_handler(): + # test files need to be larger than btrfs node size to prevent inline extents + path_a = create_file('1' * 100000, 'a') + path_b = create_file('1' * 100000, 'b') + + sh_path = os.path.join(TESTDIR_NAME, 'rmlint.sh') + + # generate rmlint.sh and check that it correctly selects files for cloning + with assert_exit_code(0): + run_rmlint( + '-S a -o sh:{p} -c sh:clone'.format(p=sh_path), + path_a, path_b, + use_default_dir=False, + with_json=False + ) + + # parse output file for expected clone command + counts = pattern_count(sh_path, ["clone '", "skip_reflink '"]) + assert counts[0] == 1 + assert counts[1] == 0 + + # now reflink the two files and check again + with assert_exit_code(0): + run_rmlint( + '--dedupe', path_a, path_b, + use_default_dir=False, + with_json=False, + verbosity="" + ) + with assert_exit_code(0): + run_rmlint( + '-S a -o sh:{p} -c sh:clone'.format(p=sh_path), + path_a, path_b, + use_default_dir=False, + with_json=False + ) + + counts = pattern_count(sh_path, ["clone '", "skip_reflink '"]) + assert counts[0] == 0 + assert counts[1] == 1 From c8e3aa1523726e9129394b306f8a847a26e7e46f Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 31 Jul 2017 20:28:01 +1000 Subject: [PATCH 054/180] utilities: better handling of rm_offsets_match if not HAVE_FIEMAP --- lib/utilities.c | 37 ++++++++++++++++++++----------------- 1 file changed, 20 insertions(+), 17 deletions(-) diff --git a/lib/utilities.c b/lib/utilities.c index 2191448f..40c63cee 100644 --- a/lib/utilities.c +++ b/lib/utilities.c @@ -1081,6 +1081,20 @@ RmOff rm_offset_get_from_path(const char *path, RmOff file_offset, return result; } +#else /* Probably FreeBSD */ + +RmOff rm_offset_get_from_fd(_UNUSED int fd, _UNUSED RmOff file_offset, + _UNUSED RmOff *file_offset_next) { + return 0; +} + +RmOff rm_offset_get_from_path(_UNUSED const char *path, _UNUSED RmOff file_offset, + _UNUSED RmOff *file_offset_next) { + return 0; +} + +#endif + static gboolean rm_util_is_path_double(char *path1, char *path2) { char *basename1 = rm_util_basename(path1); char *basename2 = rm_util_basename(path2); @@ -1154,6 +1168,8 @@ RmOffsetsMatchCode rm_offsets_match(char *path1, char *path2) { } } +#if HAVE_FIEMAP + RmOff logical_current = 0; while(!rm_session_was_aborted()) { @@ -1197,26 +1213,13 @@ RmOffsetsMatchCode rm_offsets_match(char *path1, char *path2) { } RM_RETURN(RM_OFFSETS_ERROR); -#undef RM_RETURN -} - -#else /* Probably FreeBSD */ - -RmOff rm_offset_get_from_fd(_UNUSED int fd, _UNUSED RmOff file_offset, - _UNUSED RmOff *file_offset_next) { - return 0; -} - -RmOff rm_offset_get_from_path(_UNUSED const char *path, _UNUSED RmOff file_offset, - _UNUSED RmOff *file_offset_next) { - return 0; -} +#else + RM_RETURN(RM_OFFSETS_NO_DATA); +#endif -RmOffsetsMatchCode rm_offsets_match(char *path1, char *path2) { - return (path1 == path2) ? RM_OFFSETS_MATCH : RM_OFFSETS_NO_DATA; +#undef RM_RETURN } -#endif ///////////////////////////////// // GTHREADPOOL WRAPPERS // From 9675ac3835050d1fe443d7b139a4783c30e54338 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 31 Jul 2017 20:51:29 +1000 Subject: [PATCH 055/180] utilities: rename rm_offsets_match to rm_util_link_type --- lib/cmdline.c | 10 +++++----- lib/formats/sh.c.in | 45 ++++++++++++++++++++++++++++++--------------- lib/session.c | 12 ++++++------ lib/utilities.c | 36 ++++++++++++++++++------------------ lib/utilities.h | 26 ++++++++++++++------------ 5 files changed, 73 insertions(+), 56 deletions(-) diff --git a/lib/cmdline.c b/lib/cmdline.c index 191711db..2307563d 100644 --- a/lib/cmdline.c +++ b/lib/cmdline.c @@ -1597,11 +1597,11 @@ int rm_cmd_main(RmSession *session) { RmPath *a = cfg->paths->data; g_assert(cfg->paths->next); RmPath *b = cfg->paths->next->data; - switch(rm_offsets_match(a->path, b->path)) { - case RM_OFFSETS_HARDLINK: - case RM_OFFSETS_MATCH: - case RM_OFFSETS_PATH_DOUBLE: - case RM_OFFSETS_SAME_FILE: + switch(rm_util_link_type(a->path, b->path)) { + case RM_LINK_HARDLINK: + case RM_LINK_REFLINK: + case RM_LINK_PATH_DOUBLE: + case RM_LINK_SAME_FILE: session->equal_exit_code = EXIT_SUCCESS; cfg->find_duplicates = FALSE; cfg->merge_directories = FALSE; diff --git a/lib/formats/sh.c.in b/lib/formats/sh.c.in index 4415057b..1d0cb844 100644 --- a/lib/formats/sh.c.in +++ b/lib/formats/sh.c.in @@ -94,21 +94,21 @@ static bool rm_sh_emit_handler_clone(RmFmtHandlerShScript *self, char **out, RmF return false; } - int match = rm_offsets_match(dupe_path, orig_path); - switch(match) { - case RM_OFFSETS_MATCH: + int link_type = rm_util_link_type(dupe_path, orig_path); + switch(link_type) { + case RM_LINK_REFLINK: *out = g_strdup_printf("skip_reflink '%s' '%s'", dupe_escaped, orig_escaped); return TRUE; - case RM_OFFSETS_SAME_FILE: - case RM_OFFSETS_NOT_FILE: - case RM_OFFSETS_WRONG_SIZE: - case RM_OFFSETS_PATH_DOUBLE: - case RM_OFFSETS_ERROR: - rm_log_warning_line("Unexpected return code %d from rm_offsets_match()", match); + case RM_LINK_SAME_FILE: + case RM_LINK_NOT_FILE: + case RM_LINK_WRONG_SIZE: + case RM_LINK_PATH_DOUBLE: + case RM_LINK_ERROR: + rm_log_warning_line("Unexpected return code %d from rm_util_link_type()", link_type); return FALSE; - case RM_OFFSETS_HARDLINK: - case RM_OFFSETS_NO_DATA: - case RM_OFFSETS_DIFFER: + case RM_LINK_HARDLINK: + case RM_LINK_MAYBE_REFLINK: + case RM_LINK_NONE: *out = g_strdup_printf("clone '%s' '%s'", dupe_escaped, orig_escaped); return TRUE; default: @@ -126,12 +126,27 @@ static bool rm_sh_emit_handler_reflink(RmFmtHandlerShScript *self, char **out, R return false; } - if (rm_offsets_match(dupe_path, orig_path)) { + int link_type = rm_util_link_type(dupe_path, orig_path); + switch(link_type) { + case RM_LINK_REFLINK: *out = g_strdup_printf("skip_reflink '%s' '%s'", dupe_escaped, orig_escaped); - } else { + return TRUE; + case RM_LINK_SAME_FILE: + case RM_LINK_NOT_FILE: + case RM_LINK_WRONG_SIZE: + case RM_LINK_PATH_DOUBLE: + case RM_LINK_ERROR: + rm_log_warning_line("Unexpected return code %d from rm_util_link_type()", link_type); + return FALSE; + case RM_LINK_HARDLINK: + case RM_LINK_MAYBE_REFLINK: + case RM_LINK_NONE: *out = g_strdup_printf("cp_reflink '%s' '%s'", dupe_escaped, orig_escaped); + return TRUE; + default: + rm_assert_gentle_not_reached(); + return FALSE; } - return true; } static bool rm_sh_emit_handler_symlink(RmFmtHandlerShScript *self, char **out, _UNUSED RmFile *file, _UNUSED char *dupe_path, _UNUSED char *orig_path, char *dupe_escaped, char *orig_escaped) { diff --git a/lib/session.c b/lib/session.c index 2296a1eb..1a52939a 100644 --- a/lib/session.c +++ b/lib/session.c @@ -377,19 +377,19 @@ int rm_session_is_reflink_main(RmCfg *cfg) { RmPath *b = cfg->paths->next->data; rm_log_debug_line("Testing if %s is clone of %s", a->path, b->path); - int result = rm_offsets_match(a->path, b->path); + int result = rm_util_link_type(a->path, b->path); switch(result) { - case RM_OFFSETS_DIFFER: + case RM_LINK_REFLINK: + rm_log_debug_line("Offsets match"); + break; + case RM_LINK_NONE: rm_log_debug_line("Offsets differ"); break; - case RM_OFFSETS_MATCH: - rm_log_debug_line("Offsets match"); - case RM_OFFSETS_NO_DATA: + case RM_LINK_MAYBE_REFLINK: rm_log_debug_line("Can't read file offsets (maybe inline extents?)"); break; default: break; } - return result; } diff --git a/lib/utilities.c b/lib/utilities.c index 40c63cee..c074dd26 100644 --- a/lib/utilities.c +++ b/lib/utilities.c @@ -1102,11 +1102,11 @@ static gboolean rm_util_is_path_double(char *path1, char *path2) { rm_util_parent_node(path1) == rm_util_parent_node(path2)); } -RmOffsetsMatchCode rm_offsets_match(char *path1, char *path2) { +RmLinkType rm_util_link_type(char *path1, char *path2) { int fd1 = rm_sys_open(path1, O_RDONLY); if(fd1 == -1) { rm_log_perrorf("Error opening %s in rm_offsets_match", path1); - return RM_OFFSETS_ERROR; + return RM_LINK_ERROR; } #define RM_RETURN(value) \ @@ -1119,17 +1119,17 @@ RmOffsetsMatchCode rm_offsets_match(char *path1, char *path2) { int stat_state = rm_sys_stat(path1, &stat1); if(stat_state == -1) { rm_log_perrorf("Unable to stat file %s", path1); - RM_RETURN(RM_OFFSETS_ERROR); + RM_RETURN(RM_LINK_ERROR); } if(!S_ISREG(stat1.st_mode)) { - RM_RETURN(RM_OFFSETS_NOT_FILE); + RM_RETURN(RM_LINK_NOT_FILE); } int fd2 = rm_sys_open(path2, O_RDONLY); if(fd2 == -1) { rm_log_perrorf("Error opening %s in rm_offsets_match", path2); - RM_RETURN(RM_OFFSETS_ERROR); + RM_RETURN(RM_LINK_ERROR); } #undef RM_RETURN @@ -1144,27 +1144,27 @@ RmOffsetsMatchCode rm_offsets_match(char *path1, char *path2) { stat_state = rm_sys_stat(path2, &stat2); if(stat_state == -1) { rm_log_perrorf("Unable to stat file %s", path2); - RM_RETURN(RM_OFFSETS_ERROR); + RM_RETURN(RM_LINK_ERROR); } if(!S_ISREG(stat2.st_mode)) { - RM_RETURN(RM_OFFSETS_NOT_FILE); + RM_RETURN(RM_LINK_NOT_FILE); } if(stat1.st_size != stat2.st_size) { rm_log_debug_line("Files have different sizes: %lu <> %lu", stat1.st_size, stat2.st_size); - RM_RETURN(RM_OFFSETS_WRONG_SIZE); + RM_RETURN(RM_LINK_WRONG_SIZE); } if(stat1.st_dev == stat2.st_dev && stat1.st_ino == stat2.st_ino) { /* hardlinks or maybe even same file */ if(strcmp(path1, path2) == 0) { - RM_RETURN(RM_OFFSETS_SAME_FILE); + RM_RETURN(RM_LINK_SAME_FILE); } else if(rm_util_is_path_double(path1, path2)) { - RM_RETURN(RM_OFFSETS_PATH_DOUBLE); + RM_RETURN(RM_LINK_PATH_DOUBLE); } else { - RM_RETURN(RM_OFFSETS_HARDLINK); + RM_RETURN(RM_LINK_HARDLINK); } } @@ -1181,18 +1181,18 @@ RmOffsetsMatchCode rm_offsets_match(char *path1, char *path2) { if(physical_1 != physical_2) { rm_log_debug_line("Files differ at offset %lu: %lu <> %lu", logical_current, physical_1, physical_2); - RM_RETURN(RM_OFFSETS_DIFFER); + RM_RETURN(RM_LINK_NONE); } if(logical_next_1 != logical_next_2) { rm_log_debug_line("Next offsets differ after %lu: %lu <> %lu", logical_current, logical_next_1, logical_next_2); - RM_RETURN(RM_OFFSETS_DIFFER); + RM_RETURN(RM_LINK_NONE); } if(physical_1 == 0) { rm_log_debug_line( "Can't determine whether files are clones (maybe inline extents?)"); - RM_RETURN(RM_OFFSETS_NO_DATA); + RM_RETURN(RM_LINK_MAYBE_REFLINK); } rm_log_debug_line("Offsets match at logical=%lu, physical=%lu", logical_current, @@ -1201,20 +1201,20 @@ RmOffsetsMatchCode rm_offsets_match(char *path1, char *path2) { if(logical_next_1 == logical_current) { rm_log_debug_line( "rm_offsets_match() giving up: file1_offset_next==file_offset_current"); - RM_RETURN(RM_OFFSETS_NO_DATA) + RM_RETURN(RM_LINK_ERROR) } if(logical_next_1 >= (RmOff)stat1.st_size) { /* phew, we got to the end */ - RM_RETURN(RM_OFFSETS_MATCH) + RM_RETURN(RM_LINK_REFLINK) } logical_current = logical_next_1; } - RM_RETURN(RM_OFFSETS_ERROR); + RM_RETURN(RM_LINK_ERROR); #else - RM_RETURN(RM_OFFSETS_NO_DATA); + RM_RETURN(RM_LINK_NO_DATA); #endif #undef RM_RETURN diff --git a/lib/utilities.h b/lib/utilities.h index c5928d94..7591acb5 100644 --- a/lib/utilities.h +++ b/lib/utilities.h @@ -39,17 +39,19 @@ #include "pathtricia.h" /* return values for rm_offsets_match */ -typedef enum RmOffsetsMatchCode { - RM_OFFSETS_MATCH = EXIT_SUCCESS, - RM_OFFSETS_DIFFER = EXIT_FAILURE, - RM_OFFSETS_NOT_FILE = 3, - RM_OFFSETS_WRONG_SIZE = 4, - RM_OFFSETS_NO_DATA = 5, - RM_OFFSETS_SAME_FILE = 6, - RM_OFFSETS_PATH_DOUBLE = 7, - RM_OFFSETS_HARDLINK = 8, - RM_OFFSETS_ERROR = 9, -} RmOffsetsMatchCode; +typedef enum RmLinkType { + RM_LINK_REFLINK = EXIT_SUCCESS, + RM_LINK_NONE = EXIT_FAILURE, + RM_LINK_NOT_FILE = 3, + RM_LINK_WRONG_SIZE = 4, + RM_LINK_MAYBE_REFLINK = 5, + RM_LINK_SAME_FILE = 6, + RM_LINK_PATH_DOUBLE = 7, + RM_LINK_HARDLINK = 8, + RM_LINK_ERROR = 9, + RM_LINK_SYMLINK = 10, +} RmLinkType; + #if HAVE_STAT64 && !RM_IS_APPLE typedef struct stat64 RmStat; @@ -425,7 +427,7 @@ RmOff rm_offset_get_from_path(const char *path, RmOff file_offset, * @brief Test if two files have identical fiemaps. * @retval see RmOffsetsMatchCode enum definition. */ -RmOffsetsMatchCode rm_offsets_match(char *path1, char *path2); +RmLinkType rm_util_link_type(char *path1, char *path2); ////////////////////////////// // TIMESTAMP HELPERS // From 5758330840b51557dcaa0aee6ee7c3a0009ba6b1 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Tue, 1 Aug 2017 08:08:55 +1000 Subject: [PATCH 056/180] utilities: also test files are on same device in rm_util_link_type() --- docs/rmlint.1.rst | 4 +++- lib/formats/sh.c.in | 4 ++++ lib/utilities.c | 37 ++++++++++++++++++++++++++++++++++++- lib/utilities.h | 3 ++- 4 files changed, 45 insertions(+), 3 deletions(-) diff --git a/docs/rmlint.1.rst b/docs/rmlint.1.rst index adffd1f8..4d1fa04c 100644 --- a/docs/rmlint.1.rst +++ b/docs/rmlint.1.rst @@ -752,7 +752,9 @@ OTHER STAND-ALONE COMMANDS 6: file1 and file2 are the same path 7: file1 and file2 are the same file under different mountpoints 8: files are hardlinks - 9: other error encountered + 9: files are symlinks (TODO) + 10: files are not on same device + 11: other error encountered EXAMPLES diff --git a/lib/formats/sh.c.in b/lib/formats/sh.c.in index 1d0cb844..bcd498fb 100644 --- a/lib/formats/sh.c.in +++ b/lib/formats/sh.c.in @@ -104,6 +104,8 @@ static bool rm_sh_emit_handler_clone(RmFmtHandlerShScript *self, char **out, RmF case RM_LINK_WRONG_SIZE: case RM_LINK_PATH_DOUBLE: case RM_LINK_ERROR: + case RM_LINK_XDEV: + case RM_LINK_SYMLINK: rm_log_warning_line("Unexpected return code %d from rm_util_link_type()", link_type); return FALSE; case RM_LINK_HARDLINK: @@ -135,10 +137,12 @@ static bool rm_sh_emit_handler_reflink(RmFmtHandlerShScript *self, char **out, R case RM_LINK_NOT_FILE: case RM_LINK_WRONG_SIZE: case RM_LINK_PATH_DOUBLE: + case RM_LINK_XDEV: case RM_LINK_ERROR: rm_log_warning_line("Unexpected return code %d from rm_util_link_type()", link_type); return FALSE; case RM_LINK_HARDLINK: + case RM_LINK_SYMLINK: case RM_LINK_MAYBE_REFLINK: case RM_LINK_NONE: *out = g_strdup_printf("cp_reflink '%s' '%s'", dupe_escaped, orig_escaped); diff --git a/lib/utilities.c b/lib/utilities.c index c074dd26..dd64181f 100644 --- a/lib/utilities.c +++ b/lib/utilities.c @@ -76,7 +76,7 @@ #endif #if HAVE_BLKID -#include +#include #endif #if HAVE_JSON_GLIB @@ -1102,6 +1102,33 @@ static gboolean rm_util_is_path_double(char *path1, char *path2) { rm_util_parent_node(path1) == rm_util_parent_node(path2)); } +/* test if two file paths are on the same device (even if on different + * mountpoints) + */ +static gboolean rm_util_same_device(const char *path1, const char *path2) { + const char *best1 = NULL; + const char *best2 = NULL; + int len1 = 0; + int len2 = 0; + + GList *mounts = g_unix_mounts_get(NULL); + for(GList *iter = mounts; iter; iter = iter->next) { + GUnixMountEntry *mount = iter->data; + const char *mountpath = g_unix_mount_get_mount_path(mount); + int len = strlen(mountpath); + if (len > len1 && strncmp(mountpath, path1, len) == 0) { + best1 = g_unix_mount_get_device_path(mount); + len1 = len; + } + if (len > len2 && strncmp(mountpath, path2, len) == 0) { + best2 = g_unix_mount_get_device_path(mount); + len2 = len; + } + } + gboolean result = (best1 && best2 && strcmp(best1, best2)==0); + g_list_free_full(mounts, (GDestroyNotify)g_unix_mount_free); + return result; +} RmLinkType rm_util_link_type(char *path1, char *path2) { int fd1 = rm_sys_open(path1, O_RDONLY); if(fd1 == -1) { @@ -1168,6 +1195,14 @@ RmLinkType rm_util_link_type(char *path1, char *path2) { } } + if(stat1.st_dev != stat2.st_dev) { + /* reflinks must be on same filesystem but not necessarily + * same st_dev (btrfs subvolumes have different st_dev's) */ + if(!rm_util_same_device(path1, path2)) { + RM_RETURN(RM_LINK_XDEV); + } + } + #if HAVE_FIEMAP RmOff logical_current = 0; diff --git a/lib/utilities.h b/lib/utilities.h index 7591acb5..2749300e 100644 --- a/lib/utilities.h +++ b/lib/utilities.h @@ -48,8 +48,9 @@ typedef enum RmLinkType { RM_LINK_SAME_FILE = 6, RM_LINK_PATH_DOUBLE = 7, RM_LINK_HARDLINK = 8, - RM_LINK_ERROR = 9, + RM_LINK_XDEV = 9, RM_LINK_SYMLINK = 10, + RM_LINK_ERROR = 11, } RmLinkType; From 009731d870fd5f791db15fc33cf1502a2c6d3835 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Tue, 1 Aug 2017 08:09:57 +1000 Subject: [PATCH 057/180] tests: nicer python thx sahib --- tests/test_mains/test_dedupe.py | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/tests/test_mains/test_dedupe.py b/tests/test_mains/test_dedupe.py index b3325e9e..48404b71 100644 --- a/tests/test_mains/test_dedupe.py +++ b/tests/test_mains/test_dedupe.py @@ -174,12 +174,11 @@ def test_dedupe_works(): # count the number of line in a file which start with patterns[] def pattern_count(path, patterns): counts = [0] * len(patterns) - f = open(path, 'r') - for line in f: - for i, pattern in enumerate(patterns): - if line.startswith(pattern): - counts[i] += 1 - f.close() + with open(path, 'r') as f: + for line in f: + for i, pattern in enumerate(patterns): + if line.startswith(pattern): + counts[i] += 1 return counts From 2ec0fc5fadaeb584cd3bd313c06708226e77d7df Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Tue, 1 Aug 2017 08:12:48 +1000 Subject: [PATCH 058/180] clang-format --- lib/utilities.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/lib/utilities.c b/lib/utilities.c index dd64181f..e613a468 100644 --- a/lib/utilities.c +++ b/lib/utilities.c @@ -41,9 +41,9 @@ #endif -#include -#include #include +#include +#include #include #include @@ -1116,16 +1116,16 @@ static gboolean rm_util_same_device(const char *path1, const char *path2) { GUnixMountEntry *mount = iter->data; const char *mountpath = g_unix_mount_get_mount_path(mount); int len = strlen(mountpath); - if (len > len1 && strncmp(mountpath, path1, len) == 0) { + if(len > len1 && strncmp(mountpath, path1, len) == 0) { best1 = g_unix_mount_get_device_path(mount); len1 = len; } - if (len > len2 && strncmp(mountpath, path2, len) == 0) { + if(len > len2 && strncmp(mountpath, path2, len) == 0) { best2 = g_unix_mount_get_device_path(mount); len2 = len; } } - gboolean result = (best1 && best2 && strcmp(best1, best2)==0); + gboolean result = (best1 && best2 && strcmp(best1, best2) == 0); g_list_free_full(mounts, (GDestroyNotify)g_unix_mount_free); return result; } From 9c284ab29be51aae8d27e8426d1cf39d20ab2cb7 Mon Sep 17 00:00:00 2001 From: Chris Pahl Date: Wed, 16 Aug 2017 20:26:56 +0200 Subject: [PATCH 059/180] Implement basic version of --keep-hardlinked from #248 --- docs/rmlint.1.rst | 17 ++++++++-- lib/cfg.c | 1 + lib/cfg.h | 1 + lib/cmdline.c | 1 + lib/preprocess.c | 2 +- lib/shredder.c | 27 +++++++++++++++ tests/test_options/test_keep_hardlinks.py | 41 +++++++++++++++++++++++ 7 files changed, 86 insertions(+), 4 deletions(-) create mode 100644 tests/test_options/test_keep_hardlinks.py diff --git a/docs/rmlint.1.rst b/docs/rmlint.1.rst index 4d1fa04c..96e6bd54 100644 --- a/docs/rmlint.1.rst +++ b/docs/rmlint.1.rst @@ -299,9 +299,20 @@ Traversal Options :``-l --hardlinked`` (**default**) / ``-L --no-hardlinked``: - Whether to report hardlinked files as duplicates. - Hardlinked files will not appear as space waste in the statistics, since - they do not allocate any extra space. + Whether to report hardlinked files as duplicates. If ``--no-hardlinked`` is given, + ``rmlint`` will filter all hardlinks to files it already knows of. + + Note that hardlinked files will not appear as space waste in the + statistics, since they do not allocate any extra space if not all of them are removed. + + Also look into ``--keep-hardlinked`` below. + +:``--keep-hardlinked`` (**default**: No.): + + If set, rmlint will not delete any files that are linked to any original in their respective group. + Such files will be displayed like original (i.e. for the default output with a "ls" in front). + The reasoning here is to maximize the number of kept files, while maximizing the number of freed space: + Removing hardlinks to originals will not allocate any free space. :``-f --followlinks`` / ``-F --no-followlinks`` / ``-@ --see-symlinks`` (**default**): diff --git a/lib/cfg.c b/lib/cfg.c index 0f107cd5..970a926f 100644 --- a/lib/cfg.c +++ b/lib/cfg.c @@ -55,6 +55,7 @@ void rm_cfg_set_default(RmCfg *cfg) { cfg->find_badids = true; cfg->find_badlinks = true; cfg->find_hardlinked_dupes = true; + cfg->keep_hardlinked_dupes = false; cfg->build_fiemap = true; cfg->crossdev = true; cfg->list_mounts = true; diff --git a/lib/cfg.h b/lib/cfg.h index ca5a88e1..fc5713c9 100644 --- a/lib/cfg.h +++ b/lib/cfg.h @@ -68,6 +68,7 @@ typedef struct RmCfg { gboolean must_match_tagged; gboolean must_match_untagged; gboolean find_hardlinked_dupes; + gboolean keep_hardlinked_dupes; gboolean limits_specified; gboolean filter_mtime; gboolean match_basename; diff --git a/lib/cmdline.c b/lib/cmdline.c index 2307563d..071ac588 100644 --- a/lib/cmdline.c +++ b/lib/cmdline.c @@ -1283,6 +1283,7 @@ bool rm_cmd_parse_args(int argc, char **argv, RmSession *session) { {"honour-dir-layout" , 'j' , EMPTY , G_OPTION_ARG_CALLBACK , FUNC(honour_dir_layout) , _("Only find directories with same file layout") , NULL} , {"perms" , 'z' , OPTIONAL , G_OPTION_ARG_CALLBACK , FUNC(permissions) , _("Only use files with certain permissions") , "[RWX]+"} , {"no-hardlinked" , 'L' , DISABLE , G_OPTION_ARG_NONE , &cfg->find_hardlinked_dupes , _("Ignore hardlink twins") , NULL} , + {"keep-hardlinked" , 0 , 0 , G_OPTION_ARG_NONE , &cfg->keep_hardlinked_dupes , _("Keep hardlink that are linked to any original") , NULL} , {"partial-hidden" , 0 , EMPTY , G_OPTION_ARG_CALLBACK , FUNC(partial_hidden) , _("Find hidden files in duplicate folders only") , NULL} , {"mtime-window" , 'Z' , 0 , G_OPTION_ARG_DOUBLE , &cfg->mtime_window , _("Consider duplicates only equal when mtime differs at max. T seconds") , "T"} , diff --git a/lib/preprocess.c b/lib/preprocess.c index 355d8ec2..fa96aef2 100644 --- a/lib/preprocess.c +++ b/lib/preprocess.c @@ -569,7 +569,7 @@ static gboolean rm_pp_handle_inode_clusters(_UNUSED gpointer key, GQueue *inode_ /* hardlink cluster are counted as filtered files since they are either * ignored or treated as automatic duplicates depending on settings (so - * no effort eaither way); rm_pp_handle_hardlink will either free or bundle + * no effort either way); rm_pp_handle_hardlink will either free or bundle * the hardlinks depending on value of headfile->hardlinks.is_head. */ session->total_filtered_files -= rm_util_queue_foreach_remove( diff --git a/lib/shredder.c b/lib/shredder.c index ae24223f..35f63f80 100644 --- a/lib/shredder.c +++ b/lib/shredder.c @@ -1383,6 +1383,32 @@ static RmShredGroup *rm_shred_basename_rejects(RmShredGroup *group, RmShredTag * } +static RmShredGroup *rm_shred_keep_hardlink_rejects(RmShredGroup *group, _UNUSED RmShredTag *tag) { + if(!tag->session->cfg->keep_hardlinked_dupes) { + return NULL; + } + + if(group->status != RM_SHRED_GROUP_FINISHING) { + return NULL; + } + + RmShredGroup *rejects = NULL; + RmFile *headfile = group->held_files->head->data; + for(GList *iter = group->held_files->head->next, *next = NULL; iter; + iter = next) { + next = iter->next; + RmFile *curr = iter->data; + if(headfile->inode == curr->inode && headfile->dev == curr->dev) { + if(!rejects) { + rejects = rm_shred_create_rejects(group, curr); + } + rm_shred_group_transfer(curr, group, rejects); + } + } + + return rejects; +} + /* post-process a group: * decide which file(s) are originals * maybe split out mtime rejects (--mtime-window option) @@ -1403,6 +1429,7 @@ static void rm_shred_group_postprocess(RmShredGroup *group, RmShredTag *tag) { rm_shred_group_find_original(tag->session, group->held_files, group->status); rm_shred_group_postprocess(rm_shred_basename_rejects(group, tag), tag); rm_shred_group_postprocess(rm_shred_mtime_rejects(group, tag), tag); + rm_shred_group_postprocess(rm_shred_keep_hardlink_rejects(group, tag), tag); /* re-check whether what is left of the group still meets all criteria */ group->status = (rm_shred_group_qualifies(group)) ? RM_SHRED_GROUP_FINISHING diff --git a/tests/test_options/test_keep_hardlinks.py b/tests/test_options/test_keep_hardlinks.py new file mode 100644 index 00000000..fc4df414 --- /dev/null +++ b/tests/test_options/test_keep_hardlinks.py @@ -0,0 +1,41 @@ +#!/usr/bin/env python3 +# encoding: utf-8 +from nose import with_setup +from tests.utils import * + + +@with_setup(usual_setup_func, usual_teardown_func) +def test_keep_hardlinks(): + create_file('xxx', 'file_a') + create_link('file_a', 'file_b') + create_file('xxx', 'file_z') + + head, *data, footer = run_rmlint('--no-hardlinked -S a') + assert data[0]["path"].endswith("file_a") + assert data[0]["is_original"] is True + assert data[1]["path"].endswith("file_z") + assert data[1]["is_original"] is False + + head, *data, footer = run_rmlint('--hardlinked -S a') + assert data[0]["path"].endswith("file_a") + assert data[0]["is_original"] is True + assert data[1]["path"].endswith("file_b") + assert data[1]["is_original"] is False + assert data[2]["path"].endswith("file_z") + assert data[2]["is_original"] is False + + head, *data, footer = run_rmlint('--keep-hardlinked -S a') + assert data[0]["path"].endswith("file_b") + assert data[0]["is_original"] is True + assert data[1]["path"].endswith("file_a") + assert data[1]["is_original"] is True + assert data[2]["path"].endswith("file_z") + assert data[2]["is_original"] is False + + head, *data, footer = run_rmlint('--keep-hardlinked -S A') + assert data[0]["path"].endswith("file_z") + assert data[0]["is_original"] is True + assert data[1]["path"].endswith("file_b") + assert data[1]["is_original"] is False + assert data[2]["path"].endswith("file_a") + assert data[2]["is_original"] is False From 491a5f8b7506a4a7bb94bd1fef099c61dd998ebc Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Thu, 17 Aug 2017 18:35:06 +1000 Subject: [PATCH 060/180] tests: add a test to trip up --keep-hardlinks --- tests/test_options/test_keep_hardlinks.py | 43 +++++++++++++++++++++++ 1 file changed, 43 insertions(+) diff --git a/tests/test_options/test_keep_hardlinks.py b/tests/test_options/test_keep_hardlinks.py index fc4df414..ef37e129 100644 --- a/tests/test_options/test_keep_hardlinks.py +++ b/tests/test_options/test_keep_hardlinks.py @@ -39,3 +39,46 @@ def test_keep_hardlinks(): assert data[1]["is_original"] is False assert data[2]["path"].endswith("file_a") assert data[2]["is_original"] is False + + +def test_keep_hardlinks_multiple_originals(): + create_file('xxx', 'a/file_a') + create_file('xxx', 'a/file_y') + create_dirs('b') + create_link('a/file_a', 'b/file_b') + create_link('a/file_y', 'b/file_z') + + search_paths = TESTDIR_NAME + '/b // ' + TESTDIR_NAME + '/a' + + head, *data, footer = run_rmlint('--no-hardlinked -S a ' + search_paths, use_default_dir=False) + # hardlinks file_b and file_z should be ignored + assert len(data)==2 + assert data[0]["path"].endswith("file_a") + assert data[0]["is_original"] is True + assert data[1]["path"].endswith("file_y") + assert data[1]["is_original"] is False + + head, *data, footer = run_rmlint('--hardlinked -k -m -S a ' + search_paths, use_default_dir=False) + # files in folder a should both be originals because tagged + assert len(data)==4 + assert data[0]["path"].endswith("file_a") + assert data[0]["is_original"] is True + assert data[1]["path"].endswith("file_y") + assert data[1]["is_original"] is True + assert data[2]["path"].endswith("file_b") + assert data[2]["is_original"] is False + assert data[3]["path"].endswith("file_z") + assert data[3]["is_original"] is False + + head, *data, footer = run_rmlint('--keep-hardlinked -k -m -S a ' + search_paths, use_default_dir=False) + # files in folder a are tagged so should both be preserved; + # files in folder b are hardlinks of the two originals so should also be preserved + assert len(data)==4 + assert data[0]["path"].endswith("file_a") + assert data[0]["is_original"] is True + assert data[1]["path"].endswith("file_y") + assert data[1]["is_original"] is True + assert data[2]["path"].endswith("file_b") + assert data[2]["is_original"] is True + assert data[3]["path"].endswith("file_z") + assert data[3]["is_original"] is True From ec27aa51d0b74163f92c45a0ed6c4a0fed133e8e Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Thu, 17 Aug 2017 18:56:53 +1000 Subject: [PATCH 061/180] shredder: post-process --keep-hardlinks correctly if multiple originals --- lib/shredder.c | 39 +++++++++++++++++++++++---------------- 1 file changed, 23 insertions(+), 16 deletions(-) diff --git a/lib/shredder.c b/lib/shredder.c index 35f63f80..db8973c0 100644 --- a/lib/shredder.c +++ b/lib/shredder.c @@ -1383,30 +1383,36 @@ static RmShredGroup *rm_shred_basename_rejects(RmShredGroup *group, RmShredTag * } -static RmShredGroup *rm_shred_keep_hardlink_rejects(RmShredGroup *group, _UNUSED RmShredTag *tag) { +/* if cfg->keep_hardlinked_dupes then tag hardlinked dupes as originals */ +static void rm_shred_tag_hardlink_rejects(RmShredGroup *group, _UNUSED RmShredTag *tag) { if(!tag->session->cfg->keep_hardlinked_dupes) { - return NULL; + return; } if(group->status != RM_SHRED_GROUP_FINISHING) { - return NULL; + return; } - RmShredGroup *rejects = NULL; - RmFile *headfile = group->held_files->head->data; - for(GList *iter = group->held_files->head->next, *next = NULL; iter; - iter = next) { - next = iter->next; - RmFile *curr = iter->data; - if(headfile->inode == curr->inode && headfile->dev == curr->dev) { - if(!rejects) { - rejects = rm_shred_create_rejects(group, curr); + /* do triangular iteration over group to check if non-originals are hardlinks of + * originals */ + for(GList *i_orig = group->held_files->head; i_orig; i_orig = i_orig->next) { + RmFile *orig = i_orig->data; + rm_log_info_line("orig: %s", orig->folder->basename); + if(!orig->is_original) { + /* have gone past last original */ + break; + } + if(!orig->hardlinks) { + continue; + } + for(GList *i_dupe = i_orig->next, *next = NULL; i_dupe; i_dupe = next) { + next = i_dupe->next; + RmFile *dupe = i_dupe->data; + if(dupe->hardlinks == orig->hardlinks) { + dupe->is_original = TRUE; } - rm_shred_group_transfer(curr, group, rejects); } } - - return rejects; } /* post-process a group: @@ -1429,7 +1435,6 @@ static void rm_shred_group_postprocess(RmShredGroup *group, RmShredTag *tag) { rm_shred_group_find_original(tag->session, group->held_files, group->status); rm_shred_group_postprocess(rm_shred_basename_rejects(group, tag), tag); rm_shred_group_postprocess(rm_shred_mtime_rejects(group, tag), tag); - rm_shred_group_postprocess(rm_shred_keep_hardlink_rejects(group, tag), tag); /* re-check whether what is left of the group still meets all criteria */ group->status = (rm_shred_group_qualifies(group)) ? RM_SHRED_GROUP_FINISHING @@ -1440,6 +1445,8 @@ static void rm_shred_group_postprocess(RmShredGroup *group, RmShredTag *tag) { */ rm_shred_group_find_original(tag->session, group->held_files, group->status); + rm_shred_tag_hardlink_rejects(group, tag); + /* Update statistics */ if(group->status == RM_SHRED_GROUP_FINISHING) { rm_fmt_lock_state(tag->session->formats); From 4890d4cf3f964d6e43fc09322e90a992cd09bdf3 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Thu, 17 Aug 2017 18:58:51 +1000 Subject: [PATCH 062/180] tests: fix "Afferbeck Lauder" --- tests/test_options/test_keep_hardlinks.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tests/test_options/test_keep_hardlinks.py b/tests/test_options/test_keep_hardlinks.py index ef37e129..b4e8d339 100644 --- a/tests/test_options/test_keep_hardlinks.py +++ b/tests/test_options/test_keep_hardlinks.py @@ -25,9 +25,9 @@ def test_keep_hardlinks(): assert data[2]["is_original"] is False head, *data, footer = run_rmlint('--keep-hardlinked -S a') - assert data[0]["path"].endswith("file_b") + assert data[0]["path"].endswith("file_a") assert data[0]["is_original"] is True - assert data[1]["path"].endswith("file_a") + assert data[1]["path"].endswith("file_b") assert data[1]["is_original"] is True assert data[2]["path"].endswith("file_z") assert data[2]["is_original"] is False From fb43faba3f962b534c15a613db1e90238cb64a13 Mon Sep 17 00:00:00 2001 From: Chris Pahl Date: Fri, 18 Aug 2017 23:38:27 +0200 Subject: [PATCH 063/180] docs: Integrate feedback from @Awerick for --no-hardlinked --- docs/rmlint.1.rst | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/docs/rmlint.1.rst b/docs/rmlint.1.rst index 96e6bd54..28093494 100644 --- a/docs/rmlint.1.rst +++ b/docs/rmlint.1.rst @@ -299,8 +299,10 @@ Traversal Options :``-l --hardlinked`` (**default**) / ``-L --no-hardlinked``: - Whether to report hardlinked files as duplicates. If ``--no-hardlinked`` is given, - ``rmlint`` will filter all hardlinks to files it already knows of. + Whether to report hardlinked files as duplicates. With ``--no-hardlinked``, + if a set of hardlinked files is encountered, all except one are ignored. + The "highest ranked" (see ``-S``) of the set is the one that will be used + for further processing. Note that hardlinked files will not appear as space waste in the statistics, since they do not allocate any extra space if not all of them are removed. From 124c776b256e398b61a79433235c077cfae89899 Mon Sep 17 00:00:00 2001 From: Chris Pahl Date: Tue, 22 Aug 2017 17:06:42 +0200 Subject: [PATCH 064/180] Fix compile on error on systems without FIEMAP --- lib/utilities.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/lib/utilities.c b/lib/utilities.c index e613a468..fc4e815f 100644 --- a/lib/utilities.c +++ b/lib/utilities.c @@ -1129,6 +1129,7 @@ static gboolean rm_util_same_device(const char *path1, const char *path2) { g_list_free_full(mounts, (GDestroyNotify)g_unix_mount_free); return result; } + RmLinkType rm_util_link_type(char *path1, char *path2) { int fd1 = rm_sys_open(path1, O_RDONLY); if(fd1 == -1) { @@ -1249,7 +1250,7 @@ RmLinkType rm_util_link_type(char *path1, char *path2) { RM_RETURN(RM_LINK_ERROR); #else - RM_RETURN(RM_LINK_NO_DATA); + RM_RETURN(RM_LINK_NONE); #endif #undef RM_RETURN From 11fb8f9eda6b7144ec15d53500455acf28b45baf Mon Sep 17 00:00:00 2001 From: Chris Pahl Date: Fri, 3 Nov 2017 09:25:50 +0100 Subject: [PATCH 065/180] Small build fix for scons3 --- docs/SConscript | 2 +- src/SConscript | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/SConscript b/docs/SConscript index 4dbf6f31..378ebcaf 100644 --- a/docs/SConscript +++ b/docs/SConscript @@ -67,7 +67,7 @@ env.Alias('man', env.Depends(manpage, sphinx)) if 'install' in COMMAND_LINE_TARGETS: - man_install = env.InstallPerm('$PREFIX/share/man/man1', [manpage], 0644) + man_install = env.InstallPerm('$PREFIX/share/man/man1', [manpage], 0o644) target = env.Alias('install', [manpage, man_install]) diff --git a/src/SConscript b/src/SConscript index bd1e23b4..06efd47d 100644 --- a/src/SConscript +++ b/src/SConscript @@ -26,7 +26,7 @@ for progname in ['rmlint']: if 'install' in COMMAND_LINE_TARGETS: - env.Alias('install', env.InstallPerm('$PREFIX/bin', programs, 0755)) + env.Alias('install', env.InstallPerm('$PREFIX/bin', programs, 0o755)) env.Default(programs) From 7b93e80ff5736f44a5542aa7cb00b9ff7b4c882b Mon Sep 17 00:00:00 2001 From: Chris Pahl Date: Fri, 3 Nov 2017 13:55:00 +0100 Subject: [PATCH 066/180] scons: think of debian and other ancient things --- docs/SConscript | 6 +++++- src/SConscript | 5 ++++- 2 files changed, 9 insertions(+), 2 deletions(-) diff --git a/docs/SConscript b/docs/SConscript index 378ebcaf..ec94b2a2 100644 --- a/docs/SConscript +++ b/docs/SConscript @@ -67,7 +67,11 @@ env.Alias('man', env.Depends(manpage, sphinx)) if 'install' in COMMAND_LINE_TARGETS: - man_install = env.InstallPerm('$PREFIX/share/man/man1', [manpage], 0o644) + man_install = env.InstallPerm( + '$PREFIX/share/man/man1', + [manpage], + int("644", 8), + ) target = env.Alias('install', [manpage, man_install]) diff --git a/src/SConscript b/src/SConscript index 06efd47d..02b6ac62 100644 --- a/src/SConscript +++ b/src/SConscript @@ -26,7 +26,10 @@ for progname in ['rmlint']: if 'install' in COMMAND_LINE_TARGETS: - env.Alias('install', env.InstallPerm('$PREFIX/bin', programs, 0o755)) + env.Alias( + 'install', + env.InstallPerm('$PREFIX/bin', programs, int("755", 8)) + ) env.Default(programs) From 685f4c3ef1090e9abefa7a9610e2a5974b081409 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Sat, 4 Nov 2017 22:36:19 +1000 Subject: [PATCH 067/180] hasher: better exception handling during file reads, refer #256 --- lib/hasher.c | 250 ++++++++++++++++++++++++++++----------------------- 1 file changed, 136 insertions(+), 114 deletions(-) diff --git a/lib/hasher.c b/lib/hasher.c index 75a78fc2..140f5ce0 100644 --- a/lib/hasher.c +++ b/lib/hasher.c @@ -131,7 +131,8 @@ static void rm_hasher_request_readahead(int fd, RmOff seek_offset, RmOff bytes_t #endif } -static gint64 rm_hasher_symlink_read(RmHasher *hasher, RmDigest *digest, char *path) { +static gboolean rm_hasher_symlink_read(RmHasher *hasher, RmDigest *digest, char *path, + gsize *bytes_actually_read) { /* Fake an IO operation on the symlink. */ RmBuffer *buf = rm_buffer_get(hasher->mem_pool); buf->len = 256; @@ -141,13 +142,14 @@ static gint64 rm_hasher_symlink_read(RmHasher *hasher, RmDigest *digest, char *p if(rm_sys_stat(path, &stat_buf) == -1) { /* Oops, that did not work out, report as an error */ rm_log_perror("Cannot stat symbolic link"); - return -1; + return FALSE; } - gint data_size = snprintf((char *)buf->data, rm_buffer_size(hasher->mem_pool), - "%ld:%ld", (long)stat_buf.st_dev, (long)stat_buf.st_ino); + gint data_size = snprintf((char *)buf->data, hasher->buf_size, "%ld:%ld", + (long)stat_buf.st_dev, (long)stat_buf.st_ino); buf->len = data_size; buf->digest = digest; + *bytes_actually_read = buf->len; rm_digest_buffered_update(buf); @@ -157,95 +159,95 @@ static gint64 rm_hasher_symlink_read(RmHasher *hasher, RmDigest *digest, char *p if(digest->type == RM_DIGEST_PARANOID) { rm_digest_paranoia_shrink(digest, data_size); } - return 0; + return TRUE; } -/* Reads data from file and sends to hasher threadpool - * returns number of bytes successfully read */ +/* Reads data from file and sends to hasher threadpool; + * returns true if no errors encountered; + * increments *bytes_read by the actual bytes read */ -static gint64 rm_hasher_buffered_read(RmHasher *hasher, GThreadPool *hashpipe, - RmDigest *digest, char *path, gsize start_offset, - gsize bytes_to_read) { +static gboolean rm_hasher_buffered_read(RmHasher *hasher, GThreadPool *hashpipe, + RmDigest *digest, char *path, gsize start_offset, + gsize bytes_to_read, gsize *bytes_actually_read) { FILE *fd = NULL; - if(bytes_to_read == 0) { - bytes_to_read = G_MAXSIZE; - } - - gsize total_bytes_read = 0; - - if((fd = fopen(path, "rb")) == NULL) { + fd = fopen(path, "rb"); + if(fd == NULL) { rm_log_info("fopen(3) failed for %s: %s\n", path, g_strerror(errno)); - goto finish; + return FALSE; } - gint32 bytes_read = 0; - - rm_hasher_request_readahead(fileno(fd), start_offset, bytes_to_read); + gboolean read_to_eof = (bytes_to_read == 0); + rm_hasher_request_readahead(fileno(fd), start_offset, + read_to_eof ? G_MAXSIZE : bytes_to_read); if(fseek(fd, start_offset, SEEK_SET) == -1) { rm_log_perror("fseek(3) failed"); - goto finish; + fclose(fd); + return FALSE; } - RmBuffer *buffer = rm_buffer_get(hasher->mem_pool); + gboolean success = FALSE; + gsize bytes_remaining = bytes_to_read; + + while(TRUE) { + RmBuffer *buffer = rm_buffer_get(hasher->mem_pool); + + gsize want_bytes = MIN(bytes_remaining, hasher->buf_size); + + gsize bytes_read = fread(buffer->data, 1, want_bytes, fd); + + if(ferror(fd) != 0) { + rm_log_perror("fread(3) failed"); + rm_buffer_release(buffer); + break; + } - while((bytes_read = - fread(buffer->data, 1, MIN(bytes_to_read, hasher->buf_size), fd)) > 0) { - bytes_to_read -= bytes_read; + bytes_remaining -= bytes_read; + *bytes_actually_read += bytes_read; buffer->len = bytes_read; buffer->digest = digest; buffer->user_data = NULL; - rm_util_thread_pool_push(hashpipe, buffer); - total_bytes_read += bytes_read; - buffer = rm_buffer_get(hasher->mem_pool); - } - rm_buffer_release(buffer); - - if(ferror(fd) != 0) { - rm_log_perror("fread(3) failed"); - if(total_bytes_read == bytes_to_read) { - /* signal error to caller */ - total_bytes_read++; + if(read_to_eof && feof(fd)) { + success = TRUE; + break; + } else if(bytes_remaining == 0) { + success = TRUE; + break; + } else if(feof(fd)) { + rm_log_error_line("Unexpected EOF in rm_hasher_buffered_read"); + break; + } else if(bytes_read == 0) { + rm_log_error_line(_("Something went wrong reading %s; expected %li bytes, " + "got %li; ignoring"), + path, (long int)bytes_to_read, + (long int)*bytes_actually_read); + break; } } - -finish: - if(fd != NULL) { - fclose(fd); - } - return total_bytes_read; + fclose(fd); + return success; } /* Reads data from file and sends to hasher threadpool - * returns number of bytes successfully read */ + * returns true if no errors encountered; + * increments *bytes_read by the actual bytes read */ -static gint64 rm_hasher_unbuffered_read(RmHasher *hasher, GThreadPool *hashpipe, - RmDigest *digest, char *path, gint64 start_offset, - gint64 bytes_to_read) { +static gboolean rm_hasher_unbuffered_read(RmHasher *hasher, GThreadPool *hashpipe, + RmDigest *digest, char *path, + gint64 start_offset, gint64 bytes_to_read, + gsize *bytes_actually_read) { gint32 bytes_read = 0; - gint64 total_bytes_read = 0; guint64 file_offset = start_offset; - if(bytes_to_read == 0) { - RmStat stat_buf; - if(rm_sys_stat(path, &stat_buf) != -1) { - bytes_to_read = MAX(stat_buf.st_size - start_offset, 0); - } - } - - /* how many buffers to read? */ - const gint16 N_BUFFERS = MIN(4, DIVIDE_CEIL(bytes_to_read, hasher->buf_size)); - struct iovec readvec[N_BUFFERS + 1]; - - int fd = 0; + gboolean read_to_eof = (bytes_to_read == 0); - fd = rm_sys_open(path, O_RDONLY); + int fd = rm_sys_open(path, O_RDONLY); if(fd == -1) { rm_log_info("open(2) failed for %s: %s\n", path, g_strerror(errno)); - goto finish; + return FALSE; } /* preadv() is beneficial for large files since it can cut the @@ -262,68 +264,85 @@ static gint64 rm_hasher_unbuffered_read(RmHasher *hasher, GThreadPool *hashpipe, /* Give the kernel scheduler some hints */ rm_hasher_request_readahead(fd, start_offset, bytes_to_read); - /* Initialize the buffers to begin with. - * After a buffer is full, a new one is retrieved. - */ + /* how many buffers to read? */ + guint16 N_BUFFERS = 4; + if(bytes_to_read > 0) { + N_BUFFERS = MIN(N_BUFFERS, DIVIDE_CEIL(bytes_to_read, hasher->buf_size)); + } + + /* Allocate buffer vector */ RmBuffer **buffers; buffers = g_slice_alloc(sizeof(*buffers) * N_BUFFERS); + struct iovec readvec[N_BUFFERS + 1]; memset(readvec, 0, sizeof(readvec)); - for(int i = 0; i < N_BUFFERS; ++i) { - /* buffer is one contignous memory block */ - buffers[i] = rm_buffer_get(hasher->mem_pool); - readvec[i].iov_base = buffers[i]->data; - readvec[i].iov_len = hasher->buf_size; - } - while((bytes_to_read == 0 || total_bytes_read < bytes_to_read) && - (bytes_read = rm_sys_preadv(fd, readvec, N_BUFFERS, file_offset)) > 0) { - bytes_read = - MIN(bytes_read, bytes_to_read - total_bytes_read); /* ignore over-reads */ + gboolean success = FALSE; + gsize bytes_remaining = bytes_to_read; - int blocks = DIVIDE_CEIL(bytes_read, hasher->buf_size); - rm_assert_gentle(blocks <= N_BUFFERS); + while(TRUE) { + /* allocate buffers for preadv */ + for(int i = 0; i < N_BUFFERS; ++i) { + buffers[i] = rm_buffer_get(hasher->mem_pool); + readvec[i].iov_base = buffers[i]->data; + readvec[i].iov_len = hasher->buf_size; + } + + bytes_read = rm_sys_preadv(fd, readvec, N_BUFFERS, file_offset); + + if(bytes_read == -1) { + /* error occurred */ + rm_log_perror("preadv failed"); + /* Release the buffers and give up*/ + for(int i = 0; i < N_BUFFERS; ++i) { + rm_buffer_release(buffers[i]); + } + break; + } - total_bytes_read += bytes_read; + /* ignore over-reads */ + bytes_read = MIN((gsize)bytes_read, bytes_remaining); + + /* update totals */ file_offset += bytes_read; + *bytes_actually_read += bytes_read; + bytes_remaining -= bytes_read; - for(int i = 0; i < blocks; ++i) { - /* Get the RmBuffer from the datapointer */ + /* send buffers */ + for(int i = 0; i < N_BUFFERS; ++i) { RmBuffer *buffer = buffers[i]; - buffer->len = MIN(hasher->buf_size, bytes_read - i * hasher->buf_size); - buffer->digest = digest; - buffer->user_data = NULL; - - /* Send it to the hasher */ - rm_util_thread_pool_push(hashpipe, buffer); - /* Allocate a new buffer - hasher will release the old buffer */ - buffers[i] = rm_buffer_get(hasher->mem_pool); - readvec[i].iov_base = buffers[i]->data; - readvec[i].iov_len = hasher->buf_size; + buffer->len = CLAMP(bytes_read - i * (gint32)hasher->buf_size, 0, + (gint32)hasher->buf_size); + if(buffer->len > 0) { + /* Send it to the hasher */ + buffer->digest = digest; + buffer->user_data = NULL; + rm_util_thread_pool_push(hashpipe, buffer); + } else { + rm_buffer_release(buffer); + } } - } - if(bytes_read == -1) { - rm_log_perror("preadv failed"); - } else if(total_bytes_read != bytes_to_read) { - rm_log_error_line(_("Something went wrong reading %s; expected %li bytes, " - "got %li; ignoring"), - path, (long int)bytes_to_read, (long int)total_bytes_read); + if(read_to_eof && bytes_read == 0) { + success = TRUE; + break; + } else if(bytes_remaining == 0) { + success = TRUE; + break; + } else if(bytes_read == 0) { + rm_log_error_line(_("Something went wrong reading %s; expected %li bytes, " + "got %li; ignoring"), + path, (long int)bytes_to_read, + (long int)*bytes_actually_read); + break; + } } - /* Release the rest of the buffers */ - for(int i = 0; i < N_BUFFERS; ++i) { - rm_buffer_release(buffers[i]); - } g_slice_free1(sizeof(*buffers) * N_BUFFERS, buffers); + rm_sys_close(fd); -finish: - if(fd > 0) { - rm_sys_close(fd); - } - - return total_bytes_read; + return success; } ////////////////////////////////////// @@ -447,21 +466,24 @@ gboolean rm_hasher_task_hash(RmHasherTask *task, char *path, guint64 start_offse guint64 bytes_to_read, gboolean is_symlink, RmOff *bytes_read_out) { guint64 bytes_read = 0; + gboolean success = false; + if(is_symlink) { - bytes_read = rm_hasher_symlink_read(task->hasher, task->digest, path); + success = rm_hasher_symlink_read(task->hasher, task->digest, path, &bytes_read); } else if(task->hasher->use_buffered_read) { - bytes_read = rm_hasher_buffered_read(task->hasher, task->hashpipe, task->digest, - path, start_offset, bytes_to_read); + success = rm_hasher_buffered_read(task->hasher, task->hashpipe, task->digest, + path, start_offset, bytes_to_read, &bytes_read); } else { - bytes_read = rm_hasher_unbuffered_read(task->hasher, task->hashpipe, task->digest, - path, start_offset, bytes_to_read); + success = + rm_hasher_unbuffered_read(task->hasher, task->hashpipe, task->digest, path, + start_offset, bytes_to_read, &bytes_read); } if(bytes_read_out != NULL) { - *bytes_read_out = bytes_to_read; + *bytes_read_out = bytes_read; } - return ((is_symlink && bytes_read == 0) || bytes_read == bytes_to_read); + return success; } RmDigest *rm_hasher_task_finish(RmHasherTask *task) { From 46b0fa6d6460974d085e0b90bfe3b8d275c1bf82 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Sat, 4 Nov 2017 22:32:29 +1000 Subject: [PATCH 068/180] utilities: fail rm_sys_preadv is seek fails on apple & cygwin --- lib/utilities.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/lib/utilities.h b/lib/utilities.h index 2749300e..75ecb9e5 100644 --- a/lib/utilities.h +++ b/lib/utilities.h @@ -132,11 +132,13 @@ static inline gint64 rm_sys_preadv(int fd, const struct iovec *iov, int iovcnt, #if RM_IS_APPLE || RM_IS_CYGWIN if(lseek(fd, offset, SEEK_SET) == -1) { rm_log_perror("seek in emulated preadv failed"); + return 0; } return readv(fd, iov, iovcnt); #elif RM_PLATFORM_32 if(lseek64(fd, offset, SEEK_SET) == -1) { rm_log_perror("seek in emulated preadv failed"); + return 0; } return readv(fd, iov, iovcnt); #else From e4f57d6d13005703c57e2b2810414519ea50f1a7 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Sat, 4 Nov 2017 23:01:59 +1000 Subject: [PATCH 069/180] shredder: add a TODO --- lib/shredder.c | 1 + 1 file changed, 1 insertion(+) diff --git a/lib/shredder.c b/lib/shredder.c index db8973c0..83bf3c3d 100644 --- a/lib/shredder.c +++ b/lib/shredder.c @@ -1640,6 +1640,7 @@ static gint rm_shred_process_file(RmFile *file, RmSession *session) { shredder_waiting = FALSE; } + /* TODO: make this threadsafe: */ session->shred_bytes_read += bytes_read; /* Update totals for file, device and session*/ From 05f8b5f681c27cabb901c99cdacc690d1fc9eccf Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Sat, 4 Nov 2017 22:38:27 +1000 Subject: [PATCH 070/180] hasher: read symlink text instead of target dev:inode --- lib/hasher.c | 41 +++++++++++++++++------------------------ 1 file changed, 17 insertions(+), 24 deletions(-) diff --git a/lib/hasher.c b/lib/hasher.c index 140f5ce0..05896493 100644 --- a/lib/hasher.c +++ b/lib/hasher.c @@ -131,34 +131,26 @@ static void rm_hasher_request_readahead(int fd, RmOff seek_offset, RmOff bytes_t #endif } -static gboolean rm_hasher_symlink_read(RmHasher *hasher, RmDigest *digest, char *path, +static gboolean rm_hasher_symlink_read(RmHasher *hasher, GThreadPool *hashpipe, + RmDigest *digest, char *path, gsize *bytes_actually_read) { - /* Fake an IO operation on the symlink. */ - RmBuffer *buf = rm_buffer_get(hasher->mem_pool); - buf->len = 256; - memset(buf->data, 0, buf->len); - - RmStat stat_buf; - if(rm_sys_stat(path, &stat_buf) == -1) { - /* Oops, that did not work out, report as an error */ - rm_log_perror("Cannot stat symbolic link"); + /* Read contents of symlink (i.e. path of symlink's target). */ + + RmBuffer *buffer = rm_buffer_get(hasher->mem_pool); + gint len = readlink(path, (char *)buffer->data, rm_buffer_size(hasher->mem_pool)); + + if (len < 0) { + rm_log_perror("Cannot read symbolic link"); + rm_buffer_release(buffer); return FALSE; } - gint data_size = snprintf((char *)buf->data, hasher->buf_size, "%ld:%ld", - (long)stat_buf.st_dev, (long)stat_buf.st_ino); - buf->len = data_size; - buf->digest = digest; - *bytes_actually_read = buf->len; + *bytes_actually_read = len; + buffer->len = len; + buffer->digest = digest; + buffer->user_data = NULL; + rm_util_thread_pool_push(hashpipe, buffer); - rm_digest_buffered_update(buf); - - /* In case of paranoia: shrink the used data buffer, so comparasion works - * as expected. Otherwise a full buffer is used with possibly different - * content */ - if(digest->type == RM_DIGEST_PARANOID) { - rm_digest_paranoia_shrink(digest, data_size); - } return TRUE; } @@ -469,7 +461,8 @@ gboolean rm_hasher_task_hash(RmHasherTask *task, char *path, guint64 start_offse gboolean success = false; if(is_symlink) { - success = rm_hasher_symlink_read(task->hasher, task->digest, path, &bytes_read); + success = rm_hasher_symlink_read(task->hasher, task->hashpipe, task->digest, + path, &bytes_read); } else if(task->hasher->use_buffered_read) { success = rm_hasher_buffered_read(task->hasher, task->hashpipe, task->digest, path, start_offset, bytes_to_read, &bytes_read); From dcf23a060eb2f42a7dca0ebf84e81e460c83dfe6 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Sun, 5 Nov 2017 18:17:07 +1000 Subject: [PATCH 071/180] SConstruct: automate choice of -j (Spot got a new cpu ) --- SConstruct | 33 +++++++++++++++++++++++++++++++++ docs/install.rst | 4 ++-- pkg/arch/PKGBUILD | 2 +- pkg/fedora/rmlint.spec | 4 ++-- 4 files changed, 38 insertions(+), 5 deletions(-) diff --git a/SConstruct b/SConstruct index 67c7cdab..2f6fa89c 100755 --- a/SConstruct +++ b/SConstruct @@ -674,6 +674,39 @@ SConsEnvironment.InstallPerm = InstallPerm # Your extra checks here env = conf.Finish() +def get_cpu_count(): + # priority: environ('NUM_CPU'), else try to read actual cpu count, else fallback + fallback = 4 + + if 'NUM_CPU' in os.environ: + return int(os.environ.get('NUM_CPU')) + + # try multiprocessing.cpu_count() (Python 2.6+) + try: + import multiprocessing + return multiprocessing.cpu_count() + except (ImportError, NotImplementedError): + pass + + # try psutil.cpu_count() + try: + import psutil + return psutil.cpu_count() + except (ImportError, AttributeError): + pass + + # default value + return fallback + + +# set number of parallel jobs during build +# note: while not particularly intuitive or obvious from the documentation, +# SetOption() will *not* over-ride commandline option passed by `scons -j` +# or `scons --jobs=` +SetOption('num_jobs', get_cpu_count()) + +print "Running with --jobs=" + repr(GetOption('num_jobs')) + library = SConscript('lib/SConscript') programs = SConscript('src/SConscript', exports='library') env.Default(library) diff --git a/docs/install.rst b/docs/install.rst index 62846861..58951a65 100644 --- a/docs/install.rst +++ b/docs/install.rst @@ -152,9 +152,9 @@ build the software from the potentially unstable ``develop`` branch: $ git clone -b develop https://github.com/sahib/rmlint.git $ cd rmlint/ $ scons config # Look what features scons would compile - $ scons DEBUG=1 -j4 # Optional, build locally. + $ scons DEBUG=1 # Optional, build locally. # Install (and build if necessary). For releases you can omit DEBUG=1 - $ sudo scons DEBUG=1 -j4 --prefix=/usr install + $ sudo scons DEBUG=1 --prefix=/usr install Done! diff --git a/pkg/arch/PKGBUILD b/pkg/arch/PKGBUILD index 2a5b7c53..6cd27a9f 100644 --- a/pkg/arch/PKGBUILD +++ b/pkg/arch/PKGBUILD @@ -25,7 +25,7 @@ pkgver() { build() { cd "${srcdir}/${pkgname}" - scons -j4 DEBUG=1 --prefix=${pkgdir}/usr --actual-prefix=/usr + scons DEBUG=1 --prefix=${pkgdir}/usr --actual-prefix=/usr } package() { diff --git a/pkg/fedora/rmlint.spec b/pkg/fedora/rmlint.spec index 4fe739ce..36fa1a6d 100644 --- a/pkg/fedora/rmlint.spec +++ b/pkg/fedora/rmlint.spec @@ -16,13 +16,13 @@ especially an extremely fast tool to remove duplicates from your filesystem. %prep %autosetup -c rmlint-%{version} -%build scons config; scons -j4 --prefix=%{buildroot}/usr --actual-prefix=/usr --libdir=lib64 +%build scons config; scons --prefix=%{buildroot}/usr --actual-prefix=/usr --libdir=lib64 %install # Build rmlint, install it into BUILDROOT/-/, # but take care rmlint thinks it's installed to /usr (--actual_prefix) -scons install -j4 --prefix=%{buildroot}/usr --actual-prefix=/usr --libdir=lib64 +scons install --prefix=%{buildroot}/usr --actual-prefix=/usr --libdir=lib64 # Find all rmlint.mo files and put them in rmlint.lang %find_lang %{name} From 233222d9f657fe097103cc49440879a2322f8683 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 6 Nov 2017 21:29:11 +1000 Subject: [PATCH 072/180] https://www.opensourceshakespeare.org/views/plays/play_view.php?WorkID=troilus&Act=5&Scene=7&Scope=scene verse 3580 --- docs/rmlint.1.rst | 3 +-- lib/checksum.c | 39 +++++++++------------------------------ lib/checksum.h | 1 - lib/cmdline.c | 5 +---- lib/hash-utility.c | 2 +- tests/utils.py | 2 +- 6 files changed, 13 insertions(+), 39 deletions(-) diff --git a/docs/rmlint.1.rst b/docs/rmlint.1.rst index 28093494..6ad438ce 100644 --- a/docs/rmlint.1.rst +++ b/docs/rmlint.1.rst @@ -161,7 +161,6 @@ General Options There are also some compound variations of the above functions: - * **bastard:** 256bit, combining **city**, and **murmur**. * **city256, city512, murmur256, murmur512:** Use multiple 128-bit hashes with different seeds. * **spooky32, spooky64:** Faster version of **spooky** with less bits. We strongly advise against using these. @@ -173,7 +172,7 @@ General Options * **-p** is equivalent to **--algorithm=sha512** * **-pp** is equivalent to **--algorithm=paranoid** - * **-P** is equivalent to **--algorithm bastard** + * **-P** is equivalent to **--algorithm ** * **-PP** is equivalent to **--algorithm spooky** :``-v --loud`` / ``-V --quiet``: diff --git a/lib/checksum.c b/lib/checksum.c index c5b26a6e..0c2e5be7 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -170,8 +170,6 @@ static gpointer rm_init_digest_type_table(GHashTable **code_table) { {"ext", RM_DIGEST_EXT}, {"cumulative", RM_DIGEST_CUMULATIVE}, {"paranoid", RM_DIGEST_PARANOID}, - {"bastard", RM_DIGEST_BASTARD}, - {"bastard256", RM_DIGEST_BASTARD}, {"city", RM_DIGEST_CITY}, {"city128", RM_DIGEST_CITY}, {"city256", RM_DIGEST_CITY256}, @@ -230,7 +228,6 @@ const char *rm_digest_type_to_string(RmDigestType type) { [RM_DIGEST_BLAKE2BP] = "blake2bp", [RM_DIGEST_MURMUR256] = "murmur256", [RM_DIGEST_CITY256] = "city256", - [RM_DIGEST_BASTARD] = "bastard", [RM_DIGEST_MURMUR512] = "murmur512", [RM_DIGEST_CITY512] = "city512", [RM_DIGEST_EXT] = "ext", @@ -244,16 +241,15 @@ const char *rm_digest_type_to_string(RmDigestType type) { /* TODO: remove? */ int rm_digest_type_to_multihash_id(RmDigestType type) { - static int ids[] = {[RM_DIGEST_UNKNOWN] = -1, [RM_DIGEST_MURMUR] = 17, - [RM_DIGEST_SPOOKY] = 14, [RM_DIGEST_SPOOKY32] = 16, - [RM_DIGEST_SPOOKY64] = 18, [RM_DIGEST_CITY] = 15, - [RM_DIGEST_MD5] = 1, [RM_DIGEST_SHA1] = 2, - [RM_DIGEST_SHA256] = 4, [RM_DIGEST_SHA512] = 6, - [RM_DIGEST_MURMUR256] = 7, [RM_DIGEST_CITY256] = 8, - [RM_DIGEST_BASTARD] = 9, [RM_DIGEST_MURMUR512] = 10, - [RM_DIGEST_CITY512] = 11, [RM_DIGEST_EXT] = 12, - [RM_DIGEST_FARMHASH] = 19, [RM_DIGEST_CUMULATIVE] = 13, - [RM_DIGEST_PARANOID] = 14}; + static int ids[] = {[RM_DIGEST_UNKNOWN] = -1, [RM_DIGEST_MURMUR] = 17, + [RM_DIGEST_SPOOKY] = 14, [RM_DIGEST_SPOOKY32] = 16, + [RM_DIGEST_SPOOKY64] = 18, [RM_DIGEST_CITY] = 15, + [RM_DIGEST_MD5] = 1, [RM_DIGEST_SHA1] = 2, + [RM_DIGEST_SHA256] = 4, [RM_DIGEST_SHA512] = 6, + [RM_DIGEST_MURMUR256] = 7, [RM_DIGEST_CITY256] = 8, + [RM_DIGEST_MURMUR512] = 10, [RM_DIGEST_CITY512] = 11, + [RM_DIGEST_EXT] = 12, [RM_DIGEST_FARMHASH] = 19, + [RM_DIGEST_CUMULATIVE] = 13,[RM_DIGEST_PARANOID] = 14}; return ids[MIN(type, sizeof(ids) / sizeof(ids[0]))]; } @@ -363,7 +359,6 @@ RmDigest *rm_digest_new(RmDigestType type, RmOff seed1, RmOff seed2, RmOff ext_s break; case RM_DIGEST_MURMUR256: case RM_DIGEST_CITY256: - case RM_DIGEST_BASTARD: digest->bytes = 256 / 8; break; case RM_DIGEST_SPOOKY: @@ -402,11 +397,6 @@ RmDigest *rm_digest_new(RmDigestType type, RmOff seed1, RmOff seed2, RmOff ext_s } } - if(digest->type == RM_DIGEST_BASTARD) { - /* bastard type *always* has *pure* murmur hash for first checksum - * and seeded city for second checksum */ - digest->checksum[0].first = digest->checksum[0].second = 0; - } return digest; } @@ -466,7 +456,6 @@ void rm_digest_free(RmDigest *digest) { case RM_DIGEST_CITY512: case RM_DIGEST_MURMUR256: case RM_DIGEST_CITY256: - case RM_DIGEST_BASTARD: case RM_DIGEST_SPOOKY: case RM_DIGEST_SPOOKY32: case RM_DIGEST_SPOOKY64: @@ -572,14 +561,6 @@ void rm_digest_update(RmDigest *digest, const unsigned char *data, RmOff size) { memcpy(&digest->checksum[block], &old, sizeof(uint128)); } break; - case RM_DIGEST_BASTARD: - MurmurHash3_x86_128(data, size, (uint32_t)digest->checksum[0].first, - &digest->checksum[0]); - - uint128 old = {digest->checksum[1].first, digest->checksum[1].second}; - old = CityHash128WithSeed((const char *)data, size, old); - memcpy(&digest->checksum[1], &old, sizeof(uint128)); - break; case RM_DIGEST_CUMULATIVE: { /* This only XORS the two checksums. */ for(gsize i = 0; i < digest->bytes; ++i) { @@ -726,7 +707,6 @@ RmDigest *rm_digest_copy(RmDigest *digest) { case RM_DIGEST_MURMUR512: case RM_DIGEST_XXHASH: case RM_DIGEST_FARMHASH: - case RM_DIGEST_BASTARD: case RM_DIGEST_CUMULATIVE: case RM_DIGEST_EXT: self = rm_digest_new(digest->type, 0, 0, digest->bytes, FALSE); @@ -771,7 +751,6 @@ static gboolean rm_digest_needs_steal(RmDigestType digest_type) { case RM_DIGEST_FARMHASH: case RM_DIGEST_MURMUR256: case RM_DIGEST_MURMUR512: - case RM_DIGEST_BASTARD: case RM_DIGEST_CUMULATIVE: case RM_DIGEST_EXT: case RM_DIGEST_PARANOID: diff --git a/lib/checksum.h b/lib/checksum.h index 708ea351..b92339a9 100644 --- a/lib/checksum.h +++ b/lib/checksum.h @@ -54,7 +54,6 @@ typedef enum RmDigestType { RM_DIGEST_BLAKE2XS, RM_DIGEST_MURMUR256, RM_DIGEST_CITY256, - RM_DIGEST_BASTARD, RM_DIGEST_MURMUR512, RM_DIGEST_CITY512, RM_DIGEST_XXHASH, diff --git a/lib/cmdline.c b/lib/cmdline.c index 071ac588..34a9d48e 100644 --- a/lib/cmdline.c +++ b/lib/cmdline.c @@ -756,7 +756,7 @@ static void rm_cmd_set_paranoia_from_cnt(RmCfg *cfg, int paranoia_counter, cfg->checksum_type = RM_DIGEST_XXHASH; break; case -1: - cfg->checksum_type = RM_DIGEST_BASTARD; + cfg->checksum_type = RM_DIGEST_CITY256; break; case 0: /* leave users choice of -a (default) */ @@ -799,9 +799,6 @@ static gboolean rm_cmd_parse_algorithm(_UNUSED const char *option_name, if(cfg->checksum_type == RM_DIGEST_UNKNOWN) { g_set_error(error, RM_ERROR_QUARK, 0, _("Unknown hash algorithm: '%s'"), value); return false; - } else if(cfg->checksum_type == RM_DIGEST_BASTARD) { - session->hash_seed1 = time(NULL) * (GPOINTER_TO_UINT(session)); - session->hash_seed2 = GPOINTER_TO_UINT(&session); } return true; } diff --git a/lib/hash-utility.c b/lib/hash-utility.c index 4c95caec..a63c0d48 100644 --- a/lib/hash-utility.c +++ b/lib/hash-utility.c @@ -164,7 +164,7 @@ int rm_hasher_main(int argc, const char **argv) { "\n %s\n"), "spooky, city, xxhash, sha{1,256,512}, md5, murmur", "spooky{32,64,128}, city{128,256,512}, murmur{512}", - "farmhash, cumulative, paranoid, ext, bastard"); + "farmhash, cumulative, paranoid, ext"); g_option_group_add_entries(main_group, entries); g_option_context_set_main_group(context, main_group); diff --git a/tests/utils.py b/tests/utils.py index 59fe9f9e..d2771a83 100644 --- a/tests/utils.py +++ b/tests/utils.py @@ -213,7 +213,7 @@ def run_rmlint_pedantic(*args, **kwargs): ] cksum_types = [ - 'paranoid', 'sha1', 'sha256', 'spooky', 'bastard', 'city', + 'paranoid', 'sha1', 'sha256', 'spooky', 'city', 'md5', 'city256', 'city512', 'murmur', 'murmur256', 'murmur512', 'spooky32', 'spooky64', 'xxhash', 'farmhash', 'sha3-256', 'sha3-384', 'sha3-512', From 3210fe1fa988bea547f70520f4702f5146d0f8fd Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 6 Nov 2017 21:30:30 +1000 Subject: [PATCH 073/180] checksum: fix an oops in code_table --- lib/checksum.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/checksum.c b/lib/checksum.c index 0c2e5be7..e67a066f 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -146,7 +146,7 @@ static gpointer rm_init_digest_type_table(GHashTable **code_table) { {"md5", RM_DIGEST_MD5}, {"city512", RM_DIGEST_CITY512}, {"xxhash", RM_DIGEST_XXHASH}, - {"farmhash", RM_DIGEST_XXHASH}, + {"farmhash", RM_DIGEST_FARMHASH}, {"murmur", RM_DIGEST_MURMUR}, {"murmur128", RM_DIGEST_MURMUR}, {"murmur256", RM_DIGEST_MURMUR256}, From eb54e9fdb384f79e40f03666fc1220dac9c33d4d Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 6 Nov 2017 21:33:47 +1000 Subject: [PATCH 074/180] checksum: remove duplicate entries --- lib/checksum.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/lib/checksum.c b/lib/checksum.c index e67a066f..b790adc4 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -144,7 +144,6 @@ static gpointer rm_init_digest_type_table(GHashTable **code_table) { RmDigestType code; } code_entries[] = { {"md5", RM_DIGEST_MD5}, - {"city512", RM_DIGEST_CITY512}, {"xxhash", RM_DIGEST_XXHASH}, {"farmhash", RM_DIGEST_FARMHASH}, {"murmur", RM_DIGEST_MURMUR}, @@ -161,8 +160,6 @@ static gpointer rm_init_digest_type_table(GHashTable **code_table) { {"blake2b", RM_DIGEST_BLAKE2B}, {"blake2sp", RM_DIGEST_BLAKE2SP}, {"blake2bp", RM_DIGEST_BLAKE2BP}, - {"city256", RM_DIGEST_CITY256}, - {"murmur256", RM_DIGEST_MURMUR256}, {"spooky32", RM_DIGEST_SPOOKY32}, {"spooky64", RM_DIGEST_SPOOKY64}, {"spooky128", RM_DIGEST_SPOOKY}, @@ -183,6 +180,9 @@ static gpointer rm_init_digest_type_table(GHashTable **code_table) { const size_t n_codes = sizeof(code_entries) / sizeof(code_entries[0]); for(size_t idx = 0; idx < n_codes; idx++) { + if(g_hash_table_contains(*code_table, code_entries[idx].name)) { + rm_log_error_line("Duplicate entry for %s", code_entries[idx].name); + } g_hash_table_insert(*code_table, code_entries[idx].name, GUINT_TO_POINTER(code_entries[idx].code)); From acbd08bac860020ded70efd593fd76f1d166a42a Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Wed, 8 Nov 2017 09:49:41 +1000 Subject: [PATCH 075/180] checksum: abandon compound city & murmur hashes, they were a dumb idea --- docs/rmlint.1.rst | 12 +++---- lib/checksum.c | 87 ++++++++-------------------------------------- lib/checksum.h | 4 --- lib/cmdline.c | 4 +-- lib/hash-utility.c | 3 -- tests/utils.py | 3 +- 6 files changed, 23 insertions(+), 90 deletions(-) diff --git a/docs/rmlint.1.rst b/docs/rmlint.1.rst index 6ad438ce..0464e9d1 100644 --- a/docs/rmlint.1.rst +++ b/docs/rmlint.1.rst @@ -159,21 +159,19 @@ General Options **sha512**, **farmhash**, **sha3**, **sha3-256**, **sha3-384**, **sha3-512**, **blake2s**, **blake2b**, **blake2sp**, **blake2bp**. - There are also some compound variations of the above functions: - - * **city256, city512, murmur256, murmur512:** Use multiple 128-bit hashes with different seeds. - * **spooky32, spooky64:** Faster version of **spooky** with less bits. We strongly advise against using these. + There are also some weaker hashes; we strongly advise against using these: + * **spooky32, spooky64:** Faster version of **spooky** with less bits. :``-p --paranoid`` / ``-P --less-paranoid`` (**default**): Increase or decrease the paranoia of ``rmlint``'s duplicate algorithm. Use ``-pp`` if you want byte-by-byte comparison without any hashing. - * **-p** is equivalent to **--algorithm=sha512** + * **-p** is equivalent to **--algorithm=** * **-pp** is equivalent to **--algorithm=paranoid** - * **-P** is equivalent to **--algorithm ** - * **-PP** is equivalent to **--algorithm spooky** + * **-P** is equivalent to **--algorithm=** + * **-PP** is equivalent to **--algorithm=** :``-v --loud`` / ``-V --quiet``: diff --git a/lib/checksum.c b/lib/checksum.c index b790adc4..8cf08599 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -147,9 +147,6 @@ static gpointer rm_init_digest_type_table(GHashTable **code_table) { {"xxhash", RM_DIGEST_XXHASH}, {"farmhash", RM_DIGEST_FARMHASH}, {"murmur", RM_DIGEST_MURMUR}, - {"murmur128", RM_DIGEST_MURMUR}, - {"murmur256", RM_DIGEST_MURMUR256}, - {"murmur512", RM_DIGEST_MURMUR512}, {"sha1", RM_DIGEST_SHA1}, {"sha256", RM_DIGEST_SHA256}, {"sha3", RM_DIGEST_SHA3_256}, @@ -168,9 +165,6 @@ static gpointer rm_init_digest_type_table(GHashTable **code_table) { {"cumulative", RM_DIGEST_CUMULATIVE}, {"paranoid", RM_DIGEST_PARANOID}, {"city", RM_DIGEST_CITY}, - {"city128", RM_DIGEST_CITY}, - {"city256", RM_DIGEST_CITY256}, - {"city512", RM_DIGEST_CITY512}, #if HAVE_SHA512 {"sha512", RM_DIGEST_SHA512}, #endif @@ -226,10 +220,6 @@ const char *rm_digest_type_to_string(RmDigestType type) { [RM_DIGEST_BLAKE2B] = "blake2b", [RM_DIGEST_BLAKE2SP] = "blake2sp", [RM_DIGEST_BLAKE2BP] = "blake2bp", - [RM_DIGEST_MURMUR256] = "murmur256", - [RM_DIGEST_CITY256] = "city256", - [RM_DIGEST_MURMUR512] = "murmur512", - [RM_DIGEST_CITY512] = "city512", [RM_DIGEST_EXT] = "ext", [RM_DIGEST_CUMULATIVE] = "cumulative", [RM_DIGEST_PARANOID] = "paranoid", @@ -246,8 +236,6 @@ int rm_digest_type_to_multihash_id(RmDigestType type) { [RM_DIGEST_SPOOKY64] = 18, [RM_DIGEST_CITY] = 15, [RM_DIGEST_MD5] = 1, [RM_DIGEST_SHA1] = 2, [RM_DIGEST_SHA256] = 4, [RM_DIGEST_SHA512] = 6, - [RM_DIGEST_MURMUR256] = 7, [RM_DIGEST_CITY256] = 8, - [RM_DIGEST_MURMUR512] = 10, [RM_DIGEST_CITY512] = 11, [RM_DIGEST_EXT] = 12, [RM_DIGEST_FARMHASH] = 19, [RM_DIGEST_CUMULATIVE] = 13,[RM_DIGEST_PARANOID] = 14}; @@ -349,18 +337,10 @@ RmDigest *rm_digest_new(RmDigestType type, RmOff seed1, RmOff seed2, RmOff ext_s case RM_DIGEST_BLAKE2BP: BLAKE_INIT(blake2bp, BLAKE2B); return digest; - case RM_DIGEST_MURMUR512: - case RM_DIGEST_CITY512: - digest->bytes = 512 / 8; - break; case RM_DIGEST_EXT: /* gets allocated on rm_digest_update() */ digest->bytes = ext_size; break; - case RM_DIGEST_MURMUR256: - case RM_DIGEST_CITY256: - digest->bytes = 256 / 8; - break; case RM_DIGEST_SPOOKY: case RM_DIGEST_MURMUR: case RM_DIGEST_CITY: @@ -375,28 +355,11 @@ RmDigest *rm_digest_new(RmDigestType type, RmOff seed1, RmOff seed2, RmOff ext_s digest->paranoid->shadow_hash = rm_digest_new(RM_DIGEST_XXHASH, seed1, seed2, 0, false); } - break; + return digest; default: rm_assert_gentle_not_reached(); } - - /* starting values to let us generate up to 4 different hashes in parallel with - * different starting seeds: - * */ - static const RmOff seeds[4] = {0x0000000000000000, 0xf0f0f0f0f0f0f0f0, - 0x3333333333333333, 0xaaaaaaaaaaaaaaaa}; - - if(digest->bytes > 0 && type != RM_DIGEST_PARANOID) { - const int n_seeds = sizeof(seeds) / sizeof(seeds[0]); - - /* checksum type - allocate memory and initialise */ - digest->checksum = g_slice_alloc0(digest->bytes); - for(gsize block = 0; block < (digest->bytes / 16); block++) { - digest->checksum[block].first = seeds[block % n_seeds] ^ seed1; - digest->checksum[block].second = seeds[block % n_seeds] ^ seed2; - } - } - + digest->checksum = g_slice_alloc0(digest->bytes); return digest; } @@ -451,11 +414,7 @@ void rm_digest_free(RmDigest *digest) { break; case RM_DIGEST_EXT: case RM_DIGEST_CUMULATIVE: - case RM_DIGEST_MURMUR512: case RM_DIGEST_XXHASH: - case RM_DIGEST_CITY512: - case RM_DIGEST_MURMUR256: - case RM_DIGEST_CITY256: case RM_DIGEST_SPOOKY: case RM_DIGEST_SPOOKY32: case RM_DIGEST_SPOOKY64: @@ -533,34 +492,26 @@ void rm_digest_update(RmDigest *digest, const unsigned char *data, RmOff size) { case RM_DIGEST_FARMHASH: digest->checksum[0].first = cfarmhash((const char *)data, size); break; - case RM_DIGEST_MURMUR512: - case RM_DIGEST_MURMUR256: case RM_DIGEST_MURMUR: - for(guint8 block = 0; block < (digest->bytes / 16); block++) { #if RM_PLATFORM_32 - MurmurHash3_x86_128(data, size, (uint32_t)digest->checksum[block].first, - &digest->checksum[block]); + MurmurHash3_x86_128(data, size, (uint32_t)digest->checksum->first, + digest->checksum); #elif RM_PLATFORM_64 - MurmurHash3_x64_128(data, size, (uint32_t)digest->checksum[block].first, - &digest->checksum[block]); + MurmurHash3_x64_128(data, size, (uint32_t)digest->checksum->first, + digest->checksum); #else #error "Probably not a good idea to compile rmlint on 16bit." #endif - } - break; - case RM_DIGEST_CITY: - case RM_DIGEST_CITY256: - case RM_DIGEST_CITY512: - for(guint8 block = 0; block < (digest->bytes / 16); block++) { - /* Opt out for the more optimized version. - * This needs the crc command of sse4.2 - * (available on Intel Nehalem and up; my amd box doesn't have this though) - */ - uint128 old = {digest->checksum[block].first, digest->checksum[block].second}; - old = CityHash128WithSeed((const char *)data, size, old); - memcpy(&digest->checksum[block], &old, sizeof(uint128)); - } break; + case RM_DIGEST_CITY: { + /* Opt out for the more optimized version. + * This needs the crc command of sse4.2 + * (available on Intel Nehalem and up; my amd box doesn't have this though) + */ + uint128 old = {digest->checksum->first, digest->checksum->second}; + old = CityHash128WithSeed((const char *)data, size, old); + memcpy(digest->checksum, &old, sizeof(uint128)); + } break; case RM_DIGEST_CUMULATIVE: { /* This only XORS the two checksums. */ for(gsize i = 0; i < digest->bytes; ++i) { @@ -701,10 +652,6 @@ RmDigest *rm_digest_copy(RmDigest *digest) { case RM_DIGEST_SPOOKY64: case RM_DIGEST_MURMUR: case RM_DIGEST_CITY: - case RM_DIGEST_CITY256: - case RM_DIGEST_MURMUR256: - case RM_DIGEST_CITY512: - case RM_DIGEST_MURMUR512: case RM_DIGEST_XXHASH: case RM_DIGEST_FARMHASH: case RM_DIGEST_CUMULATIVE: @@ -745,12 +692,8 @@ static gboolean rm_digest_needs_steal(RmDigestType digest_type) { case RM_DIGEST_SPOOKY: case RM_DIGEST_MURMUR: case RM_DIGEST_CITY: - case RM_DIGEST_CITY256: - case RM_DIGEST_CITY512: case RM_DIGEST_XXHASH: case RM_DIGEST_FARMHASH: - case RM_DIGEST_MURMUR256: - case RM_DIGEST_MURMUR512: case RM_DIGEST_CUMULATIVE: case RM_DIGEST_EXT: case RM_DIGEST_PARANOID: diff --git a/lib/checksum.h b/lib/checksum.h index b92339a9..e45a08c7 100644 --- a/lib/checksum.h +++ b/lib/checksum.h @@ -52,10 +52,6 @@ typedef enum RmDigestType { RM_DIGEST_BLAKE2SP /* Parallel version of BLAKE2P */, RM_DIGEST_BLAKE2BP /* Parallel version of BLAKE2S */, RM_DIGEST_BLAKE2XS, - RM_DIGEST_MURMUR256, - RM_DIGEST_CITY256, - RM_DIGEST_MURMUR512, - RM_DIGEST_CITY512, RM_DIGEST_XXHASH, RM_DIGEST_FARMHASH, diff --git a/lib/cmdline.c b/lib/cmdline.c index 34a9d48e..a3c5ec45 100644 --- a/lib/cmdline.c +++ b/lib/cmdline.c @@ -753,10 +753,10 @@ static void rm_cmd_set_paranoia_from_cnt(RmCfg *cfg, int paranoia_counter, /* Handle the paranoia option */ switch(paranoia_counter) { case -2: - cfg->checksum_type = RM_DIGEST_XXHASH; + cfg->checksum_type = RM_DIGEST_MURMUR; break; case -1: - cfg->checksum_type = RM_DIGEST_CITY256; + cfg->checksum_type = RM_DIGEST_CITY; break; case 0: /* leave users choice of -a (default) */ diff --git a/lib/hash-utility.c b/lib/hash-utility.c index a63c0d48..6e2700fc 100644 --- a/lib/hash-utility.c +++ b/lib/hash-utility.c @@ -158,12 +158,9 @@ int rm_hasher_main(int argc, const char **argv) { _("Multi-threaded file digest (hash) calculator.\n" "\n Available digest types:" "\n %s\n" - "\n Versions with different bit numbers:" - "\n %s\n" "\n Supported, but not useful:" "\n %s\n"), "spooky, city, xxhash, sha{1,256,512}, md5, murmur", - "spooky{32,64,128}, city{128,256,512}, murmur{512}", "farmhash, cumulative, paranoid, ext"); g_option_group_add_entries(main_group, entries); diff --git a/tests/utils.py b/tests/utils.py index d2771a83..6031c060 100644 --- a/tests/utils.py +++ b/tests/utils.py @@ -214,8 +214,7 @@ def run_rmlint_pedantic(*args, **kwargs): cksum_types = [ 'paranoid', 'sha1', 'sha256', 'spooky', 'city', - 'md5', 'city256', 'city512', 'murmur', 'murmur256', 'murmur512', - 'spooky32', 'spooky64', 'xxhash', 'farmhash', + 'md5', 'murmur', 'spooky32', 'spooky64', 'xxhash', 'farmhash', 'sha3-256', 'sha3-384', 'sha3-512', 'blake2s', 'blake2b', 'blake2sp', 'blake2bp', ] From aed7731e6b6c4cc1d2857ed1d75cb44ccfe9ad2a Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Wed, 8 Nov 2017 14:39:21 +1000 Subject: [PATCH 076/180] checksum: introducing struct RmDigestSpec --- lib/checksum.c | 281 +++++++++++++++++++++++++++++-------------------- 1 file changed, 168 insertions(+), 113 deletions(-) diff --git a/lib/checksum.c b/lib/checksum.c index 8cf08599..74269652 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -54,6 +54,169 @@ #define _RM_CHECKSUM_DEBUG 0 +typedef struct RmDigestSpec { + const int bits; + void (*init)(RmDigest *digest, RmOff seed1, RmOff seed2, RmOff ext_size, bool use_shadow_hash); +} RmDigestSpec; + +/* + ****** common interface for non-cryptographic hashes ****** + */ + +static void rm_digest_generic_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { + /* init for hashes which just require allocation of digest->checksum */ + + RmOff bytes = MAX(8, digest->bytes); + /* Cannot go lower than 8, since we read 8 byte in some places. + * For some checksums this may mean trailing zeros in the unused bytes */ + digest->checksum = g_slice_alloc0(bytes); + + if(seed1 && seed2) { + /* copy seeds to checksum */ + size_t seed_bytes = MIN(sizeof(RmOff), digest->bytes / 2); + memcpy(digest->checksum, &seed1, seed_bytes); + memcpy(digest->checksum + digest->bytes/2, &seed2, seed_bytes); + } else if(seed1) { + size_t seed_bytes = MIN(sizeof(RmOff), digest->bytes); + memcpy(digest->checksum, &seed1, seed_bytes); + } +} + +/* + ****** glib hash algorithm interface ****** + */ + +static const GChecksumType glib_map[] = { + [RM_DIGEST_MD5] = G_CHECKSUM_MD5, + [RM_DIGEST_SHA1] = G_CHECKSUM_SHA1, + [RM_DIGEST_SHA256] = G_CHECKSUM_SHA256, +#if HAVE_SHA512 + [RM_DIGEST_SHA512] = G_CHECKSUM_SHA512, +#endif +}; + +static void rm_digest_glib_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { + digest->glib_checksum = g_checksum_new(glib_map[digest->type]); + if(seed1) { + g_checksum_update(digest->glib_checksum, (const guchar *)&seed1, sizeof(seed1)); + } + if(seed2) { + g_checksum_update(digest->glib_checksum, (const guchar *)&seed2, sizeof(seed2)); + } +} + +/* + ****** sha3 hash algorithm interface ****** + */ + +static void rm_digest_sha3_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { + digest->sha3_ctx = g_slice_alloc0(sizeof(sha3_context)); + switch(digest->type) { + case RM_DIGEST_SHA3_256: + sha3_Init256(digest->sha3_ctx); + break; + case RM_DIGEST_SHA3_384: + sha3_Init384(digest->sha3_ctx); + break; + case RM_DIGEST_SHA3_512: + sha3_Init512(digest->sha3_ctx); + break; + default: + g_assert_not_reached(); + } + if(seed1) { + sha3_Update(digest->sha3_ctx, &seed1, sizeof(seed1)); + } + if(seed2) { + sha3_Update(digest->sha3_ctx, &seed2, sizeof(seed2)); + } +} + +/* + ****** blake hash algorithm interface ****** + */ + +#define BLAKE_INIT(ALGO, ALGO_BIG) \ + digest->ALGO##_state = g_slice_alloc0(sizeof(ALGO##_state)); \ + ALGO##_init(digest->ALGO##_state, ALGO_BIG##_OUTBYTES); \ + if(seed1) { \ + ALGO##_update(digest->ALGO##_state, &seed1, sizeof(RmOff)); \ + } \ + if(seed2) { \ + ALGO##_update(digest->ALGO##_state, &seed2, sizeof(RmOff)); \ + } \ + g_assert(digest->bytes==ALGO_BIG##_OUTBYTES); + + +static void rm_digest_blake2b_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { + BLAKE_INIT(blake2b, BLAKE2B); +} + +static void rm_digest_blake2bp_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { + BLAKE_INIT(blake2bp, BLAKE2B); +} + +static void rm_digest_blake2s_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { + BLAKE_INIT(blake2s, BLAKE2S); +} + +static void rm_digest_blake2sp_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { + BLAKE_INIT(blake2sp, BLAKE2S); +} + +/* + ****** ext hash algorithm interface ****** + */ + +static void rm_digest_ext_init(RmDigest *digest, RmOff seed1, RmOff seed2, RmOff ext_size, bool use_shadow_hash) { + digest->bytes = ext_size; + rm_digest_generic_init(digest, seed1, seed2, ext_size, use_shadow_hash); +} + +/* + ****** paranoid hash algorithm interface ****** + */ + +static void rm_digest_paranoid_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, bool use_shadow_hash) { + digest->paranoid = g_slice_new0(RmParanoid); + digest->paranoid->incoming_twin_candidates = g_async_queue_new(); + if(use_shadow_hash) { + digest->paranoid->shadow_hash = rm_digest_new(RM_DIGEST_XXHASH, seed1, seed2, 0, false); + } +} + +/* + ****** hash interface specification map ****** + */ + +static const RmDigestSpec digest_specs[] = { + [RM_DIGEST_UNKNOWN] = { 0, NULL }, + [RM_DIGEST_MURMUR] = { 128, rm_digest_generic_init }, + [RM_DIGEST_SPOOKY] = { 128, rm_digest_generic_init }, + [RM_DIGEST_SPOOKY32] = { 32, rm_digest_generic_init }, + [RM_DIGEST_SPOOKY64] = { 64, rm_digest_generic_init }, + [RM_DIGEST_CITY] = { 128, rm_digest_generic_init }, + [RM_DIGEST_MD5] = { 128, rm_digest_glib_init }, + [RM_DIGEST_SHA1] = { 160, rm_digest_glib_init }, + [RM_DIGEST_SHA256] = { 256, rm_digest_glib_init }, +#if HAVE_SHA512 + [RM_DIGEST_SHA512] = { 512, rm_digest_glib_init }, +#endif + [RM_DIGEST_SHA3_256] = { 256, rm_digest_sha3_init }, + [RM_DIGEST_SHA3_384] = { 384, rm_digest_sha3_init }, + [RM_DIGEST_SHA3_512] = { 512, rm_digest_sha3_init }, + [RM_DIGEST_BLAKE2S] = { 256, rm_digest_blake2s_init }, + [RM_DIGEST_BLAKE2B] = { 512, rm_digest_blake2b_init }, + [RM_DIGEST_BLAKE2SP] = { 256, rm_digest_blake2sp_init}, + [RM_DIGEST_BLAKE2BP] = { 512, rm_digest_blake2bp_init}, + [RM_DIGEST_EXT] = { 0, rm_digest_ext_init }, + [RM_DIGEST_CUMULATIVE] = { 128, rm_digest_generic_init }, + [RM_DIGEST_PARANOID] = { 0, rm_digest_paranoid_init}, + [RM_DIGEST_FARMHASH] = { 64, rm_digest_generic_init }, + [RM_DIGEST_XXHASH] = { 64, rm_digest_generic_init }, +}; + + /////////////////////////////////////// // BUFFER POOL IMPLEMENTATION // /////////////////////////////////////// @@ -242,124 +405,16 @@ int rm_digest_type_to_multihash_id(RmDigestType type) { return ids[MIN(type, sizeof(ids) / sizeof(ids[0]))]; } -#define ADD_SEED(digest, seed) \ - { \ - if(seed) { \ - g_checksum_update(digest->glib_checksum, (const guchar *)&seed, \ - sizeof(RmOff)); \ - } \ - } - -#define BLAKE_INIT(ALGO, ALGO_BIG) \ - { \ - digest->ALGO##_state = g_slice_alloc0(sizeof(ALGO##_state)); \ - ALGO##_init(digest->ALGO##_state, ALGO_BIG##_OUTBYTES); \ - if(seed1) { \ - ALGO##_update(digest->ALGO##_state, &seed1, sizeof(RmOff)); \ - } \ - if(seed2) { \ - ALGO##_update(digest->ALGO##_state, &seed2, sizeof(RmOff)); \ - } \ - digest->bytes = ALGO_BIG##_OUTBYTES; \ - } - -#define SHA3_INIT(SIZE) \ - digest->sha3_ctx = g_slice_alloc0(sizeof(sha3_context)); \ - sha3_Init##SIZE(digest->sha3_ctx); \ - if(seed1) { \ - sha3_Update(digest->sha3_ctx, &seed1, sizeof(RmOff)); \ - } \ - if(seed2) { \ - sha3_Update(digest->sha3_ctx, &seed2, sizeof(RmOff)); \ - } \ - digest->bytes = (SIZE) / 8; - RmDigest *rm_digest_new(RmDigestType type, RmOff seed1, RmOff seed2, RmOff ext_size, bool use_shadow_hash) { - RmDigest *digest = g_slice_new0(RmDigest); + g_assert(type != RM_DIGEST_UNKNOWN); - digest->checksum = NULL; + RmDigest *digest = g_slice_new0(RmDigest); digest->type = type; - digest->bytes = 0; + digest->bytes = digest_specs[type].bits / 8; + + digest_specs[type].init(digest, seed1, seed2, ext_size, use_shadow_hash); - switch(type) { - case RM_DIGEST_SPOOKY32: - /* cannot go lower than 64, since we read 8 byte in some places. - * simulate by leaving the part at the end empty - */ - digest->bytes = 64 / 8; - break; - case RM_DIGEST_XXHASH: - case RM_DIGEST_FARMHASH: - case RM_DIGEST_SPOOKY64: - digest->bytes = 64 / 8; - break; - case RM_DIGEST_MD5: - digest->glib_checksum = g_checksum_new(G_CHECKSUM_MD5); - ADD_SEED(digest, seed1); - digest->bytes = 128 / 8; - return digest; -#if HAVE_SHA512 - case RM_DIGEST_SHA512: - digest->glib_checksum = g_checksum_new(G_CHECKSUM_SHA512); - ADD_SEED(digest, seed1); - digest->bytes = 512 / 8; - return digest; -#endif - case RM_DIGEST_SHA256: - digest->glib_checksum = g_checksum_new(G_CHECKSUM_SHA256); - ADD_SEED(digest, seed1); - digest->bytes = 256 / 8; - return digest; - case RM_DIGEST_SHA1: - digest->glib_checksum = g_checksum_new(G_CHECKSUM_SHA1); - ADD_SEED(digest, seed1); - digest->bytes = 160 / 8; - return digest; - case RM_DIGEST_SHA3_256: - SHA3_INIT(256); - return digest; - case RM_DIGEST_SHA3_384: - SHA3_INIT(384); - return digest; - case RM_DIGEST_SHA3_512: - SHA3_INIT(512); - return digest; - case RM_DIGEST_BLAKE2S: - BLAKE_INIT(blake2s, BLAKE2S); - return digest; - case RM_DIGEST_BLAKE2B: - BLAKE_INIT(blake2b, BLAKE2B); - return digest; - case RM_DIGEST_BLAKE2SP: - BLAKE_INIT(blake2sp, BLAKE2S); - return digest; - case RM_DIGEST_BLAKE2BP: - BLAKE_INIT(blake2bp, BLAKE2B); - return digest; - case RM_DIGEST_EXT: - /* gets allocated on rm_digest_update() */ - digest->bytes = ext_size; - break; - case RM_DIGEST_SPOOKY: - case RM_DIGEST_MURMUR: - case RM_DIGEST_CITY: - case RM_DIGEST_CUMULATIVE: - digest->bytes = 128 / 8; - break; - case RM_DIGEST_PARANOID: - digest->bytes = 0; - digest->paranoid = g_slice_new0(RmParanoid); - digest->paranoid->incoming_twin_candidates = g_async_queue_new(); - if(use_shadow_hash) { - digest->paranoid->shadow_hash = - rm_digest_new(RM_DIGEST_XXHASH, seed1, seed2, 0, false); - } - return digest; - default: - rm_assert_gentle_not_reached(); - } - digest->checksum = g_slice_alloc0(digest->bytes); return digest; } From 0c2c21f52a4dc0e5c3fa9912974e5710e5dcb710 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Wed, 8 Nov 2017 17:27:32 +1000 Subject: [PATCH 077/180] checksum: extend RmDigestSpec to include free() functions --- lib/checksum.c | 142 +++++++++++++++++++++++-------------------------- 1 file changed, 67 insertions(+), 75 deletions(-) diff --git a/lib/checksum.c b/lib/checksum.c index 74269652..0cecd252 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -57,6 +57,7 @@ typedef struct RmDigestSpec { const int bits; void (*init)(RmDigest *digest, RmOff seed1, RmOff seed2, RmOff ext_size, bool use_shadow_hash); + void (*free)(RmDigest *digest); } RmDigestSpec; /* @@ -82,6 +83,12 @@ static void rm_digest_generic_init(RmDigest *digest, RmOff seed1, RmOff seed2, _ } } +static void rm_digest_generic_free(RmDigest *digest) { + if(digest->checksum) { + g_slice_free1(digest->bytes, digest->checksum); + } +} + /* ****** glib hash algorithm interface ****** */ @@ -105,6 +112,10 @@ static void rm_digest_glib_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNU } } +static void rm_digest_glib_free(RmDigest *digest) { + g_checksum_free(digest->glib_checksum); +} + /* ****** sha3 hash algorithm interface ****** */ @@ -132,6 +143,10 @@ static void rm_digest_sha3_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNU } } +static void rm_digest_sha3_free(RmDigest *digest) { + g_slice_free(sha3_context, digest->sha3_ctx); +} + /* ****** blake hash algorithm interface ****** */ @@ -164,6 +179,22 @@ static void rm_digest_blake2sp_init(RmDigest *digest, RmOff seed1, RmOff seed2, BLAKE_INIT(blake2sp, BLAKE2S); } +static void rm_digest_blake2b_free(RmDigest *digest) { + g_slice_free(blake2b_state, digest->blake2b_state); +} + +static void rm_digest_blake2bp_free(RmDigest *digest) { + g_slice_free(blake2bp_state, digest->blake2bp_state); +} + +static void rm_digest_blake2s_free(RmDigest *digest) { + g_slice_free(blake2s_state, digest->blake2s_state); +} + +static void rm_digest_blake2sp_free(RmDigest *digest) { + g_slice_free(blake2sp_state, digest->blake2sp_state); +} + /* ****** ext hash algorithm interface ****** */ @@ -185,35 +216,48 @@ static void rm_digest_paranoid_init(RmDigest *digest, RmOff seed1, RmOff seed2, } } +static void rm_digest_paranoid_free(RmDigest *digest) { + if(digest->paranoid->shadow_hash) { + rm_digest_free(digest->paranoid->shadow_hash); + } + rm_digest_release_buffers(digest); + if(digest->paranoid->incoming_twin_candidates) { + g_async_queue_unref(digest->paranoid->incoming_twin_candidates); + } + g_slist_free(digest->paranoid->rejects); + g_slice_free(RmParanoid, digest->paranoid); +} + + /* ****** hash interface specification map ****** */ static const RmDigestSpec digest_specs[] = { - [RM_DIGEST_UNKNOWN] = { 0, NULL }, - [RM_DIGEST_MURMUR] = { 128, rm_digest_generic_init }, - [RM_DIGEST_SPOOKY] = { 128, rm_digest_generic_init }, - [RM_DIGEST_SPOOKY32] = { 32, rm_digest_generic_init }, - [RM_DIGEST_SPOOKY64] = { 64, rm_digest_generic_init }, - [RM_DIGEST_CITY] = { 128, rm_digest_generic_init }, - [RM_DIGEST_MD5] = { 128, rm_digest_glib_init }, - [RM_DIGEST_SHA1] = { 160, rm_digest_glib_init }, - [RM_DIGEST_SHA256] = { 256, rm_digest_glib_init }, + [RM_DIGEST_UNKNOWN] = { 0, NULL, NULL}, + [RM_DIGEST_MURMUR] = { 128, rm_digest_generic_init, rm_digest_generic_free}, + [RM_DIGEST_SPOOKY] = { 128, rm_digest_generic_init, rm_digest_generic_free}, + [RM_DIGEST_SPOOKY32] = { 32, rm_digest_generic_init, rm_digest_generic_free}, + [RM_DIGEST_SPOOKY64] = { 64, rm_digest_generic_init, rm_digest_generic_free}, + [RM_DIGEST_CITY] = { 128, rm_digest_generic_init, rm_digest_generic_free}, + [RM_DIGEST_MD5] = { 128, rm_digest_glib_init, rm_digest_glib_free}, + [RM_DIGEST_SHA1] = { 160, rm_digest_glib_init, rm_digest_glib_free}, + [RM_DIGEST_SHA256] = { 256, rm_digest_glib_init, rm_digest_glib_free}, #if HAVE_SHA512 - [RM_DIGEST_SHA512] = { 512, rm_digest_glib_init }, + [RM_DIGEST_SHA512] = { 512, rm_digest_glib_init, rm_digest_glib_free}, #endif - [RM_DIGEST_SHA3_256] = { 256, rm_digest_sha3_init }, - [RM_DIGEST_SHA3_384] = { 384, rm_digest_sha3_init }, - [RM_DIGEST_SHA3_512] = { 512, rm_digest_sha3_init }, - [RM_DIGEST_BLAKE2S] = { 256, rm_digest_blake2s_init }, - [RM_DIGEST_BLAKE2B] = { 512, rm_digest_blake2b_init }, - [RM_DIGEST_BLAKE2SP] = { 256, rm_digest_blake2sp_init}, - [RM_DIGEST_BLAKE2BP] = { 512, rm_digest_blake2bp_init}, - [RM_DIGEST_EXT] = { 0, rm_digest_ext_init }, - [RM_DIGEST_CUMULATIVE] = { 128, rm_digest_generic_init }, - [RM_DIGEST_PARANOID] = { 0, rm_digest_paranoid_init}, - [RM_DIGEST_FARMHASH] = { 64, rm_digest_generic_init }, - [RM_DIGEST_XXHASH] = { 64, rm_digest_generic_init }, + [RM_DIGEST_SHA3_256] = { 256, rm_digest_sha3_init, rm_digest_sha3_free}, + [RM_DIGEST_SHA3_384] = { 384, rm_digest_sha3_init, rm_digest_sha3_free}, + [RM_DIGEST_SHA3_512] = { 512, rm_digest_sha3_init, rm_digest_sha3_free}, + [RM_DIGEST_BLAKE2S] = { 256, rm_digest_blake2s_init, rm_digest_blake2s_free}, + [RM_DIGEST_BLAKE2B] = { 512, rm_digest_blake2b_init, rm_digest_blake2b_free}, + [RM_DIGEST_BLAKE2SP] = { 256, rm_digest_blake2sp_init, rm_digest_blake2sp_free}, + [RM_DIGEST_BLAKE2BP] = { 512, rm_digest_blake2bp_init, rm_digest_blake2bp_free}, + [RM_DIGEST_EXT] = { 0, rm_digest_ext_init, rm_digest_generic_free}, + [RM_DIGEST_CUMULATIVE] = { 128, rm_digest_generic_init, rm_digest_generic_free}, + [RM_DIGEST_PARANOID] = { 0, rm_digest_paranoid_init, rm_digest_paranoid_free}, + [RM_DIGEST_FARMHASH] = { 64, rm_digest_generic_init, rm_digest_generic_free}, + [RM_DIGEST_XXHASH] = { 64, rm_digest_generic_init, rm_digest_generic_free}, }; @@ -431,59 +475,7 @@ void rm_digest_release_buffers(RmDigest *digest) { } void rm_digest_free(RmDigest *digest) { - switch(digest->type) { - case RM_DIGEST_MD5: - case RM_DIGEST_SHA512: - case RM_DIGEST_SHA256: - case RM_DIGEST_SHA1: - g_checksum_free(digest->glib_checksum); - digest->glib_checksum = NULL; - break; - case RM_DIGEST_PARANOID: - if(digest->paranoid->shadow_hash) { - rm_digest_free(digest->paranoid->shadow_hash); - } - rm_digest_release_buffers(digest); - if(digest->paranoid->incoming_twin_candidates) { - g_async_queue_unref(digest->paranoid->incoming_twin_candidates); - } - g_slist_free(digest->paranoid->rejects); - g_slice_free(RmParanoid, digest->paranoid); - break; - case RM_DIGEST_SHA3_256: - case RM_DIGEST_SHA3_384: - case RM_DIGEST_SHA3_512: - g_slice_free(sha3_context, digest->sha3_ctx); - break; - case RM_DIGEST_BLAKE2S: - g_slice_free(blake2s_state, digest->blake2s_state); - break; - case RM_DIGEST_BLAKE2B: - g_slice_free(blake2b_state, digest->blake2b_state); - break; - case RM_DIGEST_BLAKE2SP: - g_slice_free(blake2sp_state, digest->blake2sp_state); - break; - case RM_DIGEST_BLAKE2BP: - g_slice_free(blake2bp_state, digest->blake2bp_state); - break; - case RM_DIGEST_EXT: - case RM_DIGEST_CUMULATIVE: - case RM_DIGEST_XXHASH: - case RM_DIGEST_SPOOKY: - case RM_DIGEST_SPOOKY32: - case RM_DIGEST_SPOOKY64: - case RM_DIGEST_FARMHASH: - case RM_DIGEST_MURMUR: - case RM_DIGEST_CITY: - if(digest->checksum) { - g_slice_free1(digest->bytes, digest->checksum); - digest->checksum = NULL; - } - break; - default: - rm_assert_gentle_not_reached(); - } + digest_specs[digest->type].free(digest); g_slice_free(RmDigest, digest); } From dc5026941a30ac446050b5f2393ad737d887a273 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Wed, 8 Nov 2017 21:42:33 +1000 Subject: [PATCH 078/180] checksum: extend RmDigestSpec to include update() functions --- lib/checksum.c | 233 +++++++++++++++++++++++++------------------------ 1 file changed, 121 insertions(+), 112 deletions(-) diff --git a/lib/checksum.c b/lib/checksum.c index 0cecd252..1e3bd87a 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -54,10 +54,13 @@ #define _RM_CHECKSUM_DEBUG 0 +typedef void (*RmDigestUpdateFunc)(gpointer state, const unsigned char *data, RmOff size); + typedef struct RmDigestSpec { const int bits; void (*init)(RmDigest *digest, RmOff seed1, RmOff seed2, RmOff ext_size, bool use_shadow_hash); void (*free)(RmDigest *digest); + RmDigestUpdateFunc update; } RmDigestSpec; /* @@ -89,6 +92,80 @@ static void rm_digest_generic_free(RmDigest *digest) { } } +/* + ****** spooky hashes ****** + */ + +static void rm_digest_spooky32_update(RmUint128 *checksum, const unsigned char *data, RmOff size) { + checksum->first = spooky_hash32(data, size, checksum->first); +} + +static void rm_digest_spooky64_update(RmUint128 *checksum, const unsigned char *data, RmOff size) { + checksum->first = spooky_hash64(data, size, checksum->first); +} + +static void rm_digest_spooky_update(RmUint128 *checksum, const unsigned char *data, RmOff size) { + spooky_hash128(data, size, (uint64_t *)&checksum->first, (uint64_t *)&checksum->second); +} + +/* + ****** xxhash ****** + */ + +static void rm_digest_xxhash_update(RmUint128 *checksum, const unsigned char *data, RmOff size) { + checksum->first = XXH64(data, size, checksum->first); +} + +/* + ****** farmhash ****** + */ + +static void rm_digest_farmhash_update(RmUint128 *checksum, const unsigned char *data, RmOff size) { + /* TODO: this won't work, it's not cumulative */ + checksum->first = cfarmhash((const char *)data, size); +} + +/* + ****** murmur ****** + */ + +static void rm_digest_murmur_update(RmUint128 *checksum, const unsigned char *data, RmOff size) { +#if RM_PLATFORM_32 + MurmurHash3_x86_128(data, size, (uint32_t)checksum->first, checksum); +#elif RM_PLATFORM_64 + MurmurHash3_x64_128(data, size, (uint32_t)checksum->first, checksum); +#else +#error "Probably not a good idea to compile rmlint on 16bit." +#endif +} + +/* + ****** cityhash ****** + */ + +static void rm_digest_city_update(RmUint128 *checksum, const unsigned char *data, RmOff size) { + /* There is a more optimized version but it needs the crc command of sse4.2 + * (available on Intel Nehalem and up; my amd box doesn't have this though) + */ + uint128 old = {checksum->first, checksum->second}; + old = CityHash128WithSeed((const char *)data, size, old); + memcpy(checksum, &old, sizeof(uint128)); +} + + +/* + ****** cumulative ****** + */ + +static void rm_digest_cumulative_update(guint8 *checksum, const unsigned char *data, RmOff size) { + /* This only XORS the two checksums. */ + size = MIN(size, 16); + for(gsize i = 0; i < size; ++i) { + checksum[i] ^= ((guint8 *)data)[i % size]; + } +} + + /* ****** glib hash algorithm interface ****** */ @@ -204,6 +281,24 @@ static void rm_digest_ext_init(RmDigest *digest, RmOff seed1, RmOff seed2, RmOff rm_digest_generic_init(digest, seed1, seed2, ext_size, use_shadow_hash); } +static void rm_digest_ext_update(RmDigest *digest, const unsigned char *data, RmOff size) { + /* Data is assumed to be a hex representation of a checksum. + * Needs to be compressed in pure memory first. + * + * Checksum is not updated but rather overwritten. + * */ +#define CHAR_TO_NUM(c) (unsigned char)(g_ascii_isdigit(c) ? c - '0' : (c - 'a') + 10) + + + digest->bytes = size / 2; + digest->checksum = g_slice_alloc0(digest->bytes); + + for(unsigned i = 0; i < digest->bytes; ++i) { + ((guint8 *)digest->checksum)[i] = + (CHAR_TO_NUM(data[2 * i]) << 4) + CHAR_TO_NUM(data[2 * i + 1]); + } +} + /* ****** paranoid hash algorithm interface ****** */ @@ -228,36 +323,37 @@ static void rm_digest_paranoid_free(RmDigest *digest) { g_slice_free(RmParanoid, digest->paranoid); } - /* ****** hash interface specification map ****** */ +#define UPDATE(ALGO) (RmDigestUpdateFunc)rm_digest_##ALGO##_update + static const RmDigestSpec digest_specs[] = { - [RM_DIGEST_UNKNOWN] = { 0, NULL, NULL}, - [RM_DIGEST_MURMUR] = { 128, rm_digest_generic_init, rm_digest_generic_free}, - [RM_DIGEST_SPOOKY] = { 128, rm_digest_generic_init, rm_digest_generic_free}, - [RM_DIGEST_SPOOKY32] = { 32, rm_digest_generic_init, rm_digest_generic_free}, - [RM_DIGEST_SPOOKY64] = { 64, rm_digest_generic_init, rm_digest_generic_free}, - [RM_DIGEST_CITY] = { 128, rm_digest_generic_init, rm_digest_generic_free}, - [RM_DIGEST_MD5] = { 128, rm_digest_glib_init, rm_digest_glib_free}, - [RM_DIGEST_SHA1] = { 160, rm_digest_glib_init, rm_digest_glib_free}, - [RM_DIGEST_SHA256] = { 256, rm_digest_glib_init, rm_digest_glib_free}, + [RM_DIGEST_UNKNOWN] = { 0, NULL, NULL, NULL}, + [RM_DIGEST_MURMUR] = { 128, rm_digest_generic_init, rm_digest_generic_free, UPDATE(murmur)}, + [RM_DIGEST_SPOOKY] = { 128, rm_digest_generic_init, rm_digest_generic_free, UPDATE(spooky)}, + [RM_DIGEST_SPOOKY32] = { 32, rm_digest_generic_init, rm_digest_generic_free, UPDATE(spooky32)}, + [RM_DIGEST_SPOOKY64] = { 64, rm_digest_generic_init, rm_digest_generic_free, UPDATE(spooky64)}, + [RM_DIGEST_CITY] = { 128, rm_digest_generic_init, rm_digest_generic_free, UPDATE(city)}, + [RM_DIGEST_MD5] = { 128, rm_digest_glib_init, rm_digest_glib_free, (RmDigestUpdateFunc)g_checksum_update}, + [RM_DIGEST_SHA1] = { 160, rm_digest_glib_init, rm_digest_glib_free, (RmDigestUpdateFunc)g_checksum_update}, + [RM_DIGEST_SHA256] = { 256, rm_digest_glib_init, rm_digest_glib_free, (RmDigestUpdateFunc)g_checksum_update}, #if HAVE_SHA512 - [RM_DIGEST_SHA512] = { 512, rm_digest_glib_init, rm_digest_glib_free}, + [RM_DIGEST_SHA512] = { 512, rm_digest_glib_init, rm_digest_glib_free, (RmDigestUpdateFunc)g_checksum_update}, #endif - [RM_DIGEST_SHA3_256] = { 256, rm_digest_sha3_init, rm_digest_sha3_free}, - [RM_DIGEST_SHA3_384] = { 384, rm_digest_sha3_init, rm_digest_sha3_free}, - [RM_DIGEST_SHA3_512] = { 512, rm_digest_sha3_init, rm_digest_sha3_free}, - [RM_DIGEST_BLAKE2S] = { 256, rm_digest_blake2s_init, rm_digest_blake2s_free}, - [RM_DIGEST_BLAKE2B] = { 512, rm_digest_blake2b_init, rm_digest_blake2b_free}, - [RM_DIGEST_BLAKE2SP] = { 256, rm_digest_blake2sp_init, rm_digest_blake2sp_free}, - [RM_DIGEST_BLAKE2BP] = { 512, rm_digest_blake2bp_init, rm_digest_blake2bp_free}, - [RM_DIGEST_EXT] = { 0, rm_digest_ext_init, rm_digest_generic_free}, - [RM_DIGEST_CUMULATIVE] = { 128, rm_digest_generic_init, rm_digest_generic_free}, - [RM_DIGEST_PARANOID] = { 0, rm_digest_paranoid_init, rm_digest_paranoid_free}, - [RM_DIGEST_FARMHASH] = { 64, rm_digest_generic_init, rm_digest_generic_free}, - [RM_DIGEST_XXHASH] = { 64, rm_digest_generic_init, rm_digest_generic_free}, + [RM_DIGEST_SHA3_256] = { 256, rm_digest_sha3_init, rm_digest_sha3_free, (RmDigestUpdateFunc)sha3_Update}, + [RM_DIGEST_SHA3_384] = { 384, rm_digest_sha3_init, rm_digest_sha3_free, (RmDigestUpdateFunc)sha3_Update}, + [RM_DIGEST_SHA3_512] = { 512, rm_digest_sha3_init, rm_digest_sha3_free, (RmDigestUpdateFunc)sha3_Update}, + [RM_DIGEST_BLAKE2S] = { 256, rm_digest_blake2s_init, rm_digest_blake2s_free, (RmDigestUpdateFunc)blake2s_update}, + [RM_DIGEST_BLAKE2B] = { 512, rm_digest_blake2b_init, rm_digest_blake2b_free, (RmDigestUpdateFunc)blake2b_update}, + [RM_DIGEST_BLAKE2SP] = { 256, rm_digest_blake2sp_init, rm_digest_blake2sp_free, (RmDigestUpdateFunc)blake2sp_update}, + [RM_DIGEST_BLAKE2BP] = { 512, rm_digest_blake2bp_init, rm_digest_blake2bp_free, (RmDigestUpdateFunc)blake2bp_update}, + [RM_DIGEST_EXT] = { 0, rm_digest_ext_init, rm_digest_generic_free, UPDATE(ext)}, + [RM_DIGEST_CUMULATIVE] = { 128, rm_digest_generic_init, rm_digest_generic_free, UPDATE(cumulative)}, + [RM_DIGEST_PARANOID] = { 0, rm_digest_paranoid_init, rm_digest_paranoid_free, NULL}, + [RM_DIGEST_FARMHASH] = { 64, rm_digest_generic_init, rm_digest_generic_free, UPDATE(farmhash)}, + [RM_DIGEST_XXHASH] = { 64, rm_digest_generic_init, rm_digest_generic_free, UPDATE(xxhash)}, }; @@ -480,95 +576,8 @@ void rm_digest_free(RmDigest *digest) { } void rm_digest_update(RmDigest *digest, const unsigned char *data, RmOff size) { - switch(digest->type) { - case RM_DIGEST_EXT: -/* Data is assumed to be a hex representation of a checksum. - * Needs to be compressed in pure memory first. - * - * Checksum is not updated but rather overwritten. - * */ -#define CHAR_TO_NUM(c) (unsigned char)(g_ascii_isdigit(c) ? c - '0' : (c - 'a') + 10) - - rm_assert_gentle(data); - - digest->bytes = size / 2; - digest->checksum = g_slice_alloc0(digest->bytes); - - for(unsigned i = 0; i < digest->bytes; ++i) { - ((guint8 *)digest->checksum)[i] = - (CHAR_TO_NUM(data[2 * i]) << 4) + CHAR_TO_NUM(data[2 * i + 1]); - } - - break; - case RM_DIGEST_MD5: - case RM_DIGEST_SHA512: - case RM_DIGEST_SHA256: - case RM_DIGEST_SHA1: - g_checksum_update(digest->glib_checksum, (const guchar *)data, size); - break; - case RM_DIGEST_SHA3_256: - case RM_DIGEST_SHA3_384: - case RM_DIGEST_SHA3_512: - sha3_Update(digest->sha3_ctx, data, size); - break; - case RM_DIGEST_BLAKE2S: - blake2s_update(digest->blake2s_state, data, size); - break; - case RM_DIGEST_BLAKE2B: - blake2b_update(digest->blake2b_state, data, size); - break; - case RM_DIGEST_BLAKE2SP: - blake2sp_update(digest->blake2sp_state, data, size); - break; - case RM_DIGEST_BLAKE2BP: - blake2bp_update(digest->blake2bp_state, data, size); - break; - case RM_DIGEST_SPOOKY32: - digest->checksum[0].first = spooky_hash32(data, size, digest->checksum[0].first); - break; - case RM_DIGEST_SPOOKY64: - digest->checksum[0].first = spooky_hash64(data, size, digest->checksum[0].first); - break; - case RM_DIGEST_SPOOKY: - spooky_hash128(data, size, (uint64_t *)&digest->checksum[0].first, - (uint64_t *)&digest->checksum[0].second); - break; - case RM_DIGEST_XXHASH: - digest->checksum[0].first = XXH64(data, size, digest->checksum[0].first); - break; - case RM_DIGEST_FARMHASH: - digest->checksum[0].first = cfarmhash((const char *)data, size); - break; - case RM_DIGEST_MURMUR: -#if RM_PLATFORM_32 - MurmurHash3_x86_128(data, size, (uint32_t)digest->checksum->first, - digest->checksum); -#elif RM_PLATFORM_64 - MurmurHash3_x64_128(data, size, (uint32_t)digest->checksum->first, - digest->checksum); -#else -#error "Probably not a good idea to compile rmlint on 16bit." -#endif - break; - case RM_DIGEST_CITY: { - /* Opt out for the more optimized version. - * This needs the crc command of sse4.2 - * (available on Intel Nehalem and up; my amd box doesn't have this though) - */ - uint128 old = {digest->checksum->first, digest->checksum->second}; - old = CityHash128WithSeed((const char *)data, size, old); - memcpy(digest->checksum, &old, sizeof(uint128)); - } break; - case RM_DIGEST_CUMULATIVE: { - /* This only XORS the two checksums. */ - for(gsize i = 0; i < digest->bytes; ++i) { - ((guint8 *)digest->checksum)[i] ^= ((guint8 *)data)[i % size]; - } - } break; - case RM_DIGEST_PARANOID: - default: - rm_assert_gentle_not_reached(); - } + RmDigestSpec spec = digest_specs[digest->type]; + spec.update(digest->checksum, data, size); } void rm_digest_buffered_update(RmBuffer *buffer) { From 45fbf815babff4f5ebd577f0ce7a3d285a9b9c82 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Wed, 8 Nov 2017 21:43:06 +1000 Subject: [PATCH 079/180] checksum: improve readability --- lib/checksum.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/lib/checksum.c b/lib/checksum.c index 1e3bd87a..5df6b267 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -549,11 +549,12 @@ RmDigest *rm_digest_new(RmDigestType type, RmOff seed1, RmOff seed2, RmOff ext_s bool use_shadow_hash) { g_assert(type != RM_DIGEST_UNKNOWN); + RmDigestSpec spec = digest_specs[type]; RmDigest *digest = g_slice_new0(RmDigest); digest->type = type; - digest->bytes = digest_specs[type].bits / 8; - digest_specs[type].init(digest, seed1, seed2, ext_size, use_shadow_hash); + digest->bytes = spec.bits / 8; + spec.init(digest, seed1, seed2, ext_size, use_shadow_hash); return digest; } @@ -571,7 +572,8 @@ void rm_digest_release_buffers(RmDigest *digest) { } void rm_digest_free(RmDigest *digest) { - digest_specs[digest->type].free(digest); + RmDigestSpec spec = digest_specs[digest->type]; + spec.free(digest); g_slice_free(RmDigest, digest); } From e7490a30ee1f48bd94e213b0fa7770a6270e8141 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Wed, 8 Nov 2017 22:42:16 +1000 Subject: [PATCH 080/180] checksum: be a bit less cute with RmDigestUpdateFunc; prototype other RmDigestSpec functions --- lib/checksum.c | 221 ++++++++++++++++++++++++++++++++----------------- lib/checksum.h | 1 + 2 files changed, 145 insertions(+), 77 deletions(-) diff --git a/lib/checksum.c b/lib/checksum.c index 5df6b267..e8bf3f72 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -54,12 +54,14 @@ #define _RM_CHECKSUM_DEBUG 0 -typedef void (*RmDigestUpdateFunc)(gpointer state, const unsigned char *data, RmOff size); +typedef void (*RmDigestInitFunc)(RmDigest *digest, RmOff seed1, RmOff seed2, RmOff ext_size, bool use_shadow_hash); +typedef void (*RmDigestFreeFunc)(RmDigest *digest); +typedef void (*RmDigestUpdateFunc)(RmDigest *digest, const unsigned char *data, RmOff size); typedef struct RmDigestSpec { const int bits; - void (*init)(RmDigest *digest, RmOff seed1, RmOff seed2, RmOff ext_size, bool use_shadow_hash); - void (*free)(RmDigest *digest); + RmDigestInitFunc init; + RmDigestFreeFunc free; RmDigestUpdateFunc update; } RmDigestSpec; @@ -96,79 +98,98 @@ static void rm_digest_generic_free(RmDigest *digest) { ****** spooky hashes ****** */ -static void rm_digest_spooky32_update(RmUint128 *checksum, const unsigned char *data, RmOff size) { - checksum->first = spooky_hash32(data, size, checksum->first); +static void rm_digest_spooky32_update(RmDigest *digest, const unsigned char *data, RmOff size) { + digest->checksum->first = spooky_hash32(data, size, digest->checksum->first); } -static void rm_digest_spooky64_update(RmUint128 *checksum, const unsigned char *data, RmOff size) { - checksum->first = spooky_hash64(data, size, checksum->first); +static void rm_digest_spooky64_update(RmDigest *digest, const unsigned char *data, RmOff size) { + digest->checksum->first = spooky_hash64(data, size, digest->checksum->first); } -static void rm_digest_spooky_update(RmUint128 *checksum, const unsigned char *data, RmOff size) { - spooky_hash128(data, size, (uint64_t *)&checksum->first, (uint64_t *)&checksum->second); +static void rm_digest_spooky_update(RmDigest *digest, const unsigned char *data, RmOff size) { + spooky_hash128(data, size, (uint64_t *)&digest->checksum->first, (uint64_t *)&digest->checksum->second); } -/* - ****** xxhash ****** - */ +static const RmDigestSpec spooky32_spec = { 32, rm_digest_generic_init, rm_digest_generic_free, rm_digest_spooky32_update}; +static const RmDigestSpec spooky64_spec = { 32, rm_digest_generic_init, rm_digest_generic_free, rm_digest_spooky64_update}; +static const RmDigestSpec spooky_spec = { 128, rm_digest_generic_init, rm_digest_generic_free, rm_digest_spooky_update}; + -static void rm_digest_xxhash_update(RmUint128 *checksum, const unsigned char *data, RmOff size) { - checksum->first = XXH64(data, size, checksum->first); +/////////////////////////// +// xxhash // +/////////////////////////// + +static void rm_digest_xxhash_update(RmDigest *digest, const unsigned char *data, RmOff size) { + digest->checksum->first = XXH64(data, size, digest->checksum->first); } -/* - ****** farmhash ****** - */ +static const RmDigestSpec xxhash_spec = {64, rm_digest_generic_init, rm_digest_generic_free, rm_digest_xxhash_update}; -static void rm_digest_farmhash_update(RmUint128 *checksum, const unsigned char *data, RmOff size) { +/////////////////////////// +// farmhash // +/////////////////////////// + +static void rm_digest_farmhash_update(RmDigest *digest, const unsigned char *data, RmOff size) { /* TODO: this won't work, it's not cumulative */ - checksum->first = cfarmhash((const char *)data, size); + digest->checksum->first = cfarmhash((const char *)data, size); } -/* - ****** murmur ****** - */ +static const RmDigestSpec farmhash_spec = {64, rm_digest_generic_init, rm_digest_generic_free, rm_digest_farmhash_update}; -static void rm_digest_murmur_update(RmUint128 *checksum, const unsigned char *data, RmOff size) { +/////////////////////////// +// murmur // +/////////////////////////// + + +static void rm_digest_murmur_update(RmDigest *digest, const unsigned char *data, RmOff size) { #if RM_PLATFORM_32 - MurmurHash3_x86_128(data, size, (uint32_t)checksum->first, checksum); + MurmurHash3_x86_128(data, size, (uint32_t)digest->checksum->first, digest->checksum); #elif RM_PLATFORM_64 - MurmurHash3_x64_128(data, size, (uint32_t)checksum->first, checksum); + MurmurHash3_x64_128(data, size, (uint32_t)digest->checksum->first, digest->checksum); #else #error "Probably not a good idea to compile rmlint on 16bit." #endif } -/* - ****** cityhash ****** - */ +static const RmDigestSpec murmur_spec = {128, rm_digest_generic_init, rm_digest_generic_free, rm_digest_murmur_update}; + +/////////////////////////// +// cityhash // +/////////////////////////// -static void rm_digest_city_update(RmUint128 *checksum, const unsigned char *data, RmOff size) { +static void rm_digest_city_update(RmDigest *digest, const unsigned char *data, RmOff size) { /* There is a more optimized version but it needs the crc command of sse4.2 * (available on Intel Nehalem and up; my amd box doesn't have this though) */ - uint128 old = {checksum->first, checksum->second}; + uint128 old = {digest->checksum->first, digest->checksum->second}; +#ifdef __SSE4_2__ + old = CityHashCrc128WithSeed((const char *)data, size, old); +#else old = CityHash128WithSeed((const char *)data, size, old); - memcpy(checksum, &old, sizeof(uint128)); +#endif + memcpy(digest->checksum, &old, sizeof(uint128)); } +static const RmDigestSpec city_spec = {128, rm_digest_generic_init, rm_digest_generic_free, rm_digest_city_update}; -/* - ****** cumulative ****** - */ +/////////////////////////// +// cumulative // +/////////////////////////// -static void rm_digest_cumulative_update(guint8 *checksum, const unsigned char *data, RmOff size) { +static void rm_digest_cumulative_update(RmDigest *digest, const unsigned char *data, RmOff size) { /* This only XORS the two checksums. */ - size = MIN(size, 16); + size = MIN(size, digest->bytes); for(gsize i = 0; i < size; ++i) { - checksum[i] ^= ((guint8 *)data)[i % size]; + digest->data[i] ^= ((guint8 *)data)[i % size]; } } +static const RmDigestSpec cumulative_spec = {128, rm_digest_generic_init, rm_digest_generic_free, rm_digest_cumulative_update}; -/* - ****** glib hash algorithm interface ****** - */ + +/////////////////////////// +// glib hashes // +/////////////////////////// static const GChecksumType glib_map[] = { [RM_DIGEST_MD5] = G_CHECKSUM_MD5, @@ -193,6 +214,17 @@ static void rm_digest_glib_free(RmDigest *digest) { g_checksum_free(digest->glib_checksum); } +static void rm_digest_glib_update(RmDigest *digest, const unsigned char *data, RmOff size) { + g_checksum_update(digest->glib_checksum, data, size); +} + +static const RmDigestSpec md5_spec = {128, rm_digest_glib_init, rm_digest_glib_free, rm_digest_glib_update}; +static const RmDigestSpec sha1_spec = {160, rm_digest_glib_init, rm_digest_glib_free, rm_digest_glib_update}; +static const RmDigestSpec sha256_spec = {256, rm_digest_glib_init, rm_digest_glib_free, rm_digest_glib_update}; +#if HAVE_SHA512 +static const RmDigestSpec sha512_spec = {512, rm_digest_glib_init, rm_digest_glib_free, rm_digest_glib_update}; +#endif + /* ****** sha3 hash algorithm interface ****** */ @@ -224,6 +256,14 @@ static void rm_digest_sha3_free(RmDigest *digest) { g_slice_free(sha3_context, digest->sha3_ctx); } +static void rm_digest_sha3_update(RmDigest *digest, const unsigned char *data, RmOff size) { + sha3_Update(digest->sha3_ctx, data, size); +} + +static const RmDigestSpec sha3_256_spec = { 256, rm_digest_sha3_init, rm_digest_sha3_free, rm_digest_sha3_update}; +static const RmDigestSpec sha3_384_spec = { 384, rm_digest_sha3_init, rm_digest_sha3_free, rm_digest_sha3_update}; +static const RmDigestSpec sha3_512_spec = { 512, rm_digest_sha3_init, rm_digest_sha3_free, rm_digest_sha3_update}; + /* ****** blake hash algorithm interface ****** */ @@ -272,6 +312,29 @@ static void rm_digest_blake2sp_free(RmDigest *digest) { g_slice_free(blake2sp_state, digest->blake2sp_state); } +static void rm_digest_blake2b_update(RmDigest *digest, const unsigned char *data, RmOff size) { + blake2b_update(digest->blake2b_state, data, size); +} + +static void rm_digest_blake2bp_update(RmDigest *digest, const unsigned char *data, RmOff size) { + blake2bp_update(digest->blake2bp_state, data, size); +} + +static void rm_digest_blake2s_update(RmDigest *digest, const unsigned char *data, RmOff size) { + blake2s_update(digest->blake2s_state, data, size); +} + +static void rm_digest_blake2sp_update(RmDigest *digest, const unsigned char *data, RmOff size) { + blake2sp_update(digest->blake2sp_state, data, size); +} + + +static const RmDigestSpec blake2b_spec = {512, rm_digest_blake2b_init, rm_digest_blake2b_free, rm_digest_blake2b_update}; +static const RmDigestSpec blake2bp_spec = {512, rm_digest_blake2bp_init, rm_digest_blake2bp_free, rm_digest_blake2bp_update}; +static const RmDigestSpec blake2s_spec = {256, rm_digest_blake2s_init, rm_digest_blake2s_free, rm_digest_blake2s_update}; +static const RmDigestSpec blake2sp_spec = {256, rm_digest_blake2sp_init, rm_digest_blake2sp_free, rm_digest_blake2sp_update}; + + /* ****** ext hash algorithm interface ****** */ @@ -289,7 +352,6 @@ static void rm_digest_ext_update(RmDigest *digest, const unsigned char *data, Rm * */ #define CHAR_TO_NUM(c) (unsigned char)(g_ascii_isdigit(c) ? c - '0' : (c - 'a') + 10) - digest->bytes = size / 2; digest->checksum = g_slice_alloc0(digest->bytes); @@ -299,6 +361,9 @@ static void rm_digest_ext_update(RmDigest *digest, const unsigned char *data, Rm } } +static const RmDigestSpec ext_spec = {0, rm_digest_ext_init, rm_digest_generic_free, rm_digest_ext_update}; + + /* ****** paranoid hash algorithm interface ****** */ @@ -323,37 +388,39 @@ static void rm_digest_paranoid_free(RmDigest *digest) { g_slice_free(RmParanoid, digest->paranoid); } +/* Note: paranoid update implementation is in rm_digest_buffered_update() below */ + +static const RmDigestSpec paranoid_spec = {0, rm_digest_paranoid_init, rm_digest_paranoid_free, NULL}; + /* - ****** hash interface specification map ****** + ****** RmDigestSpec map ****** */ -#define UPDATE(ALGO) (RmDigestUpdateFunc)rm_digest_##ALGO##_update - -static const RmDigestSpec digest_specs[] = { - [RM_DIGEST_UNKNOWN] = { 0, NULL, NULL, NULL}, - [RM_DIGEST_MURMUR] = { 128, rm_digest_generic_init, rm_digest_generic_free, UPDATE(murmur)}, - [RM_DIGEST_SPOOKY] = { 128, rm_digest_generic_init, rm_digest_generic_free, UPDATE(spooky)}, - [RM_DIGEST_SPOOKY32] = { 32, rm_digest_generic_init, rm_digest_generic_free, UPDATE(spooky32)}, - [RM_DIGEST_SPOOKY64] = { 64, rm_digest_generic_init, rm_digest_generic_free, UPDATE(spooky64)}, - [RM_DIGEST_CITY] = { 128, rm_digest_generic_init, rm_digest_generic_free, UPDATE(city)}, - [RM_DIGEST_MD5] = { 128, rm_digest_glib_init, rm_digest_glib_free, (RmDigestUpdateFunc)g_checksum_update}, - [RM_DIGEST_SHA1] = { 160, rm_digest_glib_init, rm_digest_glib_free, (RmDigestUpdateFunc)g_checksum_update}, - [RM_DIGEST_SHA256] = { 256, rm_digest_glib_init, rm_digest_glib_free, (RmDigestUpdateFunc)g_checksum_update}, +static const RmDigestSpec *digest_specs[] = { + [RM_DIGEST_UNKNOWN] = NULL, + [RM_DIGEST_MURMUR] = &murmur_spec, + [RM_DIGEST_SPOOKY] = &spooky_spec, + [RM_DIGEST_SPOOKY32] = &spooky32_spec, + [RM_DIGEST_SPOOKY64] = &spooky64_spec, + [RM_DIGEST_CITY] = &city_spec, + [RM_DIGEST_MD5] = &md5_spec, + [RM_DIGEST_SHA1] = &sha1_spec, + [RM_DIGEST_SHA256] = &sha256_spec, #if HAVE_SHA512 - [RM_DIGEST_SHA512] = { 512, rm_digest_glib_init, rm_digest_glib_free, (RmDigestUpdateFunc)g_checksum_update}, + [RM_DIGEST_SHA512] = &sha512_spec, #endif - [RM_DIGEST_SHA3_256] = { 256, rm_digest_sha3_init, rm_digest_sha3_free, (RmDigestUpdateFunc)sha3_Update}, - [RM_DIGEST_SHA3_384] = { 384, rm_digest_sha3_init, rm_digest_sha3_free, (RmDigestUpdateFunc)sha3_Update}, - [RM_DIGEST_SHA3_512] = { 512, rm_digest_sha3_init, rm_digest_sha3_free, (RmDigestUpdateFunc)sha3_Update}, - [RM_DIGEST_BLAKE2S] = { 256, rm_digest_blake2s_init, rm_digest_blake2s_free, (RmDigestUpdateFunc)blake2s_update}, - [RM_DIGEST_BLAKE2B] = { 512, rm_digest_blake2b_init, rm_digest_blake2b_free, (RmDigestUpdateFunc)blake2b_update}, - [RM_DIGEST_BLAKE2SP] = { 256, rm_digest_blake2sp_init, rm_digest_blake2sp_free, (RmDigestUpdateFunc)blake2sp_update}, - [RM_DIGEST_BLAKE2BP] = { 512, rm_digest_blake2bp_init, rm_digest_blake2bp_free, (RmDigestUpdateFunc)blake2bp_update}, - [RM_DIGEST_EXT] = { 0, rm_digest_ext_init, rm_digest_generic_free, UPDATE(ext)}, - [RM_DIGEST_CUMULATIVE] = { 128, rm_digest_generic_init, rm_digest_generic_free, UPDATE(cumulative)}, - [RM_DIGEST_PARANOID] = { 0, rm_digest_paranoid_init, rm_digest_paranoid_free, NULL}, - [RM_DIGEST_FARMHASH] = { 64, rm_digest_generic_init, rm_digest_generic_free, UPDATE(farmhash)}, - [RM_DIGEST_XXHASH] = { 64, rm_digest_generic_init, rm_digest_generic_free, UPDATE(xxhash)}, + [RM_DIGEST_SHA3_256] = &sha3_256_spec, + [RM_DIGEST_SHA3_384] = &sha3_384_spec, + [RM_DIGEST_SHA3_512] = &sha3_512_spec, + [RM_DIGEST_BLAKE2S] = &blake2s_spec, + [RM_DIGEST_BLAKE2B] = &blake2b_spec, + [RM_DIGEST_BLAKE2SP] = &blake2sp_spec, + [RM_DIGEST_BLAKE2BP] = &blake2bp_spec, + [RM_DIGEST_EXT] = &ext_spec, + [RM_DIGEST_CUMULATIVE] = &cumulative_spec, + [RM_DIGEST_PARANOID] = ¶noid_spec, + [RM_DIGEST_FARMHASH] = &farmhash_spec, + [RM_DIGEST_XXHASH] = &xxhash_spec, }; @@ -549,12 +616,12 @@ RmDigest *rm_digest_new(RmDigestType type, RmOff seed1, RmOff seed2, RmOff ext_s bool use_shadow_hash) { g_assert(type != RM_DIGEST_UNKNOWN); - RmDigestSpec spec = digest_specs[type]; + const RmDigestSpec *spec = digest_specs[type]; RmDigest *digest = g_slice_new0(RmDigest); digest->type = type; - digest->bytes = spec.bits / 8; - spec.init(digest, seed1, seed2, ext_size, use_shadow_hash); + digest->bytes = spec->bits / 8; + spec->init(digest, seed1, seed2, ext_size, use_shadow_hash); return digest; } @@ -572,25 +639,25 @@ void rm_digest_release_buffers(RmDigest *digest) { } void rm_digest_free(RmDigest *digest) { - RmDigestSpec spec = digest_specs[digest->type]; - spec.free(digest); + const RmDigestSpec *spec = digest_specs[digest->type]; + spec->free(digest); g_slice_free(RmDigest, digest); } void rm_digest_update(RmDigest *digest, const unsigned char *data, RmOff size) { - RmDigestSpec spec = digest_specs[digest->type]; - spec.update(digest->checksum, data, size); + const RmDigestSpec *spec = digest_specs[digest->type]; + spec->update(digest, data, size); } void rm_digest_buffered_update(RmBuffer *buffer) { + rm_assert_gentle(buffer); RmDigest *digest = buffer->digest; if(digest->type != RM_DIGEST_PARANOID) { rm_digest_update(digest, buffer->data, buffer->len); rm_buffer_release(buffer); } else { RmParanoid *paranoid = digest->paranoid; - - /* efficiently append buffer to buffers GSList */ + /* paranoid update... */ if(!paranoid->buffers) { /* first buffer */ paranoid->buffers = g_slist_prepend(NULL, buffer); @@ -638,7 +705,7 @@ void rm_digest_buffered_update(RmBuffer *buffer) { paranoid->twin_candidate_buffer = paranoid->twin_candidate_buffer->next; } if(paranoid->twin_candidate && !match) { -/* reject the twin candidate, also add to rejects list to speed up rm_digest_equal() */ + /* reject the twin candidate, also add to rejects list to speed up rm_digest_equal() */ #if _RM_CHECKSUM_DEBUG rm_log_debug_line("Rejected twin candidate %p for %p", paranoid->twin_candidate, paranoid); diff --git a/lib/checksum.h b/lib/checksum.h index e45a08c7..16ebfb50 100644 --- a/lib/checksum.h +++ b/lib/checksum.h @@ -103,6 +103,7 @@ typedef struct RmDigest { sha3_context *sha3_ctx; RmUint128 *checksum; RmParanoid *paranoid; + guint8 *data; }; /* digest type */ From b7de76bd89e54568335371d52d2571923b1503a7 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Wed, 8 Nov 2017 22:53:33 +1000 Subject: [PATCH 081/180] checksum: re-order procedures more logically --- lib/checksum.c | 220 +++++++++++++++++++++++++------------------------ 1 file changed, 114 insertions(+), 106 deletions(-) diff --git a/lib/checksum.c b/lib/checksum.c index e8bf3f72..f6751683 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -54,6 +54,91 @@ #define _RM_CHECKSUM_DEBUG 0 + +/////////////////////////////////////// +// BUFFER POOL IMPLEMENTATION // +/////////////////////////////////////// + +RmOff rm_buffer_size(RmBufferPool *pool) { + return pool->buffer_size; +} + +static RmBuffer *rm_buffer_new(RmBufferPool *pool) { + RmBuffer *self = g_slice_new0(RmBuffer); + self->pool = pool; + self->data = g_slice_alloc(pool->buffer_size); + return self; +} + +static void rm_buffer_free(RmBuffer *buf) { + g_slice_free1(buf->pool->buffer_size, buf->data); + g_slice_free(RmBuffer, buf); +} + +RmBufferPool *rm_buffer_pool_init(gsize buffer_size, gsize max_mem) { + RmBufferPool *self = g_slice_new0(RmBufferPool); + self->buffer_size = buffer_size; + self->avail_buffers = max_mem ? MAX(max_mem / buffer_size, 1) : (gsize)-1; + + g_cond_init(&self->change); + g_mutex_init(&self->lock); + return self; +} + +void rm_buffer_pool_destroy(RmBufferPool *pool) { + g_slist_free_full(pool->stack, (GDestroyNotify)rm_buffer_free); + + g_mutex_clear(&pool->lock); + g_cond_clear(&pool->change); + g_slice_free(RmBufferPool, pool); +} + +RmBuffer *rm_buffer_get(RmBufferPool *pool) { + RmBuffer *buffer = NULL; + g_mutex_lock(&pool->lock); + { + while(!buffer) { + buffer = rm_util_slist_pop(&pool->stack, NULL); + if(!buffer && pool->avail_buffers > 0) { + buffer = rm_buffer_new(pool); + } + if(!buffer) { + if(!pool->mem_warned) { + rm_log_warning_line( + "read buffer limit reached - waiting for " + "processing to catch up"); + pool->mem_warned = true; + } + g_cond_wait(&pool->change, &pool->lock); + } + } + pool->avail_buffers--; + } + g_mutex_unlock(&pool->lock); + + rm_assert_gentle(buffer); + return buffer; +} + +void rm_buffer_release(RmBuffer *buf) { + RmBufferPool *pool = buf->pool; + g_mutex_lock(&pool->lock); + { + pool->avail_buffers++; + g_cond_signal(&pool->change); + pool->stack = g_slist_prepend(pool->stack, buf); + } + g_mutex_unlock(&pool->lock); +} + +static gboolean rm_buffer_equal(RmBuffer *a, RmBuffer *b) { + return (a->len == b->len && memcmp(a->data, b->data, a->len) == 0); +} + +/////////////////////////////////////// +// RMDIGEST IMPLEMENTATION // +/////////////////////////////////////// + typedef void (*RmDigestInitFunc)(RmDigest *digest, RmOff seed1, RmOff seed2, RmOff ext_size, bool use_shadow_hash); typedef void (*RmDigestFreeFunc)(RmDigest *digest); typedef void (*RmDigestUpdateFunc)(RmDigest *digest, const unsigned char *data, RmOff size); @@ -65,9 +150,12 @@ typedef struct RmDigestSpec { RmDigestUpdateFunc update; } RmDigestSpec; -/* - ****** common interface for non-cryptographic hashes ****** - */ + +/////////////////////////// +// common funcs for // +// non-cryptographic // +// hashes // +/////////////////////////// static void rm_digest_generic_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { /* init for hashes which just require allocation of digest->checksum */ @@ -94,9 +182,9 @@ static void rm_digest_generic_free(RmDigest *digest) { } } -/* - ****** spooky hashes ****** - */ +/////////////////////////// +// spooky hashes // +/////////////////////////// static void rm_digest_spooky32_update(RmDigest *digest, const unsigned char *data, RmOff size) { digest->checksum->first = spooky_hash32(data, size, digest->checksum->first); @@ -186,7 +274,6 @@ static void rm_digest_cumulative_update(RmDigest *digest, const unsigned char *d static const RmDigestSpec cumulative_spec = {128, rm_digest_generic_init, rm_digest_generic_free, rm_digest_cumulative_update}; - /////////////////////////// // glib hashes // /////////////////////////// @@ -225,9 +312,10 @@ static const RmDigestSpec sha256_spec = {256, rm_digest_glib_init, rm_digest_gli static const RmDigestSpec sha512_spec = {512, rm_digest_glib_init, rm_digest_glib_free, rm_digest_glib_update}; #endif -/* - ****** sha3 hash algorithm interface ****** - */ +/////////////////////////// +// sha3 hashes // +/////////////////////////// + static void rm_digest_sha3_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { digest->sha3_ctx = g_slice_alloc0(sizeof(sha3_context)); @@ -264,9 +352,10 @@ static const RmDigestSpec sha3_256_spec = { 256, rm_digest_sha3_init, rm_digest_ static const RmDigestSpec sha3_384_spec = { 384, rm_digest_sha3_init, rm_digest_sha3_free, rm_digest_sha3_update}; static const RmDigestSpec sha3_512_spec = { 512, rm_digest_sha3_init, rm_digest_sha3_free, rm_digest_sha3_update}; -/* - ****** blake hash algorithm interface ****** - */ +/////////////////////////// +// blake hashes // +/////////////////////////// + #define BLAKE_INIT(ALGO, ALGO_BIG) \ digest->ALGO##_state = g_slice_alloc0(sizeof(ALGO##_state)); \ @@ -335,9 +424,10 @@ static const RmDigestSpec blake2s_spec = {256, rm_digest_blake2s_init, rm_digest static const RmDigestSpec blake2sp_spec = {256, rm_digest_blake2sp_init, rm_digest_blake2sp_free, rm_digest_blake2sp_update}; -/* - ****** ext hash algorithm interface ****** - */ +/////////////////////////// +// ext hash // +/////////////////////////// + static void rm_digest_ext_init(RmDigest *digest, RmOff seed1, RmOff seed2, RmOff ext_size, bool use_shadow_hash) { digest->bytes = ext_size; @@ -364,9 +454,10 @@ static void rm_digest_ext_update(RmDigest *digest, const unsigned char *data, Rm static const RmDigestSpec ext_spec = {0, rm_digest_ext_init, rm_digest_generic_free, rm_digest_ext_update}; -/* - ****** paranoid hash algorithm interface ****** - */ +/////////////////////////// +// paranoid 'hash' // +/////////////////////////// + static void rm_digest_paranoid_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, bool use_shadow_hash) { digest->paranoid = g_slice_new0(RmParanoid); @@ -392,9 +483,10 @@ static void rm_digest_paranoid_free(RmDigest *digest) { static const RmDigestSpec paranoid_spec = {0, rm_digest_paranoid_init, rm_digest_paranoid_free, NULL}; -/* - ****** RmDigestSpec map ****** - */ + +//////////////////////////////// +// RmDigestSpec map // +//////////////////////////////// static const RmDigestSpec *digest_specs[] = { [RM_DIGEST_UNKNOWN] = NULL, @@ -424,90 +516,6 @@ static const RmDigestSpec *digest_specs[] = { }; -/////////////////////////////////////// -// BUFFER POOL IMPLEMENTATION // -/////////////////////////////////////// - -RmOff rm_buffer_size(RmBufferPool *pool) { - return pool->buffer_size; -} - -static RmBuffer *rm_buffer_new(RmBufferPool *pool) { - RmBuffer *self = g_slice_new0(RmBuffer); - self->pool = pool; - self->data = g_slice_alloc(pool->buffer_size); - return self; -} - -static void rm_buffer_free(RmBuffer *buf) { - g_slice_free1(buf->pool->buffer_size, buf->data); - g_slice_free(RmBuffer, buf); -} - -RmBufferPool *rm_buffer_pool_init(gsize buffer_size, gsize max_mem) { - RmBufferPool *self = g_slice_new0(RmBufferPool); - self->buffer_size = buffer_size; - self->avail_buffers = max_mem ? MAX(max_mem / buffer_size, 1) : (gsize)-1; - - g_cond_init(&self->change); - g_mutex_init(&self->lock); - return self; -} - -void rm_buffer_pool_destroy(RmBufferPool *pool) { - g_slist_free_full(pool->stack, (GDestroyNotify)rm_buffer_free); - - g_mutex_clear(&pool->lock); - g_cond_clear(&pool->change); - g_slice_free(RmBufferPool, pool); -} - -RmBuffer *rm_buffer_get(RmBufferPool *pool) { - RmBuffer *buffer = NULL; - g_mutex_lock(&pool->lock); - { - while(!buffer) { - buffer = rm_util_slist_pop(&pool->stack, NULL); - if(!buffer && pool->avail_buffers > 0) { - buffer = rm_buffer_new(pool); - } - if(!buffer) { - if(!pool->mem_warned) { - rm_log_warning_line( - "read buffer limit reached - waiting for " - "processing to catch up"); - pool->mem_warned = true; - } - g_cond_wait(&pool->change, &pool->lock); - } - } - pool->avail_buffers--; - } - g_mutex_unlock(&pool->lock); - - rm_assert_gentle(buffer); - return buffer; -} - -void rm_buffer_release(RmBuffer *buf) { - RmBufferPool *pool = buf->pool; - g_mutex_lock(&pool->lock); - { - pool->avail_buffers++; - g_cond_signal(&pool->change); - pool->stack = g_slist_prepend(pool->stack, buf); - } - g_mutex_unlock(&pool->lock); -} - -static gboolean rm_buffer_equal(RmBuffer *a, RmBuffer *b) { - return (a->len == b->len && memcmp(a->data, b->data, a->len) == 0); -} - -/////////////////////////////////////// -// RMDIGEST IMPLEMENTATION // -/////////////////////////////////////// - static gpointer rm_init_digest_type_table(GHashTable **code_table) { static struct { char *name; From e6cb5eb1f21928f397ee06364e3631050c8c421a Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Wed, 8 Nov 2017 23:16:28 +1000 Subject: [PATCH 082/180] checksum: extend RmDigestSpec to inlude copy function --- lib/checksum.c | 143 ++++++++++++++++++++----------------------------- 1 file changed, 58 insertions(+), 85 deletions(-) diff --git a/lib/checksum.c b/lib/checksum.c index f6751683..8bd5fae1 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -142,12 +142,14 @@ static gboolean rm_buffer_equal(RmBuffer *a, RmBuffer *b) { typedef void (*RmDigestInitFunc)(RmDigest *digest, RmOff seed1, RmOff seed2, RmOff ext_size, bool use_shadow_hash); typedef void (*RmDigestFreeFunc)(RmDigest *digest); typedef void (*RmDigestUpdateFunc)(RmDigest *digest, const unsigned char *data, RmOff size); +typedef void (*RmDigestCopyFunc)(RmDigest *digest, RmDigest *copy); typedef struct RmDigestSpec { const int bits; RmDigestInitFunc init; RmDigestFreeFunc free; RmDigestUpdateFunc update; + RmDigestCopyFunc copy; } RmDigestSpec; @@ -157,13 +159,14 @@ typedef struct RmDigestSpec { // hashes // /////////////////////////// +#define ALLOC_BYTES(bytes) MAX(8, bytes) + static void rm_digest_generic_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { /* init for hashes which just require allocation of digest->checksum */ - RmOff bytes = MAX(8, digest->bytes); /* Cannot go lower than 8, since we read 8 byte in some places. * For some checksums this may mean trailing zeros in the unused bytes */ - digest->checksum = g_slice_alloc0(bytes); + digest->checksum = g_slice_alloc0(ALLOC_BYTES(digest->bytes)); if(seed1 && seed2) { /* copy seeds to checksum */ @@ -182,6 +185,10 @@ static void rm_digest_generic_free(RmDigest *digest) { } } +static void rm_digest_generic_copy(RmDigest *digest, RmDigest *copy) { + copy->checksum = g_slice_copy(ALLOC_BYTES(digest->bytes), digest->checksum); +} + /////////////////////////// // spooky hashes // /////////////////////////// @@ -198,9 +205,9 @@ static void rm_digest_spooky_update(RmDigest *digest, const unsigned char *data, spooky_hash128(data, size, (uint64_t *)&digest->checksum->first, (uint64_t *)&digest->checksum->second); } -static const RmDigestSpec spooky32_spec = { 32, rm_digest_generic_init, rm_digest_generic_free, rm_digest_spooky32_update}; -static const RmDigestSpec spooky64_spec = { 32, rm_digest_generic_init, rm_digest_generic_free, rm_digest_spooky64_update}; -static const RmDigestSpec spooky_spec = { 128, rm_digest_generic_init, rm_digest_generic_free, rm_digest_spooky_update}; +static const RmDigestSpec spooky32_spec = { 32, rm_digest_generic_init, rm_digest_generic_free, rm_digest_spooky32_update, rm_digest_generic_copy}; +static const RmDigestSpec spooky64_spec = { 64, rm_digest_generic_init, rm_digest_generic_free, rm_digest_spooky64_update, rm_digest_generic_copy}; +static const RmDigestSpec spooky_spec = { 128, rm_digest_generic_init, rm_digest_generic_free, rm_digest_spooky_update, rm_digest_generic_copy}; /////////////////////////// @@ -211,7 +218,7 @@ static void rm_digest_xxhash_update(RmDigest *digest, const unsigned char *data, digest->checksum->first = XXH64(data, size, digest->checksum->first); } -static const RmDigestSpec xxhash_spec = {64, rm_digest_generic_init, rm_digest_generic_free, rm_digest_xxhash_update}; +static const RmDigestSpec xxhash_spec = {64, rm_digest_generic_init, rm_digest_generic_free, rm_digest_xxhash_update, rm_digest_generic_copy}; /////////////////////////// // farmhash // @@ -222,7 +229,7 @@ static void rm_digest_farmhash_update(RmDigest *digest, const unsigned char *dat digest->checksum->first = cfarmhash((const char *)data, size); } -static const RmDigestSpec farmhash_spec = {64, rm_digest_generic_init, rm_digest_generic_free, rm_digest_farmhash_update}; +static const RmDigestSpec farmhash_spec = {64, rm_digest_generic_init, rm_digest_generic_free, rm_digest_farmhash_update, rm_digest_generic_copy}; /////////////////////////// // murmur // @@ -239,7 +246,7 @@ static void rm_digest_murmur_update(RmDigest *digest, const unsigned char *data, #endif } -static const RmDigestSpec murmur_spec = {128, rm_digest_generic_init, rm_digest_generic_free, rm_digest_murmur_update}; +static const RmDigestSpec murmur_spec = {128, rm_digest_generic_init, rm_digest_generic_free, rm_digest_murmur_update, rm_digest_generic_copy}; /////////////////////////// // cityhash // @@ -258,7 +265,7 @@ static void rm_digest_city_update(RmDigest *digest, const unsigned char *data, R memcpy(digest->checksum, &old, sizeof(uint128)); } -static const RmDigestSpec city_spec = {128, rm_digest_generic_init, rm_digest_generic_free, rm_digest_city_update}; +static const RmDigestSpec city_spec = {128, rm_digest_generic_init, rm_digest_generic_free, rm_digest_city_update, rm_digest_generic_copy}; /////////////////////////// // cumulative // @@ -272,7 +279,7 @@ static void rm_digest_cumulative_update(RmDigest *digest, const unsigned char *d } } -static const RmDigestSpec cumulative_spec = {128, rm_digest_generic_init, rm_digest_generic_free, rm_digest_cumulative_update}; +static const RmDigestSpec cumulative_spec = {128, rm_digest_generic_init, rm_digest_generic_free, rm_digest_cumulative_update, rm_digest_generic_copy}; /////////////////////////// // glib hashes // @@ -305,11 +312,16 @@ static void rm_digest_glib_update(RmDigest *digest, const unsigned char *data, R g_checksum_update(digest->glib_checksum, data, size); } -static const RmDigestSpec md5_spec = {128, rm_digest_glib_init, rm_digest_glib_free, rm_digest_glib_update}; -static const RmDigestSpec sha1_spec = {160, rm_digest_glib_init, rm_digest_glib_free, rm_digest_glib_update}; -static const RmDigestSpec sha256_spec = {256, rm_digest_glib_init, rm_digest_glib_free, rm_digest_glib_update}; +static void rm_digest_glib_copy(RmDigest *digest, RmDigest *copy) { + copy->glib_checksum = g_checksum_copy(digest->glib_checksum); +} + + +static const RmDigestSpec md5_spec = {128, rm_digest_glib_init, rm_digest_glib_free, rm_digest_glib_update, rm_digest_glib_copy}; +static const RmDigestSpec sha1_spec = {160, rm_digest_glib_init, rm_digest_glib_free, rm_digest_glib_update, rm_digest_glib_copy}; +static const RmDigestSpec sha256_spec = {256, rm_digest_glib_init, rm_digest_glib_free, rm_digest_glib_update, rm_digest_glib_copy}; #if HAVE_SHA512 -static const RmDigestSpec sha512_spec = {512, rm_digest_glib_init, rm_digest_glib_free, rm_digest_glib_update}; +static const RmDigestSpec sha512_spec = {512, rm_digest_glib_init, rm_digest_glib_free, rm_digest_glib_update, rm_digest_glib_copy}; #endif /////////////////////////// @@ -348,9 +360,13 @@ static void rm_digest_sha3_update(RmDigest *digest, const unsigned char *data, R sha3_Update(digest->sha3_ctx, data, size); } -static const RmDigestSpec sha3_256_spec = { 256, rm_digest_sha3_init, rm_digest_sha3_free, rm_digest_sha3_update}; -static const RmDigestSpec sha3_384_spec = { 384, rm_digest_sha3_init, rm_digest_sha3_free, rm_digest_sha3_update}; -static const RmDigestSpec sha3_512_spec = { 512, rm_digest_sha3_init, rm_digest_sha3_free, rm_digest_sha3_update}; +static void rm_digest_sha3_copy(RmDigest *digest, RmDigest *copy) { + copy->sha3_ctx = g_slice_copy(sizeof(sha3_context), digest->sha3_ctx); +} + +static const RmDigestSpec sha3_256_spec = { 256, rm_digest_sha3_init, rm_digest_sha3_free, rm_digest_sha3_update, rm_digest_sha3_copy}; +static const RmDigestSpec sha3_384_spec = { 384, rm_digest_sha3_init, rm_digest_sha3_free, rm_digest_sha3_update, rm_digest_sha3_copy}; +static const RmDigestSpec sha3_512_spec = { 512, rm_digest_sha3_init, rm_digest_sha3_free, rm_digest_sha3_update, rm_digest_sha3_copy}; /////////////////////////// // blake hashes // @@ -417,11 +433,26 @@ static void rm_digest_blake2sp_update(RmDigest *digest, const unsigned char *dat blake2sp_update(digest->blake2sp_state, data, size); } +static void rm_digest_blake2b_copy(RmDigest *digest, RmDigest *copy) { + copy->blake2b_state = g_slice_copy(sizeof(blake2b_state), digest->blake2b_state); +} + +static void rm_digest_blake2bp_copy(RmDigest *digest, RmDigest *copy) { + copy->blake2bp_state = g_slice_copy(sizeof(blake2bp_state), digest->blake2bp_state); +} + +static void rm_digest_blake2s_copy(RmDigest *digest, RmDigest *copy) { + copy->blake2s_state = g_slice_copy(sizeof(blake2s_state), digest->blake2s_state); +} + +static void rm_digest_blake2sp_copy(RmDigest *digest, RmDigest *copy) { + copy->blake2sp_state = g_slice_copy(sizeof(blake2sp_state), digest->blake2sp_state); +} -static const RmDigestSpec blake2b_spec = {512, rm_digest_blake2b_init, rm_digest_blake2b_free, rm_digest_blake2b_update}; -static const RmDigestSpec blake2bp_spec = {512, rm_digest_blake2bp_init, rm_digest_blake2bp_free, rm_digest_blake2bp_update}; -static const RmDigestSpec blake2s_spec = {256, rm_digest_blake2s_init, rm_digest_blake2s_free, rm_digest_blake2s_update}; -static const RmDigestSpec blake2sp_spec = {256, rm_digest_blake2sp_init, rm_digest_blake2sp_free, rm_digest_blake2sp_update}; +static const RmDigestSpec blake2b_spec = {512, rm_digest_blake2b_init, rm_digest_blake2b_free, rm_digest_blake2b_update, rm_digest_blake2b_copy}; +static const RmDigestSpec blake2bp_spec = {512, rm_digest_blake2bp_init, rm_digest_blake2bp_free, rm_digest_blake2bp_update, rm_digest_blake2bp_copy}; +static const RmDigestSpec blake2s_spec = {256, rm_digest_blake2s_init, rm_digest_blake2s_free, rm_digest_blake2s_update, rm_digest_blake2s_copy}; +static const RmDigestSpec blake2sp_spec = {256, rm_digest_blake2sp_init, rm_digest_blake2sp_free, rm_digest_blake2sp_update, rm_digest_blake2sp_copy}; /////////////////////////// @@ -451,7 +482,7 @@ static void rm_digest_ext_update(RmDigest *digest, const unsigned char *data, Rm } } -static const RmDigestSpec ext_spec = {0, rm_digest_ext_init, rm_digest_generic_free, rm_digest_ext_update}; +static const RmDigestSpec ext_spec = {0, rm_digest_ext_init, rm_digest_generic_free, rm_digest_ext_update, rm_digest_generic_copy}; /////////////////////////// @@ -481,7 +512,7 @@ static void rm_digest_paranoid_free(RmDigest *digest) { /* Note: paranoid update implementation is in rm_digest_buffered_update() below */ -static const RmDigestSpec paranoid_spec = {0, rm_digest_paranoid_init, rm_digest_paranoid_free, NULL}; +static const RmDigestSpec paranoid_spec = {0, rm_digest_paranoid_init, rm_digest_paranoid_free, NULL, NULL}; //////////////////////////////// @@ -735,73 +766,15 @@ void rm_digest_buffered_update(RmBuffer *buffer) { } } -#define BLAKE_COPY(ALGO) \ - { \ - self = g_slice_new0(RmDigest); \ - self->bytes = digest->bytes; \ - self->type = digest->type; \ - self->ALGO##_state = g_slice_alloc0(sizeof(ALGO##_state)); \ - memcpy(self->ALGO##_state, digest->ALGO##_state, sizeof(ALGO##_state)); \ - } - RmDigest *rm_digest_copy(RmDigest *digest) { rm_assert_gentle(digest); - RmDigest *self = NULL; + RmDigest *copy = g_slice_copy(sizeof(RmDigest), digest); - switch(digest->type) { - case RM_DIGEST_MD5: - case RM_DIGEST_SHA512: - case RM_DIGEST_SHA256: - case RM_DIGEST_SHA1: - self = g_slice_new0(RmDigest); - self->bytes = digest->bytes; - self->type = digest->type; - self->glib_checksum = g_checksum_copy(digest->glib_checksum); - break; - case RM_DIGEST_SHA3_256: - case RM_DIGEST_SHA3_384: - case RM_DIGEST_SHA3_512: - self = g_slice_new0(RmDigest); - self->bytes = digest->bytes; - self->type = digest->type; - self->sha3_ctx = g_slice_alloc0(sizeof(sha3_context)); - memcpy(self->sha3_ctx, digest->sha3_ctx, sizeof(sha3_context)); - break; - case RM_DIGEST_BLAKE2S: - BLAKE_COPY(blake2s); - break; - case RM_DIGEST_BLAKE2B: - BLAKE_COPY(blake2b); - break; - case RM_DIGEST_BLAKE2SP: - BLAKE_COPY(blake2sp); - break; - case RM_DIGEST_BLAKE2BP: - BLAKE_COPY(blake2bp); - break; - case RM_DIGEST_SPOOKY: - case RM_DIGEST_SPOOKY32: - case RM_DIGEST_SPOOKY64: - case RM_DIGEST_MURMUR: - case RM_DIGEST_CITY: - case RM_DIGEST_XXHASH: - case RM_DIGEST_FARMHASH: - case RM_DIGEST_CUMULATIVE: - case RM_DIGEST_EXT: - self = rm_digest_new(digest->type, 0, 0, digest->bytes, FALSE); - - if(self->checksum && digest->checksum) { - memcpy((char *)self->checksum, (char *)digest->checksum, self->bytes); - } - - break; - case RM_DIGEST_PARANOID: - default: - rm_assert_gentle_not_reached(); - } + const RmDigestSpec *spec = digest_specs[digest->type]; + spec->copy(digest, copy); - return self; + return copy; } static gboolean rm_digest_needs_steal(RmDigestType digest_type) { From dd0cdc69f0684c20a6858dc2a175244ee17dac1e Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Thu, 9 Nov 2017 00:17:05 +1000 Subject: [PATCH 083/180] checksum: extend RmDigestSpec to inlude steal function; use macros to shorten definitions --- lib/checksum.c | 225 +++++++++++++++++++------------------------------ 1 file changed, 89 insertions(+), 136 deletions(-) diff --git a/lib/checksum.c b/lib/checksum.c index 8bd5fae1..4352086b 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -143,6 +143,7 @@ typedef void (*RmDigestInitFunc)(RmDigest *digest, RmOff seed1, RmOff seed2, RmO typedef void (*RmDigestFreeFunc)(RmDigest *digest); typedef void (*RmDigestUpdateFunc)(RmDigest *digest, const unsigned char *data, RmOff size); typedef void (*RmDigestCopyFunc)(RmDigest *digest, RmDigest *copy); +typedef void (*RmDigestStealFunc)(RmDigest *digest, guint8 *result); typedef struct RmDigestSpec { const int bits; @@ -150,6 +151,7 @@ typedef struct RmDigestSpec { RmDigestFreeFunc free; RmDigestUpdateFunc update; RmDigestCopyFunc copy; + RmDigestStealFunc steal; } RmDigestSpec; @@ -205,9 +207,10 @@ static void rm_digest_spooky_update(RmDigest *digest, const unsigned char *data, spooky_hash128(data, size, (uint64_t *)&digest->checksum->first, (uint64_t *)&digest->checksum->second); } -static const RmDigestSpec spooky32_spec = { 32, rm_digest_generic_init, rm_digest_generic_free, rm_digest_spooky32_update, rm_digest_generic_copy}; -static const RmDigestSpec spooky64_spec = { 64, rm_digest_generic_init, rm_digest_generic_free, rm_digest_spooky64_update, rm_digest_generic_copy}; -static const RmDigestSpec spooky_spec = { 128, rm_digest_generic_init, rm_digest_generic_free, rm_digest_spooky_update, rm_digest_generic_copy}; +#define GENERIC_FUNCS(ALGO) rm_digest_generic_init, rm_digest_generic_free, rm_digest_##ALGO##_update, rm_digest_generic_copy, NULL +static const RmDigestSpec spooky32_spec = { 32, GENERIC_FUNCS(spooky32) }; +static const RmDigestSpec spooky64_spec = { 64, GENERIC_FUNCS(spooky64) }; +static const RmDigestSpec spooky_spec = { 128, GENERIC_FUNCS(spooky) }; /////////////////////////// @@ -218,7 +221,7 @@ static void rm_digest_xxhash_update(RmDigest *digest, const unsigned char *data, digest->checksum->first = XXH64(data, size, digest->checksum->first); } -static const RmDigestSpec xxhash_spec = {64, rm_digest_generic_init, rm_digest_generic_free, rm_digest_xxhash_update, rm_digest_generic_copy}; +static const RmDigestSpec xxhash_spec = {64, GENERIC_FUNCS(xxhash)}; /////////////////////////// // farmhash // @@ -229,7 +232,7 @@ static void rm_digest_farmhash_update(RmDigest *digest, const unsigned char *dat digest->checksum->first = cfarmhash((const char *)data, size); } -static const RmDigestSpec farmhash_spec = {64, rm_digest_generic_init, rm_digest_generic_free, rm_digest_farmhash_update, rm_digest_generic_copy}; +static const RmDigestSpec farmhash_spec = {64, GENERIC_FUNCS(farmhash)}; /////////////////////////// // murmur // @@ -246,7 +249,7 @@ static void rm_digest_murmur_update(RmDigest *digest, const unsigned char *data, #endif } -static const RmDigestSpec murmur_spec = {128, rm_digest_generic_init, rm_digest_generic_free, rm_digest_murmur_update, rm_digest_generic_copy}; +static const RmDigestSpec murmur_spec = {128, GENERIC_FUNCS(murmur)}; /////////////////////////// // cityhash // @@ -265,7 +268,7 @@ static void rm_digest_city_update(RmDigest *digest, const unsigned char *data, R memcpy(digest->checksum, &old, sizeof(uint128)); } -static const RmDigestSpec city_spec = {128, rm_digest_generic_init, rm_digest_generic_free, rm_digest_city_update, rm_digest_generic_copy}; +static const RmDigestSpec city_spec = {128, GENERIC_FUNCS(city)}; /////////////////////////// // cumulative // @@ -279,7 +282,7 @@ static void rm_digest_cumulative_update(RmDigest *digest, const unsigned char *d } } -static const RmDigestSpec cumulative_spec = {128, rm_digest_generic_init, rm_digest_generic_free, rm_digest_cumulative_update, rm_digest_generic_copy}; +static const RmDigestSpec cumulative_spec = {128, GENERIC_FUNCS(cumulative)}; /////////////////////////// // glib hashes // @@ -316,12 +319,21 @@ static void rm_digest_glib_copy(RmDigest *digest, RmDigest *copy) { copy->glib_checksum = g_checksum_copy(digest->glib_checksum); } +static void rm_digest_glib_steal(RmDigest *digest, guint8 *result) { + GChecksum *copy = g_checksum_copy(digest->glib_checksum); + gsize buflen = digest->bytes; + g_checksum_get_digest(copy, result, &buflen); + rm_assert_gentle(buflen == digest->bytes); + g_checksum_free(copy); +} -static const RmDigestSpec md5_spec = {128, rm_digest_glib_init, rm_digest_glib_free, rm_digest_glib_update, rm_digest_glib_copy}; -static const RmDigestSpec sha1_spec = {160, rm_digest_glib_init, rm_digest_glib_free, rm_digest_glib_update, rm_digest_glib_copy}; -static const RmDigestSpec sha256_spec = {256, rm_digest_glib_init, rm_digest_glib_free, rm_digest_glib_update, rm_digest_glib_copy}; +#define GLIB_FUNCS rm_digest_glib_init, rm_digest_glib_free, rm_digest_glib_update, rm_digest_glib_copy, rm_digest_glib_steal + +static const RmDigestSpec md5_spec = {128, GLIB_FUNCS}; +static const RmDigestSpec sha1_spec = {160, GLIB_FUNCS}; +static const RmDigestSpec sha256_spec = {256, GLIB_FUNCS}; #if HAVE_SHA512 -static const RmDigestSpec sha512_spec = {512, rm_digest_glib_init, rm_digest_glib_free, rm_digest_glib_update, rm_digest_glib_copy}; +static const RmDigestSpec sha512_spec = {512, GLIB_FUNCS}; #endif /////////////////////////// @@ -364,16 +376,28 @@ static void rm_digest_sha3_copy(RmDigest *digest, RmDigest *copy) { copy->sha3_ctx = g_slice_copy(sizeof(sha3_context), digest->sha3_ctx); } -static const RmDigestSpec sha3_256_spec = { 256, rm_digest_sha3_init, rm_digest_sha3_free, rm_digest_sha3_update, rm_digest_sha3_copy}; -static const RmDigestSpec sha3_384_spec = { 384, rm_digest_sha3_init, rm_digest_sha3_free, rm_digest_sha3_update, rm_digest_sha3_copy}; -static const RmDigestSpec sha3_512_spec = { 512, rm_digest_sha3_init, rm_digest_sha3_free, rm_digest_sha3_update, rm_digest_sha3_copy}; +static void rm_digest_sha3_steal(RmDigest *digest, guint8 *result) { + sha3_context *copy = g_slice_copy(sizeof(sha3_context), digest->sha3_ctx); + memcpy(result, sha3_Finalize(copy), digest->bytes); + g_slice_free(sha3_context, copy); +} + +#define SHA3_FUNCS rm_digest_sha3_init, rm_digest_sha3_free, rm_digest_sha3_update, rm_digest_sha3_copy, rm_digest_sha3_steal + +static const RmDigestSpec sha3_256_spec = { 256, SHA3_FUNCS}; +static const RmDigestSpec sha3_384_spec = { 384, SHA3_FUNCS}; +static const RmDigestSpec sha3_512_spec = { 512, SHA3_FUNCS}; /////////////////////////// // blake hashes // /////////////////////////// - -#define BLAKE_INIT(ALGO, ALGO_BIG) \ +#define CREATE_BLAKE_FUNCS(ALGO, ALGO_BIG) \ + \ +static void rm_digest_##ALGO##_init(RmDigest *digest, RmOff seed1, \ + RmOff seed2, \ + _UNUSED RmOff ext_size, \ + _UNUSED bool use_shadow_hash) { \ digest->ALGO##_state = g_slice_alloc0(sizeof(ALGO##_state)); \ ALGO##_init(digest->ALGO##_state, ALGO_BIG##_OUTBYTES); \ if(seed1) { \ @@ -382,78 +406,46 @@ static const RmDigestSpec sha3_512_spec = { 512, rm_digest_sha3_init, rm_digest_ if(seed2) { \ ALGO##_update(digest->ALGO##_state, &seed2, sizeof(RmOff)); \ } \ - g_assert(digest->bytes==ALGO_BIG##_OUTBYTES); - - -static void rm_digest_blake2b_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { - BLAKE_INIT(blake2b, BLAKE2B); -} - -static void rm_digest_blake2bp_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { - BLAKE_INIT(blake2bp, BLAKE2B); -} - -static void rm_digest_blake2s_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { - BLAKE_INIT(blake2s, BLAKE2S); -} - -static void rm_digest_blake2sp_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { - BLAKE_INIT(blake2sp, BLAKE2S); -} - -static void rm_digest_blake2b_free(RmDigest *digest) { - g_slice_free(blake2b_state, digest->blake2b_state); -} - -static void rm_digest_blake2bp_free(RmDigest *digest) { - g_slice_free(blake2bp_state, digest->blake2bp_state); -} - -static void rm_digest_blake2s_free(RmDigest *digest) { - g_slice_free(blake2s_state, digest->blake2s_state); -} - -static void rm_digest_blake2sp_free(RmDigest *digest) { - g_slice_free(blake2sp_state, digest->blake2sp_state); -} - -static void rm_digest_blake2b_update(RmDigest *digest, const unsigned char *data, RmOff size) { - blake2b_update(digest->blake2b_state, data, size); -} - -static void rm_digest_blake2bp_update(RmDigest *digest, const unsigned char *data, RmOff size) { - blake2bp_update(digest->blake2bp_state, data, size); -} - -static void rm_digest_blake2s_update(RmDigest *digest, const unsigned char *data, RmOff size) { - blake2s_update(digest->blake2s_state, data, size); -} - -static void rm_digest_blake2sp_update(RmDigest *digest, const unsigned char *data, RmOff size) { - blake2sp_update(digest->blake2sp_state, data, size); -} - -static void rm_digest_blake2b_copy(RmDigest *digest, RmDigest *copy) { - copy->blake2b_state = g_slice_copy(sizeof(blake2b_state), digest->blake2b_state); -} - -static void rm_digest_blake2bp_copy(RmDigest *digest, RmDigest *copy) { - copy->blake2bp_state = g_slice_copy(sizeof(blake2bp_state), digest->blake2bp_state); -} - -static void rm_digest_blake2s_copy(RmDigest *digest, RmDigest *copy) { - copy->blake2s_state = g_slice_copy(sizeof(blake2s_state), digest->blake2s_state); -} - -static void rm_digest_blake2sp_copy(RmDigest *digest, RmDigest *copy) { - copy->blake2sp_state = g_slice_copy(sizeof(blake2sp_state), digest->blake2sp_state); -} - -static const RmDigestSpec blake2b_spec = {512, rm_digest_blake2b_init, rm_digest_blake2b_free, rm_digest_blake2b_update, rm_digest_blake2b_copy}; -static const RmDigestSpec blake2bp_spec = {512, rm_digest_blake2bp_init, rm_digest_blake2bp_free, rm_digest_blake2bp_update, rm_digest_blake2bp_copy}; -static const RmDigestSpec blake2s_spec = {256, rm_digest_blake2s_init, rm_digest_blake2s_free, rm_digest_blake2s_update, rm_digest_blake2s_copy}; -static const RmDigestSpec blake2sp_spec = {256, rm_digest_blake2sp_init, rm_digest_blake2sp_free, rm_digest_blake2sp_update, rm_digest_blake2sp_copy}; - + g_assert(digest->bytes==ALGO_BIG##_OUTBYTES); \ +} \ + \ +static void rm_digest_##ALGO##_free(RmDigest *digest) { \ + g_slice_free(ALGO##_state, digest->ALGO##_state); \ +} \ + \ +static void rm_digest_##ALGO##_update(RmDigest *digest, \ + const unsigned char *data, \ + RmOff size) { \ + ALGO##_update(digest->ALGO##_state, data, size); \ +} \ + \ +static void rm_digest_##ALGO##_copy(RmDigest *digest, \ + RmDigest *copy) { \ + copy->ALGO##_state = g_slice_copy(sizeof(ALGO##_state), \ + digest->ALGO##_state); \ +} \ + \ +static void rm_digest_##ALGO##_steal(RmDigest *digest, \ + guint8 *result) { \ + ALGO##_state *copy = g_slice_copy(sizeof(ALGO##_state), \ + digest->ALGO##_state); \ + ALGO##_final(copy, result, digest->bytes); \ + g_slice_free(ALGO##_state, copy); \ +} + + + +CREATE_BLAKE_FUNCS(blake2b, BLAKE2B); +CREATE_BLAKE_FUNCS(blake2bp, BLAKE2B); +CREATE_BLAKE_FUNCS(blake2s, BLAKE2S); +CREATE_BLAKE_FUNCS(blake2sp, BLAKE2S); + +#define BLAKE_FUNCS(ALGO) rm_digest_##ALGO##_init, rm_digest_##ALGO##_free, rm_digest_##ALGO##_update, rm_digest_##ALGO##_copy, rm_digest_##ALGO##_steal + +static const RmDigestSpec blake2b_spec = {512, BLAKE_FUNCS(blake2b)}; +static const RmDigestSpec blake2bp_spec = {512, BLAKE_FUNCS(blake2bp)}; +static const RmDigestSpec blake2s_spec = {256, BLAKE_FUNCS(blake2s)}; +static const RmDigestSpec blake2sp_spec = {256, BLAKE_FUNCS(blake2sp)}; /////////////////////////// // ext hash // @@ -482,7 +474,7 @@ static void rm_digest_ext_update(RmDigest *digest, const unsigned char *data, Rm } } -static const RmDigestSpec ext_spec = {0, rm_digest_ext_init, rm_digest_generic_free, rm_digest_ext_update, rm_digest_generic_copy}; +static const RmDigestSpec ext_spec = {0, rm_digest_ext_init, rm_digest_generic_free, rm_digest_ext_update, rm_digest_generic_copy, NULL}; /////////////////////////// @@ -512,7 +504,7 @@ static void rm_digest_paranoid_free(RmDigest *digest) { /* Note: paranoid update implementation is in rm_digest_buffered_update() below */ -static const RmDigestSpec paranoid_spec = {0, rm_digest_paranoid_init, rm_digest_paranoid_free, NULL, NULL}; +static const RmDigestSpec paranoid_spec = {0, rm_digest_paranoid_init, rm_digest_paranoid_free, NULL, NULL, NULL}; //////////////////////////////// @@ -810,54 +802,15 @@ static gboolean rm_digest_needs_steal(RmDigestType digest_type) { } } -#define BLAKE_STEAL(ALGO) \ - { \ - RmDigest *copy = rm_digest_copy(digest); \ - ALGO##_final(copy->ALGO##_state, result, buflen); \ - rm_assert_gentle(buflen == digest->bytes); \ - rm_digest_free(copy); \ - } - guint8 *rm_digest_steal(RmDigest *digest) { - guint8 *result = g_slice_alloc0(digest->bytes); - gsize buflen = digest->bytes; - if(rm_digest_needs_steal(digest->type)) { - /* reading the digest is destructive, so we need to take a copy */ - switch(digest->type) { - case RM_DIGEST_SHA3_256: - case RM_DIGEST_SHA3_384: - case RM_DIGEST_SHA3_512: { - RmDigest *copy = rm_digest_copy(digest); - memcpy(result, sha3_Finalize(copy->sha3_ctx), digest->bytes); - rm_assert_gentle(buflen == digest->bytes); - rm_digest_free(copy); - break; - } - case RM_DIGEST_BLAKE2S: - BLAKE_STEAL(blake2s); - break; - case RM_DIGEST_BLAKE2B: - BLAKE_STEAL(blake2b); - break; - case RM_DIGEST_BLAKE2SP: - BLAKE_STEAL(blake2sp); - break; - case RM_DIGEST_BLAKE2BP: - BLAKE_STEAL(blake2bp); - break; - default: { - RmDigest *copy = rm_digest_copy(digest); - g_checksum_get_digest(copy->glib_checksum, result, &buflen); - rm_assert_gentle(buflen == digest->bytes); - rm_digest_free(copy); - break; - } - } - } else { - /* Stateless checksum, just copy it. */ - memcpy(result, digest->checksum, digest->bytes); + const RmDigestSpec *spec = digest_specs[digest->type]; + if(!spec->steal) { + return g_slice_copy(digest->bytes, digest->checksum); } + + guint8 *result = g_slice_alloc0(digest->bytes); + spec->steal(digest, result); return result; } From 435c542bad3b3aa9ced534996e92d61107f049a4 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Thu, 9 Nov 2017 18:44:12 +1000 Subject: [PATCH 084/180] checksum: switch to a farmhash implementation that supports multiple data increments --- lib/checksum.c | 4 +- lib/checksum.h | 2 + lib/checksums/cfarmhash.c | 200 ----- lib/checksums/cfarmhash.h | 6 - lib/checksums/farmhash.c | 1651 +++++++++++++++++++++++++++++++++++++ lib/checksums/farmhash.h | 166 ++++ 6 files changed, 1820 insertions(+), 209 deletions(-) delete mode 100644 lib/checksums/cfarmhash.c delete mode 100644 lib/checksums/cfarmhash.h create mode 100644 lib/checksums/farmhash.c create mode 100644 lib/checksums/farmhash.h diff --git a/lib/checksum.c b/lib/checksum.c index 4352086b..2facf7f8 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -42,7 +42,6 @@ #include "checksum.h" #include "checksums/blake2/blake2.h" -#include "checksums/cfarmhash.h" #include "checksums/city.h" #include "checksums/citycrc.h" #include "checksums/murmur3.h" @@ -228,8 +227,7 @@ static const RmDigestSpec xxhash_spec = {64, GENERIC_FUNCS(xxhash)}; /////////////////////////// static void rm_digest_farmhash_update(RmDigest *digest, const unsigned char *data, RmOff size) { - /* TODO: this won't work, it's not cumulative */ - digest->checksum->first = cfarmhash((const char *)data, size); + *digest->farmhash = farmhash128_with_seed((const char*)data, size, *digest->farmhash); } static const RmDigestSpec farmhash_spec = {64, GENERIC_FUNCS(farmhash)}; diff --git a/lib/checksum.h b/lib/checksum.h index 16ebfb50..45050212 100644 --- a/lib/checksum.h +++ b/lib/checksum.h @@ -32,6 +32,7 @@ #include "checksums/blake2/blake2.h" #include "checksums/sha3/sha3.h" +#include "checksums/farmhash.h" typedef enum RmDigestType { RM_DIGEST_UNKNOWN = 0, @@ -102,6 +103,7 @@ typedef struct RmDigest { blake2bp_state *blake2bp_state; sha3_context *sha3_ctx; RmUint128 *checksum; + uint128_t *farmhash; RmParanoid *paranoid; guint8 *data; }; diff --git a/lib/checksums/cfarmhash.c b/lib/checksums/cfarmhash.c deleted file mode 100644 index a637884d..00000000 --- a/lib/checksums/cfarmhash.c +++ /dev/null @@ -1,200 +0,0 @@ -#include -#include -#include - -#define bswap32(x) __builtin_bswap32(x) -#define bswap64(x) __builtin_bswap64(x) - -#ifdef FARMHASH_BIG_ENDIAN -#define uint32_in_expected_order(x) bswap32(x) -#define uint64_in_expected_order(x) bswap64(x) -#else -#define uint32_in_expected_order(x) (x) -#define uint64_in_expected_order(x) (x) -#endif - -// Some primes between 2^63 and 2^64 for various uses. - -static const uint64_t k0 = 0xc3a5c85c97cb3127ULL; -static const uint64_t k1 = 0xb492b66fbe98f273ULL; -static const uint64_t k2 = 0x9ae16a3b2f90404fULL; - -static inline uint32_t fetch32(const char *p) { - uint32_t result; - - memcpy(&result, p, sizeof(result)); - - return uint32_in_expected_order(result); -} - -static inline uint64_t fetch64(const char *p) { - uint64_t result; - - memcpy(&result, p, sizeof(result)); - - return uint64_in_expected_order(result); -} - -static inline uint64_t shift_mix(uint64_t v) { - return v ^ (v >> 47); -} - -static inline uint64_t rotate64(uint64_t v, int shift) { - return ((v >> shift) | (v << (64 - shift))); -} - -static inline uint64_t hash_len_16(uint64_t u, uint64_t v, uint64_t mul) { - uint64_t a, b; - - a = (u ^ v) * mul; - a ^= (a >> 47); - b = (v ^ a) * mul; - b ^= (b >> 47); - b *= mul; - - return b; -} - -static inline uint64_t hash_len_0_to_16(const char *s, size_t len) { - if(len >= 8) { - uint64_t mul = k2 + len * 2; - uint64_t a = fetch64(s) + k2; - uint64_t b = fetch64(s + len - 8); - uint64_t c = rotate64(b, 37) * mul + a; - uint64_t d = (rotate64(a, 25) + b) * mul; - return hash_len_16(c, d, mul); - } - - if(len >= 4) { - uint64_t mul = k2 + len * 2; - uint64_t a = fetch32(s); - return hash_len_16(len + (a << 3), fetch32(s + len - 4), mul); - } - - if(len > 0) { - uint8_t a = s[0]; - uint8_t b = s[len >> 1]; - uint8_t c = s[len - 1]; - uint32_t y = (uint32_t)a + ((uint32_t)b << 8); - uint32_t z = len + ((uint32_t)c << 2); - return shift_mix(y * k2 ^ z * k0) * k2; - } - - return k2; -} - -static inline uint64_t hash_len_17_to_32(const char *s, size_t len) { - uint64_t mul = k2 + len * 2; - uint64_t a = fetch64(s) * k1; - uint64_t b = fetch64(s + 8); - uint64_t c = fetch64(s + len - 8) * mul; - uint64_t d = fetch64(s + len - 16) * k2; - - return hash_len_16(rotate64(a + b, 43) + rotate64(c, 30) + d, - a + rotate64(b + k2, 18) + c, mul); -} - -static inline uint64_t hash_len_33_to_64(const char *s, size_t len) { - uint64_t mul = k2 + len * 2; - uint64_t a = fetch64(s) * k2; - uint64_t b = fetch64(s + 8); - uint64_t c = fetch64(s + len - 8) * mul; - uint64_t d = fetch64(s + len - 16) * k2; - uint64_t y = rotate64(a + b, 43) + rotate64(c, 30) + d; - uint64_t z = hash_len_16(y, a + rotate64(b + k2, 18) + c, mul); - uint64_t e = fetch64(s + 16) * mul; - uint64_t f = fetch64(s + 24); - uint64_t g = (y + fetch64(s + len - 32)) * mul; - uint64_t h = (z + fetch64(s + len - 24)) * mul; - - return hash_len_16(rotate64(e + f, 43) + rotate64(g, 30) + h, - e + rotate64(f + a, 18) + g, mul); -} - -#define swap(x, y) \ - do { \ - (x) = (x) ^ (y); \ - (y) = (x) ^ (y); \ - (x) = (x) ^ y; \ - } while(0); - -typedef struct pair64 pair64; - -struct pair64 { - uint64_t first; - uint64_t second; -}; - -static inline pair64 weak_hash_len_32_with_seeds2(uint64_t w, uint64_t x, uint64_t y, - uint64_t z, uint64_t a, uint64_t b) { - uint64_t c; - pair64 result; - - a += w; - b = rotate64(b + a + z, 21); - c = a; - a += x; - a += y; - b += rotate64(a, 44); - result.first = a + z; - result.second = b + c; - - return result; -} - -static inline pair64 weak_hash_len_32_with_seeds(const char *s, uint64_t a, uint64_t b) { - return weak_hash_len_32_with_seeds2(fetch64(s), fetch64(s + 8), fetch64(s + 16), - fetch64(s + 24), a, b); -} - -uint64_t cfarmhash(const char *s, size_t len) { - uint64_t mul; - const uint64_t seed = 81; - - if(len <= 16) - return hash_len_0_to_16(s, len); - - if(len <= 32) - return hash_len_17_to_32(s, len); - - if(len <= 64) - return hash_len_33_to_64(s, len); - - uint64_t x = seed, y = seed * k1 + 113, z = shift_mix(y * k2 + 113) * k2; - pair64 v = {0, 0}, w = {0, 0}; - - x = x * k2 + fetch64(s); - - const char *end = s + ((len - 1) / 64) * 64; - const char *last64 = end + ((len - 1) & 63) - 63; - - do { - x = rotate64(x + y + v.first + fetch64(s + 8), 37) * k1; - y = rotate64(y + v.second + fetch64(s + 48), 42) * k1; - x ^= w.second; - y += v.first + fetch64(s + 40); - z = rotate64(z + w.first, 33) * k1; - v = weak_hash_len_32_with_seeds(s, v.second * k1, x + w.first); - w = weak_hash_len_32_with_seeds(s + 32, z + w.second, y + fetch64(s + 16)); - swap(z, x); - s += 64; - } while(s != end); - - mul = k1 + ((z & 0xff) << 1); - s = last64; - w.first += ((len - 1) & 63); - v.first += w.first; - w.first += v.first; - x = rotate64(x + y + v.first + fetch64(s + 8), 37) * mul; - y = rotate64(y + v.second + fetch64(s + 48), 42) * mul; - x ^= w.second * 9; - y += v.first * 9 + fetch64(s + 40); - z = rotate64(z + w.first, 33) * mul; - v = weak_hash_len_32_with_seeds(s, v.second * mul, x + w.first); - w = weak_hash_len_32_with_seeds(s + 32, z + w.second, y + fetch64(s + 16)); - swap(z, x); - - return hash_len_16(hash_len_16(v.first, w.first, mul) + shift_mix(y) * k0 + z, - hash_len_16(v.second, w.second, mul) + x, - mul); -} diff --git a/lib/checksums/cfarmhash.h b/lib/checksums/cfarmhash.h deleted file mode 100644 index a2b95dca..00000000 --- a/lib/checksums/cfarmhash.h +++ /dev/null @@ -1,6 +0,0 @@ -#ifndef SHEEPHASH_H_INCLUDED -#define SHEEPHASH_H_INCLUDED - -uint64_t cfarmhash(const char *, size_t); - -#endif /* SHEEPHASH_H_INCLUDED */ diff --git a/lib/checksums/farmhash.c b/lib/checksums/farmhash.c new file mode 100644 index 00000000..cc26abb1 --- /dev/null +++ b/lib/checksums/farmhash.c @@ -0,0 +1,1651 @@ +// Copyright (c) 2014 Google, Inc. +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in +// all copies or substantial portions of the Software. +// +// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +// THE SOFTWARE. +// +// FarmHash, by Geoff Pike + +#include "farmhash.h" + +#include + +#include + +// PLATFORM-SPECIFIC CONFIGURATION + +#if defined (__x86_64) || defined (__x86_64__) +#define x86_64 1 +#else +#define x86_64 0 +#endif + +#if defined(__i386__) || defined(__i386) || defined(__X86__) +#define x86 1 +#else +#define x86 x86_64 +#endif + +#if defined(__SSSE3__) +#include +#define CAN_USE_SSSE3 1 // Now we can use _mm_hsub_epi16 and so on. +#else +#define CAN_USE_SSSE3 0 +#endif + +#if defined(__SSE4_1__) +#include +#define CAN_USE_SSE41 1 // Now we can use _mm_insert_epi64 and so on. +#else +#define CAN_USE_SSE41 0 +#endif + +#if defined(__SSE4_2__) +#include +#define CAN_USE_SSE42 1 // Now we can use _mm_crc32_u{32,16,8}. And on 64-bit platforms, _mm_crc32_u64. +#else +#define CAN_USE_SSE42 0 +#endif + +#if defined(__AES__) +#include +#define CAN_USE_AESNI 1 // Now we can use _mm_aesimc_si128 and so on. +#else +#define CAN_USE_AESNI 0 +#endif + +#if defined(__AVX__) +#include +#define CAN_USE_AVX 1 +#else +#define CAN_USE_AVX 0 +#endif + +#define likely(x) (__builtin_expect(!!(x), 1)) + +#ifdef LITTLE_ENDIAN +#define uint32_t_in_expected_order(x) (x) +#define uint64_t_in_expected_order(x) (x) +#else +#define uint32_t_in_expected_order(x) (bswap32(x)) +#define uint64_t_in_expected_order(x) (bswap64(x)) +#endif + +#define PERMUTE3(a, b, c) \ + do { \ + swap32(a, b); \ + swap32(a, c); \ + } while (0) + +static inline uint32_t bswap32(const uint32_t x) { + uint32_t y = x; + + for (size_t i = 0; i < sizeof(uint32_t) >> 1; i++) { + + uint32_t d = sizeof(uint32_t) - i - 1; + + uint32_t mh = ((uint32_t)0xff) << (d << 3); + uint32_t ml = ((uint32_t)0xff) << (i << 3); + + uint32_t h = x & mh; + uint32_t l = x & ml; + + uint64_t t = (l << ((d - i) << 3)) | (h >> ((d - i) << 3)); + + y = t | (y & ~(mh | ml)); + } + + return y; +} + +static inline uint64_t bswap64(const uint64_t x) { + uint64_t y = x; + + for (size_t i = 0; i < sizeof(uint64_t) >> 1; i++) { + + uint64_t d = sizeof(uint64_t) - i - 1; + + uint64_t mh = ((uint64_t)0xff) << (d << 3); + uint64_t ml = ((uint64_t)0xff) << (i << 3); + + uint64_t h = x & mh; + uint64_t l = x & ml; + + uint64_t t = (l << ((d - i) << 3)) | (h >> ((d - i) << 3)); + + y = t | (y & ~(mh | ml)); + } + + return y; +} + +static inline uint64_t fetch64(const char* p) { + uint64_t result; + memcpy(&result, p, sizeof(result)); + + return uint64_t_in_expected_order(result); +} + +static inline uint32_t fetch32(const char* p) { + uint32_t result; + memcpy(&result, p, sizeof(result)); + + return uint32_t_in_expected_order(result); +} + +#if CAN_USE_SSSE3 || CAN_USE_SSE41 || CAN_USE_SSE42 || CAN_USE_AESNI || CAN_USE_AVX + +static inline __m128i fetch128(const char* s) { + return _mm_loadu_si128((const __m128i*) s); +} + +#endif + +static inline void swap32(uint32_t* a, uint32_t* b) { + uint32_t t; + + t = *a; + *a = *b; + *b = t; +} + +static inline void swap64(uint64_t* a, uint64_t* b) { + uint64_t t; + + t = *a; + *a = *b; + *b = t; +} + +#if CAN_USE_SSSE3 || CAN_USE_SSE41 || CAN_USE_SSE42 || CAN_USE_AESNI || CAN_USE_AVX + +static inline void swap128(__m128i* a, __m128i* b) { + __m128i t; + + t = *a; + *a = *b; + *b = t; +} + +#endif + +static inline uint32_t ror32(uint32_t val, size_t shift) { + // Avoid shifting by 32: doing so yields an undefined result. + return shift == 0 ? val : (val >> shift) | (val << (32 - shift)); +} + +static inline uint64_t ror64(uint64_t val, size_t shift) { + // Avoid shifting by 64: doing so yields an undefined result. + return shift == 0 ? val : (val >> shift) | (val << (64 - shift)); +} + +// Helpers for data-parallel operations (1x 128 bits or 2x64 or 4x32 or 8x16). + +#if CAN_USE_SSSE3 || CAN_USE_SSE41 || CAN_USE_SSE42 || CAN_USE_AESNI || CAN_USE_AVX + +static inline __m128i add64x2(__m128i x, __m128i y) { return _mm_add_epi64(x, y); } +static inline __m128i add32x4(__m128i x, __m128i y) { return _mm_add_epi32(x, y); } + +static inline __m128i xor128(__m128i x, __m128i y) { return _mm_xor_si128(x, y); } +static inline __m128i or128(__m128i x, __m128i y) { return _mm_or_si128(x, y); } + +static inline __m128i mul32x4_5(__m128i x) { return add32x4(x, _mm_slli_epi32(x, 2)); } + +static inline __m128i rol32x4(__m128i x, int c) { + return or128(_mm_slli_epi32(x, c), + _mm_srli_epi32(x, 32 - c)); +} + +static inline __m128i rol32x4_17(__m128i x) { return rol32x4(x, 17); } +static inline __m128i rol32x4_19(__m128i x) { return rol32x4(x, 19); } + +static inline __m128i shuf32x4_0_3_2_1(__m128i x) { + return _mm_shuffle_epi32(x, (0 << 6) + (3 << 4) + (2 << 2) + (1 << 0)); +} + +#endif + +#if CAN_USE_SSSE3 + +static inline __m128i shuf8x16(__m128i x, __m128i y) { return _mm_shuffle_epi8(y, x); } + +#endif + +#if CAN_USE_SSE41 + +static inline __m128i mul32x4(__m128i x, __m128i y) { return _mm_mullo_epi32(x, y); } + +static inline __m128i murk(__m128i a, __m128i b, __m128i c, __m128i d, __m128i e) { + + return add32x4(e, + mul32x4_5( + rol32x4_19( + xor128( + mul32x4(d, + rol32x4_17( + mul32x4(c, a))), + (b))))); +} + +#endif + +// Building blocks for hash functions + +// Some primes between 2^63 and 2^64 for various uses. +static const uint64_t k0 = 0xc3a5c85c97cb3127ULL; +static const uint64_t k1 = 0xb492b66fbe98f273ULL; +static const uint64_t k2 = 0x9ae16a3b2f90404fULL; + +// Magic numbers for 32-bit hashing. Copied from Murmur3. +static const uint32_t c1 = 0xcc9e2d51; +static const uint32_t c2 = 0x1b873593; + +// A 32-bit to 32-bit integer hash copied from Murmur3. +static inline uint32_t fmix(uint32_t h) { + h ^= h >> 16; + h *= 0x85ebca6b; + h ^= h >> 13; + h *= 0xc2b2ae35; + h ^= h >> 16; + return h; +} + +static inline uint64_t smix(uint64_t val) { + return val ^ (val >> 47); +} + +static inline uint32_t mur(uint32_t a, uint32_t h) { + // Helper from Murmur3 for combining two 32-bit values. + a *= c1; + a = ror32(a, 17); + a *= c2; + h ^= a; + h = ror32(h, 19); + return h * 5 + 0xe6546b64; +} + +static inline uint32_t debug_tweak32(uint32_t x) { +#ifndef NDEBUG + x = ~bswap32(x * c1); +#endif + + return x; +} + +static inline uint64_t debug_tweak64(uint64_t x) { +#ifndef NDEBUG + x = ~bswap64(x * k1); +#endif + + return x; +} + +uint128_t debug_tweak128(uint128_t x) { +#ifndef NDEBUG + uint64_t y = debug_tweak64(uint128_t_low64(x)); + uint64_t z = debug_tweak64(uint128_t_high64(x)); + y += z; + z += y; + x = make_uint128_t(y, z * k1); +#endif + + return x; +} + +static inline uint64_t farmhash_len_16(uint64_t u, uint64_t v) { + return farmhash128_to_64(make_uint128_t(u, v)); +} + +static inline uint64_t farmhash_len_16_mul(uint64_t u, uint64_t v, uint64_t mul) { + // Murmur-inspired hashing. + uint64_t a = (u ^ v) * mul; + a ^= (a >> 47); + uint64_t b = (v ^ a) * mul; + b ^= (b >> 47); + b *= mul; + return b; +} + +// farmhash na + +static inline uint64_t farmhash_na_len_0_to_16(const char *s, size_t len) { + if (len >= 8) { + uint64_t mul = k2 + len * 2; + uint64_t a = fetch64(s) + k2; + uint64_t b = fetch64(s + len - 8); + uint64_t c = ror64(b, 37) * mul + a; + uint64_t d = (ror64(a, 25) + b) * mul; + return farmhash_len_16_mul(c, d, mul); + } + if (len >= 4) { + uint64_t mul = k2 + len * 2; + uint64_t a = fetch32(s); + return farmhash_len_16_mul(len + (a << 3), fetch32(s + len - 4), mul); + } + if (len > 0) { + uint8_t a = s[0]; + uint8_t b = s[len >> 1]; + uint8_t c = s[len - 1]; + uint32_t y = (uint32_t) a + ((uint32_t) b << 8); + uint32_t z = len + ((uint32_t) c << 2); + return smix(y * k2 ^ z * k0) * k2; + } + return k2; +} + +// This probably works well for 16-byte strings as well, but it may be overkill +// in that case. +static inline uint64_t farmhash_na_len_17_to_32(const char *s, size_t len) { + uint64_t mul = k2 + len * 2; + uint64_t a = fetch64(s) * k1; + uint64_t b = fetch64(s + 8); + uint64_t c = fetch64(s + len - 8) * mul; + uint64_t d = fetch64(s + len - 16) * k2; + return farmhash_len_16_mul(ror64(a + b, 43) + ror64(c, 30) + d, + a + ror64(b + k2, 18) + c, mul); +} + +// Return a 16-byte hash for 48 bytes. Quick and dirty. +// Callers do best to use "random-looking" values for a and b. +static inline uint128_t weak_farmhash_na_len_32_with_seeds_vals( + uint64_t w, uint64_t x, uint64_t y, uint64_t z, uint64_t a, uint64_t b) { + a += w; + b = ror64(b + a + z, 21); + uint64_t c = a; + a += x; + a += y; + b += ror64(a, 44); + return make_uint128_t(a + z, b + c); +} + +// Return a 16-byte hash for s[0] ... s[31], a, and b. Quick and dirty. +static inline uint128_t weak_farmhash_na_len_32_with_seeds( + const char* s, uint64_t a, uint64_t b) { + return weak_farmhash_na_len_32_with_seeds_vals(fetch64(s), + fetch64(s + 8), + fetch64(s + 16), + fetch64(s + 24), + a, + b); +} + +// Return an 8-byte hash for 33 to 64 bytes. +static inline uint64_t farmhash_na_len_33_to_64(const char *s, size_t len) { + uint64_t mul = k2 + len * 2; + uint64_t a = fetch64(s) * k2; + uint64_t b = fetch64(s + 8); + uint64_t c = fetch64(s + len - 8) * mul; + uint64_t d = fetch64(s + len - 16) * k2; + uint64_t y = ror64(a + b, 43) + ror64(c, 30) + d; + uint64_t z = farmhash_len_16_mul(y, a + ror64(b + k2, 18) + c, mul); + uint64_t e = fetch64(s + 16) * mul; + uint64_t f = fetch64(s + 24); + uint64_t g = (y + fetch64(s + len - 32)) * mul; + uint64_t h = (z + fetch64(s + len - 24)) * mul; + return farmhash_len_16_mul(ror64(e + f, 43) + ror64(g, 30) + h, + e + ror64(f + a, 18) + g, mul); +} + +uint64_t farmhash64_na(const char *s, size_t len) { + const uint64_t seed = 81; + if (len <= 32) { + if (len <= 16) { + return farmhash_na_len_0_to_16(s, len); + } else { + return farmhash_na_len_17_to_32(s, len); + } + } else if (len <= 64) { + return farmhash_na_len_33_to_64(s, len); + } + + // For strings over 64 bytes we loop. Internal state consists of + // 56 bytes: v, w, x, y, and z. + uint64_t x = seed; + uint64_t y = seed * k1 + 113; + uint64_t z = smix(y * k2 + 113) * k2; + uint128_t v = make_uint128_t(0, 0); + uint128_t w = make_uint128_t(0, 0); + x = x * k2 + fetch64(s); + + // Set end so that after the loop we have 1 to 64 bytes left to process. + const char* end = s + ((len - 1) / 64) * 64; + const char* last64 = end + ((len - 1) & 63) - 63; + assert(s + len - 64 == last64); + do { + x = ror64(x + y + v.a + fetch64(s + 8), 37) * k1; + y = ror64(y + v.b + fetch64(s + 48), 42) * k1; + x ^= w.b; + y += v.a + fetch64(s + 40); + z = ror64(z + w.a, 33) * k1; + v = weak_farmhash_na_len_32_with_seeds(s, v.b * k1, x + w.a); + w = weak_farmhash_na_len_32_with_seeds(s + 32, z + w.b, y + fetch64(s + 16)); + swap64(&z, &x); + s += 64; + } while (s != end); + uint64_t mul = k1 + ((z & 0xff) << 1); + // Make s point to the last 64 bytes of input. + s = last64; + w.a += ((len - 1) & 63); + v.a += w.a; + w.a += v.a; + x = ror64(x + y + v.a + fetch64(s + 8), 37) * mul; + y = ror64(y + v.b + fetch64(s + 48), 42) * mul; + x ^= w.b * 9; + y += v.a * 9 + fetch64(s + 40); + z = ror64(z + w.a, 33) * mul; + v = weak_farmhash_na_len_32_with_seeds(s, v.b * mul, x + w.a); + w = weak_farmhash_na_len_32_with_seeds(s + 32, z + w.b, y + fetch64(s + 16)); + swap64(&z, &x); + return farmhash_len_16_mul(farmhash_len_16_mul(v.a, w.a, mul) + smix(y) * k0 + z, + farmhash_len_16_mul(v.b, w.b, mul) + x, + mul); +} + +uint64_t farmhash64_na_with_seeds(const char *s, size_t len, uint64_t seed0, uint64_t seed1) { + return farmhash_len_16(farmhash64_na(s, len) - seed0, seed1); +} + +uint64_t farmhash64_na_with_seed(const char *s, size_t len, uint64_t seed) { + return farmhash64_na_with_seeds(s, len, k2, seed); +} + +// farmhash uo + +static inline uint64_t farmhash_uo_h(uint64_t x, uint64_t y, uint64_t mul, int r) { + uint64_t a = (x ^ y) * mul; + a ^= (a >> 47); + uint64_t b = (y ^ a) * mul; + return ror64(b, r) * mul; +} + +uint64_t farmhash64_uo_with_seeds(const char *s, size_t len, + uint64_t seed0, uint64_t seed1) { + if (len <= 64) { + return farmhash64_na_with_seeds(s, len, seed0, seed1); + } + + // For strings over 64 bytes we loop. Internal state consists of + // 64 bytes: u, v, w, x, y, and z. + uint64_t x = seed0; + uint64_t y = seed1 * k2 + 113; + uint64_t z = smix(y * k2) * k2; + uint128_t v = make_uint128_t(seed0, seed1); + uint128_t w = make_uint128_t(0, 0); + uint64_t u = x - z; + x *= k2; + uint64_t mul = k2 + (u & 0x82); + + // Set end so that after the loop we have 1 to 64 bytes left to process. + const char* end = s + ((len - 1) / 64) * 64; + const char* last64 = end + ((len - 1) & 63) - 63; + assert(s + len - 64 == last64); + do { + uint64_t a0 = fetch64(s); + uint64_t a1 = fetch64(s + 8); + uint64_t a2 = fetch64(s + 16); + uint64_t a3 = fetch64(s + 24); + uint64_t a4 = fetch64(s + 32); + uint64_t a5 = fetch64(s + 40); + uint64_t a6 = fetch64(s + 48); + uint64_t a7 = fetch64(s + 56); + x += a0 + a1; + y += a2; + z += a3; + v.a += a4; + v.b += a5 + a1; + w.a += a6; + w.b += a7; + + x = ror64(x, 26); + x *= 9; + y = ror64(y, 29); + z *= mul; + v.a = ror64(v.a, 33); + v.b = ror64(v.b, 30); + w.a ^= x; + w.a *= 9; + z = ror64(z, 32); + z += w.b; + w.b += z; + z *= 9; + swap64(&u, &y); + + z += a0 + a6; + v.a += a2; + v.b += a3; + w.a += a4; + w.b += a5 + a6; + x += a1; + y += a7; + + y += v.a; + v.a += x - y; + v.b += w.a; + w.a += v.b; + w.b += x - y; + x += w.b; + w.b = ror64(w.b, 34); + swap64(&u, &z); + s += 64; + } while (s != end); + // Make s point to the last 64 bytes of input. + s = last64; + u *= 9; + v.b = ror64(v.b, 28); + v.a = ror64(v.a, 20); + w.a += ((len - 1) & 63); + u += y; + y += u; + x = ror64(y - x + v.a + fetch64(s + 8), 37) * mul; + y = ror64(y ^ v.b ^ fetch64(s + 48), 42) * mul; + x ^= w.b * 9; + y += v.a + fetch64(s + 40); + z = ror64(z + w.a, 33) * mul; + v = weak_farmhash_na_len_32_with_seeds(s, v.b * mul, x + w.a); + w = weak_farmhash_na_len_32_with_seeds(s + 32, z + w.b, y + fetch64(s + 16)); + return farmhash_uo_h(farmhash_len_16_mul(v.a + x, w.a ^ y, mul) + z - u, + farmhash_uo_h(v.b + y, w.b + z, k2, 30) ^ x, + k2, + 31); +} + +uint64_t farmhash64_uo_with_seed(const char *s, size_t len, uint64_t seed) { + return len <= 64 ? farmhash64_na_with_seed(s, len, seed) : + farmhash64_uo_with_seeds(s, len, 0, seed); +} + +uint64_t farmhash64_uo(const char *s, size_t len) { + return len <= 64 ? farmhash64_na(s, len) : + farmhash64_uo_with_seeds(s, len, 81, 0); +} + +// farmhash xo + +static inline uint64_t farmhash_xo_h32(const char *s, size_t len, uint64_t mul, + uint64_t seed0, uint64_t seed1) { + uint64_t a = fetch64(s) * k1; + uint64_t b = fetch64(s + 8); + uint64_t c = fetch64(s + len - 8) * mul; + uint64_t d = fetch64(s + len - 16) * k2; + uint64_t u = ror64(a + b, 43) + ror64(c, 30) + d + seed0; + uint64_t v = a + ror64(b + k2, 18) + c + seed1; + a = smix((u ^ v) * mul); + b = smix((v ^ a) * mul); + return b; +} + +// Return an 8-byte hash for 33 to 64 bytes. +static inline uint64_t farmhash_xo_len_33_to_64(const char *s, size_t len) { + uint64_t mul0 = k2 - 30; + uint64_t mul1 = k2 - 30 + 2 * len; + uint64_t h0 = farmhash_xo_h32(s, 32, mul0, 0, 0); + uint64_t h1 = farmhash_xo_h32(s + len - 32, 32, mul1, 0, 0); + return ((h1 * mul1) + h0) * mul1; +} + +// Return an 8-byte hash for 65 to 96 bytes. +static inline uint64_t farmhash_xo_len_65_to_96(const char *s, size_t len) { + uint64_t mul0 = k2 - 114; + uint64_t mul1 = k2 - 114 + 2 * len; + uint64_t h0 = farmhash_xo_h32(s, 32, mul0, 0, 0); + uint64_t h1 = farmhash_xo_h32(s + 32, 32, mul1, 0, 0); + uint64_t h2 = farmhash_xo_h32(s + len - 32, 32, mul1, h0, h1); + return (h2 * 9 + (h0 >> 17) + (h1 >> 21)) * mul1; +} + +uint64_t farmhash64_xo(const char *s, size_t len) { + if (len <= 32) { + if (len <= 16) { + return farmhash_na_len_0_to_16(s, len); + } else { + return farmhash_na_len_17_to_32(s, len); + } + } else if (len <= 64) { + return farmhash_xo_len_33_to_64(s, len); + } else if (len <= 96) { + return farmhash_xo_len_65_to_96(s, len); + } else if (len <= 256) { + return farmhash64_na(s, len); + } else { + return farmhash64_uo(s, len); + } +} + +uint64_t farmhash64_xo_with_seeds(const char *s, size_t len, uint64_t seed0, uint64_t seed1) { + return farmhash64_uo_with_seeds(s, len, seed0, seed1); +} + +uint64_t farmhash64_xo_with_seed(const char *s, size_t len, uint64_t seed) { + return farmhash64_uo_with_seed(s, len, seed); +} + +// farmhash te + +#if x86_64 && CAN_USE_SSSE3 && CAN_USE_SSE41 + +// Requires n >= 256. Requires SSE4.1. Should be slightly faster if the +// compiler uses AVX instructions (e.g., use the -mavx flag with GCC). +static inline uint64_t farmhash64_te_long(const char* s, size_t n, + uint64_t seed0, uint64_t seed1) { + const __m128i k_shuf = + _mm_set_epi8(4, 11, 10, 5, 8, 15, 6, 9, 12, 2, 14, 13, 0, 7, 3, 1); + const __m128i k_mult = + _mm_set_epi8(0xbd, 0xd6, 0x33, 0x39, 0x45, 0x54, 0xfa, 0x03, + 0x34, 0x3e, 0x33, 0xed, 0xcc, 0x9e, 0x2d, 0x51); + uint64_t seed2 = (seed0 + 113) * (seed1 + 9); + uint64_t seed3 = (ror64(seed0, 23) + 27) * (ror64(seed1, 30) + 111); + __m128i d0 = _mm_cvtsi64_si128(seed0); + __m128i d1 = _mm_cvtsi64_si128(seed1); + __m128i d2 = shuf8x16(k_shuf, d0); + __m128i d3 = shuf8x16(k_shuf, d1); + __m128i d4 = xor128(d0, d1); + __m128i d5 = xor128(d1, d2); + __m128i d6 = xor128(d2, d4); + __m128i d7 = _mm_set1_epi32(seed2 >> 32); + __m128i d8 = mul32x4(k_mult, d2); + __m128i d9 = _mm_set1_epi32(seed3 >> 32); + __m128i d10 = _mm_set1_epi32(seed3); + __m128i d11 = add64x2(d2, _mm_set1_epi32(seed2)); + const char* end = s + (n & ~((size_t) 255)); + do { + __m128i z; + z = fetch128(s); + d0 = add64x2(d0, z); + d1 = shuf8x16(k_shuf, d1); + d2 = xor128(d2, d0); + d4 = xor128(d4, z); + d4 = xor128(d4, d1); + swap128(&d0, &d6); + z = fetch128(s + 16); + d5 = add64x2(d5, z); + d6 = shuf8x16(k_shuf, d6); + d8 = shuf8x16(k_shuf, d8); + d7 = xor128(d7, d5); + d0 = xor128(d0, z); + d0 = xor128(d0, d6); + swap128(&d5, &d11); + z = fetch128(s + 32); + d1 = add64x2(d1, z); + d2 = shuf8x16(k_shuf, d2); + d4 = shuf8x16(k_shuf, d4); + d5 = xor128(d5, z); + d5 = xor128(d5, d2); + swap128(&d10, &d4); + z = fetch128(s + 48); + d6 = add64x2(d6, z); + d7 = shuf8x16(k_shuf, d7); + d0 = shuf8x16(k_shuf, d0); + d8 = xor128(d8, d6); + d1 = xor128(d1, z); + d1 = add64x2(d1, d7); + z = fetch128(s + 64); + d2 = add64x2(d2, z); + d5 = shuf8x16(k_shuf, d5); + d4 = add64x2(d4, d2); + d6 = xor128(d6, z); + d6 = xor128(d6, d11); + swap128(&d8, &d2); + z = fetch128(s + 80); + d7 = xor128(d7, z); + d8 = shuf8x16(k_shuf, d8); + d1 = shuf8x16(k_shuf, d1); + d0 = add64x2(d0, d7); + d2 = add64x2(d2, z); + d2 = add64x2(d2, d8); + swap128(&d1, &d7); + z = fetch128(s + 96); + d4 = shuf8x16(k_shuf, d4); + d6 = shuf8x16(k_shuf, d6); + d8 = mul32x4(k_mult, d8); + d5 = xor128(d5, d11); + d7 = xor128(d7, z); + d7 = add64x2(d7, d4); + swap128(&d6, &d0); + z = fetch128(s + 112); + d8 = add64x2(d8, z); + d0 = shuf8x16(k_shuf, d0); + d2 = shuf8x16(k_shuf, d2); + d1 = xor128(d1, d8); + d10 = xor128(d10, z); + d10 = xor128(d10, d0); + swap128(&d11, &d5); + z = fetch128(s + 128); + d4 = add64x2(d4, z); + d5 = shuf8x16(k_shuf, d5); + d7 = shuf8x16(k_shuf, d7); + d6 = add64x2(d6, d4); + d8 = xor128(d8, z); + d8 = xor128(d8, d5); + swap128(&d4, &d10); + z = fetch128(s + 144); + d0 = add64x2(d0, z); + d1 = shuf8x16(k_shuf, d1); + d2 = add64x2(d2, d0); + d4 = xor128(d4, z); + d4 = xor128(d4, d1); + z = fetch128(s + 160); + d5 = add64x2(d5, z); + d6 = shuf8x16(k_shuf, d6); + d8 = shuf8x16(k_shuf, d8); + d7 = xor128(d7, d5); + d0 = xor128(d0, z); + d0 = xor128(d0, d6); + swap128(&d2, &d8); + z = fetch128(s + 176); + d1 = add64x2(d1, z); + d2 = shuf8x16(k_shuf, d2); + d4 = shuf8x16(k_shuf, d4); + d5 = mul32x4(k_mult, d5); + d5 = xor128(d5, z); + d5 = xor128(d5, d2); + swap128(&d7, &d1); + z = fetch128(s + 192); + d6 = add64x2(d6, z); + d7 = shuf8x16(k_shuf, d7); + d0 = shuf8x16(k_shuf, d0); + d8 = add64x2(d8, d6); + d1 = xor128(d1, z); + d1 = xor128(d1, d7); + swap128(&d0, &d6); + z = fetch128(s + 208); + d2 = add64x2(d2, z); + d5 = shuf8x16(k_shuf, d5); + d4 = xor128(d4, d2); + d6 = xor128(d6, z); + d6 = xor128(d6, d9); + swap128(&d5, &d11); + z = fetch128(s + 224); + d7 = add64x2(d7, z); + d8 = shuf8x16(k_shuf, d8); + d1 = shuf8x16(k_shuf, d1); + d0 = xor128(d0, d7); + d2 = xor128(d2, z); + d2 = xor128(d2, d8); + swap128(&d10, &d4); + z = fetch128(s + 240); + d3 = add64x2(d3, z); + d4 = shuf8x16(k_shuf, d4); + d6 = shuf8x16(k_shuf, d6); + d7 = mul32x4(k_mult, d7); + d5 = add64x2(d5, d3); + d7 = xor128(d7, z); + d7 = xor128(d7, d4); + swap128(&d3, &d9); + s += 256; + } while (s != end); + d6 = add64x2(mul32x4(k_mult, d6), _mm_cvtsi64_si128(n)); + if (n % 256 != 0) { + d7 = add64x2(_mm_shuffle_epi32(d8, (0 << 6) + (3 << 4) + (2 << 2) + (1 << 0)), d7); + d8 = add64x2(mul32x4(k_mult, d8), _mm_cvtsi64_si128(farmhash64_xo(s, n % 256))); + } + __m128i t[8]; + d0 = mul32x4(k_mult, shuf8x16(k_shuf, mul32x4(k_mult, d0))); + d3 = mul32x4(k_mult, shuf8x16(k_shuf, mul32x4(k_mult, d3))); + d9 = mul32x4(k_mult, shuf8x16(k_shuf, mul32x4(k_mult, d9))); + d1 = mul32x4(k_mult, shuf8x16(k_shuf, mul32x4(k_mult, d1))); + d0 = add64x2(d11, d0); + d3 = xor128(d7, d3); + d9 = add64x2(d8, d9); + d1 = add64x2(d10, d1); + d4 = add64x2(d3, d4); + d5 = add64x2(d9, d5); + d6 = xor128(d1, d6); + d2 = add64x2(d0, d2); + t[0] = d0; + t[1] = d3; + t[2] = d9; + t[3] = d1; + t[4] = d4; + t[5] = d5; + t[6] = d6; + t[7] = d2; + return farmhash64_xo((const char*) t, sizeof(t)); +} + +uint64_t farmhash64_te(const char *s, size_t len) { + // Empirically, farmhash xo seems faster until length 512. + return len >= 512 ? farmhash64_te_long(s, len, k2, k1) : farmhash64_xo(s, len); +} + +uint64_t farmhash64_te_with_seed(const char *s, size_t len, uint64_t seed) { + return len >= 512 ? farmhash64_te_long(s, len, k1, seed) : + farmhash64_xo_with_seed(s, len, seed); +} + +uint64_t farmhash64_te_with_seeds(const char *s, size_t len, uint64_t seed0, uint64_t seed1) { + return len >= 512 ? farmhash64_te_long(s, len, seed0, seed1) : + farmhash64_xo_with_seeds(s, len, seed0, seed1); +} + +#endif + +// farmhash nt + +#if x86_64 && CAN_USE_SSE41 + +uint32_t farmhash32_nt(const char *s, size_t len) { + return (uint32_t) farmhash64_te(s, len); +} + +uint32_t farmhash32_nt_with_seed(const char *s, size_t len, uint32_t seed) { + return (uint32_t) farmhash64_te_with_seed(s, len, seed); +} + +#endif + +// farmhash mk + +static inline uint32_t farmhash32_mk_len_13_to_24(const char *s, size_t len, uint32_t seed) { + uint32_t a = fetch32(s - 4 + (len >> 1)); + uint32_t b = fetch32(s + 4); + uint32_t c = fetch32(s + len - 8); + uint32_t d = fetch32(s + (len >> 1)); + uint32_t e = fetch32(s); + uint32_t f = fetch32(s + len - 4); + uint32_t h = d * c1 + len + seed; + a = ror32(a, 12) + f; + h = mur(c, h) + a; + a = ror32(a, 3) + c; + h = mur(e, h) + a; + a = ror32(a + f, 12) + d; + h = mur(b ^ seed, h) + a; + return fmix(h); +} + +static inline uint32_t farmhash32_mk_len_0_to_4(const char *s, size_t len, uint32_t seed) { + uint32_t b = seed; + uint32_t c = 9; + for (size_t i = 0; i < len; i++) { + signed char v = s[i]; + b = b * c1 + v; + c ^= b; + } + return fmix(mur(b, mur(len, c))); +} + +static inline uint32_t farmhash32_mk_len_5_to_12(const char *s, size_t len, uint32_t seed) { + uint32_t a = len, b = len * 5, c = 9, d = b + seed; + a += fetch32(s); + b += fetch32(s + len - 4); + c += fetch32(s + ((len >> 1) & 4)); + return fmix(seed ^ mur(c, mur(b, mur(a, d)))); +} + +uint32_t farmhash32_mk(const char *s, size_t len) { + if (len <= 24) { + return len <= 12 ? + (len <= 4 ? farmhash32_mk_len_0_to_4(s, len, 0) : farmhash32_mk_len_5_to_12(s, len, 0)) : + farmhash32_mk_len_13_to_24(s, len, 0); + } + + // len > 24 + uint32_t h = len, g = c1 * len, f = g; + uint32_t a0 = ror32(fetch32(s + len - 4) * c1, 17) * c2; + uint32_t a1 = ror32(fetch32(s + len - 8) * c1, 17) * c2; + uint32_t a2 = ror32(fetch32(s + len - 16) * c1, 17) * c2; + uint32_t a3 = ror32(fetch32(s + len - 12) * c1, 17) * c2; + uint32_t a4 = ror32(fetch32(s + len - 20) * c1, 17) * c2; + h ^= a0; + h = ror32(h, 19); + h = h * 5 + 0xe6546b64; + h ^= a2; + h = ror32(h, 19); + h = h * 5 + 0xe6546b64; + g ^= a1; + g = ror32(g, 19); + g = g * 5 + 0xe6546b64; + g ^= a3; + g = ror32(g, 19); + g = g * 5 + 0xe6546b64; + f += a4; + f = ror32(f, 19) + 113; + size_t iters = (len - 1) / 20; + do { + uint32_t a = fetch32(s); + uint32_t b = fetch32(s + 4); + uint32_t c = fetch32(s + 8); + uint32_t d = fetch32(s + 12); + uint32_t e = fetch32(s + 16); + h += a; + g += b; + f += c; + h = mur(d, h) + e; + g = mur(c, g) + a; + f = mur(b + e * c1, f) + d; + f += g; + g += f; + s += 20; + } while (--iters != 0); + g = ror32(g, 11) * c1; + g = ror32(g, 17) * c1; + f = ror32(f, 11) * c1; + f = ror32(f, 17) * c1; + h = ror32(h + g, 19); + h = h * 5 + 0xe6546b64; + h = ror32(h, 17) * c1; + h = ror32(h + f, 19); + h = h * 5 + 0xe6546b64; + h = ror32(h, 17) * c1; + return h; +} + +uint32_t farmhash32_mk_with_seed(const char *s, size_t len, uint32_t seed) { + if (len <= 24) { + if (len >= 13) return farmhash32_mk_len_13_to_24(s, len, seed * c1); + else if (len >= 5) return farmhash32_mk_len_5_to_12(s, len, seed); + else return farmhash32_mk_len_0_to_4(s, len, seed); + } + uint32_t h = farmhash32_mk_len_13_to_24(s, 24, seed ^ len); + return mur(farmhash32_mk(s + 24, len - 24) + seed, h); +} + +// farmhash su + +#if CAN_USE_SSE41 && CAN_USE_SSE42 && CAN_USE_AESNI + +uint32_t farmhash32_su(const char *s, size_t len) { + const uint32_t seed = 81; + if (len <= 24) { + return len <= 12 ? + (len <= 4 ? + farmhash32_mk_len_0_to_4(s, len, 0) : + farmhash32_mk_len_5_to_12(s, len, 0)) : + farmhash32_mk_len_13_to_24(s, len, 0); + } + + if (len < 40) { + uint32_t a = len, b = seed * c2, c = a + b; + a += fetch32(s + len - 4); + b += fetch32(s + len - 20); + c += fetch32(s + len - 16); + uint32_t d = a; + a = ror32(a, 21); + a = mur(a, mur(b, _mm_crc32_u32(c, d))); + a += fetch32(s + len - 12); + b += fetch32(s + len - 8); + d += a; + a += d; + b = mur(b, d) * c2; + a = _mm_crc32_u32(a, b + c); + return farmhash32_mk_len_13_to_24(s, (len + 1) / 2, a) + b; + } + + const __m128i cc1 = _mm_set1_epi32(c1); + const __m128i cc2 = _mm_set1_epi32(c2); + __m128i h = _mm_set1_epi32(seed); + __m128i g = _mm_set1_epi32(c1 * seed); + __m128i f = g; + __m128i k = _mm_set1_epi32(0xe6546b64); + __m128i q; + if (len < 80) { + __m128i a = fetch128(s); + __m128i b = fetch128(s + 16); + __m128i c = fetch128(s + (len - 15) / 2); + __m128i d = fetch128(s + len - 32); + __m128i e = fetch128(s + len - 16); + h = add32x4(h, a); + g = add32x4(g, b); + q = g; + g = shuf32x4_0_3_2_1(g); + f = add32x4(f, c); + __m128i be = add32x4(b, mul32x4(e, cc1)); + h = add32x4(h, f); + f = add32x4(f, h); + h = add32x4(murk(d, h, cc1, cc2, k), e); + k = xor128(k, _mm_shuffle_epi8(g, f)); + g = add32x4(xor128(c, g), a); + f = add32x4(xor128(be, f), d); + k = add32x4(k, be); + k = add32x4(k, _mm_shuffle_epi8(f, h)); + f = add32x4(f, g); + g = add32x4(g, f); + g = add32x4(_mm_set1_epi32(len), mul32x4(g, cc1)); + } else { + // len >= 80 + // The following is loosely modelled after farmhash32_mk. + size_t iters = (len - 1) / 80; + len -= iters * 80; + +#define CHUNK_AES() do { \ + __m128i a = fetch128(s); \ + __m128i b = fetch128(s + 16); \ + __m128i c = fetch128(s + 32); \ + __m128i d = fetch128(s + 48); \ + __m128i e = fetch128(s + 64); \ + h = add32x4(h, a); \ + g = add32x4(g, b); \ + g = shuf32x4_0_3_2_1(g); \ + f = add32x4(f, c); \ + __m128i be = add32x4(b, mul32x4(e, cc1)); \ + h = add32x4(h, f); \ + f = add32x4(f, h); \ + h = add32x4(h, d); \ + q = add32x4(q, e); \ + h = rol32x4_17(h); \ + h = mul32x4(h, cc1); \ + k = xor128(k, _mm_shuffle_epi8(g, f)); \ + g = add32x4(xor128(c, g), a); \ + f = add32x4(xor128(be, f), d); \ + swap128(&f, &q); \ + q = _mm_aesimc_si128(q); \ + k = add32x4(k, be); \ + k = add32x4(k, _mm_shuffle_epi8(f, h)); \ + f = add32x4(f, g); \ + g = add32x4(g, f); \ + f = mul32x4(f, cc1); \ +} while (0) + + q = g; + while (iters-- != 0) { + CHUNK_AES(); + s += 80; + } + + if (len != 0) { + h = add32x4(h, _mm_set1_epi32(len)); + s = s + len - 80; + CHUNK_AES(); + } + } + + g = shuf32x4_0_3_2_1(g); + k = xor128(k, g); + k = xor128(k, q); + h = xor128(h, q); + f = mul32x4(f, cc1); + k = mul32x4(k, cc2); + g = mul32x4(g, cc1); + h = mul32x4(h, cc2); + k = add32x4(k, _mm_shuffle_epi8(g, f)); + h = add32x4(h, f); + f = add32x4(f, h); + g = add32x4(g, k); + k = add32x4(k, g); + k = xor128(k, _mm_shuffle_epi8(f, h)); + __m128i buf[4]; + buf[0] = f; + buf[1] = g; + buf[2] = k; + buf[3] = h; + s = (char*) buf; + uint32_t x = fetch32(s); + uint32_t y = fetch32(s+4); + uint32_t z = fetch32(s+8); + x = _mm_crc32_u32(x, fetch32(s+12)); + y = _mm_crc32_u32(y, fetch32(s+16)); + z = _mm_crc32_u32(z * c1, fetch32(s+20)); + x = _mm_crc32_u32(x, fetch32(s+24)); + y = _mm_crc32_u32(y * c1, fetch32(s+28)); + uint32_t o = y; + z = _mm_crc32_u32(z, fetch32(s+32)); + x = _mm_crc32_u32(x * c1, fetch32(s+36)); + y = _mm_crc32_u32(y, fetch32(s+40)); + z = _mm_crc32_u32(z * c1, fetch32(s+44)); + x = _mm_crc32_u32(x, fetch32(s+48)); + y = _mm_crc32_u32(y * c1, fetch32(s+52)); + z = _mm_crc32_u32(z, fetch32(s+56)); + x = _mm_crc32_u32(x, fetch32(s+60)); + return (o - x + y - z) * c1; +} + +uint32_t farmhash32_su_with_seed(const char *s, size_t len, uint32_t seed) { + if (len <= 24) { + if (len >= 13) return farmhash32_mk_len_13_to_24(s, len, seed * c1); + else if (len >= 5) return farmhash32_mk_len_5_to_12(s, len, seed); + else return farmhash32_mk_len_0_to_4(s, len, seed); + } + uint32_t h = farmhash32_mk_len_13_to_24(s, 24, seed ^ len); + return _mm_crc32_u32(farmhash32_su(s + 24, len - 24) + seed, h); +} + +#endif + +// farmhash sa + +#if CAN_USE_SSSE3 && CAN_USE_SSE41 && CAN_USE_SSE42 + +uint32_t farmhash32_sa(const char *s, size_t len) { + const uint32_t seed = 81; + if (len <= 24) { + return len <= 12 ? + (len <= 4 ? + farmhash32_mk_len_0_to_4(s, len, 0) : + farmhash32_mk_len_5_to_12(s, len, 0)) : + farmhash32_mk_len_13_to_24(s, len, 0); + } + + if (len < 40) { + uint32_t a = len, b = seed * c2, c = a + b; + a += fetch32(s + len - 4); + b += fetch32(s + len - 20); + c += fetch32(s + len - 16); + uint32_t d = a; + a = ror32(a, 21); + a = mur(a, mur(b, mur(c, d))); + a += fetch32(s + len - 12); + b += fetch32(s + len - 8); + d += a; + a += d; + b = mur(b, d) * c2; + a = _mm_crc32_u32(a, b + c); + return farmhash32_mk_len_13_to_24(s, (len + 1) / 2, a) + b; + } + + const __m128i cc1 = _mm_set1_epi32(c1); + const __m128i cc2 = _mm_set1_epi32(c2); + __m128i h = _mm_set1_epi32(seed); + __m128i g = _mm_set1_epi32(c1 * seed); + __m128i f = g; + __m128i k = _mm_set1_epi32(0xe6546b64); + if (len < 80) { + __m128i a = fetch128(s); + __m128i b = fetch128(s + 16); + __m128i c = fetch128(s + (len - 15) / 2); + __m128i d = fetch128(s + len - 32); + __m128i e = fetch128(s + len - 16); + h = add32x4(h, a); + g = add32x4(g, b); + g = shuf32x4_0_3_2_1(g); + f = add32x4(f, c); + __m128i be = add32x4(b, mul32x4(e, cc1)); + h = add32x4(h, f); + f = add32x4(f, h); + h = add32x4(murk(d, h, cc1, cc2, k), e); + k = xor128(k, _mm_shuffle_epi8(g, f)); + g = add32x4(xor128(c, g), a); + f = add32x4(xor128(be, f), d); + k = add32x4(k, be); + k = add32x4(k, _mm_shuffle_epi8(f, h)); + f = add32x4(f, g); + g = add32x4(g, f); + g = add32x4(_mm_set1_epi32(len), mul32x4(g, cc1)); + } else { + // len >= 80 + // The following is loosely modelled after farmhash32_mk. + size_t iters = (len - 1) / 80; + len -= iters * 80; + +#define CHUNK() do { \ + __m128i a = fetch128(s); \ + __m128i b = fetch128(s + 16); \ + __m128i c = fetch128(s + 32); \ + __m128i d = fetch128(s + 48); \ + __m128i e = fetch128(s + 64); \ + h = add32x4(h, a); \ + g = add32x4(g, b); \ + g = shuf32x4_0_3_2_1(g); \ + f = add32x4(f, c); \ + __m128i be = add32x4(b, mul32x4(e, cc1)); \ + h = add32x4(h, f); \ + f = add32x4(f, h); \ + h = add32x4(murk(d, h, cc1, cc2, k), e); \ + k = xor128(k, _mm_shuffle_epi8(g, f)); \ + g = add32x4(xor128(c, g), a); \ + f = add32x4(xor128(be, f), d); \ + k = add32x4(k, be); \ + k = add32x4(k, _mm_shuffle_epi8(f, h)); \ + f = add32x4(f, g); \ + g = add32x4(g, f); \ + f = mul32x4(f, cc1); \ +} while (0) + + while (iters-- != 0) { + CHUNK(); + s += 80; + } + + if (len != 0) { + h = add32x4(h, _mm_set1_epi32(len)); + s = s + len - 80; + CHUNK(); + } + } + + g = shuf32x4_0_3_2_1(g); + k = xor128(k, g); + f = mul32x4(f, cc1); + k = mul32x4(k, cc2); + g = mul32x4(g, cc1); + h = mul32x4(h, cc2); + k = add32x4(k, _mm_shuffle_epi8(g, f)); + h = add32x4(h, f); + f = add32x4(f, h); + g = add32x4(g, k); + k = add32x4(k, g); + k = xor128(k, _mm_shuffle_epi8(f, h)); + __m128i buf[4]; + buf[0] = f; + buf[1] = g; + buf[2] = k; + buf[3] = h; + s = (char*) buf; + uint32_t x = fetch32(s); + uint32_t y = fetch32(s+4); + uint32_t z = fetch32(s+8); + x = _mm_crc32_u32(x, fetch32(s+12)); + y = _mm_crc32_u32(y, fetch32(s+16)); + z = _mm_crc32_u32(z * c1, fetch32(s+20)); + x = _mm_crc32_u32(x, fetch32(s+24)); + y = _mm_crc32_u32(y * c1, fetch32(s+28)); + uint32_t o = y; + z = _mm_crc32_u32(z, fetch32(s+32)); + x = _mm_crc32_u32(x * c1, fetch32(s+36)); + y = _mm_crc32_u32(y, fetch32(s+40)); + z = _mm_crc32_u32(z * c1, fetch32(s+44)); + x = _mm_crc32_u32(x, fetch32(s+48)); + y = _mm_crc32_u32(y * c1, fetch32(s+52)); + z = _mm_crc32_u32(z, fetch32(s+56)); + x = _mm_crc32_u32(x, fetch32(s+60)); + return (o - x + y - z) * c1; +} + +uint32_t farmhash32_sa_with_seed(const char *s, size_t len, uint32_t seed) { + if (len <= 24) { + if (len >= 13) return farmhash32_mk_len_13_to_24(s, len, seed * c1); + else if (len >= 5) return farmhash32_mk_len_5_to_12(s, len, seed); + else return farmhash32_mk_len_0_to_4(s, len, seed); + } + uint32_t h = farmhash32_mk_len_13_to_24(s, 24, seed ^ len); + return _mm_crc32_u32(farmhash32_sa(s + 24, len - 24) + seed, h); +} + +#endif + +// farmhash cc + +// This file provides a 32-bit hash equivalent to cityhash32 (v1.1.1) +// and a 128-bit hash equivalent to cityhash128 (v1.1.1). It also provides +// a seeded 32-bit hash function similar to cityhash32. + +static inline uint32_t farmhash32_cc_len_13_to_24(const char *s, size_t len) { + uint32_t a = fetch32(s - 4 + (len >> 1)); + uint32_t b = fetch32(s + 4); + uint32_t c = fetch32(s + len - 8); + uint32_t d = fetch32(s + (len >> 1)); + uint32_t e = fetch32(s); + uint32_t f = fetch32(s + len - 4); + uint32_t h = len; + + return fmix(mur(f, mur(e, mur(d, mur(c, mur(b, mur(a, h))))))); +} + +static inline uint32_t farmhash32_cc_len_0_to_4(const char *s, size_t len) { + uint32_t b = 0; + uint32_t c = 9; + for (size_t i = 0; i < len; i++) { + signed char v = s[i]; + b = b * c1 + v; + c ^= b; + } + return fmix(mur(b, mur(len, c))); +} + +static inline uint32_t farmhash32_cc_len_5_to_12(const char *s, size_t len) { + uint32_t a = len, b = len * 5, c = 9, d = b; + a += fetch32(s); + b += fetch32(s + len - 4); + c += fetch32(s + ((len >> 1) & 4)); + return fmix(mur(c, mur(b, mur(a, d)))); +} + +uint32_t farmhash32_cc(const char *s, size_t len) { + if (len <= 24) { + return len <= 12 ? + (len <= 4 ? farmhash32_cc_len_0_to_4(s, len) : farmhash32_cc_len_5_to_12(s, len)) : + farmhash32_cc_len_13_to_24(s, len); + } + + // len > 24 + uint32_t h = len, g = c1 * len, f = g; + uint32_t a0 = ror32(fetch32(s + len - 4) * c1, 17) * c2; + uint32_t a1 = ror32(fetch32(s + len - 8) * c1, 17) * c2; + uint32_t a2 = ror32(fetch32(s + len - 16) * c1, 17) * c2; + uint32_t a3 = ror32(fetch32(s + len - 12) * c1, 17) * c2; + uint32_t a4 = ror32(fetch32(s + len - 20) * c1, 17) * c2; + h ^= a0; + h = ror32(h, 19); + h = h * 5 + 0xe6546b64; + h ^= a2; + h = ror32(h, 19); + h = h * 5 + 0xe6546b64; + g ^= a1; + g = ror32(g, 19); + g = g * 5 + 0xe6546b64; + g ^= a3; + g = ror32(g, 19); + g = g * 5 + 0xe6546b64; + f += a4; + f = ror32(f, 19); + f = f * 5 + 0xe6546b64; + size_t iters = (len - 1) / 20; + do { + uint32_t a0 = ror32(fetch32(s) * c1, 17) * c2; + uint32_t a1 = fetch32(s + 4); + uint32_t a2 = ror32(fetch32(s + 8) * c1, 17) * c2; + uint32_t a3 = ror32(fetch32(s + 12) * c1, 17) * c2; + uint32_t a4 = fetch32(s + 16); + h ^= a0; + h = ror32(h, 18); + h = h * 5 + 0xe6546b64; + f += a1; + f = ror32(f, 19); + f = f * c1; + g += a2; + g = ror32(g, 18); + g = g * 5 + 0xe6546b64; + h ^= a3 + a1; + h = ror32(h, 19); + h = h * 5 + 0xe6546b64; + g ^= a4; + g = bswap32(g) * 5; + h += a4 * 5; + h = bswap32(h); + f += a0; + PERMUTE3(&f, &h, &g); + s += 20; + } while (--iters != 0); + g = ror32(g, 11) * c1; + g = ror32(g, 17) * c1; + f = ror32(f, 11) * c1; + f = ror32(f, 17) * c1; + h = ror32(h + g, 19); + h = h * 5 + 0xe6546b64; + h = ror32(h, 17) * c1; + h = ror32(h + f, 19); + h = h * 5 + 0xe6546b64; + h = ror32(h, 17) * c1; + return h; +} + +uint32_t farmhash32_cc_with_seed(const char *s, size_t len, uint32_t seed) { + if (len <= 24) { + if (len >= 13) return farmhash32_mk_len_13_to_24(s, len, seed * c1); + else if (len >= 5) return farmhash32_mk_len_5_to_12(s, len, seed); + else return farmhash32_mk_len_0_to_4(s, len, seed); + } + uint32_t h = farmhash32_mk_len_13_to_24(s, 24, seed ^ len); + return mur(farmhash32_cc(s + 24, len - 24) + seed, h); +} + +static inline uint64_t farmhash_cc_len_0_to_16(const char *s, size_t len) { + if (len >= 8) { + uint64_t mul = k2 + len * 2; + uint64_t a = fetch64(s) + k2; + uint64_t b = fetch64(s + len - 8); + uint64_t c = ror64(b, 37) * mul + a; + uint64_t d = (ror64(a, 25) + b) * mul; + return farmhash_len_16_mul(c, d, mul); + } + if (len >= 4) { + uint64_t mul = k2 + len * 2; + uint64_t a = fetch32(s); + return farmhash_len_16_mul(len + (a << 3), fetch32(s + len - 4), mul); + } + if (len > 0) { + uint8_t a = s[0]; + uint8_t b = s[len >> 1]; + uint8_t c = s[len - 1]; + uint32_t y = ((uint32_t) a) + (((uint32_t) b) << 8); + uint32_t z = len + (((uint32_t) c) << 2); + return smix(y * k2 ^ z * k0) * k2; + } + return k2; +} + +// Return a 16-byte hash for 48 bytes. Quick and dirty. +// Callers do best to use "random-looking" values for a and b. +static inline uint128_t weak_farmhash_cc_len_32_with_seeds_vals( + uint64_t w, uint64_t x, uint64_t y, uint64_t z, uint64_t a, uint64_t b) { + a += w; + b = ror64(b + a + z, 21); + uint64_t c = a; + a += x; + a += y; + b += ror64(a, 44); + return make_uint128_t(a + z, b + c); +} + +// Return a 16-byte hash for s[0] ... s[31], a, and b. Quick and dirty. +static inline uint128_t weak_farmhash_cc_len_32_with_seeds( + const char* s, uint64_t a, uint64_t b) { + return weak_farmhash_cc_len_32_with_seeds_vals(fetch64(s), + fetch64(s + 8), + fetch64(s + 16), + fetch64(s + 24), + a, + b); +} + + + +// A subroutine for cityhash128(). Returns a decent 128-bit hash for strings +// of any length representable in signed long. Based on City and Murmur. +static inline uint128_t farmhash_cc_city_murmur(const char *s, size_t len, uint128_t seed) { + uint64_t a = uint128_t_low64(seed); + uint64_t b = uint128_t_high64(seed); + uint64_t c = 0; + uint64_t d = 0; + signed long l = len - 16; + if (l <= 0) { // len <= 16 + a = smix(a * k1) * k1; + c = b * k1 + farmhash_cc_len_0_to_16(s, len); + d = smix(a + (len >= 8 ? fetch64(s) : c)); + } else { // len > 16 + c = farmhash_len_16(fetch64(s + len - 8) + k1, a); + d = farmhash_len_16(b + len, c + fetch64(s + len - 16)); + a += d; + do { + a ^= smix(fetch64(s) * k1) * k1; + a *= k1; + b ^= a; + c ^= smix(fetch64(s + 8) * k1) * k1; + c *= k1; + d ^= c; + s += 16; + l -= 16; + } while (l > 0); + } + a = farmhash_len_16(a, c); + b = farmhash_len_16(d, b); + return make_uint128_t(a ^ b, farmhash_len_16(b, a)); +} + +uint128_t farmhash128_cc_city_with_seed(const char *s, size_t len, uint128_t seed) { + if (len < 128) { + return farmhash_cc_city_murmur(s, len, seed); + } + + // We expect len >= 128 to be the common case. Keep 56 bytes of state: + // v, w, x, y, and z. + uint128_t v, w; + uint64_t x = uint128_t_low64(seed); + uint64_t y = uint128_t_high64(seed); + uint64_t z = len * k1; + v.a = ror64(y ^ k1, 49) * k1 + fetch64(s); + v.b = ror64(v.a, 42) * k1 + fetch64(s + 8); + w.a = ror64(y + z, 35) * k1 + x; + w.b = ror64(x + fetch64(s + 88), 53) * k1; + + // This is the same inner loop as cityhash64(), manually unrolled. + do { + x = ror64(x + y + v.a + fetch64(s + 8), 37) * k1; + y = ror64(y + v.b + fetch64(s + 48), 42) * k1; + x ^= w.b; + y += v.a + fetch64(s + 40); + z = ror64(z + w.a, 33) * k1; + v = weak_farmhash_cc_len_32_with_seeds(s, v.b * k1, x + w.a); + w = weak_farmhash_cc_len_32_with_seeds(s + 32, z + w.b, y + fetch64(s + 16)); + swap64(&z, &x); + s += 64; + x = ror64(x + y + v.a + fetch64(s + 8), 37) * k1; + y = ror64(y + v.b + fetch64(s + 48), 42) * k1; + x ^= w.b; + y += v.a + fetch64(s + 40); + z = ror64(z + w.a, 33) * k1; + v = weak_farmhash_cc_len_32_with_seeds(s, v.b * k1, x + w.a); + w = weak_farmhash_cc_len_32_with_seeds(s + 32, z + w.b, y + fetch64(s + 16)); + swap64(&z, &x); + s += 64; + len -= 128; + } while (likely(len >= 128)); + x += ror64(v.a + z, 49) * k0; + y = y * k0 + ror64(w.b, 37); + z = z * k0 + ror64(w.a, 27); + w.a *= 9; + v.a *= k0; + // If 0 < len < 128, hash up to 4 chunks of 32 bytes each from the end of s. + for (size_t tail_done = 0; tail_done < len; ) { + tail_done += 32; + y = ror64(x + y, 42) * k0 + v.b; + w.a += fetch64(s + len - tail_done + 16); + x = x * k0 + w.a; + z += w.b + fetch64(s + len - tail_done); + w.b += v.a; + v = weak_farmhash_cc_len_32_with_seeds(s + len - tail_done, v.a + z, v.b); + v.a *= k0; + } + // At this point our 56 bytes of state should contain more than + // enough information for a strong 128-bit hash. We use two + // different 56-byte-to-8-byte hashes to get a 16-byte final result. + x = farmhash_len_16(x, v.a); + y = farmhash_len_16(y + z, w.a); + return make_uint128_t(farmhash_len_16(x + v.b, w.b) + y, + farmhash_len_16(x + w.b, y + v.b)); +} + +static inline uint128_t farmhash128_cc_city(const char *s, size_t len) { + return len >= 16 ? + farmhash128_cc_city_with_seed(s + 16, len - 16, + make_uint128_t(fetch64(s), fetch64(s + 8) + k0)) : + farmhash128_cc_city_with_seed(s, len, make_uint128_t(k0, k1)); +} + +uint128_t farmhash_cc_fingerprint128(const char* s, size_t len) { + return farmhash128_cc_city(s, len); +} + +// BASIC STRING HASHING + +// farmhash function for a byte array. See also Hash(), below. +// May change from time to time, may differ on different platforms, may differ +// depending on NDEBUG. +uint32_t farmhash32(const char* s, size_t len) { + return debug_tweak32( + +#if x86_64 && CAN_USE_SSE41 + farmhash32_nt(s, len) +#elif CAN_USE_SSE41 && CAN_USE_SSE42 && CAN_USE_AESNI + farmhash32_su(s, len) +#elif CAN_USE_SSSE3 && CAN_USE_SSE41 && CAN_USE_SSE42 + farmhash32_sa(s, len) +#else + farmhash32_mk(s, len) +#endif + + ); +} + +// Hash function for a byte array. For convenience, a 32-bit seed is also +// hashed into the result. +// May change from time to time, may differ on different platforms, may differ +// depending on NDEBUG. +uint32_t farmhash32_with_seed(const char* s, size_t len, uint32_t seed) { + return debug_tweak32( + +#if x86_64 && CAN_USE_SSE41 + farmhash32_nt_with_seed(s, len, seed) +#elif CAN_USE_SSE41 && CAN_USE_SSE42 && CAN_USE_AESNI + farmhash32_su_with_seed(s, len, seed) +#elif CAN_USE_SSSE3 && CAN_USE_SSE41 && CAN_USE_SSE42 + farmhash32_sa_with_seed(s, len, seed) +#else + farmhash32_mk_with_seed(s, len, seed) +#endif + + ); +} + +// Hash function for a byte array. For convenience, a 64-bit seed is also +// hashed into the result. See also farmhash(), below. +// May change from time to time, may differ on different platforms, may differ +// depending on NDEBUG. +uint64_t farmhash64(const char* s, size_t len) { + return debug_tweak64( +#if x86_64 && CAN_USE_SSSE3 && CAN_USE_SSE41 + farmhash64_te(s, len) +#else + farmhash64_xo(s, len) +#endif + ); +} + +// Hash function for a byte array. +// May change from time to time, may differ on different platforms, may differ +// depending on NDEBUG. +size_t farmhash(const char* s, size_t len) { + return sizeof(size_t) == 8 ? farmhash64(s, len) : farmhash32(s, len); +} + +// Hash function for a byte array. For convenience, a 64-bit seed is also +// hashed into the result. +// May change from time to time, may differ on different platforms, may differ +// depending on NDEBUG. +uint64_t farmhash64_with_seed(const char* s, size_t len, uint64_t seed) { + return debug_tweak64(farmhash64_na_with_seed(s, len, seed)); +} + +// Hash function for a byte array. For convenience, two seeds are also +// hashed into the result. +// May change from time to time, may differ on different platforms, may differ +// depending on NDEBUG. +uint64_t farmhash64_with_seeds(const char* s, size_t len, uint64_t seed0, uint64_t seed1) { + return debug_tweak64(farmhash64_na_with_seeds(s, len, seed0, seed1)); +} + +// Hash function for a byte array. +// May change from time to time, may differ on different platforms, may differ +// depending on NDEBUG. +uint128_t farmhash128(const char* s, size_t len) { + return debug_tweak128(farmhash_cc_fingerprint128(s, len)); +} + +// Hash function for a byte array. For convenience, a 128-bit seed is also +// hashed into the result. +// May change from time to time, may differ on different platforms, may differ +// depending on NDEBUG. +uint128_t farmhash128_with_seed(const char* s, size_t len, uint128_t seed) { + return debug_tweak128(farmhash128_cc_city_with_seed(s, len, seed)); +} + +// BASIC NON-STRING HASHING + +// FINGERPRINTING (i.e., good, portable, forever-fixed hash functions) + +// Fingerprint function for a byte array. Most useful in 32-bit binaries. +uint32_t farmhash_fingerprint32(const char* s, size_t len) { + return farmhash32_mk(s, len); +} + +// Fingerprint function for a byte array. +uint64_t farmhash_fingerprint64(const char* s, size_t len) { + return farmhash64_na(s, len); +} + +// Fingerprint function for a byte array. +uint128_t farmhash_fingerprint128(const char* s, size_t len) { + return farmhash_cc_fingerprint128(s, len); +} diff --git a/lib/checksums/farmhash.h b/lib/checksums/farmhash.h new file mode 100644 index 00000000..8a2d840a --- /dev/null +++ b/lib/checksums/farmhash.h @@ -0,0 +1,166 @@ +// Copyright (c) 2014 Google, Inc. +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in +// all copies or substantial portions of the Software. +// +// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +// THE SOFTWARE. +// +// FarmHash, by Geoff Pike + +// +// http://code.google.com/p/farmhash/ +// +// This file provides a few functions for hashing strings and other +// data. All of them are high-quality functions in the sense that +// they do well on standard tests such as Austin Appleby's SMHasher. +// They're also fast. FarmHash is the successor to CityHash. +// +// Functions in the FarmHash family are not suitable for cryptography. +// +// WARNING: This code has been only lightly tested on big-endian platforms! +// It is known to work well on little-endian platforms that have a small penalty +// for unaligned reads, such as current Intel and AMD moderate-to-high-end CPUs. +// It should work on all 32-bit and 64-bit platforms that allow unaligned reads; +// bug reports are welcome. +// +// By the way, for some hash functions, given strings a and b, the hash +// of a+b is easily derived from the hashes of a and b. This property +// doesn't hold for any hash functions in this file. + +// this c port from https://github.com/uxcn/farmhash-c + +#ifndef FARMHASH_H +#define FARMHASH_H + +#include +#include + +struct uint128_t { + uint64_t a; + uint64_t b; +}; + +typedef struct uint128_t uint128_t; + + +static inline uint64_t uint128_t_low64(const uint128_t x) { return x.a; } +static inline uint64_t uint128_t_high64(const uint128_t x) { return x.b; } + +static inline uint128_t make_uint128_t(uint64_t lo, uint64_t hi) { uint128_t x = {lo, hi}; return x; } + +// BASIC STRING HASHING + +// Hash function for a byte array. +// May change from time to time, may differ on different platforms, may differ +// depending on NDEBUG. +size_t farmhash(const char* s, size_t len); + +// Hash function for a byte array. Most useful in 32-bit binaries. +// May change from time to time, may differ on different platforms, may differ +// depending on NDEBUG. +uint32_t farmhash32(const char* s, size_t len); + +// Hash function for a byte array. For convenience, a 32-bit seed is also +// hashed into the result. +// May change from time to time, may differ on different platforms, may differ +// depending on NDEBUG. +uint32_t farmhash32_with_seed(const char* s, size_t len, uint32_t seed); + +// Hash 128 input bits down to 64 bits of output. +// Hash function for a byte array. +// May change from time to time, may differ on different platforms, may differ +// depending on NDEBUG. +uint64_t farmhash64(const char* s, size_t len); + +// Hash function for a byte array. For convenience, a 64-bit seed is also +// hashed into the result. +// May change from time to time, may differ on different platforms, may differ +// depending on NDEBUG. +uint64_t farmhash64_with_seed(const char* s, size_t len, uint64_t seed); + +// Hash function for a byte array. For convenience, two seeds are also +// hashed into the result. +// May change from time to time, may differ on different platforms, may differ +// depending on NDEBUG. +uint64_t farmhash64_with_seeds(const char* s, size_t len, + uint64_t seed0, uint64_t seed1); + +// Hash function for a byte array. +// May change from time to time, may differ on different platforms, may differ +// depending on NDEBUG. +uint128_t farmhash128(const char* s, size_t len); + +// Hash function for a byte array. For convenience, a 128-bit seed is also +// hashed into the result. +// May change from time to time, may differ on different platforms, may differ +// depending on NDEBUG. +uint128_t farmhash128_with_seed(const char* s, size_t len, uint128_t seed); + +// BASIC NON-STRING HASHING + +// This is intended to be a reasonably good hash function. +// May change from time to time, may differ on different platforms, may differ +// depending on NDEBUG. +static inline uint64_t farmhash128_to_64(uint128_t x) { + // Murmur-inspired hashing. + const uint64_t k_mul = 0x9ddfea08eb382d69ULL; + uint64_t a = (uint128_t_low64(x) ^ uint128_t_high64(x)) * k_mul; + a ^= (a >> 47); + uint64_t b = (uint128_t_high64(x) ^ a) * k_mul; + b ^= (b >> 47); + b *= k_mul; + return b; +} + +// FINGERPRINTING (i.e., good, portable, forever-fixed hash functions) + +// Fingerprint function for a byte array. Most useful in 32-bit binaries. +uint32_t farmhash_fingerprint32(const char* s, size_t len); + +// Fingerprint function for a byte array. +uint64_t farmhash_fingerprint64(const char* s, size_t len); + +// Fingerprint function for a byte array. +uint128_t farmhash_fingerprint128(const char* s, size_t len); + +// This is intended to be a good fingerprinting primitive. +// See below for more overloads. +static inline uint64_t farmhash_fingerprint_uint128_t(uint128_t x) { + // Murmur-inspired hashing. + const uint64_t k_mul = 0x9ddfea08eb382d69ULL; + uint64_t a = (uint128_t_low64(x) ^ uint128_t_high64(x)) * k_mul; + a ^= (a >> 47); + uint64_t b = (uint128_t_high64(x) ^ a) * k_mul; + b ^= (b >> 44); + b *= k_mul; + b ^= (b >> 41); + b *= k_mul; + return b; +} + +// This is intended to be a good fingerprinting primitive. +static inline uint64_t farmhash_fingerprint_uint64_t(uint64_t x) { + // Murmur-inspired hashing. + const uint64_t k_mul = 0x9ddfea08eb382d69ULL; + uint64_t b = x * k_mul; + b ^= (b >> 44); + b *= k_mul; + b ^= (b >> 41); + b *= k_mul; + return b; +} + +#endif // FARMHASH_H From b19f0d1edd0520f60dcc8edf22f7c08556c3f982 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Thu, 9 Nov 2017 22:02:03 +1000 Subject: [PATCH 085/180] checksum: extend RmDigestSpec to include name; improve encapsulation; remove redundant code --- lib/checksum.c | 182 +++++++++++++++++++------------------------------ lib/checksum.h | 5 +- 2 files changed, 74 insertions(+), 113 deletions(-) diff --git a/lib/checksum.c b/lib/checksum.c index 2facf7f8..f421fd75 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -145,6 +145,7 @@ typedef void (*RmDigestCopyFunc)(RmDigest *digest, RmDigest *copy); typedef void (*RmDigestStealFunc)(RmDigest *digest, guint8 *result); typedef struct RmDigestSpec { + const char *name; const int bits; RmDigestInitFunc init; RmDigestFreeFunc free; @@ -207,9 +208,9 @@ static void rm_digest_spooky_update(RmDigest *digest, const unsigned char *data, } #define GENERIC_FUNCS(ALGO) rm_digest_generic_init, rm_digest_generic_free, rm_digest_##ALGO##_update, rm_digest_generic_copy, NULL -static const RmDigestSpec spooky32_spec = { 32, GENERIC_FUNCS(spooky32) }; -static const RmDigestSpec spooky64_spec = { 64, GENERIC_FUNCS(spooky64) }; -static const RmDigestSpec spooky_spec = { 128, GENERIC_FUNCS(spooky) }; +static const RmDigestSpec spooky32_spec = { "spook32", 32, GENERIC_FUNCS(spooky32) }; +static const RmDigestSpec spooky64_spec = { "spooky64", 64, GENERIC_FUNCS(spooky64) }; +static const RmDigestSpec spooky_spec = { "spooky", 128, GENERIC_FUNCS(spooky) }; /////////////////////////// @@ -220,7 +221,7 @@ static void rm_digest_xxhash_update(RmDigest *digest, const unsigned char *data, digest->checksum->first = XXH64(data, size, digest->checksum->first); } -static const RmDigestSpec xxhash_spec = {64, GENERIC_FUNCS(xxhash)}; +static const RmDigestSpec xxhash_spec = { "xxhash", 64, GENERIC_FUNCS(xxhash)}; /////////////////////////// // farmhash // @@ -230,7 +231,7 @@ static void rm_digest_farmhash_update(RmDigest *digest, const unsigned char *dat *digest->farmhash = farmhash128_with_seed((const char*)data, size, *digest->farmhash); } -static const RmDigestSpec farmhash_spec = {64, GENERIC_FUNCS(farmhash)}; +static const RmDigestSpec farmhash_spec = { "farmhash", 64, GENERIC_FUNCS(farmhash)}; /////////////////////////// // murmur // @@ -247,7 +248,7 @@ static void rm_digest_murmur_update(RmDigest *digest, const unsigned char *data, #endif } -static const RmDigestSpec murmur_spec = {128, GENERIC_FUNCS(murmur)}; +static const RmDigestSpec murmur_spec = { "murmur", 128, GENERIC_FUNCS(murmur)}; /////////////////////////// // cityhash // @@ -266,7 +267,8 @@ static void rm_digest_city_update(RmDigest *digest, const unsigned char *data, R memcpy(digest->checksum, &old, sizeof(uint128)); } -static const RmDigestSpec city_spec = {128, GENERIC_FUNCS(city)}; +static const RmDigestSpec city_spec = { "city", 128, GENERIC_FUNCS(city)}; + /////////////////////////// // cumulative // @@ -280,7 +282,8 @@ static void rm_digest_cumulative_update(RmDigest *digest, const unsigned char *d } } -static const RmDigestSpec cumulative_spec = {128, GENERIC_FUNCS(cumulative)}; +static const RmDigestSpec cumulative_spec = { "cumulative", 128, GENERIC_FUNCS(cumulative)}; + /////////////////////////// // glib hashes // @@ -327,11 +330,11 @@ static void rm_digest_glib_steal(RmDigest *digest, guint8 *result) { #define GLIB_FUNCS rm_digest_glib_init, rm_digest_glib_free, rm_digest_glib_update, rm_digest_glib_copy, rm_digest_glib_steal -static const RmDigestSpec md5_spec = {128, GLIB_FUNCS}; -static const RmDigestSpec sha1_spec = {160, GLIB_FUNCS}; -static const RmDigestSpec sha256_spec = {256, GLIB_FUNCS}; +static const RmDigestSpec md5_spec = {"md5", 128, GLIB_FUNCS}; +static const RmDigestSpec sha1_spec = {"sha1", 160, GLIB_FUNCS}; +static const RmDigestSpec sha256_spec = {"sha256", 256, GLIB_FUNCS}; #if HAVE_SHA512 -static const RmDigestSpec sha512_spec = {512, GLIB_FUNCS}; +static const RmDigestSpec sha512_spec = {"sha512", 512, GLIB_FUNCS}; #endif /////////////////////////// @@ -380,11 +383,11 @@ static void rm_digest_sha3_steal(RmDigest *digest, guint8 *result) { g_slice_free(sha3_context, copy); } -#define SHA3_FUNCS rm_digest_sha3_init, rm_digest_sha3_free, rm_digest_sha3_update, rm_digest_sha3_copy, rm_digest_sha3_steal +#define SHA3_SPEC(BITS) "sha3_##BITS", BITS, rm_digest_sha3_init, rm_digest_sha3_free, rm_digest_sha3_update, rm_digest_sha3_copy, rm_digest_sha3_steal -static const RmDigestSpec sha3_256_spec = { 256, SHA3_FUNCS}; -static const RmDigestSpec sha3_384_spec = { 384, SHA3_FUNCS}; -static const RmDigestSpec sha3_512_spec = { 512, SHA3_FUNCS}; +static const RmDigestSpec sha3_256_spec = { SHA3_SPEC(256)}; +static const RmDigestSpec sha3_384_spec = { SHA3_SPEC(384)}; +static const RmDigestSpec sha3_512_spec = { SHA3_SPEC(512)}; /////////////////////////// // blake hashes // @@ -440,10 +443,10 @@ CREATE_BLAKE_FUNCS(blake2sp, BLAKE2S); #define BLAKE_FUNCS(ALGO) rm_digest_##ALGO##_init, rm_digest_##ALGO##_free, rm_digest_##ALGO##_update, rm_digest_##ALGO##_copy, rm_digest_##ALGO##_steal -static const RmDigestSpec blake2b_spec = {512, BLAKE_FUNCS(blake2b)}; -static const RmDigestSpec blake2bp_spec = {512, BLAKE_FUNCS(blake2bp)}; -static const RmDigestSpec blake2s_spec = {256, BLAKE_FUNCS(blake2s)}; -static const RmDigestSpec blake2sp_spec = {256, BLAKE_FUNCS(blake2sp)}; +static const RmDigestSpec blake2b_spec = {"blake2b", 512, BLAKE_FUNCS(blake2b)}; +static const RmDigestSpec blake2bp_spec = {"blake2bp", 512, BLAKE_FUNCS(blake2bp)}; +static const RmDigestSpec blake2s_spec = {"blake2s", 256, BLAKE_FUNCS(blake2s)}; +static const RmDigestSpec blake2sp_spec = {"blake2sp", 256, BLAKE_FUNCS(blake2sp)}; /////////////////////////// // ext hash // @@ -472,7 +475,7 @@ static void rm_digest_ext_update(RmDigest *digest, const unsigned char *data, Rm } } -static const RmDigestSpec ext_spec = {0, rm_digest_ext_init, rm_digest_generic_free, rm_digest_ext_update, rm_digest_generic_copy, NULL}; +static const RmDigestSpec ext_spec = {"ext", 0, rm_digest_ext_init, rm_digest_generic_free, rm_digest_ext_update, rm_digest_generic_copy, NULL}; /////////////////////////// @@ -502,40 +505,49 @@ static void rm_digest_paranoid_free(RmDigest *digest) { /* Note: paranoid update implementation is in rm_digest_buffered_update() below */ -static const RmDigestSpec paranoid_spec = {0, rm_digest_paranoid_init, rm_digest_paranoid_free, NULL, NULL, NULL}; +static const RmDigestSpec paranoid_spec = { "paranoid", 0, rm_digest_paranoid_init, rm_digest_paranoid_free, NULL, NULL, NULL}; //////////////////////////////// // RmDigestSpec map // //////////////////////////////// -static const RmDigestSpec *digest_specs[] = { - [RM_DIGEST_UNKNOWN] = NULL, - [RM_DIGEST_MURMUR] = &murmur_spec, - [RM_DIGEST_SPOOKY] = &spooky_spec, - [RM_DIGEST_SPOOKY32] = &spooky32_spec, - [RM_DIGEST_SPOOKY64] = &spooky64_spec, - [RM_DIGEST_CITY] = &city_spec, - [RM_DIGEST_MD5] = &md5_spec, - [RM_DIGEST_SHA1] = &sha1_spec, - [RM_DIGEST_SHA256] = &sha256_spec, -#if HAVE_SHA512 - [RM_DIGEST_SHA512] = &sha512_spec, -#endif - [RM_DIGEST_SHA3_256] = &sha3_256_spec, - [RM_DIGEST_SHA3_384] = &sha3_384_spec, - [RM_DIGEST_SHA3_512] = &sha3_512_spec, - [RM_DIGEST_BLAKE2S] = &blake2s_spec, - [RM_DIGEST_BLAKE2B] = &blake2b_spec, - [RM_DIGEST_BLAKE2SP] = &blake2sp_spec, - [RM_DIGEST_BLAKE2BP] = &blake2bp_spec, - [RM_DIGEST_EXT] = &ext_spec, - [RM_DIGEST_CUMULATIVE] = &cumulative_spec, - [RM_DIGEST_PARANOID] = ¶noid_spec, - [RM_DIGEST_FARMHASH] = &farmhash_spec, - [RM_DIGEST_XXHASH] = &xxhash_spec, -}; +static const RmDigestSpec *rm_digest_spec(RmDigestType type) { + static const RmDigestSpec *digest_specs[] = { + [RM_DIGEST_UNKNOWN] = NULL, + [RM_DIGEST_MURMUR] = &murmur_spec, + [RM_DIGEST_SPOOKY] = &spooky_spec, + [RM_DIGEST_SPOOKY32] = &spooky32_spec, + [RM_DIGEST_SPOOKY64] = &spooky64_spec, + [RM_DIGEST_CITY] = &city_spec, + [RM_DIGEST_MD5] = &md5_spec, + [RM_DIGEST_SHA1] = &sha1_spec, + [RM_DIGEST_SHA256] = &sha256_spec, + #if HAVE_SHA512 + [RM_DIGEST_SHA512] = &sha512_spec, + #endif + [RM_DIGEST_SHA3_256] = &sha3_256_spec, + [RM_DIGEST_SHA3_384] = &sha3_384_spec, + [RM_DIGEST_SHA3_512] = &sha3_512_spec, + [RM_DIGEST_BLAKE2S] = &blake2s_spec, + [RM_DIGEST_BLAKE2B] = &blake2b_spec, + [RM_DIGEST_BLAKE2SP] = &blake2sp_spec, + [RM_DIGEST_BLAKE2BP] = &blake2bp_spec, + [RM_DIGEST_EXT] = &ext_spec, + [RM_DIGEST_CUMULATIVE] = &cumulative_spec, + [RM_DIGEST_PARANOID] = ¶noid_spec, + [RM_DIGEST_FARMHASH] = &farmhash_spec, + [RM_DIGEST_XXHASH] = &xxhash_spec, + }; + + if(type >= RM_DIGEST_SENTINEL) { + rm_assert_gentle_not_reached(); + return digest_specs[RM_DEFAULT_DIGEST]; + } + + return digest_specs[type]; +} static gpointer rm_init_digest_type_table(GHashTable **code_table) { static struct { @@ -602,30 +614,8 @@ RmDigestType rm_string_to_digest_type(const char *string) { } const char *rm_digest_type_to_string(RmDigestType type) { - static const char *names[] = {[RM_DIGEST_UNKNOWN] = "unknown", - [RM_DIGEST_MURMUR] = "murmur", - [RM_DIGEST_SPOOKY] = "spooky", - [RM_DIGEST_SPOOKY32] = "spooky32", - [RM_DIGEST_SPOOKY64] = "spooky64", - [RM_DIGEST_CITY] = "city", - [RM_DIGEST_MD5] = "md5", - [RM_DIGEST_SHA1] = "sha1", - [RM_DIGEST_SHA256] = "sha256", - [RM_DIGEST_SHA512] = "sha512", - [RM_DIGEST_SHA3_256] = "sha3-256", - [RM_DIGEST_SHA3_384] = "sha3-384", - [RM_DIGEST_SHA3_512] = "sha3-512", - [RM_DIGEST_BLAKE2S] = "blake2s", - [RM_DIGEST_BLAKE2B] = "blake2b", - [RM_DIGEST_BLAKE2SP] = "blake2sp", - [RM_DIGEST_BLAKE2BP] = "blake2bp", - [RM_DIGEST_EXT] = "ext", - [RM_DIGEST_CUMULATIVE] = "cumulative", - [RM_DIGEST_PARANOID] = "paranoid", - [RM_DIGEST_FARMHASH] = "farmhash", - [RM_DIGEST_XXHASH] = "xxhash"}; - - return names[MIN(type, sizeof(names) / sizeof(names[0]))]; + const RmDigestSpec *spec = rm_digest_spec(type); + return spec->name; } /* TODO: remove? */ @@ -645,10 +635,9 @@ RmDigest *rm_digest_new(RmDigestType type, RmOff seed1, RmOff seed2, RmOff ext_s bool use_shadow_hash) { g_assert(type != RM_DIGEST_UNKNOWN); - const RmDigestSpec *spec = digest_specs[type]; + const RmDigestSpec *spec = rm_digest_spec(type); RmDigest *digest = g_slice_new0(RmDigest); digest->type = type; - digest->bytes = spec->bits / 8; spec->init(digest, seed1, seed2, ext_size, use_shadow_hash); @@ -668,13 +657,13 @@ void rm_digest_release_buffers(RmDigest *digest) { } void rm_digest_free(RmDigest *digest) { - const RmDigestSpec *spec = digest_specs[digest->type]; + const RmDigestSpec *spec = rm_digest_spec(digest->type); spec->free(digest); g_slice_free(RmDigest, digest); } void rm_digest_update(RmDigest *digest, const unsigned char *data, RmOff size) { - const RmDigestSpec *spec = digest_specs[digest->type]; + const RmDigestSpec *spec = rm_digest_spec(digest->type); spec->update(digest, data, size); } @@ -761,48 +750,15 @@ RmDigest *rm_digest_copy(RmDigest *digest) { RmDigest *copy = g_slice_copy(sizeof(RmDigest), digest); - const RmDigestSpec *spec = digest_specs[digest->type]; + const RmDigestSpec *spec = rm_digest_spec(digest->type); spec->copy(digest, copy); return copy; } -static gboolean rm_digest_needs_steal(RmDigestType digest_type) { - switch(digest_type) { - case RM_DIGEST_MD5: - case RM_DIGEST_SHA512: - case RM_DIGEST_SHA256: - case RM_DIGEST_SHA1: - case RM_DIGEST_SHA3_256: - case RM_DIGEST_SHA3_384: - case RM_DIGEST_SHA3_512: - case RM_DIGEST_BLAKE2S: - case RM_DIGEST_BLAKE2B: - case RM_DIGEST_BLAKE2SP: - case RM_DIGEST_BLAKE2BP: - /* for all of the above, reading the digest is destructive, so we - * need to take a copy */ - return TRUE; - case RM_DIGEST_SPOOKY32: - case RM_DIGEST_SPOOKY64: - case RM_DIGEST_SPOOKY: - case RM_DIGEST_MURMUR: - case RM_DIGEST_CITY: - case RM_DIGEST_XXHASH: - case RM_DIGEST_FARMHASH: - case RM_DIGEST_CUMULATIVE: - case RM_DIGEST_EXT: - case RM_DIGEST_PARANOID: - return FALSE; - default: - rm_assert_gentle_not_reached(); - return FALSE; - } -} - guint8 *rm_digest_steal(RmDigest *digest) { - const RmDigestSpec *spec = digest_specs[digest->type]; + const RmDigestSpec *spec = rm_digest_spec(digest->type); if(!spec->steal) { return g_slice_copy(digest->bytes, digest->checksum); } @@ -855,6 +811,8 @@ gboolean rm_digest_equal(RmDigest *a, RmDigest *b) { return false; } + const RmDigestSpec *spec = rm_digest_spec(a->type); + if(a->type == RM_DIGEST_PARANOID) { if(!a->paranoid->buffers) { /* buffers have been freed so we need to rely on shadow hash */ @@ -886,7 +844,7 @@ gboolean rm_digest_equal(RmDigest *a, RmDigest *b) { } return (!a_iter && !b_iter && bytes == a->bytes); - } else if(rm_digest_needs_steal(a->type)) { + } else if(spec->steal) { guint8 *buf_a = rm_digest_steal(a); guint8 *buf_b = rm_digest_steal(b); gboolean result = !memcmp(buf_a, buf_b, a->bytes); diff --git a/lib/checksum.h b/lib/checksum.h index 45050212..cbd182b0 100644 --- a/lib/checksum.h +++ b/lib/checksum.h @@ -33,6 +33,7 @@ #include "checksums/blake2/blake2.h" #include "checksums/sha3/sha3.h" #include "checksums/farmhash.h" +#include "checksums/highwayhash.h" typedef enum RmDigestType { RM_DIGEST_UNKNOWN = 0, @@ -59,7 +60,9 @@ typedef enum RmDigestType { /* special kids in town */ RM_DIGEST_CUMULATIVE, /* hash([a, b]) = hash([b, a]) */ RM_DIGEST_EXT, /* read hash as string */ - RM_DIGEST_PARANOID /* direct block comparisons */ + RM_DIGEST_PARANOID, /* direct block comparisons */ + /* sentinel */ + RM_DIGEST_SENTINEL, } RmDigestType; typedef struct RmUint128 { From 9f42dda82f3b4580abc25eaaf4019caa9fa1277b Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Thu, 9 Nov 2017 22:02:32 +1000 Subject: [PATCH 086/180] checksum: add highway hashes --- lib/checksum.c | 55 ++++++++ lib/checksum.h | 4 + lib/checksums/highwayhash.c | 253 ++++++++++++++++++++++++++++++++++++ lib/checksums/highwayhash.h | 93 +++++++++++++ 4 files changed, 405 insertions(+) create mode 100644 lib/checksums/highwayhash.c create mode 100644 lib/checksums/highwayhash.h diff --git a/lib/checksum.c b/lib/checksum.c index f421fd75..75d9f5d1 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -285,6 +285,55 @@ static void rm_digest_cumulative_update(RmDigest *digest, const unsigned char *d static const RmDigestSpec cumulative_spec = { "cumulative", 128, GENERIC_FUNCS(cumulative)}; +/////////////////////////// +// highway hash // +/////////////////////////// + +static void rm_digest_highway_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { + uint64_t key[4] = {1, 2, 3, 4}; + if(seed1) { + key[0] = (uint64_t)seed1; + } + if(seed2) { + key[2] = (uint64_t)seed2; + } + + digest->highway_cat = g_slice_alloc0(sizeof(HighwayHashCat)); + HighwayHashCatStart(key, digest->highway_cat); +} + +static void rm_digest_highway_free(RmDigest *digest) { + g_slice_free(HighwayHashCat, digest->highway_cat); +} + +static void rm_digest_highway_update(RmDigest *digest, const unsigned char *data, RmOff size) { + HighwayHashCatAppend((const uint8_t*)data, size, digest->highway_cat); +} + +static void rm_digest_highway_copy(RmDigest *digest, RmDigest *copy) { + copy->glib_checksum = g_slice_copy(sizeof(HighwayHashCat), digest->highway_cat); +} + +/* HighwayHashCatFinish functions are non-destructive */ +static void rm_digest_highway256_steal(RmDigest *digest, guint8 *result) { + HighwayHashCatFinish256(digest->highway_cat, (uint64_t*)result); +} + +static void rm_digest_highway128_steal(RmDigest *digest, guint8 *result) { + HighwayHashCatFinish128(digest->highway_cat, (uint64_t*)result); +} + +static void rm_digest_highway64_steal(RmDigest *digest, guint8 *result) { + *result = HighwayHashCatFinish64(digest->highway_cat); +} + +#define HIGHWAY_SPEC(BITS) "highway##BITS", BITS, rm_digest_highway_init, rm_digest_highway_free, rm_digest_highway_update, rm_digest_highway_copy, rm_digest_highway##BITS##_steal + +static const RmDigestSpec highway256_spec = {HIGHWAY_SPEC(256)}; +static const RmDigestSpec highway128_spec = {HIGHWAY_SPEC(128)}; +static const RmDigestSpec highway64_spec = {HIGHWAY_SPEC(64)}; + + /////////////////////////// // glib hashes // /////////////////////////// @@ -539,6 +588,9 @@ static const RmDigestSpec *rm_digest_spec(RmDigestType type) { [RM_DIGEST_PARANOID] = ¶noid_spec, [RM_DIGEST_FARMHASH] = &farmhash_spec, [RM_DIGEST_XXHASH] = &xxhash_spec, + [RM_DIGEST_HIGHWAY64] = &highway64_spec, + [RM_DIGEST_HIGHWAY128] = &highway128_spec, + [RM_DIGEST_HIGHWAY256] = &highway256_spec, }; if(type >= RM_DIGEST_SENTINEL) { @@ -576,6 +628,9 @@ static gpointer rm_init_digest_type_table(GHashTable **code_table) { {"cumulative", RM_DIGEST_CUMULATIVE}, {"paranoid", RM_DIGEST_PARANOID}, {"city", RM_DIGEST_CITY}, + {"highway64", RM_DIGEST_HIGHWAY64}, + {"highway128", RM_DIGEST_HIGHWAY128}, + {"highway256", RM_DIGEST_HIGHWAY256}, #if HAVE_SHA512 {"sha512", RM_DIGEST_SHA512}, #endif diff --git a/lib/checksum.h b/lib/checksum.h index cbd182b0..88adcf5b 100644 --- a/lib/checksum.h +++ b/lib/checksum.h @@ -56,6 +56,9 @@ typedef enum RmDigestType { RM_DIGEST_BLAKE2XS, RM_DIGEST_XXHASH, RM_DIGEST_FARMHASH, + RM_DIGEST_HIGHWAY64, + RM_DIGEST_HIGHWAY128, + RM_DIGEST_HIGHWAY256, /* special kids in town */ RM_DIGEST_CUMULATIVE, /* hash([a, b]) = hash([b, a]) */ @@ -104,6 +107,7 @@ typedef struct RmDigest { blake2b_state *blake2b_state; blake2sp_state *blake2sp_state; blake2bp_state *blake2bp_state; + HighwayHashCat *highway_cat; sha3_context *sha3_ctx; RmUint128 *checksum; uint128_t *farmhash; diff --git a/lib/checksums/highwayhash.c b/lib/checksums/highwayhash.c new file mode 100644 index 00000000..760553d3 --- /dev/null +++ b/lib/checksums/highwayhash.c @@ -0,0 +1,253 @@ +#include "highwayhash.h" + +#include +#include +#include + +/* +This code is compatible with C90 with the additional requirement of +supporting uint64_t. +*/ + +/*////////////////////////////////////////////////////////////////////////////*/ +/* Internal implementation */ +/*////////////////////////////////////////////////////////////////////////////*/ + +void HighwayHashReset(const uint64_t key[4], HighwayHashState* state) { + state->mul0[0] = 0xdbe6d5d5fe4cce2full; + state->mul0[1] = 0xa4093822299f31d0ull; + state->mul0[2] = 0x13198a2e03707344ull; + state->mul0[3] = 0x243f6a8885a308d3ull; + state->mul1[0] = 0x3bd39e10cb0ef593ull; + state->mul1[1] = 0xc0acf169b5f18a8cull; + state->mul1[2] = 0xbe5466cf34e90c6cull; + state->mul1[3] = 0x452821e638d01377ull; + state->v0[0] = state->mul0[0] ^ key[0]; + state->v0[1] = state->mul0[1] ^ key[1]; + state->v0[2] = state->mul0[2] ^ key[2]; + state->v0[3] = state->mul0[3] ^ key[3]; + state->v1[0] = state->mul1[0] ^ ((key[0] >> 32) | (key[0] << 32)); + state->v1[1] = state->mul1[1] ^ ((key[1] >> 32) | (key[1] << 32)); + state->v1[2] = state->mul1[2] ^ ((key[2] >> 32) | (key[2] << 32)); + state->v1[3] = state->mul1[3] ^ ((key[3] >> 32) | (key[3] << 32)); +} + +static void ZipperMergeAndAdd(const uint64_t v1, const uint64_t v0, uint64_t* add1, + uint64_t* add0) { + *add0 += (((v0 & 0xff000000ull) | (v1 & 0xff00000000ull)) >> 24) | + (((v0 & 0xff0000000000ull) | (v1 & 0xff000000000000ull)) >> 16) | + (v0 & 0xff0000ull) | ((v0 & 0xff00ull) << 32) | + ((v1 & 0xff00000000000000ull) >> 8) | (v0 << 56); + *add1 += (((v1 & 0xff000000ull) | (v0 & 0xff00000000ull)) >> 24) | + (v1 & 0xff0000ull) | ((v1 & 0xff0000000000ull) >> 16) | + ((v1 & 0xff00ull) << 24) | ((v0 & 0xff000000000000ull) >> 8) | + ((v1 & 0xffull) << 48) | (v0 & 0xff00000000000000ull); +} + +static void Update(const uint64_t lanes[4], HighwayHashState* state) { + int i; + for(i = 0; i < 4; ++i) { + state->v1[i] += state->mul0[i] + lanes[i]; + state->mul0[i] ^= (state->v1[i] & 0xffffffff) * (state->v0[i] >> 32); + state->v0[i] += state->mul1[i]; + state->mul1[i] ^= (state->v0[i] & 0xffffffff) * (state->v1[i] >> 32); + } + ZipperMergeAndAdd(state->v1[1], state->v1[0], &state->v0[1], &state->v0[0]); + ZipperMergeAndAdd(state->v1[3], state->v1[2], &state->v0[3], &state->v0[2]); + ZipperMergeAndAdd(state->v0[1], state->v0[0], &state->v1[1], &state->v1[0]); + ZipperMergeAndAdd(state->v0[3], state->v0[2], &state->v1[3], &state->v1[2]); +} + +static uint64_t Read64(const uint8_t* src) { + return (uint64_t)src[0] | ((uint64_t)src[1] << 8) | ((uint64_t)src[2] << 16) | + ((uint64_t)src[3] << 24) | ((uint64_t)src[4] << 32) | + ((uint64_t)src[5] << 40) | ((uint64_t)src[6] << 48) | ((uint64_t)src[7] << 56); +} + +void HighwayHashUpdatePacket(const uint8_t* packet, HighwayHashState* state) { + uint64_t lanes[4]; + lanes[0] = Read64(packet + 0); + lanes[1] = Read64(packet + 8); + lanes[2] = Read64(packet + 16); + lanes[3] = Read64(packet + 24); + Update(lanes, state); +} + +static void Rotate32By(uint64_t count, uint64_t lanes[4]) { + int i; + for(i = 0; i < 4; ++i) { + uint32_t half0 = lanes[i] & 0xffffffff; + uint32_t half1 = (lanes[i] >> 32); + lanes[i] = (half0 << count) | (half0 >> (32 - count)); + lanes[i] |= (uint64_t)((half1 << count) | (half1 >> (32 - count))) << 32; + } +} + +void HighwayHashUpdateRemainder(const uint8_t* bytes, const size_t size_mod32, + HighwayHashState* state) { + int i; + const size_t size_mod4 = size_mod32 & 3; + const uint8_t* remainder = bytes + (size_mod32 & ~3); + uint8_t packet[32] = {0}; + for(i = 0; i < 4; ++i) { + state->v0[i] += ((uint64_t)size_mod32 << 32) + size_mod32; + } + Rotate32By(size_mod32, state->v1); + for(i = 0; i < remainder - bytes; i++) { + packet[i] = bytes[i]; + } + if(size_mod32 & 16) { + for(i = 0; i < 4; i++) { + packet[28 + i] = remainder[i + size_mod4 - 4]; + } + } else { + if(size_mod4) { + packet[16 + 0] = remainder[0]; + packet[16 + 1] = remainder[size_mod4 >> 1]; + packet[16 + 2] = remainder[size_mod4 - 1]; + } + } + HighwayHashUpdatePacket(packet, state); +} + +static void Permute(const uint64_t v[4], uint64_t* permuted) { + permuted[0] = (v[2] >> 32) | (v[2] << 32); + permuted[1] = (v[3] >> 32) | (v[3] << 32); + permuted[2] = (v[0] >> 32) | (v[0] << 32); + permuted[3] = (v[1] >> 32) | (v[1] << 32); +} + +void PermuteAndUpdate(HighwayHashState* state) { + uint64_t permuted[4]; + Permute(state->v0, permuted); + Update(permuted, state); +} + +static void FinalPermutes(HighwayHashState* state) { + PermuteAndUpdate(state); + PermuteAndUpdate(state); + PermuteAndUpdate(state); + PermuteAndUpdate(state); +} + +static void ModularReduction(uint64_t a3_unmasked, uint64_t a2, uint64_t a1, uint64_t a0, + uint64_t* m1, uint64_t* m0) { + uint64_t a3 = a3_unmasked & 0x3FFFFFFFFFFFFFFFull; + *m1 = a1 ^ ((a3 << 1) | (a2 >> 63)) ^ ((a3 << 2) | (a2 >> 62)); + *m0 = a0 ^ (a2 << 1) ^ (a2 << 2); +} + +uint64_t HighwayHashFinalize64(HighwayHashState* state) { + FinalPermutes(state); + return state->v0[0] + state->v1[0] + state->mul0[0] + state->mul1[0]; +} + +void HighwayHashFinalize128(HighwayHashState* state, uint64_t hash[2]) { + FinalPermutes(state); + hash[0] = state->v0[0] + state->mul0[0] + state->v1[2] + state->mul1[2]; + hash[1] = state->v0[1] + state->mul0[1] + state->v1[3] + state->mul1[3]; +} + +void HighwayHashFinalize256(HighwayHashState* state, uint64_t hash[4]) { + FinalPermutes(state); + ModularReduction(state->v1[1] + state->mul1[1], state->v1[0] + state->mul1[0], + state->v0[1] + state->mul0[1], state->v0[0] + state->mul0[0], + &hash[1], &hash[0]); + ModularReduction(state->v1[3] + state->mul1[3], state->v1[2] + state->mul1[2], + state->v0[3] + state->mul0[3], state->v0[2] + state->mul0[2], + &hash[3], &hash[2]); +} + +/*////////////////////////////////////////////////////////////////////////////*/ +/* Non-cat API: single call on full data */ +/*////////////////////////////////////////////////////////////////////////////*/ + +static void ProcessAll(const uint8_t* data, size_t size, const uint64_t key[4], + HighwayHashState* state) { + size_t i; + HighwayHashReset(key, state); + for(i = 0; i + 32 <= size; i += 32) { + HighwayHashUpdatePacket(data + i, state); + } + if((size & 31) != 0) + HighwayHashUpdateRemainder(data + i, size & 31, state); +} + +uint64_t HighwayHash64(const uint8_t* data, size_t size, const uint64_t key[4]) { + HighwayHashState state; + ProcessAll(data, size, key, &state); + return HighwayHashFinalize64(&state); +} + +void HighwayHash128(const uint8_t* data, size_t size, const uint64_t key[4], + uint64_t hash[2]) { + HighwayHashState state; + ProcessAll(data, size, key, &state); + HighwayHashFinalize128(&state, hash); +} + +void HighwayHash256(const uint8_t* data, size_t size, const uint64_t key[4], + uint64_t hash[4]) { + HighwayHashState state; + ProcessAll(data, size, key, &state); + HighwayHashFinalize256(&state, hash); +} + +/*////////////////////////////////////////////////////////////////////////////*/ +/* Cat API: allows appending with multiple calls */ +/*////////////////////////////////////////////////////////////////////////////*/ + +void HighwayHashCatStart(const uint64_t key[4], HighwayHashCat* state) { + HighwayHashReset(key, &state->state); + state->num = 0; +} + +void HighwayHashCatAppend(const uint8_t* bytes, size_t num, HighwayHashCat* state) { + size_t i; + if(state->num != 0) { + size_t num_add = num > (32u - state->num) ? (32u - state->num) : num; + for(i = 0; i < num_add; i++) { + state->packet[state->num + i] = bytes[i]; + } + state->num += num_add; + num -= num_add; + bytes += num_add; + if(state->num == 32) { + HighwayHashUpdatePacket(state->packet, &state->state); + state->num = 0; + } + } + while(num >= 32) { + HighwayHashUpdatePacket(bytes, &state->state); + num -= 32; + bytes += 32; + } + for(i = 0; i < num; i++) { + state->packet[state->num] = bytes[i]; + state->num++; + } +} + +uint64_t HighwayHashCatFinish64(const HighwayHashCat* state) { + HighwayHashState copy = state->state; + if(state->num) { + HighwayHashUpdateRemainder(state->packet, state->num, ©); + } + return HighwayHashFinalize64(©); +} + +void HighwayHashCatFinish128(const HighwayHashCat* state, uint64_t hash[2]) { + HighwayHashState copy = state->state; + if(state->num) { + HighwayHashUpdateRemainder(state->packet, state->num, ©); + } + HighwayHashFinalize128(©, hash); +} + +void HighwayHashCatFinish256(const HighwayHashCat* state, uint64_t hash[4]) { + HighwayHashState copy = state->state; + if(state->num) { + HighwayHashUpdateRemainder(state->packet, state->num, ©); + } + HighwayHashFinalize256(©, hash); +} diff --git a/lib/checksums/highwayhash.h b/lib/checksums/highwayhash.h new file mode 100644 index 00000000..845d6c3d --- /dev/null +++ b/lib/checksums/highwayhash.h @@ -0,0 +1,93 @@ +#ifndef C_HIGHWAYHASH_H_ +#define C_HIGHWAYHASH_H_ + +#include +#include + +#if defined(__cplusplus) || defined(c_plusplus) +extern "C" { +#endif + +/*////////////////////////////////////////////////////////////////////////////*/ +/* Low-level API, use for implementing streams etc... */ +/*////////////////////////////////////////////////////////////////////////////*/ + +typedef struct { + uint64_t v0[4]; + uint64_t v1[4]; + uint64_t mul0[4]; + uint64_t mul1[4]; +} HighwayHashState; + +/* Initializes state with given key */ +void HighwayHashReset(const uint64_t key[4], HighwayHashState* state); +/* Takes a packet of 32 bytes */ +void HighwayHashUpdatePacket(const uint8_t* packet, HighwayHashState* state); +/* Adds the final 1..31 bytes, do not use if 0 remain */ +void HighwayHashUpdateRemainder(const uint8_t* bytes, const size_t size_mod32, + HighwayHashState* state); +/* Compute final hash value. Makes state invalid. */ +uint64_t HighwayHashFinalize64(HighwayHashState* state); +void HighwayHashFinalize128(HighwayHashState* state, uint64_t hash[2]); +void HighwayHashFinalize256(HighwayHashState* state, uint64_t hash[4]); + +/*////////////////////////////////////////////////////////////////////////////*/ +/* Non-cat API: single call on full data */ +/*////////////////////////////////////////////////////////////////////////////*/ + +uint64_t HighwayHash64(const uint8_t* data, size_t size, const uint64_t key[4]); + +void HighwayHash128(const uint8_t* data, size_t size, const uint64_t key[4], + uint64_t hash[2]); + +void HighwayHash256(const uint8_t* data, size_t size, const uint64_t key[4], + uint64_t hash[4]); + +/*////////////////////////////////////////////////////////////////////////////*/ +/* Cat API: allows appending with multiple calls */ +/*////////////////////////////////////////////////////////////////////////////*/ + +typedef struct { + HighwayHashState state; + uint8_t packet[32]; + int num; +} HighwayHashCat; + +/* Allocates new state for a new streaming hash computation */ +void HighwayHashCatStart(const uint64_t key[4], HighwayHashCat* state); + +void HighwayHashCatAppend(const uint8_t* bytes, size_t num, HighwayHashCat* state); + +/* Computes final hash value */ +uint64_t HighwayHashCatFinish64(const HighwayHashCat* state); +void HighwayHashCatFinish128(const HighwayHashCat* state, uint64_t hash[2]); +void HighwayHashCatFinish256(const HighwayHashCat* state, uint64_t hash[4]); + +/* +Usage examples: +#include +#include +void Example64() { + uint64_t key[4] = {1, 2, 3, 4}; + const char* text = "Hello world!"; + size_t size = strlen(text); + uint64_t hash = HighwayHash64((const uint8_t*)text, size, key); + printf("%016"PRIx64"\n", hash); +} +void Example64Cat() { + uint64_t key[4] = {1, 2, 3, 4}; + HighwayHashCat state; + uint64_t hash; + HighwayHashCatStart(key, &state); + HighwayHashCatAppend((const uint8_t*)"Hello", 5, &state); + HighwayHashCatAppend((const uint8_t*)" world!", 7, &state); + hash = HighwayHashCatFinish64(state); + printf("%016"PRIx64"\n", hash); +} +*/ + +#if defined(__cplusplus) || defined(c_plusplus) +} /* extern "C" */ +#endif + +#endif // C_HIGHWAYHASH_H_ From 8e709bf7fd43ef71a146340eefea79978fce2d97 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Thu, 9 Nov 2017 22:31:03 +1000 Subject: [PATCH 087/180] checksum: remove unused RM_DIGEST_BLAKE2XS --- lib/checksum.h | 2 -- 1 file changed, 2 deletions(-) diff --git a/lib/checksum.h b/lib/checksum.h index 88adcf5b..7c942db1 100644 --- a/lib/checksum.h +++ b/lib/checksum.h @@ -53,13 +53,11 @@ typedef enum RmDigestType { RM_DIGEST_BLAKE2B, RM_DIGEST_BLAKE2SP /* Parallel version of BLAKE2P */, RM_DIGEST_BLAKE2BP /* Parallel version of BLAKE2S */, - RM_DIGEST_BLAKE2XS, RM_DIGEST_XXHASH, RM_DIGEST_FARMHASH, RM_DIGEST_HIGHWAY64, RM_DIGEST_HIGHWAY128, RM_DIGEST_HIGHWAY256, - /* special kids in town */ RM_DIGEST_CUMULATIVE, /* hash([a, b]) = hash([b, a]) */ RM_DIGEST_EXT, /* read hash as string */ From 085eb088e7000842e66046f76e4ae35a15fe4872 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Thu, 9 Nov 2017 22:31:39 +1000 Subject: [PATCH 088/180] checksum: bugfixes for RmDigestSpecs --- lib/checksum.c | 27 +++++++++++++-------------- 1 file changed, 13 insertions(+), 14 deletions(-) diff --git a/lib/checksum.c b/lib/checksum.c index 75d9f5d1..691091f5 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -208,7 +208,7 @@ static void rm_digest_spooky_update(RmDigest *digest, const unsigned char *data, } #define GENERIC_FUNCS(ALGO) rm_digest_generic_init, rm_digest_generic_free, rm_digest_##ALGO##_update, rm_digest_generic_copy, NULL -static const RmDigestSpec spooky32_spec = { "spook32", 32, GENERIC_FUNCS(spooky32) }; +static const RmDigestSpec spooky32_spec = { "spooky32", 32, GENERIC_FUNCS(spooky32) }; static const RmDigestSpec spooky64_spec = { "spooky64", 64, GENERIC_FUNCS(spooky64) }; static const RmDigestSpec spooky_spec = { "spooky", 128, GENERIC_FUNCS(spooky) }; @@ -327,11 +327,11 @@ static void rm_digest_highway64_steal(RmDigest *digest, guint8 *result) { *result = HighwayHashCatFinish64(digest->highway_cat); } -#define HIGHWAY_SPEC(BITS) "highway##BITS", BITS, rm_digest_highway_init, rm_digest_highway_free, rm_digest_highway_update, rm_digest_highway_copy, rm_digest_highway##BITS##_steal +#define HIGHWAY_SPEC(BITS) BITS, rm_digest_highway_init, rm_digest_highway_free, rm_digest_highway_update, rm_digest_highway_copy, rm_digest_highway##BITS##_steal -static const RmDigestSpec highway256_spec = {HIGHWAY_SPEC(256)}; -static const RmDigestSpec highway128_spec = {HIGHWAY_SPEC(128)}; -static const RmDigestSpec highway64_spec = {HIGHWAY_SPEC(64)}; +static const RmDigestSpec highway256_spec = {"highway256", HIGHWAY_SPEC(256)}; +static const RmDigestSpec highway128_spec = {"highway128", HIGHWAY_SPEC(128)}; +static const RmDigestSpec highway64_spec = {"highway64", HIGHWAY_SPEC(64)}; /////////////////////////// @@ -432,11 +432,11 @@ static void rm_digest_sha3_steal(RmDigest *digest, guint8 *result) { g_slice_free(sha3_context, copy); } -#define SHA3_SPEC(BITS) "sha3_##BITS", BITS, rm_digest_sha3_init, rm_digest_sha3_free, rm_digest_sha3_update, rm_digest_sha3_copy, rm_digest_sha3_steal +#define SHA3_SPEC(BITS) BITS, rm_digest_sha3_init, rm_digest_sha3_free, rm_digest_sha3_update, rm_digest_sha3_copy, rm_digest_sha3_steal -static const RmDigestSpec sha3_256_spec = { SHA3_SPEC(256)}; -static const RmDigestSpec sha3_384_spec = { SHA3_SPEC(384)}; -static const RmDigestSpec sha3_512_spec = { SHA3_SPEC(512)}; +static const RmDigestSpec sha3_256_spec = { "sha3-256", SHA3_SPEC(256)}; +static const RmDigestSpec sha3_384_spec = { "sha3-384", SHA3_SPEC(384)}; +static const RmDigestSpec sha3_512_spec = { "sha3-512", SHA3_SPEC(512)}; /////////////////////////// // blake hashes // @@ -593,12 +593,11 @@ static const RmDigestSpec *rm_digest_spec(RmDigestType type) { [RM_DIGEST_HIGHWAY256] = &highway256_spec, }; - if(type >= RM_DIGEST_SENTINEL) { - rm_assert_gentle_not_reached(); - return digest_specs[RM_DEFAULT_DIGEST]; + if(type < RM_DIGEST_SENTINEL && digest_specs[type]) { + return digest_specs[type]; } - - return digest_specs[type]; + rm_log_error_line("No digest spec for enum %i", type); + g_assert_not_reached(); } static gpointer rm_init_digest_type_table(GHashTable **code_table) { From 4f400e68a4e4420010cab185f1c2a20fdab8bfd9 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Thu, 9 Nov 2017 22:31:50 +1000 Subject: [PATCH 089/180] tests: update hash list --- tests/utils.py | 28 ++++++++++++++++++++++++---- 1 file changed, 24 insertions(+), 4 deletions(-) diff --git a/tests/utils.py b/tests/utils.py index 6031c060..fa15541f 100644 --- a/tests/utils.py +++ b/tests/utils.py @@ -213,10 +213,30 @@ def run_rmlint_pedantic(*args, **kwargs): ] cksum_types = [ - 'paranoid', 'sha1', 'sha256', 'spooky', 'city', - 'md5', 'murmur', 'spooky32', 'spooky64', 'xxhash', 'farmhash', - 'sha3-256', 'sha3-384', 'sha3-512', - 'blake2s', 'blake2b', 'blake2sp', 'blake2bp', + 'murmur', + 'spooky', + 'spooky32', + 'spooky64', + 'city', + 'md5', + 'sha1', + 'sha256', + 'sha512', + 'sha3-256', + 'sha3-384', + 'sha3-512', + 'blake2s', + 'blake2b', + 'blake2sp', + 'blake2bp', + 'xxhash', + 'farmhash', + 'highway64', + 'highway128', + 'highway256', + #'cumulative', + #'ext', + 'paranoid', ] # Note: sha512 is supported on all system which have From c95e42f75aaa88665628dc8c108c61202b7e07e9 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Thu, 9 Nov 2017 22:48:17 +1000 Subject: [PATCH 090/180] checksum: simplify rm_init_digest_type_table() --- lib/checksum.c | 58 +++++++++++++------------------------------------- 1 file changed, 15 insertions(+), 43 deletions(-) diff --git a/lib/checksum.c b/lib/checksum.c index 691091f5..5a8c075a 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -600,52 +600,24 @@ static const RmDigestSpec *rm_digest_spec(RmDigestType type) { g_assert_not_reached(); } +static void rm_digest_table_insert(GHashTable *code_table, char *name, RmDigestType type) { + if(g_hash_table_contains(code_table, name)) { + rm_log_error_line("Duplicate entry for %s in rm_init_digest_type_table()", name); + } + g_hash_table_insert(code_table, name, GUINT_TO_POINTER(type)); +} + static gpointer rm_init_digest_type_table(GHashTable **code_table) { - static struct { - char *name; - RmDigestType code; - } code_entries[] = { - {"md5", RM_DIGEST_MD5}, - {"xxhash", RM_DIGEST_XXHASH}, - {"farmhash", RM_DIGEST_FARMHASH}, - {"murmur", RM_DIGEST_MURMUR}, - {"sha1", RM_DIGEST_SHA1}, - {"sha256", RM_DIGEST_SHA256}, - {"sha3", RM_DIGEST_SHA3_256}, - {"sha3-256", RM_DIGEST_SHA3_256}, - {"sha3-384", RM_DIGEST_SHA3_384}, - {"sha3-512", RM_DIGEST_SHA3_512}, - {"blake2s", RM_DIGEST_BLAKE2S}, - {"blake2b", RM_DIGEST_BLAKE2B}, - {"blake2sp", RM_DIGEST_BLAKE2SP}, - {"blake2bp", RM_DIGEST_BLAKE2BP}, - {"spooky32", RM_DIGEST_SPOOKY32}, - {"spooky64", RM_DIGEST_SPOOKY64}, - {"spooky128", RM_DIGEST_SPOOKY}, - {"spooky", RM_DIGEST_SPOOKY}, - {"ext", RM_DIGEST_EXT}, - {"cumulative", RM_DIGEST_CUMULATIVE}, - {"paranoid", RM_DIGEST_PARANOID}, - {"city", RM_DIGEST_CITY}, - {"highway64", RM_DIGEST_HIGHWAY64}, - {"highway128", RM_DIGEST_HIGHWAY128}, - {"highway256", RM_DIGEST_HIGHWAY256}, -#if HAVE_SHA512 - {"sha512", RM_DIGEST_SHA512}, -#endif - }; *code_table = g_hash_table_new(g_str_hash, g_str_equal); - - const size_t n_codes = sizeof(code_entries) / sizeof(code_entries[0]); - for(size_t idx = 0; idx < n_codes; idx++) { - if(g_hash_table_contains(*code_table, code_entries[idx].name)) { - rm_log_error_line("Duplicate entry for %s", code_entries[idx].name); - } - g_hash_table_insert(*code_table, - code_entries[idx].name, - GUINT_TO_POINTER(code_entries[idx].code)); - } + for(RmDigestType type=1; typename, type); + } + + /* add some synonyms */ + rm_digest_table_insert(*code_table, "sha3", RM_DIGEST_SHA3_256); + rm_digest_table_insert(*code_table, "spooky128", RM_DIGEST_SPOOKY); + rm_digest_table_insert(*code_table, "highway", RM_DIGEST_HIGHWAY256); return NULL; } From b42f87c1bc58a0792e0403c71161467571ef1bf4 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Thu, 9 Nov 2017 23:10:30 +1000 Subject: [PATCH 091/180] checksum: update warning --- lib/checksum.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/lib/checksum.c b/lib/checksum.c index 5a8c075a..4a64bf36 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -25,9 +25,8 @@ /* Welcome to hell! * - * This file is 90% boring switch statements with innocent, but insane code - * squashed between. Modify this file with care and make sure to test all - * checksums afterwards. + * This file is mostly boring code except for the paranoid digest + * optimisations which are pretty insane. **/ #include From 13c80164d3ca0beb1f3aef4900dea143831d6060 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Thu, 9 Nov 2017 23:13:51 +1000 Subject: [PATCH 092/180] cmdline: update paranoia scale --- lib/cmdline.c | 6 +----- lib/config.h.in | 2 +- 2 files changed, 2 insertions(+), 6 deletions(-) diff --git a/lib/cmdline.c b/lib/cmdline.c index a3c5ec45..dc84af92 100644 --- a/lib/cmdline.c +++ b/lib/cmdline.c @@ -762,11 +762,7 @@ static void rm_cmd_set_paranoia_from_cnt(RmCfg *cfg, int paranoia_counter, /* leave users choice of -a (default) */ break; case 1: -#if HAVE_SHA512 - cfg->checksum_type = RM_DIGEST_SHA512; -#else - cfg->checksum_type = RM_DIGEST_SHA256; -#endif + cfg->checksum_type = RM_DIGEST_BLAKE2B; break; case 2: cfg->checksum_type = RM_DIGEST_PARANOID; diff --git a/lib/config.h.in b/lib/config.h.in index 384315de..187456a2 100644 --- a/lib/config.h.in +++ b/lib/config.h.in @@ -23,7 +23,7 @@ #define HAVE_UNAME ({HAVE_UNAME}) #define HAVE_SYSMACROS_H ({HAVE_SYSMACROS_H}) -#define RM_DEFAULT_DIGEST RM_DIGEST_BLAKE2B +#define RM_DEFAULT_DIGEST RM_DIGEST_HIGHWAY #define RM_VERSION "{VERSION_MAJOR}.{VERSION_MINOR}.{VERSION_PATCH}" #define RM_VERSION_MAJOR {VERSION_MAJOR} #define RM_VERSION_MINOR {VERSION_MINOR} From 4b8166a88b0905bdd70d2ef4fdf1566e6dceecda Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Fri, 10 Nov 2017 18:37:23 +1000 Subject: [PATCH 093/180] hash-utility: change syntax to match main rmlint syntax --- lib/hash-utility.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/hash-utility.c b/lib/hash-utility.c index 6e2700fc..bdf56888 100644 --- a/lib/hash-utility.c +++ b/lib/hash-utility.c @@ -136,7 +136,7 @@ int rm_hasher_main(int argc, const char **argv) { /* clang-format off */ const GOptionEntry entries[] = { - {"digest-type" , 'd' , 0 , G_OPTION_ARG_CALLBACK , (GOptionArgFunc)rm_hasher_parse_type , _("Digest type [BLAKE2B]") , "[TYPE]"} , + {"algorithm" , 'a' , 0 , G_OPTION_ARG_CALLBACK , (GOptionArgFunc)rm_hasher_parse_type , _("Digest type [BLAKE2B]") , "[TYPE]"} , {"num-threads" , 't' , 0 , G_OPTION_ARG_INT , &threads , _("Number of hashing threads [8]") , "N"} , {"multihash" , 'm' , 0 , G_OPTION_ARG_NONE , &tag.print_multihash , _("Print hash as self identifying multihash") , NULL} , {"buffer-mbytes" , 'b' , 0 , G_OPTION_ARG_INT64 , &buffer_mbytes , _("Megabytes read buffer [256 MB]") , "MB"} , From d0c9e28277428b523b3476815d31ee977df273d7 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Fri, 10 Nov 2017 18:39:23 +1000 Subject: [PATCH 094/180] hash-utility: provide option to set increment buffer size --- lib/hash-utility.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/lib/hash-utility.c b/lib/hash-utility.c index bdf56888..313b52e5 100644 --- a/lib/hash-utility.c +++ b/lib/hash-utility.c @@ -130,6 +130,7 @@ int rm_hasher_main(int argc, const char **argv) { tag.digest_type = RM_DEFAULT_DIGEST; gint threads = 8; gint64 buffer_mbytes = 256; + guint64 increment = 4096; ////////////// Option Parsing /////////////// @@ -140,6 +141,7 @@ int rm_hasher_main(int argc, const char **argv) { {"num-threads" , 't' , 0 , G_OPTION_ARG_INT , &threads , _("Number of hashing threads [8]") , "N"} , {"multihash" , 'm' , 0 , G_OPTION_ARG_NONE , &tag.print_multihash , _("Print hash as self identifying multihash") , NULL} , {"buffer-mbytes" , 'b' , 0 , G_OPTION_ARG_INT64 , &buffer_mbytes , _("Megabytes read buffer [256 MB]") , "MB"} , + {"increment" , 'x' , G_OPTION_FLAG_HIDDEN , G_OPTION_ARG_INT64 , &increment , _("bytes to hash at a time [4096]") , "MB"} , {"ignore-order" , 'i' , G_OPTION_FLAG_REVERSE , G_OPTION_ARG_NONE , &tag.print_in_order , _("Print hashes in order completed, not in order entered (reduces memory usage)") , NULL} , {"" , 0 , 0 , G_OPTION_ARG_FILENAME_ARRAY , &tag.paths , _("Space-separated list of files") , "[FILE…]"} , {NULL , 0 , 0 , 0 , NULL , NULL , NULL}}; @@ -210,7 +212,7 @@ int rm_hasher_main(int argc, const char **argv) { RmHasher *hasher = rm_hasher_new(tag.digest_type, threads, FALSE, - 4096, + increment, 1024 * 1024 * buffer_mbytes, (RmHasherCallback)rm_hasher_callback, &tag); From 1e6c79c31c228d4035573cb5deecec89ebf0df35 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Fri, 10 Nov 2017 18:39:51 +1000 Subject: [PATCH 095/180] tests: export CKSUM_TYPES --- tests/utils.py | 55 +++++++++++++++++++++++++------------------------- 1 file changed, 28 insertions(+), 27 deletions(-) diff --git a/tests/utils.py b/tests/utils.py index fa15541f..3d7eb322 100644 --- a/tests/utils.py +++ b/tests/utils.py @@ -15,6 +15,33 @@ TESTDIR_NAME = os.getenv('RM_TS_DIR') or '/tmp/rmlint-unit-testdir' +CKSUM_TYPES = [ + 'murmur', + 'spooky', + 'spooky32', + 'spooky64', + 'city', + 'md5', + 'sha1', + 'sha256', + 'sha512', + 'sha3-256', + 'sha3-384', + 'sha3-512', + 'blake2s', + 'blake2b', + 'blake2sp', + 'blake2bp', + 'xxhash', + 'farmhash', + 'highway64', + 'highway128', + 'highway256', + #'cumulative', + #'ext', + 'paranoid', +] + def runs_as_root(): return os.geteuid() is 0 @@ -212,39 +239,13 @@ def run_rmlint_pedantic(*args, **kwargs): '--no-mount-table' ] - cksum_types = [ - 'murmur', - 'spooky', - 'spooky32', - 'spooky64', - 'city', - 'md5', - 'sha1', - 'sha256', - 'sha512', - 'sha3-256', - 'sha3-384', - 'sha3-512', - 'blake2s', - 'blake2b', - 'blake2sp', - 'blake2bp', - 'xxhash', - 'farmhash', - 'highway64', - 'highway128', - 'highway256', - #'cumulative', - #'ext', - 'paranoid', - ] # Note: sha512 is supported on all system which have # no recent enough glib with. God forsaken debian people. if has_feature('sha512'): cksum_types.append('sha512') - for cksum_type in cksum_types: + for cksum_type in CKSUM_TYPES: options.append('--algorithm=' + cksum_type) data = None From e7674d4d1c8bcd4aed5ea9d15fd1e9d222bc8093 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Fri, 10 Nov 2017 18:40:33 +1000 Subject: [PATCH 096/180] tests: add (partial) unit test for rmlint --hash --- tests/test_mains/test_hash.py | 70 +++++++++++++++++++++++++++++++++++ 1 file changed, 70 insertions(+) create mode 100644 tests/test_mains/test_hash.py diff --git a/tests/test_mains/test_hash.py b/tests/test_mains/test_hash.py new file mode 100644 index 00000000..126666b8 --- /dev/null +++ b/tests/test_mains/test_hash.py @@ -0,0 +1,70 @@ +#!/usr/bin/env python3 +# encoding: utf-8 + +from nose import with_setup +from tests.utils import * + +INCREMENTS = [4096, 1024, 1, 20000] + +def streaming_compliance_check(*patterns): + # a valid hash function streaming function should satisfy hash('a', 'b', 'c') == hash('abc') + + a = create_file('1' * 10000, 'a') + + algos = [] + for pattern in patterns: + algos += [algo for algo in CKSUM_TYPES if pattern in algo] + + cmd = './rmlint --hash --increment {increment} --algorithm {algo} {path}' + + for algo in algos: + command = cmd.format(increment=INCREMENTS[0], algo=algo, path=a) + output0 = subprocess.check_output(command.split()) + for increment in INCREMENTS[1:]: + command = cmd.format(increment=increment, algo=algo, path=a) + output = subprocess.check_output(command.split()) + if(output!=output0): + assert False, "{} fails streaming test with increment {}".format(algo, increment) + break + + +@with_setup(usual_setup_func, usual_teardown_func) +def test_spooky(): + streaming_compliance_check('spooky') + +@with_setup(usual_setup_func, usual_teardown_func) +def test_city(): + streaming_compliance_check('city') + +@with_setup(usual_setup_func, usual_teardown_func) +def test_murmur(): + streaming_compliance_check('murmur') + +@with_setup(usual_setup_func, usual_teardown_func) +def test_glib(): + streaming_compliance_check('md5', 'sha1', 'sha256', 'sha512') + +@with_setup(usual_setup_func, usual_teardown_func) +def test_sha3(): + streaming_compliance_check('sha3') + +@with_setup(usual_setup_func, usual_teardown_func) +def test_blake(): + streaming_compliance_check('blake') + +@with_setup(usual_setup_func, usual_teardown_func) +def test_xx(): + streaming_compliance_check('xxhash') + +@with_setup(usual_setup_func, usual_teardown_func) +def test_farm(): + streaming_compliance_check('farm') + +@with_setup(usual_setup_func, usual_teardown_func) +def test_highway(): + streaming_compliance_check('highway') + +@with_setup(usual_setup_func, usual_teardown_func) +def test_cumulative(): + streaming_compliance_check('cumulative') + From ba4286368071cefd2522582ad3f6333d91bd2a2a Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Fri, 10 Nov 2017 18:46:05 +1000 Subject: [PATCH 097/180] checksum: don't expose underlying checksum structures --- lib/checksum.c | 258 +++++++++++++++++++++++++----------------------- lib/checksum.h | 17 +--- lib/shredder.c | 2 +- lib/treemerge.c | 9 +- lib/xattr.c | 7 +- 5 files changed, 140 insertions(+), 153 deletions(-) diff --git a/lib/checksum.c b/lib/checksum.c index 4a64bf36..16e3953c 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -145,8 +145,8 @@ typedef void (*RmDigestStealFunc)(RmDigest *digest, guint8 *result); typedef struct RmDigestSpec { const char *name; - const int bits; - RmDigestInitFunc init; + const uint bits; // length of the output checksum in bits + RmDigestInitFunc init; // performs initialisation of digest->state RmDigestFreeFunc free; RmDigestUpdateFunc update; RmDigestCopyFunc copy; @@ -167,43 +167,49 @@ static void rm_digest_generic_init(RmDigest *digest, RmOff seed1, RmOff seed2, _ /* Cannot go lower than 8, since we read 8 byte in some places. * For some checksums this may mean trailing zeros in the unused bytes */ - digest->checksum = g_slice_alloc0(ALLOC_BYTES(digest->bytes)); + digest->state = g_slice_alloc0(ALLOC_BYTES(digest->bytes)); if(seed1 && seed2) { /* copy seeds to checksum */ size_t seed_bytes = MIN(sizeof(RmOff), digest->bytes / 2); - memcpy(digest->checksum, &seed1, seed_bytes); - memcpy(digest->checksum + digest->bytes/2, &seed2, seed_bytes); + memcpy(digest->state, &seed1, seed_bytes); + memcpy(digest->state + digest->bytes/2, &seed2, seed_bytes); } else if(seed1) { size_t seed_bytes = MIN(sizeof(RmOff), digest->bytes); - memcpy(digest->checksum, &seed1, seed_bytes); + memcpy(digest->state, &seed1, seed_bytes); } } static void rm_digest_generic_free(RmDigest *digest) { - if(digest->checksum) { - g_slice_free1(digest->bytes, digest->checksum); + if(digest->state) { + g_slice_free1(digest->bytes, digest->state); + digest->state = NULL; } } static void rm_digest_generic_copy(RmDigest *digest, RmDigest *copy) { - copy->checksum = g_slice_copy(ALLOC_BYTES(digest->bytes), digest->checksum); + copy->state = g_slice_copy(ALLOC_BYTES(digest->bytes), digest->state); } /////////////////////////// // spooky hashes // /////////////////////////// +/* TODO: this is broken; need to extend spooky API to add a streaming variant */ + static void rm_digest_spooky32_update(RmDigest *digest, const unsigned char *data, RmOff size) { - digest->checksum->first = spooky_hash32(data, size, digest->checksum->first); + uint32_t* hash = digest->state; + *hash = spooky_hash32(data, size, *hash); } static void rm_digest_spooky64_update(RmDigest *digest, const unsigned char *data, RmOff size) { - digest->checksum->first = spooky_hash64(data, size, digest->checksum->first); + uint64_t* hash = digest->state; + *hash = spooky_hash64(data, size, *hash); } static void rm_digest_spooky_update(RmDigest *digest, const unsigned char *data, RmOff size) { - spooky_hash128(data, size, (uint64_t *)&digest->checksum->first, (uint64_t *)&digest->checksum->second); + uint128 *hash = digest->state; + spooky_hash128(data, size, &hash->first, &hash->second); } #define GENERIC_FUNCS(ALGO) rm_digest_generic_init, rm_digest_generic_free, rm_digest_##ALGO##_update, rm_digest_generic_copy, NULL @@ -216,8 +222,11 @@ static const RmDigestSpec spooky_spec = { "spooky", 128, GENERIC_FUNCS(spooky // xxhash // /////////////////////////// +/* TODO: this is probably broken; should use streaming variant XXH64_update() */ + static void rm_digest_xxhash_update(RmDigest *digest, const unsigned char *data, RmOff size) { - digest->checksum->first = XXH64(data, size, digest->checksum->first); + unsigned long long *hash = digest->state; + *hash = XXH64(data, size, *hash); } static const RmDigestSpec xxhash_spec = { "xxhash", 64, GENERIC_FUNCS(xxhash)}; @@ -226,8 +235,11 @@ static const RmDigestSpec xxhash_spec = { "xxhash", 64, GENERIC_FUNCS(xxhash)}; // farmhash // /////////////////////////// +/* TODO: check that this is not broken, i.e. final hash is independent of increment size */ + static void rm_digest_farmhash_update(RmDigest *digest, const unsigned char *data, RmOff size) { - *digest->farmhash = farmhash128_with_seed((const char*)data, size, *digest->farmhash); + uint128_t *hash = digest->state; + *hash = farmhash128_with_seed((const char*)data, size, *hash); } static const RmDigestSpec farmhash_spec = { "farmhash", 64, GENERIC_FUNCS(farmhash)}; @@ -238,10 +250,12 @@ static const RmDigestSpec farmhash_spec = { "farmhash", 64, GENERIC_FUNCS(farmh static void rm_digest_murmur_update(RmDigest *digest, const unsigned char *data, RmOff size) { + /* TODO: this is broken; need to extend murmur API to add a streaming variant */ + uint32_t *hash = digest->state; #if RM_PLATFORM_32 - MurmurHash3_x86_128(data, size, (uint32_t)digest->checksum->first, digest->checksum); + MurmurHash3_x86_128(data, size, *hash, hash); #elif RM_PLATFORM_64 - MurmurHash3_x64_128(data, size, (uint32_t)digest->checksum->first, digest->checksum); + MurmurHash3_x64_128(data, size, *hash, hash); #else #error "Probably not a good idea to compile rmlint on 16bit." #endif @@ -254,16 +268,18 @@ static const RmDigestSpec murmur_spec = { "murmur", 128, GENERIC_FUNCS(murmur)} /////////////////////////// static void rm_digest_city_update(RmDigest *digest, const unsigned char *data, RmOff size) { + + /* TODO: check that this is not broken, i.e. final hash is independent of increment size */ + /* There is a more optimized version but it needs the crc command of sse4.2 * (available on Intel Nehalem and up; my amd box doesn't have this though) */ - uint128 old = {digest->checksum->first, digest->checksum->second}; + uint128 *hash = digest->state; #ifdef __SSE4_2__ - old = CityHashCrc128WithSeed((const char *)data, size, old); + *hash = CityHashCrc128WithSeed((const char *)data, size, *hash); #else - old = CityHash128WithSeed((const char *)data, size, old); + *hash = CityHash128WithSeed((const char *)data, size, *hash); #endif - memcpy(digest->checksum, &old, sizeof(uint128)); } static const RmDigestSpec city_spec = { "city", 128, GENERIC_FUNCS(city)}; @@ -275,9 +291,10 @@ static const RmDigestSpec city_spec = { "city", 128, GENERIC_FUNCS(city)}; static void rm_digest_cumulative_update(RmDigest *digest, const unsigned char *data, RmOff size) { /* This only XORS the two checksums. */ - size = MIN(size, digest->bytes); - for(gsize i = 0; i < size; ++i) { - digest->data[i] ^= ((guint8 *)data)[i % size]; + guint8 *hash = digest->state; + RmOff bytes = MIN(size, digest->bytes); + for(gsize i = 0; i < bytes; ++i) { + hash[i] ^= ((guint8 *)data)[i % size]; } } @@ -297,33 +314,33 @@ static void rm_digest_highway_init(RmDigest *digest, RmOff seed1, RmOff seed2, _ key[2] = (uint64_t)seed2; } - digest->highway_cat = g_slice_alloc0(sizeof(HighwayHashCat)); - HighwayHashCatStart(key, digest->highway_cat); + digest->state = g_slice_alloc0(sizeof(HighwayHashCat)); + HighwayHashCatStart(key, digest->state); } static void rm_digest_highway_free(RmDigest *digest) { - g_slice_free(HighwayHashCat, digest->highway_cat); + g_slice_free(HighwayHashCat, digest->state); } static void rm_digest_highway_update(RmDigest *digest, const unsigned char *data, RmOff size) { - HighwayHashCatAppend((const uint8_t*)data, size, digest->highway_cat); + HighwayHashCatAppend((const uint8_t*)data, size, digest->state); } static void rm_digest_highway_copy(RmDigest *digest, RmDigest *copy) { - copy->glib_checksum = g_slice_copy(sizeof(HighwayHashCat), digest->highway_cat); + copy->state = g_slice_copy(sizeof(HighwayHashCat), digest->state); } /* HighwayHashCatFinish functions are non-destructive */ static void rm_digest_highway256_steal(RmDigest *digest, guint8 *result) { - HighwayHashCatFinish256(digest->highway_cat, (uint64_t*)result); + HighwayHashCatFinish256(digest->state, (uint64_t*)result); } static void rm_digest_highway128_steal(RmDigest *digest, guint8 *result) { - HighwayHashCatFinish128(digest->highway_cat, (uint64_t*)result); + HighwayHashCatFinish128(digest->state, (uint64_t*)result); } static void rm_digest_highway64_steal(RmDigest *digest, guint8 *result) { - *result = HighwayHashCatFinish64(digest->highway_cat); + *result = HighwayHashCatFinish64(digest->state); } #define HIGHWAY_SPEC(BITS) BITS, rm_digest_highway_init, rm_digest_highway_free, rm_digest_highway_update, rm_digest_highway_copy, rm_digest_highway##BITS##_steal @@ -347,29 +364,29 @@ static const GChecksumType glib_map[] = { }; static void rm_digest_glib_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { - digest->glib_checksum = g_checksum_new(glib_map[digest->type]); + digest->state = g_checksum_new(glib_map[digest->type]); if(seed1) { - g_checksum_update(digest->glib_checksum, (const guchar *)&seed1, sizeof(seed1)); + g_checksum_update(digest->state, (const guchar *)&seed1, sizeof(seed1)); } if(seed2) { - g_checksum_update(digest->glib_checksum, (const guchar *)&seed2, sizeof(seed2)); + g_checksum_update(digest->state, (const guchar *)&seed2, sizeof(seed2)); } } static void rm_digest_glib_free(RmDigest *digest) { - g_checksum_free(digest->glib_checksum); + g_checksum_free(digest->state); } static void rm_digest_glib_update(RmDigest *digest, const unsigned char *data, RmOff size) { - g_checksum_update(digest->glib_checksum, data, size); + g_checksum_update(digest->state, data, size); } static void rm_digest_glib_copy(RmDigest *digest, RmDigest *copy) { - copy->glib_checksum = g_checksum_copy(digest->glib_checksum); + copy->state = g_checksum_copy(digest->state); } static void rm_digest_glib_steal(RmDigest *digest, guint8 *result) { - GChecksum *copy = g_checksum_copy(digest->glib_checksum); + GChecksum *copy = g_checksum_copy(digest->state); gsize buflen = digest->bytes; g_checksum_get_digest(copy, result, &buflen); rm_assert_gentle(buflen == digest->bytes); @@ -391,42 +408,42 @@ static const RmDigestSpec sha512_spec = {"sha512", 512, GLIB_FUNCS}; static void rm_digest_sha3_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { - digest->sha3_ctx = g_slice_alloc0(sizeof(sha3_context)); + digest->state = g_slice_alloc0(sizeof(sha3_context)); switch(digest->type) { case RM_DIGEST_SHA3_256: - sha3_Init256(digest->sha3_ctx); + sha3_Init256(digest->state); break; case RM_DIGEST_SHA3_384: - sha3_Init384(digest->sha3_ctx); + sha3_Init384(digest->state); break; case RM_DIGEST_SHA3_512: - sha3_Init512(digest->sha3_ctx); + sha3_Init512(digest->state); break; default: g_assert_not_reached(); } if(seed1) { - sha3_Update(digest->sha3_ctx, &seed1, sizeof(seed1)); + sha3_Update(digest->state, &seed1, sizeof(seed1)); } if(seed2) { - sha3_Update(digest->sha3_ctx, &seed2, sizeof(seed2)); + sha3_Update(digest->state, &seed2, sizeof(seed2)); } } static void rm_digest_sha3_free(RmDigest *digest) { - g_slice_free(sha3_context, digest->sha3_ctx); + g_slice_free(sha3_context, digest->state); } static void rm_digest_sha3_update(RmDigest *digest, const unsigned char *data, RmOff size) { - sha3_Update(digest->sha3_ctx, data, size); + sha3_Update(digest->state, data, size); } static void rm_digest_sha3_copy(RmDigest *digest, RmDigest *copy) { - copy->sha3_ctx = g_slice_copy(sizeof(sha3_context), digest->sha3_ctx); + copy->state = g_slice_copy(sizeof(sha3_context), digest->state); } static void rm_digest_sha3_steal(RmDigest *digest, guint8 *result) { - sha3_context *copy = g_slice_copy(sizeof(sha3_context), digest->sha3_ctx); + sha3_context *copy = g_slice_copy(sizeof(sha3_context), digest->state); memcpy(result, sha3_Finalize(copy), digest->bytes); g_slice_free(sha3_context, copy); } @@ -447,37 +464,37 @@ static void rm_digest_##ALGO##_init(RmDigest *digest, RmOff seed1, \ RmOff seed2, \ _UNUSED RmOff ext_size, \ _UNUSED bool use_shadow_hash) { \ - digest->ALGO##_state = g_slice_alloc0(sizeof(ALGO##_state)); \ - ALGO##_init(digest->ALGO##_state, ALGO_BIG##_OUTBYTES); \ + digest->state = g_slice_alloc0(sizeof(ALGO##_state)); \ + ALGO##_init(digest->state, ALGO_BIG##_OUTBYTES); \ if(seed1) { \ - ALGO##_update(digest->ALGO##_state, &seed1, sizeof(RmOff)); \ + ALGO##_update(digest->state, &seed1, sizeof(RmOff)); \ } \ if(seed2) { \ - ALGO##_update(digest->ALGO##_state, &seed2, sizeof(RmOff)); \ + ALGO##_update(digest->state, &seed2, sizeof(RmOff)); \ } \ g_assert(digest->bytes==ALGO_BIG##_OUTBYTES); \ } \ \ static void rm_digest_##ALGO##_free(RmDigest *digest) { \ - g_slice_free(ALGO##_state, digest->ALGO##_state); \ + g_slice_free(ALGO##_state, digest->state); \ } \ \ static void rm_digest_##ALGO##_update(RmDigest *digest, \ const unsigned char *data, \ RmOff size) { \ - ALGO##_update(digest->ALGO##_state, data, size); \ + ALGO##_update(digest->state, data, size); \ } \ \ static void rm_digest_##ALGO##_copy(RmDigest *digest, \ RmDigest *copy) { \ - copy->ALGO##_state = g_slice_copy(sizeof(ALGO##_state), \ - digest->ALGO##_state); \ + copy->state = g_slice_copy(sizeof(ALGO##_state), \ + digest->state); \ } \ \ static void rm_digest_##ALGO##_steal(RmDigest *digest, \ guint8 *result) { \ ALGO##_state *copy = g_slice_copy(sizeof(ALGO##_state), \ - digest->ALGO##_state); \ + digest->state); \ ALGO##_final(copy, result, digest->bytes); \ g_slice_free(ALGO##_state, copy); \ } @@ -515,10 +532,10 @@ static void rm_digest_ext_update(RmDigest *digest, const unsigned char *data, Rm #define CHAR_TO_NUM(c) (unsigned char)(g_ascii_isdigit(c) ? c - '0' : (c - 'a') + 10) digest->bytes = size / 2; - digest->checksum = g_slice_alloc0(digest->bytes); + digest->state = g_slice_alloc0(digest->bytes); for(unsigned i = 0; i < digest->bytes; ++i) { - ((guint8 *)digest->checksum)[i] = + ((guint8 *)digest->state)[i] = (CHAR_TO_NUM(data[2 * i]) << 4) + CHAR_TO_NUM(data[2 * i + 1]); } } @@ -532,28 +549,46 @@ static const RmDigestSpec ext_spec = {"ext", 0, rm_digest_ext_init, rm_digest_ge static void rm_digest_paranoid_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, bool use_shadow_hash) { - digest->paranoid = g_slice_new0(RmParanoid); - digest->paranoid->incoming_twin_candidates = g_async_queue_new(); + RmParanoid *paranoid = g_slice_new0(RmParanoid); + digest->state = paranoid; + paranoid->incoming_twin_candidates = g_async_queue_new(); if(use_shadow_hash) { - digest->paranoid->shadow_hash = rm_digest_new(RM_DIGEST_XXHASH, seed1, seed2, 0, false); + paranoid->shadow_hash = rm_digest_new(RM_DIGEST_XXHASH, seed1, seed2, 0, false); + digest->bytes = paranoid->shadow_hash->bytes; } } static void rm_digest_paranoid_free(RmDigest *digest) { - if(digest->paranoid->shadow_hash) { - rm_digest_free(digest->paranoid->shadow_hash); + RmParanoid *paranoid = digest->state; + if(paranoid->shadow_hash) { + rm_digest_free(paranoid->shadow_hash); } rm_digest_release_buffers(digest); - if(digest->paranoid->incoming_twin_candidates) { - g_async_queue_unref(digest->paranoid->incoming_twin_candidates); + if(paranoid->incoming_twin_candidates) { + g_async_queue_unref(paranoid->incoming_twin_candidates); } - g_slist_free(digest->paranoid->rejects); - g_slice_free(RmParanoid, digest->paranoid); + g_slist_free(paranoid->rejects); + g_slice_free(RmParanoid, paranoid); } +static void rm_digest_paranoid_steal(RmDigest *digest, guint8 *result) { + RmParanoid *paranoid = digest->state; + if(paranoid->shadow_hash) { + guint8 *buf = rm_digest_steal(paranoid->shadow_hash); + memcpy(result, buf, digest->bytes); + } else { + /* steal the first few bytes of the first buffer */ + if(paranoid->buffers) { + RmBuffer *buffer = paranoid->buffers->data; + memcpy(result, buffer->data, MIN(buffer->len, digest->bytes)); + } + } +} + + /* Note: paranoid update implementation is in rm_digest_buffered_update() below */ -static const RmDigestSpec paranoid_spec = { "paranoid", 0, rm_digest_paranoid_init, rm_digest_paranoid_free, NULL, NULL, NULL}; +static const RmDigestSpec paranoid_spec = { "paranoid", 0, rm_digest_paranoid_init, rm_digest_paranoid_free, NULL, NULL, rm_digest_paranoid_steal}; //////////////////////////////// @@ -675,9 +710,10 @@ void rm_digest_paranoia_shrink(RmDigest *digest, gsize new_size) { } void rm_digest_release_buffers(RmDigest *digest) { - if(digest->paranoid && digest->paranoid->buffers) { - g_slist_free_full(digest->paranoid->buffers, (GDestroyNotify)rm_buffer_free); - digest->paranoid->buffers = NULL; + RmParanoid *paranoid = digest->state; + if(paranoid && paranoid->buffers) { + g_slist_free_full(paranoid->buffers, (GDestroyNotify)rm_buffer_free); + paranoid->buffers = NULL; } } @@ -699,7 +735,7 @@ void rm_digest_buffered_update(RmBuffer *buffer) { rm_digest_update(digest, buffer->data, buffer->len); rm_buffer_release(buffer); } else { - RmParanoid *paranoid = digest->paranoid; + RmParanoid *paranoid = digest->state; /* paranoid update... */ if(!paranoid->buffers) { /* first buffer */ @@ -709,8 +745,6 @@ void rm_digest_buffered_update(RmBuffer *buffer) { paranoid->buffer_tail = g_slist_append(paranoid->buffer_tail, buffer)->next; } - digest->bytes += buffer->len; - if(paranoid->shadow_hash) { rm_digest_update(paranoid->shadow_hash, buffer->data, buffer->len); } @@ -738,7 +772,8 @@ void rm_digest_buffered_update(RmBuffer *buffer) { g_async_queue_try_pop(paranoid->incoming_twin_candidates))) { /* validate the new candidate by comparing the previous buffers (not * including current)*/ - paranoid->twin_candidate_buffer = paranoid->twin_candidate->paranoid->buffers; + RmParanoid *twin = paranoid->twin_candidate->state; + paranoid->twin_candidate_buffer = twin->buffers; GSList *iter_self = paranoid->buffers; gboolean match = TRUE; while(match && iter_self) { @@ -785,7 +820,7 @@ guint8 *rm_digest_steal(RmDigest *digest) { const RmDigestSpec *spec = rm_digest_spec(digest->type); if(!spec->steal) { - return g_slice_copy(digest->bytes, digest->checksum); + return g_slice_copy(digest->bytes, digest->state); } guint8 *result = g_slice_alloc0(digest->bytes); @@ -798,24 +833,8 @@ guint rm_digest_hash(RmDigest *digest) { gsize bytes = 0; guint hash = 0; - if(digest->type == RM_DIGEST_PARANOID) { - if(digest->paranoid->shadow_hash) { - buf = rm_digest_steal(digest->paranoid->shadow_hash); - bytes = digest->paranoid->shadow_hash->bytes; - } else { - /* steal the first few bytes of the first buffer */ - if(digest->paranoid->buffers) { - RmBuffer *buffer = digest->paranoid->buffers->data; - if(buffer->len >= sizeof(guint)) { - hash = *(guint *)buffer->data; - return hash; - } - } - } - } else { - buf = rm_digest_steal(digest); - bytes = digest->bytes; - } + buf = rm_digest_steal(digest); + bytes = digest->bytes; if(buf != NULL) { rm_assert_gentle(bytes >= sizeof(guint)); @@ -839,22 +858,24 @@ gboolean rm_digest_equal(RmDigest *a, RmDigest *b) { const RmDigestSpec *spec = rm_digest_spec(a->type); if(a->type == RM_DIGEST_PARANOID) { - if(!a->paranoid->buffers) { + RmParanoid *pa = a->state; + RmParanoid *pb = b->state; + if(!pa->buffers) { /* buffers have been freed so we need to rely on shadow hash */ - return rm_digest_equal(a->paranoid->shadow_hash, b->paranoid->shadow_hash); + return rm_digest_equal(pa->shadow_hash, pb->shadow_hash); } /* check if pre-matched twins */ - if(a->paranoid->twin_candidate == b || b->paranoid->twin_candidate == a) { + if(pa->twin_candidate == b || pb->twin_candidate == a) { return true; } /* check if already rejected */ - if(g_slist_find(a->paranoid->rejects, b) || - g_slist_find(b->paranoid->rejects, a)) { + if(g_slist_find(pa->rejects, b) || + g_slist_find(pb->rejects, a)) { return false; } /* all the "easy" ways failed... do manual check of all buffers */ - GSList *a_iter = a->paranoid->buffers; - GSList *b_iter = b->paranoid->buffers; + GSList *a_iter = pa->buffers; + GSList *b_iter = pb->buffers; guint bytes = 0; while(a_iter && b_iter) { if(!rm_buffer_equal(a_iter->data, b_iter->data)) { @@ -868,7 +889,7 @@ gboolean rm_digest_equal(RmDigest *a, RmDigest *b) { b_iter = b_iter->next; } - return (!a_iter && !b_iter && bytes == a->bytes); + return (!a_iter && !b_iter); } else if(spec->steal) { guint8 *buf_a = rm_digest_steal(a); guint8 *buf_b = rm_digest_steal(b); @@ -879,7 +900,7 @@ gboolean rm_digest_equal(RmDigest *a, RmDigest *b) { return result; } else { - return !memcmp(a->checksum, b->checksum, a->bytes); + return !memcmp(a->state, b->state, a->bytes); } } @@ -891,15 +912,8 @@ int rm_digest_hexstring(RmDigest *digest, char *buffer) { return 0; } - if(digest->type == RM_DIGEST_PARANOID) { - if(digest->paranoid->shadow_hash) { - input = rm_digest_steal(digest->paranoid->shadow_hash); - bytes = digest->paranoid->shadow_hash->bytes; - } - } else { - input = rm_digest_steal(digest); - bytes = digest->bytes; - } + input = rm_digest_steal(digest); + bytes = digest->bytes; for(gsize i = 0; i < bytes; ++i) { buffer[0] = hex[input[i] / 16]; @@ -921,22 +935,16 @@ int rm_digest_get_bytes(RmDigest *self) { return 0; } - if(self->type != RM_DIGEST_PARANOID) { - return self->bytes; - } - - if(self->paranoid->shadow_hash) { - return self->paranoid->shadow_hash->bytes; - } - - return 0; + return self->bytes; } void rm_digest_send_match_candidate(RmDigest *target, RmDigest *candidate) { - if(!target->paranoid->incoming_twin_candidates) { - target->paranoid->incoming_twin_candidates = g_async_queue_new(); + RmParanoid *paranoid = target->state; + + if(!paranoid->incoming_twin_candidates) { + paranoid->incoming_twin_candidates = g_async_queue_new(); } - g_async_queue_push(target->paranoid->incoming_twin_candidates, candidate); + g_async_queue_push(paranoid->incoming_twin_candidates, candidate); } guint8 *rm_digest_sum(RmDigestType algo, const guint8 *data, gsize len, gsize *out_len) { diff --git a/lib/checksum.h b/lib/checksum.h index 7c942db1..4a50b85e 100644 --- a/lib/checksum.h +++ b/lib/checksum.h @@ -99,25 +99,14 @@ typedef struct RmParanoid { typedef struct RmDigest { /* Different storage structures are used depending on digest type: */ - union { - GChecksum *glib_checksum; - blake2s_state *blake2s_state; - blake2b_state *blake2b_state; - blake2sp_state *blake2sp_state; - blake2bp_state *blake2bp_state; - HighwayHashCat *highway_cat; - sha3_context *sha3_ctx; - RmUint128 *checksum; - uint128_t *farmhash; - RmParanoid *paranoid; - guint8 *data; - }; + gpointer state; /* digest type */ RmDigestType type; - /* digest size in bytes */ + /* digest output size in bytes */ gsize bytes; + } RmDigest; /////////// RmBufferPool and RmBuffer //////////////// diff --git a/lib/shredder.c b/lib/shredder.c index db8973c0..6e06df24 100644 --- a/lib/shredder.c +++ b/lib/shredder.c @@ -1658,7 +1658,7 @@ static gint rm_shred_process_file(RmFile *file, RmSession *session) { file->shred_group->children && /* no point waiting if paranoid digest with no twin candidates */ (file->digest->type != RM_DIGEST_PARANOID || - file->digest->paranoid->twin_candidate); + ((RmParanoid*)file->digest->state)->twin_candidate); } file->signal = shredder_waiting ? rm_signal_new() : NULL; file->shredder_waiting = shredder_waiting; diff --git a/lib/treemerge.c b/lib/treemerge.c index 3e7d4746..0b107579 100644 --- a/lib/treemerge.c +++ b/lib/treemerge.c @@ -423,13 +423,8 @@ static void rm_directory_add(RmTreeMerger *self, RmDirectory *directory, RmFile guint8 *file_digest = NULL; RmOff digest_bytes = 0; - if(file->digest->type == RM_DIGEST_PARANOID) { - file_digest = rm_digest_steal(file->digest->paranoid->shadow_hash); - digest_bytes = file->digest->paranoid->shadow_hash->bytes; - } else { - file_digest = rm_digest_steal(file->digest); - digest_bytes = file->digest->bytes; - } + file_digest = rm_digest_steal(file->digest); + digest_bytes = file->digest->bytes; /* Update the directorie's hash with the file's hash Since we cannot be sure in which order the files come in diff --git a/lib/xattr.c b/lib/xattr.c index 539e1dba..20c1e4e5 100644 --- a/lib/xattr.c +++ b/lib/xattr.c @@ -103,12 +103,7 @@ static int rm_xattr_build_cksum(RmFile *file, char *buf, size_t buf_size) { memset(buf, '0', buf_size); buf[buf_size - 1] = 0; - if(file->digest->type == RM_DIGEST_PARANOID) { - rm_assert_gentle(file->digest->paranoid->shadow_hash); - return rm_digest_hexstring(file->digest->paranoid->shadow_hash, buf); - } else { - return rm_digest_hexstring(file->digest, buf); - } + return rm_digest_hexstring(file->digest, buf); } static int rm_xattr_is_fail(const char *name, int rc) { From 66cced5e6f161cbd5a45de65e62be1a8449b995b Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Fri, 10 Nov 2017 19:03:08 +1000 Subject: [PATCH 098/180] config: correct default checksum type --- lib/config.h.in | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/config.h.in b/lib/config.h.in index 187456a2..c01712af 100644 --- a/lib/config.h.in +++ b/lib/config.h.in @@ -23,7 +23,7 @@ #define HAVE_UNAME ({HAVE_UNAME}) #define HAVE_SYSMACROS_H ({HAVE_SYSMACROS_H}) -#define RM_DEFAULT_DIGEST RM_DIGEST_HIGHWAY +#define RM_DEFAULT_DIGEST RM_DIGEST_HIGHWAY256 #define RM_VERSION "{VERSION_MAJOR}.{VERSION_MINOR}.{VERSION_PATCH}" #define RM_VERSION_MAJOR {VERSION_MAJOR} #define RM_VERSION_MINOR {VERSION_MINOR} From 237bfdb0cf816f7d9ac6cf012df50e4d765894b3 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 13 Nov 2017 07:01:45 +1000 Subject: [PATCH 099/180] checksum: some trivial codesmithing --- lib/checksum.c | 22 ++++++++-------------- 1 file changed, 8 insertions(+), 14 deletions(-) diff --git a/lib/checksum.c b/lib/checksum.c index 16e3953c..d6ed56f8 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -294,7 +294,7 @@ static void rm_digest_cumulative_update(RmDigest *digest, const unsigned char *d guint8 *hash = digest->state; RmOff bytes = MIN(size, digest->bytes); for(gsize i = 0; i < bytes; ++i) { - hash[i] ^= ((guint8 *)data)[i % size]; + hash[i] ^= ((guint8 *)data)[i]; } } @@ -906,28 +906,22 @@ gboolean rm_digest_equal(RmDigest *a, RmDigest *b) { int rm_digest_hexstring(RmDigest *digest, char *buffer) { static const char *hex = "0123456789abcdef"; - guint8 *input = NULL; - gsize bytes = 0; if(digest == NULL) { return 0; } - input = rm_digest_steal(digest); - bytes = digest->bytes; + guint8 *input = rm_digest_steal(digest); + gsize bytes = digest->bytes; + gsize out = 0; for(gsize i = 0; i < bytes; ++i) { - buffer[0] = hex[input[i] / 16]; - buffer[1] = hex[input[i] % 16]; - - if(i == bytes - 1) { - buffer[2] = '\0'; - } - - buffer += 2; + buffer[out++] = hex[input[i] / 16]; + buffer[out++] = hex[input[i] % 16]; } + buffer[out++] = '\0'; g_slice_free1(bytes, input); - return bytes * 2 + 1; + return out; } int rm_digest_get_bytes(RmDigest *self) { From 64dd9a1b286697f91703944c812d643e30b09009 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 13 Nov 2017 08:03:56 +1000 Subject: [PATCH 100/180] checksum: add streaming implementation for murmur --- lib/checksum.c | 50 +++- lib/checksums/murmur3.c | 538 +++++++++++++++++++++++++--------------- lib/checksums/murmur3.h | 60 ++++- 3 files changed, 440 insertions(+), 208 deletions(-) diff --git a/lib/checksum.c b/lib/checksum.c index d6ed56f8..2ca7c4aa 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -249,19 +249,55 @@ static const RmDigestSpec farmhash_spec = { "farmhash", 64, GENERIC_FUNCS(farmh /////////////////////////// -static void rm_digest_murmur_update(RmDigest *digest, const unsigned char *data, RmOff size) { - /* TODO: this is broken; need to extend murmur API to add a streaming variant */ - uint32_t *hash = digest->state; + + + +#define CREATE_MURMUR_FUNCS(TYPE) \ +static void rm_digest_murmur_##TYPE##_free(RmDigest *digest) { \ + MurmurHash3_##TYPE##_free(digest->state); \ +} \ + \ +static void rm_digest_murmur_##TYPE##_update(RmDigest *digest, \ + const unsigned char *data, \ + RmOff size) { \ + MurmurHash3_##TYPE##_update(digest->state, data, size); \ +} \ + \ +static void rm_digest_murmur_##TYPE##_copy(RmDigest *digest, RmDigest *copy) { \ + copy->state = MurmurHash3_##TYPE##_copy(digest->state); \ +} \ + \ +static void rm_digest_murmur_##TYPE##_steal(RmDigest *digest, guint8 *result) { \ + MurmurHash3_##TYPE##_steal(digest->state, result); \ +} + +#define MURMUR_FUNCS(TYPE) rm_digest_murmur_##TYPE##_init, rm_digest_murmur_##TYPE##_free, rm_digest_murmur_##TYPE##_update, rm_digest_murmur_##TYPE##_copy, rm_digest_murmur_##TYPE##_steal + + #if RM_PLATFORM_32 - MurmurHash3_x86_128(data, size, *hash, hash); + +CREATE_MURMUR_FUNCS(x86_128) + +static void rm_digest_murmur_x86_128_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { + digest->state = MurmurHash3_x86_128_new(seed1, seed1>>32, seed2, seed2>>32); +} + +static const RmDigestSpec murmur_spec = { "murmur", 128, MURMUR_FUNCS(x86_128)}; + #elif RM_PLATFORM_64 - MurmurHash3_x64_128(data, size, *hash, hash); + +CREATE_MURMUR_FUNCS(x64_128) + +static void rm_digest_murmur_x64_128_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { + digest->state = MurmurHash3_x64_128_new(seed1, seed2); +} + +static const RmDigestSpec murmur_spec = { "murmur", 128, MURMUR_FUNCS(x64_128)}; + #else #error "Probably not a good idea to compile rmlint on 16bit." #endif -} -static const RmDigestSpec murmur_spec = { "murmur", 128, GENERIC_FUNCS(murmur)}; /////////////////////////// // cityhash // diff --git a/lib/checksums/murmur3.c b/lib/checksums/murmur3.c index e4aeb035..dc27ac22 100644 --- a/lib/checksums/murmur3.c +++ b/lib/checksums/murmur3.c @@ -1,28 +1,28 @@ //----------------------------------------------------------------------------- -// MurmurHash3 was written by Austin Appleby, and is placed in the public -// domain. The author hereby disclaims copyright to this source code. +// Streaming implementation of MurmurHash3 by Daniel Thomas +// Based on single-buffer implementation by Austin Appleby +// Code is placed in the public domain. +// The authors disclaim copyright to this source code. // Note - The x86 and x64 versions do _not_ produce the same results, as the // algorithms are optimized for their respective platforms. You can still // compile and run any of them on any platform, but your performance with the // non-native version will be less than optimal. +// Also will give different (but equally strong) results on big- vs +// little-endian platforms #include "murmur3.h" +#include +#include //----------------------------------------------------------------------------- // Platform-specific functions and macros -#ifdef __GNUC__ -#define FORCE_INLINE __attribute__((always_inline)) inline -#else -#define FORCE_INLINE inline -#endif - -static FORCE_INLINE uint32_t rotl32(uint32_t x, int8_t r) { +static inline uint32_t rotl32(uint32_t x, int8_t r) { return (x << r) | (x >> (32 - r)); } -static FORCE_INLINE uint64_t rotl64(uint64_t x, int8_t r) { +static inline uint64_t rotl64(uint64_t x, int8_t r) { return (x << r) | (x >> (64 - r)); } @@ -35,12 +35,39 @@ static FORCE_INLINE uint64_t rotl64(uint64_t x, int8_t r) { // Block read - if your platform needs to do endian-swapping or can only // handle aligned reads, do the conversion here -#define getblock(p, i) (p[i]) +#define GET_UINT64(p) *((uint64_t*)(p)); +#define GET_UINT32(p) *((uint32_t*)(p)); + +struct _MurmurHash3_x86_32_state { + uint32_t h1; + uint8_t xs[4]; /* unhashed data from last increment */ + uint8_t xs_len; + uint32_t len; +}; + +struct _MurmurHash3_x86_128_state { + uint32_t h1; + uint32_t h2; + uint32_t h3; + uint32_t h4; + uint8_t xs[16]; /* unhashed data from last increment */ + uint8_t xs_len; + uint32_t len; +}; + +struct _MurmurHash3_x64_128_state { + uint64_t h1; + uint64_t h2; + uint8_t xs[16]; /* unhashed data from last increment */ + uint8_t xs_len; + uint32_t len; +}; + //----------------------------------------------------------------------------- // Finalization mix - force all bits of a hash block to avalanche -static FORCE_INLINE uint32_t fmix32(uint32_t h) { +static inline uint32_t fmix32(uint32_t h) { h ^= h >> 16; h *= 0x85ebca6b; h ^= h >> 13; @@ -52,7 +79,7 @@ static FORCE_INLINE uint32_t fmix32(uint32_t h) { //---------- -static FORCE_INLINE uint64_t fmix64(uint64_t k) { +static inline uint64_t fmix64(uint64_t k) { k ^= k >> 33; k *= BIG_CONSTANT(0xff51afd7ed558ccd); k ^= k >> 33; @@ -64,197 +91,258 @@ static FORCE_INLINE uint64_t fmix64(uint64_t k) { //----------------------------------------------------------------------------- -void MurmurHash3_x86_32(const void *key, int len, uint32_t seed, void *out) { - const uint8_t *data = (const uint8_t *)key; - const int nblocks = len / 4; - int i; +#define MURMUR_UPDATE(h, k, rotl, ca, cb) \ + k *= ca; \ + k = ROTL64(k, rotl); \ + k *= cb; \ + h ^= k; - uint32_t h1 = seed; +#define MURMUR_MIX(ha, hb, rotl, c) \ + ha = ROTL64(ha, rotl); \ + ha += hb; \ + ha = ha * 5 + c; - uint32_t c1 = 0xcc9e2d51; - uint32_t c2 = 0x1b873593; +#define MURMUR_FILL_XS(xs, xs_len, xs_cap, data, data_len) \ + const int bytes = (data_len + xs_len > xs_cap) ? \ + (int)xs_cap - (int)xs_len : \ + (int)data_len; \ + memcpy(xs + xs_len, data, bytes); \ + xs_len += bytes; \ + data += bytes; - //---------- - // body - const uint32_t *blocks = (const uint32_t *)(data + nblocks * 4); - for(i = -nblocks; i; i++) { - uint32_t k1 = getblock(blocks, i); +//----------------------------------------------------------------------------- + +MurmurHash3_x86_32_state *MurmurHash3_x86_32_new(uint32_t seed) { + MurmurHash3_x86_32_state *state = g_slice_new0(MurmurHash3_x86_32_state); + state->h1 = seed; + return state; +} + +MurmurHash3_x86_32_state *MurmurHash3_x86_32_copy(MurmurHash3_x86_32_state *state) { + MurmurHash3_x86_32_state *copy = g_slice_copy(sizeof(MurmurHash3_x86_32_state), state); + return copy; +} + +#define MURMUR_UPDATE_H1_X86_32(H1) MURMUR_UPDATE(H1, k1, 15, 0xcc9e2d51, 0x1b873593); - k1 *= c1; - k1 = ROTL32(k1, 15); - k1 *= c2; +void MurmurHash3_x86_32_update(MurmurHash3_x86_32_state *const state, const void * restrict key, const uint32_t len) { + state->len += len; + uint8_t *data = (uint8_t *)key; + const uint8_t *stop = data + len; - h1 ^= k1; - h1 = ROTL32(h1, 13); - h1 = h1 * 5 + 0xe6546b64; + if(state->xs_len > 0) { + MURMUR_FILL_XS(state->xs, state->xs_len, 4, data, len); } - //---------- - // tail + /* process blocks of 4 bytes */ + while(state->xs_len == 4 || data + 4 <= stop) { + + uint32_t k1; + + if(state->xs_len == 4) { + /* process remnant data from previous update */ + k1 = GET_UINT32(&state->xs[0]); + state->xs_len = 0; + } else { + /* process new data */ + k1 = GET_UINT32(data); + data += 4; + } + + MURMUR_UPDATE_H1_X86_32(state->h1); + MURMUR_MIX(state->h1, 0, 13, 0xe6546b64); + } - const uint8_t *tail = (const uint8_t *)(data + nblocks * 4); + if (state->xs_len == 0 && stop > data) { + // store excess data in state + state->xs_len = stop - data; + memcpy(state->xs, data, state->xs_len); + } +} +void MurmurHash3_x86_32_steal(const MurmurHash3_x86_32_state *const restrict state, void *const restrict out) { uint32_t k1 = 0; - switch(len & 3) { + /* copy h to make this a non-destructive steal */ + uint32_t h1 = state->h1; + + switch(state->xs_len) { case 3: - k1 ^= tail[2] << 16; + k1 ^= state->xs[2] << 16; case 2: - k1 ^= tail[1] << 8; + k1 ^= state->xs[1] << 8; case 1: - k1 ^= tail[0]; - k1 *= c1; - k1 = ROTL32(k1, 15); - k1 *= c2; - h1 ^= k1; + k1 ^= state->xs[0]; + + MURMUR_UPDATE_H1_X86_32(h1); }; //---------- // finalization - h1 ^= len; + h1 ^= state->len; h1 = fmix32(h1); *(uint32_t *)out = h1; } +void MurmurHash3_x86_32_finalise(MurmurHash3_x86_32_state *state, void *out) { + MurmurHash3_x86_32_steal(state, out); + MurmurHash3_x86_32_free(state); +} + +void MurmurHash3_x86_32_free(MurmurHash3_x86_32_state *state) { + g_slice_free(MurmurHash3_x86_32_state, state); +} + +void MurmurHash3_x86_32(const void *key, uint32_t len, uint32_t seed, void *out) { + MurmurHash3_x86_32_state *state = MurmurHash3_x86_32_new(seed); + MurmurHash3_x86_32_update(state, key, len); + MurmurHash3_x86_32_finalise(state, out); +} + + //----------------------------------------------------------------------------- -void MurmurHash3_x86_128(const void *key, const int len, uint32_t seed, void *out) { - const uint8_t *data = (const uint8_t *)key; - const int nblocks = len / 16; - int i; +MurmurHash3_x86_128_state *MurmurHash3_x86_128_new(uint32_t seed1, uint32_t seed2, uint32_t seed3, uint32_t seed4) { + MurmurHash3_x86_128_state *state = g_slice_new0(MurmurHash3_x86_128_state); + state->h1 = seed1; + state->h2 = seed2; + state->h3 = seed3; + state->h4 = seed4; + return state; +} + +MurmurHash3_x86_128_state *MurmurHash3_x86_128_copy(MurmurHash3_x86_128_state *state) { + MurmurHash3_x86_128_state *copy = g_slice_copy(sizeof(MurmurHash3_x86_128_state), state); + return copy; +} - uint32_t h1 = seed; - uint32_t h2 = seed; - uint32_t h3 = seed; - uint32_t h4 = seed; +#define MURMUR_UPDATE_H1_X86_128(H1) MURMUR_UPDATE(H1, k1, 15, 0x239b961b, 0xab0e9789); +#define MURMUR_UPDATE_H2_X86_128(H2) MURMUR_UPDATE(H2, k2, 16, 0xab0e9789, 0x38b34ae5); +#define MURMUR_UPDATE_H3_X86_128(H3) MURMUR_UPDATE(H3, k3, 17, 0x38b34ae5, 0xa1e38b93); +#define MURMUR_UPDATE_H4_X86_128(H4) MURMUR_UPDATE(H4, k4, 18, 0xa1e38b93, 0x239b961b); - uint32_t c1 = 0x239b961b; - uint32_t c2 = 0xab0e9789; - uint32_t c3 = 0x38b34ae5; - uint32_t c4 = 0xa1e38b93; +void MurmurHash3_x86_128_update(MurmurHash3_x86_128_state *const state, const void * restrict key, const uint32_t len) { + state->len += len; + uint8_t *data = (uint8_t *)key; + const uint8_t *stop = data + len; - //---------- - // body - - const uint32_t *blocks = (const uint32_t *)(data + nblocks * 16); - - for(i = -nblocks; i; i++) { - uint32_t k1 = getblock(blocks, i * 4 + 0); - uint32_t k2 = getblock(blocks, i * 4 + 1); - uint32_t k3 = getblock(blocks, i * 4 + 2); - uint32_t k4 = getblock(blocks, i * 4 + 3); - - k1 *= c1; - k1 = ROTL32(k1, 15); - k1 *= c2; - h1 ^= k1; - - h1 = ROTL32(h1, 19); - h1 += h2; - h1 = h1 * 5 + 0x561ccd1b; - - k2 *= c2; - k2 = ROTL32(k2, 16); - k2 *= c3; - h2 ^= k2; - - h2 = ROTL32(h2, 17); - h2 += h3; - h2 = h2 * 5 + 0x0bcaa747; - - k3 *= c3; - k3 = ROTL32(k3, 17); - k3 *= c4; - h3 ^= k3; - - h3 = ROTL32(h3, 15); - h3 += h4; - h3 = h3 * 5 + 0x96cd1c35; - - k4 *= c4; - k4 = ROTL32(k4, 18); - k4 *= c1; - h4 ^= k4; - - h4 = ROTL32(h4, 13); - h4 += h1; - h4 = h4 * 5 + 0x32ac3b17; + if(state->xs_len > 0) { + MURMUR_FILL_XS(state->xs, state->xs_len, 16, data, len); } - //---------- - // tail + /* process blocks of 16 bytes */ + while(state->xs_len == 16 || data + 16 <= stop) { + + uint32_t k1; + uint32_t k2; + uint32_t k3; + uint32_t k4; + + if(state->xs_len == 16) { + /* process remnant data from previous update */ + k1 = GET_UINT32(&state->xs[0]); + k2 = GET_UINT32(&state->xs[4]); + k3 = GET_UINT32(&state->xs[8]); + k4 = GET_UINT32(&state->xs[12]); + state->xs_len = 0; + } else { + /* process new data */ + k1 = GET_UINT32(data); + k2 = GET_UINT32(data + 4); + k3 = GET_UINT32(data + 8); + k4 = GET_UINT32(data + 12); + data += 16; + } + + MURMUR_UPDATE_H1_X86_128(state->h1); + MURMUR_MIX(state->h1, state->h2, 19, 0x561ccd1b); + + MURMUR_UPDATE_H2_X86_128(state->h2); + MURMUR_MIX(state->h2, state->h3, 17, 0x0bcaa747); + + MURMUR_UPDATE_H3_X86_128(state->h3); + MURMUR_MIX(state->h3, state->h4, 15, 0x96cd1c35); + + MURMUR_UPDATE_H4_X86_128(state->h4); + MURMUR_MIX(state->h4, state->h1, 13, 0x32ac3b17); + + } - const uint8_t *tail = (const uint8_t *)(data + nblocks * 16); + if (state->xs_len == 0 && stop > data) { + // store excess data in state + state->xs_len = stop - data; + memcpy(state->xs, data, state->xs_len); + } +} +void MurmurHash3_x86_128_steal(const MurmurHash3_x86_128_state *const restrict state, void *const restrict out) { uint32_t k1 = 0; uint32_t k2 = 0; uint32_t k3 = 0; uint32_t k4 = 0; - switch(len & 15) { + /* copy h to make this a non-destructive steal */ + uint32_t h1 = state->h1; + uint32_t h2 = state->h2; + uint32_t h3 = state->h3; + uint32_t h4 = state->h4; + + switch(state->len & 15) { case 15: - k4 ^= tail[14] << 16; + k4 ^= state->xs[14] << 16; case 14: - k4 ^= tail[13] << 8; + k4 ^= state->xs[13] << 8; case 13: - k4 ^= tail[12] << 0; - k4 *= c4; - k4 = ROTL32(k4, 18); - k4 *= c1; - h4 ^= k4; + k4 ^= state->xs[12] << 0; + + MURMUR_UPDATE_H4_X86_128(h4); case 12: - k3 ^= tail[11] << 24; + k3 ^= state->xs[11] << 24; case 11: - k3 ^= tail[10] << 16; + k3 ^= state->xs[10] << 16; case 10: - k3 ^= tail[9] << 8; + k3 ^= state->xs[9] << 8; case 9: - k3 ^= tail[8] << 0; - k3 *= c3; - k3 = ROTL32(k3, 17); - k3 *= c4; - h3 ^= k3; + k3 ^= state->xs[8] << 0; + + MURMUR_UPDATE_H3_X86_128(h3); case 8: - k2 ^= tail[7] << 24; + k2 ^= state->xs[7] << 24; case 7: - k2 ^= tail[6] << 16; + k2 ^= state->xs[6] << 16; case 6: - k2 ^= tail[5] << 8; + k2 ^= state->xs[5] << 8; case 5: - k2 ^= tail[4] << 0; - k2 *= c2; - k2 = ROTL32(k2, 16); - k2 *= c3; - h2 ^= k2; + k2 ^= state->xs[4] << 0; + + MURMUR_UPDATE_H2_X86_128(h2); case 4: - k1 ^= tail[3] << 24; + k1 ^= state->xs[3] << 24; case 3: - k1 ^= tail[2] << 16; + k1 ^= state->xs[2] << 16; case 2: - k1 ^= tail[1] << 8; + k1 ^= state->xs[1] << 8; case 1: - k1 ^= tail[0] << 0; - k1 *= c1; - k1 = ROTL32(k1, 15); - k1 *= c2; - h1 ^= k1; + k1 ^= state->xs[0] << 0; + + MURMUR_UPDATE_H1_X86_128(h1); }; //---------- // finalization - h1 ^= len; - h2 ^= len; - h3 ^= len; - h4 ^= len; + h1 ^= state->len; + h2 ^= state->len; + h3 ^= state->len; + h4 ^= state->len; h1 += h2; h1 += h3; @@ -279,104 +367,136 @@ void MurmurHash3_x86_128(const void *key, const int len, uint32_t seed, void *ou ((uint32_t *)out)[1] = h2; ((uint32_t *)out)[2] = h3; ((uint32_t *)out)[3] = h4; + } -//----------------------------------------------------------------------------- +void MurmurHash3_x86_128_finalise(MurmurHash3_x86_128_state *state, void *out) { + MurmurHash3_x86_128_steal(state, out); + MurmurHash3_x86_128_free(state); +} -void MurmurHash3_x64_128(const void *key, const int len, const uint32_t seed, void *out) { - const uint8_t *data = (const uint8_t *)key; - const int nblocks = len / 16; - int i; +void MurmurHash3_x86_128_free(MurmurHash3_x86_128_state *state) { + g_slice_free(MurmurHash3_x86_128_state, state); +} - uint64_t h1 = seed; - uint64_t h2 = seed; +void MurmurHash3_x86_128(const void *key, uint32_t len, uint32_t seed, void *out) { + MurmurHash3_x86_128_state *state = MurmurHash3_x86_128_new(seed, seed, seed, seed); + MurmurHash3_x86_128_update(state, key, len); + MurmurHash3_x86_128_finalise(state, out); +} - uint64_t c1 = BIG_CONSTANT(0x87c37b91114253d5); - uint64_t c2 = BIG_CONSTANT(0x4cf5ad432745937f); +//----------------------------------------------------------------------------- - //---------- - // body - const uint64_t *blocks = (const uint64_t *)(data); - for(i = 0; i < nblocks; i++) { - uint64_t k1 = getblock(blocks, i * 2 + 0); - uint64_t k2 = getblock(blocks, i * 2 + 1); +MurmurHash3_x64_128_state *MurmurHash3_x64_128_new(uint64_t seed1, uint64_t seed2) { + MurmurHash3_x64_128_state *state = g_slice_new0(MurmurHash3_x64_128_state); + state->h1 = seed1; + state->h2 = seed2; + return state; +} + +MurmurHash3_x64_128_state *MurmurHash3_x64_128_copy(MurmurHash3_x64_128_state *state) { + return g_slice_copy(sizeof(MurmurHash3_x64_128_state), state); +} - k1 *= c1; - k1 = ROTL64(k1, 31); - k1 *= c2; - h1 ^= k1; +#define MURMUR_UPDATE_H1_X64_128(H1) MURMUR_UPDATE(H1, k1, 31, BIG_CONSTANT(0x87c37b91114253d5), BIG_CONSTANT(0x4cf5ad432745937f)); +#define MURMUR_UPDATE_H2_X64_128(H2) MURMUR_UPDATE(H2, k2, 33, BIG_CONSTANT(0x4cf5ad432745937f), BIG_CONSTANT(0x87c37b91114253d5)); - h1 = ROTL64(h1, 27); - h1 += h2; - h1 = h1 * 5 + 0x52dce729; +void MurmurHash3_x64_128_update(MurmurHash3_x64_128_state *const restrict state, const void * restrict key, const uint64_t len) { - k2 *= c2; - k2 = ROTL64(k2, 33); - k2 *= c1; - h2 ^= k2; + state->len += len; + uint8_t *data = (uint8_t *)key; + const uint8_t *stop = data + len; - h2 = ROTL64(h2, 31); - h2 += h1; - h2 = h2 * 5 + 0x38495ab5; + if(state->xs_len > 0) { + MURMUR_FILL_XS(state->xs, state->xs_len, 16, data, len); } - //---------- - // tail + /* process blocks of 16 bytes */ + while(state->xs_len == 16 || data + 16 <= stop) { + + uint64_t k1; + uint64_t k2; + + if(state->xs_len == 16) { + /* process remnant data from previous update */ + k1 = GET_UINT64(&state->xs[0]); + k2 = GET_UINT64(&state->xs[8]); + state->xs_len = 0; + } else { + /* process new data */ + k1 = GET_UINT64(data); + k2 = GET_UINT64(data + 8); + data += 16; + } + + MURMUR_UPDATE_H1_X64_128(state->h1); + MURMUR_MIX(state->h1, state->h2, 27, 0x52dce729); + + MURMUR_UPDATE_H2_X64_128(state->h2); + MURMUR_MIX(state->h2, state->h1, 31, 0x38495ab5); + } + + if (state->xs_len == 0 && stop > data) { + // store excess data in state + state->xs_len = stop - data; + memcpy(state->xs, data, state->xs_len); + } +} - const uint8_t *tail = (const uint8_t *)(data + nblocks * 16); +void MurmurHash3_x64_128_steal(const MurmurHash3_x64_128_state *const restrict state, void *const restrict out) { uint64_t k1 = 0; uint64_t k2 = 0; - switch(len & 15) { + /* copy h to make this a non-destructive steal */ + uint64_t h1 = state->h1; + uint64_t h2 = state->h2; + + switch(state->xs_len) { case 15: - k2 ^= (uint64_t)(tail[14]) << 48; + k2 ^= (uint64_t)(state->xs[14]) << 48; case 14: - k2 ^= (uint64_t)(tail[13]) << 40; + k2 ^= (uint64_t)(state->xs[13]) << 40; case 13: - k2 ^= (uint64_t)(tail[12]) << 32; + k2 ^= (uint64_t)(state->xs[12]) << 32; case 12: - k2 ^= (uint64_t)(tail[11]) << 24; + k2 ^= (uint64_t)(state->xs[11]) << 24; case 11: - k2 ^= (uint64_t)(tail[10]) << 16; + k2 ^= (uint64_t)(state->xs[10]) << 16; case 10: - k2 ^= (uint64_t)(tail[9]) << 8; + k2 ^= (uint64_t)(state->xs[9]) << 8; case 9: - k2 ^= (uint64_t)(tail[8]) << 0; - k2 *= c2; - k2 = ROTL64(k2, 33); - k2 *= c1; - h2 ^= k2; + k2 ^= (uint64_t)(state->xs[8]) << 0; + + MURMUR_UPDATE_H2_X64_128(h2); case 8: - k1 ^= (uint64_t)(tail[7]) << 56; + k1 ^= (uint64_t)(state->xs[7]) << 56; case 7: - k1 ^= (uint64_t)(tail[6]) << 48; + k1 ^= (uint64_t)(state->xs[6]) << 48; case 6: - k1 ^= (uint64_t)(tail[5]) << 40; + k1 ^= (uint64_t)(state->xs[5]) << 40; case 5: - k1 ^= (uint64_t)(tail[4]) << 32; + k1 ^= (uint64_t)(state->xs[4]) << 32; case 4: - k1 ^= (uint64_t)(tail[3]) << 24; + k1 ^= (uint64_t)(state->xs[3]) << 24; case 3: - k1 ^= (uint64_t)(tail[2]) << 16; + k1 ^= (uint64_t)(state->xs[2]) << 16; case 2: - k1 ^= (uint64_t)(tail[1]) << 8; + k1 ^= (uint64_t)(state->xs[1]) << 8; case 1: - k1 ^= (uint64_t)(tail[0]) << 0; - k1 *= c1; - k1 = ROTL64(k1, 31); - k1 *= c2; - h1 ^= k1; + k1 ^= (uint64_t)(state->xs[0]) << 0; + + MURMUR_UPDATE_H1_X64_128(h1); }; //---------- // finalization - h1 ^= len; - h2 ^= len; + h1 ^= state->len; + h2 ^= state->len; h1 += h2; h2 += h1; @@ -391,4 +511,26 @@ void MurmurHash3_x64_128(const void *key, const int len, const uint32_t seed, vo ((uint64_t *)out)[1] = h2; } +void MurmurHash3_x64_128_free(MurmurHash3_x64_128_state *state) { + g_slice_free(MurmurHash3_x64_128_state, state); +} + +void MurmurHash3_x64_128(const void *key, const uint64_t len, const uint32_t seed, void *out) { + MurmurHash3_x64_128_state *state = MurmurHash3_x64_128_new(seed, seed); + MurmurHash3_x64_128_update(state, key, len); + MurmurHash3_x64_128_finalise(state, out); +} + +void MurmurHash3_x64_128_finalise(MurmurHash3_x64_128_state *state, void *out) { + MurmurHash3_x64_128_steal(state, out); + MurmurHash3_x64_128_free(state); +} + +int MurmurHash3_x64_128_equal(MurmurHash3_x64_128_state *a, MurmurHash3_x64_128_state *b) { + if (a->h1 != b->h1 || a->h2 != b->h2 || a->xs_len != b->xs_len || a->len != b->len) { + return 0; + } + return (a->xs_len == 0 || !memcmp(a->xs, b->xs, a->xs_len)); +} + //----------------------------------------------------------------------------- diff --git a/lib/checksums/murmur3.h b/lib/checksums/murmur3.h index 82712de6..a7285a46 100644 --- a/lib/checksums/murmur3.h +++ b/lib/checksums/murmur3.h @@ -3,18 +3,72 @@ // public domain. The author hereby disclaims copyright to this source // code. +// Streaming implementation by Daniel Thomas + #ifndef _MURMURHASH3_H_ #define _MURMURHASH3_H_ #include //----------------------------------------------------------------------------- +// opaque structs for intermediate checksum states + +typedef struct _MurmurHash3_x86_32_state MurmurHash3_x86_32_state; +typedef struct _MurmurHash3_x86_128_state MurmurHash3_x86_128_state; +typedef struct _MurmurHash3_x64_128_state MurmurHash3_x64_128_state; + +//----------------------------------------------------------------------------- +// API + +/** + * return newly initialised, seeded state + */ +MurmurHash3_x86_32_state *MurmurHash3_x86_32_new(uint32_t seed); +MurmurHash3_x86_128_state *MurmurHash3_x86_128_new(uint32_t seed1, uint32_t seed2, uint32_t seed3, uint32_t seed4); +MurmurHash3_x64_128_state *MurmurHash3_x64_128_new(uint64_t seed1, uint64_t seed2); + +/** + * return duplicate copy of a state + */ +MurmurHash3_x86_32_state *MurmurHash3_x86_32_copy(MurmurHash3_x86_32_state *state); +MurmurHash3_x86_128_state *MurmurHash3_x86_128_copy(MurmurHash3_x86_128_state *state); +MurmurHash3_x64_128_state *MurmurHash3_x64_128_copy(MurmurHash3_x64_128_state *state); + +/** + * streaming update of checksum + */ +void MurmurHash3_x86_32_update(MurmurHash3_x86_32_state *const restrict state, const void *restrict key, const uint32_t len); +void MurmurHash3_x86_128_update(MurmurHash3_x86_128_state *const restrict state, const void *restrict key, const uint32_t len); +void MurmurHash3_x64_128_update(MurmurHash3_x64_128_state *const restrict state, const void *restrict key, const uint64_t len); + +/** + * output checksum result; does not modify underlying state + */ +void MurmurHash3_x86_32_steal(const MurmurHash3_x86_32_state *const restrict state, void *const restrict out); +void MurmurHash3_x86_128_steal(const MurmurHash3_x86_128_state *const restrict state, void *const restrict out); +void MurmurHash3_x64_128_steal(const MurmurHash3_x64_128_state *const restrict state, void *const restrict out); + +/** + * output checksum result; frees state + */ +void MurmurHash3_x86_32_finalise(MurmurHash3_x86_32_state *state, void *out); +void MurmurHash3_x86_128_finalise(MurmurHash3_x86_128_state *state, void *out); +void MurmurHash3_x64_128_finalise(MurmurHash3_x64_128_state *state, void *out); -void MurmurHash3_x86_32(const void *key, int len, uint32_t seed, void *out); +/** + * free state + */ +void MurmurHash3_x86_32_free(MurmurHash3_x86_32_state *state); +void MurmurHash3_x86_128_free(MurmurHash3_x86_128_state *state); +void MurmurHash3_x64_128_free(MurmurHash3_x64_128_state *state); -void MurmurHash3_x86_128(const void *key, int len, uint32_t seed, void *out); +/** + * convenience single-buffer hash + */ +void MurmurHash3_x86_32(const void *key, uint32_t len, uint32_t seed, void *out); +void MurmurHash3_x86_128(const void *key, uint32_t len, uint32_t seed, void *out); +void MurmurHash3_x64_128(const void *key, uint64_t len, uint32_t seed, void *out); -void MurmurHash3_x64_128(const void *key, int len, uint32_t seed, void *out); //----------------------------------------------------------------------------- From 9f620e9227c762017ece900e25ba4e1936f62147 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 13 Nov 2017 08:15:02 +1000 Subject: [PATCH 101/180] checksums: ditch spooky rather than add streaming version (spooky state is 304 bytes long!) --- lib/checksum.c | 33 +-- lib/checksum.h | 3 - lib/checksums/murmur3.c | 6 +- lib/checksums/murmur3.h | 2 +- lib/checksums/spooky-c.c | 439 ---------------------------------- lib/checksums/spooky-c.h | 55 ----- lib/formats/json.c | 6 +- tests/test_mains/test_hash.py | 9 +- tests/utils.py | 6 +- 9 files changed, 15 insertions(+), 544 deletions(-) delete mode 100644 lib/checksums/spooky-c.c delete mode 100644 lib/checksums/spooky-c.h diff --git a/lib/checksum.c b/lib/checksum.c index 2ca7c4aa..467df9eb 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -45,7 +45,6 @@ #include "checksums/citycrc.h" #include "checksums/murmur3.h" #include "checksums/sha3/sha3.h" -#include "checksums/spooky-c.h" #include "checksums/xxhash/xxhash.h" #include "utilities.h" @@ -191,32 +190,7 @@ static void rm_digest_generic_copy(RmDigest *digest, RmDigest *copy) { copy->state = g_slice_copy(ALLOC_BYTES(digest->bytes), digest->state); } -/////////////////////////// -// spooky hashes // -/////////////////////////// - -/* TODO: this is broken; need to extend spooky API to add a streaming variant */ - -static void rm_digest_spooky32_update(RmDigest *digest, const unsigned char *data, RmOff size) { - uint32_t* hash = digest->state; - *hash = spooky_hash32(data, size, *hash); -} - -static void rm_digest_spooky64_update(RmDigest *digest, const unsigned char *data, RmOff size) { - uint64_t* hash = digest->state; - *hash = spooky_hash64(data, size, *hash); -} - -static void rm_digest_spooky_update(RmDigest *digest, const unsigned char *data, RmOff size) { - uint128 *hash = digest->state; - spooky_hash128(data, size, &hash->first, &hash->second); -} - #define GENERIC_FUNCS(ALGO) rm_digest_generic_init, rm_digest_generic_free, rm_digest_##ALGO##_update, rm_digest_generic_copy, NULL -static const RmDigestSpec spooky32_spec = { "spooky32", 32, GENERIC_FUNCS(spooky32) }; -static const RmDigestSpec spooky64_spec = { "spooky64", 64, GENERIC_FUNCS(spooky64) }; -static const RmDigestSpec spooky_spec = { "spooky", 128, GENERIC_FUNCS(spooky) }; - /////////////////////////// // xxhash // @@ -636,9 +610,6 @@ static const RmDigestSpec *rm_digest_spec(RmDigestType type) { static const RmDigestSpec *digest_specs[] = { [RM_DIGEST_UNKNOWN] = NULL, [RM_DIGEST_MURMUR] = &murmur_spec, - [RM_DIGEST_SPOOKY] = &spooky_spec, - [RM_DIGEST_SPOOKY32] = &spooky32_spec, - [RM_DIGEST_SPOOKY64] = &spooky64_spec, [RM_DIGEST_CITY] = &city_spec, [RM_DIGEST_MD5] = &md5_spec, [RM_DIGEST_SHA1] = &sha1_spec, @@ -686,7 +657,6 @@ static gpointer rm_init_digest_type_table(GHashTable **code_table) { /* add some synonyms */ rm_digest_table_insert(*code_table, "sha3", RM_DIGEST_SHA3_256); - rm_digest_table_insert(*code_table, "spooky128", RM_DIGEST_SPOOKY); rm_digest_table_insert(*code_table, "highway", RM_DIGEST_HIGHWAY256); return NULL; @@ -717,8 +687,7 @@ const char *rm_digest_type_to_string(RmDigestType type) { /* TODO: remove? */ int rm_digest_type_to_multihash_id(RmDigestType type) { static int ids[] = {[RM_DIGEST_UNKNOWN] = -1, [RM_DIGEST_MURMUR] = 17, - [RM_DIGEST_SPOOKY] = 14, [RM_DIGEST_SPOOKY32] = 16, - [RM_DIGEST_SPOOKY64] = 18, [RM_DIGEST_CITY] = 15, + [RM_DIGEST_CITY] = 15, [RM_DIGEST_MD5] = 1, [RM_DIGEST_SHA1] = 2, [RM_DIGEST_SHA256] = 4, [RM_DIGEST_SHA512] = 6, [RM_DIGEST_EXT] = 12, [RM_DIGEST_FARMHASH] = 19, diff --git a/lib/checksum.h b/lib/checksum.h index 4a50b85e..8334f029 100644 --- a/lib/checksum.h +++ b/lib/checksum.h @@ -38,9 +38,6 @@ typedef enum RmDigestType { RM_DIGEST_UNKNOWN = 0, RM_DIGEST_MURMUR, - RM_DIGEST_SPOOKY, - RM_DIGEST_SPOOKY32, - RM_DIGEST_SPOOKY64, RM_DIGEST_CITY, RM_DIGEST_MD5, RM_DIGEST_SHA1, diff --git a/lib/checksums/murmur3.c b/lib/checksums/murmur3.c index dc27ac22..23436394 100644 --- a/lib/checksums/murmur3.c +++ b/lib/checksums/murmur3.c @@ -198,10 +198,12 @@ void MurmurHash3_x86_32_free(MurmurHash3_x86_32_state *state) { g_slice_free(MurmurHash3_x86_32_state, state); } -void MurmurHash3_x86_32(const void *key, uint32_t len, uint32_t seed, void *out) { +uint32_t MurmurHash3_x86_32(const void *key, uint32_t len, uint32_t seed) { + uint32_t out; MurmurHash3_x86_32_state *state = MurmurHash3_x86_32_new(seed); MurmurHash3_x86_32_update(state, key, len); - MurmurHash3_x86_32_finalise(state, out); + MurmurHash3_x86_32_finalise(state, &out); + return out; } diff --git a/lib/checksums/murmur3.h b/lib/checksums/murmur3.h index a7285a46..800c5b04 100644 --- a/lib/checksums/murmur3.h +++ b/lib/checksums/murmur3.h @@ -65,7 +65,7 @@ void MurmurHash3_x64_128_free(MurmurHash3_x64_128_state *state); /** * convenience single-buffer hash */ -void MurmurHash3_x86_32(const void *key, uint32_t len, uint32_t seed, void *out); +uint32_t MurmurHash3_x86_32(const void *key, uint32_t len, uint32_t seed); void MurmurHash3_x86_128(const void *key, uint32_t len, uint32_t seed, void *out); void MurmurHash3_x64_128(const void *key, uint64_t len, uint32_t seed, void *out); diff --git a/lib/checksums/spooky-c.c b/lib/checksums/spooky-c.c deleted file mode 100644 index 37ac45ac..00000000 --- a/lib/checksums/spooky-c.c +++ /dev/null @@ -1,439 +0,0 @@ -// A C version of Bob Jenkins' spooky hash -// Spooky Hash -// A 128-bit noncryptographic hash, for checksums and table lookup -// By Bob Jenkins. Public domain. -// Oct 31 2010: published framework, disclaimer ShortHash isn't right -// Nov 7 2010: disabled ShortHash -// Oct 11 2011: C version ported by Andi Kleen (andikleen@github) -// Oct 31 2011: replace End, ShortMix, ShortEnd, enable ShortHash again -// Apr 10 2012: buffer overflow on platforms without unaligned reads -// Apr 27 2012: C version updated by Ziga Zupanec ziga.zupanec@gmail.com (agiz@github) - -// Assumes little endian ness. Caller has to check this case. - -#include - -#include "spooky-c.h" - -#if defined(__i386__) || defined(__x86_64__) // add more architectures here -#define ALLOW_UNALIGNED_READS 1 -#else -#define ALLOW_UNALIGNED_READS 0 -#endif - -// SC_CONST: a constant which: -// * is not zero -// * is odd -// * is a not-very-regular mix of 1's and 0's -// * does not need any other special mathematical properties -#define SC_CONST 0xdeadbeefdeadbeefLL - -static inline uint64_t rot64(uint64_t x, int k) { - return (x << k) | (x >> (64 - k)); -} - -// -// This is used if the input is 96 bytes long or longer. -// -// The internal state is fully overwritten every 96 bytes. -// Every input bit appears to cause at least 128 bits of entropy -// before 96 other bytes are combined, when run forward or backward -// For every input bit, -// Two inputs differing in just that input bit -// Where "differ" means xor or subtraction -// And the base value is random -// When run forward or backwards one Mix -// I tried 3 pairs of each; they all differed by at least 212 bits. -// -static inline void mix(const uint64_t *data, uint64_t *s0, uint64_t *s1, uint64_t *s2, - uint64_t *s3, uint64_t *s4, uint64_t *s5, uint64_t *s6, - uint64_t *s7, uint64_t *s8, uint64_t *s9, uint64_t *s10, - uint64_t *s11) { - *s0 += data[0]; - *s2 ^= *s10; - *s11 ^= *s0; - *s0 = rot64(*s0, 11); - *s11 += *s1; - *s1 += data[1]; - *s3 ^= *s11; - *s0 ^= *s1; - *s1 = rot64(*s1, 32); - *s0 += *s2; - *s2 += data[2]; - *s4 ^= *s0; - *s1 ^= *s2; - *s2 = rot64(*s2, 43); - *s1 += *s3; - *s3 += data[3]; - *s5 ^= *s1; - *s2 ^= *s3; - *s3 = rot64(*s3, 31); - *s2 += *s4; - *s4 += data[4]; - *s6 ^= *s2; - *s3 ^= *s4; - *s4 = rot64(*s4, 17); - *s3 += *s5; - *s5 += data[5]; - *s7 ^= *s3; - *s4 ^= *s5; - *s5 = rot64(*s5, 28); - *s4 += *s6; - *s6 += data[6]; - *s8 ^= *s4; - *s5 ^= *s6; - *s6 = rot64(*s6, 39); - *s5 += *s7; - *s7 += data[7]; - *s9 ^= *s5; - *s6 ^= *s7; - *s7 = rot64(*s7, 57); - *s6 += *s8; - *s8 += data[8]; - *s10 ^= *s6; - *s7 ^= *s8; - *s8 = rot64(*s8, 55); - *s7 += *s9; - *s9 += data[9]; - *s11 ^= *s7; - *s8 ^= *s9; - *s9 = rot64(*s9, 54); - *s8 += *s10; - *s10 += data[10]; - *s0 ^= *s8; - *s9 ^= *s10; - *s10 = rot64(*s10, 22); - *s9 += *s11; - *s11 += data[11]; - *s1 ^= *s9; - *s10 ^= *s11; - *s11 = rot64(*s11, 46); - *s10 += *s0; -} - -// -// Mix all 12 inputs together so that h0, h1 are a hash of them all. -// -// For two inputs differing in just the input bits -// Where "differ" means xor or subtraction -// And the base value is random, or a counting value starting at that bit -// The final result will have each bit of h0, h1 flip -// For every input bit, -// with probability 50 +- .3% -// For every pair of input bits, -// with probability 50 +- 3% -// -// This does not rely on the last Mix() call having already mixed some. -// Two iterations was almost good enough for a 64-bit result, but a -// 128-bit result is reported, so End() does three iterations. -// -static inline void endPartial(uint64_t *h0, uint64_t *h1, uint64_t *h2, uint64_t *h3, - uint64_t *h4, uint64_t *h5, uint64_t *h6, uint64_t *h7, - uint64_t *h8, uint64_t *h9, uint64_t *h10, uint64_t *h11) { - *h11 += *h1; - *h2 ^= *h11; - *h1 = rot64(*h1, 44); - *h0 += *h2; - *h3 ^= *h0; - *h2 = rot64(*h2, 15); - *h1 += *h3; - *h4 ^= *h1; - *h3 = rot64(*h3, 34); - *h2 += *h4; - *h5 ^= *h2; - *h4 = rot64(*h4, 21); - *h3 += *h5; - *h6 ^= *h3; - *h5 = rot64(*h5, 38); - *h4 += *h6; - *h7 ^= *h4; - *h6 = rot64(*h6, 33); - *h5 += *h7; - *h8 ^= *h5; - *h7 = rot64(*h7, 10); - *h6 += *h8; - *h9 ^= *h6; - *h8 = rot64(*h8, 13); - *h7 += *h9; - *h10 ^= *h7; - *h9 = rot64(*h9, 38); - *h8 += *h10; - *h11 ^= *h8; - *h10 = rot64(*h10, 53); - *h9 += *h11; - *h0 ^= *h9; - *h11 = rot64(*h11, 42); - *h10 += *h0; - *h1 ^= *h10; - *h0 = rot64(*h0, 54); -} - -static inline void end(uint64_t *h0, uint64_t *h1, uint64_t *h2, uint64_t *h3, - uint64_t *h4, uint64_t *h5, uint64_t *h6, uint64_t *h7, - uint64_t *h8, uint64_t *h9, uint64_t *h10, uint64_t *h11) { - endPartial(h0, h1, h2, h3, h4, h5, h6, h7, h8, h9, h10, h11); - endPartial(h0, h1, h2, h3, h4, h5, h6, h7, h8, h9, h10, h11); - endPartial(h0, h1, h2, h3, h4, h5, h6, h7, h8, h9, h10, h11); -} - -// -// The goal is for each bit of the input to expand into 128 bits of -// apparent entropy before it is fully overwritten. -// n trials both set and cleared at least m bits of h0 h1 h2 h3 -// n: 2 m: 29 -// n: 3 m: 46 -// n: 4 m: 57 -// n: 5 m: 107 -// n: 6 m: 146 -// n: 7 m: 152 -// when run forwards or backwards -// for all 1-bit and 2-bit diffs -// with diffs defined by either xor or subtraction -// with a base of all zeros plus a counter, or plus another bit, or random -// -static inline void short_mix(uint64_t *h0, uint64_t *h1, uint64_t *h2, uint64_t *h3) { - *h2 = rot64(*h2, 50); - *h2 += *h3; - *h0 ^= *h2; - *h3 = rot64(*h3, 52); - *h3 += *h0; - *h1 ^= *h3; - *h0 = rot64(*h0, 30); - *h0 += *h1; - *h2 ^= *h0; - *h1 = rot64(*h1, 41); - *h1 += *h2; - *h3 ^= *h1; - *h2 = rot64(*h2, 54); - *h2 += *h3; - *h0 ^= *h2; - *h3 = rot64(*h3, 48); - *h3 += *h0; - *h1 ^= *h3; - *h0 = rot64(*h0, 38); - *h0 += *h1; - *h2 ^= *h0; - *h1 = rot64(*h1, 37); - *h1 += *h2; - *h3 ^= *h1; - *h2 = rot64(*h2, 62); - *h2 += *h3; - *h0 ^= *h2; - *h3 = rot64(*h3, 34); - *h3 += *h0; - *h1 ^= *h3; - *h0 = rot64(*h0, 5); - *h0 += *h1; - *h2 ^= *h0; - *h1 = rot64(*h1, 36); - *h1 += *h2; - *h3 ^= *h1; -} - -// -// Mix all 4 inputs together so that h0, h1 are a hash of them all. -// -// For two inputs differing in just the input bits -// Where "differ" means xor or subtraction -// And the base value is random, or a counting value starting at that bit -// The final result will have each bit of h0, h1 flip -// For every input bit, -// with probability 50 +- .3% (it is probably better than that) -// For every pair of input bits, -// with probability 50 +- .75% (the worst case is approximately that) -// -static inline void short_end(uint64_t *h0, uint64_t *h1, uint64_t *h2, uint64_t *h3) { - *h3 ^= *h2; - *h2 = rot64(*h2, 15); - *h3 += *h2; - *h0 ^= *h3; - *h3 = rot64(*h3, 52); - *h0 += *h3; - *h1 ^= *h0; - *h0 = rot64(*h0, 26); - *h1 += *h0; - *h2 ^= *h1; - *h1 = rot64(*h1, 51); - *h2 += *h1; - *h3 ^= *h2; - *h2 = rot64(*h2, 28); - *h3 += *h2; - *h0 ^= *h3; - *h3 = rot64(*h3, 9); - *h0 += *h3; - *h1 ^= *h0; - *h0 = rot64(*h0, 47); - *h1 += *h0; - *h2 ^= *h1; - *h1 = rot64(*h1, 54); - *h2 += *h1; - *h3 ^= *h2; - *h2 = rot64(*h2, 32); - *h3 += *h2; - *h0 ^= *h3; - *h3 = rot64(*h3, 25); - *h0 += *h3; - *h1 ^= *h0; - *h0 = rot64(*h0, 63); - *h1 += *h0; -} - -void spooky_shorthash(const void *message, - size_t length, - uint64_t *hash1, - uint64_t *hash2) { - uint64_t buf[2 * SC_NUMVARS]; - union { - const uint8_t *p8; - uint32_t *p32; - uint64_t *p64; - size_t i; - } u; - size_t remainder; - uint64_t a, b, c, d; - u.p8 = (const uint8_t *)message; - - if(!ALLOW_UNALIGNED_READS && (u.i & 0x7)) { - memcpy(buf, message, length); - u.p64 = buf; - } - - remainder = length % 32; - a = *hash1; - b = *hash2; - c = SC_CONST; - d = SC_CONST; - - if(length > 15) { - const uint64_t *endp = u.p64 + (length / 32) * 4; - - // handle all complete sets of 32 bytes - for(; u.p64 < endp; u.p64 += 4) { - c += u.p64[0]; - d += u.p64[1]; - short_mix(&a, &b, &c, &d); - a += u.p64[2]; - b += u.p64[3]; - } - - // Handle the case of 16+ remaining bytes. - if(remainder >= 16) { - c += u.p64[0]; - d += u.p64[1]; - short_mix(&a, &b, &c, &d); - u.p64 += 2; - remainder -= 16; - } - } - - // Handle the last 0..15 bytes, and its length - d = ((uint64_t)length) << 56; - switch(remainder) { - case 15: - d += ((uint64_t)u.p8[14]) << 48; - case 14: - d += ((uint64_t)u.p8[13]) << 40; - case 13: - d += ((uint64_t)u.p8[12]) << 32; - case 12: - d += u.p32[2]; - c += u.p64[0]; - break; - case 11: - d += ((uint64_t)u.p8[10]) << 16; - case 10: - d += ((uint64_t)u.p8[9]) << 8; - case 9: - d += (uint64_t)u.p8[8]; - case 8: - c += u.p64[0]; - break; - case 7: - c += ((uint64_t)u.p8[6]) << 48; - case 6: - c += ((uint64_t)u.p8[5]) << 40; - case 5: - c += ((uint64_t)u.p8[4]) << 32; - case 4: - c += u.p32[0]; - break; - case 3: - c += ((uint64_t)u.p8[2]) << 16; - case 2: - c += ((uint64_t)u.p8[1]) << 8; - case 1: - c += (uint64_t)u.p8[0]; - break; - case 0: - c += SC_CONST; - d += SC_CONST; - } - short_end(&a, &b, &c, &d); - *hash1 = a; - *hash2 = b; -} - -void spooky_hash128(const void *message, - size_t length, - uint64_t *hash1, - uint64_t *hash2) { - uint64_t h0, h1, h2, h3, h4, h5, h6, h7, h8, h9, h10, h11; - uint64_t buf[SC_NUMVARS]; - uint64_t *endp; - union { - const uint8_t *p8; - uint64_t *p64; - uintptr_t i; - } u; - size_t remainder; - - if(length < SC_BUFSIZE) { - spooky_shorthash(message, length, hash1, hash2); - return; - } - - h0 = h3 = h6 = h9 = *hash1; - h1 = h4 = h7 = h10 = *hash2; - h2 = h5 = h8 = h11 = SC_CONST; - - u.p8 = (const uint8_t *)message; - endp = u.p64 + (length / SC_BLOCKSIZE) * SC_NUMVARS; - - // handle all whole blocks of SC_BLOCKSIZE bytes - if(ALLOW_UNALIGNED_READS || (u.i & 0x7) == 0) { - while(u.p64 < endp) { - mix(u.p64, &h0, &h1, &h2, &h3, &h4, &h5, &h6, &h7, &h8, &h9, &h10, &h11); - u.p64 += SC_NUMVARS; - } - } else { - while(u.p64 < endp) { - memcpy(buf, u.p64, SC_BLOCKSIZE); - mix(buf, &h0, &h1, &h2, &h3, &h4, &h5, &h6, &h7, &h8, &h9, &h10, &h11); - u.p64 += SC_NUMVARS; - } - } - - // handle the last partial block of SC_BLOCKSIZE bytes - remainder = (length - ((const uint8_t *)endp - (const uint8_t *)message)); - memcpy(buf, endp, remainder); - memset(((uint8_t *)buf) + remainder, 0, SC_BLOCKSIZE - remainder); - ((uint8_t *)buf)[SC_BLOCKSIZE - 1] = remainder; - mix(buf, &h0, &h1, &h2, &h3, &h4, &h5, &h6, &h7, &h8, &h9, &h10, &h11); - - // do some final mixing - end(&h0, &h1, &h2, &h3, &h4, &h5, &h6, &h7, &h8, &h9, &h10, &h11); - *hash1 = h0; - *hash2 = h1; -} - -uint64_t spooky_hash64(const void *message, size_t length, uint64_t seed) { - uint64_t hash1 = seed; - spooky_hash128(message, length, &hash1, &seed); - return hash1; -} - -uint32_t spooky_hash32(const void *message, size_t length, uint32_t seed) { - uint64_t hash1 = seed, hash2 = seed; - spooky_hash128(message, length, &hash1, &hash2); - return (uint32_t)hash1; -} diff --git a/lib/checksums/spooky-c.h b/lib/checksums/spooky-c.h deleted file mode 100644 index 560e517e..00000000 --- a/lib/checksums/spooky-c.h +++ /dev/null @@ -1,55 +0,0 @@ -// SpookyHash: a 128-bit noncryptographic hash function -// By Bob Jenkins, public domain -// Oct 31 2010: alpha, framework + SpookyHash::Mix appears right -// Oct 11 2011: C version ported by Andi Kleen (andikleen@github) -// Oct 31 2011: alpha again, Mix only good to 2^^69 but rest appears right -// Dec 31 2011: beta, improved Mix, tested it for 2-bit deltas -// Feb 2 2012: production, same bits as beta -// Feb 5 2012: adjusted definitions of uint* to be more portable -// Mar 30 2012: 3 bytes/cycle, not 4. Alpha was 4 but wasn't thorough enough. -// Apr 27 2012: C version updated by Ziga Zupanec ziga.zupanec@gmail.com (agiz@github) -// -// Up to 3 bytes/cycle for long messages. Reasonably fast for short messages. -// All 1 or 2 bit deltas achieve avalanche within 1% bias per output bit. -// -// This was developed for and tested on 64-bit x86-compatible processors. -// It assumes the processor is little-endian. There is a macro -// controlling whether unaligned reads are allowed (by default they are). -// This should be an equally good hash on big-endian machines, but it will -// compute different results on them than on little-endian machines. -// -// Google's CityHash has similar specs to SpookyHash, and CityHash is faster -// on some platforms. MD4 and MD5 also have similar specs, but they are orders -// of magnitude slower. CRCs are two or more times slower, but unlike -// SpookyHash, they have nice math for combining the CRCs of pieces to form -// the CRCs of wholes. There are also cryptographic hashes, but those are even -// slower than MD5. -// - -#include -#include - -#define SC_NUMVARS 12 -#define SC_BLOCKSIZE (8 * SC_NUMVARS) -#define SC_BUFSIZE (2 * SC_BLOCKSIZE) - -struct spooky_state { - uint64_t m_data[2 * SC_NUMVARS]; - uint64_t m_state[SC_NUMVARS]; - size_t m_length; - unsigned char m_remainder; -}; - -void spooky_copy(struct spooky_state *dest, struct spooky_state *src); - -void spooky_shorthash(const void *message, - size_t length, - uint64_t *hash1, - uint64_t *hash2); - -// hash1/2 doubles as input parameter for seed1/2 and output for hash1/2 -void spooky_hash128(const void *message, size_t length, uint64_t *hash1, uint64_t *hash2); - -uint64_t spooky_hash64(const void *message, size_t len, uint64_t seed); - -uint32_t spooky_hash32(const void *message, size_t len, uint32_t seed); diff --git a/lib/formats/json.c b/lib/formats/json.c index 67c7d237..e9e7586d 100644 --- a/lib/formats/json.c +++ b/lib/formats/json.c @@ -23,7 +23,7 @@ * */ -#include "../checksums/spooky-c.h" +#include "../checksums/murmur3.h" #include "../formats.h" #include "../preprocess.h" #include "../utilities.h" @@ -54,8 +54,8 @@ static guint32 rm_fmt_json_generate_id(RmFmtHandlerJSON *self, RmFile *file, hash ^= file->actual_file_size; for(int i = 0; i < 8192; ++i) { - hash ^= spooky_hash32(file_path, strlen(file_path), i); - hash ^= spooky_hash32(cksum, strlen(cksum), i); + hash ^= MurmurHash3_x86_32(file_path, strlen(file_path), i); + hash ^= MurmurHash3_x86_32(cksum, strlen(cksum), i); if(!g_hash_table_contains(self->id_set, GUINT_TO_POINTER(hash))) { break; diff --git a/tests/test_mains/test_hash.py b/tests/test_mains/test_hash.py index 126666b8..2765c3c8 100644 --- a/tests/test_mains/test_hash.py +++ b/tests/test_mains/test_hash.py @@ -3,6 +3,7 @@ from nose import with_setup from tests.utils import * +from nose.plugins.attrib import attr INCREMENTS = [4096, 1024, 1, 20000] @@ -28,10 +29,7 @@ def streaming_compliance_check(*patterns): break -@with_setup(usual_setup_func, usual_teardown_func) -def test_spooky(): - streaming_compliance_check('spooky') - +@attr('known_issue') @with_setup(usual_setup_func, usual_teardown_func) def test_city(): streaming_compliance_check('city') @@ -52,10 +50,12 @@ def test_sha3(): def test_blake(): streaming_compliance_check('blake') +@attr('known_issue') @with_setup(usual_setup_func, usual_teardown_func) def test_xx(): streaming_compliance_check('xxhash') +@attr('known_issue') @with_setup(usual_setup_func, usual_teardown_func) def test_farm(): streaming_compliance_check('farm') @@ -64,6 +64,7 @@ def test_farm(): def test_highway(): streaming_compliance_check('highway') +@attr('known_issue') @with_setup(usual_setup_func, usual_teardown_func) def test_cumulative(): streaming_compliance_check('cumulative') diff --git a/tests/utils.py b/tests/utils.py index 3d7eb322..d8789412 100644 --- a/tests/utils.py +++ b/tests/utils.py @@ -17,14 +17,10 @@ CKSUM_TYPES = [ 'murmur', - 'spooky', - 'spooky32', - 'spooky64', 'city', 'md5', 'sha1', 'sha256', - 'sha512', 'sha3-256', 'sha3-384', 'sha3-512', @@ -243,7 +239,7 @@ def run_rmlint_pedantic(*args, **kwargs): # Note: sha512 is supported on all system which have # no recent enough glib with. God forsaken debian people. if has_feature('sha512'): - cksum_types.append('sha512') + CKSUM_TYPES.append('sha512') for cksum_type in CKSUM_TYPES: options.append('--algorithm=' + cksum_type) From b71a3d9426d8a4ab887921fc31f0af04574621da Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 13 Nov 2017 08:30:44 +1000 Subject: [PATCH 102/180] checksum: convert xxhash to streaming implementation --- lib/checksum.c | 27 ++++++++++++++++++++------- tests/test_mains/test_hash.py | 2 -- 2 files changed, 20 insertions(+), 9 deletions(-) diff --git a/lib/checksum.c b/lib/checksum.c index 467df9eb..5a6476ca 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -196,14 +196,30 @@ static void rm_digest_generic_copy(RmDigest *digest, RmDigest *copy) { // xxhash // /////////////////////////// -/* TODO: this is probably broken; should use streaming variant XXH64_update() */ +static void rm_digest_xxhash_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { + digest->state = XXH64_createState(); + XXH64_reset(digest->state, seed1 ^ seed2); +} + +static void rm_digest_xxhash_free(RmDigest *digest) { + XXH64_freeState(digest->state); +} static void rm_digest_xxhash_update(RmDigest *digest, const unsigned char *data, RmOff size) { - unsigned long long *hash = digest->state; - *hash = XXH64(data, size, *hash); + XXH64_update(digest->state, data, size); } -static const RmDigestSpec xxhash_spec = { "xxhash", 64, GENERIC_FUNCS(xxhash)}; +static void rm_digest_xxhash_copy(RmDigest *digest, RmDigest *copy) { + copy->state = XXH64_createState(); + memcpy(copy->state, digest->state, sizeof(XXH64_state_t)); +} + +static void rm_digest_xxhash_steal(RmDigest *digest, guint8 *result) { + *(unsigned long long*)result = XXH64_digest(digest->state); +} + + +static const RmDigestSpec xxhash_spec = { "xxhash", 64, rm_digest_xxhash_init, rm_digest_xxhash_free, rm_digest_xxhash_update, rm_digest_xxhash_copy, rm_digest_xxhash_steal}; /////////////////////////// // farmhash // @@ -223,9 +239,6 @@ static const RmDigestSpec farmhash_spec = { "farmhash", 64, GENERIC_FUNCS(farmh /////////////////////////// - - - #define CREATE_MURMUR_FUNCS(TYPE) \ static void rm_digest_murmur_##TYPE##_free(RmDigest *digest) { \ MurmurHash3_##TYPE##_free(digest->state); \ diff --git a/tests/test_mains/test_hash.py b/tests/test_mains/test_hash.py index 2765c3c8..8c1f0726 100644 --- a/tests/test_mains/test_hash.py +++ b/tests/test_mains/test_hash.py @@ -50,7 +50,6 @@ def test_sha3(): def test_blake(): streaming_compliance_check('blake') -@attr('known_issue') @with_setup(usual_setup_func, usual_teardown_func) def test_xx(): streaming_compliance_check('xxhash') @@ -64,7 +63,6 @@ def test_farm(): def test_highway(): streaming_compliance_check('highway') -@attr('known_issue') @with_setup(usual_setup_func, usual_teardown_func) def test_cumulative(): streaming_compliance_check('cumulative') From 315e0f594ce75fed7148b6317cbcfa077c5ca8f5 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 13 Nov 2017 10:49:00 +1000 Subject: [PATCH 103/180] checksum: fix cumulative so that is uses all data --- lib/checksum.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/lib/checksum.c b/lib/checksum.c index 5a6476ca..6d5d353f 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -315,9 +315,8 @@ static const RmDigestSpec city_spec = { "city", 128, GENERIC_FUNCS(city)}; static void rm_digest_cumulative_update(RmDigest *digest, const unsigned char *data, RmOff size) { /* This only XORS the two checksums. */ guint8 *hash = digest->state; - RmOff bytes = MIN(size, digest->bytes); - for(gsize i = 0; i < bytes; ++i) { - hash[i] ^= ((guint8 *)data)[i]; + for(gsize i = 0; i < size; ++i) { + hash[i % digest->bytes] ^= ((guint8 *)data)[i]; } } From 33f197d4dca8db6886f39623501302f5f55d4092 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 13 Nov 2017 10:49:38 +1000 Subject: [PATCH 104/180] checksum: fix highway64 casting during steal --- lib/checksum.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/checksum.c b/lib/checksum.c index 6d5d353f..f69767eb 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -362,7 +362,7 @@ static void rm_digest_highway128_steal(RmDigest *digest, guint8 *result) { } static void rm_digest_highway64_steal(RmDigest *digest, guint8 *result) { - *result = HighwayHashCatFinish64(digest->state); + *(uint64_t*)result = HighwayHashCatFinish64(digest->state); } #define HIGHWAY_SPEC(BITS) BITS, rm_digest_highway_init, rm_digest_highway_free, rm_digest_highway_update, rm_digest_highway_copy, rm_digest_highway##BITS##_steal From de826d3e1fd886e2ed46a55d87275e8d353442c5 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 13 Nov 2017 11:53:38 +1000 Subject: [PATCH 105/180] checksum: remove city hashes; converting to streaming would require 184-byte state --- lib/checksum.c | 26 --- lib/checksum.h | 1 - lib/checksums/city.c | 389 ---------------------------------- lib/checksums/city.h | 85 -------- lib/checksums/citycrc.h | 48 ----- lib/cmdline.c | 6 +- tests/test_mains/test_hash.py | 6 - tests/utils.py | 1 - 8 files changed, 3 insertions(+), 559 deletions(-) delete mode 100644 lib/checksums/city.c delete mode 100644 lib/checksums/city.h delete mode 100644 lib/checksums/citycrc.h diff --git a/lib/checksum.c b/lib/checksum.c index f69767eb..7a647996 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -41,8 +41,6 @@ #include "checksum.h" #include "checksums/blake2/blake2.h" -#include "checksums/city.h" -#include "checksums/citycrc.h" #include "checksums/murmur3.h" #include "checksums/sha3/sha3.h" #include "checksums/xxhash/xxhash.h" @@ -286,28 +284,6 @@ static const RmDigestSpec murmur_spec = { "murmur", 128, MURMUR_FUNCS(x64_128)}; #endif -/////////////////////////// -// cityhash // -/////////////////////////// - -static void rm_digest_city_update(RmDigest *digest, const unsigned char *data, RmOff size) { - - /* TODO: check that this is not broken, i.e. final hash is independent of increment size */ - - /* There is a more optimized version but it needs the crc command of sse4.2 - * (available on Intel Nehalem and up; my amd box doesn't have this though) - */ - uint128 *hash = digest->state; -#ifdef __SSE4_2__ - *hash = CityHashCrc128WithSeed((const char *)data, size, *hash); -#else - *hash = CityHash128WithSeed((const char *)data, size, *hash); -#endif -} - -static const RmDigestSpec city_spec = { "city", 128, GENERIC_FUNCS(city)}; - - /////////////////////////// // cumulative // /////////////////////////// @@ -622,7 +598,6 @@ static const RmDigestSpec *rm_digest_spec(RmDigestType type) { static const RmDigestSpec *digest_specs[] = { [RM_DIGEST_UNKNOWN] = NULL, [RM_DIGEST_MURMUR] = &murmur_spec, - [RM_DIGEST_CITY] = &city_spec, [RM_DIGEST_MD5] = &md5_spec, [RM_DIGEST_SHA1] = &sha1_spec, [RM_DIGEST_SHA256] = &sha256_spec, @@ -699,7 +674,6 @@ const char *rm_digest_type_to_string(RmDigestType type) { /* TODO: remove? */ int rm_digest_type_to_multihash_id(RmDigestType type) { static int ids[] = {[RM_DIGEST_UNKNOWN] = -1, [RM_DIGEST_MURMUR] = 17, - [RM_DIGEST_CITY] = 15, [RM_DIGEST_MD5] = 1, [RM_DIGEST_SHA1] = 2, [RM_DIGEST_SHA256] = 4, [RM_DIGEST_SHA512] = 6, [RM_DIGEST_EXT] = 12, [RM_DIGEST_FARMHASH] = 19, diff --git a/lib/checksum.h b/lib/checksum.h index 8334f029..31390789 100644 --- a/lib/checksum.h +++ b/lib/checksum.h @@ -38,7 +38,6 @@ typedef enum RmDigestType { RM_DIGEST_UNKNOWN = 0, RM_DIGEST_MURMUR, - RM_DIGEST_CITY, RM_DIGEST_MD5, RM_DIGEST_SHA1, RM_DIGEST_SHA256, diff --git a/lib/checksums/city.c b/lib/checksums/city.c deleted file mode 100644 index 1a34c3fc..00000000 --- a/lib/checksums/city.c +++ /dev/null @@ -1,389 +0,0 @@ -// city.c - cityhash-c -// CityHash on C -// Copyright (c) 2011-2012, Alexander Nusov -// -// - original copyright notice - -// Copyright (c) 2011 Google, Inc. -// -// Permission is hereby granted, free of charge, to any person obtaining a copy -// of this software and associated documentation files (the "Software"), to deal -// in the Software without restriction, including without limitation the rights -// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -// copies of the Software, and to permit persons to whom the Software is -// furnished to do so, subject to the following conditions: -// -// The above copyright notice and this permission notice shall be included in -// all copies or substantial portions of the Software. -// -// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN -// THE SOFTWARE. -// -// CityHash, by Geoff Pike and Jyrki Alakuijala -// -// This file provides CityHash64() and related functions. -// -// It's probably possible to create even faster hash functions by -// writing a program that systematically explores some of the space of -// possible hash functions, by using SIMD instructions, or by -// compromising on hash quality. - -#include "city.h" -#include - -static uint64 UNALIGNED_LOAD64(const char *p) { - uint64 result; - memcpy(&result, p, sizeof(result)); - return result; -} - -static uint32 UNALIGNED_LOAD32(const char *p) { - uint32 result; - memcpy(&result, p, sizeof(result)); - return result; -} - -#if !defined(WORDS_BIGENDIAN) - -#define uint32_in_expected_order(x) (x) -#define uint64_in_expected_order(x) (x) - -#else - -#ifdef _MSC_VER -#include -#define bswap_32(x) _byteswap_ulong(x) -#define bswap_64(x) _byteswap_uint64(x) - -#elif defined(__APPLE__) -// Mac OS X / Darwin features -#include -#define bswap_32(x) OSSwapInt32(x) -#define bswap_64(x) OSSwapInt64(x) - -#else -#include -#endif - -#define uint32_in_expected_order(x) (bswap_32(x)) -#define uint64_in_expected_order(x) (bswap_64(x)) - -#endif // WORDS_BIGENDIAN - -#if !defined(LIKELY) -#if HAVE_BUILTIN_EXPECT -#define LIKELY(x) (__builtin_expect(!!(x), 1)) -#else -#define LIKELY(x) (x) -#endif -#endif - -static uint64 Fetch64(const char *p) { - return uint64_in_expected_order(UNALIGNED_LOAD64(p)); -} - -static uint32 Fetch32(const char *p) { - return uint32_in_expected_order(UNALIGNED_LOAD32(p)); -} - -// Some primes between 2^63 and 2^64 for various uses. -static const uint64 k0 = 0xc3a5c85c97cb3127ULL; -static const uint64 k1 = 0xb492b66fbe98f273ULL; -static const uint64 k2 = 0x9ae16a3b2f90404fULL; -static const uint64 k3 = 0xc949d7c7509e6557ULL; - -// Hash 128 input bits down to 64 bits of output. -// This is intended to be a reasonably good hash function. -static inline uint64 Hash128to64(const uint128 x) { - // Murmur-inspired hashing. - const uint64 kMul = 0x9ddfea08eb382d69ULL; - uint64 a = (Uint128Low64(x) ^ Uint128High64(x)) * kMul; - a ^= (a >> 47); - uint64 b = (Uint128High64(x) ^ a) * kMul; - b ^= (b >> 47); - b *= kMul; - return b; -} - -// Bitwise right rotate. Normally this will compile to a single -// instruction, especially if the shift is a manifest constant. -static uint64 Rotate(uint64 val, int shift) { - // Avoid shifting by 64: doing so yields an undefined result. - return shift == 0 ? val : ((val >> shift) | (val << (64 - shift))); -} - -// Equivalent to Rotate(), but requires the second arg to be non-zero. -// On x86-64, and probably others, it's possible for this to compile -// to a single instruction if both args are already in registers. -static uint64 RotateByAtLeast1(uint64 val, int shift) { - return (val >> shift) | (val << (64 - shift)); -} - -static uint64 ShiftMix(uint64 val) { - return val ^ (val >> 47); -} - -static uint64 HashLen16(uint64 u, uint64 v) { - uint128 result; - result.first = u; - result.second = v; - return Hash128to64(result); -} - -static uint64 HashLen0to16(const char *s, size_t len) { - if(len > 8) { - uint64 a = Fetch64(s); - uint64 b = Fetch64(s + len - 8); - return HashLen16(a, RotateByAtLeast1(b + len, len)) ^ b; - } - if(len >= 4) { - uint64 a = Fetch32(s); - return HashLen16(len + (a << 3), Fetch32(s + len - 4)); - } - if(len > 0) { - uint8 a = s[0]; - uint8 b = s[len >> 1]; - uint8 c = s[len - 1]; - uint32 y = (uint32)(a) + ((uint32)(b) << 8); - uint32 z = len + ((uint32)(c) << 2); - return ShiftMix(y * k2 ^ z * k3) * k2; - } - return k2; -} - -// This probably works well for 16-byte strings as well, but it may be overkill -// in that case. -static uint64 HashLen17to32(const char *s, size_t len) { - uint64 a = Fetch64(s) * k1; - uint64 b = Fetch64(s + 8); - uint64 c = Fetch64(s + len - 8) * k2; - uint64 d = Fetch64(s + len - 16) * k0; - return HashLen16(Rotate(a - b, 43) + Rotate(c, 30) + d, - a + Rotate(b ^ k3, 20) - c + len); -} - -// Return a 16-byte hash for 48 bytes. Quick and dirty. -// Callers do best to use "random-looking" values for a and b. -// static pair WeakHashLen32WithSeeds( -uint128 WeakHashLen32WithSeeds6(uint64 w, uint64 x, uint64 y, uint64 z, uint64 a, - uint64 b) { - a += w; - b = Rotate(b + a + z, 21); - uint64 c = a; - a += x; - a += y; - b += Rotate(a, 44); - - uint128 result; - result.first = (uint64)(a + z); - result.second = (uint64)(b + c); - return result; -} - -// Return a 16-byte hash for s[0] ... s[31], a, and b. Quick and dirty. -// static pair WeakHashLen32WithSeeds( -uint128 WeakHashLen32WithSeeds(const char *s, uint64 a, uint64 b) { - return WeakHashLen32WithSeeds6(Fetch64(s), Fetch64(s + 8), Fetch64(s + 16), - Fetch64(s + 24), a, b); -} - -// Return an 8-byte hash for 33 to 64 bytes. -static uint64 HashLen33to64(const char *s, size_t len) { - uint64 z = Fetch64(s + 24); - uint64 a = Fetch64(s) + (len + Fetch64(s + len - 16)) * k0; - uint64 b = Rotate(a + z, 52); - uint64 c = Rotate(a, 37); - a += Fetch64(s + 8); - c += Rotate(a, 7); - a += Fetch64(s + 16); - uint64 vf = a + z; - uint64 vs = b + Rotate(a, 31) + c; - a = Fetch64(s + 16) + Fetch64(s + len - 32); - z = Fetch64(s + len - 8); - b = Rotate(a + z, 52); - c = Rotate(a, 37); - a += Fetch64(s + len - 24); - c += Rotate(a, 7); - a += Fetch64(s + len - 16); - uint64 wf = a + z; - uint64 ws = b + Rotate(a, 31) + c; - uint64 r = ShiftMix((vf + ws) * k2 + (wf + vs) * k0); - return ShiftMix(r * k0 + vs) * k2; -} - -uint64 CityHash64(const char *s, size_t len) { - if(len <= 32) { - if(len <= 16) { - return HashLen0to16(s, len); - } else { - return HashLen17to32(s, len); - } - } else if(len <= 64) { - return HashLen33to64(s, len); - } - - // For strings over 64 bytes we hash the end first, and then as we - // loop we keep 56 bytes of state: v, w, x, y, and z. - uint64 x = Fetch64(s + len - 40); - uint64 y = Fetch64(s + len - 16) + Fetch64(s + len - 56); - uint64 z = HashLen16(Fetch64(s + len - 48) + len, Fetch64(s + len - 24)); - uint64 temp; - uint128 v = WeakHashLen32WithSeeds(s + len - 64, len, z); - uint128 w = WeakHashLen32WithSeeds(s + len - 32, y + k1, x); - x = x * k1 + Fetch64(s); - - // Decrease len to the nearest multiple of 64, and operate on 64-byte chunks. - len = (len - 1) & ~(size_t)(63); - do { - x = Rotate(x + y + v.first + Fetch64(s + 8), 37) * k1; - y = Rotate(y + v.second + Fetch64(s + 48), 42) * k1; - x ^= w.second; - y += v.first + Fetch64(s + 40); - z = Rotate(z + w.first, 33) * k1; - v = WeakHashLen32WithSeeds(s, v.second * k1, x + w.first); - w = WeakHashLen32WithSeeds(s + 32, z + w.second, y + Fetch64(s + 16)); - temp = z; - z = x; - x = temp; - s += 64; - len -= 64; - } while(len != 0); - return HashLen16(HashLen16(v.first, w.first) + ShiftMix(y) * k1 + z, - HashLen16(v.second, w.second) + x); -} - -uint64 CityHash64WithSeed(const char *s, size_t len, uint64 seed) { - return CityHash64WithSeeds(s, len, k2, seed); -} - -uint64 CityHash64WithSeeds(const char *s, size_t len, uint64 seed0, uint64 seed1) { - return HashLen16(CityHash64(s, len) - seed0, seed1); -} - -// A subroutine for CityHash128(). Returns a decent 128-bit hash for strings -// of any length representable in signed long. Based on City and Murmur. -static uint128 CityMurmur(const char *s, size_t len, uint128 seed) { - uint64 a = Uint128Low64(seed); - uint64 b = Uint128High64(seed); - uint64 c = 0; - uint64 d = 0; - signed long l = len - 16; - if(l <= 0) { // len <= 16 - a = ShiftMix(a * k1) * k1; - c = b * k1 + HashLen0to16(s, len); - d = ShiftMix(a + (len >= 8 ? Fetch64(s) : c)); - } else { // len > 16 - c = HashLen16(Fetch64(s + len - 8) + k1, a); - d = HashLen16(b + len, c + Fetch64(s + len - 16)); - a += d; - do { - a ^= ShiftMix(Fetch64(s) * k1) * k1; - a *= k1; - b ^= a; - c ^= ShiftMix(Fetch64(s + 8) * k1) * k1; - c *= k1; - d ^= c; - s += 16; - l -= 16; - } while(l > 0); - } - a = HashLen16(a, c); - b = HashLen16(d, b); - - uint128 result; - result.first = (uint64)(a ^ b); - result.second = (uint64)(HashLen16(b, a)); - return result; -} - -uint128 CityHash128WithSeed(const char *s, size_t len, uint128 seed) { - if(len < 128) { - return CityMurmur(s, len, seed); - } - - // We expect len >= 128 to be the common case. Keep 56 bytes of state: - // v, w, x, y, and z. - uint128 v, w; - uint64 x = Uint128Low64(seed); - uint64 y = Uint128High64(seed); - uint64 z = len * k1; - uint64 temp; - v.first = Rotate(y ^ k1, 49) * k1 + Fetch64(s); - v.second = Rotate(v.first, 42) * k1 + Fetch64(s + 8); - w.first = Rotate(y + z, 35) * k1 + x; - w.second = Rotate(x + Fetch64(s + 88), 53) * k1; - - // This is the same inner loop as CityHash64(), manually unrolled. - do { - x = Rotate(x + y + v.first + Fetch64(s + 8), 37) * k1; - y = Rotate(y + v.second + Fetch64(s + 48), 42) * k1; - x ^= w.second; - y += v.first + Fetch64(s + 40); - z = Rotate(z + w.first, 33) * k1; - v = WeakHashLen32WithSeeds(s, v.second * k1, x + w.first); - w = WeakHashLen32WithSeeds(s + 32, z + w.second, y + Fetch64(s + 16)); - temp = z; - z = x; - x = temp; - s += 64; - x = Rotate(x + y + v.first + Fetch64(s + 8), 37) * k1; - y = Rotate(y + v.second + Fetch64(s + 48), 42) * k1; - x ^= w.second; - y += v.first + Fetch64(s + 40); - z = Rotate(z + w.first, 33) * k1; - v = WeakHashLen32WithSeeds(s, v.second * k1, x + w.first); - w = WeakHashLen32WithSeeds(s + 32, z + w.second, y + Fetch64(s + 16)); - temp = z; - z = x; - x = temp; - s += 64; - len -= 128; - } while(LIKELY(len >= 128)); - x += Rotate(v.first + z, 49) * k0; - z += Rotate(w.first, 37) * k0; - // If 0 < len < 128, hash up to 4 chunks of 32 bytes each from the end of s. - size_t tail_done; - for(tail_done = 0; tail_done < len;) { - tail_done += 32; - y = Rotate(x + y, 42) * k0 + v.second; - w.first += Fetch64(s + len - tail_done + 16); - x = x * k0 + w.first; - z += w.second + Fetch64(s + len - tail_done); - w.second += v.first; - v = WeakHashLen32WithSeeds(s + len - tail_done, v.first + z, v.second); - } - // At this point our 56 bytes of state should contain more than - // enough information for a strong 128-bit hash. We use two - // different 56-byte-to-8-byte hashes to get a 16-byte final result. - x = HashLen16(x, v.first); - y = HashLen16(y + z, w.first); - - uint128 result; - result.first = (uint64)(HashLen16(x + v.second, w.second) + y); - result.second = (uint64)HashLen16(x + w.second, y + v.second); - return result; -} - -uint128 CityHash128(const char *s, size_t len) { - uint128 r; - if(len >= 16) { - r.first = (uint64)(Fetch64(s) ^ k3); - r.second = (uint64)(Fetch64(s + 8)); - - return CityHash128WithSeed(s + 16, len - 16, r); - - } else if(len >= 8) { - r.first = (uint64)(Fetch64(s) ^ (len * k0)); - r.second = (uint64)(Fetch64(s + len - 8) ^ k1); - - return CityHash128WithSeed(NULL, 0, r); - } else { - r.first = (uint64)k0; - r.second = (uint64)k1; - return CityHash128WithSeed(s, len, r); - } -} diff --git a/lib/checksums/city.h b/lib/checksums/city.h deleted file mode 100644 index 5fe54559..00000000 --- a/lib/checksums/city.h +++ /dev/null @@ -1,85 +0,0 @@ -// city.h - cityhash-c -// CityHash on C -// Copyright (c) 2011-2012, Alexander Nusov -// -// - original copyright notice - -// Copyright (c) 2011 Google, Inc. -// -// Permission is hereby granted, free of charge, to any person obtaining a copy -// of this software and associated documentation files (the "Software"), to deal -// in the Software without restriction, including without limitation the rights -// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -// copies of the Software, and to permit persons to whom the Software is -// furnished to do so, subject to the following conditions: -// -// The above copyright notice and this permission notice shall be included in -// all copies or substantial portions of the Software. -// -// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN -// THE SOFTWARE. -// -// CityHash, by Geoff Pike and Jyrki Alakuijala -// -// This file provides a few functions for hashing strings. On x86-64 -// hardware in 2011, CityHash64() is faster than other high-quality -// hash functions, such as Murmur. This is largely due to higher -// instruction-level parallelism. CityHash64() and CityHash128() also perform -// well on hash-quality tests. -// -// CityHash128() is optimized for relatively long strings and returns -// a 128-bit hash. For strings more than about 2000 bytes it can be -// faster than CityHash64(). -// -// Functions in the CityHash family are not suitable for cryptography. -// -// WARNING: This code has not been tested on big-endian platforms! -// It is known to work well on little-endian platforms that have a small penalty -// for unaligned reads, such as current Intel and AMD moderate-to-high-end CPUs. -// -// By the way, for some hash functions, given strings a and b, the hash -// of a+b is easily derived from the hashes of a and b. This property -// doesn't hold for any hash functions in this file. - -#ifndef CITY_HASH_H_ -#define CITY_HASH_H_ - -#include -#include - -typedef uint8_t uint8; -typedef uint32_t uint32; -typedef uint64_t uint64; - -typedef struct _uint128 uint128; -struct _uint128 { - uint64 first; - uint64 second; -}; - -#define Uint128Low64(x) (x).first -#define Uint128High64(x) (x).second - -// Hash function for a byte array. -uint64 CityHash64(const char *buf, size_t len); - -// Hash function for a byte array. For convenience, a 64-bit seed is also -// hashed into the result. -uint64 CityHash64WithSeed(const char *buf, size_t len, uint64 seed); - -// Hash function for a byte array. For convenience, two seeds are also -// hashed into the result. -uint64 CityHash64WithSeeds(const char *buf, size_t len, uint64 seed0, uint64 seed1); - -// Hash function for a byte array. -uint128 CityHash128(const char *s, size_t len); - -// Hash function for a byte array. For convenience, a 128-bit seed is also -// hashed into the result. -uint128 CityHash128WithSeed(const char *s, size_t len, uint128 seed); - -#endif // CITY_HASH_H_ diff --git a/lib/checksums/citycrc.h b/lib/checksums/citycrc.h deleted file mode 100644 index 2001bfea..00000000 --- a/lib/checksums/citycrc.h +++ /dev/null @@ -1,48 +0,0 @@ -// citycrc.h - cityhash-c -// CityHash on C -// Copyright (c) 2011-2012, Alexander Nusov -// -// - original copyright notice - -// Copyright (c) 2011 Google, Inc. -// -// Permission is hereby granted, free of charge, to any person obtaining a copy -// of this software and associated documentation files (the "Software"), to deal -// in the Software without restriction, including without limitation the rights -// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -// copies of the Software, and to permit persons to whom the Software is -// furnished to do so, subject to the following conditions: -// -// The above copyright notice and this permission notice shall be included in -// all copies or substantial portions of the Software. -// -// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN -// THE SOFTWARE. -// -// CityHash, by Geoff Pike and Jyrki Alakuijala -// -// This file declares the subset of the CityHash functions that require -// _mm_crc32_u64(). See the CityHash README for details. -// -// Functions in the CityHash family are not suitable for cryptography. - -#ifndef CITY_HASH_CRC_H_ -#define CITY_HASH_CRC_H_ - -#include "city.h" - -// Hash function for a byte array. -uint128 CityHashCrc128(const char *s, size_t len); - -// Hash function for a byte array. For convenience, a 128-bit seed is also -// hashed into the result. -uint128 CityHashCrc128WithSeed(const char *s, size_t len, uint128 seed); - -// Hash function for a byte array. Sets result[0] ... result[3]. -void CityHashCrc256(const char *s, size_t len, uint64 *result); - -#endif // CITY_HASH_CRC_H_ diff --git a/lib/cmdline.c b/lib/cmdline.c index dc84af92..08072a60 100644 --- a/lib/cmdline.c +++ b/lib/cmdline.c @@ -753,16 +753,16 @@ static void rm_cmd_set_paranoia_from_cnt(RmCfg *cfg, int paranoia_counter, /* Handle the paranoia option */ switch(paranoia_counter) { case -2: - cfg->checksum_type = RM_DIGEST_MURMUR; + cfg->checksum_type = RM_DIGEST_XXHASH; // 64-bit non-crypto break; case -1: - cfg->checksum_type = RM_DIGEST_CITY; + cfg->checksum_type = RM_DIGEST_MURMUR; // 128-bit non-crypto break; case 0: /* leave users choice of -a (default) */ break; case 1: - cfg->checksum_type = RM_DIGEST_BLAKE2B; + cfg->checksum_type = RM_DIGEST_BLAKE2B; // 512-bit crypto break; case 2: cfg->checksum_type = RM_DIGEST_PARANOID; diff --git a/tests/test_mains/test_hash.py b/tests/test_mains/test_hash.py index 8c1f0726..eb3e9d42 100644 --- a/tests/test_mains/test_hash.py +++ b/tests/test_mains/test_hash.py @@ -28,12 +28,6 @@ def streaming_compliance_check(*patterns): assert False, "{} fails streaming test with increment {}".format(algo, increment) break - -@attr('known_issue') -@with_setup(usual_setup_func, usual_teardown_func) -def test_city(): - streaming_compliance_check('city') - @with_setup(usual_setup_func, usual_teardown_func) def test_murmur(): streaming_compliance_check('murmur') diff --git a/tests/utils.py b/tests/utils.py index d8789412..a3b75bd2 100644 --- a/tests/utils.py +++ b/tests/utils.py @@ -17,7 +17,6 @@ CKSUM_TYPES = [ 'murmur', - 'city', 'md5', 'sha1', 'sha256', From 6e4531569dd86300372a8723e7ebc388c4e0ae46 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 13 Nov 2017 12:05:18 +1000 Subject: [PATCH 106/180] remove farmhash - its 128-bit variant was just a copy of cityhash anyway... --- lib/checksum.c | 14 - lib/checksum.h | 2 - lib/checksums/farmhash.c | 1651 -------------------------------------- lib/checksums/farmhash.h | 166 ---- 4 files changed, 1833 deletions(-) delete mode 100644 lib/checksums/farmhash.c delete mode 100644 lib/checksums/farmhash.h diff --git a/lib/checksum.c b/lib/checksum.c index 7a647996..1155ad46 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -219,18 +219,6 @@ static void rm_digest_xxhash_steal(RmDigest *digest, guint8 *result) { static const RmDigestSpec xxhash_spec = { "xxhash", 64, rm_digest_xxhash_init, rm_digest_xxhash_free, rm_digest_xxhash_update, rm_digest_xxhash_copy, rm_digest_xxhash_steal}; -/////////////////////////// -// farmhash // -/////////////////////////// - -/* TODO: check that this is not broken, i.e. final hash is independent of increment size */ - -static void rm_digest_farmhash_update(RmDigest *digest, const unsigned char *data, RmOff size) { - uint128_t *hash = digest->state; - *hash = farmhash128_with_seed((const char*)data, size, *hash); -} - -static const RmDigestSpec farmhash_spec = { "farmhash", 64, GENERIC_FUNCS(farmhash)}; /////////////////////////// // murmur // @@ -614,7 +602,6 @@ static const RmDigestSpec *rm_digest_spec(RmDigestType type) { [RM_DIGEST_EXT] = &ext_spec, [RM_DIGEST_CUMULATIVE] = &cumulative_spec, [RM_DIGEST_PARANOID] = ¶noid_spec, - [RM_DIGEST_FARMHASH] = &farmhash_spec, [RM_DIGEST_XXHASH] = &xxhash_spec, [RM_DIGEST_HIGHWAY64] = &highway64_spec, [RM_DIGEST_HIGHWAY128] = &highway128_spec, @@ -676,7 +663,6 @@ int rm_digest_type_to_multihash_id(RmDigestType type) { static int ids[] = {[RM_DIGEST_UNKNOWN] = -1, [RM_DIGEST_MURMUR] = 17, [RM_DIGEST_MD5] = 1, [RM_DIGEST_SHA1] = 2, [RM_DIGEST_SHA256] = 4, [RM_DIGEST_SHA512] = 6, - [RM_DIGEST_EXT] = 12, [RM_DIGEST_FARMHASH] = 19, [RM_DIGEST_CUMULATIVE] = 13,[RM_DIGEST_PARANOID] = 14}; return ids[MIN(type, sizeof(ids) / sizeof(ids[0]))]; diff --git a/lib/checksum.h b/lib/checksum.h index 31390789..e50f2bb5 100644 --- a/lib/checksum.h +++ b/lib/checksum.h @@ -32,7 +32,6 @@ #include "checksums/blake2/blake2.h" #include "checksums/sha3/sha3.h" -#include "checksums/farmhash.h" #include "checksums/highwayhash.h" typedef enum RmDigestType { @@ -50,7 +49,6 @@ typedef enum RmDigestType { RM_DIGEST_BLAKE2SP /* Parallel version of BLAKE2P */, RM_DIGEST_BLAKE2BP /* Parallel version of BLAKE2S */, RM_DIGEST_XXHASH, - RM_DIGEST_FARMHASH, RM_DIGEST_HIGHWAY64, RM_DIGEST_HIGHWAY128, RM_DIGEST_HIGHWAY256, diff --git a/lib/checksums/farmhash.c b/lib/checksums/farmhash.c deleted file mode 100644 index cc26abb1..00000000 --- a/lib/checksums/farmhash.c +++ /dev/null @@ -1,1651 +0,0 @@ -// Copyright (c) 2014 Google, Inc. -// -// Permission is hereby granted, free of charge, to any person obtaining a copy -// of this software and associated documentation files (the "Software"), to deal -// in the Software without restriction, including without limitation the rights -// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -// copies of the Software, and to permit persons to whom the Software is -// furnished to do so, subject to the following conditions: -// -// The above copyright notice and this permission notice shall be included in -// all copies or substantial portions of the Software. -// -// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN -// THE SOFTWARE. -// -// FarmHash, by Geoff Pike - -#include "farmhash.h" - -#include - -#include - -// PLATFORM-SPECIFIC CONFIGURATION - -#if defined (__x86_64) || defined (__x86_64__) -#define x86_64 1 -#else -#define x86_64 0 -#endif - -#if defined(__i386__) || defined(__i386) || defined(__X86__) -#define x86 1 -#else -#define x86 x86_64 -#endif - -#if defined(__SSSE3__) -#include -#define CAN_USE_SSSE3 1 // Now we can use _mm_hsub_epi16 and so on. -#else -#define CAN_USE_SSSE3 0 -#endif - -#if defined(__SSE4_1__) -#include -#define CAN_USE_SSE41 1 // Now we can use _mm_insert_epi64 and so on. -#else -#define CAN_USE_SSE41 0 -#endif - -#if defined(__SSE4_2__) -#include -#define CAN_USE_SSE42 1 // Now we can use _mm_crc32_u{32,16,8}. And on 64-bit platforms, _mm_crc32_u64. -#else -#define CAN_USE_SSE42 0 -#endif - -#if defined(__AES__) -#include -#define CAN_USE_AESNI 1 // Now we can use _mm_aesimc_si128 and so on. -#else -#define CAN_USE_AESNI 0 -#endif - -#if defined(__AVX__) -#include -#define CAN_USE_AVX 1 -#else -#define CAN_USE_AVX 0 -#endif - -#define likely(x) (__builtin_expect(!!(x), 1)) - -#ifdef LITTLE_ENDIAN -#define uint32_t_in_expected_order(x) (x) -#define uint64_t_in_expected_order(x) (x) -#else -#define uint32_t_in_expected_order(x) (bswap32(x)) -#define uint64_t_in_expected_order(x) (bswap64(x)) -#endif - -#define PERMUTE3(a, b, c) \ - do { \ - swap32(a, b); \ - swap32(a, c); \ - } while (0) - -static inline uint32_t bswap32(const uint32_t x) { - uint32_t y = x; - - for (size_t i = 0; i < sizeof(uint32_t) >> 1; i++) { - - uint32_t d = sizeof(uint32_t) - i - 1; - - uint32_t mh = ((uint32_t)0xff) << (d << 3); - uint32_t ml = ((uint32_t)0xff) << (i << 3); - - uint32_t h = x & mh; - uint32_t l = x & ml; - - uint64_t t = (l << ((d - i) << 3)) | (h >> ((d - i) << 3)); - - y = t | (y & ~(mh | ml)); - } - - return y; -} - -static inline uint64_t bswap64(const uint64_t x) { - uint64_t y = x; - - for (size_t i = 0; i < sizeof(uint64_t) >> 1; i++) { - - uint64_t d = sizeof(uint64_t) - i - 1; - - uint64_t mh = ((uint64_t)0xff) << (d << 3); - uint64_t ml = ((uint64_t)0xff) << (i << 3); - - uint64_t h = x & mh; - uint64_t l = x & ml; - - uint64_t t = (l << ((d - i) << 3)) | (h >> ((d - i) << 3)); - - y = t | (y & ~(mh | ml)); - } - - return y; -} - -static inline uint64_t fetch64(const char* p) { - uint64_t result; - memcpy(&result, p, sizeof(result)); - - return uint64_t_in_expected_order(result); -} - -static inline uint32_t fetch32(const char* p) { - uint32_t result; - memcpy(&result, p, sizeof(result)); - - return uint32_t_in_expected_order(result); -} - -#if CAN_USE_SSSE3 || CAN_USE_SSE41 || CAN_USE_SSE42 || CAN_USE_AESNI || CAN_USE_AVX - -static inline __m128i fetch128(const char* s) { - return _mm_loadu_si128((const __m128i*) s); -} - -#endif - -static inline void swap32(uint32_t* a, uint32_t* b) { - uint32_t t; - - t = *a; - *a = *b; - *b = t; -} - -static inline void swap64(uint64_t* a, uint64_t* b) { - uint64_t t; - - t = *a; - *a = *b; - *b = t; -} - -#if CAN_USE_SSSE3 || CAN_USE_SSE41 || CAN_USE_SSE42 || CAN_USE_AESNI || CAN_USE_AVX - -static inline void swap128(__m128i* a, __m128i* b) { - __m128i t; - - t = *a; - *a = *b; - *b = t; -} - -#endif - -static inline uint32_t ror32(uint32_t val, size_t shift) { - // Avoid shifting by 32: doing so yields an undefined result. - return shift == 0 ? val : (val >> shift) | (val << (32 - shift)); -} - -static inline uint64_t ror64(uint64_t val, size_t shift) { - // Avoid shifting by 64: doing so yields an undefined result. - return shift == 0 ? val : (val >> shift) | (val << (64 - shift)); -} - -// Helpers for data-parallel operations (1x 128 bits or 2x64 or 4x32 or 8x16). - -#if CAN_USE_SSSE3 || CAN_USE_SSE41 || CAN_USE_SSE42 || CAN_USE_AESNI || CAN_USE_AVX - -static inline __m128i add64x2(__m128i x, __m128i y) { return _mm_add_epi64(x, y); } -static inline __m128i add32x4(__m128i x, __m128i y) { return _mm_add_epi32(x, y); } - -static inline __m128i xor128(__m128i x, __m128i y) { return _mm_xor_si128(x, y); } -static inline __m128i or128(__m128i x, __m128i y) { return _mm_or_si128(x, y); } - -static inline __m128i mul32x4_5(__m128i x) { return add32x4(x, _mm_slli_epi32(x, 2)); } - -static inline __m128i rol32x4(__m128i x, int c) { - return or128(_mm_slli_epi32(x, c), - _mm_srli_epi32(x, 32 - c)); -} - -static inline __m128i rol32x4_17(__m128i x) { return rol32x4(x, 17); } -static inline __m128i rol32x4_19(__m128i x) { return rol32x4(x, 19); } - -static inline __m128i shuf32x4_0_3_2_1(__m128i x) { - return _mm_shuffle_epi32(x, (0 << 6) + (3 << 4) + (2 << 2) + (1 << 0)); -} - -#endif - -#if CAN_USE_SSSE3 - -static inline __m128i shuf8x16(__m128i x, __m128i y) { return _mm_shuffle_epi8(y, x); } - -#endif - -#if CAN_USE_SSE41 - -static inline __m128i mul32x4(__m128i x, __m128i y) { return _mm_mullo_epi32(x, y); } - -static inline __m128i murk(__m128i a, __m128i b, __m128i c, __m128i d, __m128i e) { - - return add32x4(e, - mul32x4_5( - rol32x4_19( - xor128( - mul32x4(d, - rol32x4_17( - mul32x4(c, a))), - (b))))); -} - -#endif - -// Building blocks for hash functions - -// Some primes between 2^63 and 2^64 for various uses. -static const uint64_t k0 = 0xc3a5c85c97cb3127ULL; -static const uint64_t k1 = 0xb492b66fbe98f273ULL; -static const uint64_t k2 = 0x9ae16a3b2f90404fULL; - -// Magic numbers for 32-bit hashing. Copied from Murmur3. -static const uint32_t c1 = 0xcc9e2d51; -static const uint32_t c2 = 0x1b873593; - -// A 32-bit to 32-bit integer hash copied from Murmur3. -static inline uint32_t fmix(uint32_t h) { - h ^= h >> 16; - h *= 0x85ebca6b; - h ^= h >> 13; - h *= 0xc2b2ae35; - h ^= h >> 16; - return h; -} - -static inline uint64_t smix(uint64_t val) { - return val ^ (val >> 47); -} - -static inline uint32_t mur(uint32_t a, uint32_t h) { - // Helper from Murmur3 for combining two 32-bit values. - a *= c1; - a = ror32(a, 17); - a *= c2; - h ^= a; - h = ror32(h, 19); - return h * 5 + 0xe6546b64; -} - -static inline uint32_t debug_tweak32(uint32_t x) { -#ifndef NDEBUG - x = ~bswap32(x * c1); -#endif - - return x; -} - -static inline uint64_t debug_tweak64(uint64_t x) { -#ifndef NDEBUG - x = ~bswap64(x * k1); -#endif - - return x; -} - -uint128_t debug_tweak128(uint128_t x) { -#ifndef NDEBUG - uint64_t y = debug_tweak64(uint128_t_low64(x)); - uint64_t z = debug_tweak64(uint128_t_high64(x)); - y += z; - z += y; - x = make_uint128_t(y, z * k1); -#endif - - return x; -} - -static inline uint64_t farmhash_len_16(uint64_t u, uint64_t v) { - return farmhash128_to_64(make_uint128_t(u, v)); -} - -static inline uint64_t farmhash_len_16_mul(uint64_t u, uint64_t v, uint64_t mul) { - // Murmur-inspired hashing. - uint64_t a = (u ^ v) * mul; - a ^= (a >> 47); - uint64_t b = (v ^ a) * mul; - b ^= (b >> 47); - b *= mul; - return b; -} - -// farmhash na - -static inline uint64_t farmhash_na_len_0_to_16(const char *s, size_t len) { - if (len >= 8) { - uint64_t mul = k2 + len * 2; - uint64_t a = fetch64(s) + k2; - uint64_t b = fetch64(s + len - 8); - uint64_t c = ror64(b, 37) * mul + a; - uint64_t d = (ror64(a, 25) + b) * mul; - return farmhash_len_16_mul(c, d, mul); - } - if (len >= 4) { - uint64_t mul = k2 + len * 2; - uint64_t a = fetch32(s); - return farmhash_len_16_mul(len + (a << 3), fetch32(s + len - 4), mul); - } - if (len > 0) { - uint8_t a = s[0]; - uint8_t b = s[len >> 1]; - uint8_t c = s[len - 1]; - uint32_t y = (uint32_t) a + ((uint32_t) b << 8); - uint32_t z = len + ((uint32_t) c << 2); - return smix(y * k2 ^ z * k0) * k2; - } - return k2; -} - -// This probably works well for 16-byte strings as well, but it may be overkill -// in that case. -static inline uint64_t farmhash_na_len_17_to_32(const char *s, size_t len) { - uint64_t mul = k2 + len * 2; - uint64_t a = fetch64(s) * k1; - uint64_t b = fetch64(s + 8); - uint64_t c = fetch64(s + len - 8) * mul; - uint64_t d = fetch64(s + len - 16) * k2; - return farmhash_len_16_mul(ror64(a + b, 43) + ror64(c, 30) + d, - a + ror64(b + k2, 18) + c, mul); -} - -// Return a 16-byte hash for 48 bytes. Quick and dirty. -// Callers do best to use "random-looking" values for a and b. -static inline uint128_t weak_farmhash_na_len_32_with_seeds_vals( - uint64_t w, uint64_t x, uint64_t y, uint64_t z, uint64_t a, uint64_t b) { - a += w; - b = ror64(b + a + z, 21); - uint64_t c = a; - a += x; - a += y; - b += ror64(a, 44); - return make_uint128_t(a + z, b + c); -} - -// Return a 16-byte hash for s[0] ... s[31], a, and b. Quick and dirty. -static inline uint128_t weak_farmhash_na_len_32_with_seeds( - const char* s, uint64_t a, uint64_t b) { - return weak_farmhash_na_len_32_with_seeds_vals(fetch64(s), - fetch64(s + 8), - fetch64(s + 16), - fetch64(s + 24), - a, - b); -} - -// Return an 8-byte hash for 33 to 64 bytes. -static inline uint64_t farmhash_na_len_33_to_64(const char *s, size_t len) { - uint64_t mul = k2 + len * 2; - uint64_t a = fetch64(s) * k2; - uint64_t b = fetch64(s + 8); - uint64_t c = fetch64(s + len - 8) * mul; - uint64_t d = fetch64(s + len - 16) * k2; - uint64_t y = ror64(a + b, 43) + ror64(c, 30) + d; - uint64_t z = farmhash_len_16_mul(y, a + ror64(b + k2, 18) + c, mul); - uint64_t e = fetch64(s + 16) * mul; - uint64_t f = fetch64(s + 24); - uint64_t g = (y + fetch64(s + len - 32)) * mul; - uint64_t h = (z + fetch64(s + len - 24)) * mul; - return farmhash_len_16_mul(ror64(e + f, 43) + ror64(g, 30) + h, - e + ror64(f + a, 18) + g, mul); -} - -uint64_t farmhash64_na(const char *s, size_t len) { - const uint64_t seed = 81; - if (len <= 32) { - if (len <= 16) { - return farmhash_na_len_0_to_16(s, len); - } else { - return farmhash_na_len_17_to_32(s, len); - } - } else if (len <= 64) { - return farmhash_na_len_33_to_64(s, len); - } - - // For strings over 64 bytes we loop. Internal state consists of - // 56 bytes: v, w, x, y, and z. - uint64_t x = seed; - uint64_t y = seed * k1 + 113; - uint64_t z = smix(y * k2 + 113) * k2; - uint128_t v = make_uint128_t(0, 0); - uint128_t w = make_uint128_t(0, 0); - x = x * k2 + fetch64(s); - - // Set end so that after the loop we have 1 to 64 bytes left to process. - const char* end = s + ((len - 1) / 64) * 64; - const char* last64 = end + ((len - 1) & 63) - 63; - assert(s + len - 64 == last64); - do { - x = ror64(x + y + v.a + fetch64(s + 8), 37) * k1; - y = ror64(y + v.b + fetch64(s + 48), 42) * k1; - x ^= w.b; - y += v.a + fetch64(s + 40); - z = ror64(z + w.a, 33) * k1; - v = weak_farmhash_na_len_32_with_seeds(s, v.b * k1, x + w.a); - w = weak_farmhash_na_len_32_with_seeds(s + 32, z + w.b, y + fetch64(s + 16)); - swap64(&z, &x); - s += 64; - } while (s != end); - uint64_t mul = k1 + ((z & 0xff) << 1); - // Make s point to the last 64 bytes of input. - s = last64; - w.a += ((len - 1) & 63); - v.a += w.a; - w.a += v.a; - x = ror64(x + y + v.a + fetch64(s + 8), 37) * mul; - y = ror64(y + v.b + fetch64(s + 48), 42) * mul; - x ^= w.b * 9; - y += v.a * 9 + fetch64(s + 40); - z = ror64(z + w.a, 33) * mul; - v = weak_farmhash_na_len_32_with_seeds(s, v.b * mul, x + w.a); - w = weak_farmhash_na_len_32_with_seeds(s + 32, z + w.b, y + fetch64(s + 16)); - swap64(&z, &x); - return farmhash_len_16_mul(farmhash_len_16_mul(v.a, w.a, mul) + smix(y) * k0 + z, - farmhash_len_16_mul(v.b, w.b, mul) + x, - mul); -} - -uint64_t farmhash64_na_with_seeds(const char *s, size_t len, uint64_t seed0, uint64_t seed1) { - return farmhash_len_16(farmhash64_na(s, len) - seed0, seed1); -} - -uint64_t farmhash64_na_with_seed(const char *s, size_t len, uint64_t seed) { - return farmhash64_na_with_seeds(s, len, k2, seed); -} - -// farmhash uo - -static inline uint64_t farmhash_uo_h(uint64_t x, uint64_t y, uint64_t mul, int r) { - uint64_t a = (x ^ y) * mul; - a ^= (a >> 47); - uint64_t b = (y ^ a) * mul; - return ror64(b, r) * mul; -} - -uint64_t farmhash64_uo_with_seeds(const char *s, size_t len, - uint64_t seed0, uint64_t seed1) { - if (len <= 64) { - return farmhash64_na_with_seeds(s, len, seed0, seed1); - } - - // For strings over 64 bytes we loop. Internal state consists of - // 64 bytes: u, v, w, x, y, and z. - uint64_t x = seed0; - uint64_t y = seed1 * k2 + 113; - uint64_t z = smix(y * k2) * k2; - uint128_t v = make_uint128_t(seed0, seed1); - uint128_t w = make_uint128_t(0, 0); - uint64_t u = x - z; - x *= k2; - uint64_t mul = k2 + (u & 0x82); - - // Set end so that after the loop we have 1 to 64 bytes left to process. - const char* end = s + ((len - 1) / 64) * 64; - const char* last64 = end + ((len - 1) & 63) - 63; - assert(s + len - 64 == last64); - do { - uint64_t a0 = fetch64(s); - uint64_t a1 = fetch64(s + 8); - uint64_t a2 = fetch64(s + 16); - uint64_t a3 = fetch64(s + 24); - uint64_t a4 = fetch64(s + 32); - uint64_t a5 = fetch64(s + 40); - uint64_t a6 = fetch64(s + 48); - uint64_t a7 = fetch64(s + 56); - x += a0 + a1; - y += a2; - z += a3; - v.a += a4; - v.b += a5 + a1; - w.a += a6; - w.b += a7; - - x = ror64(x, 26); - x *= 9; - y = ror64(y, 29); - z *= mul; - v.a = ror64(v.a, 33); - v.b = ror64(v.b, 30); - w.a ^= x; - w.a *= 9; - z = ror64(z, 32); - z += w.b; - w.b += z; - z *= 9; - swap64(&u, &y); - - z += a0 + a6; - v.a += a2; - v.b += a3; - w.a += a4; - w.b += a5 + a6; - x += a1; - y += a7; - - y += v.a; - v.a += x - y; - v.b += w.a; - w.a += v.b; - w.b += x - y; - x += w.b; - w.b = ror64(w.b, 34); - swap64(&u, &z); - s += 64; - } while (s != end); - // Make s point to the last 64 bytes of input. - s = last64; - u *= 9; - v.b = ror64(v.b, 28); - v.a = ror64(v.a, 20); - w.a += ((len - 1) & 63); - u += y; - y += u; - x = ror64(y - x + v.a + fetch64(s + 8), 37) * mul; - y = ror64(y ^ v.b ^ fetch64(s + 48), 42) * mul; - x ^= w.b * 9; - y += v.a + fetch64(s + 40); - z = ror64(z + w.a, 33) * mul; - v = weak_farmhash_na_len_32_with_seeds(s, v.b * mul, x + w.a); - w = weak_farmhash_na_len_32_with_seeds(s + 32, z + w.b, y + fetch64(s + 16)); - return farmhash_uo_h(farmhash_len_16_mul(v.a + x, w.a ^ y, mul) + z - u, - farmhash_uo_h(v.b + y, w.b + z, k2, 30) ^ x, - k2, - 31); -} - -uint64_t farmhash64_uo_with_seed(const char *s, size_t len, uint64_t seed) { - return len <= 64 ? farmhash64_na_with_seed(s, len, seed) : - farmhash64_uo_with_seeds(s, len, 0, seed); -} - -uint64_t farmhash64_uo(const char *s, size_t len) { - return len <= 64 ? farmhash64_na(s, len) : - farmhash64_uo_with_seeds(s, len, 81, 0); -} - -// farmhash xo - -static inline uint64_t farmhash_xo_h32(const char *s, size_t len, uint64_t mul, - uint64_t seed0, uint64_t seed1) { - uint64_t a = fetch64(s) * k1; - uint64_t b = fetch64(s + 8); - uint64_t c = fetch64(s + len - 8) * mul; - uint64_t d = fetch64(s + len - 16) * k2; - uint64_t u = ror64(a + b, 43) + ror64(c, 30) + d + seed0; - uint64_t v = a + ror64(b + k2, 18) + c + seed1; - a = smix((u ^ v) * mul); - b = smix((v ^ a) * mul); - return b; -} - -// Return an 8-byte hash for 33 to 64 bytes. -static inline uint64_t farmhash_xo_len_33_to_64(const char *s, size_t len) { - uint64_t mul0 = k2 - 30; - uint64_t mul1 = k2 - 30 + 2 * len; - uint64_t h0 = farmhash_xo_h32(s, 32, mul0, 0, 0); - uint64_t h1 = farmhash_xo_h32(s + len - 32, 32, mul1, 0, 0); - return ((h1 * mul1) + h0) * mul1; -} - -// Return an 8-byte hash for 65 to 96 bytes. -static inline uint64_t farmhash_xo_len_65_to_96(const char *s, size_t len) { - uint64_t mul0 = k2 - 114; - uint64_t mul1 = k2 - 114 + 2 * len; - uint64_t h0 = farmhash_xo_h32(s, 32, mul0, 0, 0); - uint64_t h1 = farmhash_xo_h32(s + 32, 32, mul1, 0, 0); - uint64_t h2 = farmhash_xo_h32(s + len - 32, 32, mul1, h0, h1); - return (h2 * 9 + (h0 >> 17) + (h1 >> 21)) * mul1; -} - -uint64_t farmhash64_xo(const char *s, size_t len) { - if (len <= 32) { - if (len <= 16) { - return farmhash_na_len_0_to_16(s, len); - } else { - return farmhash_na_len_17_to_32(s, len); - } - } else if (len <= 64) { - return farmhash_xo_len_33_to_64(s, len); - } else if (len <= 96) { - return farmhash_xo_len_65_to_96(s, len); - } else if (len <= 256) { - return farmhash64_na(s, len); - } else { - return farmhash64_uo(s, len); - } -} - -uint64_t farmhash64_xo_with_seeds(const char *s, size_t len, uint64_t seed0, uint64_t seed1) { - return farmhash64_uo_with_seeds(s, len, seed0, seed1); -} - -uint64_t farmhash64_xo_with_seed(const char *s, size_t len, uint64_t seed) { - return farmhash64_uo_with_seed(s, len, seed); -} - -// farmhash te - -#if x86_64 && CAN_USE_SSSE3 && CAN_USE_SSE41 - -// Requires n >= 256. Requires SSE4.1. Should be slightly faster if the -// compiler uses AVX instructions (e.g., use the -mavx flag with GCC). -static inline uint64_t farmhash64_te_long(const char* s, size_t n, - uint64_t seed0, uint64_t seed1) { - const __m128i k_shuf = - _mm_set_epi8(4, 11, 10, 5, 8, 15, 6, 9, 12, 2, 14, 13, 0, 7, 3, 1); - const __m128i k_mult = - _mm_set_epi8(0xbd, 0xd6, 0x33, 0x39, 0x45, 0x54, 0xfa, 0x03, - 0x34, 0x3e, 0x33, 0xed, 0xcc, 0x9e, 0x2d, 0x51); - uint64_t seed2 = (seed0 + 113) * (seed1 + 9); - uint64_t seed3 = (ror64(seed0, 23) + 27) * (ror64(seed1, 30) + 111); - __m128i d0 = _mm_cvtsi64_si128(seed0); - __m128i d1 = _mm_cvtsi64_si128(seed1); - __m128i d2 = shuf8x16(k_shuf, d0); - __m128i d3 = shuf8x16(k_shuf, d1); - __m128i d4 = xor128(d0, d1); - __m128i d5 = xor128(d1, d2); - __m128i d6 = xor128(d2, d4); - __m128i d7 = _mm_set1_epi32(seed2 >> 32); - __m128i d8 = mul32x4(k_mult, d2); - __m128i d9 = _mm_set1_epi32(seed3 >> 32); - __m128i d10 = _mm_set1_epi32(seed3); - __m128i d11 = add64x2(d2, _mm_set1_epi32(seed2)); - const char* end = s + (n & ~((size_t) 255)); - do { - __m128i z; - z = fetch128(s); - d0 = add64x2(d0, z); - d1 = shuf8x16(k_shuf, d1); - d2 = xor128(d2, d0); - d4 = xor128(d4, z); - d4 = xor128(d4, d1); - swap128(&d0, &d6); - z = fetch128(s + 16); - d5 = add64x2(d5, z); - d6 = shuf8x16(k_shuf, d6); - d8 = shuf8x16(k_shuf, d8); - d7 = xor128(d7, d5); - d0 = xor128(d0, z); - d0 = xor128(d0, d6); - swap128(&d5, &d11); - z = fetch128(s + 32); - d1 = add64x2(d1, z); - d2 = shuf8x16(k_shuf, d2); - d4 = shuf8x16(k_shuf, d4); - d5 = xor128(d5, z); - d5 = xor128(d5, d2); - swap128(&d10, &d4); - z = fetch128(s + 48); - d6 = add64x2(d6, z); - d7 = shuf8x16(k_shuf, d7); - d0 = shuf8x16(k_shuf, d0); - d8 = xor128(d8, d6); - d1 = xor128(d1, z); - d1 = add64x2(d1, d7); - z = fetch128(s + 64); - d2 = add64x2(d2, z); - d5 = shuf8x16(k_shuf, d5); - d4 = add64x2(d4, d2); - d6 = xor128(d6, z); - d6 = xor128(d6, d11); - swap128(&d8, &d2); - z = fetch128(s + 80); - d7 = xor128(d7, z); - d8 = shuf8x16(k_shuf, d8); - d1 = shuf8x16(k_shuf, d1); - d0 = add64x2(d0, d7); - d2 = add64x2(d2, z); - d2 = add64x2(d2, d8); - swap128(&d1, &d7); - z = fetch128(s + 96); - d4 = shuf8x16(k_shuf, d4); - d6 = shuf8x16(k_shuf, d6); - d8 = mul32x4(k_mult, d8); - d5 = xor128(d5, d11); - d7 = xor128(d7, z); - d7 = add64x2(d7, d4); - swap128(&d6, &d0); - z = fetch128(s + 112); - d8 = add64x2(d8, z); - d0 = shuf8x16(k_shuf, d0); - d2 = shuf8x16(k_shuf, d2); - d1 = xor128(d1, d8); - d10 = xor128(d10, z); - d10 = xor128(d10, d0); - swap128(&d11, &d5); - z = fetch128(s + 128); - d4 = add64x2(d4, z); - d5 = shuf8x16(k_shuf, d5); - d7 = shuf8x16(k_shuf, d7); - d6 = add64x2(d6, d4); - d8 = xor128(d8, z); - d8 = xor128(d8, d5); - swap128(&d4, &d10); - z = fetch128(s + 144); - d0 = add64x2(d0, z); - d1 = shuf8x16(k_shuf, d1); - d2 = add64x2(d2, d0); - d4 = xor128(d4, z); - d4 = xor128(d4, d1); - z = fetch128(s + 160); - d5 = add64x2(d5, z); - d6 = shuf8x16(k_shuf, d6); - d8 = shuf8x16(k_shuf, d8); - d7 = xor128(d7, d5); - d0 = xor128(d0, z); - d0 = xor128(d0, d6); - swap128(&d2, &d8); - z = fetch128(s + 176); - d1 = add64x2(d1, z); - d2 = shuf8x16(k_shuf, d2); - d4 = shuf8x16(k_shuf, d4); - d5 = mul32x4(k_mult, d5); - d5 = xor128(d5, z); - d5 = xor128(d5, d2); - swap128(&d7, &d1); - z = fetch128(s + 192); - d6 = add64x2(d6, z); - d7 = shuf8x16(k_shuf, d7); - d0 = shuf8x16(k_shuf, d0); - d8 = add64x2(d8, d6); - d1 = xor128(d1, z); - d1 = xor128(d1, d7); - swap128(&d0, &d6); - z = fetch128(s + 208); - d2 = add64x2(d2, z); - d5 = shuf8x16(k_shuf, d5); - d4 = xor128(d4, d2); - d6 = xor128(d6, z); - d6 = xor128(d6, d9); - swap128(&d5, &d11); - z = fetch128(s + 224); - d7 = add64x2(d7, z); - d8 = shuf8x16(k_shuf, d8); - d1 = shuf8x16(k_shuf, d1); - d0 = xor128(d0, d7); - d2 = xor128(d2, z); - d2 = xor128(d2, d8); - swap128(&d10, &d4); - z = fetch128(s + 240); - d3 = add64x2(d3, z); - d4 = shuf8x16(k_shuf, d4); - d6 = shuf8x16(k_shuf, d6); - d7 = mul32x4(k_mult, d7); - d5 = add64x2(d5, d3); - d7 = xor128(d7, z); - d7 = xor128(d7, d4); - swap128(&d3, &d9); - s += 256; - } while (s != end); - d6 = add64x2(mul32x4(k_mult, d6), _mm_cvtsi64_si128(n)); - if (n % 256 != 0) { - d7 = add64x2(_mm_shuffle_epi32(d8, (0 << 6) + (3 << 4) + (2 << 2) + (1 << 0)), d7); - d8 = add64x2(mul32x4(k_mult, d8), _mm_cvtsi64_si128(farmhash64_xo(s, n % 256))); - } - __m128i t[8]; - d0 = mul32x4(k_mult, shuf8x16(k_shuf, mul32x4(k_mult, d0))); - d3 = mul32x4(k_mult, shuf8x16(k_shuf, mul32x4(k_mult, d3))); - d9 = mul32x4(k_mult, shuf8x16(k_shuf, mul32x4(k_mult, d9))); - d1 = mul32x4(k_mult, shuf8x16(k_shuf, mul32x4(k_mult, d1))); - d0 = add64x2(d11, d0); - d3 = xor128(d7, d3); - d9 = add64x2(d8, d9); - d1 = add64x2(d10, d1); - d4 = add64x2(d3, d4); - d5 = add64x2(d9, d5); - d6 = xor128(d1, d6); - d2 = add64x2(d0, d2); - t[0] = d0; - t[1] = d3; - t[2] = d9; - t[3] = d1; - t[4] = d4; - t[5] = d5; - t[6] = d6; - t[7] = d2; - return farmhash64_xo((const char*) t, sizeof(t)); -} - -uint64_t farmhash64_te(const char *s, size_t len) { - // Empirically, farmhash xo seems faster until length 512. - return len >= 512 ? farmhash64_te_long(s, len, k2, k1) : farmhash64_xo(s, len); -} - -uint64_t farmhash64_te_with_seed(const char *s, size_t len, uint64_t seed) { - return len >= 512 ? farmhash64_te_long(s, len, k1, seed) : - farmhash64_xo_with_seed(s, len, seed); -} - -uint64_t farmhash64_te_with_seeds(const char *s, size_t len, uint64_t seed0, uint64_t seed1) { - return len >= 512 ? farmhash64_te_long(s, len, seed0, seed1) : - farmhash64_xo_with_seeds(s, len, seed0, seed1); -} - -#endif - -// farmhash nt - -#if x86_64 && CAN_USE_SSE41 - -uint32_t farmhash32_nt(const char *s, size_t len) { - return (uint32_t) farmhash64_te(s, len); -} - -uint32_t farmhash32_nt_with_seed(const char *s, size_t len, uint32_t seed) { - return (uint32_t) farmhash64_te_with_seed(s, len, seed); -} - -#endif - -// farmhash mk - -static inline uint32_t farmhash32_mk_len_13_to_24(const char *s, size_t len, uint32_t seed) { - uint32_t a = fetch32(s - 4 + (len >> 1)); - uint32_t b = fetch32(s + 4); - uint32_t c = fetch32(s + len - 8); - uint32_t d = fetch32(s + (len >> 1)); - uint32_t e = fetch32(s); - uint32_t f = fetch32(s + len - 4); - uint32_t h = d * c1 + len + seed; - a = ror32(a, 12) + f; - h = mur(c, h) + a; - a = ror32(a, 3) + c; - h = mur(e, h) + a; - a = ror32(a + f, 12) + d; - h = mur(b ^ seed, h) + a; - return fmix(h); -} - -static inline uint32_t farmhash32_mk_len_0_to_4(const char *s, size_t len, uint32_t seed) { - uint32_t b = seed; - uint32_t c = 9; - for (size_t i = 0; i < len; i++) { - signed char v = s[i]; - b = b * c1 + v; - c ^= b; - } - return fmix(mur(b, mur(len, c))); -} - -static inline uint32_t farmhash32_mk_len_5_to_12(const char *s, size_t len, uint32_t seed) { - uint32_t a = len, b = len * 5, c = 9, d = b + seed; - a += fetch32(s); - b += fetch32(s + len - 4); - c += fetch32(s + ((len >> 1) & 4)); - return fmix(seed ^ mur(c, mur(b, mur(a, d)))); -} - -uint32_t farmhash32_mk(const char *s, size_t len) { - if (len <= 24) { - return len <= 12 ? - (len <= 4 ? farmhash32_mk_len_0_to_4(s, len, 0) : farmhash32_mk_len_5_to_12(s, len, 0)) : - farmhash32_mk_len_13_to_24(s, len, 0); - } - - // len > 24 - uint32_t h = len, g = c1 * len, f = g; - uint32_t a0 = ror32(fetch32(s + len - 4) * c1, 17) * c2; - uint32_t a1 = ror32(fetch32(s + len - 8) * c1, 17) * c2; - uint32_t a2 = ror32(fetch32(s + len - 16) * c1, 17) * c2; - uint32_t a3 = ror32(fetch32(s + len - 12) * c1, 17) * c2; - uint32_t a4 = ror32(fetch32(s + len - 20) * c1, 17) * c2; - h ^= a0; - h = ror32(h, 19); - h = h * 5 + 0xe6546b64; - h ^= a2; - h = ror32(h, 19); - h = h * 5 + 0xe6546b64; - g ^= a1; - g = ror32(g, 19); - g = g * 5 + 0xe6546b64; - g ^= a3; - g = ror32(g, 19); - g = g * 5 + 0xe6546b64; - f += a4; - f = ror32(f, 19) + 113; - size_t iters = (len - 1) / 20; - do { - uint32_t a = fetch32(s); - uint32_t b = fetch32(s + 4); - uint32_t c = fetch32(s + 8); - uint32_t d = fetch32(s + 12); - uint32_t e = fetch32(s + 16); - h += a; - g += b; - f += c; - h = mur(d, h) + e; - g = mur(c, g) + a; - f = mur(b + e * c1, f) + d; - f += g; - g += f; - s += 20; - } while (--iters != 0); - g = ror32(g, 11) * c1; - g = ror32(g, 17) * c1; - f = ror32(f, 11) * c1; - f = ror32(f, 17) * c1; - h = ror32(h + g, 19); - h = h * 5 + 0xe6546b64; - h = ror32(h, 17) * c1; - h = ror32(h + f, 19); - h = h * 5 + 0xe6546b64; - h = ror32(h, 17) * c1; - return h; -} - -uint32_t farmhash32_mk_with_seed(const char *s, size_t len, uint32_t seed) { - if (len <= 24) { - if (len >= 13) return farmhash32_mk_len_13_to_24(s, len, seed * c1); - else if (len >= 5) return farmhash32_mk_len_5_to_12(s, len, seed); - else return farmhash32_mk_len_0_to_4(s, len, seed); - } - uint32_t h = farmhash32_mk_len_13_to_24(s, 24, seed ^ len); - return mur(farmhash32_mk(s + 24, len - 24) + seed, h); -} - -// farmhash su - -#if CAN_USE_SSE41 && CAN_USE_SSE42 && CAN_USE_AESNI - -uint32_t farmhash32_su(const char *s, size_t len) { - const uint32_t seed = 81; - if (len <= 24) { - return len <= 12 ? - (len <= 4 ? - farmhash32_mk_len_0_to_4(s, len, 0) : - farmhash32_mk_len_5_to_12(s, len, 0)) : - farmhash32_mk_len_13_to_24(s, len, 0); - } - - if (len < 40) { - uint32_t a = len, b = seed * c2, c = a + b; - a += fetch32(s + len - 4); - b += fetch32(s + len - 20); - c += fetch32(s + len - 16); - uint32_t d = a; - a = ror32(a, 21); - a = mur(a, mur(b, _mm_crc32_u32(c, d))); - a += fetch32(s + len - 12); - b += fetch32(s + len - 8); - d += a; - a += d; - b = mur(b, d) * c2; - a = _mm_crc32_u32(a, b + c); - return farmhash32_mk_len_13_to_24(s, (len + 1) / 2, a) + b; - } - - const __m128i cc1 = _mm_set1_epi32(c1); - const __m128i cc2 = _mm_set1_epi32(c2); - __m128i h = _mm_set1_epi32(seed); - __m128i g = _mm_set1_epi32(c1 * seed); - __m128i f = g; - __m128i k = _mm_set1_epi32(0xe6546b64); - __m128i q; - if (len < 80) { - __m128i a = fetch128(s); - __m128i b = fetch128(s + 16); - __m128i c = fetch128(s + (len - 15) / 2); - __m128i d = fetch128(s + len - 32); - __m128i e = fetch128(s + len - 16); - h = add32x4(h, a); - g = add32x4(g, b); - q = g; - g = shuf32x4_0_3_2_1(g); - f = add32x4(f, c); - __m128i be = add32x4(b, mul32x4(e, cc1)); - h = add32x4(h, f); - f = add32x4(f, h); - h = add32x4(murk(d, h, cc1, cc2, k), e); - k = xor128(k, _mm_shuffle_epi8(g, f)); - g = add32x4(xor128(c, g), a); - f = add32x4(xor128(be, f), d); - k = add32x4(k, be); - k = add32x4(k, _mm_shuffle_epi8(f, h)); - f = add32x4(f, g); - g = add32x4(g, f); - g = add32x4(_mm_set1_epi32(len), mul32x4(g, cc1)); - } else { - // len >= 80 - // The following is loosely modelled after farmhash32_mk. - size_t iters = (len - 1) / 80; - len -= iters * 80; - -#define CHUNK_AES() do { \ - __m128i a = fetch128(s); \ - __m128i b = fetch128(s + 16); \ - __m128i c = fetch128(s + 32); \ - __m128i d = fetch128(s + 48); \ - __m128i e = fetch128(s + 64); \ - h = add32x4(h, a); \ - g = add32x4(g, b); \ - g = shuf32x4_0_3_2_1(g); \ - f = add32x4(f, c); \ - __m128i be = add32x4(b, mul32x4(e, cc1)); \ - h = add32x4(h, f); \ - f = add32x4(f, h); \ - h = add32x4(h, d); \ - q = add32x4(q, e); \ - h = rol32x4_17(h); \ - h = mul32x4(h, cc1); \ - k = xor128(k, _mm_shuffle_epi8(g, f)); \ - g = add32x4(xor128(c, g), a); \ - f = add32x4(xor128(be, f), d); \ - swap128(&f, &q); \ - q = _mm_aesimc_si128(q); \ - k = add32x4(k, be); \ - k = add32x4(k, _mm_shuffle_epi8(f, h)); \ - f = add32x4(f, g); \ - g = add32x4(g, f); \ - f = mul32x4(f, cc1); \ -} while (0) - - q = g; - while (iters-- != 0) { - CHUNK_AES(); - s += 80; - } - - if (len != 0) { - h = add32x4(h, _mm_set1_epi32(len)); - s = s + len - 80; - CHUNK_AES(); - } - } - - g = shuf32x4_0_3_2_1(g); - k = xor128(k, g); - k = xor128(k, q); - h = xor128(h, q); - f = mul32x4(f, cc1); - k = mul32x4(k, cc2); - g = mul32x4(g, cc1); - h = mul32x4(h, cc2); - k = add32x4(k, _mm_shuffle_epi8(g, f)); - h = add32x4(h, f); - f = add32x4(f, h); - g = add32x4(g, k); - k = add32x4(k, g); - k = xor128(k, _mm_shuffle_epi8(f, h)); - __m128i buf[4]; - buf[0] = f; - buf[1] = g; - buf[2] = k; - buf[3] = h; - s = (char*) buf; - uint32_t x = fetch32(s); - uint32_t y = fetch32(s+4); - uint32_t z = fetch32(s+8); - x = _mm_crc32_u32(x, fetch32(s+12)); - y = _mm_crc32_u32(y, fetch32(s+16)); - z = _mm_crc32_u32(z * c1, fetch32(s+20)); - x = _mm_crc32_u32(x, fetch32(s+24)); - y = _mm_crc32_u32(y * c1, fetch32(s+28)); - uint32_t o = y; - z = _mm_crc32_u32(z, fetch32(s+32)); - x = _mm_crc32_u32(x * c1, fetch32(s+36)); - y = _mm_crc32_u32(y, fetch32(s+40)); - z = _mm_crc32_u32(z * c1, fetch32(s+44)); - x = _mm_crc32_u32(x, fetch32(s+48)); - y = _mm_crc32_u32(y * c1, fetch32(s+52)); - z = _mm_crc32_u32(z, fetch32(s+56)); - x = _mm_crc32_u32(x, fetch32(s+60)); - return (o - x + y - z) * c1; -} - -uint32_t farmhash32_su_with_seed(const char *s, size_t len, uint32_t seed) { - if (len <= 24) { - if (len >= 13) return farmhash32_mk_len_13_to_24(s, len, seed * c1); - else if (len >= 5) return farmhash32_mk_len_5_to_12(s, len, seed); - else return farmhash32_mk_len_0_to_4(s, len, seed); - } - uint32_t h = farmhash32_mk_len_13_to_24(s, 24, seed ^ len); - return _mm_crc32_u32(farmhash32_su(s + 24, len - 24) + seed, h); -} - -#endif - -// farmhash sa - -#if CAN_USE_SSSE3 && CAN_USE_SSE41 && CAN_USE_SSE42 - -uint32_t farmhash32_sa(const char *s, size_t len) { - const uint32_t seed = 81; - if (len <= 24) { - return len <= 12 ? - (len <= 4 ? - farmhash32_mk_len_0_to_4(s, len, 0) : - farmhash32_mk_len_5_to_12(s, len, 0)) : - farmhash32_mk_len_13_to_24(s, len, 0); - } - - if (len < 40) { - uint32_t a = len, b = seed * c2, c = a + b; - a += fetch32(s + len - 4); - b += fetch32(s + len - 20); - c += fetch32(s + len - 16); - uint32_t d = a; - a = ror32(a, 21); - a = mur(a, mur(b, mur(c, d))); - a += fetch32(s + len - 12); - b += fetch32(s + len - 8); - d += a; - a += d; - b = mur(b, d) * c2; - a = _mm_crc32_u32(a, b + c); - return farmhash32_mk_len_13_to_24(s, (len + 1) / 2, a) + b; - } - - const __m128i cc1 = _mm_set1_epi32(c1); - const __m128i cc2 = _mm_set1_epi32(c2); - __m128i h = _mm_set1_epi32(seed); - __m128i g = _mm_set1_epi32(c1 * seed); - __m128i f = g; - __m128i k = _mm_set1_epi32(0xe6546b64); - if (len < 80) { - __m128i a = fetch128(s); - __m128i b = fetch128(s + 16); - __m128i c = fetch128(s + (len - 15) / 2); - __m128i d = fetch128(s + len - 32); - __m128i e = fetch128(s + len - 16); - h = add32x4(h, a); - g = add32x4(g, b); - g = shuf32x4_0_3_2_1(g); - f = add32x4(f, c); - __m128i be = add32x4(b, mul32x4(e, cc1)); - h = add32x4(h, f); - f = add32x4(f, h); - h = add32x4(murk(d, h, cc1, cc2, k), e); - k = xor128(k, _mm_shuffle_epi8(g, f)); - g = add32x4(xor128(c, g), a); - f = add32x4(xor128(be, f), d); - k = add32x4(k, be); - k = add32x4(k, _mm_shuffle_epi8(f, h)); - f = add32x4(f, g); - g = add32x4(g, f); - g = add32x4(_mm_set1_epi32(len), mul32x4(g, cc1)); - } else { - // len >= 80 - // The following is loosely modelled after farmhash32_mk. - size_t iters = (len - 1) / 80; - len -= iters * 80; - -#define CHUNK() do { \ - __m128i a = fetch128(s); \ - __m128i b = fetch128(s + 16); \ - __m128i c = fetch128(s + 32); \ - __m128i d = fetch128(s + 48); \ - __m128i e = fetch128(s + 64); \ - h = add32x4(h, a); \ - g = add32x4(g, b); \ - g = shuf32x4_0_3_2_1(g); \ - f = add32x4(f, c); \ - __m128i be = add32x4(b, mul32x4(e, cc1)); \ - h = add32x4(h, f); \ - f = add32x4(f, h); \ - h = add32x4(murk(d, h, cc1, cc2, k), e); \ - k = xor128(k, _mm_shuffle_epi8(g, f)); \ - g = add32x4(xor128(c, g), a); \ - f = add32x4(xor128(be, f), d); \ - k = add32x4(k, be); \ - k = add32x4(k, _mm_shuffle_epi8(f, h)); \ - f = add32x4(f, g); \ - g = add32x4(g, f); \ - f = mul32x4(f, cc1); \ -} while (0) - - while (iters-- != 0) { - CHUNK(); - s += 80; - } - - if (len != 0) { - h = add32x4(h, _mm_set1_epi32(len)); - s = s + len - 80; - CHUNK(); - } - } - - g = shuf32x4_0_3_2_1(g); - k = xor128(k, g); - f = mul32x4(f, cc1); - k = mul32x4(k, cc2); - g = mul32x4(g, cc1); - h = mul32x4(h, cc2); - k = add32x4(k, _mm_shuffle_epi8(g, f)); - h = add32x4(h, f); - f = add32x4(f, h); - g = add32x4(g, k); - k = add32x4(k, g); - k = xor128(k, _mm_shuffle_epi8(f, h)); - __m128i buf[4]; - buf[0] = f; - buf[1] = g; - buf[2] = k; - buf[3] = h; - s = (char*) buf; - uint32_t x = fetch32(s); - uint32_t y = fetch32(s+4); - uint32_t z = fetch32(s+8); - x = _mm_crc32_u32(x, fetch32(s+12)); - y = _mm_crc32_u32(y, fetch32(s+16)); - z = _mm_crc32_u32(z * c1, fetch32(s+20)); - x = _mm_crc32_u32(x, fetch32(s+24)); - y = _mm_crc32_u32(y * c1, fetch32(s+28)); - uint32_t o = y; - z = _mm_crc32_u32(z, fetch32(s+32)); - x = _mm_crc32_u32(x * c1, fetch32(s+36)); - y = _mm_crc32_u32(y, fetch32(s+40)); - z = _mm_crc32_u32(z * c1, fetch32(s+44)); - x = _mm_crc32_u32(x, fetch32(s+48)); - y = _mm_crc32_u32(y * c1, fetch32(s+52)); - z = _mm_crc32_u32(z, fetch32(s+56)); - x = _mm_crc32_u32(x, fetch32(s+60)); - return (o - x + y - z) * c1; -} - -uint32_t farmhash32_sa_with_seed(const char *s, size_t len, uint32_t seed) { - if (len <= 24) { - if (len >= 13) return farmhash32_mk_len_13_to_24(s, len, seed * c1); - else if (len >= 5) return farmhash32_mk_len_5_to_12(s, len, seed); - else return farmhash32_mk_len_0_to_4(s, len, seed); - } - uint32_t h = farmhash32_mk_len_13_to_24(s, 24, seed ^ len); - return _mm_crc32_u32(farmhash32_sa(s + 24, len - 24) + seed, h); -} - -#endif - -// farmhash cc - -// This file provides a 32-bit hash equivalent to cityhash32 (v1.1.1) -// and a 128-bit hash equivalent to cityhash128 (v1.1.1). It also provides -// a seeded 32-bit hash function similar to cityhash32. - -static inline uint32_t farmhash32_cc_len_13_to_24(const char *s, size_t len) { - uint32_t a = fetch32(s - 4 + (len >> 1)); - uint32_t b = fetch32(s + 4); - uint32_t c = fetch32(s + len - 8); - uint32_t d = fetch32(s + (len >> 1)); - uint32_t e = fetch32(s); - uint32_t f = fetch32(s + len - 4); - uint32_t h = len; - - return fmix(mur(f, mur(e, mur(d, mur(c, mur(b, mur(a, h))))))); -} - -static inline uint32_t farmhash32_cc_len_0_to_4(const char *s, size_t len) { - uint32_t b = 0; - uint32_t c = 9; - for (size_t i = 0; i < len; i++) { - signed char v = s[i]; - b = b * c1 + v; - c ^= b; - } - return fmix(mur(b, mur(len, c))); -} - -static inline uint32_t farmhash32_cc_len_5_to_12(const char *s, size_t len) { - uint32_t a = len, b = len * 5, c = 9, d = b; - a += fetch32(s); - b += fetch32(s + len - 4); - c += fetch32(s + ((len >> 1) & 4)); - return fmix(mur(c, mur(b, mur(a, d)))); -} - -uint32_t farmhash32_cc(const char *s, size_t len) { - if (len <= 24) { - return len <= 12 ? - (len <= 4 ? farmhash32_cc_len_0_to_4(s, len) : farmhash32_cc_len_5_to_12(s, len)) : - farmhash32_cc_len_13_to_24(s, len); - } - - // len > 24 - uint32_t h = len, g = c1 * len, f = g; - uint32_t a0 = ror32(fetch32(s + len - 4) * c1, 17) * c2; - uint32_t a1 = ror32(fetch32(s + len - 8) * c1, 17) * c2; - uint32_t a2 = ror32(fetch32(s + len - 16) * c1, 17) * c2; - uint32_t a3 = ror32(fetch32(s + len - 12) * c1, 17) * c2; - uint32_t a4 = ror32(fetch32(s + len - 20) * c1, 17) * c2; - h ^= a0; - h = ror32(h, 19); - h = h * 5 + 0xe6546b64; - h ^= a2; - h = ror32(h, 19); - h = h * 5 + 0xe6546b64; - g ^= a1; - g = ror32(g, 19); - g = g * 5 + 0xe6546b64; - g ^= a3; - g = ror32(g, 19); - g = g * 5 + 0xe6546b64; - f += a4; - f = ror32(f, 19); - f = f * 5 + 0xe6546b64; - size_t iters = (len - 1) / 20; - do { - uint32_t a0 = ror32(fetch32(s) * c1, 17) * c2; - uint32_t a1 = fetch32(s + 4); - uint32_t a2 = ror32(fetch32(s + 8) * c1, 17) * c2; - uint32_t a3 = ror32(fetch32(s + 12) * c1, 17) * c2; - uint32_t a4 = fetch32(s + 16); - h ^= a0; - h = ror32(h, 18); - h = h * 5 + 0xe6546b64; - f += a1; - f = ror32(f, 19); - f = f * c1; - g += a2; - g = ror32(g, 18); - g = g * 5 + 0xe6546b64; - h ^= a3 + a1; - h = ror32(h, 19); - h = h * 5 + 0xe6546b64; - g ^= a4; - g = bswap32(g) * 5; - h += a4 * 5; - h = bswap32(h); - f += a0; - PERMUTE3(&f, &h, &g); - s += 20; - } while (--iters != 0); - g = ror32(g, 11) * c1; - g = ror32(g, 17) * c1; - f = ror32(f, 11) * c1; - f = ror32(f, 17) * c1; - h = ror32(h + g, 19); - h = h * 5 + 0xe6546b64; - h = ror32(h, 17) * c1; - h = ror32(h + f, 19); - h = h * 5 + 0xe6546b64; - h = ror32(h, 17) * c1; - return h; -} - -uint32_t farmhash32_cc_with_seed(const char *s, size_t len, uint32_t seed) { - if (len <= 24) { - if (len >= 13) return farmhash32_mk_len_13_to_24(s, len, seed * c1); - else if (len >= 5) return farmhash32_mk_len_5_to_12(s, len, seed); - else return farmhash32_mk_len_0_to_4(s, len, seed); - } - uint32_t h = farmhash32_mk_len_13_to_24(s, 24, seed ^ len); - return mur(farmhash32_cc(s + 24, len - 24) + seed, h); -} - -static inline uint64_t farmhash_cc_len_0_to_16(const char *s, size_t len) { - if (len >= 8) { - uint64_t mul = k2 + len * 2; - uint64_t a = fetch64(s) + k2; - uint64_t b = fetch64(s + len - 8); - uint64_t c = ror64(b, 37) * mul + a; - uint64_t d = (ror64(a, 25) + b) * mul; - return farmhash_len_16_mul(c, d, mul); - } - if (len >= 4) { - uint64_t mul = k2 + len * 2; - uint64_t a = fetch32(s); - return farmhash_len_16_mul(len + (a << 3), fetch32(s + len - 4), mul); - } - if (len > 0) { - uint8_t a = s[0]; - uint8_t b = s[len >> 1]; - uint8_t c = s[len - 1]; - uint32_t y = ((uint32_t) a) + (((uint32_t) b) << 8); - uint32_t z = len + (((uint32_t) c) << 2); - return smix(y * k2 ^ z * k0) * k2; - } - return k2; -} - -// Return a 16-byte hash for 48 bytes. Quick and dirty. -// Callers do best to use "random-looking" values for a and b. -static inline uint128_t weak_farmhash_cc_len_32_with_seeds_vals( - uint64_t w, uint64_t x, uint64_t y, uint64_t z, uint64_t a, uint64_t b) { - a += w; - b = ror64(b + a + z, 21); - uint64_t c = a; - a += x; - a += y; - b += ror64(a, 44); - return make_uint128_t(a + z, b + c); -} - -// Return a 16-byte hash for s[0] ... s[31], a, and b. Quick and dirty. -static inline uint128_t weak_farmhash_cc_len_32_with_seeds( - const char* s, uint64_t a, uint64_t b) { - return weak_farmhash_cc_len_32_with_seeds_vals(fetch64(s), - fetch64(s + 8), - fetch64(s + 16), - fetch64(s + 24), - a, - b); -} - - - -// A subroutine for cityhash128(). Returns a decent 128-bit hash for strings -// of any length representable in signed long. Based on City and Murmur. -static inline uint128_t farmhash_cc_city_murmur(const char *s, size_t len, uint128_t seed) { - uint64_t a = uint128_t_low64(seed); - uint64_t b = uint128_t_high64(seed); - uint64_t c = 0; - uint64_t d = 0; - signed long l = len - 16; - if (l <= 0) { // len <= 16 - a = smix(a * k1) * k1; - c = b * k1 + farmhash_cc_len_0_to_16(s, len); - d = smix(a + (len >= 8 ? fetch64(s) : c)); - } else { // len > 16 - c = farmhash_len_16(fetch64(s + len - 8) + k1, a); - d = farmhash_len_16(b + len, c + fetch64(s + len - 16)); - a += d; - do { - a ^= smix(fetch64(s) * k1) * k1; - a *= k1; - b ^= a; - c ^= smix(fetch64(s + 8) * k1) * k1; - c *= k1; - d ^= c; - s += 16; - l -= 16; - } while (l > 0); - } - a = farmhash_len_16(a, c); - b = farmhash_len_16(d, b); - return make_uint128_t(a ^ b, farmhash_len_16(b, a)); -} - -uint128_t farmhash128_cc_city_with_seed(const char *s, size_t len, uint128_t seed) { - if (len < 128) { - return farmhash_cc_city_murmur(s, len, seed); - } - - // We expect len >= 128 to be the common case. Keep 56 bytes of state: - // v, w, x, y, and z. - uint128_t v, w; - uint64_t x = uint128_t_low64(seed); - uint64_t y = uint128_t_high64(seed); - uint64_t z = len * k1; - v.a = ror64(y ^ k1, 49) * k1 + fetch64(s); - v.b = ror64(v.a, 42) * k1 + fetch64(s + 8); - w.a = ror64(y + z, 35) * k1 + x; - w.b = ror64(x + fetch64(s + 88), 53) * k1; - - // This is the same inner loop as cityhash64(), manually unrolled. - do { - x = ror64(x + y + v.a + fetch64(s + 8), 37) * k1; - y = ror64(y + v.b + fetch64(s + 48), 42) * k1; - x ^= w.b; - y += v.a + fetch64(s + 40); - z = ror64(z + w.a, 33) * k1; - v = weak_farmhash_cc_len_32_with_seeds(s, v.b * k1, x + w.a); - w = weak_farmhash_cc_len_32_with_seeds(s + 32, z + w.b, y + fetch64(s + 16)); - swap64(&z, &x); - s += 64; - x = ror64(x + y + v.a + fetch64(s + 8), 37) * k1; - y = ror64(y + v.b + fetch64(s + 48), 42) * k1; - x ^= w.b; - y += v.a + fetch64(s + 40); - z = ror64(z + w.a, 33) * k1; - v = weak_farmhash_cc_len_32_with_seeds(s, v.b * k1, x + w.a); - w = weak_farmhash_cc_len_32_with_seeds(s + 32, z + w.b, y + fetch64(s + 16)); - swap64(&z, &x); - s += 64; - len -= 128; - } while (likely(len >= 128)); - x += ror64(v.a + z, 49) * k0; - y = y * k0 + ror64(w.b, 37); - z = z * k0 + ror64(w.a, 27); - w.a *= 9; - v.a *= k0; - // If 0 < len < 128, hash up to 4 chunks of 32 bytes each from the end of s. - for (size_t tail_done = 0; tail_done < len; ) { - tail_done += 32; - y = ror64(x + y, 42) * k0 + v.b; - w.a += fetch64(s + len - tail_done + 16); - x = x * k0 + w.a; - z += w.b + fetch64(s + len - tail_done); - w.b += v.a; - v = weak_farmhash_cc_len_32_with_seeds(s + len - tail_done, v.a + z, v.b); - v.a *= k0; - } - // At this point our 56 bytes of state should contain more than - // enough information for a strong 128-bit hash. We use two - // different 56-byte-to-8-byte hashes to get a 16-byte final result. - x = farmhash_len_16(x, v.a); - y = farmhash_len_16(y + z, w.a); - return make_uint128_t(farmhash_len_16(x + v.b, w.b) + y, - farmhash_len_16(x + w.b, y + v.b)); -} - -static inline uint128_t farmhash128_cc_city(const char *s, size_t len) { - return len >= 16 ? - farmhash128_cc_city_with_seed(s + 16, len - 16, - make_uint128_t(fetch64(s), fetch64(s + 8) + k0)) : - farmhash128_cc_city_with_seed(s, len, make_uint128_t(k0, k1)); -} - -uint128_t farmhash_cc_fingerprint128(const char* s, size_t len) { - return farmhash128_cc_city(s, len); -} - -// BASIC STRING HASHING - -// farmhash function for a byte array. See also Hash(), below. -// May change from time to time, may differ on different platforms, may differ -// depending on NDEBUG. -uint32_t farmhash32(const char* s, size_t len) { - return debug_tweak32( - -#if x86_64 && CAN_USE_SSE41 - farmhash32_nt(s, len) -#elif CAN_USE_SSE41 && CAN_USE_SSE42 && CAN_USE_AESNI - farmhash32_su(s, len) -#elif CAN_USE_SSSE3 && CAN_USE_SSE41 && CAN_USE_SSE42 - farmhash32_sa(s, len) -#else - farmhash32_mk(s, len) -#endif - - ); -} - -// Hash function for a byte array. For convenience, a 32-bit seed is also -// hashed into the result. -// May change from time to time, may differ on different platforms, may differ -// depending on NDEBUG. -uint32_t farmhash32_with_seed(const char* s, size_t len, uint32_t seed) { - return debug_tweak32( - -#if x86_64 && CAN_USE_SSE41 - farmhash32_nt_with_seed(s, len, seed) -#elif CAN_USE_SSE41 && CAN_USE_SSE42 && CAN_USE_AESNI - farmhash32_su_with_seed(s, len, seed) -#elif CAN_USE_SSSE3 && CAN_USE_SSE41 && CAN_USE_SSE42 - farmhash32_sa_with_seed(s, len, seed) -#else - farmhash32_mk_with_seed(s, len, seed) -#endif - - ); -} - -// Hash function for a byte array. For convenience, a 64-bit seed is also -// hashed into the result. See also farmhash(), below. -// May change from time to time, may differ on different platforms, may differ -// depending on NDEBUG. -uint64_t farmhash64(const char* s, size_t len) { - return debug_tweak64( -#if x86_64 && CAN_USE_SSSE3 && CAN_USE_SSE41 - farmhash64_te(s, len) -#else - farmhash64_xo(s, len) -#endif - ); -} - -// Hash function for a byte array. -// May change from time to time, may differ on different platforms, may differ -// depending on NDEBUG. -size_t farmhash(const char* s, size_t len) { - return sizeof(size_t) == 8 ? farmhash64(s, len) : farmhash32(s, len); -} - -// Hash function for a byte array. For convenience, a 64-bit seed is also -// hashed into the result. -// May change from time to time, may differ on different platforms, may differ -// depending on NDEBUG. -uint64_t farmhash64_with_seed(const char* s, size_t len, uint64_t seed) { - return debug_tweak64(farmhash64_na_with_seed(s, len, seed)); -} - -// Hash function for a byte array. For convenience, two seeds are also -// hashed into the result. -// May change from time to time, may differ on different platforms, may differ -// depending on NDEBUG. -uint64_t farmhash64_with_seeds(const char* s, size_t len, uint64_t seed0, uint64_t seed1) { - return debug_tweak64(farmhash64_na_with_seeds(s, len, seed0, seed1)); -} - -// Hash function for a byte array. -// May change from time to time, may differ on different platforms, may differ -// depending on NDEBUG. -uint128_t farmhash128(const char* s, size_t len) { - return debug_tweak128(farmhash_cc_fingerprint128(s, len)); -} - -// Hash function for a byte array. For convenience, a 128-bit seed is also -// hashed into the result. -// May change from time to time, may differ on different platforms, may differ -// depending on NDEBUG. -uint128_t farmhash128_with_seed(const char* s, size_t len, uint128_t seed) { - return debug_tweak128(farmhash128_cc_city_with_seed(s, len, seed)); -} - -// BASIC NON-STRING HASHING - -// FINGERPRINTING (i.e., good, portable, forever-fixed hash functions) - -// Fingerprint function for a byte array. Most useful in 32-bit binaries. -uint32_t farmhash_fingerprint32(const char* s, size_t len) { - return farmhash32_mk(s, len); -} - -// Fingerprint function for a byte array. -uint64_t farmhash_fingerprint64(const char* s, size_t len) { - return farmhash64_na(s, len); -} - -// Fingerprint function for a byte array. -uint128_t farmhash_fingerprint128(const char* s, size_t len) { - return farmhash_cc_fingerprint128(s, len); -} diff --git a/lib/checksums/farmhash.h b/lib/checksums/farmhash.h deleted file mode 100644 index 8a2d840a..00000000 --- a/lib/checksums/farmhash.h +++ /dev/null @@ -1,166 +0,0 @@ -// Copyright (c) 2014 Google, Inc. -// -// Permission is hereby granted, free of charge, to any person obtaining a copy -// of this software and associated documentation files (the "Software"), to deal -// in the Software without restriction, including without limitation the rights -// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -// copies of the Software, and to permit persons to whom the Software is -// furnished to do so, subject to the following conditions: -// -// The above copyright notice and this permission notice shall be included in -// all copies or substantial portions of the Software. -// -// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN -// THE SOFTWARE. -// -// FarmHash, by Geoff Pike - -// -// http://code.google.com/p/farmhash/ -// -// This file provides a few functions for hashing strings and other -// data. All of them are high-quality functions in the sense that -// they do well on standard tests such as Austin Appleby's SMHasher. -// They're also fast. FarmHash is the successor to CityHash. -// -// Functions in the FarmHash family are not suitable for cryptography. -// -// WARNING: This code has been only lightly tested on big-endian platforms! -// It is known to work well on little-endian platforms that have a small penalty -// for unaligned reads, such as current Intel and AMD moderate-to-high-end CPUs. -// It should work on all 32-bit and 64-bit platforms that allow unaligned reads; -// bug reports are welcome. -// -// By the way, for some hash functions, given strings a and b, the hash -// of a+b is easily derived from the hashes of a and b. This property -// doesn't hold for any hash functions in this file. - -// this c port from https://github.com/uxcn/farmhash-c - -#ifndef FARMHASH_H -#define FARMHASH_H - -#include -#include - -struct uint128_t { - uint64_t a; - uint64_t b; -}; - -typedef struct uint128_t uint128_t; - - -static inline uint64_t uint128_t_low64(const uint128_t x) { return x.a; } -static inline uint64_t uint128_t_high64(const uint128_t x) { return x.b; } - -static inline uint128_t make_uint128_t(uint64_t lo, uint64_t hi) { uint128_t x = {lo, hi}; return x; } - -// BASIC STRING HASHING - -// Hash function for a byte array. -// May change from time to time, may differ on different platforms, may differ -// depending on NDEBUG. -size_t farmhash(const char* s, size_t len); - -// Hash function for a byte array. Most useful in 32-bit binaries. -// May change from time to time, may differ on different platforms, may differ -// depending on NDEBUG. -uint32_t farmhash32(const char* s, size_t len); - -// Hash function for a byte array. For convenience, a 32-bit seed is also -// hashed into the result. -// May change from time to time, may differ on different platforms, may differ -// depending on NDEBUG. -uint32_t farmhash32_with_seed(const char* s, size_t len, uint32_t seed); - -// Hash 128 input bits down to 64 bits of output. -// Hash function for a byte array. -// May change from time to time, may differ on different platforms, may differ -// depending on NDEBUG. -uint64_t farmhash64(const char* s, size_t len); - -// Hash function for a byte array. For convenience, a 64-bit seed is also -// hashed into the result. -// May change from time to time, may differ on different platforms, may differ -// depending on NDEBUG. -uint64_t farmhash64_with_seed(const char* s, size_t len, uint64_t seed); - -// Hash function for a byte array. For convenience, two seeds are also -// hashed into the result. -// May change from time to time, may differ on different platforms, may differ -// depending on NDEBUG. -uint64_t farmhash64_with_seeds(const char* s, size_t len, - uint64_t seed0, uint64_t seed1); - -// Hash function for a byte array. -// May change from time to time, may differ on different platforms, may differ -// depending on NDEBUG. -uint128_t farmhash128(const char* s, size_t len); - -// Hash function for a byte array. For convenience, a 128-bit seed is also -// hashed into the result. -// May change from time to time, may differ on different platforms, may differ -// depending on NDEBUG. -uint128_t farmhash128_with_seed(const char* s, size_t len, uint128_t seed); - -// BASIC NON-STRING HASHING - -// This is intended to be a reasonably good hash function. -// May change from time to time, may differ on different platforms, may differ -// depending on NDEBUG. -static inline uint64_t farmhash128_to_64(uint128_t x) { - // Murmur-inspired hashing. - const uint64_t k_mul = 0x9ddfea08eb382d69ULL; - uint64_t a = (uint128_t_low64(x) ^ uint128_t_high64(x)) * k_mul; - a ^= (a >> 47); - uint64_t b = (uint128_t_high64(x) ^ a) * k_mul; - b ^= (b >> 47); - b *= k_mul; - return b; -} - -// FINGERPRINTING (i.e., good, portable, forever-fixed hash functions) - -// Fingerprint function for a byte array. Most useful in 32-bit binaries. -uint32_t farmhash_fingerprint32(const char* s, size_t len); - -// Fingerprint function for a byte array. -uint64_t farmhash_fingerprint64(const char* s, size_t len); - -// Fingerprint function for a byte array. -uint128_t farmhash_fingerprint128(const char* s, size_t len); - -// This is intended to be a good fingerprinting primitive. -// See below for more overloads. -static inline uint64_t farmhash_fingerprint_uint128_t(uint128_t x) { - // Murmur-inspired hashing. - const uint64_t k_mul = 0x9ddfea08eb382d69ULL; - uint64_t a = (uint128_t_low64(x) ^ uint128_t_high64(x)) * k_mul; - a ^= (a >> 47); - uint64_t b = (uint128_t_high64(x) ^ a) * k_mul; - b ^= (b >> 44); - b *= k_mul; - b ^= (b >> 41); - b *= k_mul; - return b; -} - -// This is intended to be a good fingerprinting primitive. -static inline uint64_t farmhash_fingerprint_uint64_t(uint64_t x) { - // Murmur-inspired hashing. - const uint64_t k_mul = 0x9ddfea08eb382d69ULL; - uint64_t b = x * k_mul; - b ^= (b >> 44); - b *= k_mul; - b ^= (b >> 41); - b *= k_mul; - return b; -} - -#endif // FARMHASH_H From cdc68416024e52779c2b00d43ad0c632bd9a3ad6 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 13 Nov 2017 12:40:45 +1000 Subject: [PATCH 107/180] checksum: import metrohash from https://github.com/jedisct1/metrohash-c --- lib/checksums/metrohash.h | 72 +++++++++++++ lib/checksums/metrohash128.c | 177 +++++++++++++++++++++++++++++++ lib/checksums/metrohash128crc.c | 179 ++++++++++++++++++++++++++++++++ 3 files changed, 428 insertions(+) create mode 100644 lib/checksums/metrohash.h create mode 100644 lib/checksums/metrohash128.c create mode 100644 lib/checksums/metrohash128crc.c diff --git a/lib/checksums/metrohash.h b/lib/checksums/metrohash.h new file mode 100644 index 00000000..0f988c6f --- /dev/null +++ b/lib/checksums/metrohash.h @@ -0,0 +1,72 @@ +// metrohash.h +// +// The MIT License (MIT) +// +// Copyright (c) 2015 J. Andrew Rogers +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in all +// copies or substantial portions of the Software. +// +// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +// SOFTWARE. +// + +#ifndef METROHASH_METROHASH_H +#define METROHASH_METROHASH_H + +#include +#include + +// MetroHash 64-bit hash functions +void metrohash64_1(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out); +void metrohash64_2(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out); + +// MetroHash 128-bit hash functions +void metrohash128_1(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out); +void metrohash128_2(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out); + +// MetroHash 128-bit hash functions using CRC instruction +void metrohash128crc_1(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out); +void metrohash128crc_2(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out); + +/* rotate right idiom recognized by compiler*/ +inline static uint64_t rotate_right(uint64_t v, unsigned k) +{ + return (v >> k) | (v << (64 - k)); +} + +// unaligned reads, fast and safe on Nehalem and later microarchitectures +inline static uint64_t read_u64(const void * const ptr) +{ + return * (uint64_t *) ptr; +} + +inline static uint64_t read_u32(const void * const ptr) +{ + return * (uint32_t *) ptr; +} + +inline static uint64_t read_u16(const void * const ptr) +{ + return * (uint16_t *) ptr; +} + +inline static uint64_t read_u8 (const void * const ptr) +{ + return * (uint8_t *) ptr; +} + + +#endif // #ifndef METROHASH_METROHASH_H diff --git a/lib/checksums/metrohash128.c b/lib/checksums/metrohash128.c new file mode 100644 index 00000000..295c246d --- /dev/null +++ b/lib/checksums/metrohash128.c @@ -0,0 +1,177 @@ +// metrohash128.cpp +// +// The MIT License (MIT) +// +// Copyright (c) 2015 J. Andrew Rogers +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in all +// copies or substantial portions of the Software. +// +// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +// SOFTWARE. +// + +#include "metrohash.h" + +void metrohash128_1(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out) +{ + static const uint64_t k0 = 0xC83A91E1; + static const uint64_t k1 = 0x8648DBDB; + static const uint64_t k2 = 0x7BDEC03B; + static const uint64_t k3 = 0x2F5870A5; + + const uint8_t * ptr = key; + const uint8_t * const end = ptr + len; + + uint64_t v[4]; + + v[0] = ((((uint64_t) seed) - k0) * k3) + len; + v[1] = ((((uint64_t) seed) + k1) * k2) + len; + + if (len >= 32) + { + v[2] = ((((uint64_t) seed) + k0) * k2) + len; + v[3] = ((((uint64_t) seed) - k1) * k3) + len; + + do + { + v[0] += read_u64(ptr) * k0; ptr += 8; v[0] = rotate_right(v[0],29) + v[2]; + v[1] += read_u64(ptr) * k1; ptr += 8; v[1] = rotate_right(v[1],29) + v[3]; + v[2] += read_u64(ptr) * k2; ptr += 8; v[2] = rotate_right(v[2],29) + v[0]; + v[3] += read_u64(ptr) * k3; ptr += 8; v[3] = rotate_right(v[3],29) + v[1]; + } + while (ptr <= (end - 32)); + + v[2] ^= rotate_right(((v[0] + v[3]) * k0) + v[1], 26) * k1; + v[3] ^= rotate_right(((v[1] + v[2]) * k1) + v[0], 26) * k0; + v[0] ^= rotate_right(((v[0] + v[2]) * k0) + v[3], 26) * k1; + v[1] ^= rotate_right(((v[1] + v[3]) * k1) + v[2], 30) * k0; + } + + if ((end - ptr) >= 16) + { + v[0] += read_u64(ptr) * k2; ptr += 8; v[0] = rotate_right(v[0],33) * k3; + v[1] += read_u64(ptr) * k2; ptr += 8; v[1] = rotate_right(v[1],33) * k3; + v[0] ^= rotate_right((v[0] * k2) + v[1], 17) * k1; + v[1] ^= rotate_right((v[1] * k3) + v[0], 17) * k0; + } + + if ((end - ptr) >= 8) + { + v[0] += read_u64(ptr) * k2; ptr += 8; v[0] = rotate_right(v[0],33) * k3; + v[0] ^= rotate_right((v[0] * k2) + v[1], 20) * k1; + } + + if ((end - ptr) >= 4) + { + v[1] += read_u32(ptr) * k2; ptr += 4; v[1] = rotate_right(v[1],33) * k3; + v[1] ^= rotate_right((v[1] * k3) + v[0], 18) * k0; + } + + if ((end - ptr) >= 2) + { + v[0] += read_u16(ptr) * k2; ptr += 2; v[0] = rotate_right(v[0],33) * k3; + v[0] ^= rotate_right((v[0] * k2) + v[1], 24) * k1; + } + + if ((end - ptr) >= 1) + { + v[1] += read_u8 (ptr) * k2; v[1] = rotate_right(v[1],33) * k3; + v[1] ^= rotate_right((v[1] * k3) + v[0], 24) * k0; + } + + v[0] += rotate_right((v[0] * k0) + v[1], 13); + v[1] += rotate_right((v[1] * k1) + v[0], 37); + v[0] += rotate_right((v[0] * k2) + v[1], 13); + v[1] += rotate_right((v[1] * k3) + v[0], 37); + + memcpy(out, v, 16); +} + + +void metrohash128_2(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out) +{ + static const uint64_t k0 = 0xD6D018F5; + static const uint64_t k1 = 0xA2AA033B; + static const uint64_t k2 = 0x62992FC1; + static const uint64_t k3 = 0x30BC5B29; + + const uint8_t * ptr = key; + const uint8_t * const end = ptr + len; + + uint64_t v[4]; + + v[0] = ((((uint64_t) seed) - k0) * k3) + len; + v[1] = ((((uint64_t) seed) + k1) * k2) + len; + + if (len >= 32) + { + v[2] = ((((uint64_t) seed) + k0) * k2) + len; + v[3] = ((((uint64_t) seed) - k1) * k3) + len; + + do + { + v[0] += read_u64(ptr) * k0; ptr += 8; v[0] = rotate_right(v[0],29) + v[2]; + v[1] += read_u64(ptr) * k1; ptr += 8; v[1] = rotate_right(v[1],29) + v[3]; + v[2] += read_u64(ptr) * k2; ptr += 8; v[2] = rotate_right(v[2],29) + v[0]; + v[3] += read_u64(ptr) * k3; ptr += 8; v[3] = rotate_right(v[3],29) + v[1]; + } + while (ptr <= (end - 32)); + + v[2] ^= rotate_right(((v[0] + v[3]) * k0) + v[1], 33) * k1; + v[3] ^= rotate_right(((v[1] + v[2]) * k1) + v[0], 33) * k0; + v[0] ^= rotate_right(((v[0] + v[2]) * k0) + v[3], 33) * k1; + v[1] ^= rotate_right(((v[1] + v[3]) * k1) + v[2], 33) * k0; + } + + if ((end - ptr) >= 16) + { + v[0] += read_u64(ptr) * k2; ptr += 8; v[0] = rotate_right(v[0],29) * k3; + v[1] += read_u64(ptr) * k2; ptr += 8; v[1] = rotate_right(v[1],29) * k3; + v[0] ^= rotate_right((v[0] * k2) + v[1], 29) * k1; + v[1] ^= rotate_right((v[1] * k3) + v[0], 29) * k0; + } + + if ((end - ptr) >= 8) + { + v[0] += read_u64(ptr) * k2; ptr += 8; v[0] = rotate_right(v[0],29) * k3; + v[0] ^= rotate_right((v[0] * k2) + v[1], 29) * k1; + } + + if ((end - ptr) >= 4) + { + v[1] += read_u32(ptr) * k2; ptr += 4; v[1] = rotate_right(v[1],29) * k3; + v[1] ^= rotate_right((v[1] * k3) + v[0], 25) * k0; + } + + if ((end - ptr) >= 2) + { + v[0] += read_u16(ptr) * k2; ptr += 2; v[0] = rotate_right(v[0],29) * k3; + v[0] ^= rotate_right((v[0] * k2) + v[1], 30) * k1; + } + + if ((end - ptr) >= 1) + { + v[1] += read_u8 (ptr) * k2; v[1] = rotate_right(v[1],29) * k3; + v[1] ^= rotate_right((v[1] * k3) + v[0], 18) * k0; + } + + v[0] += rotate_right((v[0] * k0) + v[1], 33); + v[1] += rotate_right((v[1] * k1) + v[0], 33); + v[0] += rotate_right((v[0] * k2) + v[1], 33); + v[1] += rotate_right((v[1] * k3) + v[0], 33); + + memcpy(out, v, 16); +} diff --git a/lib/checksums/metrohash128crc.c b/lib/checksums/metrohash128crc.c new file mode 100644 index 00000000..6ce78265 --- /dev/null +++ b/lib/checksums/metrohash128crc.c @@ -0,0 +1,179 @@ +// metrohash128crc.cpp +// +// The MIT License (MIT) +// +// Copyright (c) 2015 J. Andrew Rogers +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in all +// copies or substantial portions of the Software. +// +// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +// SOFTWARE. +// + + +#include "metrohash.h" +#include + +void metrohash128crc_1(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out) +{ + static const uint64_t k0 = 0xC83A91E1; + static const uint64_t k1 = 0x8648DBDB; + static const uint64_t k2 = 0x7BDEC03B; + static const uint64_t k3 = 0x2F5870A5; + + const uint8_t * ptr = key; + const uint8_t * const end = ptr + len; + + uint64_t v[4]; + + v[0] = ((((uint64_t) seed) - k0) * k3) + len; + v[1] = ((((uint64_t) seed) + k1) * k2) + len; + + if (len >= 32) + { + v[2] = ((((uint64_t) seed) + k0) * k2) + len; + v[3] = ((((uint64_t) seed) - k1) * k3) + len; + + do + { + v[0] ^= _mm_crc32_u64(v[0], read_u64(ptr)); ptr += 8; + v[1] ^= _mm_crc32_u64(v[1], read_u64(ptr)); ptr += 8; + v[2] ^= _mm_crc32_u64(v[2], read_u64(ptr)); ptr += 8; + v[3] ^= _mm_crc32_u64(v[3], read_u64(ptr)); ptr += 8; + } + while (ptr <= (end - 32)); + + v[2] ^= rotate_right(((v[0] + v[3]) * k0) + v[1], 34) * k1; + v[3] ^= rotate_right(((v[1] + v[2]) * k1) + v[0], 37) * k0; + v[0] ^= rotate_right(((v[0] + v[2]) * k0) + v[3], 34) * k1; + v[1] ^= rotate_right(((v[1] + v[3]) * k1) + v[2], 37) * k0; + } + + if ((end - ptr) >= 16) + { + v[0] += read_u64(ptr) * k2; ptr += 8; v[0] = rotate_right(v[0],34) * k3; + v[1] += read_u64(ptr) * k2; ptr += 8; v[1] = rotate_right(v[1],34) * k3; + v[0] ^= rotate_right((v[0] * k2) + v[1], 30) * k1; + v[1] ^= rotate_right((v[1] * k3) + v[0], 30) * k0; + } + + if ((end - ptr) >= 8) + { + v[0] += read_u64(ptr) * k2; ptr += 8; v[0] = rotate_right(v[0],36) * k3; + v[0] ^= rotate_right((v[0] * k2) + v[1], 23) * k1; + } + + if ((end - ptr) >= 4) + { + v[1] ^= _mm_crc32_u64(v[0], read_u32(ptr)); ptr += 4; + v[1] ^= rotate_right((v[1] * k3) + v[0], 19) * k0; + } + + if ((end - ptr) >= 2) + { + v[0] ^= _mm_crc32_u64(v[1], read_u16(ptr)); ptr += 2; + v[0] ^= rotate_right((v[0] * k2) + v[1], 13) * k1; + } + + if ((end - ptr) >= 1) + { + v[1] ^= _mm_crc32_u64(v[0], read_u8 (ptr)); + v[1] ^= rotate_right((v[1] * k3) + v[0], 17) * k0; + } + + v[0] += rotate_right((v[0] * k0) + v[1], 11); + v[1] += rotate_right((v[1] * k1) + v[0], 26); + v[0] += rotate_right((v[0] * k0) + v[1], 11); + v[1] += rotate_right((v[1] * k1) + v[0], 26); + + memcpy(out, v, 16); +} + + +void metrohash128crc_2(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out) +{ + static const uint64_t k0 = 0xEE783E2F; + static const uint64_t k1 = 0xAD07C493; + static const uint64_t k2 = 0x797A90BB; + static const uint64_t k3 = 0x2E4B2E1B; + + const uint8_t * ptr = key; + const uint8_t * const end = ptr + len; + + uint64_t v[4]; + + v[0] = ((((uint64_t) seed) - k0) * k3) + len; + v[1] = ((((uint64_t) seed) + k1) * k2) + len; + + if (len >= 32) + { + v[2] = ((((uint64_t) seed) + k0) * k2) + len; + v[3] = ((((uint64_t) seed) - k1) * k3) + len; + + do + { + v[0] ^= _mm_crc32_u64(v[0], read_u64(ptr)); ptr += 8; + v[1] ^= _mm_crc32_u64(v[1], read_u64(ptr)); ptr += 8; + v[2] ^= _mm_crc32_u64(v[2], read_u64(ptr)); ptr += 8; + v[3] ^= _mm_crc32_u64(v[3], read_u64(ptr)); ptr += 8; + } + while (ptr <= (end - 32)); + + v[2] ^= rotate_right(((v[0] + v[3]) * k0) + v[1], 12) * k1; + v[3] ^= rotate_right(((v[1] + v[2]) * k1) + v[0], 19) * k0; + v[0] ^= rotate_right(((v[0] + v[2]) * k0) + v[3], 12) * k1; + v[1] ^= rotate_right(((v[1] + v[3]) * k1) + v[2], 19) * k0; + } + + if ((end - ptr) >= 16) + { + v[0] += read_u64(ptr) * k2; ptr += 8; v[0] = rotate_right(v[0],41) * k3; + v[1] += read_u64(ptr) * k2; ptr += 8; v[1] = rotate_right(v[1],41) * k3; + v[0] ^= rotate_right((v[0] * k2) + v[1], 10) * k1; + v[1] ^= rotate_right((v[1] * k3) + v[0], 10) * k0; + } + + if ((end - ptr) >= 8) + { + v[0] += read_u64(ptr) * k2; ptr += 8; v[0] = rotate_right(v[0],34) * k3; + v[0] ^= rotate_right((v[0] * k2) + v[1], 22) * k1; + } + + if ((end - ptr) >= 4) + { + v[1] ^= _mm_crc32_u64(v[0], read_u32(ptr)); ptr += 4; + v[1] ^= rotate_right((v[1] * k3) + v[0], 14) * k0; + } + + if ((end - ptr) >= 2) + { + v[0] ^= _mm_crc32_u64(v[1], read_u16(ptr)); ptr += 2; + v[0] ^= rotate_right((v[0] * k2) + v[1], 15) * k1; + } + + if ((end - ptr) >= 1) + { + v[1] ^= _mm_crc32_u64(v[0], read_u8 (ptr)); + v[1] ^= rotate_right((v[1] * k3) + v[0], 18) * k0; + } + + v[0] += rotate_right((v[0] * k0) + v[1], 15); + v[1] += rotate_right((v[1] * k1) + v[0], 27); + v[0] += rotate_right((v[0] * k0) + v[1], 15); + v[1] += rotate_right((v[1] * k1) + v[0], 27); + + memcpy(out, v, 16); +} From 675149e1d44eaab9a5f9c5e2d828bfe62caf3f05 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 13 Nov 2017 12:47:37 +1000 Subject: [PATCH 108/180] scons: add SSE4 check --- SConstruct | 23 ++++++++++++++++++++++- lib/SConscript | 1 + lib/checksums/metrohash.h | 2 ++ lib/checksums/metrohash128crc.c | 3 +++ 4 files changed, 28 insertions(+), 1 deletion(-) diff --git a/SConstruct b/SConstruct index 2f6fa89c..b4caa04c 100755 --- a/SConstruct +++ b/SConstruct @@ -361,6 +361,21 @@ def check_cygwin(context): context.Result(rc) return rc +def check_sse4(context): + rc = 0 + + context.Message('Checking for sse4 support...') + try: + if 'sse4' in open('/proc/cpuinfo').read(): + rc = 1 + except subprocess.CalledProcessError: + # Oops. + context.Message("read cpuinfo failed") + + conf.env['HAVE_SSE4'] = rc + context.Result(rc) + return rc + def create_uninstall_target(env, path): env.Command("uninstall-" + path, path, [ @@ -536,6 +551,7 @@ conf = Configure(env, custom_tests={ 'check_linux_fs_h': check_linux_fs_h, 'check_uname': check_uname, 'check_cygwin': check_cygwin, + 'check_sse4': check_sse4, 'check_sysmacro_h': check_sysmacro_h }) @@ -609,6 +625,11 @@ if conf.env['IS_CYGWIN']: else: conf.env.Append(CCFLAGS=['-fPIC']) +# check SSE4 support: +conf.check_sse4() +if conf.env['HAVE_SSE4']: + conf.env.Append(CCFLAGS=['-msse4']) + if ARGUMENTS.get('DEBUG') == "1": conf.env.Append(CCFLAGS=['-ggdb3']) @@ -629,7 +650,7 @@ conf.env.Append(CFLAGS=[ '-Wmissing-include-dirs', '-Wuninitialized', '-Wstrict-prototypes', - '-Wno-implicit-fallthrough' + '-Wno-implicit-fallthrough', ]) env.ParseConfig(pkg_config + ' --cflags --libs ' + ' '.join(packages)) diff --git a/lib/SConscript b/lib/SConscript index 9b920628..5791052e 100644 --- a/lib/SConscript +++ b/lib/SConscript @@ -34,6 +34,7 @@ def build_config_template(target, source, env): HAVE_LINUX_LIMITS=env['HAVE_LINUX_LIMITS'], HAVE_LINUX_FS_H=env['HAVE_LINUX_FS_H'], HAVE_BTRFS_H=env['HAVE_BTRFS_H'], + HAVE_SSE4=env['HAVE_SSE4'], HAVE_FACCESSAT=env['HAVE_FACCESSAT'], HAVE_UNAME=env['HAVE_UNAME'], HAVE_SYSMACROS_H=env['HAVE_SYSMACROS_H'], diff --git a/lib/checksums/metrohash.h b/lib/checksums/metrohash.h index 0f988c6f..5f313d62 100644 --- a/lib/checksums/metrohash.h +++ b/lib/checksums/metrohash.h @@ -37,9 +37,11 @@ void metrohash64_2(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * o void metrohash128_1(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out); void metrohash128_2(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out); +#if HAVE_SSE4 // MetroHash 128-bit hash functions using CRC instruction void metrohash128crc_1(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out); void metrohash128crc_2(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out); +#endif /* rotate right idiom recognized by compiler*/ inline static uint64_t rotate_right(uint64_t v, unsigned k) diff --git a/lib/checksums/metrohash128crc.c b/lib/checksums/metrohash128crc.c index 6ce78265..0c256e21 100644 --- a/lib/checksums/metrohash128crc.c +++ b/lib/checksums/metrohash128crc.c @@ -23,6 +23,7 @@ // SOFTWARE. // +#if HAVE_SSE4 #include "metrohash.h" #include @@ -177,3 +178,5 @@ void metrohash128crc_2(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t memcpy(out, v, 16); } + +#endif From 1ad662bd7b900dafd975fac3d9054b94dd7d0c02 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 13 Nov 2017 14:10:26 +1000 Subject: [PATCH 109/180] checksum: convert part of metrohash to streaming format (for speed tests) --- lib/checksum.c | 28 +++++++ lib/checksum.h | 1 + lib/checksums/metrohash.h | 11 +++ lib/checksums/metrohash128crc.c | 129 +++++++++++++++++++++++++------- lib/config.h.in | 3 +- 5 files changed, 142 insertions(+), 30 deletions(-) diff --git a/lib/checksum.c b/lib/checksum.c index 1155ad46..af0d8386 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -42,6 +42,7 @@ #include "checksums/blake2/blake2.h" #include "checksums/murmur3.h" +#include "checksums/metrohash.h" #include "checksums/sha3/sha3.h" #include "checksums/xxhash/xxhash.h" @@ -272,6 +273,32 @@ static const RmDigestSpec murmur_spec = { "murmur", 128, MURMUR_FUNCS(x64_128)}; #endif +/////////////////////////// +// metro // +/////////////////////////// + +static void rm_digest_metro_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { + digest->state = metrohash128crc_1_new(seed1 ^ seed2); +} + +static void rm_digest_metro_free(RmDigest *digest) { + metrohash128crc_1_free(digest->state); +} + +static void rm_digest_metro_update(RmDigest *digest, const unsigned char *data, RmOff size) { + metrohash128crc_1_update(digest->state, data, size); +} + +static void rm_digest_metro_copy(RmDigest *digest, RmDigest *copy) { + copy->state = metrohash128crc_1_copy(digest->state); +} + +static void rm_digest_metro_steal(RmDigest *digest, guint8 *result) { + metrohash128crc_1_steal(digest->state, result); +} + +static const RmDigestSpec metro_spec = {"metro", 128, rm_digest_metro_init, rm_digest_metro_free, rm_digest_metro_update, rm_digest_metro_copy, rm_digest_metro_steal }; + /////////////////////////// // cumulative // /////////////////////////// @@ -586,6 +613,7 @@ static const RmDigestSpec *rm_digest_spec(RmDigestType type) { static const RmDigestSpec *digest_specs[] = { [RM_DIGEST_UNKNOWN] = NULL, [RM_DIGEST_MURMUR] = &murmur_spec, + [RM_DIGEST_METRO] = &metro_spec, [RM_DIGEST_MD5] = &md5_spec, [RM_DIGEST_SHA1] = &sha1_spec, [RM_DIGEST_SHA256] = &sha256_spec, diff --git a/lib/checksum.h b/lib/checksum.h index e50f2bb5..9c612cd5 100644 --- a/lib/checksum.h +++ b/lib/checksum.h @@ -37,6 +37,7 @@ typedef enum RmDigestType { RM_DIGEST_UNKNOWN = 0, RM_DIGEST_MURMUR, + RM_DIGEST_METRO, RM_DIGEST_MD5, RM_DIGEST_SHA1, RM_DIGEST_SHA256, diff --git a/lib/checksums/metrohash.h b/lib/checksums/metrohash.h index 5f313d62..99dacabe 100644 --- a/lib/checksums/metrohash.h +++ b/lib/checksums/metrohash.h @@ -28,6 +28,10 @@ #include #include +#include "../config.h" + +typedef struct _Metro64_state Metro64State; +typedef struct _Metro128_state Metro128State; // MetroHash 64-bit hash functions void metrohash64_1(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out); @@ -41,6 +45,13 @@ void metrohash128_2(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * // MetroHash 128-bit hash functions using CRC instruction void metrohash128crc_1(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out); void metrohash128crc_2(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out); + +Metro128State *metrohash128crc_1_new(uint32_t seed); +Metro128State *metrohash128crc_1_copy(Metro128State *state); +void metrohash128crc_1_free(Metro128State *state); +void metrohash128crc_1_update(Metro128State *state, const uint8_t * key, uint64_t len); +void metrohash128crc_1_steal(Metro128State *state, uint8_t * out); + #endif /* rotate right idiom recognized by compiler*/ diff --git a/lib/checksums/metrohash128crc.c b/lib/checksums/metrohash128crc.c index 0c256e21..1f78475a 100644 --- a/lib/checksums/metrohash128crc.c +++ b/lib/checksums/metrohash128crc.c @@ -23,45 +23,109 @@ // SOFTWARE. // -#if HAVE_SSE4 - #include "metrohash.h" #include +#include -void metrohash128crc_1(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out) -{ - static const uint64_t k0 = 0xC83A91E1; - static const uint64_t k1 = 0x8648DBDB; - static const uint64_t k2 = 0x7BDEC03B; - static const uint64_t k3 = 0x2F5870A5; +#if HAVE_SSE4 - const uint8_t * ptr = key; - const uint8_t * const end = ptr + len; - +struct _Metro128_state { uint64_t v[4]; + uint8_t xs[32]; /* unhashed data from last increment */ + uint8_t xs_len; +}; + +static const uint64_t k0 = 0xC83A91E1; +static const uint64_t k1 = 0x8648DBDB; +static const uint64_t k2 = 0x7BDEC03B; +static const uint64_t k3 = 0x2F5870A5; + +Metro128State *metrohash128crc_1_new(uint32_t seed) { + Metro128State *state = g_slice_new0(Metro128State); + state->v[0] = ((((uint64_t) seed) - k0) * k3); + state->v[1] = ((((uint64_t) seed) + k1) * k2); + state->v[2] = ((((uint64_t) seed) + k0) * k2); + state->v[3] = ((((uint64_t) seed) - k1) * k3); + return state; +} + +void metrohash128crc_1_free(Metro128State *state) { + g_slice_free(Metro128State, state); +} + +Metro128State *metrohash128crc_1_copy(Metro128State *state) { + return g_slice_copy(sizeof(Metro128State), state); +} + +#define METRO_FILL_XS(xs, xs_len, xs_cap, data, data_len) \ + const int bytes = (data_len + xs_len > xs_cap) ? \ + (int)xs_cap - (int)xs_len : \ + (int)data_len; \ + memcpy(xs + xs_len, data, bytes); \ + xs_len += bytes; \ + data += bytes; + +void metrohash128crc_1_update(Metro128State *state, const uint8_t * key, uint64_t len) +{ + + uint8_t *data = (uint8_t *)key; + const uint8_t *stop = data + len; + + METRO_FILL_XS(state->xs, state->xs_len, 32, data, len); + + /* process blocks of 16 bytes */ + while(state->xs_len == 32 || data + 32 <= stop) { + + uint64_t d1; + uint64_t d2; + uint64_t d3; + uint64_t d4; + + if(state->xs_len == 32) { + /* process remnant data from previous update */ + d1 = read_u64(&state->xs[0]); + d2 = read_u64(&state->xs[8]); + d3 = read_u64(&state->xs[16]); + d4 = read_u64(&state->xs[24]); + state->xs_len = 0; + } else { + /* process new data */ + d1 = read_u64(data); + d2 = read_u64(data + 8); + d3 = read_u64(data + 16); + d4 = read_u64(data + 24); + data += 32; + } - v[0] = ((((uint64_t) seed) - k0) * k3) + len; - v[1] = ((((uint64_t) seed) + k1) * k2) + len; + state->v[0] ^= _mm_crc32_u64(state->v[0], d1); + state->v[1] ^= _mm_crc32_u64(state->v[1], d2); + state->v[2] ^= _mm_crc32_u64(state->v[2], d3); + state->v[3] ^= _mm_crc32_u64(state->v[3], d4); + + } - if (len >= 32) - { - v[2] = ((((uint64_t) seed) + k0) * k2) + len; - v[3] = ((((uint64_t) seed) - k1) * k3) + len; + if (state->xs_len == 0 && stop > data) { + // store excess data in state + state->xs_len = stop - data; + memcpy(state->xs, data, state->xs_len); + } - do - { - v[0] ^= _mm_crc32_u64(v[0], read_u64(ptr)); ptr += 8; - v[1] ^= _mm_crc32_u64(v[1], read_u64(ptr)); ptr += 8; - v[2] ^= _mm_crc32_u64(v[2], read_u64(ptr)); ptr += 8; - v[3] ^= _mm_crc32_u64(v[3], read_u64(ptr)); ptr += 8; - } - while (ptr <= (end - 32)); +} + +void metrohash128crc_1_steal(Metro128State *state, uint8_t * out) { - v[2] ^= rotate_right(((v[0] + v[3]) * k0) + v[1], 34) * k1; - v[3] ^= rotate_right(((v[1] + v[2]) * k1) + v[0], 37) * k0; - v[0] ^= rotate_right(((v[0] + v[2]) * k0) + v[3], 34) * k1; - v[1] ^= rotate_right(((v[1] + v[3]) * k1) + v[2], 37) * k0; + uint64_t v[4]; + for(int i=0; i<4; i++) { + v[i] = state->v[i]; } + + v[2] ^= rotate_right(((v[0] + v[3]) * k0) + v[1], 34) * k1; + v[3] ^= rotate_right(((v[1] + v[2]) * k1) + v[0], 37) * k0; + v[0] ^= rotate_right(((v[0] + v[2]) * k0) + v[3], 34) * k1; + v[1] ^= rotate_right(((v[1] + v[3]) * k1) + v[2], 37) * k0; + + uint8_t *ptr = state->xs; + uint8_t *end = ptr + state->xs_len; if ((end - ptr) >= 16) { @@ -103,6 +167,13 @@ void metrohash128crc_1(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t memcpy(out, v, 16); } +void metrohash128crc_1(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out) { + Metro128State *state = metrohash128crc_1_new(seed); + metrohash128crc_1_update(state, key, len); + metrohash128crc_1_steal(state, out); + metrohash128crc_1_free(state); +} + void metrohash128crc_2(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out) { diff --git a/lib/config.h.in b/lib/config.h.in index c01712af..fa03f65c 100644 --- a/lib/config.h.in +++ b/lib/config.h.in @@ -21,7 +21,8 @@ #define HAVE_LINUX_FS_H ({HAVE_LINUX_FS_H}) #define HAVE_FACCESSAT ({HAVE_FACCESSAT}) #define HAVE_UNAME ({HAVE_UNAME}) -#define HAVE_SYSMACROS_H ({HAVE_SYSMACROS_H}) +#define HAVE_SYSMACROS_H ({HAVE_SYSMACROS_H}) +#define HAVE_SSE4 ({HAVE_SSE4}) #define RM_DEFAULT_DIGEST RM_DIGEST_HIGHWAY256 #define RM_VERSION "{VERSION_MAJOR}.{VERSION_MINOR}.{VERSION_PATCH}" From a066f7b41c47c85d897bacc2748b3f559c58b538 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 13 Nov 2017 15:04:50 +1000 Subject: [PATCH 110/180] checksum: finish conversion of metro 128 bit hashes to streaming --- lib/checksum.c | 26 +- lib/checksum.h | 5 + lib/checksums/metrohash.h | 20 +- lib/checksums/metrohash128.c | 475 ++++++++++++++++++++++++++------ lib/checksums/metrohash128crc.c | 253 ----------------- 5 files changed, 434 insertions(+), 345 deletions(-) delete mode 100644 lib/checksums/metrohash128crc.c diff --git a/lib/checksum.c b/lib/checksum.c index af0d8386..732a7d93 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -278,27 +278,42 @@ static const RmDigestSpec murmur_spec = { "murmur", 128, MURMUR_FUNCS(x64_128)}; /////////////////////////// static void rm_digest_metro_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { - digest->state = metrohash128crc_1_new(seed1 ^ seed2); + digest->state = metrohash128_1_new(seed1 ^ seed2); } static void rm_digest_metro_free(RmDigest *digest) { - metrohash128crc_1_free(digest->state); + metrohash128_free(digest->state); } static void rm_digest_metro_update(RmDigest *digest, const unsigned char *data, RmOff size) { - metrohash128crc_1_update(digest->state, data, size); + metrohash128_1_update(digest->state, data, size); } static void rm_digest_metro_copy(RmDigest *digest, RmDigest *copy) { - copy->state = metrohash128crc_1_copy(digest->state); + copy->state = metrohash128_copy(digest->state); } static void rm_digest_metro_steal(RmDigest *digest, guint8 *result) { - metrohash128crc_1_steal(digest->state, result); + metrohash128_1_steal(digest->state, result); } static const RmDigestSpec metro_spec = {"metro", 128, rm_digest_metro_init, rm_digest_metro_free, rm_digest_metro_update, rm_digest_metro_copy, rm_digest_metro_steal }; +#if HAVE_SSE4 + +static void rm_digest_metro_crc_update(RmDigest *digest, const unsigned char *data, RmOff size) { + metrohash128crc_update(digest->state, data, size); +} + +static void rm_digest_metro_crc_steal(RmDigest *digest, guint8 *result) { + metrohash128crc_1_steal(digest->state, result); +} + +static const RmDigestSpec metro_crc_spec = {"metrocrc", 128, rm_digest_metro_init, rm_digest_metro_free, rm_digest_metro_crc_update, rm_digest_metro_copy, rm_digest_metro_crc_steal }; + +#endif + + /////////////////////////// // cumulative // /////////////////////////// @@ -614,6 +629,7 @@ static const RmDigestSpec *rm_digest_spec(RmDigestType type) { [RM_DIGEST_UNKNOWN] = NULL, [RM_DIGEST_MURMUR] = &murmur_spec, [RM_DIGEST_METRO] = &metro_spec, + [RM_DIGEST_METROCRC] = &metro_crc_spec, [RM_DIGEST_MD5] = &md5_spec, [RM_DIGEST_SHA1] = &sha1_spec, [RM_DIGEST_SHA256] = &sha256_spec, diff --git a/lib/checksum.h b/lib/checksum.h index 9c612cd5..6d87790e 100644 --- a/lib/checksum.h +++ b/lib/checksum.h @@ -38,10 +38,15 @@ typedef enum RmDigestType { RM_DIGEST_UNKNOWN = 0, RM_DIGEST_MURMUR, RM_DIGEST_METRO, +#if HAVE_SSE4 + RM_DIGEST_METROCRC, +#endif RM_DIGEST_MD5, RM_DIGEST_SHA1, RM_DIGEST_SHA256, +#if HAVE_SHA512 RM_DIGEST_SHA512, +#endif RM_DIGEST_SHA3_256, RM_DIGEST_SHA3_384, RM_DIGEST_SHA3_512, diff --git a/lib/checksums/metrohash.h b/lib/checksums/metrohash.h index 99dacabe..94b7e6d8 100644 --- a/lib/checksums/metrohash.h +++ b/lib/checksums/metrohash.h @@ -38,19 +38,31 @@ void metrohash64_1(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * o void metrohash64_2(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out); // MetroHash 128-bit hash functions +Metro128State *metrohash128_1_new(uint32_t seed); +Metro128State *metrohash128_2_new(uint32_t seed); + +Metro128State *metrohash128_copy(Metro128State *state); +void metrohash128_free(Metro128State *state); + void metrohash128_1(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out); void metrohash128_2(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out); +void metrohash128_1_update(Metro128State *state, const uint8_t * key, uint64_t len); +void metrohash128_1_steal(Metro128State *state, uint8_t * out); + +void metrohash128_2_update(Metro128State *state, const uint8_t * key, uint64_t len); +void metrohash128_2_steal(Metro128State *state, uint8_t * out); + + #if HAVE_SSE4 // MetroHash 128-bit hash functions using CRC instruction void metrohash128crc_1(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out); void metrohash128crc_2(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out); -Metro128State *metrohash128crc_1_new(uint32_t seed); -Metro128State *metrohash128crc_1_copy(Metro128State *state); -void metrohash128crc_1_free(Metro128State *state); -void metrohash128crc_1_update(Metro128State *state, const uint8_t * key, uint64_t len); +void metrohash128crc_update(Metro128State *state, const uint8_t * key, uint64_t len); + void metrohash128crc_1_steal(Metro128State *state, uint8_t * out); +void metrohash128crc_2_steal(Metro128State *state, uint8_t * out); #endif diff --git a/lib/checksums/metrohash128.c b/lib/checksums/metrohash128.c index 295c246d..4f4f23d0 100644 --- a/lib/checksums/metrohash128.c +++ b/lib/checksums/metrohash128.c @@ -1,4 +1,4 @@ -// metrohash128.cpp +// metrohash128crc.cpp // // The MIT License (MIT) // @@ -24,154 +24,463 @@ // #include "metrohash.h" +#include +#include -void metrohash128_1(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out) +#if HAVE_SSE4 + +struct _Metro128_state { + uint64_t v[4]; + uint8_t xs[32]; /* unhashed data from last increment */ + uint8_t xs_len; +}; + +static const uint64_t k0_1 = 0xC83A91E1; +static const uint64_t k1_1 = 0x8648DBDB; +static const uint64_t k2_1 = 0x7BDEC03B; +static const uint64_t k3_1 = 0x2F5870A5; + +Metro128State *metrohash128_1_new(uint32_t seed) { + Metro128State *state = g_slice_new0(Metro128State); + state->v[0] = ((((uint64_t) seed) - k0_1) * k3_1); + state->v[1] = ((((uint64_t) seed) + k1_1) * k2_1); + state->v[2] = ((((uint64_t) seed) + k0_1) * k2_1); + state->v[3] = ((((uint64_t) seed) - k1_1) * k3_1); + return state; +} + +static const uint64_t k0_2 = 0xEE783E2F; +static const uint64_t k1_2 = 0xAD07C493; +static const uint64_t k2_2 = 0x797A90BB; +static const uint64_t k3_2 = 0x2E4B2E1B; + +Metro128State *metrohash128_2_new(uint32_t seed) { + Metro128State *state = g_slice_new0(Metro128State); + state->v[0] = ((((uint64_t) seed) - k0_2) * k3_2); + state->v[1] = ((((uint64_t) seed) + k1_2) * k2_2); + state->v[2] = ((((uint64_t) seed) + k0_2) * k2_2); + state->v[3] = ((((uint64_t) seed) - k1_2) * k3_2); + return state; +} + + +void metrohash128_free(Metro128State *state) { + g_slice_free(Metro128State, state); +} + +Metro128State *metrohash128_copy(Metro128State *state) { + return g_slice_copy(sizeof(Metro128State), state); +} + +#define METRO_FILL_XS(xs, xs_len, xs_cap, data, data_len) \ + const int bytes = (data_len + xs_len > xs_cap) ? \ + (int)xs_cap - (int)xs_len : \ + (int)data_len; \ + memcpy(xs + xs_len, data, bytes); \ + xs_len += bytes; \ + data += bytes; + +void metrohash128crc_update(Metro128State *state, const uint8_t * key, uint64_t len) { - static const uint64_t k0 = 0xC83A91E1; - static const uint64_t k1 = 0x8648DBDB; - static const uint64_t k2 = 0x7BDEC03B; - static const uint64_t k3 = 0x2F5870A5; - const uint8_t * ptr = key; - const uint8_t * const end = ptr + len; + uint8_t *data = (uint8_t *)key; + const uint8_t *stop = data + len; + + METRO_FILL_XS(state->xs, state->xs_len, 32, data, len); + + /* process blocks of 32 bytes */ + while(state->xs_len == 32 || data + 32 <= stop) { + + uint64_t d1; + uint64_t d2; + uint64_t d3; + uint64_t d4; + + if(state->xs_len == 32) { + /* process remnant data from previous update */ + d1 = read_u64(&state->xs[0]); + d2 = read_u64(&state->xs[8]); + d3 = read_u64(&state->xs[16]); + d4 = read_u64(&state->xs[24]); + state->xs_len = 0; + } else { + /* process new data */ + d1 = read_u64(data); + d2 = read_u64(data + 8); + d3 = read_u64(data + 16); + d4 = read_u64(data + 24); + data += 32; + } + state->v[0] ^= _mm_crc32_u64(state->v[0], d1); + state->v[1] ^= _mm_crc32_u64(state->v[1], d2); + state->v[2] ^= _mm_crc32_u64(state->v[2], d3); + state->v[3] ^= _mm_crc32_u64(state->v[3], d4); + + } + + if (state->xs_len == 0 && stop > data) { + // store excess data in state + state->xs_len = stop - data; + memcpy(state->xs, data, state->xs_len); + } +} + +void metrohash128crc_1_steal(Metro128State *state, uint8_t * out) { + uint64_t v[4]; + for(int i=0; i<4; i++) { + v[i] = state->v[i]; + } + + v[2] ^= rotate_right(((v[0] + v[3]) * k0_1) + v[1], 34) * k1_1; + v[3] ^= rotate_right(((v[1] + v[2]) * k1_1) + v[0], 37) * k0_1; + v[0] ^= rotate_right(((v[0] + v[2]) * k0_1) + v[3], 34) * k1_1; + v[1] ^= rotate_right(((v[1] + v[3]) * k1_1) + v[2], 37) * k0_1; + + uint8_t *ptr = state->xs; + uint8_t *end = ptr + state->xs_len; + + if ((end - ptr) >= 16) + { + v[0] += read_u64(ptr) * k2_1; ptr += 8; v[0] = rotate_right(v[0],34) * k3_1; + v[1] += read_u64(ptr) * k2_1; ptr += 8; v[1] = rotate_right(v[1],34) * k3_1; + v[0] ^= rotate_right((v[0] * k2_1) + v[1], 30) * k1_1; + v[1] ^= rotate_right((v[1] * k3_1) + v[0], 30) * k0_1; + } + + if ((end - ptr) >= 8) + { + v[0] += read_u64(ptr) * k2_1; ptr += 8; v[0] = rotate_right(v[0],36) * k3_1; + v[0] ^= rotate_right((v[0] * k2_1) + v[1], 23) * k1_1; + } + + if ((end - ptr) >= 4) + { + v[1] ^= _mm_crc32_u64(v[0], read_u32(ptr)); ptr += 4; + v[1] ^= rotate_right((v[1] * k3_1) + v[0], 19) * k0_1; + } + + if ((end - ptr) >= 2) + { + v[0] ^= _mm_crc32_u64(v[1], read_u16(ptr)); ptr += 2; + v[0] ^= rotate_right((v[0] * k2_1) + v[1], 13) * k1_1; + } - v[0] = ((((uint64_t) seed) - k0) * k3) + len; - v[1] = ((((uint64_t) seed) + k1) * k2) + len; + if ((end - ptr) >= 1) + { + v[1] ^= _mm_crc32_u64(v[0], read_u8 (ptr)); + v[1] ^= rotate_right((v[1] * k3_1) + v[0], 17) * k0_1; + } - if (len >= 32) - { - v[2] = ((((uint64_t) seed) + k0) * k2) + len; - v[3] = ((((uint64_t) seed) - k1) * k3) + len; + v[0] += rotate_right((v[0] * k0_1) + v[1], 11); + v[1] += rotate_right((v[1] * k1_1) + v[0], 26); + v[0] += rotate_right((v[0] * k0_1) + v[1], 11); + v[1] += rotate_right((v[1] * k1_1) + v[0], 26); - do - { - v[0] += read_u64(ptr) * k0; ptr += 8; v[0] = rotate_right(v[0],29) + v[2]; - v[1] += read_u64(ptr) * k1; ptr += 8; v[1] = rotate_right(v[1],29) + v[3]; - v[2] += read_u64(ptr) * k2; ptr += 8; v[2] = rotate_right(v[2],29) + v[0]; - v[3] += read_u64(ptr) * k3; ptr += 8; v[3] = rotate_right(v[3],29) + v[1]; - } - while (ptr <= (end - 32)); + memcpy(out, v, 16); +} - v[2] ^= rotate_right(((v[0] + v[3]) * k0) + v[1], 26) * k1; - v[3] ^= rotate_right(((v[1] + v[2]) * k1) + v[0], 26) * k0; - v[0] ^= rotate_right(((v[0] + v[2]) * k0) + v[3], 26) * k1; - v[1] ^= rotate_right(((v[1] + v[3]) * k1) + v[2], 30) * k0; +void metrohash128crc_2_steal(Metro128State *state, uint8_t * out) { + + uint64_t v[4]; + for(int i=0; i<4; i++) { + v[i] = state->v[i]; } + + v[2] ^= rotate_right(((v[0] + v[3]) * k0_2) + v[1], 12) * k1_2; + v[3] ^= rotate_right(((v[1] + v[2]) * k1_2) + v[0], 19) * k0_2; + v[0] ^= rotate_right(((v[0] + v[2]) * k0_2) + v[3], 12) * k1_2; + v[1] ^= rotate_right(((v[1] + v[3]) * k1_2) + v[2], 19) * k0_2; + + uint8_t *ptr = state->xs; + uint8_t *end = ptr + state->xs_len; if ((end - ptr) >= 16) { - v[0] += read_u64(ptr) * k2; ptr += 8; v[0] = rotate_right(v[0],33) * k3; - v[1] += read_u64(ptr) * k2; ptr += 8; v[1] = rotate_right(v[1],33) * k3; - v[0] ^= rotate_right((v[0] * k2) + v[1], 17) * k1; - v[1] ^= rotate_right((v[1] * k3) + v[0], 17) * k0; + v[0] += read_u64(ptr) * k2_2; ptr += 8; v[0] = rotate_right(v[0],41) * k3_2; + v[1] += read_u64(ptr) * k2_2; ptr += 8; v[1] = rotate_right(v[1],41) * k3_2; + v[0] ^= rotate_right((v[0] * k2_2) + v[1], 10) * k1_2; + v[1] ^= rotate_right((v[1] * k3_2) + v[0], 10) * k0_2; } if ((end - ptr) >= 8) { - v[0] += read_u64(ptr) * k2; ptr += 8; v[0] = rotate_right(v[0],33) * k3; - v[0] ^= rotate_right((v[0] * k2) + v[1], 20) * k1; + v[0] += read_u64(ptr) * k2_2; ptr += 8; v[0] = rotate_right(v[0],34) * k3_2; + v[0] ^= rotate_right((v[0] * k2_2) + v[1], 22) * k1_2; } if ((end - ptr) >= 4) { - v[1] += read_u32(ptr) * k2; ptr += 4; v[1] = rotate_right(v[1],33) * k3; - v[1] ^= rotate_right((v[1] * k3) + v[0], 18) * k0; + v[1] ^= _mm_crc32_u64(v[0], read_u32(ptr)); ptr += 4; + v[1] ^= rotate_right((v[1] * k3_2) + v[0], 14) * k0_2; } if ((end - ptr) >= 2) { - v[0] += read_u16(ptr) * k2; ptr += 2; v[0] = rotate_right(v[0],33) * k3; - v[0] ^= rotate_right((v[0] * k2) + v[1], 24) * k1; + v[0] ^= _mm_crc32_u64(v[1], read_u16(ptr)); ptr += 2; + v[0] ^= rotate_right((v[0] * k2_2) + v[1], 15) * k1_2; } if ((end - ptr) >= 1) { - v[1] += read_u8 (ptr) * k2; v[1] = rotate_right(v[1],33) * k3; - v[1] ^= rotate_right((v[1] * k3) + v[0], 24) * k0; + v[1] ^= _mm_crc32_u64(v[0], read_u8 (ptr)); + v[1] ^= rotate_right((v[1] * k3_2) + v[0], 18) * k0_2; } - v[0] += rotate_right((v[0] * k0) + v[1], 13); - v[1] += rotate_right((v[1] * k1) + v[0], 37); - v[0] += rotate_right((v[0] * k2) + v[1], 13); - v[1] += rotate_right((v[1] * k3) + v[0], 37); + v[0] += rotate_right((v[0] * k0_2) + v[1], 15); + v[1] += rotate_right((v[1] * k1_2) + v[0], 27); + v[0] += rotate_right((v[0] * k0_2) + v[1], 15); + v[1] += rotate_right((v[1] * k1_2) + v[0], 27); memcpy(out, v, 16); } +void metrohash128crc_1(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out) { + Metro128State *state = metrohash128_1_new(seed); + metrohash128crc_update(state, key, len); + metrohash128crc_1_steal(state, out); + metrohash128_free(state); +} + +void metrohash128crc_2(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out) { + Metro128State *state = metrohash128_2_new(seed); + metrohash128crc_update(state, key, len); + metrohash128crc_2_steal(state, out); + metrohash128_free(state); +} + +#endif + + + +void metrohash128_1_update(Metro128State *state, const uint8_t * key, uint64_t len) +{ + + uint8_t *data = (uint8_t *)key; + const uint8_t *stop = data + len; + + METRO_FILL_XS(state->xs, state->xs_len, 32, data, len); + + /* process blocks of 32 bytes */ + while(state->xs_len == 32 || data + 32 <= stop) { + + uint64_t d1; + uint64_t d2; + uint64_t d3; + uint64_t d4; + + if(state->xs_len == 32) { + /* process remnant data from previous update */ + d1 = read_u64(&state->xs[0]); + d2 = read_u64(&state->xs[8]); + d3 = read_u64(&state->xs[16]); + d4 = read_u64(&state->xs[24]); + state->xs_len = 0; + } else { + /* process new data */ + d1 = read_u64(data); + d2 = read_u64(data + 8); + d3 = read_u64(data + 16); + d4 = read_u64(data + 24); + data += 32; + } + + state->v[0] += d1 * k0_1; + state->v[0] = rotate_right(state->v[0],29) + state->v[2]; + + state->v[1] += d2 * k1_1; + state->v[1] = rotate_right(state->v[1],29) + state->v[3]; -void metrohash128_2(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out) + state->v[2] += d3 * k2_1; + state->v[2] = rotate_right(state->v[2],29) + state->v[0]; + + state->v[3] += d4 * k3_1; + state->v[3] = rotate_right(state->v[3],29) + state->v[1]; + + } + + if (state->xs_len == 0 && stop > data) { + // store excess data in state + state->xs_len = stop - data; + memcpy(state->xs, data, state->xs_len); + } +} +void metrohash128_2_update(Metro128State *state, const uint8_t * key, uint64_t len) { - static const uint64_t k0 = 0xD6D018F5; - static const uint64_t k1 = 0xA2AA033B; - static const uint64_t k2 = 0x62992FC1; - static const uint64_t k3 = 0x30BC5B29; - const uint8_t * ptr = key; - const uint8_t * const end = ptr + len; + uint8_t *data = (uint8_t *)key; + const uint8_t *stop = data + len; + + METRO_FILL_XS(state->xs, state->xs_len, 32, data, len); + + /* process blocks of 32 bytes */ + while(state->xs_len == 32 || data + 32 <= stop) { + + uint64_t d1; + uint64_t d2; + uint64_t d3; + uint64_t d4; + + if(state->xs_len == 32) { + /* process remnant data from previous update */ + d1 = read_u64(&state->xs[0]); + d2 = read_u64(&state->xs[8]); + d3 = read_u64(&state->xs[16]); + d4 = read_u64(&state->xs[24]); + state->xs_len = 0; + } else { + /* process new data */ + d1 = read_u64(data); + d2 = read_u64(data + 8); + d3 = read_u64(data + 16); + d4 = read_u64(data + 24); + data += 32; + } + + state->v[0] += d1 * k0_2; + state->v[0] = rotate_right(state->v[0],29) + state->v[2]; + + state->v[1] += d2 * k1_2; + state->v[1] = rotate_right(state->v[1],29) + state->v[3]; + + state->v[2] += d3 * k2_2; + state->v[2] = rotate_right(state->v[2],29) + state->v[0]; + + state->v[3] += d4 * k3_2; + state->v[3] = rotate_right(state->v[3],29) + state->v[1]; + + } + if (state->xs_len == 0 && stop > data) { + // store excess data in state + state->xs_len = stop - data; + memcpy(state->xs, data, state->xs_len); + } +} + +void metrohash128_1_steal(Metro128State *state, uint8_t * out) { + uint64_t v[4]; + for(int i=0; i<4; i++) { + v[i] = state->v[i]; + } + + v[2] ^= rotate_right(((v[0] + v[3]) * k0_1) + v[1], 26) * k1_1; + v[3] ^= rotate_right(((v[1] + v[2]) * k1_1) + v[0], 26) * k0_1; + v[0] ^= rotate_right(((v[0] + v[2]) * k0_1) + v[3], 26) * k1_1; + v[1] ^= rotate_right(((v[1] + v[3]) * k1_1) + v[2], 30) * k0_1; + + uint8_t *ptr = state->xs; + uint8_t *end = ptr + state->xs_len; - v[0] = ((((uint64_t) seed) - k0) * k3) + len; - v[1] = ((((uint64_t) seed) + k1) * k2) + len; + if ((end - ptr) >= 16) + { + v[0] += read_u64(ptr) * k2_1; ptr += 8; v[0] = rotate_right(v[0],33) * k3_1; + v[1] += read_u64(ptr) * k2_1; ptr += 8; v[1] = rotate_right(v[1],33) * k3_1; + v[0] ^= rotate_right((v[0] * k2_1) + v[1], 17) * k1_1; + v[1] ^= rotate_right((v[1] * k3_1) + v[0], 17) * k0_1; + } + + if ((end - ptr) >= 8) + { + v[0] += read_u64(ptr) * k2_1; ptr += 8; v[0] = rotate_right(v[0],33) * k3_1; + v[0] ^= rotate_right((v[0] * k2_1) + v[1], 20) * k1_1; + } + + if ((end - ptr) >= 4) + { + v[1] += read_u32(ptr) * k2_1; ptr += 4; v[1] = rotate_right(v[1],33) * k3_1; + v[1] ^= rotate_right((v[1] * k3_1) + v[0], 18) * k0_1; + } + + if ((end - ptr) >= 2) + { + v[0] += read_u16(ptr) * k2_1; ptr += 2; v[0] = rotate_right(v[0],33) * k3_1; + v[0] ^= rotate_right((v[0] * k2_1) + v[1], 24) * k1_1; + } + + if ((end - ptr) >= 1) + { + v[1] += read_u8 (ptr) * k2_1; v[1] = rotate_right(v[1],33) * k3_1; + v[1] ^= rotate_right((v[1] * k3_1) + v[0], 24) * k0_1; + } - if (len >= 32) - { - v[2] = ((((uint64_t) seed) + k0) * k2) + len; - v[3] = ((((uint64_t) seed) - k1) * k3) + len; + v[0] += rotate_right((v[0] * k0_1) + v[1], 13); + v[1] += rotate_right((v[1] * k1_1) + v[0], 37); + v[0] += rotate_right((v[0] * k2_1) + v[1], 13); + v[1] += rotate_right((v[1] * k3_1) + v[0], 37); - do - { - v[0] += read_u64(ptr) * k0; ptr += 8; v[0] = rotate_right(v[0],29) + v[2]; - v[1] += read_u64(ptr) * k1; ptr += 8; v[1] = rotate_right(v[1],29) + v[3]; - v[2] += read_u64(ptr) * k2; ptr += 8; v[2] = rotate_right(v[2],29) + v[0]; - v[3] += read_u64(ptr) * k3; ptr += 8; v[3] = rotate_right(v[3],29) + v[1]; - } - while (ptr <= (end - 32)); + memcpy(out, v, 16); +} + + +void metrohash128_2_steal(Metro128State *state, uint8_t * out) { - v[2] ^= rotate_right(((v[0] + v[3]) * k0) + v[1], 33) * k1; - v[3] ^= rotate_right(((v[1] + v[2]) * k1) + v[0], 33) * k0; - v[0] ^= rotate_right(((v[0] + v[2]) * k0) + v[3], 33) * k1; - v[1] ^= rotate_right(((v[1] + v[3]) * k1) + v[2], 33) * k0; + uint64_t v[4]; + for(int i=0; i<4; i++) { + v[i] = state->v[i]; } + + v[2] ^= rotate_right(((v[0] + v[3]) * k0_2) + v[1], 33) * k1_2; + v[3] ^= rotate_right(((v[1] + v[2]) * k1_2) + v[0], 33) * k0_2; + v[0] ^= rotate_right(((v[0] + v[2]) * k0_2) + v[3], 33) * k1_2; + v[1] ^= rotate_right(((v[1] + v[3]) * k1_2) + v[2], 33) * k0_2; + + uint8_t *ptr = state->xs; + uint8_t *end = ptr + state->xs_len; if ((end - ptr) >= 16) { - v[0] += read_u64(ptr) * k2; ptr += 8; v[0] = rotate_right(v[0],29) * k3; - v[1] += read_u64(ptr) * k2; ptr += 8; v[1] = rotate_right(v[1],29) * k3; - v[0] ^= rotate_right((v[0] * k2) + v[1], 29) * k1; - v[1] ^= rotate_right((v[1] * k3) + v[0], 29) * k0; + v[0] += read_u64(ptr) * k2_2; ptr += 8; v[0] = rotate_right(v[0],29) * k3_2; + v[1] += read_u64(ptr) * k2_2; ptr += 8; v[1] = rotate_right(v[1],29) * k3_2; + v[0] ^= rotate_right((v[0] * k2_2) + v[1], 29) * k1_2; + v[1] ^= rotate_right((v[1] * k3_2) + v[0], 29) * k0_2; } if ((end - ptr) >= 8) { - v[0] += read_u64(ptr) * k2; ptr += 8; v[0] = rotate_right(v[0],29) * k3; - v[0] ^= rotate_right((v[0] * k2) + v[1], 29) * k1; + v[0] += read_u64(ptr) * k2_2; ptr += 8; v[0] = rotate_right(v[0],29) * k3_2; + v[0] ^= rotate_right((v[0] * k2_2) + v[1], 29) * k1_2; } if ((end - ptr) >= 4) { - v[1] += read_u32(ptr) * k2; ptr += 4; v[1] = rotate_right(v[1],29) * k3; - v[1] ^= rotate_right((v[1] * k3) + v[0], 25) * k0; + v[1] += read_u32(ptr) * k2_2; ptr += 4; v[1] = rotate_right(v[1],29) * k3_2; + v[1] ^= rotate_right((v[1] * k3_2) + v[0], 25) * k0_2; } if ((end - ptr) >= 2) { - v[0] += read_u16(ptr) * k2; ptr += 2; v[0] = rotate_right(v[0],29) * k3; - v[0] ^= rotate_right((v[0] * k2) + v[1], 30) * k1; + v[0] += read_u16(ptr) * k2_2; ptr += 2; v[0] = rotate_right(v[0],29) * k3_2; + v[0] ^= rotate_right((v[0] * k2_2) + v[1], 30) * k1_2; } if ((end - ptr) >= 1) { - v[1] += read_u8 (ptr) * k2; v[1] = rotate_right(v[1],29) * k3; - v[1] ^= rotate_right((v[1] * k3) + v[0], 18) * k0; + v[1] += read_u8 (ptr) * k2_2; v[1] = rotate_right(v[1],29) * k3_2; + v[1] ^= rotate_right((v[1] * k3_2) + v[0], 18) * k0_2; } - v[0] += rotate_right((v[0] * k0) + v[1], 33); - v[1] += rotate_right((v[1] * k1) + v[0], 33); - v[0] += rotate_right((v[0] * k2) + v[1], 33); - v[1] += rotate_right((v[1] * k3) + v[0], 33); + v[0] += rotate_right((v[0] * k0_2) + v[1], 33); + v[1] += rotate_right((v[1] * k1_2) + v[0], 33); + v[0] += rotate_right((v[0] * k2_2) + v[1], 33); + v[1] += rotate_right((v[1] * k3_2) + v[0], 33); memcpy(out, v, 16); } + + +void metrohash128_1(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out) { + Metro128State *state = metrohash128_1_new(seed); + metrohash128_1_update(state, key, len); + metrohash128_1_steal(state, out); + metrohash128_free(state); +} + +void metrohash128_2(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out) { + Metro128State *state = metrohash128_2_new(seed); + metrohash128_2_update(state, key, len); + metrohash128_2_steal(state, out); + metrohash128_free(state); +} diff --git a/lib/checksums/metrohash128crc.c b/lib/checksums/metrohash128crc.c deleted file mode 100644 index 1f78475a..00000000 --- a/lib/checksums/metrohash128crc.c +++ /dev/null @@ -1,253 +0,0 @@ -// metrohash128crc.cpp -// -// The MIT License (MIT) -// -// Copyright (c) 2015 J. Andrew Rogers -// -// Permission is hereby granted, free of charge, to any person obtaining a copy -// of this software and associated documentation files (the "Software"), to deal -// in the Software without restriction, including without limitation the rights -// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -// copies of the Software, and to permit persons to whom the Software is -// furnished to do so, subject to the following conditions: -// -// The above copyright notice and this permission notice shall be included in all -// copies or substantial portions of the Software. -// -// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -// SOFTWARE. -// - -#include "metrohash.h" -#include -#include - -#if HAVE_SSE4 - -struct _Metro128_state { - uint64_t v[4]; - uint8_t xs[32]; /* unhashed data from last increment */ - uint8_t xs_len; -}; - -static const uint64_t k0 = 0xC83A91E1; -static const uint64_t k1 = 0x8648DBDB; -static const uint64_t k2 = 0x7BDEC03B; -static const uint64_t k3 = 0x2F5870A5; - -Metro128State *metrohash128crc_1_new(uint32_t seed) { - Metro128State *state = g_slice_new0(Metro128State); - state->v[0] = ((((uint64_t) seed) - k0) * k3); - state->v[1] = ((((uint64_t) seed) + k1) * k2); - state->v[2] = ((((uint64_t) seed) + k0) * k2); - state->v[3] = ((((uint64_t) seed) - k1) * k3); - return state; -} - -void metrohash128crc_1_free(Metro128State *state) { - g_slice_free(Metro128State, state); -} - -Metro128State *metrohash128crc_1_copy(Metro128State *state) { - return g_slice_copy(sizeof(Metro128State), state); -} - -#define METRO_FILL_XS(xs, xs_len, xs_cap, data, data_len) \ - const int bytes = (data_len + xs_len > xs_cap) ? \ - (int)xs_cap - (int)xs_len : \ - (int)data_len; \ - memcpy(xs + xs_len, data, bytes); \ - xs_len += bytes; \ - data += bytes; - -void metrohash128crc_1_update(Metro128State *state, const uint8_t * key, uint64_t len) -{ - - uint8_t *data = (uint8_t *)key; - const uint8_t *stop = data + len; - - METRO_FILL_XS(state->xs, state->xs_len, 32, data, len); - - /* process blocks of 16 bytes */ - while(state->xs_len == 32 || data + 32 <= stop) { - - uint64_t d1; - uint64_t d2; - uint64_t d3; - uint64_t d4; - - if(state->xs_len == 32) { - /* process remnant data from previous update */ - d1 = read_u64(&state->xs[0]); - d2 = read_u64(&state->xs[8]); - d3 = read_u64(&state->xs[16]); - d4 = read_u64(&state->xs[24]); - state->xs_len = 0; - } else { - /* process new data */ - d1 = read_u64(data); - d2 = read_u64(data + 8); - d3 = read_u64(data + 16); - d4 = read_u64(data + 24); - data += 32; - } - - state->v[0] ^= _mm_crc32_u64(state->v[0], d1); - state->v[1] ^= _mm_crc32_u64(state->v[1], d2); - state->v[2] ^= _mm_crc32_u64(state->v[2], d3); - state->v[3] ^= _mm_crc32_u64(state->v[3], d4); - - } - - if (state->xs_len == 0 && stop > data) { - // store excess data in state - state->xs_len = stop - data; - memcpy(state->xs, data, state->xs_len); - } - -} - -void metrohash128crc_1_steal(Metro128State *state, uint8_t * out) { - - uint64_t v[4]; - for(int i=0; i<4; i++) { - v[i] = state->v[i]; - } - - v[2] ^= rotate_right(((v[0] + v[3]) * k0) + v[1], 34) * k1; - v[3] ^= rotate_right(((v[1] + v[2]) * k1) + v[0], 37) * k0; - v[0] ^= rotate_right(((v[0] + v[2]) * k0) + v[3], 34) * k1; - v[1] ^= rotate_right(((v[1] + v[3]) * k1) + v[2], 37) * k0; - - uint8_t *ptr = state->xs; - uint8_t *end = ptr + state->xs_len; - - if ((end - ptr) >= 16) - { - v[0] += read_u64(ptr) * k2; ptr += 8; v[0] = rotate_right(v[0],34) * k3; - v[1] += read_u64(ptr) * k2; ptr += 8; v[1] = rotate_right(v[1],34) * k3; - v[0] ^= rotate_right((v[0] * k2) + v[1], 30) * k1; - v[1] ^= rotate_right((v[1] * k3) + v[0], 30) * k0; - } - - if ((end - ptr) >= 8) - { - v[0] += read_u64(ptr) * k2; ptr += 8; v[0] = rotate_right(v[0],36) * k3; - v[0] ^= rotate_right((v[0] * k2) + v[1], 23) * k1; - } - - if ((end - ptr) >= 4) - { - v[1] ^= _mm_crc32_u64(v[0], read_u32(ptr)); ptr += 4; - v[1] ^= rotate_right((v[1] * k3) + v[0], 19) * k0; - } - - if ((end - ptr) >= 2) - { - v[0] ^= _mm_crc32_u64(v[1], read_u16(ptr)); ptr += 2; - v[0] ^= rotate_right((v[0] * k2) + v[1], 13) * k1; - } - - if ((end - ptr) >= 1) - { - v[1] ^= _mm_crc32_u64(v[0], read_u8 (ptr)); - v[1] ^= rotate_right((v[1] * k3) + v[0], 17) * k0; - } - - v[0] += rotate_right((v[0] * k0) + v[1], 11); - v[1] += rotate_right((v[1] * k1) + v[0], 26); - v[0] += rotate_right((v[0] * k0) + v[1], 11); - v[1] += rotate_right((v[1] * k1) + v[0], 26); - - memcpy(out, v, 16); -} - -void metrohash128crc_1(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out) { - Metro128State *state = metrohash128crc_1_new(seed); - metrohash128crc_1_update(state, key, len); - metrohash128crc_1_steal(state, out); - metrohash128crc_1_free(state); -} - - -void metrohash128crc_2(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out) -{ - static const uint64_t k0 = 0xEE783E2F; - static const uint64_t k1 = 0xAD07C493; - static const uint64_t k2 = 0x797A90BB; - static const uint64_t k3 = 0x2E4B2E1B; - - const uint8_t * ptr = key; - const uint8_t * const end = ptr + len; - - uint64_t v[4]; - - v[0] = ((((uint64_t) seed) - k0) * k3) + len; - v[1] = ((((uint64_t) seed) + k1) * k2) + len; - - if (len >= 32) - { - v[2] = ((((uint64_t) seed) + k0) * k2) + len; - v[3] = ((((uint64_t) seed) - k1) * k3) + len; - - do - { - v[0] ^= _mm_crc32_u64(v[0], read_u64(ptr)); ptr += 8; - v[1] ^= _mm_crc32_u64(v[1], read_u64(ptr)); ptr += 8; - v[2] ^= _mm_crc32_u64(v[2], read_u64(ptr)); ptr += 8; - v[3] ^= _mm_crc32_u64(v[3], read_u64(ptr)); ptr += 8; - } - while (ptr <= (end - 32)); - - v[2] ^= rotate_right(((v[0] + v[3]) * k0) + v[1], 12) * k1; - v[3] ^= rotate_right(((v[1] + v[2]) * k1) + v[0], 19) * k0; - v[0] ^= rotate_right(((v[0] + v[2]) * k0) + v[3], 12) * k1; - v[1] ^= rotate_right(((v[1] + v[3]) * k1) + v[2], 19) * k0; - } - - if ((end - ptr) >= 16) - { - v[0] += read_u64(ptr) * k2; ptr += 8; v[0] = rotate_right(v[0],41) * k3; - v[1] += read_u64(ptr) * k2; ptr += 8; v[1] = rotate_right(v[1],41) * k3; - v[0] ^= rotate_right((v[0] * k2) + v[1], 10) * k1; - v[1] ^= rotate_right((v[1] * k3) + v[0], 10) * k0; - } - - if ((end - ptr) >= 8) - { - v[0] += read_u64(ptr) * k2; ptr += 8; v[0] = rotate_right(v[0],34) * k3; - v[0] ^= rotate_right((v[0] * k2) + v[1], 22) * k1; - } - - if ((end - ptr) >= 4) - { - v[1] ^= _mm_crc32_u64(v[0], read_u32(ptr)); ptr += 4; - v[1] ^= rotate_right((v[1] * k3) + v[0], 14) * k0; - } - - if ((end - ptr) >= 2) - { - v[0] ^= _mm_crc32_u64(v[1], read_u16(ptr)); ptr += 2; - v[0] ^= rotate_right((v[0] * k2) + v[1], 15) * k1; - } - - if ((end - ptr) >= 1) - { - v[1] ^= _mm_crc32_u64(v[0], read_u8 (ptr)); - v[1] ^= rotate_right((v[1] * k3) + v[0], 18) * k0; - } - - v[0] += rotate_right((v[0] * k0) + v[1], 15); - v[1] += rotate_right((v[1] * k1) + v[0], 27); - v[0] += rotate_right((v[0] * k0) + v[1], 15); - v[1] += rotate_right((v[1] * k1) + v[0], 27); - - memcpy(out, v, 16); -} - -#endif From 6b4cd60c07b93c269f4b9542d0cc3949f504aef9 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 13 Nov 2017 15:37:54 +1000 Subject: [PATCH 111/180] add 256-bit combination of metro128_1 and metro128_2 --- lib/checksum.c | 36 ++++++++++ lib/checksum.h | 2 + lib/checksums/metrohash.h | 10 +++ lib/checksums/metrohash128.c | 125 +++++++++++++++++++++++++---------- lib/cmdline.c | 10 ++- 5 files changed, 145 insertions(+), 38 deletions(-) diff --git a/lib/checksum.c b/lib/checksum.c index 732a7d93..4ae64772 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -297,7 +297,29 @@ static void rm_digest_metro_steal(RmDigest *digest, guint8 *result) { metrohash128_1_steal(digest->state, result); } +static void rm_digest_metro256_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { + digest->state = metrohash256_new(seed1 ^ seed2); +} + +static void rm_digest_metro256_free(RmDigest *digest) { + metrohash256_free(digest->state); +} + +static void rm_digest_metro256_update(RmDigest *digest, const unsigned char *data, RmOff size) { + metrohash256_update(digest->state, data, size); +} + +static void rm_digest_metro256_copy(RmDigest *digest, RmDigest *copy) { + copy->state = metrohash256_copy(digest->state); +} + +static void rm_digest_metro256_steal(RmDigest *digest, guint8 *result) { + metrohash256_steal(digest->state, result); +} + + static const RmDigestSpec metro_spec = {"metro", 128, rm_digest_metro_init, rm_digest_metro_free, rm_digest_metro_update, rm_digest_metro_copy, rm_digest_metro_steal }; +static const RmDigestSpec metro256_spec = {"metro256", 256, rm_digest_metro256_init, rm_digest_metro256_free, rm_digest_metro256_update, rm_digest_metro256_copy, rm_digest_metro256_steal }; #if HAVE_SSE4 @@ -309,7 +331,17 @@ static void rm_digest_metro_crc_steal(RmDigest *digest, guint8 *result) { metrohash128crc_1_steal(digest->state, result); } +static void rm_digest_metro256_crc_update(RmDigest *digest, const unsigned char *data, RmOff size) { + metrohash256_update(digest->state, data, size); +} + +static void rm_digest_metro256_crc_steal(RmDigest *digest, guint8 *result) { + metrohash256_steal(digest->state, result); +} + + static const RmDigestSpec metro_crc_spec = {"metrocrc", 128, rm_digest_metro_init, rm_digest_metro_free, rm_digest_metro_crc_update, rm_digest_metro_copy, rm_digest_metro_crc_steal }; +static const RmDigestSpec metro256_crc_spec = {"metrocrc256", 256, rm_digest_metro256_init, rm_digest_metro256_free, rm_digest_metro256_crc_update, rm_digest_metro256_copy, rm_digest_metro256_crc_steal }; #endif @@ -629,7 +661,11 @@ static const RmDigestSpec *rm_digest_spec(RmDigestType type) { [RM_DIGEST_UNKNOWN] = NULL, [RM_DIGEST_MURMUR] = &murmur_spec, [RM_DIGEST_METRO] = &metro_spec, + [RM_DIGEST_METRO256] = &metro256_spec, + #if HAVE_SSE4 [RM_DIGEST_METROCRC] = &metro_crc_spec, + [RM_DIGEST_METROCRC256]= &metro256_crc_spec, + #endif [RM_DIGEST_MD5] = &md5_spec, [RM_DIGEST_SHA1] = &sha1_spec, [RM_DIGEST_SHA256] = &sha256_spec, diff --git a/lib/checksum.h b/lib/checksum.h index 6d87790e..ec54346b 100644 --- a/lib/checksum.h +++ b/lib/checksum.h @@ -38,8 +38,10 @@ typedef enum RmDigestType { RM_DIGEST_UNKNOWN = 0, RM_DIGEST_MURMUR, RM_DIGEST_METRO, + RM_DIGEST_METRO256, #if HAVE_SSE4 RM_DIGEST_METROCRC, + RM_DIGEST_METROCRC256, #endif RM_DIGEST_MD5, RM_DIGEST_SHA1, diff --git a/lib/checksums/metrohash.h b/lib/checksums/metrohash.h index 94b7e6d8..3330a490 100644 --- a/lib/checksums/metrohash.h +++ b/lib/checksums/metrohash.h @@ -32,6 +32,7 @@ typedef struct _Metro64_state Metro64State; typedef struct _Metro128_state Metro128State; +typedef struct _Metro256_state Metro256State; // MetroHash 64-bit hash functions void metrohash64_1(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out); @@ -40,9 +41,13 @@ void metrohash64_2(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * o // MetroHash 128-bit hash functions Metro128State *metrohash128_1_new(uint32_t seed); Metro128State *metrohash128_2_new(uint32_t seed); +Metro256State *metrohash256_new(uint32_t seed); Metro128State *metrohash128_copy(Metro128State *state); +Metro256State *metrohash256_copy(Metro256State *state); + void metrohash128_free(Metro128State *state); +void metrohash256_free(Metro256State *state); void metrohash128_1(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out); void metrohash128_2(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out); @@ -53,6 +58,8 @@ void metrohash128_1_steal(Metro128State *state, uint8_t * out); void metrohash128_2_update(Metro128State *state, const uint8_t * key, uint64_t len); void metrohash128_2_steal(Metro128State *state, uint8_t * out); +void metrohash256_update(Metro256State *state, const uint8_t * key, uint64_t len); +void metrohash256_steal(Metro256State *state, uint8_t * out); #if HAVE_SSE4 // MetroHash 128-bit hash functions using CRC instruction @@ -64,6 +71,9 @@ void metrohash128crc_update(Metro128State *state, const uint8_t * key, uint64_t void metrohash128crc_1_steal(Metro128State *state, uint8_t * out); void metrohash128crc_2_steal(Metro128State *state, uint8_t * out); +void metrohash256crc_update(Metro256State *state, const uint8_t * key, uint64_t len); +void metrohash256crc_steal(Metro256State *state, uint8_t * out); + #endif /* rotate right idiom recognized by compiler*/ diff --git a/lib/checksums/metrohash128.c b/lib/checksums/metrohash128.c index 4f4f23d0..af346b94 100644 --- a/lib/checksums/metrohash128.c +++ b/lib/checksums/metrohash128.c @@ -35,17 +35,27 @@ struct _Metro128_state { uint8_t xs_len; }; +struct _Metro256_state { + Metro128State state1; + Metro128State state2; +}; + + static const uint64_t k0_1 = 0xC83A91E1; static const uint64_t k1_1 = 0x8648DBDB; static const uint64_t k2_1 = 0x7BDEC03B; static const uint64_t k3_1 = 0x2F5870A5; -Metro128State *metrohash128_1_new(uint32_t seed) { - Metro128State *state = g_slice_new0(Metro128State); +static void metrohash128_1_init(Metro128State *state, uint32_t seed) { state->v[0] = ((((uint64_t) seed) - k0_1) * k3_1); state->v[1] = ((((uint64_t) seed) + k1_1) * k2_1); state->v[2] = ((((uint64_t) seed) + k0_1) * k2_1); state->v[3] = ((((uint64_t) seed) - k1_1) * k3_1); +} + +Metro128State *metrohash128_1_new(uint32_t seed) { + Metro128State *state = g_slice_new0(Metro128State); + metrohash128_1_init(state, seed); return state; } @@ -54,15 +64,18 @@ static const uint64_t k1_2 = 0xAD07C493; static const uint64_t k2_2 = 0x797A90BB; static const uint64_t k3_2 = 0x2E4B2E1B; -Metro128State *metrohash128_2_new(uint32_t seed) { - Metro128State *state = g_slice_new0(Metro128State); +static void metrohash128_2_init(Metro128State *state, uint32_t seed) { state->v[0] = ((((uint64_t) seed) - k0_2) * k3_2); state->v[1] = ((((uint64_t) seed) + k1_2) * k2_2); state->v[2] = ((((uint64_t) seed) + k0_2) * k2_2); state->v[3] = ((((uint64_t) seed) - k1_2) * k3_2); - return state; } +Metro128State *metrohash128_2_new(uint32_t seed) { + Metro128State *state = g_slice_new0(Metro128State); + metrohash128_2_init(state, seed); + return state; +} void metrohash128_free(Metro128State *state) { g_slice_free(Metro128State, state); @@ -111,14 +124,14 @@ void metrohash128crc_update(Metro128State *state, const uint8_t * key, uint64_t d4 = read_u64(data + 24); data += 32; } - + state->v[0] ^= _mm_crc32_u64(state->v[0], d1); state->v[1] ^= _mm_crc32_u64(state->v[1], d2); state->v[2] ^= _mm_crc32_u64(state->v[2], d3); state->v[3] ^= _mm_crc32_u64(state->v[3], d4); } - + if (state->xs_len == 0 && stop > data) { // store excess data in state state->xs_len = stop - data; @@ -140,7 +153,7 @@ void metrohash128crc_1_steal(Metro128State *state, uint8_t * out) { uint8_t *ptr = state->xs; uint8_t *end = ptr + state->xs_len; - + if ((end - ptr) >= 16) { v[0] += read_u64(ptr) * k2_1; ptr += 8; v[0] = rotate_right(v[0],34) * k3_1; @@ -148,31 +161,31 @@ void metrohash128crc_1_steal(Metro128State *state, uint8_t * out) { v[0] ^= rotate_right((v[0] * k2_1) + v[1], 30) * k1_1; v[1] ^= rotate_right((v[1] * k3_1) + v[0], 30) * k0_1; } - + if ((end - ptr) >= 8) { v[0] += read_u64(ptr) * k2_1; ptr += 8; v[0] = rotate_right(v[0],36) * k3_1; v[0] ^= rotate_right((v[0] * k2_1) + v[1], 23) * k1_1; } - + if ((end - ptr) >= 4) { v[1] ^= _mm_crc32_u64(v[0], read_u32(ptr)); ptr += 4; v[1] ^= rotate_right((v[1] * k3_1) + v[0], 19) * k0_1; } - + if ((end - ptr) >= 2) { v[0] ^= _mm_crc32_u64(v[1], read_u16(ptr)); ptr += 2; v[0] ^= rotate_right((v[0] * k2_1) + v[1], 13) * k1_1; } - + if ((end - ptr) >= 1) { v[1] ^= _mm_crc32_u64(v[0], read_u8 (ptr)); v[1] ^= rotate_right((v[1] * k3_1) + v[0], 17) * k0_1; } - + v[0] += rotate_right((v[0] * k0_1) + v[1], 11); v[1] += rotate_right((v[1] * k1_1) + v[0], 26); v[0] += rotate_right((v[0] * k0_1) + v[1], 11); @@ -195,7 +208,7 @@ void metrohash128crc_2_steal(Metro128State *state, uint8_t * out) { uint8_t *ptr = state->xs; uint8_t *end = ptr + state->xs_len; - + if ((end - ptr) >= 16) { v[0] += read_u64(ptr) * k2_2; ptr += 8; v[0] = rotate_right(v[0],41) * k3_2; @@ -203,31 +216,31 @@ void metrohash128crc_2_steal(Metro128State *state, uint8_t * out) { v[0] ^= rotate_right((v[0] * k2_2) + v[1], 10) * k1_2; v[1] ^= rotate_right((v[1] * k3_2) + v[0], 10) * k0_2; } - + if ((end - ptr) >= 8) { v[0] += read_u64(ptr) * k2_2; ptr += 8; v[0] = rotate_right(v[0],34) * k3_2; v[0] ^= rotate_right((v[0] * k2_2) + v[1], 22) * k1_2; } - + if ((end - ptr) >= 4) { v[1] ^= _mm_crc32_u64(v[0], read_u32(ptr)); ptr += 4; v[1] ^= rotate_right((v[1] * k3_2) + v[0], 14) * k0_2; } - + if ((end - ptr) >= 2) { v[0] ^= _mm_crc32_u64(v[1], read_u16(ptr)); ptr += 2; v[0] ^= rotate_right((v[0] * k2_2) + v[1], 15) * k1_2; } - + if ((end - ptr) >= 1) { v[1] ^= _mm_crc32_u64(v[0], read_u8 (ptr)); v[1] ^= rotate_right((v[1] * k3_2) + v[0], 18) * k0_2; } - + v[0] += rotate_right((v[0] * k0_2) + v[1], 15); v[1] += rotate_right((v[1] * k1_2) + v[0], 27); v[0] += rotate_right((v[0] * k0_2) + v[1], 15); @@ -250,6 +263,18 @@ void metrohash128crc_2(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t metrohash128_free(state); } +void metrohash256crc_update(Metro256State *state, const uint8_t * key, uint64_t len) { + metrohash128crc_update(&state->state1, key, len); + metrohash128crc_update(&state->state2, key, len); +} + +void metrohash256crc_steal(Metro256State *state, uint8_t * out) { + metrohash128crc_1_steal(&state->state1, out); + metrohash128crc_2_steal(&state->state2, out+16); +} + + + #endif @@ -285,7 +310,7 @@ void metrohash128_1_update(Metro128State *state, const uint8_t * key, uint64_t l d4 = read_u64(data + 24); data += 32; } - + state->v[0] += d1 * k0_1; state->v[0] = rotate_right(state->v[0],29) + state->v[2]; @@ -299,7 +324,7 @@ void metrohash128_1_update(Metro128State *state, const uint8_t * key, uint64_t l state->v[3] = rotate_right(state->v[3],29) + state->v[1]; } - + if (state->xs_len == 0 && stop > data) { // store excess data in state state->xs_len = stop - data; @@ -337,7 +362,9 @@ void metrohash128_2_update(Metro128State *state, const uint8_t * key, uint64_t l d4 = read_u64(data + 24); data += 32; } - + void metrohash256_update(Metro256State *state, const uint8_t * key, uint64_t len); +void metrohash256_steal(Metro256State *state, uint8_t * out); + state->v[0] += d1 * k0_2; state->v[0] = rotate_right(state->v[0],29) + state->v[2]; @@ -351,7 +378,7 @@ void metrohash128_2_update(Metro128State *state, const uint8_t * key, uint64_t l state->v[3] = rotate_right(state->v[3],29) + state->v[1]; } - + if (state->xs_len == 0 && stop > data) { // store excess data in state state->xs_len = stop - data; @@ -373,7 +400,7 @@ void metrohash128_1_steal(Metro128State *state, uint8_t * out) { uint8_t *ptr = state->xs; uint8_t *end = ptr + state->xs_len; - + if ((end - ptr) >= 16) { v[0] += read_u64(ptr) * k2_1; ptr += 8; v[0] = rotate_right(v[0],33) * k3_1; @@ -381,31 +408,31 @@ void metrohash128_1_steal(Metro128State *state, uint8_t * out) { v[0] ^= rotate_right((v[0] * k2_1) + v[1], 17) * k1_1; v[1] ^= rotate_right((v[1] * k3_1) + v[0], 17) * k0_1; } - + if ((end - ptr) >= 8) { v[0] += read_u64(ptr) * k2_1; ptr += 8; v[0] = rotate_right(v[0],33) * k3_1; v[0] ^= rotate_right((v[0] * k2_1) + v[1], 20) * k1_1; } - + if ((end - ptr) >= 4) { v[1] += read_u32(ptr) * k2_1; ptr += 4; v[1] = rotate_right(v[1],33) * k3_1; v[1] ^= rotate_right((v[1] * k3_1) + v[0], 18) * k0_1; } - + if ((end - ptr) >= 2) { v[0] += read_u16(ptr) * k2_1; ptr += 2; v[0] = rotate_right(v[0],33) * k3_1; v[0] ^= rotate_right((v[0] * k2_1) + v[1], 24) * k1_1; } - + if ((end - ptr) >= 1) { v[1] += read_u8 (ptr) * k2_1; v[1] = rotate_right(v[1],33) * k3_1; v[1] ^= rotate_right((v[1] * k3_1) + v[0], 24) * k0_1; } - + v[0] += rotate_right((v[0] * k0_1) + v[1], 13); v[1] += rotate_right((v[1] * k1_1) + v[0], 37); v[0] += rotate_right((v[0] * k2_1) + v[1], 13); @@ -429,7 +456,7 @@ void metrohash128_2_steal(Metro128State *state, uint8_t * out) { uint8_t *ptr = state->xs; uint8_t *end = ptr + state->xs_len; - + if ((end - ptr) >= 16) { v[0] += read_u64(ptr) * k2_2; ptr += 8; v[0] = rotate_right(v[0],29) * k3_2; @@ -437,31 +464,31 @@ void metrohash128_2_steal(Metro128State *state, uint8_t * out) { v[0] ^= rotate_right((v[0] * k2_2) + v[1], 29) * k1_2; v[1] ^= rotate_right((v[1] * k3_2) + v[0], 29) * k0_2; } - + if ((end - ptr) >= 8) { v[0] += read_u64(ptr) * k2_2; ptr += 8; v[0] = rotate_right(v[0],29) * k3_2; v[0] ^= rotate_right((v[0] * k2_2) + v[1], 29) * k1_2; } - + if ((end - ptr) >= 4) { v[1] += read_u32(ptr) * k2_2; ptr += 4; v[1] = rotate_right(v[1],29) * k3_2; v[1] ^= rotate_right((v[1] * k3_2) + v[0], 25) * k0_2; } - + if ((end - ptr) >= 2) { v[0] += read_u16(ptr) * k2_2; ptr += 2; v[0] = rotate_right(v[0],29) * k3_2; v[0] ^= rotate_right((v[0] * k2_2) + v[1], 30) * k1_2; } - + if ((end - ptr) >= 1) { v[1] += read_u8 (ptr) * k2_2; v[1] = rotate_right(v[1],29) * k3_2; v[1] ^= rotate_right((v[1] * k3_2) + v[0], 18) * k0_2; } - + v[0] += rotate_right((v[0] * k0_2) + v[1], 33); v[1] += rotate_right((v[1] * k1_2) + v[0], 33); v[0] += rotate_right((v[0] * k2_2) + v[1], 33); @@ -484,3 +511,31 @@ void metrohash128_2(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * metrohash128_2_steal(state, out); metrohash128_free(state); } + + +Metro256State *metrohash256_new(uint32_t seed) { + Metro256State *state = g_slice_new0(Metro256State); + metrohash128_1_init(&state->state1, seed); + metrohash128_2_init(&state->state2, seed); + return state; +} + +void metrohash256_free(Metro256State *state) { + g_slice_free(Metro256State, state); +} + +Metro256State *metrohash256_copy(Metro256State *state) { + return g_slice_copy(sizeof(Metro256State), state); +} + +void metrohash256_update(Metro256State *state, const uint8_t * key, uint64_t len) { + metrohash128_1_update(&state->state1, key, len); + metrohash128_2_update(&state->state2, key, len); +} + +void metrohash256_steal(Metro256State *state, uint8_t * out) { + metrohash128_1_steal(&state->state1, out); + metrohash128_2_steal(&state->state2, out+16); +} + + diff --git a/lib/cmdline.c b/lib/cmdline.c index 08072a60..963cfed6 100644 --- a/lib/cmdline.c +++ b/lib/cmdline.c @@ -753,16 +753,20 @@ static void rm_cmd_set_paranoia_from_cnt(RmCfg *cfg, int paranoia_counter, /* Handle the paranoia option */ switch(paranoia_counter) { case -2: - cfg->checksum_type = RM_DIGEST_XXHASH; // 64-bit non-crypto + cfg->checksum_type = RM_DIGEST_METRO; // 128-bit non-crypto break; case -1: - cfg->checksum_type = RM_DIGEST_MURMUR; // 128-bit non-crypto +#if HAVE_SSE4 + cfg->checksum_type = RM_DIGEST_METROCRC256; // 256-bit non-crypto +#else + cfg->checksum_type = RM_DIGEST_METRO256; // 256-bit non-crypto +#endif break; case 0: /* leave users choice of -a (default) */ break; case 1: - cfg->checksum_type = RM_DIGEST_BLAKE2B; // 512-bit crypto + cfg->checksum_type = RM_DIGEST_BLAKE2B; // 512-bit crypto break; case 2: cfg->checksum_type = RM_DIGEST_PARANOID; From f1ead325e52ffe90a3784657238b97dfdbf564d2 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 13 Nov 2017 21:46:21 +1000 Subject: [PATCH 112/180] tests: update hash list --- tests/test_mains/test_hash.py | 10 +++++----- tests/utils.py | 7 ++++++- 2 files changed, 11 insertions(+), 6 deletions(-) diff --git a/tests/test_mains/test_hash.py b/tests/test_mains/test_hash.py index eb3e9d42..6a7db5a2 100644 --- a/tests/test_mains/test_hash.py +++ b/tests/test_mains/test_hash.py @@ -32,6 +32,10 @@ def streaming_compliance_check(*patterns): def test_murmur(): streaming_compliance_check('murmur') +@with_setup(usual_setup_func, usual_teardown_func) +def test_metro(): + streaming_compliance_check('metro') + @with_setup(usual_setup_func, usual_teardown_func) def test_glib(): streaming_compliance_check('md5', 'sha1', 'sha256', 'sha512') @@ -48,15 +52,11 @@ def test_blake(): def test_xx(): streaming_compliance_check('xxhash') -@attr('known_issue') -@with_setup(usual_setup_func, usual_teardown_func) -def test_farm(): - streaming_compliance_check('farm') - @with_setup(usual_setup_func, usual_teardown_func) def test_highway(): streaming_compliance_check('highway') +@attr("known_issue") @with_setup(usual_setup_func, usual_teardown_func) def test_cumulative(): streaming_compliance_check('cumulative') diff --git a/tests/utils.py b/tests/utils.py index a3b75bd2..92f309e1 100644 --- a/tests/utils.py +++ b/tests/utils.py @@ -17,6 +17,8 @@ CKSUM_TYPES = [ 'murmur', + 'metro', + 'metro256' 'md5', 'sha1', 'sha256', @@ -28,7 +30,6 @@ 'blake2sp', 'blake2bp', 'xxhash', - 'farmhash', 'highway64', 'highway128', 'highway256', @@ -240,6 +241,10 @@ def run_rmlint_pedantic(*args, **kwargs): if has_feature('sha512'): CKSUM_TYPES.append('sha512') + if has_feature('sse4'): + CKSUM_TYPES.append('metrocrc') + CKSUM_TYPES.append('metrocrc256') + for cksum_type in CKSUM_TYPES: options.append('--algorithm=' + cksum_type) From c1a23c93d3c8c590f56a6279764c5f3ba46901bc Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Tue, 14 Nov 2017 10:19:56 +1000 Subject: [PATCH 113/180] tests: fix missing comma --- tests/utils.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tests/utils.py b/tests/utils.py index 92f309e1..9c49a5d4 100644 --- a/tests/utils.py +++ b/tests/utils.py @@ -18,7 +18,7 @@ CKSUM_TYPES = [ 'murmur', 'metro', - 'metro256' + 'metro256', 'md5', 'sha1', 'sha256', From 7fbceb14cfb41b4ed39c21d47d9e6eef8d888f1a Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Tue, 14 Nov 2017 10:21:16 +1000 Subject: [PATCH 114/180] checksum: make cumulative digest streaming-compatible & platform-optimise --- lib/checksum.c | 86 +++++++++++++++++++++++++++++++---- tests/test_mains/test_hash.py | 3 +- 2 files changed, 79 insertions(+), 10 deletions(-) diff --git a/lib/checksum.c b/lib/checksum.c index 4ae64772..350ecfbe 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -350,15 +350,85 @@ static const RmDigestSpec metro256_crc_spec = {"metrocrc256", 256, rm_digest_me // cumulative // /////////////////////////// +#define RM_DIGEST_CUMULATIVE_LEN 16 /* must be power of 2 and >= 8 */ + +#if RM_PLATFORM_64 + +#define RM_DIGEST_CUMULATIVE_T guint64 +#define RM_DIGEST_CUMULATIVE_DATA data64 +#define RM_DIGEST_CUMULATIVE_ALIGN 8 + +#else + +#define RM_DIGEST_CUMULATIVE_T guint32 +#define RM_DIGEST_CUMULATIVE_DATA data32 +#define RM_DIGEST_CUMULATIVE_ALIGN 4 + +#endif + +typedef struct RmDigestCumulative { + union { + guint8 data[RM_DIGEST_CUMULATIVE_LEN]; + RM_DIGEST_CUMULATIVE_T bigdata[RM_DIGEST_CUMULATIVE_LEN / RM_DIGEST_CUMULATIVE_ALIGN]; + }; + RM_DIGEST_CUMULATIVE_T pos; /* could be smaller but this is faster */ +} RmDigestCumulative; + +static void rm_digest_cumulative_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { + RmDigestCumulative *state = g_slice_new0(RmDigestCumulative); + *(RmOff*)&state->data[0] ^= seed1; +#if (RM_DIGEST_CUMULATIVE_LEN >= 16) + *(RmOff*)&state->data[8] ^= seed2; +#else + *(RmOff*)&state->data[0] ^= seed2; +#endif + digest->state = state; +} + +static void rm_digest_cumulative_free(RmDigest *digest) { + g_slice_free(RmDigestCumulative, digest->state); + digest->state = NULL; +} + static void rm_digest_cumulative_update(RmDigest *digest, const unsigned char *data, RmOff size) { - /* This only XORS the two checksums. */ - guint8 *hash = digest->state; - for(gsize i = 0; i < size; ++i) { - hash[i % digest->bytes] ^= ((guint8 *)data)[i]; + guint8 *ptr = (guint8*) data; + guint8 *stop = ptr + size; + RmDigestCumulative *state = digest->state; + + /* align so we can use [32|64]-bit xor */ + while ((state->pos % RM_DIGEST_CUMULATIVE_ALIGN != 0) && ptr < stop) { + state->data[state->pos++] ^= *(ptr++); + state->pos &= (RM_DIGEST_CUMULATIVE_LEN-1); + } + + RM_DIGEST_CUMULATIVE_T *ptr_big = (RM_DIGEST_CUMULATIVE_T*)ptr; + RM_DIGEST_CUMULATIVE_T *stop_big = (RM_DIGEST_CUMULATIVE_T*)(stop + 1 - RM_DIGEST_CUMULATIVE_ALIGN); + + /* plough through body of data efficiently */ + while (ptr_big < stop_big) { + state->bigdata[state->pos / RM_DIGEST_CUMULATIVE_ALIGN] ^= *ptr_big++; + state->pos = (state->pos + RM_DIGEST_CUMULATIVE_ALIGN) & (RM_DIGEST_CUMULATIVE_ALIGN-1); + } + + /* process remaining date byte-wise */ + ptr = (guint8*)ptr_big; + while (ptr < stop) { + state->data[state->pos++] ^= *(ptr++); + state->pos &= (RM_DIGEST_CUMULATIVE_LEN-1); } } -static const RmDigestSpec cumulative_spec = { "cumulative", 128, GENERIC_FUNCS(cumulative)}; +static void rm_digest_cumulative_copy(RmDigest *digest, RmDigest *copy) { + copy->state = g_slice_copy(sizeof(RmDigestCumulative), digest->state); +} + +static void rm_digest_cumulative_steal(RmDigest *digest, guint8 *result) { + RmDigestCumulative *state = digest->state; + memcpy(result, state->data, RM_DIGEST_CUMULATIVE_LEN); +} + +static const RmDigestSpec cumulative_spec = { "cumulative", 8 * RM_DIGEST_CUMULATIVE_LEN, rm_digest_cumulative_init, rm_digest_cumulative_free, + rm_digest_cumulative_update, rm_digest_cumulative_copy, rm_digest_cumulative_steal}; /////////////////////////// @@ -705,9 +775,9 @@ static void rm_digest_table_insert(GHashTable *code_table, char *name, RmDigestT static gpointer rm_init_digest_type_table(GHashTable **code_table) { *code_table = g_hash_table_new(g_str_hash, g_str_equal); - for(RmDigestType type=1; typename, type); - } + } /* add some synonyms */ rm_digest_table_insert(*code_table, "sha3", RM_DIGEST_SHA3_256); @@ -991,7 +1061,7 @@ int rm_digest_get_bytes(RmDigest *self) { void rm_digest_send_match_candidate(RmDigest *target, RmDigest *candidate) { RmParanoid *paranoid = target->state; - + if(!paranoid->incoming_twin_candidates) { paranoid->incoming_twin_candidates = g_async_queue_new(); } diff --git a/tests/test_mains/test_hash.py b/tests/test_mains/test_hash.py index 6a7db5a2..ec3a022e 100644 --- a/tests/test_mains/test_hash.py +++ b/tests/test_mains/test_hash.py @@ -9,7 +9,7 @@ def streaming_compliance_check(*patterns): # a valid hash function streaming function should satisfy hash('a', 'b', 'c') == hash('abc') - + a = create_file('1' * 10000, 'a') algos = [] @@ -56,7 +56,6 @@ def test_xx(): def test_highway(): streaming_compliance_check('highway') -@attr("known_issue") @with_setup(usual_setup_func, usual_teardown_func) def test_cumulative(): streaming_compliance_check('cumulative') From 555060fc85564333080bcd8f477b44b640b3755e Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Tue, 14 Nov 2017 16:12:31 +1000 Subject: [PATCH 115/180] checksum: update default hash to SHA512 (refer #261) and update docs --- docs/rmlint.1.rst | 37 ++++++++++++++++++++++--------------- lib/cmdline.c | 9 +++++---- lib/config.h.in | 2 +- lib/hash-utility.c | 11 +++++++++-- 4 files changed, 37 insertions(+), 22 deletions(-) diff --git a/docs/rmlint.1.rst b/docs/rmlint.1.rst index 0464e9d1..bb6af12b 100644 --- a/docs/rmlint.1.rst +++ b/docs/rmlint.1.rst @@ -149,29 +149,34 @@ General Options ``$ rmlint -z rx $(echo $PATH | tr ":" " ") # Look at all executable files in $PATH`` -:``-a --algorithm=name`` (**default\:** *blake2b*): +:``-a --algorithm=name`` (**default\:** *sha512*): Choose the algorithm to use for finding duplicate files. The algorithm can be either **paranoid** (byte-by-byte file comparison) or use one of several file hash - algorithms to identify duplicates. The following well-known algorithms are available: + algorithms to identify duplicates. The following cryptographic algorithms are available: - **spooky**, **city**, **murmur**, **xxhash**, **md5**, **sha1**, **sha256**, - **sha512**, **farmhash**, **sha3**, **sha3-256**, **sha3-384**, **sha3-512**, - **blake2s**, **blake2b**, **blake2sp**, **blake2bp**. + **highway128**, **highway256**, + **sha1** (160 bit), **sha256**, **sha512**, + **sha3-256**, **sha3-384**, **sha3-512**, + **blake2s/blake2sp** (256), **blake2b/blake2bp** (512). - There are also some weaker hashes; we strongly advise against using these: - * **spooky32, spooky64:** Faster version of **spooky** with less bits. + For improved run time / reduced CPU load, the following non-cryptographic + hashes are also available: + **murmur** (128 bit), **metro** (128), **metro256**, + **metrocrc** (128), **metrocrc256** (if cpu supports crc) + + There are also some 64-bit hashes; we strongly advise against using these: + * **highway64** (cryptographic), **xxhash**. :``-p --paranoid`` / ``-P --less-paranoid`` (**default**): Increase or decrease the paranoia of ``rmlint``'s duplicate algorithm. Use ``-pp`` if you want byte-by-byte comparison without any hashing. - * **-p** is equivalent to **--algorithm=** - * **-pp** is equivalent to **--algorithm=paranoid** + * **-p** is equivalent to **--algorithm=paranoid** - * **-P** is equivalent to **--algorithm=** - * **-PP** is equivalent to **--algorithm=** + * **-P** is equivalent to **--algorithm=metro256** + * **-PP** is equivalent to **--algorithm=metro** :``-v --loud`` / ``-V --quiet``: @@ -846,10 +851,12 @@ PROBLEMS 1. **False Positives:** Depending on the options you use, there is a very slight risk of false positives (files that are erroneously detected as duplicate). - The default hash function (SHA1) is pretty safe but in theory it is possible for - two files to have then same hash. This happens about once in 2 ** 80 files, so - it is very very unlikely. If you're concerned just use the ``--paranoid`` (``-pp``) - option. This will compare all the files byte-by-byte and is not much slower than SHA1. + The default hash function (sha512) is very safe but in theory it is possible for + two files to have then same hash. If you had 10^73 different files, all the same + size, then the chance of a false positive is still less than 1 in a billion. + If you're concerned just use the ``--paranoid`` (``-pp``) + option. This will compare all the files byte-by-byte and is not much slower than + sha512 (it may even be faster), although it is a lot more memory-hungry. 2. **File modification during or after rmlint run:** It is possible that a file that ``rmlint`` recognized as duplicate is modified afterwards, resulting in diff --git a/lib/cmdline.c b/lib/cmdline.c index 963cfed6..8684f4fa 100644 --- a/lib/cmdline.c +++ b/lib/cmdline.c @@ -753,7 +753,11 @@ static void rm_cmd_set_paranoia_from_cnt(RmCfg *cfg, int paranoia_counter, /* Handle the paranoia option */ switch(paranoia_counter) { case -2: +#if HAVE_SSE4 + cfg->checksum_type = RM_DIGEST_METROCRC; // 128-bit non-crypto +#else cfg->checksum_type = RM_DIGEST_METRO; // 128-bit non-crypto +#endif break; case -1: #if HAVE_SSE4 @@ -766,15 +770,12 @@ static void rm_cmd_set_paranoia_from_cnt(RmCfg *cfg, int paranoia_counter, /* leave users choice of -a (default) */ break; case 1: - cfg->checksum_type = RM_DIGEST_BLAKE2B; // 512-bit crypto - break; - case 2: cfg->checksum_type = RM_DIGEST_PARANOID; break; default: if(error && *error == NULL) { g_set_error(error, RM_ERROR_QUARK, 0, - _("Only up to -pp or down to -PP flags allowed")); + _("Only up to -p or down to -PP flags allowed")); } break; } diff --git a/lib/config.h.in b/lib/config.h.in index fa03f65c..a5e8c6c8 100644 --- a/lib/config.h.in +++ b/lib/config.h.in @@ -24,7 +24,7 @@ #define HAVE_SYSMACROS_H ({HAVE_SYSMACROS_H}) #define HAVE_SSE4 ({HAVE_SSE4}) -#define RM_DEFAULT_DIGEST RM_DIGEST_HIGHWAY256 +#define RM_DEFAULT_DIGEST RM_DIGEST_SHA512 #define RM_VERSION "{VERSION_MAJOR}.{VERSION_MINOR}.{VERSION_PATCH}" #define RM_VERSION_MAJOR {VERSION_MAJOR} #define RM_VERSION_MINOR {VERSION_MINOR} diff --git a/lib/hash-utility.c b/lib/hash-utility.c index 313b52e5..d0c01d0c 100644 --- a/lib/hash-utility.c +++ b/lib/hash-utility.c @@ -159,11 +159,18 @@ int rm_hasher_main(int argc, const char **argv) { g_snprintf(summary, sizeof(summary), _("Multi-threaded file digest (hash) calculator.\n" "\n Available digest types:" + "\n Cryptographic:" + "\n %s\n" + "\n Non-cryptographic:" "\n %s\n" "\n Supported, but not useful:" "\n %s\n"), - "spooky, city, xxhash, sha{1,256,512}, md5, murmur", - "farmhash, cumulative, paranoid, ext"); + "sha{1,256,512}, sha3-{256,384,512}, blake{2s,2b,2sp,2bp}, highway{64,128,256}", +#if HAVE_SSE4 + "metrocrc, metrocrc256, " +#endif + "metro, metro256, xxhash, murmur", + "cumulative, paranoid, ext"); g_option_group_add_entries(main_group, entries); g_option_context_set_main_group(context, main_group); From 0b44249564be81ae6218f71fafa4f964d5624d7d Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Tue, 14 Nov 2017 18:58:27 +1000 Subject: [PATCH 116/180] checksum: use rhash implementation of sha3 hashes --- lib/checksum.c | 27 +-- lib/checksum.h | 5 +- lib/checksums/sha3/byte_order.c | 152 ++++++++++++++ lib/checksums/sha3/byte_order.h | 178 ++++++++++++++++ lib/checksums/sha3/sha3.c | 238 --------------------- lib/checksums/sha3/sha3.h | 54 ----- lib/checksums/sha3/sha3_rhash.c | 356 ++++++++++++++++++++++++++++++++ lib/checksums/sha3/sha3_rhash.h | 54 +++++ lib/checksums/sha3/ustd.h | 30 +++ lib/utilities.h | 2 + 10 files changed, 788 insertions(+), 308 deletions(-) create mode 100644 lib/checksums/sha3/byte_order.c create mode 100644 lib/checksums/sha3/byte_order.h delete mode 100644 lib/checksums/sha3/sha3.c delete mode 100644 lib/checksums/sha3/sha3.h create mode 100644 lib/checksums/sha3/sha3_rhash.c create mode 100644 lib/checksums/sha3/sha3_rhash.h create mode 100644 lib/checksums/sha3/ustd.h diff --git a/lib/checksum.c b/lib/checksum.c index 350ecfbe..e1ba32c6 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -43,8 +43,9 @@ #include "checksums/blake2/blake2.h" #include "checksums/murmur3.h" #include "checksums/metrohash.h" -#include "checksums/sha3/sha3.h" +#include "checksums/sha3/sha3_rhash.h" #include "checksums/xxhash/xxhash.h" +#include "checksums/highwayhash.h" #include "utilities.h" @@ -538,44 +539,44 @@ static const RmDigestSpec sha512_spec = {"sha512", 512, GLIB_FUNCS}; static void rm_digest_sha3_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { - digest->state = g_slice_alloc0(sizeof(sha3_context)); + digest->state = g_slice_alloc0(sizeof(sha3_ctx)); switch(digest->type) { case RM_DIGEST_SHA3_256: - sha3_Init256(digest->state); + rhash_sha3_256_init(digest->state); break; case RM_DIGEST_SHA3_384: - sha3_Init384(digest->state); + rhash_sha3_384_init(digest->state); break; case RM_DIGEST_SHA3_512: - sha3_Init512(digest->state); + rhash_sha3_512_init(digest->state); break; default: g_assert_not_reached(); } if(seed1) { - sha3_Update(digest->state, &seed1, sizeof(seed1)); + rhash_sha3_update(digest->state, (const unsigned char *)&seed1, sizeof(seed1)); } if(seed2) { - sha3_Update(digest->state, &seed2, sizeof(seed2)); + rhash_sha3_update(digest->state, (const unsigned char *)&seed2, sizeof(seed2)); } } static void rm_digest_sha3_free(RmDigest *digest) { - g_slice_free(sha3_context, digest->state); + g_slice_free(sha3_ctx, digest->state); } static void rm_digest_sha3_update(RmDigest *digest, const unsigned char *data, RmOff size) { - sha3_Update(digest->state, data, size); + rhash_sha3_update(digest->state, data, size); } static void rm_digest_sha3_copy(RmDigest *digest, RmDigest *copy) { - copy->state = g_slice_copy(sizeof(sha3_context), digest->state); + copy->state = g_slice_copy(sizeof(sha3_ctx), digest->state); } static void rm_digest_sha3_steal(RmDigest *digest, guint8 *result) { - sha3_context *copy = g_slice_copy(sizeof(sha3_context), digest->state); - memcpy(result, sha3_Finalize(copy), digest->bytes); - g_slice_free(sha3_context, copy); + sha3_ctx *copy = g_slice_copy(sizeof(sha3_ctx), digest->state); + rhash_sha3_final(copy, result); + g_slice_free(sha3_ctx, copy); } #define SHA3_SPEC(BITS) BITS, rm_digest_sha3_init, rm_digest_sha3_free, rm_digest_sha3_update, rm_digest_sha3_copy, rm_digest_sha3_steal diff --git a/lib/checksum.h b/lib/checksum.h index ec54346b..5a506f62 100644 --- a/lib/checksum.h +++ b/lib/checksum.h @@ -30,9 +30,8 @@ #include #include "config.h" -#include "checksums/blake2/blake2.h" -#include "checksums/sha3/sha3.h" -#include "checksums/highwayhash.h" +//#include "checksums/blake2/blake2.h" +//#include "checksums/highwayhash.h" typedef enum RmDigestType { RM_DIGEST_UNKNOWN = 0, diff --git a/lib/checksums/sha3/byte_order.c b/lib/checksums/sha3/byte_order.c new file mode 100644 index 00000000..9be65c3e --- /dev/null +++ b/lib/checksums/sha3/byte_order.c @@ -0,0 +1,152 @@ +/* byte_order.c - byte order related platform dependent routines, + * + * Copyright: 2008-2012 Aleksey Kravchenko + * + * Permission is hereby granted, free of charge, to any person obtaining a + * copy of this software and associated documentation files (the "Software"), + * to deal in the Software without restriction, including without limitation + * the rights to use, copy, modify, merge, publish, distribute, sublicense, + * and/or sell copies of the Software, and to permit persons to whom the + * Software is furnished to do so. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY + * or FITNESS FOR A PARTICULAR PURPOSE. Use this program at your own risk! + */ +#include "byte_order.h" + +#ifndef rhash_ctz + +# if _MSC_VER >= 1300 && (_M_IX86 || _M_AMD64 || _M_IA64) /* if MSVC++ >= 2002 on x86/x64 */ +# include +# pragma intrinsic(_BitScanForward) + +/** + * Returns index of the trailing bit of x. + * + * @param x the number to process + * @return zero-based index of the trailing bit + */ +unsigned rhash_ctz(unsigned x) +{ + unsigned long index; + unsigned char isNonzero = _BitScanForward(&index, x); /* MSVC intrinsic */ + return (isNonzero ? (unsigned)index : 0); +} +# else /* _MSC_VER >= 1300... */ + +/** + * Returns index of the trailing bit of a 32-bit number. + * This is a plain C equivalent for GCC __builtin_ctz() bit scan. + * + * @param x the number to process + * @return zero-based index of the trailing bit + */ +unsigned rhash_ctz(unsigned x) +{ + /* array for conversion to bit position */ + static unsigned char bit_pos[32] = { + 0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8, + 31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9 + }; + + /* The De Bruijn bit-scan was devised in 1997, according to Donald Knuth + * by Martin Lauter. The constant 0x077CB531UL is a De Bruijn sequence, + * which produces a unique pattern of bits into the high 5 bits for each + * possible bit position that it is multiplied against. + * See http://graphics.stanford.edu/~seander/bithacks.html + * and http://chessprogramming.wikispaces.com/BitScan */ + return (unsigned)bit_pos[((uint32_t)((x & -x) * 0x077CB531U)) >> 27]; +} +# endif /* _MSC_VER >= 1300... */ +#endif /* rhash_ctz */ + +/** + * Copy a memory block with simultaneous exchanging byte order. + * The byte order is changed from little-endian 32-bit integers + * to big-endian (or vice-versa). + * + * @param to the pointer where to copy memory block + * @param index the index to start writing from + * @param from the source block to copy + * @param length length of the memory block + */ +void rhash_swap_copy_str_to_u32(void* to, int index, const void* from, size_t length) +{ + /* if all pointers and length are 32-bits aligned */ + if ( 0 == (( (int)((char*)to - (char*)0) | ((char*)from - (char*)0) | index | length ) & 3) ) { + /* copy memory as 32-bit words */ + const uint32_t* src = (const uint32_t*)from; + const uint32_t* end = (const uint32_t*)((const char*)src + length); + uint32_t* dst = (uint32_t*)((char*)to + index); + for (; src < end; dst++, src++) + *dst = bswap_32(*src); + } else { + const char* src = (const char*)from; + for (length += index; (size_t)index < length; index++) + ((char*)to)[index ^ 3] = *(src++); + } +} + +/** + * Copy a memory block with changed byte order. + * The byte order is changed from little-endian 64-bit integers + * to big-endian (or vice-versa). + * + * @param to the pointer where to copy memory block + * @param index the index to start writing from + * @param from the source block to copy + * @param length length of the memory block + */ +void rhash_swap_copy_str_to_u64(void* to, int index, const void* from, size_t length) +{ + /* if all pointers and length are 64-bits aligned */ + if ( 0 == (( (int)((char*)to - (char*)0) | ((char*)from - (char*)0) | index | length ) & 7) ) { + /* copy aligned memory block as 64-bit integers */ + const uint64_t* src = (const uint64_t*)from; + const uint64_t* end = (const uint64_t*)((const char*)src + length); + uint64_t* dst = (uint64_t*)((char*)to + index); + while (src < end) *(dst++) = bswap_64( *(src++) ); + } else { + const char* src = (const char*)from; + for (length += index; (size_t)index < length; index++) ((char*)to)[index ^ 7] = *(src++); + } +} + +/** + * Copy data from a sequence of 64-bit words to a binary string of given length, + * while changing byte order. + * + * @param to the binary string to receive data + * @param from the source sequence of 64-bit words + * @param length the size in bytes of the data being copied + */ +void rhash_swap_copy_u64_to_str(void* to, const void* from, size_t length) +{ + /* if all pointers and length are 64-bits aligned */ + if ( 0 == (( (int)((char*)to - (char*)0) | ((char*)from - (char*)0) | length ) & 7) ) { + /* copy aligned memory block as 64-bit integers */ + const uint64_t* src = (const uint64_t*)from; + const uint64_t* end = (const uint64_t*)((const char*)src + length); + uint64_t* dst = (uint64_t*)to; + while (src < end) *(dst++) = bswap_64( *(src++) ); + } else { + size_t index; + char* dst = (char*)to; + for (index = 0; index < length; index++) *(dst++) = ((char*)from)[index ^ 7]; + } +} + +/** + * Exchange byte order in the given array of 32-bit integers. + * + * @param arr the array to process + * @param length array length + */ +void rhash_u32_mem_swap(unsigned *arr, int length) +{ + unsigned* end = arr + length; + for (; arr < end; arr++) { + *arr = bswap_32(*arr); + } +} diff --git a/lib/checksums/sha3/byte_order.h b/lib/checksums/sha3/byte_order.h new file mode 100644 index 00000000..4085f0e5 --- /dev/null +++ b/lib/checksums/sha3/byte_order.h @@ -0,0 +1,178 @@ +/* byte_order.h */ +#ifndef BYTE_ORDER_H +#define BYTE_ORDER_H +#include "ustd.h" +#include + +#ifdef __GLIBC__ +# include +#endif + +#ifdef __cplusplus +extern "C" { +#endif + +/* if x86 compatible cpu */ +#if defined(i386) || defined(__i386__) || defined(__i486__) || \ + defined(__i586__) || defined(__i686__) || defined(__pentium__) || \ + defined(__pentiumpro__) || defined(__pentium4__) || \ + defined(__nocona__) || defined(prescott) || defined(__core2__) || \ + defined(__k6__) || defined(__k8__) || defined(__athlon__) || \ + defined(__amd64) || defined(__amd64__) || \ + defined(__x86_64) || defined(__x86_64__) || defined(_M_IX86) || \ + defined(_M_AMD64) || defined(_M_IA64) || defined(_M_X64) +/* detect if x86-64 instruction set is supported */ +# if defined(_LP64) || defined(__LP64__) || defined(__x86_64) || \ + defined(__x86_64__) || defined(_M_AMD64) || defined(_M_X64) +# define CPU_X64 +# else +# define CPU_IA32 +# endif +#endif + + +/* detect CPU endianness */ +#if (defined(__BYTE_ORDER) && defined(__LITTLE_ENDIAN) && \ + __BYTE_ORDER == __LITTLE_ENDIAN) || \ + (defined(__BYTE_ORDER__) && defined(__ORDER_LITTLE_ENDIAN__) && \ + __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__) || \ + defined(CPU_IA32) || defined(CPU_X64) || \ + defined(__ia64) || defined(__ia64__) || defined(__alpha__) || defined(_M_ALPHA) || \ + defined(vax) || defined(MIPSEL) || defined(_ARM_) || defined(__arm__) +# define CPU_LITTLE_ENDIAN +# define IS_BIG_ENDIAN 0 +# define IS_LITTLE_ENDIAN 1 +#elif (defined(__BYTE_ORDER) && defined(__BIG_ENDIAN) && \ + __BYTE_ORDER == __BIG_ENDIAN) || \ + (defined(__BYTE_ORDER__) && defined(__ORDER_BIG_ENDIAN__) && \ + __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__) || \ + defined(__sparc) || defined(__sparc__) || defined(sparc) || \ + defined(_ARCH_PPC) || defined(_ARCH_PPC64) || defined(_POWER) || \ + defined(__POWERPC__) || defined(POWERPC) || defined(__powerpc) || \ + defined(__powerpc__) || defined(__powerpc64__) || defined(__ppc__) || \ + defined(__hpux) || defined(_MIPSEB) || defined(mc68000) || \ + defined(__s390__) || defined(__s390x__) || defined(sel) +# define CPU_BIG_ENDIAN +# define IS_BIG_ENDIAN 1 +# define IS_LITTLE_ENDIAN 0 +#else +# error "Can't detect CPU architechture" +#endif + +#ifndef __has_builtin +# define __has_builtin(x) 0 +#endif + +#define IS_ALIGNED_32(p) (0 == (3 & ((const char*)(p) - (const char*)0))) +#define IS_ALIGNED_64(p) (0 == (7 & ((const char*)(p) - (const char*)0))) + +#if defined(_MSC_VER) +#define ALIGN_ATTR(n) __declspec(align(n)) +#elif defined(__GNUC__) +#define ALIGN_ATTR(n) __attribute__((aligned (n))) +#else +#define ALIGN_ATTR(n) /* nothing */ +#endif + + +#if defined(_MSC_VER) || defined(__BORLANDC__) +#define I64(x) x##ui64 +#else +#define I64(x) x##ULL +#endif + + +#ifndef __STRICT_ANSI__ +#define RHASH_INLINE inline +#elif defined(__GNUC__) +#define RHASH_INLINE __inline__ +#else +#define RHASH_INLINE +#endif + +/* define rhash_ctz - count traling zero bits */ +#if (defined(__GNUC__) && __GNUC__ >= 4 || (__GNUC__ == 3 && __GNUC_MINOR__ >= 4)) || \ + (defined(__clang__) && __has_builtin(__builtin_ctz)) +/* GCC >= 3.4 or clang */ +# define rhash_ctz(x) __builtin_ctz(x) +#else +unsigned rhash_ctz(unsigned); /* define as function */ +#endif + +void rhash_swap_copy_str_to_u32(void* to, int index, const void* from, size_t length); +void rhash_swap_copy_str_to_u64(void* to, int index, const void* from, size_t length); +void rhash_swap_copy_u64_to_str(void* to, const void* from, size_t length); +void rhash_u32_mem_swap(unsigned *p, int length_in_u32); + +/* bswap definitions */ +#if (defined(__GNUC__) && (__GNUC__ >= 4) && (__GNUC__ > 4 || __GNUC_MINOR__ >= 3)) || \ + (defined(__clang__) && __has_builtin(__builtin_bswap32) && __has_builtin(__builtin_bswap64)) +/* GCC >= 4.3 or clang */ +# define bswap_32(x) __builtin_bswap32(x) +# define bswap_64(x) __builtin_bswap64(x) +#elif (_MSC_VER > 1300) && (defined(CPU_IA32) || defined(CPU_X64)) /* MS VC */ +# define bswap_32(x) _byteswap_ulong((unsigned long)x) +# define bswap_64(x) _byteswap_uint64((__int64)x) +#else +/* fallback to generic bswap definition */ +static RHASH_INLINE uint32_t bswap_32(uint32_t x) +{ +# if defined(__GNUC__) && defined(CPU_IA32) && !defined(__i386__) && !defined(RHASH_NO_ASM) + __asm("bswap\t%0" : "=r" (x) : "0" (x)); /* gcc x86 version */ + return x; +# else + x = ((x << 8) & 0xFF00FF00u) | ((x >> 8) & 0x00FF00FFu); + return (x >> 16) | (x << 16); +# endif +} +static RHASH_INLINE uint64_t bswap_64(uint64_t x) +{ + union { + uint64_t ll; + uint32_t l[2]; + } w, r; + w.ll = x; + r.l[0] = bswap_32(w.l[1]); + r.l[1] = bswap_32(w.l[0]); + return r.ll; +} +#endif /* bswap definitions */ + +#ifdef CPU_BIG_ENDIAN +# define be2me_32(x) (x) +# define be2me_64(x) (x) +# define le2me_32(x) bswap_32(x) +# define le2me_64(x) bswap_64(x) + +# define be32_copy(to, index, from, length) memcpy((to) + (index), (from), (length)) +# define le32_copy(to, index, from, length) rhash_swap_copy_str_to_u32((to), (index), (from), (length)) +# define be64_copy(to, index, from, length) memcpy((to) + (index), (from), (length)) +# define le64_copy(to, index, from, length) rhash_swap_copy_str_to_u64((to), (index), (from), (length)) +# define me64_to_be_str(to, from, length) memcpy((to), (from), (length)) +# define me64_to_le_str(to, from, length) rhash_swap_copy_u64_to_str((to), (from), (length)) + +#else /* CPU_BIG_ENDIAN */ +# define be2me_32(x) bswap_32(x) +# define be2me_64(x) bswap_64(x) +# define le2me_32(x) (x) +# define le2me_64(x) (x) + +# define be32_copy(to, index, from, length) rhash_swap_copy_str_to_u32((to), (index), (from), (length)) +# define le32_copy(to, index, from, length) memcpy((to) + (index), (from), (length)) +# define be64_copy(to, index, from, length) rhash_swap_copy_str_to_u64((to), (index), (from), (length)) +# define le64_copy(to, index, from, length) memcpy((to) + (index), (from), (length)) +# define me64_to_be_str(to, from, length) rhash_swap_copy_u64_to_str((to), (from), (length)) +# define me64_to_le_str(to, from, length) memcpy((to), (from), (length)) +#endif /* CPU_BIG_ENDIAN */ + +/* ROTL/ROTR macros rotate a 32/64-bit word left/right by n bits */ +#define ROTL32(dword, n) ((dword) << (n) ^ ((dword) >> (32 - (n)))) +#define ROTR32(dword, n) ((dword) >> (n) ^ ((dword) << (32 - (n)))) +#define ROTL64(qword, n) ((qword) << (n) ^ ((qword) >> (64 - (n)))) +#define ROTR64(qword, n) ((qword) >> (n) ^ ((qword) << (64 - (n)))) + +#ifdef __cplusplus +} /* extern "C" */ +#endif /* __cplusplus */ + +#endif /* BYTE_ORDER_H */ diff --git a/lib/checksums/sha3/sha3.c b/lib/checksums/sha3/sha3.c deleted file mode 100644 index 30f7b969..00000000 --- a/lib/checksums/sha3/sha3.c +++ /dev/null @@ -1,238 +0,0 @@ -/* ------------------------------------------------------------------------- - * Works when compiled for either 32-bit or 64-bit targets, optimized for - * 64 bit. - * - * Canonical implementation of Init/Update/Finalize for SHA-3 byte input. - * - * SHA3-256, SHA3-384, SHA-512 are implemented. SHA-224 can easily be added. - * - * Based on code from http://keccak.noekeon.org/ . - * - * I place the code that I wrote into public domain, free to use. - * - * I would appreciate if you give credits to this work if you used it to - * write or test * your code. - * - * Aug 2015. Andrey Jivsov. crypto@brainhub.org - * ---------------------------------------------------------------------- */ - -#include -#include -#include - -#include "sha3.h" - -#ifndef SHA3_ROTL64 -#define SHA3_ROTL64(x, y) (((x) << (y)) | ((x) >> ((sizeof(uint64_t) * 8) - (y)))) -#endif - -static const uint64_t keccakf_rndc[24] = { - SHA3_CONST(0x0000000000000001UL), SHA3_CONST(0x0000000000008082UL), - SHA3_CONST(0x800000000000808aUL), SHA3_CONST(0x8000000080008000UL), - SHA3_CONST(0x000000000000808bUL), SHA3_CONST(0x0000000080000001UL), - SHA3_CONST(0x8000000080008081UL), SHA3_CONST(0x8000000000008009UL), - SHA3_CONST(0x000000000000008aUL), SHA3_CONST(0x0000000000000088UL), - SHA3_CONST(0x0000000080008009UL), SHA3_CONST(0x000000008000000aUL), - SHA3_CONST(0x000000008000808bUL), SHA3_CONST(0x800000000000008bUL), - SHA3_CONST(0x8000000000008089UL), SHA3_CONST(0x8000000000008003UL), - SHA3_CONST(0x8000000000008002UL), SHA3_CONST(0x8000000000000080UL), - SHA3_CONST(0x000000000000800aUL), SHA3_CONST(0x800000008000000aUL), - SHA3_CONST(0x8000000080008081UL), SHA3_CONST(0x8000000000008080UL), - SHA3_CONST(0x0000000080000001UL), SHA3_CONST(0x8000000080008008UL)}; - -static const unsigned keccakf_rotc[24] = {1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 2, 14, - 27, 41, 56, 8, 25, 43, 62, 18, 39, 61, 20, 44}; - -static const unsigned keccakf_piln[24] = {10, 7, 11, 17, 18, 3, 5, 16, 8, 21, 24, 4, - 15, 23, 19, 13, 12, 2, 20, 14, 22, 9, 6, 1}; - -/* generally called after SHA3_KECCAK_SPONGE_WORDS-ctx->capacityWords words - * are XORed into the state s - */ -static void keccakf(uint64_t s[25]) { - int i, j, round; - uint64_t t, bc[5]; -#define KECCAK_ROUNDS 24 - - for(round = 0; round < KECCAK_ROUNDS; round++) { - /* Theta */ - for(i = 0; i < 5; i++) - bc[i] = s[i] ^ s[i + 5] ^ s[i + 10] ^ s[i + 15] ^ s[i + 20]; - - for(i = 0; i < 5; i++) { - t = bc[(i + 4) % 5] ^ SHA3_ROTL64(bc[(i + 1) % 5], 1); - for(j = 0; j < 25; j += 5) - s[j + i] ^= t; - } - - /* Rho Pi */ - t = s[1]; - for(i = 0; i < 24; i++) { - j = keccakf_piln[i]; - bc[0] = s[j]; - s[j] = SHA3_ROTL64(t, keccakf_rotc[i]); - t = bc[0]; - } - - /* Chi */ - for(j = 0; j < 25; j += 5) { - for(i = 0; i < 5; i++) - bc[i] = s[j + i]; - for(i = 0; i < 5; i++) - s[j + i] ^= (~bc[(i + 1) % 5]) & bc[(i + 2) % 5]; - } - - /* Iota */ - s[0] ^= keccakf_rndc[round]; - } -} - -/* *************************** Public Inteface ************************ */ - -/* For Init or Reset call these: */ -void sha3_Init256(sha3_context *ctx) { - memset(ctx, 0, sizeof(*ctx)); - ctx->capacityWords = 2 * 256 / (8 * sizeof(uint64_t)); -} - -void sha3_Init384(sha3_context *ctx) { - memset(ctx, 0, sizeof(*ctx)); - ctx->capacityWords = 2 * 384 / (8 * sizeof(uint64_t)); -} - -void sha3_Init512(sha3_context *ctx) { - memset(ctx, 0, sizeof(*ctx)); - ctx->capacityWords = 2 * 512 / (8 * sizeof(uint64_t)); -} - -void sha3_Update(sha3_context *ctx, void const *bufIn, size_t len) { - /* 0...7 -- how much is needed to have a word */ - unsigned old_tail = (8 - ctx->byteIndex) & 7; - - size_t words; - unsigned tail; - size_t i; - - const uint8_t *buf = bufIn; - - SHA3_TRACE_BUF("called to update with:", buf, len); - - SHA3_ASSERT(ctx->byteIndex < 8); - SHA3_ASSERT(ctx->wordIndex < sizeof(ctx->s) / sizeof(ctx->s[0])); - - if(len < old_tail) { /* have no complete word or haven't started - * the word yet */ - SHA3_TRACE("because %d<%d, store it and return", (unsigned)len, - (unsigned)old_tail); - /* endian-independent code follows: */ - while(len--) - ctx->saved |= (uint64_t)(*(buf++)) << ((ctx->byteIndex++) * 8); - SHA3_ASSERT(ctx->byteIndex < 8); - return; - } - - if(old_tail) { /* will have one word to process */ - SHA3_TRACE("completing one word with %d bytes", (unsigned)old_tail); - /* endian-independent code follows: */ - len -= old_tail; - while(old_tail--) - ctx->saved |= (uint64_t)(*(buf++)) << ((ctx->byteIndex++) * 8); - - /* now ready to add saved to the sponge */ - ctx->s[ctx->wordIndex] ^= ctx->saved; - SHA3_ASSERT(ctx->byteIndex == 8); - ctx->byteIndex = 0; - ctx->saved = 0; - if(++ctx->wordIndex == (SHA3_KECCAK_SPONGE_WORDS - ctx->capacityWords)) { - keccakf(ctx->s); - ctx->wordIndex = 0; - } - } - - /* now work in full words directly from input */ - - SHA3_ASSERT(ctx->byteIndex == 0); - - words = len / sizeof(uint64_t); - tail = len - words * sizeof(uint64_t); - - SHA3_TRACE("have %d full words to process", (unsigned)words); - - for(i = 0; i < words; i++, buf += sizeof(uint64_t)) { - const uint64_t t = (uint64_t)(buf[0]) | ((uint64_t)(buf[1]) << 8 * 1) | - ((uint64_t)(buf[2]) << 8 * 2) | ((uint64_t)(buf[3]) << 8 * 3) | - ((uint64_t)(buf[4]) << 8 * 4) | ((uint64_t)(buf[5]) << 8 * 5) | - ((uint64_t)(buf[6]) << 8 * 6) | ((uint64_t)(buf[7]) << 8 * 7); -#if defined(__x86_64__) || defined(__i386__) - SHA3_ASSERT(memcmp(&t, buf, 8) == 0); -#endif - ctx->s[ctx->wordIndex] ^= t; - if(++ctx->wordIndex == (SHA3_KECCAK_SPONGE_WORDS - ctx->capacityWords)) { - keccakf(ctx->s); - ctx->wordIndex = 0; - } - } - - SHA3_TRACE("have %d bytes left to process, save them", (unsigned)tail); - - /* finally, save the partial word */ - SHA3_ASSERT(ctx->byteIndex == 0 && tail < 8); - while(tail--) { - SHA3_TRACE("Store byte %02x '%c'", *buf, *buf); - ctx->saved |= (uint64_t)(*(buf++)) << ((ctx->byteIndex++) * 8); - } - SHA3_ASSERT(ctx->byteIndex < 8); - SHA3_TRACE("Have saved=0x%016" PRIx64 " at the end", ctx->saved); -} - -/* This is simply the 'update' with the padding block. - * The padding block is 0x01 || 0x00* || 0x80. First 0x01 and last 0x80 - * bytes are always present, but they can be the same byte. - */ -void const *sha3_Finalize(sha3_context *ctx) { - SHA3_TRACE("called with %d bytes in the buffer", ctx->byteIndex); - -/* Append 2-bit suffix 01, per SHA-3 spec. Instead of 1 for padding we - * use 1<<2 below. The 0x02 below corresponds to the suffix 01. - * Overall, we feed 0, then 1, and finally 1 to start padding. Without - * M || 01, we would simply use 1 to start padding. */ - -#ifndef SHA3_USE_KECCAK - /* SHA3 version */ - ctx->s[ctx->wordIndex] ^= (ctx->saved ^ ((uint64_t)((uint64_t)(0x02 | (1 << 2)) - << ((ctx->byteIndex) * 8)))); -#else - /* For testing the "pure" Keccak version */ - ctx->s[ctx->wordIndex] ^= - (ctx->saved ^ ((uint64_t)((uint64_t)1 << (ctx->byteIndex * 8)))); -#endif - - ctx->s[SHA3_KECCAK_SPONGE_WORDS - ctx->capacityWords - 1] ^= - SHA3_CONST(0x8000000000000000UL); - keccakf(ctx->s); - - /* Return first bytes of the ctx->s. This conversion is not needed for - * little-endian platforms e.g. wrap with #if !defined(__BYTE_ORDER__) - * || !defined(__ORDER_LITTLE_ENDIAN__) || \ - * __BYTE_ORDER__!=__ORDER_LITTLE_ENDIAN__ ... the conversion below ... - * #endif */ - { - unsigned i; - for(i = 0; i < SHA3_KECCAK_SPONGE_WORDS; i++) { - const unsigned t1 = (uint32_t)ctx->s[i]; - const unsigned t2 = (uint32_t)((ctx->s[i] >> 16) >> 16); - ctx->sb[i * 8 + 0] = (uint8_t)(t1); - ctx->sb[i * 8 + 1] = (uint8_t)(t1 >> 8); - ctx->sb[i * 8 + 2] = (uint8_t)(t1 >> 16); - ctx->sb[i * 8 + 3] = (uint8_t)(t1 >> 24); - ctx->sb[i * 8 + 4] = (uint8_t)(t2); - ctx->sb[i * 8 + 5] = (uint8_t)(t2 >> 8); - ctx->sb[i * 8 + 6] = (uint8_t)(t2 >> 16); - ctx->sb[i * 8 + 7] = (uint8_t)(t2 >> 24); - } - } - - SHA3_TRACE_BUF("Hash: (first 32 bytes)", ctx->sb, 256 / 8); - - return (ctx->sb); -} diff --git a/lib/checksums/sha3/sha3.h b/lib/checksums/sha3/sha3.h deleted file mode 100644 index 39d03d95..00000000 --- a/lib/checksums/sha3/sha3.h +++ /dev/null @@ -1,54 +0,0 @@ -#ifndef _RM_CHECKSUM_SHA3 -#define _RM_CHECKSUM_SHA3 - -#include - -#define SHA3_ASSERT(x) -#if defined(_MSC_VER) -#define SHA3_TRACE(format, ...) -#define SHA3_TRACE_BUF(format, buf, l, ...) -#else -#define SHA3_TRACE(format, args...) -#define SHA3_TRACE_BUF(format, buf, l, args...) -#endif - -//#define SHA3_USE_KECCAK -/* - * Define SHA3_USE_KECCAK to run "pure" Keccak, as opposed to SHA3. - * The tests that this macro enables use the input and output from [Keccak] - * (see the reference below). The used test vectors aren't correct for SHA3, - * however, they are helpful to verify the implementation. - * SHA3_USE_KECCAK only changes one line of code in Finalize. - */ - -#if defined(_MSC_VER) -#define SHA3_CONST(x) x -#else -#define SHA3_CONST(x) x##L -#endif - -/* 'Words' here refers to uint64_t */ -#define SHA3_KECCAK_SPONGE_WORDS (((1600) / 8 /*bits to byte*/) / sizeof(uint64_t)) -typedef struct sha3_context_ { - uint64_t saved; /* the portion of the input message that we - * didn't consume yet */ - union { /* Keccak's state */ - uint64_t s[SHA3_KECCAK_SPONGE_WORDS]; - uint8_t sb[SHA3_KECCAK_SPONGE_WORDS * 8]; - }; - unsigned byteIndex; /* 0..7--the next byte after the set one - * (starts from 0; 0--none are buffered) */ - unsigned wordIndex; /* 0..24--the next word to integrate input - * (starts from 0) */ - unsigned capacityWords; /* the double size of the hash output in - * words (e.g. 16 for Keccak 512) */ -} sha3_context; - -void sha3_Init256(sha3_context *ctx); -void sha3_Init384(sha3_context *ctx); -void sha3_Init512(sha3_context *ctx); - -void sha3_Update(sha3_context *ctx, void const *bufIn, size_t len); -void const *sha3_Finalize(sha3_context *ctx); - -#endif /* _RM_CHECKSUM_SHA3 */ diff --git a/lib/checksums/sha3/sha3_rhash.c b/lib/checksums/sha3/sha3_rhash.c new file mode 100644 index 00000000..05633b03 --- /dev/null +++ b/lib/checksums/sha3/sha3_rhash.c @@ -0,0 +1,356 @@ +/* sha3.c - an implementation of Secure Hash Algorithm 3 (Keccak). + * based on the + * The Keccak SHA-3 submission. Submission to NIST (Round 3), 2011 + * by Guido Bertoni, Joan Daemen, Michaël Peeters and Gilles Van Assche + * + * Copyright: 2013 Aleksey Kravchenko + * + * Permission is hereby granted, free of charge, to any person obtaining a + * copy of this software and associated documentation files (the "Software"), + * to deal in the Software without restriction, including without limitation + * the rights to use, copy, modify, merge, publish, distribute, sublicense, + * and/or sell copies of the Software, and to permit persons to whom the + * Software is furnished to do so. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY + * or FITNESS FOR A PARTICULAR PURPOSE. Use this program at your own risk! + */ + +#include +#include +#include "byte_order.h" +#include "sha3_rhash.h" + +/* constants */ +#define NumberOfRounds 24 + +/* SHA3 (Keccak) constants for 24 rounds */ +static uint64_t keccak_round_constants[NumberOfRounds] = { + I64(0x0000000000000001), I64(0x0000000000008082), I64(0x800000000000808A), I64(0x8000000080008000), + I64(0x000000000000808B), I64(0x0000000080000001), I64(0x8000000080008081), I64(0x8000000000008009), + I64(0x000000000000008A), I64(0x0000000000000088), I64(0x0000000080008009), I64(0x000000008000000A), + I64(0x000000008000808B), I64(0x800000000000008B), I64(0x8000000000008089), I64(0x8000000000008003), + I64(0x8000000000008002), I64(0x8000000000000080), I64(0x000000000000800A), I64(0x800000008000000A), + I64(0x8000000080008081), I64(0x8000000000008080), I64(0x0000000080000001), I64(0x8000000080008008) +}; + +/* Initializing a sha3 context for given number of output bits */ +static void rhash_keccak_init(sha3_ctx *ctx, unsigned bits) +{ + /* NB: The Keccak capacity parameter = bits * 2 */ + unsigned rate = 1600 - bits * 2; + + memset(ctx, 0, sizeof(sha3_ctx)); + ctx->block_size = rate / 8; + assert(rate <= 1600 && (rate % 64) == 0); +} + +/** + * Initialize context before calculating hash. + * + * @param ctx context to initialize + */ +void rhash_sha3_224_init(sha3_ctx *ctx) +{ + rhash_keccak_init(ctx, 224); +} + +/** + * Initialize context before calculating hash. + * + * @param ctx context to initialize + */ +void rhash_sha3_256_init(sha3_ctx *ctx) +{ + rhash_keccak_init(ctx, 256); +} + +/** + * Initialize context before calculating hash. + * + * @param ctx context to initialize + */ +void rhash_sha3_384_init(sha3_ctx *ctx) +{ + rhash_keccak_init(ctx, 384); +} + +/** + * Initialize context before calculating hash. + * + * @param ctx context to initialize + */ +void rhash_sha3_512_init(sha3_ctx *ctx) +{ + rhash_keccak_init(ctx, 512); +} + +/* Keccak theta() transformation */ +static void keccak_theta(uint64_t *A) +{ + unsigned int x; + uint64_t C[5], D[5]; + + for (x = 0; x < 5; x++) { + C[x] = A[x] ^ A[x + 5] ^ A[x + 10] ^ A[x + 15] ^ A[x + 20]; + } + D[0] = ROTL64(C[1], 1) ^ C[4]; + D[1] = ROTL64(C[2], 1) ^ C[0]; + D[2] = ROTL64(C[3], 1) ^ C[1]; + D[3] = ROTL64(C[4], 1) ^ C[2]; + D[4] = ROTL64(C[0], 1) ^ C[3]; + + for (x = 0; x < 5; x++) { + A[x] ^= D[x]; + A[x + 5] ^= D[x]; + A[x + 10] ^= D[x]; + A[x + 15] ^= D[x]; + A[x + 20] ^= D[x]; + } +} + +/* Keccak pi() transformation */ +static void keccak_pi(uint64_t *A) +{ + uint64_t A1; + A1 = A[1]; + A[ 1] = A[ 6]; + A[ 6] = A[ 9]; + A[ 9] = A[22]; + A[22] = A[14]; + A[14] = A[20]; + A[20] = A[ 2]; + A[ 2] = A[12]; + A[12] = A[13]; + A[13] = A[19]; + A[19] = A[23]; + A[23] = A[15]; + A[15] = A[ 4]; + A[ 4] = A[24]; + A[24] = A[21]; + A[21] = A[ 8]; + A[ 8] = A[16]; + A[16] = A[ 5]; + A[ 5] = A[ 3]; + A[ 3] = A[18]; + A[18] = A[17]; + A[17] = A[11]; + A[11] = A[ 7]; + A[ 7] = A[10]; + A[10] = A1; + /* note: A[ 0] is left as is */ +} + +/* Keccak chi() transformation */ +static void keccak_chi(uint64_t *A) +{ + int i; + for (i = 0; i < 25; i += 5) { + uint64_t A0 = A[0 + i], A1 = A[1 + i]; + A[0 + i] ^= ~A1 & A[2 + i]; + A[1 + i] ^= ~A[2 + i] & A[3 + i]; + A[2 + i] ^= ~A[3 + i] & A[4 + i]; + A[3 + i] ^= ~A[4 + i] & A0; + A[4 + i] ^= ~A0 & A1; + } +} + +static void rhash_sha3_permutation(uint64_t *state) +{ + int round; + for (round = 0; round < NumberOfRounds; round++) + { + keccak_theta(state); + + /* apply Keccak rho() transformation */ + state[ 1] = ROTL64(state[ 1], 1); + state[ 2] = ROTL64(state[ 2], 62); + state[ 3] = ROTL64(state[ 3], 28); + state[ 4] = ROTL64(state[ 4], 27); + state[ 5] = ROTL64(state[ 5], 36); + state[ 6] = ROTL64(state[ 6], 44); + state[ 7] = ROTL64(state[ 7], 6); + state[ 8] = ROTL64(state[ 8], 55); + state[ 9] = ROTL64(state[ 9], 20); + state[10] = ROTL64(state[10], 3); + state[11] = ROTL64(state[11], 10); + state[12] = ROTL64(state[12], 43); + state[13] = ROTL64(state[13], 25); + state[14] = ROTL64(state[14], 39); + state[15] = ROTL64(state[15], 41); + state[16] = ROTL64(state[16], 45); + state[17] = ROTL64(state[17], 15); + state[18] = ROTL64(state[18], 21); + state[19] = ROTL64(state[19], 8); + state[20] = ROTL64(state[20], 18); + state[21] = ROTL64(state[21], 2); + state[22] = ROTL64(state[22], 61); + state[23] = ROTL64(state[23], 56); + state[24] = ROTL64(state[24], 14); + + keccak_pi(state); + keccak_chi(state); + + /* apply iota(state, round) */ + *state ^= keccak_round_constants[round]; + } +} + +/** + * The core transformation. Process the specified block of data. + * + * @param hash the algorithm state + * @param block the message block to process + * @param block_size the size of the processed block in bytes + */ +static void rhash_sha3_process_block(uint64_t hash[25], const uint64_t *block, size_t block_size) +{ + /* expanded loop */ + hash[ 0] ^= le2me_64(block[ 0]); + hash[ 1] ^= le2me_64(block[ 1]); + hash[ 2] ^= le2me_64(block[ 2]); + hash[ 3] ^= le2me_64(block[ 3]); + hash[ 4] ^= le2me_64(block[ 4]); + hash[ 5] ^= le2me_64(block[ 5]); + hash[ 6] ^= le2me_64(block[ 6]); + hash[ 7] ^= le2me_64(block[ 7]); + hash[ 8] ^= le2me_64(block[ 8]); + /* if not sha3-512 */ + if (block_size > 72) { + hash[ 9] ^= le2me_64(block[ 9]); + hash[10] ^= le2me_64(block[10]); + hash[11] ^= le2me_64(block[11]); + hash[12] ^= le2me_64(block[12]); + /* if not sha3-384 */ + if (block_size > 104) { + hash[13] ^= le2me_64(block[13]); + hash[14] ^= le2me_64(block[14]); + hash[15] ^= le2me_64(block[15]); + hash[16] ^= le2me_64(block[16]); + /* if not sha3-256 */ + if (block_size > 136) { + hash[17] ^= le2me_64(block[17]); +#ifdef FULL_SHA3_FAMILY_SUPPORT + /* if not sha3-224 */ + if (block_size > 144) { + hash[18] ^= le2me_64(block[18]); + hash[19] ^= le2me_64(block[19]); + hash[20] ^= le2me_64(block[20]); + hash[21] ^= le2me_64(block[21]); + hash[22] ^= le2me_64(block[22]); + hash[23] ^= le2me_64(block[23]); + hash[24] ^= le2me_64(block[24]); + } +#endif + } + } + } + /* make a permutation of the hash */ + rhash_sha3_permutation(hash); +} + +#define SHA3_FINALIZED 0x80000000 + +/** + * Calculate message hash. + * Can be called repeatedly with chunks of the message to be hashed. + * + * @param ctx the algorithm context containing current hashing state + * @param msg message chunk + * @param size length of the message chunk + */ +void rhash_sha3_update(sha3_ctx *ctx, const unsigned char *msg, size_t size) +{ + size_t index = (size_t)ctx->rest; + size_t block_size = (size_t)ctx->block_size; + + if (ctx->rest & SHA3_FINALIZED) return; /* too late for additional input */ + ctx->rest = (unsigned)((ctx->rest + size) % block_size); + + /* fill partial block */ + if (index) { + size_t left = block_size - index; + memcpy((char*)ctx->message + index, msg, (size < left ? size : left)); + if (size < left) return; + + /* process partial block */ + rhash_sha3_process_block(ctx->hash, ctx->message, block_size); + msg += left; + size -= left; + } + while (size >= block_size) { + uint64_t* aligned_message_block; + if (IS_ALIGNED_64(msg)) { + /* the most common case is processing of an already aligned message + without copying it */ + aligned_message_block = (uint64_t*)msg; + } else { + memcpy(ctx->message, msg, block_size); + aligned_message_block = ctx->message; + } + + rhash_sha3_process_block(ctx->hash, aligned_message_block, block_size); + msg += block_size; + size -= block_size; + } + if (size) { + memcpy(ctx->message, msg, size); /* save leftovers */ + } +} + +/** + * Store calculated hash into the given array. + * + * @param ctx the algorithm context containing current hashing state + * @param result calculated hash in binary form + */ +void rhash_sha3_final(sha3_ctx *ctx, unsigned char* result) +{ + size_t digest_length = 100 - ctx->block_size / 2; + const size_t block_size = ctx->block_size; + + if (!(ctx->rest & SHA3_FINALIZED)) + { + /* clear the rest of the data queue */ + memset((char*)ctx->message + ctx->rest, 0, block_size - ctx->rest); + ((char*)ctx->message)[ctx->rest] |= 0x06; + ((char*)ctx->message)[block_size - 1] |= 0x80; + + /* process final block */ + rhash_sha3_process_block(ctx->hash, ctx->message, block_size); + ctx->rest = SHA3_FINALIZED; /* mark context as finalized */ + } + + assert(block_size > digest_length); + if (result) me64_to_le_str(result, ctx->hash, digest_length); +} + +#ifdef USE_KECCAK +/** +* Store calculated hash into the given array. +* +* @param ctx the algorithm context containing current hashing state +* @param result calculated hash in binary form +*/ +void rhash_keccak_final(sha3_ctx *ctx, unsigned char* result) +{ + size_t digest_length = 100 - ctx->block_size / 2; + const size_t block_size = ctx->block_size; + + if (!(ctx->rest & SHA3_FINALIZED)) + { + /* clear the rest of the data queue */ + memset((char*)ctx->message + ctx->rest, 0, block_size - ctx->rest); + ((char*)ctx->message)[ctx->rest] |= 0x01; + ((char*)ctx->message)[block_size - 1] |= 0x80; + + /* process final block */ + rhash_sha3_process_block(ctx->hash, ctx->message, block_size); + ctx->rest = SHA3_FINALIZED; /* mark context as finalized */ + } + + assert(block_size > digest_length); + if (result) me64_to_le_str(result, ctx->hash, digest_length); +} +#endif /* USE_KECCAK */ diff --git a/lib/checksums/sha3/sha3_rhash.h b/lib/checksums/sha3/sha3_rhash.h new file mode 100644 index 00000000..28319978 --- /dev/null +++ b/lib/checksums/sha3/sha3_rhash.h @@ -0,0 +1,54 @@ +/* sha3.h */ +#ifndef RHASH_SHA3_H +#define RHASH_SHA3_H +#include "ustd.h" + +#ifdef __cplusplus +extern "C" { +#endif + +#define sha3_224_hash_size 28 +#define sha3_256_hash_size 32 +#define sha3_384_hash_size 48 +#define sha3_512_hash_size 64 +#define sha3_max_permutation_size 25 +#define sha3_max_rate_in_qwords 24 + +/** + * SHA3 Algorithm context. + */ +typedef struct sha3_ctx +{ + /* 1600 bits algorithm hashing state */ + uint64_t hash[sha3_max_permutation_size]; + /* 1536-bit buffer for leftovers */ + uint64_t message[sha3_max_rate_in_qwords]; + /* count of bytes in the message[] buffer */ + unsigned rest; + /* size of a message block processed at once */ + unsigned block_size; +} sha3_ctx; + +/* methods for calculating the hash function */ + +void rhash_sha3_224_init(sha3_ctx *ctx); +void rhash_sha3_256_init(sha3_ctx *ctx); +void rhash_sha3_384_init(sha3_ctx *ctx); +void rhash_sha3_512_init(sha3_ctx *ctx); +void rhash_sha3_update(sha3_ctx *ctx, const unsigned char* msg, size_t size); +void rhash_sha3_final(sha3_ctx *ctx, unsigned char* result); + +#ifdef USE_KECCAK +#define rhash_keccak_224_init rhash_sha3_224_init +#define rhash_keccak_256_init rhash_sha3_256_init +#define rhash_keccak_384_init rhash_sha3_384_init +#define rhash_keccak_512_init rhash_sha3_512_init +#define rhash_keccak_update rhash_sha3_update +void rhash_keccak_final(sha3_ctx *ctx, unsigned char* result); +#endif + +#ifdef __cplusplus +} /* extern "C" */ +#endif /* __cplusplus */ + +#endif /* RHASH_SHA3_H */ diff --git a/lib/checksums/sha3/ustd.h b/lib/checksums/sha3/ustd.h new file mode 100644 index 00000000..94f1ae26 --- /dev/null +++ b/lib/checksums/sha3/ustd.h @@ -0,0 +1,30 @@ +/* ustd.h common macros and includes */ +#ifndef LIBRHASH_USTD_H +#define LIBRHASH_USTD_H + +#if _MSC_VER >= 1300 + +# define int64_t __int64 +# define int32_t __int32 +# define int16_t __int16 +# define int8_t __int8 +# define uint64_t unsigned __int64 +# define uint32_t unsigned __int32 +# define uint16_t unsigned __int16 +# define uint8_t unsigned __int8 + +/* disable warnings: The POSIX name for this item is deprecated. Use the ISO C++ conformant name. */ +#pragma warning(disable : 4996) + +#else /* _MSC_VER >= 1300 */ + +# include +# include + +#endif /* _MSC_VER >= 1300 */ + +#if _MSC_VER <= 1300 +# include /* size_t for vc6.0 */ +#endif /* _MSC_VER <= 1300 */ + +#endif /* LIBRHASH_USTD_H */ diff --git a/lib/utilities.h b/lib/utilities.h index 75ecb9e5..114f4bb7 100644 --- a/lib/utilities.h +++ b/lib/utilities.h @@ -34,6 +34,8 @@ #include #include #include +#include + /* Pat(h)tricia Trie implementation */ #include "pathtricia.h" From 080dc1879dad5859d40fea9f8ad5c8e1a82b24f5 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Tue, 14 Nov 2017 21:37:16 +1000 Subject: [PATCH 117/180] SConstruct: optimise even if DEBUG=1; allow custom option via e.g. `scons O=2` --- SConstruct | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/SConstruct b/SConstruct index b4caa04c..441d370b 100755 --- a/SConstruct +++ b/SConstruct @@ -11,6 +11,8 @@ import SCons import SCons.Conftest as tests from SCons.Script.SConscript import SConsEnvironment +DEFAULT_OPTIMISATION='s' # compile with -Os + pkg_config = os.getenv('PKG_CONFIG') or 'pkg-config' def read_version(): @@ -635,9 +637,9 @@ if ARGUMENTS.get('DEBUG') == "1": conf.env.Append(CCFLAGS=['-ggdb3']) else: # Generic compiler: - conf.env.Append(CCFLAGS=['-Os']) conf.env.Append(LINKFLAGS=['-s']) + if 'clang' in os.path.basename(conf.env['CC']): conf.env.Append(CCFLAGS=['-fcolor-diagnostics']) # Colored warnings conf.env.Append(CCFLAGS=['-Qunused-arguments']) # Hide wrong messages @@ -677,6 +679,11 @@ conf.check_sysmacro_h() if conf.env['HAVE_LIBELF']: conf.env.Append(_LIBFLAGS=['-lelf']) +# compiler optimisations: +o_option = '-O' + (ARGUMENTS.get('O') or DEFAULT_OPTIMISATION) +print("Using compiler optimisation {} (to change, run scons with O=[0|1|2|3|s|fast])".format(o_option)) +conf.env.Append(CCFLAGS=[o_option]) + SConsEnvironment.Chmod = SCons.Action.ActionFactory( os.chmod, lambda dest, mode: 'Chmod("%s", 0%o)' % (dest, mode) From d913934b5f7a6df7ef98155ec0d5bb6010da2dd8 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Tue, 14 Nov 2017 23:38:48 +1000 Subject: [PATCH 118/180] xattr: fix logic error (thanks clang) --- lib/xattr.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/xattr.c b/lib/xattr.c index 20c1e4e5..b37f039e 100644 --- a/lib/xattr.c +++ b/lib/xattr.c @@ -199,7 +199,7 @@ gboolean rm_xattr_read_hash(RmFile *file, RmSession *session) { return FALSE; } - if(cksum_hex_str == NULL || strcmp(cksum_hex_str, "")==0) { + if(cksum_hex_str[0] == 0 || strcmp(cksum_hex_str, "")==0) { return FALSE; } From 6eba93c9a3e5b51bd15415b19f7fdc31ffeb3e05 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Wed, 15 Nov 2017 06:02:07 +1000 Subject: [PATCH 119/180] murmur: stop clang complaining about unused function --- lib/checksums/murmur3.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/checksums/murmur3.c b/lib/checksums/murmur3.c index 23436394..20aa1fc5 100644 --- a/lib/checksums/murmur3.c +++ b/lib/checksums/murmur3.c @@ -18,7 +18,7 @@ //----------------------------------------------------------------------------- // Platform-specific functions and macros -static inline uint32_t rotl32(uint32_t x, int8_t r) { +__attribute__((__unused__)) static inline uint32_t rotl32(uint32_t x, int8_t r) { return (x << r) | (x >> (32 - r)); } From 69639eec2b12f0313613063820ccfc5612c066d0 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Wed, 15 Nov 2017 06:15:23 +1000 Subject: [PATCH 120/180] SConstruct: fix / clarify optimisation settings --- SConstruct | 25 +++++++++++++------------ 1 file changed, 13 insertions(+), 12 deletions(-) diff --git a/SConstruct b/SConstruct index 441d370b..925e6fd8 100755 --- a/SConstruct +++ b/SConstruct @@ -632,14 +632,6 @@ conf.check_sse4() if conf.env['HAVE_SSE4']: conf.env.Append(CCFLAGS=['-msse4']) - -if ARGUMENTS.get('DEBUG') == "1": - conf.env.Append(CCFLAGS=['-ggdb3']) -else: - # Generic compiler: - conf.env.Append(LINKFLAGS=['-s']) - - if 'clang' in os.path.basename(conf.env['CC']): conf.env.Append(CCFLAGS=['-fcolor-diagnostics']) # Colored warnings conf.env.Append(CCFLAGS=['-Qunused-arguments']) # Hide wrong messages @@ -679,10 +671,19 @@ conf.check_sysmacro_h() if conf.env['HAVE_LIBELF']: conf.env.Append(_LIBFLAGS=['-lelf']) -# compiler optimisations: -o_option = '-O' + (ARGUMENTS.get('O') or DEFAULT_OPTIMISATION) -print("Using compiler optimisation {} (to change, run scons with O=[0|1|2|3|s|fast])".format(o_option)) -conf.env.Append(CCFLAGS=[o_option]) +# compiler optimisation and debug symbols: +cc_O_option = '-O' +if ARGUMENTS.get('DEBUG') == "1": + print("Compiling with gdb extra debug symbols") + conf.env.Append(CCFLAGS=['-ggdb3', '-fno-inline']) + cc_O_option += (ARGUMENTS.get('O') or '0') +else: + conf.env.Append(LINKFLAGS=['-s']) + cc_O_option += (ARGUMENTS.get('O') or DEFAULT_OPTIMISATION) + +print("Using compiler optimisation {} (to change, run scons with O=[0|1|2|3|s|fast])".format(cc_O_option)) +conf.env.Append(CCFLAGS=[cc_O_option]) + SConsEnvironment.Chmod = SCons.Action.ActionFactory( os.chmod, From f0dc3e95528f7aed4955cd5ddab8f41e52802309 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Thu, 16 Nov 2017 21:36:23 +1000 Subject: [PATCH 121/180] checksum: dispense with RmBufferPool (the g_slice allocator effectively does the same thing) --- lib/checksum.c | 79 ++++++-------------------------------------------- lib/checksum.h | 78 ++++++------------------------------------------- lib/hasher.c | 27 +++++++---------- 3 files changed, 29 insertions(+), 155 deletions(-) diff --git a/lib/checksum.c b/lib/checksum.c index e1ba32c6..08c30eac 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -51,83 +51,22 @@ #define _RM_CHECKSUM_DEBUG 0 +////////////////////////////////// +// BUFFER IMPLEMENTATION // +////////////////////////////////// -/////////////////////////////////////// -// BUFFER POOL IMPLEMENTATION // -/////////////////////////////////////// - -RmOff rm_buffer_size(RmBufferPool *pool) { - return pool->buffer_size; -} - -static RmBuffer *rm_buffer_new(RmBufferPool *pool) { +RmBuffer *rm_buffer_new(gsize buf_size) { RmBuffer *self = g_slice_new0(RmBuffer); - self->pool = pool; - self->data = g_slice_alloc(pool->buffer_size); + self->data = g_slice_alloc(buf_size); + self->buf_size = buf_size; return self; } -static void rm_buffer_free(RmBuffer *buf) { - g_slice_free1(buf->pool->buffer_size, buf->data); +void rm_buffer_free(RmBuffer *buf) { + g_slice_free1(buf->buf_size, buf->data); g_slice_free(RmBuffer, buf); } -RmBufferPool *rm_buffer_pool_init(gsize buffer_size, gsize max_mem) { - RmBufferPool *self = g_slice_new0(RmBufferPool); - self->buffer_size = buffer_size; - self->avail_buffers = max_mem ? MAX(max_mem / buffer_size, 1) : (gsize)-1; - - g_cond_init(&self->change); - g_mutex_init(&self->lock); - return self; -} - -void rm_buffer_pool_destroy(RmBufferPool *pool) { - g_slist_free_full(pool->stack, (GDestroyNotify)rm_buffer_free); - - g_mutex_clear(&pool->lock); - g_cond_clear(&pool->change); - g_slice_free(RmBufferPool, pool); -} - -RmBuffer *rm_buffer_get(RmBufferPool *pool) { - RmBuffer *buffer = NULL; - g_mutex_lock(&pool->lock); - { - while(!buffer) { - buffer = rm_util_slist_pop(&pool->stack, NULL); - if(!buffer && pool->avail_buffers > 0) { - buffer = rm_buffer_new(pool); - } - if(!buffer) { - if(!pool->mem_warned) { - rm_log_warning_line( - "read buffer limit reached - waiting for " - "processing to catch up"); - pool->mem_warned = true; - } - g_cond_wait(&pool->change, &pool->lock); - } - } - pool->avail_buffers--; - } - g_mutex_unlock(&pool->lock); - - rm_assert_gentle(buffer); - return buffer; -} - -void rm_buffer_release(RmBuffer *buf) { - RmBufferPool *pool = buf->pool; - g_mutex_lock(&pool->lock); - { - pool->avail_buffers++; - g_cond_signal(&pool->change); - pool->stack = g_slist_prepend(pool->stack, buf); - } - g_mutex_unlock(&pool->lock); -} - static gboolean rm_buffer_equal(RmBuffer *a, RmBuffer *b) { return (a->len == b->len && memcmp(a->data, b->data, a->len) == 0); } @@ -861,7 +800,7 @@ void rm_digest_buffered_update(RmBuffer *buffer) { RmDigest *digest = buffer->digest; if(digest->type != RM_DIGEST_PARANOID) { rm_digest_update(digest, buffer->data, buffer->len); - rm_buffer_release(buffer); + rm_buffer_free(buffer); } else { RmParanoid *paranoid = digest->state; /* paranoid update... */ diff --git a/lib/checksum.h b/lib/checksum.h index 5a506f62..3ae83db1 100644 --- a/lib/checksum.h +++ b/lib/checksum.h @@ -110,31 +110,8 @@ typedef struct RmDigest { } RmDigest; -/////////// RmBufferPool and RmBuffer //////////////// +/////////// RmBuffer //////////////// -typedef struct RmBufferPool { - /* Place where recycled buffers are stored */ - GSList *stack; - - /* size of each buffer */ - gsize buffer_size; - - /* how many new buffers can we allocate before hitting mem limit? */ - gsize avail_buffers; - - /* Buffers that were kept for paranoia (internal) */ - gsize kept_buffers; - - gsize min_kept_buffers; - gsize max_kept_buffers; - - /* Flag to prevent double warnings. */ - bool mem_warned; - - /* concurrent accesses may happen */ - GMutex lock; - GCond change; -} RmBufferPool; /* Represents one block of read data */ typedef struct RmBuffer { @@ -145,19 +122,24 @@ typedef struct RmBuffer { /* checksum the data belongs to */ struct RmDigest *digest; + /* len of data */ + guint32 buf_size; + /* len of the data actually filled */ guint32 len; /* user utility data field */ gpointer user_data; - /* the pool the buffer belongs to */ - RmBufferPool *pool; - /* pointer to the data block */ unsigned char *data; } RmBuffer; +RmBuffer *rm_buffer_new(gsize buf_size); + +void rm_buffer_free(RmBuffer *buf); + + /** * @brief Convert a string like "md5" to a RmDigestType member. * @@ -308,48 +290,6 @@ void rm_digest_paranoia_shrink(RmDigest *digest, gsize new_size); */ void rm_digest_release_buffers(RmDigest *digest); -/** - * @brief Return the size of an individual buffer. - */ -RmOff rm_buffer_size(RmBufferPool *pool); - -/** - * @brief Create a new buffer pool. - * - * A buffer pool holds a number of same-sized RmBuffer structs - * up to a maximum number of bytes. - * - * If the limit is hit, rm_buffer_get() will block till - * other buffers were released. - * - * @param buffer_size The size of each buffer. - * @param max_mem Maxmimum number of bytes the pool may allocate; 0 for no limit. - * - * @return A readily usable RmBufferPool. - */ -RmBufferPool *rm_buffer_pool_init(gsize buffer_size, gsize max_mem); - -/** - * @brief Destroy a RmBufferPool. - * - * This can only be safely called when no parallel access to the pool is done. - */ -void rm_buffer_pool_destroy(RmBufferPool *pool); - -/** - * @brief Retrieve a RmBuffer. - * - * This might be either a previously used one or initially allocate one. - */ -RmBuffer *rm_buffer_get(RmBufferPool *pool); - -/** - * @brief Release a previously retrieved buffer. - * - * It will be either cached or freed if over the limit. - */ -void rm_buffer_release(RmBuffer *buf); - /** * @brief Send a new (pending) paranoid digest match `candidate` for `target`. */ diff --git a/lib/hasher.c b/lib/hasher.c index 05896493..e570b230 100644 --- a/lib/hasher.c +++ b/lib/hasher.c @@ -54,7 +54,6 @@ struct _RmHasher { gboolean use_buffered_read; guint64 cache_quota_bytes; gpointer session_user_data; - RmBufferPool *mem_pool; RmHasherCallback callback; GAsyncQueue *hashpipe_pool; @@ -103,7 +102,7 @@ static void rm_hasher_hashpipe_worker(RmBuffer *buffer, RmHasher *hasher) { hasher->callback(hasher, task->digest, hasher->session_user_data, task->task_user_data); rm_hasher_task_free(task); - rm_buffer_release(buffer); + rm_buffer_free(buffer); g_mutex_lock(&hasher->lock); { @@ -136,12 +135,12 @@ static gboolean rm_hasher_symlink_read(RmHasher *hasher, GThreadPool *hashpipe, gsize *bytes_actually_read) { /* Read contents of symlink (i.e. path of symlink's target). */ - RmBuffer *buffer = rm_buffer_get(hasher->mem_pool); - gint len = readlink(path, (char *)buffer->data, rm_buffer_size(hasher->mem_pool)); + RmBuffer *buffer = rm_buffer_new(hasher->buf_size); + gint len = readlink(path, (char *)buffer->data, hasher->buf_size); if (len < 0) { rm_log_perror("Cannot read symbolic link"); - rm_buffer_release(buffer); + rm_buffer_free(buffer); return FALSE; } @@ -182,7 +181,7 @@ static gboolean rm_hasher_buffered_read(RmHasher *hasher, GThreadPool *hashpipe, gsize bytes_remaining = bytes_to_read; while(TRUE) { - RmBuffer *buffer = rm_buffer_get(hasher->mem_pool); + RmBuffer *buffer = rm_buffer_new(hasher->buf_size); gsize want_bytes = MIN(bytes_remaining, hasher->buf_size); @@ -190,7 +189,7 @@ static gboolean rm_hasher_buffered_read(RmHasher *hasher, GThreadPool *hashpipe, if(ferror(fd) != 0) { rm_log_perror("fread(3) failed"); - rm_buffer_release(buffer); + rm_buffer_free(buffer); break; } @@ -275,7 +274,7 @@ static gboolean rm_hasher_unbuffered_read(RmHasher *hasher, GThreadPool *hashpip while(TRUE) { /* allocate buffers for preadv */ for(int i = 0; i < N_BUFFERS; ++i) { - buffers[i] = rm_buffer_get(hasher->mem_pool); + buffers[i] = rm_buffer_new(hasher->buf_size); readvec[i].iov_base = buffers[i]->data; readvec[i].iov_len = hasher->buf_size; } @@ -287,7 +286,7 @@ static gboolean rm_hasher_unbuffered_read(RmHasher *hasher, GThreadPool *hashpip rm_log_perror("preadv failed"); /* Release the buffers and give up*/ for(int i = 0; i < N_BUFFERS; ++i) { - rm_buffer_release(buffers[i]); + rm_buffer_free(buffers[i]); } break; } @@ -312,7 +311,7 @@ static gboolean rm_hasher_unbuffered_read(RmHasher *hasher, GThreadPool *hashpip buffer->user_data = NULL; rm_util_thread_pool_push(hashpipe, buffer); } else { - rm_buffer_release(buffer); + rm_buffer_free(buffer); } } @@ -384,10 +383,7 @@ RmHasher *rm_hasher_new(RmDigestType digest_type, /* initialise mutex & cond */ g_mutex_init(&self->lock); g_cond_init(&self->cond); - - /* Create buffer mem pool */ - self->mem_pool = rm_buffer_pool_init(buf_size, cache_quota_bytes); - + /* Create a pool of hashing thread "pools" - each "pool" can only have * one thread because hashing must be done in order */ self->hashpipe_pool = g_async_queue_new_full((GDestroyNotify)rm_hasher_hashpipe_free); @@ -413,7 +409,6 @@ void rm_hasher_free(RmHasher *hasher, gboolean wait) { g_async_queue_unref(hasher->hashpipe_pool); - rm_buffer_pool_destroy(hasher->mem_pool); g_cond_clear(&hasher->cond); g_mutex_clear(&hasher->lock); g_slice_free(RmHasher, hasher); @@ -483,7 +478,7 @@ RmDigest *rm_hasher_task_finish(RmHasherTask *task) { /* get a dummy buffer to use to signal the hasher thread that this increment is * finished */ RmHasher *hasher = task->hasher; - RmBuffer *finisher = rm_buffer_get(task->hasher->mem_pool); + RmBuffer *finisher = rm_buffer_new(task->hasher->buf_size); finisher->digest = task->digest; finisher->len = 0; finisher->user_data = task; From be0048a88a18b5095bf6ba3a0559a69ac8e44834 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Thu, 16 Nov 2017 21:43:57 +1000 Subject: [PATCH 122/180] checksum: remove redundant rm_digest_paranoia_shrink --- lib/checksum.c | 5 ----- lib/checksum.h | 7 ------- 2 files changed, 12 deletions(-) diff --git a/lib/checksum.c b/lib/checksum.c index 08c30eac..85bb549f 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -771,11 +771,6 @@ RmDigest *rm_digest_new(RmDigestType type, RmOff seed1, RmOff seed2, RmOff ext_s return digest; } -void rm_digest_paranoia_shrink(RmDigest *digest, gsize new_size) { - rm_assert_gentle(digest->type == RM_DIGEST_PARANOID); - digest->bytes = new_size; -} - void rm_digest_release_buffers(RmDigest *digest) { RmParanoid *paranoid = digest->state; if(paranoid && paranoid->buffers) { diff --git a/lib/checksum.h b/lib/checksum.h index 3ae83db1..a7bca81c 100644 --- a/lib/checksum.h +++ b/lib/checksum.h @@ -278,13 +278,6 @@ RmDigest *rm_digest_copy(RmDigest *digest); */ int rm_digest_get_bytes(RmDigest *self); -/** - * Shrink the paranoid checksum buffer to new_size. - * - * This is mainly useful for using an adjusted buffer for symlinks. - */ -void rm_digest_paranoia_shrink(RmDigest *digest, gsize new_size); - /** * Release any kept (paranoid) buffers. */ From d72806d9f8aa139e8736531690806f527efc6226 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Thu, 16 Nov 2017 21:51:29 +1000 Subject: [PATCH 123/180] checksum: rename "spec" to "interface" --- lib/checksum.c | 151 +++++++++++++++++++++++++------------------------ 1 file changed, 76 insertions(+), 75 deletions(-) diff --git a/lib/checksum.c b/lib/checksum.c index 85bb549f..9162620b 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -72,16 +72,17 @@ static gboolean rm_buffer_equal(RmBuffer *a, RmBuffer *b) { } /////////////////////////////////////// -// RMDIGEST IMPLEMENTATION // +// RMDIGEST INTERFACE DEFINITIONS // /////////////////////////////////////// +/* Each digest type must have an RmDigestInterface defined as follows: */ typedef void (*RmDigestInitFunc)(RmDigest *digest, RmOff seed1, RmOff seed2, RmOff ext_size, bool use_shadow_hash); typedef void (*RmDigestFreeFunc)(RmDigest *digest); typedef void (*RmDigestUpdateFunc)(RmDigest *digest, const unsigned char *data, RmOff size); typedef void (*RmDigestCopyFunc)(RmDigest *digest, RmDigest *copy); typedef void (*RmDigestStealFunc)(RmDigest *digest, guint8 *result); -typedef struct RmDigestSpec { +typedef struct RmDigestInterface { const char *name; const uint bits; // length of the output checksum in bits RmDigestInitFunc init; // performs initialisation of digest->state @@ -89,7 +90,7 @@ typedef struct RmDigestSpec { RmDigestUpdateFunc update; RmDigestCopyFunc copy; RmDigestStealFunc steal; -} RmDigestSpec; +} RmDigestInterface; /////////////////////////// @@ -158,7 +159,7 @@ static void rm_digest_xxhash_steal(RmDigest *digest, guint8 *result) { } -static const RmDigestSpec xxhash_spec = { "xxhash", 64, rm_digest_xxhash_init, rm_digest_xxhash_free, rm_digest_xxhash_update, rm_digest_xxhash_copy, rm_digest_xxhash_steal}; +static const RmDigestInterface xxhash_interface = { "xxhash", 64, rm_digest_xxhash_init, rm_digest_xxhash_free, rm_digest_xxhash_update, rm_digest_xxhash_copy, rm_digest_xxhash_steal}; /////////////////////////// @@ -196,7 +197,7 @@ static void rm_digest_murmur_x86_128_init(RmDigest *digest, RmOff seed1, RmOff s digest->state = MurmurHash3_x86_128_new(seed1, seed1>>32, seed2, seed2>>32); } -static const RmDigestSpec murmur_spec = { "murmur", 128, MURMUR_FUNCS(x86_128)}; +static const RmDigestInterface murmur_interface = { "murmur", 128, MURMUR_FUNCS(x86_128)}; #elif RM_PLATFORM_64 @@ -206,7 +207,7 @@ static void rm_digest_murmur_x64_128_init(RmDigest *digest, RmOff seed1, RmOff s digest->state = MurmurHash3_x64_128_new(seed1, seed2); } -static const RmDigestSpec murmur_spec = { "murmur", 128, MURMUR_FUNCS(x64_128)}; +static const RmDigestInterface murmur_interface = { "murmur", 128, MURMUR_FUNCS(x64_128)}; #else #error "Probably not a good idea to compile rmlint on 16bit." @@ -258,8 +259,8 @@ static void rm_digest_metro256_steal(RmDigest *digest, guint8 *result) { } -static const RmDigestSpec metro_spec = {"metro", 128, rm_digest_metro_init, rm_digest_metro_free, rm_digest_metro_update, rm_digest_metro_copy, rm_digest_metro_steal }; -static const RmDigestSpec metro256_spec = {"metro256", 256, rm_digest_metro256_init, rm_digest_metro256_free, rm_digest_metro256_update, rm_digest_metro256_copy, rm_digest_metro256_steal }; +static const RmDigestInterface metro_interface = {"metro", 128, rm_digest_metro_init, rm_digest_metro_free, rm_digest_metro_update, rm_digest_metro_copy, rm_digest_metro_steal }; +static const RmDigestInterface metro256_interface = {"metro256", 256, rm_digest_metro256_init, rm_digest_metro256_free, rm_digest_metro256_update, rm_digest_metro256_copy, rm_digest_metro256_steal }; #if HAVE_SSE4 @@ -280,8 +281,8 @@ static void rm_digest_metro256_crc_steal(RmDigest *digest, guint8 *result) { } -static const RmDigestSpec metro_crc_spec = {"metrocrc", 128, rm_digest_metro_init, rm_digest_metro_free, rm_digest_metro_crc_update, rm_digest_metro_copy, rm_digest_metro_crc_steal }; -static const RmDigestSpec metro256_crc_spec = {"metrocrc256", 256, rm_digest_metro256_init, rm_digest_metro256_free, rm_digest_metro256_crc_update, rm_digest_metro256_copy, rm_digest_metro256_crc_steal }; +static const RmDigestInterface metro_crc_interface = {"metrocrc", 128, rm_digest_metro_init, rm_digest_metro_free, rm_digest_metro_crc_update, rm_digest_metro_copy, rm_digest_metro_crc_steal }; +static const RmDigestInterface metro256_crc_interface = {"metrocrc256", 256, rm_digest_metro256_init, rm_digest_metro256_free, rm_digest_metro256_crc_update, rm_digest_metro256_copy, rm_digest_metro256_crc_steal }; #endif @@ -367,7 +368,7 @@ static void rm_digest_cumulative_steal(RmDigest *digest, guint8 *result) { memcpy(result, state->data, RM_DIGEST_CUMULATIVE_LEN); } -static const RmDigestSpec cumulative_spec = { "cumulative", 8 * RM_DIGEST_CUMULATIVE_LEN, rm_digest_cumulative_init, rm_digest_cumulative_free, +static const RmDigestInterface cumulative_interface = { "cumulative", 8 * RM_DIGEST_CUMULATIVE_LEN, rm_digest_cumulative_init, rm_digest_cumulative_free, rm_digest_cumulative_update, rm_digest_cumulative_copy, rm_digest_cumulative_steal}; @@ -413,11 +414,11 @@ static void rm_digest_highway64_steal(RmDigest *digest, guint8 *result) { *(uint64_t*)result = HighwayHashCatFinish64(digest->state); } -#define HIGHWAY_SPEC(BITS) BITS, rm_digest_highway_init, rm_digest_highway_free, rm_digest_highway_update, rm_digest_highway_copy, rm_digest_highway##BITS##_steal +#define HIGHWAY_INTERFACE(BITS) BITS, rm_digest_highway_init, rm_digest_highway_free, rm_digest_highway_update, rm_digest_highway_copy, rm_digest_highway##BITS##_steal -static const RmDigestSpec highway256_spec = {"highway256", HIGHWAY_SPEC(256)}; -static const RmDigestSpec highway128_spec = {"highway128", HIGHWAY_SPEC(128)}; -static const RmDigestSpec highway64_spec = {"highway64", HIGHWAY_SPEC(64)}; +static const RmDigestInterface highway256_interface = {"highway256", HIGHWAY_INTERFACE(256)}; +static const RmDigestInterface highway128_interface = {"highway128", HIGHWAY_INTERFACE(128)}; +static const RmDigestInterface highway64_interface = {"highway64", HIGHWAY_INTERFACE(64)}; /////////////////////////// @@ -465,11 +466,11 @@ static void rm_digest_glib_steal(RmDigest *digest, guint8 *result) { #define GLIB_FUNCS rm_digest_glib_init, rm_digest_glib_free, rm_digest_glib_update, rm_digest_glib_copy, rm_digest_glib_steal -static const RmDigestSpec md5_spec = {"md5", 128, GLIB_FUNCS}; -static const RmDigestSpec sha1_spec = {"sha1", 160, GLIB_FUNCS}; -static const RmDigestSpec sha256_spec = {"sha256", 256, GLIB_FUNCS}; +static const RmDigestInterface md5_interface = {"md5", 128, GLIB_FUNCS}; +static const RmDigestInterface sha1_interface = {"sha1", 160, GLIB_FUNCS}; +static const RmDigestInterface sha256_interface = {"sha256", 256, GLIB_FUNCS}; #if HAVE_SHA512 -static const RmDigestSpec sha512_spec = {"sha512", 512, GLIB_FUNCS}; +static const RmDigestInterface sha512_interface = {"sha512", 512, GLIB_FUNCS}; #endif /////////////////////////// @@ -518,11 +519,11 @@ static void rm_digest_sha3_steal(RmDigest *digest, guint8 *result) { g_slice_free(sha3_ctx, copy); } -#define SHA3_SPEC(BITS) BITS, rm_digest_sha3_init, rm_digest_sha3_free, rm_digest_sha3_update, rm_digest_sha3_copy, rm_digest_sha3_steal +#define SHA3_INTERFACE(BITS) BITS, rm_digest_sha3_init, rm_digest_sha3_free, rm_digest_sha3_update, rm_digest_sha3_copy, rm_digest_sha3_steal -static const RmDigestSpec sha3_256_spec = { "sha3-256", SHA3_SPEC(256)}; -static const RmDigestSpec sha3_384_spec = { "sha3-384", SHA3_SPEC(384)}; -static const RmDigestSpec sha3_512_spec = { "sha3-512", SHA3_SPEC(512)}; +static const RmDigestInterface sha3_256_interface = { "sha3-256", SHA3_INTERFACE(256)}; +static const RmDigestInterface sha3_384_interface = { "sha3-384", SHA3_INTERFACE(384)}; +static const RmDigestInterface sha3_512_interface = { "sha3-512", SHA3_INTERFACE(512)}; /////////////////////////// // blake hashes // @@ -578,10 +579,10 @@ CREATE_BLAKE_FUNCS(blake2sp, BLAKE2S); #define BLAKE_FUNCS(ALGO) rm_digest_##ALGO##_init, rm_digest_##ALGO##_free, rm_digest_##ALGO##_update, rm_digest_##ALGO##_copy, rm_digest_##ALGO##_steal -static const RmDigestSpec blake2b_spec = {"blake2b", 512, BLAKE_FUNCS(blake2b)}; -static const RmDigestSpec blake2bp_spec = {"blake2bp", 512, BLAKE_FUNCS(blake2bp)}; -static const RmDigestSpec blake2s_spec = {"blake2s", 256, BLAKE_FUNCS(blake2s)}; -static const RmDigestSpec blake2sp_spec = {"blake2sp", 256, BLAKE_FUNCS(blake2sp)}; +static const RmDigestInterface blake2b_interface = {"blake2b", 512, BLAKE_FUNCS(blake2b)}; +static const RmDigestInterface blake2bp_interface = {"blake2bp", 512, BLAKE_FUNCS(blake2bp)}; +static const RmDigestInterface blake2s_interface = {"blake2s", 256, BLAKE_FUNCS(blake2s)}; +static const RmDigestInterface blake2sp_interface = {"blake2sp", 256, BLAKE_FUNCS(blake2sp)}; /////////////////////////// // ext hash // @@ -610,7 +611,7 @@ static void rm_digest_ext_update(RmDigest *digest, const unsigned char *data, Rm } } -static const RmDigestSpec ext_spec = {"ext", 0, rm_digest_ext_init, rm_digest_generic_free, rm_digest_ext_update, rm_digest_generic_copy, NULL}; +static const RmDigestInterface ext_interface = {"ext", 0, rm_digest_ext_init, rm_digest_generic_free, rm_digest_ext_update, rm_digest_generic_copy, NULL}; /////////////////////////// @@ -658,50 +659,50 @@ static void rm_digest_paranoid_steal(RmDigest *digest, guint8 *result) { /* Note: paranoid update implementation is in rm_digest_buffered_update() below */ -static const RmDigestSpec paranoid_spec = { "paranoid", 0, rm_digest_paranoid_init, rm_digest_paranoid_free, NULL, NULL, rm_digest_paranoid_steal}; +static const RmDigestInterface paranoid_interface = { "paranoid", 0, rm_digest_paranoid_init, rm_digest_paranoid_free, NULL, NULL, rm_digest_paranoid_steal}; //////////////////////////////// -// RmDigestSpec map // +// RmDigestInterface map // //////////////////////////////// -static const RmDigestSpec *rm_digest_spec(RmDigestType type) { - static const RmDigestSpec *digest_specs[] = { +static const RmDigestInterface *rm_digest_interface(RmDigestType type) { + static const RmDigestInterface *digest_interfaces[] = { [RM_DIGEST_UNKNOWN] = NULL, - [RM_DIGEST_MURMUR] = &murmur_spec, - [RM_DIGEST_METRO] = &metro_spec, - [RM_DIGEST_METRO256] = &metro256_spec, + [RM_DIGEST_MURMUR] = &murmur_interface, + [RM_DIGEST_METRO] = &metro_interface, + [RM_DIGEST_METRO256] = &metro256_interface, #if HAVE_SSE4 - [RM_DIGEST_METROCRC] = &metro_crc_spec, - [RM_DIGEST_METROCRC256]= &metro256_crc_spec, + [RM_DIGEST_METROCRC] = &metro_crc_interface, + [RM_DIGEST_METROCRC256]= &metro256_crc_interface, #endif - [RM_DIGEST_MD5] = &md5_spec, - [RM_DIGEST_SHA1] = &sha1_spec, - [RM_DIGEST_SHA256] = &sha256_spec, + [RM_DIGEST_MD5] = &md5_interface, + [RM_DIGEST_SHA1] = &sha1_interface, + [RM_DIGEST_SHA256] = &sha256_interface, #if HAVE_SHA512 - [RM_DIGEST_SHA512] = &sha512_spec, + [RM_DIGEST_SHA512] = &sha512_interface, #endif - [RM_DIGEST_SHA3_256] = &sha3_256_spec, - [RM_DIGEST_SHA3_384] = &sha3_384_spec, - [RM_DIGEST_SHA3_512] = &sha3_512_spec, - [RM_DIGEST_BLAKE2S] = &blake2s_spec, - [RM_DIGEST_BLAKE2B] = &blake2b_spec, - [RM_DIGEST_BLAKE2SP] = &blake2sp_spec, - [RM_DIGEST_BLAKE2BP] = &blake2bp_spec, - [RM_DIGEST_EXT] = &ext_spec, - [RM_DIGEST_CUMULATIVE] = &cumulative_spec, - [RM_DIGEST_PARANOID] = ¶noid_spec, - [RM_DIGEST_XXHASH] = &xxhash_spec, - [RM_DIGEST_HIGHWAY64] = &highway64_spec, - [RM_DIGEST_HIGHWAY128] = &highway128_spec, - [RM_DIGEST_HIGHWAY256] = &highway256_spec, + [RM_DIGEST_SHA3_256] = &sha3_256_interface, + [RM_DIGEST_SHA3_384] = &sha3_384_interface, + [RM_DIGEST_SHA3_512] = &sha3_512_interface, + [RM_DIGEST_BLAKE2S] = &blake2s_interface, + [RM_DIGEST_BLAKE2B] = &blake2b_interface, + [RM_DIGEST_BLAKE2SP] = &blake2sp_interface, + [RM_DIGEST_BLAKE2BP] = &blake2bp_interface, + [RM_DIGEST_EXT] = &ext_interface, + [RM_DIGEST_CUMULATIVE] = &cumulative_interface, + [RM_DIGEST_PARANOID] = ¶noid_interface, + [RM_DIGEST_XXHASH] = &xxhash_interface, + [RM_DIGEST_HIGHWAY64] = &highway64_interface, + [RM_DIGEST_HIGHWAY128] = &highway128_interface, + [RM_DIGEST_HIGHWAY256] = &highway256_interface, }; - if(type < RM_DIGEST_SENTINEL && digest_specs[type]) { - return digest_specs[type]; + if(type < RM_DIGEST_SENTINEL && digest_interfaces[type]) { + return digest_interfaces[type]; } - rm_log_error_line("No digest spec for enum %i", type); + rm_log_error_line("No digest interface for enum %i", type); g_assert_not_reached(); } @@ -716,7 +717,7 @@ static gpointer rm_init_digest_type_table(GHashTable **code_table) { *code_table = g_hash_table_new(g_str_hash, g_str_equal); for(RmDigestType type=1; typename, type); + rm_digest_table_insert(*code_table, (char*)rm_digest_interface(type)->name, type); } /* add some synonyms */ @@ -744,8 +745,8 @@ RmDigestType rm_string_to_digest_type(const char *string) { } const char *rm_digest_type_to_string(RmDigestType type) { - const RmDigestSpec *spec = rm_digest_spec(type); - return spec->name; + const RmDigestInterface *interface = rm_digest_interface(type); + return interface->name; } /* TODO: remove? */ @@ -762,11 +763,11 @@ RmDigest *rm_digest_new(RmDigestType type, RmOff seed1, RmOff seed2, RmOff ext_s bool use_shadow_hash) { g_assert(type != RM_DIGEST_UNKNOWN); - const RmDigestSpec *spec = rm_digest_spec(type); + const RmDigestInterface *interface = rm_digest_interface(type); RmDigest *digest = g_slice_new0(RmDigest); digest->type = type; - digest->bytes = spec->bits / 8; - spec->init(digest, seed1, seed2, ext_size, use_shadow_hash); + digest->bytes = interface->bits / 8; + interface->init(digest, seed1, seed2, ext_size, use_shadow_hash); return digest; } @@ -780,14 +781,14 @@ void rm_digest_release_buffers(RmDigest *digest) { } void rm_digest_free(RmDigest *digest) { - const RmDigestSpec *spec = rm_digest_spec(digest->type); - spec->free(digest); + const RmDigestInterface *interface = rm_digest_interface(digest->type); + interface->free(digest); g_slice_free(RmDigest, digest); } void rm_digest_update(RmDigest *digest, const unsigned char *data, RmOff size) { - const RmDigestSpec *spec = rm_digest_spec(digest->type); - spec->update(digest, data, size); + const RmDigestInterface *interface = rm_digest_interface(digest->type); + interface->update(digest, data, size); } void rm_digest_buffered_update(RmBuffer *buffer) { @@ -872,21 +873,21 @@ RmDigest *rm_digest_copy(RmDigest *digest) { RmDigest *copy = g_slice_copy(sizeof(RmDigest), digest); - const RmDigestSpec *spec = rm_digest_spec(digest->type); - spec->copy(digest, copy); + const RmDigestInterface *interface = rm_digest_interface(digest->type); + interface->copy(digest, copy); return copy; } guint8 *rm_digest_steal(RmDigest *digest) { - const RmDigestSpec *spec = rm_digest_spec(digest->type); - if(!spec->steal) { + const RmDigestInterface *interface = rm_digest_interface(digest->type); + if(!interface->steal) { return g_slice_copy(digest->bytes, digest->state); } guint8 *result = g_slice_alloc0(digest->bytes); - spec->steal(digest, result); + interface->steal(digest, result); return result; } @@ -917,7 +918,7 @@ gboolean rm_digest_equal(RmDigest *a, RmDigest *b) { return false; } - const RmDigestSpec *spec = rm_digest_spec(a->type); + const RmDigestInterface *interface = rm_digest_interface(a->type); if(a->type == RM_DIGEST_PARANOID) { RmParanoid *pa = a->state; @@ -952,7 +953,7 @@ gboolean rm_digest_equal(RmDigest *a, RmDigest *b) { } return (!a_iter && !b_iter); - } else if(spec->steal) { + } else if(interface->steal) { guint8 *buf_a = rm_digest_steal(a); guint8 *buf_b = rm_digest_steal(b); gboolean result = !memcmp(buf_a, buf_b, a->bytes); From 0bcc196838886757845d987092b04c4cf806a94c Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Thu, 16 Nov 2017 22:36:30 +1000 Subject: [PATCH 124/180] checsum: tidy up xxhash, murmur and metro interfaces --- lib/checksum.c | 199 ++++++++++++++++++++++++++++--------------------- 1 file changed, 114 insertions(+), 85 deletions(-) diff --git a/lib/checksum.c b/lib/checksum.c index 9162620b..2d5606bd 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -92,48 +92,20 @@ typedef struct RmDigestInterface { RmDigestStealFunc steal; } RmDigestInterface; +/* convenience macro to define an interface where all methods follow the standard naming convention */ +#define RM_DIGEST_DEFINE_INTERFACE(NAME, BITS) \ +static const RmDigestInterface NAME##_interface = { \ + .name = (#NAME), \ + .bits = (BITS), \ + .init = rm_digest_##NAME##_init, \ + .free = rm_digest_##NAME##_free, \ + .update = rm_digest_##NAME##_update, \ + .copy = rm_digest_##NAME##_copy, \ + .steal = rm_digest_##NAME##_steal \ + }; /////////////////////////// -// common funcs for // -// non-cryptographic // -// hashes // -/////////////////////////// - -#define ALLOC_BYTES(bytes) MAX(8, bytes) - -static void rm_digest_generic_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { - /* init for hashes which just require allocation of digest->checksum */ - - /* Cannot go lower than 8, since we read 8 byte in some places. - * For some checksums this may mean trailing zeros in the unused bytes */ - digest->state = g_slice_alloc0(ALLOC_BYTES(digest->bytes)); - - if(seed1 && seed2) { - /* copy seeds to checksum */ - size_t seed_bytes = MIN(sizeof(RmOff), digest->bytes / 2); - memcpy(digest->state, &seed1, seed_bytes); - memcpy(digest->state + digest->bytes/2, &seed2, seed_bytes); - } else if(seed1) { - size_t seed_bytes = MIN(sizeof(RmOff), digest->bytes); - memcpy(digest->state, &seed1, seed_bytes); - } -} - -static void rm_digest_generic_free(RmDigest *digest) { - if(digest->state) { - g_slice_free1(digest->bytes, digest->state); - digest->state = NULL; - } -} - -static void rm_digest_generic_copy(RmDigest *digest, RmDigest *copy) { - copy->state = g_slice_copy(ALLOC_BYTES(digest->bytes), digest->state); -} - -#define GENERIC_FUNCS(ALGO) rm_digest_generic_init, rm_digest_generic_free, rm_digest_##ALGO##_update, rm_digest_generic_copy, NULL - -/////////////////////////// -// xxhash // +// xxhash interface // /////////////////////////// static void rm_digest_xxhash_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { @@ -158,61 +130,74 @@ static void rm_digest_xxhash_steal(RmDigest *digest, guint8 *result) { *(unsigned long long*)result = XXH64_digest(digest->state); } - -static const RmDigestInterface xxhash_interface = { "xxhash", 64, rm_digest_xxhash_init, rm_digest_xxhash_free, rm_digest_xxhash_update, rm_digest_xxhash_copy, rm_digest_xxhash_steal}; - +RM_DIGEST_DEFINE_INTERFACE(xxhash, 64); /////////////////////////// // murmur // /////////////////////////// +#if RM_PLATFORM_32 -#define CREATE_MURMUR_FUNCS(TYPE) \ -static void rm_digest_murmur_##TYPE##_free(RmDigest *digest) { \ - MurmurHash3_##TYPE##_free(digest->state); \ -} \ - \ -static void rm_digest_murmur_##TYPE##_update(RmDigest *digest, \ - const unsigned char *data, \ - RmOff size) { \ - MurmurHash3_##TYPE##_update(digest->state, data, size); \ -} \ - \ -static void rm_digest_murmur_##TYPE##_copy(RmDigest *digest, RmDigest *copy) { \ - copy->state = MurmurHash3_##TYPE##_copy(digest->state); \ -} \ - \ -static void rm_digest_murmur_##TYPE##_steal(RmDigest *digest, guint8 *result) { \ - MurmurHash3_##TYPE##_steal(digest->state, result); \ +static void rm_digest_murmur_init(RmDigest *digest, RmOff seed1, RmOff seed2, + _UNUSED RmOff ext_size, + _UNUSED bool use_shadow_hash) { + digest->state = MurmurHash3_x86_128_new(seed1, seed1>>32, seed2, seed2>>32); } -#define MURMUR_FUNCS(TYPE) rm_digest_murmur_##TYPE##_init, rm_digest_murmur_##TYPE##_free, rm_digest_murmur_##TYPE##_update, rm_digest_murmur_##TYPE##_copy, rm_digest_murmur_##TYPE##_steal - +static void rm_digest_murmur_free(RmDigest *digest) { + MurmurHash3_x86_128_free(digest->state); +} -#if RM_PLATFORM_32 +static void rm_digest_murmur_update(RmDigest *digest, + const unsigned char *data, + RmOff size) { + MurmurHash3_x86_128_update(digest->state, data, size); +} -CREATE_MURMUR_FUNCS(x86_128) +static void rm_digest_murmur_copy(RmDigest *digest, RmDigest *copy) { + copy->state = MurmurHash3_x86_128_copy(digest->state); +} -static void rm_digest_murmur_x86_128_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { - digest->state = MurmurHash3_x86_128_new(seed1, seed1>>32, seed2, seed2>>32); +static void rm_digest_murmur_steal(RmDigest *digest, guint8 *result) { + MurmurHash3_x86_128_steal(digest->state, result); } -static const RmDigestInterface murmur_interface = { "murmur", 128, MURMUR_FUNCS(x86_128)}; #elif RM_PLATFORM_64 -CREATE_MURMUR_FUNCS(x64_128) - -static void rm_digest_murmur_x64_128_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { +static void rm_digest_murmur_init(RmDigest *digest, RmOff seed1, RmOff seed2, + _UNUSED RmOff ext_size, + _UNUSED bool use_shadow_hash) { digest->state = MurmurHash3_x64_128_new(seed1, seed2); } -static const RmDigestInterface murmur_interface = { "murmur", 128, MURMUR_FUNCS(x64_128)}; +static void rm_digest_murmur_free(RmDigest *digest) { + MurmurHash3_x64_128_free(digest->state); +} + +static void rm_digest_murmur_update(RmDigest *digest, + const unsigned char *data, + RmOff size) { + MurmurHash3_x64_128_update(digest->state, data, size); +} + +static void rm_digest_murmur_copy(RmDigest *digest, RmDigest *copy) { + copy->state = MurmurHash3_x64_128_copy(digest->state); +} + +static void rm_digest_murmur_steal(RmDigest *digest, guint8 *result) { + MurmurHash3_x64_128_steal(digest->state, result); +} + #else + #error "Probably not a good idea to compile rmlint on 16bit." + #endif +RM_DIGEST_DEFINE_INTERFACE(murmur, 128); + /////////////////////////// // metro // @@ -258,31 +243,37 @@ static void rm_digest_metro256_steal(RmDigest *digest, guint8 *result) { metrohash256_steal(digest->state, result); } - -static const RmDigestInterface metro_interface = {"metro", 128, rm_digest_metro_init, rm_digest_metro_free, rm_digest_metro_update, rm_digest_metro_copy, rm_digest_metro_steal }; -static const RmDigestInterface metro256_interface = {"metro256", 256, rm_digest_metro256_init, rm_digest_metro256_free, rm_digest_metro256_update, rm_digest_metro256_copy, rm_digest_metro256_steal }; +RM_DIGEST_DEFINE_INTERFACE(metro, 128); +RM_DIGEST_DEFINE_INTERFACE(metro256, 256); #if HAVE_SSE4 -static void rm_digest_metro_crc_update(RmDigest *digest, const unsigned char *data, RmOff size) { +#define rm_digest_metrocrc_init rm_digest_metro_init +#define rm_digest_metrocrc_free rm_digest_metro_free +#define rm_digest_metrocrc_copy rm_digest_metro_copy + +static void rm_digest_metrocrc_update(RmDigest *digest, const unsigned char *data, RmOff size) { metrohash128crc_update(digest->state, data, size); } -static void rm_digest_metro_crc_steal(RmDigest *digest, guint8 *result) { +static void rm_digest_metrocrc_steal(RmDigest *digest, guint8 *result) { metrohash128crc_1_steal(digest->state, result); } -static void rm_digest_metro256_crc_update(RmDigest *digest, const unsigned char *data, RmOff size) { - metrohash256_update(digest->state, data, size); -} +#define rm_digest_metrocrc256_init rm_digest_metro256_init +#define rm_digest_metrocrc256_free rm_digest_metro256_free +#define rm_digest_metrocrc256_copy rm_digest_metro256_copy -static void rm_digest_metro256_crc_steal(RmDigest *digest, guint8 *result) { - metrohash256_steal(digest->state, result); +static void rm_digest_metrocrc256_update(RmDigest *digest, const unsigned char *data, RmOff size) { + metrohash256crc_update(digest->state, data, size); } +static void rm_digest_metrocrc256_steal(RmDigest *digest, guint8 *result) { + metrohash256crc_steal(digest->state, result); +} -static const RmDigestInterface metro_crc_interface = {"metrocrc", 128, rm_digest_metro_init, rm_digest_metro_free, rm_digest_metro_crc_update, rm_digest_metro_copy, rm_digest_metro_crc_steal }; -static const RmDigestInterface metro256_crc_interface = {"metrocrc256", 256, rm_digest_metro256_init, rm_digest_metro256_free, rm_digest_metro256_crc_update, rm_digest_metro256_copy, rm_digest_metro256_crc_steal }; +RM_DIGEST_DEFINE_INTERFACE(metrocrc, 128); +RM_DIGEST_DEFINE_INTERFACE(metrocrc256, 256); #endif @@ -588,6 +579,44 @@ static const RmDigestInterface blake2sp_interface = {"blake2sp", 256, BLAKE_FUNC // ext hash // /////////////////////////// +#define ALLOC_BYTES(bytes) MAX(8, bytes) + +static void rm_digest_generic_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { + /* init for hashes which just require allocation of digest->checksum */ + + /* Cannot go lower than 8, since we read 8 byte in some places. + * For some checksums this may mean trailing zeros in the unused bytes */ + digest->state = g_slice_alloc0(ALLOC_BYTES(digest->bytes)); + + if(seed1 && seed2) { + /* copy seeds to checksum */ + size_t seed_bytes = MIN(sizeof(RmOff), digest->bytes / 2); + memcpy(digest->state, &seed1, seed_bytes); + memcpy(digest->state + digest->bytes/2, &seed2, seed_bytes); + } else if(seed1) { + size_t seed_bytes = MIN(sizeof(RmOff), digest->bytes); + memcpy(digest->state, &seed1, seed_bytes); + } +} + +static void rm_digest_generic_free(RmDigest *digest) { + if(digest->state) { + g_slice_free1(digest->bytes, digest->state); + digest->state = NULL; + } +} + +static void rm_digest_generic_copy(RmDigest *digest, RmDigest *copy) { + copy->state = g_slice_copy(ALLOC_BYTES(digest->bytes), digest->state); +} + +#define GENERIC_FUNCS(ALGO) \ + .init = rm_digest_generic_init, \ + .free = rm_digest_generic_free, \ + .update = rm_digest_##ALGO##_update,\ + .copy = rm_digest_generic_copy, \ + .steal = NULL + static void rm_digest_ext_init(RmDigest *digest, RmOff seed1, RmOff seed2, RmOff ext_size, bool use_shadow_hash) { digest->bytes = ext_size; @@ -674,8 +703,8 @@ static const RmDigestInterface *rm_digest_interface(RmDigestType type) { [RM_DIGEST_METRO] = &metro_interface, [RM_DIGEST_METRO256] = &metro256_interface, #if HAVE_SSE4 - [RM_DIGEST_METROCRC] = &metro_crc_interface, - [RM_DIGEST_METROCRC256]= &metro256_crc_interface, + [RM_DIGEST_METROCRC] = &metrocrc_interface, + [RM_DIGEST_METROCRC256]= &metrocrc256_interface, #endif [RM_DIGEST_MD5] = &md5_interface, [RM_DIGEST_SHA1] = &sha1_interface, From d575c945a8e8f202514754f14b82fa3e8110c2e0 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Thu, 16 Nov 2017 23:06:02 +1000 Subject: [PATCH 125/180] checksum: simplify interface (ext_size and use_shadow hash redundant; 1 seed is enough) --- lib/checksum.c | 111 +++++++++++++++++---------------------------- lib/checksum.h | 5 +- lib/formats/json.c | 6 +-- lib/hasher.c | 3 +- lib/replay.c | 2 +- lib/session.h | 3 +- lib/shredder.c | 17 ++----- lib/treemerge.c | 2 +- 8 files changed, 52 insertions(+), 97 deletions(-) diff --git a/lib/checksum.c b/lib/checksum.c index 2d5606bd..0611e172 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -76,7 +76,7 @@ static gboolean rm_buffer_equal(RmBuffer *a, RmBuffer *b) { /////////////////////////////////////// /* Each digest type must have an RmDigestInterface defined as follows: */ -typedef void (*RmDigestInitFunc)(RmDigest *digest, RmOff seed1, RmOff seed2, RmOff ext_size, bool use_shadow_hash); +typedef void (*RmDigestInitFunc)(RmDigest *digest, RmOff seed); typedef void (*RmDigestFreeFunc)(RmDigest *digest); typedef void (*RmDigestUpdateFunc)(RmDigest *digest, const unsigned char *data, RmOff size); typedef void (*RmDigestCopyFunc)(RmDigest *digest, RmDigest *copy); @@ -108,9 +108,9 @@ static const RmDigestInterface NAME##_interface = { \ // xxhash interface // /////////////////////////// -static void rm_digest_xxhash_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { +static void rm_digest_xxhash_init(RmDigest *digest, RmOff seed) { digest->state = XXH64_createState(); - XXH64_reset(digest->state, seed1 ^ seed2); + XXH64_reset(digest->state, seed); } static void rm_digest_xxhash_free(RmDigest *digest) { @@ -138,10 +138,8 @@ RM_DIGEST_DEFINE_INTERFACE(xxhash, 64); #if RM_PLATFORM_32 -static void rm_digest_murmur_init(RmDigest *digest, RmOff seed1, RmOff seed2, - _UNUSED RmOff ext_size, - _UNUSED bool use_shadow_hash) { - digest->state = MurmurHash3_x86_128_new(seed1, seed1>>32, seed2, seed2>>32); +static void rm_digest_murmur_init(RmDigest *digest, RmOff seed) { + digest->state = MurmurHash3_x86_128_new(seed, seed>>32, seed, seed>>32); } static void rm_digest_murmur_free(RmDigest *digest) { @@ -165,10 +163,8 @@ static void rm_digest_murmur_steal(RmDigest *digest, guint8 *result) { #elif RM_PLATFORM_64 -static void rm_digest_murmur_init(RmDigest *digest, RmOff seed1, RmOff seed2, - _UNUSED RmOff ext_size, - _UNUSED bool use_shadow_hash) { - digest->state = MurmurHash3_x64_128_new(seed1, seed2); +static void rm_digest_murmur_init(RmDigest *digest, RmOff seed) { + digest->state = MurmurHash3_x64_128_new(seed, seed); } static void rm_digest_murmur_free(RmDigest *digest) { @@ -203,8 +199,8 @@ RM_DIGEST_DEFINE_INTERFACE(murmur, 128); // metro // /////////////////////////// -static void rm_digest_metro_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { - digest->state = metrohash128_1_new(seed1 ^ seed2); +static void rm_digest_metro_init(RmDigest *digest, RmOff seed) { + digest->state = metrohash128_1_new(seed); } static void rm_digest_metro_free(RmDigest *digest) { @@ -223,8 +219,8 @@ static void rm_digest_metro_steal(RmDigest *digest, guint8 *result) { metrohash128_1_steal(digest->state, result); } -static void rm_digest_metro256_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { - digest->state = metrohash256_new(seed1 ^ seed2); +static void rm_digest_metro256_init(RmDigest *digest, RmOff seed) { + digest->state = metrohash256_new(seed); } static void rm_digest_metro256_free(RmDigest *digest) { @@ -248,6 +244,7 @@ RM_DIGEST_DEFINE_INTERFACE(metro256, 256); #if HAVE_SSE4 +/* some of the interface procedures are common between crc- and non-crc-variants */ #define rm_digest_metrocrc_init rm_digest_metro_init #define rm_digest_metrocrc_free rm_digest_metro_free #define rm_digest_metrocrc_copy rm_digest_metro_copy @@ -260,6 +257,7 @@ static void rm_digest_metrocrc_steal(RmDigest *digest, guint8 *result) { metrohash128crc_1_steal(digest->state, result); } +/* some of the interface procedures are common between crc- and non-crc-variants */ #define rm_digest_metrocrc256_init rm_digest_metro256_init #define rm_digest_metrocrc256_free rm_digest_metro256_free #define rm_digest_metrocrc256_copy rm_digest_metro256_copy @@ -306,14 +304,9 @@ typedef struct RmDigestCumulative { RM_DIGEST_CUMULATIVE_T pos; /* could be smaller but this is faster */ } RmDigestCumulative; -static void rm_digest_cumulative_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { +static void rm_digest_cumulative_init(RmDigest *digest, RmOff seed) { RmDigestCumulative *state = g_slice_new0(RmDigestCumulative); - *(RmOff*)&state->data[0] ^= seed1; -#if (RM_DIGEST_CUMULATIVE_LEN >= 16) - *(RmOff*)&state->data[8] ^= seed2; -#else - *(RmOff*)&state->data[0] ^= seed2; -#endif + *(RmOff*)&state->data[0] ^= seed; digest->state = state; } @@ -367,13 +360,10 @@ static const RmDigestInterface cumulative_interface = { "cumulative", 8 * RM_DI // highway hash // /////////////////////////// -static void rm_digest_highway_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { +static void rm_digest_highway_init(RmDigest *digest, RmOff seed) { uint64_t key[4] = {1, 2, 3, 4}; - if(seed1) { - key[0] = (uint64_t)seed1; - } - if(seed2) { - key[2] = (uint64_t)seed2; + if(seed) { + key[0] = (uint64_t)seed; } digest->state = g_slice_alloc0(sizeof(HighwayHashCat)); @@ -425,13 +415,10 @@ static const GChecksumType glib_map[] = { #endif }; -static void rm_digest_glib_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { +static void rm_digest_glib_init(RmDigest *digest, RmOff seed) { digest->state = g_checksum_new(glib_map[digest->type]); - if(seed1) { - g_checksum_update(digest->state, (const guchar *)&seed1, sizeof(seed1)); - } - if(seed2) { - g_checksum_update(digest->state, (const guchar *)&seed2, sizeof(seed2)); + if(seed) { + g_checksum_update(digest->state, (const guchar *)&seed, sizeof(seed)); } } @@ -469,7 +456,7 @@ static const RmDigestInterface sha512_interface = {"sha512", 512, GLIB_FUNCS}; /////////////////////////// -static void rm_digest_sha3_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { +static void rm_digest_sha3_init(RmDigest *digest, RmOff seed) { digest->state = g_slice_alloc0(sizeof(sha3_ctx)); switch(digest->type) { case RM_DIGEST_SHA3_256: @@ -484,11 +471,8 @@ static void rm_digest_sha3_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNU default: g_assert_not_reached(); } - if(seed1) { - rhash_sha3_update(digest->state, (const unsigned char *)&seed1, sizeof(seed1)); - } - if(seed2) { - rhash_sha3_update(digest->state, (const unsigned char *)&seed2, sizeof(seed2)); + if(seed) { + rhash_sha3_update(digest->state, (const unsigned char *)&seed, sizeof(seed)); } } @@ -522,17 +506,11 @@ static const RmDigestInterface sha3_512_interface = { "sha3-512", SHA3_INTERFACE #define CREATE_BLAKE_FUNCS(ALGO, ALGO_BIG) \ \ -static void rm_digest_##ALGO##_init(RmDigest *digest, RmOff seed1, \ - RmOff seed2, \ - _UNUSED RmOff ext_size, \ - _UNUSED bool use_shadow_hash) { \ +static void rm_digest_##ALGO##_init(RmDigest *digest, RmOff seed) { \ digest->state = g_slice_alloc0(sizeof(ALGO##_state)); \ ALGO##_init(digest->state, ALGO_BIG##_OUTBYTES); \ - if(seed1) { \ - ALGO##_update(digest->state, &seed1, sizeof(RmOff)); \ - } \ - if(seed2) { \ - ALGO##_update(digest->state, &seed2, sizeof(RmOff)); \ + if(seed) { \ + ALGO##_update(digest->state, &seed, sizeof(RmOff)); \ } \ g_assert(digest->bytes==ALGO_BIG##_OUTBYTES); \ } \ @@ -581,21 +559,17 @@ static const RmDigestInterface blake2sp_interface = {"blake2sp", 256, BLAKE_FUNC #define ALLOC_BYTES(bytes) MAX(8, bytes) -static void rm_digest_generic_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, _UNUSED bool use_shadow_hash) { +static void rm_digest_generic_init(RmDigest *digest, RmOff seed) { /* init for hashes which just require allocation of digest->checksum */ /* Cannot go lower than 8, since we read 8 byte in some places. * For some checksums this may mean trailing zeros in the unused bytes */ digest->state = g_slice_alloc0(ALLOC_BYTES(digest->bytes)); - if(seed1 && seed2) { - /* copy seeds to checksum */ + if(seed) { + /* copy seed to checksum */ size_t seed_bytes = MIN(sizeof(RmOff), digest->bytes / 2); - memcpy(digest->state, &seed1, seed_bytes); - memcpy(digest->state + digest->bytes/2, &seed2, seed_bytes); - } else if(seed1) { - size_t seed_bytes = MIN(sizeof(RmOff), digest->bytes); - memcpy(digest->state, &seed1, seed_bytes); + memcpy(digest->state, &seed, seed_bytes); } } @@ -618,9 +592,9 @@ static void rm_digest_generic_copy(RmDigest *digest, RmDigest *copy) { .steal = NULL -static void rm_digest_ext_init(RmDigest *digest, RmOff seed1, RmOff seed2, RmOff ext_size, bool use_shadow_hash) { - digest->bytes = ext_size; - rm_digest_generic_init(digest, seed1, seed2, ext_size, use_shadow_hash); +static void rm_digest_ext_init(RmDigest *digest, RmOff seed) { + digest->bytes = 64; + rm_digest_generic_init(digest, seed); } static void rm_digest_ext_update(RmDigest *digest, const unsigned char *data, RmOff size) { @@ -640,7 +614,7 @@ static void rm_digest_ext_update(RmDigest *digest, const unsigned char *data, Rm } } -static const RmDigestInterface ext_interface = {"ext", 0, rm_digest_ext_init, rm_digest_generic_free, rm_digest_ext_update, rm_digest_generic_copy, NULL}; +static const RmDigestInterface ext_interface = {"ext", 512, rm_digest_ext_init, rm_digest_generic_free, rm_digest_ext_update, rm_digest_generic_copy, NULL}; /////////////////////////// @@ -648,14 +622,12 @@ static const RmDigestInterface ext_interface = {"ext", 0, rm_digest_ext_init, rm /////////////////////////// -static void rm_digest_paranoid_init(RmDigest *digest, RmOff seed1, RmOff seed2, _UNUSED RmOff ext_size, bool use_shadow_hash) { +static void rm_digest_paranoid_init(RmDigest *digest, RmOff seed) { RmParanoid *paranoid = g_slice_new0(RmParanoid); digest->state = paranoid; paranoid->incoming_twin_candidates = g_async_queue_new(); - if(use_shadow_hash) { - paranoid->shadow_hash = rm_digest_new(RM_DIGEST_XXHASH, seed1, seed2, 0, false); - digest->bytes = paranoid->shadow_hash->bytes; - } + paranoid->shadow_hash = rm_digest_new(RM_DIGEST_XXHASH, seed); + digest->bytes = paranoid->shadow_hash->bytes; } static void rm_digest_paranoid_free(RmDigest *digest) { @@ -788,15 +760,14 @@ int rm_digest_type_to_multihash_id(RmDigestType type) { return ids[MIN(type, sizeof(ids) / sizeof(ids[0]))]; } -RmDigest *rm_digest_new(RmDigestType type, RmOff seed1, RmOff seed2, RmOff ext_size, - bool use_shadow_hash) { +RmDigest *rm_digest_new(RmDigestType type, RmOff seed) { g_assert(type != RM_DIGEST_UNKNOWN); const RmDigestInterface *interface = rm_digest_interface(type); RmDigest *digest = g_slice_new0(RmDigest); digest->type = type; digest->bytes = interface->bits / 8; - interface->init(digest, seed1, seed2, ext_size, use_shadow_hash); + interface->init(digest, seed); return digest; } @@ -1034,7 +1005,7 @@ void rm_digest_send_match_candidate(RmDigest *target, RmDigest *candidate) { } guint8 *rm_digest_sum(RmDigestType algo, const guint8 *data, gsize len, gsize *out_len) { - RmDigest *digest = rm_digest_new(algo, 0, 0, 0, false); + RmDigest *digest = rm_digest_new(algo, 0); rm_digest_update(digest, data, len); guint8 *buf = rm_digest_steal(digest); diff --git a/lib/checksum.h b/lib/checksum.h index a7bca81c..b79144c0 100644 --- a/lib/checksum.h +++ b/lib/checksum.h @@ -174,11 +174,8 @@ const char *rm_digest_type_to_string(RmDigestType type); * * @param type Which algorithm to use for hashing. * @param seed Initial seed. Pass 0 if not interested. - * @param ext_size Size of the digest in case on RM_DIGEST_EXT - * @param use_shadow_hash. Keep a shadow hash for lookup purposes. */ -RmDigest *rm_digest_new(RmDigestType type, RmOff seed1, RmOff seed2, RmOff ext_size, - bool use_shadow_hash); +RmDigest *rm_digest_new(RmDigestType type, RmOff seed); /** * @brief Deallocate memory assocated with a RmDigest. diff --git a/lib/formats/json.c b/lib/formats/json.c index e9e7586d..e0746448 100644 --- a/lib/formats/json.c +++ b/lib/formats/json.c @@ -200,11 +200,9 @@ static void rm_fmt_head(RmSession *session, _UNUSED RmFmtHandler *parent, FILE * rm_fmt_json_sep(self, out); rm_fmt_json_key(out, "checksum_type", rm_digest_type_to_string(session->cfg->checksum_type)); - if(session->hash_seed1 && session->hash_seed2) { + if(session->hash_seed) { rm_fmt_json_sep(self, out); - rm_fmt_json_key_int(out, "hash_seed1", session->hash_seed1); - rm_fmt_json_sep(self, out); - rm_fmt_json_key_int(out, "hash_seed2", session->hash_seed2); + rm_fmt_json_key_int(out, "hash_seed", session->hash_seed); } } rm_fmt_json_close(self, out); diff --git a/lib/hasher.c b/lib/hasher.c index e570b230..8934ed0d 100644 --- a/lib/hasher.c +++ b/lib/hasher.c @@ -425,8 +425,7 @@ RmHasherTask *rm_hasher_task_new(RmHasher *hasher, RmDigest *digest, if(digest) { self->digest = digest; } else { - self->digest = rm_digest_new(hasher->digest_type, 0, 0, 0, - hasher->digest_type == RM_DIGEST_PARANOID); + self->digest = rm_digest_new(hasher->digest_type, 0); } /* get a recycled hashpipe if available */ diff --git a/lib/replay.c b/lib/replay.c index 9b0adc42..31cb6c3f 100644 --- a/lib/replay.c +++ b/lib/replay.c @@ -187,7 +187,7 @@ static RmFile *rm_parrot_try_next(RmParrot *polly) { file = rm_file_new(polly->session, path, stat_info, type, 0, 0, 0); file->is_original = json_object_get_boolean_member(object, "is_original"); file->is_symlink = (lstat_buf.st_mode & S_IFLNK); - file->digest = rm_digest_new(RM_DIGEST_EXT, 0, 0, 0, FALSE); + file->digest = rm_digest_new(RM_DIGEST_EXT, 0); file->free_digest = true; if(file->is_original) { diff --git a/lib/session.h b/lib/session.h index d7651823..efd70e95 100644 --- a/lib/session.h +++ b/lib/session.h @@ -114,8 +114,7 @@ typedef struct RmSession { RmOff offset_fails; /* Daniels paranoia */ - RmOff hash_seed1; - RmOff hash_seed2; + RmOff hash_seed; /* count used for determining the verbosity level */ int verbosity_count; diff --git a/lib/shredder.c b/lib/shredder.c index 62043269..fe2d57f7 100644 --- a/lib/shredder.c +++ b/lib/shredder.c @@ -333,11 +333,6 @@ typedef struct RmShredTag { #define NEEDS_NEW(group) \ (group->session->cfg->min_mtime) -/* There does not seem to be an performance advance here, - * but for paranoid mode it's useful to have a checksum in the json output. - * */ -#define NEEDS_SHADOW_HASH(cfg) \ - (TRUE || cfg->merge_directories || cfg->read_cksum_from_xattr) typedef struct RmShredGroup { /* holding queue for files; they are held here until the group first meets @@ -1070,7 +1065,7 @@ static void rm_shred_file_preprocess(RmFile *file, RmShredGroup **group) { /* Create an empty checksum for empty files */ if(file->file_size == 0) { - file->digest = rm_digest_new(cfg->checksum_type, 0, 0, 0, NEEDS_SHADOW_HASH(cfg)); + file->digest = rm_digest_new(cfg->checksum_type, 0); } if(!(*group)) { @@ -1486,13 +1481,12 @@ static void rm_shred_group_postprocess(RmShredGroup *group, RmShredTag *tag) { } static void rm_shred_result_factory(RmShredGroup *group, RmShredTag *tag) { - RmCfg *cfg = tag->session->cfg; /* maybe create group's digest from external checksums */ RmFile *headfile = group->held_files->head->data; char *cksum = headfile->ext_cksum; if(cksum && !group->digest) { - group->digest = rm_digest_new(RM_DIGEST_EXT, 0, 0, 0, NEEDS_SHADOW_HASH(cfg)); + group->digest = rm_digest_new(RM_DIGEST_EXT, 0); rm_digest_update(group->digest, (unsigned char *)cksum, strlen(cksum)); } @@ -1556,7 +1550,7 @@ static bool rm_shred_reassign_checksum(RmShredTag *main, RmFile *file) { } g_mutex_unlock(&group->lock); - file->digest = rm_digest_new(RM_DIGEST_PARANOID, 0, 0, 0, NEEDS_SHADOW_HASH(cfg)); + file->digest = rm_digest_new(RM_DIGEST_PARANOID, 0); if((file->is_symlink == false || cfg->see_symlinks == false) && (group->next_offset > file->hash_offset + SHRED_PREMATCH_THRESHOLD)) { @@ -1584,10 +1578,7 @@ static bool rm_shred_reassign_checksum(RmShredTag *main, RmFile *file) { } else { /* this is first generation of RMGroups, so there is no progressive hash yet */ file->digest = rm_digest_new(cfg->checksum_type, - main->session->hash_seed1, - main->session->hash_seed2, - 0, - NEEDS_SHADOW_HASH(cfg)); + main->session->hash_seed); } return true; } diff --git a/lib/treemerge.c b/lib/treemerge.c index 0b107579..e8a8cbf6 100644 --- a/lib/treemerge.c +++ b/lib/treemerge.c @@ -313,7 +313,7 @@ static RmDirectory *rm_directory_new(char *dirname) { * order in which the file hashes were added. * It is not used as full hash, but as sorting speedup. */ - self->digest = rm_digest_new(RM_DIGEST_CUMULATIVE, 0, 0, 0, false); + self->digest = rm_digest_new(RM_DIGEST_CUMULATIVE, 0); g_queue_init(&self->known_files); g_queue_init(&self->children); From 434e8142fc25e3f9352df33f0e8e4b54be32d97e Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Thu, 16 Nov 2017 23:06:21 +1000 Subject: [PATCH 126/180] tests: update for only one level of -p --- tests/test_options/test_equal.py | 4 +-- tests/test_options/test_merge_directories.py | 26 ++++++++++---------- tests/test_speed/benchmark.py | 6 ++--- 3 files changed, 18 insertions(+), 18 deletions(-) diff --git a/tests/test_options/test_equal.py b/tests/test_options/test_equal.py index 90b1f825..6b899e06 100644 --- a/tests/test_options/test_equal.py +++ b/tests/test_options/test_equal.py @@ -38,7 +38,7 @@ def test_equal_files(): with assert_exit_code(0): head, *data, footer = run_rmlint( - '-pp', '--equal', path_a, path_b, + '-p', '--equal', path_a, path_b, use_default_dir=False ) @@ -107,7 +107,7 @@ def test_equal_directories(): with assert_exit_code(0): head, *data, footer = run_rmlint( - '-pp', '--equal', path_a, path_b, + '-p', '--equal', path_a, path_b, use_default_dir=False ) diff --git a/tests/test_options/test_merge_directories.py b/tests/test_options/test_merge_directories.py index 5d5997fc..d8534878 100644 --- a/tests/test_options/test_merge_directories.py +++ b/tests/test_options/test_merge_directories.py @@ -10,7 +10,7 @@ def test_simple(): create_file('xxx', '2/a') create_file('xxx', 'a') - head, *data, footer = run_rmlint('-pp -D --rank-by A') + head, *data, footer = run_rmlint('-p -D --rank-by A') assert 2 == sum(find['type'] == 'duplicate_dir' for find in data) @@ -32,7 +32,7 @@ def test_diff(): create_file('xxx', '2/a') create_file('xxx', '3/a') create_file('yyy', '3/b') - head, *data, footer = run_rmlint('-pp -D --rank-by A') + head, *data, footer = run_rmlint('-p -D --rank-by A') assert 2 == sum(find['type'] == 'duplicate_dir' for find in data) assert data[0]['size'] == 3 @@ -49,7 +49,7 @@ def test_same_but_not_dupe(): create_file('xxx', '1/a') create_file('xxx', '2/a') create_file('xxx', '2/b') - head, *data, footer = run_rmlint('-pp -D --rank-by A') + head, *data, footer = run_rmlint('-p -D --rank-by A') # No duplicate dirs, but 3 duplicate files should be found. assert 0 == sum(find['type'] == 'duplicate_dir' for find in data) @@ -64,7 +64,7 @@ def test_hardlinks(): create_link('2/a', '2/link1') create_link('2/a', '2/link2') - head, *data, footer = run_rmlint('-pp -D -l -S a') + head, *data, footer = run_rmlint('-p -D -l -S a') assert len(data) is 5 assert data[0]['type'] == 'duplicate_dir' assert data[0]['path'].endswith('1') @@ -109,7 +109,7 @@ def test_deep_simple(): create_file('xxx', 'd/b/empty') create_file('xxx', 'd/a/1') create_file('xxx', 'd/b/empty') - head, *data, footer = run_rmlint('-pp -D -S a') + head, *data, footer = run_rmlint('-p -D -S a') assert data[0]['path'].endswith('d/a') assert data[1]['path'].endswith('d/b') @@ -120,7 +120,7 @@ def test_deep_simple(): def test_dirs_with_empty_files_only(): create_file('', 'a/empty') create_file('', 'b/empty') - head, *data, footer = run_rmlint('-pp -D -S a -T df,dd --size 0') + head, *data, footer = run_rmlint('-p -D -S a -T df,dd --size 0') assert len(data) == 2 assert data[0]['path'].endswith('a') @@ -128,10 +128,10 @@ def test_dirs_with_empty_files_only(): assert data[1]['path'].endswith('b') assert data[1]['type'] == "duplicate_dir" - head, *data, footer = run_rmlint('-pp -D -S a -T df,dd') + head, *data, footer = run_rmlint('-p -D -S a -T df,dd') assert len(data) == 0 - head, *data, footer = run_rmlint('-pp -D -S a --size 0') + head, *data, footer = run_rmlint('-p -D -S a --size 0') assert len(data) == 2 data.sort(key=lambda elem: elem["path"]) @@ -156,7 +156,7 @@ def test_deep_full(): # subprocess.call('tree ' + TESTDIR_NAME, shell=True) # subprocess.call('./rmlint -p -S a -D ' + TESTDIR_NAME, shell=True) - head, *data, footer = run_rmlint('-pp -D -S a') + head, *data, footer = run_rmlint('-p -D -S a') assert len(data) == 6 @@ -225,7 +225,7 @@ def test_symlinks(): create_file('xxx', 'b/z') create_link('b/z', 'b/x', symlink=True) - head, *data, footer = run_rmlint('-pp -D -S a -F') + head, *data, footer = run_rmlint('-p -D -S a -F') assert len(data) == 2 assert data[0]['path'].endswith('z') @@ -233,7 +233,7 @@ def test_symlinks(): assert data[1]['path'].endswith('z') assert not data[1]['is_original'] - head, *data, footer = run_rmlint('-pp -D -S a -f') + head, *data, footer = run_rmlint('-p -D -S a -f') assert len(data) == 4 assert data[0]['path'].endswith('/a') @@ -425,7 +425,7 @@ def test_equal_content_different_layout(): create_file('xxx', "tree-b/x") create_file('yyy', "tree-b/y") - head, *data, footer = run_rmlint('-pp -D --rank-by a') + head, *data, footer = run_rmlint('-p -D --rank-by a') assert data[0]["path"].endswith("tree-a") assert data[0]["is_original"] is True @@ -433,7 +433,7 @@ def test_equal_content_different_layout(): assert data[1]["is_original"] is False # Now, try to honour the layout - head, *data, footer = run_rmlint('-pp -Dj --rank-by a') + head, *data, footer = run_rmlint('-p -Dj --rank-by a') for point in data: assert point["type"] == "duplicate_file" diff --git a/tests/test_speed/benchmark.py b/tests/test_speed/benchmark.py index b468e65e..815256df 100644 --- a/tests/test_speed/benchmark.py +++ b/tests/test_speed/benchmark.py @@ -275,7 +275,7 @@ def get_benchid(self): class RmlintSpotParanoid(RmlintSpot): def get_options(self, paths): - return '-pp ' + RmlintSpot.get_options(self, paths) + return '-p ' + RmlintSpot.get_options(self, paths) def get_benchid(self): return 'rmlint-spot-paranoid' @@ -304,7 +304,7 @@ def get_benchid(self): class Rmlint222Paranoid(Rmlint222): def get_options(self, paths): - return '-pp ' + Rmlint222.get_options(self, paths) + return '-p ' + Rmlint222.get_options(self, paths) def get_benchid(self): return Rmlint222.get_benchid(self) + '-paranoid' @@ -363,7 +363,7 @@ def get_benchid(self): class RmlintParanoid(Rmlint): def get_options(self, paths): - return '-pp ' + Rmlint.get_options(self, paths) + return '-p ' + Rmlint.get_options(self, paths) def get_benchid(self): return 'rmlint-paranoid' From 476ff0f3155456098bf49962973fa59af141e860 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Thu, 16 Nov 2017 23:16:21 +1000 Subject: [PATCH 127/180] checksum: clang-format --- lib/checksum.c | 364 ++++++++++++++++++---------------- lib/checksum.h | 3 - lib/checksums/metrohash.h | 60 +++--- lib/checksums/metrohash128.c | 277 +++++++++++++------------- lib/checksums/murmur3.c | 109 +++++----- lib/checksums/murmur3.h | 24 ++- lib/checksums/xxhash/xxhash.c | 44 ++-- lib/checksums/xxhash/xxhash.h | 20 +- 8 files changed, 455 insertions(+), 446 deletions(-) diff --git a/lib/checksum.c b/lib/checksum.c index 0611e172..45b01e48 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -41,11 +41,11 @@ #include "checksum.h" #include "checksums/blake2/blake2.h" -#include "checksums/murmur3.h" +#include "checksums/highwayhash.h" #include "checksums/metrohash.h" +#include "checksums/murmur3.h" #include "checksums/sha3/sha3_rhash.h" #include "checksums/xxhash/xxhash.h" -#include "checksums/highwayhash.h" #include "utilities.h" @@ -78,31 +78,32 @@ static gboolean rm_buffer_equal(RmBuffer *a, RmBuffer *b) { /* Each digest type must have an RmDigestInterface defined as follows: */ typedef void (*RmDigestInitFunc)(RmDigest *digest, RmOff seed); typedef void (*RmDigestFreeFunc)(RmDigest *digest); -typedef void (*RmDigestUpdateFunc)(RmDigest *digest, const unsigned char *data, RmOff size); +typedef void (*RmDigestUpdateFunc)(RmDigest *digest, const unsigned char *data, + RmOff size); typedef void (*RmDigestCopyFunc)(RmDigest *digest, RmDigest *copy); typedef void (*RmDigestStealFunc)(RmDigest *digest, guint8 *result); typedef struct RmDigestInterface { const char *name; - const uint bits; // length of the output checksum in bits - RmDigestInitFunc init; // performs initialisation of digest->state + const uint bits; // length of the output checksum in bits + RmDigestInitFunc init; // performs initialisation of digest->state RmDigestFreeFunc free; RmDigestUpdateFunc update; RmDigestCopyFunc copy; RmDigestStealFunc steal; } RmDigestInterface; -/* convenience macro to define an interface where all methods follow the standard naming convention */ -#define RM_DIGEST_DEFINE_INTERFACE(NAME, BITS) \ -static const RmDigestInterface NAME##_interface = { \ - .name = (#NAME), \ - .bits = (BITS), \ - .init = rm_digest_##NAME##_init, \ - .free = rm_digest_##NAME##_free, \ - .update = rm_digest_##NAME##_update, \ - .copy = rm_digest_##NAME##_copy, \ - .steal = rm_digest_##NAME##_steal \ - }; +/* convenience macro to define an interface where all methods follow the standard naming + * convention */ +#define RM_DIGEST_DEFINE_INTERFACE(NAME, BITS) \ + static const RmDigestInterface NAME##_interface = { \ + .name = (#NAME), \ + .bits = (BITS), \ + .init = rm_digest_##NAME##_init, \ + .free = rm_digest_##NAME##_free, \ + .update = rm_digest_##NAME##_update, \ + .copy = rm_digest_##NAME##_copy, \ + .steal = rm_digest_##NAME##_steal}; /////////////////////////// // xxhash interface // @@ -117,7 +118,8 @@ static void rm_digest_xxhash_free(RmDigest *digest) { XXH64_freeState(digest->state); } -static void rm_digest_xxhash_update(RmDigest *digest, const unsigned char *data, RmOff size) { +static void rm_digest_xxhash_update(RmDigest *digest, const unsigned char *data, + RmOff size) { XXH64_update(digest->state, data, size); } @@ -127,7 +129,7 @@ static void rm_digest_xxhash_copy(RmDigest *digest, RmDigest *copy) { } static void rm_digest_xxhash_steal(RmDigest *digest, guint8 *result) { - *(unsigned long long*)result = XXH64_digest(digest->state); + *(unsigned long long *)result = XXH64_digest(digest->state); } RM_DIGEST_DEFINE_INTERFACE(xxhash, 64); @@ -139,7 +141,7 @@ RM_DIGEST_DEFINE_INTERFACE(xxhash, 64); #if RM_PLATFORM_32 static void rm_digest_murmur_init(RmDigest *digest, RmOff seed) { - digest->state = MurmurHash3_x86_128_new(seed, seed>>32, seed, seed>>32); + digest->state = MurmurHash3_x86_128_new(seed, seed >> 32, seed, seed >> 32); } static void rm_digest_murmur_free(RmDigest *digest) { @@ -160,7 +162,6 @@ static void rm_digest_murmur_steal(RmDigest *digest, guint8 *result) { MurmurHash3_x86_128_steal(digest->state, result); } - #elif RM_PLATFORM_64 static void rm_digest_murmur_init(RmDigest *digest, RmOff seed) { @@ -185,7 +186,6 @@ static void rm_digest_murmur_steal(RmDigest *digest, guint8 *result) { MurmurHash3_x64_128_steal(digest->state, result); } - #else #error "Probably not a good idea to compile rmlint on 16bit." @@ -194,7 +194,6 @@ static void rm_digest_murmur_steal(RmDigest *digest, guint8 *result) { RM_DIGEST_DEFINE_INTERFACE(murmur, 128); - /////////////////////////// // metro // /////////////////////////// @@ -207,7 +206,8 @@ static void rm_digest_metro_free(RmDigest *digest) { metrohash128_free(digest->state); } -static void rm_digest_metro_update(RmDigest *digest, const unsigned char *data, RmOff size) { +static void rm_digest_metro_update(RmDigest *digest, const unsigned char *data, + RmOff size) { metrohash128_1_update(digest->state, data, size); } @@ -227,7 +227,8 @@ static void rm_digest_metro256_free(RmDigest *digest) { metrohash256_free(digest->state); } -static void rm_digest_metro256_update(RmDigest *digest, const unsigned char *data, RmOff size) { +static void rm_digest_metro256_update(RmDigest *digest, const unsigned char *data, + RmOff size) { metrohash256_update(digest->state, data, size); } @@ -249,7 +250,8 @@ RM_DIGEST_DEFINE_INTERFACE(metro256, 256); #define rm_digest_metrocrc_free rm_digest_metro_free #define rm_digest_metrocrc_copy rm_digest_metro_copy -static void rm_digest_metrocrc_update(RmDigest *digest, const unsigned char *data, RmOff size) { +static void rm_digest_metrocrc_update(RmDigest *digest, const unsigned char *data, + RmOff size) { metrohash128crc_update(digest->state, data, size); } @@ -262,7 +264,8 @@ static void rm_digest_metrocrc_steal(RmDigest *digest, guint8 *result) { #define rm_digest_metrocrc256_free rm_digest_metro256_free #define rm_digest_metrocrc256_copy rm_digest_metro256_copy -static void rm_digest_metrocrc256_update(RmDigest *digest, const unsigned char *data, RmOff size) { +static void rm_digest_metrocrc256_update(RmDigest *digest, const unsigned char *data, + RmOff size) { metrohash256crc_update(digest->state, data, size); } @@ -275,7 +278,6 @@ RM_DIGEST_DEFINE_INTERFACE(metrocrc256, 256); #endif - /////////////////////////// // cumulative // /////////////////////////// @@ -299,14 +301,15 @@ RM_DIGEST_DEFINE_INTERFACE(metrocrc256, 256); typedef struct RmDigestCumulative { union { guint8 data[RM_DIGEST_CUMULATIVE_LEN]; - RM_DIGEST_CUMULATIVE_T bigdata[RM_DIGEST_CUMULATIVE_LEN / RM_DIGEST_CUMULATIVE_ALIGN]; + RM_DIGEST_CUMULATIVE_T + bigdata[RM_DIGEST_CUMULATIVE_LEN / RM_DIGEST_CUMULATIVE_ALIGN]; }; - RM_DIGEST_CUMULATIVE_T pos; /* could be smaller but this is faster */ + RM_DIGEST_CUMULATIVE_T pos; /* could be smaller but this is faster */ } RmDigestCumulative; static void rm_digest_cumulative_init(RmDigest *digest, RmOff seed) { RmDigestCumulative *state = g_slice_new0(RmDigestCumulative); - *(RmOff*)&state->data[0] ^= seed; + *(RmOff *)&state->data[0] ^= seed; digest->state = state; } @@ -315,31 +318,34 @@ static void rm_digest_cumulative_free(RmDigest *digest) { digest->state = NULL; } -static void rm_digest_cumulative_update(RmDigest *digest, const unsigned char *data, RmOff size) { - guint8 *ptr = (guint8*) data; +static void rm_digest_cumulative_update(RmDigest *digest, const unsigned char *data, + RmOff size) { + guint8 *ptr = (guint8 *)data; guint8 *stop = ptr + size; RmDigestCumulative *state = digest->state; /* align so we can use [32|64]-bit xor */ - while ((state->pos % RM_DIGEST_CUMULATIVE_ALIGN != 0) && ptr < stop) { + while((state->pos % RM_DIGEST_CUMULATIVE_ALIGN != 0) && ptr < stop) { state->data[state->pos++] ^= *(ptr++); - state->pos &= (RM_DIGEST_CUMULATIVE_LEN-1); + state->pos &= (RM_DIGEST_CUMULATIVE_LEN - 1); } - RM_DIGEST_CUMULATIVE_T *ptr_big = (RM_DIGEST_CUMULATIVE_T*)ptr; - RM_DIGEST_CUMULATIVE_T *stop_big = (RM_DIGEST_CUMULATIVE_T*)(stop + 1 - RM_DIGEST_CUMULATIVE_ALIGN); + RM_DIGEST_CUMULATIVE_T *ptr_big = (RM_DIGEST_CUMULATIVE_T *)ptr; + RM_DIGEST_CUMULATIVE_T *stop_big = + (RM_DIGEST_CUMULATIVE_T *)(stop + 1 - RM_DIGEST_CUMULATIVE_ALIGN); /* plough through body of data efficiently */ - while (ptr_big < stop_big) { + while(ptr_big < stop_big) { state->bigdata[state->pos / RM_DIGEST_CUMULATIVE_ALIGN] ^= *ptr_big++; - state->pos = (state->pos + RM_DIGEST_CUMULATIVE_ALIGN) & (RM_DIGEST_CUMULATIVE_ALIGN-1); + state->pos = + (state->pos + RM_DIGEST_CUMULATIVE_ALIGN) & (RM_DIGEST_CUMULATIVE_ALIGN - 1); } /* process remaining date byte-wise */ - ptr = (guint8*)ptr_big; - while (ptr < stop) { + ptr = (guint8 *)ptr_big; + while(ptr < stop) { state->data[state->pos++] ^= *(ptr++); - state->pos &= (RM_DIGEST_CUMULATIVE_LEN-1); + state->pos &= (RM_DIGEST_CUMULATIVE_LEN - 1); } } @@ -352,9 +358,13 @@ static void rm_digest_cumulative_steal(RmDigest *digest, guint8 *result) { memcpy(result, state->data, RM_DIGEST_CUMULATIVE_LEN); } -static const RmDigestInterface cumulative_interface = { "cumulative", 8 * RM_DIGEST_CUMULATIVE_LEN, rm_digest_cumulative_init, rm_digest_cumulative_free, - rm_digest_cumulative_update, rm_digest_cumulative_copy, rm_digest_cumulative_steal}; - +static const RmDigestInterface cumulative_interface = {"cumulative", + 8 * RM_DIGEST_CUMULATIVE_LEN, + rm_digest_cumulative_init, + rm_digest_cumulative_free, + rm_digest_cumulative_update, + rm_digest_cumulative_copy, + rm_digest_cumulative_steal}; /////////////////////////// // highway hash // @@ -374,8 +384,9 @@ static void rm_digest_highway_free(RmDigest *digest) { g_slice_free(HighwayHashCat, digest->state); } -static void rm_digest_highway_update(RmDigest *digest, const unsigned char *data, RmOff size) { - HighwayHashCatAppend((const uint8_t*)data, size, digest->state); +static void rm_digest_highway_update(RmDigest *digest, const unsigned char *data, + RmOff size) { + HighwayHashCatAppend((const uint8_t *)data, size, digest->state); } static void rm_digest_highway_copy(RmDigest *digest, RmDigest *copy) { @@ -384,34 +395,37 @@ static void rm_digest_highway_copy(RmDigest *digest, RmDigest *copy) { /* HighwayHashCatFinish functions are non-destructive */ static void rm_digest_highway256_steal(RmDigest *digest, guint8 *result) { - HighwayHashCatFinish256(digest->state, (uint64_t*)result); + HighwayHashCatFinish256(digest->state, (uint64_t *)result); } static void rm_digest_highway128_steal(RmDigest *digest, guint8 *result) { - HighwayHashCatFinish128(digest->state, (uint64_t*)result); + HighwayHashCatFinish128(digest->state, (uint64_t *)result); } static void rm_digest_highway64_steal(RmDigest *digest, guint8 *result) { - *(uint64_t*)result = HighwayHashCatFinish64(digest->state); + *(uint64_t *)result = HighwayHashCatFinish64(digest->state); } -#define HIGHWAY_INTERFACE(BITS) BITS, rm_digest_highway_init, rm_digest_highway_free, rm_digest_highway_update, rm_digest_highway_copy, rm_digest_highway##BITS##_steal - -static const RmDigestInterface highway256_interface = {"highway256", HIGHWAY_INTERFACE(256)}; -static const RmDigestInterface highway128_interface = {"highway128", HIGHWAY_INTERFACE(128)}; -static const RmDigestInterface highway64_interface = {"highway64", HIGHWAY_INTERFACE(64)}; +#define HIGHWAY_INTERFACE(BITS) \ + BITS, rm_digest_highway_init, rm_digest_highway_free, rm_digest_highway_update, \ + rm_digest_highway_copy, rm_digest_highway##BITS##_steal +static const RmDigestInterface highway256_interface = {"highway256", + HIGHWAY_INTERFACE(256)}; +static const RmDigestInterface highway128_interface = {"highway128", + HIGHWAY_INTERFACE(128)}; +static const RmDigestInterface highway64_interface = {"highway64", HIGHWAY_INTERFACE(64)}; /////////////////////////// // glib hashes // /////////////////////////// static const GChecksumType glib_map[] = { - [RM_DIGEST_MD5] = G_CHECKSUM_MD5, - [RM_DIGEST_SHA1] = G_CHECKSUM_SHA1, - [RM_DIGEST_SHA256] = G_CHECKSUM_SHA256, + [RM_DIGEST_MD5] = G_CHECKSUM_MD5, + [RM_DIGEST_SHA1] = G_CHECKSUM_SHA1, + [RM_DIGEST_SHA256] = G_CHECKSUM_SHA256, #if HAVE_SHA512 - [RM_DIGEST_SHA512] = G_CHECKSUM_SHA512, + [RM_DIGEST_SHA512] = G_CHECKSUM_SHA512, #endif }; @@ -426,7 +440,8 @@ static void rm_digest_glib_free(RmDigest *digest) { g_checksum_free(digest->state); } -static void rm_digest_glib_update(RmDigest *digest, const unsigned char *data, RmOff size) { +static void rm_digest_glib_update(RmDigest *digest, const unsigned char *data, + RmOff size) { g_checksum_update(digest->state, data, size); } @@ -442,9 +457,11 @@ static void rm_digest_glib_steal(RmDigest *digest, guint8 *result) { g_checksum_free(copy); } -#define GLIB_FUNCS rm_digest_glib_init, rm_digest_glib_free, rm_digest_glib_update, rm_digest_glib_copy, rm_digest_glib_steal +#define GLIB_FUNCS \ + rm_digest_glib_init, rm_digest_glib_free, rm_digest_glib_update, \ + rm_digest_glib_copy, rm_digest_glib_steal -static const RmDigestInterface md5_interface = {"md5", 128, GLIB_FUNCS}; +static const RmDigestInterface md5_interface = {"md5", 128, GLIB_FUNCS}; static const RmDigestInterface sha1_interface = {"sha1", 160, GLIB_FUNCS}; static const RmDigestInterface sha256_interface = {"sha256", 256, GLIB_FUNCS}; #if HAVE_SHA512 @@ -455,21 +472,20 @@ static const RmDigestInterface sha512_interface = {"sha512", 512, GLIB_FUNCS}; // sha3 hashes // /////////////////////////// - static void rm_digest_sha3_init(RmDigest *digest, RmOff seed) { digest->state = g_slice_alloc0(sizeof(sha3_ctx)); switch(digest->type) { - case RM_DIGEST_SHA3_256: - rhash_sha3_256_init(digest->state); - break; - case RM_DIGEST_SHA3_384: - rhash_sha3_384_init(digest->state); - break; - case RM_DIGEST_SHA3_512: - rhash_sha3_512_init(digest->state); - break; - default: - g_assert_not_reached(); + case RM_DIGEST_SHA3_256: + rhash_sha3_256_init(digest->state); + break; + case RM_DIGEST_SHA3_384: + rhash_sha3_384_init(digest->state); + break; + case RM_DIGEST_SHA3_512: + rhash_sha3_512_init(digest->state); + break; + default: + g_assert_not_reached(); } if(seed) { rhash_sha3_update(digest->state, (const unsigned char *)&seed, sizeof(seed)); @@ -480,7 +496,8 @@ static void rm_digest_sha3_free(RmDigest *digest) { g_slice_free(sha3_ctx, digest->state); } -static void rm_digest_sha3_update(RmDigest *digest, const unsigned char *data, RmOff size) { +static void rm_digest_sha3_update(RmDigest *digest, const unsigned char *data, + RmOff size) { rhash_sha3_update(digest->state, data, size); } @@ -494,64 +511,63 @@ static void rm_digest_sha3_steal(RmDigest *digest, guint8 *result) { g_slice_free(sha3_ctx, copy); } -#define SHA3_INTERFACE(BITS) BITS, rm_digest_sha3_init, rm_digest_sha3_free, rm_digest_sha3_update, rm_digest_sha3_copy, rm_digest_sha3_steal +#define SHA3_INTERFACE(BITS) \ + BITS, rm_digest_sha3_init, rm_digest_sha3_free, rm_digest_sha3_update, \ + rm_digest_sha3_copy, rm_digest_sha3_steal -static const RmDigestInterface sha3_256_interface = { "sha3-256", SHA3_INTERFACE(256)}; -static const RmDigestInterface sha3_384_interface = { "sha3-384", SHA3_INTERFACE(384)}; -static const RmDigestInterface sha3_512_interface = { "sha3-512", SHA3_INTERFACE(512)}; +static const RmDigestInterface sha3_256_interface = {"sha3-256", SHA3_INTERFACE(256)}; +static const RmDigestInterface sha3_384_interface = {"sha3-384", SHA3_INTERFACE(384)}; +static const RmDigestInterface sha3_512_interface = {"sha3-512", SHA3_INTERFACE(512)}; /////////////////////////// // blake hashes // /////////////////////////// -#define CREATE_BLAKE_FUNCS(ALGO, ALGO_BIG) \ - \ -static void rm_digest_##ALGO##_init(RmDigest *digest, RmOff seed) { \ - digest->state = g_slice_alloc0(sizeof(ALGO##_state)); \ - ALGO##_init(digest->state, ALGO_BIG##_OUTBYTES); \ - if(seed) { \ - ALGO##_update(digest->state, &seed, sizeof(RmOff)); \ - } \ - g_assert(digest->bytes==ALGO_BIG##_OUTBYTES); \ -} \ - \ -static void rm_digest_##ALGO##_free(RmDigest *digest) { \ - g_slice_free(ALGO##_state, digest->state); \ -} \ - \ -static void rm_digest_##ALGO##_update(RmDigest *digest, \ - const unsigned char *data, \ - RmOff size) { \ - ALGO##_update(digest->state, data, size); \ -} \ - \ -static void rm_digest_##ALGO##_copy(RmDigest *digest, \ - RmDigest *copy) { \ - copy->state = g_slice_copy(sizeof(ALGO##_state), \ - digest->state); \ -} \ - \ -static void rm_digest_##ALGO##_steal(RmDigest *digest, \ - guint8 *result) { \ - ALGO##_state *copy = g_slice_copy(sizeof(ALGO##_state), \ - digest->state); \ - ALGO##_final(copy, result, digest->bytes); \ - g_slice_free(ALGO##_state, copy); \ -} - - +#define CREATE_BLAKE_FUNCS(ALGO, ALGO_BIG) \ + \ + static void rm_digest_##ALGO##_init(RmDigest *digest, RmOff seed) { \ + digest->state = g_slice_alloc0(sizeof(ALGO##_state)); \ + ALGO##_init(digest->state, ALGO_BIG##_OUTBYTES); \ + if(seed) { \ + ALGO##_update(digest->state, &seed, sizeof(RmOff)); \ + } \ + g_assert(digest->bytes == ALGO_BIG##_OUTBYTES); \ + } \ + \ + static void rm_digest_##ALGO##_free(RmDigest *digest) { \ + g_slice_free(ALGO##_state, digest->state); \ + } \ + \ + static void rm_digest_##ALGO##_update(RmDigest *digest, const unsigned char *data, \ + RmOff size) { \ + ALGO##_update(digest->state, data, size); \ + } \ + \ + static void rm_digest_##ALGO##_copy(RmDigest *digest, RmDigest *copy) { \ + copy->state = g_slice_copy(sizeof(ALGO##_state), digest->state); \ + } \ + \ + static void rm_digest_##ALGO##_steal(RmDigest *digest, guint8 *result) { \ + ALGO##_state *copy = g_slice_copy(sizeof(ALGO##_state), digest->state); \ + ALGO##_final(copy, result, digest->bytes); \ + g_slice_free(ALGO##_state, copy); \ + } CREATE_BLAKE_FUNCS(blake2b, BLAKE2B); CREATE_BLAKE_FUNCS(blake2bp, BLAKE2B); CREATE_BLAKE_FUNCS(blake2s, BLAKE2S); CREATE_BLAKE_FUNCS(blake2sp, BLAKE2S); -#define BLAKE_FUNCS(ALGO) rm_digest_##ALGO##_init, rm_digest_##ALGO##_free, rm_digest_##ALGO##_update, rm_digest_##ALGO##_copy, rm_digest_##ALGO##_steal +#define BLAKE_FUNCS(ALGO) \ + rm_digest_##ALGO##_init, rm_digest_##ALGO##_free, rm_digest_##ALGO##_update, \ + rm_digest_##ALGO##_copy, rm_digest_##ALGO##_steal static const RmDigestInterface blake2b_interface = {"blake2b", 512, BLAKE_FUNCS(blake2b)}; -static const RmDigestInterface blake2bp_interface = {"blake2bp", 512, BLAKE_FUNCS(blake2bp)}; +static const RmDigestInterface blake2bp_interface = {"blake2bp", 512, + BLAKE_FUNCS(blake2bp)}; static const RmDigestInterface blake2s_interface = {"blake2s", 256, BLAKE_FUNCS(blake2s)}; -static const RmDigestInterface blake2sp_interface = {"blake2sp", 256, BLAKE_FUNCS(blake2sp)}; +static const RmDigestInterface blake2sp_interface = {"blake2sp", 256, + BLAKE_FUNCS(blake2sp)}; /////////////////////////// // ext hash // @@ -584,25 +600,22 @@ static void rm_digest_generic_copy(RmDigest *digest, RmDigest *copy) { copy->state = g_slice_copy(ALLOC_BYTES(digest->bytes), digest->state); } -#define GENERIC_FUNCS(ALGO) \ - .init = rm_digest_generic_init, \ - .free = rm_digest_generic_free, \ - .update = rm_digest_##ALGO##_update,\ - .copy = rm_digest_generic_copy, \ - .steal = NULL - +#define GENERIC_FUNCS(ALGO) \ + .init = rm_digest_generic_init, .free = rm_digest_generic_free, \ + .update = rm_digest_##ALGO##_update, .copy = rm_digest_generic_copy, .steal = NULL static void rm_digest_ext_init(RmDigest *digest, RmOff seed) { digest->bytes = 64; rm_digest_generic_init(digest, seed); } -static void rm_digest_ext_update(RmDigest *digest, const unsigned char *data, RmOff size) { - /* Data is assumed to be a hex representation of a checksum. - * Needs to be compressed in pure memory first. - * - * Checksum is not updated but rather overwritten. - * */ +static void rm_digest_ext_update(RmDigest *digest, const unsigned char *data, + RmOff size) { +/* Data is assumed to be a hex representation of a checksum. + * Needs to be compressed in pure memory first. + * + * Checksum is not updated but rather overwritten. + * */ #define CHAR_TO_NUM(c) (unsigned char)(g_ascii_isdigit(c) ? c - '0' : (c - 'a') + 10) digest->bytes = size / 2; @@ -614,14 +627,18 @@ static void rm_digest_ext_update(RmDigest *digest, const unsigned char *data, Rm } } -static const RmDigestInterface ext_interface = {"ext", 512, rm_digest_ext_init, rm_digest_generic_free, rm_digest_ext_update, rm_digest_generic_copy, NULL}; - +static const RmDigestInterface ext_interface = {"ext", + 512, + rm_digest_ext_init, + rm_digest_generic_free, + rm_digest_ext_update, + rm_digest_generic_copy, + NULL}; /////////////////////////// // paranoid 'hash' // /////////////////////////// - static void rm_digest_paranoid_init(RmDigest *digest, RmOff seed) { RmParanoid *paranoid = g_slice_new0(RmParanoid); digest->state = paranoid; @@ -657,45 +674,44 @@ static void rm_digest_paranoid_steal(RmDigest *digest, guint8 *result) { } } - /* Note: paranoid update implementation is in rm_digest_buffered_update() below */ -static const RmDigestInterface paranoid_interface = { "paranoid", 0, rm_digest_paranoid_init, rm_digest_paranoid_free, NULL, NULL, rm_digest_paranoid_steal}; - +static const RmDigestInterface paranoid_interface = { + "paranoid", 0, rm_digest_paranoid_init, rm_digest_paranoid_free, + NULL, NULL, rm_digest_paranoid_steal}; //////////////////////////////// // RmDigestInterface map // //////////////////////////////// - static const RmDigestInterface *rm_digest_interface(RmDigestType type) { static const RmDigestInterface *digest_interfaces[] = { - [RM_DIGEST_UNKNOWN] = NULL, - [RM_DIGEST_MURMUR] = &murmur_interface, - [RM_DIGEST_METRO] = &metro_interface, - [RM_DIGEST_METRO256] = &metro256_interface, - #if HAVE_SSE4 - [RM_DIGEST_METROCRC] = &metrocrc_interface, - [RM_DIGEST_METROCRC256]= &metrocrc256_interface, - #endif - [RM_DIGEST_MD5] = &md5_interface, - [RM_DIGEST_SHA1] = &sha1_interface, - [RM_DIGEST_SHA256] = &sha256_interface, - #if HAVE_SHA512 - [RM_DIGEST_SHA512] = &sha512_interface, - #endif - [RM_DIGEST_SHA3_256] = &sha3_256_interface, - [RM_DIGEST_SHA3_384] = &sha3_384_interface, - [RM_DIGEST_SHA3_512] = &sha3_512_interface, - [RM_DIGEST_BLAKE2S] = &blake2s_interface, - [RM_DIGEST_BLAKE2B] = &blake2b_interface, - [RM_DIGEST_BLAKE2SP] = &blake2sp_interface, - [RM_DIGEST_BLAKE2BP] = &blake2bp_interface, - [RM_DIGEST_EXT] = &ext_interface, + [RM_DIGEST_UNKNOWN] = NULL, + [RM_DIGEST_MURMUR] = &murmur_interface, + [RM_DIGEST_METRO] = &metro_interface, + [RM_DIGEST_METRO256] = &metro256_interface, +#if HAVE_SSE4 + [RM_DIGEST_METROCRC] = &metrocrc_interface, + [RM_DIGEST_METROCRC256] = &metrocrc256_interface, +#endif + [RM_DIGEST_MD5] = &md5_interface, + [RM_DIGEST_SHA1] = &sha1_interface, + [RM_DIGEST_SHA256] = &sha256_interface, +#if HAVE_SHA512 + [RM_DIGEST_SHA512] = &sha512_interface, +#endif + [RM_DIGEST_SHA3_256] = &sha3_256_interface, + [RM_DIGEST_SHA3_384] = &sha3_384_interface, + [RM_DIGEST_SHA3_512] = &sha3_512_interface, + [RM_DIGEST_BLAKE2S] = &blake2s_interface, + [RM_DIGEST_BLAKE2B] = &blake2b_interface, + [RM_DIGEST_BLAKE2SP] = &blake2sp_interface, + [RM_DIGEST_BLAKE2BP] = &blake2bp_interface, + [RM_DIGEST_EXT] = &ext_interface, [RM_DIGEST_CUMULATIVE] = &cumulative_interface, - [RM_DIGEST_PARANOID] = ¶noid_interface, - [RM_DIGEST_XXHASH] = &xxhash_interface, - [RM_DIGEST_HIGHWAY64] = &highway64_interface, + [RM_DIGEST_PARANOID] = ¶noid_interface, + [RM_DIGEST_XXHASH] = &xxhash_interface, + [RM_DIGEST_HIGHWAY64] = &highway64_interface, [RM_DIGEST_HIGHWAY128] = &highway128_interface, [RM_DIGEST_HIGHWAY256] = &highway256_interface, }; @@ -707,7 +723,8 @@ static const RmDigestInterface *rm_digest_interface(RmDigestType type) { g_assert_not_reached(); } -static void rm_digest_table_insert(GHashTable *code_table, char *name, RmDigestType type) { +static void rm_digest_table_insert(GHashTable *code_table, char *name, + RmDigestType type) { if(g_hash_table_contains(code_table, name)) { rm_log_error_line("Duplicate entry for %s in rm_init_digest_type_table()", name); } @@ -715,10 +732,10 @@ static void rm_digest_table_insert(GHashTable *code_table, char *name, RmDigestT } static gpointer rm_init_digest_type_table(GHashTable **code_table) { - *code_table = g_hash_table_new(g_str_hash, g_str_equal); - for(RmDigestType type=1; typename, type); + for(RmDigestType type = 1; type < RM_DIGEST_SENTINEL; type++) { + rm_digest_table_insert(*code_table, (char *)rm_digest_interface(type)->name, + type); } /* add some synonyms */ @@ -752,10 +769,10 @@ const char *rm_digest_type_to_string(RmDigestType type) { /* TODO: remove? */ int rm_digest_type_to_multihash_id(RmDigestType type) { - static int ids[] = {[RM_DIGEST_UNKNOWN] = -1, [RM_DIGEST_MURMUR] = 17, - [RM_DIGEST_MD5] = 1, [RM_DIGEST_SHA1] = 2, - [RM_DIGEST_SHA256] = 4, [RM_DIGEST_SHA512] = 6, - [RM_DIGEST_CUMULATIVE] = 13,[RM_DIGEST_PARANOID] = 14}; + static int ids[] = {[RM_DIGEST_UNKNOWN] = -1, [RM_DIGEST_MURMUR] = 17, + [RM_DIGEST_MD5] = 1, [RM_DIGEST_SHA1] = 2, + [RM_DIGEST_SHA256] = 4, [RM_DIGEST_SHA512] = 6, + [RM_DIGEST_CUMULATIVE] = 13, [RM_DIGEST_PARANOID] = 14}; return ids[MIN(type, sizeof(ids) / sizeof(ids[0]))]; } @@ -846,7 +863,8 @@ void rm_digest_buffered_update(RmBuffer *buffer) { paranoid->twin_candidate_buffer = paranoid->twin_candidate_buffer->next; } if(paranoid->twin_candidate && !match) { - /* reject the twin candidate, also add to rejects list to speed up rm_digest_equal() */ + /* reject the twin candidate, also add to rejects list to speed up + * rm_digest_equal() */ #if _RM_CHECKSUM_DEBUG rm_log_debug_line("Rejected twin candidate %p for %p", paranoid->twin_candidate, paranoid); @@ -880,7 +898,6 @@ RmDigest *rm_digest_copy(RmDigest *digest) { } guint8 *rm_digest_steal(RmDigest *digest) { - const RmDigestInterface *interface = rm_digest_interface(digest->type); if(!interface->steal) { return g_slice_copy(digest->bytes, digest->state); @@ -932,8 +949,7 @@ gboolean rm_digest_equal(RmDigest *a, RmDigest *b) { return true; } /* check if already rejected */ - if(g_slist_find(pa->rejects, b) || - g_slist_find(pb->rejects, a)) { + if(g_slist_find(pa->rejects, b) || g_slist_find(pb->rejects, a)) { return false; } /* all the "easy" ways failed... do manual check of all buffers */ diff --git a/lib/checksum.h b/lib/checksum.h index b79144c0..7edb726e 100644 --- a/lib/checksum.h +++ b/lib/checksum.h @@ -112,7 +112,6 @@ typedef struct RmDigest { /////////// RmBuffer //////////////// - /* Represents one block of read data */ typedef struct RmBuffer { /* note that first (sizeof(pointer)) bytes of this structure get overwritten @@ -139,7 +138,6 @@ RmBuffer *rm_buffer_new(gsize buf_size); void rm_buffer_free(RmBuffer *buf); - /** * @brief Convert a string like "md5" to a RmDigestType member. * @@ -236,7 +234,6 @@ guint8 *rm_digest_steal(RmDigest *digest); **/ guint8 *rm_digest_sum(RmDigestType algo, const guint8 *data, gsize len, gsize *out_len); - /** * @brief Hash a Digest, suitable for GHashTable. * diff --git a/lib/checksums/metrohash.h b/lib/checksums/metrohash.h index 3330a490..1c5451f6 100644 --- a/lib/checksums/metrohash.h +++ b/lib/checksums/metrohash.h @@ -35,8 +35,8 @@ typedef struct _Metro128_state Metro128State; typedef struct _Metro256_state Metro256State; // MetroHash 64-bit hash functions -void metrohash64_1(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out); -void metrohash64_2(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out); +void metrohash64_1(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out); +void metrohash64_2(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out); // MetroHash 128-bit hash functions Metro128State *metrohash128_1_new(uint32_t seed); @@ -49,59 +49,53 @@ Metro256State *metrohash256_copy(Metro256State *state); void metrohash128_free(Metro128State *state); void metrohash256_free(Metro256State *state); -void metrohash128_1(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out); -void metrohash128_2(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out); +void metrohash128_1(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out); +void metrohash128_2(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out); -void metrohash128_1_update(Metro128State *state, const uint8_t * key, uint64_t len); -void metrohash128_1_steal(Metro128State *state, uint8_t * out); +void metrohash128_1_update(Metro128State *state, const uint8_t *key, uint64_t len); +void metrohash128_1_steal(Metro128State *state, uint8_t *out); -void metrohash128_2_update(Metro128State *state, const uint8_t * key, uint64_t len); -void metrohash128_2_steal(Metro128State *state, uint8_t * out); +void metrohash128_2_update(Metro128State *state, const uint8_t *key, uint64_t len); +void metrohash128_2_steal(Metro128State *state, uint8_t *out); -void metrohash256_update(Metro256State *state, const uint8_t * key, uint64_t len); -void metrohash256_steal(Metro256State *state, uint8_t * out); +void metrohash256_update(Metro256State *state, const uint8_t *key, uint64_t len); +void metrohash256_steal(Metro256State *state, uint8_t *out); #if HAVE_SSE4 // MetroHash 128-bit hash functions using CRC instruction -void metrohash128crc_1(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out); -void metrohash128crc_2(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out); +void metrohash128crc_1(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out); +void metrohash128crc_2(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out); -void metrohash128crc_update(Metro128State *state, const uint8_t * key, uint64_t len); +void metrohash128crc_update(Metro128State *state, const uint8_t *key, uint64_t len); -void metrohash128crc_1_steal(Metro128State *state, uint8_t * out); -void metrohash128crc_2_steal(Metro128State *state, uint8_t * out); +void metrohash128crc_1_steal(Metro128State *state, uint8_t *out); +void metrohash128crc_2_steal(Metro128State *state, uint8_t *out); -void metrohash256crc_update(Metro256State *state, const uint8_t * key, uint64_t len); -void metrohash256crc_steal(Metro256State *state, uint8_t * out); +void metrohash256crc_update(Metro256State *state, const uint8_t *key, uint64_t len); +void metrohash256crc_steal(Metro256State *state, uint8_t *out); #endif /* rotate right idiom recognized by compiler*/ -inline static uint64_t rotate_right(uint64_t v, unsigned k) -{ +inline static uint64_t rotate_right(uint64_t v, unsigned k) { return (v >> k) | (v << (64 - k)); } // unaligned reads, fast and safe on Nehalem and later microarchitectures -inline static uint64_t read_u64(const void * const ptr) -{ - return * (uint64_t *) ptr; +inline static uint64_t read_u64(const void *const ptr) { + return *(uint64_t *)ptr; } -inline static uint64_t read_u32(const void * const ptr) -{ - return * (uint32_t *) ptr; +inline static uint64_t read_u32(const void *const ptr) { + return *(uint32_t *)ptr; } -inline static uint64_t read_u16(const void * const ptr) -{ - return * (uint16_t *) ptr; +inline static uint64_t read_u16(const void *const ptr) { + return *(uint16_t *)ptr; } -inline static uint64_t read_u8 (const void * const ptr) -{ - return * (uint8_t *) ptr; +inline static uint64_t read_u8(const void *const ptr) { + return *(uint8_t *)ptr; } - -#endif // #ifndef METROHASH_METROHASH_H +#endif // #ifndef METROHASH_METROHASH_H diff --git a/lib/checksums/metrohash128.c b/lib/checksums/metrohash128.c index af346b94..8abd3a3f 100644 --- a/lib/checksums/metrohash128.c +++ b/lib/checksums/metrohash128.c @@ -23,15 +23,15 @@ // SOFTWARE. // -#include "metrohash.h" -#include #include +#include +#include "metrohash.h" #if HAVE_SSE4 struct _Metro128_state { uint64_t v[4]; - uint8_t xs[32]; /* unhashed data from last increment */ + uint8_t xs[32]; /* unhashed data from last increment */ uint8_t xs_len; }; @@ -40,17 +40,16 @@ struct _Metro256_state { Metro128State state2; }; - static const uint64_t k0_1 = 0xC83A91E1; static const uint64_t k1_1 = 0x8648DBDB; static const uint64_t k2_1 = 0x7BDEC03B; static const uint64_t k3_1 = 0x2F5870A5; static void metrohash128_1_init(Metro128State *state, uint32_t seed) { - state->v[0] = ((((uint64_t) seed) - k0_1) * k3_1); - state->v[1] = ((((uint64_t) seed) + k1_1) * k2_1); - state->v[2] = ((((uint64_t) seed) + k0_1) * k2_1); - state->v[3] = ((((uint64_t) seed) - k1_1) * k3_1); + state->v[0] = ((((uint64_t)seed) - k0_1) * k3_1); + state->v[1] = ((((uint64_t)seed) + k1_1) * k2_1); + state->v[2] = ((((uint64_t)seed) + k0_1) * k2_1); + state->v[3] = ((((uint64_t)seed) - k1_1) * k3_1); } Metro128State *metrohash128_1_new(uint32_t seed) { @@ -65,10 +64,10 @@ static const uint64_t k2_2 = 0x797A90BB; static const uint64_t k3_2 = 0x2E4B2E1B; static void metrohash128_2_init(Metro128State *state, uint32_t seed) { - state->v[0] = ((((uint64_t) seed) - k0_2) * k3_2); - state->v[1] = ((((uint64_t) seed) + k1_2) * k2_2); - state->v[2] = ((((uint64_t) seed) + k0_2) * k2_2); - state->v[3] = ((((uint64_t) seed) - k1_2) * k3_2); + state->v[0] = ((((uint64_t)seed) - k0_2) * k3_2); + state->v[1] = ((((uint64_t)seed) + k1_2) * k2_2); + state->v[2] = ((((uint64_t)seed) + k0_2) * k2_2); + state->v[3] = ((((uint64_t)seed) - k1_2) * k3_2); } Metro128State *metrohash128_2_new(uint32_t seed) { @@ -85,17 +84,14 @@ Metro128State *metrohash128_copy(Metro128State *state) { return g_slice_copy(sizeof(Metro128State), state); } -#define METRO_FILL_XS(xs, xs_len, xs_cap, data, data_len) \ - const int bytes = (data_len + xs_len > xs_cap) ? \ - (int)xs_cap - (int)xs_len : \ - (int)data_len; \ - memcpy(xs + xs_len, data, bytes); \ - xs_len += bytes; \ +#define METRO_FILL_XS(xs, xs_len, xs_cap, data, data_len) \ + const int bytes = \ + (data_len + xs_len > xs_cap) ? (int)xs_cap - (int)xs_len : (int)data_len; \ + memcpy(xs + xs_len, data, bytes); \ + xs_len += bytes; \ data += bytes; -void metrohash128crc_update(Metro128State *state, const uint8_t * key, uint64_t len) -{ - +void metrohash128crc_update(Metro128State *state, const uint8_t *key, uint64_t len) { uint8_t *data = (uint8_t *)key; const uint8_t *stop = data + len; @@ -103,7 +99,6 @@ void metrohash128crc_update(Metro128State *state, const uint8_t * key, uint64_t /* process blocks of 32 bytes */ while(state->xs_len == 32 || data + 32 <= stop) { - uint64_t d1; uint64_t d2; uint64_t d3; @@ -129,20 +124,18 @@ void metrohash128crc_update(Metro128State *state, const uint8_t * key, uint64_t state->v[1] ^= _mm_crc32_u64(state->v[1], d2); state->v[2] ^= _mm_crc32_u64(state->v[2], d3); state->v[3] ^= _mm_crc32_u64(state->v[3], d4); - } - if (state->xs_len == 0 && stop > data) { + if(state->xs_len == 0 && stop > data) { // store excess data in state state->xs_len = stop - data; memcpy(state->xs, data, state->xs_len); } } -void metrohash128crc_1_steal(Metro128State *state, uint8_t * out) { - +void metrohash128crc_1_steal(Metro128State *state, uint8_t *out) { uint64_t v[4]; - for(int i=0; i<4; i++) { + for(int i = 0; i < 4; i++) { v[i] = state->v[i]; } @@ -154,35 +147,38 @@ void metrohash128crc_1_steal(Metro128State *state, uint8_t * out) { uint8_t *ptr = state->xs; uint8_t *end = ptr + state->xs_len; - if ((end - ptr) >= 16) - { - v[0] += read_u64(ptr) * k2_1; ptr += 8; v[0] = rotate_right(v[0],34) * k3_1; - v[1] += read_u64(ptr) * k2_1; ptr += 8; v[1] = rotate_right(v[1],34) * k3_1; + if((end - ptr) >= 16) { + v[0] += read_u64(ptr) * k2_1; + ptr += 8; + v[0] = rotate_right(v[0], 34) * k3_1; + v[1] += read_u64(ptr) * k2_1; + ptr += 8; + v[1] = rotate_right(v[1], 34) * k3_1; v[0] ^= rotate_right((v[0] * k2_1) + v[1], 30) * k1_1; v[1] ^= rotate_right((v[1] * k3_1) + v[0], 30) * k0_1; } - if ((end - ptr) >= 8) - { - v[0] += read_u64(ptr) * k2_1; ptr += 8; v[0] = rotate_right(v[0],36) * k3_1; + if((end - ptr) >= 8) { + v[0] += read_u64(ptr) * k2_1; + ptr += 8; + v[0] = rotate_right(v[0], 36) * k3_1; v[0] ^= rotate_right((v[0] * k2_1) + v[1], 23) * k1_1; } - if ((end - ptr) >= 4) - { - v[1] ^= _mm_crc32_u64(v[0], read_u32(ptr)); ptr += 4; + if((end - ptr) >= 4) { + v[1] ^= _mm_crc32_u64(v[0], read_u32(ptr)); + ptr += 4; v[1] ^= rotate_right((v[1] * k3_1) + v[0], 19) * k0_1; } - if ((end - ptr) >= 2) - { - v[0] ^= _mm_crc32_u64(v[1], read_u16(ptr)); ptr += 2; + if((end - ptr) >= 2) { + v[0] ^= _mm_crc32_u64(v[1], read_u16(ptr)); + ptr += 2; v[0] ^= rotate_right((v[0] * k2_1) + v[1], 13) * k1_1; } - if ((end - ptr) >= 1) - { - v[1] ^= _mm_crc32_u64(v[0], read_u8 (ptr)); + if((end - ptr) >= 1) { + v[1] ^= _mm_crc32_u64(v[0], read_u8(ptr)); v[1] ^= rotate_right((v[1] * k3_1) + v[0], 17) * k0_1; } @@ -194,10 +190,9 @@ void metrohash128crc_1_steal(Metro128State *state, uint8_t * out) { memcpy(out, v, 16); } -void metrohash128crc_2_steal(Metro128State *state, uint8_t * out) { - +void metrohash128crc_2_steal(Metro128State *state, uint8_t *out) { uint64_t v[4]; - for(int i=0; i<4; i++) { + for(int i = 0; i < 4; i++) { v[i] = state->v[i]; } @@ -209,36 +204,39 @@ void metrohash128crc_2_steal(Metro128State *state, uint8_t * out) { uint8_t *ptr = state->xs; uint8_t *end = ptr + state->xs_len; - if ((end - ptr) >= 16) - { - v[0] += read_u64(ptr) * k2_2; ptr += 8; v[0] = rotate_right(v[0],41) * k3_2; - v[1] += read_u64(ptr) * k2_2; ptr += 8; v[1] = rotate_right(v[1],41) * k3_2; + if((end - ptr) >= 16) { + v[0] += read_u64(ptr) * k2_2; + ptr += 8; + v[0] = rotate_right(v[0], 41) * k3_2; + v[1] += read_u64(ptr) * k2_2; + ptr += 8; + v[1] = rotate_right(v[1], 41) * k3_2; v[0] ^= rotate_right((v[0] * k2_2) + v[1], 10) * k1_2; v[1] ^= rotate_right((v[1] * k3_2) + v[0], 10) * k0_2; } - if ((end - ptr) >= 8) - { - v[0] += read_u64(ptr) * k2_2; ptr += 8; v[0] = rotate_right(v[0],34) * k3_2; + if((end - ptr) >= 8) { + v[0] += read_u64(ptr) * k2_2; + ptr += 8; + v[0] = rotate_right(v[0], 34) * k3_2; v[0] ^= rotate_right((v[0] * k2_2) + v[1], 22) * k1_2; } - if ((end - ptr) >= 4) - { - v[1] ^= _mm_crc32_u64(v[0], read_u32(ptr)); ptr += 4; + if((end - ptr) >= 4) { + v[1] ^= _mm_crc32_u64(v[0], read_u32(ptr)); + ptr += 4; v[1] ^= rotate_right((v[1] * k3_2) + v[0], 14) * k0_2; } - if ((end - ptr) >= 2) - { - v[0] ^= _mm_crc32_u64(v[1], read_u16(ptr)); ptr += 2; + if((end - ptr) >= 2) { + v[0] ^= _mm_crc32_u64(v[1], read_u16(ptr)); + ptr += 2; v[0] ^= rotate_right((v[0] * k2_2) + v[1], 15) * k1_2; } - if ((end - ptr) >= 1) - { - v[1] ^= _mm_crc32_u64(v[0], read_u8 (ptr)); - v[1] ^= rotate_right((v[1] * k3_2) + v[0], 18) * k0_2; + if((end - ptr) >= 1) { + v[1] ^= _mm_crc32_u64(v[0], read_u8(ptr)); + v[1] ^= rotate_right((v[1] * k3_2) + v[0], 18) * k0_2; } v[0] += rotate_right((v[0] * k0_2) + v[1], 15); @@ -249,39 +247,33 @@ void metrohash128crc_2_steal(Metro128State *state, uint8_t * out) { memcpy(out, v, 16); } -void metrohash128crc_1(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out) { +void metrohash128crc_1(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out) { Metro128State *state = metrohash128_1_new(seed); metrohash128crc_update(state, key, len); metrohash128crc_1_steal(state, out); metrohash128_free(state); } -void metrohash128crc_2(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out) { +void metrohash128crc_2(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out) { Metro128State *state = metrohash128_2_new(seed); metrohash128crc_update(state, key, len); metrohash128crc_2_steal(state, out); metrohash128_free(state); } -void metrohash256crc_update(Metro256State *state, const uint8_t * key, uint64_t len) { +void metrohash256crc_update(Metro256State *state, const uint8_t *key, uint64_t len) { metrohash128crc_update(&state->state1, key, len); metrohash128crc_update(&state->state2, key, len); } -void metrohash256crc_steal(Metro256State *state, uint8_t * out) { +void metrohash256crc_steal(Metro256State *state, uint8_t *out) { metrohash128crc_1_steal(&state->state1, out); - metrohash128crc_2_steal(&state->state2, out+16); + metrohash128crc_2_steal(&state->state2, out + 16); } - - #endif - - -void metrohash128_1_update(Metro128State *state, const uint8_t * key, uint64_t len) -{ - +void metrohash128_1_update(Metro128State *state, const uint8_t *key, uint64_t len) { uint8_t *data = (uint8_t *)key; const uint8_t *stop = data + len; @@ -289,7 +281,6 @@ void metrohash128_1_update(Metro128State *state, const uint8_t * key, uint64_t l /* process blocks of 32 bytes */ while(state->xs_len == 32 || data + 32 <= stop) { - uint64_t d1; uint64_t d2; uint64_t d3; @@ -312,28 +303,25 @@ void metrohash128_1_update(Metro128State *state, const uint8_t * key, uint64_t l } state->v[0] += d1 * k0_1; - state->v[0] = rotate_right(state->v[0],29) + state->v[2]; + state->v[0] = rotate_right(state->v[0], 29) + state->v[2]; state->v[1] += d2 * k1_1; - state->v[1] = rotate_right(state->v[1],29) + state->v[3]; + state->v[1] = rotate_right(state->v[1], 29) + state->v[3]; state->v[2] += d3 * k2_1; - state->v[2] = rotate_right(state->v[2],29) + state->v[0]; + state->v[2] = rotate_right(state->v[2], 29) + state->v[0]; state->v[3] += d4 * k3_1; - state->v[3] = rotate_right(state->v[3],29) + state->v[1]; - + state->v[3] = rotate_right(state->v[3], 29) + state->v[1]; } - if (state->xs_len == 0 && stop > data) { + if(state->xs_len == 0 && stop > data) { // store excess data in state state->xs_len = stop - data; memcpy(state->xs, data, state->xs_len); } } -void metrohash128_2_update(Metro128State *state, const uint8_t * key, uint64_t len) -{ - +void metrohash128_2_update(Metro128State *state, const uint8_t *key, uint64_t len) { uint8_t *data = (uint8_t *)key; const uint8_t *stop = data + len; @@ -341,7 +329,6 @@ void metrohash128_2_update(Metro128State *state, const uint8_t * key, uint64_t l /* process blocks of 32 bytes */ while(state->xs_len == 32 || data + 32 <= stop) { - uint64_t d1; uint64_t d2; uint64_t d3; @@ -362,34 +349,32 @@ void metrohash128_2_update(Metro128State *state, const uint8_t * key, uint64_t l d4 = read_u64(data + 24); data += 32; } - void metrohash256_update(Metro256State *state, const uint8_t * key, uint64_t len); -void metrohash256_steal(Metro256State *state, uint8_t * out); + void metrohash256_update(Metro256State * state, const uint8_t *key, uint64_t len); + void metrohash256_steal(Metro256State * state, uint8_t * out); state->v[0] += d1 * k0_2; - state->v[0] = rotate_right(state->v[0],29) + state->v[2]; + state->v[0] = rotate_right(state->v[0], 29) + state->v[2]; state->v[1] += d2 * k1_2; - state->v[1] = rotate_right(state->v[1],29) + state->v[3]; + state->v[1] = rotate_right(state->v[1], 29) + state->v[3]; state->v[2] += d3 * k2_2; - state->v[2] = rotate_right(state->v[2],29) + state->v[0]; + state->v[2] = rotate_right(state->v[2], 29) + state->v[0]; state->v[3] += d4 * k3_2; - state->v[3] = rotate_right(state->v[3],29) + state->v[1]; - + state->v[3] = rotate_right(state->v[3], 29) + state->v[1]; } - if (state->xs_len == 0 && stop > data) { + if(state->xs_len == 0 && stop > data) { // store excess data in state state->xs_len = stop - data; memcpy(state->xs, data, state->xs_len); } } -void metrohash128_1_steal(Metro128State *state, uint8_t * out) { - +void metrohash128_1_steal(Metro128State *state, uint8_t *out) { uint64_t v[4]; - for(int i=0; i<4; i++) { + for(int i = 0; i < 4; i++) { v[i] = state->v[i]; } @@ -401,35 +386,41 @@ void metrohash128_1_steal(Metro128State *state, uint8_t * out) { uint8_t *ptr = state->xs; uint8_t *end = ptr + state->xs_len; - if ((end - ptr) >= 16) - { - v[0] += read_u64(ptr) * k2_1; ptr += 8; v[0] = rotate_right(v[0],33) * k3_1; - v[1] += read_u64(ptr) * k2_1; ptr += 8; v[1] = rotate_right(v[1],33) * k3_1; + if((end - ptr) >= 16) { + v[0] += read_u64(ptr) * k2_1; + ptr += 8; + v[0] = rotate_right(v[0], 33) * k3_1; + v[1] += read_u64(ptr) * k2_1; + ptr += 8; + v[1] = rotate_right(v[1], 33) * k3_1; v[0] ^= rotate_right((v[0] * k2_1) + v[1], 17) * k1_1; v[1] ^= rotate_right((v[1] * k3_1) + v[0], 17) * k0_1; } - if ((end - ptr) >= 8) - { - v[0] += read_u64(ptr) * k2_1; ptr += 8; v[0] = rotate_right(v[0],33) * k3_1; + if((end - ptr) >= 8) { + v[0] += read_u64(ptr) * k2_1; + ptr += 8; + v[0] = rotate_right(v[0], 33) * k3_1; v[0] ^= rotate_right((v[0] * k2_1) + v[1], 20) * k1_1; } - if ((end - ptr) >= 4) - { - v[1] += read_u32(ptr) * k2_1; ptr += 4; v[1] = rotate_right(v[1],33) * k3_1; + if((end - ptr) >= 4) { + v[1] += read_u32(ptr) * k2_1; + ptr += 4; + v[1] = rotate_right(v[1], 33) * k3_1; v[1] ^= rotate_right((v[1] * k3_1) + v[0], 18) * k0_1; } - if ((end - ptr) >= 2) - { - v[0] += read_u16(ptr) * k2_1; ptr += 2; v[0] = rotate_right(v[0],33) * k3_1; + if((end - ptr) >= 2) { + v[0] += read_u16(ptr) * k2_1; + ptr += 2; + v[0] = rotate_right(v[0], 33) * k3_1; v[0] ^= rotate_right((v[0] * k2_1) + v[1], 24) * k1_1; } - if ((end - ptr) >= 1) - { - v[1] += read_u8 (ptr) * k2_1; v[1] = rotate_right(v[1],33) * k3_1; + if((end - ptr) >= 1) { + v[1] += read_u8(ptr) * k2_1; + v[1] = rotate_right(v[1], 33) * k3_1; v[1] ^= rotate_right((v[1] * k3_1) + v[0], 24) * k0_1; } @@ -441,11 +432,9 @@ void metrohash128_1_steal(Metro128State *state, uint8_t * out) { memcpy(out, v, 16); } - -void metrohash128_2_steal(Metro128State *state, uint8_t * out) { - +void metrohash128_2_steal(Metro128State *state, uint8_t *out) { uint64_t v[4]; - for(int i=0; i<4; i++) { + for(int i = 0; i < 4; i++) { v[i] = state->v[i]; } @@ -457,36 +446,42 @@ void metrohash128_2_steal(Metro128State *state, uint8_t * out) { uint8_t *ptr = state->xs; uint8_t *end = ptr + state->xs_len; - if ((end - ptr) >= 16) - { - v[0] += read_u64(ptr) * k2_2; ptr += 8; v[0] = rotate_right(v[0],29) * k3_2; - v[1] += read_u64(ptr) * k2_2; ptr += 8; v[1] = rotate_right(v[1],29) * k3_2; + if((end - ptr) >= 16) { + v[0] += read_u64(ptr) * k2_2; + ptr += 8; + v[0] = rotate_right(v[0], 29) * k3_2; + v[1] += read_u64(ptr) * k2_2; + ptr += 8; + v[1] = rotate_right(v[1], 29) * k3_2; v[0] ^= rotate_right((v[0] * k2_2) + v[1], 29) * k1_2; v[1] ^= rotate_right((v[1] * k3_2) + v[0], 29) * k0_2; } - if ((end - ptr) >= 8) - { - v[0] += read_u64(ptr) * k2_2; ptr += 8; v[0] = rotate_right(v[0],29) * k3_2; + if((end - ptr) >= 8) { + v[0] += read_u64(ptr) * k2_2; + ptr += 8; + v[0] = rotate_right(v[0], 29) * k3_2; v[0] ^= rotate_right((v[0] * k2_2) + v[1], 29) * k1_2; } - if ((end - ptr) >= 4) - { - v[1] += read_u32(ptr) * k2_2; ptr += 4; v[1] = rotate_right(v[1],29) * k3_2; + if((end - ptr) >= 4) { + v[1] += read_u32(ptr) * k2_2; + ptr += 4; + v[1] = rotate_right(v[1], 29) * k3_2; v[1] ^= rotate_right((v[1] * k3_2) + v[0], 25) * k0_2; } - if ((end - ptr) >= 2) - { - v[0] += read_u16(ptr) * k2_2; ptr += 2; v[0] = rotate_right(v[0],29) * k3_2; + if((end - ptr) >= 2) { + v[0] += read_u16(ptr) * k2_2; + ptr += 2; + v[0] = rotate_right(v[0], 29) * k3_2; v[0] ^= rotate_right((v[0] * k2_2) + v[1], 30) * k1_2; } - if ((end - ptr) >= 1) - { - v[1] += read_u8 (ptr) * k2_2; v[1] = rotate_right(v[1],29) * k3_2; - v[1] ^= rotate_right((v[1] * k3_2) + v[0], 18) * k0_2; + if((end - ptr) >= 1) { + v[1] += read_u8(ptr) * k2_2; + v[1] = rotate_right(v[1], 29) * k3_2; + v[1] ^= rotate_right((v[1] * k3_2) + v[0], 18) * k0_2; } v[0] += rotate_right((v[0] * k0_2) + v[1], 33); @@ -497,22 +492,20 @@ void metrohash128_2_steal(Metro128State *state, uint8_t * out) { memcpy(out, v, 16); } - -void metrohash128_1(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out) { +void metrohash128_1(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out) { Metro128State *state = metrohash128_1_new(seed); metrohash128_1_update(state, key, len); metrohash128_1_steal(state, out); metrohash128_free(state); } -void metrohash128_2(const uint8_t * key, uint64_t len, uint32_t seed, uint8_t * out) { +void metrohash128_2(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out) { Metro128State *state = metrohash128_2_new(seed); metrohash128_2_update(state, key, len); metrohash128_2_steal(state, out); metrohash128_free(state); } - Metro256State *metrohash256_new(uint32_t seed) { Metro256State *state = g_slice_new0(Metro256State); metrohash128_1_init(&state->state1, seed); @@ -528,14 +521,12 @@ Metro256State *metrohash256_copy(Metro256State *state) { return g_slice_copy(sizeof(Metro256State), state); } -void metrohash256_update(Metro256State *state, const uint8_t * key, uint64_t len) { +void metrohash256_update(Metro256State *state, const uint8_t *key, uint64_t len) { metrohash128_1_update(&state->state1, key, len); metrohash128_2_update(&state->state2, key, len); } -void metrohash256_steal(Metro256State *state, uint8_t * out) { +void metrohash256_steal(Metro256State *state, uint8_t *out) { metrohash128_1_steal(&state->state1, out); - metrohash128_2_steal(&state->state2, out+16); + metrohash128_2_steal(&state->state2, out + 16); } - - diff --git a/lib/checksums/murmur3.c b/lib/checksums/murmur3.c index 20aa1fc5..6bbd8eaf 100644 --- a/lib/checksums/murmur3.c +++ b/lib/checksums/murmur3.c @@ -12,8 +12,8 @@ // little-endian platforms #include "murmur3.h" -#include #include +#include //----------------------------------------------------------------------------- // Platform-specific functions and macros @@ -31,16 +31,16 @@ static inline uint64_t rotl64(uint64_t x, int8_t r) { #define BIG_CONSTANT(x) (x##LLU) -//----------------------------------------------------------------------------- -// Block read - if your platform needs to do endian-swapping or can only -// handle aligned reads, do the conversion here + //----------------------------------------------------------------------------- + // Block read - if your platform needs to do endian-swapping or can only + // handle aligned reads, do the conversion here -#define GET_UINT64(p) *((uint64_t*)(p)); -#define GET_UINT32(p) *((uint32_t*)(p)); +#define GET_UINT64(p) *((uint64_t *)(p)); +#define GET_UINT32(p) *((uint32_t *)(p)); struct _MurmurHash3_x86_32_state { uint32_t h1; - uint8_t xs[4]; /* unhashed data from last increment */ + uint8_t xs[4]; /* unhashed data from last increment */ uint8_t xs_len; uint32_t len; }; @@ -50,7 +50,7 @@ struct _MurmurHash3_x86_128_state { uint32_t h2; uint32_t h3; uint32_t h4; - uint8_t xs[16]; /* unhashed data from last increment */ + uint8_t xs[16]; /* unhashed data from last increment */ uint8_t xs_len; uint32_t len; }; @@ -58,12 +58,11 @@ struct _MurmurHash3_x86_128_state { struct _MurmurHash3_x64_128_state { uint64_t h1; uint64_t h2; - uint8_t xs[16]; /* unhashed data from last increment */ + uint8_t xs[16]; /* unhashed data from last increment */ uint8_t xs_len; uint32_t len; }; - //----------------------------------------------------------------------------- // Finalization mix - force all bits of a hash block to avalanche @@ -89,29 +88,26 @@ static inline uint64_t fmix64(uint64_t k) { return k; } -//----------------------------------------------------------------------------- + //----------------------------------------------------------------------------- -#define MURMUR_UPDATE(h, k, rotl, ca, cb) \ - k *= ca; \ - k = ROTL64(k, rotl); \ - k *= cb; \ +#define MURMUR_UPDATE(h, k, rotl, ca, cb) \ + k *= ca; \ + k = ROTL64(k, rotl); \ + k *= cb; \ h ^= k; -#define MURMUR_MIX(ha, hb, rotl, c) \ - ha = ROTL64(ha, rotl); \ - ha += hb; \ +#define MURMUR_MIX(ha, hb, rotl, c) \ + ha = ROTL64(ha, rotl); \ + ha += hb; \ ha = ha * 5 + c; -#define MURMUR_FILL_XS(xs, xs_len, xs_cap, data, data_len) \ - const int bytes = (data_len + xs_len > xs_cap) ? \ - (int)xs_cap - (int)xs_len : \ - (int)data_len; \ - memcpy(xs + xs_len, data, bytes); \ - xs_len += bytes; \ +#define MURMUR_FILL_XS(xs, xs_len, xs_cap, data, data_len) \ + const int bytes = \ + (data_len + xs_len > xs_cap) ? (int)xs_cap - (int)xs_len : (int)data_len; \ + memcpy(xs + xs_len, data, bytes); \ + xs_len += bytes; \ data += bytes; - - //----------------------------------------------------------------------------- MurmurHash3_x86_32_state *MurmurHash3_x86_32_new(uint32_t seed) { @@ -121,13 +117,15 @@ MurmurHash3_x86_32_state *MurmurHash3_x86_32_new(uint32_t seed) { } MurmurHash3_x86_32_state *MurmurHash3_x86_32_copy(MurmurHash3_x86_32_state *state) { - MurmurHash3_x86_32_state *copy = g_slice_copy(sizeof(MurmurHash3_x86_32_state), state); + MurmurHash3_x86_32_state *copy = + g_slice_copy(sizeof(MurmurHash3_x86_32_state), state); return copy; } #define MURMUR_UPDATE_H1_X86_32(H1) MURMUR_UPDATE(H1, k1, 15, 0xcc9e2d51, 0x1b873593); -void MurmurHash3_x86_32_update(MurmurHash3_x86_32_state *const state, const void * restrict key, const uint32_t len) { +void MurmurHash3_x86_32_update(MurmurHash3_x86_32_state *const state, + const void *restrict key, const uint32_t len) { state->len += len; uint8_t *data = (uint8_t *)key; const uint8_t *stop = data + len; @@ -138,7 +136,6 @@ void MurmurHash3_x86_32_update(MurmurHash3_x86_32_state *const state, const void /* process blocks of 4 bytes */ while(state->xs_len == 4 || data + 4 <= stop) { - uint32_t k1; if(state->xs_len == 4) { @@ -155,14 +152,15 @@ void MurmurHash3_x86_32_update(MurmurHash3_x86_32_state *const state, const void MURMUR_MIX(state->h1, 0, 13, 0xe6546b64); } - if (state->xs_len == 0 && stop > data) { + if(state->xs_len == 0 && stop > data) { // store excess data in state state->xs_len = stop - data; memcpy(state->xs, data, state->xs_len); } } -void MurmurHash3_x86_32_steal(const MurmurHash3_x86_32_state *const restrict state, void *const restrict out) { +void MurmurHash3_x86_32_steal(const MurmurHash3_x86_32_state *const restrict state, + void *const restrict out) { uint32_t k1 = 0; /* copy h to make this a non-destructive steal */ @@ -194,7 +192,7 @@ void MurmurHash3_x86_32_finalise(MurmurHash3_x86_32_state *state, void *out) { MurmurHash3_x86_32_free(state); } -void MurmurHash3_x86_32_free(MurmurHash3_x86_32_state *state) { +void MurmurHash3_x86_32_free(MurmurHash3_x86_32_state *state) { g_slice_free(MurmurHash3_x86_32_state, state); } @@ -206,10 +204,10 @@ uint32_t MurmurHash3_x86_32(const void *key, uint32_t len, uint32_t seed) { return out; } - //----------------------------------------------------------------------------- -MurmurHash3_x86_128_state *MurmurHash3_x86_128_new(uint32_t seed1, uint32_t seed2, uint32_t seed3, uint32_t seed4) { +MurmurHash3_x86_128_state *MurmurHash3_x86_128_new(uint32_t seed1, uint32_t seed2, + uint32_t seed3, uint32_t seed4) { MurmurHash3_x86_128_state *state = g_slice_new0(MurmurHash3_x86_128_state); state->h1 = seed1; state->h2 = seed2; @@ -219,7 +217,8 @@ MurmurHash3_x86_128_state *MurmurHash3_x86_128_new(uint32_t seed1, uint32_t seed } MurmurHash3_x86_128_state *MurmurHash3_x86_128_copy(MurmurHash3_x86_128_state *state) { - MurmurHash3_x86_128_state *copy = g_slice_copy(sizeof(MurmurHash3_x86_128_state), state); + MurmurHash3_x86_128_state *copy = + g_slice_copy(sizeof(MurmurHash3_x86_128_state), state); return copy; } @@ -228,7 +227,8 @@ MurmurHash3_x86_128_state *MurmurHash3_x86_128_copy(MurmurHash3_x86_128_state *s #define MURMUR_UPDATE_H3_X86_128(H3) MURMUR_UPDATE(H3, k3, 17, 0x38b34ae5, 0xa1e38b93); #define MURMUR_UPDATE_H4_X86_128(H4) MURMUR_UPDATE(H4, k4, 18, 0xa1e38b93, 0x239b961b); -void MurmurHash3_x86_128_update(MurmurHash3_x86_128_state *const state, const void * restrict key, const uint32_t len) { +void MurmurHash3_x86_128_update(MurmurHash3_x86_128_state *const state, + const void *restrict key, const uint32_t len) { state->len += len; uint8_t *data = (uint8_t *)key; const uint8_t *stop = data + len; @@ -239,7 +239,6 @@ void MurmurHash3_x86_128_update(MurmurHash3_x86_128_state *const state, const vo /* process blocks of 16 bytes */ while(state->xs_len == 16 || data + 16 <= stop) { - uint32_t k1; uint32_t k2; uint32_t k3; @@ -272,17 +271,17 @@ void MurmurHash3_x86_128_update(MurmurHash3_x86_128_state *const state, const vo MURMUR_UPDATE_H4_X86_128(state->h4); MURMUR_MIX(state->h4, state->h1, 13, 0x32ac3b17); - } - if (state->xs_len == 0 && stop > data) { + if(state->xs_len == 0 && stop > data) { // store excess data in state state->xs_len = stop - data; memcpy(state->xs, data, state->xs_len); } } -void MurmurHash3_x86_128_steal(const MurmurHash3_x86_128_state *const restrict state, void *const restrict out) { +void MurmurHash3_x86_128_steal(const MurmurHash3_x86_128_state *const restrict state, + void *const restrict out) { uint32_t k1 = 0; uint32_t k2 = 0; uint32_t k3 = 0; @@ -369,7 +368,6 @@ void MurmurHash3_x86_128_steal(const MurmurHash3_x86_128_state *const restrict s ((uint32_t *)out)[1] = h2; ((uint32_t *)out)[2] = h3; ((uint32_t *)out)[3] = h4; - } void MurmurHash3_x86_128_finalise(MurmurHash3_x86_128_state *state, void *out) { @@ -389,8 +387,6 @@ void MurmurHash3_x86_128(const void *key, uint32_t len, uint32_t seed, void *out //----------------------------------------------------------------------------- - - MurmurHash3_x64_128_state *MurmurHash3_x64_128_new(uint64_t seed1, uint64_t seed2) { MurmurHash3_x64_128_state *state = g_slice_new0(MurmurHash3_x64_128_state); state->h1 = seed1; @@ -402,11 +398,15 @@ MurmurHash3_x64_128_state *MurmurHash3_x64_128_copy(MurmurHash3_x64_128_state *s return g_slice_copy(sizeof(MurmurHash3_x64_128_state), state); } -#define MURMUR_UPDATE_H1_X64_128(H1) MURMUR_UPDATE(H1, k1, 31, BIG_CONSTANT(0x87c37b91114253d5), BIG_CONSTANT(0x4cf5ad432745937f)); -#define MURMUR_UPDATE_H2_X64_128(H2) MURMUR_UPDATE(H2, k2, 33, BIG_CONSTANT(0x4cf5ad432745937f), BIG_CONSTANT(0x87c37b91114253d5)); - -void MurmurHash3_x64_128_update(MurmurHash3_x64_128_state *const restrict state, const void * restrict key, const uint64_t len) { +#define MURMUR_UPDATE_H1_X64_128(H1) \ + MURMUR_UPDATE(H1, k1, 31, BIG_CONSTANT(0x87c37b91114253d5), \ + BIG_CONSTANT(0x4cf5ad432745937f)); +#define MURMUR_UPDATE_H2_X64_128(H2) \ + MURMUR_UPDATE(H2, k2, 33, BIG_CONSTANT(0x4cf5ad432745937f), \ + BIG_CONSTANT(0x87c37b91114253d5)); +void MurmurHash3_x64_128_update(MurmurHash3_x64_128_state *const restrict state, + const void *restrict key, const uint64_t len) { state->len += len; uint8_t *data = (uint8_t *)key; const uint8_t *stop = data + len; @@ -417,7 +417,6 @@ void MurmurHash3_x64_128_update(MurmurHash3_x64_128_state *const restrict state, /* process blocks of 16 bytes */ while(state->xs_len == 16 || data + 16 <= stop) { - uint64_t k1; uint64_t k2; @@ -440,15 +439,15 @@ void MurmurHash3_x64_128_update(MurmurHash3_x64_128_state *const restrict state, MURMUR_MIX(state->h2, state->h1, 31, 0x38495ab5); } - if (state->xs_len == 0 && stop > data) { + if(state->xs_len == 0 && stop > data) { // store excess data in state state->xs_len = stop - data; memcpy(state->xs, data, state->xs_len); } } -void MurmurHash3_x64_128_steal(const MurmurHash3_x64_128_state *const restrict state, void *const restrict out) { - +void MurmurHash3_x64_128_steal(const MurmurHash3_x64_128_state *const restrict state, + void *const restrict out) { uint64_t k1 = 0; uint64_t k2 = 0; @@ -517,7 +516,8 @@ void MurmurHash3_x64_128_free(MurmurHash3_x64_128_state *state) { g_slice_free(MurmurHash3_x64_128_state, state); } -void MurmurHash3_x64_128(const void *key, const uint64_t len, const uint32_t seed, void *out) { +void MurmurHash3_x64_128(const void *key, const uint64_t len, const uint32_t seed, + void *out) { MurmurHash3_x64_128_state *state = MurmurHash3_x64_128_new(seed, seed); MurmurHash3_x64_128_update(state, key, len); MurmurHash3_x64_128_finalise(state, out); @@ -528,8 +528,9 @@ void MurmurHash3_x64_128_finalise(MurmurHash3_x64_128_state *state, void *out) { MurmurHash3_x64_128_free(state); } -int MurmurHash3_x64_128_equal(MurmurHash3_x64_128_state *a, MurmurHash3_x64_128_state *b) { - if (a->h1 != b->h1 || a->h2 != b->h2 || a->xs_len != b->xs_len || a->len != b->len) { +int MurmurHash3_x64_128_equal(MurmurHash3_x64_128_state *a, + MurmurHash3_x64_128_state *b) { + if(a->h1 != b->h1 || a->h2 != b->h2 || a->xs_len != b->xs_len || a->len != b->len) { return 0; } return (a->xs_len == 0 || !memcmp(a->xs, b->xs, a->xs_len)); diff --git a/lib/checksums/murmur3.h b/lib/checksums/murmur3.h index 800c5b04..99106544 100644 --- a/lib/checksums/murmur3.h +++ b/lib/checksums/murmur3.h @@ -24,7 +24,8 @@ typedef struct _MurmurHash3_x64_128_state MurmurHash3_x64_128_state; * return newly initialised, seeded state */ MurmurHash3_x86_32_state *MurmurHash3_x86_32_new(uint32_t seed); -MurmurHash3_x86_128_state *MurmurHash3_x86_128_new(uint32_t seed1, uint32_t seed2, uint32_t seed3, uint32_t seed4); +MurmurHash3_x86_128_state *MurmurHash3_x86_128_new(uint32_t seed1, uint32_t seed2, + uint32_t seed3, uint32_t seed4); MurmurHash3_x64_128_state *MurmurHash3_x64_128_new(uint64_t seed1, uint64_t seed2); /** @@ -37,16 +38,22 @@ MurmurHash3_x64_128_state *MurmurHash3_x64_128_copy(MurmurHash3_x64_128_state *s /** * streaming update of checksum */ -void MurmurHash3_x86_32_update(MurmurHash3_x86_32_state *const restrict state, const void *restrict key, const uint32_t len); -void MurmurHash3_x86_128_update(MurmurHash3_x86_128_state *const restrict state, const void *restrict key, const uint32_t len); -void MurmurHash3_x64_128_update(MurmurHash3_x64_128_state *const restrict state, const void *restrict key, const uint64_t len); +void MurmurHash3_x86_32_update(MurmurHash3_x86_32_state *const restrict state, + const void *restrict key, const uint32_t len); +void MurmurHash3_x86_128_update(MurmurHash3_x86_128_state *const restrict state, + const void *restrict key, const uint32_t len); +void MurmurHash3_x64_128_update(MurmurHash3_x64_128_state *const restrict state, + const void *restrict key, const uint64_t len); /** * output checksum result; does not modify underlying state */ -void MurmurHash3_x86_32_steal(const MurmurHash3_x86_32_state *const restrict state, void *const restrict out); -void MurmurHash3_x86_128_steal(const MurmurHash3_x86_128_state *const restrict state, void *const restrict out); -void MurmurHash3_x64_128_steal(const MurmurHash3_x64_128_state *const restrict state, void *const restrict out); +void MurmurHash3_x86_32_steal(const MurmurHash3_x86_32_state *const restrict state, + void *const restrict out); +void MurmurHash3_x86_128_steal(const MurmurHash3_x86_128_state *const restrict state, + void *const restrict out); +void MurmurHash3_x64_128_steal(const MurmurHash3_x64_128_state *const restrict state, + void *const restrict out); /** * output checksum result; frees state @@ -58,7 +65,7 @@ void MurmurHash3_x64_128_finalise(MurmurHash3_x64_128_state *state, void *out); /** * free state */ -void MurmurHash3_x86_32_free(MurmurHash3_x86_32_state *state); +void MurmurHash3_x86_32_free(MurmurHash3_x86_32_state *state); void MurmurHash3_x86_128_free(MurmurHash3_x86_128_state *state); void MurmurHash3_x64_128_free(MurmurHash3_x64_128_state *state); @@ -69,7 +76,6 @@ uint32_t MurmurHash3_x86_32(const void *key, uint32_t len, uint32_t seed); void MurmurHash3_x86_128(const void *key, uint32_t len, uint32_t seed, void *out); void MurmurHash3_x64_128(const void *key, uint64_t len, uint32_t seed, void *out); - //----------------------------------------------------------------------------- #endif // _MURMURHASH3_H_ diff --git a/lib/checksums/xxhash/xxhash.c b/lib/checksums/xxhash/xxhash.c index 065c4188..c0c80974 100644 --- a/lib/checksums/xxhash/xxhash.c +++ b/lib/checksums/xxhash/xxhash.c @@ -32,8 +32,8 @@ You can contact the author at : */ /************************************** -* Tuning parameters -**************************************/ + * Tuning parameters + **************************************/ /* Unaligned memory access is automatically enabled for "common" CPU, such as x86. * For others CPU, the compiler will be more cautious, and insert extra code to ensure * aligned access is respected. @@ -70,8 +70,8 @@ You can contact the author at : #define XXH_FORCE_NATIVE_FORMAT 0 /************************************** -* Compiler Specific Options -***************************************/ + * Compiler Specific Options + ***************************************/ #ifdef _MSC_VER /* Visual Studio */ #pragma warning(disable : 4127) /* disable: C4127: conditional expression is constant */ #define FORCE_INLINE static __forceinline @@ -88,8 +88,8 @@ You can contact the author at : #endif /************************************** -* Includes & Memory related functions -***************************************/ + * Includes & Memory related functions + ***************************************/ #include "xxhash.h" /* Modify the local functions below should you wish to use some other memory routines */ /* for malloc(), free() */ @@ -107,8 +107,8 @@ static void* XXH_memcpy(void* dest, const void* src, size_t size) { } /************************************** -* Basic Types -***************************************/ + * Basic Types + ***************************************/ #if defined(__STDC_VERSION__) && __STDC_VERSION__ >= 199901L /* C99 */ #include typedef uint8_t BYTE; @@ -137,8 +137,8 @@ static U64 XXH_read64(const void* memPtr) { } /****************************************** -* Compiler-specific Functions and Macros -******************************************/ + * Compiler-specific Functions and Macros + ******************************************/ #define GCC_VERSION (__GNUC__ * 100 + __GNUC_MINOR__) /* Note : although _rotl exists for minGW (GCC under windows), performance seems poor */ @@ -170,8 +170,8 @@ static U64 XXH_swap64(U64 x) { #endif /*************************************** -* Architecture Macros -***************************************/ + * Architecture Macros + ***************************************/ typedef enum { XXH_bigEndian = 0, XXH_littleEndian = 1 } XXH_endianess; #ifndef XXH_CPU_LITTLE_ENDIAN /* XXH_CPU_LITTLE_ENDIAN can be defined externally, for \ example using a compiler switch */ @@ -180,8 +180,8 @@ static const int one = 1; #endif /***************************** -* Memory reads -*****************************/ + * Memory reads + *****************************/ typedef enum { XXH_aligned, XXH_unaligned } XXH_alignment; FORCE_INLINE U32 XXH_readLE32_align(const void* ptr, XXH_endianess endian, @@ -211,16 +211,16 @@ FORCE_INLINE U64 XXH_readLE64(const void* ptr, XXH_endianess endian) { } /*************************************** -* Macros -***************************************/ + * Macros + ***************************************/ #define XXH_STATIC_ASSERT(c) \ { \ enum { XXH_static_assert = 1 / (!!(c)) }; \ } /* use only *after* variable declarations */ /*************************************** -* Constants -***************************************/ + * Constants + ***************************************/ #define PRIME32_1 2654435761U #define PRIME32_2 2246822519U #define PRIME32_3 3266489917U @@ -234,8 +234,8 @@ FORCE_INLINE U64 XXH_readLE64(const void* ptr, XXH_endianess endian) { #define PRIME64_5 2870177450012600261ULL /***************************** -* Simple Hash Functions -*****************************/ + * Simple Hash Functions + *****************************/ FORCE_INLINE U32 XXH32_endian_align(const void* input, size_t len, U32 seed, XXH_endianess endian, XXH_alignment align) { const BYTE* p = (const BYTE*)input; @@ -465,8 +465,8 @@ unsigned long long XXH64(const void* input, size_t len, unsigned long long seed) } /**************************************************** -* Advanced Hash Functions -****************************************************/ + * Advanced Hash Functions + ****************************************************/ /*** Allocation ***/ typedef struct { diff --git a/lib/checksums/xxhash/xxhash.h b/lib/checksums/xxhash/xxhash.h index 6c55265b..a21affbf 100644 --- a/lib/checksums/xxhash/xxhash.h +++ b/lib/checksums/xxhash/xxhash.h @@ -71,14 +71,14 @@ extern "C" { #endif /***************************** -* Definitions -*****************************/ + * Definitions + *****************************/ #include /* size_t */ typedef enum { XXH_OK = 0, XXH_ERROR } XXH_errorcode; /***************************** -* Simple Hash Functions -*****************************/ + * Simple Hash Functions + *****************************/ unsigned int XXH32(const void* input, size_t length, unsigned seed); unsigned long long XXH64(const void* input, size_t length, unsigned long long seed); @@ -98,10 +98,14 @@ XXH64() : */ /***************************** -* Advanced Hash Functions -*****************************/ -typedef struct { long long ll[6]; } XXH32_state_t; -typedef struct { long long ll[11]; } XXH64_state_t; + * Advanced Hash Functions + *****************************/ +typedef struct { + long long ll[6]; +} XXH32_state_t; +typedef struct { + long long ll[11]; +} XXH64_state_t; /* These structures allow static allocation of XXH states. From c33709543cf859746478baec7601ca47c81d1e25 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Thu, 16 Nov 2017 23:19:26 +1000 Subject: [PATCH 128/180] Revert "checksum: use rhash implementation of sha3 hashes" This reverts commit 0b44249564be81ae6218f71fafa4f964d5624d7d. # Conflicts: # lib/checksum.c --- lib/checksum.c | 43 ++-- lib/checksum.h | 5 +- lib/checksums/sha3/byte_order.c | 152 -------------- lib/checksums/sha3/byte_order.h | 178 ---------------- lib/checksums/sha3/sha3.c | 238 +++++++++++++++++++++ lib/checksums/sha3/sha3.h | 54 +++++ lib/checksums/sha3/sha3_rhash.c | 356 -------------------------------- lib/checksums/sha3/sha3_rhash.h | 54 ----- lib/checksums/sha3/ustd.h | 30 --- lib/utilities.h | 2 - 10 files changed, 316 insertions(+), 796 deletions(-) delete mode 100644 lib/checksums/sha3/byte_order.c delete mode 100644 lib/checksums/sha3/byte_order.h create mode 100644 lib/checksums/sha3/sha3.c create mode 100644 lib/checksums/sha3/sha3.h delete mode 100644 lib/checksums/sha3/sha3_rhash.c delete mode 100644 lib/checksums/sha3/sha3_rhash.h delete mode 100644 lib/checksums/sha3/ustd.h diff --git a/lib/checksum.c b/lib/checksum.c index 45b01e48..b8fbfb5c 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -44,7 +44,7 @@ #include "checksums/highwayhash.h" #include "checksums/metrohash.h" #include "checksums/murmur3.h" -#include "checksums/sha3/sha3_rhash.h" +#include "checksums/sha3/sha3.h" #include "checksums/xxhash/xxhash.h" #include "utilities.h" @@ -473,42 +473,41 @@ static const RmDigestInterface sha512_interface = {"sha512", 512, GLIB_FUNCS}; /////////////////////////// static void rm_digest_sha3_init(RmDigest *digest, RmOff seed) { - digest->state = g_slice_alloc0(sizeof(sha3_ctx)); + digest->state = g_slice_alloc0(sizeof(sha3_context)); switch(digest->type) { - case RM_DIGEST_SHA3_256: - rhash_sha3_256_init(digest->state); - break; - case RM_DIGEST_SHA3_384: - rhash_sha3_384_init(digest->state); - break; - case RM_DIGEST_SHA3_512: - rhash_sha3_512_init(digest->state); - break; - default: - g_assert_not_reached(); + case RM_DIGEST_SHA3_256: + sha3_Init256(digest->state); + break; + case RM_DIGEST_SHA3_384: + sha3_Init384(digest->state); + break; + case RM_DIGEST_SHA3_512: + sha3_Init512(digest->state); + break; + default: + g_assert_not_reached(); } if(seed) { - rhash_sha3_update(digest->state, (const unsigned char *)&seed, sizeof(seed)); + sha3_Update(digest->state, &seed, sizeof(seed)); } } static void rm_digest_sha3_free(RmDigest *digest) { - g_slice_free(sha3_ctx, digest->state); + g_slice_free(sha3_context, digest->state); } -static void rm_digest_sha3_update(RmDigest *digest, const unsigned char *data, - RmOff size) { - rhash_sha3_update(digest->state, data, size); +static void rm_digest_sha3_update(RmDigest *digest, const unsigned char *data, RmOff size) { + sha3_Update(digest->state, data, size); } static void rm_digest_sha3_copy(RmDigest *digest, RmDigest *copy) { - copy->state = g_slice_copy(sizeof(sha3_ctx), digest->state); + copy->state = g_slice_copy(sizeof(sha3_context), digest->state); } static void rm_digest_sha3_steal(RmDigest *digest, guint8 *result) { - sha3_ctx *copy = g_slice_copy(sizeof(sha3_ctx), digest->state); - rhash_sha3_final(copy, result); - g_slice_free(sha3_ctx, copy); + sha3_context *copy = g_slice_copy(sizeof(sha3_context), digest->state); + memcpy(result, sha3_Finalize(copy), digest->bytes); + g_slice_free(sha3_context, copy); } #define SHA3_INTERFACE(BITS) \ diff --git a/lib/checksum.h b/lib/checksum.h index 7edb726e..fc7e31db 100644 --- a/lib/checksum.h +++ b/lib/checksum.h @@ -30,8 +30,9 @@ #include #include "config.h" -//#include "checksums/blake2/blake2.h" -//#include "checksums/highwayhash.h" +#include "checksums/blake2/blake2.h" +#include "checksums/sha3/sha3.h" +#include "checksums/highwayhash.h" typedef enum RmDigestType { RM_DIGEST_UNKNOWN = 0, diff --git a/lib/checksums/sha3/byte_order.c b/lib/checksums/sha3/byte_order.c deleted file mode 100644 index 9be65c3e..00000000 --- a/lib/checksums/sha3/byte_order.c +++ /dev/null @@ -1,152 +0,0 @@ -/* byte_order.c - byte order related platform dependent routines, - * - * Copyright: 2008-2012 Aleksey Kravchenko - * - * Permission is hereby granted, free of charge, to any person obtaining a - * copy of this software and associated documentation files (the "Software"), - * to deal in the Software without restriction, including without limitation - * the rights to use, copy, modify, merge, publish, distribute, sublicense, - * and/or sell copies of the Software, and to permit persons to whom the - * Software is furnished to do so. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY - * or FITNESS FOR A PARTICULAR PURPOSE. Use this program at your own risk! - */ -#include "byte_order.h" - -#ifndef rhash_ctz - -# if _MSC_VER >= 1300 && (_M_IX86 || _M_AMD64 || _M_IA64) /* if MSVC++ >= 2002 on x86/x64 */ -# include -# pragma intrinsic(_BitScanForward) - -/** - * Returns index of the trailing bit of x. - * - * @param x the number to process - * @return zero-based index of the trailing bit - */ -unsigned rhash_ctz(unsigned x) -{ - unsigned long index; - unsigned char isNonzero = _BitScanForward(&index, x); /* MSVC intrinsic */ - return (isNonzero ? (unsigned)index : 0); -} -# else /* _MSC_VER >= 1300... */ - -/** - * Returns index of the trailing bit of a 32-bit number. - * This is a plain C equivalent for GCC __builtin_ctz() bit scan. - * - * @param x the number to process - * @return zero-based index of the trailing bit - */ -unsigned rhash_ctz(unsigned x) -{ - /* array for conversion to bit position */ - static unsigned char bit_pos[32] = { - 0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8, - 31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9 - }; - - /* The De Bruijn bit-scan was devised in 1997, according to Donald Knuth - * by Martin Lauter. The constant 0x077CB531UL is a De Bruijn sequence, - * which produces a unique pattern of bits into the high 5 bits for each - * possible bit position that it is multiplied against. - * See http://graphics.stanford.edu/~seander/bithacks.html - * and http://chessprogramming.wikispaces.com/BitScan */ - return (unsigned)bit_pos[((uint32_t)((x & -x) * 0x077CB531U)) >> 27]; -} -# endif /* _MSC_VER >= 1300... */ -#endif /* rhash_ctz */ - -/** - * Copy a memory block with simultaneous exchanging byte order. - * The byte order is changed from little-endian 32-bit integers - * to big-endian (or vice-versa). - * - * @param to the pointer where to copy memory block - * @param index the index to start writing from - * @param from the source block to copy - * @param length length of the memory block - */ -void rhash_swap_copy_str_to_u32(void* to, int index, const void* from, size_t length) -{ - /* if all pointers and length are 32-bits aligned */ - if ( 0 == (( (int)((char*)to - (char*)0) | ((char*)from - (char*)0) | index | length ) & 3) ) { - /* copy memory as 32-bit words */ - const uint32_t* src = (const uint32_t*)from; - const uint32_t* end = (const uint32_t*)((const char*)src + length); - uint32_t* dst = (uint32_t*)((char*)to + index); - for (; src < end; dst++, src++) - *dst = bswap_32(*src); - } else { - const char* src = (const char*)from; - for (length += index; (size_t)index < length; index++) - ((char*)to)[index ^ 3] = *(src++); - } -} - -/** - * Copy a memory block with changed byte order. - * The byte order is changed from little-endian 64-bit integers - * to big-endian (or vice-versa). - * - * @param to the pointer where to copy memory block - * @param index the index to start writing from - * @param from the source block to copy - * @param length length of the memory block - */ -void rhash_swap_copy_str_to_u64(void* to, int index, const void* from, size_t length) -{ - /* if all pointers and length are 64-bits aligned */ - if ( 0 == (( (int)((char*)to - (char*)0) | ((char*)from - (char*)0) | index | length ) & 7) ) { - /* copy aligned memory block as 64-bit integers */ - const uint64_t* src = (const uint64_t*)from; - const uint64_t* end = (const uint64_t*)((const char*)src + length); - uint64_t* dst = (uint64_t*)((char*)to + index); - while (src < end) *(dst++) = bswap_64( *(src++) ); - } else { - const char* src = (const char*)from; - for (length += index; (size_t)index < length; index++) ((char*)to)[index ^ 7] = *(src++); - } -} - -/** - * Copy data from a sequence of 64-bit words to a binary string of given length, - * while changing byte order. - * - * @param to the binary string to receive data - * @param from the source sequence of 64-bit words - * @param length the size in bytes of the data being copied - */ -void rhash_swap_copy_u64_to_str(void* to, const void* from, size_t length) -{ - /* if all pointers and length are 64-bits aligned */ - if ( 0 == (( (int)((char*)to - (char*)0) | ((char*)from - (char*)0) | length ) & 7) ) { - /* copy aligned memory block as 64-bit integers */ - const uint64_t* src = (const uint64_t*)from; - const uint64_t* end = (const uint64_t*)((const char*)src + length); - uint64_t* dst = (uint64_t*)to; - while (src < end) *(dst++) = bswap_64( *(src++) ); - } else { - size_t index; - char* dst = (char*)to; - for (index = 0; index < length; index++) *(dst++) = ((char*)from)[index ^ 7]; - } -} - -/** - * Exchange byte order in the given array of 32-bit integers. - * - * @param arr the array to process - * @param length array length - */ -void rhash_u32_mem_swap(unsigned *arr, int length) -{ - unsigned* end = arr + length; - for (; arr < end; arr++) { - *arr = bswap_32(*arr); - } -} diff --git a/lib/checksums/sha3/byte_order.h b/lib/checksums/sha3/byte_order.h deleted file mode 100644 index 4085f0e5..00000000 --- a/lib/checksums/sha3/byte_order.h +++ /dev/null @@ -1,178 +0,0 @@ -/* byte_order.h */ -#ifndef BYTE_ORDER_H -#define BYTE_ORDER_H -#include "ustd.h" -#include - -#ifdef __GLIBC__ -# include -#endif - -#ifdef __cplusplus -extern "C" { -#endif - -/* if x86 compatible cpu */ -#if defined(i386) || defined(__i386__) || defined(__i486__) || \ - defined(__i586__) || defined(__i686__) || defined(__pentium__) || \ - defined(__pentiumpro__) || defined(__pentium4__) || \ - defined(__nocona__) || defined(prescott) || defined(__core2__) || \ - defined(__k6__) || defined(__k8__) || defined(__athlon__) || \ - defined(__amd64) || defined(__amd64__) || \ - defined(__x86_64) || defined(__x86_64__) || defined(_M_IX86) || \ - defined(_M_AMD64) || defined(_M_IA64) || defined(_M_X64) -/* detect if x86-64 instruction set is supported */ -# if defined(_LP64) || defined(__LP64__) || defined(__x86_64) || \ - defined(__x86_64__) || defined(_M_AMD64) || defined(_M_X64) -# define CPU_X64 -# else -# define CPU_IA32 -# endif -#endif - - -/* detect CPU endianness */ -#if (defined(__BYTE_ORDER) && defined(__LITTLE_ENDIAN) && \ - __BYTE_ORDER == __LITTLE_ENDIAN) || \ - (defined(__BYTE_ORDER__) && defined(__ORDER_LITTLE_ENDIAN__) && \ - __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__) || \ - defined(CPU_IA32) || defined(CPU_X64) || \ - defined(__ia64) || defined(__ia64__) || defined(__alpha__) || defined(_M_ALPHA) || \ - defined(vax) || defined(MIPSEL) || defined(_ARM_) || defined(__arm__) -# define CPU_LITTLE_ENDIAN -# define IS_BIG_ENDIAN 0 -# define IS_LITTLE_ENDIAN 1 -#elif (defined(__BYTE_ORDER) && defined(__BIG_ENDIAN) && \ - __BYTE_ORDER == __BIG_ENDIAN) || \ - (defined(__BYTE_ORDER__) && defined(__ORDER_BIG_ENDIAN__) && \ - __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__) || \ - defined(__sparc) || defined(__sparc__) || defined(sparc) || \ - defined(_ARCH_PPC) || defined(_ARCH_PPC64) || defined(_POWER) || \ - defined(__POWERPC__) || defined(POWERPC) || defined(__powerpc) || \ - defined(__powerpc__) || defined(__powerpc64__) || defined(__ppc__) || \ - defined(__hpux) || defined(_MIPSEB) || defined(mc68000) || \ - defined(__s390__) || defined(__s390x__) || defined(sel) -# define CPU_BIG_ENDIAN -# define IS_BIG_ENDIAN 1 -# define IS_LITTLE_ENDIAN 0 -#else -# error "Can't detect CPU architechture" -#endif - -#ifndef __has_builtin -# define __has_builtin(x) 0 -#endif - -#define IS_ALIGNED_32(p) (0 == (3 & ((const char*)(p) - (const char*)0))) -#define IS_ALIGNED_64(p) (0 == (7 & ((const char*)(p) - (const char*)0))) - -#if defined(_MSC_VER) -#define ALIGN_ATTR(n) __declspec(align(n)) -#elif defined(__GNUC__) -#define ALIGN_ATTR(n) __attribute__((aligned (n))) -#else -#define ALIGN_ATTR(n) /* nothing */ -#endif - - -#if defined(_MSC_VER) || defined(__BORLANDC__) -#define I64(x) x##ui64 -#else -#define I64(x) x##ULL -#endif - - -#ifndef __STRICT_ANSI__ -#define RHASH_INLINE inline -#elif defined(__GNUC__) -#define RHASH_INLINE __inline__ -#else -#define RHASH_INLINE -#endif - -/* define rhash_ctz - count traling zero bits */ -#if (defined(__GNUC__) && __GNUC__ >= 4 || (__GNUC__ == 3 && __GNUC_MINOR__ >= 4)) || \ - (defined(__clang__) && __has_builtin(__builtin_ctz)) -/* GCC >= 3.4 or clang */ -# define rhash_ctz(x) __builtin_ctz(x) -#else -unsigned rhash_ctz(unsigned); /* define as function */ -#endif - -void rhash_swap_copy_str_to_u32(void* to, int index, const void* from, size_t length); -void rhash_swap_copy_str_to_u64(void* to, int index, const void* from, size_t length); -void rhash_swap_copy_u64_to_str(void* to, const void* from, size_t length); -void rhash_u32_mem_swap(unsigned *p, int length_in_u32); - -/* bswap definitions */ -#if (defined(__GNUC__) && (__GNUC__ >= 4) && (__GNUC__ > 4 || __GNUC_MINOR__ >= 3)) || \ - (defined(__clang__) && __has_builtin(__builtin_bswap32) && __has_builtin(__builtin_bswap64)) -/* GCC >= 4.3 or clang */ -# define bswap_32(x) __builtin_bswap32(x) -# define bswap_64(x) __builtin_bswap64(x) -#elif (_MSC_VER > 1300) && (defined(CPU_IA32) || defined(CPU_X64)) /* MS VC */ -# define bswap_32(x) _byteswap_ulong((unsigned long)x) -# define bswap_64(x) _byteswap_uint64((__int64)x) -#else -/* fallback to generic bswap definition */ -static RHASH_INLINE uint32_t bswap_32(uint32_t x) -{ -# if defined(__GNUC__) && defined(CPU_IA32) && !defined(__i386__) && !defined(RHASH_NO_ASM) - __asm("bswap\t%0" : "=r" (x) : "0" (x)); /* gcc x86 version */ - return x; -# else - x = ((x << 8) & 0xFF00FF00u) | ((x >> 8) & 0x00FF00FFu); - return (x >> 16) | (x << 16); -# endif -} -static RHASH_INLINE uint64_t bswap_64(uint64_t x) -{ - union { - uint64_t ll; - uint32_t l[2]; - } w, r; - w.ll = x; - r.l[0] = bswap_32(w.l[1]); - r.l[1] = bswap_32(w.l[0]); - return r.ll; -} -#endif /* bswap definitions */ - -#ifdef CPU_BIG_ENDIAN -# define be2me_32(x) (x) -# define be2me_64(x) (x) -# define le2me_32(x) bswap_32(x) -# define le2me_64(x) bswap_64(x) - -# define be32_copy(to, index, from, length) memcpy((to) + (index), (from), (length)) -# define le32_copy(to, index, from, length) rhash_swap_copy_str_to_u32((to), (index), (from), (length)) -# define be64_copy(to, index, from, length) memcpy((to) + (index), (from), (length)) -# define le64_copy(to, index, from, length) rhash_swap_copy_str_to_u64((to), (index), (from), (length)) -# define me64_to_be_str(to, from, length) memcpy((to), (from), (length)) -# define me64_to_le_str(to, from, length) rhash_swap_copy_u64_to_str((to), (from), (length)) - -#else /* CPU_BIG_ENDIAN */ -# define be2me_32(x) bswap_32(x) -# define be2me_64(x) bswap_64(x) -# define le2me_32(x) (x) -# define le2me_64(x) (x) - -# define be32_copy(to, index, from, length) rhash_swap_copy_str_to_u32((to), (index), (from), (length)) -# define le32_copy(to, index, from, length) memcpy((to) + (index), (from), (length)) -# define be64_copy(to, index, from, length) rhash_swap_copy_str_to_u64((to), (index), (from), (length)) -# define le64_copy(to, index, from, length) memcpy((to) + (index), (from), (length)) -# define me64_to_be_str(to, from, length) rhash_swap_copy_u64_to_str((to), (from), (length)) -# define me64_to_le_str(to, from, length) memcpy((to), (from), (length)) -#endif /* CPU_BIG_ENDIAN */ - -/* ROTL/ROTR macros rotate a 32/64-bit word left/right by n bits */ -#define ROTL32(dword, n) ((dword) << (n) ^ ((dword) >> (32 - (n)))) -#define ROTR32(dword, n) ((dword) >> (n) ^ ((dword) << (32 - (n)))) -#define ROTL64(qword, n) ((qword) << (n) ^ ((qword) >> (64 - (n)))) -#define ROTR64(qword, n) ((qword) >> (n) ^ ((qword) << (64 - (n)))) - -#ifdef __cplusplus -} /* extern "C" */ -#endif /* __cplusplus */ - -#endif /* BYTE_ORDER_H */ diff --git a/lib/checksums/sha3/sha3.c b/lib/checksums/sha3/sha3.c new file mode 100644 index 00000000..30f7b969 --- /dev/null +++ b/lib/checksums/sha3/sha3.c @@ -0,0 +1,238 @@ +/* ------------------------------------------------------------------------- + * Works when compiled for either 32-bit or 64-bit targets, optimized for + * 64 bit. + * + * Canonical implementation of Init/Update/Finalize for SHA-3 byte input. + * + * SHA3-256, SHA3-384, SHA-512 are implemented. SHA-224 can easily be added. + * + * Based on code from http://keccak.noekeon.org/ . + * + * I place the code that I wrote into public domain, free to use. + * + * I would appreciate if you give credits to this work if you used it to + * write or test * your code. + * + * Aug 2015. Andrey Jivsov. crypto@brainhub.org + * ---------------------------------------------------------------------- */ + +#include +#include +#include + +#include "sha3.h" + +#ifndef SHA3_ROTL64 +#define SHA3_ROTL64(x, y) (((x) << (y)) | ((x) >> ((sizeof(uint64_t) * 8) - (y)))) +#endif + +static const uint64_t keccakf_rndc[24] = { + SHA3_CONST(0x0000000000000001UL), SHA3_CONST(0x0000000000008082UL), + SHA3_CONST(0x800000000000808aUL), SHA3_CONST(0x8000000080008000UL), + SHA3_CONST(0x000000000000808bUL), SHA3_CONST(0x0000000080000001UL), + SHA3_CONST(0x8000000080008081UL), SHA3_CONST(0x8000000000008009UL), + SHA3_CONST(0x000000000000008aUL), SHA3_CONST(0x0000000000000088UL), + SHA3_CONST(0x0000000080008009UL), SHA3_CONST(0x000000008000000aUL), + SHA3_CONST(0x000000008000808bUL), SHA3_CONST(0x800000000000008bUL), + SHA3_CONST(0x8000000000008089UL), SHA3_CONST(0x8000000000008003UL), + SHA3_CONST(0x8000000000008002UL), SHA3_CONST(0x8000000000000080UL), + SHA3_CONST(0x000000000000800aUL), SHA3_CONST(0x800000008000000aUL), + SHA3_CONST(0x8000000080008081UL), SHA3_CONST(0x8000000000008080UL), + SHA3_CONST(0x0000000080000001UL), SHA3_CONST(0x8000000080008008UL)}; + +static const unsigned keccakf_rotc[24] = {1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 2, 14, + 27, 41, 56, 8, 25, 43, 62, 18, 39, 61, 20, 44}; + +static const unsigned keccakf_piln[24] = {10, 7, 11, 17, 18, 3, 5, 16, 8, 21, 24, 4, + 15, 23, 19, 13, 12, 2, 20, 14, 22, 9, 6, 1}; + +/* generally called after SHA3_KECCAK_SPONGE_WORDS-ctx->capacityWords words + * are XORed into the state s + */ +static void keccakf(uint64_t s[25]) { + int i, j, round; + uint64_t t, bc[5]; +#define KECCAK_ROUNDS 24 + + for(round = 0; round < KECCAK_ROUNDS; round++) { + /* Theta */ + for(i = 0; i < 5; i++) + bc[i] = s[i] ^ s[i + 5] ^ s[i + 10] ^ s[i + 15] ^ s[i + 20]; + + for(i = 0; i < 5; i++) { + t = bc[(i + 4) % 5] ^ SHA3_ROTL64(bc[(i + 1) % 5], 1); + for(j = 0; j < 25; j += 5) + s[j + i] ^= t; + } + + /* Rho Pi */ + t = s[1]; + for(i = 0; i < 24; i++) { + j = keccakf_piln[i]; + bc[0] = s[j]; + s[j] = SHA3_ROTL64(t, keccakf_rotc[i]); + t = bc[0]; + } + + /* Chi */ + for(j = 0; j < 25; j += 5) { + for(i = 0; i < 5; i++) + bc[i] = s[j + i]; + for(i = 0; i < 5; i++) + s[j + i] ^= (~bc[(i + 1) % 5]) & bc[(i + 2) % 5]; + } + + /* Iota */ + s[0] ^= keccakf_rndc[round]; + } +} + +/* *************************** Public Inteface ************************ */ + +/* For Init or Reset call these: */ +void sha3_Init256(sha3_context *ctx) { + memset(ctx, 0, sizeof(*ctx)); + ctx->capacityWords = 2 * 256 / (8 * sizeof(uint64_t)); +} + +void sha3_Init384(sha3_context *ctx) { + memset(ctx, 0, sizeof(*ctx)); + ctx->capacityWords = 2 * 384 / (8 * sizeof(uint64_t)); +} + +void sha3_Init512(sha3_context *ctx) { + memset(ctx, 0, sizeof(*ctx)); + ctx->capacityWords = 2 * 512 / (8 * sizeof(uint64_t)); +} + +void sha3_Update(sha3_context *ctx, void const *bufIn, size_t len) { + /* 0...7 -- how much is needed to have a word */ + unsigned old_tail = (8 - ctx->byteIndex) & 7; + + size_t words; + unsigned tail; + size_t i; + + const uint8_t *buf = bufIn; + + SHA3_TRACE_BUF("called to update with:", buf, len); + + SHA3_ASSERT(ctx->byteIndex < 8); + SHA3_ASSERT(ctx->wordIndex < sizeof(ctx->s) / sizeof(ctx->s[0])); + + if(len < old_tail) { /* have no complete word or haven't started + * the word yet */ + SHA3_TRACE("because %d<%d, store it and return", (unsigned)len, + (unsigned)old_tail); + /* endian-independent code follows: */ + while(len--) + ctx->saved |= (uint64_t)(*(buf++)) << ((ctx->byteIndex++) * 8); + SHA3_ASSERT(ctx->byteIndex < 8); + return; + } + + if(old_tail) { /* will have one word to process */ + SHA3_TRACE("completing one word with %d bytes", (unsigned)old_tail); + /* endian-independent code follows: */ + len -= old_tail; + while(old_tail--) + ctx->saved |= (uint64_t)(*(buf++)) << ((ctx->byteIndex++) * 8); + + /* now ready to add saved to the sponge */ + ctx->s[ctx->wordIndex] ^= ctx->saved; + SHA3_ASSERT(ctx->byteIndex == 8); + ctx->byteIndex = 0; + ctx->saved = 0; + if(++ctx->wordIndex == (SHA3_KECCAK_SPONGE_WORDS - ctx->capacityWords)) { + keccakf(ctx->s); + ctx->wordIndex = 0; + } + } + + /* now work in full words directly from input */ + + SHA3_ASSERT(ctx->byteIndex == 0); + + words = len / sizeof(uint64_t); + tail = len - words * sizeof(uint64_t); + + SHA3_TRACE("have %d full words to process", (unsigned)words); + + for(i = 0; i < words; i++, buf += sizeof(uint64_t)) { + const uint64_t t = (uint64_t)(buf[0]) | ((uint64_t)(buf[1]) << 8 * 1) | + ((uint64_t)(buf[2]) << 8 * 2) | ((uint64_t)(buf[3]) << 8 * 3) | + ((uint64_t)(buf[4]) << 8 * 4) | ((uint64_t)(buf[5]) << 8 * 5) | + ((uint64_t)(buf[6]) << 8 * 6) | ((uint64_t)(buf[7]) << 8 * 7); +#if defined(__x86_64__) || defined(__i386__) + SHA3_ASSERT(memcmp(&t, buf, 8) == 0); +#endif + ctx->s[ctx->wordIndex] ^= t; + if(++ctx->wordIndex == (SHA3_KECCAK_SPONGE_WORDS - ctx->capacityWords)) { + keccakf(ctx->s); + ctx->wordIndex = 0; + } + } + + SHA3_TRACE("have %d bytes left to process, save them", (unsigned)tail); + + /* finally, save the partial word */ + SHA3_ASSERT(ctx->byteIndex == 0 && tail < 8); + while(tail--) { + SHA3_TRACE("Store byte %02x '%c'", *buf, *buf); + ctx->saved |= (uint64_t)(*(buf++)) << ((ctx->byteIndex++) * 8); + } + SHA3_ASSERT(ctx->byteIndex < 8); + SHA3_TRACE("Have saved=0x%016" PRIx64 " at the end", ctx->saved); +} + +/* This is simply the 'update' with the padding block. + * The padding block is 0x01 || 0x00* || 0x80. First 0x01 and last 0x80 + * bytes are always present, but they can be the same byte. + */ +void const *sha3_Finalize(sha3_context *ctx) { + SHA3_TRACE("called with %d bytes in the buffer", ctx->byteIndex); + +/* Append 2-bit suffix 01, per SHA-3 spec. Instead of 1 for padding we + * use 1<<2 below. The 0x02 below corresponds to the suffix 01. + * Overall, we feed 0, then 1, and finally 1 to start padding. Without + * M || 01, we would simply use 1 to start padding. */ + +#ifndef SHA3_USE_KECCAK + /* SHA3 version */ + ctx->s[ctx->wordIndex] ^= (ctx->saved ^ ((uint64_t)((uint64_t)(0x02 | (1 << 2)) + << ((ctx->byteIndex) * 8)))); +#else + /* For testing the "pure" Keccak version */ + ctx->s[ctx->wordIndex] ^= + (ctx->saved ^ ((uint64_t)((uint64_t)1 << (ctx->byteIndex * 8)))); +#endif + + ctx->s[SHA3_KECCAK_SPONGE_WORDS - ctx->capacityWords - 1] ^= + SHA3_CONST(0x8000000000000000UL); + keccakf(ctx->s); + + /* Return first bytes of the ctx->s. This conversion is not needed for + * little-endian platforms e.g. wrap with #if !defined(__BYTE_ORDER__) + * || !defined(__ORDER_LITTLE_ENDIAN__) || \ + * __BYTE_ORDER__!=__ORDER_LITTLE_ENDIAN__ ... the conversion below ... + * #endif */ + { + unsigned i; + for(i = 0; i < SHA3_KECCAK_SPONGE_WORDS; i++) { + const unsigned t1 = (uint32_t)ctx->s[i]; + const unsigned t2 = (uint32_t)((ctx->s[i] >> 16) >> 16); + ctx->sb[i * 8 + 0] = (uint8_t)(t1); + ctx->sb[i * 8 + 1] = (uint8_t)(t1 >> 8); + ctx->sb[i * 8 + 2] = (uint8_t)(t1 >> 16); + ctx->sb[i * 8 + 3] = (uint8_t)(t1 >> 24); + ctx->sb[i * 8 + 4] = (uint8_t)(t2); + ctx->sb[i * 8 + 5] = (uint8_t)(t2 >> 8); + ctx->sb[i * 8 + 6] = (uint8_t)(t2 >> 16); + ctx->sb[i * 8 + 7] = (uint8_t)(t2 >> 24); + } + } + + SHA3_TRACE_BUF("Hash: (first 32 bytes)", ctx->sb, 256 / 8); + + return (ctx->sb); +} diff --git a/lib/checksums/sha3/sha3.h b/lib/checksums/sha3/sha3.h new file mode 100644 index 00000000..39d03d95 --- /dev/null +++ b/lib/checksums/sha3/sha3.h @@ -0,0 +1,54 @@ +#ifndef _RM_CHECKSUM_SHA3 +#define _RM_CHECKSUM_SHA3 + +#include + +#define SHA3_ASSERT(x) +#if defined(_MSC_VER) +#define SHA3_TRACE(format, ...) +#define SHA3_TRACE_BUF(format, buf, l, ...) +#else +#define SHA3_TRACE(format, args...) +#define SHA3_TRACE_BUF(format, buf, l, args...) +#endif + +//#define SHA3_USE_KECCAK +/* + * Define SHA3_USE_KECCAK to run "pure" Keccak, as opposed to SHA3. + * The tests that this macro enables use the input and output from [Keccak] + * (see the reference below). The used test vectors aren't correct for SHA3, + * however, they are helpful to verify the implementation. + * SHA3_USE_KECCAK only changes one line of code in Finalize. + */ + +#if defined(_MSC_VER) +#define SHA3_CONST(x) x +#else +#define SHA3_CONST(x) x##L +#endif + +/* 'Words' here refers to uint64_t */ +#define SHA3_KECCAK_SPONGE_WORDS (((1600) / 8 /*bits to byte*/) / sizeof(uint64_t)) +typedef struct sha3_context_ { + uint64_t saved; /* the portion of the input message that we + * didn't consume yet */ + union { /* Keccak's state */ + uint64_t s[SHA3_KECCAK_SPONGE_WORDS]; + uint8_t sb[SHA3_KECCAK_SPONGE_WORDS * 8]; + }; + unsigned byteIndex; /* 0..7--the next byte after the set one + * (starts from 0; 0--none are buffered) */ + unsigned wordIndex; /* 0..24--the next word to integrate input + * (starts from 0) */ + unsigned capacityWords; /* the double size of the hash output in + * words (e.g. 16 for Keccak 512) */ +} sha3_context; + +void sha3_Init256(sha3_context *ctx); +void sha3_Init384(sha3_context *ctx); +void sha3_Init512(sha3_context *ctx); + +void sha3_Update(sha3_context *ctx, void const *bufIn, size_t len); +void const *sha3_Finalize(sha3_context *ctx); + +#endif /* _RM_CHECKSUM_SHA3 */ diff --git a/lib/checksums/sha3/sha3_rhash.c b/lib/checksums/sha3/sha3_rhash.c deleted file mode 100644 index 05633b03..00000000 --- a/lib/checksums/sha3/sha3_rhash.c +++ /dev/null @@ -1,356 +0,0 @@ -/* sha3.c - an implementation of Secure Hash Algorithm 3 (Keccak). - * based on the - * The Keccak SHA-3 submission. Submission to NIST (Round 3), 2011 - * by Guido Bertoni, Joan Daemen, Michaël Peeters and Gilles Van Assche - * - * Copyright: 2013 Aleksey Kravchenko - * - * Permission is hereby granted, free of charge, to any person obtaining a - * copy of this software and associated documentation files (the "Software"), - * to deal in the Software without restriction, including without limitation - * the rights to use, copy, modify, merge, publish, distribute, sublicense, - * and/or sell copies of the Software, and to permit persons to whom the - * Software is furnished to do so. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY - * or FITNESS FOR A PARTICULAR PURPOSE. Use this program at your own risk! - */ - -#include -#include -#include "byte_order.h" -#include "sha3_rhash.h" - -/* constants */ -#define NumberOfRounds 24 - -/* SHA3 (Keccak) constants for 24 rounds */ -static uint64_t keccak_round_constants[NumberOfRounds] = { - I64(0x0000000000000001), I64(0x0000000000008082), I64(0x800000000000808A), I64(0x8000000080008000), - I64(0x000000000000808B), I64(0x0000000080000001), I64(0x8000000080008081), I64(0x8000000000008009), - I64(0x000000000000008A), I64(0x0000000000000088), I64(0x0000000080008009), I64(0x000000008000000A), - I64(0x000000008000808B), I64(0x800000000000008B), I64(0x8000000000008089), I64(0x8000000000008003), - I64(0x8000000000008002), I64(0x8000000000000080), I64(0x000000000000800A), I64(0x800000008000000A), - I64(0x8000000080008081), I64(0x8000000000008080), I64(0x0000000080000001), I64(0x8000000080008008) -}; - -/* Initializing a sha3 context for given number of output bits */ -static void rhash_keccak_init(sha3_ctx *ctx, unsigned bits) -{ - /* NB: The Keccak capacity parameter = bits * 2 */ - unsigned rate = 1600 - bits * 2; - - memset(ctx, 0, sizeof(sha3_ctx)); - ctx->block_size = rate / 8; - assert(rate <= 1600 && (rate % 64) == 0); -} - -/** - * Initialize context before calculating hash. - * - * @param ctx context to initialize - */ -void rhash_sha3_224_init(sha3_ctx *ctx) -{ - rhash_keccak_init(ctx, 224); -} - -/** - * Initialize context before calculating hash. - * - * @param ctx context to initialize - */ -void rhash_sha3_256_init(sha3_ctx *ctx) -{ - rhash_keccak_init(ctx, 256); -} - -/** - * Initialize context before calculating hash. - * - * @param ctx context to initialize - */ -void rhash_sha3_384_init(sha3_ctx *ctx) -{ - rhash_keccak_init(ctx, 384); -} - -/** - * Initialize context before calculating hash. - * - * @param ctx context to initialize - */ -void rhash_sha3_512_init(sha3_ctx *ctx) -{ - rhash_keccak_init(ctx, 512); -} - -/* Keccak theta() transformation */ -static void keccak_theta(uint64_t *A) -{ - unsigned int x; - uint64_t C[5], D[5]; - - for (x = 0; x < 5; x++) { - C[x] = A[x] ^ A[x + 5] ^ A[x + 10] ^ A[x + 15] ^ A[x + 20]; - } - D[0] = ROTL64(C[1], 1) ^ C[4]; - D[1] = ROTL64(C[2], 1) ^ C[0]; - D[2] = ROTL64(C[3], 1) ^ C[1]; - D[3] = ROTL64(C[4], 1) ^ C[2]; - D[4] = ROTL64(C[0], 1) ^ C[3]; - - for (x = 0; x < 5; x++) { - A[x] ^= D[x]; - A[x + 5] ^= D[x]; - A[x + 10] ^= D[x]; - A[x + 15] ^= D[x]; - A[x + 20] ^= D[x]; - } -} - -/* Keccak pi() transformation */ -static void keccak_pi(uint64_t *A) -{ - uint64_t A1; - A1 = A[1]; - A[ 1] = A[ 6]; - A[ 6] = A[ 9]; - A[ 9] = A[22]; - A[22] = A[14]; - A[14] = A[20]; - A[20] = A[ 2]; - A[ 2] = A[12]; - A[12] = A[13]; - A[13] = A[19]; - A[19] = A[23]; - A[23] = A[15]; - A[15] = A[ 4]; - A[ 4] = A[24]; - A[24] = A[21]; - A[21] = A[ 8]; - A[ 8] = A[16]; - A[16] = A[ 5]; - A[ 5] = A[ 3]; - A[ 3] = A[18]; - A[18] = A[17]; - A[17] = A[11]; - A[11] = A[ 7]; - A[ 7] = A[10]; - A[10] = A1; - /* note: A[ 0] is left as is */ -} - -/* Keccak chi() transformation */ -static void keccak_chi(uint64_t *A) -{ - int i; - for (i = 0; i < 25; i += 5) { - uint64_t A0 = A[0 + i], A1 = A[1 + i]; - A[0 + i] ^= ~A1 & A[2 + i]; - A[1 + i] ^= ~A[2 + i] & A[3 + i]; - A[2 + i] ^= ~A[3 + i] & A[4 + i]; - A[3 + i] ^= ~A[4 + i] & A0; - A[4 + i] ^= ~A0 & A1; - } -} - -static void rhash_sha3_permutation(uint64_t *state) -{ - int round; - for (round = 0; round < NumberOfRounds; round++) - { - keccak_theta(state); - - /* apply Keccak rho() transformation */ - state[ 1] = ROTL64(state[ 1], 1); - state[ 2] = ROTL64(state[ 2], 62); - state[ 3] = ROTL64(state[ 3], 28); - state[ 4] = ROTL64(state[ 4], 27); - state[ 5] = ROTL64(state[ 5], 36); - state[ 6] = ROTL64(state[ 6], 44); - state[ 7] = ROTL64(state[ 7], 6); - state[ 8] = ROTL64(state[ 8], 55); - state[ 9] = ROTL64(state[ 9], 20); - state[10] = ROTL64(state[10], 3); - state[11] = ROTL64(state[11], 10); - state[12] = ROTL64(state[12], 43); - state[13] = ROTL64(state[13], 25); - state[14] = ROTL64(state[14], 39); - state[15] = ROTL64(state[15], 41); - state[16] = ROTL64(state[16], 45); - state[17] = ROTL64(state[17], 15); - state[18] = ROTL64(state[18], 21); - state[19] = ROTL64(state[19], 8); - state[20] = ROTL64(state[20], 18); - state[21] = ROTL64(state[21], 2); - state[22] = ROTL64(state[22], 61); - state[23] = ROTL64(state[23], 56); - state[24] = ROTL64(state[24], 14); - - keccak_pi(state); - keccak_chi(state); - - /* apply iota(state, round) */ - *state ^= keccak_round_constants[round]; - } -} - -/** - * The core transformation. Process the specified block of data. - * - * @param hash the algorithm state - * @param block the message block to process - * @param block_size the size of the processed block in bytes - */ -static void rhash_sha3_process_block(uint64_t hash[25], const uint64_t *block, size_t block_size) -{ - /* expanded loop */ - hash[ 0] ^= le2me_64(block[ 0]); - hash[ 1] ^= le2me_64(block[ 1]); - hash[ 2] ^= le2me_64(block[ 2]); - hash[ 3] ^= le2me_64(block[ 3]); - hash[ 4] ^= le2me_64(block[ 4]); - hash[ 5] ^= le2me_64(block[ 5]); - hash[ 6] ^= le2me_64(block[ 6]); - hash[ 7] ^= le2me_64(block[ 7]); - hash[ 8] ^= le2me_64(block[ 8]); - /* if not sha3-512 */ - if (block_size > 72) { - hash[ 9] ^= le2me_64(block[ 9]); - hash[10] ^= le2me_64(block[10]); - hash[11] ^= le2me_64(block[11]); - hash[12] ^= le2me_64(block[12]); - /* if not sha3-384 */ - if (block_size > 104) { - hash[13] ^= le2me_64(block[13]); - hash[14] ^= le2me_64(block[14]); - hash[15] ^= le2me_64(block[15]); - hash[16] ^= le2me_64(block[16]); - /* if not sha3-256 */ - if (block_size > 136) { - hash[17] ^= le2me_64(block[17]); -#ifdef FULL_SHA3_FAMILY_SUPPORT - /* if not sha3-224 */ - if (block_size > 144) { - hash[18] ^= le2me_64(block[18]); - hash[19] ^= le2me_64(block[19]); - hash[20] ^= le2me_64(block[20]); - hash[21] ^= le2me_64(block[21]); - hash[22] ^= le2me_64(block[22]); - hash[23] ^= le2me_64(block[23]); - hash[24] ^= le2me_64(block[24]); - } -#endif - } - } - } - /* make a permutation of the hash */ - rhash_sha3_permutation(hash); -} - -#define SHA3_FINALIZED 0x80000000 - -/** - * Calculate message hash. - * Can be called repeatedly with chunks of the message to be hashed. - * - * @param ctx the algorithm context containing current hashing state - * @param msg message chunk - * @param size length of the message chunk - */ -void rhash_sha3_update(sha3_ctx *ctx, const unsigned char *msg, size_t size) -{ - size_t index = (size_t)ctx->rest; - size_t block_size = (size_t)ctx->block_size; - - if (ctx->rest & SHA3_FINALIZED) return; /* too late for additional input */ - ctx->rest = (unsigned)((ctx->rest + size) % block_size); - - /* fill partial block */ - if (index) { - size_t left = block_size - index; - memcpy((char*)ctx->message + index, msg, (size < left ? size : left)); - if (size < left) return; - - /* process partial block */ - rhash_sha3_process_block(ctx->hash, ctx->message, block_size); - msg += left; - size -= left; - } - while (size >= block_size) { - uint64_t* aligned_message_block; - if (IS_ALIGNED_64(msg)) { - /* the most common case is processing of an already aligned message - without copying it */ - aligned_message_block = (uint64_t*)msg; - } else { - memcpy(ctx->message, msg, block_size); - aligned_message_block = ctx->message; - } - - rhash_sha3_process_block(ctx->hash, aligned_message_block, block_size); - msg += block_size; - size -= block_size; - } - if (size) { - memcpy(ctx->message, msg, size); /* save leftovers */ - } -} - -/** - * Store calculated hash into the given array. - * - * @param ctx the algorithm context containing current hashing state - * @param result calculated hash in binary form - */ -void rhash_sha3_final(sha3_ctx *ctx, unsigned char* result) -{ - size_t digest_length = 100 - ctx->block_size / 2; - const size_t block_size = ctx->block_size; - - if (!(ctx->rest & SHA3_FINALIZED)) - { - /* clear the rest of the data queue */ - memset((char*)ctx->message + ctx->rest, 0, block_size - ctx->rest); - ((char*)ctx->message)[ctx->rest] |= 0x06; - ((char*)ctx->message)[block_size - 1] |= 0x80; - - /* process final block */ - rhash_sha3_process_block(ctx->hash, ctx->message, block_size); - ctx->rest = SHA3_FINALIZED; /* mark context as finalized */ - } - - assert(block_size > digest_length); - if (result) me64_to_le_str(result, ctx->hash, digest_length); -} - -#ifdef USE_KECCAK -/** -* Store calculated hash into the given array. -* -* @param ctx the algorithm context containing current hashing state -* @param result calculated hash in binary form -*/ -void rhash_keccak_final(sha3_ctx *ctx, unsigned char* result) -{ - size_t digest_length = 100 - ctx->block_size / 2; - const size_t block_size = ctx->block_size; - - if (!(ctx->rest & SHA3_FINALIZED)) - { - /* clear the rest of the data queue */ - memset((char*)ctx->message + ctx->rest, 0, block_size - ctx->rest); - ((char*)ctx->message)[ctx->rest] |= 0x01; - ((char*)ctx->message)[block_size - 1] |= 0x80; - - /* process final block */ - rhash_sha3_process_block(ctx->hash, ctx->message, block_size); - ctx->rest = SHA3_FINALIZED; /* mark context as finalized */ - } - - assert(block_size > digest_length); - if (result) me64_to_le_str(result, ctx->hash, digest_length); -} -#endif /* USE_KECCAK */ diff --git a/lib/checksums/sha3/sha3_rhash.h b/lib/checksums/sha3/sha3_rhash.h deleted file mode 100644 index 28319978..00000000 --- a/lib/checksums/sha3/sha3_rhash.h +++ /dev/null @@ -1,54 +0,0 @@ -/* sha3.h */ -#ifndef RHASH_SHA3_H -#define RHASH_SHA3_H -#include "ustd.h" - -#ifdef __cplusplus -extern "C" { -#endif - -#define sha3_224_hash_size 28 -#define sha3_256_hash_size 32 -#define sha3_384_hash_size 48 -#define sha3_512_hash_size 64 -#define sha3_max_permutation_size 25 -#define sha3_max_rate_in_qwords 24 - -/** - * SHA3 Algorithm context. - */ -typedef struct sha3_ctx -{ - /* 1600 bits algorithm hashing state */ - uint64_t hash[sha3_max_permutation_size]; - /* 1536-bit buffer for leftovers */ - uint64_t message[sha3_max_rate_in_qwords]; - /* count of bytes in the message[] buffer */ - unsigned rest; - /* size of a message block processed at once */ - unsigned block_size; -} sha3_ctx; - -/* methods for calculating the hash function */ - -void rhash_sha3_224_init(sha3_ctx *ctx); -void rhash_sha3_256_init(sha3_ctx *ctx); -void rhash_sha3_384_init(sha3_ctx *ctx); -void rhash_sha3_512_init(sha3_ctx *ctx); -void rhash_sha3_update(sha3_ctx *ctx, const unsigned char* msg, size_t size); -void rhash_sha3_final(sha3_ctx *ctx, unsigned char* result); - -#ifdef USE_KECCAK -#define rhash_keccak_224_init rhash_sha3_224_init -#define rhash_keccak_256_init rhash_sha3_256_init -#define rhash_keccak_384_init rhash_sha3_384_init -#define rhash_keccak_512_init rhash_sha3_512_init -#define rhash_keccak_update rhash_sha3_update -void rhash_keccak_final(sha3_ctx *ctx, unsigned char* result); -#endif - -#ifdef __cplusplus -} /* extern "C" */ -#endif /* __cplusplus */ - -#endif /* RHASH_SHA3_H */ diff --git a/lib/checksums/sha3/ustd.h b/lib/checksums/sha3/ustd.h deleted file mode 100644 index 94f1ae26..00000000 --- a/lib/checksums/sha3/ustd.h +++ /dev/null @@ -1,30 +0,0 @@ -/* ustd.h common macros and includes */ -#ifndef LIBRHASH_USTD_H -#define LIBRHASH_USTD_H - -#if _MSC_VER >= 1300 - -# define int64_t __int64 -# define int32_t __int32 -# define int16_t __int16 -# define int8_t __int8 -# define uint64_t unsigned __int64 -# define uint32_t unsigned __int32 -# define uint16_t unsigned __int16 -# define uint8_t unsigned __int8 - -/* disable warnings: The POSIX name for this item is deprecated. Use the ISO C++ conformant name. */ -#pragma warning(disable : 4996) - -#else /* _MSC_VER >= 1300 */ - -# include -# include - -#endif /* _MSC_VER >= 1300 */ - -#if _MSC_VER <= 1300 -# include /* size_t for vc6.0 */ -#endif /* _MSC_VER <= 1300 */ - -#endif /* LIBRHASH_USTD_H */ diff --git a/lib/utilities.h b/lib/utilities.h index 114f4bb7..75ecb9e5 100644 --- a/lib/utilities.h +++ b/lib/utilities.h @@ -34,8 +34,6 @@ #include #include #include -#include - /* Pat(h)tricia Trie implementation */ #include "pathtricia.h" From 0fa8e5eb53686c69dc5cd5a89ad696e172d61aeb Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Fri, 17 Nov 2017 08:28:45 +1000 Subject: [PATCH 129/180] checksum: rename rm_digest_interface() to rm_digest_interface_get() --- lib/checksum.c | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/lib/checksum.c b/lib/checksum.c index b8fbfb5c..fc35009c 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -683,7 +683,8 @@ static const RmDigestInterface paranoid_interface = { // RmDigestInterface map // //////////////////////////////// -static const RmDigestInterface *rm_digest_interface(RmDigestType type) { +static const RmDigestInterface *rm_digest_get_interface(RmDigestType type) { + static const RmDigestInterface *digest_interfaces[] = { [RM_DIGEST_UNKNOWN] = NULL, [RM_DIGEST_MURMUR] = &murmur_interface, @@ -733,7 +734,7 @@ static void rm_digest_table_insert(GHashTable *code_table, char *name, static gpointer rm_init_digest_type_table(GHashTable **code_table) { *code_table = g_hash_table_new(g_str_hash, g_str_equal); for(RmDigestType type = 1; type < RM_DIGEST_SENTINEL; type++) { - rm_digest_table_insert(*code_table, (char *)rm_digest_interface(type)->name, + rm_digest_table_insert(*code_table, (char *)rm_digest_get_interface(type)->name, type); } @@ -762,7 +763,7 @@ RmDigestType rm_string_to_digest_type(const char *string) { } const char *rm_digest_type_to_string(RmDigestType type) { - const RmDigestInterface *interface = rm_digest_interface(type); + const RmDigestInterface *interface = rm_digest_get_interface(type); return interface->name; } @@ -779,7 +780,7 @@ int rm_digest_type_to_multihash_id(RmDigestType type) { RmDigest *rm_digest_new(RmDigestType type, RmOff seed) { g_assert(type != RM_DIGEST_UNKNOWN); - const RmDigestInterface *interface = rm_digest_interface(type); + const RmDigestInterface *interface = rm_digest_get_interface(type); RmDigest *digest = g_slice_new0(RmDigest); digest->type = type; digest->bytes = interface->bits / 8; @@ -797,13 +798,13 @@ void rm_digest_release_buffers(RmDigest *digest) { } void rm_digest_free(RmDigest *digest) { - const RmDigestInterface *interface = rm_digest_interface(digest->type); + const RmDigestInterface *interface = rm_digest_get_interface(digest->type); interface->free(digest); g_slice_free(RmDigest, digest); } void rm_digest_update(RmDigest *digest, const unsigned char *data, RmOff size) { - const RmDigestInterface *interface = rm_digest_interface(digest->type); + const RmDigestInterface *interface = rm_digest_get_interface(digest->type); interface->update(digest, data, size); } @@ -890,14 +891,14 @@ RmDigest *rm_digest_copy(RmDigest *digest) { RmDigest *copy = g_slice_copy(sizeof(RmDigest), digest); - const RmDigestInterface *interface = rm_digest_interface(digest->type); + const RmDigestInterface *interface = rm_digest_get_interface(digest->type); interface->copy(digest, copy); return copy; } guint8 *rm_digest_steal(RmDigest *digest) { - const RmDigestInterface *interface = rm_digest_interface(digest->type); + const RmDigestInterface *interface = rm_digest_get_interface(digest->type); if(!interface->steal) { return g_slice_copy(digest->bytes, digest->state); } @@ -934,7 +935,7 @@ gboolean rm_digest_equal(RmDigest *a, RmDigest *b) { return false; } - const RmDigestInterface *interface = rm_digest_interface(a->type); + const RmDigestInterface *interface = rm_digest_get_interface(a->type); if(a->type == RM_DIGEST_PARANOID) { RmParanoid *pa = a->state; From 7877a9afa93b6e16435d443a6efa36136eb68b46 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Fri, 17 Nov 2017 08:30:45 +1000 Subject: [PATCH 130/180] checksum: tidy up interfaces a bit more --- lib/checksum.c | 191 ++++++++++++++++++++++++++----------------------- 1 file changed, 100 insertions(+), 91 deletions(-) diff --git a/lib/checksum.c b/lib/checksum.c index fc35009c..04f1f062 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -139,6 +139,7 @@ RM_DIGEST_DEFINE_INTERFACE(xxhash, 64); /////////////////////////// #if RM_PLATFORM_32 +/* use 32-bit optimised murmur hash interface */ static void rm_digest_murmur_init(RmDigest *digest, RmOff seed) { digest->state = MurmurHash3_x86_128_new(seed, seed >> 32, seed, seed >> 32); @@ -163,6 +164,7 @@ static void rm_digest_murmur_steal(RmDigest *digest, guint8 *result) { } #elif RM_PLATFORM_64 +/* use 64-bit optimised murmur hash interface */ static void rm_digest_murmur_init(RmDigest *digest, RmOff seed) { digest->state = MurmurHash3_x64_128_new(seed, seed); @@ -244,12 +246,17 @@ RM_DIGEST_DEFINE_INTERFACE(metro, 128); RM_DIGEST_DEFINE_INTERFACE(metro256, 256); #if HAVE_SSE4 +/* also define crc-optimised metro variants metrocrc and metrocrc256*/ /* some of the interface procedures are common between crc- and non-crc-variants */ #define rm_digest_metrocrc_init rm_digest_metro_init #define rm_digest_metrocrc_free rm_digest_metro_free #define rm_digest_metrocrc_copy rm_digest_metro_copy +#define rm_digest_metrocrc256_init rm_digest_metro256_init +#define rm_digest_metrocrc256_free rm_digest_metro256_free +#define rm_digest_metrocrc256_copy rm_digest_metro256_copy + static void rm_digest_metrocrc_update(RmDigest *digest, const unsigned char *data, RmOff size) { metrohash128crc_update(digest->state, data, size); @@ -259,11 +266,6 @@ static void rm_digest_metrocrc_steal(RmDigest *digest, guint8 *result) { metrohash128crc_1_steal(digest->state, result); } -/* some of the interface procedures are common between crc- and non-crc-variants */ -#define rm_digest_metrocrc256_init rm_digest_metro256_init -#define rm_digest_metrocrc256_free rm_digest_metro256_free -#define rm_digest_metrocrc256_copy rm_digest_metro256_copy - static void rm_digest_metrocrc256_update(RmDigest *digest, const unsigned char *data, RmOff size) { metrohash256crc_update(digest->state, data, size); @@ -298,13 +300,14 @@ RM_DIGEST_DEFINE_INTERFACE(metrocrc256, 256); #endif +#define RM_DIGEST_CUMULATIVE_INTS (RM_DIGEST_CUMULATIVE_LEN / RM_DIGEST_CUMULATIVE_ALIGN) + typedef struct RmDigestCumulative { union { guint8 data[RM_DIGEST_CUMULATIVE_LEN]; - RM_DIGEST_CUMULATIVE_T - bigdata[RM_DIGEST_CUMULATIVE_LEN / RM_DIGEST_CUMULATIVE_ALIGN]; + RM_DIGEST_CUMULATIVE_T bigdata[RM_DIGEST_CUMULATIVE_INTS]; }; - RM_DIGEST_CUMULATIVE_T pos; /* could be smaller but this is faster */ + RM_DIGEST_CUMULATIVE_T pos; /* byte offset within data */ } RmDigestCumulative; static void rm_digest_cumulative_init(RmDigest *digest, RmOff seed) { @@ -358,13 +361,7 @@ static void rm_digest_cumulative_steal(RmDigest *digest, guint8 *result) { memcpy(result, state->data, RM_DIGEST_CUMULATIVE_LEN); } -static const RmDigestInterface cumulative_interface = {"cumulative", - 8 * RM_DIGEST_CUMULATIVE_LEN, - rm_digest_cumulative_init, - rm_digest_cumulative_free, - rm_digest_cumulative_update, - rm_digest_cumulative_copy, - rm_digest_cumulative_steal}; +RM_DIGEST_DEFINE_INTERFACE(cumulative, 8 * RM_DIGEST_CUMULATIVE_LEN) /////////////////////////// // highway hash // @@ -393,7 +390,8 @@ static void rm_digest_highway_copy(RmDigest *digest, RmDigest *copy) { copy->state = g_slice_copy(sizeof(HighwayHashCat), digest->state); } -/* HighwayHashCatFinish functions are non-destructive */ +/* HighwayHashCatFinish functions are non-destructive so steal funcs don't + * need to make a copy */ static void rm_digest_highway256_steal(RmDigest *digest, guint8 *result) { HighwayHashCatFinish256(digest->state, (uint64_t *)result); } @@ -406,15 +404,27 @@ static void rm_digest_highway64_steal(RmDigest *digest, guint8 *result) { *(uint64_t *)result = HighwayHashCatFinish64(digest->state); } -#define HIGHWAY_INTERFACE(BITS) \ - BITS, rm_digest_highway_init, rm_digest_highway_free, rm_digest_highway_update, \ - rm_digest_highway_copy, rm_digest_highway##BITS##_steal + /* highway hashes share common interface functions other than steal: */ + +#define rm_digest_highway64_init rm_digest_highway_init +#define rm_digest_highway128_init rm_digest_highway_init +#define rm_digest_highway256_init rm_digest_highway_init -static const RmDigestInterface highway256_interface = {"highway256", - HIGHWAY_INTERFACE(256)}; -static const RmDigestInterface highway128_interface = {"highway128", - HIGHWAY_INTERFACE(128)}; -static const RmDigestInterface highway64_interface = {"highway64", HIGHWAY_INTERFACE(64)}; +#define rm_digest_highway64_free rm_digest_highway_free +#define rm_digest_highway128_free rm_digest_highway_free +#define rm_digest_highway256_free rm_digest_highway_free + +#define rm_digest_highway64_update rm_digest_highway_update +#define rm_digest_highway128_update rm_digest_highway_update +#define rm_digest_highway256_update rm_digest_highway_update + +#define rm_digest_highway64_copy rm_digest_highway_copy +#define rm_digest_highway128_copy rm_digest_highway_copy +#define rm_digest_highway256_copy rm_digest_highway_copy + +RM_DIGEST_DEFINE_INTERFACE(highway64, 64) +RM_DIGEST_DEFINE_INTERFACE(highway128, 128) +RM_DIGEST_DEFINE_INTERFACE(highway256, 256) /////////////////////////// // glib hashes // @@ -457,36 +467,45 @@ static void rm_digest_glib_steal(RmDigest *digest, guint8 *result) { g_checksum_free(copy); } -#define GLIB_FUNCS \ - rm_digest_glib_init, rm_digest_glib_free, rm_digest_glib_update, \ - rm_digest_glib_copy, rm_digest_glib_steal +#define RM_DIGEST_DEFINE_GLIB(NAME, BITS) \ + static const RmDigestInterface NAME##_interface = {.name = (#NAME), \ + .bits = (BITS), \ + .init = rm_digest_glib_init, \ + .free = rm_digest_glib_free, \ + .update = rm_digest_glib_update, \ + .copy = rm_digest_glib_copy, \ + .steal = rm_digest_glib_steal}; -static const RmDigestInterface md5_interface = {"md5", 128, GLIB_FUNCS}; -static const RmDigestInterface sha1_interface = {"sha1", 160, GLIB_FUNCS}; -static const RmDigestInterface sha256_interface = {"sha256", 256, GLIB_FUNCS}; +RM_DIGEST_DEFINE_GLIB(md5, 128); +RM_DIGEST_DEFINE_GLIB(sha1, 160); +RM_DIGEST_DEFINE_GLIB(sha256, 256); #if HAVE_SHA512 -static const RmDigestInterface sha512_interface = {"sha512", 512, GLIB_FUNCS}; +RM_DIGEST_DEFINE_GLIB(sha512, 512); #endif /////////////////////////// // sha3 hashes // /////////////////////////// -static void rm_digest_sha3_init(RmDigest *digest, RmOff seed) { +static void rm_digest_sha3_256_init(RmDigest *digest, RmOff seed) { digest->state = g_slice_alloc0(sizeof(sha3_context)); - switch(digest->type) { - case RM_DIGEST_SHA3_256: - sha3_Init256(digest->state); - break; - case RM_DIGEST_SHA3_384: - sha3_Init384(digest->state); - break; - case RM_DIGEST_SHA3_512: - sha3_Init512(digest->state); - break; - default: - g_assert_not_reached(); + sha3_Init256(digest->state); + if(seed) { + sha3_Update(digest->state, &seed, sizeof(seed)); } +} + +static void rm_digest_sha3_384_init(RmDigest *digest, RmOff seed) { + digest->state = g_slice_alloc0(sizeof(sha3_context)); + sha3_Init384(digest->state); + if(seed) { + sha3_Update(digest->state, &seed, sizeof(seed)); + } +} + +static void rm_digest_sha3_512_init(RmDigest *digest, RmOff seed) { + digest->state = g_slice_alloc0(sizeof(sha3_context)); + sha3_Init512(digest->state); if(seed) { sha3_Update(digest->state, &seed, sizeof(seed)); } @@ -496,7 +515,8 @@ static void rm_digest_sha3_free(RmDigest *digest) { g_slice_free(sha3_context, digest->state); } -static void rm_digest_sha3_update(RmDigest *digest, const unsigned char *data, RmOff size) { +static void rm_digest_sha3_update(RmDigest *digest, const unsigned char *data, + RmOff size) { sha3_Update(digest->state, data, size); } @@ -510,13 +530,19 @@ static void rm_digest_sha3_steal(RmDigest *digest, guint8 *result) { g_slice_free(sha3_context, copy); } -#define SHA3_INTERFACE(BITS) \ - BITS, rm_digest_sha3_init, rm_digest_sha3_free, rm_digest_sha3_update, \ - rm_digest_sha3_copy, rm_digest_sha3_steal +#define RM_DIGEST_DEFINE_SHA3(BITS) \ + static const RmDigestInterface sha3_##BITS##_interface = { \ + .name = ("sha3-" #BITS), \ + .bits = (BITS), \ + .init = rm_digest_sha3_##BITS##_init, \ + .free = rm_digest_sha3_free, \ + .update = rm_digest_sha3_update, \ + .copy = rm_digest_sha3_copy, \ + .steal = rm_digest_sha3_steal}; -static const RmDigestInterface sha3_256_interface = {"sha3-256", SHA3_INTERFACE(256)}; -static const RmDigestInterface sha3_384_interface = {"sha3-384", SHA3_INTERFACE(384)}; -static const RmDigestInterface sha3_512_interface = {"sha3-512", SHA3_INTERFACE(512)}; +RM_DIGEST_DEFINE_SHA3(256) +RM_DIGEST_DEFINE_SHA3(384) +RM_DIGEST_DEFINE_SHA3(512) /////////////////////////// // blake hashes // @@ -572,42 +598,13 @@ static const RmDigestInterface blake2sp_interface = {"blake2sp", 256, // ext hash // /////////////////////////// -#define ALLOC_BYTES(bytes) MAX(8, bytes) - -static void rm_digest_generic_init(RmDigest *digest, RmOff seed) { - /* init for hashes which just require allocation of digest->checksum */ - - /* Cannot go lower than 8, since we read 8 byte in some places. - * For some checksums this may mean trailing zeros in the unused bytes */ - digest->state = g_slice_alloc0(ALLOC_BYTES(digest->bytes)); - - if(seed) { - /* copy seed to checksum */ - size_t seed_bytes = MIN(sizeof(RmOff), digest->bytes / 2); - memcpy(digest->state, &seed, seed_bytes); - } -} - -static void rm_digest_generic_free(RmDigest *digest) { +static void rm_digest_ext_free(RmDigest *digest) { if(digest->state) { g_slice_free1(digest->bytes, digest->state); digest->state = NULL; } } -static void rm_digest_generic_copy(RmDigest *digest, RmDigest *copy) { - copy->state = g_slice_copy(ALLOC_BYTES(digest->bytes), digest->state); -} - -#define GENERIC_FUNCS(ALGO) \ - .init = rm_digest_generic_init, .free = rm_digest_generic_free, \ - .update = rm_digest_##ALGO##_update, .copy = rm_digest_generic_copy, .steal = NULL - -static void rm_digest_ext_init(RmDigest *digest, RmOff seed) { - digest->bytes = 64; - rm_digest_generic_init(digest, seed); -} - static void rm_digest_ext_update(RmDigest *digest, const unsigned char *data, RmOff size) { /* Data is assumed to be a hex representation of a checksum. @@ -626,13 +623,18 @@ static void rm_digest_ext_update(RmDigest *digest, const unsigned char *data, } } -static const RmDigestInterface ext_interface = {"ext", - 512, - rm_digest_ext_init, - rm_digest_generic_free, - rm_digest_ext_update, - rm_digest_generic_copy, - NULL}; +static void rm_digest_ext_copy(RmDigest *digest, RmDigest *copy) { + copy->state = g_slice_copy(digest->bytes, digest->state); +} + +static const RmDigestInterface ext_interface = { + .name = "ext", + .bits = 512, + .init = NULL, + .free = rm_digest_ext_free, + .update = rm_digest_ext_update, + .copy = rm_digest_ext_copy, + .steal = NULL}; /////////////////////////// // paranoid 'hash' // @@ -676,8 +678,13 @@ static void rm_digest_paranoid_steal(RmDigest *digest, guint8 *result) { /* Note: paranoid update implementation is in rm_digest_buffered_update() below */ static const RmDigestInterface paranoid_interface = { - "paranoid", 0, rm_digest_paranoid_init, rm_digest_paranoid_free, - NULL, NULL, rm_digest_paranoid_steal}; + .name = "paranoid", + .bits = 0, + .init = rm_digest_paranoid_init, + .free = rm_digest_paranoid_free, + .update = NULL, + .copy = NULL, + .steal = rm_digest_paranoid_steal}; //////////////////////////////// // RmDigestInterface map // @@ -784,7 +791,9 @@ RmDigest *rm_digest_new(RmDigestType type, RmOff seed) { RmDigest *digest = g_slice_new0(RmDigest); digest->type = type; digest->bytes = interface->bits / 8; - interface->init(digest, seed); + if(interface->init) { + interface->init(digest, seed); + } return digest; } From 81f7c8e3fb11be5d88ffb496e4ee4888185d2fc1 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Sat, 18 Nov 2017 11:04:15 +1000 Subject: [PATCH 131/180] checksum: finish interfaces; move to standardised seeding approach --- lib/checksum.c | 917 ++++++++++++++++------------------ lib/checksums/metrohash.h | 6 +- lib/checksums/metrohash128.c | 46 +- lib/checksums/murmur3.c | 46 +- lib/checksums/murmur3.h | 9 +- lib/checksums/xxhash/xxhash.h | 10 +- 6 files changed, 476 insertions(+), 558 deletions(-) diff --git a/lib/checksum.c b/lib/checksum.c index 04f1f062..f57b2f48 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -23,9 +23,9 @@ * */ -/* Welcome to hell! - * - * This file is mostly boring code except for the paranoid digest +/* This file is mostly boring interface definitions to conform all of the + * difference hash types to a single interface. + * code except for the paranoid digest * optimisations which are pretty insane. **/ @@ -76,207 +76,118 @@ static gboolean rm_buffer_equal(RmBuffer *a, RmBuffer *b) { /////////////////////////////////////// /* Each digest type must have an RmDigestInterface defined as follows: */ -typedef void (*RmDigestInitFunc)(RmDigest *digest, RmOff seed); -typedef void (*RmDigestFreeFunc)(RmDigest *digest); -typedef void (*RmDigestUpdateFunc)(RmDigest *digest, const unsigned char *data, - RmOff size); -typedef void (*RmDigestCopyFunc)(RmDigest *digest, RmDigest *copy); -typedef void (*RmDigestStealFunc)(RmDigest *digest, guint8 *result); +typedef gpointer (*RmDigestNewFunc)(void); +typedef void (*RmDigestFreeFunc)(gpointer state); +typedef void (*RmDigestUpdateFunc)(gpointer state, const unsigned char *data, gsize size); +typedef gpointer (*RmDigestCopyFunc)(gpointer state); +typedef void (*RmDigestStealFunc)(gpointer state, guint8 *result); typedef struct RmDigestInterface { - const char *name; - const uint bits; // length of the output checksum in bits - RmDigestInitFunc init; // performs initialisation of digest->state - RmDigestFreeFunc free; - RmDigestUpdateFunc update; - RmDigestCopyFunc copy; - RmDigestStealFunc steal; + const char *name; // hash name + const uint bits; // length of the output checksum in bits + RmDigestNewFunc new; // returns new digest->state + RmDigestFreeFunc free; // frees state allocated by new() + RmDigestUpdateFunc update; // hashes data into state + RmDigestCopyFunc copy; // allocates and returns a copy of passed state + RmDigestStealFunc steal; // writes checksum (as binary) to *result } RmDigestInterface; -/* convenience macro to define an interface where all methods follow the standard naming - * convention */ -#define RM_DIGEST_DEFINE_INTERFACE(NAME, BITS) \ - static const RmDigestInterface NAME##_interface = { \ - .name = (#NAME), \ - .bits = (BITS), \ - .init = rm_digest_##NAME##_init, \ - .free = rm_digest_##NAME##_free, \ - .update = rm_digest_##NAME##_update, \ - .copy = rm_digest_##NAME##_copy, \ - .steal = rm_digest_##NAME##_steal}; - /////////////////////////// // xxhash interface // /////////////////////////// -static void rm_digest_xxhash_init(RmDigest *digest, RmOff seed) { - digest->state = XXH64_createState(); - XXH64_reset(digest->state, seed); -} - -static void rm_digest_xxhash_free(RmDigest *digest) { - XXH64_freeState(digest->state); -} - -static void rm_digest_xxhash_update(RmDigest *digest, const unsigned char *data, - RmOff size) { - XXH64_update(digest->state, data, size); +static XXH64_state_t *rm_digest_xxhash_new(void) { + XXH64_state_t *state = XXH64_createState(); + XXH64_reset(state, 0); + return state; } -static void rm_digest_xxhash_copy(RmDigest *digest, RmDigest *copy) { - copy->state = XXH64_createState(); - memcpy(copy->state, digest->state, sizeof(XXH64_state_t)); +static XXH64_state_t *rm_digest_xxhash_copy(XXH64_state_t *state) { + XXH64_state_t *copy = XXH64_createState(); + memcpy(copy, state, sizeof(XXH64_state_t)); + return copy; } -static void rm_digest_xxhash_steal(RmDigest *digest, guint8 *result) { - *(unsigned long long *)result = XXH64_digest(digest->state); +static void rm_digest_xxhash_steal(gpointer state, guint8 *result) { + *(unsigned long long *)result = XXH64_digest(state); } -RM_DIGEST_DEFINE_INTERFACE(xxhash, 64); +static const RmDigestInterface xxhash_interface = { + .name = "xxhash", + .bits = 64, + .new = (RmDigestNewFunc)rm_digest_xxhash_new, + .free = (RmDigestFreeFunc)XXH64_freeState, + .update = (RmDigestUpdateFunc)XXH64_update, + .copy = (RmDigestCopyFunc)rm_digest_xxhash_copy, + .steal = rm_digest_xxhash_steal}; /////////////////////////// // murmur // /////////////////////////// +static const RmDigestInterface murmur_interface = { + .name = "murmur", + .bits = 128, #if RM_PLATFORM_32 -/* use 32-bit optimised murmur hash interface */ - -static void rm_digest_murmur_init(RmDigest *digest, RmOff seed) { - digest->state = MurmurHash3_x86_128_new(seed, seed >> 32, seed, seed >> 32); -} - -static void rm_digest_murmur_free(RmDigest *digest) { - MurmurHash3_x86_128_free(digest->state); -} - -static void rm_digest_murmur_update(RmDigest *digest, - const unsigned char *data, - RmOff size) { - MurmurHash3_x86_128_update(digest->state, data, size); -} - -static void rm_digest_murmur_copy(RmDigest *digest, RmDigest *copy) { - copy->state = MurmurHash3_x86_128_copy(digest->state); -} - -static void rm_digest_murmur_steal(RmDigest *digest, guint8 *result) { - MurmurHash3_x86_128_steal(digest->state, result); -} - + .new = (RmDigestNewFunc)MurmurHash3_x86_128_new, + .free = (RmDigestFreeFunc)MurmurHash3_x86_128_free, + .update = (RmDigestUpdateFunc)MurmurHash3_x86_128_update, + .copy = (RmDigestCopyFunc)MurmurHash3_x86_128_copy, + .steal = (RmDigestStealFunc)MurmurHash3_x86_128_steal, #elif RM_PLATFORM_64 -/* use 64-bit optimised murmur hash interface */ - -static void rm_digest_murmur_init(RmDigest *digest, RmOff seed) { - digest->state = MurmurHash3_x64_128_new(seed, seed); -} - -static void rm_digest_murmur_free(RmDigest *digest) { - MurmurHash3_x64_128_free(digest->state); -} - -static void rm_digest_murmur_update(RmDigest *digest, - const unsigned char *data, - RmOff size) { - MurmurHash3_x64_128_update(digest->state, data, size); -} - -static void rm_digest_murmur_copy(RmDigest *digest, RmDigest *copy) { - copy->state = MurmurHash3_x64_128_copy(digest->state); -} - -static void rm_digest_murmur_steal(RmDigest *digest, guint8 *result) { - MurmurHash3_x64_128_steal(digest->state, result); -} - + /* use 64-bit optimised murmur hash interface */ + .new = (RmDigestNewFunc)MurmurHash3_x64_128_new, + .free = (RmDigestFreeFunc)MurmurHash3_x64_128_free, + .update = (RmDigestUpdateFunc)MurmurHash3_x64_128_update, + .copy = (RmDigestCopyFunc)MurmurHash3_x64_128_copy, + .steal = (RmDigestStealFunc)MurmurHash3_x64_128_steal, #else - #error "Probably not a good idea to compile rmlint on 16bit." - #endif - -RM_DIGEST_DEFINE_INTERFACE(murmur, 128); +}; /////////////////////////// // metro // /////////////////////////// -static void rm_digest_metro_init(RmDigest *digest, RmOff seed) { - digest->state = metrohash128_1_new(seed); -} - -static void rm_digest_metro_free(RmDigest *digest) { - metrohash128_free(digest->state); -} - -static void rm_digest_metro_update(RmDigest *digest, const unsigned char *data, - RmOff size) { - metrohash128_1_update(digest->state, data, size); -} - -static void rm_digest_metro_copy(RmDigest *digest, RmDigest *copy) { - copy->state = metrohash128_copy(digest->state); -} - -static void rm_digest_metro_steal(RmDigest *digest, guint8 *result) { - metrohash128_1_steal(digest->state, result); -} - -static void rm_digest_metro256_init(RmDigest *digest, RmOff seed) { - digest->state = metrohash256_new(seed); -} - -static void rm_digest_metro256_free(RmDigest *digest) { - metrohash256_free(digest->state); -} - -static void rm_digest_metro256_update(RmDigest *digest, const unsigned char *data, - RmOff size) { - metrohash256_update(digest->state, data, size); -} - -static void rm_digest_metro256_copy(RmDigest *digest, RmDigest *copy) { - copy->state = metrohash256_copy(digest->state); -} - -static void rm_digest_metro256_steal(RmDigest *digest, guint8 *result) { - metrohash256_steal(digest->state, result); -} - -RM_DIGEST_DEFINE_INTERFACE(metro, 128); -RM_DIGEST_DEFINE_INTERFACE(metro256, 256); +static const RmDigestInterface metro_interface = { + .name = "metro", + .bits = 128, + .new = (RmDigestNewFunc)metrohash128_1_new, + .free = (RmDigestFreeFunc)metrohash128_free, + .update = (RmDigestUpdateFunc)metrohash128_1_update, + .copy = (RmDigestCopyFunc)metrohash128_copy, + .steal = (RmDigestStealFunc)metrohash128_1_steal}; + +static const RmDigestInterface metro256_interface = { + .name = "metro256", + .bits = 256, + .new = (RmDigestNewFunc)metrohash256_new, + .free = (RmDigestFreeFunc)metrohash256_free, + .update = (RmDigestUpdateFunc)metrohash256_update, + .copy = (RmDigestCopyFunc)metrohash256_copy, + .steal = (RmDigestStealFunc)metrohash256_steal}; #if HAVE_SSE4 /* also define crc-optimised metro variants metrocrc and metrocrc256*/ -/* some of the interface procedures are common between crc- and non-crc-variants */ -#define rm_digest_metrocrc_init rm_digest_metro_init -#define rm_digest_metrocrc_free rm_digest_metro_free -#define rm_digest_metrocrc_copy rm_digest_metro_copy - -#define rm_digest_metrocrc256_init rm_digest_metro256_init -#define rm_digest_metrocrc256_free rm_digest_metro256_free -#define rm_digest_metrocrc256_copy rm_digest_metro256_copy - -static void rm_digest_metrocrc_update(RmDigest *digest, const unsigned char *data, - RmOff size) { - metrohash128crc_update(digest->state, data, size); -} - -static void rm_digest_metrocrc_steal(RmDigest *digest, guint8 *result) { - metrohash128crc_1_steal(digest->state, result); -} - -static void rm_digest_metrocrc256_update(RmDigest *digest, const unsigned char *data, - RmOff size) { - metrohash256crc_update(digest->state, data, size); -} - -static void rm_digest_metrocrc256_steal(RmDigest *digest, guint8 *result) { - metrohash256crc_steal(digest->state, result); -} - -RM_DIGEST_DEFINE_INTERFACE(metrocrc, 128); -RM_DIGEST_DEFINE_INTERFACE(metrocrc256, 256); +static const RmDigestInterface metrocrc_interface = { + .name = "metrocrc", + .bits = 128, + .new = (RmDigestNewFunc)metrohash128_1_new, /* <-same */ + .free = (RmDigestFreeFunc)metrohash128_free, /* <-same */ + .update = (RmDigestUpdateFunc)metrohash128crc_update, + .copy = (RmDigestCopyFunc)metrohash128_copy, /* <-same */ + .steal = (RmDigestStealFunc)metrohash128crc_1_steal}; + +static const RmDigestInterface metrocrc256_interface = { + .name = "metrocrc256", + .bits = 256, + .new = (RmDigestNewFunc)metrohash256_new, /* <-same */ + .free = (RmDigestFreeFunc)metrohash256_free, /* <-same */ + .update = (RmDigestUpdateFunc)metrohash256crc_update, + .copy = (RmDigestCopyFunc)metrohash256_copy, /* <-same */ + .steal = (RmDigestStealFunc)metrohash256crc_steal}; #endif @@ -310,22 +221,18 @@ typedef struct RmDigestCumulative { RM_DIGEST_CUMULATIVE_T pos; /* byte offset within data */ } RmDigestCumulative; -static void rm_digest_cumulative_init(RmDigest *digest, RmOff seed) { - RmDigestCumulative *state = g_slice_new0(RmDigestCumulative); - *(RmOff *)&state->data[0] ^= seed; - digest->state = state; +static RmDigestCumulative *rm_digest_cumulative_new(void) { + return g_slice_new0(RmDigestCumulative); } -static void rm_digest_cumulative_free(RmDigest *digest) { - g_slice_free(RmDigestCumulative, digest->state); - digest->state = NULL; +static void rm_digest_cumulative_free(RmDigestCumulative *state) { + g_slice_free(RmDigestCumulative, state); } -static void rm_digest_cumulative_update(RmDigest *digest, const unsigned char *data, - RmOff size) { +static void rm_digest_cumulative_update(RmDigestCumulative *state, + const unsigned char *data, RmOff size) { guint8 *ptr = (guint8 *)data; guint8 *stop = ptr + size; - RmDigestCumulative *state = digest->state; /* align so we can use [32|64]-bit xor */ while((state->pos % RM_DIGEST_CUMULATIVE_ALIGN != 0) && ptr < stop) { @@ -352,193 +259,206 @@ static void rm_digest_cumulative_update(RmDigest *digest, const unsigned char *d } } -static void rm_digest_cumulative_copy(RmDigest *digest, RmDigest *copy) { - copy->state = g_slice_copy(sizeof(RmDigestCumulative), digest->state); +static RmDigestCumulative *rm_digest_cumulative_copy(RmDigestCumulative *state) { + return g_slice_copy(sizeof(RmDigestCumulative), state); } -static void rm_digest_cumulative_steal(RmDigest *digest, guint8 *result) { - RmDigestCumulative *state = digest->state; +static void rm_digest_cumulative_steal(RmDigestCumulative *state, guint8 *result) { memcpy(result, state->data, RM_DIGEST_CUMULATIVE_LEN); } -RM_DIGEST_DEFINE_INTERFACE(cumulative, 8 * RM_DIGEST_CUMULATIVE_LEN) +static const RmDigestInterface cumulative_interface = { + .name = "cumulative", + .bits = 8 * RM_DIGEST_CUMULATIVE_LEN, + .new = (RmDigestNewFunc)rm_digest_cumulative_new, /* <-same */ + .free = (RmDigestFreeFunc)rm_digest_cumulative_free, /* <-same */ + .update = (RmDigestUpdateFunc)rm_digest_cumulative_update, + .copy = (RmDigestCopyFunc)rm_digest_cumulative_copy, /* <-same */ + .steal = (RmDigestStealFunc)rm_digest_cumulative_steal}; /////////////////////////// // highway hash // /////////////////////////// -static void rm_digest_highway_init(RmDigest *digest, RmOff seed) { - uint64_t key[4] = {1, 2, 3, 4}; - if(seed) { - key[0] = (uint64_t)seed; - } - - digest->state = g_slice_alloc0(sizeof(HighwayHashCat)); - HighwayHashCatStart(key, digest->state); +static HighwayHashCat *rm_digest_highway_new(void) { + HighwayHashCat *state = g_slice_new(HighwayHashCat); + static const uint64_t key[4] = {1, 2, 3, 4}; + HighwayHashCatStart(key, state); + return state; } -static void rm_digest_highway_free(RmDigest *digest) { - g_slice_free(HighwayHashCat, digest->state); +static void rm_digest_highway_free(HighwayHashCat *state) { + g_slice_free(HighwayHashCat, state); } -static void rm_digest_highway_update(RmDigest *digest, const unsigned char *data, +static void rm_digest_highway_update(HighwayHashCat *state, const unsigned char *data, RmOff size) { - HighwayHashCatAppend((const uint8_t *)data, size, digest->state); -} - -static void rm_digest_highway_copy(RmDigest *digest, RmDigest *copy) { - copy->state = g_slice_copy(sizeof(HighwayHashCat), digest->state); -} - -/* HighwayHashCatFinish functions are non-destructive so steal funcs don't - * need to make a copy */ -static void rm_digest_highway256_steal(RmDigest *digest, guint8 *result) { - HighwayHashCatFinish256(digest->state, (uint64_t *)result); -} - -static void rm_digest_highway128_steal(RmDigest *digest, guint8 *result) { - HighwayHashCatFinish128(digest->state, (uint64_t *)result); -} - -static void rm_digest_highway64_steal(RmDigest *digest, guint8 *result) { - *(uint64_t *)result = HighwayHashCatFinish64(digest->state); -} - - /* highway hashes share common interface functions other than steal: */ - -#define rm_digest_highway64_init rm_digest_highway_init -#define rm_digest_highway128_init rm_digest_highway_init -#define rm_digest_highway256_init rm_digest_highway_init - -#define rm_digest_highway64_free rm_digest_highway_free -#define rm_digest_highway128_free rm_digest_highway_free -#define rm_digest_highway256_free rm_digest_highway_free - -#define rm_digest_highway64_update rm_digest_highway_update -#define rm_digest_highway128_update rm_digest_highway_update -#define rm_digest_highway256_update rm_digest_highway_update - -#define rm_digest_highway64_copy rm_digest_highway_copy -#define rm_digest_highway128_copy rm_digest_highway_copy -#define rm_digest_highway256_copy rm_digest_highway_copy - -RM_DIGEST_DEFINE_INTERFACE(highway64, 64) -RM_DIGEST_DEFINE_INTERFACE(highway128, 128) -RM_DIGEST_DEFINE_INTERFACE(highway256, 256) + HighwayHashCatAppend((const uint8_t *)data, size, state); +} + +static HighwayHashCat *rm_digest_highway_copy(HighwayHashCat *state) { + return g_slice_copy(sizeof(HighwayHashCat), state); +} + +static void rm_digest_highway64_steal(HighwayHashCat *state, guint8 *result) { + /* HighwayHashCatFinish functions are non-destructive so steal funcs don't + * need to make a copy */ + *(uint64_t *)result = HighwayHashCatFinish64(state); +} + +static const RmDigestInterface highway64_interface = { + .name = "highway64", + .bits = 64, + .new = (RmDigestNewFunc)rm_digest_highway_new, + .free = (RmDigestFreeFunc)rm_digest_highway_free, + .update = (RmDigestUpdateFunc)rm_digest_highway_update, + .copy = (RmDigestCopyFunc)rm_digest_highway_copy, + .steal = (RmDigestStealFunc)rm_digest_highway64_steal}; + +static const RmDigestInterface highway128_interface = { + .name = "highway128", + .bits = 128, + .new = (RmDigestNewFunc)rm_digest_highway_new, + .free = (RmDigestFreeFunc)rm_digest_highway_free, + .update = (RmDigestUpdateFunc)rm_digest_highway_update, + .copy = (RmDigestCopyFunc)rm_digest_highway_copy, + .steal = (RmDigestStealFunc)HighwayHashCatFinish128}; + +static const RmDigestInterface highway256_interface = { + .name = "highway256", + .bits = 256, + .new = (RmDigestNewFunc)rm_digest_highway_new, + .free = (RmDigestFreeFunc)rm_digest_highway_free, + .update = (RmDigestUpdateFunc)rm_digest_highway_update, + .copy = (RmDigestCopyFunc)rm_digest_highway_copy, + .steal = (RmDigestStealFunc)HighwayHashCatFinish256}; /////////////////////////// // glib hashes // /////////////////////////// -static const GChecksumType glib_map[] = { - [RM_DIGEST_MD5] = G_CHECKSUM_MD5, - [RM_DIGEST_SHA1] = G_CHECKSUM_SHA1, - [RM_DIGEST_SHA256] = G_CHECKSUM_SHA256, -#if HAVE_SHA512 - [RM_DIGEST_SHA512] = G_CHECKSUM_SHA512, -#endif -}; +static void rm_digest_glib_steal(GChecksum *state, guint8 *result, gsize *len) { + GChecksum *copy = g_checksum_copy(state); + g_checksum_get_digest(copy, result, len); + g_checksum_free(copy); +} -static void rm_digest_glib_init(RmDigest *digest, RmOff seed) { - digest->state = g_checksum_new(glib_map[digest->type]); - if(seed) { - g_checksum_update(digest->state, (const guchar *)&seed, sizeof(seed)); - } +#define RM_DIGEST_DEFINE_GLIB(NAME, BITS) \ + static const RmDigestInterface NAME##_interface = { \ + .name = #NAME, \ + .bits = BITS, \ + .new = (RmDigestNewFunc)rm_digest_##NAME##_new, \ + .free = (RmDigestFreeFunc)g_checksum_free, \ + .update = (RmDigestUpdateFunc)g_checksum_update, \ + .copy = (RmDigestCopyFunc)g_checksum_copy, \ + .steal = (RmDigestStealFunc)rm_digest_##NAME##_steal}; + +/* md5 */ +static GChecksum *rm_digest_md5_new(void) { + return g_checksum_new(G_CHECKSUM_MD5); } -static void rm_digest_glib_free(RmDigest *digest) { - g_checksum_free(digest->state); +static void rm_digest_md5_steal(GChecksum *state, guint8 *result) { + gsize len = 16; + rm_digest_glib_steal(state, result, &len); } +RM_DIGEST_DEFINE_GLIB(md5, 128); -static void rm_digest_glib_update(RmDigest *digest, const unsigned char *data, - RmOff size) { - g_checksum_update(digest->state, data, size); +/* sha1 */ +static GChecksum *rm_digest_sha1_new(void) { + return g_checksum_new(G_CHECKSUM_SHA1); } -static void rm_digest_glib_copy(RmDigest *digest, RmDigest *copy) { - copy->state = g_checksum_copy(digest->state); +static void rm_digest_sha1_steal(GChecksum *state, guint8 *result) { + gsize len = 20; + rm_digest_glib_steal(state, result, &len); } -static void rm_digest_glib_steal(RmDigest *digest, guint8 *result) { - GChecksum *copy = g_checksum_copy(digest->state); - gsize buflen = digest->bytes; - g_checksum_get_digest(copy, result, &buflen); - rm_assert_gentle(buflen == digest->bytes); - g_checksum_free(copy); +RM_DIGEST_DEFINE_GLIB(sha1, 160); + +/* sha256 */ +static GChecksum *rm_digest_sha256_new(void) { + return g_checksum_new(G_CHECKSUM_SHA256); } -#define RM_DIGEST_DEFINE_GLIB(NAME, BITS) \ - static const RmDigestInterface NAME##_interface = {.name = (#NAME), \ - .bits = (BITS), \ - .init = rm_digest_glib_init, \ - .free = rm_digest_glib_free, \ - .update = rm_digest_glib_update, \ - .copy = rm_digest_glib_copy, \ - .steal = rm_digest_glib_steal}; +static void rm_digest_sha256_steal(GChecksum *state, guint8 *result) { + gsize len = 32; + rm_digest_glib_steal(state, result, &len); +} -RM_DIGEST_DEFINE_GLIB(md5, 128); -RM_DIGEST_DEFINE_GLIB(sha1, 160); RM_DIGEST_DEFINE_GLIB(sha256, 256); + +/* sha512 */ #if HAVE_SHA512 +static GChecksum *rm_digest_sha512_new(void) { + return g_checksum_new(G_CHECKSUM_SHA512); +} + +static void rm_digest_sha512_steal(GChecksum *state, guint8 *result) { + gsize len = 64; + rm_digest_glib_steal(state, result, &len); +} RM_DIGEST_DEFINE_GLIB(sha512, 512); + #endif /////////////////////////// // sha3 hashes // /////////////////////////// -static void rm_digest_sha3_256_init(RmDigest *digest, RmOff seed) { - digest->state = g_slice_alloc0(sizeof(sha3_context)); - sha3_Init256(digest->state); - if(seed) { - sha3_Update(digest->state, &seed, sizeof(seed)); - } +static sha3_context *rm_digest_sha3_256_new(void) { + sha3_context *state = g_slice_new(sha3_context); + sha3_Init256(state); + return state; } -static void rm_digest_sha3_384_init(RmDigest *digest, RmOff seed) { - digest->state = g_slice_alloc0(sizeof(sha3_context)); - sha3_Init384(digest->state); - if(seed) { - sha3_Update(digest->state, &seed, sizeof(seed)); - } +static sha3_context *rm_digest_sha3_384_new(void) { + sha3_context *state = g_slice_new(sha3_context); + sha3_Init384(state); + return state; } -static void rm_digest_sha3_512_init(RmDigest *digest, RmOff seed) { - digest->state = g_slice_alloc0(sizeof(sha3_context)); - sha3_Init512(digest->state); - if(seed) { - sha3_Update(digest->state, &seed, sizeof(seed)); - } +static sha3_context *rm_digest_sha3_512_new(void) { + sha3_context *state = g_slice_new(sha3_context); + sha3_Init512(state); + return state; } -static void rm_digest_sha3_free(RmDigest *digest) { - g_slice_free(sha3_context, digest->state); +static void rm_digest_sha3_free(sha3_context *state) { + g_slice_free(sha3_context, state); } -static void rm_digest_sha3_update(RmDigest *digest, const unsigned char *data, - RmOff size) { - sha3_Update(digest->state, data, size); +static sha3_context *rm_digest_sha3_copy(sha3_context *state) { + return g_slice_copy(sizeof(sha3_context), state); } -static void rm_digest_sha3_copy(RmDigest *digest, RmDigest *copy) { - copy->state = g_slice_copy(sizeof(sha3_context), digest->state); +static void rm_digest_sha3_256_steal(sha3_context *state, guint8 *result) { + sha3_context *copy = g_slice_copy(sizeof(sha3_context), state); + memcpy(result, sha3_Finalize(copy), 256 / 8); + rm_digest_sha3_free(copy); } -static void rm_digest_sha3_steal(RmDigest *digest, guint8 *result) { - sha3_context *copy = g_slice_copy(sizeof(sha3_context), digest->state); - memcpy(result, sha3_Finalize(copy), digest->bytes); - g_slice_free(sha3_context, copy); +static void rm_digest_sha3_384_steal(sha3_context *state, guint8 *result) { + sha3_context *copy = g_slice_copy(sizeof(sha3_context), state); + memcpy(result, sha3_Finalize(copy), 384 / 8); + rm_digest_sha3_free(copy); +} + +static void rm_digest_sha3_512_steal(sha3_context *state, guint8 *result) { + sha3_context *copy = g_slice_copy(sizeof(sha3_context), state); + memcpy(result, sha3_Finalize(copy), 512 / 8); + rm_digest_sha3_free(copy); } #define RM_DIGEST_DEFINE_SHA3(BITS) \ static const RmDigestInterface sha3_##BITS##_interface = { \ .name = ("sha3-" #BITS), \ .bits = (BITS), \ - .init = rm_digest_sha3_##BITS##_init, \ - .free = rm_digest_sha3_free, \ - .update = rm_digest_sha3_update, \ - .copy = rm_digest_sha3_copy, \ - .steal = rm_digest_sha3_steal}; + .new = (RmDigestNewFunc)rm_digest_sha3_##BITS##_new, \ + .free = (RmDigestFreeFunc)rm_digest_sha3_free, \ + .update = (RmDigestUpdateFunc)sha3_Update, \ + .copy = (RmDigestCopyFunc)rm_digest_sha3_copy, \ + .steal = (RmDigestStealFunc)rm_digest_sha3_##BITS##_steal}; RM_DIGEST_DEFINE_SHA3(256) RM_DIGEST_DEFINE_SHA3(384) @@ -548,64 +468,67 @@ RM_DIGEST_DEFINE_SHA3(512) // blake hashes // /////////////////////////// -#define CREATE_BLAKE_FUNCS(ALGO, ALGO_BIG) \ - \ - static void rm_digest_##ALGO##_init(RmDigest *digest, RmOff seed) { \ - digest->state = g_slice_alloc0(sizeof(ALGO##_state)); \ - ALGO##_init(digest->state, ALGO_BIG##_OUTBYTES); \ - if(seed) { \ - ALGO##_update(digest->state, &seed, sizeof(RmOff)); \ - } \ - g_assert(digest->bytes == ALGO_BIG##_OUTBYTES); \ - } \ - \ - static void rm_digest_##ALGO##_free(RmDigest *digest) { \ - g_slice_free(ALGO##_state, digest->state); \ - } \ - \ - static void rm_digest_##ALGO##_update(RmDigest *digest, const unsigned char *data, \ - RmOff size) { \ - ALGO##_update(digest->state, data, size); \ - } \ - \ - static void rm_digest_##ALGO##_copy(RmDigest *digest, RmDigest *copy) { \ - copy->state = g_slice_copy(sizeof(ALGO##_state), digest->state); \ - } \ - \ - static void rm_digest_##ALGO##_steal(RmDigest *digest, guint8 *result) { \ - ALGO##_state *copy = g_slice_copy(sizeof(ALGO##_state), digest->state); \ - ALGO##_final(copy, result, digest->bytes); \ - g_slice_free(ALGO##_state, copy); \ - } - -CREATE_BLAKE_FUNCS(blake2b, BLAKE2B); -CREATE_BLAKE_FUNCS(blake2bp, BLAKE2B); -CREATE_BLAKE_FUNCS(blake2s, BLAKE2S); -CREATE_BLAKE_FUNCS(blake2sp, BLAKE2S); - -#define BLAKE_FUNCS(ALGO) \ - rm_digest_##ALGO##_init, rm_digest_##ALGO##_free, rm_digest_##ALGO##_update, \ - rm_digest_##ALGO##_copy, rm_digest_##ALGO##_steal - -static const RmDigestInterface blake2b_interface = {"blake2b", 512, BLAKE_FUNCS(blake2b)}; -static const RmDigestInterface blake2bp_interface = {"blake2bp", 512, - BLAKE_FUNCS(blake2bp)}; -static const RmDigestInterface blake2s_interface = {"blake2s", 256, BLAKE_FUNCS(blake2s)}; -static const RmDigestInterface blake2sp_interface = {"blake2sp", 256, - BLAKE_FUNCS(blake2sp)}; +#define CREATE_BLAKE_INTERFACE(ALGO, ALGO_BIG) \ + \ + static ALGO##_state *rm_digest_##ALGO##_new(void) { \ + ALGO##_state *state = g_slice_new(ALGO##_state); \ + ALGO##_init(state, ALGO_BIG##_OUTBYTES); \ + return state; \ + } \ + \ + static void rm_digest_##ALGO##_free(ALGO##_state *state) { \ + g_slice_free(ALGO##_state, state); \ + } \ + \ + static ALGO##_state *rm_digest_##ALGO##_copy(ALGO##_state *state) { \ + return g_slice_copy(sizeof(ALGO##_state), state); \ + } \ + \ + static void rm_digest_##ALGO##_steal(ALGO##_state *state, guint8 *result) { \ + ALGO##_state *copy = rm_digest_##ALGO##_copy(state); \ + ALGO##_final(copy, result, ALGO_BIG##_OUTBYTES); \ + rm_digest_##ALGO##_free(copy); \ + } \ + \ + static const RmDigestInterface ALGO##_interface = { \ + .name = #ALGO, \ + .bits = 8 * ALGO_BIG##_OUTBYTES, \ + .new = (RmDigestNewFunc)rm_digest_##ALGO##_new, \ + .free = (RmDigestFreeFunc)rm_digest_##ALGO##_free, \ + .update = (RmDigestUpdateFunc)ALGO##_update, \ + .copy = (RmDigestCopyFunc)rm_digest_##ALGO##_copy, \ + .steal = (RmDigestStealFunc)rm_digest_##ALGO##_steal}; + +CREATE_BLAKE_INTERFACE(blake2b, BLAKE2B); +CREATE_BLAKE_INTERFACE(blake2bp, BLAKE2B); +CREATE_BLAKE_INTERFACE(blake2s, BLAKE2S); +CREATE_BLAKE_INTERFACE(blake2sp, BLAKE2S); /////////////////////////// // ext hash // /////////////////////////// -static void rm_digest_ext_free(RmDigest *digest) { - if(digest->state) { - g_slice_free1(digest->bytes, digest->state); - digest->state = NULL; +typedef struct RmDigestExt { + guint8 len; + guint8 *data; +} RmDigestExt; + +static RmDigestExt *rm_digest_ext_new(void) { + return g_slice_new0(RmDigestExt); +} + +static void rm_digest_ext_free_data(RmDigestExt *state) { + if(state->data) { + g_slice_free1(state->len, state->data); } } -static void rm_digest_ext_update(RmDigest *digest, const unsigned char *data, +static void rm_digest_ext_free(RmDigestExt *state) { + rm_digest_ext_free_data(state); + g_slice_free(RmDigestExt, state); +} + +static void rm_digest_ext_update(RmDigestExt *state, const unsigned char *data, RmOff size) { /* Data is assumed to be a hex representation of a checksum. * Needs to be compressed in pure memory first. @@ -614,84 +537,158 @@ static void rm_digest_ext_update(RmDigest *digest, const unsigned char *data, * */ #define CHAR_TO_NUM(c) (unsigned char)(g_ascii_isdigit(c) ? c - '0' : (c - 'a') + 10) - digest->bytes = size / 2; - digest->state = g_slice_alloc0(digest->bytes); + if(state->data) { + rm_digest_ext_free_data(state); + } + + state->len = size / 2; + state->data = g_slice_alloc(state->len); - for(unsigned i = 0; i < digest->bytes; ++i) { - ((guint8 *)digest->state)[i] = - (CHAR_TO_NUM(data[2 * i]) << 4) + CHAR_TO_NUM(data[2 * i + 1]); + for(unsigned i = 0; i < state->len; ++i) { + state->data[i] = (CHAR_TO_NUM(data[2 * i]) << 4) + CHAR_TO_NUM(data[2 * i + 1]); } } -static void rm_digest_ext_copy(RmDigest *digest, RmDigest *copy) { - copy->state = g_slice_copy(digest->bytes, digest->state); +static RmDigestExt *rm_digest_ext_copy(RmDigestExt *state) { + RmDigestExt *copy = g_slice_copy(sizeof(*state), state); + copy->data = g_slice_copy(state->len, state->data); + return copy; +} + +static void rm_digest_ext_steal(RmDigestExt *state, guint8 *result) { + memcpy(result, state->data, state->len); } static const RmDigestInterface ext_interface = { - .name = "ext", - .bits = 512, - .init = NULL, - .free = rm_digest_ext_free, - .update = rm_digest_ext_update, - .copy = rm_digest_ext_copy, - .steal = NULL}; + .name = "ext", + .bits = 512, + .new = (RmDigestNewFunc)rm_digest_ext_new, + .free = (RmDigestFreeFunc)rm_digest_ext_free, + .update = (RmDigestUpdateFunc)rm_digest_ext_update, + .copy = (RmDigestCopyFunc)rm_digest_ext_copy, + .steal = (RmDigestStealFunc)rm_digest_ext_steal}; /////////////////////////// // paranoid 'hash' // /////////////////////////// -static void rm_digest_paranoid_init(RmDigest *digest, RmOff seed) { +static RmParanoid *rm_digest_paranoid_new(void) { RmParanoid *paranoid = g_slice_new0(RmParanoid); - digest->state = paranoid; paranoid->incoming_twin_candidates = g_async_queue_new(); - paranoid->shadow_hash = rm_digest_new(RM_DIGEST_XXHASH, seed); - digest->bytes = paranoid->shadow_hash->bytes; + paranoid->shadow_hash = rm_digest_new(RM_DIGEST_XXHASH, 0); + return paranoid; } -static void rm_digest_paranoid_free(RmDigest *digest) { - RmParanoid *paranoid = digest->state; - if(paranoid->shadow_hash) { - rm_digest_free(paranoid->shadow_hash); - } - rm_digest_release_buffers(digest); - if(paranoid->incoming_twin_candidates) { - g_async_queue_unref(paranoid->incoming_twin_candidates); - } +static void rm_digest_paranoid_release_buffers(RmParanoid *paranoid) { + g_slist_free_full(paranoid->buffers, (GDestroyNotify)rm_buffer_free); + paranoid->buffers = NULL; +} + +static void rm_digest_paranoid_free(RmParanoid *paranoid) { + rm_digest_free(paranoid->shadow_hash); + rm_digest_paranoid_release_buffers(paranoid); + g_async_queue_unref(paranoid->incoming_twin_candidates); g_slist_free(paranoid->rejects); g_slice_free(RmParanoid, paranoid); } -static void rm_digest_paranoid_steal(RmDigest *digest, guint8 *result) { - RmParanoid *paranoid = digest->state; - if(paranoid->shadow_hash) { - guint8 *buf = rm_digest_steal(paranoid->shadow_hash); - memcpy(result, buf, digest->bytes); +static void rm_digest_paranoid_buffered_update(RmParanoid *paranoid, RmBuffer *buffer) { + /* Welcome to hell! + * This is a somewhat crazy part of the rmlint optimisation strategy. + * Comparing two "paranoid digests" (basically a large chunk of a file stored in + * a series of buffers) is fairly simple but it's slow because it has to compare + * each buffer. + * The algorithm below tries to get a head-start on the comparison by starting the + * buffer comparison before the last buffer has been read. + */ + + rm_digest_update(paranoid->shadow_hash, buffer->data, buffer->len); + + if(!paranoid->buffers) { + /* first buffer */ + paranoid->buffers = g_slist_prepend(NULL, buffer); + paranoid->buffer_tail = paranoid->buffers; } else { - /* steal the first few bytes of the first buffer */ - if(paranoid->buffers) { - RmBuffer *buffer = paranoid->buffers->data; - memcpy(result, buffer->data, MIN(buffer->len, digest->bytes)); + paranoid->buffer_tail = g_slist_append(paranoid->buffer_tail, buffer)->next; + } + + if(paranoid->twin_candidate) { + /* do a running check that digest remains the same as its candidate twin */ + if(rm_buffer_equal(buffer, paranoid->twin_candidate_buffer->data)) { + /* buffers match; move ptr to next one ready for next buffer */ + paranoid->twin_candidate_buffer = paranoid->twin_candidate_buffer->next; + } else { + /* buffers don't match - delete candidate (new candidate might be added on + * next + * call to rm_digest_buffered_update) */ + paranoid->twin_candidate = NULL; + paranoid->twin_candidate_buffer = NULL; +#if _RM_CHECKSUM_DEBUG + rm_log_debug_line("Ejected candidate match at buffer #%u", + g_slist_length(paranoid->buffers)); +#endif } } + + while(!paranoid->twin_candidate && (paranoid->twin_candidate = g_async_queue_try_pop( + paranoid->incoming_twin_candidates))) { + /* validate the new candidate by comparing the previous buffers (not + * including current)*/ + RmParanoid *twin = paranoid->twin_candidate->state; + paranoid->twin_candidate_buffer = twin->buffers; + GSList *iter_self = paranoid->buffers; + gboolean match = TRUE; + while(match && iter_self) { + match = + (rm_buffer_equal(paranoid->twin_candidate_buffer->data, iter_self->data)); + iter_self = iter_self->next; + paranoid->twin_candidate_buffer = paranoid->twin_candidate_buffer->next; + } + if(paranoid->twin_candidate && !match) { + /* reject the twin candidate, also add to rejects list to speed up + * rm_digest_equal() */ +#if _RM_CHECKSUM_DEBUG + rm_log_debug_line("Rejected twin candidate %p for %p", + paranoid->twin_candidate, paranoid); +#endif + if(!paranoid->shadow_hash) { + /* we use the rejects file to speed up rm_digest_equal */ + paranoid->rejects = + g_slist_prepend(paranoid->rejects, paranoid->twin_candidate); + } + paranoid->twin_candidate = NULL; + paranoid->twin_candidate_buffer = NULL; + } else { +#if _RM_CHECKSUM_DEBUG + rm_log_debug_line("Added twin candidate %p for %p", paranoid->twin_candidate, + paranoid); +#endif + } + } +} + +static void rm_digest_paranoid_steal(RmParanoid *paranoid, guint8 *result) { + RmDigest *shadow_hash = paranoid->shadow_hash; + // rm_log_warning_line("rm_digest_paranoid_steal %d bytes", shadow_hash->bytes); + rm_digest_xxhash_steal(shadow_hash->state, result); } /* Note: paranoid update implementation is in rm_digest_buffered_update() below */ static const RmDigestInterface paranoid_interface = { - .name = "paranoid", - .bits = 0, - .init = rm_digest_paranoid_init, - .free = rm_digest_paranoid_free, - .update = NULL, - .copy = NULL, - .steal = rm_digest_paranoid_steal}; + .name = "paranoid", + .bits = 64, /* must match shadow hash length */ + .new = (RmDigestNewFunc)rm_digest_paranoid_new, + .free = (RmDigestFreeFunc)rm_digest_paranoid_free, + .update = NULL, + .copy = NULL, + .steal = (RmDigestStealFunc)rm_digest_paranoid_steal}; //////////////////////////////// // RmDigestInterface map // //////////////////////////////// static const RmDigestInterface *rm_digest_get_interface(RmDigestType type) { - static const RmDigestInterface *digest_interfaces[] = { [RM_DIGEST_UNKNOWN] = NULL, [RM_DIGEST_MURMUR] = &murmur_interface, @@ -723,7 +720,8 @@ static const RmDigestInterface *rm_digest_get_interface(RmDigestType type) { [RM_DIGEST_HIGHWAY256] = &highway256_interface, }; - if(type < RM_DIGEST_SENTINEL && digest_interfaces[type]) { + if(type != RM_DIGEST_UNKNOWN && type < RM_DIGEST_SENTINEL && + digest_interfaces[type]) { return digest_interfaces[type]; } rm_log_error_line("No digest interface for enum %i", type); @@ -752,6 +750,10 @@ static gpointer rm_init_digest_type_table(GHashTable **code_table) { return NULL; } +/////////////////////////////////////// +// RMDIGEST API // +/////////////////////////////////////// + RmDigestType rm_string_to_digest_type(const char *string) { static GHashTable *code_table = NULL; static GOnce table_once = G_ONCE_INIT; @@ -774,7 +776,7 @@ const char *rm_digest_type_to_string(RmDigestType type) { return interface->name; } -/* TODO: remove? */ +/* TODO: update or remove? */ int rm_digest_type_to_multihash_id(RmDigestType type) { static int ids[] = {[RM_DIGEST_UNKNOWN] = -1, [RM_DIGEST_MURMUR] = 17, [RM_DIGEST_MD5] = 1, [RM_DIGEST_SHA1] = 2, @@ -785,36 +787,33 @@ int rm_digest_type_to_multihash_id(RmDigestType type) { } RmDigest *rm_digest_new(RmDigestType type, RmOff seed) { - g_assert(type != RM_DIGEST_UNKNOWN); - const RmDigestInterface *interface = rm_digest_get_interface(type); + RmDigest *digest = g_slice_new0(RmDigest); digest->type = type; digest->bytes = interface->bits / 8; - if(interface->init) { - interface->init(digest, seed); + digest->state = interface->new(); + if(seed) { + interface->update(digest->state, (const unsigned char *)&seed, sizeof(seed)); } return digest; } void rm_digest_release_buffers(RmDigest *digest) { - RmParanoid *paranoid = digest->state; - if(paranoid && paranoid->buffers) { - g_slist_free_full(paranoid->buffers, (GDestroyNotify)rm_buffer_free); - paranoid->buffers = NULL; - } + rm_assert_gentle(digest->type == RM_DIGEST_PARANOID); + rm_digest_paranoid_release_buffers(digest->state); } void rm_digest_free(RmDigest *digest) { const RmDigestInterface *interface = rm_digest_get_interface(digest->type); - interface->free(digest); + interface->free(digest->state); g_slice_free(RmDigest, digest); } void rm_digest_update(RmDigest *digest, const unsigned char *data, RmOff size) { const RmDigestInterface *interface = rm_digest_get_interface(digest->type); - interface->update(digest, data, size); + interface->update(digest->state, data, size); } void rm_digest_buffered_update(RmBuffer *buffer) { @@ -825,73 +824,7 @@ void rm_digest_buffered_update(RmBuffer *buffer) { rm_buffer_free(buffer); } else { RmParanoid *paranoid = digest->state; - /* paranoid update... */ - if(!paranoid->buffers) { - /* first buffer */ - paranoid->buffers = g_slist_prepend(NULL, buffer); - paranoid->buffer_tail = paranoid->buffers; - } else { - paranoid->buffer_tail = g_slist_append(paranoid->buffer_tail, buffer)->next; - } - - if(paranoid->shadow_hash) { - rm_digest_update(paranoid->shadow_hash, buffer->data, buffer->len); - } - - if(paranoid->twin_candidate) { - /* do a running check that digest remains the same as its candidate twin */ - if(rm_buffer_equal(buffer, paranoid->twin_candidate_buffer->data)) { - /* buffers match; move ptr to next one ready for next buffer */ - paranoid->twin_candidate_buffer = paranoid->twin_candidate_buffer->next; - } else { - /* buffers don't match - delete candidate (new candidate might be added on - * next - * call to rm_digest_buffered_update) */ - paranoid->twin_candidate = NULL; - paranoid->twin_candidate_buffer = NULL; -#if _RM_CHECKSUM_DEBUG - rm_log_debug_line("Ejected candidate match at buffer #%u", - g_slist_length(paranoid->buffers)); -#endif - } - } - - while(!paranoid->twin_candidate && paranoid->incoming_twin_candidates && - (paranoid->twin_candidate = - g_async_queue_try_pop(paranoid->incoming_twin_candidates))) { - /* validate the new candidate by comparing the previous buffers (not - * including current)*/ - RmParanoid *twin = paranoid->twin_candidate->state; - paranoid->twin_candidate_buffer = twin->buffers; - GSList *iter_self = paranoid->buffers; - gboolean match = TRUE; - while(match && iter_self) { - match = (rm_buffer_equal(paranoid->twin_candidate_buffer->data, - iter_self->data)); - iter_self = iter_self->next; - paranoid->twin_candidate_buffer = paranoid->twin_candidate_buffer->next; - } - if(paranoid->twin_candidate && !match) { - /* reject the twin candidate, also add to rejects list to speed up - * rm_digest_equal() */ -#if _RM_CHECKSUM_DEBUG - rm_log_debug_line("Rejected twin candidate %p for %p", - paranoid->twin_candidate, paranoid); -#endif - if(!paranoid->shadow_hash) { - /* we use the rejects file to speed up rm_digest_equal */ - paranoid->rejects = - g_slist_prepend(paranoid->rejects, paranoid->twin_candidate); - } - paranoid->twin_candidate = NULL; - paranoid->twin_candidate_buffer = NULL; - } else { -#if _RM_CHECKSUM_DEBUG - rm_log_debug_line("Added twin candidate %p for %p", - paranoid->twin_candidate, paranoid); -#endif - } - } + rm_digest_paranoid_buffered_update(paranoid, buffer); } } @@ -901,19 +834,15 @@ RmDigest *rm_digest_copy(RmDigest *digest) { RmDigest *copy = g_slice_copy(sizeof(RmDigest), digest); const RmDigestInterface *interface = rm_digest_get_interface(digest->type); - interface->copy(digest, copy); + copy->state = interface->copy(digest->state); return copy; } guint8 *rm_digest_steal(RmDigest *digest) { const RmDigestInterface *interface = rm_digest_get_interface(digest->type); - if(!interface->steal) { - return g_slice_copy(digest->bytes, digest->state); - } - guint8 *result = g_slice_alloc0(digest->bytes); - interface->steal(digest, result); + interface->steal(digest->state, result); return result; } @@ -944,8 +873,6 @@ gboolean rm_digest_equal(RmDigest *a, RmDigest *b) { return false; } - const RmDigestInterface *interface = rm_digest_get_interface(a->type); - if(a->type == RM_DIGEST_PARANOID) { RmParanoid *pa = a->state; RmParanoid *pb = b->state; @@ -978,7 +905,7 @@ gboolean rm_digest_equal(RmDigest *a, RmDigest *b) { } return (!a_iter && !b_iter); - } else if(interface->steal) { + } else { guint8 *buf_a = rm_digest_steal(a); guint8 *buf_b = rm_digest_steal(b); gboolean result = !memcmp(buf_a, buf_b, a->bytes); @@ -987,8 +914,6 @@ gboolean rm_digest_equal(RmDigest *a, RmDigest *b) { g_slice_free1(b->bytes, buf_b); return result; - } else { - return !memcmp(a->state, b->state, a->bytes); } } @@ -1022,10 +947,6 @@ int rm_digest_get_bytes(RmDigest *self) { void rm_digest_send_match_candidate(RmDigest *target, RmDigest *candidate) { RmParanoid *paranoid = target->state; - - if(!paranoid->incoming_twin_candidates) { - paranoid->incoming_twin_candidates = g_async_queue_new(); - } g_async_queue_push(paranoid->incoming_twin_candidates, candidate); } diff --git a/lib/checksums/metrohash.h b/lib/checksums/metrohash.h index 1c5451f6..d49722f2 100644 --- a/lib/checksums/metrohash.h +++ b/lib/checksums/metrohash.h @@ -39,9 +39,9 @@ void metrohash64_1(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out void metrohash64_2(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out); // MetroHash 128-bit hash functions -Metro128State *metrohash128_1_new(uint32_t seed); -Metro128State *metrohash128_2_new(uint32_t seed); -Metro256State *metrohash256_new(uint32_t seed); +Metro128State *metrohash128_1_new(void); +Metro128State *metrohash128_2_new(void); +Metro256State *metrohash256_new(void); Metro128State *metrohash128_copy(Metro128State *state); Metro256State *metrohash256_copy(Metro256State *state); diff --git a/lib/checksums/metrohash128.c b/lib/checksums/metrohash128.c index 8abd3a3f..402de5d2 100644 --- a/lib/checksums/metrohash128.c +++ b/lib/checksums/metrohash128.c @@ -45,16 +45,16 @@ static const uint64_t k1_1 = 0x8648DBDB; static const uint64_t k2_1 = 0x7BDEC03B; static const uint64_t k3_1 = 0x2F5870A5; -static void metrohash128_1_init(Metro128State *state, uint32_t seed) { - state->v[0] = ((((uint64_t)seed) - k0_1) * k3_1); - state->v[1] = ((((uint64_t)seed) + k1_1) * k2_1); - state->v[2] = ((((uint64_t)seed) + k0_1) * k2_1); - state->v[3] = ((((uint64_t)seed) - k1_1) * k3_1); +static void metrohash128_1_init(Metro128State *state) { + state->v[0] = -k0_1 * k3_1; + state->v[1] = k1_1 * k2_1; + state->v[2] = k0_1 * k2_1; + state->v[3] = - k1_1 * k3_1; } -Metro128State *metrohash128_1_new(uint32_t seed) { +Metro128State *metrohash128_1_new(void) { Metro128State *state = g_slice_new0(Metro128State); - metrohash128_1_init(state, seed); + metrohash128_1_init(state); return state; } @@ -63,16 +63,16 @@ static const uint64_t k1_2 = 0xAD07C493; static const uint64_t k2_2 = 0x797A90BB; static const uint64_t k3_2 = 0x2E4B2E1B; -static void metrohash128_2_init(Metro128State *state, uint32_t seed) { - state->v[0] = ((((uint64_t)seed) - k0_2) * k3_2); - state->v[1] = ((((uint64_t)seed) + k1_2) * k2_2); - state->v[2] = ((((uint64_t)seed) + k0_2) * k2_2); - state->v[3] = ((((uint64_t)seed) - k1_2) * k3_2); +static void metrohash128_2_init(Metro128State *state) { + state->v[0] = -k0_2 * k3_2; + state->v[1] = k1_2 * k2_2; + state->v[2] = k0_2 * k2_2; + state->v[3] = -k1_2 * k3_2; } -Metro128State *metrohash128_2_new(uint32_t seed) { +Metro128State *metrohash128_2_new() { Metro128State *state = g_slice_new0(Metro128State); - metrohash128_2_init(state, seed); + metrohash128_2_init(state); return state; } @@ -248,14 +248,16 @@ void metrohash128crc_2_steal(Metro128State *state, uint8_t *out) { } void metrohash128crc_1(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out) { - Metro128State *state = metrohash128_1_new(seed); + Metro128State *state = metrohash128_1_new(); + metrohash128crc_update(state, (const uint8_t*)&seed, sizeof(seed)); metrohash128crc_update(state, key, len); metrohash128crc_1_steal(state, out); metrohash128_free(state); } void metrohash128crc_2(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out) { - Metro128State *state = metrohash128_2_new(seed); + Metro128State *state = metrohash128_2_new(); + metrohash128crc_update(state, (const uint8_t*)&seed, sizeof(seed)); metrohash128crc_update(state, key, len); metrohash128crc_2_steal(state, out); metrohash128_free(state); @@ -493,23 +495,25 @@ void metrohash128_2_steal(Metro128State *state, uint8_t *out) { } void metrohash128_1(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out) { - Metro128State *state = metrohash128_1_new(seed); + Metro128State *state = metrohash128_1_new(); + metrohash128_1_update(state, (const uint8_t*)&seed, sizeof(seed)); metrohash128_1_update(state, key, len); metrohash128_1_steal(state, out); metrohash128_free(state); } void metrohash128_2(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out) { - Metro128State *state = metrohash128_2_new(seed); + Metro128State *state = metrohash128_2_new(); + metrohash128_2_update(state, (const uint8_t*)&seed, sizeof(seed)); metrohash128_2_update(state, key, len); metrohash128_2_steal(state, out); metrohash128_free(state); } -Metro256State *metrohash256_new(uint32_t seed) { +Metro256State *metrohash256_new(void) { Metro256State *state = g_slice_new0(Metro256State); - metrohash128_1_init(&state->state1, seed); - metrohash128_2_init(&state->state2, seed); + metrohash128_1_init(&state->state1); + metrohash128_2_init(&state->state2); return state; } diff --git a/lib/checksums/murmur3.c b/lib/checksums/murmur3.c index 6bbd8eaf..00a54e4b 100644 --- a/lib/checksums/murmur3.c +++ b/lib/checksums/murmur3.c @@ -110,16 +110,12 @@ static inline uint64_t fmix64(uint64_t k) { //----------------------------------------------------------------------------- -MurmurHash3_x86_32_state *MurmurHash3_x86_32_new(uint32_t seed) { - MurmurHash3_x86_32_state *state = g_slice_new0(MurmurHash3_x86_32_state); - state->h1 = seed; - return state; +MurmurHash3_x86_32_state *MurmurHash3_x86_32_new() { + return g_slice_new0(MurmurHash3_x86_32_state); } MurmurHash3_x86_32_state *MurmurHash3_x86_32_copy(MurmurHash3_x86_32_state *state) { - MurmurHash3_x86_32_state *copy = - g_slice_copy(sizeof(MurmurHash3_x86_32_state), state); - return copy; + return g_slice_copy(sizeof(MurmurHash3_x86_32_state), state); } #define MURMUR_UPDATE_H1_X86_32(H1) MURMUR_UPDATE(H1, k1, 15, 0xcc9e2d51, 0x1b873593); @@ -198,7 +194,10 @@ void MurmurHash3_x86_32_free(MurmurHash3_x86_32_state *state) { uint32_t MurmurHash3_x86_32(const void *key, uint32_t len, uint32_t seed) { uint32_t out; - MurmurHash3_x86_32_state *state = MurmurHash3_x86_32_new(seed); + MurmurHash3_x86_32_state *state = MurmurHash3_x86_32_new(); + if(seed != 0) { + MurmurHash3_x86_32_update(state, &seed, sizeof(seed)); + } MurmurHash3_x86_32_update(state, key, len); MurmurHash3_x86_32_finalise(state, &out); return out; @@ -206,20 +205,12 @@ uint32_t MurmurHash3_x86_32(const void *key, uint32_t len, uint32_t seed) { //----------------------------------------------------------------------------- -MurmurHash3_x86_128_state *MurmurHash3_x86_128_new(uint32_t seed1, uint32_t seed2, - uint32_t seed3, uint32_t seed4) { - MurmurHash3_x86_128_state *state = g_slice_new0(MurmurHash3_x86_128_state); - state->h1 = seed1; - state->h2 = seed2; - state->h3 = seed3; - state->h4 = seed4; - return state; +MurmurHash3_x86_128_state *MurmurHash3_x86_128_new(void) { + return g_slice_new0(MurmurHash3_x86_128_state); } MurmurHash3_x86_128_state *MurmurHash3_x86_128_copy(MurmurHash3_x86_128_state *state) { - MurmurHash3_x86_128_state *copy = - g_slice_copy(sizeof(MurmurHash3_x86_128_state), state); - return copy; + return g_slice_copy(sizeof(MurmurHash3_x86_128_state), state); } #define MURMUR_UPDATE_H1_X86_128(H1) MURMUR_UPDATE(H1, k1, 15, 0x239b961b, 0xab0e9789); @@ -380,18 +371,18 @@ void MurmurHash3_x86_128_free(MurmurHash3_x86_128_state *state) { } void MurmurHash3_x86_128(const void *key, uint32_t len, uint32_t seed, void *out) { - MurmurHash3_x86_128_state *state = MurmurHash3_x86_128_new(seed, seed, seed, seed); + MurmurHash3_x86_128_state *state = MurmurHash3_x86_128_new(); + if(seed != 0) { + MurmurHash3_x86_128_update(state, &seed, sizeof(seed)); + } MurmurHash3_x86_128_update(state, key, len); MurmurHash3_x86_128_finalise(state, out); } //----------------------------------------------------------------------------- -MurmurHash3_x64_128_state *MurmurHash3_x64_128_new(uint64_t seed1, uint64_t seed2) { - MurmurHash3_x64_128_state *state = g_slice_new0(MurmurHash3_x64_128_state); - state->h1 = seed1; - state->h2 = seed2; - return state; +MurmurHash3_x64_128_state *MurmurHash3_x64_128_new(void) { + return g_slice_new0(MurmurHash3_x64_128_state); } MurmurHash3_x64_128_state *MurmurHash3_x64_128_copy(MurmurHash3_x64_128_state *state) { @@ -518,7 +509,10 @@ void MurmurHash3_x64_128_free(MurmurHash3_x64_128_state *state) { void MurmurHash3_x64_128(const void *key, const uint64_t len, const uint32_t seed, void *out) { - MurmurHash3_x64_128_state *state = MurmurHash3_x64_128_new(seed, seed); + MurmurHash3_x64_128_state *state = MurmurHash3_x64_128_new(); + if(seed != 0) { + MurmurHash3_x64_128_update(state, &seed, sizeof(seed)); + } MurmurHash3_x64_128_update(state, key, len); MurmurHash3_x64_128_finalise(state, out); } diff --git a/lib/checksums/murmur3.h b/lib/checksums/murmur3.h index 99106544..9fd861e8 100644 --- a/lib/checksums/murmur3.h +++ b/lib/checksums/murmur3.h @@ -21,12 +21,11 @@ typedef struct _MurmurHash3_x64_128_state MurmurHash3_x64_128_state; // API /** - * return newly initialised, seeded state + * return newly initialised state */ -MurmurHash3_x86_32_state *MurmurHash3_x86_32_new(uint32_t seed); -MurmurHash3_x86_128_state *MurmurHash3_x86_128_new(uint32_t seed1, uint32_t seed2, - uint32_t seed3, uint32_t seed4); -MurmurHash3_x64_128_state *MurmurHash3_x64_128_new(uint64_t seed1, uint64_t seed2); +MurmurHash3_x86_32_state *MurmurHash3_x86_32_new(void); +MurmurHash3_x86_128_state *MurmurHash3_x86_128_new(void); +MurmurHash3_x64_128_state *MurmurHash3_x64_128_new(void); /** * return duplicate copy of a state diff --git a/lib/checksums/xxhash/xxhash.h b/lib/checksums/xxhash/xxhash.h index a21affbf..c47046ce 100644 --- a/lib/checksums/xxhash/xxhash.h +++ b/lib/checksums/xxhash/xxhash.h @@ -107,11 +107,10 @@ typedef struct { long long ll[11]; } XXH64_state_t; + /* -These structures allow static allocation of XXH states. +These functions create and release memory for XXH state. States must then be initialized using XXHnn_reset() before first use. - -If you prefer dynamic allocation, please refer to functions below. */ XXH32_state_t* XXH32_createState(void); @@ -119,10 +118,11 @@ XXH_errorcode XXH32_freeState(XXH32_state_t* statePtr); XXH64_state_t* XXH64_createState(void); XXH_errorcode XXH64_freeState(XXH64_state_t* statePtr); - /* -These functions create and release memory for XXH state. +These structures allow static allocation of XXH states. States must then be initialized using XXHnn_reset() before first use. + +If you prefer dynamic allocation, please refer to functions below. */ XXH_errorcode XXH32_reset(XXH32_state_t* statePtr, unsigned seed); From f3ef978422391971aa08e3afa62443f6fc624353 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Sat, 18 Nov 2017 11:39:02 +1000 Subject: [PATCH 132/180] scons: be more specific about SSE --- SConstruct | 24 ++++++++---------------- lib/SConscript | 2 +- lib/checksum.c | 4 ++-- lib/checksum.h | 2 +- lib/checksums/metrohash.h | 2 +- lib/checksums/metrohash128.c | 2 +- lib/cmdline.c | 4 ++-- lib/config.h.in | 2 +- lib/hash-utility.c | 2 +- 9 files changed, 18 insertions(+), 26 deletions(-) diff --git a/SConstruct b/SConstruct index 925e6fd8..0fe90008 100755 --- a/SConstruct +++ b/SConstruct @@ -363,18 +363,18 @@ def check_cygwin(context): context.Result(rc) return rc -def check_sse4(context): +def check_sse_4_2(context): rc = 0 - context.Message('Checking for sse4 support...') + context.Message('Checking for SSE 4.2 support...') try: - if 'sse4' in open('/proc/cpuinfo').read(): + if 'sse4_2' in open('/proc/cpuinfo').read(): rc = 1 except subprocess.CalledProcessError: # Oops. context.Message("read cpuinfo failed") - conf.env['HAVE_SSE4'] = rc + conf.env['HAVE_SSE_4_2'] = rc context.Result(rc) return rc @@ -495,14 +495,6 @@ for suffix in ['libelf', 'gettext', 'fiemap', 'blkid', 'json-glib', 'gui']: dest='with_' + suffix ) -AddOption( - '--with-sse', action='store_const', default=False, const=False, dest='with_sse' -) - -AddOption( - '--without-sse', action='store_const', default=False, const=False, dest='with_sse' -) - # General Environment options = dict( CXXCOMSTR=compile_source_message, @@ -553,7 +545,7 @@ conf = Configure(env, custom_tests={ 'check_linux_fs_h': check_linux_fs_h, 'check_uname': check_uname, 'check_cygwin': check_cygwin, - 'check_sse4': check_sse4, + 'check_sse_4_2': check_sse_4_2, 'check_sysmacro_h': check_sysmacro_h }) @@ -628,9 +620,9 @@ else: conf.env.Append(CCFLAGS=['-fPIC']) # check SSE4 support: -conf.check_sse4() -if conf.env['HAVE_SSE4']: - conf.env.Append(CCFLAGS=['-msse4']) +conf.check_sse_4_2() +if conf.env['HAVE_SSE_4_2']: + conf.env.Append(CCFLAGS=['-msse4.2']) if 'clang' in os.path.basename(conf.env['CC']): conf.env.Append(CCFLAGS=['-fcolor-diagnostics']) # Colored warnings diff --git a/lib/SConscript b/lib/SConscript index 5791052e..be5d3f95 100644 --- a/lib/SConscript +++ b/lib/SConscript @@ -34,7 +34,7 @@ def build_config_template(target, source, env): HAVE_LINUX_LIMITS=env['HAVE_LINUX_LIMITS'], HAVE_LINUX_FS_H=env['HAVE_LINUX_FS_H'], HAVE_BTRFS_H=env['HAVE_BTRFS_H'], - HAVE_SSE4=env['HAVE_SSE4'], + HAVE_SSE_4_2=env['HAVE_SSE_4_2'], HAVE_FACCESSAT=env['HAVE_FACCESSAT'], HAVE_UNAME=env['HAVE_UNAME'], HAVE_SYSMACROS_H=env['HAVE_SYSMACROS_H'], diff --git a/lib/checksum.c b/lib/checksum.c index f57b2f48..c5fdb598 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -168,7 +168,7 @@ static const RmDigestInterface metro256_interface = { .copy = (RmDigestCopyFunc)metrohash256_copy, .steal = (RmDigestStealFunc)metrohash256_steal}; -#if HAVE_SSE4 +#if HAVE_SSE_4_2 /* also define crc-optimised metro variants metrocrc and metrocrc256*/ static const RmDigestInterface metrocrc_interface = { @@ -694,7 +694,7 @@ static const RmDigestInterface *rm_digest_get_interface(RmDigestType type) { [RM_DIGEST_MURMUR] = &murmur_interface, [RM_DIGEST_METRO] = &metro_interface, [RM_DIGEST_METRO256] = &metro256_interface, -#if HAVE_SSE4 +#if HAVE_SSE_4_2 [RM_DIGEST_METROCRC] = &metrocrc_interface, [RM_DIGEST_METROCRC256] = &metrocrc256_interface, #endif diff --git a/lib/checksum.h b/lib/checksum.h index fc7e31db..eb59afa5 100644 --- a/lib/checksum.h +++ b/lib/checksum.h @@ -39,7 +39,7 @@ typedef enum RmDigestType { RM_DIGEST_MURMUR, RM_DIGEST_METRO, RM_DIGEST_METRO256, -#if HAVE_SSE4 +#if HAVE_SSE_4_2 RM_DIGEST_METROCRC, RM_DIGEST_METROCRC256, #endif diff --git a/lib/checksums/metrohash.h b/lib/checksums/metrohash.h index d49722f2..bec282be 100644 --- a/lib/checksums/metrohash.h +++ b/lib/checksums/metrohash.h @@ -61,7 +61,7 @@ void metrohash128_2_steal(Metro128State *state, uint8_t *out); void metrohash256_update(Metro256State *state, const uint8_t *key, uint64_t len); void metrohash256_steal(Metro256State *state, uint8_t *out); -#if HAVE_SSE4 +#if HAVE_SSE_4_2 // MetroHash 128-bit hash functions using CRC instruction void metrohash128crc_1(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out); void metrohash128crc_2(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out); diff --git a/lib/checksums/metrohash128.c b/lib/checksums/metrohash128.c index 402de5d2..e7af6b0d 100644 --- a/lib/checksums/metrohash128.c +++ b/lib/checksums/metrohash128.c @@ -27,7 +27,7 @@ #include #include "metrohash.h" -#if HAVE_SSE4 +#if HAVE_SSE_4_2 struct _Metro128_state { uint64_t v[4]; diff --git a/lib/cmdline.c b/lib/cmdline.c index 8684f4fa..1d5e6265 100644 --- a/lib/cmdline.c +++ b/lib/cmdline.c @@ -753,14 +753,14 @@ static void rm_cmd_set_paranoia_from_cnt(RmCfg *cfg, int paranoia_counter, /* Handle the paranoia option */ switch(paranoia_counter) { case -2: -#if HAVE_SSE4 +#if HAVE_SSE_4_2 cfg->checksum_type = RM_DIGEST_METROCRC; // 128-bit non-crypto #else cfg->checksum_type = RM_DIGEST_METRO; // 128-bit non-crypto #endif break; case -1: -#if HAVE_SSE4 +#if HAVE_SSE_4_2 cfg->checksum_type = RM_DIGEST_METROCRC256; // 256-bit non-crypto #else cfg->checksum_type = RM_DIGEST_METRO256; // 256-bit non-crypto diff --git a/lib/config.h.in b/lib/config.h.in index a5e8c6c8..8016836e 100644 --- a/lib/config.h.in +++ b/lib/config.h.in @@ -22,7 +22,7 @@ #define HAVE_FACCESSAT ({HAVE_FACCESSAT}) #define HAVE_UNAME ({HAVE_UNAME}) #define HAVE_SYSMACROS_H ({HAVE_SYSMACROS_H}) -#define HAVE_SSE4 ({HAVE_SSE4}) +#define HAVE_SSE_4_2 ({HAVE_SSE_4_2}) #define RM_DEFAULT_DIGEST RM_DIGEST_SHA512 #define RM_VERSION "{VERSION_MAJOR}.{VERSION_MINOR}.{VERSION_PATCH}" diff --git a/lib/hash-utility.c b/lib/hash-utility.c index d0c01d0c..8e0d9e4b 100644 --- a/lib/hash-utility.c +++ b/lib/hash-utility.c @@ -166,7 +166,7 @@ int rm_hasher_main(int argc, const char **argv) { "\n Supported, but not useful:" "\n %s\n"), "sha{1,256,512}, sha3-{256,384,512}, blake{2s,2b,2sp,2bp}, highway{64,128,256}", -#if HAVE_SSE4 +#if HAVE_SSE_4_2 "metrocrc, metrocrc256, " #endif "metro, metro256, xxhash, murmur", From 351238a4c447ebaef03a5825c154d15bbe6bbc61 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 20 Nov 2017 07:13:15 +1000 Subject: [PATCH 133/180] cmdline: redefine paranoia levels --- docs/rmlint.1.rst | 11 ++++---- lib/cmdline.c | 41 ++++++++++++---------------- lib/config.h.in | 4 ++- tests/test_formatters/test_others.py | 2 +- 4 files changed, 27 insertions(+), 31 deletions(-) diff --git a/docs/rmlint.1.rst b/docs/rmlint.1.rst index bb6af12b..9bc61b6f 100644 --- a/docs/rmlint.1.rst +++ b/docs/rmlint.1.rst @@ -149,7 +149,7 @@ General Options ``$ rmlint -z rx $(echo $PATH | tr ":" " ") # Look at all executable files in $PATH`` -:``-a --algorithm=name`` (**default\:** *sha512*): +:``-a --algorithm=name`` (**default\:** *blake2b*): Choose the algorithm to use for finding duplicate files. The algorithm can be either **paranoid** (byte-by-byte file comparison) or use one of several file hash @@ -175,8 +175,9 @@ General Options * **-p** is equivalent to **--algorithm=paranoid** - * **-P** is equivalent to **--algorithm=metro256** - * **-PP** is equivalent to **--algorithm=metro** + * **-P** is equivalent to **--algorithm=highway256** + * **-PP** is equivalent to **--algorithm=metro256** + * **-PPP** is equivalent to **--algorithm=metro** :``-v --loud`` / ``-V --quiet``: @@ -851,12 +852,12 @@ PROBLEMS 1. **False Positives:** Depending on the options you use, there is a very slight risk of false positives (files that are erroneously detected as duplicate). - The default hash function (sha512) is very safe but in theory it is possible for + The default hash function (blake2b) is very safe but in theory it is possible for two files to have then same hash. If you had 10^73 different files, all the same size, then the chance of a false positive is still less than 1 in a billion. If you're concerned just use the ``--paranoid`` (``-pp``) option. This will compare all the files byte-by-byte and is not much slower than - sha512 (it may even be faster), although it is a lot more memory-hungry. + blake2b (it may even be faster), although it is a lot more memory-hungry. 2. **File modification during or after rmlint run:** It is possible that a file that ``rmlint`` recognized as duplicate is modified afterwards, resulting in diff --git a/lib/cmdline.c b/lib/cmdline.c index 1d5e6265..4c4e4c3a 100644 --- a/lib/cmdline.c +++ b/lib/cmdline.c @@ -50,6 +50,15 @@ #include "treemerge.h" #include "utilities.h" +/* define paranoia levels */ +static const RmDigestType RM_PARANOIA_LEVELS[] = {RM_DIGEST_METRO, + RM_DIGEST_METRO256, + RM_DIGEST_HIGHWAY256, + RM_DEFAULT_DIGEST, + RM_DIGEST_PARANOID}; +static const int RM_PARANOIA_NORMAL = 3; /* must be index of RM_DEFAULT_DIGEST */ +static const int RM_PARANOIA_MAX = 4; + static void rm_cmd_show_version(void) { fprintf(stderr, "version %s compiled: %s at [%s] \"%s\" (rev %s)\n", RM_VERSION, __DATE__, __TIME__, RM_VERSION_NAME, RM_VERSION_GIT_REVISION); @@ -751,33 +760,17 @@ static void rm_cmd_set_verbosity_from_cnt(RmCfg *cfg, int verbosity_counter) { static void rm_cmd_set_paranoia_from_cnt(RmCfg *cfg, int paranoia_counter, GError **error) { /* Handle the paranoia option */ - switch(paranoia_counter) { - case -2: -#if HAVE_SSE_4_2 - cfg->checksum_type = RM_DIGEST_METROCRC; // 128-bit non-crypto -#else - cfg->checksum_type = RM_DIGEST_METRO; // 128-bit non-crypto -#endif - break; - case -1: -#if HAVE_SSE_4_2 - cfg->checksum_type = RM_DIGEST_METROCRC256; // 256-bit non-crypto -#else - cfg->checksum_type = RM_DIGEST_METRO256; // 256-bit non-crypto -#endif - break; - case 0: - /* leave users choice of -a (default) */ - break; - case 1: - cfg->checksum_type = RM_DIGEST_PARANOID; - break; - default: + int index = paranoia_counter + RM_PARANOIA_NORMAL; + + if(index < 0 || index > RM_PARANOIA_MAX) { if(error && *error == NULL) { g_set_error(error, RM_ERROR_QUARK, 0, - _("Only up to -p or down to -PP flags allowed")); + _("Only up to -%.*s or down to -%.*s flags allowed"), + RM_PARANOIA_MAX - RM_PARANOIA_NORMAL, "ppppp", RM_PARANOIA_NORMAL, + "PPPPP"); } - break; + } else { + cfg->checksum_type = RM_PARANOIA_LEVELS[index]; } } diff --git a/lib/config.h.in b/lib/config.h.in index 8016836e..27b74cd5 100644 --- a/lib/config.h.in +++ b/lib/config.h.in @@ -24,7 +24,9 @@ #define HAVE_SYSMACROS_H ({HAVE_SYSMACROS_H}) #define HAVE_SSE_4_2 ({HAVE_SSE_4_2}) -#define RM_DEFAULT_DIGEST RM_DIGEST_SHA512 +/* define here so rmlint and hash utility can both access */ +#define RM_DEFAULT_DIGEST RM_DIGEST_BLAKE2B + #define RM_VERSION "{VERSION_MAJOR}.{VERSION_MINOR}.{VERSION_PATCH}" #define RM_VERSION_MAJOR {VERSION_MAJOR} #define RM_VERSION_MINOR {VERSION_MINOR} diff --git a/tests/test_formatters/test_others.py b/tests/test_formatters/test_others.py index 16ac2dcb..fd933a79 100644 --- a/tests/test_formatters/test_others.py +++ b/tests/test_formatters/test_others.py @@ -23,7 +23,7 @@ def test_just_call_it(): subprocess.check_output(['./rmlint', '-g', '-O' , 'fdupes', TESTDIR_NAME]) subprocess.check_output(['./rmlint', '-g', TESTDIR_NAME]) - for silly_option in ['-ppp', '-PPP']: + for silly_option in ['-pp', '-PPPP']: try: subprocess.check_output(['./rmlint', '-VVV', silly_option, TESTDIR_NAME]) except subprocess.CalledProcessError: From 2b3591bed1af24e7369565f972bd24d88a2b64f1 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 20 Nov 2017 07:26:23 +1000 Subject: [PATCH 134/180] formats: fix a mem leak (digests not being freed) --- lib/formats.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/lib/formats.c b/lib/formats.c index 2c9fb0fe..29f922df 100644 --- a/lib/formats.c +++ b/lib/formats.c @@ -70,8 +70,11 @@ static void rm_fmt_group_destroy(RmFmtTable *self, RmFmtGroup *group) { } if(needs_free) { + gboolean is_first = TRUE; for(GList *iter = group->files.head; iter; iter = iter->next) { RmFile *file = iter->data; + file->free_digest = is_first; + is_first = FALSE; rm_file_destroy(file); } } From fee89e420e43163f401161b7b5badc57c198f647 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 20 Nov 2017 07:32:54 +1000 Subject: [PATCH 135/180] progressbar: fix mem leak from running mean calculator --- lib/formats/progressbar.c | 2 ++ lib/utilities.c | 7 +++++++ lib/utilities.h | 5 +++++ 3 files changed, 14 insertions(+) diff --git a/lib/formats/progressbar.c b/lib/formats/progressbar.c index d5dc82b7..ec96d6be 100644 --- a/lib/formats/progressbar.c +++ b/lib/formats/progressbar.c @@ -444,6 +444,8 @@ static void rm_fmt_prog(RmSession *session, if(state == RM_PROGRESS_STATE_PRE_SHUTDOWN) { fprintf(out, "\n\n"); g_timer_destroy(self->timer); + rm_running_mean_unref(&self->read_diff_mean); + rm_running_mean_unref(&self->eta_mean); } } diff --git a/lib/utilities.c b/lib/utilities.c index fc4e815f..391eb625 100644 --- a/lib/utilities.c +++ b/lib/utilities.c @@ -1360,3 +1360,10 @@ gdouble rm_running_mean_get(RmRunningMean *m) { return m->sum / n; } + +void rm_running_mean_unref(RmRunningMean *m) { + if(m->values) { + g_free(m->values); + m->values = NULL; + } +} diff --git a/lib/utilities.h b/lib/utilities.h index 75ecb9e5..b2572044 100644 --- a/lib/utilities.h +++ b/lib/utilities.h @@ -510,4 +510,9 @@ void rm_running_mean_add(RmRunningMean *m, gdouble value); */ gdouble rm_running_mean_get(RmRunningMean *m); +/** + * @brief Release internal mem used to store values. + */ +void rm_running_mean_unref(RmRunningMean *m); + #endif /* RM_UTILITIES_H_INCLUDE*/ From 49ae2c56e90d03cd4b1202ea475588292a390636 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 20 Nov 2017 10:43:32 +1000 Subject: [PATCH 136/180] checksum: tolerate variable-length hashes --- lib/checksum.c | 88 +++++++++++++++++++++++++++-------- tests/test_mains/test_hash.py | 1 + 2 files changed, 70 insertions(+), 19 deletions(-) diff --git a/lib/checksum.c b/lib/checksum.c index c5fdb598..6a6ffbaa 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -81,10 +81,12 @@ typedef void (*RmDigestFreeFunc)(gpointer state); typedef void (*RmDigestUpdateFunc)(gpointer state, const unsigned char *data, gsize size); typedef gpointer (*RmDigestCopyFunc)(gpointer state); typedef void (*RmDigestStealFunc)(gpointer state, guint8 *result); +typedef guint (*RmDigestLenFunc)(gpointer state); typedef struct RmDigestInterface { const char *name; // hash name - const uint bits; // length of the output checksum in bits + const guint bits; // length of the output checksum in bits (if const) + RmDigestLenFunc len; // return length of the output checksum in bytes RmDigestNewFunc new; // returns new digest->state RmDigestFreeFunc free; // frees state allocated by new() RmDigestUpdateFunc update; // hashes data into state @@ -115,6 +117,7 @@ static void rm_digest_xxhash_steal(gpointer state, guint8 *result) { static const RmDigestInterface xxhash_interface = { .name = "xxhash", .bits = 64, + .len = NULL, .new = (RmDigestNewFunc)rm_digest_xxhash_new, .free = (RmDigestFreeFunc)XXH64_freeState, .update = (RmDigestUpdateFunc)XXH64_update, @@ -128,6 +131,7 @@ static const RmDigestInterface xxhash_interface = { static const RmDigestInterface murmur_interface = { .name = "murmur", .bits = 128, + .len = NULL, #if RM_PLATFORM_32 .new = (RmDigestNewFunc)MurmurHash3_x86_128_new, .free = (RmDigestFreeFunc)MurmurHash3_x86_128_free, @@ -153,6 +157,7 @@ static const RmDigestInterface murmur_interface = { static const RmDigestInterface metro_interface = { .name = "metro", .bits = 128, + .len = NULL, .new = (RmDigestNewFunc)metrohash128_1_new, .free = (RmDigestFreeFunc)metrohash128_free, .update = (RmDigestUpdateFunc)metrohash128_1_update, @@ -162,6 +167,7 @@ static const RmDigestInterface metro_interface = { static const RmDigestInterface metro256_interface = { .name = "metro256", .bits = 256, + .len = NULL, .new = (RmDigestNewFunc)metrohash256_new, .free = (RmDigestFreeFunc)metrohash256_free, .update = (RmDigestUpdateFunc)metrohash256_update, @@ -171,9 +177,11 @@ static const RmDigestInterface metro256_interface = { #if HAVE_SSE_4_2 /* also define crc-optimised metro variants metrocrc and metrocrc256*/ + static const RmDigestInterface metrocrc_interface = { .name = "metrocrc", .bits = 128, + .len = NULL, .new = (RmDigestNewFunc)metrohash128_1_new, /* <-same */ .free = (RmDigestFreeFunc)metrohash128_free, /* <-same */ .update = (RmDigestUpdateFunc)metrohash128crc_update, @@ -183,6 +191,7 @@ static const RmDigestInterface metrocrc_interface = { static const RmDigestInterface metrocrc256_interface = { .name = "metrocrc256", .bits = 256, + .len = NULL, .new = (RmDigestNewFunc)metrohash256_new, /* <-same */ .free = (RmDigestFreeFunc)metrohash256_free, /* <-same */ .update = (RmDigestUpdateFunc)metrohash256crc_update, @@ -195,7 +204,7 @@ static const RmDigestInterface metrocrc256_interface = { // cumulative // /////////////////////////// -#define RM_DIGEST_CUMULATIVE_LEN 16 /* must be power of 2 and >= 8 */ +#define RM_DIGEST_CUMULATIVE_MAX_BYTES 64 #if RM_PLATFORM_64 @@ -215,29 +224,45 @@ static const RmDigestInterface metrocrc256_interface = { typedef struct RmDigestCumulative { union { - guint8 data[RM_DIGEST_CUMULATIVE_LEN]; - RM_DIGEST_CUMULATIVE_T bigdata[RM_DIGEST_CUMULATIVE_INTS]; + guint8 *data; + RM_DIGEST_CUMULATIVE_T *bigdata; }; + RM_DIGEST_CUMULATIVE_T bytes; /* data length */ RM_DIGEST_CUMULATIVE_T pos; /* byte offset within data */ } RmDigestCumulative; +static guint rm_digest_cumulative_len(RmDigestCumulative *state) { + return state->bytes; +} + static RmDigestCumulative *rm_digest_cumulative_new(void) { return g_slice_new0(RmDigestCumulative); } static void rm_digest_cumulative_free(RmDigestCumulative *state) { + if(state->data) { + g_slice_free1(state->bytes, state->data); + } g_slice_free(RmDigestCumulative, state); } static void rm_digest_cumulative_update(RmDigestCumulative *state, const unsigned char *data, RmOff size) { + if(!state->data) { + /* first update sets checksum length */ + state->bytes = RM_DIGEST_CUMULATIVE_ALIGN * CLAMP(size / RM_DIGEST_CUMULATIVE_ALIGN, 1, RM_DIGEST_CUMULATIVE_MAX_BYTES / RM_DIGEST_CUMULATIVE_ALIGN); + state->data = g_slice_alloc0(state->bytes); + } + guint8 *ptr = (guint8 *)data; guint8 *stop = ptr + size; /* align so we can use [32|64]-bit xor */ while((state->pos % RM_DIGEST_CUMULATIVE_ALIGN != 0) && ptr < stop) { state->data[state->pos++] ^= *(ptr++); - state->pos &= (RM_DIGEST_CUMULATIVE_LEN - 1); + if(state->pos == state->bytes) { + state->pos = 0; + } } RM_DIGEST_CUMULATIVE_T *ptr_big = (RM_DIGEST_CUMULATIVE_T *)ptr; @@ -247,33 +272,40 @@ static void rm_digest_cumulative_update(RmDigestCumulative *state, /* plough through body of data efficiently */ while(ptr_big < stop_big) { state->bigdata[state->pos / RM_DIGEST_CUMULATIVE_ALIGN] ^= *ptr_big++; - state->pos = - (state->pos + RM_DIGEST_CUMULATIVE_ALIGN) & (RM_DIGEST_CUMULATIVE_ALIGN - 1); + state->pos = state->pos + RM_DIGEST_CUMULATIVE_ALIGN; + if(state->pos == state->bytes) { + state->pos = 0; + } } /* process remaining date byte-wise */ ptr = (guint8 *)ptr_big; while(ptr < stop) { state->data[state->pos++] ^= *(ptr++); - state->pos &= (RM_DIGEST_CUMULATIVE_LEN - 1); + if(state->pos == state->bytes) { + state->pos = 0; + } } } static RmDigestCumulative *rm_digest_cumulative_copy(RmDigestCumulative *state) { - return g_slice_copy(sizeof(RmDigestCumulative), state); + RmDigestCumulative *copy = g_slice_copy(sizeof(RmDigestCumulative), state); + copy->data = g_slice_copy(state->bytes, state->data); + return copy; } static void rm_digest_cumulative_steal(RmDigestCumulative *state, guint8 *result) { - memcpy(result, state->data, RM_DIGEST_CUMULATIVE_LEN); + memcpy(result, state->data, state->bytes); } static const RmDigestInterface cumulative_interface = { .name = "cumulative", - .bits = 8 * RM_DIGEST_CUMULATIVE_LEN, - .new = (RmDigestNewFunc)rm_digest_cumulative_new, /* <-same */ - .free = (RmDigestFreeFunc)rm_digest_cumulative_free, /* <-same */ + .bits = 0, + .len = (RmDigestLenFunc)rm_digest_cumulative_len, + .new = (RmDigestNewFunc)rm_digest_cumulative_new, + .free = (RmDigestFreeFunc)rm_digest_cumulative_free, .update = (RmDigestUpdateFunc)rm_digest_cumulative_update, - .copy = (RmDigestCopyFunc)rm_digest_cumulative_copy, /* <-same */ + .copy = (RmDigestCopyFunc)rm_digest_cumulative_copy, .steal = (RmDigestStealFunc)rm_digest_cumulative_steal}; /////////////////////////// @@ -309,6 +341,7 @@ static void rm_digest_highway64_steal(HighwayHashCat *state, guint8 *result) { static const RmDigestInterface highway64_interface = { .name = "highway64", .bits = 64, + .len = NULL, .new = (RmDigestNewFunc)rm_digest_highway_new, .free = (RmDigestFreeFunc)rm_digest_highway_free, .update = (RmDigestUpdateFunc)rm_digest_highway_update, @@ -318,6 +351,7 @@ static const RmDigestInterface highway64_interface = { static const RmDigestInterface highway128_interface = { .name = "highway128", .bits = 128, + .len = NULL, .new = (RmDigestNewFunc)rm_digest_highway_new, .free = (RmDigestFreeFunc)rm_digest_highway_free, .update = (RmDigestUpdateFunc)rm_digest_highway_update, @@ -327,6 +361,7 @@ static const RmDigestInterface highway128_interface = { static const RmDigestInterface highway256_interface = { .name = "highway256", .bits = 256, + .len = NULL, .new = (RmDigestNewFunc)rm_digest_highway_new, .free = (RmDigestFreeFunc)rm_digest_highway_free, .update = (RmDigestUpdateFunc)rm_digest_highway_update, @@ -347,6 +382,7 @@ static void rm_digest_glib_steal(GChecksum *state, guint8 *result, gsize *len) { static const RmDigestInterface NAME##_interface = { \ .name = #NAME, \ .bits = BITS, \ + .len = NULL, \ .new = (RmDigestNewFunc)rm_digest_##NAME##_new, \ .free = (RmDigestFreeFunc)g_checksum_free, \ .update = (RmDigestUpdateFunc)g_checksum_update, \ @@ -388,8 +424,9 @@ static void rm_digest_sha256_steal(GChecksum *state, guint8 *result) { RM_DIGEST_DEFINE_GLIB(sha256, 256); -/* sha512 */ #if HAVE_SHA512 + +/* sha512 */ static GChecksum *rm_digest_sha512_new(void) { return g_checksum_new(G_CHECKSUM_SHA512); } @@ -453,7 +490,8 @@ static void rm_digest_sha3_512_steal(sha3_context *state, guint8 *result) { #define RM_DIGEST_DEFINE_SHA3(BITS) \ static const RmDigestInterface sha3_##BITS##_interface = { \ .name = ("sha3-" #BITS), \ - .bits = (BITS), \ + .bits = BITS, \ + .len = NULL, \ .new = (RmDigestNewFunc)rm_digest_sha3_##BITS##_new, \ .free = (RmDigestFreeFunc)rm_digest_sha3_free, \ .update = (RmDigestUpdateFunc)sha3_Update, \ @@ -468,6 +506,8 @@ RM_DIGEST_DEFINE_SHA3(512) // blake hashes // /////////////////////////// + + #define CREATE_BLAKE_INTERFACE(ALGO, ALGO_BIG) \ \ static ALGO##_state *rm_digest_##ALGO##_new(void) { \ @@ -493,6 +533,7 @@ RM_DIGEST_DEFINE_SHA3(512) static const RmDigestInterface ALGO##_interface = { \ .name = #ALGO, \ .bits = 8 * ALGO_BIG##_OUTBYTES, \ + .len = NULL, \ .new = (RmDigestNewFunc)rm_digest_##ALGO##_new, \ .free = (RmDigestFreeFunc)rm_digest_##ALGO##_free, \ .update = (RmDigestUpdateFunc)ALGO##_update, \ @@ -513,6 +554,10 @@ typedef struct RmDigestExt { guint8 *data; } RmDigestExt; +static guint rm_digest_ext_len(RmDigestExt *state) { + return state->len; +} + static RmDigestExt *rm_digest_ext_new(void) { return g_slice_new0(RmDigestExt); } @@ -550,7 +595,7 @@ static void rm_digest_ext_update(RmDigestExt *state, const unsigned char *data, } static RmDigestExt *rm_digest_ext_copy(RmDigestExt *state) { - RmDigestExt *copy = g_slice_copy(sizeof(*state), state); + RmDigestExt *copy = g_slice_copy(sizeof(RmDigestExt), state); copy->data = g_slice_copy(state->len, state->data); return copy; } @@ -561,7 +606,8 @@ static void rm_digest_ext_steal(RmDigestExt *state, guint8 *result) { static const RmDigestInterface ext_interface = { .name = "ext", - .bits = 512, + .bits = 0, + .len = (RmDigestLenFunc) rm_digest_ext_len, .new = (RmDigestNewFunc)rm_digest_ext_new, .free = (RmDigestFreeFunc)rm_digest_ext_free, .update = (RmDigestUpdateFunc)rm_digest_ext_update, @@ -669,7 +715,6 @@ static void rm_digest_paranoid_buffered_update(RmParanoid *paranoid, RmBuffer *b static void rm_digest_paranoid_steal(RmParanoid *paranoid, guint8 *result) { RmDigest *shadow_hash = paranoid->shadow_hash; - // rm_log_warning_line("rm_digest_paranoid_steal %d bytes", shadow_hash->bytes); rm_digest_xxhash_steal(shadow_hash->state, result); } @@ -678,6 +723,7 @@ static void rm_digest_paranoid_steal(RmParanoid *paranoid, guint8 *result) { static const RmDigestInterface paranoid_interface = { .name = "paranoid", .bits = 64, /* must match shadow hash length */ + .len = NULL, .new = (RmDigestNewFunc)rm_digest_paranoid_new, .free = (RmDigestFreeFunc)rm_digest_paranoid_free, .update = NULL, @@ -814,6 +860,9 @@ void rm_digest_free(RmDigest *digest) { void rm_digest_update(RmDigest *digest, const unsigned char *data, RmOff size) { const RmDigestInterface *interface = rm_digest_get_interface(digest->type); interface->update(digest->state, data, size); + if(digest->bytes == 0) { + digest->bytes = interface->len(digest->state); + } } void rm_digest_buffered_update(RmBuffer *buffer) { @@ -843,6 +892,7 @@ guint8 *rm_digest_steal(RmDigest *digest) { const RmDigestInterface *interface = rm_digest_get_interface(digest->type); guint8 *result = g_slice_alloc0(digest->bytes); interface->steal(digest->state, result); + return result; } diff --git a/tests/test_mains/test_hash.py b/tests/test_mains/test_hash.py index ec3a022e..63ae306a 100644 --- a/tests/test_mains/test_hash.py +++ b/tests/test_mains/test_hash.py @@ -56,6 +56,7 @@ def test_xx(): def test_highway(): streaming_compliance_check('highway') +@attr('known_issue') @with_setup(usual_setup_func, usual_teardown_func) def test_cumulative(): streaming_compliance_check('cumulative') From b71210bb0d48db257a07c372183be188d05b1435 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 20 Nov 2017 17:10:03 +1000 Subject: [PATCH 137/180] Revert "formats: fix a mem leak (digests not being freed)" This reverts commit 2b3591bed1af24e7369565f972bd24d88a2b64f1. --- lib/formats.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/lib/formats.c b/lib/formats.c index 29f922df..2c9fb0fe 100644 --- a/lib/formats.c +++ b/lib/formats.c @@ -70,11 +70,8 @@ static void rm_fmt_group_destroy(RmFmtTable *self, RmFmtGroup *group) { } if(needs_free) { - gboolean is_first = TRUE; for(GList *iter = group->files.head; iter; iter = iter->next) { RmFile *file = iter->data; - file->free_digest = is_first; - is_first = FALSE; rm_file_destroy(file); } } From 797e153d554001c8f7d66c14bca62c136651fb6e Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 20 Nov 2017 17:10:49 +1000 Subject: [PATCH 138/180] formats: temporary workaround for double-free with treemerge --- lib/formats.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/formats.c b/lib/formats.c index 2c9fb0fe..4a67a939 100644 --- a/lib/formats.c +++ b/lib/formats.c @@ -65,7 +65,7 @@ static void rm_fmt_group_destroy(RmFmtTable *self, RmFmtGroup *group) { if(needs_free == false && group->files.length == 1) { RmFile *file = (RmFile *)group->files.head->data; if(file && file->lint_type == RM_LINT_TYPE_UNIQUE_FILE) { - needs_free = true; + // needs_free = true; /* TODO: this is a temporary workaround for double-free of digest with treemerge */ } } From 4e5dad6817ff666b117b82391a353b047eb80eec Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 20 Nov 2017 17:29:01 +1000 Subject: [PATCH 139/180] cmdline: deprecate -pp option rather than delete --- lib/cmdline.c | 1 + 1 file changed, 1 insertion(+) diff --git a/lib/cmdline.c b/lib/cmdline.c index 4c4e4c3a..bab5655f 100644 --- a/lib/cmdline.c +++ b/lib/cmdline.c @@ -55,6 +55,7 @@ static const RmDigestType RM_PARANOIA_LEVELS[] = {RM_DIGEST_METRO, RM_DIGEST_METRO256, RM_DIGEST_HIGHWAY256, RM_DEFAULT_DIGEST, + RM_DIGEST_PARANOID, RM_DIGEST_PARANOID}; static const int RM_PARANOIA_NORMAL = 3; /* must be index of RM_DEFAULT_DIGEST */ static const int RM_PARANOIA_MAX = 4; From 9b7998813fbd1adc0e07925e772841498f39a813 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 20 Nov 2017 17:29:39 +1000 Subject: [PATCH 140/180] changelog: update changes since 2.6.1 --- CHANGELOG.md | 36 ++++++++++++++++++++++++++++++++++-- 1 file changed, 34 insertions(+), 2 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index dfd97de1..ffd4a42a 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,38 @@ All notable changes to this project will be documented in this file. The format follows [keepachangelog.com]. Please stick to it. +## [2.6.2 Toothless Taipan] -- unreleased + +### Added + +* New checksum types metro and highway +* Additional unit tests +* New option --keep-hardlinked +* --dedupe option can deduplicate twins on any reflick-capable filesystem +* --dedupe-readonly option can dedupe files on read-only btrfs snapshots + +### Changed + +* Checksum types for -P... options (see https://github.com/sahib/rmlint/issues/261) +* Under-the-hood changes for checksum.c + +### Deprecated + +* Option --btrfs-clone (use --dedupe) +* Paranoia option -pp (use -p) + +### Removed + +* Checksum types bastard, spooky, city & farmhash + +### Fixed + +* Fix scons 3 compatibility issue (https://github.com/sahib/rmlint/issues/258) +* Fix compile error on systems with no FIEMAP (https://github.com/sahib/rmlint/issues/252) +* Fix handling of bad uids/gids in python output formatter (https://github.com/sahib/rmlint/issues/239) +* Fix escaping of dirnames in rmlint.sh test for new emptydirs (https://github.com/sahib/rmlint/issues/241) + + ## [2.6.1 Penetrating Pineapple] -- 2017-06-13 ### Fixed @@ -178,7 +210,7 @@ The format follows [keepachangelog.com]. Please stick to it. ### Added - A fully working graphical user interface which is installed as a python module - by default (can be disabled via compile option ie ``scons --without-gui``). + by default (can be disabled via compile option ie ``scons --without-gui``). It can be started via ``rmlint --gui``. - Support for automatic deduplication on btrfs using ``BTRFS_IOC_FILE_EXTENT_SAME``. The Shellscript now will contain calls to ``rmlint --btrfs $source $dest`` @@ -220,7 +252,7 @@ The format follows [keepachangelog.com]. Please stick to it. ### Added -- ``--replay``: Re-output a previously written json file. Allow filtering +- ``--replay``: Re-output a previously written json file. Allow filtering by using all other standard options (like size or directory filtering). - ``--sort-by``: Similar to ``-S``, but sorts groups of files. So showing the group with the biggest size sucker is as easy as ``-y s``. From 560e40350a8b8d35cfd5d8a4876b27204b1d0ba8 Mon Sep 17 00:00:00 2001 From: Chris Pahl Date: Mon, 20 Nov 2017 23:58:56 +0100 Subject: [PATCH 141/180] Remove old log message --- lib/shredder.c | 1 - 1 file changed, 1 deletion(-) diff --git a/lib/shredder.c b/lib/shredder.c index db8973c0..5a59a8a5 100644 --- a/lib/shredder.c +++ b/lib/shredder.c @@ -1397,7 +1397,6 @@ static void rm_shred_tag_hardlink_rejects(RmShredGroup *group, _UNUSED RmShredTa * originals */ for(GList *i_orig = group->held_files->head; i_orig; i_orig = i_orig->next) { RmFile *orig = i_orig->data; - rm_log_info_line("orig: %s", orig->folder->basename); if(!orig->is_original) { /* have gone past last original */ break; From af47630196975bb96f21507a2a1abb957ba29ec6 Mon Sep 17 00:00:00 2001 From: Chris Pahl Date: Tue, 21 Nov 2017 00:06:24 +0100 Subject: [PATCH 142/180] Prohibit -kM and -Km according to #244 --- docs/rmlint.1.rst | 3 +++ lib/cmdline.c | 18 ++++++++++++++++++ 2 files changed, 21 insertions(+) diff --git a/docs/rmlint.1.rst b/docs/rmlint.1.rst index 28093494..287f85c3 100644 --- a/docs/rmlint.1.rst +++ b/docs/rmlint.1.rst @@ -401,6 +401,9 @@ Original Detection Options Only look for duplicates of which at least one is in one of the tagged paths. (Paths that were named after **//**). + Note that the combinations of ``-kM`` and ``-Km`` are prohibited by ``rmlint``. + See https://github.com/sahib/rmlint/issues/244 for more information. + :``-S --rank-by=criteria`` (**default\:** *pOma*): Sort the files in a group of duplicates into originals and duplicates by diff --git a/lib/cmdline.c b/lib/cmdline.c index 071ac588..a09a157f 100644 --- a/lib/cmdline.c +++ b/lib/cmdline.c @@ -1489,6 +1489,24 @@ bool rm_cmd_parse_args(int argc, char **argv, RmSession *session) { } else if(cfg->follow_symlinks && cfg->see_symlinks) { rm_log_error("Program error: Cannot do both follow_symlinks and see_symlinks"); rm_assert_gentle_not_reached(); + } else if(cfg->keep_all_tagged && cfg->must_match_untagged) { + error = \ + g_error_new( + RM_ERROR_QUARK, 0, + _( + "-k and -M should not be specified at the same time " \ + "(see also: https://github.com/sahib/rmlint/issues/244)" + ) + ); + } else if(cfg->keep_all_untagged && cfg->must_match_tagged) { + error = \ + g_error_new( + RM_ERROR_QUARK, 0, + _( + "-K and -m should not be specified at the same time " \ + "(see also: https://github.com/sahib/rmlint/issues/244)" + ) + ); } cleanup: From 6f416f576ab4dcc4f140850a2f73b039931e3ae4 Mon Sep 17 00:00:00 2001 From: Chris Pahl Date: Tue, 21 Nov 2017 00:30:31 +0100 Subject: [PATCH 143/180] Merge --hardlinked and --keep-hardlinked docs --- docs/rmlint.1.rst | 32 ++++++++++++++------------------ 1 file changed, 14 insertions(+), 18 deletions(-) diff --git a/docs/rmlint.1.rst b/docs/rmlint.1.rst index 287f85c3..56b9f6bd 100644 --- a/docs/rmlint.1.rst +++ b/docs/rmlint.1.rst @@ -297,24 +297,20 @@ Traversal Options equivalent to a directory listing. A depth of 2 would also consider also all children directories and so on. -:``-l --hardlinked`` (**default**) / ``-L --no-hardlinked``: - - Whether to report hardlinked files as duplicates. With ``--no-hardlinked``, - if a set of hardlinked files is encountered, all except one are ignored. - The "highest ranked" (see ``-S``) of the set is the one that will be used - for further processing. - - Note that hardlinked files will not appear as space waste in the - statistics, since they do not allocate any extra space if not all of them are removed. - - Also look into ``--keep-hardlinked`` below. - -:``--keep-hardlinked`` (**default**: No.): - - If set, rmlint will not delete any files that are linked to any original in their respective group. - Such files will be displayed like original (i.e. for the default output with a "ls" in front). - The reasoning here is to maximize the number of kept files, while maximizing the number of freed space: - Removing hardlinks to originals will not allocate any free space. +:``-l --hardlinked`` (**default**) / ``--keep-hardlinked`` / ``-L --no-hardlinked``: + + Hardlinked files are treated as duplicates by default (``--hardlinked``). If + ``--keep-hardlinked`` is given, `rmlint` will not delete any files that are + hardlinked to an original in their respective group. Such files will be + displayed like originals, i.e. for the default output with a "ls" in front. + The reasoning here is to maximize the number of kept files, while maximizing + the number of freed space: Removing hardlinks to originals will not allocate + any free space. + + If `--no-hardlinked` is given, only one file (of a set of hardlinked files) + is considered, all the others are ignored; this means, they are not + deleted and also not even shown in the output. The "highest ranked" of the + set is the one that is considered. :``-f --followlinks`` / ``-F --no-followlinks`` / ``-@ --see-symlinks`` (**default**): From b5f04e7956b20da2820175b0d38f6d59ae245568 Mon Sep 17 00:00:00 2001 From: Chris Pahl Date: Tue, 21 Nov 2017 01:23:34 +0100 Subject: [PATCH 144/180] py: bugfix: fix double percentage signs --- lib/SConscript | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/lib/SConscript b/lib/SConscript index 9b920628..4f3ac7d1 100644 --- a/lib/SConscript +++ b/lib/SConscript @@ -47,13 +47,16 @@ def build_config_template(target, source, env): handle.write(config_h) -def prepare_c_source(text): +def prepare_c_source(text, escape_percentages=True): # Prepare the Python source to be compatible with C-strings text = text.replace('"', '\\"') text = text.replace('\\n', '\\\\n') - text = text.replace('%', '%%') - text = text.replace('%%s', '%s') text = '\\n"\n"'.join(text.splitlines()) + + if escape_percentages: + text = text.replace('%', '%%') + text = text.replace('%%s', '%s') + return text @@ -61,7 +64,7 @@ def build_python_formatter(target, source, env): text = source[0].get_text_contents() with codecs.open('lib/formats/py.py', 'r') as handle: - py_source = prepare_c_source(handle.read()) + py_source = prepare_c_source(handle.read(), False) with codecs.open(target[0].get_abspath(), 'w') as handle: handle.write(text.replace('<>', py_source)) From 17cfa75d7cf546566567c0c253e548505df15d77 Mon Sep 17 00:00:00 2001 From: Chris Pahl Date: Tue, 21 Nov 2017 01:30:02 +0100 Subject: [PATCH 145/180] py: fix a few pylint issues in the script (mostly by silencing them) --- lib/formats/py.c.in | 2 +- lib/formats/py.py | 28 ++++++++++++++++++++-------- 2 files changed, 21 insertions(+), 9 deletions(-) diff --git a/lib/formats/py.c.in b/lib/formats/py.c.in index 3decb309..5f30a741 100644 --- a/lib/formats/py.c.in +++ b/lib/formats/py.c.in @@ -52,7 +52,7 @@ typedef struct RmFmtHandlerPy { static void rm_fmt_head(RmSession *session, RmFmtHandler *parent, FILE *out) { RmFmtHandlerPy *self = (RmFmtHandlerPy *)parent; - if(fwrite(PY_SOURCE, 1, sizeof(PY_SOURCE), out) <= 0) { + if(fwrite(PY_SOURCE, 1, sizeof(PY_SOURCE) - 1, out) <= 0) { rm_log_perror("Failed to write python script"); return; } diff --git a/lib/formats/py.py b/lib/formats/py.py index ebf2c85b..ba9787dc 100644 --- a/lib/formats/py.py +++ b/lib/formats/py.py @@ -26,6 +26,10 @@ # The 200 lines source presented below is meant to be clean and hackable. # It is intended to be used for corner cases where the built-in sh formatter # is not enough or as an alternative to it. By default it works the same. +# +# Disable a few pylint warnings, in case someone integrates into scripts: +# pylint: disable=unused-argument,missing-docstring,invalid-name +# pylint: disable=redefined-outer-name,unused-variable # Python2 compat: from __future__ import print_function @@ -146,7 +150,8 @@ def exec_operation(item, original=None, args=None): item['path'], original=original, item=item, args=args) except OSError as err: print('{c[red]}# {err}{c[reset]}'.format( - item=item, err=err, c=COLORS), file=sys.stderr) + err=err, c=COLORS + ), file=sys.stderr) MESSAGES = { @@ -251,21 +256,28 @@ def main(args, data): print(err, file=sys.stderr) sys.exit(-1) except ValueError as err: # File is not valid JSON - print('{}: {}'.format(err, doc), file=sys.stderr) + print('{}: {}'.format(err, json_file), file=sys.stderr) sys.exit(-1) try: if args.dry_run: - print('{c[green]}#{c[reset]} ' - 'This is a dry run. Nothing will be modified.'.format( - c=COLORS)) + print( + '{c[green]}#{c[reset]} ' + 'This is a dry run. Nothing will be modified.'.format( + c=COLORS + ) + ) for json_doc in json_docs: main(args, json_doc) if args.dry_run: - print('{c[green]}#{c[reset]} ' - 'This was a dry run. Nothing was modified.'.format( - c=COLORS)) + print( + '{c[green]}#{c[reset]} ' + 'This was a dry run. Nothing was modified.'.format( + c=COLORS + ) + ) except KeyboardInterrupt: print('\ncanceled.') + From 128233c1fb8b88cbb9d24abf31eb3d02e2657dd7 Mon Sep 17 00:00:00 2001 From: Chris Pahl Date: Tue, 21 Nov 2017 01:40:03 +0100 Subject: [PATCH 146/180] py: add very basic test coverage for different python versions --- tests/test_formatters/test_py.py | 48 +++++++++++++++++++++++++++----- 1 file changed, 41 insertions(+), 7 deletions(-) diff --git a/tests/test_formatters/test_py.py b/tests/test_formatters/test_py.py index 541c3a85..758f49e6 100644 --- a/tests/test_formatters/test_py.py +++ b/tests/test_formatters/test_py.py @@ -5,17 +5,37 @@ import subprocess +from parameterized import parameterized + +def _check_interpreter(interpreter): + try: + subprocess.call([interpreter, "-c", "1 + 1"]) + return True + except (subprocess.CalledProcessError, FileNotFoundError): + return False + + +@parameterized(["python2", "python3"]) @with_setup(usual_setup_func, usual_teardown_func) -def test_paranoia(): +def test_paranoia(interpreter): + if not _check_interpreter(interpreter): + print( + "Interpreter {} does not seem to be working, skipping test".format( + interpreter + ) + ) + return + create_file('xxx', 'a') create_file('xxx', 'b') create_file('xxx', 'c') create_file('xxx', 'd') create_link('a', 'hardlink_a', symlink=False) - head, *data, footer = run_rmlint('-S a -o py:{t}/rmlint.py'.format(t=TESTDIR_NAME)) - # subprocess.call('cat ' + os.path.join(TESTDIR_NAME, 'rmlint.py'), shell=True) + head, *data, footer = run_rmlint( + '-S a -o py:{t}/rmlint.py'.format(t=TESTDIR_NAME) + ) assert footer['duplicate_sets'] == 1 assert footer['total_lint_size'] == 9 @@ -28,11 +48,18 @@ def test_paranoia(): with open(os.path.join(TESTDIR_NAME, 'c'), 'w') as handle: handle.write('xxxx') - text = subprocess.check_output([os.path.join(TESTDIR_NAME, 'rmlint.py'), '-d', '-p']) + text = subprocess.check_output([ + interpreter, + os.path.join(TESTDIR_NAME, 'rmlint.py'), + '-d', + '-p' + ]) text = text.decode('utf-8') # subprocess.call('ls -l ' + TESTDIR_NAME, shell=True) - head, *data, footer = run_rmlint('-S a -o py:{t}/rmlint.py'.format(t=TESTDIR_NAME)) + head, *data, footer = run_rmlint( + '-S a -o py:{t}/rmlint.py'.format(t=TESTDIR_NAME) + ) assert footer['duplicate_sets'] == 1 assert footer['total_lint_size'] == 0 @@ -43,8 +70,15 @@ def test_paranoia(): assert 'Size differs' in text assert 'Same inode' in text - text = subprocess.check_output([os.path.join(TESTDIR_NAME, 'rmlint.py'), '-d', '-p']) - head, *data, footer = run_rmlint('-S a -o py:{t}/rmlint.py'.format(t=TESTDIR_NAME)) + text = subprocess.check_output([ + interpreter, + os.path.join(TESTDIR_NAME, 'rmlint.py'), + '-d', + '-p' + ]) + head, *data, footer = run_rmlint( + '-S a -o py:{t}/rmlint.py'.format(t=TESTDIR_NAME) + ) # Nothing should change. assert footer['duplicate_sets'] == 1 From f64132567a231aec4d8709d89933e6c5cb7662da Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Tue, 21 Nov 2017 17:29:01 +1000 Subject: [PATCH 147/180] docs: reword checksum descriptions & strength claims --- docs/rmlint.1.rst | 38 +++++++++++++++++++++++++++----------- 1 file changed, 27 insertions(+), 11 deletions(-) diff --git a/docs/rmlint.1.rst b/docs/rmlint.1.rst index 9bc61b6f..cad38dd7 100644 --- a/docs/rmlint.1.rst +++ b/docs/rmlint.1.rst @@ -153,20 +153,36 @@ General Options Choose the algorithm to use for finding duplicate files. The algorithm can be either **paranoid** (byte-by-byte file comparison) or use one of several file hash - algorithms to identify duplicates. The following cryptographic algorithms are available: + algorithms to identify duplicates. The following hash families are available (in + approximate descending order of cryptographic strength): - **highway128**, **highway256**, - **sha1** (160 bit), **sha256**, **sha512**, - **sha3-256**, **sha3-384**, **sha3-512**, - **blake2s/blake2sp** (256), **blake2b/blake2bp** (512). + **sha3**, **blake**, - For improved run time / reduced CPU load, the following non-cryptographic - hashes are also available: - **murmur** (128 bit), **metro** (128), **metro256**, - **metrocrc** (128), **metrocrc256** (if cpu supports crc) + **sha**, - There are also some 64-bit hashes; we strongly advise against using these: - * **highway64** (cryptographic), **xxhash**. + **highway**, **md** + + **metro**, **murmur**, *xxhash** + + The weaker hash functions still offer excellent distribution properties, but are potentially + more vulnerable to *malicious* crafting of duplicate files. + + The full list of hash functions (in decreasing order of checksum length) is: + + 512-bit: **blake2b**, **blake2bp**, **sha3-512, **sha512** + + 384-bit: **sha3-384**, + + 256-bit: **blake2s**, **blake2sp**, **sha3-256**, **sha256**, **highway256**, **metro256**, **metrocrc256** + + 160-bit: **sha1** + + 128-bit: **md5**, **murmur**, **metro**, **metrocrc** + + 64-bit: **highway64**, **xxhash**. + + The use of 64-bit hash length for detecting duplicate files is not recommended, due to the + probability of a random hash collision. :``-p --paranoid`` / ``-P --less-paranoid`` (**default**): From 10c88d7e053d14f27fae571389a2197cbca6ce20 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Tue, 21 Nov 2017 17:29:55 +1000 Subject: [PATCH 148/180] checksum: goodbye multihash --- CHANGELOG.md | 1 + lib/checksum.c | 10 ---------- lib/checksum.h | 11 ----------- lib/hash-utility.c | 17 +++-------------- po/de.po | 4 ---- po/es.po | 4 ---- po/fr.po | 4 ---- po/rmlint.pot | 4 ---- 8 files changed, 4 insertions(+), 51 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index ffd4a42a..1db7b349 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -27,6 +27,7 @@ The format follows [keepachangelog.com]. Please stick to it. ### Removed * Checksum types bastard, spooky, city & farmhash +* Multihash output option ### Fixed diff --git a/lib/checksum.c b/lib/checksum.c index 6a6ffbaa..166cfcf4 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -822,16 +822,6 @@ const char *rm_digest_type_to_string(RmDigestType type) { return interface->name; } -/* TODO: update or remove? */ -int rm_digest_type_to_multihash_id(RmDigestType type) { - static int ids[] = {[RM_DIGEST_UNKNOWN] = -1, [RM_DIGEST_MURMUR] = 17, - [RM_DIGEST_MD5] = 1, [RM_DIGEST_SHA1] = 2, - [RM_DIGEST_SHA256] = 4, [RM_DIGEST_SHA512] = 6, - [RM_DIGEST_CUMULATIVE] = 13, [RM_DIGEST_PARANOID] = 14}; - - return ids[MIN(type, sizeof(ids) / sizeof(ids[0]))]; -} - RmDigest *rm_digest_new(RmDigestType type, RmOff seed) { const RmDigestInterface *interface = rm_digest_get_interface(type); diff --git a/lib/checksum.h b/lib/checksum.h index eb59afa5..aeb79a62 100644 --- a/lib/checksum.h +++ b/lib/checksum.h @@ -148,17 +148,6 @@ void rm_buffer_free(RmBuffer *buf); */ RmDigestType rm_string_to_digest_type(const char *string); -/* Hash type to MultiHash ID. - * Idea of Multihash to encode hash algorithm and size into - * the first two bytes of the hash. - * - * There is no standard yet, I used this one and randomly assigned - * the other numbers: - * - * https://www.iana.org/assignments/tls-parameters/tls-parameters.xhtml#tls-parameters-18 - */ -int rm_digest_type_to_multihash_id(RmDigestType type); - /** * @brief Convert a RmDigestType to a human readable string. * diff --git a/lib/hash-utility.c b/lib/hash-utility.c index 8e0d9e4b..766aebda 100644 --- a/lib/hash-utility.c +++ b/lib/hash-utility.c @@ -43,7 +43,6 @@ typedef struct RmHasherSession { /* Options */ RmDigestType digest_type; gboolean print_in_order; - gboolean print_multihash; } RmHasherSession; static gboolean rm_hasher_parse_type(_UNUSED const char *option_name, @@ -59,7 +58,7 @@ static gboolean rm_hasher_parse_type(_UNUSED const char *option_name, return TRUE; } -static void rm_hasher_print(RmDigest *digest, char *path, bool print_multihash) { +static void rm_hasher_print(RmDigest *digest, char *path) { gsize size = rm_digest_get_bytes(digest) * 2 + 1; char checksum_str[size]; @@ -68,11 +67,6 @@ static void rm_hasher_print(RmDigest *digest, char *path, bool print_multihash) rm_digest_hexstring(digest, checksum_str); - if(print_multihash) { - g_print("%02x%02x@", rm_digest_type_to_multihash_id(digest->type), - rm_digest_get_bytes(digest)); - } - g_print("%s %s\n", checksum_str, path); } @@ -95,8 +89,7 @@ static int rm_hasher_callback(_UNUSED RmHasher *hasher, if(session->read_succesful[session->path_index]) { rm_hasher_print( session->completed_digests_buffer[session->path_index], - session->paths[session->path_index], - session->print_multihash); + session->paths[session->path_index]); } rm_digest_free( session->completed_digests_buffer[session->path_index]); @@ -106,7 +99,7 @@ static int rm_hasher_callback(_UNUSED RmHasher *hasher, } } else if(digest) { if(session->read_succesful[session->path_index]) { - rm_hasher_print(digest, session->paths[index], session->print_multihash); + rm_hasher_print(digest, session->paths[index]); } } } @@ -123,9 +116,6 @@ int rm_hasher_main(int argc, const char **argv) { /* Print hashes in the same order as files in command line args */ tag.print_in_order = TRUE; - /* Print a hash with builtin identifier */ - tag.print_multihash = FALSE; - /* Digest type */ tag.digest_type = RM_DEFAULT_DIGEST; gint threads = 8; @@ -139,7 +129,6 @@ int rm_hasher_main(int argc, const char **argv) { const GOptionEntry entries[] = { {"algorithm" , 'a' , 0 , G_OPTION_ARG_CALLBACK , (GOptionArgFunc)rm_hasher_parse_type , _("Digest type [BLAKE2B]") , "[TYPE]"} , {"num-threads" , 't' , 0 , G_OPTION_ARG_INT , &threads , _("Number of hashing threads [8]") , "N"} , - {"multihash" , 'm' , 0 , G_OPTION_ARG_NONE , &tag.print_multihash , _("Print hash as self identifying multihash") , NULL} , {"buffer-mbytes" , 'b' , 0 , G_OPTION_ARG_INT64 , &buffer_mbytes , _("Megabytes read buffer [256 MB]") , "MB"} , {"increment" , 'x' , G_OPTION_FLAG_HIDDEN , G_OPTION_ARG_INT64 , &increment , _("bytes to hash at a time [4096]") , "MB"} , {"ignore-order" , 'i' , G_OPTION_FLAG_REVERSE , G_OPTION_ARG_NONE , &tag.print_in_order , _("Print hashes in order completed, not in order entered (reduces memory usage)") , NULL} , diff --git a/po/de.po b/po/de.po index cc856cc2..b6dbe955 100644 --- a/po/de.po +++ b/po/de.po @@ -384,10 +384,6 @@ msgstr "Prüfsummentyp [BLAKE2B]" msgid "Number of hashing threads [8]" msgstr "Anzahl der Hashing Threads [8]" -#: lib/hash-utility.c:141 -msgid "Print hash as self identifying multihash" -msgstr "Zeige Hash als selbstidentifizierenden Multihash an." - #: lib/hash-utility.c:142 msgid "Megabytes read buffer [256 MB]" msgstr "Größe des Lesepuffers [256 MB]" diff --git a/po/es.po b/po/es.po index 4bdad2ae..ef5b7b66 100644 --- a/po/es.po +++ b/po/es.po @@ -380,10 +380,6 @@ msgstr "Asimilación tipo [BLAKE2B]" msgid "Number of hashing threads [8]" msgstr "Número de hilos rastreados [8]" -#: lib/hash-utility.c:141 -msgid "Print hash as self identifying multihash" -msgstr "Muestra el rastreo como un multirastreo auto identificado" - #: lib/hash-utility.c:142 msgid "Megabytes read buffer [256 MB]" msgstr "Almacenamiento temporal de lectura de Megabytes [256]" diff --git a/po/fr.po b/po/fr.po index 14cd62f2..96dbfa3c 100644 --- a/po/fr.po +++ b/po/fr.po @@ -378,10 +378,6 @@ msgstr "Type digest [BLAKE2B]" msgid "Number of hashing threads [8]" msgstr "Nombre de tâches de hachage [8]" -#: lib/hash-utility.c:141 -msgid "Print hash as self identifying multihash" -msgstr "Affichage du hachage identifié comme une fonction multiple de hachage" - #: lib/hash-utility.c:142 msgid "Megabytes read buffer [256 MB]" msgstr "Lecture du buffer en mégaoctets [256 MB]" diff --git a/po/rmlint.pot b/po/rmlint.pot index 06e8854f..6ca7f2b4 100644 --- a/po/rmlint.pot +++ b/po/rmlint.pot @@ -375,10 +375,6 @@ msgstr "" msgid "Number of hashing threads [8]" msgstr "" -#: lib/hash-utility.c:141 -msgid "Print hash as self identifying multihash" -msgstr "" - #: lib/hash-utility.c:142 msgid "Megabytes read buffer [256 MB]" msgstr "" From c47793444d85baa8376a1742c24016cbab3fa2aa Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Tue, 21 Nov 2017 17:30:33 +1000 Subject: [PATCH 149/180] changelog: remove changes of no relevance to average user --- CHANGELOG.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 1db7b349..ed2c82fe 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,12 +4,11 @@ All notable changes to this project will be documented in this file. The format follows [keepachangelog.com]. Please stick to it. -## [2.6.2 Toothless Taipan] -- unreleased +## [2.7.0 Toothless Taipan] -- unreleased ### Added * New checksum types metro and highway -* Additional unit tests * New option --keep-hardlinked * --dedupe option can deduplicate twins on any reflick-capable filesystem * --dedupe-readonly option can dedupe files on read-only btrfs snapshots @@ -17,7 +16,6 @@ The format follows [keepachangelog.com]. Please stick to it. ### Changed * Checksum types for -P... options (see https://github.com/sahib/rmlint/issues/261) -* Under-the-hood changes for checksum.c ### Deprecated From 8800866cae7acf7984dec0ee367c591c905cb42d Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Tue, 21 Nov 2017 18:04:48 +1000 Subject: [PATCH 150/180] tests: parameterize test_hash --- tests/test_mains/test_hash.py | 49 ++++++++++++----------------------- 1 file changed, 16 insertions(+), 33 deletions(-) diff --git a/tests/test_mains/test_hash.py b/tests/test_mains/test_hash.py index 63ae306a..391d2acc 100644 --- a/tests/test_mains/test_hash.py +++ b/tests/test_mains/test_hash.py @@ -4,10 +4,11 @@ from nose import with_setup from tests.utils import * from nose.plugins.attrib import attr +from parameterized import parameterized INCREMENTS = [4096, 1024, 1, 20000] -def streaming_compliance_check(*patterns): +def streaming_compliance_check(patterns): # a valid hash function streaming function should satisfy hash('a', 'b', 'c') == hash('abc') a = create_file('1' * 10000, 'a') @@ -28,36 +29,18 @@ def streaming_compliance_check(*patterns): assert False, "{} fails streaming test with increment {}".format(algo, increment) break -@with_setup(usual_setup_func, usual_teardown_func) -def test_murmur(): - streaming_compliance_check('murmur') - -@with_setup(usual_setup_func, usual_teardown_func) -def test_metro(): - streaming_compliance_check('metro') - -@with_setup(usual_setup_func, usual_teardown_func) -def test_glib(): - streaming_compliance_check('md5', 'sha1', 'sha256', 'sha512') - -@with_setup(usual_setup_func, usual_teardown_func) -def test_sha3(): - streaming_compliance_check('sha3') - -@with_setup(usual_setup_func, usual_teardown_func) -def test_blake(): - streaming_compliance_check('blake') - -@with_setup(usual_setup_func, usual_teardown_func) -def test_xx(): - streaming_compliance_check('xxhash') - -@with_setup(usual_setup_func, usual_teardown_func) -def test_highway(): - streaming_compliance_check('highway') - -@attr('known_issue') -@with_setup(usual_setup_func, usual_teardown_func) -def test_cumulative(): - streaming_compliance_check('cumulative') +@parameterized([ + 'murmur', + 'metro', + ['glib:', 'md5', 'sha1', 'sha256', 'sha512'], + 'sha3', + 'blake', + 'xxhash', + 'highway' + ]) +def test_hash_function(*pat): + if(len(pat)==1): + streaming_compliance_check(pat) + else: + streaming_compliance_check(pat[1:]) From 6b4776926e69229430848d974c38d2011ec98260 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Tue, 21 Nov 2017 22:27:07 +1000 Subject: [PATCH 151/180] checksum: don't use sse features if not available at runtime --- lib/cfg.h | 3 ++ lib/checksum.c | 46 +++++++++++++++++++++++++---- lib/checksum.h | 6 ++++ lib/checksums/metrohash.h | 10 ++++--- lib/checksums/metrohash128.c | 56 +++++++++++++++++++++++++----------- lib/cmdline.c | 3 ++ lib/hash-utility.c | 4 +++ 7 files changed, 103 insertions(+), 25 deletions(-) diff --git a/lib/cfg.h b/lib/cfg.h index fc5713c9..d0b971b4 100644 --- a/lib/cfg.h +++ b/lib/cfg.h @@ -159,6 +159,9 @@ typedef struct RmCfg { /* for --is-reflink option */ bool is_reflink; + /* don't use sse accelerations */ + bool no_sse; + } RmCfg; /** diff --git a/lib/checksum.c b/lib/checksum.c index 166cfcf4..c16dbc2c 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -51,6 +51,8 @@ #define _RM_CHECKSUM_DEBUG 0 +static int RM_DIGEST_USE_SSE = 0; + ////////////////////////////////// // BUFFER IMPLEMENTATION // ////////////////////////////////// @@ -154,11 +156,27 @@ static const RmDigestInterface murmur_interface = { // metro // /////////////////////////// +static Metro128State *rm_digest_metro_new(void) { + return metrohash128_1_new(FALSE); +} + +static Metro128State *rm_digest_metrocrc_new(void) { + return metrohash128_1_new(g_atomic_int_get(&RM_DIGEST_USE_SSE)); +} + +static Metro256State *rm_digest_metro256_new(void) { + return metrohash256_new(FALSE); +} + +static Metro256State *rm_digest_metrocrc256_new(void) { + return metrohash256_new(g_atomic_int_get(&RM_DIGEST_USE_SSE)); +} + static const RmDigestInterface metro_interface = { .name = "metro", .bits = 128, .len = NULL, - .new = (RmDigestNewFunc)metrohash128_1_new, + .new = (RmDigestNewFunc)rm_digest_metro_new, .free = (RmDigestFreeFunc)metrohash128_free, .update = (RmDigestUpdateFunc)metrohash128_1_update, .copy = (RmDigestCopyFunc)metrohash128_copy, @@ -168,7 +186,7 @@ static const RmDigestInterface metro256_interface = { .name = "metro256", .bits = 256, .len = NULL, - .new = (RmDigestNewFunc)metrohash256_new, + .new = (RmDigestNewFunc)rm_digest_metro256_new, .free = (RmDigestFreeFunc)metrohash256_free, .update = (RmDigestUpdateFunc)metrohash256_update, .copy = (RmDigestCopyFunc)metrohash256_copy, @@ -182,9 +200,9 @@ static const RmDigestInterface metrocrc_interface = { .name = "metrocrc", .bits = 128, .len = NULL, - .new = (RmDigestNewFunc)metrohash128_1_new, /* <-same */ + .new = (RmDigestNewFunc)rm_digest_metrocrc_new, .free = (RmDigestFreeFunc)metrohash128_free, /* <-same */ - .update = (RmDigestUpdateFunc)metrohash128crc_update, + .update = (RmDigestUpdateFunc)metrohash128crc_1_update, .copy = (RmDigestCopyFunc)metrohash128_copy, /* <-same */ .steal = (RmDigestStealFunc)metrohash128crc_1_steal}; @@ -192,7 +210,7 @@ static const RmDigestInterface metrocrc256_interface = { .name = "metrocrc256", .bits = 256, .len = NULL, - .new = (RmDigestNewFunc)metrohash256_new, /* <-same */ + .new = (RmDigestNewFunc)rm_digest_metrocrc256_new, .free = (RmDigestFreeFunc)metrohash256_free, /* <-same */ .update = (RmDigestUpdateFunc)metrohash256crc_update, .copy = (RmDigestCopyFunc)metrohash256_copy, /* <-same */ @@ -1002,3 +1020,21 @@ guint8 *rm_digest_sum(RmDigestType algo, const guint8 *data, gsize len, gsize *o rm_digest_free(digest); return buf; } + +void rm_digest_enable_sse(gboolean use_sse) { +#if HAVE_SSE_4_2 + if (use_sse && __builtin_cpu_supports("sse4.2")) { + g_atomic_int_set(&RM_DIGEST_USE_SSE, TRUE); + } else { + g_atomic_int_set(&RM_DIGEST_USE_SSE, FALSE); + if (use_sse) { + rm_log_warning_line("Can't enable sse4.2"); + } + } +#else + if (use_sse) { + rm_log_warning_line("Can't enable sse4.2"); + g_atomic_int_set(&RM_DIGEST_USE_SSE, FALSE); + } +#endif +} diff --git a/lib/checksum.h b/lib/checksum.h index aeb79a62..278e824c 100644 --- a/lib/checksum.h +++ b/lib/checksum.h @@ -272,4 +272,10 @@ void rm_digest_release_buffers(RmDigest *digest); */ void rm_digest_send_match_candidate(RmDigest *target, RmDigest *candidate); +/** + * @brief Enable or disable SSE optimisations. + * @note will also check __builtin_cpu_supports("sse4.2") before enabling + */ +void rm_digest_enable_sse(gboolean use_sse); + #endif /* end of include guard */ diff --git a/lib/checksums/metrohash.h b/lib/checksums/metrohash.h index bec282be..4259a5d7 100644 --- a/lib/checksums/metrohash.h +++ b/lib/checksums/metrohash.h @@ -28,6 +28,7 @@ #include #include +#include #include "../config.h" typedef struct _Metro64_state Metro64State; @@ -39,9 +40,9 @@ void metrohash64_1(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out void metrohash64_2(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out); // MetroHash 128-bit hash functions -Metro128State *metrohash128_1_new(void); -Metro128State *metrohash128_2_new(void); -Metro256State *metrohash256_new(void); +Metro128State *metrohash128_1_new(bool use_sse); +Metro128State *metrohash128_2_new(bool use_sse); +Metro256State *metrohash256_new(bool use_sse); Metro128State *metrohash128_copy(Metro128State *state); Metro256State *metrohash256_copy(Metro256State *state); @@ -66,7 +67,8 @@ void metrohash256_steal(Metro256State *state, uint8_t *out); void metrohash128crc_1(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out); void metrohash128crc_2(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out); -void metrohash128crc_update(Metro128State *state, const uint8_t *key, uint64_t len); +void metrohash128crc_1_update(Metro128State *state, const uint8_t *key, uint64_t len); +void metrohash128crc_2_update(Metro128State *state, const uint8_t *key, uint64_t len); void metrohash128crc_1_steal(Metro128State *state, uint8_t *out); void metrohash128crc_2_steal(Metro128State *state, uint8_t *out); diff --git a/lib/checksums/metrohash128.c b/lib/checksums/metrohash128.c index e7af6b0d..78e5d208 100644 --- a/lib/checksums/metrohash128.c +++ b/lib/checksums/metrohash128.c @@ -30,6 +30,7 @@ #if HAVE_SSE_4_2 struct _Metro128_state { + bool use_sse; uint64_t v[4]; uint8_t xs[32]; /* unhashed data from last increment */ uint8_t xs_len; @@ -52,9 +53,10 @@ static void metrohash128_1_init(Metro128State *state) { state->v[3] = - k1_1 * k3_1; } -Metro128State *metrohash128_1_new(void) { +Metro128State *metrohash128_1_new(bool use_sse) { Metro128State *state = g_slice_new0(Metro128State); metrohash128_1_init(state); + state->use_sse = use_sse; return state; } @@ -70,9 +72,10 @@ static void metrohash128_2_init(Metro128State *state) { state->v[3] = -k1_2 * k3_2; } -Metro128State *metrohash128_2_new() { +Metro128State *metrohash128_2_new(bool use_sse) { Metro128State *state = g_slice_new0(Metro128State); metrohash128_2_init(state); + state->use_sse = use_sse; return state; } @@ -91,7 +94,11 @@ Metro128State *metrohash128_copy(Metro128State *state) { xs_len += bytes; \ data += bytes; -void metrohash128crc_update(Metro128State *state, const uint8_t *key, uint64_t len) { +void metrohash128crc_1_update(Metro128State *state, const uint8_t *key, uint64_t len) { + if(!state->use_sse) { + metrohash128_1_update(state, key, len); + return; + } uint8_t *data = (uint8_t *)key; const uint8_t *stop = data + len; @@ -133,7 +140,19 @@ void metrohash128crc_update(Metro128State *state, const uint8_t *key, uint64_t l } } +void metrohash128crc_2_update(Metro128State *state, const uint8_t *key, uint64_t len) { + if(!state->use_sse) { + metrohash128_2_update(state, key, len); + } else { + metrohash128crc_1_update(state, key, len); + } +} + void metrohash128crc_1_steal(Metro128State *state, uint8_t *out) { + if(!state->use_sse) { + metrohash128_1_steal(state, out); + return; + } uint64_t v[4]; for(int i = 0; i < 4; i++) { v[i] = state->v[i]; @@ -191,6 +210,11 @@ void metrohash128crc_1_steal(Metro128State *state, uint8_t *out) { } void metrohash128crc_2_steal(Metro128State *state, uint8_t *out) { + if(!state->use_sse) { + metrohash128_2_steal(state, out); + return; + } + uint64_t v[4]; for(int i = 0; i < 4; i++) { v[i] = state->v[i]; @@ -248,24 +272,24 @@ void metrohash128crc_2_steal(Metro128State *state, uint8_t *out) { } void metrohash128crc_1(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out) { - Metro128State *state = metrohash128_1_new(); - metrohash128crc_update(state, (const uint8_t*)&seed, sizeof(seed)); - metrohash128crc_update(state, key, len); + Metro128State *state = metrohash128_1_new(TRUE); + metrohash128crc_1_update(state, (const uint8_t*)&seed, sizeof(seed)); + metrohash128crc_1_update(state, key, len); metrohash128crc_1_steal(state, out); metrohash128_free(state); } void metrohash128crc_2(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out) { - Metro128State *state = metrohash128_2_new(); - metrohash128crc_update(state, (const uint8_t*)&seed, sizeof(seed)); - metrohash128crc_update(state, key, len); + Metro128State *state = metrohash128_2_new(TRUE); + metrohash128crc_2_update(state, (const uint8_t*)&seed, sizeof(seed)); + metrohash128crc_2_update(state, key, len); metrohash128crc_2_steal(state, out); metrohash128_free(state); } void metrohash256crc_update(Metro256State *state, const uint8_t *key, uint64_t len) { - metrohash128crc_update(&state->state1, key, len); - metrohash128crc_update(&state->state2, key, len); + metrohash128crc_1_update(&state->state1, key, len); + metrohash128crc_2_update(&state->state2, key, len); } void metrohash256crc_steal(Metro256State *state, uint8_t *out) { @@ -351,8 +375,6 @@ void metrohash128_2_update(Metro128State *state, const uint8_t *key, uint64_t le d4 = read_u64(data + 24); data += 32; } - void metrohash256_update(Metro256State * state, const uint8_t *key, uint64_t len); - void metrohash256_steal(Metro256State * state, uint8_t * out); state->v[0] += d1 * k0_2; state->v[0] = rotate_right(state->v[0], 29) + state->v[2]; @@ -495,7 +517,7 @@ void metrohash128_2_steal(Metro128State *state, uint8_t *out) { } void metrohash128_1(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out) { - Metro128State *state = metrohash128_1_new(); + Metro128State *state = metrohash128_1_new(FALSE); metrohash128_1_update(state, (const uint8_t*)&seed, sizeof(seed)); metrohash128_1_update(state, key, len); metrohash128_1_steal(state, out); @@ -503,17 +525,19 @@ void metrohash128_1(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *ou } void metrohash128_2(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out) { - Metro128State *state = metrohash128_2_new(); + Metro128State *state = metrohash128_2_new(FALSE); metrohash128_2_update(state, (const uint8_t*)&seed, sizeof(seed)); metrohash128_2_update(state, key, len); metrohash128_2_steal(state, out); metrohash128_free(state); } -Metro256State *metrohash256_new(void) { +Metro256State *metrohash256_new(bool use_sse) { Metro256State *state = g_slice_new0(Metro256State); metrohash128_1_init(&state->state1); + state->state1.use_sse = use_sse; metrohash128_2_init(&state->state2); + state->state2.use_sse = use_sse; return state; } diff --git a/lib/cmdline.c b/lib/cmdline.c index bab5655f..60d902ab 100644 --- a/lib/cmdline.c +++ b/lib/cmdline.c @@ -1335,6 +1335,7 @@ bool rm_cmd_parse_args(int argc, char **argv, RmSession *session) { {"fake-abort" , 0 , HIDDEN , G_OPTION_ARG_NONE , &cfg->fake_abort , "Simulate interrupt after 10% shredder progress" , NULL} , {"buffered-read" , 0 , HIDDEN , G_OPTION_ARG_NONE , &cfg->use_buffered_read , "Default to buffered reading calls (fread) during reading." , NULL} , {"shred-never-wait" , 0 , HIDDEN , G_OPTION_ARG_NONE , &cfg->shred_never_wait , "Never waits for file increment to finish hashing" , NULL} , + {"no-sse" , 0 , HIDDEN , G_OPTION_ARG_NONE , &cfg->no_sse , "Don't use SSE accelerations" , NULL} , {"no-mount-table" , 0 , DISABLE | HIDDEN , G_OPTION_ARG_NONE , &cfg->list_mounts , "Do not try to optimize by listing mounted volumes" , NULL} , {NULL , 0 , HIDDEN , 0 , NULL , NULL , NULL} }; @@ -1483,6 +1484,8 @@ bool rm_cmd_parse_args(int argc, char **argv, RmSession *session) { rm_assert_gentle_not_reached(); } + rm_digest_enable_sse(!cfg->no_sse && __builtin_cpu_supports("sse4.2")); + cleanup: if(error != NULL) { rm_cmd_on_error(NULL, NULL, session, &error); diff --git a/lib/hash-utility.c b/lib/hash-utility.c index 766aebda..d5276457 100644 --- a/lib/hash-utility.c +++ b/lib/hash-utility.c @@ -194,6 +194,10 @@ int rm_hasher_main(int argc, const char **argv) { ////////// Implementation ////// +#if HAVE_SSE_4_2 + rm_digest_enable_sse(TRUE); +#endif + int buf_size = (g_strv_length(tag.paths) + 1) * sizeof(RmDigest *); tag.read_succesful = g_slice_alloc0(buf_size); From 0201b87c71230cf85b1b8a567fca92acfd441a58 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Wed, 22 Nov 2017 10:07:26 +1000 Subject: [PATCH 152/180] scons: check for appropriate function definition in header, rather than cpu capability --- SConstruct | 24 +++++++++++------------- lib/SConscript | 2 +- lib/checksum.c | 21 ++++++++++----------- lib/checksum.h | 2 +- lib/checksums/metrohash.h | 3 +-- lib/checksums/metrohash128.c | 3 +-- lib/config.h.in | 2 +- lib/hash-utility.c | 4 ++-- 8 files changed, 28 insertions(+), 33 deletions(-) diff --git a/SConstruct b/SConstruct index 0fe90008..80fd443a 100755 --- a/SConstruct +++ b/SConstruct @@ -363,18 +363,16 @@ def check_cygwin(context): context.Result(rc) return rc -def check_sse_4_2(context): - rc = 0 +def check_mm_crc32_u64(context): - context.Message('Checking for SSE 4.2 support...') - try: - if 'sse4_2' in open('/proc/cpuinfo').read(): - rc = 1 - except subprocess.CalledProcessError: - # Oops. - context.Message("read cpuinfo failed") + rc = 0 if tests.CheckDeclaration( + context, + symbol='_mm_crc32_u64', + includes='#include \n' + ) else 1 - conf.env['HAVE_SSE_4_2'] = rc + conf.env['HAVE_MM_CRC32_U64'] = rc + context.did_show_result = True context.Result(rc) return rc @@ -545,7 +543,7 @@ conf = Configure(env, custom_tests={ 'check_linux_fs_h': check_linux_fs_h, 'check_uname': check_uname, 'check_cygwin': check_cygwin, - 'check_sse_4_2': check_sse_4_2, + 'check_mm_crc32_u64': check_mm_crc32_u64, 'check_sysmacro_h': check_sysmacro_h }) @@ -620,8 +618,8 @@ else: conf.env.Append(CCFLAGS=['-fPIC']) # check SSE4 support: -conf.check_sse_4_2() -if conf.env['HAVE_SSE_4_2']: +conf.check_mm_crc32_u64() +if conf.env['HAVE_MM_CRC32_U64']: conf.env.Append(CCFLAGS=['-msse4.2']) if 'clang' in os.path.basename(conf.env['CC']): diff --git a/lib/SConscript b/lib/SConscript index be5d3f95..7072b54e 100644 --- a/lib/SConscript +++ b/lib/SConscript @@ -34,7 +34,7 @@ def build_config_template(target, source, env): HAVE_LINUX_LIMITS=env['HAVE_LINUX_LIMITS'], HAVE_LINUX_FS_H=env['HAVE_LINUX_FS_H'], HAVE_BTRFS_H=env['HAVE_BTRFS_H'], - HAVE_SSE_4_2=env['HAVE_SSE_4_2'], + HAVE_MM_CRC32_U64=env['HAVE_MM_CRC32_U64'], HAVE_FACCESSAT=env['HAVE_FACCESSAT'], HAVE_UNAME=env['HAVE_UNAME'], HAVE_SYSMACROS_H=env['HAVE_SYSMACROS_H'], diff --git a/lib/checksum.c b/lib/checksum.c index c16dbc2c..e26db873 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -160,18 +160,10 @@ static Metro128State *rm_digest_metro_new(void) { return metrohash128_1_new(FALSE); } -static Metro128State *rm_digest_metrocrc_new(void) { - return metrohash128_1_new(g_atomic_int_get(&RM_DIGEST_USE_SSE)); -} - static Metro256State *rm_digest_metro256_new(void) { return metrohash256_new(FALSE); } -static Metro256State *rm_digest_metrocrc256_new(void) { - return metrohash256_new(g_atomic_int_get(&RM_DIGEST_USE_SSE)); -} - static const RmDigestInterface metro_interface = { .name = "metro", .bits = 128, @@ -192,9 +184,16 @@ static const RmDigestInterface metro256_interface = { .copy = (RmDigestCopyFunc)metrohash256_copy, .steal = (RmDigestStealFunc)metrohash256_steal}; -#if HAVE_SSE_4_2 +#if HAVE_MM_CRC32_U64 /* also define crc-optimised metro variants metrocrc and metrocrc256*/ +static Metro128State *rm_digest_metrocrc_new(void) { + return metrohash128_1_new(g_atomic_int_get(&RM_DIGEST_USE_SSE)); +} + +static Metro256State *rm_digest_metrocrc256_new(void) { + return metrohash256_new(g_atomic_int_get(&RM_DIGEST_USE_SSE)); +} static const RmDigestInterface metrocrc_interface = { .name = "metrocrc", @@ -758,7 +757,7 @@ static const RmDigestInterface *rm_digest_get_interface(RmDigestType type) { [RM_DIGEST_MURMUR] = &murmur_interface, [RM_DIGEST_METRO] = &metro_interface, [RM_DIGEST_METRO256] = &metro256_interface, -#if HAVE_SSE_4_2 +#if HAVE_MM_CRC32_U64 [RM_DIGEST_METROCRC] = &metrocrc_interface, [RM_DIGEST_METROCRC256] = &metrocrc256_interface, #endif @@ -1022,7 +1021,7 @@ guint8 *rm_digest_sum(RmDigestType algo, const guint8 *data, gsize len, gsize *o } void rm_digest_enable_sse(gboolean use_sse) { -#if HAVE_SSE_4_2 +#if HAVE_MM_CRC32_U64 if (use_sse && __builtin_cpu_supports("sse4.2")) { g_atomic_int_set(&RM_DIGEST_USE_SSE, TRUE); } else { diff --git a/lib/checksum.h b/lib/checksum.h index 278e824c..a872915f 100644 --- a/lib/checksum.h +++ b/lib/checksum.h @@ -39,7 +39,7 @@ typedef enum RmDigestType { RM_DIGEST_MURMUR, RM_DIGEST_METRO, RM_DIGEST_METRO256, -#if HAVE_SSE_4_2 +#if HAVE_MM_CRC32_U64 RM_DIGEST_METROCRC, RM_DIGEST_METROCRC256, #endif diff --git a/lib/checksums/metrohash.h b/lib/checksums/metrohash.h index 4259a5d7..8d7e0bb0 100644 --- a/lib/checksums/metrohash.h +++ b/lib/checksums/metrohash.h @@ -62,7 +62,7 @@ void metrohash128_2_steal(Metro128State *state, uint8_t *out); void metrohash256_update(Metro256State *state, const uint8_t *key, uint64_t len); void metrohash256_steal(Metro256State *state, uint8_t *out); -#if HAVE_SSE_4_2 +#if HAVE_MM_CRC32_U64 // MetroHash 128-bit hash functions using CRC instruction void metrohash128crc_1(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out); void metrohash128crc_2(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out); @@ -75,7 +75,6 @@ void metrohash128crc_2_steal(Metro128State *state, uint8_t *out); void metrohash256crc_update(Metro256State *state, const uint8_t *key, uint64_t len); void metrohash256crc_steal(Metro256State *state, uint8_t *out); - #endif /* rotate right idiom recognized by compiler*/ diff --git a/lib/checksums/metrohash128.c b/lib/checksums/metrohash128.c index 78e5d208..2ec51d66 100644 --- a/lib/checksums/metrohash128.c +++ b/lib/checksums/metrohash128.c @@ -27,8 +27,6 @@ #include #include "metrohash.h" -#if HAVE_SSE_4_2 - struct _Metro128_state { bool use_sse; uint64_t v[4]; @@ -94,6 +92,7 @@ Metro128State *metrohash128_copy(Metro128State *state) { xs_len += bytes; \ data += bytes; +#if HAVE_MM_CRC32_U64 void metrohash128crc_1_update(Metro128State *state, const uint8_t *key, uint64_t len) { if(!state->use_sse) { metrohash128_1_update(state, key, len); diff --git a/lib/config.h.in b/lib/config.h.in index 27b74cd5..cffec507 100644 --- a/lib/config.h.in +++ b/lib/config.h.in @@ -22,7 +22,7 @@ #define HAVE_FACCESSAT ({HAVE_FACCESSAT}) #define HAVE_UNAME ({HAVE_UNAME}) #define HAVE_SYSMACROS_H ({HAVE_SYSMACROS_H}) -#define HAVE_SSE_4_2 ({HAVE_SSE_4_2}) +#define HAVE_MM_CRC32_U64 ({HAVE_MM_CRC32_U64}) /* define here so rmlint and hash utility can both access */ #define RM_DEFAULT_DIGEST RM_DIGEST_BLAKE2B diff --git a/lib/hash-utility.c b/lib/hash-utility.c index d5276457..cd914e89 100644 --- a/lib/hash-utility.c +++ b/lib/hash-utility.c @@ -155,7 +155,7 @@ int rm_hasher_main(int argc, const char **argv) { "\n Supported, but not useful:" "\n %s\n"), "sha{1,256,512}, sha3-{256,384,512}, blake{2s,2b,2sp,2bp}, highway{64,128,256}", -#if HAVE_SSE_4_2 +#if HAVE_MM_CRC32_U64 "metrocrc, metrocrc256, " #endif "metro, metro256, xxhash, murmur", @@ -194,7 +194,7 @@ int rm_hasher_main(int argc, const char **argv) { ////////// Implementation ////// -#if HAVE_SSE_4_2 +#if HAVE_MM_CRC32_U64 rm_digest_enable_sse(TRUE); #endif From 56eb723d325726d2238324d7b55ba5ab8f04ee11 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Wed, 22 Nov 2017 11:08:00 +1000 Subject: [PATCH 153/180] scons: tolerate old gcc's with no __builtin_cpu_supports() --- SConstruct | 15 ++++++++++++++- lib/SConscript | 1 + lib/checksum.c | 2 +- lib/checksums/metrohash128.c | 4 +++- lib/cmdline.c | 2 ++ lib/config.h.in | 1 + 6 files changed, 22 insertions(+), 3 deletions(-) diff --git a/SConstruct b/SConstruct index 80fd443a..7f2a645b 100755 --- a/SConstruct +++ b/SConstruct @@ -376,6 +376,17 @@ def check_mm_crc32_u64(context): context.Result(rc) return rc +def check_builtin_cpu_supports(context): + rc = 0 if tests.CheckDeclaration( + context, + symbol='__builtin_cpu_supports' + ) else 1 + + conf.env['HAVE_BUILTIN_CPU_SUPPORTS'] = rc + context.did_show_result = True + context.Result(rc) + return rc + def create_uninstall_target(env, path): env.Command("uninstall-" + path, path, [ @@ -544,6 +555,7 @@ conf = Configure(env, custom_tests={ 'check_uname': check_uname, 'check_cygwin': check_cygwin, 'check_mm_crc32_u64': check_mm_crc32_u64, + 'check_builtin_cpu_supports': check_builtin_cpu_supports, 'check_sysmacro_h': check_sysmacro_h }) @@ -617,7 +629,7 @@ if conf.env['IS_CYGWIN']: else: conf.env.Append(CCFLAGS=['-fPIC']) -# check SSE4 support: +# check _mm_crc32_u64 (SSE4.2) support: conf.check_mm_crc32_u64() if conf.env['HAVE_MM_CRC32_U64']: conf.env.Append(CCFLAGS=['-msse4.2']) @@ -642,6 +654,7 @@ env.ParseConfig(pkg_config + ' --cflags --libs ' + ' '.join(packages)) conf.env.Append(_LIBFLAGS=['-lm']) +conf.check_builtin_cpu_supports() conf.check_blkid() conf.check_sys_block() conf.check_libelf() diff --git a/lib/SConscript b/lib/SConscript index 7072b54e..210122ae 100644 --- a/lib/SConscript +++ b/lib/SConscript @@ -35,6 +35,7 @@ def build_config_template(target, source, env): HAVE_LINUX_FS_H=env['HAVE_LINUX_FS_H'], HAVE_BTRFS_H=env['HAVE_BTRFS_H'], HAVE_MM_CRC32_U64=env['HAVE_MM_CRC32_U64'], + HAVE_BUILTIN_CPU_SUPPORTS=env['HAVE_BUILTIN_CPU_SUPPORTS'], HAVE_FACCESSAT=env['HAVE_FACCESSAT'], HAVE_UNAME=env['HAVE_UNAME'], HAVE_SYSMACROS_H=env['HAVE_SYSMACROS_H'], diff --git a/lib/checksum.c b/lib/checksum.c index e26db873..fbc13067 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -1021,7 +1021,7 @@ guint8 *rm_digest_sum(RmDigestType algo, const guint8 *data, gsize len, gsize *o } void rm_digest_enable_sse(gboolean use_sse) { -#if HAVE_MM_CRC32_U64 +#if HAVE_MM_CRC32_U64 && HAVE_BUILTIN_CPU_SUPPORTS if (use_sse && __builtin_cpu_supports("sse4.2")) { g_atomic_int_set(&RM_DIGEST_USE_SSE, TRUE); } else { diff --git a/lib/checksums/metrohash128.c b/lib/checksums/metrohash128.c index 2ec51d66..320c0284 100644 --- a/lib/checksums/metrohash128.c +++ b/lib/checksums/metrohash128.c @@ -24,7 +24,6 @@ // #include -#include #include "metrohash.h" struct _Metro128_state { @@ -93,6 +92,9 @@ Metro128State *metrohash128_copy(Metro128State *state) { data += bytes; #if HAVE_MM_CRC32_U64 + +#include + void metrohash128crc_1_update(Metro128State *state, const uint8_t *key, uint64_t len) { if(!state->use_sse) { metrohash128_1_update(state, key, len); diff --git a/lib/cmdline.c b/lib/cmdline.c index 60d902ab..9f01483f 100644 --- a/lib/cmdline.c +++ b/lib/cmdline.c @@ -1484,7 +1484,9 @@ bool rm_cmd_parse_args(int argc, char **argv, RmSession *session) { rm_assert_gentle_not_reached(); } +#if HAVE_BUILTIN_CPU_SUPPORTS rm_digest_enable_sse(!cfg->no_sse && __builtin_cpu_supports("sse4.2")); +#endif cleanup: if(error != NULL) { diff --git a/lib/config.h.in b/lib/config.h.in index cffec507..254b7fe0 100644 --- a/lib/config.h.in +++ b/lib/config.h.in @@ -23,6 +23,7 @@ #define HAVE_UNAME ({HAVE_UNAME}) #define HAVE_SYSMACROS_H ({HAVE_SYSMACROS_H}) #define HAVE_MM_CRC32_U64 ({HAVE_MM_CRC32_U64}) +#define HAVE_BUILTIN_CPU_SUPPORTS ({HAVE_BUILTIN_CPU_SUPPORTS}) /* define here so rmlint and hash utility can both access */ #define RM_DEFAULT_DIGEST RM_DIGEST_BLAKE2B From 74ce99711179e08f0db2dc681bef60f79c9bb3ff Mon Sep 17 00:00:00 2001 From: daniel Date: Sat, 25 Nov 2017 08:24:34 +1000 Subject: [PATCH 154/180] checksums: fix metro var type incompatibilities on 32bit --- lib/checksum.c | 2 +- lib/checksums/metrohash.h | 24 ++++++++++++------------ lib/checksums/metrohash128.c | 28 ++++++++++++++-------------- 3 files changed, 27 insertions(+), 27 deletions(-) diff --git a/lib/checksum.c b/lib/checksum.c index fbc13067..24ba29e0 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -80,7 +80,7 @@ static gboolean rm_buffer_equal(RmBuffer *a, RmBuffer *b) { /* Each digest type must have an RmDigestInterface defined as follows: */ typedef gpointer (*RmDigestNewFunc)(void); typedef void (*RmDigestFreeFunc)(gpointer state); -typedef void (*RmDigestUpdateFunc)(gpointer state, const unsigned char *data, gsize size); +typedef void (*RmDigestUpdateFunc)(gpointer state, const unsigned char *data, size_t size); typedef gpointer (*RmDigestCopyFunc)(gpointer state); typedef void (*RmDigestStealFunc)(gpointer state, guint8 *result); typedef guint (*RmDigestLenFunc)(gpointer state); diff --git a/lib/checksums/metrohash.h b/lib/checksums/metrohash.h index 8d7e0bb0..5e71a7d6 100644 --- a/lib/checksums/metrohash.h +++ b/lib/checksums/metrohash.h @@ -36,8 +36,8 @@ typedef struct _Metro128_state Metro128State; typedef struct _Metro256_state Metro256State; // MetroHash 64-bit hash functions -void metrohash64_1(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out); -void metrohash64_2(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out); +void metrohash64_1(const uint8_t *key, size_t len, uint32_t seed, uint8_t *out); +void metrohash64_2(const uint8_t *key, size_t len, uint32_t seed, uint8_t *out); // MetroHash 128-bit hash functions Metro128State *metrohash128_1_new(bool use_sse); @@ -50,30 +50,30 @@ Metro256State *metrohash256_copy(Metro256State *state); void metrohash128_free(Metro128State *state); void metrohash256_free(Metro256State *state); -void metrohash128_1(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out); -void metrohash128_2(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out); +void metrohash128_1(const uint8_t *key, size_t len, uint32_t seed, uint8_t *out); +void metrohash128_2(const uint8_t *key, size_t len, uint32_t seed, uint8_t *out); -void metrohash128_1_update(Metro128State *state, const uint8_t *key, uint64_t len); +void metrohash128_1_update(Metro128State *state, const uint8_t *key, size_t len); void metrohash128_1_steal(Metro128State *state, uint8_t *out); -void metrohash128_2_update(Metro128State *state, const uint8_t *key, uint64_t len); +void metrohash128_2_update(Metro128State *state, const uint8_t *key, size_t len); void metrohash128_2_steal(Metro128State *state, uint8_t *out); -void metrohash256_update(Metro256State *state, const uint8_t *key, uint64_t len); +void metrohash256_update(Metro256State *state, const uint8_t *key, size_t len); void metrohash256_steal(Metro256State *state, uint8_t *out); #if HAVE_MM_CRC32_U64 // MetroHash 128-bit hash functions using CRC instruction -void metrohash128crc_1(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out); -void metrohash128crc_2(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out); +void metrohash128crc_1(const uint8_t *key, size_t len, uint32_t seed, uint8_t *out); +void metrohash128crc_2(const uint8_t *key, size_t len, uint32_t seed, uint8_t *out); -void metrohash128crc_1_update(Metro128State *state, const uint8_t *key, uint64_t len); -void metrohash128crc_2_update(Metro128State *state, const uint8_t *key, uint64_t len); +void metrohash128crc_1_update(Metro128State *state, const uint8_t *key, size_t len); +void metrohash128crc_2_update(Metro128State *state, const uint8_t *key, size_t len); void metrohash128crc_1_steal(Metro128State *state, uint8_t *out); void metrohash128crc_2_steal(Metro128State *state, uint8_t *out); -void metrohash256crc_update(Metro256State *state, const uint8_t *key, uint64_t len); +void metrohash256crc_update(Metro256State *state, const uint8_t *key, size_t len); void metrohash256crc_steal(Metro256State *state, uint8_t *out); #endif diff --git a/lib/checksums/metrohash128.c b/lib/checksums/metrohash128.c index 320c0284..1f37a993 100644 --- a/lib/checksums/metrohash128.c +++ b/lib/checksums/metrohash128.c @@ -27,10 +27,10 @@ #include "metrohash.h" struct _Metro128_state { - bool use_sse; - uint64_t v[4]; uint8_t xs[32]; /* unhashed data from last increment */ uint8_t xs_len; + uint64_t v[4]; + bool use_sse; }; struct _Metro256_state { @@ -85,8 +85,7 @@ Metro128State *metrohash128_copy(Metro128State *state) { } #define METRO_FILL_XS(xs, xs_len, xs_cap, data, data_len) \ - const int bytes = \ - (data_len + xs_len > xs_cap) ? (int)xs_cap - (int)xs_len : (int)data_len; \ + const int bytes = ((int)data_len + (int)xs_len > (int)xs_cap) ? (int)xs_cap - (int)xs_len : (int)data_len; \ memcpy(xs + xs_len, data, bytes); \ xs_len += bytes; \ data += bytes; @@ -95,7 +94,7 @@ Metro128State *metrohash128_copy(Metro128State *state) { #include -void metrohash128crc_1_update(Metro128State *state, const uint8_t *key, uint64_t len) { +void metrohash128crc_1_update(Metro128State *state, const uint8_t *key, size_t len) { if(!state->use_sse) { metrohash128_1_update(state, key, len); return; @@ -141,7 +140,7 @@ void metrohash128crc_1_update(Metro128State *state, const uint8_t *key, uint64_t } } -void metrohash128crc_2_update(Metro128State *state, const uint8_t *key, uint64_t len) { +void metrohash128crc_2_update(Metro128State *state, const uint8_t *key, size_t len) { if(!state->use_sse) { metrohash128_2_update(state, key, len); } else { @@ -272,7 +271,7 @@ void metrohash128crc_2_steal(Metro128State *state, uint8_t *out) { memcpy(out, v, 16); } -void metrohash128crc_1(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out) { +void metrohash128crc_1(const uint8_t *key, size_t len, uint32_t seed, uint8_t *out) { Metro128State *state = metrohash128_1_new(TRUE); metrohash128crc_1_update(state, (const uint8_t*)&seed, sizeof(seed)); metrohash128crc_1_update(state, key, len); @@ -280,7 +279,7 @@ void metrohash128crc_1(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t metrohash128_free(state); } -void metrohash128crc_2(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out) { +void metrohash128crc_2(const uint8_t *key, size_t len, uint32_t seed, uint8_t *out) { Metro128State *state = metrohash128_2_new(TRUE); metrohash128crc_2_update(state, (const uint8_t*)&seed, sizeof(seed)); metrohash128crc_2_update(state, key, len); @@ -288,7 +287,7 @@ void metrohash128crc_2(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t metrohash128_free(state); } -void metrohash256crc_update(Metro256State *state, const uint8_t *key, uint64_t len) { +void metrohash256crc_update(Metro256State *state, const uint8_t *key, size_t len) { metrohash128crc_1_update(&state->state1, key, len); metrohash128crc_2_update(&state->state2, key, len); } @@ -300,7 +299,8 @@ void metrohash256crc_steal(Metro256State *state, uint8_t *out) { #endif -void metrohash128_1_update(Metro128State *state, const uint8_t *key, uint64_t len) { +void metrohash128_1_update(Metro128State *state, const uint8_t *key, size_t len) { + uint8_t *data = (uint8_t *)key; const uint8_t *stop = data + len; @@ -348,7 +348,7 @@ void metrohash128_1_update(Metro128State *state, const uint8_t *key, uint64_t le memcpy(state->xs, data, state->xs_len); } } -void metrohash128_2_update(Metro128State *state, const uint8_t *key, uint64_t len) { +void metrohash128_2_update(Metro128State *state, const uint8_t *key, size_t len) { uint8_t *data = (uint8_t *)key; const uint8_t *stop = data + len; @@ -517,7 +517,7 @@ void metrohash128_2_steal(Metro128State *state, uint8_t *out) { memcpy(out, v, 16); } -void metrohash128_1(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out) { +void metrohash128_1(const uint8_t *key, size_t len, uint32_t seed, uint8_t *out) { Metro128State *state = metrohash128_1_new(FALSE); metrohash128_1_update(state, (const uint8_t*)&seed, sizeof(seed)); metrohash128_1_update(state, key, len); @@ -525,7 +525,7 @@ void metrohash128_1(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *ou metrohash128_free(state); } -void metrohash128_2(const uint8_t *key, uint64_t len, uint32_t seed, uint8_t *out) { +void metrohash128_2(const uint8_t *key, size_t len, uint32_t seed, uint8_t *out) { Metro128State *state = metrohash128_2_new(FALSE); metrohash128_2_update(state, (const uint8_t*)&seed, sizeof(seed)); metrohash128_2_update(state, key, len); @@ -550,7 +550,7 @@ Metro256State *metrohash256_copy(Metro256State *state) { return g_slice_copy(sizeof(Metro256State), state); } -void metrohash256_update(Metro256State *state, const uint8_t *key, uint64_t len) { +void metrohash256_update(Metro256State *state, const uint8_t *key, size_t len) { metrohash128_1_update(&state->state1, key, len); metrohash128_2_update(&state->state2, key, len); } From d4438859df2667105803d941d3b1b10727b1e61e Mon Sep 17 00:00:00 2001 From: daniel Date: Sat, 25 Nov 2017 17:47:11 +1000 Subject: [PATCH 155/180] murmur: use size_t for data len --- lib/checksums/murmur3.c | 16 ++++++++-------- lib/checksums/murmur3.h | 13 +++++++------ 2 files changed, 15 insertions(+), 14 deletions(-) diff --git a/lib/checksums/murmur3.c b/lib/checksums/murmur3.c index 00a54e4b..81c2973e 100644 --- a/lib/checksums/murmur3.c +++ b/lib/checksums/murmur3.c @@ -42,7 +42,7 @@ struct _MurmurHash3_x86_32_state { uint32_t h1; uint8_t xs[4]; /* unhashed data from last increment */ uint8_t xs_len; - uint32_t len; + size_t len; }; struct _MurmurHash3_x86_128_state { @@ -52,7 +52,7 @@ struct _MurmurHash3_x86_128_state { uint32_t h4; uint8_t xs[16]; /* unhashed data from last increment */ uint8_t xs_len; - uint32_t len; + size_t len; }; struct _MurmurHash3_x64_128_state { @@ -60,7 +60,7 @@ struct _MurmurHash3_x64_128_state { uint64_t h2; uint8_t xs[16]; /* unhashed data from last increment */ uint8_t xs_len; - uint32_t len; + size_t len; }; //----------------------------------------------------------------------------- @@ -121,7 +121,7 @@ MurmurHash3_x86_32_state *MurmurHash3_x86_32_copy(MurmurHash3_x86_32_state *stat #define MURMUR_UPDATE_H1_X86_32(H1) MURMUR_UPDATE(H1, k1, 15, 0xcc9e2d51, 0x1b873593); void MurmurHash3_x86_32_update(MurmurHash3_x86_32_state *const state, - const void *restrict key, const uint32_t len) { + const void *restrict key, const size_t len) { state->len += len; uint8_t *data = (uint8_t *)key; const uint8_t *stop = data + len; @@ -192,7 +192,7 @@ void MurmurHash3_x86_32_free(MurmurHash3_x86_32_state *state) { g_slice_free(MurmurHash3_x86_32_state, state); } -uint32_t MurmurHash3_x86_32(const void *key, uint32_t len, uint32_t seed) { +uint32_t MurmurHash3_x86_32(const void *key, size_t len, uint32_t seed) { uint32_t out; MurmurHash3_x86_32_state *state = MurmurHash3_x86_32_new(); if(seed != 0) { @@ -219,7 +219,7 @@ MurmurHash3_x86_128_state *MurmurHash3_x86_128_copy(MurmurHash3_x86_128_state *s #define MURMUR_UPDATE_H4_X86_128(H4) MURMUR_UPDATE(H4, k4, 18, 0xa1e38b93, 0x239b961b); void MurmurHash3_x86_128_update(MurmurHash3_x86_128_state *const state, - const void *restrict key, const uint32_t len) { + const void *restrict key, const size_t len) { state->len += len; uint8_t *data = (uint8_t *)key; const uint8_t *stop = data + len; @@ -370,7 +370,7 @@ void MurmurHash3_x86_128_free(MurmurHash3_x86_128_state *state) { g_slice_free(MurmurHash3_x86_128_state, state); } -void MurmurHash3_x86_128(const void *key, uint32_t len, uint32_t seed, void *out) { +void MurmurHash3_x86_128(const void *key, size_t len, uint32_t seed, void *out) { MurmurHash3_x86_128_state *state = MurmurHash3_x86_128_new(); if(seed != 0) { MurmurHash3_x86_128_update(state, &seed, sizeof(seed)); @@ -507,7 +507,7 @@ void MurmurHash3_x64_128_free(MurmurHash3_x64_128_state *state) { g_slice_free(MurmurHash3_x64_128_state, state); } -void MurmurHash3_x64_128(const void *key, const uint64_t len, const uint32_t seed, +void MurmurHash3_x64_128(const void *key, const size_t len, const uint32_t seed, void *out) { MurmurHash3_x64_128_state *state = MurmurHash3_x64_128_new(); if(seed != 0) { diff --git a/lib/checksums/murmur3.h b/lib/checksums/murmur3.h index 9fd861e8..8185bc8f 100644 --- a/lib/checksums/murmur3.h +++ b/lib/checksums/murmur3.h @@ -9,6 +9,7 @@ #define _MURMURHASH3_H_ #include +#include //----------------------------------------------------------------------------- // opaque structs for intermediate checksum states @@ -38,11 +39,11 @@ MurmurHash3_x64_128_state *MurmurHash3_x64_128_copy(MurmurHash3_x64_128_state *s * streaming update of checksum */ void MurmurHash3_x86_32_update(MurmurHash3_x86_32_state *const restrict state, - const void *restrict key, const uint32_t len); + const void *restrict key, const size_t len); void MurmurHash3_x86_128_update(MurmurHash3_x86_128_state *const restrict state, - const void *restrict key, const uint32_t len); + const void *restrict key, const size_t len); void MurmurHash3_x64_128_update(MurmurHash3_x64_128_state *const restrict state, - const void *restrict key, const uint64_t len); + const void *restrict key, const size_t len); /** * output checksum result; does not modify underlying state @@ -71,9 +72,9 @@ void MurmurHash3_x64_128_free(MurmurHash3_x64_128_state *state); /** * convenience single-buffer hash */ -uint32_t MurmurHash3_x86_32(const void *key, uint32_t len, uint32_t seed); -void MurmurHash3_x86_128(const void *key, uint32_t len, uint32_t seed, void *out); -void MurmurHash3_x64_128(const void *key, uint64_t len, uint32_t seed, void *out); +uint32_t MurmurHash3_x86_32(const void *key, size_t len, uint32_t seed); +void MurmurHash3_x86_128(const void *key, size_t len, uint32_t seed, void *out); +void MurmurHash3_x64_128(const void *key, size_t len, uint32_t seed, void *out); //----------------------------------------------------------------------------- From 99d11222e41c18e93f49af3d034d8f5211bcbd36 Mon Sep 17 00:00:00 2001 From: daniel Date: Sat, 25 Nov 2017 17:47:44 +1000 Subject: [PATCH 156/180] murmur: fix error using wrong ROTL for 32-bit implementation --- lib/checksums/murmur3.c | 78 +++++++++++++++++++++++++---------------- 1 file changed, 47 insertions(+), 31 deletions(-) diff --git a/lib/checksums/murmur3.c b/lib/checksums/murmur3.c index 81c2973e..04412f30 100644 --- a/lib/checksums/murmur3.c +++ b/lib/checksums/murmur3.c @@ -90,20 +90,35 @@ static inline uint64_t fmix64(uint64_t k) { //----------------------------------------------------------------------------- -#define MURMUR_UPDATE(h, k, rotl, ca, cb) \ - k *= ca; \ +#define MURMUR_UPDATE_X86(h, k, rotl, ca, cb) \ + k *= ca; \ + k = ROTL32(k, rotl); \ + k *= cb; \ + h ^= k; + +#define MURMUR_MIX_X86(ha, hb, rotl, c) \ + ha = ROTL32(ha, rotl); \ + ha += hb; \ + ha = ha * 5 + c; + + +#define MURMUR_UPDATE_X64(h, k, rotl, ca, cb) \ + k = k * ca; \ k = ROTL64(k, rotl); \ k *= cb; \ - h ^= k; + h ^= k; \ + -#define MURMUR_MIX(ha, hb, rotl, c) \ +#define MURMUR_MIX_X64(ha, hb, rotl, c) \ ha = ROTL64(ha, rotl); \ ha += hb; \ - ha = ha * 5 + c; + ha = ha * 5 + c; \ + + #define MURMUR_FILL_XS(xs, xs_len, xs_cap, data, data_len) \ const int bytes = \ - (data_len + xs_len > xs_cap) ? (int)xs_cap - (int)xs_len : (int)data_len; \ + ((int)data_len + (int)xs_len > (int)xs_cap) ? (int)xs_cap - (int)xs_len : (int)data_len; \ memcpy(xs + xs_len, data, bytes); \ xs_len += bytes; \ data += bytes; @@ -118,7 +133,7 @@ MurmurHash3_x86_32_state *MurmurHash3_x86_32_copy(MurmurHash3_x86_32_state *stat return g_slice_copy(sizeof(MurmurHash3_x86_32_state), state); } -#define MURMUR_UPDATE_H1_X86_32(H1) MURMUR_UPDATE(H1, k1, 15, 0xcc9e2d51, 0x1b873593); +#define MURMUR_UPDATE_H1_X86_32(H1, K1) MURMUR_UPDATE_X86(H1, K1, 15, 0xcc9e2d51, 0x1b873593); void MurmurHash3_x86_32_update(MurmurHash3_x86_32_state *const state, const void *restrict key, const size_t len) { @@ -144,8 +159,8 @@ void MurmurHash3_x86_32_update(MurmurHash3_x86_32_state *const state, data += 4; } - MURMUR_UPDATE_H1_X86_32(state->h1); - MURMUR_MIX(state->h1, 0, 13, 0xe6546b64); + MURMUR_UPDATE_H1_X86_32(state->h1, k1); + MURMUR_MIX_X86(state->h1, 0, 13, 0xe6546b64); } if(state->xs_len == 0 && stop > data) { @@ -170,7 +185,7 @@ void MurmurHash3_x86_32_steal(const MurmurHash3_x86_32_state *const restrict sta case 1: k1 ^= state->xs[0]; - MURMUR_UPDATE_H1_X86_32(h1); + MURMUR_UPDATE_H1_X86_32(h1, k1); }; //---------- @@ -213,10 +228,11 @@ MurmurHash3_x86_128_state *MurmurHash3_x86_128_copy(MurmurHash3_x86_128_state *s return g_slice_copy(sizeof(MurmurHash3_x86_128_state), state); } -#define MURMUR_UPDATE_H1_X86_128(H1) MURMUR_UPDATE(H1, k1, 15, 0x239b961b, 0xab0e9789); -#define MURMUR_UPDATE_H2_X86_128(H2) MURMUR_UPDATE(H2, k2, 16, 0xab0e9789, 0x38b34ae5); -#define MURMUR_UPDATE_H3_X86_128(H3) MURMUR_UPDATE(H3, k3, 17, 0x38b34ae5, 0xa1e38b93); -#define MURMUR_UPDATE_H4_X86_128(H4) MURMUR_UPDATE(H4, k4, 18, 0xa1e38b93, 0x239b961b); + +#define MURMUR_UPDATE_H1_X86_128(H, K) MURMUR_UPDATE_X86(H, K, 15, 0x239b961b, 0xab0e9789); +#define MURMUR_UPDATE_H2_X86_128(H, K) MURMUR_UPDATE_X86(H, K, 16, 0xab0e9789, 0x38b34ae5); +#define MURMUR_UPDATE_H3_X86_128(H, K) MURMUR_UPDATE_X86(H, K, 17, 0x38b34ae5, 0xa1e38b93); +#define MURMUR_UPDATE_H4_X86_128(H, K) MURMUR_UPDATE_X86(H, K, 18, 0xa1e38b93, 0x239b961b); void MurmurHash3_x86_128_update(MurmurHash3_x86_128_state *const state, const void *restrict key, const size_t len) { @@ -251,17 +267,17 @@ void MurmurHash3_x86_128_update(MurmurHash3_x86_128_state *const state, data += 16; } - MURMUR_UPDATE_H1_X86_128(state->h1); - MURMUR_MIX(state->h1, state->h2, 19, 0x561ccd1b); + MURMUR_UPDATE_H1_X86_128(state->h1, k1); + MURMUR_MIX_X86(state->h1, state->h2, 19, 0x561ccd1b); - MURMUR_UPDATE_H2_X86_128(state->h2); - MURMUR_MIX(state->h2, state->h3, 17, 0x0bcaa747); + MURMUR_UPDATE_H2_X86_128(state->h2, k2); + MURMUR_MIX_X86(state->h2, state->h3, 17, 0x0bcaa747); - MURMUR_UPDATE_H3_X86_128(state->h3); - MURMUR_MIX(state->h3, state->h4, 15, 0x96cd1c35); + MURMUR_UPDATE_H3_X86_128(state->h3, k3); + MURMUR_MIX_X86(state->h3, state->h4, 15, 0x96cd1c35); - MURMUR_UPDATE_H4_X86_128(state->h4); - MURMUR_MIX(state->h4, state->h1, 13, 0x32ac3b17); + MURMUR_UPDATE_H4_X86_128(state->h4, k4); + MURMUR_MIX_X86(state->h4, state->h1, 13, 0x32ac3b17); } if(state->xs_len == 0 && stop > data) { @@ -292,7 +308,7 @@ void MurmurHash3_x86_128_steal(const MurmurHash3_x86_128_state *const restrict s case 13: k4 ^= state->xs[12] << 0; - MURMUR_UPDATE_H4_X86_128(h4); + MURMUR_UPDATE_H4_X86_128(h4, k4); case 12: k3 ^= state->xs[11] << 24; @@ -303,7 +319,7 @@ void MurmurHash3_x86_128_steal(const MurmurHash3_x86_128_state *const restrict s case 9: k3 ^= state->xs[8] << 0; - MURMUR_UPDATE_H3_X86_128(h3); + MURMUR_UPDATE_H3_X86_128(h3, k3); case 8: k2 ^= state->xs[7] << 24; @@ -314,7 +330,7 @@ void MurmurHash3_x86_128_steal(const MurmurHash3_x86_128_state *const restrict s case 5: k2 ^= state->xs[4] << 0; - MURMUR_UPDATE_H2_X86_128(h2); + MURMUR_UPDATE_H2_X86_128(h2, k2); case 4: k1 ^= state->xs[3] << 24; @@ -325,7 +341,7 @@ void MurmurHash3_x86_128_steal(const MurmurHash3_x86_128_state *const restrict s case 1: k1 ^= state->xs[0] << 0; - MURMUR_UPDATE_H1_X86_128(h1); + MURMUR_UPDATE_H1_X86_128(h1, k1); }; //---------- @@ -390,14 +406,14 @@ MurmurHash3_x64_128_state *MurmurHash3_x64_128_copy(MurmurHash3_x64_128_state *s } #define MURMUR_UPDATE_H1_X64_128(H1) \ - MURMUR_UPDATE(H1, k1, 31, BIG_CONSTANT(0x87c37b91114253d5), \ + MURMUR_UPDATE_X64(H1, k1, 31, BIG_CONSTANT(0x87c37b91114253d5), \ BIG_CONSTANT(0x4cf5ad432745937f)); #define MURMUR_UPDATE_H2_X64_128(H2) \ - MURMUR_UPDATE(H2, k2, 33, BIG_CONSTANT(0x4cf5ad432745937f), \ + MURMUR_UPDATE_X64(H2, k2, 33, BIG_CONSTANT(0x4cf5ad432745937f), \ BIG_CONSTANT(0x87c37b91114253d5)); void MurmurHash3_x64_128_update(MurmurHash3_x64_128_state *const restrict state, - const void *restrict key, const uint64_t len) { + const void *restrict key, const size_t len) { state->len += len; uint8_t *data = (uint8_t *)key; const uint8_t *stop = data + len; @@ -424,10 +440,10 @@ void MurmurHash3_x64_128_update(MurmurHash3_x64_128_state *const restrict state, } MURMUR_UPDATE_H1_X64_128(state->h1); - MURMUR_MIX(state->h1, state->h2, 27, 0x52dce729); + MURMUR_MIX_X64(state->h1, state->h2, 27, 0x52dce729); MURMUR_UPDATE_H2_X64_128(state->h2); - MURMUR_MIX(state->h2, state->h1, 31, 0x38495ab5); + MURMUR_MIX_X64(state->h2, state->h1, 31, 0x38495ab5); } if(state->xs_len == 0 && stop > data) { From 448cb0c76cbb6178105556ede2bfd864c6f83af3 Mon Sep 17 00:00:00 2001 From: daniel Date: Sat, 25 Nov 2017 19:05:21 +1000 Subject: [PATCH 157/180] checksum: fix data len typing --- lib/checksum.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/checksum.c b/lib/checksum.c index 24ba29e0..016713bb 100644 --- a/lib/checksum.c +++ b/lib/checksum.c @@ -264,7 +264,7 @@ static void rm_digest_cumulative_free(RmDigestCumulative *state) { } static void rm_digest_cumulative_update(RmDigestCumulative *state, - const unsigned char *data, RmOff size) { + const unsigned char *data, size_t size) { if(!state->data) { /* first update sets checksum length */ state->bytes = RM_DIGEST_CUMULATIVE_ALIGN * CLAMP(size / RM_DIGEST_CUMULATIVE_ALIGN, 1, RM_DIGEST_CUMULATIVE_MAX_BYTES / RM_DIGEST_CUMULATIVE_ALIGN); From f6463ee70a68932592226b462be88a17dfaf4c8c Mon Sep 17 00:00:00 2001 From: daniel Date: Mon, 27 Nov 2017 07:28:53 +1000 Subject: [PATCH 158/180] tests: add hash collision resistance test --- tests/test_robustness/test_collisions.py | 23 +++++++++++++++++++++++ tests/utils.py | 13 ++++++++++--- 2 files changed, 33 insertions(+), 3 deletions(-) create mode 100644 tests/test_robustness/test_collisions.py diff --git a/tests/test_robustness/test_collisions.py b/tests/test_robustness/test_collisions.py new file mode 100644 index 00000000..6fd2eb66 --- /dev/null +++ b/tests/test_robustness/test_collisions.py @@ -0,0 +1,23 @@ +#!/usr/bin/env python3 +# encoding: utf-8 +from nose.plugins.attrib import attr +from nose import with_setup +from tests.utils import * + + +BLACKLIST = [ ] + +@attr('slow') +@with_setup(usual_setup_func, usual_teardown_func) +def test_collision_resistance(): + # test for at least 20 bits of collision resistancel + # this should detect gross errors in checksum encoding... + + numfiles = 1024*1024 + for i in range(numfiles): + create_file(i, str(i), write_binary=True) + + for algo in CKSUM_TYPES: + if algo not in BLACKLIST: + *_, footer = run_rmlint('-a {}'.format(algo)) + assert footer['duplicates'] == 0, 'Unexpected hash collision for hash type {}'.format(algo) diff --git a/tests/utils.py b/tests/utils.py index 9c49a5d4..464783ce 100644 --- a/tests/utils.py +++ b/tests/utils.py @@ -11,6 +11,7 @@ import pprint import shutil import shlex +import struct import subprocess TESTDIR_NAME = os.getenv('RM_TS_DIR') or '/tmp/rmlint-unit-testdir' @@ -303,7 +304,7 @@ def create_link(path, target, symlink=False): ) -def create_file(data, name, mtime=None): +def create_file(data, name, mtime=None, write_binary=False): full_path = os.path.join(TESTDIR_NAME, name) if '/' in name: try: @@ -311,8 +312,14 @@ def create_file(data, name, mtime=None): except OSError: pass - with open(full_path, 'w') as handle: - handle.write(data) + with open(full_path, 'wb' if write_binary else 'w') as handle: + if write_binary: + if isinstance(data, int): + handle.write(struct.pack('i', data)) + else: + assert False, "Unhandled data type for binary write: " + data + else: + handle.write(data) if not mtime is None: subprocess.call(['touch', '-m', '-d', str(mtime), full_path]) From f9fe3a2493e768a9300ea23a12dff3e27c50c101 Mon Sep 17 00:00:00 2001 From: daniel Date: Mon, 27 Nov 2017 09:36:07 +1000 Subject: [PATCH 159/180] tests: try to accommodate `paranoid` in collision resistance test --- lib/cfg.c | 15 +++++++++++++++ lib/cfg.h | 3 +++ lib/cmdline.c | 7 +++++++ lib/shredder.c | 15 +-------------- tests/test_robustness/test_collisions.py | 2 +- 5 files changed, 27 insertions(+), 15 deletions(-) diff --git a/lib/cfg.c b/lib/cfg.c index 970a926f..fbb23f8e 100644 --- a/lib/cfg.c +++ b/lib/cfg.c @@ -72,6 +72,21 @@ void rm_cfg_set_default(RmCfg *cfg) { cfg->verbosity = G_LOG_LEVEL_INFO; cfg->follow_symlinks = false; + /* Optimum buffer size based on /usr without dropping caches: + * 4k => 5.29 seconds + * 8k => 5.11 seconds + * 16k => 5.04 seconds + * 32k => 5.08 seconds + * With dropped caches: + * 4k => 45.2 seconds + * 16k => 45.0 seconds + * Optimum buffer size using a rotational disk and paranoid hash: + * 4k => 16.5 seconds + * 8k => 16.5 seconds + * 16k => 15.9 seconds + * 32k => 15.8 seconds */ + cfg->read_buf_len = 16 * 1024; + cfg->total_mem = (RmOff)1024 * 1024 * 1024; cfg->sweep_size = 1024 * 1024 * 1024; cfg->sweep_count = 1024 * 16; diff --git a/lib/cfg.h b/lib/cfg.h index d0b971b4..cc75d1de 100644 --- a/lib/cfg.h +++ b/lib/cfg.h @@ -132,6 +132,9 @@ typedef struct RmCfg { /* total number of bytes we are allowed to use (target only) */ RmOff total_mem; + /* length of read buffers */ + RmOff read_buf_len; + /* number of bytes to read before going back to start of disk * (too big a sweep risks metadata getting pushed out of ram)*/ RmOff sweep_size; diff --git a/lib/cmdline.c b/lib/cmdline.c index 9f01483f..96852d27 100644 --- a/lib/cmdline.c +++ b/lib/cmdline.c @@ -832,6 +832,12 @@ static gboolean rm_cmd_parse_limit_mem(_UNUSED const char *option_name, return (rm_cmd_parse_mem(size_spec, error, &session->cfg->total_mem)); } +static gboolean rm_cmd_parse_read_buf_len(_UNUSED const char *option_name, + const gchar *size_spec, RmSession *session, + GError **error) { + return (rm_cmd_parse_mem(size_spec, error, &session->cfg->read_buf_len)); +} + static gboolean rm_cmd_parse_sweep_size(_UNUSED const char *option_name, const gchar *size_spec, RmSession *session, GError **error) { @@ -1318,6 +1324,7 @@ bool rm_cmd_parse_args(int argc, char **argv, RmSession *session) { {"clamp-low" , 'q' , HIDDEN , G_OPTION_ARG_CALLBACK , FUNC(clamp_low) , "Limit lower reading barrier" , "P"} , {"clamp-top" , 'Q' , HIDDEN , G_OPTION_ARG_CALLBACK , FUNC(clamp_top) , "Limit upper reading barrier" , "P"} , {"limit-mem" , 'u' , HIDDEN , G_OPTION_ARG_CALLBACK , FUNC(limit_mem) , "Specify max. memory usage target" , "S"} , + {"read-buffer-len" , 0 , HIDDEN , G_OPTION_ARG_CALLBACK , FUNC(read_buf_len) , "Specify read buffer length in bytes" , "S"} , {"sweep-size" , 0 , HIDDEN , G_OPTION_ARG_CALLBACK , FUNC(sweep_size) , "Specify max. bytes per pass when scanning disks" , "S"} , {"sweep-files" , 0 , HIDDEN , G_OPTION_ARG_CALLBACK , FUNC(sweep_count) , "Specify max. file count per pass when scanning disks" , "S"} , {"threads" , 't' , HIDDEN , G_OPTION_ARG_INT64 , &cfg->threads , "Specify max. number of hasher threads" , "N"} , diff --git a/lib/shredder.c b/lib/shredder.c index fe2d57f7..34f76163 100644 --- a/lib/shredder.c +++ b/lib/shredder.c @@ -1739,24 +1739,11 @@ void rm_shred_run(RmSession *session) { rm_log_debug_line("Read buffer Mem: %" LLU, read_buffer_mem); /* Initialise hasher */ - /* Optimum buffer size based on /usr without dropping caches: - * SHRED_PAGE_SIZE * 1 => 5.29 seconds - * SHRED_PAGE_SIZE * 2 => 5.11 seconds - * SHRED_PAGE_SIZE * 4 => 5.04 seconds - * SHRED_PAGE_SIZE * 8 => 5.08 seconds - * With dropped caches: - * SHRED_PAGE_SIZE * 1 => 45.2 seconds - * SHRED_PAGE_SIZE * 4 => 45.0 seconds - * Optimum buffer size using a rotational disk and paranoid hash: - * SHRED_PAGE_SIZE * 1 => 16.5 seconds - * SHRED_PAGE_SIZE * 2 => 16.5 seconds - * SHRED_PAGE_SIZE * 4 => 15.9 seconds - * SHRED_PAGE_SIZE * 8 => 15.8 seconds */ tag.hasher = rm_hasher_new(cfg->checksum_type, cfg->threads, cfg->use_buffered_read, - SHRED_PAGE_SIZE * 4, + cfg->read_buf_len, read_buffer_mem, (RmHasherCallback)rm_shred_hash_callback, &tag); diff --git a/tests/test_robustness/test_collisions.py b/tests/test_robustness/test_collisions.py index 6fd2eb66..f1a8cd59 100644 --- a/tests/test_robustness/test_collisions.py +++ b/tests/test_robustness/test_collisions.py @@ -19,5 +19,5 @@ def test_collision_resistance(): for algo in CKSUM_TYPES: if algo not in BLACKLIST: - *_, footer = run_rmlint('-a {}'.format(algo)) + *_, footer = run_rmlint('--read-buffer-len=4 -a {}'.format(algo)) assert footer['duplicates'] == 0, 'Unexpected hash collision for hash type {}'.format(algo) From ae06aed402a9ed8d90f4d351438b094e73a1133d Mon Sep 17 00:00:00 2001 From: daniel Date: Mon, 27 Nov 2017 09:41:54 +1000 Subject: [PATCH 160/180] tests: blacklist paranoid from collision_resistance test --- tests/test_robustness/test_collisions.py | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/tests/test_robustness/test_collisions.py b/tests/test_robustness/test_collisions.py index f1a8cd59..0d729bff 100644 --- a/tests/test_robustness/test_collisions.py +++ b/tests/test_robustness/test_collisions.py @@ -5,7 +5,11 @@ from tests.utils import * -BLACKLIST = [ ] +# current shredder algorithm does not handle large size-groups at all +# well, due to pre-matching "optimisation" +# https://github.com/SeeSpotRun/rmlint/blob/448cb0c76cbb6178105556ede2bfd864c6f83af3/lib/checksum.c#L678-L730 +# which degenerates into an inefficient O(n^2) lookup with large size groups +BLACKLIST = ['paranoid'] @attr('slow') @with_setup(usual_setup_func, usual_teardown_func) From 6d573ac709443c51b8e324831ecd8d3bf5707206 Mon Sep 17 00:00:00 2001 From: daniel Date: Mon, 27 Nov 2017 09:54:02 +1000 Subject: [PATCH 161/180] cmdline: make sse handling uglier to avoid pointless warning messages --- lib/cmdline.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/cmdline.c b/lib/cmdline.c index 96852d27..b5328ce3 100644 --- a/lib/cmdline.c +++ b/lib/cmdline.c @@ -1491,7 +1491,7 @@ bool rm_cmd_parse_args(int argc, char **argv, RmSession *session) { rm_assert_gentle_not_reached(); } -#if HAVE_BUILTIN_CPU_SUPPORTS +#if HAVE_BUILTIN_CPU_SUPPORTS && HAVE_MM_CRC32_U64 rm_digest_enable_sse(!cfg->no_sse && __builtin_cpu_supports("sse4.2")); #endif From d9c79ade7d4dcea6527fc5b4a70a02f796e8eddc Mon Sep 17 00:00:00 2001 From: daniel Date: Mon, 27 Nov 2017 10:22:47 +1000 Subject: [PATCH 162/180] various: fix some typecasting and formatting warnings on 32-bit --- lib/checksums/murmur3.c | 29 +++++++++++++++++++---------- lib/hasher.c | 6 +++--- lib/hasher.h | 4 ++-- lib/shredder.c | 2 +- lib/utilities.c | 16 ++++++++++------ lib/xattr.c | 2 +- 6 files changed, 36 insertions(+), 23 deletions(-) diff --git a/lib/checksums/murmur3.c b/lib/checksums/murmur3.c index 04412f30..0a6a32d3 100644 --- a/lib/checksums/murmur3.c +++ b/lib/checksums/murmur3.c @@ -40,7 +40,10 @@ static inline uint64_t rotl64(uint64_t x, int8_t r) { struct _MurmurHash3_x86_32_state { uint32_t h1; - uint8_t xs[4]; /* unhashed data from last increment */ + union { + uint8_t xs[4]; /* unhashed data from last increment */ + uint32_t xs32; + }; uint8_t xs_len; size_t len; }; @@ -50,7 +53,10 @@ struct _MurmurHash3_x86_128_state { uint32_t h2; uint32_t h3; uint32_t h4; - uint8_t xs[16]; /* unhashed data from last increment */ + union { + uint8_t xs[16]; /* unhashed data from last increment */ + uint32_t xs32[4]; + }; uint8_t xs_len; size_t len; }; @@ -58,7 +64,10 @@ struct _MurmurHash3_x86_128_state { struct _MurmurHash3_x64_128_state { uint64_t h1; uint64_t h2; - uint8_t xs[16]; /* unhashed data from last increment */ + union { + uint8_t xs[16]; /* unhashed data from last increment */ + uint64_t xs64[2]; + }; uint8_t xs_len; size_t len; }; @@ -151,7 +160,7 @@ void MurmurHash3_x86_32_update(MurmurHash3_x86_32_state *const state, if(state->xs_len == 4) { /* process remnant data from previous update */ - k1 = GET_UINT32(&state->xs[0]); + k1 = state->xs32; state->xs_len = 0; } else { /* process new data */ @@ -253,10 +262,10 @@ void MurmurHash3_x86_128_update(MurmurHash3_x86_128_state *const state, if(state->xs_len == 16) { /* process remnant data from previous update */ - k1 = GET_UINT32(&state->xs[0]); - k2 = GET_UINT32(&state->xs[4]); - k3 = GET_UINT32(&state->xs[8]); - k4 = GET_UINT32(&state->xs[12]); + k1 = state->xs32[0]; + k2 = state->xs32[1]; + k3 = state->xs32[2]; + k4 = state->xs32[3]; state->xs_len = 0; } else { /* process new data */ @@ -429,8 +438,8 @@ void MurmurHash3_x64_128_update(MurmurHash3_x64_128_state *const restrict state, if(state->xs_len == 16) { /* process remnant data from previous update */ - k1 = GET_UINT64(&state->xs[0]); - k2 = GET_UINT64(&state->xs[8]); + k1 = state->xs64[0]; + k2 = state->xs64[1]; state->xs_len = 0; } else { /* process new data */ diff --git a/lib/hasher.c b/lib/hasher.c index 8934ed0d..919cd859 100644 --- a/lib/hasher.c +++ b/lib/hasher.c @@ -449,9 +449,9 @@ RmHasherTask *rm_hasher_task_new(RmHasher *hasher, RmDigest *digest, } gboolean rm_hasher_task_hash(RmHasherTask *task, char *path, guint64 start_offset, - guint64 bytes_to_read, gboolean is_symlink, - RmOff *bytes_read_out) { - guint64 bytes_read = 0; + gsize bytes_to_read, gboolean is_symlink, + gsize *bytes_read_out) { + gsize bytes_read = 0; gboolean success = false; if(is_symlink) { diff --git a/lib/hasher.h b/lib/hasher.h index 82f9f1c8..0b455db8 100644 --- a/lib/hasher.h +++ b/lib/hasher.h @@ -141,9 +141,9 @@ RmHasherTask *rm_hasher_task_new(RmHasher *hasher, gboolean rm_hasher_task_hash(RmHasherTask *task, char *path, guint64 start_offset, - guint64 bytes_to_read, + size_t bytes_to_read, gboolean is_symlink, - RmOff *bytes_read_out); + gsize *bytes_read_out); /** * @brief Finalise a hashing task diff --git a/lib/shredder.c b/lib/shredder.c index 34f76163..eb7549ca 100644 --- a/lib/shredder.c +++ b/lib/shredder.c @@ -1622,7 +1622,7 @@ static gint rm_shred_process_file(RmFile *file, RmSession *session) { (!cfg->shred_never_wait && rm_mds_device_is_rotational(file->disk) && bytes_to_read < SHRED_TOO_MANY_BYTES_TO_WAIT)); - RmOff bytes_read = 0; + gsize bytes_read = 0; RmHasherTask *task = rm_hasher_task_new(tag->hasher, file->digest, file); if(!rm_hasher_task_hash(task, file_path, file->hash_offset, bytes_to_read, file->is_symlink, &bytes_read)) { diff --git a/lib/utilities.c b/lib/utilities.c index 391eb625..63129dee 100644 --- a/lib/utilities.c +++ b/lib/utilities.c @@ -1180,7 +1180,8 @@ RmLinkType rm_util_link_type(char *path1, char *path2) { } if(stat1.st_size != stat2.st_size) { - rm_log_debug_line("Files have different sizes: %lu <> %lu", stat1.st_size, + rm_log_debug_line("Files have different sizes: %" G_GUINT64_FORMAT + " <> %" G_GUINT64_FORMAT, stat1.st_size, stat2.st_size); RM_RETURN(RM_LINK_WRONG_SIZE); } @@ -1215,12 +1216,14 @@ RmLinkType rm_util_link_type(char *path1, char *path2) { RmOff physical_2 = rm_offset_get_from_fd(fd2, logical_current, &logical_next_2); if(physical_1 != physical_2) { - rm_log_debug_line("Files differ at offset %lu: %lu <> %lu", logical_current, - physical_1, physical_2); + rm_log_debug_line("Files differ at offset %" G_GUINT64_FORMAT + ": %"G_GUINT64_FORMAT "<> %" G_GUINT64_FORMAT, + logical_current, physical_1, physical_2); RM_RETURN(RM_LINK_NONE); } if(logical_next_1 != logical_next_2) { - rm_log_debug_line("Next offsets differ after %lu: %lu <> %lu", + rm_log_debug_line("Next offsets differ after %" G_GUINT64_FORMAT + ": %" G_GUINT64_FORMAT "<> %" G_GUINT64_FORMAT, logical_current, logical_next_1, logical_next_2); RM_RETURN(RM_LINK_NONE); } @@ -1231,8 +1234,9 @@ RmLinkType rm_util_link_type(char *path1, char *path2) { RM_RETURN(RM_LINK_MAYBE_REFLINK); } - rm_log_debug_line("Offsets match at logical=%lu, physical=%lu", logical_current, - physical_1); + rm_log_debug_line("Offsets match at logical=%" G_GUINT64_FORMAT + ", physical=%" G_GUINT64_FORMAT, + logical_current, physical_1); if(logical_next_1 == logical_current) { rm_log_debug_line( diff --git a/lib/xattr.c b/lib/xattr.c index b37f039e..5dc6bbb2 100644 --- a/lib/xattr.c +++ b/lib/xattr.c @@ -205,7 +205,7 @@ gboolean rm_xattr_read_hash(RmFile *file, RmSession *session) { if(FLOAT_SIGN_DIFF(g_ascii_strtod(mtime_buf, NULL), file->mtime, MTIME_TOL) < 0) { /* Data is too old and not useful, autoclean it */ - rm_log_debug_line("Checksum too old for %s, %li < %li", + rm_log_debug_line("Checksum too old for %s, %" G_GINT64_FORMAT " < %" G_GINT64_FORMAT, file->folder->basename, g_ascii_strtoll(mtime_buf, NULL, 10), (gint64)file->mtime); From 17a9a14b61ed4f621fadf9eee66797a9e307b9d1 Mon Sep 17 00:00:00 2001 From: daniel Date: Mon, 27 Nov 2017 10:27:19 +1000 Subject: [PATCH 163/180] hash-utility: silence sse warning --- lib/hash-utility.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/hash-utility.c b/lib/hash-utility.c index cd914e89..68395b15 100644 --- a/lib/hash-utility.c +++ b/lib/hash-utility.c @@ -194,7 +194,7 @@ int rm_hasher_main(int argc, const char **argv) { ////////// Implementation ////// -#if HAVE_MM_CRC32_U64 +#if HAVE_MM_CRC32_U64 && HAVE_BUILTIN_CPU_SUPPORTS rm_digest_enable_sse(TRUE); #endif From 19173affbe982b4cf13a618472eab308c9a5b0b8 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 27 Nov 2017 11:42:19 +1000 Subject: [PATCH 164/180] shredder: amuse @Awerick (https://github.com/sahib/rmlint/issues/248#issuecomment-346689036) --- lib/shredder.c | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/lib/shredder.c b/lib/shredder.c index 707395c4..80a4f79c 100644 --- a/lib/shredder.c +++ b/lib/shredder.c @@ -1272,6 +1272,16 @@ void rm_shred_group_find_original(RmSession *session, GQueue *files, } } +static gboolean rm_shred_has_duplicates(GQueue *group) { + for(GList *iter=group->head; iter; iter=iter->next) { + RmFile *file = iter->data; + if(!file->is_original) { + return TRUE; + } + } + return FALSE; +} + void rm_shred_forward_to_output(RmSession *session, GQueue *group) { rm_assert_gentle(group); rm_assert_gentle(group->head); @@ -1283,9 +1293,11 @@ void rm_shred_forward_to_output(RmSession *session, GQueue *group) { #endif /* Hand it over to the printing module */ - for(GList *iter = group->head; iter; iter = iter->next) { - RmFile *file = iter->data; - rm_fmt_write(file, session->formats, group->length); + if(rm_shred_has_duplicates(group)) { + for(GList *iter = group->head; iter; iter = iter->next) { + RmFile *file = iter->data; + rm_fmt_write(file, session->formats, group->length); + } } } From 51c9542f9f9b131b91319196a581bf65f3789ec0 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Mon, 27 Nov 2017 12:13:47 +1000 Subject: [PATCH 165/180] tests: update keep-hardlinked test for all-original group --- tests/test_options/test_keep_hardlinks.py | 11 ++--------- 1 file changed, 2 insertions(+), 9 deletions(-) diff --git a/tests/test_options/test_keep_hardlinks.py b/tests/test_options/test_keep_hardlinks.py index b4e8d339..e9ffc44b 100644 --- a/tests/test_options/test_keep_hardlinks.py +++ b/tests/test_options/test_keep_hardlinks.py @@ -73,12 +73,5 @@ def test_keep_hardlinks_multiple_originals(): head, *data, footer = run_rmlint('--keep-hardlinked -k -m -S a ' + search_paths, use_default_dir=False) # files in folder a are tagged so should both be preserved; # files in folder b are hardlinks of the two originals so should also be preserved - assert len(data)==4 - assert data[0]["path"].endswith("file_a") - assert data[0]["is_original"] is True - assert data[1]["path"].endswith("file_y") - assert data[1]["is_original"] is True - assert data[2]["path"].endswith("file_b") - assert data[2]["is_original"] is True - assert data[3]["path"].endswith("file_z") - assert data[3]["is_original"] is True + # therefore all files are originals and so don't get reported + assert len(data)==0 From 4edc5c46d2a5075725a0e373a9fd1c040bb22a77 Mon Sep 17 00:00:00 2001 From: Stuart Powers Date: Mon, 27 Nov 2017 10:41:59 -0500 Subject: [PATCH 166/180] Add parens after print statement --- SConstruct | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/SConstruct b/SConstruct index 2f6fa89c..34eaae52 100755 --- a/SConstruct +++ b/SConstruct @@ -705,7 +705,7 @@ def get_cpu_count(): # or `scons --jobs=` SetOption('num_jobs', get_cpu_count()) -print "Running with --jobs=" + repr(GetOption('num_jobs')) +print ("Running with --jobs=" + repr(GetOption('num_jobs'))) library = SConscript('lib/SConscript') programs = SConscript('src/SConscript', exports='library') From e1b3e2196b679722fa08196bbd6ef7073af1996b Mon Sep 17 00:00:00 2001 From: Chris Pahl Date: Sun, 31 Dec 2017 16:08:53 +0100 Subject: [PATCH 167/180] sh: Only use sudo in rmlint.sh if not root yet (see #271) --- lib/formats/sh.sh | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/lib/formats/sh.sh b/lib/formats/sh.sh index bbfa32e6..e6a53e46 100644 --- a/lib/formats/sh.sh +++ b/lib/formats/sh.sh @@ -9,6 +9,14 @@ PROGRESS_TOTAL=0 RMLINT_BINARY="%s" +# Only use sudo if we're not root yet: +# (See: https://github.com/sahib/rmlint/issues/27://github.com/sahib/rmlint/issues/271) +SUDO_COMMAND="sudo" +if [ "$EUID" -eq 0 ] +then + SUDO_COMMAND="" +fi + # In special cases --equal needs special args to mimic # the behaviour that lead to finding the duplicates below. RMLINT_EQUAL_EXTRA_ARGS="%s" @@ -208,7 +216,7 @@ clone() { echo "${COL_YELLOW}Cloning to: ${COL_RESET}" "$1" if [ -z "$DO_DRY_RUN" ]; then if [ -n "$DO_CLONE_READONLY" ]; then - sudo $RMLINT_BINARY --dedupe -r "$2" "$1" + $SUDO_COMMAND $RMLINT_BINARY --dedupe -r "$2" "$1" else $RMLINT_BINARY --dedupe "$2" "$1" fi From 07d0ac0bcdcc1787b841e0db89e1ccace064b7a3 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Fri, 23 Feb 2018 21:50:58 +1000 Subject: [PATCH 168/180] sh: posix-compliant way of getting UID --- lib/formats/sh.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/formats/sh.sh b/lib/formats/sh.sh index e6a53e46..e29065ff 100644 --- a/lib/formats/sh.sh +++ b/lib/formats/sh.sh @@ -12,7 +12,7 @@ RMLINT_BINARY="%s" # Only use sudo if we're not root yet: # (See: https://github.com/sahib/rmlint/issues/27://github.com/sahib/rmlint/issues/271) SUDO_COMMAND="sudo" -if [ "$EUID" -eq 0 ] +if [ "$(id -u)" -eq "0" ] then SUDO_COMMAND="" fi From 7df026f636aa3fe5e6efd7c5c14c4b8e8344dad6 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Fri, 23 Feb 2018 21:51:49 +1000 Subject: [PATCH 169/180] sh: use read -r to protect against backslash mangling --- lib/formats/sh.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/formats/sh.sh b/lib/formats/sh.sh index e29065ff..efb27ebc 100644 --- a/lib/formats/sh.sh +++ b/lib/formats/sh.sh @@ -281,7 +281,7 @@ Rmlint was executed in the following way: Execute this script with -d to disable this informational message. Type any string to continue; CTRL-C, Enter or CTRL-D to abort immediately EOF - read eof_check + read -r eof_check if [ -z "$eof_check" ] then # Count Ctrl-D and Enter as aborted too. From 80ded7ae276493b5fc7498e34ab624a006117929 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Fri, 23 Feb 2018 21:52:22 +1000 Subject: [PATCH 170/180] sh: handle invalid options --- lib/formats/sh.sh | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/lib/formats/sh.sh b/lib/formats/sh.sh index efb27ebc..7aa56160 100644 --- a/lib/formats/sh.sh +++ b/lib/formats/sh.sh @@ -315,7 +315,7 @@ do case $OPTION in h) usage - exit 1 + exit 0 ;; d) DO_ASK=false @@ -340,6 +340,9 @@ do q) DO_SHOW_PROGRESS= ;; + *) + usage + exit 1 esac done From fc54d822585eac768118eb3cb2cc16c67647dc5b Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Fri, 23 Feb 2018 21:53:10 +1000 Subject: [PATCH 171/180] sh: add warning message about shell script deletion --- lib/formats/sh.sh | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/lib/formats/sh.sh b/lib/formats/sh.sh index 7aa56160..a595b89c 100644 --- a/lib/formats/sh.sh +++ b/lib/formats/sh.sh @@ -346,6 +346,11 @@ do esac done +if [ -z $DO_REMOVE ] +then + echo "#${COL_YELLOW} ///${COL_RESET}This script will be deleted after it runs${COL_YELLOW}///${COL_RESET}" +fi + if [ -z $DO_ASK ] then usage From aa9b063827da6903770a1d36debae71fb669d8a6 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Fri, 23 Feb 2018 21:53:54 +1000 Subject: [PATCH 172/180] sh: quoting of VAR's in echo statements to keep shellcheck happy --- lib/formats/sh.sh | 52 +++++++++++++++++++++++------------------------ 1 file changed, 26 insertions(+), 26 deletions(-) diff --git a/lib/formats/sh.sh b/lib/formats/sh.sh index a595b89c..6286890b 100644 --- a/lib/formats/sh.sh +++ b/lib/formats/sh.sh @@ -55,7 +55,7 @@ print_progress_prefix() { if [ $((PROGRESS_TOTAL)) -gt 0 ]; then PROGRESS_PERC=$((PROGRESS_CURR * 100 / PROGRESS_TOTAL)) fi - printf "$COL_BLUE[% 3d%%]$COL_RESET " $PROGRESS_PERC + printf "${COL_BLUE}[% 3d%%]${COL_RESET} $PROGRESS_PERC" if [ $# -eq "1" ]; then PROGRESS_CURR=$((PROGRESS_CURR+$1)) else @@ -66,7 +66,7 @@ print_progress_prefix() { handle_emptyfile() { print_progress_prefix - echo "${COL_GREEN}Deleting empty file:${COL_RESET}" "$1" + echo "${COL_GREEN}Deleting empty file:${COL_RESET} $1" if [ -z "$DO_DRY_RUN" ]; then rm -f "$1" fi @@ -74,7 +74,7 @@ handle_emptyfile() { handle_emptydir() { print_progress_prefix - echo "${COL_GREEN}Deleting empty directory: ${COL_RESET}" "$1" + echo "${COL_GREEN}Deleting empty directory: ${COL_RESET}$1" if [ -z "$DO_DRY_RUN" ]; then rmdir "$1" fi @@ -82,7 +82,7 @@ handle_emptydir() { handle_bad_symlink() { print_progress_prefix - echo "${COL_GREEN} Deleting symlink pointing nowhere: ${COL_RESET}" "$1" + echo "${COL_GREEN} Deleting symlink pointing nowhere: ${COL_RESET}$1" if [ -z "$DO_DRY_RUN" ]; then rm -f "$1" fi @@ -90,7 +90,7 @@ handle_bad_symlink() { handle_unstripped_binary() { print_progress_prefix - echo "${COL_GREEN} Stripping debug symbols of: ${COL_RESET}" "$1" + echo "${COL_GREEN} Stripping debug symbols of: ${COL_RESET}$1" if [ -z "$DO_DRY_RUN" ]; then strip -s "$1" fi @@ -98,7 +98,7 @@ handle_unstripped_binary() { handle_bad_user_id() { print_progress_prefix - echo "${COL_GREEN}chown ${USER}${COL_RESET}" "$1" + echo "${COL_GREEN}chown ${USER}${COL_RESET} $1" if [ -z "$DO_DRY_RUN" ]; then chown "$USER" "$1" fi @@ -106,7 +106,7 @@ handle_bad_user_id() { handle_bad_group_id() { print_progress_prefix - echo "${COL_GREEN}chgrp ${GROUP}${COL_RESET}" "$1" + echo "${COL_GREEN}chgrp ${GROUP}${COL_RESET} $1" if [ -z "$DO_DRY_RUN" ]; then chgrp "$GROUP" "$1" fi @@ -114,7 +114,7 @@ handle_bad_group_id() { handle_bad_user_and_group_id() { print_progress_prefix - echo "${COL_GREEN}chown ${USER}:${GROUP}${COL_RESET}" "$1" + echo "${COL_GREEN}chown ${USER}:${GROUP}${COL_RESET} $1" if [ -z "$DO_DRY_RUN" ]; then chown "$USER:$GROUP" "$1" fi @@ -138,18 +138,18 @@ check_for_equality() { original_check() { if [ ! -e "$2" ]; then - echo $COL_RED "^^^^^^ Error: original has disappeared - cancelling....." $COL_RESET + echo "${COL_RED}^^^^^^ Error: original has disappeared - cancelling.....${COL_RESET}" return 1 fi if [ ! -e "$1" ]; then - echo $COL_RED "^^^^^^ Error: duplicate has disappeared - cancelling....." $COL_RESET + echo "${COL_RED}^^^^^^ Error: duplicate has disappeared - cancelling.....${COL_RESET}" return 1 fi # Check they are not the exact same file (hardlinks allowed): if [ "$1" = "$2" ]; then - echo $COL_RED "^^^^^^ Error: original and duplicate point to the *same* path - cancelling....." $COL_RESET + echo "${COL_RED}^^^^^^ Error: original and duplicate point to the *same* path - cancelling.....{COL_RESET}" return 1 fi @@ -157,15 +157,15 @@ original_check() { if [ -z "$DO_PARANOID_CHECK" ]; then return 0 else - if [ $(check_for_equality "$1" "$2") -ne 0 ]; then - echo $COL_RED "^^^^^^ Error: files no longer identical - cancelling....." $COL_RESET + if [ "$(check_for_equality "$1" "$2")" -ne "0" ]; then + echo "${COL_RED}^^^^^^ Error: files no longer identical - cancelling.....${COL_RESET}" fi fi } cp_hardlink() { print_progress_prefix - echo "${COL_YELLOW}Hardlinking to original: ${COL_RESET}" "$1" + echo "${COL_YELLOW}Hardlinking to original: ${COL_RESET}$1" if original_check "$1" "$2"; then if [ -z "$DO_DRY_RUN" ]; then # If it's a directory cp will create a new copy into @@ -180,7 +180,7 @@ cp_hardlink() { cp_symlink() { print_progress_prefix - echo "${COL_YELLOW}Symlinking to original: ${COL_RESET}" "$1" + echo "${COL_YELLOW}Symlinking to original: ${COL_RESET}$1" if original_check "$1" "$2"; then if [ -z "$DO_DRY_RUN" ]; then touch -mr "$1" "$0" @@ -196,7 +196,7 @@ cp_symlink() { cp_reflink() { print_progress_prefix # reflink $1 to $2's data, preserving $1's mtime - echo "${COL_YELLOW}Reflinking to original: ${COL_RESET}" "$1" + echo "${COL_YELLOW}Reflinking to original: ${COL_RESET}$1" if original_check "$1" "$2"; then if [ -z "$DO_DRY_RUN" ]; then touch -mr "$1" "$0" @@ -213,7 +213,7 @@ clone() { print_progress_prefix # clone $1 from $2's data # note: no original_check() call because rmlint --dedupe takes care of this - echo "${COL_YELLOW}Cloning to: ${COL_RESET}" "$1" + echo "${COL_YELLOW}Cloning to: ${COL_RESET}$1" if [ -z "$DO_DRY_RUN" ]; then if [ -n "$DO_CLONE_READONLY" ]; then $SUDO_COMMAND $RMLINT_BINARY --dedupe -r "$2" "$1" @@ -225,12 +225,12 @@ clone() { skip_hardlink() { print_progress_prefix - echo "${COL_BLUE}Leaving as-is (already hardlinked to original): ${COL_RESET}" "$1" + echo "${COL_BLUE}Leaving as-is (already hardlinked to original): ${COL_RESET}$1" } skip_reflink() { print_progress_prefix - echo "{$COL_BLUE}Leaving as-is (already reflinked to original): ${COL_RESET}" "$1" + echo "{$COL_BLUE}Leaving as-is (already reflinked to original): ${COL_RESET}$1" } user_command() { @@ -241,7 +241,7 @@ user_command() { remove_cmd() { print_progress_prefix - echo "${COL_YELLOW}Deleting: ${COL_RESET}" "$1" + echo "${COL_YELLOW}Deleting: ${COL_RESET}$1" if original_check "$1" "$2"; then if [ -z "$DO_DRY_RUN" ]; then rm -rf "$1" @@ -250,7 +250,7 @@ remove_cmd() { DIR=$(dirname "$1") while [ ! "$(ls -A "$DIR")" ]; do print_progress_prefix 0 - echo "${COL_GREEN}Deleting resulting empty dir: ${COL_RESET}" "$DIR" + echo "${COL_GREEN}Deleting resulting empty dir: ${COL_RESET}$DIR" rmdir "$DIR" DIR=$(dirname "$DIR") done @@ -261,7 +261,7 @@ remove_cmd() { original_cmd() { print_progress_prefix - echo "${COL_GREEN}Keeping: ${COL_RESET}" "$1" + echo "${COL_GREEN}Keeping: ${COL_RESET}$1" } ################## @@ -285,7 +285,7 @@ EOF if [ -z "$eof_check" ] then # Count Ctrl-D and Enter as aborted too. - echo $COL_RED "Aborted on behalf of the user." $COL_RESET + echo "${COL_RED}Aborted on behalf of the user.${COL_RESET}" exit 1; fi } @@ -359,9 +359,9 @@ fi if [ ! -z $DO_DRY_RUN ] then - echo "#$COL_YELLOW ////////////////////////////////////////////////////////////" $COL_RESET - echo "#$COL_YELLOW ///" $COL_RESET "This is only a dry run; nothing will be modified! " $COL_YELLOW "///" $COL_RESET - echo "#$COL_YELLOW ////////////////////////////////////////////////////////////" $COL_RESET + echo "#${COL_YELLOW} ////////////////////////////////////////////////////////////${COL_RESET}" + echo "#${COL_YELLOW} /// ${COL_RESET} This is only a dry run; nothing will be modified! ${COL_YELLOW}///${COL_RESET}" + echo "#${COL_YELLOW} ////////////////////////////////////////////////////////////${COL_RESET}" fi ######### START OF AUTOGENERATED OUTPUT ######### From 16e9455bc60935f59a58c04331320970a22018a1 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Fri, 23 Feb 2018 22:22:12 +1000 Subject: [PATCH 173/180] sh: make RMLINT_EQUAL_EXTRA_ARGS work --- lib/formats/sh.c.in | 2 +- lib/formats/sh.sh | 6 +----- 2 files changed, 2 insertions(+), 6 deletions(-) diff --git a/lib/formats/sh.c.in b/lib/formats/sh.c.in index bcd498fb..3853282a 100644 --- a/lib/formats/sh.c.in +++ b/lib/formats/sh.c.in @@ -331,9 +331,9 @@ static void rm_fmt_head(RmSession *session, RmFmtHandler *parent, FILE *out) { session->cfg->iwd, (session->cfg->joined_argv) ? (session->cfg->joined_argv) : "[unknown]", (session->cfg->full_argv0_path) ? (session->cfg->full_argv0_path) : "$(which rmlint)", - equal_extra_args, rm_util_get_username(), rm_util_get_groupname(), + equal_extra_args, (self->user_cmd) ? self->user_cmd : "echo 'no user command defined.'", (session->cfg->joined_argv) ? (session->cfg->joined_argv) : "unknown_commandline" ); diff --git a/lib/formats/sh.sh b/lib/formats/sh.sh index 6286890b..348d7940 100644 --- a/lib/formats/sh.sh +++ b/lib/formats/sh.sh @@ -17,10 +17,6 @@ then SUDO_COMMAND="" fi -# In special cases --equal needs special args to mimic -# the behaviour that lead to finding the duplicates below. -RMLINT_EQUAL_EXTRA_ARGS="%s" - USER='%s' GROUP='%s' @@ -131,7 +127,7 @@ check_for_equality() { echo $? else # Fallback to `rmlint --equal` for directories: - $RMLINT_BINARY -pp --equal $RMLINT_EQUAL_EXTRA_ARGS "$1" "$2" + "$RMLINT_BINARY" -pp --equal %s "$1" "$2" echo $? fi } From ec789684c8d0d2f07574b2878fff1ce8e84ded40 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Wed, 7 Mar 2018 08:09:09 +1000 Subject: [PATCH 174/180] cmdline: add option -0 | --stdin0 for null-separated stdin input --- lib/cfg.h | 2 ++ lib/cmdline.c | 29 ++++++++++++++++++++++------- 2 files changed, 24 insertions(+), 7 deletions(-) diff --git a/lib/cfg.h b/lib/cfg.h index cc75d1de..e568bd1a 100644 --- a/lib/cfg.h +++ b/lib/cfg.h @@ -87,6 +87,8 @@ typedef struct RmCfg { gboolean progress_enabled; gboolean list_mounts; gboolean replay; + gboolean read_stdin; + gboolean read_stdin0; int permissions; diff --git a/lib/cmdline.c b/lib/cmdline.c index 9401041d..c6128e2f 100644 --- a/lib/cmdline.c +++ b/lib/cmdline.c @@ -358,18 +358,27 @@ static GLogLevelFlags VERBOSITY_TO_LOG_LEVEL[] = {[0] = G_LOG_LEVEL_CRITICAL, [3] = G_LOG_LEVEL_MESSAGE | G_LOG_LEVEL_INFO, [4] = G_LOG_LEVEL_DEBUG}; +static bool rm_cmd_read_paths_from_stdin(RmSession *session, bool is_prefd, + bool null_separated) { + char delim = null_separated ? 0 : '\n'; + + size_t buf_len = PATH_MAX; + char *path_buf = malloc(buf_len * sizeof(char)); -static bool rm_cmd_read_paths_from_stdin(RmSession *session, bool is_prefd) { - char path_buf[PATH_MAX]; - char *tokbuf = NULL; bool all_paths_read = true; + int path_len; + /* Still read all paths on errors, so the user knows all paths that failed */ - while(fgets(path_buf, PATH_MAX, stdin)) { - all_paths_read &= - rm_cfg_add_path(session->cfg, is_prefd, strtok_r(path_buf, "\n", &tokbuf)); + while((path_len = getdelim(&path_buf, &buf_len, delim, stdin)) >= 0) { + if(path_len > 0) { + /* replace returned delimiter with null */ + path_buf[path_len] = 0; + all_paths_read &= rm_cfg_add_path(session->cfg, is_prefd, path_buf); + } } + free(path_buf); return all_paths_read; } @@ -1157,7 +1166,7 @@ static bool rm_cmd_set_paths(RmSession *session, char **paths) { for(int i = 0; paths && paths[i]; ++i) { if(strcmp(paths[i], "-") == 0) { /* option '-' means read paths from stdin */ - all_paths_valid &= rm_cmd_read_paths_from_stdin(session, is_prefd); + cfg->read_stdin = TRUE; } else if(strcmp(paths[i], "//") == 0) { /* the '//' separator separates non-preferred paths from preferred */ is_prefd = !is_prefd; @@ -1168,6 +1177,11 @@ static bool rm_cmd_set_paths(RmSession *session, char **paths) { g_strfreev(paths); + if(cfg->read_stdin || cfg->read_stdin0) { + all_paths_valid &= + rm_cmd_read_paths_from_stdin(session, is_prefd, cfg->read_stdin0); + } + if(cfg->path_count == 0 && all_paths_valid) { /* Still no path set? - use `pwd` */ rm_cfg_add_path(session->cfg, is_prefd, cfg->iwd); @@ -1284,6 +1298,7 @@ bool rm_cmd_parse_args(int argc, char **argv, RmSession *session) { {"keep-hardlinked" , 0 , 0 , G_OPTION_ARG_NONE , &cfg->keep_hardlinked_dupes , _("Keep hardlink that are linked to any original") , NULL} , {"partial-hidden" , 0 , EMPTY , G_OPTION_ARG_CALLBACK , FUNC(partial_hidden) , _("Find hidden files in duplicate folders only") , NULL} , {"mtime-window" , 'Z' , 0 , G_OPTION_ARG_DOUBLE , &cfg->mtime_window , _("Consider duplicates only equal when mtime differs at max. T seconds") , "T"} , + {"stdin0" , '0' , 0 , G_OPTION_ARG_NONE , &cfg->read_stdin0 , _("Read null-separated file list from stdin") , NULL} , /* COW filesystem deduplication support */ {"dedupe" , 0 , 0 , G_OPTION_ARG_NONE , &cfg->dedupe , _("Dedupe matching extents from source to dest (if filesystem supports)") , NULL} , From f26774fe7581e683bd2d7d4c4ae033fa6b0c85fd Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Wed, 7 Mar 2018 14:25:34 +1000 Subject: [PATCH 175/180] cmdline: respect location of '-' option relative to '//' --- lib/cmdline.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/lib/cmdline.c b/lib/cmdline.c index c6128e2f..f1e9a787 100644 --- a/lib/cmdline.c +++ b/lib/cmdline.c @@ -1165,8 +1165,9 @@ static bool rm_cmd_set_paths(RmSession *session, char **paths) { /* Check the directory to be valid */ for(int i = 0; paths && paths[i]; ++i) { if(strcmp(paths[i], "-") == 0) { - /* option '-' means read paths from stdin */ cfg->read_stdin = TRUE; + /* remember whether to treat stdin paths as preferred paths */ + stdin_paths_preferred = is_prefd; } else if(strcmp(paths[i], "//") == 0) { /* the '//' separator separates non-preferred paths from preferred */ is_prefd = !is_prefd; @@ -1178,8 +1179,9 @@ static bool rm_cmd_set_paths(RmSession *session, char **paths) { g_strfreev(paths); if(cfg->read_stdin || cfg->read_stdin0) { + /* option '-' means read paths from stdin */ all_paths_valid &= - rm_cmd_read_paths_from_stdin(session, is_prefd, cfg->read_stdin0); + rm_cmd_read_paths_from_stdin(session, stdin_paths_preferred, cfg->read_stdin0); } if(cfg->path_count == 0 && all_paths_valid) { From 05a0364ce82f8f2de4664ca4f88810bfb19c8b6c Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Wed, 7 Mar 2018 16:26:43 +1000 Subject: [PATCH 176/180] cmdline: fix handling of returned delimiter --- lib/cmdline.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/lib/cmdline.c b/lib/cmdline.c index f1e9a787..36e2b2ff 100644 --- a/lib/cmdline.c +++ b/lib/cmdline.c @@ -373,7 +373,9 @@ static bool rm_cmd_read_paths_from_stdin(RmSession *session, bool is_prefd, while((path_len = getdelim(&path_buf, &buf_len, delim, stdin)) >= 0) { if(path_len > 0) { /* replace returned delimiter with null */ - path_buf[path_len] = 0; + if (path_buf[path_len - 1] == delim) { + path_buf[path_len - 1] = 0; + } all_paths_read &= rm_cfg_add_path(session->cfg, is_prefd, path_buf); } } From 0b0bd868b832b3630be46540c4912a0397fa49a7 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Wed, 7 Mar 2018 16:29:23 +1000 Subject: [PATCH 177/180] docs: update for new option `-0` --- docs/rmlint.1.rst | 4 +++- docs/tutorial.rst | 3 ++- 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/docs/rmlint.1.rst b/docs/rmlint.1.rst index 57d45d5c..8f26212d 100644 --- a/docs/rmlint.1.rst +++ b/docs/rmlint.1.rst @@ -395,7 +395,7 @@ Traversal Options original and duplicate are newer than ``timestamp`` you can use ``find(1)``: - * ``find -mtime -1 | rmlint - # find all files younger than a day`` + * ``find -mtime -1 -print0 | rmlint -0 # pass all files younger than a day to rmlint`` *Note:* you can make rmlint write out a compatible timestamp with: @@ -828,6 +828,8 @@ This is a collection of common usecases and other tricks: ``$ find /usr/lib -iname '*.so' -type f | rmlint - # find all duplicate .so files`` + ``$ find /usr/lib -iname '*.so' -type f -print0 | rmlint -0 # as above but handles filenames with newline character in them`` + ``$ find ~/pics -iname '*.png' | ./rmlint - # compare png files only`` * Limit file size range to investigate: diff --git a/docs/tutorial.rst b/docs/tutorial.rst index df9a294f..f93e371f 100644 --- a/docs/tutorial.rst +++ b/docs/tutorial.rst @@ -87,6 +87,7 @@ can also use external tools to feed ``rmlint's stdin``: .. code-block:: bash $ find pics/ -iname '*.png' | rmlint - + $ find pics/ -iname '*.png' -print0 | rmlint -0 # (also handles filenames with newline characters) Limit files by size using ``--size`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -775,7 +776,7 @@ Here's just a list of options that are nice to know, but not essential: .. code-block:: bash $ # find all files except everything under .git or .svn folders - $ find . -type d | grep -v '\(.git\|.svn\)' | rmlint - --hidden + $ find . -type d | grep -v '\(.git\|.svn\)' -print0 | rmlint -0 --hidden But you would have checked the output anyways, wouldn't you? From b16e8febdadb6e56af1f4580952125542a71a2b8 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Wed, 7 Mar 2018 16:49:26 +1000 Subject: [PATCH 178/180] tests: add test for `-0` option --- tests/test_options/test_stdin.py | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/tests/test_options/test_stdin.py b/tests/test_options/test_stdin.py index 228981c1..157805d4 100644 --- a/tests/test_options/test_stdin.py +++ b/tests/test_options/test_stdin.py @@ -32,6 +32,29 @@ def test_stdin_read(): assert data[3]['path'].endswith('c') assert footer['total_lint_size'] == 12 +@with_setup(usual_setup_func, usual_teardown_func) +def test_stdin_read_newlines(): + path_a = create_file('1234', 'a') + '\0' + path_b = create_file('1234', 'name\nwith\nnewlines') + '\0' + path_c = create_file('1234', '.hidden') + '\0' + + subdir = 'look-in-here' + create_file('1234', subdir + '/c') + subdir_path = os.path.join(TESTDIR_NAME, subdir) + + proc = subprocess.Popen( + ['./rmlint', '-0', subdir_path, '-o', 'json', '-S', 'a', '--hidden'], + stdin=subprocess.PIPE, + stdout=subprocess.PIPE + ) + data, _ = proc.communicate((path_a + path_b + path_c).encode('utf-8')) + head, *data, footer = json.loads(data.decode('utf-8')) + + assert data[0]['path'].endswith('.hidden') + assert data[1]['path'].endswith('a') + assert data[2]['path'].endswith('c') + assert data[3]['path'].endswith('newlines') + assert footer['total_lint_size'] == 12 @with_setup(usual_setup_func, usual_teardown_func) def test_path_starting_with_dash(): From f5c48059657e1cd3796587069e4e4c77d9569399 Mon Sep 17 00:00:00 2001 From: SeeSpotRun Date: Wed, 7 Mar 2018 16:56:42 +1000 Subject: [PATCH 179/180] cmdline: oops missed a line in the commit --- lib/cmdline.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/lib/cmdline.c b/lib/cmdline.c index 36e2b2ff..68f064b9 100644 --- a/lib/cmdline.c +++ b/lib/cmdline.c @@ -1161,13 +1161,14 @@ static bool rm_cmd_set_cmdline(RmCfg *cfg, int argc, char **argv) { static bool rm_cmd_set_paths(RmSession *session, char **paths) { bool is_prefd = false; bool all_paths_valid = true; + bool stdin_paths_preferred = false; RmCfg *cfg = session->cfg; /* Check the directory to be valid */ for(int i = 0; paths && paths[i]; ++i) { if(strcmp(paths[i], "-") == 0) { - cfg->read_stdin = TRUE; + cfg->read_stdin = true; /* remember whether to treat stdin paths as preferred paths */ stdin_paths_preferred = is_prefd; } else if(strcmp(paths[i], "//") == 0) { From 403eb4c582e358d117c9f00c3b777b16809a4d0e Mon Sep 17 00:00:00 2001 From: Chris Pahl Date: Wed, 25 Apr 2018 18:14:04 +0200 Subject: [PATCH 180/180] bump version 2.6.2 --- .version | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.version b/.version index 5d2af625..fb914671 100644 --- a/.version +++ b/.version @@ -1 +1 @@ -2.6.1 Penetrating Pineapple +2.6.2 Penetrating Pineapple