forked from simonsj/fdupes-jody
-
Notifications
You must be signed in to change notification settings - Fork 0
A powerful duplicate file finder and an enhanced fork of 'fdupes'.
License
razvanco13/jdupes
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Introduction -------------------------------------------------------------------------- jdupes is a program for identifying and taking actions upon duplicate files. This fork known as 'jdupes' is heavily modified from and improved over the original. See CHANGES for details. A WORD OF WARNING: jdupes IS NOT a drop-in compatible replacement for fdupes! Do not blindly replace fdupes with jdupes in scripts and expect everything to work the same way. Option availability and meanings differ between the two programs. For example, the -I switch in jdupes means "isolate" and blocks intra-argument matching, while in fdupes it means "immediately delete files during scanning without prompting the user." Why use jdupes instead of the original fdupes or other forks? -------------------------------------------------------------------------- The biggest reason is raw speed. In testing on various data sets, jdupes is over 7 times faster than fdupes-1.51 on average. jdupes is the only Windows port of fdupes. Most duplicate scanners built on Linux and other UNIX-like systems do not compile for Windows out-of-the-box and even if they do, they don't support Unicode and other Windows-specific quirks and features. jdupes is generally stable. All releases of jdupes are compared against a known working reference versions of fdupes or jdupes to be certain that output does not change. You get the benefits of an aggressive development process without putting your data at increased risk. Code in jdupes is written with data loss avoidance as the highest priority. If a choice must be made between being aggressive or careful, the careful way is always chosen. jdupes includes features that are not always found elsewhere. Examples of such features include btrfs block-level deduplication and control over which file is kept when a match set is automatically deleted. jdupes is not afraid of dropping features of low value; a prime example is the -1 switch which outputs all matches in a set on one line, a feature which was found to be useless in real-world tests and therefore thrown out. The downside is that jdupes development is never guaranteed to be bug-free! If the program eats your dog or sets fire to your lawn, the authors cannot be held responsible. If you notice a bug, please report it. While jdupes maintains some degree of compatibility with fdupes from which it was originally derived, there is no guarantee that it will continue to maintain such compatibility in the future. However, compatibility will be retained between minor versions, i.e. jdupes-1.6 and jdupes-1.6.1 should not have any significant differences in results with identical command lines. What jdupes is not: a similar (but not identical) file finding tool -------------------------------------------------------------------------- Please note that jdupes ONLY works on 100% exact matches. It does not have any sort of "similarity" matching, nor does it know anything about any specific file formats such as images or sounds. Something as simple as a change in embedded metadata such as the ID3 tags in an MP3 file or the EXIF information in a JPEG image will not change the sound or image presented to the user when opened, but technically it makes the file no longer identical to the original. Plenty of excellent tools already exist to "fuzzy match" specific file types using knowledge of their file formats to help. There are no plans to add this type of matching to jdupes. Usage -------------------------------------------------------------------------- Usage: jdupes [options] DIRECTORY... -1 --one-file-system do not match files on different filesystems/devices -A --nohidden exclude hidden files from consideration -B --dedupe Send matches to btrfs for block-level deduplication -d --delete prompt user for files to preserve and delete all others; important: under particular circumstances, data may be lost when using this option together with -s or --symlinks, or when specifying a particular directory more than once; refer to the documentation for additional information -f --omitfirst omit the first file in each set of matches -h --help display this help message -H --hardlinks treat hard-linked files as duplicate files. Normally hard links are treated as non-duplicates for safety -i --reverse reverse (invert) the match sort order -I --isolate files in the same specified directory won't match -l --linksoft make relative symlinks for duplicates w/o prompting -L --linkhard hard link all duplicate files without prompting Windows allows a maximum of 1023 hard links per file -m --summarize summarize dupe information -N --noprompt together with --delete, preserve the first file in each set of duplicates and delete the rest without prompting the user -o --order=BY select sort order for output, linking and deleting; by -O --paramorder Parameter order is more important than selected -O sort mtime (BY=time) or filename (BY=name, the default) -p --permissions don't consider files with different owner/group or permission bits as duplicates -r --recurse for every directory given follow subdirectories encountered within -R --recurse: for each directory given after this option follow subdirectories encountered within (note the ':' at the end of the option, manpage for more details) -s --symlinks follow symlinks -S --size show size of duplicate files -q --quiet hide progress indicator -v --version display jdupes version and license information -x --xsize=SIZE exclude files of size < SIZE bytes from consideration --xsize=+SIZE '+' specified before SIZE, exclude size > SIZE K/M/G size suffixes can be used (case-insensitive) -z --zeromatch consider zero-length files to be duplicates -Z --softabort If the user aborts (i.e. CTRL-C) act on matches so far The -n/--noempty option was removed for safety. Matching zero-length files as duplicates now requires explicit use of the -z/--zeromatch option instead. Duplicate files are listed together in groups with each file displayed on a Separate line. The groups are then separated from each other by blank lines. The -s/--symlinks option will treat symlinked files as regular files, but direct symlinks will be treated as if they are hard linked files and the -H/--hardlinks option will apply to them in the same manner. This option used to follow symlinked directories but in the current implementation this behavior has been disabled due to it being too dangerous. When using -d or --delete, care should be taken to insure against accidental data loss. While no information will be immediately lost, using this option together with -s or --symlink can lead to confusing information being presented to the user when prompted for files to preserve. Specifically, a user could accidentally preserve a symlink while deleting the file it points to. A similar problem arises when specifying a particular directory more than once. All files within that directory will be listed as their own duplicates, leading to data loss should a user preserve a file without its "duplicate" (the file itself!) The -I/--isolate option attempts to block matches that are contained in the same specified directory parameter on the command line. Due to the underlying nature of the jdupes algorithm, a lot of matches will be blocked by this option that probably should not be. This code could use improvement. Hard and soft (symbolic) linking status symbols and behavior -------------------------------------------------------------------------- A set of arrows are used in file linking to show what action was taken on each link candidate. These arrows are as follows: ----> File was hard linked to the first file in the duplicate chain -@@-> File was symlinked to the first file in the chain -==-> Already a hard link to the first file in the chain -//-> File linking failed due to an error during the linking process If your data set has linked files and you do not use -H to always consider them as duplicates, you may still see linked files appear together in match sets. This is caused by a separate file that matches with linked files independently and is the correct behavior. See notes below on the "triangle problem" in jdupes for technical details. Microsoft Windows platform-specific notes -------------------------------------------------------------------------- The Windows port does not support Unicode, only ANSI file names. This is because Unicode support on Windows is difficult to add to existing code without making it very messy or breaking things. Support is eventually planned for Unicode on Windows. Windows has a hard limit of 1024 hard links per file. There is no way to change this. The documentation for CreateHardLink() states: "The maximum number of hard links that can be created with this function is 1023 per file. If more than 1023 links are created for a file, an error results." (The number is actually 1024, but they're ignoring the first file.) The current jdupes algorithm's "triangle problem" -------------------------------------------------------------------------- Pairs of files are excluded individually based on how the two files compare. For example, if --hardlinks is not specified then two files which are hard linked will not match one another for duplicate scanning purposes. The problem with only examining files in pairs is that certain circumstances will lead to the exclusion being overridden. Let's say we have three files with identical contents: a/file1 a/file2 a/file3 and 'a/file1' is linked to 'a/file3'. Here's how 'jdupes a/' sees them: --- Are 'a/file1' and 'a/file2' matches? Yes [point a/file1->duplicates to a/file2] Are 'a/file1' and 'a/file3' matches? No (hard linked already, -H off) Are 'a/file2' and 'a/file3' matches? Yes [point a/file2->duplicates to a/file3] --- Now you have the following duplicate list: a/file1->duplicates ==> a/file2->duplicates ==> a/file3 The solution is to split match sets into multiple sets, but doing this will also remove the guarantee that files will only ever appear in one match set and could result in data loss if handled improperly. In the future, options for "greedy" and "sparse" may be introduced to switch between allowing triangle matches to be in the same set vs. splitting sets after matching finishes without the "only ever appears once" guarantee. Does jdupes meet the "Good Practice when Deleting Duplicates" by rmlint? -------------------------------------------------------------------------- Yes. If you've not read this list of cautions, it is available at http://rmlint.readthedocs.io/en/latest/cautions.html Here's a breakdown of how jdupes addresses each of the items listed. "Backup your data" "Measure twice, cut once" These guidelines are for the user of duplicate scanning software, not the software itself. Back up your files regularly. Use jdupes to print a list of what is found as duplicated and check that list very carefully before automatically deleting the files. "Beware of unusual filename characters" The only character that poses a concern in jdupes is a newline '\n' and that is only a problem because the duplicate set printer uses them to separate file names. Actions taken by jdupes are not parsed like a command line, so spaces and other weird characters in names aren't a problem. Escaping the names properly if acting on the printed output is a problem for the user's shell script or other external program. "Consider safe removal options" This is also an exercise for the user. "Traversal Robustness" jdupes tracks each directory traversed by dev:inode pair to avoid adding the contents of the same directory twice. This prevents the user from being able to register all of their files twice by duplicating an entry on the command line. Symlinked directories are not followed. Files are renamed to a temporary name before any linking is done and if the link operation fails they are renamed back to the original name. "Collision Robustness" jdupes uses jodyhash for file data hashing. This hash is extremely fast with a low collision rate, but it still encounters collisions as any hash function will ("secure" or otherwise) due to the "birthday problem." This is why jdupes performs a full-file verification before declaring a match. It is slower than matching on hashes alone, but the birthday problem puts all data sets larger than the hash at risk of collision, meaning a false duplicate detection and data loss. The slower completion time is not as important as data integrity. Checking for a match based on hashes alone is irresponsible, and using secure hashes like MD5 or the SHA families is orders of magnitude slower than jodyhash while still suffering from the risk brought about by the birthday problem. In short, the birthday problem means that if you have 365 days in a year and 366 people, the having at least two birthdays on the same day is guaranteed; likewise, even though SHA512 is a 512-bit (64-byte) wide hash, there are guaranteed to be at least 256 pairs of data streams that causes a collision once any of the data streams being hashed for comparison is 65 bytes (520 bits) or larger. "Unusual Characters Robustness" jdupes does not protect the user from putting ASCII control characters in their file names; they will mangle the output if printed, but they can still be operated upon by the actions (delete, link, etc.) in jdupes. "Seek Thrash Robustness" jdupes uses an I/O chunk size that is optimized for reading as much as possible from disk at once to take advantage of high sequential read speeds in traditional rotating media drives while balancing against the significantly higher rate of CPU cache misses triggered by an excessively large I/O buffer size. Enlarging the I/O buffer further may allow for lots of large files to be read with less head seeking, but the CPU cache misses slow the algorithm down and memory usage increases to hold these large buffers. jdupes is benchmarked periodically to make sure that the chosen I/O chunk size is the best compromise for a wide variety of data sets. "Memory Usage Robustness" This is a very subjective concern considering that even a cell phone in someone's pocket has at least 1GB of RAM, however it still applies in the embedded device world where 32MB of RAM might be all that you can have. Even when processing a data set with over a million files, jdupes memory usage (tested on Linux x86_64 with -O3 optimization) doesn't exceed 2GB. A low memory mode can be chosen at compile time to reduce overall memory usage with a small performance penalty. Contact Information -------------------------------------------------------------------------- For all jdupes inquiries, contact Jody Bruchon <[email protected]> Please DO NOT contact Adrian Lopez about issues with jdupes. Legal Information and Software License -------------------------------------------------------------------------- jdupes is Copyright (C) 2015-2017 by Jody Bruchon <[email protected]> Derived from the original 'fdupes' (C) 1999-2017 by Adrian Lopez Includes other code libraries which are (C) 2015-2017 by Jody Bruchon The MIT License Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
About
A powerful duplicate file finder and an enhanced fork of 'fdupes'.
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published
Languages
- C 89.5%
- Roff 5.1%
- Makefile 3.2%
- Shell 2.2%