Skip to content

Latest commit

 

History

History
105 lines (75 loc) · 3.9 KB

README.md

File metadata and controls

105 lines (75 loc) · 3.9 KB

wordlist-dedup

wordlist-dedup is a program written in rust to deduplicate wordlists. Duh.

I've tried to deduplicate lines of a huge wordlist (>80 GB) with GNU/coreutils uniq. First everything seemed to be hunky dory. Before I deleted the original I checked the size difference between the two files. It was about half of the original and not just about 5% less, which I initially suspected.

To confirm this, I wrote a program to count the duplicates and Bingo! The original file had just a smidgen over 3 % of duplicates.

Maybe I did something wrong or my PC was not able to handle the memory requirements of uniq. I don't know, why it needs that much memory and has a super low throughput. The wordlist must be sorted before using uniq anyway. Everything uniq has to do is compare line n with line n-1. If they are both the same, don't write one of them to another file. That simple.

So, here is wordlist-dedup. It does exactly that. On my old laptop it writes roughly 400 MB/s of deduplicated output. When only counting the number of duplicates, it reads about 1.4 GB/s of data. Both with a line length of 200 utf-8 characters.

I use it together with a script, that first runs GNU/coreutils sort (which works as intended) and wordlist-dedup after that.

Command line tool

$ wordlist-dedup --help
Deduplicate Pre-Sorted Wordlists

Usage: wordlist-dedup [OPTIONS] <SRC> [DEST]

Arguments:
  <SRC>   The pre-sorted source file, wich may contains duplicated lines
  [DEST]  The destination file, to write the deduplicated file to

Options:
  -c, --count    Only count the number of duplicates
  -h, --help     Print help
  -V, --version  Print version

Please keep in mind, the file must be sorted beforehand. Otherwise, it may not find duplicates. You can use GNU/coreutils sort, which does a fine job, even, when the file is larger than the available memory. wordlist-dedup uses what it gets of memory. If there is a bunch of memory available, it might use it (to an extend). If there is barely any memory left, it will work as well. Just slower.

You can use it to deduplicate a file like:

$ wordlist-dedup some_file_with_dups.txt new_file_to_write_to.txt
  ⠏ Checking for duplicates... (0.8 s)
  🚀 Found 410 duplicates in 2.1 s.

During operation it outputs a spinner in the front, to signal, that that the program is still running. When it is done, it tells you the number of duplicate line.

If you use it with only one argument like file.ext, it names the destination file file_uniq.ext.

Keep in mind, it was made for one job, to deduplicate sorted wordlists. It might work in different scenarios.

If you like to use my scripts to dedup as many files in one folder structure check out my Gist.

Installation

Make sure you have a recent version of a Rust compiler installed. To build it, run: cargo build --release.

The binary is now located in the "target" folder, which was created while building the program: target/release/wordlist-dedup.

If you are using a GNU/Linux based operating system, you can also use the already build binary by the CD workflow. Head over to the release tab.

Semantic Versioning

This repository uses SemVer for its release cycle.

Branching Model

This repository uses the git-flow branching model by Vincent Driessen. It has two branches with infinite lifetime:

The master branch gets updated on every release. The develop branch is the merging branch.

License

Copyright © 2020-2023 Michael Sasser [email protected]. Released under the GPLv3 license.