uniqueurls

"uniqueurls" is a Python-based tool used for decluttering a list of URLs by performing string similarity comparisons. It generates a list of unique URLs by comparing the similarity of path components of URLs.

It helps professionals in the realm of pentesting and bug hunting streamline their process by efficiently filtering out similar URLs from a large dataset, thus enhancing the effectiveness and productivity of their workflow.

Getting Started

Prerequisites

Python 3
Python packages: sqlite3, urllib.parse, tempfile, fuzzywuzzy, argparse

Installing

Clone the GitHub repository:

git clone https://github.com/Nishantbhagat57/uniqueurls.git
cd uniqueurls

Install the required Python packages:

pip3 install -r requirements.txt

How It Works

uniqueurls uses an intelligent approach to analyze URLs and deduplicate them based on string similarity:

URLs are read from an input text file and stored in a temporary SQLite database.
URLs are then split into their respective origins (domain or hostname) and paths.
For each origin, paths are compared with each other for similarity. If the ratio of similarity between two paths is less than a specified percentage (default is 85%), the URLs are considered unique.
Unique URLs are then printed out and any URL found to be similar (or duplicate) is optionally logged in a debug file for review.

Why uniqueurls?

While there are other deduplication tools available on GitHub, uniqueurls stands out for the following reasons:

Contextual Analysis: Rather than just comparing URLs as a whole, uniqueurls examines the path of URLs for each specific domain/hostname, ensuring a thorough and context-based review.
Configurable Similarity Ratio: The tool allows the user to define the ratio of similarity, providing flexibility in determining how stringent the comparison should be.
Debug Log: Removed URLs can be logged for analysis, which aids in understanding the tool's operation and ensures transparency.

Usage

python3 uniqueurls.py [url_file] [-debug [debug_file]] [-ratio [similarity_ratio]]

url_file: (required) The path to the file containing URLs to be processed.
-debug [debug_file]: (optional) The path to the debug file where removed URLs will be logged.
-ratio [similarity_ratio]: (optional) The ratio for string similarity checks. Default value is 85 if not specified.

Example

Input file urls.txt contains:

http://example.com/event/4v35s
http://example.com/event/4v38q
http://example.com/event/4v3bw
http://example.com/people/AMWarren
http://example.com/people/amy082600
http://xyz.example.com/job/Chicago-(US)-JobIL60606/772758
http://xyz.example.com/job/Chicago-(US)-JobIL60611/772757

Command:

python3 uniqueurls.py urls.txt -debug debug.txt -ratio 90

Output:

http://example.com/event/4v35s
http://example.com/people/AMWarren
http://example.com/people/amy082600
http://xyz.example.com/job/Chicago-(US)-JobIL60606/772758

debug.txt would contain:

http://example.com/event/4v38q
http://example.com/event/4v3bw
http://xyz.example.com/job/Chicago-(US)-JobIL60611/772757

License

This project is licensed under the MIT License - see the LICENSE.md file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
requirements.txt		requirements.txt
uniqueurls.py		uniqueurls.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

uniqueurls

Getting Started

Prerequisites

Installing

How It Works

Why uniqueurls?

Usage

Example

License

About

Releases

Packages

Languages

Nishantbhagat57/uniqueurls

Folders and files

Latest commit

History

Repository files navigation

uniqueurls

Getting Started

Prerequisites

Installing

How It Works

Why uniqueurls?

Usage

Example

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages