Adding Jarelllama's Scam Blocklists & other contributions #184

jarelllama · 2024-04-17T17:23:21Z

jarelllama
Apr 17, 2024

Contact Details

No response

What's your idea?

Hi, I'm the maintainer of Jarelllama's Scam Blocklist, a blocklist for newly created scam and phishing domains automatically retrieved daily using Google Search API, automated NRD detection, and other public sources.

This blocklist aims to be an alternative to blocking all newly registered domains (NRDs) seeing how many, but not all, NRDs are malicious. A variety of sources are integrated to detect new malicious domains within a short time span of their registration date.

Taken from my README, this is the current filtering process:

The domains collated from all sources are filtered against an actively maintained whitelist (scam reporting sites, forums, vetted stores, etc.)
The domains are checked against the Tranco Top Sites Ranking for potential false positives which are then vetted manually
Common subdomains like 'www' are stripped to make use of wildcard matching for all other subdomains. The list of subdomains checked for can be viewed here: subdomains.txt
Only domains are included in the blocklist; IP addresses are manually checked for resolving DNS records and URLs are stripped down to their domains
Entries that require manual verification/intervention are sent in a Telegram notification for fast remediations

Dead domains and parked domains are automatically removed daily as well. More about the blocklist's retrieval and filtering process can be found in the README.

These are the formats I currently offer:

Format	Syntax
Adblock Plus	\|\|scam.com^
Dnsmasq	local=/scam.com/
Unbound	local-zone: "scam.com." always_nxdomain
Wildcard Asterisk	*.scam.com
Wildcard Domains	scam.com

Please do let me know your thoughts!

Code of Conduct

I agree to follow this project's Code of Conduct

T145 · 2024-04-18T16:06:08Z

T145
Apr 18, 2024
Maintainer

Hey! This looks neat, and I'm actually surprised I haven't found it yet. I'll need some time to look over things thoroughly at some point. I'm using PhishStats already, and do something similar to what you are with my maintained version of the "Not on my Shift" blocklist.

My present thoughts since you asked:

Rather than downloading data from a few third parties I'd go straight to the source for NRDs. NRD Downloader is great to use.
With your ScamAdviser processing, I'm fairly certain you can just use URL globbing rather than using a URL array.
You can use my Docker image if you'd like to have access to better CLI utilities: https://github.com/T145/black-mirror/blob/master/.github/workflows/publish_lists.yml#L18
I would just assign execution_time once before your update methods are called rather than assigning it in every update method. To my understanding, you get the timestamp from the time the last update method is called rather than the first, which is a bug.
When it comes to parallel downloads, I don't think many utilities beat Aria2.

I'll need time to check the list content but it looks good so far. I'd also be open to integrating your project's functionality into Black Mirror if you'd like to join on as a contributor. If the contributing documentation isn't helpful I'm happy to answer any questions.

0 replies

jarelllama · 2024-04-19T02:50:03Z

jarelllama
Apr 19, 2024
Author

Thanks @T145 for the code review. Your input is incredibly valuable seeing how I'm the single maintainer and doing my own code reviews is less effective than input from someone else.

Feel free to give feedback for any of the other scripts/workflows.

Regarding contributing, please do let me know in what other ways I could help contribute besides lending my blocklist as a source. If it's helpful I can begin a pull request for data/v2/manifest.json.

Thanks again!

0 replies

T145 · 2024-04-19T17:36:40Z

T145
Apr 19, 2024
Maintainer

Sure, if you'd like to make a PR go for it.

My thoughts on contributing more to the project regard adding what you do w/ some sources directly to Black Mirror. This would mean making an entry in the manifest with the source URL in the mirrors field, then designating a filter in my scripts to process the text. This is what you're doing with many of your grep commands as reference. It's all in the contributing docs but if anything is confusing just ask.

0 replies

T145 · 2024-04-20T04:53:24Z

T145
Apr 20, 2024
Maintainer

I moved the issue to a discussion so conversation can feel more natural and so other interested parties can feel more welcome in joining or giving some feedback

0 replies

jarelllama · 2024-04-20T04:55:33Z

jarelllama
Apr 20, 2024
Author

Hi again. I gave it some thought after reviewing my sources and code. As my first big project, my code is rather rigid and does not leave much room for modularity. A lot of the source retrieval code do not adhere to the same filters, for example, some sources are limited to their first 100 results or first 5 pages, depending on the update frequency of each individual source. Most annoying, handling the edge cases like whether a trailing slash returns nothing in curl, or sources that add modifiers to their domains.

This is despite me spending the better half of last night trying to reduce redundancy and implement mawk wherever feasible.

I would love to contribute in ways practical if you have any suggestions. In the mean time, I see you have already merged my manifest.json commit. Thanks for that!

2 replies

T145 Apr 20, 2024
Maintainer

I'd agree that it's rigid. Which is one of the reasons I think joining on Black Mirror would be fun, b/c then you can focus on just source scouting and how to process that information rather than building out your own code framework. And that's precisely why I've designed this project as it is, to help aid contribution and maintenance. So let's work on converting what you w/ a source method over to what Black Mirror would do. Taking this example (since I already use PhishStats):

source_phishstats() {
    local source='PhishStats'
    local ignore_from_light=true
    local results_file='data/pending/domains_phishstats.tmp'
    local execution_time
    execution_time="$(date +%s)"

    [[ "$USE_EXISTING" == true ]] && { process_source; return; }

    local url='https://phishstats.info/phish_score.csv'
    # Get only URLs with no subdirectories, exclude IP addresses and extract
    # domains
    # -o for grep can be omitted since each entry is on its own line.
    # Once again, mawk does not work well with such regex expressions.
    wget -qO - "$url" | mawk -F ',' '{print $3}' \
        | grep -E '^"?https?://[[:alnum:].-]+\.[[:alnum:]-]*[a-z]{2,}[[:alnum:]-]*."?$' \
        | mawk -F '/' '{gsub(/"/, "", $3); print $3}' \
        | sort -u -o "$results_file"

    # Get matching NRDs for light version (Unicode ignored)
    comm -12 "$results_file" nrd.tmp > phishstats_nrds.tmp

    process_source
}

So let's break it down into what the most essential steps are.

Download the list
Convert the list content into a specific format
Save the result into a list that holds to the respective format

For your method, you accomplish this by:

Using wget to download.
Using 'mawk' and grep to transform the text.
Save it to your "results" file, which in this case only handles domains.

Now let's look at how I implement the same thing:

Use the retriever manifest definition to download the list w/ Aria2.
Convert the list's content into several formats using content filters also designated in the manifest (note the content type too). In this case by Miller and Perl, though I'm fairly certain there's a way to do it with only Miller.
Save each list's content into a format also designated by each format field in the manifest.

Now that you at least have some experience making a manifest field, the next step is adding unique content through a custom retriever, filter, etc. Or if you could find a way to make existing filters even faster that would also be very helpful.

I've been trying to make this project as open to community involvement as possible so hopefully this is a step in the right direction!

jarelllama Apr 20, 2024
Author

Thanks so much for the guidance! I'll practice on a fork and see what I can do in my free time. I've already spent the day looking at your bash scripts to learn and update my own code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Jarelllama's Scam Blocklists & other contributions #184

{{title}}

Replies: 5 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Adding Jarelllama's Scam Blocklists & other contributions #184

jarelllama Apr 17, 2024

Contact Details

What's your idea?

Code of Conduct

Replies: 5 comments · 2 replies

T145 Apr 18, 2024 Maintainer

jarelllama Apr 19, 2024 Author

T145 Apr 19, 2024 Maintainer

T145 Apr 20, 2024 Maintainer

jarelllama Apr 20, 2024 Author

T145 Apr 20, 2024 Maintainer

jarelllama Apr 20, 2024 Author

jarelllama
Apr 17, 2024

Replies: 5 comments 2 replies

T145
Apr 18, 2024
Maintainer

jarelllama
Apr 19, 2024
Author

T145
Apr 19, 2024
Maintainer

T145
Apr 20, 2024
Maintainer

jarelllama
Apr 20, 2024
Author

T145 Apr 20, 2024
Maintainer

jarelllama Apr 20, 2024
Author