Skip to content

Threat categorization

Václav Bartoš edited this page Aug 2, 2024 · 3 revisions

Threat categorization

IP addresses are divided into categories based on the type of threat they pose. Primarily, each address is labeled as either src or dst, depending on whether it is an active attacker or a part of some malicious infrastructure. Source addresses are then classified based on the kind of attack (e.g. brute force or DDoS attacks) and destination addresses are further divided by their type (C&C servers, phishing sites, etc.). These categories can have additional subcategories (depending on the configuration), such as targeted ports or protocols. An IP may belong to multiple categories and each category label has a confidence value based on the number of related reports.

Taxonomy

category role subcategories description
bruteforce src port, protocol The IP performs dictionary (or bruteforce) attacks on password-protected services. Usually accompanied with scanning - searching for the targeted service.
botnet_drone src malware_family The IP is acting as a bot/drone of a botnet.
cc dst malware_family The IP is used as Command&Control server for a botnet/malware.
ddos src - The IP was observed as a source of volumetric (D)DoS attacks.
ddos-amplifier dst protocol The IP runs a service which can be (and often is) misused as an amplifier for DDoS attacks, e.g. open DNS resolvers, NTP servers, memcached, etc.
exploit src protocol The IP is attempting to exploit known vulnerabilities.
malware_distribution dst malware_family The IP is used to distribute a malware, e.g. hosts an HTTP URL from which a malware is being downloaded.
phishing_site dst - The IP is hosting a phishing website.
scan src port The IP address performs a network scanning, i.e. it tries to connect to various targets to search for open ports/services.
spam src - The IP is sending spam.
unknown src - The IP was reported as a source of malicious/rogue/unexpected packets, but without any further specification.

How it works

Classification

When an IP address is seen in a new event/pulse/blacklist, it is assigned a threat category by the corresponding source module. All modules use the same taxonomy (defined above) and the classification method is largely the same, but can differ slightly based on how the module operates and what kind of information it has about each IP.

The classification is based on a system of rules. Rules are evaluated using Python (by the built-in eval() function), i.e. each rule must be a valid Python expression that resolves to either True or False when evaluated. When a rule is evaluated as True, the IP address will be assigned corresponding category label. Optionally, it is also possible to assign subcategory values directly within a specific rule.

Within each rule the programmer can access objects and functions visible in the context of the classify_ip() function defined in /nerd/common/threat_categorization.py, mainly the event object and its attributes (which contains information about the new event that is currently being classified) and the regex library (re) that can be used for easier string matching.

Here is a list of event attributes that can be used for classification:

attribute name type description
date string Time of detection
description string Event description
ip_info string IP description / additional info
protocols list[string] List of protocols used by the IP
target_ports list[int] List of ports targeted by the IP
categories list[string] List of event categories (specific to Warden)
tags list[string] List of tags related to the event and/or IP address (specific to MISP)
ip_role string IP role (src/dst, specific to MISP)
indicator_role string IP category (specific to OTX)
blacklist_id string Blacklist ID (specific to blacklists)

Summary module

For each IP address there is a history of category records which contain the category id, date of detection and the number of reports from individual source modules. These records are then aggregated by a secondary module threat_category_summary and each category from the final summary is assigned a confidence value based on the number of times the IP was reported.

The confidence for each category is computed as follows:

  • For each of the last 14 days, compute:
    • n_events(d) - Number of times the IP address was reported within the day
    • n_sources(d) - Number of distinct source modules that reported those events
    • Daily confidence confidence(d) = (1 - 1/2^n_events(d)) * (1 - 1/2^n_sources(d))
  • Final confidence is the weighted average of the 14 daily values with linearly decreasing weight (most recent day has the highest weight):
    • confidence = SUM[d=0..13](confidence(d) * (14-d)/14) / 7.5 (where d is the number of days before today; 7.5 is just the sum of the weights)

Configuration

Configuration is specified in a YAML-formatted file (/etc/nerd/threat_categorization.yml by default). It contains the definition of individual categories and their parameters:

threat_categorization:
	category_id:
		label: "Full name"
		description: "Category description"
		role: "src"
		subcategories:
			- "port"
			- "protocol"
			- "malware_family"
		triggers:
			module_name: |-
				"indicator1" in event.description
				"indicator2" in event.ip_info -> {protocol: ['proto1']}
			another_module_name: |-
				...
	another_category_id:
		...

A category is defined by specifying a new item under threat_categorization with key set to the name of the category. Each category must have a label (full name), a description and a role (src or dst).

subcategories specifies which subcategories should source modules try to classify. Currently there are 3 possible subcategories - target port, protocol and malware family.

The triggers field contains a set of rules that are used for classification, divided into sections for individual source modules (it is possible to define a common set of rules for all modules under the name general). Rules for each module are written into a single multiline string (block scalar with one rule per line) so that special characters like quotes do not have to be escaped. Each rule may have 2 parts divided by the "->" symbol - statement used for classification (mandatory) and subcategory assignment (optional). Both have to be valid Python expressions, able to be evaluated by eval(). The statement should resolve to either True or False and the assignment should be a valid dictionary (key : set of values).

Clone this wiki locally