Skip to content

Commit

Permalink
Improves detection for various bots (matomo-org#7857)
Browse files Browse the repository at this point in the history
* Improves detection for generic bots
* Adds detection for Inspici
* Adds detection for Meta-ExternalAgent
* Adds detection for Meta-ExternalFetcher
* Fix url for Facebook crawlers
  • Loading branch information
liviuconcioiu authored Oct 7, 2024
1 parent c6cb44b commit f21f39c
Show file tree
Hide file tree
Showing 2 changed files with 80 additions and 7 deletions.
59 changes: 54 additions & 5 deletions Tests/fixtures/bots.yml
Original file line number Diff line number Diff line change
Expand Up @@ -950,7 +950,7 @@
bot:
name: Facebook Crawler
category: Social Media Agent
url: https://developers.facebook.com/docs/sharing/webmasters/crawler/
url: https://developers.facebook.com/docs/sharing/webmasters/web-crawlers
producer:
name: Meta Platforms, Inc.
url: https://www.meta.com/
Expand All @@ -959,7 +959,7 @@
bot:
name: Facebook Crawler
category: Social Media Agent
url: https://developers.facebook.com/docs/sharing/webmasters/crawler/
url: https://developers.facebook.com/docs/sharing/webmasters/web-crawlers
producer:
name: Meta Platforms, Inc.
url: https://www.meta.com/
Expand All @@ -968,7 +968,7 @@
bot:
name: Facebook Crawler
category: Social Media Agent
url: https://developers.facebook.com/docs/sharing/webmasters/crawler/
url: https://developers.facebook.com/docs/sharing/webmasters/web-crawlers
producer:
name: Meta Platforms, Inc.
url: https://www.meta.com/
Expand Down Expand Up @@ -4642,7 +4642,7 @@
bot:
name: Facebook Crawler
category: Social Media Agent
url: https://developers.facebook.com/docs/sharing/webmasters/crawler/
url: https://developers.facebook.com/docs/sharing/webmasters/web-crawlers
producer:
name: Meta Platforms, Inc.
url: https://www.meta.com/
Expand Down Expand Up @@ -7762,7 +7762,7 @@
bot:
name: Facebook Crawler
category: Social Media Agent
url: https://developers.facebook.com/docs/sharing/webmasters/crawler/
url: https://developers.facebook.com/docs/sharing/webmasters/web-crawlers
producer:
name: Meta Platforms, Inc.
url: https://www.meta.com/
Expand Down Expand Up @@ -8309,3 +8309,52 @@
producer:
name: Immutable, SNC
url: https://ohdear.app/
-
user_agent: Mozilla/5.0 Keydrop
bot:
name: Generic Bot
-
user_agent: Mozilla/6.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36 Inspici (www.inspici.com)
bot:
name: Inspici
category: Crawler
url: https://www.inspici.com/
producer:
name: Inspici, LLC
url: https://www.inspici.com/
-
user_agent: meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)
bot:
name: Meta-ExternalAgent
category: Crawler
url: https://developers.facebook.com/docs/sharing/webmasters/web-crawlers
producer:
name: Meta Platforms, Inc.
url: https://www.meta.com/
-
user_agent: meta-externalagent/1.1
bot:
name: Meta-ExternalAgent
category: Crawler
url: https://developers.facebook.com/docs/sharing/webmasters/web-crawlers
producer:
name: Meta Platforms, Inc.
url: https://www.meta.com/
-
user_agent: meta-externalfetcher/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)
bot:
name: Meta-ExternalFetcher
category: Social Media Agent
url: https://developers.facebook.com/docs/sharing/webmasters/web-crawlers
producer:
name: Meta Platforms, Inc.
url: https://www.meta.com/
-
user_agent: meta-externalfetcher/1.1
bot:
name: Meta-ExternalFetcher
category: Social Media Agent
url: https://developers.facebook.com/docs/sharing/webmasters/web-crawlers
producer:
name: Meta Platforms, Inc.
url: https://www.meta.com/
28 changes: 26 additions & 2 deletions regexes/bots.yml
Original file line number Diff line number Diff line change
Expand Up @@ -594,7 +594,23 @@
- regex: 'facebook(?:catalog|externalhit|externalua|platform|scraper)'
name: 'Facebook Crawler'
category: 'Social Media Agent'
url: 'https://developers.facebook.com/docs/sharing/webmasters/crawler/'
url: 'https://developers.facebook.com/docs/sharing/webmasters/web-crawlers'
producer:
name: 'Meta Platforms, Inc.'
url: 'https://www.meta.com/'

- regex: 'meta-externalagent'
name: 'Meta-ExternalAgent'
category: 'Crawler'
url: 'https://developers.facebook.com/docs/sharing/webmasters/web-crawlers'
producer:
name: 'Meta Platforms, Inc.'
url: 'https://www.meta.com/'

- regex: 'meta-externalfetcher'
name: 'Meta-ExternalFetcher'
category: 'Social Media Agent'
url: 'https://developers.facebook.com/docs/sharing/webmasters/web-crawlers'
producer:
name: 'Meta Platforms, Inc.'
url: 'https://www.meta.com/'
Expand Down Expand Up @@ -4822,8 +4838,16 @@
name: 'Immutable, SNC'
url: 'https://ohdear.app/'

- regex: 'Inspici'
name: 'Inspici'
category: 'Crawler'
url: 'https://www.inspici.com/'
producer:
name: 'Inspici, LLC'
url: 'https://www.inspici.com/'

# Generic bots
- regex: 'nuhk|grub-client|Download Demon|SearchExpress|Microsoft URL Control|borg|altavista|dataminr\.com|teoma|oegp|http%20client|htdig|mogimogi|larbin|scrubby|searchsight|semanticdiscovery|snappy|vortex(?!(?: Build|Plus| CM62| HD65))|zeal(?!ot)|dataparksearch|findlinks|BrowserMob|URL2PNG|ZooShot|GomezA|Google SketchUp|Read%20Later|7Siters|centuryb\.o\.t9|InterNaetBoten|EasyBib AutoCite|Bidtellect|tomnomnom/meg|cortex|Re-re Studio|adreview|AHC/|NameOfAgent|Request-Promise|ALittle Client|Hello,? world|wp_is_mobile|0xAbyssalDoesntExist|Anarchy99|^revolt|nvd0rz|xfa1|Hakai|gbrmss|fuck-your-hp|IDBTE4M CODE87|Antoine|Insomania|Hells-Net|b3astmode|Linux Gnu \(cow\)|Test Certificate Info|iplabel|Magellan|TheSafex?Internetx?Search|Searcherx?web|kirkland-signature|LinkChain|survey-security-dot-txt|infrawatch|Time/|r00ts3c-owned-you|nvdorz|Root Slut|NiggaBalls|BotPoke|GlobalWebSearch|xx032_bo9vs83_2a|sslshed|geckotrail|Wordup|^xenu|^(?:chrome|firefox|Abcd|Dark|KvshClient|Node.js|Report Runner|url|Zeus|ZmEu)$'
- regex: 'nuhk|grub-client|Download Demon|SearchExpress|Microsoft URL Control|borg|altavista|dataminr\.com|teoma|oegp|http%20client|htdig|mogimogi|larbin|scrubby|searchsight|semanticdiscovery|snappy|vortex(?!(?: Build|Plus| CM62| HD65))|zeal(?!ot)|dataparksearch|findlinks|BrowserMob|URL2PNG|ZooShot|GomezA|Google SketchUp|Read%20Later|7Siters|centuryb\.o\.t9|InterNaetBoten|EasyBib AutoCite|Bidtellect|tomnomnom/meg|cortex|Re-re Studio|adreview|AHC/|NameOfAgent|Request-Promise|ALittle Client|Hello,? world|wp_is_mobile|0xAbyssalDoesntExist|Anarchy99|^revolt|nvd0rz|xfa1|Hakai|gbrmss|fuck-your-hp|IDBTE4M CODE87|Antoine|Insomania|Hells-Net|b3astmode|Linux Gnu \(cow\)|Test Certificate Info|iplabel|Magellan|TheSafex?Internetx?Search|Searcherx?web|kirkland-signature|LinkChain|survey-security-dot-txt|infrawatch|Time/|r00ts3c-owned-you|nvdorz|Root Slut|NiggaBalls|BotPoke|GlobalWebSearch|xx032_bo9vs83_2a|sslshed|geckotrail|Wordup|Keydrop|^xenu|^(?:chrome|firefox|Abcd|Dark|KvshClient|Node.js|Report Runner|url|Zeus|ZmEu)$'
name: 'Generic Bot'

# Generic detections
Expand Down

0 comments on commit f21f39c

Please sign in to comment.