Skip to content

Latest commit

 

History

History
51 lines (44 loc) · 1.5 KB

README.md

File metadata and controls

51 lines (44 loc) · 1.5 KB

URL Extract

This module extracts tld, domain, subdomains and query from URLs. It also validates the URLs.

Documentation https://url-extract.readthedocs.io/en/latest/

Installation

pip install url_extract

Usage

>>> from url_extract import UrlExtract
>>> extract = UrlExtract()
Downloading list...
>>> extracted = extract.extract('http://dir.bg')
>>> extracted.getDomain()
'dir'
>>> extracted.getTld()
'bg'
>>> extracted.valid()
>>> True
>>> extracted = extract.extract('https://sireninfo.com')
>>> extracted.getDomain()
'sireninfo'
>>> extracted = extract.extract('http://police.uk')
>>> extracted.valid()
False

Documentation

####class UrlExtract (datFileMaxAge=86400*31, datFileSaveDir=None, alwaysPuny=None)####

  • datFileMaxAge specifies the max age of the public suffix list
  • datFileSaveDir specifies where will the public suffix list (tlds.dat) will be downloaded
  • alwaysPuny if set to True unicoded domains after extract will be punyencoded
  • extract(url) - Extracts the url and returns Result() object

####class Result ()####

  • getDomain() - Returns domain name without subdomains and tld.
  • getTld() - Returns the tld of the domain
  • valid() - Validates domain and returns True or False
  • getFoundSubdomains() - Returns the extracted subdomains as list
  • getHostname() - Returns the hostname of the URL
  • getUrlQuery() - Returns the query after the first / in the url