-
-
Notifications
You must be signed in to change notification settings - Fork 278
Adding WebSites
A WebSite object corresponds to at least one WebPage objects. The corresponding WebSite to a WebPage can found as follows:
domain = WebPage.domain_for_url(webpage.url)
site_data = WebPage.site_data_for_domain(domain)
website = WebSite().load(site_data)
A WebSite requires a name
, domains
, and is_whitelisted
. It can also have the following optional attributes: bad_urls
, normalization_rules
, title_branding
, initial_title_branding
, exclude_from_tracking
, and whitelist_selectors
. exclude_from_tracking
and whitelist_selectors
are only relevant for linker v1 and v2. name
is what displays in the WebPages sidebar. domains
is a list of all domains corresponding to the WebSite with the specified name. is_whitelisted
must be set to True in order for the WebSite's WebPages to appear in the Sefaria sidebar. bad_urls
is a list of regular expressions specifying URLs that match any of the domains
but we nevertheless don't want to save in our database or appear in the sidebar. To understand normalization_rules
, see normalize_url() in sefaria/model/webpage.py. In normalize_url, the URL of an incoming WebPage is normalized based on global rules that are applied to all incoming WebPages, and the URL can be normalized by other rules if specified in the WebSite object's normalization_rules
list. When WebPage data is received by the server, the incoming dictionary has a title
field. title_branding
and initial_title_branding
are used for normalizing this title. See clean_title
in sefaria/model/webpage.py to understand how title_branding
and initial_title_branding
are used to normalize the title field.
Here is an example of a WebSite in the database:
{"name" : "Torah In Motion",
"domains" : [
"torahinmotion.org",
"torahinmotionorg.e.civicrm.ca"
],
"is_whitelisted" : true,
"bad_urls" : [
"torahinmotionorg\\.e\\.civicrm\\.ca\\/store"
],
"normalization_rules" : [
"remove www"
],
"title_branding" : [
"TORAH IN MOTION"
],
"initial_title_branding" : true,
}
To add this WebSite in the CLI:
from sefaria.model.webpage import *
w = WebSite()
w.name = "Tora In Motion" # required attribute
w.domains = ["torahinmotion.org", "torahinmotionorg.e.civicrm.ca"] # required attribute
w.is_whitelisted = True #required attribute
w.bad_urls = ["torahinmotionorg\\.e\\.civicrm\\.ca\\/store"]
w.normalization_rules = ["remove www"]
w.save()