-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an input URL list parameter #38
Changes from 10 commits
3b8ae73
1ebf3f9
350f626
488c734
3f39de4
e7580cf
ff5423f
25e2283
c44172c
9698494
3f2cc93
1dc5c36
a2bf6ab
6123775
de22b34
fb09cbe
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,11 @@ | ||
import re | ||
from enum import Enum | ||
from importlib.metadata import version | ||
from typing import Any, Dict, Optional | ||
from logging import getLogger | ||
from typing import Any, Dict, List, Optional, Union | ||
|
||
import scrapy | ||
from pydantic import BaseModel, Field | ||
from pydantic import BaseModel, ConfigDict, Field, field_validator, model_validator | ||
from scrapy.crawler import Crawler | ||
|
||
from zyte_spider_templates._geolocations import ( | ||
|
@@ -15,6 +17,8 @@ | |
# Higher priority than command-line-defined settings (40). | ||
ARG_SETTING_PRIORITY: int = 50 | ||
|
||
logger = getLogger(__name__) | ||
|
||
|
||
@document_enum | ||
class ExtractFrom(str, Enum): | ||
|
@@ -26,12 +30,48 @@ class ExtractFrom(str, Enum): | |
"""Use browser rendering. Often provides the best quality.""" | ||
|
||
|
||
_INPUT_FIELDS = ("url", "urls") | ||
_URL_PATTERN = r"^https?://[^:/\s]+(:\d{1,5})?(/[^\s]*)*(#[^\s]*)?$" | ||
|
||
|
||
class BaseSpiderParams(BaseModel): | ||
model_config = ConfigDict( | ||
json_schema_extra={ | ||
"groups": [ | ||
{ | ||
"id": "inputs", | ||
"title": "Inputs", | ||
"description": ( | ||
"Input data that determines the start URLs of the crawl." | ||
), | ||
"widget": "exclusive", | ||
}, | ||
], | ||
}, | ||
) | ||
|
||
url: str = Field( | ||
title="URL", | ||
description="Initial URL for the crawl. Enter the full URL including http(s), " | ||
"you can copy and paste it from your browser. Example: https://toscrape.com/", | ||
pattern=r"^https?://[^:/\s]+(:\d{1,5})?(/[^\s]*)*(#[^\s]*)?$", | ||
pattern=_URL_PATTERN, | ||
default="", | ||
json_schema_extra={ | ||
"group": "inputs", | ||
}, | ||
) | ||
urls: Optional[List[str]] = Field( | ||
title="URLs", | ||
description=( | ||
"Initial URLs for the crawl, separated by new lines. Enter the " | ||
"full URL including http(s), you can copy and paste it from your " | ||
"browser. Example: https://toscrape.com/" | ||
), | ||
default=None, | ||
json_schema_extra={ | ||
"group": "inputs", | ||
"widget": "textarea", | ||
}, | ||
) | ||
geolocation: Optional[Geolocation] = Field( | ||
title="Geolocation", | ||
|
@@ -81,6 +121,65 @@ class BaseSpiderParams(BaseModel): | |
}, | ||
) | ||
|
||
@field_validator("urls", mode="before") | ||
@classmethod | ||
def validate_url_list(cls, value: Union[List[str], str]) -> List[str]: | ||
"""Validate a list of URLs. | ||
|
||
If a string is received as input, it is split into multiple strings | ||
on new lines. | ||
|
||
List items that do not match a URL pattern trigger a warning and are | ||
removed from the list. If all URLs are invalid, validation fails. | ||
""" | ||
if isinstance(value, str): | ||
value = value.split("\n") | ||
if value: | ||
new_value = [] | ||
for v in value: | ||
v = v.strip() | ||
if not v: | ||
continue | ||
if not re.search(_URL_PATTERN, v): | ||
logger.warning( | ||
f"{v!r}, from the 'urls' spider argument, is not a " | ||
f"valid URL and will be ignored." | ||
) | ||
continue | ||
new_value.append(v) | ||
if new_value: | ||
value = new_value | ||
else: | ||
raise ValueError(f"No valid URL found in {value!r}") | ||
return value | ||
|
||
@model_validator(mode="after") | ||
def single_input(self): | ||
"""Fields | ||
:class:`~zyte_spider_templates.spiders.ecommerce.EcommerceSpiderParams.url` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If it's base class and we are going to use the same approach in Articles - in this case, might be no need to mention ecommerce.EcommerceSpiderParams explicitly. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Once we merge article support, we can refactor the docs to cover shared params elsewhere in the docs, and point there. But for the current state of the repository, this is the best thing we can link to. |
||
and | ||
:class:`~zyte_spider_templates.spiders.ecommerce.EcommerceSpiderParams.urls` | ||
form a mandatory, mutually-exclusive field group: one of them must be | ||
defined, the rest must not be defined.""" | ||
input_fields = set( | ||
field for field in _INPUT_FIELDS if getattr(self, field, None) | ||
) | ||
if not input_fields: | ||
input_field_list = ", ".join(_INPUT_FIELDS) | ||
raise ValueError( | ||
f"No input parameter defined. Please, define one of: " | ||
f"{input_field_list}." | ||
) | ||
elif len(input_fields) > 1: | ||
input_field_list = ", ".join( | ||
f"{field} ({getattr(self, field)!r})" for field in input_fields | ||
) | ||
raise ValueError( | ||
f"Expected a single input parameter, got {len(input_fields)}: " | ||
f"{input_field_list}." | ||
) | ||
return self | ||
|
||
|
||
class BaseSpider(scrapy.Spider): | ||
custom_settings: Dict[str, Any] = { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we expect to have is it's not str instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any of the actually supported types:
None
or a list ofstr
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Gallaecio will you add something to make this work?
mind that "URL\nURL" will not work - from your comment in description.
or running like
scrapy crawl spider -a urls="https://www.nytimes.com/\nhttps://www.theguardian.com"
I've added this workaround https://github.com/zytedata/zyte-spider-templates-private/blob/article/zyte_spider_templates/spiders/article.py#L135, but might be we can handle it in
validate_url_list
too.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think supporting that syntax is a bit dangerous, as
\n
are valid URL characters. Since it will not be an issue in the UI or from code, I am not sure if it is worth supporting.Maybe we could support spaces as a separator, but we risk scenarios were users expect a space → %20 conversion to happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Gallaecio I've checked https://www.rfc-editor.org/rfc/rfc3986 and haven't found anything about \n as allowed character - but might be missed something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant valid characters, i.e.
\
is a valid character andn
is a valid character. https://example.com/\n is a valid URL.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, makes sense, thanks!