Skip to content

Commit

Permalink
Merge remote-tracking branch 'zytedata/main' into google-language
Browse files Browse the repository at this point in the history
  • Loading branch information
Gallaecio committed Nov 22, 2024
2 parents 884643d + b6b33ea commit 84a94bf
Show file tree
Hide file tree
Showing 25 changed files with 2,634 additions and 220 deletions.
2 changes: 1 addition & 1 deletion .bumpversion.cfg
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[bumpversion]
current_version = 0.9.0
current_version = 0.10.0
commit = True
tag = True
tag_name = {new_version}
Expand Down
41 changes: 41 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,47 @@
Changes
=======

0.10.0 (2024-11-22)
-------------------

* Dropped Python 3.8 support, added Python 3.13 support.

* Increased the minimum required versions of some dependencies:

* ``pydantic``: ``2`` → ``2.1``

* ``scrapy-poet``: ``0.21.0`` → ``0.24.0``

* ``scrapy-spider-metadata``: ``0.1.2`` → ``0.2.0``

* ``scrapy-zyte-api[provider]``: ``0.16.0`` → ``0.23.0``

* ``zyte-common-items``: ``0.22.0`` → ``0.23.0``

* Added :ref:`custom attributes <custom-attributes>` support to the
:ref:`e-commerce spider template <e-commerce>` through its new
:class:`~zyte_spider_templates.spiders.ecommerce.EcommerceSpiderParams.custom_attrs_input`
and
:class:`~zyte_spider_templates.spiders.ecommerce.EcommerceSpiderParams.custom_attrs_method`
parameters.

* The
:class:`~zyte_spider_templates.spiders.serp.GoogleSearchSpiderParams.max_pages`
parameter of the :ref:`Google Search spider template <google-search>` can no
longer be 0 or lower.

* The :ref:`Google Search spider template <google-search>` now follows
pagination for the results of each query page by page, instead of sending a
request for every page in parallel. It stops once it reaches a page without
organic results.

* Improved the description of
:class:`~zyte_spider_templates.spiders.ecommerce.EcommerceCrawlStrategy`
values.

* Fixed type hint issues related to Scrapy.


0.9.0 (2024-09-17)
------------------

Expand Down
15 changes: 14 additions & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
project = "zyte-spider-templates"
copyright = "2023, Zyte Group Ltd"
author = "Zyte Group Ltd"
release = "0.9.0"
release = "0.10.0"

sys.path.insert(0, str(Path(__file__).parent.absolute())) # _ext
extensions = [
Expand All @@ -22,6 +22,14 @@
html_theme = "sphinx_rtd_theme"

intersphinx_mapping = {
"form2request": (
"https://form2request.readthedocs.io/en/latest",
None,
),
"formasaurus": (
"https://formasaurus.readthedocs.io/en/latest",
None,
),
"python": (
"https://docs.python.org/3",
None,
Expand All @@ -46,6 +54,10 @@
"https://web-poet.readthedocs.io/en/stable",
None,
),
"zyte": (
"https://docs.zyte.com",
None,
),
"zyte-common-items": (
"https://zyte-common-items.readthedocs.io/en/latest",
None,
Expand All @@ -57,6 +69,7 @@
autodoc_pydantic_model_show_json = False
autodoc_pydantic_model_show_validator_members = False
autodoc_pydantic_model_show_validator_summary = False
autodoc_pydantic_field_list_validators = False

# sphinx-reredirects
redirects = {
Expand Down
27 changes: 26 additions & 1 deletion docs/customization/pages.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@ Customizing page objects

All parsing is implemented using :ref:`web-poet page objects <page-objects>`
that use `Zyte API automatic extraction`_ to extract :ref:`standard items
<item-api>`, both for navigation and for item details.
<item-api>`: for navigation, for item details, and even for :ref:`search
request generation <search-queries>`.

.. _Zyte API automatic extraction: https://docs.zyte.com/zyte-api/usage/extract.html

Expand Down Expand Up @@ -141,3 +142,27 @@ To extract a new field for one or more websites:
def parse_product(self, response: DummyResponse, product: CustomProduct):
yield from super().parse_product(response, product)
.. _fix-search:

Fixing search support
=====================

If the default implementation to build a request out of :ref:`search queries
<search-queries>` does not work on a given website, you can implement your
own search request page object to fix that. See
:ref:`custom-request-template-page`.

For example:

.. code-block:: python
from web_poet import handle_urls
from zyte_common_items import BaseSearchRequestTemplatePage
@handle_urls("example.com")
class ExampleComSearchRequestTemplatePage(BaseSearchRequestTemplatePage):
@field
def url(self):
return "https://example.com/search?q={{ query|quote_plus }}"
43 changes: 43 additions & 0 deletions docs/features/search.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
.. _search-queries:

==============
Search queries
==============

The :ref:`e-commerce spider template <e-commerce>` supports a spider argument,
:data:`~zyte_spider_templates.spiders.ecommerce.EcommerceSpiderParams.search_queries`,
that allows you to define a different search query per line, and
turns the input URLs into search requests for those queries.

For example, given the following input URLs:

.. code-block:: none
https://a.example
https://b.example
And the following list of search queries:

.. code-block:: none
foo bar
baz
By default, the spider would send 2 initial requests to those 2 input URLs,
to try and find out how to build a search request for them, and if it succeeds,
it will then send 4 search requests, 1 per combination of input URL and search
query. For example:

.. code-block:: none
https://a.example/search?q=foo+bar
https://a.example/search?q=baz
https://b.example/s/foo%20bar
https://b.example/s/baz
The default implementation uses a combination of HTML metadata, AI-based HTML
form inspection and heuristics to find the most likely way to build a search
request for a given website.

If this default implementation does not work as expected on a given website,
you can :ref:`write a page object to fix that <fix-search>`.
6 changes: 6 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,12 @@ zyte-spider-templates documentation
E-commerce <templates/e-commerce>
Google search <templates/google-search>

.. toctree::
:caption: Features
:hidden:

Search queries <features/search>

.. toctree::
:caption: Customization
:hidden:
Expand Down
13 changes: 13 additions & 0 deletions docs/reference/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,14 @@ Pages
Parameter mixins
================

.. autopydantic_model:: zyte_spider_templates.params.CustomAttrsInputParam
:exclude-members: model_computed_fields

.. autopydantic_model:: zyte_spider_templates.params.CustomAttrsMethodParam
:exclude-members: model_computed_fields

.. autoenum:: zyte_spider_templates.params.CustomAttrsMethod

.. autopydantic_model:: zyte_spider_templates.params.ExtractFromParam
:exclude-members: model_computed_fields

Expand All @@ -44,5 +52,10 @@ Parameter mixins

.. autoenum:: zyte_spider_templates.spiders.ecommerce.EcommerceCrawlStrategy

.. autopydantic_model:: zyte_spider_templates.spiders.serp.SerpItemTypeParam
:exclude-members: model_computed_fields

.. autoenum:: zyte_spider_templates.spiders.serp.SerpItemType

.. autopydantic_model:: zyte_spider_templates.spiders.serp.SerpMaxPagesParam
:exclude-members: model_computed_fields
3 changes: 3 additions & 0 deletions pytest.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[pytest]
filterwarnings =
ignore:deprecated string literal syntax::jmespath.lexer
11 changes: 8 additions & 3 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

setup(
name="zyte-spider-templates",
version="0.9.0",
version="0.10.0",
description="Spider templates for automatic crawlers.",
long_description=open("README.rst").read(),
long_description_content_type="text/x-rst",
Expand All @@ -12,13 +12,18 @@
packages=find_packages(),
include_package_data=True,
install_requires=[
"extruct>=0.18.0",
"form2request>=0.2.0",
"formasaurus>=0.10.0",
"jmespath>=0.9.5",
"pydantic>=2.1",
"requests>=0.10.1",
"requests>=1.0.0",
"scrapy>=2.11.0",
"scrapy-poet>=0.24.0",
"scrapy-spider-metadata>=0.2.0",
"scrapy-zyte-api[provider]>=0.23.0",
"zyte-common-items>=0.23.0",
"web-poet>=0.17.1",
"zyte-common-items>=0.26.2",
],
classifiers=[
"Development Status :: 3 - Alpha",
Expand Down
Loading

0 comments on commit 84a94bf

Please sign in to comment.