Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid duplicate requests #105

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
117 commits
Select commit Hold shift + click to select a range
5908b6c
Move the provider deps into an optional feature.
wRAR Jul 3, 2023
eaa3409
Run the provider tests on CI.
wRAR Jul 3, 2023
a3f3e15
Merge pull request #98 from scrapy-plugins/scrapy-poet-optional
kmike Jul 4, 2023
5ad4d79
ZyteApiProvider: support more data types
kmike Jul 12, 2023
914d599
pre-commit fixes
kmike Jul 12, 2023
3c81296
document requirements for scrapy-poet integration
kmike Jul 12, 2023
be25b79
cleanup tools configuration
kmike Jul 12, 2023
edd2ed2
fix black configuration
kmike Jul 12, 2023
44c3a5f
document new dependencies which scrapy-zyte-api can provide
kmike Jul 12, 2023
8fd2138
proper mypy config
kmike Jul 12, 2023
7ac7f12
Update README.rst
kmike Jul 12, 2023
aeca861
make SCRAPY_POET_PROVIDERS configuration copy-pasteable (#102)
kmike Jul 12, 2023
fd91fbf
Merge pull request #101 from scrapy-plugins/cleanup-ci
kmike Jul 12, 2023
bd4bd6b
Merge pull request #100 from scrapy-plugins/more-data-types
kmike Jul 12, 2023
6d549b7
add build/ folder to gitignore (#103)
kmike Jul 12, 2023
226620c
add a quick start section to readme
kmike Jul 12, 2023
139f55e
fix rst syntax in readme
kmike Jul 12, 2023
267d689
Merge pull request #104 from scrapy-plugins/quickstart
kmike Jul 12, 2023
23c4940
WIP
Gallaecio Jul 13, 2023
1c7db87
Support driving Zyte API requests through a proxy
Gallaecio Jul 13, 2023
4e4c829
Add release notes for 0.10.0.
wRAR Jul 13, 2023
0c8b648
Merge pull request #109 from scrapy-plugins/relnotes-0.10.0
wRAR Jul 14, 2023
46ae308
Reword the behind-a-proxy readme entry to minimize confusion
Gallaecio Jul 14, 2023
24a4914
Merge remote-tracking branch 'scrapy-plugins/main' into trust-env
Gallaecio Jul 14, 2023
4361ca7
CHANGES:rst: cover ZYTE_API_USE_ENV_PROXY
Gallaecio Jul 14, 2023
28927c0
Merge pull request #106 from Gallaecio/trust-env
kmike Jul 14, 2023
23762b6
Bump version: 0.9.0 → 0.10.0
wRAR Jul 14, 2023
febe99f
introduce ZYTE_API_MAX_REQUESTS
BurnzZ Jul 19, 2023
5a8ce24
update tests to include non-ZAPI requests
BurnzZ Jul 19, 2023
561d9ff
refactor conditional check on max request reached
BurnzZ Jul 19, 2023
276c7f4
Warn about requestCookies=[] (#115)
Gallaecio Jul 25, 2023
660cdb2
improve IgnoreRequest message when max ZAPI requests is reached
BurnzZ Aug 2, 2023
63a9d44
document new ZYTE_API_MAX_REQUESTS setting
BurnzZ Aug 2, 2023
2820d56
Styling improvements
BurnzZ Aug 3, 2023
3c09dfc
Merge pull request #114 from scrapy-plugins/max-zapi-requests
kmike Aug 3, 2023
8438007
CHANGES.rst: Cover 0.11.0
Gallaecio Aug 4, 2023
1097650
Clarify that ZYTE_API_MAX_REQUESTS limits successful requests
Gallaecio Aug 4, 2023
6da9a39
Merge pull request #117 from Gallaecio/release-notes
kmike Aug 7, 2023
b2d251d
Bump version: 0.10.0 → 0.11.0
Gallaecio Aug 7, 2023
654aaba
CHANGES.rst: set the 0.11.0 release date
Gallaecio Aug 7, 2023
e11265e
Fix version bumping
Gallaecio Aug 7, 2023
30c91d0
Support scrapy-poet 0.10.0 again
Gallaecio Aug 7, 2023
d3e7115
Hint at potential issues when switching reactors on existing projects…
Gallaecio Aug 7, 2023
5708e0e
Merge pull request #121 from Gallaecio/fix-version-bumping
kmike Aug 8, 2023
75808e3
Test the minimum dependencies of the provider extra (#122)
Gallaecio Aug 8, 2023
a330b4b
Clarify how status codes are reflected in stats
Gallaecio Aug 10, 2023
51823c9
Merge pull request #123 from scrapy-plugins/stat-status-codes
kmike Aug 10, 2023
8b1411b
Do not warn about dropping the default Accept-Encoding header
Gallaecio Aug 21, 2023
b8a2fa4
Merge pull request #126 from scrapy-plugins/dont-warn-on-default-headers
kmike Aug 23, 2023
a82ccbb
Cover 0.11.1 in the release notes
Gallaecio Aug 24, 2023
deb1826
Fix a typo
Gallaecio Aug 24, 2023
20046c1
Merge pull request #127 from Gallaecio/release-notes
kmike Aug 24, 2023
6fd143b
Set a release date for 0.11.1
Gallaecio Aug 25, 2023
cd47399
Bump version: 0.11.0 → 0.11.1
Gallaecio Aug 25, 2023
b9eb5d1
Track top-level request args in stats
Gallaecio Aug 25, 2023
21926eb
Apply special treatment for experimental features
Gallaecio Aug 30, 2023
d61cb68
Merge pull request #128 from Gallaecio/request-arg-stats
kmike Aug 31, 2023
0282e00
Implement ZYTE_API_PROVIDER_PARAMS
Gallaecio Sep 26, 2023
af1e223
Change the code-block syntax for one that works on GitHub
Gallaecio Sep 26, 2023
c5f02e0
Merge pull request #133 from Gallaecio/provider-geolocation
kmike Sep 26, 2023
f2dac4e
Add release notes for 0.12.0
Gallaecio Sep 26, 2023
207fed4
Merge pull request #134 from Gallaecio/release-notes
kmike Sep 26, 2023
c60c48a
Set a release date for 0.12.0
Gallaecio Sep 26, 2023
0918370
Bump version: 0.11.1 → 0.12.0
Gallaecio Sep 26, 2023
b4ce6e0
change ellipsis to 3 dots
BurnzZ Sep 28, 2023
d50b14d
Identify scrapy-zyte-api usage via custom user-agent (#130)
PyExplorer Sep 29, 2023
6de738a
updates for 0.12.1 release (#137)
PyExplorer Sep 29, 2023
1bd2e25
update date for the release 0.12.1
PyExplorer Sep 29, 2023
b110833
Bump version: 0.12.0 → 0.12.1
PyExplorer Sep 29, 2023
3db30f1
Merge pull request #136 from scrapy-plugins/ellipsis
kmike Oct 4, 2023
60e2265
remove unused extration options
kmike Oct 4, 2023
2cabbe0
Initial support for typing.Annotated.
wRAR Oct 11, 2023
ab34b93
Switch to scrapy_poet.AnnotatedResult, improve the test.
wRAR Oct 13, 2023
ecc42d4
Forbid multiple extractFrom.
wRAR Oct 16, 2023
7f15a9c
document the new behavior
kmike Oct 18, 2023
8ee4e24
Update README.rst
kmike Oct 19, 2023
ad39749
Merge pull request #138 from scrapy-plugins/remove-extraction-options
kmike Oct 19, 2023
bb54a28
0.12.2 release notes
kmike Oct 19, 2023
1370505
Bump version: 0.12.1 → 0.12.2
kmike Oct 19, 2023
fbf37f0
log Zyte API request id on errors
BurnzZ Oct 23, 2023
600a571
use new RequestError.request_id attribute
BurnzZ Oct 25, 2023
80ae0a0
bump zyte-api dep: 0.4.7 → 0.4.8
BurnzZ Nov 2, 2023
63228ec
Use repr when logging request-id
BurnzZ Nov 3, 2023
5e205ef
Merge pull request #142 from scrapy-plugins/log-request-id
BurnzZ Nov 3, 2023
0b58307
Test that different extraction outputs generate different fingerprint…
Gallaecio Nov 6, 2023
438df61
Clarify the max requests docs
Gallaecio Nov 7, 2023
b73bfc2
Update README.rst
Gallaecio Nov 7, 2023
b395711
Merge pull request #146 from Gallaecio/max-request-docs
kmike Nov 8, 2023
e55e99c
Set a close reason for bad key and suspended account scenarios
Gallaecio Nov 14, 2023
87bd23a
Merge remote-tracking branch 'origin/main' into annotated-support
wRAR Nov 14, 2023
4b5c088
Fix removing *Options.
wRAR Nov 14, 2023
aa3f77b
Update the pinned zyte-api version in CI (#149)
wRAR Nov 14, 2023
b63ad66
Fixes and improvements.
wRAR Nov 14, 2023
15fc06e
Fix CI issues.
wRAR Nov 14, 2023
08a5f48
More fixes.
wRAR Nov 14, 2023
e0ba154
Install more deps for the mypy test.
wRAR Nov 14, 2023
37e3517
Merge pull request #148 from Gallaecio/bad-auth
kmike Nov 14, 2023
2d936fd
Fix an old typing issue.
wRAR Nov 15, 2023
b892453
Use Sphinx and ReadTheDocs for the documentation (#150)
Gallaecio Nov 20, 2023
493a48c
Drop the Set-Cookie header (#132)
Gallaecio Nov 24, 2023
dc09ac3
Improve forbidden domain handling (#147)
Gallaecio Nov 28, 2023
702cc63
Implement a parameter map (#151)
Gallaecio Nov 30, 2023
d701f48
Fail on SPM + transparent mode (#152)
Gallaecio Nov 30, 2023
cad9c3a
Merge remote-tracking branch 'origin/main' into annotated-support
wRAR Dec 12, 2023
e9be1a3
Bump the andi version.
wRAR Dec 12, 2023
3ee24f0
Fix and improve extractFrom tests.
wRAR Dec 12, 2023
2b0d659
Roll back capture_exceptions.
wRAR Dec 12, 2023
de01fc2
Bump the scrapy-poet version.
wRAR Dec 12, 2023
412475e
Bump the web-poet version.
wRAR Dec 12, 2023
5305b46
Move ExtractFrom into scrapy_zyte_api/_annotations.py.
wRAR Dec 12, 2023
f5bb894
Add docs for ExtractFrom.
wRAR Dec 12, 2023
11a9b03
Merge pull request #141 from scrapy-plugins/annotated-support
wRAR Dec 12, 2023
a80fc0b
Release notes for 0.13.0.
wRAR Dec 13, 2023
fab5e58
Merge pull request #153 from scrapy-plugins/relnotes-0.13.0
wRAR Dec 13, 2023
0536f42
Bump version: 0.12.2 → 0.13.0
wRAR Dec 13, 2023
a5f909c
Merge remote-tracking branch 'scrapy-plugins/main' into provider-fix-…
Gallaecio Dec 27, 2023
857b51c
Require scrapy-poet >= 0.19.0
Gallaecio Dec 27, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions .bumpversion.cfg
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
[bumpversion]
current_version = 0.9.0
current_version = 0.13.0
commit = True
tag = True
tag_name = {new_version}

[bumpversion:file:setup.py]
[bumpversion:file:docs/conf.py]

[bumpversion:file:scrapy_zyte_api/__version__.py]
20 changes: 15 additions & 5 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,10 +34,20 @@ jobs:
- python-version: '3.10'
- python-version: '3.11'

- python-version: '3.8'
toxenv: pinned-provider
- python-version: '3.11'
toxenv: provider

- python-version: '3.7'
toxenv: pinned-extra
- python-version: '3.11'
toxenv: extra

steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
Expand All @@ -57,12 +67,12 @@ jobs:
fail-fast: false
matrix:
python-version: ["3.11"]
tox-job: ["mypy", "flake8", "twine-check"]
tox-job: ["mypy", "linters", "twine-check", "docs"]

steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,4 @@ docs/_build
*.egg-info/
__pycache__/
/test-results/
build/
3 changes: 0 additions & 3 deletions .isort.cfg

This file was deleted.

23 changes: 8 additions & 15 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,23 +1,16 @@
repos:
- repo: https://github.com/pre-commit/mirrors-isort
rev: v5.7.0
- repo: https://github.com/PyCQA/isort
rev: 5.12.0
hooks:
- id: isort
- repo: https://github.com/ambv/black
rev: 22.3.0
- repo: https://github.com/psf/black
rev: 23.7.0
hooks:
- id: black
language_version: python3.8
additional_dependencies:
- click<8.1
- repo: https://github.com/pycqa/flake8
rev: 3.8.4
hooks:
- id: flake8
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v0.991
rev: 6.0.0
hooks:
- id: mypy
- id: flake8
additional_dependencies:
- types-setuptools
args: [--check-untyped-defs, --ignore-missing-imports, --no-warn-no-return]
- flake8-docstrings
- flake8-print
12 changes: 12 additions & 0 deletions .readthedocs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
version: 2
formats: all
sphinx:
configuration: docs/conf.py
build:
os: ubuntu-22.04
tools:
python: "3.11" # Keep in sync with .github/workflows/test.yml
python:
install:
- requirements: docs/requirements.txt
- path: .
140 changes: 136 additions & 4 deletions CHANGES.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,138 @@
Changes
=======

0.13.0 (2023-12-13)
-------------------

* Updated requirement versions:

* andi >= 0.5.0
* scrapy-poet >= 0.18.0
* web-poet >= 0.15.1
* zyte-api >= 0.4.8

* The spider is now closed and the finish reason is set to
``"zyte_api_bad_key"`` or ``"zyte_api_suspended_account"`` when receiving
"Authentication Key Not Found" or "Account Suspended" responses from Zyte
API.

* The spider is now closed and the finish reason is set to
``"failed_forbidden_domain"`` when all start requests fail because they are
pointing to domains forbidden by Zyte API.

* The spider is now closed and the finish reason is set to
``"plugin_conflict"`` if both scrapy-zyte-smartproxy and the transparent mode
of scrapy-zyte-api are enabled.

* The ``extractFrom`` extraction option can now be requested by annotating the
dependency with a ``scrapy_zyte_api.ExtractFrom`` member (e.g.
``product: typing.Annotated[Product, ExtractFrom.httpResponseBody]``).

* The ``Set-Cookie`` header is now removed from the response if the cookies
were returned by Zyte API (as ``"experimental.responseCookies"``).

* The request fingerprinting was improved by refining which parts of the
request affect the fingerprint.

* Zyte API Request IDs are now included in the error logs.

* Split README.rst into multiple documentation files and publish them on
ReadTheDocs.

* Improve the documentation for the ``ZYTE_API_MAX_REQUESTS`` setting.

* Test and CI improvements.

0.12.2 (2023-10-19)
-------------------

* Unused ``<data type>Options`` (e.g. ``productOptions``) are now dropped
from ``ZYTE_API_PROVIDER_PARAMS`` when sending the Zyte API request
* When logging Zyte API requests, truncation now uses
"..." instead of Unicode ellipsis.

0.12.1 (2023-09-29)
-------------------

* The new ``_ZYTE_API_USER_AGENT`` setting allows customizing the user agent
string reported to Zyte API.

Note that this setting is only meant for libraries and frameworks built on
top of scrapy-zyte-api, to report themselves to Zyte API, for client software
tracking and monitoring purposes. The value of this setting is *not* the
``User-Agent`` header sent to upstream websites when using Zyte API.


0.12.0 (2023-09-26)
-------------------

* A new ``ZYTE_API_PROVIDER_PARAMS`` setting allows setting Zyte API
parameters, like ``geolocation``, to be included in all Zyte API requests by
the scrapy-poet provider.

* A new ``scrapy-zyte-api/request_args/<parameter>`` stat, counts the number of
requests containing a given Zyte API request parameter. For example,
``scrapy-zyte-api/request_args/url`` counts the number of Zyte API requests
with the URL parameter set (which should be all of them).

Experimental is treated as a namespace, and its parameters are the ones
counted, i.e. there is no ``scrapy-zyte-api/request_args/experimental`` stat,
but there are stats like
``scrapy-zyte-api/request_args/experimental.responseCookies``.


0.11.1 (2023-08-25)
-------------------

* scrapy-zyte-api 0.11.0 accidentally increased the minimum required version of
scrapy-poet from 0.10.0 to 0.11.0. We have reverted that change and
implemented measures to prevent similar accidents in the future.

* Automatic parameter mapping no longer warns about dropping the
``Accept-Encoding`` header when the header value matches the Scrapy default.

* The README now mentions additional changes that may be necessary when
switching Twisted reactors on existing projects.

* The README now explains how status codes, from Zyte API or from wrapped
responses, are reflected in Scrapy stats.

0.11.0 (2023-08-07)
-------------------

* Added a ``ZYTE_API_MAX_REQUESTS`` setting to limit the number of successful
Zyte API requests that a spider can send. Reaching the limit stops the
spider.

* Setting ``requestCookies`` to ``[]`` in the ``zyte_api_automap`` request
metadata field now triggers a warning.

0.10.0 (2023-07-14)
-------------------

* Added more data types to the scrapy-poet provider:

* ``zyte_common_items.ProductList``
* ``zyte_common_items.ProductNavigation``
* ``zyte_common_items.Article``
* ``zyte_common_items.ArticleList``
* ``zyte_common_items.ArticleNavigation``

* Moved the new dependencies added in 0.9.0 and needed only for the scrapy-poet
provider (``scrapy-poet``, ``web-poet``, ``zyte-common-items``) into the new
optional feature ``[provider]``.

* Improved result caching in the scrapy-poet provider.

* Added a new setting, ``ZYTE_API_USE_ENV_PROXY``, which can be set to ``True``
to access Zyte API using a proxy configured in the local environment.

* Fixed getting the Scrapy Cloud job ID.

* Improved the documentation.

* Improved the CI configuration.

0.9.0 (2023-06-13)
------------------

Expand Down Expand Up @@ -71,7 +203,7 @@ Changes
cookiejar of the request.

* A new boolean setting, ``ZYTE_API_EXPERIMENTAL_COOKIES_ENABLED``, can be
set to ``True`` to enable automated mapping of cookies from a request
set to ``True`` to enable automatic mapping of cookies from a request
cookiejar into the ``experimental.requestCookies`` Zyte API parameter.

* ``ZyteAPITextResponse`` is now a subclass of ``HtmlResponse``, so that the
Expand Down Expand Up @@ -143,10 +275,10 @@ When upgrading, you should set the following in your Scrapy settings:
be set to ``True`` to make all requests use Zyte API by default, with request
parameters being automatically mapped to Zyte API parameters.
* Add a Request meta key, ``zyte_api_automap``, that can be used to enable
automated request parameter mapping for specific requests, or to modify the
outcome of automated request parameter mapping for specific requests.
automatic request parameter mapping for specific requests, or to modify the
outcome of automatic request parameter mapping for specific requests.
* Add a ``ZYTE_API_AUTOMAP_PARAMS`` setting, which is a counterpart for
``ZYTE_API_DEFAULT_PARAMS`` that applies to requests where automated request
``ZYTE_API_DEFAULT_PARAMS`` that applies to requests where automatic request
parameter mapping is enabled.
* Add the ``ZYTE_API_SKIP_HEADERS`` and ``ZYTE_API_BROWSER_HEADERS`` settings
to control the automatic mapping of request headers.
Expand Down
Loading