Add an input URL list parameter #38

Gallaecio · 2024-02-15T11:34:17Z

Addresses the first half of #36. (I plan to have a separate PR for the ability to specify input URLs that point to the actual start URLs).

To do:

Find out why CI fails.
Confirm we are OK with the approach (\n as URL separator, ~~separate~~ same spider)
Tests
Docs

Tested manually. When specifying the urls parameter from a terminal, mind that "URL\nURL" will not work.

…pport

zyte_spider_templates/spiders/ecommerce.py

codecov-commenter · 2024-02-15T13:48:46Z

Codecov Report

Merging #38 (3481c76) into main (097e0d8) will increase coverage by 0.09%.
The diff coverage is 100.00%.

❗ Current head 3481c76 differs from pull request most recent head 25e2283. Consider uploading reports for the commit 25e2283 to get more accurate results

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #38      +/-   ##
==========================================
+ Coverage   98.81%   98.90%   +0.09%     
==========================================
  Files          12       12              
  Lines         506      549      +43     
==========================================
+ Hits          500      543      +43     
  Misses          6        6

Files	Coverage Δ
zyte_spider_templates/spiders/base.py	`100.00% <100.00%> (ø)`
zyte_spider_templates/spiders/ecommerce.py	`100.00% <100.00%> (ø)`

zyte_spider_templates/spiders/ecommerce.py

PyExplorer · 2024-03-12T08:26:18Z

zyte_spider_templates/spiders/base.py

+        """
+        if isinstance(value, str):
+            value = value.split("\n")
+        if value:


What we expect to have is it's not str instance?

Any of the actually supported types: None or a list of str.

@Gallaecio will you add something to make this work?
mind that "URL\nURL" will not work - from your comment in description.
or running like
scrapy crawl spider -a urls="https://www.nytimes.com/\nhttps://www.theguardian.com"
I've added this workaround https://github.com/zytedata/zyte-spider-templates-private/blob/article/zyte_spider_templates/spiders/article.py#L135, but might be we can handle it in validate_url_list too.

I think supporting that syntax is a bit dangerous, as \n are valid URL characters. Since it will not be an issue in the UI or from code, I am not sure if it is worth supporting.

Maybe we could support spaces as a separator, but we risk scenarios were users expect a space → %20 conversion to happen.

@Gallaecio I've checked https://www.rfc-editor.org/rfc/rfc3986 and haven't found anything about \n as allowed character - but might be missed something.

I meant valid characters, i.e. \ is a valid character and n is a valid character. https://example.com/\n is a valid URL.

Got it, makes sense, thanks!

PyExplorer · 2024-03-12T08:46:01Z

zyte_spider_templates/spiders/base.py

+    @model_validator(mode="after")
+    def single_input(self):
+        """Fields
+        :class:`~zyte_spider_templates.spiders.ecommerce.EcommerceSpiderParams.url`


If it's base class and we are going to use the same approach in Articles - in this case, might be no need to mention ecommerce.EcommerceSpiderParams explicitly.

Once we merge article support, we can refactor the docs to cover shared params elsewhere in the docs, and point there. But for the current state of the repository, this is the best thing we can link to.

BurnzZ

LGTM. A few thoughts:

Making this available in the ecommerce template might mean that the upcoming article spider could be released without the max_requests_per_seed parameter. We can do it after the article launch. What do you think @PyExplorer ?

zyte_spider_templates/params.py

PyExplorer · 2024-04-29T20:57:53Z

LGTM. A few thoughts:

Making this available in the ecommerce template might mean that the upcoming article spider could be released without the max_requests_per_seed parameter. We can do it after the article launch. What do you think @PyExplorer ?

This parameter is quite important for Articles, especially for incremental crawl - discussed here https://zytegroup.slack.com/archives/G011GR9M47N/p1712058449481609?thread_ts=1712047414.333929&cid=G011GR9M47N

kmike · 2024-08-19T14:14:44Z

Is there anything left to do in this PR, besides resolving the conflicts?

Implement an alternative e-commerce spider with multiple start URL su…

3b8ae73

…pport

Gallaecio requested review from kmike, wRAR, BurnzZ, VMRuiz and proway2 February 15, 2024 11:34

Fix test_start_requests

1ebf3f9

BurnzZ reviewed Feb 15, 2024

View reviewed changes

zyte_spider_templates/spiders/ecommerce.py Outdated Show resolved Hide resolved

zyte_spider_templates/spiders/ecommerce.py Outdated Show resolved Hide resolved

zyte_spider_templates/spiders/ecommerce.py Outdated Show resolved Hide resolved

Allow extra spacing in urls

350f626

PyExplorer reviewed Feb 16, 2024

View reviewed changes

zyte_spider_templates/spiders/ecommerce.py Outdated Show resolved Hide resolved

Gallaecio added 2 commits February 16, 2024 19:08

Merge remote-tracking branch 'zytedata/main' into start-urls

488c734

urls: perform URL and emptiness validation

3f39de4

Gallaecio mentioned this pull request Feb 19, 2024

Define standard params #34

Closed

6 tasks

Gallaecio added 5 commits March 11, 2024 20:26

Merge remote-tracking branch 'zytedata/main' into start-urls

e7580cf

Switch to a single spider, use new widgets

ff5423f

Add docs

25e2283

Black cleanup

c44172c

Black cleanup

9698494

Gallaecio changed the title ~~Implement an alternative e-commerce spider with multiple start URL support~~ Add an input URL list parameter Mar 11, 2024

PyExplorer reviewed Mar 12, 2024

View reviewed changes

Add exclusiveRequired

3f2cc93

Gallaecio mentioned this pull request Apr 8, 2024

#38 + #41 #48

Closed

Gallaecio added 3 commits April 8, 2024 10:06

Merge remote-tracking branch 'zytedata/main' into start-urls

1dc5c36

Merge remote-tracking branch 'zytedata/main' into start-urls

a2bf6ab

Remove an unused logger

6123775

BurnzZ approved these changes Apr 16, 2024

View reviewed changes

zyte_spider_templates/params.py Outdated Show resolved Hide resolved

Apply an early return

de22b34

kmike approved these changes Jun 17, 2024

View reviewed changes

PyExplorer approved these changes Jun 28, 2024

View reviewed changes

Merge remote-tracking branch 'zytedata/main' into start-urls

fb09cbe

Gallaecio merged commit 0e54634 into zytedata:main Aug 20, 2024
10 checks passed

Gallaecio mentioned this pull request Aug 20, 2024

Allow multiple URLs as input to the spiders #36

Closed

BurnzZ mentioned this pull request Aug 22, 2024

No errors on Python 3.12 when none of url, urls, or urls_file is given #59

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an input URL list parameter #38

Add an input URL list parameter #38

Gallaecio commented Feb 15, 2024 •

edited

Loading

codecov-commenter commented Feb 15, 2024 •

edited

Loading

PyExplorer Mar 12, 2024

Gallaecio Mar 12, 2024

PyExplorer Mar 12, 2024

Gallaecio Mar 12, 2024

PyExplorer Apr 8, 2024

Gallaecio Apr 9, 2024

PyExplorer Apr 9, 2024

PyExplorer Mar 12, 2024

Gallaecio Mar 12, 2024

BurnzZ left a comment

PyExplorer commented Apr 29, 2024

kmike commented Aug 19, 2024

Add an input URL list parameter #38

Add an input URL list parameter #38

Conversation

Gallaecio commented Feb 15, 2024 • edited Loading

codecov-commenter commented Feb 15, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BurnzZ left a comment

Choose a reason for hiding this comment

PyExplorer commented Apr 29, 2024

kmike commented Aug 19, 2024

Gallaecio commented Feb 15, 2024 •

edited

Loading

codecov-commenter commented Feb 15, 2024 •

edited

Loading