Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an input URL list parameter #38

Merged
merged 16 commits into from
Aug 20, 2024
Merged

Conversation

Gallaecio
Copy link
Contributor

@Gallaecio Gallaecio commented Feb 15, 2024

Addresses the first half of #36. (I plan to have a separate PR for the ability to specify input URLs that point to the actual start URLs).

To do:

  • Find out why CI fails.
  • Confirm we are OK with the approach (\n as URL separator, separate same spider)
  • Tests
  • Docs

Tested manually. When specifying the urls parameter from a terminal, mind that "URL\nURL" will not work.

zyte_spider_templates/spiders/ecommerce.py Outdated Show resolved Hide resolved
zyte_spider_templates/spiders/ecommerce.py Outdated Show resolved Hide resolved
zyte_spider_templates/spiders/ecommerce.py Outdated Show resolved Hide resolved
@codecov-commenter
Copy link

codecov-commenter commented Feb 15, 2024

Codecov Report

Merging #38 (3481c76) into main (097e0d8) will increase coverage by 0.09%.
The diff coverage is 100.00%.

❗ Current head 3481c76 differs from pull request most recent head 25e2283. Consider uploading reports for the commit 25e2283 to get more accurate results

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #38      +/-   ##
==========================================
+ Coverage   98.81%   98.90%   +0.09%     
==========================================
  Files          12       12              
  Lines         506      549      +43     
==========================================
+ Hits          500      543      +43     
  Misses          6        6              
Files Coverage Δ
zyte_spider_templates/spiders/base.py 100.00% <100.00%> (ø)
zyte_spider_templates/spiders/ecommerce.py 100.00% <100.00%> (ø)

@Gallaecio Gallaecio mentioned this pull request Feb 19, 2024
6 tasks
@Gallaecio Gallaecio changed the title Implement an alternative e-commerce spider with multiple start URL support Add an input URL list parameter Mar 11, 2024
"""
if isinstance(value, str):
value = value.split("\n")
if value:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we expect to have is it's not str instance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any of the actually supported types: None or a list of str.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Gallaecio will you add something to make this work?
mind that "URL\nURL" will not work - from your comment in description.
or running like
scrapy crawl spider -a urls="https://www.nytimes.com/\nhttps://www.theguardian.com"
I've added this workaround https://github.com/zytedata/zyte-spider-templates-private/blob/article/zyte_spider_templates/spiders/article.py#L135, but might be we can handle it in validate_url_list too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think supporting that syntax is a bit dangerous, as \n are valid URL characters. Since it will not be an issue in the UI or from code, I am not sure if it is worth supporting.

Maybe we could support spaces as a separator, but we risk scenarios were users expect a space → %20 conversion to happen.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Gallaecio I've checked https://www.rfc-editor.org/rfc/rfc3986 and haven't found anything about \n as allowed character - but might be missed something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant valid characters, i.e. \ is a valid character and n is a valid character. https://example.com/\n is a valid URL.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, makes sense, thanks!

@model_validator(mode="after")
def single_input(self):
"""Fields
:class:`~zyte_spider_templates.spiders.ecommerce.EcommerceSpiderParams.url`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's base class and we are going to use the same approach in Articles - in this case, might be no need to mention ecommerce.EcommerceSpiderParams explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we merge article support, we can refactor the docs to cover shared params elsewhere in the docs, and point there. But for the current state of the repository, this is the best thing we can link to.

@Gallaecio Gallaecio mentioned this pull request Apr 8, 2024
Copy link
Contributor

@BurnzZ BurnzZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. A few thoughts:

Making this available in the ecommerce template might mean that the upcoming article spider could be released without the max_requests_per_seed parameter. We can do it after the article launch. What do you think @PyExplorer ?

zyte_spider_templates/params.py Outdated Show resolved Hide resolved
@PyExplorer
Copy link

LGTM. A few thoughts:

Making this available in the ecommerce template might mean that the upcoming article spider could be released without the max_requests_per_seed parameter. We can do it after the article launch. What do you think @PyExplorer ?

This parameter is quite important for Articles, especially for incremental crawl - discussed here https://zytegroup.slack.com/archives/G011GR9M47N/p1712058449481609?thread_ts=1712047414.333929&cid=G011GR9M47N

@kmike
Copy link
Contributor

kmike commented Aug 19, 2024

Is there anything left to do in this PR, besides resolving the conflicts?

@Gallaecio Gallaecio merged commit 0e54634 into zytedata:main Aug 20, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants