Discussion: Handle data type of `response` data type in HeuristicsProductNavigationPage (`BrowserResponse` vs `HttpResponse`) #25

BurnzZ · 2024-01-10T07:08:01Z

Problem

Currently, HeuristicsProductNavigationPage uses HttpResponse as one of the dependencies (code ref). This causes the spider to send 2 separate requests to ZAPI if the extract_from parameter is set to "browserHtml":

{"productNavigationOptions": {"extractFrom": "browserHtml"}, "productNavigation": true, "url": "https://books.toscrape.com/"}
{"httpResponseBody": true, "url": "https://books.toscrape.com/"}

This is also true when it's left to the default value which is extract_from=None.

Note that this issue is not related to scrapy-plugins/scrapy-zyte-api#91 since scrapinghub/scrapy-poet#175 has addressed it.

The problem that we face is that the dependency of HeuristicsProductNavigationPage should be dynamic depending on the extract_from input value:

response: BrowserResponse, if extract_from=browserHtml
response: HttpResponse, if extract_from=httpResponseBody

The text was updated successfully, but these errors were encountered:

kmike · 2024-01-10T10:18:06Z

I think scrapinghub/andi#30 should allow to express this kind of dependencies in page objects.

BurnzZ · 2024-01-10T12:29:14Z

Yes, it would 👍

Though some new interface in page objects should be made to allow declaration of dynamic dependencies. For example, depending on the extract_from value, it would logically switch between BrowserResponse or HttpResponse.

kmike · 2024-01-10T15:57:20Z

For example, depending on the extract_from value, it would logically switch between BrowserResponse or HttpResponse.

Yeah, this case is more tricky than the original use case. I was thinking that it still can be handled implicitly: page object defines response: Reuse(BrowserResponse) | HttpResponse dependency (or whatever syntax is). Then, scrapy-poet +scrapy-zyte-api handle it - not just for the case BrowserResponse dependency is defined somewhere explicitly, but also for the case ExtractFrom dependency is used.

PyExplorer · 2024-01-16T14:22:52Z

@BurnzZ I've checked this approach with AnyResponse for Articles and it's actually resolves handling the case with extract_from (browserHtml or httpResponseBody) argument for spider.
But the issue with doubling the requests is still here.

PyExplorer · 2024-01-17T09:09:04Z

After double checking I see that this is working, thanks!
In my case, previously, doubled requests were only for HttpResponse, and for BrowserResponse everything was in one response.
Now, I'm just curious if we can handle HttpResponse the same way as BrowserResponse here https://github.com/scrapy-plugins/scrapy-zyte-api/pull/161/files#diff-67cea92ffa3989fe30ffd7f4bbe868f953810e0f4fdf2ddfe7f18bc54fba505eR81 - at this moment PO with HttpResponse is not even handled by ZyteApiProvider

BurnzZ · 2024-01-17T09:50:13Z

Both browserHtml and httpResponseBody should be handled @PyExplorer . For the latter, was the extract_from spider parameter input updated to httpResponseBody?

PyExplorer · 2024-01-17T10:12:34Z

Yes, extract_from is httpResponseBody, but I mean that here https://github.com/scrapy-plugins/scrapy-zyte-api/pull/161/files#diff-67cea92ffa3989fe30ffd7f4bbe868f953810e0f4fdf2ddfe7f18bc54fba505eR81 in to_provide we don't have HttpResponse (if we asked response: HttpResponse in PO) but we have BrowserResponse (if we ask response: BrowserResponse) in PO

BurnzZ · 2024-01-19T07:52:58Z

in to_provide we don't have HttpResponse (if we asked response: HttpResponse in PO)

The PR in scrapy-plugins/scrapy-zyte-api#161 has been updated so that providers can also use the instances created by earlier providers. That is, HttpResponse is made by HttpResponseProvider. Later on, AnyResponse can be created by ZyteApiProvider as it can use the HttpResponse that was made earlier.

The new PR in scrapinghub/scrapy-poet#184 also made this possible.

BurnzZ added the question Further information is requested label Jan 10, 2024

This was referenced Jan 16, 2024

support AnyResponse #28

Merged

add new AnyResponse scrapinghub/web-poet#195

Merged

support AnyResponse scrapy-plugins/scrapy-zyte-api#161

Merged

kmike closed this as completed in #28 Feb 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: Handle data type of `response` data type in HeuristicsProductNavigationPage (`BrowserResponse` vs `HttpResponse`) #25

Discussion: Handle data type of `response` data type in HeuristicsProductNavigationPage (`BrowserResponse` vs `HttpResponse`) #25

BurnzZ commented Jan 10, 2024

kmike commented Jan 10, 2024

BurnzZ commented Jan 10, 2024

kmike commented Jan 10, 2024

PyExplorer commented Jan 16, 2024

PyExplorer commented Jan 17, 2024

BurnzZ commented Jan 17, 2024

PyExplorer commented Jan 17, 2024

BurnzZ commented Jan 19, 2024

Discussion: Handle data type of response data type in HeuristicsProductNavigationPage (BrowserResponse vs HttpResponse) #25

Discussion: Handle data type of response data type in HeuristicsProductNavigationPage (BrowserResponse vs HttpResponse) #25

Comments

BurnzZ commented Jan 10, 2024

Problem

kmike commented Jan 10, 2024

BurnzZ commented Jan 10, 2024

kmike commented Jan 10, 2024

PyExplorer commented Jan 16, 2024

PyExplorer commented Jan 17, 2024

BurnzZ commented Jan 17, 2024

PyExplorer commented Jan 17, 2024

BurnzZ commented Jan 19, 2024

Discussion: Handle data type of `response` data type in HeuristicsProductNavigationPage (`BrowserResponse` vs `HttpResponse`) #25

Discussion: Handle data type of `response` data type in HeuristicsProductNavigationPage (`BrowserResponse` vs `HttpResponse`) #25