docs: blog How to scrape Google search results with Python #2739

Mantisus · 2024-11-07T18:09:31Z

souravjain540 · 2024-11-13T18:34:52Z

please use PR title for blogs in this format docs: blog title

B4nan · 2024-11-13T18:38:55Z

more importantly, the article is missing the frontmatter completely (the "header" if you want)

https://github.com/apify/crawlee/blob/master/website/blog/2024/11-10-web-scraping-tips/index.md?plain=1#L1-L8

Mantisus · 2024-11-13T19:34:35Z

more importantly, the article is missing the frontmatter completely (the "header" if you want)

https://github.com/apify/crawlee/blob/master/website/blog/2024/11-10-web-scraping-tips/index.md?plain=1#L1-L8

Yes, this block usually adds @souravjain540

That's why I didn't include it in the draft version.

souravjain540 · 2024-11-14T15:43:53Z

website/blog/2024/google-search-crawler-draft/index.md

+
+If Google search isn't going anywhere in the coming years, then tasks like [`SERP Analysis`](https://www.semrush.com/blog/serp-analysis/), SEO optimization, and evaluating your site's search ranking remain highly relevant - and for these, we need effective tools.
+
+That's why in this blog, we'll explore creating a Google search results crawler with [`Crawlee for Python`](https://crawlee.dev/python). This will be particularly useful if you're conducting a small data analysis project, analyzing search results for your business, or writing an [article about Google ranking analysis](https://backlinko.com/search-engine-ranking). Rather ironically, don't you think, considering that the largest known crawler belongs to Google itself?


no need of giving website reference to a blog published on same website, change it with github

website/blog/2024/google-search-crawler-draft/index.md

souravjain540 · 2024-11-20T03:36:18Z

website/blog/2024/google-search-crawler-draft/index.md

+
+![Check Html](./img/check_html.webp)
+
+Based on the data obtained from the page, all necessary information is present in the HTML code. Therefore, we can use [`beautifulsoup_crawler`](https://crawlee.dev/python/docs/examples/beautifulsoup-crawler).


whenever mentioning crawlee.dev use www.crawlee.dev, there is some error with docusaurus not able to mention the URL with www when its about crawlee

souravjain540 · 2024-11-20T03:37:29Z

website/blog/2024/google-search-crawler-draft/index.md

+
+#### 3. Write Selectors Manually
+
+For any project that needs to be reproduced more than once or any large-scale website project that might have multiple page layouts, you should avoid using selectors obtained from browser's "Copy XPath/selector" feature. Selectors built with absolute paths are not stable against any page changes. You should write selectors yourself, so familiarize yourself with the basics of [CSS](https://www.w3schools.com/cssref/css_selectors.php) and [XPath](https://www.w3schools.com/xml/xpath_syntax.asp) syntax.


new learning for me too, can we explain more on the why part?

Here are 2 selectors obtained from dev tool one for google the other for g2.com

//*[@id="rso"]/div[2]/div[8]/div/div/div/div[2]/div
/html/body/div[5]/div/div/div[1]/div/div[5]/div/div[2]/div[5]/div[1]

The first one looks so-so, at least it uses @id, but if tomorrow google updates the page markup by making at least one div smaller, the selector won't work or will return wrong data.

For g2.com it just looks terrible.

These selectors work well for some quick tests here and now, but no more.

souravjain540 · 2024-11-20T03:38:23Z

website/blog/2024/google-search-crawler-draft/index.md

+### Site Specifics
+
+#### 1. Results are Personalized
+
+Google tries to provide data that's useful to you based on the information it has. This is crucial when working with Google, and you should *always* keep it in mind.
+
+#### 2. Do You Need Personalized Data?
+
+If you want to analyze data that's maximally relevant to your search results, use your main browser's `cookies`.
+
+#### 3. Location Matters
+
+Your IP address's geolocation will influence Google search results.
+
+#### 4. Don't Forget About [`advanced search`](https://www.google.com/advanced_search)


this can simply be a list, no need for sub headings

souravjain540 · 2024-11-20T04:14:44Z

website/blog/2024/google-search-crawler-draft/index.md

+poetry install
+```
+
+Use whatever tools you're comfortable with - perhaps [`pyenv`](https://github.com/pyenv/pyenv) + ['uv'](https://github.com/astral-sh/uv) or something else.


lets ask them to use what we want to as this is a tutorial

souravjain540 · 2024-11-20T04:16:13Z

website/blog/2024/google-search-crawler-draft/index.md

+Let's implement the necessary code:
+
+```python
+# crawlee-google-search.main
+
+from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
+from crawlee import Request, ConcurrencySettings
+
+from .routes import router
+
+QUERIES = ["Apify", "Crawlee"]
+
+CRAWL_DEPTH = 2
+
+
+async def main() -> None:
+    """The crawler entry point."""
+    crawler = BeautifulSoupCrawler(
+        request_handler=router,
+        max_request_retries=1,
+        concurrency_settings=ConcurrencySettings(max_concurrency=5, max_tasks_per_minute=20),
+        max_requests_per_crawl=100,
+    )
+
+    requests_lists = [
+        Request.from_url(f"https://www.google.com/search?q={query}",
+                         user_data = {
+                             "current_depth": 0,
+                             "crawl_depth": CRAWL_DEPTH,
+                             "query": query
+                         },
+                         headers = {
+                             "referer": "https://www.google.com/",


i am missing code explanations and breakdown, we cannot just put the whole code at once and say it a tutorial :D

valid for the rest of code snippets too.

Updated the code snippets.

Also updated the code to better match the latest version of crawlee-python.

souravjain540 · 2024-11-20T04:16:33Z

website/blog/2024/google-search-crawler-draft/index.md

+
+In this blog, we've created a Google search crawler that captures ranking data. How you analyze this dataset is entirely up to you!
+
+The code repository is available on [`GitHub`](soon)


lets not forget to update it.

souravjain540 · 2024-11-20T04:17:02Z

website/blog/2024/google-search-crawler-draft/index.md

+That's why in this blog, we'll explore creating a Google search results crawler with [`Crawlee for Python`](https://crawlee.dev/python). This will be particularly useful if you're conducting a small data analysis project, analyzing search results for your business, or writing an [article about Google ranking analysis](https://backlinko.com/search-engine-ranking). Rather ironically, don't you think, considering that the largest known crawler belongs to Google itself?
+
+Let's get started!
+


somewhere here or later let's add note like I did in the prev article of yours.

what about this?

souravjain540 · 2024-11-20T04:17:22Z

website/blog/2024/google-search-crawler-draft/index.md

@@ -0,0 +1,201 @@
+# How to scrape Google search results with Python


let's also add the meta description from your previous articles.

just copy paste and edit

Co-authored-by: Saurav Jain <[email protected]>

souravjain540 · 2024-11-23T04:16:05Z

more review

souravjain540 · 2024-11-21T13:59:06Z

website/blog/2024/google-search-crawler-draft/index.md

+That's why in this blog, we'll explore creating a Google search results crawler with [`Crawlee for Python`](https://crawlee.dev/python). This will be particularly useful if you're conducting a small data analysis project, analyzing search results for your business, or writing an [article about Google ranking analysis](https://backlinko.com/search-engine-ranking). Rather ironically, don't you think, considering that the largest known crawler belongs to Google itself?
+
+Let's get started!
+


what about this?

souravjain540 · 2024-11-27T15:46:39Z

also the folder name should start with : mm-dd-article-name

souravjain540 · 2024-11-29T08:25:55Z

@B4nan can you please give a final review before I merge :) thanks!

B4nan

just one nit, didnt read the text, sorry, no time for that now

website/blog/2024/11-27-scrape-google-search/index.md

Co-authored-by: Martin Adámek <[email protected]>

souravjain540 · 2024-11-29T15:30:23Z

thanks! @Mantisus lets merge it monday morning now :)

add draft for blog How to scrape Google search results with Python

a93f10c

Mantisus changed the title ~~add draft for blog How to scrape Google search results with Python~~ docs: blog How to scrape Google search results with Python Nov 13, 2024

souravjain540 requested changes Nov 20, 2024

View reviewed changes

Update website/blog/2024/google-search-crawler-draft/index.md

06ab968

Co-authored-by: Saurav Jain <[email protected]>

Mantisus force-pushed the blog-google-search-crawler branch from ffee23f to 06ab968 Compare November 20, 2024 11:13

docs: update blog How to scrape Google search results with Python

ff46d05

souravjain540 requested changes Nov 23, 2024

View reviewed changes

Mantisus added 2 commits November 23, 2024 11:50

change crawlee.dev link

2cffb80

rewrite article for a more SEO-optimize structure

c9981d4

Mantisus added 2 commits November 27, 2024 16:49

update text

a723b74

update folder name

ad0b852

Mantisus marked this pull request as ready for review November 28, 2024 12:16

Mantisus requested a review from souravjain540 November 28, 2024 12:16

souravjain540 approved these changes Nov 28, 2024

View reviewed changes

add main image

d2dc9d2

souravjain540 requested a review from B4nan November 29, 2024 08:25

B4nan requested changes Nov 29, 2024

View reviewed changes

website/blog/2024/11-27-scrape-google-search/index.md Outdated Show resolved Hide resolved

Update website/blog/2024/11-27-scrape-google-search/index.md

e9f958a

Co-authored-by: Martin Adámek <[email protected]>

B4nan approved these changes Nov 29, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: blog How to scrape Google search results with Python #2739

docs: blog How to scrape Google search results with Python #2739

Mantisus commented Nov 7, 2024 •

edited

Loading

souravjain540 commented Nov 13, 2024

B4nan commented Nov 13, 2024

Mantisus commented Nov 13, 2024

souravjain540 Nov 14, 2024

souravjain540 Nov 20, 2024

souravjain540 Nov 20, 2024

Mantisus Nov 20, 2024 •

edited

Loading

souravjain540 Nov 20, 2024

souravjain540 Nov 20, 2024

souravjain540 Nov 20, 2024

Mantisus Nov 20, 2024

souravjain540 Nov 20, 2024

souravjain540 Nov 20, 2024

souravjain540 Nov 21, 2024

souravjain540 Nov 20, 2024

souravjain540 Nov 20, 2024

souravjain540 commented Nov 23, 2024

souravjain540 Nov 21, 2024

souravjain540 commented Nov 27, 2024

souravjain540 commented Nov 29, 2024

B4nan left a comment

souravjain540 commented Nov 29, 2024


		If Google search isn't going anywhere in the coming years, then tasks like [`SERP Analysis`](https://www.semrush.com/blog/serp-analysis/), SEO optimization, and evaluating your site's search ranking remain highly relevant - and for these, we need effective tools.

		That's why in this blog, we'll explore creating a Google search results crawler with [`Crawlee for Python`](https://crawlee.dev/python). This will be particularly useful if you're conducting a small data analysis project, analyzing search results for your business, or writing an [article about Google ranking analysis](https://backlinko.com/search-engine-ranking). Rather ironically, don't you think, considering that the largest known crawler belongs to Google itself?


		![Check Html](./img/check_html.webp)

		Based on the data obtained from the page, all necessary information is present in the HTML code. Therefore, we can use [`beautifulsoup_crawler`](https://crawlee.dev/python/docs/examples/beautifulsoup-crawler).


		#### 3. Write Selectors Manually

		For any project that needs to be reproduced more than once or any large-scale website project that might have multiple page layouts, you should avoid using selectors obtained from browser's "Copy XPath/selector" feature. Selectors built with absolute paths are not stable against any page changes. You should write selectors yourself, so familiarize yourself with the basics of [CSS](https://www.w3schools.com/cssref/css_selectors.php) and [XPath](https://www.w3schools.com/xml/xpath_syntax.asp) syntax.


		In this blog, we've created a Google search crawler that captures ranking data. How you analyze this dataset is entirely up to you!

		The code repository is available on [`GitHub`](soon)

		That's why in this blog, we'll explore creating a Google search results crawler with [`Crawlee for Python`](https://crawlee.dev/python). This will be particularly useful if you're conducting a small data analysis project, analyzing search results for your business, or writing an [article about Google ranking analysis](https://backlinko.com/search-engine-ranking). Rather ironically, don't you think, considering that the largest known crawler belongs to Google itself?

		Let's get started!

		@@ -0,0 +1,201 @@
		# How to scrape Google search results with Python

docs: blog How to scrape Google search results with Python #2739

Are you sure you want to change the base?

docs: blog How to scrape Google search results with Python #2739

Conversation

Mantisus commented Nov 7, 2024 • edited Loading

souravjain540 commented Nov 13, 2024

B4nan commented Nov 13, 2024

Mantisus commented Nov 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mantisus Nov 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

souravjain540 commented Nov 23, 2024

Choose a reason for hiding this comment

souravjain540 commented Nov 27, 2024

souravjain540 commented Nov 29, 2024

B4nan left a comment

Choose a reason for hiding this comment

souravjain540 commented Nov 29, 2024

Mantisus commented Nov 7, 2024 •

edited

Loading

Mantisus Nov 20, 2024 •

edited

Loading