Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: blog How to scrape Google search results with Python #2739

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

Mantisus
Copy link

@Mantisus Mantisus commented Nov 7, 2024

Draft article for @souravjain540

@souravjain540
Copy link
Collaborator

please use PR title for blogs in this format docs: blog title

@B4nan
Copy link
Member

B4nan commented Nov 13, 2024

more importantly, the article is missing the frontmatter completely (the "header" if you want)

https://github.com/apify/crawlee/blob/master/website/blog/2024/11-10-web-scraping-tips/index.md?plain=1#L1-L8

@Mantisus Mantisus changed the title add draft for blog How to scrape Google search results with Python docs: blog How to scrape Google search results with Python Nov 13, 2024
@Mantisus
Copy link
Author

more importantly, the article is missing the frontmatter completely (the "header" if you want)

https://github.com/apify/crawlee/blob/master/website/blog/2024/11-10-web-scraping-tips/index.md?plain=1#L1-L8

Yes, this block usually adds @souravjain540

That's why I didn't include it in the draft version.


If Google search isn't going anywhere in the coming years, then tasks like [`SERP Analysis`](https://www.semrush.com/blog/serp-analysis/), SEO optimization, and evaluating your site's search ranking remain highly relevant - and for these, we need effective tools.

That's why in this blog, we'll explore creating a Google search results crawler with [`Crawlee for Python`](https://crawlee.dev/python). This will be particularly useful if you're conducting a small data analysis project, analyzing search results for your business, or writing an [article about Google ranking analysis](https://backlinko.com/search-engine-ranking). Rather ironically, don't you think, considering that the largest known crawler belongs to Google itself?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need of giving website reference to a blog published on same website, change it with github

website/blog/2024/google-search-crawler-draft/index.md Outdated Show resolved Hide resolved

![Check Html](./img/check_html.webp)

Based on the data obtained from the page, all necessary information is present in the HTML code. Therefore, we can use [`beautifulsoup_crawler`](https://crawlee.dev/python/docs/examples/beautifulsoup-crawler).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whenever mentioning crawlee.dev use www.crawlee.dev, there is some error with docusaurus not able to mention the URL with www when its about crawlee


#### 3. Write Selectors Manually

For any project that needs to be reproduced more than once or any large-scale website project that might have multiple page layouts, you should avoid using selectors obtained from browser's "Copy XPath/selector" feature. Selectors built with absolute paths are not stable against any page changes. You should write selectors yourself, so familiarize yourself with the basics of [CSS](https://www.w3schools.com/cssref/css_selectors.php) and [XPath](https://www.w3schools.com/xml/xpath_syntax.asp) syntax.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new learning for me too, can we explain more on the why part?

Copy link
Author

@Mantisus Mantisus Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are 2 selectors obtained from dev tool one for google the other for g2.com

//*[@id="rso"]/div[2]/div[8]/div/div/div/div[2]/div
/html/body/div[5]/div/div/div[1]/div/div[5]/div/div[2]/div[5]/div[1]

The first one looks so-so, at least it uses @id, but if tomorrow google updates the page markup by making at least one div smaller, the selector won't work or will return wrong data.

For g2.com it just looks terrible.

These selectors work well for some quick tests here and now, but no more.

Comment on lines 49 to 63
### Site Specifics

#### 1. Results are Personalized

Google tries to provide data that's useful to you based on the information it has. This is crucial when working with Google, and you should *always* keep it in mind.

#### 2. Do You Need Personalized Data?

If you want to analyze data that's maximally relevant to your search results, use your main browser's `cookies`.

#### 3. Location Matters

Your IP address's geolocation will influence Google search results.

#### 4. Don't Forget About [`advanced search`](https://www.google.com/advanced_search)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can simply be a list, no need for sub headings

poetry install
```

Use whatever tools you're comfortable with - perhaps [`pyenv`](https://github.com/pyenv/pyenv) + ['uv'](https://github.com/astral-sh/uv) or something else.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets ask them to use what we want to as this is a tutorial

Comment on lines 90 to 122
Let's implement the necessary code:

```python
# crawlee-google-search.main

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
from crawlee import Request, ConcurrencySettings

from .routes import router

QUERIES = ["Apify", "Crawlee"]

CRAWL_DEPTH = 2


async def main() -> None:
"""The crawler entry point."""
crawler = BeautifulSoupCrawler(
request_handler=router,
max_request_retries=1,
concurrency_settings=ConcurrencySettings(max_concurrency=5, max_tasks_per_minute=20),
max_requests_per_crawl=100,
)

requests_lists = [
Request.from_url(f"https://www.google.com/search?q={query}",
user_data = {
"current_depth": 0,
"crawl_depth": CRAWL_DEPTH,
"query": query
},
headers = {
"referer": "https://www.google.com/",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am missing code explanations and breakdown, we cannot just put the whole code at once and say it a tutorial :D

valid for the rest of code snippets too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the code snippets.

Also updated the code to better match the latest version of crawlee-python.


In this blog, we've created a Google search crawler that captures ranking data. How you analyze this dataset is entirely up to you!

The code repository is available on [`GitHub`](soon)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets not forget to update it.

That's why in this blog, we'll explore creating a Google search results crawler with [`Crawlee for Python`](https://crawlee.dev/python). This will be particularly useful if you're conducting a small data analysis project, analyzing search results for your business, or writing an [article about Google ranking analysis](https://backlinko.com/search-engine-ranking). Rather ironically, don't you think, considering that the largest known crawler belongs to Google itself?

Let's get started!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

somewhere here or later let's add note like I did in the prev article of yours.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about this?

@@ -0,0 +1,201 @@
# How to scrape Google search results with Python
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also add the meta description from your previous articles.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just copy paste and edit

@souravjain540
Copy link
Collaborator

more review
image

That's why in this blog, we'll explore creating a Google search results crawler with [`Crawlee for Python`](https://crawlee.dev/python). This will be particularly useful if you're conducting a small data analysis project, analyzing search results for your business, or writing an [article about Google ranking analysis](https://backlinko.com/search-engine-ranking). Rather ironically, don't you think, considering that the largest known crawler belongs to Google itself?

Let's get started!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about this?

@souravjain540
Copy link
Collaborator

also the folder name should start with : mm-dd-article-name

@Mantisus Mantisus marked this pull request as ready for review November 28, 2024 12:16
@souravjain540
Copy link
Collaborator

@B4nan can you please give a final review before I merge :) thanks!

Copy link
Member

@B4nan B4nan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just one nit, didnt read the text, sorry, no time for that now

website/blog/2024/11-27-scrape-google-search/index.md Outdated Show resolved Hide resolved
@souravjain540
Copy link
Collaborator

thanks! @Mantisus lets merge it monday morning now :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants