-
Notifications
You must be signed in to change notification settings - Fork 683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: blog How to scrape Google search results with Python #2739
base: master
Are you sure you want to change the base?
Conversation
please use PR title for blogs in this format |
more importantly, the article is missing the frontmatter completely (the "header" if you want) |
How to scrape Google search results with Python
Yes, this block usually adds @souravjain540 That's why I didn't include it in the draft version. |
|
||
If Google search isn't going anywhere in the coming years, then tasks like [`SERP Analysis`](https://www.semrush.com/blog/serp-analysis/), SEO optimization, and evaluating your site's search ranking remain highly relevant - and for these, we need effective tools. | ||
|
||
That's why in this blog, we'll explore creating a Google search results crawler with [`Crawlee for Python`](https://crawlee.dev/python). This will be particularly useful if you're conducting a small data analysis project, analyzing search results for your business, or writing an [article about Google ranking analysis](https://backlinko.com/search-engine-ranking). Rather ironically, don't you think, considering that the largest known crawler belongs to Google itself? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need of giving website reference to a blog published on same website, change it with github
|
||
![Check Html](./img/check_html.webp) | ||
|
||
Based on the data obtained from the page, all necessary information is present in the HTML code. Therefore, we can use [`beautifulsoup_crawler`](https://crawlee.dev/python/docs/examples/beautifulsoup-crawler). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whenever mentioning crawlee.dev use www.crawlee.dev, there is some error with docusaurus not able to mention the URL with www when its about crawlee
|
||
#### 3. Write Selectors Manually | ||
|
||
For any project that needs to be reproduced more than once or any large-scale website project that might have multiple page layouts, you should avoid using selectors obtained from browser's "Copy XPath/selector" feature. Selectors built with absolute paths are not stable against any page changes. You should write selectors yourself, so familiarize yourself with the basics of [CSS](https://www.w3schools.com/cssref/css_selectors.php) and [XPath](https://www.w3schools.com/xml/xpath_syntax.asp) syntax. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new learning for me too, can we explain more on the why part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are 2 selectors obtained from dev tool one for google
the other for g2.com
//*[@id="rso"]/div[2]/div[8]/div/div/div/div[2]/div
/html/body/div[5]/div/div/div[1]/div/div[5]/div/div[2]/div[5]/div[1]
The first one looks so-so, at least it uses @id
, but if tomorrow google updates the page markup by making at least one div smaller, the selector won't work or will return wrong data.
For g2.com it just looks terrible.
These selectors work well for some quick tests here and now, but no more.
### Site Specifics | ||
|
||
#### 1. Results are Personalized | ||
|
||
Google tries to provide data that's useful to you based on the information it has. This is crucial when working with Google, and you should *always* keep it in mind. | ||
|
||
#### 2. Do You Need Personalized Data? | ||
|
||
If you want to analyze data that's maximally relevant to your search results, use your main browser's `cookies`. | ||
|
||
#### 3. Location Matters | ||
|
||
Your IP address's geolocation will influence Google search results. | ||
|
||
#### 4. Don't Forget About [`advanced search`](https://www.google.com/advanced_search) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can simply be a list, no need for sub headings
poetry install | ||
``` | ||
|
||
Use whatever tools you're comfortable with - perhaps [`pyenv`](https://github.com/pyenv/pyenv) + ['uv'](https://github.com/astral-sh/uv) or something else. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets ask them to use what we want to as this is a tutorial
Let's implement the necessary code: | ||
|
||
```python | ||
# crawlee-google-search.main | ||
|
||
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler | ||
from crawlee import Request, ConcurrencySettings | ||
|
||
from .routes import router | ||
|
||
QUERIES = ["Apify", "Crawlee"] | ||
|
||
CRAWL_DEPTH = 2 | ||
|
||
|
||
async def main() -> None: | ||
"""The crawler entry point.""" | ||
crawler = BeautifulSoupCrawler( | ||
request_handler=router, | ||
max_request_retries=1, | ||
concurrency_settings=ConcurrencySettings(max_concurrency=5, max_tasks_per_minute=20), | ||
max_requests_per_crawl=100, | ||
) | ||
|
||
requests_lists = [ | ||
Request.from_url(f"https://www.google.com/search?q={query}", | ||
user_data = { | ||
"current_depth": 0, | ||
"crawl_depth": CRAWL_DEPTH, | ||
"query": query | ||
}, | ||
headers = { | ||
"referer": "https://www.google.com/", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am missing code explanations and breakdown, we cannot just put the whole code at once and say it a tutorial :D
valid for the rest of code snippets too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the code snippets.
Also updated the code to better match the latest version of crawlee-python.
|
||
In this blog, we've created a Google search crawler that captures ranking data. How you analyze this dataset is entirely up to you! | ||
|
||
The code repository is available on [`GitHub`](soon) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets not forget to update it.
That's why in this blog, we'll explore creating a Google search results crawler with [`Crawlee for Python`](https://crawlee.dev/python). This will be particularly useful if you're conducting a small data analysis project, analyzing search results for your business, or writing an [article about Google ranking analysis](https://backlinko.com/search-engine-ranking). Rather ironically, don't you think, considering that the largest known crawler belongs to Google itself? | ||
|
||
Let's get started! | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
somewhere here or later let's add note like I did in the prev article of yours.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about this?
@@ -0,0 +1,201 @@ | |||
# How to scrape Google search results with Python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's also add the meta description from your previous articles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just copy paste and edit
Co-authored-by: Saurav Jain <[email protected]>
ffee23f
to
06ab968
Compare
That's why in this blog, we'll explore creating a Google search results crawler with [`Crawlee for Python`](https://crawlee.dev/python). This will be particularly useful if you're conducting a small data analysis project, analyzing search results for your business, or writing an [article about Google ranking analysis](https://backlinko.com/search-engine-ranking). Rather ironically, don't you think, considering that the largest known crawler belongs to Google itself? | ||
|
||
Let's get started! | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about this?
also the folder name should start with : |
@B4nan can you please give a final review before I merge :) thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just one nit, didnt read the text, sorry, no time for that now
Co-authored-by: Martin Adámek <[email protected]>
thanks! @Mantisus lets merge it monday morning now :) |
Draft article for @souravjain540