My entry to Trustoo's hiring challenge. Find its description here.
- Docker Compose
- Python 3.12
- Clone this repository or download its contents.
- Install the python environment
poetry install
. - Start Splash:
docker compose up -d
. - Run the scraper
poetry run scrapy crawl gouden_gids
. - Wait for the process to finish and find the scraped data in
results.csv
.
- Crawl any category in goudengids.nl. Provide it as an argument to the spider:
poetry run scrapy crawl gouden_gids -a category=fysiotherapeuten
- The
gouden_gids
spider also takes the number of pages to crawl as an argument. example:poetry run scrapy crawl gouden_gids -a category=fysiotherapeuten -a max_page=3
- The spider waits between requests while crawling in order to avoid detection and overloading the infrastructure of the crawled website.
- The spider uses a spoofed user-agent which is constantly changed and randomly chosen
- Scrape parking info
- Scrape reviews
- More extensive detection avoidance
- Improve argument names
- Add a CLI
- Publish to PyPI
- Publish a docker image with Splash auto starting to allow for an easier showcase.
- Extensive documentation
- Developer guide
- Module docstrings
- Scraper works well with lawyers, but also does well with the rest of the categories in Gouden Gids.
- The code is documented, formatted and linted.
- A decent part of the code is covered by tests.
- Unit testing
- Integration testing
- Scraping dynamic content
- Logging
- Collection of "Overige Informatie" -- Currently broken
The items below are ordered from most to least preferable to scrape in terms of data quality.
Enroll Business is one of the leading business directories in Spain. It enables browsing of local businesses with ease. Useful customer reviews help customers identify the best services.
Could not find anything that would apply to scraping.
Trustoo would likely be allowed to scrape useful data.
- Address
- Telephone
- Work hours
- Website
- Description
- Reviews (Mostly missing)
- Photos (Often missing)
- Free
- Domain authority 74 (Excellent)
- Must choose area when searching
- Provides a "change language" option which would allow a person who doesn't speak Spanish to explore more easily.
Could not find them
Trustoo would not be allowed to scrape useful information.
- Address
- Telephone
- Work hours
- Website
- Description
- Specifics about areas of work
- Reviews (Mostly missing)
- Photos (Often missing)
Not as big as some of the other websites listed here.
- Free
- Domain authority 30 (Below average)
- Must choose area when searching
Yelp is a popular business directory in Spain. Yelp is a great platform to connect with local businesses by making it easier for consumers to make a purchase, reservation, or an appointment. It has detailed information and review content, and all necessary business information.
- 4.B: You may not access or use the Service if you are a competitor of Yelp
- 7.B.ix: Use the Service to Modify, adapt, appropriate, reproduce, distribute, translate, create derivative works of the Service or the Service Content or adaptations thereof, and publicly display, sell, trade or exploit in any way the Service or the Service Content (other than Your Content), except as expressly authorized by Yelp;
- 7.B.x: Use any robot, spider, Service search/retrieval application, or other automated device, process or means to access, copy, retrieve or index any portion of the Service or any content on the Service, unless expressly authorized by Yelp
- 7.B.xix: Use any device, software or routine that interferes with the proper working of the Service or attempts to do so in any way
- 7.B.xxi: Remove, circumvent, disable, damage or interfere with security features of the Service, features that prevent or restrict use or copying of Service Content, or features that enforce limitations on the use of the Service.
A non-spoofed user agent would not be allowed anything (Disallow: /
). Most details about a business are disallowed for any user agent.
- Address
- Telephone
- Work hours
- Website
- Description
- Specifics about areas of work
- Reviews (Mostly missing)
- Photos (Often missing)
- It is not possible to get all lawyers in spain from a single search, city always has to be specified.
- A lot of the entries are of low quality
- Free
- Domain authority 59 (Good)
It is quite simple to find a local business with Hotfrog. It is a famous business directory in Spain, helping millions of small businesses gain maximum customers.
Fair use by users
All Content made available on or via the Services is provided for informational purposes only. The Content may only be used and reproduced for personal and non-commercial use. The following are examples of unacceptable use: (a) Content framing; (b) Content scaping; (c) Content data-mining; (d) Content extraction; (e) Content re-distribution; (f) mirroring of material; or (g) using this website in any way which would interfere with its operation for other parties.
- Address
- Telephone
- Website
- Description
- Specifics about areas of work
- Reviews (Mostly missing)
- Socials
- Photos (Often missing)
No restrictions are specified for the information that would be useful to Trustoo.
- Free
- Domain authority 42 (Average)
- Inflict an excessive load on our infrastructure or otherwise
- Interfere with the proper functioning of Yalwa through:
- Copying, modifying, distributing content from other users' ads
- Copying other people's information, including email addresses, without their consent
- Circumvention of measures intended to prevent or restrict access to Yalwa
Trustoo would not be allowed to scrape useful information. Interestingly, a number of older Mozilla user agents are severely restricted along with scraping-associated user agents such as "EmailSiphon".
- Address
- Telephone (Says click to reveal, but doesn't get revealed. Didn't easily find it in the source)
- Website
- Description
- Photos (Often missing)
- Free
- Domain authority 40
- Must choose area when searching
e-justice [Not part of the ranking]
The European e-Justice Portal allows you to easily find a lawyer throughout the EU. This service is provided by the European Commission in collaboration with the currently participating national bar registers.
Could be useful, especially if there is an easy way to access the data. I did not manage to find it, but perhaps there is an API?
Paginas Amarillas [Not part of the ranking]
Spanish yellow pages, blocked when visiting from abroad.