GitHub - martanman/weblio-hakusuisha-scraper: Scrapes 白水社中国語辞典 on Weblio and creates a yomichan dictionary

Outline

A scraper for the Chinese--Japanese dictionary 白水社中国語辞典 hosted by Weblio and found here. Includes a script to convert the scraped data into a yomichan compatible dictionary format for use in the yomichan browser extension.

Requirements

The python libraries needed can be installed with

pip install scrapy html5lib regex

Pandoc is also required; used to convert html to plain text.

Usage

python spider.py

runs the scraper. It takes about a day to scrape all the pages depending on the value of DOWNLOAD_DELAY in spider.py. Then run

python export.py

to create a yomichan compatible dictionary zip file from entries.jsonl.

Disclaimer

Please don't go around sharing copies of the scraped dictionary for copyright reasons. These scripts are intended for individual use.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
display_entries.py		display_entries.py
export.py		export.py
index.json		index.json
spider.py		spider.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Outline

Requirements

Usage

Disclaimer

About

Releases

Packages

Languages

License

martanman/weblio-hakusuisha-scraper

Folders and files

Latest commit

History

Repository files navigation

Outline

Requirements

Usage

Disclaimer

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages