This python script tries to scrape and store every single word and definition from Urban Dictionary.
$ . venv/bin/activate
$ pip install -r requirements.txt
$ python ubscrape-runner.py # [args]
pip install ubscrape
ubscrape --help # shows options
ubscrape --scrape # begins scraping all of urban dictionary, starting by adding words to database
ubscrape --define hello # defines hello and prints it, verifies your network connection
ubscrape --define-all # begind defining all words that are stored locally
ubscrape --dump # dump all existing definitions to .json files
ubscrape --dump --out OUT # specify an output directory for --dump
ubscrape --report # shows the progress in defining all the locally stored words
ubscrape --clear --force # deletes the locally stored words and definitions
-
ubscrape goes through the page indices looking for every word (https://www.urbandictionary.com/browse.php?character=A, https://www.urbandictionary.com/browse.php?character=A&page=2, etc). ubscrape adds these words to a SQLite database in a
words
table. -
ubscrape goes through every row in the database and looks it up (https://www.urbandictionary.com/define.php?term=Magic%20Carpet%20Ride) and adds the definitions to a
definitions
table. -
When ubscrape has added every definition for a word, it flags the word as
complete
and moves onto the next word. -
When every word in ubscrape is complete, it dumps the SQLite database to JSON. Each letter gets its own folder, and then definitions are added to files in 50 MB groups. Each file will be ~50 MB, and the title will be the first and last word in the file (firstword-lastword.json).
If ubscrape crashes or fails, it will restart and try to redo as little work as possible.
- Do we want examples as well as definitions?
- Add support for dumping at the same time as scraping, making it less linear.
-
Cannot take escaped unicode characters as input to
--define
:ubscrape --define \u2764\ufe0f
does not work.ubscrape --define ❤️❤️
DOES work.
-
Cannot dump to json while it's defining words.
- Using multiprocessing pool
Time of 100 words: real 0m13.341s real 0m12.922s real 0m12.606s
Time of 0 words (testing for initialization): real 0m3.033s real 0m3.171s real 0m2.893s
~13 and ~3 seem good enough for an estimate. 100 words takes 10 seconds, so 1.9 million words takes 0.19 million seconds.
0.19 * 10 ^ 6 seconds / 60 sec/min / 60 min/hr / 24 hr/day = ~2.2 days
I could run it on my laptop for 6 hours a day, or I could run it on the school computers and get it done in two days (checking twice a day on progress).
- Testing before building:
python -m ubscrape --version
python -m ubscrape --scrape # for a bit
python -m ubscrape --define hello
python -m ubscrape --define-all # for a bit
python -m ubscrape --dump
-
Delete
build/
,dist/
,ubscrape.egg-info/
. -
Bump version number in
ubscrape/setup.py
. -
Activate your virtual environment, make sure everything is installed.
-
python ubscrape/setup.py sdist bdist_wheel
Note to self
- Activate global environment with
. ~/global_venv/bin/activate
before the next step.
-
Upload:
twine upload --repository-url https://test.pypi.org/legacy/ dist/*
(test)twine upload dist/*
(real)
-
Download and test:
pip install -i https://test.pypi.org/simple/ ubscrape==0.5
(test)pip install ubscrape
(real)
-
ubscrape --help