Tab-delimited word frequency list compiled from the German Wikipedia.
Words were converted to lowercase before being counted.
The list can be found here: result.zip (compressed txt-file)
Example output:
öpnv 3547
sylvia 3547
gewonnene 3546
milde 3546
deal 3546
amy 3546
mittelgewicht 3546
gegenspieler 3545
...
Date of Wikipedia dump file: 02-Nov-2021
To compile the list yourself, you need Python 3.8+ and Poetry installed.
1. Clone the repository and install dependencies with Poetry:
$ git clone [email protected]:gambolputty/dewiki-wordrank.git
$ cd dewiki-wordrank
$ poetry install
Extract the Wikipedia pages from the XML dump file with WikiExtractor:
$ poetry run python -m wikiextractor.WikiExtractor /path-to-xml-file.xml.bz2 --output /path-to-output-directory
Run the script in this repository to compile the list of word occurrences:
$ poetry run python -m dewiki_wordrank /path-to-wikiextractor-output-directory
The result will be saved in the dewiki_wordrank directory.