Skip to content

Commit

Permalink
Create README.md
Browse files Browse the repository at this point in the history
added basic info and a teaser
  • Loading branch information
beviah authored Mar 30, 2024
1 parent 2feb389 commit e1103b1
Showing 1 changed file with 28 additions and 0 deletions.
28 changes: 28 additions & 0 deletions thesauruses-co/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Thesauruses.co

This is series of ~20 scripts used for parsing and structuring Wiktionary data in a template (locale) agnostic way!
The same scripts are used to parse all 170+ wiktionary dumps!

Final dataset was and will be used again at https://www.ezglot.com/

We need to upgrade the server to handle the traffic with all new data we created over the years.

## Files

**1. wiki_parse.py**
- Loads Wiktionary XML dumps from ./xmls folder
- Detects templates and attempts replacements
- Normalizes common wiki markups found in various locales (maybe it is not locale agnostic!)
but does not rely on specific language strings (maybe it is!)
- Converts content into node-property-relation like paths with more/less consistent levels..
to be further dealt with in later script(s)

I thought I had the problem solved with this script, but oh boy, was I wrong.. took *a few* more processing steps...

**2. finer.py**

- *Every few dozen stars will motivate me to add one more script ;-)*

**3. polngrams.py**

...

0 comments on commit e1103b1

Please sign in to comment.