Skip to content

okulovsky/grammar_ru

Repository files navigation

Grammar_ru

Initially, grammar_ru was envisioned as a set of tools to correct grammar and style errors in Russian texts.

Since the beginning, a lot has changed.

Grammar_ru is now a module that governs a representation of language-independent texts in tabular format.

  • The texts are processed with tokenization and sentenization, placed in pandas DataFrames and stored in zip-files along with Table-Of-Contents, or toc-files that contains metadata about each dataframe.
  • These zip-files we call corpora.
  • Each word, sentence and paragraph receives its unique ID in the corpus
  • Relations can be placed in corpus, establishing the relations between fragments of texts (e.g. that the chapters from translation and original texts are in fact the same chapter).
  • Grammar_ru also allows you to apply featurizers, such as pymorphy, snowball, slovnet, etc.
  • Grammar_ru contains useful components to further convert such datasets in torch tensors (based on Training Grounds Framework)

Aside from grammar_ru, the repository contains a not-yet-working app_grammar_ru which is a docker app that actually checks errors in Russian texts. This app is to utilize existing python solutions (pyenchant), as well as ML models, trained in grammar_ru paradygm.

Finally, a creative articulator (ca) project is also temporarily hosted in this repo.

About

NLP algorithms for Russian grammar check

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages