Text

Text & language processing for Elixir. Initial release focuses on:

n-gram generation from text
pluralization of english words
word counting (word freqencies)
language detection using pluggable classifier, vocabulary and corpus backends.

Second phase will focus on:

Stemming
tokenization and part-of-speech tagging (at least for english)
Sentiment analysis

Each of these phases requires prior development. See below.

Status Update Sept 2021

The Text project remains active and maintained. However with the advent of the amazing Numerical Elixir (Nx) project, many improved opportunities to leverage ML for text analysis open up and this is the planned path. I expect to focus using ML for the additional planned functionality as a calendar year 2022 project. Bug reports, PR and suggests are welcome!

Installation

def deps do
  [
    {:text, "~> 0.2.0"}
  ]
end

Word Counting

text contains an implementation of word counting that is oriented towards large streams of words rather than discrete strings. Input to Text.Word.word_count/2 can be a String.t, File.Stream.t or Flow.t allowing flexible streaming of text.

English Pluralization

text includes an inflector for the English language that takes an approach based upon An Algorithmic Approach to English Pluralization. See the module Text.Inflect.En and the functions:

Text.Inflect.En.pluralize/2
Text.Inflect.En.pluralize_noun/2
Text.Inflect.En.pluralize_verb/1
Text.Inflect.En.pluralize_adjective/1

Language Detection

text contains 3 language classifiers to aid in natural language detection. However it does not include any corpora; these are contained in separate libraries. The available classifiers are:

Text.Language.Classifier.CommulativeFrequency
Text.Language.Classifier.NaiveBayesian
Text.Language.Classifier.RankOrder

Additional classifiers can be added by defining a module that implements the Text.Language.Classifier behaviour.

The library text_corpus_udhr implements the Text.Corpus behaviour for the United National Declaration of Human Rights which is available for download in 423 languages from Unicode.

See Text.Language.detect/2.

N-Gram generation

The Text.Ngram module supports efficient generation of n-grams of length 2 to 7. See Text.Ngram.ngram/2.

Down the rabbit hole

Text analysis at a fundamental level requires segmenting arbitrary text in any language into characters (graphemes), words and sentences. This is a complex topic covered by the Unicode text segmentation standard agumented by localised rules in CLDR's segmentations data.

Therefore in order to provide higher order text analysis the order of development looks like this:

Finish the Unicode regular expression engine in ex_unicode_set. Most of the work is complete but compound character classes needs further work. Unicode regular expressions are required to implement both Unicode transforms and Unicode segmentation
Implement basic Unicode word and sentence segmentation in ex_unicode_string. Grapheme cluster segmentation is available in the standard library as String.graphemes/1
Add CLDR tailorings for locale-specific segmentation of words and sentences.
Finish up the Snowball stemming compiler. There is a lot to do here, only the parser is partially complete.
Implement stemming

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
data/inflector/en		data/inflector/en
lib		lib
mix		mix
priv/inflection/en		priv/inflection/en
test		test
.dialyzer_ignore_warnings		.dialyzer_ignore_warnings
.formatter.exs		.formatter.exs
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE.md		LICENSE.md
README.md		README.md
logo.png		logo.png
mix.exs		mix.exs
mix.lock		mix.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text

Status Update Sept 2021

Installation

Word Counting

English Pluralization

Language Detection

N-Gram generation

Down the rabbit hole

About

Releases

Packages

Languages

License

Sitata/text

Folders and files

Latest commit

History

Repository files navigation

Text

Status Update Sept 2021

Installation

Word Counting

English Pluralization

Language Detection

N-Gram generation

Down the rabbit hole

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages