A library for processing text data

cophi is a Python library for handling, modeling and processing text corpora. You can easily pipe a collection of text files using the high-level API:

corpus, metadata = cophi.corpus(directory="british-fiction-corpus",
                                filepath_pattern="**/*.txt",
                                encoding="utf-8",
                                lowercase=True,
                                token_pattern=r"\p{L}+\p{P}?\p{L}+")

You can also plug the DARIAH-DKPro-Wrapper into this pipeline to lemmatize text, or just keep certain word types.

Check out the introducing Jupyter notebook.

Getting started

To install the latest stable version:

$ pip install cophi

To install the latest development version:

$ pip install --upgrade git+https://github.com/cophi-wue/cophi-toolbox.git@testing

Available complexity measures

There are also a plenty of complexity metrics for measuring the lexical richness of (literary) texts.

Measures that use sample size and vocabulary size:

Type-Token Ratio TTR
Guiraud’s R
Herdan’s C
Dugast’s k
Maas’ a²
Dugast’s U
Tuldava’s LN
Brunet’s W
Carroll’s CTTR
Summer’s S

Measures that use part of the frequency spectrum:

Honoré’s H
Sichel’s S
Michéa’s M

Measures that use the whole frequency spectrum:

Entropy S
Yule’s K
Simpson’s D
Herdan’s V_m

Parameters of probabilistic models:

Orlov’s Z

Name		Name	Last commit message	Last commit date
Latest commit History 198 Commits
docs		docs
notebooks		notebooks
src/cophi		src/cophi
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
Makefile		Makefile
README.md		README.md
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A library for processing text data

Getting started

Available complexity measures

About

Releases

Packages

Contributors 5

Languages

cophi-wue/cophi-toolbox

Folders and files

Latest commit

History

Repository files navigation

A library for processing text data

Getting started

Available complexity measures

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages