WtP usage in wtpsplit (Legacy)

This doc details how to use the old WtP models. You should probably use SaT instead.

Usage

from wtpsplit import WtP

wtp = WtP("wtp-bert-mini")
# optionally run on GPU for better performance
# also supports TPUs via e.g. wtp.to("xla:0"), in that case pass `pad_last_batch=True` to wtp.split
wtp.half().to("cuda")

# returns ["Hello ", "This is a test."]
wtp.split("Hello This is a test.")

# returns an iterator yielding a lists of sentences for every text
# do this instead of calling wtp.split on every text individually for much better performance
wtp.split(["Hello This is a test.", "And some more texts..."])

# if you're using a model with language adapters, also pass a `lang_code`
wtp.split("Hello This is a test.", lang_code="en")

# depending on your usecase, adaptation to e.g. the Universal Dependencies style may give better results
# this always requires a language code
wtp.split("Hello This is a test.", lang_code="en", style="ud")

ONNX support

You can enable ONNX inference for the wtp-bert-* models:

wtp = WtP("wtp-bert-mini", onnx_providers=["CUDAExecutionProvider"])

This requires onnxruntime and onnxruntime-gpu. It should give a good speedup on GPU!

>>> from wtpsplit import WtP
>>> texts = ["This is a sentence. This is another sentence."] * 1000

# PyTorch GPU
>>> model = WtP("wtp-bert-mini")
>>> model.half().to("cuda")
>>> %timeit list(model.split(texts))
272 ms ± 16.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# onnxruntime GPU
>>> model = WtP("wtp-bert-mini", ort_providers=["CUDAExecutionProvider"])
>>> %timeit list(model.split(texts))
198 ms ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Notes:

The wtp-canine-* models are currently not supported with ONNX because the pooling done by CANINE is not trivial to export. Ideas to solve this are very welcome!
This does not work with Python 3.7 because onnxruntime does not support the opset we need for py37.

Available Models

Pro tips: I recommend wtp-bert-mini for speed-sensitive applications, otherwise wtp-canine-s-12l. The *-no-adapters models provide a good tradeoff between speed and performance. You should probably not use wtp-bert-tiny.

Model	English Score	English Score (adapted)	Multilingual Score	Multilingual Score (adapted)
wtp-bert-tiny	83.8	91.9	79.5	88.6
wtp-bert-mini	91.8	95.9	84.3	91.3
wtp-canine-s-1l	94.5	96.5	86.7	92.8
wtp-canine-s-1l-no-adapters	93.1	96.4	85.1	91.8
wtp-canine-s-3l	94.4	96.8	86.7	93.4
wtp-canine-s-3l-no-adapters	93.8	96.4	86	92.3
wtp-canine-s-6l	94.5	97.1	87	93.6
wtp-canine-s-6l-no-adapters	94.4	96.8	86.4	92.8
wtp-canine-s-9l	94.8	97	87.7	93.8
wtp-canine-s-9l-no-adapters	94.3	96.9	86.6	93
wtp-canine-s-12l	94.7	97.1	87.9	94
wtp-canine-s-12l-no-adapters	94.5	97	87.1	93.2

The scores are macro-average F1 score across all available datasets for "English", and macro-average F1 score across all datasets and languages for "Multilingual". "adapted" means adapation via WtP Punct; check out the paper for details.

For comparison, here's the English scores of some other tools:

Model	English Score
SpaCy (sentencizer)	86.8
PySBD	69.8
SpaCy (dependency parser)	93.1
Ersatz	91.6
Punkt (`nltk.sent_tokenize`)	92.5

Paragraph Segmentation

Since WtP models are trained to predict newline probablity, they can segment text into paragraphs in addition to sentences.

# returns a list of paragraphs, each containing a list of sentences
# adjust the paragraph threshold via the `paragraph_threshold` argument.
wtp.split(text, do_paragraph_segmentation=True)

Adaptation

WtP can adapt to the Universal Dependencies, OPUS100 or Ersatz corpus segmentation style in many languages by punctuation adaptation (preferred) or threshold adaptation.

Punctuation Adaptation

# this requires a `lang_code`
# check the paper or `wtp.mixtures` for supported styles
wtp.split(text, lang_code="en", style="ud")

This also allows changing the threshold, but inherently has higher thresholds values since it is not newline probablity anymore being thresholded:

wtp.split(text, lang_code="en", style="ud", threshold=0.7)

To get the default threshold for a style:

wtp.get_threshold("en", "ud", return_punctuation_threshold=True)

Threshold Adaptation

threshold = wtp.get_threshold("en", "ud")

wtp.split(text, threshold=threshold)

Advanced Usage

Get the newline or sentence boundary probabilities for a text:

# returns newline probabilities (supports batching!)
wtp.predict_proba(text)

# returns sentence boundary probabilities for the given style
wtp.predict_proba(text, lang_code="en", style="ud")

Load a WtP model in HuggingFace transformers:

# import wtpsplit.models to register the custom models 
# (character-level BERT w/ hash embeddings and canine with language adapters)
import wtpsplit.models
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained("benjamin/wtp-bert-mini") # or some other model name

** NEW ** Adapt to your own corpus using WtP_Punct:

Clone the repository:

git clone https://github.com/bminixhofer/wtpsplit
cd wtpsplit

Create your data:

import torch

torch.save(
    {
        "en": {
            "sentence": {
                "dummy-dataset": {
                    "meta": {
                        "train_data": ["train sentence 1", "train sentence 2"],
                    },
                    "data": [
                        "test sentence 1",
                        "test sentence 2",
                    ]
                }
            }
        }
    },
    "dummy-dataset.pth"
)

Run adaptation:

python3 wtpsplit/evaluation/adapt.py --model_path=benjamin/wtp-bert-mini --eval_data_path dummy-dataset.pth --include_langs=en

This should print something like

en dummy-dataset U=0.500 T=0.667 PUNCT=0.667
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 30.52it/s]
Wrote mixture to /Users/bminixhofer/Documents/wtpsplit/wtpsplit/.cache/wtp-bert-mini.skops
Wrote results to /Users/bminixhofer/Documents/wtpsplit/wtpsplit/.cache/wtp-bert-mini_intrinsic_results.json

i.e. run adaptation on your data and save the mixtures and evaluation results. You can then load and use the mixture like this:

from wtpsplit import WtP
import skops.io as sio

wtp = WtP(
    "wtp-bert-mini",
    mixtures=sio.load(
        "wtpsplit/.cache/wtp-bert-mini.skops",
        ["numpy.float32", "numpy.float64", "sklearn.linear_model._logistic.LogisticRegression"],
    ),
)

wtp.split("your text here", lang_code="en", style="dummy-dataset")

... and adjust the dataset name, language and model in the above to your needs.

Reproducing the paper

configs/ contains the configs for the runs from the paper. We trained on a TPUv3-8. Launch training like this:

python wtpsplit/train/train.py configs/<config_name>.json

In addition:

wtpsplit/data_acquisition contains the code for obtaining evaluation data and raw text from the mC4 corpus.
wtpsplit/evaluation contains the code for:
- intrinsic evaluation (i.e. sentence segmentation results) via adapt.py. The raw intrinsic results in JSON format are also at evaluation_results/
- extrinsic evaluation on Machine Translation in extrinsic.py
- baseline (PySBD, nltk, etc.) intrinsic evaluation in intrinsic_baselines.py
- punctuation annotation experiments in punct_annotation.py and punct_annotation_wtp.py

Supported Languages

iso	Name
af	Afrikaans
am	Amharic
ar	Arabic
az	Azerbaijani
be	Belarusian
bg	Bulgarian
bn	Bengali
ca	Catalan
ceb	Cebuano
cs	Czech
cy	Welsh
da	Danish
de	German
el	Greek
en	English
eo	Esperanto
es	Spanish
et	Estonian
eu	Basque
fa	Persian
fi	Finnish
fr	French
fy	Western Frisian
ga	Irish
gd	Scottish Gaelic
gl	Galician
gu	Gujarati
ha	Hausa
he	Hebrew
hi	Hindi
hu	Hungarian
hy	Armenian
id	Indonesian
ig	Igbo
is	Icelandic
it	Italian
ja	Japanese
jv	Javanese
ka	Georgian
kk	Kazakh
km	Central Khmer
kn	Kannada
ko	Korean
ku	Kurdish
ky	Kirghiz
la	Latin
lt	Lithuanian
lv	Latvian
mg	Malagasy
mk	Macedonian
ml	Malayalam
mn	Mongolian
mr	Marathi
ms	Malay
mt	Maltese
my	Burmese
ne	Nepali
nl	Dutch
no	Norwegian
pa	Panjabi
pl	Polish
ps	Pushto
pt	Portuguese
ro	Romanian
ru	Russian
si	Sinhala
sk	Slovak
sl	Slovenian
sq	Albanian
sr	Serbian
sv	Swedish
ta	Tamil
te	Telugu
tg	Tajik
th	Thai
tr	Turkish
uk	Ukrainian
ur	Urdu
uz	Uzbek
vi	Vietnamese
xh	Xhosa
yi	Yiddish
yo	Yoruba
zh	Chinese
zu	Zulu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_WTP.md

README_WTP.md

WtP usage in wtpsplit (Legacy)

Usage

ONNX support

Available Models

Paragraph Segmentation

Adaptation

Punctuation Adaptation

Threshold Adaptation

Advanced Usage

Reproducing the paper

Supported Languages

Files

README_WTP.md

Latest commit

History

README_WTP.md

File metadata and controls

WtP usage in wtpsplit (Legacy)

Usage

ONNX support

Available Models

Paragraph Segmentation

Adaptation

Punctuation Adaptation

Threshold Adaptation

Advanced Usage

Reproducing the paper

Supported Languages