SCAPHRA

SpaCy component for scattered phrase matching.

You have documents such as In a hole in the ground there lived a Hobbit.
You want to match patterns like in holes live hobbits
Then you need this spaCy component!

Usage

phrasemap = {'hobbits': ['in', 'holes', 'live', 'hobbits']}
nlp.add_pipe("scaphra", config=dict(phrasemap=phrasemap))
doc = nlp("In a hole in the ground there lived a Hobbit")
# now doc.spans contains a SpanGroup with the matched tokens

See scaphra/example.py for multiple, full examples.

The matcher is a single SpaCy component which matches scattered phrases both using their lemmas and stems. This is important when the text quality is bad and relying on lemmata does not suffice. Also, in some languages (such as German) phrases are often non-contiguous. For example: Matching does not start should match Does it not always start well?.

This implementation should run reasonably fast (it uses a state-machine which memoizes all partial matches such that each text only needs to be traversed once). However, the computational cost rises when many, similar patterns are applied to large texts with many matches (runtime complexity is dependent on the number of patterns).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

SCAPHRA

Usage

Files

README.md

Latest commit

History

README.md

File metadata and controls

SCAPHRA

Usage