Version 0.3.0
New Features
Hearst Patterns
This component implements Automatic Aquisition of Hyponyms from Large Text Corpora using the SpaCy Matcher component.
Passing extended=True
to the HyponymDetector
will use the extended set of hearst patterns, which include higher recall but lower precision hyponymy relations (e.g X compared to Y, X similar to Y, etc).
This component produces a doc level attribute on the spacy doc: doc._.hearst_patterns
, which is a list containing tuples of extracted hyponym pairs. The tuples contain:
- The relation rule used to extract the hyponym (type:
str
) - The more general concept (type:
spacy.Span
) - The more specific concept (type:
spacy.Span
)
Usage:
import spacy
from scispacy.hyponym_detector import HyponymDetector
nlp = spacy.load("en_core_sci_sm")
hyponym_pipe = HyponymDetector(nlp, extended=True)
nlp.add_pipe(hyponym_pipe, last=True)
doc = nlp("Keystone plant species such as fig trees are good for the soil.")
print(doc._.hearst_patterns)
>>> [('such_as', Keystone plant species, fig trees)]
Ontonotes Mixin: Clear Format > UD
Thanks to Yoav Goldberg for this fix! Yoav noticed that the dependency labels for the Onotonotes data use a different format than the converted GENIA Trees. Yoav wrote some scripts to convert between them, including normalising of some syntactic phenomena that were being treated inconsistently between the two corpora.
Bug Fixes
#252 - removed duplicated aliases in the entity linkers, reducing the size of the UMLS linker by ~10%
#249 - fix the path to the rxnorm linker