using "noun_chunks" from custom extension #259

fukidzon · 2020-02-21T10:20:08Z

fukidzon
Feb 21, 2020

I wanted to use pytextrank together with spacy_udpipe to get keywords from texts in other languages (see https://stackoverflow.com/questions/59824405/spacy-udpipe-with-pytextrank-to-extract-keywords-from-non-english-text) but I realized, that udpipe-spacy somehow "overrides" the original spacy's pipeline so the noun_chunks are not generated (btw: the noun_chunks are created in lang/en/syntax_iterators.py but it doesn't exist for all languages so even if it is called, it doesn't work e.g. for Slovak language)

Pytextrank keywords are taken from the spacy doc.noun_chunks, but if the noun_chunks are not generated, pytextrank doesn't work.

Sample code:

import spacy_udpipe, spacy, pytextrank
spacy_udpipe.download("en") # download English model
text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types."

# using spacy_udpipe
nlp_udpipe = spacy_udpipe.load("en")
tr = pytextrank.TextRank(logger=None)
nlp_udpipe.add_pipe(tr.PipelineComponent, name="textrank", last=True)
doc_udpipe = nlp_udpipe(text)

print("keywords from udpipe processing:")
for phrase in doc_udpipe._.phrases:
    print("{:.4f} {:5d}  {}".format(phrase.rank, phrase.count, phrase.text))
    print(phrase.chunks)

# loading original spacy model
nlp_spacy = spacy.load("en_core_web_sm")
tr2 = pytextrank.TextRank(logger=None)
nlp_spacy.add_pipe(tr2.PipelineComponent, name="textrank", last=True)
doc_spacy = nlp_spacy(text)

print("keywords from spacy processing:")
for phrase in doc_spacy._.phrases:
    print("{:.4f} {:5d}  {}".format(phrase.rank, phrase.count, phrase.text))
    print(phrase.chunks)

Would it be possible that pytextrank processes the "noun_chunks" (candidates for keywords) from a custom extension (function which uses a Matcher and the result is available e.g. as a doc._.custom_noun_chunks - see explosion/spaCy#3856 )?

ceteri · 2020-02-23T19:20:56Z

ceteri
Feb 23, 2020
Maintainer

Thank you @fukidzon

I've reworked the code that you provided as a Colab notebook: https://gist.github.com/ceteri/f3bfac641cffb61e10af5aae7eefc9dd so people can view and interact with the problem.

The root issue appears to be that the noun chunks produced by the UDpipe extension are much less rich than what spaCy produces? Or is that information available through some other field?

To address your main question:

Could PyTextRank be extended to support doc._.custom_noun_chunks as in append the noun_chunk generator object explosion/spaCy#3856 ?

If you have an example of an extension that has implemented "3856" then yes we could add support for that in the next release.

0 replies

ceteri · 2020-02-23T19:29:27Z

ceteri
Feb 23, 2020
Maintainer

Also, there's another implied question:

Could PyTextRank be modified so that it does not depend on the availability of noun chunks ?

While that's possible, and somewhat closer to the original algorithm description, it would a larger job to refactor the code.

I'll take a look, and try to scope it. We may be able to add a use_chunks flag that's True by default.

Back to your original question on StackOverflow could you provide a brief example text in sk along with the code you're using to run that pipeline? If you could provide what phrases would be expected too, that would help lots!

0 replies

ceteri · 2020-02-23T19:30:59Z

ceteri
Feb 23, 2020
Maintainer

Another issue that was mentioned:

I get tokens with POS and DEP tags, but there is nothing in doc._.phrases (doc.noun_chunks is also empty) and in nlp.pipe_names is just ['textrank']

See the gist it appears that spacy_udpipe clears nlp.pipe_names

0 replies

ceteri · 2020-02-23T20:16:46Z

ceteri
Feb 23, 2020
Maintainer

@fukidzon @asajatovic

The points above identify two issues in the spacy_udpipe implementation. It might be more efficient to make a pull request on that project to resolve those issues, rather than adapt to them?

0 replies

asajatovic · 2020-02-24T19:34:42Z

asajatovic
Feb 24, 2020

@ceteri You can find the explanation for only ['textrank'] showing up in nlp.pipe_names at https://github.com/TakeLab/spacy-udpipe/blob/master/spacy_udpipe/language.py#L75-L76.

@fukidzon @ceteri
Regarding the doc.noun_chunks property, it is built from a dependency parsed document https://github.com/explosion/spaCy/blob/master/spacy/tokens/doc.pyx#L577. If you peek a little deeper into spaCy source code, you'll notice some languages have an implementation of a proper syntax iterator (https://github.com/explosion/spaCy/blob/master/spacy/tokens/doc.pyx#L206, i.e. English https://github.com/explosion/spaCy/blob/master/spacy/lang/en/syntax_iterators.py#L7) and others don't. The idea behind spacy-udpipe is to be a lightweight wrapper around the underlying UDPipe models. As the dependecy labels used both in spaCy and udpipe-spacy use the UD scheme for languages other than English and German, I believe the only required thing for doc.noun_chunks to work is a proper syntax iterator implementation. Taking all of this into account, I suggest you try the approach using doc._.custom_noun_chunks or try implementing the syntax iterator for your language. Hope this helps to solve your issue! :)

0 replies

ceteri · 2020-02-24T20:10:24Z

ceteri
Feb 24, 2020
Maintainer

Thank you kindly @asajatovic, that's good to know and makes a lot of sense to use that approach.

@fukidzon I can help with a syntax iterator implementation. To start I'd need more about a language sample and expected output -- there are core models in spaCy for the other languages in which I'm conversant.

0 replies

fukidzon · 2020-02-25T06:06:49Z

fukidzon
Feb 25, 2020
Author

@ceteri @asajatovic thank you for the comments!

I created a colab notebook with custom noun_chunks example for Slovak: https://colab.research.google.com/drive/1tLMUMpFTGvxvp32YQYF5LC-nlTlUdtYz

To create a syntax_iterators for Slovak language would be the best solution - I was already checking it but I think it needs a deeper look into the language structure to make it correctly (the best would be if it's a part of spaCy code, not just a local workaround )

We may be able to add a use_chunks flag that's True by default

I like the idea, that it can be possible to provide some other source of "noun_chunks"

0 replies

fukidzon · 2020-03-04T07:31:48Z

fukidzon
Mar 4, 2020
Author

I found a solution:

import spacy_udpipe, spacy, pytextrank
from spacy.matcher import Matcher
from spacy.attrs import POS

def get_chunks(doc):
    np_label = doc.vocab.strings.add("NP")
    matcher = Matcher(nlp.vocab)
    pattern = [{POS: 'ADJ', "OP": "+"}, {POS: {"IN": ["NOUN", "PROPN"]}, "OP": "+"}]
    matcher.add("Adjective(s), (p)noun", None, pattern)
    matches = matcher(doc)

    for match_id, start, end in matches:
        yield start, end, np_label
        
#spacy_udpipe.download("sk") # download model
nlp = spacy_udpipe.load("sk")
nlp.Defaults.syntax_iterators = {"noun_chunks" : get_chunks}  #noun_chunk replacement

tr = pytextrank.TextRank(logger=None)
nlp.add_pipe(tr.PipelineComponent, name="textrank", last=True)

text = "Wikipédia je webová encyklopédia s otvoreným obsahom, ktorú možno slobodne čítať aj upravovať. Je sponzorovaná neziskovou organizáciou Wikimedia Foundation. Má 285 nezávislých jazykových vydaní vrátane slovenského a najrozsiahlejšieho anglického. Popri článkoch encyklopedického typu obsahuje, najmä anglická encyklopédia, aj články podobajúce sa almanachu, atlasu či stránky aktuálnych udalostí. Wikipédia je jedným z najpopulárnejších zdrojov informácií na webe s približne 13 miliardami zobrazení mesačne. Jej rast je skoro exponenciálny. Wikipédii (takmer 2 milióny). Wikipédia bola spustená 15. januára 2001 ako doplnok k expertmi písanej Nupedii. So stále rastúcou popularitou sa Wikipédia stala podhubím pre sesterské projekty ako Wikislovník (Wiktionary), Wikiknihy (Wikibooks) a Wikisprávy (Wikinews). Jej články sú upravované dobrovoľníkmi vo wiki štýle, čo znamená, že články môže meniť v podstate hocikto. Wikipediáni presadzujú politiku „nestranný uhol pohľadu“. Podľa nej relevantné názory ľudí sú sumarizované bez ambície určiť objektívnu pravdu. Vzhľadom na to, že Wikipédia presadzuje otvorenú filozofiu, jej najväčším problémom je vandalizmus a nepresnosť. "
doc = nlp(text)

print("Noun chunks:")
for nc in doc.noun_chunks:
    print(nc)
    
print("\nKeywords:")
for phrase in doc._.phrases:
    print("{:.4f} {:5d}  {}".format(phrase.rank, phrase.count, phrase.text))
    print(phrase.chunks)

I'm not sure how clean is this workaround nlp.Defaults.syntax_iterators = {"noun_chunks" : get_chunks}, but it works (It's based on how are the noun_chunks defined in syntax_iterators.py and __init__.py in spaCy/lang/en)

0 replies

ghost · 2020-03-05T17:51:59Z

ghost
Mar 5, 2020

I have much the same problem using the pt_core_news_sm model. Noun_chunks are empty since there is no appropriate syntax_iterator. I tested this solution there and it's also able to produce results now. However, I noticed that it seems that only the noun_chunks can be returned as keywords. I'm not sure why that is.

I'm not sure how clean is this workaround nlp.Defaults.syntax_iterators = {"noun_chunks" : get_chunks}, but it works (It's based on how are the noun_chunks defined in syntax_iterators.py and __init__.py in spaCy/lang/en)

Adding a syntax_iterator seems like the cleanest thing to do. The only concern I would have with the presented solution is that it requires the parser to be after the tagger in the pipeline.

0 replies

andremacola · 2021-05-05T09:07:07Z

andremacola
May 5, 2021

Hi, I'm trying to use nlp.Defaults.syntax_iterators with spaCy v3 with no success. My language (pt) does not have the syntax_iterator.py file in the spacy lang folder.

Is this only works with spacy_udpipe ? Which I'm not using.

0 replies

ceteri · 2021-05-05T19:56:24Z

ceteri
May 5, 2021
Maintainer

Hi @andremacola, could you help us by showing some example code about the pipeline you're building with spaCy 3.x ?
We may be able to help.

The code for syntax_iterators is in https://github.com/explosion/spaCy/blob/master/spacy/lang/en/syntax_iterators.py

Also, if this is more of a spaCy question, we could move this thread to https://github.com/explosion/spaCy/discussions/ to get more help.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

using "noun_chunks" from custom extension #259

{{title}}

Replies: 11 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

using "noun_chunks" from custom extension #259

fukidzon Feb 21, 2020

Replies: 11 comments

ceteri Feb 23, 2020 Maintainer

ceteri Feb 23, 2020 Maintainer

ceteri Feb 23, 2020 Maintainer

ceteri Feb 23, 2020 Maintainer

asajatovic Feb 24, 2020

ceteri Feb 24, 2020 Maintainer

fukidzon Feb 25, 2020 Author

fukidzon Mar 4, 2020 Author

ghost Mar 5, 2020

andremacola May 5, 2021

ceteri May 5, 2021 Maintainer

fukidzon
Feb 21, 2020

ceteri
Feb 23, 2020
Maintainer

ceteri
Feb 23, 2020
Maintainer

ceteri
Feb 23, 2020
Maintainer

ceteri
Feb 23, 2020
Maintainer

asajatovic
Feb 24, 2020

ceteri
Feb 24, 2020
Maintainer

fukidzon
Feb 25, 2020
Author

fukidzon
Mar 4, 2020
Author

ghost
Mar 5, 2020

andremacola
May 5, 2021

ceteri
May 5, 2021
Maintainer