Is parser a relavant pipeline component beyond noun chunking? #249

ghost · 2020-03-12T13:50:18Z

ghost
Mar 12, 2020

Hello,

I'm using pytextrank with texts in Portuguese. Thanks to issue #54 I'm able to use POS information to produce some basic noun chunking, instead of syntactic information from the parser.

My question is, in this case where I'm producing chunks from POS, am I loosing something if I disable the parser and create a new pipeline component just for chunking? Are there other relevant information given by the parser used?

ceteri · 2020-03-12T21:45:03Z

ceteri
Mar 12, 2020
Maintainer

Hi @imeano,

Given how TextRank works, there are strict needs for what the parsers tend to produce:

sentence and word segmentation
part-of-speech tagging
lemmatization

The noun chunking part was an extended case that I have added (along with use of lemmatization) to make the algorithm more effective.

Does that help?

Also, does https://spacy.io/models/pt provide an effective parser for Portuguese?

0 replies

ghost · 2020-03-13T16:40:51Z

ghost
Mar 13, 2020

Thanks for the response. It does answer my question, even tough I didn't asked as best I could.

I used spaCy's terminology without specifying it clearly. Because spacy's DependencyParser, as a pipeline component, is called simply "parser" I tend to also just call it parser. From testing, I came up with the following:

Word segmentation: Done by Tokenizer, which doesn't need any other pipeline component.
Sentence segmentation: Done by DependencyParser. Can be bypassed by custom function added to pipeline
PoS tagging: Done by Tagger
Lemmatization: Done by Lemmatizer. Lemmatizer can make use of PoS information, but this is dependent on how it's implemented for the Language used.
Noun Chunking: Done by 'noun_chunks' syntax iterator in Languages where this is defined. Syntax iterators are defined within DependencyParser parameters. Disabling it will prevent noun chunking from occurring. Can also be bypassed by custom function added to pipeline.

So, assuming those features are the only ones needed for pytextrank to work properly, it seems I can disable the DependencyParser as long as I include noun_chunking and sentence segmentation pipeline components.

I was quite sure I could get it to work with these alterations, but was afraid to get different results.

Also, does https://spacy.io/models/pt provide an effective parser for Portuguese?

Mostly effective I would say. I've worked with linguists and they couldn't make much use of the syntactic trees produced (errors in syntactic parser tend to to accumulate as far from ROOT you get). Sentence segmentation is quite good for sentences that aren't too long. As for the Tagger, POS is quite good, but TAG_MAP is too huge in my opinion.

0 replies

guy4261 · 2022-06-27T09:40:54Z

guy4261
Jun 27, 2022

Can you please elaborate on this one more explicitly?

i.e. if we can't remove

sentence and word segmentation
part-of-speech tagging
lemmatization

Then is doc = nlp(text, disable = ['ner', 'parser']) (for instance) an acceptable disable= situation or not?

Please spell the redundant parts more explicitly 🙇

It means a lot when the text is big and removing any redundant pipeline would help a lot, memory-wise.

0 replies

ceteri · 2022-07-25T18:08:54Z

ceteri
Jul 25, 2022
Maintainer

Hi @guy4261,

No, none of the textgraph algorithms would work with a parser disabled.

Disabling NER might be an option. It depends on the language, version of other pipeline components, etc., so you'd need to experiment.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is parser a relavant pipeline component beyond noun chunking? #249

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Is parser a relavant pipeline component beyond noun chunking? #249

ghost Mar 12, 2020

Replies: 4 comments

ceteri Mar 12, 2020 Maintainer

ghost Mar 13, 2020

guy4261 Jun 27, 2022

ceteri Jul 25, 2022 Maintainer

ghost
Mar 12, 2020

ceteri
Mar 12, 2020
Maintainer

ghost
Mar 13, 2020

guy4261
Jun 27, 2022

ceteri
Jul 25, 2022
Maintainer