Add reference analysis and apply google style docstring #14

Huffon · 2020-01-15T08:22:53Z

Hi,
I enjoyed your awesome project! Thx for hard work.
I read the issue(#2) you opened, and add naive citation_analysis logic using Semantic Scholar API (your recommendation).

It calls API using inputted paper's DOI information and gets referenced papers' addresses.
Then apply the label_paper logic to extract tag information from referenced papers.

After that, using reference_tag, we can flourish justification for the label recommendation :)
The result would be like below:

# Title: Inducing Document Structure for Aspect-based Summarization
# Online location: https://www.aclweb.org/anthology/P19-1630.pdf
# CHECK: confidence = 0.9, justification = Matched regex multi-task learning
train-mtl
# CHECK: confidence = 0.9, justification = Matched regex Recurrent Neural Network|RNN|recurrent neural networks, 4 occurrences in the refs
arch-rnn
# CHECK: confidence = 0.9, justification = Matched regex Long Short-term Memory|LSTMs|LSTM, 4 occurrences in the refs
arch-lstm
# CHECK: confidence = 0.9, justification = Matched regex Convolutional Neural Networks|CNNs|convolutional neural network, 3 occurrences in the refs
arch-cnn
# CHECK: confidence = 0.9, justification = Matched regex attention, 4 occurrences in the refs
arch-att
# CHECK: confidence = 0.9, justification = Matched regex coverage, 2 occurrences in the refs
arch-coverage
...

Because of the limitations of API, sometimes we get no information about the referenced paper.
I think it will be future work. And multi-processing is also needed to accelerate pdf download logic 👍

neubig · 2020-01-15T18:12:44Z

First, thank you so much, this is great!

However, this is a little bit different than what I meant through citation analysis. Currently I believe the logic is: Find up to n_ref citations, download them, run the classifier over the references, and note how many referenced papers were classified as including the concept.

Rather, what I meant is something like: search all the citations for particular "known" papers (e.g. for Transformer, Vaswani et al. 2017), and if these papers appear, mark the paper as (potentially) using these concepts.

I'm a little bit hesitant to incorporate this change as-is because, as you noted, downloading all the references adds quite a bit of overhead to the processing of a single paper, so something more light-weight would be preferable.

Huffon · 2020-01-16T06:54:56Z

Thanks for your feedback!
I misunderstood your intention and changed the logic of citation analysis

The changed logic is like below:

Take use_cite (bool) as an argument (c.f. to use use_cite option, paper_id should be specified!)
If use citation papers, it will lead us to get a paper list which cited specified paper_id using Semantic Scholar API
Then conduct labeling citation papers with specified paper's tag list

The result will be like this

$ python get_paper.py --paper_id N19-1329 --template template.cpt --feature fulltext --use_cite True
Totally 6 new papers are labeled using N19-1329 tag information.
: [D19-1607] Zero-shot Reading Comprehension by Cross-lingual Transfer Learning with Multi-lingual Language Representation Model
: [D19-1167] Investigating Multilingual NMT Representations at Scale
: [P19-1283] Correlating neural and symbolic representations of language
: [D19-6106] BERT is Not an Interlingua and the Bias of Tokenization
: [D19-1275] Designing and Interpreting Probes with Control Tasks
: [D19-3022] LINSPECTOR WEB: A Multilingual Probing Suite for Word Representations

$ cat auto/D19-1607.txt
# Title: Zero-shot Reading Comprehension by Cross-lingual Transfer Learning with Multi-lingual Language Representation Model
# Online location: https://www.aclweb.org/anthology/D19-1607.pdf
# CHECK: justification = Found from reference paper N19-1329
pre-elmo
# CHECK: justification = Found from reference paper N19-1329
arch-lstm
# CHECK: justification = Found from reference paper N19-1329
loss-cca
# CHECK: justification = Found from reference paper N19-1329
loss-svd
# CHECK: justification = Found from reference paper N19-1329
task-seqlab
# CHECK: justification = Found from reference paper N19-1329
task-lm

Now we don't download any PDF files, just look up a pre-annotated text file.
So, I guess there will be not that big overhead!

neubig · 2020-01-17T02:43:04Z

Hi! I'm sorry, but it seems like I didn't communicate this clearly.

Currently, the code is looking at a paper (e.g. N19-1329) and applying its tags to all papers that reference it. However, this seems like a bit of overkill. Just because a referenced paper uses a concept doesn't necessarily mean that it is also used in the paper doing the referencing.

What I meant instead is that we have a list of particular papers that are very representative of the concept, e.g.:

"Attention is All You Need" -> arch-transformer
"Adam: A Method for Stochastic Optimization" -> optim-adam

and if that particular paper appears in the references, then we assign that topic to the paper that includes that reference.

add reference analysis and apply google style docstring

f6f9c53

Huffon added 2 commits January 16, 2020 15:41

change citation analysis logic

092f4bf

fix typo

a96b951

Huffon closed this Jun 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add reference analysis and apply google style docstring #14

Add reference analysis and apply google style docstring #14

Huffon commented Jan 15, 2020 •

edited

Loading

neubig commented Jan 15, 2020

Huffon commented Jan 16, 2020

neubig commented Jan 17, 2020 •

edited

Loading

Add reference analysis and apply google style docstring #14

Add reference analysis and apply google style docstring #14

Conversation

Huffon commented Jan 15, 2020 • edited Loading

neubig commented Jan 15, 2020

Huffon commented Jan 16, 2020

neubig commented Jan 17, 2020 • edited Loading

Huffon commented Jan 15, 2020 •

edited

Loading

neubig commented Jan 17, 2020 •

edited

Loading