Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reference analysis and apply google style docstring #14

Closed
wants to merge 3 commits into from
Closed

Add reference analysis and apply google style docstring #14

wants to merge 3 commits into from

Conversation

Huffon
Copy link

@Huffon Huffon commented Jan 15, 2020

Hi,
I enjoyed your awesome project! Thx for hard work.
I read the issue(#2) you opened, and add naive citation_analysis logic using Semantic Scholar API (your recommendation).

It calls API using inputted paper's DOI information and gets referenced papers' addresses.
Then apply the label_paper logic to extract tag information from referenced papers.

After that, using reference_tag, we can flourish justification for the label recommendation :)
The result would be like below:

# Title: Inducing Document Structure for Aspect-based Summarization
# Online location: https://www.aclweb.org/anthology/P19-1630.pdf
# CHECK: confidence = 0.9, justification = Matched regex multi-task learning
train-mtl
# CHECK: confidence = 0.9, justification = Matched regex Recurrent Neural Network|RNN|recurrent neural networks, 4 occurrences in the refs
arch-rnn
# CHECK: confidence = 0.9, justification = Matched regex Long Short-term Memory|LSTMs|LSTM, 4 occurrences in the refs
arch-lstm
# CHECK: confidence = 0.9, justification = Matched regex Convolutional Neural Networks|CNNs|convolutional neural network, 3 occurrences in the refs
arch-cnn
# CHECK: confidence = 0.9, justification = Matched regex attention, 4 occurrences in the refs
arch-att
# CHECK: confidence = 0.9, justification = Matched regex coverage, 2 occurrences in the refs
arch-coverage
...

Because of the limitations of API, sometimes we get no information about the referenced paper.
I think it will be future work. And multi-processing is also needed to accelerate pdf download logic 👍

@neubig
Copy link
Contributor

neubig commented Jan 15, 2020

First, thank you so much, this is great!

However, this is a little bit different than what I meant through citation analysis. Currently I believe the logic is: Find up to n_ref citations, download them, run the classifier over the references, and note how many referenced papers were classified as including the concept.

Rather, what I meant is something like: search all the citations for particular "known" papers (e.g. for Transformer, Vaswani et al. 2017), and if these papers appear, mark the paper as (potentially) using these concepts.

I'm a little bit hesitant to incorporate this change as-is because, as you noted, downloading all the references adds quite a bit of overhead to the processing of a single paper, so something more light-weight would be preferable.

@Huffon
Copy link
Author

Huffon commented Jan 16, 2020

Thanks for your feedback!
I misunderstood your intention and changed the logic of citation analysis

The changed logic is like below:

  1. Take use_cite (bool) as an argument (c.f. to use use_cite option, paper_id should be specified!)
  2. If use citation papers, it will lead us to get a paper list which cited specified paper_id using Semantic Scholar API
  3. Then conduct labeling citation papers with specified paper's tag list

The result will be like this

$ python get_paper.py --paper_id N19-1329 --template template.cpt --feature fulltext --use_cite True
Totally 6 new papers are labeled using N19-1329 tag information.
: [D19-1607] Zero-shot Reading Comprehension by Cross-lingual Transfer Learning with Multi-lingual Language Representation Model
: [D19-1167] Investigating Multilingual NMT Representations at Scale
: [P19-1283] Correlating neural and symbolic representations of language
: [D19-6106] BERT is Not an Interlingua and the Bias of Tokenization
: [D19-1275] Designing and Interpreting Probes with Control Tasks
: [D19-3022] LINSPECTOR WEB: A Multilingual Probing Suite for Word Representations

$ cat auto/D19-1607.txt
# Title: Zero-shot Reading Comprehension by Cross-lingual Transfer Learning with Multi-lingual Language Representation Model
# Online location: https://www.aclweb.org/anthology/D19-1607.pdf
# CHECK: justification = Found from reference paper N19-1329
pre-elmo
# CHECK: justification = Found from reference paper N19-1329
arch-lstm
# CHECK: justification = Found from reference paper N19-1329
loss-cca
# CHECK: justification = Found from reference paper N19-1329
loss-svd
# CHECK: justification = Found from reference paper N19-1329
task-seqlab
# CHECK: justification = Found from reference paper N19-1329
task-lm

Now we don't download any PDF files, just look up a pre-annotated text file.
So, I guess there will be not that big overhead!

@neubig
Copy link
Contributor

neubig commented Jan 17, 2020

Hi! I'm sorry, but it seems like I didn't communicate this clearly.

Currently, the code is looking at a paper (e.g. N19-1329) and applying its tags to all papers that reference it. However, this seems like a bit of overkill. Just because a referenced paper uses a concept doesn't necessarily mean that it is also used in the paper doing the referencing.

What I meant instead is that we have a list of particular papers that are very representative of the concept, e.g.:

  • "Attention is All You Need" -> arch-transformer
  • "Adam: A Method for Stochastic Optimization" -> optim-adam

and if that particular paper appears in the references, then we assign that topic to the paper that includes that reference.

@Huffon Huffon closed this Jun 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants