Skip to content

Latest commit

 

History

History
86 lines (63 loc) · 4.49 KB

README_details.md

File metadata and controls

86 lines (63 loc) · 4.49 KB

BERT ParsCit

PyTorch Lightning Config: Hydra Template
Paper Conference

Set up PDF parsing engine s2orc-doc2json

The current doc2json tool uses Grobid to first process each PDF into XML, then extracts paper components from the XML. If you fail to install Doc2Json or Grobid with bin/doc2json/scripts/run.sh , try to execute the following command:

cd bin/doc2json
python setup.py develop
cd ../..

This will setup Doc2Json.

Install Grobid

You will need to have Java installed on your machine. Then, you can install your own version of Grobid and get it running, or you can run the following script:

bash bin/doc2json/scripts/setup_grobid.sh

This will setup Grobid, currently hard-coded as version 0.6.1. Then run:

bash bin/doc2json/scripts/run_grobid.sh

to start the Grobid server. Don't worry if it gets stuck at 87%; this is normal and means Grobid is ready to process PDFs.

How to Parse Reference Strings

To parse reference strings from a PDF file, try:

from src.pipelines.bert_parscit import predict_for_pdf

results, tokens, tags = predict_for_pdf(filename, output_dir, temp_dir)

This will generate a text file of reference strings in the specified output_dir. And the JSON format of the origin PDF will be saved in the specified temp_dir. The default output_dir is result/ from your path and the default temp_dir is temp/ from your path. The output results is a list of tagged strings, which seems like:

['<author>Waleed</author> <author>Ammar,</author> <author>Matthew</author> <author>E.</author> <author>Peters,</author> <author>Chandra</author> <author>Bhagavat-</author> <author>ula,</author> <author>and</author> <author>Russell</author> <author>Power.</author> <date>2017.</date> <title>The</title> <title>ai2</title> <title>system</title> <title>at</title> <title>semeval-2017</title> <title>task</title> <title>10</title> <title>(scienceie):</title> <title>semi-supervised</title> <title>end-to-end</title> <title>entity</title> <title>and</title> <title>relation</title> <title>extraction.</title> <booktitle>In</booktitle> <booktitle>ACL</booktitle> <booktitle>workshop</booktitle> <booktitle>(SemEval).</booktitle>']

tokens is a list of origin tokens of the input file and tags is the list of predicted tags corresponding to the input: tokens:

[['Waleed', 'Ammar,', 'Matthew', 'E.', 'Peters,', 'Chandra', 'Bhagavat-', 'ula,', 'and', 'Russell', 'Power.', '2017.', 'The', 'ai2', 'system', 'at', 'semeval-2017', 'task', '10', '(scienceie):', 'semi-supervised', 'end-to-end', 'entity', 'and', 'relation', 'extraction.', 'In', 'ACL', 'workshop', '(SemEval).']]

tags:

[['author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'date', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'booktitle', 'booktitle', 'booktitle', 'booktitle']]

You can also process a single string or parse strings from a TEXT file:

from src.pipelines.bert_parscit import predict_for_string, predict_for_text

str_results, str_tokens, str_tags = predict_for_string(
    "Waleed Ammar, Matthew E. Peters, Chandra Bhagavat- ula, and Russell Power. 2017. The ai2 system at semeval-2017 task 10 (scienceie): semi-supervised end-to-end entity and relation extraction. In ACL workshop (SemEval).")
text_results, text_tokens, text_tags = predict_for_text(filename)

To extract strings from a PDF

You can extract strings you need with the script. For example, to get reference strings, try:

python pdf2text.py --input_file file_path --reference --output_dir output/ --temp_dir temp/