-
Notifications
You must be signed in to change notification settings - Fork 4
EMBO HTML format
We begin with a <div>
node whose attribute class
has the value article
.
Each main-body (after Abstract and before Acknowledgments) section is a <div>
node of class="section"
and an id=sec-\d+
. The section head is in an <h2>
node under it.
Under a section <div>
node, subsections <div>
nodes populate, each of which has class="subsection"
and an id=sec-\d+
. The subsection head is in an <h3>
node under it.
Paragraphs are <p>
nodes under section or subsection nodes. A paragraph has an id
attribute of the value in the form of p-\d+
.
Note that counter for id
for sections and subsections is continuous.
An example is as follows:
<div class="section" id="sec-1">
<h2>1st section</h2>
<div class="subsection" id="sec-2">
<h3>1st subsection</h3>
<p id="p-2">1st paragraph</p>
<h3>2nd subsection</h3>
<p id="p-3">2nd paragraph</p>
</div>
</div>
Hence, extracting main-body paragraphs can be easy: just <p>
nodes whose parents have id
matching sec-\d+
.
The abstract is a <div>
node of class="section abstract"
. Under it, there are many paragraph <p>
nodes.
A figure caption is a <div>
node of class=fig-caption
. Under it, there are many paragraph <p>
nodes.
In our training examples, we haven't encountered the papers including tables.
- Citations are
<a>
nodes covering both author name and year, e.g.,<a class="xref-bibr">X et al, 2001</a>
- Cross-references are also
<a>
nodes, e.g.,<a class="xref-supplementary-material">Supplementary Figure 1F</a>
. - means italic and bold. There are
<strong>
nodes. But they are not in text body. - and