Skip to content

EMBO HTML format

Forrest Bao edited this page Jun 28, 2018 · 1 revision

We begin with a <div> node whose attribute class has the value article.

Main-body sections and subsections

Each main-body (after Abstract and before Acknowledgments) section is a <div> node of class="section" and an id=sec-\d+. The section head is in an <h2> node under it.

Under a section <div> node, subsections <div> nodes populate, each of which has class="subsection" and an id=sec-\d+. The subsection head is in an <h3> node under it.

Paragraphs are <p> nodes under section or subsection nodes. A paragraph has an id attribute of the value in the form of p-\d+.

Note that counter for id for sections and subsections is continuous.

An example is as follows:

<div class="section" id="sec-1">
  <h2>1st section</h2>
  <div class="subsection" id="sec-2">
    <h3>1st subsection</h3>
    <p id="p-2">1st paragraph</p> 
    <h3>2nd subsection</h3>
    <p id="p-3">2nd paragraph</p> 
  </div>
</div>

Hence, extracting main-body paragraphs can be easy: just <p> nodes whose parents have id matching sec-\d+.

Abstract

The abstract is a <div> node of class="section abstract". Under it, there are many paragraph <p> nodes.

Figure captions

A figure caption is a <div> node of class=fig-caption. Under it, there are many paragraph <p> nodes.

Table

In our training examples, we haven't encountered the papers including tables.

Style

  • Citations are <a> nodes covering both author name and year, e.g., <a class="xref-bibr">X et al, 2001</a>
  • Cross-references are also <a> nodes, e.g., <a class="xref-supplementary-material">Supplementary Figure 1F</a>.
  • means italic and bold. There are <strong> nodes. But they are not in text body.
  • and
Clone this wiki locally