Skip to content

Nature HTML format

Forrest Bao edited this page May 25, 2018 · 2 revisions

In an HTML page of a paper from Nature, the paper begins from an <article> node.

Under the <article> node, the paper from Abstract to Rights and Permissions is under a <div > node like this

<div data-article-body="true" data-track-component="article body" class="article-body clear">

Under it, there are many <section> nodes, each of which has a aria-labelledby attribute of values like abstract, main, results, discussion, methods, references, acknowledgements, author-information, supplimentary-information, and rightslink. There may be some nodes under the body <div> node as well.

Regular text paragraphs

Under each <section> node, there can be multiple (grand)(grand)childern <p> nodes. In Nature's HTML, italic, superscript, and subscript fonts are in <i>, <sup> and <sub> tags. References are in <sup> nodes, each of which can contain multiple <a> nodes, like this:

<sup>
   <a title="Newton et al. 2018" href="/articles/nbt.1557#ref5">5</a>,
   <a title="Einstein et al., 2019" href="/articles/nbt.1557#ref6">6</a>
</sup>

Figures

Figures are under <figure> nodes

  • the header of the figure is in the only <figcaption> node, e.g.,
   <figcaption>
       <b id="f1" class="block tiny-space-below">What a great figure! </b>
   </figcaption>
  • The figure itself and the detailed caption is under a <div class="small-space-below">

    • the figure is encapsulated in an <a> node that links to full size image.
    • Detailed caption is in the only (?) <p> node under a sub <div> node.

    So the figure and detailed caption can be like this

<div class="small-space-below">
  <a herf="fullsize.jpeg"><img src="thumbnail.jpeg"></a>
  <div> <p>The result shows .... </p>
  </div>
</div>

Tables

At this point, we do not have to deal with tables.

Clean up heuristics

To clean up data, suggest drop all <a> nodes.

Clone this wiki locally