-
Notifications
You must be signed in to change notification settings - Fork 4
Nature HTML format
In an HTML page of a paper from Nature, the paper begins from an <article>
node.
Under the <article>
node, the paper from Abstract to Rights and Permissions is under a <div >
node like this
<div data-article-body="true" data-track-component="article body" class="article-body clear">
Under it, there are many <section>
nodes, each of which has a aria-labelledby
attribute of values like abstract, main, results, discussion, methods, references, acknowledgements, author-information, supplimentary-information, and rightslink. There may be some nodes under the body <div>
node as well.
Under each <section>
node, there can be multiple (grand)(grand)childern <p>
nodes. In Nature's HTML, italic, superscript, and subscript fonts are in <i>
, <sup>
and <sub>
tags. References are in <sup>
nodes, each of which can contain multiple <a>
nodes, like this:
<sup>
<a title="Newton et al. 2018" href="/articles/nbt.1557#ref5">5</a>,
<a title="Einstein et al., 2019" href="/articles/nbt.1557#ref6">6</a>
</sup>
Figures are under <figure>
nodes
- the header of the figure is in the only
<figcaption>
node, e.g.,
<figcaption>
<b id="f1" class="block tiny-space-below">What a great figure! </b>
</figcaption>
-
The figure itself and the detailed caption is under a
<div class="small-space-below">
- the figure is encapsulated in an
<a>
node that links to full size image. - Detailed caption is in the only (?)
<p>
node under a sub<div>
node.
So the figure and detailed caption can be like this
- the figure is encapsulated in an
<div class="small-space-below">
<a herf="fullsize.jpeg"><img src="thumbnail.jpeg"></a>
<div> <p>The result shows .... </p>
</div>
</div>
At this point, we do not have to deal with tables.
To clean up data, suggest drop all <a>
nodes.