Skip to content

Commit

Permalink
add versioning to readthedocs, small warning remaining
Browse files Browse the repository at this point in the history
  • Loading branch information
piconti committed May 30, 2024
1 parent 3f61a45 commit 2054f32
Show file tree
Hide file tree
Showing 25 changed files with 2,406 additions and 49 deletions.
Binary file modified docs/_build/doctrees/environment.pickle
Binary file not shown.
Binary file modified docs/_build/doctrees/index.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/io.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/rebuild.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/utils.doctree
Binary file not shown.
Binary file added docs/_build/doctrees/versioning.doctree
Binary file not shown.
1 change: 1 addition & 0 deletions docs/_build/html/_sources/index.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,5 @@ Python module with bits of code (objects, functions) highly reusable within impr
rebuild
utils
images
versioning

43 changes: 43 additions & 0 deletions docs/_build/html/_sources/versioning.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
Data Versioning
================================

The `versioning` package of `impresso_commons` contains several modules and scripts with classes and functions that allow to version Impresso's data at various stages of the processing pipeline.

The main goal of this approach is to version the data and track information at every stage to:
1. **Ensure data consisteny and ease of debugging:** Data elements should be consistent across stages, and inconsistencies/differences should be justifiable through the identification of data leakage points.
2. **Allow partial updates:** It should be possible to (re)run all or part of the processes on subsets of the data, knowing which version of the data was used at each step. This can be necessary when new media collections arrive, or when an existing collection has been patched.
3. **Ensure transparency:** Citation of the various data stages and datasets should be straightforward; users should know when using the interface exactly what versions they are using, and should be able to consult the precise statistics related to them.


Data Statistics and NewspaperStatistics
------------------------------------------

.. automodule:: impresso_commons.versioning.data_statistics
:members:
:undoc-members:
:show-inheritance:

Data Manifest
--------------------------------------------

.. automodule:: impresso_commons.versioning.data_manifest
:members:
:undoc-members:
:show-inheritance:

Versioning Helpers
--------------------------------------------

.. automodule:: impresso_commons.versioning.helpers
:members:
:undoc-members:
:show-inheritance:

Manifest Computing Script
--------------------------------------------

.. automodule:: impresso_commons.versioning.compute_manifest
:members:
:undoc-members:
:show-inheritance:

328 changes: 314 additions & 14 deletions docs/_build/html/genindex.html

Large diffs are not rendered by default.

3 changes: 3 additions & 0 deletions docs/_build/html/images.html
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
<script src="_static/js/theme.js"></script>
<link rel="index" title="Index" href="genindex.html" />
<link rel="search" title="Search" href="search.html" />
<link rel="next" title="Data Versioning" href="versioning.html" />
<link rel="prev" title="Utilities" href="utils.html" />
</head>

Expand Down Expand Up @@ -77,6 +78,7 @@
</li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="versioning.html">Data Versioning</a></li>
</ul>

</div>
Expand Down Expand Up @@ -336,6 +338,7 @@ <h4>Case 4: one jpg only<a class="headerlink" href="#case-4-one-jpg-only" title=
</div>
<footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
<a href="utils.html" class="btn btn-neutral float-left" title="Utilities" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
<a href="versioning.html" class="btn btn-neutral float-right" title="Data Versioning" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
</div>

<hr/>
Expand Down
8 changes: 8 additions & 0 deletions docs/_build/html/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@
<li class="toctree-l1"><a class="reference internal" href="rebuild.html">Text Rebuild</a></li>
<li class="toctree-l1"><a class="reference internal" href="utils.html">Utilities</a></li>
<li class="toctree-l1"><a class="reference internal" href="images.html">Image handling</a></li>
<li class="toctree-l1"><a class="reference internal" href="versioning.html">Data Versioning</a></li>
</ul>

</div>
Expand Down Expand Up @@ -104,6 +105,13 @@ <h1>Welcome to Impresso PyCommons’s documentation!<a class="headerlink" href="
<li class="toctree-l2"><a class="reference internal" href="images.html#module-impresso_commons.images.olive_boxes">Olive Boxes</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="versioning.html">Data Versioning</a><ul>
<li class="toctree-l2"><a class="reference internal" href="versioning.html#module-impresso_commons.versioning.data_statistics">Data Statistics and NewspaperStatistics</a></li>
<li class="toctree-l2"><a class="reference internal" href="versioning.html#module-impresso_commons.versioning.data_manifest">Data Manifest</a></li>
<li class="toctree-l2"><a class="reference internal" href="versioning.html#module-impresso_commons.versioning.helpers">Versioning Helpers</a></li>
<li class="toctree-l2"><a class="reference internal" href="versioning.html#module-impresso_commons.versioning.compute_manifest">Manifest Computing Script</a></li>
</ul>
</li>
</ul>
</div>
</section>
Expand Down
3 changes: 2 additions & 1 deletion docs/_build/html/io.html
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@
<li class="toctree-l1"><a class="reference internal" href="rebuild.html">Text Rebuild</a></li>
<li class="toctree-l1"><a class="reference internal" href="utils.html">Utilities</a></li>
<li class="toctree-l1"><a class="reference internal" href="images.html">Image handling</a></li>
<li class="toctree-l1"><a class="reference internal" href="versioning.html">Data Versioning</a></li>
</ul>

</div>
Expand Down Expand Up @@ -391,7 +392,7 @@ <h1>Input/Output<a class="headerlink" href="#input-output" title="Link to this h

<dl class="py function">
<dt class="sig sig-object py" id="impresso_commons.path.path_s3.list_files">
<span class="sig-prename descclassname"><span class="pre">impresso_commons.path.path_s3.</span></span><span class="sig-name descname"><span class="pre">list_files</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">bucket_name</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">str</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">file_type</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">str</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">'issues'</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">newspapers_filter</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">list</span><span class="p"><span class="pre">[</span></span><span class="pre">str</span><span class="p"><span class="pre">]</span></span><span class="w"> </span><span class="p"><span class="pre">|</span></span><span class="w"> </span><span class="pre">None</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">None</span></span></em><span class="sig-paren">)</span> <span class="sig-return"><span class="sig-return-icon">&#x2192;</span> <span class="sig-return-typehint"><span class="pre">tuple</span><span class="p"><span class="pre">[</span></span><span class="pre">list</span><span class="p"><span class="pre">[</span></span><span class="pre">str</span><span class="p"><span class="pre">]</span></span><span class="w"> </span><span class="p"><span class="pre">|</span></span><span class="w"> </span><span class="pre">None</span><span class="p"><span class="pre">,</span></span><span class="w"> </span><span class="pre">list</span><span class="p"><span class="pre">[</span></span><span class="pre">str</span><span class="p"><span class="pre">]</span></span><span class="w"> </span><span class="p"><span class="pre">|</span></span><span class="w"> </span><span class="pre">None</span><span class="p"><span class="pre">]</span></span></span></span><a class="headerlink" href="#impresso_commons.path.path_s3.list_files" title="Link to this definition"></a></dt>
<span class="sig-prename descclassname"><span class="pre">impresso_commons.path.path_s3.</span></span><span class="sig-name descname"><span class="pre">list_files</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">bucket_name</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">str</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">file_type</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">str</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">'issues'</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">newspapers_filter</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">list</span><span class="p"><span class="pre">[</span></span><span class="pre">str</span><span class="p"><span class="pre">]</span></span><span class="w"> </span><span class="p"><span class="pre">|</span></span><span class="w"> </span><span class="pre">None</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">None</span></span></em><span class="sig-paren">)</span> <span class="sig-return"><span class="sig-return-icon">&#x2192;</span> <span class="sig-return-typehint"><span class="pre">tuple</span><span class="p"><span class="pre">[</span></span><span class="pre">list</span><span class="p"><span class="pre">[</span></span><span class="pre">str</span><span class="p"><span class="pre">]</span></span><span class="w"> </span><span class="p"><span class="pre">|</span></span><span class="w"> </span><span class="pre">None</span><span class="p"><span class="pre">]</span></span></span></span><a class="headerlink" href="#impresso_commons.path.path_s3.list_files" title="Link to this definition"></a></dt>
<dd><p>List the canonical files located in a given S3 bucket.</p>
<div class="admonition note">
<p class="admonition-title">Note</p>
Expand Down
Binary file modified docs/_build/html/objects.inv
Binary file not shown.
21 changes: 21 additions & 0 deletions docs/_build/html/py-modindex.html
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@
<li class="toctree-l1"><a class="reference internal" href="rebuild.html">Text Rebuild</a></li>
<li class="toctree-l1"><a class="reference internal" href="utils.html">Utilities</a></li>
<li class="toctree-l1"><a class="reference internal" href="images.html">Image handling</a></li>
<li class="toctree-l1"><a class="reference internal" href="versioning.html">Data Versioning</a></li>
</ul>

</div>
Expand Down Expand Up @@ -150,6 +151,26 @@ <h1>Python Module Index</h1>
<td>&#160;&#160;&#160;
<a href="utils.html#module-impresso_commons.utils.utils"><code class="xref">impresso_commons.utils.utils</code></a></td><td>
<em></em></td></tr>
<tr class="cg-1">
<td></td>
<td>&#160;&#160;&#160;
<a href="versioning.html#module-impresso_commons.versioning.compute_manifest"><code class="xref">impresso_commons.versioning.compute_manifest</code></a></td><td>
<em></em></td></tr>
<tr class="cg-1">
<td></td>
<td>&#160;&#160;&#160;
<a href="versioning.html#module-impresso_commons.versioning.data_manifest"><code class="xref">impresso_commons.versioning.data_manifest</code></a></td><td>
<em></em></td></tr>
<tr class="cg-1">
<td></td>
<td>&#160;&#160;&#160;
<a href="versioning.html#module-impresso_commons.versioning.data_statistics"><code class="xref">impresso_commons.versioning.data_statistics</code></a></td><td>
<em></em></td></tr>
<tr class="cg-1">
<td></td>
<td>&#160;&#160;&#160;
<a href="versioning.html#module-impresso_commons.versioning.helpers"><code class="xref">impresso_commons.versioning.helpers</code></a></td><td>
<em></em></td></tr>
</table>


Expand Down
23 changes: 18 additions & 5 deletions docs/_build/html/rebuild.html
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@
</li>
<li class="toctree-l1"><a class="reference internal" href="utils.html">Utilities</a></li>
<li class="toctree-l1"><a class="reference internal" href="images.html">Image handling</a></li>
<li class="toctree-l1"><a class="reference internal" href="versioning.html">Data Versioning</a></li>
</ul>

</div>
Expand Down Expand Up @@ -112,7 +113,7 @@ <h2>Rebuild functions<a class="headerlink" href="#rebuild-functions" title="Link
For EPFL members, this script can be scaled by running it using Runai,
as documented on <a class="reference external" href="https://github.com/impresso/impresso-infrastructure/blob/main/howtos/runai.md">https://github.com/impresso/impresso-infrastructure/blob/main/howtos/runai.md</a>.</p>
<dl class="simple">
<dt>Usage:</dt><dd><p>rebuilder.py rebuild_articles –input-bucket=&lt;b&gt; –log-file=&lt;f&gt; –output-dir=&lt;od&gt; –filter-config=&lt;fc&gt; [–format=&lt;fo&gt; –scheduler=&lt;sch&gt; –output-bucket=&lt;ob&gt; –verbose –clear –languages=&lt;lgs&gt; –nworkers=&lt;nw&gt;]</p>
<dt>Usage:</dt><dd><p>rebuilder.py rebuild_articles –input-bucket=&lt;b&gt; –log-file=&lt;f&gt; –output-dir=&lt;od&gt; –filter-config=&lt;fc&gt; [–format=&lt;fo&gt; –scheduler=&lt;sch&gt; –output-bucket=&lt;ob&gt; –verbose –clear –languages=&lt;lgs&gt; –nworkers=&lt;nw&gt; –git-repo=&lt;gr&gt; –temp-dir=&lt;tp&gt; –prev-manifest=&lt;pm&gt;]</p>
</dd>
</dl>
<p>Options:</p>
Expand All @@ -139,10 +140,22 @@ <h2>Rebuild functions<a class="headerlink" href="#rebuild-functions" title="Link
<dd><p>Remove output directory before and after rebuilding</p>
</dd>
<dt><kbd><span class="option">--format=<var>&lt;fo&gt;</var></span></kbd></dt>
<dd><p>stuff</p>
<dd><p>Rebuilt format to use (can be “solr” or “passim”)</p>
</dd>
<dt><kbd><span class="option">--languages=<var>&lt;lgs&gt;</var></span></kbd></dt>
<dd><p>Languages to filter the articles to rebuild on.</p>
</dd>
<dt><kbd><span class="option">--nworkers=<var>&lt;nw&gt;</var></span></kbd></dt>
<dd><p>number of workers for (local) dask client</p>
<dd><p>number of workers for (local) Dask client.</p>
</dd>
<dt><kbd><span class="option">--git-repo=<var>&lt;gr&gt;</var></span></kbd></dt>
<dd><p>Local path to the “impresso-text-acquisition” git directory (including it).</p>
</dd>
<dt><kbd><span class="option">--temp-dir=<var>&lt;tp&gt;</var></span></kbd></dt>
<dd><p>Temporary directory in which to clone the impresso-data-release git repository.</p>
</dd>
<dt><kbd><span class="option">--prev-manifest=<var>&lt;pm&gt;</var></span></kbd></dt>
<dd><p>Optional S3 path to the previous manifest to use for the manifest generation.</p>
</dd>
</dl>
<dl class="py function">
Expand Down Expand Up @@ -208,7 +221,7 @@ <h2>Rebuild functions<a class="headerlink" href="#rebuild-functions" title="Link

<dl class="py function">
<dt class="sig sig-object py" id="impresso_commons.text.rebuilder.main">
<span class="sig-prename descclassname"><span class="pre">impresso_commons.text.rebuilder.</span></span><span class="sig-name descname"><span class="pre">main</span></span><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="headerlink" href="#impresso_commons.text.rebuilder.main" title="Link to this definition"></a></dt>
<span class="sig-prename descclassname"><span class="pre">impresso_commons.text.rebuilder.</span></span><span class="sig-name descname"><span class="pre">main</span></span><span class="sig-paren">(</span><span class="sig-paren">)</span> <span class="sig-return"><span class="sig-return-icon">&#x2192;</span> <span class="sig-return-typehint"><span class="pre">None</span></span></span><a class="headerlink" href="#impresso_commons.text.rebuilder.main" title="Link to this definition"></a></dt>
<dd></dd></dl>

<dl class="py function">
Expand Down Expand Up @@ -252,7 +265,7 @@ <h2>Rebuild functions<a class="headerlink" href="#rebuild-functions" title="Link

<dl class="py function">
<dt class="sig sig-object py" id="impresso_commons.text.rebuilder.rebuild_issues">
<span class="sig-prename descclassname"><span class="pre">impresso_commons.text.rebuilder.</span></span><span class="sig-name descname"><span class="pre">rebuild_issues</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">issues</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">input_bucket</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">output_dir</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">dask_client</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">format</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">'solr'</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">filter_language</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">None</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#impresso_commons.text.rebuilder.rebuild_issues" title="Link to this definition"></a></dt>
<span class="sig-prename descclassname"><span class="pre">impresso_commons.text.rebuilder.</span></span><span class="sig-name descname"><span class="pre">rebuild_issues</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">issues</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">input_bucket</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">output_dir</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">dask_client</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">_format</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">'solr'</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">filter_language</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">None</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#impresso_commons.text.rebuilder.rebuild_issues" title="Link to this definition"></a></dt>
<dd><p>Rebuild a set of newspaper issues into a given format.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters<span class="colon">:</span></dt>
Expand Down
1 change: 1 addition & 0 deletions docs/_build/html/search.html
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@
<li class="toctree-l1"><a class="reference internal" href="rebuild.html">Text Rebuild</a></li>
<li class="toctree-l1"><a class="reference internal" href="utils.html">Utilities</a></li>
<li class="toctree-l1"><a class="reference internal" href="images.html">Image handling</a></li>
<li class="toctree-l1"><a class="reference internal" href="versioning.html">Data Versioning</a></li>
</ul>

</div>
Expand Down
2 changes: 1 addition & 1 deletion docs/_build/html/searchindex.js

Large diffs are not rendered by default.

Loading

0 comments on commit 2054f32

Please sign in to comment.