Update documentation

socius-org · Apr 15, 2024 · 581bcf0 · 581bcf0
1 parent 9015892
commit 581bcf0
Show file tree

Hide file tree

Showing 33 changed files with 1,944 additions and 228 deletions.
diff --git a/.buildinfo b/.buildinfo
@@ -1,4 +1,4 @@
 # Sphinx build info version 1
 # This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
-config: c980c25411b2f88e2de822371e07e90a
+config: 87d691a00f9808f83b1e8ac9849980e7
 tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/_sources/Introduction.md → _sources/about.md b/_sources/Introduction.md → _sources/about.md
@@ -27,7 +27,7 @@ Here's how RedditHarbor empowers your research:
 * **📈 Scalable and Efficient**: Handle pagination seamlessly, even for large datasets with millions of rows. 
 * **🕹️ Customisable Collection**: Tailor your data collection to your specific needs by configuring parameters.
 * **📂 Analysis-Ready**: Export your database to CSV, JSON, or JPEG formats for effortless integration with your preferred analysis tools.
-* **🔄 Temporal Metric Tracking:**: Regularly update key metrics like upvote ratios, scores, awards, and comment counts, allowing temporal analysis - a distinct advantage over static "snapshot" databases, such as PushShift or AcademicTorrent. 
+* **🔄 Temporal Metric Tracking**: Regularly update key metrics like upvote ratios, scores, awards, and comment counts, allowing temporal analysis - a distinct advantage over static "snapshot" databases, such as PushShift or AcademicTorrent. 
 * **⚡ Smart Update Intervals**: Leverage flexible configurations to automatically adjust update intervals based on dataset size, optimising efficiency while adhering to API constraints. 
 
 With RedditHarbor, you can spend less time wrestling with technical hurdles and more time focusing on your research objectives. 
diff --git a/_sources/data_schema/comment.md b/_sources/data_schema/comment.md
@@ -0,0 +1,23 @@
+# Comment
+
+The `Comment` collection stores information about comments made on Reddit submissions. Each document in this collection has the following schema:
+
+```python
+{
+    "comment_id": str,  # Unique identifier for the comment
+    "link_id": str,  # ID of the submission the comment is associated with
+    "subreddit": str,  # Name of the subreddit the comment is posted in
+    "parent_id": str,  # ID of the parent comment or submission
+    "redditor_id": str,  # ID of the user who posted the comment
+    "created_at": str,  # Datetime when the comment was created (ISO format)
+    "body": str or None,  # Text content of the comment (None if removed)
+    "score": {str: int},  # Dictionary mapping datetimes (ISO format) to the comment's score
+    "edited": bool,  # Whether the comment has been edited
+    "removed": str or None  # "deleted" or "removed" if the comment was removed, otherwise None
+}
+```
+
+The `parent_id` field can have two different formats:
+
+1. If it starts with `"t3_{link_id}"`, it means the comment is a top-level comment, and the parent is a submission.
+2. If it starts with `"t1_{comment_id}"`, it means the comment is a reply to another comment, and the parent is the comment with the specified `comment_id`.
diff --git a/_sources/data_schema/submission.md b/_sources/data_schema/submission.md
@@ -0,0 +1,37 @@
+# Submission
+
+The `Submission` collection stores information about Reddit submissions. Each document in this collection has the following schema:
+
+```python
+{
+    "submission_id": str,  # Unique identifier for the submission
+    "redditor_id": str,  # ID of the user who posted the submission
+    "created_at": str,  # Datetime when the submission was created (ISO format)
+    "title": str,  # Title of the submission
+    "text": str,  # Text content of the submission
+    "subreddit": str,  # Name of the subreddit the submission is posted in
+    "permalink": str,  # URL of the submission
+    "attachment": {str: str} or None,  # Dictionary containing URLs of attached media (e.g., {"jpg": "https://example.com/image.jpg"})
+    "flair": {
+        "link": str,  # Link flair text
+        "author": str  # Author flair text
+    },
+    "awards": {
+        "list": dict,  # Dictionary mapping award names to [count, coin_price]
+        "total_awards_count": int,  # Total number of awards received
+        "total_awards_price": int  # Total coin price of all awards received
+    },
+    "score": {str: int},  # Dictionary mapping datetimes (ISO format) to the submission's score
+    "upvote_ratio": {str: float},  # Dictionary mapping datetimes (ISO format) to the upvote ratio
+    "num_comments": {str: int},  # Dictionary mapping datetimes (ISO format) to the number of comments
+    "edited": bool,  # Whether the submission has been edited
+    "archived": bool,  # Whether the submission is archived
+    "removed": bool,  # Whether the submission has been removed
+    "poll": {
+        "total_vote_count": int,  # Total number of votes in the poll
+        "vote_ends_at": str,  # Datetime when the poll ends (ISO format)
+        "options": {str: int},  # Dictionary mapping poll options to the number of votes
+        "closed": bool  # Whether the poll is closed
+    } or None  # None if the submission does not have a poll
+}
+```
diff --git a/_sources/data_schema/user.md b/_sources/data_schema/user.md
@@ -0,0 +1,29 @@
+# User 
+
+The `User` collection stores information about Reddit users. Each document in this collection has the following schema:
+
+```python
+{
+    "redditor_id": str,  # Unique identifier for the user
+    "name": str,  # User's Reddit username
+    "created_at": str,  # Datetime when the user account was created (ISO format)
+    "karma": {
+        "link": int,  # Link karma
+        "total": int,  # Total karma
+        "awardee": int,  # Karma received from awards
+        "awarder": int,  # Karma awarded to others
+        "comment": int  # Comment karma
+    },
+    "is_gold": bool,  # Whether the user has Reddit Gold
+    "is_mod": {
+        str: [str, int]  # Dictionary mapping subreddit IDs to [subreddit name, number of subscribers]
+    } or None,  # None if the user is not a moderator
+    "trophy": {
+        "list": list,  # List of trophy names
+        "count": int  # Number of trophies
+    } or None,  # None if the user has no trophies
+    "removed": str  # "active" or "suspended"
+}
+```
+
+Note: For suspended users, the `redditor_id` is represented as `"suspended:{name}"`.
diff --git a/_sources/pages/install.md → _sources/getting_started/installation.md b/_sources/pages/install.md → _sources/getting_started/installation.md
diff --git a/_sources/pages/prerequisites.md → _sources/getting_started/prerequisites.md b/_sources/pages/prerequisites.md → _sources/getting_started/prerequisites.md
diff --git a/_sources/pages/GettingStarted.md → _sources/getting_started/setting.md b/_sources/pages/GettingStarted.md → _sources/getting_started/setting.md
@@ -66,7 +66,7 @@ CREATE TABLE test_comment(
 ALTER TABLE test_comment ENABLE ROW LEVEL SECURITY;
 ```
 
-This will create the three tables with the necessary columns and data types. Once created, you'll see the new tables available in the "Table Editor". In the future, you can duplicate and rename these tables (instead of "test_...") for your production needs.
+This will create the three tables with the necessary columns and data types. Once created, you'll see the new tables available in the "Table Editor". In the future, you can duplicate and rename these tables (instead of "test_...") for your production needs. For a structured overview of the database schema used by RedditHarbor, including detailed explanations of each field and its data type, see [Database Schema](../data_schema/user.md). 
 
 ```{warning} 
 The RedditHarbor package depends on predefined column names for all user, submission, and comment tables. To ensure proper functionality, it's crucial to create tables with all the specified columns mentioned in the documentation. Failure to do so may lead to errors or incomplete data retrieval.

diff --git a/_sources/pages/database.md → _sources/hands_on/database_driven.md b/_sources/pages/database.md → _sources/hands_on/database_driven.md
diff --git a/_sources/pages/download.md → _sources/hands_on/download_data.md b/_sources/pages/download.md → _sources/hands_on/download_data.md
diff --git a/_sources/pages/keyword.md → _sources/hands_on/keyword_based.md b/_sources/pages/keyword.md → _sources/hands_on/keyword_based.md
diff --git a/_sources/hands_on/scraping_examples.md b/_sources/hands_on/scraping_examples.md
@@ -0,0 +1,7 @@
+# Scraping Examples
+
+This section will cover frequently used scenarios from researchers for collecting Reddit data.
+
+* [Collecting Subreddit-based Data](../hands_on/subreddit_based.md): Provides guidance on collecting data from specific subreddits
+* [Collecting Keyword-based Data](../hands_on/keyword_based.md): Covers collecting submissions based on specific keywords
+* [Database-Driven Data Collection](../hands_on/database_driven.md): Explains how to leverage an existing database (previously collected Reddit data) to collect additional relevant data
diff --git a/_sources/pages/subreddit.md → _sources/hands_on/subreddit_based.md b/_sources/pages/subreddit.md → _sources/hands_on/subreddit_based.md
diff --git a/_sources/pages/update.md → _sources/hands_on/update_data.md b/_sources/pages/update.md → _sources/hands_on/update_data.md
@@ -2,7 +2,7 @@
 
 ## Unlock temporal insights 📈 with intelligent updates 🔄
 
-`RedditHarbor`'s update module streamlines and automates the process of updating crucial metrics for existing submissions (comment and user is currently working-in-progress!). It provides flexibility and configurability to adjust update intervals and data sources. A key advantage of this update module is the ability to track how various metrics, such as the upvote ratio or score, change over time for specific posts. This capability sets RedditHarbor apart from many other Reddit database resources, such as PushShift or Academic Torrents, which typically provide a static "snapshot" of submissions and comments at a random point in time.
+The `update()` module streamlines and automates the process of updating crucial metrics for existing submissions (comment and user is currently working-in-progress!). It provides flexibility and configurability to adjust update intervals and data sources. A key advantage of this update module is the ability to track how various metrics, such as the upvote ratio or score, change over time for specific posts. This capability sets RedditHarbor apart from many other Reddit database resources, such as PushShift or Academic Torrents, which typically provide a static "snapshot" of submissions and comments at a random point in time.
 
 ## Updating Submissions
 To update submission data, follow these steps:

diff --git a/_sources/pages/ScrapingExamples.md b/_sources/pages/ScrapingExamples.md
diff --git a/Introduction.html → about.html b/Introduction.html → about.html
@@ -9,7 +9,7 @@
     <meta charset="utf-8" />
     <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="generator" content="Docutils 0.17.1: http://docutils.sourceforge.net/" />
 
-    <title>About &#8212; ICWSM 2024 Tutorial</title>
+    <title>About &#8212; RedditHarbor</title>
 
 
 
@@ -60,11 +60,12 @@
 const thebe_selector_output = ".output, .cell_output"
 </script>
     <script async="async" src="_static/sphinx-thebe.js"></script>
-    <script>DOCUMENTATION_OPTIONS.pagename = 'Introduction';</script>
+    <script>DOCUMENTATION_OPTIONS.pagename = 'about';</script>
     <link rel="shortcut icon" href="_static/socius_logo.png"/>
+    <link rel="author" title="About these documents" href="#" />
     <link rel="index" title="Index" href="genindex.html" />
     <link rel="search" title="Search" href="search.html" />
-    <link rel="next" title="Prerequisites" href="pages/prerequisites.html" />
+    <link rel="next" title="Prerequisites" href="getting_started/prerequisites.html" />
   <meta name="viewport" content="width=device-width, initial-scale=1"/>
   <meta name="docsearch:language" content="None"/>
   </head>
@@ -157,20 +158,26 @@
         </ul>
         <p aria-level="2" class="caption" role="heading"><span class="caption-text">Getting Started</span></p>
 <ul class="nav bd-sidenav">
-<li class="toctree-l1"><a class="reference internal" href="pages/prerequisites.html">Prerequisites</a></li>
-<li class="toctree-l1"><a class="reference internal" href="pages/install.html">Installation</a></li>
-<li class="toctree-l1"><a class="reference internal" href="pages/GettingStarted.html">Getting Started</a></li>
+<li class="toctree-l1"><a class="reference internal" href="getting_started/prerequisites.html">Prerequisites</a></li>
+<li class="toctree-l1"><a class="reference internal" href="getting_started/installation.html">Installation</a></li>
+<li class="toctree-l1"><a class="reference internal" href="getting_started/setting.html">Getting Started</a></li>
+</ul>
+<p aria-level="2" class="caption" role="heading"><span class="caption-text">Database Schema</span></p>
+<ul class="nav bd-sidenav">
+<li class="toctree-l1"><a class="reference internal" href="data_schema/user.html">User</a></li>
+<li class="toctree-l1"><a class="reference internal" href="data_schema/submission.html">Submission</a></li>
+<li class="toctree-l1"><a class="reference internal" href="data_schema/comment.html">Comment</a></li>
 </ul>
 <p aria-level="2" class="caption" role="heading"><span class="caption-text">Hands-on RedditHarbor</span></p>
 <ul class="nav bd-sidenav">
-<li class="toctree-l1 has-children"><a class="reference internal" href="pages/ScrapingExamples.html">Scraping Examples</a><input class="toctree-checkbox" id="toctree-checkbox-1" name="toctree-checkbox-1" type="checkbox"/><label class="toctree-toggle" for="toctree-checkbox-1"><i class="fa-solid fa-chevron-down"></i></label><ul>
-<li class="toctree-l2"><a class="reference internal" href="pages/subreddit.html">Collecting Subreddit-based Data</a></li>
-<li class="toctree-l2"><a class="reference internal" href="pages/keyword.html">Collecting Keyword-based Data</a></li>
-<li class="toctree-l2"><a class="reference internal" href="pages/database.html">Database-Driven Data Collection</a></li>
+<li class="toctree-l1 has-children"><a class="reference internal" href="hands_on/scraping_examples.html">Scraping Examples</a><input class="toctree-checkbox" id="toctree-checkbox-1" name="toctree-checkbox-1" type="checkbox"/><label class="toctree-toggle" for="toctree-checkbox-1"><i class="fa-solid fa-chevron-down"></i></label><ul>
+<li class="toctree-l2"><a class="reference internal" href="hands_on/subreddit_based.html">Collecting Subreddit-based Data</a></li>
+<li class="toctree-l2"><a class="reference internal" href="hands_on/keyword_based.html">Collecting Keyword-based Data</a></li>
+<li class="toctree-l2"><a class="reference internal" href="hands_on/database_driven.html">Database-Driven Data Collection</a></li>
 </ul>
 </li>
-<li class="toctree-l1"><a class="reference internal" href="pages/download.html">Downloading Data</a></li>
-<li class="toctree-l1"><a class="reference internal" href="pages/update.html">Updating Data</a></li>
+<li class="toctree-l1"><a class="reference internal" href="hands_on/download_data.html">Downloading Data</a></li>
+<li class="toctree-l1"><a class="reference internal" href="hands_on/update_data.html">Updating Data</a></li>
 </ul>
 
     </div>
@@ -242,7 +249,7 @@
 
 
 
-      <li><a href="https://github.com/socius-org/RedditHarbor/issues/new?title=Issue%20on%20page%20%2FIntroduction.html&body=Your%20issue%20content%20here." target="_blank"
+      <li><a href="https://github.com/socius-org/RedditHarbor/issues/new?title=Issue%20on%20page%20%2Fabout.html&body=Your%20issue%20content%20here." target="_blank"
    class="btn btn-sm btn-source-issues-button dropdown-item"
    title="Open an issue"
    data-bs-placement="left" data-bs-toggle="tooltip"
@@ -272,7 +279,7 @@
 
 
 
-      <li><a href="_sources/Introduction.md" target="_blank"
+      <li><a href="_sources/about.md" target="_blank"
    class="btn btn-sm btn-download-source-button dropdown-item"
    title="Download source file"
    data-bs-placement="left" data-bs-toggle="tooltip"
@@ -410,7 +417,7 @@ <h3>RedditHarbor<a class="headerlink" href="#redditharbor" title="Permalink to t
 <li><p><strong>📈 Scalable and Efficient</strong>: Handle pagination seamlessly, even for large datasets with millions of rows.</p></li>
 <li><p><strong>🕹️ Customisable Collection</strong>: Tailor your data collection to your specific needs by configuring parameters.</p></li>
 <li><p><strong>📂 Analysis-Ready</strong>: Export your database to CSV, JSON, or JPEG formats for effortless integration with your preferred analysis tools.</p></li>
-<li><p><strong>🔄 Temporal Metric Tracking:</strong>: Regularly update key metrics like upvote ratios, scores, awards, and comment counts, allowing temporal analysis - a distinct advantage over static “snapshot” databases, such as PushShift or AcademicTorrent.</p></li>
+<li><p><strong>🔄 Temporal Metric Tracking</strong>: Regularly update key metrics like upvote ratios, scores, awards, and comment counts, allowing temporal analysis - a distinct advantage over static “snapshot” databases, such as PushShift or AcademicTorrent.</p></li>
 <li><p><strong>⚡ Smart Update Intervals</strong>: Leverage flexible configurations to automatically adjust update intervals based on dataset size, optimising efficiency while adhering to API constraints.</p></li>
 </ul>
 <p>With RedditHarbor, you can spend less time wrestling with technical hurdles and more time focusing on your research objectives.</p>
@@ -420,6 +427,8 @@ <h3>RedditHarbor<a class="headerlink" href="#redditharbor" title="Permalink to t
 </div>
 <div class="toctree-wrapper compound">
 </div>
+<div class="toctree-wrapper compound">
+</div>
 </section>
 
     <script type="text/x-thebe-config">
@@ -454,7 +463,7 @@ <h3>RedditHarbor<a class="headerlink" href="#redditharbor" title="Permalink to t
     <div class="footer-article-item"><!-- Previous / next buttons -->
 <div class="prev-next-area">
     <a class="right-next"
-       href="pages/prerequisites.html"
+       href="getting_started/prerequisites.html"
        title="next page">
       <div class="prev-next-info">
         <p class="prev-next-subtitle">next</p>