Skip to content

Commit

Permalink
Update documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
nick-sh-oh committed Apr 15, 2024
1 parent 9015892 commit 581bcf0
Show file tree
Hide file tree
Showing 33 changed files with 1,944 additions and 228 deletions.
2 changes: 1 addition & 1 deletion .buildinfo
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: c980c25411b2f88e2de822371e07e90a
config: 87d691a00f9808f83b1e8ac9849980e7
tags: 645f666f9bcd5a90fca523b33c5a78b7
2 changes: 1 addition & 1 deletion _sources/Introduction.md → _sources/about.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ Here's how RedditHarbor empowers your research:
* **📈 Scalable and Efficient**: Handle pagination seamlessly, even for large datasets with millions of rows.
* **🕹️ Customisable Collection**: Tailor your data collection to your specific needs by configuring parameters.
* **📂 Analysis-Ready**: Export your database to CSV, JSON, or JPEG formats for effortless integration with your preferred analysis tools.
* **🔄 Temporal Metric Tracking:**: Regularly update key metrics like upvote ratios, scores, awards, and comment counts, allowing temporal analysis - a distinct advantage over static "snapshot" databases, such as PushShift or AcademicTorrent.
* **🔄 Temporal Metric Tracking**: Regularly update key metrics like upvote ratios, scores, awards, and comment counts, allowing temporal analysis - a distinct advantage over static "snapshot" databases, such as PushShift or AcademicTorrent.
* **⚡ Smart Update Intervals**: Leverage flexible configurations to automatically adjust update intervals based on dataset size, optimising efficiency while adhering to API constraints.

With RedditHarbor, you can spend less time wrestling with technical hurdles and more time focusing on your research objectives.
23 changes: 23 additions & 0 deletions _sources/data_schema/comment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Comment

The `Comment` collection stores information about comments made on Reddit submissions. Each document in this collection has the following schema:

```python
{
"comment_id": str, # Unique identifier for the comment
"link_id": str, # ID of the submission the comment is associated with
"subreddit": str, # Name of the subreddit the comment is posted in
"parent_id": str, # ID of the parent comment or submission
"redditor_id": str, # ID of the user who posted the comment
"created_at": str, # Datetime when the comment was created (ISO format)
"body": str or None, # Text content of the comment (None if removed)
"score": {str: int}, # Dictionary mapping datetimes (ISO format) to the comment's score
"edited": bool, # Whether the comment has been edited
"removed": str or None # "deleted" or "removed" if the comment was removed, otherwise None
}
```

The `parent_id` field can have two different formats:

1. If it starts with `"t3_{link_id}"`, it means the comment is a top-level comment, and the parent is a submission.
2. If it starts with `"t1_{comment_id}"`, it means the comment is a reply to another comment, and the parent is the comment with the specified `comment_id`.
37 changes: 37 additions & 0 deletions _sources/data_schema/submission.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Submission

The `Submission` collection stores information about Reddit submissions. Each document in this collection has the following schema:

```python
{
"submission_id": str, # Unique identifier for the submission
"redditor_id": str, # ID of the user who posted the submission
"created_at": str, # Datetime when the submission was created (ISO format)
"title": str, # Title of the submission
"text": str, # Text content of the submission
"subreddit": str, # Name of the subreddit the submission is posted in
"permalink": str, # URL of the submission
"attachment": {str: str} or None, # Dictionary containing URLs of attached media (e.g., {"jpg": "https://example.com/image.jpg"})
"flair": {
"link": str, # Link flair text
"author": str # Author flair text
},
"awards": {
"list": dict, # Dictionary mapping award names to [count, coin_price]
"total_awards_count": int, # Total number of awards received
"total_awards_price": int # Total coin price of all awards received
},
"score": {str: int}, # Dictionary mapping datetimes (ISO format) to the submission's score
"upvote_ratio": {str: float}, # Dictionary mapping datetimes (ISO format) to the upvote ratio
"num_comments": {str: int}, # Dictionary mapping datetimes (ISO format) to the number of comments
"edited": bool, # Whether the submission has been edited
"archived": bool, # Whether the submission is archived
"removed": bool, # Whether the submission has been removed
"poll": {
"total_vote_count": int, # Total number of votes in the poll
"vote_ends_at": str, # Datetime when the poll ends (ISO format)
"options": {str: int}, # Dictionary mapping poll options to the number of votes
"closed": bool # Whether the poll is closed
} or None # None if the submission does not have a poll
}
```
29 changes: 29 additions & 0 deletions _sources/data_schema/user.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# User

The `User` collection stores information about Reddit users. Each document in this collection has the following schema:

```python
{
"redditor_id": str, # Unique identifier for the user
"name": str, # User's Reddit username
"created_at": str, # Datetime when the user account was created (ISO format)
"karma": {
"link": int, # Link karma
"total": int, # Total karma
"awardee": int, # Karma received from awards
"awarder": int, # Karma awarded to others
"comment": int # Comment karma
},
"is_gold": bool, # Whether the user has Reddit Gold
"is_mod": {
str: [str, int] # Dictionary mapping subreddit IDs to [subreddit name, number of subscribers]
} or None, # None if the user is not a moderator
"trophy": {
"list": list, # List of trophy names
"count": int # Number of trophies
} or None, # None if the user has no trophies
"removed": str # "active" or "suspended"
}
```

Note: For suspended users, the `redditor_id` is represented as `"suspended:{name}"`.
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ CREATE TABLE test_comment(
ALTER TABLE test_comment ENABLE ROW LEVEL SECURITY;
```

This will create the three tables with the necessary columns and data types. Once created, you'll see the new tables available in the "Table Editor". In the future, you can duplicate and rename these tables (instead of "test_...") for your production needs.
This will create the three tables with the necessary columns and data types. Once created, you'll see the new tables available in the "Table Editor". In the future, you can duplicate and rename these tables (instead of "test_...") for your production needs. For a structured overview of the database schema used by RedditHarbor, including detailed explanations of each field and its data type, see [Database Schema](../data_schema/user.md).

```{warning}
The RedditHarbor package depends on predefined column names for all user, submission, and comment tables. To ensure proper functionality, it's crucial to create tables with all the specified columns mentioned in the documentation. Failure to do so may lead to errors or incomplete data retrieval.
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
7 changes: 7 additions & 0 deletions _sources/hands_on/scraping_examples.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Scraping Examples

This section will cover frequently used scenarios from researchers for collecting Reddit data.

* [Collecting Subreddit-based Data](../hands_on/subreddit_based.md): Provides guidance on collecting data from specific subreddits
* [Collecting Keyword-based Data](../hands_on/keyword_based.md): Covers collecting submissions based on specific keywords
* [Database-Driven Data Collection](../hands_on/database_driven.md): Explains how to leverage an existing database (previously collected Reddit data) to collect additional relevant data
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Unlock temporal insights 📈 with intelligent updates 🔄

`RedditHarbor`'s update module streamlines and automates the process of updating crucial metrics for existing submissions (comment and user is currently working-in-progress!). It provides flexibility and configurability to adjust update intervals and data sources. A key advantage of this update module is the ability to track how various metrics, such as the upvote ratio or score, change over time for specific posts. This capability sets RedditHarbor apart from many other Reddit database resources, such as PushShift or Academic Torrents, which typically provide a static "snapshot" of submissions and comments at a random point in time.
The `update()` module streamlines and automates the process of updating crucial metrics for existing submissions (comment and user is currently working-in-progress!). It provides flexibility and configurability to adjust update intervals and data sources. A key advantage of this update module is the ability to track how various metrics, such as the upvote ratio or score, change over time for specific posts. This capability sets RedditHarbor apart from many other Reddit database resources, such as PushShift or Academic Torrents, which typically provide a static "snapshot" of submissions and comments at a random point in time.

## Updating Submissions
To update submission data, follow these steps:
Expand Down
7 changes: 0 additions & 7 deletions _sources/pages/ScrapingExamples.md

This file was deleted.

41 changes: 25 additions & 16 deletions Introduction.html → about.html
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="generator" content="Docutils 0.17.1: http://docutils.sourceforge.net/" />

<title>About &#8212; ICWSM 2024 Tutorial</title>
<title>About &#8212; RedditHarbor</title>



Expand Down Expand Up @@ -60,11 +60,12 @@
const thebe_selector_output = ".output, .cell_output"
</script>
<script async="async" src="_static/sphinx-thebe.js"></script>
<script>DOCUMENTATION_OPTIONS.pagename = 'Introduction';</script>
<script>DOCUMENTATION_OPTIONS.pagename = 'about';</script>
<link rel="shortcut icon" href="_static/socius_logo.png"/>
<link rel="author" title="About these documents" href="#" />
<link rel="index" title="Index" href="genindex.html" />
<link rel="search" title="Search" href="search.html" />
<link rel="next" title="Prerequisites" href="pages/prerequisites.html" />
<link rel="next" title="Prerequisites" href="getting_started/prerequisites.html" />
<meta name="viewport" content="width=device-width, initial-scale=1"/>
<meta name="docsearch:language" content="None"/>
</head>
Expand Down Expand Up @@ -157,20 +158,26 @@
</ul>
<p aria-level="2" class="caption" role="heading"><span class="caption-text">Getting Started</span></p>
<ul class="nav bd-sidenav">
<li class="toctree-l1"><a class="reference internal" href="pages/prerequisites.html">Prerequisites</a></li>
<li class="toctree-l1"><a class="reference internal" href="pages/install.html">Installation</a></li>
<li class="toctree-l1"><a class="reference internal" href="pages/GettingStarted.html">Getting Started</a></li>
<li class="toctree-l1"><a class="reference internal" href="getting_started/prerequisites.html">Prerequisites</a></li>
<li class="toctree-l1"><a class="reference internal" href="getting_started/installation.html">Installation</a></li>
<li class="toctree-l1"><a class="reference internal" href="getting_started/setting.html">Getting Started</a></li>
</ul>
<p aria-level="2" class="caption" role="heading"><span class="caption-text">Database Schema</span></p>
<ul class="nav bd-sidenav">
<li class="toctree-l1"><a class="reference internal" href="data_schema/user.html">User</a></li>
<li class="toctree-l1"><a class="reference internal" href="data_schema/submission.html">Submission</a></li>
<li class="toctree-l1"><a class="reference internal" href="data_schema/comment.html">Comment</a></li>
</ul>
<p aria-level="2" class="caption" role="heading"><span class="caption-text">Hands-on RedditHarbor</span></p>
<ul class="nav bd-sidenav">
<li class="toctree-l1 has-children"><a class="reference internal" href="pages/ScrapingExamples.html">Scraping Examples</a><input class="toctree-checkbox" id="toctree-checkbox-1" name="toctree-checkbox-1" type="checkbox"/><label class="toctree-toggle" for="toctree-checkbox-1"><i class="fa-solid fa-chevron-down"></i></label><ul>
<li class="toctree-l2"><a class="reference internal" href="pages/subreddit.html">Collecting Subreddit-based Data</a></li>
<li class="toctree-l2"><a class="reference internal" href="pages/keyword.html">Collecting Keyword-based Data</a></li>
<li class="toctree-l2"><a class="reference internal" href="pages/database.html">Database-Driven Data Collection</a></li>
<li class="toctree-l1 has-children"><a class="reference internal" href="hands_on/scraping_examples.html">Scraping Examples</a><input class="toctree-checkbox" id="toctree-checkbox-1" name="toctree-checkbox-1" type="checkbox"/><label class="toctree-toggle" for="toctree-checkbox-1"><i class="fa-solid fa-chevron-down"></i></label><ul>
<li class="toctree-l2"><a class="reference internal" href="hands_on/subreddit_based.html">Collecting Subreddit-based Data</a></li>
<li class="toctree-l2"><a class="reference internal" href="hands_on/keyword_based.html">Collecting Keyword-based Data</a></li>
<li class="toctree-l2"><a class="reference internal" href="hands_on/database_driven.html">Database-Driven Data Collection</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="pages/download.html">Downloading Data</a></li>
<li class="toctree-l1"><a class="reference internal" href="pages/update.html">Updating Data</a></li>
<li class="toctree-l1"><a class="reference internal" href="hands_on/download_data.html">Downloading Data</a></li>
<li class="toctree-l1"><a class="reference internal" href="hands_on/update_data.html">Updating Data</a></li>
</ul>

</div>
Expand Down Expand Up @@ -242,7 +249,7 @@



<li><a href="https://github.com/socius-org/RedditHarbor/issues/new?title=Issue%20on%20page%20%2FIntroduction.html&body=Your%20issue%20content%20here." target="_blank"
<li><a href="https://github.com/socius-org/RedditHarbor/issues/new?title=Issue%20on%20page%20%2Fabout.html&body=Your%20issue%20content%20here." target="_blank"
class="btn btn-sm btn-source-issues-button dropdown-item"
title="Open an issue"
data-bs-placement="left" data-bs-toggle="tooltip"
Expand Down Expand Up @@ -272,7 +279,7 @@



<li><a href="_sources/Introduction.md" target="_blank"
<li><a href="_sources/about.md" target="_blank"
class="btn btn-sm btn-download-source-button dropdown-item"
title="Download source file"
data-bs-placement="left" data-bs-toggle="tooltip"
Expand Down Expand Up @@ -410,7 +417,7 @@ <h3>RedditHarbor<a class="headerlink" href="#redditharbor" title="Permalink to t
<li><p><strong>📈 Scalable and Efficient</strong>: Handle pagination seamlessly, even for large datasets with millions of rows.</p></li>
<li><p><strong>🕹️ Customisable Collection</strong>: Tailor your data collection to your specific needs by configuring parameters.</p></li>
<li><p><strong>📂 Analysis-Ready</strong>: Export your database to CSV, JSON, or JPEG formats for effortless integration with your preferred analysis tools.</p></li>
<li><p><strong>🔄 Temporal Metric Tracking:</strong>: Regularly update key metrics like upvote ratios, scores, awards, and comment counts, allowing temporal analysis - a distinct advantage over static “snapshot” databases, such as PushShift or AcademicTorrent.</p></li>
<li><p><strong>🔄 Temporal Metric Tracking</strong>: Regularly update key metrics like upvote ratios, scores, awards, and comment counts, allowing temporal analysis - a distinct advantage over static “snapshot” databases, such as PushShift or AcademicTorrent.</p></li>
<li><p><strong>⚡ Smart Update Intervals</strong>: Leverage flexible configurations to automatically adjust update intervals based on dataset size, optimising efficiency while adhering to API constraints.</p></li>
</ul>
<p>With RedditHarbor, you can spend less time wrestling with technical hurdles and more time focusing on your research objectives.</p>
Expand All @@ -420,6 +427,8 @@ <h3>RedditHarbor<a class="headerlink" href="#redditharbor" title="Permalink to t
</div>
<div class="toctree-wrapper compound">
</div>
<div class="toctree-wrapper compound">
</div>
</section>

<script type="text/x-thebe-config">
Expand Down Expand Up @@ -454,7 +463,7 @@ <h3>RedditHarbor<a class="headerlink" href="#redditharbor" title="Permalink to t
<div class="footer-article-item"><!-- Previous / next buttons -->
<div class="prev-next-area">
<a class="right-next"
href="pages/prerequisites.html"
href="getting_started/prerequisites.html"
title="next page">
<div class="prev-next-info">
<p class="prev-next-subtitle">next</p>
Expand Down
Loading

0 comments on commit 581bcf0

Please sign in to comment.