From 581bcf0fd57f81c9c82f3b3116191cb83b6e63df Mon Sep 17 00:00:00 2001 From: SOCIALSCIENCEai Date: Tue, 16 Apr 2024 00:11:05 +0100 Subject: [PATCH] Update documentation --- .buildinfo | 2 +- _sources/{Introduction.md => about.md} | 2 +- _sources/data_schema/comment.md | 23 + _sources/data_schema/submission.md | 37 ++ _sources/data_schema/user.md | 29 + .../installation.md} | 0 .../prerequisites.md | 0 .../setting.md} | 2 +- .../database_driven.md} | 0 .../download.md => hands_on/download_data.md} | 0 .../keyword.md => hands_on/keyword_based.md} | 0 _sources/hands_on/scraping_examples.md | 7 + .../subreddit_based.md} | 0 .../update.md => hands_on/update_data.md} | 2 +- _sources/pages/ScrapingExamples.md | 7 - Introduction.html => about.html | 41 +- data_schema/comment.html | 508 +++++++++++++++++ data_schema/submission.html | 522 ++++++++++++++++++ data_schema/user.html | 513 +++++++++++++++++ genindex.html | 31 +- .../installation.html | 39 +- {pages => getting_started}/prerequisites.html | 45 +- .../setting.html | 47 +- .../database_driven.html | 45 +- .../download_data.html | 45 +- .../keyword_based.html | 45 +- .../scraping_examples.html | 53 +- .../subreddit_based.html | 45 +- .../update.html => hands_on/update_data.html | 43 +- index.html | 2 +- objects.inv | 4 +- search.html | 31 +- searchindex.js | 2 +- 33 files changed, 1944 insertions(+), 228 deletions(-) rename _sources/{Introduction.md => about.md} (91%) create mode 100755 _sources/data_schema/comment.md create mode 100755 _sources/data_schema/submission.md create mode 100755 _sources/data_schema/user.md rename _sources/{pages/install.md => getting_started/installation.md} (100%) rename _sources/{pages => getting_started}/prerequisites.md (100%) rename _sources/{pages/GettingStarted.md => getting_started/setting.md} (92%) rename _sources/{pages/database.md => hands_on/database_driven.md} (100%) rename _sources/{pages/download.md => hands_on/download_data.md} (100%) rename _sources/{pages/keyword.md => hands_on/keyword_based.md} (100%) create mode 100755 _sources/hands_on/scraping_examples.md rename _sources/{pages/subreddit.md => hands_on/subreddit_based.md} (100%) rename _sources/{pages/update.md => hands_on/update_data.md} (73%) delete mode 100755 _sources/pages/ScrapingExamples.md rename Introduction.html => about.html (87%) create mode 100755 data_schema/comment.html create mode 100755 data_schema/submission.html create mode 100755 data_schema/user.html rename pages/install.html => getting_started/installation.html (84%) rename {pages => getting_started}/prerequisites.html (88%) rename pages/GettingStarted.html => getting_started/setting.html (89%) rename pages/database.html => hands_on/database_driven.html (89%) rename pages/download.html => hands_on/download_data.html (89%) rename pages/keyword.html => hands_on/keyword_based.html (88%) rename pages/ScrapingExamples.html => hands_on/scraping_examples.html (82%) rename pages/subreddit.html => hands_on/subreddit_based.html (89%) rename pages/update.html => hands_on/update_data.html (85%) diff --git a/.buildinfo b/.buildinfo index 088fb0a..f19ac0a 100755 --- a/.buildinfo +++ b/.buildinfo @@ -1,4 +1,4 @@ # Sphinx build info version 1 # This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done. -config: c980c25411b2f88e2de822371e07e90a +config: 87d691a00f9808f83b1e8ac9849980e7 tags: 645f666f9bcd5a90fca523b33c5a78b7 diff --git a/_sources/Introduction.md b/_sources/about.md similarity index 91% rename from _sources/Introduction.md rename to _sources/about.md index bbe1454..ad8ae23 100755 --- a/_sources/Introduction.md +++ b/_sources/about.md @@ -27,7 +27,7 @@ Here's how RedditHarbor empowers your research: * **📈 Scalable and Efficient**: Handle pagination seamlessly, even for large datasets with millions of rows. * **🕹️ Customisable Collection**: Tailor your data collection to your specific needs by configuring parameters. * **📂 Analysis-Ready**: Export your database to CSV, JSON, or JPEG formats for effortless integration with your preferred analysis tools. -* **🔄 Temporal Metric Tracking:**: Regularly update key metrics like upvote ratios, scores, awards, and comment counts, allowing temporal analysis - a distinct advantage over static "snapshot" databases, such as PushShift or AcademicTorrent. +* **🔄 Temporal Metric Tracking**: Regularly update key metrics like upvote ratios, scores, awards, and comment counts, allowing temporal analysis - a distinct advantage over static "snapshot" databases, such as PushShift or AcademicTorrent. * **⚡ Smart Update Intervals**: Leverage flexible configurations to automatically adjust update intervals based on dataset size, optimising efficiency while adhering to API constraints. With RedditHarbor, you can spend less time wrestling with technical hurdles and more time focusing on your research objectives. \ No newline at end of file diff --git a/_sources/data_schema/comment.md b/_sources/data_schema/comment.md new file mode 100755 index 0000000..2809eeb --- /dev/null +++ b/_sources/data_schema/comment.md @@ -0,0 +1,23 @@ +# Comment + +The `Comment` collection stores information about comments made on Reddit submissions. Each document in this collection has the following schema: + +```python +{ + "comment_id": str, # Unique identifier for the comment + "link_id": str, # ID of the submission the comment is associated with + "subreddit": str, # Name of the subreddit the comment is posted in + "parent_id": str, # ID of the parent comment or submission + "redditor_id": str, # ID of the user who posted the comment + "created_at": str, # Datetime when the comment was created (ISO format) + "body": str or None, # Text content of the comment (None if removed) + "score": {str: int}, # Dictionary mapping datetimes (ISO format) to the comment's score + "edited": bool, # Whether the comment has been edited + "removed": str or None # "deleted" or "removed" if the comment was removed, otherwise None +} +``` + +The `parent_id` field can have two different formats: + +1. If it starts with `"t3_{link_id}"`, it means the comment is a top-level comment, and the parent is a submission. +2. If it starts with `"t1_{comment_id}"`, it means the comment is a reply to another comment, and the parent is the comment with the specified `comment_id`. \ No newline at end of file diff --git a/_sources/data_schema/submission.md b/_sources/data_schema/submission.md new file mode 100755 index 0000000..98cf0c3 --- /dev/null +++ b/_sources/data_schema/submission.md @@ -0,0 +1,37 @@ +# Submission + +The `Submission` collection stores information about Reddit submissions. Each document in this collection has the following schema: + +```python +{ + "submission_id": str, # Unique identifier for the submission + "redditor_id": str, # ID of the user who posted the submission + "created_at": str, # Datetime when the submission was created (ISO format) + "title": str, # Title of the submission + "text": str, # Text content of the submission + "subreddit": str, # Name of the subreddit the submission is posted in + "permalink": str, # URL of the submission + "attachment": {str: str} or None, # Dictionary containing URLs of attached media (e.g., {"jpg": "https://example.com/image.jpg"}) + "flair": { + "link": str, # Link flair text + "author": str # Author flair text + }, + "awards": { + "list": dict, # Dictionary mapping award names to [count, coin_price] + "total_awards_count": int, # Total number of awards received + "total_awards_price": int # Total coin price of all awards received + }, + "score": {str: int}, # Dictionary mapping datetimes (ISO format) to the submission's score + "upvote_ratio": {str: float}, # Dictionary mapping datetimes (ISO format) to the upvote ratio + "num_comments": {str: int}, # Dictionary mapping datetimes (ISO format) to the number of comments + "edited": bool, # Whether the submission has been edited + "archived": bool, # Whether the submission is archived + "removed": bool, # Whether the submission has been removed + "poll": { + "total_vote_count": int, # Total number of votes in the poll + "vote_ends_at": str, # Datetime when the poll ends (ISO format) + "options": {str: int}, # Dictionary mapping poll options to the number of votes + "closed": bool # Whether the poll is closed + } or None # None if the submission does not have a poll +} +``` diff --git a/_sources/data_schema/user.md b/_sources/data_schema/user.md new file mode 100755 index 0000000..3c92caf --- /dev/null +++ b/_sources/data_schema/user.md @@ -0,0 +1,29 @@ +# User + +The `User` collection stores information about Reddit users. Each document in this collection has the following schema: + +```python +{ + "redditor_id": str, # Unique identifier for the user + "name": str, # User's Reddit username + "created_at": str, # Datetime when the user account was created (ISO format) + "karma": { + "link": int, # Link karma + "total": int, # Total karma + "awardee": int, # Karma received from awards + "awarder": int, # Karma awarded to others + "comment": int # Comment karma + }, + "is_gold": bool, # Whether the user has Reddit Gold + "is_mod": { + str: [str, int] # Dictionary mapping subreddit IDs to [subreddit name, number of subscribers] + } or None, # None if the user is not a moderator + "trophy": { + "list": list, # List of trophy names + "count": int # Number of trophies + } or None, # None if the user has no trophies + "removed": str # "active" or "suspended" +} +``` + +Note: For suspended users, the `redditor_id` is represented as `"suspended:{name}"`. \ No newline at end of file diff --git a/_sources/pages/install.md b/_sources/getting_started/installation.md similarity index 100% rename from _sources/pages/install.md rename to _sources/getting_started/installation.md diff --git a/_sources/pages/prerequisites.md b/_sources/getting_started/prerequisites.md similarity index 100% rename from _sources/pages/prerequisites.md rename to _sources/getting_started/prerequisites.md diff --git a/_sources/pages/GettingStarted.md b/_sources/getting_started/setting.md similarity index 92% rename from _sources/pages/GettingStarted.md rename to _sources/getting_started/setting.md index 754ec19..cec3333 100755 --- a/_sources/pages/GettingStarted.md +++ b/_sources/getting_started/setting.md @@ -66,7 +66,7 @@ CREATE TABLE test_comment( ALTER TABLE test_comment ENABLE ROW LEVEL SECURITY; ``` -This will create the three tables with the necessary columns and data types. Once created, you'll see the new tables available in the "Table Editor". In the future, you can duplicate and rename these tables (instead of "test_...") for your production needs. +This will create the three tables with the necessary columns and data types. Once created, you'll see the new tables available in the "Table Editor". In the future, you can duplicate and rename these tables (instead of "test_...") for your production needs. For a structured overview of the database schema used by RedditHarbor, including detailed explanations of each field and its data type, see [Database Schema](../data_schema/user.md). ```{warning} The RedditHarbor package depends on predefined column names for all user, submission, and comment tables. To ensure proper functionality, it's crucial to create tables with all the specified columns mentioned in the documentation. Failure to do so may lead to errors or incomplete data retrieval. diff --git a/_sources/pages/database.md b/_sources/hands_on/database_driven.md similarity index 100% rename from _sources/pages/database.md rename to _sources/hands_on/database_driven.md diff --git a/_sources/pages/download.md b/_sources/hands_on/download_data.md similarity index 100% rename from _sources/pages/download.md rename to _sources/hands_on/download_data.md diff --git a/_sources/pages/keyword.md b/_sources/hands_on/keyword_based.md similarity index 100% rename from _sources/pages/keyword.md rename to _sources/hands_on/keyword_based.md diff --git a/_sources/hands_on/scraping_examples.md b/_sources/hands_on/scraping_examples.md new file mode 100755 index 0000000..7bedfad --- /dev/null +++ b/_sources/hands_on/scraping_examples.md @@ -0,0 +1,7 @@ +# Scraping Examples + +This section will cover frequently used scenarios from researchers for collecting Reddit data. + +* [Collecting Subreddit-based Data](../hands_on/subreddit_based.md): Provides guidance on collecting data from specific subreddits +* [Collecting Keyword-based Data](../hands_on/keyword_based.md): Covers collecting submissions based on specific keywords +* [Database-Driven Data Collection](../hands_on/database_driven.md): Explains how to leverage an existing database (previously collected Reddit data) to collect additional relevant data \ No newline at end of file diff --git a/_sources/pages/subreddit.md b/_sources/hands_on/subreddit_based.md similarity index 100% rename from _sources/pages/subreddit.md rename to _sources/hands_on/subreddit_based.md diff --git a/_sources/pages/update.md b/_sources/hands_on/update_data.md similarity index 73% rename from _sources/pages/update.md rename to _sources/hands_on/update_data.md index 0ccb4d7..54e702a 100755 --- a/_sources/pages/update.md +++ b/_sources/hands_on/update_data.md @@ -2,7 +2,7 @@ ## Unlock temporal insights 📈 with intelligent updates 🔄 -`RedditHarbor`'s update module streamlines and automates the process of updating crucial metrics for existing submissions (comment and user is currently working-in-progress!). It provides flexibility and configurability to adjust update intervals and data sources. A key advantage of this update module is the ability to track how various metrics, such as the upvote ratio or score, change over time for specific posts. This capability sets RedditHarbor apart from many other Reddit database resources, such as PushShift or Academic Torrents, which typically provide a static "snapshot" of submissions and comments at a random point in time. +The `update()` module streamlines and automates the process of updating crucial metrics for existing submissions (comment and user is currently working-in-progress!). It provides flexibility and configurability to adjust update intervals and data sources. A key advantage of this update module is the ability to track how various metrics, such as the upvote ratio or score, change over time for specific posts. This capability sets RedditHarbor apart from many other Reddit database resources, such as PushShift or Academic Torrents, which typically provide a static "snapshot" of submissions and comments at a random point in time. ## Updating Submissions To update submission data, follow these steps: diff --git a/_sources/pages/ScrapingExamples.md b/_sources/pages/ScrapingExamples.md deleted file mode 100755 index 35fe9dc..0000000 --- a/_sources/pages/ScrapingExamples.md +++ /dev/null @@ -1,7 +0,0 @@ -# Scraping Examples - -This section will cover frequently used scenarios from researchers for collecting Reddit data. - -* [Collecting Subreddit-based Data](../pages/subreddit.md): Provides guidance on collecting data from specific subreddits -* [Collecting Keyword-based Data](../pages/keyword.md): Covers collecting submissions based on specific keywords -* [Database-Driven Data Collection](../pages/database.md): Explains how to leverage an existing database (previously collected Reddit data) to collect additional relevant data \ No newline at end of file diff --git a/Introduction.html b/about.html similarity index 87% rename from Introduction.html rename to about.html index d004ef0..3f6bf5b 100755 --- a/Introduction.html +++ b/about.html @@ -9,7 +9,7 @@ - About — ICWSM 2024 Tutorial + About — RedditHarbor @@ -60,11 +60,12 @@ const thebe_selector_output = ".output, .cell_output" - + + - + @@ -157,20 +158,26 @@

Getting Started

+

Database Schema

+

Hands-on RedditHarbor