-
Notifications
You must be signed in to change notification settings - Fork 7
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
9015892
commit 581bcf0
Showing
33 changed files
with
1,944 additions
and
228 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
# Sphinx build info version 1 | ||
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done. | ||
config: c980c25411b2f88e2de822371e07e90a | ||
config: 87d691a00f9808f83b1e8ac9849980e7 | ||
tags: 645f666f9bcd5a90fca523b33c5a78b7 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
# Comment | ||
|
||
The `Comment` collection stores information about comments made on Reddit submissions. Each document in this collection has the following schema: | ||
|
||
```python | ||
{ | ||
"comment_id": str, # Unique identifier for the comment | ||
"link_id": str, # ID of the submission the comment is associated with | ||
"subreddit": str, # Name of the subreddit the comment is posted in | ||
"parent_id": str, # ID of the parent comment or submission | ||
"redditor_id": str, # ID of the user who posted the comment | ||
"created_at": str, # Datetime when the comment was created (ISO format) | ||
"body": str or None, # Text content of the comment (None if removed) | ||
"score": {str: int}, # Dictionary mapping datetimes (ISO format) to the comment's score | ||
"edited": bool, # Whether the comment has been edited | ||
"removed": str or None # "deleted" or "removed" if the comment was removed, otherwise None | ||
} | ||
``` | ||
|
||
The `parent_id` field can have two different formats: | ||
|
||
1. If it starts with `"t3_{link_id}"`, it means the comment is a top-level comment, and the parent is a submission. | ||
2. If it starts with `"t1_{comment_id}"`, it means the comment is a reply to another comment, and the parent is the comment with the specified `comment_id`. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
# Submission | ||
|
||
The `Submission` collection stores information about Reddit submissions. Each document in this collection has the following schema: | ||
|
||
```python | ||
{ | ||
"submission_id": str, # Unique identifier for the submission | ||
"redditor_id": str, # ID of the user who posted the submission | ||
"created_at": str, # Datetime when the submission was created (ISO format) | ||
"title": str, # Title of the submission | ||
"text": str, # Text content of the submission | ||
"subreddit": str, # Name of the subreddit the submission is posted in | ||
"permalink": str, # URL of the submission | ||
"attachment": {str: str} or None, # Dictionary containing URLs of attached media (e.g., {"jpg": "https://example.com/image.jpg"}) | ||
"flair": { | ||
"link": str, # Link flair text | ||
"author": str # Author flair text | ||
}, | ||
"awards": { | ||
"list": dict, # Dictionary mapping award names to [count, coin_price] | ||
"total_awards_count": int, # Total number of awards received | ||
"total_awards_price": int # Total coin price of all awards received | ||
}, | ||
"score": {str: int}, # Dictionary mapping datetimes (ISO format) to the submission's score | ||
"upvote_ratio": {str: float}, # Dictionary mapping datetimes (ISO format) to the upvote ratio | ||
"num_comments": {str: int}, # Dictionary mapping datetimes (ISO format) to the number of comments | ||
"edited": bool, # Whether the submission has been edited | ||
"archived": bool, # Whether the submission is archived | ||
"removed": bool, # Whether the submission has been removed | ||
"poll": { | ||
"total_vote_count": int, # Total number of votes in the poll | ||
"vote_ends_at": str, # Datetime when the poll ends (ISO format) | ||
"options": {str: int}, # Dictionary mapping poll options to the number of votes | ||
"closed": bool # Whether the poll is closed | ||
} or None # None if the submission does not have a poll | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
# User | ||
|
||
The `User` collection stores information about Reddit users. Each document in this collection has the following schema: | ||
|
||
```python | ||
{ | ||
"redditor_id": str, # Unique identifier for the user | ||
"name": str, # User's Reddit username | ||
"created_at": str, # Datetime when the user account was created (ISO format) | ||
"karma": { | ||
"link": int, # Link karma | ||
"total": int, # Total karma | ||
"awardee": int, # Karma received from awards | ||
"awarder": int, # Karma awarded to others | ||
"comment": int # Comment karma | ||
}, | ||
"is_gold": bool, # Whether the user has Reddit Gold | ||
"is_mod": { | ||
str: [str, int] # Dictionary mapping subreddit IDs to [subreddit name, number of subscribers] | ||
} or None, # None if the user is not a moderator | ||
"trophy": { | ||
"list": list, # List of trophy names | ||
"count": int # Number of trophies | ||
} or None, # None if the user has no trophies | ||
"removed": str # "active" or "suspended" | ||
} | ||
``` | ||
|
||
Note: For suspended users, the `redditor_id` is represented as `"suspended:{name}"`. |
File renamed without changes.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
File renamed without changes.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
# Scraping Examples | ||
|
||
This section will cover frequently used scenarios from researchers for collecting Reddit data. | ||
|
||
* [Collecting Subreddit-based Data](../hands_on/subreddit_based.md): Provides guidance on collecting data from specific subreddits | ||
* [Collecting Keyword-based Data](../hands_on/keyword_based.md): Covers collecting submissions based on specific keywords | ||
* [Database-Driven Data Collection](../hands_on/database_driven.md): Explains how to leverage an existing database (previously collected Reddit data) to collect additional relevant data |
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.