-
Notifications
You must be signed in to change notification settings - Fork 7
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit dcf994d
Showing
177 changed files
with
28,746 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Sphinx build info version 1 | ||
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done. | ||
config: c980c25411b2f88e2de822371e07e90a | ||
tags: 645f666f9bcd5a90fca523b33c5a78b7 |
Empty file.
Large diffs are not rendered by default.
Oops, something went wrong.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
# About | ||
|
||
## Navigating 🧭 the complexities of APIs and data collection can be a daunting task, especially for researchers 👨💻 with limited coding backgrounds. **RedditHarbor simplifies collecting Reddit data and saving 📥 it to a database**. It **removes the complexity** of working with APIs 🏗️, letting you easily build a "harbor" of data for analysis. | ||
|
||
## Overview | ||
|
||
![redditharbor_demo](https://github.com/socius-org/RedditHarbor/assets/130935698/7bb4f570-90f7-4e6c-a469-7e8debf9a260) | ||
|
||
### Extract, Transform and Load (ETL) Data | ||
|
||
RedditHarbor streamlines the ETL (Extract, Transform, Load) process, enabling researchers to efficiently collect and store transformed Reddit data for analysis. | ||
|
||
**Extract**: RedditHarbor connects directly to the Reddit Data API, seamlessly retrieving submissions, comments, and user profiles. | ||
|
||
**Transform**: To safeguard user privacy and comply with ethical research practices, RedditHarbor allows researchers to anonymise any personally identifiable information (PII) present in the data. | ||
|
||
**Load**: The collected and transformed data is securely stored in a database of your choice, ensuring organised and accessible data management. | ||
|
||
### RedditHarbor | ||
|
||
[RedditHarbor](https://github.com/socius-org/RedditHarbor/) is designed for researchers who want to focus on their analysis rather than grappling with technical complexities. While third party API tools, such as [PRAW](https://praw.readthedocs.io/en/stable/), offer flexibility for advanced users, RedditHarbor simplifies the process, allowing you to effortlessly collect the data you need through intuitive commands. | ||
|
||
Here's how RedditHarbor empowers your research: | ||
|
||
* **✨ Comprehensive Data Collection**: Connect directly to the Reddit Data API and gather submissions, comments, and user profiles with ease. | ||
* **🔒 Privacy-Focused**: Anonymise any personally identifiable information (PII) to protect user privacy and comply with ethical research practices and IRB requirements. | ||
* **📦 Organised Data Storage**: Store your collected data in a secure database that you control, ensuring accessibility and organisation. | ||
* **📈 Scalable and Efficient**: Handle pagination seamlessly, even for large datasets with millions of rows. | ||
* **🕹️ Customisable Collection**: Tailor your data collection to your specific needs by configuring parameters. | ||
* **📂 Analysis-Ready**: Export your database to CSV, JSON, or JPEG formats for effortless integration with your preferred analysis tools. | ||
|
||
With RedditHarbor, you can spend less time wrestling with technical hurdles and more time focusing on your research objectives. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,108 @@ | ||
# Getting Started | ||
|
||
## Setting Up Supabase Tables | ||
|
||
Next, we need to create three tables in Supabase to store the user, submission, and comment data from Reddit. For testing purposes, we'll name them "test_redditor", "test_submission", and "test_comment". | ||
|
||
1. Head to the [Supabase Dashboard](https://app.supabase.com) and open the "SQL Editor" from the sidebar. | ||
2. Click "New Query" to start a new SQL query. | ||
3. Copy and paste the following table creation SQL, then run it: | ||
|
||
```sql | ||
-- Create table test_redditor | ||
CREATE TABLE test_redditor ( | ||
redditor_id varchar primary key, | ||
name varchar, | ||
created_at timestamptz, | ||
karma jsonb, | ||
is_gold boolean, | ||
is_mod jsonb, | ||
trophy jsonb, | ||
removed varchar | ||
); | ||
|
||
-- Enable row-level security on test_redditor | ||
ALTER TABLE test_redditor ENABLE ROW LEVEL SECURITY; | ||
|
||
-- Create table test_submission | ||
CREATE TABLE test_submission ( | ||
submission_id varchar primary key, | ||
redditor_id varchar, | ||
created_at timestamptz, | ||
title varchar, | ||
text text, | ||
subreddit varchar, | ||
permalink varchar, | ||
attachment jsonb, | ||
flair jsonb, | ||
awards jsonb, | ||
score jsonb, | ||
upvote_ratio jsonb, | ||
num_comments jsonb, | ||
edited boolean, | ||
archived boolean, | ||
removed boolean, | ||
poll jsonb | ||
); | ||
|
||
-- Enable row-level security on test_submission | ||
ALTER TABLE test_submission ENABLE ROW LEVEL SECURITY; | ||
|
||
-- Create table test_comment | ||
CREATE TABLE test_comment( | ||
comment_id varchar primary key, | ||
link_id varchar, | ||
subreddit varchar, | ||
parent_id varchar, | ||
redditor_id varchar, | ||
created_at timestamptz, | ||
body text, | ||
score jsonb, | ||
edited boolean, | ||
removed varchar | ||
); | ||
|
||
-- Enable row-level security on test_comment | ||
ALTER TABLE test_comment ENABLE ROW LEVEL SECURITY; | ||
``` | ||
|
||
This will create the three tables with the necessary columns and data types. Once created, you'll see the new tables available in the "Table Editor". In the future, you can duplicate and rename these tables (instead of "test_...") for your production needs. | ||
|
||
```{warning} | ||
The RedditHarbor package depends on predefined column names for all user, submission, and comment tables. To ensure proper functionality, it's crucial to create tables with all the specified columns mentioned in the documentation. Failure to do so may lead to errors or incomplete data retrieval. | ||
``` | ||
|
||
## Setting Up for Data Collection | ||
|
||
To start collecting Reddit data, create a new Python file in your folder (e.g., `run.py`). Running the code directly in Jupyter Notebook is not recommended, as it may cause errors. | ||
|
||
Copy and paste the following code block, which serves as a template to set up RedditHarbor: | ||
|
||
```python | ||
import redditharbor.login as login | ||
from redditharbor.dock.pipeline import collect | ||
|
||
# Configure authentication | ||
SUPABASE_URL = "<your-supabase-url>" | ||
SUPABASE_KEY = "<your-supabase-api-key>" # Use "service_role/secret" key, not "anon/public" | ||
REDDIT_PUBLIC = "<your-reddit-public-key>" | ||
REDDIT_SECRET = "<your-reddit-secret-key>" | ||
REDDIT_USER_AGENT = "<your-reddit-user-agent>" # Format: <institution:project-name (u/reddit-username)> | ||
# e.g. REDDIT_USER_AGENT = "LondonSchoolofEconomics:ICWSM-tutorial (u/reddit-username)" | ||
|
||
# Define database table names | ||
DB_CONFIG = { | ||
"user": "test_redditor", | ||
"submission": "test_submission", | ||
"comment": "test_comment" | ||
} | ||
|
||
# Login and create instances of Reddit and Supabase clients | ||
reddit_client = login.reddit(public_key=REDDIT_PUBLIC, secret_key=REDDIT_SECRET, user_agent=REDDIT_USER_AGENT) | ||
supabase_client = login.supabase(url=SUPABASE_URL, private_key=SUPABASE_KEY) | ||
|
||
# Initialise an instance of the `collect` class | ||
collect = collect(reddit_client=reddit_client, supabase_client=supabase_client, db_config=DB_CONFIG) | ||
``` | ||
|
||
Now you're ready to start collecting Reddit data! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
# Scraping Examples | ||
|
||
This section will cover frequently used scenarios from researchers for collecting Reddit data. | ||
|
||
* [Collecting Subreddit-based Data](../pages/subreddit.md): Provides guidance on collecting data from specific subreddits | ||
* [Collecting Keyword-based Data](../pages/keyword.md): Covers collecting submissions based on specific keywords | ||
* [Database-Driven Data Collection](../pages/database.md): Explains how to leverage an existing database (previously collected Reddit data) to collect additional relevant data |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
# Database-Driven Data Collection | ||
|
||
## Leverage your existing database 📂 to collect additional relevant data, such as comments from specific submissions or user activity. | ||
|
||
## Collect Submission Comments | ||
|
||
While you cannot directly collect comments containing particular keywords, you can collect comments from submissions that match your keywords of interest. To gather comments from specified submissions, use the following code: | ||
|
||
```python | ||
from redditharbor.utils import fetch | ||
|
||
fetch_submission = fetch.submission(supabase_client=supabase_client, db_name=db_config["submission"]) | ||
submission_ids = fetch_submission.id(limit=100) # Limiting to 100 submission IDs for demonstration. Set limit=None to fetch all submission IDs. | ||
|
||
collect.comment_from_submission(submission_ids=submission_ids, level=2) # Set level=None to collect entire comment threads | ||
``` | ||
|
||
This will collect comments from the specified 100 submissions up to level 2 (e.g., including replies to top-level comments). | ||
|
||
## Collect User Submissions | ||
|
||
To collect submissions made by specified users, you'll need to "fetch" user names from your existing database: | ||
|
||
```python | ||
from redditharbor.utils import fetch | ||
|
||
fetch_user = fetch.user(supabase_client=supabase_client, db_name=DB_CONFIG["user"]) | ||
users = fetch_user.name(limit=100) # This will fetch the first 100 user names from the user database. Set limit=None to fetch all user names. | ||
|
||
collect.submission_from_user(users=users, sort_types=["controversial"], limit=10) | ||
``` | ||
|
||
This will collect the 10 most controversial submissions from the specified users. | ||
|
||
## Collect User Comments | ||
|
||
To collect comments made by specified users, use: | ||
|
||
```python | ||
collect.comment_from_user(users=users, sort_types=["new"], limit=10) | ||
``` | ||
|
||
This will collect the 10 most recent comments from the specified users. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
# Downloading Data | ||
|
||
## Seamlessly download 💾 the data you need, in CSV, JSON, text, or even image formats. | ||
|
||
## Downloading Submissions | ||
|
||
To download submission data, simply follow these steps: | ||
|
||
```python | ||
from redditharbor.utils import download | ||
|
||
download = download.submission(supabase_client, DB_CONFIG["submission"]) | ||
download.to_csv(columns="all", file_name="submission", file_path="<your-folder-name>") | ||
``` | ||
|
||
This will save all columns from the "submissions" table to a `submission.csv` file in the specified folder directory. You can also customize the output by specifying columns and file formats: | ||
|
||
```python | ||
cols = ["submission_id", "title", "score"] | ||
download.to_json(columns=cols, file_name="submission", file_path="<your-folder-name>") | ||
``` | ||
|
||
This will save the "submission_id", "title", and "score" columns from the submission table to a `submission.json` file(s) in the specified folder directory. | ||
|
||
## Downloading Images from Submissions | ||
|
||
To download image files from the submission data, use: | ||
|
||
```python | ||
download = download.submission(supabase_client, DB_CONFIG["submission"]) | ||
download.to_img(file_path="<your-folder-name>") | ||
``` | ||
|
||
This will save all `.jpg` and `.png` files associated with the submissions table in the specified folder directory. | ||
|
||
## Downloading Comments | ||
|
||
Extracting comment data is just as straightforward: | ||
|
||
```python | ||
download = download.comment(supabase_client, DB_CONFIG["comment"]) | ||
download.to_csv(columns="all", file_name="comment", file_path="<your-folder-name>") | ||
``` | ||
|
||
## Downloading User Data | ||
|
||
And for user data: | ||
|
||
```python | ||
download = download.user(supabase_client, DB_CONFIG["user"]) | ||
download.to_csv(columns="all", file_name="user", file_path="<your-folder-name>") | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
# Installation | ||
|
||
To begin, install the RedditHarbor package using pip in your terminal: | ||
|
||
```python | ||
pip install redditharbor | ||
``` | ||
|
||
Additionally, run `pip install redditharbor[pii]` to enable anonymising any personally identifiable information (PII) from the collected data. | ||
|
||
This will download the latest version and install the necessary dependencies. To upgrade the older version to the latest: | ||
|
||
```python | ||
pip install --upgrade redditharbor | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
# Collecting Keyword-based Data | ||
|
||
## Collect submissions based on specific keywords 🔍 from your desired subreddits. | ||
|
||
## Collect Submissions | ||
|
||
To collect submissions containing particular keywords from *all* possible subreddits, use the following code: | ||
|
||
```python | ||
subreddits = ["all"] | ||
query = "data science" | ||
collect.submission_by_keyword(subreddits, query, limit=5) | ||
``` | ||
|
||
You can also collect submissions containing particular keywords from specified subreddits: | ||
|
||
```python | ||
subreddits = ["python", "learnpython"] | ||
query = "data science" | ||
collect.submission_by_keyword(subreddits, query, limit=5) | ||
``` | ||
|
||
This example collects the 5 *most relevant* submissions from the subreddits r/python and r/learnpython that contain the keyword "data science." | ||
|
||
You can customise your search using boolean operators: | ||
|
||
- `AND`: Requires all words to be present (e.g., "energy AND oil" returns results with both "energy" and "oil") | ||
- `OR`: Requires at least one word to match (e.g., "energy OR oil" returns results with either "energy" or "oil") | ||
- `NOT`: Excludes results with a word (e.g., "energy NOT oil" returns results with "energy" but without "oil") | ||
- `()`: Groups parts of the query | ||
|
||
When using multiple boolean operators, you may sometimes get unexpected results. To control the logic flow, use parentheses to group clauses. For example, "renewable energy NOT fossil fuels OR oil OR gas" returns very different results than "renewable energy NOT (fossil fuels OR oil OR gas)". | ||
|
||
## Collect Comments | ||
|
||
Unfortunately, Reddit's Data API does not currently support searching comments based on keywords. However, RedditHarbor provides other powerful features for collecting relevant comment data, which we'll explore in the next section. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,66 @@ | ||
# Prerequisites | ||
|
||
For a smooth experience during the ICWSM tutorial, please ensure you have the following prerequisites set up beforehand. We have tried to provide clear and easy-to-follow instructions to make the process as straightforward as possible. | ||
|
||
## 👨💻 Reddit API | ||
|
||
1. **Create a Reddit Account**: You will need a Reddit account to access the Reddit API. If you don't have one already, head over to [reddit.com](https://www.reddit.com/) and sign up for a new account. | ||
|
||
2. **Register as a Developer**: Follow [Reddit's API guide](https://www.reddit.com/wiki/api/) to [register as a developer](https://reddithelp.com/hc/en-us/requests/new?ticket_form_id=14868593862164). This step is necessary to create a script app and obtain the required credentials for API access. | ||
<br> | ||
```{image} ../images/reddit_register.png | ||
:width: 400px | ||
:align: center | ||
``` | ||
<br> | ||
|
||
3. **Create a Script App**: Once registered as a developer, create a [script app](https://old.reddit.com/prefs/apps/). This will provide you with the `PUBLIC_KEY` and `SECRET_KEY` credentials needed to authenticate with the Reddit API during the tutorial. | ||
<br> | ||
```{image} ../images/reddit_createapp.png | ||
:width: 400px | ||
:align: center | ||
``` | ||
<br> | ||
|
||
## 📦 Supabase API | ||
|
||
1. **Sign Up for Supabase**: Visit [supabase.com](https://supabase.com/) and sign up for a new account. This will allow you to create a project and obtain the necessary credentials for storing the Reddit data. | ||
|
||
2. **Create a New Project**: After signing up, create a new project in Supabase. This will generate a database `URL` and a `SECRET_KEY` (`service_role`) for your project. | ||
<br> | ||
```{image} ../images/supabase_createnewproject.png | ||
:width: 400px | ||
:align: center | ||
``` | ||
<br> | ||
|
||
3. **Access Credentials**: Access database `URL` and `SECRET_KEY` provided in the "Project Settings > Configuration > API" section. You will need these credentials to connect and store the Reddit data during the tutorial. | ||
<br> | ||
```{image} ../images/supabase_apikey.png | ||
:width: 400px | ||
:align: center | ||
``` | ||
<br> | ||
|
||
## 🖥️ Environment Setup | ||
|
||
1. **Install Visual Studio Code (Recommended)**: We recommend [installing Visual Studio Code](https://code.visualstudio.com/download), a popular and user-friendly code editor. Once installed, make sure to get the Python extension for full support in running and editing Python apps. | ||
|
||
Alternatively, you can use your preferred code editor or IDE, but please note that Jupyter Notebook is not the ideal workspace for running RedditHarbor. | ||
|
||
2. **Install Python**: Install a supported version of Python on your system: | ||
- **Windows**: [Install Python from python.org](https://www.python.org/downloads/). Use the "Download Python" button that appears first on the page to download the latest version. | ||
- **macOS**: The system install of Python on macOS is not supported. Instead, we recommend using a package management system like [Homebrew](https://brew.sh/). To install Python using Homebrew on macOS, run `brew install python3` in the Terminal. | ||
|
||
3. **Install Python Extension (for Visual Studio Code users)**: If you're using Visual Studio Code, open the editor and navigate to the sidebar (or press `Ctrl+Shift+X`). Search for "python" in the Extensions Marketplace and install the Python extension. | ||
|
||
## 🔣 Command Prompt (Windows Users) | ||
|
||
If you're a Windows user, we recommend using Git Bash, one of the best command prompts for a Linux-style command-line experience. Follow these steps: | ||
|
||
1. [Download Git Bash](https://gitforwindows.org/) | ||
2. Follow the setup wizard, selecting all the default options | ||
3. At the "Adjusting your PATH environment" step, select the "Use Git from the Windows Command Prompt" option | ||
4. Once installed, you will have access to Git Bash, which provides Linux-style command-line utilities and Git functionality in Windows. | ||
|
||
If you have any questions or encounter any difficulties during the setup process, please don't hesitate to reach out to us. We're here to ensure a smooth and enjoyable tutorial experience for everyone. |
Oops, something went wrong.