Skip to content

Commit

Permalink
Update documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
nick-sh-oh committed Apr 9, 2024
0 parents commit dcf994d
Show file tree
Hide file tree
Showing 177 changed files with 28,746 additions and 0 deletions.
4 changes: 4 additions & 0 deletions .buildinfo
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: c980c25411b2f88e2de822371e07e90a
tags: 645f666f9bcd5a90fca523b33c5a78b7
Empty file added .nojekyll
Empty file.
537 changes: 537 additions & 0 deletions Introduction.html

Large diffs are not rendered by default.

Binary file added _images/reddit_createapp.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/reddit_register.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/supabase_apikey.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/supabase_createnewproject.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
32 changes: 32 additions & 0 deletions _sources/Introduction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# About

## Navigating 🧭 the complexities of APIs and data collection can be a daunting task, especially for researchers 👨‍💻 with limited coding backgrounds. **RedditHarbor simplifies collecting Reddit data and saving 📥 it to a database**. It **removes the complexity** of working with APIs 🏗️, letting you easily build a "harbor" of data for analysis.

## Overview

![redditharbor_demo](https://github.com/socius-org/RedditHarbor/assets/130935698/7bb4f570-90f7-4e6c-a469-7e8debf9a260)

### Extract, Transform and Load (ETL) Data

RedditHarbor streamlines the ETL (Extract, Transform, Load) process, enabling researchers to efficiently collect and store transformed Reddit data for analysis.

**Extract**: RedditHarbor connects directly to the Reddit Data API, seamlessly retrieving submissions, comments, and user profiles.

**Transform**: To safeguard user privacy and comply with ethical research practices, RedditHarbor allows researchers to anonymise any personally identifiable information (PII) present in the data.

**Load**: The collected and transformed data is securely stored in a database of your choice, ensuring organised and accessible data management.

### RedditHarbor

[RedditHarbor](https://github.com/socius-org/RedditHarbor/) is designed for researchers who want to focus on their analysis rather than grappling with technical complexities. While third party API tools, such as [PRAW](https://praw.readthedocs.io/en/stable/), offer flexibility for advanced users, RedditHarbor simplifies the process, allowing you to effortlessly collect the data you need through intuitive commands.

Here's how RedditHarbor empowers your research:

* **✨ Comprehensive Data Collection**: Connect directly to the Reddit Data API and gather submissions, comments, and user profiles with ease.
* **🔒 Privacy-Focused**: Anonymise any personally identifiable information (PII) to protect user privacy and comply with ethical research practices and IRB requirements.
* **📦 Organised Data Storage**: Store your collected data in a secure database that you control, ensuring accessibility and organisation.
* **📈 Scalable and Efficient**: Handle pagination seamlessly, even for large datasets with millions of rows.
* **🕹️ Customisable Collection**: Tailor your data collection to your specific needs by configuring parameters.
* **📂 Analysis-Ready**: Export your database to CSV, JSON, or JPEG formats for effortless integration with your preferred analysis tools.

With RedditHarbor, you can spend less time wrestling with technical hurdles and more time focusing on your research objectives.
108 changes: 108 additions & 0 deletions _sources/pages/GettingStarted.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# Getting Started

## Setting Up Supabase Tables

Next, we need to create three tables in Supabase to store the user, submission, and comment data from Reddit. For testing purposes, we'll name them "test_redditor", "test_submission", and "test_comment".

1. Head to the [Supabase Dashboard](https://app.supabase.com) and open the "SQL Editor" from the sidebar.
2. Click "New Query" to start a new SQL query.
3. Copy and paste the following table creation SQL, then run it:

```sql
-- Create table test_redditor
CREATE TABLE test_redditor (
redditor_id varchar primary key,
name varchar,
created_at timestamptz,
karma jsonb,
is_gold boolean,
is_mod jsonb,
trophy jsonb,
removed varchar
);

-- Enable row-level security on test_redditor
ALTER TABLE test_redditor ENABLE ROW LEVEL SECURITY;

-- Create table test_submission
CREATE TABLE test_submission (
submission_id varchar primary key,
redditor_id varchar,
created_at timestamptz,
title varchar,
text text,
subreddit varchar,
permalink varchar,
attachment jsonb,
flair jsonb,
awards jsonb,
score jsonb,
upvote_ratio jsonb,
num_comments jsonb,
edited boolean,
archived boolean,
removed boolean,
poll jsonb
);

-- Enable row-level security on test_submission
ALTER TABLE test_submission ENABLE ROW LEVEL SECURITY;

-- Create table test_comment
CREATE TABLE test_comment(
comment_id varchar primary key,
link_id varchar,
subreddit varchar,
parent_id varchar,
redditor_id varchar,
created_at timestamptz,
body text,
score jsonb,
edited boolean,
removed varchar
);

-- Enable row-level security on test_comment
ALTER TABLE test_comment ENABLE ROW LEVEL SECURITY;
```

This will create the three tables with the necessary columns and data types. Once created, you'll see the new tables available in the "Table Editor". In the future, you can duplicate and rename these tables (instead of "test_...") for your production needs.

```{warning}
The RedditHarbor package depends on predefined column names for all user, submission, and comment tables. To ensure proper functionality, it's crucial to create tables with all the specified columns mentioned in the documentation. Failure to do so may lead to errors or incomplete data retrieval.
```

## Setting Up for Data Collection

To start collecting Reddit data, create a new Python file in your folder (e.g., `run.py`). Running the code directly in Jupyter Notebook is not recommended, as it may cause errors.

Copy and paste the following code block, which serves as a template to set up RedditHarbor:

```python
import redditharbor.login as login
from redditharbor.dock.pipeline import collect

# Configure authentication
SUPABASE_URL = "<your-supabase-url>"
SUPABASE_KEY = "<your-supabase-api-key>" # Use "service_role/secret" key, not "anon/public"
REDDIT_PUBLIC = "<your-reddit-public-key>"
REDDIT_SECRET = "<your-reddit-secret-key>"
REDDIT_USER_AGENT = "<your-reddit-user-agent>" # Format: <institution:project-name (u/reddit-username)>
# e.g. REDDIT_USER_AGENT = "LondonSchoolofEconomics:ICWSM-tutorial (u/reddit-username)"

# Define database table names
DB_CONFIG = {
"user": "test_redditor",
"submission": "test_submission",
"comment": "test_comment"
}

# Login and create instances of Reddit and Supabase clients
reddit_client = login.reddit(public_key=REDDIT_PUBLIC, secret_key=REDDIT_SECRET, user_agent=REDDIT_USER_AGENT)
supabase_client = login.supabase(url=SUPABASE_URL, private_key=SUPABASE_KEY)

# Initialise an instance of the `collect` class
collect = collect(reddit_client=reddit_client, supabase_client=supabase_client, db_config=DB_CONFIG)
```

Now you're ready to start collecting Reddit data!
7 changes: 7 additions & 0 deletions _sources/pages/ScrapingExamples.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Scraping Examples

This section will cover frequently used scenarios from researchers for collecting Reddit data.

* [Collecting Subreddit-based Data](../pages/subreddit.md): Provides guidance on collecting data from specific subreddits
* [Collecting Keyword-based Data](../pages/keyword.md): Covers collecting submissions based on specific keywords
* [Database-Driven Data Collection](../pages/database.md): Explains how to leverage an existing database (previously collected Reddit data) to collect additional relevant data
43 changes: 43 additions & 0 deletions _sources/pages/database.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Database-Driven Data Collection

## Leverage your existing database 📂 to collect additional relevant data, such as comments from specific submissions or user activity.

## Collect Submission Comments

While you cannot directly collect comments containing particular keywords, you can collect comments from submissions that match your keywords of interest. To gather comments from specified submissions, use the following code:

```python
from redditharbor.utils import fetch

fetch_submission = fetch.submission(supabase_client=supabase_client, db_name=db_config["submission"])
submission_ids = fetch_submission.id(limit=100) # Limiting to 100 submission IDs for demonstration. Set limit=None to fetch all submission IDs.

collect.comment_from_submission(submission_ids=submission_ids, level=2) # Set level=None to collect entire comment threads
```

This will collect comments from the specified 100 submissions up to level 2 (e.g., including replies to top-level comments).

## Collect User Submissions

To collect submissions made by specified users, you'll need to "fetch" user names from your existing database:

```python
from redditharbor.utils import fetch

fetch_user = fetch.user(supabase_client=supabase_client, db_name=DB_CONFIG["user"])
users = fetch_user.name(limit=100) # This will fetch the first 100 user names from the user database. Set limit=None to fetch all user names.

collect.submission_from_user(users=users, sort_types=["controversial"], limit=10)
```

This will collect the 10 most controversial submissions from the specified users.

## Collect User Comments

To collect comments made by specified users, use:

```python
collect.comment_from_user(users=users, sort_types=["new"], limit=10)
```

This will collect the 10 most recent comments from the specified users.
52 changes: 52 additions & 0 deletions _sources/pages/download.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Downloading Data

## Seamlessly download 💾 the data you need, in CSV, JSON, text, or even image formats.

## Downloading Submissions

To download submission data, simply follow these steps:

```python
from redditharbor.utils import download

download = download.submission(supabase_client, DB_CONFIG["submission"])
download.to_csv(columns="all", file_name="submission", file_path="<your-folder-name>")
```

This will save all columns from the "submissions" table to a `submission.csv` file in the specified folder directory. You can also customize the output by specifying columns and file formats:

```python
cols = ["submission_id", "title", "score"]
download.to_json(columns=cols, file_name="submission", file_path="<your-folder-name>")
```

This will save the "submission_id", "title", and "score" columns from the submission table to a `submission.json` file(s) in the specified folder directory.

## Downloading Images from Submissions

To download image files from the submission data, use:

```python
download = download.submission(supabase_client, DB_CONFIG["submission"])
download.to_img(file_path="<your-folder-name>")
```

This will save all `.jpg` and `.png` files associated with the submissions table in the specified folder directory.

## Downloading Comments

Extracting comment data is just as straightforward:

```python
download = download.comment(supabase_client, DB_CONFIG["comment"])
download.to_csv(columns="all", file_name="comment", file_path="<your-folder-name>")
```

## Downloading User Data

And for user data:

```python
download = download.user(supabase_client, DB_CONFIG["user"])
download.to_csv(columns="all", file_name="user", file_path="<your-folder-name>")
```
15 changes: 15 additions & 0 deletions _sources/pages/install.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Installation

To begin, install the RedditHarbor package using pip in your terminal:

```python
pip install redditharbor
```

Additionally, run `pip install redditharbor[pii]` to enable anonymising any personally identifiable information (PII) from the collected data.

This will download the latest version and install the necessary dependencies. To upgrade the older version to the latest:

```python
pip install --upgrade redditharbor
```
36 changes: 36 additions & 0 deletions _sources/pages/keyword.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Collecting Keyword-based Data

## Collect submissions based on specific keywords 🔍 from your desired subreddits.

## Collect Submissions

To collect submissions containing particular keywords from *all* possible subreddits, use the following code:

```python
subreddits = ["all"]
query = "data science"
collect.submission_by_keyword(subreddits, query, limit=5)
```

You can also collect submissions containing particular keywords from specified subreddits:

```python
subreddits = ["python", "learnpython"]
query = "data science"
collect.submission_by_keyword(subreddits, query, limit=5)
```

This example collects the 5 *most relevant* submissions from the subreddits r/python and r/learnpython that contain the keyword "data science."

You can customise your search using boolean operators:

- `AND`: Requires all words to be present (e.g., "energy AND oil" returns results with both "energy" and "oil")
- `OR`: Requires at least one word to match (e.g., "energy OR oil" returns results with either "energy" or "oil")
- `NOT`: Excludes results with a word (e.g., "energy NOT oil" returns results with "energy" but without "oil")
- `()`: Groups parts of the query

When using multiple boolean operators, you may sometimes get unexpected results. To control the logic flow, use parentheses to group clauses. For example, "renewable energy NOT fossil fuels OR oil OR gas" returns very different results than "renewable energy NOT (fossil fuels OR oil OR gas)".

## Collect Comments

Unfortunately, Reddit's Data API does not currently support searching comments based on keywords. However, RedditHarbor provides other powerful features for collecting relevant comment data, which we'll explore in the next section.
66 changes: 66 additions & 0 deletions _sources/pages/prerequisites.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Prerequisites

For a smooth experience during the ICWSM tutorial, please ensure you have the following prerequisites set up beforehand. We have tried to provide clear and easy-to-follow instructions to make the process as straightforward as possible.

## 👨‍💻 Reddit API

1. **Create a Reddit Account**: You will need a Reddit account to access the Reddit API. If you don't have one already, head over to [reddit.com](https://www.reddit.com/) and sign up for a new account.

2. **Register as a Developer**: Follow [Reddit's API guide](https://www.reddit.com/wiki/api/) to [register as a developer](https://reddithelp.com/hc/en-us/requests/new?ticket_form_id=14868593862164). This step is necessary to create a script app and obtain the required credentials for API access.
<br>
```{image} ../images/reddit_register.png
:width: 400px
:align: center
```
<br>

3. **Create a Script App**: Once registered as a developer, create a [script app](https://old.reddit.com/prefs/apps/). This will provide you with the `PUBLIC_KEY` and `SECRET_KEY` credentials needed to authenticate with the Reddit API during the tutorial.
<br>
```{image} ../images/reddit_createapp.png
:width: 400px
:align: center
```
<br>

## 📦 Supabase API

1. **Sign Up for Supabase**: Visit [supabase.com](https://supabase.com/) and sign up for a new account. This will allow you to create a project and obtain the necessary credentials for storing the Reddit data.

2. **Create a New Project**: After signing up, create a new project in Supabase. This will generate a database `URL` and a `SECRET_KEY` (`service_role`) for your project.
<br>
```{image} ../images/supabase_createnewproject.png
:width: 400px
:align: center
```
<br>

3. **Access Credentials**: Access database `URL` and `SECRET_KEY` provided in the "Project Settings > Configuration > API" section. You will need these credentials to connect and store the Reddit data during the tutorial.
<br>
```{image} ../images/supabase_apikey.png
:width: 400px
:align: center
```
<br>

## 🖥️ Environment Setup

1. **Install Visual Studio Code (Recommended)**: We recommend [installing Visual Studio Code](https://code.visualstudio.com/download), a popular and user-friendly code editor. Once installed, make sure to get the Python extension for full support in running and editing Python apps.

Alternatively, you can use your preferred code editor or IDE, but please note that Jupyter Notebook is not the ideal workspace for running RedditHarbor.

2. **Install Python**: Install a supported version of Python on your system:
- **Windows**: [Install Python from python.org](https://www.python.org/downloads/). Use the "Download Python" button that appears first on the page to download the latest version.
- **macOS**: The system install of Python on macOS is not supported. Instead, we recommend using a package management system like [Homebrew](https://brew.sh/). To install Python using Homebrew on macOS, run `brew install python3` in the Terminal.

3. **Install Python Extension (for Visual Studio Code users)**: If you're using Visual Studio Code, open the editor and navigate to the sidebar (or press `Ctrl+Shift+X`). Search for "python" in the Extensions Marketplace and install the Python extension.

## 🔣 Command Prompt (Windows Users)

If you're a Windows user, we recommend using Git Bash, one of the best command prompts for a Linux-style command-line experience. Follow these steps:

1. [Download Git Bash](https://gitforwindows.org/)
2. Follow the setup wizard, selecting all the default options
3. At the "Adjusting your PATH environment" step, select the "Use Git from the Windows Command Prompt" option
4. Once installed, you will have access to Git Bash, which provides Linux-style command-line utilities and Git functionality in Windows.

If you have any questions or encounter any difficulties during the setup process, please don't hesitate to reach out to us. We're here to ensure a smooth and enjoyable tutorial experience for everyone.
Loading

0 comments on commit dcf994d

Please sign in to comment.