ScreenshotToR2-Node-Crawler

ScreenshotToR2 is a lambda NodeJS crawler designed to collect info and capture screenshots of an array of URLs and upload them to Cloudflare R2 in a few minutes.

🚀 Features

URL Accessibility Check: Test if the input URL is still accessible.
SEO & AI Summary Ready: If the URL is accessible, fetch the h1 and h2 headers for future AI summaries or SEO reference, and detect affiliate links.
Screenshot Capture & Storage: Take a screenshot of the webpage and upload it to Cloudflare R2.
Free Tier Utilization: Maximize the use of AWS Lambda's free tier (even for new users) and Cloudflare R2's free storage for photos.
Step Function Integration: If you have many URLs, you can use AWS Step Functions to integrate with this Lambda function for enhanced workflow management.

📊 Free Tier Details

Service	Free Tier Details	Link
Cloudflare R2	10 GB storage, 1 million Class A operations, and 10 million Class B operations per month. Egress (data transfer to Internet) is free.	Cloudflare R2 Pricing
AWS Lambda	1 million free requests and 400,000 GB-seconds of compute time per month.	AWS Lambda Free Tier

📚 Getting Started

Prerequisites

Node.js (version 20.x or higher)
Serverless Framework CLI - npm install -g serverless
AWS account with permissions to create Lambda functions (Sign up for AWS)
Cloudflare account for R2 (Sign up for Cloudflare)

Installation

git clone https://github.com/yourusername/ScreenshotToR2.git
cd ScreenshotToR2
npm install

Environment Variables Set up

Create a .env file based on .env_example and fill in your credentials.

Deployment

Configure your AWS credentials

serverless config credentials --provider aws --key YOUR_ACCESS_KEY --secret YOUR_SECRET_KEY

Alternatively, if you have an AWS profile set up, you can specify the profile in your serverless.yml

provider:
  name: aws
  runtime: nodejs20.x
  profile: your-aws-profile
  region: us-east-1

Deploy the service:

serverless deploy

📚 Usage

Invoke the function with an array of URLs to capture screenshots and upload them to Cloudflare R2.

{
  "urls": [
    { "url": "https://example.com", "name": "example" },
    { "url": "https://another-example.com", "name": "another-example" }
  ]
}

serverless invoke -f screencapturesToR2 -p data.json

⚠️ TypeScript Note

I opted not to use TypeScript in this project due to compatibility issues with the @sparticuz/chromium package, which can lead to errors during implementation. For more details, refer to the discussion here.

📂 Explanation of Batching Logic

The batch process logic(results.length >= BATCH_SIZE) is optional while this helps in managing memory usage and ensures that the function handles large sets of URLs efficiently.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

ScreenshotToR2-Node-Crawler

🚀 Features

📊 Free Tier Details

📚 Getting Started

Prerequisites

Installation

Environment Variables Set up

Deployment

📚 Usage

⚠️ TypeScript Note

📂 Explanation of Batching Logic

Files

README.md

Latest commit

History

README.md

File metadata and controls

ScreenshotToR2-Node-Crawler

🚀 Features

📊 Free Tier Details

📚 Getting Started

Prerequisites

Installation

Environment Variables Set up

Deployment

📚 Usage

⚠️ TypeScript Note

📂 Explanation of Batching Logic