Skip to content

heyacherry/ScreenshotToR2-Node-Crawler

Repository files navigation

ScreenshotToR2-Node-Crawler

ScreenshotToR2 is a lambda NodeJS crawler designed to collect info and capture screenshots of an array of URLs and upload them to Cloudflare R2 in a few minutes.

πŸš€ Features

  • URL Accessibility Check: Test if the input URL is still accessible.
  • SEO & AI Summary Ready: If the URL is accessible, fetch the h1 and h2 headers for future AI summaries or SEO reference, and detect affiliate links.
  • Screenshot Capture & Storage: Take a screenshot of the webpage and upload it to Cloudflare R2.
  • Free Tier Utilization: Maximize the use of AWS Lambda's free tier (even for new users) and Cloudflare R2's free storage for photos.
  • Step Function Integration: If you have many URLs, you can use AWS Step Functions to integrate with this Lambda function for enhanced workflow management.

πŸ“Š Free Tier Details

Service Free Tier Details Link
Cloudflare R2 10 GB storage, 1 million Class A operations, and 10 million Class B operations per month. Egress (data transfer to Internet) is free. Cloudflare R2 Pricing
AWS Lambda 1 million free requests and 400,000 GB-seconds of compute time per month. AWS Lambda Free Tier

πŸ“š Getting Started

Prerequisites

Installation

git clone https://github.com/yourusername/ScreenshotToR2.git
cd ScreenshotToR2
npm install

Environment Variables Set up

Create a .env file based on .env_example and fill in your credentials.

Deployment

  1. Configure your AWS credentials
serverless config credentials --provider aws --key YOUR_ACCESS_KEY --secret YOUR_SECRET_KEY

Alternatively, if you have an AWS profile set up, you can specify the profile in your serverless.yml

provider:
  name: aws
  runtime: nodejs20.x
  profile: your-aws-profile
  region: us-east-1
  1. Deploy the service:
serverless deploy

πŸ“š Usage

Invoke the function with an array of URLs to capture screenshots and upload them to Cloudflare R2.

{
  "urls": [
    { "url": "https://example.com", "name": "example" },
    { "url": "https://another-example.com", "name": "another-example" }
  ]
}
serverless invoke -f screencapturesToR2 -p data.json

⚠️ TypeScript Note

I opted not to use TypeScript in this project due to compatibility issues with the @sparticuz/chromium package, which can lead to errors during implementation. For more details, refer to the discussion here.

πŸ“‚ Explanation of Batching Logic

The batch process logic(results.length >= BATCH_SIZE) is optional while this helps in managing memory usage and ensures that the function handles large sets of URLs efficiently.

About

Open Source Web Crawler for NodeJS, Lambda and Cloudfare R2

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published