Skip to content

Commit

Permalink
Create unlighthouse-gTracker.md
Browse files Browse the repository at this point in the history
Inital commit for new instructions for using the unlighthouse-gtracker tool.
  • Loading branch information
mgifford authored Sep 4, 2024
1 parent 5b8f450 commit 0f112c0
Showing 1 changed file with 121 additions and 0 deletions.
121 changes: 121 additions & 0 deletions unlighthouse-gTracker.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# Unlighthouse Site Scanning and Google Sheets Reporting System

## Overview

This system automates the process of crawling websites, generating site evaluation reports using the Unlighthouse tool, and uploading CSV results to Google Sheets. It consists of two main components:

1. **unlighthouse-gTracker.sh**: A shell script that automates the process of running Unlighthouse scans on websites, based on URLs retrieved from a YAML configuration file. This script runs the Unlighthouse scans for a specific day of the week and logs output.

2. **unlighthouse-gTracker.js**: A Node.js script that processes the CSV results from Unlighthouse scans, uploads the data to Google Sheets, and manages the summary sheet. It handles Google Sheets authentication, retries for scan failures, and ensures proper memory usage.

Together, these scripts allow for an automated, scheduled website scanning and reporting workflow.

## Features

### unlighthouse-gTracker.sh
- Extracts URLs from a YAML configuration file (`unlighthouse-sites.yml`) based on the current day of the week or a specified day.
- Runs Unlighthouse scans for each URL and logs the results.
- Closes Chrome Canary and Chrome Helper processes after each scan to prevent resource exhaustion.
- Forces garbage collection after each scan to manage memory efficiently.
- Logs details such as the Node.js version, start/end times, and URLs scanned.

### unlighthouse-gTracker.js
- Fetches and parses CSV files generated by the Unlighthouse scans.
- Authenticates with Google Sheets API and uploads the scan data to a newly created or existing Google Sheet.
- Handles dynamic creation of new Google Sheets, ensuring unique sheet names.
- Appends metadata, such as the current date and URL, to a summary sheet.
- Implements retry logic with exponential backoff in case of scan failures.
- Monitors memory usage and ensures efficient garbage collection during processing.

## Installation

### Prerequisites
1. **Node.js**: Ensure that Node.js is installed on your system. You can download it from [Node.js](https://nodejs.org/).
2. **Google Cloud Platform**: Set up a project and enable the Google Sheets API and Google Drive API.
3. **OAuth 2.0 Credentials**: Create OAuth 2.0 credentials and download the `credentials.json` file.
4. **Unlighthouse**: Install Unlighthouse globally via npm:
```bash
npm install -g unlighthouse
```
5. **Dependencies**: Install the required Node.js modules for `unlighthouse-gTracker.js`:
```bash
npm install axios googleapis js-yaml csv-parse yargs
```

### Shell Script Dependencies
The shell script (`unlighthouse-gTracker.sh`) relies on:
- **yq**: A command-line YAML processor. Install it using:
```bash
brew install yq
```

### Cron Job (Optional)
You can set up a cron job to run the `unlighthouse-gTracker.sh` script weekly:
```bash
0 2 * * 1 /path/to/unlighthouse-gTracker.sh >> /path/to/log/unlighthouse-gTracker.log 2>&1
```

## Usage

### unlighthouse-gTracker.sh

This script runs the Unlighthouse scans for a specific day of the week and manages logging.

#### Command-Line Arguments:
- `-d <day>`: Specify the day of the week for which URLs should be processed (e.g., `-d Monday`). If no day is specified, it defaults to the current day.

#### Example:
```bash
./unlighthouse-gTracker.sh -d Monday
```

This will scan all the URLs scheduled for Monday, as defined in `unlighthouse-sites.yml`.

### unlighthouse-gTracker.js

This script is called by `unlighthouse-gTracker.sh` to process the results of each scan and upload them to Google Sheets.

#### Command-Line Arguments:
- `--url <url>`: Specify the URL to run the Unlighthouse scan for.

#### Example:
```bash
node unlighthouse-gTracker.js --url=https://example.com
```

The script will:
1. Run the Unlighthouse scan for the specified URL.
2. Parse the resulting CSV file.
3. Upload the parsed data to a Google Sheet.
4. Append the data to a summary sheet, ensuring proper logging and memory management.

## YAML Configuration (`unlighthouse-sites.yml`)

The `unlighthouse-sites.yml` file is used by `unlighthouse-gTracker.sh` to store site information, including the day of the week each site should be scanned. This file contains URLs, Google Sheets IDs, and other relevant metadata for each site.

An example YAML entry:
```yaml
example-site:
- url: https://example.com
sheet_id: '1XyzABC123SheetID'
start_date: 'Monday'
max: 500
```
## Workflow
1. **Add URLs**: Update the `unlighthouse-sites.yml` file with URLs you want to scan and schedule them for specific days.
2. **Run the Shell Script**: Execute `unlighthouse-gTracker.sh` (optionally through a cron job) to run the scheduled scans for the day.
3. **Process CSV Files**: After the scans are complete, the `unlighthouse-gTracker.js` script will process the CSV files, upload them to Google Sheets, and update the summary.

## Logs

Logs for the script are stored at `/Users/mgifford/CA-Sitemap-Scans/unlighthouse-gTracker.log`, containing details of the scan, including URLs processed, errors encountered, and memory usage.

## License

This project is licensed under the GNU General Public License v3.0. You can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

---

This README provides a clear overview of how the two scripts (`unlighthouse-gTracker.sh` and `unlighthouse-gTracker.js`) work together to automate the crawling, scanning, and reporting of websites into Google Sheets, along with instructions for setting up and running the scripts.

0 comments on commit 0f112c0

Please sign in to comment.