-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Inital commit for new instructions for using the unlighthouse-gtracker tool.
- Loading branch information
Showing
1 changed file
with
121 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,121 @@ | ||
# Unlighthouse Site Scanning and Google Sheets Reporting System | ||
|
||
## Overview | ||
|
||
This system automates the process of crawling websites, generating site evaluation reports using the Unlighthouse tool, and uploading CSV results to Google Sheets. It consists of two main components: | ||
|
||
1. **unlighthouse-gTracker.sh**: A shell script that automates the process of running Unlighthouse scans on websites, based on URLs retrieved from a YAML configuration file. This script runs the Unlighthouse scans for a specific day of the week and logs output. | ||
|
||
2. **unlighthouse-gTracker.js**: A Node.js script that processes the CSV results from Unlighthouse scans, uploads the data to Google Sheets, and manages the summary sheet. It handles Google Sheets authentication, retries for scan failures, and ensures proper memory usage. | ||
|
||
Together, these scripts allow for an automated, scheduled website scanning and reporting workflow. | ||
|
||
## Features | ||
|
||
### unlighthouse-gTracker.sh | ||
- Extracts URLs from a YAML configuration file (`unlighthouse-sites.yml`) based on the current day of the week or a specified day. | ||
- Runs Unlighthouse scans for each URL and logs the results. | ||
- Closes Chrome Canary and Chrome Helper processes after each scan to prevent resource exhaustion. | ||
- Forces garbage collection after each scan to manage memory efficiently. | ||
- Logs details such as the Node.js version, start/end times, and URLs scanned. | ||
|
||
### unlighthouse-gTracker.js | ||
- Fetches and parses CSV files generated by the Unlighthouse scans. | ||
- Authenticates with Google Sheets API and uploads the scan data to a newly created or existing Google Sheet. | ||
- Handles dynamic creation of new Google Sheets, ensuring unique sheet names. | ||
- Appends metadata, such as the current date and URL, to a summary sheet. | ||
- Implements retry logic with exponential backoff in case of scan failures. | ||
- Monitors memory usage and ensures efficient garbage collection during processing. | ||
|
||
## Installation | ||
|
||
### Prerequisites | ||
1. **Node.js**: Ensure that Node.js is installed on your system. You can download it from [Node.js](https://nodejs.org/). | ||
2. **Google Cloud Platform**: Set up a project and enable the Google Sheets API and Google Drive API. | ||
3. **OAuth 2.0 Credentials**: Create OAuth 2.0 credentials and download the `credentials.json` file. | ||
4. **Unlighthouse**: Install Unlighthouse globally via npm: | ||
```bash | ||
npm install -g unlighthouse | ||
``` | ||
5. **Dependencies**: Install the required Node.js modules for `unlighthouse-gTracker.js`: | ||
```bash | ||
npm install axios googleapis js-yaml csv-parse yargs | ||
``` | ||
|
||
### Shell Script Dependencies | ||
The shell script (`unlighthouse-gTracker.sh`) relies on: | ||
- **yq**: A command-line YAML processor. Install it using: | ||
```bash | ||
brew install yq | ||
``` | ||
|
||
### Cron Job (Optional) | ||
You can set up a cron job to run the `unlighthouse-gTracker.sh` script weekly: | ||
```bash | ||
0 2 * * 1 /path/to/unlighthouse-gTracker.sh >> /path/to/log/unlighthouse-gTracker.log 2>&1 | ||
``` | ||
|
||
## Usage | ||
|
||
### unlighthouse-gTracker.sh | ||
|
||
This script runs the Unlighthouse scans for a specific day of the week and manages logging. | ||
|
||
#### Command-Line Arguments: | ||
- `-d <day>`: Specify the day of the week for which URLs should be processed (e.g., `-d Monday`). If no day is specified, it defaults to the current day. | ||
|
||
#### Example: | ||
```bash | ||
./unlighthouse-gTracker.sh -d Monday | ||
``` | ||
|
||
This will scan all the URLs scheduled for Monday, as defined in `unlighthouse-sites.yml`. | ||
|
||
### unlighthouse-gTracker.js | ||
|
||
This script is called by `unlighthouse-gTracker.sh` to process the results of each scan and upload them to Google Sheets. | ||
|
||
#### Command-Line Arguments: | ||
- `--url <url>`: Specify the URL to run the Unlighthouse scan for. | ||
|
||
#### Example: | ||
```bash | ||
node unlighthouse-gTracker.js --url=https://example.com | ||
``` | ||
|
||
The script will: | ||
1. Run the Unlighthouse scan for the specified URL. | ||
2. Parse the resulting CSV file. | ||
3. Upload the parsed data to a Google Sheet. | ||
4. Append the data to a summary sheet, ensuring proper logging and memory management. | ||
|
||
## YAML Configuration (`unlighthouse-sites.yml`) | ||
|
||
The `unlighthouse-sites.yml` file is used by `unlighthouse-gTracker.sh` to store site information, including the day of the week each site should be scanned. This file contains URLs, Google Sheets IDs, and other relevant metadata for each site. | ||
|
||
An example YAML entry: | ||
```yaml | ||
example-site: | ||
- url: https://example.com | ||
sheet_id: '1XyzABC123SheetID' | ||
start_date: 'Monday' | ||
max: 500 | ||
``` | ||
## Workflow | ||
1. **Add URLs**: Update the `unlighthouse-sites.yml` file with URLs you want to scan and schedule them for specific days. | ||
2. **Run the Shell Script**: Execute `unlighthouse-gTracker.sh` (optionally through a cron job) to run the scheduled scans for the day. | ||
3. **Process CSV Files**: After the scans are complete, the `unlighthouse-gTracker.js` script will process the CSV files, upload them to Google Sheets, and update the summary. | ||
|
||
## Logs | ||
|
||
Logs for the script are stored at `/Users/mgifford/CA-Sitemap-Scans/unlighthouse-gTracker.log`, containing details of the scan, including URLs processed, errors encountered, and memory usage. | ||
|
||
## License | ||
|
||
This project is licensed under the GNU General Public License v3.0. You can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. | ||
|
||
--- | ||
|
||
This README provides a clear overview of how the two scripts (`unlighthouse-gTracker.sh` and `unlighthouse-gTracker.js`) work together to automate the crawling, scanning, and reporting of websites into Google Sheets, along with instructions for setting up and running the scripts. |