Skip to content

Commit

Permalink
New post by URLs list text scraper
Browse files Browse the repository at this point in the history
  • Loading branch information
ibbsbbr committed Sep 12, 2024
1 parent 690a17f commit da89e86
Showing 1 changed file with 111 additions and 0 deletions.
111 changes: 111 additions & 0 deletions _posts/2024-09-12-urls-list-text-scraper.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
---
tags:
- scripts>powershell, software>windows
info: aberto.
date: 2024-09-12
type: post
layout: post
published: true
slug: urls-list-text-scraper
title: 'URLs list text scraper'
---

Reference: `https://github.com/kitsuyui/scraper`

To download the text content of multiple URLs from a list on Windows 11, we'll create a PowerShell script that's more robust and flexible than the previously suggested batch file. This approach leverages PowerShell's strengths and provides better error handling and output formatting.

1. First, ensure you have `scraper.exe` set up:
- Download the latest Windows executable from https://github.com/kitsuyui/scraper/releases/latest
- Rename it to `scraper.exe` and place it in a directory that's in your system PATH

2. Create a file named `scraper-config.json` with the following content:
```json
[
{"type": "xpath", "label": "BodyText", "query": "//body//text()"}
]
```

3. Create a text file named `urls.txt` with one URL per line:
```
https://example.com
https://another-example.com
https://third-example.com
```

4. Create a new file named `Scrape-Urls.ps1` with the following PowerShell script:

```powershell
# Scrape-Urls.ps1
param(
[string]$UrlFile = "urls.txt",
[string]$ConfigFile = "scraper-config.json",
[string]$OutputDir = "scraped_content"
)
# Ensure the output directory exists
New-Item -ItemType Directory -Force -Path $OutputDir | Out-Null
# Read URLs from file
$urls = Get-Content $UrlFile
foreach ($url in $urls) {
try {
Write-Host "Processing: $url"
# Generate a safe filename
$filename = ($url -replace "https?://", "" -replace "[^a-zA-Z0-9]+", "_") + ".txt"
$outputPath = Join-Path $OutputDir $filename
# Download and scrape content
$content = Invoke-WebRequest -Uri $url -UseBasicParsing | Select-Object -ExpandProperty Content
$scrapedContent = $content | & scraper -c $ConfigFile | ConvertFrom-Json
# Extract text from JSON and save
$bodyText = $scrapedContent | Where-Object { $_.label -eq "BodyText" } | Select-Object -ExpandProperty results
$bodyText -join " " | Out-File -FilePath $outputPath
Write-Host "Saved to: $outputPath"
}
catch {
Write-Host "Error processing $url : $_" -ForegroundColor Red
}
Write-Host
}
Write-Host "All URLs processed." -ForegroundColor Green
```

5. Open PowerShell and navigate to the directory containing your script and files.

6. Run the script:
```powershell
.\Scrape-Urls.ps1
```

This improved solution offers several advantages:

- It uses PowerShell, which is more powerful and flexible than batch scripts on Windows.
- It includes error handling to manage issues with individual URLs without stopping the entire process.
- It creates a separate output directory for scraped content, keeping things organized.
- It generates safe filenames based on the URLs, avoiding potential naming conflicts or invalid characters.
- It extracts the actual text content from the JSON output, providing clean text files.
- It's more customizable, allowing you to specify different input files, config files, or output directories.

Additional notes:

1. This script respects rate limiting by processing URLs sequentially. For a large number of URLs, consider adding a delay between requests.

2. Some websites may block or behave differently with automated requests. You might need to add user-agent headers or other modifications for certain sites:

```powershell
$headers = @{
"User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
$content = Invoke-WebRequest -Uri $url -UseBasicParsing -Headers $headers | Select-Object -ExpandProperty Content
```

3. Always ensure you have permission to scrape the websites you're targeting and that you're complying with their terms of service and robots.txt files.

4. For very large lists of URLs, consider implementing parallel processing or breaking the list into smaller batches to improve efficiency.

5. You may want to add more robust URL validation and error checking, depending on your specific needs and the reliability of your URL list.

0 comments on commit da89e86

Please sign in to comment.