A powerful command-line tool that fetches web content and converts it to clean, readable Markdown format.
- Bypass Anti-Scraping Measures: Uses real browsers in headless mode to bypass 403 errors and CAPTCHAs that typically block programmatic scraping
- Multiple Browser Support: Uses Chrome, Firefox, or curl to fetch web content
- Smart HTML Cleaning: Removes unwanted JavaScript, CSS, and metadata while preserving content
- JavaScript Support: Properly renders JavaScript-heavy websites using Chrome or Firefox
- Clean Markdown Output: Converts cleaned HTML to well-formatted Markdown
- AI/LLM Optimized: Produces lightweight, clean text that's perfect for feeding into AI models
md-fetch is especially valuable for AI and Large Language Model (LLM) applications:
- Clean Input: Removes noise like scripts, styles, and metadata that could confuse LLMs
- Token Efficiency: Outputs lightweight Markdown, reducing token usage when feeding content to AI models
- Context Preservation: Maintains important content structure while eliminating irrelevant elements
- Consistent Format: Provides uniformly formatted text regardless of the source website's structure
- Easy Integration: Perfect for automating web content collection for AI training or real-time querying
Many modern websites implement anti-scraping measures that block traditional HTTP requests:
- Return 403 Forbidden errors
- Present CAPTCHAs
- Require JavaScript execution
- Check for browser fingerprints
md-fetch solves this by using real browsers (Chrome/Firefox) in headless mode, which:
- Appears as a legitimate browser
- Executes JavaScript properly
- Handles modern web features
- Maintains your existing browser session
Visit our releases page to download pre-built binaries:
- Standard build (dynamically linked)
- Musl build (statically linked, ideal for Alpine Linux and other musl-based systems)
- Debian/Ubuntu package (.deb)
- Red Hat/Fedora package (.rpm)
- Intel (amd64)
- Apple Silicon (arm64)
- 64-bit (amd64)
For Debian/Ubuntu:
sudo dpkg -i md-fetch_<version>_amd64.deb
For Red Hat/Fedora:
sudo rpm -i md-fetch-<version>.x86_64.rpm
go install github.com/nathabonfim59/md-fetch@latest
- Ensure you have Go 1.16 or later installed
- Clone the repository:
git clone https://github.com/nathabonfim59/md-fetch.git cd md-fetch
- Build the project:
go build -o bin/md-fetch main.go
Basic usage:
md-fetch <url>
With browser selection:
md-fetch -browser <chrome|firefox|curl> <url>
Examples:
# Use default browser (tries Chrome, then Firefox, then curl)
md-fetch https://www.google.com
# Use Chrome specifically
md-fetch -browser chrome https://www.google.com
# Use Firefox specifically
md-fetch -browser firefox https://www.google.com
# Use curl for static content
md-fetch -browser curl https://www.google.com
Start the HTTP server:
md-fetch --serve [-port 8080]
The server provides a REST API for fetching content:
curl -X POST http://localhost:8080/fetch \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://www.example.com"],
"browser": "chrome"
}'
curl -X POST http://localhost:8080/fetch \
-H "Content-Type: application/json" \
-d '{
"urls": [
"https://www.example.com",
"https://www.google.com",
"https://www.github.com"
],
"browser": "chrome"
}'
Access the OpenAPI specification at:
http://localhost:8080/openapi.yaml
{
"results": {
"https://www.example.com": "# Example Domain\n\nThis domain is for...",
"https://www.google.com": "# Google\n\n[Gmail](https://mail.google.com)..."
},
"errors": {
"https://invalid.url": "error message"
}
}
The tool supports multiple browsers in the following priority order:
- Chrome/Chromium: Best for JavaScript-heavy sites (default)
- Firefox: Good alternative for JavaScript support
- curl: Fallback for static content
You can specify which browser to use with the -browser
flag.
-
Removes JavaScript code:
- Anonymous functions and IIFEs
- Event listeners
- Window assignments
- Variable declarations
- Google-specific scripts
- MediaWiki RLQ functions
-
Cleans CSS content:
- Inline styles
- Style blocks
- Media queries
- CSS definitions
-
Removes metadata:
- JSON-LD data
- Schema.org markup
- Configuration objects
md-fetch/
├── cmd/ # Command-line interface
├── internal/
│ ├── browser/ # Browser implementations
│ │ ├── chrome.go # Chrome/Chromium support
│ │ ├── firefox.go # Firefox support
│ │ ├── curl.go # curl support
│ │ └── html_cleaner.go # HTML cleaning logic
│ ├── converter/ # HTML to Markdown conversion
│ └── fetcher/ # Content fetching coordination
├── bin/ # Compiled binaries
└── main.go # Entry point
go test ./...
- New Browser Support: Implement the
Browser
interface ininternal/browser/browser.go
- HTML Cleaning Rules: Add patterns to
html_cleaner.go
- Markdown Conversion: Enhance
internal/converter/markdown.go
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.