md-fetch

A powerful command-line tool that fetches web content and converts it to clean, readable Markdown format.

Key Features

Bypass Anti-Scraping Measures: Uses real browsers in headless mode to bypass 403 errors and CAPTCHAs that typically block programmatic scraping
Multiple Browser Support: Uses Chrome, Firefox, or curl to fetch web content
Smart HTML Cleaning: Removes unwanted JavaScript, CSS, and metadata while preserving content
JavaScript Support: Properly renders JavaScript-heavy websites using Chrome or Firefox
Clean Markdown Output: Converts cleaned HTML to well-formatted Markdown
AI/LLM Optimized: Produces lightweight, clean text that's perfect for feeding into AI models

Perfect for AI/LLM Applications

md-fetch is especially valuable for AI and Large Language Model (LLM) applications:

Clean Input: Removes noise like scripts, styles, and metadata that could confuse LLMs
Token Efficiency: Outputs lightweight Markdown, reducing token usage when feeding content to AI models
Context Preservation: Maintains important content structure while eliminating irrelevant elements
Consistent Format: Provides uniformly formatted text regardless of the source website's structure
Easy Integration: Perfect for automating web content collection for AI training or real-time querying

Why Use Real Browsers?

Many modern websites implement anti-scraping measures that block traditional HTTP requests:

Return 403 Forbidden errors
Present CAPTCHAs
Require JavaScript execution
Check for browser fingerprints

md-fetch solves this by using real browsers (Chrome/Firefox) in headless mode, which:

Appears as a legitimate browser
Executes JavaScript properly
Handles modern web features
Maintains your existing browser session

Installation

Pre-built Binaries

Visit our releases page to download pre-built binaries:

Linux

Standard build (dynamically linked)
Musl build (statically linked, ideal for Alpine Linux and other musl-based systems)
Debian/Ubuntu package (.deb)
Red Hat/Fedora package (.rpm)

macOS

Intel (amd64)
Apple Silicon (arm64)

Windows

64-bit (amd64)

Package Installation

For Debian/Ubuntu:

sudo dpkg -i md-fetch_<version>_amd64.deb

For Red Hat/Fedora:

sudo rpm -i md-fetch-<version>.x86_64.rpm

Using go install

go install github.com/nathabonfim59/md-fetch@latest

From Source

Ensure you have Go 1.16 or later installed

Clone the repository:

git clone https://github.com/nathabonfim59/md-fetch.git
cd md-fetch

Build the project:
```
go build -o bin/md-fetch main.go
```

Usage

CLI Mode

Basic usage:

md-fetch <url>

With browser selection:

md-fetch -browser <chrome|firefox|curl> <url>

Examples:

# Use default browser (tries Chrome, then Firefox, then curl)
md-fetch https://www.google.com

# Use Chrome specifically
md-fetch -browser chrome https://www.google.com

# Use Firefox specifically
md-fetch -browser firefox https://www.google.com

# Use curl for static content
md-fetch -browser curl https://www.google.com

Server Mode

Start the HTTP server:

md-fetch --serve [-port 8080]

The server provides a REST API for fetching content:

Single URL Request

curl -X POST http://localhost:8080/fetch \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://www.example.com"],
    "browser": "chrome"
  }'

Batch Request (Parallel Processing)

curl -X POST http://localhost:8080/fetch \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://www.example.com",
      "https://www.google.com",
      "https://www.github.com"
    ],
    "browser": "chrome"
  }'

OpenAPI Specification

Access the OpenAPI specification at:

http://localhost:8080/openapi.yaml

Response Format

{
  "results": {
    "https://www.example.com": "# Example Domain\n\nThis domain is for...",
    "https://www.google.com": "# Google\n\n[Gmail](https://mail.google.com)..."
  },
  "errors": {
    "https://invalid.url": "error message"
  }
}

Browser Support

The tool supports multiple browsers in the following priority order:

Chrome/Chromium: Best for JavaScript-heavy sites (default)
Firefox: Good alternative for JavaScript support
curl: Fallback for static content

You can specify which browser to use with the -browser flag.

HTML Cleaning Features

Removes JavaScript code:
- Anonymous functions and IIFEs
- Event listeners
- Window assignments
- Variable declarations
- Google-specific scripts
- MediaWiki RLQ functions
Cleans CSS content:
- Inline styles
- Style blocks
- Media queries
- CSS definitions
Removes metadata:
- JSON-LD data
- Schema.org markup
- Configuration objects

Project Structure

md-fetch/
├── cmd/                    # Command-line interface
├── internal/              
│   ├── browser/           # Browser implementations
│   │   ├── chrome.go      # Chrome/Chromium support
│   │   ├── firefox.go     # Firefox support
│   │   ├── curl.go        # curl support
│   │   └── html_cleaner.go # HTML cleaning logic
│   ├── converter/         # HTML to Markdown conversion
│   └── fetcher/           # Content fetching coordination
├── bin/                   # Compiled binaries
└── main.go                # Entry point

Development

Running Tests

go test ./...

Adding New Features

New Browser Support: Implement the Browser interface in internal/browser/browser.go
HTML Cleaning Rules: Add patterns to html_cleaner.go
Markdown Conversion: Enhance internal/converter/markdown.go

Contributing

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
build		build
cmd/fetch		cmd/fetch
internal		internal
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

md-fetch

Key Features

Perfect for AI/LLM Applications

Why Use Real Browsers?

Installation

Pre-built Binaries

Linux

macOS

Windows

Package Installation

Using go install

From Source

Usage

CLI Mode

Server Mode

Single URL Request

Batch Request (Parallel Processing)

OpenAPI Specification

Response Format

Browser Support

HTML Cleaning Features

Project Structure

Development

Running Tests

Adding New Features

Contributing

License

About

Releases 2

Packages

Languages

License

nathabonfim59/md-fetch

Folders and files

Latest commit

History

Repository files navigation

md-fetch

Key Features

Perfect for AI/LLM Applications

Why Use Real Browsers?

Installation

Pre-built Binaries

Linux

macOS

Windows

Package Installation

Using go install

From Source

Usage

CLI Mode

Server Mode

Single URL Request

Batch Request (Parallel Processing)

OpenAPI Specification

Response Format

Browser Support

HTML Cleaning Features

Project Structure

Development

Running Tests

Adding New Features

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages