GitHub

gocrawler [1]

gocrawler is an API layer, to crawl domains. Crawler adds a domain into a worker queue configured with a given depth, so that crawling is stopped after the depth. Crawling is restricted to the same domain, since crawling external domains (in addition to the requested domain) can go into a ~infinite loop, for e.g. when the crawling request is received for https://google.com, any child links outside of google.com are not added back to the task queue.

gocrawler is concurrent safe, utilises goroutines to achieve concurrency
gocrawler uses Channels to pass references to data between goroutines
gocrawler uses Channels to achieve throttled concurrency
uses robots.txt & adheres to the policies of robots.txt exclusion standard

Pre-requisites

This readme is prepared for OSX & it works mostly for Linux as well. However, if you are on Windows OS these instructions might vary significantly, the extent of which I'm not sure because I do not have a Windows machine to test these instructions.

Golang is needed to build gocrawler. Steps to install Go can be found here.
GNU Make. If you are on OSX, it comes with make, but if you are on a different OS, please consult this link for installation.

Quick Start

Quickstart to build and run gocrawler. All instructions below assume that you are in the directory of gocrawler and have the Pre-requisites installed.

# run tests and make binary
make

Now let's start gocrawler

./gocrawler -a 127.0.0.1 -p 8080

Accessing help is just an argument away

./gocrawler -h

API Docs

API Docs are available at http://127.0.0.1:8080/docs, assuming that you have started gocrawler with flags -a 127.0.0.1 -p 8080

Testing

To run tests

make test

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
api		api
crawler		crawler
swagger		swagger
vendor		vendor
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
Makefile		Makefile
Readme.md		Readme.md
server.go		server.go
swagger.go		swagger.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gocrawler [1]

Getting started

Pre-requisites

Quick Start

API Docs

Testing

About

Releases

Packages

Languages

License

r8k/crawl

Folders and files

Latest commit

History

Repository files navigation

gocrawler [1]

Getting started

Pre-requisites

Quick Start

API Docs

Testing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages