gocrawler
is an API layer, to crawl domains. Crawler adds a domain into a worker
queue configured with a given depth
, so that crawling is stopped after the depth
. Crawling is restricted to the same domain, since crawling external domains (in addition to the requested domain) can go into a ~infinite loop, for e.g. when the crawling request is received for https://google.com
, any child links outside of google.com
are not added back to the task queue.
gocrawler
is concurrent safe, utilises goroutines to achieve concurrencygocrawler
uses Channels to pass references to data between goroutinesgocrawler
uses Channels to achieve throttled concurrency- uses
robots.txt
& adheres to the policies of robots.txt exclusion standard
This readme is prepared for OSX & it works mostly for Linux as well. However, if you are on Windows OS these instructions might vary significantly, the extent of which I'm not sure because I do not have a Windows machine to test these instructions.
- Golang is needed to build
gocrawler
. Steps to install Go can be found here. - GNU Make. If you are on OSX, it comes with
make
, but if you are on a different OS, please consult this link for installation.
Quickstart to build and run gocrawler
. All instructions below assume that you are in the directory of gocrawler
and have the Pre-requisites installed.
# run tests and make binary
make
Now let's start gocrawler
./gocrawler -a 127.0.0.1 -p 8080
Accessing help
is just an argument away
./gocrawler -h
API Docs are available at http://127.0.0.1:8080/docs
, assuming that you have started gocrawler
with flags -a 127.0.0.1 -p 8080
To run tests
make test