-
Notifications
You must be signed in to change notification settings - Fork 41
Getting Started
zhengchun edited this page Dec 29, 2017
·
4 revisions
go get github.com/antchfx/antch
type item struct {
Title string `json:"title"`
Link string `json:"link"`
Desc string `json:"desc"`
}
Create a struct called dmozSpider
that implement Handler
interface.
type dmozSpider struct {}
func (s *dmozSpider) ServeSpider(c chan<- antch.Item, res *http.Response) {}
dmozSpider
will extracting data from received pages and pass data into Pipeline.
doc, err := antch.ParseHTML(res)
for _, node := range htmlquery.Find(doc, "//div[@id='site-list-content']/div") {
v := new(item)
v.Title = htmlquery.InnerText(htmlquery.FindOne(node, "//div[@class='site-title']"))
v.Link = htmlquery.SelectAttr(htmlquery.FindOne(node, "//a"), "href")
v.Desc = htmlquery.InnerText(htmlquery.FindOne(node, "//div[contains(@class,'site-descr')]"))
c <- v
}
htmlquery package, that supports XPath expression extracting data,
and then send Item toGo'Channel c
.
c <- v
Create new Pipeline called jsonOutputPipeline
, implements PipelineHandler
interface.
jsonOutputPipeline
serialize received Item data as JSON format print into console.
type jsonOutputPipeline struct {}
func (p *jsonOutputPipeline) ServePipeline(v Item) {
b, err := json.Marshal(v)
if err != nil {
panic(err)
}
os.Stdout.Write(b)
}
Create a new web crawler instance.
crawler := antch.NewCrawler()
You can enables middleware for HTTP cookies or robots.txt if you want.
- enable cookies middleware for web crawler.
crawler.UseCookies()
- you even registers custom middleware for web crawler.
crawler.UseMiddleware(CustomMiddleware())
Register dmozSpider
to the web crawler instance.
dmozSpider
will process all matches pages if its matches by dmoztools.net
pattern.
crawler.Handle("dmoztools.net", &dmozSpider{})
Register jsonOutputPipeline
to the web crawler instance.
crawler.UsePipeline(newTrimSpacePipeline(), newJsonOutputPipeline())
startURLs := []string{
"http://dmoztools.net/Computers/Programming/Languages/Python/Books/",
"http://dmoztools.net/Computers/Programming/Languages/Python/Resources/",
}
crawler.StartURLs(startURLs)
go run main.go
Enjoy it.