A simple, higher level interface for Go web scraping.
When scraping with Go, I find myself redefining tree traversal and other utility functions.
This package is a place to put some simple tools which build on top of the Go HTML parsing library.
For the full interface check out the godoc
Scrape defines traversal functions like Find and FindAll while attempting to be generic. It also defines convenience functions such as Attr and Text.
// Parse the pageroot, err:=html.Parse(resp.Body) iferr!=nil{// handle error } // Search for the titletitle, ok:=scrape.Find(root, scrape.ByTag(atom.Title)) ifok{// Print the titlefmt.Println(scrape.Text(title)) }package main import ( "fmt""net/http""github.com/yhat/scrape""golang.org/x/net/html""golang.org/x/net/html/atom" ) funcmain(){// request and parse the front pageresp, err:=http.Get("https://news.ycombinator.com/") iferr!=nil{panic(err) } root, err:=html.Parse(resp.Body) iferr!=nil{panic(err) } // define a matchermatcher:=func(n*html.Node) bool{// must check for nil valuesifn.DataAtom==atom.A&&n.Parent!=nil&&n.Parent.Parent!=nil{returnscrape.Attr(n.Parent.Parent, "class") =="athing" } returnfalse } // grab all articles and print themarticles:=scrape.FindAll(root, matcher) fori, article:=rangearticles{fmt.Printf("%2d %s (%s)\n", i, scrape.Text(article), scrape.Attr(article, "href")) } }