With the popularity of the Internet, we need to obtain a large amount of information, and a large part of it requires us to crawl it from the website. There are many crawling methods, among which crawlers written in golang can help us obtain this information more efficiently.
Golang is an intuitive, concise and efficient programming language, suitable for high-concurrency, high-performance application scenarios, and crawlers are a high-concurrency, high-performance task, so it is very suitable to use golang to write crawlers of. In this article, we will introduce the basic process, commonly used libraries and core technologies for writing crawlers in Golang to help beginners quickly master the basic methods of Golang crawlers.
1. Basic steps for writing crawlers in golang
Before introducing the basic steps for writing crawlers in golang, we need to understand the basic HTML structure.
In golang’s standard library, related functions for HTTP requests have been provided. We only need to set the URL, request headers, cookies, and request parameters. Once you have the basic information, you can construct the HTTP request you need. The main code is as follows:
package main import ( "fmt" "io/ioutil" "net/http" ) func main() { resp, err := http.Get("http://www.baidu.com") if err != nil { fmt.Println(err) return } defer resp.Body.Close() body, _ := ioutil.ReadAll(resp.Body) fmt.Println(string(body)) }
This code uses the http.Get function to initiate an HTTP request and read the response body from the response. The key point is the defer statement, which will be executed at the end of the function to close the response body and avoid resource leaks.
The response data obtained by the HTTP request is an HTML document, which we need to parse in order to obtain the required data. In golang, we can use the GoQuery library to parse HTML documents. This library is based on jQuery's syntax and is easy to use.
The main parsing functions provided by GoQuery are: Find, Filter, Each, Attr, etc. The Find function is used to find sub-elements that meet the criteria, and the Filter function is used to filter the elements that meet the criteria. The Each function is used to traverse all elements that meet the conditions, while the Attr function is used to obtain the attributes of the element. Taking the analysis of Baidu homepage as an example, the code is as follows:
package main import ( "fmt" "github.com/PuerkitoBio/goquery" "log" ) func main() { resp, err := http.Get("http://www.baidu.com") if err != nil { log.Fatal(err) } body := resp.Body defer body.Close() doc, err := goquery.NewDocumentFromReader(body) if err != nil { log.Fatal(err) } doc.Find("title").Each(func(i int, s *goquery.Selection) { fmt.Println(s.Text()) }) }
In the above code, the goquery.NewDocumentFromReader function is used to construct the document object, and then the title element is found through the Find method, and all qualified elements are traversed through the Each method, and the text.
The last step is to save the obtained data. For data storage, we have many ways to choose from, such as databases, files, caches, etc.
For example, we want to save the crawled data into a CSV file. The steps are as follows:
package main import ( "encoding/csv" "log" "os" ) func main() { file, err := os.Create("data.csv") if err != nil { log.Fatal(err) } defer file.Close() writer := csv.NewWriter(file) defer writer.Flush() writer.Write([]string{"name", "address", "tel"}) writer.Write([]string{"John Smith", "123 Main St, Los Angeles, CA 90012", "123-456-7890"}) writer.Write([]string{"Jane Smith", "456 Oak Ave, San Francisco, CA 94107", "123-456-7891"}) }
The above code uses the os.Create function to create a file named data.csv. Then create a CSV writer through the csv.NewWriter function. Finally, we write the data to be saved into the CSV file through the writer.Write method.
2. Commonly used libraries for writing crawlers in golang
Writing crawlers in golang does not require you to write a lot of underlying code yourself. Common crawler libraries are as follows:
Gocolly is a lightweight crawler framework based on golang, which provides many convenient methods to help crawl data. It can automatically handle issues such as redirection, cookies, proxies, speed limits, etc., allowing us to focus more on defining data extraction rules. The following code demonstrates how to use Gocolly to get Baidu titles:
package main import ( "fmt" "github.com/gocolly/colly" ) func main() { c := colly.NewCollector() c.OnHTML("head", func(e *colly.HTMLElement) { title := e.ChildText("title") fmt.Println(title) }) c.Visit("http://www.baidu.com") }
package main import ( "fmt" "github.com/sundy-li/go_commons/crawler" ) func main() { html := crawler.FetchHTML("http://www.baidu.com", "GET", nil, "") bs := crawler.NewSoup(html) title := bs.Find("title").Text() fmt.Println(title) }
package main import ( "fmt" "github.com/PuerkitoBio/goquery" "log" ) func main() { resp, err := http.Get("http://www.baidu.com") if err != nil { log.Fatal(err) } body := resp.Body defer body.Close() doc, err := goquery.NewDocumentFromReader(body) if err != nil { log.Fatal(err) } title := doc.Find("title").Text() fmt.Println(title) }
package main import ( "fmt" "github.com/gocolly/colly" ) func main() { urls := []string{ "http://www.baidu.com", "http://www.sogou.com", "http://www.google.com", } ch := make(chan string, len(urls)) for _, url := range urls { go func(url string) { c := colly.NewCollector() c.OnHTML("head", func(e *colly.HTMLElement) { title := e.ChildText("title") ch <- title }) c.Visit(url) }(url) } for range urls { title := <-ch fmt.Println(title) } }
package main import ( "fmt" "github.com/gocolly/colly" "time" ) func main() { c := colly.NewCollector() c.Limit(&colly.LimitRule{ DomainGlob: "*", Parallelism: 2, RandomDelay: 5 * time.Second, }) c.OnHTML("head", func(e *colly.HTMLElement) { title := e.ChildText("title") fmt.Println(title) }) c.Visit("http://www.baidu.com") }
Distributed crawling can effectively avoid being restricted by the website and improve crawling efficiency. The basic idea is to assign different tasks to different nodes or machines, process them independently, and summarize the results together. Distributed crawling requires scheduling, communication and other technologies, which is relatively complex. In actual crawlers, we can use third-party libraries or cloud services to implement distributed crawling.
Conclusion
This article introduces how to use golang to write a crawler, including basic steps, commonly used libraries and core technologies. Golang is a high-performance, concise and clear language that can well meet the needs of crawlers. However, in the practice of crawling, we still need to understand more technologies and constantly learn newer anti-crawling technologies in order to successfully complete the crawling task.
The above is the detailed content of How to write a crawler in golang. For more information, please follow other related articles on the PHP Chinese website!