golang does not direct crawlers-Golang-php.cn

golang does not direct crawlers

PHPz

Release： 2023-05-10 10:06:06

Original

595 people have browsed it

1. Foreword

With the development of the Internet, the application scope of web crawlers is getting wider and wider. In daily life, we can obtain various information through web crawlers, such as news, stocks, weather, movies, music, etc. Especially in the fields of big data analysis and artificial intelligence, web crawlers play an important role. This article mainly explains how to use golang language to write a non-directional (that is, no specific target website) crawler to obtain information on the Internet.

2. Introduction to golang

Golang is a programming language developed by Google. Due to its concurrency, high performance, simplicity and ease of learning, it is increasingly favored by programmers. The golang version used in this article is 1.14.2.

3. Implementation ideas

This crawler is mainly divided into the following steps:

Get the starting URL

You can pass Obtain the starting URL by manually entering the URL, reading the URL from a file, reading the URL from the database, etc.

Send http request

Send an http request through Get or Post to obtain the response data.

Parse response data

Use regular expressions or third-party libraries to parse the data according to the format of the response data.

Storing data

You can store data in files, in databases, or use other storage methods, depending on your needs.

Parse the new URL

According to the hyperlink and other information in the response data, parse the new URL as the next URL to be crawled.

Repeat the above steps

According to the new URL, send the http request again, parse the response data, store the data, parse the new URL, and repeat until there is no new one Until the URL.

4. Code Implementation

In golang, use the net/http package to send http requests, and use the regexp package or a third-party library to parse the response data. This article uses the goquery library.

Initialization function

First, we need to define an initial function, which is responsible for obtaining the starting URL, setting up the http client and other operations.

func init() {
    // 获取起始网址
    flag.StringVar(&startUrl, "url", "", "请输入起始网址")
    flag.Parse()

    // 设置http客户端
    client = &http.Client{
        Timeout: 30 * time.Second,
        CheckRedirect: func(req *http.Request, via []*http.Request) error {
            return http.ErrUseLastResponse
        },
    }
}

Copy after login

Send http request function

Define a function responsible for sending http requests and obtaining response data.

func GetHtml(url string) (string, error) {
    resp, err := client.Get(url)
    if err != nil {
        log.Println(err)
        return "", err
    }
    defer resp.Body.Close()

    body, err := ioutil.ReadAll(resp.Body)
    if err != nil {
        log.Println(err)
        return "", err
    }

    return string(body), nil
}

Copy after login

Parse response data function

Use goquery library to parse response data. The specific implementation is as follows:

func ParseSingleHTML(html string, query string) []string {
    doc, err := goquery.NewDocumentFromReader(strings.NewReader(html))
    if err != nil {
        log.Println(err)
        return nil
    }

    result := make([]string, 0)
    doc.Find(query).Each(func(i int, selection *goquery.Selection) {
        href, ok := selection.Attr("href")
        if ok {
            result = append(result, href)
        }
    })

    return result
}

Copy after login

Storage data function

Define a function responsible for storing data into a file.

func SaveData(data []string) error {
    file, err := os.OpenFile("data.txt", os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0644)
    if err != nil {
        log.Println(err)
        return err
    }
    defer file.Close()

    writer := bufio.NewWriter(file)
    for _, line := range data {
        _, err := writer.WriteString(line + "
")
        if err != nil {
            log.Println(err)
            return err
        }
    }
    writer.Flush()

    return nil
}

Copy after login

Parse new URL function

Use regular expressions to parse new URLs in hyperlinks.

func ParseHref(url, html string) []string {
    re := regexp.MustCompile(`<a[sS]+?href="(.*?)"[sS]*?>`)
    matches := re.FindAllStringSubmatch(html, -1)

    result := make([]string, 0)
    for _, match := range matches {
        href := match[1]
        if strings.HasPrefix(href, "//") {
            href = "http:" + href
        } else if strings.HasPrefix(href, "/") {
            href = strings.TrimSuffix(url, "/") + href
        } else if strings.HasPrefix(href, "http://") || strings.HasPrefix(href, "https://") {
            // do nothing
        } else {
            href = url + "/" + href
        }
        result = append(result, href)
    }

    return result
}

Copy after login

Main function

Finally, we need to define a main function to implement the entire crawler process.

func main() {
    // 确认起始网址是否为空
    if startUrl == "" {
        fmt.Println("请指定起始网址")
        return
    }

    // 初始化待访问队列
    queue := list.New()
    queue.PushBack(startUrl)

    // 初始化已访问集合
    visited := make(map[string]bool)

    // 循环爬取
    for queue.Len() > 0 {
        // 从队列中弹出一个网址
        elem := queue.Front()
        queue.Remove(elem)
        url, ok := elem.Value.(string)
        if !ok {
            log.Println("网址格式错误")
            continue
        }

        // 确认该网址是否已经访问过
        if visited[url] {
            continue
        }
        visited[url] = true

        // 发送http请求，获取响应数据
        html, err := GetHtml(url)
        if err != nil {
            continue
        }

        // 解析响应数据，获取新的网址
        hrefs := ParseHref(url, html)
        queue.PushBackList(list.New().Init())
        for _, href := range hrefs {
            if !visited[href] {
                hrefHtml, err := GetHtml(href)
                if err != nil {
                    continue
                }
                hrefUrls := ParseSingleHTML(hrefHtml, "a")

                // 将新的网址加入队列
                queue.PushBackList(list.New().Init())
                for _, hrefUrl := range hrefUrls {
                    queue.PushBack(hrefUrl)
                }
            }
        }

        // 存储数据到文件
        data := ParseSingleHTML(html, "title")
        err = SaveData(data)
        if err != nil {
            continue
        }
    }
}

Copy after login

5. Summary

The above is the basic process and implementation method of using golang to write undirected crawlers. Of course, this is just a simple example. In actual development, anti-crawler strategies, thread safety and other issues also need to be considered. Hope it can be helpful to readers.

The above is the detailed content of golang does not direct crawlers. For more information, please follow other related articles on the PHP Chinese website!