Website scraping often leads to blocks due to the use of standard or inappropriate user-agents. This article demonstrates a simple method to mitigate this by using randomized fake user-agents within your Go Colly scrapers.
Understanding Fake User-Agents
User-agents are strings identifying the client making a web request. They convey information about the application, operating system (Windows, macOS, Linux), and browser (Chrome, Firefox, Safari). Websites use this information for various purposes, including security and analytics.
A typical user-agent string might look like this (Chrome on Android):
<code>'User-Agent': 'Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Mobile Safari/537.36'</code>
Go Colly's default user-agent:
<code>"User-Agent": "colly - https://www.php.cn/link/953bd83cb0b9c9f9dc4b3ba0bfc1b236",</code>
easily identifies your scraper, increasing the risk of being blocked. Therefore, employing a custom, randomized user-agent is crucial.
Implementing a Fake User-Agent with Go Colly
Modifying request headers to include a custom user-agent is achieved using the OnRequest()
callback. This ensures each request uses a different user-agent string.
<code class="language-go">package main import ( "bytes" "log" "github.com/gocolly/colly" ) func main() { c := colly.NewCollector(colly.AllowURLRevisit()) c.OnRequest(func(r *colly.Request) { r.Headers.Set("User-Agent", "Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148") }) c.OnResponse(func(r *colly.Response) { log.Printf("%s\n", bytes.Replace(r.Body, []byte("\n"), nil, -1)) }) for i := 0; i < 5; i++ { c.Visit("httpbin.org/headers") } }</code>
This sets a single user-agent for all requests. For more robust scraping, use a randomized approach.
Rotating Through Random User-Agents
The github.com/lib4u/fake-useragent
package simplifies random user-agent selection.
<code class="language-go">package main import ( "bytes" "fmt" "log" "github.com/gocolly/colly" uaFake "github.com/lib4u/fake-useragent" ) func main() { ua, err := uaFake.New() if err != nil { fmt.Println(err) } c := colly.NewCollector(colly.AllowURLRevisit()) c.OnRequest(func(r *colly.Request) { r.Headers.Set("User-Agent", ua.Filter().GetRandom()) }) c.OnResponse(func(r *colly.Response) { log.Printf("%s\n", bytes.Replace(r.Body, []byte("\n"), nil, -1)) }) for i := 0; i < 5; i++ { c.Visit("httpbin.org/headers") } }</code>
This code snippet retrieves a random user-agent for each request.
Using Specific Fake User-Agents
github.com/lib4u/fake-useragent
provides filtering options. For example, to use a random desktop Chrome user-agent:
<code class="language-go">r.Headers.Set("User-Agent", ua.Filter().Chrome().Platform(uaFake.Desktop).Get())</code>
Remember to always respect a website's robots.txt
and terms of service when scraping. Using random user-agents is one technique among many for responsible web scraping; consider using proxies and other header management strategies as well.
References:
The above is the detailed content of Golang with Colly: Use Random Fake User-Agents When Scraping. For more information, please follow other related articles on the PHP Chinese website!