Go language practice: How to remove HTML tags?
In web development, we often need to remove HTML tags to obtain plain text content, such as analysis and processing of comments, articles, etc. For this requirement, the Go language provides a variety of methods, and this article will introduce you to several of them.
Method 1: Use string replacement
The Go language provides the strings package to operate strings. We can use the strings.ReplaceAll() method to replace HTML tags with whitespace characters to get plain text content. The specific implementation code is as follows:
package main import ( "fmt" "strings" ) func main() { html := "<html><head><title>Test Page</title></head><body><p>Hello, Go!</p></body></html>" // 使用 strings.ReplaceAll() 将 HTML 标签替换为空白字符 text := strings.ReplaceAll(html, "<", " <") text = strings.ReplaceAll(text, ">", "> ") text = strings.TrimSpace(strings.Join(strings.Fields(text), " ")) fmt.Println(text) }
In the above code, we first use the strings.ReplaceAll() method to replace all left angle brackets ("<") with space left angle brackets, and replace all right angle brackets (" >") is replaced with a right angle bracket space, that is, a space is added between the label and the text to facilitate subsequent use of the strings.Fields() method to split the string into multiple substrings. Next, we use the strings.Fields() method to split the string into multiple substrings, then use strings.Join() to connect these substrings with whitespace characters, and finally use the strings.TrimSpace() method to remove the strings at both ends. White space characters to get the final plain text content.
Run the above code, the output is as follows:
Test Page Hello, Go!
The above code is simple to implement, but there are several problems:
Considering these issues, we can use the second method.
Method 2: Use the Goquery library
Goquery is an HTML parsing and manipulation library in the Go language, providing a convenient and flexible API. We can use the Goquery library to parse HTML and filter text nodes to obtain plain text content. The specific implementation code is as follows:
package main import ( "fmt" "strings" "github.com/PuerkitoBio/goquery" ) func main() { html := "<html><head><title>Test Page</title></head><body><p>Hello, Go!</p></body></html>" doc, _ := goquery.NewDocumentFromReader(strings.NewReader(html)) // 筛选文本节点 var text string doc.Find(":not(script):not(style)").Each(func(_ int, sel *goquery.Selection) { if sel.Children().Length() == 0 { text += sel.Text() + " " } }) fmt.Println(strings.TrimSpace(text)) }
In the above code, we use the goquery.NewDocumentFromReader() method to convert HTML into a goquery.Document object. Next, we use the doc.Find() method to select all nodes except script and style tags, and use the sel.Children().Length() method to determine whether the current node is a text node. If so, add its content to the text variable. Finally, use the strings.TrimSpace() method to remove the blank characters at both ends of the string to obtain the final plain text content.
Run the above code, the output is as follows:
Test Page Hello, Go!
Using the Goquery library can handle various tag formats, and the code is easier to read and maintain.
This article introduces two methods for removing HTML tags, of which regular expressions are also commonly used. In practical applications, we can choose the most suitable method for specific situations.
The above is the detailed content of How to remove html in golang. For more information, please follow other related articles on the PHP Chinese website!