Home > Web Front-end > JS Tutorial > Node.js crawls Chinese webpage garbled problems and solutions_node.js

Node.js crawls Chinese webpage garbled problems and solutions_node.js

WBOY
Release: 2016-05-16 16:14:49
Original
1873 people have browsed it

When Node.js crawls non-utf-8 Chinese web pages, garbled characters will appear. For example, NetEase’s homepage encoding is gb2312, and garbled characters will appear when crawling

Copy code The code is as follows:

var request = require('request')
var url = 'http://www.163.com'

request(url, function (err, res, body) {
console.log(body)
})


You can use iconv-lite to solve

Installation

Copy code The code is as follows:

npm install iconv-lite

At the same time, let’s modify the user-agent to prevent the website from being blocked:
Copy code The code is as follows:

var originRequest = require('request')
var iconv = require('iconv-lite')
var headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS }
function request (url, callback) {

var options = {
url: url,
encoding: null,
headers: headers
}
originRequest(options, callback)
}

request(url, function (err, res, body) {

var html = iconv.decode(body, 'gb2312')
console.log(html)
})

Garbled code problem solved

Use cheerio to parse HTML

cheerio can be simply and crudely understood as a server-side jQuery selector. With it, it is much more intuitive than regular expressions

Installation


Copy code The code is as follows:
npm install cheerio
request(url, function (err, res, body) {
var html = iconv.decode(body, 'gb2312')
var $ = cheerio.load(html)
console.log($('h1').text())
console.log($('h1').html())
})

The output is as follows

Copy code The code is as follows:
NetEase
NetEase

Then here comes the problem. The code output by $('h1').html() is Unicode encoded. NetEase has become NetEase, which brings some trouble to our character processing

Solve the "garbled" problem of cheerio .html()
Check the
document to find out that you can turn off the function of converting entity encoding

Copy code The code is as follows:
var $ = cheerio.load(html)

Change to

Copy code The code is as follows:
var $ = cheerio.load(html, {decodeEntities: false})

That’s it, the complete code is as follows:

Copy code The code is as follows:

var originRequest = require('request') 
var cheerio = require('cheerio') 
var iconv = require('iconv-lite') 
var headers = { 
  'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.65 Safari/537.36'
}

function request (url, callback) { 
  var options = {
    url: url,
    encoding: null,
    headers: headers
  }
  originRequest(options, callback)
}

var url = 'http://www.163.com'

request(url, function (err, res, body) { 
    var html = iconv.decode(body, 'gb2312')
    var $ = cheerio.load(html, {decodeEntities: false})
    console.log($('h1').text())
    console.log($('h1').html())
})

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template