


Summary of the problem of garbled data captured by nodejs crawler_node.js
1. Non-UTF-8 page processing.
1. Background
windows-1251 encoding
For example, Russian website: https://vk.com/cciinniikk
Shameful to find this encoding
What we mainly talk about here is the issue of Windows-1251 (cp1251) encoding and utf-8 encoding. Others such as gbk will not be taken into consideration~
2. Solution
1.
Use js native encoding conversion
But I haven’t found a way yet..
If it’s utf-8 to window-1251 it’s okayhttp://stackoverflow.com/questions/2696481/encoding-conversation-utf-8-to-1251-in-javascript
var DMap = {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 16: 16, 17: 17, 18: 18, 19: 19, 20: 20, 21: 21, 22: 22, 23: 23, 24: 24, 25: 25, 26: 26, 27: 27, 28: 28, 29: 29, 30: 30, 31: 31, 32: 32, 33: 33, 34: 34, 35: 35, 36: 36, 37: 37, 38: 38, 39: 39, 40: 40, 41: 41, 42: 42, 43: 43, 44: 44, 45: 45, 46: 46, 47: 47, 48: 48, 49: 49, 50: 50, 51: 51, 52: 52, 53: 53, 54: 54, 55: 55, 56: 56, 57: 57, 58: 58, 59: 59, 60: 60, 61: 61, 62: 62, 63: 63, 64: 64, 65: 65, 66: 66, 67: 67, 68: 68, 69: 69, 70: 70, 71: 71, 72: 72, 73: 73, 74: 74, 75: 75, 76: 76, 77: 77, 78: 78, 79: 79, 80: 80, 81: 81, 82: 82, 83: 83, 84: 84, 85: 85, 86: 86, 87: 87, 88: 88, 89: 89, 90: 90, 91: 91, 92: 92, 93: 93, 94: 94, 95: 95, 96: 96, 97: 97, 98: 98, 99: 99, 100: 100, 101: 101, 102: 102, 103: 103, 104: 104, 105: 105, 106: 106, 107: 107, 108: 108, 109: 109, 110: 110, 111: 111, 112: 112, 113: 113, 114: 114, 115: 115, 116: 116, 117: 117, 118: 118, 119: 119, 120: 120, 121: 121, 122: 122, 123: 123, 124: 124, 125: 125, 126: 126, 127: 127, 1027: 129, 8225: 135, 1046: 198, 8222: 132, 1047: 199, 1168: 165, 1048: 200, 1113: 154, 1049: 201, 1045: 197, 1050: 202, 1028: 170, 160: 160, 1040: 192, 1051: 203, 164: 164, 166: 166, 167: 167, 169: 169, 171: 171, 172: 172, 173: 173, 174: 174, 1053: 205, 176: 176, 177: 177, 1114: 156, 181: 181, 182: 182, 183: 183, 8221: 148, 187: 187, 1029: 189, 1056: 208, 1057: 209, 1058: 210, 8364: 136, 1112: 188, 1115: 158, 1059: 211, 1060: 212, 1030: 178, 1061: 213, 1062: 214, 1063: 215, 1116: 157, 1064: 216, 1065: 217, 1031: 175, 1066: 218, 1067: 219, 1068: 220, 1069: 221, 1070: 222, 1032: 163, 8226: 149, 1071: 223, 1072: 224, 8482: 153, 1073: 225, 8240: 137, 1118: 162, 1074: 226, 1110: 179, 8230: 133, 1075: 227, 1033: 138, 1076: 228, 1077: 229, 8211: 150, 1078: 230, 1119: 159, 1079: 231, 1042: 194, 1080: 232, 1034: 140, 1025: 168, 1081: 233, 1082: 234, 8212: 151, 1083: 235, 1169: 180, 1084: 236, 1052: 204, 1085: 237, 1035: 142, 1086: 238, 1087: 239, 1088: 240, 1089: 241, 1090: 242, 1036: 141, 1041: 193, 1091: 243, 1092: 244, 8224: 134, 1093: 245, 8470: 185, 1094: 246, 1054: 206, 1095: 247, 1096: 248, 8249: 139, 1097: 249, 1098: 250, 1044: 196, 1099: 251, 1111: 191, 1055: 207, 1100: 252, 1038: 161, 8220: 147, 1101: 253, 8250: 155, 1102: 254, 8216: 145, 1103: 255, 1043: 195, 1105: 184, 1039: 143, 1026: 128, 1106: 144, 8218: 130, 1107: 131, 8217: 146, 1108: 186, 1109: 190} function UnicodeToWin1251(s) { var L = [] for (var i=0; i<s.length; i++) { var ord = s.charCodeAt(i) if (!(ord in DMap)) throw "Character "+s.charAt(i)+" isn't supported by win1251!" L.push(String.fromCharCode(DMap[ord])) } return L.join('') }
Well, this is a good idea. What Dmap stores is actually the mapping relationship between window-1251 encoding and unicode
So I just planned to do it the other way around
But on the contrary, I discovered that the charCodeAt method is only valid for unicode. How to dig out the code segments of other encodings? Because I am using nodejs, I consider using the corresponding module
2.
For instructions on installing and using the nodejs module iconv-lite, see https://www.npmjs.com/package/iconv-lite
According to the usage method, it should be used in a similar way
var iconv = require('iconv-lite'); var Buffer = require('buffer').Buffer; // Convert from an encoded windows-1251 to utf-8 //这个str1应该是http.get 或request等请求返回的数据 //请求的时候要带参数,不然就会出错 //除了基本的参数之外 要注意记得使用 encoding: 'binary'这个参数 //比如 str1 = 'ценности ни в '; //把获取到的数据 转换成Buffer,记得格式使用 binary //binary在各编码直接穿梭无阻~ var buf = new Buffer(str1,'binary'); var str2 = iconv.decode(buf, 'win1251'); //str2就被转换出来了,默认是转成 Unicode格式,估计这也是iconv-lite的初衷吧 console.log(str2);
3.
Instructions for installing and using the nodejs module iconv are available at https://github.com/bnoordhuis/node-iconv
(In fact, the essence is to install node-gyp. I didn’t read the official instructions carefully before)
Generally, after simple use, the code is still garbled. The format is: пїЅпїЅпїЅпїЅпїЅ пїЅпїЅпїЅпїЅпїЅпїЅ пїЅпїЅпїЅпїЅпїЅпїЅ пїЅпїЅ
http://stackoverflow.com/questions/8693400/nodejs-convertinf-from-windows-1251-to-utf-8
The solution is to convert the read data into binary encoding: binary (the default encoding is utf-8)
request({ uri: website_url, method: 'GET', encoding: 'binary' }, function (error, response, body) { body = new Buffer(body, 'binary'); conv = new iconv.Iconv('WINDOWS-1251', 'utf8'); body = conv.convert(body).toString(); } });
--> In addition, the use of iconv requires some environmental dependencies. See the official instructions: https://github.com/TooTallNate/node-gyp
So:
Firstly, you need the support of python corresponding version (such as 2.7);
Second, it requires the support of compilation tools (most errors occur under Windows)
Error similar to this
Node, if there is no specific version or higher, the vs2005 compilation tool is used by default (so the solution to the error message is generally to follow vs2005 and framwork sdk2.0)
Problem solution:
1. Install visual studio 2010
2. Specify the vs compilation tool version (if it is vs2012, it is 2012)
(Sometimes it will be automatically specified, so this command is not necessarily needed npm config set msvs_version 2010 --global)
3. If it still prompts that the framwork sdk cannot be found, you can add its installation path to the system environment variable path
(2010 corresponds to sdk4.0 version, similar to 2008 sdj3.5 2012 sdk4.5?)
Another thing to remember is that the environment variable will only read the first one!
For example, if you have set the path of SDK2.0 to the system environment variable before, then when you add and set the path of SDK4.0 now, only the first one will work
So:
Or delete the previous one
Or put the path you want to add in front of it
2. Gzip page processing
Sometimes we find that it is normal for the browser to access the page, but the simulated request is garbled when it comes back. You can check the Response information requested by the browser. If there is Content-Encoding: gzip, it is most likely because the page is compressed by gzip. , then you need to add the following parameters when requesting
gzip:true
The above is the entire content of this article, I hope you all like it.

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



The Linux Chinese garbled problem is a common problem when using Chinese character sets and encodings. Garbled characters may be caused by incorrect file encoding settings, system locale not being installed or set, and terminal display configuration errors, etc. This article will introduce several common workarounds and provide specific code examples. 1. Check the file encoding setting. Use the file command to view the file encoding. Use the file command in the terminal to view the encoding of the file: file-ifilename. If there is "charset" in the output

Solutions to garbled tomcat startup: 1. Modify Tomcat's conf configuration file; 2. Modify the system language; 3. Modify the command line window encoding; 4. Check the Tomcat server configuration; 5. Check the project encoding; 6. Check the log file; 7 , try other solutions. Detailed introduction: 1. Modify Tomcat's conf configuration file, open Tomcat's conf directory, find the "logging.properties" file, etc.

In the Windows 10 system, garbled characters are common. The reason behind this is often that the operating system does not provide default support for some character sets, or there is an error in the set character set options. In order to prescribe the right medicine, we will analyze the actual operating procedures in detail below. How to solve Windows 10 garbled code 1. Open settings and find "Time and Language" 2. Then find "Language" 3. Find "Manage Language Settings" 4. Click "Change System Regional Settings" here 5. Check the box as shown and click Just make sure.

The time it takes to learn Python crawlers varies from person to person and depends on factors such as personal learning ability, learning methods, learning time and experience. Learning Python crawlers is not just about learning the technology itself, but also requires good information gathering skills, problem solving skills and teamwork skills. Through continuous learning and practice, you will gradually grow into an excellent Python crawler developer.

When many users use computers, they will find that there are many files with the suffix dll, but many users do not know how to open such files. For those who want to know, please take a look at the following details. Tutorial~How to open and edit dll files: 1. Download a software called "exescope" and download and install it. 2. Then right-click the dll file and select "Edit resources with exescope". 3. Then click "OK" in the pop-up error prompt box. 4. Then on the right panel, click the "+" sign in front of each group to view the content it contains. 5. Click on the dll file you want to view, then click "File" and select "Export". 6. Then you can

Some friends want to open a notepad and find that their win11 notepad is garbled and don't know what to do. In fact, we generally only need to modify the region and language. Win11 Notepad is garbled: First step, use the search function, search and open "Control Panel" Second step, click "Change date, time or number format" under Clock and Region Third step, click the "Manage" option above Card. The fourth step is to click "Change System Regional Settings" below. The fifth step is to change the current system regional settings to "Chinese (Simplified, China)" and click "OK" to save.

Solutions to filezilla garbled characters include: 1. Check the encoding settings; 2. Check the file itself; 3. Check the server configuration; 4. Try other transfer tools; 5. Update the software version; 6. Check for network problems; 7. Seek technical support. To solve the problem of FileZilla garbled characters, you need to start from multiple aspects, gradually investigate the cause of the problem, and take corresponding measures to repair it.

Many users found that their personal software was garbled after upgrading the win11 system. So how to solve this problem? Now let the editor carefully introduce to users the analysis of garbled code problems in some software in win11. Analysis of garbled characters in some software in win11 1. Click the search box in the taskbar in the lower left corner and enter control panel to open it. 3. Click on the area. 5. Then uncheck the small box for beta version in the window, and finally restart the computer to solve the problem.
