I use request to crawl images. In order to prevent the IP from being blocked, I use a proxy. However, after using the proxy, I always get an error. nodejs uses request and async modules
function download(item,cb){
request({
url:item.img,
proxy:proxys[Math.random()*proxys.length|0],
method:'GET',
timeout:5000
},function(err,response,body){
if(response && response.statusCode == 200){
cb(null,item);
}
}).on('error',function(){
console.log('下载出现异常,可能是pipe有问题,再次请求...');
download(item,cb);
// cb(null,item);
}).pipe(fs.createWriteStream(fileDir2+item.name+'.'+item.url_token+'.jpg'));
}
download(item,cb), cb is the callback function of the control flow in async:
async.eachLimit(items,10,function(item,cb){
download(item,cb);
},function(){...})
Every time I download a few files, I get an error and stop running:
throw new assert.AssertionError({
^
AssertionError: 258 == 0
at ClientRequest.onConnect (C:\Users\fox\WebstormProjects\nodejs\实战\爬虫\node_modules\tunnel-agent\index.js:160:14)
If I remove the proxy request header, nothing will happen; if I change the above download to no longer continue the request and directly cb(), no error will be reported if the request fails.
.on('error',function(){
console.log('下载出现异常,可能是pipe有问题,再次请求...');
// download(item,cb);
cb(null,item);
})
Please take a look and see if you can help me solve it. I have been thinking about it for a long time and have been troubleshooting it. I don’t know why.
I have done almost the same function as you before, directly downloading a lot of pictures. I downloaded part of them, and then reported an error. Finally, I tried to wrap a layer of
setTimeout
, similar to:This is actually good, I wrote a blog post about this: nodejs batch downloading pictures, you can refer to it
When encountering this kind of problem, the program must have a retry mechanism.
A good retry mechanism is: on the next try, increase the sleep time appropriately to ensure correct execution.