利用爬虫在html中获取的相应图片资源src的代码是这样的
但是再通过代码将资源转成链接的形式下载图片的时候,就报了400的错误
然而,我使用chrome去测试链接是否存在是,发现,真正对方网站服务器能够识别的是
也就是说我通过网页获得图片资源的链接是
http://www.neofactory.co.jp/i... 2.jpg
然而,正常能够获取图片的链接是
http://www.neofactory.co.jp/i...
请各位大神指导之后应该怎么办,我在网上查了好多资料,还是没有解决办法。
ps:奇怪的是我用Firefox的话,上面的那个链接也能得到图片,我就百思不得其解了。
代码:
public class Image
{
private String urlNeo="";
public String getUrlNeo() {
return urlNeo;
}
public void setUrlNeo(String urlNeo) {
this.urlNeo = urlNeo;
}
public String getHtml() throws Exception{
ArrayList<String> list=new ArrayList<String>();
String line="";
String Html="";
URL url=new URL(urlNeo);
URLConnection connection=url.openConnection();
InputStream in=connection.getInputStream();
InputStreamReader isr=new InputStreamReader(in);
BufferedReader br=new BufferedReader(isr);
while((line=br.readLine())!=null){
Html+=line;
list.add(line);
}
br.close();
isr.close();
in.close();
return Html;
}
public String getImgSrc() throws Exception{
String html=getHtml();
String IMGURL_REG_xpath="//p[1]/p[2]/p[2]/p/node()";
String imginfomation="";
JXDocument jxDocument = new JXDocument(html);
imginfomation=(jxDocument.sel(IMGURL_REG_xpath).toString()).substring(1,jxDocument.sel(IMGURL_REG_xpath).toString().length() - 1);
return imginfomation;
}
public List<String> getImgXpath() throws Exception{
String str="";
String IMGSRC_REG = "img.product.\w.*.jpg";
List<String> list1=new ArrayList<String>();
List<String> list2=new ArrayList<String>();
String listimg = getImgSrc();
Matcher matcher = Pattern.compile(IMGSRC_REG).matcher(listimg);
while (matcher.find()) {
list1.add(matcher.group());
}
for(int i=1;i<=(list1.size()/2);i++){
int j=i*2;
list2.add(list1.get(j-1));
}
return list2;
}
public void download(String admin_no) throws Exception{
List<String> list=new ArrayList<String>();
list=getImgXpath();
for(String img:list){
System.out.println(img);
String url="http://www.neofactory.co.jp/"+img;
URL uri=new URL(url);
URLConnection con=uri.openConnection();
con.setConnectTimeout(5000);
InputStream in=con.getInputStream();
byte[] buf=new byte[1024];
int length=0;
File sf=new File("D:\item_neo_photo\"+admin_no);
if(!sf.exists()){
sf.mkdirs();
}
String[] a=img.split("/");
OutputStream os=new FileOutputStream(sf.getPath()+"\"+a[a.length-1]);
while((length=in.read(buf))!=-1){
os.write(buf, 0, length);
}
os.close();
in.close();
}
}
}
直接把域名+获取的img src属性拼起来不行么
url编码下