84669 person learning
152542 person learning
20005 person learning
5487 person learning
7821 person learning
359900 person learning
3350 person learning
180660 person learning
48569 person learning
18603 person learning
40936 person learning
1549 person learning
1183 person learning
32909 person learning
毕设项目需要爬取coursera的课程数据,已经把所有课程的url链接爬下来了,存在了txt中,一行是一个课程的url,现在想要获取每门课程的详细信息,如instructor,syllabus 和detail information这几项,但是都需要点进各个课程的网页链接中取爬取。码渣求大神指导一下,来段伪码就更好啦!thx
学习是最好的投资!
你好!不知道这是不是你想要的答案:
f = open("coursera.txt","r") urlList = f.readlines() for url in urlList: r = requests.get(url) ''''''
Good Luck ! ^_<
如果是爬取coursera的课程数据,建议你用scrapy爬取,这样不需要提前抓取所有课程的url,只要写好匹配url就行。
爬取coursera的课程数据
scrapy教程 http://scrapy-chs.readthedocs.org/zh_CN/0.24/intro/tutorial.html项目参考 https://github.com/Junnplus/OnlineJudgeCrawlerCore
你好!不知道这是不是你想要的答案:
Good Luck ! ^_<
如果是
爬取coursera的课程数据
,建议你用scrapy爬取,这样不需要提前抓取所有课程的url,只要写好匹配url就行。scrapy教程 http://scrapy-chs.readthedocs.org/zh_CN/0.24/intro/tutorial.html
项目参考 https://github.com/Junnplus/OnlineJudgeCrawlerCore