Python股票数据爬虫解读 - python爬虫

Home > List of blog posts > Python股票数据爬虫解读

Blogger Information

Blog 13

fans 0

comment 0

visits 28439

Special Recommendation

More>

Related recommendations

Related Tutorials

Popular Recommendations

Latest courses

The latest ThinkPHP 5.1 world premiere video tutorial (60 days to become a PHP expert online training course)

1421281 times of learning
Collection
PHP introductory tutorial one: Learn PHP in one week

4264997 times of learning
Collection
JAVA Beginner's Video Tutorial

2516384 times of learning
Collection

Latest Downloads

More>

Web Effects

Website Source Code

Website Materials

Front End Template

Python股票数据爬虫解读

python自学网

Original

2447 people have browsed it

中国A股2019年开年市场火爆，随着股市大涨，沪市指数逼近3000点大关，三大股指均创下本轮反弹新高，股市单日成交过万亿，许多人跑步进入股市。

要想在股市中获得收益，对股票数据进行分析非常重要，要进行数据分析必须有数据，然而数据收集是很费时费力的事情，有些网站中会有我们需要的数据，如果能把这些数据下载到电脑中，对后面使用机器学习算法处理非常有用。例如下图是某一日的股票行情信息：

如果想得到上面表格中的数据，可以使用网络爬虫实现。网络爬虫，又被称为网页蜘蛛，网络机器人，有时也称为网页追逐者，是一种按照一定的规则，自动地抓取互联网上网页中相应信息（文本、图片等）的程序或者脚本，然后把抓取的信息存储到自己的计算机上。

程序主要由三部分组成：网页源码的获取、删除冗余的内容和标签和结果的显示。

实现步骤如下：

1、网页源码的获取

url='http://quote.stockstar.com/stock/ranklist_a_3_1_1.html'  #目标网址
 
headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64)"}
 
#伪装浏览器请求报头
 
request=urllib.request.Request(url=url,headers=headers)  #请求服务器
 
response=urllib.request.urlopen(request)  #服务器应答
 
content=response.read().decode('gbk')   #以一定的编码方式查看源码
 
for page in range(1,8):
 
    url='http://quote.stockstar.com/stock/ranklist_a_3_1_'+str(page)+'.html'
 
request=urllib.request.Request(url=url,headers={"User-Agent":random.choice(user_agent)}) #随机从user_agent列表中抽取一个元素
 
    content=response.read().decode('gbk')       #读取网页内容

2、删除冗余的内容

获取网页源码后，就可以从中提取我们所需要的数据了。如前所述，提取的网页内容中有很多html的标签，空格等内容，此时需要从源码删除这些信息，这里仍然使用正则表达式，代码如下：

  pattern=re.compile('<tbody[\s\S]*</tbody>')
 
    body=re.findall(pattern,str(content))
 
    pattern=re.compile('>(.*?)<')
 
    stock_page=re.findall(pattern,body[0])      #正则匹配
 
    stock_total.extend(stock_page)
 
    time.sleep(random.randrange(1,4))

3、结果的显示

print('代码','\t','简称','   ','\t','最新价','\t','涨跌幅','\t','涨跌额','\t','5分钟涨幅')
 
for i in range(0,len(stock_last),13):        #网页总共有13列数据
 
   print(stock_last[i],'\t',stock_last[i+1],' ','\t',stock_last[i+2],'  ','\t',stock_last[i+3],'  ','\t',stock_last[i+4],'  ','\t',stock_last[i+5])

下图是使用爬虫获取的数据。

有了上面数据之后，我们就可以使用机器学习算法，自己编制程序进行预测了。

Statement of this Website

The copyright of this blog article belongs to the blogger. Please specify the address when reprinting! If there is any infringement or violation of the law, please contact admin@php.cn Report processing!

All comments Speak rationally on civilized internet, please comply with News Comment Service Agreement

0 comments

Author's latest blog post

谷歌搜索官方给出的title标题优化，适用于百度搜索引擎SEO优化

2022-04-27 14:56:50