Python 适合大数据量的处理吗？-Python Tutorial-php.cn

Table of Contents

回复内容：

Home

Backend Development

Python Tutorial

Python 适合大数据量的处理吗？

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 06, 2016 pm 04:22 PM

python

python 能处理数据库中百万行级的数据吗？

处理大规模数据时有那些常用的python库，他们有什么优缺点？适用范围如何？

回复内容：

需要澄清两点之后才可以比较全面的看这个问题：

1. 百万行级不算大数据量，以目前的互联网应用来看，大数据量的起点是10亿条以上。
2. 处理的具体含义，如果是数据载入和分发，用python是很高效的；如果是求一些常用的统计量和求一些基本算法的结果，python也有现成的高效的库，C实现的和并行化的；如果是纯粹自己写的算法，没有任何其他可借鉴的，什么库也用不上，用纯python写是自讨苦吃。

python的优势不在于运行效率，而在于开发效率和高可维护性。针对特定的问题挑选合适的工具，本身也是一项技术能力。我很喜欢用python，用python处理数据是家常便饭，从事的工作涉及nlp，算法，推荐，数据挖掘，数据清洗，数据量级从几十k到几T不等，我来说说吧
百万级别数据是小数据，python处理起来不成问题，python处理数据还是有些问题的
Python处理大数据的劣势：
1. python线程有gil，通俗说就是多线程的时候只能在一个核上跑，浪费了多核服务器。在一种常见的场景下是要命的：并发单元之间有巨大的数据共享或者共用（例如大dict），多进程会导致内存吃紧，多线程则解决不了数据共享的问题，单独的写一个进程之间负责维护读写这个数据不仅效率不高而且麻烦
2. python执行效率不高，在处理大数据的时候，效率不高，这是真的，pypy（一个jit的python解释器，可以理解成脚本语言加速执行的东西）能够提高很大的速度，但是pypy不支持很多python经典的包，例如numpy（顺便给pypy做做广告，土豪可以捐赠一下PyPy - Call for donations）
3. 绝大部分的大公司，用java处理大数据不管是环境也好，积累也好，都会好很多
Python处理数据的优势（不是处理大数据）：
1. 异常快捷的开发速度，代码量巨少
2. 丰富的数据处理包，不管正则也好，html解析啦，xml解析啦，用起来非常方便
3. 内部类型使用成本巨低，不需要额外怎么操作（java，c++用个map都很费劲）
4. 公司中，很大量的数据处理工作工作是不需要面对非常大的数据的
5. 巨大的数据不是语言所能解决的，需要处理数据的框架（hadoop， mpi。。。。）虽然小众，但是python还是有处理大数据的框架的，或者一些框架也支持python
6. 编码问题处理起来太太太方便了

综上所述：
1. python可以处理大数据
2. python处理大数据不一定是最优的选择
3. python和其他语言（公司主推的方式）并行使用是非常不错的选择
4. 因为开发速度，你如果经常处理数据，而且喜欢linux终端，而且经常处理不大的数据（100m一下），最好还是学一下python

python数据处理的包：
1. 自带正则包，文本处理足够了
2. cElementTree, lxml 默认的xml速度在数据量过大的情况下不足
3. beautifulsoup 处理html
4. hadoop(可以用python) 并行处理，支持python写的map reduce，足够了，顺便说一下阿里巴巴的odps，和hadoop一样的东西，支持python写的udf，嵌入到sql语句中
5. numpy, scipy, scikit-learn 数值计算，数据挖掘
6. dpark(搬楼上的答案）类似hadoop一样的东西

1，2，3，5是处理文本数据的利器（python不就处理文本数据方便嘛），4，6是并行计算的框架（大数据处理的效率在于良好的分布计算逻辑，而不是什么语言）
暂时就这些，最好说一个方向，否则不知道处理什么样的数据也不好推荐包，所以没有头绪从哪里开始介绍这些包这要看具体的应用场景，从本质上来说，我们把问题分解为两个方面：

1、CPU密集型操作
即我们要计算的大数据，大部分时间都在做一些数据计算，比如求逆矩阵、向量相似度、在内存中分词等等，这种情况对语言的高效性非常依赖，Python做此类工作的时候必然性能低下。

2、IO密集型操作
假如大数据涉及到频繁的IO操作，比如从数据流中每次读取一行，然后不做什么复杂的计算，频繁的输入输出到文件系统，由于这些操作都是调用的操作系统接口，所以用什么语言已经不在重要了。

结论
用Python来做整个流程的框架，然后核心的CPU密集操作部分调用C函数，这样开发效率和性能都不错，但缺点是对团队的要求又高了(尤其涉及到Python+C的多线程操作)...所以...鱼与熊掌不可兼得。如果一定要兼得，必须得自己牛逼。我们公司每天处理数以P记的数据，有个并行grep的平台就是python做的。当初大概是考虑快速成型而不是极限速度，但是事实证明现在也跑得杠杠的。大数据很多时候并不考虑太多每个节点上的极限速度，当然速度是越快越好，但是再更高层次做优化（比如利用data locality减少传输，建索引快速join，做sample优化partition，用bloomfilter快速测试等等），把python换成C并不能很大程度上提升效率。很多python库的实现都是用其他语言写的(C比较多)，只是用Python做了个包装而已。库的效率本身不低。码代码比程序时间复杂度更cost 很多机器学习，神经网络，数据计算的算法已经存在几十年了，这些零零散散的工具多被C和Fortran实现，直到有人开始用Python把这些工具集合到一起，所以，表面上是在用Python的库，实际上是C和Fortran的程序，性能上也并无大的影响，如果你真的是大数据的话大量數據處理的瓶頸是在IO，而不是在哪個語言。語言選擇真的是要看個人口味、品味。流处理是python最大软肋使用python可以，但对速度要求较高的关键模块，还是要用C重写。

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hello Kitty Island Adventure: How To Get Giant Seeds

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

How Long Does It Take To Beat Split Fiction?

4 weeks ago By DDD

R.E.P.O. Save File Location: Where Is It & How to Protect It?

4 weeks ago By DDD

Two Point Museum: All Exhibits And Where To Find Them

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7375

Java Tutorial

1628

CakePHP Tutorial

1355

Laravel Tutorial

1267

PHP Tutorial

1216

Related knowledge

Is the conversion speed fast when converting XML to PDF on mobile phone? Apr 02, 2025 pm 10:09 PM

The speed of mobile XML to PDF depends on the following factors: the complexity of XML structure. Mobile hardware configuration conversion method (library, algorithm) code quality optimization methods (select efficient libraries, optimize algorithms, cache data, and utilize multi-threading). Overall, there is no absolute answer and it needs to be optimized according to the specific situation.

Is there any mobile app that can convert XML into PDF? Apr 02, 2025 pm 08:54 PM

An application that converts XML directly to PDF cannot be found because they are two fundamentally different formats. XML is used to store data, while PDF is used to display documents. To complete the transformation, you can use programming languages and libraries such as Python and ReportLab to parse XML data and generate PDF documents.

How to control the size of XML converted to images? Apr 02, 2025 pm 07:24 PM

To generate images through XML, you need to use graph libraries (such as Pillow and JFreeChart) as bridges to generate images based on metadata (size, color) in XML. The key to controlling the size of the image is to adjust the values of the <width> and <height> tags in XML. However, in practical applications, the complexity of XML structure, the fineness of graph drawing, the speed of image generation and memory consumption, and the selection of image formats all have an impact on the generated image size. Therefore, it is necessary to have a deep understanding of XML structure, proficient in the graphics library, and consider factors such as optimization algorithms and image format selection.

How to convert XML files to PDF on your phone? Apr 02, 2025 pm 10:12 PM

It is impossible to complete XML to PDF conversion directly on your phone with a single application. It is necessary to use cloud services, which can be achieved through two steps: 1. Convert XML to PDF in the cloud, 2. Access or download the converted PDF file on the mobile phone.

What is the function of C language sum? Apr 03, 2025 pm 02:21 PM

There is no built-in sum function in C language, so it needs to be written by yourself. Sum can be achieved by traversing the array and accumulating elements: Loop version: Sum is calculated using for loop and array length. Pointer version: Use pointers to point to array elements, and efficient summing is achieved through self-increment pointers. Dynamically allocate array version: Dynamically allocate arrays and manage memory yourself, ensuring that allocated memory is freed to prevent memory leaks.

How to open xml format Apr 02, 2025 pm 09:00 PM

Use most text editors to open XML files; if you need a more intuitive tree display, you can use an XML editor, such as Oxygen XML Editor or XMLSpy; if you process XML data in a program, you need to use a programming language (such as Python) and XML libraries (such as xml.etree.ElementTree) to parse.

What is the process of converting XML into images? Apr 02, 2025 pm 08:24 PM

To convert XML images, you need to determine the XML data structure first, then select a suitable graphical library (such as Python's matplotlib) and method, select a visualization strategy based on the data structure, consider the data volume and image format, perform batch processing or use efficient libraries, and finally save it as PNG, JPEG, or SVG according to the needs.

Recommended XML formatting tool Apr 02, 2025 pm 09:03 PM

XML formatting tools can type code according to rules to improve readability and understanding. When selecting a tool, pay attention to customization capabilities, handling of special circumstances, performance and ease of use. Commonly used tool types include online tools, IDE plug-ins, and command-line tools.

See all articles