Big data full-stack development language – Python-Python Tutorial-php.cn

Some time ago, ThoughtWorks held a community event in Shenzhen, and there was a speech titled "Fullstack JavaScript", which was about using JavaScript for front-end, server-side, and even database (MongoDB) development. A web application developer only needs to learn one. language, you can implement the entire application.

Inspired by this, I discovered that Python can be called a big data full-stack development language. Because Python is a hot language in cloud infrastructure, DevOps, big data processing and other fields.

Field	Popular language
Cloud Infrastructure	Python, Java, Go
DevOps	Python, Shell, Ruby, Go
Web Crawler	Python, PHP, C++
data processing	Python, R, Scala

Just like you can write a complete web application as long as you know JavaScript, you can implement a complete big data processing platform as long as you know Python.

Cloud infrastructure

These days, if we don’t support cloud platforms, massive data, or dynamic scaling, we don’t dare to say that we do big data. At most, we dare to tell others that we do business intelligence (BI).

Cloud platforms are divided into private clouds and public clouds. OpenStack, the popular private cloud platform, is written in Python. CloudStack, the former pursuer, strongly emphasized that it was written in Java and had advantages over Python when it was first launched. As a result, at the beginning of 2015, Citrix, the founder of CloudStack, announced that it would join the OpenStack Foundation, and CloudStack was about to come to an end.

If you find it troublesome and don’t want to build your own private cloud, use public clouds. Whether it’s AWS, GCE, Azure, Alibaba Cloud, or Qingyun, they all provide Python SDKs. GCE only provides Python and JavaScript SDKs, while Qingyun only provides Python SDKs. . It can be seen that various cloud platforms attach great importance to Python.

When it comes to infrastructure construction, we have to mention Hadoop. Today, Hadoop is no longer the first choice for big data processing because its MapReduce data processing speed is not fast enough. However, HDFS and Yarn, the two components of Hadoop, are becoming more and more popular. The more popular it becomes. The development language of Hadoop is Java, and there is no official Python support. However, there are many third-party libraries that encapsulate Hadoop's API interface (pydoop, hadoopy, etc.).

The replacement of Hadoop MapReduce is Spark, which is said to be 100 times faster. Its development language is Scala, but it provides development interfaces for Scala, Java, and Python. It is really unreasonable to want to please so many data scientists who develop in Python without supporting Python. . HDFS alternatives, such as GlusterFS, Ceph, etc., all directly provide Python support. A replacement for Yarn, Mesos is implemented in C++. In addition to C++, it also provides support packages for Java and Python.

DevOps

DevOps has a Chinese name, which is called development and self-operation and maintenance. In the Internet era, only by being able to quickly test new ideas and deliver business value safely and reliably as soon as possible can we remain competitive. The automated build/test/deployment and system measurement and other technical practices advocated by DevOps are indispensable in the Internet era.

Automated construction is easy because of the application. If it is a Python application, because of the existence of tools such as setuptools, pip, virtualenv, tox, flake8, etc., automated construction is very simple. Moreover, because almost all Linux systems have built-in Python interpreters, using Python for automation does not require any pre-installed software on the system.

In terms of automated testing, the Python-based Robot Framework is the favorite automated testing framework for enterprise-level applications, and it has nothing to do with language. Cucumber also has many supporters, and its Python counterpart Lettuce can do exactly the same thing. Locust has also begun to receive more and more attention in automated performance testing.

Automated configuration management tools, old ones such as Chef and Puppet, are developed in Ruby and still maintain a strong momentum. However, the new generation of Ansible and SaltStack - both developed in Python - are more lightweight than the previous two and are welcomed by more and more developers, which has begun to create a lot of pressure on their predecessors.

In terms of system monitoring and measurement, traditional Nagios is gradually declining, upstarts such as Sensu are well received, and New Relic in the form of cloud services has become the standard for startups. None of these are directly implemented through Python, but Python needs to be connected to these tools. , not difficult.

In addition to the above tools, PaaS platforms based on Python that provide complete DevOps functions, such as Cloudify and Deis, have not yet become popular, but they have already received a lot of attention.

Web Crawler

Where does the data of big data come from? Except for some companies that have the ability to generate large amounts of data themselves, most of the time, they need to rely on crawlers to capture Internet data for analysis.

Web crawlers are Python's traditional strong areas. The most popular crawler framework Scrapy, HTTP tool kit urlib2, HTML parsing tool beautifulsoup, XML parser lxml, etc. are all class libraries that can stand alone.

However, web crawlers are not just as simple as opening web pages and parsing HTML. An efficient crawler must be able to support a large number of flexible concurrent operations, and often be able to crawl thousands or even tens of thousands of web pages at the same time. The traditional thread pool method wastes a lot of resources. After the number of threads reaches thousands, system resources are basically wasted. Thread scheduling is on. Because Python can well support coroutine operations, many concurrency libraries have been developed based on this, such as Gevent, Eventlet, and distributed task frameworks such as Celery. ZeroMQ, which is considered more efficient than AMQP, was also the first to provide a Python version. With support for high concurrency, web crawlers can truly reach the scale of big data.

The captured data needs word segmentation processing, and Python is not inferior in this regard. The famous natural language processing package NLTK, and Jieba, which specializes in Chinese word segmentation, are all powerful tools for word segmentation.

data processing

All is ready except for the opportunity. This east wind is the data processing algorithm. From statistical theory, to data mining, machine learning, to the deep learning theory proposed in recent years, data science is in an era where a hundred flowers are blooming. What programming do data scientists use?

If it is in the field of theoretical research, the R language may be the most popular among data scientists, but the problems with the R language are also obvious. Because statisticians created the R language, its syntax is slightly weird. Moreover, if R language wants to realize a large-scale distributed system, it will still take a long time to go on the engineering road. Therefore, many companies use R language for prototype testing. After the algorithm is determined, it is translated into engineering language.

Python is also one of the favorite languages of data scientists. Unlike the R language, Python itself is an engineering language. The algorithms implemented by data scientists in Python can be directly used in products, which is very helpful for big data startups to save costs. Officially because of data scientists' love for Python and R, Spark provides very good support for these two languages in order to please data scientists.

Python has many data processing related libraries. The high-performance scientific computing libraries NumPy and SciPy lay a very good foundation for other advanced algorithms. matploglib makes Python drawing as easy as Matlab. Scikit-learn and Milk implement many machine learning algorithms. Pylearn2 implemented based on these two libraries is an important member of the deep learning field. Theano uses GPU acceleration to achieve high-performance mathematical symbolic calculations and multi-dimensional matrix calculations. Of course, there is also Pandas, a big data processing library that has been widely used in the engineering field. Its DataFrame design is borrowed from the R language, and later inspired the Spark project to implement a similar mechanism.

By the way, there is also iPython. This tool is so useful that I almost regarded it as a standard library and forgot to introduce it. iPython is an interactive Python running environment that allows you to see the results of each piece of Python code in real time. By default, iPython runs on the command line, and you can execute ipython notebook to run it on the web page. Figures drawn with matplotlib can be directly displayed embedded in iPython Notebook.
The notebook files of iPython Notebook can be shared with other people, so that others can reproduce your work results in their own environment; if the other party does not have a running environment, they can also be directly converted into HTML or PDF.

Why Python

It is precisely because application development engineers, operation and maintenance engineers, and data scientists all like Python that Python has become a full-stack development language for big data systems.

For development engineers, the elegance and simplicity of Python are undoubtedly the biggest attraction. In the Python interactive environment, execute import this and read the Zen of Python, and you will understand why Python is so attractive. The Python community has always been very dynamic. Unlike the explosive growth of software packages in the NodeJS community, the growth rate of Python software packages has been relatively stable, and the quality of the software packages is also relatively high. Many people criticize Python for having too strict requirements on spaces, but it is precisely because of this requirement that Python has an advantage over other languages when doing large-scale projects. OpenStack projects total more than 2 million lines of code to prove this.

For operation and maintenance engineers, the biggest advantage of Python is that almost all Linux distributions have built-in Python interpreters. Although Shell is powerful, its syntax is not elegant enough, and it will be painful to write more complex tasks. Using Python to replace Shell to do some complex tasks is a liberation for operation and maintenance personnel.

For data scientists, Python is simple yet powerful. Compared with C/C++, there is no need to do a lot of low-level work and model verification can be carried out quickly; compared with Java, Python has concise syntax and strong expressive ability, and the same work only requires 1/3 of the code; compared with Matlab and Octave, Python's engineering maturity is higher. More than one programming expert has expressed that Python is the most suitable language to use as a university computer science programming course - MIT's introductory computer course uses Python - because Python can let people learn the most important thing about programming - how to solve problems question.

By the way, Microsoft participated in PyCon 2015 and made a high-profile announcement to improve the Python programming experience on Windows, including Visual Studio supporting Python, optimizing the compilation of Python C extensions on Windows, and so on. Imagine a future scenario where Python becomes the default component of Windows.

The above is the detailed content of Big data full-stack development language – Python. For more information, please follow other related articles on the PHP Chinese website!