Performance comparison test of PyPy and CPython-Python Tutorial-php.cn

Recently I completed some data mining tasks on Wikipedia. It consists of these parts:

Parsing the Wikipedia dump of enwiki-pages-articles.xml;

Storing categories and pages into MongoDB;

Re-categorizing category names.

I tested the performance of CPython 2.7.3 and PyPy 2b on real tasks. The libraries I used are:

redis 2.7.2

pymongo 2.4.2

Also CPython is supported by the following libraries:

hiredis

pymongo c-extensions

The test mainly consists of database parsing, so I didn't expect that How much benefit will you get from PyPy (not to mention that CPython’s database driver is written in C).

Below I will describe some interesting results.

Extract wiki page names

I need to create a join of wiki page names to page.id in all Wikipedia categories and store the reassigned ones. The simplest solution should be to import enwiki-page.sql (which defines an RDB table) into MySQL, then transfer the data and redistribute it. But I didn't want to increase MySQL requirements (have backbone! XD) so I wrote a simple SQL insert statement parser in pure Python, and then imported the data directly from enwiki-page.sql and redistributed it.

This task relies more on the CPU, so I am optimistic about PyPy again.

/ time

PyPy 169.00s User mode 8.52s System mode 90% CPU

CPython 1287.13s User mode 8.10s System mode 96% CPU

I also made a similar join for page.id->category (I The laptop's memory is too small to hold the information for my testing).

Filter categories from enwiki. Therefore I chose a SAX parser, a wrapper parser that works in both PyPy and CPython. External native compilation package (colleagues in PyPy and CPython).

The code is very simple:

class WikiCategoryHandler(handler.ContentHandler):
    """Class which detecs category pages and stores them separately
    """
    ignored = set((&#39;contributor&#39;, &#39;comment&#39;, &#39;meta&#39;))
  
    def __init__(self, f_out):
        handler.ContentHandler.__init__(self)
        self.f_out = f_out
        self.curr_page = None
        self.curr_tag = &#39;&#39;
        self.curr_elem = Element(&#39;root&#39;, {})
        self.root = self.curr_elem
        self.stack = Stack()
        self.stack.push(self.curr_elem)
        self.skip = 0
  
    def startElement(self, name, attrs):
        if self.skip>0 or name in self.ignored:
            self.skip += 1
            return
        self.curr_tag = name
        elem = Element(name, attrs)
        if name == &#39;page&#39;:
            elem.ns = -1
            self.curr_page = elem
        else:   # we don&#39;t want to keep old pages in memory
            self.curr_elem.append(elem)
        self.stack.push(elem)
        self.curr_elem = elem
  
    def endElement(self, name):
        if self.skip>0:
            self.skip -= 1
            return
        if name == &#39;page&#39;:
            self.task()
            self.curr_page = None
        self.stack.pop()
        self.curr_elem = self.stack.top()
        self.curr_tag = self.curr_elem.tag
  
    def characters(self, content):
        if content.isspace(): return
        if self.skip == 0:
            self.curr_elem.append(TextElement(content))
            if self.curr_tag == &#39;ns&#39;:
                self.curr_page.ns = int(content)
  
    def startDocument(self):
        self.f_out.write("<root>\n")
  
    def endDocument(self):
        self.f_out.write("<\root>\n")
        print("FINISH PROCESSING WIKIPEDIA")
  
    def task(self):
        if self.curr_page.ns == 14:
            self.f_out.write(self.curr_page.render())
  
  
class Element(object):
    def __init__(self, tag, attrs):
        self.tag = tag
        self.attrs = attrs
        self.childrens = []
        self.append = self.childrens.append
  
    def __repr__(self):
        return "Element {}".format(self.tag)
  
    def render(self, margin=0):
        if not self.childrens:
            return u"{0}<{1}{2} />".format(
                " "*margin,
                self.tag,
                "".join([&#39; {}="{}"&#39;.format(k,v) for k,v in {}.iteritems()]))
        if isinstance(self.childrens[0], TextElement) and len(self.childrens)==1:
            return u"{0}<{1}{2}>{3}</{1}>".format(
                " "*margin,
                self.tag,
                "".join([u&#39; {}="{}"&#39;.format(k,v) for k,v in {}.iteritems()]),
                self.childrens[0].render())
  
        return u"{0}<{1}{2}>\n{3}\n{0}</{1}>".format(
            " "*margin,
            self.tag,
            "".join([u&#39; {}="{}"&#39;.format(k,v) for k,v in {}.iteritems()]),
            "\n".join((c.render(margin+2) for c in self.childrens)))
  
class TextElement(object):
    def __init__(self, content):
        self.content = content
  
    def __repr__(self):
        return "TextElement" def render(self, margin=0):
        return self.content

Copy after login

Element and TextElement elements include tag and body information, and provide a method to render it.

The following is the comparison result of PyPy and CPython that I want.

/time

PyPy 2169.90s

CPython 4494.69s

I was very surprised by the results of PyPy.

Computing an interesting set of categories

I once wanted to calculate an interesting set of categories - in the context of one of my applications, starting with some categories derived from the Computing category. To do this I need to build a class diagram that provides classes - a subclass diagram.

Structure Class-Subclass Relationship Diagram

This task uses MongoDB as the data source and redistributes the structure. The algorithm is:

for each category.id in redis_categories (it holds *category.id -> category title mapping*) do:
    title = redis_categories.get(category.id)
    parent_categories = mongodb get categories for title
    for each parent_cat in parent categories do:
        redis_tree.sadd(parent_cat, title) # add to parent_cat set title

Copy after login

Sorry for writing such pseudo code, but I want it to look more compact.

So this task only copies data from one database to another. The results here are obtained after MongoDB is warmed up (if the data is not warmed up, the data will be biased - this Python task only consumes about 10% of the CPU). The timing is as follows:

/ time

PyPy 175.11s User mode 66.11s System mode 64% CPU

CPython 457.92s User mode 72.86s System mode 81% CPU

Traversing redis_tree (redistributed tree)

If we have a redis_tree database, the only remaining problem is to traverse all achievable nodes under the Computing category. To avoid loop traversal, we need to record the nodes that have been visited. Since I wanted to test Python's database performance, I solved this problem by redistributing collection columns.

/ time

PyPy 14.79s User mode 6.22s System mode 69% CPU 30.322 Total

CPython 44.20s User mode 13.86s System mode 71% CPU 1:20.91 Total

To be honest, this task also requires building some tabu list (ban list) - to avoid entering unwanted categories. But that's not the point of this article.

Conclusion

The tests conducted are just a preview of my final work. It requires a body of knowledge, a body of knowledge that I get from extracting the appropriate content from Wikipedia.

Compared with CPython, PyPy has improved performance by 2-3 times in my simple database operation. (I'm not counting the SQL parser here, which is about 8 times)

Thanks to PyPy, my job is more pleasant - I make Python efficient without rewriting the algorithm, and PyPy doesn't tax my CPU like CPython does It hung up so that I couldn't use my laptop normally for a while (look at the percentage of CPU time).

The tasks are almost all database operations, and CPython has some accelerated messy C language modules. PyPy doesn't use these, but the results are faster!

All my work requires a lot of cycles, so I'm really excited to use PyPy.