Home Backend Development PHP Tutorial Use Huagepage and PGO to improve the execution performance of PHP7_php tips

Use Huagepage and PGO to improve the execution performance of PHP7_php tips

May 16, 2016 pm 08:04 PM
php7 performance

Hugepage
PHP7 has just released RC4, which includes some bug fixes and one of our latest performance improvements, which is "HugePageFy PHP TEXT segment". By enabling this feature, PHP7 will "move" its own TEXT segment (execution body) On Huagepage, in previous tests, we could steadily see a 2% to 3% QPS improvement on WordPress.

Regarding what Hugepage is, to put it simply, the default memory is paged in 4KB, and the virtual address and the memory address need to be converted, and this conversion requires a table lookup. In order to speed up the table lookup process, the CPU will Built-in TLB (Translation Lookaside Buffer), it is obvious that if the virtual page is smaller, the number of entries in the table will be more, and the TLB size is limited, the more entries, the higher the Cache Miss of the TLB will be, so if we Enabling large memory pages can indirectly reduce this TLB Cache Miss. As for the detailed introduction, I will not go into details after searching a lot on Google. Here I mainly explain how to enable this new feature, which will bring significant performance improvements.

It has become very easy to enable Hugepage with the new Kernel. Taking my development virtual machine as an example (Ubuntu Server 14.04, Kernel 3.13.0-45), if we view the memory information:

$ cat /proc/meminfo | grep Huge
Copy after login
Copy after login
Copy after login
Copy after login
AnonHugePages:  444416 kB
HugePages_Total:    0
HugePages_Free:    0
HugePages_Rsvd:    0
HugePages_Surp:    0
Hugepagesize:    2048 kB
Copy after login

It can be seen that the size of a Hugepage is 2MB, and HugePages are not currently enabled. Now let us compile PHP RC4 first, remember not to add: –disable-huge-code-pages (this new feature is enabled by default, you add Turn it off after this)

Then configure opcache. Starting from PHP5.5, Opcache has been enabled for compilation by default, but it is used to compile dynamic libraries, so we still need to configure and load it in php.ini.

zend_extension=opcache.so
Copy after login

This new feature is implemented in Opcache, so this feature must also be enabled through Opcache (by setting opcache.huge_code_pages=1), specific configuration:

opcache.huge_code_pages=1
Copy after login

Now let’s configure the OS and allocate some Hugepages:

$ sudo sysctl vm.nr_hugepages=128
vm.nr_hugepages = 128
Copy after login

Now let’s check the memory information again:

$ cat /proc/meminfo | grep Huge
Copy after login
Copy after login
Copy after login
Copy after login
AnonHugePages:  444416 kB
HugePages_Total:   128
HugePages_Free:   128
HugePages_Rsvd:    0
HugePages_Surp:    0
Hugepagesize:    2048 kB
Copy after login

You can see that the 128 Hugepages we allocated are ready, and then we start php-fpm:

$ /home/huixinchen/local/php7/sbin/php-fpm
Copy after login
[01-Oct-2015 09:33:27] NOTICE: [pool www] 'user' directive is ignored when FPM is not running as root
[01-Oct-2015 09:33:27] NOTICE: [pool www] 'group' directive is ignored when FPM is not running as root
Copy after login

Now, check the memory information again:

$ cat /proc/meminfo | grep Huge
Copy after login
Copy after login
Copy after login
Copy after login
AnonHugePages:  411648 kB
HugePages_Total:   128
HugePages_Free:   113
HugePages_Rsvd:    27
HugePages_Surp:    0
Hugepagesize:    2048 kB
Copy after login

Speaking of which, if Hugepages is available, Opcache will actually use Hugepages to store the opcodes cache, so in order to verify that opcache.huge_code_pages is indeed effective, we might as well close opcache.huge_code_pages, and then start it again to see the memory information:

$ cat /proc/meminfo | grep Huge
Copy after login
Copy after login
Copy after login
Copy after login
AnonHugePages:  436224 kB
HugePages_Total:   128
HugePages_Free:   117
HugePages_Rsvd:    27
HugePages_Surp:    0
Hugepagesize:    2048 kB
Copy after login

It can be seen that after huge_code_pages is turned on, 4 more pages are used after fpm is started. Now let’s check the text size of php-fpm:

$ size /home/huixinchen/local/php7/sbin/php-fpm
Copy after login
  text    data     bss     dec     hex   filename
10114565   695200   131528   10941293   a6f36d   /home/huixinchen/local/php7/sbin/php-fpm
Copy after login

可见text段有10114565个字节大小, 总共需要占用4.8个左右的2M的pages, 考虑到对齐以后(尾部不足2M Page部分不挪动), 申请4个pages, 正好和我们看到的相符。

说明配置成功! Enjoy :)

但是有言在先, 启用此特性以后, 会造成一个问题就是你如果尝试通过Perf report/anno 去profiling的时候, 会发现符号丢失(valgrind, gdb不受影响), 这个主要原因是Perf的设计采用监听了mmap,然后记录地址范围, 做IP到符号的转换, 但是目前HugeTLB只支持MAP_ANON, 所以导致Perf认为这部分地址没有符号信息,希望以后版本的Kernel可以修复这个限制吧..

GCC PGO
PGO正如名字所说(Profile Guided Optimization 有兴趣的可以Google), 他需要用一些用例来获得反馈, 也就是说这个优化是需要和一个特定的场景绑定的.

你对一个场景的优化, 也许在另外一个场景就事与愿违了. 它不是一个通用的优化. 所以我们不能简单的就包含这些优化, 也无法直接发布PGO编译后的PHP7.

当然, 我们正在尝试从PGO找出一些共性的优化, 然后手工Apply到PHP7上去, 但这个很明显不能做到针对一个场景的特别优化所能达到的效果, 所以我决定写这篇文章简单介绍下怎么使用PGO来编译PHP7, 让你编译的PHP7能特别的让你自己的独立的应用变得更快.

首先, 要决定的就是拿什么场景去Feedback GCC, 我们一般都会选择: 在你要优化的场景中: 访问量最大的, 耗时最多的, 资源消耗最重的一个页面.

拿Wordpress为例, 我们选择Wordpress的首页(因为首页往往是访问量最大的).

我们以我的机器为例:

Intel(R) Xeon(R) CPU X5687 @ 3.60GHz X 16(超线程),
48G Memory
php-fpm 采用固定32个worker, opcache采用默认的配置(一定要记得加载opcache)

以wordpress 4.1为优化场景..

首先我们来测试下目前WP在PHP7的性能(ab -n 10000 -c 100):

$ ab -n 10000 -c 100 http://inf-dev-maybach.weibo.com:8000/wordpress/
Copy after login
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
 
Benchmarking inf-dev-maybach.weibo.com (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests
 
Server Software:    nginx/1.7.12
Server Hostname:    inf-dev-maybach.weibo.com
Server Port:      8000
 
Document Path:     /wordpress/
Document Length:    9048 bytes
 
Concurrency Level:   100
Time taken for tests:  8.957 seconds
Complete requests:   10000
Failed requests:    0
Write errors:      0
Total transferred:   92860000 bytes
HTML transferred:    90480000 bytes
Requests per second:  1116.48 [#/sec] (mean)
Time per request:    89.567 [ms] (mean)
Time per request:    0.896 [ms] (mean, across all concurrent requests)
Transfer rate:     10124.65 [Kbytes/sec] received
Copy after login

可见Wordpress 4.1 目前在这个机器上, 首页的QPS可以到1116.48. 也就是每秒钟可以处理这么多个对首页的请求,

现在, 让我们开始教GCC, 让他编译出跑Wordpress4.1更快的PHP7来, 首先要求GCC 4.0以上的版本, 不过我建议大家使用GCC-4.8以上的版本(现在都GCC-5.1了).

第一步, 自然是下载PHP7的源代码了, 然后做./configure. 这些都没什么区别

接下来就是有区别的地方了, 我们要首先第一遍编译PHP7, 让它生成会产生profile数据的可执行文件:

$ make prof-gen
Copy after login

注意, 我们用到了prof-gen参数(这个是PHP7的Makefile特有的, 不要尝试在其他项目上也这么搞哈 :) )

然后, 让我们开始训练GCC:

$ sapi/cgi/php-cgi -T 100 /home/huixinchen/local/www/htdocs/wordpress/index.php >/dev/null
Copy after login

也就是让php-cgi跑100遍wordpress的首页, 从而生成一些在这个过程中的profile信息.

然后, 我们开始第二次编译PHP7.

$ make prof-clean
$ make prof-use && make install
Copy after login

好的, 就这么简单, PGO编译完成了, 现在我们看看PGO编译以后的PHP7的性能:

$ ab -n10000 -c 100 http://inf-dev-maybach.weibo.com:8000/wordpress/
Copy after login
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
 
Benchmarking inf-dev-maybach.weibo.com (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests
 
Server Software:    nginx/1.7.12
Server Hostname:    inf-dev-maybach.weibo.com
Server Port:      8000
 
Document Path:     /wordpress/
Document Length:    9048 bytes
 
Concurrency Level:   100
Time taken for tests:  8.391 seconds
Complete requests:   10000
Failed requests:    0
Write errors:      0
Total transferred:   92860000 bytes
HTML transferred:    90480000 bytes
Requests per second:  1191.78 [#/sec] (mean)
Time per request:    83.908 [ms] (mean)
Time per request:    0.839 [ms] (mean, across all concurrent requests)
Transfer rate:     10807.45 [Kbytes/sec] received
Copy after login

现在每秒钟可以处理1191.78个QPS了, 提升是~7%. 还不赖哈(咦, 你不是说10%么? 怎么成7%了? 呵呵, 正如我之前说过, 我们尝试分析PGO都做了些什么优化, 然后把一些通用的优化手工Apply到PHP7中. 所以也就是说, 那~3%的比较通用的优化已经包含到了PHP7里面了, 当然这个工作还在继续).

于是就这么简单, 大家可以用自己的产品的经典场景来训练GCC, 简单几步, 获得提升, 何乐而不为呢

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Repo: How To Revive Teammates
1 months ago By 尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

The local running performance of the Embedding service exceeds that of OpenAI Text-Embedding-Ada-002, which is so convenient! The local running performance of the Embedding service exceeds that of OpenAI Text-Embedding-Ada-002, which is so convenient! Apr 15, 2024 am 09:01 AM

Ollama is a super practical tool that allows you to easily run open source models such as Llama2, Mistral, and Gemma locally. In this article, I will introduce how to use Ollama to vectorize text. If you have not installed Ollama locally, you can read this article. In this article we will use the nomic-embed-text[2] model. It is a text encoder that outperforms OpenAI text-embedding-ada-002 and text-embedding-3-small on short context and long context tasks. Start the nomic-embed-text service when you have successfully installed o

Performance comparison of different Java frameworks Performance comparison of different Java frameworks Jun 05, 2024 pm 07:14 PM

Performance comparison of different Java frameworks: REST API request processing: Vert.x is the best, with a request rate of 2 times SpringBoot and 3 times Dropwizard. Database query: SpringBoot's HibernateORM is better than Vert.x and Dropwizard's ORM. Caching operations: Vert.x's Hazelcast client is superior to SpringBoot and Dropwizard's caching mechanisms. Suitable framework: Choose according to application requirements. Vert.x is suitable for high-performance web services, SpringBoot is suitable for data-intensive applications, and Dropwizard is suitable for microservice architecture.

PHP array key value flipping: Comparative performance analysis of different methods PHP array key value flipping: Comparative performance analysis of different methods May 03, 2024 pm 09:03 PM

The performance comparison of PHP array key value flipping methods shows that the array_flip() function performs better than the for loop in large arrays (more than 1 million elements) and takes less time. The for loop method of manually flipping key values ​​takes a relatively long time.

How to optimize the performance of multi-threaded programs in C++? How to optimize the performance of multi-threaded programs in C++? Jun 05, 2024 pm 02:04 PM

Effective techniques for optimizing C++ multi-threaded performance include limiting the number of threads to avoid resource contention. Use lightweight mutex locks to reduce contention. Optimize the scope of the lock and minimize the waiting time. Use lock-free data structures to improve concurrency. Avoid busy waiting and notify threads of resource availability through events.

What are the performance considerations for C++ static functions? What are the performance considerations for C++ static functions? Apr 16, 2024 am 10:51 AM

Static function performance considerations are as follows: Code size: Static functions are usually smaller because they do not contain member variables. Memory occupation: does not belong to any specific object and does not occupy object memory. Calling overhead: lower, no need to call through object pointer or reference. Multi-thread-safe: Generally thread-safe because there is no dependence on class instances.

How to use benchmarks to evaluate the performance of Java functions? How to use benchmarks to evaluate the performance of Java functions? Apr 19, 2024 pm 10:18 PM

A way to benchmark the performance of Java functions is to use the Java Microbenchmark Suite (JMH). Specific steps include: Adding JMH dependencies to the project. Create a new Java class and annotate it with @State to represent the benchmark method. Write the benchmark method in the class and annotate it with @Benchmark. Run the benchmark using the JMH command line tool.

How performant are PHP functions? How performant are PHP functions? Apr 18, 2024 pm 06:45 PM

The performance of different PHP functions is crucial to application efficiency. Functions with better performance include echo and print, while functions such as str_replace, array_merge, and file_get_contents have slower performance. For example, the str_replace function is used to replace strings and has moderate performance, while the sprintf function is used to format strings. Performance analysis shows that it only takes 0.05 milliseconds to execute one example, proving that the function performs well. Therefore, using functions wisely can lead to faster and more efficient applications.

Performance comparison of C++ with other languages Performance comparison of C++ with other languages Jun 01, 2024 pm 10:04 PM

When developing high-performance applications, C++ outperforms other languages, especially in micro-benchmarks. In macro benchmarks, the convenience and optimization mechanisms of other languages ​​such as Java and C# may perform better. In practical cases, C++ performs well in image processing, numerical calculations and game development, and its direct control of memory management and hardware access brings obvious performance advantages.

See all articles