Use process information to track down memory leaks
Abstract: Memory leaks are common software problems encountered by background server programs. There are many ways to locate memory leaks, such as valgrind, but the process needs to be restarted. . In some cases, it is difficult or takes a long time to reproduce the same memory leak after restarting the process. This article explores a method of using existing process instances that have experienced memory leaks to analyze and try to obtain the memory leak point.
1. Problem phenomenon
Bigpipe is Baidu’s internal distributed transmission system. Its server module Broker is implemented using an asynchronous programming framework and makes extensive use of reference counting to manage the life of object resources. Period and release timing. During the stress test of the Broker module, it was found that after the Broker was running for a long time, the memory usage gradually increased, and a memory leak occurred.
2. Preliminary analysis
Based on the recent upgrades of Broker, determine the objects that may cause memory leaks in Broker. Broker has added a new monitoring function, one of which is the monitoring statistics of each parameter of the server. This must have a read operation on the parameter object. Each operation will "increment" the reference count and "decrement it by one" after the operation is completed. . Currently, there are several parameter objects, and it is necessary to determine which parameter object is leaked.
3. Code & Business Analysis
1. To prove the results of the previous preliminary analysis, possible methods are: use Valgrind to run the Broker and start the stress program to reproduce the possible memory leak. However, using this method:
1) Since the triggering conditions of memory leaks are not simple, the recurrence period may be very long, and the same memory leak may even be impossible to reproduce;
2) The object of the memory leak is placed in the container, and valgrind does not report the related memory leak after exiting normally;
After a short-term run attempt on another test cluster to reproduce, Valgrind reported no exception.
2. Analyze the existing conditions: Fortunately, the Broker process with the "memory leak" problem is still running, and the truth lies within this process. The existing scene should be fully utilized to locate the problem. Initially I hope to use GDB for debugging.
3. Challenge: Using the GDB attach pid method will cause the process to hang. According to the Broker's design, once another master/slave Broker is paired and does not send heartbeats to each other, the Broker will automatically exit the program and exit. The scene cannot be saved afterwards, which means there is only one chance to use GDB.
4. Solution: Use gdb to print memory information and observe possible memory leak points from the information.
5. Step 1: pmap -x {PID} to view memory information (such as: pmap -x 24671); get information similar to the following, pay attention to the location marked anon:
SHAPE * MERGEFORMAT
24671: ./bin/broker
24671: ./bin/broker
Address Kbytes RSS Anon Locked Mode Mapping
0000000000400000 11508 - - - r-x-- broker
000000000103c000 388 - - - rw--- broker
000000000109d000 144508 - - - rw--- [ anon ]
00007fb3f583b000 4 - - - rw--- libgcc_s-3.4.5-20051201.so.1
---------------- ------ ------ ------ ------
total kB 610180 - - -
|
Address Kbytes RSS Anon Locked Mode Mapping
0000000000400000 11508 - - - r-x-- broker
000000000103c000 388 - - - rw--- broker
000000000109d000 144508 - - - rw--- [ anon ]
00007fb3f583b000 4 - - - rw--- libgcc_s-3.4.5-20051201.so.1 ---------------- ------ ------ - ----- ------
total kB 610180 - - -
|
6. Step 2: Start gdb ./bin/broker and use the attach {PID} command to load the existing process; for example, the above process number is 24671, use: attach 24671;
7. Step 3: Use setheight 0 and setlogging on to turn on gdb log, the log will be stored in the gdb.txt file;
8. Step 4: Use x/{number of memory bytes}a {memory address} to print out a piece of memory information. For example, the above anon is the heap header address and occupies 144508kb of memory, then use: x/18497024a0x000000000109d000; If there are many command lines, you can edit the command line in the peripheral and directly post it to the gdb command line prompt to run, or write the command line to a text file, such as command.txt, and then use it in the gdb command line prompt. sourcecommand.txt to execute the command set in the file. The following is the content of the command.txt file;
SHAPE * MERGEFORMAT
set height 0
set height 0
set logging on
x/18497024a 0x000000000109d000
x/23552a 0x000000317ae09000
x/2048a 0x000000317b65e000
x/512a 0x000000318a821000
x/2560a 0x000000318b18d000
|
set logging on
x/18497024a 0x000000000109d000
x/23552a 0x000000317ae09000
x/2048a 0x000000317b65e000
x/512a 0x000000318a821000
0x1071000 <_ZN7bigpipe13bmq_handler_t16_heart_beat_bodyE 832>: 0x0 0x0
0x1071010 <_ZN7bigpipe13bmq_handler_t16_heart_beat_bodyE 848>: 0x0 0x0
…
0x10710c0 <_zgvz5getippce4lock>: 0x0 0x0
0x10710d0 <_zgvzn7bigpipe13bmq_handler_t14get_heart_beaterie4__sl>: 0x0 0x0
0x10710e0 <_zst8__ioinit>: 0x0 0x0
0x10710f0 <_zgvz5getippce4lock>: 0x0 0x0
…
0x22c2f00: 0x10200d0 <_ZTVN7bigpipe14BigpipeDIEngineE 16> 0x4600000001
0x22c2f10: 0x1 0x117087b
0x22c2f20: 0x0 0x1214495
…
0x22c2f70: 0x0 0x0
0x22c2f80: 0x0 0x0
0x22c2f90: 0x0 0x0
… Gdb.txt中内容的说明和分析:第一列为当前内存地址,如0x22c2f00;第二、三、四列分别为当前内存地址对应所存储的值(使用十六进制表示),以及gdb的debug的符号信息,例如:0x10200d0<_ZTVN7bigpipe15BigpipeDIEngineE 16>0x4600000001,分别表示:“前16字节”、“符号信息(注意有 16的偏移)”、“后16字节”,但不是所有地址都会打印gdb的debug符号信息,有时符号信息显示在第三列,有时显示在第二列。上述这行内存地址0x22c2f00 存储了bigpipe::BigpipeDiEngine 类的生成的其中一个对象的虚析构函数的函数指针,即虚函数表指针(vptr),其中地址0x10200d0附近内存存储的应该是BigpipeDiEngine类的虚函数表(vtbl),如下所示:
|
x/2560a 0x000000318b18d000
|
9. Step 5: Analyze the information in the gdb.txt file. The content in gdb.txt is as follows: SHAPE * MERGEFORMAT
0x1071000 <_ZN7bigpipe13bmq_handler_t16_heart_beat_bodyE 832>: 0x0 0x0
0x1071010 <_ZN7bigpipe13bmq_handler_t16_heart_beat_bodyE 848 >: 0x0 0x0
… 0x10710c0 <_zgvz5getippce4lock>: 0x0 0x0
0x10710d0 <_zgvzn7bigpipe13bmq_handler_t14get_heart_beaterie4__sl>: 0x0 0x0
0x10710e0 <_zst8__ioinit>: 0x0 0x0
0x10710f0 <_zgvz5getippce4lock>: 0x0 0x0
…
0x22c2f00: 0x10200d0 <_ZTVN7bigpipe14BigpipeDIEngineE 16> 0x4600000001
0x22c2f10: 0x1 0x117087b
0x22c2f20: 0x0 0x1214495
…
0x22c2f70: 0x0 0x0
0x22c2f80: 0x0 0x0
0x22c2f90: 0x0 0x0
… Explanation and analysis of the content in Gdb.txt: The first column is the current memory address, such as0x22c2f00; the second, third, and fourth columns are the values stored corresponding to the current memory address (expressed in hexadecimal), and The debug symbol information of gdb, for example: 0x10200d0<_ZTVN7bigpipe15BigpipeDIEngineE 16>0x4600000001, respectively means: "first 16 bytes", "symbol Information (note that there is an offset of 16)", "Last 16 bytes", but not all addresses will print the debug symbol information of gdb. Sometimes the symbol information is displayed in the third column, sometimes in the second column. The above line of memory address 0x22c2f00 stores bigpipe ::The function pointer of the virtual destructor of one of the objects generated by the BigpipeDiEngine class, that is, virtual function table pointer (vptr), Among them, the memory stored near the address0x10200d0 should be of the BigpipeDiEngine classVirtual function table (vtbl), as shown below:
|
SHAPE * MERGEFORMAT
(gdb) x/a 0x10200d0
(gdb) x/a 0x10200d0
0x10200d0 <_ZTVN7bigpipe15BigpipeDIEngineE 16>: 0x53e2c6
(gdb) x/i 0x53e2c6
0x53e2c6 : push %rbp
(gdb) x/a 0x53e2c6
0x53e2c6 : 0xec834853e5894855 地址0x10200d0中的值是指向BigpipeDiEngine类的析构函数的地址,即真正的析构函数代码段头地址0x53e2c6。可以从上述执行结果看到,地址0x53e2c6的“符号信息”是析构函数名,其汇编命令为push。因此,可以知道最初看到的0x22c2f00地址是对象的一个虚析构函数指针,并且有“符号信息”BigpipeDIEngine显示出来,可以根据这种信息确定出这个类(带虚析构函数的类)生成了多少个实例,然后根据排出来的实例个数做进一步判断。 因此,对gdb.txt排序并做适当处理获得符号(类名/函数名称)出现的次数的列表。例如将上述内容过滤出带尖括号的“符号信息”部分并按出现次数排序,可以使用类似如下命令,catgdb.txt |grep "<"|awk -F '<' '{print }' |awk -F '>''{print }' |sort |uniq -c|sort -rn > result.txt,过滤出项目相关的变量前缀(如bmq、Bigpipe、bmeta等)cat result.txt|grep -P"bmq|Bigpipe|bigpipe|bmeta"|grep "_ZTV" > result2.txt,获得类似如下的列表:
| 0x10200d0 <_ZTVN7bigpipe15BigpipeDIEngineE 16>: 0x53e2c6
(gdb) x/i 0x53e2c6
0x53e2c6 : push %rbp
35782 _ZTVN7bigpipe14CConnectE 16
282 _ZTVN3bsl3var4IVarE 16
179 _ZTVN7bigpipe19bmeta_stripe_info_tE 16
26 _ZTV13AutoKylinLockI5MutexE 16
21 _ZTVN6google8protobuf8internal26GeneratedMessageReflectionE 16
8 _ZTVN6comcfg17ConstraintLibrary12WrapFunctionE 16
8 _ZTVN3bsl3var11BasicStringINS_12basic_stringIcNS_14pool_allocatorIcEEEEEE 16
6 _ZTVN7bigpipe19bmeta_broker_info_tE 16
6 _ZTVN7bigpipe15BigpipeDIEngineE 16
|
(gdb) x/a 0x53e2 c6
0x53e2c6 : 0xec834853e5894855 The value in the address 0x10200d0 points to the destructor of the BigpipeDiEngine class The address of the function, that is, the real destructor code segment header address 0x53e2c6. It can be seen from the above execution results that the "symbol information" of the address 0x53e2c6 is the destructor name, its assembly command is push. Therefore, we can know that the initially seen 0x22c2f00 address is a virtual destructor pointer of the object and has "symbol information" BigpipeDIEngine is displayed. You can use this information to determine how many instances this class (a class with a virtual destructor) has generated. Then make further judgments based on the number of discharged instances. Therefore, sort gdb.txt and do appropriate processing to obtain a list of the number of times symbols (class names/function names) appear. For example, to filter the above content out of the "symbol information" part with angle brackets and sort it by the number of occurrences, you can use a command similar to the following, catgdb.txt | grep "<"|awk -F '<' '{print $2}' |awk -F '>''{print $1}' |sort |uniq -c|sort -rn > result.txt i>, filter outproject-related variable prefixes (such as bmq, Bigpipe, bmeta, etc.)cat result.txt|grep -P"bmq|Bigpipe|bigpipe|bmeta"|grep "_ZTV" > result2.txt, get a list similar to the following:
td>
|
SHAPE * MERGEFORMAT
35782 _ZTVN7bigpipe14CConnectE 16
282 _ZTVN3bsl3var4IVarE 16
179 _ZTVN7bigpipe19bmeta_stripe_info_tE 16
if (atomic_add (&_count, -1) == 0) {
_free(_conn)
}
|
26 _ZTV13AutoKylinLockI5MutexE 16
21 _ZTVN6google8protobuf8internal26GeneratedMessageReflectionE 16
8 _ZTVN6comcfg17ConstraintLibrary12WrapFunctionE 16
8 _ZTVN3bsl3var11BasicStringINS_12basic_stringIcNS_14pool_allocatorIcEEEEEE 16
6 _ZTVN7bigpipe19bmeta_broker_info_tE 16
6 _ZTVN7bigpipe15BigpipeDIEngineE 16
|
10. Then find out the CConnect object that is related to this project and appears most frequently; determine the possible leaks After the object is created, it is also necessary to locate the reference count problem in the asynchronous framework that prevents the CConnect object from being decremented by one and released normally. 11. After tracing, the code related to the newly added "monitoring" function and CConnect is as follows. SHAPE * MERGEFORMAT
if (atomic_add (&_count, -1) == 0) {
_free(_conn)
}
|
4. The truth is revealed
Looking at the implementation of the atomic_add function (as shown below), we can see that the return value is the value before the increment (decrement), and since the function name atomic_add is not special Such a meaning causes the caller to misuse this function, thinking it is the value after incrementing, and finally the reference count mistakenly thinks it is not 0, resulting in the _free operation not being performed, which in turn leads to a memory leak. Usually, the function corresponding to __sync_fetch_and_add is also __sync_add _and_fetch. The difference between the two is "get the value first and then get it" or "add the value first and then get it".
SHAPE * MERGEFORMAT
atomic_add(volatile int *count, int add)
atomic_add(volatile int *count, int add)
{
register int __res;
__res = __sync_fetch_and_add(count, add);
return __res;
}
|
{
register int __res;
__res = __sync_fetch_and_add(count, add);
return __res;
}
|
if (atomic_add_and_fetch (&_count, -1) == 0) {
_free(_conn)
}
|
5. Solution
Therefore, the program improvements are as follows:
SHAPE * MERGEFORMAT
if (atomic_add_and_fetch (&_count, -1) == 0) {
_free(_conn)
}
6. Summary
1. Since the program implemented by the asynchronous framework is more difficult to locate and track problems, it needs to be completed by comprehensive means: logs, gdb, pmap and other means. Problem recurrence and location;
2. Valgrind is not the only way to detect memory leaks, and has certain limitations;
3. The function name definition tries to intuitively indicate the function function, which can avoid Part of the caller's error;
4. You should carefully read the documentation of the library function to understand how to use it; The scenarios and limitations of this method: 1) When using gdb to print memory information, it must conform to the example There is a one-to-one relationship between numbers and memory information symbols. In the above practice, the CConnect class has a virtual destructor, so the virtual function table pointer can be seen in the memory information, and there is a one-to-one correspondence with the symbols that appear. Therefore, This type of speculative condition can exist as a memory leak; if the leaked memory leaves no "trace" in the memory information, effective information about the memory leak cannot be obtained; 2) After the offline attempt to reproduce the memory leak fails, but there is memory The leaked process (site) still exists online. You can try to use the above method to obtain more memory leak information from the existing process (site); 3) This method can use the existing process (site) that has produced a memory leak. ) to analyze and make full use of the existing problem process; 4) The above method serves as a supplement to other memory leak debugging methods, a method worth trying and can be used as a reference.
Baidu MTC is the industry's leading mobile application testing service platform, providing solutions to the cost, technology and efficiency issues faced by developers in mobile application testing. At the same time, industry-leading Baidu technology is shared, and the authors come from Baidu employees and industry leaders. >> If you have any questions, please feel free to communicate with me
http://www.bkjia.com/PHPjc/1090824.htmlwww.bkjia.comtrue
http: //www.bkjia.com/PHPjc/1090824.htmlUsing process information to track down memory leaks Summary: Memory leaks are common software problems encountered by background server programs. It is necessary to locate memory leaks. There are many methods, such as valgrind, but it requires restarting...