Detailed explanation of HashTable structure in PHP

HashTable is the most important and widely used data structure in Zend engine. It is used to store almost everything.
1.2.1 Data structure
HashTable data structure is defined as follows:

Copy code Code As follows:

typedef struct bucket {
ulong h; // Store hash
uint nKeyLength;
void *pData; // Point to value, which is a copy of user data
void *pDataPtr;
struct bucket *pListNext; // pListNext and pListLast are composed of
struct bucket *pListLast; // The entire HashTable doubly linked list
struct bucket *pNext; // pNext and pLast are used to form A certain hash corresponds to
struct bucket *pLast; // Double linked list
char arKey[1]; // key
} Bucket;

typedef struct _hashtable {
uint nTableSize;
uint nTableMask;
uint nNumOfElements;
ulong nNextFreeElement;
Bucket *pInternalPointer; /* Used for element traversal */
Bucket *pListHead;
Bucket *pListTail;
Bucket **arBuckets; // hash array
dtor_func_t pDestructor; // Specify when HashTable is initialized and call when destroying Bucket
zend_bool persistent; // Whether to use C Memory allocation routines
unsigned char nApplyCount;
zend_bool bApplyProtection;
#if ZEND_DEBUG
int inconsistent;
#endif
} HashTable;

In general, Zend's HashTable is a linked list hash, which is also optimized for linear traversal, as shown below:

HashTable contains two data structures, a linked list hash and a doubly linked list. The former is used for fast key-value query, and the latter is convenient for linear traversal and sorting. A Bucket exists here at the same time in two data structures.
A few explanations about this data structure:
Why is a doubly linked list used in linked list hashing?
General linked list hashing only needs to be operated by key, and only a single linked list is enough. However, Zend sometimes needs to delete a given Bucket from the linked list hash, which can be achieved very efficiently using a double linked list.
What does nTableMask do?
This value is used to convert the hash value to the arBuckets array index. When initializing a HashTable, Zend first allocates memory of nTableSize size for the arBuckets array. nTableSize is the smallest 2^n that is not less than the user-specified size, which is 10* in binary. nTableMask = nTableSize – 1, which is binary 01*. At this time, h & nTableMask happens to fall in [0, nTableSize – 1], and Zend uses it as the index to access the arBuckets array.
What does pDataPtr do?
Normally, when the user inserts a key-value pair, Zend will make a copy of the value and point pData to the copy of the value. The copy operation requires calling Zend's internal routine emalloc to allocate memory. This is a very time-consuming operation and will consume a memory larger than the value (the extra memory is used to store cookies). If the value is small, it will cause Big waste. Considering that HashTable is mostly used to store pointer values, Zend introduces pDataPtr. When the value is as small as the pointer, Zend directly copies it to pDataPtr and points pData to pDataPtr. This avoids emalloc operations and also helps improve the Cache hit rate.
Why is the arKey size only 1? Why not use pointers to manage keys?
arKey is an array that stores keys, but its size is only 1, which is not enough to hold the key. The following code can be found in the initialization function of HashTable:

Copy code The code is as follows:

p = (Bucket *) pemalloc (sizeof(Bucket) - 1 + nKeyLength, ht->persistent);

可见，Zend为一个Bucket分配了一块足够放下自己和key的内存，上半部分是Bucket，下半部分是key，而arKey“恰好”是Bucket的最后一个元素，于是就可以使用arKey来访问key了。这种手法在内存管理例程中最为常见，当分配内存时，实际上是分配了比指定大小要大的内存，多出的上半部分通常被称为cookie，它存储了这块内存的信息，比如块大小、上一块指针、下一块指针等，baidu的Transmit程序就使用了这种方法。
不用指针管理key，是为了减少一次emalloc操作，同时也可以提高Cache命中率。另一个必需的理由是，key绝大部分情况下是固定不变的，不会因为key变长了而导致重新分配整个Bucket。这同时也解释了为什么不把value也一起作为数组分配了——因为value是可变的。

1.2.2 PHP数组
关于HashTable还有一个疑问没有回答，就是nNextFreeElement是干什么的？
不同于一般的散列，Zend的HashTable允许用户直接指定hash值，而忽略key，甚至可以不指定key（此时，nKeyLength为0）。同时，HashTable也支持append操作，用户连hash值也不用指定，只需要提供value，此时，Zend就用nNextFreeElement作为hash，之后将nNextFreeElement递增。
HashTable的这种行为看起来很奇怪，因为这将无法按key访问value，已经完全不是个散列了。理解问题的关键在于，PHP数组就是使用HashTable实现的——关联数组使用正常的k-v映射将元素加入HashTable，其key为用户指定的字符串；非关联数组则直接使用数组下标作为hash值，不存在key；而当在一个数组中混合使用关联和非关联时，或者使用array_push操作时，就需要用nNextFreeElement了。
再来看value，PHP数组的value直接使用了zval这个通用结构，pData指向的是zval*，按照上一节的介绍，这个zval*将直接存储在pDataPtr里。由于直接使用了zval，数组的元素可以是任意PHP类型。
数组的遍历操作，即foreach、each等，是通过HashTable的双向链表来进行的，pInternalPointer作为游标记录了当前位置。

1.2.3 变量符号表
除了数组，HashTable还被用来存储许多其他数据，比如，PHP函数、变量符号、加载的模块、类成员等。
一个变量符号表就相当于一个关联数组，其key是变量名（可见，使用很长的变量名并不是个好主意），value是zval*。
在任一时刻PHP代码都可以看见两个变量符号表——symbol_table和active_symbol_table——前者用于存储全局变量，称为全局符号表；后者是个指针，指向当前活动的变量符号表，通常情况下就是全局符号表。但是，当每次进入一个PHP函数时（此处指的是用户使用PHP代码创建的函数），Zend都会创建函数局部的变量符号表，并将active_symbol_table指向局部符号表。Zend总是使用active_symbol_table来访问变量，这样就实现了局部变量的作用域控制。
但如果在函数局部访问标记为global的变量，Zend会进行特殊处理——在active_symbol_table中创建symbol_table中同名变量的引用，如果symbol_table中没有同名变量则会先创建。

1.3 Memory and files
The resources owned by the program generally include memory and files. For ordinary programs, these resources are process-oriented. When the process ends, the operation The system or C library will automatically reclaim resources that we do not explicitly release.
However, the PHP program has its own particularity. It is based on pages. When a page is running, it will also apply for resources such as memory or files. However, when the page is finished running, the operating system or C library may not know the need. Carry out resource recycling. For example, we compile php into apache as a module and run apache in prefork or worker mode. In this case, the apache process or thread is reused, and the memory allocated by the php page will remain in the memory until the core is released.
In order to solve this problem, Zend provides a set of memory allocation APIs. Their functions are the same as the corresponding functions in C. The difference is that these functions allocate memory from Zend's own memory pool, and they can implement page-based Automatic recycling. In our module, the memory allocated for the page should use these APIs instead of C routines, otherwise Zend will try to efree our memory at the end of the page, and the result is usually a crush.
emalloc()
efree()
estrdup()
estrndup()
ecalloc()
erealloc()
In addition, Zend also provides a set of functions in the form VCWD_xxx The macros are used to replace the C library and the corresponding file API of the operating system. These macros can support PHP's virtual working directory and should always be used in module code. For the specific definition of macro, please refer to the PHP source code "TSRM/tsrm_virtual_cwd.h". You may notice that the close operation is not provided in all those macros. This is because the object of close is an opened resource and does not involve the file path, so you can use C or operating system routines directly; similarly, read/ Operations such as write also directly use C or operating system routines.

Detailed explanation of HashTable structure in PHP_PHP Tutorial