Hash Table is the core of PHP, this is not an exaggeration at all.
PHP's arrays, associative arrays, object properties, function tables, symbol tables, etc. all use HashTable as a container.
PHP's HashTable uses the zipper method to resolve conflicts. Needless to say, my main focus today is PHP's Hash algorithm and some of the ideas revealed by the algorithm itself.
PHP's Hash uses the most common DJBX33A (Daniel J. Bernstein, Times 33 with Addition). This algorithm is widely used in multiple software projects, such as Apache, Perl and Berkeley DB. For strings, this is currently The best hashing algorithm known, because it is very fast and classifies very well (little collisions, even distribution).
The core idea of the algorithm is:
1. hash(i) = hash(i-1) * 33 + str[i]
In zend_hash.h, we can find this algorithm in PHP:
1. static inline ulong zend_inline_hash_func(char *arKey, uint nKeyLength)
2. {
3. Register ulong hash = 5381;
4.
5. /* variant with the hash unrolled eight times */
6. for (; nKeyLength >= 8; nKeyLength -= {
7. hash = ((hash << 5) + hash) + *arKey++;
8. hash = ((hash << 5) + hash) + *arKey++;
9. hash = ((hash << 5) + hash) + *arKey++;
10. hash = ((hash << 5) + hash) + *arKey++;
11. hash = ((hash << 5) + hash) + *arKey++;
12. hash = ((hash << 5) + hash) + *arKey++;
13. hash = ((hash << 5) + hash) + *arKey++;
14. hash = ((hash << 5) + hash) + *arKey++;
15. }
16. switch (nKeyLength) {
17. case 7: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */
18. case 6: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */
19. case 5: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */
20. case 4: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */
21. case 3: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */
22. case 2: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */
23. case 1: hash = ((hash << 5) + hash) + *arKey++; break;
24. case 0: break;
25. EMPTY_SWITCH_DEFAULT_CASE()
26. }
27. Return hash;
28. }
Compared to the classic Times 33 algorithm adopted directly in Apache and Perl:
1. hashing function used in Perl 5.005:
2. # Return the hashed value of a string: $hash = perlhash("key")
3. # (Defined by the PERL_HASH macro in hv.h)
4. sub perlhash
5. {
6. $hash = 0;
7. foreach (split //, shift) {
8. $hash = $hash*33 + ord($_);
9. }
10. return $hash;
11. }
In PHP’s hash algorithm, we can see very subtle differences.
First of all, the most different thing is that PHP does not use direct multiplication by 33, but uses:
1. hash << 5 + has
This will of course be faster than taking a ride.
Then, the most important thing to consider is the use of unrolled. I read an article a few days ago about Discuz’s caching mechanism. One of them said that Discuz will adopt different caching strategies according to the popularity of the post. According to user habits, only Cache the first page of the post (because few people will read the post).
Similar to this idea, PHP encourages character indexes of less than 8 digits. It uses unrolled in units of 8 to improve efficiency. It must be said that this is also a very detailed and meticulous place.
In addition, there are inline and register variables... It can be seen that PHP developers have also taken great pains to optimize hash
Finally, the initial value of hash is set to 5381. Compared with the times algorithm in Apache and the Hash algorithm in Perl (both use an initial hash of 0), why choose 5381? I don’t know the specific reason, but I Discovered some features of 5381:
1. Magic Constant 5381:
2. 1. odd number
3. 2. prime number
4. 3. deficient number
5. 4. 001/010/100/000/101
After reading this, I have reason to believe that the selection of this initial value can provide better classification.
As for why Times 33 is Times 33 instead of Times other numbers, there are some explanations in the comments of the PHP Hash algorithm. I hope it will be useful to interested students:
1. DJBX33A (Daniel J. Bernstein, Times 33 with Addition)
2.
3. This is Daniel J. Bernstein's popular `times 33' hash function as
4. Posted by him years ago on comp.lang.c. It basically uses a function
5. Like ``hash(i) = hash(i-1) * 33 + str[i]''. This is one of the best
6. Known hash functions for strings. Because it is both computed very
7. fast and distributes very well.
8.
9. The magic of number 33, i.e. why it works better than many other
10. constants, prime or not, has never been adequately explained by
11. anyone. So I try an explanation: if one experimentally tests all
12. multipliers between 1 and 256 (as RSE did now) one detects that even
13. Numbers are not useable at all. The remaining 128 odd numbers
14. (except for the number 1) work more or less all equally well. They
15. all distribute in an acceptable way and this way fill a hash table
16. with an average percent of approx. 86%.
17.
18. If one compares the Chi^2 values of the variants, the number 33 not
19. even has the best value. But the number 33 and a few other equally
20. Good numbers like 17, 31, 63, 127 and 129 have nevertheless a great
21. Advantage to the remaining numbers in the large set of possible
22. Multipliers: their multiply operation can be replaced by a faster
23. Operation based on just one shift plus either a single addition
24. or subtraction operation. And because a hash function has to both
25. distribute well _and_ has to be very fast to compute, those few
26. Numbers should be preferred and seems to be the reason why Daniel J.
27. Bernstein also preferred it.
28.
29. www.2cto.com -- Ralf S. Engelschall
• Author: Laruence
• This article’s address: http://www.laruence.com/2009/07/23/994.html