We have always known that redis has several commonly used data structures, such as strings, hashes, lists, sets, and ordered sets. In fact, Redis later made many additions, one of which was HyperLogLog, and the other was GEO (geographic location), which was added in version 3.2.
Here we will briefly introduce the HyperLogLog structure.
Let’s talk about the use first: This structure can be very memory-saving to count various counts, such as the number of registered IPs, the number of daily accessed IPs, the real-time UV of the page (PV must be a string), the number of online users, etc.
All the uses here are xxx numbers, so the characteristic of this data structure is that it can more accurately estimate the quantity you want to count, but it is impossible to know the detailed content of the statistics. For example, by counting the number of IPs visited every day, you can get the total number of IPs visited at that time, but there is no way to know what these IPs are.
There are gains and losses. Of course, you have to count the contents mentioned above. You can use sets to process them, so that you can know the quantities and get all the detailed lists. But for a large website, for example, there are 1 million IPs every day. We roughly calculate that one IP consumes 15 bytes, so 1 million IPs is 15M. If it is 10 million, it is 150M.
Let’s take a look at our HyperLogLog again. Each key in Redis occupies 12K of content, and the theoretical storage is approximately close to 2^64 values, regardless of the stored content. 12K, you know the role of this data structure. This is why he can't know the details inside. This is an algorithm based on cardinality estimation, which can only estimate the cardinality relatively accurately, and can use a small amount of fixed memory to store and identify unique elements in the set. And the base of this estimate is not necessarily accurate. It is an approximation with a standard error of 0.81%.
The more content you record here, the easier it is to have a sharp contrast with the content used in the collection, because the HyperLogLog structure, no matter how many values are allowed by the range, is grayed out and takes up 12K of memory.
For example, if we record the daily IP, assuming there are 100 million IP accesses every day, if we use aggregation, the memory usage per day will be 1.5G. Assuming we store a month's records, we will need 45G capacity. But using HyperLogLog, it’s 12K a day and 360K a month. If we don't need to know the specific information of the IP, we can keep these records in the memory for a year or not delete them. If necessary, we will also store all IP access records through other means. By storing daily information, we can calculate the total number of IPs per month (MERGE), the total number of IPs in a year, etc. (remove duplication).
The following is an introduction to the HyperLogLog command. In fact, it is similar to the collection command, except that it has fewer commands and cannot obtain the list. In addition, this data structure requires version 2.8.9 and above to use~
PFADD
After executing this command, the internal structure of HyperLogLog will be updated and feedback will be given. If the internal cardinality estimation of HyperLogLog occurs after execution, If there is a change, then 1 will be returned, otherwise (it is considered to already exist), 0 will be returned.
Another advantage of this command is that it can only have keys and no values. This means that only empty keys are created without values.
If the key exists, do nothing and return 0; if it does not exist, create it and return 1.
The time complexity of this command is O(1), so feel free to use it~
Command example:
redis> PFADD ip:20160929 "1.1.1.1" "2.2.2.2" "3.3.3.3" (integer) 1 redis> PFADD ip:20160929 "2.2.2.2" "4.4.4.4" "5.5.5.5" # 存在就只加新的 (integer) 1 redis> PFCOUNT ip:20160929 # 元素估计数量没有变化 (integer) 5 redis> PFADD ip:20160929 "2.2.2.2" # 存在就不会增加 (integer) 0
In fact, we found that it is quite accurate when there are few, haha.
PFCOUNT
In fact, we have already used this in the above study, let’s introduce it again.
When the command is applied to a single key, return the cardinality estimate of this key. If the key does not exist, 0 is returned.
When applied to multiple keys, returns the union estimate of those keys. Similar to merging these keys, call this command to output.
When this command operates on a single value, the time complexity is O(1), and has a very low average constant time; when it operates on N values, the time complexity is O(N), and this command The constant complexity of will be lower.
Command example:
redis> PFADD ip:20160929 "1.1.1.1" "2.2.2.2" "3.3.3.3" (integer) 1 redis> PFCOUNT ip:20160929 (integer) 3 redis> PFADD ip:20160928 "1.1.1.1" "4.4.4.4" "5.5.5.5" (integer) 1 redis> PFCOUNT ip:20160928 ip:20160929 (integer) 5
PFMERGE
Merge multiple HyperLogLogs into one HyperLogLog. In fact, this is easy to understand, and the combined estimated cardinality is also approximate to the union of all HyperLogLog estimated cardinality.
The first parameter of this command is the target key, and the remaining parameters are the HyperLogLog to be merged. When the command is executed, if the target key does not exist, it will be created and then merged.
The time complexity of this command is O(N), where N is the number of HyperLogLogs to be merged. However, the constant time complexity of this command is relatively high.
Command example:
edis> PFADD ip:20160929 "1.1.1.1" "2.2.2.2" "3.3.3.3" (integer) 1 redis> PFADD ip:20160928 "1.1.1.1" "4.4.4.4" "5.5.5.5" (integer) 1 redis> PFMERGE ip:201609 ip:20160928 ip:20160929 OK redis> PFCOUNT ip:201609 (integer) 5
到此HyperLogLog所有的命令就都介绍完了,没错,目前就只有这三个。其实也很简单的,知道了这个结构的用法,也就知道什么时候适合用了,对我们非常珍贵的内存还是很有帮助。